netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH net-next v6 0/4] Enable virtio_net to act as a backup for a passthru device
@ 2018-04-10 18:59 Sridhar Samudrala
  2018-04-10 18:59 ` [RFC PATCH net-next v6 1/4] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit Sridhar Samudrala
                   ` (3 more replies)
  0 siblings, 4 replies; 63+ messages in thread
From: Sridhar Samudrala @ 2018-04-10 18:59 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh, jiri

The main motivation for this patch is to enable cloud service providers
to provide an accelerated datapath to virtio-net enabled VMs in a 
transparent manner with no/minimal guest userspace changes. This also
enables hypervisor controlled live migration to be supported with VMs that
have direct attached SR-IOV VF devices.

Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
used by hypervisor to indicate that virtio_net interface should act as
a backup for another device with the same MAC address.

Patch 2 introduces a bypass module that provides a generic interface for 
paravirtual drivers to listen for netdev register/unregister/link change
events from pci ethernet devices with the same MAC and takeover their
datapath. The notifier and event handling code is based on the existing
netvsc implementation. It provides 2 sets of interfaces to paravirtual 
drivers to support 2-netdev(netvsc) and 3-netdev(virtio_net) models.

Patch 3 extends virtio_net to use alternate datapath when available and
registered. When BACKUP feature is enabled, virtio_net driver creates
an additional 'bypass' netdev that acts as a master device and controls
2 slave devices.  The original virtio_net netdev is registered as
'backup' netdev and a passthru/vf device with the same MAC gets
registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
associated with the same 'pci' device.  The user accesses the network
interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
as default for transmits when it is available with link up and running.

Patch 4 refactors netvsc to use the registration/notification framework
supported by bypass module.

As this patch series is initially focusing on usecases where hypervisor 
fully controls the VM networking and the guest is not expected to directly 
configure any hardware settings, it doesn't expose all the ndo/ethtool ops
that are supported by virtio_net at this time. To support additional usecases,
it should be possible to enable additional ops later by caching the state
in virtio netdev and replaying when the 'active' netdev gets registered. 
 
The hypervisor needs to enable only one datapath at any time so that packets
don't get looped back to the VM over the other datapath. When a VF is
plugged, the virtio datapath link state can be marked as down.
At the time of live migration, the hypervisor needs to unplug the VF device
from the guest on the source host and reset the MAC filter of the VF to
initiate failover of datapath to virtio before starting the migration. After
the migration is completed, the destination hypervisor sets the MAC filter
on the VF and plugs it back to the guest to switch over to VF datapath.

This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2

v6 RFC:
  Simplified virtio_net changes by moving all the ndo_ops of the 
  bypass_netdev and create/destroy of bypass_netdev to 'bypass' module.
  avoided 2 phase registration(driver + instances).
  introduced IFF_BYPASS/IFF_BYPASS_SLAVE dev->priv_flags 
  replaced mutex with a spinlock

v5 RFC:
  Based on Jiri's comments, moved the common functionality to a 'bypass'
  module so that the same notifier and event handlers to handle child
  register/unregister/link change events can be shared between virtio_net
  and netvsc.
  Improved error handling based on Siwei's comments.
v4:
- Based on the review comments on the v3 version of the RFC patch and
  Jakub's suggestion for the naming issue with 3 netdev solution,
  proposed 3 netdev in-driver bonding solution for virtio-net.
v3 RFC:
- Introduced 3 netdev model and pointed out a couple of issues with
  that model and proposed 2 netdev model to avoid these issues.
- Removed broadcast/multicast optimization and only use virtio as
  backup path when VF is unplugged.
v2 RFC:
- Changed VIRTIO_NET_F_MASTER to VIRTIO_NET_F_BACKUP (mst)
- made a small change to the virtio-net xmit path to only use VF datapath
  for unicasts. Broadcasts/multicasts use virtio datapath. This avoids
  east-west broadcasts to go over the PCI link.
- added suppport for the feature bit in qemu

Sridhar Samudrala (4):
  virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit
  net: Introduce generic bypass module
  virtio_net: Extend virtio to use VF datapath when available
  netvsc: refactor notifier/event handling code to use the bypass
    framework

 drivers/net/Kconfig             |   1 +
 drivers/net/hyperv/Kconfig      |   1 +
 drivers/net/hyperv/netvsc_drv.c | 219 ++++----------
 drivers/net/virtio_net.c        | 614 +++++++++++++++++++++++++++++++++++++++-
 include/net/bypass.h            |  80 ++++++
 include/uapi/linux/virtio_net.h |   3 +
 net/Kconfig                     |  18 ++
 net/core/Makefile               |   1 +
 net/core/bypass.c               | 406 ++++++++++++++++++++++++++
 9 files changed, 1184 insertions(+), 159 deletions(-)
 create mode 100644 include/net/bypass.h
 create mode 100644 net/core/bypass.c

-- 
2.14.3

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [RFC PATCH net-next v6 1/4] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit
  2018-04-10 18:59 [RFC PATCH net-next v6 0/4] Enable virtio_net to act as a backup for a passthru device Sridhar Samudrala
@ 2018-04-10 18:59 ` Sridhar Samudrala
  2018-04-10 18:59 ` [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module Sridhar Samudrala
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 63+ messages in thread
From: Sridhar Samudrala @ 2018-04-10 18:59 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh, jiri

This feature bit can be used by hypervisor to indicate virtio_net device to
act as a backup for another device with the same MAC address.

VIRTIO_NET_F_BACKUP is defined as bit 62 as it is a device feature bit.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 drivers/net/virtio_net.c        | 2 +-
 include/uapi/linux/virtio_net.h | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 7b187ec7411e..befb5944f3fd 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2962,7 +2962,7 @@ static struct virtio_device_id id_table[] = {
 	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
 	VIRTIO_NET_F_CTRL_MAC_ADDR, \
 	VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
-	VIRTIO_NET_F_SPEED_DUPLEX
+	VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_BACKUP
 
 static unsigned int features[] = {
 	VIRTNET_FEATURES,
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index 5de6ed37695b..c7c35fd1a5ed 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -57,6 +57,9 @@
 					 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
 
+#define VIRTIO_NET_F_BACKUP	  62	/* Act as backup for another device
+					 * with the same MAC.
+					 */
 #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
 
 #ifndef VIRTIO_NET_NO_LEGACY
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
  2018-04-10 18:59 [RFC PATCH net-next v6 0/4] Enable virtio_net to act as a backup for a passthru device Sridhar Samudrala
  2018-04-10 18:59 ` [RFC PATCH net-next v6 1/4] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit Sridhar Samudrala
@ 2018-04-10 18:59 ` Sridhar Samudrala
  2018-04-11 15:51   ` Jiri Pirko
  2018-04-10 18:59 ` [RFC PATCH net-next v6 3/4] virtio_net: Extend virtio to use VF datapath when available Sridhar Samudrala
  2018-04-10 18:59 ` [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework Sridhar Samudrala
  3 siblings, 1 reply; 63+ messages in thread
From: Sridhar Samudrala @ 2018-04-10 18:59 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh, jiri

This provides a generic interface for paravirtual drivers to listen
for netdev register/unregister/link change events from pci ethernet
devices with the same MAC and takeover their datapath. The notifier and
event handling code is based on the existing netvsc implementation.

It exposes 2 sets of interfaces to the paravirtual drivers.
1. existing netvsc driver that uses 2 netdev model. In this model, no
master netdev is created. The paravirtual driver registers each bypass
instance along with a set of ops to manage the slave events.
     bypass_master_register()
     bypass_master_unregister()
2. new virtio_net based solution that uses 3 netdev model. In this model,
the bypass module provides interfaces to create/destroy additional master
netdev and all the slave events are managed internally.
      bypass_master_create()
      bypass_master_destroy()

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 include/linux/netdevice.h |  14 +
 include/net/bypass.h      |  96 ++++++
 net/Kconfig               |  18 +
 net/core/Makefile         |   1 +
 net/core/bypass.c         | 844 ++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 973 insertions(+)
 create mode 100644 include/net/bypass.h
 create mode 100644 net/core/bypass.c

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cf44503ea81a..587293728f70 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
 	IFF_PHONY_HEADROOM		= 1<<24,
 	IFF_MACSEC			= 1<<25,
 	IFF_NO_RX_HANDLER		= 1<<26,
+	IFF_BYPASS			= 1 << 27,
+	IFF_BYPASS_SLAVE		= 1 << 28,
 };
 
 #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
@@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
 #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
 #define IFF_MACSEC			IFF_MACSEC
 #define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
+#define IFF_BYPASS			IFF_BYPASS
+#define IFF_BYPASS_SLAVE		IFF_BYPASS_SLAVE
 
 /**
  *	struct net_device - The DEVICE structure.
@@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
 	return dev->priv_flags & IFF_RXFH_CONFIGURED;
 }
 
+static inline bool netif_is_bypass_master(const struct net_device *dev)
+{
+	return dev->priv_flags & IFF_BYPASS;
+}
+
+static inline bool netif_is_bypass_slave(const struct net_device *dev)
+{
+	return dev->priv_flags & IFF_BYPASS_SLAVE;
+}
+
 /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
 static inline void netif_keep_dst(struct net_device *dev)
 {
diff --git a/include/net/bypass.h b/include/net/bypass.h
new file mode 100644
index 000000000000..86b02cb894cf
--- /dev/null
+++ b/include/net/bypass.h
@@ -0,0 +1,96 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018, Intel Corporation. */
+
+#ifndef _NET_BYPASS_H
+#define _NET_BYPASS_H
+
+#include <linux/netdevice.h>
+
+struct bypass_ops {
+	int (*slave_pre_register)(struct net_device *slave_netdev,
+				  struct net_device *bypass_netdev);
+	int (*slave_join)(struct net_device *slave_netdev,
+			  struct net_device *bypass_netdev);
+	int (*slave_pre_unregister)(struct net_device *slave_netdev,
+				    struct net_device *bypass_netdev);
+	int (*slave_release)(struct net_device *slave_netdev,
+			     struct net_device *bypass_netdev);
+	int (*slave_link_change)(struct net_device *slave_netdev,
+				 struct net_device *bypass_netdev);
+	rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
+};
+
+struct bypass_master {
+	struct list_head list;
+	struct net_device __rcu *bypass_netdev;
+	struct bypass_ops __rcu *ops;
+};
+
+/* bypass state */
+struct bypass_info {
+	/* passthru netdev with same MAC */
+	struct net_device __rcu *active_netdev;
+
+	/* virtio_net netdev */
+	struct net_device __rcu *backup_netdev;
+
+	/* active netdev stats */
+	struct rtnl_link_stats64 active_stats;
+
+	/* backup netdev stats */
+	struct rtnl_link_stats64 backup_stats;
+
+	/* aggregated stats */
+	struct rtnl_link_stats64 bypass_stats;
+
+	/* spinlock while updating stats */
+	spinlock_t stats_lock;
+};
+
+#if IS_ENABLED(CONFIG_NET_BYPASS)
+
+int bypass_master_create(struct net_device *backup_netdev,
+			 struct bypass_master **pbypass_master);
+void bypass_master_destroy(struct bypass_master *bypass_master);
+
+int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
+			   struct bypass_master **pbypass_master);
+void bypass_master_unregister(struct bypass_master *bypass_master);
+
+int bypass_slave_unregister(struct net_device *slave_netdev);
+
+#else
+
+static inline
+int bypass_master_create(struct net_device *backup_netdev,
+			 struct bypass_master **pbypass_master);
+{
+	return 0;
+}
+
+static inline
+void bypass_master_destroy(struct bypass_master *bypass_master)
+{
+}
+
+static inline
+int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
+			   struct pbypass_master **pbypass_master);
+{
+	return 0;
+}
+
+static inline
+void bypass_master_unregister(struct bypass_master *bypass_master)
+{
+}
+
+static inline
+int bypass_slave_unregister(struct net_device *slave_netdev)
+{
+	return 0;
+}
+
+#endif
+
+#endif /* _NET_BYPASS_H */
diff --git a/net/Kconfig b/net/Kconfig
index 0428f12c25c2..994445f4a96a 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -423,6 +423,24 @@ config MAY_USE_DEVLINK
 	  on MAY_USE_DEVLINK to ensure they do not cause link errors when
 	  devlink is a loadable module and the driver using it is built-in.
 
+config NET_BYPASS
+	tristate "Bypass interface"
+	---help---
+	  This provides a generic interface for paravirtual drivers to listen
+	  for netdev register/unregister/link change events from pci ethernet
+	  devices with the same MAC and takeover their datapath. This also
+	  enables live migration of a VM with direct attached VF by failing
+	  over to the paravirtual datapath when the VF is unplugged.
+
+config MAY_USE_BYPASS
+	tristate
+	default m if NET_BYPASS=m
+	default y if NET_BYPASS=y || NET_BYPASS=n
+	help
+	  Drivers using the bypass infrastructure should have a dependency
+	  on MAY_USE_BYPASS to ensure they do not cause link errors when
+	  bypass is a loadable module and the driver using it is built-in.
+
 endif   # if NET
 
 # Used by archs to tell that they support BPF JIT compiler plus which flavour.
diff --git a/net/core/Makefile b/net/core/Makefile
index 6dbbba8c57ae..a9727ed1c8fc 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -30,3 +30,4 @@ obj-$(CONFIG_DST_CACHE) += dst_cache.o
 obj-$(CONFIG_HWBM) += hwbm.o
 obj-$(CONFIG_NET_DEVLINK) += devlink.o
 obj-$(CONFIG_GRO_CELLS) += gro_cells.o
+obj-$(CONFIG_NET_BYPASS) += bypass.o
diff --git a/net/core/bypass.c b/net/core/bypass.c
new file mode 100644
index 000000000000..b5b9cb554c3f
--- /dev/null
+++ b/net/core/bypass.c
@@ -0,0 +1,844 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018, Intel Corporation. */
+
+/* A common module to handle registrations and notifications for paravirtual
+ * drivers to enable accelerated datapath and support VF live migration.
+ *
+ * The notifier and event handling code is based on netvsc driver.
+ */
+
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/ethtool.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/netdevice.h>
+#include <linux/netpoll.h>
+#include <linux/rtnetlink.h>
+#include <linux/if_vlan.h>
+#include <linux/pci.h>
+#include <net/sch_generic.h>
+#include <uapi/linux/if_arp.h>
+#include <net/bypass.h>
+
+static LIST_HEAD(bypass_master_list);
+static DEFINE_SPINLOCK(bypass_lock);
+
+static int bypass_slave_pre_register(struct net_device *slave_netdev,
+				     struct net_device *bypass_netdev,
+				     struct bypass_ops *bypass_ops)
+{
+	struct bypass_info *bi;
+	bool backup;
+
+	if (bypass_ops) {
+		if (!bypass_ops->slave_pre_register)
+			return -EINVAL;
+
+		return bypass_ops->slave_pre_register(slave_netdev,
+						      bypass_netdev);
+	}
+
+	bi = netdev_priv(bypass_netdev);
+	backup = (slave_netdev->dev.parent == bypass_netdev->dev.parent);
+	if (backup ? rtnl_dereference(bi->backup_netdev) :
+			rtnl_dereference(bi->active_netdev)) {
+		netdev_err(bypass_netdev, "%s attempting to register as slave dev when %s already present\n",
+			   slave_netdev->name, backup ? "backup" : "active");
+		return -EEXIST;
+	}
+
+	/* Avoid non pci devices as active netdev */
+	if (!backup && (!slave_netdev->dev.parent ||
+			!dev_is_pci(slave_netdev->dev.parent)))
+		return -EINVAL;
+
+	return 0;
+}
+
+static int bypass_slave_join(struct net_device *slave_netdev,
+			     struct net_device *bypass_netdev,
+			     struct bypass_ops *bypass_ops)
+{
+	struct bypass_info *bi;
+	bool backup;
+
+	if (bypass_ops) {
+		if (!bypass_ops->slave_join)
+			return -EINVAL;
+
+		return bypass_ops->slave_join(slave_netdev, bypass_netdev);
+	}
+
+	bi = netdev_priv(bypass_netdev);
+	backup = (slave_netdev->dev.parent == bypass_netdev->dev.parent);
+
+	dev_hold(slave_netdev);
+
+	if (backup) {
+		rcu_assign_pointer(bi->backup_netdev, slave_netdev);
+		dev_get_stats(bi->backup_netdev, &bi->backup_stats);
+	} else {
+		rcu_assign_pointer(bi->active_netdev, slave_netdev);
+		dev_get_stats(bi->active_netdev, &bi->active_stats);
+		bypass_netdev->min_mtu = slave_netdev->min_mtu;
+		bypass_netdev->max_mtu = slave_netdev->max_mtu;
+	}
+
+	netdev_info(bypass_netdev, "bypass slave:%s joined\n",
+		    slave_netdev->name);
+
+	return 0;
+}
+
+/* Called when slave dev is injecting data into network stack.
+ * Change the associated network device from lower dev to virtio.
+ * note: already called with rcu_read_lock
+ */
+static rx_handler_result_t bypass_handle_frame(struct sk_buff **pskb)
+{
+	struct sk_buff *skb = *pskb;
+	struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data);
+
+	skb->dev = ndev;
+
+	return RX_HANDLER_ANOTHER;
+}
+
+static struct net_device *bypass_master_get_bymac(u8 *mac,
+						  struct bypass_ops **ops)
+{
+	struct bypass_master *bypass_master;
+	struct net_device *bypass_netdev;
+
+	spin_lock(&bypass_lock);
+	list_for_each_entry(bypass_master, &bypass_master_list, list) {
+		bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
+		if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
+			*ops = rcu_dereference(bypass_master->ops);
+			spin_unlock(&bypass_lock);
+			return bypass_netdev;
+		}
+	}
+	spin_unlock(&bypass_lock);
+	return NULL;
+}
+
+static int bypass_slave_register(struct net_device *slave_netdev)
+{
+	struct net_device *bypass_netdev;
+	struct bypass_ops *bypass_ops;
+	int ret, orig_mtu;
+
+	ASSERT_RTNL();
+
+	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
+						&bypass_ops);
+	if (!bypass_netdev)
+		goto done;
+
+	ret = bypass_slave_pre_register(slave_netdev, bypass_netdev,
+					bypass_ops);
+	if (ret != 0)
+		goto done;
+
+	ret = netdev_rx_handler_register(slave_netdev,
+					 bypass_ops ? bypass_ops->handle_frame :
+					 bypass_handle_frame, bypass_netdev);
+	if (ret != 0) {
+		netdev_err(slave_netdev, "can not register bypass rx handler (err = %d)\n",
+			   ret);
+		goto done;
+	}
+
+	ret = netdev_upper_dev_link(slave_netdev, bypass_netdev, NULL);
+	if (ret != 0) {
+		netdev_err(slave_netdev, "can not set master device %s (err = %d)\n",
+			   bypass_netdev->name, ret);
+		goto upper_link_failed;
+	}
+
+	slave_netdev->priv_flags |= IFF_BYPASS_SLAVE;
+
+	if (netif_running(bypass_netdev)) {
+		ret = dev_open(slave_netdev);
+		if (ret && (ret != -EBUSY)) {
+			netdev_err(bypass_netdev, "Opening slave %s failed ret:%d\n",
+				   slave_netdev->name, ret);
+			goto err_interface_up;
+		}
+	}
+
+	/* Align MTU of slave with master */
+	orig_mtu = slave_netdev->mtu;
+	ret = dev_set_mtu(slave_netdev, bypass_netdev->mtu);
+	if (ret != 0) {
+		netdev_err(bypass_netdev, "unable to change mtu of %s to %u register failed\n",
+			   slave_netdev->name, bypass_netdev->mtu);
+		goto err_set_mtu;
+	}
+
+	ret = bypass_slave_join(slave_netdev, bypass_netdev, bypass_ops);
+	if (ret != 0)
+		goto err_join;
+
+	call_netdevice_notifiers(NETDEV_JOIN, slave_netdev);
+
+	netdev_info(bypass_netdev, "bypass slave:%s registered\n",
+		    slave_netdev->name);
+
+	goto done;
+
+err_join:
+	dev_set_mtu(slave_netdev, orig_mtu);
+err_set_mtu:
+	dev_close(slave_netdev);
+err_interface_up:
+	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
+	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
+upper_link_failed:
+	netdev_rx_handler_unregister(slave_netdev);
+done:
+	return NOTIFY_DONE;
+}
+
+static int bypass_slave_pre_unregister(struct net_device *slave_netdev,
+				       struct net_device *bypass_netdev,
+				       struct bypass_ops *bypass_ops)
+{
+	struct net_device *backup_netdev, *active_netdev;
+	struct bypass_info *bi;
+
+	if (bypass_ops) {
+		if (!bypass_ops->slave_pre_unregister)
+			return -EINVAL;
+
+		return bypass_ops->slave_pre_unregister(slave_netdev,
+							bypass_netdev);
+	}
+
+	bi = netdev_priv(bypass_netdev);
+	active_netdev = rtnl_dereference(bi->active_netdev);
+	backup_netdev = rtnl_dereference(bi->backup_netdev);
+
+	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
+		return -EINVAL;
+
+	return 0;
+}
+
+static int bypass_slave_release(struct net_device *slave_netdev,
+				struct net_device *bypass_netdev,
+				struct bypass_ops *bypass_ops)
+{
+	struct net_device *backup_netdev, *active_netdev;
+	struct bypass_info *bi;
+
+	if (bypass_ops) {
+		if (!bypass_ops->slave_release)
+			return -EINVAL;
+
+		return bypass_ops->slave_release(slave_netdev, bypass_netdev);
+	}
+
+	bi = netdev_priv(bypass_netdev);
+	active_netdev = rtnl_dereference(bi->active_netdev);
+	backup_netdev = rtnl_dereference(bi->backup_netdev);
+
+	if (slave_netdev == backup_netdev) {
+		RCU_INIT_POINTER(bi->backup_netdev, NULL);
+	} else {
+		RCU_INIT_POINTER(bi->active_netdev, NULL);
+		if (backup_netdev) {
+			bypass_netdev->min_mtu = backup_netdev->min_mtu;
+			bypass_netdev->max_mtu = backup_netdev->max_mtu;
+		}
+	}
+
+	dev_put(slave_netdev);
+
+	netdev_info(bypass_netdev, "bypass slave:%s released\n",
+		    slave_netdev->name);
+
+	return 0;
+}
+
+int bypass_slave_unregister(struct net_device *slave_netdev)
+{
+	struct net_device *bypass_netdev;
+	struct bypass_ops *bypass_ops;
+	int ret;
+
+	if (!netif_is_bypass_slave(slave_netdev))
+		goto done;
+
+	ASSERT_RTNL();
+
+	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
+						&bypass_ops);
+	if (!bypass_netdev)
+		goto done;
+
+	ret = bypass_slave_pre_unregister(slave_netdev, bypass_netdev,
+					  bypass_ops);
+	if (ret != 0)
+		goto done;
+
+	netdev_rx_handler_unregister(slave_netdev);
+	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
+	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
+
+	bypass_slave_release(slave_netdev, bypass_netdev, bypass_ops);
+
+	netdev_info(bypass_netdev, "bypass slave:%s unregistered\n",
+		    slave_netdev->name);
+
+done:
+	return NOTIFY_DONE;
+}
+EXPORT_SYMBOL_GPL(bypass_slave_unregister);
+
+static bool bypass_xmit_ready(struct net_device *dev)
+{
+	return netif_running(dev) && netif_carrier_ok(dev);
+}
+
+static int bypass_slave_link_change(struct net_device *slave_netdev)
+{
+	struct net_device *bypass_netdev, *active_netdev, *backup_netdev;
+	struct bypass_ops *bypass_ops;
+	struct bypass_info *bi;
+
+	if (!netif_is_bypass_slave(slave_netdev))
+		goto done;
+
+	ASSERT_RTNL();
+
+	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
+						&bypass_ops);
+	if (!bypass_netdev)
+		goto done;
+
+	if (bypass_ops) {
+		if (!bypass_ops->slave_link_change)
+			goto done;
+
+		return bypass_ops->slave_link_change(slave_netdev,
+						     bypass_netdev);
+	}
+
+	if (!netif_running(bypass_netdev))
+		return 0;
+
+	bi = netdev_priv(bypass_netdev);
+
+	active_netdev = rtnl_dereference(bi->active_netdev);
+	backup_netdev = rtnl_dereference(bi->backup_netdev);
+
+	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
+		goto done;
+
+	if ((active_netdev && bypass_xmit_ready(active_netdev)) ||
+	    (backup_netdev && bypass_xmit_ready(backup_netdev))) {
+		netif_carrier_on(bypass_netdev);
+		netif_tx_wake_all_queues(bypass_netdev);
+	} else {
+		netif_carrier_off(bypass_netdev);
+		netif_tx_stop_all_queues(bypass_netdev);
+	}
+
+done:
+	return NOTIFY_DONE;
+}
+
+static bool bypass_validate_event_dev(struct net_device *dev)
+{
+	/* Skip parent events */
+	if (netif_is_bypass_master(dev))
+		return false;
+
+	/* Avoid non-Ethernet type devices */
+	if (dev->type != ARPHRD_ETHER)
+		return false;
+
+	/* Avoid Vlan dev with same MAC registering as VF */
+	if (is_vlan_dev(dev))
+		return false;
+
+	/* Avoid Bonding master dev with same MAC registering as slave dev */
+	if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
+		return false;
+
+	return true;
+}
+
+static int
+bypass_event(struct notifier_block *this, unsigned long event, void *ptr)
+{
+	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
+
+	if (!bypass_validate_event_dev(event_dev))
+		return NOTIFY_DONE;
+
+	switch (event) {
+	case NETDEV_REGISTER:
+		return bypass_slave_register(event_dev);
+	case NETDEV_UNREGISTER:
+		return bypass_slave_unregister(event_dev);
+	case NETDEV_UP:
+	case NETDEV_DOWN:
+	case NETDEV_CHANGE:
+		return bypass_slave_link_change(event_dev);
+	default:
+		return NOTIFY_DONE;
+	}
+}
+
+static struct notifier_block bypass_notifier = {
+	.notifier_call = bypass_event,
+};
+
+int bypass_open(struct net_device *dev)
+{
+	struct bypass_info *bi = netdev_priv(dev);
+	struct net_device *active_netdev, *backup_netdev;
+	int err;
+
+	netif_carrier_off(dev);
+	netif_tx_wake_all_queues(dev);
+
+	active_netdev = rtnl_dereference(bi->active_netdev);
+	if (active_netdev) {
+		err = dev_open(active_netdev);
+		if (err)
+			goto err_active_open;
+	}
+
+	backup_netdev = rtnl_dereference(bi->backup_netdev);
+	if (backup_netdev) {
+		err = dev_open(backup_netdev);
+		if (err)
+			goto err_backup_open;
+	}
+
+	return 0;
+
+err_backup_open:
+	dev_close(active_netdev);
+err_active_open:
+	netif_tx_disable(dev);
+	return err;
+}
+EXPORT_SYMBOL_GPL(bypass_open);
+
+int bypass_close(struct net_device *dev)
+{
+	struct bypass_info *vi = netdev_priv(dev);
+	struct net_device *slave_netdev;
+
+	netif_tx_disable(dev);
+
+	slave_netdev = rtnl_dereference(vi->active_netdev);
+	if (slave_netdev)
+		dev_close(slave_netdev);
+
+	slave_netdev = rtnl_dereference(vi->backup_netdev);
+	if (slave_netdev)
+		dev_close(slave_netdev);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(bypass_close);
+
+static netdev_tx_t bypass_drop_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	atomic_long_inc(&dev->tx_dropped);
+	dev_kfree_skb_any(skb);
+	return NETDEV_TX_OK;
+}
+
+netdev_tx_t bypass_start_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct bypass_info *bi = netdev_priv(dev);
+	struct net_device *xmit_dev;
+
+	/* Try xmit via active netdev followed by backup netdev */
+	xmit_dev = rcu_dereference_bh(bi->active_netdev);
+	if (!xmit_dev || !bypass_xmit_ready(xmit_dev)) {
+		xmit_dev = rcu_dereference_bh(bi->backup_netdev);
+		if (!xmit_dev || !bypass_xmit_ready(xmit_dev))
+			return bypass_drop_xmit(skb, dev);
+	}
+
+	skb->dev = xmit_dev;
+	skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
+
+	return dev_queue_xmit(skb);
+}
+EXPORT_SYMBOL_GPL(bypass_start_xmit);
+
+u16 bypass_select_queue(struct net_device *dev, struct sk_buff *skb,
+			void *accel_priv, select_queue_fallback_t fallback)
+{
+	/* This helper function exists to help dev_pick_tx get the correct
+	 * destination queue.  Using a helper function skips a call to
+	 * skb_tx_hash and will put the skbs in the queue we expect on their
+	 * way down to the bonding driver.
+	 */
+	u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
+
+	/* Save the original txq to restore before passing to the driver */
+	qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
+
+	if (unlikely(txq >= dev->real_num_tx_queues)) {
+		do {
+			txq -= dev->real_num_tx_queues;
+		} while (txq >= dev->real_num_tx_queues);
+	}
+
+	return txq;
+}
+EXPORT_SYMBOL_GPL(bypass_select_queue);
+
+/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
+ * that some drivers can provide 32bit values only.
+ */
+static void bypass_fold_stats(struct rtnl_link_stats64 *_res,
+			      const struct rtnl_link_stats64 *_new,
+			      const struct rtnl_link_stats64 *_old)
+{
+	const u64 *new = (const u64 *)_new;
+	const u64 *old = (const u64 *)_old;
+	u64 *res = (u64 *)_res;
+	int i;
+
+	for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
+		u64 nv = new[i];
+		u64 ov = old[i];
+		s64 delta = nv - ov;
+
+		/* detects if this particular field is 32bit only */
+		if (((nv | ov) >> 32) == 0)
+			delta = (s64)(s32)((u32)nv - (u32)ov);
+
+		/* filter anomalies, some drivers reset their stats
+		 * at down/up events.
+		 */
+		if (delta > 0)
+			res[i] += delta;
+	}
+}
+
+void bypass_get_stats(struct net_device *dev, struct rtnl_link_stats64 *stats)
+{
+	struct bypass_info *bi = netdev_priv(dev);
+	const struct rtnl_link_stats64 *new;
+	struct rtnl_link_stats64 temp;
+	struct net_device *slave_netdev;
+
+	spin_lock(&bi->stats_lock);
+	memcpy(stats, &bi->bypass_stats, sizeof(*stats));
+
+	rcu_read_lock();
+
+	slave_netdev = rcu_dereference(bi->active_netdev);
+	if (slave_netdev) {
+		new = dev_get_stats(slave_netdev, &temp);
+		bypass_fold_stats(stats, new, &bi->active_stats);
+		memcpy(&bi->active_stats, new, sizeof(*new));
+	}
+
+	slave_netdev = rcu_dereference(bi->backup_netdev);
+	if (slave_netdev) {
+		new = dev_get_stats(slave_netdev, &temp);
+		bypass_fold_stats(stats, new, &bi->backup_stats);
+		memcpy(&bi->backup_stats, new, sizeof(*new));
+	}
+
+	rcu_read_unlock();
+
+	memcpy(&bi->bypass_stats, stats, sizeof(*stats));
+	spin_unlock(&bi->stats_lock);
+}
+EXPORT_SYMBOL_GPL(bypass_get_stats);
+
+int bypass_change_mtu(struct net_device *dev, int new_mtu)
+{
+	struct bypass_info *bi = netdev_priv(dev);
+	struct net_device *active_netdev, *backup_netdev;
+	int ret = 0;
+
+	active_netdev = rcu_dereference(bi->active_netdev);
+	if (active_netdev) {
+		ret = dev_set_mtu(active_netdev, new_mtu);
+		if (ret)
+			return ret;
+	}
+
+	backup_netdev = rcu_dereference(bi->backup_netdev);
+	if (backup_netdev) {
+		ret = dev_set_mtu(backup_netdev, new_mtu);
+		if (ret) {
+			dev_set_mtu(active_netdev, dev->mtu);
+			return ret;
+		}
+	}
+
+	dev->mtu = new_mtu;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(bypass_change_mtu);
+
+void bypass_set_rx_mode(struct net_device *dev)
+{
+	struct bypass_info *bi = netdev_priv(dev);
+	struct net_device *slave_netdev;
+
+	rcu_read_lock();
+
+	slave_netdev = rcu_dereference(bi->active_netdev);
+	if (slave_netdev) {
+		dev_uc_sync_multiple(slave_netdev, dev);
+		dev_mc_sync_multiple(slave_netdev, dev);
+	}
+
+	slave_netdev = rcu_dereference(bi->backup_netdev);
+	if (slave_netdev) {
+		dev_uc_sync_multiple(slave_netdev, dev);
+		dev_mc_sync_multiple(slave_netdev, dev);
+	}
+
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL_GPL(bypass_set_rx_mode);
+
+static const struct net_device_ops bypass_netdev_ops = {
+	.ndo_open		= bypass_open,
+	.ndo_stop		= bypass_close,
+	.ndo_start_xmit		= bypass_start_xmit,
+	.ndo_select_queue	= bypass_select_queue,
+	.ndo_get_stats64	= bypass_get_stats,
+	.ndo_change_mtu		= bypass_change_mtu,
+	.ndo_set_rx_mode	= bypass_set_rx_mode,
+	.ndo_validate_addr	= eth_validate_addr,
+	.ndo_features_check	= passthru_features_check,
+};
+
+#define BYPASS_DRV_NAME "bypass"
+#define BYPASS_DRV_VERSION "0.1"
+
+static void bypass_ethtool_get_drvinfo(struct net_device *dev,
+				       struct ethtool_drvinfo *drvinfo)
+{
+	strlcpy(drvinfo->driver, BYPASS_DRV_NAME, sizeof(drvinfo->driver));
+	strlcpy(drvinfo->version, BYPASS_DRV_VERSION, sizeof(drvinfo->version));
+}
+
+int bypass_ethtool_get_link_ksettings(struct net_device *dev,
+				      struct ethtool_link_ksettings *cmd)
+{
+	struct bypass_info *bi = netdev_priv(dev);
+	struct net_device *slave_netdev;
+
+	slave_netdev = rtnl_dereference(bi->active_netdev);
+	if (!slave_netdev || !bypass_xmit_ready(slave_netdev)) {
+		slave_netdev = rtnl_dereference(bi->backup_netdev);
+		if (!slave_netdev || !bypass_xmit_ready(slave_netdev)) {
+			cmd->base.duplex = DUPLEX_UNKNOWN;
+			cmd->base.port = PORT_OTHER;
+			cmd->base.speed = SPEED_UNKNOWN;
+
+			return 0;
+		}
+	}
+
+	return __ethtool_get_link_ksettings(slave_netdev, cmd);
+}
+EXPORT_SYMBOL_GPL(bypass_ethtool_get_link_ksettings);
+
+static const struct ethtool_ops bypass_ethtool_ops = {
+	.get_drvinfo            = bypass_ethtool_get_drvinfo,
+	.get_link               = ethtool_op_get_link,
+	.get_link_ksettings     = bypass_ethtool_get_link_ksettings,
+};
+
+static void bypass_register_existing_slave(struct net_device *bypass_netdev)
+{
+	struct net *net = dev_net(bypass_netdev);
+	struct net_device *dev;
+
+	rtnl_lock();
+	for_each_netdev(net, dev) {
+		if (dev == bypass_netdev)
+			continue;
+		if (!bypass_validate_event_dev(dev))
+			continue;
+		if (ether_addr_equal(bypass_netdev->perm_addr, dev->perm_addr))
+			bypass_slave_register(dev);
+	}
+	rtnl_unlock();
+}
+
+int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
+			   struct bypass_master **pbypass_master)
+{
+	struct bypass_master *bypass_master;
+
+	bypass_master = kzalloc(sizeof(*bypass_master), GFP_KERNEL);
+	if (!bypass_master)
+		return -ENOMEM;
+
+	rcu_assign_pointer(bypass_master->ops, ops);
+	dev_hold(dev);
+	dev->priv_flags |= IFF_BYPASS;
+	rcu_assign_pointer(bypass_master->bypass_netdev, dev);
+
+	spin_lock(&bypass_lock);
+	list_add_tail(&bypass_master->list, &bypass_master_list);
+	spin_unlock(&bypass_lock);
+
+	bypass_register_existing_slave(dev);
+
+	*pbypass_master = bypass_master;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(bypass_master_register);
+
+void bypass_master_unregister(struct bypass_master *bypass_master)
+{
+	struct net_device *bypass_netdev;
+
+	bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
+
+	bypass_netdev->priv_flags &= ~IFF_BYPASS;
+	dev_put(bypass_netdev);
+
+	spin_lock(&bypass_lock);
+	list_del(&bypass_master->list);
+	spin_unlock(&bypass_lock);
+
+	kfree(bypass_master);
+}
+EXPORT_SYMBOL_GPL(bypass_master_unregister);
+
+int bypass_master_create(struct net_device *backup_netdev,
+			 struct bypass_master **pbypass_master)
+{
+	struct device *dev = backup_netdev->dev.parent;
+	struct net_device *bypass_netdev;
+	int err;
+
+	/* Alloc at least 2 queues, for now we are going with 16 assuming
+	 * that most devices being bonded won't have too many queues.
+	 */
+	bypass_netdev = alloc_etherdev_mq(sizeof(struct bypass_info), 16);
+	if (!bypass_netdev) {
+		dev_err(dev, "Unable to allocate bypass_netdev!\n");
+		return -ENOMEM;
+	}
+
+	dev_net_set(bypass_netdev, dev_net(backup_netdev));
+	SET_NETDEV_DEV(bypass_netdev, dev);
+
+	bypass_netdev->netdev_ops = &bypass_netdev_ops;
+	bypass_netdev->ethtool_ops = &bypass_ethtool_ops;
+
+	/* Initialize the device options */
+	bypass_netdev->priv_flags |= IFF_UNICAST_FLT | IFF_NO_QUEUE;
+	bypass_netdev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
+				       IFF_TX_SKB_SHARING);
+
+	/* don't acquire bypass netdev's netif_tx_lock when transmitting */
+	bypass_netdev->features |= NETIF_F_LLTX;
+
+	/* Don't allow bypass devices to change network namespaces. */
+	bypass_netdev->features |= NETIF_F_NETNS_LOCAL;
+
+	bypass_netdev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
+				     NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
+				     NETIF_F_HIGHDMA | NETIF_F_LRO;
+
+	bypass_netdev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
+	bypass_netdev->features |= bypass_netdev->hw_features;
+
+	memcpy(bypass_netdev->dev_addr, backup_netdev->dev_addr,
+	       bypass_netdev->addr_len);
+
+	bypass_netdev->min_mtu = backup_netdev->min_mtu;
+	bypass_netdev->max_mtu = backup_netdev->max_mtu;
+
+	err = register_netdev(bypass_netdev);
+	if (err < 0) {
+		dev_err(dev, "Unable to register bypass_netdev!\n");
+		goto err_register_netdev;
+	}
+
+	netif_carrier_off(bypass_netdev);
+
+	err = bypass_master_register(bypass_netdev, NULL, pbypass_master);
+	if (err < 0)
+		goto err_bypass;
+
+	return 0;
+
+err_bypass:
+	unregister_netdev(bypass_netdev);
+err_register_netdev:
+	free_netdev(bypass_netdev);
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(bypass_master_create);
+
+void bypass_master_destroy(struct bypass_master *bypass_master)
+{
+	struct net_device *bypass_netdev;
+	struct net_device *slave_netdev;
+	struct bypass_info *bi;
+
+	if (!bypass_master)
+		return;
+
+	bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
+	bi = netdev_priv(bypass_netdev);
+
+	netif_device_detach(bypass_netdev);
+
+	rtnl_lock();
+
+	slave_netdev = rtnl_dereference(bi->active_netdev);
+	if (slave_netdev)
+		bypass_slave_unregister(slave_netdev);
+
+	slave_netdev = rtnl_dereference(bi->backup_netdev);
+	if (slave_netdev)
+		bypass_slave_unregister(slave_netdev);
+
+	bypass_master_unregister(bypass_master);
+
+	unregister_netdevice(bypass_netdev);
+
+	rtnl_unlock();
+
+	free_netdev(bypass_netdev);
+}
+EXPORT_SYMBOL_GPL(bypass_master_destroy);
+
+static __init int
+bypass_init(void)
+{
+	register_netdevice_notifier(&bypass_notifier);
+
+	return 0;
+}
+module_init(bypass_init);
+
+static __exit
+void bypass_exit(void)
+{
+	unregister_netdevice_notifier(&bypass_notifier);
+}
+module_exit(bypass_exit);
+
+MODULE_DESCRIPTION("Bypass infrastructure/interface for Paravirtual drivers");
+MODULE_LICENSE("GPL v2");
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [RFC PATCH net-next v6 3/4] virtio_net: Extend virtio to use VF datapath when available
  2018-04-10 18:59 [RFC PATCH net-next v6 0/4] Enable virtio_net to act as a backup for a passthru device Sridhar Samudrala
  2018-04-10 18:59 ` [RFC PATCH net-next v6 1/4] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit Sridhar Samudrala
  2018-04-10 18:59 ` [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module Sridhar Samudrala
@ 2018-04-10 18:59 ` Sridhar Samudrala
  2018-04-10 18:59 ` [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework Sridhar Samudrala
  3 siblings, 0 replies; 63+ messages in thread
From: Sridhar Samudrala @ 2018-04-10 18:59 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh, jiri

This patch enables virtio_net to switch over to a VF datapath when a VF
netdev is present with the same MAC address. It allows live migration
of a VM with a direct attached VF without the need to setup a bond/team
between a VF and virtio net device in the guest.

The hypervisor needs to enable only one datapath at any time so that
packets don't get looped back to the VM over the other datapath. When a VF
is plugged, the virtio datapath link state can be marked as down. The
hypervisor needs to unplug the VF device from the guest on the source host
and reset the MAC filter of the VF to initiate failover of datapath to
virtio before starting the migration. After the migration is completed,
the destination hypervisor sets the MAC filter on the VF and plugs it back
to the guest to switch over to VF datapath.

It uses the generic bypass framework that provides 2 functions to create
and destroy a master bypass netdev. When BACKUP feature is enabled, an
additional netdev(bypass netdev) is created that acts as a master device
and tracks the state of the 2 lower netdevs. The original virtio_net netdev
is marked as 'backup' netdev and a passthru device with the same MAC is
registered as 'active' netdev.

This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 drivers/net/Kconfig      |  1 +
 drivers/net/virtio_net.c | 36 +++++++++++++++++++++++++++++++++++-
 2 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 891846655000..9e2cf61fd1c1 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -331,6 +331,7 @@ config VETH
 config VIRTIO_NET
 	tristate "Virtio network driver"
 	depends on VIRTIO
+	depends on MAY_USE_BYPASS
 	---help---
 	  This is the virtual network driver for virtio.  It can be used with
 	  QEMU based VMMs (like KVM or Xen).  Say Y or M.
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index befb5944f3fd..99aa52d5ac9b 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -30,8 +30,11 @@
 #include <linux/cpu.h>
 #include <linux/average.h>
 #include <linux/filter.h>
+#include <linux/netdevice.h>
+#include <linux/pci.h>
 #include <net/route.h>
 #include <net/xdp.h>
+#include <net/bypass.h>
 
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
@@ -206,6 +209,9 @@ struct virtnet_info {
 	u32 speed;
 
 	unsigned long guest_offloads;
+
+	/* bypass_master created when BACKUP feature enabled */
+	struct bypass_master *bypass_master;
 };
 
 struct padded_vnet_hdr {
@@ -2275,6 +2281,22 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	}
 }
 
+static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
+				      size_t len)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	int ret;
+
+	if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
+		return -EOPNOTSUPP;
+
+	ret = snprintf(buf, len, "_bkup");
+	if (ret >= len)
+		return -EOPNOTSUPP;
+
+	return 0;
+}
+
 static const struct net_device_ops virtnet_netdev = {
 	.ndo_open            = virtnet_open,
 	.ndo_stop   	     = virtnet_close,
@@ -2292,6 +2314,7 @@ static const struct net_device_ops virtnet_netdev = {
 	.ndo_xdp_xmit		= virtnet_xdp_xmit,
 	.ndo_xdp_flush		= virtnet_xdp_flush,
 	.ndo_features_check	= passthru_features_check,
+	.ndo_get_phys_port_name	= virtnet_get_phys_port_name,
 };
 
 static void virtnet_config_changed_work(struct work_struct *work)
@@ -2839,10 +2862,16 @@ static int virtnet_probe(struct virtio_device *vdev)
 
 	virtnet_init_settings(dev);
 
+	if (virtio_has_feature(vdev, VIRTIO_NET_F_BACKUP)) {
+		err = bypass_master_create(vi->dev, &vi->bypass_master);
+		if (err)
+			goto free_vqs;
+	}
+
 	err = register_netdev(dev);
 	if (err) {
 		pr_debug("virtio_net: registering device failed\n");
-		goto free_vqs;
+		goto free_bypass;
 	}
 
 	virtio_device_ready(vdev);
@@ -2879,6 +2908,8 @@ static int virtnet_probe(struct virtio_device *vdev)
 	vi->vdev->config->reset(vdev);
 
 	unregister_netdev(dev);
+free_bypass:
+	bypass_master_destroy(vi->bypass_master);
 free_vqs:
 	cancel_delayed_work_sync(&vi->refill);
 	free_receive_page_frags(vi);
@@ -2913,6 +2944,8 @@ static void virtnet_remove(struct virtio_device *vdev)
 
 	unregister_netdev(vi->dev);
 
+	bypass_master_destroy(vi->bypass_master);
+
 	remove_vq_common(vi);
 
 	free_netdev(vi->dev);
@@ -3010,6 +3043,7 @@ static __init int virtio_net_driver_init(void)
         ret = register_virtio_driver(&virtio_net_driver);
 	if (ret)
 		goto err_virtio;
+
 	return 0;
 err_virtio:
 	cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework
  2018-04-10 18:59 [RFC PATCH net-next v6 0/4] Enable virtio_net to act as a backup for a passthru device Sridhar Samudrala
                   ` (2 preceding siblings ...)
  2018-04-10 18:59 ` [RFC PATCH net-next v6 3/4] virtio_net: Extend virtio to use VF datapath when available Sridhar Samudrala
@ 2018-04-10 18:59 ` Sridhar Samudrala
  2018-04-10 21:26   ` Stephen Hemminger
  3 siblings, 1 reply; 63+ messages in thread
From: Sridhar Samudrala @ 2018-04-10 18:59 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh, jiri

Use the registration/notification framework supported by the generic
bypass infrastructure.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 drivers/net/hyperv/Kconfig      |   1 +
 drivers/net/hyperv/hyperv_net.h |   2 +
 drivers/net/hyperv/netvsc_drv.c | 208 ++++++++++------------------------------
 3 files changed, 55 insertions(+), 156 deletions(-)

diff --git a/drivers/net/hyperv/Kconfig b/drivers/net/hyperv/Kconfig
index 936968d23559..cc3a721baa18 100644
--- a/drivers/net/hyperv/Kconfig
+++ b/drivers/net/hyperv/Kconfig
@@ -1,5 +1,6 @@
 config HYPERV_NET
 	tristate "Microsoft Hyper-V virtual network driver"
 	depends on HYPERV
+	depends on MAY_USE_BYPASS
 	help
 	  Select this option to enable the Hyper-V virtual network driver.
diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 960f06141472..5f8137bc5c1c 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -768,6 +768,8 @@ struct net_device_context {
 	u32 vf_alloc;
 	/* Serial number of the VF to team with */
 	u32 vf_serial;
+
+	struct bypass_master *bypass_master;
 };
 
 /* Per channel data */
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index ecc84954c511..87c2a276e62f 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -43,6 +43,7 @@
 #include <net/pkt_sched.h>
 #include <net/checksum.h>
 #include <net/ip6_checksum.h>
+#include <net/bypass.h>
 
 #include "hyperv_net.h"
 
@@ -1763,46 +1764,6 @@ static void netvsc_link_change(struct work_struct *w)
 	rtnl_unlock();
 }
 
-static struct net_device *get_netvsc_bymac(const u8 *mac)
-{
-	struct net_device *dev;
-
-	ASSERT_RTNL();
-
-	for_each_netdev(&init_net, dev) {
-		if (dev->netdev_ops != &device_ops)
-			continue;	/* not a netvsc device */
-
-		if (ether_addr_equal(mac, dev->perm_addr))
-			return dev;
-	}
-
-	return NULL;
-}
-
-static struct net_device *get_netvsc_byref(struct net_device *vf_netdev)
-{
-	struct net_device *dev;
-
-	ASSERT_RTNL();
-
-	for_each_netdev(&init_net, dev) {
-		struct net_device_context *net_device_ctx;
-
-		if (dev->netdev_ops != &device_ops)
-			continue;	/* not a netvsc device */
-
-		net_device_ctx = netdev_priv(dev);
-		if (!rtnl_dereference(net_device_ctx->nvdev))
-			continue;	/* device is removed */
-
-		if (rtnl_dereference(net_device_ctx->vf_netdev) == vf_netdev)
-			return dev;	/* a match */
-	}
-
-	return NULL;
-}
-
 /* Called when VF is injecting data into network stack.
  * Change the associated network device from VF to netvsc.
  * note: already called with rcu_read_lock
@@ -1829,39 +1790,15 @@ static int netvsc_vf_join(struct net_device *vf_netdev,
 			  struct net_device *ndev)
 {
 	struct net_device_context *ndev_ctx = netdev_priv(ndev);
-	int ret;
-
-	ret = netdev_rx_handler_register(vf_netdev,
-					 netvsc_vf_handle_frame, ndev);
-	if (ret != 0) {
-		netdev_err(vf_netdev,
-			   "can not register netvsc VF receive handler (err = %d)\n",
-			   ret);
-		goto rx_handler_failed;
-	}
-
-	ret = netdev_upper_dev_link(vf_netdev, ndev, NULL);
-	if (ret != 0) {
-		netdev_err(vf_netdev,
-			   "can not set master device %s (err = %d)\n",
-			   ndev->name, ret);
-		goto upper_link_failed;
-	}
-
-	/* set slave flag before open to prevent IPv6 addrconf */
-	vf_netdev->flags |= IFF_SLAVE;
 
 	schedule_delayed_work(&ndev_ctx->vf_takeover, VF_TAKEOVER_INT);
 
-	call_netdevice_notifiers(NETDEV_JOIN, vf_netdev);
-
 	netdev_info(vf_netdev, "joined to %s\n", ndev->name);
-	return 0;
 
-upper_link_failed:
-	netdev_rx_handler_unregister(vf_netdev);
-rx_handler_failed:
-	return ret;
+	dev_hold(vf_netdev);
+	rcu_assign_pointer(ndev_ctx->vf_netdev, vf_netdev);
+
+	return 0;
 }
 
 static void __netvsc_vf_setup(struct net_device *ndev,
@@ -1914,85 +1851,82 @@ static void netvsc_vf_setup(struct work_struct *w)
 	rtnl_unlock();
 }
 
-static int netvsc_register_vf(struct net_device *vf_netdev)
+static int netvsc_vf_pre_register(struct net_device *vf_netdev,
+				  struct net_device *ndev)
 {
-	struct net_device *ndev;
 	struct net_device_context *net_device_ctx;
 	struct netvsc_device *netvsc_dev;
 
-	if (vf_netdev->addr_len != ETH_ALEN)
-		return NOTIFY_DONE;
-
-	/*
-	 * We will use the MAC address to locate the synthetic interface to
-	 * associate with the VF interface. If we don't find a matching
-	 * synthetic interface, move on.
-	 */
-	ndev = get_netvsc_bymac(vf_netdev->perm_addr);
-	if (!ndev)
-		return NOTIFY_DONE;
-
 	net_device_ctx = netdev_priv(ndev);
 	netvsc_dev = rtnl_dereference(net_device_ctx->nvdev);
 	if (!netvsc_dev || rtnl_dereference(net_device_ctx->vf_netdev))
-		return NOTIFY_DONE;
-
-	if (netvsc_vf_join(vf_netdev, ndev) != 0)
-		return NOTIFY_DONE;
+		return -EEXIST;
 
 	netdev_info(ndev, "VF registering: %s\n", vf_netdev->name);
 
-	dev_hold(vf_netdev);
-	rcu_assign_pointer(net_device_ctx->vf_netdev, vf_netdev);
-	return NOTIFY_OK;
+	return 0;
 }
 
 /* VF up/down change detected, schedule to change data path */
-static int netvsc_vf_changed(struct net_device *vf_netdev)
+static int netvsc_vf_changed(struct net_device *vf_netdev,
+			     struct net_device *ndev)
 {
 	struct net_device_context *net_device_ctx;
 	struct netvsc_device *netvsc_dev;
-	struct net_device *ndev;
 	bool vf_is_up = netif_running(vf_netdev);
 
-	ndev = get_netvsc_byref(vf_netdev);
-	if (!ndev)
-		return NOTIFY_DONE;
-
 	net_device_ctx = netdev_priv(ndev);
 	netvsc_dev = rtnl_dereference(net_device_ctx->nvdev);
 	if (!netvsc_dev)
-		return NOTIFY_DONE;
+		return -EINVAL;
 
 	netvsc_switch_datapath(ndev, vf_is_up);
 	netdev_info(ndev, "Data path switched %s VF: %s\n",
 		    vf_is_up ? "to" : "from", vf_netdev->name);
 
-	return NOTIFY_OK;
+	return 0;
 }
 
-static int netvsc_unregister_vf(struct net_device *vf_netdev)
+static int netvsc_vf_release(struct net_device *vf_netdev,
+			     struct net_device *ndev)
 {
-	struct net_device *ndev;
 	struct net_device_context *net_device_ctx;
 
-	ndev = get_netvsc_byref(vf_netdev);
-	if (!ndev)
-		return NOTIFY_DONE;
-
 	net_device_ctx = netdev_priv(ndev);
-	cancel_delayed_work_sync(&net_device_ctx->vf_takeover);
+	if (vf_netdev != rtnl_dereference(net_device_ctx->vf_netdev))
+		return -EINVAL;
 
-	netdev_info(ndev, "VF unregistering: %s\n", vf_netdev->name);
+	cancel_delayed_work_sync(&net_device_ctx->vf_takeover);
 
-	netdev_rx_handler_unregister(vf_netdev);
-	netdev_upper_dev_unlink(vf_netdev, ndev);
 	RCU_INIT_POINTER(net_device_ctx->vf_netdev, NULL);
 	dev_put(vf_netdev);
 
-	return NOTIFY_OK;
+	return 0;
 }
 
+static int netvsc_vf_pre_unregister(struct net_device *vf_netdev,
+				    struct net_device *ndev)
+{
+	struct net_device_context *net_device_ctx;
+
+	net_device_ctx = netdev_priv(ndev);
+	if (vf_netdev != rtnl_dereference(net_device_ctx->vf_netdev))
+		return -EINVAL;
+
+	netdev_info(ndev, "VF unregistering: %s\n", vf_netdev->name);
+
+	return 0;
+}
+
+static struct bypass_ops netvsc_bypass_ops = {
+	.slave_pre_register	= netvsc_vf_pre_register,
+	.slave_join		= netvsc_vf_join,
+	.slave_pre_unregister	= netvsc_vf_pre_unregister,
+	.slave_release		= netvsc_vf_release,
+	.slave_link_change	= netvsc_vf_changed,
+	.handle_frame		= netvsc_vf_handle_frame,
+};
+
 static int netvsc_probe(struct hv_device *dev,
 			const struct hv_vmbus_device_id *dev_id)
 {
@@ -2082,8 +2016,15 @@ static int netvsc_probe(struct hv_device *dev,
 		goto register_failed;
 	}
 
+	ret = bypass_master_register(net, &netvsc_bypass_ops,
+				     &net_device_ctx->bypass_master);
+	if (ret != 0)
+		goto err_bypass;
+
 	return ret;
 
+err_bypass:
+	unregister_netdev(net);
 register_failed:
 	rndis_filter_device_remove(dev, nvdev);
 rndis_failed:
@@ -2124,13 +2065,15 @@ static int netvsc_remove(struct hv_device *dev)
 	rtnl_lock();
 	vf_netdev = rtnl_dereference(ndev_ctx->vf_netdev);
 	if (vf_netdev)
-		netvsc_unregister_vf(vf_netdev);
+		bypass_slave_unregister(vf_netdev);
 
 	if (nvdev)
 		rndis_filter_device_remove(dev, nvdev);
 
 	unregister_netdevice(net);
 
+	bypass_master_unregister(ndev_ctx->bypass_master);
+
 	rtnl_unlock();
 	rcu_read_unlock();
 
@@ -2157,54 +2100,8 @@ static struct  hv_driver netvsc_drv = {
 	.remove = netvsc_remove,
 };
 
-/*
- * On Hyper-V, every VF interface is matched with a corresponding
- * synthetic interface. The synthetic interface is presented first
- * to the guest. When the corresponding VF instance is registered,
- * we will take care of switching the data path.
- */
-static int netvsc_netdev_event(struct notifier_block *this,
-			       unsigned long event, void *ptr)
-{
-	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
-
-	/* Skip our own events */
-	if (event_dev->netdev_ops == &device_ops)
-		return NOTIFY_DONE;
-
-	/* Avoid non-Ethernet type devices */
-	if (event_dev->type != ARPHRD_ETHER)
-		return NOTIFY_DONE;
-
-	/* Avoid Vlan dev with same MAC registering as VF */
-	if (is_vlan_dev(event_dev))
-		return NOTIFY_DONE;
-
-	/* Avoid Bonding master dev with same MAC registering as VF */
-	if ((event_dev->priv_flags & IFF_BONDING) &&
-	    (event_dev->flags & IFF_MASTER))
-		return NOTIFY_DONE;
-
-	switch (event) {
-	case NETDEV_REGISTER:
-		return netvsc_register_vf(event_dev);
-	case NETDEV_UNREGISTER:
-		return netvsc_unregister_vf(event_dev);
-	case NETDEV_UP:
-	case NETDEV_DOWN:
-		return netvsc_vf_changed(event_dev);
-	default:
-		return NOTIFY_DONE;
-	}
-}
-
-static struct notifier_block netvsc_netdev_notifier = {
-	.notifier_call = netvsc_netdev_event,
-};
-
 static void __exit netvsc_drv_exit(void)
 {
-	unregister_netdevice_notifier(&netvsc_netdev_notifier);
 	vmbus_driver_unregister(&netvsc_drv);
 }
 
@@ -2224,7 +2121,6 @@ static int __init netvsc_drv_init(void)
 	if (ret)
 		return ret;
 
-	register_netdevice_notifier(&netvsc_netdev_notifier);
 	return 0;
 }
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework
  2018-04-10 18:59 ` [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework Sridhar Samudrala
@ 2018-04-10 21:26   ` Stephen Hemminger
  2018-04-10 22:56     ` Samudrala, Sridhar
                       ` (3 more replies)
  0 siblings, 4 replies; 63+ messages in thread
From: Stephen Hemminger @ 2018-04-10 21:26 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: mst, davem, netdev, virtualization, virtio-dev, jesse.brandeburg,
	alexander.h.duyck, kubakici, jasowang, loseweigh, jiri

On Tue, 10 Apr 2018 11:59:50 -0700
Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:

> Use the registration/notification framework supported by the generic
> bypass infrastructure.
> 
> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> ---

Thanks for doing this.  Your current version has couple show stopper
issues.

First, the slave device is instantly taking over the slave.
This doesn't allow udev/systemd to do its device rename of the slave
device. Netvsc uses a delayed work to workaround this.

Secondly, the select queue needs to call queue selection in VF.
The bonding/teaming logic doesn't work well for UDP flows.
Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
fixed this performance problem.

Lastly, more indirection is bad in current climate.

I am not completely adverse to this but it needs to be fast, simple
and completely transparent.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework
  2018-04-10 21:26   ` Stephen Hemminger
@ 2018-04-10 22:56     ` Samudrala, Sridhar
  2018-04-10 23:28     ` Michael S. Tsirkin
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 63+ messages in thread
From: Samudrala, Sridhar @ 2018-04-10 22:56 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: alexander.h.duyck, virtio-dev, jiri, mst, kubakici, netdev,
	virtualization, loseweigh, davem

On 4/10/2018 2:26 PM, Stephen Hemminger wrote:
> On Tue, 10 Apr 2018 11:59:50 -0700
> Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
>
>> Use the registration/notification framework supported by the generic
>> bypass infrastructure.
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> ---
> Thanks for doing this.  Your current version has couple show stopper
> issues.
>
> First, the slave device is instantly taking over the slave.
> This doesn't allow udev/systemd to do its device rename of the slave
> device. Netvsc uses a delayed work to workaround this.

OK. I guess you are referring to the dev_set_mtu() and dev_open() calls that are
made in bypass_slave_register() and you want to defer them to be done after
a delay.  I could avoid these calls in case of netvsc based on bypass_ops.


>
> Secondly, the select queue needs to call queue selection in VF.
> The bonding/teaming logic doesn't work well for UDP flows.
> Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
> fixed this performance problem.

netvsc should not be using bypass_select_queue() as  that ndo op gets used
only with 3-netdev model.
Anyway, will look into updating bypass_select_queue() based on your fix.

>
> Lastly, more indirection is bad in current climate.
>
> I am not completely adverse to this but it needs to be fast, simple
> and completely transparent.

Not sure we can avoid this indirection if we want to commonize the code,  but use
different models for virtio-net and netvsc.

On the other hand, these patches avoid calls to get_netvsc_bymac() and
get_netvsc_by_ref() that go through all the devices for all the netdev events.
netvsc lookups should be much faster.

Thanks
Sridhar

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework
  2018-04-10 21:26   ` Stephen Hemminger
  2018-04-10 22:56     ` Samudrala, Sridhar
@ 2018-04-10 23:28     ` Michael S. Tsirkin
  2018-04-10 23:44       ` Siwei Liu
  2018-04-11  7:50       ` Jiri Pirko
  2018-04-11  1:21     ` Michael S. Tsirkin
  2018-04-11  7:53     ` Jiri Pirko
  3 siblings, 2 replies; 63+ messages in thread
From: Michael S. Tsirkin @ 2018-04-10 23:28 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Sridhar Samudrala, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
	loseweigh, jiri

On Tue, Apr 10, 2018 at 02:26:08PM -0700, Stephen Hemminger wrote:
> On Tue, 10 Apr 2018 11:59:50 -0700
> Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
> 
> > Use the registration/notification framework supported by the generic
> > bypass infrastructure.
> > 
> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> > ---
> 
> Thanks for doing this.  Your current version has couple show stopper
> issues.
> 
> First, the slave device is instantly taking over the slave.
> This doesn't allow udev/systemd to do its device rename of the slave
> device. Netvsc uses a delayed work to workaround this.

Interesting. Does this mean udev must act within a specific time window
then?

> Secondly, the select queue needs to call queue selection in VF.
> The bonding/teaming logic doesn't work well for UDP flows.
> Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
> fixed this performance problem.
> 
> Lastly, more indirection is bad in current climate.
> 
> I am not completely adverse to this but it needs to be fast, simple
> and completely transparent.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework
  2018-04-10 23:28     ` Michael S. Tsirkin
@ 2018-04-10 23:44       ` Siwei Liu
  2018-04-10 23:59         ` Stephen Hemminger
  2018-04-11  7:50       ` Jiri Pirko
  1 sibling, 1 reply; 63+ messages in thread
From: Siwei Liu @ 2018-04-10 23:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stephen Hemminger, Sridhar Samudrala, David Miller, Netdev,
	virtualization, virtio-dev, Brandeburg, Jesse, Alexander Duyck,
	Jakub Kicinski, Jason Wang, Jiri Pirko

On Tue, Apr 10, 2018 at 4:28 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Apr 10, 2018 at 02:26:08PM -0700, Stephen Hemminger wrote:
>> On Tue, 10 Apr 2018 11:59:50 -0700
>> Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
>>
>> > Use the registration/notification framework supported by the generic
>> > bypass infrastructure.
>> >
>> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> > ---
>>
>> Thanks for doing this.  Your current version has couple show stopper
>> issues.
>>
>> First, the slave device is instantly taking over the slave.
>> This doesn't allow udev/systemd to do its device rename of the slave
>> device. Netvsc uses a delayed work to workaround this.
>
> Interesting. Does this mean udev must act within a specific time window
> then?

Sighs, lots of hacks. Why propgating this from driver to a common
module. We really need a clean solution.

-Siwei


>
>> Secondly, the select queue needs to call queue selection in VF.
>> The bonding/teaming logic doesn't work well for UDP flows.
>> Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
>> fixed this performance problem.
>>
>> Lastly, more indirection is bad in current climate.
>>
>> I am not completely adverse to this but it needs to be fast, simple
>> and completely transparent.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework
  2018-04-10 23:44       ` Siwei Liu
@ 2018-04-10 23:59         ` Stephen Hemminger
  0 siblings, 0 replies; 63+ messages in thread
From: Stephen Hemminger @ 2018-04-10 23:59 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Sridhar Samudrala, virtualization, Netdev,
	David Miller

On Tue, 10 Apr 2018 16:44:47 -0700
Siwei Liu <loseweigh@gmail.com> wrote:

> On Tue, Apr 10, 2018 at 4:28 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Tue, Apr 10, 2018 at 02:26:08PM -0700, Stephen Hemminger wrote:  
> >> On Tue, 10 Apr 2018 11:59:50 -0700
> >> Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
> >>  
> >> > Use the registration/notification framework supported by the generic
> >> > bypass infrastructure.
> >> >
> >> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> >> > ---  
> >>
> >> Thanks for doing this.  Your current version has couple show stopper
> >> issues.
> >>
> >> First, the slave device is instantly taking over the slave.
> >> This doesn't allow udev/systemd to do its device rename of the slave
> >> device. Netvsc uses a delayed work to workaround this.  
> >
> > Interesting. Does this mean udev must act within a specific time window
> > then?  
> 
> Sighs, lots of hacks. Why propgating this from driver to a common
> module. We really need a clean solution.
> 

I had a patch to wait for udev to do the rename and go from there
but davem rejected it.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework
  2018-04-10 21:26   ` Stephen Hemminger
  2018-04-10 22:56     ` Samudrala, Sridhar
  2018-04-10 23:28     ` Michael S. Tsirkin
@ 2018-04-11  1:21     ` Michael S. Tsirkin
  2018-04-11  7:53     ` Jiri Pirko
  3 siblings, 0 replies; 63+ messages in thread
From: Michael S. Tsirkin @ 2018-04-11  1:21 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Sridhar Samudrala, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
	loseweigh, jiri

On Tue, Apr 10, 2018 at 02:26:08PM -0700, Stephen Hemminger wrote:
> On Tue, 10 Apr 2018 11:59:50 -0700
> Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
> 
> > Use the registration/notification framework supported by the generic
> > bypass infrastructure.
> > 
> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> > ---
> 
> Thanks for doing this.  Your current version has couple show stopper
> issues.
> 
> First, the slave device is instantly taking over the slave.
> This doesn't allow udev/systemd to do its device rename of the slave
> device. Netvsc uses a delayed work to workaround this.
> 
> Secondly, the select queue needs to call queue selection in VF.
> The bonding/teaming logic doesn't work well for UDP flows.
> Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
> fixed this performance problem.
> 
> Lastly, more indirection is bad in current climate.

Well right now netvsc does an indirect call to the PT device,
does it not? If you really want max performance when PT
is in use you need to do the reverse and have PT forward to netvsc.

> I am not completely adverse to this but it needs to be fast, simple
> and completely transparent.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework
  2018-04-10 23:28     ` Michael S. Tsirkin
  2018-04-10 23:44       ` Siwei Liu
@ 2018-04-11  7:50       ` Jiri Pirko
  1 sibling, 0 replies; 63+ messages in thread
From: Jiri Pirko @ 2018-04-11  7:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stephen Hemminger, Sridhar Samudrala, davem, netdev,
	virtualization, virtio-dev, jesse.brandeburg, alexander.h.duyck,
	kubakici, jasowang, loseweigh

Wed, Apr 11, 2018 at 01:28:51AM CEST, mst@redhat.com wrote:
>On Tue, Apr 10, 2018 at 02:26:08PM -0700, Stephen Hemminger wrote:
>> On Tue, 10 Apr 2018 11:59:50 -0700
>> Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
>> 
>> > Use the registration/notification framework supported by the generic
>> > bypass infrastructure.
>> > 
>> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> > ---
>> 
>> Thanks for doing this.  Your current version has couple show stopper
>> issues.
>> 
>> First, the slave device is instantly taking over the slave.
>> This doesn't allow udev/systemd to do its device rename of the slave
>> device. Netvsc uses a delayed work to workaround this.
>
>Interesting. Does this mean udev must act within a specific time window
>then?

Yeah. That is scarry. Also, wrong.


>
>> Secondly, the select queue needs to call queue selection in VF.
>> The bonding/teaming logic doesn't work well for UDP flows.
>> Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
>> fixed this performance problem.
>> 
>> Lastly, more indirection is bad in current climate.
>> 
>> I am not completely adverse to this but it needs to be fast, simple
>> and completely transparent.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework
  2018-04-10 21:26   ` Stephen Hemminger
                       ` (2 preceding siblings ...)
  2018-04-11  1:21     ` Michael S. Tsirkin
@ 2018-04-11  7:53     ` Jiri Pirko
  2019-02-22  1:14       ` net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework) Siwei Liu
  3 siblings, 1 reply; 63+ messages in thread
From: Jiri Pirko @ 2018-04-11  7:53 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Sridhar Samudrala, mst, davem, netdev, virtualization,
	virtio-dev, jesse.brandeburg, alexander.h.duyck, kubakici,
	jasowang, loseweigh

Tue, Apr 10, 2018 at 11:26:08PM CEST, stephen@networkplumber.org wrote:
>On Tue, 10 Apr 2018 11:59:50 -0700
>Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
>
>> Use the registration/notification framework supported by the generic
>> bypass infrastructure.
>> 
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> ---
>
>Thanks for doing this.  Your current version has couple show stopper
>issues.
>
>First, the slave device is instantly taking over the slave.
>This doesn't allow udev/systemd to do its device rename of the slave
>device. Netvsc uses a delayed work to workaround this.

Wait. Why the fact a device is enslaved has to affect the udev in any
way? If it does, smells like a bug in udev.


>
>Secondly, the select queue needs to call queue selection in VF.
>The bonding/teaming logic doesn't work well for UDP flows.
>Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
>fixed this performance problem.
>
>Lastly, more indirection is bad in current climate.
>
>I am not completely adverse to this but it needs to be fast, simple
>and completely transparent.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
  2018-04-10 18:59 ` [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module Sridhar Samudrala
@ 2018-04-11 15:51   ` Jiri Pirko
  2018-04-11 19:13     ` Samudrala, Sridhar
  0 siblings, 1 reply; 63+ messages in thread
From: Jiri Pirko @ 2018-04-11 15:51 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
	loseweigh

Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudrala@intel.com wrote:
>This provides a generic interface for paravirtual drivers to listen
>for netdev register/unregister/link change events from pci ethernet
>devices with the same MAC and takeover their datapath. The notifier and
>event handling code is based on the existing netvsc implementation.
>
>It exposes 2 sets of interfaces to the paravirtual drivers.
>1. existing netvsc driver that uses 2 netdev model. In this model, no
>master netdev is created. The paravirtual driver registers each bypass
>instance along with a set of ops to manage the slave events.
>     bypass_master_register()
>     bypass_master_unregister()
>2. new virtio_net based solution that uses 3 netdev model. In this model,
>the bypass module provides interfaces to create/destroy additional master
>netdev and all the slave events are managed internally.
>      bypass_master_create()
>      bypass_master_destroy()
>
>Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>---
> include/linux/netdevice.h |  14 +
> include/net/bypass.h      |  96 ++++++
> net/Kconfig               |  18 +
> net/core/Makefile         |   1 +
> net/core/bypass.c         | 844 ++++++++++++++++++++++++++++++++++++++++++++++
> 5 files changed, 973 insertions(+)
> create mode 100644 include/net/bypass.h
> create mode 100644 net/core/bypass.c
>
>diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>index cf44503ea81a..587293728f70 100644
>--- a/include/linux/netdevice.h
>+++ b/include/linux/netdevice.h
>@@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
> 	IFF_PHONY_HEADROOM		= 1<<24,
> 	IFF_MACSEC			= 1<<25,
> 	IFF_NO_RX_HANDLER		= 1<<26,
>+	IFF_BYPASS			= 1 << 27,
>+	IFF_BYPASS_SLAVE		= 1 << 28,

I wonder, why you don't follow the existing coding style... Also, please
add these to into the comment above.


> };
> 
> #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
>@@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
> #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
> #define IFF_MACSEC			IFF_MACSEC
> #define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
>+#define IFF_BYPASS			IFF_BYPASS
>+#define IFF_BYPASS_SLAVE		IFF_BYPASS_SLAVE
> 
> /**
>  *	struct net_device - The DEVICE structure.
>@@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
> 	return dev->priv_flags & IFF_RXFH_CONFIGURED;
> }
> 
>+static inline bool netif_is_bypass_master(const struct net_device *dev)
>+{
>+	return dev->priv_flags & IFF_BYPASS;
>+}
>+
>+static inline bool netif_is_bypass_slave(const struct net_device *dev)
>+{
>+	return dev->priv_flags & IFF_BYPASS_SLAVE;
>+}
>+
> /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
> static inline void netif_keep_dst(struct net_device *dev)
> {
>diff --git a/include/net/bypass.h b/include/net/bypass.h
>new file mode 100644
>index 000000000000..86b02cb894cf
>--- /dev/null
>+++ b/include/net/bypass.h
>@@ -0,0 +1,96 @@
>+// SPDX-License-Identifier: GPL-2.0
>+/* Copyright (c) 2018, Intel Corporation. */
>+
>+#ifndef _NET_BYPASS_H
>+#define _NET_BYPASS_H
>+
>+#include <linux/netdevice.h>
>+
>+struct bypass_ops {
>+	int (*slave_pre_register)(struct net_device *slave_netdev,
>+				  struct net_device *bypass_netdev);
>+	int (*slave_join)(struct net_device *slave_netdev,
>+			  struct net_device *bypass_netdev);
>+	int (*slave_pre_unregister)(struct net_device *slave_netdev,
>+				    struct net_device *bypass_netdev);
>+	int (*slave_release)(struct net_device *slave_netdev,
>+			     struct net_device *bypass_netdev);
>+	int (*slave_link_change)(struct net_device *slave_netdev,
>+				 struct net_device *bypass_netdev);
>+	rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>+};
>+
>+struct bypass_master {
>+	struct list_head list;
>+	struct net_device __rcu *bypass_netdev;
>+	struct bypass_ops __rcu *ops;
>+};
>+
>+/* bypass state */
>+struct bypass_info {
>+	/* passthru netdev with same MAC */
>+	struct net_device __rcu *active_netdev;

You still use "active"/"backup" names which is highly misleading as
it has completely different meaning that in bond for example.
I noted that in my previous review already. Please change it.


>+
>+	/* virtio_net netdev */
>+	struct net_device __rcu *backup_netdev;
>+
>+	/* active netdev stats */
>+	struct rtnl_link_stats64 active_stats;
>+
>+	/* backup netdev stats */
>+	struct rtnl_link_stats64 backup_stats;
>+
>+	/* aggregated stats */
>+	struct rtnl_link_stats64 bypass_stats;
>+
>+	/* spinlock while updating stats */
>+	spinlock_t stats_lock;
>+};
>+
>+#if IS_ENABLED(CONFIG_NET_BYPASS)
>+
>+int bypass_master_create(struct net_device *backup_netdev,
>+			 struct bypass_master **pbypass_master);
>+void bypass_master_destroy(struct bypass_master *bypass_master);
>+
>+int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>+			   struct bypass_master **pbypass_master);
>+void bypass_master_unregister(struct bypass_master *bypass_master);
>+
>+int bypass_slave_unregister(struct net_device *slave_netdev);
>+
>+#else
>+
>+static inline
>+int bypass_master_create(struct net_device *backup_netdev,
>+			 struct bypass_master **pbypass_master);
>+{
>+	return 0;
>+}
>+
>+static inline
>+void bypass_master_destroy(struct bypass_master *bypass_master)
>+{
>+}
>+
>+static inline
>+int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>+			   struct pbypass_master **pbypass_master);
>+{
>+	return 0;
>+}
>+
>+static inline
>+void bypass_master_unregister(struct bypass_master *bypass_master)
>+{
>+}
>+
>+static inline
>+int bypass_slave_unregister(struct net_device *slave_netdev)
>+{
>+	return 0;
>+}
>+
>+#endif
>+
>+#endif /* _NET_BYPASS_H */
>diff --git a/net/Kconfig b/net/Kconfig
>index 0428f12c25c2..994445f4a96a 100644
>--- a/net/Kconfig
>+++ b/net/Kconfig
>@@ -423,6 +423,24 @@ config MAY_USE_DEVLINK
> 	  on MAY_USE_DEVLINK to ensure they do not cause link errors when
> 	  devlink is a loadable module and the driver using it is built-in.
> 
>+config NET_BYPASS
>+	tristate "Bypass interface"
>+	---help---
>+	  This provides a generic interface for paravirtual drivers to listen
>+	  for netdev register/unregister/link change events from pci ethernet
>+	  devices with the same MAC and takeover their datapath. This also
>+	  enables live migration of a VM with direct attached VF by failing
>+	  over to the paravirtual datapath when the VF is unplugged.
>+
>+config MAY_USE_BYPASS
>+	tristate
>+	default m if NET_BYPASS=m
>+	default y if NET_BYPASS=y || NET_BYPASS=n
>+	help
>+	  Drivers using the bypass infrastructure should have a dependency
>+	  on MAY_USE_BYPASS to ensure they do not cause link errors when
>+	  bypass is a loadable module and the driver using it is built-in.
>+
> endif   # if NET
> 
> # Used by archs to tell that they support BPF JIT compiler plus which flavour.
>diff --git a/net/core/Makefile b/net/core/Makefile
>index 6dbbba8c57ae..a9727ed1c8fc 100644
>--- a/net/core/Makefile
>+++ b/net/core/Makefile
>@@ -30,3 +30,4 @@ obj-$(CONFIG_DST_CACHE) += dst_cache.o
> obj-$(CONFIG_HWBM) += hwbm.o
> obj-$(CONFIG_NET_DEVLINK) += devlink.o
> obj-$(CONFIG_GRO_CELLS) += gro_cells.o
>+obj-$(CONFIG_NET_BYPASS) += bypass.o
>diff --git a/net/core/bypass.c b/net/core/bypass.c
>new file mode 100644
>index 000000000000..b5b9cb554c3f
>--- /dev/null
>+++ b/net/core/bypass.c
>@@ -0,0 +1,844 @@
>+// SPDX-License-Identifier: GPL-2.0
>+/* Copyright (c) 2018, Intel Corporation. */
>+
>+/* A common module to handle registrations and notifications for paravirtual
>+ * drivers to enable accelerated datapath and support VF live migration.
>+ *
>+ * The notifier and event handling code is based on netvsc driver.
>+ */
>+
>+#include <linux/netdevice.h>
>+#include <linux/etherdevice.h>
>+#include <linux/ethtool.h>
>+#include <linux/module.h>
>+#include <linux/slab.h>
>+#include <linux/netdevice.h>
>+#include <linux/netpoll.h>
>+#include <linux/rtnetlink.h>
>+#include <linux/if_vlan.h>
>+#include <linux/pci.h>
>+#include <net/sch_generic.h>
>+#include <uapi/linux/if_arp.h>
>+#include <net/bypass.h>
>+
>+static LIST_HEAD(bypass_master_list);
>+static DEFINE_SPINLOCK(bypass_lock);
>+
>+static int bypass_slave_pre_register(struct net_device *slave_netdev,
>+				     struct net_device *bypass_netdev,
>+				     struct bypass_ops *bypass_ops)
>+{
>+	struct bypass_info *bi;
>+	bool backup;
>+
>+	if (bypass_ops) {
>+		if (!bypass_ops->slave_pre_register)
>+			return -EINVAL;
>+
>+		return bypass_ops->slave_pre_register(slave_netdev,
>+						      bypass_netdev);
>+	}
>+
>+	bi = netdev_priv(bypass_netdev);
>+	backup = (slave_netdev->dev.parent == bypass_netdev->dev.parent);
>+	if (backup ? rtnl_dereference(bi->backup_netdev) :
>+			rtnl_dereference(bi->active_netdev)) {
>+		netdev_err(bypass_netdev, "%s attempting to register as slave dev when %s already present\n",
>+			   slave_netdev->name, backup ? "backup" : "active");
>+		return -EEXIST;
>+	}
>+
>+	/* Avoid non pci devices as active netdev */
>+	if (!backup && (!slave_netdev->dev.parent ||
>+			!dev_is_pci(slave_netdev->dev.parent)))
>+		return -EINVAL;
>+
>+	return 0;
>+}
>+
>+static int bypass_slave_join(struct net_device *slave_netdev,
>+			     struct net_device *bypass_netdev,
>+			     struct bypass_ops *bypass_ops)
>+{
>+	struct bypass_info *bi;
>+	bool backup;
>+
>+	if (bypass_ops) {
>+		if (!bypass_ops->slave_join)
>+			return -EINVAL;
>+
>+		return bypass_ops->slave_join(slave_netdev, bypass_netdev);
>+	}
>+
>+	bi = netdev_priv(bypass_netdev);
>+	backup = (slave_netdev->dev.parent == bypass_netdev->dev.parent);
>+
>+	dev_hold(slave_netdev);
>+
>+	if (backup) {
>+		rcu_assign_pointer(bi->backup_netdev, slave_netdev);
>+		dev_get_stats(bi->backup_netdev, &bi->backup_stats);
>+	} else {
>+		rcu_assign_pointer(bi->active_netdev, slave_netdev);
>+		dev_get_stats(bi->active_netdev, &bi->active_stats);
>+		bypass_netdev->min_mtu = slave_netdev->min_mtu;
>+		bypass_netdev->max_mtu = slave_netdev->max_mtu;
>+	}
>+
>+	netdev_info(bypass_netdev, "bypass slave:%s joined\n",
>+		    slave_netdev->name);
>+
>+	return 0;
>+}
>+
>+/* Called when slave dev is injecting data into network stack.
>+ * Change the associated network device from lower dev to virtio.
>+ * note: already called with rcu_read_lock
>+ */
>+static rx_handler_result_t bypass_handle_frame(struct sk_buff **pskb)
>+{
>+	struct sk_buff *skb = *pskb;
>+	struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data);
>+
>+	skb->dev = ndev;
>+
>+	return RX_HANDLER_ANOTHER;
>+}
>+
>+static struct net_device *bypass_master_get_bymac(u8 *mac,
>+						  struct bypass_ops **ops)
>+{
>+	struct bypass_master *bypass_master;
>+	struct net_device *bypass_netdev;
>+
>+	spin_lock(&bypass_lock);
>+	list_for_each_entry(bypass_master, &bypass_master_list, list) {

As I wrote the last time, you don't need this list, spinlock.
You can do just something like:
        for_each_net(net) {
                for_each_netdev(net, dev) {
			if (netif_is_bypass_master(dev)) {




>+		bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>+		if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
>+			*ops = rcu_dereference(bypass_master->ops);

I don't see how rcu_dereference is ok here.
1) I don't see rcu_read_lock taken
2) Looks like bypass_master->ops has the same value across the whole
   existence.


>+			spin_unlock(&bypass_lock);
>+			return bypass_netdev;
>+		}
>+	}
>+	spin_unlock(&bypass_lock);
>+	return NULL;
>+}
>+
>+static int bypass_slave_register(struct net_device *slave_netdev)
>+{
>+	struct net_device *bypass_netdev;
>+	struct bypass_ops *bypass_ops;
>+	int ret, orig_mtu;
>+
>+	ASSERT_RTNL();
>+
>+	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>+						&bypass_ops);

For master, could you use word "master" in the variables so it is clear?
Also, "dev" is fine instead of "netdev".
Something like "bpmaster_dev"


>+	if (!bypass_netdev)
>+		goto done;
>+
>+	ret = bypass_slave_pre_register(slave_netdev, bypass_netdev,
>+					bypass_ops);
>+	if (ret != 0)

	Just "if (ret)" will do. You have this on more places.


>+		goto done;
>+
>+	ret = netdev_rx_handler_register(slave_netdev,
>+					 bypass_ops ? bypass_ops->handle_frame :
>+					 bypass_handle_frame, bypass_netdev);
>+	if (ret != 0) {
>+		netdev_err(slave_netdev, "can not register bypass rx handler (err = %d)\n",
>+			   ret);
>+		goto done;
>+	}
>+
>+	ret = netdev_upper_dev_link(slave_netdev, bypass_netdev, NULL);
>+	if (ret != 0) {
>+		netdev_err(slave_netdev, "can not set master device %s (err = %d)\n",
>+			   bypass_netdev->name, ret);
>+		goto upper_link_failed;
>+	}
>+
>+	slave_netdev->priv_flags |= IFF_BYPASS_SLAVE;
>+
>+	if (netif_running(bypass_netdev)) {
>+		ret = dev_open(slave_netdev);
>+		if (ret && (ret != -EBUSY)) {
>+			netdev_err(bypass_netdev, "Opening slave %s failed ret:%d\n",
>+				   slave_netdev->name, ret);
>+			goto err_interface_up;
>+		}
>+	}
>+
>+	/* Align MTU of slave with master */
>+	orig_mtu = slave_netdev->mtu;
>+	ret = dev_set_mtu(slave_netdev, bypass_netdev->mtu);
>+	if (ret != 0) {
>+		netdev_err(bypass_netdev, "unable to change mtu of %s to %u register failed\n",
>+			   slave_netdev->name, bypass_netdev->mtu);
>+		goto err_set_mtu;
>+	}
>+
>+	ret = bypass_slave_join(slave_netdev, bypass_netdev, bypass_ops);
>+	if (ret != 0)
>+		goto err_join;
>+
>+	call_netdevice_notifiers(NETDEV_JOIN, slave_netdev);
>+
>+	netdev_info(bypass_netdev, "bypass slave:%s registered\n",
>+		    slave_netdev->name);
>+
>+	goto done;
>+
>+err_join:
>+	dev_set_mtu(slave_netdev, orig_mtu);
>+err_set_mtu:
>+	dev_close(slave_netdev);
>+err_interface_up:
>+	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>+	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>+upper_link_failed:
>+	netdev_rx_handler_unregister(slave_netdev);
>+done:
>+	return NOTIFY_DONE;
>+}
>+
>+static int bypass_slave_pre_unregister(struct net_device *slave_netdev,
>+				       struct net_device *bypass_netdev,
>+				       struct bypass_ops *bypass_ops)
>+{
>+	struct net_device *backup_netdev, *active_netdev;
>+	struct bypass_info *bi;
>+
>+	if (bypass_ops) {
>+		if (!bypass_ops->slave_pre_unregister)
>+			return -EINVAL;
>+
>+		return bypass_ops->slave_pre_unregister(slave_netdev,
>+							bypass_netdev);
>+	}
>+
>+	bi = netdev_priv(bypass_netdev);
>+	active_netdev = rtnl_dereference(bi->active_netdev);
>+	backup_netdev = rtnl_dereference(bi->backup_netdev);
>+
>+	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>+		return -EINVAL;
>+
>+	return 0;
>+}
>+
>+static int bypass_slave_release(struct net_device *slave_netdev,
>+				struct net_device *bypass_netdev,
>+				struct bypass_ops *bypass_ops)
>+{
>+	struct net_device *backup_netdev, *active_netdev;
>+	struct bypass_info *bi;
>+
>+	if (bypass_ops) {
>+		if (!bypass_ops->slave_release)
>+			return -EINVAL;

I think it would be good to make the API to the driver more strict and
have a separate set of ops for "active" and "backup" netdevices.
That should stop people thinking about extending this to more slaves in
the future.



>+
>+		return bypass_ops->slave_release(slave_netdev, bypass_netdev);
>+	}
>+
>+	bi = netdev_priv(bypass_netdev);
>+	active_netdev = rtnl_dereference(bi->active_netdev);
>+	backup_netdev = rtnl_dereference(bi->backup_netdev);
>+
>+	if (slave_netdev == backup_netdev) {
>+		RCU_INIT_POINTER(bi->backup_netdev, NULL);
>+	} else {
>+		RCU_INIT_POINTER(bi->active_netdev, NULL);
>+		if (backup_netdev) {
>+			bypass_netdev->min_mtu = backup_netdev->min_mtu;
>+			bypass_netdev->max_mtu = backup_netdev->max_mtu;
>+		}
>+	}
>+
>+	dev_put(slave_netdev);
>+
>+	netdev_info(bypass_netdev, "bypass slave:%s released\n",
>+		    slave_netdev->name);
>+
>+	return 0;
>+}
>+
>+int bypass_slave_unregister(struct net_device *slave_netdev)
>+{
>+	struct net_device *bypass_netdev;
>+	struct bypass_ops *bypass_ops;
>+	int ret;
>+
>+	if (!netif_is_bypass_slave(slave_netdev))
>+		goto done;
>+
>+	ASSERT_RTNL();
>+
>+	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>+						&bypass_ops);
>+	if (!bypass_netdev)
>+		goto done;
>+
>+	ret = bypass_slave_pre_unregister(slave_netdev, bypass_netdev,
>+					  bypass_ops);
>+	if (ret != 0)
>+		goto done;
>+
>+	netdev_rx_handler_unregister(slave_netdev);
>+	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>+	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>+
>+	bypass_slave_release(slave_netdev, bypass_netdev, bypass_ops);
>+
>+	netdev_info(bypass_netdev, "bypass slave:%s unregistered\n",
>+		    slave_netdev->name);
>+
>+done:
>+	return NOTIFY_DONE;
>+}
>+EXPORT_SYMBOL_GPL(bypass_slave_unregister);
>+
>+static bool bypass_xmit_ready(struct net_device *dev)
>+{
>+	return netif_running(dev) && netif_carrier_ok(dev);
>+}
>+
>+static int bypass_slave_link_change(struct net_device *slave_netdev)
>+{
>+	struct net_device *bypass_netdev, *active_netdev, *backup_netdev;
>+	struct bypass_ops *bypass_ops;
>+	struct bypass_info *bi;
>+
>+	if (!netif_is_bypass_slave(slave_netdev))
>+		goto done;
>+
>+	ASSERT_RTNL();
>+
>+	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>+						&bypass_ops);
>+	if (!bypass_netdev)
>+		goto done;
>+
>+	if (bypass_ops) {
>+		if (!bypass_ops->slave_link_change)
>+			goto done;
>+
>+		return bypass_ops->slave_link_change(slave_netdev,
>+						     bypass_netdev);
>+	}
>+
>+	if (!netif_running(bypass_netdev))
>+		return 0;
>+
>+	bi = netdev_priv(bypass_netdev);
>+
>+	active_netdev = rtnl_dereference(bi->active_netdev);
>+	backup_netdev = rtnl_dereference(bi->backup_netdev);
>+
>+	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>+		goto done;

You don't need this check. "if (!netif_is_bypass_slave(slave_netdev))"
above is enough.


>+
>+	if ((active_netdev && bypass_xmit_ready(active_netdev)) ||
>+	    (backup_netdev && bypass_xmit_ready(backup_netdev))) {
>+		netif_carrier_on(bypass_netdev);
>+		netif_tx_wake_all_queues(bypass_netdev);
>+	} else {
>+		netif_carrier_off(bypass_netdev);
>+		netif_tx_stop_all_queues(bypass_netdev);
>+	}
>+
>+done:
>+	return NOTIFY_DONE;
>+}
>+
>+static bool bypass_validate_event_dev(struct net_device *dev)
>+{
>+	/* Skip parent events */
>+	if (netif_is_bypass_master(dev))
>+		return false;
>+
>+	/* Avoid non-Ethernet type devices */
>+	if (dev->type != ARPHRD_ETHER)
>+		return false;
>+
>+	/* Avoid Vlan dev with same MAC registering as VF */
>+	if (is_vlan_dev(dev))
>+		return false;
>+
>+	/* Avoid Bonding master dev with same MAC registering as slave dev */
>+	if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))

Yeah, this is certainly incorrect. One thing is, you should be using the
helpers netif_is_bond_master().
But what about the rest? macsec, macvlan, team, bridge, ovs and others?

You need to do it not by blacklisting, but with whitelisting. You need
to whitelist VF devices. My port flavours patchset might help with this.


>+		return false;
>+
>+	return true;
>+}
>+
>+static int
>+bypass_event(struct notifier_block *this, unsigned long event, void *ptr)
>+{
>+	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
>+
>+	if (!bypass_validate_event_dev(event_dev))
>+		return NOTIFY_DONE;
>+
>+	switch (event) {
>+	case NETDEV_REGISTER:
>+		return bypass_slave_register(event_dev);
>+	case NETDEV_UNREGISTER:
>+		return bypass_slave_unregister(event_dev);
>+	case NETDEV_UP:
>+	case NETDEV_DOWN:
>+	case NETDEV_CHANGE:
>+		return bypass_slave_link_change(event_dev);
>+	default:
>+		return NOTIFY_DONE;
>+	}
>+}
>+
>+static struct notifier_block bypass_notifier = {
>+	.notifier_call = bypass_event,
>+};
>+
>+int bypass_open(struct net_device *dev)
>+{
>+	struct bypass_info *bi = netdev_priv(dev);
>+	struct net_device *active_netdev, *backup_netdev;
>+	int err;
>+
>+	netif_carrier_off(dev);
>+	netif_tx_wake_all_queues(dev);
>+
>+	active_netdev = rtnl_dereference(bi->active_netdev);
>+	if (active_netdev) {
>+		err = dev_open(active_netdev);
>+		if (err)
>+			goto err_active_open;
>+	}
>+
>+	backup_netdev = rtnl_dereference(bi->backup_netdev);
>+	if (backup_netdev) {
>+		err = dev_open(backup_netdev);
>+		if (err)
>+			goto err_backup_open;
>+	}
>+
>+	return 0;
>+
>+err_backup_open:
>+	dev_close(active_netdev);
>+err_active_open:
>+	netif_tx_disable(dev);
>+	return err;
>+}
>+EXPORT_SYMBOL_GPL(bypass_open);
>+
>+int bypass_close(struct net_device *dev)
>+{
>+	struct bypass_info *vi = netdev_priv(dev);

This should be probably "bi"


>+	struct net_device *slave_netdev;
>+
>+	netif_tx_disable(dev);
>+
>+	slave_netdev = rtnl_dereference(vi->active_netdev);
>+	if (slave_netdev)
>+		dev_close(slave_netdev);
>+
>+	slave_netdev = rtnl_dereference(vi->backup_netdev);
>+	if (slave_netdev)
>+		dev_close(slave_netdev);
>+
>+	return 0;
>+}
>+EXPORT_SYMBOL_GPL(bypass_close);
>+
>+static netdev_tx_t bypass_drop_xmit(struct sk_buff *skb, struct net_device *dev)
>+{
>+	atomic_long_inc(&dev->tx_dropped);
>+	dev_kfree_skb_any(skb);
>+	return NETDEV_TX_OK;
>+}
>+
>+netdev_tx_t bypass_start_xmit(struct sk_buff *skb, struct net_device *dev)
>+{
>+	struct bypass_info *bi = netdev_priv(dev);

If you rename the other variable to "bpmaster_dev", it would be nice to
rename this to bpinfo or something more descriptive. "bi" is too short
to know what that is right away.


>+	struct net_device *xmit_dev;

Don't mix "dev" and "netdev" in one .c file. Just use "dev" for all.



>+
>+	/* Try xmit via active netdev followed by backup netdev */
>+	xmit_dev = rcu_dereference_bh(bi->active_netdev);
>+	if (!xmit_dev || !bypass_xmit_ready(xmit_dev)) {
>+		xmit_dev = rcu_dereference_bh(bi->backup_netdev);
>+		if (!xmit_dev || !bypass_xmit_ready(xmit_dev))
>+			return bypass_drop_xmit(skb, dev);
>+	}
>+
>+	skb->dev = xmit_dev;
>+	skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
>+
>+	return dev_queue_xmit(skb);
>+}
>+EXPORT_SYMBOL_GPL(bypass_start_xmit);
>+
>+u16 bypass_select_queue(struct net_device *dev, struct sk_buff *skb,
>+			void *accel_priv, select_queue_fallback_t fallback)
>+{
>+	/* This helper function exists to help dev_pick_tx get the correct
>+	 * destination queue.  Using a helper function skips a call to
>+	 * skb_tx_hash and will put the skbs in the queue we expect on their
>+	 * way down to the bonding driver.
>+	 */
>+	u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
>+
>+	/* Save the original txq to restore before passing to the driver */
>+	qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
>+
>+	if (unlikely(txq >= dev->real_num_tx_queues)) {
>+		do {
>+			txq -= dev->real_num_tx_queues;
>+		} while (txq >= dev->real_num_tx_queues);
>+	}
>+
>+	return txq;
>+}
>+EXPORT_SYMBOL_GPL(bypass_select_queue);
>+
>+/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
>+ * that some drivers can provide 32bit values only.
>+ */
>+static void bypass_fold_stats(struct rtnl_link_stats64 *_res,
>+			      const struct rtnl_link_stats64 *_new,
>+			      const struct rtnl_link_stats64 *_old)
>+{
>+	const u64 *new = (const u64 *)_new;
>+	const u64 *old = (const u64 *)_old;
>+	u64 *res = (u64 *)_res;
>+	int i;
>+
>+	for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
>+		u64 nv = new[i];
>+		u64 ov = old[i];
>+		s64 delta = nv - ov;
>+
>+		/* detects if this particular field is 32bit only */
>+		if (((nv | ov) >> 32) == 0)
>+			delta = (s64)(s32)((u32)nv - (u32)ov);
>+
>+		/* filter anomalies, some drivers reset their stats
>+		 * at down/up events.
>+		 */
>+		if (delta > 0)
>+			res[i] += delta;
>+	}
>+}
>+
>+void bypass_get_stats(struct net_device *dev, struct rtnl_link_stats64 *stats)
>+{
>+	struct bypass_info *bi = netdev_priv(dev);

You can WARN_ON and return in case the dev is not bypass master, just
to catch buggy drivers. Same with other helpers.


>+	const struct rtnl_link_stats64 *new;
>+	struct rtnl_link_stats64 temp;
>+	struct net_device *slave_netdev;
>+
>+	spin_lock(&bi->stats_lock);
>+	memcpy(stats, &bi->bypass_stats, sizeof(*stats));
>+
>+	rcu_read_lock();
>+
>+	slave_netdev = rcu_dereference(bi->active_netdev);
>+	if (slave_netdev) {
>+		new = dev_get_stats(slave_netdev, &temp);
>+		bypass_fold_stats(stats, new, &bi->active_stats);
>+		memcpy(&bi->active_stats, new, sizeof(*new));
>+	}
>+
>+	slave_netdev = rcu_dereference(bi->backup_netdev);
>+	if (slave_netdev) {
>+		new = dev_get_stats(slave_netdev, &temp);
>+		bypass_fold_stats(stats, new, &bi->backup_stats);
>+		memcpy(&bi->backup_stats, new, sizeof(*new));
>+	}
>+
>+	rcu_read_unlock();
>+
>+	memcpy(&bi->bypass_stats, stats, sizeof(*stats));
>+	spin_unlock(&bi->stats_lock);
>+}
>+EXPORT_SYMBOL_GPL(bypass_get_stats);
>+
>+int bypass_change_mtu(struct net_device *dev, int new_mtu)
>+{
>+	struct bypass_info *bi = netdev_priv(dev);
>+	struct net_device *active_netdev, *backup_netdev;
>+	int ret = 0;

Pointless initialization.


>+
>+	active_netdev = rcu_dereference(bi->active_netdev);
>+	if (active_netdev) {
>+		ret = dev_set_mtu(active_netdev, new_mtu);
>+		if (ret)
>+			return ret;
>+	}
>+
>+	backup_netdev = rcu_dereference(bi->backup_netdev);
>+	if (backup_netdev) {
>+		ret = dev_set_mtu(backup_netdev, new_mtu);
>+		if (ret) {
>+			dev_set_mtu(active_netdev, dev->mtu);
>+			return ret;
>+		}
>+	}
>+
>+	dev->mtu = new_mtu;
>+	return 0;
>+}
>+EXPORT_SYMBOL_GPL(bypass_change_mtu);
>+
>+void bypass_set_rx_mode(struct net_device *dev)
>+{
>+	struct bypass_info *bi = netdev_priv(dev);
>+	struct net_device *slave_netdev;
>+
>+	rcu_read_lock();
>+
>+	slave_netdev = rcu_dereference(bi->active_netdev);
>+	if (slave_netdev) {
>+		dev_uc_sync_multiple(slave_netdev, dev);
>+		dev_mc_sync_multiple(slave_netdev, dev);
>+	}
>+
>+	slave_netdev = rcu_dereference(bi->backup_netdev);
>+	if (slave_netdev) {
>+		dev_uc_sync_multiple(slave_netdev, dev);
>+		dev_mc_sync_multiple(slave_netdev, dev);
>+	}
>+
>+	rcu_read_unlock();
>+}
>+EXPORT_SYMBOL_GPL(bypass_set_rx_mode);
>+
>+static const struct net_device_ops bypass_netdev_ops = {
>+	.ndo_open		= bypass_open,
>+	.ndo_stop		= bypass_close,
>+	.ndo_start_xmit		= bypass_start_xmit,
>+	.ndo_select_queue	= bypass_select_queue,
>+	.ndo_get_stats64	= bypass_get_stats,
>+	.ndo_change_mtu		= bypass_change_mtu,
>+	.ndo_set_rx_mode	= bypass_set_rx_mode,
>+	.ndo_validate_addr	= eth_validate_addr,
>+	.ndo_features_check	= passthru_features_check,
>+};
>+
>+#define BYPASS_DRV_NAME "bypass"
>+#define BYPASS_DRV_VERSION "0.1"
>+
>+static void bypass_ethtool_get_drvinfo(struct net_device *dev,
>+				       struct ethtool_drvinfo *drvinfo)
>+{
>+	strlcpy(drvinfo->driver, BYPASS_DRV_NAME, sizeof(drvinfo->driver));
>+	strlcpy(drvinfo->version, BYPASS_DRV_VERSION, sizeof(drvinfo->version));
>+}
>+
>+int bypass_ethtool_get_link_ksettings(struct net_device *dev,
>+				      struct ethtool_link_ksettings *cmd)
>+{
>+	struct bypass_info *bi = netdev_priv(dev);
>+	struct net_device *slave_netdev;
>+
>+	slave_netdev = rtnl_dereference(bi->active_netdev);
>+	if (!slave_netdev || !bypass_xmit_ready(slave_netdev)) {
>+		slave_netdev = rtnl_dereference(bi->backup_netdev);
>+		if (!slave_netdev || !bypass_xmit_ready(slave_netdev)) {
>+			cmd->base.duplex = DUPLEX_UNKNOWN;
>+			cmd->base.port = PORT_OTHER;
>+			cmd->base.speed = SPEED_UNKNOWN;
>+
>+			return 0;
>+		}
>+	}
>+
>+	return __ethtool_get_link_ksettings(slave_netdev, cmd);
>+}
>+EXPORT_SYMBOL_GPL(bypass_ethtool_get_link_ksettings);
>+
>+static const struct ethtool_ops bypass_ethtool_ops = {
>+	.get_drvinfo            = bypass_ethtool_get_drvinfo,
>+	.get_link               = ethtool_op_get_link,
>+	.get_link_ksettings     = bypass_ethtool_get_link_ksettings,
>+};
>+
>+static void bypass_register_existing_slave(struct net_device *bypass_netdev)
>+{
>+	struct net *net = dev_net(bypass_netdev);
>+	struct net_device *dev;
>+
>+	rtnl_lock();
>+	for_each_netdev(net, dev) {
>+		if (dev == bypass_netdev)
>+			continue;
>+		if (!bypass_validate_event_dev(dev))
>+			continue;
>+		if (ether_addr_equal(bypass_netdev->perm_addr, dev->perm_addr))
>+			bypass_slave_register(dev);
>+	}
>+	rtnl_unlock();
>+}
>+
>+int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>+			   struct bypass_master **pbypass_master)
>+{
>+	struct bypass_master *bypass_master;
>+
>+	bypass_master = kzalloc(sizeof(*bypass_master), GFP_KERNEL);
>+	if (!bypass_master)
>+		return -ENOMEM;
>+
>+	rcu_assign_pointer(bypass_master->ops, ops);
>+	dev_hold(dev);
>+	dev->priv_flags |= IFF_BYPASS;
>+	rcu_assign_pointer(bypass_master->bypass_netdev, dev);
>+
>+	spin_lock(&bypass_lock);
>+	list_add_tail(&bypass_master->list, &bypass_master_list);
>+	spin_unlock(&bypass_lock);
>+
>+	bypass_register_existing_slave(dev);
>+
>+	*pbypass_master = bypass_master;
>+	return 0;
>+}
>+EXPORT_SYMBOL_GPL(bypass_master_register);
>+
>+void bypass_master_unregister(struct bypass_master *bypass_master)
>+{
>+	struct net_device *bypass_netdev;
>+
>+	bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>+
>+	bypass_netdev->priv_flags &= ~IFF_BYPASS;
>+	dev_put(bypass_netdev);
>+
>+	spin_lock(&bypass_lock);
>+	list_del(&bypass_master->list);
>+	spin_unlock(&bypass_lock);
>+
>+	kfree(bypass_master);
>+}
>+EXPORT_SYMBOL_GPL(bypass_master_unregister);
>+
>+int bypass_master_create(struct net_device *backup_netdev,
>+			 struct bypass_master **pbypass_master)
>+{
>+	struct device *dev = backup_netdev->dev.parent;
>+	struct net_device *bypass_netdev;
>+	int err;
>+
>+	/* Alloc at least 2 queues, for now we are going with 16 assuming
>+	 * that most devices being bonded won't have too many queues.
>+	 */
>+	bypass_netdev = alloc_etherdev_mq(sizeof(struct bypass_info), 16);
>+	if (!bypass_netdev) {
>+		dev_err(dev, "Unable to allocate bypass_netdev!\n");
>+		return -ENOMEM;
>+	}
>+
>+	dev_net_set(bypass_netdev, dev_net(backup_netdev));
>+	SET_NETDEV_DEV(bypass_netdev, dev);
>+
>+	bypass_netdev->netdev_ops = &bypass_netdev_ops;
>+	bypass_netdev->ethtool_ops = &bypass_ethtool_ops;
>+
>+	/* Initialize the device options */
>+	bypass_netdev->priv_flags |= IFF_UNICAST_FLT | IFF_NO_QUEUE;
>+	bypass_netdev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
>+				       IFF_TX_SKB_SHARING);
>+
>+	/* don't acquire bypass netdev's netif_tx_lock when transmitting */
>+	bypass_netdev->features |= NETIF_F_LLTX;
>+
>+	/* Don't allow bypass devices to change network namespaces. */
>+	bypass_netdev->features |= NETIF_F_NETNS_LOCAL;
>+
>+	bypass_netdev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
>+				     NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
>+				     NETIF_F_HIGHDMA | NETIF_F_LRO;
>+
>+	bypass_netdev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
>+	bypass_netdev->features |= bypass_netdev->hw_features;
>+
>+	memcpy(bypass_netdev->dev_addr, backup_netdev->dev_addr,
>+	       bypass_netdev->addr_len);
>+
>+	bypass_netdev->min_mtu = backup_netdev->min_mtu;
>+	bypass_netdev->max_mtu = backup_netdev->max_mtu;
>+
>+	err = register_netdev(bypass_netdev);
>+	if (err < 0) {
>+		dev_err(dev, "Unable to register bypass_netdev!\n");
>+		goto err_register_netdev;
>+	}
>+
>+	netif_carrier_off(bypass_netdev);
>+
>+	err = bypass_master_register(bypass_netdev, NULL, pbypass_master);
>+	if (err < 0)

just "if (err)" would do.


>+		goto err_bypass;
>+
>+	return 0;
>+
>+err_bypass:
>+	unregister_netdev(bypass_netdev);
>+err_register_netdev:
>+	free_netdev(bypass_netdev);
>+
>+	return err;
>+}
>+EXPORT_SYMBOL_GPL(bypass_master_create);
>+
>+void bypass_master_destroy(struct bypass_master *bypass_master)
>+{
>+	struct net_device *bypass_netdev;
>+	struct net_device *slave_netdev;
>+	struct bypass_info *bi;
>+
>+	if (!bypass_master)
>+		return;
>+
>+	bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>+	bi = netdev_priv(bypass_netdev);
>+
>+	netif_device_detach(bypass_netdev);
>+
>+	rtnl_lock();
>+
>+	slave_netdev = rtnl_dereference(bi->active_netdev);
>+	if (slave_netdev)
>+		bypass_slave_unregister(slave_netdev);
>+
>+	slave_netdev = rtnl_dereference(bi->backup_netdev);
>+	if (slave_netdev)
>+		bypass_slave_unregister(slave_netdev);
>+
>+	bypass_master_unregister(bypass_master);
>+
>+	unregister_netdevice(bypass_netdev);
>+
>+	rtnl_unlock();
>+
>+	free_netdev(bypass_netdev);
>+}
>+EXPORT_SYMBOL_GPL(bypass_master_destroy);
>+
>+static __init int
>+bypass_init(void)
>+{
>+	register_netdevice_notifier(&bypass_notifier);
>+
>+	return 0;
>+}
>+module_init(bypass_init);
>+
>+static __exit
>+void bypass_exit(void)
>+{
>+	unregister_netdevice_notifier(&bypass_notifier);
>+}
>+module_exit(bypass_exit);
>+
>+MODULE_DESCRIPTION("Bypass infrastructure/interface for Paravirtual drivers");
>+MODULE_LICENSE("GPL v2");
>-- 
>2.14.3
>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
  2018-04-11 15:51   ` Jiri Pirko
@ 2018-04-11 19:13     ` Samudrala, Sridhar
  2018-04-18  9:25       ` Jiri Pirko
  0 siblings, 1 reply; 63+ messages in thread
From: Samudrala, Sridhar @ 2018-04-11 19:13 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
	loseweigh

On 4/11/2018 8:51 AM, Jiri Pirko wrote:
> Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudrala@intel.com wrote:
>> This provides a generic interface for paravirtual drivers to listen
>> for netdev register/unregister/link change events from pci ethernet
>> devices with the same MAC and takeover their datapath. The notifier and
>> event handling code is based on the existing netvsc implementation.
>>
>> It exposes 2 sets of interfaces to the paravirtual drivers.
>> 1. existing netvsc driver that uses 2 netdev model. In this model, no
>> master netdev is created. The paravirtual driver registers each bypass
>> instance along with a set of ops to manage the slave events.
>>      bypass_master_register()
>>      bypass_master_unregister()
>> 2. new virtio_net based solution that uses 3 netdev model. In this model,
>> the bypass module provides interfaces to create/destroy additional master
>> netdev and all the slave events are managed internally.
>>       bypass_master_create()
>>       bypass_master_destroy()
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> ---
>> include/linux/netdevice.h |  14 +
>> include/net/bypass.h      |  96 ++++++
>> net/Kconfig               |  18 +
>> net/core/Makefile         |   1 +
>> net/core/bypass.c         | 844 ++++++++++++++++++++++++++++++++++++++++++++++
>> 5 files changed, 973 insertions(+)
>> create mode 100644 include/net/bypass.h
>> create mode 100644 net/core/bypass.c
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index cf44503ea81a..587293728f70 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
>> 	IFF_PHONY_HEADROOM		= 1<<24,
>> 	IFF_MACSEC			= 1<<25,
>> 	IFF_NO_RX_HANDLER		= 1<<26,
>> +	IFF_BYPASS			= 1 << 27,
>> +	IFF_BYPASS_SLAVE		= 1 << 28,
> I wonder, why you don't follow the existing coding style... Also, please
> add these to into the comment above.

To avoid checkpatch warnings. If it is OK to ignore these warnings, I can switch back
to the existing coding style to be consistent.

>
>
>> };
>>
>> #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
>> @@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
>> #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
>> #define IFF_MACSEC			IFF_MACSEC
>> #define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
>> +#define IFF_BYPASS			IFF_BYPASS
>> +#define IFF_BYPASS_SLAVE		IFF_BYPASS_SLAVE
>>
>> /**
>>   *	struct net_device - The DEVICE structure.
>> @@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
>> 	return dev->priv_flags & IFF_RXFH_CONFIGURED;
>> }
>>
>> +static inline bool netif_is_bypass_master(const struct net_device *dev)
>> +{
>> +	return dev->priv_flags & IFF_BYPASS;
>> +}
>> +
>> +static inline bool netif_is_bypass_slave(const struct net_device *dev)
>> +{
>> +	return dev->priv_flags & IFF_BYPASS_SLAVE;
>> +}
>> +
>> /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
>> static inline void netif_keep_dst(struct net_device *dev)
>> {
>> diff --git a/include/net/bypass.h b/include/net/bypass.h
>> new file mode 100644
>> index 000000000000..86b02cb894cf
>> --- /dev/null
>> +++ b/include/net/bypass.h
>> @@ -0,0 +1,96 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (c) 2018, Intel Corporation. */
>> +
>> +#ifndef _NET_BYPASS_H
>> +#define _NET_BYPASS_H
>> +
>> +#include <linux/netdevice.h>
>> +
>> +struct bypass_ops {
>> +	int (*slave_pre_register)(struct net_device *slave_netdev,
>> +				  struct net_device *bypass_netdev);
>> +	int (*slave_join)(struct net_device *slave_netdev,
>> +			  struct net_device *bypass_netdev);
>> +	int (*slave_pre_unregister)(struct net_device *slave_netdev,
>> +				    struct net_device *bypass_netdev);
>> +	int (*slave_release)(struct net_device *slave_netdev,
>> +			     struct net_device *bypass_netdev);
>> +	int (*slave_link_change)(struct net_device *slave_netdev,
>> +				 struct net_device *bypass_netdev);
>> +	rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>> +};
>> +
>> +struct bypass_master {
>> +	struct list_head list;
>> +	struct net_device __rcu *bypass_netdev;
>> +	struct bypass_ops __rcu *ops;
>> +};
>> +
>> +/* bypass state */
>> +struct bypass_info {
>> +	/* passthru netdev with same MAC */
>> +	struct net_device __rcu *active_netdev;
> You still use "active"/"backup" names which is highly misleading as
> it has completely different meaning that in bond for example.
> I noted that in my previous review already. Please change it.

I guess the issue is with only the 'active'  name. 'backup' should be fine as it also
matches with the BACKUP feature bit we are adding to virtio_net.

With regards to alternate names for 'active', you suggested 'stolen', but i
am not too happy with it.
netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'



>
>
>> +
>> +	/* virtio_net netdev */
>> +	struct net_device __rcu *backup_netdev;
>> +
>> +	/* active netdev stats */
>> +	struct rtnl_link_stats64 active_stats;
>> +
>> +	/* backup netdev stats */
>> +	struct rtnl_link_stats64 backup_stats;
>> +
>> +	/* aggregated stats */
>> +	struct rtnl_link_stats64 bypass_stats;
>> +
>> +	/* spinlock while updating stats */
>> +	spinlock_t stats_lock;
>> +};
>> +
>> +#if IS_ENABLED(CONFIG_NET_BYPASS)
>> +
>> +int bypass_master_create(struct net_device *backup_netdev,
>> +			 struct bypass_master **pbypass_master);
>> +void bypass_master_destroy(struct bypass_master *bypass_master);
>> +
>> +int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>> +			   struct bypass_master **pbypass_master);
>> +void bypass_master_unregister(struct bypass_master *bypass_master);
>> +
>> +int bypass_slave_unregister(struct net_device *slave_netdev);
>> +
>> +#else
>> +
>> +static inline
>> +int bypass_master_create(struct net_device *backup_netdev,
>> +			 struct bypass_master **pbypass_master);
>> +{
>> +	return 0;
>> +}
>> +
>> +static inline
>> +void bypass_master_destroy(struct bypass_master *bypass_master)
>> +{
>> +}
>> +
>> +static inline
>> +int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>> +			   struct pbypass_master **pbypass_master);
>> +{
>> +	return 0;
>> +}
>> +
>> +static inline
>> +void bypass_master_unregister(struct bypass_master *bypass_master)
>> +{
>> +}
>> +
>> +static inline
>> +int bypass_slave_unregister(struct net_device *slave_netdev)
>> +{
>> +	return 0;
>> +}
>> +
>> +#endif
>> +
>> +#endif /* _NET_BYPASS_H */
>> diff --git a/net/Kconfig b/net/Kconfig
>> index 0428f12c25c2..994445f4a96a 100644
>> --- a/net/Kconfig
>> +++ b/net/Kconfig
>> @@ -423,6 +423,24 @@ config MAY_USE_DEVLINK
>> 	  on MAY_USE_DEVLINK to ensure they do not cause link errors when
>> 	  devlink is a loadable module and the driver using it is built-in.
>>
>> +config NET_BYPASS
>> +	tristate "Bypass interface"
>> +	---help---
>> +	  This provides a generic interface for paravirtual drivers to listen
>> +	  for netdev register/unregister/link change events from pci ethernet
>> +	  devices with the same MAC and takeover their datapath. This also
>> +	  enables live migration of a VM with direct attached VF by failing
>> +	  over to the paravirtual datapath when the VF is unplugged.
>> +
>> +config MAY_USE_BYPASS
>> +	tristate
>> +	default m if NET_BYPASS=m
>> +	default y if NET_BYPASS=y || NET_BYPASS=n
>> +	help
>> +	  Drivers using the bypass infrastructure should have a dependency
>> +	  on MAY_USE_BYPASS to ensure they do not cause link errors when
>> +	  bypass is a loadable module and the driver using it is built-in.
>> +
>> endif   # if NET
>>
>> # Used by archs to tell that they support BPF JIT compiler plus which flavour.
>> diff --git a/net/core/Makefile b/net/core/Makefile
>> index 6dbbba8c57ae..a9727ed1c8fc 100644
>> --- a/net/core/Makefile
>> +++ b/net/core/Makefile
>> @@ -30,3 +30,4 @@ obj-$(CONFIG_DST_CACHE) += dst_cache.o
>> obj-$(CONFIG_HWBM) += hwbm.o
>> obj-$(CONFIG_NET_DEVLINK) += devlink.o
>> obj-$(CONFIG_GRO_CELLS) += gro_cells.o
>> +obj-$(CONFIG_NET_BYPASS) += bypass.o
>> diff --git a/net/core/bypass.c b/net/core/bypass.c
>> new file mode 100644
>> index 000000000000..b5b9cb554c3f
>> --- /dev/null
>> +++ b/net/core/bypass.c
>> @@ -0,0 +1,844 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (c) 2018, Intel Corporation. */
>> +
>> +/* A common module to handle registrations and notifications for paravirtual
>> + * drivers to enable accelerated datapath and support VF live migration.
>> + *
>> + * The notifier and event handling code is based on netvsc driver.
>> + */
>> +
>> +#include <linux/netdevice.h>
>> +#include <linux/etherdevice.h>
>> +#include <linux/ethtool.h>
>> +#include <linux/module.h>
>> +#include <linux/slab.h>
>> +#include <linux/netdevice.h>
>> +#include <linux/netpoll.h>
>> +#include <linux/rtnetlink.h>
>> +#include <linux/if_vlan.h>
>> +#include <linux/pci.h>
>> +#include <net/sch_generic.h>
>> +#include <uapi/linux/if_arp.h>
>> +#include <net/bypass.h>
>> +
>> +static LIST_HEAD(bypass_master_list);
>> +static DEFINE_SPINLOCK(bypass_lock);
>> +
>> +static int bypass_slave_pre_register(struct net_device *slave_netdev,
>> +				     struct net_device *bypass_netdev,
>> +				     struct bypass_ops *bypass_ops)
>> +{
>> +	struct bypass_info *bi;
>> +	bool backup;
>> +
>> +	if (bypass_ops) {
>> +		if (!bypass_ops->slave_pre_register)
>> +			return -EINVAL;
>> +
>> +		return bypass_ops->slave_pre_register(slave_netdev,
>> +						      bypass_netdev);
>> +	}
>> +
>> +	bi = netdev_priv(bypass_netdev);
>> +	backup = (slave_netdev->dev.parent == bypass_netdev->dev.parent);
>> +	if (backup ? rtnl_dereference(bi->backup_netdev) :
>> +			rtnl_dereference(bi->active_netdev)) {
>> +		netdev_err(bypass_netdev, "%s attempting to register as slave dev when %s already present\n",
>> +			   slave_netdev->name, backup ? "backup" : "active");
>> +		return -EEXIST;
>> +	}
>> +
>> +	/* Avoid non pci devices as active netdev */
>> +	if (!backup && (!slave_netdev->dev.parent ||
>> +			!dev_is_pci(slave_netdev->dev.parent)))
>> +		return -EINVAL;
>> +
>> +	return 0;
>> +}
>> +
>> +static int bypass_slave_join(struct net_device *slave_netdev,
>> +			     struct net_device *bypass_netdev,
>> +			     struct bypass_ops *bypass_ops)
>> +{
>> +	struct bypass_info *bi;
>> +	bool backup;
>> +
>> +	if (bypass_ops) {
>> +		if (!bypass_ops->slave_join)
>> +			return -EINVAL;
>> +
>> +		return bypass_ops->slave_join(slave_netdev, bypass_netdev);
>> +	}
>> +
>> +	bi = netdev_priv(bypass_netdev);
>> +	backup = (slave_netdev->dev.parent == bypass_netdev->dev.parent);
>> +
>> +	dev_hold(slave_netdev);
>> +
>> +	if (backup) {
>> +		rcu_assign_pointer(bi->backup_netdev, slave_netdev);
>> +		dev_get_stats(bi->backup_netdev, &bi->backup_stats);
>> +	} else {
>> +		rcu_assign_pointer(bi->active_netdev, slave_netdev);
>> +		dev_get_stats(bi->active_netdev, &bi->active_stats);
>> +		bypass_netdev->min_mtu = slave_netdev->min_mtu;
>> +		bypass_netdev->max_mtu = slave_netdev->max_mtu;
>> +	}
>> +
>> +	netdev_info(bypass_netdev, "bypass slave:%s joined\n",
>> +		    slave_netdev->name);
>> +
>> +	return 0;
>> +}
>> +
>> +/* Called when slave dev is injecting data into network stack.
>> + * Change the associated network device from lower dev to virtio.
>> + * note: already called with rcu_read_lock
>> + */
>> +static rx_handler_result_t bypass_handle_frame(struct sk_buff **pskb)
>> +{
>> +	struct sk_buff *skb = *pskb;
>> +	struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data);
>> +
>> +	skb->dev = ndev;
>> +
>> +	return RX_HANDLER_ANOTHER;
>> +}
>> +
>> +static struct net_device *bypass_master_get_bymac(u8 *mac,
>> +						  struct bypass_ops **ops)
>> +{
>> +	struct bypass_master *bypass_master;
>> +	struct net_device *bypass_netdev;
>> +
>> +	spin_lock(&bypass_lock);
>> +	list_for_each_entry(bypass_master, &bypass_master_list, list) {
> As I wrote the last time, you don't need this list, spinlock.
> You can do just something like:
>          for_each_net(net) {
>                  for_each_netdev(net, dev) {
> 			if (netif_is_bypass_master(dev)) {

This function returns the upper netdev as well as the ops associated
with that netdev.
bypass_master_list is a list of 'struct bypass_master' that associates
'bypass_netdev' with 'bypass_ops' and gets added via bypass_master_register().
We need 'ops' only to support the 2 netdev model of netvsc. ops will be
NULL for 3-netdev model.


>
>
>
>
>> +		bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>> +		if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
>> +			*ops = rcu_dereference(bypass_master->ops);
> I don't see how rcu_dereference is ok here.
> 1) I don't see rcu_read_lock taken
> 2) Looks like bypass_master->ops has the same value across the whole
>     existence.

We hold rtnl_lock(), i think i need to change this to rtnl_dereference.
Yes. ops doesn't change.

>
>
>> +			spin_unlock(&bypass_lock);
>> +			return bypass_netdev;
>> +		}
>> +	}
>> +	spin_unlock(&bypass_lock);
>> +	return NULL;
>> +}
>> +
>> +static int bypass_slave_register(struct net_device *slave_netdev)
>> +{
>> +	struct net_device *bypass_netdev;
>> +	struct bypass_ops *bypass_ops;
>> +	int ret, orig_mtu;
>> +
>> +	ASSERT_RTNL();
>> +
>> +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> +						&bypass_ops);
> For master, could you use word "master" in the variables so it is clear?
> Also, "dev" is fine instead of "netdev".
> Something like "bpmaster_dev"

bypass_master is of  type struct bypass_master,  bypass_netdev is of type struct net_device.
I can change all _netdev suffixes to _dev to make the names shorter.


>
>
>> +	if (!bypass_netdev)
>> +		goto done;
>> +
>> +	ret = bypass_slave_pre_register(slave_netdev, bypass_netdev,
>> +					bypass_ops);
>> +	if (ret != 0)
> 	Just "if (ret)" will do. You have this on more places.

OK.


>
>
>> +		goto done;
>> +
>> +	ret = netdev_rx_handler_register(slave_netdev,
>> +					 bypass_ops ? bypass_ops->handle_frame :
>> +					 bypass_handle_frame, bypass_netdev);
>> +	if (ret != 0) {
>> +		netdev_err(slave_netdev, "can not register bypass rx handler (err = %d)\n",
>> +			   ret);
>> +		goto done;
>> +	}
>> +
>> +	ret = netdev_upper_dev_link(slave_netdev, bypass_netdev, NULL);
>> +	if (ret != 0) {
>> +		netdev_err(slave_netdev, "can not set master device %s (err = %d)\n",
>> +			   bypass_netdev->name, ret);
>> +		goto upper_link_failed;
>> +	}
>> +
>> +	slave_netdev->priv_flags |= IFF_BYPASS_SLAVE;
>> +
>> +	if (netif_running(bypass_netdev)) {
>> +		ret = dev_open(slave_netdev);
>> +		if (ret && (ret != -EBUSY)) {
>> +			netdev_err(bypass_netdev, "Opening slave %s failed ret:%d\n",
>> +				   slave_netdev->name, ret);
>> +			goto err_interface_up;
>> +		}
>> +	}
>> +
>> +	/* Align MTU of slave with master */
>> +	orig_mtu = slave_netdev->mtu;
>> +	ret = dev_set_mtu(slave_netdev, bypass_netdev->mtu);
>> +	if (ret != 0) {
>> +		netdev_err(bypass_netdev, "unable to change mtu of %s to %u register failed\n",
>> +			   slave_netdev->name, bypass_netdev->mtu);
>> +		goto err_set_mtu;
>> +	}
>> +
>> +	ret = bypass_slave_join(slave_netdev, bypass_netdev, bypass_ops);
>> +	if (ret != 0)
>> +		goto err_join;
>> +
>> +	call_netdevice_notifiers(NETDEV_JOIN, slave_netdev);
>> +
>> +	netdev_info(bypass_netdev, "bypass slave:%s registered\n",
>> +		    slave_netdev->name);
>> +
>> +	goto done;
>> +
>> +err_join:
>> +	dev_set_mtu(slave_netdev, orig_mtu);
>> +err_set_mtu:
>> +	dev_close(slave_netdev);
>> +err_interface_up:
>> +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>> +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>> +upper_link_failed:
>> +	netdev_rx_handler_unregister(slave_netdev);
>> +done:
>> +	return NOTIFY_DONE;
>> +}
>> +
>> +static int bypass_slave_pre_unregister(struct net_device *slave_netdev,
>> +				       struct net_device *bypass_netdev,
>> +				       struct bypass_ops *bypass_ops)
>> +{
>> +	struct net_device *backup_netdev, *active_netdev;
>> +	struct bypass_info *bi;
>> +
>> +	if (bypass_ops) {
>> +		if (!bypass_ops->slave_pre_unregister)
>> +			return -EINVAL;
>> +
>> +		return bypass_ops->slave_pre_unregister(slave_netdev,
>> +							bypass_netdev);
>> +	}
>> +
>> +	bi = netdev_priv(bypass_netdev);
>> +	active_netdev = rtnl_dereference(bi->active_netdev);
>> +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> +
>> +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>> +		return -EINVAL;
>> +
>> +	return 0;
>> +}
>> +
>> +static int bypass_slave_release(struct net_device *slave_netdev,
>> +				struct net_device *bypass_netdev,
>> +				struct bypass_ops *bypass_ops)
>> +{
>> +	struct net_device *backup_netdev, *active_netdev;
>> +	struct bypass_info *bi;
>> +
>> +	if (bypass_ops) {
>> +		if (!bypass_ops->slave_release)
>> +			return -EINVAL;
> I think it would be good to make the API to the driver more strict and
> have a separate set of ops for "active" and "backup" netdevices.
> That should stop people thinking about extending this to more slaves in
> the future.

We have checks in slave_pre_register() that allows only 1 'backup' and 1
'active' slave.


>
>
>
>> +
>> +		return bypass_ops->slave_release(slave_netdev, bypass_netdev);
>> +	}
>> +
>> +	bi = netdev_priv(bypass_netdev);
>> +	active_netdev = rtnl_dereference(bi->active_netdev);
>> +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> +
>> +	if (slave_netdev == backup_netdev) {
>> +		RCU_INIT_POINTER(bi->backup_netdev, NULL);
>> +	} else {
>> +		RCU_INIT_POINTER(bi->active_netdev, NULL);
>> +		if (backup_netdev) {
>> +			bypass_netdev->min_mtu = backup_netdev->min_mtu;
>> +			bypass_netdev->max_mtu = backup_netdev->max_mtu;
>> +		}
>> +	}
>> +
>> +	dev_put(slave_netdev);
>> +
>> +	netdev_info(bypass_netdev, "bypass slave:%s released\n",
>> +		    slave_netdev->name);
>> +
>> +	return 0;
>> +}
>> +
>> +int bypass_slave_unregister(struct net_device *slave_netdev)
>> +{
>> +	struct net_device *bypass_netdev;
>> +	struct bypass_ops *bypass_ops;
>> +	int ret;
>> +
>> +	if (!netif_is_bypass_slave(slave_netdev))
>> +		goto done;
>> +
>> +	ASSERT_RTNL();
>> +
>> +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> +						&bypass_ops);
>> +	if (!bypass_netdev)
>> +		goto done;
>> +
>> +	ret = bypass_slave_pre_unregister(slave_netdev, bypass_netdev,
>> +					  bypass_ops);
>> +	if (ret != 0)
>> +		goto done;
>> +
>> +	netdev_rx_handler_unregister(slave_netdev);
>> +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>> +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>> +
>> +	bypass_slave_release(slave_netdev, bypass_netdev, bypass_ops);
>> +
>> +	netdev_info(bypass_netdev, "bypass slave:%s unregistered\n",
>> +		    slave_netdev->name);
>> +
>> +done:
>> +	return NOTIFY_DONE;
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_slave_unregister);
>> +
>> +static bool bypass_xmit_ready(struct net_device *dev)
>> +{
>> +	return netif_running(dev) && netif_carrier_ok(dev);
>> +}
>> +
>> +static int bypass_slave_link_change(struct net_device *slave_netdev)
>> +{
>> +	struct net_device *bypass_netdev, *active_netdev, *backup_netdev;
>> +	struct bypass_ops *bypass_ops;
>> +	struct bypass_info *bi;
>> +
>> +	if (!netif_is_bypass_slave(slave_netdev))
>> +		goto done;
>> +
>> +	ASSERT_RTNL();
>> +
>> +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> +						&bypass_ops);
>> +	if (!bypass_netdev)
>> +		goto done;
>> +
>> +	if (bypass_ops) {
>> +		if (!bypass_ops->slave_link_change)
>> +			goto done;
>> +
>> +		return bypass_ops->slave_link_change(slave_netdev,
>> +						     bypass_netdev);
>> +	}
>> +
>> +	if (!netif_running(bypass_netdev))
>> +		return 0;
>> +
>> +	bi = netdev_priv(bypass_netdev);
>> +
>> +	active_netdev = rtnl_dereference(bi->active_netdev);
>> +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> +
>> +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>> +		goto done;
> You don't need this check. "if (!netif_is_bypass_slave(slave_netdev))"
> above is enough.

I think we need this check to not allow events from a slave that is not
attached to this master but has the same MAC.

>
>
>> +
>> +	if ((active_netdev && bypass_xmit_ready(active_netdev)) ||
>> +	    (backup_netdev && bypass_xmit_ready(backup_netdev))) {
>> +		netif_carrier_on(bypass_netdev);
>> +		netif_tx_wake_all_queues(bypass_netdev);
>> +	} else {
>> +		netif_carrier_off(bypass_netdev);
>> +		netif_tx_stop_all_queues(bypass_netdev);
>> +	}
>> +
>> +done:
>> +	return NOTIFY_DONE;
>> +}
>> +
>> +static bool bypass_validate_event_dev(struct net_device *dev)
>> +{
>> +	/* Skip parent events */
>> +	if (netif_is_bypass_master(dev))
>> +		return false;
>> +
>> +	/* Avoid non-Ethernet type devices */
>> +	if (dev->type != ARPHRD_ETHER)
>> +		return false;
>> +
>> +	/* Avoid Vlan dev with same MAC registering as VF */
>> +	if (is_vlan_dev(dev))
>> +		return false;
>> +
>> +	/* Avoid Bonding master dev with same MAC registering as slave dev */
>> +	if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
> Yeah, this is certainly incorrect. One thing is, you should be using the
> helpers netif_is_bond_master().
> But what about the rest? macsec, macvlan, team, bridge, ovs and others?
>
> You need to do it not by blacklisting, but with whitelisting. You need
> to whitelist VF devices. My port flavours patchset might help with this.

May be i can use netdev_has_lower_dev() helper to make sure that the slave
device is not an upper dev.
Can you point to your port flavours patchset? Is it upstream?

>
>
>> +		return false;
>> +
>> +	return true;
>> +}
>> +
>> +static int
>> +bypass_event(struct notifier_block *this, unsigned long event, void *ptr)
>> +{
>> +	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
>> +
>> +	if (!bypass_validate_event_dev(event_dev))
>> +		return NOTIFY_DONE;
>> +
>> +	switch (event) {
>> +	case NETDEV_REGISTER:
>> +		return bypass_slave_register(event_dev);
>> +	case NETDEV_UNREGISTER:
>> +		return bypass_slave_unregister(event_dev);
>> +	case NETDEV_UP:
>> +	case NETDEV_DOWN:
>> +	case NETDEV_CHANGE:
>> +		return bypass_slave_link_change(event_dev);
>> +	default:
>> +		return NOTIFY_DONE;
>> +	}
>> +}
>> +
>> +static struct notifier_block bypass_notifier = {
>> +	.notifier_call = bypass_event,
>> +};
>> +
>> +int bypass_open(struct net_device *dev)
>> +{
>> +	struct bypass_info *bi = netdev_priv(dev);
>> +	struct net_device *active_netdev, *backup_netdev;
>> +	int err;
>> +
>> +	netif_carrier_off(dev);
>> +	netif_tx_wake_all_queues(dev);
>> +
>> +	active_netdev = rtnl_dereference(bi->active_netdev);
>> +	if (active_netdev) {
>> +		err = dev_open(active_netdev);
>> +		if (err)
>> +			goto err_active_open;
>> +	}
>> +
>> +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> +	if (backup_netdev) {
>> +		err = dev_open(backup_netdev);
>> +		if (err)
>> +			goto err_backup_open;
>> +	}
>> +
>> +	return 0;
>> +
>> +err_backup_open:
>> +	dev_close(active_netdev);
>> +err_active_open:
>> +	netif_tx_disable(dev);
>> +	return err;
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_open);
>> +
>> +int bypass_close(struct net_device *dev)
>> +{
>> +	struct bypass_info *vi = netdev_priv(dev);
> This should be probably "bi"

Yes.


>
>
>> +	struct net_device *slave_netdev;
>> +
>> +	netif_tx_disable(dev);
>> +
>> +	slave_netdev = rtnl_dereference(vi->active_netdev);
>> +	if (slave_netdev)
>> +		dev_close(slave_netdev);
>> +
>> +	slave_netdev = rtnl_dereference(vi->backup_netdev);
>> +	if (slave_netdev)
>> +		dev_close(slave_netdev);
>> +
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_close);
>> +
>> +static netdev_tx_t bypass_drop_xmit(struct sk_buff *skb, struct net_device *dev)
>> +{
>> +	atomic_long_inc(&dev->tx_dropped);
>> +	dev_kfree_skb_any(skb);
>> +	return NETDEV_TX_OK;
>> +}
>> +
>> +netdev_tx_t bypass_start_xmit(struct sk_buff *skb, struct net_device *dev)
>> +{
>> +	struct bypass_info *bi = netdev_priv(dev);
> If you rename the other variable to "bpmaster_dev", it would be nice to
> rename this to bpinfo or something more descriptive. "bi" is too short
> to know what that is right away.

Will rename bypass_netdev to bypass_dev. bypass indicates that it is
an upper master dev.


>
>
>> +	struct net_device *xmit_dev;
> Don't mix "dev" and "netdev" in one .c file. Just use "dev" for all.

OK.


>
>
>
>> +
>> +	/* Try xmit via active netdev followed by backup netdev */
>> +	xmit_dev = rcu_dereference_bh(bi->active_netdev);
>> +	if (!xmit_dev || !bypass_xmit_ready(xmit_dev)) {
>> +		xmit_dev = rcu_dereference_bh(bi->backup_netdev);
>> +		if (!xmit_dev || !bypass_xmit_ready(xmit_dev))
>> +			return bypass_drop_xmit(skb, dev);
>> +	}
>> +
>> +	skb->dev = xmit_dev;
>> +	skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
>> +
>> +	return dev_queue_xmit(skb);
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_start_xmit);
>> +
>> +u16 bypass_select_queue(struct net_device *dev, struct sk_buff *skb,
>> +			void *accel_priv, select_queue_fallback_t fallback)
>> +{
>> +	/* This helper function exists to help dev_pick_tx get the correct
>> +	 * destination queue.  Using a helper function skips a call to
>> +	 * skb_tx_hash and will put the skbs in the queue we expect on their
>> +	 * way down to the bonding driver.
>> +	 */
>> +	u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
>> +
>> +	/* Save the original txq to restore before passing to the driver */
>> +	qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
>> +
>> +	if (unlikely(txq >= dev->real_num_tx_queues)) {
>> +		do {
>> +			txq -= dev->real_num_tx_queues;
>> +		} while (txq >= dev->real_num_tx_queues);
>> +	}
>> +
>> +	return txq;
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_select_queue);
>> +
>> +/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
>> + * that some drivers can provide 32bit values only.
>> + */
>> +static void bypass_fold_stats(struct rtnl_link_stats64 *_res,
>> +			      const struct rtnl_link_stats64 *_new,
>> +			      const struct rtnl_link_stats64 *_old)
>> +{
>> +	const u64 *new = (const u64 *)_new;
>> +	const u64 *old = (const u64 *)_old;
>> +	u64 *res = (u64 *)_res;
>> +	int i;
>> +
>> +	for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
>> +		u64 nv = new[i];
>> +		u64 ov = old[i];
>> +		s64 delta = nv - ov;
>> +
>> +		/* detects if this particular field is 32bit only */
>> +		if (((nv | ov) >> 32) == 0)
>> +			delta = (s64)(s32)((u32)nv - (u32)ov);
>> +
>> +		/* filter anomalies, some drivers reset their stats
>> +		 * at down/up events.
>> +		 */
>> +		if (delta > 0)
>> +			res[i] += delta;
>> +	}
>> +}
>> +
>> +void bypass_get_stats(struct net_device *dev, struct rtnl_link_stats64 *stats)
>> +{
>> +	struct bypass_info *bi = netdev_priv(dev);
> You can WARN_ON and return in case the dev is not bypass master, just
> to catch buggy drivers. Same with other helpers.

I can make this static and not export this helper as well as all
bypass_netdev ops.

>
>
>> +	const struct rtnl_link_stats64 *new;
>> +	struct rtnl_link_stats64 temp;
>> +	struct net_device *slave_netdev;
>> +
>> +	spin_lock(&bi->stats_lock);
>> +	memcpy(stats, &bi->bypass_stats, sizeof(*stats));
>> +
>> +	rcu_read_lock();
>> +
>> +	slave_netdev = rcu_dereference(bi->active_netdev);
>> +	if (slave_netdev) {
>> +		new = dev_get_stats(slave_netdev, &temp);
>> +		bypass_fold_stats(stats, new, &bi->active_stats);
>> +		memcpy(&bi->active_stats, new, sizeof(*new));
>> +	}
>> +
>> +	slave_netdev = rcu_dereference(bi->backup_netdev);
>> +	if (slave_netdev) {
>> +		new = dev_get_stats(slave_netdev, &temp);
>> +		bypass_fold_stats(stats, new, &bi->backup_stats);
>> +		memcpy(&bi->backup_stats, new, sizeof(*new));
>> +	}
>> +
>> +	rcu_read_unlock();
>> +
>> +	memcpy(&bi->bypass_stats, stats, sizeof(*stats));
>> +	spin_unlock(&bi->stats_lock);
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_get_stats);
>> +
>> +int bypass_change_mtu(struct net_device *dev, int new_mtu)
>> +{
>> +	struct bypass_info *bi = netdev_priv(dev);
>> +	struct net_device *active_netdev, *backup_netdev;
>> +	int ret = 0;
> Pointless initialization.
>
>
>> +
>> +	active_netdev = rcu_dereference(bi->active_netdev);
>> +	if (active_netdev) {
>> +		ret = dev_set_mtu(active_netdev, new_mtu);
>> +		if (ret)
>> +			return ret;
>> +	}
>> +
>> +	backup_netdev = rcu_dereference(bi->backup_netdev);
>> +	if (backup_netdev) {
>> +		ret = dev_set_mtu(backup_netdev, new_mtu);
>> +		if (ret) {
>> +			dev_set_mtu(active_netdev, dev->mtu);
>> +			return ret;
>> +		}
>> +	}
>> +
>> +	dev->mtu = new_mtu;
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_change_mtu);
>> +
>> +void bypass_set_rx_mode(struct net_device *dev)
>> +{
>> +	struct bypass_info *bi = netdev_priv(dev);
>> +	struct net_device *slave_netdev;
>> +
>> +	rcu_read_lock();
>> +
>> +	slave_netdev = rcu_dereference(bi->active_netdev);
>> +	if (slave_netdev) {
>> +		dev_uc_sync_multiple(slave_netdev, dev);
>> +		dev_mc_sync_multiple(slave_netdev, dev);
>> +	}
>> +
>> +	slave_netdev = rcu_dereference(bi->backup_netdev);
>> +	if (slave_netdev) {
>> +		dev_uc_sync_multiple(slave_netdev, dev);
>> +		dev_mc_sync_multiple(slave_netdev, dev);
>> +	}
>> +
>> +	rcu_read_unlock();
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_set_rx_mode);
>> +
>> +static const struct net_device_ops bypass_netdev_ops = {
>> +	.ndo_open		= bypass_open,
>> +	.ndo_stop		= bypass_close,
>> +	.ndo_start_xmit		= bypass_start_xmit,
>> +	.ndo_select_queue	= bypass_select_queue,
>> +	.ndo_get_stats64	= bypass_get_stats,
>> +	.ndo_change_mtu		= bypass_change_mtu,
>> +	.ndo_set_rx_mode	= bypass_set_rx_mode,
>> +	.ndo_validate_addr	= eth_validate_addr,
>> +	.ndo_features_check	= passthru_features_check,
>> +};
>> +
>> +#define BYPASS_DRV_NAME "bypass"
>> +#define BYPASS_DRV_VERSION "0.1"
>> +
>> +static void bypass_ethtool_get_drvinfo(struct net_device *dev,
>> +				       struct ethtool_drvinfo *drvinfo)
>> +{
>> +	strlcpy(drvinfo->driver, BYPASS_DRV_NAME, sizeof(drvinfo->driver));
>> +	strlcpy(drvinfo->version, BYPASS_DRV_VERSION, sizeof(drvinfo->version));
>> +}
>> +
>> +int bypass_ethtool_get_link_ksettings(struct net_device *dev,
>> +				      struct ethtool_link_ksettings *cmd)
>> +{
>> +	struct bypass_info *bi = netdev_priv(dev);
>> +	struct net_device *slave_netdev;
>> +
>> +	slave_netdev = rtnl_dereference(bi->active_netdev);
>> +	if (!slave_netdev || !bypass_xmit_ready(slave_netdev)) {
>> +		slave_netdev = rtnl_dereference(bi->backup_netdev);
>> +		if (!slave_netdev || !bypass_xmit_ready(slave_netdev)) {
>> +			cmd->base.duplex = DUPLEX_UNKNOWN;
>> +			cmd->base.port = PORT_OTHER;
>> +			cmd->base.speed = SPEED_UNKNOWN;
>> +
>> +			return 0;
>> +		}
>> +	}
>> +
>> +	return __ethtool_get_link_ksettings(slave_netdev, cmd);
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_ethtool_get_link_ksettings);
>> +
>> +static const struct ethtool_ops bypass_ethtool_ops = {
>> +	.get_drvinfo            = bypass_ethtool_get_drvinfo,
>> +	.get_link               = ethtool_op_get_link,
>> +	.get_link_ksettings     = bypass_ethtool_get_link_ksettings,
>> +};
>> +
>> +static void bypass_register_existing_slave(struct net_device *bypass_netdev)
>> +{
>> +	struct net *net = dev_net(bypass_netdev);
>> +	struct net_device *dev;
>> +
>> +	rtnl_lock();
>> +	for_each_netdev(net, dev) {
>> +		if (dev == bypass_netdev)
>> +			continue;
>> +		if (!bypass_validate_event_dev(dev))
>> +			continue;
>> +		if (ether_addr_equal(bypass_netdev->perm_addr, dev->perm_addr))
>> +			bypass_slave_register(dev);
>> +	}
>> +	rtnl_unlock();
>> +}
>> +
>> +int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>> +			   struct bypass_master **pbypass_master)
>> +{
>> +	struct bypass_master *bypass_master;
>> +
>> +	bypass_master = kzalloc(sizeof(*bypass_master), GFP_KERNEL);
>> +	if (!bypass_master)
>> +		return -ENOMEM;
>> +
>> +	rcu_assign_pointer(bypass_master->ops, ops);
>> +	dev_hold(dev);
>> +	dev->priv_flags |= IFF_BYPASS;
>> +	rcu_assign_pointer(bypass_master->bypass_netdev, dev);
>> +
>> +	spin_lock(&bypass_lock);
>> +	list_add_tail(&bypass_master->list, &bypass_master_list);
>> +	spin_unlock(&bypass_lock);
>> +
>> +	bypass_register_existing_slave(dev);
>> +
>> +	*pbypass_master = bypass_master;
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_master_register);
>> +
>> +void bypass_master_unregister(struct bypass_master *bypass_master)
>> +{
>> +	struct net_device *bypass_netdev;
>> +
>> +	bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>> +
>> +	bypass_netdev->priv_flags &= ~IFF_BYPASS;
>> +	dev_put(bypass_netdev);
>> +
>> +	spin_lock(&bypass_lock);
>> +	list_del(&bypass_master->list);
>> +	spin_unlock(&bypass_lock);
>> +
>> +	kfree(bypass_master);
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_master_unregister);
>> +
>> +int bypass_master_create(struct net_device *backup_netdev,
>> +			 struct bypass_master **pbypass_master)
>> +{
>> +	struct device *dev = backup_netdev->dev.parent;
>> +	struct net_device *bypass_netdev;
>> +	int err;
>> +
>> +	/* Alloc at least 2 queues, for now we are going with 16 assuming
>> +	 * that most devices being bonded won't have too many queues.
>> +	 */
>> +	bypass_netdev = alloc_etherdev_mq(sizeof(struct bypass_info), 16);
>> +	if (!bypass_netdev) {
>> +		dev_err(dev, "Unable to allocate bypass_netdev!\n");
>> +		return -ENOMEM;
>> +	}
>> +
>> +	dev_net_set(bypass_netdev, dev_net(backup_netdev));
>> +	SET_NETDEV_DEV(bypass_netdev, dev);
>> +
>> +	bypass_netdev->netdev_ops = &bypass_netdev_ops;
>> +	bypass_netdev->ethtool_ops = &bypass_ethtool_ops;
>> +
>> +	/* Initialize the device options */
>> +	bypass_netdev->priv_flags |= IFF_UNICAST_FLT | IFF_NO_QUEUE;
>> +	bypass_netdev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
>> +				       IFF_TX_SKB_SHARING);
>> +
>> +	/* don't acquire bypass netdev's netif_tx_lock when transmitting */
>> +	bypass_netdev->features |= NETIF_F_LLTX;
>> +
>> +	/* Don't allow bypass devices to change network namespaces. */
>> +	bypass_netdev->features |= NETIF_F_NETNS_LOCAL;
>> +
>> +	bypass_netdev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
>> +				     NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
>> +				     NETIF_F_HIGHDMA | NETIF_F_LRO;
>> +
>> +	bypass_netdev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
>> +	bypass_netdev->features |= bypass_netdev->hw_features;
>> +
>> +	memcpy(bypass_netdev->dev_addr, backup_netdev->dev_addr,
>> +	       bypass_netdev->addr_len);
>> +
>> +	bypass_netdev->min_mtu = backup_netdev->min_mtu;
>> +	bypass_netdev->max_mtu = backup_netdev->max_mtu;
>> +
>> +	err = register_netdev(bypass_netdev);
>> +	if (err < 0) {
>> +		dev_err(dev, "Unable to register bypass_netdev!\n");
>> +		goto err_register_netdev;
>> +	}
>> +
>> +	netif_carrier_off(bypass_netdev);
>> +
>> +	err = bypass_master_register(bypass_netdev, NULL, pbypass_master);
>> +	if (err < 0)
> just "if (err)" would do.

OK

>
>
>> +		goto err_bypass;
>> +
>> +	return 0;
>> +
>> +err_bypass:
>> +	unregister_netdev(bypass_netdev);
>> +err_register_netdev:
>> +	free_netdev(bypass_netdev);
>> +
>> +	return err;
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_master_create);
>> +
>> +void bypass_master_destroy(struct bypass_master *bypass_master)
>> +{
>> +	struct net_device *bypass_netdev;
>> +	struct net_device *slave_netdev;
>> +	struct bypass_info *bi;
>> +
>> +	if (!bypass_master)
>> +		return;
>> +
>> +	bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>> +	bi = netdev_priv(bypass_netdev);
>> +
>> +	netif_device_detach(bypass_netdev);
>> +
>> +	rtnl_lock();
>> +
>> +	slave_netdev = rtnl_dereference(bi->active_netdev);
>> +	if (slave_netdev)
>> +		bypass_slave_unregister(slave_netdev);
>> +
>> +	slave_netdev = rtnl_dereference(bi->backup_netdev);
>> +	if (slave_netdev)
>> +		bypass_slave_unregister(slave_netdev);
>> +
>> +	bypass_master_unregister(bypass_master);
>> +
>> +	unregister_netdevice(bypass_netdev);
>> +
>> +	rtnl_unlock();
>> +
>> +	free_netdev(bypass_netdev);
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_master_destroy);
>> +
>> +static __init int
>> +bypass_init(void)
>> +{
>> +	register_netdevice_notifier(&bypass_notifier);
>> +
>> +	return 0;
>> +}
>> +module_init(bypass_init);
>> +
>> +static __exit
>> +void bypass_exit(void)
>> +{
>> +	unregister_netdevice_notifier(&bypass_notifier);
>> +}
>> +module_exit(bypass_exit);
>> +
>> +MODULE_DESCRIPTION("Bypass infrastructure/interface for Paravirtual drivers");
>> +MODULE_LICENSE("GPL v2");
>> -- 
>> 2.14.3
>>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
  2018-04-11 19:13     ` Samudrala, Sridhar
@ 2018-04-18  9:25       ` Jiri Pirko
  2018-04-18 18:43         ` Samudrala, Sridhar
  0 siblings, 1 reply; 63+ messages in thread
From: Jiri Pirko @ 2018-04-18  9:25 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
	loseweigh

Wed, Apr 11, 2018 at 09:13:52PM CEST, sridhar.samudrala@intel.com wrote:
>On 4/11/2018 8:51 AM, Jiri Pirko wrote:
>> Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudrala@intel.com wrote:
>> > This provides a generic interface for paravirtual drivers to listen
>> > for netdev register/unregister/link change events from pci ethernet
>> > devices with the same MAC and takeover their datapath. The notifier and
>> > event handling code is based on the existing netvsc implementation.
>> > 
>> > It exposes 2 sets of interfaces to the paravirtual drivers.
>> > 1. existing netvsc driver that uses 2 netdev model. In this model, no
>> > master netdev is created. The paravirtual driver registers each bypass
>> > instance along with a set of ops to manage the slave events.
>> >      bypass_master_register()
>> >      bypass_master_unregister()
>> > 2. new virtio_net based solution that uses 3 netdev model. In this model,
>> > the bypass module provides interfaces to create/destroy additional master
>> > netdev and all the slave events are managed internally.
>> >       bypass_master_create()
>> >       bypass_master_destroy()
>> > 
>> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> > ---
>> > include/linux/netdevice.h |  14 +
>> > include/net/bypass.h      |  96 ++++++
>> > net/Kconfig               |  18 +
>> > net/core/Makefile         |   1 +
>> > net/core/bypass.c         | 844 ++++++++++++++++++++++++++++++++++++++++++++++
>> > 5 files changed, 973 insertions(+)
>> > create mode 100644 include/net/bypass.h
>> > create mode 100644 net/core/bypass.c
>> > 
>> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> > index cf44503ea81a..587293728f70 100644
>> > --- a/include/linux/netdevice.h
>> > +++ b/include/linux/netdevice.h
>> > @@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
>> > 	IFF_PHONY_HEADROOM		= 1<<24,
>> > 	IFF_MACSEC			= 1<<25,
>> > 	IFF_NO_RX_HANDLER		= 1<<26,
>> > +	IFF_BYPASS			= 1 << 27,
>> > +	IFF_BYPASS_SLAVE		= 1 << 28,
>> I wonder, why you don't follow the existing coding style... Also, please
>> add these to into the comment above.
>
>To avoid checkpatch warnings. If it is OK to ignore these warnings, I can switch back
>to the existing coding style to be consistent.

Please do.


>
>> 
>> 
>> > };
>> > 
>> > #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
>> > @@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
>> > #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
>> > #define IFF_MACSEC			IFF_MACSEC
>> > #define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
>> > +#define IFF_BYPASS			IFF_BYPASS
>> > +#define IFF_BYPASS_SLAVE		IFF_BYPASS_SLAVE
>> > 
>> > /**
>> >   *	struct net_device - The DEVICE structure.
>> > @@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
>> > 	return dev->priv_flags & IFF_RXFH_CONFIGURED;
>> > }
>> > 
>> > +static inline bool netif_is_bypass_master(const struct net_device *dev)
>> > +{
>> > +	return dev->priv_flags & IFF_BYPASS;
>> > +}
>> > +
>> > +static inline bool netif_is_bypass_slave(const struct net_device *dev)
>> > +{
>> > +	return dev->priv_flags & IFF_BYPASS_SLAVE;
>> > +}
>> > +
>> > /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
>> > static inline void netif_keep_dst(struct net_device *dev)
>> > {
>> > diff --git a/include/net/bypass.h b/include/net/bypass.h
>> > new file mode 100644
>> > index 000000000000..86b02cb894cf
>> > --- /dev/null
>> > +++ b/include/net/bypass.h
>> > @@ -0,0 +1,96 @@
>> > +// SPDX-License-Identifier: GPL-2.0
>> > +/* Copyright (c) 2018, Intel Corporation. */
>> > +
>> > +#ifndef _NET_BYPASS_H
>> > +#define _NET_BYPASS_H
>> > +
>> > +#include <linux/netdevice.h>
>> > +
>> > +struct bypass_ops {
>> > +	int (*slave_pre_register)(struct net_device *slave_netdev,
>> > +				  struct net_device *bypass_netdev);
>> > +	int (*slave_join)(struct net_device *slave_netdev,
>> > +			  struct net_device *bypass_netdev);
>> > +	int (*slave_pre_unregister)(struct net_device *slave_netdev,
>> > +				    struct net_device *bypass_netdev);
>> > +	int (*slave_release)(struct net_device *slave_netdev,
>> > +			     struct net_device *bypass_netdev);
>> > +	int (*slave_link_change)(struct net_device *slave_netdev,
>> > +				 struct net_device *bypass_netdev);
>> > +	rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>> > +};
>> > +
>> > +struct bypass_master {
>> > +	struct list_head list;
>> > +	struct net_device __rcu *bypass_netdev;
>> > +	struct bypass_ops __rcu *ops;
>> > +};
>> > +
>> > +/* bypass state */
>> > +struct bypass_info {
>> > +	/* passthru netdev with same MAC */
>> > +	struct net_device __rcu *active_netdev;
>> You still use "active"/"backup" names which is highly misleading as
>> it has completely different meaning that in bond for example.
>> I noted that in my previous review already. Please change it.
>
>I guess the issue is with only the 'active'  name. 'backup' should be fine as it also
>matches with the BACKUP feature bit we are adding to virtio_net.

I think that "backup" is also misleading. Both "active" and "backup"
mean a *state* of slaves. This should be named differently.



>
>With regards to alternate names for 'active', you suggested 'stolen', but i
>am not too happy with it.
>netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'

No. The netdev could be any netdevice. It does not have to be a "VF".
I think "stolen" is quite appropriate since it describes the modus
operandi. The bypass master steals some netdevice according to some
match.

But I don't insist on "stolen". Just sounds right.



>
>
>
>> 
>> 
>> > +
>> > +	/* virtio_net netdev */
>> > +	struct net_device __rcu *backup_netdev;
>> > +
>> > +	/* active netdev stats */
>> > +	struct rtnl_link_stats64 active_stats;
>> > +
>> > +	/* backup netdev stats */
>> > +	struct rtnl_link_stats64 backup_stats;
>> > +
>> > +	/* aggregated stats */
>> > +	struct rtnl_link_stats64 bypass_stats;
>> > +
>> > +	/* spinlock while updating stats */
>> > +	spinlock_t stats_lock;
>> > +};
>> > +
>> > +#if IS_ENABLED(CONFIG_NET_BYPASS)
>> > +
>> > +int bypass_master_create(struct net_device *backup_netdev,
>> > +			 struct bypass_master **pbypass_master);
>> > +void bypass_master_destroy(struct bypass_master *bypass_master);
>> > +
>> > +int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>> > +			   struct bypass_master **pbypass_master);
>> > +void bypass_master_unregister(struct bypass_master *bypass_master);
>> > +
>> > +int bypass_slave_unregister(struct net_device *slave_netdev);
>> > +
>> > +#else
>> > +
>> > +static inline
>> > +int bypass_master_create(struct net_device *backup_netdev,
>> > +			 struct bypass_master **pbypass_master);
>> > +{
>> > +	return 0;
>> > +}
>> > +
>> > +static inline
>> > +void bypass_master_destroy(struct bypass_master *bypass_master)
>> > +{
>> > +}
>> > +
>> > +static inline
>> > +int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>> > +			   struct pbypass_master **pbypass_master);
>> > +{
>> > +	return 0;
>> > +}
>> > +
>> > +static inline
>> > +void bypass_master_unregister(struct bypass_master *bypass_master)
>> > +{
>> > +}
>> > +
>> > +static inline
>> > +int bypass_slave_unregister(struct net_device *slave_netdev)
>> > +{
>> > +	return 0;
>> > +}
>> > +
>> > +#endif
>> > +
>> > +#endif /* _NET_BYPASS_H */
>> > diff --git a/net/Kconfig b/net/Kconfig
>> > index 0428f12c25c2..994445f4a96a 100644
>> > --- a/net/Kconfig
>> > +++ b/net/Kconfig
>> > @@ -423,6 +423,24 @@ config MAY_USE_DEVLINK
>> > 	  on MAY_USE_DEVLINK to ensure they do not cause link errors when
>> > 	  devlink is a loadable module and the driver using it is built-in.
>> > 
>> > +config NET_BYPASS
>> > +	tristate "Bypass interface"
>> > +	---help---
>> > +	  This provides a generic interface for paravirtual drivers to listen
>> > +	  for netdev register/unregister/link change events from pci ethernet
>> > +	  devices with the same MAC and takeover their datapath. This also
>> > +	  enables live migration of a VM with direct attached VF by failing
>> > +	  over to the paravirtual datapath when the VF is unplugged.
>> > +
>> > +config MAY_USE_BYPASS
>> > +	tristate
>> > +	default m if NET_BYPASS=m
>> > +	default y if NET_BYPASS=y || NET_BYPASS=n
>> > +	help
>> > +	  Drivers using the bypass infrastructure should have a dependency
>> > +	  on MAY_USE_BYPASS to ensure they do not cause link errors when
>> > +	  bypass is a loadable module and the driver using it is built-in.
>> > +
>> > endif   # if NET
>> > 
>> > # Used by archs to tell that they support BPF JIT compiler plus which flavour.
>> > diff --git a/net/core/Makefile b/net/core/Makefile
>> > index 6dbbba8c57ae..a9727ed1c8fc 100644
>> > --- a/net/core/Makefile
>> > +++ b/net/core/Makefile
>> > @@ -30,3 +30,4 @@ obj-$(CONFIG_DST_CACHE) += dst_cache.o
>> > obj-$(CONFIG_HWBM) += hwbm.o
>> > obj-$(CONFIG_NET_DEVLINK) += devlink.o
>> > obj-$(CONFIG_GRO_CELLS) += gro_cells.o
>> > +obj-$(CONFIG_NET_BYPASS) += bypass.o
>> > diff --git a/net/core/bypass.c b/net/core/bypass.c
>> > new file mode 100644
>> > index 000000000000..b5b9cb554c3f
>> > --- /dev/null
>> > +++ b/net/core/bypass.c
>> > @@ -0,0 +1,844 @@
>> > +// SPDX-License-Identifier: GPL-2.0
>> > +/* Copyright (c) 2018, Intel Corporation. */
>> > +
>> > +/* A common module to handle registrations and notifications for paravirtual
>> > + * drivers to enable accelerated datapath and support VF live migration.
>> > + *
>> > + * The notifier and event handling code is based on netvsc driver.
>> > + */
>> > +
>> > +#include <linux/netdevice.h>
>> > +#include <linux/etherdevice.h>
>> > +#include <linux/ethtool.h>
>> > +#include <linux/module.h>
>> > +#include <linux/slab.h>
>> > +#include <linux/netdevice.h>
>> > +#include <linux/netpoll.h>
>> > +#include <linux/rtnetlink.h>
>> > +#include <linux/if_vlan.h>
>> > +#include <linux/pci.h>
>> > +#include <net/sch_generic.h>
>> > +#include <uapi/linux/if_arp.h>
>> > +#include <net/bypass.h>
>> > +
>> > +static LIST_HEAD(bypass_master_list);
>> > +static DEFINE_SPINLOCK(bypass_lock);
>> > +
>> > +static int bypass_slave_pre_register(struct net_device *slave_netdev,
>> > +				     struct net_device *bypass_netdev,
>> > +				     struct bypass_ops *bypass_ops)
>> > +{
>> > +	struct bypass_info *bi;
>> > +	bool backup;
>> > +
>> > +	if (bypass_ops) {
>> > +		if (!bypass_ops->slave_pre_register)
>> > +			return -EINVAL;
>> > +
>> > +		return bypass_ops->slave_pre_register(slave_netdev,
>> > +						      bypass_netdev);
>> > +	}
>> > +
>> > +	bi = netdev_priv(bypass_netdev);
>> > +	backup = (slave_netdev->dev.parent == bypass_netdev->dev.parent);
>> > +	if (backup ? rtnl_dereference(bi->backup_netdev) :
>> > +			rtnl_dereference(bi->active_netdev)) {
>> > +		netdev_err(bypass_netdev, "%s attempting to register as slave dev when %s already present\n",
>> > +			   slave_netdev->name, backup ? "backup" : "active");
>> > +		return -EEXIST;
>> > +	}
>> > +
>> > +	/* Avoid non pci devices as active netdev */
>> > +	if (!backup && (!slave_netdev->dev.parent ||
>> > +			!dev_is_pci(slave_netdev->dev.parent)))
>> > +		return -EINVAL;
>> > +
>> > +	return 0;
>> > +}
>> > +
>> > +static int bypass_slave_join(struct net_device *slave_netdev,
>> > +			     struct net_device *bypass_netdev,
>> > +			     struct bypass_ops *bypass_ops)
>> > +{
>> > +	struct bypass_info *bi;
>> > +	bool backup;
>> > +
>> > +	if (bypass_ops) {
>> > +		if (!bypass_ops->slave_join)
>> > +			return -EINVAL;
>> > +
>> > +		return bypass_ops->slave_join(slave_netdev, bypass_netdev);
>> > +	}
>> > +
>> > +	bi = netdev_priv(bypass_netdev);
>> > +	backup = (slave_netdev->dev.parent == bypass_netdev->dev.parent);
>> > +
>> > +	dev_hold(slave_netdev);
>> > +
>> > +	if (backup) {
>> > +		rcu_assign_pointer(bi->backup_netdev, slave_netdev);
>> > +		dev_get_stats(bi->backup_netdev, &bi->backup_stats);
>> > +	} else {
>> > +		rcu_assign_pointer(bi->active_netdev, slave_netdev);
>> > +		dev_get_stats(bi->active_netdev, &bi->active_stats);
>> > +		bypass_netdev->min_mtu = slave_netdev->min_mtu;
>> > +		bypass_netdev->max_mtu = slave_netdev->max_mtu;
>> > +	}
>> > +
>> > +	netdev_info(bypass_netdev, "bypass slave:%s joined\n",
>> > +		    slave_netdev->name);
>> > +
>> > +	return 0;
>> > +}
>> > +
>> > +/* Called when slave dev is injecting data into network stack.
>> > + * Change the associated network device from lower dev to virtio.
>> > + * note: already called with rcu_read_lock
>> > + */
>> > +static rx_handler_result_t bypass_handle_frame(struct sk_buff **pskb)
>> > +{
>> > +	struct sk_buff *skb = *pskb;
>> > +	struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data);
>> > +
>> > +	skb->dev = ndev;
>> > +
>> > +	return RX_HANDLER_ANOTHER;
>> > +}
>> > +
>> > +static struct net_device *bypass_master_get_bymac(u8 *mac,
>> > +						  struct bypass_ops **ops)
>> > +{
>> > +	struct bypass_master *bypass_master;
>> > +	struct net_device *bypass_netdev;
>> > +
>> > +	spin_lock(&bypass_lock);
>> > +	list_for_each_entry(bypass_master, &bypass_master_list, list) {
>> As I wrote the last time, you don't need this list, spinlock.
>> You can do just something like:
>>          for_each_net(net) {
>>                  for_each_netdev(net, dev) {
>> 			if (netif_is_bypass_master(dev)) {
>
>This function returns the upper netdev as well as the ops associated
>with that netdev.
>bypass_master_list is a list of 'struct bypass_master' that associates

Well, can't you have it in netdev priv?


>'bypass_netdev' with 'bypass_ops' and gets added via bypass_master_register().
>We need 'ops' only to support the 2 netdev model of netvsc. ops will be
>NULL for 3-netdev model.

I see :(


>
>
>> 
>> 
>> 
>> 
>> > +		bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>> > +		if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
>> > +			*ops = rcu_dereference(bypass_master->ops);
>> I don't see how rcu_dereference is ok here.
>> 1) I don't see rcu_read_lock taken
>> 2) Looks like bypass_master->ops has the same value across the whole
>>     existence.
>
>We hold rtnl_lock(), i think i need to change this to rtnl_dereference.
>Yes. ops doesn't change.

If it does not change, you can just access it directly.


>
>> 
>> 
>> > +			spin_unlock(&bypass_lock);
>> > +			return bypass_netdev;
>> > +		}
>> > +	}
>> > +	spin_unlock(&bypass_lock);
>> > +	return NULL;
>> > +}
>> > +
>> > +static int bypass_slave_register(struct net_device *slave_netdev)
>> > +{
>> > +	struct net_device *bypass_netdev;
>> > +	struct bypass_ops *bypass_ops;
>> > +	int ret, orig_mtu;
>> > +
>> > +	ASSERT_RTNL();
>> > +
>> > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> > +						&bypass_ops);
>> For master, could you use word "master" in the variables so it is clear?
>> Also, "dev" is fine instead of "netdev".
>> Something like "bpmaster_dev"
>
>bypass_master is of  type struct bypass_master,  bypass_netdev is of type struct net_device.

I was trying to point out, that "bypass_netdev" represents a "master"
netdev, yet it does not say master. That is why I suggested
"bpmaster_dev"


>I can change all _netdev suffixes to _dev to make the names shorter.

ok.


>
>
>> 
>> 
>> > +	if (!bypass_netdev)
>> > +		goto done;
>> > +
>> > +	ret = bypass_slave_pre_register(slave_netdev, bypass_netdev,
>> > +					bypass_ops);
>> > +	if (ret != 0)
>> 	Just "if (ret)" will do. You have this on more places.
>
>OK.
>
>
>> 
>> 
>> > +		goto done;
>> > +
>> > +	ret = netdev_rx_handler_register(slave_netdev,
>> > +					 bypass_ops ? bypass_ops->handle_frame :
>> > +					 bypass_handle_frame, bypass_netdev);
>> > +	if (ret != 0) {
>> > +		netdev_err(slave_netdev, "can not register bypass rx handler (err = %d)\n",
>> > +			   ret);
>> > +		goto done;
>> > +	}
>> > +
>> > +	ret = netdev_upper_dev_link(slave_netdev, bypass_netdev, NULL);
>> > +	if (ret != 0) {
>> > +		netdev_err(slave_netdev, "can not set master device %s (err = %d)\n",
>> > +			   bypass_netdev->name, ret);
>> > +		goto upper_link_failed;
>> > +	}
>> > +
>> > +	slave_netdev->priv_flags |= IFF_BYPASS_SLAVE;
>> > +
>> > +	if (netif_running(bypass_netdev)) {
>> > +		ret = dev_open(slave_netdev);
>> > +		if (ret && (ret != -EBUSY)) {
>> > +			netdev_err(bypass_netdev, "Opening slave %s failed ret:%d\n",
>> > +				   slave_netdev->name, ret);
>> > +			goto err_interface_up;
>> > +		}
>> > +	}
>> > +
>> > +	/* Align MTU of slave with master */
>> > +	orig_mtu = slave_netdev->mtu;
>> > +	ret = dev_set_mtu(slave_netdev, bypass_netdev->mtu);
>> > +	if (ret != 0) {
>> > +		netdev_err(bypass_netdev, "unable to change mtu of %s to %u register failed\n",
>> > +			   slave_netdev->name, bypass_netdev->mtu);
>> > +		goto err_set_mtu;
>> > +	}
>> > +
>> > +	ret = bypass_slave_join(slave_netdev, bypass_netdev, bypass_ops);
>> > +	if (ret != 0)
>> > +		goto err_join;
>> > +
>> > +	call_netdevice_notifiers(NETDEV_JOIN, slave_netdev);
>> > +
>> > +	netdev_info(bypass_netdev, "bypass slave:%s registered\n",
>> > +		    slave_netdev->name);
>> > +
>> > +	goto done;
>> > +
>> > +err_join:
>> > +	dev_set_mtu(slave_netdev, orig_mtu);
>> > +err_set_mtu:
>> > +	dev_close(slave_netdev);
>> > +err_interface_up:
>> > +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>> > +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>> > +upper_link_failed:
>> > +	netdev_rx_handler_unregister(slave_netdev);
>> > +done:
>> > +	return NOTIFY_DONE;
>> > +}
>> > +
>> > +static int bypass_slave_pre_unregister(struct net_device *slave_netdev,
>> > +				       struct net_device *bypass_netdev,
>> > +				       struct bypass_ops *bypass_ops)
>> > +{
>> > +	struct net_device *backup_netdev, *active_netdev;
>> > +	struct bypass_info *bi;
>> > +
>> > +	if (bypass_ops) {
>> > +		if (!bypass_ops->slave_pre_unregister)
>> > +			return -EINVAL;
>> > +
>> > +		return bypass_ops->slave_pre_unregister(slave_netdev,
>> > +							bypass_netdev);
>> > +	}
>> > +
>> > +	bi = netdev_priv(bypass_netdev);
>> > +	active_netdev = rtnl_dereference(bi->active_netdev);
>> > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> > +
>> > +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>> > +		return -EINVAL;
>> > +
>> > +	return 0;
>> > +}
>> > +
>> > +static int bypass_slave_release(struct net_device *slave_netdev,
>> > +				struct net_device *bypass_netdev,
>> > +				struct bypass_ops *bypass_ops)
>> > +{
>> > +	struct net_device *backup_netdev, *active_netdev;
>> > +	struct bypass_info *bi;
>> > +
>> > +	if (bypass_ops) {
>> > +		if (!bypass_ops->slave_release)
>> > +			return -EINVAL;
>> I think it would be good to make the API to the driver more strict and
>> have a separate set of ops for "active" and "backup" netdevices.
>> That should stop people thinking about extending this to more slaves in
>> the future.
>
>We have checks in slave_pre_register() that allows only 1 'backup' and 1
>'active' slave.

I'm very well aware of that. I just thought that explicit ops for the
two slaves would make this more clear.


>
>
>> 
>> 
>> 
>> > +
>> > +		return bypass_ops->slave_release(slave_netdev, bypass_netdev);
>> > +	}
>> > +
>> > +	bi = netdev_priv(bypass_netdev);
>> > +	active_netdev = rtnl_dereference(bi->active_netdev);
>> > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> > +
>> > +	if (slave_netdev == backup_netdev) {
>> > +		RCU_INIT_POINTER(bi->backup_netdev, NULL);
>> > +	} else {
>> > +		RCU_INIT_POINTER(bi->active_netdev, NULL);
>> > +		if (backup_netdev) {
>> > +			bypass_netdev->min_mtu = backup_netdev->min_mtu;
>> > +			bypass_netdev->max_mtu = backup_netdev->max_mtu;
>> > +		}
>> > +	}
>> > +
>> > +	dev_put(slave_netdev);
>> > +
>> > +	netdev_info(bypass_netdev, "bypass slave:%s released\n",
>> > +		    slave_netdev->name);
>> > +
>> > +	return 0;
>> > +}
>> > +
>> > +int bypass_slave_unregister(struct net_device *slave_netdev)
>> > +{
>> > +	struct net_device *bypass_netdev;
>> > +	struct bypass_ops *bypass_ops;
>> > +	int ret;
>> > +
>> > +	if (!netif_is_bypass_slave(slave_netdev))
>> > +		goto done;
>> > +
>> > +	ASSERT_RTNL();
>> > +
>> > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> > +						&bypass_ops);
>> > +	if (!bypass_netdev)
>> > +		goto done;
>> > +
>> > +	ret = bypass_slave_pre_unregister(slave_netdev, bypass_netdev,
>> > +					  bypass_ops);
>> > +	if (ret != 0)
>> > +		goto done;
>> > +
>> > +	netdev_rx_handler_unregister(slave_netdev);
>> > +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>> > +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>> > +
>> > +	bypass_slave_release(slave_netdev, bypass_netdev, bypass_ops);
>> > +
>> > +	netdev_info(bypass_netdev, "bypass slave:%s unregistered\n",
>> > +		    slave_netdev->name);
>> > +
>> > +done:
>> > +	return NOTIFY_DONE;
>> > +}
>> > +EXPORT_SYMBOL_GPL(bypass_slave_unregister);
>> > +
>> > +static bool bypass_xmit_ready(struct net_device *dev)
>> > +{
>> > +	return netif_running(dev) && netif_carrier_ok(dev);
>> > +}
>> > +
>> > +static int bypass_slave_link_change(struct net_device *slave_netdev)
>> > +{
>> > +	struct net_device *bypass_netdev, *active_netdev, *backup_netdev;
>> > +	struct bypass_ops *bypass_ops;
>> > +	struct bypass_info *bi;
>> > +
>> > +	if (!netif_is_bypass_slave(slave_netdev))
>> > +		goto done;
>> > +
>> > +	ASSERT_RTNL();
>> > +
>> > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> > +						&bypass_ops);
>> > +	if (!bypass_netdev)
>> > +		goto done;
>> > +
>> > +	if (bypass_ops) {
>> > +		if (!bypass_ops->slave_link_change)
>> > +			goto done;
>> > +
>> > +		return bypass_ops->slave_link_change(slave_netdev,
>> > +						     bypass_netdev);
>> > +	}
>> > +
>> > +	if (!netif_running(bypass_netdev))
>> > +		return 0;
>> > +
>> > +	bi = netdev_priv(bypass_netdev);
>> > +
>> > +	active_netdev = rtnl_dereference(bi->active_netdev);
>> > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> > +
>> > +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>> > +		goto done;
>> You don't need this check. "if (!netif_is_bypass_slave(slave_netdev))"
>> above is enough.
>
>I think we need this check to not allow events from a slave that is not
>attached to this master but has the same MAC.

Why do we need such events? Seems wrong to me. Consider:

bp1      bp2
a1 b1    a2 b2


a1 and a2 have the same mac and bp1 and bp2 have the same mac.
Now bypass_master_get_bymac() will return always bp1 or bp2 - depending on
the order of creation.
Let's say it will return bp1. Then when we have event for a2, the
bypass_ops->slave_link_change is called with (a2, bp1). That is wrong.


You cannot use bypass_master_get_bymac() here.



>
>> 
>> 
>> > +
>> > +	if ((active_netdev && bypass_xmit_ready(active_netdev)) ||
>> > +	    (backup_netdev && bypass_xmit_ready(backup_netdev))) {
>> > +		netif_carrier_on(bypass_netdev);
>> > +		netif_tx_wake_all_queues(bypass_netdev);
>> > +	} else {
>> > +		netif_carrier_off(bypass_netdev);
>> > +		netif_tx_stop_all_queues(bypass_netdev);
>> > +	}
>> > +
>> > +done:
>> > +	return NOTIFY_DONE;
>> > +}
>> > +
>> > +static bool bypass_validate_event_dev(struct net_device *dev)
>> > +{
>> > +	/* Skip parent events */
>> > +	if (netif_is_bypass_master(dev))
>> > +		return false;
>> > +
>> > +	/* Avoid non-Ethernet type devices */
>> > +	if (dev->type != ARPHRD_ETHER)
>> > +		return false;
>> > +
>> > +	/* Avoid Vlan dev with same MAC registering as VF */
>> > +	if (is_vlan_dev(dev))
>> > +		return false;
>> > +
>> > +	/* Avoid Bonding master dev with same MAC registering as slave dev */
>> > +	if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
>> Yeah, this is certainly incorrect. One thing is, you should be using the
>> helpers netif_is_bond_master().
>> But what about the rest? macsec, macvlan, team, bridge, ovs and others?
>> 
>> You need to do it not by blacklisting, but with whitelisting. You need
>> to whitelist VF devices. My port flavours patchset might help with this.
>
>May be i can use netdev_has_lower_dev() helper to make sure that the slave

I don't see such function in the code.


>device is not an upper dev.
>Can you point to your port flavours patchset? Is it upstream?

I sent rfc couple of weeks ago:
[patch net-next RFC 00/12] devlink: introduce port flavours and common phys_port_name generation


>
>> 
>> 
>> > +		return false;
>> > +
>> > +	return true;
>> > +}
>> > +
>> > +static int
>> > +bypass_event(struct notifier_block *this, unsigned long event, void *ptr)
>> > +{
>> > +	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
>> > +
>> > +	if (!bypass_validate_event_dev(event_dev))
>> > +		return NOTIFY_DONE;
>> > +
>> > +	switch (event) {
>> > +	case NETDEV_REGISTER:
>> > +		return bypass_slave_register(event_dev);
>> > +	case NETDEV_UNREGISTER:
>> > +		return bypass_slave_unregister(event_dev);
>> > +	case NETDEV_UP:
>> > +	case NETDEV_DOWN:
>> > +	case NETDEV_CHANGE:
>> > +		return bypass_slave_link_change(event_dev);
>> > +	default:
>> > +		return NOTIFY_DONE;
>> > +	}
>> > +}
>> > +
>> > +static struct notifier_block bypass_notifier = {
>> > +	.notifier_call = bypass_event,
>> > +};
>> > +
>> > +int bypass_open(struct net_device *dev)
>> > +{
>> > +	struct bypass_info *bi = netdev_priv(dev);
>> > +	struct net_device *active_netdev, *backup_netdev;
>> > +	int err;
>> > +
>> > +	netif_carrier_off(dev);
>> > +	netif_tx_wake_all_queues(dev);
>> > +
>> > +	active_netdev = rtnl_dereference(bi->active_netdev);
>> > +	if (active_netdev) {
>> > +		err = dev_open(active_netdev);
>> > +		if (err)
>> > +			goto err_active_open;
>> > +	}
>> > +
>> > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> > +	if (backup_netdev) {
>> > +		err = dev_open(backup_netdev);
>> > +		if (err)
>> > +			goto err_backup_open;
>> > +	}
>> > +
>> > +	return 0;
>> > +
>> > +err_backup_open:
>> > +	dev_close(active_netdev);
>> > +err_active_open:
>> > +	netif_tx_disable(dev);
>> > +	return err;
>> > +}
>> > +EXPORT_SYMBOL_GPL(bypass_open);
>> > +
>> > +int bypass_close(struct net_device *dev)
>> > +{
>> > +	struct bypass_info *vi = netdev_priv(dev);
>> This should be probably "bi"
>
>Yes.
>
>
>> 
>> 
>> > +	struct net_device *slave_netdev;
>> > +
>> > +	netif_tx_disable(dev);
>> > +
>> > +	slave_netdev = rtnl_dereference(vi->active_netdev);
>> > +	if (slave_netdev)
>> > +		dev_close(slave_netdev);
>> > +
>> > +	slave_netdev = rtnl_dereference(vi->backup_netdev);
>> > +	if (slave_netdev)
>> > +		dev_close(slave_netdev);
>> > +
>> > +	return 0;
>> > +}
>> > +EXPORT_SYMBOL_GPL(bypass_close);
>> > +
>> > +static netdev_tx_t bypass_drop_xmit(struct sk_buff *skb, struct net_device *dev)
>> > +{
>> > +	atomic_long_inc(&dev->tx_dropped);
>> > +	dev_kfree_skb_any(skb);
>> > +	return NETDEV_TX_OK;
>> > +}
>> > +
>> > +netdev_tx_t bypass_start_xmit(struct sk_buff *skb, struct net_device *dev)
>> > +{
>> > +	struct bypass_info *bi = netdev_priv(dev);
>> If you rename the other variable to "bpmaster_dev", it would be nice to
>> rename this to bpinfo or something more descriptive. "bi" is too short
>> to know what that is right away.
>
>Will rename bypass_netdev to bypass_dev. bypass indicates that it is
>an upper master dev.
>
>
>> 
>> 
>> > +	struct net_device *xmit_dev;
>> Don't mix "dev" and "netdev" in one .c file. Just use "dev" for all.
>
>OK.
>
>
>> 
>> 
>> 
>> > +
>> > +	/* Try xmit via active netdev followed by backup netdev */
>> > +	xmit_dev = rcu_dereference_bh(bi->active_netdev);
>> > +	if (!xmit_dev || !bypass_xmit_ready(xmit_dev)) {
>> > +		xmit_dev = rcu_dereference_bh(bi->backup_netdev);
>> > +		if (!xmit_dev || !bypass_xmit_ready(xmit_dev))
>> > +			return bypass_drop_xmit(skb, dev);
>> > +	}
>> > +
>> > +	skb->dev = xmit_dev;
>> > +	skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
>> > +
>> > +	return dev_queue_xmit(skb);
>> > +}
>> > +EXPORT_SYMBOL_GPL(bypass_start_xmit);
>> > +
>> > +u16 bypass_select_queue(struct net_device *dev, struct sk_buff *skb,
>> > +			void *accel_priv, select_queue_fallback_t fallback)
>> > +{
>> > +	/* This helper function exists to help dev_pick_tx get the correct
>> > +	 * destination queue.  Using a helper function skips a call to
>> > +	 * skb_tx_hash and will put the skbs in the queue we expect on their
>> > +	 * way down to the bonding driver.
>> > +	 */
>> > +	u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
>> > +
>> > +	/* Save the original txq to restore before passing to the driver */
>> > +	qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
>> > +
>> > +	if (unlikely(txq >= dev->real_num_tx_queues)) {
>> > +		do {
>> > +			txq -= dev->real_num_tx_queues;
>> > +		} while (txq >= dev->real_num_tx_queues);
>> > +	}
>> > +
>> > +	return txq;
>> > +}
>> > +EXPORT_SYMBOL_GPL(bypass_select_queue);
>> > +
>> > +/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
>> > + * that some drivers can provide 32bit values only.
>> > + */
>> > +static void bypass_fold_stats(struct rtnl_link_stats64 *_res,
>> > +			      const struct rtnl_link_stats64 *_new,
>> > +			      const struct rtnl_link_stats64 *_old)
>> > +{
>> > +	const u64 *new = (const u64 *)_new;
>> > +	const u64 *old = (const u64 *)_old;
>> > +	u64 *res = (u64 *)_res;
>> > +	int i;
>> > +
>> > +	for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
>> > +		u64 nv = new[i];
>> > +		u64 ov = old[i];
>> > +		s64 delta = nv - ov;
>> > +
>> > +		/* detects if this particular field is 32bit only */
>> > +		if (((nv | ov) >> 32) == 0)
>> > +			delta = (s64)(s32)((u32)nv - (u32)ov);
>> > +
>> > +		/* filter anomalies, some drivers reset their stats
>> > +		 * at down/up events.
>> > +		 */
>> > +		if (delta > 0)
>> > +			res[i] += delta;
>> > +	}
>> > +}
>> > +
>> > +void bypass_get_stats(struct net_device *dev, struct rtnl_link_stats64 *stats)
>> > +{
>> > +	struct bypass_info *bi = netdev_priv(dev);
>> You can WARN_ON and return in case the dev is not bypass master, just
>> to catch buggy drivers. Same with other helpers.
>
>I can make this static and not export this helper as well as all
>bypass_netdev ops.

Ok.


>
>> 
>> 
>> > +	const struct rtnl_link_stats64 *new;
>> > +	struct rtnl_link_stats64 temp;
>> > +	struct net_device *slave_netdev;
>> > +
>> > +	spin_lock(&bi->stats_lock);
>> > +	memcpy(stats, &bi->bypass_stats, sizeof(*stats));
>> > +
>> > +	rcu_read_lock();
>> > +
>> > +	slave_netdev = rcu_dereference(bi->active_netdev);
>> > +	if (slave_netdev) {
>> > +		new = dev_get_stats(slave_netdev, &temp);
>> > +		bypass_fold_stats(stats, new, &bi->active_stats);
>> > +		memcpy(&bi->active_stats, new, sizeof(*new));
>> > +	}
>> > +
>> > +	slave_netdev = rcu_dereference(bi->backup_netdev);
>> > +	if (slave_netdev) {
>> > +		new = dev_get_stats(slave_netdev, &temp);
>> > +		bypass_fold_stats(stats, new, &bi->backup_stats);
>> > +		memcpy(&bi->backup_stats, new, sizeof(*new));
>> > +	}
>> > +
>> > +	rcu_read_unlock();
>> > +
>> > +	memcpy(&bi->bypass_stats, stats, sizeof(*stats));
>> > +	spin_unlock(&bi->stats_lock);
>> > +}
>> > +EXPORT_SYMBOL_GPL(bypass_get_stats);
>> > +
>> > +int bypass_change_mtu(struct net_device *dev, int new_mtu)
>> > +{
>> > +	struct bypass_info *bi = netdev_priv(dev);
>> > +	struct net_device *active_netdev, *backup_netdev;
>> > +	int ret = 0;
>> Pointless initialization.
>> 
>> 
>> > +
>> > +	active_netdev = rcu_dereference(bi->active_netdev);
>> > +	if (active_netdev) {
>> > +		ret = dev_set_mtu(active_netdev, new_mtu);
>> > +		if (ret)
>> > +			return ret;
>> > +	}
>> > +
>> > +	backup_netdev = rcu_dereference(bi->backup_netdev);
>> > +	if (backup_netdev) {
>> > +		ret = dev_set_mtu(backup_netdev, new_mtu);
>> > +		if (ret) {
>> > +			dev_set_mtu(active_netdev, dev->mtu);
>> > +			return ret;
>> > +		}
>> > +	}
>> > +
>> > +	dev->mtu = new_mtu;
>> > +	return 0;
>> > +}
>> > +EXPORT_SYMBOL_GPL(bypass_change_mtu);
>> > +
>> > +void bypass_set_rx_mode(struct net_device *dev)
>> > +{
>> > +	struct bypass_info *bi = netdev_priv(dev);
>> > +	struct net_device *slave_netdev;
>> > +
>> > +	rcu_read_lock();
>> > +
>> > +	slave_netdev = rcu_dereference(bi->active_netdev);
>> > +	if (slave_netdev) {
>> > +		dev_uc_sync_multiple(slave_netdev, dev);
>> > +		dev_mc_sync_multiple(slave_netdev, dev);
>> > +	}
>> > +
>> > +	slave_netdev = rcu_dereference(bi->backup_netdev);
>> > +	if (slave_netdev) {
>> > +		dev_uc_sync_multiple(slave_netdev, dev);
>> > +		dev_mc_sync_multiple(slave_netdev, dev);
>> > +	}
>> > +
>> > +	rcu_read_unlock();
>> > +}
>> > +EXPORT_SYMBOL_GPL(bypass_set_rx_mode);
>> > +
>> > +static const struct net_device_ops bypass_netdev_ops = {
>> > +	.ndo_open		= bypass_open,
>> > +	.ndo_stop		= bypass_close,
>> > +	.ndo_start_xmit		= bypass_start_xmit,
>> > +	.ndo_select_queue	= bypass_select_queue,
>> > +	.ndo_get_stats64	= bypass_get_stats,
>> > +	.ndo_change_mtu		= bypass_change_mtu,
>> > +	.ndo_set_rx_mode	= bypass_set_rx_mode,
>> > +	.ndo_validate_addr	= eth_validate_addr,
>> > +	.ndo_features_check	= passthru_features_check,
>> > +};
>> > +
>> > +#define BYPASS_DRV_NAME "bypass"
>> > +#define BYPASS_DRV_VERSION "0.1"
>> > +
>> > +static void bypass_ethtool_get_drvinfo(struct net_device *dev,
>> > +				       struct ethtool_drvinfo *drvinfo)
>> > +{
>> > +	strlcpy(drvinfo->driver, BYPASS_DRV_NAME, sizeof(drvinfo->driver));
>> > +	strlcpy(drvinfo->version, BYPASS_DRV_VERSION, sizeof(drvinfo->version));
>> > +}
>> > +
>> > +int bypass_ethtool_get_link_ksettings(struct net_device *dev,
>> > +				      struct ethtool_link_ksettings *cmd)
>> > +{
>> > +	struct bypass_info *bi = netdev_priv(dev);
>> > +	struct net_device *slave_netdev;
>> > +
>> > +	slave_netdev = rtnl_dereference(bi->active_netdev);
>> > +	if (!slave_netdev || !bypass_xmit_ready(slave_netdev)) {
>> > +		slave_netdev = rtnl_dereference(bi->backup_netdev);
>> > +		if (!slave_netdev || !bypass_xmit_ready(slave_netdev)) {
>> > +			cmd->base.duplex = DUPLEX_UNKNOWN;
>> > +			cmd->base.port = PORT_OTHER;
>> > +			cmd->base.speed = SPEED_UNKNOWN;
>> > +
>> > +			return 0;
>> > +		}
>> > +	}
>> > +
>> > +	return __ethtool_get_link_ksettings(slave_netdev, cmd);
>> > +}
>> > +EXPORT_SYMBOL_GPL(bypass_ethtool_get_link_ksettings);
>> > +
>> > +static const struct ethtool_ops bypass_ethtool_ops = {
>> > +	.get_drvinfo            = bypass_ethtool_get_drvinfo,
>> > +	.get_link               = ethtool_op_get_link,
>> > +	.get_link_ksettings     = bypass_ethtool_get_link_ksettings,
>> > +};
>> > +
>> > +static void bypass_register_existing_slave(struct net_device *bypass_netdev)
>> > +{
>> > +	struct net *net = dev_net(bypass_netdev);
>> > +	struct net_device *dev;
>> > +
>> > +	rtnl_lock();
>> > +	for_each_netdev(net, dev) {
>> > +		if (dev == bypass_netdev)
>> > +			continue;
>> > +		if (!bypass_validate_event_dev(dev))
>> > +			continue;
>> > +		if (ether_addr_equal(bypass_netdev->perm_addr, dev->perm_addr))
>> > +			bypass_slave_register(dev);
>> > +	}
>> > +	rtnl_unlock();
>> > +}
>> > +
>> > +int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>> > +			   struct bypass_master **pbypass_master)
>> > +{
>> > +	struct bypass_master *bypass_master;
>> > +
>> > +	bypass_master = kzalloc(sizeof(*bypass_master), GFP_KERNEL);
>> > +	if (!bypass_master)
>> > +		return -ENOMEM;
>> > +
>> > +	rcu_assign_pointer(bypass_master->ops, ops);
>> > +	dev_hold(dev);
>> > +	dev->priv_flags |= IFF_BYPASS;
>> > +	rcu_assign_pointer(bypass_master->bypass_netdev, dev);
>> > +
>> > +	spin_lock(&bypass_lock);
>> > +	list_add_tail(&bypass_master->list, &bypass_master_list);
>> > +	spin_unlock(&bypass_lock);
>> > +
>> > +	bypass_register_existing_slave(dev);
>> > +
>> > +	*pbypass_master = bypass_master;
>> > +	return 0;
>> > +}
>> > +EXPORT_SYMBOL_GPL(bypass_master_register);
>> > +
>> > +void bypass_master_unregister(struct bypass_master *bypass_master)
>> > +{
>> > +	struct net_device *bypass_netdev;
>> > +
>> > +	bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>> > +
>> > +	bypass_netdev->priv_flags &= ~IFF_BYPASS;
>> > +	dev_put(bypass_netdev);
>> > +
>> > +	spin_lock(&bypass_lock);
>> > +	list_del(&bypass_master->list);
>> > +	spin_unlock(&bypass_lock);
>> > +
>> > +	kfree(bypass_master);
>> > +}
>> > +EXPORT_SYMBOL_GPL(bypass_master_unregister);
>> > +
>> > +int bypass_master_create(struct net_device *backup_netdev,
>> > +			 struct bypass_master **pbypass_master)
>> > +{
>> > +	struct device *dev = backup_netdev->dev.parent;
>> > +	struct net_device *bypass_netdev;
>> > +	int err;
>> > +
>> > +	/* Alloc at least 2 queues, for now we are going with 16 assuming
>> > +	 * that most devices being bonded won't have too many queues.
>> > +	 */
>> > +	bypass_netdev = alloc_etherdev_mq(sizeof(struct bypass_info), 16);
>> > +	if (!bypass_netdev) {
>> > +		dev_err(dev, "Unable to allocate bypass_netdev!\n");
>> > +		return -ENOMEM;
>> > +	}
>> > +
>> > +	dev_net_set(bypass_netdev, dev_net(backup_netdev));
>> > +	SET_NETDEV_DEV(bypass_netdev, dev);
>> > +
>> > +	bypass_netdev->netdev_ops = &bypass_netdev_ops;
>> > +	bypass_netdev->ethtool_ops = &bypass_ethtool_ops;
>> > +
>> > +	/* Initialize the device options */
>> > +	bypass_netdev->priv_flags |= IFF_UNICAST_FLT | IFF_NO_QUEUE;
>> > +	bypass_netdev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
>> > +				       IFF_TX_SKB_SHARING);
>> > +
>> > +	/* don't acquire bypass netdev's netif_tx_lock when transmitting */
>> > +	bypass_netdev->features |= NETIF_F_LLTX;
>> > +
>> > +	/* Don't allow bypass devices to change network namespaces. */
>> > +	bypass_netdev->features |= NETIF_F_NETNS_LOCAL;
>> > +
>> > +	bypass_netdev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
>> > +				     NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
>> > +				     NETIF_F_HIGHDMA | NETIF_F_LRO;
>> > +
>> > +	bypass_netdev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
>> > +	bypass_netdev->features |= bypass_netdev->hw_features;
>> > +
>> > +	memcpy(bypass_netdev->dev_addr, backup_netdev->dev_addr,
>> > +	       bypass_netdev->addr_len);
>> > +
>> > +	bypass_netdev->min_mtu = backup_netdev->min_mtu;
>> > +	bypass_netdev->max_mtu = backup_netdev->max_mtu;
>> > +
>> > +	err = register_netdev(bypass_netdev);
>> > +	if (err < 0) {
>> > +		dev_err(dev, "Unable to register bypass_netdev!\n");
>> > +		goto err_register_netdev;
>> > +	}
>> > +
>> > +	netif_carrier_off(bypass_netdev);
>> > +
>> > +	err = bypass_master_register(bypass_netdev, NULL, pbypass_master);
>> > +	if (err < 0)
>> just "if (err)" would do.
>
>OK
>
>> 
>> 
>> > +		goto err_bypass;
>> > +
>> > +	return 0;
>> > +
>> > +err_bypass:
>> > +	unregister_netdev(bypass_netdev);
>> > +err_register_netdev:
>> > +	free_netdev(bypass_netdev);
>> > +
>> > +	return err;
>> > +}
>> > +EXPORT_SYMBOL_GPL(bypass_master_create);
>> > +
>> > +void bypass_master_destroy(struct bypass_master *bypass_master)
>> > +{
>> > +	struct net_device *bypass_netdev;
>> > +	struct net_device *slave_netdev;
>> > +	struct bypass_info *bi;
>> > +
>> > +	if (!bypass_master)
>> > +		return;
>> > +
>> > +	bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>> > +	bi = netdev_priv(bypass_netdev);
>> > +
>> > +	netif_device_detach(bypass_netdev);
>> > +
>> > +	rtnl_lock();
>> > +
>> > +	slave_netdev = rtnl_dereference(bi->active_netdev);
>> > +	if (slave_netdev)
>> > +		bypass_slave_unregister(slave_netdev);
>> > +
>> > +	slave_netdev = rtnl_dereference(bi->backup_netdev);
>> > +	if (slave_netdev)
>> > +		bypass_slave_unregister(slave_netdev);
>> > +
>> > +	bypass_master_unregister(bypass_master);
>> > +
>> > +	unregister_netdevice(bypass_netdev);
>> > +
>> > +	rtnl_unlock();
>> > +
>> > +	free_netdev(bypass_netdev);
>> > +}
>> > +EXPORT_SYMBOL_GPL(bypass_master_destroy);
>> > +
>> > +static __init int
>> > +bypass_init(void)
>> > +{
>> > +	register_netdevice_notifier(&bypass_notifier);
>> > +
>> > +	return 0;
>> > +}
>> > +module_init(bypass_init);
>> > +
>> > +static __exit
>> > +void bypass_exit(void)
>> > +{
>> > +	unregister_netdevice_notifier(&bypass_notifier);
>> > +}
>> > +module_exit(bypass_exit);
>> > +
>> > +MODULE_DESCRIPTION("Bypass infrastructure/interface for Paravirtual drivers");
>> > +MODULE_LICENSE("GPL v2");
>> > -- 
>> > 2.14.3
>> > 
>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
  2018-04-18  9:25       ` Jiri Pirko
@ 2018-04-18 18:43         ` Samudrala, Sridhar
  2018-04-18 19:13           ` Jiri Pirko
  0 siblings, 1 reply; 63+ messages in thread
From: Samudrala, Sridhar @ 2018-04-18 18:43 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
	loseweigh

On 4/18/2018 2:25 AM, Jiri Pirko wrote:
> Wed, Apr 11, 2018 at 09:13:52PM CEST, sridhar.samudrala@intel.com wrote:
>> On 4/11/2018 8:51 AM, Jiri Pirko wrote:
>>> Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudrala@intel.com wrote:
>>>> This provides a generic interface for paravirtual drivers to listen
>>>> for netdev register/unregister/link change events from pci ethernet
>>>> devices with the same MAC and takeover their datapath. The notifier and
>>>> event handling code is based on the existing netvsc implementation.
>>>>
>>>> It exposes 2 sets of interfaces to the paravirtual drivers.
>>>> 1. existing netvsc driver that uses 2 netdev model. In this model, no
>>>> master netdev is created. The paravirtual driver registers each bypass
>>>> instance along with a set of ops to manage the slave events.
>>>>       bypass_master_register()
>>>>       bypass_master_unregister()
>>>> 2. new virtio_net based solution that uses 3 netdev model. In this model,
>>>> the bypass module provides interfaces to create/destroy additional master
>>>> netdev and all the slave events are managed internally.
>>>>        bypass_master_create()
>>>>        bypass_master_destroy()
>>>>
>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>>> ---
>>>> include/linux/netdevice.h |  14 +
>>>> include/net/bypass.h      |  96 ++++++
>>>> net/Kconfig               |  18 +
>>>> net/core/Makefile         |   1 +
>>>> net/core/bypass.c         | 844 ++++++++++++++++++++++++++++++++++++++++++++++
>>>> 5 files changed, 973 insertions(+)
>>>> create mode 100644 include/net/bypass.h
>>>> create mode 100644 net/core/bypass.c
>>>>
>>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>>> index cf44503ea81a..587293728f70 100644
>>>> --- a/include/linux/netdevice.h
>>>> +++ b/include/linux/netdevice.h
>>>> @@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
>>>> 	IFF_PHONY_HEADROOM		= 1<<24,
>>>> 	IFF_MACSEC			= 1<<25,
>>>> 	IFF_NO_RX_HANDLER		= 1<<26,
>>>> +	IFF_BYPASS			= 1 << 27,
>>>> +	IFF_BYPASS_SLAVE		= 1 << 28,
>>> I wonder, why you don't follow the existing coding style... Also, please
>>> add these to into the comment above.
>> To avoid checkpatch warnings. If it is OK to ignore these warnings, I can switch back
>> to the existing coding style to be consistent.
> Please do.
>
>
>>>
>>>> };
>>>>
>>>> #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
>>>> @@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
>>>> #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
>>>> #define IFF_MACSEC			IFF_MACSEC
>>>> #define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
>>>> +#define IFF_BYPASS			IFF_BYPASS
>>>> +#define IFF_BYPASS_SLAVE		IFF_BYPASS_SLAVE
>>>>
>>>> /**
>>>>    *	struct net_device - The DEVICE structure.
>>>> @@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
>>>> 	return dev->priv_flags & IFF_RXFH_CONFIGURED;
>>>> }
>>>>
>>>> +static inline bool netif_is_bypass_master(const struct net_device *dev)
>>>> +{
>>>> +	return dev->priv_flags & IFF_BYPASS;
>>>> +}
>>>> +
>>>> +static inline bool netif_is_bypass_slave(const struct net_device *dev)
>>>> +{
>>>> +	return dev->priv_flags & IFF_BYPASS_SLAVE;
>>>> +}
>>>> +
>>>> /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
>>>> static inline void netif_keep_dst(struct net_device *dev)
>>>> {
>>>> diff --git a/include/net/bypass.h b/include/net/bypass.h
>>>> new file mode 100644
>>>> index 000000000000..86b02cb894cf
>>>> --- /dev/null
>>>> +++ b/include/net/bypass.h
>>>> @@ -0,0 +1,96 @@
>>>> +// SPDX-License-Identifier: GPL-2.0
>>>> +/* Copyright (c) 2018, Intel Corporation. */
>>>> +
>>>> +#ifndef _NET_BYPASS_H
>>>> +#define _NET_BYPASS_H
>>>> +
>>>> +#include <linux/netdevice.h>
>>>> +
>>>> +struct bypass_ops {
>>>> +	int (*slave_pre_register)(struct net_device *slave_netdev,
>>>> +				  struct net_device *bypass_netdev);
>>>> +	int (*slave_join)(struct net_device *slave_netdev,
>>>> +			  struct net_device *bypass_netdev);
>>>> +	int (*slave_pre_unregister)(struct net_device *slave_netdev,
>>>> +				    struct net_device *bypass_netdev);
>>>> +	int (*slave_release)(struct net_device *slave_netdev,
>>>> +			     struct net_device *bypass_netdev);
>>>> +	int (*slave_link_change)(struct net_device *slave_netdev,
>>>> +				 struct net_device *bypass_netdev);
>>>> +	rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>>>> +};
>>>> +
>>>> +struct bypass_master {
>>>> +	struct list_head list;
>>>> +	struct net_device __rcu *bypass_netdev;
>>>> +	struct bypass_ops __rcu *ops;
>>>> +};
>>>> +
>>>> +/* bypass state */
>>>> +struct bypass_info {
>>>> +	/* passthru netdev with same MAC */
>>>> +	struct net_device __rcu *active_netdev;
>>> You still use "active"/"backup" names which is highly misleading as
>>> it has completely different meaning that in bond for example.
>>> I noted that in my previous review already. Please change it.
>> I guess the issue is with only the 'active'  name. 'backup' should be fine as it also
>> matches with the BACKUP feature bit we are adding to virtio_net.
> I think that "backup" is also misleading. Both "active" and "backup"
> mean a *state* of slaves. This should be named differently.
>
>
>
>> With regards to alternate names for 'active', you suggested 'stolen', but i
>> am not too happy with it.
>> netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'
> No. The netdev could be any netdevice. It does not have to be a "VF".
> I think "stolen" is quite appropriate since it describes the modus
> operandi. The bypass master steals some netdevice according to some
> match.
>
> But I don't insist on "stolen". Just sounds right.

We are adding VIRTIO_NET_F_BACKUP as a new feature bit to enable this feature, So i think
'backup' name is consistent.

The intent is to restrict the 'active' netdev to be a VF. If there is a way to check that
a PCI device is a VF in the guest kernel, we could restrict 'active' netdev to be a VF.

Will look for any suggestions in the next day or two. If i don't get any, i will go
with 'stolen'

<snip>


> +
> +static struct net_device *bypass_master_get_bymac(u8 *mac,
> +						  struct bypass_ops **ops)
> +{
> +	struct bypass_master *bypass_master;
> +	struct net_device *bypass_netdev;
> +
> +	spin_lock(&bypass_lock);
> +	list_for_each_entry(bypass_master, &bypass_master_list, list) {
>>> As I wrote the last time, you don't need this list, spinlock.
>>> You can do just something like:
>>>           for_each_net(net) {
>>>                   for_each_netdev(net, dev) {
>>> 			if (netif_is_bypass_master(dev)) {
>> This function returns the upper netdev as well as the ops associated
>> with that netdev.
>> bypass_master_list is a list of 'struct bypass_master' that associates
> Well, can't you have it in netdev priv?

We cannot do this for 2-netdev model as there is no bypass_netdev created.

>
>
>> 'bypass_netdev' with 'bypass_ops' and gets added via bypass_master_register().
>> We need 'ops' only to support the 2 netdev model of netvsc. ops will be
>> NULL for 3-netdev model.
> I see :(
>
>
>>
>>>
>>>
>>>
>>>> +		bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>>>> +		if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
>>>> +			*ops = rcu_dereference(bypass_master->ops);
>>> I don't see how rcu_dereference is ok here.
>>> 1) I don't see rcu_read_lock taken
>>> 2) Looks like bypass_master->ops has the same value across the whole
>>>      existence.
>> We hold rtnl_lock(), i think i need to change this to rtnl_dereference.
>> Yes. ops doesn't change.
> If it does not change, you can just access it directly.
>
>
>>>
>>>> +			spin_unlock(&bypass_lock);
>>>> +			return bypass_netdev;
>>>> +		}
>>>> +	}
>>>> +	spin_unlock(&bypass_lock);
>>>> +	return NULL;
>>>> +}
>>>> +
>>>> +static int bypass_slave_register(struct net_device *slave_netdev)
>>>> +{
>>>> +	struct net_device *bypass_netdev;
>>>> +	struct bypass_ops *bypass_ops;
>>>> +	int ret, orig_mtu;
>>>> +
>>>> +	ASSERT_RTNL();
>>>> +
>>>> +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>>>> +						&bypass_ops);
>>> For master, could you use word "master" in the variables so it is clear?
>>> Also, "dev" is fine instead of "netdev".
>>> Something like "bpmaster_dev"
>> bypass_master is of  type struct bypass_master,  bypass_netdev is of type struct net_device.
> I was trying to point out, that "bypass_netdev" represents a "master"
> netdev, yet it does not say master. That is why I suggested
> "bpmaster_dev"
>
>
>> I can change all _netdev suffixes to _dev to make the names shorter.
> ok.
>
>
>>
>>>
>>>> +	if (!bypass_netdev)
>>>> +		goto done;
>>>> +
>>>> +	ret = bypass_slave_pre_register(slave_netdev, bypass_netdev,
>>>> +					bypass_ops);
>>>> +	if (ret != 0)
>>> 	Just "if (ret)" will do. You have this on more places.
>> OK.
>>
>>
>>>
>>>> +		goto done;
>>>> +
>>>> +	ret = netdev_rx_handler_register(slave_netdev,
>>>> +					 bypass_ops ? bypass_ops->handle_frame :
>>>> +					 bypass_handle_frame, bypass_netdev);
>>>> +	if (ret != 0) {
>>>> +		netdev_err(slave_netdev, "can not register bypass rx handler (err = %d)\n",
>>>> +			   ret);
>>>> +		goto done;
>>>> +	}
>>>> +
>>>> +	ret = netdev_upper_dev_link(slave_netdev, bypass_netdev, NULL);
>>>> +	if (ret != 0) {
>>>> +		netdev_err(slave_netdev, "can not set master device %s (err = %d)\n",
>>>> +			   bypass_netdev->name, ret);
>>>> +		goto upper_link_failed;
>>>> +	}
>>>> +
>>>> +	slave_netdev->priv_flags |= IFF_BYPASS_SLAVE;
>>>> +
>>>> +	if (netif_running(bypass_netdev)) {
>>>> +		ret = dev_open(slave_netdev);
>>>> +		if (ret && (ret != -EBUSY)) {
>>>> +			netdev_err(bypass_netdev, "Opening slave %s failed ret:%d\n",
>>>> +				   slave_netdev->name, ret);
>>>> +			goto err_interface_up;
>>>> +		}
>>>> +	}
>>>> +
>>>> +	/* Align MTU of slave with master */
>>>> +	orig_mtu = slave_netdev->mtu;
>>>> +	ret = dev_set_mtu(slave_netdev, bypass_netdev->mtu);
>>>> +	if (ret != 0) {
>>>> +		netdev_err(bypass_netdev, "unable to change mtu of %s to %u register failed\n",
>>>> +			   slave_netdev->name, bypass_netdev->mtu);
>>>> +		goto err_set_mtu;
>>>> +	}
>>>> +
>>>> +	ret = bypass_slave_join(slave_netdev, bypass_netdev, bypass_ops);
>>>> +	if (ret != 0)
>>>> +		goto err_join;
>>>> +
>>>> +	call_netdevice_notifiers(NETDEV_JOIN, slave_netdev);
>>>> +
>>>> +	netdev_info(bypass_netdev, "bypass slave:%s registered\n",
>>>> +		    slave_netdev->name);
>>>> +
>>>> +	goto done;
>>>> +
>>>> +err_join:
>>>> +	dev_set_mtu(slave_netdev, orig_mtu);
>>>> +err_set_mtu:
>>>> +	dev_close(slave_netdev);
>>>> +err_interface_up:
>>>> +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>>>> +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>>>> +upper_link_failed:
>>>> +	netdev_rx_handler_unregister(slave_netdev);
>>>> +done:
>>>> +	return NOTIFY_DONE;
>>>> +}
>>>> +
>>>> +static int bypass_slave_pre_unregister(struct net_device *slave_netdev,
>>>> +				       struct net_device *bypass_netdev,
>>>> +				       struct bypass_ops *bypass_ops)
>>>> +{
>>>> +	struct net_device *backup_netdev, *active_netdev;
>>>> +	struct bypass_info *bi;
>>>> +
>>>> +	if (bypass_ops) {
>>>> +		if (!bypass_ops->slave_pre_unregister)
>>>> +			return -EINVAL;
>>>> +
>>>> +		return bypass_ops->slave_pre_unregister(slave_netdev,
>>>> +							bypass_netdev);
>>>> +	}
>>>> +
>>>> +	bi = netdev_priv(bypass_netdev);
>>>> +	active_netdev = rtnl_dereference(bi->active_netdev);
>>>> +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>>>> +
>>>> +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>>>> +		return -EINVAL;
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +static int bypass_slave_release(struct net_device *slave_netdev,
>>>> +				struct net_device *bypass_netdev,
>>>> +				struct bypass_ops *bypass_ops)
>>>> +{
>>>> +	struct net_device *backup_netdev, *active_netdev;
>>>> +	struct bypass_info *bi;
>>>> +
>>>> +	if (bypass_ops) {
>>>> +		if (!bypass_ops->slave_release)
>>>> +			return -EINVAL;
>>> I think it would be good to make the API to the driver more strict and
>>> have a separate set of ops for "active" and "backup" netdevices.
>>> That should stop people thinking about extending this to more slaves in
>>> the future.
>> We have checks in slave_pre_register() that allows only 1 'backup' and 1
>> 'active' slave.
> I'm very well aware of that. I just thought that explicit ops for the
> two slaves would make this more clear.
>
>
>>
>>>
>>>
>>>> +
>>>> +		return bypass_ops->slave_release(slave_netdev, bypass_netdev);
>>>> +	}
>>>> +
>>>> +	bi = netdev_priv(bypass_netdev);
>>>> +	active_netdev = rtnl_dereference(bi->active_netdev);
>>>> +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>>>> +
>>>> +	if (slave_netdev == backup_netdev) {
>>>> +		RCU_INIT_POINTER(bi->backup_netdev, NULL);
>>>> +	} else {
>>>> +		RCU_INIT_POINTER(bi->active_netdev, NULL);
>>>> +		if (backup_netdev) {
>>>> +			bypass_netdev->min_mtu = backup_netdev->min_mtu;
>>>> +			bypass_netdev->max_mtu = backup_netdev->max_mtu;
>>>> +		}
>>>> +	}
>>>> +
>>>> +	dev_put(slave_netdev);
>>>> +
>>>> +	netdev_info(bypass_netdev, "bypass slave:%s released\n",
>>>> +		    slave_netdev->name);
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +int bypass_slave_unregister(struct net_device *slave_netdev)
>>>> +{
>>>> +	struct net_device *bypass_netdev;
>>>> +	struct bypass_ops *bypass_ops;
>>>> +	int ret;
>>>> +
>>>> +	if (!netif_is_bypass_slave(slave_netdev))
>>>> +		goto done;
>>>> +
>>>> +	ASSERT_RTNL();
>>>> +
>>>> +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>>>> +						&bypass_ops);
>>>> +	if (!bypass_netdev)
>>>> +		goto done;
>>>> +
>>>> +	ret = bypass_slave_pre_unregister(slave_netdev, bypass_netdev,
>>>> +					  bypass_ops);
>>>> +	if (ret != 0)
>>>> +		goto done;
>>>> +
>>>> +	netdev_rx_handler_unregister(slave_netdev);
>>>> +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>>>> +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>>>> +
>>>> +	bypass_slave_release(slave_netdev, bypass_netdev, bypass_ops);
>>>> +
>>>> +	netdev_info(bypass_netdev, "bypass slave:%s unregistered\n",
>>>> +		    slave_netdev->name);
>>>> +
>>>> +done:
>>>> +	return NOTIFY_DONE;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(bypass_slave_unregister);
>>>> +
>>>> +static bool bypass_xmit_ready(struct net_device *dev)
>>>> +{
>>>> +	return netif_running(dev) && netif_carrier_ok(dev);
>>>> +}
>>>> +
>>>> +static int bypass_slave_link_change(struct net_device *slave_netdev)
>>>> +{
>>>> +	struct net_device *bypass_netdev, *active_netdev, *backup_netdev;
>>>> +	struct bypass_ops *bypass_ops;
>>>> +	struct bypass_info *bi;
>>>> +
>>>> +	if (!netif_is_bypass_slave(slave_netdev))
>>>> +		goto done;
>>>> +
>>>> +	ASSERT_RTNL();
>>>> +
>>>> +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>>>> +						&bypass_ops);
>>>> +	if (!bypass_netdev)
>>>> +		goto done;
>>>> +
>>>> +	if (bypass_ops) {
>>>> +		if (!bypass_ops->slave_link_change)
>>>> +			goto done;
>>>> +
>>>> +		return bypass_ops->slave_link_change(slave_netdev,
>>>> +						     bypass_netdev);
>>>> +	}
>>>> +
>>>> +	if (!netif_running(bypass_netdev))
>>>> +		return 0;
>>>> +
>>>> +	bi = netdev_priv(bypass_netdev);
>>>> +
>>>> +	active_netdev = rtnl_dereference(bi->active_netdev);
>>>> +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>>>> +
>>>> +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>>>> +		goto done;
>>> You don't need this check. "if (!netif_is_bypass_slave(slave_netdev))"
>>> above is enough.
>> I think we need this check to not allow events from a slave that is not
>> attached to this master but has the same MAC.
> Why do we need such events? Seems wrong to me.

We want to avoid events from a netdev that is mis-configured with the same MAC as
a bypass setup.

>   Consider:
>
> bp1      bp2
> a1 b1    a2 b2
>
>
> a1 and a2 have the same mac and bp1 and bp2 have the same mac.

We should not have 2 bypass configs with the same MAC.
I need to add a check in the bypass_master_register() to prevent this.

The above check is to avoid cases where we have
bp1(a1, b1) with mac1
and a2 is mis-configured with mac1, we want to avoid using a2 link events to update bp1.

> Now bypass_master_get_bymac() will return always bp1 or bp2 - depending on
> the order of creation.
> Let's say it will return bp1. Then when we have event for a2, the
> bypass_ops->slave_link_change is called with (a2, bp1). That is wrong.
>
>
> You cannot use bypass_master_get_bymac() here.
>
>
>
>>>
>>>> +
>>>> +	if ((active_netdev && bypass_xmit_ready(active_netdev)) ||
>>>> +	    (backup_netdev && bypass_xmit_ready(backup_netdev))) {
>>>> +		netif_carrier_on(bypass_netdev);
>>>> +		netif_tx_wake_all_queues(bypass_netdev);
>>>> +	} else {
>>>> +		netif_carrier_off(bypass_netdev);
>>>> +		netif_tx_stop_all_queues(bypass_netdev);
>>>> +	}
>>>> +
>>>> +done:
>>>> +	return NOTIFY_DONE;
>>>> +}
>>>> +
>>>> +static bool bypass_validate_event_dev(struct net_device *dev)
>>>> +{
>>>> +	/* Skip parent events */
>>>> +	if (netif_is_bypass_master(dev))
>>>> +		return false;
>>>> +
>>>> +	/* Avoid non-Ethernet type devices */
>>>> +	if (dev->type != ARPHRD_ETHER)
>>>> +		return false;
>>>> +
>>>> +	/* Avoid Vlan dev with same MAC registering as VF */
>>>> +	if (is_vlan_dev(dev))
>>>> +		return false;
>>>> +
>>>> +	/* Avoid Bonding master dev with same MAC registering as slave dev */
>>>> +	if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
>>> Yeah, this is certainly incorrect. One thing is, you should be using the
>>> helpers netif_is_bond_master().
>>> But what about the rest? macsec, macvlan, team, bridge, ovs and others?
>>>
>>> You need to do it not by blacklisting, but with whitelisting. You need
>>> to whitelist VF devices. My port flavours patchset might help with this.
>> May be i can use netdev_has_lower_dev() helper to make sure that the slave
> I don't see such function in the code.

It is netdev_has_any_lower_dev(). I need to export it.

>
>
>> device is not an upper dev.
>> Can you point to your port flavours patchset? Is it upstream?
> I sent rfc couple of weeks ago:
> [patch net-next RFC 00/12] devlink: introduce port flavours and common phys_port_name generation

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
  2018-04-18 18:43         ` Samudrala, Sridhar
@ 2018-04-18 19:13           ` Jiri Pirko
  2018-04-18 19:46             ` Michael S. Tsirkin
  0 siblings, 1 reply; 63+ messages in thread
From: Jiri Pirko @ 2018-04-18 19:13 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
	loseweigh

Wed, Apr 18, 2018 at 08:43:15PM CEST, sridhar.samudrala@intel.com wrote:
>On 4/18/2018 2:25 AM, Jiri Pirko wrote:
>> Wed, Apr 11, 2018 at 09:13:52PM CEST, sridhar.samudrala@intel.com wrote:
>> > On 4/11/2018 8:51 AM, Jiri Pirko wrote:
>> > > Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudrala@intel.com wrote:
>> > > > This provides a generic interface for paravirtual drivers to listen
>> > > > for netdev register/unregister/link change events from pci ethernet
>> > > > devices with the same MAC and takeover their datapath. The notifier and
>> > > > event handling code is based on the existing netvsc implementation.
>> > > > 
>> > > > It exposes 2 sets of interfaces to the paravirtual drivers.
>> > > > 1. existing netvsc driver that uses 2 netdev model. In this model, no
>> > > > master netdev is created. The paravirtual driver registers each bypass
>> > > > instance along with a set of ops to manage the slave events.
>> > > >       bypass_master_register()
>> > > >       bypass_master_unregister()
>> > > > 2. new virtio_net based solution that uses 3 netdev model. In this model,
>> > > > the bypass module provides interfaces to create/destroy additional master
>> > > > netdev and all the slave events are managed internally.
>> > > >        bypass_master_create()
>> > > >        bypass_master_destroy()
>> > > > 
>> > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> > > > ---
>> > > > include/linux/netdevice.h |  14 +
>> > > > include/net/bypass.h      |  96 ++++++
>> > > > net/Kconfig               |  18 +
>> > > > net/core/Makefile         |   1 +
>> > > > net/core/bypass.c         | 844 ++++++++++++++++++++++++++++++++++++++++++++++
>> > > > 5 files changed, 973 insertions(+)
>> > > > create mode 100644 include/net/bypass.h
>> > > > create mode 100644 net/core/bypass.c
>> > > > 
>> > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> > > > index cf44503ea81a..587293728f70 100644
>> > > > --- a/include/linux/netdevice.h
>> > > > +++ b/include/linux/netdevice.h
>> > > > @@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
>> > > > 	IFF_PHONY_HEADROOM		= 1<<24,
>> > > > 	IFF_MACSEC			= 1<<25,
>> > > > 	IFF_NO_RX_HANDLER		= 1<<26,
>> > > > +	IFF_BYPASS			= 1 << 27,
>> > > > +	IFF_BYPASS_SLAVE		= 1 << 28,
>> > > I wonder, why you don't follow the existing coding style... Also, please
>> > > add these to into the comment above.
>> > To avoid checkpatch warnings. If it is OK to ignore these warnings, I can switch back
>> > to the existing coding style to be consistent.
>> Please do.
>> 
>> 
>> > > 
>> > > > };
>> > > > 
>> > > > #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
>> > > > @@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
>> > > > #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
>> > > > #define IFF_MACSEC			IFF_MACSEC
>> > > > #define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
>> > > > +#define IFF_BYPASS			IFF_BYPASS
>> > > > +#define IFF_BYPASS_SLAVE		IFF_BYPASS_SLAVE
>> > > > 
>> > > > /**
>> > > >    *	struct net_device - The DEVICE structure.
>> > > > @@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
>> > > > 	return dev->priv_flags & IFF_RXFH_CONFIGURED;
>> > > > }
>> > > > 
>> > > > +static inline bool netif_is_bypass_master(const struct net_device *dev)
>> > > > +{
>> > > > +	return dev->priv_flags & IFF_BYPASS;
>> > > > +}
>> > > > +
>> > > > +static inline bool netif_is_bypass_slave(const struct net_device *dev)
>> > > > +{
>> > > > +	return dev->priv_flags & IFF_BYPASS_SLAVE;
>> > > > +}
>> > > > +
>> > > > /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
>> > > > static inline void netif_keep_dst(struct net_device *dev)
>> > > > {
>> > > > diff --git a/include/net/bypass.h b/include/net/bypass.h
>> > > > new file mode 100644
>> > > > index 000000000000..86b02cb894cf
>> > > > --- /dev/null
>> > > > +++ b/include/net/bypass.h
>> > > > @@ -0,0 +1,96 @@
>> > > > +// SPDX-License-Identifier: GPL-2.0
>> > > > +/* Copyright (c) 2018, Intel Corporation. */
>> > > > +
>> > > > +#ifndef _NET_BYPASS_H
>> > > > +#define _NET_BYPASS_H
>> > > > +
>> > > > +#include <linux/netdevice.h>
>> > > > +
>> > > > +struct bypass_ops {
>> > > > +	int (*slave_pre_register)(struct net_device *slave_netdev,
>> > > > +				  struct net_device *bypass_netdev);
>> > > > +	int (*slave_join)(struct net_device *slave_netdev,
>> > > > +			  struct net_device *bypass_netdev);
>> > > > +	int (*slave_pre_unregister)(struct net_device *slave_netdev,
>> > > > +				    struct net_device *bypass_netdev);
>> > > > +	int (*slave_release)(struct net_device *slave_netdev,
>> > > > +			     struct net_device *bypass_netdev);
>> > > > +	int (*slave_link_change)(struct net_device *slave_netdev,
>> > > > +				 struct net_device *bypass_netdev);
>> > > > +	rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>> > > > +};
>> > > > +
>> > > > +struct bypass_master {
>> > > > +	struct list_head list;
>> > > > +	struct net_device __rcu *bypass_netdev;
>> > > > +	struct bypass_ops __rcu *ops;
>> > > > +};
>> > > > +
>> > > > +/* bypass state */
>> > > > +struct bypass_info {
>> > > > +	/* passthru netdev with same MAC */
>> > > > +	struct net_device __rcu *active_netdev;
>> > > You still use "active"/"backup" names which is highly misleading as
>> > > it has completely different meaning that in bond for example.
>> > > I noted that in my previous review already. Please change it.
>> > I guess the issue is with only the 'active'  name. 'backup' should be fine as it also
>> > matches with the BACKUP feature bit we are adding to virtio_net.
>> I think that "backup" is also misleading. Both "active" and "backup"
>> mean a *state* of slaves. This should be named differently.
>> 
>> 
>> 
>> > With regards to alternate names for 'active', you suggested 'stolen', but i
>> > am not too happy with it.
>> > netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'
>> No. The netdev could be any netdevice. It does not have to be a "VF".
>> I think "stolen" is quite appropriate since it describes the modus
>> operandi. The bypass master steals some netdevice according to some
>> match.
>> 
>> But I don't insist on "stolen". Just sounds right.
>
>We are adding VIRTIO_NET_F_BACKUP as a new feature bit to enable this feature, So i think
>'backup' name is consistent.

It perhaps makes sense from the view of virtio device. However, as I
described couple of times, for master/slave device the name "backup" is
highly misleading.


>
>The intent is to restrict the 'active' netdev to be a VF. If there is a way to check that
>a PCI device is a VF in the guest kernel, we could restrict 'active' netdev to be a VF.
>
>Will look for any suggestions in the next day or two. If i don't get any, i will go
>with 'stolen'
>
><snip>
>
>
>> +
>> +static struct net_device *bypass_master_get_bymac(u8 *mac,
>> +						  struct bypass_ops **ops)
>> +{
>> +	struct bypass_master *bypass_master;
>> +	struct net_device *bypass_netdev;
>> +
>> +	spin_lock(&bypass_lock);
>> +	list_for_each_entry(bypass_master, &bypass_master_list, list) {
>> > > As I wrote the last time, you don't need this list, spinlock.
>> > > You can do just something like:
>> > >           for_each_net(net) {
>> > >                   for_each_netdev(net, dev) {
>> > > 			if (netif_is_bypass_master(dev)) {
>> > This function returns the upper netdev as well as the ops associated
>> > with that netdev.
>> > bypass_master_list is a list of 'struct bypass_master' that associates
>> Well, can't you have it in netdev priv?
>
>We cannot do this for 2-netdev model as there is no bypass_netdev created.

Howcome? You have no master? I don't understand..



>
>> 
>> 
>> > 'bypass_netdev' with 'bypass_ops' and gets added via bypass_master_register().
>> > We need 'ops' only to support the 2 netdev model of netvsc. ops will be
>> > NULL for 3-netdev model.
>> I see :(
>> 
>> 
>> > 
>> > > 
>> > > 
>> > > 
>> > > > +		bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>> > > > +		if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
>> > > > +			*ops = rcu_dereference(bypass_master->ops);
>> > > I don't see how rcu_dereference is ok here.
>> > > 1) I don't see rcu_read_lock taken
>> > > 2) Looks like bypass_master->ops has the same value across the whole
>> > >      existence.
>> > We hold rtnl_lock(), i think i need to change this to rtnl_dereference.
>> > Yes. ops doesn't change.
>> If it does not change, you can just access it directly.
>> 
>> 
>> > > 
>> > > > +			spin_unlock(&bypass_lock);
>> > > > +			return bypass_netdev;
>> > > > +		}
>> > > > +	}
>> > > > +	spin_unlock(&bypass_lock);
>> > > > +	return NULL;
>> > > > +}
>> > > > +
>> > > > +static int bypass_slave_register(struct net_device *slave_netdev)
>> > > > +{
>> > > > +	struct net_device *bypass_netdev;
>> > > > +	struct bypass_ops *bypass_ops;
>> > > > +	int ret, orig_mtu;
>> > > > +
>> > > > +	ASSERT_RTNL();
>> > > > +
>> > > > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> > > > +						&bypass_ops);
>> > > For master, could you use word "master" in the variables so it is clear?
>> > > Also, "dev" is fine instead of "netdev".
>> > > Something like "bpmaster_dev"
>> > bypass_master is of  type struct bypass_master,  bypass_netdev is of type struct net_device.
>> I was trying to point out, that "bypass_netdev" represents a "master"
>> netdev, yet it does not say master. That is why I suggested
>> "bpmaster_dev"
>> 
>> 
>> > I can change all _netdev suffixes to _dev to make the names shorter.
>> ok.
>> 
>> 
>> > 
>> > > 
>> > > > +	if (!bypass_netdev)
>> > > > +		goto done;
>> > > > +
>> > > > +	ret = bypass_slave_pre_register(slave_netdev, bypass_netdev,
>> > > > +					bypass_ops);
>> > > > +	if (ret != 0)
>> > > 	Just "if (ret)" will do. You have this on more places.
>> > OK.
>> > 
>> > 
>> > > 
>> > > > +		goto done;
>> > > > +
>> > > > +	ret = netdev_rx_handler_register(slave_netdev,
>> > > > +					 bypass_ops ? bypass_ops->handle_frame :
>> > > > +					 bypass_handle_frame, bypass_netdev);
>> > > > +	if (ret != 0) {
>> > > > +		netdev_err(slave_netdev, "can not register bypass rx handler (err = %d)\n",
>> > > > +			   ret);
>> > > > +		goto done;
>> > > > +	}
>> > > > +
>> > > > +	ret = netdev_upper_dev_link(slave_netdev, bypass_netdev, NULL);
>> > > > +	if (ret != 0) {
>> > > > +		netdev_err(slave_netdev, "can not set master device %s (err = %d)\n",
>> > > > +			   bypass_netdev->name, ret);
>> > > > +		goto upper_link_failed;
>> > > > +	}
>> > > > +
>> > > > +	slave_netdev->priv_flags |= IFF_BYPASS_SLAVE;
>> > > > +
>> > > > +	if (netif_running(bypass_netdev)) {
>> > > > +		ret = dev_open(slave_netdev);
>> > > > +		if (ret && (ret != -EBUSY)) {
>> > > > +			netdev_err(bypass_netdev, "Opening slave %s failed ret:%d\n",
>> > > > +				   slave_netdev->name, ret);
>> > > > +			goto err_interface_up;
>> > > > +		}
>> > > > +	}
>> > > > +
>> > > > +	/* Align MTU of slave with master */
>> > > > +	orig_mtu = slave_netdev->mtu;
>> > > > +	ret = dev_set_mtu(slave_netdev, bypass_netdev->mtu);
>> > > > +	if (ret != 0) {
>> > > > +		netdev_err(bypass_netdev, "unable to change mtu of %s to %u register failed\n",
>> > > > +			   slave_netdev->name, bypass_netdev->mtu);
>> > > > +		goto err_set_mtu;
>> > > > +	}
>> > > > +
>> > > > +	ret = bypass_slave_join(slave_netdev, bypass_netdev, bypass_ops);
>> > > > +	if (ret != 0)
>> > > > +		goto err_join;
>> > > > +
>> > > > +	call_netdevice_notifiers(NETDEV_JOIN, slave_netdev);
>> > > > +
>> > > > +	netdev_info(bypass_netdev, "bypass slave:%s registered\n",
>> > > > +		    slave_netdev->name);
>> > > > +
>> > > > +	goto done;
>> > > > +
>> > > > +err_join:
>> > > > +	dev_set_mtu(slave_netdev, orig_mtu);
>> > > > +err_set_mtu:
>> > > > +	dev_close(slave_netdev);
>> > > > +err_interface_up:
>> > > > +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>> > > > +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>> > > > +upper_link_failed:
>> > > > +	netdev_rx_handler_unregister(slave_netdev);
>> > > > +done:
>> > > > +	return NOTIFY_DONE;
>> > > > +}
>> > > > +
>> > > > +static int bypass_slave_pre_unregister(struct net_device *slave_netdev,
>> > > > +				       struct net_device *bypass_netdev,
>> > > > +				       struct bypass_ops *bypass_ops)
>> > > > +{
>> > > > +	struct net_device *backup_netdev, *active_netdev;
>> > > > +	struct bypass_info *bi;
>> > > > +
>> > > > +	if (bypass_ops) {
>> > > > +		if (!bypass_ops->slave_pre_unregister)
>> > > > +			return -EINVAL;
>> > > > +
>> > > > +		return bypass_ops->slave_pre_unregister(slave_netdev,
>> > > > +							bypass_netdev);
>> > > > +	}
>> > > > +
>> > > > +	bi = netdev_priv(bypass_netdev);
>> > > > +	active_netdev = rtnl_dereference(bi->active_netdev);
>> > > > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> > > > +
>> > > > +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>> > > > +		return -EINVAL;
>> > > > +
>> > > > +	return 0;
>> > > > +}
>> > > > +
>> > > > +static int bypass_slave_release(struct net_device *slave_netdev,
>> > > > +				struct net_device *bypass_netdev,
>> > > > +				struct bypass_ops *bypass_ops)
>> > > > +{
>> > > > +	struct net_device *backup_netdev, *active_netdev;
>> > > > +	struct bypass_info *bi;
>> > > > +
>> > > > +	if (bypass_ops) {
>> > > > +		if (!bypass_ops->slave_release)
>> > > > +			return -EINVAL;
>> > > I think it would be good to make the API to the driver more strict and
>> > > have a separate set of ops for "active" and "backup" netdevices.
>> > > That should stop people thinking about extending this to more slaves in
>> > > the future.
>> > We have checks in slave_pre_register() that allows only 1 'backup' and 1
>> > 'active' slave.
>> I'm very well aware of that. I just thought that explicit ops for the
>> two slaves would make this more clear.
>> 
>> 
>> > 
>> > > 
>> > > 
>> > > > +
>> > > > +		return bypass_ops->slave_release(slave_netdev, bypass_netdev);
>> > > > +	}
>> > > > +
>> > > > +	bi = netdev_priv(bypass_netdev);
>> > > > +	active_netdev = rtnl_dereference(bi->active_netdev);
>> > > > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> > > > +
>> > > > +	if (slave_netdev == backup_netdev) {
>> > > > +		RCU_INIT_POINTER(bi->backup_netdev, NULL);
>> > > > +	} else {
>> > > > +		RCU_INIT_POINTER(bi->active_netdev, NULL);
>> > > > +		if (backup_netdev) {
>> > > > +			bypass_netdev->min_mtu = backup_netdev->min_mtu;
>> > > > +			bypass_netdev->max_mtu = backup_netdev->max_mtu;
>> > > > +		}
>> > > > +	}
>> > > > +
>> > > > +	dev_put(slave_netdev);
>> > > > +
>> > > > +	netdev_info(bypass_netdev, "bypass slave:%s released\n",
>> > > > +		    slave_netdev->name);
>> > > > +
>> > > > +	return 0;
>> > > > +}
>> > > > +
>> > > > +int bypass_slave_unregister(struct net_device *slave_netdev)
>> > > > +{
>> > > > +	struct net_device *bypass_netdev;
>> > > > +	struct bypass_ops *bypass_ops;
>> > > > +	int ret;
>> > > > +
>> > > > +	if (!netif_is_bypass_slave(slave_netdev))
>> > > > +		goto done;
>> > > > +
>> > > > +	ASSERT_RTNL();
>> > > > +
>> > > > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> > > > +						&bypass_ops);
>> > > > +	if (!bypass_netdev)
>> > > > +		goto done;
>> > > > +
>> > > > +	ret = bypass_slave_pre_unregister(slave_netdev, bypass_netdev,
>> > > > +					  bypass_ops);
>> > > > +	if (ret != 0)
>> > > > +		goto done;
>> > > > +
>> > > > +	netdev_rx_handler_unregister(slave_netdev);
>> > > > +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>> > > > +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>> > > > +
>> > > > +	bypass_slave_release(slave_netdev, bypass_netdev, bypass_ops);
>> > > > +
>> > > > +	netdev_info(bypass_netdev, "bypass slave:%s unregistered\n",
>> > > > +		    slave_netdev->name);
>> > > > +
>> > > > +done:
>> > > > +	return NOTIFY_DONE;
>> > > > +}
>> > > > +EXPORT_SYMBOL_GPL(bypass_slave_unregister);
>> > > > +
>> > > > +static bool bypass_xmit_ready(struct net_device *dev)
>> > > > +{
>> > > > +	return netif_running(dev) && netif_carrier_ok(dev);
>> > > > +}
>> > > > +
>> > > > +static int bypass_slave_link_change(struct net_device *slave_netdev)
>> > > > +{
>> > > > +	struct net_device *bypass_netdev, *active_netdev, *backup_netdev;
>> > > > +	struct bypass_ops *bypass_ops;
>> > > > +	struct bypass_info *bi;
>> > > > +
>> > > > +	if (!netif_is_bypass_slave(slave_netdev))
>> > > > +		goto done;
>> > > > +
>> > > > +	ASSERT_RTNL();
>> > > > +
>> > > > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> > > > +						&bypass_ops);
>> > > > +	if (!bypass_netdev)
>> > > > +		goto done;
>> > > > +
>> > > > +	if (bypass_ops) {
>> > > > +		if (!bypass_ops->slave_link_change)
>> > > > +			goto done;
>> > > > +
>> > > > +		return bypass_ops->slave_link_change(slave_netdev,
>> > > > +						     bypass_netdev);
>> > > > +	}
>> > > > +
>> > > > +	if (!netif_running(bypass_netdev))
>> > > > +		return 0;
>> > > > +
>> > > > +	bi = netdev_priv(bypass_netdev);
>> > > > +
>> > > > +	active_netdev = rtnl_dereference(bi->active_netdev);
>> > > > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> > > > +
>> > > > +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>> > > > +		goto done;
>> > > You don't need this check. "if (!netif_is_bypass_slave(slave_netdev))"
>> > > above is enough.
>> > I think we need this check to not allow events from a slave that is not
>> > attached to this master but has the same MAC.
>> Why do we need such events? Seems wrong to me.
>
>We want to avoid events from a netdev that is mis-configured with the same MAC as
>a bypass setup.
>
>>   Consider:
>> 
>> bp1      bp2
>> a1 b1    a2 b2
>> 
>> 
>> a1 and a2 have the same mac and bp1 and bp2 have the same mac.
>
>We should not have 2 bypass configs with the same MAC.
>I need to add a check in the bypass_master_register() to prevent this.

Mac can change, you would have to check in change as well. Feels odd
thought. 


>
>The above check is to avoid cases where we have
>bp1(a1, b1) with mac1
>and a2 is mis-configured with mac1, we want to avoid using a2 link events to update bp1.
>
>> Now bypass_master_get_bymac() will return always bp1 or bp2 - depending on
>> the order of creation.
>> Let's say it will return bp1. Then when we have event for a2, the
>> bypass_ops->slave_link_change is called with (a2, bp1). That is wrong.
>> 
>> 
>> You cannot use bypass_master_get_bymac() here.
>> 
>> 
>> 
>> > > 
>> > > > +
>> > > > +	if ((active_netdev && bypass_xmit_ready(active_netdev)) ||
>> > > > +	    (backup_netdev && bypass_xmit_ready(backup_netdev))) {
>> > > > +		netif_carrier_on(bypass_netdev);
>> > > > +		netif_tx_wake_all_queues(bypass_netdev);
>> > > > +	} else {
>> > > > +		netif_carrier_off(bypass_netdev);
>> > > > +		netif_tx_stop_all_queues(bypass_netdev);
>> > > > +	}
>> > > > +
>> > > > +done:
>> > > > +	return NOTIFY_DONE;
>> > > > +}
>> > > > +
>> > > > +static bool bypass_validate_event_dev(struct net_device *dev)
>> > > > +{
>> > > > +	/* Skip parent events */
>> > > > +	if (netif_is_bypass_master(dev))
>> > > > +		return false;
>> > > > +
>> > > > +	/* Avoid non-Ethernet type devices */
>> > > > +	if (dev->type != ARPHRD_ETHER)
>> > > > +		return false;
>> > > > +
>> > > > +	/* Avoid Vlan dev with same MAC registering as VF */
>> > > > +	if (is_vlan_dev(dev))
>> > > > +		return false;
>> > > > +
>> > > > +	/* Avoid Bonding master dev with same MAC registering as slave dev */
>> > > > +	if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
>> > > Yeah, this is certainly incorrect. One thing is, you should be using the
>> > > helpers netif_is_bond_master().
>> > > But what about the rest? macsec, macvlan, team, bridge, ovs and others?
>> > > 
>> > > You need to do it not by blacklisting, but with whitelisting. You need
>> > > to whitelist VF devices. My port flavours patchset might help with this.
>> > May be i can use netdev_has_lower_dev() helper to make sure that the slave
>> I don't see such function in the code.
>
>It is netdev_has_any_lower_dev(). I need to export it.

Come on, you cannot use that. That would allow bonding without slaves,
but the slaves could be added later on.

What exactly you are trying to achieve by this?


>
>> 
>> 
>> > device is not an upper dev.
>> > Can you point to your port flavours patchset? Is it upstream?
>> I sent rfc couple of weeks ago:
>> [patch net-next RFC 00/12] devlink: introduce port flavours and common phys_port_name generation
>
>
>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
  2018-04-18 19:13           ` Jiri Pirko
@ 2018-04-18 19:46             ` Michael S. Tsirkin
  2018-04-18 20:32               ` Jiri Pirko
  0 siblings, 1 reply; 63+ messages in thread
From: Michael S. Tsirkin @ 2018-04-18 19:46 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: alexander.h.duyck, virtio-dev, kubakici, Samudrala, Sridhar,
	virtualization, loseweigh, netdev, davem

On Wed, Apr 18, 2018 at 09:13:15PM +0200, Jiri Pirko wrote:
> Wed, Apr 18, 2018 at 08:43:15PM CEST, sridhar.samudrala@intel.com wrote:
> >On 4/18/2018 2:25 AM, Jiri Pirko wrote:
> >> Wed, Apr 11, 2018 at 09:13:52PM CEST, sridhar.samudrala@intel.com wrote:
> >> > On 4/11/2018 8:51 AM, Jiri Pirko wrote:
> >> > > Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudrala@intel.com wrote:
> >> > > > This provides a generic interface for paravirtual drivers to listen
> >> > > > for netdev register/unregister/link change events from pci ethernet
> >> > > > devices with the same MAC and takeover their datapath. The notifier and
> >> > > > event handling code is based on the existing netvsc implementation.
> >> > > > 
> >> > > > It exposes 2 sets of interfaces to the paravirtual drivers.
> >> > > > 1. existing netvsc driver that uses 2 netdev model. In this model, no
> >> > > > master netdev is created. The paravirtual driver registers each bypass
> >> > > > instance along with a set of ops to manage the slave events.
> >> > > >       bypass_master_register()
> >> > > >       bypass_master_unregister()
> >> > > > 2. new virtio_net based solution that uses 3 netdev model. In this model,
> >> > > > the bypass module provides interfaces to create/destroy additional master
> >> > > > netdev and all the slave events are managed internally.
> >> > > >        bypass_master_create()
> >> > > >        bypass_master_destroy()
> >> > > > 
> >> > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> >> > > > ---
> >> > > > include/linux/netdevice.h |  14 +
> >> > > > include/net/bypass.h      |  96 ++++++
> >> > > > net/Kconfig               |  18 +
> >> > > > net/core/Makefile         |   1 +
> >> > > > net/core/bypass.c         | 844 ++++++++++++++++++++++++++++++++++++++++++++++
> >> > > > 5 files changed, 973 insertions(+)
> >> > > > create mode 100644 include/net/bypass.h
> >> > > > create mode 100644 net/core/bypass.c
> >> > > > 
> >> > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> >> > > > index cf44503ea81a..587293728f70 100644
> >> > > > --- a/include/linux/netdevice.h
> >> > > > +++ b/include/linux/netdevice.h
> >> > > > @@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
> >> > > > 	IFF_PHONY_HEADROOM		= 1<<24,
> >> > > > 	IFF_MACSEC			= 1<<25,
> >> > > > 	IFF_NO_RX_HANDLER		= 1<<26,
> >> > > > +	IFF_BYPASS			= 1 << 27,
> >> > > > +	IFF_BYPASS_SLAVE		= 1 << 28,
> >> > > I wonder, why you don't follow the existing coding style... Also, please
> >> > > add these to into the comment above.
> >> > To avoid checkpatch warnings. If it is OK to ignore these warnings, I can switch back
> >> > to the existing coding style to be consistent.
> >> Please do.
> >> 
> >> 
> >> > > 
> >> > > > };
> >> > > > 
> >> > > > #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
> >> > > > @@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
> >> > > > #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
> >> > > > #define IFF_MACSEC			IFF_MACSEC
> >> > > > #define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
> >> > > > +#define IFF_BYPASS			IFF_BYPASS
> >> > > > +#define IFF_BYPASS_SLAVE		IFF_BYPASS_SLAVE
> >> > > > 
> >> > > > /**
> >> > > >    *	struct net_device - The DEVICE structure.
> >> > > > @@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
> >> > > > 	return dev->priv_flags & IFF_RXFH_CONFIGURED;
> >> > > > }
> >> > > > 
> >> > > > +static inline bool netif_is_bypass_master(const struct net_device *dev)
> >> > > > +{
> >> > > > +	return dev->priv_flags & IFF_BYPASS;
> >> > > > +}
> >> > > > +
> >> > > > +static inline bool netif_is_bypass_slave(const struct net_device *dev)
> >> > > > +{
> >> > > > +	return dev->priv_flags & IFF_BYPASS_SLAVE;
> >> > > > +}
> >> > > > +
> >> > > > /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
> >> > > > static inline void netif_keep_dst(struct net_device *dev)
> >> > > > {
> >> > > > diff --git a/include/net/bypass.h b/include/net/bypass.h
> >> > > > new file mode 100644
> >> > > > index 000000000000..86b02cb894cf
> >> > > > --- /dev/null
> >> > > > +++ b/include/net/bypass.h
> >> > > > @@ -0,0 +1,96 @@
> >> > > > +// SPDX-License-Identifier: GPL-2.0
> >> > > > +/* Copyright (c) 2018, Intel Corporation. */
> >> > > > +
> >> > > > +#ifndef _NET_BYPASS_H
> >> > > > +#define _NET_BYPASS_H
> >> > > > +
> >> > > > +#include <linux/netdevice.h>
> >> > > > +
> >> > > > +struct bypass_ops {
> >> > > > +	int (*slave_pre_register)(struct net_device *slave_netdev,
> >> > > > +				  struct net_device *bypass_netdev);
> >> > > > +	int (*slave_join)(struct net_device *slave_netdev,
> >> > > > +			  struct net_device *bypass_netdev);
> >> > > > +	int (*slave_pre_unregister)(struct net_device *slave_netdev,
> >> > > > +				    struct net_device *bypass_netdev);
> >> > > > +	int (*slave_release)(struct net_device *slave_netdev,
> >> > > > +			     struct net_device *bypass_netdev);
> >> > > > +	int (*slave_link_change)(struct net_device *slave_netdev,
> >> > > > +				 struct net_device *bypass_netdev);
> >> > > > +	rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
> >> > > > +};
> >> > > > +
> >> > > > +struct bypass_master {
> >> > > > +	struct list_head list;
> >> > > > +	struct net_device __rcu *bypass_netdev;
> >> > > > +	struct bypass_ops __rcu *ops;
> >> > > > +};
> >> > > > +
> >> > > > +/* bypass state */
> >> > > > +struct bypass_info {
> >> > > > +	/* passthru netdev with same MAC */
> >> > > > +	struct net_device __rcu *active_netdev;
> >> > > You still use "active"/"backup" names which is highly misleading as
> >> > > it has completely different meaning that in bond for example.
> >> > > I noted that in my previous review already. Please change it.
> >> > I guess the issue is with only the 'active'  name. 'backup' should be fine as it also
> >> > matches with the BACKUP feature bit we are adding to virtio_net.
> >> I think that "backup" is also misleading. Both "active" and "backup"
> >> mean a *state* of slaves. This should be named differently.
> >> 
> >> 
> >> 
> >> > With regards to alternate names for 'active', you suggested 'stolen', but i
> >> > am not too happy with it.
> >> > netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'
> >> No. The netdev could be any netdevice. It does not have to be a "VF".
> >> I think "stolen" is quite appropriate since it describes the modus
> >> operandi. The bypass master steals some netdevice according to some
> >> match.
> >> 
> >> But I don't insist on "stolen". Just sounds right.
> >
> >We are adding VIRTIO_NET_F_BACKUP as a new feature bit to enable this feature, So i think
> >'backup' name is consistent.
> 
> It perhaps makes sense from the view of virtio device. However, as I
> described couple of times, for master/slave device the name "backup" is
> highly misleading.

virtio is the backup. You are supposed to use another
(typically passthrough) device, if that fails use virtio.
It does seem appropriate to me. If you like, we can
change that to "standby".  Active I don't like either. "main"?

In fact would failover be better than bypass?


> 
> >
> >The intent is to restrict the 'active' netdev to be a VF. If there is a way to check that
> >a PCI device is a VF in the guest kernel, we could restrict 'active' netdev to be a VF.
> >
> >Will look for any suggestions in the next day or two. If i don't get any, i will go
> >with 'stolen'
> >
> ><snip>
> >
> >
> >> +
> >> +static struct net_device *bypass_master_get_bymac(u8 *mac,
> >> +						  struct bypass_ops **ops)
> >> +{
> >> +	struct bypass_master *bypass_master;
> >> +	struct net_device *bypass_netdev;
> >> +
> >> +	spin_lock(&bypass_lock);
> >> +	list_for_each_entry(bypass_master, &bypass_master_list, list) {
> >> > > As I wrote the last time, you don't need this list, spinlock.
> >> > > You can do just something like:
> >> > >           for_each_net(net) {
> >> > >                   for_each_netdev(net, dev) {
> >> > > 			if (netif_is_bypass_master(dev)) {
> >> > This function returns the upper netdev as well as the ops associated
> >> > with that netdev.
> >> > bypass_master_list is a list of 'struct bypass_master' that associates
> >> Well, can't you have it in netdev priv?
> >
> >We cannot do this for 2-netdev model as there is no bypass_netdev created.
> 
> Howcome? You have no master? I don't understand..
> 
> 
> 
> >
> >> 
> >> 
> >> > 'bypass_netdev' with 'bypass_ops' and gets added via bypass_master_register().
> >> > We need 'ops' only to support the 2 netdev model of netvsc. ops will be
> >> > NULL for 3-netdev model.
> >> I see :(
> >> 
> >> 
> >> > 
> >> > > 
> >> > > 
> >> > > 
> >> > > > +		bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
> >> > > > +		if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
> >> > > > +			*ops = rcu_dereference(bypass_master->ops);
> >> > > I don't see how rcu_dereference is ok here.
> >> > > 1) I don't see rcu_read_lock taken
> >> > > 2) Looks like bypass_master->ops has the same value across the whole
> >> > >      existence.
> >> > We hold rtnl_lock(), i think i need to change this to rtnl_dereference.
> >> > Yes. ops doesn't change.
> >> If it does not change, you can just access it directly.
> >> 
> >> 
> >> > > 
> >> > > > +			spin_unlock(&bypass_lock);
> >> > > > +			return bypass_netdev;
> >> > > > +		}
> >> > > > +	}
> >> > > > +	spin_unlock(&bypass_lock);
> >> > > > +	return NULL;
> >> > > > +}
> >> > > > +
> >> > > > +static int bypass_slave_register(struct net_device *slave_netdev)
> >> > > > +{
> >> > > > +	struct net_device *bypass_netdev;
> >> > > > +	struct bypass_ops *bypass_ops;
> >> > > > +	int ret, orig_mtu;
> >> > > > +
> >> > > > +	ASSERT_RTNL();
> >> > > > +
> >> > > > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
> >> > > > +						&bypass_ops);
> >> > > For master, could you use word "master" in the variables so it is clear?
> >> > > Also, "dev" is fine instead of "netdev".
> >> > > Something like "bpmaster_dev"
> >> > bypass_master is of  type struct bypass_master,  bypass_netdev is of type struct net_device.
> >> I was trying to point out, that "bypass_netdev" represents a "master"
> >> netdev, yet it does not say master. That is why I suggested
> >> "bpmaster_dev"
> >> 
> >> 
> >> > I can change all _netdev suffixes to _dev to make the names shorter.
> >> ok.
> >> 
> >> 
> >> > 
> >> > > 
> >> > > > +	if (!bypass_netdev)
> >> > > > +		goto done;
> >> > > > +
> >> > > > +	ret = bypass_slave_pre_register(slave_netdev, bypass_netdev,
> >> > > > +					bypass_ops);
> >> > > > +	if (ret != 0)
> >> > > 	Just "if (ret)" will do. You have this on more places.
> >> > OK.
> >> > 
> >> > 
> >> > > 
> >> > > > +		goto done;
> >> > > > +
> >> > > > +	ret = netdev_rx_handler_register(slave_netdev,
> >> > > > +					 bypass_ops ? bypass_ops->handle_frame :
> >> > > > +					 bypass_handle_frame, bypass_netdev);
> >> > > > +	if (ret != 0) {
> >> > > > +		netdev_err(slave_netdev, "can not register bypass rx handler (err = %d)\n",
> >> > > > +			   ret);
> >> > > > +		goto done;
> >> > > > +	}
> >> > > > +
> >> > > > +	ret = netdev_upper_dev_link(slave_netdev, bypass_netdev, NULL);
> >> > > > +	if (ret != 0) {
> >> > > > +		netdev_err(slave_netdev, "can not set master device %s (err = %d)\n",
> >> > > > +			   bypass_netdev->name, ret);
> >> > > > +		goto upper_link_failed;
> >> > > > +	}
> >> > > > +
> >> > > > +	slave_netdev->priv_flags |= IFF_BYPASS_SLAVE;
> >> > > > +
> >> > > > +	if (netif_running(bypass_netdev)) {
> >> > > > +		ret = dev_open(slave_netdev);
> >> > > > +		if (ret && (ret != -EBUSY)) {
> >> > > > +			netdev_err(bypass_netdev, "Opening slave %s failed ret:%d\n",
> >> > > > +				   slave_netdev->name, ret);
> >> > > > +			goto err_interface_up;
> >> > > > +		}
> >> > > > +	}
> >> > > > +
> >> > > > +	/* Align MTU of slave with master */
> >> > > > +	orig_mtu = slave_netdev->mtu;
> >> > > > +	ret = dev_set_mtu(slave_netdev, bypass_netdev->mtu);
> >> > > > +	if (ret != 0) {
> >> > > > +		netdev_err(bypass_netdev, "unable to change mtu of %s to %u register failed\n",
> >> > > > +			   slave_netdev->name, bypass_netdev->mtu);
> >> > > > +		goto err_set_mtu;
> >> > > > +	}
> >> > > > +
> >> > > > +	ret = bypass_slave_join(slave_netdev, bypass_netdev, bypass_ops);
> >> > > > +	if (ret != 0)
> >> > > > +		goto err_join;
> >> > > > +
> >> > > > +	call_netdevice_notifiers(NETDEV_JOIN, slave_netdev);
> >> > > > +
> >> > > > +	netdev_info(bypass_netdev, "bypass slave:%s registered\n",
> >> > > > +		    slave_netdev->name);
> >> > > > +
> >> > > > +	goto done;
> >> > > > +
> >> > > > +err_join:
> >> > > > +	dev_set_mtu(slave_netdev, orig_mtu);
> >> > > > +err_set_mtu:
> >> > > > +	dev_close(slave_netdev);
> >> > > > +err_interface_up:
> >> > > > +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
> >> > > > +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
> >> > > > +upper_link_failed:
> >> > > > +	netdev_rx_handler_unregister(slave_netdev);
> >> > > > +done:
> >> > > > +	return NOTIFY_DONE;
> >> > > > +}
> >> > > > +
> >> > > > +static int bypass_slave_pre_unregister(struct net_device *slave_netdev,
> >> > > > +				       struct net_device *bypass_netdev,
> >> > > > +				       struct bypass_ops *bypass_ops)
> >> > > > +{
> >> > > > +	struct net_device *backup_netdev, *active_netdev;
> >> > > > +	struct bypass_info *bi;
> >> > > > +
> >> > > > +	if (bypass_ops) {
> >> > > > +		if (!bypass_ops->slave_pre_unregister)
> >> > > > +			return -EINVAL;
> >> > > > +
> >> > > > +		return bypass_ops->slave_pre_unregister(slave_netdev,
> >> > > > +							bypass_netdev);
> >> > > > +	}
> >> > > > +
> >> > > > +	bi = netdev_priv(bypass_netdev);
> >> > > > +	active_netdev = rtnl_dereference(bi->active_netdev);
> >> > > > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
> >> > > > +
> >> > > > +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
> >> > > > +		return -EINVAL;
> >> > > > +
> >> > > > +	return 0;
> >> > > > +}
> >> > > > +
> >> > > > +static int bypass_slave_release(struct net_device *slave_netdev,
> >> > > > +				struct net_device *bypass_netdev,
> >> > > > +				struct bypass_ops *bypass_ops)
> >> > > > +{
> >> > > > +	struct net_device *backup_netdev, *active_netdev;
> >> > > > +	struct bypass_info *bi;
> >> > > > +
> >> > > > +	if (bypass_ops) {
> >> > > > +		if (!bypass_ops->slave_release)
> >> > > > +			return -EINVAL;
> >> > > I think it would be good to make the API to the driver more strict and
> >> > > have a separate set of ops for "active" and "backup" netdevices.
> >> > > That should stop people thinking about extending this to more slaves in
> >> > > the future.
> >> > We have checks in slave_pre_register() that allows only 1 'backup' and 1
> >> > 'active' slave.
> >> I'm very well aware of that. I just thought that explicit ops for the
> >> two slaves would make this more clear.
> >> 
> >> 
> >> > 
> >> > > 
> >> > > 
> >> > > > +
> >> > > > +		return bypass_ops->slave_release(slave_netdev, bypass_netdev);
> >> > > > +	}
> >> > > > +
> >> > > > +	bi = netdev_priv(bypass_netdev);
> >> > > > +	active_netdev = rtnl_dereference(bi->active_netdev);
> >> > > > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
> >> > > > +
> >> > > > +	if (slave_netdev == backup_netdev) {
> >> > > > +		RCU_INIT_POINTER(bi->backup_netdev, NULL);
> >> > > > +	} else {
> >> > > > +		RCU_INIT_POINTER(bi->active_netdev, NULL);
> >> > > > +		if (backup_netdev) {
> >> > > > +			bypass_netdev->min_mtu = backup_netdev->min_mtu;
> >> > > > +			bypass_netdev->max_mtu = backup_netdev->max_mtu;
> >> > > > +		}
> >> > > > +	}
> >> > > > +
> >> > > > +	dev_put(slave_netdev);
> >> > > > +
> >> > > > +	netdev_info(bypass_netdev, "bypass slave:%s released\n",
> >> > > > +		    slave_netdev->name);
> >> > > > +
> >> > > > +	return 0;
> >> > > > +}
> >> > > > +
> >> > > > +int bypass_slave_unregister(struct net_device *slave_netdev)
> >> > > > +{
> >> > > > +	struct net_device *bypass_netdev;
> >> > > > +	struct bypass_ops *bypass_ops;
> >> > > > +	int ret;
> >> > > > +
> >> > > > +	if (!netif_is_bypass_slave(slave_netdev))
> >> > > > +		goto done;
> >> > > > +
> >> > > > +	ASSERT_RTNL();
> >> > > > +
> >> > > > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
> >> > > > +						&bypass_ops);
> >> > > > +	if (!bypass_netdev)
> >> > > > +		goto done;
> >> > > > +
> >> > > > +	ret = bypass_slave_pre_unregister(slave_netdev, bypass_netdev,
> >> > > > +					  bypass_ops);
> >> > > > +	if (ret != 0)
> >> > > > +		goto done;
> >> > > > +
> >> > > > +	netdev_rx_handler_unregister(slave_netdev);
> >> > > > +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
> >> > > > +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
> >> > > > +
> >> > > > +	bypass_slave_release(slave_netdev, bypass_netdev, bypass_ops);
> >> > > > +
> >> > > > +	netdev_info(bypass_netdev, "bypass slave:%s unregistered\n",
> >> > > > +		    slave_netdev->name);
> >> > > > +
> >> > > > +done:
> >> > > > +	return NOTIFY_DONE;
> >> > > > +}
> >> > > > +EXPORT_SYMBOL_GPL(bypass_slave_unregister);
> >> > > > +
> >> > > > +static bool bypass_xmit_ready(struct net_device *dev)
> >> > > > +{
> >> > > > +	return netif_running(dev) && netif_carrier_ok(dev);
> >> > > > +}
> >> > > > +
> >> > > > +static int bypass_slave_link_change(struct net_device *slave_netdev)
> >> > > > +{
> >> > > > +	struct net_device *bypass_netdev, *active_netdev, *backup_netdev;
> >> > > > +	struct bypass_ops *bypass_ops;
> >> > > > +	struct bypass_info *bi;
> >> > > > +
> >> > > > +	if (!netif_is_bypass_slave(slave_netdev))
> >> > > > +		goto done;
> >> > > > +
> >> > > > +	ASSERT_RTNL();
> >> > > > +
> >> > > > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
> >> > > > +						&bypass_ops);
> >> > > > +	if (!bypass_netdev)
> >> > > > +		goto done;
> >> > > > +
> >> > > > +	if (bypass_ops) {
> >> > > > +		if (!bypass_ops->slave_link_change)
> >> > > > +			goto done;
> >> > > > +
> >> > > > +		return bypass_ops->slave_link_change(slave_netdev,
> >> > > > +						     bypass_netdev);
> >> > > > +	}
> >> > > > +
> >> > > > +	if (!netif_running(bypass_netdev))
> >> > > > +		return 0;
> >> > > > +
> >> > > > +	bi = netdev_priv(bypass_netdev);
> >> > > > +
> >> > > > +	active_netdev = rtnl_dereference(bi->active_netdev);
> >> > > > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
> >> > > > +
> >> > > > +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
> >> > > > +		goto done;
> >> > > You don't need this check. "if (!netif_is_bypass_slave(slave_netdev))"
> >> > > above is enough.
> >> > I think we need this check to not allow events from a slave that is not
> >> > attached to this master but has the same MAC.
> >> Why do we need such events? Seems wrong to me.
> >
> >We want to avoid events from a netdev that is mis-configured with the same MAC as
> >a bypass setup.
> >
> >>   Consider:
> >> 
> >> bp1      bp2
> >> a1 b1    a2 b2
> >> 
> >> 
> >> a1 and a2 have the same mac and bp1 and bp2 have the same mac.
> >
> >We should not have 2 bypass configs with the same MAC.
> >I need to add a check in the bypass_master_register() to prevent this.
> 
> Mac can change, you would have to check in change as well. Feels odd
> thought. 
> 
> 
> >
> >The above check is to avoid cases where we have
> >bp1(a1, b1) with mac1
> >and a2 is mis-configured with mac1, we want to avoid using a2 link events to update bp1.
> >
> >> Now bypass_master_get_bymac() will return always bp1 or bp2 - depending on
> >> the order of creation.
> >> Let's say it will return bp1. Then when we have event for a2, the
> >> bypass_ops->slave_link_change is called with (a2, bp1). That is wrong.
> >> 
> >> 
> >> You cannot use bypass_master_get_bymac() here.
> >> 
> >> 
> >> 
> >> > > 
> >> > > > +
> >> > > > +	if ((active_netdev && bypass_xmit_ready(active_netdev)) ||
> >> > > > +	    (backup_netdev && bypass_xmit_ready(backup_netdev))) {
> >> > > > +		netif_carrier_on(bypass_netdev);
> >> > > > +		netif_tx_wake_all_queues(bypass_netdev);
> >> > > > +	} else {
> >> > > > +		netif_carrier_off(bypass_netdev);
> >> > > > +		netif_tx_stop_all_queues(bypass_netdev);
> >> > > > +	}
> >> > > > +
> >> > > > +done:
> >> > > > +	return NOTIFY_DONE;
> >> > > > +}
> >> > > > +
> >> > > > +static bool bypass_validate_event_dev(struct net_device *dev)
> >> > > > +{
> >> > > > +	/* Skip parent events */
> >> > > > +	if (netif_is_bypass_master(dev))
> >> > > > +		return false;
> >> > > > +
> >> > > > +	/* Avoid non-Ethernet type devices */
> >> > > > +	if (dev->type != ARPHRD_ETHER)
> >> > > > +		return false;
> >> > > > +
> >> > > > +	/* Avoid Vlan dev with same MAC registering as VF */
> >> > > > +	if (is_vlan_dev(dev))
> >> > > > +		return false;
> >> > > > +
> >> > > > +	/* Avoid Bonding master dev with same MAC registering as slave dev */
> >> > > > +	if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
> >> > > Yeah, this is certainly incorrect. One thing is, you should be using the
> >> > > helpers netif_is_bond_master().
> >> > > But what about the rest? macsec, macvlan, team, bridge, ovs and others?
> >> > > 
> >> > > You need to do it not by blacklisting, but with whitelisting. You need
> >> > > to whitelist VF devices. My port flavours patchset might help with this.
> >> > May be i can use netdev_has_lower_dev() helper to make sure that the slave
> >> I don't see such function in the code.
> >
> >It is netdev_has_any_lower_dev(). I need to export it.
> 
> Come on, you cannot use that. That would allow bonding without slaves,
> but the slaves could be added later on.
> 
> What exactly you are trying to achieve by this?
> 
> 
> >
> >> 
> >> 
> >> > device is not an upper dev.
> >> > Can you point to your port flavours patchset? Is it upstream?
> >> I sent rfc couple of weeks ago:
> >> [patch net-next RFC 00/12] devlink: introduce port flavours and common phys_port_name generation
> >
> >
> >

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
  2018-04-18 19:46             ` Michael S. Tsirkin
@ 2018-04-18 20:32               ` Jiri Pirko
  2018-04-18 22:46                 ` Samudrala, Sridhar
  2018-04-19  4:08                 ` Michael S. Tsirkin
  0 siblings, 2 replies; 63+ messages in thread
From: Jiri Pirko @ 2018-04-18 20:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Samudrala, Sridhar, stephen, davem, netdev, virtualization,
	virtio-dev, jesse.brandeburg, alexander.h.duyck, kubakici,
	jasowang, loseweigh

Wed, Apr 18, 2018 at 09:46:04PM CEST, mst@redhat.com wrote:
>On Wed, Apr 18, 2018 at 09:13:15PM +0200, Jiri Pirko wrote:
>> Wed, Apr 18, 2018 at 08:43:15PM CEST, sridhar.samudrala@intel.com wrote:
>> >On 4/18/2018 2:25 AM, Jiri Pirko wrote:
>> >> Wed, Apr 11, 2018 at 09:13:52PM CEST, sridhar.samudrala@intel.com wrote:
>> >> > On 4/11/2018 8:51 AM, Jiri Pirko wrote:
>> >> > > Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudrala@intel.com wrote:
>> >> > > > This provides a generic interface for paravirtual drivers to listen
>> >> > > > for netdev register/unregister/link change events from pci ethernet
>> >> > > > devices with the same MAC and takeover their datapath. The notifier and
>> >> > > > event handling code is based on the existing netvsc implementation.
>> >> > > > 
>> >> > > > It exposes 2 sets of interfaces to the paravirtual drivers.
>> >> > > > 1. existing netvsc driver that uses 2 netdev model. In this model, no
>> >> > > > master netdev is created. The paravirtual driver registers each bypass
>> >> > > > instance along with a set of ops to manage the slave events.
>> >> > > >       bypass_master_register()
>> >> > > >       bypass_master_unregister()
>> >> > > > 2. new virtio_net based solution that uses 3 netdev model. In this model,
>> >> > > > the bypass module provides interfaces to create/destroy additional master
>> >> > > > netdev and all the slave events are managed internally.
>> >> > > >        bypass_master_create()
>> >> > > >        bypass_master_destroy()
>> >> > > > 
>> >> > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> >> > > > ---
>> >> > > > include/linux/netdevice.h |  14 +
>> >> > > > include/net/bypass.h      |  96 ++++++
>> >> > > > net/Kconfig               |  18 +
>> >> > > > net/core/Makefile         |   1 +
>> >> > > > net/core/bypass.c         | 844 ++++++++++++++++++++++++++++++++++++++++++++++
>> >> > > > 5 files changed, 973 insertions(+)
>> >> > > > create mode 100644 include/net/bypass.h
>> >> > > > create mode 100644 net/core/bypass.c
>> >> > > > 
>> >> > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> >> > > > index cf44503ea81a..587293728f70 100644
>> >> > > > --- a/include/linux/netdevice.h
>> >> > > > +++ b/include/linux/netdevice.h
>> >> > > > @@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
>> >> > > > 	IFF_PHONY_HEADROOM		= 1<<24,
>> >> > > > 	IFF_MACSEC			= 1<<25,
>> >> > > > 	IFF_NO_RX_HANDLER		= 1<<26,
>> >> > > > +	IFF_BYPASS			= 1 << 27,
>> >> > > > +	IFF_BYPASS_SLAVE		= 1 << 28,
>> >> > > I wonder, why you don't follow the existing coding style... Also, please
>> >> > > add these to into the comment above.
>> >> > To avoid checkpatch warnings. If it is OK to ignore these warnings, I can switch back
>> >> > to the existing coding style to be consistent.
>> >> Please do.
>> >> 
>> >> 
>> >> > > 
>> >> > > > };
>> >> > > > 
>> >> > > > #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
>> >> > > > @@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
>> >> > > > #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
>> >> > > > #define IFF_MACSEC			IFF_MACSEC
>> >> > > > #define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
>> >> > > > +#define IFF_BYPASS			IFF_BYPASS
>> >> > > > +#define IFF_BYPASS_SLAVE		IFF_BYPASS_SLAVE
>> >> > > > 
>> >> > > > /**
>> >> > > >    *	struct net_device - The DEVICE structure.
>> >> > > > @@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
>> >> > > > 	return dev->priv_flags & IFF_RXFH_CONFIGURED;
>> >> > > > }
>> >> > > > 
>> >> > > > +static inline bool netif_is_bypass_master(const struct net_device *dev)
>> >> > > > +{
>> >> > > > +	return dev->priv_flags & IFF_BYPASS;
>> >> > > > +}
>> >> > > > +
>> >> > > > +static inline bool netif_is_bypass_slave(const struct net_device *dev)
>> >> > > > +{
>> >> > > > +	return dev->priv_flags & IFF_BYPASS_SLAVE;
>> >> > > > +}
>> >> > > > +
>> >> > > > /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
>> >> > > > static inline void netif_keep_dst(struct net_device *dev)
>> >> > > > {
>> >> > > > diff --git a/include/net/bypass.h b/include/net/bypass.h
>> >> > > > new file mode 100644
>> >> > > > index 000000000000..86b02cb894cf
>> >> > > > --- /dev/null
>> >> > > > +++ b/include/net/bypass.h
>> >> > > > @@ -0,0 +1,96 @@
>> >> > > > +// SPDX-License-Identifier: GPL-2.0
>> >> > > > +/* Copyright (c) 2018, Intel Corporation. */
>> >> > > > +
>> >> > > > +#ifndef _NET_BYPASS_H
>> >> > > > +#define _NET_BYPASS_H
>> >> > > > +
>> >> > > > +#include <linux/netdevice.h>
>> >> > > > +
>> >> > > > +struct bypass_ops {
>> >> > > > +	int (*slave_pre_register)(struct net_device *slave_netdev,
>> >> > > > +				  struct net_device *bypass_netdev);
>> >> > > > +	int (*slave_join)(struct net_device *slave_netdev,
>> >> > > > +			  struct net_device *bypass_netdev);
>> >> > > > +	int (*slave_pre_unregister)(struct net_device *slave_netdev,
>> >> > > > +				    struct net_device *bypass_netdev);
>> >> > > > +	int (*slave_release)(struct net_device *slave_netdev,
>> >> > > > +			     struct net_device *bypass_netdev);
>> >> > > > +	int (*slave_link_change)(struct net_device *slave_netdev,
>> >> > > > +				 struct net_device *bypass_netdev);
>> >> > > > +	rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>> >> > > > +};
>> >> > > > +
>> >> > > > +struct bypass_master {
>> >> > > > +	struct list_head list;
>> >> > > > +	struct net_device __rcu *bypass_netdev;
>> >> > > > +	struct bypass_ops __rcu *ops;
>> >> > > > +};
>> >> > > > +
>> >> > > > +/* bypass state */
>> >> > > > +struct bypass_info {
>> >> > > > +	/* passthru netdev with same MAC */
>> >> > > > +	struct net_device __rcu *active_netdev;
>> >> > > You still use "active"/"backup" names which is highly misleading as
>> >> > > it has completely different meaning that in bond for example.
>> >> > > I noted that in my previous review already. Please change it.
>> >> > I guess the issue is with only the 'active'  name. 'backup' should be fine as it also
>> >> > matches with the BACKUP feature bit we are adding to virtio_net.
>> >> I think that "backup" is also misleading. Both "active" and "backup"
>> >> mean a *state* of slaves. This should be named differently.
>> >> 
>> >> 
>> >> 
>> >> > With regards to alternate names for 'active', you suggested 'stolen', but i
>> >> > am not too happy with it.
>> >> > netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'
>> >> No. The netdev could be any netdevice. It does not have to be a "VF".
>> >> I think "stolen" is quite appropriate since it describes the modus
>> >> operandi. The bypass master steals some netdevice according to some
>> >> match.
>> >> 
>> >> But I don't insist on "stolen". Just sounds right.
>> >
>> >We are adding VIRTIO_NET_F_BACKUP as a new feature bit to enable this feature, So i think
>> >'backup' name is consistent.
>> 
>> It perhaps makes sense from the view of virtio device. However, as I
>> described couple of times, for master/slave device the name "backup" is
>> highly misleading.
>
>virtio is the backup. You are supposed to use another
>(typically passthrough) device, if that fails use virtio.
>It does seem appropriate to me. If you like, we can
>change that to "standby".  Active I don't like either. "main"?

Sounds much better, yes.


>
>In fact would failover be better than bypass?

Also, much better.


>
>
>> 
>> >
>> >The intent is to restrict the 'active' netdev to be a VF. If there is a way to check that
>> >a PCI device is a VF in the guest kernel, we could restrict 'active' netdev to be a VF.
>> >
>> >Will look for any suggestions in the next day or two. If i don't get any, i will go
>> >with 'stolen'
>> >
>> ><snip>
>> >
>> >
>> >> +
>> >> +static struct net_device *bypass_master_get_bymac(u8 *mac,
>> >> +						  struct bypass_ops **ops)
>> >> +{
>> >> +	struct bypass_master *bypass_master;
>> >> +	struct net_device *bypass_netdev;
>> >> +
>> >> +	spin_lock(&bypass_lock);
>> >> +	list_for_each_entry(bypass_master, &bypass_master_list, list) {
>> >> > > As I wrote the last time, you don't need this list, spinlock.
>> >> > > You can do just something like:
>> >> > >           for_each_net(net) {
>> >> > >                   for_each_netdev(net, dev) {
>> >> > > 			if (netif_is_bypass_master(dev)) {
>> >> > This function returns the upper netdev as well as the ops associated
>> >> > with that netdev.
>> >> > bypass_master_list is a list of 'struct bypass_master' that associates
>> >> Well, can't you have it in netdev priv?
>> >
>> >We cannot do this for 2-netdev model as there is no bypass_netdev created.
>> 
>> Howcome? You have no master? I don't understand..
>> 
>> 
>> 
>> >
>> >> 
>> >> 
>> >> > 'bypass_netdev' with 'bypass_ops' and gets added via bypass_master_register().
>> >> > We need 'ops' only to support the 2 netdev model of netvsc. ops will be
>> >> > NULL for 3-netdev model.
>> >> I see :(
>> >> 
>> >> 
>> >> > 
>> >> > > 
>> >> > > 
>> >> > > 
>> >> > > > +		bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>> >> > > > +		if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
>> >> > > > +			*ops = rcu_dereference(bypass_master->ops);
>> >> > > I don't see how rcu_dereference is ok here.
>> >> > > 1) I don't see rcu_read_lock taken
>> >> > > 2) Looks like bypass_master->ops has the same value across the whole
>> >> > >      existence.
>> >> > We hold rtnl_lock(), i think i need to change this to rtnl_dereference.
>> >> > Yes. ops doesn't change.
>> >> If it does not change, you can just access it directly.
>> >> 
>> >> 
>> >> > > 
>> >> > > > +			spin_unlock(&bypass_lock);
>> >> > > > +			return bypass_netdev;
>> >> > > > +		}
>> >> > > > +	}
>> >> > > > +	spin_unlock(&bypass_lock);
>> >> > > > +	return NULL;
>> >> > > > +}
>> >> > > > +
>> >> > > > +static int bypass_slave_register(struct net_device *slave_netdev)
>> >> > > > +{
>> >> > > > +	struct net_device *bypass_netdev;
>> >> > > > +	struct bypass_ops *bypass_ops;
>> >> > > > +	int ret, orig_mtu;
>> >> > > > +
>> >> > > > +	ASSERT_RTNL();
>> >> > > > +
>> >> > > > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> >> > > > +						&bypass_ops);
>> >> > > For master, could you use word "master" in the variables so it is clear?
>> >> > > Also, "dev" is fine instead of "netdev".
>> >> > > Something like "bpmaster_dev"
>> >> > bypass_master is of  type struct bypass_master,  bypass_netdev is of type struct net_device.
>> >> I was trying to point out, that "bypass_netdev" represents a "master"
>> >> netdev, yet it does not say master. That is why I suggested
>> >> "bpmaster_dev"
>> >> 
>> >> 
>> >> > I can change all _netdev suffixes to _dev to make the names shorter.
>> >> ok.
>> >> 
>> >> 
>> >> > 
>> >> > > 
>> >> > > > +	if (!bypass_netdev)
>> >> > > > +		goto done;
>> >> > > > +
>> >> > > > +	ret = bypass_slave_pre_register(slave_netdev, bypass_netdev,
>> >> > > > +					bypass_ops);
>> >> > > > +	if (ret != 0)
>> >> > > 	Just "if (ret)" will do. You have this on more places.
>> >> > OK.
>> >> > 
>> >> > 
>> >> > > 
>> >> > > > +		goto done;
>> >> > > > +
>> >> > > > +	ret = netdev_rx_handler_register(slave_netdev,
>> >> > > > +					 bypass_ops ? bypass_ops->handle_frame :
>> >> > > > +					 bypass_handle_frame, bypass_netdev);
>> >> > > > +	if (ret != 0) {
>> >> > > > +		netdev_err(slave_netdev, "can not register bypass rx handler (err = %d)\n",
>> >> > > > +			   ret);
>> >> > > > +		goto done;
>> >> > > > +	}
>> >> > > > +
>> >> > > > +	ret = netdev_upper_dev_link(slave_netdev, bypass_netdev, NULL);
>> >> > > > +	if (ret != 0) {
>> >> > > > +		netdev_err(slave_netdev, "can not set master device %s (err = %d)\n",
>> >> > > > +			   bypass_netdev->name, ret);
>> >> > > > +		goto upper_link_failed;
>> >> > > > +	}
>> >> > > > +
>> >> > > > +	slave_netdev->priv_flags |= IFF_BYPASS_SLAVE;
>> >> > > > +
>> >> > > > +	if (netif_running(bypass_netdev)) {
>> >> > > > +		ret = dev_open(slave_netdev);
>> >> > > > +		if (ret && (ret != -EBUSY)) {
>> >> > > > +			netdev_err(bypass_netdev, "Opening slave %s failed ret:%d\n",
>> >> > > > +				   slave_netdev->name, ret);
>> >> > > > +			goto err_interface_up;
>> >> > > > +		}
>> >> > > > +	}
>> >> > > > +
>> >> > > > +	/* Align MTU of slave with master */
>> >> > > > +	orig_mtu = slave_netdev->mtu;
>> >> > > > +	ret = dev_set_mtu(slave_netdev, bypass_netdev->mtu);
>> >> > > > +	if (ret != 0) {
>> >> > > > +		netdev_err(bypass_netdev, "unable to change mtu of %s to %u register failed\n",
>> >> > > > +			   slave_netdev->name, bypass_netdev->mtu);
>> >> > > > +		goto err_set_mtu;
>> >> > > > +	}
>> >> > > > +
>> >> > > > +	ret = bypass_slave_join(slave_netdev, bypass_netdev, bypass_ops);
>> >> > > > +	if (ret != 0)
>> >> > > > +		goto err_join;
>> >> > > > +
>> >> > > > +	call_netdevice_notifiers(NETDEV_JOIN, slave_netdev);
>> >> > > > +
>> >> > > > +	netdev_info(bypass_netdev, "bypass slave:%s registered\n",
>> >> > > > +		    slave_netdev->name);
>> >> > > > +
>> >> > > > +	goto done;
>> >> > > > +
>> >> > > > +err_join:
>> >> > > > +	dev_set_mtu(slave_netdev, orig_mtu);
>> >> > > > +err_set_mtu:
>> >> > > > +	dev_close(slave_netdev);
>> >> > > > +err_interface_up:
>> >> > > > +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>> >> > > > +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>> >> > > > +upper_link_failed:
>> >> > > > +	netdev_rx_handler_unregister(slave_netdev);
>> >> > > > +done:
>> >> > > > +	return NOTIFY_DONE;
>> >> > > > +}
>> >> > > > +
>> >> > > > +static int bypass_slave_pre_unregister(struct net_device *slave_netdev,
>> >> > > > +				       struct net_device *bypass_netdev,
>> >> > > > +				       struct bypass_ops *bypass_ops)
>> >> > > > +{
>> >> > > > +	struct net_device *backup_netdev, *active_netdev;
>> >> > > > +	struct bypass_info *bi;
>> >> > > > +
>> >> > > > +	if (bypass_ops) {
>> >> > > > +		if (!bypass_ops->slave_pre_unregister)
>> >> > > > +			return -EINVAL;
>> >> > > > +
>> >> > > > +		return bypass_ops->slave_pre_unregister(slave_netdev,
>> >> > > > +							bypass_netdev);
>> >> > > > +	}
>> >> > > > +
>> >> > > > +	bi = netdev_priv(bypass_netdev);
>> >> > > > +	active_netdev = rtnl_dereference(bi->active_netdev);
>> >> > > > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> >> > > > +
>> >> > > > +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>> >> > > > +		return -EINVAL;
>> >> > > > +
>> >> > > > +	return 0;
>> >> > > > +}
>> >> > > > +
>> >> > > > +static int bypass_slave_release(struct net_device *slave_netdev,
>> >> > > > +				struct net_device *bypass_netdev,
>> >> > > > +				struct bypass_ops *bypass_ops)
>> >> > > > +{
>> >> > > > +	struct net_device *backup_netdev, *active_netdev;
>> >> > > > +	struct bypass_info *bi;
>> >> > > > +
>> >> > > > +	if (bypass_ops) {
>> >> > > > +		if (!bypass_ops->slave_release)
>> >> > > > +			return -EINVAL;
>> >> > > I think it would be good to make the API to the driver more strict and
>> >> > > have a separate set of ops for "active" and "backup" netdevices.
>> >> > > That should stop people thinking about extending this to more slaves in
>> >> > > the future.
>> >> > We have checks in slave_pre_register() that allows only 1 'backup' and 1
>> >> > 'active' slave.
>> >> I'm very well aware of that. I just thought that explicit ops for the
>> >> two slaves would make this more clear.
>> >> 
>> >> 
>> >> > 
>> >> > > 
>> >> > > 
>> >> > > > +
>> >> > > > +		return bypass_ops->slave_release(slave_netdev, bypass_netdev);
>> >> > > > +	}
>> >> > > > +
>> >> > > > +	bi = netdev_priv(bypass_netdev);
>> >> > > > +	active_netdev = rtnl_dereference(bi->active_netdev);
>> >> > > > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> >> > > > +
>> >> > > > +	if (slave_netdev == backup_netdev) {
>> >> > > > +		RCU_INIT_POINTER(bi->backup_netdev, NULL);
>> >> > > > +	} else {
>> >> > > > +		RCU_INIT_POINTER(bi->active_netdev, NULL);
>> >> > > > +		if (backup_netdev) {
>> >> > > > +			bypass_netdev->min_mtu = backup_netdev->min_mtu;
>> >> > > > +			bypass_netdev->max_mtu = backup_netdev->max_mtu;
>> >> > > > +		}
>> >> > > > +	}
>> >> > > > +
>> >> > > > +	dev_put(slave_netdev);
>> >> > > > +
>> >> > > > +	netdev_info(bypass_netdev, "bypass slave:%s released\n",
>> >> > > > +		    slave_netdev->name);
>> >> > > > +
>> >> > > > +	return 0;
>> >> > > > +}
>> >> > > > +
>> >> > > > +int bypass_slave_unregister(struct net_device *slave_netdev)
>> >> > > > +{
>> >> > > > +	struct net_device *bypass_netdev;
>> >> > > > +	struct bypass_ops *bypass_ops;
>> >> > > > +	int ret;
>> >> > > > +
>> >> > > > +	if (!netif_is_bypass_slave(slave_netdev))
>> >> > > > +		goto done;
>> >> > > > +
>> >> > > > +	ASSERT_RTNL();
>> >> > > > +
>> >> > > > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> >> > > > +						&bypass_ops);
>> >> > > > +	if (!bypass_netdev)
>> >> > > > +		goto done;
>> >> > > > +
>> >> > > > +	ret = bypass_slave_pre_unregister(slave_netdev, bypass_netdev,
>> >> > > > +					  bypass_ops);
>> >> > > > +	if (ret != 0)
>> >> > > > +		goto done;
>> >> > > > +
>> >> > > > +	netdev_rx_handler_unregister(slave_netdev);
>> >> > > > +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>> >> > > > +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>> >> > > > +
>> >> > > > +	bypass_slave_release(slave_netdev, bypass_netdev, bypass_ops);
>> >> > > > +
>> >> > > > +	netdev_info(bypass_netdev, "bypass slave:%s unregistered\n",
>> >> > > > +		    slave_netdev->name);
>> >> > > > +
>> >> > > > +done:
>> >> > > > +	return NOTIFY_DONE;
>> >> > > > +}
>> >> > > > +EXPORT_SYMBOL_GPL(bypass_slave_unregister);
>> >> > > > +
>> >> > > > +static bool bypass_xmit_ready(struct net_device *dev)
>> >> > > > +{
>> >> > > > +	return netif_running(dev) && netif_carrier_ok(dev);
>> >> > > > +}
>> >> > > > +
>> >> > > > +static int bypass_slave_link_change(struct net_device *slave_netdev)
>> >> > > > +{
>> >> > > > +	struct net_device *bypass_netdev, *active_netdev, *backup_netdev;
>> >> > > > +	struct bypass_ops *bypass_ops;
>> >> > > > +	struct bypass_info *bi;
>> >> > > > +
>> >> > > > +	if (!netif_is_bypass_slave(slave_netdev))
>> >> > > > +		goto done;
>> >> > > > +
>> >> > > > +	ASSERT_RTNL();
>> >> > > > +
>> >> > > > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> >> > > > +						&bypass_ops);
>> >> > > > +	if (!bypass_netdev)
>> >> > > > +		goto done;
>> >> > > > +
>> >> > > > +	if (bypass_ops) {
>> >> > > > +		if (!bypass_ops->slave_link_change)
>> >> > > > +			goto done;
>> >> > > > +
>> >> > > > +		return bypass_ops->slave_link_change(slave_netdev,
>> >> > > > +						     bypass_netdev);
>> >> > > > +	}
>> >> > > > +
>> >> > > > +	if (!netif_running(bypass_netdev))
>> >> > > > +		return 0;
>> >> > > > +
>> >> > > > +	bi = netdev_priv(bypass_netdev);
>> >> > > > +
>> >> > > > +	active_netdev = rtnl_dereference(bi->active_netdev);
>> >> > > > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> >> > > > +
>> >> > > > +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>> >> > > > +		goto done;
>> >> > > You don't need this check. "if (!netif_is_bypass_slave(slave_netdev))"
>> >> > > above is enough.
>> >> > I think we need this check to not allow events from a slave that is not
>> >> > attached to this master but has the same MAC.
>> >> Why do we need such events? Seems wrong to me.
>> >
>> >We want to avoid events from a netdev that is mis-configured with the same MAC as
>> >a bypass setup.
>> >
>> >>   Consider:
>> >> 
>> >> bp1      bp2
>> >> a1 b1    a2 b2
>> >> 
>> >> 
>> >> a1 and a2 have the same mac and bp1 and bp2 have the same mac.
>> >
>> >We should not have 2 bypass configs with the same MAC.
>> >I need to add a check in the bypass_master_register() to prevent this.
>> 
>> Mac can change, you would have to check in change as well. Feels odd
>> thought. 
>> 
>> 
>> >
>> >The above check is to avoid cases where we have
>> >bp1(a1, b1) with mac1
>> >and a2 is mis-configured with mac1, we want to avoid using a2 link events to update bp1.
>> >
>> >> Now bypass_master_get_bymac() will return always bp1 or bp2 - depending on
>> >> the order of creation.
>> >> Let's say it will return bp1. Then when we have event for a2, the
>> >> bypass_ops->slave_link_change is called with (a2, bp1). That is wrong.
>> >> 
>> >> 
>> >> You cannot use bypass_master_get_bymac() here.
>> >> 
>> >> 
>> >> 
>> >> > > 
>> >> > > > +
>> >> > > > +	if ((active_netdev && bypass_xmit_ready(active_netdev)) ||
>> >> > > > +	    (backup_netdev && bypass_xmit_ready(backup_netdev))) {
>> >> > > > +		netif_carrier_on(bypass_netdev);
>> >> > > > +		netif_tx_wake_all_queues(bypass_netdev);
>> >> > > > +	} else {
>> >> > > > +		netif_carrier_off(bypass_netdev);
>> >> > > > +		netif_tx_stop_all_queues(bypass_netdev);
>> >> > > > +	}
>> >> > > > +
>> >> > > > +done:
>> >> > > > +	return NOTIFY_DONE;
>> >> > > > +}
>> >> > > > +
>> >> > > > +static bool bypass_validate_event_dev(struct net_device *dev)
>> >> > > > +{
>> >> > > > +	/* Skip parent events */
>> >> > > > +	if (netif_is_bypass_master(dev))
>> >> > > > +		return false;
>> >> > > > +
>> >> > > > +	/* Avoid non-Ethernet type devices */
>> >> > > > +	if (dev->type != ARPHRD_ETHER)
>> >> > > > +		return false;
>> >> > > > +
>> >> > > > +	/* Avoid Vlan dev with same MAC registering as VF */
>> >> > > > +	if (is_vlan_dev(dev))
>> >> > > > +		return false;
>> >> > > > +
>> >> > > > +	/* Avoid Bonding master dev with same MAC registering as slave dev */
>> >> > > > +	if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
>> >> > > Yeah, this is certainly incorrect. One thing is, you should be using the
>> >> > > helpers netif_is_bond_master().
>> >> > > But what about the rest? macsec, macvlan, team, bridge, ovs and others?
>> >> > > 
>> >> > > You need to do it not by blacklisting, but with whitelisting. You need
>> >> > > to whitelist VF devices. My port flavours patchset might help with this.
>> >> > May be i can use netdev_has_lower_dev() helper to make sure that the slave
>> >> I don't see such function in the code.
>> >
>> >It is netdev_has_any_lower_dev(). I need to export it.
>> 
>> Come on, you cannot use that. That would allow bonding without slaves,
>> but the slaves could be added later on.
>> 
>> What exactly you are trying to achieve by this?
>> 
>> 
>> >
>> >> 
>> >> 
>> >> > device is not an upper dev.
>> >> > Can you point to your port flavours patchset? Is it upstream?
>> >> I sent rfc couple of weeks ago:
>> >> [patch net-next RFC 00/12] devlink: introduce port flavours and common phys_port_name generation
>> >
>> >
>> >

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
  2018-04-18 20:32               ` Jiri Pirko
@ 2018-04-18 22:46                 ` Samudrala, Sridhar
  2018-04-19  6:35                   ` Jiri Pirko
  2018-04-19  4:08                 ` Michael S. Tsirkin
  1 sibling, 1 reply; 63+ messages in thread
From: Samudrala, Sridhar @ 2018-04-18 22:46 UTC (permalink / raw)
  To: Jiri Pirko, Michael S. Tsirkin
  Cc: stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
	loseweigh

On 4/18/2018 1:32 PM, Jiri Pirko wrote:
>>>>>>> You still use "active"/"backup" names which is highly misleading as
>>>>>>> it has completely different meaning that in bond for example.
>>>>>>> I noted that in my previous review already. Please change it.
>>>>>> I guess the issue is with only the 'active'  name. 'backup' should be fine as it also
>>>>>> matches with the BACKUP feature bit we are adding to virtio_net.
>>>>> I think that "backup" is also misleading. Both "active" and "backup"
>>>>> mean a *state* of slaves. This should be named differently.
>>>>>
>>>>>
>>>>>
>>>>>> With regards to alternate names for 'active', you suggested 'stolen', but i
>>>>>> am not too happy with it.
>>>>>> netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'
>>>>> No. The netdev could be any netdevice. It does not have to be a "VF".
>>>>> I think "stolen" is quite appropriate since it describes the modus
>>>>> operandi. The bypass master steals some netdevice according to some
>>>>> match.
>>>>>
>>>>> But I don't insist on "stolen". Just sounds right.
>>>> We are adding VIRTIO_NET_F_BACKUP as a new feature bit to enable this feature, So i think
>>>> 'backup' name is consistent.
>>> It perhaps makes sense from the view of virtio device. However, as I
>>> described couple of times, for master/slave device the name "backup" is
>>> highly misleading.
>> virtio is the backup. You are supposed to use another
>> (typically passthrough) device, if that fails use virtio.
>> It does seem appropriate to me. If you like, we can
>> change that to "standby".  Active I don't like either. "main"?
> Sounds much better, yes.

OK. Will change backup to 'standby'.
'main' is fine, what about 'primary'?


>
>
>> In fact would failover be better than bypass?
> Also, much better.

So do we want to change all 'bypass' references to 'failover' including
the filenames.(net/core/failover.c and include/net/failover.h)

<snip>



>
>
>>
>>>> The intent is to restrict the 'active' netdev to be a VF. If there is a way to check that
>>>> a PCI device is a VF in the guest kernel, we could restrict 'active' netdev to be a VF.
>>>>
>>>> Will look for any suggestions in the next day or two. If i don't get any, i will go
>>>> with 'stolen'
>>>>
>>>> <snip>
>>>>
>>>>
>>>>> +
>>>>> +static struct net_device *bypass_master_get_bymac(u8 *mac,
>>>>> +						  struct bypass_ops **ops)
>>>>> +{
>>>>> +	struct bypass_master *bypass_master;
>>>>> +	struct net_device *bypass_netdev;
>>>>> +
>>>>> +	spin_lock(&bypass_lock);
>>>>> +	list_for_each_entry(bypass_master, &bypass_master_list, list) {
>>>>>>> As I wrote the last time, you don't need this list, spinlock.
>>>>>>> You can do just something like:
>>>>>>>            for_each_net(net) {
>>>>>>>                    for_each_netdev(net, dev) {
>>>>>>> 			if (netif_is_bypass_master(dev)) {
>>>>>> This function returns the upper netdev as well as the ops associated
>>>>>> with that netdev.
>>>>>> bypass_master_list is a list of 'struct bypass_master' that associates
>>>>> Well, can't you have it in netdev priv?
>>>> We cannot do this for 2-netdev model as there is no bypass_netdev created.
>>> Howcome? You have no master? I don't understand..

For 2-netdev model, the master netdev is not a new one created by the bypass module.
It is created by netvsc internally and passed via bypass_master_register()

<snip>



>>>
>>>>>>>> +
>>>>>>>> +	/* Avoid Bonding master dev with same MAC registering as slave dev */
>>>>>>>> +	if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
>>>>>>> Yeah, this is certainly incorrect. One thing is, you should be using the
>>>>>>> helpers netif_is_bond_master().
>>>>>>> But what about the rest? macsec, macvlan, team, bridge, ovs and others?
>>>>>>>
>>>>>>> You need to do it not by blacklisting, but with whitelisting. You need
>>>>>>> to whitelist VF devices. My port flavours patchset might help with this.
>>>>>> May be i can use netdev_has_lower_dev() helper to make sure that the slave
>>>>> I don't see such function in the code.
>>>> It is netdev_has_any_lower_dev(). I need to export it.
>>> Come on, you cannot use that. That would allow bonding without slaves,
>>> but the slaves could be added later on.
>>>
>>> What exactly you are trying to achieve by this?

I think i can remove this check.  In pre-register,
for backup device, i check that its parent matches bypass device &
for vf device, we make sure that it is a pci device.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
  2018-04-18 20:32               ` Jiri Pirko
  2018-04-18 22:46                 ` Samudrala, Sridhar
@ 2018-04-19  4:08                 ` Michael S. Tsirkin
  2018-04-19  7:22                   ` Jiri Pirko
  1 sibling, 1 reply; 63+ messages in thread
From: Michael S. Tsirkin @ 2018-04-19  4:08 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Samudrala, Sridhar, stephen, davem, netdev, virtualization,
	virtio-dev, jesse.brandeburg, alexander.h.duyck, kubakici,
	jasowang, loseweigh

On Wed, Apr 18, 2018 at 10:32:06PM +0200, Jiri Pirko wrote:
> >> >> > With regards to alternate names for 'active', you suggested 'stolen', but i
> >> >> > am not too happy with it.
> >> >> > netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'
> >> >> No. The netdev could be any netdevice. It does not have to be a "VF".
> >> >> I think "stolen" is quite appropriate since it describes the modus
> >> >> operandi. The bypass master steals some netdevice according to some
> >> >> match.
> >> >> 
> >> >> But I don't insist on "stolen". Just sounds right.
> >> >
> >> >We are adding VIRTIO_NET_F_BACKUP as a new feature bit to enable this feature, So i think
> >> >'backup' name is consistent.
> >> 
> >> It perhaps makes sense from the view of virtio device. However, as I
> >> described couple of times, for master/slave device the name "backup" is
> >> highly misleading.
> >
> >virtio is the backup. You are supposed to use another
> >(typically passthrough) device, if that fails use virtio.
> >It does seem appropriate to me. If you like, we can
> >change that to "standby".  Active I don't like either. "main"?
> 
> Sounds much better, yes.

Excuse me, which of the versions are better in your eyes?


> 
> >
> >In fact would failover be better than bypass?
> 
> Also, much better.
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
  2018-04-18 22:46                 ` Samudrala, Sridhar
@ 2018-04-19  6:35                   ` Jiri Pirko
  0 siblings, 0 replies; 63+ messages in thread
From: Jiri Pirko @ 2018-04-19  6:35 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Michael S. Tsirkin, stephen, davem, netdev, virtualization,
	virtio-dev, jesse.brandeburg, alexander.h.duyck, kubakici,
	jasowang, loseweigh

Thu, Apr 19, 2018 at 12:46:11AM CEST, sridhar.samudrala@intel.com wrote:
>On 4/18/2018 1:32 PM, Jiri Pirko wrote:
>> > > > > > > You still use "active"/"backup" names which is highly misleading as
>> > > > > > > it has completely different meaning that in bond for example.
>> > > > > > > I noted that in my previous review already. Please change it.
>> > > > > > I guess the issue is with only the 'active'  name. 'backup' should be fine as it also
>> > > > > > matches with the BACKUP feature bit we are adding to virtio_net.
>> > > > > I think that "backup" is also misleading. Both "active" and "backup"
>> > > > > mean a *state* of slaves. This should be named differently.
>> > > > > 
>> > > > > 
>> > > > > 
>> > > > > > With regards to alternate names for 'active', you suggested 'stolen', but i
>> > > > > > am not too happy with it.
>> > > > > > netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'
>> > > > > No. The netdev could be any netdevice. It does not have to be a "VF".
>> > > > > I think "stolen" is quite appropriate since it describes the modus
>> > > > > operandi. The bypass master steals some netdevice according to some
>> > > > > match.
>> > > > > 
>> > > > > But I don't insist on "stolen". Just sounds right.
>> > > > We are adding VIRTIO_NET_F_BACKUP as a new feature bit to enable this feature, So i think
>> > > > 'backup' name is consistent.
>> > > It perhaps makes sense from the view of virtio device. However, as I
>> > > described couple of times, for master/slave device the name "backup" is
>> > > highly misleading.
>> > virtio is the backup. You are supposed to use another
>> > (typically passthrough) device, if that fails use virtio.
>> > It does seem appropriate to me. If you like, we can
>> > change that to "standby".  Active I don't like either. "main"?
>> Sounds much better, yes.
>
>OK. Will change backup to 'standby'.
>'main' is fine, what about 'primary'?

Primary is also bonding terminology. But in this case, I think it would
fit. The primary slave is used as the active one whenever the link is
up.


>
>
>> 
>> 
>> > In fact would failover be better than bypass?
>> Also, much better.
>
>So do we want to change all 'bypass' references to 'failover' including
>the filenames.(net/core/failover.c and include/net/failover.h)
>
><snip>
>
>
>
>> 
>> 
>> > 
>> > > > The intent is to restrict the 'active' netdev to be a VF. If there is a way to check that
>> > > > a PCI device is a VF in the guest kernel, we could restrict 'active' netdev to be a VF.
>> > > > 
>> > > > Will look for any suggestions in the next day or two. If i don't get any, i will go
>> > > > with 'stolen'
>> > > > 
>> > > > <snip>
>> > > > 
>> > > > 
>> > > > > +
>> > > > > +static struct net_device *bypass_master_get_bymac(u8 *mac,
>> > > > > +						  struct bypass_ops **ops)
>> > > > > +{
>> > > > > +	struct bypass_master *bypass_master;
>> > > > > +	struct net_device *bypass_netdev;
>> > > > > +
>> > > > > +	spin_lock(&bypass_lock);
>> > > > > +	list_for_each_entry(bypass_master, &bypass_master_list, list) {
>> > > > > > > As I wrote the last time, you don't need this list, spinlock.
>> > > > > > > You can do just something like:
>> > > > > > >            for_each_net(net) {
>> > > > > > >                    for_each_netdev(net, dev) {
>> > > > > > > 			if (netif_is_bypass_master(dev)) {
>> > > > > > This function returns the upper netdev as well as the ops associated
>> > > > > > with that netdev.
>> > > > > > bypass_master_list is a list of 'struct bypass_master' that associates
>> > > > > Well, can't you have it in netdev priv?
>> > > > We cannot do this for 2-netdev model as there is no bypass_netdev created.
>> > > Howcome? You have no master? I don't understand..
>
>For 2-netdev model, the master netdev is not a new one created by the bypass module.
>It is created by netvsc internally and passed via bypass_master_register()

But virtio_net alho has to create the master and pass it down to the
bypass module. Howcome it is different?


>
><snip>
>
>
>
>> > > 
>> > > > > > > > +
>> > > > > > > > +	/* Avoid Bonding master dev with same MAC registering as slave dev */
>> > > > > > > > +	if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
>> > > > > > > Yeah, this is certainly incorrect. One thing is, you should be using the
>> > > > > > > helpers netif_is_bond_master().
>> > > > > > > But what about the rest? macsec, macvlan, team, bridge, ovs and others?
>> > > > > > > 
>> > > > > > > You need to do it not by blacklisting, but with whitelisting. You need
>> > > > > > > to whitelist VF devices. My port flavours patchset might help with this.
>> > > > > > May be i can use netdev_has_lower_dev() helper to make sure that the slave
>> > > > > I don't see such function in the code.
>> > > > It is netdev_has_any_lower_dev(). I need to export it.
>> > > Come on, you cannot use that. That would allow bonding without slaves,
>> > > but the slaves could be added later on.
>> > > 
>> > > What exactly you are trying to achieve by this?
>
>I think i can remove this check.  In pre-register,
>for backup device, i check that its parent matches bypass device &
>for vf device, we make sure that it is a pci device.

Okay. That is a start.


>
>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
  2018-04-19  4:08                 ` Michael S. Tsirkin
@ 2018-04-19  7:22                   ` Jiri Pirko
  0 siblings, 0 replies; 63+ messages in thread
From: Jiri Pirko @ 2018-04-19  7:22 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Samudrala, Sridhar, stephen, davem, netdev, virtualization,
	virtio-dev, jesse.brandeburg, alexander.h.duyck, kubakici,
	jasowang, loseweigh

Thu, Apr 19, 2018 at 06:08:58AM CEST, mst@redhat.com wrote:
>On Wed, Apr 18, 2018 at 10:32:06PM +0200, Jiri Pirko wrote:
>> >> >> > With regards to alternate names for 'active', you suggested 'stolen', but i
>> >> >> > am not too happy with it.
>> >> >> > netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'
>> >> >> No. The netdev could be any netdevice. It does not have to be a "VF".
>> >> >> I think "stolen" is quite appropriate since it describes the modus
>> >> >> operandi. The bypass master steals some netdevice according to some
>> >> >> match.
>> >> >> 
>> >> >> But I don't insist on "stolen". Just sounds right.
>> >> >
>> >> >We are adding VIRTIO_NET_F_BACKUP as a new feature bit to enable this feature, So i think
>> >> >'backup' name is consistent.
>> >> 
>> >> It perhaps makes sense from the view of virtio device. However, as I
>> >> described couple of times, for master/slave device the name "backup" is
>> >> highly misleading.
>> >
>> >virtio is the backup. You are supposed to use another
>> >(typically passthrough) device, if that fails use virtio.
>> >It does seem appropriate to me. If you like, we can
>> >change that to "standby".  Active I don't like either. "main"?
>> 
>> Sounds much better, yes.
>
>Excuse me, which of the versions are better in your eyes?

standby is okay. main/primary is fine too.

>
>
>> 
>> >
>> >In fact would failover be better than bypass?
>> 
>> Also, much better.
>> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2018-04-11  7:53     ` Jiri Pirko
@ 2019-02-22  1:14       ` Siwei Liu
  2019-02-22  1:39         ` Michael S. Tsirkin
  0 siblings, 1 reply; 63+ messages in thread
From: Siwei Liu @ 2019-02-22  1:14 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Stephen Hemminger, Sridhar Samudrala, Michael S. Tsirkin,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Alexander Duyck, Jakub Kicinski, Jason Wang, liran.alon,
	si-wei liu

Sorry for replying to this ancient thread. There was some remaining
issue that I don't think the initial net_failover patch got addressed
cleanly, see:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268

The renaming of 'eth0' to 'ens4' fails because the udev userspace was
not specifically writtten for such kernel automatic enslavement.
Specifically, if it is a bond or team, the slave would typically get
renamed *before* virtual device gets created, that's what udev can
control (without getting netdev opened early by the other part of
kernel) and other userspace components for e.g. initramfs,
init-scripts can coordinate well in between. The in-kernel
auto-enslavement of net_failover breaks this userspace convention,
which don't provides a solution if user care about consistent naming
on the slave netdevs specifically.

Previously this issue had been specifically called out when IFF_HIDDEN
and the 1-netdev was proposed, but no one gives out a solution to this
problem ever since. Please share your mind how to proceed and solve
this userspace issue if netdev does not welcome a 1-netdev model.

On Wed, Apr 11, 2018 at 12:53 AM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Tue, Apr 10, 2018 at 11:26:08PM CEST, stephen@networkplumber.org wrote:
> >On Tue, 10 Apr 2018 11:59:50 -0700
> >Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
> >
> >> Use the registration/notification framework supported by the generic
> >> bypass infrastructure.
> >>
> >> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> >> ---
> >
> >Thanks for doing this.  Your current version has couple show stopper
> >issues.
> >
> >First, the slave device is instantly taking over the slave.
> >This doesn't allow udev/systemd to do its device rename of the slave
> >device. Netvsc uses a delayed work to workaround this.
>
> Wait. Why the fact a device is enslaved has to affect the udev in any
> way? If it does, smells like a bug in udev.

See above for clarifications.

Thanks,


>
>
> >
> >Secondly, the select queue needs to call queue selection in VF.
> >The bonding/teaming logic doesn't work well for UDP flows.
> >Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
> >fixed this performance problem.
> >
> >Lastly, more indirection is bad in current climate.
> >
> >I am not completely adverse to this but it needs to be fast, simple
> >and completely transparent.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2019-02-22  1:14       ` net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework) Siwei Liu
@ 2019-02-22  1:39         ` Michael S. Tsirkin
  2019-02-22  3:33           ` [virtio-dev] " si-wei liu
  0 siblings, 1 reply; 63+ messages in thread
From: Michael S. Tsirkin @ 2019-02-22  1:39 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Jiri Pirko, Stephen Hemminger, Sridhar Samudrala, David Miller,
	Netdev, virtualization, virtio-dev, Brandeburg, Jesse,
	Alexander Duyck, Jakub Kicinski, Jason Wang, liran.alon,
	si-wei liu

On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
> Sorry for replying to this ancient thread. There was some remaining
> issue that I don't think the initial net_failover patch got addressed
> cleanly, see:
> 
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
> 
> The renaming of 'eth0' to 'ens4' fails because the udev userspace was
> not specifically writtten for such kernel automatic enslavement.
> Specifically, if it is a bond or team, the slave would typically get
> renamed *before* virtual device gets created, that's what udev can
> control (without getting netdev opened early by the other part of
> kernel) and other userspace components for e.g. initramfs,
> init-scripts can coordinate well in between. The in-kernel
> auto-enslavement of net_failover breaks this userspace convention,
> which don't provides a solution if user care about consistent naming
> on the slave netdevs specifically.
> 
> Previously this issue had been specifically called out when IFF_HIDDEN
> and the 1-netdev was proposed, but no one gives out a solution to this
> problem ever since. Please share your mind how to proceed and solve
> this userspace issue if netdev does not welcome a 1-netdev model.

Above says:

	there's no motivation in the systemd/udevd community at
	this point to refactor the rename logic and make it work well with
	3-netdev.

What would the fix be? Skip slave devices?

-- 
MST

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2019-02-22  1:39         ` Michael S. Tsirkin
@ 2019-02-22  3:33           ` si-wei liu
       [not found]             ` <91d4cbb1-be7a-b53c-6b2a-99bef07e7c53@intel.com>
  0 siblings, 1 reply; 63+ messages in thread
From: si-wei liu @ 2019-02-22  3:33 UTC (permalink / raw)
  To: Michael S. Tsirkin, Siwei Liu
  Cc: Jiri Pirko, Stephen Hemminger, Sridhar Samudrala, David Miller,
	Netdev, virtualization, virtio-dev, Brandeburg, Jesse,
	Alexander Duyck, Jakub Kicinski, Jason Wang, liran.alon



On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
>> Sorry for replying to this ancient thread. There was some remaining
>> issue that I don't think the initial net_failover patch got addressed
>> cleanly, see:
>>
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
>>
>> The renaming of 'eth0' to 'ens4' fails because the udev userspace was
>> not specifically writtten for such kernel automatic enslavement.
>> Specifically, if it is a bond or team, the slave would typically get
>> renamed *before* virtual device gets created, that's what udev can
>> control (without getting netdev opened early by the other part of
>> kernel) and other userspace components for e.g. initramfs,
>> init-scripts can coordinate well in between. The in-kernel
>> auto-enslavement of net_failover breaks this userspace convention,
>> which don't provides a solution if user care about consistent naming
>> on the slave netdevs specifically.
>>
>> Previously this issue had been specifically called out when IFF_HIDDEN
>> and the 1-netdev was proposed, but no one gives out a solution to this
>> problem ever since. Please share your mind how to proceed and solve
>> this userspace issue if netdev does not welcome a 1-netdev model.
> Above says:
>
> 	there's no motivation in the systemd/udevd community at
> 	this point to refactor the rename logic and make it work well with
> 	3-netdev.
>
> What would the fix be? Skip slave devices?
>
There's nothing user can get if just skipping slave devices - the name 
is still unchanged and unpredictable e.g. eth0, or eth1 the next reboot, 
while the rest may conform to the naming scheme (ens3 and such). There's 
no way one can fix this in userspace alone - when the failover is 
created the enslaved netdev was opened by the kernel earlier than the 
userspace is made aware of, and there's no negotiation protocol for 
kernel to know when userspace has done initial renaming of the 
interface. I would expect netdev list should at least provide the 
direction in general for how this can be solved...

-Siwei



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
       [not found]             ` <91d4cbb1-be7a-b53c-6b2a-99bef07e7c53@intel.com>
@ 2019-02-22  7:55               ` si-wei liu
  2019-02-22 12:58                 ` Rob Miller
  2019-02-22 15:14                 ` Michael S. Tsirkin
  0 siblings, 2 replies; 63+ messages in thread
From: si-wei liu @ 2019-02-22  7:55 UTC (permalink / raw)
  To: Samudrala, Sridhar, Michael S. Tsirkin, Siwei Liu
  Cc: Jiri Pirko, Stephen Hemminger, David Miller, Netdev,
	virtualization, virtio-dev, Brandeburg, Jesse, Alexander Duyck,
	Jakub Kicinski, Jason Wang, liran.alon



On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
>
>
> On 2/21/2019 7:33 PM, si-wei liu wrote:
>>
>>
>> On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
>>> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
>>>> Sorry for replying to this ancient thread. There was some remaining
>>>> issue that I don't think the initial net_failover patch got addressed
>>>> cleanly, see:
>>>>
>>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
>>>>
>>>> The renaming of 'eth0' to 'ens4' fails because the udev userspace was
>>>> not specifically writtten for such kernel automatic enslavement.
>>>> Specifically, if it is a bond or team, the slave would typically get
>>>> renamed *before* virtual device gets created, that's what udev can
>>>> control (without getting netdev opened early by the other part of
>>>> kernel) and other userspace components for e.g. initramfs,
>>>> init-scripts can coordinate well in between. The in-kernel
>>>> auto-enslavement of net_failover breaks this userspace convention,
>>>> which don't provides a solution if user care about consistent naming
>>>> on the slave netdevs specifically.
>>>>
>>>> Previously this issue had been specifically called out when IFF_HIDDEN
>>>> and the 1-netdev was proposed, but no one gives out a solution to this
>>>> problem ever since. Please share your mind how to proceed and solve
>>>> this userspace issue if netdev does not welcome a 1-netdev model.
>>> Above says:
>>>
>>>     there's no motivation in the systemd/udevd community at
>>>     this point to refactor the rename logic and make it work well with
>>>     3-netdev.
>>>
>>> What would the fix be? Skip slave devices?
>>>
>> There's nothing user can get if just skipping slave devices - the 
>> name is still unchanged and unpredictable e.g. eth0, or eth1 the next 
>> reboot, while the rest may conform to the naming scheme (ens3 and 
>> such). There's no way one can fix this in userspace alone - when the 
>> failover is created the enslaved netdev was opened by the kernel 
>> earlier than the userspace is made aware of, and there's no 
>> negotiation protocol for kernel to know when userspace has done 
>> initial renaming of the interface. I would expect netdev list should 
>> at least provide the direction in general for how this can be solved...
>>
> Is there an issue if slave device names are not predictable? The user/admin scripts are expected
> to only work with the master failover device.
Where does this expectation come from?

Admin users may have ethtool or tc configurations that need to deal with 
predictable interface name. Third-party app which was built upon 
specifying certain interface name can't be modified to chase dynamic names.

Specifically, we have pre-canned image that uses ethtool to fine tune VF 
offload settings post boot for specific workload. Those images won't 
work well if the name is constantly changing just after couple rounds of 
live migration.

> Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
> about moving them to a hidden network namespace so that they are not visible from the default namespace.
> I looked into this sometime back, but did not find the right kernel api to create a network namespace within
> kernel. If so, we could use this mechanism to simulate a 1-netdev model.
Yes, that's one possible implementation (IMHO the key is to make 
1-netdev model as much transparent to a real NIC as possible, while a 
hidden netns is just the vehicle). However, I recall there was 
resistance around this discussion that even the concept of hiding itself 
is a taboo for Linux netdev. I would like to summon potential 
alternatives before concluding 1-netdev is the only solution too soon.

Thanks,
-Siwei

>
>> -Siwei
>>
>>


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2019-02-22  7:55               ` si-wei liu
@ 2019-02-22 12:58                 ` Rob Miller
  2019-02-22 15:14                 ` Michael S. Tsirkin
  1 sibling, 0 replies; 63+ messages in thread
From: Rob Miller @ 2019-02-22 12:58 UTC (permalink / raw)
  To: si-wei liu
  Cc: Samudrala, Sridhar, Michael S. Tsirkin, Siwei Liu, Jiri Pirko,
	Stephen Hemminger, David Miller, Netdev, virtualization,
	virtio-dev, Brandeburg, Jesse, Alexander Duyck, Jakub Kicinski,
	Jason Wang, liran.alon

I don’t know enough about how they get named, but is it possible for
user space to suggest its interface name, such that the interface name
would we as unique as the VM name itself. and is limited to scope to
be within the network boundry of an organization?

In other words, as a company, i decided to name my VM co-vm-1 through
co-vm-xxx, i would leave off location of vm b/c that will change. My
interfaces then would be named, co-vm-1.0 through co-vm-1.x.

Just thinking out loud.

Sent from my iPhone

> On Feb 22, 2019, at 2:55 AM, si-wei liu <si-wei.liu@oracle.com> wrote:
>
>
>
>> On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
>>
>>
>>> On 2/21/2019 7:33 PM, si-wei liu wrote:
>>>
>>>
>>>> On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
>>>>> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
>>>>> Sorry for replying to this ancient thread. There was some remaining
>>>>> issue that I don't think the initial net_failover patch got addressed
>>>>> cleanly, see:
>>>>>
>>>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
>>>>>
>>>>> The renaming of 'eth0' to 'ens4' fails because the udev userspace was
>>>>> not specifically writtten for such kernel automatic enslavement.
>>>>> Specifically, if it is a bond or team, the slave would typically get
>>>>> renamed *before* virtual device gets created, that's what udev can
>>>>> control (without getting netdev opened early by the other part of
>>>>> kernel) and other userspace components for e.g. initramfs,
>>>>> init-scripts can coordinate well in between. The in-kernel
>>>>> auto-enslavement of net_failover breaks this userspace convention,
>>>>> which don't provides a solution if user care about consistent naming
>>>>> on the slave netdevs specifically.
>>>>>
>>>>> Previously this issue had been specifically called out when IFF_HIDDEN
>>>>> and the 1-netdev was proposed, but no one gives out a solution to this
>>>>> problem ever since. Please share your mind how to proceed and solve
>>>>> this userspace issue if netdev does not welcome a 1-netdev model.
>>>> Above says:
>>>>
>>>>    there's no motivation in the systemd/udevd community at
>>>>    this point to refactor the rename logic and make it work well with
>>>>    3-netdev.
>>>>
>>>> What would the fix be? Skip slave devices?
>>>>
>>> There's nothing user can get if just skipping slave devices - the name is still unchanged and unpredictable e.g. eth0, or eth1 the next reboot, while the rest may conform to the naming scheme (ens3 and such). There's no way one can fix this in userspace alone - when the failover is created the enslaved netdev was opened by the kernel earlier than the userspace is made aware of, and there's no negotiation protocol for kernel to know when userspace has done initial renaming of the interface. I would expect netdev list should at least provide the direction in general for how this can be solved...
>>>
>> Is there an issue if slave device names are not predictable? The user/admin scripts are expected
>> to only work with the master failover device.
> Where does this expectation come from?
>
> Admin users may have ethtool or tc configurations that need to deal with predictable interface name. Third-party app which was built upon specifying certain interface name can't be modified to chase dynamic names.
>
> Specifically, we have pre-canned image that uses ethtool to fine tune VF offload settings post boot for specific workload. Those images won't work well if the name is constantly changing just after couple rounds of live migration.
>
>> Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
>> about moving them to a hidden network namespace so that they are not visible from the default namespace.
>> I looked into this sometime back, but did not find the right kernel api to create a network namespace within
>> kernel. If so, we could use this mechanism to simulate a 1-netdev model.
> Yes, that's one possible implementation (IMHO the key is to make 1-netdev model as much transparent to a real NIC as possible, while a hidden netns is just the vehicle). However, I recall there was resistance around this discussion that even the concept of hiding itself is a taboo for Linux netdev. I would like to summon potential alternatives before concluding 1-netdev is the only solution too soon.
>
> Thanks,
> -Siwei
>
>>
>>> -Siwei
>>>
>>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2019-02-22  7:55               ` si-wei liu
  2019-02-22 12:58                 ` Rob Miller
@ 2019-02-22 15:14                 ` Michael S. Tsirkin
  2019-02-26  0:58                   ` si-wei liu
  1 sibling, 1 reply; 63+ messages in thread
From: Michael S. Tsirkin @ 2019-02-22 15:14 UTC (permalink / raw)
  To: si-wei liu
  Cc: Samudrala, Sridhar, Siwei Liu, Jiri Pirko, Stephen Hemminger,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Alexander Duyck, Jakub Kicinski, Jason Wang, liran.alon

On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:
> 
> 
> On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
> > 
> > 
> > On 2/21/2019 7:33 PM, si-wei liu wrote:
> > > 
> > > 
> > > On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
> > > > On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
> > > > > Sorry for replying to this ancient thread. There was some remaining
> > > > > issue that I don't think the initial net_failover patch got addressed
> > > > > cleanly, see:
> > > > > 
> > > > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
> > > > > 
> > > > > The renaming of 'eth0' to 'ens4' fails because the udev userspace was
> > > > > not specifically writtten for such kernel automatic enslavement.
> > > > > Specifically, if it is a bond or team, the slave would typically get
> > > > > renamed *before* virtual device gets created, that's what udev can
> > > > > control (without getting netdev opened early by the other part of
> > > > > kernel) and other userspace components for e.g. initramfs,
> > > > > init-scripts can coordinate well in between. The in-kernel
> > > > > auto-enslavement of net_failover breaks this userspace convention,
> > > > > which don't provides a solution if user care about consistent naming
> > > > > on the slave netdevs specifically.
> > > > > 
> > > > > Previously this issue had been specifically called out when IFF_HIDDEN
> > > > > and the 1-netdev was proposed, but no one gives out a solution to this
> > > > > problem ever since. Please share your mind how to proceed and solve
> > > > > this userspace issue if netdev does not welcome a 1-netdev model.
> > > > Above says:
> > > > 
> > > >     there's no motivation in the systemd/udevd community at
> > > >     this point to refactor the rename logic and make it work well with
> > > >     3-netdev.
> > > > 
> > > > What would the fix be? Skip slave devices?
> > > > 
> > > There's nothing user can get if just skipping slave devices - the
> > > name is still unchanged and unpredictable e.g. eth0, or eth1 the
> > > next reboot, while the rest may conform to the naming scheme (ens3
> > > and such). There's no way one can fix this in userspace alone - when
> > > the failover is created the enslaved netdev was opened by the kernel
> > > earlier than the userspace is made aware of, and there's no
> > > negotiation protocol for kernel to know when userspace has done
> > > initial renaming of the interface. I would expect netdev list should
> > > at least provide the direction in general for how this can be
> > > solved...


I was just wondering what did you mean when you said
"refactor the rename logic and make it work well with 3-netdev" -
was there a proposal udev rejected?

Anyway, can we write a time diagram for what happens in which order that
leads to failure?  That would help look for triggers that we can tie
into, or add new ones.






> > > 
> > Is there an issue if slave device names are not predictable? The user/admin scripts are expected
> > to only work with the master failover device.
> Where does this expectation come from?
> 
> Admin users may have ethtool or tc configurations that need to deal with
> predictable interface name. Third-party app which was built upon specifying
> certain interface name can't be modified to chase dynamic names.
> 
> Specifically, we have pre-canned image that uses ethtool to fine tune VF
> offload settings post boot for specific workload. Those images won't work
> well if the name is constantly changing just after couple rounds of live
> migration.

It should be possible to specify the ethtool configuration on the
master and have it automatically propagated to the slave.

BTW this is something we should look at IMHO.

> > Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
> > about moving them to a hidden network namespace so that they are not visible from the default namespace.
> > I looked into this sometime back, but did not find the right kernel api to create a network namespace within
> > kernel. If so, we could use this mechanism to simulate a 1-netdev model.
> Yes, that's one possible implementation (IMHO the key is to make 1-netdev
> model as much transparent to a real NIC as possible, while a hidden netns is
> just the vehicle). However, I recall there was resistance around this
> discussion that even the concept of hiding itself is a taboo for Linux
> netdev. I would like to summon potential alternatives before concluding
> 1-netdev is the only solution too soon.
> 
> Thanks,
> -Siwei

Your scripts would not work at all then, right?


> > 
> > > -Siwei
> > > 
> > > 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2019-02-22 15:14                 ` Michael S. Tsirkin
@ 2019-02-26  0:58                   ` si-wei liu
  2019-02-26  1:39                     ` Stephen Hemminger
  2019-02-26  2:08                     ` Michael S. Tsirkin
  0 siblings, 2 replies; 63+ messages in thread
From: si-wei liu @ 2019-02-26  0:58 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Samudrala, Sridhar, Siwei Liu, Jiri Pirko, Stephen Hemminger,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Alexander Duyck, Jakub Kicinski, Jason Wang, liran.alon

[-- Attachment #1: Type: text/plain, Size: 6340 bytes --]



On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:
> On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:
>>
>> On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
>>>
>>> On 2/21/2019 7:33 PM, si-wei liu wrote:
>>>>
>>>> On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
>>>>> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
>>>>>> Sorry for replying to this ancient thread. There was some remaining
>>>>>> issue that I don't think the initial net_failover patch got addressed
>>>>>> cleanly, see:
>>>>>>
>>>>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
>>>>>>
>>>>>> The renaming of 'eth0' to 'ens4' fails because the udev userspace was
>>>>>> not specifically writtten for such kernel automatic enslavement.
>>>>>> Specifically, if it is a bond or team, the slave would typically get
>>>>>> renamed *before* virtual device gets created, that's what udev can
>>>>>> control (without getting netdev opened early by the other part of
>>>>>> kernel) and other userspace components for e.g. initramfs,
>>>>>> init-scripts can coordinate well in between. The in-kernel
>>>>>> auto-enslavement of net_failover breaks this userspace convention,
>>>>>> which don't provides a solution if user care about consistent naming
>>>>>> on the slave netdevs specifically.
>>>>>>
>>>>>> Previously this issue had been specifically called out when IFF_HIDDEN
>>>>>> and the 1-netdev was proposed, but no one gives out a solution to this
>>>>>> problem ever since. Please share your mind how to proceed and solve
>>>>>> this userspace issue if netdev does not welcome a 1-netdev model.
>>>>> Above says:
>>>>>
>>>>>      there's no motivation in the systemd/udevd community at
>>>>>      this point to refactor the rename logic and make it work well with
>>>>>      3-netdev.
>>>>>
>>>>> What would the fix be? Skip slave devices?
>>>>>
>>>> There's nothing user can get if just skipping slave devices - the
>>>> name is still unchanged and unpredictable e.g. eth0, or eth1 the
>>>> next reboot, while the rest may conform to the naming scheme (ens3
>>>> and such). There's no way one can fix this in userspace alone - when
>>>> the failover is created the enslaved netdev was opened by the kernel
>>>> earlier than the userspace is made aware of, and there's no
>>>> negotiation protocol for kernel to know when userspace has done
>>>> initial renaming of the interface. I would expect netdev list should
>>>> at least provide the direction in general for how this can be
>>>> solved...
>
> I was just wondering what did you mean when you said
> "refactor the rename logic and make it work well with 3-netdev" -
> was there a proposal udev rejected?
No. I never believed this particular issue can be fixed in userspace 
alone. Previously someone had said it could be, but I never see any work 
or relevant discussion ever happened in various userspace communities 
(for e.g. dracut, initramfs-tools, systemd, udev, and NetworkManager). 
IMHO the root of the issue derives from the kernel, it makes more sense 
to start from netdev, work out and decide on a solution: see what can be 
done in the kernel in order to fix it, then after that engage userspace 
community for the feasibility...

> Anyway, can we write a time diagram for what happens in which order that
> leads to failure?  That would help look for triggers that we can tie
> into, or add new ones.
>

See attached diagram.

>
>
>
>
>>> Is there an issue if slave device names are not predictable? The user/admin scripts are expected
>>> to only work with the master failover device.
>> Where does this expectation come from?
>>
>> Admin users may have ethtool or tc configurations that need to deal with
>> predictable interface name. Third-party app which was built upon specifying
>> certain interface name can't be modified to chase dynamic names.
>>
>> Specifically, we have pre-canned image that uses ethtool to fine tune VF
>> offload settings post boot for specific workload. Those images won't work
>> well if the name is constantly changing just after couple rounds of live
>> migration.
> It should be possible to specify the ethtool configuration on the
> master and have it automatically propagated to the slave.
>
> BTW this is something we should look at IMHO.
I was elaborating a few examples that the expectation and assumption 
that user/admin scripts only deal with master failover device is 
incorrect. It had never been taken good care of, although I did try to 
emphasize it from the very beginning.

Basically what you said about propagating the ethtool configuration down 
to the slave is the key pursuance of 1-netdev model. However, what I am 
seeking now is any alternative that can also fix the specific udev 
rename problem, before concluding that 1-netdev is the only solution. 
Generally a 1-netdev scheme would take time to implement, while I'm 
trying to find a way out to fix this particular naming problem under 
3-netdev.

>
>>> Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
>>> about moving them to a hidden network namespace so that they are not visible from the default namespace.
>>> I looked into this sometime back, but did not find the right kernel api to create a network namespace within
>>> kernel. If so, we could use this mechanism to simulate a 1-netdev model.
>> Yes, that's one possible implementation (IMHO the key is to make 1-netdev
>> model as much transparent to a real NIC as possible, while a hidden netns is
>> just the vehicle). However, I recall there was resistance around this
>> discussion that even the concept of hiding itself is a taboo for Linux
>> netdev. I would like to summon potential alternatives before concluding
>> 1-netdev is the only solution too soon.
>>
>> Thanks,
>> -Siwei
> Your scripts would not work at all then, right?
At this point we don't claim images with such usage as SR-IOV live 
migrate-able. We would flag it as live migrate-able until this ethtool 
config issue is fully addressed and a transparent live migration 
solution emerges in upstream eventually.


Thanks,
-Siwei
>
>
>>>> -Siwei
>>>>
>>>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>


[-- Attachment #2: net_failover_rename_race.txt --]
[-- Type: text/plain, Size: 3587 bytes --]


  net_failover(kernel)                            |    network.service (user)    |          systemd-udevd (user)
--------------------------------------------------+------------------------------+--------------------------------------------
(standby virtio-net and net_failover              |                              |
devices created and initialized,                  |                              |
i.e. virtnet_probe()->                            |                              |
       net_failover_create()                      |                              |
was done.)                                        |                              |
                                                  |                              |
                                                  |  runs `ifup ens3' ->         |
                                                  |    ip link set dev ens3 up   |
net_failover_open()                               |                              |
  dev_open(virtnet_dev)                           |                              |
    virtnet_open(virtnet_dev)                     |                              |
  netif_carrier_on(failover_dev)                  |                              |
  ...                                             |                              |
                                                  |                              |
(VF hot plugged in)                               |                              |
ixgbevf_probe()                                   |                              |
 register_netdev(ixgbevf_netdev)                  |                              |
  netdev_register_kobject(ixgbevf_netdev)         |                              |
   kobject_add(ixgbevf_dev)                       |                              |
    device_add(ixgbevf_dev)                       |                              |
     kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) |                              |
      netlink_broadcast()                         |                              |
  ...                                             |                              |
  call_netdevice_notifiers(NETDEV_REGISTER)       |                              |
   failover_event(..., NETDEV_REGISTER, ...)      |                              |
    failover_slave_register(ixgbevf_netdev)       |                              |
     net_failover_slave_register(ixgbevf_netdev)  |                              |
      dev_open(ixgbevf_netdev)                    |                              |
                                                  |                              |
                                                  |                              |
                                                  |                              |   received ADD uevent from netlink fd
                                                  |                              |   ...
                                                  |                              |   udev-builtin-net_id.c:dev_pci_slot()
                                                  |                              |   (decided to renamed 'eth0' )
                                                  |                              |     ip link set dev eth0 name ens4
(dev_change_name() returns -EBUSY as              |                              |
ixgbevf_netdev->flags has IFF_UP)                 |                              |
                                                  |                              |


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2019-02-26  0:58                   ` si-wei liu
@ 2019-02-26  1:39                     ` Stephen Hemminger
  2019-02-26  2:05                       ` Michael S. Tsirkin
  2019-02-26  2:08                     ` Michael S. Tsirkin
  1 sibling, 1 reply; 63+ messages in thread
From: Stephen Hemminger @ 2019-02-26  1:39 UTC (permalink / raw)
  To: si-wei liu
  Cc: Michael S. Tsirkin, Samudrala, Sridhar, Siwei Liu, Jiri Pirko,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Alexander Duyck, Jakub Kicinski, Jason Wang, liran.alon

On Mon, 25 Feb 2019 16:58:07 -0800
si-wei liu <si-wei.liu@oracle.com> wrote:

> On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:
> > On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:  
> >>
> >> On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:  
> >>>
> >>> On 2/21/2019 7:33 PM, si-wei liu wrote:  
> >>>>
> >>>> On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:  
> >>>>> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:  
> >>>>>> Sorry for replying to this ancient thread. There was some remaining
> >>>>>> issue that I don't think the initial net_failover patch got addressed
> >>>>>> cleanly, see:
> >>>>>>
> >>>>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
> >>>>>>
> >>>>>> The renaming of 'eth0' to 'ens4' fails because the udev userspace was
> >>>>>> not specifically writtten for such kernel automatic enslavement.
> >>>>>> Specifically, if it is a bond or team, the slave would typically get
> >>>>>> renamed *before* virtual device gets created, that's what udev can
> >>>>>> control (without getting netdev opened early by the other part of
> >>>>>> kernel) and other userspace components for e.g. initramfs,
> >>>>>> init-scripts can coordinate well in between. The in-kernel
> >>>>>> auto-enslavement of net_failover breaks this userspace convention,
> >>>>>> which don't provides a solution if user care about consistent naming
> >>>>>> on the slave netdevs specifically.
> >>>>>>
> >>>>>> Previously this issue had been specifically called out when IFF_HIDDEN
> >>>>>> and the 1-netdev was proposed, but no one gives out a solution to this
> >>>>>> problem ever since. Please share your mind how to proceed and solve
> >>>>>> this userspace issue if netdev does not welcome a 1-netdev model.  
> >>>>> Above says:
> >>>>>
> >>>>>      there's no motivation in the systemd/udevd community at
> >>>>>      this point to refactor the rename logic and make it work well with
> >>>>>      3-netdev.
> >>>>>
> >>>>> What would the fix be? Skip slave devices?
> >>>>>  
> >>>> There's nothing user can get if just skipping slave devices - the
> >>>> name is still unchanged and unpredictable e.g. eth0, or eth1 the
> >>>> next reboot, while the rest may conform to the naming scheme (ens3
> >>>> and such). There's no way one can fix this in userspace alone - when
> >>>> the failover is created the enslaved netdev was opened by the kernel
> >>>> earlier than the userspace is made aware of, and there's no
> >>>> negotiation protocol for kernel to know when userspace has done
> >>>> initial renaming of the interface. I would expect netdev list should
> >>>> at least provide the direction in general for how this can be
> >>>> solved...  
> >
> > I was just wondering what did you mean when you said
> > "refactor the rename logic and make it work well with 3-netdev" -
> > was there a proposal udev rejected?  
> No. I never believed this particular issue can be fixed in userspace 
> alone. Previously someone had said it could be, but I never see any work 
> or relevant discussion ever happened in various userspace communities 
> (for e.g. dracut, initramfs-tools, systemd, udev, and NetworkManager). 
> IMHO the root of the issue derives from the kernel, it makes more sense 
> to start from netdev, work out and decide on a solution: see what can be 
> done in the kernel in order to fix it, then after that engage userspace 
> community for the feasibility...
> 
> > Anyway, can we write a time diagram for what happens in which order that
> > leads to failure?  That would help look for triggers that we can tie
> > into, or add new ones.
> >  
> 
> See attached diagram.
> 
> >
> >
> >
> >  
> >>> Is there an issue if slave device names are not predictable? The user/admin scripts are expected
> >>> to only work with the master failover device.  
> >> Where does this expectation come from?
> >>
> >> Admin users may have ethtool or tc configurations that need to deal with
> >> predictable interface name. Third-party app which was built upon specifying
> >> certain interface name can't be modified to chase dynamic names.
> >>
> >> Specifically, we have pre-canned image that uses ethtool to fine tune VF
> >> offload settings post boot for specific workload. Those images won't work
> >> well if the name is constantly changing just after couple rounds of live
> >> migration.  
> > It should be possible to specify the ethtool configuration on the
> > master and have it automatically propagated to the slave.
> >
> > BTW this is something we should look at IMHO.  
> I was elaborating a few examples that the expectation and assumption 
> that user/admin scripts only deal with master failover device is 
> incorrect. It had never been taken good care of, although I did try to 
> emphasize it from the very beginning.
> 
> Basically what you said about propagating the ethtool configuration down 
> to the slave is the key pursuance of 1-netdev model. However, what I am 
> seeking now is any alternative that can also fix the specific udev 
> rename problem, before concluding that 1-netdev is the only solution. 
> Generally a 1-netdev scheme would take time to implement, while I'm 
> trying to find a way out to fix this particular naming problem under 
> 3-netdev.
> 
> >  
> >>> Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
> >>> about moving them to a hidden network namespace so that they are not visible from the default namespace.
> >>> I looked into this sometime back, but did not find the right kernel api to create a network namespace within
> >>> kernel. If so, we could use this mechanism to simulate a 1-netdev model.  
> >> Yes, that's one possible implementation (IMHO the key is to make 1-netdev
> >> model as much transparent to a real NIC as possible, while a hidden netns is
> >> just the vehicle). However, I recall there was resistance around this
> >> discussion that even the concept of hiding itself is a taboo for Linux
> >> netdev. I would like to summon potential alternatives before concluding
> >> 1-netdev is the only solution too soon.
> >>
> >> Thanks,
> >> -Siwei  
> > Your scripts would not work at all then, right?  
> At this point we don't claim images with such usage as SR-IOV live 
> migrate-able. We would flag it as live migrate-able until this ethtool 
> config issue is fully addressed and a transparent live migration 
> solution emerges in upstream eventually.

The hyper-v netvsc with 1-dev model uses a timeout to allow  udev to do its rename.
I proposed a patch to key state change off of the udev rename, but that patch was
rejected.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2019-02-26  1:39                     ` Stephen Hemminger
@ 2019-02-26  2:05                       ` Michael S. Tsirkin
  2019-02-27  0:49                         ` si-wei liu
  0 siblings, 1 reply; 63+ messages in thread
From: Michael S. Tsirkin @ 2019-02-26  2:05 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: si-wei liu, Samudrala, Sridhar, Siwei Liu, Jiri Pirko,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Alexander Duyck, Jakub Kicinski, Jason Wang, liran.alon

On Mon, Feb 25, 2019 at 05:39:12PM -0800, Stephen Hemminger wrote:
> > >>> Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
> > >>> about moving them to a hidden network namespace so that they are not visible from the default namespace.
> > >>> I looked into this sometime back, but did not find the right kernel api to create a network namespace within
> > >>> kernel. If so, we could use this mechanism to simulate a 1-netdev model.  
> > >> Yes, that's one possible implementation (IMHO the key is to make 1-netdev
> > >> model as much transparent to a real NIC as possible, while a hidden netns is
> > >> just the vehicle). However, I recall there was resistance around this
> > >> discussion that even the concept of hiding itself is a taboo for Linux
> > >> netdev. I would like to summon potential alternatives before concluding
> > >> 1-netdev is the only solution too soon.
> > >>
> > >> Thanks,
> > >> -Siwei  
> > > Your scripts would not work at all then, right?  
> > At this point we don't claim images with such usage as SR-IOV live 
> > migrate-able. We would flag it as live migrate-able until this ethtool 
> > config issue is fully addressed and a transparent live migration 
> > solution emerges in upstream eventually.
> 
> The hyper-v netvsc with 1-dev model uses a timeout to allow  udev to do its rename.
> I proposed a patch to key state change off of the udev rename, but that patch was
> rejected.

Of course that would mean nothing works without udev - was
that the objection? Could you help me find that discussion pls?

-- 
MST

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2019-02-26  0:58                   ` si-wei liu
  2019-02-26  1:39                     ` Stephen Hemminger
@ 2019-02-26  2:08                     ` Michael S. Tsirkin
  2019-02-27  0:17                       ` si-wei liu
  1 sibling, 1 reply; 63+ messages in thread
From: Michael S. Tsirkin @ 2019-02-26  2:08 UTC (permalink / raw)
  To: si-wei liu
  Cc: Samudrala, Sridhar, Siwei Liu, Jiri Pirko, Stephen Hemminger,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Alexander Duyck, Jakub Kicinski, Jason Wang, liran.alon

On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:
> 
> 
> On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:
> > On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:
> > > 
> > > On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
> > > > 
> > > > On 2/21/2019 7:33 PM, si-wei liu wrote:
> > > > > 
> > > > > On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
> > > > > > On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
> > > > > > > Sorry for replying to this ancient thread. There was some remaining
> > > > > > > issue that I don't think the initial net_failover patch got addressed
> > > > > > > cleanly, see:
> > > > > > > 
> > > > > > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
> > > > > > > 
> > > > > > > The renaming of 'eth0' to 'ens4' fails because the udev userspace was
> > > > > > > not specifically writtten for such kernel automatic enslavement.
> > > > > > > Specifically, if it is a bond or team, the slave would typically get
> > > > > > > renamed *before* virtual device gets created, that's what udev can
> > > > > > > control (without getting netdev opened early by the other part of
> > > > > > > kernel) and other userspace components for e.g. initramfs,
> > > > > > > init-scripts can coordinate well in between. The in-kernel
> > > > > > > auto-enslavement of net_failover breaks this userspace convention,
> > > > > > > which don't provides a solution if user care about consistent naming
> > > > > > > on the slave netdevs specifically.
> > > > > > > 
> > > > > > > Previously this issue had been specifically called out when IFF_HIDDEN
> > > > > > > and the 1-netdev was proposed, but no one gives out a solution to this
> > > > > > > problem ever since. Please share your mind how to proceed and solve
> > > > > > > this userspace issue if netdev does not welcome a 1-netdev model.
> > > > > > Above says:
> > > > > > 
> > > > > >      there's no motivation in the systemd/udevd community at
> > > > > >      this point to refactor the rename logic and make it work well with
> > > > > >      3-netdev.
> > > > > > 
> > > > > > What would the fix be? Skip slave devices?
> > > > > > 
> > > > > There's nothing user can get if just skipping slave devices - the
> > > > > name is still unchanged and unpredictable e.g. eth0, or eth1 the
> > > > > next reboot, while the rest may conform to the naming scheme (ens3
> > > > > and such). There's no way one can fix this in userspace alone - when
> > > > > the failover is created the enslaved netdev was opened by the kernel
> > > > > earlier than the userspace is made aware of, and there's no
> > > > > negotiation protocol for kernel to know when userspace has done
> > > > > initial renaming of the interface. I would expect netdev list should
> > > > > at least provide the direction in general for how this can be
> > > > > solved...
> > 
> > I was just wondering what did you mean when you said
> > "refactor the rename logic and make it work well with 3-netdev" -
> > was there a proposal udev rejected?
> No. I never believed this particular issue can be fixed in userspace alone.
> Previously someone had said it could be, but I never see any work or
> relevant discussion ever happened in various userspace communities (for e.g.
> dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
> of the issue derives from the kernel, it makes more sense to start from
> netdev, work out and decide on a solution: see what can be done in the
> kernel in order to fix it, then after that engage userspace community for
> the feasibility...
> 
> > Anyway, can we write a time diagram for what happens in which order that
> > leads to failure?  That would help look for triggers that we can tie
> > into, or add new ones.
> > 
> 
> See attached diagram.
> 
> > 
> > 
> > 
> > 
> > > > Is there an issue if slave device names are not predictable? The user/admin scripts are expected
> > > > to only work with the master failover device.
> > > Where does this expectation come from?
> > > 
> > > Admin users may have ethtool or tc configurations that need to deal with
> > > predictable interface name. Third-party app which was built upon specifying
> > > certain interface name can't be modified to chase dynamic names.
> > > 
> > > Specifically, we have pre-canned image that uses ethtool to fine tune VF
> > > offload settings post boot for specific workload. Those images won't work
> > > well if the name is constantly changing just after couple rounds of live
> > > migration.
> > It should be possible to specify the ethtool configuration on the
> > master and have it automatically propagated to the slave.
> > 
> > BTW this is something we should look at IMHO.
> I was elaborating a few examples that the expectation and assumption that
> user/admin scripts only deal with master failover device is incorrect. It
> had never been taken good care of, although I did try to emphasize it from
> the very beginning.
> 
> Basically what you said about propagating the ethtool configuration down to
> the slave is the key pursuance of 1-netdev model. However, what I am seeking
> now is any alternative that can also fix the specific udev rename problem,
> before concluding that 1-netdev is the only solution. Generally a 1-netdev
> scheme would take time to implement, while I'm trying to find a way out to
> fix this particular naming problem under 3-netdev.
> 
> > 
> > > > Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
> > > > about moving them to a hidden network namespace so that they are not visible from the default namespace.
> > > > I looked into this sometime back, but did not find the right kernel api to create a network namespace within
> > > > kernel. If so, we could use this mechanism to simulate a 1-netdev model.
> > > Yes, that's one possible implementation (IMHO the key is to make 1-netdev
> > > model as much transparent to a real NIC as possible, while a hidden netns is
> > > just the vehicle). However, I recall there was resistance around this
> > > discussion that even the concept of hiding itself is a taboo for Linux
> > > netdev. I would like to summon potential alternatives before concluding
> > > 1-netdev is the only solution too soon.
> > > 
> > > Thanks,
> > > -Siwei
> > Your scripts would not work at all then, right?
> At this point we don't claim images with such usage as SR-IOV live
> migrate-able. We would flag it as live migrate-able until this ethtool
> config issue is fully addressed and a transparent live migration solution
> emerges in upstream eventually.
> 
> 
> Thanks,
> -Siwei
> > 
> > 
> > > > > -Siwei
> > > > > 
> > > > > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> > 
> 

> 
>   net_failover(kernel)                            |    network.service (user)    |          systemd-udevd (user)
> --------------------------------------------------+------------------------------+--------------------------------------------
> (standby virtio-net and net_failover              |                              |
> devices created and initialized,                  |                              |
> i.e. virtnet_probe()->                            |                              |
>        net_failover_create()                      |                              |
> was done.)                                        |                              |
>                                                   |                              |
>                                                   |  runs `ifup ens3' ->         |
>                                                   |    ip link set dev ens3 up   |
> net_failover_open()                               |                              |
>   dev_open(virtnet_dev)                           |                              |
>     virtnet_open(virtnet_dev)                     |                              |
>   netif_carrier_on(failover_dev)                  |                              |
>   ...                                             |                              |
>                                                   |                              |
> (VF hot plugged in)                               |                              |
> ixgbevf_probe()                                   |                              |
>  register_netdev(ixgbevf_netdev)                  |                              |
>   netdev_register_kobject(ixgbevf_netdev)         |                              |
>    kobject_add(ixgbevf_dev)                       |                              |
>     device_add(ixgbevf_dev)                       |                              |
>      kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) |                              |
>       netlink_broadcast()                         |                              |
>   ...                                             |                              |
>   call_netdevice_notifiers(NETDEV_REGISTER)       |                              |
>    failover_event(..., NETDEV_REGISTER, ...)      |                              |
>     failover_slave_register(ixgbevf_netdev)       |                              |
>      net_failover_slave_register(ixgbevf_netdev)  |                              |
>       dev_open(ixgbevf_netdev)                    |                              |
>                                                   |                              |
>                                                   |                              |
>                                                   |                              |   received ADD uevent from netlink fd
>                                                   |                              |   ...
>                                                   |                              |   udev-builtin-net_id.c:dev_pci_slot()
>                                                   |                              |   (decided to renamed 'eth0' )
>                                                   |                              |     ip link set dev eth0 name ens4
> (dev_change_name() returns -EBUSY as              |                              |
> ixgbevf_netdev->flags has IFF_UP)                 |                              |
>                                                   |                              |
> 

Given renaming slaves does not work anyway: would it work if we just
hard-coded slave names instead?

E.g.
1. fail slave renames
2. rename of failover to XX automatically renames standby to XXnsby
   and primary to XXnpry


-- 
MST

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2019-02-26  2:08                     ` Michael S. Tsirkin
@ 2019-02-27  0:17                       ` si-wei liu
  2019-02-27 21:57                         ` Stephen Hemminger
  2019-02-27 22:38                         ` Michael S. Tsirkin
  0 siblings, 2 replies; 63+ messages in thread
From: si-wei liu @ 2019-02-27  0:17 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Samudrala, Sridhar, Siwei Liu, Jiri Pirko, Stephen Hemminger,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Alexander Duyck, Jakub Kicinski, Jason Wang, liran.alon



On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:
> On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:
>>
>> On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:
>>> On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:
>>>> On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
>>>>> On 2/21/2019 7:33 PM, si-wei liu wrote:
>>>>>> On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
>>>>>>> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
>>>>>>>> Sorry for replying to this ancient thread. There was some remaining
>>>>>>>> issue that I don't think the initial net_failover patch got addressed
>>>>>>>> cleanly, see:
>>>>>>>>
>>>>>>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
>>>>>>>>
>>>>>>>> The renaming of 'eth0' to 'ens4' fails because the udev userspace was
>>>>>>>> not specifically writtten for such kernel automatic enslavement.
>>>>>>>> Specifically, if it is a bond or team, the slave would typically get
>>>>>>>> renamed *before* virtual device gets created, that's what udev can
>>>>>>>> control (without getting netdev opened early by the other part of
>>>>>>>> kernel) and other userspace components for e.g. initramfs,
>>>>>>>> init-scripts can coordinate well in between. The in-kernel
>>>>>>>> auto-enslavement of net_failover breaks this userspace convention,
>>>>>>>> which don't provides a solution if user care about consistent naming
>>>>>>>> on the slave netdevs specifically.
>>>>>>>>
>>>>>>>> Previously this issue had been specifically called out when IFF_HIDDEN
>>>>>>>> and the 1-netdev was proposed, but no one gives out a solution to this
>>>>>>>> problem ever since. Please share your mind how to proceed and solve
>>>>>>>> this userspace issue if netdev does not welcome a 1-netdev model.
>>>>>>> Above says:
>>>>>>>
>>>>>>>       there's no motivation in the systemd/udevd community at
>>>>>>>       this point to refactor the rename logic and make it work well with
>>>>>>>       3-netdev.
>>>>>>>
>>>>>>> What would the fix be? Skip slave devices?
>>>>>>>
>>>>>> There's nothing user can get if just skipping slave devices - the
>>>>>> name is still unchanged and unpredictable e.g. eth0, or eth1 the
>>>>>> next reboot, while the rest may conform to the naming scheme (ens3
>>>>>> and such). There's no way one can fix this in userspace alone - when
>>>>>> the failover is created the enslaved netdev was opened by the kernel
>>>>>> earlier than the userspace is made aware of, and there's no
>>>>>> negotiation protocol for kernel to know when userspace has done
>>>>>> initial renaming of the interface. I would expect netdev list should
>>>>>> at least provide the direction in general for how this can be
>>>>>> solved...
>>> I was just wondering what did you mean when you said
>>> "refactor the rename logic and make it work well with 3-netdev" -
>>> was there a proposal udev rejected?
>> No. I never believed this particular issue can be fixed in userspace alone.
>> Previously someone had said it could be, but I never see any work or
>> relevant discussion ever happened in various userspace communities (for e.g.
>> dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
>> of the issue derives from the kernel, it makes more sense to start from
>> netdev, work out and decide on a solution: see what can be done in the
>> kernel in order to fix it, then after that engage userspace community for
>> the feasibility...
>>
>>> Anyway, can we write a time diagram for what happens in which order that
>>> leads to failure?  That would help look for triggers that we can tie
>>> into, or add new ones.
>>>
>> See attached diagram.
>>
>>>
>>>
>>>
>>>>> Is there an issue if slave device names are not predictable? The user/admin scripts are expected
>>>>> to only work with the master failover device.
>>>> Where does this expectation come from?
>>>>
>>>> Admin users may have ethtool or tc configurations that need to deal with
>>>> predictable interface name. Third-party app which was built upon specifying
>>>> certain interface name can't be modified to chase dynamic names.
>>>>
>>>> Specifically, we have pre-canned image that uses ethtool to fine tune VF
>>>> offload settings post boot for specific workload. Those images won't work
>>>> well if the name is constantly changing just after couple rounds of live
>>>> migration.
>>> It should be possible to specify the ethtool configuration on the
>>> master and have it automatically propagated to the slave.
>>>
>>> BTW this is something we should look at IMHO.
>> I was elaborating a few examples that the expectation and assumption that
>> user/admin scripts only deal with master failover device is incorrect. It
>> had never been taken good care of, although I did try to emphasize it from
>> the very beginning.
>>
>> Basically what you said about propagating the ethtool configuration down to
>> the slave is the key pursuance of 1-netdev model. However, what I am seeking
>> now is any alternative that can also fix the specific udev rename problem,
>> before concluding that 1-netdev is the only solution. Generally a 1-netdev
>> scheme would take time to implement, while I'm trying to find a way out to
>> fix this particular naming problem under 3-netdev.
>>
>>>>> Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
>>>>> about moving them to a hidden network namespace so that they are not visible from the default namespace.
>>>>> I looked into this sometime back, but did not find the right kernel api to create a network namespace within
>>>>> kernel. If so, we could use this mechanism to simulate a 1-netdev model.
>>>> Yes, that's one possible implementation (IMHO the key is to make 1-netdev
>>>> model as much transparent to a real NIC as possible, while a hidden netns is
>>>> just the vehicle). However, I recall there was resistance around this
>>>> discussion that even the concept of hiding itself is a taboo for Linux
>>>> netdev. I would like to summon potential alternatives before concluding
>>>> 1-netdev is the only solution too soon.
>>>>
>>>> Thanks,
>>>> -Siwei
>>> Your scripts would not work at all then, right?
>> At this point we don't claim images with such usage as SR-IOV live
>> migrate-able. We would flag it as live migrate-able until this ethtool
>> config issue is fully addressed and a transparent live migration solution
>> emerges in upstream eventually.
>>
>>
>> Thanks,
>> -Siwei
>>>
>>>>>> -Siwei
>>>>>>
>>>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
>>> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>>>
>>    net_failover(kernel)                            |    network.service (user)    |          systemd-udevd (user)
>> --------------------------------------------------+------------------------------+--------------------------------------------
>> (standby virtio-net and net_failover              |                              |
>> devices created and initialized,                  |                              |
>> i.e. virtnet_probe()->                            |                              |
>>         net_failover_create()                      |                              |
>> was done.)                                        |                              |
>>                                                    |                              |
>>                                                    |  runs `ifup ens3' ->         |
>>                                                    |    ip link set dev ens3 up   |
>> net_failover_open()                               |                              |
>>    dev_open(virtnet_dev)                           |                              |
>>      virtnet_open(virtnet_dev)                     |                              |
>>    netif_carrier_on(failover_dev)                  |                              |
>>    ...                                             |                              |
>>                                                    |                              |
>> (VF hot plugged in)                               |                              |
>> ixgbevf_probe()                                   |                              |
>>   register_netdev(ixgbevf_netdev)                  |                              |
>>    netdev_register_kobject(ixgbevf_netdev)         |                              |
>>     kobject_add(ixgbevf_dev)                       |                              |
>>      device_add(ixgbevf_dev)                       |                              |
>>       kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) |                              |
>>        netlink_broadcast()                         |                              |
>>    ...                                             |                              |
>>    call_netdevice_notifiers(NETDEV_REGISTER)       |                              |
>>     failover_event(..., NETDEV_REGISTER, ...)      |                              |
>>      failover_slave_register(ixgbevf_netdev)       |                              |
>>       net_failover_slave_register(ixgbevf_netdev)  |                              |
>>        dev_open(ixgbevf_netdev)                    |                              |
>>                                                    |                              |
>>                                                    |                              |
>>                                                    |                              |   received ADD uevent from netlink fd
>>                                                    |                              |   ...
>>                                                    |                              |   udev-builtin-net_id.c:dev_pci_slot()
>>                                                    |                              |   (decided to renamed 'eth0' )
>>                                                    |                              |     ip link set dev eth0 name ens4
>> (dev_change_name() returns -EBUSY as              |                              |
>> ixgbevf_netdev->flags has IFF_UP)                 |                              |
>>                                                    |                              |
>>
> Given renaming slaves does not work anyway:
I was actually thinking what if we relieve the rename restriction just 
for the failover slave? What the impact would be? I think users don't 
care about slave being renamed when it's in use, especially the initial 
rename. Thoughts?

>   would it work if we just
> hard-coded slave names instead?
>
> E.g.
> 1. fail slave renames
> 2. rename of failover to XX automatically renames standby to XXnsby
>     and primary to XXnpry
That wouldn't help. The time when the failover master gets renamed, the 
VF may not be present. I don't like the idea to delay exposing failover 
master until VF is hot plugged in (probably subject to various failures) 
later.

Thanks,
-Siwei

>
>


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2019-02-26  2:05                       ` Michael S. Tsirkin
@ 2019-02-27  0:49                         ` si-wei liu
  0 siblings, 0 replies; 63+ messages in thread
From: si-wei liu @ 2019-02-27  0:49 UTC (permalink / raw)
  To: Michael S. Tsirkin, Stephen Hemminger
  Cc: Samudrala, Sridhar, Siwei Liu, Jiri Pirko, David Miller, Netdev,
	virtualization, virtio-dev, Brandeburg, Jesse, Alexander Duyck,
	Jakub Kicinski, Jason Wang, liran.alon



On 2/25/2019 6:05 PM, Michael S. Tsirkin wrote:
> On Mon, Feb 25, 2019 at 05:39:12PM -0800, Stephen Hemminger wrote:
>>>>>> Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
>>>>>> about moving them to a hidden network namespace so that they are not visible from the default namespace.
>>>>>> I looked into this sometime back, but did not find the right kernel api to create a network namespace within
>>>>>> kernel. If so, we could use this mechanism to simulate a 1-netdev model.
>>>>> Yes, that's one possible implementation (IMHO the key is to make 1-netdev
>>>>> model as much transparent to a real NIC as possible, while a hidden netns is
>>>>> just the vehicle). However, I recall there was resistance around this
>>>>> discussion that even the concept of hiding itself is a taboo for Linux
>>>>> netdev. I would like to summon potential alternatives before concluding
>>>>> 1-netdev is the only solution too soon.
>>>>>
>>>>> Thanks,
>>>>> -Siwei
>>>> Your scripts would not work at all then, right?
>>> At this point we don't claim images with such usage as SR-IOV live
>>> migrate-able. We would flag it as live migrate-able until this ethtool
>>> config issue is fully addressed and a transparent live migration
>>> solution emerges in upstream eventually.
>> The hyper-v netvsc with 1-dev model uses a timeout to allow  udev to do its rename.
>> I proposed a patch to key state change off of the udev rename, but that patch was
>> rejected.
> Of course that would mean nothing works without udev - was
> that the objection? Could you help me find that discussion pls?
Yeah, the kernel should work with and without udev rename - typically 
the kernel is agnostic of upcoming rename. User may opt out for kernel 
provided names (particularly on older distros) then no rename would ever 
happen.

I ever thought about this approach but didn't think it would fit. But, 
what is the historical reason that prevents slave from being renamed 
after being opened? Could we specialize a code path for this kernel 
created device, as net_failover shouldn't carry over any history burden?


Thanks,
-Siwei



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2019-02-27  0:17                       ` si-wei liu
@ 2019-02-27 21:57                         ` Stephen Hemminger
  2019-02-27 22:30                           ` si-wei liu
  2019-02-27 22:38                         ` Michael S. Tsirkin
  1 sibling, 1 reply; 63+ messages in thread
From: Stephen Hemminger @ 2019-02-27 21:57 UTC (permalink / raw)
  To: si-wei liu
  Cc: Michael S. Tsirkin, Samudrala, Sridhar, Siwei Liu, Jiri Pirko,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Alexander Duyck, Jakub Kicinski, Jason Wang, liran.alon

On Tue, 26 Feb 2019 16:17:21 -0800
si-wei liu <si-wei.liu@oracle.com> wrote:

> On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:
> > On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:  
> >>
> >> On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:  
> >>> On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:  
> >>>> On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:  
> >>>>> On 2/21/2019 7:33 PM, si-wei liu wrote:  
> >>>>>> On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:  
> >>>>>>> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:  
> >>>>>>>> Sorry for replying to this ancient thread. There was some remaining
> >>>>>>>> issue that I don't think the initial net_failover patch got addressed
> >>>>>>>> cleanly, see:
> >>>>>>>>
> >>>>>>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
> >>>>>>>>
> >>>>>>>> The renaming of 'eth0' to 'ens4' fails because the udev userspace was
> >>>>>>>> not specifically writtten for such kernel automatic enslavement.
> >>>>>>>> Specifically, if it is a bond or team, the slave would typically get
> >>>>>>>> renamed *before* virtual device gets created, that's what udev can
> >>>>>>>> control (without getting netdev opened early by the other part of
> >>>>>>>> kernel) and other userspace components for e.g. initramfs,
> >>>>>>>> init-scripts can coordinate well in between. The in-kernel
> >>>>>>>> auto-enslavement of net_failover breaks this userspace convention,
> >>>>>>>> which don't provides a solution if user care about consistent naming
> >>>>>>>> on the slave netdevs specifically.
> >>>>>>>>
> >>>>>>>> Previously this issue had been specifically called out when IFF_HIDDEN
> >>>>>>>> and the 1-netdev was proposed, but no one gives out a solution to this
> >>>>>>>> problem ever since. Please share your mind how to proceed and solve
> >>>>>>>> this userspace issue if netdev does not welcome a 1-netdev model.  
> >>>>>>> Above says:
> >>>>>>>
> >>>>>>>       there's no motivation in the systemd/udevd community at
> >>>>>>>       this point to refactor the rename logic and make it work well with
> >>>>>>>       3-netdev.
> >>>>>>>
> >>>>>>> What would the fix be? Skip slave devices?
> >>>>>>>  
> >>>>>> There's nothing user can get if just skipping slave devices - the
> >>>>>> name is still unchanged and unpredictable e.g. eth0, or eth1 the
> >>>>>> next reboot, while the rest may conform to the naming scheme (ens3
> >>>>>> and such). There's no way one can fix this in userspace alone - when
> >>>>>> the failover is created the enslaved netdev was opened by the kernel
> >>>>>> earlier than the userspace is made aware of, and there's no
> >>>>>> negotiation protocol for kernel to know when userspace has done
> >>>>>> initial renaming of the interface. I would expect netdev list should
> >>>>>> at least provide the direction in general for how this can be
> >>>>>> solved...  
> >>> I was just wondering what did you mean when you said
> >>> "refactor the rename logic and make it work well with 3-netdev" -
> >>> was there a proposal udev rejected?  
> >> No. I never believed this particular issue can be fixed in userspace alone.
> >> Previously someone had said it could be, but I never see any work or
> >> relevant discussion ever happened in various userspace communities (for e.g.
> >> dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
> >> of the issue derives from the kernel, it makes more sense to start from
> >> netdev, work out and decide on a solution: see what can be done in the
> >> kernel in order to fix it, then after that engage userspace community for
> >> the feasibility...
> >>  
> >>> Anyway, can we write a time diagram for what happens in which order that
> >>> leads to failure?  That would help look for triggers that we can tie
> >>> into, or add new ones.
> >>>  
> >> See attached diagram.
> >>  
> >>>
> >>>
> >>>  
> >>>>> Is there an issue if slave device names are not predictable? The user/admin scripts are expected
> >>>>> to only work with the master failover device.  
> >>>> Where does this expectation come from?
> >>>>
> >>>> Admin users may have ethtool or tc configurations that need to deal with
> >>>> predictable interface name. Third-party app which was built upon specifying
> >>>> certain interface name can't be modified to chase dynamic names.
> >>>>
> >>>> Specifically, we have pre-canned image that uses ethtool to fine tune VF
> >>>> offload settings post boot for specific workload. Those images won't work
> >>>> well if the name is constantly changing just after couple rounds of live
> >>>> migration.  
> >>> It should be possible to specify the ethtool configuration on the
> >>> master and have it automatically propagated to the slave.
> >>>
> >>> BTW this is something we should look at IMHO.  
> >> I was elaborating a few examples that the expectation and assumption that
> >> user/admin scripts only deal with master failover device is incorrect. It
> >> had never been taken good care of, although I did try to emphasize it from
> >> the very beginning.
> >>
> >> Basically what you said about propagating the ethtool configuration down to
> >> the slave is the key pursuance of 1-netdev model. However, what I am seeking
> >> now is any alternative that can also fix the specific udev rename problem,
> >> before concluding that 1-netdev is the only solution. Generally a 1-netdev
> >> scheme would take time to implement, while I'm trying to find a way out to
> >> fix this particular naming problem under 3-netdev.
> >>  
> >>>>> Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
> >>>>> about moving them to a hidden network namespace so that they are not visible from the default namespace.
> >>>>> I looked into this sometime back, but did not find the right kernel api to create a network namespace within
> >>>>> kernel. If so, we could use this mechanism to simulate a 1-netdev model.  
> >>>> Yes, that's one possible implementation (IMHO the key is to make 1-netdev
> >>>> model as much transparent to a real NIC as possible, while a hidden netns is
> >>>> just the vehicle). However, I recall there was resistance around this
> >>>> discussion that even the concept of hiding itself is a taboo for Linux
> >>>> netdev. I would like to summon potential alternatives before concluding
> >>>> 1-netdev is the only solution too soon.
> >>>>
> >>>> Thanks,
> >>>> -Siwei  
> >>> Your scripts would not work at all then, right?  
> >> At this point we don't claim images with such usage as SR-IOV live
> >> migrate-able. We would flag it as live migrate-able until this ethtool
> >> config issue is fully addressed and a transparent live migration solution
> >> emerges in upstream eventually.
> >>
> >>
> >> Thanks,
> >> -Siwei  
> >>>  
> >>>>>> -Siwei
> >>>>>>
> >>>>>>  
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> >>> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> >>>  
> >>    net_failover(kernel)                            |    network.service (user)    |          systemd-udevd (user)
> >> --------------------------------------------------+------------------------------+--------------------------------------------
> >> (standby virtio-net and net_failover              |                              |
> >> devices created and initialized,                  |                              |
> >> i.e. virtnet_probe()->                            |                              |
> >>         net_failover_create()                      |                              |
> >> was done.)                                        |                              |
> >>                                                    |                              |
> >>                                                    |  runs `ifup ens3' ->         |
> >>                                                    |    ip link set dev ens3 up   |
> >> net_failover_open()                               |                              |
> >>    dev_open(virtnet_dev)                           |                              |
> >>      virtnet_open(virtnet_dev)                     |                              |
> >>    netif_carrier_on(failover_dev)                  |                              |
> >>    ...                                             |                              |
> >>                                                    |                              |
> >> (VF hot plugged in)                               |                              |
> >> ixgbevf_probe()                                   |                              |
> >>   register_netdev(ixgbevf_netdev)                  |                              |
> >>    netdev_register_kobject(ixgbevf_netdev)         |                              |
> >>     kobject_add(ixgbevf_dev)                       |                              |
> >>      device_add(ixgbevf_dev)                       |                              |
> >>       kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) |                              |
> >>        netlink_broadcast()                         |                              |
> >>    ...                                             |                              |
> >>    call_netdevice_notifiers(NETDEV_REGISTER)       |                              |
> >>     failover_event(..., NETDEV_REGISTER, ...)      |                              |
> >>      failover_slave_register(ixgbevf_netdev)       |                              |
> >>       net_failover_slave_register(ixgbevf_netdev)  |                              |
> >>        dev_open(ixgbevf_netdev)                    |                              |
> >>                                                    |                              |
> >>                                                    |                              |
> >>                                                    |                              |   received ADD uevent from netlink fd
> >>                                                    |                              |   ...
> >>                                                    |                              |   udev-builtin-net_id.c:dev_pci_slot()
> >>                                                    |                              |   (decided to renamed 'eth0' )
> >>                                                    |                              |     ip link set dev eth0 name ens4
> >> (dev_change_name() returns -EBUSY as              |                              |
> >> ixgbevf_netdev->flags has IFF_UP)                 |                              |
> >>                                                    |                              |
> >>  
> > Given renaming slaves does not work anyway:  
> I was actually thinking what if we relieve the rename restriction just 
> for the failover slave? What the impact would be? I think users don't 
> care about slave being renamed when it's in use, especially the initial 
> rename. Thoughts?
> 
> >   would it work if we just
> > hard-coded slave names instead?
> >
> > E.g.
> > 1. fail slave renames
> > 2. rename of failover to XX automatically renames standby to XXnsby
> >     and primary to XXnpry  
> That wouldn't help. The time when the failover master gets renamed, the 
> VF may not be present. I don't like the idea to delay exposing failover 
> master until VF is hot plugged in (probably subject to various failures) 
> later.


What netvsc does now is wait 2 seconds (to allow udev to do rename)
before bringing the VF link up. This works, has had no problems even
with slow distributions and is widely used.

A patch to allow ending the timeout after rename was proposed but
rejected.

https://lore.kernel.org/netdev/20171220223323.21125-1-sthemmin@microsoft.com/

Allow network devices to change name when up is too risky. There are things
like netfilter rules and other state in and out of the kernel that may break.
Userspace does not like it when the rules change.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2019-02-27 21:57                         ` Stephen Hemminger
@ 2019-02-27 22:30                           ` si-wei liu
  0 siblings, 0 replies; 63+ messages in thread
From: si-wei liu @ 2019-02-27 22:30 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Michael S. Tsirkin, Samudrala, Sridhar, Siwei Liu, Jiri Pirko,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Alexander Duyck, Jakub Kicinski, Jason Wang, liran.alon



On 2/27/2019 1:57 PM, Stephen Hemminger wrote:
> On Tue, 26 Feb 2019 16:17:21 -0800
> si-wei liu <si-wei.liu@oracle.com> wrote:
>
>> On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:
>>> On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:
>>>> On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:
>>>>> On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:
>>>>>> On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
>>>>>>> On 2/21/2019 7:33 PM, si-wei liu wrote:
>>>>>>>> On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
>>>>>>>>> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
>>>>>>>>>> Sorry for replying to this ancient thread. There was some remaining
>>>>>>>>>> issue that I don't think the initial net_failover patch got addressed
>>>>>>>>>> cleanly, see:
>>>>>>>>>>
>>>>>>>>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
>>>>>>>>>>
>>>>>>>>>> The renaming of 'eth0' to 'ens4' fails because the udev userspace was
>>>>>>>>>> not specifically writtten for such kernel automatic enslavement.
>>>>>>>>>> Specifically, if it is a bond or team, the slave would typically get
>>>>>>>>>> renamed *before* virtual device gets created, that's what udev can
>>>>>>>>>> control (without getting netdev opened early by the other part of
>>>>>>>>>> kernel) and other userspace components for e.g. initramfs,
>>>>>>>>>> init-scripts can coordinate well in between. The in-kernel
>>>>>>>>>> auto-enslavement of net_failover breaks this userspace convention,
>>>>>>>>>> which don't provides a solution if user care about consistent naming
>>>>>>>>>> on the slave netdevs specifically.
>>>>>>>>>>
>>>>>>>>>> Previously this issue had been specifically called out when IFF_HIDDEN
>>>>>>>>>> and the 1-netdev was proposed, but no one gives out a solution to this
>>>>>>>>>> problem ever since. Please share your mind how to proceed and solve
>>>>>>>>>> this userspace issue if netdev does not welcome a 1-netdev model.
>>>>>>>>> Above says:
>>>>>>>>>
>>>>>>>>>        there's no motivation in the systemd/udevd community at
>>>>>>>>>        this point to refactor the rename logic and make it work well with
>>>>>>>>>        3-netdev.
>>>>>>>>>
>>>>>>>>> What would the fix be? Skip slave devices?
>>>>>>>>>   
>>>>>>>> There's nothing user can get if just skipping slave devices - the
>>>>>>>> name is still unchanged and unpredictable e.g. eth0, or eth1 the
>>>>>>>> next reboot, while the rest may conform to the naming scheme (ens3
>>>>>>>> and such). There's no way one can fix this in userspace alone - when
>>>>>>>> the failover is created the enslaved netdev was opened by the kernel
>>>>>>>> earlier than the userspace is made aware of, and there's no
>>>>>>>> negotiation protocol for kernel to know when userspace has done
>>>>>>>> initial renaming of the interface. I would expect netdev list should
>>>>>>>> at least provide the direction in general for how this can be
>>>>>>>> solved...
>>>>> I was just wondering what did you mean when you said
>>>>> "refactor the rename logic and make it work well with 3-netdev" -
>>>>> was there a proposal udev rejected?
>>>> No. I never believed this particular issue can be fixed in userspace alone.
>>>> Previously someone had said it could be, but I never see any work or
>>>> relevant discussion ever happened in various userspace communities (for e.g.
>>>> dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
>>>> of the issue derives from the kernel, it makes more sense to start from
>>>> netdev, work out and decide on a solution: see what can be done in the
>>>> kernel in order to fix it, then after that engage userspace community for
>>>> the feasibility...
>>>>   
>>>>> Anyway, can we write a time diagram for what happens in which order that
>>>>> leads to failure?  That would help look for triggers that we can tie
>>>>> into, or add new ones.
>>>>>   
>>>> See attached diagram.
>>>>   
>>>>>
>>>>>   
>>>>>>> Is there an issue if slave device names are not predictable? The user/admin scripts are expected
>>>>>>> to only work with the master failover device.
>>>>>> Where does this expectation come from?
>>>>>>
>>>>>> Admin users may have ethtool or tc configurations that need to deal with
>>>>>> predictable interface name. Third-party app which was built upon specifying
>>>>>> certain interface name can't be modified to chase dynamic names.
>>>>>>
>>>>>> Specifically, we have pre-canned image that uses ethtool to fine tune VF
>>>>>> offload settings post boot for specific workload. Those images won't work
>>>>>> well if the name is constantly changing just after couple rounds of live
>>>>>> migration.
>>>>> It should be possible to specify the ethtool configuration on the
>>>>> master and have it automatically propagated to the slave.
>>>>>
>>>>> BTW this is something we should look at IMHO.
>>>> I was elaborating a few examples that the expectation and assumption that
>>>> user/admin scripts only deal with master failover device is incorrect. It
>>>> had never been taken good care of, although I did try to emphasize it from
>>>> the very beginning.
>>>>
>>>> Basically what you said about propagating the ethtool configuration down to
>>>> the slave is the key pursuance of 1-netdev model. However, what I am seeking
>>>> now is any alternative that can also fix the specific udev rename problem,
>>>> before concluding that 1-netdev is the only solution. Generally a 1-netdev
>>>> scheme would take time to implement, while I'm trying to find a way out to
>>>> fix this particular naming problem under 3-netdev.
>>>>   
>>>>>>> Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
>>>>>>> about moving them to a hidden network namespace so that they are not visible from the default namespace.
>>>>>>> I looked into this sometime back, but did not find the right kernel api to create a network namespace within
>>>>>>> kernel. If so, we could use this mechanism to simulate a 1-netdev model.
>>>>>> Yes, that's one possible implementation (IMHO the key is to make 1-netdev
>>>>>> model as much transparent to a real NIC as possible, while a hidden netns is
>>>>>> just the vehicle). However, I recall there was resistance around this
>>>>>> discussion that even the concept of hiding itself is a taboo for Linux
>>>>>> netdev. I would like to summon potential alternatives before concluding
>>>>>> 1-netdev is the only solution too soon.
>>>>>>
>>>>>> Thanks,
>>>>>> -Siwei
>>>>> Your scripts would not work at all then, right?
>>>> At this point we don't claim images with such usage as SR-IOV live
>>>> migrate-able. We would flag it as live migrate-able until this ethtool
>>>> config issue is fully addressed and a transparent live migration solution
>>>> emerges in upstream eventually.
>>>>
>>>>
>>>> Thanks,
>>>> -Siwei
>>>>>   
>>>>>>>> -Siwei
>>>>>>>>
>>>>>>>>   
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
>>>>> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>>>>>   
>>>>     net_failover(kernel)                            |    network.service (user)    |          systemd-udevd (user)
>>>> --------------------------------------------------+------------------------------+--------------------------------------------
>>>> (standby virtio-net and net_failover              |                              |
>>>> devices created and initialized,                  |                              |
>>>> i.e. virtnet_probe()->                            |                              |
>>>>          net_failover_create()                      |                              |
>>>> was done.)                                        |                              |
>>>>                                                     |                              |
>>>>                                                     |  runs `ifup ens3' ->         |
>>>>                                                     |    ip link set dev ens3 up   |
>>>> net_failover_open()                               |                              |
>>>>     dev_open(virtnet_dev)                           |                              |
>>>>       virtnet_open(virtnet_dev)                     |                              |
>>>>     netif_carrier_on(failover_dev)                  |                              |
>>>>     ...                                             |                              |
>>>>                                                     |                              |
>>>> (VF hot plugged in)                               |                              |
>>>> ixgbevf_probe()                                   |                              |
>>>>    register_netdev(ixgbevf_netdev)                  |                              |
>>>>     netdev_register_kobject(ixgbevf_netdev)         |                              |
>>>>      kobject_add(ixgbevf_dev)                       |                              |
>>>>       device_add(ixgbevf_dev)                       |                              |
>>>>        kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) |                              |
>>>>         netlink_broadcast()                         |                              |
>>>>     ...                                             |                              |
>>>>     call_netdevice_notifiers(NETDEV_REGISTER)       |                              |
>>>>      failover_event(..., NETDEV_REGISTER, ...)      |                              |
>>>>       failover_slave_register(ixgbevf_netdev)       |                              |
>>>>        net_failover_slave_register(ixgbevf_netdev)  |                              |
>>>>         dev_open(ixgbevf_netdev)                    |                              |
>>>>                                                     |                              |
>>>>                                                     |                              |
>>>>                                                     |                              |   received ADD uevent from netlink fd
>>>>                                                     |                              |   ...
>>>>                                                     |                              |   udev-builtin-net_id.c:dev_pci_slot()
>>>>                                                     |                              |   (decided to renamed 'eth0' )
>>>>                                                     |                              |     ip link set dev eth0 name ens4
>>>> (dev_change_name() returns -EBUSY as              |                              |
>>>> ixgbevf_netdev->flags has IFF_UP)                 |                              |
>>>>                                                     |                              |
>>>>   
>>> Given renaming slaves does not work anyway:
>> I was actually thinking what if we relieve the rename restriction just
>> for the failover slave? What the impact would be? I think users don't
>> care about slave being renamed when it's in use, especially the initial
>> rename. Thoughts?
>>
>>>    would it work if we just
>>> hard-coded slave names instead?
>>>
>>> E.g.
>>> 1. fail slave renames
>>> 2. rename of failover to XX automatically renames standby to XXnsby
>>>      and primary to XXnpry
>> That wouldn't help. The time when the failover master gets renamed, the
>> VF may not be present. I don't like the idea to delay exposing failover
>> master until VF is hot plugged in (probably subject to various failures)
>> later.
>
> What netvsc does now is wait 2 seconds (to allow udev to do rename)
> before bringing the VF link up. This works, has had no problems even
> with slow distributions and is widely used.
>
> A patch to allow ending the timeout after rename was proposed but
> rejected.
>
> https://lore.kernel.org/netdev/20171220223323.21125-1-sthemmin@microsoft.com/
>
> Allow network devices to change name when up is too risky.
I understand the concern in general, the thread above referenced this patch:

https://patchwork.ozlabs.org/patch/799646/

That was in the context of netvsc without a proper framework (net_failover).

What I was saying is that we should consider opening up the rename 
restriction for  IFF_FAILOVER_SLAVE. It looks to me that all the 
userspace usage are trying to ignore the slave instead of operating it 
directly. The netfilter rules and what mentioned below can/should be 
applied to on top of the master if I'm not mistaken. The current 
userspace doesn't speak the net_failover way, and it is already broken 
since its introduction. If anything, those userspace can be fixed up to 
listen for rename events to track name changes. Whatever those cases are 
it should not affect current use cases.

Thanks,
-Siwei



> There are things
> like netfilter rules and other state in and out of the kernel that may break.
> Userspace does not like it when the rules change.



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2019-02-27  0:17                       ` si-wei liu
  2019-02-27 21:57                         ` Stephen Hemminger
@ 2019-02-27 22:38                         ` Michael S. Tsirkin
  2019-02-27 23:34                           ` si-wei liu
  1 sibling, 1 reply; 63+ messages in thread
From: Michael S. Tsirkin @ 2019-02-27 22:38 UTC (permalink / raw)
  To: si-wei liu
  Cc: Samudrala, Sridhar, Siwei Liu, Jiri Pirko, Stephen Hemminger,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Alexander Duyck, Jakub Kicinski, Jason Wang, liran.alon

On Tue, Feb 26, 2019 at 04:17:21PM -0800, si-wei liu wrote:
> 
> 
> On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:
> > On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:
> > > 
> > > On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:
> > > > On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:
> > > > > On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
> > > > > > On 2/21/2019 7:33 PM, si-wei liu wrote:
> > > > > > > On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
> > > > > > > > On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
> > > > > > > > > Sorry for replying to this ancient thread. There was some remaining
> > > > > > > > > issue that I don't think the initial net_failover patch got addressed
> > > > > > > > > cleanly, see:
> > > > > > > > > 
> > > > > > > > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
> > > > > > > > > 
> > > > > > > > > The renaming of 'eth0' to 'ens4' fails because the udev userspace was
> > > > > > > > > not specifically writtten for such kernel automatic enslavement.
> > > > > > > > > Specifically, if it is a bond or team, the slave would typically get
> > > > > > > > > renamed *before* virtual device gets created, that's what udev can
> > > > > > > > > control (without getting netdev opened early by the other part of
> > > > > > > > > kernel) and other userspace components for e.g. initramfs,
> > > > > > > > > init-scripts can coordinate well in between. The in-kernel
> > > > > > > > > auto-enslavement of net_failover breaks this userspace convention,
> > > > > > > > > which don't provides a solution if user care about consistent naming
> > > > > > > > > on the slave netdevs specifically.
> > > > > > > > > 
> > > > > > > > > Previously this issue had been specifically called out when IFF_HIDDEN
> > > > > > > > > and the 1-netdev was proposed, but no one gives out a solution to this
> > > > > > > > > problem ever since. Please share your mind how to proceed and solve
> > > > > > > > > this userspace issue if netdev does not welcome a 1-netdev model.
> > > > > > > > Above says:
> > > > > > > > 
> > > > > > > >       there's no motivation in the systemd/udevd community at
> > > > > > > >       this point to refactor the rename logic and make it work well with
> > > > > > > >       3-netdev.
> > > > > > > > 
> > > > > > > > What would the fix be? Skip slave devices?
> > > > > > > > 
> > > > > > > There's nothing user can get if just skipping slave devices - the
> > > > > > > name is still unchanged and unpredictable e.g. eth0, or eth1 the
> > > > > > > next reboot, while the rest may conform to the naming scheme (ens3
> > > > > > > and such). There's no way one can fix this in userspace alone - when
> > > > > > > the failover is created the enslaved netdev was opened by the kernel
> > > > > > > earlier than the userspace is made aware of, and there's no
> > > > > > > negotiation protocol for kernel to know when userspace has done
> > > > > > > initial renaming of the interface. I would expect netdev list should
> > > > > > > at least provide the direction in general for how this can be
> > > > > > > solved...
> > > > I was just wondering what did you mean when you said
> > > > "refactor the rename logic and make it work well with 3-netdev" -
> > > > was there a proposal udev rejected?
> > > No. I never believed this particular issue can be fixed in userspace alone.
> > > Previously someone had said it could be, but I never see any work or
> > > relevant discussion ever happened in various userspace communities (for e.g.
> > > dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
> > > of the issue derives from the kernel, it makes more sense to start from
> > > netdev, work out and decide on a solution: see what can be done in the
> > > kernel in order to fix it, then after that engage userspace community for
> > > the feasibility...
> > > 
> > > > Anyway, can we write a time diagram for what happens in which order that
> > > > leads to failure?  That would help look for triggers that we can tie
> > > > into, or add new ones.
> > > > 
> > > See attached diagram.
> > > 
> > > > 
> > > > 
> > > > 
> > > > > > Is there an issue if slave device names are not predictable? The user/admin scripts are expected
> > > > > > to only work with the master failover device.
> > > > > Where does this expectation come from?
> > > > > 
> > > > > Admin users may have ethtool or tc configurations that need to deal with
> > > > > predictable interface name. Third-party app which was built upon specifying
> > > > > certain interface name can't be modified to chase dynamic names.
> > > > > 
> > > > > Specifically, we have pre-canned image that uses ethtool to fine tune VF
> > > > > offload settings post boot for specific workload. Those images won't work
> > > > > well if the name is constantly changing just after couple rounds of live
> > > > > migration.
> > > > It should be possible to specify the ethtool configuration on the
> > > > master and have it automatically propagated to the slave.
> > > > 
> > > > BTW this is something we should look at IMHO.
> > > I was elaborating a few examples that the expectation and assumption that
> > > user/admin scripts only deal with master failover device is incorrect. It
> > > had never been taken good care of, although I did try to emphasize it from
> > > the very beginning.
> > > 
> > > Basically what you said about propagating the ethtool configuration down to
> > > the slave is the key pursuance of 1-netdev model. However, what I am seeking
> > > now is any alternative that can also fix the specific udev rename problem,
> > > before concluding that 1-netdev is the only solution. Generally a 1-netdev
> > > scheme would take time to implement, while I'm trying to find a way out to
> > > fix this particular naming problem under 3-netdev.
> > > 
> > > > > > Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
> > > > > > about moving them to a hidden network namespace so that they are not visible from the default namespace.
> > > > > > I looked into this sometime back, but did not find the right kernel api to create a network namespace within
> > > > > > kernel. If so, we could use this mechanism to simulate a 1-netdev model.
> > > > > Yes, that's one possible implementation (IMHO the key is to make 1-netdev
> > > > > model as much transparent to a real NIC as possible, while a hidden netns is
> > > > > just the vehicle). However, I recall there was resistance around this
> > > > > discussion that even the concept of hiding itself is a taboo for Linux
> > > > > netdev. I would like to summon potential alternatives before concluding
> > > > > 1-netdev is the only solution too soon.
> > > > > 
> > > > > Thanks,
> > > > > -Siwei
> > > > Your scripts would not work at all then, right?
> > > At this point we don't claim images with such usage as SR-IOV live
> > > migrate-able. We would flag it as live migrate-able until this ethtool
> > > config issue is fully addressed and a transparent live migration solution
> > > emerges in upstream eventually.
> > > 
> > > 
> > > Thanks,
> > > -Siwei
> > > > 
> > > > > > > -Siwei
> > > > > > > 
> > > > > > > 
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > > > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> > > > 
> > >    net_failover(kernel)                            |    network.service (user)    |          systemd-udevd (user)
> > > --------------------------------------------------+------------------------------+--------------------------------------------
> > > (standby virtio-net and net_failover              |                              |
> > > devices created and initialized,                  |                              |
> > > i.e. virtnet_probe()->                            |                              |
> > >         net_failover_create()                      |                              |
> > > was done.)                                        |                              |
> > >                                                    |                              |
> > >                                                    |  runs `ifup ens3' ->         |
> > >                                                    |    ip link set dev ens3 up   |
> > > net_failover_open()                               |                              |
> > >    dev_open(virtnet_dev)                           |                              |
> > >      virtnet_open(virtnet_dev)                     |                              |
> > >    netif_carrier_on(failover_dev)                  |                              |
> > >    ...                                             |                              |
> > >                                                    |                              |
> > > (VF hot plugged in)                               |                              |
> > > ixgbevf_probe()                                   |                              |
> > >   register_netdev(ixgbevf_netdev)                  |                              |
> > >    netdev_register_kobject(ixgbevf_netdev)         |                              |
> > >     kobject_add(ixgbevf_dev)                       |                              |
> > >      device_add(ixgbevf_dev)                       |                              |
> > >       kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) |                              |
> > >        netlink_broadcast()                         |                              |
> > >    ...                                             |                              |
> > >    call_netdevice_notifiers(NETDEV_REGISTER)       |                              |
> > >     failover_event(..., NETDEV_REGISTER, ...)      |                              |
> > >      failover_slave_register(ixgbevf_netdev)       |                              |
> > >       net_failover_slave_register(ixgbevf_netdev)  |                              |
> > >        dev_open(ixgbevf_netdev)                    |                              |
> > >                                                    |                              |
> > >                                                    |                              |
> > >                                                    |                              |   received ADD uevent from netlink fd
> > >                                                    |                              |   ...
> > >                                                    |                              |   udev-builtin-net_id.c:dev_pci_slot()
> > >                                                    |                              |   (decided to renamed 'eth0' )
> > >                                                    |                              |     ip link set dev eth0 name ens4
> > > (dev_change_name() returns -EBUSY as              |                              |
> > > ixgbevf_netdev->flags has IFF_UP)                 |                              |
> > >                                                    |                              |
> > > 
> > Given renaming slaves does not work anyway:
> I was actually thinking what if we relieve the rename restriction just for
> the failover slave? What the impact would be? I think users don't care about
> slave being renamed when it's in use, especially the initial rename.
> Thoughts?
> 
> >   would it work if we just
> > hard-coded slave names instead?
> > 
> > E.g.
> > 1. fail slave renames
> > 2. rename of failover to XX automatically renames standby to XXnsby
> >     and primary to XXnpry
> That wouldn't help. The time when the failover master gets renamed, the VF
> may not be present.

In this scheme if VF is not there it will be renamed immediately after registration.

> I don't like the idea to delay exposing failover master
> until VF is hot plugged in (probably subject to various failures) later.
> 
> Thanks,
> -Siwei


I agree, this was not what I meant.

> > 
> > 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2019-02-27 22:38                         ` Michael S. Tsirkin
@ 2019-02-27 23:34                           ` si-wei liu
  2019-02-27 23:50                             ` Michael S. Tsirkin
  0 siblings, 1 reply; 63+ messages in thread
From: si-wei liu @ 2019-02-27 23:34 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Samudrala, Sridhar, Siwei Liu, Jiri Pirko, Stephen Hemminger,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Alexander Duyck, Jakub Kicinski, Jason Wang, liran.alon



On 2/27/2019 2:38 PM, Michael S. Tsirkin wrote:
> On Tue, Feb 26, 2019 at 04:17:21PM -0800, si-wei liu wrote:
>>
>> On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:
>>> On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:
>>>> On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:
>>>>> On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:
>>>>>> On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
>>>>>>> On 2/21/2019 7:33 PM, si-wei liu wrote:
>>>>>>>> On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
>>>>>>>>> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
>>>>>>>>>> Sorry for replying to this ancient thread. There was some remaining
>>>>>>>>>> issue that I don't think the initial net_failover patch got addressed
>>>>>>>>>> cleanly, see:
>>>>>>>>>>
>>>>>>>>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
>>>>>>>>>>
>>>>>>>>>> The renaming of 'eth0' to 'ens4' fails because the udev userspace was
>>>>>>>>>> not specifically writtten for such kernel automatic enslavement.
>>>>>>>>>> Specifically, if it is a bond or team, the slave would typically get
>>>>>>>>>> renamed *before* virtual device gets created, that's what udev can
>>>>>>>>>> control (without getting netdev opened early by the other part of
>>>>>>>>>> kernel) and other userspace components for e.g. initramfs,
>>>>>>>>>> init-scripts can coordinate well in between. The in-kernel
>>>>>>>>>> auto-enslavement of net_failover breaks this userspace convention,
>>>>>>>>>> which don't provides a solution if user care about consistent naming
>>>>>>>>>> on the slave netdevs specifically.
>>>>>>>>>>
>>>>>>>>>> Previously this issue had been specifically called out when IFF_HIDDEN
>>>>>>>>>> and the 1-netdev was proposed, but no one gives out a solution to this
>>>>>>>>>> problem ever since. Please share your mind how to proceed and solve
>>>>>>>>>> this userspace issue if netdev does not welcome a 1-netdev model.
>>>>>>>>> Above says:
>>>>>>>>>
>>>>>>>>>        there's no motivation in the systemd/udevd community at
>>>>>>>>>        this point to refactor the rename logic and make it work well with
>>>>>>>>>        3-netdev.
>>>>>>>>>
>>>>>>>>> What would the fix be? Skip slave devices?
>>>>>>>>>
>>>>>>>> There's nothing user can get if just skipping slave devices - the
>>>>>>>> name is still unchanged and unpredictable e.g. eth0, or eth1 the
>>>>>>>> next reboot, while the rest may conform to the naming scheme (ens3
>>>>>>>> and such). There's no way one can fix this in userspace alone - when
>>>>>>>> the failover is created the enslaved netdev was opened by the kernel
>>>>>>>> earlier than the userspace is made aware of, and there's no
>>>>>>>> negotiation protocol for kernel to know when userspace has done
>>>>>>>> initial renaming of the interface. I would expect netdev list should
>>>>>>>> at least provide the direction in general for how this can be
>>>>>>>> solved...
>>>>> I was just wondering what did you mean when you said
>>>>> "refactor the rename logic and make it work well with 3-netdev" -
>>>>> was there a proposal udev rejected?
>>>> No. I never believed this particular issue can be fixed in userspace alone.
>>>> Previously someone had said it could be, but I never see any work or
>>>> relevant discussion ever happened in various userspace communities (for e.g.
>>>> dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
>>>> of the issue derives from the kernel, it makes more sense to start from
>>>> netdev, work out and decide on a solution: see what can be done in the
>>>> kernel in order to fix it, then after that engage userspace community for
>>>> the feasibility...
>>>>
>>>>> Anyway, can we write a time diagram for what happens in which order that
>>>>> leads to failure?  That would help look for triggers that we can tie
>>>>> into, or add new ones.
>>>>>
>>>> See attached diagram.
>>>>
>>>>>
>>>>>
>>>>>>> Is there an issue if slave device names are not predictable? The user/admin scripts are expected
>>>>>>> to only work with the master failover device.
>>>>>> Where does this expectation come from?
>>>>>>
>>>>>> Admin users may have ethtool or tc configurations that need to deal with
>>>>>> predictable interface name. Third-party app which was built upon specifying
>>>>>> certain interface name can't be modified to chase dynamic names.
>>>>>>
>>>>>> Specifically, we have pre-canned image that uses ethtool to fine tune VF
>>>>>> offload settings post boot for specific workload. Those images won't work
>>>>>> well if the name is constantly changing just after couple rounds of live
>>>>>> migration.
>>>>> It should be possible to specify the ethtool configuration on the
>>>>> master and have it automatically propagated to the slave.
>>>>>
>>>>> BTW this is something we should look at IMHO.
>>>> I was elaborating a few examples that the expectation and assumption that
>>>> user/admin scripts only deal with master failover device is incorrect. It
>>>> had never been taken good care of, although I did try to emphasize it from
>>>> the very beginning.
>>>>
>>>> Basically what you said about propagating the ethtool configuration down to
>>>> the slave is the key pursuance of 1-netdev model. However, what I am seeking
>>>> now is any alternative that can also fix the specific udev rename problem,
>>>> before concluding that 1-netdev is the only solution. Generally a 1-netdev
>>>> scheme would take time to implement, while I'm trying to find a way out to
>>>> fix this particular naming problem under 3-netdev.
>>>>
>>>>>>> Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
>>>>>>> about moving them to a hidden network namespace so that they are not visible from the default namespace.
>>>>>>> I looked into this sometime back, but did not find the right kernel api to create a network namespace within
>>>>>>> kernel. If so, we could use this mechanism to simulate a 1-netdev model.
>>>>>> Yes, that's one possible implementation (IMHO the key is to make 1-netdev
>>>>>> model as much transparent to a real NIC as possible, while a hidden netns is
>>>>>> just the vehicle). However, I recall there was resistance around this
>>>>>> discussion that even the concept of hiding itself is a taboo for Linux
>>>>>> netdev. I would like to summon potential alternatives before concluding
>>>>>> 1-netdev is the only solution too soon.
>>>>>>
>>>>>> Thanks,
>>>>>> -Siwei
>>>>> Your scripts would not work at all then, right?
>>>> At this point we don't claim images with such usage as SR-IOV live
>>>> migrate-able. We would flag it as live migrate-able until this ethtool
>>>> config issue is fully addressed and a transparent live migration solution
>>>> emerges in upstream eventually.
>>>>
>>>>
>>>> Thanks,
>>>> -Siwei
>>>>>>>> -Siwei
>>>>>>>>
>>>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
>>>>> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>>>>>
>>>>     net_failover(kernel)                            |    network.service (user)    |          systemd-udevd (user)
>>>> --------------------------------------------------+------------------------------+--------------------------------------------
>>>> (standby virtio-net and net_failover              |                              |
>>>> devices created and initialized,                  |                              |
>>>> i.e. virtnet_probe()->                            |                              |
>>>>          net_failover_create()                      |                              |
>>>> was done.)                                        |                              |
>>>>                                                     |                              |
>>>>                                                     |  runs `ifup ens3' ->         |
>>>>                                                     |    ip link set dev ens3 up   |
>>>> net_failover_open()                               |                              |
>>>>     dev_open(virtnet_dev)                           |                              |
>>>>       virtnet_open(virtnet_dev)                     |                              |
>>>>     netif_carrier_on(failover_dev)                  |                              |
>>>>     ...                                             |                              |
>>>>                                                     |                              |
>>>> (VF hot plugged in)                               |                              |
>>>> ixgbevf_probe()                                   |                              |
>>>>    register_netdev(ixgbevf_netdev)                  |                              |
>>>>     netdev_register_kobject(ixgbevf_netdev)         |                              |
>>>>      kobject_add(ixgbevf_dev)                       |                              |
>>>>       device_add(ixgbevf_dev)                       |                              |
>>>>        kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) |                              |
>>>>         netlink_broadcast()                         |                              |
>>>>     ...                                             |                              |
>>>>     call_netdevice_notifiers(NETDEV_REGISTER)       |                              |
>>>>      failover_event(..., NETDEV_REGISTER, ...)      |                              |
>>>>       failover_slave_register(ixgbevf_netdev)       |                              |
>>>>        net_failover_slave_register(ixgbevf_netdev)  |                              |
>>>>         dev_open(ixgbevf_netdev)                    |                              |
>>>>                                                     |                              |
>>>>                                                     |                              |
>>>>                                                     |                              |   received ADD uevent from netlink fd
>>>>                                                     |                              |   ...
>>>>                                                     |                              |   udev-builtin-net_id.c:dev_pci_slot()
>>>>                                                     |                              |   (decided to renamed 'eth0' )
>>>>                                                     |                              |     ip link set dev eth0 name ens4
>>>> (dev_change_name() returns -EBUSY as              |                              |
>>>> ixgbevf_netdev->flags has IFF_UP)                 |                              |
>>>>                                                     |                              |
>>>>
>>> Given renaming slaves does not work anyway:
>> I was actually thinking what if we relieve the rename restriction just for
>> the failover slave? What the impact would be? I think users don't care about
>> slave being renamed when it's in use, especially the initial rename.
>> Thoughts?
>>
>>>    would it work if we just
>>> hard-coded slave names instead?
>>>
>>> E.g.
>>> 1. fail slave renames
>>> 2. rename of failover to XX automatically renames standby to XXnsby
>>>      and primary to XXnpry
>> That wouldn't help. The time when the failover master gets renamed, the VF
>> may not be present.
> In this scheme if VF is not there it will be renamed immediately after registration.
Who will be responsible to rename the slave, the kernel? Note the 
master's name may or may not come from the userspace. If it comes from 
the userspace, should the userspace daemon change their expectation not 
to name/rename _any_ slaves (today there's no distinction)? How do users 
know which name to trust, depending on which wins the race more often? 
Say if kernel wants a ens3npry name while userspace wants it named as ens4.

-Siwei

>
>> I don't like the idea to delay exposing failover master
>> until VF is hot plugged in (probably subject to various failures) later.
>>
>> Thanks,
>> -Siwei
>
> I agree, this was not what I meant.
>
>>>


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2019-02-27 23:34                           ` si-wei liu
@ 2019-02-27 23:50                             ` Michael S. Tsirkin
  2019-02-28  0:00                               ` Liran Alon
                                                 ` (2 more replies)
  0 siblings, 3 replies; 63+ messages in thread
From: Michael S. Tsirkin @ 2019-02-27 23:50 UTC (permalink / raw)
  To: si-wei liu
  Cc: Samudrala, Sridhar, Siwei Liu, Jiri Pirko, Stephen Hemminger,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Alexander Duyck, Jakub Kicinski, Jason Wang, liran.alon

On Wed, Feb 27, 2019 at 03:34:56PM -0800, si-wei liu wrote:
> 
> 
> On 2/27/2019 2:38 PM, Michael S. Tsirkin wrote:
> > On Tue, Feb 26, 2019 at 04:17:21PM -0800, si-wei liu wrote:
> > > 
> > > On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:
> > > > On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:
> > > > > On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:
> > > > > > On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:
> > > > > > > On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
> > > > > > > > On 2/21/2019 7:33 PM, si-wei liu wrote:
> > > > > > > > > On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
> > > > > > > > > > On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
> > > > > > > > > > > Sorry for replying to this ancient thread. There was some remaining
> > > > > > > > > > > issue that I don't think the initial net_failover patch got addressed
> > > > > > > > > > > cleanly, see:
> > > > > > > > > > > 
> > > > > > > > > > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
> > > > > > > > > > > 
> > > > > > > > > > > The renaming of 'eth0' to 'ens4' fails because the udev userspace was
> > > > > > > > > > > not specifically writtten for such kernel automatic enslavement.
> > > > > > > > > > > Specifically, if it is a bond or team, the slave would typically get
> > > > > > > > > > > renamed *before* virtual device gets created, that's what udev can
> > > > > > > > > > > control (without getting netdev opened early by the other part of
> > > > > > > > > > > kernel) and other userspace components for e.g. initramfs,
> > > > > > > > > > > init-scripts can coordinate well in between. The in-kernel
> > > > > > > > > > > auto-enslavement of net_failover breaks this userspace convention,
> > > > > > > > > > > which don't provides a solution if user care about consistent naming
> > > > > > > > > > > on the slave netdevs specifically.
> > > > > > > > > > > 
> > > > > > > > > > > Previously this issue had been specifically called out when IFF_HIDDEN
> > > > > > > > > > > and the 1-netdev was proposed, but no one gives out a solution to this
> > > > > > > > > > > problem ever since. Please share your mind how to proceed and solve
> > > > > > > > > > > this userspace issue if netdev does not welcome a 1-netdev model.
> > > > > > > > > > Above says:
> > > > > > > > > > 
> > > > > > > > > >        there's no motivation in the systemd/udevd community at
> > > > > > > > > >        this point to refactor the rename logic and make it work well with
> > > > > > > > > >        3-netdev.
> > > > > > > > > > 
> > > > > > > > > > What would the fix be? Skip slave devices?
> > > > > > > > > > 
> > > > > > > > > There's nothing user can get if just skipping slave devices - the
> > > > > > > > > name is still unchanged and unpredictable e.g. eth0, or eth1 the
> > > > > > > > > next reboot, while the rest may conform to the naming scheme (ens3
> > > > > > > > > and such). There's no way one can fix this in userspace alone - when
> > > > > > > > > the failover is created the enslaved netdev was opened by the kernel
> > > > > > > > > earlier than the userspace is made aware of, and there's no
> > > > > > > > > negotiation protocol for kernel to know when userspace has done
> > > > > > > > > initial renaming of the interface. I would expect netdev list should
> > > > > > > > > at least provide the direction in general for how this can be
> > > > > > > > > solved...
> > > > > > I was just wondering what did you mean when you said
> > > > > > "refactor the rename logic and make it work well with 3-netdev" -
> > > > > > was there a proposal udev rejected?
> > > > > No. I never believed this particular issue can be fixed in userspace alone.
> > > > > Previously someone had said it could be, but I never see any work or
> > > > > relevant discussion ever happened in various userspace communities (for e.g.
> > > > > dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
> > > > > of the issue derives from the kernel, it makes more sense to start from
> > > > > netdev, work out and decide on a solution: see what can be done in the
> > > > > kernel in order to fix it, then after that engage userspace community for
> > > > > the feasibility...
> > > > > 
> > > > > > Anyway, can we write a time diagram for what happens in which order that
> > > > > > leads to failure?  That would help look for triggers that we can tie
> > > > > > into, or add new ones.
> > > > > > 
> > > > > See attached diagram.
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > > > Is there an issue if slave device names are not predictable? The user/admin scripts are expected
> > > > > > > > to only work with the master failover device.
> > > > > > > Where does this expectation come from?
> > > > > > > 
> > > > > > > Admin users may have ethtool or tc configurations that need to deal with
> > > > > > > predictable interface name. Third-party app which was built upon specifying
> > > > > > > certain interface name can't be modified to chase dynamic names.
> > > > > > > 
> > > > > > > Specifically, we have pre-canned image that uses ethtool to fine tune VF
> > > > > > > offload settings post boot for specific workload. Those images won't work
> > > > > > > well if the name is constantly changing just after couple rounds of live
> > > > > > > migration.
> > > > > > It should be possible to specify the ethtool configuration on the
> > > > > > master and have it automatically propagated to the slave.
> > > > > > 
> > > > > > BTW this is something we should look at IMHO.
> > > > > I was elaborating a few examples that the expectation and assumption that
> > > > > user/admin scripts only deal with master failover device is incorrect. It
> > > > > had never been taken good care of, although I did try to emphasize it from
> > > > > the very beginning.
> > > > > 
> > > > > Basically what you said about propagating the ethtool configuration down to
> > > > > the slave is the key pursuance of 1-netdev model. However, what I am seeking
> > > > > now is any alternative that can also fix the specific udev rename problem,
> > > > > before concluding that 1-netdev is the only solution. Generally a 1-netdev
> > > > > scheme would take time to implement, while I'm trying to find a way out to
> > > > > fix this particular naming problem under 3-netdev.
> > > > > 
> > > > > > > > Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
> > > > > > > > about moving them to a hidden network namespace so that they are not visible from the default namespace.
> > > > > > > > I looked into this sometime back, but did not find the right kernel api to create a network namespace within
> > > > > > > > kernel. If so, we could use this mechanism to simulate a 1-netdev model.
> > > > > > > Yes, that's one possible implementation (IMHO the key is to make 1-netdev
> > > > > > > model as much transparent to a real NIC as possible, while a hidden netns is
> > > > > > > just the vehicle). However, I recall there was resistance around this
> > > > > > > discussion that even the concept of hiding itself is a taboo for Linux
> > > > > > > netdev. I would like to summon potential alternatives before concluding
> > > > > > > 1-netdev is the only solution too soon.
> > > > > > > 
> > > > > > > Thanks,
> > > > > > > -Siwei
> > > > > > Your scripts would not work at all then, right?
> > > > > At this point we don't claim images with such usage as SR-IOV live
> > > > > migrate-able. We would flag it as live migrate-able until this ethtool
> > > > > config issue is fully addressed and a transparent live migration solution
> > > > > emerges in upstream eventually.
> > > > > 
> > > > > 
> > > > > Thanks,
> > > > > -Siwei
> > > > > > > > > -Siwei
> > > > > > > > > 
> > > > > > > > > 
> > > > > > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > > > > > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> > > > > > 
> > > > >     net_failover(kernel)                            |    network.service (user)    |          systemd-udevd (user)
> > > > > --------------------------------------------------+------------------------------+--------------------------------------------
> > > > > (standby virtio-net and net_failover              |                              |
> > > > > devices created and initialized,                  |                              |
> > > > > i.e. virtnet_probe()->                            |                              |
> > > > >          net_failover_create()                      |                              |
> > > > > was done.)                                        |                              |
> > > > >                                                     |                              |
> > > > >                                                     |  runs `ifup ens3' ->         |
> > > > >                                                     |    ip link set dev ens3 up   |
> > > > > net_failover_open()                               |                              |
> > > > >     dev_open(virtnet_dev)                           |                              |
> > > > >       virtnet_open(virtnet_dev)                     |                              |
> > > > >     netif_carrier_on(failover_dev)                  |                              |
> > > > >     ...                                             |                              |
> > > > >                                                     |                              |
> > > > > (VF hot plugged in)                               |                              |
> > > > > ixgbevf_probe()                                   |                              |
> > > > >    register_netdev(ixgbevf_netdev)                  |                              |
> > > > >     netdev_register_kobject(ixgbevf_netdev)         |                              |
> > > > >      kobject_add(ixgbevf_dev)                       |                              |
> > > > >       device_add(ixgbevf_dev)                       |                              |
> > > > >        kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) |                              |
> > > > >         netlink_broadcast()                         |                              |
> > > > >     ...                                             |                              |
> > > > >     call_netdevice_notifiers(NETDEV_REGISTER)       |                              |
> > > > >      failover_event(..., NETDEV_REGISTER, ...)      |                              |
> > > > >       failover_slave_register(ixgbevf_netdev)       |                              |
> > > > >        net_failover_slave_register(ixgbevf_netdev)  |                              |
> > > > >         dev_open(ixgbevf_netdev)                    |                              |
> > > > >                                                     |                              |
> > > > >                                                     |                              |
> > > > >                                                     |                              |   received ADD uevent from netlink fd
> > > > >                                                     |                              |   ...
> > > > >                                                     |                              |   udev-builtin-net_id.c:dev_pci_slot()
> > > > >                                                     |                              |   (decided to renamed 'eth0' )
> > > > >                                                     |                              |     ip link set dev eth0 name ens4
> > > > > (dev_change_name() returns -EBUSY as              |                              |
> > > > > ixgbevf_netdev->flags has IFF_UP)                 |                              |
> > > > >                                                     |                              |
> > > > > 
> > > > Given renaming slaves does not work anyway:
> > > I was actually thinking what if we relieve the rename restriction just for
> > > the failover slave? What the impact would be? I think users don't care about
> > > slave being renamed when it's in use, especially the initial rename.
> > > Thoughts?
> > > 
> > > >    would it work if we just
> > > > hard-coded slave names instead?
> > > > 
> > > > E.g.
> > > > 1. fail slave renames
> > > > 2. rename of failover to XX automatically renames standby to XXnsby
> > > >      and primary to XXnpry
> > > That wouldn't help. The time when the failover master gets renamed, the VF
> > > may not be present.
> > In this scheme if VF is not there it will be renamed immediately after registration.
> Who will be responsible to rename the slave, the kernel?

That's the idea.

> Note the master's
> name may or may not come from the userspace. If it comes from the userspace,
> should the userspace daemon change their expectation not to name/rename
> _any_ slaves (today there's no distinction)?

Yes the idea would be to fail renaming slaves.

> How do users know which name to
> trust, depending on which wins the race more often? Say if kernel wants a
> ens3npry name while userspace wants it named as ens4.
> 
> -Siwei

With this approach kernel will deny attempts by userspace to rename
slaves.  Slaves will always be named XXXnsby and XXnpry. Master renames
will rename both slaves.

It seems pretty solid to me, the only issue is that in theory userspace
can use a name like XXXnsby for something else. But this seems unlikely.


> > 
> > > I don't like the idea to delay exposing failover master
> > > until VF is hot plugged in (probably subject to various failures) later.
> > > 
> > > Thanks,
> > > -Siwei
> > 
> > I agree, this was not what I meant.
> > 
> > > > 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2019-02-27 23:50                             ` Michael S. Tsirkin
@ 2019-02-28  0:00                               ` Liran Alon
  2019-02-28  0:03                               ` Stephen Hemminger
  2019-02-28  0:38                               ` si-wei liu
  2 siblings, 0 replies; 63+ messages in thread
From: Liran Alon @ 2019-02-28  0:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: si-wei liu, Samudrala, Sridhar, Siwei Liu, Jiri Pirko,
	Stephen Hemminger, David Miller, Netdev, virtualization,
	virtio-dev, Brandeburg, Jesse, Alexander Duyck, Jakub Kicinski,
	Jason Wang



> On 28 Feb 2019, at 1:50, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> On Wed, Feb 27, 2019 at 03:34:56PM -0800, si-wei liu wrote:
>> 
>> 
>> On 2/27/2019 2:38 PM, Michael S. Tsirkin wrote:
>>> On Tue, Feb 26, 2019 at 04:17:21PM -0800, si-wei liu wrote:
>>>> 
>>>> On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:
>>>>> On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:
>>>>>> On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:
>>>>>>> On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:
>>>>>>>> On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
>>>>>>>>> On 2/21/2019 7:33 PM, si-wei liu wrote:
>>>>>>>>>> On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
>>>>>>>>>>> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
>>>>>>>>>>>> Sorry for replying to this ancient thread. There was some remaining
>>>>>>>>>>>> issue that I don't think the initial net_failover patch got addressed
>>>>>>>>>>>> cleanly, see:
>>>>>>>>>>>> 
>>>>>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.launchpad.net_ubuntu_-2Bsource_linux_-2Bbug_1815268&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=aL-QfUoSYx8r0XCOBkcDtF8f-cYxrJI3skYLFTb8XJE&s=yk6Nqv3a6_JMzyrXKY67h00FyNrDJyQ-PYMFffDSTXM&e=
>>>>>>>>>>>> 
>>>>>>>>>>>> The renaming of 'eth0' to 'ens4' fails because the udev userspace was
>>>>>>>>>>>> not specifically writtten for such kernel automatic enslavement.
>>>>>>>>>>>> Specifically, if it is a bond or team, the slave would typically get
>>>>>>>>>>>> renamed *before* virtual device gets created, that's what udev can
>>>>>>>>>>>> control (without getting netdev opened early by the other part of
>>>>>>>>>>>> kernel) and other userspace components for e.g. initramfs,
>>>>>>>>>>>> init-scripts can coordinate well in between. The in-kernel
>>>>>>>>>>>> auto-enslavement of net_failover breaks this userspace convention,
>>>>>>>>>>>> which don't provides a solution if user care about consistent naming
>>>>>>>>>>>> on the slave netdevs specifically.
>>>>>>>>>>>> 
>>>>>>>>>>>> Previously this issue had been specifically called out when IFF_HIDDEN
>>>>>>>>>>>> and the 1-netdev was proposed, but no one gives out a solution to this
>>>>>>>>>>>> problem ever since. Please share your mind how to proceed and solve
>>>>>>>>>>>> this userspace issue if netdev does not welcome a 1-netdev model.
>>>>>>>>>>> Above says:
>>>>>>>>>>> 
>>>>>>>>>>>       there's no motivation in the systemd/udevd community at
>>>>>>>>>>>       this point to refactor the rename logic and make it work well with
>>>>>>>>>>>       3-netdev.
>>>>>>>>>>> 
>>>>>>>>>>> What would the fix be? Skip slave devices?
>>>>>>>>>>> 
>>>>>>>>>> There's nothing user can get if just skipping slave devices - the
>>>>>>>>>> name is still unchanged and unpredictable e.g. eth0, or eth1 the
>>>>>>>>>> next reboot, while the rest may conform to the naming scheme (ens3
>>>>>>>>>> and such). There's no way one can fix this in userspace alone - when
>>>>>>>>>> the failover is created the enslaved netdev was opened by the kernel
>>>>>>>>>> earlier than the userspace is made aware of, and there's no
>>>>>>>>>> negotiation protocol for kernel to know when userspace has done
>>>>>>>>>> initial renaming of the interface. I would expect netdev list should
>>>>>>>>>> at least provide the direction in general for how this can be
>>>>>>>>>> solved...
>>>>>>> I was just wondering what did you mean when you said
>>>>>>> "refactor the rename logic and make it work well with 3-netdev" -
>>>>>>> was there a proposal udev rejected?
>>>>>> No. I never believed this particular issue can be fixed in userspace alone.
>>>>>> Previously someone had said it could be, but I never see any work or
>>>>>> relevant discussion ever happened in various userspace communities (for e.g.
>>>>>> dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
>>>>>> of the issue derives from the kernel, it makes more sense to start from
>>>>>> netdev, work out and decide on a solution: see what can be done in the
>>>>>> kernel in order to fix it, then after that engage userspace community for
>>>>>> the feasibility...
>>>>>> 
>>>>>>> Anyway, can we write a time diagram for what happens in which order that
>>>>>>> leads to failure?  That would help look for triggers that we can tie
>>>>>>> into, or add new ones.
>>>>>>> 
>>>>>> See attached diagram.
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>>> Is there an issue if slave device names are not predictable? The user/admin scripts are expected
>>>>>>>>> to only work with the master failover device.
>>>>>>>> Where does this expectation come from?
>>>>>>>> 
>>>>>>>> Admin users may have ethtool or tc configurations that need to deal with
>>>>>>>> predictable interface name. Third-party app which was built upon specifying
>>>>>>>> certain interface name can't be modified to chase dynamic names.
>>>>>>>> 
>>>>>>>> Specifically, we have pre-canned image that uses ethtool to fine tune VF
>>>>>>>> offload settings post boot for specific workload. Those images won't work
>>>>>>>> well if the name is constantly changing just after couple rounds of live
>>>>>>>> migration.
>>>>>>> It should be possible to specify the ethtool configuration on the
>>>>>>> master and have it automatically propagated to the slave.
>>>>>>> 
>>>>>>> BTW this is something we should look at IMHO.
>>>>>> I was elaborating a few examples that the expectation and assumption that
>>>>>> user/admin scripts only deal with master failover device is incorrect. It
>>>>>> had never been taken good care of, although I did try to emphasize it from
>>>>>> the very beginning.
>>>>>> 
>>>>>> Basically what you said about propagating the ethtool configuration down to
>>>>>> the slave is the key pursuance of 1-netdev model. However, what I am seeking
>>>>>> now is any alternative that can also fix the specific udev rename problem,
>>>>>> before concluding that 1-netdev is the only solution. Generally a 1-netdev
>>>>>> scheme would take time to implement, while I'm trying to find a way out to
>>>>>> fix this particular naming problem under 3-netdev.
>>>>>> 
>>>>>>>>> Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
>>>>>>>>> about moving them to a hidden network namespace so that they are not visible from the default namespace.
>>>>>>>>> I looked into this sometime back, but did not find the right kernel api to create a network namespace within
>>>>>>>>> kernel. If so, we could use this mechanism to simulate a 1-netdev model.
>>>>>>>> Yes, that's one possible implementation (IMHO the key is to make 1-netdev
>>>>>>>> model as much transparent to a real NIC as possible, while a hidden netns is
>>>>>>>> just the vehicle). However, I recall there was resistance around this
>>>>>>>> discussion that even the concept of hiding itself is a taboo for Linux
>>>>>>>> netdev. I would like to summon potential alternatives before concluding
>>>>>>>> 1-netdev is the only solution too soon.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> -Siwei
>>>>>>> Your scripts would not work at all then, right?
>>>>>> At this point we don't claim images with such usage as SR-IOV live
>>>>>> migrate-able. We would flag it as live migrate-able until this ethtool
>>>>>> config issue is fully addressed and a transparent live migration solution
>>>>>> emerges in upstream eventually.
>>>>>> 
>>>>>> 
>>>>>> Thanks,
>>>>>> -Siwei
>>>>>>>>>> -Siwei
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
>>>>>>> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>>>>>>> 
>>>>>>    net_failover(kernel)                            |    network.service (user)    |          systemd-udevd (user)
>>>>>> --------------------------------------------------+------------------------------+--------------------------------------------
>>>>>> (standby virtio-net and net_failover              |                              |
>>>>>> devices created and initialized,                  |                              |
>>>>>> i.e. virtnet_probe()->                            |                              |
>>>>>>         net_failover_create()                      |                              |
>>>>>> was done.)                                        |                              |
>>>>>>                                                    |                              |
>>>>>>                                                    |  runs `ifup ens3' ->         |
>>>>>>                                                    |    ip link set dev ens3 up   |
>>>>>> net_failover_open()                               |                              |
>>>>>>    dev_open(virtnet_dev)                           |                              |
>>>>>>      virtnet_open(virtnet_dev)                     |                              |
>>>>>>    netif_carrier_on(failover_dev)                  |                              |
>>>>>>    ...                                             |                              |
>>>>>>                                                    |                              |
>>>>>> (VF hot plugged in)                               |                              |
>>>>>> ixgbevf_probe()                                   |                              |
>>>>>>   register_netdev(ixgbevf_netdev)                  |                              |
>>>>>>    netdev_register_kobject(ixgbevf_netdev)         |                              |
>>>>>>     kobject_add(ixgbevf_dev)                       |                              |
>>>>>>      device_add(ixgbevf_dev)                       |                              |
>>>>>>       kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) |                              |
>>>>>>        netlink_broadcast()                         |                              |
>>>>>>    ...                                             |                              |
>>>>>>    call_netdevice_notifiers(NETDEV_REGISTER)       |                              |
>>>>>>     failover_event(..., NETDEV_REGISTER, ...)      |                              |
>>>>>>      failover_slave_register(ixgbevf_netdev)       |                              |
>>>>>>       net_failover_slave_register(ixgbevf_netdev)  |                              |
>>>>>>        dev_open(ixgbevf_netdev)                    |                              |
>>>>>>                                                    |                              |
>>>>>>                                                    |                              |
>>>>>>                                                    |                              |   received ADD uevent from netlink fd
>>>>>>                                                    |                              |   ...
>>>>>>                                                    |                              |   udev-builtin-net_id.c:dev_pci_slot()
>>>>>>                                                    |                              |   (decided to renamed 'eth0' )
>>>>>>                                                    |                              |     ip link set dev eth0 name ens4
>>>>>> (dev_change_name() returns -EBUSY as              |                              |
>>>>>> ixgbevf_netdev->flags has IFF_UP)                 |                              |
>>>>>>                                                    |                              |
>>>>>> 
>>>>> Given renaming slaves does not work anyway:
>>>> I was actually thinking what if we relieve the rename restriction just for
>>>> the failover slave? What the impact would be? I think users don't care about
>>>> slave being renamed when it's in use, especially the initial rename.
>>>> Thoughts?
>>>> 
>>>>>   would it work if we just
>>>>> hard-coded slave names instead?
>>>>> 
>>>>> E.g.
>>>>> 1. fail slave renames
>>>>> 2. rename of failover to XX automatically renames standby to XXnsby
>>>>>     and primary to XXnpry
>>>> That wouldn't help. The time when the failover master gets renamed, the VF
>>>> may not be present.
>>> In this scheme if VF is not there it will be renamed immediately after registration.
>> Who will be responsible to rename the slave, the kernel?
> 
> That's the idea.
> 
>> Note the master's
>> name may or may not come from the userspace. If it comes from the userspace,
>> should the userspace daemon change their expectation not to name/rename
>> _any_ slaves (today there's no distinction)?
> 
> Yes the idea would be to fail renaming slaves.
> 
>> How do users know which name to
>> trust, depending on which wins the race more often? Say if kernel wants a
>> ens3npry name while userspace wants it named as ens4.
>> 
>> -Siwei
> 
> With this approach kernel will deny attempts by userspace to rename
> slaves.  Slaves will always be named XXXnsby and XXnpry. Master renames
> will rename both slaves.
> 
> It seems pretty solid to me, the only issue is that in theory userspace
> can use a name like XXXnsby for something else. But this seems unlikely.

I’m fond of this idea and I have similar opinion.
I think it simplifies the issue here.
I don’t see a real reason for customer to define udev rule to rename a net-failover slave to have different postfix.

-Liran

> 
> 
>>> 
>>>> I don't like the idea to delay exposing failover master
>>>> until VF is hot plugged in (probably subject to various failures) later.
>>>> 
>>>> Thanks,
>>>> -Siwei
>>> 
>>> I agree, this was not what I meant.
>>> 
>>>>> 


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2019-02-27 23:50                             ` Michael S. Tsirkin
  2019-02-28  0:00                               ` Liran Alon
@ 2019-02-28  0:03                               ` Stephen Hemminger
  2019-02-28  0:38                                 ` Michael S. Tsirkin
  2019-02-28  0:38                               ` si-wei liu
  2 siblings, 1 reply; 63+ messages in thread
From: Stephen Hemminger @ 2019-02-28  0:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: si-wei liu, Samudrala, Sridhar, Siwei Liu, Jiri Pirko,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Alexander Duyck, Jakub Kicinski, Jason Wang, liran.alon

On Wed, 27 Feb 2019 18:50:44 -0500
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Feb 27, 2019 at 03:34:56PM -0800, si-wei liu wrote:
> > 
> > 
> > On 2/27/2019 2:38 PM, Michael S. Tsirkin wrote:  
> > > On Tue, Feb 26, 2019 at 04:17:21PM -0800, si-wei liu wrote:  
> > > > 
> > > > On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:  
> > > > > On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:  
> > > > > > On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:  
> > > > > > > On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:  
> > > > > > > > On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:  
> > > > > > > > > On 2/21/2019 7:33 PM, si-wei liu wrote:  
> > > > > > > > > > On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:  
> > > > > > > > > > > On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:  
> > > > > > > > > > > > Sorry for replying to this ancient thread. There was some remaining
> > > > > > > > > > > > issue that I don't think the initial net_failover patch got addressed
> > > > > > > > > > > > cleanly, see:
> > > > > > > > > > > > 
> > > > > > > > > > > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
> > > > > > > > > > > > 
> > > > > > > > > > > > The renaming of 'eth0' to 'ens4' fails because the udev userspace was
> > > > > > > > > > > > not specifically writtten for such kernel automatic enslavement.
> > > > > > > > > > > > Specifically, if it is a bond or team, the slave would typically get
> > > > > > > > > > > > renamed *before* virtual device gets created, that's what udev can
> > > > > > > > > > > > control (without getting netdev opened early by the other part of
> > > > > > > > > > > > kernel) and other userspace components for e.g. initramfs,
> > > > > > > > > > > > init-scripts can coordinate well in between. The in-kernel
> > > > > > > > > > > > auto-enslavement of net_failover breaks this userspace convention,
> > > > > > > > > > > > which don't provides a solution if user care about consistent naming
> > > > > > > > > > > > on the slave netdevs specifically.
> > > > > > > > > > > > 
> > > > > > > > > > > > Previously this issue had been specifically called out when IFF_HIDDEN
> > > > > > > > > > > > and the 1-netdev was proposed, but no one gives out a solution to this
> > > > > > > > > > > > problem ever since. Please share your mind how to proceed and solve
> > > > > > > > > > > > this userspace issue if netdev does not welcome a 1-netdev model.  
> > > > > > > > > > > Above says:
> > > > > > > > > > > 
> > > > > > > > > > >        there's no motivation in the systemd/udevd community at
> > > > > > > > > > >        this point to refactor the rename logic and make it work well with
> > > > > > > > > > >        3-netdev.
> > > > > > > > > > > 
> > > > > > > > > > > What would the fix be? Skip slave devices?
> > > > > > > > > > >   
> > > > > > > > > > There's nothing user can get if just skipping slave devices - the
> > > > > > > > > > name is still unchanged and unpredictable e.g. eth0, or eth1 the
> > > > > > > > > > next reboot, while the rest may conform to the naming scheme (ens3
> > > > > > > > > > and such). There's no way one can fix this in userspace alone - when
> > > > > > > > > > the failover is created the enslaved netdev was opened by the kernel
> > > > > > > > > > earlier than the userspace is made aware of, and there's no
> > > > > > > > > > negotiation protocol for kernel to know when userspace has done
> > > > > > > > > > initial renaming of the interface. I would expect netdev list should
> > > > > > > > > > at least provide the direction in general for how this can be
> > > > > > > > > > solved...  
> > > > > > > I was just wondering what did you mean when you said
> > > > > > > "refactor the rename logic and make it work well with 3-netdev" -
> > > > > > > was there a proposal udev rejected?  
> > > > > > No. I never believed this particular issue can be fixed in userspace alone.
> > > > > > Previously someone had said it could be, but I never see any work or
> > > > > > relevant discussion ever happened in various userspace communities (for e.g.
> > > > > > dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
> > > > > > of the issue derives from the kernel, it makes more sense to start from
> > > > > > netdev, work out and decide on a solution: see what can be done in the
> > > > > > kernel in order to fix it, then after that engage userspace community for
> > > > > > the feasibility...
> > > > > >   
> > > > > > > Anyway, can we write a time diagram for what happens in which order that
> > > > > > > leads to failure?  That would help look for triggers that we can tie
> > > > > > > into, or add new ones.
> > > > > > >   
> > > > > > See attached diagram.
> > > > > >   
> > > > > > > 
> > > > > > >   
> > > > > > > > > Is there an issue if slave device names are not predictable? The user/admin scripts are expected
> > > > > > > > > to only work with the master failover device.  
> > > > > > > > Where does this expectation come from?
> > > > > > > > 
> > > > > > > > Admin users may have ethtool or tc configurations that need to deal with
> > > > > > > > predictable interface name. Third-party app which was built upon specifying
> > > > > > > > certain interface name can't be modified to chase dynamic names.
> > > > > > > > 
> > > > > > > > Specifically, we have pre-canned image that uses ethtool to fine tune VF
> > > > > > > > offload settings post boot for specific workload. Those images won't work
> > > > > > > > well if the name is constantly changing just after couple rounds of live
> > > > > > > > migration.  
> > > > > > > It should be possible to specify the ethtool configuration on the
> > > > > > > master and have it automatically propagated to the slave.
> > > > > > > 
> > > > > > > BTW this is something we should look at IMHO.  
> > > > > > I was elaborating a few examples that the expectation and assumption that
> > > > > > user/admin scripts only deal with master failover device is incorrect. It
> > > > > > had never been taken good care of, although I did try to emphasize it from
> > > > > > the very beginning.
> > > > > > 
> > > > > > Basically what you said about propagating the ethtool configuration down to
> > > > > > the slave is the key pursuance of 1-netdev model. However, what I am seeking
> > > > > > now is any alternative that can also fix the specific udev rename problem,
> > > > > > before concluding that 1-netdev is the only solution. Generally a 1-netdev
> > > > > > scheme would take time to implement, while I'm trying to find a way out to
> > > > > > fix this particular naming problem under 3-netdev.
> > > > > >   
> > > > > > > > > Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
> > > > > > > > > about moving them to a hidden network namespace so that they are not visible from the default namespace.
> > > > > > > > > I looked into this sometime back, but did not find the right kernel api to create a network namespace within
> > > > > > > > > kernel. If so, we could use this mechanism to simulate a 1-netdev model.  
> > > > > > > > Yes, that's one possible implementation (IMHO the key is to make 1-netdev
> > > > > > > > model as much transparent to a real NIC as possible, while a hidden netns is
> > > > > > > > just the vehicle). However, I recall there was resistance around this
> > > > > > > > discussion that even the concept of hiding itself is a taboo for Linux
> > > > > > > > netdev. I would like to summon potential alternatives before concluding
> > > > > > > > 1-netdev is the only solution too soon.
> > > > > > > > 
> > > > > > > > Thanks,
> > > > > > > > -Siwei  
> > > > > > > Your scripts would not work at all then, right?  
> > > > > > At this point we don't claim images with such usage as SR-IOV live
> > > > > > migrate-able. We would flag it as live migrate-able until this ethtool
> > > > > > config issue is fully addressed and a transparent live migration solution
> > > > > > emerges in upstream eventually.
> > > > > > 
> > > > > > 
> > > > > > Thanks,
> > > > > > -Siwei  
> > > > > > > > > > -Siwei
> > > > > > > > > > 
> > > > > > > > > >   
> > > > > > > ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > > > > > > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> > > > > > >   
> > > > > >     net_failover(kernel)                            |    network.service (user)    |          systemd-udevd (user)
> > > > > > --------------------------------------------------+------------------------------+--------------------------------------------
> > > > > > (standby virtio-net and net_failover              |                              |
> > > > > > devices created and initialized,                  |                              |
> > > > > > i.e. virtnet_probe()->                            |                              |
> > > > > >          net_failover_create()                      |                              |
> > > > > > was done.)                                        |                              |
> > > > > >                                                     |                              |
> > > > > >                                                     |  runs `ifup ens3' ->         |
> > > > > >                                                     |    ip link set dev ens3 up   |
> > > > > > net_failover_open()                               |                              |
> > > > > >     dev_open(virtnet_dev)                           |                              |
> > > > > >       virtnet_open(virtnet_dev)                     |                              |
> > > > > >     netif_carrier_on(failover_dev)                  |                              |
> > > > > >     ...                                             |                              |
> > > > > >                                                     |                              |
> > > > > > (VF hot plugged in)                               |                              |
> > > > > > ixgbevf_probe()                                   |                              |
> > > > > >    register_netdev(ixgbevf_netdev)                  |                              |
> > > > > >     netdev_register_kobject(ixgbevf_netdev)         |                              |
> > > > > >      kobject_add(ixgbevf_dev)                       |                              |
> > > > > >       device_add(ixgbevf_dev)                       |                              |
> > > > > >        kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) |                              |
> > > > > >         netlink_broadcast()                         |                              |
> > > > > >     ...                                             |                              |
> > > > > >     call_netdevice_notifiers(NETDEV_REGISTER)       |                              |
> > > > > >      failover_event(..., NETDEV_REGISTER, ...)      |                              |
> > > > > >       failover_slave_register(ixgbevf_netdev)       |                              |
> > > > > >        net_failover_slave_register(ixgbevf_netdev)  |                              |
> > > > > >         dev_open(ixgbevf_netdev)                    |                              |
> > > > > >                                                     |                              |
> > > > > >                                                     |                              |
> > > > > >                                                     |                              |   received ADD uevent from netlink fd
> > > > > >                                                     |                              |   ...
> > > > > >                                                     |                              |   udev-builtin-net_id.c:dev_pci_slot()
> > > > > >                                                     |                              |   (decided to renamed 'eth0' )
> > > > > >                                                     |                              |     ip link set dev eth0 name ens4
> > > > > > (dev_change_name() returns -EBUSY as              |                              |
> > > > > > ixgbevf_netdev->flags has IFF_UP)                 |                              |
> > > > > >                                                     |                              |
> > > > > >   
> > > > > Given renaming slaves does not work anyway:  
> > > > I was actually thinking what if we relieve the rename restriction just for
> > > > the failover slave? What the impact would be? I think users don't care about
> > > > slave being renamed when it's in use, especially the initial rename.
> > > > Thoughts?
> > > >   
> > > > >    would it work if we just
> > > > > hard-coded slave names instead?
> > > > > 
> > > > > E.g.
> > > > > 1. fail slave renames
> > > > > 2. rename of failover to XX automatically renames standby to XXnsby
> > > > >      and primary to XXnpry  
> > > > That wouldn't help. The time when the failover master gets renamed, the VF
> > > > may not be present.  
> > > In this scheme if VF is not there it will be renamed immediately after registration.  
> > Who will be responsible to rename the slave, the kernel?  
> 
> That's the idea.
> 
> > Note the master's
> > name may or may not come from the userspace. If it comes from the userspace,
> > should the userspace daemon change their expectation not to name/rename
> > _any_ slaves (today there's no distinction)?  
> 
> Yes the idea would be to fail renaming slaves.
> 
> > How do users know which name to
> > trust, depending on which wins the race more often? Say if kernel wants a
> > ens3npry name while userspace wants it named as ens4.
> > 
> > -Siwei  
> 
> With this approach kernel will deny attempts by userspace to rename
> slaves.  Slaves will always be named XXXnsby and XXnpry. Master renames
> will rename both slaves.
> 
> It seems pretty solid to me, the only issue is that in theory userspace
> can use a name like XXXnsby for something else. But this seems unlikely.

Similar schemes (with kernel providing naming) were also previously rejected
upstream. It has been a consistent theme that the kernel should not be in
the renaming business. It will certainly break userspace.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2019-02-27 23:50                             ` Michael S. Tsirkin
  2019-02-28  0:00                               ` Liran Alon
  2019-02-28  0:03                               ` Stephen Hemminger
@ 2019-02-28  0:38                               ` si-wei liu
  2019-02-28  0:41                                 ` Michael S. Tsirkin
  2 siblings, 1 reply; 63+ messages in thread
From: si-wei liu @ 2019-02-28  0:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Samudrala, Sridhar, Siwei Liu, Jiri Pirko, Stephen Hemminger,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Alexander Duyck, Jakub Kicinski, Jason Wang, liran.alon



On 2/27/2019 3:50 PM, Michael S. Tsirkin wrote:
> On Wed, Feb 27, 2019 at 03:34:56PM -0800, si-wei liu wrote:
>>
>> On 2/27/2019 2:38 PM, Michael S. Tsirkin wrote:
>>> On Tue, Feb 26, 2019 at 04:17:21PM -0800, si-wei liu wrote:
>>>> On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:
>>>>> On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:
>>>>>> On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:
>>>>>>> On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:
>>>>>>>> On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
>>>>>>>>> On 2/21/2019 7:33 PM, si-wei liu wrote:
>>>>>>>>>> On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
>>>>>>>>>>> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
>>>>>>>>>>>> Sorry for replying to this ancient thread. There was some remaining
>>>>>>>>>>>> issue that I don't think the initial net_failover patch got addressed
>>>>>>>>>>>> cleanly, see:
>>>>>>>>>>>>
>>>>>>>>>>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
>>>>>>>>>>>>
>>>>>>>>>>>> The renaming of 'eth0' to 'ens4' fails because the udev userspace was
>>>>>>>>>>>> not specifically writtten for such kernel automatic enslavement.
>>>>>>>>>>>> Specifically, if it is a bond or team, the slave would typically get
>>>>>>>>>>>> renamed *before* virtual device gets created, that's what udev can
>>>>>>>>>>>> control (without getting netdev opened early by the other part of
>>>>>>>>>>>> kernel) and other userspace components for e.g. initramfs,
>>>>>>>>>>>> init-scripts can coordinate well in between. The in-kernel
>>>>>>>>>>>> auto-enslavement of net_failover breaks this userspace convention,
>>>>>>>>>>>> which don't provides a solution if user care about consistent naming
>>>>>>>>>>>> on the slave netdevs specifically.
>>>>>>>>>>>>
>>>>>>>>>>>> Previously this issue had been specifically called out when IFF_HIDDEN
>>>>>>>>>>>> and the 1-netdev was proposed, but no one gives out a solution to this
>>>>>>>>>>>> problem ever since. Please share your mind how to proceed and solve
>>>>>>>>>>>> this userspace issue if netdev does not welcome a 1-netdev model.
>>>>>>>>>>> Above says:
>>>>>>>>>>>
>>>>>>>>>>>         there's no motivation in the systemd/udevd community at
>>>>>>>>>>>         this point to refactor the rename logic and make it work well with
>>>>>>>>>>>         3-netdev.
>>>>>>>>>>>
>>>>>>>>>>> What would the fix be? Skip slave devices?
>>>>>>>>>>>
>>>>>>>>>> There's nothing user can get if just skipping slave devices - the
>>>>>>>>>> name is still unchanged and unpredictable e.g. eth0, or eth1 the
>>>>>>>>>> next reboot, while the rest may conform to the naming scheme (ens3
>>>>>>>>>> and such). There's no way one can fix this in userspace alone - when
>>>>>>>>>> the failover is created the enslaved netdev was opened by the kernel
>>>>>>>>>> earlier than the userspace is made aware of, and there's no
>>>>>>>>>> negotiation protocol for kernel to know when userspace has done
>>>>>>>>>> initial renaming of the interface. I would expect netdev list should
>>>>>>>>>> at least provide the direction in general for how this can be
>>>>>>>>>> solved...
>>>>>>> I was just wondering what did you mean when you said
>>>>>>> "refactor the rename logic and make it work well with 3-netdev" -
>>>>>>> was there a proposal udev rejected?
>>>>>> No. I never believed this particular issue can be fixed in userspace alone.
>>>>>> Previously someone had said it could be, but I never see any work or
>>>>>> relevant discussion ever happened in various userspace communities (for e.g.
>>>>>> dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
>>>>>> of the issue derives from the kernel, it makes more sense to start from
>>>>>> netdev, work out and decide on a solution: see what can be done in the
>>>>>> kernel in order to fix it, then after that engage userspace community for
>>>>>> the feasibility...
>>>>>>
>>>>>>> Anyway, can we write a time diagram for what happens in which order that
>>>>>>> leads to failure?  That would help look for triggers that we can tie
>>>>>>> into, or add new ones.
>>>>>>>
>>>>>> See attached diagram.
>>>>>>
>>>>>>>
>>>>>>>>> Is there an issue if slave device names are not predictable? The user/admin scripts are expected
>>>>>>>>> to only work with the master failover device.
>>>>>>>> Where does this expectation come from?
>>>>>>>>
>>>>>>>> Admin users may have ethtool or tc configurations that need to deal with
>>>>>>>> predictable interface name. Third-party app which was built upon specifying
>>>>>>>> certain interface name can't be modified to chase dynamic names.
>>>>>>>>
>>>>>>>> Specifically, we have pre-canned image that uses ethtool to fine tune VF
>>>>>>>> offload settings post boot for specific workload. Those images won't work
>>>>>>>> well if the name is constantly changing just after couple rounds of live
>>>>>>>> migration.
>>>>>>> It should be possible to specify the ethtool configuration on the
>>>>>>> master and have it automatically propagated to the slave.
>>>>>>>
>>>>>>> BTW this is something we should look at IMHO.
>>>>>> I was elaborating a few examples that the expectation and assumption that
>>>>>> user/admin scripts only deal with master failover device is incorrect. It
>>>>>> had never been taken good care of, although I did try to emphasize it from
>>>>>> the very beginning.
>>>>>>
>>>>>> Basically what you said about propagating the ethtool configuration down to
>>>>>> the slave is the key pursuance of 1-netdev model. However, what I am seeking
>>>>>> now is any alternative that can also fix the specific udev rename problem,
>>>>>> before concluding that 1-netdev is the only solution. Generally a 1-netdev
>>>>>> scheme would take time to implement, while I'm trying to find a way out to
>>>>>> fix this particular naming problem under 3-netdev.
>>>>>>
>>>>>>>>> Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
>>>>>>>>> about moving them to a hidden network namespace so that they are not visible from the default namespace.
>>>>>>>>> I looked into this sometime back, but did not find the right kernel api to create a network namespace within
>>>>>>>>> kernel. If so, we could use this mechanism to simulate a 1-netdev model.
>>>>>>>> Yes, that's one possible implementation (IMHO the key is to make 1-netdev
>>>>>>>> model as much transparent to a real NIC as possible, while a hidden netns is
>>>>>>>> just the vehicle). However, I recall there was resistance around this
>>>>>>>> discussion that even the concept of hiding itself is a taboo for Linux
>>>>>>>> netdev. I would like to summon potential alternatives before concluding
>>>>>>>> 1-netdev is the only solution too soon.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> -Siwei
>>>>>>> Your scripts would not work at all then, right?
>>>>>> At this point we don't claim images with such usage as SR-IOV live
>>>>>> migrate-able. We would flag it as live migrate-able until this ethtool
>>>>>> config issue is fully addressed and a transparent live migration solution
>>>>>> emerges in upstream eventually.
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> -Siwei
>>>>>>>>>> -Siwei
>>>>>>>>>>
>>>>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
>>>>>>> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>>>>>>>
>>>>>>      net_failover(kernel)                            |    network.service (user)    |          systemd-udevd (user)
>>>>>> --------------------------------------------------+------------------------------+--------------------------------------------
>>>>>> (standby virtio-net and net_failover              |                              |
>>>>>> devices created and initialized,                  |                              |
>>>>>> i.e. virtnet_probe()->                            |                              |
>>>>>>           net_failover_create()                      |                              |
>>>>>> was done.)                                        |                              |
>>>>>>                                                      |                              |
>>>>>>                                                      |  runs `ifup ens3' ->         |
>>>>>>                                                      |    ip link set dev ens3 up   |
>>>>>> net_failover_open()                               |                              |
>>>>>>      dev_open(virtnet_dev)                           |                              |
>>>>>>        virtnet_open(virtnet_dev)                     |                              |
>>>>>>      netif_carrier_on(failover_dev)                  |                              |
>>>>>>      ...                                             |                              |
>>>>>>                                                      |                              |
>>>>>> (VF hot plugged in)                               |                              |
>>>>>> ixgbevf_probe()                                   |                              |
>>>>>>     register_netdev(ixgbevf_netdev)                  |                              |
>>>>>>      netdev_register_kobject(ixgbevf_netdev)         |                              |
>>>>>>       kobject_add(ixgbevf_dev)                       |                              |
>>>>>>        device_add(ixgbevf_dev)                       |                              |
>>>>>>         kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) |                              |
>>>>>>          netlink_broadcast()                         |                              |
>>>>>>      ...                                             |                              |
>>>>>>      call_netdevice_notifiers(NETDEV_REGISTER)       |                              |
>>>>>>       failover_event(..., NETDEV_REGISTER, ...)      |                              |
>>>>>>        failover_slave_register(ixgbevf_netdev)       |                              |
>>>>>>         net_failover_slave_register(ixgbevf_netdev)  |                              |
>>>>>>          dev_open(ixgbevf_netdev)                    |                              |
>>>>>>                                                      |                              |
>>>>>>                                                      |                              |
>>>>>>                                                      |                              |   received ADD uevent from netlink fd
>>>>>>                                                      |                              |   ...
>>>>>>                                                      |                              |   udev-builtin-net_id.c:dev_pci_slot()
>>>>>>                                                      |                              |   (decided to renamed 'eth0' )
>>>>>>                                                      |                              |     ip link set dev eth0 name ens4
>>>>>> (dev_change_name() returns -EBUSY as              |                              |
>>>>>> ixgbevf_netdev->flags has IFF_UP)                 |                              |
>>>>>>                                                      |                              |
>>>>>>
>>>>> Given renaming slaves does not work anyway:
>>>> I was actually thinking what if we relieve the rename restriction just for
>>>> the failover slave? What the impact would be? I think users don't care about
>>>> slave being renamed when it's in use, especially the initial rename.
>>>> Thoughts?
>>>>
>>>>>     would it work if we just
>>>>> hard-coded slave names instead?
>>>>>
>>>>> E.g.
>>>>> 1. fail slave renames
>>>>> 2. rename of failover to XX automatically renames standby to XXnsby
>>>>>       and primary to XXnpry
>>>> That wouldn't help. The time when the failover master gets renamed, the VF
>>>> may not be present.
>>> In this scheme if VF is not there it will be renamed immediately after registration.
>> Who will be responsible to rename the slave, the kernel?
> That's the idea.
>
>> Note the master's
>> name may or may not come from the userspace. If it comes from the userspace,
>> should the userspace daemon change their expectation not to name/rename
>> _any_ slaves (today there's no distinction)?
> Yes the idea would be to fail renaming slaves.
No I was asking about the userspace expectation: whether it should track 
and detect the lifecycle events of failover slaves and decide what to 
do. How does it get back to the user specified name if VF is not 
enslaved (say someone unloads the virtio-net module)?

As this scheme adds much complexity to the kernel naming convention 
(currently it's just ethX names) that no userspace can understand. Will 
the change break userspace further?

-Siwei

>
>> How do users know which name to
>> trust, depending on which wins the race more often? Say if kernel wants a
>> ens3npry name while userspace wants it named as ens4.
>>
>> -Siwei
> With this approach kernel will deny attempts by userspace to rename
> slaves.  Slaves will always be named XXXnsby and XXnpry. Master renames
> will rename both slaves.
>
> It seems pretty solid to me, the only issue is that in theory userspace
> can use a name like XXXnsby for something else. But this seems unlikely.
>
>
>>>> I don't like the idea to delay exposing failover master
>>>> until VF is hot plugged in (probably subject to various failures) later.
>>>>
>>>> Thanks,
>>>> -Siwei
>>> I agree, this was not what I meant.
>>>


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2019-02-28  0:03                               ` Stephen Hemminger
@ 2019-02-28  0:38                                 ` Michael S. Tsirkin
  0 siblings, 0 replies; 63+ messages in thread
From: Michael S. Tsirkin @ 2019-02-28  0:38 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: si-wei liu, Samudrala, Sridhar, Siwei Liu, Jiri Pirko,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Alexander Duyck, Jakub Kicinski, Jason Wang, liran.alon

On Wed, Feb 27, 2019 at 04:03:42PM -0800, Stephen Hemminger wrote:
> > With this approach kernel will deny attempts by userspace to rename
> > slaves.  Slaves will always be named XXXnsby and XXnpry. Master renames
> > will rename both slaves.
> > 
> > It seems pretty solid to me, the only issue is that in theory userspace
> > can use a name like XXXnsby for something else. But this seems unlikely.
> 
> Similar schemes (with kernel providing naming) were also previously rejected
> upstream.

Links?
I'm inclined to try and see what happens.

> It has been a consistent theme that the kernel should not be in
> the renaming business.

In this case it's not in renaming business per se. The only reason
we even have the original name is due to the ways internal APIs
work. You can look at it as simply having slaves names being
part of master.

> It will certainly break userspace.

That's a strong claim. What is it based on?  It so happens that
userspace renaming slaves is already broken on virtio. So we can fix it
any way we like :)

And yes it won't help netvsc because netvsc wants compatibility with old
scripts but then netvsc uses a 2 device model anyway.

-- 
MST

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2019-02-28  0:38                               ` si-wei liu
@ 2019-02-28  0:41                                 ` Michael S. Tsirkin
  2019-02-28  0:52                                   ` Jakub Kicinski
  2019-02-28  9:32                                   ` si-wei liu
  0 siblings, 2 replies; 63+ messages in thread
From: Michael S. Tsirkin @ 2019-02-28  0:41 UTC (permalink / raw)
  To: si-wei liu
  Cc: Samudrala, Sridhar, Siwei Liu, Jiri Pirko, Stephen Hemminger,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Alexander Duyck, Jakub Kicinski, Jason Wang, liran.alon

On Wed, Feb 27, 2019 at 04:38:00PM -0800, si-wei liu wrote:
> 
> 
> On 2/27/2019 3:50 PM, Michael S. Tsirkin wrote:
> > On Wed, Feb 27, 2019 at 03:34:56PM -0800, si-wei liu wrote:
> > > 
> > > On 2/27/2019 2:38 PM, Michael S. Tsirkin wrote:
> > > > On Tue, Feb 26, 2019 at 04:17:21PM -0800, si-wei liu wrote:
> > > > > On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:
> > > > > > On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:
> > > > > > > On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:
> > > > > > > > On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:
> > > > > > > > > On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
> > > > > > > > > > On 2/21/2019 7:33 PM, si-wei liu wrote:
> > > > > > > > > > > On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
> > > > > > > > > > > > On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
> > > > > > > > > > > > > Sorry for replying to this ancient thread. There was some remaining
> > > > > > > > > > > > > issue that I don't think the initial net_failover patch got addressed
> > > > > > > > > > > > > cleanly, see:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
> > > > > > > > > > > > > 
> > > > > > > > > > > > > The renaming of 'eth0' to 'ens4' fails because the udev userspace was
> > > > > > > > > > > > > not specifically writtten for such kernel automatic enslavement.
> > > > > > > > > > > > > Specifically, if it is a bond or team, the slave would typically get
> > > > > > > > > > > > > renamed *before* virtual device gets created, that's what udev can
> > > > > > > > > > > > > control (without getting netdev opened early by the other part of
> > > > > > > > > > > > > kernel) and other userspace components for e.g. initramfs,
> > > > > > > > > > > > > init-scripts can coordinate well in between. The in-kernel
> > > > > > > > > > > > > auto-enslavement of net_failover breaks this userspace convention,
> > > > > > > > > > > > > which don't provides a solution if user care about consistent naming
> > > > > > > > > > > > > on the slave netdevs specifically.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Previously this issue had been specifically called out when IFF_HIDDEN
> > > > > > > > > > > > > and the 1-netdev was proposed, but no one gives out a solution to this
> > > > > > > > > > > > > problem ever since. Please share your mind how to proceed and solve
> > > > > > > > > > > > > this userspace issue if netdev does not welcome a 1-netdev model.
> > > > > > > > > > > > Above says:
> > > > > > > > > > > > 
> > > > > > > > > > > >         there's no motivation in the systemd/udevd community at
> > > > > > > > > > > >         this point to refactor the rename logic and make it work well with
> > > > > > > > > > > >         3-netdev.
> > > > > > > > > > > > 
> > > > > > > > > > > > What would the fix be? Skip slave devices?
> > > > > > > > > > > > 
> > > > > > > > > > > There's nothing user can get if just skipping slave devices - the
> > > > > > > > > > > name is still unchanged and unpredictable e.g. eth0, or eth1 the
> > > > > > > > > > > next reboot, while the rest may conform to the naming scheme (ens3
> > > > > > > > > > > and such). There's no way one can fix this in userspace alone - when
> > > > > > > > > > > the failover is created the enslaved netdev was opened by the kernel
> > > > > > > > > > > earlier than the userspace is made aware of, and there's no
> > > > > > > > > > > negotiation protocol for kernel to know when userspace has done
> > > > > > > > > > > initial renaming of the interface. I would expect netdev list should
> > > > > > > > > > > at least provide the direction in general for how this can be
> > > > > > > > > > > solved...
> > > > > > > > I was just wondering what did you mean when you said
> > > > > > > > "refactor the rename logic and make it work well with 3-netdev" -
> > > > > > > > was there a proposal udev rejected?
> > > > > > > No. I never believed this particular issue can be fixed in userspace alone.
> > > > > > > Previously someone had said it could be, but I never see any work or
> > > > > > > relevant discussion ever happened in various userspace communities (for e.g.
> > > > > > > dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
> > > > > > > of the issue derives from the kernel, it makes more sense to start from
> > > > > > > netdev, work out and decide on a solution: see what can be done in the
> > > > > > > kernel in order to fix it, then after that engage userspace community for
> > > > > > > the feasibility...
> > > > > > > 
> > > > > > > > Anyway, can we write a time diagram for what happens in which order that
> > > > > > > > leads to failure?  That would help look for triggers that we can tie
> > > > > > > > into, or add new ones.
> > > > > > > > 
> > > > > > > See attached diagram.
> > > > > > > 
> > > > > > > > 
> > > > > > > > > > Is there an issue if slave device names are not predictable? The user/admin scripts are expected
> > > > > > > > > > to only work with the master failover device.
> > > > > > > > > Where does this expectation come from?
> > > > > > > > > 
> > > > > > > > > Admin users may have ethtool or tc configurations that need to deal with
> > > > > > > > > predictable interface name. Third-party app which was built upon specifying
> > > > > > > > > certain interface name can't be modified to chase dynamic names.
> > > > > > > > > 
> > > > > > > > > Specifically, we have pre-canned image that uses ethtool to fine tune VF
> > > > > > > > > offload settings post boot for specific workload. Those images won't work
> > > > > > > > > well if the name is constantly changing just after couple rounds of live
> > > > > > > > > migration.
> > > > > > > > It should be possible to specify the ethtool configuration on the
> > > > > > > > master and have it automatically propagated to the slave.
> > > > > > > > 
> > > > > > > > BTW this is something we should look at IMHO.
> > > > > > > I was elaborating a few examples that the expectation and assumption that
> > > > > > > user/admin scripts only deal with master failover device is incorrect. It
> > > > > > > had never been taken good care of, although I did try to emphasize it from
> > > > > > > the very beginning.
> > > > > > > 
> > > > > > > Basically what you said about propagating the ethtool configuration down to
> > > > > > > the slave is the key pursuance of 1-netdev model. However, what I am seeking
> > > > > > > now is any alternative that can also fix the specific udev rename problem,
> > > > > > > before concluding that 1-netdev is the only solution. Generally a 1-netdev
> > > > > > > scheme would take time to implement, while I'm trying to find a way out to
> > > > > > > fix this particular naming problem under 3-netdev.
> > > > > > > 
> > > > > > > > > > Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
> > > > > > > > > > about moving them to a hidden network namespace so that they are not visible from the default namespace.
> > > > > > > > > > I looked into this sometime back, but did not find the right kernel api to create a network namespace within
> > > > > > > > > > kernel. If so, we could use this mechanism to simulate a 1-netdev model.
> > > > > > > > > Yes, that's one possible implementation (IMHO the key is to make 1-netdev
> > > > > > > > > model as much transparent to a real NIC as possible, while a hidden netns is
> > > > > > > > > just the vehicle). However, I recall there was resistance around this
> > > > > > > > > discussion that even the concept of hiding itself is a taboo for Linux
> > > > > > > > > netdev. I would like to summon potential alternatives before concluding
> > > > > > > > > 1-netdev is the only solution too soon.
> > > > > > > > > 
> > > > > > > > > Thanks,
> > > > > > > > > -Siwei
> > > > > > > > Your scripts would not work at all then, right?
> > > > > > > At this point we don't claim images with such usage as SR-IOV live
> > > > > > > migrate-able. We would flag it as live migrate-able until this ethtool
> > > > > > > config issue is fully addressed and a transparent live migration solution
> > > > > > > emerges in upstream eventually.
> > > > > > > 
> > > > > > > 
> > > > > > > Thanks,
> > > > > > > -Siwei
> > > > > > > > > > > -Siwei
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > ---------------------------------------------------------------------
> > > > > > > > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > > > > > > > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> > > > > > > > 
> > > > > > >      net_failover(kernel)                            |    network.service (user)    |          systemd-udevd (user)
> > > > > > > --------------------------------------------------+------------------------------+--------------------------------------------
> > > > > > > (standby virtio-net and net_failover              |                              |
> > > > > > > devices created and initialized,                  |                              |
> > > > > > > i.e. virtnet_probe()->                            |                              |
> > > > > > >           net_failover_create()                      |                              |
> > > > > > > was done.)                                        |                              |
> > > > > > >                                                      |                              |
> > > > > > >                                                      |  runs `ifup ens3' ->         |
> > > > > > >                                                      |    ip link set dev ens3 up   |
> > > > > > > net_failover_open()                               |                              |
> > > > > > >      dev_open(virtnet_dev)                           |                              |
> > > > > > >        virtnet_open(virtnet_dev)                     |                              |
> > > > > > >      netif_carrier_on(failover_dev)                  |                              |
> > > > > > >      ...                                             |                              |
> > > > > > >                                                      |                              |
> > > > > > > (VF hot plugged in)                               |                              |
> > > > > > > ixgbevf_probe()                                   |                              |
> > > > > > >     register_netdev(ixgbevf_netdev)                  |                              |
> > > > > > >      netdev_register_kobject(ixgbevf_netdev)         |                              |
> > > > > > >       kobject_add(ixgbevf_dev)                       |                              |
> > > > > > >        device_add(ixgbevf_dev)                       |                              |
> > > > > > >         kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) |                              |
> > > > > > >          netlink_broadcast()                         |                              |
> > > > > > >      ...                                             |                              |
> > > > > > >      call_netdevice_notifiers(NETDEV_REGISTER)       |                              |
> > > > > > >       failover_event(..., NETDEV_REGISTER, ...)      |                              |
> > > > > > >        failover_slave_register(ixgbevf_netdev)       |                              |
> > > > > > >         net_failover_slave_register(ixgbevf_netdev)  |                              |
> > > > > > >          dev_open(ixgbevf_netdev)                    |                              |
> > > > > > >                                                      |                              |
> > > > > > >                                                      |                              |
> > > > > > >                                                      |                              |   received ADD uevent from netlink fd
> > > > > > >                                                      |                              |   ...
> > > > > > >                                                      |                              |   udev-builtin-net_id.c:dev_pci_slot()
> > > > > > >                                                      |                              |   (decided to renamed 'eth0' )
> > > > > > >                                                      |                              |     ip link set dev eth0 name ens4
> > > > > > > (dev_change_name() returns -EBUSY as              |                              |
> > > > > > > ixgbevf_netdev->flags has IFF_UP)                 |                              |
> > > > > > >                                                      |                              |
> > > > > > > 
> > > > > > Given renaming slaves does not work anyway:
> > > > > I was actually thinking what if we relieve the rename restriction just for
> > > > > the failover slave? What the impact would be? I think users don't care about
> > > > > slave being renamed when it's in use, especially the initial rename.
> > > > > Thoughts?
> > > > > 
> > > > > >     would it work if we just
> > > > > > hard-coded slave names instead?
> > > > > > 
> > > > > > E.g.
> > > > > > 1. fail slave renames
> > > > > > 2. rename of failover to XX automatically renames standby to XXnsby
> > > > > >       and primary to XXnpry
> > > > > That wouldn't help. The time when the failover master gets renamed, the VF
> > > > > may not be present.
> > > > In this scheme if VF is not there it will be renamed immediately after registration.
> > > Who will be responsible to rename the slave, the kernel?
> > That's the idea.
> > 
> > > Note the master's
> > > name may or may not come from the userspace. If it comes from the userspace,
> > > should the userspace daemon change their expectation not to name/rename
> > > _any_ slaves (today there's no distinction)?
> > Yes the idea would be to fail renaming slaves.
> No I was asking about the userspace expectation: whether it should track and
> detect the lifecycle events of failover slaves and decide what to do. How
> does it get back to the user specified name if VF is not enslaved (say
> someone unloads the virtio-net module)?

When virtio net is removed VF will shortly be removed too.

> As this scheme adds much complexity to the kernel naming convention
> (currently it's just ethX names) that no userspace can understand.

Anything that pokes at slaves needs to be specially designed anyway.
Naming seems like a minor issue.

> Will the
> change break userspace further?
> 
> -Siwei

Didn't you show userspace is already broken. You can't "further
break it", rename already fails.

> > 
> > > How do users know which name to
> > > trust, depending on which wins the race more often? Say if kernel wants a
> > > ens3npry name while userspace wants it named as ens4.
> > > 
> > > -Siwei
> > With this approach kernel will deny attempts by userspace to rename
> > slaves.  Slaves will always be named XXXnsby and XXnpry. Master renames
> > will rename both slaves.
> > 
> > It seems pretty solid to me, the only issue is that in theory userspace
> > can use a name like XXXnsby for something else. But this seems unlikely.
> > 
> > 
> > > > > I don't like the idea to delay exposing failover master
> > > > > until VF is hot plugged in (probably subject to various failures) later.
> > > > > 
> > > > > Thanks,
> > > > > -Siwei
> > > > I agree, this was not what I meant.
> > > > 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2019-02-28  0:41                                 ` Michael S. Tsirkin
@ 2019-02-28  0:52                                   ` Jakub Kicinski
  2019-02-28  1:26                                     ` Michael S. Tsirkin
  2019-02-28  9:32                                   ` si-wei liu
  1 sibling, 1 reply; 63+ messages in thread
From: Jakub Kicinski @ 2019-02-28  0:52 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: si-wei liu, Samudrala, Sridhar, Siwei Liu, Jiri Pirko,
	Stephen Hemminger, David Miller, Netdev, virtualization,
	virtio-dev, Brandeburg, Jesse, Alexander Duyck, Jason Wang,
	liran.alon

On Wed, 27 Feb 2019 19:41:32 -0500, Michael S. Tsirkin wrote:
> > As this scheme adds much complexity to the kernel naming convention
> > (currently it's just ethX names) that no userspace can understand.  
> 
> Anything that pokes at slaves needs to be specially designed anyway.
> Naming seems like a minor issue.

Can the users who care about the naming put net_failover into
"user space will do the bond enslavement" mode, and do the bond
creation/management themselves from user space (in systemd/ 
Network Manager) based on the failover flag?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2019-02-28  0:52                                   ` Jakub Kicinski
@ 2019-02-28  1:26                                     ` Michael S. Tsirkin
  2019-02-28  1:52                                       ` Jakub Kicinski
  0 siblings, 1 reply; 63+ messages in thread
From: Michael S. Tsirkin @ 2019-02-28  1:26 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: si-wei liu, Samudrala, Sridhar, Siwei Liu, Jiri Pirko,
	Stephen Hemminger, David Miller, Netdev, virtualization,
	virtio-dev, Brandeburg, Jesse, Alexander Duyck, Jason Wang,
	liran.alon

On Wed, Feb 27, 2019 at 04:52:05PM -0800, Jakub Kicinski wrote:
> On Wed, 27 Feb 2019 19:41:32 -0500, Michael S. Tsirkin wrote:
> > > As this scheme adds much complexity to the kernel naming convention
> > > (currently it's just ethX names) that no userspace can understand.  
> > 
> > Anything that pokes at slaves needs to be specially designed anyway.
> > Naming seems like a minor issue.
> 
> Can the users who care about the naming put net_failover into
> "user space will do the bond enslavement" mode, and do the bond
> creation/management themselves from user space (in systemd/ 
> Network Manager) based on the failover flag?

Putting issues of compatibility aside (userspace tends to be confused if
you give it two devices with same MAC), how would you have it work in
practice? Timer based hacks like netvsc where if userspace didn't
respond within X seconds we assume it won't and do everything ourselves?

-- 
MST

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2019-02-28  1:26                                     ` Michael S. Tsirkin
@ 2019-02-28  1:52                                       ` Jakub Kicinski
  2019-02-28  4:47                                         ` Michael S. Tsirkin
  0 siblings, 1 reply; 63+ messages in thread
From: Jakub Kicinski @ 2019-02-28  1:52 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: si-wei liu, Samudrala, Sridhar, Siwei Liu, Jiri Pirko,
	Stephen Hemminger, David Miller, Netdev, virtualization,
	virtio-dev, Brandeburg, Jesse, Alexander Duyck, Jason Wang,
	liran.alon

On Wed, 27 Feb 2019 20:26:02 -0500, Michael S. Tsirkin wrote:
> On Wed, Feb 27, 2019 at 04:52:05PM -0800, Jakub Kicinski wrote:
> > On Wed, 27 Feb 2019 19:41:32 -0500, Michael S. Tsirkin wrote:  
> > > > As this scheme adds much complexity to the kernel naming convention
> > > > (currently it's just ethX names) that no userspace can understand.    
> > > 
> > > Anything that pokes at slaves needs to be specially designed anyway.
> > > Naming seems like a minor issue.  
> > 
> > Can the users who care about the naming put net_failover into
> > "user space will do the bond enslavement" mode, and do the bond
> > creation/management themselves from user space (in systemd/ 
> > Network Manager) based on the failover flag?  
> 
> Putting issues of compatibility aside (userspace tends to be confused if
> you give it two devices with same MAC), how would you have it work in
> practice? Timer based hacks like netvsc where if userspace didn't
> respond within X seconds we assume it won't and do everything ourselves?

Well, what I'm saying is basically if user space knows how to deal with
the auto-bonding, we can put aside net_failover for the most part.  It
can either be blacklisted or it can have some knob which will
effectively disable the auto-enslavement.

Auto-bonding capable user space can do the renames, spawn the bond,
etc. all by itself.  I'm basically going back to my initial proposal
here :)  There is a RedHat bugzilla for the NetworkManager team to do
this, but we merged net_failover before those folks got around to
implementing it.

IOW if NM/systemd is capable of doing the auto-bonding itself it can
disable the kernel mechanism and take care of it all.  If kernel is
booted with an old user space which doesn't have capable NM/systemd -
net_failover will kick in and do its best.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2019-02-28  1:52                                       ` Jakub Kicinski
@ 2019-02-28  4:47                                         ` Michael S. Tsirkin
  2019-02-28 18:13                                           ` Jakub Kicinski
  0 siblings, 1 reply; 63+ messages in thread
From: Michael S. Tsirkin @ 2019-02-28  4:47 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: si-wei liu, Samudrala, Sridhar, Siwei Liu, Jiri Pirko,
	Stephen Hemminger, David Miller, Netdev, virtualization,
	virtio-dev, Brandeburg, Jesse, Alexander Duyck, Jason Wang,
	liran.alon

On Wed, Feb 27, 2019 at 05:52:18PM -0800, Jakub Kicinski wrote:
> On Wed, 27 Feb 2019 20:26:02 -0500, Michael S. Tsirkin wrote:
> > On Wed, Feb 27, 2019 at 04:52:05PM -0800, Jakub Kicinski wrote:
> > > On Wed, 27 Feb 2019 19:41:32 -0500, Michael S. Tsirkin wrote:  
> > > > > As this scheme adds much complexity to the kernel naming convention
> > > > > (currently it's just ethX names) that no userspace can understand.    
> > > > 
> > > > Anything that pokes at slaves needs to be specially designed anyway.
> > > > Naming seems like a minor issue.  
> > > 
> > > Can the users who care about the naming put net_failover into
> > > "user space will do the bond enslavement" mode, and do the bond
> > > creation/management themselves from user space (in systemd/ 
> > > Network Manager) based on the failover flag?  
> > 
> > Putting issues of compatibility aside (userspace tends to be confused if
> > you give it two devices with same MAC), how would you have it work in
> > practice? Timer based hacks like netvsc where if userspace didn't
> > respond within X seconds we assume it won't and do everything ourselves?
> 
> Well, what I'm saying is basically if user space knows how to deal with
> the auto-bonding, we can put aside net_failover for the most part.  It
> can either be blacklisted or it can have some knob which will
> effectively disable the auto-enslavement.

OK I guess we could add a module parameter to skip this.
Is this what you mean?

> Auto-bonding capable user space can do the renames, spawn the bond,
> etc. all by itself.  I'm basically going back to my initial proposal
> here :)  There is a RedHat bugzilla for the NetworkManager team to do
> this, but we merged net_failover before those folks got around to
> implementing it.

In particular because there's no policy involved whatsoever
here so it's just mechanism being pushed up to userspace.

> IOW if NM/systemd is capable of doing the auto-bonding itself it can
> disable the kernel mechanism and take care of it all.  If kernel is
> booted with an old user space which doesn't have capable NM/systemd -
> net_failover will kick in and do its best.

Sure - it's just 2 lines of code, see below.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

But I don't intend to bother until there's actual interest from
userspace developers to bother. In particular it is not just NM/systemd
even on Fedora - e.g. you will need to teach dracut to somehow detect
and handle this - right now it gets confused if there are two devices
with same MAC addresses.

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 955b3e76eb8d..dd2b2c370003 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -43,6 +43,7 @@ static bool csum = true, gso = true, napi_tx;
 module_param(csum, bool, 0444);
 module_param(gso, bool, 0444);
 module_param(napi_tx, bool, 0644);
+module_param(disable_failover, bool, 0644);
 
 /* FIXME: MTU in config. */
 #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
@@ -3163,6 +3164,7 @@ static int virtnet_probe(struct virtio_device *vdev)
 	virtnet_init_settings(dev);
 
-	if (virtio_has_feature(vdev, VIRTIO_NET_F_STANDBY)) {
+	if (virtio_has_feature(vdev, VIRTIO_NET_F_STANDBY) &&
+		!disable_failover) {
 		vi->failover = net_failover_create(vi->dev);
 		if (IS_ERR(vi->failover)) {
 			err = PTR_ERR(vi->failover);


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2019-02-28  0:41                                 ` Michael S. Tsirkin
  2019-02-28  0:52                                   ` Jakub Kicinski
@ 2019-02-28  9:32                                   ` si-wei liu
  2019-02-28 14:26                                     ` Michael S. Tsirkin
  1 sibling, 1 reply; 63+ messages in thread
From: si-wei liu @ 2019-02-28  9:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Samudrala, Sridhar, Siwei Liu, Jiri Pirko, Stephen Hemminger,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Alexander Duyck, Jakub Kicinski, Jason Wang, liran.alon



On 2/27/2019 4:41 PM, Michael S. Tsirkin wrote:
> On Wed, Feb 27, 2019 at 04:38:00PM -0800, si-wei liu wrote:
>>
>> On 2/27/2019 3:50 PM, Michael S. Tsirkin wrote:
>>> On Wed, Feb 27, 2019 at 03:34:56PM -0800, si-wei liu wrote:
>>>> On 2/27/2019 2:38 PM, Michael S. Tsirkin wrote:
>>>>> On Tue, Feb 26, 2019 at 04:17:21PM -0800, si-wei liu wrote:
>>>>>> On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:
>>>>>>> On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:
>>>>>>>> On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:
>>>>>>>>> On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:
>>>>>>>>>> On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
>>>>>>>>>>> On 2/21/2019 7:33 PM, si-wei liu wrote:
>>>>>>>>>>>> On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
>>>>>>>>>>>>> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
>>>>>>>>>>>>>> Sorry for replying to this ancient thread. There was some remaining
>>>>>>>>>>>>>> issue that I don't think the initial net_failover patch got addressed
>>>>>>>>>>>>>> cleanly, see:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The renaming of 'eth0' to 'ens4' fails because the udev userspace was
>>>>>>>>>>>>>> not specifically writtten for such kernel automatic enslavement.
>>>>>>>>>>>>>> Specifically, if it is a bond or team, the slave would typically get
>>>>>>>>>>>>>> renamed *before* virtual device gets created, that's what udev can
>>>>>>>>>>>>>> control (without getting netdev opened early by the other part of
>>>>>>>>>>>>>> kernel) and other userspace components for e.g. initramfs,
>>>>>>>>>>>>>> init-scripts can coordinate well in between. The in-kernel
>>>>>>>>>>>>>> auto-enslavement of net_failover breaks this userspace convention,
>>>>>>>>>>>>>> which don't provides a solution if user care about consistent naming
>>>>>>>>>>>>>> on the slave netdevs specifically.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Previously this issue had been specifically called out when IFF_HIDDEN
>>>>>>>>>>>>>> and the 1-netdev was proposed, but no one gives out a solution to this
>>>>>>>>>>>>>> problem ever since. Please share your mind how to proceed and solve
>>>>>>>>>>>>>> this userspace issue if netdev does not welcome a 1-netdev model.
>>>>>>>>>>>>> Above says:
>>>>>>>>>>>>>
>>>>>>>>>>>>>          there's no motivation in the systemd/udevd community at
>>>>>>>>>>>>>          this point to refactor the rename logic and make it work well with
>>>>>>>>>>>>>          3-netdev.
>>>>>>>>>>>>>
>>>>>>>>>>>>> What would the fix be? Skip slave devices?
>>>>>>>>>>>>>
>>>>>>>>>>>> There's nothing user can get if just skipping slave devices - the
>>>>>>>>>>>> name is still unchanged and unpredictable e.g. eth0, or eth1 the
>>>>>>>>>>>> next reboot, while the rest may conform to the naming scheme (ens3
>>>>>>>>>>>> and such). There's no way one can fix this in userspace alone - when
>>>>>>>>>>>> the failover is created the enslaved netdev was opened by the kernel
>>>>>>>>>>>> earlier than the userspace is made aware of, and there's no
>>>>>>>>>>>> negotiation protocol for kernel to know when userspace has done
>>>>>>>>>>>> initial renaming of the interface. I would expect netdev list should
>>>>>>>>>>>> at least provide the direction in general for how this can be
>>>>>>>>>>>> solved...
>>>>>>>>> I was just wondering what did you mean when you said
>>>>>>>>> "refactor the rename logic and make it work well with 3-netdev" -
>>>>>>>>> was there a proposal udev rejected?
>>>>>>>> No. I never believed this particular issue can be fixed in userspace alone.
>>>>>>>> Previously someone had said it could be, but I never see any work or
>>>>>>>> relevant discussion ever happened in various userspace communities (for e.g.
>>>>>>>> dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
>>>>>>>> of the issue derives from the kernel, it makes more sense to start from
>>>>>>>> netdev, work out and decide on a solution: see what can be done in the
>>>>>>>> kernel in order to fix it, then after that engage userspace community for
>>>>>>>> the feasibility...
>>>>>>>>
>>>>>>>>> Anyway, can we write a time diagram for what happens in which order that
>>>>>>>>> leads to failure?  That would help look for triggers that we can tie
>>>>>>>>> into, or add new ones.
>>>>>>>>>
>>>>>>>> See attached diagram.
>>>>>>>>
>>>>>>>>>>> Is there an issue if slave device names are not predictable? The user/admin scripts are expected
>>>>>>>>>>> to only work with the master failover device.
>>>>>>>>>> Where does this expectation come from?
>>>>>>>>>>
>>>>>>>>>> Admin users may have ethtool or tc configurations that need to deal with
>>>>>>>>>> predictable interface name. Third-party app which was built upon specifying
>>>>>>>>>> certain interface name can't be modified to chase dynamic names.
>>>>>>>>>>
>>>>>>>>>> Specifically, we have pre-canned image that uses ethtool to fine tune VF
>>>>>>>>>> offload settings post boot for specific workload. Those images won't work
>>>>>>>>>> well if the name is constantly changing just after couple rounds of live
>>>>>>>>>> migration.
>>>>>>>>> It should be possible to specify the ethtool configuration on the
>>>>>>>>> master and have it automatically propagated to the slave.
>>>>>>>>>
>>>>>>>>> BTW this is something we should look at IMHO.
>>>>>>>> I was elaborating a few examples that the expectation and assumption that
>>>>>>>> user/admin scripts only deal with master failover device is incorrect. It
>>>>>>>> had never been taken good care of, although I did try to emphasize it from
>>>>>>>> the very beginning.
>>>>>>>>
>>>>>>>> Basically what you said about propagating the ethtool configuration down to
>>>>>>>> the slave is the key pursuance of 1-netdev model. However, what I am seeking
>>>>>>>> now is any alternative that can also fix the specific udev rename problem,
>>>>>>>> before concluding that 1-netdev is the only solution. Generally a 1-netdev
>>>>>>>> scheme would take time to implement, while I'm trying to find a way out to
>>>>>>>> fix this particular naming problem under 3-netdev.
>>>>>>>>
>>>>>>>>>>> Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
>>>>>>>>>>> about moving them to a hidden network namespace so that they are not visible from the default namespace.
>>>>>>>>>>> I looked into this sometime back, but did not find the right kernel api to create a network namespace within
>>>>>>>>>>> kernel. If so, we could use this mechanism to simulate a 1-netdev model.
>>>>>>>>>> Yes, that's one possible implementation (IMHO the key is to make 1-netdev
>>>>>>>>>> model as much transparent to a real NIC as possible, while a hidden netns is
>>>>>>>>>> just the vehicle). However, I recall there was resistance around this
>>>>>>>>>> discussion that even the concept of hiding itself is a taboo for Linux
>>>>>>>>>> netdev. I would like to summon potential alternatives before concluding
>>>>>>>>>> 1-netdev is the only solution too soon.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> -Siwei
>>>>>>>>> Your scripts would not work at all then, right?
>>>>>>>> At this point we don't claim images with such usage as SR-IOV live
>>>>>>>> migrate-able. We would flag it as live migrate-able until this ethtool
>>>>>>>> config issue is fully addressed and a transparent live migration solution
>>>>>>>> emerges in upstream eventually.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> -Siwei
>>>>>>>>>>>> -Siwei
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
>>>>>>>>> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>>>>>>>>>
>>>>>>>>       net_failover(kernel)                            |    network.service (user)    |          systemd-udevd (user)
>>>>>>>> --------------------------------------------------+------------------------------+--------------------------------------------
>>>>>>>> (standby virtio-net and net_failover              |                              |
>>>>>>>> devices created and initialized,                  |                              |
>>>>>>>> i.e. virtnet_probe()->                            |                              |
>>>>>>>>            net_failover_create()                      |                              |
>>>>>>>> was done.)                                        |                              |
>>>>>>>>                                                       |                              |
>>>>>>>>                                                       |  runs `ifup ens3' ->         |
>>>>>>>>                                                       |    ip link set dev ens3 up   |
>>>>>>>> net_failover_open()                               |                              |
>>>>>>>>       dev_open(virtnet_dev)                           |                              |
>>>>>>>>         virtnet_open(virtnet_dev)                     |                              |
>>>>>>>>       netif_carrier_on(failover_dev)                  |                              |
>>>>>>>>       ...                                             |                              |
>>>>>>>>                                                       |                              |
>>>>>>>> (VF hot plugged in)                               |                              |
>>>>>>>> ixgbevf_probe()                                   |                              |
>>>>>>>>      register_netdev(ixgbevf_netdev)                  |                              |
>>>>>>>>       netdev_register_kobject(ixgbevf_netdev)         |                              |
>>>>>>>>        kobject_add(ixgbevf_dev)                       |                              |
>>>>>>>>         device_add(ixgbevf_dev)                       |                              |
>>>>>>>>          kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) |                              |
>>>>>>>>           netlink_broadcast()                         |                              |
>>>>>>>>       ...                                             |                              |
>>>>>>>>       call_netdevice_notifiers(NETDEV_REGISTER)       |                              |
>>>>>>>>        failover_event(..., NETDEV_REGISTER, ...)      |                              |
>>>>>>>>         failover_slave_register(ixgbevf_netdev)       |                              |
>>>>>>>>          net_failover_slave_register(ixgbevf_netdev)  |                              |
>>>>>>>>           dev_open(ixgbevf_netdev)                    |                              |
>>>>>>>>                                                       |                              |
>>>>>>>>                                                       |                              |
>>>>>>>>                                                       |                              |   received ADD uevent from netlink fd
>>>>>>>>                                                       |                              |   ...
>>>>>>>>                                                       |                              |   udev-builtin-net_id.c:dev_pci_slot()
>>>>>>>>                                                       |                              |   (decided to renamed 'eth0' )
>>>>>>>>                                                       |                              |     ip link set dev eth0 name ens4
>>>>>>>> (dev_change_name() returns -EBUSY as              |                              |
>>>>>>>> ixgbevf_netdev->flags has IFF_UP)                 |                              |
>>>>>>>>                                                       |                              |
>>>>>>>>
>>>>>>> Given renaming slaves does not work anyway:
>>>>>> I was actually thinking what if we relieve the rename restriction just for
>>>>>> the failover slave? What the impact would be? I think users don't care about
>>>>>> slave being renamed when it's in use, especially the initial rename.
>>>>>> Thoughts?
>>>>>>
>>>>>>>      would it work if we just
>>>>>>> hard-coded slave names instead?
>>>>>>>
>>>>>>> E.g.
>>>>>>> 1. fail slave renames
>>>>>>> 2. rename of failover to XX automatically renames standby to XXnsby
>>>>>>>        and primary to XXnpry
>>>>>> That wouldn't help. The time when the failover master gets renamed, the VF
>>>>>> may not be present.
>>>>> In this scheme if VF is not there it will be renamed immediately after registration.
>>>> Who will be responsible to rename the slave, the kernel?
>>> That's the idea.
>>>
>>>> Note the master's
>>>> name may or may not come from the userspace. If it comes from the userspace,
>>>> should the userspace daemon change their expectation not to name/rename
>>>> _any_ slaves (today there's no distinction)?
>>> Yes the idea would be to fail renaming slaves.
>> No I was asking about the userspace expectation: whether it should track and
>> detect the lifecycle events of failover slaves and decide what to do. How
>> does it get back to the user specified name if VF is not enslaved (say
>> someone unloads the virtio-net module)?
> When virtio net is removed VF will shortly be removed too.
>
>> As this scheme adds much complexity to the kernel naming convention
>> (currently it's just ethX names) that no userspace can understand.
> Anything that pokes at slaves needs to be specially designed anyway.
> Naming seems like a minor issue.
>
>> Will the
>> change break userspace further?
>>
>> -Siwei
> Didn't you show userspace is already broken. You can't "further
> break it", rename already fails.
It's a race, userspace tends to give slave a user(space) desired name 
but sometimes may fail due to this race. Today if failover master is not 
up, rename would succeed anyway. While what you proposed prohibits user 
from providing a name in all circumstances if I understand you 
correctly. That's what I meant of breaking userspace further. On the 
other hand, you seem to tighten the kernel default naming to udev 
predictable names, which is derived from only recent systemd-udevd, 
while there exists many possible userspace naming schemes out of that. 
Users today who deliberately chooses to disable predictable naming 
(net.ifnames=0 biosdevname=0) and fall back to kernel provided names 
would expect the ethX pattern, with this change admin/user scripts which 
matches the ethX pattern could potentially break.

IMHO that change is more risky than allow userspace to change the name 
for failover slave in any case. I would refresh everyone's mind that the 
target users of net_failover is very specific to the live migration 
scenario, who typically don't have profound knowledge to fiddle with the 
low level plumbing but just expect to operate on master device directly. 
I don't have much concern over the slave netfilter rule brokenness or 
whatsoever if just lifting up the rename restriction: the failover slave 
naming itself is already unreliable, how can we break those apps relying 
on consistent naming further without fixing it in the first place? It 
could be just simply two lines of code change, if any net_failover user, 
who may break due to this change, would have come here and complained 
about the naming issue earlier. IOW at the very least, the change below 
shouldn't make the current situation any worse.

Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>

--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1127,7 +1127,8 @@ int dev_change_name(struct net_device *dev, const 
char *newname)
         BUG_ON(!dev_net(dev));

         net = dev_net(dev);
-       if (dev->flags & IFF_UP)
+       if (dev->flags & IFF_UP &&
+           !(dev->priv_flags & IFF_FAILOVER_SLAVE))
                 return -EBUSY;

         write_seqcount_begin(&devnet_rename_seq);

Thanks,
-Siwei


>
>>>> How do users know which name to
>>>> trust, depending on which wins the race more often? Say if kernel wants a
>>>> ens3npry name while userspace wants it named as ens4.
>>>>
>>>> -Siwei
>>> With this approach kernel will deny attempts by userspace to rename
>>> slaves.  Slaves will always be named XXXnsby and XXnpry. Master renames
>>> will rename both slaves.
>>>
>>> It seems pretty solid to me, the only issue is that in theory userspace
>>> can use a name like XXXnsby for something else. But this seems unlikely.
>>>
>>>
>>>>>> I don't like the idea to delay exposing failover master
>>>>>> until VF is hot plugged in (probably subject to various failures) later.
>>>>>>
>>>>>> Thanks,
>>>>>> -Siwei
>>>>> I agree, this was not what I meant.
>>>>>


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2019-02-28  9:32                                   ` si-wei liu
@ 2019-02-28 14:26                                     ` Michael S. Tsirkin
  2019-03-01  1:30                                       ` si-wei liu
  0 siblings, 1 reply; 63+ messages in thread
From: Michael S. Tsirkin @ 2019-02-28 14:26 UTC (permalink / raw)
  To: si-wei liu
  Cc: Samudrala, Sridhar, Siwei Liu, Jiri Pirko, Stephen Hemminger,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Alexander Duyck, Jakub Kicinski, Jason Wang, liran.alon

On Thu, Feb 28, 2019 at 01:32:12AM -0800, si-wei liu wrote:
> > > Will the
> > > change break userspace further?
> > > 
> > > -Siwei
> > Didn't you show userspace is already broken. You can't "further
> > break it", rename already fails.
> It's a race, userspace tends to give slave a user(space) desired name but
> sometimes may fail due to this race. Today if failover master is not up,
> rename would succeed anyway. While what you proposed prohibits user from
> providing a name in all circumstances if I understand you correctly. That's
> what I meant of breaking userspace further. On the other hand, you seem to
> tighten the kernel default naming to udev predictable names, which is
> derived from only recent systemd-udevd, while there exists many possible
> userspace naming schemes out of that. Users today who deliberately chooses
> to disable predictable naming (net.ifnames=0 biosdevname=0) and fall back to
> kernel provided names would expect the ethX pattern, with this change
> admin/user scripts which matches the ethX pattern could potentially break.

Whatever crashes with a name not matching ethX will crash on the
standby interface *anyway*.

So I think what you are saying is that someone might have already
written scripts and gotten them to work on v4.17 when STANDBY was
included and these scripts rely on ethX. Now these scripts
will break.

Maybe it is still early enough (just half a year passed) that the
number of these users would be small.  So how about a kernel config
option and maybe a module parameter to rename the primary?  People can
then opt in to the old broken behaviour.

-- 
MST

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2019-02-28  4:47                                         ` Michael S. Tsirkin
@ 2019-02-28 18:13                                           ` Jakub Kicinski
  2019-02-28 19:36                                             ` Michael S. Tsirkin
  0 siblings, 1 reply; 63+ messages in thread
From: Jakub Kicinski @ 2019-02-28 18:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: si-wei liu, Samudrala, Sridhar, Siwei Liu, Jiri Pirko,
	Stephen Hemminger, David Miller, Netdev, virtualization,
	virtio-dev, Brandeburg, Jesse, Alexander Duyck, Jason Wang,
	liran.alon

On Wed, 27 Feb 2019 23:47:33 -0500, Michael S. Tsirkin wrote:
> On Wed, Feb 27, 2019 at 05:52:18PM -0800, Jakub Kicinski wrote:
> > > > Can the users who care about the naming put net_failover into
> > > > "user space will do the bond enslavement" mode, and do the bond
> > > > creation/management themselves from user space (in systemd/ 
> > > > Network Manager) based on the failover flag?    
> > > 
> > > Putting issues of compatibility aside (userspace tends to be confused if
> > > you give it two devices with same MAC), how would you have it work in
> > > practice? Timer based hacks like netvsc where if userspace didn't
> > > respond within X seconds we assume it won't and do everything ourselves?  
> > 
> > Well, what I'm saying is basically if user space knows how to deal with
> > the auto-bonding, we can put aside net_failover for the most part.  It
> > can either be blacklisted or it can have some knob which will
> > effectively disable the auto-enslavement.  
> 
> OK I guess we could add a module parameter to skip this.
> Is this what you mean?

Yup.

> > Auto-bonding capable user space can do the renames, spawn the bond,
> > etc. all by itself.  I'm basically going back to my initial proposal
> > here :)  There is a RedHat bugzilla for the NetworkManager team to do
> > this, but we merged net_failover before those folks got around to
> > implementing it.  
> 
> In particular because there's no policy involved whatsoever
> here so it's just mechanism being pushed up to userspace.
> 
> > IOW if NM/systemd is capable of doing the auto-bonding itself it can
> > disable the kernel mechanism and take care of it all.  If kernel is
> > booted with an old user space which doesn't have capable NM/systemd -
> > net_failover will kick in and do its best.  
> 
> Sure - it's just 2 lines of code, see below.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> 
> But I don't intend to bother until there's actual interest from
> userspace developers to bother. In particular it is not just NM/systemd
> even on Fedora - e.g. you will need to teach dracut to somehow detect
> and handle this - right now it gets confused if there are two devices
> with same MAC addresses.

It is a bit of a the chicken or the egg situation ;)  But users can
just blacklist, too.  Anyway, I think this is far better than module
parameters for twiddling kernel-based interface naming policy.. :S

> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 955b3e76eb8d..dd2b2c370003 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -43,6 +43,7 @@ static bool csum = true, gso = true, napi_tx;
>  module_param(csum, bool, 0444);
>  module_param(gso, bool, 0444);
>  module_param(napi_tx, bool, 0644);
> +module_param(disable_failover, bool, 0644);
>  
>  /* FIXME: MTU in config. */
>  #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
> @@ -3163,6 +3164,7 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	virtnet_init_settings(dev);
>  
> -	if (virtio_has_feature(vdev, VIRTIO_NET_F_STANDBY)) {
> +	if (virtio_has_feature(vdev, VIRTIO_NET_F_STANDBY) &&
> +		!disable_failover) {
>  		vi->failover = net_failover_create(vi->dev);
>  		if (IS_ERR(vi->failover)) {
>  			err = PTR_ERR(vi->failover);
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2019-02-28 18:13                                           ` Jakub Kicinski
@ 2019-02-28 19:36                                             ` Michael S. Tsirkin
  2019-02-28 19:56                                               ` Jakub Kicinski
  0 siblings, 1 reply; 63+ messages in thread
From: Michael S. Tsirkin @ 2019-02-28 19:36 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: si-wei liu, Samudrala, Sridhar, Siwei Liu, Jiri Pirko,
	Stephen Hemminger, David Miller, Netdev, virtualization,
	virtio-dev, Brandeburg, Jesse, Alexander Duyck, Jason Wang,
	liran.alon

On Thu, Feb 28, 2019 at 10:13:56AM -0800, Jakub Kicinski wrote:
> On Wed, 27 Feb 2019 23:47:33 -0500, Michael S. Tsirkin wrote:
> > On Wed, Feb 27, 2019 at 05:52:18PM -0800, Jakub Kicinski wrote:
> > > > > Can the users who care about the naming put net_failover into
> > > > > "user space will do the bond enslavement" mode, and do the bond
> > > > > creation/management themselves from user space (in systemd/ 
> > > > > Network Manager) based on the failover flag?    
> > > > 
> > > > Putting issues of compatibility aside (userspace tends to be confused if
> > > > you give it two devices with same MAC), how would you have it work in
> > > > practice? Timer based hacks like netvsc where if userspace didn't
> > > > respond within X seconds we assume it won't and do everything ourselves?  
> > > 
> > > Well, what I'm saying is basically if user space knows how to deal with
> > > the auto-bonding, we can put aside net_failover for the most part.  It
> > > can either be blacklisted or it can have some knob which will
> > > effectively disable the auto-enslavement.  
> > 
> > OK I guess we could add a module parameter to skip this.
> > Is this what you mean?
> 
> Yup.
> 
> > > Auto-bonding capable user space can do the renames, spawn the bond,
> > > etc. all by itself.  I'm basically going back to my initial proposal
> > > here :)  There is a RedHat bugzilla for the NetworkManager team to do
> > > this, but we merged net_failover before those folks got around to
> > > implementing it.  
> > 
> > In particular because there's no policy involved whatsoever
> > here so it's just mechanism being pushed up to userspace.
> > 
> > > IOW if NM/systemd is capable of doing the auto-bonding itself it can
> > > disable the kernel mechanism and take care of it all.  If kernel is
> > > booted with an old user space which doesn't have capable NM/systemd -
> > > net_failover will kick in and do its best.  
> > 
> > Sure - it's just 2 lines of code, see below.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > 
> > But I don't intend to bother until there's actual interest from
> > userspace developers to bother. In particular it is not just NM/systemd
> > even on Fedora - e.g. you will need to teach dracut to somehow detect
> > and handle this - right now it gets confused if there are two devices
> > with same MAC addresses.
> 
> It is a bit of a the chicken or the egg situation ;)  But users can
> just blacklist, too.  Anyway, I think this is far better than module
> parameters

Sorry I'm a bit confused. What is better than what?

> for twiddling kernel-based interface naming policy.. :S

I see your point. But my point is slave names don't really matter, only
master name matters.  So I am not sure there's any policy worth talking
about here.

> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 955b3e76eb8d..dd2b2c370003 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -43,6 +43,7 @@ static bool csum = true, gso = true, napi_tx;
> >  module_param(csum, bool, 0444);
> >  module_param(gso, bool, 0444);
> >  module_param(napi_tx, bool, 0644);
> > +module_param(disable_failover, bool, 0644);
> >  
> >  /* FIXME: MTU in config. */
> >  #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
> > @@ -3163,6 +3164,7 @@ static int virtnet_probe(struct virtio_device *vdev)
> >  	virtnet_init_settings(dev);
> >  
> > -	if (virtio_has_feature(vdev, VIRTIO_NET_F_STANDBY)) {
> > +	if (virtio_has_feature(vdev, VIRTIO_NET_F_STANDBY) &&
> > +		!disable_failover) {
> >  		vi->failover = net_failover_create(vi->dev);
> >  		if (IS_ERR(vi->failover)) {
> >  			err = PTR_ERR(vi->failover);
> > 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2019-02-28 19:36                                             ` Michael S. Tsirkin
@ 2019-02-28 19:56                                               ` Jakub Kicinski
  2019-02-28 20:14                                                 ` Michael S. Tsirkin
  2019-03-01  0:20                                                 ` Siwei Liu
  0 siblings, 2 replies; 63+ messages in thread
From: Jakub Kicinski @ 2019-02-28 19:56 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: si-wei liu, Samudrala, Sridhar, Siwei Liu, Jiri Pirko,
	Stephen Hemminger, David Miller, Netdev, virtualization,
	Brandeburg, Jesse, Alexander Duyck, Jason Wang, liran.alon

On Thu, 28 Feb 2019 14:36:56 -0500, Michael S. Tsirkin wrote:
> > It is a bit of a the chicken or the egg situation ;)  But users can
> > just blacklist, too.  Anyway, I think this is far better than module
> > parameters  
> 
> Sorry I'm a bit confused. What is better than what?

I mean that blacklist net_failover or module param to disable
net_failover and handle in user space are better than trying to solve
the renaming at kernel level (either by adding module params that make
the kernel rename devices or letting user space change names of running
devices if they are slaves).

> > for twiddling kernel-based interface naming policy.. :S  
> 
> I see your point. But my point is slave names don't really matter, only
> master name matters.  So I am not sure there's any policy worth talking
> about here.

Oh yes, I don't disagree with you, but others seems to want to rename
the auto-bonded lower devices.  Which can be done trivially if it was 
a daemon in user space instantiating the auto-bond.  We are just
providing a basic version of auto-bonding in the kernel.  If there are
extra requirements on policy, or naming - the whole thing is better
solved in user space.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2019-02-28 19:56                                               ` Jakub Kicinski
@ 2019-02-28 20:14                                                 ` Michael S. Tsirkin
  2019-02-28 23:31                                                   ` Jakub Kicinski
  2019-03-01  0:20                                                 ` Siwei Liu
  1 sibling, 1 reply; 63+ messages in thread
From: Michael S. Tsirkin @ 2019-02-28 20:14 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: si-wei liu, Samudrala, Sridhar, Siwei Liu, Jiri Pirko,
	Stephen Hemminger, David Miller, Netdev, virtualization,
	Brandeburg, Jesse, Alexander Duyck, Jason Wang, liran.alon

On Thu, Feb 28, 2019 at 11:56:41AM -0800, Jakub Kicinski wrote:
> On Thu, 28 Feb 2019 14:36:56 -0500, Michael S. Tsirkin wrote:
> > > It is a bit of a the chicken or the egg situation ;)  But users can
> > > just blacklist, too.  Anyway, I think this is far better than module
> > > parameters  
> > 
> > Sorry I'm a bit confused. What is better than what?
> 
> I mean that blacklist net_failover or module param to disable
> net_failover and handle in user space are better than trying to solve
> the renaming at kernel level (either by adding module params that make
> the kernel rename devices or letting user space change names of running
> devices if they are slaves).
> 
> > > for twiddling kernel-based interface naming policy.. :S  
> > 
> > I see your point. But my point is slave names don't really matter, only
> > master name matters.  So I am not sure there's any policy worth talking
> > about here.
> 
> Oh yes, I don't disagree with you, but others seems to want to rename
> the auto-bonded lower devices.  Which can be done trivially if it was 
> a daemon in user space instantiating the auto-bond.  We are just
> providing a basic version of auto-bonding in the kernel.  If there are
> extra requirements on policy, or naming - the whole thing is better
> solved in user space.

OK so it seems that you would be happy with a combination of the module
parameter disabling failover completely and renaming primary in kernel?
Did I get it right?

-- 
MST

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2019-02-28 20:14                                                 ` Michael S. Tsirkin
@ 2019-02-28 23:31                                                   ` Jakub Kicinski
  0 siblings, 0 replies; 63+ messages in thread
From: Jakub Kicinski @ 2019-02-28 23:31 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: si-wei liu, Samudrala, Sridhar, Siwei Liu, Jiri Pirko,
	Stephen Hemminger, David Miller, Netdev, virtualization,
	Brandeburg, Jesse, Alexander Duyck, Jason Wang, liran.alon

On Thu, 28 Feb 2019 15:14:55 -0500, Michael S. Tsirkin wrote:
> On Thu, Feb 28, 2019 at 11:56:41AM -0800, Jakub Kicinski wrote:
> > On Thu, 28 Feb 2019 14:36:56 -0500, Michael S. Tsirkin wrote:  
> > > > It is a bit of a the chicken or the egg situation ;)  But users can
> > > > just blacklist, too.  Anyway, I think this is far better than module
> > > > parameters    
> > > 
> > > Sorry I'm a bit confused. What is better than what?  
> > 
> > I mean that blacklist net_failover or module param to disable
> > net_failover and handle in user space are better than trying to solve
> > the renaming at kernel level (either by adding module params that make
> > the kernel rename devices or letting user space change names of running
> > devices if they are slaves).
> >   
> > > > for twiddling kernel-based interface naming policy.. :S    
> > > 
> > > I see your point. But my point is slave names don't really matter, only
> > > master name matters.  So I am not sure there's any policy worth talking
> > > about here.  
> > 
> > Oh yes, I don't disagree with you, but others seems to want to rename
> > the auto-bonded lower devices.  Which can be done trivially if it was 
> > a daemon in user space instantiating the auto-bond.  We are just
> > providing a basic version of auto-bonding in the kernel.  If there are
> > extra requirements on policy, or naming - the whole thing is better
> > solved in user space.  
> 
> OK so it seems that you would be happy with a combination of the module
> parameter disabling failover completely and renaming primary in kernel?
> Did I get it right?

Not 100%, I'm personally not convinced that renaming primary in the
kernel is okay.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2019-02-28 19:56                                               ` Jakub Kicinski
  2019-02-28 20:14                                                 ` Michael S. Tsirkin
@ 2019-03-01  0:20                                                 ` Siwei Liu
  2019-03-01  1:05                                                   ` Jakub Kicinski
  1 sibling, 1 reply; 63+ messages in thread
From: Siwei Liu @ 2019-03-01  0:20 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Michael S. Tsirkin, si-wei liu, Samudrala, Sridhar, Jiri Pirko,
	Stephen Hemminger, David Miller, Netdev, virtualization,
	Brandeburg, Jesse, Alexander Duyck, Jason Wang, liran.alon

On Thu, Feb 28, 2019 at 11:56 AM Jakub Kicinski <kubakici@wp.pl> wrote:
>
> On Thu, 28 Feb 2019 14:36:56 -0500, Michael S. Tsirkin wrote:
> > > It is a bit of a the chicken or the egg situation ;)  But users can
> > > just blacklist, too.  Anyway, I think this is far better than module
> > > parameters
> >
> > Sorry I'm a bit confused. What is better than what?
>
> I mean that blacklist net_failover or module param to disable
> net_failover and handle in user space are better than trying to solve
> the renaming at kernel level (either by adding module params that make
> the kernel rename devices or letting user space change names of running
> devices if they are slaves).

Before I was aksed to revive this old mail thread, I knew the
discussion could end up with something like this. Yes, theoretically
there's a point - basically you don't believe kernel should take risk
in fixing the issue, so you push back the hope to something in
hypothesis that actually wasn't done and hard to get done in reality.
It's not too different than saying "hey, what you're asking for is
simply wrong, don't do it! Go back to modify userspace to create a
bond or team instead!" FWIW I want to emphasize that the debate for
what should be the right place to implement this failover facility:
userspace versus kernel, had been around for almost a decade, and no
real work ever happened in userspace to "standardize" this in the
Linux world.  The truth is that it's quite amount of complex work to
get it implemented right at userspace in reality: what Michael
mentions about making dracut auto-bonding aware is just tip of the
iceberg. Basically one would need to modify all the existing network
config tools to treat them well with this new auto-bonding concept:
handle duplicate MACs, differentiate it with regular bond/team, fix
boot time dependency of network boot and etc. Moreover, it's not a
single distro's effort from cloud provider's perspective, at least not
as simple as to say just move it to a daemon systemd/NM then work is
done. We (Oracle) had done extensive work in the past year to help
align various userspace components and work with distro vendors to
patch shipped packages to make them work with the failover 3-netdev
model. The work that needs to be done with userspace auto-bonding
would be more involved than just that, with quite trivial value (just
naming?) in turn that I suspect any developer in userspace could be
motivated.

So, simply put, no, we have zero interest in this direction. If
upstream believes this is the final conclusion, I think we can stop
discussing.

Thanks,
-Siwei
>
> > > for twiddling kernel-based interface naming policy.. :S
> >
> > I see your point. But my point is slave names don't really matter, only
> > master name matters.  So I am not sure there's any policy worth talking
> > about here.
>
> Oh yes, I don't disagree with you, but others seems to want to rename
> the auto-bonded lower devices.  Which can be done trivially if it was
> a daemon in user space instantiating the auto-bond.  We are just
> providing a basic version of auto-bonding in the kernel.  If there are
> extra requirements on policy, or naming - the whole thing is better
> solved in user space.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2019-03-01  0:20                                                 ` Siwei Liu
@ 2019-03-01  1:05                                                   ` Jakub Kicinski
  2019-03-02  0:30                                                     ` Siwei Liu
  0 siblings, 1 reply; 63+ messages in thread
From: Jakub Kicinski @ 2019-03-01  1:05 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Michael S. Tsirkin, si-wei liu, Samudrala, Sridhar, Jiri Pirko,
	Stephen Hemminger, David Miller, Netdev, virtualization,
	Brandeburg, Jesse, Alexander Duyck, Jason Wang, liran.alon

On Thu, 28 Feb 2019 16:20:28 -0800, Siwei Liu wrote:
> On Thu, Feb 28, 2019 at 11:56 AM Jakub Kicinski wrote:
> > On Thu, 28 Feb 2019 14:36:56 -0500, Michael S. Tsirkin wrote:  
> > > > It is a bit of a the chicken or the egg situation ;)  But users can
> > > > just blacklist, too.  Anyway, I think this is far better than module
> > > > parameters  
> > >
> > > Sorry I'm a bit confused. What is better than what?  
> >
> > I mean that blacklist net_failover or module param to disable
> > net_failover and handle in user space are better than trying to solve
> > the renaming at kernel level (either by adding module params that make
> > the kernel rename devices or letting user space change names of running
> > devices if they are slaves).  
> 
> Before I was aksed to revive this old mail thread, I knew the
> discussion could end up with something like this. Yes, theoretically
> there's a point - basically you don't believe kernel should take risk
> in fixing the issue, so you push back the hope to something in
> hypothesis that actually wasn't done and hard to get done in reality.
> It's not too different than saying "hey, what you're asking for is
> simply wrong, don't do it! Go back to modify userspace to create a
> bond or team instead!" FWIW I want to emphasize that the debate for
> what should be the right place to implement this failover facility:
> userspace versus kernel, had been around for almost a decade, and no
> real work ever happened in userspace to "standardize" this in the
> Linux world.

Let me offer you my very subjective opinion of why "no real work ever
happened in user space".  The actors who have primary interest to get
the auto-bonding working are HW vendors trying to either convince
customers to use SR-IOV, or being pressured by customers to make SR-IOV
easier to consume.  HW vendors hire driver developers, not user space
developers.  So the solution we arrive at is in the kernel for a non
technical reason (Conway's law, sort of).

$ cd NetworkManager/
$ git log --pretty=format:"%ae" | \
    grep '\(mellanox\|intel\|broadcom\|netronome\)' | sort | uniq -c
     81 andrew.zaborowski@intel.com
      2 David.Woodhouse@intel.com
      2 ismo.puustinen@intel.com
      1 michael.i.doherty@intel.com

Andrew works on WiFi.

I have asked the NetworkManager folks to implement this feature last
year when net_failover got dangerously close to getting merged, and
they said they were never approached with this request before, much less
offered code that solve it.  Unfortunately before they got around to it
net_failover was merged already, and they didn't proceed.  

So to my knowledge nobody ever tried to solve this in user space.
I don't think net_failover is particularly terrible, or that renaming
of primary in the kernel is the end of the world, but I'd appreciate if
you could point me to efforts to solve it upstream in user space
components, or acknowledge that nobody actually tried that.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2019-02-28 14:26                                     ` Michael S. Tsirkin
@ 2019-03-01  1:30                                       ` si-wei liu
  2019-03-01 13:27                                         ` Michael S. Tsirkin
  0 siblings, 1 reply; 63+ messages in thread
From: si-wei liu @ 2019-03-01  1:30 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Samudrala, Sridhar, Siwei Liu, Jiri Pirko, Stephen Hemminger,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Alexander Duyck, Jakub Kicinski, Jason Wang, liran.alon



On 2/28/2019 6:26 AM, Michael S. Tsirkin wrote:
> On Thu, Feb 28, 2019 at 01:32:12AM -0800, si-wei liu wrote:
>>>> Will the
>>>> change break userspace further?
>>>>
>>>> -Siwei
>>> Didn't you show userspace is already broken. You can't "further
>>> break it", rename already fails.
>> It's a race, userspace tends to give slave a user(space) desired name but
>> sometimes may fail due to this race. Today if failover master is not up,
>> rename would succeed anyway. While what you proposed prohibits user from
>> providing a name in all circumstances if I understand you correctly. That's
>> what I meant of breaking userspace further. On the other hand, you seem to
>> tighten the kernel default naming to udev predictable names, which is
>> derived from only recent systemd-udevd, while there exists many possible
>> userspace naming schemes out of that. Users today who deliberately chooses
>> to disable predictable naming (net.ifnames=0 biosdevname=0) and fall back to
>> kernel provided names would expect the ethX pattern, with this change
>> admin/user scripts which matches the ethX pattern could potentially break.
> Whatever crashes with a name not matching ethX will crash on the
> standby interface *anyway*.
With udev predictable naming disabled they should not. It's not hard for 
user to look for device attribute to persistent the name well, in a 
consistent and reliable way.

>
> So I think what you are saying is that someone might have already
> written scripts and gotten them to work on v4.17 when STANDBY was
> included and these scripts rely on ethX. Now these scripts
> will break.
The controversial part is the new kernel naming pattern. Initially I 
thought there shouldn't be such crazy scripts relying on the pattern, 
but when I worked on cloud-init it I realized that there's already a lot 
of software taking assumption around the 'eth0' name. In the past I've 
seen random scripts that parses the ethX name assumes (incorrectly) the 
name ends up with digits, or even the digits and name are 1:1 mapped. Of 
course, you can say these are bugs in scripts themselves.

Anyway, I'll let others in the netdev to comment on this new scheme, 
maybe that's the concern of merely myself. The good part of your 
proposal is that we can get consistent slave name, which still plays its 
role until we move towards making slave names less relevant, i.e. 
ideally a 1-netdev model. I think we both agree that the master matters 
more than the slave names.

>
> Maybe it is still early enough (just half a year passed) that the
> number of these users would be small.  So how about a kernel config
> option and maybe a module parameter to rename the primary?  People can
> then opt in to the old broken behaviour.
Were I could I would ask  why a similar opt-in (kernel config or module 
parameter) couldn't be implemented to open up the rename restriction on 
slave, net_failover in particular. What I felt about this rename 
restriction was more because of historical reason than anything else, 
while net_failover is comparatively a new type of link that we are now 
designing proper use case it should support, and can get it shaped to 
whatever it fits. My personal view is that the slave can't be renamed 
when master is running is just implementation details that got 
incorrectly exposed to userspace apps for many years. It's old behavior 
with historical reason for sure, but I don't think this applies to 
net_failover.

(FWIW as one previous bond maintainer for another OS, we relieved the 
rename restriction slaves 13 year ago, while no single complaint or 
issue was ever raised because of this change over the years, neither 
from the customers of tens of millions of installation base, nor the 
FOSS software running atop. Of course, Linux is different so that 
experience doesn't count.)

Thanks,
-Siwei



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2019-03-01  1:30                                       ` si-wei liu
@ 2019-03-01 13:27                                         ` Michael S. Tsirkin
  2019-03-01 20:55                                           ` si-wei liu
  0 siblings, 1 reply; 63+ messages in thread
From: Michael S. Tsirkin @ 2019-03-01 13:27 UTC (permalink / raw)
  To: si-wei liu
  Cc: Samudrala, Sridhar, Siwei Liu, Jiri Pirko, Stephen Hemminger,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Alexander Duyck, Jakub Kicinski, Jason Wang, liran.alon

On Thu, Feb 28, 2019 at 05:30:56PM -0800, si-wei liu wrote:
> 
> 
> On 2/28/2019 6:26 AM, Michael S. Tsirkin wrote:
> > On Thu, Feb 28, 2019 at 01:32:12AM -0800, si-wei liu wrote:
> > > > > Will the
> > > > > change break userspace further?
> > > > > 
> > > > > -Siwei
> > > > Didn't you show userspace is already broken. You can't "further
> > > > break it", rename already fails.
> > > It's a race, userspace tends to give slave a user(space) desired name but
> > > sometimes may fail due to this race. Today if failover master is not up,
> > > rename would succeed anyway. While what you proposed prohibits user from
> > > providing a name in all circumstances if I understand you correctly. That's
> > > what I meant of breaking userspace further. On the other hand, you seem to
> > > tighten the kernel default naming to udev predictable names, which is
> > > derived from only recent systemd-udevd, while there exists many possible
> > > userspace naming schemes out of that. Users today who deliberately chooses
> > > to disable predictable naming (net.ifnames=0 biosdevname=0) and fall back to
> > > kernel provided names would expect the ethX pattern, with this change
> > > admin/user scripts which matches the ethX pattern could potentially break.
> > Whatever crashes with a name not matching ethX will crash on the
> > standby interface *anyway*.
> With udev predictable naming disabled they should not. It's not hard for
> user to look for device attribute to persistent the name well, in a
> consistent and reliable way.

Well that's special code for failover already. So far we just
taught userspace to skip renaming slave interfaces.

> > 
> > So I think what you are saying is that someone might have already
> > written scripts and gotten them to work on v4.17 when STANDBY was
> > included and these scripts rely on ethX. Now these scripts
> > will break.
> The controversial part is the new kernel naming pattern. Initially I thought
> there shouldn't be such crazy scripts relying on the pattern, but when I
> worked on cloud-init it I realized that there's already a lot of software
> taking assumption around the 'eth0' name. In the past I've seen random
> scripts that parses the ethX name assumes (incorrectly) the name ends up
> with digits, or even the digits and name are 1:1 mapped. Of course, you can
> say these are bugs in scripts themselves.

No what I say is that they will crash on rename of standby too.

> Anyway, I'll let others in the netdev to comment on this new scheme, maybe
> that's the concern of merely myself. The good part of your proposal is that
> we can get consistent slave name, which still plays its role until we move
> towards making slave names less relevant, i.e. ideally a 1-netdev model. I
> think we both agree that the master matters more than the slave names.
> > 
> > Maybe it is still early enough (just half a year passed) that the
> > number of these users would be small.  So how about a kernel config
> > option and maybe a module parameter to rename the primary?  People can
> > then opt in to the old broken behaviour.
> Were I could I would ask  why a similar opt-in (kernel config or module
> parameter) couldn't be implemented to open up the rename restriction on
> slave, net_failover in particular. What I felt about this rename restriction
> was more because of historical reason than anything else, while net_failover
> is comparatively a new type of link that we are now designing proper use
> case it should support, and can get it shaped to whatever it fits. My
> personal view is that the slave can't be renamed when master is running is
> just implementation details that got incorrectly exposed to userspace apps
> for many years. It's old behavior with historical reason for sure, but I
> don't think this applies to net_failover.
> 
> (FWIW as one previous bond maintainer for another OS, we relieved the rename
> restriction slaves 13 year ago, while no single complaint or issue was ever
> raised because of this change over the years, neither from the customers of
> tens of millions of installation base, nor the FOSS software running atop.
> Of course, Linux is different so that experience doesn't count.)
> 
> Thanks,
> -Siwei
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2019-03-01 13:27                                         ` Michael S. Tsirkin
@ 2019-03-01 20:55                                           ` si-wei liu
  0 siblings, 0 replies; 63+ messages in thread
From: si-wei liu @ 2019-03-01 20:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Samudrala, Sridhar, Siwei Liu, Jiri Pirko, Stephen Hemminger,
	David Miller, Netdev, virtualization, virtio-dev, Brandeburg,
	Jesse, Alexander Duyck, Jakub Kicinski, Jason Wang, liran.alon



On 3/1/2019 5:27 AM, Michael S. Tsirkin wrote:
> On Thu, Feb 28, 2019 at 05:30:56PM -0800, si-wei liu wrote:
>>
>> On 2/28/2019 6:26 AM, Michael S. Tsirkin wrote:
>>> On Thu, Feb 28, 2019 at 01:32:12AM -0800, si-wei liu wrote:
>>>>>> Will the
>>>>>> change break userspace further?
>>>>>>
>>>>>> -Siwei
>>>>> Didn't you show userspace is already broken. You can't "further
>>>>> break it", rename already fails.
>>>> It's a race, userspace tends to give slave a user(space) desired name but
>>>> sometimes may fail due to this race. Today if failover master is not up,
>>>> rename would succeed anyway. While what you proposed prohibits user from
>>>> providing a name in all circumstances if I understand you correctly. That's
>>>> what I meant of breaking userspace further. On the other hand, you seem to
>>>> tighten the kernel default naming to udev predictable names, which is
>>>> derived from only recent systemd-udevd, while there exists many possible
>>>> userspace naming schemes out of that. Users today who deliberately chooses
>>>> to disable predictable naming (net.ifnames=0 biosdevname=0) and fall back to
>>>> kernel provided names would expect the ethX pattern, with this change
>>>> admin/user scripts which matches the ethX pattern could potentially break.
>>> Whatever crashes with a name not matching ethX will crash on the
>>> standby interface *anyway*.
>> With udev predictable naming disabled they should not. It's not hard for
>> user to look for device attribute to persistent the name well, in a
>> consistent and reliable way.
> Well that's special code for failover already. So far we just
> taught userspace to skip renaming slave interfaces.
I think today kernel provided names never collapse, e.g. master gets 
eth0 then standby will get eth1. It's the userspace specified name that 
suffers name clashing, mostly the default predictable naming pattern 
from systemd-udevd.

Kernel should not assume there's only one naming pattern in userspace. 
Users can customize naming with udev rules in /etc which do not conform 
to the default udevd pattern at all. It's pretty legitimate use case.


>
>>> So I think what you are saying is that someone might have already
>>> written scripts and gotten them to work on v4.17 when STANDBY was
>>> included and these scripts rely on ethX. Now these scripts
>>> will break.
>> The controversial part is the new kernel naming pattern. Initially I thought
>> there shouldn't be such crazy scripts relying on the pattern, but when I
>> worked on cloud-init it I realized that there's already a lot of software
>> taking assumption around the 'eth0' name. In the past I've seen random
>> scripts that parses the ethX name assumes (incorrectly) the name ends up
>> with digits, or even the digits and name are 1:1 mapped. Of course, you can
>> say these are bugs in scripts themselves.
> No what I say is that they will crash on rename of standby too.
What do you mean crashing on standby rename? First off, if master is not 
up, rename on both standby and primary should not fail. If master is up, 
the standby should be named before userspace brings up the master, so 
what's the issue you talked about?

Thanks,
-Siwei

>
>> Anyway, I'll let others in the netdev to comment on this new scheme, maybe
>> that's the concern of merely myself. The good part of your proposal is that
>> we can get consistent slave name, which still plays its role until we move
>> towards making slave names less relevant, i.e. ideally a 1-netdev model. I
>> think we both agree that the master matters more than the slave names.
>>> Maybe it is still early enough (just half a year passed) that the
>>> number of these users would be small.  So how about a kernel config
>>> option and maybe a module parameter to rename the primary?  People can
>>> then opt in to the old broken behaviour.
>> Were I could I would ask  why a similar opt-in (kernel config or module
>> parameter) couldn't be implemented to open up the rename restriction on
>> slave, net_failover in particular. What I felt about this rename restriction
>> was more because of historical reason than anything else, while net_failover
>> is comparatively a new type of link that we are now designing proper use
>> case it should support, and can get it shaped to whatever it fits. My
>> personal view is that the slave can't be renamed when master is running is
>> just implementation details that got incorrectly exposed to userspace apps
>> for many years. It's old behavior with historical reason for sure, but I
>> don't think this applies to net_failover.
>>
>> (FWIW as one previous bond maintainer for another OS, we relieved the rename
>> restriction slaves 13 year ago, while no single complaint or issue was ever
>> raised because of this change over the years, neither from the customers of
>> tens of millions of installation base, nor the FOSS software running atop.
>> Of course, Linux is different so that experience doesn't count.)
>>
>> Thanks,
>> -Siwei
>>


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
  2019-03-01  1:05                                                   ` Jakub Kicinski
@ 2019-03-02  0:30                                                     ` Siwei Liu
  0 siblings, 0 replies; 63+ messages in thread
From: Siwei Liu @ 2019-03-02  0:30 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Michael S. Tsirkin, si-wei liu, Samudrala, Sridhar, Jiri Pirko,
	Stephen Hemminger, David Miller, Netdev, virtualization,
	Brandeburg, Jesse, Alexander Duyck, Jason Wang, liran.alon

On Thu, Feb 28, 2019 at 5:05 PM Jakub Kicinski <kubakici@wp.pl> wrote:
>
> On Thu, 28 Feb 2019 16:20:28 -0800, Siwei Liu wrote:
> > On Thu, Feb 28, 2019 at 11:56 AM Jakub Kicinski wrote:
> > > On Thu, 28 Feb 2019 14:36:56 -0500, Michael S. Tsirkin wrote:
> > > > > It is a bit of a the chicken or the egg situation ;)  But users can
> > > > > just blacklist, too.  Anyway, I think this is far better than module
> > > > > parameters
> > > >
> > > > Sorry I'm a bit confused. What is better than what?
> > >
> > > I mean that blacklist net_failover or module param to disable
> > > net_failover and handle in user space are better than trying to solve
> > > the renaming at kernel level (either by adding module params that make
> > > the kernel rename devices or letting user space change names of running
> > > devices if they are slaves).
> >
> > Before I was aksed to revive this old mail thread, I knew the
> > discussion could end up with something like this. Yes, theoretically
> > there's a point - basically you don't believe kernel should take risk
> > in fixing the issue, so you push back the hope to something in
> > hypothesis that actually wasn't done and hard to get done in reality.
> > It's not too different than saying "hey, what you're asking for is
> > simply wrong, don't do it! Go back to modify userspace to create a
> > bond or team instead!" FWIW I want to emphasize that the debate for
> > what should be the right place to implement this failover facility:
> > userspace versus kernel, had been around for almost a decade, and no
> > real work ever happened in userspace to "standardize" this in the
> > Linux world.
>
> Let me offer you my very subjective opinion of why "no real work ever
> happened in user space".  The actors who have primary interest to get
> the auto-bonding working are HW vendors trying to either convince
> customers to use SR-IOV, or being pressured by customers to make SR-IOV
> easier to consume.  HW vendors hire driver developers, not user space
> developers.  So the solution we arrive at is in the kernel for a non
> technical reason (Conway's law, sort of).
>
> $ cd NetworkManager/
> $ git log --pretty=format:"%ae" | \
>     grep '\(mellanox\|intel\|broadcom\|netronome\)' | sort | uniq -c
>      81 andrew.zaborowski@intel.com
>       2 David.Woodhouse@intel.com
>       2 ismo.puustinen@intel.com
>       1 michael.i.doherty@intel.com
>
> Andrew works on WiFi.
>

I'm sorry, but we don't use NetworkManager in our cloud images at all.
We sufferd from lots of problems when booting from remote iSCSI disk
with NetworkManager enabled, and it looks like those issues are still
there while that's not (my subjective impression) a network config
tool mainly targeting desktop and WiFi users ever cares about. At
least a sign of lack of sufficient testing was made there.

From cloud service provider perspective, we always prefer single
central solution than speak to various distro vendors with their own
network daemons/config tools thus different solutions. It's hard to
coordicate all efforts in one place. From my personal perspetive, the
in-kernel auto-slave solution is nothing technically inferior than any
userspace implementation, and every major OS/cloud providers choose to
implement this in-kernel model for the same reason. I don't want to
argue more if there's value or not for net_failover to be in Linux
kernel, given that it's already there I think it's better to move on.

We have done extensive work in reporting (actually, fix them
internally before posting) issues to the dracut, udev,
initramfs-tools, and cloud-init community. Although as claimed the
3-netdev should be transparent to userspace in general, the reality is
opposite: the effort is nothing differenet than bring up a new type of
virutal bond than any existing userspace tool would otherwise expect
for a regular physical netdev. If there's ever concern about breaking
userspace, I bet no one ever tries to start using it. If they did they
know what I am saying. The dup MAC address setting and plugging order
are totally new to userspace that none of userspace tools fail to know
how to plumb failover interface in a proper way, if without fixing
them one or another.

-Siwei

> I have asked the NetworkManager folks to implement this feature last
> year when net_failover got dangerously close to getting merged, and
> they said they were never approached with this request before, much less
> offered code that solve it.  Unfortunately before they got around to it
> net_failover was merged already, and they didn't proceed.
>
> So to my knowledge nobody ever tried to solve this in user space.
> I don't think net_failover is particularly terrible, or that renaming
> of primary in the kernel is the end of the world, but I'd appreciate if
> you could point me to efforts to solve it upstream in user space
> components, or acknowledge that nobody actually tried that.

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2019-03-02  0:30 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-10 18:59 [RFC PATCH net-next v6 0/4] Enable virtio_net to act as a backup for a passthru device Sridhar Samudrala
2018-04-10 18:59 ` [RFC PATCH net-next v6 1/4] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit Sridhar Samudrala
2018-04-10 18:59 ` [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module Sridhar Samudrala
2018-04-11 15:51   ` Jiri Pirko
2018-04-11 19:13     ` Samudrala, Sridhar
2018-04-18  9:25       ` Jiri Pirko
2018-04-18 18:43         ` Samudrala, Sridhar
2018-04-18 19:13           ` Jiri Pirko
2018-04-18 19:46             ` Michael S. Tsirkin
2018-04-18 20:32               ` Jiri Pirko
2018-04-18 22:46                 ` Samudrala, Sridhar
2018-04-19  6:35                   ` Jiri Pirko
2018-04-19  4:08                 ` Michael S. Tsirkin
2018-04-19  7:22                   ` Jiri Pirko
2018-04-10 18:59 ` [RFC PATCH net-next v6 3/4] virtio_net: Extend virtio to use VF datapath when available Sridhar Samudrala
2018-04-10 18:59 ` [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework Sridhar Samudrala
2018-04-10 21:26   ` Stephen Hemminger
2018-04-10 22:56     ` Samudrala, Sridhar
2018-04-10 23:28     ` Michael S. Tsirkin
2018-04-10 23:44       ` Siwei Liu
2018-04-10 23:59         ` Stephen Hemminger
2018-04-11  7:50       ` Jiri Pirko
2018-04-11  1:21     ` Michael S. Tsirkin
2018-04-11  7:53     ` Jiri Pirko
2019-02-22  1:14       ` net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework) Siwei Liu
2019-02-22  1:39         ` Michael S. Tsirkin
2019-02-22  3:33           ` [virtio-dev] " si-wei liu
     [not found]             ` <91d4cbb1-be7a-b53c-6b2a-99bef07e7c53@intel.com>
2019-02-22  7:55               ` si-wei liu
2019-02-22 12:58                 ` Rob Miller
2019-02-22 15:14                 ` Michael S. Tsirkin
2019-02-26  0:58                   ` si-wei liu
2019-02-26  1:39                     ` Stephen Hemminger
2019-02-26  2:05                       ` Michael S. Tsirkin
2019-02-27  0:49                         ` si-wei liu
2019-02-26  2:08                     ` Michael S. Tsirkin
2019-02-27  0:17                       ` si-wei liu
2019-02-27 21:57                         ` Stephen Hemminger
2019-02-27 22:30                           ` si-wei liu
2019-02-27 22:38                         ` Michael S. Tsirkin
2019-02-27 23:34                           ` si-wei liu
2019-02-27 23:50                             ` Michael S. Tsirkin
2019-02-28  0:00                               ` Liran Alon
2019-02-28  0:03                               ` Stephen Hemminger
2019-02-28  0:38                                 ` Michael S. Tsirkin
2019-02-28  0:38                               ` si-wei liu
2019-02-28  0:41                                 ` Michael S. Tsirkin
2019-02-28  0:52                                   ` Jakub Kicinski
2019-02-28  1:26                                     ` Michael S. Tsirkin
2019-02-28  1:52                                       ` Jakub Kicinski
2019-02-28  4:47                                         ` Michael S. Tsirkin
2019-02-28 18:13                                           ` Jakub Kicinski
2019-02-28 19:36                                             ` Michael S. Tsirkin
2019-02-28 19:56                                               ` Jakub Kicinski
2019-02-28 20:14                                                 ` Michael S. Tsirkin
2019-02-28 23:31                                                   ` Jakub Kicinski
2019-03-01  0:20                                                 ` Siwei Liu
2019-03-01  1:05                                                   ` Jakub Kicinski
2019-03-02  0:30                                                     ` Siwei Liu
2019-02-28  9:32                                   ` si-wei liu
2019-02-28 14:26                                     ` Michael S. Tsirkin
2019-03-01  1:30                                       ` si-wei liu
2019-03-01 13:27                                         ` Michael S. Tsirkin
2019-03-01 20:55                                           ` si-wei liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).