All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] Userspace compatible driver model for virtio_bypass
@ 2018-04-01  9:13 ` Si-Wei Liu
  0 siblings, 0 replies; 109+ messages in thread
From: Si-Wei Liu @ 2018-04-01  9:13 UTC (permalink / raw)
  To: mst, jiri, stephen, alexander.h.duyck, davem, jesse.brandeburg,
	kubakici, jasowang, sridhar.samudrala, netdev, virtualization,
	virtio-dev

This RFC patch series attempts to hide the lower netdevs for virtio_bypass
from userspace visibility, and tighten up the association between virtio_bypass
and the lower passthrough netdev to be enslaved by binding to a specific device
identifier explicitly. This in turn has the benefits of taking the merit of the
2-netdev driver model from netvsc (userspace compliance) to a perfect sense,
while keeping the internal implementation still a 3-netdev model. There's no
loss of feature such as XDP, and continously adding improvements for performance
and features thanks to the good bypass nature of the 3-netdev model are also
possible in the long run.

As said, this change should make the code sharing between netvsc and virtio_bypass
easier and more approachable, as I think the concerns Stephen pointed out was
mainly regarding userspace compatibility and not the hardware offloading
tunables on the VF slave that had to be exposed to netvsc users today, if I'm
not mistaken.

Jiri expressed concerns around the weak check depending on MAC address only
during enslavement and we really need to do strict checks more than that. With
the change to requiring user explicitly specifying the passthrough device
to which virtio_bypass is expected to be bound, virtio_bypass now would match
device based on the PCI slot info in device tree, rather than rely on MAC
address inadvertently. In addition, the PCI slot info passed in will be helpful
to accommodate udevd to name the virtio_bypass interface specifically, making
a transparent and automatic upgrade from existing VF setup to virtio_bypass
possible (expect udevd patch to come later on).

Since I'd like to get the discussion going as early as possible, this series
just shows essential changes to a minimal set. Although not included in the
series, I would like to remind ahead that a few neccessary pieces must be built
upon the assumption of hidden lower netdevs and explicit binding. Such as
sysfs interfaces for udev's naming of virtio_bypass interace. Such as passing
down HW offloading configs to the active lower slave, and making it persistent
across live migration. And so on..

The current patch series is based on Sridhar's v4 patch "Enable virtio to act
as a backup for a passthru device", but I can resync anyway to his upcoming
version once posted.


Si-Wei Liu (1):
  qemu: virtio-bypass should explicitly bind to a passthrough device

 hw/net/virtio-net.c                         | 29 ++++++++++++-
 include/hw/pci/pci.h                        |  3 ++
 include/hw/virtio/virtio-net.h              |  2 +
 include/standard-headers/linux/virtio_net.h |  1 +
 qdev-monitor.c                              | 64 +++++++++++++++++++++++++++++
 5 files changed, 97 insertions(+), 2 deletions(-)

Si-Wei Liu (2):
  netdev: kernel-only IFF_HIDDEN netdevice
  virtio_net: make lower netdevs for virtio_bypass hidden

 drivers/net/virtio_net.c        | 159 +++++++++++++++++++++--
 include/linux/netdevice.h       |  12 ++
 include/net/net_namespace.h     |   2 +
 include/uapi/linux/virtio_net.h |   2 +
 net/core/dev.c                  | 281 +++++++++++++++++++++++++++++++++++-----
 net/core/net_namespace.c        |   1 +
 6 files changed, 411 insertions(+), 46 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 109+ messages in thread

* [virtio-dev] [RFC PATCH 0/3] Userspace compatible driver model for virtio_bypass
@ 2018-04-01  9:13 ` Si-Wei Liu
  0 siblings, 0 replies; 109+ messages in thread
From: Si-Wei Liu @ 2018-04-01  9:13 UTC (permalink / raw)
  To: mst, jiri, stephen, alexander.h.duyck, davem, jesse.brandeburg,
	kubakici, jasowang, sridhar.samudrala, netdev, virtualization,
	virtio-dev

This RFC patch series attempts to hide the lower netdevs for virtio_bypass
from userspace visibility, and tighten up the association between virtio_bypass
and the lower passthrough netdev to be enslaved by binding to a specific device
identifier explicitly. This in turn has the benefits of taking the merit of the
2-netdev driver model from netvsc (userspace compliance) to a perfect sense,
while keeping the internal implementation still a 3-netdev model. There's no
loss of feature such as XDP, and continously adding improvements for performance
and features thanks to the good bypass nature of the 3-netdev model are also
possible in the long run.

As said, this change should make the code sharing between netvsc and virtio_bypass
easier and more approachable, as I think the concerns Stephen pointed out was
mainly regarding userspace compatibility and not the hardware offloading
tunables on the VF slave that had to be exposed to netvsc users today, if I'm
not mistaken.

Jiri expressed concerns around the weak check depending on MAC address only
during enslavement and we really need to do strict checks more than that. With
the change to requiring user explicitly specifying the passthrough device
to which virtio_bypass is expected to be bound, virtio_bypass now would match
device based on the PCI slot info in device tree, rather than rely on MAC
address inadvertently. In addition, the PCI slot info passed in will be helpful
to accommodate udevd to name the virtio_bypass interface specifically, making
a transparent and automatic upgrade from existing VF setup to virtio_bypass
possible (expect udevd patch to come later on).

Since I'd like to get the discussion going as early as possible, this series
just shows essential changes to a minimal set. Although not included in the
series, I would like to remind ahead that a few neccessary pieces must be built
upon the assumption of hidden lower netdevs and explicit binding. Such as
sysfs interfaces for udev's naming of virtio_bypass interace. Such as passing
down HW offloading configs to the active lower slave, and making it persistent
across live migration. And so on..

The current patch series is based on Sridhar's v4 patch "Enable virtio to act
as a backup for a passthru device", but I can resync anyway to his upcoming
version once posted.


Si-Wei Liu (1):
  qemu: virtio-bypass should explicitly bind to a passthrough device

 hw/net/virtio-net.c                         | 29 ++++++++++++-
 include/hw/pci/pci.h                        |  3 ++
 include/hw/virtio/virtio-net.h              |  2 +
 include/standard-headers/linux/virtio_net.h |  1 +
 qdev-monitor.c                              | 64 +++++++++++++++++++++++++++++
 5 files changed, 97 insertions(+), 2 deletions(-)

Si-Wei Liu (2):
  netdev: kernel-only IFF_HIDDEN netdevice
  virtio_net: make lower netdevs for virtio_bypass hidden

 drivers/net/virtio_net.c        | 159 +++++++++++++++++++++--
 include/linux/netdevice.h       |  12 ++
 include/net/net_namespace.h     |   2 +
 include/uapi/linux/virtio_net.h |   2 +
 net/core/dev.c                  | 281 +++++++++++++++++++++++++++++++++++-----
 net/core/net_namespace.c        |   1 +
 6 files changed, 411 insertions(+), 46 deletions(-)

-- 
1.8.3.1


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 109+ messages in thread

* [RFC PATCH 1/3] qemu: virtio-bypass should explicitly bind to a passthrough device
  2018-04-01  9:13 ` [virtio-dev] " Si-Wei Liu
@ 2018-04-01  9:13   ` Si-Wei Liu
  -1 siblings, 0 replies; 109+ messages in thread
From: Si-Wei Liu @ 2018-04-01  9:13 UTC (permalink / raw)
  To: mst, jiri, stephen, alexander.h.duyck, davem, jesse.brandeburg,
	kubakici, jasowang, sridhar.samudrala, netdev, virtualization,
	virtio-dev

The new backup option allows guest virtio-bypass driver to explicitly
bind to a corresponding passthrough instance, which is identifiable by
the <bus>:<slot>.<function> notation. MAC address is still validated
in the guest but not the only criteria for pairing two devices.
MAC address is more a matter of network configuration than a (virtual)
device identifier, the latter of which needs to be unique as part of
VM configuration. Techinically it's possible there exists more than
one device in the guest configured with the same MAC, but each belongs
to completely isolated network.

The direct benefit as a result of the explicit binding (or pairing),
apparently, is the prohibition of improper binding or malicious pairing
due to any flexiblility in terms of guest MAC address config.

What's more important, the indicator of guest device location can even
be used as a means to reserve the slot for the corresponding passthrough
device in the PCI bus tree if such device is temporarily absent, but
yet to be hot plugged into the VM. We'd need to preserve the slot for
the passthrough device to which virtio-bypass is bound, such that once
it is plugged out as a result of migration we can ensure the slot
wouldn't be occupied by other devices, and any user-space application
assumes consistent device location in the bus tree still works.

The usage for the backup option is as follows:

   -device virtio-net-pci, ... ,backup=<bus-id>:<slot>[.<function>]

for e.g.

   -device virtio-net-pci,id=net0,mac=52:54:00:e0:58:80,backup=pci.2:0x3
   ...
   -device vfio-pci,host=02:10.1,id=hostdev0,bus=pci.2,addr=0x3

Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
---
 hw/net/virtio-net.c                         | 29 ++++++++++++-
 include/hw/pci/pci.h                        |  3 ++
 include/hw/virtio/virtio-net.h              |  2 +
 include/standard-headers/linux/virtio_net.h |  1 +
 qdev-monitor.c                              | 64 +++++++++++++++++++++++++++++
 5 files changed, 97 insertions(+), 2 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index de31b1b98c..a36b169958 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -26,6 +26,7 @@
 #include "qapi-event.h"
 #include "hw/virtio/virtio-access.h"
 #include "migration/misc.h"
+#include "hw/pci/pci.h"
 
 #define VIRTIO_NET_VM_VERSION    11
 
@@ -61,6 +62,8 @@ static VirtIOFeature feature_sizes[] = {
      .end = endof(struct virtio_net_config, max_virtqueue_pairs)},
     {.flags = 1 << VIRTIO_NET_F_MTU,
      .end = endof(struct virtio_net_config, mtu)},
+    {.flags = 1 << VIRTIO_NET_F_BACKUP,
+     .end = endof(struct virtio_net_config, bsf2backup)},
     {}
 };
 
@@ -84,10 +87,24 @@ static void virtio_net_get_config(VirtIODevice *vdev, uint8_t *config)
 {
     VirtIONet *n = VIRTIO_NET(vdev);
     struct virtio_net_config netcfg;
+    uint16_t busdevfn;
 
     virtio_stw_p(vdev, &netcfg.status, n->status);
     virtio_stw_p(vdev, &netcfg.max_virtqueue_pairs, n->max_queues);
     virtio_stw_p(vdev, &netcfg.mtu, n->net_conf.mtu);
+    if (n->net_conf.backup) {
+        /* Below function should not fail as the backup ID string
+         * has been validated when device is being realized.
+         * Until guest starts to run we can can get to the
+         * effective bus num in use from pci config space where
+         * guest had written to.
+         */
+        pci_get_busdevfn_by_id(n->net_conf.backup, &busdevfn,
+                               NULL, NULL);
+        busdevfn <<= 8;
+        busdevfn |= (n->backup_devfn & 0xFF);
+        virtio_stw_p(vdev, &netcfg.bsf2backup, busdevfn);
+    }
     memcpy(netcfg.mac, n->mac, ETH_ALEN);
     memcpy(config, &netcfg, n->config_size);
 }
@@ -1935,11 +1952,19 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
     VirtIODevice *vdev = VIRTIO_DEVICE(dev);
     VirtIONet *n = VIRTIO_NET(dev);
     NetClientState *nc;
+    uint16_t bdevfn;
     int i;
 
     if (n->net_conf.mtu) {
         n->host_features |= (0x1 << VIRTIO_NET_F_MTU);
     }
+    if (n->net_conf.backup) {
+        if (pci_get_busdevfn_by_id(n->net_conf.backup, NULL,
+                                   &bdevfn, errp))
+            return;
+        n->backup_devfn = bdevfn;
+        n->host_features |= (0x1 << VIRTIO_NET_F_BACKUP);
+    }
 
     virtio_net_set_config_size(n, n->host_features);
     virtio_init(vdev, "virtio-net", VIRTIO_ID_NET, n->config_size);
@@ -2160,8 +2185,8 @@ static Property virtio_net_properties[] = {
     DEFINE_PROP_UINT16("host_mtu", VirtIONet, net_conf.mtu, 0),
     DEFINE_PROP_BOOL("x-mtu-bypass-backend", VirtIONet, mtu_bypass_backend,
                      true),
-    DEFINE_PROP_BIT("backup", VirtIONet, host_features,
-                     VIRTIO_NET_F_BACKUP, false),
+    DEFINE_PROP_STRING("backup", VirtIONet, net_conf.backup),
+
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index d8c18c7fa4..dbb910d162 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -431,6 +431,9 @@ PCIDevice *pci_nic_init_nofail(NICInfo *nd, PCIBus *rootbus,
 
 PCIDevice *pci_vga_init(PCIBus *bus);
 
+int pci_get_busdevfn_by_id(const char *id, uint16_t *busnr,
+                           uint16_t *devfn, Error **errp);
+
 static inline PCIBus *pci_get_bus(const PCIDevice *dev)
 {
     return PCI_BUS(qdev_get_parent_bus(DEVICE(dev)));
diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
index b81b6a4624..276b39f64f 100644
--- a/include/hw/virtio/virtio-net.h
+++ b/include/hw/virtio/virtio-net.h
@@ -38,6 +38,7 @@ typedef struct virtio_net_conf
     uint16_t rx_queue_size;
     uint16_t tx_queue_size;
     uint16_t mtu;
+    char *backup;
 } virtio_net_conf;
 
 /* Maximum packet size we can receive from tap device: header + 64k */
@@ -99,6 +100,7 @@ typedef struct VirtIONet {
     int announce_counter;
     bool needs_vnet_hdr_swap;
     bool mtu_bypass_backend;
+    uint16_t backup_devfn;
 } VirtIONet;
 
 void virtio_net_set_netclient_name(VirtIONet *n, const char *name,
diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
index 65dde3209d..cd936e5521 100644
--- a/include/standard-headers/linux/virtio_net.h
+++ b/include/standard-headers/linux/virtio_net.h
@@ -79,6 +79,7 @@ struct virtio_net_config {
 	uint16_t max_virtqueue_pairs;
 	/* Default maximum transmit unit advice */
 	uint16_t mtu;
+	uint16_t bsf2backup;
 } QEMU_PACKED;
 
 /*
diff --git a/qdev-monitor.c b/qdev-monitor.c
index 846238175f..600a81c73e 100644
--- a/qdev-monitor.c
+++ b/qdev-monitor.c
@@ -32,6 +32,8 @@
 #include "qemu/help_option.h"
 #include "qemu/option.h"
 #include "sysemu/block-backend.h"
+#include "hw/pci/pci.h"
+#include "hw/vfio/pci.h"
 #include "migration/misc.h"
 
 /*
@@ -896,6 +898,68 @@ void qmp_device_del(const char *id, Error **errp)
     }
 }
 
+int pci_get_busdevfn_by_id(const char *id, uint16_t *busnr,
+                           uint16_t *devfn, Error **errp)
+{
+    uint16_t busnum = 0, slot = 0, func = 0;
+    const char *pc, *pd, *pe;
+    Error *local_err = NULL;
+    ObjectClass *class;
+    char value[1024];
+    BusState *bus;
+    uint64_t u64;
+
+    if (!(pc = strchr(id, ':'))) {
+        error_setg(errp, "Invalid id: backup=%s, "
+                   "correct format should be backup="
+                   "'<bus-id>:<slot>[.<function>]'", id);
+        return -1;
+    }
+    get_opt_name(value, sizeof(value), id, ':');
+    if (pc != id + 1) {
+        bus = qbus_find(value, errp);
+        if (!bus)
+            return -1;
+
+        class = object_get_class(OBJECT(bus));
+        if (class != object_class_by_name(TYPE_PCI_BUS) &&
+            class != object_class_by_name(TYPE_PCIE_BUS)) {
+            error_setg(errp, "%s is not a device on pci bus", id);
+            return -1;
+        }
+        busnum = (uint16_t)pci_bus_num(PCI_BUS(bus));
+    }
+
+    if (!devfn)
+        goto out;
+
+    pd = strchr(pc, '.');
+    pe = get_opt_name(value, sizeof(value), pc + 1, '.');
+    if (pe != pc + 1) {
+        parse_option_number("slot", value, &u64, &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            return -1;
+        }
+        slot = (uint16_t)u64;
+    }
+    if (pd && *(pd + 1) != '\0') {
+        parse_option_number("function", pd, &u64, &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            return -1;
+        }
+        func = (uint16_t)u64;
+    }
+
+out:
+    if (busnr)
+        *busnr = busnum;
+    if (devfn)
+        *devfn = ((slot & 0x1F) << 3) | (func & 0x7);
+    return 0;
+}
+
 BlockBackend *blk_by_qdev_id(const char *id, Error **errp)
 {
     DeviceState *dev;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [virtio-dev] [RFC PATCH 1/3] qemu: virtio-bypass should explicitly bind to a passthrough device
@ 2018-04-01  9:13   ` Si-Wei Liu
  0 siblings, 0 replies; 109+ messages in thread
From: Si-Wei Liu @ 2018-04-01  9:13 UTC (permalink / raw)
  To: mst, jiri, stephen, alexander.h.duyck, davem, jesse.brandeburg,
	kubakici, jasowang, sridhar.samudrala, netdev, virtualization,
	virtio-dev

The new backup option allows guest virtio-bypass driver to explicitly
bind to a corresponding passthrough instance, which is identifiable by
the <bus>:<slot>.<function> notation. MAC address is still validated
in the guest but not the only criteria for pairing two devices.
MAC address is more a matter of network configuration than a (virtual)
device identifier, the latter of which needs to be unique as part of
VM configuration. Techinically it's possible there exists more than
one device in the guest configured with the same MAC, but each belongs
to completely isolated network.

The direct benefit as a result of the explicit binding (or pairing),
apparently, is the prohibition of improper binding or malicious pairing
due to any flexiblility in terms of guest MAC address config.

What's more important, the indicator of guest device location can even
be used as a means to reserve the slot for the corresponding passthrough
device in the PCI bus tree if such device is temporarily absent, but
yet to be hot plugged into the VM. We'd need to preserve the slot for
the passthrough device to which virtio-bypass is bound, such that once
it is plugged out as a result of migration we can ensure the slot
wouldn't be occupied by other devices, and any user-space application
assumes consistent device location in the bus tree still works.

The usage for the backup option is as follows:

   -device virtio-net-pci, ... ,backup=<bus-id>:<slot>[.<function>]

for e.g.

   -device virtio-net-pci,id=net0,mac=52:54:00:e0:58:80,backup=pci.2:0x3
   ...
   -device vfio-pci,host=02:10.1,id=hostdev0,bus=pci.2,addr=0x3

Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
---
 hw/net/virtio-net.c                         | 29 ++++++++++++-
 include/hw/pci/pci.h                        |  3 ++
 include/hw/virtio/virtio-net.h              |  2 +
 include/standard-headers/linux/virtio_net.h |  1 +
 qdev-monitor.c                              | 64 +++++++++++++++++++++++++++++
 5 files changed, 97 insertions(+), 2 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index de31b1b98c..a36b169958 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -26,6 +26,7 @@
 #include "qapi-event.h"
 #include "hw/virtio/virtio-access.h"
 #include "migration/misc.h"
+#include "hw/pci/pci.h"
 
 #define VIRTIO_NET_VM_VERSION    11
 
@@ -61,6 +62,8 @@ static VirtIOFeature feature_sizes[] = {
      .end = endof(struct virtio_net_config, max_virtqueue_pairs)},
     {.flags = 1 << VIRTIO_NET_F_MTU,
      .end = endof(struct virtio_net_config, mtu)},
+    {.flags = 1 << VIRTIO_NET_F_BACKUP,
+     .end = endof(struct virtio_net_config, bsf2backup)},
     {}
 };
 
@@ -84,10 +87,24 @@ static void virtio_net_get_config(VirtIODevice *vdev, uint8_t *config)
 {
     VirtIONet *n = VIRTIO_NET(vdev);
     struct virtio_net_config netcfg;
+    uint16_t busdevfn;
 
     virtio_stw_p(vdev, &netcfg.status, n->status);
     virtio_stw_p(vdev, &netcfg.max_virtqueue_pairs, n->max_queues);
     virtio_stw_p(vdev, &netcfg.mtu, n->net_conf.mtu);
+    if (n->net_conf.backup) {
+        /* Below function should not fail as the backup ID string
+         * has been validated when device is being realized.
+         * Until guest starts to run we can can get to the
+         * effective bus num in use from pci config space where
+         * guest had written to.
+         */
+        pci_get_busdevfn_by_id(n->net_conf.backup, &busdevfn,
+                               NULL, NULL);
+        busdevfn <<= 8;
+        busdevfn |= (n->backup_devfn & 0xFF);
+        virtio_stw_p(vdev, &netcfg.bsf2backup, busdevfn);
+    }
     memcpy(netcfg.mac, n->mac, ETH_ALEN);
     memcpy(config, &netcfg, n->config_size);
 }
@@ -1935,11 +1952,19 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
     VirtIODevice *vdev = VIRTIO_DEVICE(dev);
     VirtIONet *n = VIRTIO_NET(dev);
     NetClientState *nc;
+    uint16_t bdevfn;
     int i;
 
     if (n->net_conf.mtu) {
         n->host_features |= (0x1 << VIRTIO_NET_F_MTU);
     }
+    if (n->net_conf.backup) {
+        if (pci_get_busdevfn_by_id(n->net_conf.backup, NULL,
+                                   &bdevfn, errp))
+            return;
+        n->backup_devfn = bdevfn;
+        n->host_features |= (0x1 << VIRTIO_NET_F_BACKUP);
+    }
 
     virtio_net_set_config_size(n, n->host_features);
     virtio_init(vdev, "virtio-net", VIRTIO_ID_NET, n->config_size);
@@ -2160,8 +2185,8 @@ static Property virtio_net_properties[] = {
     DEFINE_PROP_UINT16("host_mtu", VirtIONet, net_conf.mtu, 0),
     DEFINE_PROP_BOOL("x-mtu-bypass-backend", VirtIONet, mtu_bypass_backend,
                      true),
-    DEFINE_PROP_BIT("backup", VirtIONet, host_features,
-                     VIRTIO_NET_F_BACKUP, false),
+    DEFINE_PROP_STRING("backup", VirtIONet, net_conf.backup),
+
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index d8c18c7fa4..dbb910d162 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -431,6 +431,9 @@ PCIDevice *pci_nic_init_nofail(NICInfo *nd, PCIBus *rootbus,
 
 PCIDevice *pci_vga_init(PCIBus *bus);
 
+int pci_get_busdevfn_by_id(const char *id, uint16_t *busnr,
+                           uint16_t *devfn, Error **errp);
+
 static inline PCIBus *pci_get_bus(const PCIDevice *dev)
 {
     return PCI_BUS(qdev_get_parent_bus(DEVICE(dev)));
diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
index b81b6a4624..276b39f64f 100644
--- a/include/hw/virtio/virtio-net.h
+++ b/include/hw/virtio/virtio-net.h
@@ -38,6 +38,7 @@ typedef struct virtio_net_conf
     uint16_t rx_queue_size;
     uint16_t tx_queue_size;
     uint16_t mtu;
+    char *backup;
 } virtio_net_conf;
 
 /* Maximum packet size we can receive from tap device: header + 64k */
@@ -99,6 +100,7 @@ typedef struct VirtIONet {
     int announce_counter;
     bool needs_vnet_hdr_swap;
     bool mtu_bypass_backend;
+    uint16_t backup_devfn;
 } VirtIONet;
 
 void virtio_net_set_netclient_name(VirtIONet *n, const char *name,
diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
index 65dde3209d..cd936e5521 100644
--- a/include/standard-headers/linux/virtio_net.h
+++ b/include/standard-headers/linux/virtio_net.h
@@ -79,6 +79,7 @@ struct virtio_net_config {
 	uint16_t max_virtqueue_pairs;
 	/* Default maximum transmit unit advice */
 	uint16_t mtu;
+	uint16_t bsf2backup;
 } QEMU_PACKED;
 
 /*
diff --git a/qdev-monitor.c b/qdev-monitor.c
index 846238175f..600a81c73e 100644
--- a/qdev-monitor.c
+++ b/qdev-monitor.c
@@ -32,6 +32,8 @@
 #include "qemu/help_option.h"
 #include "qemu/option.h"
 #include "sysemu/block-backend.h"
+#include "hw/pci/pci.h"
+#include "hw/vfio/pci.h"
 #include "migration/misc.h"
 
 /*
@@ -896,6 +898,68 @@ void qmp_device_del(const char *id, Error **errp)
     }
 }
 
+int pci_get_busdevfn_by_id(const char *id, uint16_t *busnr,
+                           uint16_t *devfn, Error **errp)
+{
+    uint16_t busnum = 0, slot = 0, func = 0;
+    const char *pc, *pd, *pe;
+    Error *local_err = NULL;
+    ObjectClass *class;
+    char value[1024];
+    BusState *bus;
+    uint64_t u64;
+
+    if (!(pc = strchr(id, ':'))) {
+        error_setg(errp, "Invalid id: backup=%s, "
+                   "correct format should be backup="
+                   "'<bus-id>:<slot>[.<function>]'", id);
+        return -1;
+    }
+    get_opt_name(value, sizeof(value), id, ':');
+    if (pc != id + 1) {
+        bus = qbus_find(value, errp);
+        if (!bus)
+            return -1;
+
+        class = object_get_class(OBJECT(bus));
+        if (class != object_class_by_name(TYPE_PCI_BUS) &&
+            class != object_class_by_name(TYPE_PCIE_BUS)) {
+            error_setg(errp, "%s is not a device on pci bus", id);
+            return -1;
+        }
+        busnum = (uint16_t)pci_bus_num(PCI_BUS(bus));
+    }
+
+    if (!devfn)
+        goto out;
+
+    pd = strchr(pc, '.');
+    pe = get_opt_name(value, sizeof(value), pc + 1, '.');
+    if (pe != pc + 1) {
+        parse_option_number("slot", value, &u64, &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            return -1;
+        }
+        slot = (uint16_t)u64;
+    }
+    if (pd && *(pd + 1) != '\0') {
+        parse_option_number("function", pd, &u64, &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            return -1;
+        }
+        func = (uint16_t)u64;
+    }
+
+out:
+    if (busnr)
+        *busnr = busnum;
+    if (devfn)
+        *devfn = ((slot & 0x1F) << 3) | (func & 0x7);
+    return 0;
+}
+
 BlockBackend *blk_by_qdev_id(const char *id, Error **errp)
 {
     DeviceState *dev;
-- 
1.8.3.1


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-01  9:13 ` [virtio-dev] " Si-Wei Liu
@ 2018-04-01  9:13   ` Si-Wei Liu
  -1 siblings, 0 replies; 109+ messages in thread
From: Si-Wei Liu @ 2018-04-01  9:13 UTC (permalink / raw)
  To: mst, jiri, stephen, alexander.h.duyck, davem, jesse.brandeburg,
	kubakici, jasowang, sridhar.samudrala, netdev, virtualization,
	virtio-dev

Hidden netdevice is not visible to userspace such that
typical network utilites e.g. ip, ifconfig and et al,
cannot sense its existence or configure it. Internally
hidden netdev may associate with an upper level netdev
that userspace has access to. Although userspace cannot
manipulate the lower netdev directly, user may control
or configure the underlying hidden device through the
upper-level netdev. For identification purpose, the
kobject for hidden netdev still presents in the sysfs
hierarchy, however, no uevent message will be generated
when the sysfs entry is created, modified or destroyed.

For that end, a separate namescope needs to be carved
out for IFF_HIDDEN netdevs. As of now netdev name that
starts with colon i.e. ':' is invalid in userspace,
since socket ioctls such as SIOCGIFCONF use ':' as the
separator for ifname. The absence of namescope started
with ':' can rightly be used as the namescope for
the kernel-only IFF_HIDDEN netdevs.

Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
---
 include/linux/netdevice.h   |  12 ++
 include/net/net_namespace.h |   2 +
 net/core/dev.c              | 281 ++++++++++++++++++++++++++++++++++++++------
 net/core/net_namespace.c    |   1 +
 4 files changed, 263 insertions(+), 33 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ef789e1..1a70f3a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1380,6 +1380,7 @@ struct net_device_ops {
  * @IFF_PHONY_HEADROOM: the headroom value is controlled by an external
  *	entity (i.e. the master device for bridged veth)
  * @IFF_MACSEC: device is a MACsec device
+ * @IFF_HIDDEN: device is not visible to userspace
  */
 enum netdev_priv_flags {
 	IFF_802_1Q_VLAN			= 1<<0,
@@ -1410,6 +1411,7 @@ enum netdev_priv_flags {
 	IFF_RXFH_CONFIGURED		= 1<<25,
 	IFF_PHONY_HEADROOM		= 1<<26,
 	IFF_MACSEC			= 1<<27,
+	IFF_HIDDEN			= 1<<28,
 };
 
 #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
@@ -1439,6 +1441,7 @@ enum netdev_priv_flags {
 #define IFF_TEAM			IFF_TEAM
 #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
 #define IFF_MACSEC			IFF_MACSEC
+#define IFF_HIDDEN			IFF_HIDDEN
 
 /**
  *	struct net_device - The DEVICE structure.
@@ -1659,6 +1662,7 @@ enum netdev_priv_flags {
 struct net_device {
 	char			name[IFNAMSIZ];
 	struct hlist_node	name_hlist;
+	struct hlist_node	name_cmpl_hlist;
 	struct dev_ifalias	__rcu *ifalias;
 	/*
 	 *	I/O specific fields
@@ -1680,6 +1684,7 @@ struct net_device {
 	unsigned long		state;
 
 	struct list_head	dev_list;
+	struct list_head	dev_cmpl_list;
 	struct list_head	napi_list;
 	struct list_head	unreg_list;
 	struct list_head	close_list;
@@ -2326,6 +2331,7 @@ struct netdev_lag_lower_state_info {
 #define NETDEV_UDP_TUNNEL_PUSH_INFO	0x001C
 #define NETDEV_UDP_TUNNEL_DROP_INFO	0x001D
 #define NETDEV_CHANGE_TX_QUEUE_LEN	0x001E
+#define NETDEV_PRE_GETNAME	0x001F
 
 int register_netdevice_notifier(struct notifier_block *nb);
 int unregister_netdevice_notifier(struct notifier_block *nb);
@@ -2393,6 +2399,8 @@ static inline void netdev_notifier_info_init(struct netdev_notifier_info *info,
 		for_each_netdev_rcu(&init_net, slave)	\
 			if (netdev_master_upper_dev_get_rcu(slave) == (bond))
 #define net_device_entry(lh)	list_entry(lh, struct net_device, dev_list)
+#define for_each_netdev_complete(net, d)		\
+		list_for_each_entry(d, &(net)->dev_cmpl_head, dev_cmpl_list)
 
 static inline struct net_device *next_net_device(struct net_device *dev)
 {
@@ -2462,6 +2470,10 @@ static inline void unregister_netdevice(struct net_device *dev)
 	unregister_netdevice_queue(dev, NULL);
 }
 
+void netdev_set_hidden(struct net_device *dev);
+int hide_netdevice(struct net_device *dev);
+void unhide_netdevice(struct net_device *dev);
+
 int netdev_refcnt_read(const struct net_device *dev);
 void free_netdev(struct net_device *dev);
 void netdev_freemem(struct net_device *dev);
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 0490084..f9ce9b4 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -80,7 +80,9 @@ struct net {
 	struct sock		*genl_sock;
 
 	struct list_head 	dev_base_head;
+	struct list_head 	dev_cmpl_head;
 	struct hlist_head 	*dev_name_head;
+	struct hlist_head 	*dev_name_cmpl_head;
 	struct hlist_head	*dev_index_head;
 	unsigned int		dev_base_seq;	/* protected by rtnl_mutex */
 	int			ifindex;
diff --git a/net/core/dev.c b/net/core/dev.c
index 613fb40..a991b35 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -211,6 +211,13 @@ static inline struct hlist_head *dev_name_hash(struct net *net, const char *name
 	return &net->dev_name_head[hash_32(hash, NETDEV_HASHBITS)];
 }
 
+static inline struct hlist_head *dev_cname_hash(struct net *net, const char *name)
+{
+	unsigned int hash = full_name_hash(net, name, strnlen(name, IFNAMSIZ));
+
+	return &net->dev_name_cmpl_head[hash_32(hash, NETDEV_HASHBITS)];
+}
+
 static inline struct hlist_head *dev_index_hash(struct net *net, int ifindex)
 {
 	return &net->dev_index_head[ifindex & (NETDEV_HASHENTRIES - 1)];
@@ -237,11 +244,19 @@ static void list_netdevice(struct net_device *dev)
 
 	ASSERT_RTNL();
 
+
 	write_lock_bh(&dev_base_lock);
-	list_add_tail_rcu(&dev->dev_list, &net->dev_base_head);
-	hlist_add_head_rcu(&dev->name_hlist, dev_name_hash(net, dev->name));
-	hlist_add_head_rcu(&dev->index_hlist,
-			   dev_index_hash(net, dev->ifindex));
+	if (!(dev->priv_flags & IFF_HIDDEN)) {
+		list_add_tail_rcu(&dev->dev_list, &net->dev_base_head);
+		hlist_add_head_rcu(&dev->name_hlist,
+				   dev_name_hash(net, dev->name));
+		hlist_add_head_rcu(&dev->index_hlist,
+				   dev_index_hash(net, dev->ifindex));
+	}
+	list_add_tail_rcu(&dev->dev_cmpl_list,
+			  &net->dev_cmpl_head);
+	hlist_add_head_rcu(&dev->name_cmpl_hlist,
+			   dev_cname_hash(net, dev->name));
 	write_unlock_bh(&dev_base_lock);
 
 	dev_base_seq_inc(net);
@@ -256,9 +271,13 @@ static void unlist_netdevice(struct net_device *dev)
 
 	/* Unlink dev from the device chain */
 	write_lock_bh(&dev_base_lock);
-	list_del_rcu(&dev->dev_list);
-	hlist_del_rcu(&dev->name_hlist);
-	hlist_del_rcu(&dev->index_hlist);
+	if (!(dev->priv_flags & IFF_HIDDEN)) {
+		list_del_rcu(&dev->dev_list);
+		hlist_del_rcu(&dev->name_hlist);
+		hlist_del_rcu(&dev->index_hlist);
+	}
+	list_del_rcu(&dev->dev_cmpl_list);
+	hlist_del_rcu(&dev->name_cmpl_hlist);
 	write_unlock_bh(&dev_base_lock);
 
 	dev_base_seq_inc(dev_net(dev));
@@ -736,11 +755,15 @@ int dev_fill_metadata_dst(struct net_device *dev, struct sk_buff *skb)
 struct net_device *__dev_get_by_name(struct net *net, const char *name)
 {
 	struct net_device *dev;
-	struct hlist_head *head = dev_name_hash(net, name);
+	struct hlist_head *head = dev_cname_hash(net, name);
+	bool hidden_name = (*name == ':');
 
-	hlist_for_each_entry(dev, head, name_hlist)
+	hlist_for_each_entry(dev, head, name_cmpl_hlist) {
+		if (hidden_name && !(dev->priv_flags & IFF_HIDDEN))
+			continue;
 		if (!strncmp(dev->name, name, IFNAMSIZ))
 			return dev;
+	}
 
 	return NULL;
 }
@@ -1015,15 +1038,7 @@ struct net_device *__dev_get_by_flags(struct net *net, unsigned short if_flags,
 }
 EXPORT_SYMBOL(__dev_get_by_flags);
 
-/**
- *	dev_valid_name - check if name is okay for network device
- *	@name: name string
- *
- *	Network device names need to be valid file names to
- *	to allow sysfs to work.  We also disallow any kind of
- *	whitespace.
- */
-bool dev_valid_name(const char *name)
+static bool __dev_valid_name(const char *name, bool hidden)
 {
 	if (*name == '\0')
 		return false;
@@ -1033,12 +1048,27 @@ bool dev_valid_name(const char *name)
 		return false;
 
 	while (*name) {
-		if (*name == '/' || *name == ':' || isspace(*name))
+		if (*name == '/' || isspace(*name))
+			return false;
+		if (!hidden && *name == ':')
 			return false;
 		name++;
 	}
 	return true;
 }
+
+/**
+ *	dev_valid_name - check if name is okay for network device
+ *	@name: name string
+ *
+ *	Network device names need to be valid file names to
+ *	to allow sysfs to work.  We also disallow any kind of
+ *	whitespace.
+ */
+bool dev_valid_name(const char *name)
+{
+	return __dev_valid_name(name, false);
+}
 EXPORT_SYMBOL(dev_valid_name);
 
 /**
@@ -1064,9 +1094,6 @@ static int __dev_alloc_name(struct net *net, const char *name, char *buf)
 	unsigned long *inuse;
 	struct net_device *d;
 
-	if (!dev_valid_name(name))
-		return -EINVAL;
-
 	p = strchr(name, '%');
 	if (p) {
 		/*
@@ -1082,7 +1109,7 @@ static int __dev_alloc_name(struct net *net, const char *name, char *buf)
 		if (!inuse)
 			return -ENOMEM;
 
-		for_each_netdev(net, d) {
+		for_each_netdev_complete(net, d) {
 			if (!sscanf(d->name, name, &i))
 				continue;
 			if (i < 0 || i >= max_netdevices)
@@ -1139,18 +1166,18 @@ static int dev_alloc_name_ns(struct net *net,
 
 int dev_alloc_name(struct net_device *dev, const char *name)
 {
+	if (!dev_valid_name(name))
+		return -EINVAL;
+
 	return dev_alloc_name_ns(dev_net(dev), dev, name);
 }
 EXPORT_SYMBOL(dev_alloc_name);
 
-int dev_get_valid_name(struct net *net, struct net_device *dev,
-		       const char *name)
+static int __dev_get_name(struct net *net, struct net_device *dev,
+			  const char *name)
 {
 	BUG_ON(!net);
 
-	if (!dev_valid_name(name))
-		return -EINVAL;
-
 	if (strchr(name, '%'))
 		return dev_alloc_name_ns(net, dev, name);
 	else if (__dev_get_by_name(net, name))
@@ -1160,6 +1187,15 @@ int dev_get_valid_name(struct net *net, struct net_device *dev,
 
 	return 0;
 }
+
+int dev_get_valid_name(struct net *net, struct net_device *dev,
+		       const char *name)
+{
+	if (!__dev_valid_name(name, (dev->priv_flags & IFF_HIDDEN)))
+		return -EINVAL;
+
+	return __dev_get_name(net, dev, name);
+}
 EXPORT_SYMBOL(dev_get_valid_name);
 
 /**
@@ -1221,12 +1257,15 @@ int dev_change_name(struct net_device *dev, const char *newname)
 
 	write_lock_bh(&dev_base_lock);
 	hlist_del_rcu(&dev->name_hlist);
+	hlist_del_rcu(&dev->name_cmpl_hlist);
 	write_unlock_bh(&dev_base_lock);
 
 	synchronize_rcu();
 
 	write_lock_bh(&dev_base_lock);
 	hlist_add_head_rcu(&dev->name_hlist, dev_name_hash(net, dev->name));
+	hlist_add_head_rcu(&dev->name_cmpl_hlist,
+			   dev_cname_hash(net, dev->name));
 	write_unlock_bh(&dev_base_lock);
 
 	ret = call_netdevice_notifiers(NETDEV_CHANGENAME, dev);
@@ -1594,7 +1633,7 @@ int register_netdevice_notifier(struct notifier_block *nb)
 	if (dev_boot_phase)
 		goto unlock;
 	for_each_net(net) {
-		for_each_netdev(net, dev) {
+		for_each_netdev_complete(net, dev) {
 			err = call_netdevice_notifier(nb, NETDEV_REGISTER, dev);
 			err = notifier_to_errno(err);
 			if (err)
@@ -1614,7 +1653,7 @@ int register_netdevice_notifier(struct notifier_block *nb)
 rollback:
 	last = dev;
 	for_each_net(net) {
-		for_each_netdev(net, dev) {
+		for_each_netdev_complete(net, dev) {
 			if (dev == last)
 				goto outroll;
 
@@ -1659,7 +1698,7 @@ int unregister_netdevice_notifier(struct notifier_block *nb)
 		goto unlock;
 
 	for_each_net(net) {
-		for_each_netdev(net, dev) {
+		for_each_netdev_complete(net, dev) {
 			if (dev->flags & IFF_UP) {
 				call_netdevice_notifier(nb, NETDEV_GOING_DOWN,
 							dev);
@@ -7642,6 +7681,11 @@ int register_netdevice(struct net_device *dev)
 	spin_lock_init(&dev->addr_list_lock);
 	netdev_set_addr_lockdep_class(dev);
 
+	ret = call_netdevice_notifiers(NETDEV_PRE_GETNAME, dev);
+	ret = notifier_to_errno(ret);
+	if (ret)
+		goto out;
+
 	ret = dev_get_valid_name(net, dev, dev->name);
 	if (ret < 0)
 		goto out;
@@ -8461,6 +8505,166 @@ int dev_change_net_namespace(struct net_device *dev, struct net *net, const char
 }
 EXPORT_SYMBOL_GPL(dev_change_net_namespace);
 
+/**
+ *	netdev_set_hidden - indicate a hidden netdev before or at
+ *			    early point of driver registration
+ *	@dev: device
+ *
+ *	Callers must hold the rtnl semaphore, typically before or
+ *	at some early point (e.g in NETDEV_PRE_GETNAME notifier)
+ *	of driver registrationr, or it won't take effect to hide
+ *	the netdev post registration.
+ */
+void netdev_set_hidden(struct net_device *dev)
+{
+	dev->priv_flags |= IFF_HIDDEN;
+	strlcpy(dev->name, ":eth%d", IFNAMSIZ);
+}
+EXPORT_SYMBOL(netdev_set_hidden);
+
+/**
+ *	hide_netdevice - hide device from userspace's visibility
+ *	@dev: device
+ *
+ *	This function shuts down a device interface and removes it
+ *	from all userspace visible dev lists, and moves it to 
+ *	comprehensive dev lists containing both userspace-visible
+ *	and kernel-only devices. On success 0 is returned, on
+ *	a failure a netagive errno code is returned.
+ */
+int hide_netdevice(struct net_device *dev)
+{
+	int err;
+
+	rtnl_lock();
+
+	err = 0;
+	/* Get out if there is nothing to do */
+	if (dev->priv_flags & IFF_HIDDEN)
+		goto out;
+
+	err = -EINVAL;
+	/* Ensure the device has been registrered */
+	if (dev->reg_state != NETREG_REGISTERED)
+		goto out;
+
+	err = __dev_get_name(dev_net(dev), dev, ":eth%d");
+       	if (err < 0)
+		goto out;
+
+	/*
+	 * And now a mini version of register_netdevice unregister_netdevice.
+	 */
+
+	/* If device is running close it first. */
+	dev_close(dev);
+
+	/* And unlink it from device chain */
+	unlist_netdevice(dev);
+	synchronize_net();
+
+	/* Shutdown queueing discipline. */
+	dev_shutdown(dev);
+
+	/* Notify protocols, that we are about to destroy
+	 * this device. They should clean all the things.
+	 *
+	 * Note that dev->reg_state stays at NETREG_REGISTERED.
+	 * This is wanted because this way 8021q and macvlan know
+	 * the device is just moving and can keep their slaves up.
+	 */
+	call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
+	rcu_barrier();
+	call_netdevice_notifiers(NETDEV_UNREGISTER_FINAL, dev);
+	rtmsg_ifinfo(RTM_DELLINK, dev, ~0U, GFP_KERNEL);
+
+	/*
+	 *	Flush the unicast and multicast chains
+	 */
+	dev_uc_flush(dev);
+	dev_mc_flush(dev);
+
+	/* Send a netdev-removed uevent to the old namespace */
+	kobject_uevent(&dev->dev.kobj, KOBJ_REMOVE);
+	netdev_adjacent_del_links(dev);
+
+	/* Fixup kobjects */
+	err = device_rename(&dev->dev, dev->name);
+	WARN_ON(err);
+
+	dev->priv_flags |= IFF_HIDDEN;
+	list_netdevice(dev);
+
+	/* Notify protocols, that a new device appeared. */
+	call_netdevice_notifiers(NETDEV_REGISTER, dev);
+
+	synchronize_net();
+	err = 0;
+out:
+	rtnl_unlock();
+	return err;
+}
+EXPORT_SYMBOL(hide_netdevice);
+
+/**
+ *	unhide_netdevice - make a hidden device visible to userspace
+ *	@dev: device
+ *
+ *	This function moves a hidden device to userspace visible
+ *	interfaces. A %NETDEV_REGISTER message will be sent to
+ *	the netdev notifier chain.
+ */
+void unhide_netdevice(struct net_device *dev)
+{
+	int err;
+
+	rtnl_lock();
+	/* Get out if there is nothing to do */
+	if (!(dev->priv_flags & IFF_HIDDEN))
+		goto out;
+
+	/* Ensure the device has been registrered */
+	if (dev->reg_state != NETREG_REGISTERED)
+		goto out;
+
+	err = __dev_get_name(dev_net(dev), dev, "eth%d");
+	WARN_ON(err < 0);
+
+	/* If device is running close it first. */
+	dev_close(dev);
+	unlist_netdevice(dev);
+	synchronize_net();
+
+	/* Shutdown queueing discipline. */
+	dev_shutdown(dev);
+
+	call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
+	rcu_barrier();
+	call_netdevice_notifiers(NETDEV_UNREGISTER_FINAL, dev);
+	dev_uc_flush(dev);
+	dev_mc_flush(dev);
+
+	/* Send a netdev-add uevent to the new namespace */
+	kobject_uevent(&dev->dev.kobj, KOBJ_ADD);
+	netdev_adjacent_add_links(dev);
+
+	/* Fixup kobjects */
+	err = device_rename(&dev->dev, dev->name);
+	WARN_ON(err);
+
+	/* Add the device back in the hashes */
+	dev->priv_flags &= ~IFF_HIDDEN;
+	list_netdevice(dev);
+
+	call_netdevice_notifiers(NETDEV_REGISTER, dev);
+
+	rtmsg_ifinfo(RTM_NEWLINK, dev, ~0U, GFP_KERNEL);
+	synchronize_net();
+out:
+	rtnl_unlock();
+}
+EXPORT_SYMBOL(unhide_netdevice);
+
 static int dev_cpu_dead(unsigned int oldcpu)
 {
 	struct sk_buff **list_skb;
@@ -8571,13 +8775,19 @@ static struct hlist_head * __net_init netdev_create_hash(void)
 /* Initialize per network namespace state */
 static int __net_init netdev_init(struct net *net)
 {
-	if (net != &init_net)
+	if (net != &init_net) {
 		INIT_LIST_HEAD(&net->dev_base_head);
+		INIT_LIST_HEAD(&net->dev_cmpl_head);
+	}
 
 	net->dev_name_head = netdev_create_hash();
 	if (net->dev_name_head == NULL)
 		goto err_name;
 
+	net->dev_name_cmpl_head = netdev_create_hash();
+	if (net->dev_name_cmpl_head == NULL)
+		goto err_cname;
+
 	net->dev_index_head = netdev_create_hash();
 	if (net->dev_index_head == NULL)
 		goto err_idx;
@@ -8585,6 +8795,8 @@ static int __net_init netdev_init(struct net *net)
 	return 0;
 
 err_idx:
+	kfree(net->dev_name_cmpl_head);
+err_cname:
 	kfree(net->dev_name_head);
 err_name:
 	return -ENOMEM;
@@ -8676,9 +8888,12 @@ void func(const struct net_device *dev, const char *fmt, ...)	\
 static void __net_exit netdev_exit(struct net *net)
 {
 	kfree(net->dev_name_head);
+	kfree(net->dev_name_cmpl_head);
 	kfree(net->dev_index_head);
-	if (net != &init_net)
+	if (net != &init_net) {
 		WARN_ON_ONCE(!list_empty(&net->dev_base_head));
+		WARN_ON_ONCE(!list_empty(&net->dev_cmpl_head));
+	}
 }
 
 static struct pernet_operations __net_initdata netdev_net_ops = {
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 60a71be..1c399e9 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -37,6 +37,7 @@
 struct net init_net = {
 	.count		= ATOMIC_INIT(1),
 	.dev_base_head	= LIST_HEAD_INIT(init_net.dev_base_head),
+	.dev_cmpl_head	= LIST_HEAD_INIT(init_net.dev_cmpl_head),
 };
 EXPORT_SYMBOL(init_net);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [virtio-dev] [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
@ 2018-04-01  9:13   ` Si-Wei Liu
  0 siblings, 0 replies; 109+ messages in thread
From: Si-Wei Liu @ 2018-04-01  9:13 UTC (permalink / raw)
  To: mst, jiri, stephen, alexander.h.duyck, davem, jesse.brandeburg,
	kubakici, jasowang, sridhar.samudrala, netdev, virtualization,
	virtio-dev

Hidden netdevice is not visible to userspace such that
typical network utilites e.g. ip, ifconfig and et al,
cannot sense its existence or configure it. Internally
hidden netdev may associate with an upper level netdev
that userspace has access to. Although userspace cannot
manipulate the lower netdev directly, user may control
or configure the underlying hidden device through the
upper-level netdev. For identification purpose, the
kobject for hidden netdev still presents in the sysfs
hierarchy, however, no uevent message will be generated
when the sysfs entry is created, modified or destroyed.

For that end, a separate namescope needs to be carved
out for IFF_HIDDEN netdevs. As of now netdev name that
starts with colon i.e. ':' is invalid in userspace,
since socket ioctls such as SIOCGIFCONF use ':' as the
separator for ifname. The absence of namescope started
with ':' can rightly be used as the namescope for
the kernel-only IFF_HIDDEN netdevs.

Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
---
 include/linux/netdevice.h   |  12 ++
 include/net/net_namespace.h |   2 +
 net/core/dev.c              | 281 ++++++++++++++++++++++++++++++++++++++------
 net/core/net_namespace.c    |   1 +
 4 files changed, 263 insertions(+), 33 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ef789e1..1a70f3a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1380,6 +1380,7 @@ struct net_device_ops {
  * @IFF_PHONY_HEADROOM: the headroom value is controlled by an external
  *	entity (i.e. the master device for bridged veth)
  * @IFF_MACSEC: device is a MACsec device
+ * @IFF_HIDDEN: device is not visible to userspace
  */
 enum netdev_priv_flags {
 	IFF_802_1Q_VLAN			= 1<<0,
@@ -1410,6 +1411,7 @@ enum netdev_priv_flags {
 	IFF_RXFH_CONFIGURED		= 1<<25,
 	IFF_PHONY_HEADROOM		= 1<<26,
 	IFF_MACSEC			= 1<<27,
+	IFF_HIDDEN			= 1<<28,
 };
 
 #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
@@ -1439,6 +1441,7 @@ enum netdev_priv_flags {
 #define IFF_TEAM			IFF_TEAM
 #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
 #define IFF_MACSEC			IFF_MACSEC
+#define IFF_HIDDEN			IFF_HIDDEN
 
 /**
  *	struct net_device - The DEVICE structure.
@@ -1659,6 +1662,7 @@ enum netdev_priv_flags {
 struct net_device {
 	char			name[IFNAMSIZ];
 	struct hlist_node	name_hlist;
+	struct hlist_node	name_cmpl_hlist;
 	struct dev_ifalias	__rcu *ifalias;
 	/*
 	 *	I/O specific fields
@@ -1680,6 +1684,7 @@ struct net_device {
 	unsigned long		state;
 
 	struct list_head	dev_list;
+	struct list_head	dev_cmpl_list;
 	struct list_head	napi_list;
 	struct list_head	unreg_list;
 	struct list_head	close_list;
@@ -2326,6 +2331,7 @@ struct netdev_lag_lower_state_info {
 #define NETDEV_UDP_TUNNEL_PUSH_INFO	0x001C
 #define NETDEV_UDP_TUNNEL_DROP_INFO	0x001D
 #define NETDEV_CHANGE_TX_QUEUE_LEN	0x001E
+#define NETDEV_PRE_GETNAME	0x001F
 
 int register_netdevice_notifier(struct notifier_block *nb);
 int unregister_netdevice_notifier(struct notifier_block *nb);
@@ -2393,6 +2399,8 @@ static inline void netdev_notifier_info_init(struct netdev_notifier_info *info,
 		for_each_netdev_rcu(&init_net, slave)	\
 			if (netdev_master_upper_dev_get_rcu(slave) == (bond))
 #define net_device_entry(lh)	list_entry(lh, struct net_device, dev_list)
+#define for_each_netdev_complete(net, d)		\
+		list_for_each_entry(d, &(net)->dev_cmpl_head, dev_cmpl_list)
 
 static inline struct net_device *next_net_device(struct net_device *dev)
 {
@@ -2462,6 +2470,10 @@ static inline void unregister_netdevice(struct net_device *dev)
 	unregister_netdevice_queue(dev, NULL);
 }
 
+void netdev_set_hidden(struct net_device *dev);
+int hide_netdevice(struct net_device *dev);
+void unhide_netdevice(struct net_device *dev);
+
 int netdev_refcnt_read(const struct net_device *dev);
 void free_netdev(struct net_device *dev);
 void netdev_freemem(struct net_device *dev);
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 0490084..f9ce9b4 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -80,7 +80,9 @@ struct net {
 	struct sock		*genl_sock;
 
 	struct list_head 	dev_base_head;
+	struct list_head 	dev_cmpl_head;
 	struct hlist_head 	*dev_name_head;
+	struct hlist_head 	*dev_name_cmpl_head;
 	struct hlist_head	*dev_index_head;
 	unsigned int		dev_base_seq;	/* protected by rtnl_mutex */
 	int			ifindex;
diff --git a/net/core/dev.c b/net/core/dev.c
index 613fb40..a991b35 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -211,6 +211,13 @@ static inline struct hlist_head *dev_name_hash(struct net *net, const char *name
 	return &net->dev_name_head[hash_32(hash, NETDEV_HASHBITS)];
 }
 
+static inline struct hlist_head *dev_cname_hash(struct net *net, const char *name)
+{
+	unsigned int hash = full_name_hash(net, name, strnlen(name, IFNAMSIZ));
+
+	return &net->dev_name_cmpl_head[hash_32(hash, NETDEV_HASHBITS)];
+}
+
 static inline struct hlist_head *dev_index_hash(struct net *net, int ifindex)
 {
 	return &net->dev_index_head[ifindex & (NETDEV_HASHENTRIES - 1)];
@@ -237,11 +244,19 @@ static void list_netdevice(struct net_device *dev)
 
 	ASSERT_RTNL();
 
+
 	write_lock_bh(&dev_base_lock);
-	list_add_tail_rcu(&dev->dev_list, &net->dev_base_head);
-	hlist_add_head_rcu(&dev->name_hlist, dev_name_hash(net, dev->name));
-	hlist_add_head_rcu(&dev->index_hlist,
-			   dev_index_hash(net, dev->ifindex));
+	if (!(dev->priv_flags & IFF_HIDDEN)) {
+		list_add_tail_rcu(&dev->dev_list, &net->dev_base_head);
+		hlist_add_head_rcu(&dev->name_hlist,
+				   dev_name_hash(net, dev->name));
+		hlist_add_head_rcu(&dev->index_hlist,
+				   dev_index_hash(net, dev->ifindex));
+	}
+	list_add_tail_rcu(&dev->dev_cmpl_list,
+			  &net->dev_cmpl_head);
+	hlist_add_head_rcu(&dev->name_cmpl_hlist,
+			   dev_cname_hash(net, dev->name));
 	write_unlock_bh(&dev_base_lock);
 
 	dev_base_seq_inc(net);
@@ -256,9 +271,13 @@ static void unlist_netdevice(struct net_device *dev)
 
 	/* Unlink dev from the device chain */
 	write_lock_bh(&dev_base_lock);
-	list_del_rcu(&dev->dev_list);
-	hlist_del_rcu(&dev->name_hlist);
-	hlist_del_rcu(&dev->index_hlist);
+	if (!(dev->priv_flags & IFF_HIDDEN)) {
+		list_del_rcu(&dev->dev_list);
+		hlist_del_rcu(&dev->name_hlist);
+		hlist_del_rcu(&dev->index_hlist);
+	}
+	list_del_rcu(&dev->dev_cmpl_list);
+	hlist_del_rcu(&dev->name_cmpl_hlist);
 	write_unlock_bh(&dev_base_lock);
 
 	dev_base_seq_inc(dev_net(dev));
@@ -736,11 +755,15 @@ int dev_fill_metadata_dst(struct net_device *dev, struct sk_buff *skb)
 struct net_device *__dev_get_by_name(struct net *net, const char *name)
 {
 	struct net_device *dev;
-	struct hlist_head *head = dev_name_hash(net, name);
+	struct hlist_head *head = dev_cname_hash(net, name);
+	bool hidden_name = (*name == ':');
 
-	hlist_for_each_entry(dev, head, name_hlist)
+	hlist_for_each_entry(dev, head, name_cmpl_hlist) {
+		if (hidden_name && !(dev->priv_flags & IFF_HIDDEN))
+			continue;
 		if (!strncmp(dev->name, name, IFNAMSIZ))
 			return dev;
+	}
 
 	return NULL;
 }
@@ -1015,15 +1038,7 @@ struct net_device *__dev_get_by_flags(struct net *net, unsigned short if_flags,
 }
 EXPORT_SYMBOL(__dev_get_by_flags);
 
-/**
- *	dev_valid_name - check if name is okay for network device
- *	@name: name string
- *
- *	Network device names need to be valid file names to
- *	to allow sysfs to work.  We also disallow any kind of
- *	whitespace.
- */
-bool dev_valid_name(const char *name)
+static bool __dev_valid_name(const char *name, bool hidden)
 {
 	if (*name == '\0')
 		return false;
@@ -1033,12 +1048,27 @@ bool dev_valid_name(const char *name)
 		return false;
 
 	while (*name) {
-		if (*name == '/' || *name == ':' || isspace(*name))
+		if (*name == '/' || isspace(*name))
+			return false;
+		if (!hidden && *name == ':')
 			return false;
 		name++;
 	}
 	return true;
 }
+
+/**
+ *	dev_valid_name - check if name is okay for network device
+ *	@name: name string
+ *
+ *	Network device names need to be valid file names to
+ *	to allow sysfs to work.  We also disallow any kind of
+ *	whitespace.
+ */
+bool dev_valid_name(const char *name)
+{
+	return __dev_valid_name(name, false);
+}
 EXPORT_SYMBOL(dev_valid_name);
 
 /**
@@ -1064,9 +1094,6 @@ static int __dev_alloc_name(struct net *net, const char *name, char *buf)
 	unsigned long *inuse;
 	struct net_device *d;
 
-	if (!dev_valid_name(name))
-		return -EINVAL;
-
 	p = strchr(name, '%');
 	if (p) {
 		/*
@@ -1082,7 +1109,7 @@ static int __dev_alloc_name(struct net *net, const char *name, char *buf)
 		if (!inuse)
 			return -ENOMEM;
 
-		for_each_netdev(net, d) {
+		for_each_netdev_complete(net, d) {
 			if (!sscanf(d->name, name, &i))
 				continue;
 			if (i < 0 || i >= max_netdevices)
@@ -1139,18 +1166,18 @@ static int dev_alloc_name_ns(struct net *net,
 
 int dev_alloc_name(struct net_device *dev, const char *name)
 {
+	if (!dev_valid_name(name))
+		return -EINVAL;
+
 	return dev_alloc_name_ns(dev_net(dev), dev, name);
 }
 EXPORT_SYMBOL(dev_alloc_name);
 
-int dev_get_valid_name(struct net *net, struct net_device *dev,
-		       const char *name)
+static int __dev_get_name(struct net *net, struct net_device *dev,
+			  const char *name)
 {
 	BUG_ON(!net);
 
-	if (!dev_valid_name(name))
-		return -EINVAL;
-
 	if (strchr(name, '%'))
 		return dev_alloc_name_ns(net, dev, name);
 	else if (__dev_get_by_name(net, name))
@@ -1160,6 +1187,15 @@ int dev_get_valid_name(struct net *net, struct net_device *dev,
 
 	return 0;
 }
+
+int dev_get_valid_name(struct net *net, struct net_device *dev,
+		       const char *name)
+{
+	if (!__dev_valid_name(name, (dev->priv_flags & IFF_HIDDEN)))
+		return -EINVAL;
+
+	return __dev_get_name(net, dev, name);
+}
 EXPORT_SYMBOL(dev_get_valid_name);
 
 /**
@@ -1221,12 +1257,15 @@ int dev_change_name(struct net_device *dev, const char *newname)
 
 	write_lock_bh(&dev_base_lock);
 	hlist_del_rcu(&dev->name_hlist);
+	hlist_del_rcu(&dev->name_cmpl_hlist);
 	write_unlock_bh(&dev_base_lock);
 
 	synchronize_rcu();
 
 	write_lock_bh(&dev_base_lock);
 	hlist_add_head_rcu(&dev->name_hlist, dev_name_hash(net, dev->name));
+	hlist_add_head_rcu(&dev->name_cmpl_hlist,
+			   dev_cname_hash(net, dev->name));
 	write_unlock_bh(&dev_base_lock);
 
 	ret = call_netdevice_notifiers(NETDEV_CHANGENAME, dev);
@@ -1594,7 +1633,7 @@ int register_netdevice_notifier(struct notifier_block *nb)
 	if (dev_boot_phase)
 		goto unlock;
 	for_each_net(net) {
-		for_each_netdev(net, dev) {
+		for_each_netdev_complete(net, dev) {
 			err = call_netdevice_notifier(nb, NETDEV_REGISTER, dev);
 			err = notifier_to_errno(err);
 			if (err)
@@ -1614,7 +1653,7 @@ int register_netdevice_notifier(struct notifier_block *nb)
 rollback:
 	last = dev;
 	for_each_net(net) {
-		for_each_netdev(net, dev) {
+		for_each_netdev_complete(net, dev) {
 			if (dev == last)
 				goto outroll;
 
@@ -1659,7 +1698,7 @@ int unregister_netdevice_notifier(struct notifier_block *nb)
 		goto unlock;
 
 	for_each_net(net) {
-		for_each_netdev(net, dev) {
+		for_each_netdev_complete(net, dev) {
 			if (dev->flags & IFF_UP) {
 				call_netdevice_notifier(nb, NETDEV_GOING_DOWN,
 							dev);
@@ -7642,6 +7681,11 @@ int register_netdevice(struct net_device *dev)
 	spin_lock_init(&dev->addr_list_lock);
 	netdev_set_addr_lockdep_class(dev);
 
+	ret = call_netdevice_notifiers(NETDEV_PRE_GETNAME, dev);
+	ret = notifier_to_errno(ret);
+	if (ret)
+		goto out;
+
 	ret = dev_get_valid_name(net, dev, dev->name);
 	if (ret < 0)
 		goto out;
@@ -8461,6 +8505,166 @@ int dev_change_net_namespace(struct net_device *dev, struct net *net, const char
 }
 EXPORT_SYMBOL_GPL(dev_change_net_namespace);
 
+/**
+ *	netdev_set_hidden - indicate a hidden netdev before or at
+ *			    early point of driver registration
+ *	@dev: device
+ *
+ *	Callers must hold the rtnl semaphore, typically before or
+ *	at some early point (e.g in NETDEV_PRE_GETNAME notifier)
+ *	of driver registrationr, or it won't take effect to hide
+ *	the netdev post registration.
+ */
+void netdev_set_hidden(struct net_device *dev)
+{
+	dev->priv_flags |= IFF_HIDDEN;
+	strlcpy(dev->name, ":eth%d", IFNAMSIZ);
+}
+EXPORT_SYMBOL(netdev_set_hidden);
+
+/**
+ *	hide_netdevice - hide device from userspace's visibility
+ *	@dev: device
+ *
+ *	This function shuts down a device interface and removes it
+ *	from all userspace visible dev lists, and moves it to 
+ *	comprehensive dev lists containing both userspace-visible
+ *	and kernel-only devices. On success 0 is returned, on
+ *	a failure a netagive errno code is returned.
+ */
+int hide_netdevice(struct net_device *dev)
+{
+	int err;
+
+	rtnl_lock();
+
+	err = 0;
+	/* Get out if there is nothing to do */
+	if (dev->priv_flags & IFF_HIDDEN)
+		goto out;
+
+	err = -EINVAL;
+	/* Ensure the device has been registrered */
+	if (dev->reg_state != NETREG_REGISTERED)
+		goto out;
+
+	err = __dev_get_name(dev_net(dev), dev, ":eth%d");
+       	if (err < 0)
+		goto out;
+
+	/*
+	 * And now a mini version of register_netdevice unregister_netdevice.
+	 */
+
+	/* If device is running close it first. */
+	dev_close(dev);
+
+	/* And unlink it from device chain */
+	unlist_netdevice(dev);
+	synchronize_net();
+
+	/* Shutdown queueing discipline. */
+	dev_shutdown(dev);
+
+	/* Notify protocols, that we are about to destroy
+	 * this device. They should clean all the things.
+	 *
+	 * Note that dev->reg_state stays at NETREG_REGISTERED.
+	 * This is wanted because this way 8021q and macvlan know
+	 * the device is just moving and can keep their slaves up.
+	 */
+	call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
+	rcu_barrier();
+	call_netdevice_notifiers(NETDEV_UNREGISTER_FINAL, dev);
+	rtmsg_ifinfo(RTM_DELLINK, dev, ~0U, GFP_KERNEL);
+
+	/*
+	 *	Flush the unicast and multicast chains
+	 */
+	dev_uc_flush(dev);
+	dev_mc_flush(dev);
+
+	/* Send a netdev-removed uevent to the old namespace */
+	kobject_uevent(&dev->dev.kobj, KOBJ_REMOVE);
+	netdev_adjacent_del_links(dev);
+
+	/* Fixup kobjects */
+	err = device_rename(&dev->dev, dev->name);
+	WARN_ON(err);
+
+	dev->priv_flags |= IFF_HIDDEN;
+	list_netdevice(dev);
+
+	/* Notify protocols, that a new device appeared. */
+	call_netdevice_notifiers(NETDEV_REGISTER, dev);
+
+	synchronize_net();
+	err = 0;
+out:
+	rtnl_unlock();
+	return err;
+}
+EXPORT_SYMBOL(hide_netdevice);
+
+/**
+ *	unhide_netdevice - make a hidden device visible to userspace
+ *	@dev: device
+ *
+ *	This function moves a hidden device to userspace visible
+ *	interfaces. A %NETDEV_REGISTER message will be sent to
+ *	the netdev notifier chain.
+ */
+void unhide_netdevice(struct net_device *dev)
+{
+	int err;
+
+	rtnl_lock();
+	/* Get out if there is nothing to do */
+	if (!(dev->priv_flags & IFF_HIDDEN))
+		goto out;
+
+	/* Ensure the device has been registrered */
+	if (dev->reg_state != NETREG_REGISTERED)
+		goto out;
+
+	err = __dev_get_name(dev_net(dev), dev, "eth%d");
+	WARN_ON(err < 0);
+
+	/* If device is running close it first. */
+	dev_close(dev);
+	unlist_netdevice(dev);
+	synchronize_net();
+
+	/* Shutdown queueing discipline. */
+	dev_shutdown(dev);
+
+	call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
+	rcu_barrier();
+	call_netdevice_notifiers(NETDEV_UNREGISTER_FINAL, dev);
+	dev_uc_flush(dev);
+	dev_mc_flush(dev);
+
+	/* Send a netdev-add uevent to the new namespace */
+	kobject_uevent(&dev->dev.kobj, KOBJ_ADD);
+	netdev_adjacent_add_links(dev);
+
+	/* Fixup kobjects */
+	err = device_rename(&dev->dev, dev->name);
+	WARN_ON(err);
+
+	/* Add the device back in the hashes */
+	dev->priv_flags &= ~IFF_HIDDEN;
+	list_netdevice(dev);
+
+	call_netdevice_notifiers(NETDEV_REGISTER, dev);
+
+	rtmsg_ifinfo(RTM_NEWLINK, dev, ~0U, GFP_KERNEL);
+	synchronize_net();
+out:
+	rtnl_unlock();
+}
+EXPORT_SYMBOL(unhide_netdevice);
+
 static int dev_cpu_dead(unsigned int oldcpu)
 {
 	struct sk_buff **list_skb;
@@ -8571,13 +8775,19 @@ static struct hlist_head * __net_init netdev_create_hash(void)
 /* Initialize per network namespace state */
 static int __net_init netdev_init(struct net *net)
 {
-	if (net != &init_net)
+	if (net != &init_net) {
 		INIT_LIST_HEAD(&net->dev_base_head);
+		INIT_LIST_HEAD(&net->dev_cmpl_head);
+	}
 
 	net->dev_name_head = netdev_create_hash();
 	if (net->dev_name_head == NULL)
 		goto err_name;
 
+	net->dev_name_cmpl_head = netdev_create_hash();
+	if (net->dev_name_cmpl_head == NULL)
+		goto err_cname;
+
 	net->dev_index_head = netdev_create_hash();
 	if (net->dev_index_head == NULL)
 		goto err_idx;
@@ -8585,6 +8795,8 @@ static int __net_init netdev_init(struct net *net)
 	return 0;
 
 err_idx:
+	kfree(net->dev_name_cmpl_head);
+err_cname:
 	kfree(net->dev_name_head);
 err_name:
 	return -ENOMEM;
@@ -8676,9 +8888,12 @@ void func(const struct net_device *dev, const char *fmt, ...)	\
 static void __net_exit netdev_exit(struct net *net)
 {
 	kfree(net->dev_name_head);
+	kfree(net->dev_name_cmpl_head);
 	kfree(net->dev_index_head);
-	if (net != &init_net)
+	if (net != &init_net) {
 		WARN_ON_ONCE(!list_empty(&net->dev_base_head));
+		WARN_ON_ONCE(!list_empty(&net->dev_cmpl_head));
+	}
 }
 
 static struct pernet_operations __net_initdata netdev_net_ops = {
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 60a71be..1c399e9 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -37,6 +37,7 @@
 struct net init_net = {
 	.count		= ATOMIC_INIT(1),
 	.dev_base_head	= LIST_HEAD_INIT(init_net.dev_base_head),
+	.dev_cmpl_head	= LIST_HEAD_INIT(init_net.dev_cmpl_head),
 };
 EXPORT_SYMBOL(init_net);
 
-- 
1.8.3.1


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [RFC PATCH 3/3] virtio_net: make lower netdevs for virtio_bypass hidden
  2018-04-01  9:13 ` [virtio-dev] " Si-Wei Liu
@ 2018-04-01  9:13   ` Si-Wei Liu
  -1 siblings, 0 replies; 109+ messages in thread
From: Si-Wei Liu @ 2018-04-01  9:13 UTC (permalink / raw)
  To: mst, jiri, stephen, alexander.h.duyck, davem, jesse.brandeburg,
	kubakici, jasowang, sridhar.samudrala, netdev, virtualization,
	virtio-dev

We should move virtio_bypass to a 1-upper-with-2-hidden-lower
driver model for greater compatibility with regard to preserving
userpsace API and ABI.

On the other hand, technically virtio_bypass should make stricter
check before automatically enslaving the corresponding virtual
function or passthrough device. It's more reasonable to pair
virtio_bypass instance with a VF or passthrough device 1:1,
rather than rely on searching for a random non-virtio netdev with
exact same MAC address. One possible way of doing it is to bind
virtio_bypass explicitly to a guest pci device by specifying
its <bus> and <slot>:<function> location. Changing BACKUP feature
to take these configs into account, such that verifying target
device for auto-enslavement no longer relies on the MAC address.

Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
---
 drivers/net/virtio_net.c        | 159 ++++++++++++++++++++++++++++++++++++----
 include/uapi/linux/virtio_net.h |   2 +
 2 files changed, 148 insertions(+), 13 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f850cf6..c54a5bd 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -77,6 +77,8 @@ struct virtnet_stats {
 	u64 rx_packets;
 };
 
+static struct workqueue_struct *virtnet_bypass_wq;
+
 /* Internal representation of a send virtqueue */
 struct send_queue {
 	/* Virtqueue associated with this send _queue */
@@ -196,6 +198,13 @@ struct padded_vnet_hdr {
 	char padding[4];
 };
 
+struct virtnet_bypass_task {
+	struct work_struct	work;
+	unsigned long		event; 
+	struct net_device	*child_netdev;
+	struct net_device	*bypass_netdev;
+};
+
 /* Converting between virtqueue no. and kernel tx/rx queue no.
  * 0:rx0 1:tx0 2:rx1 3:tx1 ... 2N:rxN 2N+1:txN 2N+2:cvq
  */
@@ -2557,6 +2566,11 @@ struct virtnet_bypass_info {
 
 	/* spinlock while updating stats */
 	spinlock_t stats_lock;
+
+	int bus;
+	int slot;
+	int function;
+
 };
 
 static void virtnet_bypass_child_open(struct net_device *dev,
@@ -2822,10 +2836,13 @@ static void virtnet_bypass_ethtool_get_drvinfo(struct net_device *dev,
 	.get_link_ksettings     = virtnet_bypass_ethtool_get_link_ksettings,
 };
 
-static struct net_device *get_virtnet_bypass_bymac(struct net *net,
-						   const u8 *mac)
+static struct net_device *
+get_virtnet_bypass_bymac(struct net_device *child_netdev)
 {
+	struct net *net = dev_net(child_netdev);
 	struct net_device *dev;
+	struct virtnet_bypass_info *vbi;
+	int devfn;
 
 	ASSERT_RTNL();
 
@@ -2833,7 +2850,29 @@ static struct net_device *get_virtnet_bypass_bymac(struct net *net,
 		if (dev->netdev_ops != &virtnet_bypass_netdev_ops)
 			continue;       /* not a virtnet_bypass device */
 
-		if (ether_addr_equal(mac, dev->perm_addr))
+		if (!ether_addr_equal(child_netdev->dev_addr, dev->perm_addr))
+			continue;       /* not matching MAC address */
+
+		if (!child_netdev->dev.parent)
+			continue;
+
+		/* Is child_netdev a backup netdev ? */
+		if (child_netdev->dev.parent == dev->dev.parent)
+			return dev;
+
+		/* Avoid non pci devices as active netdev */
+		if (!dev_is_pci(child_netdev->dev.parent))
+			continue;
+
+		vbi = netdev_priv(dev);
+		devfn = PCI_DEVFN(vbi->slot, vbi->function);
+
+		netdev_info(dev, "bus %d slot %d func %d",
+			    vbi->bus, vbi->slot, vbi->function);
+
+		/* Need to match <bus>:<slot>.<function> */
+		if (pci_get_bus_and_slot(vbi->bus, devfn) ==
+		    to_pci_dev(child_netdev->dev.parent))
 			return dev;
 	}
 
@@ -2878,10 +2917,61 @@ static rx_handler_result_t virtnet_bypass_handle_frame(struct sk_buff **pskb)
 	return RX_HANDLER_ANOTHER;
 }
 
+static int virtnet_bypass_pregetname_child(struct net_device *child_netdev)
+{
+	struct net_device *dev;
+
+	if (child_netdev->addr_len != ETH_ALEN)
+		return NOTIFY_DONE;
+
+	/* We will use the MAC address to locate the virtnet_bypass netdev
+	 * to associate with the child netdev. If we don't find a matching
+	 * bypass netdev, move on.
+	 */
+	dev = get_virtnet_bypass_bymac(child_netdev);
+	if (!dev)
+		return NOTIFY_DONE;
+
+	if (child_netdev->dev.parent &&
+	    child_netdev->dev.parent != dev->dev.parent);
+		netdev_set_hidden(child_netdev);
+
+	return NOTIFY_OK;
+}
+
+static void virtnet_bypass_task_fn(struct work_struct *work)
+{
+	struct virtnet_bypass_task *task;
+	struct net_device *child_netdev;
+	int rc;
+
+	task = container_of(work, struct virtnet_bypass_task, work);
+	child_netdev = task->child_netdev;
+
+	switch (task->event) {
+	case NETDEV_REGISTER:
+		rc = hide_netdevice(child_netdev);
+		if (rc)
+			netdev_err(child_netdev,
+				   "hide netdev %s failed with error %#x",
+				   child_netdev->name, rc);
+
+		break;
+	case NETDEV_UNREGISTER:
+		unhide_netdevice(child_netdev);
+		break;
+	default:
+		break;
+	}
+	dev_put(child_netdev);
+	kfree(task);
+}
+
 static int virtnet_bypass_register_child(struct net_device *child_netdev)
 {
 	struct virtnet_bypass_info *vbi;
 	struct net_device *dev;
+	struct virtnet_bypass_task *task;
 	bool backup;
 	int ret;
 
@@ -2892,25 +2982,34 @@ static int virtnet_bypass_register_child(struct net_device *child_netdev)
 	 * to associate with the child netdev. If we don't find a matching
 	 * bypass netdev, move on.
 	 */
-	dev = get_virtnet_bypass_bymac(dev_net(child_netdev),
-				       child_netdev->perm_addr);
+	dev = get_virtnet_bypass_bymac(child_netdev);
 	if (!dev)
 		return NOTIFY_DONE;
 
 	vbi = netdev_priv(dev);
 	backup = (child_netdev->dev.parent == dev->dev.parent);
 	if (backup ? rtnl_dereference(vbi->backup_netdev) :
-			rtnl_dereference(vbi->active_netdev)) {
+		     rtnl_dereference(vbi->active_netdev)) {
 		netdev_info(dev,
 			    "%s attempting to join bypass dev when %s already present\n",
 			    child_netdev->name, backup ? "backup" : "active");
 		return NOTIFY_DONE;
 	}
 
-	/* Avoid non pci devices as active netdev */
-	if (!backup && (!child_netdev->dev.parent ||
-			!dev_is_pci(child_netdev->dev.parent)))
-		return NOTIFY_DONE;
+	/* Verify <bus>:<slot>.<function> info */
+	if (!backup && !(child_netdev->priv_flags & IFF_HIDDEN)) {
+		task = kzalloc(sizeof(*task), GFP_ATOMIC);
+		if (!task)
+			return NOTIFY_DONE;
+		task->event = NETDEV_REGISTER;
+		task->bypass_netdev = dev;
+		task->child_netdev = child_netdev;
+		INIT_WORK(&task->work, virtnet_bypass_task_fn);
+		queue_work(virtnet_bypass_wq, &task->work);
+		dev_hold(child_netdev);
+
+		return NOTIFY_OK;
+	}
 
 	ret = netdev_rx_handler_register(child_netdev,
 					 virtnet_bypass_handle_frame, dev);
@@ -2981,6 +3080,7 @@ static int virtnet_bypass_unregister_child(struct net_device *child_netdev)
 {
 	struct virtnet_bypass_info *vbi;
 	struct net_device *dev, *backup;
+	struct virtnet_bypass_task *task;
 
 	dev = get_virtnet_bypass_byref(child_netdev);
 	if (!dev)
@@ -3003,6 +3103,16 @@ static int virtnet_bypass_unregister_child(struct net_device *child_netdev)
 			dev->min_mtu = backup->min_mtu;
 			dev->max_mtu = backup->max_mtu;
 		}
+
+		task = kzalloc(sizeof(*task), GFP_ATOMIC);
+		if (task) {
+			task->event = NETDEV_UNREGISTER;
+			task->bypass_netdev = dev;
+			task->child_netdev = child_netdev;
+			INIT_WORK(&task->work, virtnet_bypass_task_fn);
+			queue_work(virtnet_bypass_wq, &task->work);
+			dev_hold(child_netdev);
+		}
 	}
 
 	dev_put(child_netdev);
@@ -3059,6 +3169,8 @@ static int virtnet_bypass_event(struct notifier_block *this,
 		return NOTIFY_DONE;
 
 	switch (event) {
+	case NETDEV_PRE_GETNAME:
+		return virtnet_bypass_pregetname_child(event_dev);
 	case NETDEV_REGISTER:
 		return virtnet_bypass_register_child(event_dev);
 	case NETDEV_UNREGISTER:
@@ -3076,11 +3188,12 @@ static int virtnet_bypass_event(struct notifier_block *this,
 	.notifier_call = virtnet_bypass_event,
 };
 
-static int virtnet_bypass_create(struct virtnet_info *vi)
+static int virtnet_bypass_create(struct virtnet_info *vi, int bsf)
 {
 	struct net_device *backup_netdev = vi->dev;
 	struct device *dev = &vi->vdev->dev;
 	struct net_device *bypass_netdev;
+	struct virtnet_bypass_info *vbi;
 	int res;
 
 	/* Alloc at least 2 queues, for now we are going with 16 assuming
@@ -3095,6 +3208,11 @@ static int virtnet_bypass_create(struct virtnet_info *vi)
 
 	dev_net_set(bypass_netdev, dev_net(backup_netdev));
 	SET_NETDEV_DEV(bypass_netdev, dev);
+	vbi = netdev_priv(bypass_netdev);
+
+	vbi->bus = (bsf >> 8) & 0xFF;
+	vbi->slot = (bsf >> 3) & 0x1F;
+	vbi->function = bsf & 0x7;
 
 	bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops;
 	bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops;
@@ -3183,7 +3301,7 @@ static int virtnet_probe(struct virtio_device *vdev)
 	struct net_device *dev;
 	struct virtnet_info *vi;
 	u16 max_queue_pairs;
-	int mtu;
+	int mtu, bsf;
 
 	/* Find if host supports multiqueue virtio_net device */
 	err = virtio_cread_feature(vdev, VIRTIO_NET_F_MQ,
@@ -3339,8 +3457,12 @@ static int virtnet_probe(struct virtio_device *vdev)
 	virtnet_init_settings(dev);
 
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_BACKUP)) {
-		if (virtnet_bypass_create(vi) != 0)
+		bsf = virtio_cread16(vdev,
+				     offsetof(struct virtio_net_config,
+					      bsf2backup));
+		if (virtnet_bypass_create(vi, bsf) != 0)
 			goto free_vqs;
+		netdev_set_hidden(dev); 
 	}
 
 	err = register_netdev(dev);
@@ -3384,6 +3506,7 @@ static int virtnet_probe(struct virtio_device *vdev)
 	unregister_netdev(dev);
 free_bypass:
 	virtnet_bypass_destroy(vi);
+
 free_vqs:
 	cancel_delayed_work_sync(&vi->refill);
 	free_receive_page_frags(vi);
@@ -3517,6 +3640,12 @@ static __init int virtio_net_driver_init(void)
 	if (ret)
 		goto err_dead;
 
+	virtnet_bypass_wq = create_singlethread_workqueue("virtio_bypass");
+	if (!virtnet_bypass_wq) {
+		ret = -ENOMEM;
+		goto err_wq;
+	}
+
         ret = register_virtio_driver(&virtio_net_driver);
 	if (ret)
 		goto err_virtio;
@@ -3524,6 +3653,8 @@ static __init int virtio_net_driver_init(void)
 	register_netdevice_notifier(&virtnet_bypass_notifier);
 	return 0;
 err_virtio:
+	destroy_workqueue(virtnet_bypass_wq);
+err_wq:
 	cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
 err_dead:
 	cpuhp_remove_multi_state(virtionet_online);
@@ -3535,6 +3666,8 @@ static __init int virtio_net_driver_init(void)
 static __exit void virtio_net_driver_exit(void)
 {
 	unregister_netdevice_notifier(&virtnet_bypass_notifier);
+	if (virtnet_bypass_wq)
+		destroy_workqueue(virtnet_bypass_wq);
 	unregister_virtio_driver(&virtio_net_driver);
 	cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
 	cpuhp_remove_multi_state(virtionet_online);
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index aa40664..0827b7e 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -80,6 +80,8 @@ struct virtio_net_config {
 	__u16 max_virtqueue_pairs;
 	/* Default maximum transmit unit advice */
 	__u16 mtu;
+	/* Device at bus:slot.function backed up by virtio_net */
+	__u16 bsf2backup;
 } __attribute__((packed));
 
 /*
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [virtio-dev] [RFC PATCH 3/3] virtio_net: make lower netdevs for virtio_bypass hidden
@ 2018-04-01  9:13   ` Si-Wei Liu
  0 siblings, 0 replies; 109+ messages in thread
From: Si-Wei Liu @ 2018-04-01  9:13 UTC (permalink / raw)
  To: mst, jiri, stephen, alexander.h.duyck, davem, jesse.brandeburg,
	kubakici, jasowang, sridhar.samudrala, netdev, virtualization,
	virtio-dev

We should move virtio_bypass to a 1-upper-with-2-hidden-lower
driver model for greater compatibility with regard to preserving
userpsace API and ABI.

On the other hand, technically virtio_bypass should make stricter
check before automatically enslaving the corresponding virtual
function or passthrough device. It's more reasonable to pair
virtio_bypass instance with a VF or passthrough device 1:1,
rather than rely on searching for a random non-virtio netdev with
exact same MAC address. One possible way of doing it is to bind
virtio_bypass explicitly to a guest pci device by specifying
its <bus> and <slot>:<function> location. Changing BACKUP feature
to take these configs into account, such that verifying target
device for auto-enslavement no longer relies on the MAC address.

Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
---
 drivers/net/virtio_net.c        | 159 ++++++++++++++++++++++++++++++++++++----
 include/uapi/linux/virtio_net.h |   2 +
 2 files changed, 148 insertions(+), 13 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f850cf6..c54a5bd 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -77,6 +77,8 @@ struct virtnet_stats {
 	u64 rx_packets;
 };
 
+static struct workqueue_struct *virtnet_bypass_wq;
+
 /* Internal representation of a send virtqueue */
 struct send_queue {
 	/* Virtqueue associated with this send _queue */
@@ -196,6 +198,13 @@ struct padded_vnet_hdr {
 	char padding[4];
 };
 
+struct virtnet_bypass_task {
+	struct work_struct	work;
+	unsigned long		event; 
+	struct net_device	*child_netdev;
+	struct net_device	*bypass_netdev;
+};
+
 /* Converting between virtqueue no. and kernel tx/rx queue no.
  * 0:rx0 1:tx0 2:rx1 3:tx1 ... 2N:rxN 2N+1:txN 2N+2:cvq
  */
@@ -2557,6 +2566,11 @@ struct virtnet_bypass_info {
 
 	/* spinlock while updating stats */
 	spinlock_t stats_lock;
+
+	int bus;
+	int slot;
+	int function;
+
 };
 
 static void virtnet_bypass_child_open(struct net_device *dev,
@@ -2822,10 +2836,13 @@ static void virtnet_bypass_ethtool_get_drvinfo(struct net_device *dev,
 	.get_link_ksettings     = virtnet_bypass_ethtool_get_link_ksettings,
 };
 
-static struct net_device *get_virtnet_bypass_bymac(struct net *net,
-						   const u8 *mac)
+static struct net_device *
+get_virtnet_bypass_bymac(struct net_device *child_netdev)
 {
+	struct net *net = dev_net(child_netdev);
 	struct net_device *dev;
+	struct virtnet_bypass_info *vbi;
+	int devfn;
 
 	ASSERT_RTNL();
 
@@ -2833,7 +2850,29 @@ static struct net_device *get_virtnet_bypass_bymac(struct net *net,
 		if (dev->netdev_ops != &virtnet_bypass_netdev_ops)
 			continue;       /* not a virtnet_bypass device */
 
-		if (ether_addr_equal(mac, dev->perm_addr))
+		if (!ether_addr_equal(child_netdev->dev_addr, dev->perm_addr))
+			continue;       /* not matching MAC address */
+
+		if (!child_netdev->dev.parent)
+			continue;
+
+		/* Is child_netdev a backup netdev ? */
+		if (child_netdev->dev.parent == dev->dev.parent)
+			return dev;
+
+		/* Avoid non pci devices as active netdev */
+		if (!dev_is_pci(child_netdev->dev.parent))
+			continue;
+
+		vbi = netdev_priv(dev);
+		devfn = PCI_DEVFN(vbi->slot, vbi->function);
+
+		netdev_info(dev, "bus %d slot %d func %d",
+			    vbi->bus, vbi->slot, vbi->function);
+
+		/* Need to match <bus>:<slot>.<function> */
+		if (pci_get_bus_and_slot(vbi->bus, devfn) ==
+		    to_pci_dev(child_netdev->dev.parent))
 			return dev;
 	}
 
@@ -2878,10 +2917,61 @@ static rx_handler_result_t virtnet_bypass_handle_frame(struct sk_buff **pskb)
 	return RX_HANDLER_ANOTHER;
 }
 
+static int virtnet_bypass_pregetname_child(struct net_device *child_netdev)
+{
+	struct net_device *dev;
+
+	if (child_netdev->addr_len != ETH_ALEN)
+		return NOTIFY_DONE;
+
+	/* We will use the MAC address to locate the virtnet_bypass netdev
+	 * to associate with the child netdev. If we don't find a matching
+	 * bypass netdev, move on.
+	 */
+	dev = get_virtnet_bypass_bymac(child_netdev);
+	if (!dev)
+		return NOTIFY_DONE;
+
+	if (child_netdev->dev.parent &&
+	    child_netdev->dev.parent != dev->dev.parent);
+		netdev_set_hidden(child_netdev);
+
+	return NOTIFY_OK;
+}
+
+static void virtnet_bypass_task_fn(struct work_struct *work)
+{
+	struct virtnet_bypass_task *task;
+	struct net_device *child_netdev;
+	int rc;
+
+	task = container_of(work, struct virtnet_bypass_task, work);
+	child_netdev = task->child_netdev;
+
+	switch (task->event) {
+	case NETDEV_REGISTER:
+		rc = hide_netdevice(child_netdev);
+		if (rc)
+			netdev_err(child_netdev,
+				   "hide netdev %s failed with error %#x",
+				   child_netdev->name, rc);
+
+		break;
+	case NETDEV_UNREGISTER:
+		unhide_netdevice(child_netdev);
+		break;
+	default:
+		break;
+	}
+	dev_put(child_netdev);
+	kfree(task);
+}
+
 static int virtnet_bypass_register_child(struct net_device *child_netdev)
 {
 	struct virtnet_bypass_info *vbi;
 	struct net_device *dev;
+	struct virtnet_bypass_task *task;
 	bool backup;
 	int ret;
 
@@ -2892,25 +2982,34 @@ static int virtnet_bypass_register_child(struct net_device *child_netdev)
 	 * to associate with the child netdev. If we don't find a matching
 	 * bypass netdev, move on.
 	 */
-	dev = get_virtnet_bypass_bymac(dev_net(child_netdev),
-				       child_netdev->perm_addr);
+	dev = get_virtnet_bypass_bymac(child_netdev);
 	if (!dev)
 		return NOTIFY_DONE;
 
 	vbi = netdev_priv(dev);
 	backup = (child_netdev->dev.parent == dev->dev.parent);
 	if (backup ? rtnl_dereference(vbi->backup_netdev) :
-			rtnl_dereference(vbi->active_netdev)) {
+		     rtnl_dereference(vbi->active_netdev)) {
 		netdev_info(dev,
 			    "%s attempting to join bypass dev when %s already present\n",
 			    child_netdev->name, backup ? "backup" : "active");
 		return NOTIFY_DONE;
 	}
 
-	/* Avoid non pci devices as active netdev */
-	if (!backup && (!child_netdev->dev.parent ||
-			!dev_is_pci(child_netdev->dev.parent)))
-		return NOTIFY_DONE;
+	/* Verify <bus>:<slot>.<function> info */
+	if (!backup && !(child_netdev->priv_flags & IFF_HIDDEN)) {
+		task = kzalloc(sizeof(*task), GFP_ATOMIC);
+		if (!task)
+			return NOTIFY_DONE;
+		task->event = NETDEV_REGISTER;
+		task->bypass_netdev = dev;
+		task->child_netdev = child_netdev;
+		INIT_WORK(&task->work, virtnet_bypass_task_fn);
+		queue_work(virtnet_bypass_wq, &task->work);
+		dev_hold(child_netdev);
+
+		return NOTIFY_OK;
+	}
 
 	ret = netdev_rx_handler_register(child_netdev,
 					 virtnet_bypass_handle_frame, dev);
@@ -2981,6 +3080,7 @@ static int virtnet_bypass_unregister_child(struct net_device *child_netdev)
 {
 	struct virtnet_bypass_info *vbi;
 	struct net_device *dev, *backup;
+	struct virtnet_bypass_task *task;
 
 	dev = get_virtnet_bypass_byref(child_netdev);
 	if (!dev)
@@ -3003,6 +3103,16 @@ static int virtnet_bypass_unregister_child(struct net_device *child_netdev)
 			dev->min_mtu = backup->min_mtu;
 			dev->max_mtu = backup->max_mtu;
 		}
+
+		task = kzalloc(sizeof(*task), GFP_ATOMIC);
+		if (task) {
+			task->event = NETDEV_UNREGISTER;
+			task->bypass_netdev = dev;
+			task->child_netdev = child_netdev;
+			INIT_WORK(&task->work, virtnet_bypass_task_fn);
+			queue_work(virtnet_bypass_wq, &task->work);
+			dev_hold(child_netdev);
+		}
 	}
 
 	dev_put(child_netdev);
@@ -3059,6 +3169,8 @@ static int virtnet_bypass_event(struct notifier_block *this,
 		return NOTIFY_DONE;
 
 	switch (event) {
+	case NETDEV_PRE_GETNAME:
+		return virtnet_bypass_pregetname_child(event_dev);
 	case NETDEV_REGISTER:
 		return virtnet_bypass_register_child(event_dev);
 	case NETDEV_UNREGISTER:
@@ -3076,11 +3188,12 @@ static int virtnet_bypass_event(struct notifier_block *this,
 	.notifier_call = virtnet_bypass_event,
 };
 
-static int virtnet_bypass_create(struct virtnet_info *vi)
+static int virtnet_bypass_create(struct virtnet_info *vi, int bsf)
 {
 	struct net_device *backup_netdev = vi->dev;
 	struct device *dev = &vi->vdev->dev;
 	struct net_device *bypass_netdev;
+	struct virtnet_bypass_info *vbi;
 	int res;
 
 	/* Alloc at least 2 queues, for now we are going with 16 assuming
@@ -3095,6 +3208,11 @@ static int virtnet_bypass_create(struct virtnet_info *vi)
 
 	dev_net_set(bypass_netdev, dev_net(backup_netdev));
 	SET_NETDEV_DEV(bypass_netdev, dev);
+	vbi = netdev_priv(bypass_netdev);
+
+	vbi->bus = (bsf >> 8) & 0xFF;
+	vbi->slot = (bsf >> 3) & 0x1F;
+	vbi->function = bsf & 0x7;
 
 	bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops;
 	bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops;
@@ -3183,7 +3301,7 @@ static int virtnet_probe(struct virtio_device *vdev)
 	struct net_device *dev;
 	struct virtnet_info *vi;
 	u16 max_queue_pairs;
-	int mtu;
+	int mtu, bsf;
 
 	/* Find if host supports multiqueue virtio_net device */
 	err = virtio_cread_feature(vdev, VIRTIO_NET_F_MQ,
@@ -3339,8 +3457,12 @@ static int virtnet_probe(struct virtio_device *vdev)
 	virtnet_init_settings(dev);
 
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_BACKUP)) {
-		if (virtnet_bypass_create(vi) != 0)
+		bsf = virtio_cread16(vdev,
+				     offsetof(struct virtio_net_config,
+					      bsf2backup));
+		if (virtnet_bypass_create(vi, bsf) != 0)
 			goto free_vqs;
+		netdev_set_hidden(dev); 
 	}
 
 	err = register_netdev(dev);
@@ -3384,6 +3506,7 @@ static int virtnet_probe(struct virtio_device *vdev)
 	unregister_netdev(dev);
 free_bypass:
 	virtnet_bypass_destroy(vi);
+
 free_vqs:
 	cancel_delayed_work_sync(&vi->refill);
 	free_receive_page_frags(vi);
@@ -3517,6 +3640,12 @@ static __init int virtio_net_driver_init(void)
 	if (ret)
 		goto err_dead;
 
+	virtnet_bypass_wq = create_singlethread_workqueue("virtio_bypass");
+	if (!virtnet_bypass_wq) {
+		ret = -ENOMEM;
+		goto err_wq;
+	}
+
         ret = register_virtio_driver(&virtio_net_driver);
 	if (ret)
 		goto err_virtio;
@@ -3524,6 +3653,8 @@ static __init int virtio_net_driver_init(void)
 	register_netdevice_notifier(&virtnet_bypass_notifier);
 	return 0;
 err_virtio:
+	destroy_workqueue(virtnet_bypass_wq);
+err_wq:
 	cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
 err_dead:
 	cpuhp_remove_multi_state(virtionet_online);
@@ -3535,6 +3666,8 @@ static __init int virtio_net_driver_init(void)
 static __exit void virtio_net_driver_exit(void)
 {
 	unregister_netdevice_notifier(&virtnet_bypass_notifier);
+	if (virtnet_bypass_wq)
+		destroy_workqueue(virtnet_bypass_wq);
 	unregister_virtio_driver(&virtio_net_driver);
 	cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
 	cpuhp_remove_multi_state(virtionet_online);
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index aa40664..0827b7e 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -80,6 +80,8 @@ struct virtio_net_config {
 	__u16 max_virtqueue_pairs;
 	/* Default maximum transmit unit advice */
 	__u16 mtu;
+	/* Device at bus:slot.function backed up by virtio_net */
+	__u16 bsf2backup;
 } __attribute__((packed));
 
 /*
-- 
1.8.3.1


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-01  9:13   ` [virtio-dev] " Si-Wei Liu
  (?)
@ 2018-04-01 16:11   ` David Ahern
  2018-04-03  7:40     ` Siwei Liu
                       ` (3 more replies)
  -1 siblings, 4 replies; 109+ messages in thread
From: David Ahern @ 2018-04-01 16:11 UTC (permalink / raw)
  To: Si-Wei Liu, mst, jiri, stephen, alexander.h.duyck, davem,
	jesse.brandeburg, kubakici, jasowang, sridhar.samudrala, netdev,
	virtualization, virtio-dev

On 4/1/18 3:13 AM, Si-Wei Liu wrote:
> Hidden netdevice is not visible to userspace such that
> typical network utilites e.g. ip, ifconfig and et al,
> cannot sense its existence or configure it. Internally
> hidden netdev may associate with an upper level netdev
> that userspace has access to. Although userspace cannot
> manipulate the lower netdev directly, user may control
> or configure the underlying hidden device through the
> upper-level netdev. For identification purpose, the
> kobject for hidden netdev still presents in the sysfs
> hierarchy, however, no uevent message will be generated
> when the sysfs entry is created, modified or destroyed.
> 
> For that end, a separate namescope needs to be carved
> out for IFF_HIDDEN netdevs. As of now netdev name that
> starts with colon i.e. ':' is invalid in userspace,
> since socket ioctls such as SIOCGIFCONF use ':' as the
> separator for ifname. The absence of namescope started
> with ':' can rightly be used as the namescope for
> the kernel-only IFF_HIDDEN netdevs.
> 
> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
> ---
>  include/linux/netdevice.h   |  12 ++
>  include/net/net_namespace.h |   2 +
>  net/core/dev.c              | 281 ++++++++++++++++++++++++++++++++++++++------
>  net/core/net_namespace.c    |   1 +
>  4 files changed, 263 insertions(+), 33 deletions(-)
> 

There are other use cases that want to hide a device from userspace. I
would prefer a better solution than playing games with name prefixes and
one that includes an API for users to list all devices -- even ones
hidden by default.

https://github.com/dsahern/linux/commit/48a80a00eac284e58bae04af10a5a932dd7aee00

https://github.com/dsahern/iproute2/commit/7563f5b26f5539960e99066e34a995d22ea908ed

Also, why are you suggesting that the device should still be visible via
/sysfs? That leads to inconsistent views of networking state - /sys
shows a device but a link dump does not.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-01 16:11   ` David Ahern
@ 2018-04-03  7:40       ` Siwei Liu
  2018-04-03  7:40       ` [virtio-dev] " Siwei Liu
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-03  7:40 UTC (permalink / raw)
  To: David Ahern
  Cc: Si-Wei Liu, Michael S. Tsirkin, Jiri Pirko, Stephen Hemminger,
	Alexander Duyck, David Miller, Brandeburg, Jesse, Jakub Kicinski,
	Jason Wang, Samudrala, Sridhar, Netdev, virtualization,
	virtio-dev

On Sun, Apr 1, 2018 at 9:11 AM, David Ahern <dsahern@gmail.com> wrote:
> On 4/1/18 3:13 AM, Si-Wei Liu wrote:
>> Hidden netdevice is not visible to userspace such that
>> typical network utilities e.g. ip, ifconfig and et al,
>> cannot sense its existence or configure it. Internally
>> hidden netdev may associate with an upper level netdev
>> that userspace has access to. Although userspace cannot
>> manipulate the lower netdev directly, user may control
>> or configure the underlying hidden device through the
>> upper-level netdev. For identification purpose, the
>> kobject for hidden netdev still presents in the sysfs
>> hierarchy, however, no uevent message will be generated
>> when the sysfs entry is created, modified or destroyed.
>>
>> For that end, a separate namescope needs to be carved
>> out for IFF_HIDDEN netdevs. As of now netdev name that
>> starts with colon i.e. ':' is invalid in userspace,
>> since socket ioctls such as SIOCGIFCONF use ':' as the
>> separator for ifname. The absence of namescope started
>> with ':' can rightly be used as the namescope for
>> the kernel-only IFF_HIDDEN netdevs.
>>
>> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
>> ---
>>  include/linux/netdevice.h   |  12 ++
>>  include/net/net_namespace.h |   2 +
>>  net/core/dev.c              | 281 ++++++++++++++++++++++++++++++++++++++------
>>  net/core/net_namespace.c    |   1 +
>>  4 files changed, 263 insertions(+), 33 deletions(-)
>>
>
> There are other use cases that want to hide a device from userspace.

Can you elaborate your case in more details? Looking at the links
below I realize that the purpose of hiding devices in your case is
quite different from the our migration case. Particularly, I don't
like the part of elaborately allowing user to manipulate the link's
visibility - things fall apart easily while live migration is on
going. And, why doing additional check for invisible links in every
for_each_netdev() and its friends. This is effectively changing
semantics of internal APIs that exist for decades.

> I would prefer a better solution than playing games with name prefixes and

This part is intentionally left to be that way and I would anticipate
feedback before going too far. The idea in my mind was that I need a
completely new device namespace underneath (or namescope, which is !=
netns) for all netdevs: kernel-only IFF_HIDDEN network devices and
those not. The current namespace for devname is already exposed to
userspace and visible in the sysfs hierarchy, but for backwards
compatibility reasons it's necessary to keep the old udevd still able
to reference the entry of IFF_HIDDEN netdev under the /sys/net
directory. By using the ':' prefix it has the benefit of minimal
changes without introducing another namespace or the accompanied
complexity in managing these two separate namespaces.

Technically, I can create a separate sysfs directory for the new
namescope, say /sys/net-kernel, which includes all netdev entries like
':eth0' and 'ens3', and which can be referenced from /sys/net. It
would make the /sys/net consistent with the view of userspace
utilities, but I am not even sure if that's an overkill, and would
like to gather more feedback before moving to that model.

> one that includes an API for users to list all devices -- even ones

What kind of API you would like to query for hidden devices?
rtnetlink? a private socket API? or something else?

For our case, the sysfs interface is what we need and is sufficient,
since udev is the main target we'd like to support to make the naming
of virtio_bypass consistent and compatible.

> hidden by default.
>
> https://github.com/dsahern/linux/commit/48a80a00eac284e58bae04af10a5a932dd7aee00
>
> https://github.com/dsahern/iproute2/commit/7563f5b26f5539960e99066e34a995d22ea908ed
>
> Also, why are you suggesting that the device should still be visible via
> /sysfs? That leads to inconsistent views of networking state - /sys
> shows a device but a link dump does not.

See my clarifications above. I don't mind kernel-only netdevs being
visible via sysfs, as that way we get a good trade-off between
backwards compatibility and visibility. There's still kobject created
there right. Bottom line is that all kernel devices and its life-cycle
uevents are made invisible to userpace network utilities, and I think
it simply gets to the goal of not breaking existing apps while being
able to add new features.

-Siwei

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-01 16:11   ` David Ahern
@ 2018-04-03  7:40     ` Siwei Liu
  2018-04-03  7:40       ` [virtio-dev] " Siwei Liu
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-03  7:40 UTC (permalink / raw)
  To: David Ahern
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, virtualization, Netdev,
	Si-Wei Liu, David Miller

On Sun, Apr 1, 2018 at 9:11 AM, David Ahern <dsahern@gmail.com> wrote:
> On 4/1/18 3:13 AM, Si-Wei Liu wrote:
>> Hidden netdevice is not visible to userspace such that
>> typical network utilities e.g. ip, ifconfig and et al,
>> cannot sense its existence or configure it. Internally
>> hidden netdev may associate with an upper level netdev
>> that userspace has access to. Although userspace cannot
>> manipulate the lower netdev directly, user may control
>> or configure the underlying hidden device through the
>> upper-level netdev. For identification purpose, the
>> kobject for hidden netdev still presents in the sysfs
>> hierarchy, however, no uevent message will be generated
>> when the sysfs entry is created, modified or destroyed.
>>
>> For that end, a separate namescope needs to be carved
>> out for IFF_HIDDEN netdevs. As of now netdev name that
>> starts with colon i.e. ':' is invalid in userspace,
>> since socket ioctls such as SIOCGIFCONF use ':' as the
>> separator for ifname. The absence of namescope started
>> with ':' can rightly be used as the namescope for
>> the kernel-only IFF_HIDDEN netdevs.
>>
>> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
>> ---
>>  include/linux/netdevice.h   |  12 ++
>>  include/net/net_namespace.h |   2 +
>>  net/core/dev.c              | 281 ++++++++++++++++++++++++++++++++++++++------
>>  net/core/net_namespace.c    |   1 +
>>  4 files changed, 263 insertions(+), 33 deletions(-)
>>
>
> There are other use cases that want to hide a device from userspace.

Can you elaborate your case in more details? Looking at the links
below I realize that the purpose of hiding devices in your case is
quite different from the our migration case. Particularly, I don't
like the part of elaborately allowing user to manipulate the link's
visibility - things fall apart easily while live migration is on
going. And, why doing additional check for invisible links in every
for_each_netdev() and its friends. This is effectively changing
semantics of internal APIs that exist for decades.

> I would prefer a better solution than playing games with name prefixes and

This part is intentionally left to be that way and I would anticipate
feedback before going too far. The idea in my mind was that I need a
completely new device namespace underneath (or namescope, which is !=
netns) for all netdevs: kernel-only IFF_HIDDEN network devices and
those not. The current namespace for devname is already exposed to
userspace and visible in the sysfs hierarchy, but for backwards
compatibility reasons it's necessary to keep the old udevd still able
to reference the entry of IFF_HIDDEN netdev under the /sys/net
directory. By using the ':' prefix it has the benefit of minimal
changes without introducing another namespace or the accompanied
complexity in managing these two separate namespaces.

Technically, I can create a separate sysfs directory for the new
namescope, say /sys/net-kernel, which includes all netdev entries like
':eth0' and 'ens3', and which can be referenced from /sys/net. It
would make the /sys/net consistent with the view of userspace
utilities, but I am not even sure if that's an overkill, and would
like to gather more feedback before moving to that model.

> one that includes an API for users to list all devices -- even ones

What kind of API you would like to query for hidden devices?
rtnetlink? a private socket API? or something else?

For our case, the sysfs interface is what we need and is sufficient,
since udev is the main target we'd like to support to make the naming
of virtio_bypass consistent and compatible.

> hidden by default.
>
> https://github.com/dsahern/linux/commit/48a80a00eac284e58bae04af10a5a932dd7aee00
>
> https://github.com/dsahern/iproute2/commit/7563f5b26f5539960e99066e34a995d22ea908ed
>
> Also, why are you suggesting that the device should still be visible via
> /sysfs? That leads to inconsistent views of networking state - /sys
> shows a device but a link dump does not.

See my clarifications above. I don't mind kernel-only netdevs being
visible via sysfs, as that way we get a good trade-off between
backwards compatibility and visibility. There's still kobject created
there right. Bottom line is that all kernel devices and its life-cycle
uevents are made invisible to userpace network utilities, and I think
it simply gets to the goal of not breaking existing apps while being
able to add new features.

-Siwei

^ permalink raw reply	[flat|nested] 109+ messages in thread

* [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
@ 2018-04-03  7:40       ` Siwei Liu
  0 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-03  7:40 UTC (permalink / raw)
  To: David Ahern
  Cc: Si-Wei Liu, Michael S. Tsirkin, Jiri Pirko, Stephen Hemminger,
	Alexander Duyck, David Miller, Brandeburg, Jesse, Jakub Kicinski,
	Jason Wang, Samudrala, Sridhar, Netdev, virtualization,
	virtio-dev

On Sun, Apr 1, 2018 at 9:11 AM, David Ahern <dsahern@gmail.com> wrote:
> On 4/1/18 3:13 AM, Si-Wei Liu wrote:
>> Hidden netdevice is not visible to userspace such that
>> typical network utilities e.g. ip, ifconfig and et al,
>> cannot sense its existence or configure it. Internally
>> hidden netdev may associate with an upper level netdev
>> that userspace has access to. Although userspace cannot
>> manipulate the lower netdev directly, user may control
>> or configure the underlying hidden device through the
>> upper-level netdev. For identification purpose, the
>> kobject for hidden netdev still presents in the sysfs
>> hierarchy, however, no uevent message will be generated
>> when the sysfs entry is created, modified or destroyed.
>>
>> For that end, a separate namescope needs to be carved
>> out for IFF_HIDDEN netdevs. As of now netdev name that
>> starts with colon i.e. ':' is invalid in userspace,
>> since socket ioctls such as SIOCGIFCONF use ':' as the
>> separator for ifname. The absence of namescope started
>> with ':' can rightly be used as the namescope for
>> the kernel-only IFF_HIDDEN netdevs.
>>
>> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
>> ---
>>  include/linux/netdevice.h   |  12 ++
>>  include/net/net_namespace.h |   2 +
>>  net/core/dev.c              | 281 ++++++++++++++++++++++++++++++++++++++------
>>  net/core/net_namespace.c    |   1 +
>>  4 files changed, 263 insertions(+), 33 deletions(-)
>>
>
> There are other use cases that want to hide a device from userspace.

Can you elaborate your case in more details? Looking at the links
below I realize that the purpose of hiding devices in your case is
quite different from the our migration case. Particularly, I don't
like the part of elaborately allowing user to manipulate the link's
visibility - things fall apart easily while live migration is on
going. And, why doing additional check for invisible links in every
for_each_netdev() and its friends. This is effectively changing
semantics of internal APIs that exist for decades.

> I would prefer a better solution than playing games with name prefixes and

This part is intentionally left to be that way and I would anticipate
feedback before going too far. The idea in my mind was that I need a
completely new device namespace underneath (or namescope, which is !=
netns) for all netdevs: kernel-only IFF_HIDDEN network devices and
those not. The current namespace for devname is already exposed to
userspace and visible in the sysfs hierarchy, but for backwards
compatibility reasons it's necessary to keep the old udevd still able
to reference the entry of IFF_HIDDEN netdev under the /sys/net
directory. By using the ':' prefix it has the benefit of minimal
changes without introducing another namespace or the accompanied
complexity in managing these two separate namespaces.

Technically, I can create a separate sysfs directory for the new
namescope, say /sys/net-kernel, which includes all netdev entries like
':eth0' and 'ens3', and which can be referenced from /sys/net. It
would make the /sys/net consistent with the view of userspace
utilities, but I am not even sure if that's an overkill, and would
like to gather more feedback before moving to that model.

> one that includes an API for users to list all devices -- even ones

What kind of API you would like to query for hidden devices?
rtnetlink? a private socket API? or something else?

For our case, the sysfs interface is what we need and is sufficient,
since udev is the main target we'd like to support to make the naming
of virtio_bypass consistent and compatible.

> hidden by default.
>
> https://github.com/dsahern/linux/commit/48a80a00eac284e58bae04af10a5a932dd7aee00
>
> https://github.com/dsahern/iproute2/commit/7563f5b26f5539960e99066e34a995d22ea908ed
>
> Also, why are you suggesting that the device should still be visible via
> /sysfs? That leads to inconsistent views of networking state - /sys
> shows a device but a link dump does not.

See my clarifications above. I don't mind kernel-only netdevs being
visible via sysfs, as that way we get a good trade-off between
backwards compatibility and visibility. There's still kobject created
there right. Bottom line is that all kernel devices and its life-cycle
uevents are made invisible to userpace network utilities, and I think
it simply gets to the goal of not breaking existing apps while being
able to add new features.

-Siwei

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 3/3] virtio_net: make lower netdevs for virtio_bypass hidden
  2018-04-01  9:13   ` [virtio-dev] " Si-Wei Liu
@ 2018-04-03 12:20     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 109+ messages in thread
From: Michael S. Tsirkin @ 2018-04-03 12:20 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: jiri, stephen, alexander.h.duyck, davem, jesse.brandeburg,
	kubakici, jasowang, sridhar.samudrala, netdev, virtualization,
	virtio-dev

On Sun, Apr 01, 2018 at 05:13:10AM -0400, Si-Wei Liu wrote:
> diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
> index aa40664..0827b7e 100644
> --- a/include/uapi/linux/virtio_net.h
> +++ b/include/uapi/linux/virtio_net.h
> @@ -80,6 +80,8 @@ struct virtio_net_config {
>  	__u16 max_virtqueue_pairs;
>  	/* Default maximum transmit unit advice */
>  	__u16 mtu;
> +	/* Device at bus:slot.function backed up by virtio_net */
> +	__u16 bsf2backup;
>  } __attribute__((packed));

I'm not sure this is a good interface.  This isn't unique even on some
PCI systems, not to speak of non-PCI ones.

>  /*
> -- 
> 1.8.3.1

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 3/3] virtio_net: make lower netdevs for virtio_bypass hidden
  2018-04-01  9:13   ` [virtio-dev] " Si-Wei Liu
  (?)
@ 2018-04-03 12:20   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 109+ messages in thread
From: Michael S. Tsirkin @ 2018-04-03 12:20 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: alexander.h.duyck, virtio-dev, jiri, kubakici, sridhar.samudrala,
	virtualization, netdev, davem

On Sun, Apr 01, 2018 at 05:13:10AM -0400, Si-Wei Liu wrote:
> diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
> index aa40664..0827b7e 100644
> --- a/include/uapi/linux/virtio_net.h
> +++ b/include/uapi/linux/virtio_net.h
> @@ -80,6 +80,8 @@ struct virtio_net_config {
>  	__u16 max_virtqueue_pairs;
>  	/* Default maximum transmit unit advice */
>  	__u16 mtu;
> +	/* Device at bus:slot.function backed up by virtio_net */
> +	__u16 bsf2backup;
>  } __attribute__((packed));

I'm not sure this is a good interface.  This isn't unique even on some
PCI systems, not to speak of non-PCI ones.

>  /*
> -- 
> 1.8.3.1

^ permalink raw reply	[flat|nested] 109+ messages in thread

* [virtio-dev] Re: [RFC PATCH 3/3] virtio_net: make lower netdevs for virtio_bypass hidden
@ 2018-04-03 12:20     ` Michael S. Tsirkin
  0 siblings, 0 replies; 109+ messages in thread
From: Michael S. Tsirkin @ 2018-04-03 12:20 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: jiri, stephen, alexander.h.duyck, davem, jesse.brandeburg,
	kubakici, jasowang, sridhar.samudrala, netdev, virtualization,
	virtio-dev

On Sun, Apr 01, 2018 at 05:13:10AM -0400, Si-Wei Liu wrote:
> diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
> index aa40664..0827b7e 100644
> --- a/include/uapi/linux/virtio_net.h
> +++ b/include/uapi/linux/virtio_net.h
> @@ -80,6 +80,8 @@ struct virtio_net_config {
>  	__u16 max_virtqueue_pairs;
>  	/* Default maximum transmit unit advice */
>  	__u16 mtu;
> +	/* Device at bus:slot.function backed up by virtio_net */
> +	__u16 bsf2backup;
>  } __attribute__((packed));

I'm not sure this is a good interface.  This isn't unique even on some
PCI systems, not to speak of non-PCI ones.

>  /*
> -- 
> 1.8.3.1

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 1/3] qemu: virtio-bypass should explicitly bind to a passthrough device
  2018-04-01  9:13   ` [virtio-dev] " Si-Wei Liu
@ 2018-04-03 12:25     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 109+ messages in thread
From: Michael S. Tsirkin @ 2018-04-03 12:25 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: jiri, stephen, alexander.h.duyck, davem, jesse.brandeburg,
	kubakici, jasowang, sridhar.samudrala, netdev, virtualization,
	virtio-dev

On Sun, Apr 01, 2018 at 05:13:08AM -0400, Si-Wei Liu wrote:
> @@ -896,6 +898,68 @@ void qmp_device_del(const char *id, Error **errp)
>      }
>  }
>  
> +int pci_get_busdevfn_by_id(const char *id, uint16_t *busnr,
> +                           uint16_t *devfn, Error **errp)
> +{
> +    uint16_t busnum = 0, slot = 0, func = 0;
> +    const char *pc, *pd, *pe;
> +    Error *local_err = NULL;
> +    ObjectClass *class;
> +    char value[1024];
> +    BusState *bus;
> +    uint64_t u64;
> +
> +    if (!(pc = strchr(id, ':'))) {
> +        error_setg(errp, "Invalid id: backup=%s, "
> +                   "correct format should be backup="
> +                   "'<bus-id>:<slot>[.<function>]'", id);
> +        return -1;
> +    }
> +    get_opt_name(value, sizeof(value), id, ':');
> +    if (pc != id + 1) {
> +        bus = qbus_find(value, errp);
> +        if (!bus)
> +            return -1;
> +
> +        class = object_get_class(OBJECT(bus));
> +        if (class != object_class_by_name(TYPE_PCI_BUS) &&
> +            class != object_class_by_name(TYPE_PCIE_BUS)) {
> +            error_setg(errp, "%s is not a device on pci bus", id);
> +            return -1;
> +        }
> +        busnum = (uint16_t)pci_bus_num(PCI_BUS(bus));
> +    }

pci_bus_num is almost always a bug if not done within
a context of a PCI host, bridge, etc.

In particular this will not DTRT if run before guest assigns bus
numbers.


> +
> +    if (!devfn)
> +        goto out;
> +
> +    pd = strchr(pc, '.');
> +    pe = get_opt_name(value, sizeof(value), pc + 1, '.');
> +    if (pe != pc + 1) {
> +        parse_option_number("slot", value, &u64, &local_err);
> +        if (local_err) {
> +            error_propagate(errp, local_err);
> +            return -1;
> +        }
> +        slot = (uint16_t)u64;
> +    }
> +    if (pd && *(pd + 1) != '\0') {
> +        parse_option_number("function", pd, &u64, &local_err);
> +        if (local_err) {
> +            error_propagate(errp, local_err);
> +            return -1;
> +        }
> +        func = (uint16_t)u64;
> +    }
> +
> +out:
> +    if (busnr)
> +        *busnr = busnum;
> +    if (devfn)
> +        *devfn = ((slot & 0x1F) << 3) | (func & 0x7);
> +    return 0;
> +}
> +
>  BlockBackend *blk_by_qdev_id(const char *id, Error **errp)
>  {
>      DeviceState *dev;
> -- 
> 1.8.3.1

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 1/3] qemu: virtio-bypass should explicitly bind to a passthrough device
  2018-04-01  9:13   ` [virtio-dev] " Si-Wei Liu
  (?)
@ 2018-04-03 12:25   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 109+ messages in thread
From: Michael S. Tsirkin @ 2018-04-03 12:25 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: alexander.h.duyck, virtio-dev, jiri, kubakici, sridhar.samudrala,
	virtualization, netdev, davem

On Sun, Apr 01, 2018 at 05:13:08AM -0400, Si-Wei Liu wrote:
> @@ -896,6 +898,68 @@ void qmp_device_del(const char *id, Error **errp)
>      }
>  }
>  
> +int pci_get_busdevfn_by_id(const char *id, uint16_t *busnr,
> +                           uint16_t *devfn, Error **errp)
> +{
> +    uint16_t busnum = 0, slot = 0, func = 0;
> +    const char *pc, *pd, *pe;
> +    Error *local_err = NULL;
> +    ObjectClass *class;
> +    char value[1024];
> +    BusState *bus;
> +    uint64_t u64;
> +
> +    if (!(pc = strchr(id, ':'))) {
> +        error_setg(errp, "Invalid id: backup=%s, "
> +                   "correct format should be backup="
> +                   "'<bus-id>:<slot>[.<function>]'", id);
> +        return -1;
> +    }
> +    get_opt_name(value, sizeof(value), id, ':');
> +    if (pc != id + 1) {
> +        bus = qbus_find(value, errp);
> +        if (!bus)
> +            return -1;
> +
> +        class = object_get_class(OBJECT(bus));
> +        if (class != object_class_by_name(TYPE_PCI_BUS) &&
> +            class != object_class_by_name(TYPE_PCIE_BUS)) {
> +            error_setg(errp, "%s is not a device on pci bus", id);
> +            return -1;
> +        }
> +        busnum = (uint16_t)pci_bus_num(PCI_BUS(bus));
> +    }

pci_bus_num is almost always a bug if not done within
a context of a PCI host, bridge, etc.

In particular this will not DTRT if run before guest assigns bus
numbers.


> +
> +    if (!devfn)
> +        goto out;
> +
> +    pd = strchr(pc, '.');
> +    pe = get_opt_name(value, sizeof(value), pc + 1, '.');
> +    if (pe != pc + 1) {
> +        parse_option_number("slot", value, &u64, &local_err);
> +        if (local_err) {
> +            error_propagate(errp, local_err);
> +            return -1;
> +        }
> +        slot = (uint16_t)u64;
> +    }
> +    if (pd && *(pd + 1) != '\0') {
> +        parse_option_number("function", pd, &u64, &local_err);
> +        if (local_err) {
> +            error_propagate(errp, local_err);
> +            return -1;
> +        }
> +        func = (uint16_t)u64;
> +    }
> +
> +out:
> +    if (busnr)
> +        *busnr = busnum;
> +    if (devfn)
> +        *devfn = ((slot & 0x1F) << 3) | (func & 0x7);
> +    return 0;
> +}
> +
>  BlockBackend *blk_by_qdev_id(const char *id, Error **errp)
>  {
>      DeviceState *dev;
> -- 
> 1.8.3.1

^ permalink raw reply	[flat|nested] 109+ messages in thread

* [virtio-dev] Re: [RFC PATCH 1/3] qemu: virtio-bypass should explicitly bind to a passthrough device
@ 2018-04-03 12:25     ` Michael S. Tsirkin
  0 siblings, 0 replies; 109+ messages in thread
From: Michael S. Tsirkin @ 2018-04-03 12:25 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: jiri, stephen, alexander.h.duyck, davem, jesse.brandeburg,
	kubakici, jasowang, sridhar.samudrala, netdev, virtualization,
	virtio-dev

On Sun, Apr 01, 2018 at 05:13:08AM -0400, Si-Wei Liu wrote:
> @@ -896,6 +898,68 @@ void qmp_device_del(const char *id, Error **errp)
>      }
>  }
>  
> +int pci_get_busdevfn_by_id(const char *id, uint16_t *busnr,
> +                           uint16_t *devfn, Error **errp)
> +{
> +    uint16_t busnum = 0, slot = 0, func = 0;
> +    const char *pc, *pd, *pe;
> +    Error *local_err = NULL;
> +    ObjectClass *class;
> +    char value[1024];
> +    BusState *bus;
> +    uint64_t u64;
> +
> +    if (!(pc = strchr(id, ':'))) {
> +        error_setg(errp, "Invalid id: backup=%s, "
> +                   "correct format should be backup="
> +                   "'<bus-id>:<slot>[.<function>]'", id);
> +        return -1;
> +    }
> +    get_opt_name(value, sizeof(value), id, ':');
> +    if (pc != id + 1) {
> +        bus = qbus_find(value, errp);
> +        if (!bus)
> +            return -1;
> +
> +        class = object_get_class(OBJECT(bus));
> +        if (class != object_class_by_name(TYPE_PCI_BUS) &&
> +            class != object_class_by_name(TYPE_PCIE_BUS)) {
> +            error_setg(errp, "%s is not a device on pci bus", id);
> +            return -1;
> +        }
> +        busnum = (uint16_t)pci_bus_num(PCI_BUS(bus));
> +    }

pci_bus_num is almost always a bug if not done within
a context of a PCI host, bridge, etc.

In particular this will not DTRT if run before guest assigns bus
numbers.


> +
> +    if (!devfn)
> +        goto out;
> +
> +    pd = strchr(pc, '.');
> +    pe = get_opt_name(value, sizeof(value), pc + 1, '.');
> +    if (pe != pc + 1) {
> +        parse_option_number("slot", value, &u64, &local_err);
> +        if (local_err) {
> +            error_propagate(errp, local_err);
> +            return -1;
> +        }
> +        slot = (uint16_t)u64;
> +    }
> +    if (pd && *(pd + 1) != '\0') {
> +        parse_option_number("function", pd, &u64, &local_err);
> +        if (local_err) {
> +            error_propagate(errp, local_err);
> +            return -1;
> +        }
> +        func = (uint16_t)u64;
> +    }
> +
> +out:
> +    if (busnr)
> +        *busnr = busnum;
> +    if (devfn)
> +        *devfn = ((slot & 0x1F) << 3) | (func & 0x7);
> +    return 0;
> +}
> +
>  BlockBackend *blk_by_qdev_id(const char *id, Error **errp)
>  {
>      DeviceState *dev;
> -- 
> 1.8.3.1

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-03  7:40       ` [virtio-dev] " Siwei Liu
  (?)
@ 2018-04-03 14:57       ` David Ahern
  -1 siblings, 0 replies; 109+ messages in thread
From: David Ahern @ 2018-04-03 14:57 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Si-Wei Liu, Michael S. Tsirkin, Jiri Pirko, Stephen Hemminger,
	Alexander Duyck, David Miller, Brandeburg, Jesse, Jakub Kicinski,
	Jason Wang, Samudrala, Sridhar, Netdev, virtualization,
	virtio-dev

On 4/3/18 1:40 AM, Siwei Liu wrote:
>> There are other use cases that want to hide a device from userspace.
> 
> Can you elaborate your case in more details? Looking at the links
> below I realize that the purpose of hiding devices in your case is
> quite different from the our migration case. Particularly, I don't

some kernel drivers create "control" netdev's. They are not intended for
users to manipulate and doing so may actually break networking.

> like the part of elaborately allowing user to manipulate the link's
> visibility - things fall apart easily while live migration is on
> going. And, why doing additional check for invisible links in every
> for_each_netdev() and its friends. This is effectively changing
> semantics of internal APIs that exist for decades.

Read the patch again: there are 40 references to for_each_netdev and
that patch touches 2 of them -- link dumps via rtnetlink and link dumps
via ioctl.

>> one that includes an API for users to list all devices -- even ones
> 
> What kind of API you would like to query for hidden devices?
> rtnetlink? a private socket API? or something else?

There are existing, established APIs for dumping links. No new API is
needed. As suggested in the 2 patches I referenced the hidden /
invisibility cloak is an attribute of the device. When a link dump is
requested if the attribute is set, the device is skipped and not
included in the dump. However, if the user knows the device name the
GETLINK / SETLINK / DELLINK apis all work as normal. This allows the
device to be hidden from apps like snmpd, lldpd, etc, yet still usable.

> 
> For our case, the sysfs interface is what we need and is sufficient,
> since udev is the main target we'd like to support to make the naming
> of virtio_bypass consistent and compatible.

You are not hiding a device if it is visible in 1 API (/sysfs) and not
visible by another API (rtnetlink). That only creates confusion.

> 
>> hidden by default.
>>
>> https://github.com/dsahern/linux/commit/48a80a00eac284e58bae04af10a5a932dd7aee00
>>
>> https://github.com/dsahern/iproute2/commit/7563f5b26f5539960e99066e34a995d22ea908ed
>>
>> Also, why are you suggesting that the device should still be visible via
>> /sysfs? That leads to inconsistent views of networking state - /sys
>> shows a device but a link dump does not.
> 
> See my clarifications above. I don't mind kernel-only netdevs being
> visible via sysfs, as that way we get a good trade-off between
> backwards compatibility and visibility. There's still kobject created
> there right. Bottom line is that all kernel devices and its life-cycle
> uevents are made invisible to userpace network utilities, and I think
> it simply gets to the goal of not breaking existing apps while being
> able to add new features.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-01 16:11   ` David Ahern
                       ` (2 preceding siblings ...)
  2018-04-03 15:42     ` Jiri Pirko
@ 2018-04-03 15:42     ` Jiri Pirko
  2018-04-03 19:23       ` Siwei Liu
                         ` (2 more replies)
  3 siblings, 3 replies; 109+ messages in thread
From: Jiri Pirko @ 2018-04-03 15:42 UTC (permalink / raw)
  To: David Ahern
  Cc: Si-Wei Liu, mst, stephen, alexander.h.duyck, davem,
	jesse.brandeburg, kubakici, jasowang, sridhar.samudrala, netdev,
	virtualization, virtio-dev

Sun, Apr 01, 2018 at 06:11:29PM CEST, dsahern@gmail.com wrote:
>On 4/1/18 3:13 AM, Si-Wei Liu wrote:
>> Hidden netdevice is not visible to userspace such that
>> typical network utilites e.g. ip, ifconfig and et al,
>> cannot sense its existence or configure it. Internally
>> hidden netdev may associate with an upper level netdev
>> that userspace has access to. Although userspace cannot
>> manipulate the lower netdev directly, user may control
>> or configure the underlying hidden device through the
>> upper-level netdev. For identification purpose, the
>> kobject for hidden netdev still presents in the sysfs
>> hierarchy, however, no uevent message will be generated
>> when the sysfs entry is created, modified or destroyed.
>> 
>> For that end, a separate namescope needs to be carved
>> out for IFF_HIDDEN netdevs. As of now netdev name that
>> starts with colon i.e. ':' is invalid in userspace,
>> since socket ioctls such as SIOCGIFCONF use ':' as the
>> separator for ifname. The absence of namescope started
>> with ':' can rightly be used as the namescope for
>> the kernel-only IFF_HIDDEN netdevs.
>> 
>> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
>> ---
>>  include/linux/netdevice.h   |  12 ++
>>  include/net/net_namespace.h |   2 +
>>  net/core/dev.c              | 281 ++++++++++++++++++++++++++++++++++++++------
>>  net/core/net_namespace.c    |   1 +
>>  4 files changed, 263 insertions(+), 33 deletions(-)
>> 
>
>There are other use cases that want to hide a device from userspace. I

What usecases do you have in mind?

>would prefer a better solution than playing games with name prefixes and
>one that includes an API for users to list all devices -- even ones
>hidden by default.

Netdevice hiding feels a bit scarry for me. This smells like a workaround
for userspace issues. Why can't the netdevice be visible always and
userspace would know what is it and what should it do with it?

Once we start with hiding, there are other things related to that which
appear. Like who can see what, levels of visibility etc...


>
>https://github.com/dsahern/linux/commit/48a80a00eac284e58bae04af10a5a932dd7aee00
>
>https://github.com/dsahern/iproute2/commit/7563f5b26f5539960e99066e34a995d22ea908ed
>
>Also, why are you suggesting that the device should still be visible via
>/sysfs? That leads to inconsistent views of networking state - /sys
>shows a device but a link dump does not.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-01 16:11   ` David Ahern
  2018-04-03  7:40     ` Siwei Liu
  2018-04-03  7:40       ` [virtio-dev] " Siwei Liu
@ 2018-04-03 15:42     ` Jiri Pirko
  2018-04-03 15:42     ` Jiri Pirko
  3 siblings, 0 replies; 109+ messages in thread
From: Jiri Pirko @ 2018-04-03 15:42 UTC (permalink / raw)
  To: David Ahern
  Cc: alexander.h.duyck, virtio-dev, mst, kubakici, sridhar.samudrala,
	virtualization, netdev, Si-Wei Liu, davem

Sun, Apr 01, 2018 at 06:11:29PM CEST, dsahern@gmail.com wrote:
>On 4/1/18 3:13 AM, Si-Wei Liu wrote:
>> Hidden netdevice is not visible to userspace such that
>> typical network utilites e.g. ip, ifconfig and et al,
>> cannot sense its existence or configure it. Internally
>> hidden netdev may associate with an upper level netdev
>> that userspace has access to. Although userspace cannot
>> manipulate the lower netdev directly, user may control
>> or configure the underlying hidden device through the
>> upper-level netdev. For identification purpose, the
>> kobject for hidden netdev still presents in the sysfs
>> hierarchy, however, no uevent message will be generated
>> when the sysfs entry is created, modified or destroyed.
>> 
>> For that end, a separate namescope needs to be carved
>> out for IFF_HIDDEN netdevs. As of now netdev name that
>> starts with colon i.e. ':' is invalid in userspace,
>> since socket ioctls such as SIOCGIFCONF use ':' as the
>> separator for ifname. The absence of namescope started
>> with ':' can rightly be used as the namescope for
>> the kernel-only IFF_HIDDEN netdevs.
>> 
>> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
>> ---
>>  include/linux/netdevice.h   |  12 ++
>>  include/net/net_namespace.h |   2 +
>>  net/core/dev.c              | 281 ++++++++++++++++++++++++++++++++++++++------
>>  net/core/net_namespace.c    |   1 +
>>  4 files changed, 263 insertions(+), 33 deletions(-)
>> 
>
>There are other use cases that want to hide a device from userspace. I

What usecases do you have in mind?

>would prefer a better solution than playing games with name prefixes and
>one that includes an API for users to list all devices -- even ones
>hidden by default.

Netdevice hiding feels a bit scarry for me. This smells like a workaround
for userspace issues. Why can't the netdevice be visible always and
userspace would know what is it and what should it do with it?

Once we start with hiding, there are other things related to that which
appear. Like who can see what, levels of visibility etc...


>
>https://github.com/dsahern/linux/commit/48a80a00eac284e58bae04af10a5a932dd7aee00
>
>https://github.com/dsahern/iproute2/commit/7563f5b26f5539960e99066e34a995d22ea908ed
>
>Also, why are you suggesting that the device should still be visible via
>/sysfs? That leads to inconsistent views of networking state - /sys
>shows a device but a link dump does not.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-01  9:13   ` [virtio-dev] " Si-Wei Liu
  (?)
  (?)
@ 2018-04-03 17:35   ` Stephen Hemminger
       [not found]     ` <CADGSJ23vZdtQzWdc_6M_Hr4MUej--wgvJ785DwRF3VaPWS1rpA@mail.gmail.com>
  -1 siblings, 1 reply; 109+ messages in thread
From: Stephen Hemminger @ 2018-04-03 17:35 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: mst, jiri, alexander.h.duyck, davem, jesse.brandeburg, kubakici,
	jasowang, sridhar.samudrala, netdev, virtualization, virtio-dev

On Sun,  1 Apr 2018 05:13:09 -0400
Si-Wei Liu <si-wei.liu@oracle.com> wrote:

> Hidden netdevice is not visible to userspace such that
> typical network utilites e.g. ip, ifconfig and et al,
> cannot sense its existence or configure it. Internally
> hidden netdev may associate with an upper level netdev
> that userspace has access to. Although userspace cannot
> manipulate the lower netdev directly, user may control
> or configure the underlying hidden device through the
> upper-level netdev. For identification purpose, the
> kobject for hidden netdev still presents in the sysfs
> hierarchy, however, no uevent message will be generated
> when the sysfs entry is created, modified or destroyed.
> 
> For that end, a separate namescope needs to be carved
> out for IFF_HIDDEN netdevs. As of now netdev name that
> starts with colon i.e. ':' is invalid in userspace,
> since socket ioctls such as SIOCGIFCONF use ':' as the
> separator for ifname. The absence of namescope started
> with ':' can rightly be used as the namescope for
> the kernel-only IFF_HIDDEN netdevs.
> 
> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
> ---

I understand the use case. I proposed using . as a prefix before
but that ran into resistance. Using colon seems worse.

Rather than playing with names and all the issues that can cause,
why not make it an attribute flag of the device in netlink.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-01  9:13   ` [virtio-dev] " Si-Wei Liu
                     ` (2 preceding siblings ...)
  (?)
@ 2018-04-03 17:35   ` Stephen Hemminger
  -1 siblings, 0 replies; 109+ messages in thread
From: Stephen Hemminger @ 2018-04-03 17:35 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: alexander.h.duyck, virtio-dev, jiri, mst, kubakici,
	sridhar.samudrala, virtualization, netdev, davem

On Sun,  1 Apr 2018 05:13:09 -0400
Si-Wei Liu <si-wei.liu@oracle.com> wrote:

> Hidden netdevice is not visible to userspace such that
> typical network utilites e.g. ip, ifconfig and et al,
> cannot sense its existence or configure it. Internally
> hidden netdev may associate with an upper level netdev
> that userspace has access to. Although userspace cannot
> manipulate the lower netdev directly, user may control
> or configure the underlying hidden device through the
> upper-level netdev. For identification purpose, the
> kobject for hidden netdev still presents in the sysfs
> hierarchy, however, no uevent message will be generated
> when the sysfs entry is created, modified or destroyed.
> 
> For that end, a separate namescope needs to be carved
> out for IFF_HIDDEN netdevs. As of now netdev name that
> starts with colon i.e. ':' is invalid in userspace,
> since socket ioctls such as SIOCGIFCONF use ':' as the
> separator for ifname. The absence of namescope started
> with ':' can rightly be used as the namescope for
> the kernel-only IFF_HIDDEN netdevs.
> 
> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
> ---

I understand the use case. I proposed using . as a prefix before
but that ran into resistance. Using colon seems worse.

Rather than playing with names and all the issues that can cause,
why not make it an attribute flag of the device in netlink.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-03 15:42     ` Jiri Pirko
@ 2018-04-03 19:23         ` Siwei Liu
  2018-04-03 19:23         ` [virtio-dev] " Siwei Liu
  2018-04-04  1:04       ` David Ahern
  2 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-03 19:23 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: David Ahern, Si-Wei Liu, Michael S. Tsirkin, Stephen Hemminger,
	Alexander Duyck, David Miller, Brandeburg, Jesse, Jakub Kicinski,
	Jason Wang, Samudrala, Sridhar, Netdev, virtualization,
	virtio-dev

On Tue, Apr 3, 2018 at 8:42 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Sun, Apr 01, 2018 at 06:11:29PM CEST, dsahern@gmail.com wrote:
>>On 4/1/18 3:13 AM, Si-Wei Liu wrote:
>>> Hidden netdevice is not visible to userspace such that
>>> typical network utilites e.g. ip, ifconfig and et al,
>>> cannot sense its existence or configure it. Internally
>>> hidden netdev may associate with an upper level netdev
>>> that userspace has access to. Although userspace cannot
>>> manipulate the lower netdev directly, user may control
>>> or configure the underlying hidden device through the
>>> upper-level netdev. For identification purpose, the
>>> kobject for hidden netdev still presents in the sysfs
>>> hierarchy, however, no uevent message will be generated
>>> when the sysfs entry is created, modified or destroyed.
>>>
>>> For that end, a separate namescope needs to be carved
>>> out for IFF_HIDDEN netdevs. As of now netdev name that
>>> starts with colon i.e. ':' is invalid in userspace,
>>> since socket ioctls such as SIOCGIFCONF use ':' as the
>>> separator for ifname. The absence of namescope started
>>> with ':' can rightly be used as the namescope for
>>> the kernel-only IFF_HIDDEN netdevs.
>>>
>>> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
>>> ---
>>>  include/linux/netdevice.h   |  12 ++
>>>  include/net/net_namespace.h |   2 +
>>>  net/core/dev.c              | 281 ++++++++++++++++++++++++++++++++++++++------
>>>  net/core/net_namespace.c    |   1 +
>>>  4 files changed, 263 insertions(+), 33 deletions(-)
>>>
>>
>>There are other use cases that want to hide a device from userspace. I
>
> What usecases do you have in mind?

Hope you're not staring at me and shouting. :)

I think we had discussed a lot, and if the common goal is to merge two
drivers rather than diverge, there's no better way than to hide the
lower devices from all existing userspace management utiliies
(NetworManager, cloud-init). This does not mean loss of visibility as
we can add new API or CLI later on to get those missing ones exposed
as needed, in a way existing userspace apps don't break while new apps
aware of the feature know where to get it. This requirement is
critical to cloud providers, which I wouldn't repeat enough why it
drove me crazy if not seeing this resolved.

Thanks,
-Siwei

>
>>would prefer a better solution than playing games with name prefixes and
>>one that includes an API for users to list all devices -- even ones
>>hidden by default.
>
> Netdevice hiding feels a bit scarry for me. This smells like a workaround
> for userspace issues. Why can't the netdevice be visible always and
> userspace would know what is it and what should it do with it?
>
> Once we start with hiding, there are other things related to that which
> appear. Like who can see what, levels of visibility etc...
>
>
>>
>>https://github.com/dsahern/linux/commit/48a80a00eac284e58bae04af10a5a932dd7aee00
>>
>>https://github.com/dsahern/iproute2/commit/7563f5b26f5539960e99066e34a995d22ea908ed
>>
>>Also, why are you suggesting that the device should still be visible via
>>/sysfs? That leads to inconsistent views of networking state - /sys
>>shows a device but a link dump does not.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-03 15:42     ` Jiri Pirko
@ 2018-04-03 19:23       ` Siwei Liu
  2018-04-03 19:23         ` [virtio-dev] " Siwei Liu
  2018-04-04  1:04       ` David Ahern
  2 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-03 19:23 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Alexander Duyck, virtio-dev, Michael S. Tsirkin, Jakub Kicinski,
	Samudrala, Sridhar, virtualization, Netdev, David Ahern,
	Si-Wei Liu, David Miller

On Tue, Apr 3, 2018 at 8:42 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Sun, Apr 01, 2018 at 06:11:29PM CEST, dsahern@gmail.com wrote:
>>On 4/1/18 3:13 AM, Si-Wei Liu wrote:
>>> Hidden netdevice is not visible to userspace such that
>>> typical network utilites e.g. ip, ifconfig and et al,
>>> cannot sense its existence or configure it. Internally
>>> hidden netdev may associate with an upper level netdev
>>> that userspace has access to. Although userspace cannot
>>> manipulate the lower netdev directly, user may control
>>> or configure the underlying hidden device through the
>>> upper-level netdev. For identification purpose, the
>>> kobject for hidden netdev still presents in the sysfs
>>> hierarchy, however, no uevent message will be generated
>>> when the sysfs entry is created, modified or destroyed.
>>>
>>> For that end, a separate namescope needs to be carved
>>> out for IFF_HIDDEN netdevs. As of now netdev name that
>>> starts with colon i.e. ':' is invalid in userspace,
>>> since socket ioctls such as SIOCGIFCONF use ':' as the
>>> separator for ifname. The absence of namescope started
>>> with ':' can rightly be used as the namescope for
>>> the kernel-only IFF_HIDDEN netdevs.
>>>
>>> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
>>> ---
>>>  include/linux/netdevice.h   |  12 ++
>>>  include/net/net_namespace.h |   2 +
>>>  net/core/dev.c              | 281 ++++++++++++++++++++++++++++++++++++++------
>>>  net/core/net_namespace.c    |   1 +
>>>  4 files changed, 263 insertions(+), 33 deletions(-)
>>>
>>
>>There are other use cases that want to hide a device from userspace. I
>
> What usecases do you have in mind?

Hope you're not staring at me and shouting. :)

I think we had discussed a lot, and if the common goal is to merge two
drivers rather than diverge, there's no better way than to hide the
lower devices from all existing userspace management utiliies
(NetworManager, cloud-init). This does not mean loss of visibility as
we can add new API or CLI later on to get those missing ones exposed
as needed, in a way existing userspace apps don't break while new apps
aware of the feature know where to get it. This requirement is
critical to cloud providers, which I wouldn't repeat enough why it
drove me crazy if not seeing this resolved.

Thanks,
-Siwei

>
>>would prefer a better solution than playing games with name prefixes and
>>one that includes an API for users to list all devices -- even ones
>>hidden by default.
>
> Netdevice hiding feels a bit scarry for me. This smells like a workaround
> for userspace issues. Why can't the netdevice be visible always and
> userspace would know what is it and what should it do with it?
>
> Once we start with hiding, there are other things related to that which
> appear. Like who can see what, levels of visibility etc...
>
>
>>
>>https://github.com/dsahern/linux/commit/48a80a00eac284e58bae04af10a5a932dd7aee00
>>
>>https://github.com/dsahern/iproute2/commit/7563f5b26f5539960e99066e34a995d22ea908ed
>>
>>Also, why are you suggesting that the device should still be visible via
>>/sysfs? That leads to inconsistent views of networking state - /sys
>>shows a device but a link dump does not.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
@ 2018-04-03 19:23         ` Siwei Liu
  0 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-03 19:23 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: David Ahern, Si-Wei Liu, Michael S. Tsirkin, Stephen Hemminger,
	Alexander Duyck, David Miller, Brandeburg, Jesse, Jakub Kicinski,
	Jason Wang, Samudrala, Sridhar, Netdev, virtualization,
	virtio-dev

On Tue, Apr 3, 2018 at 8:42 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Sun, Apr 01, 2018 at 06:11:29PM CEST, dsahern@gmail.com wrote:
>>On 4/1/18 3:13 AM, Si-Wei Liu wrote:
>>> Hidden netdevice is not visible to userspace such that
>>> typical network utilites e.g. ip, ifconfig and et al,
>>> cannot sense its existence or configure it. Internally
>>> hidden netdev may associate with an upper level netdev
>>> that userspace has access to. Although userspace cannot
>>> manipulate the lower netdev directly, user may control
>>> or configure the underlying hidden device through the
>>> upper-level netdev. For identification purpose, the
>>> kobject for hidden netdev still presents in the sysfs
>>> hierarchy, however, no uevent message will be generated
>>> when the sysfs entry is created, modified or destroyed.
>>>
>>> For that end, a separate namescope needs to be carved
>>> out for IFF_HIDDEN netdevs. As of now netdev name that
>>> starts with colon i.e. ':' is invalid in userspace,
>>> since socket ioctls such as SIOCGIFCONF use ':' as the
>>> separator for ifname. The absence of namescope started
>>> with ':' can rightly be used as the namescope for
>>> the kernel-only IFF_HIDDEN netdevs.
>>>
>>> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
>>> ---
>>>  include/linux/netdevice.h   |  12 ++
>>>  include/net/net_namespace.h |   2 +
>>>  net/core/dev.c              | 281 ++++++++++++++++++++++++++++++++++++++------
>>>  net/core/net_namespace.c    |   1 +
>>>  4 files changed, 263 insertions(+), 33 deletions(-)
>>>
>>
>>There are other use cases that want to hide a device from userspace. I
>
> What usecases do you have in mind?

Hope you're not staring at me and shouting. :)

I think we had discussed a lot, and if the common goal is to merge two
drivers rather than diverge, there's no better way than to hide the
lower devices from all existing userspace management utiliies
(NetworManager, cloud-init). This does not mean loss of visibility as
we can add new API or CLI later on to get those missing ones exposed
as needed, in a way existing userspace apps don't break while new apps
aware of the feature know where to get it. This requirement is
critical to cloud providers, which I wouldn't repeat enough why it
drove me crazy if not seeing this resolved.

Thanks,
-Siwei

>
>>would prefer a better solution than playing games with name prefixes and
>>one that includes an API for users to list all devices -- even ones
>>hidden by default.
>
> Netdevice hiding feels a bit scarry for me. This smells like a workaround
> for userspace issues. Why can't the netdevice be visible always and
> userspace would know what is it and what should it do with it?
>
> Once we start with hiding, there are other things related to that which
> appear. Like who can see what, levels of visibility etc...
>
>
>>
>>https://github.com/dsahern/linux/commit/48a80a00eac284e58bae04af10a5a932dd7aee00
>>
>>https://github.com/dsahern/iproute2/commit/7563f5b26f5539960e99066e34a995d22ea908ed
>>
>>Also, why are you suggesting that the device should still be visible via
>>/sysfs? That leads to inconsistent views of networking state - /sys
>>shows a device but a link dump does not.

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-03 15:42     ` Jiri Pirko
  2018-04-03 19:23       ` Siwei Liu
  2018-04-03 19:23         ` [virtio-dev] " Siwei Liu
@ 2018-04-04  1:04       ` David Ahern
  2018-04-04  6:19         ` Jiri Pirko
                           ` (5 more replies)
  2 siblings, 6 replies; 109+ messages in thread
From: David Ahern @ 2018-04-04  1:04 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Si-Wei Liu, mst, stephen, alexander.h.duyck, davem,
	jesse.brandeburg, kubakici, jasowang, sridhar.samudrala, netdev,
	virtualization, virtio-dev

On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>
>> There are other use cases that want to hide a device from userspace. I
> 
> What usecases do you have in mind?

As mentioned in a previous response some kernel drivers create control
netdevs. Just as in this case users should not be mucking with it, and
S/W like lldpd should ignore it.

> 
>> would prefer a better solution than playing games with name prefixes and
>> one that includes an API for users to list all devices -- even ones
>> hidden by default.
> 
> Netdevice hiding feels a bit scarry for me. This smells like a workaround
> for userspace issues. Why can't the netdevice be visible always and
> userspace would know what is it and what should it do with it?
> 
> Once we start with hiding, there are other things related to that which
> appear. Like who can see what, levels of visibility etc...
> 

I would not advocate for any API that does not allow users to have full
introspection. The intent is to hide the netdev by default but have an
option to see it.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-04  1:04       ` David Ahern
@ 2018-04-04  6:19         ` Jiri Pirko
  2018-04-04  8:01             ` [virtio-dev] " Siwei Liu
  2018-04-04  8:01           ` Siwei Liu
  2018-04-04  6:19         ` Jiri Pirko
                           ` (4 subsequent siblings)
  5 siblings, 2 replies; 109+ messages in thread
From: Jiri Pirko @ 2018-04-04  6:19 UTC (permalink / raw)
  To: David Ahern
  Cc: Si-Wei Liu, mst, stephen, alexander.h.duyck, davem,
	jesse.brandeburg, kubakici, jasowang, sridhar.samudrala, netdev,
	virtualization, virtio-dev

Wed, Apr 04, 2018 at 03:04:26AM CEST, dsahern@gmail.com wrote:
>On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>>
>>> There are other use cases that want to hide a device from userspace. I
>> 
>> What usecases do you have in mind?
>
>As mentioned in a previous response some kernel drivers create control
>netdevs. Just as in this case users should not be mucking with it, and

virtio_net. Any other drivers?


>S/W like lldpd should ignore it.

It's just a matter of identification of the netdevs, so the user knows
what to do.


>
>> 
>>> would prefer a better solution than playing games with name prefixes and
>>> one that includes an API for users to list all devices -- even ones
>>> hidden by default.
>> 
>> Netdevice hiding feels a bit scarry for me. This smells like a workaround
>> for userspace issues. Why can't the netdevice be visible always and
>> userspace would know what is it and what should it do with it?
>> 
>> Once we start with hiding, there are other things related to that which
>> appear. Like who can see what, levels of visibility etc...
>> 
>
>I would not advocate for any API that does not allow users to have full
>introspection. The intent is to hide the netdev by default but have an
>option to see it.

As an administrator, I want to see all by default. I think it is
reasonable requirements. Again, this awfully smells like a workaround...

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-04  1:04       ` David Ahern
  2018-04-04  6:19         ` Jiri Pirko
@ 2018-04-04  6:19         ` Jiri Pirko
  2018-04-04  7:36         ` Siwei Liu
                           ` (3 subsequent siblings)
  5 siblings, 0 replies; 109+ messages in thread
From: Jiri Pirko @ 2018-04-04  6:19 UTC (permalink / raw)
  To: David Ahern
  Cc: alexander.h.duyck, virtio-dev, mst, kubakici, sridhar.samudrala,
	virtualization, netdev, Si-Wei Liu, davem

Wed, Apr 04, 2018 at 03:04:26AM CEST, dsahern@gmail.com wrote:
>On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>>
>>> There are other use cases that want to hide a device from userspace. I
>> 
>> What usecases do you have in mind?
>
>As mentioned in a previous response some kernel drivers create control
>netdevs. Just as in this case users should not be mucking with it, and

virtio_net. Any other drivers?


>S/W like lldpd should ignore it.

It's just a matter of identification of the netdevs, so the user knows
what to do.


>
>> 
>>> would prefer a better solution than playing games with name prefixes and
>>> one that includes an API for users to list all devices -- even ones
>>> hidden by default.
>> 
>> Netdevice hiding feels a bit scarry for me. This smells like a workaround
>> for userspace issues. Why can't the netdevice be visible always and
>> userspace would know what is it and what should it do with it?
>> 
>> Once we start with hiding, there are other things related to that which
>> appear. Like who can see what, levels of visibility etc...
>> 
>
>I would not advocate for any API that does not allow users to have full
>introspection. The intent is to hide the netdev by default but have an
>option to see it.

As an administrator, I want to see all by default. I think it is
reasonable requirements. Again, this awfully smells like a workaround...

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-04  1:04       ` David Ahern
@ 2018-04-04  7:36           ` Siwei Liu
  2018-04-04  6:19         ` Jiri Pirko
                             ` (4 subsequent siblings)
  5 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-04  7:36 UTC (permalink / raw)
  To: David Ahern
  Cc: Jiri Pirko, Si-Wei Liu, Michael S. Tsirkin, Stephen Hemminger,
	Alexander Duyck, David Miller, Brandeburg, Jesse, Jakub Kicinski,
	Jason Wang, Samudrala, Sridhar, Netdev, virtualization,
	virtio-dev

On Tue, Apr 3, 2018 at 6:04 PM, David Ahern <dsahern@gmail.com> wrote:
> On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>>
>>> There are other use cases that want to hide a device from userspace. I
>>
>> What usecases do you have in mind?
>
> As mentioned in a previous response some kernel drivers create control
> netdevs. Just as in this case users should not be mucking with it, and
> S/W like lldpd should ignore it.
>
>>
>>> would prefer a better solution than playing games with name prefixes and
>>> one that includes an API for users to list all devices -- even ones
>>> hidden by default.
>>
>> Netdevice hiding feels a bit scarry for me. This smells like a workaround
>> for userspace issues. Why can't the netdevice be visible always and
>> userspace would know what is it and what should it do with it?
>>
>> Once we start with hiding, there are other things related to that which
>> appear. Like who can see what, levels of visibility etc...
>>
>
> I would not advocate for any API that does not allow users to have full
> introspection. The intent is to hide the netdev by default but have an
> option to see it.

I'm fine with having a link dump API to inspect the hidden netdev. As
said, the name for hidden netdevs should be in a separate device
namespace, and we did not even get closer to what it should look like
as I don't want to make it just an option for ip link. Perhaps a new
set of sub-commands of, say, 'ip device'.

-Siwei

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-04  1:04       ` David Ahern
  2018-04-04  6:19         ` Jiri Pirko
  2018-04-04  6:19         ` Jiri Pirko
@ 2018-04-04  7:36         ` Siwei Liu
  2018-04-04  7:36           ` [virtio-dev] " Siwei Liu
                           ` (2 subsequent siblings)
  5 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-04  7:36 UTC (permalink / raw)
  To: David Ahern
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, virtualization, Netdev,
	Si-Wei Liu, David Miller

On Tue, Apr 3, 2018 at 6:04 PM, David Ahern <dsahern@gmail.com> wrote:
> On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>>
>>> There are other use cases that want to hide a device from userspace. I
>>
>> What usecases do you have in mind?
>
> As mentioned in a previous response some kernel drivers create control
> netdevs. Just as in this case users should not be mucking with it, and
> S/W like lldpd should ignore it.
>
>>
>>> would prefer a better solution than playing games with name prefixes and
>>> one that includes an API for users to list all devices -- even ones
>>> hidden by default.
>>
>> Netdevice hiding feels a bit scarry for me. This smells like a workaround
>> for userspace issues. Why can't the netdevice be visible always and
>> userspace would know what is it and what should it do with it?
>>
>> Once we start with hiding, there are other things related to that which
>> appear. Like who can see what, levels of visibility etc...
>>
>
> I would not advocate for any API that does not allow users to have full
> introspection. The intent is to hide the netdev by default but have an
> option to see it.

I'm fine with having a link dump API to inspect the hidden netdev. As
said, the name for hidden netdevs should be in a separate device
namespace, and we did not even get closer to what it should look like
as I don't want to make it just an option for ip link. Perhaps a new
set of sub-commands of, say, 'ip device'.

-Siwei

^ permalink raw reply	[flat|nested] 109+ messages in thread

* [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
@ 2018-04-04  7:36           ` Siwei Liu
  0 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-04  7:36 UTC (permalink / raw)
  To: David Ahern
  Cc: Jiri Pirko, Si-Wei Liu, Michael S. Tsirkin, Stephen Hemminger,
	Alexander Duyck, David Miller, Brandeburg, Jesse, Jakub Kicinski,
	Jason Wang, Samudrala, Sridhar, Netdev, virtualization,
	virtio-dev

On Tue, Apr 3, 2018 at 6:04 PM, David Ahern <dsahern@gmail.com> wrote:
> On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>>
>>> There are other use cases that want to hide a device from userspace. I
>>
>> What usecases do you have in mind?
>
> As mentioned in a previous response some kernel drivers create control
> netdevs. Just as in this case users should not be mucking with it, and
> S/W like lldpd should ignore it.
>
>>
>>> would prefer a better solution than playing games with name prefixes and
>>> one that includes an API for users to list all devices -- even ones
>>> hidden by default.
>>
>> Netdevice hiding feels a bit scarry for me. This smells like a workaround
>> for userspace issues. Why can't the netdevice be visible always and
>> userspace would know what is it and what should it do with it?
>>
>> Once we start with hiding, there are other things related to that which
>> appear. Like who can see what, levels of visibility etc...
>>
>
> I would not advocate for any API that does not allow users to have full
> introspection. The intent is to hide the netdev by default but have an
> option to see it.

I'm fine with having a link dump API to inspect the hidden netdev. As
said, the name for hidden netdevs should be in a separate device
namespace, and we did not even get closer to what it should look like
as I don't want to make it just an option for ip link. Perhaps a new
set of sub-commands of, say, 'ip device'.

-Siwei

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-04  6:19         ` Jiri Pirko
@ 2018-04-04  8:01             ` Siwei Liu
  2018-04-04  8:01           ` Siwei Liu
  1 sibling, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-04  8:01 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: David Ahern, Si-Wei Liu, Michael S. Tsirkin, Stephen Hemminger,
	Alexander Duyck, David Miller, Brandeburg, Jesse, Jakub Kicinski,
	Jason Wang, Samudrala, Sridhar, Netdev, virtualization,
	virtio-dev

On Tue, Apr 3, 2018 at 11:19 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> Wed, Apr 04, 2018 at 03:04:26AM CEST, dsahern@gmail.com wrote:
>>On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>>>
>>>> There are other use cases that want to hide a device from userspace. I
>>>
>>> What usecases do you have in mind?
>>
>>As mentioned in a previous response some kernel drivers create control
>>netdevs. Just as in this case users should not be mucking with it, and
>
> virtio_net. Any other drivers?

netvsc if factoring out virtio_bypass to a common driver.

>
>
>>S/W like lldpd should ignore it.
>
> It's just a matter of identification of the netdevs, so the user knows
> what to do.
>
>
>>
>>>
>>>> would prefer a better solution than playing games with name prefixes and
>>>> one that includes an API for users to list all devices -- even ones
>>>> hidden by default.
>>>
>>> Netdevice hiding feels a bit scarry for me. This smells like a workaround
>>> for userspace issues. Why can't the netdevice be visible always and
>>> userspace would know what is it and what should it do with it?
>>>
>>> Once we start with hiding, there are other things related to that which
>>> appear. Like who can see what, levels of visibility etc...
>>>
>>
>>I would not advocate for any API that does not allow users to have full
>>introspection. The intent is to hide the netdev by default but have an
>>option to see it.
>
> As an administrator, I want to see all by default. I think it is
> reasonable requirements. Again, this awfully smells like a workaround...

If the requirement is just for dumping the link info i.e. perform
read-only operation on the hidden netdev, it's completely fine.
However, I am not a big fan of creating a weird mechanism to allow
user deliberately manipulate the visibility (hide/unhide) of a netdev
in any case at any time. This is subject to becoming a slippery slope
to work around any software issue that should get fixed in the right
place.

Let's treat IFF_HIDDEN as a means to hide auto-managed netdevices. If
it's just the name is misleading, I can get it renamed to something
like IFF_AUTO_MANAGED which might reflect its nature more properly.

Thanks,
-Siwei

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-04  6:19         ` Jiri Pirko
  2018-04-04  8:01             ` [virtio-dev] " Siwei Liu
@ 2018-04-04  8:01           ` Siwei Liu
  1 sibling, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-04  8:01 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Alexander Duyck, virtio-dev, Michael S. Tsirkin, Jakub Kicinski,
	Samudrala, Sridhar, virtualization, Netdev, David Ahern,
	Si-Wei Liu, David Miller

On Tue, Apr 3, 2018 at 11:19 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> Wed, Apr 04, 2018 at 03:04:26AM CEST, dsahern@gmail.com wrote:
>>On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>>>
>>>> There are other use cases that want to hide a device from userspace. I
>>>
>>> What usecases do you have in mind?
>>
>>As mentioned in a previous response some kernel drivers create control
>>netdevs. Just as in this case users should not be mucking with it, and
>
> virtio_net. Any other drivers?

netvsc if factoring out virtio_bypass to a common driver.

>
>
>>S/W like lldpd should ignore it.
>
> It's just a matter of identification of the netdevs, so the user knows
> what to do.
>
>
>>
>>>
>>>> would prefer a better solution than playing games with name prefixes and
>>>> one that includes an API for users to list all devices -- even ones
>>>> hidden by default.
>>>
>>> Netdevice hiding feels a bit scarry for me. This smells like a workaround
>>> for userspace issues. Why can't the netdevice be visible always and
>>> userspace would know what is it and what should it do with it?
>>>
>>> Once we start with hiding, there are other things related to that which
>>> appear. Like who can see what, levels of visibility etc...
>>>
>>
>>I would not advocate for any API that does not allow users to have full
>>introspection. The intent is to hide the netdev by default but have an
>>option to see it.
>
> As an administrator, I want to see all by default. I think it is
> reasonable requirements. Again, this awfully smells like a workaround...

If the requirement is just for dumping the link info i.e. perform
read-only operation on the hidden netdev, it's completely fine.
However, I am not a big fan of creating a weird mechanism to allow
user deliberately manipulate the visibility (hide/unhide) of a netdev
in any case at any time. This is subject to becoming a slippery slope
to work around any software issue that should get fixed in the right
place.

Let's treat IFF_HIDDEN as a means to hide auto-managed netdevices. If
it's just the name is misleading, I can get it renamed to something
like IFF_AUTO_MANAGED which might reflect its nature more properly.

Thanks,
-Siwei

^ permalink raw reply	[flat|nested] 109+ messages in thread

* [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
@ 2018-04-04  8:01             ` Siwei Liu
  0 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-04  8:01 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: David Ahern, Si-Wei Liu, Michael S. Tsirkin, Stephen Hemminger,
	Alexander Duyck, David Miller, Brandeburg, Jesse, Jakub Kicinski,
	Jason Wang, Samudrala, Sridhar, Netdev, virtualization,
	virtio-dev

On Tue, Apr 3, 2018 at 11:19 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> Wed, Apr 04, 2018 at 03:04:26AM CEST, dsahern@gmail.com wrote:
>>On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>>>
>>>> There are other use cases that want to hide a device from userspace. I
>>>
>>> What usecases do you have in mind?
>>
>>As mentioned in a previous response some kernel drivers create control
>>netdevs. Just as in this case users should not be mucking with it, and
>
> virtio_net. Any other drivers?

netvsc if factoring out virtio_bypass to a common driver.

>
>
>>S/W like lldpd should ignore it.
>
> It's just a matter of identification of the netdevs, so the user knows
> what to do.
>
>
>>
>>>
>>>> would prefer a better solution than playing games with name prefixes and
>>>> one that includes an API for users to list all devices -- even ones
>>>> hidden by default.
>>>
>>> Netdevice hiding feels a bit scarry for me. This smells like a workaround
>>> for userspace issues. Why can't the netdevice be visible always and
>>> userspace would know what is it and what should it do with it?
>>>
>>> Once we start with hiding, there are other things related to that which
>>> appear. Like who can see what, levels of visibility etc...
>>>
>>
>>I would not advocate for any API that does not allow users to have full
>>introspection. The intent is to hide the netdev by default but have an
>>option to see it.
>
> As an administrator, I want to see all by default. I think it is
> reasonable requirements. Again, this awfully smells like a workaround...

If the requirement is just for dumping the link info i.e. perform
read-only operation on the hidden netdev, it's completely fine.
However, I am not a big fan of creating a weird mechanism to allow
user deliberately manipulate the visibility (hide/unhide) of a netdev
in any case at any time. This is subject to becoming a slippery slope
to work around any software issue that should get fixed in the right
place.

Let's treat IFF_HIDDEN as a means to hide auto-managed netdevices. If
it's just the name is misleading, I can get it renamed to something
like IFF_AUTO_MANAGED which might reflect its nature more properly.

Thanks,
-Siwei

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [virtio-dev] Re: [RFC PATCH 1/3] qemu: virtio-bypass should explicitly bind to a passthrough device
  2018-04-03 12:25     ` [virtio-dev] " Michael S. Tsirkin
@ 2018-04-04  8:02       ` Siwei Liu
  -1 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-04  8:02 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Si-Wei Liu, Jiri Pirko, Stephen Hemminger, Alexander Duyck,
	David Miller, Brandeburg, Jesse, Jakub Kicinski, Jason Wang,
	Samudrala, Sridhar, Netdev, virtualization, virtio-dev

On Tue, Apr 3, 2018 at 5:25 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Sun, Apr 01, 2018 at 05:13:08AM -0400, Si-Wei Liu wrote:
>> @@ -896,6 +898,68 @@ void qmp_device_del(const char *id, Error **errp)
>>      }
>>  }
>>
>> +int pci_get_busdevfn_by_id(const char *id, uint16_t *busnr,
>> +                           uint16_t *devfn, Error **errp)
>> +{
>> +    uint16_t busnum = 0, slot = 0, func = 0;
>> +    const char *pc, *pd, *pe;
>> +    Error *local_err = NULL;
>> +    ObjectClass *class;
>> +    char value[1024];
>> +    BusState *bus;
>> +    uint64_t u64;
>> +
>> +    if (!(pc = strchr(id, ':'))) {
>> +        error_setg(errp, "Invalid id: backup=%s, "
>> +                   "correct format should be backup="
>> +                   "'<bus-id>:<slot>[.<function>]'", id);
>> +        return -1;
>> +    }
>> +    get_opt_name(value, sizeof(value), id, ':');
>> +    if (pc != id + 1) {
>> +        bus = qbus_find(value, errp);
>> +        if (!bus)
>> +            return -1;
>> +
>> +        class = object_get_class(OBJECT(bus));
>> +        if (class != object_class_by_name(TYPE_PCI_BUS) &&
>> +            class != object_class_by_name(TYPE_PCIE_BUS)) {
>> +            error_setg(errp, "%s is not a device on pci bus", id);
>> +            return -1;
>> +        }
>> +        busnum = (uint16_t)pci_bus_num(PCI_BUS(bus));
>> +    }
>
> pci_bus_num is almost always a bug if not done within
> a context of a PCI host, bridge, etc.
>
> In particular this will not DTRT if run before guest assigns bus
> numbers.
>
I was seeking means to reserve a specific pci bus slot from drivers,
and update the driver when guest assigns the bus number but it seems
there's no low-hanging fruits. Because of that reason the bus_num is
only obtained until it's really needed (during get_config) and I
assume at that point the pci bus assignment is already done. I know
the current one is not perfect, but we need that information (PCI
bus:slot.func number) to name the guest device correctly.

Regards,
-Siwei
>
>> +
>> +    if (!devfn)
>> +        goto out;
>> +
>> +    pd = strchr(pc, '.');
>> +    pe = get_opt_name(value, sizeof(value), pc + 1, '.');
>> +    if (pe != pc + 1) {
>> +        parse_option_number("slot", value, &u64, &local_err);
>> +        if (local_err) {
>> +            error_propagate(errp, local_err);
>> +            return -1;
>> +        }
>> +        slot = (uint16_t)u64;
>> +    }
>> +    if (pd && *(pd + 1) != '\0') {
>> +        parse_option_number("function", pd, &u64, &local_err);
>> +        if (local_err) {
>> +            error_propagate(errp, local_err);
>> +            return -1;
>> +        }
>> +        func = (uint16_t)u64;
>> +    }
>> +
>> +out:
>> +    if (busnr)
>> +        *busnr = busnum;
>> +    if (devfn)
>> +        *devfn = ((slot & 0x1F) << 3) | (func & 0x7);
>> +    return 0;
>> +}
>> +
>>  BlockBackend *blk_by_qdev_id(const char *id, Error **errp)
>>  {
>>      DeviceState *dev;
>> --
>> 1.8.3.1
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [virtio-dev] Re: [RFC PATCH 1/3] qemu: virtio-bypass should explicitly bind to a passthrough device
  2018-04-03 12:25     ` [virtio-dev] " Michael S. Tsirkin
  (?)
@ 2018-04-04  8:02     ` Siwei Liu
  -1 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-04  8:02 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski,
	Samudrala, Sridhar, virtualization, Netdev, Si-Wei Liu,
	David Miller

On Tue, Apr 3, 2018 at 5:25 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Sun, Apr 01, 2018 at 05:13:08AM -0400, Si-Wei Liu wrote:
>> @@ -896,6 +898,68 @@ void qmp_device_del(const char *id, Error **errp)
>>      }
>>  }
>>
>> +int pci_get_busdevfn_by_id(const char *id, uint16_t *busnr,
>> +                           uint16_t *devfn, Error **errp)
>> +{
>> +    uint16_t busnum = 0, slot = 0, func = 0;
>> +    const char *pc, *pd, *pe;
>> +    Error *local_err = NULL;
>> +    ObjectClass *class;
>> +    char value[1024];
>> +    BusState *bus;
>> +    uint64_t u64;
>> +
>> +    if (!(pc = strchr(id, ':'))) {
>> +        error_setg(errp, "Invalid id: backup=%s, "
>> +                   "correct format should be backup="
>> +                   "'<bus-id>:<slot>[.<function>]'", id);
>> +        return -1;
>> +    }
>> +    get_opt_name(value, sizeof(value), id, ':');
>> +    if (pc != id + 1) {
>> +        bus = qbus_find(value, errp);
>> +        if (!bus)
>> +            return -1;
>> +
>> +        class = object_get_class(OBJECT(bus));
>> +        if (class != object_class_by_name(TYPE_PCI_BUS) &&
>> +            class != object_class_by_name(TYPE_PCIE_BUS)) {
>> +            error_setg(errp, "%s is not a device on pci bus", id);
>> +            return -1;
>> +        }
>> +        busnum = (uint16_t)pci_bus_num(PCI_BUS(bus));
>> +    }
>
> pci_bus_num is almost always a bug if not done within
> a context of a PCI host, bridge, etc.
>
> In particular this will not DTRT if run before guest assigns bus
> numbers.
>
I was seeking means to reserve a specific pci bus slot from drivers,
and update the driver when guest assigns the bus number but it seems
there's no low-hanging fruits. Because of that reason the bus_num is
only obtained until it's really needed (during get_config) and I
assume at that point the pci bus assignment is already done. I know
the current one is not perfect, but we need that information (PCI
bus:slot.func number) to name the guest device correctly.

Regards,
-Siwei
>
>> +
>> +    if (!devfn)
>> +        goto out;
>> +
>> +    pd = strchr(pc, '.');
>> +    pe = get_opt_name(value, sizeof(value), pc + 1, '.');
>> +    if (pe != pc + 1) {
>> +        parse_option_number("slot", value, &u64, &local_err);
>> +        if (local_err) {
>> +            error_propagate(errp, local_err);
>> +            return -1;
>> +        }
>> +        slot = (uint16_t)u64;
>> +    }
>> +    if (pd && *(pd + 1) != '\0') {
>> +        parse_option_number("function", pd, &u64, &local_err);
>> +        if (local_err) {
>> +            error_propagate(errp, local_err);
>> +            return -1;
>> +        }
>> +        func = (uint16_t)u64;
>> +    }
>> +
>> +out:
>> +    if (busnr)
>> +        *busnr = busnum;
>> +    if (devfn)
>> +        *devfn = ((slot & 0x1F) << 3) | (func & 0x7);
>> +    return 0;
>> +}
>> +
>>  BlockBackend *blk_by_qdev_id(const char *id, Error **errp)
>>  {
>>      DeviceState *dev;
>> --
>> 1.8.3.1
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [virtio-dev] Re: [RFC PATCH 1/3] qemu: virtio-bypass should explicitly bind to a passthrough device
@ 2018-04-04  8:02       ` Siwei Liu
  0 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-04  8:02 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Si-Wei Liu, Jiri Pirko, Stephen Hemminger, Alexander Duyck,
	David Miller, Brandeburg, Jesse, Jakub Kicinski, Jason Wang,
	Samudrala, Sridhar, Netdev, virtualization, virtio-dev

On Tue, Apr 3, 2018 at 5:25 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Sun, Apr 01, 2018 at 05:13:08AM -0400, Si-Wei Liu wrote:
>> @@ -896,6 +898,68 @@ void qmp_device_del(const char *id, Error **errp)
>>      }
>>  }
>>
>> +int pci_get_busdevfn_by_id(const char *id, uint16_t *busnr,
>> +                           uint16_t *devfn, Error **errp)
>> +{
>> +    uint16_t busnum = 0, slot = 0, func = 0;
>> +    const char *pc, *pd, *pe;
>> +    Error *local_err = NULL;
>> +    ObjectClass *class;
>> +    char value[1024];
>> +    BusState *bus;
>> +    uint64_t u64;
>> +
>> +    if (!(pc = strchr(id, ':'))) {
>> +        error_setg(errp, "Invalid id: backup=%s, "
>> +                   "correct format should be backup="
>> +                   "'<bus-id>:<slot>[.<function>]'", id);
>> +        return -1;
>> +    }
>> +    get_opt_name(value, sizeof(value), id, ':');
>> +    if (pc != id + 1) {
>> +        bus = qbus_find(value, errp);
>> +        if (!bus)
>> +            return -1;
>> +
>> +        class = object_get_class(OBJECT(bus));
>> +        if (class != object_class_by_name(TYPE_PCI_BUS) &&
>> +            class != object_class_by_name(TYPE_PCIE_BUS)) {
>> +            error_setg(errp, "%s is not a device on pci bus", id);
>> +            return -1;
>> +        }
>> +        busnum = (uint16_t)pci_bus_num(PCI_BUS(bus));
>> +    }
>
> pci_bus_num is almost always a bug if not done within
> a context of a PCI host, bridge, etc.
>
> In particular this will not DTRT if run before guest assigns bus
> numbers.
>
I was seeking means to reserve a specific pci bus slot from drivers,
and update the driver when guest assigns the bus number but it seems
there's no low-hanging fruits. Because of that reason the bus_num is
only obtained until it's really needed (during get_config) and I
assume at that point the pci bus assignment is already done. I know
the current one is not perfect, but we need that information (PCI
bus:slot.func number) to name the guest device correctly.

Regards,
-Siwei
>
>> +
>> +    if (!devfn)
>> +        goto out;
>> +
>> +    pd = strchr(pc, '.');
>> +    pe = get_opt_name(value, sizeof(value), pc + 1, '.');
>> +    if (pe != pc + 1) {
>> +        parse_option_number("slot", value, &u64, &local_err);
>> +        if (local_err) {
>> +            error_propagate(errp, local_err);
>> +            return -1;
>> +        }
>> +        slot = (uint16_t)u64;
>> +    }
>> +    if (pd && *(pd + 1) != '\0') {
>> +        parse_option_number("function", pd, &u64, &local_err);
>> +        if (local_err) {
>> +            error_propagate(errp, local_err);
>> +            return -1;
>> +        }
>> +        func = (uint16_t)u64;
>> +    }
>> +
>> +out:
>> +    if (busnr)
>> +        *busnr = busnum;
>> +    if (devfn)
>> +        *devfn = ((slot & 0x1F) << 3) | (func & 0x7);
>> +    return 0;
>> +}
>> +
>>  BlockBackend *blk_by_qdev_id(const char *id, Error **errp)
>>  {
>>      DeviceState *dev;
>> --
>> 1.8.3.1
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [virtio-dev] Re: [RFC PATCH 3/3] virtio_net: make lower netdevs for virtio_bypass hidden
  2018-04-03 12:20     ` [virtio-dev] " Michael S. Tsirkin
@ 2018-04-04  8:03       ` Siwei Liu
  -1 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-04  8:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Si-Wei Liu, Jiri Pirko, Stephen Hemminger, Alexander Duyck,
	David Miller, Brandeburg, Jesse, Jakub Kicinski, Jason Wang,
	Samudrala, Sridhar, Netdev, virtualization, virtio-dev

On Tue, Apr 3, 2018 at 5:20 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Sun, Apr 01, 2018 at 05:13:10AM -0400, Si-Wei Liu wrote:
>> diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
>> index aa40664..0827b7e 100644
>> --- a/include/uapi/linux/virtio_net.h
>> +++ b/include/uapi/linux/virtio_net.h
>> @@ -80,6 +80,8 @@ struct virtio_net_config {
>>       __u16 max_virtqueue_pairs;
>>       /* Default maximum transmit unit advice */
>>       __u16 mtu;
>> +     /* Device at bus:slot.function backed up by virtio_net */
>> +     __u16 bsf2backup;
>>  } __attribute__((packed));
>
> I'm not sure this is a good interface.  This isn't unique even on some
> PCI systems, not to speak of non-PCI ones.

Are you suggesting adding PCI address domain besides to make it
universally unique? And what the non-PCI device you envisioned that
the main target, essetially live migration, can/should cover? Or is
there better option in your mind already?

Thanks,
-Siwei
>
>>  /*
>> --
>> 1.8.3.1
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [virtio-dev] Re: [RFC PATCH 3/3] virtio_net: make lower netdevs for virtio_bypass hidden
  2018-04-03 12:20     ` [virtio-dev] " Michael S. Tsirkin
  (?)
@ 2018-04-04  8:03     ` Siwei Liu
  -1 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-04  8:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski,
	Samudrala, Sridhar, virtualization, Netdev, Si-Wei Liu,
	David Miller

On Tue, Apr 3, 2018 at 5:20 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Sun, Apr 01, 2018 at 05:13:10AM -0400, Si-Wei Liu wrote:
>> diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
>> index aa40664..0827b7e 100644
>> --- a/include/uapi/linux/virtio_net.h
>> +++ b/include/uapi/linux/virtio_net.h
>> @@ -80,6 +80,8 @@ struct virtio_net_config {
>>       __u16 max_virtqueue_pairs;
>>       /* Default maximum transmit unit advice */
>>       __u16 mtu;
>> +     /* Device at bus:slot.function backed up by virtio_net */
>> +     __u16 bsf2backup;
>>  } __attribute__((packed));
>
> I'm not sure this is a good interface.  This isn't unique even on some
> PCI systems, not to speak of non-PCI ones.

Are you suggesting adding PCI address domain besides to make it
universally unique? And what the non-PCI device you envisioned that
the main target, essetially live migration, can/should cover? Or is
there better option in your mind already?

Thanks,
-Siwei
>
>>  /*
>> --
>> 1.8.3.1
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [virtio-dev] Re: [RFC PATCH 3/3] virtio_net: make lower netdevs for virtio_bypass hidden
@ 2018-04-04  8:03       ` Siwei Liu
  0 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-04  8:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Si-Wei Liu, Jiri Pirko, Stephen Hemminger, Alexander Duyck,
	David Miller, Brandeburg, Jesse, Jakub Kicinski, Jason Wang,
	Samudrala, Sridhar, Netdev, virtualization, virtio-dev

On Tue, Apr 3, 2018 at 5:20 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Sun, Apr 01, 2018 at 05:13:10AM -0400, Si-Wei Liu wrote:
>> diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
>> index aa40664..0827b7e 100644
>> --- a/include/uapi/linux/virtio_net.h
>> +++ b/include/uapi/linux/virtio_net.h
>> @@ -80,6 +80,8 @@ struct virtio_net_config {
>>       __u16 max_virtqueue_pairs;
>>       /* Default maximum transmit unit advice */
>>       __u16 mtu;
>> +     /* Device at bus:slot.function backed up by virtio_net */
>> +     __u16 bsf2backup;
>>  } __attribute__((packed));
>
> I'm not sure this is a good interface.  This isn't unique even on some
> PCI systems, not to speak of non-PCI ones.

Are you suggesting adding PCI address domain besides to make it
universally unique? And what the non-PCI device you envisioned that
the main target, essetially live migration, can/should cover? Or is
there better option in your mind already?

Thanks,
-Siwei
>
>>  /*
>> --
>> 1.8.3.1
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-04  1:04       ` David Ahern
@ 2018-04-04  8:28           ` Siwei Liu
  2018-04-04  6:19         ` Jiri Pirko
                             ` (4 subsequent siblings)
  5 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-04  8:28 UTC (permalink / raw)
  To: David Ahern
  Cc: Jiri Pirko, Si-Wei Liu, Michael S. Tsirkin, Stephen Hemminger,
	Alexander Duyck, David Miller, Brandeburg, Jesse, Jakub Kicinski,
	Jason Wang, Samudrala, Sridhar, Netdev, virtualization,
	virtio-dev

On Tue, Apr 3, 2018 at 6:04 PM, David Ahern <dsahern@gmail.com> wrote:
> On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>>
>>> There are other use cases that want to hide a device from userspace. I
>>
>> What usecases do you have in mind?
>
> As mentioned in a previous response some kernel drivers create control
> netdevs. Just as in this case users should not be mucking with it, and
> S/W like lldpd should ignore it.

I'm still not sure I understand your case: why you want to hide the
control netdev, as I assume those devices could choose either to
silently ignore the request, or fail loudly against user operations?
Is it creating issues already, or what problem you want to solve if
not making the netdev invisible. Why couldn't lldpd check some
specific flag and ignore the control netdevice (can you please give an
example of a concrete driver for control netdevice *in tree*).

And I'm completely lost why you want an API to make a hidden netdev
visible again for these control devices.

Thanks,
-Siwei


>
>>
>>> would prefer a better solution than playing games with name prefixes and
>>> one that includes an API for users to list all devices -- even ones
>>> hidden by default.
>>
>> Netdevice hiding feels a bit scarry for me. This smells like a workaround
>> for userspace issues. Why can't the netdevice be visible always and
>> userspace would know what is it and what should it do with it?
>>
>> Once we start with hiding, there are other things related to that which
>> appear. Like who can see what, levels of visibility etc...
>>
>
> I would not advocate for any API that does not allow users to have full
> introspection. The intent is to hide the netdev by default but have an
> option to see it.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-04  1:04       ` David Ahern
                           ` (4 preceding siblings ...)
  2018-04-04  8:28           ` [virtio-dev] " Siwei Liu
@ 2018-04-04  8:28         ` Siwei Liu
  5 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-04  8:28 UTC (permalink / raw)
  To: David Ahern
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, virtualization, Netdev,
	Si-Wei Liu, David Miller

On Tue, Apr 3, 2018 at 6:04 PM, David Ahern <dsahern@gmail.com> wrote:
> On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>>
>>> There are other use cases that want to hide a device from userspace. I
>>
>> What usecases do you have in mind?
>
> As mentioned in a previous response some kernel drivers create control
> netdevs. Just as in this case users should not be mucking with it, and
> S/W like lldpd should ignore it.

I'm still not sure I understand your case: why you want to hide the
control netdev, as I assume those devices could choose either to
silently ignore the request, or fail loudly against user operations?
Is it creating issues already, or what problem you want to solve if
not making the netdev invisible. Why couldn't lldpd check some
specific flag and ignore the control netdevice (can you please give an
example of a concrete driver for control netdevice *in tree*).

And I'm completely lost why you want an API to make a hidden netdev
visible again for these control devices.

Thanks,
-Siwei


>
>>
>>> would prefer a better solution than playing games with name prefixes and
>>> one that includes an API for users to list all devices -- even ones
>>> hidden by default.
>>
>> Netdevice hiding feels a bit scarry for me. This smells like a workaround
>> for userspace issues. Why can't the netdevice be visible always and
>> userspace would know what is it and what should it do with it?
>>
>> Once we start with hiding, there are other things related to that which
>> appear. Like who can see what, levels of visibility etc...
>>
>
> I would not advocate for any API that does not allow users to have full
> introspection. The intent is to hide the netdev by default but have an
> option to see it.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
@ 2018-04-04  8:28           ` Siwei Liu
  0 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-04  8:28 UTC (permalink / raw)
  To: David Ahern
  Cc: Jiri Pirko, Si-Wei Liu, Michael S. Tsirkin, Stephen Hemminger,
	Alexander Duyck, David Miller, Brandeburg, Jesse, Jakub Kicinski,
	Jason Wang, Samudrala, Sridhar, Netdev, virtualization,
	virtio-dev

On Tue, Apr 3, 2018 at 6:04 PM, David Ahern <dsahern@gmail.com> wrote:
> On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>>
>>> There are other use cases that want to hide a device from userspace. I
>>
>> What usecases do you have in mind?
>
> As mentioned in a previous response some kernel drivers create control
> netdevs. Just as in this case users should not be mucking with it, and
> S/W like lldpd should ignore it.

I'm still not sure I understand your case: why you want to hide the
control netdev, as I assume those devices could choose either to
silently ignore the request, or fail loudly against user operations?
Is it creating issues already, or what problem you want to solve if
not making the netdev invisible. Why couldn't lldpd check some
specific flag and ignore the control netdevice (can you please give an
example of a concrete driver for control netdevice *in tree*).

And I'm completely lost why you want an API to make a hidden netdev
visible again for these control devices.

Thanks,
-Siwei


>
>>
>>> would prefer a better solution than playing games with name prefixes and
>>> one that includes an API for users to list all devices -- even ones
>>> hidden by default.
>>
>> Netdevice hiding feels a bit scarry for me. This smells like a workaround
>> for userspace issues. Why can't the netdevice be visible always and
>> userspace would know what is it and what should it do with it?
>>
>> Once we start with hiding, there are other things related to that which
>> appear. Like who can see what, levels of visibility etc...
>>
>
> I would not advocate for any API that does not allow users to have full
> introspection. The intent is to hide the netdev by default but have an
> option to see it.

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-04  7:36           ` [virtio-dev] " Siwei Liu
  (?)
@ 2018-04-04 17:21           ` David Ahern
  2018-04-04 17:37             ` David Miller
                               ` (3 more replies)
  -1 siblings, 4 replies; 109+ messages in thread
From: David Ahern @ 2018-04-04 17:21 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Jiri Pirko, Si-Wei Liu, Michael S. Tsirkin, Stephen Hemminger,
	Alexander Duyck, David Miller, Brandeburg, Jesse, Jakub Kicinski,
	Jason Wang, Samudrala, Sridhar, Netdev, virtualization,
	virtio-dev

On 4/4/18 1:36 AM, Siwei Liu wrote:
> On Tue, Apr 3, 2018 at 6:04 PM, David Ahern <dsahern@gmail.com> wrote:
>> On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>>>
>>>> There are other use cases that want to hide a device from userspace. I
>>>
>>> What usecases do you have in mind?
>>
>> As mentioned in a previous response some kernel drivers create control
>> netdevs. Just as in this case users should not be mucking with it, and
>> S/W like lldpd should ignore it.
>>
>>>
>>>> would prefer a better solution than playing games with name prefixes and
>>>> one that includes an API for users to list all devices -- even ones
>>>> hidden by default.
>>>
>>> Netdevice hiding feels a bit scarry for me. This smells like a workaround
>>> for userspace issues. Why can't the netdevice be visible always and
>>> userspace would know what is it and what should it do with it?
>>>
>>> Once we start with hiding, there are other things related to that which
>>> appear. Like who can see what, levels of visibility etc...
>>>
>>
>> I would not advocate for any API that does not allow users to have full
>> introspection. The intent is to hide the netdev by default but have an
>> option to see it.
> 
> I'm fine with having a link dump API to inspect the hidden netdev. As
> said, the name for hidden netdevs should be in a separate device
> namespace, and we did not even get closer to what it should look like
> as I don't want to make it just an option for ip link. Perhaps a new
> set of sub-commands of, say, 'ip device'.

It is a netdev so there is no reason to have a separate ip command to
inspect it. 'ip link' is the right place.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-04 17:21           ` David Ahern
@ 2018-04-04 17:37             ` David Miller
  2018-04-04 18:20               ` Jiri Pirko
                                 ` (2 more replies)
  2018-04-04 17:37             ` David Miller
                               ` (2 subsequent siblings)
  3 siblings, 3 replies; 109+ messages in thread
From: David Miller @ 2018-04-04 17:37 UTC (permalink / raw)
  To: dsahern
  Cc: loseweigh, jiri, si-wei.liu, mst, stephen, alexander.h.duyck,
	jesse.brandeburg, kubakici, jasowang, sridhar.samudrala, netdev,
	virtualization, virtio-dev

From: David Ahern <dsahern@gmail.com>
Date: Wed, 4 Apr 2018 11:21:54 -0600

> It is a netdev so there is no reason to have a separate ip command to
> inspect it. 'ip link' is the right place.

I agree on this.

What I really don't understand still is the use case... really.

So there are control netdevs, what exactly is the problem with that?

Are we not exporting enough information for applications to handle
these devices sanely?  If so, then's let add that information.

We can set netdev->type to ETH_P_LINUXCONTROL or something like that.

Another alternative is to add an interface flag like IFF_CONTROL or
similar, and that probably is much nicer.

Hiding the devices means that we acknowledge that applications are
currently broken with control netdevs... and we want them to stay
broken!

That doesn't sound like a good plan to me.

So let's fix handling of control netdevs instead of hiding them.

Thanks.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-04 17:21           ` David Ahern
  2018-04-04 17:37             ` David Miller
@ 2018-04-04 17:37             ` David Miller
  2018-04-04 18:02             ` Siwei Liu
  2018-04-04 18:02             ` Siwei Liu
  3 siblings, 0 replies; 109+ messages in thread
From: David Miller @ 2018-04-04 17:37 UTC (permalink / raw)
  To: dsahern
  Cc: alexander.h.duyck, virtio-dev, jiri, mst, kubakici,
	sridhar.samudrala, virtualization, loseweigh, netdev, si-wei.liu

From: David Ahern <dsahern@gmail.com>
Date: Wed, 4 Apr 2018 11:21:54 -0600

> It is a netdev so there is no reason to have a separate ip command to
> inspect it. 'ip link' is the right place.

I agree on this.

What I really don't understand still is the use case... really.

So there are control netdevs, what exactly is the problem with that?

Are we not exporting enough information for applications to handle
these devices sanely?  If so, then's let add that information.

We can set netdev->type to ETH_P_LINUXCONTROL or something like that.

Another alternative is to add an interface flag like IFF_CONTROL or
similar, and that probably is much nicer.

Hiding the devices means that we acknowledge that applications are
currently broken with control netdevs... and we want them to stay
broken!

That doesn't sound like a good plan to me.

So let's fix handling of control netdevs instead of hiding them.

Thanks.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-04  8:28           ` [virtio-dev] " Siwei Liu
  (?)
@ 2018-04-04 17:37           ` David Ahern
  2018-04-04 17:42             ` David Miller
                               ` (5 more replies)
  -1 siblings, 6 replies; 109+ messages in thread
From: David Ahern @ 2018-04-04 17:37 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Jiri Pirko, Si-Wei Liu, Michael S. Tsirkin, Stephen Hemminger,
	Alexander Duyck, David Miller, Brandeburg, Jesse, Jakub Kicinski,
	Jason Wang, Samudrala, Sridhar, Netdev, virtualization

[ dropping virtio-dev@lists.oasis-open.org since it is a closed list and
I am tired of deleting bounces ]

On 4/4/18 2:28 AM, Siwei Liu wrote:
> On Tue, Apr 3, 2018 at 6:04 PM, David Ahern <dsahern@gmail.com> wrote:
>> On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>>>
>>>> There are other use cases that want to hide a device from userspace. I
>>>
>>> What usecases do you have in mind?
>>
>> As mentioned in a previous response some kernel drivers create control
>> netdevs. Just as in this case users should not be mucking with it, and
>> S/W like lldpd should ignore it.
> 
> I'm still not sure I understand your case: why you want to hide the
> control netdev, as I assume those devices could choose either to
> silently ignore the request, or fail loudly against user operations?
> Is it creating issues already, or what problem you want to solve if
> not making the netdev invisible. Why couldn't lldpd check some
> specific flag and ignore the control netdevice (can you please give an
> example of a concrete driver for control netdevice *in tree*).
> 
> And I'm completely lost why you want an API to make a hidden netdev
> visible again for these control devices.

Networking vendors have out of tree kernel modules. Those modules use a
netdev (call it a master netdev, a control netdev, cpu port, whatever)
to pull packets from the ASIC and deliver to virtual netdevices
representing physical ports. The master netdev should not be mucked with
by a user. It should be ignored by certain s/w with lldpd as just an
*example*.

The short of it is that you have your reasons for wanting to hide the
virtio bypass device; other users have other arguments for wanting a
similar capability.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-04 17:37           ` David Ahern
@ 2018-04-04 17:42             ` David Miller
  2018-04-04 17:42             ` David Miller
                               ` (4 subsequent siblings)
  5 siblings, 0 replies; 109+ messages in thread
From: David Miller @ 2018-04-04 17:42 UTC (permalink / raw)
  To: dsahern
  Cc: loseweigh, jiri, si-wei.liu, mst, stephen, alexander.h.duyck,
	jesse.brandeburg, kubakici, jasowang, sridhar.samudrala, netdev,
	virtualization

From: David Ahern <dsahern@gmail.com>
Date: Wed, 4 Apr 2018 11:37:52 -0600

> Networking vendors have out of tree kernel modules. Those modules use a
> netdev (call it a master netdev, a control netdev, cpu port, whatever)
> to pull packets from the ASIC and deliver to virtual netdevices
> representing physical ports. The master netdev should not be mucked with
> by a user. It should be ignored by certain s/w with lldpd as just an
> *example*.

Two approaches:

1) Add an IFF_CONTROL and make userspace understand this.  It is probably
   long overdue.

2) Design the driver properly.  Have a non-netdev master device like
   mlxsw does, and control it using devlink or similar.  This is exactly
   how this stuff was meant to be architected.

> From there I think you are confusing my intentions: I fundamentally do
> not believe the kernel should be hiding anything from an admin. Not
> showing data by default is completely different than not showing that
> data at all.

It is the same David.

It measn we have no intention of fixing applications to properly know
what to do with and how to handle these devices.

If you hide these objects, we are basically giving up on fixing the
tools and or the drivers themselves to be architected differently
(see #2 above).

That really isn't acceptable in my opinion.

> The intention of my patch with the IFF_HIDDEN attribute is:
> 1. it is a netdev attribute
> 
> 2. that attribute can be used by userpsace to indicate to the kernel I
> want all or I want the default
> 
> 3. that attribute can be controlled by an admin.
> 
> The patches go beyond my specific use case (preventing a user from
> modifying a netdev it should not be touching) but to defining the
> semantics of a generic capability which is what the kernel should have.

"Teach, do not hide!" -Yoda

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-04 17:37           ` David Ahern
  2018-04-04 17:42             ` David Miller
@ 2018-04-04 17:42             ` David Miller
  2018-04-04 17:44             ` Stephen Hemminger
                               ` (3 subsequent siblings)
  5 siblings, 0 replies; 109+ messages in thread
From: David Miller @ 2018-04-04 17:42 UTC (permalink / raw)
  To: dsahern
  Cc: alexander.h.duyck, jiri, mst, kubakici, sridhar.samudrala,
	virtualization, loseweigh, netdev, si-wei.liu

From: David Ahern <dsahern@gmail.com>
Date: Wed, 4 Apr 2018 11:37:52 -0600

> Networking vendors have out of tree kernel modules. Those modules use a
> netdev (call it a master netdev, a control netdev, cpu port, whatever)
> to pull packets from the ASIC and deliver to virtual netdevices
> representing physical ports. The master netdev should not be mucked with
> by a user. It should be ignored by certain s/w with lldpd as just an
> *example*.

Two approaches:

1) Add an IFF_CONTROL and make userspace understand this.  It is probably
   long overdue.

2) Design the driver properly.  Have a non-netdev master device like
   mlxsw does, and control it using devlink or similar.  This is exactly
   how this stuff was meant to be architected.

> From there I think you are confusing my intentions: I fundamentally do
> not believe the kernel should be hiding anything from an admin. Not
> showing data by default is completely different than not showing that
> data at all.

It is the same David.

It measn we have no intention of fixing applications to properly know
what to do with and how to handle these devices.

If you hide these objects, we are basically giving up on fixing the
tools and or the drivers themselves to be architected differently
(see #2 above).

That really isn't acceptable in my opinion.

> The intention of my patch with the IFF_HIDDEN attribute is:
> 1. it is a netdev attribute
> 
> 2. that attribute can be used by userpsace to indicate to the kernel I
> want all or I want the default
> 
> 3. that attribute can be controlled by an admin.
> 
> The patches go beyond my specific use case (preventing a user from
> modifying a netdev it should not be touching) but to defining the
> semantics of a generic capability which is what the kernel should have.

"Teach, do not hide!" -Yoda

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-04 17:37           ` David Ahern
  2018-04-04 17:42             ` David Miller
  2018-04-04 17:42             ` David Miller
@ 2018-04-04 17:44             ` Stephen Hemminger
  2018-04-04 17:44             ` Stephen Hemminger
                               ` (2 subsequent siblings)
  5 siblings, 0 replies; 109+ messages in thread
From: Stephen Hemminger @ 2018-04-04 17:44 UTC (permalink / raw)
  To: David Ahern
  Cc: Siwei Liu, Jiri Pirko, Si-Wei Liu, Michael S. Tsirkin,
	Alexander Duyck, David Miller, Brandeburg, Jesse, Jakub Kicinski,
	Jason Wang, Samudrala, Sridhar, Netdev, virtualization

On Wed, 4 Apr 2018 11:37:52 -0600
David Ahern <dsahern@gmail.com> wrote:

> Networking vendors have out of tree kernel modules. Those modules use a
> netdev (call it a master netdev, a control netdev, cpu port, whatever)
> to pull packets from the ASIC and deliver to virtual netdevices
> representing physical ports. The master netdev should not be mucked with
> by a user. It should be ignored by certain s/w with lldpd as just an
> *example*.

Sorry, the linux kernel maintainers have a clear well defined attitude
about out of tree kernel modules...

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-04 17:37           ` David Ahern
                               ` (2 preceding siblings ...)
  2018-04-04 17:44             ` Stephen Hemminger
@ 2018-04-04 17:44             ` Stephen Hemminger
  2018-04-04 20:08             ` Andrew Lunn
  2018-04-04 20:08             ` Andrew Lunn
  5 siblings, 0 replies; 109+ messages in thread
From: Stephen Hemminger @ 2018-04-04 17:44 UTC (permalink / raw)
  To: David Ahern
  Cc: Alexander Duyck, Jiri Pirko, Michael S. Tsirkin, Jakub Kicinski,
	Samudrala, Sridhar, virtualization, Siwei Liu, Netdev,
	Si-Wei Liu, David Miller

On Wed, 4 Apr 2018 11:37:52 -0600
David Ahern <dsahern@gmail.com> wrote:

> Networking vendors have out of tree kernel modules. Those modules use a
> netdev (call it a master netdev, a control netdev, cpu port, whatever)
> to pull packets from the ASIC and deliver to virtual netdevices
> representing physical ports. The master netdev should not be mucked with
> by a user. It should be ignored by certain s/w with lldpd as just an
> *example*.

Sorry, the linux kernel maintainers have a clear well defined attitude
about out of tree kernel modules...

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-04 17:21           ` David Ahern
                               ` (2 preceding siblings ...)
  2018-04-04 18:02             ` Siwei Liu
@ 2018-04-04 18:02             ` Siwei Liu
  3 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-04 18:02 UTC (permalink / raw)
  To: David Ahern
  Cc: Jiri Pirko, Si-Wei Liu, Michael S. Tsirkin, Stephen Hemminger,
	Alexander Duyck, David Miller, Brandeburg, Jesse, Jakub Kicinski,
	Jason Wang, Samudrala, Sridhar, Netdev, virtualization

On Wed, Apr 4, 2018 at 10:21 AM, David Ahern <dsahern@gmail.com> wrote:
> On 4/4/18 1:36 AM, Siwei Liu wrote:
>> On Tue, Apr 3, 2018 at 6:04 PM, David Ahern <dsahern@gmail.com> wrote:
>>> On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>>>>
>>>>> There are other use cases that want to hide a device from userspace. I
>>>>
>>>> What usecases do you have in mind?
>>>
>>> As mentioned in a previous response some kernel drivers create control
>>> netdevs. Just as in this case users should not be mucking with it, and
>>> S/W like lldpd should ignore it.
>>>
>>>>
>>>>> would prefer a better solution than playing games with name prefixes and
>>>>> one that includes an API for users to list all devices -- even ones
>>>>> hidden by default.
>>>>
>>>> Netdevice hiding feels a bit scarry for me. This smells like a workaround
>>>> for userspace issues. Why can't the netdevice be visible always and
>>>> userspace would know what is it and what should it do with it?
>>>>
>>>> Once we start with hiding, there are other things related to that which
>>>> appear. Like who can see what, levels of visibility etc...
>>>>
>>>
>>> I would not advocate for any API that does not allow users to have full
>>> introspection. The intent is to hide the netdev by default but have an
>>> option to see it.
>>
>> I'm fine with having a link dump API to inspect the hidden netdev. As
>> said, the name for hidden netdevs should be in a separate device
>> namespace, and we did not even get closer to what it should look like
>> as I don't want to make it just an option for ip link. Perhaps a new
>> set of sub-commands of, say, 'ip device'.
>
> It is a netdev so there is no reason to have a separate ip command to
> inspect it. 'ip link' is the right place.

If you're still thinking the visibility is part of link attribute
rather than a separate namespace, I'd say we are trying to solve
essentially different problems, and I really don't understand your
proposal in solving that problem to be honest.

So, let's step back on studying your case if that's the right thing
and let's talk about concrete examples.

-Siwei

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-04 17:21           ` David Ahern
  2018-04-04 17:37             ` David Miller
  2018-04-04 17:37             ` David Miller
@ 2018-04-04 18:02             ` Siwei Liu
  2018-04-04 18:02             ` Siwei Liu
  3 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-04 18:02 UTC (permalink / raw)
  To: David Ahern
  Cc: Alexander Duyck, Jiri Pirko, Michael S. Tsirkin, Jakub Kicinski,
	Samudrala, Sridhar, virtualization, Netdev, Si-Wei Liu,
	David Miller

On Wed, Apr 4, 2018 at 10:21 AM, David Ahern <dsahern@gmail.com> wrote:
> On 4/4/18 1:36 AM, Siwei Liu wrote:
>> On Tue, Apr 3, 2018 at 6:04 PM, David Ahern <dsahern@gmail.com> wrote:
>>> On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>>>>
>>>>> There are other use cases that want to hide a device from userspace. I
>>>>
>>>> What usecases do you have in mind?
>>>
>>> As mentioned in a previous response some kernel drivers create control
>>> netdevs. Just as in this case users should not be mucking with it, and
>>> S/W like lldpd should ignore it.
>>>
>>>>
>>>>> would prefer a better solution than playing games with name prefixes and
>>>>> one that includes an API for users to list all devices -- even ones
>>>>> hidden by default.
>>>>
>>>> Netdevice hiding feels a bit scarry for me. This smells like a workaround
>>>> for userspace issues. Why can't the netdevice be visible always and
>>>> userspace would know what is it and what should it do with it?
>>>>
>>>> Once we start with hiding, there are other things related to that which
>>>> appear. Like who can see what, levels of visibility etc...
>>>>
>>>
>>> I would not advocate for any API that does not allow users to have full
>>> introspection. The intent is to hide the netdev by default but have an
>>> option to see it.
>>
>> I'm fine with having a link dump API to inspect the hidden netdev. As
>> said, the name for hidden netdevs should be in a separate device
>> namespace, and we did not even get closer to what it should look like
>> as I don't want to make it just an option for ip link. Perhaps a new
>> set of sub-commands of, say, 'ip device'.
>
> It is a netdev so there is no reason to have a separate ip command to
> inspect it. 'ip link' is the right place.

If you're still thinking the visibility is part of link attribute
rather than a separate namespace, I'd say we are trying to solve
essentially different problems, and I really don't understand your
proposal in solving that problem to be honest.

So, let's step back on studying your case if that's the right thing
and let's talk about concrete examples.

-Siwei

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-04 17:37             ` David Miller
@ 2018-04-04 18:20               ` Jiri Pirko
  2018-04-04 18:20               ` Jiri Pirko
  2018-04-07  2:32                 ` [virtio-dev] " Siwei Liu
  2 siblings, 0 replies; 109+ messages in thread
From: Jiri Pirko @ 2018-04-04 18:20 UTC (permalink / raw)
  To: David Miller
  Cc: dsahern, loseweigh, si-wei.liu, mst, stephen, alexander.h.duyck,
	jesse.brandeburg, kubakici, jasowang, sridhar.samudrala, netdev,
	virtualization, virtio-dev

Wed, Apr 04, 2018 at 07:37:49PM CEST, davem@davemloft.net wrote:
>From: David Ahern <dsahern@gmail.com>
>Date: Wed, 4 Apr 2018 11:21:54 -0600
>
>> It is a netdev so there is no reason to have a separate ip command to
>> inspect it. 'ip link' is the right place.
>
>I agree on this.
>
>What I really don't understand still is the use case... really.
>
>So there are control netdevs, what exactly is the problem with that?
>
>Are we not exporting enough information for applications to handle
>these devices sanely?  If so, then's let add that information.
>
>We can set netdev->type to ETH_P_LINUXCONTROL or something like that.
>
>Another alternative is to add an interface flag like IFF_CONTROL or
>similar, and that probably is much nicer.
>
>Hiding the devices means that we acknowledge that applications are
>currently broken with control netdevs... and we want them to stay
>broken!
>
>That doesn't sound like a good plan to me.
>
>So let's fix handling of control netdevs instead of hiding them.

Exactly. Don't workaround userspace issues by kernel patches.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-04 17:37             ` David Miller
  2018-04-04 18:20               ` Jiri Pirko
@ 2018-04-04 18:20               ` Jiri Pirko
  2018-04-07  2:32                 ` [virtio-dev] " Siwei Liu
  2 siblings, 0 replies; 109+ messages in thread
From: Jiri Pirko @ 2018-04-04 18:20 UTC (permalink / raw)
  To: David Miller
  Cc: alexander.h.duyck, virtio-dev, mst, kubakici, sridhar.samudrala,
	virtualization, loseweigh, netdev, dsahern, si-wei.liu

Wed, Apr 04, 2018 at 07:37:49PM CEST, davem@davemloft.net wrote:
>From: David Ahern <dsahern@gmail.com>
>Date: Wed, 4 Apr 2018 11:21:54 -0600
>
>> It is a netdev so there is no reason to have a separate ip command to
>> inspect it. 'ip link' is the right place.
>
>I agree on this.
>
>What I really don't understand still is the use case... really.
>
>So there are control netdevs, what exactly is the problem with that?
>
>Are we not exporting enough information for applications to handle
>these devices sanely?  If so, then's let add that information.
>
>We can set netdev->type to ETH_P_LINUXCONTROL or something like that.
>
>Another alternative is to add an interface flag like IFF_CONTROL or
>similar, and that probably is much nicer.
>
>Hiding the devices means that we acknowledge that applications are
>currently broken with control netdevs... and we want them to stay
>broken!
>
>That doesn't sound like a good plan to me.
>
>So let's fix handling of control netdevs instead of hiding them.

Exactly. Don't workaround userspace issues by kernel patches.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-04 17:37           ` David Ahern
                               ` (4 preceding siblings ...)
  2018-04-04 20:08             ` Andrew Lunn
@ 2018-04-04 20:08             ` Andrew Lunn
  5 siblings, 0 replies; 109+ messages in thread
From: Andrew Lunn @ 2018-04-04 20:08 UTC (permalink / raw)
  To: David Ahern
  Cc: Siwei Liu, Jiri Pirko, Si-Wei Liu, Michael S. Tsirkin,
	Stephen Hemminger, Alexander Duyck, David Miller, Brandeburg,
	Jesse, Jakub Kicinski, Jason Wang, Samudrala, Sridhar, Netdev,
	virtualization

> Networking vendors have out of tree kernel modules. Those modules use a
> netdev (call it a master netdev, a control netdev, cpu port, whatever)
> to pull packets from the ASIC and deliver to virtual netdevices
> representing physical ports.

Sounds a lot like DSA. Please ask the vendor to contribute the drivers
:-)

> The master netdev should not be mucked with by a user. It should be
> ignored by certain s/w with lldpd as just an *example*.

I have come across occasional problems with the master device in DSA.
But nothing too serious. Generally the switch will just toss frames it
gets which don't have the needed header, when they come direct from
the master device, rather than via the slave devices.

    Andrew

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-04 17:37           ` David Ahern
                               ` (3 preceding siblings ...)
  2018-04-04 17:44             ` Stephen Hemminger
@ 2018-04-04 20:08             ` Andrew Lunn
  2018-04-04 20:08             ` Andrew Lunn
  5 siblings, 0 replies; 109+ messages in thread
From: Andrew Lunn @ 2018-04-04 20:08 UTC (permalink / raw)
  To: David Ahern
  Cc: Alexander Duyck, Jiri Pirko, Michael S. Tsirkin, Jakub Kicinski,
	Samudrala, Sridhar, virtualization, Siwei Liu, Netdev,
	Si-Wei Liu, David Miller

> Networking vendors have out of tree kernel modules. Those modules use a
> netdev (call it a master netdev, a control netdev, cpu port, whatever)
> to pull packets from the ASIC and deliver to virtual netdevices
> representing physical ports.

Sounds a lot like DSA. Please ask the vendor to contribute the drivers
:-)

> The master netdev should not be mucked with by a user. It should be
> ignored by certain s/w with lldpd as just an *example*.

I have come across occasional problems with the master device in DSA.
But nothing too serious. Generally the switch will just toss frames it
gets which don't have the needed header, when they come direct from
the master device, rather than via the slave devices.

    Andrew

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [virtio-dev] Re: [RFC PATCH 1/3] qemu: virtio-bypass should explicitly bind to a passthrough device
  2018-04-04  8:02       ` Siwei Liu
  (?)
@ 2018-04-05 15:31       ` Paolo Bonzini
  2018-04-07  2:54         ` Siwei Liu
  2018-04-07  2:54           ` [virtio-dev] " Siwei Liu
  -1 siblings, 2 replies; 109+ messages in thread
From: Paolo Bonzini @ 2018-04-05 15:31 UTC (permalink / raw)
  To: Siwei Liu, Michael S. Tsirkin
  Cc: Si-Wei Liu, Jiri Pirko, Stephen Hemminger, Alexander Duyck,
	David Miller, Brandeburg, Jesse, Jakub Kicinski, Jason Wang,
	Samudrala, Sridhar, Netdev, virtualization, virtio-dev

On 04/04/2018 10:02, Siwei Liu wrote:
>> pci_bus_num is almost always a bug if not done within
>> a context of a PCI host, bridge, etc.
>>
>> In particular this will not DTRT if run before guest assigns bus
>> numbers.
>>
> I was seeking means to reserve a specific pci bus slot from drivers,
> and update the driver when guest assigns the bus number but it seems
> there's no low-hanging fruits. Because of that reason the bus_num is
> only obtained until it's really needed (during get_config) and I
> assume at that point the pci bus assignment is already done. I know
> the current one is not perfect, but we need that information (PCI
> bus:slot.func number) to name the guest device correctly.

Can you use the -device "id", and look it up as

    devices = container_get(qdev_get_machine(), "/peripheral");
    return object_resolve_path_component(devices, id);

?

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
       [not found]       ` <20180403160834.51594373@xeon-e3>
@ 2018-04-06 21:29           ` Siwei Liu
  2018-04-06 21:29           ` [virtio-dev] " Siwei Liu
  1 sibling, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-06 21:29 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Michael S. Tsirkin, Jiri Pirko, Alexander Duyck, David Miller,
	Brandeburg, Jesse, Jakub Kicinski, Jason Wang, Samudrala,
	Sridhar, Netdev, virtualization, virtio-dev, si-wei liu

(put discussions back on list which got accidentally removed)

On Tue, Apr 3, 2018 at 4:08 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Tue, 3 Apr 2018 12:04:38 -0700
> Siwei Liu <loseweigh@gmail.com> wrote:
>
>> On Tue, Apr 3, 2018 at 10:35 AM, Stephen Hemminger
>> <stephen@networkplumber.org> wrote:
>> > On Sun,  1 Apr 2018 05:13:09 -0400
>> > Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>> >
>> >> Hidden netdevice is not visible to userspace such that
>> >> typical network utilites e.g. ip, ifconfig and et al,
>> >> cannot sense its existence or configure it. Internally
>> >> hidden netdev may associate with an upper level netdev
>> >> that userspace has access to. Although userspace cannot
>> >> manipulate the lower netdev directly, user may control
>> >> or configure the underlying hidden device through the
>> >> upper-level netdev. For identification purpose, the
>> >> kobject for hidden netdev still presents in the sysfs
>> >> hierarchy, however, no uevent message will be generated
>> >> when the sysfs entry is created, modified or destroyed.
>> >>
>> >> For that end, a separate namescope needs to be carved
>> >> out for IFF_HIDDEN netdevs. As of now netdev name that
>> >> starts with colon i.e. ':' is invalid in userspace,
>> >> since socket ioctls such as SIOCGIFCONF use ':' as the
>> >> separator for ifname. The absence of namescope started
>> >> with ':' can rightly be used as the namescope for
>> >> the kernel-only IFF_HIDDEN netdevs.
>> >>
>> >> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
>> >> ---
>> >
>> > I understand the use case. I proposed using . as a prefix before
>> > but that ran into resistance. Using colon seems worse.
>>
>> Using dot (.) can't be good because it would cause namespace collision
>> and thus breaking apps when you hide the device. Imagine a user really
>> wants to add a link with the same name as the one hidden and it starts
>> with a dot. It would fail, and users don't know its just because the
>> name starts with dot. IMHO users should be agnostic of (the namespace
>> of) hidden device at all if what they pick is a valid name.
>>
>> ":" is an invalid prefix to userspace, there's no such problem if
>> being used to construct the namescope for hidden devices.
>>
>> However, technically, just as what I alluded to in the reply earlier,
>> it might really be consistent to put this under a separeate namespace
>> instead than fiddling with name prefix. But I am just not sure if that
>> is a big hammer and would like to earn enough feedback and attention
>> before going that way too quickly.
>>
>>
>> >
>> > Rather than playing with names and all the issues that can cause,
>> > why not make it an attribute flag of the device in netlink.
>>
>> Atrribute flag doesn't help. It's a matter of namespace.
>>
>> Regards,
>> -Siwei
>
> In Vyatta, we used names like ".isatap" for devices that would clutter up
> the user experience. They are naturally not visible by simple scans of
> /sys/class/net, and there was a patch to ignore them in iproute2.
> It was a hack which worked but not really worth upstreaming.
>
> The question is if this a security feature then it needs to be more

I don't expect the namespace to be a security aspect of feature, but
rather a way to make old userspace unmodified  to work with a new
feature. And, we're going to add API to expose the netdev info for the
invisible IFF_AUTO_MANAGED links anyway. We don't need to make it
secure and all hidden under the dark to be honest.

> robust than just name prefix. Plus it took years to handle network
> namespaces everywhere; this kind of flag would start same problems.
>
> Network namespaces work but have the problem namespaces only weakly
> support hierarchy and nesting. I prefer the namespace approach
> because it fits better and has less impact.

Great, thanks!

-Siwei

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
       [not found]       ` <20180403160834.51594373@xeon-e3>
@ 2018-04-06 21:29         ` Siwei Liu
  2018-04-06 21:29           ` [virtio-dev] " Siwei Liu
  1 sibling, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-06 21:29 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, virtualization, Netdev,
	si-wei liu, David Miller

(put discussions back on list which got accidentally removed)

On Tue, Apr 3, 2018 at 4:08 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Tue, 3 Apr 2018 12:04:38 -0700
> Siwei Liu <loseweigh@gmail.com> wrote:
>
>> On Tue, Apr 3, 2018 at 10:35 AM, Stephen Hemminger
>> <stephen@networkplumber.org> wrote:
>> > On Sun,  1 Apr 2018 05:13:09 -0400
>> > Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>> >
>> >> Hidden netdevice is not visible to userspace such that
>> >> typical network utilites e.g. ip, ifconfig and et al,
>> >> cannot sense its existence or configure it. Internally
>> >> hidden netdev may associate with an upper level netdev
>> >> that userspace has access to. Although userspace cannot
>> >> manipulate the lower netdev directly, user may control
>> >> or configure the underlying hidden device through the
>> >> upper-level netdev. For identification purpose, the
>> >> kobject for hidden netdev still presents in the sysfs
>> >> hierarchy, however, no uevent message will be generated
>> >> when the sysfs entry is created, modified or destroyed.
>> >>
>> >> For that end, a separate namescope needs to be carved
>> >> out for IFF_HIDDEN netdevs. As of now netdev name that
>> >> starts with colon i.e. ':' is invalid in userspace,
>> >> since socket ioctls such as SIOCGIFCONF use ':' as the
>> >> separator for ifname. The absence of namescope started
>> >> with ':' can rightly be used as the namescope for
>> >> the kernel-only IFF_HIDDEN netdevs.
>> >>
>> >> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
>> >> ---
>> >
>> > I understand the use case. I proposed using . as a prefix before
>> > but that ran into resistance. Using colon seems worse.
>>
>> Using dot (.) can't be good because it would cause namespace collision
>> and thus breaking apps when you hide the device. Imagine a user really
>> wants to add a link with the same name as the one hidden and it starts
>> with a dot. It would fail, and users don't know its just because the
>> name starts with dot. IMHO users should be agnostic of (the namespace
>> of) hidden device at all if what they pick is a valid name.
>>
>> ":" is an invalid prefix to userspace, there's no such problem if
>> being used to construct the namescope for hidden devices.
>>
>> However, technically, just as what I alluded to in the reply earlier,
>> it might really be consistent to put this under a separeate namespace
>> instead than fiddling with name prefix. But I am just not sure if that
>> is a big hammer and would like to earn enough feedback and attention
>> before going that way too quickly.
>>
>>
>> >
>> > Rather than playing with names and all the issues that can cause,
>> > why not make it an attribute flag of the device in netlink.
>>
>> Atrribute flag doesn't help. It's a matter of namespace.
>>
>> Regards,
>> -Siwei
>
> In Vyatta, we used names like ".isatap" for devices that would clutter up
> the user experience. They are naturally not visible by simple scans of
> /sys/class/net, and there was a patch to ignore them in iproute2.
> It was a hack which worked but not really worth upstreaming.
>
> The question is if this a security feature then it needs to be more

I don't expect the namespace to be a security aspect of feature, but
rather a way to make old userspace unmodified  to work with a new
feature. And, we're going to add API to expose the netdev info for the
invisible IFF_AUTO_MANAGED links anyway. We don't need to make it
secure and all hidden under the dark to be honest.

> robust than just name prefix. Plus it took years to handle network
> namespaces everywhere; this kind of flag would start same problems.
>
> Network namespaces work but have the problem namespaces only weakly
> support hierarchy and nesting. I prefer the namespace approach
> because it fits better and has less impact.

Great, thanks!

-Siwei

^ permalink raw reply	[flat|nested] 109+ messages in thread

* [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
@ 2018-04-06 21:29           ` Siwei Liu
  0 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-06 21:29 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Michael S. Tsirkin, Jiri Pirko, Alexander Duyck, David Miller,
	Brandeburg, Jesse, Jakub Kicinski, Jason Wang, Samudrala,
	Sridhar, Netdev, virtualization, virtio-dev, si-wei liu

(put discussions back on list which got accidentally removed)

On Tue, Apr 3, 2018 at 4:08 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Tue, 3 Apr 2018 12:04:38 -0700
> Siwei Liu <loseweigh@gmail.com> wrote:
>
>> On Tue, Apr 3, 2018 at 10:35 AM, Stephen Hemminger
>> <stephen@networkplumber.org> wrote:
>> > On Sun,  1 Apr 2018 05:13:09 -0400
>> > Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>> >
>> >> Hidden netdevice is not visible to userspace such that
>> >> typical network utilites e.g. ip, ifconfig and et al,
>> >> cannot sense its existence or configure it. Internally
>> >> hidden netdev may associate with an upper level netdev
>> >> that userspace has access to. Although userspace cannot
>> >> manipulate the lower netdev directly, user may control
>> >> or configure the underlying hidden device through the
>> >> upper-level netdev. For identification purpose, the
>> >> kobject for hidden netdev still presents in the sysfs
>> >> hierarchy, however, no uevent message will be generated
>> >> when the sysfs entry is created, modified or destroyed.
>> >>
>> >> For that end, a separate namescope needs to be carved
>> >> out for IFF_HIDDEN netdevs. As of now netdev name that
>> >> starts with colon i.e. ':' is invalid in userspace,
>> >> since socket ioctls such as SIOCGIFCONF use ':' as the
>> >> separator for ifname. The absence of namescope started
>> >> with ':' can rightly be used as the namescope for
>> >> the kernel-only IFF_HIDDEN netdevs.
>> >>
>> >> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
>> >> ---
>> >
>> > I understand the use case. I proposed using . as a prefix before
>> > but that ran into resistance. Using colon seems worse.
>>
>> Using dot (.) can't be good because it would cause namespace collision
>> and thus breaking apps when you hide the device. Imagine a user really
>> wants to add a link with the same name as the one hidden and it starts
>> with a dot. It would fail, and users don't know its just because the
>> name starts with dot. IMHO users should be agnostic of (the namespace
>> of) hidden device at all if what they pick is a valid name.
>>
>> ":" is an invalid prefix to userspace, there's no such problem if
>> being used to construct the namescope for hidden devices.
>>
>> However, technically, just as what I alluded to in the reply earlier,
>> it might really be consistent to put this under a separeate namespace
>> instead than fiddling with name prefix. But I am just not sure if that
>> is a big hammer and would like to earn enough feedback and attention
>> before going that way too quickly.
>>
>>
>> >
>> > Rather than playing with names and all the issues that can cause,
>> > why not make it an attribute flag of the device in netlink.
>>
>> Atrribute flag doesn't help. It's a matter of namespace.
>>
>> Regards,
>> -Siwei
>
> In Vyatta, we used names like ".isatap" for devices that would clutter up
> the user experience. They are naturally not visible by simple scans of
> /sys/class/net, and there was a patch to ignore them in iproute2.
> It was a hack which worked but not really worth upstreaming.
>
> The question is if this a security feature then it needs to be more

I don't expect the namespace to be a security aspect of feature, but
rather a way to make old userspace unmodified  to work with a new
feature. And, we're going to add API to expose the netdev info for the
invisible IFF_AUTO_MANAGED links anyway. We don't need to make it
secure and all hidden under the dark to be honest.

> robust than just name prefix. Plus it took years to handle network
> namespaces everywhere; this kind of flag would start same problems.
>
> Network namespaces work but have the problem namespaces only weakly
> support hierarchy and nesting. I prefer the namespace approach
> because it fits better and has less impact.

Great, thanks!

-Siwei

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-04 17:37             ` David Miller
@ 2018-04-07  2:32                 ` Siwei Liu
  2018-04-04 18:20               ` Jiri Pirko
  2018-04-07  2:32                 ` [virtio-dev] " Siwei Liu
  2 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-07  2:32 UTC (permalink / raw)
  To: David Miller
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, virtualization, Netdev,
	David Ahern, si-wei liu

On Wed, Apr 4, 2018 at 10:37 AM, David Miller <davem@davemloft.net> wrote:
> From: David Ahern <dsahern@gmail.com>
> Date: Wed, 4 Apr 2018 11:21:54 -0600
>
>> It is a netdev so there is no reason to have a separate ip command to
>> inspect it. 'ip link' is the right place.
>
> I agree on this.

I'm completely fine of having an API for inspection purpose. The thing
is that we'd perhaps need to go for the namespace approach, for which
I think everyone seems to agree not to fiddle with the ":" prefix, but
rather have a new class of network subsystem under /sys/class thus a
separate device namespace e.g. /sys/class/net-kernel for those
auto-managed lower netdevs is needed.

And I assume everyone here understands the use case for live migration
(in the context of providing cloud service) is very different, and we
have to hide the netdevs. If not, I'm more than happy to clarify.

With that in mind, if having a new class of net-kernel namespace, we
can name the kernel device elaborately which is not neccessarily equal
to the device name exposed to userspace. For example, we can use
driver name as the prefix as opposed to "eth" or ":eth". And we don't
need to have auto-managed netdevs locked into the ":" prefix at all (I
intentionally left it out in the this RFC patch to ask for comments on
the namespace solution which is much cleaner). That said, an userpsace
named device through udev may call something like ens3 and
switch1-port2, but in the kernel-net namespace, it may look like
ixgbevf0 and mlxsw1p2.

So if we all agree introducing a new namespace is the rigth thing to
do, `ip link' will no longer serve the purpose of displaying the
information for kernel-net devnames for the sake of avoiding ambiguity
and namespace collision: it's entirely possible the ip link name could
collide with a kernel-net devname, it's become unclear which name of a
netdev object the command is expected to operate on. That's why I
thought showing the kernel-only netdevs using a separate subcommand
makes more sense.

Thoughts and comments? Please let me know.

Thanks,
-Siwei

>
> What I really don't understand still is the use case... really.
>
> So there are control netdevs, what exactly is the problem with that?
>
> Are we not exporting enough information for applications to handle
> these devices sanely?  If so, then's let add that information.
>
> We can set netdev->type to ETH_P_LINUXCONTROL or something like that.
>
> Another alternative is to add an interface flag like IFF_CONTROL or
> similar, and that probably is much nicer.
>
> Hiding the devices means that we acknowledge that applications are
> currently broken with control netdevs... and we want them to stay
> broken!
>
> That doesn't sound like a good plan to me.
>
> So let's fix handling of control netdevs instead of hiding them.
>
> Thanks.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
@ 2018-04-07  2:32                 ` Siwei Liu
  0 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-07  2:32 UTC (permalink / raw)
  To: David Miller
  Cc: David Ahern, Jiri Pirko, si-wei liu, Michael S. Tsirkin,
	Stephen Hemminger, Alexander Duyck, Brandeburg, Jesse,
	Jakub Kicinski, Jason Wang, Samudrala, Sridhar, Netdev,
	virtualization, virtio-dev

On Wed, Apr 4, 2018 at 10:37 AM, David Miller <davem@davemloft.net> wrote:
> From: David Ahern <dsahern@gmail.com>
> Date: Wed, 4 Apr 2018 11:21:54 -0600
>
>> It is a netdev so there is no reason to have a separate ip command to
>> inspect it. 'ip link' is the right place.
>
> I agree on this.

I'm completely fine of having an API for inspection purpose. The thing
is that we'd perhaps need to go for the namespace approach, for which
I think everyone seems to agree not to fiddle with the ":" prefix, but
rather have a new class of network subsystem under /sys/class thus a
separate device namespace e.g. /sys/class/net-kernel for those
auto-managed lower netdevs is needed.

And I assume everyone here understands the use case for live migration
(in the context of providing cloud service) is very different, and we
have to hide the netdevs. If not, I'm more than happy to clarify.

With that in mind, if having a new class of net-kernel namespace, we
can name the kernel device elaborately which is not neccessarily equal
to the device name exposed to userspace. For example, we can use
driver name as the prefix as opposed to "eth" or ":eth". And we don't
need to have auto-managed netdevs locked into the ":" prefix at all (I
intentionally left it out in the this RFC patch to ask for comments on
the namespace solution which is much cleaner). That said, an userpsace
named device through udev may call something like ens3 and
switch1-port2, but in the kernel-net namespace, it may look like
ixgbevf0 and mlxsw1p2.

So if we all agree introducing a new namespace is the rigth thing to
do, `ip link' will no longer serve the purpose of displaying the
information for kernel-net devnames for the sake of avoiding ambiguity
and namespace collision: it's entirely possible the ip link name could
collide with a kernel-net devname, it's become unclear which name of a
netdev object the command is expected to operate on. That's why I
thought showing the kernel-only netdevs using a separate subcommand
makes more sense.

Thoughts and comments? Please let me know.

Thanks,
-Siwei

>
> What I really don't understand still is the use case... really.
>
> So there are control netdevs, what exactly is the problem with that?
>
> Are we not exporting enough information for applications to handle
> these devices sanely?  If so, then's let add that information.
>
> We can set netdev->type to ETH_P_LINUXCONTROL or something like that.
>
> Another alternative is to add an interface flag like IFF_CONTROL or
> similar, and that probably is much nicer.
>
> Hiding the devices means that we acknowledge that applications are
> currently broken with control netdevs... and we want them to stay
> broken!
>
> That doesn't sound like a good plan to me.
>
> So let's fix handling of control netdevs instead of hiding them.
>
> Thanks.

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Re: [RFC PATCH 1/3] qemu: virtio-bypass should explicitly bind to a passthrough device
  2018-04-05 15:31       ` Paolo Bonzini
@ 2018-04-07  2:54           ` Siwei Liu
  2018-04-07  2:54           ` [virtio-dev] " Siwei Liu
  1 sibling, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-07  2:54 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Michael S. Tsirkin, Si-Wei Liu, Jiri Pirko, Stephen Hemminger,
	Alexander Duyck, David Miller, Brandeburg, Jesse, Jakub Kicinski,
	Jason Wang, Samudrala, Sridhar, Netdev, virtualization,
	virtio-dev

(click the wrong reply button again, sorry)


On Thu, Apr 5, 2018 at 8:31 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 04/04/2018 10:02, Siwei Liu wrote:
>>> pci_bus_num is almost always a bug if not done within
>>> a context of a PCI host, bridge, etc.
>>>
>>> In particular this will not DTRT if run before guest assigns bus
>>> numbers.
>>>
>> I was seeking means to reserve a specific pci bus slot from drivers,
>> and update the driver when guest assigns the bus number but it seems
>> there's no low-hanging fruits. Because of that reason the bus_num is
>> only obtained until it's really needed (during get_config) and I
>> assume at that point the pci bus assignment is already done. I know
>> the current one is not perfect, but we need that information (PCI
>> bus:slot.func number) to name the guest device correctly.
>
> Can you use the -device "id", and look it up as
>
>     devices = container_get(qdev_get_machine(), "/peripheral");
>     return object_resolve_path_component(devices, id);


No. The problem of using device id is that the vfio device may come
and go at any time, this is particularly true when live migration is
happening. There's no gurantee we can get the bus:device.func info if
that device is gone. Currently the binding between vfio and virtio-net
is weakly coupled through the backup property, there's no better way
than specifying the bus id and addr property directly.

Regards,
-Siwei

>
> ?
>
> Thanks,
>
> Paolo

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [virtio-dev] Re: [RFC PATCH 1/3] qemu: virtio-bypass should explicitly bind to a passthrough device
  2018-04-05 15:31       ` Paolo Bonzini
@ 2018-04-07  2:54         ` Siwei Liu
  2018-04-07  2:54           ` [virtio-dev] " Siwei Liu
  1 sibling, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-07  2:54 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, virtualization, Netdev,
	Si-Wei Liu, David Miller

(click the wrong reply button again, sorry)


On Thu, Apr 5, 2018 at 8:31 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 04/04/2018 10:02, Siwei Liu wrote:
>>> pci_bus_num is almost always a bug if not done within
>>> a context of a PCI host, bridge, etc.
>>>
>>> In particular this will not DTRT if run before guest assigns bus
>>> numbers.
>>>
>> I was seeking means to reserve a specific pci bus slot from drivers,
>> and update the driver when guest assigns the bus number but it seems
>> there's no low-hanging fruits. Because of that reason the bus_num is
>> only obtained until it's really needed (during get_config) and I
>> assume at that point the pci bus assignment is already done. I know
>> the current one is not perfect, but we need that information (PCI
>> bus:slot.func number) to name the guest device correctly.
>
> Can you use the -device "id", and look it up as
>
>     devices = container_get(qdev_get_machine(), "/peripheral");
>     return object_resolve_path_component(devices, id);


No. The problem of using device id is that the vfio device may come
and go at any time, this is particularly true when live migration is
happening. There's no gurantee we can get the bus:device.func info if
that device is gone. Currently the binding between vfio and virtio-net
is weakly coupled through the backup property, there's no better way
than specifying the bus id and addr property directly.

Regards,
-Siwei

>
> ?
>
> Thanks,
>
> Paolo

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [virtio-dev] Re: [RFC PATCH 1/3] qemu: virtio-bypass should explicitly bind to a passthrough device
@ 2018-04-07  2:54           ` Siwei Liu
  0 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-07  2:54 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Michael S. Tsirkin, Si-Wei Liu, Jiri Pirko, Stephen Hemminger,
	Alexander Duyck, David Miller, Brandeburg, Jesse, Jakub Kicinski,
	Jason Wang, Samudrala, Sridhar, Netdev, virtualization,
	virtio-dev

(click the wrong reply button again, sorry)


On Thu, Apr 5, 2018 at 8:31 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 04/04/2018 10:02, Siwei Liu wrote:
>>> pci_bus_num is almost always a bug if not done within
>>> a context of a PCI host, bridge, etc.
>>>
>>> In particular this will not DTRT if run before guest assigns bus
>>> numbers.
>>>
>> I was seeking means to reserve a specific pci bus slot from drivers,
>> and update the driver when guest assigns the bus number but it seems
>> there's no low-hanging fruits. Because of that reason the bus_num is
>> only obtained until it's really needed (during get_config) and I
>> assume at that point the pci bus assignment is already done. I know
>> the current one is not perfect, but we need that information (PCI
>> bus:slot.func number) to name the guest device correctly.
>
> Can you use the -device "id", and look it up as
>
>     devices = container_get(qdev_get_machine(), "/peripheral");
>     return object_resolve_path_component(devices, id);


No. The problem of using device id is that the vfio device may come
and go at any time, this is particularly true when live migration is
happening. There's no gurantee we can get the bus:device.func info if
that device is gone. Currently the binding between vfio and virtio-net
is weakly coupled through the backup property, there's no better way
than specifying the bus id and addr property directly.

Regards,
-Siwei

>
> ?
>
> Thanks,
>
> Paolo

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-07  2:32                 ` [virtio-dev] " Siwei Liu
  (?)
@ 2018-04-07  3:19                 ` Andrew Lunn
  2018-04-09 22:07                   ` Siwei Liu
  2018-04-09 22:07                     ` [virtio-dev] " Siwei Liu
  -1 siblings, 2 replies; 109+ messages in thread
From: Andrew Lunn @ 2018-04-07  3:19 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, virtualization, Netdev,
	David Ahern, si-wei liu, David Miller

Hi Siwei

> I think everyone seems to agree not to fiddle with the ":" prefix, but
> rather have a new class of network subsystem under /sys/class thus a
> separate device namespace e.g. /sys/class/net-kernel for those
> auto-managed lower netdevs is needed.
 
How do you get a device into this new class? I don't know the Linux
driver model too well, but to get a device out of one class and into
another, i think you need to device_del(dev). modify dev->class and
then device_add(dev). However, device_add() says you are not allowed
to do this.

And i don't even see how this helps. Are you also not going to call
list_netdevice()? Are you going to add some other list for these
devices in a different class?

   Andrew

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-07  2:32                 ` [virtio-dev] " Siwei Liu
  (?)
  (?)
@ 2018-04-08 16:32                 ` David Miller
  2018-04-10  6:48                     ` [virtio-dev] " Siwei Liu
  2018-04-10  6:48                   ` Siwei Liu
  -1 siblings, 2 replies; 109+ messages in thread
From: David Miller @ 2018-04-08 16:32 UTC (permalink / raw)
  To: loseweigh
  Cc: dsahern, jiri, si-wei.liu, mst, stephen, alexander.h.duyck,
	jesse.brandeburg, kubakici, jasowang, sridhar.samudrala, netdev,
	virtualization, virtio-dev

From: Siwei Liu <loseweigh@gmail.com>
Date: Fri, 6 Apr 2018 19:32:05 -0700

> And I assume everyone here understands the use case for live
> migration (in the context of providing cloud service) is very
> different, and we have to hide the netdevs. If not, I'm more than
> happy to clarify.

I think you still need to clarify.

netdevs are netdevs.  If they have special attributes, mark them as
such and the tools base their actions upon that.

"Hiding", or changing classes, doesn't make any sense to me still.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-07  2:32                 ` [virtio-dev] " Siwei Liu
                                   ` (2 preceding siblings ...)
  (?)
@ 2018-04-08 16:32                 ` David Miller
  -1 siblings, 0 replies; 109+ messages in thread
From: David Miller @ 2018-04-08 16:32 UTC (permalink / raw)
  To: loseweigh
  Cc: alexander.h.duyck, virtio-dev, jiri, mst, kubakici,
	sridhar.samudrala, virtualization, netdev, dsahern, si-wei.liu

From: Siwei Liu <loseweigh@gmail.com>
Date: Fri, 6 Apr 2018 19:32:05 -0700

> And I assume everyone here understands the use case for live
> migration (in the context of providing cloud service) is very
> different, and we have to hide the netdevs. If not, I'm more than
> happy to clarify.

I think you still need to clarify.

netdevs are netdevs.  If they have special attributes, mark them as
such and the tools base their actions upon that.

"Hiding", or changing classes, doesn't make any sense to me still.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-07  3:19                 ` Andrew Lunn
@ 2018-04-09 22:07                     ` Siwei Liu
  2018-04-09 22:07                     ` [virtio-dev] " Siwei Liu
  1 sibling, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-09 22:07 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: David Miller, David Ahern, Jiri Pirko, si-wei liu,
	Michael S. Tsirkin, Stephen Hemminger, Alexander Duyck,
	Brandeburg, Jesse, Jakub Kicinski, Jason Wang, Samudrala,
	Sridhar, Netdev, virtualization, virtio-dev

On Fri, Apr 6, 2018 at 8:19 PM, Andrew Lunn <andrew@lunn.ch> wrote:
> Hi Siwei
>
>> I think everyone seems to agree not to fiddle with the ":" prefix, but
>> rather have a new class of network subsystem under /sys/class thus a
>> separate device namespace e.g. /sys/class/net-kernel for those
>> auto-managed lower netdevs is needed.
>
> How do you get a device into this new class? I don't know the Linux
> driver model too well, but to get a device out of one class and into
> another, i think you need to device_del(dev). modify dev->class and
> then device_add(dev). However, device_add() says you are not allowed
> to do this.

No, implementation wise I'd avoid changing the class on the fly. What
I'm looking to is a means to add a secondary class or class aliasing
mechanism for netdevs that allows mapping for a kernel device
namespace (/class/net-kernel) to userspace (/class/net). Imagine
creating symlinks between these two namespaces as an analogy. All
userspace visible netdevs today will have both a kernel name and a
userspace visible name, having one (/class/net) referecing the other
(/class/net-kernel) in its own namespace. The newly introduced
IFF_AUTO_MANAGED device will have a kernel name only
(/class/net-kernel). As a result, the existing applications using
/class/net don't break, while we're adding the kernel namespace that
allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
at all.

As this requires changing the internals of driver model core it's a
rather big hammer approach I'd think. If there exists a better
implementation than this to allow adding a separate layer of in-kernel
device namespace, I'd more than welcome to hear about.

>
> And i don't even see how this helps. Are you also not going to call
> list_netdevice()? Are you going to add some other list for these
> devices in a different class?

list_netdevice() is still called. I think with the current RFC patch,
I've added two lists for netdevs under the kernel namespace:
dev_cmpl_list and name_cmpl_hlist. As a result of that, all userspace
netdevs get registered will be added to two types of lists: the
userspace list for e.g. dev_list, and also the kernelspace list e.g.
dev_cmpl_list (I can rename it to something more accurate). The
IFF_AUTO_MANAGED device will be only added to kernelspace list e.g.
dev_cmpl_list.

Hope all your questions are answered.

Thanks,
-Siwei


>
>    Andrew

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-07  3:19                 ` Andrew Lunn
@ 2018-04-09 22:07                   ` Siwei Liu
  2018-04-09 22:07                     ` [virtio-dev] " Siwei Liu
  1 sibling, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-09 22:07 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, virtualization, Netdev,
	David Ahern, si-wei liu, David Miller

On Fri, Apr 6, 2018 at 8:19 PM, Andrew Lunn <andrew@lunn.ch> wrote:
> Hi Siwei
>
>> I think everyone seems to agree not to fiddle with the ":" prefix, but
>> rather have a new class of network subsystem under /sys/class thus a
>> separate device namespace e.g. /sys/class/net-kernel for those
>> auto-managed lower netdevs is needed.
>
> How do you get a device into this new class? I don't know the Linux
> driver model too well, but to get a device out of one class and into
> another, i think you need to device_del(dev). modify dev->class and
> then device_add(dev). However, device_add() says you are not allowed
> to do this.

No, implementation wise I'd avoid changing the class on the fly. What
I'm looking to is a means to add a secondary class or class aliasing
mechanism for netdevs that allows mapping for a kernel device
namespace (/class/net-kernel) to userspace (/class/net). Imagine
creating symlinks between these two namespaces as an analogy. All
userspace visible netdevs today will have both a kernel name and a
userspace visible name, having one (/class/net) referecing the other
(/class/net-kernel) in its own namespace. The newly introduced
IFF_AUTO_MANAGED device will have a kernel name only
(/class/net-kernel). As a result, the existing applications using
/class/net don't break, while we're adding the kernel namespace that
allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
at all.

As this requires changing the internals of driver model core it's a
rather big hammer approach I'd think. If there exists a better
implementation than this to allow adding a separate layer of in-kernel
device namespace, I'd more than welcome to hear about.

>
> And i don't even see how this helps. Are you also not going to call
> list_netdevice()? Are you going to add some other list for these
> devices in a different class?

list_netdevice() is still called. I think with the current RFC patch,
I've added two lists for netdevs under the kernel namespace:
dev_cmpl_list and name_cmpl_hlist. As a result of that, all userspace
netdevs get registered will be added to two types of lists: the
userspace list for e.g. dev_list, and also the kernelspace list e.g.
dev_cmpl_list (I can rename it to something more accurate). The
IFF_AUTO_MANAGED device will be only added to kernelspace list e.g.
dev_cmpl_list.

Hope all your questions are answered.

Thanks,
-Siwei


>
>    Andrew

^ permalink raw reply	[flat|nested] 109+ messages in thread

* [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
@ 2018-04-09 22:07                     ` Siwei Liu
  0 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-09 22:07 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: David Miller, David Ahern, Jiri Pirko, si-wei liu,
	Michael S. Tsirkin, Stephen Hemminger, Alexander Duyck,
	Brandeburg, Jesse, Jakub Kicinski, Jason Wang, Samudrala,
	Sridhar, Netdev, virtualization, virtio-dev

On Fri, Apr 6, 2018 at 8:19 PM, Andrew Lunn <andrew@lunn.ch> wrote:
> Hi Siwei
>
>> I think everyone seems to agree not to fiddle with the ":" prefix, but
>> rather have a new class of network subsystem under /sys/class thus a
>> separate device namespace e.g. /sys/class/net-kernel for those
>> auto-managed lower netdevs is needed.
>
> How do you get a device into this new class? I don't know the Linux
> driver model too well, but to get a device out of one class and into
> another, i think you need to device_del(dev). modify dev->class and
> then device_add(dev). However, device_add() says you are not allowed
> to do this.

No, implementation wise I'd avoid changing the class on the fly. What
I'm looking to is a means to add a secondary class or class aliasing
mechanism for netdevs that allows mapping for a kernel device
namespace (/class/net-kernel) to userspace (/class/net). Imagine
creating symlinks between these two namespaces as an analogy. All
userspace visible netdevs today will have both a kernel name and a
userspace visible name, having one (/class/net) referecing the other
(/class/net-kernel) in its own namespace. The newly introduced
IFF_AUTO_MANAGED device will have a kernel name only
(/class/net-kernel). As a result, the existing applications using
/class/net don't break, while we're adding the kernel namespace that
allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
at all.

As this requires changing the internals of driver model core it's a
rather big hammer approach I'd think. If there exists a better
implementation than this to allow adding a separate layer of in-kernel
device namespace, I'd more than welcome to hear about.

>
> And i don't even see how this helps. Are you also not going to call
> list_netdevice()? Are you going to add some other list for these
> devices in a different class?

list_netdevice() is still called. I think with the current RFC patch,
I've added two lists for netdevs under the kernel namespace:
dev_cmpl_list and name_cmpl_hlist. As a result of that, all userspace
netdevs get registered will be added to two types of lists: the
userspace list for e.g. dev_list, and also the kernelspace list e.g.
dev_cmpl_list (I can rename it to something more accurate). The
IFF_AUTO_MANAGED device will be only added to kernelspace list e.g.
dev_cmpl_list.

Hope all your questions are answered.

Thanks,
-Siwei


>
>    Andrew

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-09 22:07                     ` [virtio-dev] " Siwei Liu
  (?)
  (?)
@ 2018-04-09 22:15                     ` Andrew Lunn
  2018-04-09 22:30                         ` [virtio-dev] " Siwei Liu
  -1 siblings, 1 reply; 109+ messages in thread
From: Andrew Lunn @ 2018-04-09 22:15 UTC (permalink / raw)
  To: Siwei Liu
  Cc: David Miller, David Ahern, Jiri Pirko, si-wei liu,
	Michael S. Tsirkin, Stephen Hemminger, Alexander Duyck,
	Brandeburg, Jesse, Jakub Kicinski, Jason Wang, Samudrala,
	Sridhar, Netdev, virtualization, virtio-dev

> No, implementation wise I'd avoid changing the class on the fly. What
> I'm looking to is a means to add a secondary class or class aliasing
> mechanism for netdevs that allows mapping for a kernel device
> namespace (/class/net-kernel) to userspace (/class/net). Imagine
> creating symlinks between these two namespaces as an analogy. All
> userspace visible netdevs today will have both a kernel name and a
> userspace visible name, having one (/class/net) referecing the other
> (/class/net-kernel) in its own namespace. The newly introduced
> IFF_AUTO_MANAGED device will have a kernel name only
> (/class/net-kernel). As a result, the existing applications using
> /class/net don't break, while we're adding the kernel namespace that
> allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
> at all.

My gut feeling is this whole scheme will not fly. You really should be
talking to GregKH.

Anyway, please remember that IFF_AUTO_MANAGED will need to be dynamic.
A device can start out as a normal device, and will change to being
automatic later, when the user on top of it probes.

	Andrew

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-09 22:07                     ` [virtio-dev] " Siwei Liu
  (?)
@ 2018-04-09 22:15                     ` Andrew Lunn
  -1 siblings, 0 replies; 109+ messages in thread
From: Andrew Lunn @ 2018-04-09 22:15 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, virtualization, Netdev,
	David Ahern, si-wei liu, David Miller

> No, implementation wise I'd avoid changing the class on the fly. What
> I'm looking to is a means to add a secondary class or class aliasing
> mechanism for netdevs that allows mapping for a kernel device
> namespace (/class/net-kernel) to userspace (/class/net). Imagine
> creating symlinks between these two namespaces as an analogy. All
> userspace visible netdevs today will have both a kernel name and a
> userspace visible name, having one (/class/net) referecing the other
> (/class/net-kernel) in its own namespace. The newly introduced
> IFF_AUTO_MANAGED device will have a kernel name only
> (/class/net-kernel). As a result, the existing applications using
> /class/net don't break, while we're adding the kernel namespace that
> allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
> at all.

My gut feeling is this whole scheme will not fly. You really should be
talking to GregKH.

Anyway, please remember that IFF_AUTO_MANAGED will need to be dynamic.
A device can start out as a normal device, and will change to being
automatic later, when the user on top of it probes.

	Andrew

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-09 22:15                     ` Andrew Lunn
@ 2018-04-09 22:30                         ` Siwei Liu
  0 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-09 22:30 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, virtualization, Netdev,
	David Ahern, si-wei liu, David Miller

On Mon, Apr 9, 2018 at 3:15 PM, Andrew Lunn <andrew@lunn.ch> wrote:
>> No, implementation wise I'd avoid changing the class on the fly. What
>> I'm looking to is a means to add a secondary class or class aliasing
>> mechanism for netdevs that allows mapping for a kernel device
>> namespace (/class/net-kernel) to userspace (/class/net). Imagine
>> creating symlinks between these two namespaces as an analogy. All
>> userspace visible netdevs today will have both a kernel name and a
>> userspace visible name, having one (/class/net) referecing the other
>> (/class/net-kernel) in its own namespace. The newly introduced
>> IFF_AUTO_MANAGED device will have a kernel name only
>> (/class/net-kernel). As a result, the existing applications using
>> /class/net don't break, while we're adding the kernel namespace that
>> allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
>> at all.
>
> My gut feeling is this whole scheme will not fly. You really should be
> talking to GregKH.

Will do. Before spreading it out loudly I'd run it within netdev to
clarify the need for why not exposing the lower netdevs is critical
for cloud service providers in the face of introducing a new feature,
and we are not hiding anything but exposing it in a way that don't
break existing userspace applications while introducing feature is
possible with the limitation of keeping old userspace still.

>
> Anyway, please remember that IFF_AUTO_MANAGED will need to be dynamic.
> A device can start out as a normal device, and will change to being
> automatic later, when the user on top of it probes.

Sure. In whatever form it's still a netdev, and changing the namespace
should be more dynamic than changing the class.

Thanks a lot,
-Siwei

>
>         Andrew

^ permalink raw reply	[flat|nested] 109+ messages in thread

* [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
@ 2018-04-09 22:30                         ` Siwei Liu
  0 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-09 22:30 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: David Miller, David Ahern, Jiri Pirko, si-wei liu,
	Michael S. Tsirkin, Stephen Hemminger, Alexander Duyck,
	Brandeburg, Jesse, Jakub Kicinski, Jason Wang, Samudrala,
	Sridhar, Netdev, virtualization, virtio-dev

On Mon, Apr 9, 2018 at 3:15 PM, Andrew Lunn <andrew@lunn.ch> wrote:
>> No, implementation wise I'd avoid changing the class on the fly. What
>> I'm looking to is a means to add a secondary class or class aliasing
>> mechanism for netdevs that allows mapping for a kernel device
>> namespace (/class/net-kernel) to userspace (/class/net). Imagine
>> creating symlinks between these two namespaces as an analogy. All
>> userspace visible netdevs today will have both a kernel name and a
>> userspace visible name, having one (/class/net) referecing the other
>> (/class/net-kernel) in its own namespace. The newly introduced
>> IFF_AUTO_MANAGED device will have a kernel name only
>> (/class/net-kernel). As a result, the existing applications using
>> /class/net don't break, while we're adding the kernel namespace that
>> allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
>> at all.
>
> My gut feeling is this whole scheme will not fly. You really should be
> talking to GregKH.

Will do. Before spreading it out loudly I'd run it within netdev to
clarify the need for why not exposing the lower netdevs is critical
for cloud service providers in the face of introducing a new feature,
and we are not hiding anything but exposing it in a way that don't
break existing userspace applications while introducing feature is
possible with the limitation of keeping old userspace still.

>
> Anyway, please remember that IFF_AUTO_MANAGED will need to be dynamic.
> A device can start out as a normal device, and will change to being
> automatic later, when the user on top of it probes.

Sure. In whatever form it's still a netdev, and changing the namespace
should be more dynamic than changing the class.

Thanks a lot,
-Siwei

>
>         Andrew

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-09 22:30                         ` [virtio-dev] " Siwei Liu
  (?)
@ 2018-04-09 23:03                         ` Stephen Hemminger
  2018-04-09 23:31                           ` Siwei Liu
  2018-04-09 23:31                             ` [virtio-dev] " Siwei Liu
  -1 siblings, 2 replies; 109+ messages in thread
From: Stephen Hemminger @ 2018-04-09 23:03 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Andrew Lunn, David Miller, David Ahern, Jiri Pirko, si-wei liu,
	Michael S. Tsirkin, Alexander Duyck, Brandeburg, Jesse,
	Jakub Kicinski, Jason Wang, Samudrala, Sridhar, Netdev,
	virtualization, virtio-dev

On Mon, 9 Apr 2018 15:30:42 -0700
Siwei Liu <loseweigh@gmail.com> wrote:

> On Mon, Apr 9, 2018 at 3:15 PM, Andrew Lunn <andrew@lunn.ch> wrote:
> >> No, implementation wise I'd avoid changing the class on the fly. What
> >> I'm looking to is a means to add a secondary class or class aliasing
> >> mechanism for netdevs that allows mapping for a kernel device
> >> namespace (/class/net-kernel) to userspace (/class/net). Imagine
> >> creating symlinks between these two namespaces as an analogy. All
> >> userspace visible netdevs today will have both a kernel name and a
> >> userspace visible name, having one (/class/net) referecing the other
> >> (/class/net-kernel) in its own namespace. The newly introduced
> >> IFF_AUTO_MANAGED device will have a kernel name only
> >> (/class/net-kernel). As a result, the existing applications using
> >> /class/net don't break, while we're adding the kernel namespace that
> >> allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
> >> at all.  
> >
> > My gut feeling is this whole scheme will not fly. You really should be
> > talking to GregKH.  
> 
> Will do. Before spreading it out loudly I'd run it within netdev to
> clarify the need for why not exposing the lower netdevs is critical
> for cloud service providers in the face of introducing a new feature,
> and we are not hiding anything but exposing it in a way that don't
> break existing userspace applications while introducing feature is
> possible with the limitation of keeping old userspace still.
> 
> >
> > Anyway, please remember that IFF_AUTO_MANAGED will need to be dynamic.
> > A device can start out as a normal device, and will change to being
> > automatic later, when the user on top of it probes.  
> 
> Sure. In whatever form it's still a netdev, and changing the namespace
> should be more dynamic than changing the class.
> 
> Thanks a lot,
> -Siwei
> 
> >
> >         Andrew  

Also, remember for netdev's /sys is really a third class API.
The primary API's are netlink and ioctl. Also why not use existing
network namespaces rather than inventing a new abstraction?

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-09 22:30                         ` [virtio-dev] " Siwei Liu
  (?)
  (?)
@ 2018-04-09 23:03                         ` Stephen Hemminger
  -1 siblings, 0 replies; 109+ messages in thread
From: Stephen Hemminger @ 2018-04-09 23:03 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Alexander Duyck, Andrew Lunn, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, virtio-dev, virtualization,
	Netdev, David Ahern, si-wei liu, David Miller

On Mon, 9 Apr 2018 15:30:42 -0700
Siwei Liu <loseweigh@gmail.com> wrote:

> On Mon, Apr 9, 2018 at 3:15 PM, Andrew Lunn <andrew@lunn.ch> wrote:
> >> No, implementation wise I'd avoid changing the class on the fly. What
> >> I'm looking to is a means to add a secondary class or class aliasing
> >> mechanism for netdevs that allows mapping for a kernel device
> >> namespace (/class/net-kernel) to userspace (/class/net). Imagine
> >> creating symlinks between these two namespaces as an analogy. All
> >> userspace visible netdevs today will have both a kernel name and a
> >> userspace visible name, having one (/class/net) referecing the other
> >> (/class/net-kernel) in its own namespace. The newly introduced
> >> IFF_AUTO_MANAGED device will have a kernel name only
> >> (/class/net-kernel). As a result, the existing applications using
> >> /class/net don't break, while we're adding the kernel namespace that
> >> allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
> >> at all.  
> >
> > My gut feeling is this whole scheme will not fly. You really should be
> > talking to GregKH.  
> 
> Will do. Before spreading it out loudly I'd run it within netdev to
> clarify the need for why not exposing the lower netdevs is critical
> for cloud service providers in the face of introducing a new feature,
> and we are not hiding anything but exposing it in a way that don't
> break existing userspace applications while introducing feature is
> possible with the limitation of keeping old userspace still.
> 
> >
> > Anyway, please remember that IFF_AUTO_MANAGED will need to be dynamic.
> > A device can start out as a normal device, and will change to being
> > automatic later, when the user on top of it probes.  
> 
> Sure. In whatever form it's still a netdev, and changing the namespace
> should be more dynamic than changing the class.
> 
> Thanks a lot,
> -Siwei
> 
> >
> >         Andrew  

Also, remember for netdev's /sys is really a third class API.
The primary API's are netlink and ioctl. Also why not use existing
network namespaces rather than inventing a new abstraction?

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-09 23:03                         ` Stephen Hemminger
@ 2018-04-09 23:31                             ` Siwei Liu
  2018-04-09 23:31                             ` [virtio-dev] " Siwei Liu
  1 sibling, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-09 23:31 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Andrew Lunn, David Miller, David Ahern, Jiri Pirko, si-wei liu,
	Michael S. Tsirkin, Alexander Duyck, Brandeburg, Jesse,
	Jakub Kicinski, Jason Wang, Samudrala, Sridhar, Netdev,
	virtualization, virtio-dev

On Mon, Apr 9, 2018 at 4:03 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Mon, 9 Apr 2018 15:30:42 -0700
> Siwei Liu <loseweigh@gmail.com> wrote:
>
>> On Mon, Apr 9, 2018 at 3:15 PM, Andrew Lunn <andrew@lunn.ch> wrote:
>> >> No, implementation wise I'd avoid changing the class on the fly. What
>> >> I'm looking to is a means to add a secondary class or class aliasing
>> >> mechanism for netdevs that allows mapping for a kernel device
>> >> namespace (/class/net-kernel) to userspace (/class/net). Imagine
>> >> creating symlinks between these two namespaces as an analogy. All
>> >> userspace visible netdevs today will have both a kernel name and a
>> >> userspace visible name, having one (/class/net) referecing the other
>> >> (/class/net-kernel) in its own namespace. The newly introduced
>> >> IFF_AUTO_MANAGED device will have a kernel name only
>> >> (/class/net-kernel). As a result, the existing applications using
>> >> /class/net don't break, while we're adding the kernel namespace that
>> >> allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
>> >> at all.
>> >
>> > My gut feeling is this whole scheme will not fly. You really should be
>> > talking to GregKH.
>>
>> Will do. Before spreading it out loudly I'd run it within netdev to
>> clarify the need for why not exposing the lower netdevs is critical
>> for cloud service providers in the face of introducing a new feature,
>> and we are not hiding anything but exposing it in a way that don't
>> break existing userspace applications while introducing feature is
>> possible with the limitation of keeping old userspace still.
>>
>> >
>> > Anyway, please remember that IFF_AUTO_MANAGED will need to be dynamic.
>> > A device can start out as a normal device, and will change to being
>> > automatic later, when the user on top of it probes.
>>
>> Sure. In whatever form it's still a netdev, and changing the namespace
>> should be more dynamic than changing the class.
>>
>> Thanks a lot,
>> -Siwei
>>
>> >
>> >         Andrew
>
> Also, remember for netdev's /sys is really a third class API.
> The primary API's are netlink and ioctl. Also why not use existing
> network namespaces rather than inventing a new abstraction?

Because we want to leave old userspace unmodified while making SR-IOV
live migration transparent to users. Specifically, we'd want old udevd
to skip through uevents for the lower netdevs, while also making new
udevd able to name the bypass_master interface by referencing the pci
slot information which is only present in the sysfs entry for
IFF_AUTO_MANAGED net device.

The problem of using network namespace is that, no sysfs entry will be
around for IFF_AUTO_MANAGED netdev if we isolate it out to a separate
netns, unless we generalize the scope for what netns is designed for
(isolation I mean). For auto-managed netdevs we don't neccessarily
wants strict isolation, but rather a way of sticking to 1-netdev model
for strict backward compatibility requiement of the old userspace,
while exposing the information in a way new userspace understands.

Thanks,
-Siwei

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-09 23:03                         ` Stephen Hemminger
@ 2018-04-09 23:31                           ` Siwei Liu
  2018-04-09 23:31                             ` [virtio-dev] " Siwei Liu
  1 sibling, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-09 23:31 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Alexander Duyck, Andrew Lunn, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, virtio-dev, virtualization,
	Netdev, David Ahern, si-wei liu, David Miller

On Mon, Apr 9, 2018 at 4:03 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Mon, 9 Apr 2018 15:30:42 -0700
> Siwei Liu <loseweigh@gmail.com> wrote:
>
>> On Mon, Apr 9, 2018 at 3:15 PM, Andrew Lunn <andrew@lunn.ch> wrote:
>> >> No, implementation wise I'd avoid changing the class on the fly. What
>> >> I'm looking to is a means to add a secondary class or class aliasing
>> >> mechanism for netdevs that allows mapping for a kernel device
>> >> namespace (/class/net-kernel) to userspace (/class/net). Imagine
>> >> creating symlinks between these two namespaces as an analogy. All
>> >> userspace visible netdevs today will have both a kernel name and a
>> >> userspace visible name, having one (/class/net) referecing the other
>> >> (/class/net-kernel) in its own namespace. The newly introduced
>> >> IFF_AUTO_MANAGED device will have a kernel name only
>> >> (/class/net-kernel). As a result, the existing applications using
>> >> /class/net don't break, while we're adding the kernel namespace that
>> >> allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
>> >> at all.
>> >
>> > My gut feeling is this whole scheme will not fly. You really should be
>> > talking to GregKH.
>>
>> Will do. Before spreading it out loudly I'd run it within netdev to
>> clarify the need for why not exposing the lower netdevs is critical
>> for cloud service providers in the face of introducing a new feature,
>> and we are not hiding anything but exposing it in a way that don't
>> break existing userspace applications while introducing feature is
>> possible with the limitation of keeping old userspace still.
>>
>> >
>> > Anyway, please remember that IFF_AUTO_MANAGED will need to be dynamic.
>> > A device can start out as a normal device, and will change to being
>> > automatic later, when the user on top of it probes.
>>
>> Sure. In whatever form it's still a netdev, and changing the namespace
>> should be more dynamic than changing the class.
>>
>> Thanks a lot,
>> -Siwei
>>
>> >
>> >         Andrew
>
> Also, remember for netdev's /sys is really a third class API.
> The primary API's are netlink and ioctl. Also why not use existing
> network namespaces rather than inventing a new abstraction?

Because we want to leave old userspace unmodified while making SR-IOV
live migration transparent to users. Specifically, we'd want old udevd
to skip through uevents for the lower netdevs, while also making new
udevd able to name the bypass_master interface by referencing the pci
slot information which is only present in the sysfs entry for
IFF_AUTO_MANAGED net device.

The problem of using network namespace is that, no sysfs entry will be
around for IFF_AUTO_MANAGED netdev if we isolate it out to a separate
netns, unless we generalize the scope for what netns is designed for
(isolation I mean). For auto-managed netdevs we don't neccessarily
wants strict isolation, but rather a way of sticking to 1-netdev model
for strict backward compatibility requiement of the old userspace,
while exposing the information in a way new userspace understands.

Thanks,
-Siwei

^ permalink raw reply	[flat|nested] 109+ messages in thread

* [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
@ 2018-04-09 23:31                             ` Siwei Liu
  0 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-09 23:31 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Andrew Lunn, David Miller, David Ahern, Jiri Pirko, si-wei liu,
	Michael S. Tsirkin, Alexander Duyck, Brandeburg, Jesse,
	Jakub Kicinski, Jason Wang, Samudrala, Sridhar, Netdev,
	virtualization, virtio-dev

On Mon, Apr 9, 2018 at 4:03 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Mon, 9 Apr 2018 15:30:42 -0700
> Siwei Liu <loseweigh@gmail.com> wrote:
>
>> On Mon, Apr 9, 2018 at 3:15 PM, Andrew Lunn <andrew@lunn.ch> wrote:
>> >> No, implementation wise I'd avoid changing the class on the fly. What
>> >> I'm looking to is a means to add a secondary class or class aliasing
>> >> mechanism for netdevs that allows mapping for a kernel device
>> >> namespace (/class/net-kernel) to userspace (/class/net). Imagine
>> >> creating symlinks between these two namespaces as an analogy. All
>> >> userspace visible netdevs today will have both a kernel name and a
>> >> userspace visible name, having one (/class/net) referecing the other
>> >> (/class/net-kernel) in its own namespace. The newly introduced
>> >> IFF_AUTO_MANAGED device will have a kernel name only
>> >> (/class/net-kernel). As a result, the existing applications using
>> >> /class/net don't break, while we're adding the kernel namespace that
>> >> allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
>> >> at all.
>> >
>> > My gut feeling is this whole scheme will not fly. You really should be
>> > talking to GregKH.
>>
>> Will do. Before spreading it out loudly I'd run it within netdev to
>> clarify the need for why not exposing the lower netdevs is critical
>> for cloud service providers in the face of introducing a new feature,
>> and we are not hiding anything but exposing it in a way that don't
>> break existing userspace applications while introducing feature is
>> possible with the limitation of keeping old userspace still.
>>
>> >
>> > Anyway, please remember that IFF_AUTO_MANAGED will need to be dynamic.
>> > A device can start out as a normal device, and will change to being
>> > automatic later, when the user on top of it probes.
>>
>> Sure. In whatever form it's still a netdev, and changing the namespace
>> should be more dynamic than changing the class.
>>
>> Thanks a lot,
>> -Siwei
>>
>> >
>> >         Andrew
>
> Also, remember for netdev's /sys is really a third class API.
> The primary API's are netlink and ioctl. Also why not use existing
> network namespaces rather than inventing a new abstraction?

Because we want to leave old userspace unmodified while making SR-IOV
live migration transparent to users. Specifically, we'd want old udevd
to skip through uevents for the lower netdevs, while also making new
udevd able to name the bypass_master interface by referencing the pci
slot information which is only present in the sysfs entry for
IFF_AUTO_MANAGED net device.

The problem of using network namespace is that, no sysfs entry will be
around for IFF_AUTO_MANAGED netdev if we isolate it out to a separate
netns, unless we generalize the scope for what netns is designed for
(isolation I mean). For auto-managed netdevs we don't neccessarily
wants strict isolation, but rather a way of sticking to 1-netdev model
for strict backward compatibility requiement of the old userspace,
while exposing the information in a way new userspace understands.

Thanks,
-Siwei

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-08 16:32                 ` David Miller
@ 2018-04-10  6:48                     ` Siwei Liu
  2018-04-10  6:48                   ` Siwei Liu
  1 sibling, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-10  6:48 UTC (permalink / raw)
  To: David Miller
  Cc: David Ahern, Jiri Pirko, si-wei liu, Michael S. Tsirkin,
	Stephen Hemminger, Alexander Duyck, Brandeburg, Jesse,
	Jakub Kicinski, Jason Wang, Samudrala, Sridhar, Netdev,
	virtualization, virtio-dev

On Sun, Apr 8, 2018 at 9:32 AM, David Miller <davem@davemloft.net> wrote:
> From: Siwei Liu <loseweigh@gmail.com>
> Date: Fri, 6 Apr 2018 19:32:05 -0700
>
>> And I assume everyone here understands the use case for live
>> migration (in the context of providing cloud service) is very
>> different, and we have to hide the netdevs. If not, I'm more than
>> happy to clarify.
>
> I think you still need to clarify.

OK. The short answer is cloud users really want *transparent* live migration.

By being transparent it means they don't and shouldn't care about the
existence and the occurence of live migration, but they do if
userspace toolstack and libraries have to be updated or modified,
which means potential dependency brokeness of their applications. They
don't like any change to the userspace envinroment (existing apps
lift-and-shift, no recompilation, no re-packaging, no re-certification
needed), while no one barely cares about ABI or API compatibility in
the kernel level, as long as their applications don't break.

I agree the current bypass solution for SR-IOV live migration requires
guest cooperation. Though it doesn't mean guest *userspace*
cooperation. As a matter of fact, techinically it shouldn't invovle
userspace at all to get SR-IOV migration working. It's the kernel that
does the real work. If I understand the goal of this in-kernel
approach correctly, it was meant to save userspace from modification
or corresponding toolstack support, as those additional 2 interfaces
is more a side product of this approach, rather than being neccessary
for users to be aware of. All what the user needs to deal with is one
single interface, and that's what they care about. It's more a trouble
than help when they see 2 extra interfaces are present. Management
tools in the old distros don't recoginze them and try to bring up
those extra interfaces for its own. Various odd warnings start to spew
out, and there's a lot of caveats for the users to get around...

On the other hand, if we "teach" those cloud users to update the
userspace toolstack just for trading a feature they don't need, no one
is likely going to embrace the change. As such there's just no real
value of adopting this in-kernel bypass facility for any cloud service
provider. It does not look more appealing than just configure generic
bonding using its own set of daemons or scripts. But again, cloud
users don't welcome that facility. And basically it would get to
nearly the same set of problems if leaving userspace alone.

IMHO we're not hiding the devices, think it the way we're adding a
feature transparent to user. Those auto-managed slaves are ones users
don't care about much. And user is still able to see and configure the
lower netdevs if they really desires to do so. But generally the
target user for this feature won't need to know that. Why they care
how many interfaces a VM virtually has rather than how many interfaces
are actually _useable_ to them??

Thanks,
-Siwei


>
> netdevs are netdevs.  If they have special attributes, mark them as
> such and the tools base their actions upon that.
>
> "Hiding", or changing classes, doesn't make any sense to me still.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-08 16:32                 ` David Miller
  2018-04-10  6:48                     ` [virtio-dev] " Siwei Liu
@ 2018-04-10  6:48                   ` Siwei Liu
  1 sibling, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-10  6:48 UTC (permalink / raw)
  To: David Miller
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, virtualization, Netdev,
	David Ahern, si-wei liu

On Sun, Apr 8, 2018 at 9:32 AM, David Miller <davem@davemloft.net> wrote:
> From: Siwei Liu <loseweigh@gmail.com>
> Date: Fri, 6 Apr 2018 19:32:05 -0700
>
>> And I assume everyone here understands the use case for live
>> migration (in the context of providing cloud service) is very
>> different, and we have to hide the netdevs. If not, I'm more than
>> happy to clarify.
>
> I think you still need to clarify.

OK. The short answer is cloud users really want *transparent* live migration.

By being transparent it means they don't and shouldn't care about the
existence and the occurence of live migration, but they do if
userspace toolstack and libraries have to be updated or modified,
which means potential dependency brokeness of their applications. They
don't like any change to the userspace envinroment (existing apps
lift-and-shift, no recompilation, no re-packaging, no re-certification
needed), while no one barely cares about ABI or API compatibility in
the kernel level, as long as their applications don't break.

I agree the current bypass solution for SR-IOV live migration requires
guest cooperation. Though it doesn't mean guest *userspace*
cooperation. As a matter of fact, techinically it shouldn't invovle
userspace at all to get SR-IOV migration working. It's the kernel that
does the real work. If I understand the goal of this in-kernel
approach correctly, it was meant to save userspace from modification
or corresponding toolstack support, as those additional 2 interfaces
is more a side product of this approach, rather than being neccessary
for users to be aware of. All what the user needs to deal with is one
single interface, and that's what they care about. It's more a trouble
than help when they see 2 extra interfaces are present. Management
tools in the old distros don't recoginze them and try to bring up
those extra interfaces for its own. Various odd warnings start to spew
out, and there's a lot of caveats for the users to get around...

On the other hand, if we "teach" those cloud users to update the
userspace toolstack just for trading a feature they don't need, no one
is likely going to embrace the change. As such there's just no real
value of adopting this in-kernel bypass facility for any cloud service
provider. It does not look more appealing than just configure generic
bonding using its own set of daemons or scripts. But again, cloud
users don't welcome that facility. And basically it would get to
nearly the same set of problems if leaving userspace alone.

IMHO we're not hiding the devices, think it the way we're adding a
feature transparent to user. Those auto-managed slaves are ones users
don't care about much. And user is still able to see and configure the
lower netdevs if they really desires to do so. But generally the
target user for this feature won't need to know that. Why they care
how many interfaces a VM virtually has rather than how many interfaces
are actually _useable_ to them??

Thanks,
-Siwei


>
> netdevs are netdevs.  If they have special attributes, mark them as
> such and the tools base their actions upon that.
>
> "Hiding", or changing classes, doesn't make any sense to me still.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
@ 2018-04-10  6:48                     ` Siwei Liu
  0 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-10  6:48 UTC (permalink / raw)
  To: David Miller
  Cc: David Ahern, Jiri Pirko, si-wei liu, Michael S. Tsirkin,
	Stephen Hemminger, Alexander Duyck, Brandeburg, Jesse,
	Jakub Kicinski, Jason Wang, Samudrala, Sridhar, Netdev,
	virtualization, virtio-dev

On Sun, Apr 8, 2018 at 9:32 AM, David Miller <davem@davemloft.net> wrote:
> From: Siwei Liu <loseweigh@gmail.com>
> Date: Fri, 6 Apr 2018 19:32:05 -0700
>
>> And I assume everyone here understands the use case for live
>> migration (in the context of providing cloud service) is very
>> different, and we have to hide the netdevs. If not, I'm more than
>> happy to clarify.
>
> I think you still need to clarify.

OK. The short answer is cloud users really want *transparent* live migration.

By being transparent it means they don't and shouldn't care about the
existence and the occurence of live migration, but they do if
userspace toolstack and libraries have to be updated or modified,
which means potential dependency brokeness of their applications. They
don't like any change to the userspace envinroment (existing apps
lift-and-shift, no recompilation, no re-packaging, no re-certification
needed), while no one barely cares about ABI or API compatibility in
the kernel level, as long as their applications don't break.

I agree the current bypass solution for SR-IOV live migration requires
guest cooperation. Though it doesn't mean guest *userspace*
cooperation. As a matter of fact, techinically it shouldn't invovle
userspace at all to get SR-IOV migration working. It's the kernel that
does the real work. If I understand the goal of this in-kernel
approach correctly, it was meant to save userspace from modification
or corresponding toolstack support, as those additional 2 interfaces
is more a side product of this approach, rather than being neccessary
for users to be aware of. All what the user needs to deal with is one
single interface, and that's what they care about. It's more a trouble
than help when they see 2 extra interfaces are present. Management
tools in the old distros don't recoginze them and try to bring up
those extra interfaces for its own. Various odd warnings start to spew
out, and there's a lot of caveats for the users to get around...

On the other hand, if we "teach" those cloud users to update the
userspace toolstack just for trading a feature they don't need, no one
is likely going to embrace the change. As such there's just no real
value of adopting this in-kernel bypass facility for any cloud service
provider. It does not look more appealing than just configure generic
bonding using its own set of daemons or scripts. But again, cloud
users don't welcome that facility. And basically it would get to
nearly the same set of problems if leaving userspace alone.

IMHO we're not hiding the devices, think it the way we're adding a
feature transparent to user. Those auto-managed slaves are ones users
don't care about much. And user is still able to see and configure the
lower netdevs if they really desires to do so. But generally the
target user for this feature won't need to know that. Why they care
how many interfaces a VM virtually has rather than how many interfaces
are actually _useable_ to them??

Thanks,
-Siwei


>
> netdevs are netdevs.  If they have special attributes, mark them as
> such and the tools base their actions upon that.
>
> "Hiding", or changing classes, doesn't make any sense to me still.

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-10  6:48                     ` [virtio-dev] " Siwei Liu
@ 2018-04-18  0:26                       ` Siwei Liu
  -1 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-18  0:26 UTC (permalink / raw)
  To: David Miller
  Cc: David Ahern, Jiri Pirko, si-wei liu, Michael S. Tsirkin,
	Stephen Hemminger, Alexander Duyck, Brandeburg, Jesse,
	Jakub Kicinski, Jason Wang, Samudrala, Sridhar, Netdev,
	virtualization, virtio-dev

I ran this with a few folks offline and gathered some good feedbacks
that I'd like to share thus revive the discussion.

First of all, as illustrated in the reply below, cloud service
providers require transparent live migration. Specifically, the main
target of our case is to support SR-IOV live migration via kernel
upgrade while keeping the userspace of old distros unmodified. If it's
because this use case is not appealing enough for the mainline to
adopt, I will shut up and not continue discussing, although
technically it's entirely possible (and there's precedent in other
implementation) to do so to benefit any cloud service providers.

If it's just the implementation of hiding netdev itself needs to be
improved, such as implementing it as attribute flag or adding linkdump
API, that's completely fine and we can look into that. However, the
specific issue needs to be undestood beforehand is to make transparent
SR-IOV to be able to take over the name (so inherit all the configs)
from the lower netdev, which needs some games with uevents and name
space reservation. So far I don't think it's been well discussed.

One thing in particular I'd like to point out is that the 3-netdev
model currently missed to address the core problem of live migration:
migration of hardware specific feature/state, for e.g. ethtool configs
and hardware offloading states. Only general network state (IP
address, gateway, for eg.) associated with the bypass interface can be
migrated. As a follow-up work, bypass driver can/should be enhanced to
save and apply those hardware specific configs before or after
migration as needed. The transparent 1-netdev model being proposed as
part of this patch series will be able to solve that problem naturally
by making all hardware specific configurations go through the central
bypass driver, such that hardware configurations can be replayed when
new VF or passthrough gets plugged back in. Although that
corresponding function hasn't been implemented today, I'd like to
refresh everyone's mind that is the core problem any live migration
proposal should have addressed.

If it would make things more clear to defer netdev hiding until all
functionalities regarding centralizing and replay are implemented,
we'd take advices like that and move on to implementing those features
as follow-up patches. Once all needed features get done, we'd resume
the work for hiding lower netdev at that point. Think it would be the
best to make everyone understand the big picture in advance before
going too far.

Thanks, comments welcome.

-Siwei


On Mon, Apr 9, 2018 at 11:48 PM, Siwei Liu <loseweigh@gmail.com> wrote:
> On Sun, Apr 8, 2018 at 9:32 AM, David Miller <davem@davemloft.net> wrote:
>> From: Siwei Liu <loseweigh@gmail.com>
>> Date: Fri, 6 Apr 2018 19:32:05 -0700
>>
>>> And I assume everyone here understands the use case for live
>>> migration (in the context of providing cloud service) is very
>>> different, and we have to hide the netdevs. If not, I'm more than
>>> happy to clarify.
>>
>> I think you still need to clarify.
>
> OK. The short answer is cloud users really want *transparent* live migration.
>
> By being transparent it means they don't and shouldn't care about the
> existence and the occurence of live migration, but they do if
> userspace toolstack and libraries have to be updated or modified,
> which means potential dependency brokeness of their applications. They
> don't like any change to the userspace envinroment (existing apps
> lift-and-shift, no recompilation, no re-packaging, no re-certification
> needed), while no one barely cares about ABI or API compatibility in
> the kernel level, as long as their applications don't break.
>
> I agree the current bypass solution for SR-IOV live migration requires
> guest cooperation. Though it doesn't mean guest *userspace*
> cooperation. As a matter of fact, techinically it shouldn't invovle
> userspace at all to get SR-IOV migration working. It's the kernel that
> does the real work. If I understand the goal of this in-kernel
> approach correctly, it was meant to save userspace from modification
> or corresponding toolstack support, as those additional 2 interfaces
> is more a side product of this approach, rather than being neccessary
> for users to be aware of. All what the user needs to deal with is one
> single interface, and that's what they care about. It's more a trouble
> than help when they see 2 extra interfaces are present. Management
> tools in the old distros don't recoginze them and try to bring up
> those extra interfaces for its own. Various odd warnings start to spew
> out, and there's a lot of caveats for the users to get around...
>
> On the other hand, if we "teach" those cloud users to update the
> userspace toolstack just for trading a feature they don't need, no one
> is likely going to embrace the change. As such there's just no real
> value of adopting this in-kernel bypass facility for any cloud service
> provider. It does not look more appealing than just configure generic
> bonding using its own set of daemons or scripts. But again, cloud
> users don't welcome that facility. And basically it would get to
> nearly the same set of problems if leaving userspace alone.
>
> IMHO we're not hiding the devices, think it the way we're adding a
> feature transparent to user. Those auto-managed slaves are ones users
> don't care about much. And user is still able to see and configure the
> lower netdevs if they really desires to do so. But generally the
> target user for this feature won't need to know that. Why they care
> how many interfaces a VM virtually has rather than how many interfaces
> are actually _useable_ to them??
>
> Thanks,
> -Siwei
>
>
>>
>> netdevs are netdevs.  If they have special attributes, mark them as
>> such and the tools base their actions upon that.
>>
>> "Hiding", or changing classes, doesn't make any sense to me still.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-10  6:48                     ` [virtio-dev] " Siwei Liu
  (?)
@ 2018-04-18  0:26                     ` Siwei Liu
  -1 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-18  0:26 UTC (permalink / raw)
  To: David Miller
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, virtualization, Netdev,
	David Ahern, si-wei liu

I ran this with a few folks offline and gathered some good feedbacks
that I'd like to share thus revive the discussion.

First of all, as illustrated in the reply below, cloud service
providers require transparent live migration. Specifically, the main
target of our case is to support SR-IOV live migration via kernel
upgrade while keeping the userspace of old distros unmodified. If it's
because this use case is not appealing enough for the mainline to
adopt, I will shut up and not continue discussing, although
technically it's entirely possible (and there's precedent in other
implementation) to do so to benefit any cloud service providers.

If it's just the implementation of hiding netdev itself needs to be
improved, such as implementing it as attribute flag or adding linkdump
API, that's completely fine and we can look into that. However, the
specific issue needs to be undestood beforehand is to make transparent
SR-IOV to be able to take over the name (so inherit all the configs)
from the lower netdev, which needs some games with uevents and name
space reservation. So far I don't think it's been well discussed.

One thing in particular I'd like to point out is that the 3-netdev
model currently missed to address the core problem of live migration:
migration of hardware specific feature/state, for e.g. ethtool configs
and hardware offloading states. Only general network state (IP
address, gateway, for eg.) associated with the bypass interface can be
migrated. As a follow-up work, bypass driver can/should be enhanced to
save and apply those hardware specific configs before or after
migration as needed. The transparent 1-netdev model being proposed as
part of this patch series will be able to solve that problem naturally
by making all hardware specific configurations go through the central
bypass driver, such that hardware configurations can be replayed when
new VF or passthrough gets plugged back in. Although that
corresponding function hasn't been implemented today, I'd like to
refresh everyone's mind that is the core problem any live migration
proposal should have addressed.

If it would make things more clear to defer netdev hiding until all
functionalities regarding centralizing and replay are implemented,
we'd take advices like that and move on to implementing those features
as follow-up patches. Once all needed features get done, we'd resume
the work for hiding lower netdev at that point. Think it would be the
best to make everyone understand the big picture in advance before
going too far.

Thanks, comments welcome.

-Siwei


On Mon, Apr 9, 2018 at 11:48 PM, Siwei Liu <loseweigh@gmail.com> wrote:
> On Sun, Apr 8, 2018 at 9:32 AM, David Miller <davem@davemloft.net> wrote:
>> From: Siwei Liu <loseweigh@gmail.com>
>> Date: Fri, 6 Apr 2018 19:32:05 -0700
>>
>>> And I assume everyone here understands the use case for live
>>> migration (in the context of providing cloud service) is very
>>> different, and we have to hide the netdevs. If not, I'm more than
>>> happy to clarify.
>>
>> I think you still need to clarify.
>
> OK. The short answer is cloud users really want *transparent* live migration.
>
> By being transparent it means they don't and shouldn't care about the
> existence and the occurence of live migration, but they do if
> userspace toolstack and libraries have to be updated or modified,
> which means potential dependency brokeness of their applications. They
> don't like any change to the userspace envinroment (existing apps
> lift-and-shift, no recompilation, no re-packaging, no re-certification
> needed), while no one barely cares about ABI or API compatibility in
> the kernel level, as long as their applications don't break.
>
> I agree the current bypass solution for SR-IOV live migration requires
> guest cooperation. Though it doesn't mean guest *userspace*
> cooperation. As a matter of fact, techinically it shouldn't invovle
> userspace at all to get SR-IOV migration working. It's the kernel that
> does the real work. If I understand the goal of this in-kernel
> approach correctly, it was meant to save userspace from modification
> or corresponding toolstack support, as those additional 2 interfaces
> is more a side product of this approach, rather than being neccessary
> for users to be aware of. All what the user needs to deal with is one
> single interface, and that's what they care about. It's more a trouble
> than help when they see 2 extra interfaces are present. Management
> tools in the old distros don't recoginze them and try to bring up
> those extra interfaces for its own. Various odd warnings start to spew
> out, and there's a lot of caveats for the users to get around...
>
> On the other hand, if we "teach" those cloud users to update the
> userspace toolstack just for trading a feature they don't need, no one
> is likely going to embrace the change. As such there's just no real
> value of adopting this in-kernel bypass facility for any cloud service
> provider. It does not look more appealing than just configure generic
> bonding using its own set of daemons or scripts. But again, cloud
> users don't welcome that facility. And basically it would get to
> nearly the same set of problems if leaving userspace alone.
>
> IMHO we're not hiding the devices, think it the way we're adding a
> feature transparent to user. Those auto-managed slaves are ones users
> don't care about much. And user is still able to see and configure the
> lower netdevs if they really desires to do so. But generally the
> target user for this feature won't need to know that. Why they care
> how many interfaces a VM virtually has rather than how many interfaces
> are actually _useable_ to them??
>
> Thanks,
> -Siwei
>
>
>>
>> netdevs are netdevs.  If they have special attributes, mark them as
>> such and the tools base their actions upon that.
>>
>> "Hiding", or changing classes, doesn't make any sense to me still.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
@ 2018-04-18  0:26                       ` Siwei Liu
  0 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-18  0:26 UTC (permalink / raw)
  To: David Miller
  Cc: David Ahern, Jiri Pirko, si-wei liu, Michael S. Tsirkin,
	Stephen Hemminger, Alexander Duyck, Brandeburg, Jesse,
	Jakub Kicinski, Jason Wang, Samudrala, Sridhar, Netdev,
	virtualization, virtio-dev

I ran this with a few folks offline and gathered some good feedbacks
that I'd like to share thus revive the discussion.

First of all, as illustrated in the reply below, cloud service
providers require transparent live migration. Specifically, the main
target of our case is to support SR-IOV live migration via kernel
upgrade while keeping the userspace of old distros unmodified. If it's
because this use case is not appealing enough for the mainline to
adopt, I will shut up and not continue discussing, although
technically it's entirely possible (and there's precedent in other
implementation) to do so to benefit any cloud service providers.

If it's just the implementation of hiding netdev itself needs to be
improved, such as implementing it as attribute flag or adding linkdump
API, that's completely fine and we can look into that. However, the
specific issue needs to be undestood beforehand is to make transparent
SR-IOV to be able to take over the name (so inherit all the configs)
from the lower netdev, which needs some games with uevents and name
space reservation. So far I don't think it's been well discussed.

One thing in particular I'd like to point out is that the 3-netdev
model currently missed to address the core problem of live migration:
migration of hardware specific feature/state, for e.g. ethtool configs
and hardware offloading states. Only general network state (IP
address, gateway, for eg.) associated with the bypass interface can be
migrated. As a follow-up work, bypass driver can/should be enhanced to
save and apply those hardware specific configs before or after
migration as needed. The transparent 1-netdev model being proposed as
part of this patch series will be able to solve that problem naturally
by making all hardware specific configurations go through the central
bypass driver, such that hardware configurations can be replayed when
new VF or passthrough gets plugged back in. Although that
corresponding function hasn't been implemented today, I'd like to
refresh everyone's mind that is the core problem any live migration
proposal should have addressed.

If it would make things more clear to defer netdev hiding until all
functionalities regarding centralizing and replay are implemented,
we'd take advices like that and move on to implementing those features
as follow-up patches. Once all needed features get done, we'd resume
the work for hiding lower netdev at that point. Think it would be the
best to make everyone understand the big picture in advance before
going too far.

Thanks, comments welcome.

-Siwei


On Mon, Apr 9, 2018 at 11:48 PM, Siwei Liu <loseweigh@gmail.com> wrote:
> On Sun, Apr 8, 2018 at 9:32 AM, David Miller <davem@davemloft.net> wrote:
>> From: Siwei Liu <loseweigh@gmail.com>
>> Date: Fri, 6 Apr 2018 19:32:05 -0700
>>
>>> And I assume everyone here understands the use case for live
>>> migration (in the context of providing cloud service) is very
>>> different, and we have to hide the netdevs. If not, I'm more than
>>> happy to clarify.
>>
>> I think you still need to clarify.
>
> OK. The short answer is cloud users really want *transparent* live migration.
>
> By being transparent it means they don't and shouldn't care about the
> existence and the occurence of live migration, but they do if
> userspace toolstack and libraries have to be updated or modified,
> which means potential dependency brokeness of their applications. They
> don't like any change to the userspace envinroment (existing apps
> lift-and-shift, no recompilation, no re-packaging, no re-certification
> needed), while no one barely cares about ABI or API compatibility in
> the kernel level, as long as their applications don't break.
>
> I agree the current bypass solution for SR-IOV live migration requires
> guest cooperation. Though it doesn't mean guest *userspace*
> cooperation. As a matter of fact, techinically it shouldn't invovle
> userspace at all to get SR-IOV migration working. It's the kernel that
> does the real work. If I understand the goal of this in-kernel
> approach correctly, it was meant to save userspace from modification
> or corresponding toolstack support, as those additional 2 interfaces
> is more a side product of this approach, rather than being neccessary
> for users to be aware of. All what the user needs to deal with is one
> single interface, and that's what they care about. It's more a trouble
> than help when they see 2 extra interfaces are present. Management
> tools in the old distros don't recoginze them and try to bring up
> those extra interfaces for its own. Various odd warnings start to spew
> out, and there's a lot of caveats for the users to get around...
>
> On the other hand, if we "teach" those cloud users to update the
> userspace toolstack just for trading a feature they don't need, no one
> is likely going to embrace the change. As such there's just no real
> value of adopting this in-kernel bypass facility for any cloud service
> provider. It does not look more appealing than just configure generic
> bonding using its own set of daemons or scripts. But again, cloud
> users don't welcome that facility. And basically it would get to
> nearly the same set of problems if leaving userspace alone.
>
> IMHO we're not hiding the devices, think it the way we're adding a
> feature transparent to user. Those auto-managed slaves are ones users
> don't care about much. And user is still able to see and configure the
> lower netdevs if they really desires to do so. But generally the
> target user for this feature won't need to know that. Why they care
> how many interfaces a VM virtually has rather than how many interfaces
> are actually _useable_ to them??
>
> Thanks,
> -Siwei
>
>
>>
>> netdevs are netdevs.  If they have special attributes, mark them as
>> such and the tools base their actions upon that.
>>
>> "Hiding", or changing classes, doesn't make any sense to me still.

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-18  0:26                       ` [virtio-dev] " Siwei Liu
@ 2018-04-18 23:33                         ` Samudrala, Sridhar
  -1 siblings, 0 replies; 109+ messages in thread
From: Samudrala, Sridhar @ 2018-04-18 23:33 UTC (permalink / raw)
  To: Siwei Liu, David Miller
  Cc: David Ahern, Jiri Pirko, si-wei liu, Michael S. Tsirkin,
	Stephen Hemminger, Alexander Duyck, Brandeburg, Jesse,
	Jakub Kicinski, Jason Wang, Netdev, virtualization, virtio-dev

On 4/17/2018 5:26 PM, Siwei Liu wrote:
> I ran this with a few folks offline and gathered some good feedbacks
> that I'd like to share thus revive the discussion.
>
> First of all, as illustrated in the reply below, cloud service
> providers require transparent live migration. Specifically, the main
> target of our case is to support SR-IOV live migration via kernel
> upgrade while keeping the userspace of old distros unmodified. If it's
> because this use case is not appealing enough for the mainline to
> adopt, I will shut up and not continue discussing, although
> technically it's entirely possible (and there's precedent in other
> implementation) to do so to benefit any cloud service providers.
>
> If it's just the implementation of hiding netdev itself needs to be
> improved, such as implementing it as attribute flag or adding linkdump
> API, that's completely fine and we can look into that. However, the
> specific issue needs to be undestood beforehand is to make transparent
> SR-IOV to be able to take over the name (so inherit all the configs)
> from the lower netdev, which needs some games with uevents and name
> space reservation. So far I don't think it's been well discussed.
>
> One thing in particular I'd like to point out is that the 3-netdev
> model currently missed to address the core problem of live migration:
> migration of hardware specific feature/state, for e.g. ethtool configs
> and hardware offloading states. Only general network state (IP
> address, gateway, for eg.) associated with the bypass interface can be
> migrated. As a follow-up work, bypass driver can/should be enhanced to
> save and apply those hardware specific configs before or after
> migration as needed. The transparent 1-netdev model being proposed as
> part of this patch series will be able to solve that problem naturally
> by making all hardware specific configurations go through the central
> bypass driver, such that hardware configurations can be replayed when
> new VF or passthrough gets plugged back in. Although that
> corresponding function hasn't been implemented today, I'd like to
> refresh everyone's mind that is the core problem any live migration
> proposal should have addressed.
>
> If it would make things more clear to defer netdev hiding until all
> functionalities regarding centralizing and replay are implemented,
> we'd take advices like that and move on to implementing those features
> as follow-up patches. Once all needed features get done, we'd resume
> the work for hiding lower netdev at that point. Think it would be the
> best to make everyone understand the big picture in advance before
> going too far.

I think we should get the 3-netdev model integrated and add any additional
ndo_ops/ethool ops that we would like to support/migrate before looking into
hiding the lower netdevs.


>
> Thanks, comments welcome.
>
> -Siwei
>
>
> On Mon, Apr 9, 2018 at 11:48 PM, Siwei Liu <loseweigh@gmail.com> wrote:
>> On Sun, Apr 8, 2018 at 9:32 AM, David Miller <davem@davemloft.net> wrote:
>>> From: Siwei Liu <loseweigh@gmail.com>
>>> Date: Fri, 6 Apr 2018 19:32:05 -0700
>>>
>>>> And I assume everyone here understands the use case for live
>>>> migration (in the context of providing cloud service) is very
>>>> different, and we have to hide the netdevs. If not, I'm more than
>>>> happy to clarify.
>>> I think you still need to clarify.
>> OK. The short answer is cloud users really want *transparent* live migration.
>>
>> By being transparent it means they don't and shouldn't care about the
>> existence and the occurence of live migration, but they do if
>> userspace toolstack and libraries have to be updated or modified,
>> which means potential dependency brokeness of their applications. They
>> don't like any change to the userspace envinroment (existing apps
>> lift-and-shift, no recompilation, no re-packaging, no re-certification
>> needed), while no one barely cares about ABI or API compatibility in
>> the kernel level, as long as their applications don't break.
>>
>> I agree the current bypass solution for SR-IOV live migration requires
>> guest cooperation. Though it doesn't mean guest *userspace*
>> cooperation. As a matter of fact, techinically it shouldn't invovle
>> userspace at all to get SR-IOV migration working. It's the kernel that
>> does the real work. If I understand the goal of this in-kernel
>> approach correctly, it was meant to save userspace from modification
>> or corresponding toolstack support, as those additional 2 interfaces
>> is more a side product of this approach, rather than being neccessary
>> for users to be aware of. All what the user needs to deal with is one
>> single interface, and that's what they care about. It's more a trouble
>> than help when they see 2 extra interfaces are present. Management
>> tools in the old distros don't recoginze them and try to bring up
>> those extra interfaces for its own. Various odd warnings start to spew
>> out, and there's a lot of caveats for the users to get around...
>>
>> On the other hand, if we "teach" those cloud users to update the
>> userspace toolstack just for trading a feature they don't need, no one
>> is likely going to embrace the change. As such there's just no real
>> value of adopting this in-kernel bypass facility for any cloud service
>> provider. It does not look more appealing than just configure generic
>> bonding using its own set of daemons or scripts. But again, cloud
>> users don't welcome that facility. And basically it would get to
>> nearly the same set of problems if leaving userspace alone.
>>
>> IMHO we're not hiding the devices, think it the way we're adding a
>> feature transparent to user. Those auto-managed slaves are ones users
>> don't care about much. And user is still able to see and configure the
>> lower netdevs if they really desires to do so. But generally the
>> target user for this feature won't need to know that. Why they care
>> how many interfaces a VM virtually has rather than how many interfaces
>> are actually _useable_ to them??
>>
>> Thanks,
>> -Siwei
>>
>>
>>> netdevs are netdevs.  If they have special attributes, mark them as
>>> such and the tools base their actions upon that.
>>>
>>> "Hiding", or changing classes, doesn't make any sense to me still.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-18  0:26                       ` [virtio-dev] " Siwei Liu
  (?)
@ 2018-04-18 23:33                       ` Samudrala, Sridhar
  -1 siblings, 0 replies; 109+ messages in thread
From: Samudrala, Sridhar @ 2018-04-18 23:33 UTC (permalink / raw)
  To: Siwei Liu, David Miller
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Netdev, virtualization, David Ahern, si-wei liu

On 4/17/2018 5:26 PM, Siwei Liu wrote:
> I ran this with a few folks offline and gathered some good feedbacks
> that I'd like to share thus revive the discussion.
>
> First of all, as illustrated in the reply below, cloud service
> providers require transparent live migration. Specifically, the main
> target of our case is to support SR-IOV live migration via kernel
> upgrade while keeping the userspace of old distros unmodified. If it's
> because this use case is not appealing enough for the mainline to
> adopt, I will shut up and not continue discussing, although
> technically it's entirely possible (and there's precedent in other
> implementation) to do so to benefit any cloud service providers.
>
> If it's just the implementation of hiding netdev itself needs to be
> improved, such as implementing it as attribute flag or adding linkdump
> API, that's completely fine and we can look into that. However, the
> specific issue needs to be undestood beforehand is to make transparent
> SR-IOV to be able to take over the name (so inherit all the configs)
> from the lower netdev, which needs some games with uevents and name
> space reservation. So far I don't think it's been well discussed.
>
> One thing in particular I'd like to point out is that the 3-netdev
> model currently missed to address the core problem of live migration:
> migration of hardware specific feature/state, for e.g. ethtool configs
> and hardware offloading states. Only general network state (IP
> address, gateway, for eg.) associated with the bypass interface can be
> migrated. As a follow-up work, bypass driver can/should be enhanced to
> save and apply those hardware specific configs before or after
> migration as needed. The transparent 1-netdev model being proposed as
> part of this patch series will be able to solve that problem naturally
> by making all hardware specific configurations go through the central
> bypass driver, such that hardware configurations can be replayed when
> new VF or passthrough gets plugged back in. Although that
> corresponding function hasn't been implemented today, I'd like to
> refresh everyone's mind that is the core problem any live migration
> proposal should have addressed.
>
> If it would make things more clear to defer netdev hiding until all
> functionalities regarding centralizing and replay are implemented,
> we'd take advices like that and move on to implementing those features
> as follow-up patches. Once all needed features get done, we'd resume
> the work for hiding lower netdev at that point. Think it would be the
> best to make everyone understand the big picture in advance before
> going too far.

I think we should get the 3-netdev model integrated and add any additional
ndo_ops/ethool ops that we would like to support/migrate before looking into
hiding the lower netdevs.


>
> Thanks, comments welcome.
>
> -Siwei
>
>
> On Mon, Apr 9, 2018 at 11:48 PM, Siwei Liu <loseweigh@gmail.com> wrote:
>> On Sun, Apr 8, 2018 at 9:32 AM, David Miller <davem@davemloft.net> wrote:
>>> From: Siwei Liu <loseweigh@gmail.com>
>>> Date: Fri, 6 Apr 2018 19:32:05 -0700
>>>
>>>> And I assume everyone here understands the use case for live
>>>> migration (in the context of providing cloud service) is very
>>>> different, and we have to hide the netdevs. If not, I'm more than
>>>> happy to clarify.
>>> I think you still need to clarify.
>> OK. The short answer is cloud users really want *transparent* live migration.
>>
>> By being transparent it means they don't and shouldn't care about the
>> existence and the occurence of live migration, but they do if
>> userspace toolstack and libraries have to be updated or modified,
>> which means potential dependency brokeness of their applications. They
>> don't like any change to the userspace envinroment (existing apps
>> lift-and-shift, no recompilation, no re-packaging, no re-certification
>> needed), while no one barely cares about ABI or API compatibility in
>> the kernel level, as long as their applications don't break.
>>
>> I agree the current bypass solution for SR-IOV live migration requires
>> guest cooperation. Though it doesn't mean guest *userspace*
>> cooperation. As a matter of fact, techinically it shouldn't invovle
>> userspace at all to get SR-IOV migration working. It's the kernel that
>> does the real work. If I understand the goal of this in-kernel
>> approach correctly, it was meant to save userspace from modification
>> or corresponding toolstack support, as those additional 2 interfaces
>> is more a side product of this approach, rather than being neccessary
>> for users to be aware of. All what the user needs to deal with is one
>> single interface, and that's what they care about. It's more a trouble
>> than help when they see 2 extra interfaces are present. Management
>> tools in the old distros don't recoginze them and try to bring up
>> those extra interfaces for its own. Various odd warnings start to spew
>> out, and there's a lot of caveats for the users to get around...
>>
>> On the other hand, if we "teach" those cloud users to update the
>> userspace toolstack just for trading a feature they don't need, no one
>> is likely going to embrace the change. As such there's just no real
>> value of adopting this in-kernel bypass facility for any cloud service
>> provider. It does not look more appealing than just configure generic
>> bonding using its own set of daemons or scripts. But again, cloud
>> users don't welcome that facility. And basically it would get to
>> nearly the same set of problems if leaving userspace alone.
>>
>> IMHO we're not hiding the devices, think it the way we're adding a
>> feature transparent to user. Those auto-managed slaves are ones users
>> don't care about much. And user is still able to see and configure the
>> lower netdevs if they really desires to do so. But generally the
>> target user for this feature won't need to know that. Why they care
>> how many interfaces a VM virtually has rather than how many interfaces
>> are actually _useable_ to them??
>>
>> Thanks,
>> -Siwei
>>
>>
>>> netdevs are netdevs.  If they have special attributes, mark them as
>>> such and the tools base their actions upon that.
>>>
>>> "Hiding", or changing classes, doesn't make any sense to me still.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
@ 2018-04-18 23:33                         ` Samudrala, Sridhar
  0 siblings, 0 replies; 109+ messages in thread
From: Samudrala, Sridhar @ 2018-04-18 23:33 UTC (permalink / raw)
  To: Siwei Liu, David Miller
  Cc: David Ahern, Jiri Pirko, si-wei liu, Michael S. Tsirkin,
	Stephen Hemminger, Alexander Duyck, Brandeburg, Jesse,
	Jakub Kicinski, Jason Wang, Netdev, virtualization, virtio-dev

On 4/17/2018 5:26 PM, Siwei Liu wrote:
> I ran this with a few folks offline and gathered some good feedbacks
> that I'd like to share thus revive the discussion.
>
> First of all, as illustrated in the reply below, cloud service
> providers require transparent live migration. Specifically, the main
> target of our case is to support SR-IOV live migration via kernel
> upgrade while keeping the userspace of old distros unmodified. If it's
> because this use case is not appealing enough for the mainline to
> adopt, I will shut up and not continue discussing, although
> technically it's entirely possible (and there's precedent in other
> implementation) to do so to benefit any cloud service providers.
>
> If it's just the implementation of hiding netdev itself needs to be
> improved, such as implementing it as attribute flag or adding linkdump
> API, that's completely fine and we can look into that. However, the
> specific issue needs to be undestood beforehand is to make transparent
> SR-IOV to be able to take over the name (so inherit all the configs)
> from the lower netdev, which needs some games with uevents and name
> space reservation. So far I don't think it's been well discussed.
>
> One thing in particular I'd like to point out is that the 3-netdev
> model currently missed to address the core problem of live migration:
> migration of hardware specific feature/state, for e.g. ethtool configs
> and hardware offloading states. Only general network state (IP
> address, gateway, for eg.) associated with the bypass interface can be
> migrated. As a follow-up work, bypass driver can/should be enhanced to
> save and apply those hardware specific configs before or after
> migration as needed. The transparent 1-netdev model being proposed as
> part of this patch series will be able to solve that problem naturally
> by making all hardware specific configurations go through the central
> bypass driver, such that hardware configurations can be replayed when
> new VF or passthrough gets plugged back in. Although that
> corresponding function hasn't been implemented today, I'd like to
> refresh everyone's mind that is the core problem any live migration
> proposal should have addressed.
>
> If it would make things more clear to defer netdev hiding until all
> functionalities regarding centralizing and replay are implemented,
> we'd take advices like that and move on to implementing those features
> as follow-up patches. Once all needed features get done, we'd resume
> the work for hiding lower netdev at that point. Think it would be the
> best to make everyone understand the big picture in advance before
> going too far.

I think we should get the 3-netdev model integrated and add any additional
ndo_ops/ethool ops that we would like to support/migrate before looking into
hiding the lower netdevs.


>
> Thanks, comments welcome.
>
> -Siwei
>
>
> On Mon, Apr 9, 2018 at 11:48 PM, Siwei Liu <loseweigh@gmail.com> wrote:
>> On Sun, Apr 8, 2018 at 9:32 AM, David Miller <davem@davemloft.net> wrote:
>>> From: Siwei Liu <loseweigh@gmail.com>
>>> Date: Fri, 6 Apr 2018 19:32:05 -0700
>>>
>>>> And I assume everyone here understands the use case for live
>>>> migration (in the context of providing cloud service) is very
>>>> different, and we have to hide the netdevs. If not, I'm more than
>>>> happy to clarify.
>>> I think you still need to clarify.
>> OK. The short answer is cloud users really want *transparent* live migration.
>>
>> By being transparent it means they don't and shouldn't care about the
>> existence and the occurence of live migration, but they do if
>> userspace toolstack and libraries have to be updated or modified,
>> which means potential dependency brokeness of their applications. They
>> don't like any change to the userspace envinroment (existing apps
>> lift-and-shift, no recompilation, no re-packaging, no re-certification
>> needed), while no one barely cares about ABI or API compatibility in
>> the kernel level, as long as their applications don't break.
>>
>> I agree the current bypass solution for SR-IOV live migration requires
>> guest cooperation. Though it doesn't mean guest *userspace*
>> cooperation. As a matter of fact, techinically it shouldn't invovle
>> userspace at all to get SR-IOV migration working. It's the kernel that
>> does the real work. If I understand the goal of this in-kernel
>> approach correctly, it was meant to save userspace from modification
>> or corresponding toolstack support, as those additional 2 interfaces
>> is more a side product of this approach, rather than being neccessary
>> for users to be aware of. All what the user needs to deal with is one
>> single interface, and that's what they care about. It's more a trouble
>> than help when they see 2 extra interfaces are present. Management
>> tools in the old distros don't recoginze them and try to bring up
>> those extra interfaces for its own. Various odd warnings start to spew
>> out, and there's a lot of caveats for the users to get around...
>>
>> On the other hand, if we "teach" those cloud users to update the
>> userspace toolstack just for trading a feature they don't need, no one
>> is likely going to embrace the change. As such there's just no real
>> value of adopting this in-kernel bypass facility for any cloud service
>> provider. It does not look more appealing than just configure generic
>> bonding using its own set of daemons or scripts. But again, cloud
>> users don't welcome that facility. And basically it would get to
>> nearly the same set of problems if leaving userspace alone.
>>
>> IMHO we're not hiding the devices, think it the way we're adding a
>> feature transparent to user. Those auto-managed slaves are ones users
>> don't care about much. And user is still able to see and configure the
>> lower netdevs if they really desires to do so. But generally the
>> target user for this feature won't need to know that. Why they care
>> how many interfaces a VM virtually has rather than how many interfaces
>> are actually _useable_ to them??
>>
>> Thanks,
>> -Siwei
>>
>>
>>> netdevs are netdevs.  If they have special attributes, mark them as
>>> such and the tools base their actions upon that.
>>>
>>> "Hiding", or changing classes, doesn't make any sense to me still.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-18 23:33                         ` [virtio-dev] " Samudrala, Sridhar
@ 2018-04-19  4:41                           ` Michael S. Tsirkin
  -1 siblings, 0 replies; 109+ messages in thread
From: Michael S. Tsirkin @ 2018-04-19  4:41 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Siwei Liu, David Miller, David Ahern, Jiri Pirko, si-wei liu,
	Stephen Hemminger, Alexander Duyck, Brandeburg, Jesse,
	Jakub Kicinski, Jason Wang, Netdev, virtualization, virtio-dev

On Wed, Apr 18, 2018 at 04:33:34PM -0700, Samudrala, Sridhar wrote:
> On 4/17/2018 5:26 PM, Siwei Liu wrote:
> > I ran this with a few folks offline and gathered some good feedbacks
> > that I'd like to share thus revive the discussion.
> > 
> > First of all, as illustrated in the reply below, cloud service
> > providers require transparent live migration. Specifically, the main
> > target of our case is to support SR-IOV live migration via kernel
> > upgrade while keeping the userspace of old distros unmodified. If it's
> > because this use case is not appealing enough for the mainline to
> > adopt, I will shut up and not continue discussing, although
> > technically it's entirely possible (and there's precedent in other
> > implementation) to do so to benefit any cloud service providers.
> > 
> > If it's just the implementation of hiding netdev itself needs to be
> > improved, such as implementing it as attribute flag or adding linkdump
> > API, that's completely fine and we can look into that. However, the
> > specific issue needs to be undestood beforehand is to make transparent
> > SR-IOV to be able to take over the name (so inherit all the configs)
> > from the lower netdev, which needs some games with uevents and name
> > space reservation. So far I don't think it's been well discussed.
> > 
> > One thing in particular I'd like to point out is that the 3-netdev
> > model currently missed to address the core problem of live migration:
> > migration of hardware specific feature/state, for e.g. ethtool configs
> > and hardware offloading states. Only general network state (IP
> > address, gateway, for eg.) associated with the bypass interface can be
> > migrated. As a follow-up work, bypass driver can/should be enhanced to
> > save and apply those hardware specific configs before or after
> > migration as needed. The transparent 1-netdev model being proposed as
> > part of this patch series will be able to solve that problem naturally
> > by making all hardware specific configurations go through the central
> > bypass driver, such that hardware configurations can be replayed when
> > new VF or passthrough gets plugged back in. Although that
> > corresponding function hasn't been implemented today, I'd like to
> > refresh everyone's mind that is the core problem any live migration
> > proposal should have addressed.
> > 
> > If it would make things more clear to defer netdev hiding until all
> > functionalities regarding centralizing and replay are implemented,
> > we'd take advices like that and move on to implementing those features
> > as follow-up patches. Once all needed features get done, we'd resume
> > the work for hiding lower netdev at that point. Think it would be the
> > best to make everyone understand the big picture in advance before
> > going too far.
> 
> I think we should get the 3-netdev model integrated and add any additional
> ndo_ops/ethool ops that we would like to support/migrate before looking into
> hiding the lower netdevs.

Once they are exposed, I don't think we'll be able to hide them -
they will be a kernel ABI.

Do you think everyone needs to hide the SRIOV device?
Or that only some users need this?

-- 
MST

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-18 23:33                         ` [virtio-dev] " Samudrala, Sridhar
  (?)
  (?)
@ 2018-04-19  4:41                         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 109+ messages in thread
From: Michael S. Tsirkin @ 2018-04-19  4:41 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski, Netdev,
	virtualization, Siwei Liu, David Ahern, si-wei liu, David Miller

On Wed, Apr 18, 2018 at 04:33:34PM -0700, Samudrala, Sridhar wrote:
> On 4/17/2018 5:26 PM, Siwei Liu wrote:
> > I ran this with a few folks offline and gathered some good feedbacks
> > that I'd like to share thus revive the discussion.
> > 
> > First of all, as illustrated in the reply below, cloud service
> > providers require transparent live migration. Specifically, the main
> > target of our case is to support SR-IOV live migration via kernel
> > upgrade while keeping the userspace of old distros unmodified. If it's
> > because this use case is not appealing enough for the mainline to
> > adopt, I will shut up and not continue discussing, although
> > technically it's entirely possible (and there's precedent in other
> > implementation) to do so to benefit any cloud service providers.
> > 
> > If it's just the implementation of hiding netdev itself needs to be
> > improved, such as implementing it as attribute flag or adding linkdump
> > API, that's completely fine and we can look into that. However, the
> > specific issue needs to be undestood beforehand is to make transparent
> > SR-IOV to be able to take over the name (so inherit all the configs)
> > from the lower netdev, which needs some games with uevents and name
> > space reservation. So far I don't think it's been well discussed.
> > 
> > One thing in particular I'd like to point out is that the 3-netdev
> > model currently missed to address the core problem of live migration:
> > migration of hardware specific feature/state, for e.g. ethtool configs
> > and hardware offloading states. Only general network state (IP
> > address, gateway, for eg.) associated with the bypass interface can be
> > migrated. As a follow-up work, bypass driver can/should be enhanced to
> > save and apply those hardware specific configs before or after
> > migration as needed. The transparent 1-netdev model being proposed as
> > part of this patch series will be able to solve that problem naturally
> > by making all hardware specific configurations go through the central
> > bypass driver, such that hardware configurations can be replayed when
> > new VF or passthrough gets plugged back in. Although that
> > corresponding function hasn't been implemented today, I'd like to
> > refresh everyone's mind that is the core problem any live migration
> > proposal should have addressed.
> > 
> > If it would make things more clear to defer netdev hiding until all
> > functionalities regarding centralizing and replay are implemented,
> > we'd take advices like that and move on to implementing those features
> > as follow-up patches. Once all needed features get done, we'd resume
> > the work for hiding lower netdev at that point. Think it would be the
> > best to make everyone understand the big picture in advance before
> > going too far.
> 
> I think we should get the 3-netdev model integrated and add any additional
> ndo_ops/ethool ops that we would like to support/migrate before looking into
> hiding the lower netdevs.

Once they are exposed, I don't think we'll be able to hide them -
they will be a kernel ABI.

Do you think everyone needs to hide the SRIOV device?
Or that only some users need this?

-- 
MST

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
@ 2018-04-19  4:41                           ` Michael S. Tsirkin
  0 siblings, 0 replies; 109+ messages in thread
From: Michael S. Tsirkin @ 2018-04-19  4:41 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Siwei Liu, David Miller, David Ahern, Jiri Pirko, si-wei liu,
	Stephen Hemminger, Alexander Duyck, Brandeburg, Jesse,
	Jakub Kicinski, Jason Wang, Netdev, virtualization, virtio-dev

On Wed, Apr 18, 2018 at 04:33:34PM -0700, Samudrala, Sridhar wrote:
> On 4/17/2018 5:26 PM, Siwei Liu wrote:
> > I ran this with a few folks offline and gathered some good feedbacks
> > that I'd like to share thus revive the discussion.
> > 
> > First of all, as illustrated in the reply below, cloud service
> > providers require transparent live migration. Specifically, the main
> > target of our case is to support SR-IOV live migration via kernel
> > upgrade while keeping the userspace of old distros unmodified. If it's
> > because this use case is not appealing enough for the mainline to
> > adopt, I will shut up and not continue discussing, although
> > technically it's entirely possible (and there's precedent in other
> > implementation) to do so to benefit any cloud service providers.
> > 
> > If it's just the implementation of hiding netdev itself needs to be
> > improved, such as implementing it as attribute flag or adding linkdump
> > API, that's completely fine and we can look into that. However, the
> > specific issue needs to be undestood beforehand is to make transparent
> > SR-IOV to be able to take over the name (so inherit all the configs)
> > from the lower netdev, which needs some games with uevents and name
> > space reservation. So far I don't think it's been well discussed.
> > 
> > One thing in particular I'd like to point out is that the 3-netdev
> > model currently missed to address the core problem of live migration:
> > migration of hardware specific feature/state, for e.g. ethtool configs
> > and hardware offloading states. Only general network state (IP
> > address, gateway, for eg.) associated with the bypass interface can be
> > migrated. As a follow-up work, bypass driver can/should be enhanced to
> > save and apply those hardware specific configs before or after
> > migration as needed. The transparent 1-netdev model being proposed as
> > part of this patch series will be able to solve that problem naturally
> > by making all hardware specific configurations go through the central
> > bypass driver, such that hardware configurations can be replayed when
> > new VF or passthrough gets plugged back in. Although that
> > corresponding function hasn't been implemented today, I'd like to
> > refresh everyone's mind that is the core problem any live migration
> > proposal should have addressed.
> > 
> > If it would make things more clear to defer netdev hiding until all
> > functionalities regarding centralizing and replay are implemented,
> > we'd take advices like that and move on to implementing those features
> > as follow-up patches. Once all needed features get done, we'd resume
> > the work for hiding lower netdev at that point. Think it would be the
> > best to make everyone understand the big picture in advance before
> > going too far.
> 
> I think we should get the 3-netdev model integrated and add any additional
> ndo_ops/ethool ops that we would like to support/migrate before looking into
> hiding the lower netdevs.

Once they are exposed, I don't think we'll be able to hide them -
they will be a kernel ABI.

Do you think everyone needs to hide the SRIOV device?
Or that only some users need this?

-- 
MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-19  4:41                           ` [virtio-dev] " Michael S. Tsirkin
@ 2018-04-19  5:00                             ` Samudrala, Sridhar
  -1 siblings, 0 replies; 109+ messages in thread
From: Samudrala, Sridhar @ 2018-04-19  5:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Siwei Liu, David Miller, David Ahern, Jiri Pirko, si-wei liu,
	Stephen Hemminger, Alexander Duyck, Brandeburg, Jesse,
	Jakub Kicinski, Jason Wang, Netdev, virtualization, virtio-dev

On 4/18/2018 9:41 PM, Michael S. Tsirkin wrote:
> On Wed, Apr 18, 2018 at 04:33:34PM -0700, Samudrala, Sridhar wrote:
>> On 4/17/2018 5:26 PM, Siwei Liu wrote:
>>> I ran this with a few folks offline and gathered some good feedbacks
>>> that I'd like to share thus revive the discussion.
>>>
>>> First of all, as illustrated in the reply below, cloud service
>>> providers require transparent live migration. Specifically, the main
>>> target of our case is to support SR-IOV live migration via kernel
>>> upgrade while keeping the userspace of old distros unmodified. If it's
>>> because this use case is not appealing enough for the mainline to
>>> adopt, I will shut up and not continue discussing, although
>>> technically it's entirely possible (and there's precedent in other
>>> implementation) to do so to benefit any cloud service providers.
>>>
>>> If it's just the implementation of hiding netdev itself needs to be
>>> improved, such as implementing it as attribute flag or adding linkdump
>>> API, that's completely fine and we can look into that. However, the
>>> specific issue needs to be undestood beforehand is to make transparent
>>> SR-IOV to be able to take over the name (so inherit all the configs)
>>> from the lower netdev, which needs some games with uevents and name
>>> space reservation. So far I don't think it's been well discussed.
>>>
>>> One thing in particular I'd like to point out is that the 3-netdev
>>> model currently missed to address the core problem of live migration:
>>> migration of hardware specific feature/state, for e.g. ethtool configs
>>> and hardware offloading states. Only general network state (IP
>>> address, gateway, for eg.) associated with the bypass interface can be
>>> migrated. As a follow-up work, bypass driver can/should be enhanced to
>>> save and apply those hardware specific configs before or after
>>> migration as needed. The transparent 1-netdev model being proposed as
>>> part of this patch series will be able to solve that problem naturally
>>> by making all hardware specific configurations go through the central
>>> bypass driver, such that hardware configurations can be replayed when
>>> new VF or passthrough gets plugged back in. Although that
>>> corresponding function hasn't been implemented today, I'd like to
>>> refresh everyone's mind that is the core problem any live migration
>>> proposal should have addressed.
>>>
>>> If it would make things more clear to defer netdev hiding until all
>>> functionalities regarding centralizing and replay are implemented,
>>> we'd take advices like that and move on to implementing those features
>>> as follow-up patches. Once all needed features get done, we'd resume
>>> the work for hiding lower netdev at that point. Think it would be the
>>> best to make everyone understand the big picture in advance before
>>> going too far.
>> I think we should get the 3-netdev model integrated and add any additional
>> ndo_ops/ethool ops that we would like to support/migrate before looking into
>> hiding the lower netdevs.
> Once they are exposed, I don't think we'll be able to hide them -
> they will be a kernel ABI.
>
> Do you think everyone needs to hide the SRIOV device?
> Or that only some users need this?

Hyper-V is currently supporting live migration without hiding the SR-IOV device. So i don't
think it is a hard requirement. And also,  as we don't yet have a consensus on how to hide
the lower netdevs, we could make it as another feature bit to hide lower netdevs once
we have an acceptable solution.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-19  4:41                           ` [virtio-dev] " Michael S. Tsirkin
  (?)
  (?)
@ 2018-04-19  5:00                           ` Samudrala, Sridhar
  -1 siblings, 0 replies; 109+ messages in thread
From: Samudrala, Sridhar @ 2018-04-19  5:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski, Netdev,
	virtualization, Siwei Liu, David Ahern, si-wei liu, David Miller

On 4/18/2018 9:41 PM, Michael S. Tsirkin wrote:
> On Wed, Apr 18, 2018 at 04:33:34PM -0700, Samudrala, Sridhar wrote:
>> On 4/17/2018 5:26 PM, Siwei Liu wrote:
>>> I ran this with a few folks offline and gathered some good feedbacks
>>> that I'd like to share thus revive the discussion.
>>>
>>> First of all, as illustrated in the reply below, cloud service
>>> providers require transparent live migration. Specifically, the main
>>> target of our case is to support SR-IOV live migration via kernel
>>> upgrade while keeping the userspace of old distros unmodified. If it's
>>> because this use case is not appealing enough for the mainline to
>>> adopt, I will shut up and not continue discussing, although
>>> technically it's entirely possible (and there's precedent in other
>>> implementation) to do so to benefit any cloud service providers.
>>>
>>> If it's just the implementation of hiding netdev itself needs to be
>>> improved, such as implementing it as attribute flag or adding linkdump
>>> API, that's completely fine and we can look into that. However, the
>>> specific issue needs to be undestood beforehand is to make transparent
>>> SR-IOV to be able to take over the name (so inherit all the configs)
>>> from the lower netdev, which needs some games with uevents and name
>>> space reservation. So far I don't think it's been well discussed.
>>>
>>> One thing in particular I'd like to point out is that the 3-netdev
>>> model currently missed to address the core problem of live migration:
>>> migration of hardware specific feature/state, for e.g. ethtool configs
>>> and hardware offloading states. Only general network state (IP
>>> address, gateway, for eg.) associated with the bypass interface can be
>>> migrated. As a follow-up work, bypass driver can/should be enhanced to
>>> save and apply those hardware specific configs before or after
>>> migration as needed. The transparent 1-netdev model being proposed as
>>> part of this patch series will be able to solve that problem naturally
>>> by making all hardware specific configurations go through the central
>>> bypass driver, such that hardware configurations can be replayed when
>>> new VF or passthrough gets plugged back in. Although that
>>> corresponding function hasn't been implemented today, I'd like to
>>> refresh everyone's mind that is the core problem any live migration
>>> proposal should have addressed.
>>>
>>> If it would make things more clear to defer netdev hiding until all
>>> functionalities regarding centralizing and replay are implemented,
>>> we'd take advices like that and move on to implementing those features
>>> as follow-up patches. Once all needed features get done, we'd resume
>>> the work for hiding lower netdev at that point. Think it would be the
>>> best to make everyone understand the big picture in advance before
>>> going too far.
>> I think we should get the 3-netdev model integrated and add any additional
>> ndo_ops/ethool ops that we would like to support/migrate before looking into
>> hiding the lower netdevs.
> Once they are exposed, I don't think we'll be able to hide them -
> they will be a kernel ABI.
>
> Do you think everyone needs to hide the SRIOV device?
> Or that only some users need this?

Hyper-V is currently supporting live migration without hiding the SR-IOV device. So i don't
think it is a hard requirement. And also,  as we don't yet have a consensus on how to hide
the lower netdevs, we could make it as another feature bit to hide lower netdevs once
we have an acceptable solution.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
@ 2018-04-19  5:00                             ` Samudrala, Sridhar
  0 siblings, 0 replies; 109+ messages in thread
From: Samudrala, Sridhar @ 2018-04-19  5:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Siwei Liu, David Miller, David Ahern, Jiri Pirko, si-wei liu,
	Stephen Hemminger, Alexander Duyck, Brandeburg, Jesse,
	Jakub Kicinski, Jason Wang, Netdev, virtualization, virtio-dev

On 4/18/2018 9:41 PM, Michael S. Tsirkin wrote:
> On Wed, Apr 18, 2018 at 04:33:34PM -0700, Samudrala, Sridhar wrote:
>> On 4/17/2018 5:26 PM, Siwei Liu wrote:
>>> I ran this with a few folks offline and gathered some good feedbacks
>>> that I'd like to share thus revive the discussion.
>>>
>>> First of all, as illustrated in the reply below, cloud service
>>> providers require transparent live migration. Specifically, the main
>>> target of our case is to support SR-IOV live migration via kernel
>>> upgrade while keeping the userspace of old distros unmodified. If it's
>>> because this use case is not appealing enough for the mainline to
>>> adopt, I will shut up and not continue discussing, although
>>> technically it's entirely possible (and there's precedent in other
>>> implementation) to do so to benefit any cloud service providers.
>>>
>>> If it's just the implementation of hiding netdev itself needs to be
>>> improved, such as implementing it as attribute flag or adding linkdump
>>> API, that's completely fine and we can look into that. However, the
>>> specific issue needs to be undestood beforehand is to make transparent
>>> SR-IOV to be able to take over the name (so inherit all the configs)
>>> from the lower netdev, which needs some games with uevents and name
>>> space reservation. So far I don't think it's been well discussed.
>>>
>>> One thing in particular I'd like to point out is that the 3-netdev
>>> model currently missed to address the core problem of live migration:
>>> migration of hardware specific feature/state, for e.g. ethtool configs
>>> and hardware offloading states. Only general network state (IP
>>> address, gateway, for eg.) associated with the bypass interface can be
>>> migrated. As a follow-up work, bypass driver can/should be enhanced to
>>> save and apply those hardware specific configs before or after
>>> migration as needed. The transparent 1-netdev model being proposed as
>>> part of this patch series will be able to solve that problem naturally
>>> by making all hardware specific configurations go through the central
>>> bypass driver, such that hardware configurations can be replayed when
>>> new VF or passthrough gets plugged back in. Although that
>>> corresponding function hasn't been implemented today, I'd like to
>>> refresh everyone's mind that is the core problem any live migration
>>> proposal should have addressed.
>>>
>>> If it would make things more clear to defer netdev hiding until all
>>> functionalities regarding centralizing and replay are implemented,
>>> we'd take advices like that and move on to implementing those features
>>> as follow-up patches. Once all needed features get done, we'd resume
>>> the work for hiding lower netdev at that point. Think it would be the
>>> best to make everyone understand the big picture in advance before
>>> going too far.
>> I think we should get the 3-netdev model integrated and add any additional
>> ndo_ops/ethool ops that we would like to support/migrate before looking into
>> hiding the lower netdevs.
> Once they are exposed, I don't think we'll be able to hide them -
> they will be a kernel ABI.
>
> Do you think everyone needs to hide the SRIOV device?
> Or that only some users need this?

Hyper-V is currently supporting live migration without hiding the SR-IOV device. So i don't
think it is a hard requirement. And also,  as we don't yet have a consensus on how to hide
the lower netdevs, we could make it as another feature bit to hide lower netdevs once
we have an acceptable solution.


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-19  5:00                             ` Samudrala, Sridhar
@ 2018-04-19  5:07                               ` Michael S. Tsirkin
  -1 siblings, 0 replies; 109+ messages in thread
From: Michael S. Tsirkin @ 2018-04-19  5:07 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Siwei Liu, David Miller, David Ahern, Jiri Pirko, si-wei liu,
	Stephen Hemminger, Alexander Duyck, Brandeburg, Jesse,
	Jakub Kicinski, Jason Wang, Netdev, virtualization, virtio-dev

On Wed, Apr 18, 2018 at 10:00:51PM -0700, Samudrala, Sridhar wrote:
> On 4/18/2018 9:41 PM, Michael S. Tsirkin wrote:
> > On Wed, Apr 18, 2018 at 04:33:34PM -0700, Samudrala, Sridhar wrote:
> > > On 4/17/2018 5:26 PM, Siwei Liu wrote:
> > > > I ran this with a few folks offline and gathered some good feedbacks
> > > > that I'd like to share thus revive the discussion.
> > > > 
> > > > First of all, as illustrated in the reply below, cloud service
> > > > providers require transparent live migration. Specifically, the main
> > > > target of our case is to support SR-IOV live migration via kernel
> > > > upgrade while keeping the userspace of old distros unmodified. If it's
> > > > because this use case is not appealing enough for the mainline to
> > > > adopt, I will shut up and not continue discussing, although
> > > > technically it's entirely possible (and there's precedent in other
> > > > implementation) to do so to benefit any cloud service providers.
> > > > 
> > > > If it's just the implementation of hiding netdev itself needs to be
> > > > improved, such as implementing it as attribute flag or adding linkdump
> > > > API, that's completely fine and we can look into that. However, the
> > > > specific issue needs to be undestood beforehand is to make transparent
> > > > SR-IOV to be able to take over the name (so inherit all the configs)
> > > > from the lower netdev, which needs some games with uevents and name
> > > > space reservation. So far I don't think it's been well discussed.
> > > > 
> > > > One thing in particular I'd like to point out is that the 3-netdev
> > > > model currently missed to address the core problem of live migration:
> > > > migration of hardware specific feature/state, for e.g. ethtool configs
> > > > and hardware offloading states. Only general network state (IP
> > > > address, gateway, for eg.) associated with the bypass interface can be
> > > > migrated. As a follow-up work, bypass driver can/should be enhanced to
> > > > save and apply those hardware specific configs before or after
> > > > migration as needed. The transparent 1-netdev model being proposed as
> > > > part of this patch series will be able to solve that problem naturally
> > > > by making all hardware specific configurations go through the central
> > > > bypass driver, such that hardware configurations can be replayed when
> > > > new VF or passthrough gets plugged back in. Although that
> > > > corresponding function hasn't been implemented today, I'd like to
> > > > refresh everyone's mind that is the core problem any live migration
> > > > proposal should have addressed.
> > > > 
> > > > If it would make things more clear to defer netdev hiding until all
> > > > functionalities regarding centralizing and replay are implemented,
> > > > we'd take advices like that and move on to implementing those features
> > > > as follow-up patches. Once all needed features get done, we'd resume
> > > > the work for hiding lower netdev at that point. Think it would be the
> > > > best to make everyone understand the big picture in advance before
> > > > going too far.
> > > I think we should get the 3-netdev model integrated and add any additional
> > > ndo_ops/ethool ops that we would like to support/migrate before looking into
> > > hiding the lower netdevs.
> > Once they are exposed, I don't think we'll be able to hide them -
> > they will be a kernel ABI.
> > 
> > Do you think everyone needs to hide the SRIOV device?
> > Or that only some users need this?
> 
> Hyper-V is currently supporting live migration without hiding the SR-IOV device. So i don't
> think it is a hard requirement.

OK, fine.

> And also,  as we don't yet have a consensus on how to hide
> the lower netdevs, we could make it as another feature bit to hide lower netdevs once
> we have an acceptable solution.

Guest/host interface isn't more flexible than the userspace/kernel
interface.  The feature bit you propose would say what exactly?
Hypervisor has no idea what guest kernel shows guest userspace.
Note that the backup flag doesn't tell guest kernel what to do,
it just tells guest that there is or will be a faster main device
connected to the same backend, so the backup should only be used
when main device is not present.

-- 
MST

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-19  5:00                             ` Samudrala, Sridhar
  (?)
@ 2018-04-19  5:07                             ` Michael S. Tsirkin
  -1 siblings, 0 replies; 109+ messages in thread
From: Michael S. Tsirkin @ 2018-04-19  5:07 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski, Netdev,
	virtualization, Siwei Liu, David Ahern, si-wei liu, David Miller

On Wed, Apr 18, 2018 at 10:00:51PM -0700, Samudrala, Sridhar wrote:
> On 4/18/2018 9:41 PM, Michael S. Tsirkin wrote:
> > On Wed, Apr 18, 2018 at 04:33:34PM -0700, Samudrala, Sridhar wrote:
> > > On 4/17/2018 5:26 PM, Siwei Liu wrote:
> > > > I ran this with a few folks offline and gathered some good feedbacks
> > > > that I'd like to share thus revive the discussion.
> > > > 
> > > > First of all, as illustrated in the reply below, cloud service
> > > > providers require transparent live migration. Specifically, the main
> > > > target of our case is to support SR-IOV live migration via kernel
> > > > upgrade while keeping the userspace of old distros unmodified. If it's
> > > > because this use case is not appealing enough for the mainline to
> > > > adopt, I will shut up and not continue discussing, although
> > > > technically it's entirely possible (and there's precedent in other
> > > > implementation) to do so to benefit any cloud service providers.
> > > > 
> > > > If it's just the implementation of hiding netdev itself needs to be
> > > > improved, such as implementing it as attribute flag or adding linkdump
> > > > API, that's completely fine and we can look into that. However, the
> > > > specific issue needs to be undestood beforehand is to make transparent
> > > > SR-IOV to be able to take over the name (so inherit all the configs)
> > > > from the lower netdev, which needs some games with uevents and name
> > > > space reservation. So far I don't think it's been well discussed.
> > > > 
> > > > One thing in particular I'd like to point out is that the 3-netdev
> > > > model currently missed to address the core problem of live migration:
> > > > migration of hardware specific feature/state, for e.g. ethtool configs
> > > > and hardware offloading states. Only general network state (IP
> > > > address, gateway, for eg.) associated with the bypass interface can be
> > > > migrated. As a follow-up work, bypass driver can/should be enhanced to
> > > > save and apply those hardware specific configs before or after
> > > > migration as needed. The transparent 1-netdev model being proposed as
> > > > part of this patch series will be able to solve that problem naturally
> > > > by making all hardware specific configurations go through the central
> > > > bypass driver, such that hardware configurations can be replayed when
> > > > new VF or passthrough gets plugged back in. Although that
> > > > corresponding function hasn't been implemented today, I'd like to
> > > > refresh everyone's mind that is the core problem any live migration
> > > > proposal should have addressed.
> > > > 
> > > > If it would make things more clear to defer netdev hiding until all
> > > > functionalities regarding centralizing and replay are implemented,
> > > > we'd take advices like that and move on to implementing those features
> > > > as follow-up patches. Once all needed features get done, we'd resume
> > > > the work for hiding lower netdev at that point. Think it would be the
> > > > best to make everyone understand the big picture in advance before
> > > > going too far.
> > > I think we should get the 3-netdev model integrated and add any additional
> > > ndo_ops/ethool ops that we would like to support/migrate before looking into
> > > hiding the lower netdevs.
> > Once they are exposed, I don't think we'll be able to hide them -
> > they will be a kernel ABI.
> > 
> > Do you think everyone needs to hide the SRIOV device?
> > Or that only some users need this?
> 
> Hyper-V is currently supporting live migration without hiding the SR-IOV device. So i don't
> think it is a hard requirement.

OK, fine.

> And also,  as we don't yet have a consensus on how to hide
> the lower netdevs, we could make it as another feature bit to hide lower netdevs once
> we have an acceptable solution.

Guest/host interface isn't more flexible than the userspace/kernel
interface.  The feature bit you propose would say what exactly?
Hypervisor has no idea what guest kernel shows guest userspace.
Note that the backup flag doesn't tell guest kernel what to do,
it just tells guest that there is or will be a faster main device
connected to the same backend, so the backup should only be used
when main device is not present.

-- 
MST

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
@ 2018-04-19  5:07                               ` Michael S. Tsirkin
  0 siblings, 0 replies; 109+ messages in thread
From: Michael S. Tsirkin @ 2018-04-19  5:07 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Siwei Liu, David Miller, David Ahern, Jiri Pirko, si-wei liu,
	Stephen Hemminger, Alexander Duyck, Brandeburg, Jesse,
	Jakub Kicinski, Jason Wang, Netdev, virtualization, virtio-dev

On Wed, Apr 18, 2018 at 10:00:51PM -0700, Samudrala, Sridhar wrote:
> On 4/18/2018 9:41 PM, Michael S. Tsirkin wrote:
> > On Wed, Apr 18, 2018 at 04:33:34PM -0700, Samudrala, Sridhar wrote:
> > > On 4/17/2018 5:26 PM, Siwei Liu wrote:
> > > > I ran this with a few folks offline and gathered some good feedbacks
> > > > that I'd like to share thus revive the discussion.
> > > > 
> > > > First of all, as illustrated in the reply below, cloud service
> > > > providers require transparent live migration. Specifically, the main
> > > > target of our case is to support SR-IOV live migration via kernel
> > > > upgrade while keeping the userspace of old distros unmodified. If it's
> > > > because this use case is not appealing enough for the mainline to
> > > > adopt, I will shut up and not continue discussing, although
> > > > technically it's entirely possible (and there's precedent in other
> > > > implementation) to do so to benefit any cloud service providers.
> > > > 
> > > > If it's just the implementation of hiding netdev itself needs to be
> > > > improved, such as implementing it as attribute flag or adding linkdump
> > > > API, that's completely fine and we can look into that. However, the
> > > > specific issue needs to be undestood beforehand is to make transparent
> > > > SR-IOV to be able to take over the name (so inherit all the configs)
> > > > from the lower netdev, which needs some games with uevents and name
> > > > space reservation. So far I don't think it's been well discussed.
> > > > 
> > > > One thing in particular I'd like to point out is that the 3-netdev
> > > > model currently missed to address the core problem of live migration:
> > > > migration of hardware specific feature/state, for e.g. ethtool configs
> > > > and hardware offloading states. Only general network state (IP
> > > > address, gateway, for eg.) associated with the bypass interface can be
> > > > migrated. As a follow-up work, bypass driver can/should be enhanced to
> > > > save and apply those hardware specific configs before or after
> > > > migration as needed. The transparent 1-netdev model being proposed as
> > > > part of this patch series will be able to solve that problem naturally
> > > > by making all hardware specific configurations go through the central
> > > > bypass driver, such that hardware configurations can be replayed when
> > > > new VF or passthrough gets plugged back in. Although that
> > > > corresponding function hasn't been implemented today, I'd like to
> > > > refresh everyone's mind that is the core problem any live migration
> > > > proposal should have addressed.
> > > > 
> > > > If it would make things more clear to defer netdev hiding until all
> > > > functionalities regarding centralizing and replay are implemented,
> > > > we'd take advices like that and move on to implementing those features
> > > > as follow-up patches. Once all needed features get done, we'd resume
> > > > the work for hiding lower netdev at that point. Think it would be the
> > > > best to make everyone understand the big picture in advance before
> > > > going too far.
> > > I think we should get the 3-netdev model integrated and add any additional
> > > ndo_ops/ethool ops that we would like to support/migrate before looking into
> > > hiding the lower netdevs.
> > Once they are exposed, I don't think we'll be able to hide them -
> > they will be a kernel ABI.
> > 
> > Do you think everyone needs to hide the SRIOV device?
> > Or that only some users need this?
> 
> Hyper-V is currently supporting live migration without hiding the SR-IOV device. So i don't
> think it is a hard requirement.

OK, fine.

> And also,  as we don't yet have a consensus on how to hide
> the lower netdevs, we could make it as another feature bit to hide lower netdevs once
> we have an acceptable solution.

Guest/host interface isn't more flexible than the userspace/kernel
interface.  The feature bit you propose would say what exactly?
Hypervisor has no idea what guest kernel shows guest userspace.
Note that the backup flag doesn't tell guest kernel what to do,
it just tells guest that there is or will be a faster main device
connected to the same backend, so the backup should only be used
when main device is not present.

-- 
MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-19  5:07                               ` [virtio-dev] " Michael S. Tsirkin
@ 2018-04-19  6:10                                 ` Samudrala, Sridhar
  -1 siblings, 0 replies; 109+ messages in thread
From: Samudrala, Sridhar @ 2018-04-19  6:10 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Siwei Liu, David Miller, David Ahern, Jiri Pirko, si-wei liu,
	Stephen Hemminger, Alexander Duyck, Brandeburg, Jesse,
	Jakub Kicinski, Jason Wang, Netdev, virtualization, virtio-dev


On 4/18/2018 10:07 PM, Michael S. Tsirkin wrote:
> On Wed, Apr 18, 2018 at 10:00:51PM -0700, Samudrala, Sridhar wrote:
>> On 4/18/2018 9:41 PM, Michael S. Tsirkin wrote:
>>> On Wed, Apr 18, 2018 at 04:33:34PM -0700, Samudrala, Sridhar wrote:
>>>> On 4/17/2018 5:26 PM, Siwei Liu wrote:
>>>>> I ran this with a few folks offline and gathered some good feedbacks
>>>>> that I'd like to share thus revive the discussion.
>>>>>
>>>>> First of all, as illustrated in the reply below, cloud service
>>>>> providers require transparent live migration. Specifically, the main
>>>>> target of our case is to support SR-IOV live migration via kernel
>>>>> upgrade while keeping the userspace of old distros unmodified. If it's
>>>>> because this use case is not appealing enough for the mainline to
>>>>> adopt, I will shut up and not continue discussing, although
>>>>> technically it's entirely possible (and there's precedent in other
>>>>> implementation) to do so to benefit any cloud service providers.
>>>>>
>>>>> If it's just the implementation of hiding netdev itself needs to be
>>>>> improved, such as implementing it as attribute flag or adding linkdump
>>>>> API, that's completely fine and we can look into that. However, the
>>>>> specific issue needs to be undestood beforehand is to make transparent
>>>>> SR-IOV to be able to take over the name (so inherit all the configs)
>>>>> from the lower netdev, which needs some games with uevents and name
>>>>> space reservation. So far I don't think it's been well discussed.
>>>>>
>>>>> One thing in particular I'd like to point out is that the 3-netdev
>>>>> model currently missed to address the core problem of live migration:
>>>>> migration of hardware specific feature/state, for e.g. ethtool configs
>>>>> and hardware offloading states. Only general network state (IP
>>>>> address, gateway, for eg.) associated with the bypass interface can be
>>>>> migrated. As a follow-up work, bypass driver can/should be enhanced to
>>>>> save and apply those hardware specific configs before or after
>>>>> migration as needed. The transparent 1-netdev model being proposed as
>>>>> part of this patch series will be able to solve that problem naturally
>>>>> by making all hardware specific configurations go through the central
>>>>> bypass driver, such that hardware configurations can be replayed when
>>>>> new VF or passthrough gets plugged back in. Although that
>>>>> corresponding function hasn't been implemented today, I'd like to
>>>>> refresh everyone's mind that is the core problem any live migration
>>>>> proposal should have addressed.
>>>>>
>>>>> If it would make things more clear to defer netdev hiding until all
>>>>> functionalities regarding centralizing and replay are implemented,
>>>>> we'd take advices like that and move on to implementing those features
>>>>> as follow-up patches. Once all needed features get done, we'd resume
>>>>> the work for hiding lower netdev at that point. Think it would be the
>>>>> best to make everyone understand the big picture in advance before
>>>>> going too far.
>>>> I think we should get the 3-netdev model integrated and add any additional
>>>> ndo_ops/ethool ops that we would like to support/migrate before looking into
>>>> hiding the lower netdevs.
>>> Once they are exposed, I don't think we'll be able to hide them -
>>> they will be a kernel ABI.
>>>
>>> Do you think everyone needs to hide the SRIOV device?
>>> Or that only some users need this?
>> Hyper-V is currently supporting live migration without hiding the SR-IOV device. So i don't
>> think it is a hard requirement.
> OK, fine.
>
>> And also,  as we don't yet have a consensus on how to hide
>> the lower netdevs, we could make it as another feature bit to hide lower netdevs once
>> we have an acceptable solution.
> Guest/host interface isn't more flexible than the userspace/kernel
> interface.  The feature bit you propose would say what exactly?
> Hypervisor has no idea what guest kernel shows guest userspace.
> Note that the backup flag doesn't tell guest kernel what to do,
> it just tells guest that there is or will be a faster main device
> connected to the same backend, so the backup should only be used
> when main device is not present.

The current bypass module supports 3-netdev and 2-netdev models via 2 sets of interfaces
bypass_master_create/destroy and bypass_master_register/unregister.  So theoretically
we can support the 2 models via 2 different feature bits. BACKUP and BACKUP_2_NETDEV.

Similarly if we can figure out a way to hide both the netdevs, we can add BACKUP_1_NETDEV
feature bit and update the bypass module to provide another set of interfaces that can
be used by virtio_net to support this model.

Now that we are leaning towards 'standby' as the name for the lower virtio-net, should we
change the feature bit name also to VIRTIO_NET_F_STANDBY?

>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-19  5:07                               ` [virtio-dev] " Michael S. Tsirkin
  (?)
@ 2018-04-19  6:10                               ` Samudrala, Sridhar
  -1 siblings, 0 replies; 109+ messages in thread
From: Samudrala, Sridhar @ 2018-04-19  6:10 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski, Netdev,
	virtualization, Siwei Liu, David Ahern, si-wei liu, David Miller


On 4/18/2018 10:07 PM, Michael S. Tsirkin wrote:
> On Wed, Apr 18, 2018 at 10:00:51PM -0700, Samudrala, Sridhar wrote:
>> On 4/18/2018 9:41 PM, Michael S. Tsirkin wrote:
>>> On Wed, Apr 18, 2018 at 04:33:34PM -0700, Samudrala, Sridhar wrote:
>>>> On 4/17/2018 5:26 PM, Siwei Liu wrote:
>>>>> I ran this with a few folks offline and gathered some good feedbacks
>>>>> that I'd like to share thus revive the discussion.
>>>>>
>>>>> First of all, as illustrated in the reply below, cloud service
>>>>> providers require transparent live migration. Specifically, the main
>>>>> target of our case is to support SR-IOV live migration via kernel
>>>>> upgrade while keeping the userspace of old distros unmodified. If it's
>>>>> because this use case is not appealing enough for the mainline to
>>>>> adopt, I will shut up and not continue discussing, although
>>>>> technically it's entirely possible (and there's precedent in other
>>>>> implementation) to do so to benefit any cloud service providers.
>>>>>
>>>>> If it's just the implementation of hiding netdev itself needs to be
>>>>> improved, such as implementing it as attribute flag or adding linkdump
>>>>> API, that's completely fine and we can look into that. However, the
>>>>> specific issue needs to be undestood beforehand is to make transparent
>>>>> SR-IOV to be able to take over the name (so inherit all the configs)
>>>>> from the lower netdev, which needs some games with uevents and name
>>>>> space reservation. So far I don't think it's been well discussed.
>>>>>
>>>>> One thing in particular I'd like to point out is that the 3-netdev
>>>>> model currently missed to address the core problem of live migration:
>>>>> migration of hardware specific feature/state, for e.g. ethtool configs
>>>>> and hardware offloading states. Only general network state (IP
>>>>> address, gateway, for eg.) associated with the bypass interface can be
>>>>> migrated. As a follow-up work, bypass driver can/should be enhanced to
>>>>> save and apply those hardware specific configs before or after
>>>>> migration as needed. The transparent 1-netdev model being proposed as
>>>>> part of this patch series will be able to solve that problem naturally
>>>>> by making all hardware specific configurations go through the central
>>>>> bypass driver, such that hardware configurations can be replayed when
>>>>> new VF or passthrough gets plugged back in. Although that
>>>>> corresponding function hasn't been implemented today, I'd like to
>>>>> refresh everyone's mind that is the core problem any live migration
>>>>> proposal should have addressed.
>>>>>
>>>>> If it would make things more clear to defer netdev hiding until all
>>>>> functionalities regarding centralizing and replay are implemented,
>>>>> we'd take advices like that and move on to implementing those features
>>>>> as follow-up patches. Once all needed features get done, we'd resume
>>>>> the work for hiding lower netdev at that point. Think it would be the
>>>>> best to make everyone understand the big picture in advance before
>>>>> going too far.
>>>> I think we should get the 3-netdev model integrated and add any additional
>>>> ndo_ops/ethool ops that we would like to support/migrate before looking into
>>>> hiding the lower netdevs.
>>> Once they are exposed, I don't think we'll be able to hide them -
>>> they will be a kernel ABI.
>>>
>>> Do you think everyone needs to hide the SRIOV device?
>>> Or that only some users need this?
>> Hyper-V is currently supporting live migration without hiding the SR-IOV device. So i don't
>> think it is a hard requirement.
> OK, fine.
>
>> And also,  as we don't yet have a consensus on how to hide
>> the lower netdevs, we could make it as another feature bit to hide lower netdevs once
>> we have an acceptable solution.
> Guest/host interface isn't more flexible than the userspace/kernel
> interface.  The feature bit you propose would say what exactly?
> Hypervisor has no idea what guest kernel shows guest userspace.
> Note that the backup flag doesn't tell guest kernel what to do,
> it just tells guest that there is or will be a faster main device
> connected to the same backend, so the backup should only be used
> when main device is not present.

The current bypass module supports 3-netdev and 2-netdev models via 2 sets of interfaces
bypass_master_create/destroy and bypass_master_register/unregister.  So theoretically
we can support the 2 models via 2 different feature bits. BACKUP and BACKUP_2_NETDEV.

Similarly if we can figure out a way to hide both the netdevs, we can add BACKUP_1_NETDEV
feature bit and update the bypass module to provide another set of interfaces that can
be used by virtio_net to support this model.

Now that we are leaning towards 'standby' as the name for the lower virtio-net, should we
change the feature bit name also to VIRTIO_NET_F_STANDBY?

>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
@ 2018-04-19  6:10                                 ` Samudrala, Sridhar
  0 siblings, 0 replies; 109+ messages in thread
From: Samudrala, Sridhar @ 2018-04-19  6:10 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Siwei Liu, David Miller, David Ahern, Jiri Pirko, si-wei liu,
	Stephen Hemminger, Alexander Duyck, Brandeburg, Jesse,
	Jakub Kicinski, Jason Wang, Netdev, virtualization, virtio-dev


On 4/18/2018 10:07 PM, Michael S. Tsirkin wrote:
> On Wed, Apr 18, 2018 at 10:00:51PM -0700, Samudrala, Sridhar wrote:
>> On 4/18/2018 9:41 PM, Michael S. Tsirkin wrote:
>>> On Wed, Apr 18, 2018 at 04:33:34PM -0700, Samudrala, Sridhar wrote:
>>>> On 4/17/2018 5:26 PM, Siwei Liu wrote:
>>>>> I ran this with a few folks offline and gathered some good feedbacks
>>>>> that I'd like to share thus revive the discussion.
>>>>>
>>>>> First of all, as illustrated in the reply below, cloud service
>>>>> providers require transparent live migration. Specifically, the main
>>>>> target of our case is to support SR-IOV live migration via kernel
>>>>> upgrade while keeping the userspace of old distros unmodified. If it's
>>>>> because this use case is not appealing enough for the mainline to
>>>>> adopt, I will shut up and not continue discussing, although
>>>>> technically it's entirely possible (and there's precedent in other
>>>>> implementation) to do so to benefit any cloud service providers.
>>>>>
>>>>> If it's just the implementation of hiding netdev itself needs to be
>>>>> improved, such as implementing it as attribute flag or adding linkdump
>>>>> API, that's completely fine and we can look into that. However, the
>>>>> specific issue needs to be undestood beforehand is to make transparent
>>>>> SR-IOV to be able to take over the name (so inherit all the configs)
>>>>> from the lower netdev, which needs some games with uevents and name
>>>>> space reservation. So far I don't think it's been well discussed.
>>>>>
>>>>> One thing in particular I'd like to point out is that the 3-netdev
>>>>> model currently missed to address the core problem of live migration:
>>>>> migration of hardware specific feature/state, for e.g. ethtool configs
>>>>> and hardware offloading states. Only general network state (IP
>>>>> address, gateway, for eg.) associated with the bypass interface can be
>>>>> migrated. As a follow-up work, bypass driver can/should be enhanced to
>>>>> save and apply those hardware specific configs before or after
>>>>> migration as needed. The transparent 1-netdev model being proposed as
>>>>> part of this patch series will be able to solve that problem naturally
>>>>> by making all hardware specific configurations go through the central
>>>>> bypass driver, such that hardware configurations can be replayed when
>>>>> new VF or passthrough gets plugged back in. Although that
>>>>> corresponding function hasn't been implemented today, I'd like to
>>>>> refresh everyone's mind that is the core problem any live migration
>>>>> proposal should have addressed.
>>>>>
>>>>> If it would make things more clear to defer netdev hiding until all
>>>>> functionalities regarding centralizing and replay are implemented,
>>>>> we'd take advices like that and move on to implementing those features
>>>>> as follow-up patches. Once all needed features get done, we'd resume
>>>>> the work for hiding lower netdev at that point. Think it would be the
>>>>> best to make everyone understand the big picture in advance before
>>>>> going too far.
>>>> I think we should get the 3-netdev model integrated and add any additional
>>>> ndo_ops/ethool ops that we would like to support/migrate before looking into
>>>> hiding the lower netdevs.
>>> Once they are exposed, I don't think we'll be able to hide them -
>>> they will be a kernel ABI.
>>>
>>> Do you think everyone needs to hide the SRIOV device?
>>> Or that only some users need this?
>> Hyper-V is currently supporting live migration without hiding the SR-IOV device. So i don't
>> think it is a hard requirement.
> OK, fine.
>
>> And also,  as we don't yet have a consensus on how to hide
>> the lower netdevs, we could make it as another feature bit to hide lower netdevs once
>> we have an acceptable solution.
> Guest/host interface isn't more flexible than the userspace/kernel
> interface.  The feature bit you propose would say what exactly?
> Hypervisor has no idea what guest kernel shows guest userspace.
> Note that the backup flag doesn't tell guest kernel what to do,
> it just tells guest that there is or will be a faster main device
> connected to the same backend, so the backup should only be used
> when main device is not present.

The current bypass module supports 3-netdev and 2-netdev models via 2 sets of interfaces
bypass_master_create/destroy and bypass_master_register/unregister.  So theoretically
we can support the 2 models via 2 different feature bits. BACKUP and BACKUP_2_NETDEV.

Similarly if we can figure out a way to hide both the netdevs, we can add BACKUP_1_NETDEV
feature bit and update the bypass module to provide another set of interfaces that can
be used by virtio_net to support this model.

Now that we are leaning towards 'standby' as the name for the lower virtio-net, should we
change the feature bit name also to VIRTIO_NET_F_STANDBY?

>


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-19  5:00                             ` Samudrala, Sridhar
@ 2018-04-19  6:31                               ` Siwei Liu
  -1 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-19  6:31 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Michael S. Tsirkin, David Miller, David Ahern, Jiri Pirko,
	si-wei liu, Stephen Hemminger, Alexander Duyck, Brandeburg,
	Jesse, Jakub Kicinski, Jason Wang, Netdev, virtualization,
	virtio-dev

On Wed, Apr 18, 2018 at 10:00 PM, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
> On 4/18/2018 9:41 PM, Michael S. Tsirkin wrote:
>>
>> On Wed, Apr 18, 2018 at 04:33:34PM -0700, Samudrala, Sridhar wrote:
>>>
>>> On 4/17/2018 5:26 PM, Siwei Liu wrote:
>>>>
>>>> I ran this with a few folks offline and gathered some good feedbacks
>>>> that I'd like to share thus revive the discussion.
>>>>
>>>> First of all, as illustrated in the reply below, cloud service
>>>> providers require transparent live migration. Specifically, the main
>>>> target of our case is to support SR-IOV live migration via kernel
>>>> upgrade while keeping the userspace of old distros unmodified. If it's
>>>> because this use case is not appealing enough for the mainline to
>>>> adopt, I will shut up and not continue discussing, although
>>>> technically it's entirely possible (and there's precedent in other
>>>> implementation) to do so to benefit any cloud service providers.
>>>>
>>>> If it's just the implementation of hiding netdev itself needs to be
>>>> improved, such as implementing it as attribute flag or adding linkdump
>>>> API, that's completely fine and we can look into that. However, the
>>>> specific issue needs to be undestood beforehand is to make transparent
>>>> SR-IOV to be able to take over the name (so inherit all the configs)
>>>> from the lower netdev, which needs some games with uevents and name
>>>> space reservation. So far I don't think it's been well discussed.
>>>>
>>>> One thing in particular I'd like to point out is that the 3-netdev
>>>> model currently missed to address the core problem of live migration:
>>>> migration of hardware specific feature/state, for e.g. ethtool configs
>>>> and hardware offloading states. Only general network state (IP
>>>> address, gateway, for eg.) associated with the bypass interface can be
>>>> migrated. As a follow-up work, bypass driver can/should be enhanced to
>>>> save and apply those hardware specific configs before or after
>>>> migration as needed. The transparent 1-netdev model being proposed as
>>>> part of this patch series will be able to solve that problem naturally
>>>> by making all hardware specific configurations go through the central
>>>> bypass driver, such that hardware configurations can be replayed when
>>>> new VF or passthrough gets plugged back in. Although that
>>>> corresponding function hasn't been implemented today, I'd like to
>>>> refresh everyone's mind that is the core problem any live migration
>>>> proposal should have addressed.
>>>>
>>>> If it would make things more clear to defer netdev hiding until all
>>>> functionalities regarding centralizing and replay are implemented,
>>>> we'd take advices like that and move on to implementing those features
>>>> as follow-up patches. Once all needed features get done, we'd resume
>>>> the work for hiding lower netdev at that point. Think it would be the
>>>> best to make everyone understand the big picture in advance before
>>>> going too far.
>>>
>>> I think we should get the 3-netdev model integrated and add any
>>> additional
>>> ndo_ops/ethool ops that we would like to support/migrate before looking
>>> into
>>> hiding the lower netdevs.
>>
>> Once they are exposed, I don't think we'll be able to hide them -
>> they will be a kernel ABI.
>>
>> Do you think everyone needs to hide the SRIOV device?
>> Or that only some users need this?
>
>
> Hyper-V is currently supporting live migration without hiding the SR-IOV
> device. So i don't
> think it is a hard requirement. And also,  as we don't yet have a consensus
This is a vague point as Hyper-V is mostly Windows oriented: the
target users don't change adapter settings in device manager much as
it's hidden too deep already. Actually it does not address the general
case for SR-IOV live migration but just a subset, why are we making
such comparison?

Note it's always the hard requirement for live migration that *all
states* should be migrated no matter what the implementation it is
going to be. The current 3-netdev model is remote to be useful for
real world scenario and it has no advantage compared to using
userspace generic bonding.

-Siwei

> on how to hide
> the lower netdevs, we could make it as another feature bit to hide lower
> netdevs once
> we have an acceptable solution.
>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-19  5:00                             ` Samudrala, Sridhar
                                               ` (2 preceding siblings ...)
  (?)
@ 2018-04-19  6:31                             ` Siwei Liu
  -1 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-19  6:31 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Netdev, virtualization, David Ahern, si-wei liu,
	David Miller

On Wed, Apr 18, 2018 at 10:00 PM, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
> On 4/18/2018 9:41 PM, Michael S. Tsirkin wrote:
>>
>> On Wed, Apr 18, 2018 at 04:33:34PM -0700, Samudrala, Sridhar wrote:
>>>
>>> On 4/17/2018 5:26 PM, Siwei Liu wrote:
>>>>
>>>> I ran this with a few folks offline and gathered some good feedbacks
>>>> that I'd like to share thus revive the discussion.
>>>>
>>>> First of all, as illustrated in the reply below, cloud service
>>>> providers require transparent live migration. Specifically, the main
>>>> target of our case is to support SR-IOV live migration via kernel
>>>> upgrade while keeping the userspace of old distros unmodified. If it's
>>>> because this use case is not appealing enough for the mainline to
>>>> adopt, I will shut up and not continue discussing, although
>>>> technically it's entirely possible (and there's precedent in other
>>>> implementation) to do so to benefit any cloud service providers.
>>>>
>>>> If it's just the implementation of hiding netdev itself needs to be
>>>> improved, such as implementing it as attribute flag or adding linkdump
>>>> API, that's completely fine and we can look into that. However, the
>>>> specific issue needs to be undestood beforehand is to make transparent
>>>> SR-IOV to be able to take over the name (so inherit all the configs)
>>>> from the lower netdev, which needs some games with uevents and name
>>>> space reservation. So far I don't think it's been well discussed.
>>>>
>>>> One thing in particular I'd like to point out is that the 3-netdev
>>>> model currently missed to address the core problem of live migration:
>>>> migration of hardware specific feature/state, for e.g. ethtool configs
>>>> and hardware offloading states. Only general network state (IP
>>>> address, gateway, for eg.) associated with the bypass interface can be
>>>> migrated. As a follow-up work, bypass driver can/should be enhanced to
>>>> save and apply those hardware specific configs before or after
>>>> migration as needed. The transparent 1-netdev model being proposed as
>>>> part of this patch series will be able to solve that problem naturally
>>>> by making all hardware specific configurations go through the central
>>>> bypass driver, such that hardware configurations can be replayed when
>>>> new VF or passthrough gets plugged back in. Although that
>>>> corresponding function hasn't been implemented today, I'd like to
>>>> refresh everyone's mind that is the core problem any live migration
>>>> proposal should have addressed.
>>>>
>>>> If it would make things more clear to defer netdev hiding until all
>>>> functionalities regarding centralizing and replay are implemented,
>>>> we'd take advices like that and move on to implementing those features
>>>> as follow-up patches. Once all needed features get done, we'd resume
>>>> the work for hiding lower netdev at that point. Think it would be the
>>>> best to make everyone understand the big picture in advance before
>>>> going too far.
>>>
>>> I think we should get the 3-netdev model integrated and add any
>>> additional
>>> ndo_ops/ethool ops that we would like to support/migrate before looking
>>> into
>>> hiding the lower netdevs.
>>
>> Once they are exposed, I don't think we'll be able to hide them -
>> they will be a kernel ABI.
>>
>> Do you think everyone needs to hide the SRIOV device?
>> Or that only some users need this?
>
>
> Hyper-V is currently supporting live migration without hiding the SR-IOV
> device. So i don't
> think it is a hard requirement. And also,  as we don't yet have a consensus
This is a vague point as Hyper-V is mostly Windows oriented: the
target users don't change adapter settings in device manager much as
it's hidden too deep already. Actually it does not address the general
case for SR-IOV live migration but just a subset, why are we making
such comparison?

Note it's always the hard requirement for live migration that *all
states* should be migrated no matter what the implementation it is
going to be. The current 3-netdev model is remote to be useful for
real world scenario and it has no advantage compared to using
userspace generic bonding.

-Siwei

> on how to hide
> the lower netdevs, we could make it as another feature bit to hide lower
> netdevs once
> we have an acceptable solution.
>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
@ 2018-04-19  6:31                               ` Siwei Liu
  0 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-19  6:31 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Michael S. Tsirkin, David Miller, David Ahern, Jiri Pirko,
	si-wei liu, Stephen Hemminger, Alexander Duyck, Brandeburg,
	Jesse, Jakub Kicinski, Jason Wang, Netdev, virtualization,
	virtio-dev

On Wed, Apr 18, 2018 at 10:00 PM, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
> On 4/18/2018 9:41 PM, Michael S. Tsirkin wrote:
>>
>> On Wed, Apr 18, 2018 at 04:33:34PM -0700, Samudrala, Sridhar wrote:
>>>
>>> On 4/17/2018 5:26 PM, Siwei Liu wrote:
>>>>
>>>> I ran this with a few folks offline and gathered some good feedbacks
>>>> that I'd like to share thus revive the discussion.
>>>>
>>>> First of all, as illustrated in the reply below, cloud service
>>>> providers require transparent live migration. Specifically, the main
>>>> target of our case is to support SR-IOV live migration via kernel
>>>> upgrade while keeping the userspace of old distros unmodified. If it's
>>>> because this use case is not appealing enough for the mainline to
>>>> adopt, I will shut up and not continue discussing, although
>>>> technically it's entirely possible (and there's precedent in other
>>>> implementation) to do so to benefit any cloud service providers.
>>>>
>>>> If it's just the implementation of hiding netdev itself needs to be
>>>> improved, such as implementing it as attribute flag or adding linkdump
>>>> API, that's completely fine and we can look into that. However, the
>>>> specific issue needs to be undestood beforehand is to make transparent
>>>> SR-IOV to be able to take over the name (so inherit all the configs)
>>>> from the lower netdev, which needs some games with uevents and name
>>>> space reservation. So far I don't think it's been well discussed.
>>>>
>>>> One thing in particular I'd like to point out is that the 3-netdev
>>>> model currently missed to address the core problem of live migration:
>>>> migration of hardware specific feature/state, for e.g. ethtool configs
>>>> and hardware offloading states. Only general network state (IP
>>>> address, gateway, for eg.) associated with the bypass interface can be
>>>> migrated. As a follow-up work, bypass driver can/should be enhanced to
>>>> save and apply those hardware specific configs before or after
>>>> migration as needed. The transparent 1-netdev model being proposed as
>>>> part of this patch series will be able to solve that problem naturally
>>>> by making all hardware specific configurations go through the central
>>>> bypass driver, such that hardware configurations can be replayed when
>>>> new VF or passthrough gets plugged back in. Although that
>>>> corresponding function hasn't been implemented today, I'd like to
>>>> refresh everyone's mind that is the core problem any live migration
>>>> proposal should have addressed.
>>>>
>>>> If it would make things more clear to defer netdev hiding until all
>>>> functionalities regarding centralizing and replay are implemented,
>>>> we'd take advices like that and move on to implementing those features
>>>> as follow-up patches. Once all needed features get done, we'd resume
>>>> the work for hiding lower netdev at that point. Think it would be the
>>>> best to make everyone understand the big picture in advance before
>>>> going too far.
>>>
>>> I think we should get the 3-netdev model integrated and add any
>>> additional
>>> ndo_ops/ethool ops that we would like to support/migrate before looking
>>> into
>>> hiding the lower netdevs.
>>
>> Once they are exposed, I don't think we'll be able to hide them -
>> they will be a kernel ABI.
>>
>> Do you think everyone needs to hide the SRIOV device?
>> Or that only some users need this?
>
>
> Hyper-V is currently supporting live migration without hiding the SR-IOV
> device. So i don't
> think it is a hard requirement. And also,  as we don't yet have a consensus
This is a vague point as Hyper-V is mostly Windows oriented: the
target users don't change adapter settings in device manager much as
it's hidden too deep already. Actually it does not address the general
case for SR-IOV live migration but just a subset, why are we making
such comparison?

Note it's always the hard requirement for live migration that *all
states* should be migrated no matter what the implementation it is
going to be. The current 3-netdev model is remote to be useful for
real world scenario and it has no advantage compared to using
userspace generic bonding.

-Siwei

> on how to hide
> the lower netdevs, we could make it as another feature bit to hide lower
> netdevs once
> we have an acceptable solution.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-19  6:10                                 ` Samudrala, Sridhar
@ 2018-04-19  6:43                                   ` Siwei Liu
  -1 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-19  6:43 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Michael S. Tsirkin, David Miller, David Ahern, Jiri Pirko,
	si-wei liu, Stephen Hemminger, Alexander Duyck, Brandeburg,
	Jesse, Jakub Kicinski, Jason Wang, Netdev, virtualization,
	virtio-dev

On Wed, Apr 18, 2018 at 11:10 PM, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
>
> On 4/18/2018 10:07 PM, Michael S. Tsirkin wrote:
>>
>> On Wed, Apr 18, 2018 at 10:00:51PM -0700, Samudrala, Sridhar wrote:
>>>
>>> On 4/18/2018 9:41 PM, Michael S. Tsirkin wrote:
>>>>
>>>> On Wed, Apr 18, 2018 at 04:33:34PM -0700, Samudrala, Sridhar wrote:
>>>>>
>>>>> On 4/17/2018 5:26 PM, Siwei Liu wrote:
>>>>>>
>>>>>> I ran this with a few folks offline and gathered some good feedbacks
>>>>>> that I'd like to share thus revive the discussion.
>>>>>>
>>>>>> First of all, as illustrated in the reply below, cloud service
>>>>>> providers require transparent live migration. Specifically, the main
>>>>>> target of our case is to support SR-IOV live migration via kernel
>>>>>> upgrade while keeping the userspace of old distros unmodified. If it's
>>>>>> because this use case is not appealing enough for the mainline to
>>>>>> adopt, I will shut up and not continue discussing, although
>>>>>> technically it's entirely possible (and there's precedent in other
>>>>>> implementation) to do so to benefit any cloud service providers.
>>>>>>
>>>>>> If it's just the implementation of hiding netdev itself needs to be
>>>>>> improved, such as implementing it as attribute flag or adding linkdump
>>>>>> API, that's completely fine and we can look into that. However, the
>>>>>> specific issue needs to be undestood beforehand is to make transparent
>>>>>> SR-IOV to be able to take over the name (so inherit all the configs)
>>>>>> from the lower netdev, which needs some games with uevents and name
>>>>>> space reservation. So far I don't think it's been well discussed.
>>>>>>
>>>>>> One thing in particular I'd like to point out is that the 3-netdev
>>>>>> model currently missed to address the core problem of live migration:
>>>>>> migration of hardware specific feature/state, for e.g. ethtool configs
>>>>>> and hardware offloading states. Only general network state (IP
>>>>>> address, gateway, for eg.) associated with the bypass interface can be
>>>>>> migrated. As a follow-up work, bypass driver can/should be enhanced to
>>>>>> save and apply those hardware specific configs before or after
>>>>>> migration as needed. The transparent 1-netdev model being proposed as
>>>>>> part of this patch series will be able to solve that problem naturally
>>>>>> by making all hardware specific configurations go through the central
>>>>>> bypass driver, such that hardware configurations can be replayed when
>>>>>> new VF or passthrough gets plugged back in. Although that
>>>>>> corresponding function hasn't been implemented today, I'd like to
>>>>>> refresh everyone's mind that is the core problem any live migration
>>>>>> proposal should have addressed.
>>>>>>
>>>>>> If it would make things more clear to defer netdev hiding until all
>>>>>> functionalities regarding centralizing and replay are implemented,
>>>>>> we'd take advices like that and move on to implementing those features
>>>>>> as follow-up patches. Once all needed features get done, we'd resume
>>>>>> the work for hiding lower netdev at that point. Think it would be the
>>>>>> best to make everyone understand the big picture in advance before
>>>>>> going too far.
>>>>>
>>>>> I think we should get the 3-netdev model integrated and add any
>>>>> additional
>>>>> ndo_ops/ethool ops that we would like to support/migrate before looking
>>>>> into
>>>>> hiding the lower netdevs.
>>>>
>>>> Once they are exposed, I don't think we'll be able to hide them -
>>>> they will be a kernel ABI.
>>>>
>>>> Do you think everyone needs to hide the SRIOV device?
>>>> Or that only some users need this?
>>>
>>> Hyper-V is currently supporting live migration without hiding the SR-IOV
>>> device. So i don't
>>> think it is a hard requirement.
>>
>> OK, fine.
>>
>>> And also,  as we don't yet have a consensus on how to hide
>>> the lower netdevs, we could make it as another feature bit to hide lower
>>> netdevs once
>>> we have an acceptable solution.
>>
>> Guest/host interface isn't more flexible than the userspace/kernel
>> interface.  The feature bit you propose would say what exactly?
>> Hypervisor has no idea what guest kernel shows guest userspace.
>> Note that the backup flag doesn't tell guest kernel what to do,
>> it just tells guest that there is or will be a faster main device
>> connected to the same backend, so the backup should only be used
>> when main device is not present.
>
>
> The current bypass module supports 3-netdev and 2-netdev models via 2 sets
> of interfaces
> bypass_master_create/destroy and bypass_master_register/unregister.  So
> theoretically
> we can support the 2 models via 2 different feature bits. BACKUP and
> BACKUP_2_NETDEV.

I'm still trying to understand the value of so many models to support.
If we all agree eventually the transparent 1-netdev model can address
the more general case while 2-netdev or 3-netdev is unable to, what's
the point for supporting these many features?

-Siwei
>
> Similarly if we can figure out a way to hide both the netdevs, we can add
> BACKUP_1_NETDEV
> feature bit and update the bypass module to provide another set of
> interfaces that can
> be used by virtio_net to support this model.
>
> Now that we are leaning towards 'standby' as the name for the lower
> virtio-net, should we
> change the feature bit name also to VIRTIO_NET_F_STANDBY?
>
>>
>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
  2018-04-19  6:10                                 ` Samudrala, Sridhar
  (?)
@ 2018-04-19  6:43                                 ` Siwei Liu
  -1 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-19  6:43 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Netdev, virtualization, David Ahern, si-wei liu,
	David Miller

On Wed, Apr 18, 2018 at 11:10 PM, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
>
> On 4/18/2018 10:07 PM, Michael S. Tsirkin wrote:
>>
>> On Wed, Apr 18, 2018 at 10:00:51PM -0700, Samudrala, Sridhar wrote:
>>>
>>> On 4/18/2018 9:41 PM, Michael S. Tsirkin wrote:
>>>>
>>>> On Wed, Apr 18, 2018 at 04:33:34PM -0700, Samudrala, Sridhar wrote:
>>>>>
>>>>> On 4/17/2018 5:26 PM, Siwei Liu wrote:
>>>>>>
>>>>>> I ran this with a few folks offline and gathered some good feedbacks
>>>>>> that I'd like to share thus revive the discussion.
>>>>>>
>>>>>> First of all, as illustrated in the reply below, cloud service
>>>>>> providers require transparent live migration. Specifically, the main
>>>>>> target of our case is to support SR-IOV live migration via kernel
>>>>>> upgrade while keeping the userspace of old distros unmodified. If it's
>>>>>> because this use case is not appealing enough for the mainline to
>>>>>> adopt, I will shut up and not continue discussing, although
>>>>>> technically it's entirely possible (and there's precedent in other
>>>>>> implementation) to do so to benefit any cloud service providers.
>>>>>>
>>>>>> If it's just the implementation of hiding netdev itself needs to be
>>>>>> improved, such as implementing it as attribute flag or adding linkdump
>>>>>> API, that's completely fine and we can look into that. However, the
>>>>>> specific issue needs to be undestood beforehand is to make transparent
>>>>>> SR-IOV to be able to take over the name (so inherit all the configs)
>>>>>> from the lower netdev, which needs some games with uevents and name
>>>>>> space reservation. So far I don't think it's been well discussed.
>>>>>>
>>>>>> One thing in particular I'd like to point out is that the 3-netdev
>>>>>> model currently missed to address the core problem of live migration:
>>>>>> migration of hardware specific feature/state, for e.g. ethtool configs
>>>>>> and hardware offloading states. Only general network state (IP
>>>>>> address, gateway, for eg.) associated with the bypass interface can be
>>>>>> migrated. As a follow-up work, bypass driver can/should be enhanced to
>>>>>> save and apply those hardware specific configs before or after
>>>>>> migration as needed. The transparent 1-netdev model being proposed as
>>>>>> part of this patch series will be able to solve that problem naturally
>>>>>> by making all hardware specific configurations go through the central
>>>>>> bypass driver, such that hardware configurations can be replayed when
>>>>>> new VF or passthrough gets plugged back in. Although that
>>>>>> corresponding function hasn't been implemented today, I'd like to
>>>>>> refresh everyone's mind that is the core problem any live migration
>>>>>> proposal should have addressed.
>>>>>>
>>>>>> If it would make things more clear to defer netdev hiding until all
>>>>>> functionalities regarding centralizing and replay are implemented,
>>>>>> we'd take advices like that and move on to implementing those features
>>>>>> as follow-up patches. Once all needed features get done, we'd resume
>>>>>> the work for hiding lower netdev at that point. Think it would be the
>>>>>> best to make everyone understand the big picture in advance before
>>>>>> going too far.
>>>>>
>>>>> I think we should get the 3-netdev model integrated and add any
>>>>> additional
>>>>> ndo_ops/ethool ops that we would like to support/migrate before looking
>>>>> into
>>>>> hiding the lower netdevs.
>>>>
>>>> Once they are exposed, I don't think we'll be able to hide them -
>>>> they will be a kernel ABI.
>>>>
>>>> Do you think everyone needs to hide the SRIOV device?
>>>> Or that only some users need this?
>>>
>>> Hyper-V is currently supporting live migration without hiding the SR-IOV
>>> device. So i don't
>>> think it is a hard requirement.
>>
>> OK, fine.
>>
>>> And also,  as we don't yet have a consensus on how to hide
>>> the lower netdevs, we could make it as another feature bit to hide lower
>>> netdevs once
>>> we have an acceptable solution.
>>
>> Guest/host interface isn't more flexible than the userspace/kernel
>> interface.  The feature bit you propose would say what exactly?
>> Hypervisor has no idea what guest kernel shows guest userspace.
>> Note that the backup flag doesn't tell guest kernel what to do,
>> it just tells guest that there is or will be a faster main device
>> connected to the same backend, so the backup should only be used
>> when main device is not present.
>
>
> The current bypass module supports 3-netdev and 2-netdev models via 2 sets
> of interfaces
> bypass_master_create/destroy and bypass_master_register/unregister.  So
> theoretically
> we can support the 2 models via 2 different feature bits. BACKUP and
> BACKUP_2_NETDEV.

I'm still trying to understand the value of so many models to support.
If we all agree eventually the transparent 1-netdev model can address
the more general case while 2-netdev or 3-netdev is unable to, what's
the point for supporting these many features?

-Siwei
>
> Similarly if we can figure out a way to hide both the netdevs, we can add
> BACKUP_1_NETDEV
> feature bit and update the bypass module to provide another set of
> interfaces that can
> be used by virtio_net to support this model.
>
> Now that we are leaning towards 'standby' as the name for the lower
> virtio-net, should we
> change the feature bit name also to VIRTIO_NET_F_STANDBY?
>
>>
>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
@ 2018-04-19  6:43                                   ` Siwei Liu
  0 siblings, 0 replies; 109+ messages in thread
From: Siwei Liu @ 2018-04-19  6:43 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Michael S. Tsirkin, David Miller, David Ahern, Jiri Pirko,
	si-wei liu, Stephen Hemminger, Alexander Duyck, Brandeburg,
	Jesse, Jakub Kicinski, Jason Wang, Netdev, virtualization,
	virtio-dev

On Wed, Apr 18, 2018 at 11:10 PM, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
>
> On 4/18/2018 10:07 PM, Michael S. Tsirkin wrote:
>>
>> On Wed, Apr 18, 2018 at 10:00:51PM -0700, Samudrala, Sridhar wrote:
>>>
>>> On 4/18/2018 9:41 PM, Michael S. Tsirkin wrote:
>>>>
>>>> On Wed, Apr 18, 2018 at 04:33:34PM -0700, Samudrala, Sridhar wrote:
>>>>>
>>>>> On 4/17/2018 5:26 PM, Siwei Liu wrote:
>>>>>>
>>>>>> I ran this with a few folks offline and gathered some good feedbacks
>>>>>> that I'd like to share thus revive the discussion.
>>>>>>
>>>>>> First of all, as illustrated in the reply below, cloud service
>>>>>> providers require transparent live migration. Specifically, the main
>>>>>> target of our case is to support SR-IOV live migration via kernel
>>>>>> upgrade while keeping the userspace of old distros unmodified. If it's
>>>>>> because this use case is not appealing enough for the mainline to
>>>>>> adopt, I will shut up and not continue discussing, although
>>>>>> technically it's entirely possible (and there's precedent in other
>>>>>> implementation) to do so to benefit any cloud service providers.
>>>>>>
>>>>>> If it's just the implementation of hiding netdev itself needs to be
>>>>>> improved, such as implementing it as attribute flag or adding linkdump
>>>>>> API, that's completely fine and we can look into that. However, the
>>>>>> specific issue needs to be undestood beforehand is to make transparent
>>>>>> SR-IOV to be able to take over the name (so inherit all the configs)
>>>>>> from the lower netdev, which needs some games with uevents and name
>>>>>> space reservation. So far I don't think it's been well discussed.
>>>>>>
>>>>>> One thing in particular I'd like to point out is that the 3-netdev
>>>>>> model currently missed to address the core problem of live migration:
>>>>>> migration of hardware specific feature/state, for e.g. ethtool configs
>>>>>> and hardware offloading states. Only general network state (IP
>>>>>> address, gateway, for eg.) associated with the bypass interface can be
>>>>>> migrated. As a follow-up work, bypass driver can/should be enhanced to
>>>>>> save and apply those hardware specific configs before or after
>>>>>> migration as needed. The transparent 1-netdev model being proposed as
>>>>>> part of this patch series will be able to solve that problem naturally
>>>>>> by making all hardware specific configurations go through the central
>>>>>> bypass driver, such that hardware configurations can be replayed when
>>>>>> new VF or passthrough gets plugged back in. Although that
>>>>>> corresponding function hasn't been implemented today, I'd like to
>>>>>> refresh everyone's mind that is the core problem any live migration
>>>>>> proposal should have addressed.
>>>>>>
>>>>>> If it would make things more clear to defer netdev hiding until all
>>>>>> functionalities regarding centralizing and replay are implemented,
>>>>>> we'd take advices like that and move on to implementing those features
>>>>>> as follow-up patches. Once all needed features get done, we'd resume
>>>>>> the work for hiding lower netdev at that point. Think it would be the
>>>>>> best to make everyone understand the big picture in advance before
>>>>>> going too far.
>>>>>
>>>>> I think we should get the 3-netdev model integrated and add any
>>>>> additional
>>>>> ndo_ops/ethool ops that we would like to support/migrate before looking
>>>>> into
>>>>> hiding the lower netdevs.
>>>>
>>>> Once they are exposed, I don't think we'll be able to hide them -
>>>> they will be a kernel ABI.
>>>>
>>>> Do you think everyone needs to hide the SRIOV device?
>>>> Or that only some users need this?
>>>
>>> Hyper-V is currently supporting live migration without hiding the SR-IOV
>>> device. So i don't
>>> think it is a hard requirement.
>>
>> OK, fine.
>>
>>> And also,  as we don't yet have a consensus on how to hide
>>> the lower netdevs, we could make it as another feature bit to hide lower
>>> netdevs once
>>> we have an acceptable solution.
>>
>> Guest/host interface isn't more flexible than the userspace/kernel
>> interface.  The feature bit you propose would say what exactly?
>> Hypervisor has no idea what guest kernel shows guest userspace.
>> Note that the backup flag doesn't tell guest kernel what to do,
>> it just tells guest that there is or will be a faster main device
>> connected to the same backend, so the backup should only be used
>> when main device is not present.
>
>
> The current bypass module supports 3-netdev and 2-netdev models via 2 sets
> of interfaces
> bypass_master_create/destroy and bypass_master_register/unregister.  So
> theoretically
> we can support the 2 models via 2 different feature bits. BACKUP and
> BACKUP_2_NETDEV.

I'm still trying to understand the value of so many models to support.
If we all agree eventually the transparent 1-netdev model can address
the more general case while 2-netdev or 3-netdev is unable to, what's
the point for supporting these many features?

-Siwei
>
> Similarly if we can figure out a way to hide both the netdevs, we can add
> BACKUP_1_NETDEV
> feature bit and update the bypass module to provide another set of
> interfaces that can
> be used by virtio_net to support this model.
>
> Now that we are leaning towards 'standby' as the name for the lower
> virtio-net, should we
> change the feature bit name also to VIRTIO_NET_F_STANDBY?
>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 109+ messages in thread

end of thread, other threads:[~2018-04-19  6:44 UTC | newest]

Thread overview: 109+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-01  9:13 [RFC PATCH 0/3] Userspace compatible driver model for virtio_bypass Si-Wei Liu
2018-04-01  9:13 ` [virtio-dev] " Si-Wei Liu
2018-04-01  9:13 ` [RFC PATCH 1/3] qemu: virtio-bypass should explicitly bind to a passthrough device Si-Wei Liu
2018-04-01  9:13   ` [virtio-dev] " Si-Wei Liu
2018-04-03 12:25   ` Michael S. Tsirkin
2018-04-03 12:25   ` Michael S. Tsirkin
2018-04-03 12:25     ` [virtio-dev] " Michael S. Tsirkin
2018-04-04  8:02     ` Siwei Liu
2018-04-04  8:02     ` Siwei Liu
2018-04-04  8:02       ` Siwei Liu
2018-04-05 15:31       ` Paolo Bonzini
2018-04-07  2:54         ` Siwei Liu
2018-04-07  2:54         ` Siwei Liu
2018-04-07  2:54           ` [virtio-dev] " Siwei Liu
2018-04-01  9:13 ` [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice Si-Wei Liu
2018-04-01  9:13   ` [virtio-dev] " Si-Wei Liu
2018-04-01 16:11   ` David Ahern
2018-04-03  7:40     ` Siwei Liu
2018-04-03  7:40     ` Siwei Liu
2018-04-03  7:40       ` [virtio-dev] " Siwei Liu
2018-04-03 14:57       ` David Ahern
2018-04-03 15:42     ` Jiri Pirko
2018-04-03 15:42     ` Jiri Pirko
2018-04-03 19:23       ` Siwei Liu
2018-04-03 19:23       ` Siwei Liu
2018-04-03 19:23         ` [virtio-dev] " Siwei Liu
2018-04-04  1:04       ` David Ahern
2018-04-04  6:19         ` Jiri Pirko
2018-04-04  8:01           ` Siwei Liu
2018-04-04  8:01             ` [virtio-dev] " Siwei Liu
2018-04-04  8:01           ` Siwei Liu
2018-04-04  6:19         ` Jiri Pirko
2018-04-04  7:36         ` Siwei Liu
2018-04-04  7:36         ` Siwei Liu
2018-04-04  7:36           ` [virtio-dev] " Siwei Liu
2018-04-04 17:21           ` David Ahern
2018-04-04 17:37             ` David Miller
2018-04-04 18:20               ` Jiri Pirko
2018-04-04 18:20               ` Jiri Pirko
2018-04-07  2:32               ` Siwei Liu
2018-04-07  2:32                 ` [virtio-dev] " Siwei Liu
2018-04-07  3:19                 ` Andrew Lunn
2018-04-09 22:07                   ` Siwei Liu
2018-04-09 22:07                   ` Siwei Liu
2018-04-09 22:07                     ` [virtio-dev] " Siwei Liu
2018-04-09 22:15                     ` Andrew Lunn
2018-04-09 22:15                     ` Andrew Lunn
2018-04-09 22:30                       ` Siwei Liu
2018-04-09 22:30                         ` [virtio-dev] " Siwei Liu
2018-04-09 23:03                         ` Stephen Hemminger
2018-04-09 23:31                           ` Siwei Liu
2018-04-09 23:31                           ` Siwei Liu
2018-04-09 23:31                             ` [virtio-dev] " Siwei Liu
2018-04-09 23:03                         ` Stephen Hemminger
2018-04-08 16:32                 ` David Miller
2018-04-10  6:48                   ` Siwei Liu
2018-04-10  6:48                     ` [virtio-dev] " Siwei Liu
2018-04-18  0:26                     ` Siwei Liu
2018-04-18  0:26                     ` Siwei Liu
2018-04-18  0:26                       ` [virtio-dev] " Siwei Liu
2018-04-18 23:33                       ` Samudrala, Sridhar
2018-04-18 23:33                       ` Samudrala, Sridhar
2018-04-18 23:33                         ` [virtio-dev] " Samudrala, Sridhar
2018-04-19  4:41                         ` Michael S. Tsirkin
2018-04-19  4:41                           ` [virtio-dev] " Michael S. Tsirkin
2018-04-19  5:00                           ` Samudrala, Sridhar
2018-04-19  5:00                             ` Samudrala, Sridhar
2018-04-19  5:07                             ` Michael S. Tsirkin
2018-04-19  5:07                             ` Michael S. Tsirkin
2018-04-19  5:07                               ` [virtio-dev] " Michael S. Tsirkin
2018-04-19  6:10                               ` Samudrala, Sridhar
2018-04-19  6:10                               ` Samudrala, Sridhar
2018-04-19  6:10                                 ` Samudrala, Sridhar
2018-04-19  6:43                                 ` Siwei Liu
2018-04-19  6:43                                 ` Siwei Liu
2018-04-19  6:43                                   ` Siwei Liu
2018-04-19  6:31                             ` Siwei Liu
2018-04-19  6:31                             ` Siwei Liu
2018-04-19  6:31                               ` [virtio-dev] " Siwei Liu
2018-04-19  5:00                           ` Samudrala, Sridhar
2018-04-19  4:41                         ` Michael S. Tsirkin
2018-04-10  6:48                   ` Siwei Liu
2018-04-08 16:32                 ` David Miller
2018-04-04 17:37             ` David Miller
2018-04-04 18:02             ` Siwei Liu
2018-04-04 18:02             ` Siwei Liu
2018-04-04  8:28         ` Siwei Liu
2018-04-04  8:28           ` [virtio-dev] " Siwei Liu
2018-04-04 17:37           ` David Ahern
2018-04-04 17:42             ` David Miller
2018-04-04 17:42             ` David Miller
2018-04-04 17:44             ` Stephen Hemminger
2018-04-04 17:44             ` Stephen Hemminger
2018-04-04 20:08             ` Andrew Lunn
2018-04-04 20:08             ` Andrew Lunn
2018-04-04  8:28         ` Siwei Liu
2018-04-03 17:35   ` Stephen Hemminger
     [not found]     ` <CADGSJ23vZdtQzWdc_6M_Hr4MUej--wgvJ785DwRF3VaPWS1rpA@mail.gmail.com>
     [not found]       ` <20180403160834.51594373@xeon-e3>
2018-04-06 21:29         ` Siwei Liu
2018-04-06 21:29         ` Siwei Liu
2018-04-06 21:29           ` [virtio-dev] " Siwei Liu
2018-04-03 17:35   ` Stephen Hemminger
2018-04-01  9:13 ` [RFC PATCH 3/3] virtio_net: make lower netdevs for virtio_bypass hidden Si-Wei Liu
2018-04-01  9:13   ` [virtio-dev] " Si-Wei Liu
2018-04-03 12:20   ` Michael S. Tsirkin
2018-04-03 12:20   ` Michael S. Tsirkin
2018-04-03 12:20     ` [virtio-dev] " Michael S. Tsirkin
2018-04-04  8:03     ` Siwei Liu
2018-04-04  8:03     ` Siwei Liu
2018-04-04  8:03       ` Siwei Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.