qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/10] add failover feature for assigned network devices
@ 2019-10-11 11:20 Jens Freimann
  2019-10-11 11:20 ` [PATCH v3 01/10] qdev/qbus: add hidden device support Jens Freimann
                   ` (12 more replies)
  0 siblings, 13 replies; 35+ messages in thread
From: Jens Freimann @ 2019-10-11 11:20 UTC (permalink / raw)
  To: qemu-devel
  Cc: parav, mst, aadam, dgilbert, alex.williamson, laine, ailan, ehabkost

This is implementing the host side of the net_failover concept
(https://www.kernel.org/doc/html/latest/networking/net_failover.html)

Changes since v2:
* back out of creating failover pair when it is a non-networking
  vfio-pci device (Alex W)
* handle migration state change from within the migration thread. I do a
  timed wait on a semaphore and then check if all unplugs were
  succesful. Added a new function to each device that checks the device
  if the unplug for it has happened. When all devices report the succesful
  unplug *or* the time/retries is up, continue with the migration or
  cancel. When not all devices could be unplugged I am cancelling at the
  moment. It is likely that we can't plug it back at the destination which
  would result in degraded network performance.
* fix a few bugs regarding re-plug on migration source and target 
* run full set of tests including migration tests
* add patch for libqos to tolerate new migration state
* squashed patch 1 and 2, added patch 8 
 
The general idea is that we have a pair of devices, a vfio-pci and a
virtio-net device. Before migration the vfio device is unplugged and data
flows to the virtio-net device, on the target side another vfio-pci device
is plugged in to take over the data-path. In the guest the net_failover
module will pair net devices with the same MAC address.

* Patch 1 adds the infrastructure to hide the device for the qbus and qdev APIs

* Patch 2 sets a new flag for PCIDevice 'partially_hotplugged' which we
  use to skip the unrealize code path when doing a unplug of the primary
  device

* Patch 3 sets the pending_deleted_event before triggering the guest
  unplug request

* Patch 4 and 5 add new qmp events, one sends the device id of a device
  that was just requested to be unplugged from the guest and another one
  to let libvirt know if VIRTIO_NET_F_STANDBY was negotiated

* Patch 6 make sure that we can unplug the vfio-device before
  migration starts

* Patch 7 adds a new migration state that is entered while we wait for
  devices to be unplugged by guest OS

* Patch 8 just adds the new migration state to a check in libqos code

* Patch 9 In the second patch the virtio-net uses the API to defer adding the vfio
  device until the VIRTIO_NET_F_STANDBY feature is acked. It also
  implements the migration handler to unplug the device from the guest and
  re-plug in case of migration failure

* Patch 10 allows migration for failover vfio-pci devices, but only when it is
  a networking device 

Previous discussion:
  RFC v1 https://patchwork.ozlabs.org/cover/989098/
  RFC v2 https://www.mail-archive.com/qemu-devel@nongnu.org/msg606906.html
  v1: https://lists.gnu.org/archive/html/qemu-devel/2019-05/msg03968.html
  v2: https://www.mail-archive.com/qemu-devel@nongnu.org/msg635214.html

To summarize concerns/feedback from previous discussion:
1.- guest OS can reject or worse _delay_ unplug by any amount of time.
  Migration might get stuck for unpredictable time with unclear reason.
  This approach combines two tricky things, hot/unplug and migration.
  -> We need to let libvirt know what's happening. Add new qmp events
     and a new migration state. When a primary device is (partially)
     unplugged (only from guest) we send a qmp event with the device id. When
     it is unplugged from the guest the DEVICE_DELETED event is sent.
     Migration will enter the wait-unplug state while waiting for the guest
     os to unplug all primary devices and then move on with migration.
2. PCI devices are a precious ressource. The primary device should never
  be added to QEMU if it won't be used by guest instead of hiding it in
  QEMU.
  -> We only hotplug the device when the standby feature bit was
     negotiated. We save the device cmdline options until we need it for
     qdev_device_add()
     Hiding a device can be a useful concept to model. For example a
     pci device in a powered-off slot could be marked as hidden until the slot is
     powered on (mst).
3. Management layer software should handle this. Open Stack already has
  components/code to handle unplug/replug VFIO devices and metadata to
  provide to the guest for detecting which devices should be paired.
  -> An approach that includes all software from firmware to
     higher-level management software wasn't tried in the last years. This is
     an attempt to keep it simple and contained in QEMU as much as possible.
     One of the problems that stopped management software and libvirt from
     implementing this idea is that it can't be sure that it's possible to
     re-plug the primary device. By not freeing the devices resources in QEMU
     and only asking the guest OS to unplug it is possible to re-plug the
     device in case of a migration failure.
4. Hotplugging a device and then making it part of a failover setup is
   not possible
  -> addressed by extending qdev hotplug functions to check for hidden
     attribute, so e.g. device_add can be used to plug a device.


I have tested this with a mlx5 and igb NIC and was able to migrate the VM.

Command line example:

qemu-system-x86_64 -enable-kvm -m 3072 -smp 3 \
        -machine q35,kernel-irqchip=split -cpu host   \
        -serial stdio   \
        -net none \
        -qmp unix:/tmp/qmp.socket,server,nowait \
        -monitor telnet:127.0.0.1:5555,server,nowait \
        -device pcie-root-port,id=root0,multifunction=on,chassis=0,addr=0xa \
        -device pcie-root-port,id=root1,bus=pcie.0,chassis=1 \
        -device pcie-root-port,id=root2,bus=pcie.0,chassis=2 \
        -netdev tap,script=/root/bin/bridge.sh,downscript=no,id=hostnet1,vhost=on \
        -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc,bus=root2,failover=on \
	-device vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,net_failover_pair_id =net1 \
        /root/rhel-guest-image-8.0-1781.x86_64.qcow2

I'm grateful for any remarks or ideas!

Thanks!

regards,
Jens 


Jens Freimann (10):
  qdev/qbus: add hidden device support
  pci: mark devices partially unplugged
  pci: mark device having guest unplug request pending
  qapi: add unplug primary event
  qapi: add failover negotiated event
  migration: allow unplug during migration for failover devices
  migration: add new migration state wait-unplug
  libqos: tolerate wait-unplug migration state
  net/virtio: add failover support
  vfio: unplug failover primary device before migration

 hw/core/qdev.c                 |  20 +++
 hw/net/virtio-net.c            | 267 +++++++++++++++++++++++++++++++++
 hw/pci/pci.c                   |   2 +
 hw/pci/pcie.c                  |   6 +
 hw/vfio/pci.c                  |  35 ++++-
 hw/vfio/pci.h                  |   2 +
 include/hw/pci/pci.h           |   1 +
 include/hw/qdev-core.h         |  10 ++
 include/hw/virtio/virtio-net.h |  12 ++
 include/hw/virtio/virtio.h     |   1 +
 include/migration/vmstate.h    |   2 +
 migration/migration.c          |  34 +++++
 migration/migration.h          |   3 +
 migration/savevm.c             |  36 +++++
 migration/savevm.h             |   1 +
 qapi/migration.json            |  24 ++-
 qapi/net.json                  |  16 ++
 qdev-monitor.c                 |  43 +++++-
 tests/libqos/libqos.c          |   3 +-
 vl.c                           |   6 +-
 20 files changed, 515 insertions(+), 9 deletions(-)

-- 
2.21.0



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v3 01/10] qdev/qbus: add hidden device support
  2019-10-11 11:20 [PATCH v3 0/10] add failover feature for assigned network devices Jens Freimann
@ 2019-10-11 11:20 ` Jens Freimann
  2019-10-14  9:46   ` Cornelia Huck
  2019-10-15 19:19   ` Alex Williamson
  2019-10-11 11:20 ` [PATCH v3 02/10] pci: mark devices partially unplugged Jens Freimann
                   ` (11 subsequent siblings)
  12 siblings, 2 replies; 35+ messages in thread
From: Jens Freimann @ 2019-10-11 11:20 UTC (permalink / raw)
  To: qemu-devel
  Cc: parav, mst, aadam, dgilbert, alex.williamson, laine, ailan, ehabkost

This adds support for hiding a device to the qbus and qdev APIs.  The
first user of this will be the virtio-net failover feature but the API
introduced with this patch could be used to implement other features as
well, for example hiding pci devices when a pci bus is powered off.

qdev_device_add() is modified to check for a net_failover_pair_id
argument in the option string. A DeviceListener callback
should_be_hidden() is added. It can be used by a standby device to
inform qdev that this device should not be added now. The standby device
handler can store the device options to plug the device in at a later
point in time.

One reason for hiding the device is that we don't want to expose both
devices to the guest kernel until the respective virtio feature bit
VIRTIO_NET_F_STANDBY was negotiated and we know that the devices will be
handled correctly by the guest.

More information on the kernel feature this is using:
 https://www.kernel.org/doc/html/latest/networking/net_failover.html

An example where the primary device is a vfio-pci device and the standby
device is a virtio-net device:

A device is hidden when it has an "net_failover_pair_id" option, e.g.

 -device virtio-net-pci,...,failover=on,...
 -device vfio-pci,...,net_failover_pair_id=net1,...

Signed-off-by: Jens Freimann <jfreimann@redhat.com>
---
 hw/core/qdev.c         | 19 +++++++++++++++++++
 include/hw/qdev-core.h |  9 +++++++++
 qdev-monitor.c         | 43 ++++++++++++++++++++++++++++++++++++++----
 vl.c                   |  6 ++++--
 4 files changed, 71 insertions(+), 6 deletions(-)

diff --git a/hw/core/qdev.c b/hw/core/qdev.c
index cbad6c1d55..84fac591ca 100644
--- a/hw/core/qdev.c
+++ b/hw/core/qdev.c
@@ -212,6 +212,25 @@ void device_listener_unregister(DeviceListener *listener)
     QTAILQ_REMOVE(&device_listeners, listener, link);
 }
 
+bool qdev_should_hide_device(QemuOpts *opts, Error **errp)
+{
+    bool res = false;
+    bool match_found = false;
+    DeviceListener *listener;
+
+    QTAILQ_FOREACH(listener, &device_listeners, link) {
+       if (listener->should_be_hidden) {
+            listener->should_be_hidden(listener, opts, &match_found, &res);
+        }
+
+        if (match_found) {
+            break;
+        }
+    }
+
+    return res;
+}
+
 void qdev_set_legacy_instance_id(DeviceState *dev, int alias_id,
                                  int required_for_version)
 {
diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
index aa123f88cb..b61cf82ded 100644
--- a/include/hw/qdev-core.h
+++ b/include/hw/qdev-core.h
@@ -154,6 +154,13 @@ struct DeviceState {
 struct DeviceListener {
     void (*realize)(DeviceListener *listener, DeviceState *dev);
     void (*unrealize)(DeviceListener *listener, DeviceState *dev);
+    /*
+     * This callback is called just upon init of the DeviceState
+     * and can be used by a standby device for informing qdev if this
+     * device should be hidden by checking the device opts
+     */
+    void (*should_be_hidden)(DeviceListener *listener, QemuOpts *device_opts,
+            bool *match_found, bool *res);
     QTAILQ_ENTRY(DeviceListener) link;
 };
 
@@ -451,4 +458,6 @@ static inline bool qbus_is_hotpluggable(BusState *bus)
 void device_listener_register(DeviceListener *listener);
 void device_listener_unregister(DeviceListener *listener);
 
+bool qdev_should_hide_device(QemuOpts *opts, Error **errp);
+
 #endif
diff --git a/qdev-monitor.c b/qdev-monitor.c
index 148df9cacf..9fc8331157 100644
--- a/qdev-monitor.c
+++ b/qdev-monitor.c
@@ -32,9 +32,11 @@
 #include "qemu/help_option.h"
 #include "qemu/option.h"
 #include "qemu/qemu-print.h"
+#include "qemu/option_int.h"
 #include "sysemu/block-backend.h"
 #include "sysemu/sysemu.h"
 #include "migration/misc.h"
+#include "migration/migration.h"
 
 /*
  * Aliases were a bad idea from the start.  Let's keep them
@@ -562,14 +564,45 @@ void qdev_set_id(DeviceState *dev, const char *id)
     }
 }
 
+static int is_failover_device(void *opaque, const char *name, const char *value,
+                        Error **errp)
+{
+    if (strcmp(name, "net_failover_pair_id") == 0) {
+        QemuOpts *opts = (QemuOpts *)opaque;
+
+        if (qdev_should_hide_device(opts, errp) && errp && !*errp) {
+            return 1;
+        } else if (errp && *errp) {
+            return -1;
+        }
+    }
+
+    return 0;
+}
+
+static bool should_hide_device(QemuOpts *opts, Error **err)
+{
+    if (qemu_opt_foreach(opts, is_failover_device, opts, err) == 0) {
+        return false;
+    }
+    return true;
+}
+
 DeviceState *qdev_device_add(QemuOpts *opts, Error **errp)
 {
     DeviceClass *dc;
     const char *driver, *path;
-    DeviceState *dev;
+    DeviceState *dev = NULL;
     BusState *bus = NULL;
     Error *err = NULL;
 
+    if (opts && should_hide_device(opts, &err)) {
+        if (err) {
+            goto err_del_dev;
+        }
+        return NULL;
+    }
+
     driver = qemu_opt_get(opts, "driver");
     if (!driver) {
         error_setg(errp, QERR_MISSING_PARAMETER, "driver");
@@ -648,8 +681,10 @@ DeviceState *qdev_device_add(QemuOpts *opts, Error **errp)
 
 err_del_dev:
     error_propagate(errp, err);
-    object_unparent(OBJECT(dev));
-    object_unref(OBJECT(dev));
+    if (dev) {
+        object_unparent(OBJECT(dev));
+        object_unref(OBJECT(dev));
+    }
     return NULL;
 }
 
@@ -818,7 +853,7 @@ void qdev_unplug(DeviceState *dev, Error **errp)
         return;
     }
 
-    if (!migration_is_idle()) {
+    if (!migration_is_idle() && !migration_in_setup(migrate_get_current())) {
         error_setg(errp, "device_del not allowed while migrating");
         return;
     }
diff --git a/vl.c b/vl.c
index 002bf4919e..898b9901c8 100644
--- a/vl.c
+++ b/vl.c
@@ -2204,10 +2204,12 @@ static int device_init_func(void *opaque, QemuOpts *opts, Error **errp)
     DeviceState *dev;
 
     dev = qdev_device_add(opts, errp);
-    if (!dev) {
+    if (!dev && *errp) {
+        error_report_err(*errp);
         return -1;
+    } else if (dev) {
+        object_unref(OBJECT(dev));
     }
-    object_unref(OBJECT(dev));
     return 0;
 }
 
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 02/10] pci: mark devices partially unplugged
  2019-10-11 11:20 [PATCH v3 0/10] add failover feature for assigned network devices Jens Freimann
  2019-10-11 11:20 ` [PATCH v3 01/10] qdev/qbus: add hidden device support Jens Freimann
@ 2019-10-11 11:20 ` Jens Freimann
  2019-10-16  1:53   ` Alex Williamson
  2019-10-11 11:20 ` [PATCH v3 03/10] pci: mark device having guest unplug request pending Jens Freimann
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 35+ messages in thread
From: Jens Freimann @ 2019-10-11 11:20 UTC (permalink / raw)
  To: qemu-devel
  Cc: parav, mst, aadam, dgilbert, alex.williamson, laine, ailan, ehabkost

Only the guest unplug request was triggered. This is needed for
the failover feature. In case of a failed migration we need to
plug the device back to the guest.

Signed-off-by: Jens Freimann <jfreimann@redhat.com>
---
 hw/pci/pci.c         | 2 ++
 hw/pci/pcie.c        | 3 +++
 include/hw/pci/pci.h | 1 +
 3 files changed, 6 insertions(+)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index aa05c2b9b2..c140b37765 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2078,6 +2078,8 @@ static void pci_qdev_realize(DeviceState *qdev, Error **errp)
     Error *local_err = NULL;
     bool is_default_rom;
 
+    pci_dev->partially_hotplugged = false;
+
     /* initialize cap_present for pci_is_express() and pci_config_size(),
      * Note that hybrid PCIs are not set automatically and need to manage
      * QEMU_PCI_CAP_EXPRESS manually */
diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
index a6beb567bd..19363ff8ce 100644
--- a/hw/pci/pcie.c
+++ b/hw/pci/pcie.c
@@ -456,6 +456,9 @@ static void pcie_unplug_device(PCIBus *bus, PCIDevice *dev, void *opaque)
 {
     HotplugHandler *hotplug_ctrl = qdev_get_hotplug_handler(DEVICE(dev));
 
+    if (dev->partially_hotplugged) {
+        return;
+    }
     hotplug_handler_unplug(hotplug_ctrl, DEVICE(dev), &error_abort);
     object_unparent(OBJECT(dev));
 }
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index f3f0ffd5fb..f3a39c9bbd 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -265,6 +265,7 @@ typedef struct PCIReqIDCache PCIReqIDCache;
 
 struct PCIDevice {
     DeviceState qdev;
+    bool partially_hotplugged;
 
     /* PCI config space */
     uint8_t *config;
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 03/10] pci: mark device having guest unplug request pending
  2019-10-11 11:20 [PATCH v3 0/10] add failover feature for assigned network devices Jens Freimann
  2019-10-11 11:20 ` [PATCH v3 01/10] qdev/qbus: add hidden device support Jens Freimann
  2019-10-11 11:20 ` [PATCH v3 02/10] pci: mark devices partially unplugged Jens Freimann
@ 2019-10-11 11:20 ` Jens Freimann
  2019-10-11 11:20 ` [PATCH v3 04/10] qapi: add unplug primary event Jens Freimann
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 35+ messages in thread
From: Jens Freimann @ 2019-10-11 11:20 UTC (permalink / raw)
  To: qemu-devel
  Cc: parav, mst, aadam, dgilbert, alex.williamson, laine, ailan, ehabkost

Set pending_deleted_event in DeviceState for failover
primary devices that were successfully unplugged by the Guest OS.

Signed-off-by: Jens Freimann <jfreimann@redhat.com>
---
 hw/pci/pcie.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
index 19363ff8ce..08718188bb 100644
--- a/hw/pci/pcie.c
+++ b/hw/pci/pcie.c
@@ -457,6 +457,7 @@ static void pcie_unplug_device(PCIBus *bus, PCIDevice *dev, void *opaque)
     HotplugHandler *hotplug_ctrl = qdev_get_hotplug_handler(DEVICE(dev));
 
     if (dev->partially_hotplugged) {
+        dev->qdev.pending_deleted_event = false;
         return;
     }
     hotplug_handler_unplug(hotplug_ctrl, DEVICE(dev), &error_abort);
@@ -476,6 +477,8 @@ void pcie_cap_slot_unplug_request_cb(HotplugHandler *hotplug_dev,
         return;
     }
 
+    dev->pending_deleted_event = true;
+
     /* In case user cancel the operation of multi-function hot-add,
      * remove the function that is unexposed to guest individually,
      * without interaction with guest.
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 04/10] qapi: add unplug primary event
  2019-10-11 11:20 [PATCH v3 0/10] add failover feature for assigned network devices Jens Freimann
                   ` (2 preceding siblings ...)
  2019-10-11 11:20 ` [PATCH v3 03/10] pci: mark device having guest unplug request pending Jens Freimann
@ 2019-10-11 11:20 ` Jens Freimann
  2019-10-11 11:20 ` [PATCH v3 05/10] qapi: add failover negotiated event Jens Freimann
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 35+ messages in thread
From: Jens Freimann @ 2019-10-11 11:20 UTC (permalink / raw)
  To: qemu-devel
  Cc: parav, mst, aadam, dgilbert, alex.williamson, laine, ailan, ehabkost

This event is emitted when we sent a request to unplug a
failover primary device from the Guest OS and it includes the
device id of the primary device.

Signed-off-by: Jens Freimann <jfreimann@redhat.com>
---
 qapi/migration.json | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/qapi/migration.json b/qapi/migration.json
index 82feb5bd39..52e69e2868 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -1448,3 +1448,22 @@
 # Since: 3.0
 ##
 { 'command': 'migrate-pause', 'allow-oob': true }
+
+##
+# @UNPLUG_PRIMARY:
+#
+# Emitted from source side of a migration when migration state is
+# WAIT_UNPLUG. Device was unplugged by guest operating system.
+# Device resources in QEMU are kept on standby to be able to re-plug it in case
+# of migration failure.
+#
+# @device_id: QEMU device id of the unplugged device
+#
+# Since: 4.2
+#
+# Example:
+#   {"event": "UNPLUG_PRIMARY", "data": {"device_id": "hostdev0"} }
+#
+##
+{ 'event': 'UNPLUG_PRIMARY',
+  'data': { 'device_id': 'str' } }
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 05/10] qapi: add failover negotiated event
  2019-10-11 11:20 [PATCH v3 0/10] add failover feature for assigned network devices Jens Freimann
                   ` (3 preceding siblings ...)
  2019-10-11 11:20 ` [PATCH v3 04/10] qapi: add unplug primary event Jens Freimann
@ 2019-10-11 11:20 ` Jens Freimann
  2019-10-11 11:20 ` [PATCH v3 06/10] migration: allow unplug during migration for failover devices Jens Freimann
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 35+ messages in thread
From: Jens Freimann @ 2019-10-11 11:20 UTC (permalink / raw)
  To: qemu-devel
  Cc: parav, mst, aadam, dgilbert, alex.williamson, laine, ailan, ehabkost

This event is sent to let libvirt know that VIRTIO_NET_F_STANDBY
feature was not negotiated during virtio feature negotiation. If this
event is received it means any primary devices hotplugged before
this were were never really added to QEMU devices.

Signed-off-by: Jens Freimann <jfreimann@redhat.com>
---
 qapi/net.json | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/qapi/net.json b/qapi/net.json
index 728990f4fb..8c5f3f1fb2 100644
--- a/qapi/net.json
+++ b/qapi/net.json
@@ -737,3 +737,19 @@
 ##
 { 'command': 'announce-self', 'boxed': true,
   'data' : 'AnnounceParameters'}
+
+##
+# @FAILOVER_NEGOTIATED:
+#
+# Emitted when VIRTIO_NET_F_STANDBY was negotiated during feature negotiation
+#
+# Since: 4.2
+#
+# Example:
+#
+# <- { "event": "FAILOVER_NEGOTIATED",
+#      "data": {} }
+#
+##
+{ 'event': 'FAILOVER_NEGOTIATED',
+  'data': {} }
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 06/10] migration: allow unplug during migration for failover devices
  2019-10-11 11:20 [PATCH v3 0/10] add failover feature for assigned network devices Jens Freimann
                   ` (4 preceding siblings ...)
  2019-10-11 11:20 ` [PATCH v3 05/10] qapi: add failover negotiated event Jens Freimann
@ 2019-10-11 11:20 ` Jens Freimann
  2019-10-11 11:20 ` [PATCH v3 07/10] migration: add new migration state wait-unplug Jens Freimann
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 35+ messages in thread
From: Jens Freimann @ 2019-10-11 11:20 UTC (permalink / raw)
  To: qemu-devel
  Cc: parav, mst, aadam, dgilbert, alex.williamson, laine, ailan, ehabkost

In "b06424de62 migration: Disable hotplug/unplug during migration" we
added a check to disable unplug for all devices until we have figured
out what works. For failover primary devices qdev_unplug() is called
from the migration handler, i.e. during migration.

This patch adds a flag to DeviceState which is set to false for all
devices and makes an exception for vfio-pci devices that are also
primary devices in a failover pair.

Signed-off-by: Jens Freimann <jfreimann@redhat.com>
---
 hw/core/qdev.c         | 1 +
 include/hw/qdev-core.h | 1 +
 qdev-monitor.c         | 2 +-
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/hw/core/qdev.c b/hw/core/qdev.c
index 84fac591ca..fb76c6eeaf 100644
--- a/hw/core/qdev.c
+++ b/hw/core/qdev.c
@@ -991,6 +991,7 @@ static void device_initfn(Object *obj)
 
     dev->instance_id_alias = -1;
     dev->realized = false;
+    dev->allow_unplug_during_migration = false;
 
     object_property_add_bool(obj, "realized",
                              device_get_realized, device_set_realized, NULL);
diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
index b61cf82ded..f023c6e9c7 100644
--- a/include/hw/qdev-core.h
+++ b/include/hw/qdev-core.h
@@ -143,6 +143,7 @@ struct DeviceState {
     bool pending_deleted_event;
     QemuOpts *opts;
     int hotplugged;
+    bool allow_unplug_during_migration;
     BusState *parent_bus;
     QLIST_HEAD(, NamedGPIOList) gpios;
     QLIST_HEAD(, BusState) child_bus;
diff --git a/qdev-monitor.c b/qdev-monitor.c
index 9fc8331157..cf9d1fa901 100644
--- a/qdev-monitor.c
+++ b/qdev-monitor.c
@@ -853,7 +853,7 @@ void qdev_unplug(DeviceState *dev, Error **errp)
         return;
     }
 
-    if (!migration_is_idle() && !migration_in_setup(migrate_get_current())) {
+    if (!migration_is_idle() && !dev->allow_unplug_during_migration) {
         error_setg(errp, "device_del not allowed while migrating");
         return;
     }
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 07/10] migration: add new migration state wait-unplug
  2019-10-11 11:20 [PATCH v3 0/10] add failover feature for assigned network devices Jens Freimann
                   ` (5 preceding siblings ...)
  2019-10-11 11:20 ` [PATCH v3 06/10] migration: allow unplug during migration for failover devices Jens Freimann
@ 2019-10-11 11:20 ` Jens Freimann
  2019-10-11 12:57   ` Michael S. Tsirkin
                     ` (2 more replies)
  2019-10-11 11:20 ` [PATCH v3 08/10] libqos: tolerate wait-unplug migration state Jens Freimann
                   ` (5 subsequent siblings)
  12 siblings, 3 replies; 35+ messages in thread
From: Jens Freimann @ 2019-10-11 11:20 UTC (permalink / raw)
  To: qemu-devel
  Cc: parav, mst, aadam, dgilbert, alex.williamson, laine, ailan, ehabkost

This patch adds a new migration state called wait-unplug.  It is entered
after the SETUP state and will transition into ACTIVE once all devices
were succesfully unplugged from the guest.

So if a guest doesn't respond or takes long to honor the unplug request
the user will see the migration state 'wait-unplug'.

In the migration thread we query failover devices if they're are still
pending the guest unplug. When all are unplugged the migration
continues. We give it a defined number of iterations including small
waiting periods before we proceed.

Signed-off-by: Jens Freimann <jfreimann@redhat.com>
---
 include/migration/vmstate.h |  2 ++
 migration/migration.c       | 34 ++++++++++++++++++++++++++++++++++
 migration/migration.h       |  3 +++
 migration/savevm.c          | 36 ++++++++++++++++++++++++++++++++++++
 migration/savevm.h          |  1 +
 qapi/migration.json         |  5 ++++-
 6 files changed, 80 insertions(+), 1 deletion(-)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index 1fbfd099dd..39ef125225 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -186,6 +186,8 @@ struct VMStateDescription {
     int (*pre_save)(void *opaque);
     int (*post_save)(void *opaque);
     bool (*needed)(void *opaque);
+    bool (*dev_unplug_pending)(void *opaque);
+
     const VMStateField *fields;
     const VMStateDescription **subsections;
 };
diff --git a/migration/migration.c b/migration/migration.c
index 5f7e4d15e9..a17d9fb990 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -52,9 +52,14 @@
 #include "hw/qdev-properties.h"
 #include "monitor/monitor.h"
 #include "net/announce.h"
+#include "qemu/queue.h"
 
 #define MAX_THROTTLE  (32 << 20)      /* Migration transfer speed throttling */
 
+/* Time in milliseconds to wait for guest OS to unplug PCI device */
+#define FAILOVER_GUEST_UNPLUG_WAIT 10000
+#define FAILOVER_UNPLUG_RETRIES 5
+
 /* Amount of time to allocate to each "chunk" of bandwidth-throttled
  * data. */
 #define BUFFER_DELAY     100
@@ -954,6 +959,9 @@ static void fill_source_migration_info(MigrationInfo *info)
     case MIGRATION_STATUS_CANCELLED:
         info->has_status = true;
         break;
+    case MIGRATION_STATUS_WAIT_UNPLUG:
+        info->has_status = true;
+        break;
     }
     info->status = s->state;
 }
@@ -1695,6 +1703,7 @@ bool migration_is_idle(void)
     case MIGRATION_STATUS_COLO:
     case MIGRATION_STATUS_PRE_SWITCHOVER:
     case MIGRATION_STATUS_DEVICE:
+    case MIGRATION_STATUS_WAIT_UNPLUG:
         return false;
     case MIGRATION_STATUS__MAX:
         g_assert_not_reached();
@@ -3224,6 +3233,8 @@ static void *migration_thread(void *opaque)
     int64_t setup_start = qemu_clock_get_ms(QEMU_CLOCK_HOST);
     MigThrError thr_error;
     bool urgent = false;
+    bool all_unplugged = true;
+    int i = 0;
 
     rcu_register_thread();
 
@@ -3260,6 +3271,27 @@ static void *migration_thread(void *opaque)
 
     qemu_savevm_state_setup(s->to_dst_file);
 
+    migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
+                      MIGRATION_STATUS_WAIT_UNPLUG);
+
+    while (i < FAILOVER_UNPLUG_RETRIES &&
+           s->state == MIGRATION_STATUS_WAIT_UNPLUG) {
+        i++;
+        qemu_sem_timedwait(&s->wait_unplug_sem, FAILOVER_GUEST_UNPLUG_WAIT);
+        all_unplugged = qemu_savevm_state_guest_unplug_pending();
+        if (all_unplugged) {
+            break;
+        }
+    }
+
+    if (all_unplugged) {
+        migrate_set_state(&s->state, MIGRATION_STATUS_WAIT_UNPLUG,
+                MIGRATION_STATUS_ACTIVE);
+    } else {
+        migrate_set_state(&s->state, MIGRATION_STATUS_WAIT_UNPLUG,
+                          MIGRATION_STATUS_CANCELLING);
+    }
+
     s->setup_time = qemu_clock_get_ms(QEMU_CLOCK_HOST) - setup_start;
     migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
                       MIGRATION_STATUS_ACTIVE);
@@ -3508,6 +3540,7 @@ static void migration_instance_finalize(Object *obj)
     qemu_mutex_destroy(&ms->qemu_file_lock);
     g_free(params->tls_hostname);
     g_free(params->tls_creds);
+    qemu_sem_destroy(&ms->wait_unplug_sem);
     qemu_sem_destroy(&ms->rate_limit_sem);
     qemu_sem_destroy(&ms->pause_sem);
     qemu_sem_destroy(&ms->postcopy_pause_sem);
@@ -3553,6 +3586,7 @@ static void migration_instance_init(Object *obj)
     qemu_sem_init(&ms->postcopy_pause_rp_sem, 0);
     qemu_sem_init(&ms->rp_state.rp_sem, 0);
     qemu_sem_init(&ms->rate_limit_sem, 0);
+    qemu_sem_init(&ms->wait_unplug_sem, 0);
     qemu_mutex_init(&ms->qemu_file_lock);
 }
 
diff --git a/migration/migration.h b/migration/migration.h
index 4f2fe193dc..79b3dda146 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -206,6 +206,9 @@ struct MigrationState
     /* Flag set once the migration thread called bdrv_inactivate_all */
     bool block_inactive;
 
+    /* Migration is waiting for guest to unplug device */
+    QemuSemaphore wait_unplug_sem;
+
     /* Migration is paused due to pause-before-switchover */
     QemuSemaphore pause_sem;
 
diff --git a/migration/savevm.c b/migration/savevm.c
index bb9462a54d..26e5bde687 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -942,6 +942,20 @@ static void qemu_savevm_command_send(QEMUFile *f,
     qemu_fflush(f);
 }
 
+static int qemu_savevm_nr_failover_devices(void)
+{
+    SaveStateEntry *se;
+    int n = 0;
+
+    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+        if (se->vmsd && se->vmsd->dev_unplug_pending) {
+            n++;
+        }
+    }
+
+    return n;
+}
+
 void qemu_savevm_send_colo_enable(QEMUFile *f)
 {
     trace_savevm_send_colo_enable();
@@ -1113,6 +1127,28 @@ void qemu_savevm_state_header(QEMUFile *f)
     }
 }
 
+bool qemu_savevm_state_guest_unplug_pending(void)
+{
+    int nr_failover_devs;
+    SaveStateEntry *se;
+    bool ret = false;
+    int n = 0;
+
+    nr_failover_devs = qemu_savevm_nr_failover_devices();
+
+    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+        if (!se->vmsd || !se->vmsd->dev_unplug_pending) {
+            continue;
+        }
+        ret = se->vmsd->dev_unplug_pending(se->opaque);
+        if (!ret) {
+            n++;
+        }
+    }
+
+    return n == nr_failover_devs;
+}
+
 void qemu_savevm_state_setup(QEMUFile *f)
 {
     SaveStateEntry *se;
diff --git a/migration/savevm.h b/migration/savevm.h
index 51a4b9caa8..ba64a7e271 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -31,6 +31,7 @@
 
 bool qemu_savevm_state_blocked(Error **errp);
 void qemu_savevm_state_setup(QEMUFile *f);
+bool qemu_savevm_state_guest_unplug_pending(void);
 int qemu_savevm_state_resume_prepare(MigrationState *s);
 void qemu_savevm_state_header(QEMUFile *f);
 int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy);
diff --git a/qapi/migration.json b/qapi/migration.json
index 52e69e2868..5a06cd489f 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -133,6 +133,9 @@
 # @device: During device serialisation when pause-before-switchover is enabled
 #        (since 2.11)
 #
+# @wait-unplug: wait for device unplug request by guest OS to be completed.
+#               (since 4.2)
+#
 # Since: 2.3
 #
 ##
@@ -140,7 +143,7 @@
   'data': [ 'none', 'setup', 'cancelling', 'cancelled',
             'active', 'postcopy-active', 'postcopy-paused',
             'postcopy-recover', 'completed', 'failed', 'colo',
-            'pre-switchover', 'device' ] }
+            'pre-switchover', 'device', 'wait-unplug' ] }
 
 ##
 # @MigrationInfo:
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 08/10] libqos: tolerate wait-unplug migration state
  2019-10-11 11:20 [PATCH v3 0/10] add failover feature for assigned network devices Jens Freimann
                   ` (6 preceding siblings ...)
  2019-10-11 11:20 ` [PATCH v3 07/10] migration: add new migration state wait-unplug Jens Freimann
@ 2019-10-11 11:20 ` Jens Freimann
  2019-10-11 11:20 ` [PATCH v3 09/10] net/virtio: add failover support Jens Freimann
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 35+ messages in thread
From: Jens Freimann @ 2019-10-11 11:20 UTC (permalink / raw)
  To: qemu-devel
  Cc: parav, mst, aadam, dgilbert, alex.williamson, laine, ailan, ehabkost

Signed-off-by: Jens Freimann <jfreimann@redhat.com>
---
 tests/libqos/libqos.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tests/libqos/libqos.c b/tests/libqos/libqos.c
index d71557c5cb..f229eb2cb8 100644
--- a/tests/libqos/libqos.c
+++ b/tests/libqos/libqos.c
@@ -125,7 +125,8 @@ void migrate(QOSState *from, QOSState *to, const char *uri)
             break;
         }
 
-        if ((strcmp(st, "setup") == 0) || (strcmp(st, "active") == 0)) {
+        if ((strcmp(st, "setup") == 0) || (strcmp(st, "active") == 0)
+            || (strcmp(st, "wait-unplug") == 0)) {
             qobject_unref(rsp);
             g_usleep(5000);
             continue;
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 09/10] net/virtio: add failover support
  2019-10-11 11:20 [PATCH v3 0/10] add failover feature for assigned network devices Jens Freimann
                   ` (7 preceding siblings ...)
  2019-10-11 11:20 ` [PATCH v3 08/10] libqos: tolerate wait-unplug migration state Jens Freimann
@ 2019-10-11 11:20 ` Jens Freimann
  2019-10-11 11:20 ` [PATCH v3 10/10] vfio: unplug failover primary device before migration Jens Freimann
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 35+ messages in thread
From: Jens Freimann @ 2019-10-11 11:20 UTC (permalink / raw)
  To: qemu-devel
  Cc: parav, mst, aadam, dgilbert, alex.williamson, laine, ailan, ehabkost

This patch adds support to handle failover device pairs of a virtio-net
device and a vfio-pci device, where the virtio-net acts as the standby
device and the vfio-pci device as the primary.

The general idea is that we have a pair of devices, a vfio-pci and a
emulated (virtio-net) device. Before migration the vfio device is
unplugged and data flows to the emulated device, on the target side
another vfio-pci device is plugged in to take over the data-path. In the
guest the net_failover module will pair net devices with the same MAC
address.

To achieve this we need:

1. Provide a callback function for the should_be_hidden DeviceListener.
   It is called when the primary device is plugged in. Evaluate the QOpt
   passed in to check if it is the matching primary device. It returns
   two values:
     - one to signal if the device to be added is the matching
       primary device
     - another one to signal to qdev if it should actually
       continue with adding the device or skip it.

   In the latter case it stores the device options in the VirtioNet
   struct and the device is added once the VIRTIO_NET_F_STANDBY feature is
   negotiated during virtio feature negotiation.

   If the virtio-net devices are not realized at the time the vfio-pci
   devices are realized, we need to connect the devices later. This way
   we make sure primary and standby devices can be specified in any
   order.

2. Register a callback for migration status notifier. When called it
   will unplug its primary device before the migration happens.

3. Register a callback for the migration code that checks if a device
   needs to be unplugged from the guest.

Signed-off-by: Jens Freimann <jfreimann@redhat.com>
---
 hw/net/virtio-net.c            | 267 +++++++++++++++++++++++++++++++++
 include/hw/virtio/virtio-net.h |  12 ++
 include/hw/virtio/virtio.h     |   1 +
 3 files changed, 280 insertions(+)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 9f11422337..fb62e18b65 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -12,6 +12,7 @@
  */
 
 #include "qemu/osdep.h"
+#include "qemu/atomic.h"
 #include "qemu/iov.h"
 #include "qemu/main-loop.h"
 #include "qemu/module.h"
@@ -21,6 +22,10 @@
 #include "net/tap.h"
 #include "qemu/error-report.h"
 #include "qemu/timer.h"
+#include "qemu/option.h"
+#include "qemu/option_int.h"
+#include "qemu/config-file.h"
+#include "qapi/qmp/qdict.h"
 #include "hw/virtio/virtio-net.h"
 #include "net/vhost_net.h"
 #include "net/announce.h"
@@ -28,11 +33,15 @@
 #include "qapi/error.h"
 #include "qapi/qapi-events-net.h"
 #include "hw/qdev-properties.h"
+#include "qapi/qapi-types-migration.h"
+#include "qapi/qapi-events-migration.h"
 #include "hw/virtio/virtio-access.h"
 #include "migration/misc.h"
 #include "standard-headers/linux/ethtool.h"
 #include "sysemu/sysemu.h"
 #include "trace.h"
+#include "monitor/qdev.h"
+#include "hw/pci/pci.h"
 
 #define VIRTIO_NET_VM_VERSION    11
 
@@ -746,9 +755,85 @@ static inline uint64_t virtio_net_supported_guest_offloads(VirtIONet *n)
     return virtio_net_guest_offloads_by_features(vdev->guest_features);
 }
 
+static void failover_add_primary(VirtIONet *n)
+{
+    Error *err = NULL;
+
+    n->primary_device_opts = qemu_opts_find(qemu_find_opts("device"),
+            n->primary_device_id);
+    if (n->primary_device_opts) {
+        n->primary_dev = qdev_device_add(n->primary_device_opts, &err);
+        if (err) {
+            qemu_opts_del(n->primary_device_opts);
+        }
+        if (n->primary_dev) {
+            n->primary_bus = n->primary_dev->parent_bus;
+            if (err) {
+                qdev_unplug(n->primary_dev, &err);
+                qdev_set_id(n->primary_dev, "");
+
+            }
+        }
+    }
+    if (err) {
+        error_report_err(err);
+    }
+}
+
+static int is_my_primary(void *opaque, QemuOpts *opts, Error **errp)
+{
+    VirtIONet *n = opaque;
+    int ret = 0;
+
+    const char *standby_id = qemu_opt_get(opts, "net_failover_pair_id");
+
+    if (standby_id != NULL && (g_strcmp0(standby_id, n->netclient_name) == 0)) {
+        n->primary_device_id = g_strdup(opts->id);
+        ret = 1;
+    }
+
+    return ret;
+}
+
+static DeviceState *virtio_net_find_primary(VirtIONet *n, Error *err)
+{
+    DeviceState *dev = NULL;
+
+    if (qemu_opts_foreach(qemu_find_opts("device"),
+                         is_my_primary, n, &err)) {
+        if (n->primary_device_id) {
+            dev = qdev_find_recursive(sysbus_get_default(),
+                    n->primary_device_id);
+        } else {
+            return NULL;
+        }
+    }
+    return dev;
+}
+
+
+
+static DeviceState *virtio_connect_failover_devices(VirtIONet *n,
+                                                    DeviceState *dev,
+                                                    Error **errp)
+{
+    DeviceState *prim_dev = NULL;
+
+    prim_dev = virtio_net_find_primary(n, *errp);
+    if (prim_dev) {
+        n->primary_device_id = g_strdup(prim_dev->id);
+        n->primary_device_opts = prim_dev->opts;
+    } else {
+        error_setg(errp, "connect: virtio_net: Could not find primary device");
+    }
+
+    return prim_dev;
+}
+
 static void virtio_net_set_features(VirtIODevice *vdev, uint64_t features)
 {
     VirtIONet *n = VIRTIO_NET(vdev);
+    Error *err = NULL;
     int i;
 
     if (n->mtu_bypass_backend &&
@@ -790,6 +875,22 @@ static void virtio_net_set_features(VirtIODevice *vdev, uint64_t features)
     } else {
         memset(n->vlans, 0xff, MAX_VLAN >> 3);
     }
+
+    if (virtio_has_feature(features, VIRTIO_NET_F_STANDBY)) {
+        atomic_set(&n->primary_should_be_hidden, false);
+        failover_add_primary(n);
+        if (!n->primary_dev) {
+            n->primary_dev = virtio_connect_failover_devices(n, n->qdev, &err);
+            if (!n->primary_dev) {
+                failover_add_primary(n);
+                n->primary_dev = virtio_connect_failover_devices(n, n->qdev,
+                                                                 &err);
+            }
+        }
+    }
+    if (err) {
+        error_report_err(err);
+    }
 }
 
 static int virtio_net_handle_rx_mode(VirtIONet *n, uint8_t cmd,
@@ -2630,6 +2731,135 @@ void virtio_net_set_netclient_name(VirtIONet *n, const char *name,
     n->netclient_type = g_strdup(type);
 }
 
+static bool failover_unplug_primary(VirtIONet *n)
+{
+    HotplugHandler *hotplug_ctrl;
+    PCIDevice *pci_dev;
+    Error *err = NULL;
+
+    hotplug_ctrl = qdev_get_hotplug_handler(n->primary_dev);
+    if (hotplug_ctrl) {
+        pci_dev = PCI_DEVICE(n->primary_dev);
+        pci_dev->partially_hotplugged = true;
+        hotplug_handler_unplug_request(hotplug_ctrl, n->primary_dev, &err);
+        if (err) {
+            error_report_err(err);
+            return false;
+        }
+    } else {
+        return false;
+    }
+    return true;
+}
+
+static bool failover_replug_primary(VirtIONet *n, Error **errp)
+{
+    HotplugHandler *hotplug_ctrl;
+    PCIDevice *pdev = PCI_DEVICE(n->primary_dev);
+
+    if (!pdev->partially_hotplugged) {
+        return true;
+    }
+    if (!n->primary_device_opts) {
+        n->primary_device_opts = qemu_opts_from_qdict(
+                qemu_find_opts("device"),
+                n->primary_device_dict, errp);
+    }
+    if (n->primary_device_opts) {
+        qdev_set_parent_bus(n->primary_dev, n->primary_bus);
+        n->primary_should_be_hidden = false;
+        qemu_opt_set_bool(n->primary_device_opts,
+                "partially_hotplugged", true, errp);
+        hotplug_ctrl = qdev_get_hotplug_handler(n->primary_dev);
+        if (hotplug_ctrl) {
+            hotplug_handler_pre_plug(hotplug_ctrl, n->primary_dev, errp);
+            hotplug_handler_plug(hotplug_ctrl, n->primary_dev, errp);
+        }
+        if (!n->primary_dev) {
+            error_setg(errp, "virtio_net: couldn't find primary device");
+        }
+    }
+    return *errp != NULL;
+}
+
+static void virtio_net_handle_migration_primary(VirtIONet *n,
+                                                MigrationState *s)
+{
+    bool should_be_hidden;
+    Error *err = NULL;
+
+    should_be_hidden = atomic_read(&n->primary_should_be_hidden);
+
+    if (!n->primary_dev) {
+        n->primary_dev = virtio_connect_failover_devices(n, n->qdev, &err);
+        if (!n->primary_dev) {
+            return;
+        }
+    }
+
+    if (migration_in_setup(s) && !should_be_hidden &&
+        n->primary_dev) {
+        if (failover_unplug_primary(n)) {
+            vmstate_unregister(n->primary_dev, qdev_get_vmsd(n->primary_dev),
+                    n->primary_dev);
+            qapi_event_send_unplug_primary(n->primary_device_id);
+            atomic_set(&n->primary_should_be_hidden, true);
+        } else {
+            warn_report("virtio_net: Couldn't unplug primary device");
+        }
+    } else if (migration_has_failed(s)) {
+        /* We already unplugged the device let's plugged it back */
+        if (!failover_replug_primary(n, &err)) {
+            if (err) {
+                error_report_err(err);
+            }
+        }
+    }
+}
+
+static void virtio_net_migration_state_notifier(Notifier *notifier, void *data)
+{
+    MigrationState *s = data;
+    VirtIONet *n = container_of(notifier, VirtIONet, migration_state);
+    virtio_net_handle_migration_primary(n, s);
+}
+
+static void virtio_net_primary_should_be_hidden(DeviceListener *listener,
+            QemuOpts *device_opts, bool *match_found, bool *hide)
+{
+    VirtIONet *n = container_of(listener, VirtIONet, primary_listener);
+
+    n->primary_device_dict = qemu_opts_to_qdict(device_opts,
+            n->primary_device_dict);
+    if (n->primary_device_dict) {
+        g_free(n->standby_id);
+        n->standby_id = g_strdup(qdict_get_try_str(n->primary_device_dict,
+                    "net_failover_pair_id"));
+    }
+    if (device_opts && g_strcmp0(n->standby_id, n->netclient_name) == 0) {
+        *match_found = true;
+    } else {
+        *match_found = false;
+        *hide = false;
+        g_free(n->standby_id);
+        n->primary_device_dict = NULL;
+        return;
+    }
+
+    n->primary_device_opts = device_opts;
+
+    /* primary_should_be_hidden is set during feature negotiation */
+    *hide = atomic_read(&n->primary_should_be_hidden);
+
+    if (n->primary_device_dict) {
+        g_free(n->primary_device_id);
+        n->primary_device_id = g_strdup(qdict_get_try_str(
+                    n->primary_device_dict, "id"));
+        if (!n->primary_device_id) {
+            warn_report("primary_device_id NULL");
+        }
+    }
+}
 static void virtio_net_device_realize(DeviceState *dev, Error **errp)
 {
     VirtIODevice *vdev = VIRTIO_DEVICE(dev);
@@ -2660,6 +2890,16 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
         n->host_features |= (1ULL << VIRTIO_NET_F_SPEED_DUPLEX);
     }
 
+    if (n->failover) {
+        n->primary_listener.should_be_hidden =
+            virtio_net_primary_should_be_hidden;
+        atomic_set(&n->primary_should_be_hidden, true);
+        device_listener_register(&n->primary_listener);
+        n->migration_state.notify = virtio_net_migration_state_notifier;
+        add_migration_state_change_notifier(&n->migration_state);
+        n->host_features |= (1ULL << VIRTIO_NET_F_STANDBY);
+    }
+
     virtio_net_set_config_size(n, n->host_features);
     virtio_init(vdev, "virtio-net", VIRTIO_ID_NET, n->config_size);
 
@@ -2782,6 +3022,13 @@ static void virtio_net_device_unrealize(DeviceState *dev, Error **errp)
     g_free(n->mac_table.macs);
     g_free(n->vlans);
 
+    if (n->failover) {
+        g_free(n->primary_device_id);
+        g_free(n->standby_id);
+        qobject_unref(n->primary_device_dict);
+        n->primary_device_dict = NULL;
+    }
+
     max_queues = n->multiqueue ? n->max_queues : 1;
     for (i = 0; i < max_queues; i++) {
         virtio_net_del_queue(n, i);
@@ -2819,6 +3066,23 @@ static int virtio_net_pre_save(void *opaque)
     return 0;
 }
 
+static bool primary_unplug_pending(void *opaque)
+{
+    DeviceState *dev = opaque;
+    VirtIODevice *vdev = VIRTIO_DEVICE(dev);
+    VirtIONet *n = VIRTIO_NET(vdev);
+
+    return n->primary_dev ? n->primary_dev->pending_deleted_event : false;
+}
+
+static bool dev_unplug_pending(void *opaque)
+{
+    DeviceState *dev = opaque;
+    VirtioDeviceClass *vdc = VIRTIO_DEVICE_GET_CLASS(dev);
+
+    return vdc->primary_unplug_pending(dev);
+}
+
 static const VMStateDescription vmstate_virtio_net = {
     .name = "virtio-net",
     .minimum_version_id = VIRTIO_NET_VM_VERSION,
@@ -2828,6 +3092,7 @@ static const VMStateDescription vmstate_virtio_net = {
         VMSTATE_END_OF_LIST()
     },
     .pre_save = virtio_net_pre_save,
+    .dev_unplug_pending = dev_unplug_pending,
 };
 
 static Property virtio_net_properties[] = {
@@ -2889,6 +3154,7 @@ static Property virtio_net_properties[] = {
                      true),
     DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
     DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
+    DEFINE_PROP_BOOL("failover", VirtIONet, failover, false),
     DEFINE_PROP_END_OF_LIST(),
 };
 
@@ -2913,6 +3179,7 @@ static void virtio_net_class_init(ObjectClass *klass, void *data)
     vdc->guest_notifier_pending = virtio_net_guest_notifier_pending;
     vdc->legacy_features |= (0x1 << VIRTIO_NET_F_GSO);
     vdc->vmsd = &vmstate_virtio_net_device;
+    vdc->primary_unplug_pending = primary_unplug_pending;
 }
 
 static const TypeInfo virtio_net_info = {
diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
index b96f0c643f..3da4ca382a 100644
--- a/include/hw/virtio/virtio-net.h
+++ b/include/hw/virtio/virtio-net.h
@@ -18,6 +18,7 @@
 #include "standard-headers/linux/virtio_net.h"
 #include "hw/virtio/virtio.h"
 #include "net/announce.h"
+#include "qemu/option_int.h"
 
 #define TYPE_VIRTIO_NET "virtio-net-device"
 #define VIRTIO_NET(obj) \
@@ -43,6 +44,7 @@ typedef struct virtio_net_conf
     int32_t speed;
     char *duplex_str;
     uint8_t duplex;
+    char *primary_id_str;
 } virtio_net_conf;
 
 /* Coalesced packets type & status */
@@ -185,6 +187,16 @@ struct VirtIONet {
     AnnounceTimer announce_timer;
     bool needs_vnet_hdr_swap;
     bool mtu_bypass_backend;
+    QemuOpts *primary_device_opts;
+    QDict *primary_device_dict;
+    DeviceState *primary_dev;
+    BusState *primary_bus;
+    char *primary_device_id;
+    char *standby_id;
+    bool primary_should_be_hidden;
+    bool failover;
+    DeviceListener primary_listener;
+    Notifier migration_state;
 };
 
 void virtio_net_set_netclient_name(VirtIONet *n, const char *name,
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index 48e8d04ff6..0c857ecf1a 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -159,6 +159,7 @@ typedef struct VirtioDeviceClass {
     void (*save)(VirtIODevice *vdev, QEMUFile *f);
     int (*load)(VirtIODevice *vdev, QEMUFile *f, int version_id);
     const VMStateDescription *vmsd;
+    bool (*primary_unplug_pending)(void *opaque);
 } VirtioDeviceClass;
 
 void virtio_instance_init_common(Object *proxy_obj, void *data,
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 10/10] vfio: unplug failover primary device before migration
  2019-10-11 11:20 [PATCH v3 0/10] add failover feature for assigned network devices Jens Freimann
                   ` (8 preceding siblings ...)
  2019-10-11 11:20 ` [PATCH v3 09/10] net/virtio: add failover support Jens Freimann
@ 2019-10-11 11:20 ` Jens Freimann
  2019-10-14 10:05   ` Cornelia Huck
  2019-10-16  1:52   ` Alex Williamson
  2019-10-11 14:28 ` [PATCH v3 0/10] add failover feature for assigned network devices Michael S. Tsirkin
                   ` (2 subsequent siblings)
  12 siblings, 2 replies; 35+ messages in thread
From: Jens Freimann @ 2019-10-11 11:20 UTC (permalink / raw)
  To: qemu-devel
  Cc: parav, mst, aadam, dgilbert, alex.williamson, laine, ailan, ehabkost

As usual block all vfio-pci devices from being migrated, but make an
exception for failover primary devices. This is achieved by setting
unmigratable to 0 but also add a migration blocker for all vfio-pci
devices except failover primary devices. These will be unplugged before
migration happens by the migration handler of the corresponding
virtio-net standby device.

Signed-off-by: Jens Freimann <jfreimann@redhat.com>
---
 hw/vfio/pci.c | 35 ++++++++++++++++++++++++++++++++++-
 hw/vfio/pci.h |  2 ++
 2 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index c5e6fe61cb..64cf8e07d9 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -40,6 +40,9 @@
 #include "pci.h"
 #include "trace.h"
 #include "qapi/error.h"
+#include "migration/blocker.h"
+#include "qemu/option.h"
+#include "qemu/option_int.h"
 
 #define TYPE_VFIO_PCI "vfio-pci"
 #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
@@ -2698,6 +2701,12 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
     vdev->req_enabled = false;
 }
 
+static int has_net_failover_arg(void *opaque, const char *name,
+                           const char *value, Error **errp)
+{
+    return (strcmp(name, "net_failover_pair_id") == 0);
+}
+
 static void vfio_realize(PCIDevice *pdev, Error **errp)
 {
     VFIOPCIDevice *vdev = PCI_VFIO(pdev);
@@ -2710,6 +2719,20 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     int groupid;
     int i, ret;
     bool is_mdev;
+    uint16_t class_id;
+
+    if (qemu_opt_foreach(pdev->qdev.opts, has_net_failover_arg,
+                         (void *) pdev->qdev.opts, &err) == 0) {
+        error_setg(&vdev->migration_blocker,
+                "VFIO device doesn't support migration");
+        ret = migrate_add_blocker(vdev->migration_blocker, &err);
+        if (err) {
+            error_propagate(errp, err);
+            error_free(vdev->migration_blocker);
+        }
+    } else {
+        pdev->qdev.allow_unplug_during_migration = true;
+    }
 
     if (!vdev->vbasedev.sysfsdev) {
         if (!(~vdev->host.domain || ~vdev->host.bus ||
@@ -2812,6 +2835,14 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         goto error;
     }
 
+    if (vdev->net_failover_pair_id != NULL) {
+        class_id = pci_get_word(pdev->config + PCI_CLASS_DEVICE);
+        if (class_id != PCI_CLASS_NETWORK_ETHERNET) {
+            error_setg(errp, "failover device is not an Ethernet device");
+            goto error;
+        }
+    }
+
     /* vfio emulates a lot for us, but some bits need extra love */
     vdev->emulated_config_bits = g_malloc0(vdev->config_size);
 
@@ -3110,6 +3141,8 @@ static Property vfio_pci_dev_properties[] = {
                             display, ON_OFF_AUTO_OFF),
     DEFINE_PROP_UINT32("xres", VFIOPCIDevice, display_xres, 0),
     DEFINE_PROP_UINT32("yres", VFIOPCIDevice, display_yres, 0),
+    DEFINE_PROP_STRING("net_failover_pair_id", VFIOPCIDevice,
+            net_failover_pair_id),
     DEFINE_PROP_UINT32("x-intx-mmap-timeout-ms", VFIOPCIDevice,
                        intx.mmap_timeout, 1100),
     DEFINE_PROP_BIT("x-vga", VFIOPCIDevice, features,
@@ -3152,7 +3185,7 @@ static Property vfio_pci_dev_properties[] = {
 
 static const VMStateDescription vfio_pci_vmstate = {
     .name = "vfio-pci",
-    .unmigratable = 1,
+    .unmigratable = 0,
 };
 
 static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 834a90d646..da4417071a 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -134,6 +134,7 @@ typedef struct VFIOPCIDevice {
     PCIHostDeviceAddress host;
     EventNotifier err_notifier;
     EventNotifier req_notifier;
+    char *net_failover_pair_id;
     int (*resetfn)(struct VFIOPCIDevice *);
     uint32_t vendor_id;
     uint32_t device_id;
@@ -168,6 +169,7 @@ typedef struct VFIOPCIDevice {
     bool no_vfio_ioeventfd;
     bool enable_ramfb;
     VFIODisplay *dpy;
+    Error *migration_blocker;
 } VFIOPCIDevice;
 
 uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 07/10] migration: add new migration state wait-unplug
  2019-10-11 11:20 ` [PATCH v3 07/10] migration: add new migration state wait-unplug Jens Freimann
@ 2019-10-11 12:57   ` Michael S. Tsirkin
  2019-10-11 14:22     ` Jens Freimann
  2019-10-11 16:49   ` Dr. David Alan Gilbert
  2019-10-11 17:11   ` Dr. David Alan Gilbert
  2 siblings, 1 reply; 35+ messages in thread
From: Michael S. Tsirkin @ 2019-10-11 12:57 UTC (permalink / raw)
  To: Jens Freimann
  Cc: parav, aadam, qemu-devel, dgilbert, alex.williamson, laine,
	ailan, ehabkost

On Fri, Oct 11, 2019 at 01:20:12PM +0200, Jens Freimann wrote:
> This patch adds a new migration state called wait-unplug.  It is entered
> after the SETUP state and will transition into ACTIVE once all devices
> were succesfully unplugged from the guest.
> 
> So if a guest doesn't respond or takes long to honor the unplug request
> the user will see the migration state 'wait-unplug'.
> 
> In the migration thread we query failover devices if they're are still
> pending the guest unplug. When all are unplugged the migration
> continues. We give it a defined number of iterations including small
> waiting periods before we proceed.
> 
> Signed-off-by: Jens Freimann <jfreimann@redhat.com>
> ---
>  include/migration/vmstate.h |  2 ++
>  migration/migration.c       | 34 ++++++++++++++++++++++++++++++++++
>  migration/migration.h       |  3 +++
>  migration/savevm.c          | 36 ++++++++++++++++++++++++++++++++++++
>  migration/savevm.h          |  1 +
>  qapi/migration.json         |  5 ++++-
>  6 files changed, 80 insertions(+), 1 deletion(-)
> 
> diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
> index 1fbfd099dd..39ef125225 100644
> --- a/include/migration/vmstate.h
> +++ b/include/migration/vmstate.h
> @@ -186,6 +186,8 @@ struct VMStateDescription {
>      int (*pre_save)(void *opaque);
>      int (*post_save)(void *opaque);
>      bool (*needed)(void *opaque);
> +    bool (*dev_unplug_pending)(void *opaque);
> +
>      const VMStateField *fields;
>      const VMStateDescription **subsections;
>  };
> diff --git a/migration/migration.c b/migration/migration.c
> index 5f7e4d15e9..a17d9fb990 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -52,9 +52,14 @@
>  #include "hw/qdev-properties.h"
>  #include "monitor/monitor.h"
>  #include "net/announce.h"
> +#include "qemu/queue.h"
>  
>  #define MAX_THROTTLE  (32 << 20)      /* Migration transfer speed throttling */
>  
> +/* Time in milliseconds to wait for guest OS to unplug PCI device */
> +#define FAILOVER_GUEST_UNPLUG_WAIT 10000

This value should be controllable from QMP.
And I think the default should be infinite wait.

> +#define FAILOVER_UNPLUG_RETRIES 5


This is a bit of a hack. We really should just
have unplug signal wakeup, or time expire.
E.g. a new kind of notifier?

> +
>  /* Amount of time to allocate to each "chunk" of bandwidth-throttled
>   * data. */
>  #define BUFFER_DELAY     100
> @@ -954,6 +959,9 @@ static void fill_source_migration_info(MigrationInfo *info)
>      case MIGRATION_STATUS_CANCELLED:
>          info->has_status = true;
>          break;
> +    case MIGRATION_STATUS_WAIT_UNPLUG:
> +        info->has_status = true;
> +        break;
>      }
>      info->status = s->state;
>  }
> @@ -1695,6 +1703,7 @@ bool migration_is_idle(void)
>      case MIGRATION_STATUS_COLO:
>      case MIGRATION_STATUS_PRE_SWITCHOVER:
>      case MIGRATION_STATUS_DEVICE:
> +    case MIGRATION_STATUS_WAIT_UNPLUG:
>          return false;
>      case MIGRATION_STATUS__MAX:
>          g_assert_not_reached();
> @@ -3224,6 +3233,8 @@ static void *migration_thread(void *opaque)
>      int64_t setup_start = qemu_clock_get_ms(QEMU_CLOCK_HOST);
>      MigThrError thr_error;
>      bool urgent = false;
> +    bool all_unplugged = true;
> +    int i = 0;
>  
>      rcu_register_thread();
>  
> @@ -3260,6 +3271,27 @@ static void *migration_thread(void *opaque)
>  
>      qemu_savevm_state_setup(s->to_dst_file);
>  
> +    migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
> +                      MIGRATION_STATUS_WAIT_UNPLUG);
> +
> +    while (i < FAILOVER_UNPLUG_RETRIES &&
> +           s->state == MIGRATION_STATUS_WAIT_UNPLUG) {
> +        i++;
> +        qemu_sem_timedwait(&s->wait_unplug_sem, FAILOVER_GUEST_UNPLUG_WAIT);

Should be FAILOVER_GUEST_UNPLUG_WAIT / FAILOVER_UNPLUG_RETRIES

such that time set is the total.

> +        all_unplugged = qemu_savevm_state_guest_unplug_pending();
> +        if (all_unplugged) {
> +            break;
> +        }
> +    }
> +
> +    if (all_unplugged) {
> +        migrate_set_state(&s->state, MIGRATION_STATUS_WAIT_UNPLUG,
> +                MIGRATION_STATUS_ACTIVE);
> +    } else {
> +        migrate_set_state(&s->state, MIGRATION_STATUS_WAIT_UNPLUG,
> +                          MIGRATION_STATUS_CANCELLING);
> +    }
> +
>      s->setup_time = qemu_clock_get_ms(QEMU_CLOCK_HOST) - setup_start;
>      migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
>                        MIGRATION_STATUS_ACTIVE);
> @@ -3508,6 +3540,7 @@ static void migration_instance_finalize(Object *obj)
>      qemu_mutex_destroy(&ms->qemu_file_lock);
>      g_free(params->tls_hostname);
>      g_free(params->tls_creds);
> +    qemu_sem_destroy(&ms->wait_unplug_sem);
>      qemu_sem_destroy(&ms->rate_limit_sem);
>      qemu_sem_destroy(&ms->pause_sem);
>      qemu_sem_destroy(&ms->postcopy_pause_sem);
> @@ -3553,6 +3586,7 @@ static void migration_instance_init(Object *obj)
>      qemu_sem_init(&ms->postcopy_pause_rp_sem, 0);
>      qemu_sem_init(&ms->rp_state.rp_sem, 0);
>      qemu_sem_init(&ms->rate_limit_sem, 0);
> +    qemu_sem_init(&ms->wait_unplug_sem, 0);
>      qemu_mutex_init(&ms->qemu_file_lock);
>  }
>  
> diff --git a/migration/migration.h b/migration/migration.h
> index 4f2fe193dc..79b3dda146 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -206,6 +206,9 @@ struct MigrationState
>      /* Flag set once the migration thread called bdrv_inactivate_all */
>      bool block_inactive;
>  
> +    /* Migration is waiting for guest to unplug device */
> +    QemuSemaphore wait_unplug_sem;
> +
>      /* Migration is paused due to pause-before-switchover */
>      QemuSemaphore pause_sem;
>  
> diff --git a/migration/savevm.c b/migration/savevm.c
> index bb9462a54d..26e5bde687 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -942,6 +942,20 @@ static void qemu_savevm_command_send(QEMUFile *f,
>      qemu_fflush(f);
>  }
>  
> +static int qemu_savevm_nr_failover_devices(void)
> +{
> +    SaveStateEntry *se;
> +    int n = 0;
> +
> +    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +        if (se->vmsd && se->vmsd->dev_unplug_pending) {
> +            n++;
> +        }
> +    }
> +
> +    return n;
> +}
> +
>  void qemu_savevm_send_colo_enable(QEMUFile *f)
>  {
>      trace_savevm_send_colo_enable();
> @@ -1113,6 +1127,28 @@ void qemu_savevm_state_header(QEMUFile *f)
>      }
>  }
>  
> +bool qemu_savevm_state_guest_unplug_pending(void)
> +{
> +    int nr_failover_devs;
> +    SaveStateEntry *se;
> +    bool ret = false;
> +    int n = 0;
> +
> +    nr_failover_devs = qemu_savevm_nr_failover_devices();
> +
> +    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +        if (!se->vmsd || !se->vmsd->dev_unplug_pending) {
> +            continue;
> +        }
> +        ret = se->vmsd->dev_unplug_pending(se->opaque);
> +        if (!ret) {
> +            n++;
> +        }
> +    }
> +
> +    return n == nr_failover_devs;
> +}
> +
>  void qemu_savevm_state_setup(QEMUFile *f)
>  {
>      SaveStateEntry *se;
> diff --git a/migration/savevm.h b/migration/savevm.h
> index 51a4b9caa8..ba64a7e271 100644
> --- a/migration/savevm.h
> +++ b/migration/savevm.h
> @@ -31,6 +31,7 @@
>  
>  bool qemu_savevm_state_blocked(Error **errp);
>  void qemu_savevm_state_setup(QEMUFile *f);
> +bool qemu_savevm_state_guest_unplug_pending(void);
>  int qemu_savevm_state_resume_prepare(MigrationState *s);
>  void qemu_savevm_state_header(QEMUFile *f);
>  int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy);
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 52e69e2868..5a06cd489f 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -133,6 +133,9 @@
>  # @device: During device serialisation when pause-before-switchover is enabled
>  #        (since 2.11)
>  #
> +# @wait-unplug: wait for device unplug request by guest OS to be completed.
> +#               (since 4.2)
> +#
>  # Since: 2.3
>  #
>  ##
> @@ -140,7 +143,7 @@
>    'data': [ 'none', 'setup', 'cancelling', 'cancelled',
>              'active', 'postcopy-active', 'postcopy-paused',
>              'postcopy-recover', 'completed', 'failed', 'colo',
> -            'pre-switchover', 'device' ] }
> +            'pre-switchover', 'device', 'wait-unplug' ] }
>  
>  ##
>  # @MigrationInfo:
> -- 
> 2.21.0


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 07/10] migration: add new migration state wait-unplug
  2019-10-11 12:57   ` Michael S. Tsirkin
@ 2019-10-11 14:22     ` Jens Freimann
  0 siblings, 0 replies; 35+ messages in thread
From: Jens Freimann @ 2019-10-11 14:22 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: parav, aadam, qemu-devel, dgilbert, alex.williamson, laine,
	ailan, ehabkost

On Fri, Oct 11, 2019 at 08:57:48AM -0400, Michael S. Tsirkin wrote:
>On Fri, Oct 11, 2019 at 01:20:12PM +0200, Jens Freimann wrote:
>> This patch adds a new migration state called wait-unplug.  It is entered
>> after the SETUP state and will transition into ACTIVE once all devices
>> were succesfully unplugged from the guest.
>>
>> So if a guest doesn't respond or takes long to honor the unplug request
>> the user will see the migration state 'wait-unplug'.
>>
>> In the migration thread we query failover devices if they're are still
>> pending the guest unplug. When all are unplugged the migration
>> continues. We give it a defined number of iterations including small
>> waiting periods before we proceed.
>>
>> Signed-off-by: Jens Freimann <jfreimann@redhat.com>
>> ---
>>  include/migration/vmstate.h |  2 ++
>>  migration/migration.c       | 34 ++++++++++++++++++++++++++++++++++
>>  migration/migration.h       |  3 +++
>>  migration/savevm.c          | 36 ++++++++++++++++++++++++++++++++++++
>>  migration/savevm.h          |  1 +
>>  qapi/migration.json         |  5 ++++-
>>  6 files changed, 80 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
>> index 1fbfd099dd..39ef125225 100644
>> --- a/include/migration/vmstate.h
>> +++ b/include/migration/vmstate.h
>> @@ -186,6 +186,8 @@ struct VMStateDescription {
>>      int (*pre_save)(void *opaque);
>>      int (*post_save)(void *opaque);
>>      bool (*needed)(void *opaque);
>> +    bool (*dev_unplug_pending)(void *opaque);
>> +
>>      const VMStateField *fields;
>>      const VMStateDescription **subsections;
>>  };
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 5f7e4d15e9..a17d9fb990 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -52,9 +52,14 @@
>>  #include "hw/qdev-properties.h"
>>  #include "monitor/monitor.h"
>>  #include "net/announce.h"
>> +#include "qemu/queue.h"
>>
>>  #define MAX_THROTTLE  (32 << 20)      /* Migration transfer speed throttling */
>>
>> +/* Time in milliseconds to wait for guest OS to unplug PCI device */
>> +#define FAILOVER_GUEST_UNPLUG_WAIT 10000
>
>This value should be controllable from QMP.
>And I think the default should be infinite wait.

ok, I can do that.
>
>> +#define FAILOVER_UNPLUG_RETRIES 5
>
>
>This is a bit of a hack. We really should just
>have unplug signal wakeup, or time expire.
>E.g. a new kind of notifier?

I tried to keep it simple. But you're probably right.
Seems like a new notifier similar to the migration state notifier is a good
option. Maybe I could also reuse the migration state notifier I have
added and check for the new migration state in it. 

Thanks for the review!

regards,
Jens 

>
>> +
>>  /* Amount of time to allocate to each "chunk" of bandwidth-throttled
>>   * data. */
>>  #define BUFFER_DELAY     100
>> @@ -954,6 +959,9 @@ static void fill_source_migration_info(MigrationInfo *info)
>>      case MIGRATION_STATUS_CANCELLED:
>>          info->has_status = true;
>>          break;
>> +    case MIGRATION_STATUS_WAIT_UNPLUG:
>> +        info->has_status = true;
>> +        break;
>>      }
>>      info->status = s->state;
>>  }
>> @@ -1695,6 +1703,7 @@ bool migration_is_idle(void)
>>      case MIGRATION_STATUS_COLO:
>>      case MIGRATION_STATUS_PRE_SWITCHOVER:
>>      case MIGRATION_STATUS_DEVICE:
>> +    case MIGRATION_STATUS_WAIT_UNPLUG:
>>          return false;
>>      case MIGRATION_STATUS__MAX:
>>          g_assert_not_reached();
>> @@ -3224,6 +3233,8 @@ static void *migration_thread(void *opaque)
>>      int64_t setup_start = qemu_clock_get_ms(QEMU_CLOCK_HOST);
>>      MigThrError thr_error;
>>      bool urgent = false;
>> +    bool all_unplugged = true;
>> +    int i = 0;
>>
>>      rcu_register_thread();
>>
>> @@ -3260,6 +3271,27 @@ static void *migration_thread(void *opaque)
>>
>>      qemu_savevm_state_setup(s->to_dst_file);
>>
>> +    migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
>> +                      MIGRATION_STATUS_WAIT_UNPLUG);
>> +
>> +    while (i < FAILOVER_UNPLUG_RETRIES &&
>> +           s->state == MIGRATION_STATUS_WAIT_UNPLUG) {
>> +        i++;
>> +        qemu_sem_timedwait(&s->wait_unplug_sem, FAILOVER_GUEST_UNPLUG_WAIT);
>
>Should be FAILOVER_GUEST_UNPLUG_WAIT / FAILOVER_UNPLUG_RETRIES
>
>such that time set is the total.
>
>> +        all_unplugged = qemu_savevm_state_guest_unplug_pending();
>> +        if (all_unplugged) {
>> +            break;
>> +        }
>> +    }
>> +
>> +    if (all_unplugged) {
>> +        migrate_set_state(&s->state, MIGRATION_STATUS_WAIT_UNPLUG,
>> +                MIGRATION_STATUS_ACTIVE);
>> +    } else {
>> +        migrate_set_state(&s->state, MIGRATION_STATUS_WAIT_UNPLUG,
>> +                          MIGRATION_STATUS_CANCELLING);
>> +    }
>> +
>>      s->setup_time = qemu_clock_get_ms(QEMU_CLOCK_HOST) - setup_start;
>>      migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
>>                        MIGRATION_STATUS_ACTIVE);
>> @@ -3508,6 +3540,7 @@ static void migration_instance_finalize(Object *obj)
>>      qemu_mutex_destroy(&ms->qemu_file_lock);
>>      g_free(params->tls_hostname);
>>      g_free(params->tls_creds);
>> +    qemu_sem_destroy(&ms->wait_unplug_sem);
>>      qemu_sem_destroy(&ms->rate_limit_sem);
>>      qemu_sem_destroy(&ms->pause_sem);
>>      qemu_sem_destroy(&ms->postcopy_pause_sem);
>> @@ -3553,6 +3586,7 @@ static void migration_instance_init(Object *obj)
>>      qemu_sem_init(&ms->postcopy_pause_rp_sem, 0);
>>      qemu_sem_init(&ms->rp_state.rp_sem, 0);
>>      qemu_sem_init(&ms->rate_limit_sem, 0);
>> +    qemu_sem_init(&ms->wait_unplug_sem, 0);
>>      qemu_mutex_init(&ms->qemu_file_lock);
>>  }
>>
>> diff --git a/migration/migration.h b/migration/migration.h
>> index 4f2fe193dc..79b3dda146 100644
>> --- a/migration/migration.h
>> +++ b/migration/migration.h
>> @@ -206,6 +206,9 @@ struct MigrationState
>>      /* Flag set once the migration thread called bdrv_inactivate_all */
>>      bool block_inactive;
>>
>> +    /* Migration is waiting for guest to unplug device */
>> +    QemuSemaphore wait_unplug_sem;
>> +
>>      /* Migration is paused due to pause-before-switchover */
>>      QemuSemaphore pause_sem;
>>
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index bb9462a54d..26e5bde687 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -942,6 +942,20 @@ static void qemu_savevm_command_send(QEMUFile *f,
>>      qemu_fflush(f);
>>  }
>>
>> +static int qemu_savevm_nr_failover_devices(void)
>> +{
>> +    SaveStateEntry *se;
>> +    int n = 0;
>> +
>> +    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +        if (se->vmsd && se->vmsd->dev_unplug_pending) {
>> +            n++;
>> +        }
>> +    }
>> +
>> +    return n;
>> +}
>> +
>>  void qemu_savevm_send_colo_enable(QEMUFile *f)
>>  {
>>      trace_savevm_send_colo_enable();
>> @@ -1113,6 +1127,28 @@ void qemu_savevm_state_header(QEMUFile *f)
>>      }
>>  }
>>
>> +bool qemu_savevm_state_guest_unplug_pending(void)
>> +{
>> +    int nr_failover_devs;
>> +    SaveStateEntry *se;
>> +    bool ret = false;
>> +    int n = 0;
>> +
>> +    nr_failover_devs = qemu_savevm_nr_failover_devices();
>> +
>> +    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +        if (!se->vmsd || !se->vmsd->dev_unplug_pending) {
>> +            continue;
>> +        }
>> +        ret = se->vmsd->dev_unplug_pending(se->opaque);
>> +        if (!ret) {
>> +            n++;
>> +        }
>> +    }
>> +
>> +    return n == nr_failover_devs;
>> +}
>> +
>>  void qemu_savevm_state_setup(QEMUFile *f)
>>  {
>>      SaveStateEntry *se;
>> diff --git a/migration/savevm.h b/migration/savevm.h
>> index 51a4b9caa8..ba64a7e271 100644
>> --- a/migration/savevm.h
>> +++ b/migration/savevm.h
>> @@ -31,6 +31,7 @@
>>
>>  bool qemu_savevm_state_blocked(Error **errp);
>>  void qemu_savevm_state_setup(QEMUFile *f);
>> +bool qemu_savevm_state_guest_unplug_pending(void);
>>  int qemu_savevm_state_resume_prepare(MigrationState *s);
>>  void qemu_savevm_state_header(QEMUFile *f);
>>  int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy);
>> diff --git a/qapi/migration.json b/qapi/migration.json
>> index 52e69e2868..5a06cd489f 100644
>> --- a/qapi/migration.json
>> +++ b/qapi/migration.json
>> @@ -133,6 +133,9 @@
>>  # @device: During device serialisation when pause-before-switchover is enabled
>>  #        (since 2.11)
>>  #
>> +# @wait-unplug: wait for device unplug request by guest OS to be completed.
>> +#               (since 4.2)
>> +#
>>  # Since: 2.3
>>  #
>>  ##
>> @@ -140,7 +143,7 @@
>>    'data': [ 'none', 'setup', 'cancelling', 'cancelled',
>>              'active', 'postcopy-active', 'postcopy-paused',
>>              'postcopy-recover', 'completed', 'failed', 'colo',
>> -            'pre-switchover', 'device' ] }
>> +            'pre-switchover', 'device', 'wait-unplug' ] }
>>
>>  ##
>>  # @MigrationInfo:
>> --
>> 2.21.0


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 0/10] add failover feature for assigned network devices
  2019-10-11 11:20 [PATCH v3 0/10] add failover feature for assigned network devices Jens Freimann
                   ` (9 preceding siblings ...)
  2019-10-11 11:20 ` [PATCH v3 10/10] vfio: unplug failover primary device before migration Jens Freimann
@ 2019-10-11 14:28 ` Michael S. Tsirkin
  2019-10-11 16:04 ` no-reply
  2019-10-15 19:03 ` Alex Williamson
  12 siblings, 0 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2019-10-11 14:28 UTC (permalink / raw)
  To: Jens Freimann
  Cc: parav, aadam, qemu-devel, dgilbert, alex.williamson, laine,
	ailan, ehabkost

On Fri, Oct 11, 2019 at 01:20:05PM +0200, Jens Freimann wrote:
> This is implementing the host side of the net_failover concept
> (https://www.kernel.org/doc/html/latest/networking/net_failover.html)


I posted a comment about the migration changes.
Besides that this looks good to me.
I did not look at VFIO parts at all yet.
Alex, could you pls review/ack the VFIO patch?


> Changes since v2:
> * back out of creating failover pair when it is a non-networking
>   vfio-pci device (Alex W)
> * handle migration state change from within the migration thread. I do a
>   timed wait on a semaphore and then check if all unplugs were
>   succesful. Added a new function to each device that checks the device
>   if the unplug for it has happened. When all devices report the succesful
>   unplug *or* the time/retries is up, continue with the migration or
>   cancel. When not all devices could be unplugged I am cancelling at the
>   moment. It is likely that we can't plug it back at the destination which
>   would result in degraded network performance.
> * fix a few bugs regarding re-plug on migration source and target 
> * run full set of tests including migration tests
> * add patch for libqos to tolerate new migration state
> * squashed patch 1 and 2, added patch 8 
>  
> The general idea is that we have a pair of devices, a vfio-pci and a
> virtio-net device. Before migration the vfio device is unplugged and data
> flows to the virtio-net device, on the target side another vfio-pci device
> is plugged in to take over the data-path. In the guest the net_failover
> module will pair net devices with the same MAC address.
> 
> * Patch 1 adds the infrastructure to hide the device for the qbus and qdev APIs
> 
> * Patch 2 sets a new flag for PCIDevice 'partially_hotplugged' which we
>   use to skip the unrealize code path when doing a unplug of the primary
>   device
> 
> * Patch 3 sets the pending_deleted_event before triggering the guest
>   unplug request
> 
> * Patch 4 and 5 add new qmp events, one sends the device id of a device
>   that was just requested to be unplugged from the guest and another one
>   to let libvirt know if VIRTIO_NET_F_STANDBY was negotiated
> 
> * Patch 6 make sure that we can unplug the vfio-device before
>   migration starts
> 
> * Patch 7 adds a new migration state that is entered while we wait for
>   devices to be unplugged by guest OS
> 
> * Patch 8 just adds the new migration state to a check in libqos code
> 
> * Patch 9 In the second patch the virtio-net uses the API to defer adding the vfio
>   device until the VIRTIO_NET_F_STANDBY feature is acked. It also
>   implements the migration handler to unplug the device from the guest and
>   re-plug in case of migration failure
> 
> * Patch 10 allows migration for failover vfio-pci devices, but only when it is
>   a networking device 
> 
> Previous discussion:
>   RFC v1 https://patchwork.ozlabs.org/cover/989098/
>   RFC v2 https://www.mail-archive.com/qemu-devel@nongnu.org/msg606906.html
>   v1: https://lists.gnu.org/archive/html/qemu-devel/2019-05/msg03968.html
>   v2: https://www.mail-archive.com/qemu-devel@nongnu.org/msg635214.html
> 
> To summarize concerns/feedback from previous discussion:
> 1.- guest OS can reject or worse _delay_ unplug by any amount of time.
>   Migration might get stuck for unpredictable time with unclear reason.
>   This approach combines two tricky things, hot/unplug and migration.
>   -> We need to let libvirt know what's happening. Add new qmp events
>      and a new migration state. When a primary device is (partially)
>      unplugged (only from guest) we send a qmp event with the device id. When
>      it is unplugged from the guest the DEVICE_DELETED event is sent.
>      Migration will enter the wait-unplug state while waiting for the guest
>      os to unplug all primary devices and then move on with migration.
> 2. PCI devices are a precious ressource. The primary device should never
>   be added to QEMU if it won't be used by guest instead of hiding it in
>   QEMU.
>   -> We only hotplug the device when the standby feature bit was
>      negotiated. We save the device cmdline options until we need it for
>      qdev_device_add()
>      Hiding a device can be a useful concept to model. For example a
>      pci device in a powered-off slot could be marked as hidden until the slot is
>      powered on (mst).
> 3. Management layer software should handle this. Open Stack already has
>   components/code to handle unplug/replug VFIO devices and metadata to
>   provide to the guest for detecting which devices should be paired.
>   -> An approach that includes all software from firmware to
>      higher-level management software wasn't tried in the last years. This is
>      an attempt to keep it simple and contained in QEMU as much as possible.
>      One of the problems that stopped management software and libvirt from
>      implementing this idea is that it can't be sure that it's possible to
>      re-plug the primary device. By not freeing the devices resources in QEMU
>      and only asking the guest OS to unplug it is possible to re-plug the
>      device in case of a migration failure.
> 4. Hotplugging a device and then making it part of a failover setup is
>    not possible
>   -> addressed by extending qdev hotplug functions to check for hidden
>      attribute, so e.g. device_add can be used to plug a device.
> 
> 
> I have tested this with a mlx5 and igb NIC and was able to migrate the VM.
> 
> Command line example:
> 
> qemu-system-x86_64 -enable-kvm -m 3072 -smp 3 \
>         -machine q35,kernel-irqchip=split -cpu host   \
>         -serial stdio   \
>         -net none \
>         -qmp unix:/tmp/qmp.socket,server,nowait \
>         -monitor telnet:127.0.0.1:5555,server,nowait \
>         -device pcie-root-port,id=root0,multifunction=on,chassis=0,addr=0xa \
>         -device pcie-root-port,id=root1,bus=pcie.0,chassis=1 \
>         -device pcie-root-port,id=root2,bus=pcie.0,chassis=2 \
>         -netdev tap,script=/root/bin/bridge.sh,downscript=no,id=hostnet1,vhost=on \
>         -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc,bus=root2,failover=on \
> 	-device vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,net_failover_pair_id =net1 \
>         /root/rhel-guest-image-8.0-1781.x86_64.qcow2
> 
> I'm grateful for any remarks or ideas!
> 
> Thanks!
> 
> regards,
> Jens 
> 
> 
> Jens Freimann (10):
>   qdev/qbus: add hidden device support
>   pci: mark devices partially unplugged
>   pci: mark device having guest unplug request pending
>   qapi: add unplug primary event
>   qapi: add failover negotiated event
>   migration: allow unplug during migration for failover devices
>   migration: add new migration state wait-unplug
>   libqos: tolerate wait-unplug migration state
>   net/virtio: add failover support
>   vfio: unplug failover primary device before migration
> 
>  hw/core/qdev.c                 |  20 +++
>  hw/net/virtio-net.c            | 267 +++++++++++++++++++++++++++++++++
>  hw/pci/pci.c                   |   2 +
>  hw/pci/pcie.c                  |   6 +
>  hw/vfio/pci.c                  |  35 ++++-
>  hw/vfio/pci.h                  |   2 +
>  include/hw/pci/pci.h           |   1 +
>  include/hw/qdev-core.h         |  10 ++
>  include/hw/virtio/virtio-net.h |  12 ++
>  include/hw/virtio/virtio.h     |   1 +
>  include/migration/vmstate.h    |   2 +
>  migration/migration.c          |  34 +++++
>  migration/migration.h          |   3 +
>  migration/savevm.c             |  36 +++++
>  migration/savevm.h             |   1 +
>  qapi/migration.json            |  24 ++-
>  qapi/net.json                  |  16 ++
>  qdev-monitor.c                 |  43 +++++-
>  tests/libqos/libqos.c          |   3 +-
>  vl.c                           |   6 +-
>  20 files changed, 515 insertions(+), 9 deletions(-)
> 
> -- 
> 2.21.0


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 0/10] add failover feature for assigned network devices
  2019-10-11 11:20 [PATCH v3 0/10] add failover feature for assigned network devices Jens Freimann
                   ` (10 preceding siblings ...)
  2019-10-11 14:28 ` [PATCH v3 0/10] add failover feature for assigned network devices Michael S. Tsirkin
@ 2019-10-11 16:04 ` no-reply
  2019-10-15 19:03 ` Alex Williamson
  12 siblings, 0 replies; 35+ messages in thread
From: no-reply @ 2019-10-11 16:04 UTC (permalink / raw)
  To: jfreimann
  Cc: parav, mst, aadam, qemu-devel, dgilbert, alex.williamson, laine,
	ailan, ehabkost

Patchew URL: https://patchew.org/QEMU/20191011112015.11785-1-jfreimann@redhat.com/



Hi,

This series failed the docker-quick@centos7 build test. Please find the testing commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#!/bin/bash
make docker-image-centos7 V=1 NETWORK=1
time make docker-test-quick@centos7 SHOW_ENV=1 J=14 NETWORK=1
=== TEST SCRIPT END ===

  TEST    iotest-qcow2: 025
  TEST    iotest-qcow2: 027
**
ERROR:/tmp/qemu-test/src/tests/migration-test.c:903:wait_for_migration_fail: assertion failed: (!strcmp(status, "setup") || !strcmp(status, "failed") || (allow_active && !strcmp(status, "active")))
ERROR - Bail out! ERROR:/tmp/qemu-test/src/tests/migration-test.c:903:wait_for_migration_fail: assertion failed: (!strcmp(status, "setup") || !strcmp(status, "failed") || (allow_active && !strcmp(status, "active")))
make: *** [check-qtest-aarch64] Error 1
make: *** Waiting for unfinished jobs....
  TEST    iotest-qcow2: 029
  TEST    check-unit: tests/test-qdist
---
  TEST    iotest-qcow2: 172
  TEST    iotest-qcow2: 174
**
ERROR:/tmp/qemu-test/src/tests/migration-test.c:903:wait_for_migration_fail: assertion failed: (!strcmp(status, "setup") || !strcmp(status, "failed") || (allow_active && !strcmp(status, "active")))
ERROR - Bail out! ERROR:/tmp/qemu-test/src/tests/migration-test.c:903:wait_for_migration_fail: assertion failed: (!strcmp(status, "setup") || !strcmp(status, "failed") || (allow_active && !strcmp(status, "active")))
make: *** [check-qtest-x86_64] Error 1
  TEST    iotest-qcow2: 176
  TEST    iotest-qcow2: 177
  TEST    iotest-qcow2: 179
---
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', '--label', 'com.qemu.instance.uuid=8a50cfce678f45bfb554befdb722504e', '-u', '1001', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', '-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 'SHOW_ENV=1', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', '/home/patchew/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', '/var/tmp/patchew-tester-tmp-lrrx9ztr/src/docker-src.2019-10-11-11.53.24.30391:/var/tmp/qemu:z,ro', 'qemu:centos7', '/var/tmp/qemu/run', 'test-quick']' returned non-zero exit status 2.
filter=--filter=label=com.qemu.instance.uuid=8a50cfce678f45bfb554befdb722504e
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-lrrx9ztr/src'
make: *** [docker-run-test-quick@centos7] Error 2

real    10m49.386s
user    0m8.737s


The full log is available at
http://patchew.org/logs/20191011112015.11785-1-jfreimann@redhat.com/testing.docker-quick@centos7/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 07/10] migration: add new migration state wait-unplug
  2019-10-11 11:20 ` [PATCH v3 07/10] migration: add new migration state wait-unplug Jens Freimann
  2019-10-11 12:57   ` Michael S. Tsirkin
@ 2019-10-11 16:49   ` Dr. David Alan Gilbert
  2019-10-11 17:11   ` Dr. David Alan Gilbert
  2 siblings, 0 replies; 35+ messages in thread
From: Dr. David Alan Gilbert @ 2019-10-11 16:49 UTC (permalink / raw)
  To: Jens Freimann
  Cc: ehabkost, mst, aadam, qemu-devel, alex.williamson, laine, ailan, parav

* Jens Freimann (jfreimann@redhat.com) wrote:
> This patch adds a new migration state called wait-unplug.  It is entered
> after the SETUP state and will transition into ACTIVE once all devices
> were succesfully unplugged from the guest.
> 
> So if a guest doesn't respond or takes long to honor the unplug request
> the user will see the migration state 'wait-unplug'.
> 
> In the migration thread we query failover devices if they're are still
> pending the guest unplug. When all are unplugged the migration
> continues. We give it a defined number of iterations including small
> waiting periods before we proceed.
> 
> Signed-off-by: Jens Freimann <jfreimann@redhat.com>
> ---
>  include/migration/vmstate.h |  2 ++
>  migration/migration.c       | 34 ++++++++++++++++++++++++++++++++++
>  migration/migration.h       |  3 +++
>  migration/savevm.c          | 36 ++++++++++++++++++++++++++++++++++++
>  migration/savevm.h          |  1 +
>  qapi/migration.json         |  5 ++++-
>  6 files changed, 80 insertions(+), 1 deletion(-)
> 
> diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
> index 1fbfd099dd..39ef125225 100644
> --- a/include/migration/vmstate.h
> +++ b/include/migration/vmstate.h
> @@ -186,6 +186,8 @@ struct VMStateDescription {
>      int (*pre_save)(void *opaque);
>      int (*post_save)(void *opaque);
>      bool (*needed)(void *opaque);
> +    bool (*dev_unplug_pending)(void *opaque);
> +
>      const VMStateField *fields;
>      const VMStateDescription **subsections;
>  };
> diff --git a/migration/migration.c b/migration/migration.c
> index 5f7e4d15e9..a17d9fb990 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -52,9 +52,14 @@
>  #include "hw/qdev-properties.h"
>  #include "monitor/monitor.h"
>  #include "net/announce.h"
> +#include "qemu/queue.h"
>  
>  #define MAX_THROTTLE  (32 << 20)      /* Migration transfer speed throttling */
>  
> +/* Time in milliseconds to wait for guest OS to unplug PCI device */
> +#define FAILOVER_GUEST_UNPLUG_WAIT 10000
> +#define FAILOVER_UNPLUG_RETRIES 5
> +
>  /* Amount of time to allocate to each "chunk" of bandwidth-throttled
>   * data. */
>  #define BUFFER_DELAY     100
> @@ -954,6 +959,9 @@ static void fill_source_migration_info(MigrationInfo *info)
>      case MIGRATION_STATUS_CANCELLED:
>          info->has_status = true;
>          break;
> +    case MIGRATION_STATUS_WAIT_UNPLUG:
> +        info->has_status = true;
> +        break;
>      }
>      info->status = s->state;
>  }
> @@ -1695,6 +1703,7 @@ bool migration_is_idle(void)
>      case MIGRATION_STATUS_COLO:
>      case MIGRATION_STATUS_PRE_SWITCHOVER:
>      case MIGRATION_STATUS_DEVICE:
> +    case MIGRATION_STATUS_WAIT_UNPLUG:

I think the patchew failure that it shows in the migration test is
because that new state is visible to the test and it's not expecting it.

Dave

>          return false;
>      case MIGRATION_STATUS__MAX:
>          g_assert_not_reached();
> @@ -3224,6 +3233,8 @@ static void *migration_thread(void *opaque)
>      int64_t setup_start = qemu_clock_get_ms(QEMU_CLOCK_HOST);
>      MigThrError thr_error;
>      bool urgent = false;
> +    bool all_unplugged = true;
> +    int i = 0;
>  
>      rcu_register_thread();
>  
> @@ -3260,6 +3271,27 @@ static void *migration_thread(void *opaque)
>  
>      qemu_savevm_state_setup(s->to_dst_file);
>  
> +    migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
> +                      MIGRATION_STATUS_WAIT_UNPLUG);
> +
> +    while (i < FAILOVER_UNPLUG_RETRIES &&
> +           s->state == MIGRATION_STATUS_WAIT_UNPLUG) {
> +        i++;
> +        qemu_sem_timedwait(&s->wait_unplug_sem, FAILOVER_GUEST_UNPLUG_WAIT);
> +        all_unplugged = qemu_savevm_state_guest_unplug_pending();
> +        if (all_unplugged) {
> +            break;
> +        }
> +    }
> +
> +    if (all_unplugged) {
> +        migrate_set_state(&s->state, MIGRATION_STATUS_WAIT_UNPLUG,
> +                MIGRATION_STATUS_ACTIVE);
> +    } else {
> +        migrate_set_state(&s->state, MIGRATION_STATUS_WAIT_UNPLUG,
> +                          MIGRATION_STATUS_CANCELLING);
> +    }
> +
>      s->setup_time = qemu_clock_get_ms(QEMU_CLOCK_HOST) - setup_start;
>      migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
>                        MIGRATION_STATUS_ACTIVE);
> @@ -3508,6 +3540,7 @@ static void migration_instance_finalize(Object *obj)
>      qemu_mutex_destroy(&ms->qemu_file_lock);
>      g_free(params->tls_hostname);
>      g_free(params->tls_creds);
> +    qemu_sem_destroy(&ms->wait_unplug_sem);
>      qemu_sem_destroy(&ms->rate_limit_sem);
>      qemu_sem_destroy(&ms->pause_sem);
>      qemu_sem_destroy(&ms->postcopy_pause_sem);
> @@ -3553,6 +3586,7 @@ static void migration_instance_init(Object *obj)
>      qemu_sem_init(&ms->postcopy_pause_rp_sem, 0);
>      qemu_sem_init(&ms->rp_state.rp_sem, 0);
>      qemu_sem_init(&ms->rate_limit_sem, 0);
> +    qemu_sem_init(&ms->wait_unplug_sem, 0);
>      qemu_mutex_init(&ms->qemu_file_lock);
>  }
>  
> diff --git a/migration/migration.h b/migration/migration.h
> index 4f2fe193dc..79b3dda146 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -206,6 +206,9 @@ struct MigrationState
>      /* Flag set once the migration thread called bdrv_inactivate_all */
>      bool block_inactive;
>  
> +    /* Migration is waiting for guest to unplug device */
> +    QemuSemaphore wait_unplug_sem;
> +
>      /* Migration is paused due to pause-before-switchover */
>      QemuSemaphore pause_sem;
>  
> diff --git a/migration/savevm.c b/migration/savevm.c
> index bb9462a54d..26e5bde687 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -942,6 +942,20 @@ static void qemu_savevm_command_send(QEMUFile *f,
>      qemu_fflush(f);
>  }
>  
> +static int qemu_savevm_nr_failover_devices(void)
> +{
> +    SaveStateEntry *se;
> +    int n = 0;
> +
> +    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +        if (se->vmsd && se->vmsd->dev_unplug_pending) {
> +            n++;
> +        }
> +    }
> +
> +    return n;
> +}
> +
>  void qemu_savevm_send_colo_enable(QEMUFile *f)
>  {
>      trace_savevm_send_colo_enable();
> @@ -1113,6 +1127,28 @@ void qemu_savevm_state_header(QEMUFile *f)
>      }
>  }
>  
> +bool qemu_savevm_state_guest_unplug_pending(void)
> +{
> +    int nr_failover_devs;
> +    SaveStateEntry *se;
> +    bool ret = false;
> +    int n = 0;
> +
> +    nr_failover_devs = qemu_savevm_nr_failover_devices();
> +
> +    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +        if (!se->vmsd || !se->vmsd->dev_unplug_pending) {
> +            continue;
> +        }
> +        ret = se->vmsd->dev_unplug_pending(se->opaque);
> +        if (!ret) {
> +            n++;
> +        }
> +    }
> +
> +    return n == nr_failover_devs;
> +}
> +
>  void qemu_savevm_state_setup(QEMUFile *f)
>  {
>      SaveStateEntry *se;
> diff --git a/migration/savevm.h b/migration/savevm.h
> index 51a4b9caa8..ba64a7e271 100644
> --- a/migration/savevm.h
> +++ b/migration/savevm.h
> @@ -31,6 +31,7 @@
>  
>  bool qemu_savevm_state_blocked(Error **errp);
>  void qemu_savevm_state_setup(QEMUFile *f);
> +bool qemu_savevm_state_guest_unplug_pending(void);
>  int qemu_savevm_state_resume_prepare(MigrationState *s);
>  void qemu_savevm_state_header(QEMUFile *f);
>  int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy);
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 52e69e2868..5a06cd489f 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -133,6 +133,9 @@
>  # @device: During device serialisation when pause-before-switchover is enabled
>  #        (since 2.11)
>  #
> +# @wait-unplug: wait for device unplug request by guest OS to be completed.
> +#               (since 4.2)
> +#
>  # Since: 2.3
>  #
>  ##
> @@ -140,7 +143,7 @@
>    'data': [ 'none', 'setup', 'cancelling', 'cancelled',
>              'active', 'postcopy-active', 'postcopy-paused',
>              'postcopy-recover', 'completed', 'failed', 'colo',
> -            'pre-switchover', 'device' ] }
> +            'pre-switchover', 'device', 'wait-unplug' ] }
>  
>  ##
>  # @MigrationInfo:
> -- 
> 2.21.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 07/10] migration: add new migration state wait-unplug
  2019-10-11 11:20 ` [PATCH v3 07/10] migration: add new migration state wait-unplug Jens Freimann
  2019-10-11 12:57   ` Michael S. Tsirkin
  2019-10-11 16:49   ` Dr. David Alan Gilbert
@ 2019-10-11 17:11   ` Dr. David Alan Gilbert
  2019-10-15  9:45     ` Jens Freimann
  2 siblings, 1 reply; 35+ messages in thread
From: Dr. David Alan Gilbert @ 2019-10-11 17:11 UTC (permalink / raw)
  To: Jens Freimann
  Cc: ehabkost, mst, aadam, qemu-devel, alex.williamson, laine, ailan, parav

* Jens Freimann (jfreimann@redhat.com) wrote:
> This patch adds a new migration state called wait-unplug.  It is entered
> after the SETUP state and will transition into ACTIVE once all devices
> were succesfully unplugged from the guest.
> 
> So if a guest doesn't respond or takes long to honor the unplug request
> the user will see the migration state 'wait-unplug'.
> 
> In the migration thread we query failover devices if they're are still
> pending the guest unplug. When all are unplugged the migration
> continues. We give it a defined number of iterations including small
> waiting periods before we proceed.
> 
> Signed-off-by: Jens Freimann <jfreimann@redhat.com>
> ---
>  include/migration/vmstate.h |  2 ++
>  migration/migration.c       | 34 ++++++++++++++++++++++++++++++++++
>  migration/migration.h       |  3 +++
>  migration/savevm.c          | 36 ++++++++++++++++++++++++++++++++++++
>  migration/savevm.h          |  1 +
>  qapi/migration.json         |  5 ++++-
>  6 files changed, 80 insertions(+), 1 deletion(-)
> 
> diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
> index 1fbfd099dd..39ef125225 100644
> --- a/include/migration/vmstate.h
> +++ b/include/migration/vmstate.h
> @@ -186,6 +186,8 @@ struct VMStateDescription {
>      int (*pre_save)(void *opaque);
>      int (*post_save)(void *opaque);
>      bool (*needed)(void *opaque);
> +    bool (*dev_unplug_pending)(void *opaque);
> +
>      const VMStateField *fields;
>      const VMStateDescription **subsections;
>  };
> diff --git a/migration/migration.c b/migration/migration.c
> index 5f7e4d15e9..a17d9fb990 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -52,9 +52,14 @@
>  #include "hw/qdev-properties.h"
>  #include "monitor/monitor.h"
>  #include "net/announce.h"
> +#include "qemu/queue.h"
>  
>  #define MAX_THROTTLE  (32 << 20)      /* Migration transfer speed throttling */
>  
> +/* Time in milliseconds to wait for guest OS to unplug PCI device */
> +#define FAILOVER_GUEST_UNPLUG_WAIT 10000
> +#define FAILOVER_UNPLUG_RETRIES 5
> +
>  /* Amount of time to allocate to each "chunk" of bandwidth-throttled
>   * data. */
>  #define BUFFER_DELAY     100
> @@ -954,6 +959,9 @@ static void fill_source_migration_info(MigrationInfo *info)
>      case MIGRATION_STATUS_CANCELLED:
>          info->has_status = true;
>          break;
> +    case MIGRATION_STATUS_WAIT_UNPLUG:
> +        info->has_status = true;
> +        break;
>      }
>      info->status = s->state;
>  }
> @@ -1695,6 +1703,7 @@ bool migration_is_idle(void)
>      case MIGRATION_STATUS_COLO:
>      case MIGRATION_STATUS_PRE_SWITCHOVER:
>      case MIGRATION_STATUS_DEVICE:
> +    case MIGRATION_STATUS_WAIT_UNPLUG:
>          return false;
>      case MIGRATION_STATUS__MAX:
>          g_assert_not_reached();
> @@ -3224,6 +3233,8 @@ static void *migration_thread(void *opaque)
>      int64_t setup_start = qemu_clock_get_ms(QEMU_CLOCK_HOST);
>      MigThrError thr_error;
>      bool urgent = false;
> +    bool all_unplugged = true;
> +    int i = 0;
>  
>      rcu_register_thread();
>  
> @@ -3260,6 +3271,27 @@ static void *migration_thread(void *opaque)
>  
>      qemu_savevm_state_setup(s->to_dst_file);
>  
> +    migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
> +                      MIGRATION_STATUS_WAIT_UNPLUG);

I think I'd prefer if you only went into this state if you had any
devices that were going to need unplugging.

> +    while (i < FAILOVER_UNPLUG_RETRIES &&
> +           s->state == MIGRATION_STATUS_WAIT_UNPLUG) {
> +        i++;
> +        qemu_sem_timedwait(&s->wait_unplug_sem, FAILOVER_GUEST_UNPLUG_WAIT);
> +        all_unplugged = qemu_savevm_state_guest_unplug_pending();
> +        if (all_unplugged) {
> +            break;
> +        }
> +    }
> +
> +    if (all_unplugged) {
> +        migrate_set_state(&s->state, MIGRATION_STATUS_WAIT_UNPLUG,
> +                MIGRATION_STATUS_ACTIVE);
> +    } else {
> +        migrate_set_state(&s->state, MIGRATION_STATUS_WAIT_UNPLUG,
> +                          MIGRATION_STATUS_CANCELLING);
> +    }

I think you can get rid of both the timeout and the count and just make
sure that migrate_cancel works at this point.
This pushes the problem up a layer, which I think is fine.

>      s->setup_time = qemu_clock_get_ms(QEMU_CLOCK_HOST) - setup_start;
>      migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
>                        MIGRATION_STATUS_ACTIVE);
> @@ -3508,6 +3540,7 @@ static void migration_instance_finalize(Object *obj)
>      qemu_mutex_destroy(&ms->qemu_file_lock);
>      g_free(params->tls_hostname);
>      g_free(params->tls_creds);
> +    qemu_sem_destroy(&ms->wait_unplug_sem);
>      qemu_sem_destroy(&ms->rate_limit_sem);
>      qemu_sem_destroy(&ms->pause_sem);
>      qemu_sem_destroy(&ms->postcopy_pause_sem);
> @@ -3553,6 +3586,7 @@ static void migration_instance_init(Object *obj)
>      qemu_sem_init(&ms->postcopy_pause_rp_sem, 0);
>      qemu_sem_init(&ms->rp_state.rp_sem, 0);
>      qemu_sem_init(&ms->rate_limit_sem, 0);
> +    qemu_sem_init(&ms->wait_unplug_sem, 0);
>      qemu_mutex_init(&ms->qemu_file_lock);
>  }
>  
> diff --git a/migration/migration.h b/migration/migration.h
> index 4f2fe193dc..79b3dda146 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -206,6 +206,9 @@ struct MigrationState
>      /* Flag set once the migration thread called bdrv_inactivate_all */
>      bool block_inactive;
>  
> +    /* Migration is waiting for guest to unplug device */
> +    QemuSemaphore wait_unplug_sem;
> +
>      /* Migration is paused due to pause-before-switchover */
>      QemuSemaphore pause_sem;
>  
> diff --git a/migration/savevm.c b/migration/savevm.c
> index bb9462a54d..26e5bde687 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -942,6 +942,20 @@ static void qemu_savevm_command_send(QEMUFile *f,
>      qemu_fflush(f);
>  }
>  
> +static int qemu_savevm_nr_failover_devices(void)
> +{
> +    SaveStateEntry *se;
> +    int n = 0;
> +
> +    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +        if (se->vmsd && se->vmsd->dev_unplug_pending) {
> +            n++;
> +        }
> +    }
> +
> +    return n;
> +}
> +
>  void qemu_savevm_send_colo_enable(QEMUFile *f)
>  {
>      trace_savevm_send_colo_enable();
> @@ -1113,6 +1127,28 @@ void qemu_savevm_state_header(QEMUFile *f)
>      }
>  }
>  
> +bool qemu_savevm_state_guest_unplug_pending(void)
> +{
> +    int nr_failover_devs;
> +    SaveStateEntry *se;
> +    bool ret = false;
> +    int n = 0;
> +
> +    nr_failover_devs = qemu_savevm_nr_failover_devices();
> +
> +    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +        if (!se->vmsd || !se->vmsd->dev_unplug_pending) {
> +            continue;
> +        }
> +        ret = se->vmsd->dev_unplug_pending(se->opaque);
> +        if (!ret) {
> +            n++;
> +        }
> +    }
> +
> +    return n == nr_failover_devs;
> +}
> +
>  void qemu_savevm_state_setup(QEMUFile *f)
>  {
>      SaveStateEntry *se;
> diff --git a/migration/savevm.h b/migration/savevm.h
> index 51a4b9caa8..ba64a7e271 100644
> --- a/migration/savevm.h
> +++ b/migration/savevm.h
> @@ -31,6 +31,7 @@
>  
>  bool qemu_savevm_state_blocked(Error **errp);
>  void qemu_savevm_state_setup(QEMUFile *f);
> +bool qemu_savevm_state_guest_unplug_pending(void);
>  int qemu_savevm_state_resume_prepare(MigrationState *s);
>  void qemu_savevm_state_header(QEMUFile *f);
>  int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy);
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 52e69e2868..5a06cd489f 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -133,6 +133,9 @@
>  # @device: During device serialisation when pause-before-switchover is enabled
>  #        (since 2.11)
>  #
> +# @wait-unplug: wait for device unplug request by guest OS to be completed.
> +#               (since 4.2)
> +#
>  # Since: 2.3
>  #
>  ##
> @@ -140,7 +143,7 @@
>    'data': [ 'none', 'setup', 'cancelling', 'cancelled',
>              'active', 'postcopy-active', 'postcopy-paused',
>              'postcopy-recover', 'completed', 'failed', 'colo',
> -            'pre-switchover', 'device' ] }
> +            'pre-switchover', 'device', 'wait-unplug' ] }
>  
>  ##
>  # @MigrationInfo:
> -- 
> 2.21.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 01/10] qdev/qbus: add hidden device support
  2019-10-11 11:20 ` [PATCH v3 01/10] qdev/qbus: add hidden device support Jens Freimann
@ 2019-10-14  9:46   ` Cornelia Huck
  2019-10-14 12:02     ` Jens Freimann
  2019-10-15 19:19   ` Alex Williamson
  1 sibling, 1 reply; 35+ messages in thread
From: Cornelia Huck @ 2019-10-14  9:46 UTC (permalink / raw)
  To: Jens Freimann
  Cc: parav, mst, aadam, qemu-devel, dgilbert, alex.williamson, laine,
	ailan, ehabkost

On Fri, 11 Oct 2019 13:20:06 +0200
Jens Freimann <jfreimann@redhat.com> wrote:

> This adds support for hiding a device to the qbus and qdev APIs.  The
> first user of this will be the virtio-net failover feature but the API
> introduced with this patch could be used to implement other features as
> well, for example hiding pci devices when a pci bus is powered off.
> 
> qdev_device_add() is modified to check for a net_failover_pair_id
> argument in the option string. A DeviceListener callback
> should_be_hidden() is added. It can be used by a standby device to
> inform qdev that this device should not be added now. The standby device
> handler can store the device options to plug the device in at a later
> point in time.
> 
> One reason for hiding the device is that we don't want to expose both
> devices to the guest kernel until the respective virtio feature bit
> VIRTIO_NET_F_STANDBY was negotiated and we know that the devices will be
> handled correctly by the guest.
> 
> More information on the kernel feature this is using:
>  https://www.kernel.org/doc/html/latest/networking/net_failover.html
> 
> An example where the primary device is a vfio-pci device and the standby
> device is a virtio-net device:
> 
> A device is hidden when it has an "net_failover_pair_id" option, e.g.
> 
>  -device virtio-net-pci,...,failover=on,...
>  -device vfio-pci,...,net_failover_pair_id=net1,...
> 
> Signed-off-by: Jens Freimann <jfreimann@redhat.com>
> ---
>  hw/core/qdev.c         | 19 +++++++++++++++++++
>  include/hw/qdev-core.h |  9 +++++++++
>  qdev-monitor.c         | 43 ++++++++++++++++++++++++++++++++++++++----
>  vl.c                   |  6 ++++--
>  4 files changed, 71 insertions(+), 6 deletions(-)
> 

(...)

> diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
> index aa123f88cb..b61cf82ded 100644
> --- a/include/hw/qdev-core.h
> +++ b/include/hw/qdev-core.h
> @@ -154,6 +154,13 @@ struct DeviceState {
>  struct DeviceListener {
>      void (*realize)(DeviceListener *listener, DeviceState *dev);
>      void (*unrealize)(DeviceListener *listener, DeviceState *dev);
> +    /*
> +     * This callback is called just upon init of the DeviceState
> +     * and can be used by a standby device for informing qdev if this
> +     * device should be hidden by checking the device opts
> +     */

Maybe tweak this comment a bit:

/*
 * This callback is called upon init of the DeviceState and allows to
 * inform qdev that a device should be hidden, depending on the device
 * opts, for example, to hide a standby device.
 */

This makes it clearer that this interface could be reused for other
purposes.

> +    void (*should_be_hidden)(DeviceListener *listener, QemuOpts *device_opts,
> +            bool *match_found, bool *res);
>      QTAILQ_ENTRY(DeviceListener) link;
>  };
>  
> @@ -451,4 +458,6 @@ static inline bool qbus_is_hotpluggable(BusState *bus)
>  void device_listener_register(DeviceListener *listener);
>  void device_listener_unregister(DeviceListener *listener);
>  
> +bool qdev_should_hide_device(QemuOpts *opts, Error **errp);
> +
>  #endif
> diff --git a/qdev-monitor.c b/qdev-monitor.c
> index 148df9cacf..9fc8331157 100644
> --- a/qdev-monitor.c
> +++ b/qdev-monitor.c
> @@ -32,9 +32,11 @@
>  #include "qemu/help_option.h"
>  #include "qemu/option.h"
>  #include "qemu/qemu-print.h"
> +#include "qemu/option_int.h"
>  #include "sysemu/block-backend.h"
>  #include "sysemu/sysemu.h"
>  #include "migration/misc.h"
> +#include "migration/migration.h"
>  
>  /*
>   * Aliases were a bad idea from the start.  Let's keep them
> @@ -562,14 +564,45 @@ void qdev_set_id(DeviceState *dev, const char *id)
>      }
>  }
>  
> +static int is_failover_device(void *opaque, const char *name, const char *value,
> +                        Error **errp)
> +{
> +    if (strcmp(name, "net_failover_pair_id") == 0) {
> +        QemuOpts *opts = (QemuOpts *)opaque;
> +
> +        if (qdev_should_hide_device(opts, errp) && errp && !*errp) {
> +            return 1;
> +        } else if (errp && *errp) {
> +            return -1;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +static bool should_hide_device(QemuOpts *opts, Error **err)
> +{
> +    if (qemu_opt_foreach(opts, is_failover_device, opts, err) == 0) {
> +        return false;

Maybe turn that check around? I.e. return true if the failover property
is present, else return false. Makes it easier to add checks for other
properties in the future.

> +    }
> +    return true;
> +}
> +
>  DeviceState *qdev_device_add(QemuOpts *opts, Error **errp)
>  {
>      DeviceClass *dc;
>      const char *driver, *path;
> -    DeviceState *dev;
> +    DeviceState *dev = NULL;
>      BusState *bus = NULL;
>      Error *err = NULL;
>  
> +    if (opts && should_hide_device(opts, &err)) {
> +        if (err) {
> +            goto err_del_dev;
> +        }
> +        return NULL;
> +    }
> +
>      driver = qemu_opt_get(opts, "driver");
>      if (!driver) {
>          error_setg(errp, QERR_MISSING_PARAMETER, "driver");
> @@ -648,8 +681,10 @@ DeviceState *qdev_device_add(QemuOpts *opts, Error **errp)
>  
>  err_del_dev:
>      error_propagate(errp, err);
> -    object_unparent(OBJECT(dev));
> -    object_unref(OBJECT(dev));
> +    if (dev) {
> +        object_unparent(OBJECT(dev));
> +        object_unref(OBJECT(dev));
> +    }
>      return NULL;
>  }
>  
> @@ -818,7 +853,7 @@ void qdev_unplug(DeviceState *dev, Error **errp)
>          return;
>      }
>  
> -    if (!migration_is_idle()) {
> +    if (!migration_is_idle() && !migration_in_setup(migrate_get_current())) {

Hm, should that hunk go into another patch?

>          error_setg(errp, "device_del not allowed while migrating");
>          return;
>      }

(...)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 10/10] vfio: unplug failover primary device before migration
  2019-10-11 11:20 ` [PATCH v3 10/10] vfio: unplug failover primary device before migration Jens Freimann
@ 2019-10-14 10:05   ` Cornelia Huck
  2019-10-16  1:52   ` Alex Williamson
  1 sibling, 0 replies; 35+ messages in thread
From: Cornelia Huck @ 2019-10-14 10:05 UTC (permalink / raw)
  To: Jens Freimann
  Cc: parav, mst, aadam, qemu-devel, dgilbert, alex.williamson, laine,
	ailan, ehabkost

On Fri, 11 Oct 2019 13:20:15 +0200
Jens Freimann <jfreimann@redhat.com> wrote:

> As usual block all vfio-pci devices from being migrated, but make an
> exception for failover primary devices. This is achieved by setting
> unmigratable to 0 but also add a migration blocker for all vfio-pci
> devices except failover primary devices. These will be unplugged before
> migration happens by the migration handler of the corresponding
> virtio-net standby device.
> 
> Signed-off-by: Jens Freimann <jfreimann@redhat.com>
> ---
>  hw/vfio/pci.c | 35 ++++++++++++++++++++++++++++++++++-
>  hw/vfio/pci.h |  2 ++
>  2 files changed, 36 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index c5e6fe61cb..64cf8e07d9 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -40,6 +40,9 @@
>  #include "pci.h"
>  #include "trace.h"
>  #include "qapi/error.h"
> +#include "migration/blocker.h"
> +#include "qemu/option.h"
> +#include "qemu/option_int.h"
>  
>  #define TYPE_VFIO_PCI "vfio-pci"
>  #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
> @@ -2698,6 +2701,12 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
>      vdev->req_enabled = false;
>  }
>  
> +static int has_net_failover_arg(void *opaque, const char *name,
> +                           const char *value, Error **errp)
> +{
> +    return (strcmp(name, "net_failover_pair_id") == 0);
> +}
> +
>  static void vfio_realize(PCIDevice *pdev, Error **errp)
>  {
>      VFIOPCIDevice *vdev = PCI_VFIO(pdev);
> @@ -2710,6 +2719,20 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>      int groupid;
>      int i, ret;
>      bool is_mdev;
> +    uint16_t class_id;
> +
> +    if (qemu_opt_foreach(pdev->qdev.opts, has_net_failover_arg,
> +                         (void *) pdev->qdev.opts, &err) == 0) {
> +        error_setg(&vdev->migration_blocker,
> +                "VFIO device doesn't support migration");
> +        ret = migrate_add_blocker(vdev->migration_blocker, &err);
> +        if (err) {
> +            error_propagate(errp, err);
> +            error_free(vdev->migration_blocker);
> +        }

Do we really want to continue here if adding the migration blocker
failed? Previously, adding the vmsd with unmigratable=1 would have
prevented realization of the device if --only-migratable was specified;
I guess we should keep that behaviour and just bail out here?

> +    } else {
> +        pdev->qdev.allow_unplug_during_migration = true;
> +    }
>  
>      if (!vdev->vbasedev.sysfsdev) {
>          if (!(~vdev->host.domain || ~vdev->host.bus ||
> @@ -2812,6 +2835,14 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>          goto error;
>      }
>  
> +    if (vdev->net_failover_pair_id != NULL) {
> +        class_id = pci_get_word(pdev->config + PCI_CLASS_DEVICE);
> +        if (class_id != PCI_CLASS_NETWORK_ETHERNET) {
> +            error_setg(errp, "failover device is not an Ethernet device");
> +            goto error;
> +        }
> +    }
> +
>      /* vfio emulates a lot for us, but some bits need extra love */
>      vdev->emulated_config_bits = g_malloc0(vdev->config_size);
>  
> @@ -3110,6 +3141,8 @@ static Property vfio_pci_dev_properties[] = {
>                              display, ON_OFF_AUTO_OFF),
>      DEFINE_PROP_UINT32("xres", VFIOPCIDevice, display_xres, 0),
>      DEFINE_PROP_UINT32("yres", VFIOPCIDevice, display_yres, 0),
> +    DEFINE_PROP_STRING("net_failover_pair_id", VFIOPCIDevice,
> +            net_failover_pair_id),
>      DEFINE_PROP_UINT32("x-intx-mmap-timeout-ms", VFIOPCIDevice,
>                         intx.mmap_timeout, 1100),
>      DEFINE_PROP_BIT("x-vga", VFIOPCIDevice, features,
> @@ -3152,7 +3185,7 @@ static Property vfio_pci_dev_properties[] = {
>  
>  static const VMStateDescription vfio_pci_vmstate = {
>      .name = "vfio-pci",
> -    .unmigratable = 1,
> +    .unmigratable = 0,

Is this vmsd useful in any way with that change anymore?

>  };
>  

(...)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 01/10] qdev/qbus: add hidden device support
  2019-10-14  9:46   ` Cornelia Huck
@ 2019-10-14 12:02     ` Jens Freimann
  0 siblings, 0 replies; 35+ messages in thread
From: Jens Freimann @ 2019-10-14 12:02 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: parav, mst, aadam, qemu-devel, dgilbert, alex.williamson, laine,
	ailan, ehabkost

On Mon, Oct 14, 2019 at 11:46:22AM +0200, Cornelia Huck wrote:
>On Fri, 11 Oct 2019 13:20:06 +0200
>Jens Freimann <jfreimann@redhat.com> wrote:
>
>> This adds support for hiding a device to the qbus and qdev APIs.  The
>> first user of this will be the virtio-net failover feature but the API
>> introduced with this patch could be used to implement other features as
>> well, for example hiding pci devices when a pci bus is powered off.
>>
>> qdev_device_add() is modified to check for a net_failover_pair_id
>> argument in the option string. A DeviceListener callback
>> should_be_hidden() is added. It can be used by a standby device to
>> inform qdev that this device should not be added now. The standby device
>> handler can store the device options to plug the device in at a later
>> point in time.
>>
>> One reason for hiding the device is that we don't want to expose both
>> devices to the guest kernel until the respective virtio feature bit
>> VIRTIO_NET_F_STANDBY was negotiated and we know that the devices will be
>> handled correctly by the guest.
>>
>> More information on the kernel feature this is using:
>>  https://www.kernel.org/doc/html/latest/networking/net_failover.html
>>
>> An example where the primary device is a vfio-pci device and the standby
>> device is a virtio-net device:
>>
>> A device is hidden when it has an "net_failover_pair_id" option, e.g.
>>
>>  -device virtio-net-pci,...,failover=on,...
>>  -device vfio-pci,...,net_failover_pair_id=net1,...
>>
>> Signed-off-by: Jens Freimann <jfreimann@redhat.com>
>> ---
>>  hw/core/qdev.c         | 19 +++++++++++++++++++
>>  include/hw/qdev-core.h |  9 +++++++++
>>  qdev-monitor.c         | 43 ++++++++++++++++++++++++++++++++++++++----
>>  vl.c                   |  6 ++++--
>>  4 files changed, 71 insertions(+), 6 deletions(-)
>>
>
>(...)
>
>> diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
>> index aa123f88cb..b61cf82ded 100644
>> --- a/include/hw/qdev-core.h
>> +++ b/include/hw/qdev-core.h
>> @@ -154,6 +154,13 @@ struct DeviceState {
>>  struct DeviceListener {
>>      void (*realize)(DeviceListener *listener, DeviceState *dev);
>>      void (*unrealize)(DeviceListener *listener, DeviceState *dev);
>> +    /*
>> +     * This callback is called just upon init of the DeviceState
>> +     * and can be used by a standby device for informing qdev if this
>> +     * device should be hidden by checking the device opts
>> +     */
>
>Maybe tweak this comment a bit:
>
>/*
> * This callback is called upon init of the DeviceState and allows to
> * inform qdev that a device should be hidden, depending on the device
> * opts, for example, to hide a standby device.
> */
>
>This makes it clearer that this interface could be reused for other
>purposes.

Sounds good. 
>
>> +    void (*should_be_hidden)(DeviceListener *listener, QemuOpts *device_opts,
>> +            bool *match_found, bool *res);
>>      QTAILQ_ENTRY(DeviceListener) link;
>>  };
>>
>> @@ -451,4 +458,6 @@ static inline bool qbus_is_hotpluggable(BusState *bus)
>>  void device_listener_register(DeviceListener *listener);
>>  void device_listener_unregister(DeviceListener *listener);
>>
>> +bool qdev_should_hide_device(QemuOpts *opts, Error **errp);
>> +
>>  #endif
>> diff --git a/qdev-monitor.c b/qdev-monitor.c
>> index 148df9cacf..9fc8331157 100644
>> --- a/qdev-monitor.c
>> +++ b/qdev-monitor.c
>> @@ -32,9 +32,11 @@
>>  #include "qemu/help_option.h"
>>  #include "qemu/option.h"
>>  #include "qemu/qemu-print.h"
>> +#include "qemu/option_int.h"
>>  #include "sysemu/block-backend.h"
>>  #include "sysemu/sysemu.h"
>>  #include "migration/misc.h"
>> +#include "migration/migration.h"
>>
>>  /*
>>   * Aliases were a bad idea from the start.  Let's keep them
>> @@ -562,14 +564,45 @@ void qdev_set_id(DeviceState *dev, const char *id)
>>      }
>>  }
>>
>> +static int is_failover_device(void *opaque, const char *name, const char *value,
>> +                        Error **errp)
>> +{
>> +    if (strcmp(name, "net_failover_pair_id") == 0) {
>> +        QemuOpts *opts = (QemuOpts *)opaque;
>> +
>> +        if (qdev_should_hide_device(opts, errp) && errp && !*errp) {
>> +            return 1;
>> +        } else if (errp && *errp) {
>> +            return -1;
>> +        }
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static bool should_hide_device(QemuOpts *opts, Error **err)
>> +{
>> +    if (qemu_opt_foreach(opts, is_failover_device, opts, err) == 0) {
>> +        return false;
>
>Maybe turn that check around? I.e. return true if the failover property
>is present, else return false. Makes it easier to add checks for other
>properties in the future.

Done.

>> +    }
>> +    return true;
>> +}
>> +
>>  DeviceState *qdev_device_add(QemuOpts *opts, Error **errp)
>>  {
>>      DeviceClass *dc;
>>      const char *driver, *path;
>> -    DeviceState *dev;
>> +    DeviceState *dev = NULL;
>>      BusState *bus = NULL;
>>      Error *err = NULL;
>>
>> +    if (opts && should_hide_device(opts, &err)) {
>> +        if (err) {
>> +            goto err_del_dev;
>> +        }
>> +        return NULL;
>> +    }
>> +
>>      driver = qemu_opt_get(opts, "driver");
>>      if (!driver) {
>>          error_setg(errp, QERR_MISSING_PARAMETER, "driver");
>> @@ -648,8 +681,10 @@ DeviceState *qdev_device_add(QemuOpts *opts, Error **errp)
>>
>>  err_del_dev:
>>      error_propagate(errp, err);
>> -    object_unparent(OBJECT(dev));
>> -    object_unref(OBJECT(dev));
>> +    if (dev) {
>> +        object_unparent(OBJECT(dev));
>> +        object_unref(OBJECT(dev));
>> +    }
>>      return NULL;
>>  }
>>
>> @@ -818,7 +853,7 @@ void qdev_unplug(DeviceState *dev, Error **errp)
>>          return;
>>      }
>>
>> -    if (!migration_is_idle()) {
>> +    if (!migration_is_idle() && !migration_in_setup(migrate_get_current())) {
>
>Hm, should that hunk go into another patch?

Yes it should :)

Thanks for the review!

regards,
Jens 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 07/10] migration: add new migration state wait-unplug
  2019-10-11 17:11   ` Dr. David Alan Gilbert
@ 2019-10-15  9:45     ` Jens Freimann
  2019-10-15 10:50       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 35+ messages in thread
From: Jens Freimann @ 2019-10-15  9:45 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: ehabkost, mst, aadam, qemu-devel, alex.williamson, laine, ailan, parav

On Fri, Oct 11, 2019 at 06:11:33PM +0100, Dr. David Alan Gilbert wrote:
>* Jens Freimann (jfreimann@redhat.com) wrote:
>> This patch adds a new migration state called wait-unplug.  It is entered
>> after the SETUP state and will transition into ACTIVE once all devices
>> were succesfully unplugged from the guest.
>>
>> So if a guest doesn't respond or takes long to honor the unplug request
>> the user will see the migration state 'wait-unplug'.
>>
>> In the migration thread we query failover devices if they're are still
>> pending the guest unplug. When all are unplugged the migration
>> continues. We give it a defined number of iterations including small
>> waiting periods before we proceed.
>>
>> Signed-off-by: Jens Freimann <jfreimann@redhat.com>
[..]
>> @@ -3260,6 +3271,27 @@ static void *migration_thread(void *opaque)
>>
>>      qemu_savevm_state_setup(s->to_dst_file);
>>
>> +    migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
>> +                      MIGRATION_STATUS_WAIT_UNPLUG);
>
>I think I'd prefer if you only went into this state if you had any
>devices that were going to need unplugging.

Sure, that makes sense. I'll change it.

>> +    while (i < FAILOVER_UNPLUG_RETRIES &&
>> +           s->state == MIGRATION_STATUS_WAIT_UNPLUG) {
>> +        i++;
>> +        qemu_sem_timedwait(&s->wait_unplug_sem, FAILOVER_GUEST_UNPLUG_WAIT);
>> +        all_unplugged = qemu_savevm_state_guest_unplug_pending();
>> +        if (all_unplugged) {
>> +            break;
>> +        }
>> +    }
>> +
>> +    if (all_unplugged) {
>> +        migrate_set_state(&s->state, MIGRATION_STATUS_WAIT_UNPLUG,
>> +                MIGRATION_STATUS_ACTIVE);
>> +    } else {
>> +        migrate_set_state(&s->state, MIGRATION_STATUS_WAIT_UNPLUG,
>> +                          MIGRATION_STATUS_CANCELLING);
>> +    }
>
>I think you can get rid of both the timeout and the count and just make
>sure that migrate_cancel works at this point.

I see, I need to add the new state to migration_is_setup_or_active() or
a cancel won't work.  

>This pushes the problem up a layer, which I think is fine.

Seems good to me. To be clear, you're saying I should just poll on
the device unplugged state? Like

         while (s->state == MIGRATION_STATUS_WAIT_UNPLUG &&
                !qemu_savevm_state_guest_unplug_pending()) {
_            /* This block intentionally left blank */
         }

regards,
Jens  


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 07/10] migration: add new migration state wait-unplug
  2019-10-15  9:45     ` Jens Freimann
@ 2019-10-15 10:50       ` Dr. David Alan Gilbert
  2019-10-16  7:07         ` Jens Freimann
  0 siblings, 1 reply; 35+ messages in thread
From: Dr. David Alan Gilbert @ 2019-10-15 10:50 UTC (permalink / raw)
  To: Jens Freimann
  Cc: ehabkost, mst, aadam, qemu-devel, alex.williamson, laine, ailan, parav

* Jens Freimann (jfreimann@redhat.com) wrote:
> On Fri, Oct 11, 2019 at 06:11:33PM +0100, Dr. David Alan Gilbert wrote:
> > * Jens Freimann (jfreimann@redhat.com) wrote:
> > > This patch adds a new migration state called wait-unplug.  It is entered
> > > after the SETUP state and will transition into ACTIVE once all devices
> > > were succesfully unplugged from the guest.
> > > 
> > > So if a guest doesn't respond or takes long to honor the unplug request
> > > the user will see the migration state 'wait-unplug'.
> > > 
> > > In the migration thread we query failover devices if they're are still
> > > pending the guest unplug. When all are unplugged the migration
> > > continues. We give it a defined number of iterations including small
> > > waiting periods before we proceed.
> > > 
> > > Signed-off-by: Jens Freimann <jfreimann@redhat.com>
> [..]
> > > @@ -3260,6 +3271,27 @@ static void *migration_thread(void *opaque)
> > > 
> > >      qemu_savevm_state_setup(s->to_dst_file);
> > > 
> > > +    migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
> > > +                      MIGRATION_STATUS_WAIT_UNPLUG);
> > 
> > I think I'd prefer if you only went into this state if you had any
> > devices that were going to need unplugging.
> 
> Sure, that makes sense. I'll change it.
> 
> > > +    while (i < FAILOVER_UNPLUG_RETRIES &&
> > > +           s->state == MIGRATION_STATUS_WAIT_UNPLUG) {
> > > +        i++;
> > > +        qemu_sem_timedwait(&s->wait_unplug_sem, FAILOVER_GUEST_UNPLUG_WAIT);
> > > +        all_unplugged = qemu_savevm_state_guest_unplug_pending();
> > > +        if (all_unplugged) {
> > > +            break;
> > > +        }
> > > +    }
> > > +
> > > +    if (all_unplugged) {
> > > +        migrate_set_state(&s->state, MIGRATION_STATUS_WAIT_UNPLUG,
> > > +                MIGRATION_STATUS_ACTIVE);
> > > +    } else {
> > > +        migrate_set_state(&s->state, MIGRATION_STATUS_WAIT_UNPLUG,
> > > +                          MIGRATION_STATUS_CANCELLING);
> > > +    }
> > 
> > I think you can get rid of both the timeout and the count and just make
> > sure that migrate_cancel works at this point.
> 
> I see, I need to add the new state to migration_is_setup_or_active() or
> a cancel won't work.

You probably need to do that anyway given all the other places
is_setup_or_active is called.

> > This pushes the problem up a layer, which I think is fine.
> 
> Seems good to me. To be clear, you're saying I should just poll on
> the device unplugged state? Like
> 
>         while (s->state == MIGRATION_STATUS_WAIT_UNPLUG &&
>                !qemu_savevm_state_guest_unplug_pending()) {
> _            /* This block intentionally left blank */
>         }

I'd keep the qemu_sem_timedwait in there, but with a short time out
(e.g. 250ms say); that way it doesn't eat cpu, but also the cancel still
happens quickly.

Dave

> 
> regards,
> Jens
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 0/10] add failover feature for assigned network devices
  2019-10-11 11:20 [PATCH v3 0/10] add failover feature for assigned network devices Jens Freimann
                   ` (11 preceding siblings ...)
  2019-10-11 16:04 ` no-reply
@ 2019-10-15 19:03 ` Alex Williamson
  2019-10-15 21:17   ` Michael S. Tsirkin
  2019-10-17 10:33   ` Jens Freimann
  12 siblings, 2 replies; 35+ messages in thread
From: Alex Williamson @ 2019-10-15 19:03 UTC (permalink / raw)
  To: Jens Freimann
  Cc: ehabkost, mst, aadam, qemu-devel, dgilbert, laine, ailan, parav

On Fri, 11 Oct 2019 13:20:05 +0200
Jens Freimann <jfreimann@redhat.com> wrote:

> This is implementing the host side of the net_failover concept
> (https://www.kernel.org/doc/html/latest/networking/net_failover.html)
> 
> Changes since v2:
> * back out of creating failover pair when it is a non-networking
>   vfio-pci device (Alex W)
> * handle migration state change from within the migration thread. I do a
>   timed wait on a semaphore and then check if all unplugs were
>   succesful. Added a new function to each device that checks the device
>   if the unplug for it has happened. When all devices report the succesful
>   unplug *or* the time/retries is up, continue with the migration or
>   cancel. When not all devices could be unplugged I am cancelling at the
>   moment. It is likely that we can't plug it back at the destination which
>   would result in degraded network performance.
> * fix a few bugs regarding re-plug on migration source and target 
> * run full set of tests including migration tests
> * add patch for libqos to tolerate new migration state
> * squashed patch 1 and 2, added patch 8 
>  
> The general idea is that we have a pair of devices, a vfio-pci and a
> virtio-net device. Before migration the vfio device is unplugged and data
> flows to the virtio-net device, on the target side another vfio-pci device
> is plugged in to take over the data-path. In the guest the net_failover
> module will pair net devices with the same MAC address.
> 
> * Patch 1 adds the infrastructure to hide the device for the qbus and qdev APIs
> 
> * Patch 2 sets a new flag for PCIDevice 'partially_hotplugged' which we
>   use to skip the unrealize code path when doing a unplug of the primary
>   device
> 
> * Patch 3 sets the pending_deleted_event before triggering the guest
>   unplug request

These only cover pcie hotplug, is this feature somehow dependent on
pcie?  There's also ACPI-based PCI hotplug, SHPC hotplug, and it looks
like s390 has it's own version (of course) of PCI hotplug.  IMO, we
either need to make an attempt to support this universally or the
option needs to fail if the hotplug controller doesn't support partial
removal.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 01/10] qdev/qbus: add hidden device support
  2019-10-11 11:20 ` [PATCH v3 01/10] qdev/qbus: add hidden device support Jens Freimann
  2019-10-14  9:46   ` Cornelia Huck
@ 2019-10-15 19:19   ` Alex Williamson
  2019-10-16  7:04     ` Jens Freimann
  1 sibling, 1 reply; 35+ messages in thread
From: Alex Williamson @ 2019-10-15 19:19 UTC (permalink / raw)
  To: Jens Freimann
  Cc: ehabkost, mst, aadam, qemu-devel, dgilbert, laine, ailan, parav

On Fri, 11 Oct 2019 13:20:06 +0200
Jens Freimann <jfreimann@redhat.com> wrote:

> This adds support for hiding a device to the qbus and qdev APIs.  The
> first user of this will be the virtio-net failover feature but the API
> introduced with this patch could be used to implement other features as
> well, for example hiding pci devices when a pci bus is powered off.
> 
> qdev_device_add() is modified to check for a net_failover_pair_id
> argument in the option string. A DeviceListener callback
> should_be_hidden() is added. It can be used by a standby device to
> inform qdev that this device should not be added now. The standby device
> handler can store the device options to plug the device in at a later
> point in time.
> 
> One reason for hiding the device is that we don't want to expose both
> devices to the guest kernel until the respective virtio feature bit
> VIRTIO_NET_F_STANDBY was negotiated and we know that the devices will be
> handled correctly by the guest.
> 
> More information on the kernel feature this is using:
>  https://www.kernel.org/doc/html/latest/networking/net_failover.html
> 
> An example where the primary device is a vfio-pci device and the standby
> device is a virtio-net device:
> 
> A device is hidden when it has an "net_failover_pair_id" option, e.g.
> 
>  -device virtio-net-pci,...,failover=on,...
>  -device vfio-pci,...,net_failover_pair_id=net1,...
> 
> Signed-off-by: Jens Freimann <jfreimann@redhat.com>
> ---
>  hw/core/qdev.c         | 19 +++++++++++++++++++
>  include/hw/qdev-core.h |  9 +++++++++
>  qdev-monitor.c         | 43 ++++++++++++++++++++++++++++++++++++++----
>  vl.c                   |  6 ++++--
>  4 files changed, 71 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/core/qdev.c b/hw/core/qdev.c
> index cbad6c1d55..84fac591ca 100644
> --- a/hw/core/qdev.c
> +++ b/hw/core/qdev.c
> @@ -212,6 +212,25 @@ void device_listener_unregister(DeviceListener *listener)
>      QTAILQ_REMOVE(&device_listeners, listener, link);
>  }
>  
> +bool qdev_should_hide_device(QemuOpts *opts, Error **errp)
> +{
> +    bool res = false;
> +    bool match_found = false;
> +    DeviceListener *listener;
> +
> +    QTAILQ_FOREACH(listener, &device_listeners, link) {
> +       if (listener->should_be_hidden) {
> +            listener->should_be_hidden(listener, opts, &match_found, &res);
> +        }
> +
> +        if (match_found) {
> +            break;
> +        }

Calling convention here seems overly complicated, couldn't
should_be_hidden() just return >0 (should be hidden), 0 (should not be
hidden), <0 (don't care), ie. continue until >=0?  The errp arg is
unused and using "res" to return should/shouldn't hide is very unclear.
The virtio callback renames this to hide, which makes more sense, but
as above, both the stop and hidden state could be conveyed with a
simple int return value.  Thanks,

Alex

> +    }
> +
> +    return res;
> +}
> +
>  void qdev_set_legacy_instance_id(DeviceState *dev, int alias_id,
>                                   int required_for_version)
>  {
> diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
> index aa123f88cb..b61cf82ded 100644
> --- a/include/hw/qdev-core.h
> +++ b/include/hw/qdev-core.h
> @@ -154,6 +154,13 @@ struct DeviceState {
>  struct DeviceListener {
>      void (*realize)(DeviceListener *listener, DeviceState *dev);
>      void (*unrealize)(DeviceListener *listener, DeviceState *dev);
> +    /*
> +     * This callback is called just upon init of the DeviceState
> +     * and can be used by a standby device for informing qdev if this
> +     * device should be hidden by checking the device opts
> +     */
> +    void (*should_be_hidden)(DeviceListener *listener, QemuOpts *device_opts,
> +            bool *match_found, bool *res);
>      QTAILQ_ENTRY(DeviceListener) link;
>  };
>  
> @@ -451,4 +458,6 @@ static inline bool qbus_is_hotpluggable(BusState *bus)
>  void device_listener_register(DeviceListener *listener);
>  void device_listener_unregister(DeviceListener *listener);
>  
> +bool qdev_should_hide_device(QemuOpts *opts, Error **errp);
> +
>  #endif
> diff --git a/qdev-monitor.c b/qdev-monitor.c
> index 148df9cacf..9fc8331157 100644
> --- a/qdev-monitor.c
> +++ b/qdev-monitor.c
> @@ -32,9 +32,11 @@
>  #include "qemu/help_option.h"
>  #include "qemu/option.h"
>  #include "qemu/qemu-print.h"
> +#include "qemu/option_int.h"
>  #include "sysemu/block-backend.h"
>  #include "sysemu/sysemu.h"
>  #include "migration/misc.h"
> +#include "migration/migration.h"
>  
>  /*
>   * Aliases were a bad idea from the start.  Let's keep them
> @@ -562,14 +564,45 @@ void qdev_set_id(DeviceState *dev, const char *id)
>      }
>  }
>  
> +static int is_failover_device(void *opaque, const char *name, const char *value,
> +                        Error **errp)
> +{
> +    if (strcmp(name, "net_failover_pair_id") == 0) {
> +        QemuOpts *opts = (QemuOpts *)opaque;
> +
> +        if (qdev_should_hide_device(opts, errp) && errp && !*errp) {
> +            return 1;
> +        } else if (errp && *errp) {
> +            return -1;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +static bool should_hide_device(QemuOpts *opts, Error **err)
> +{
> +    if (qemu_opt_foreach(opts, is_failover_device, opts, err) == 0) {
> +        return false;
> +    }
> +    return true;
> +}
> +
>  DeviceState *qdev_device_add(QemuOpts *opts, Error **errp)
>  {
>      DeviceClass *dc;
>      const char *driver, *path;
> -    DeviceState *dev;
> +    DeviceState *dev = NULL;
>      BusState *bus = NULL;
>      Error *err = NULL;
>  
> +    if (opts && should_hide_device(opts, &err)) {
> +        if (err) {
> +            goto err_del_dev;
> +        }
> +        return NULL;
> +    }
> +
>      driver = qemu_opt_get(opts, "driver");
>      if (!driver) {
>          error_setg(errp, QERR_MISSING_PARAMETER, "driver");
> @@ -648,8 +681,10 @@ DeviceState *qdev_device_add(QemuOpts *opts, Error **errp)
>  
>  err_del_dev:
>      error_propagate(errp, err);
> -    object_unparent(OBJECT(dev));
> -    object_unref(OBJECT(dev));
> +    if (dev) {
> +        object_unparent(OBJECT(dev));
> +        object_unref(OBJECT(dev));
> +    }
>      return NULL;
>  }
>  
> @@ -818,7 +853,7 @@ void qdev_unplug(DeviceState *dev, Error **errp)
>          return;
>      }
>  
> -    if (!migration_is_idle()) {
> +    if (!migration_is_idle() && !migration_in_setup(migrate_get_current())) {
>          error_setg(errp, "device_del not allowed while migrating");
>          return;
>      }
> diff --git a/vl.c b/vl.c
> index 002bf4919e..898b9901c8 100644
> --- a/vl.c
> +++ b/vl.c
> @@ -2204,10 +2204,12 @@ static int device_init_func(void *opaque, QemuOpts *opts, Error **errp)
>      DeviceState *dev;
>  
>      dev = qdev_device_add(opts, errp);
> -    if (!dev) {
> +    if (!dev && *errp) {
> +        error_report_err(*errp);
>          return -1;
> +    } else if (dev) {
> +        object_unref(OBJECT(dev));
>      }
> -    object_unref(OBJECT(dev));
>      return 0;
>  }
>  



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 0/10] add failover feature for assigned network devices
  2019-10-15 19:03 ` Alex Williamson
@ 2019-10-15 21:17   ` Michael S. Tsirkin
  2019-10-17 10:33   ` Jens Freimann
  1 sibling, 0 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2019-10-15 21:17 UTC (permalink / raw)
  To: Alex Williamson
  Cc: ehabkost, aadam, qemu-devel, dgilbert, laine, Jens Freimann,
	ailan, parav

On Tue, Oct 15, 2019 at 01:03:17PM -0600, Alex Williamson wrote:
> On Fri, 11 Oct 2019 13:20:05 +0200
> Jens Freimann <jfreimann@redhat.com> wrote:
> 
> > This is implementing the host side of the net_failover concept
> > (https://www.kernel.org/doc/html/latest/networking/net_failover.html)
> > 
> > Changes since v2:
> > * back out of creating failover pair when it is a non-networking
> >   vfio-pci device (Alex W)
> > * handle migration state change from within the migration thread. I do a
> >   timed wait on a semaphore and then check if all unplugs were
> >   succesful. Added a new function to each device that checks the device
> >   if the unplug for it has happened. When all devices report the succesful
> >   unplug *or* the time/retries is up, continue with the migration or
> >   cancel. When not all devices could be unplugged I am cancelling at the
> >   moment. It is likely that we can't plug it back at the destination which
> >   would result in degraded network performance.
> > * fix a few bugs regarding re-plug on migration source and target 
> > * run full set of tests including migration tests
> > * add patch for libqos to tolerate new migration state
> > * squashed patch 1 and 2, added patch 8 
> >  
> > The general idea is that we have a pair of devices, a vfio-pci and a
> > virtio-net device. Before migration the vfio device is unplugged and data
> > flows to the virtio-net device, on the target side another vfio-pci device
> > is plugged in to take over the data-path. In the guest the net_failover
> > module will pair net devices with the same MAC address.
> > 
> > * Patch 1 adds the infrastructure to hide the device for the qbus and qdev APIs
> > 
> > * Patch 2 sets a new flag for PCIDevice 'partially_hotplugged' which we
> >   use to skip the unrealize code path when doing a unplug of the primary
> >   device
> > 
> > * Patch 3 sets the pending_deleted_event before triggering the guest
> >   unplug request
> 
> These only cover pcie hotplug, is this feature somehow dependent on
> pcie?  There's also ACPI-based PCI hotplug, SHPC hotplug, and it looks
> like s390 has it's own version (of course) of PCI hotplug.  IMO, we
> either need to make an attempt to support this universally or the
> option needs to fail if the hotplug controller doesn't support partial
> removal.  Thanks,
> 
> Alex


Alex, could you please comment a bit more on vfio patches?
Besides what you point out here, any other issues?

-- 
MST


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 10/10] vfio: unplug failover primary device before migration
  2019-10-11 11:20 ` [PATCH v3 10/10] vfio: unplug failover primary device before migration Jens Freimann
  2019-10-14 10:05   ` Cornelia Huck
@ 2019-10-16  1:52   ` Alex Williamson
  2019-10-16 20:18     ` Jens Freimann
  1 sibling, 1 reply; 35+ messages in thread
From: Alex Williamson @ 2019-10-16  1:52 UTC (permalink / raw)
  To: Jens Freimann
  Cc: ehabkost, mst, aadam, qemu-devel, dgilbert, laine, ailan, parav

On Fri, 11 Oct 2019 13:20:15 +0200
Jens Freimann <jfreimann@redhat.com> wrote:

> As usual block all vfio-pci devices from being migrated, but make an
> exception for failover primary devices. This is achieved by setting
> unmigratable to 0 but also add a migration blocker for all vfio-pci
> devices except failover primary devices. These will be unplugged before
> migration happens by the migration handler of the corresponding
> virtio-net standby device.
> 
> Signed-off-by: Jens Freimann <jfreimann@redhat.com>
> ---
>  hw/vfio/pci.c | 35 ++++++++++++++++++++++++++++++++++-
>  hw/vfio/pci.h |  2 ++
>  2 files changed, 36 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index c5e6fe61cb..64cf8e07d9 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -40,6 +40,9 @@
>  #include "pci.h"
>  #include "trace.h"
>  #include "qapi/error.h"
> +#include "migration/blocker.h"
> +#include "qemu/option.h"
> +#include "qemu/option_int.h"
>  
>  #define TYPE_VFIO_PCI "vfio-pci"
>  #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
> @@ -2698,6 +2701,12 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
>      vdev->req_enabled = false;
>  }
>  
> +static int has_net_failover_arg(void *opaque, const char *name,
> +                           const char *value, Error **errp)
> +{
> +    return (strcmp(name, "net_failover_pair_id") == 0);
> +}
> +
>  static void vfio_realize(PCIDevice *pdev, Error **errp)
>  {
>      VFIOPCIDevice *vdev = PCI_VFIO(pdev);
> @@ -2710,6 +2719,20 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>      int groupid;
>      int i, ret;
>      bool is_mdev;
> +    uint16_t class_id;
> +
> +    if (qemu_opt_foreach(pdev->qdev.opts, has_net_failover_arg,
> +                         (void *) pdev->qdev.opts, &err) == 0) {

Why do we need a qemu_opt_foreach here versus testing
vdev->net_failover_pair_id as you do below or similar to how we test
sysfsdev immediately below this chunk?

> +        error_setg(&vdev->migration_blocker,
> +                "VFIO device doesn't support migration");
> +        ret = migrate_add_blocker(vdev->migration_blocker, &err);

Where's the migrate_del_blocker()/error_free() for any other realize
error or device removal?

> +        if (err) {
> +            error_propagate(errp, err);
> +            error_free(vdev->migration_blocker);
> +        }

As Connie noted, unclear if this aborts or continues without a
migration blocker, which would be bad.

> +    } else {
> +        pdev->qdev.allow_unplug_during_migration = true;
> +    }
>  
>      if (!vdev->vbasedev.sysfsdev) {
>          if (!(~vdev->host.domain || ~vdev->host.bus ||
> @@ -2812,6 +2835,14 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>          goto error;
>      }
>  
> +    if (vdev->net_failover_pair_id != NULL) {
> +        class_id = pci_get_word(pdev->config + PCI_CLASS_DEVICE);
> +        if (class_id != PCI_CLASS_NETWORK_ETHERNET) {
> +            error_setg(errp, "failover device is not an Ethernet device");
> +            goto error;
> +        }
> +    }

Not clear to me why we do this separate from setting up the migration
blocker or why we use a different mechanism to test for the property.

> +
>      /* vfio emulates a lot for us, but some bits need extra love */
>      vdev->emulated_config_bits = g_malloc0(vdev->config_size);
>  
> @@ -3110,6 +3141,8 @@ static Property vfio_pci_dev_properties[] = {
>                              display, ON_OFF_AUTO_OFF),
>      DEFINE_PROP_UINT32("xres", VFIOPCIDevice, display_xres, 0),
>      DEFINE_PROP_UINT32("yres", VFIOPCIDevice, display_yres, 0),
> +    DEFINE_PROP_STRING("net_failover_pair_id", VFIOPCIDevice,
> +            net_failover_pair_id),

Should this and the Ethernet class test be done in PCIDevice?  The
migration aspect is the only thing unique to vfio since we don't
otherwise support it, right?  For instance, I should be able to
setup an emulated NIC with this failover pair id too, right?  Thanks,

Alex

>      DEFINE_PROP_UINT32("x-intx-mmap-timeout-ms", VFIOPCIDevice,
>                         intx.mmap_timeout, 1100),
>      DEFINE_PROP_BIT("x-vga", VFIOPCIDevice, features,
> @@ -3152,7 +3185,7 @@ static Property vfio_pci_dev_properties[] = {
>  
>  static const VMStateDescription vfio_pci_vmstate = {
>      .name = "vfio-pci",
> -    .unmigratable = 1,
> +    .unmigratable = 0,
>  };
>  
>  static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index 834a90d646..da4417071a 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -134,6 +134,7 @@ typedef struct VFIOPCIDevice {
>      PCIHostDeviceAddress host;
>      EventNotifier err_notifier;
>      EventNotifier req_notifier;
> +    char *net_failover_pair_id;
>      int (*resetfn)(struct VFIOPCIDevice *);
>      uint32_t vendor_id;
>      uint32_t device_id;
> @@ -168,6 +169,7 @@ typedef struct VFIOPCIDevice {
>      bool no_vfio_ioeventfd;
>      bool enable_ramfb;
>      VFIODisplay *dpy;
> +    Error *migration_blocker;
>  } VFIOPCIDevice;
>  
>  uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 02/10] pci: mark devices partially unplugged
  2019-10-11 11:20 ` [PATCH v3 02/10] pci: mark devices partially unplugged Jens Freimann
@ 2019-10-16  1:53   ` Alex Williamson
  0 siblings, 0 replies; 35+ messages in thread
From: Alex Williamson @ 2019-10-16  1:53 UTC (permalink / raw)
  To: Jens Freimann
  Cc: ehabkost, mst, aadam, qemu-devel, dgilbert, laine, ailan, parav

On Fri, 11 Oct 2019 13:20:07 +0200
Jens Freimann <jfreimann@redhat.com> wrote:

> Only the guest unplug request was triggered. This is needed for
> the failover feature. In case of a failed migration we need to
> plug the device back to the guest.
> 
> Signed-off-by: Jens Freimann <jfreimann@redhat.com>
> ---
>  hw/pci/pci.c         | 2 ++
>  hw/pci/pcie.c        | 3 +++
>  include/hw/pci/pci.h | 1 +
>  3 files changed, 6 insertions(+)
> 
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index aa05c2b9b2..c140b37765 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -2078,6 +2078,8 @@ static void pci_qdev_realize(DeviceState *qdev, Error **errp)
>      Error *local_err = NULL;
>      bool is_default_rom;
>  
> +    pci_dev->partially_hotplugged = false;

This is redundant though since the object is zero initialized on
allocation, right?  Thanks,

Alex

> +
>      /* initialize cap_present for pci_is_express() and pci_config_size(),
>       * Note that hybrid PCIs are not set automatically and need to manage
>       * QEMU_PCI_CAP_EXPRESS manually */
> diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
> index a6beb567bd..19363ff8ce 100644
> --- a/hw/pci/pcie.c
> +++ b/hw/pci/pcie.c
> @@ -456,6 +456,9 @@ static void pcie_unplug_device(PCIBus *bus, PCIDevice *dev, void *opaque)
>  {
>      HotplugHandler *hotplug_ctrl = qdev_get_hotplug_handler(DEVICE(dev));
>  
> +    if (dev->partially_hotplugged) {
> +        return;
> +    }
>      hotplug_handler_unplug(hotplug_ctrl, DEVICE(dev), &error_abort);
>      object_unparent(OBJECT(dev));
>  }
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index f3f0ffd5fb..f3a39c9bbd 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -265,6 +265,7 @@ typedef struct PCIReqIDCache PCIReqIDCache;
>  
>  struct PCIDevice {
>      DeviceState qdev;
> +    bool partially_hotplugged;
>  
>      /* PCI config space */
>      uint8_t *config;



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 01/10] qdev/qbus: add hidden device support
  2019-10-15 19:19   ` Alex Williamson
@ 2019-10-16  7:04     ` Jens Freimann
  0 siblings, 0 replies; 35+ messages in thread
From: Jens Freimann @ 2019-10-16  7:04 UTC (permalink / raw)
  To: Alex Williamson
  Cc: ehabkost, mst, aadam, qemu-devel, dgilbert, laine, ailan, parav

On Tue, Oct 15, 2019 at 01:19:33PM -0600, Alex Williamson wrote:
>On Fri, 11 Oct 2019 13:20:06 +0200
>Jens Freimann <jfreimann@redhat.com> wrote:
>
>> This adds support for hiding a device to the qbus and qdev APIs.  The
>> first user of this will be the virtio-net failover feature but the API
>> introduced with this patch could be used to implement other features as
>> well, for example hiding pci devices when a pci bus is powered off.
>>
>> qdev_device_add() is modified to check for a net_failover_pair_id
>> argument in the option string. A DeviceListener callback
>> should_be_hidden() is added. It can be used by a standby device to
>> inform qdev that this device should not be added now. The standby device
>> handler can store the device options to plug the device in at a later
>> point in time.
>>
>> One reason for hiding the device is that we don't want to expose both
>> devices to the guest kernel until the respective virtio feature bit
>> VIRTIO_NET_F_STANDBY was negotiated and we know that the devices will be
>> handled correctly by the guest.
>>
>> More information on the kernel feature this is using:
>>  https://www.kernel.org/doc/html/latest/networking/net_failover.html
>>
>> An example where the primary device is a vfio-pci device and the standby
>> device is a virtio-net device:
>>
>> A device is hidden when it has an "net_failover_pair_id" option, e.g.
>>
>>  -device virtio-net-pci,...,failover=on,...
>>  -device vfio-pci,...,net_failover_pair_id=net1,...
>>
>> Signed-off-by: Jens Freimann <jfreimann@redhat.com>
>> ---
>>  hw/core/qdev.c         | 19 +++++++++++++++++++
>>  include/hw/qdev-core.h |  9 +++++++++
>>  qdev-monitor.c         | 43 ++++++++++++++++++++++++++++++++++++++----
>>  vl.c                   |  6 ++++--
>>  4 files changed, 71 insertions(+), 6 deletions(-)
>>
>> diff --git a/hw/core/qdev.c b/hw/core/qdev.c
>> index cbad6c1d55..84fac591ca 100644
>> --- a/hw/core/qdev.c
>> +++ b/hw/core/qdev.c
>> @@ -212,6 +212,25 @@ void device_listener_unregister(DeviceListener *listener)
>>      QTAILQ_REMOVE(&device_listeners, listener, link);
>>  }
>>
>> +bool qdev_should_hide_device(QemuOpts *opts, Error **errp)
>> +{
>> +    bool res = false;
>> +    bool match_found = false;
>> +    DeviceListener *listener;
>> +
>> +    QTAILQ_FOREACH(listener, &device_listeners, link) {
>> +       if (listener->should_be_hidden) {
>> +            listener->should_be_hidden(listener, opts, &match_found, &res);
>> +        }
>> +
>> +        if (match_found) {
>> +            break;
>> +        }
>
>Calling convention here seems overly complicated, couldn't
>should_be_hidden() just return >0 (should be hidden), 0 (should not be
>hidden), <0 (don't care), ie. continue until >=0?  The errp arg is
>unused and using "res" to return should/shouldn't hide is very unclear.
>The virtio callback renames this to hide, which makes more sense, but
>as above, both the stop and hidden state could be conveyed with a
>simple int return value.  Thanks,

Yes, this can be simplified. Will be in the next version.

Thanks!

regards,
Jens  


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 07/10] migration: add new migration state wait-unplug
  2019-10-15 10:50       ` Dr. David Alan Gilbert
@ 2019-10-16  7:07         ` Jens Freimann
  0 siblings, 0 replies; 35+ messages in thread
From: Jens Freimann @ 2019-10-16  7:07 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: ehabkost, mst, aadam, qemu-devel, alex.williamson, laine, ailan, parav

On Tue, Oct 15, 2019 at 11:50:08AM +0100, Dr. David Alan Gilbert wrote:
>* Jens Freimann (jfreimann@redhat.com) wrote:
>> On Fri, Oct 11, 2019 at 06:11:33PM +0100, Dr. David Alan Gilbert wrote:
>> > * Jens Freimann (jfreimann@redhat.com) wrote:
>> > > This patch adds a new migration state called wait-unplug.  It is entered
>> > > after the SETUP state and will transition into ACTIVE once all devices
>> > > were succesfully unplugged from the guest.
>> > >
>> > > So if a guest doesn't respond or takes long to honor the unplug request
>> > > the user will see the migration state 'wait-unplug'.
>> > >
>> > > In the migration thread we query failover devices if they're are still
>> > > pending the guest unplug. When all are unplugged the migration
>> > > continues. We give it a defined number of iterations including small
>> > > waiting periods before we proceed.
>> > >
>> > > Signed-off-by: Jens Freimann <jfreimann@redhat.com>
>> [..]
>> > > +    while (i < FAILOVER_UNPLUG_RETRIES &&
>> > > +           s->state == MIGRATION_STATUS_WAIT_UNPLUG) {
>> > > +        i++;
>> > > +        qemu_sem_timedwait(&s->wait_unplug_sem, FAILOVER_GUEST_UNPLUG_WAIT);
>> > > +        all_unplugged = qemu_savevm_state_guest_unplug_pending();
>> > > +        if (all_unplugged) {
>> > > +            break;
>> > > +        }
>> > > +    }
>> > > +
>> > > +    if (all_unplugged) {
>> > > +        migrate_set_state(&s->state, MIGRATION_STATUS_WAIT_UNPLUG,
>> > > +                MIGRATION_STATUS_ACTIVE);
>> > > +    } else {
>> > > +        migrate_set_state(&s->state, MIGRATION_STATUS_WAIT_UNPLUG,
>> > > +                          MIGRATION_STATUS_CANCELLING);
>> > > +    }
>> >
>> > I think you can get rid of both the timeout and the count and just make
>> > sure that migrate_cancel works at this point.
>>
>> I see, I need to add the new state to migration_is_setup_or_active() or
>> a cancel won't work.
>
>You probably need to do that anyway given all the other places
>is_setup_or_active is called.

Yes, done.

>> > This pushes the problem up a layer, which I think is fine.
>>
>> Seems good to me. To be clear, you're saying I should just poll on
>> the device unplugged state? Like
>>
>>         while (s->state == MIGRATION_STATUS_WAIT_UNPLUG &&
>>                !qemu_savevm_state_guest_unplug_pending()) {
>> _            /* This block intentionally left blank */
>>         }
>
>I'd keep the qemu_sem_timedwait in there, but with a short time out
>(e.g. 250ms say); that way it doesn't eat cpu, but also the cancel still
>happens quickly.

Yes, that's what I do now and it works fine.

Thanks!

regards,
Jens 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 10/10] vfio: unplug failover primary device before migration
  2019-10-16  1:52   ` Alex Williamson
@ 2019-10-16 20:18     ` Jens Freimann
  2019-10-17  0:39       ` Alex Williamson
  0 siblings, 1 reply; 35+ messages in thread
From: Jens Freimann @ 2019-10-16 20:18 UTC (permalink / raw)
  To: Alex Williamson
  Cc: ehabkost, mst, aadam, qemu-devel, dgilbert, laine, ailan, parav

On Tue, Oct 15, 2019 at 07:52:12PM -0600, Alex Williamson wrote:
>On Fri, 11 Oct 2019 13:20:15 +0200
>Jens Freimann <jfreimann@redhat.com> wrote:
>
>> As usual block all vfio-pci devices from being migrated, but make an
>> exception for failover primary devices. This is achieved by setting
>> unmigratable to 0 but also add a migration blocker for all vfio-pci
>> devices except failover primary devices. These will be unplugged before
>> migration happens by the migration handler of the corresponding
>> virtio-net standby device.
>>
>> Signed-off-by: Jens Freimann <jfreimann@redhat.com>
>> ---
>>  hw/vfio/pci.c | 35 ++++++++++++++++++++++++++++++++++-
>>  hw/vfio/pci.h |  2 ++
>>  2 files changed, 36 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index c5e6fe61cb..64cf8e07d9 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -40,6 +40,9 @@
>>  #include "pci.h"
>>  #include "trace.h"
>>  #include "qapi/error.h"
>> +#include "migration/blocker.h"
>> +#include "qemu/option.h"
>> +#include "qemu/option_int.h"
>>
>>  #define TYPE_VFIO_PCI "vfio-pci"
>>  #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
>> @@ -2698,6 +2701,12 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
>>      vdev->req_enabled = false;
>>  }
>>
>> +static int has_net_failover_arg(void *opaque, const char *name,
>> +                           const char *value, Error **errp)
>> +{
>> +    return (strcmp(name, "net_failover_pair_id") == 0);
>> +}
>> +
>>  static void vfio_realize(PCIDevice *pdev, Error **errp)
>>  {
>>      VFIOPCIDevice *vdev = PCI_VFIO(pdev);
>> @@ -2710,6 +2719,20 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>>      int groupid;
>>      int i, ret;
>>      bool is_mdev;
>> +    uint16_t class_id;
>> +
>> +    if (qemu_opt_foreach(pdev->qdev.opts, has_net_failover_arg,
>> +                         (void *) pdev->qdev.opts, &err) == 0) {
>
>Why do we need a qemu_opt_foreach here versus testing
>vdev->net_failover_pair_id as you do below or similar to how we test
>sysfsdev immediately below this chunk?

We don't need it, I will change it and move it to where we check for
the PCI class.
>
>> +        error_setg(&vdev->migration_blocker,
>> +                "VFIO device doesn't support migration");
>> +        ret = migrate_add_blocker(vdev->migration_blocker, &err);
>
>Where's the migrate_del_blocker()/error_free() for any other realize
>error or device removal?
>
>> +        if (err) {
>> +            error_propagate(errp, err);
>> +            error_free(vdev->migration_blocker);
>> +        }
>
>As Connie noted, unclear if this aborts or continues without a
>migration blocker, which would be bad.

It aborts in my test. PCI realize propagates it further and eventually
it leads to aborting qemu.

It looks like this now:

     if (!pdev->net_failover_pair_id) {
          error_setg(&vdev->migration_blocker,
                  "VFIO device doesn't support migration");
          ret = migrate_add_blocker(vdev->migration_blocker, &err);
          if (err) {
              error_propagate(errp, err);
          } else {
              error_propagate(errp, vdev->migration_blocker);
          }
          goto error;
      } else {
          pdev->qdev.allow_unplug_during_migration = true;
      }

>> +    } else {
>> +        pdev->qdev.allow_unplug_during_migration = true;
>> +    }
>>
>>      if (!vdev->vbasedev.sysfsdev) {
>>          if (!(~vdev->host.domain || ~vdev->host.bus ||
>> @@ -2812,6 +2835,14 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>>          goto error;
>>      }
>>
>> +    if (vdev->net_failover_pair_id != NULL) {
>> +        class_id = pci_get_word(pdev->config + PCI_CLASS_DEVICE);
>> +        if (class_id != PCI_CLASS_NETWORK_ETHERNET) {
>> +            error_setg(errp, "failover device is not an Ethernet device");
>> +            goto error;
>> +        }
>> +    }
>
>Not clear to me why we do this separate from setting up the migration
>blocker or why we use a different mechanism to test for the property.

I'm moving this check to hw/pci/pci.c as you suggested.

>> +
>>      /* vfio emulates a lot for us, but some bits need extra love */
>>      vdev->emulated_config_bits = g_malloc0(vdev->config_size);
>>
>> @@ -3110,6 +3141,8 @@ static Property vfio_pci_dev_properties[] = {
>>                              display, ON_OFF_AUTO_OFF),
>>      DEFINE_PROP_UINT32("xres", VFIOPCIDevice, display_xres, 0),
>>      DEFINE_PROP_UINT32("yres", VFIOPCIDevice, display_yres, 0),
>> +    DEFINE_PROP_STRING("net_failover_pair_id", VFIOPCIDevice,
>> +            net_failover_pair_id),
>
>Should this and the Ethernet class test be done in PCIDevice?  The
>migration aspect is the only thing unique to vfio since we don't
>otherwise support it, right?  For instance, I should be able to
>setup an emulated NIC with this failover pair id too, right?  Thanks,

Yes, we can do it in PCIDevice. Using it with an emulated device.
It wouldn't make sense for production but could make sense for
testing purposes.

Thanks for the review!

regards,
Jens


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 10/10] vfio: unplug failover primary device before migration
  2019-10-16 20:18     ` Jens Freimann
@ 2019-10-17  0:39       ` Alex Williamson
  2019-10-17  7:45         ` Jens Freimann
  0 siblings, 1 reply; 35+ messages in thread
From: Alex Williamson @ 2019-10-17  0:39 UTC (permalink / raw)
  To: Jens Freimann
  Cc: ehabkost, mst, aadam, qemu-devel, dgilbert, laine, ailan, parav

On Wed, 16 Oct 2019 22:18:47 +0200
Jens Freimann <jfreimann@redhat.com> wrote:

> On Tue, Oct 15, 2019 at 07:52:12PM -0600, Alex Williamson wrote:
> >On Fri, 11 Oct 2019 13:20:15 +0200
> >Jens Freimann <jfreimann@redhat.com> wrote:
> >  
> >> As usual block all vfio-pci devices from being migrated, but make an
> >> exception for failover primary devices. This is achieved by setting
> >> unmigratable to 0 but also add a migration blocker for all vfio-pci
> >> devices except failover primary devices. These will be unplugged before
> >> migration happens by the migration handler of the corresponding
> >> virtio-net standby device.
> >>
> >> Signed-off-by: Jens Freimann <jfreimann@redhat.com>
> >> ---
> >>  hw/vfio/pci.c | 35 ++++++++++++++++++++++++++++++++++-
> >>  hw/vfio/pci.h |  2 ++
> >>  2 files changed, 36 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> >> index c5e6fe61cb..64cf8e07d9 100644
> >> --- a/hw/vfio/pci.c
> >> +++ b/hw/vfio/pci.c
> >> @@ -40,6 +40,9 @@
> >>  #include "pci.h"
> >>  #include "trace.h"
> >>  #include "qapi/error.h"
> >> +#include "migration/blocker.h"
> >> +#include "qemu/option.h"
> >> +#include "qemu/option_int.h"
> >>
> >>  #define TYPE_VFIO_PCI "vfio-pci"
> >>  #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
> >> @@ -2698,6 +2701,12 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
> >>      vdev->req_enabled = false;
> >>  }
> >>
> >> +static int has_net_failover_arg(void *opaque, const char *name,
> >> +                           const char *value, Error **errp)
> >> +{
> >> +    return (strcmp(name, "net_failover_pair_id") == 0);
> >> +}
> >> +
> >>  static void vfio_realize(PCIDevice *pdev, Error **errp)
> >>  {
> >>      VFIOPCIDevice *vdev = PCI_VFIO(pdev);
> >> @@ -2710,6 +2719,20 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
> >>      int groupid;
> >>      int i, ret;
> >>      bool is_mdev;
> >> +    uint16_t class_id;
> >> +
> >> +    if (qemu_opt_foreach(pdev->qdev.opts, has_net_failover_arg,
> >> +                         (void *) pdev->qdev.opts, &err) == 0) {  
> >
> >Why do we need a qemu_opt_foreach here versus testing
> >vdev->net_failover_pair_id as you do below or similar to how we test
> >sysfsdev immediately below this chunk?  
> 
> We don't need it, I will change it and move it to where we check for
> the PCI class.
> >  
> >> +        error_setg(&vdev->migration_blocker,
> >> +                "VFIO device doesn't support migration");
> >> +        ret = migrate_add_blocker(vdev->migration_blocker, &err);  
> >
> >Where's the migrate_del_blocker()/error_free() for any other realize
> >error or device removal?
> >  
> >> +        if (err) {
> >> +            error_propagate(errp, err);
> >> +            error_free(vdev->migration_blocker);
> >> +        }  
> >
> >As Connie noted, unclear if this aborts or continues without a
> >migration blocker, which would be bad.  
> 
> It aborts in my test. PCI realize propagates it further and eventually
> it leads to aborting qemu.
> 
> It looks like this now:
> 
>      if (!pdev->net_failover_pair_id) {
>           error_setg(&vdev->migration_blocker,
>                   "VFIO device doesn't support migration");
>           ret = migrate_add_blocker(vdev->migration_blocker, &err);
>           if (err) {
>               error_propagate(errp, err);
>           } else {
>               error_propagate(errp, vdev->migration_blocker);
>           }
>           goto error;

This unconditionally goes to error when we don't have a failover pair
set :-\

I suspect we don't want any sort of error propagate in the success
case, the migration_blocker pre-defines the error when the migration is
blocked, right?  Thanks,

Alex

>       } else {
>           pdev->qdev.allow_unplug_during_migration = true;
>       }
> 
> >> +    } else {
> >> +        pdev->qdev.allow_unplug_during_migration = true;
> >> +    }
> >>
> >>      if (!vdev->vbasedev.sysfsdev) {
> >>          if (!(~vdev->host.domain || ~vdev->host.bus ||
> >> @@ -2812,6 +2835,14 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
> >>          goto error;
> >>      }
> >>
> >> +    if (vdev->net_failover_pair_id != NULL) {
> >> +        class_id = pci_get_word(pdev->config + PCI_CLASS_DEVICE);
> >> +        if (class_id != PCI_CLASS_NETWORK_ETHERNET) {
> >> +            error_setg(errp, "failover device is not an Ethernet device");
> >> +            goto error;
> >> +        }
> >> +    }  
> >
> >Not clear to me why we do this separate from setting up the migration
> >blocker or why we use a different mechanism to test for the property.  
> 
> I'm moving this check to hw/pci/pci.c as you suggested.
> 
> >> +
> >>      /* vfio emulates a lot for us, but some bits need extra love */
> >>      vdev->emulated_config_bits = g_malloc0(vdev->config_size);
> >>
> >> @@ -3110,6 +3141,8 @@ static Property vfio_pci_dev_properties[] = {
> >>                              display, ON_OFF_AUTO_OFF),
> >>      DEFINE_PROP_UINT32("xres", VFIOPCIDevice, display_xres, 0),
> >>      DEFINE_PROP_UINT32("yres", VFIOPCIDevice, display_yres, 0),
> >> +    DEFINE_PROP_STRING("net_failover_pair_id", VFIOPCIDevice,
> >> +            net_failover_pair_id),  
> >
> >Should this and the Ethernet class test be done in PCIDevice?  The
> >migration aspect is the only thing unique to vfio since we don't
> >otherwise support it, right?  For instance, I should be able to
> >setup an emulated NIC with this failover pair id too, right?  Thanks,  
> 
> Yes, we can do it in PCIDevice. Using it with an emulated device.
> It wouldn't make sense for production but could make sense for
> testing purposes.
> 
> Thanks for the review!
> 
> regards,
> Jens



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 10/10] vfio: unplug failover primary device before migration
  2019-10-17  0:39       ` Alex Williamson
@ 2019-10-17  7:45         ` Jens Freimann
  0 siblings, 0 replies; 35+ messages in thread
From: Jens Freimann @ 2019-10-17  7:45 UTC (permalink / raw)
  To: Alex Williamson
  Cc: ehabkost, mst, aadam, qemu-devel, dgilbert, laine, ailan, parav

On Wed, Oct 16, 2019 at 06:39:58PM -0600, Alex Williamson wrote:
>On Wed, 16 Oct 2019 22:18:47 +0200
>Jens Freimann <jfreimann@redhat.com> wrote:
>
>> On Tue, Oct 15, 2019 at 07:52:12PM -0600, Alex Williamson wrote:
>> >On Fri, 11 Oct 2019 13:20:15 +0200
>> >Jens Freimann <jfreimann@redhat.com> wrote:
>> >
>> >> As usual block all vfio-pci devices from being migrated, but make an
>> >> exception for failover primary devices. This is achieved by setting
>> >> unmigratable to 0 but also add a migration blocker for all vfio-pci
>> >> devices except failover primary devices. These will be unplugged before
>> >> migration happens by the migration handler of the corresponding
>> >> virtio-net standby device.
>> >>
>> >> Signed-off-by: Jens Freimann <jfreimann@redhat.com>
>> >> ---
>> >>  hw/vfio/pci.c | 35 ++++++++++++++++++++++++++++++++++-
>> >>  hw/vfio/pci.h |  2 ++
>> >>  2 files changed, 36 insertions(+), 1 deletion(-)
>> >>
>> >> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> >> index c5e6fe61cb..64cf8e07d9 100644
>> >> --- a/hw/vfio/pci.c
>> >> +++ b/hw/vfio/pci.c
>> >> @@ -40,6 +40,9 @@
>> >>  #include "pci.h"
>> >>  #include "trace.h"
>> >>  #include "qapi/error.h"
>> >> +#include "migration/blocker.h"
>> >> +#include "qemu/option.h"
>> >> +#include "qemu/option_int.h"
>> >>
>> >>  #define TYPE_VFIO_PCI "vfio-pci"
>> >>  #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
>> >> @@ -2698,6 +2701,12 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
>> >>      vdev->req_enabled = false;
>> >>  }
>> >>
>> >> +static int has_net_failover_arg(void *opaque, const char *name,
>> >> +                           const char *value, Error **errp)
>> >> +{
>> >> +    return (strcmp(name, "net_failover_pair_id") == 0);
>> >> +}
>> >> +
>> >>  static void vfio_realize(PCIDevice *pdev, Error **errp)
>> >>  {
>> >>      VFIOPCIDevice *vdev = PCI_VFIO(pdev);
>> >> @@ -2710,6 +2719,20 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>> >>      int groupid;
>> >>      int i, ret;
>> >>      bool is_mdev;
>> >> +    uint16_t class_id;
>> >> +
>> >> +    if (qemu_opt_foreach(pdev->qdev.opts, has_net_failover_arg,
>> >> +                         (void *) pdev->qdev.opts, &err) == 0) {
>> >
>> >Why do we need a qemu_opt_foreach here versus testing
>> >vdev->net_failover_pair_id as you do below or similar to how we test
>> >sysfsdev immediately below this chunk?
>>
>> We don't need it, I will change it and move it to where we check for
>> the PCI class.
>> >
>> >> +        error_setg(&vdev->migration_blocker,
>> >> +                "VFIO device doesn't support migration");
>> >> +        ret = migrate_add_blocker(vdev->migration_blocker, &err);
>> >
>> >Where's the migrate_del_blocker()/error_free() for any other realize
>> >error or device removal?
>> >
>> >> +        if (err) {
>> >> +            error_propagate(errp, err);
>> >> +            error_free(vdev->migration_blocker);
>> >> +        }
>> >
>> >As Connie noted, unclear if this aborts or continues without a
>> >migration blocker, which would be bad.
>>
>> It aborts in my test. PCI realize propagates it further and eventually
>> it leads to aborting qemu.
>>
>> It looks like this now:
>>
>>      if (!pdev->net_failover_pair_id) {
>>           error_setg(&vdev->migration_blocker,
>>                   "VFIO device doesn't support migration");
>>           ret = migrate_add_blocker(vdev->migration_blocker, &err);
>>           if (err) {
>>               error_propagate(errp, err);
>>           } else {
>>               error_propagate(errp, vdev->migration_blocker);
>>           }
>>           goto error;
>
>This unconditionally goes to error when we don't have a failover pair
>set :-\
>
>I suspect we don't want any sort of error propagate in the success
>case, the migration_blocker pre-defines the error when the migration is
>blocked, right?  Thanks,

I shouldn't code or send mails late at night :) I'll fix it in v4

Thanks!

regards,
Jens 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 0/10] add failover feature for assigned network devices
  2019-10-15 19:03 ` Alex Williamson
  2019-10-15 21:17   ` Michael S. Tsirkin
@ 2019-10-17 10:33   ` Jens Freimann
  2019-10-17 12:51     ` Alex Williamson
  1 sibling, 1 reply; 35+ messages in thread
From: Jens Freimann @ 2019-10-17 10:33 UTC (permalink / raw)
  To: Alex Williamson
  Cc: ehabkost, mst, aadam, qemu-devel, dgilbert, laine, ailan, parav

On Tue, Oct 15, 2019 at 01:03:17PM -0600, Alex Williamson wrote:
>On Fri, 11 Oct 2019 13:20:05 +0200
>Jens Freimann <jfreimann@redhat.com> wrote:
>
>> This is implementing the host side of the net_failover concept
>> (https://www.kernel.org/doc/html/latest/networking/net_failover.html)
>>
>> Changes since v2:
>> * back out of creating failover pair when it is a non-networking
>>   vfio-pci device (Alex W)
>> * handle migration state change from within the migration thread. I do a
>>   timed wait on a semaphore and then check if all unplugs were
>>   succesful. Added a new function to each device that checks the device
>>   if the unplug for it has happened. When all devices report the succesful
>>   unplug *or* the time/retries is up, continue with the migration or
>>   cancel. When not all devices could be unplugged I am cancelling at the
>>   moment. It is likely that we can't plug it back at the destination which
>>   would result in degraded network performance.
>> * fix a few bugs regarding re-plug on migration source and target
>> * run full set of tests including migration tests
>> * add patch for libqos to tolerate new migration state
>> * squashed patch 1 and 2, added patch 8
>>
>> The general idea is that we have a pair of devices, a vfio-pci and a
>> virtio-net device. Before migration the vfio device is unplugged and data
>> flows to the virtio-net device, on the target side another vfio-pci device
>> is plugged in to take over the data-path. In the guest the net_failover
>> module will pair net devices with the same MAC address.
>>
>> * Patch 1 adds the infrastructure to hide the device for the qbus and qdev APIs
>>
>> * Patch 2 sets a new flag for PCIDevice 'partially_hotplugged' which we
>>   use to skip the unrealize code path when doing a unplug of the primary
>>   device
>>
>> * Patch 3 sets the pending_deleted_event before triggering the guest
>>   unplug request
>
>These only cover pcie hotplug, is this feature somehow dependent on
>pcie?  There's also ACPI-based PCI hotplug, SHPC hotplug, and it looks
>like s390 has it's own version (of course) of PCI hotplug.  IMO, we
>either need to make an attempt to support this universally or the
>option needs to fail if the hotplug controller doesn't support partial
>removal.  Thanks,

It is possible to make it work with non-pcie hotplug but as the first
step I want to enable it for pcie only. For that I would add a check
into pci_qdev_realize(), where I also check if the device is an
ethernet device, and fail if it is not a pcie device. Would that work
for you?

regards,
Jens 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 0/10] add failover feature for assigned network devices
  2019-10-17 10:33   ` Jens Freimann
@ 2019-10-17 12:51     ` Alex Williamson
  2019-10-17 14:04       ` Jens Freimann
  0 siblings, 1 reply; 35+ messages in thread
From: Alex Williamson @ 2019-10-17 12:51 UTC (permalink / raw)
  To: Jens Freimann
  Cc: ehabkost, mst, aadam, qemu-devel, dgilbert, laine, ailan, parav

On Thu, 17 Oct 2019 12:33:47 +0200
Jens Freimann <jfreimann@redhat.com> wrote:

> On Tue, Oct 15, 2019 at 01:03:17PM -0600, Alex Williamson wrote:
> >On Fri, 11 Oct 2019 13:20:05 +0200
> >Jens Freimann <jfreimann@redhat.com> wrote:
> >  
> >> This is implementing the host side of the net_failover concept
> >> (https://www.kernel.org/doc/html/latest/networking/net_failover.html)
> >>
> >> Changes since v2:
> >> * back out of creating failover pair when it is a non-networking
> >>   vfio-pci device (Alex W)
> >> * handle migration state change from within the migration thread. I do a
> >>   timed wait on a semaphore and then check if all unplugs were
> >>   succesful. Added a new function to each device that checks the device
> >>   if the unplug for it has happened. When all devices report the succesful
> >>   unplug *or* the time/retries is up, continue with the migration or
> >>   cancel. When not all devices could be unplugged I am cancelling at the
> >>   moment. It is likely that we can't plug it back at the destination which
> >>   would result in degraded network performance.
> >> * fix a few bugs regarding re-plug on migration source and target
> >> * run full set of tests including migration tests
> >> * add patch for libqos to tolerate new migration state
> >> * squashed patch 1 and 2, added patch 8
> >>
> >> The general idea is that we have a pair of devices, a vfio-pci and a
> >> virtio-net device. Before migration the vfio device is unplugged and data
> >> flows to the virtio-net device, on the target side another vfio-pci device
> >> is plugged in to take over the data-path. In the guest the net_failover
> >> module will pair net devices with the same MAC address.
> >>
> >> * Patch 1 adds the infrastructure to hide the device for the qbus and qdev APIs
> >>
> >> * Patch 2 sets a new flag for PCIDevice 'partially_hotplugged' which we
> >>   use to skip the unrealize code path when doing a unplug of the primary
> >>   device
> >>
> >> * Patch 3 sets the pending_deleted_event before triggering the guest
> >>   unplug request  
> >
> >These only cover pcie hotplug, is this feature somehow dependent on
> >pcie?  There's also ACPI-based PCI hotplug, SHPC hotplug, and it looks
> >like s390 has it's own version (of course) of PCI hotplug.  IMO, we
> >either need to make an attempt to support this universally or the
> >option needs to fail if the hotplug controller doesn't support partial
> >removal.  Thanks,  
> 
> It is possible to make it work with non-pcie hotplug but as the first
> step I want to enable it for pcie only. For that I would add a check
> into pci_qdev_realize(), where I also check if the device is an
> ethernet device, and fail if it is not a pcie device. Would that work
> for you?

How would libvirt introspect what topologies are supported rather than
trial and error?  I think this solves my issue that I get bugs that the
failover pair option doesn't work on vfio-pci depending on the
topology, but it really just pushes the problem up the stack.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 0/10] add failover feature for assigned network devices
  2019-10-17 12:51     ` Alex Williamson
@ 2019-10-17 14:04       ` Jens Freimann
  0 siblings, 0 replies; 35+ messages in thread
From: Jens Freimann @ 2019-10-17 14:04 UTC (permalink / raw)
  To: Alex Williamson
  Cc: ehabkost, mst, aadam, qemu-devel, dgilbert, laine, ailan, parav

On Thu, Oct 17, 2019 at 06:51:54AM -0600, Alex Williamson wrote:
>On Thu, 17 Oct 2019 12:33:47 +0200
>Jens Freimann <jfreimann@redhat.com> wrote:
>
>> On Tue, Oct 15, 2019 at 01:03:17PM -0600, Alex Williamson wrote:
>> >On Fri, 11 Oct 2019 13:20:05 +0200
>> >Jens Freimann <jfreimann@redhat.com> wrote:
>> >
>> >> This is implementing the host side of the net_failover concept
>> >> (https://www.kernel.org/doc/html/latest/networking/net_failover.html)
>> >>
>> >> Changes since v2:
>> >> * back out of creating failover pair when it is a non-networking
>> >>   vfio-pci device (Alex W)
>> >> * handle migration state change from within the migration thread. I do a
>> >>   timed wait on a semaphore and then check if all unplugs were
>> >>   succesful. Added a new function to each device that checks the device
>> >>   if the unplug for it has happened. When all devices report the succesful
>> >>   unplug *or* the time/retries is up, continue with the migration or
>> >>   cancel. When not all devices could be unplugged I am cancelling at the
>> >>   moment. It is likely that we can't plug it back at the destination which
>> >>   would result in degraded network performance.
>> >> * fix a few bugs regarding re-plug on migration source and target
>> >> * run full set of tests including migration tests
>> >> * add patch for libqos to tolerate new migration state
>> >> * squashed patch 1 and 2, added patch 8
>> >>
>> >> The general idea is that we have a pair of devices, a vfio-pci and a
>> >> virtio-net device. Before migration the vfio device is unplugged and data
>> >> flows to the virtio-net device, on the target side another vfio-pci device
>> >> is plugged in to take over the data-path. In the guest the net_failover
>> >> module will pair net devices with the same MAC address.
>> >>
>> >> * Patch 1 adds the infrastructure to hide the device for the qbus and qdev APIs
>> >>
>> >> * Patch 2 sets a new flag for PCIDevice 'partially_hotplugged' which we
>> >>   use to skip the unrealize code path when doing a unplug of the primary
>> >>   device
>> >>
>> >> * Patch 3 sets the pending_deleted_event before triggering the guest
>> >>   unplug request
>> >
>> >These only cover pcie hotplug, is this feature somehow dependent on
>> >pcie?  There's also ACPI-based PCI hotplug, SHPC hotplug, and it looks
>> >like s390 has it's own version (of course) of PCI hotplug.  IMO, we
>> >either need to make an attempt to support this universally or the
>> >option needs to fail if the hotplug controller doesn't support partial
>> >removal.  Thanks,
>>
>> It is possible to make it work with non-pcie hotplug but as the first
>> step I want to enable it for pcie only. For that I would add a check
>> into pci_qdev_realize(), where I also check if the device is an
>> ethernet device, and fail if it is not a pcie device. Would that work
>> for you?
>
>How would libvirt introspect what topologies are supported rather than
>trial and error?  I think this solves my issue that I get bugs that the
>failover pair option doesn't work on vfio-pci depending on the
>topology, but it really just pushes the problem up the stack.  Thanks,

Good point Alex. Laine, any idea what would be required for this?
How does libvirt introspect other properties of PCI devices if any?
I only found checks for commandline options in libvirt, but I might
miss something.

regards,
Jens 


^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2019-10-17 15:14 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-11 11:20 [PATCH v3 0/10] add failover feature for assigned network devices Jens Freimann
2019-10-11 11:20 ` [PATCH v3 01/10] qdev/qbus: add hidden device support Jens Freimann
2019-10-14  9:46   ` Cornelia Huck
2019-10-14 12:02     ` Jens Freimann
2019-10-15 19:19   ` Alex Williamson
2019-10-16  7:04     ` Jens Freimann
2019-10-11 11:20 ` [PATCH v3 02/10] pci: mark devices partially unplugged Jens Freimann
2019-10-16  1:53   ` Alex Williamson
2019-10-11 11:20 ` [PATCH v3 03/10] pci: mark device having guest unplug request pending Jens Freimann
2019-10-11 11:20 ` [PATCH v3 04/10] qapi: add unplug primary event Jens Freimann
2019-10-11 11:20 ` [PATCH v3 05/10] qapi: add failover negotiated event Jens Freimann
2019-10-11 11:20 ` [PATCH v3 06/10] migration: allow unplug during migration for failover devices Jens Freimann
2019-10-11 11:20 ` [PATCH v3 07/10] migration: add new migration state wait-unplug Jens Freimann
2019-10-11 12:57   ` Michael S. Tsirkin
2019-10-11 14:22     ` Jens Freimann
2019-10-11 16:49   ` Dr. David Alan Gilbert
2019-10-11 17:11   ` Dr. David Alan Gilbert
2019-10-15  9:45     ` Jens Freimann
2019-10-15 10:50       ` Dr. David Alan Gilbert
2019-10-16  7:07         ` Jens Freimann
2019-10-11 11:20 ` [PATCH v3 08/10] libqos: tolerate wait-unplug migration state Jens Freimann
2019-10-11 11:20 ` [PATCH v3 09/10] net/virtio: add failover support Jens Freimann
2019-10-11 11:20 ` [PATCH v3 10/10] vfio: unplug failover primary device before migration Jens Freimann
2019-10-14 10:05   ` Cornelia Huck
2019-10-16  1:52   ` Alex Williamson
2019-10-16 20:18     ` Jens Freimann
2019-10-17  0:39       ` Alex Williamson
2019-10-17  7:45         ` Jens Freimann
2019-10-11 14:28 ` [PATCH v3 0/10] add failover feature for assigned network devices Michael S. Tsirkin
2019-10-11 16:04 ` no-reply
2019-10-15 19:03 ` Alex Williamson
2019-10-15 21:17   ` Michael S. Tsirkin
2019-10-17 10:33   ` Jens Freimann
2019-10-17 12:51     ` Alex Williamson
2019-10-17 14:04       ` Jens Freimann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).