All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC PATCH 0/5] hotplug: fix premature rebinding of VFIO devices to host
@ 2017-06-29  0:24 Michael Roth
  2017-06-29  0:24 ` [Qemu-devel] [RFC PATCH 1/5] virhostdev: factor release out from reattach and export it for use later Michael Roth
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: Michael Roth @ 2017-06-29  0:24 UTC (permalink / raw)
  To: libvir-list
  Cc: qemu-devel, qemu-ppc, aik, alex.williamson, abologna, laine,
	pkrempa, berrange, mprivozn, sbhat

Hi everyone. Hoping to get some feedback on this approach, or some
alternatives proposed below, to the following issue:

Currently libvirt immediately attempts to rebind a managed device back to the
host driver when it receives a DEVICE_DELETED event from QEMU. This is
problematic for 2 reasons:

1) If multiple devices from a group are attached to a guest, this can move
   the group into a "non-viable" state where some devices are assigned to
   the host and some to the guest.

2) When QEMU emits the DEVICE_DELETED event, there's still a "finalize" phase
   where additional cleanup occurs. In most cases libvirt can ignore this
   cleanup, but in the case of VFIO devices this is where closing of a VFIO
   group FD occurs, and failing to wait before rebinding the device to the
   host driver can result in unexpected behavior. In the case of powernv
   hosts at least, this can lead to a host driver crashing due to the default
   DMA windows not having been fully-restored yet. The window between this is
   and the initial DEVICE_DELETED seems to be ~6 seconds in practice. We've
   seen host dumps with Mellanox CX4 VFs being rebound to host driver during
   this period (on powernv hosts).

Patches 1-4 address 1) by deferring rebinding of a hostdev to the host driver
until all the devices in the group have been detached, at which point all
the hostdevs are rebound as a group. Until that point, the devices are traced
by the drvManager's inactiveList in a similar manner to hostdevs that are
assigned to VFIO via the nodedev-detach interface.

Patch 5 addresses 2) by adding an additional check that, when the last device
from a group is detached, polls /proc for open FDs referencing the VFIO group
path in /dev/vfio/<iommu_group> and waiting for the FD to be closed. If we
time out, we abandon rebinding the hostdevs back to the host.

There are a couple alternatives to Patch 5 that might be worth considering:

a) Add a DEVICE_FINALIZED event to QEMU and wait for that instead of
   DEVICE_DELETED. Paired with patches 1-4 this would let us drop patch 5 in
   favor of minimal changes to libvirt's event handlers.

   The downsides are:
    - that we'd incur some latency for all device-detach calls, but it's not
      apparent to me whether this delay is significant for anything outside
      of VFIO.
    - there may be cases where finalization after DEVICE_DELETE/unparent are
      is not guaranteed, and I'm not sure QOM would encourage such
      expectations even if that's currently the case.

b) Add a GROUP_DELETED event to VFIO's finalize callback. This is the most
   direct solution. With this we could completely separate out the handling
   of rebinding to host driver based on receival of this event.

   The downsides are:
    - this would only work for newer versions of QEMU, though we could use
      the poll-wait in patch 5 as a fallback.
    - synchronizing sync/async device-detach threads with sync/async
      handlers for this would be a bit hairy, but I have a WIP in progress
      that seems *fairly reasonable*

c) Take the approach in Patch 5, either as a precursor to implementing b) or
   something else, or just sticking with that for now.

d) ???

Personally I'm leaning toward c), but I'm wondering if that's "good enough"
for now, or if we should pursue something more robust from the start, or
perhaps something else entirely?

Any feedback is greatly appreciated!

 src/libvirt_private.syms |   5 ++
 src/qemu/qemu_hostdev.c  |  16 +++++
 src/qemu/qemu_hostdev.h  |   4 ++
 src/qemu/qemu_hotplug.c  |  56 ++++++++++++++----
 src/util/virfile.c       |  52 +++++++++++++++++
 src/util/virfile.h       |   1 +
 src/util/virhostdev.c    | 173 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
 src/util/virhostdev.h    |  16 +++++
 src/util/virpci.c        |  69 ++++++++++++++++++----
 src/util/virpci.h        |   4 ++
 10 files changed, 360 insertions(+), 36 deletions(-)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Qemu-devel] [RFC PATCH 1/5] virhostdev: factor release out from reattach and export it for use later
  2017-06-29  0:24 [Qemu-devel] [RFC PATCH 0/5] hotplug: fix premature rebinding of VFIO devices to host Michael Roth
@ 2017-06-29  0:24 ` Michael Roth
  2017-06-29  0:24 ` [Qemu-devel] [RFC PATCH 2/5] qemu_hotplug: squash qemuDomainRemovePCIHostDevice into caller Michael Roth
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Michael Roth @ 2017-06-29  0:24 UTC (permalink / raw)
  To: libvir-list
  Cc: qemu-devel, qemu-ppc, aik, alex.williamson, abologna, laine,
	pkrempa, berrange, mprivozn, sbhat

Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
---
 src/libvirt_private.syms |  1 +
 src/util/virhostdev.c    | 83 ++++++++++++++++++++++++++++++++++++++++--------
 src/util/virhostdev.h    |  8 +++++
 3 files changed, 78 insertions(+), 14 deletions(-)

diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms
index c1e9471..2bd3581 100644
--- a/src/libvirt_private.syms
+++ b/src/libvirt_private.syms
@@ -1798,6 +1798,7 @@ virHostdevReAttachPCIDevices;
 virHostdevReAttachSCSIDevices;
 virHostdevReAttachSCSIVHostDevices;
 virHostdevReAttachUSBDevices;
+virHostdevReleasePCIDevices;
 virHostdevUpdateActiveDomainDevices;
 virHostdevUpdateActiveMediatedDevices;
 virHostdevUpdateActivePCIDevices;
diff --git a/src/util/virhostdev.c b/src/util/virhostdev.c
index 579563c..2cd3f34 100644
--- a/src/util/virhostdev.c
+++ b/src/util/virhostdev.c
@@ -932,16 +932,20 @@ virHostdevReattachPCIDevice(virHostdevManagerPtr mgr,
     }
 }
 
-/* @oldStateDir:
- * For upgrade purpose: see virHostdevRestoreNetConfig
+/*
+ * Move PCI devices to inactive list and prepare them for reattaching
+ * to host driver
+ *
+ * Pre-condition: inactivePCIHostdevs & activePCIHostdevs
+ * are locked
  */
-void
-virHostdevReAttachPCIDevices(virHostdevManagerPtr mgr,
-                             const char *drv_name,
-                             const char *dom_name,
-                             virDomainHostdevDefPtr *hostdevs,
-                             int nhostdevs,
-                             const char *oldStateDir)
+static void
+virHostdevReleasePCIDevicesInternal(virHostdevManagerPtr mgr,
+                                    const char *drv_name,
+                                    const char *dom_name,
+                                    virDomainHostdevDefPtr *hostdevs,
+                                    int nhostdevs,
+                                    const char *oldStateDir)
 {
     virPCIDeviceListPtr pcidevs;
     size_t i;
@@ -949,9 +953,6 @@ virHostdevReAttachPCIDevices(virHostdevManagerPtr mgr,
     if (!nhostdevs)
         return;
 
-    virObjectLock(mgr->activePCIHostdevs);
-    virObjectLock(mgr->inactivePCIHostdevs);
-
     if (!(pcidevs = virHostdevGetPCIHostDeviceList(hostdevs, nhostdevs))) {
         VIR_ERROR(_("Failed to allocate PCI device list: %s"),
                   virGetLastErrorMessage());
@@ -1056,8 +1057,62 @@ virHostdevReAttachPCIDevices(virHostdevManagerPtr mgr,
         }
     }
 
-    /* Step 5: Reattach managed devices to their host drivers; unmanaged
-     *         devices don't need to be processed further */
+ cleanup:
+    virObjectUnref(pcidevs);
+}
+
+void
+virHostdevReleasePCIDevices(virHostdevManagerPtr mgr,
+                            const char *drv_name,
+                            const char *dom_name,
+                            virDomainHostdevDefPtr *hostdevs,
+                            int nhostdevs,
+                            const char *oldStateDir)
+{
+    virObjectLock(mgr->activePCIHostdevs);
+    virObjectLock(mgr->inactivePCIHostdevs);
+
+
+    virHostdevReleasePCIDevicesInternal(mgr, drv_name, dom_name,
+                                        hostdevs, nhostdevs, oldStateDir);
+
+    virObjectUnlock(mgr->activePCIHostdevs);
+    virObjectUnlock(mgr->inactivePCIHostdevs);
+}
+
+/* @oldStateDir:
+ * For upgrade purpose: see virHostdevRestoreNetConfig
+ */
+void
+virHostdevReAttachPCIDevices(virHostdevManagerPtr mgr,
+                             const char *drv_name,
+                             const char *dom_name,
+                             virDomainHostdevDefPtr *hostdevs,
+                             int nhostdevs,
+                             const char *oldStateDir)
+{
+    virPCIDeviceListPtr pcidevs;
+    size_t i;
+
+    if (!nhostdevs)
+        return;
+
+    virObjectLock(mgr->activePCIHostdevs);
+    virObjectLock(mgr->inactivePCIHostdevs);
+
+    /* Release PCI devices to the inactive list */
+    virHostdevReleasePCIDevicesInternal(mgr, drv_name, dom_name,
+                                        hostdevs, nhostdevs, oldStateDir);
+
+    if (!(pcidevs = virHostdevGetPCIHostDeviceList(hostdevs, nhostdevs))) {
+        VIR_ERROR(_("Failed to allocate PCI device list: %s"),
+                  virGetLastErrorMessage());
+        virResetLastError();
+        goto cleanup;
+    }
+
+    /* Reattach managed devices to their host drivers; unmanaged
+     * devices don't need to be processed further */
     for (i = 0; i < virPCIDeviceListCount(pcidevs); i++) {
         virPCIDevicePtr pci = virPCIDeviceListGet(pcidevs, i);
         virPCIDevicePtr actual;
diff --git a/src/util/virhostdev.h b/src/util/virhostdev.h
index 54e1c66..fbc7fbd 100644
--- a/src/util/virhostdev.h
+++ b/src/util/virhostdev.h
@@ -114,6 +114,14 @@ virHostdevReAttachPCIDevices(virHostdevManagerPtr hostdev_mgr,
                              const char *oldStateDir)
     ATTRIBUTE_NONNULL(1);
 void
+virHostdevReleasePCIDevices(virHostdevManagerPtr mgr,
+                            const char *drv_name,
+                            const char *dom_name,
+                            virDomainHostdevDefPtr *hostdevs,
+                            int nhostdevs,
+                            const char *oldStateDir)
+    ATTRIBUTE_NONNULL(1);
+void
 virHostdevReAttachUSBDevices(virHostdevManagerPtr hostdev_mgr,
                               const char *drv_name,
                               const char *dom_name,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Qemu-devel] [RFC PATCH 2/5] qemu_hotplug: squash qemuDomainRemovePCIHostDevice into caller
  2017-06-29  0:24 [Qemu-devel] [RFC PATCH 0/5] hotplug: fix premature rebinding of VFIO devices to host Michael Roth
  2017-06-29  0:24 ` [Qemu-devel] [RFC PATCH 1/5] virhostdev: factor release out from reattach and export it for use later Michael Roth
@ 2017-06-29  0:24 ` Michael Roth
  2017-06-29  0:24 ` [Qemu-devel] [RFC PATCH 3/5] virpci: introduce virPCIIOMMUGroupIterate() Michael Roth
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Michael Roth @ 2017-06-29  0:24 UTC (permalink / raw)
  To: libvir-list
  Cc: qemu-devel, qemu-ppc, aik, alex.williamson, abologna, laine,
	pkrempa, berrange, mprivozn, sbhat

It's only called from one place, and only takes the extra step of
freeing the device alias after reattach. Since another path through
qemuDomainRemoveHostDevice introduced in a subsequent patch will also
need to free the device alias, it'll be more readable to just start
calling it directly.

Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
---
 src/qemu/qemu_hotplug.c | 12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c
index a486fb4..b557e82 100644
--- a/src/qemu/qemu_hotplug.c
+++ b/src/qemu/qemu_hotplug.c
@@ -3805,15 +3805,6 @@ qemuDomainRemoveMemoryDevice(virQEMUDriverPtr driver,
 
 
 static void
-qemuDomainRemovePCIHostDevice(virQEMUDriverPtr driver,
-                              virDomainObjPtr vm,
-                              virDomainHostdevDefPtr hostdev)
-{
-    qemuHostdevReAttachPCIDevices(driver, vm->def->name, &hostdev, 1);
-    qemuDomainReleaseDeviceAddress(vm, hostdev->info, NULL);
-}
-
-static void
 qemuDomainRemoveUSBHostDevice(virQEMUDriverPtr driver,
                               virDomainObjPtr vm,
                               virDomainHostdevDefPtr hostdev)
@@ -3905,7 +3896,8 @@ qemuDomainRemoveHostDevice(virQEMUDriverPtr driver,
 
     switch ((virDomainHostdevSubsysType) hostdev->source.subsys.type) {
     case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI:
-        qemuDomainRemovePCIHostDevice(driver, vm, hostdev);
+        qemuHostdevReAttachPCIDevices(driver, vm->def->name, &hostdev, 1);
+        qemuDomainReleaseDeviceAddress(vm, hostdev->info, NULL);
         /* QEMU might no longer need to lock as much memory, eg. we just
          * detached the last VFIO device, so adjust the limit here */
         if (qemuDomainAdjustMaxMemLock(vm) < 0)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Qemu-devel] [RFC PATCH 3/5] virpci: introduce virPCIIOMMUGroupIterate()
  2017-06-29  0:24 [Qemu-devel] [RFC PATCH 0/5] hotplug: fix premature rebinding of VFIO devices to host Michael Roth
  2017-06-29  0:24 ` [Qemu-devel] [RFC PATCH 1/5] virhostdev: factor release out from reattach and export it for use later Michael Roth
  2017-06-29  0:24 ` [Qemu-devel] [RFC PATCH 2/5] qemu_hotplug: squash qemuDomainRemovePCIHostDevice into caller Michael Roth
@ 2017-06-29  0:24 ` Michael Roth
  2017-06-29  0:24 ` [Qemu-devel] [RFC PATCH 4/5] qemu: hotplug: unbind VFIO devices as a group Michael Roth
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Michael Roth @ 2017-06-29  0:24 UTC (permalink / raw)
  To: libvir-list
  Cc: qemu-devel, qemu-ppc, aik, alex.williamson, abologna, laine,
	pkrempa, berrange, mprivozn, sbhat

This serves a similar purpose to virPCIDeviceAddressIOMMUGroupIterate,
but uses the iommu group number to find matches instead of a device
within the group. We refactor the code to use this new function and
also export it for use in subsequent patches.

Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
---
 src/util/virpci.c | 42 ++++++++++++++++++++++++++++++------------
 src/util/virpci.h |  3 +++
 2 files changed, 33 insertions(+), 12 deletions(-)

diff --git a/src/util/virpci.c b/src/util/virpci.c
index 2c1b758..b842f44 100644
--- a/src/util/virpci.c
+++ b/src/util/virpci.c
@@ -49,6 +49,7 @@
 VIR_LOG_INIT("util.pci");
 
 #define PCI_SYSFS "/sys/bus/pci/"
+#define IOMMU_GROUP_SYSFS "/sys/kernel/iommu_groups/"
 #define PCI_ID_LEN 10   /* "XXXX XXXX" */
 #define PCI_ADDR_LEN 13 /* "XXXX:XX:XX.X" */
 
@@ -2178,16 +2179,13 @@ int virPCIDeviceFileIterate(virPCIDevicePtr dev,
     return ret;
 }
 
-
-/* virPCIDeviceAddressIOMMUGroupIterate:
- *   Call @actor for all devices in the same iommu_group as orig
- *   (including orig itself) Even if there is no iommu_group for the
- *   device, call @actor once for orig.
+/* virPCIIOMMUGroupIterate:
+ *   Call @actor for all devices in a particular iommu_group.
  */
 int
-virPCIDeviceAddressIOMMUGroupIterate(virPCIDeviceAddressPtr orig,
-                                     virPCIDeviceAddressActor actor,
-                                     void *opaque)
+virPCIIOMMUGroupIterate(int iommu_group,
+                        virPCIDeviceAddressActor actor,
+                        void *opaque)
 {
     char *groupPath = NULL;
     DIR *groupDir = NULL;
@@ -2196,13 +2194,11 @@ virPCIDeviceAddressIOMMUGroupIterate(virPCIDeviceAddressPtr orig,
     int direrr;
 
     if (virAsprintf(&groupPath,
-                    PCI_SYSFS "devices/%04x:%02x:%02x.%x/iommu_group/devices",
-                    orig->domain, orig->bus, orig->slot, orig->function) < 0)
+                    IOMMU_GROUP_SYSFS "%d/devices",
+                    iommu_group) < 0)
         goto cleanup;
 
     if (virDirOpenQuiet(&groupDir, groupPath) < 0) {
-        /* just process the original device, nothing more */
-        ret = (actor)(orig, opaque);
         goto cleanup;
     }
 
@@ -2230,6 +2226,28 @@ virPCIDeviceAddressIOMMUGroupIterate(virPCIDeviceAddressPtr orig,
     return ret;
 }
 
+/* virPCIDeviceAddressIOMMUGroupIterate:
+ *   Call @actor for all devices in the same iommu_group as orig
+ *   (including orig itself) Even if there is no iommu_group for the
+ *   device, call @actor once for orig.
+ */
+int
+virPCIDeviceAddressIOMMUGroupIterate(virPCIDeviceAddressPtr orig,
+                                     virPCIDeviceAddressActor actor,
+                                     void *opaque)
+{
+    int ret = -1;
+
+    ret = virPCIIOMMUGroupIterate(virPCIDeviceAddressGetIOMMUGroupNum(orig),
+                                  actor, opaque);
+    if (ret < 0) {
+        /* just process the original device, nothing more */
+        ret = (actor)(orig, opaque);
+    }
+
+    return ret;
+}
+
 
 static int
 virPCIDeviceGetIOMMUGroupAddOne(virPCIDeviceAddressPtr newDevAddr, void *opaque)
diff --git a/src/util/virpci.h b/src/util/virpci.h
index 570684e..5ec1306 100644
--- a/src/util/virpci.h
+++ b/src/util/virpci.h
@@ -176,6 +176,9 @@ typedef int (*virPCIDeviceAddressActor)(virPCIDeviceAddressPtr addr,
 int virPCIDeviceAddressIOMMUGroupIterate(virPCIDeviceAddressPtr orig,
                                          virPCIDeviceAddressActor actor,
                                          void *opaque);
+int virPCIIOMMUGroupIterate(int iommu_group,
+                            virPCIDeviceAddressActor actor,
+                            void *opaque);
 virPCIDeviceListPtr virPCIDeviceGetIOMMUGroupList(virPCIDevicePtr dev);
 int virPCIDeviceAddressGetIOMMUGroupAddresses(virPCIDeviceAddressPtr devAddr,
                                               virPCIDeviceAddressPtr **iommuGroupDevices,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Qemu-devel] [RFC PATCH 4/5] qemu: hotplug: unbind VFIO devices as a group
  2017-06-29  0:24 [Qemu-devel] [RFC PATCH 0/5] hotplug: fix premature rebinding of VFIO devices to host Michael Roth
                   ` (2 preceding siblings ...)
  2017-06-29  0:24 ` [Qemu-devel] [RFC PATCH 3/5] virpci: introduce virPCIIOMMUGroupIterate() Michael Roth
@ 2017-06-29  0:24 ` Michael Roth
  2017-06-29  0:25 ` [Qemu-devel] [RFC PATCH 5/5] qemu: hotplug: wait for VFIO group FD close before unbind Michael Roth
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Michael Roth @ 2017-06-29  0:24 UTC (permalink / raw)
  To: libvir-list
  Cc: qemu-devel, qemu-ppc, aik, alex.williamson, abologna, laine,
	pkrempa, berrange, mprivozn, sbhat

Currently we bind a managed hostdev back to the host driver (or
"unbind" from the perspective of the stub driver) immediately upon
receiving a DEVICE_DELETED event from QEMU. In cases where we have
more one device from the group attached to a guest, this runs the risk
of putting the group in a "non-viable" state where both a guest and
host are using devices from a group simultaneously.

This patch addresses this by deferring the unbind step until all
hostdevs from a group have been detached from the guest. In the
meantime, they are left on the drvManager's inactiveList, in a similar
state as they would be if they were unmanaged devices that were bound
to VFIO via nodedev-detach but not yet plugged into a guest.

Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
---
 src/libvirt_private.syms |  3 ++
 src/qemu/qemu_hostdev.c  | 16 +++++++++
 src/qemu/qemu_hostdev.h  |  4 +++
 src/qemu/qemu_hotplug.c  | 16 ++++++++-
 src/util/virhostdev.c    | 90 ++++++++++++++++++++++++++++++++++++++++++++++++
 src/util/virhostdev.h    |  8 +++++
 src/util/virpci.c        | 27 +++++++++++++++
 src/util/virpci.h        |  1 +
 8 files changed, 164 insertions(+), 1 deletion(-)

diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms
index 2bd3581..ba7fa39 100644
--- a/src/libvirt_private.syms
+++ b/src/libvirt_private.syms
@@ -1783,6 +1783,8 @@ virHostCPUStatsAssign;
 virHostdevFindUSBDevice;
 virHostdevIsSCSIDevice;
 virHostdevManagerGetDefault;
+virHostdevPCIDeviceGroupUnbind;
+virHostdevPCIDeviceGroupUnbindable;
 virHostdevPCINodeDeviceDetach;
 virHostdevPCINodeDeviceReAttach;
 virHostdevPCINodeDeviceReset;
@@ -2342,6 +2344,7 @@ virPCIDeviceWaitForCleanup;
 virPCIEDeviceInfoFree;
 virPCIGetDeviceAddressFromSysfsLink;
 virPCIGetHeaderType;
+virPCIGetIOMMUGroupList;
 virPCIGetNetName;
 virPCIGetPhysicalFunction;
 virPCIGetVirtualFunctionIndex;
diff --git a/src/qemu/qemu_hostdev.c b/src/qemu/qemu_hostdev.c
index 73d26f4..fdc52fe 100644
--- a/src/qemu/qemu_hostdev.c
+++ b/src/qemu/qemu_hostdev.c
@@ -384,6 +384,22 @@ qemuHostdevPrepareDomainDevices(virQEMUDriverPtr driver,
 }
 
 void
+qemuHostdevReleasePCIDevices(virQEMUDriverPtr driver,
+                             const char *name,
+                             virDomainHostdevDefPtr *hostdevs,
+                             int nhostdevs)
+{
+    virQEMUDriverConfigPtr cfg = virQEMUDriverGetConfig(driver);
+    const char *oldStateDir = cfg->stateDir;
+    virHostdevManagerPtr hostdev_mgr = driver->hostdevMgr;
+
+    virHostdevReleasePCIDevices(hostdev_mgr, QEMU_DRIVER_NAME, name,
+                                hostdevs, nhostdevs, oldStateDir);
+
+    virObjectUnref(cfg);
+}
+
+void
 qemuHostdevReAttachPCIDevices(virQEMUDriverPtr driver,
                               const char *name,
                               virDomainHostdevDefPtr *hostdevs,
diff --git a/src/qemu/qemu_hostdev.h b/src/qemu/qemu_hostdev.h
index 9a7c7f1..b010085 100644
--- a/src/qemu/qemu_hostdev.h
+++ b/src/qemu/qemu_hostdev.h
@@ -74,6 +74,10 @@ void qemuHostdevReAttachPCIDevices(virQEMUDriverPtr driver,
                                    const char *name,
                                    virDomainHostdevDefPtr *hostdevs,
                                    int nhostdevs);
+void qemuHostdevReleasePCIDevices(virQEMUDriverPtr driver,
+                                  const char *name,
+                                  virDomainHostdevDefPtr *hostdevs,
+                                  int nhostdevs);
 void qemuHostdevReAttachUSBDevices(virQEMUDriverPtr driver,
                                    const char *name,
                                    virDomainHostdevDefPtr *hostdevs,
diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c
index b557e82..af5ee6f 100644
--- a/src/qemu/qemu_hotplug.c
+++ b/src/qemu/qemu_hotplug.c
@@ -3896,7 +3896,10 @@ qemuDomainRemoveHostDevice(virQEMUDriverPtr driver,
 
     switch ((virDomainHostdevSubsysType) hostdev->source.subsys.type) {
     case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI:
-        qemuHostdevReAttachPCIDevices(driver, vm->def->name, &hostdev, 1);
+        if (is_vfio)
+            qemuHostdevReleasePCIDevices(driver, vm->def->name, &hostdev, 1);
+        else
+            qemuHostdevReAttachPCIDevices(driver, vm->def->name, &hostdev, 1);
         qemuDomainReleaseDeviceAddress(vm, hostdev->info, NULL);
         /* QEMU might no longer need to lock as much memory, eg. we just
          * detached the last VFIO device, so adjust the limit here */
@@ -3925,6 +3928,17 @@ qemuDomainRemoveHostDevice(virQEMUDriverPtr driver,
         virDomainNetDefFree(net);
     }
 
+    if (is_vfio) {
+        int iommu_group =
+            virPCIDeviceAddressGetIOMMUGroupNum(&hostdev->source.subsys.u.pci.addr);
+        if (virHostdevPCIDeviceGroupUnbindable(driver->hostdevMgr,
+                                               iommu_group)) {
+            virHostdevPCIDeviceGroupUnbind(driver->hostdevMgr,
+                                           iommu_group);
+        }
+    }
+
+
     ret = 0;
 
  cleanup:
diff --git a/src/util/virhostdev.c b/src/util/virhostdev.c
index 2cd3f34..a7f04fe 100644
--- a/src/util/virhostdev.c
+++ b/src/util/virhostdev.c
@@ -905,6 +905,96 @@ virHostdevPreparePCIDevices(virHostdevManagerPtr mgr,
     return ret;
 }
 
+static bool
+virHostdevPCIDeviceUnbindableInternal(virHostdevManagerPtr mgr,
+                                      int iommu_group)
+{
+    struct virHostdevIsPCINodeDeviceUsedData data = { mgr, NULL, true };
+
+    if (virPCIIOMMUGroupIterate(iommu_group,
+                                virHostdevIsPCINodeDeviceUsed,
+                                &data) < 0) {
+        VIR_DEBUG("IOMMU group %d is not unbindable", iommu_group);
+        return false;
+    }
+
+    VIR_DEBUG("IOMMU group %d is unbindable", iommu_group);
+    return true;
+}
+
+/*
+ * Check if devices within IOMMU group are in use by any domains
+ */
+bool
+virHostdevPCIDeviceGroupUnbindable(virHostdevManagerPtr mgr,
+                                   int iommu_group)
+{
+    bool result;
+
+    virObjectLock(mgr->activePCIHostdevs);
+    result = virHostdevPCIDeviceUnbindableInternal(mgr, iommu_group);
+    virObjectUnlock(mgr->activePCIHostdevs);
+
+    return result;
+}
+
+/*
+ * Confirm all devices in IOMMU group are in inactiveList
+ * before attempting to reattach to host driver. Devices in IOMMU
+ * group that aren't in either activeList or inactiveList are considered
+ * outside our control, so we treat them as inactive as well.
+ *
+ * Callers can check virHostdevPCIDeviceGroupUnbindable() beforehand
+ * for some indication that the group is ready for reattach to the
+ * host, but since it's possible for a hostdev from the group to get
+ * re-attached to a guest prior to subsequently calling this function
+ * there is no guarantee of this, which should be fine since it would
+ * only be immediately rebound to the stub driver anyway.
+ */
+void
+virHostdevPCIDeviceGroupUnbind(virHostdevManagerPtr mgr,
+                               int iommu_group)
+{
+    virPCIDeviceListPtr pcidevs = NULL;
+    size_t i;
+
+    virObjectLock(mgr->activePCIHostdevs);
+    virObjectLock(mgr->inactivePCIHostdevs);
+
+    if (!virHostdevPCIDeviceUnbindableInternal(mgr, iommu_group)) {
+        VIR_DEBUG("IOMMU group %d still in use, deferring reattach "
+                  "of PCI devices to host", iommu_group);
+        goto cleanup;
+    }
+
+    pcidevs = virPCIGetIOMMUGroupList(iommu_group);
+    for (i = 0; i < virPCIDeviceListCount(pcidevs); i++) {
+        virPCIDevicePtr actual, pci = virPCIDeviceListGet(pcidevs, i);
+        virPCIDeviceAddressPtr devAddr = virPCIDeviceGetAddress(pci);
+
+        actual = virPCIDeviceListFindByIDs(mgr->inactivePCIHostdevs,
+                                           devAddr->domain,
+                                           devAddr->bus,
+                                           devAddr->slot,
+                                           devAddr->function);
+        if (actual) {
+            VIR_DEBUG("Reattaching PCI device %s", virPCIDeviceGetName(actual));
+            if (virPCIDeviceGetManaged(actual))
+                if (virPCIDeviceReattach(actual, mgr->activePCIHostdevs,
+                                         mgr->inactivePCIHostdevs) < 0) {
+                    VIR_ERROR(_("Failed to re-attach PCI device: %s"),
+                              virGetLastErrorMessage());
+                    virResetLastError();
+                }
+        }
+    }
+
+ cleanup:
+    virObjectUnref(pcidevs);
+    virObjectUnlock(mgr->activePCIHostdevs);
+    virObjectUnlock(mgr->inactivePCIHostdevs);
+}
+
 /*
  * Pre-condition: inactivePCIHostdevs & activePCIHostdevs
  * are locked
diff --git a/src/util/virhostdev.h b/src/util/virhostdev.h
index fbc7fbd..2ab8101 100644
--- a/src/util/virhostdev.h
+++ b/src/util/virhostdev.h
@@ -122,6 +122,14 @@ virHostdevReleasePCIDevices(virHostdevManagerPtr mgr,
                             const char *oldStateDir)
     ATTRIBUTE_NONNULL(1);
 void
+virHostdevPCIDeviceGroupUnbind(virHostdevManagerPtr mgr,
+                               int iommu_group)
+    ATTRIBUTE_NONNULL(1);
+bool
+virHostdevPCIDeviceGroupUnbindable(virHostdevManagerPtr mgr,
+                                   int iommu_group)
+    ATTRIBUTE_NONNULL(1);
+void
 virHostdevReAttachUSBDevices(virHostdevManagerPtr hostdev_mgr,
                               const char *drv_name,
                               const char *dom_name,
diff --git a/src/util/virpci.c b/src/util/virpci.c
index b842f44..a8e5190 100644
--- a/src/util/virpci.c
+++ b/src/util/virpci.c
@@ -2298,6 +2298,33 @@ virPCIDeviceGetIOMMUGroupList(virPCIDevicePtr dev)
 }
 
 
+/*
+ * virPCIGetIOMMUGroupList - return a virPCIDeviceList containing
+ * all of the devices in @iommu_group.
+ *
+ * Return the new list, or NULL on failure
+ */
+virPCIDeviceListPtr
+virPCIGetIOMMUGroupList(int iommu_group)
+{
+    virPCIDeviceListPtr groupList = virPCIDeviceListNew();
+
+    if (!groupList)
+        goto error;
+
+    if (virPCIIOMMUGroupIterate(iommu_group,
+                                virPCIDeviceGetIOMMUGroupAddOne,
+                                groupList) < 0)
+        goto error;
+
+    return groupList;
+
+ error:
+    virObjectUnref(groupList);
+    return NULL;
+}
+
+
 typedef struct {
     virPCIDeviceAddressPtr **iommuGroupDevices;
     size_t *nIommuGroupDevices;
diff --git a/src/util/virpci.h b/src/util/virpci.h
index 5ec1306..5bcacb2 100644
--- a/src/util/virpci.h
+++ b/src/util/virpci.h
@@ -180,6 +180,7 @@ int virPCIIOMMUGroupIterate(int iommu_group,
                             virPCIDeviceAddressActor actor,
                             void *opaque);
 virPCIDeviceListPtr virPCIDeviceGetIOMMUGroupList(virPCIDevicePtr dev);
+virPCIDeviceListPtr virPCIGetIOMMUGroupList(int iommu_group);
 int virPCIDeviceAddressGetIOMMUGroupAddresses(virPCIDeviceAddressPtr devAddr,
                                               virPCIDeviceAddressPtr **iommuGroupDevices,
                                               size_t *nIommuGroupDevices);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Qemu-devel] [RFC PATCH 5/5] qemu: hotplug: wait for VFIO group FD close before unbind
  2017-06-29  0:24 [Qemu-devel] [RFC PATCH 0/5] hotplug: fix premature rebinding of VFIO devices to host Michael Roth
                   ` (3 preceding siblings ...)
  2017-06-29  0:24 ` [Qemu-devel] [RFC PATCH 4/5] qemu: hotplug: unbind VFIO devices as a group Michael Roth
@ 2017-06-29  0:25 ` Michael Roth
  2017-06-29  8:33 ` [Qemu-devel] [RFC PATCH 0/5] hotplug: fix premature rebinding of VFIO devices to host Daniel P. Berrange
  2017-06-29 18:50 ` Laine Stump
  6 siblings, 0 replies; 16+ messages in thread
From: Michael Roth @ 2017-06-29  0:25 UTC (permalink / raw)
  To: libvir-list
  Cc: qemu-devel, qemu-ppc, aik, alex.williamson, abologna, laine,
	pkrempa, berrange, mprivozn, sbhat

QEMU emits DEVICE_DELETED events during a device's "unparent"
callback, but some additional cleanup occurs afterward via
"finalize". In most cases libvirt can ignore the latter, but in the
case of VFIO the closing of a device's group FD happens here, which
is something libvirt needs to wait for before attempting to bind a
hostdev back to a host driver. In the case of powernv, and possibly
other host archs as well, failing to do this can lead to the host
device driver crashing due to necessary setup (like restoring default
DMA windows for the IOMMU group) not being completed yet.

We attempt to avoid this here by polling the QEMU process for open
FDs referencing /dev/vfio/<group num> and waiting for a certain period
of time. In practice the delay between the DEVICE_DELETED
event and closing of the group FD seems to be around 6 seconds, so we
set the max wait time at 15 seconds. If we time out we leave the
device in the inactiveList and bound to VFIO. We only attempt the
wait if the last hostdev from an IOMMU group is being detached and
there's reasonable expectation that the group FD will be closed soon.

There are alternatives to this approach, like adding a specific
group delete event to QEMU and handling this cleanup via and
asynchronous event handler, nut since we do a similar poll-wait for
things like KVM device passthrough this simple approach is hopefully
a reasonable starting point at least.

Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
---
 src/libvirt_private.syms |  1 +
 src/qemu/qemu_hotplug.c  | 34 +++++++++++++++++++++++++++++--
 src/util/virfile.c       | 52 ++++++++++++++++++++++++++++++++++++++++++++++++
 src/util/virfile.h       |  1 +
 4 files changed, 86 insertions(+), 2 deletions(-)

diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms
index ba7fa39..787267c 100644
--- a/src/libvirt_private.syms
+++ b/src/libvirt_private.syms
@@ -1657,6 +1657,7 @@ virFileIsDir;
 virFileIsExecutable;
 virFileIsLink;
 virFileIsMountPoint;
+virFileIsOpenByPid;
 virFileIsSharedFS;
 virFileIsSharedFSType;
 virFileLength;
diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c
index af5ee6f..d200bab 100644
--- a/src/qemu/qemu_hotplug.c
+++ b/src/qemu/qemu_hotplug.c
@@ -68,6 +68,8 @@ VIR_LOG_INIT("qemu.qemu_hotplug");
 /* Wait up to 5 seconds for device removal to finish. */
 unsigned long long qemuDomainRemoveDeviceWaitTime = 1000ull * 5;
 
+/* Wait up to 15 seconds for iommu group close */
+unsigned long long qemuDomainRemoveDeviceGroupWaitTime = 1000ull * 15;
 
 /**
  * qemuDomainPrepareDisk:
@@ -3830,6 +3832,32 @@ qemuDomainRemoveSCSIVHostDevice(virQEMUDriverPtr driver,
 }
 
 static int
+qemuDomainWaitForDeviceGroupClose(virDomainObjPtr vm, int iommu_group)
+{
+    char *group_path;
+    unsigned long long remaining_ms = qemuDomainRemoveDeviceGroupWaitTime;
+    int rc = -1;
+
+    if (virAsprintf(&group_path, "/dev/vfio/%d", iommu_group) < 0)
+        return -1;
+
+    while ((rc = virFileIsOpenByPid(group_path, vm->pid)) == 1) {
+        if (remaining_ms <= 0)
+            break;
+        usleep(100*1000);
+        remaining_ms -= 100;
+    }
+
+    VIR_DEBUG("IOMMU group %d FD status: %d, wait time: %llu ms",
+              iommu_group, rc,
+              qemuDomainRemoveDeviceGroupWaitTime - remaining_ms);
+
+    VIR_FREE(group_path);
+    return rc;
+}
+
+
+static int
 qemuDomainRemoveHostDevice(virQEMUDriverPtr driver,
                            virDomainObjPtr vm,
                            virDomainHostdevDefPtr hostdev)
@@ -3933,8 +3961,10 @@ qemuDomainRemoveHostDevice(virQEMUDriverPtr driver,
             virPCIDeviceAddressGetIOMMUGroupNum(&hostdev->source.subsys.u.pci.addr);
         if (virHostdevPCIDeviceGroupUnbindable(driver->hostdevMgr,
                                                iommu_group)) {
-            virHostdevPCIDeviceGroupUnbind(driver->hostdevMgr,
-                                           iommu_group);
+            if (qemuDomainWaitForDeviceGroupClose(vm, iommu_group) == 0) {
+                virHostdevPCIDeviceGroupUnbind(driver->hostdevMgr,
+                                               iommu_group);
+            }
         }
     }
 
diff --git a/src/util/virfile.c b/src/util/virfile.c
index d444b32..29b762f 100644
--- a/src/util/virfile.c
+++ b/src/util/virfile.c
@@ -4162,3 +4162,55 @@ virFileReadValueString(char **value, const char *format, ...)
     VIR_FREE(str);
     return ret;
 }
+
+int
+virFileIsOpenByPid(const char *path, pid_t pid)
+{
+    struct dirent *ent;
+    DIR *filelist_dir;
+    char *filelist_path;
+    bool found = false;
+    int rc = -1;
+
+    if (!path || !IS_ABSOLUTE_FILE_NAME(path)) {
+        virReportError(VIR_ERR_INTERNAL_ERROR,
+                       _("invalid path: %s"), path ? path : "null");
+        goto error;
+    }
+
+    if (virAsprintf(&filelist_path, "/proc/%d/fd", pid) < 0)
+        goto error;
+
+    if (virDirOpen(&filelist_dir, filelist_path) < 0) {
+        virReportError(VIR_ERR_INTERNAL_ERROR,
+                       _("unable to open directory: %s"), filelist_path);
+        goto error;
+    }
+
+    while (!found &&
+           (rc = virDirRead(filelist_dir, &ent, filelist_path)) == 1) {
+        char *resolved_path = NULL;
+        char *link_path;
+        if ((rc = virAsprintf(&link_path, "%s/%s", filelist_path, ent->d_name)) < 0)
+            break;
+        if (virFileResolveLink(link_path, &resolved_path) == 0) {
+            if (resolved_path) {
+                VIR_DEBUG("checking absolute path for match (need: %s, got: %s)",
+                          path, resolved_path);
+                if (STREQ(resolved_path, path))
+                    found = true;
+                VIR_FREE(resolved_path);
+            }
+        }
+    }
+
+    VIR_DIR_CLOSE(filelist_dir);
+ error:
+    VIR_FREE(filelist_path);
+
+    VIR_DEBUG("returning, rc: %d, found: %d", rc, found);
+    if (rc < 0)
+        return rc;
+
+    return found ? 1 : 0;
+}
diff --git a/src/util/virfile.h b/src/util/virfile.h
index 57ceb80..fb86786 100644
--- a/src/util/virfile.h
+++ b/src/util/virfile.h
@@ -347,6 +347,7 @@ int virFileReadValueScaledInt(unsigned long long *value, const char *format, ...
 int virFileReadValueString(char **value, const char *format, ...)
  ATTRIBUTE_FMT_PRINTF(2, 3);
 
+int virFileIsOpenByPid(const char *path, pid_t pid);
 
 int virFileInData(int fd,
                   int *inData,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/5] hotplug: fix premature rebinding of VFIO devices to host
  2017-06-29  0:24 [Qemu-devel] [RFC PATCH 0/5] hotplug: fix premature rebinding of VFIO devices to host Michael Roth
                   ` (4 preceding siblings ...)
  2017-06-29  0:25 ` [Qemu-devel] [RFC PATCH 5/5] qemu: hotplug: wait for VFIO group FD close before unbind Michael Roth
@ 2017-06-29  8:33 ` Daniel P. Berrange
  2017-06-29 18:22   ` Michael Roth
  2017-06-29 19:28   ` Alex Williamson
  2017-06-29 18:50 ` Laine Stump
  6 siblings, 2 replies; 16+ messages in thread
From: Daniel P. Berrange @ 2017-06-29  8:33 UTC (permalink / raw)
  To: Michael Roth
  Cc: libvir-list, qemu-devel, qemu-ppc, aik, alex.williamson,
	abologna, laine, pkrempa, mprivozn, sbhat

On Wed, Jun 28, 2017 at 07:24:55PM -0500, Michael Roth wrote:
> Hi everyone. Hoping to get some feedback on this approach, or some
> alternatives proposed below, to the following issue:
> 
> Currently libvirt immediately attempts to rebind a managed device back to the
> host driver when it receives a DEVICE_DELETED event from QEMU. This is
> problematic for 2 reasons:
> 
> 1) If multiple devices from a group are attached to a guest, this can move
>    the group into a "non-viable" state where some devices are assigned to
>    the host and some to the guest.
> 
> 2) When QEMU emits the DEVICE_DELETED event, there's still a "finalize" phase
>    where additional cleanup occurs. In most cases libvirt can ignore this
>    cleanup, but in the case of VFIO devices this is where closing of a VFIO
>    group FD occurs, and failing to wait before rebinding the device to the
>    host driver can result in unexpected behavior. In the case of powernv
>    hosts at least, this can lead to a host driver crashing due to the default
>    DMA windows not having been fully-restored yet. The window between this is
>    and the initial DEVICE_DELETED seems to be ~6 seconds in practice. We've
>    seen host dumps with Mellanox CX4 VFs being rebound to host driver during
>    this period (on powernv hosts).

Why on earth does QEMU's device finalization take 6 seconds to complete.
That feels very broken to me, unless QEMU is not being schedled due to
host being overcomitted. If that's not the case, then we have a bug to
investigate in QEMU to find out why cleanup is delayed so long.


>From libvirt's POV, we consider 'DEVICE_DELETED' as meaning both that the
frontend has gone *and* the corresponding backend has gone. Aside from
cleaning the VFIO group, we use this as a trigger for all other device
related cleanup like SELinux labelling, cgroup device ACLs, etc. If the
backend is not guaranteed to be closed in QEMU when this emit is emitted
then either QEMU needs to delay the event until it is really cleaned up,
or QEMU needs to add a further event to emit when the backend is clean.


> Patches 1-4 address 1) by deferring rebinding of a hostdev to the host driver
> until all the devices in the group have been detached, at which point all
> the hostdevs are rebound as a group. Until that point, the devices are traced
> by the drvManager's inactiveList in a similar manner to hostdevs that are
> assigned to VFIO via the nodedev-detach interface.
> 
> Patch 5 addresses 2) by adding an additional check that, when the last device
> from a group is detached, polls /proc for open FDs referencing the VFIO group
> path in /dev/vfio/<iommu_group> and waiting for the FD to be closed. If we
> time out, we abandon rebinding the hostdevs back to the host.

That is just gross - it is tieing libvirt to details of the QEMU internal
implementation. I really don't think we should be doing that. So NACK to
this from my POV.

> There are a couple alternatives to Patch 5 that might be worth considering:
> 
> a) Add a DEVICE_FINALIZED event to QEMU and wait for that instead of
>    DEVICE_DELETED. Paired with patches 1-4 this would let us drop patch 5 in
>    favor of minimal changes to libvirt's event handlers.
> 
>    The downsides are:
>     - that we'd incur some latency for all device-detach calls, but it's not
>       apparent to me whether this delay is significant for anything outside
>       of VFIO.
>     - there may be cases where finalization after DEVICE_DELETE/unparent are
>       is not guaranteed, and I'm not sure QOM would encourage such
>       expectations even if that's currently the case.
> 
> b) Add a GROUP_DELETED event to VFIO's finalize callback. This is the most
>    direct solution. With this we could completely separate out the handling
>    of rebinding to host driver based on receival of this event.
> 
>    The downsides are:
>     - this would only work for newer versions of QEMU, though we could use
>       the poll-wait in patch 5 as a fallback.
>     - synchronizing sync/async device-detach threads with sync/async
>       handlers for this would be a bit hairy, but I have a WIP in progress
>       that seems *fairly reasonable*
> 
> c) Take the approach in Patch 5, either as a precursor to implementing b) or
>    something else, or just sticking with that for now.
> 
> d) ???

Fix DEVICE_DELETE so its only emitted when the backend associated with
the device is fully cleaned up.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/5] hotplug: fix premature rebinding of VFIO devices to host
  2017-06-29  8:33 ` [Qemu-devel] [RFC PATCH 0/5] hotplug: fix premature rebinding of VFIO devices to host Daniel P. Berrange
@ 2017-06-29 18:22   ` Michael Roth
  2017-06-29 19:31     ` Alex Williamson
  2017-06-29 19:28   ` Alex Williamson
  1 sibling, 1 reply; 16+ messages in thread
From: Michael Roth @ 2017-06-29 18:22 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: libvir-list, qemu-devel, qemu-ppc, aik, alex.williamson,
	abologna, laine, pkrempa, mprivozn, sbhat

Quoting Daniel P. Berrange (2017-06-29 03:33:19)
> On Wed, Jun 28, 2017 at 07:24:55PM -0500, Michael Roth wrote:
> > Hi everyone. Hoping to get some feedback on this approach, or some
> > alternatives proposed below, to the following issue:
> > 
> > Currently libvirt immediately attempts to rebind a managed device back to the
> > host driver when it receives a DEVICE_DELETED event from QEMU. This is
> > problematic for 2 reasons:
> > 
> > 1) If multiple devices from a group are attached to a guest, this can move
> >    the group into a "non-viable" state where some devices are assigned to
> >    the host and some to the guest.
> > 
> > 2) When QEMU emits the DEVICE_DELETED event, there's still a "finalize" phase
> >    where additional cleanup occurs. In most cases libvirt can ignore this
> >    cleanup, but in the case of VFIO devices this is where closing of a VFIO
> >    group FD occurs, and failing to wait before rebinding the device to the
> >    host driver can result in unexpected behavior. In the case of powernv
> >    hosts at least, this can lead to a host driver crashing due to the default
> >    DMA windows not having been fully-restored yet. The window between this is
> >    and the initial DEVICE_DELETED seems to be ~6 seconds in practice. We've
> >    seen host dumps with Mellanox CX4 VFs being rebound to host driver during
> >    this period (on powernv hosts).
> 
> Why on earth does QEMU's device finalization take 6 seconds to complete.
> That feels very broken to me, unless QEMU is not being schedled due to
> host being overcomitted. If that's not the case, then we have a bug to
> investigate in QEMU to find out why cleanup is delayed so long.

In this particular case QEMU starts finalization almost immediately
after the DEVICE_DELETED, but it looks like most of the time between that and
closing of the group FD is spent in the kernel. Here's what it looks
like from QEMU with vfio*/monitor* traces enabled (in this case
unplugging a Mellanox CX3 PF):

# device_del called
...
# vfio device's device_unparent() called:
  # unrealize->exit->vfio_exitfn called:
61948@1498759308.951038:vfio_intx_disable  (0002:01:00.0)
61948@1498759308.953657:vfio_region_exit Device 0002:01:00.0, region 0
61948@1498759308.954532:vfio_region_exit Device 0002:01:00.0, region 2
  # unrealize->exit->vfio_exitfn returns, DEVICE_DELETED sent:
61948@1498759308.954633:monitor_protocol_event_queue event=9 data=0x3fff6c508690 rate=0
61948@1498759308.954669:monitor_protocol_event_emit event=9 data=0x3fff6c508690
  # last unref of vfio device, vfio_instance_finalize() called:
61948@1498759308.955660:vfio_region_finalize Device 0002:01:00.0, region 0
61948@1498759308.955742:vfio_region_finalize Device 0002:01:00.0, region 2
61948@1498759308.955751:vfio_put_base_device close vdev->fd=30

    # close(VFIODevice.fd) <- 5 SECOND DELAY

61948@1498759313.140515:vfio_put_base_device_completed close completed vdev->fd=30
61948@1498759313.181124:vfio_disconnect_container close container->fd=102
61948@1498759313.181152:vfio_put_group close group->fd=101
    # close(VFIOGroup.fd)
61948@1498759313.181157:vfio_put_group_completed close completed group->fd=101
  # vfio_instance_finalize() returns
# vfio device's device_unparent() returns

I poked around in the VFIO group close path and figured restoring ownership
of IOMMU to the host via vfio_iommu_driver_ops.release() (via
close(groupfd) was where all the time was spent, but didn't expect it to
be the close(VFIODevice.fd). Maybe Alexey/Alex have a better idea. I'll
look into it more as well.

But suffice to say there's not much QEMU can do about the delay other
than moving deferring the DEVICE_DELETED event or adding a later-stage
event.

> 
> 
> From libvirt's POV, we consider 'DEVICE_DELETED' as meaning both that the
> frontend has gone *and* the corresponding backend has gone. Aside from
> cleaning the VFIO group, we use this as a trigger for all other device
> related cleanup like SELinux labelling, cgroup device ACLs, etc. If the
> backend is not guaranteed to be closed in QEMU when this emit is emitted
> then either QEMU needs to delay the event until it is really cleaned up,
> or QEMU needs to add a further event to emit when the backend is clean.
> 
> 
> > Patches 1-4 address 1) by deferring rebinding of a hostdev to the host driver
> > until all the devices in the group have been detached, at which point all
> > the hostdevs are rebound as a group. Until that point, the devices are traced
> > by the drvManager's inactiveList in a similar manner to hostdevs that are
> > assigned to VFIO via the nodedev-detach interface.
> > 
> > Patch 5 addresses 2) by adding an additional check that, when the last device
> > from a group is detached, polls /proc for open FDs referencing the VFIO group
> > path in /dev/vfio/<iommu_group> and waiting for the FD to be closed. If we
> > time out, we abandon rebinding the hostdevs back to the host.
> 
> That is just gross - it is tieing libvirt to details of the QEMU internal
> implementation. I really don't think we should be doing that. So NACK to
> this from my POV.
> 
> > There are a couple alternatives to Patch 5 that might be worth considering:
> > 
> > a) Add a DEVICE_FINALIZED event to QEMU and wait for that instead of
> >    DEVICE_DELETED. Paired with patches 1-4 this would let us drop patch 5 in
> >    favor of minimal changes to libvirt's event handlers.
> > 
> >    The downsides are:
> >     - that we'd incur some latency for all device-detach calls, but it's not
> >       apparent to me whether this delay is significant for anything outside
> >       of VFIO.
> >     - there may be cases where finalization after DEVICE_DELETE/unparent are
> >       is not guaranteed, and I'm not sure QOM would encourage such
> >       expectations even if that's currently the case.
> > 
> > b) Add a GROUP_DELETED event to VFIO's finalize callback. This is the most
> >    direct solution. With this we could completely separate out the handling
> >    of rebinding to host driver based on receival of this event.
> > 
> >    The downsides are:
> >     - this would only work for newer versions of QEMU, though we could use
> >       the poll-wait in patch 5 as a fallback.
> >     - synchronizing sync/async device-detach threads with sync/async
> >       handlers for this would be a bit hairy, but I have a WIP in progress
> >       that seems *fairly reasonable*
> > 
> > c) Take the approach in Patch 5, either as a precursor to implementing b) or
> >    something else, or just sticking with that for now.
> > 
> > d) ???
> 
> Fix DEVICE_DELETE so its only emitted when the backend associated with
> the device is fully cleaned up.

Need to explore same concerns I had WRT to DEVICE_FINALIZED above, but
assuming those aren't an issue this would indeed makes things even
simpler. Will look into this more.

Would patch 5 be out of the question even as a fallback for downlevel
QEMUs? Or is that scenario too unlikely to warrant it?

Thanks!

> 
> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/5] hotplug: fix premature rebinding of VFIO devices to host
  2017-06-29  0:24 [Qemu-devel] [RFC PATCH 0/5] hotplug: fix premature rebinding of VFIO devices to host Michael Roth
                   ` (5 preceding siblings ...)
  2017-06-29  8:33 ` [Qemu-devel] [RFC PATCH 0/5] hotplug: fix premature rebinding of VFIO devices to host Daniel P. Berrange
@ 2017-06-29 18:50 ` Laine Stump
  2017-06-29 19:44   ` Alex Williamson
  2017-06-29 20:59   ` Michael Roth
  6 siblings, 2 replies; 16+ messages in thread
From: Laine Stump @ 2017-06-29 18:50 UTC (permalink / raw)
  To: libvir-list
  Cc: Michael Roth, qemu-devel, qemu-ppc, aik, alex.williamson,
	abologna, pkrempa, berrange, mprivozn, sbhat

On 06/28/2017 08:24 PM, Michael Roth wrote:
> Hi everyone. Hoping to get some feedback on this approach, or some
> alternatives proposed below, to the following issue:
> 
> Currently libvirt immediately attempts to rebind a managed device back to the
> host driver when it receives a DEVICE_DELETED event from QEMU. This is
> problematic for 2 reasons:
> 
> 1) If multiple devices from a group are attached to a guest, this can move
>    the group into a "non-viable" state where some devices are assigned to
>    the host and some to the guest.

Since we don't support hotplug with managed='yes' of individual (or
multiple) functions of a multifunction host device, I don't know that
it's very useful to support hot *un*plug of it - it would only be useful
if the multi-function device were present in the guest when it was
started, and then was hot-unplugged later. And this is all a lot of
extra complexity, though, so it would be useful to know what are the
scenarios where it would actually be used (i.e. is this a legitimate
need, or just an interesting exercise?)

> 
> 2) When QEMU emits the DEVICE_DELETED event, there's still a "finalize" phase
>    where additional cleanup occurs. In most cases libvirt can ignore this
>    cleanup, but in the case of VFIO devices this is where closing of a VFIO
>    group FD occurs, and failing to wait before rebinding the device to the
>    host driver can result in unexpected behavior. In the case of powernv
>    hosts at least, this can lead to a host driver crashing due to the default
>    DMA windows not having been fully-restored yet. The window between this is
>    and the initial DEVICE_DELETED seems to be ~6 seconds in practice. We've
>    seen host dumps with Mellanox CX4 VFs being rebound to host driver during
>    this period (on powernv hosts).

I agree with Dan that the situation described here should be considered
a qemu bug - according to my understanding (from back at the time
DEVICE_DELETED was added to qemu (I think at libvirt's request) qemu
should never emit the DEVICE_DELETED event until *everything* related to
the device is finished - that was the whole point of adding the event in
the first palce. Covering up this bug with a bunch of extra libvirt
complexity is just creating the potential for even more bugs in the more
complex code.

> 
> Patches 1-4 address 1) by deferring rebinding of a hostdev to the host driver
> until all the devices in the group have been detached, at which point all
> the hostdevs are rebound as a group. Until that point, the devices are traced
> by the drvManager's inactiveList in a similar manner to hostdevs that are
> assigned to VFIO via the nodedev-detach interface.

What happens if libvirtd is restarted during this period? Is the
inactiveList rebuilt with all the info necessary to assure that the
nodedev-reattach does/doesn't happen (as appropriate) for all devices?


> 
> Patch 5 addresses 2) by adding an additional check that, when the last device
> from a group is detached, polls /proc for open FDs referencing the VFIO group
> path in /dev/vfio/<iommu_group> and waiting for the FD to be closed. If we
> time out, we abandon rebinding the hostdevs back to the host.
> 
> There are a couple alternatives to Patch 5 that might be worth considering:
> 
> a) Add a DEVICE_FINALIZED event to QEMU and wait for that instead of
>    DEVICE_DELETED.

Is the "device is *almost* completely deleted" event (current
DEVIE_DELETED) really something that anyone wants/needs to know about?
Or is the only useful event just the one that you're calling
DEVICE_FINALIZED? If the latter, then I think it would be better to just
change when DEVICE_DELETED is emitted.

 Paired with patches 1-4 this would let us drop patch 5 in
>    favor of minimal changes to libvirt's event handlers.
> 
>    The downsides are:
>     - that we'd incur some latency for all device-detach calls, but it's not
>       apparent to me whether this delay is significant for anything outside
>       of VFIO.
>     - there may be cases where finalization after DEVICE_DELETE/unparent are
>       is not guaranteed, and I'm not sure QOM would encourage such
>       expectations even if that's currently the case.
> 
> b) Add a GROUP_DELETED event to VFIO's finalize callback. This is the most
>    direct solution. With this we could completely separate out the handling
>    of rebinding to host driver based on receival of this event.
> 
>    The downsides are:
>     - this would only work for newer versions of QEMU, though we could use
>       the poll-wait in patch 5 as a fallback.

I don't think we should add such a complex patch as a fallback to
support older versions of qemu that don't have the bug fixed. Instead,
just tell people to upgrade qemu (after all, they're going to need to
update *something* (either libvirt or qemu); no need to update libvirt
just in order to avoid updating qemu).


>     - synchronizing sync/async device-detach threads with sync/async
>       handlers for this would be a bit hairy, but I have a WIP in progress
>       that seems *fairly reasonable*
> 
> c) Take the approach in Patch 5, either as a precursor to implementing b) or
>    something else, or just sticking with that for now.
> 
> d) ???
> 
> Personally I'm leaning toward c), but I'm wondering if that's "good enough"
> for now, or if we should pursue something more robust from the start, or
> perhaps something else entirely?
> 
> Any feedback is greatly appreciated!
> 
>  src/libvirt_private.syms |   5 ++
>  src/qemu/qemu_hostdev.c  |  16 +++++
>  src/qemu/qemu_hostdev.h  |   4 ++
>  src/qemu/qemu_hotplug.c  |  56 ++++++++++++++----
>  src/util/virfile.c       |  52 +++++++++++++++++
>  src/util/virfile.h       |   1 +
>  src/util/virhostdev.c    | 173 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
>  src/util/virhostdev.h    |  16 +++++
>  src/util/virpci.c        |  69 ++++++++++++++++++----
>  src/util/virpci.h        |   4 ++
>  10 files changed, 360 insertions(+), 36 deletions(-)
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/5] hotplug: fix premature rebinding of VFIO devices to host
  2017-06-29  8:33 ` [Qemu-devel] [RFC PATCH 0/5] hotplug: fix premature rebinding of VFIO devices to host Daniel P. Berrange
  2017-06-29 18:22   ` Michael Roth
@ 2017-06-29 19:28   ` Alex Williamson
  2017-06-29 20:21     ` Michael Roth
  1 sibling, 1 reply; 16+ messages in thread
From: Alex Williamson @ 2017-06-29 19:28 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Michael Roth, libvir-list, qemu-devel, qemu-ppc, aik, abologna,
	laine, pkrempa, mprivozn, sbhat

On Thu, 29 Jun 2017 09:33:19 +0100
"Daniel P. Berrange" <berrange@redhat.com> wrote:

> On Wed, Jun 28, 2017 at 07:24:55PM -0500, Michael Roth wrote:
> > Hi everyone. Hoping to get some feedback on this approach, or some
> > alternatives proposed below, to the following issue:
> > 
> > Currently libvirt immediately attempts to rebind a managed device back to the
> > host driver when it receives a DEVICE_DELETED event from QEMU. This is
> > problematic for 2 reasons:
> > 
> > 1) If multiple devices from a group are attached to a guest, this can move
> >    the group into a "non-viable" state where some devices are assigned to
> >    the host and some to the guest.
> > 
> > 2) When QEMU emits the DEVICE_DELETED event, there's still a "finalize" phase
> >    where additional cleanup occurs. In most cases libvirt can ignore this
> >    cleanup, but in the case of VFIO devices this is where closing of a VFIO
> >    group FD occurs, and failing to wait before rebinding the device to the
> >    host driver can result in unexpected behavior. In the case of powernv
> >    hosts at least, this can lead to a host driver crashing due to the default
> >    DMA windows not having been fully-restored yet. The window between this is
> >    and the initial DEVICE_DELETED seems to be ~6 seconds in practice. We've
> >    seen host dumps with Mellanox CX4 VFs being rebound to host driver during
> >    this period (on powernv hosts).  

I've been trying to tackle this from the kernel side too, currently
Linux's driver model really neither allows vfio bus drivers to nak
unbinding a device from an in-use group nor nak binding a device
from an in-use group to an incompatible driver.  The issue you identify
in QEMU/libvirt exacerbates the problem as QEMU has not yet finalized
the device/group references before libvirt tries to unbind the device
from the vfio bus driver and attach it to a host driver.  I'd love to
solve this from both sides by allowing the kernel to prevent driver
binds that we'd consider compromising and also introduce a bit of
patience in the QEMU/libvirt path to avoid the kernel needing to impose
that driver blocking.

> Why on earth does QEMU's device finalization take 6 seconds to complete.
> That feels very broken to me, unless QEMU is not being schedled due to
> host being overcomitted. If that's not the case, then we have a bug to
> investigate in QEMU to find out why cleanup is delayed so long.

I wouldn't necessarily jump to the conclusion that this is a bug, if
it's relating to tearing down the IOMMU mappings for the device, gigs
of mappings can take non-trivial time.  Is that the scenario here?  Is
that 6s somehow proportional to guest memory size?
 
> From libvirt's POV, we consider 'DEVICE_DELETED' as meaning both that the
> frontend has gone *and* the corresponding backend has gone. Aside from
> cleaning the VFIO group, we use this as a trigger for all other device
> related cleanup like SELinux labelling, cgroup device ACLs, etc. If the
> backend is not guaranteed to be closed in QEMU when this emit is emitted
> then either QEMU needs to delay the event until it is really cleaned up,
> or QEMU needs to add a further event to emit when the backend is clean.

Clearly libvirt and QEMU's idea of what DEVICE_DELETED means don't
align.

> > Patches 1-4 address 1) by deferring rebinding of a hostdev to the host driver
> > until all the devices in the group have been detached, at which point all
> > the hostdevs are rebound as a group. Until that point, the devices are traced
> > by the drvManager's inactiveList in a similar manner to hostdevs that are
> > assigned to VFIO via the nodedev-detach interface.

There are certainly some benefits to group-awareness here, currently
an admin user like libvirt can trigger a BUG_ON by trying to bind a
device back to a host driver when a group is still in use, at best we
might improve that to rejecting the compromising bind.  

> > Patch 5 addresses 2) by adding an additional check that, when the last device
> > from a group is detached, polls /proc for open FDs referencing the VFIO group
> > path in /dev/vfio/<iommu_group> and waiting for the FD to be closed. If we
> > time out, we abandon rebinding the hostdevs back to the host.  
> 
> That is just gross - it is tieing libvirt to details of the QEMU internal
> implementation. I really don't think we should be doing that. So NACK to
> this from my POV.

It seems a little silly for QEMU to emit the event while it's still in
use, clearly emitting the event at the right point would negate any
need for snooping around in proc.

> > There are a couple alternatives to Patch 5 that might be worth considering:
> > 
> > a) Add a DEVICE_FINALIZED event to QEMU and wait for that instead of
> >    DEVICE_DELETED. Paired with patches 1-4 this would let us drop patch 5 in
> >    favor of minimal changes to libvirt's event handlers.
> > 
> >    The downsides are:
> >     - that we'd incur some latency for all device-detach calls, but it's not
> >       apparent to me whether this delay is significant for anything outside
> >       of VFIO.
> >     - there may be cases where finalization after DEVICE_DELETE/unparent are
> >       is not guaranteed, and I'm not sure QOM would encourage such
> >       expectations even if that's currently the case.
> > 
> > b) Add a GROUP_DELETED event to VFIO's finalize callback. This is the most
> >    direct solution. With this we could completely separate out the handling
> >    of rebinding to host driver based on receival of this event.
> > 
> >    The downsides are:
> >     - this would only work for newer versions of QEMU, though we could use
> >       the poll-wait in patch 5 as a fallback.
> >     - synchronizing sync/async device-detach threads with sync/async
> >       handlers for this would be a bit hairy, but I have a WIP in progress
> >       that seems *fairly reasonable*
> > 
> > c) Take the approach in Patch 5, either as a precursor to implementing b) or
> >    something else, or just sticking with that for now.
> > 
> > d) ???  
> 
> Fix DEVICE_DELETE so its only emitted when the backend associated with
> the device is fully cleaned up.

Adding a FINALIZE seems to require a two-step fix, fix QEMU then fix
libvirt, whereas moving DELETE to the correct location automatically
fixes the behavior with existing libvirt.  I don't know that a
GROUP_DELETED makes much sense, libvirt can know about groups on its
own and it just leads to a vfio specific path.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/5] hotplug: fix premature rebinding of VFIO devices to host
  2017-06-29 18:22   ` Michael Roth
@ 2017-06-29 19:31     ` Alex Williamson
  0 siblings, 0 replies; 16+ messages in thread
From: Alex Williamson @ 2017-06-29 19:31 UTC (permalink / raw)
  To: Michael Roth
  Cc: Daniel P. Berrange, libvir-list, qemu-devel, qemu-ppc, aik,
	abologna, laine, pkrempa, mprivozn, sbhat

On Thu, 29 Jun 2017 13:22:44 -0500
Michael Roth <mdroth@linux.vnet.ibm.com> wrote:

> Quoting Daniel P. Berrange (2017-06-29 03:33:19)
> > On Wed, Jun 28, 2017 at 07:24:55PM -0500, Michael Roth wrote:  
> > > Hi everyone. Hoping to get some feedback on this approach, or some
> > > alternatives proposed below, to the following issue:
> > > 
> > > Currently libvirt immediately attempts to rebind a managed device back to the
> > > host driver when it receives a DEVICE_DELETED event from QEMU. This is
> > > problematic for 2 reasons:
> > > 
> > > 1) If multiple devices from a group are attached to a guest, this can move
> > >    the group into a "non-viable" state where some devices are assigned to
> > >    the host and some to the guest.
> > > 
> > > 2) When QEMU emits the DEVICE_DELETED event, there's still a "finalize" phase
> > >    where additional cleanup occurs. In most cases libvirt can ignore this
> > >    cleanup, but in the case of VFIO devices this is where closing of a VFIO
> > >    group FD occurs, and failing to wait before rebinding the device to the
> > >    host driver can result in unexpected behavior. In the case of powernv
> > >    hosts at least, this can lead to a host driver crashing due to the default
> > >    DMA windows not having been fully-restored yet. The window between this is
> > >    and the initial DEVICE_DELETED seems to be ~6 seconds in practice. We've
> > >    seen host dumps with Mellanox CX4 VFs being rebound to host driver during
> > >    this period (on powernv hosts).  
> > 
> > Why on earth does QEMU's device finalization take 6 seconds to complete.
> > That feels very broken to me, unless QEMU is not being schedled due to
> > host being overcomitted. If that's not the case, then we have a bug to
> > investigate in QEMU to find out why cleanup is delayed so long.  
> 
> In this particular case QEMU starts finalization almost immediately
> after the DEVICE_DELETED, but it looks like most of the time between that and
> closing of the group FD is spent in the kernel. Here's what it looks
> like from QEMU with vfio*/monitor* traces enabled (in this case
> unplugging a Mellanox CX3 PF):
> 
> # device_del called
> ...
> # vfio device's device_unparent() called:
>   # unrealize->exit->vfio_exitfn called:
> 61948@1498759308.951038:vfio_intx_disable  (0002:01:00.0)
> 61948@1498759308.953657:vfio_region_exit Device 0002:01:00.0, region 0
> 61948@1498759308.954532:vfio_region_exit Device 0002:01:00.0, region 2
>   # unrealize->exit->vfio_exitfn returns, DEVICE_DELETED sent:
> 61948@1498759308.954633:monitor_protocol_event_queue event=9 data=0x3fff6c508690 rate=0
> 61948@1498759308.954669:monitor_protocol_event_emit event=9 data=0x3fff6c508690
>   # last unref of vfio device, vfio_instance_finalize() called:
> 61948@1498759308.955660:vfio_region_finalize Device 0002:01:00.0, region 0
> 61948@1498759308.955742:vfio_region_finalize Device 0002:01:00.0, region 2
> 61948@1498759308.955751:vfio_put_base_device close vdev->fd=30
> 
>     # close(VFIODevice.fd) <- 5 SECOND DELAY
> 
> 61948@1498759313.140515:vfio_put_base_device_completed close completed vdev->fd=30
> 61948@1498759313.181124:vfio_disconnect_container close container->fd=102
> 61948@1498759313.181152:vfio_put_group close group->fd=101
>     # close(VFIOGroup.fd)
> 61948@1498759313.181157:vfio_put_group_completed close completed group->fd=101
>   # vfio_instance_finalize() returns
> # vfio device's device_unparent() returns
> 
> I poked around in the VFIO group close path and figured restoring ownership
> of IOMMU to the host via vfio_iommu_driver_ops.release() (via
> close(groupfd) was where all the time was spent, but didn't expect it to
> be the close(VFIODevice.fd). Maybe Alexey/Alex have a better idea. I'll
> look into it more as well.
> 
> But suffice to say there's not much QEMU can do about the delay other
> than moving deferring the DEVICE_DELETED event or adding a later-stage
> event.

The device close needs to do a device reset and attempt to restore the
state of the device, plus whatever that EEH code is doing.  I would
have expected the group close to teardown the IOMMU, but maybe I'm
forgetting some subtlety, in either case it's going to happen as part
of the finalize and it's going to take some time.  Thanks,

Alex
 
> > From libvirt's POV, we consider 'DEVICE_DELETED' as meaning both that the
> > frontend has gone *and* the corresponding backend has gone. Aside from
> > cleaning the VFIO group, we use this as a trigger for all other device
> > related cleanup like SELinux labelling, cgroup device ACLs, etc. If the
> > backend is not guaranteed to be closed in QEMU when this emit is emitted
> > then either QEMU needs to delay the event until it is really cleaned up,
> > or QEMU needs to add a further event to emit when the backend is clean.
> > 
> >   
> > > Patches 1-4 address 1) by deferring rebinding of a hostdev to the host driver
> > > until all the devices in the group have been detached, at which point all
> > > the hostdevs are rebound as a group. Until that point, the devices are traced
> > > by the drvManager's inactiveList in a similar manner to hostdevs that are
> > > assigned to VFIO via the nodedev-detach interface.
> > > 
> > > Patch 5 addresses 2) by adding an additional check that, when the last device
> > > from a group is detached, polls /proc for open FDs referencing the VFIO group
> > > path in /dev/vfio/<iommu_group> and waiting for the FD to be closed. If we
> > > time out, we abandon rebinding the hostdevs back to the host.  
> > 
> > That is just gross - it is tieing libvirt to details of the QEMU internal
> > implementation. I really don't think we should be doing that. So NACK to
> > this from my POV.
> >   
> > > There are a couple alternatives to Patch 5 that might be worth considering:
> > > 
> > > a) Add a DEVICE_FINALIZED event to QEMU and wait for that instead of
> > >    DEVICE_DELETED. Paired with patches 1-4 this would let us drop patch 5 in
> > >    favor of minimal changes to libvirt's event handlers.
> > > 
> > >    The downsides are:
> > >     - that we'd incur some latency for all device-detach calls, but it's not
> > >       apparent to me whether this delay is significant for anything outside
> > >       of VFIO.
> > >     - there may be cases where finalization after DEVICE_DELETE/unparent are
> > >       is not guaranteed, and I'm not sure QOM would encourage such
> > >       expectations even if that's currently the case.
> > > 
> > > b) Add a GROUP_DELETED event to VFIO's finalize callback. This is the most
> > >    direct solution. With this we could completely separate out the handling
> > >    of rebinding to host driver based on receival of this event.
> > > 
> > >    The downsides are:
> > >     - this would only work for newer versions of QEMU, though we could use
> > >       the poll-wait in patch 5 as a fallback.
> > >     - synchronizing sync/async device-detach threads with sync/async
> > >       handlers for this would be a bit hairy, but I have a WIP in progress
> > >       that seems *fairly reasonable*
> > > 
> > > c) Take the approach in Patch 5, either as a precursor to implementing b) or
> > >    something else, or just sticking with that for now.
> > > 
> > > d) ???  
> > 
> > Fix DEVICE_DELETE so its only emitted when the backend associated with
> > the device is fully cleaned up.  
> 
> Need to explore same concerns I had WRT to DEVICE_FINALIZED above, but
> assuming those aren't an issue this would indeed makes things even
> simpler. Will look into this more.
> 
> Would patch 5 be out of the question even as a fallback for downlevel
> QEMUs? Or is that scenario too unlikely to warrant it?
> 
> Thanks!
> 
> > 
> > 
> > Regards,
> > Daniel
> > -- 
> > |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> > |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> > |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> >   
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/5] hotplug: fix premature rebinding of VFIO devices to host
  2017-06-29 18:50 ` Laine Stump
@ 2017-06-29 19:44   ` Alex Williamson
  2017-06-30  2:27     ` Laine Stump
  2017-06-29 20:59   ` Michael Roth
  1 sibling, 1 reply; 16+ messages in thread
From: Alex Williamson @ 2017-06-29 19:44 UTC (permalink / raw)
  To: Laine Stump
  Cc: libvir-list, Michael Roth, qemu-devel, qemu-ppc, aik, abologna,
	pkrempa, berrange, mprivozn, sbhat

On Thu, 29 Jun 2017 14:50:15 -0400
Laine Stump <laine@redhat.com> wrote:

> On 06/28/2017 08:24 PM, Michael Roth wrote:
> > Hi everyone. Hoping to get some feedback on this approach, or some
> > alternatives proposed below, to the following issue:
> > 
> > Currently libvirt immediately attempts to rebind a managed device back to the
> > host driver when it receives a DEVICE_DELETED event from QEMU. This is
> > problematic for 2 reasons:
> > 
> > 1) If multiple devices from a group are attached to a guest, this can move
> >    the group into a "non-viable" state where some devices are assigned to
> >    the host and some to the guest.  
> 
> Since we don't support hotplug with managed='yes' of individual (or
> multiple) functions of a multifunction host device, I don't know that
> it's very useful to support hot *un*plug of it - it would only be useful
> if the multi-function device were present in the guest when it was
> started, and then was hot-unplugged later. And this is all a lot of
> extra complexity, though, so it would be useful to know what are the
> scenarios where it would actually be used (i.e. is this a legitimate
> need, or just an interesting exercise?)

This doesn't make sense to me, since when do we not support hotplug
with managed='yes' and how is it prevented?  Also, let's just not talk
about multifunction, a multifunction device does not imply a shared
group, nor does a shared group imply multifunction.  So is it hotplug
of a device which is in a shared group that is not supported, and if so
how?  I think libvirt tries to do the hot-add, but it hits the
non-viable group when it gives it to QEMU.  On hot-remove, I'm pretty
sure libvirt just lets the host crash into the ground by re-binding the
device to the host driver.  If we don't want to support it, that's one
thing, but the current model is more just neglectful than unsupported.


> > 2) When QEMU emits the DEVICE_DELETED event, there's still a "finalize" phase
> >    where additional cleanup occurs. In most cases libvirt can ignore this
> >    cleanup, but in the case of VFIO devices this is where closing of a VFIO
> >    group FD occurs, and failing to wait before rebinding the device to the
> >    host driver can result in unexpected behavior. In the case of powernv
> >    hosts at least, this can lead to a host driver crashing due to the default
> >    DMA windows not having been fully-restored yet. The window between this is
> >    and the initial DEVICE_DELETED seems to be ~6 seconds in practice. We've
> >    seen host dumps with Mellanox CX4 VFs being rebound to host driver during
> >    this period (on powernv hosts).  
> 
> I agree with Dan that the situation described here should be considered
> a qemu bug - according to my understanding (from back at the time
> DEVICE_DELETED was added to qemu (I think at libvirt's request) qemu
> should never emit the DEVICE_DELETED event until *everything* related to
> the device is finished - that was the whole point of adding the event in
> the first palce. Covering up this bug with a bunch of extra libvirt
> complexity is just creating the potential for even more bugs in the more
> complex code.

Agree, but ISTR not everyone thinks that way.  I don't remember the
opposing viewpoint though.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/5] hotplug: fix premature rebinding of VFIO devices to host
  2017-06-29 19:28   ` Alex Williamson
@ 2017-06-29 20:21     ` Michael Roth
  0 siblings, 0 replies; 16+ messages in thread
From: Michael Roth @ 2017-06-29 20:21 UTC (permalink / raw)
  To: Daniel P. Berrange, Alex Williamson
  Cc: libvir-list, qemu-devel, qemu-ppc, aik, abologna, laine, pkrempa,
	mprivozn, sbhat

Quoting Alex Williamson (2017-06-29 14:28:11)
> On Thu, 29 Jun 2017 09:33:19 +0100
> "Daniel P. Berrange" <berrange@redhat.com> wrote:
> 
> > On Wed, Jun 28, 2017 at 07:24:55PM -0500, Michael Roth wrote:
> > > Hi everyone. Hoping to get some feedback on this approach, or some
> > > alternatives proposed below, to the following issue:
> > > 
> > > Currently libvirt immediately attempts to rebind a managed device back to the
> > > host driver when it receives a DEVICE_DELETED event from QEMU. This is
> > > problematic for 2 reasons:
> > > 
> > > 1) If multiple devices from a group are attached to a guest, this can move
> > >    the group into a "non-viable" state where some devices are assigned to
> > >    the host and some to the guest.
> > > 
> > > 2) When QEMU emits the DEVICE_DELETED event, there's still a "finalize" phase
> > >    where additional cleanup occurs. In most cases libvirt can ignore this
> > >    cleanup, but in the case of VFIO devices this is where closing of a VFIO
> > >    group FD occurs, and failing to wait before rebinding the device to the
> > >    host driver can result in unexpected behavior. In the case of powernv
> > >    hosts at least, this can lead to a host driver crashing due to the default
> > >    DMA windows not having been fully-restored yet. The window between this is
> > >    and the initial DEVICE_DELETED seems to be ~6 seconds in practice. We've
> > >    seen host dumps with Mellanox CX4 VFs being rebound to host driver during
> > >    this period (on powernv hosts).  
> 
> I've been trying to tackle this from the kernel side too, currently
> Linux's driver model really neither allows vfio bus drivers to nak
> unbinding a device from an in-use group nor nak binding a device
> from an in-use group to an incompatible driver.  The issue you identify
> in QEMU/libvirt exacerbates the problem as QEMU has not yet finalized
> the device/group references before libvirt tries to unbind the device
> from the vfio bus driver and attach it to a host driver.  I'd love to
> solve this from both sides by allowing the kernel to prevent driver
> binds that we'd consider compromising and also introduce a bit of
> patience in the QEMU/libvirt path to avoid the kernel needing to impose
> that driver blocking.

Perfect, I'd meant to ask about this and it skipped my mind. I had some
concerns about the fact that even with these qemu/libvirt changes in place
we could still trigger these sorts of crashes by issuing such a rebind
outside of libvirt, so good to know that aspect is being looked at as well
and that the 2 aren't in conflict.

> 
> > Why on earth does QEMU's device finalization take 6 seconds to complete.
> > That feels very broken to me, unless QEMU is not being schedled due to
> > host being overcomitted. If that's not the case, then we have a bug to
> > investigate in QEMU to find out why cleanup is delayed so long.
> 
> I wouldn't necessarily jump to the conclusion that this is a bug, if
> it's relating to tearing down the IOMMU mappings for the device, gigs
> of mappings can take non-trivial time.  Is that the scenario here?  Is
> that 6s somehow proportional to guest memory size?

I think you're right that the teardown is via group close. Since the traces
seem to pin it on the vfio device FD close, the device reset you mentioned
is probably what's going on here, but I'll try to confirm.

> 
> > From libvirt's POV, we consider 'DEVICE_DELETED' as meaning both that the
> > frontend has gone *and* the corresponding backend has gone. Aside from
> > cleaning the VFIO group, we use this as a trigger for all other device
> > related cleanup like SELinux labelling, cgroup device ACLs, etc. If the
> > backend is not guaranteed to be closed in QEMU when this emit is emitted
> > then either QEMU needs to delay the event until it is really cleaned up,
> > or QEMU needs to add a further event to emit when the backend is clean.
> 
> Clearly libvirt and QEMU's idea of what DEVICE_DELETED means don't
> align.
> 
> > > Patches 1-4 address 1) by deferring rebinding of a hostdev to the host driver
> > > until all the devices in the group have been detached, at which point all
> > > the hostdevs are rebound as a group. Until that point, the devices are traced
> > > by the drvManager's inactiveList in a similar manner to hostdevs that are
> > > assigned to VFIO via the nodedev-detach interface.
> 
> There are certainly some benefits to group-awareness here, currently
> an admin user like libvirt can trigger a BUG_ON by trying to bind a
> device back to a host driver when a group is still in use, at best we
> might improve that to rejecting the compromising bind.  
> 
> > > Patch 5 addresses 2) by adding an additional check that, when the last device
> > > from a group is detached, polls /proc for open FDs referencing the VFIO group
> > > path in /dev/vfio/<iommu_group> and waiting for the FD to be closed. If we
> > > time out, we abandon rebinding the hostdevs back to the host.  
> > 
> > That is just gross - it is tieing libvirt to details of the QEMU internal
> > implementation. I really don't think we should be doing that. So NACK to
> > this from my POV.
> 
> It seems a little silly for QEMU to emit the event while it's still in
> use, clearly emitting the event at the right point would negate any
> need for snooping around in proc.
> 
> > > There are a couple alternatives to Patch 5 that might be worth considering:
> > > 
> > > a) Add a DEVICE_FINALIZED event to QEMU and wait for that instead of
> > >    DEVICE_DELETED. Paired with patches 1-4 this would let us drop patch 5 in
> > >    favor of minimal changes to libvirt's event handlers.
> > > 
> > >    The downsides are:
> > >     - that we'd incur some latency for all device-detach calls, but it's not
> > >       apparent to me whether this delay is significant for anything outside
> > >       of VFIO.
> > >     - there may be cases where finalization after DEVICE_DELETE/unparent are
> > >       is not guaranteed, and I'm not sure QOM would encourage such
> > >       expectations even if that's currently the case.
> > > 
> > > b) Add a GROUP_DELETED event to VFIO's finalize callback. This is the most
> > >    direct solution. With this we could completely separate out the handling
> > >    of rebinding to host driver based on receival of this event.
> > > 
> > >    The downsides are:
> > >     - this would only work for newer versions of QEMU, though we could use
> > >       the poll-wait in patch 5 as a fallback.
> > >     - synchronizing sync/async device-detach threads with sync/async
> > >       handlers for this would be a bit hairy, but I have a WIP in progress
> > >       that seems *fairly reasonable*
> > > 
> > > c) Take the approach in Patch 5, either as a precursor to implementing b) or
> > >    something else, or just sticking with that for now.
> > > 
> > > d) ???  
> > 
> > Fix DEVICE_DELETE so its only emitted when the backend associated with
> > the device is fully cleaned up.
> 
> Adding a FINALIZE seems to require a two-step fix, fix QEMU then fix
> libvirt, whereas moving DELETE to the correct location automatically
> fixes the behavior with existing libvirt.  I don't know that a
> GROUP_DELETED makes much sense, libvirt can know about groups on its
> own and it just leads to a vfio specific path.  Thanks,

Ok, seems like there's a consensus there. I'll report back if there's
any obvious showstopper with this, but otherwise I'll work on a patch to
defer the DEVICE_DELETED, and drop patch 5 from the libvirt side of
things.

Thanks!

> 
> Alex
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/5] hotplug: fix premature rebinding of VFIO devices to host
  2017-06-29 18:50 ` Laine Stump
  2017-06-29 19:44   ` Alex Williamson
@ 2017-06-29 20:59   ` Michael Roth
  2017-06-30  6:59     ` Andrea Bolognani
  1 sibling, 1 reply; 16+ messages in thread
From: Michael Roth @ 2017-06-29 20:59 UTC (permalink / raw)
  To: Laine Stump, libvir-list
  Cc: qemu-devel, qemu-ppc, aik, alex.williamson, abologna, pkrempa,
	berrange, mprivozn, sbhat

Quoting Laine Stump (2017-06-29 13:50:15)
> On 06/28/2017 08:24 PM, Michael Roth wrote:
> > Hi everyone. Hoping to get some feedback on this approach, or some
> > alternatives proposed below, to the following issue:
> > 
> > Currently libvirt immediately attempts to rebind a managed device back to the
> > host driver when it receives a DEVICE_DELETED event from QEMU. This is
> > problematic for 2 reasons:
> > 
> > 1) If multiple devices from a group are attached to a guest, this can move
> >    the group into a "non-viable" state where some devices are assigned to
> >    the host and some to the guest.
> 
> Since we don't support hotplug with managed='yes' of individual (or
> multiple) functions of a multifunction host device, I don't know that
> it's very useful to support hot *un*plug of it - it would only be useful
> if the multi-function device were present in the guest when it was
> started, and then was hot-unplugged later. And this is all a lot of
> extra complexity, though, so it would be useful to know what are the
> scenarios where it would actually be used (i.e. is this a legitimate
> need, or just an interesting exercise?)
> 
> > 
> > 2) When QEMU emits the DEVICE_DELETED event, there's still a "finalize" phase
> >    where additional cleanup occurs. In most cases libvirt can ignore this
> >    cleanup, but in the case of VFIO devices this is where closing of a VFIO
> >    group FD occurs, and failing to wait before rebinding the device to the
> >    host driver can result in unexpected behavior. In the case of powernv
> >    hosts at least, this can lead to a host driver crashing due to the default
> >    DMA windows not having been fully-restored yet. The window between this is
> >    and the initial DEVICE_DELETED seems to be ~6 seconds in practice. We've
> >    seen host dumps with Mellanox CX4 VFs being rebound to host driver during
> >    this period (on powernv hosts).
> 
> I agree with Dan that the situation described here should be considered
> a qemu bug - according to my understanding (from back at the time
> DEVICE_DELETED was added to qemu (I think at libvirt's request) qemu
> should never emit the DEVICE_DELETED event until *everything* related to
> the device is finished - that was the whole point of adding the event in
> the first palce. Covering up this bug with a bunch of extra libvirt
> complexity is just creating the potential for even more bugs in the more
> complex code.
> 
> > 
> > Patches 1-4 address 1) by deferring rebinding of a hostdev to the host driver
> > until all the devices in the group have been detached, at which point all
> > the hostdevs are rebound as a group. Until that point, the devices are traced
> > by the drvManager's inactiveList in a similar manner to hostdevs that are
> > assigned to VFIO via the nodedev-detach interface.
> 
> What happens if libvirtd is restarted during this period? Is the
> inactiveList rebuilt with all the info necessary to assure that the
> nodedev-reattach does/doesn't happen (as appropriate) for all devices?

Hmm, good question.

The Unbindable() check only needs to know that nothing in the activeList
belongs to the group we're checking, and that list at least seems to get
rebuilt appropriately on restart.

But the Unbind() relies on inactiveList and the behavior there is what
nodedev-detach currently does, which is to add it to inactive list while
libvirtd is running, and then just forget about it when libvirtd restarts.
For nodedev-detach it's fine since virHostdevPreparePCIDevices() re-adds
it as needed in the device-attach path. But yah, for this purpose it ends
up losing track of hostdevs that are still pending rebind to the host, and
that means some devices may not get rebound at the appropriate time if
there was a libvirtd restart.

Unlike with device-attach, we can't just re-add it on-demand because we
actually need to know whether or not it was previously in the list. So
I think we'd need to add some persistent state to track this. Will look
into adding handling for that.

> 
> 
> > 
> > Patch 5 addresses 2) by adding an additional check that, when the last device
> > from a group is detached, polls /proc for open FDs referencing the VFIO group
> > path in /dev/vfio/<iommu_group> and waiting for the FD to be closed. If we
> > time out, we abandon rebinding the hostdevs back to the host.
> > 
> > There are a couple alternatives to Patch 5 that might be worth considering:
> > 
> > a) Add a DEVICE_FINALIZED event to QEMU and wait for that instead of
> >    DEVICE_DELETED.
> 
> Is the "device is *almost* completely deleted" event (current
> DEVIE_DELETED) really something that anyone wants/needs to know about?
> Or is the only useful event just the one that you're calling
> DEVICE_FINALIZED? If the latter, then I think it would be better to just
> change when DEVICE_DELETED is emitted.
> 
>  Paired with patches 1-4 this would let us drop patch 5 in
> >    favor of minimal changes to libvirt's event handlers.
> > 
> >    The downsides are:
> >     - that we'd incur some latency for all device-detach calls, but it's not
> >       apparent to me whether this delay is significant for anything outside
> >       of VFIO.
> >     - there may be cases where finalization after DEVICE_DELETE/unparent are
> >       is not guaranteed, and I'm not sure QOM would encourage such
> >       expectations even if that's currently the case.
> > 
> > b) Add a GROUP_DELETED event to VFIO's finalize callback. This is the most
> >    direct solution. With this we could completely separate out the handling
> >    of rebinding to host driver based on receival of this event.
> > 
> >    The downsides are:
> >     - this would only work for newer versions of QEMU, though we could use
> >       the poll-wait in patch 5 as a fallback.
> 
> I don't think we should add such a complex patch as a fallback to
> support older versions of qemu that don't have the bug fixed. Instead,
> just tell people to upgrade qemu (after all, they're going to need to
> update *something* (either libvirt or qemu); no need to update libvirt
> just in order to avoid updating qemu).

Ok, makes sense.

Thanks!

> 
> 
> >     - synchronizing sync/async device-detach threads with sync/async
> >       handlers for this would be a bit hairy, but I have a WIP in progress
> >       that seems *fairly reasonable*
> > 
> > c) Take the approach in Patch 5, either as a precursor to implementing b) or
> >    something else, or just sticking with that for now.
> > 
> > d) ???
> > 
> > Personally I'm leaning toward c), but I'm wondering if that's "good enough"
> > for now, or if we should pursue something more robust from the start, or
> > perhaps something else entirely?
> > 
> > Any feedback is greatly appreciated!
> > 
> >  src/libvirt_private.syms |   5 ++
> >  src/qemu/qemu_hostdev.c  |  16 +++++
> >  src/qemu/qemu_hostdev.h  |   4 ++
> >  src/qemu/qemu_hotplug.c  |  56 ++++++++++++++----
> >  src/util/virfile.c       |  52 +++++++++++++++++
> >  src/util/virfile.h       |   1 +
> >  src/util/virhostdev.c    | 173 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
> >  src/util/virhostdev.h    |  16 +++++
> >  src/util/virpci.c        |  69 ++++++++++++++++++----
> >  src/util/virpci.h        |   4 ++
> >  10 files changed, 360 insertions(+), 36 deletions(-)
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/5] hotplug: fix premature rebinding of VFIO devices to host
  2017-06-29 19:44   ` Alex Williamson
@ 2017-06-30  2:27     ` Laine Stump
  0 siblings, 0 replies; 16+ messages in thread
From: Laine Stump @ 2017-06-30  2:27 UTC (permalink / raw)
  To: libvir-list
  Cc: Alex Williamson, Michael Roth, qemu-devel, qemu-ppc, aik,
	abologna, pkrempa, berrange, mprivozn, sbhat

On 06/29/2017 03:44 PM, Alex Williamson wrote:
> On Thu, 29 Jun 2017 14:50:15 -0400
> Laine Stump <laine@redhat.com> wrote:
> 
>> On 06/28/2017 08:24 PM, Michael Roth wrote:
>>> Hi everyone. Hoping to get some feedback on this approach, or some
>>> alternatives proposed below, to the following issue:
>>>
>>> Currently libvirt immediately attempts to rebind a managed device back to the
>>> host driver when it receives a DEVICE_DELETED event from QEMU. This is
>>> problematic for 2 reasons:
>>>
>>> 1) If multiple devices from a group are attached to a guest, this can move
>>>    the group into a "non-viable" state where some devices are assigned to
>>>    the host and some to the guest.  
>>
>> Since we don't support hotplug with managed='yes' of individual (or
>> multiple) functions of a multifunction host device, I don't know that
>> it's very useful to support hot *un*plug of it - it would only be useful
>> if the multi-function device were present in the guest when it was
>> started, and then was hot-unplugged later. And this is all a lot of
>> extra complexity, though, so it would be useful to know what are the
>> scenarios where it would actually be used (i.e. is this a legitimate
>> need, or just an interesting exercise?)
> 
> This doesn't make sense to me, since when do we not support hotplug
> with managed='yes' and how is it prevented?

Wait a minute. I wasn't thinking straight, got sidetracked into a
different problem, and completely screwed up what I was trying to say.

So basically forget everything in that paragraph :-/

What I started out to say was simply that we didn't support
managed='yes' hotplugging of a single device that's in an iommu group
containing multiple devices (and that led me to wander into the topic of
multifunction devices, which as you say is really unrelated). We don't
do anything specifically to prevent it, but as you say, it will fail
when it gets to qemu.

At one time someone talked about having libvirt automatically detach the
other devices in the same group at times like this, but that was
rightfully detected as ridiculous (it may have been me talking to
myself, and I realized the stupidity of the idea before I managed to
repeat it to anyone else).

>  Also, let's just not talk
> about multifunction, a multifunction device does not imply a shared
> group, nor does a shared group imply multifunction. 

Yeah, that was me being distracted while thinking about the situation
and mistakenly conflating two different problems.

> So is it hotplug
> of a device which is in a shared group that is not supported, and if so
> how?

"not supported" != "actively prevented", it just means "there's no way
to successfully do that"

>  I think libvirt tries to do the hot-add, but it hits the
> non-viable group when it gives it to QEMU.

Correct.


> On hot-remove, I'm pretty
> sure libvirt just lets the host crash into the ground by re-binding the
> device to the host driver.

I haven't tried, but you're likely right.

>  If we don't want to support it, that's one
> thing, but the current model is more just neglectful than unsupported.

Semantics. Potato, Potahto. The end result is "it doesn't work". If
we're going to try to fix one, we should try to fix the other too.

> 
> 
>>> 2) When QEMU emits the DEVICE_DELETED event, there's still a "finalize" phase
>>>    where additional cleanup occurs. In most cases libvirt can ignore this
>>>    cleanup, but in the case of VFIO devices this is where closing of a VFIO
>>>    group FD occurs, and failing to wait before rebinding the device to the
>>>    host driver can result in unexpected behavior. In the case of powernv
>>>    hosts at least, this can lead to a host driver crashing due to the default
>>>    DMA windows not having been fully-restored yet. The window between this is
>>>    and the initial DEVICE_DELETED seems to be ~6 seconds in practice. We've
>>>    seen host dumps with Mellanox CX4 VFs being rebound to host driver during
>>>    this period (on powernv hosts).  
>>
>> I agree with Dan that the situation described here should be considered
>> a qemu bug - according to my understanding (from back at the time
>> DEVICE_DELETED was added to qemu (I think at libvirt's request) qemu
>> should never emit the DEVICE_DELETED event until *everything* related to
>> the device is finished - that was the whole point of adding the event in
>> the first palce. Covering up this bug with a bunch of extra libvirt
>> complexity is just creating the potential for even more bugs in the more
>> complex code.
> 
> Agree, but ISTR not everyone thinks that way.  I don't remember the
> opposing viewpoint though.

Possible. I only retain about 1% of what goes by the screen, and even
most of that is gone within 6 months. I thought I recalled that there
was no DEVICE_DELETED event before libvirt requested it, so it would
make sense for it to behave according to what libvirt needs (assuming
nobody else is using it). That may be a narcissistic and incorrect view
though.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/5] hotplug: fix premature rebinding of VFIO devices to host
  2017-06-29 20:59   ` Michael Roth
@ 2017-06-30  6:59     ` Andrea Bolognani
  0 siblings, 0 replies; 16+ messages in thread
From: Andrea Bolognani @ 2017-06-30  6:59 UTC (permalink / raw)
  To: Michael Roth, Laine Stump, libvir-list
  Cc: qemu-devel, qemu-ppc, aik, alex.williamson, pkrempa, berrange,
	mprivozn, sbhat

On Thu, 2017-06-29 at 15:59 -0500, Michael Roth wrote:
> > > Patches 1-4 address 1) by deferring rebinding of a hostdev to the host driver
> > > until all the devices in the group have been detached, at which point all
> > > the hostdevs are rebound as a group. Until that point, the devices are traced
> > > by the drvManager's inactiveList in a similar manner to hostdevs that are
> > > assigned to VFIO via the nodedev-detach interface.
> > 
> > What happens if libvirtd is restarted during this period? Is the
> > inactiveList rebuilt with all the info necessary to assure that the
> > nodedev-reattach does/doesn't happen (as appropriate) for all devices?
> 
> Hmm, good question.
> 
> The Unbindable() check only needs to know that nothing in the activeList
> belongs to the group we're checking, and that list at least seems to get
> rebuilt appropriately on restart.
> 
> But the Unbind() relies on inactiveList and the behavior there is what
> nodedev-detach currently does, which is to add it to inactive list while
> libvirtd is running, and then just forget about it when libvirtd restarts.
> For nodedev-detach it's fine since virHostdevPreparePCIDevices() re-adds
> it as needed in the device-attach path. But yah, for this purpose it ends
> up losing track of hostdevs that are still pending rebind to the host, and
> that means some devices may not get rebound at the appropriate time if
> there was a libvirtd restart.
> 
> Unlike with device-attach, we can't just re-add it on-demand because we
> actually need to know whether or not it was previously in the list. So
> I think we'd need to add some persistent state to track this. Will look
> into adding handling for that.

FWIW last time a tried to attack this issue[1] I got pretty
much as far as this, eg. the code worked as intended but
restarting libvirtd would result in an inconsistent state
which prevented you from performing some operations.

Unfortunately I got sidetracked by other work and stopped
just short of implementing a way to persist the relevant
information on disk :(


[1] ~1.5 years ago, according to git log
-- 
Andrea Bolognani / Red Hat / Virtualization

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-06-30  7:00 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-29  0:24 [Qemu-devel] [RFC PATCH 0/5] hotplug: fix premature rebinding of VFIO devices to host Michael Roth
2017-06-29  0:24 ` [Qemu-devel] [RFC PATCH 1/5] virhostdev: factor release out from reattach and export it for use later Michael Roth
2017-06-29  0:24 ` [Qemu-devel] [RFC PATCH 2/5] qemu_hotplug: squash qemuDomainRemovePCIHostDevice into caller Michael Roth
2017-06-29  0:24 ` [Qemu-devel] [RFC PATCH 3/5] virpci: introduce virPCIIOMMUGroupIterate() Michael Roth
2017-06-29  0:24 ` [Qemu-devel] [RFC PATCH 4/5] qemu: hotplug: unbind VFIO devices as a group Michael Roth
2017-06-29  0:25 ` [Qemu-devel] [RFC PATCH 5/5] qemu: hotplug: wait for VFIO group FD close before unbind Michael Roth
2017-06-29  8:33 ` [Qemu-devel] [RFC PATCH 0/5] hotplug: fix premature rebinding of VFIO devices to host Daniel P. Berrange
2017-06-29 18:22   ` Michael Roth
2017-06-29 19:31     ` Alex Williamson
2017-06-29 19:28   ` Alex Williamson
2017-06-29 20:21     ` Michael Roth
2017-06-29 18:50 ` Laine Stump
2017-06-29 19:44   ` Alex Williamson
2017-06-30  2:27     ` Laine Stump
2017-06-29 20:59   ` Michael Roth
2017-06-30  6:59     ` Andrea Bolognani

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.