All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v7 00/12] vfio-pci: pass the aer error to guest
@ 2016-05-18  3:30 Zhou Jie
  2016-05-18  3:30 ` [Qemu-devel] [PATCH 01/12] vfio: extract vfio_get_hot_reset_info as a single function Zhou Jie
                   ` (11 more replies)
  0 siblings, 12 replies; 21+ messages in thread
From: Zhou Jie @ 2016-05-18  3:30 UTC (permalink / raw)
  To: qemu-devel, alex.williamson; +Cc: izumi.taku, mst, caoj.fnst, Chen Fan

From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>

v6-v7:
   1. Stall any access to the device until resume is signaled.
      Do guest directed bus reset after receive resume notification.
      If don't get the resume notification after some timeout unplug the device.
   2. fix the patches 11/12 as Alex sugguestion.

v5-v6:
   1. register resume handler both qemu and kernel to ensure the
      reset in order.
   2. fix the patches 6/12, 7/12 as Alex and MST sugguestion.

v4-v5:
   1. add back the common function 0 hotplug code in pci core.
   2. fix a sporadic device stuck on D3 problem when doing aer recovery.
   3. fix patches 5/12 ~ 9/12 as Alex sugguestion.

v3-v4:
   1. rebase patchset to fit latest master branch.
   2. modifying patches 5/10 ~ 8/10 as Alex sugguestion(Thanks).

v2-v3:
   1. fix patch 4/9, 5/9 as Alex sugguestion.
   2. patches 5/9 ~ 8/9 are made to force limiting that all vfio functions
      are combined in the same way as on the host.

v1-v2:
   1. limit all devices on same bus in guest are on same bus in host in patch 

For now, for vfio pci passthough devices, when qemu receives
an error from host aer report, it just terminate the guest.
But usually user want to know what error occurred. So these patches add
aer capability support for vfio devices, and pass the error to guest.
Then let guest driver to recover from the error.

notes: this series patches enable aer support single/multi-function,
for multi-function, require all the function of the slot assigned to
VM and on the same slot.

Chen Fan (12):
  vfio: extract vfio_get_hot_reset_info as a single function
  vfio: squeeze out vfio_pci_do_hot_reset for support bus reset
  vfio: add pcie extended capability support
  vfio: add aer support for vfio device
  vfio: refine function vfio_pci_host_match
  vfio: add check host bus reset is support or not
  pci: add a pci_function_is_valid callback to check function if valid
  vfio: add check aer functionality for hotplug device
  vfio: vote the function 0 to do host bus reset when aer occurred
  vfio-pci: pass the aer error to guest
  vfio: register aer resume notification handler for aer resume
  vfio: add 'aer' property to expose aercap

 hw/pci/pci.c               |  32 ++
 hw/vfio/pci.c              | 826 +++++++++++++++++++++++++++++++++++++++++----
 hw/vfio/pci.h              |  10 +
 include/hw/pci/pci.h       |   1 +
 linux-headers/linux/vfio.h |   1 +
 5 files changed, 804 insertions(+), 66 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH 01/12] vfio: extract vfio_get_hot_reset_info as a single function
  2016-05-18  3:30 [Qemu-devel] [PATCH v7 00/12] vfio-pci: pass the aer error to guest Zhou Jie
@ 2016-05-18  3:30 ` Zhou Jie
  2016-05-18  3:31 ` [Qemu-devel] [PATCH 02/12] vfio: squeeze out vfio_pci_do_hot_reset for support bus reset Zhou Jie
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Zhou Jie @ 2016-05-18  3:30 UTC (permalink / raw)
  To: qemu-devel, alex.williamson; +Cc: izumi.taku, mst, caoj.fnst, Chen Fan

From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>

The function is used to get affected devices by bus reset.
So here extract it, and can used for aer soon.

Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
---
 hw/vfio/pci.c | 66 +++++++++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 48 insertions(+), 18 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index d091d8c..cf40f9e 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1701,6 +1701,51 @@ static void vfio_check_af_flr(VFIOPCIDevice *vdev, uint8_t pos)
     }
 }
 
+/*
+ * return negative with errno, return 0 on success.
+ * if success, the point of ret_info fill with the affected device reset info.
+ *
+ */
+static int vfio_get_hot_reset_info(VFIOPCIDevice *vdev,
+                                   struct vfio_pci_hot_reset_info **ret_info)
+{
+    struct vfio_pci_hot_reset_info *info;
+    int ret, count;
+
+    *ret_info = NULL;
+
+    info = g_malloc0(sizeof(*info));
+    info->argsz = sizeof(*info);
+
+    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO, info);
+    if (ret && errno != ENOSPC) {
+        ret = -errno;
+        goto error;
+    }
+
+    count = info->count;
+
+    info = g_realloc(info, sizeof(*info) +
+                     (count * sizeof(struct vfio_pci_dependent_device)));
+    info->argsz = sizeof(*info) +
+                  (count * sizeof(struct vfio_pci_dependent_device));
+
+    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO, info);
+    if (ret) {
+        ret = -errno;
+        error_report("vfio: hot reset info failed: %m");
+        goto error;
+    }
+
+    *ret_info = info;
+    info = NULL;
+
+    return 0;
+error:
+    g_free(info);
+    return ret;
+}
+
 static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos)
 {
     PCIDevice *pdev = &vdev->pdev;
@@ -1842,7 +1887,7 @@ static bool vfio_pci_host_match(PCIHostDeviceAddress *addr, const char *name)
 static int vfio_pci_hot_reset(VFIOPCIDevice *vdev, bool single)
 {
     VFIOGroup *group;
-    struct vfio_pci_hot_reset_info *info;
+    struct vfio_pci_hot_reset_info *info = NULL;
     struct vfio_pci_dependent_device *devices;
     struct vfio_pci_hot_reset *reset;
     int32_t *fds;
@@ -1854,12 +1899,8 @@ static int vfio_pci_hot_reset(VFIOPCIDevice *vdev, bool single)
     vfio_pci_pre_reset(vdev);
     vdev->vbasedev.needs_reset = false;
 
-    info = g_malloc0(sizeof(*info));
-    info->argsz = sizeof(*info);
-
-    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO, info);
-    if (ret && errno != ENOSPC) {
-        ret = -errno;
+    ret = vfio_get_hot_reset_info(vdev, &info);
+    if (ret) {
         if (!vdev->has_pm_reset) {
             error_report("vfio: Cannot reset device %s, "
                          "no available reset mechanism.", vdev->vbasedev.name);
@@ -1867,18 +1908,7 @@ static int vfio_pci_hot_reset(VFIOPCIDevice *vdev, bool single)
         goto out_single;
     }
 
-    count = info->count;
-    info = g_realloc(info, sizeof(*info) + (count * sizeof(*devices)));
-    info->argsz = sizeof(*info) + (count * sizeof(*devices));
     devices = &info->devices[0];
-
-    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO, info);
-    if (ret) {
-        ret = -errno;
-        error_report("vfio: hot reset info failed: %m");
-        goto out_single;
-    }
-
     trace_vfio_pci_hot_reset_has_dep_devices(vdev->vbasedev.name);
 
     /* Verify that we have all the groups required */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH 02/12] vfio: squeeze out vfio_pci_do_hot_reset for support bus reset
  2016-05-18  3:30 [Qemu-devel] [PATCH v7 00/12] vfio-pci: pass the aer error to guest Zhou Jie
  2016-05-18  3:30 ` [Qemu-devel] [PATCH 01/12] vfio: extract vfio_get_hot_reset_info as a single function Zhou Jie
@ 2016-05-18  3:31 ` Zhou Jie
  2016-05-18  3:31 ` [Qemu-devel] [PATCH 03/12] vfio: add pcie extended capability support Zhou Jie
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Zhou Jie @ 2016-05-18  3:31 UTC (permalink / raw)
  To: qemu-devel, alex.williamson; +Cc: izumi.taku, mst, caoj.fnst, Chen Fan

From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>

Squeeze out vfio_pci_do_hot_reset to do host bus reset when AER recovery.

Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
---
 hw/vfio/pci.c | 75 +++++++++++++++++++++++++++++++++++------------------------
 1 file changed, 44 insertions(+), 31 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index cf40f9e..1ad47ef 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1746,6 +1746,48 @@ error:
     return ret;
 }
 
+static int vfio_pci_do_hot_reset(VFIOPCIDevice *vdev,
+                                 struct vfio_pci_hot_reset_info *info)
+{
+    VFIOGroup *group;
+    struct vfio_pci_hot_reset *reset;
+    int32_t *fds;
+    int ret, i, count;
+    struct vfio_pci_dependent_device *devices;
+
+    /* Determine how many group fds need to be passed */
+    count = 0;
+    devices = &info->devices[0];
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        for (i = 0; i < info->count; i++) {
+            if (group->groupid == devices[i].group_id) {
+                count++;
+                break;
+            }
+        }
+    }
+
+    reset = g_malloc0(sizeof(*reset) + (count * sizeof(*fds)));
+    reset->argsz = sizeof(*reset) + (count * sizeof(*fds));
+    fds = &reset->group_fds[0];
+
+    /* Fill in group fds */
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        for (i = 0; i < info->count; i++) {
+            if (group->groupid == devices[i].group_id) {
+                fds[reset->count++] = group->fd;
+                break;
+            }
+        }
+    }
+
+    /* Bus reset! */
+    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_PCI_HOT_RESET, reset);
+    g_free(reset);
+
+    return ret;
+}
+
 static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos)
 {
     PCIDevice *pdev = &vdev->pdev;
@@ -1889,9 +1931,7 @@ static int vfio_pci_hot_reset(VFIOPCIDevice *vdev, bool single)
     VFIOGroup *group;
     struct vfio_pci_hot_reset_info *info = NULL;
     struct vfio_pci_dependent_device *devices;
-    struct vfio_pci_hot_reset *reset;
-    int32_t *fds;
-    int ret, i, count;
+    int ret, i;
     bool multi = false;
 
     trace_vfio_pci_hot_reset(vdev->vbasedev.name, single ? "one" : "multi");
@@ -1969,34 +2009,7 @@ static int vfio_pci_hot_reset(VFIOPCIDevice *vdev, bool single)
         goto out_single;
     }
 
-    /* Determine how many group fds need to be passed */
-    count = 0;
-    QLIST_FOREACH(group, &vfio_group_list, next) {
-        for (i = 0; i < info->count; i++) {
-            if (group->groupid == devices[i].group_id) {
-                count++;
-                break;
-            }
-        }
-    }
-
-    reset = g_malloc0(sizeof(*reset) + (count * sizeof(*fds)));
-    reset->argsz = sizeof(*reset) + (count * sizeof(*fds));
-    fds = &reset->group_fds[0];
-
-    /* Fill in group fds */
-    QLIST_FOREACH(group, &vfio_group_list, next) {
-        for (i = 0; i < info->count; i++) {
-            if (group->groupid == devices[i].group_id) {
-                fds[reset->count++] = group->fd;
-                break;
-            }
-        }
-    }
-
-    /* Bus reset! */
-    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_PCI_HOT_RESET, reset);
-    g_free(reset);
+    ret = vfio_pci_do_hot_reset(vdev, info);
 
     trace_vfio_pci_hot_reset_result(vdev->vbasedev.name,
                                     ret ? "%m" : "Success");
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH 03/12] vfio: add pcie extended capability support
  2016-05-18  3:30 [Qemu-devel] [PATCH v7 00/12] vfio-pci: pass the aer error to guest Zhou Jie
  2016-05-18  3:30 ` [Qemu-devel] [PATCH 01/12] vfio: extract vfio_get_hot_reset_info as a single function Zhou Jie
  2016-05-18  3:31 ` [Qemu-devel] [PATCH 02/12] vfio: squeeze out vfio_pci_do_hot_reset for support bus reset Zhou Jie
@ 2016-05-18  3:31 ` Zhou Jie
  2016-06-28 20:04   ` Laszlo Ersek
  2016-05-18  3:31 ` [Qemu-devel] [PATCH 04/12] vfio: add aer support for vfio device Zhou Jie
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 21+ messages in thread
From: Zhou Jie @ 2016-05-18  3:31 UTC (permalink / raw)
  To: qemu-devel, alex.williamson; +Cc: izumi.taku, mst, caoj.fnst, Chen Fan

From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>

For vfio pcie device, we could expose the extended capability on
PCIE bus. due to add a new pcie capability at the tail of the chain,
in order to avoid config space overwritten, we introduce a copy config
for parsing extended caps. and rebuild the pcie extended config space.

Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
---
 hw/vfio/pci.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 71 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 1ad47ef..f697853 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1528,6 +1528,21 @@ static uint8_t vfio_std_cap_max_size(PCIDevice *pdev, uint8_t pos)
     return next - pos;
 }
 
+
+static uint16_t vfio_ext_cap_max_size(const uint8_t *config, uint16_t pos)
+{
+    uint16_t tmp, next = PCIE_CONFIG_SPACE_SIZE;
+
+    for (tmp = PCI_CONFIG_SPACE_SIZE; tmp;
+        tmp = PCI_EXT_CAP_NEXT(pci_get_long(config + tmp))) {
+        if (tmp > pos && tmp < next) {
+            next = tmp;
+        }
+    }
+
+    return next - pos;
+}
+
 static void vfio_set_word_bits(uint8_t *buf, uint16_t val, uint16_t mask)
 {
     pci_set_word(buf, (pci_get_word(buf) & ~mask) | val);
@@ -1862,16 +1877,71 @@ static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos)
     return 0;
 }
 
+static int vfio_add_ext_cap(VFIOPCIDevice *vdev)
+{
+    PCIDevice *pdev = &vdev->pdev;
+    uint32_t header;
+    uint16_t cap_id, next, size;
+    uint8_t cap_ver;
+    uint8_t *config;
+
+    /*
+     * pcie_add_capability always inserts the new capability at the tail
+     * of the chain.  Therefore to end up with a chain that matches the
+     * physical device, we cache the config space to avoid overwriting
+     * the original config space when we parse the extended capabilities.
+     */
+    config = g_memdup(pdev->config, vdev->config_size);
+
+    for (next = PCI_CONFIG_SPACE_SIZE; next;
+         next = PCI_EXT_CAP_NEXT(pci_get_long(config + next))) {
+        header = pci_get_long(config + next);
+        cap_id = PCI_EXT_CAP_ID(header);
+        cap_ver = PCI_EXT_CAP_VER(header);
+
+        /*
+         * If it becomes important to configure extended capabilities to their
+         * actual size, use this as the default when it's something we don't
+         * recognize. Since QEMU doesn't actually handle many of the config
+         * accesses, exact size doesn't seem worthwhile.
+         */
+        size = vfio_ext_cap_max_size(config, next);
+
+        pcie_add_capability(pdev, cap_id, cap_ver, next, size);
+        pci_set_long(pdev->config + next, PCI_EXT_CAP(cap_id, cap_ver, 0));
+
+        /* Use emulated next pointer to allow dropping extended caps */
+        pci_long_test_and_set_mask(vdev->emulated_config_bits + next,
+                                   PCI_EXT_CAP_NEXT_MASK);
+    }
+
+    g_free(config);
+    return 0;
+}
+
 static int vfio_add_capabilities(VFIOPCIDevice *vdev)
 {
     PCIDevice *pdev = &vdev->pdev;
+    int ret;
 
     if (!(pdev->config[PCI_STATUS] & PCI_STATUS_CAP_LIST) ||
         !pdev->config[PCI_CAPABILITY_LIST]) {
         return 0; /* Nothing to add */
     }
 
-    return vfio_add_std_cap(vdev, pdev->config[PCI_CAPABILITY_LIST]);
+    ret = vfio_add_std_cap(vdev, pdev->config[PCI_CAPABILITY_LIST]);
+    if (ret) {
+        return ret;
+    }
+
+    /* on PCI bus, it doesn't make sense to expose extended capabilities. */
+    if (!pci_is_express(pdev) ||
+        !pci_bus_is_express(pdev->bus) ||
+        !pci_get_long(pdev->config + PCI_CONFIG_SPACE_SIZE)) {
+        return 0;
+    }
+
+    return vfio_add_ext_cap(vdev);
 }
 
 static void vfio_pci_pre_reset(VFIOPCIDevice *vdev)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH 04/12] vfio: add aer support for vfio device
  2016-05-18  3:30 [Qemu-devel] [PATCH v7 00/12] vfio-pci: pass the aer error to guest Zhou Jie
                   ` (2 preceding siblings ...)
  2016-05-18  3:31 ` [Qemu-devel] [PATCH 03/12] vfio: add pcie extended capability support Zhou Jie
@ 2016-05-18  3:31 ` Zhou Jie
  2016-05-18  3:31 ` [Qemu-devel] [PATCH 05/12] vfio: refine function vfio_pci_host_match Zhou Jie
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Zhou Jie @ 2016-05-18  3:31 UTC (permalink / raw)
  To: qemu-devel, alex.williamson; +Cc: izumi.taku, mst, caoj.fnst, Chen Fan

From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>

Calling pcie_aer_init to initilize aer related registers for
vfio device, then reload physical related registers to expose
device capability.

Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
---
 hw/vfio/pci.c | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 hw/vfio/pci.h |  3 +++
 2 files changed, 84 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index f697853..0516d94 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1877,6 +1877,66 @@ static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos)
     return 0;
 }
 
+static int vfio_setup_aer(VFIOPCIDevice *vdev, uint8_t cap_ver,
+                          int pos, uint16_t size)
+{
+    PCIDevice *pdev = &vdev->pdev;
+    PCIDevice *dev_iter;
+    uint8_t type;
+    uint32_t errcap;
+
+    if (!(vdev->features & VFIO_FEATURE_ENABLE_AER)) {
+        pcie_add_capability(pdev, PCI_EXT_CAP_ID_ERR,
+                            cap_ver, pos, size);
+        return 0;
+    }
+
+    dev_iter = pci_bridge_get_device(pdev->bus);
+    if (!dev_iter) {
+        goto error;
+    }
+
+    while (dev_iter) {
+        if (!pci_is_express(dev_iter)) {
+            goto error;
+        }
+
+        type = pcie_cap_get_type(dev_iter);
+        if ((type != PCI_EXP_TYPE_ROOT_PORT &&
+             type != PCI_EXP_TYPE_UPSTREAM &&
+             type != PCI_EXP_TYPE_DOWNSTREAM)) {
+            goto error;
+        }
+
+        if (!dev_iter->exp.aer_cap) {
+            goto error;
+        }
+
+        dev_iter = pci_bridge_get_device(dev_iter->bus);
+    }
+
+    errcap = vfio_pci_read_config(pdev, pos + PCI_ERR_CAP, 4);
+    /*
+     * The ability to record multiple headers is depending on
+     * the state of the Multiple Header Recording Capable bit and
+     * enabled by the Multiple Header Recording Enable bit.
+     */
+    if ((errcap & PCI_ERR_CAP_MHRC) &&
+        (errcap & PCI_ERR_CAP_MHRE)) {
+        pdev->exp.aer_log.log_max = PCIE_AER_LOG_MAX_DEFAULT;
+    } else {
+        pdev->exp.aer_log.log_max = 0;
+    }
+
+    pcie_cap_deverr_init(pdev);
+    return pcie_aer_init(pdev, pos, size);
+
+error:
+    error_report("vfio: Unable to enable AER for device %s, parent bus "
+                 "does not support AER signaling", vdev->vbasedev.name);
+    return -1;
+}
+
 static int vfio_add_ext_cap(VFIOPCIDevice *vdev)
 {
     PCIDevice *pdev = &vdev->pdev;
@@ -1884,6 +1944,7 @@ static int vfio_add_ext_cap(VFIOPCIDevice *vdev)
     uint16_t cap_id, next, size;
     uint8_t cap_ver;
     uint8_t *config;
+    int ret = 0;
 
     /*
      * pcie_add_capability always inserts the new capability at the tail
@@ -1907,7 +1968,19 @@ static int vfio_add_ext_cap(VFIOPCIDevice *vdev)
          */
         size = vfio_ext_cap_max_size(config, next);
 
-        pcie_add_capability(pdev, cap_id, cap_ver, next, size);
+        switch (cap_id) {
+        case PCI_EXT_CAP_ID_ERR:
+            ret = vfio_setup_aer(vdev, cap_ver, next, size);
+            break;
+        default:
+            pcie_add_capability(pdev, cap_id, cap_ver, next, size);
+            break;
+        }
+
+        if (ret) {
+            goto out;
+        }
+
         pci_set_long(pdev->config + next, PCI_EXT_CAP(cap_id, cap_ver, 0));
 
         /* Use emulated next pointer to allow dropping extended caps */
@@ -1915,8 +1988,9 @@ static int vfio_add_ext_cap(VFIOPCIDevice *vdev)
                                    PCI_EXT_CAP_NEXT_MASK);
     }
 
+out:
     g_free(config);
-    return 0;
+    return ret;
 }
 
 static int vfio_add_capabilities(VFIOPCIDevice *vdev)
@@ -2673,6 +2747,11 @@ static int vfio_initfn(PCIDevice *pdev)
         goto out_teardown;
     }
 
+    if ((vdev->features & VFIO_FEATURE_ENABLE_AER) &&
+        !pdev->exp.aer_cap) {
+        goto out_teardown;
+    }
+
     /* QEMU emulates all of MSI & MSIX */
     if (pdev->cap_present & QEMU_PCI_CAP_MSIX) {
         memset(vdev->emulated_config_bits + pdev->msix_cap, 0xff,
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 3976f68..7b3924e 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -15,6 +15,7 @@
 #include "qemu-common.h"
 #include "exec/memory.h"
 #include "hw/pci/pci.h"
+#include "hw/pci/pci_bridge.h"
 #include "hw/vfio/vfio-common.h"
 #include "qemu/event_notifier.h"
 #include "qemu/queue.h"
@@ -128,6 +129,8 @@ typedef struct VFIOPCIDevice {
 #define VFIO_FEATURE_ENABLE_VGA (1 << VFIO_FEATURE_ENABLE_VGA_BIT)
 #define VFIO_FEATURE_ENABLE_REQ_BIT 1
 #define VFIO_FEATURE_ENABLE_REQ (1 << VFIO_FEATURE_ENABLE_REQ_BIT)
+#define VFIO_FEATURE_ENABLE_AER_BIT 2
+#define VFIO_FEATURE_ENABLE_AER (1 << VFIO_FEATURE_ENABLE_AER_BIT)
     int32_t bootindex;
     uint8_t pm_cap;
     bool has_vga;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH 05/12] vfio: refine function vfio_pci_host_match
  2016-05-18  3:30 [Qemu-devel] [PATCH v7 00/12] vfio-pci: pass the aer error to guest Zhou Jie
                   ` (3 preceding siblings ...)
  2016-05-18  3:31 ` [Qemu-devel] [PATCH 04/12] vfio: add aer support for vfio device Zhou Jie
@ 2016-05-18  3:31 ` Zhou Jie
  2016-05-18  3:31 ` [Qemu-devel] [PATCH 06/12] vfio: add check host bus reset is support or not Zhou Jie
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Zhou Jie @ 2016-05-18  3:31 UTC (permalink / raw)
  To: qemu-devel, alex.williamson; +Cc: izumi.taku, mst, caoj.fnst, Chen Fan

From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>

Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
---
 hw/vfio/pci.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 0516d94..5b23a86 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2060,14 +2060,27 @@ static void vfio_pci_post_reset(VFIOPCIDevice *vdev)
     vfio_intx_enable(vdev);
 }
 
+static int vfio_pci_name_to_addr(const char *name, PCIHostDeviceAddress *addr)
+{
+    if (strlen(name) != 12 ||
+        sscanf(name, "%04x:%02x:%02x.%1x", &addr->domain,
+               &addr->bus, &addr->slot, &addr->function) != 4) {
+        return -EINVAL;
+    }
+
+    return 0;
+}
+
 static bool vfio_pci_host_match(PCIHostDeviceAddress *addr, const char *name)
 {
-    char tmp[13];
+    PCIHostDeviceAddress tmp;
 
-    sprintf(tmp, "%04x:%02x:%02x.%1x", addr->domain,
-            addr->bus, addr->slot, addr->function);
+    if (vfio_pci_name_to_addr(name, &tmp)) {
+        return false;
+    }
 
-    return (strcmp(tmp, name) == 0);
+    return (tmp.domain == addr->domain && tmp.bus == addr->bus &&
+            tmp.slot == addr->slot && tmp.function == addr->function);
 }
 
 static int vfio_pci_hot_reset(VFIOPCIDevice *vdev, bool single)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH 06/12] vfio: add check host bus reset is support or not
  2016-05-18  3:30 [Qemu-devel] [PATCH v7 00/12] vfio-pci: pass the aer error to guest Zhou Jie
                   ` (4 preceding siblings ...)
  2016-05-18  3:31 ` [Qemu-devel] [PATCH 05/12] vfio: refine function vfio_pci_host_match Zhou Jie
@ 2016-05-18  3:31 ` Zhou Jie
  2016-05-18  3:31 ` [Qemu-devel] [PATCH 07/12] pci: add a pci_function_is_valid callback to check function if valid Zhou Jie
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Zhou Jie @ 2016-05-18  3:31 UTC (permalink / raw)
  To: qemu-devel, alex.williamson; +Cc: izumi.taku, mst, caoj.fnst, Chen Fan

From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>

When assigning a vfio device with AER enabled, we must check whether
the device supports a host bus reset (ie. hot reset) as this may be
used by the guest OS in order to recover the device from an AER
error.  QEMU must therefore have the ability to perform a physical
host bus reset using the existing vfio APIs in response to a virtual
bus reset in the VM.  A physical bus reset affects all of the devices
on the host bus, therefore we place a few simplifying configuration
restriction on the VM:

 - All physical devices affected by a bus reset must be assigned to
   the VM with AER enabled on each and be configured on the same
   virtual bus in the VM.

 - No devices unaffected by the bus reset, be they physical, emulated,
   or paravirtual may be configured on the same virtual bus as a
   device supporting AER signaling through vfio.

In other words users wishing to enable AER on a multifunction device
need to assign all functions of the device to the same virtual bus
and enable AER support for each device.  The easiest way to
accomplish this is to identity map the physical functions to virtual
functions with multifunction enabled on the virtual device.

Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
---
 hw/vfio/pci.c | 279 +++++++++++++++++++++++++++++++++++++++++++++++++++++-----
 hw/vfio/pci.h |   1 +
 2 files changed, 257 insertions(+), 23 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 5b23a86..832f25e 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -19,6 +19,7 @@
  */
 
 #include "qemu/osdep.h"
+#include "qapi/error.h"
 #include <linux/vfio.h>
 #include <sys/ioctl.h>
 #include <sys/mman.h>
@@ -1716,6 +1717,42 @@ static void vfio_check_af_flr(VFIOPCIDevice *vdev, uint8_t pos)
     }
 }
 
+static int vfio_pci_name_to_addr(const char *name, PCIHostDeviceAddress *addr)
+{
+    if (strlen(name) != 12 ||
+        sscanf(name, "%04x:%02x:%02x.%1x", &addr->domain,
+               &addr->bus, &addr->slot, &addr->function) != 4) {
+        return -EINVAL;
+    }
+
+    return 0;
+}
+
+static bool vfio_pci_host_match(PCIHostDeviceAddress *addr, const char *name)
+{
+    PCIHostDeviceAddress tmp;
+
+    if (vfio_pci_name_to_addr(name, &tmp)) {
+        return false;
+    }
+
+    return (tmp.domain == addr->domain && tmp.bus == addr->bus &&
+            tmp.slot == addr->slot && tmp.function == addr->function);
+}
+
+static bool vfio_pci_host_match_slot(PCIHostDeviceAddress *addr,
+                                     const char *name)
+{
+    PCIHostDeviceAddress tmp;
+
+    if (vfio_pci_name_to_addr(name, &tmp)) {
+        return false;
+    }
+
+    return (tmp.domain == addr->domain && tmp.bus == addr->bus &&
+            tmp.slot == addr->slot);
+}
+
 /*
  * return negative with errno, return 0 on success.
  * if success, the point of ret_info fill with the affected device reset info.
@@ -1877,6 +1914,200 @@ static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos)
     return 0;
 }
 
+static int vfio_device_range_limit(PCIBus *bus)
+{
+    PCIDevice *br;
+
+    br = pci_bridge_get_device(bus);
+    if (!br ||
+        !pci_is_express(br) ||
+        !(br->exp.exp_cap) ||
+        pcie_cap_is_arifwd_enabled(br)) {
+        return 255;
+    }
+
+    return 8;
+}
+
+static void vfio_check_hot_bus_reset(VFIOPCIDevice *vdev, Error **errp)
+{
+    PCIBus *bus = vdev->pdev.bus;
+    struct vfio_pci_hot_reset_info *info = NULL;
+    struct vfio_pci_dependent_device *devices;
+    VFIOGroup *group;
+    int ret, i, devfn, range_limit;
+
+    ret = vfio_get_hot_reset_info(vdev, &info);
+    if (ret) {
+        error_setg(errp, "vfio: Cannot enable AER for device %s,"
+                   " device does not support hot reset.",
+                   vdev->vbasedev.name);
+        return;
+    }
+
+    /* List all affected devices by bus reset */
+    devices = &info->devices[0];
+
+    /* Verify that we have all the groups required */
+    for (i = 0; i < info->count; i++) {
+        PCIHostDeviceAddress host;
+        VFIOPCIDevice *tmp;
+        VFIODevice *vbasedev_iter;
+        bool found = false;
+
+        host.domain = devices[i].segment;
+        host.bus = devices[i].bus;
+        host.slot = PCI_SLOT(devices[i].devfn);
+        host.function = PCI_FUNC(devices[i].devfn);
+
+        /* Skip the current device */
+        if (vfio_pci_host_match(&host, vdev->vbasedev.name)) {
+            continue;
+        }
+
+        /* Ensure we own the group of the affected device */
+        QLIST_FOREACH(group, &vfio_group_list, next) {
+            if (group->groupid == devices[i].group_id) {
+                break;
+            }
+        }
+
+        if (!group) {
+            error_setg(errp, "vfio: Cannot enable AER for device %s, "
+                       "depends on group %d which is not owned.",
+                       vdev->vbasedev.name, devices[i].group_id);
+            goto out;
+        }
+
+        /* Ensure affected devices for reset on the same bus */
+        QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
+            if (vbasedev_iter->type != VFIO_DEVICE_TYPE_PCI) {
+                continue;
+            }
+            tmp = container_of(vbasedev_iter, VFIOPCIDevice, vbasedev);
+            if (vfio_pci_host_match(&host, tmp->vbasedev.name)) {
+                /*
+                 * AER errors may be broadcast to all functions of a multi-
+                 * function endpoint.  If any of those sibling functions are
+                 * also assigned, they need to have AER enabled or else an
+                 * error may continue to cause a vm_stop condition.  IOW,
+                 * AER setup of this function would be pointless.
+                 */
+                if (vfio_pci_host_match_slot(&host, vdev->vbasedev.name) &&
+                    !(tmp->features & VFIO_FEATURE_ENABLE_AER)) {
+                    error_setg(errp, "vfio: Cannot enable AER for device %s,"
+                               " on same slot the dependent device %s which"
+                               " does not enable AER.",
+                               vdev->vbasedev.name, tmp->vbasedev.name);
+                    goto out;
+                }
+
+                if (tmp->pdev.bus != bus) {
+                    error_setg(errp, "vfio: Cannot enable AER for device %s, "
+                               "the dependent device %s is not on the same bus",
+                               vdev->vbasedev.name, tmp->vbasedev.name);
+                    goto out;
+                }
+                found = true;
+                break;
+            }
+        }
+
+        /* Ensure all affected devices assigned to VM */
+        if (!found) {
+            error_setg(errp, "vfio: Cannot enable AER for device %s, "
+                       "the dependent device %04x:%02x:%02x.%x "
+                       "is not assigned to VM.",
+                       vdev->vbasedev.name, host.domain, host.bus,
+                       host.slot, host.function);
+            goto out;
+        }
+    }
+
+    /*
+     * The above code verified that all devices affected by a bus reset
+     * exist on the same bus in the VM.  To further simplify, we also
+     * require that there are no additional devices beyond those existing on
+     * the VM bus.
+     */
+    range_limit = vfio_device_range_limit(bus);
+    for (devfn = 0; devfn < range_limit; devfn++) {
+        VFIOPCIDevice *tmp;
+        PCIDevice *dev;
+        bool found = false;
+
+        dev = pci_find_device(bus, pci_bus_num(bus),
+                  PCI_DEVFN(PCI_SLOT(vdev->pdev.devfn), devfn));
+
+        if (!dev) {
+            continue;
+        }
+
+        if (!object_dynamic_cast(OBJECT(dev), "vfio-pci")) {
+            error_setg(errp, "vfio: Cannot enable AER for device %s, device"
+                             " %s: slot %d function%d cannot be configured"
+                             " on the same virtual bus",
+                             vdev->vbasedev.name, dev->name,
+                             PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn));
+            goto out;
+        }
+
+        tmp = DO_UPCAST(VFIOPCIDevice, pdev, dev);
+        for (i = 0; i < info->count; i++) {
+            PCIHostDeviceAddress host;
+
+            host.domain = devices[i].segment;
+            host.bus = devices[i].bus;
+            host.slot = PCI_SLOT(devices[i].devfn);
+            host.function = PCI_FUNC(devices[i].devfn);
+
+            if (vfio_pci_host_match(&host, tmp->vbasedev.name)) {
+                found = true;
+                break;
+            }
+        }
+
+        if (!found) {
+            error_setg(errp, "vfio: Cannot enable AER for device %s, affected"
+                             " device %s does not be configured on the same"
+                             " virtual bus",
+                       vdev->vbasedev.name, tmp->vbasedev.name);
+            goto out;
+        }
+    }
+
+out:
+    g_free(info);
+    return;
+}
+
+static void vfio_aer_check_host_bus_reset(Error **errp)
+{
+    VFIOGroup *group;
+    VFIODevice *vbasedev;
+    VFIOPCIDevice *vdev;
+    Error *local_err = NULL;
+
+    /* Check All vfio-pci devices if have bus reset capability */
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            if (vbasedev->type != VFIO_DEVICE_TYPE_PCI) {
+                continue;
+            }
+            vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+            if (vdev->features & VFIO_FEATURE_ENABLE_AER) {
+                vfio_check_hot_bus_reset(vdev, &local_err);
+                if (local_err) {
+                    error_propagate(errp, local_err);
+                    return;
+                }
+            }
+        }
+    }
+
+    return;
+}
+
 static int vfio_setup_aer(VFIOPCIDevice *vdev, uint8_t cap_ver,
                           int pos, uint16_t size)
 {
@@ -2060,29 +2291,6 @@ static void vfio_pci_post_reset(VFIOPCIDevice *vdev)
     vfio_intx_enable(vdev);
 }
 
-static int vfio_pci_name_to_addr(const char *name, PCIHostDeviceAddress *addr)
-{
-    if (strlen(name) != 12 ||
-        sscanf(name, "%04x:%02x:%02x.%1x", &addr->domain,
-               &addr->bus, &addr->slot, &addr->function) != 4) {
-        return -EINVAL;
-    }
-
-    return 0;
-}
-
-static bool vfio_pci_host_match(PCIHostDeviceAddress *addr, const char *name)
-{
-    PCIHostDeviceAddress tmp;
-
-    if (vfio_pci_name_to_addr(name, &tmp)) {
-        return false;
-    }
-
-    return (tmp.domain == addr->domain && tmp.bus == addr->bus &&
-            tmp.slot == addr->slot && tmp.function == addr->function);
-}
-
 static int vfio_pci_hot_reset(VFIOPCIDevice *vdev, bool single)
 {
     VFIOGroup *group;
@@ -2589,6 +2797,22 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
     vdev->req_enabled = false;
 }
 
+static void vfio_pci_machine_done_notify(Notifier *notifier, void *unused)
+{
+    Error *local_err = NULL;
+
+    vfio_aer_check_host_bus_reset(&local_err);
+    if (local_err) {
+        fprintf(stderr, "%s\n", error_get_pretty(local_err));
+        error_free(local_err);
+        exit(1);
+    }
+}
+
+static Notifier machine_notifier = {
+    .notify = vfio_pci_machine_done_notify,
+};
+
 static int vfio_initfn(PCIDevice *pdev)
 {
     VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
@@ -2934,6 +3158,15 @@ static const TypeInfo vfio_pci_dev_info = {
 static void register_vfio_pci_dev_type(void)
 {
     type_register_static(&vfio_pci_dev_info);
+
+    /*
+     * The AER configuration may depend on multiple devices, so we cannot
+     * validate consistency after each device is initialized.  We can only
+     * depend on function initialization order (function 0 last) for hotplug
+     * devices, therefore a machine-init-done notifier is used to validate
+     * the configuration after all cold-plug devices are processed.
+     */
+     qemu_add_machine_init_done_notifier(&machine_notifier);
 }
 
 type_init(register_vfio_pci_dev_type)
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 7b3924e..db7c6d5 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -15,6 +15,7 @@
 #include "qemu-common.h"
 #include "exec/memory.h"
 #include "hw/pci/pci.h"
+#include "hw/pci/pci_bus.h"
 #include "hw/pci/pci_bridge.h"
 #include "hw/vfio/vfio-common.h"
 #include "qemu/event_notifier.h"
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH 07/12] pci: add a pci_function_is_valid callback to check function if valid
  2016-05-18  3:30 [Qemu-devel] [PATCH v7 00/12] vfio-pci: pass the aer error to guest Zhou Jie
                   ` (5 preceding siblings ...)
  2016-05-18  3:31 ` [Qemu-devel] [PATCH 06/12] vfio: add check host bus reset is support or not Zhou Jie
@ 2016-05-18  3:31 ` Zhou Jie
  2016-05-18  3:31 ` [Qemu-devel] [PATCH 08/12] vfio: add check aer functionality for hotplug device Zhou Jie
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Zhou Jie @ 2016-05-18  3:31 UTC (permalink / raw)
  To: qemu-devel, alex.williamson; +Cc: izumi.taku, mst, caoj.fnst, Chen Fan

From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>

PCI hotplug requires that function 0 is added last to close the
slot.  Since vfio supporting AER, we require that the VM bus
contains the same set of devices as the host bus to support AER,
we can perform an AER validation test whenever a function 0 in
the VM is hot-added.

Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
---
 hw/pci/pci.c         | 32 ++++++++++++++++++++++++++++++++
 include/hw/pci/pci.h |  1 +
 2 files changed, 33 insertions(+)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index bb605ef..acd48eb 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -1841,6 +1841,22 @@ PCIDevice *pci_find_device(PCIBus *bus, int bus_num, uint8_t devfn)
     return bus->devices[devfn];
 }
 
+static void pci_function_is_valid(PCIBus *bus, PCIDevice *d, void *opaque)
+{
+    PCIDeviceClass *pc = PCI_DEVICE_GET_CLASS(d);
+    Error **errp = opaque;
+
+    if (*errp) {
+        return;
+    }
+
+    if (!pc->is_valid_func) {
+        return;
+    }
+
+    pc->is_valid_func(d, errp);
+}
+
 static void pci_qdev_realize(DeviceState *qdev, Error **errp)
 {
     PCIDevice *pci_dev = (PCIDevice *)qdev;
@@ -1883,6 +1899,22 @@ static void pci_qdev_realize(DeviceState *qdev, Error **errp)
         pci_qdev_unrealize(DEVICE(pci_dev), NULL);
         return;
     }
+
+    /*
+     *  If the function number is 0, indicate the closure of the slot.
+     *  then we get the chance to check all functions on same device
+     *  if valid.
+     */
+    if (DEVICE(pci_dev)->hotplugged &&
+        pci_get_function_0(pci_dev) == pci_dev) {
+        pci_for_each_device(bus, pci_bus_num(bus),
+                            pci_function_is_valid, &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            pci_qdev_unrealize(DEVICE(pci_dev), NULL);
+            return;
+        }
+    }
 }
 
 static void pci_default_realize(PCIDevice *dev, Error **errp)
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index ef6ba51..848fa7b 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -197,6 +197,7 @@ typedef struct PCIDeviceClass {
 
     void (*realize)(PCIDevice *dev, Error **errp);
     int (*init)(PCIDevice *dev);/* TODO convert to realize() and remove */
+    void (*is_valid_func)(PCIDevice *dev, Error **errp);
     PCIUnregisterFunc *exit;
     PCIConfigReadFunc *config_read;
     PCIConfigWriteFunc *config_write;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH 08/12] vfio: add check aer functionality for hotplug device
  2016-05-18  3:30 [Qemu-devel] [PATCH v7 00/12] vfio-pci: pass the aer error to guest Zhou Jie
                   ` (6 preceding siblings ...)
  2016-05-18  3:31 ` [Qemu-devel] [PATCH 07/12] pci: add a pci_function_is_valid callback to check function if valid Zhou Jie
@ 2016-05-18  3:31 ` Zhou Jie
  2016-05-18  3:31 ` [Qemu-devel] [PATCH 09/12] vfio: vote the function 0 to do host bus reset when aer occurred Zhou Jie
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Zhou Jie @ 2016-05-18  3:31 UTC (permalink / raw)
  To: qemu-devel, alex.williamson; +Cc: izumi.taku, mst, caoj.fnst, Chen Fan

From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>

when function 0 is hot-added, we can check the vfio device
whether support hot bus reset.

Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
---
 hw/vfio/pci.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 832f25e..109b11b 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3087,6 +3087,19 @@ post_reset:
     vfio_pci_post_reset(vdev);
 }
 
+static void vfio_pci_is_valid(PCIDevice *dev, Error **errp)
+{
+    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, dev);
+    Error *local_err = NULL;
+
+    if (vdev->features & VFIO_FEATURE_ENABLE_AER) {
+        vfio_check_hot_bus_reset(vdev, &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+        }
+    }
+}
+
 static void vfio_instance_init(Object *obj)
 {
     PCIDevice *pci_dev = PCI_DEVICE(obj);
@@ -3141,6 +3154,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
     set_bit(DEVICE_CATEGORY_MISC, dc->categories);
     pdc->init = vfio_initfn;
     pdc->exit = vfio_exitfn;
+    pdc->is_valid_func = vfio_pci_is_valid;
     pdc->config_read = vfio_pci_read_config;
     pdc->config_write = vfio_pci_write_config;
     pdc->is_express = 1; /* We might be */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH 09/12] vfio: vote the function 0 to do host bus reset when aer occurred
  2016-05-18  3:30 [Qemu-devel] [PATCH v7 00/12] vfio-pci: pass the aer error to guest Zhou Jie
                   ` (7 preceding siblings ...)
  2016-05-18  3:31 ` [Qemu-devel] [PATCH 08/12] vfio: add check aer functionality for hotplug device Zhou Jie
@ 2016-05-18  3:31 ` Zhou Jie
  2016-05-18  3:31 ` [Qemu-devel] [PATCH 10/12] vfio-pci: pass the aer error to guest Zhou Jie
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Zhou Jie @ 2016-05-18  3:31 UTC (permalink / raw)
  To: qemu-devel, alex.williamson; +Cc: izumi.taku, mst, caoj.fnst, Chen Fan

From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>

Due to all devices assigned to VM on the same way as host if enable
aer, so we can easily do the hot reset by selecting the function #0
to do the hot reset.

Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
---
 hw/vfio/pci.c | 14 ++++++++++++++
 hw/vfio/pci.h |  1 +
 2 files changed, 15 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 109b11b..47fd407 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1948,6 +1948,8 @@ static void vfio_check_hot_bus_reset(VFIOPCIDevice *vdev, Error **errp)
     /* List all affected devices by bus reset */
     devices = &info->devices[0];
 
+    vdev->single_depend_dev = (info->count == 1);
+
     /* Verify that we have all the groups required */
     for (i = 0; i < info->count; i++) {
         PCIHostDeviceAddress host;
@@ -3058,6 +3060,18 @@ static void vfio_pci_reset(DeviceState *dev)
 
     trace_vfio_pci_reset(vdev->vbasedev.name);
 
+    if (vdev->features & VFIO_FEATURE_ENABLE_AER) {
+        PCIDevice *br = pci_bridge_get_device(pdev->bus);
+
+        if ((pci_get_word(br->config + PCI_BRIDGE_CONTROL) &
+             PCI_BRIDGE_CTL_BUS_RESET)) {
+            if (pci_get_function_0(pdev) == pdev) {
+                vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
+            }
+            return;
+        }
+    }
+
     vfio_pci_pre_reset(vdev);
 
     if (vdev->resetfn && !vdev->resetfn(vdev)) {
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index db7c6d5..9fb0206 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -143,6 +143,7 @@ typedef struct VFIOPCIDevice {
     bool no_kvm_intx;
     bool no_kvm_msi;
     bool no_kvm_msix;
+    bool single_depend_dev;
 } VFIOPCIDevice;
 
 uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH 10/12] vfio-pci: pass the aer error to guest
  2016-05-18  3:30 [Qemu-devel] [PATCH v7 00/12] vfio-pci: pass the aer error to guest Zhou Jie
                   ` (8 preceding siblings ...)
  2016-05-18  3:31 ` [Qemu-devel] [PATCH 09/12] vfio: vote the function 0 to do host bus reset when aer occurred Zhou Jie
@ 2016-05-18  3:31 ` Zhou Jie
  2016-05-18  3:31 ` [Qemu-devel] [PATCH 11/12] vfio: register aer resume notification handler for aer resume Zhou Jie
  2016-05-18  3:31 ` [Qemu-devel] [PATCH 12/12] vfio: add 'aer' property to expose aercap Zhou Jie
  11 siblings, 0 replies; 21+ messages in thread
From: Zhou Jie @ 2016-05-18  3:31 UTC (permalink / raw)
  To: qemu-devel, alex.williamson; +Cc: izumi.taku, mst, caoj.fnst, Chen Fan

From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>

when the vfio device encounters an uncorrectable error in host,
the vfio_pci driver will signal the eventfd registered by this
vfio device, resulting in the qemu eventfd handler getting
invoked.

this patch is to pass the error to guest and let the guest driver
recover from the error.

Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
---
 hw/vfio/pci.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 54 insertions(+), 6 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 47fd407..6877a3d 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2609,18 +2609,66 @@ static void vfio_put_device(VFIOPCIDevice *vdev)
 static void vfio_err_notifier_handler(void *opaque)
 {
     VFIOPCIDevice *vdev = opaque;
+    PCIDevice *dev = &vdev->pdev;
+    Error *local_err = NULL;
+    PCIEAERMsg msg = {
+        .severity = 0,
+        .source_id = (pci_bus_num(dev->bus) << 8) | dev->devfn,
+    };
 
     if (!event_notifier_test_and_clear(&vdev->err_notifier)) {
         return;
     }
 
+
+    if (!(vdev->features & VFIO_FEATURE_ENABLE_AER)) {
+        goto stop;
+    }
+
+    /*
+     * in case the real hardware configuration has been changed,
+     * here we should recheck the bus reset capability.
+     */
+    vfio_check_hot_bus_reset(vdev, &local_err);
+    if (local_err) {
+        error_report_err(local_err);
+        goto stop;
+    }
+
+    /*
+     * we should read the error details from the real hardware
+     * configuration spaces, here we only need to do is signaling
+     * to guest an uncorrectable error has occurred.
+     */
+    if (dev->exp.aer_cap) {
+        uint8_t *aer_cap = dev->config + dev->exp.aer_cap;
+        uint32_t uncor_status;
+        bool isfatal;
+
+        uncor_status = vfio_pci_read_config(dev,
+                           dev->exp.aer_cap + PCI_ERR_UNCOR_STATUS, 4);
+
+        /*
+         * if the error is not emitted by this device, we can
+         * just ignore it.
+         */
+        if (!(uncor_status & ~0UL)) {
+            return;
+        }
+
+        isfatal = uncor_status & pci_get_long(aer_cap + PCI_ERR_UNCOR_SEVER);
+
+        msg.severity = isfatal ? PCI_ERR_ROOT_CMD_FATAL_EN :
+                                 PCI_ERR_ROOT_CMD_NONFATAL_EN;
+
+        pcie_aer_msg(dev, &msg);
+        return;
+    }
+
+stop:
     /*
-     * TBD. Retrieve the error details and decide what action
-     * needs to be taken. One of the actions could be to pass
-     * the error to the guest and have the guest driver recover
-     * from the error. This requires that PCIe capabilities be
-     * exposed to the guest. For now, we just terminate the
-     * guest to contain the error.
+     * If the aer capability is not exposed to the guest. we just
+     * terminate the guest to contain the error.
      */
 
     error_report("%s(%s) Unrecoverable error detected. Please collect any data possible and then kill the guest", __func__, vdev->vbasedev.name);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH 11/12] vfio: register aer resume notification handler for aer resume
  2016-05-18  3:30 [Qemu-devel] [PATCH v7 00/12] vfio-pci: pass the aer error to guest Zhou Jie
                   ` (9 preceding siblings ...)
  2016-05-18  3:31 ` [Qemu-devel] [PATCH 10/12] vfio-pci: pass the aer error to guest Zhou Jie
@ 2016-05-18  3:31 ` Zhou Jie
  2016-05-18 18:26   ` Alex Williamson
  2016-05-18  3:31 ` [Qemu-devel] [PATCH 12/12] vfio: add 'aer' property to expose aercap Zhou Jie
  11 siblings, 1 reply; 21+ messages in thread
From: Zhou Jie @ 2016-05-18  3:31 UTC (permalink / raw)
  To: qemu-devel, alex.williamson
  Cc: izumi.taku, mst, caoj.fnst, Chen Fan, Zhou Jie

From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>

For supporting aer recovery, host and guest would run the same aer
recovery code, that would do the secondary bus reset if the error
is fatal, the aer recovery process:
  1. error_detected
  2. reset_link (if fatal)
  3. slot_reset/mmio_enabled
  4. resume

It indicates that host will do secondary bus reset to reset
the physical devices under bus in step 2, that would cause
devices in D3 status in a short time. But in qemu, we register
an error detected handler, that would be invoked as host broadcasts
the error-detected event in step 1, in order to avoid guest do
reset_link when host do reset_link simultaneously. it may cause
fatal error. we introduce a resmue notifier to assure host reset
completely.
In qemu, the aer recovery process:
  1. Detect support for resume notification
     If host vfio driver does not support for resume notification,
     directly fail to boot up VM as with aer enabled.
  2. Immediately notify the VM on error detected.
  3. Stall any access to the device until resume is signaled.
  4. Delay the guest directed bus reset.
  5. Wait for resume notification.
     If we don't get the resume notification from the host after
     some timeout, we would abort the guest directed bus reset
     altogether and unplug of the device to prevent it from further
     interacting with the VM.
  6. After get the resume notification reset bus.

Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
Signed-off-by: Zhou Jie <zhoujie2011@cn.fujitsu.com>
---
 hw/vfio/pci.c              | 182 ++++++++++++++++++++++++++++++++++++++++++++-
 hw/vfio/pci.h              |   5 ++
 linux-headers/linux/vfio.h |   1 +
 3 files changed, 186 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 6877a3d..39a9a9f 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -35,6 +35,7 @@
 #include "trace.h"
 
 #define MSIX_CAP_LENGTH 12
+#define VFIO_RESET_TIMEOUT 1000
 
 static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
 static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
@@ -1937,6 +1938,14 @@ static void vfio_check_hot_bus_reset(VFIOPCIDevice *vdev, Error **errp)
     VFIOGroup *group;
     int ret, i, devfn, range_limit;
 
+    if (!vdev->vfio_resume_cap) {
+        error_setg(errp, "vfio: Cannot enable AER for device %s,"
+                   " host vfio driver does not support for"
+                   " resume notification",
+                   vdev->vbasedev.name);
+        return;
+    }
+
     ret = vfio_get_hot_reset_info(vdev, &info);
     if (ret) {
         error_setg(errp, "vfio: Cannot enable AER for device %s,"
@@ -2594,6 +2603,23 @@ static int vfio_populate_device(VFIOPCIDevice *vdev)
                      vbasedev->name);
     }
 
+    irq_info.index = VFIO_PCI_RESUME_IRQ_INDEX;
+
+    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info);
+    if (ret) {
+        /* This can fail for an old kernel or legacy PCI dev */
+        trace_vfio_populate_device_get_irq_info_failure();
+        ret = 0;
+    } else if (irq_info.count == 1) {
+        vdev->vfio_resume_cap = true;
+    } else {
+        error_report("vfio: %s "
+                     "Could not enable error recovery for the device,"
+                     " because host vfio driver does not support for"
+                     " resume notification",
+                     vbasedev->name);
+    }
+
 error:
     return ret;
 }
@@ -2661,6 +2687,17 @@ static void vfio_err_notifier_handler(void *opaque)
         msg.severity = isfatal ? PCI_ERR_ROOT_CMD_FATAL_EN :
                                  PCI_ERR_ROOT_CMD_NONFATAL_EN;
 
+        if (isfatal) {
+            PCIDevice *dev_0 = pci_get_function_0(dev);
+            vdev->vfio_resume_wait = true;
+            vdev->vfio_resume_arrived = false;
+            vfio_mmap_set_enabled(vdev, false);
+            if (dev_0 != dev) {
+                VFIOPCIDevice *vdev_0 = DO_UPCAST(VFIOPCIDevice, pdev, dev_0);
+                vdev_0->vfio_resume_wait = true;
+                vdev_0->vfio_resume_arrived = false;
+            }
+        }
         pcie_aer_msg(dev, &msg);
         return;
     }
@@ -2756,6 +2793,96 @@ static void vfio_unregister_err_notifier(VFIOPCIDevice *vdev)
     event_notifier_cleanup(&vdev->err_notifier);
 }
 
+static void vfio_resume_notifier_handler(void *opaque)
+{
+    VFIOPCIDevice *vdev = opaque;
+
+    if (!event_notifier_test_and_clear(&vdev->resume_notifier)) {
+        return;
+    }
+
+    vdev->vfio_resume_arrived = true;
+    if (vdev->reset_timer != NULL) {
+        timer_mod(vdev->reset_timer, qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL));
+    }
+}
+
+static void vfio_register_aer_resume_notifier(VFIOPCIDevice *vdev)
+{
+    int ret;
+    int argsz;
+    struct vfio_irq_set *irq_set;
+    int32_t *pfd;
+
+    if (!(vdev->features & VFIO_FEATURE_ENABLE_AER) ||
+        !vdev->vfio_resume_cap) {
+        return;
+    }
+
+    if (event_notifier_init(&vdev->resume_notifier, 0)) {
+        error_report("vfio: Unable to init event notifier for"
+                     " resume notification");
+        vdev->vfio_resume_cap = false;
+        return;
+    }
+
+    argsz = sizeof(*irq_set) + sizeof(*pfd);
+
+    irq_set = g_malloc0(argsz);
+    irq_set->argsz = argsz;
+    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD |
+                     VFIO_IRQ_SET_ACTION_TRIGGER;
+    irq_set->index = VFIO_PCI_RESUME_IRQ_INDEX;
+    irq_set->start = 0;
+    irq_set->count = 1;
+    pfd = (int32_t *)&irq_set->data;
+
+    *pfd = event_notifier_get_fd(&vdev->resume_notifier);
+    qemu_set_fd_handler(*pfd, vfio_resume_notifier_handler, NULL, vdev);
+
+    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
+    if (ret) {
+        error_report("vfio: Failed to set up resume notification");
+        qemu_set_fd_handler(*pfd, NULL, NULL, vdev);
+        event_notifier_cleanup(&vdev->resume_notifier);
+        vdev->vfio_resume_cap = false;
+    }
+    g_free(irq_set);
+}
+
+static void vfio_unregister_aer_resume_notifier(VFIOPCIDevice *vdev)
+{
+    int argsz;
+    struct vfio_irq_set *irq_set;
+    int32_t *pfd;
+    int ret;
+
+    if (!vdev->vfio_resume_cap) {
+        return;
+    }
+
+    argsz = sizeof(*irq_set) + sizeof(*pfd);
+
+    irq_set = g_malloc0(argsz);
+    irq_set->argsz = argsz;
+    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD |
+                     VFIO_IRQ_SET_ACTION_TRIGGER;
+    irq_set->index = VFIO_PCI_RESUME_IRQ_INDEX;
+    irq_set->start = 0;
+    irq_set->count = 1;
+    pfd = (int32_t *)&irq_set->data;
+    *pfd = -1;
+
+    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
+    if (ret) {
+        error_report("vfio: Failed to de-assign error fd: %m");
+    }
+    g_free(irq_set);
+    qemu_set_fd_handler(event_notifier_get_fd(&vdev->resume_notifier),
+                        NULL, NULL, vdev);
+    event_notifier_cleanup(&vdev->resume_notifier);
+}
+
 static void vfio_req_notifier_handler(void *opaque)
 {
     VFIOPCIDevice *vdev = opaque;
@@ -3061,6 +3188,7 @@ static int vfio_initfn(PCIDevice *pdev)
     }
 
     vfio_register_err_notifier(vdev);
+    vfio_register_aer_resume_notifier(vdev);
     vfio_register_req_notifier(vdev);
     vfio_setup_resetfn_quirk(vdev);
 
@@ -3091,6 +3219,7 @@ static void vfio_exitfn(PCIDevice *pdev)
     VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
 
     vfio_unregister_req_notifier(vdev);
+    vfio_unregister_aer_resume_notifier(vdev);
     vfio_unregister_err_notifier(vdev);
     pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
     vfio_disable_interrupts(vdev);
@@ -3101,6 +3230,34 @@ static void vfio_exitfn(PCIDevice *pdev)
     vfio_bars_exit(vdev);
 }
 
+static void vfio_pci_delayed_reset(void *opaque)
+{
+    VFIOPCIDevice *vdev = opaque;
+    PCIDevice *pdev = &vdev->pdev;
+
+    timer_free(vdev->reset_timer);
+    vdev->reset_timer = NULL;
+
+    if (!vdev->vfio_resume_wait) {
+        return;
+    }
+    vdev->vfio_resume_wait = false;
+
+    if (vdev->vfio_resume_arrived) {
+        vfio_mmap_set_enabled(vdev, true);
+        if (pci_get_function_0(pdev) == pdev) {
+            vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
+        }
+    } else {
+        uint32_t uncor_status;
+        uncor_status = vfio_pci_read_config(pdev,
+                           pdev->exp.aer_cap + PCI_ERR_UNCOR_STATUS, 4);
+        if (uncor_status & ~0UL) {
+            qdev_unplug(&vdev->pdev.qdev, NULL);
+        }
+    }
+}
+
 static void vfio_pci_reset(DeviceState *dev)
 {
     PCIDevice *pdev = DO_UPCAST(PCIDevice, qdev, dev);
@@ -3113,13 +3270,34 @@ static void vfio_pci_reset(DeviceState *dev)
 
         if ((pci_get_word(br->config + PCI_BRIDGE_CONTROL) &
              PCI_BRIDGE_CTL_BUS_RESET)) {
-            if (pci_get_function_0(pdev) == pdev) {
-                vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
+            if (!vdev->vfio_resume_wait) {
+                if (pci_get_function_0(pdev) == pdev) {
+                    vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
+                }
+            } else {
+                if (vdev->vfio_resume_arrived) {
+                    vdev->vfio_resume_wait = false;
+                    vfio_mmap_set_enabled(vdev, true);
+                    if (pci_get_function_0(pdev) == pdev) {
+                        vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
+                    }
+                } else {
+                    vdev->reset_timer = timer_new_ms(QEMU_CLOCK_VIRTUAL,
+                                            vfio_pci_delayed_reset, vdev);
+                    timer_mod(vdev->reset_timer,
+                              qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL)
+                              + VFIO_RESET_TIMEOUT);
+                }
             }
             return;
         }
     }
 
+    if (vdev->vfio_resume_wait) {
+        vdev->vfio_resume_wait = false;
+        vfio_mmap_set_enabled(vdev, true);
+    }
+
     vfio_pci_pre_reset(vdev);
 
     if (vdev->resetfn && !vdev->resetfn(vdev)) {
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 9fb0206..49e28d8 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -119,6 +119,7 @@ typedef struct VFIOPCIDevice {
     VFIOVGA *vga; /* 0xa0000, 0x3b0, 0x3c0 */
     PCIHostDeviceAddress host;
     EventNotifier err_notifier;
+    EventNotifier resume_notifier;
     EventNotifier req_notifier;
     int (*resetfn)(struct VFIOPCIDevice *);
     uint32_t vendor_id;
@@ -144,6 +145,10 @@ typedef struct VFIOPCIDevice {
     bool no_kvm_msi;
     bool no_kvm_msix;
     bool single_depend_dev;
+    bool vfio_resume_cap;
+    bool vfio_resume_wait;
+    bool vfio_resume_arrived;
+    QEMUTimer *reset_timer;
 } VFIOPCIDevice;
 
 uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index 759b850..01dfd5d 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -433,6 +433,7 @@ enum {
 	VFIO_PCI_MSIX_IRQ_INDEX,
 	VFIO_PCI_ERR_IRQ_INDEX,
 	VFIO_PCI_REQ_IRQ_INDEX,
+        VFIO_PCI_RESUME_IRQ_INDEX,
 	VFIO_PCI_NUM_IRQS
 };
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH 12/12] vfio: add 'aer' property to expose aercap
  2016-05-18  3:30 [Qemu-devel] [PATCH v7 00/12] vfio-pci: pass the aer error to guest Zhou Jie
                   ` (10 preceding siblings ...)
  2016-05-18  3:31 ` [Qemu-devel] [PATCH 11/12] vfio: register aer resume notification handler for aer resume Zhou Jie
@ 2016-05-18  3:31 ` Zhou Jie
  11 siblings, 0 replies; 21+ messages in thread
From: Zhou Jie @ 2016-05-18  3:31 UTC (permalink / raw)
  To: qemu-devel, alex.williamson; +Cc: izumi.taku, mst, caoj.fnst, Chen Fan

From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>

add 'aer' property to let user able to decide whether expose
the aer capability. by default we should disable aer feature,
because it needs configuration restrictions.

Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
---
 hw/vfio/pci.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 39a9a9f..bf8fc95 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3369,6 +3369,8 @@ static Property vfio_pci_dev_properties[] = {
                        sub_vendor_id, PCI_ANY_ID),
     DEFINE_PROP_UINT32("x-pci-sub-device-id", VFIOPCIDevice,
                        sub_device_id, PCI_ANY_ID),
+    DEFINE_PROP_BIT("aer", VFIOPCIDevice, features,
+                    VFIO_FEATURE_ENABLE_AER_BIT, false),
     /*
      * TODO - support passed fds... is this necessary?
      * DEFINE_PROP_STRING("vfiofd", VFIOPCIDevice, vfiofd_name),
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH 11/12] vfio: register aer resume notification handler for aer resume
  2016-05-18  3:31 ` [Qemu-devel] [PATCH 11/12] vfio: register aer resume notification handler for aer resume Zhou Jie
@ 2016-05-18 18:26   ` Alex Williamson
  2016-05-19  1:49     ` Zhou Jie
  0 siblings, 1 reply; 21+ messages in thread
From: Alex Williamson @ 2016-05-18 18:26 UTC (permalink / raw)
  To: Zhou Jie; +Cc: qemu-devel, Chen Fan, izumi.taku, caoj.fnst, mst

On Wed, 18 May 2016 11:31:09 +0800
Zhou Jie <zhoujie2011@cn.fujitsu.com> wrote:

> From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
> 
> For supporting aer recovery, host and guest would run the same aer
> recovery code, that would do the secondary bus reset if the error
> is fatal, the aer recovery process:
>   1. error_detected
>   2. reset_link (if fatal)
>   3. slot_reset/mmio_enabled
>   4. resume
> 
> It indicates that host will do secondary bus reset to reset
> the physical devices under bus in step 2, that would cause
> devices in D3 status in a short time. But in qemu, we register
> an error detected handler, that would be invoked as host broadcasts
> the error-detected event in step 1, in order to avoid guest do
> reset_link when host do reset_link simultaneously. it may cause
> fatal error. we introduce a resmue notifier to assure host reset
> completely.
> In qemu, the aer recovery process:
>   1. Detect support for resume notification
>      If host vfio driver does not support for resume notification,
>      directly fail to boot up VM as with aer enabled.
>   2. Immediately notify the VM on error detected.
>   3. Stall any access to the device until resume is signaled.

The code below doesn't actually do this, mmaps are disabled, but that
just means any device access will use the slow path through QEMU.  That
path is still fully enabled and so is config space access.

>   4. Delay the guest directed bus reset.

It's delayed, but not the way I expected.  The guest goes on executing
and then we do the reset at some point later?  More comments below...

>   5. Wait for resume notification.
>      If we don't get the resume notification from the host after
>      some timeout, we would abort the guest directed bus reset
>      altogether and unplug of the device to prevent it from further
>      interacting with the VM.
>   6. After get the resume notification reset bus.
> 
> Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
> Signed-off-by: Zhou Jie <zhoujie2011@cn.fujitsu.com>
> ---
>  hw/vfio/pci.c              | 182 ++++++++++++++++++++++++++++++++++++++++++++-
>  hw/vfio/pci.h              |   5 ++
>  linux-headers/linux/vfio.h |   1 +
>  3 files changed, 186 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 6877a3d..39a9a9f 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -35,6 +35,7 @@
>  #include "trace.h"
>  
>  #define MSIX_CAP_LENGTH 12
> +#define VFIO_RESET_TIMEOUT 1000

It deserves at least a comment as to why this value was chosen.

>  
>  static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
>  static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
> @@ -1937,6 +1938,14 @@ static void vfio_check_hot_bus_reset(VFIOPCIDevice *vdev, Error **errp)
>      VFIOGroup *group;
>      int ret, i, devfn, range_limit;
>  
> +    if (!vdev->vfio_resume_cap) {
> +        error_setg(errp, "vfio: Cannot enable AER for device %s,"
> +                   " host vfio driver does not support for"
> +                   " resume notification",
> +                   vdev->vbasedev.name);
> +        return;
> +    }
> +
>      ret = vfio_get_hot_reset_info(vdev, &info);
>      if (ret) {
>          error_setg(errp, "vfio: Cannot enable AER for device %s,"
> @@ -2594,6 +2603,23 @@ static int vfio_populate_device(VFIOPCIDevice *vdev)
>                       vbasedev->name);
>      }
>  
> +    irq_info.index = VFIO_PCI_RESUME_IRQ_INDEX;
> +
> +    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info);
> +    if (ret) {
> +        /* This can fail for an old kernel or legacy PCI dev */
> +        trace_vfio_populate_device_get_irq_info_failure();
> +        ret = 0;
> +    } else if (irq_info.count == 1) {
> +        vdev->vfio_resume_cap = true;

"vfio" in the name is redundant, ERR_IRQ uses pci_aer, so a logical
name might be pci_aer_has_resume.

> +    } else {
> +        error_report("vfio: %s "
> +                     "Could not enable error recovery for the device,"
> +                     " because host vfio driver does not support for"
> +                     " resume notification",
> +                     vbasedev->name);
> +    }

This error_report makes sense for ERR_IRQ because halt-on-AER is setup
transparently, but I don't think it makes sense here.  If the user has
specified to enable AER then it should either work or they should get
an error message.  If they have not specified to enable AER, why does
the user care if there's an inconsistency here?

> +
>  error:
>      return ret;
>  }
> @@ -2661,6 +2687,17 @@ static void vfio_err_notifier_handler(void *opaque)
>          msg.severity = isfatal ? PCI_ERR_ROOT_CMD_FATAL_EN :
>                                   PCI_ERR_ROOT_CMD_NONFATAL_EN;
>  
> +        if (isfatal) {
> +            PCIDevice *dev_0 = pci_get_function_0(dev);
> +            vdev->vfio_resume_wait = true;
> +            vdev->vfio_resume_arrived = false;

Possible names:

pci_aer_error_signaled
pci_aer_resume_signaled

> +            vfio_mmap_set_enabled(vdev, false);
> +            if (dev_0 != dev) {
> +                VFIOPCIDevice *vdev_0 = DO_UPCAST(VFIOPCIDevice, pdev, dev_0);
> +                vdev_0->vfio_resume_wait = true;
> +                vdev_0->vfio_resume_arrived = false;
> +            }

Why is function 0 special here?  Don't we expect that it will also get
an ERR_IRQ?

> +        }
>          pcie_aer_msg(dev, &msg);
>          return;
>      }
> @@ -2756,6 +2793,96 @@ static void vfio_unregister_err_notifier(VFIOPCIDevice *vdev)
>      event_notifier_cleanup(&vdev->err_notifier);
>  }
>  
> +static void vfio_resume_notifier_handler(void *opaque)

Please use "aer" in the name, otherwise resume might refer to
suspend-resume.

> +{
> +    VFIOPCIDevice *vdev = opaque;
> +
> +    if (!event_notifier_test_and_clear(&vdev->resume_notifier)) {
> +        return;
> +    }
> +
> +    vdev->vfio_resume_arrived = true;
> +    if (vdev->reset_timer != NULL) {

pci_aer_reset_blocked_timer

> +        timer_mod(vdev->reset_timer, qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL));
> +    }
> +}
> +
> +static void vfio_register_aer_resume_notifier(VFIOPCIDevice *vdev)
> +{
> +    int ret;
> +    int argsz;
> +    struct vfio_irq_set *irq_set;
> +    int32_t *pfd;
> +
> +    if (!(vdev->features & VFIO_FEATURE_ENABLE_AER) ||
> +        !vdev->vfio_resume_cap) {
> +        return;
> +    }
> +
> +    if (event_notifier_init(&vdev->resume_notifier, 0)) {
> +        error_report("vfio: Unable to init event notifier for"
> +                     " resume notification");
> +        vdev->vfio_resume_cap = false;
> +        return;
> +    }
> +
> +    argsz = sizeof(*irq_set) + sizeof(*pfd);
> +
> +    irq_set = g_malloc0(argsz);
> +    irq_set->argsz = argsz;
> +    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD |
> +                     VFIO_IRQ_SET_ACTION_TRIGGER;
> +    irq_set->index = VFIO_PCI_RESUME_IRQ_INDEX;
> +    irq_set->start = 0;
> +    irq_set->count = 1;
> +    pfd = (int32_t *)&irq_set->data;
> +
> +    *pfd = event_notifier_get_fd(&vdev->resume_notifier);
> +    qemu_set_fd_handler(*pfd, vfio_resume_notifier_handler, NULL, vdev);
> +
> +    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
> +    if (ret) {
> +        error_report("vfio: Failed to set up resume notification");
> +        qemu_set_fd_handler(*pfd, NULL, NULL, vdev);
> +        event_notifier_cleanup(&vdev->resume_notifier);
> +        vdev->vfio_resume_cap = false;
> +    }
> +    g_free(irq_set);
> +}
> +
> +static void vfio_unregister_aer_resume_notifier(VFIOPCIDevice *vdev)
> +{
> +    int argsz;
> +    struct vfio_irq_set *irq_set;
> +    int32_t *pfd;
> +    int ret;
> +
> +    if (!vdev->vfio_resume_cap) {
> +        return;
> +    }
> +
> +    argsz = sizeof(*irq_set) + sizeof(*pfd);
> +
> +    irq_set = g_malloc0(argsz);
> +    irq_set->argsz = argsz;
> +    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD |
> +                     VFIO_IRQ_SET_ACTION_TRIGGER;
> +    irq_set->index = VFIO_PCI_RESUME_IRQ_INDEX;
> +    irq_set->start = 0;
> +    irq_set->count = 1;
> +    pfd = (int32_t *)&irq_set->data;
> +    *pfd = -1;
> +
> +    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
> +    if (ret) {
> +        error_report("vfio: Failed to de-assign error fd: %m");
> +    }
> +    g_free(irq_set);
> +    qemu_set_fd_handler(event_notifier_get_fd(&vdev->resume_notifier),
> +                        NULL, NULL, vdev);
> +    event_notifier_cleanup(&vdev->resume_notifier);
> +}
> +
>  static void vfio_req_notifier_handler(void *opaque)
>  {
>      VFIOPCIDevice *vdev = opaque;
> @@ -3061,6 +3188,7 @@ static int vfio_initfn(PCIDevice *pdev)
>      }
>  
>      vfio_register_err_notifier(vdev);
> +    vfio_register_aer_resume_notifier(vdev);
>      vfio_register_req_notifier(vdev);
>      vfio_setup_resetfn_quirk(vdev);
>  
> @@ -3091,6 +3219,7 @@ static void vfio_exitfn(PCIDevice *pdev)
>      VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
>  
>      vfio_unregister_req_notifier(vdev);
> +    vfio_unregister_aer_resume_notifier(vdev);
>      vfio_unregister_err_notifier(vdev);
>      pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
>      vfio_disable_interrupts(vdev);
> @@ -3101,6 +3230,34 @@ static void vfio_exitfn(PCIDevice *pdev)
>      vfio_bars_exit(vdev);
>  }
>  
> +static void vfio_pci_delayed_reset(void *opaque)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +    PCIDevice *pdev = &vdev->pdev;
> +
> +    timer_free(vdev->reset_timer);
> +    vdev->reset_timer = NULL;

Racy, the timer gets freed before set to NULL so the test above could
see it non-NULL as it's being freed, assuming QEMU supports that sort
of concurrency.

> +
> +    if (!vdev->vfio_resume_wait) {
> +        return;
> +    }
> +    vdev->vfio_resume_wait = false;
> +
> +    if (vdev->vfio_resume_arrived) {
> +        vfio_mmap_set_enabled(vdev, true);

How do you know if mmap was enabled when you started?  This could
interfere with other cases where mmaps get disabled.

> +        if (pci_get_function_0(pdev) == pdev) {
> +            vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
> +        }
> +    } else {
> +        uint32_t uncor_status;
> +        uncor_status = vfio_pci_read_config(pdev,
> +                           pdev->exp.aer_cap + PCI_ERR_UNCOR_STATUS, 4);
> +        if (uncor_status & ~0UL) {
> +            qdev_unplug(&vdev->pdev.qdev, NULL);
> +        }
> +    }
> +}
> +
>  static void vfio_pci_reset(DeviceState *dev)
>  {
>      PCIDevice *pdev = DO_UPCAST(PCIDevice, qdev, dev);
> @@ -3113,13 +3270,34 @@ static void vfio_pci_reset(DeviceState *dev)
>  
>          if ((pci_get_word(br->config + PCI_BRIDGE_CONTROL) &
>               PCI_BRIDGE_CTL_BUS_RESET)) {
> -            if (pci_get_function_0(pdev) == pdev) {
> -                vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
> +            if (!vdev->vfio_resume_wait) {
> +                if (pci_get_function_0(pdev) == pdev) {
> +                    vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
> +                }
> +            } else {
> +                if (vdev->vfio_resume_arrived) {
> +                    vdev->vfio_resume_wait = false;
> +                    vfio_mmap_set_enabled(vdev, true);

mmap is getting restored in too many places, it should be disabled on
ERR_IRQ and re-enabled on ERR_RESUME, no more.

> +                    if (pci_get_function_0(pdev) == pdev) {
> +                        vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
> +                    }
> +                } else {
> +                    vdev->reset_timer = timer_new_ms(QEMU_CLOCK_VIRTUAL,
> +                                            vfio_pci_delayed_reset, vdev);
> +                    timer_mod(vdev->reset_timer,
> +                              qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL)
> +                              + VFIO_RESET_TIMEOUT);
> +                }

Wait, so we just set a timer and return pretending the reset occurred
when actually it might occur at some point in the future?  How is that
supposed to work?  I thought the plan was to block here.

>              }
>              return;
>          }
>      }
>  
> +    if (vdev->vfio_resume_wait) {
> +        vdev->vfio_resume_wait = false;
> +        vfio_mmap_set_enabled(vdev, true);
> +    }

Again, these are getting changed in too many places, the state machine
is too complicated.  Thanks,

Alex

> +
>      vfio_pci_pre_reset(vdev);
>  
>      if (vdev->resetfn && !vdev->resetfn(vdev)) {
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index 9fb0206..49e28d8 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -119,6 +119,7 @@ typedef struct VFIOPCIDevice {
>      VFIOVGA *vga; /* 0xa0000, 0x3b0, 0x3c0 */
>      PCIHostDeviceAddress host;
>      EventNotifier err_notifier;
> +    EventNotifier resume_notifier;
>      EventNotifier req_notifier;
>      int (*resetfn)(struct VFIOPCIDevice *);
>      uint32_t vendor_id;
> @@ -144,6 +145,10 @@ typedef struct VFIOPCIDevice {
>      bool no_kvm_msi;
>      bool no_kvm_msix;
>      bool single_depend_dev;
> +    bool vfio_resume_cap;
> +    bool vfio_resume_wait;
> +    bool vfio_resume_arrived;
> +    QEMUTimer *reset_timer;
>  } VFIOPCIDevice;
>  
>  uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> index 759b850..01dfd5d 100644
> --- a/linux-headers/linux/vfio.h
> +++ b/linux-headers/linux/vfio.h
> @@ -433,6 +433,7 @@ enum {
>  	VFIO_PCI_MSIX_IRQ_INDEX,
>  	VFIO_PCI_ERR_IRQ_INDEX,
>  	VFIO_PCI_REQ_IRQ_INDEX,
> +        VFIO_PCI_RESUME_IRQ_INDEX,
>  	VFIO_PCI_NUM_IRQS
>  };
>  

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH 11/12] vfio: register aer resume notification handler for aer resume
  2016-05-18 18:26   ` Alex Williamson
@ 2016-05-19  1:49     ` Zhou Jie
  2016-05-19  2:18       ` Alex Williamson
  2016-05-25  6:23       ` Zhou Jie
  0 siblings, 2 replies; 21+ messages in thread
From: Zhou Jie @ 2016-05-19  1:49 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, Chen Fan, izumi.taku, caoj.fnst, mst



On 2016/5/19 2:26, Alex Williamson wrote:
> On Wed, 18 May 2016 11:31:09 +0800
> Zhou Jie <zhoujie2011@cn.fujitsu.com> wrote:
>
>> From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
>>
>> For supporting aer recovery, host and guest would run the same aer
>> recovery code, that would do the secondary bus reset if the error
>> is fatal, the aer recovery process:
>>   1. error_detected
>>   2. reset_link (if fatal)
>>   3. slot_reset/mmio_enabled
>>   4. resume
>>
>> It indicates that host will do secondary bus reset to reset
>> the physical devices under bus in step 2, that would cause
>> devices in D3 status in a short time. But in qemu, we register
>> an error detected handler, that would be invoked as host broadcasts
>> the error-detected event in step 1, in order to avoid guest do
>> reset_link when host do reset_link simultaneously. it may cause
>> fatal error. we introduce a resmue notifier to assure host reset
>> completely.
>> In qemu, the aer recovery process:
>>   1. Detect support for resume notification
>>      If host vfio driver does not support for resume notification,
>>      directly fail to boot up VM as with aer enabled.
>>   2. Immediately notify the VM on error detected.
>>   3. Stall any access to the device until resume is signaled.
>
> The code below doesn't actually do this, mmaps are disabled, but that
> just means any device access will use the slow path through QEMU.  That
> path is still fully enabled and so is config space access.
I will modify the code and find other way to
stall any access to the device.

>>   4. Delay the guest directed bus reset.
>
> It's delayed, but not the way I expected.  The guest goes on executing
> and then we do the reset at some point later?  More comments below...
>
>>   5. Wait for resume notification.
>>      If we don't get the resume notification from the host after
>>      some timeout, we would abort the guest directed bus reset
>>      altogether and unplug of the device to prevent it from further
>>      interacting with the VM.
>>   6. After get the resume notification reset bus.
>>
>> Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
>> Signed-off-by: Zhou Jie <zhoujie2011@cn.fujitsu.com>
>> ---
>>  hw/vfio/pci.c              | 182 ++++++++++++++++++++++++++++++++++++++++++++-
>>  hw/vfio/pci.h              |   5 ++
>>  linux-headers/linux/vfio.h |   1 +
>>  3 files changed, 186 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index 6877a3d..39a9a9f 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -35,6 +35,7 @@
>>  #include "trace.h"
>>
>>  #define MSIX_CAP_LENGTH 12
>> +#define VFIO_RESET_TIMEOUT 1000
>
> It deserves at least a comment as to why this value was chosen.
OK. I will add a comment here.

>>  static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
>>  static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
>> @@ -1937,6 +1938,14 @@ static void vfio_check_hot_bus_reset(VFIOPCIDevice *vdev, Error **errp)
>>      VFIOGroup *group;
>>      int ret, i, devfn, range_limit;
>>
>> +    if (!vdev->vfio_resume_cap) {
>> +        error_setg(errp, "vfio: Cannot enable AER for device %s,"
>> +                   " host vfio driver does not support for"
>> +                   " resume notification",
>> +                   vdev->vbasedev.name);
>> +        return;
>> +    }
>> +
>>      ret = vfio_get_hot_reset_info(vdev, &info);
>>      if (ret) {
>>          error_setg(errp, "vfio: Cannot enable AER for device %s,"
>> @@ -2594,6 +2603,23 @@ static int vfio_populate_device(VFIOPCIDevice *vdev)
>>                       vbasedev->name);
>>      }
>>
>> +    irq_info.index = VFIO_PCI_RESUME_IRQ_INDEX;
>> +
>> +    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info);
>> +    if (ret) {
>> +        /* This can fail for an old kernel or legacy PCI dev */
>> +        trace_vfio_populate_device_get_irq_info_failure();
>> +        ret = 0;
>> +    } else if (irq_info.count == 1) {
>> +        vdev->vfio_resume_cap = true;
>
> "vfio" in the name is redundant, ERR_IRQ uses pci_aer, so a logical
> name might be pci_aer_has_resume.
OK.

>> +    } else {
>> +        error_report("vfio: %s "
>> +                     "Could not enable error recovery for the device,"
>> +                     " because host vfio driver does not support for"
>> +                     " resume notification",
>> +                     vbasedev->name);
>> +    }
>
> This error_report makes sense for ERR_IRQ because halt-on-AER is setup
> transparently, but I don't think it makes sense here.  If the user has
> specified to enable AER then it should either work or they should get
> an error message.  If they have not specified to enable AER, why does
> the user care if there's an inconsistency here?
OK. I will delete the error report at here.

>> +
>>  error:
>>      return ret;
>>  }
>> @@ -2661,6 +2687,17 @@ static void vfio_err_notifier_handler(void *opaque)
>>          msg.severity = isfatal ? PCI_ERR_ROOT_CMD_FATAL_EN :
>>                                   PCI_ERR_ROOT_CMD_NONFATAL_EN;
>>
>> +        if (isfatal) {
>> +            PCIDevice *dev_0 = pci_get_function_0(dev);
>> +            vdev->vfio_resume_wait = true;
>> +            vdev->vfio_resume_arrived = false;
>
> Possible names:
>
> pci_aer_error_signaled
> pci_aer_resume_signaled
OK.

>> +            vfio_mmap_set_enabled(vdev, false);
>> +            if (dev_0 != dev) {
>> +                VFIOPCIDevice *vdev_0 = DO_UPCAST(VFIOPCIDevice, pdev, dev_0);
>> +                vdev_0->vfio_resume_wait = true;
>> +                vdev_0->vfio_resume_arrived = false;
>> +            }
>
> Why is function 0 special here?  Don't we expect that it will also get
> an ERR_IRQ?
I tested in this condition.
The device is multifunction.
And function 0 is OK, function 1 occured an error.
function 0 got an ERR_IRQ, but returned at following code.
if (!(uncor_status & ~0UL)) {
     return;
}
If don't set the function 0 at here, it will not wait in vfio_pci_reset.

And I also tested the nofatal error.
The aer will send the nofatal error notification to qemu,
but the guest will not invoke vfio_pci_reset.
So, I need know whether the error is fatal,
and then set the pci_aer_error_signaled.

>> +        }
>>          pcie_aer_msg(dev, &msg);
>>          return;
>>      }
>> @@ -2756,6 +2793,96 @@ static void vfio_unregister_err_notifier(VFIOPCIDevice *vdev)
>>      event_notifier_cleanup(&vdev->err_notifier);
>>  }
>>
>> +static void vfio_resume_notifier_handler(void *opaque)
>
> Please use "aer" in the name, otherwise resume might refer to
> suspend-resume.
OK.

>> +{
>> +    VFIOPCIDevice *vdev = opaque;
>> +
>> +    if (!event_notifier_test_and_clear(&vdev->resume_notifier)) {
>> +        return;
>> +    }
>> +
>> +    vdev->vfio_resume_arrived = true;
>> +    if (vdev->reset_timer != NULL) {
>
> pci_aer_reset_blocked_timer
OK

>> +        timer_mod(vdev->reset_timer, qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL));
>> +    }
>> +}
>> +
>> +static void vfio_register_aer_resume_notifier(VFIOPCIDevice *vdev)
>> +{
>> +    int ret;
>> +    int argsz;
>> +    struct vfio_irq_set *irq_set;
>> +    int32_t *pfd;
>> +
>> +    if (!(vdev->features & VFIO_FEATURE_ENABLE_AER) ||
>> +        !vdev->vfio_resume_cap) {
>> +        return;
>> +    }
>> +
>> +    if (event_notifier_init(&vdev->resume_notifier, 0)) {
>> +        error_report("vfio: Unable to init event notifier for"
>> +                     " resume notification");
>> +        vdev->vfio_resume_cap = false;
>> +        return;
>> +    }
>> +
>> +    argsz = sizeof(*irq_set) + sizeof(*pfd);
>> +
>> +    irq_set = g_malloc0(argsz);
>> +    irq_set->argsz = argsz;
>> +    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD |
>> +                     VFIO_IRQ_SET_ACTION_TRIGGER;
>> +    irq_set->index = VFIO_PCI_RESUME_IRQ_INDEX;
>> +    irq_set->start = 0;
>> +    irq_set->count = 1;
>> +    pfd = (int32_t *)&irq_set->data;
>> +
>> +    *pfd = event_notifier_get_fd(&vdev->resume_notifier);
>> +    qemu_set_fd_handler(*pfd, vfio_resume_notifier_handler, NULL, vdev);
>> +
>> +    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
>> +    if (ret) {
>> +        error_report("vfio: Failed to set up resume notification");
>> +        qemu_set_fd_handler(*pfd, NULL, NULL, vdev);
>> +        event_notifier_cleanup(&vdev->resume_notifier);
>> +        vdev->vfio_resume_cap = false;
>> +    }
>> +    g_free(irq_set);
>> +}
>> +
>> +static void vfio_unregister_aer_resume_notifier(VFIOPCIDevice *vdev)
>> +{
>> +    int argsz;
>> +    struct vfio_irq_set *irq_set;
>> +    int32_t *pfd;
>> +    int ret;
>> +
>> +    if (!vdev->vfio_resume_cap) {
>> +        return;
>> +    }
>> +
>> +    argsz = sizeof(*irq_set) + sizeof(*pfd);
>> +
>> +    irq_set = g_malloc0(argsz);
>> +    irq_set->argsz = argsz;
>> +    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD |
>> +                     VFIO_IRQ_SET_ACTION_TRIGGER;
>> +    irq_set->index = VFIO_PCI_RESUME_IRQ_INDEX;
>> +    irq_set->start = 0;
>> +    irq_set->count = 1;
>> +    pfd = (int32_t *)&irq_set->data;
>> +    *pfd = -1;
>> +
>> +    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
>> +    if (ret) {
>> +        error_report("vfio: Failed to de-assign error fd: %m");
>> +    }
>> +    g_free(irq_set);
>> +    qemu_set_fd_handler(event_notifier_get_fd(&vdev->resume_notifier),
>> +                        NULL, NULL, vdev);
>> +    event_notifier_cleanup(&vdev->resume_notifier);
>> +}
>> +
>>  static void vfio_req_notifier_handler(void *opaque)
>>  {
>>      VFIOPCIDevice *vdev = opaque;
>> @@ -3061,6 +3188,7 @@ static int vfio_initfn(PCIDevice *pdev)
>>      }
>>
>>      vfio_register_err_notifier(vdev);
>> +    vfio_register_aer_resume_notifier(vdev);
>>      vfio_register_req_notifier(vdev);
>>      vfio_setup_resetfn_quirk(vdev);
>>
>> @@ -3091,6 +3219,7 @@ static void vfio_exitfn(PCIDevice *pdev)
>>      VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
>>
>>      vfio_unregister_req_notifier(vdev);
>> +    vfio_unregister_aer_resume_notifier(vdev);
>>      vfio_unregister_err_notifier(vdev);
>>      pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
>>      vfio_disable_interrupts(vdev);
>> @@ -3101,6 +3230,34 @@ static void vfio_exitfn(PCIDevice *pdev)
>>      vfio_bars_exit(vdev);
>>  }
>>
>> +static void vfio_pci_delayed_reset(void *opaque)
>> +{
>> +    VFIOPCIDevice *vdev = opaque;
>> +    PCIDevice *pdev = &vdev->pdev;
>> +
>> +    timer_free(vdev->reset_timer);
>> +    vdev->reset_timer = NULL;
>
> Racy, the timer gets freed before set to NULL so the test above could
> see it non-NULL as it's being freed, assuming QEMU supports that sort
> of concurrency.
I will use sem to avoid race condition.
More comments below...

>> +
>> +    if (!vdev->vfio_resume_wait) {
>> +        return;
>> +    }
>> +    vdev->vfio_resume_wait = false;
>> +
>> +    if (vdev->vfio_resume_arrived) {
>> +        vfio_mmap_set_enabled(vdev, true);
>
> How do you know if mmap was enabled when you started?  This could
> interfere with other cases where mmaps get disabled.
Yes, I will modify the code.

>> +        if (pci_get_function_0(pdev) == pdev) {
>> +            vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
>> +        }
>> +    } else {
>> +        uint32_t uncor_status;
>> +        uncor_status = vfio_pci_read_config(pdev,
>> +                           pdev->exp.aer_cap + PCI_ERR_UNCOR_STATUS, 4);
>> +        if (uncor_status & ~0UL) {
>> +            qdev_unplug(&vdev->pdev.qdev, NULL);
>> +        }
>> +    }
>> +}
>> +
>>  static void vfio_pci_reset(DeviceState *dev)
>>  {
>>      PCIDevice *pdev = DO_UPCAST(PCIDevice, qdev, dev);
>> @@ -3113,13 +3270,34 @@ static void vfio_pci_reset(DeviceState *dev)
>>
>>          if ((pci_get_word(br->config + PCI_BRIDGE_CONTROL) &
>>               PCI_BRIDGE_CTL_BUS_RESET)) {
>> -            if (pci_get_function_0(pdev) == pdev) {
>> -                vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
>> +            if (!vdev->vfio_resume_wait) {
>> +                if (pci_get_function_0(pdev) == pdev) {
>> +                    vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
>> +                }
>> +            } else {
>> +                if (vdev->vfio_resume_arrived) {
>> +                    vdev->vfio_resume_wait = false;
>> +                    vfio_mmap_set_enabled(vdev, true);
>
> mmap is getting restored in too many places, it should be disabled on
> ERR_IRQ and re-enabled on ERR_RESUME, no more.
OK.

>> +                    if (pci_get_function_0(pdev) == pdev) {
>> +                        vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
>> +                    }
>> +                } else {
>> +                    vdev->reset_timer = timer_new_ms(QEMU_CLOCK_VIRTUAL,
>> +                                            vfio_pci_delayed_reset, vdev);
>> +                    timer_mod(vdev->reset_timer,
>> +                              qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL)
>> +                              + VFIO_RESET_TIMEOUT);
>> +                }
>
> Wait, so we just set a timer and return pretending the reset occurred
> when actually it might occur at some point in the future?  How is that
> supposed to work?  I thought the plan was to block here.
How about invoke sem_post at vfio_resume_notifier_handler,
and wait here use sem_timedwait?

>>              }
>>              return;
>>          }
>>      }
>>
>> +    if (vdev->vfio_resume_wait) {
>> +        vdev->vfio_resume_wait = false;
>> +        vfio_mmap_set_enabled(vdev, true);
>> +    }
>
> Again, these are getting changed in too many places, the state machine
> is too complicated.  Thanks,
In my test this code will never be invoked.
But I add this code to clear vfio_resume_wait if the guest don't
  reset the bus.

Sincerely,
Zhou Jie


>
> Alex
>
>> +
>>      vfio_pci_pre_reset(vdev);
>>
>>      if (vdev->resetfn && !vdev->resetfn(vdev)) {
>> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
>> index 9fb0206..49e28d8 100644
>> --- a/hw/vfio/pci.h
>> +++ b/hw/vfio/pci.h
>> @@ -119,6 +119,7 @@ typedef struct VFIOPCIDevice {
>>      VFIOVGA *vga; /* 0xa0000, 0x3b0, 0x3c0 */
>>      PCIHostDeviceAddress host;
>>      EventNotifier err_notifier;
>> +    EventNotifier resume_notifier;
>>      EventNotifier req_notifier;
>>      int (*resetfn)(struct VFIOPCIDevice *);
>>      uint32_t vendor_id;
>> @@ -144,6 +145,10 @@ typedef struct VFIOPCIDevice {
>>      bool no_kvm_msi;
>>      bool no_kvm_msix;
>>      bool single_depend_dev;
>> +    bool vfio_resume_cap;
>> +    bool vfio_resume_wait;
>> +    bool vfio_resume_arrived;
>> +    QEMUTimer *reset_timer;
>>  } VFIOPCIDevice;
>>
>>  uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
>> index 759b850..01dfd5d 100644
>> --- a/linux-headers/linux/vfio.h
>> +++ b/linux-headers/linux/vfio.h
>> @@ -433,6 +433,7 @@ enum {
>>  	VFIO_PCI_MSIX_IRQ_INDEX,
>>  	VFIO_PCI_ERR_IRQ_INDEX,
>>  	VFIO_PCI_REQ_IRQ_INDEX,
>> +        VFIO_PCI_RESUME_IRQ_INDEX,
>>  	VFIO_PCI_NUM_IRQS
>>  };
>>
>
>
>
> .
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH 11/12] vfio: register aer resume notification handler for aer resume
  2016-05-19  1:49     ` Zhou Jie
@ 2016-05-19  2:18       ` Alex Williamson
  2016-05-19  2:41         ` Zhou Jie
  2016-05-19  2:41         ` Zhou Jie
  2016-05-25  6:23       ` Zhou Jie
  1 sibling, 2 replies; 21+ messages in thread
From: Alex Williamson @ 2016-05-19  2:18 UTC (permalink / raw)
  To: Zhou Jie; +Cc: qemu-devel, Chen Fan, izumi.taku, caoj.fnst, mst

On Thu, 19 May 2016 09:49:00 +0800
Zhou Jie <zhoujie2011@cn.fujitsu.com> wrote:

> On 2016/5/19 2:26, Alex Williamson wrote:
> > On Wed, 18 May 2016 11:31:09 +0800
> > Zhou Jie <zhoujie2011@cn.fujitsu.com> wrote:
> >  
> >> From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
> >>
> >> For supporting aer recovery, host and guest would run the same aer
> >> recovery code, that would do the secondary bus reset if the error
> >> is fatal, the aer recovery process:
> >>   1. error_detected
> >>   2. reset_link (if fatal)
> >>   3. slot_reset/mmio_enabled
> >>   4. resume
> >>
> >> It indicates that host will do secondary bus reset to reset
> >> the physical devices under bus in step 2, that would cause
> >> devices in D3 status in a short time. But in qemu, we register
> >> an error detected handler, that would be invoked as host broadcasts
> >> the error-detected event in step 1, in order to avoid guest do
> >> reset_link when host do reset_link simultaneously. it may cause
> >> fatal error. we introduce a resmue notifier to assure host reset
> >> completely.
> >> In qemu, the aer recovery process:
> >>   1. Detect support for resume notification
> >>      If host vfio driver does not support for resume notification,
> >>      directly fail to boot up VM as with aer enabled.
> >>   2. Immediately notify the VM on error detected.
> >>   3. Stall any access to the device until resume is signaled.  
> >
> > The code below doesn't actually do this, mmaps are disabled, but that
> > just means any device access will use the slow path through QEMU.  That
> > path is still fully enabled and so is config space access.  
> I will modify the code and find other way to
> stall any access to the device.
> 
> >>   4. Delay the guest directed bus reset.  
> >
> > It's delayed, but not the way I expected.  The guest goes on executing
> > and then we do the reset at some point later?  More comments below...
> >  
> >>   5. Wait for resume notification.
> >>      If we don't get the resume notification from the host after
> >>      some timeout, we would abort the guest directed bus reset
> >>      altogether and unplug of the device to prevent it from further
> >>      interacting with the VM.
> >>   6. After get the resume notification reset bus.
> >>
> >> Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
> >> Signed-off-by: Zhou Jie <zhoujie2011@cn.fujitsu.com>
> >> ---
> >>  hw/vfio/pci.c              | 182 ++++++++++++++++++++++++++++++++++++++++++++-
> >>  hw/vfio/pci.h              |   5 ++
> >>  linux-headers/linux/vfio.h |   1 +
> >>  3 files changed, 186 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> >> index 6877a3d..39a9a9f 100644
> >> --- a/hw/vfio/pci.c
> >> +++ b/hw/vfio/pci.c
> >> @@ -35,6 +35,7 @@
> >>  #include "trace.h"
> >>
> >>  #define MSIX_CAP_LENGTH 12
> >> +#define VFIO_RESET_TIMEOUT 1000  
> >
> > It deserves at least a comment as to why this value was chosen.  
> OK. I will add a comment here.
> 
> >>  static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
> >>  static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
> >> @@ -1937,6 +1938,14 @@ static void vfio_check_hot_bus_reset(VFIOPCIDevice *vdev, Error **errp)
> >>      VFIOGroup *group;
> >>      int ret, i, devfn, range_limit;
> >>
> >> +    if (!vdev->vfio_resume_cap) {
> >> +        error_setg(errp, "vfio: Cannot enable AER for device %s,"
> >> +                   " host vfio driver does not support for"
> >> +                   " resume notification",
> >> +                   vdev->vbasedev.name);
> >> +        return;
> >> +    }
> >> +
> >>      ret = vfio_get_hot_reset_info(vdev, &info);
> >>      if (ret) {
> >>          error_setg(errp, "vfio: Cannot enable AER for device %s,"
> >> @@ -2594,6 +2603,23 @@ static int vfio_populate_device(VFIOPCIDevice *vdev)
> >>                       vbasedev->name);
> >>      }
> >>
> >> +    irq_info.index = VFIO_PCI_RESUME_IRQ_INDEX;
> >> +
> >> +    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info);
> >> +    if (ret) {
> >> +        /* This can fail for an old kernel or legacy PCI dev */
> >> +        trace_vfio_populate_device_get_irq_info_failure();
> >> +        ret = 0;
> >> +    } else if (irq_info.count == 1) {
> >> +        vdev->vfio_resume_cap = true;  
> >
> > "vfio" in the name is redundant, ERR_IRQ uses pci_aer, so a logical
> > name might be pci_aer_has_resume.  
> OK.
> 
> >> +    } else {
> >> +        error_report("vfio: %s "
> >> +                     "Could not enable error recovery for the device,"
> >> +                     " because host vfio driver does not support for"
> >> +                     " resume notification",
> >> +                     vbasedev->name);
> >> +    }  
> >
> > This error_report makes sense for ERR_IRQ because halt-on-AER is setup
> > transparently, but I don't think it makes sense here.  If the user has
> > specified to enable AER then it should either work or they should get
> > an error message.  If they have not specified to enable AER, why does
> > the user care if there's an inconsistency here?  
> OK. I will delete the error report at here.
> 
> >> +
> >>  error:
> >>      return ret;
> >>  }
> >> @@ -2661,6 +2687,17 @@ static void vfio_err_notifier_handler(void *opaque)
> >>          msg.severity = isfatal ? PCI_ERR_ROOT_CMD_FATAL_EN :
> >>                                   PCI_ERR_ROOT_CMD_NONFATAL_EN;
> >>
> >> +        if (isfatal) {
> >> +            PCIDevice *dev_0 = pci_get_function_0(dev);
> >> +            vdev->vfio_resume_wait = true;
> >> +            vdev->vfio_resume_arrived = false;  
> >
> > Possible names:
> >
> > pci_aer_error_signaled
> > pci_aer_resume_signaled  
> OK.
> 
> >> +            vfio_mmap_set_enabled(vdev, false);
> >> +            if (dev_0 != dev) {
> >> +                VFIOPCIDevice *vdev_0 = DO_UPCAST(VFIOPCIDevice, pdev, dev_0);
> >> +                vdev_0->vfio_resume_wait = true;
> >> +                vdev_0->vfio_resume_arrived = false;
> >> +            }  
> >
> > Why is function 0 special here?  Don't we expect that it will also get
> > an ERR_IRQ?  
> I tested in this condition.
> The device is multifunction.
> And function 0 is OK, function 1 occured an error.
> function 0 got an ERR_IRQ, but returned at following code.
> if (!(uncor_status & ~0UL)) {
>      return;
> }
> If don't set the function 0 at here, it will not wait in vfio_pci_reset.
> 
> And I also tested the nofatal error.
> The aer will send the nofatal error notification to qemu,
> but the guest will not invoke vfio_pci_reset.
> So, I need know whether the error is fatal,
> and then set the pci_aer_error_signaled.

Do we need to worry about some functions getting a resume while others
don't?  Should the reset stall until all functions that received a
fatal error receive a resume?  Maybe all tracking should be done with
an 8-byte (256-bit to support ARI) atomic bitmap on function 0 and
reset is blocked until the mask of all functions receiving a fatal
error is equal to the mask of all functions receiving a resume.
 
> >> +        }
> >>          pcie_aer_msg(dev, &msg);
> >>          return;
> >>      }
> >> @@ -2756,6 +2793,96 @@ static void vfio_unregister_err_notifier(VFIOPCIDevice *vdev)
> >>      event_notifier_cleanup(&vdev->err_notifier);
> >>  }
> >>
> >> +static void vfio_resume_notifier_handler(void *opaque)  
> >
> > Please use "aer" in the name, otherwise resume might refer to
> > suspend-resume.  
> OK.
> 
> >> +{
> >> +    VFIOPCIDevice *vdev = opaque;
> >> +
> >> +    if (!event_notifier_test_and_clear(&vdev->resume_notifier)) {
> >> +        return;
> >> +    }
> >> +
> >> +    vdev->vfio_resume_arrived = true;
> >> +    if (vdev->reset_timer != NULL) {  
> >
> > pci_aer_reset_blocked_timer  
> OK
> 
> >> +        timer_mod(vdev->reset_timer, qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL));
> >> +    }
> >> +}
> >> +
> >> +static void vfio_register_aer_resume_notifier(VFIOPCIDevice *vdev)
> >> +{
> >> +    int ret;
> >> +    int argsz;
> >> +    struct vfio_irq_set *irq_set;
> >> +    int32_t *pfd;
> >> +
> >> +    if (!(vdev->features & VFIO_FEATURE_ENABLE_AER) ||
> >> +        !vdev->vfio_resume_cap) {
> >> +        return;
> >> +    }
> >> +
> >> +    if (event_notifier_init(&vdev->resume_notifier, 0)) {
> >> +        error_report("vfio: Unable to init event notifier for"
> >> +                     " resume notification");
> >> +        vdev->vfio_resume_cap = false;
> >> +        return;
> >> +    }
> >> +
> >> +    argsz = sizeof(*irq_set) + sizeof(*pfd);
> >> +
> >> +    irq_set = g_malloc0(argsz);
> >> +    irq_set->argsz = argsz;
> >> +    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD |
> >> +                     VFIO_IRQ_SET_ACTION_TRIGGER;
> >> +    irq_set->index = VFIO_PCI_RESUME_IRQ_INDEX;
> >> +    irq_set->start = 0;
> >> +    irq_set->count = 1;
> >> +    pfd = (int32_t *)&irq_set->data;
> >> +
> >> +    *pfd = event_notifier_get_fd(&vdev->resume_notifier);
> >> +    qemu_set_fd_handler(*pfd, vfio_resume_notifier_handler, NULL, vdev);
> >> +
> >> +    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
> >> +    if (ret) {
> >> +        error_report("vfio: Failed to set up resume notification");
> >> +        qemu_set_fd_handler(*pfd, NULL, NULL, vdev);
> >> +        event_notifier_cleanup(&vdev->resume_notifier);
> >> +        vdev->vfio_resume_cap = false;
> >> +    }
> >> +    g_free(irq_set);
> >> +}
> >> +
> >> +static void vfio_unregister_aer_resume_notifier(VFIOPCIDevice *vdev)
> >> +{
> >> +    int argsz;
> >> +    struct vfio_irq_set *irq_set;
> >> +    int32_t *pfd;
> >> +    int ret;
> >> +
> >> +    if (!vdev->vfio_resume_cap) {
> >> +        return;
> >> +    }
> >> +
> >> +    argsz = sizeof(*irq_set) + sizeof(*pfd);
> >> +
> >> +    irq_set = g_malloc0(argsz);
> >> +    irq_set->argsz = argsz;
> >> +    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD |
> >> +                     VFIO_IRQ_SET_ACTION_TRIGGER;
> >> +    irq_set->index = VFIO_PCI_RESUME_IRQ_INDEX;
> >> +    irq_set->start = 0;
> >> +    irq_set->count = 1;
> >> +    pfd = (int32_t *)&irq_set->data;
> >> +    *pfd = -1;
> >> +
> >> +    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
> >> +    if (ret) {
> >> +        error_report("vfio: Failed to de-assign error fd: %m");
> >> +    }
> >> +    g_free(irq_set);
> >> +    qemu_set_fd_handler(event_notifier_get_fd(&vdev->resume_notifier),
> >> +                        NULL, NULL, vdev);
> >> +    event_notifier_cleanup(&vdev->resume_notifier);
> >> +}
> >> +
> >>  static void vfio_req_notifier_handler(void *opaque)
> >>  {
> >>      VFIOPCIDevice *vdev = opaque;
> >> @@ -3061,6 +3188,7 @@ static int vfio_initfn(PCIDevice *pdev)
> >>      }
> >>
> >>      vfio_register_err_notifier(vdev);
> >> +    vfio_register_aer_resume_notifier(vdev);
> >>      vfio_register_req_notifier(vdev);
> >>      vfio_setup_resetfn_quirk(vdev);
> >>
> >> @@ -3091,6 +3219,7 @@ static void vfio_exitfn(PCIDevice *pdev)
> >>      VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
> >>
> >>      vfio_unregister_req_notifier(vdev);
> >> +    vfio_unregister_aer_resume_notifier(vdev);
> >>      vfio_unregister_err_notifier(vdev);
> >>      pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
> >>      vfio_disable_interrupts(vdev);
> >> @@ -3101,6 +3230,34 @@ static void vfio_exitfn(PCIDevice *pdev)
> >>      vfio_bars_exit(vdev);
> >>  }
> >>
> >> +static void vfio_pci_delayed_reset(void *opaque)
> >> +{
> >> +    VFIOPCIDevice *vdev = opaque;
> >> +    PCIDevice *pdev = &vdev->pdev;
> >> +
> >> +    timer_free(vdev->reset_timer);
> >> +    vdev->reset_timer = NULL;  
> >
> > Racy, the timer gets freed before set to NULL so the test above could
> > see it non-NULL as it's being freed, assuming QEMU supports that sort
> > of concurrency.  
> I will use sem to avoid race condition.
> More comments below...

Seems like you could just reverse the order, cache vdev->reset_timer,
set it to NULL, then call timer_free() on the cached value.  But as I
question below, it'd be more simple to not have a timer.
 
> >> +
> >> +    if (!vdev->vfio_resume_wait) {
> >> +        return;
> >> +    }
> >> +    vdev->vfio_resume_wait = false;
> >> +
> >> +    if (vdev->vfio_resume_arrived) {
> >> +        vfio_mmap_set_enabled(vdev, true);  
> >
> > How do you know if mmap was enabled when you started?  This could
> > interfere with other cases where mmaps get disabled.  
> Yes, I will modify the code.
> 
> >> +        if (pci_get_function_0(pdev) == pdev) {
> >> +            vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
> >> +        }
> >> +    } else {
> >> +        uint32_t uncor_status;
> >> +        uncor_status = vfio_pci_read_config(pdev,
> >> +                           pdev->exp.aer_cap + PCI_ERR_UNCOR_STATUS, 4);
> >> +        if (uncor_status & ~0UL) {
> >> +            qdev_unplug(&vdev->pdev.qdev, NULL);
> >> +        }
> >> +    }
> >> +}
> >> +
> >>  static void vfio_pci_reset(DeviceState *dev)
> >>  {
> >>      PCIDevice *pdev = DO_UPCAST(PCIDevice, qdev, dev);
> >> @@ -3113,13 +3270,34 @@ static void vfio_pci_reset(DeviceState *dev)
> >>
> >>          if ((pci_get_word(br->config + PCI_BRIDGE_CONTROL) &
> >>               PCI_BRIDGE_CTL_BUS_RESET)) {
> >> -            if (pci_get_function_0(pdev) == pdev) {
> >> -                vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
> >> +            if (!vdev->vfio_resume_wait) {
> >> +                if (pci_get_function_0(pdev) == pdev) {
> >> +                    vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
> >> +                }
> >> +            } else {
> >> +                if (vdev->vfio_resume_arrived) {
> >> +                    vdev->vfio_resume_wait = false;
> >> +                    vfio_mmap_set_enabled(vdev, true);  
> >
> > mmap is getting restored in too many places, it should be disabled on
> > ERR_IRQ and re-enabled on ERR_RESUME, no more.  
> OK.
> 
> >> +                    if (pci_get_function_0(pdev) == pdev) {
> >> +                        vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
> >> +                    }
> >> +                } else {
> >> +                    vdev->reset_timer = timer_new_ms(QEMU_CLOCK_VIRTUAL,
> >> +                                            vfio_pci_delayed_reset, vdev);
> >> +                    timer_mod(vdev->reset_timer,
> >> +                              qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL)
> >> +                              + VFIO_RESET_TIMEOUT);
> >> +                }  
> >
> > Wait, so we just set a timer and return pretending the reset occurred
> > when actually it might occur at some point in the future?  How is that
> > supposed to work?  I thought the plan was to block here.  
> How about invoke sem_post at vfio_resume_notifier_handler,
> and wait here use sem_timedwait?

Do we even need a timer?  What if we simply spin?

for (i = 0; i < 1000; i++) {
    if (vdev->pci_aer_resume_signaled) {
        break;
    }
    g_usleep(1000);
}

if (i == 1000) {
    /* We failed */
} else {
    /* Proceed with reset */
}

Does QEMU have enough concurrency to do this?  Thanks,

Alex

> >>              }
> >>              return;
> >>          }
> >>      }
> >>
> >> +    if (vdev->vfio_resume_wait) {
> >> +        vdev->vfio_resume_wait = false;
> >> +        vfio_mmap_set_enabled(vdev, true);
> >> +    }  
> >
> > Again, these are getting changed in too many places, the state machine
> > is too complicated.  Thanks,  
> In my test this code will never be invoked.
> But I add this code to clear vfio_resume_wait if the guest don't
>   reset the bus.
> 
> Sincerely,
> Zhou Jie
> 
> 
> >
> > Alex
> >  
> >> +
> >>      vfio_pci_pre_reset(vdev);
> >>
> >>      if (vdev->resetfn && !vdev->resetfn(vdev)) {
> >> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> >> index 9fb0206..49e28d8 100644
> >> --- a/hw/vfio/pci.h
> >> +++ b/hw/vfio/pci.h
> >> @@ -119,6 +119,7 @@ typedef struct VFIOPCIDevice {
> >>      VFIOVGA *vga; /* 0xa0000, 0x3b0, 0x3c0 */
> >>      PCIHostDeviceAddress host;
> >>      EventNotifier err_notifier;
> >> +    EventNotifier resume_notifier;
> >>      EventNotifier req_notifier;
> >>      int (*resetfn)(struct VFIOPCIDevice *);
> >>      uint32_t vendor_id;
> >> @@ -144,6 +145,10 @@ typedef struct VFIOPCIDevice {
> >>      bool no_kvm_msi;
> >>      bool no_kvm_msix;
> >>      bool single_depend_dev;
> >> +    bool vfio_resume_cap;
> >> +    bool vfio_resume_wait;
> >> +    bool vfio_resume_arrived;
> >> +    QEMUTimer *reset_timer;
> >>  } VFIOPCIDevice;
> >>
> >>  uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
> >> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> >> index 759b850..01dfd5d 100644
> >> --- a/linux-headers/linux/vfio.h
> >> +++ b/linux-headers/linux/vfio.h
> >> @@ -433,6 +433,7 @@ enum {
> >>  	VFIO_PCI_MSIX_IRQ_INDEX,
> >>  	VFIO_PCI_ERR_IRQ_INDEX,
> >>  	VFIO_PCI_REQ_IRQ_INDEX,
> >> +        VFIO_PCI_RESUME_IRQ_INDEX,
> >>  	VFIO_PCI_NUM_IRQS
> >>  };
> >>  
> >
> >
> >
> > .
> >  
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH 11/12] vfio: register aer resume notification handler for aer resume
  2016-05-19  2:18       ` Alex Williamson
@ 2016-05-19  2:41         ` Zhou Jie
  2016-05-19  2:41         ` Zhou Jie
  1 sibling, 0 replies; 21+ messages in thread
From: Zhou Jie @ 2016-05-19  2:41 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, Chen Fan, izumi.taku, caoj.fnst, mst



On 2016/5/19 10:18, Alex Williamson wrote:
> On Thu, 19 May 2016 09:49:00 +0800
> Zhou Jie <zhoujie2011@cn.fujitsu.com> wrote:
>
>> On 2016/5/19 2:26, Alex Williamson wrote:
>>> On Wed, 18 May 2016 11:31:09 +0800
>>> Zhou Jie <zhoujie2011@cn.fujitsu.com> wrote:
>>>
>>>> From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
>>>>
>>>> For supporting aer recovery, host and guest would run the same aer
>>>> recovery code, that would do the secondary bus reset if the error
>>>> is fatal, the aer recovery process:
>>>>   1. error_detected
>>>>   2. reset_link (if fatal)
>>>>   3. slot_reset/mmio_enabled
>>>>   4. resume
>>>>
>>>> It indicates that host will do secondary bus reset to reset
>>>> the physical devices under bus in step 2, that would cause
>>>> devices in D3 status in a short time. But in qemu, we register
>>>> an error detected handler, that would be invoked as host broadcasts
>>>> the error-detected event in step 1, in order to avoid guest do
>>>> reset_link when host do reset_link simultaneously. it may cause
>>>> fatal error. we introduce a resmue notifier to assure host reset
>>>> completely.
>>>> In qemu, the aer recovery process:
>>>>   1. Detect support for resume notification
>>>>      If host vfio driver does not support for resume notification,
>>>>      directly fail to boot up VM as with aer enabled.
>>>>   2. Immediately notify the VM on error detected.
>>>>   3. Stall any access to the device until resume is signaled.
>>>
>>> The code below doesn't actually do this, mmaps are disabled, but that
>>> just means any device access will use the slow path through QEMU.  That
>>> path is still fully enabled and so is config space access.
>> I will modify the code and find other way to
>> stall any access to the device.
>>
>>>>   4. Delay the guest directed bus reset.
>>>
>>> It's delayed, but not the way I expected.  The guest goes on executing
>>> and then we do the reset at some point later?  More comments below...
>>>
>>>>   5. Wait for resume notification.
>>>>      If we don't get the resume notification from the host after
>>>>      some timeout, we would abort the guest directed bus reset
>>>>      altogether and unplug of the device to prevent it from further
>>>>      interacting with the VM.
>>>>   6. After get the resume notification reset bus.
>>>>
>>>> Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
>>>> Signed-off-by: Zhou Jie <zhoujie2011@cn.fujitsu.com>
>>>> ---
>>>>  hw/vfio/pci.c              | 182 ++++++++++++++++++++++++++++++++++++++++++++-
>>>>  hw/vfio/pci.h              |   5 ++
>>>>  linux-headers/linux/vfio.h |   1 +
>>>>  3 files changed, 186 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>> index 6877a3d..39a9a9f 100644
>>>> --- a/hw/vfio/pci.c
>>>> +++ b/hw/vfio/pci.c
>>>> @@ -35,6 +35,7 @@
>>>>  #include "trace.h"
>>>>
>>>>  #define MSIX_CAP_LENGTH 12
>>>> +#define VFIO_RESET_TIMEOUT 1000
>>>
>>> It deserves at least a comment as to why this value was chosen.
>> OK. I will add a comment here.
>>
>>>>  static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
>>>>  static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
>>>> @@ -1937,6 +1938,14 @@ static void vfio_check_hot_bus_reset(VFIOPCIDevice *vdev, Error **errp)
>>>>      VFIOGroup *group;
>>>>      int ret, i, devfn, range_limit;
>>>>
>>>> +    if (!vdev->vfio_resume_cap) {
>>>> +        error_setg(errp, "vfio: Cannot enable AER for device %s,"
>>>> +                   " host vfio driver does not support for"
>>>> +                   " resume notification",
>>>> +                   vdev->vbasedev.name);
>>>> +        return;
>>>> +    }
>>>> +
>>>>      ret = vfio_get_hot_reset_info(vdev, &info);
>>>>      if (ret) {
>>>>          error_setg(errp, "vfio: Cannot enable AER for device %s,"
>>>> @@ -2594,6 +2603,23 @@ static int vfio_populate_device(VFIOPCIDevice *vdev)
>>>>                       vbasedev->name);
>>>>      }
>>>>
>>>> +    irq_info.index = VFIO_PCI_RESUME_IRQ_INDEX;
>>>> +
>>>> +    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info);
>>>> +    if (ret) {
>>>> +        /* This can fail for an old kernel or legacy PCI dev */
>>>> +        trace_vfio_populate_device_get_irq_info_failure();
>>>> +        ret = 0;
>>>> +    } else if (irq_info.count == 1) {
>>>> +        vdev->vfio_resume_cap = true;
>>>
>>> "vfio" in the name is redundant, ERR_IRQ uses pci_aer, so a logical
>>> name might be pci_aer_has_resume.
>> OK.
>>
>>>> +    } else {
>>>> +        error_report("vfio: %s "
>>>> +                     "Could not enable error recovery for the device,"
>>>> +                     " because host vfio driver does not support for"
>>>> +                     " resume notification",
>>>> +                     vbasedev->name);
>>>> +    }
>>>
>>> This error_report makes sense for ERR_IRQ because halt-on-AER is setup
>>> transparently, but I don't think it makes sense here.  If the user has
>>> specified to enable AER then it should either work or they should get
>>> an error message.  If they have not specified to enable AER, why does
>>> the user care if there's an inconsistency here?
>> OK. I will delete the error report at here.
>>
>>>> +
>>>>  error:
>>>>      return ret;
>>>>  }
>>>> @@ -2661,6 +2687,17 @@ static void vfio_err_notifier_handler(void *opaque)
>>>>          msg.severity = isfatal ? PCI_ERR_ROOT_CMD_FATAL_EN :
>>>>                                   PCI_ERR_ROOT_CMD_NONFATAL_EN;
>>>>
>>>> +        if (isfatal) {
>>>> +            PCIDevice *dev_0 = pci_get_function_0(dev);
>>>> +            vdev->vfio_resume_wait = true;
>>>> +            vdev->vfio_resume_arrived = false;
>>>
>>> Possible names:
>>>
>>> pci_aer_error_signaled
>>> pci_aer_resume_signaled
>> OK.
>>
>>>> +            vfio_mmap_set_enabled(vdev, false);
>>>> +            if (dev_0 != dev) {
>>>> +                VFIOPCIDevice *vdev_0 = DO_UPCAST(VFIOPCIDevice, pdev, dev_0);
>>>> +                vdev_0->vfio_resume_wait = true;
>>>> +                vdev_0->vfio_resume_arrived = false;
>>>> +            }
>>>
>>> Why is function 0 special here?  Don't we expect that it will also get
>>> an ERR_IRQ?
>> I tested in this condition.
>> The device is multifunction.
>> And function 0 is OK, function 1 occured an error.
>> function 0 got an ERR_IRQ, but returned at following code.
>> if (!(uncor_status & ~0UL)) {
>>      return;
>> }
>> If don't set the function 0 at here, it will not wait in vfio_pci_reset.
>>
>> And I also tested the nofatal error.
>> The aer will send the nofatal error notification to qemu,
>> but the guest will not invoke vfio_pci_reset.
>> So, I need know whether the error is fatal,
>> and then set the pci_aer_error_signaled.
>
> Do we need to worry about some functions getting a resume while others
> don't?  Should the reset stall until all functions that received a
> fatal error receive a resume?  Maybe all tracking should be done with
> an 8-byte (256-bit to support ARI) atomic bitmap on function 0 and
> reset is blocked until the mask of all functions receiving a fatal
> error is equal to the mask of all functions receiving a resume.
It is a good idea. I will try it.

>>>> +        }
>>>>          pcie_aer_msg(dev, &msg);
>>>>          return;
>>>>      }
>>>> @@ -2756,6 +2793,96 @@ static void vfio_unregister_err_notifier(VFIOPCIDevice *vdev)
>>>>      event_notifier_cleanup(&vdev->err_notifier);
>>>>  }
>>>>
>>>> +static void vfio_resume_notifier_handler(void *opaque)
>>>
>>> Please use "aer" in the name, otherwise resume might refer to
>>> suspend-resume.
>> OK.
>>
>>>> +{
>>>> +    VFIOPCIDevice *vdev = opaque;
>>>> +
>>>> +    if (!event_notifier_test_and_clear(&vdev->resume_notifier)) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    vdev->vfio_resume_arrived = true;
>>>> +    if (vdev->reset_timer != NULL) {
>>>
>>> pci_aer_reset_blocked_timer
>> OK
>>
>>>> +        timer_mod(vdev->reset_timer, qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL));
>>>> +    }
>>>> +}
>>>> +
>>>> +static void vfio_register_aer_resume_notifier(VFIOPCIDevice *vdev)
>>>> +{
>>>> +    int ret;
>>>> +    int argsz;
>>>> +    struct vfio_irq_set *irq_set;
>>>> +    int32_t *pfd;
>>>> +
>>>> +    if (!(vdev->features & VFIO_FEATURE_ENABLE_AER) ||
>>>> +        !vdev->vfio_resume_cap) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    if (event_notifier_init(&vdev->resume_notifier, 0)) {
>>>> +        error_report("vfio: Unable to init event notifier for"
>>>> +                     " resume notification");
>>>> +        vdev->vfio_resume_cap = false;
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    argsz = sizeof(*irq_set) + sizeof(*pfd);
>>>> +
>>>> +    irq_set = g_malloc0(argsz);
>>>> +    irq_set->argsz = argsz;
>>>> +    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD |
>>>> +                     VFIO_IRQ_SET_ACTION_TRIGGER;
>>>> +    irq_set->index = VFIO_PCI_RESUME_IRQ_INDEX;
>>>> +    irq_set->start = 0;
>>>> +    irq_set->count = 1;
>>>> +    pfd = (int32_t *)&irq_set->data;
>>>> +
>>>> +    *pfd = event_notifier_get_fd(&vdev->resume_notifier);
>>>> +    qemu_set_fd_handler(*pfd, vfio_resume_notifier_handler, NULL, vdev);
>>>> +
>>>> +    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
>>>> +    if (ret) {
>>>> +        error_report("vfio: Failed to set up resume notification");
>>>> +        qemu_set_fd_handler(*pfd, NULL, NULL, vdev);
>>>> +        event_notifier_cleanup(&vdev->resume_notifier);
>>>> +        vdev->vfio_resume_cap = false;
>>>> +    }
>>>> +    g_free(irq_set);
>>>> +}
>>>> +
>>>> +static void vfio_unregister_aer_resume_notifier(VFIOPCIDevice *vdev)
>>>> +{
>>>> +    int argsz;
>>>> +    struct vfio_irq_set *irq_set;
>>>> +    int32_t *pfd;
>>>> +    int ret;
>>>> +
>>>> +    if (!vdev->vfio_resume_cap) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    argsz = sizeof(*irq_set) + sizeof(*pfd);
>>>> +
>>>> +    irq_set = g_malloc0(argsz);
>>>> +    irq_set->argsz = argsz;
>>>> +    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD |
>>>> +                     VFIO_IRQ_SET_ACTION_TRIGGER;
>>>> +    irq_set->index = VFIO_PCI_RESUME_IRQ_INDEX;
>>>> +    irq_set->start = 0;
>>>> +    irq_set->count = 1;
>>>> +    pfd = (int32_t *)&irq_set->data;
>>>> +    *pfd = -1;
>>>> +
>>>> +    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
>>>> +    if (ret) {
>>>> +        error_report("vfio: Failed to de-assign error fd: %m");
>>>> +    }
>>>> +    g_free(irq_set);
>>>> +    qemu_set_fd_handler(event_notifier_get_fd(&vdev->resume_notifier),
>>>> +                        NULL, NULL, vdev);
>>>> +    event_notifier_cleanup(&vdev->resume_notifier);
>>>> +}
>>>> +
>>>>  static void vfio_req_notifier_handler(void *opaque)
>>>>  {
>>>>      VFIOPCIDevice *vdev = opaque;
>>>> @@ -3061,6 +3188,7 @@ static int vfio_initfn(PCIDevice *pdev)
>>>>      }
>>>>
>>>>      vfio_register_err_notifier(vdev);
>>>> +    vfio_register_aer_resume_notifier(vdev);
>>>>      vfio_register_req_notifier(vdev);
>>>>      vfio_setup_resetfn_quirk(vdev);
>>>>
>>>> @@ -3091,6 +3219,7 @@ static void vfio_exitfn(PCIDevice *pdev)
>>>>      VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
>>>>
>>>>      vfio_unregister_req_notifier(vdev);
>>>> +    vfio_unregister_aer_resume_notifier(vdev);
>>>>      vfio_unregister_err_notifier(vdev);
>>>>      pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
>>>>      vfio_disable_interrupts(vdev);
>>>> @@ -3101,6 +3230,34 @@ static void vfio_exitfn(PCIDevice *pdev)
>>>>      vfio_bars_exit(vdev);
>>>>  }
>>>>
>>>> +static void vfio_pci_delayed_reset(void *opaque)
>>>> +{
>>>> +    VFIOPCIDevice *vdev = opaque;
>>>> +    PCIDevice *pdev = &vdev->pdev;
>>>> +
>>>> +    timer_free(vdev->reset_timer);
>>>> +    vdev->reset_timer = NULL;
>>>
>>> Racy, the timer gets freed before set to NULL so the test above could
>>> see it non-NULL as it's being freed, assuming QEMU supports that sort
>>> of concurrency.
>> I will use sem to avoid race condition.
>> More comments below...
>
> Seems like you could just reverse the order, cache vdev->reset_timer,
> set it to NULL, then call timer_free() on the cached value.  But as I
> question below, it'd be more simple to not have a timer.

I meant I will use sem instead of timer.

>>>> +
>>>> +    if (!vdev->vfio_resume_wait) {
>>>> +        return;
>>>> +    }
>>>> +    vdev->vfio_resume_wait = false;
>>>> +
>>>> +    if (vdev->vfio_resume_arrived) {
>>>> +        vfio_mmap_set_enabled(vdev, true);
>>>
>>> How do you know if mmap was enabled when you started?  This could
>>> interfere with other cases where mmaps get disabled.
>> Yes, I will modify the code.
>>
>>>> +        if (pci_get_function_0(pdev) == pdev) {
>>>> +            vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
>>>> +        }
>>>> +    } else {
>>>> +        uint32_t uncor_status;
>>>> +        uncor_status = vfio_pci_read_config(pdev,
>>>> +                           pdev->exp.aer_cap + PCI_ERR_UNCOR_STATUS, 4);
>>>> +        if (uncor_status & ~0UL) {
>>>> +            qdev_unplug(&vdev->pdev.qdev, NULL);
>>>> +        }
>>>> +    }
>>>> +}
>>>> +
>>>>  static void vfio_pci_reset(DeviceState *dev)
>>>>  {
>>>>      PCIDevice *pdev = DO_UPCAST(PCIDevice, qdev, dev);
>>>> @@ -3113,13 +3270,34 @@ static void vfio_pci_reset(DeviceState *dev)
>>>>
>>>>          if ((pci_get_word(br->config + PCI_BRIDGE_CONTROL) &
>>>>               PCI_BRIDGE_CTL_BUS_RESET)) {
>>>> -            if (pci_get_function_0(pdev) == pdev) {
>>>> -                vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
>>>> +            if (!vdev->vfio_resume_wait) {
>>>> +                if (pci_get_function_0(pdev) == pdev) {
>>>> +                    vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
>>>> +                }
>>>> +            } else {
>>>> +                if (vdev->vfio_resume_arrived) {
>>>> +                    vdev->vfio_resume_wait = false;
>>>> +                    vfio_mmap_set_enabled(vdev, true);
>>>
>>> mmap is getting restored in too many places, it should be disabled on
>>> ERR_IRQ and re-enabled on ERR_RESUME, no more.
>> OK.
>>
>>>> +                    if (pci_get_function_0(pdev) == pdev) {
>>>> +                        vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
>>>> +                    }
>>>> +                } else {
>>>> +                    vdev->reset_timer = timer_new_ms(QEMU_CLOCK_VIRTUAL,
>>>> +                                            vfio_pci_delayed_reset, vdev);
>>>> +                    timer_mod(vdev->reset_timer,
>>>> +                              qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL)
>>>> +                              + VFIO_RESET_TIMEOUT);
>>>> +                }
>>>
>>> Wait, so we just set a timer and return pretending the reset occurred
>>> when actually it might occur at some point in the future?  How is that
>>> supposed to work?  I thought the plan was to block here.
>> How about invoke sem_post at vfio_resume_notifier_handler,
>> and wait here use sem_timedwait?
>
> Do we even need a timer?  What if we simply spin?
>
> for (i = 0; i < 1000; i++) {
>     if (vdev->pci_aer_resume_signaled) {
>         break;
>     }
>     g_usleep(1000);
> }
>
> if (i == 1000) {
>     /* We failed */
> } else {
>     /* Proceed with reset */
> }
>
> Does QEMU have enough concurrency to do this?  Thanks,

I meant I will use sem instead of timer.
But think about the new idea you proposed.
I will use g_usleep and check bitmap to
wait mask of all functions receiving a fatal
error is equal to the mask of all functions receiving a resume.

Sincerely,
Zhou Jie

>
> Alex
>
>>>>              }
>>>>              return;
>>>>          }
>>>>      }
>>>>
>>>> +    if (vdev->vfio_resume_wait) {
>>>> +        vdev->vfio_resume_wait = false;
>>>> +        vfio_mmap_set_enabled(vdev, true);
>>>> +    }
>>>
>>> Again, these are getting changed in too many places, the state machine
>>> is too complicated.  Thanks,
>> In my test this code will never be invoked.
>> But I add this code to clear vfio_resume_wait if the guest don't
>>   reset the bus.
>>
>> Sincerely,
>> Zhou Jie
>>
>>
>>>
>>> Alex
>>>
>>>> +
>>>>      vfio_pci_pre_reset(vdev);
>>>>
>>>>      if (vdev->resetfn && !vdev->resetfn(vdev)) {
>>>> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
>>>> index 9fb0206..49e28d8 100644
>>>> --- a/hw/vfio/pci.h
>>>> +++ b/hw/vfio/pci.h
>>>> @@ -119,6 +119,7 @@ typedef struct VFIOPCIDevice {
>>>>      VFIOVGA *vga; /* 0xa0000, 0x3b0, 0x3c0 */
>>>>      PCIHostDeviceAddress host;
>>>>      EventNotifier err_notifier;
>>>> +    EventNotifier resume_notifier;
>>>>      EventNotifier req_notifier;
>>>>      int (*resetfn)(struct VFIOPCIDevice *);
>>>>      uint32_t vendor_id;
>>>> @@ -144,6 +145,10 @@ typedef struct VFIOPCIDevice {
>>>>      bool no_kvm_msi;
>>>>      bool no_kvm_msix;
>>>>      bool single_depend_dev;
>>>> +    bool vfio_resume_cap;
>>>> +    bool vfio_resume_wait;
>>>> +    bool vfio_resume_arrived;
>>>> +    QEMUTimer *reset_timer;
>>>>  } VFIOPCIDevice;
>>>>
>>>>  uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
>>>> index 759b850..01dfd5d 100644
>>>> --- a/linux-headers/linux/vfio.h
>>>> +++ b/linux-headers/linux/vfio.h
>>>> @@ -433,6 +433,7 @@ enum {
>>>>  	VFIO_PCI_MSIX_IRQ_INDEX,
>>>>  	VFIO_PCI_ERR_IRQ_INDEX,
>>>>  	VFIO_PCI_REQ_IRQ_INDEX,
>>>> +        VFIO_PCI_RESUME_IRQ_INDEX,
>>>>  	VFIO_PCI_NUM_IRQS
>>>>  };
>>>>
>>>
>>>
>>>
>>> .
>>>
>>
>>
>
>
>
> .
>

-- 
------------------------------------------------
周潔
Dept 1
No. 6 Wenzhu Road,
Nanjing, 210012, China
TEL:+86+25-86630566-8557
FUJITSU INTERNAL:7998-8557
E-Mail:zhoujie2011@cn.fujitsu.com
------------------------------------------------

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH 11/12] vfio: register aer resume notification handler for aer resume
  2016-05-19  2:18       ` Alex Williamson
  2016-05-19  2:41         ` Zhou Jie
@ 2016-05-19  2:41         ` Zhou Jie
  1 sibling, 0 replies; 21+ messages in thread
From: Zhou Jie @ 2016-05-19  2:41 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, Chen Fan, izumi.taku, caoj.fnst, mst



On 2016/5/19 10:18, Alex Williamson wrote:
> On Thu, 19 May 2016 09:49:00 +0800
> Zhou Jie <zhoujie2011@cn.fujitsu.com> wrote:
>
>> On 2016/5/19 2:26, Alex Williamson wrote:
>>> On Wed, 18 May 2016 11:31:09 +0800
>>> Zhou Jie <zhoujie2011@cn.fujitsu.com> wrote:
>>>
>>>> From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
>>>>
>>>> For supporting aer recovery, host and guest would run the same aer
>>>> recovery code, that would do the secondary bus reset if the error
>>>> is fatal, the aer recovery process:
>>>>   1. error_detected
>>>>   2. reset_link (if fatal)
>>>>   3. slot_reset/mmio_enabled
>>>>   4. resume
>>>>
>>>> It indicates that host will do secondary bus reset to reset
>>>> the physical devices under bus in step 2, that would cause
>>>> devices in D3 status in a short time. But in qemu, we register
>>>> an error detected handler, that would be invoked as host broadcasts
>>>> the error-detected event in step 1, in order to avoid guest do
>>>> reset_link when host do reset_link simultaneously. it may cause
>>>> fatal error. we introduce a resmue notifier to assure host reset
>>>> completely.
>>>> In qemu, the aer recovery process:
>>>>   1. Detect support for resume notification
>>>>      If host vfio driver does not support for resume notification,
>>>>      directly fail to boot up VM as with aer enabled.
>>>>   2. Immediately notify the VM on error detected.
>>>>   3. Stall any access to the device until resume is signaled.
>>>
>>> The code below doesn't actually do this, mmaps are disabled, but that
>>> just means any device access will use the slow path through QEMU.  That
>>> path is still fully enabled and so is config space access.
>> I will modify the code and find other way to
>> stall any access to the device.
>>
>>>>   4. Delay the guest directed bus reset.
>>>
>>> It's delayed, but not the way I expected.  The guest goes on executing
>>> and then we do the reset at some point later?  More comments below...
>>>
>>>>   5. Wait for resume notification.
>>>>      If we don't get the resume notification from the host after
>>>>      some timeout, we would abort the guest directed bus reset
>>>>      altogether and unplug of the device to prevent it from further
>>>>      interacting with the VM.
>>>>   6. After get the resume notification reset bus.
>>>>
>>>> Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
>>>> Signed-off-by: Zhou Jie <zhoujie2011@cn.fujitsu.com>
>>>> ---
>>>>  hw/vfio/pci.c              | 182 ++++++++++++++++++++++++++++++++++++++++++++-
>>>>  hw/vfio/pci.h              |   5 ++
>>>>  linux-headers/linux/vfio.h |   1 +
>>>>  3 files changed, 186 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>> index 6877a3d..39a9a9f 100644
>>>> --- a/hw/vfio/pci.c
>>>> +++ b/hw/vfio/pci.c
>>>> @@ -35,6 +35,7 @@
>>>>  #include "trace.h"
>>>>
>>>>  #define MSIX_CAP_LENGTH 12
>>>> +#define VFIO_RESET_TIMEOUT 1000
>>>
>>> It deserves at least a comment as to why this value was chosen.
>> OK. I will add a comment here.
>>
>>>>  static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
>>>>  static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
>>>> @@ -1937,6 +1938,14 @@ static void vfio_check_hot_bus_reset(VFIOPCIDevice *vdev, Error **errp)
>>>>      VFIOGroup *group;
>>>>      int ret, i, devfn, range_limit;
>>>>
>>>> +    if (!vdev->vfio_resume_cap) {
>>>> +        error_setg(errp, "vfio: Cannot enable AER for device %s,"
>>>> +                   " host vfio driver does not support for"
>>>> +                   " resume notification",
>>>> +                   vdev->vbasedev.name);
>>>> +        return;
>>>> +    }
>>>> +
>>>>      ret = vfio_get_hot_reset_info(vdev, &info);
>>>>      if (ret) {
>>>>          error_setg(errp, "vfio: Cannot enable AER for device %s,"
>>>> @@ -2594,6 +2603,23 @@ static int vfio_populate_device(VFIOPCIDevice *vdev)
>>>>                       vbasedev->name);
>>>>      }
>>>>
>>>> +    irq_info.index = VFIO_PCI_RESUME_IRQ_INDEX;
>>>> +
>>>> +    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info);
>>>> +    if (ret) {
>>>> +        /* This can fail for an old kernel or legacy PCI dev */
>>>> +        trace_vfio_populate_device_get_irq_info_failure();
>>>> +        ret = 0;
>>>> +    } else if (irq_info.count == 1) {
>>>> +        vdev->vfio_resume_cap = true;
>>>
>>> "vfio" in the name is redundant, ERR_IRQ uses pci_aer, so a logical
>>> name might be pci_aer_has_resume.
>> OK.
>>
>>>> +    } else {
>>>> +        error_report("vfio: %s "
>>>> +                     "Could not enable error recovery for the device,"
>>>> +                     " because host vfio driver does not support for"
>>>> +                     " resume notification",
>>>> +                     vbasedev->name);
>>>> +    }
>>>
>>> This error_report makes sense for ERR_IRQ because halt-on-AER is setup
>>> transparently, but I don't think it makes sense here.  If the user has
>>> specified to enable AER then it should either work or they should get
>>> an error message.  If they have not specified to enable AER, why does
>>> the user care if there's an inconsistency here?
>> OK. I will delete the error report at here.
>>
>>>> +
>>>>  error:
>>>>      return ret;
>>>>  }
>>>> @@ -2661,6 +2687,17 @@ static void vfio_err_notifier_handler(void *opaque)
>>>>          msg.severity = isfatal ? PCI_ERR_ROOT_CMD_FATAL_EN :
>>>>                                   PCI_ERR_ROOT_CMD_NONFATAL_EN;
>>>>
>>>> +        if (isfatal) {
>>>> +            PCIDevice *dev_0 = pci_get_function_0(dev);
>>>> +            vdev->vfio_resume_wait = true;
>>>> +            vdev->vfio_resume_arrived = false;
>>>
>>> Possible names:
>>>
>>> pci_aer_error_signaled
>>> pci_aer_resume_signaled
>> OK.
>>
>>>> +            vfio_mmap_set_enabled(vdev, false);
>>>> +            if (dev_0 != dev) {
>>>> +                VFIOPCIDevice *vdev_0 = DO_UPCAST(VFIOPCIDevice, pdev, dev_0);
>>>> +                vdev_0->vfio_resume_wait = true;
>>>> +                vdev_0->vfio_resume_arrived = false;
>>>> +            }
>>>
>>> Why is function 0 special here?  Don't we expect that it will also get
>>> an ERR_IRQ?
>> I tested in this condition.
>> The device is multifunction.
>> And function 0 is OK, function 1 occured an error.
>> function 0 got an ERR_IRQ, but returned at following code.
>> if (!(uncor_status & ~0UL)) {
>>      return;
>> }
>> If don't set the function 0 at here, it will not wait in vfio_pci_reset.
>>
>> And I also tested the nofatal error.
>> The aer will send the nofatal error notification to qemu,
>> but the guest will not invoke vfio_pci_reset.
>> So, I need know whether the error is fatal,
>> and then set the pci_aer_error_signaled.
>
> Do we need to worry about some functions getting a resume while others
> don't?  Should the reset stall until all functions that received a
> fatal error receive a resume?  Maybe all tracking should be done with
> an 8-byte (256-bit to support ARI) atomic bitmap on function 0 and
> reset is blocked until the mask of all functions receiving a fatal
> error is equal to the mask of all functions receiving a resume.
It is a good idea. I will try it.

>>>> +        }
>>>>          pcie_aer_msg(dev, &msg);
>>>>          return;
>>>>      }
>>>> @@ -2756,6 +2793,96 @@ static void vfio_unregister_err_notifier(VFIOPCIDevice *vdev)
>>>>      event_notifier_cleanup(&vdev->err_notifier);
>>>>  }
>>>>
>>>> +static void vfio_resume_notifier_handler(void *opaque)
>>>
>>> Please use "aer" in the name, otherwise resume might refer to
>>> suspend-resume.
>> OK.
>>
>>>> +{
>>>> +    VFIOPCIDevice *vdev = opaque;
>>>> +
>>>> +    if (!event_notifier_test_and_clear(&vdev->resume_notifier)) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    vdev->vfio_resume_arrived = true;
>>>> +    if (vdev->reset_timer != NULL) {
>>>
>>> pci_aer_reset_blocked_timer
>> OK
>>
>>>> +        timer_mod(vdev->reset_timer, qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL));
>>>> +    }
>>>> +}
>>>> +
>>>> +static void vfio_register_aer_resume_notifier(VFIOPCIDevice *vdev)
>>>> +{
>>>> +    int ret;
>>>> +    int argsz;
>>>> +    struct vfio_irq_set *irq_set;
>>>> +    int32_t *pfd;
>>>> +
>>>> +    if (!(vdev->features & VFIO_FEATURE_ENABLE_AER) ||
>>>> +        !vdev->vfio_resume_cap) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    if (event_notifier_init(&vdev->resume_notifier, 0)) {
>>>> +        error_report("vfio: Unable to init event notifier for"
>>>> +                     " resume notification");
>>>> +        vdev->vfio_resume_cap = false;
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    argsz = sizeof(*irq_set) + sizeof(*pfd);
>>>> +
>>>> +    irq_set = g_malloc0(argsz);
>>>> +    irq_set->argsz = argsz;
>>>> +    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD |
>>>> +                     VFIO_IRQ_SET_ACTION_TRIGGER;
>>>> +    irq_set->index = VFIO_PCI_RESUME_IRQ_INDEX;
>>>> +    irq_set->start = 0;
>>>> +    irq_set->count = 1;
>>>> +    pfd = (int32_t *)&irq_set->data;
>>>> +
>>>> +    *pfd = event_notifier_get_fd(&vdev->resume_notifier);
>>>> +    qemu_set_fd_handler(*pfd, vfio_resume_notifier_handler, NULL, vdev);
>>>> +
>>>> +    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
>>>> +    if (ret) {
>>>> +        error_report("vfio: Failed to set up resume notification");
>>>> +        qemu_set_fd_handler(*pfd, NULL, NULL, vdev);
>>>> +        event_notifier_cleanup(&vdev->resume_notifier);
>>>> +        vdev->vfio_resume_cap = false;
>>>> +    }
>>>> +    g_free(irq_set);
>>>> +}
>>>> +
>>>> +static void vfio_unregister_aer_resume_notifier(VFIOPCIDevice *vdev)
>>>> +{
>>>> +    int argsz;
>>>> +    struct vfio_irq_set *irq_set;
>>>> +    int32_t *pfd;
>>>> +    int ret;
>>>> +
>>>> +    if (!vdev->vfio_resume_cap) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    argsz = sizeof(*irq_set) + sizeof(*pfd);
>>>> +
>>>> +    irq_set = g_malloc0(argsz);
>>>> +    irq_set->argsz = argsz;
>>>> +    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD |
>>>> +                     VFIO_IRQ_SET_ACTION_TRIGGER;
>>>> +    irq_set->index = VFIO_PCI_RESUME_IRQ_INDEX;
>>>> +    irq_set->start = 0;
>>>> +    irq_set->count = 1;
>>>> +    pfd = (int32_t *)&irq_set->data;
>>>> +    *pfd = -1;
>>>> +
>>>> +    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
>>>> +    if (ret) {
>>>> +        error_report("vfio: Failed to de-assign error fd: %m");
>>>> +    }
>>>> +    g_free(irq_set);
>>>> +    qemu_set_fd_handler(event_notifier_get_fd(&vdev->resume_notifier),
>>>> +                        NULL, NULL, vdev);
>>>> +    event_notifier_cleanup(&vdev->resume_notifier);
>>>> +}
>>>> +
>>>>  static void vfio_req_notifier_handler(void *opaque)
>>>>  {
>>>>      VFIOPCIDevice *vdev = opaque;
>>>> @@ -3061,6 +3188,7 @@ static int vfio_initfn(PCIDevice *pdev)
>>>>      }
>>>>
>>>>      vfio_register_err_notifier(vdev);
>>>> +    vfio_register_aer_resume_notifier(vdev);
>>>>      vfio_register_req_notifier(vdev);
>>>>      vfio_setup_resetfn_quirk(vdev);
>>>>
>>>> @@ -3091,6 +3219,7 @@ static void vfio_exitfn(PCIDevice *pdev)
>>>>      VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
>>>>
>>>>      vfio_unregister_req_notifier(vdev);
>>>> +    vfio_unregister_aer_resume_notifier(vdev);
>>>>      vfio_unregister_err_notifier(vdev);
>>>>      pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
>>>>      vfio_disable_interrupts(vdev);
>>>> @@ -3101,6 +3230,34 @@ static void vfio_exitfn(PCIDevice *pdev)
>>>>      vfio_bars_exit(vdev);
>>>>  }
>>>>
>>>> +static void vfio_pci_delayed_reset(void *opaque)
>>>> +{
>>>> +    VFIOPCIDevice *vdev = opaque;
>>>> +    PCIDevice *pdev = &vdev->pdev;
>>>> +
>>>> +    timer_free(vdev->reset_timer);
>>>> +    vdev->reset_timer = NULL;
>>>
>>> Racy, the timer gets freed before set to NULL so the test above could
>>> see it non-NULL as it's being freed, assuming QEMU supports that sort
>>> of concurrency.
>> I will use sem to avoid race condition.
>> More comments below...
>
> Seems like you could just reverse the order, cache vdev->reset_timer,
> set it to NULL, then call timer_free() on the cached value.  But as I
> question below, it'd be more simple to not have a timer.

I meant I will use sem instead of timer.

>>>> +
>>>> +    if (!vdev->vfio_resume_wait) {
>>>> +        return;
>>>> +    }
>>>> +    vdev->vfio_resume_wait = false;
>>>> +
>>>> +    if (vdev->vfio_resume_arrived) {
>>>> +        vfio_mmap_set_enabled(vdev, true);
>>>
>>> How do you know if mmap was enabled when you started?  This could
>>> interfere with other cases where mmaps get disabled.
>> Yes, I will modify the code.
>>
>>>> +        if (pci_get_function_0(pdev) == pdev) {
>>>> +            vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
>>>> +        }
>>>> +    } else {
>>>> +        uint32_t uncor_status;
>>>> +        uncor_status = vfio_pci_read_config(pdev,
>>>> +                           pdev->exp.aer_cap + PCI_ERR_UNCOR_STATUS, 4);
>>>> +        if (uncor_status & ~0UL) {
>>>> +            qdev_unplug(&vdev->pdev.qdev, NULL);
>>>> +        }
>>>> +    }
>>>> +}
>>>> +
>>>>  static void vfio_pci_reset(DeviceState *dev)
>>>>  {
>>>>      PCIDevice *pdev = DO_UPCAST(PCIDevice, qdev, dev);
>>>> @@ -3113,13 +3270,34 @@ static void vfio_pci_reset(DeviceState *dev)
>>>>
>>>>          if ((pci_get_word(br->config + PCI_BRIDGE_CONTROL) &
>>>>               PCI_BRIDGE_CTL_BUS_RESET)) {
>>>> -            if (pci_get_function_0(pdev) == pdev) {
>>>> -                vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
>>>> +            if (!vdev->vfio_resume_wait) {
>>>> +                if (pci_get_function_0(pdev) == pdev) {
>>>> +                    vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
>>>> +                }
>>>> +            } else {
>>>> +                if (vdev->vfio_resume_arrived) {
>>>> +                    vdev->vfio_resume_wait = false;
>>>> +                    vfio_mmap_set_enabled(vdev, true);
>>>
>>> mmap is getting restored in too many places, it should be disabled on
>>> ERR_IRQ and re-enabled on ERR_RESUME, no more.
>> OK.
>>
>>>> +                    if (pci_get_function_0(pdev) == pdev) {
>>>> +                        vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
>>>> +                    }
>>>> +                } else {
>>>> +                    vdev->reset_timer = timer_new_ms(QEMU_CLOCK_VIRTUAL,
>>>> +                                            vfio_pci_delayed_reset, vdev);
>>>> +                    timer_mod(vdev->reset_timer,
>>>> +                              qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL)
>>>> +                              + VFIO_RESET_TIMEOUT);
>>>> +                }
>>>
>>> Wait, so we just set a timer and return pretending the reset occurred
>>> when actually it might occur at some point in the future?  How is that
>>> supposed to work?  I thought the plan was to block here.
>> How about invoke sem_post at vfio_resume_notifier_handler,
>> and wait here use sem_timedwait?
>
> Do we even need a timer?  What if we simply spin?
>
> for (i = 0; i < 1000; i++) {
>     if (vdev->pci_aer_resume_signaled) {
>         break;
>     }
>     g_usleep(1000);
> }
>
> if (i == 1000) {
>     /* We failed */
> } else {
>     /* Proceed with reset */
> }
>
> Does QEMU have enough concurrency to do this?  Thanks,

I meant I will use sem instead of timer.
But think about the new idea you proposed.
I will use g_usleep and check bitmap to
wait mask of all functions receiving a fatal
error is equal to the mask of all functions receiving a resume.

Sincerely,
Zhou Jie

>
> Alex
>
>>>>              }
>>>>              return;
>>>>          }
>>>>      }
>>>>
>>>> +    if (vdev->vfio_resume_wait) {
>>>> +        vdev->vfio_resume_wait = false;
>>>> +        vfio_mmap_set_enabled(vdev, true);
>>>> +    }
>>>
>>> Again, these are getting changed in too many places, the state machine
>>> is too complicated.  Thanks,
>> In my test this code will never be invoked.
>> But I add this code to clear vfio_resume_wait if the guest don't
>>   reset the bus.
>>
>> Sincerely,
>> Zhou Jie
>>
>>
>>>
>>> Alex
>>>
>>>> +
>>>>      vfio_pci_pre_reset(vdev);
>>>>
>>>>      if (vdev->resetfn && !vdev->resetfn(vdev)) {
>>>> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
>>>> index 9fb0206..49e28d8 100644
>>>> --- a/hw/vfio/pci.h
>>>> +++ b/hw/vfio/pci.h
>>>> @@ -119,6 +119,7 @@ typedef struct VFIOPCIDevice {
>>>>      VFIOVGA *vga; /* 0xa0000, 0x3b0, 0x3c0 */
>>>>      PCIHostDeviceAddress host;
>>>>      EventNotifier err_notifier;
>>>> +    EventNotifier resume_notifier;
>>>>      EventNotifier req_notifier;
>>>>      int (*resetfn)(struct VFIOPCIDevice *);
>>>>      uint32_t vendor_id;
>>>> @@ -144,6 +145,10 @@ typedef struct VFIOPCIDevice {
>>>>      bool no_kvm_msi;
>>>>      bool no_kvm_msix;
>>>>      bool single_depend_dev;
>>>> +    bool vfio_resume_cap;
>>>> +    bool vfio_resume_wait;
>>>> +    bool vfio_resume_arrived;
>>>> +    QEMUTimer *reset_timer;
>>>>  } VFIOPCIDevice;
>>>>
>>>>  uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
>>>> index 759b850..01dfd5d 100644
>>>> --- a/linux-headers/linux/vfio.h
>>>> +++ b/linux-headers/linux/vfio.h
>>>> @@ -433,6 +433,7 @@ enum {
>>>>  	VFIO_PCI_MSIX_IRQ_INDEX,
>>>>  	VFIO_PCI_ERR_IRQ_INDEX,
>>>>  	VFIO_PCI_REQ_IRQ_INDEX,
>>>> +        VFIO_PCI_RESUME_IRQ_INDEX,
>>>>  	VFIO_PCI_NUM_IRQS
>>>>  };
>>>>
>>>
>>>
>>>
>>> .
>>>
>>
>>
>
>
>
> .
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH 11/12] vfio: register aer resume notification handler for aer resume
  2016-05-19  1:49     ` Zhou Jie
  2016-05-19  2:18       ` Alex Williamson
@ 2016-05-25  6:23       ` Zhou Jie
  2016-05-25 14:06         ` Zhou Jie
  1 sibling, 1 reply; 21+ messages in thread
From: Zhou Jie @ 2016-05-25  6:23 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Chen Fan, izumi.taku, caoj.fnst, qemu-devel, mst

Hi, Alex

>>>   3. Stall any access to the device until resume is signaled.
>>
>> The code below doesn't actually do this, mmaps are disabled, but that
>> just means any device access will use the slow path through QEMU.  That
>> path is still fully enabled and so is config space access.
> I will modify the code and find other way to
> stall any access to the device.

I find that to stall any access to the device,
I need modify the following function.
1. vfio_region_read and vfio_region_write
    For stalling any access to the device bar region.
2. vfio_vga_read and vfio_vga_write
    For stalling any access to the vga device region.
3. vfio_pci_read_config and vfio_pci_write_config
    For stalling any access to the device config space.

What will happen if I don't stall any access to the device?

Sincerely,
Zhou Jie

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH 11/12] vfio: register aer resume notification handler for aer resume
  2016-05-25  6:23       ` Zhou Jie
@ 2016-05-25 14:06         ` Zhou Jie
  0 siblings, 0 replies; 21+ messages in thread
From: Zhou Jie @ 2016-05-25 14:06 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Chen Fan, izumi.taku, caoj.fnst, qemu-devel, mst

Hi, Alex

 >> Do we even need a timer?  What if we simply spin?
 >>
 >> for (i = 0; i < 1000; i++) {
 >>     if (vdev->pci_aer_resume_signaled) {
 >>         break;
 >>     }
 >>     g_usleep(1000);
 >> }
 >>
 >> if (i == 1000) {
 >>     /* We failed */
 >> } else {
 >>     /* Proceed with reset */
 >> }
 >>
 >> Does QEMU have enough concurrency to do this?  Thanks,On 2016/5/25
In my test, it can not work.
vfio_aer_resume_notifier_handler is invoked after vfio_pci_reset.
It is useless to wait in vfio_pci_reset.

Sincerely,
Zhou Jie

14:23, Zhou Jie wrote:
> Hi, Alex
>
>>>>   3. Stall any access to the device until resume is signaled.
>>>
>>> The code below doesn't actually do this, mmaps are disabled, but that
>>> just means any device access will use the slow path through QEMU.  That
>>> path is still fully enabled and so is config space access.
>> I will modify the code and find other way to
>> stall any access to the device.
>
> I find that to stall any access to the device,
> I need modify the following function.
> 1. vfio_region_read and vfio_region_write
>    For stalling any access to the device bar region.
> 2. vfio_vga_read and vfio_vga_write
>    For stalling any access to the vga device region.
> 3. vfio_pci_read_config and vfio_pci_write_config
>    For stalling any access to the device config space.
>
> What will happen if I don't stall any access to the device?
>
> Sincerely,
> Zhou Jie
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH 03/12] vfio: add pcie extended capability support
  2016-05-18  3:31 ` [Qemu-devel] [PATCH 03/12] vfio: add pcie extended capability support Zhou Jie
@ 2016-06-28 20:04   ` Laszlo Ersek
  0 siblings, 0 replies; 21+ messages in thread
From: Laszlo Ersek @ 2016-06-28 20:04 UTC (permalink / raw)
  To: Zhou Jie, qemu-devel, alex.williamson
  Cc: Chen Fan, izumi.taku, caoj.fnst, mst

On 05/18/16 05:31, Zhou Jie wrote:
> From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
> 
> For vfio pcie device, we could expose the extended capability on
> PCIE bus. due to add a new pcie capability at the tail of the chain,
> in order to avoid config space overwritten, we introduce a copy config
> for parsing extended caps. and rebuild the pcie extended config space.
> 
> Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
> ---
>  hw/vfio/pci.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 71 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 1ad47ef..f697853 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -1528,6 +1528,21 @@ static uint8_t vfio_std_cap_max_size(PCIDevice *pdev, uint8_t pos)
>      return next - pos;
>  }
>  
> +
> +static uint16_t vfio_ext_cap_max_size(const uint8_t *config, uint16_t pos)
> +{
> +    uint16_t tmp, next = PCIE_CONFIG_SPACE_SIZE;
> +
> +    for (tmp = PCI_CONFIG_SPACE_SIZE; tmp;
> +        tmp = PCI_EXT_CAP_NEXT(pci_get_long(config + tmp))) {
> +        if (tmp > pos && tmp < next) {
> +            next = tmp;
> +        }
> +    }
> +
> +    return next - pos;
> +}
> +
>  static void vfio_set_word_bits(uint8_t *buf, uint16_t val, uint16_t mask)
>  {
>      pci_set_word(buf, (pci_get_word(buf) & ~mask) | val);
> @@ -1862,16 +1877,71 @@ static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos)
>      return 0;
>  }
>  
> +static int vfio_add_ext_cap(VFIOPCIDevice *vdev)
> +{
> +    PCIDevice *pdev = &vdev->pdev;
> +    uint32_t header;
> +    uint16_t cap_id, next, size;
> +    uint8_t cap_ver;
> +    uint8_t *config;
> +
> +    /*
> +     * pcie_add_capability always inserts the new capability at the tail
> +     * of the chain.  Therefore to end up with a chain that matches the
> +     * physical device, we cache the config space to avoid overwriting
> +     * the original config space when we parse the extended capabilities.
> +     */
> +    config = g_memdup(pdev->config, vdev->config_size);
> +
> +    for (next = PCI_CONFIG_SPACE_SIZE; next;
> +         next = PCI_EXT_CAP_NEXT(pci_get_long(config + next))) {
> +        header = pci_get_long(config + next);
> +        cap_id = PCI_EXT_CAP_ID(header);
> +        cap_ver = PCI_EXT_CAP_VER(header);
> +
> +        /*
> +         * If it becomes important to configure extended capabilities to their
> +         * actual size, use this as the default when it's something we don't
> +         * recognize. Since QEMU doesn't actually handle many of the config
> +         * accesses, exact size doesn't seem worthwhile.
> +         */
> +        size = vfio_ext_cap_max_size(config, next);
> +
> +        pcie_add_capability(pdev, cap_id, cap_ver, next, size);
> +        pci_set_long(pdev->config + next, PCI_EXT_CAP(cap_id, cap_ver, 0));
> +
> +        /* Use emulated next pointer to allow dropping extended caps */
> +        pci_long_test_and_set_mask(vdev->emulated_config_bits + next,
> +                                   PCI_EXT_CAP_NEXT_MASK);
> +    }
> +
> +    g_free(config);
> +    return 0;
> +}
> +
>  static int vfio_add_capabilities(VFIOPCIDevice *vdev)
>  {
>      PCIDevice *pdev = &vdev->pdev;
> +    int ret;
>  
>      if (!(pdev->config[PCI_STATUS] & PCI_STATUS_CAP_LIST) ||
>          !pdev->config[PCI_CAPABILITY_LIST]) {
>          return 0; /* Nothing to add */
>      }
>  
> -    return vfio_add_std_cap(vdev, pdev->config[PCI_CAPABILITY_LIST]);
> +    ret = vfio_add_std_cap(vdev, pdev->config[PCI_CAPABILITY_LIST]);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    /* on PCI bus, it doesn't make sense to expose extended capabilities. */
> +    if (!pci_is_express(pdev) ||
> +        !pci_bus_is_express(pdev->bus) ||
> +        !pci_get_long(pdev->config + PCI_CONFIG_SPACE_SIZE)) {
> +        return 0;
> +    }
> +
> +    return vfio_add_ext_cap(vdev);
>  }
>  
>  static void vfio_pci_pre_reset(VFIOPCIDevice *vdev)
> 

Tested-by: Laszlo Ersek <lersek@redhat.com>

(as a prerequisite for
<http://thread.gmane.org/gmane.comp.emulators.qemu/420774>)

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2016-06-28 20:04 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-18  3:30 [Qemu-devel] [PATCH v7 00/12] vfio-pci: pass the aer error to guest Zhou Jie
2016-05-18  3:30 ` [Qemu-devel] [PATCH 01/12] vfio: extract vfio_get_hot_reset_info as a single function Zhou Jie
2016-05-18  3:31 ` [Qemu-devel] [PATCH 02/12] vfio: squeeze out vfio_pci_do_hot_reset for support bus reset Zhou Jie
2016-05-18  3:31 ` [Qemu-devel] [PATCH 03/12] vfio: add pcie extended capability support Zhou Jie
2016-06-28 20:04   ` Laszlo Ersek
2016-05-18  3:31 ` [Qemu-devel] [PATCH 04/12] vfio: add aer support for vfio device Zhou Jie
2016-05-18  3:31 ` [Qemu-devel] [PATCH 05/12] vfio: refine function vfio_pci_host_match Zhou Jie
2016-05-18  3:31 ` [Qemu-devel] [PATCH 06/12] vfio: add check host bus reset is support or not Zhou Jie
2016-05-18  3:31 ` [Qemu-devel] [PATCH 07/12] pci: add a pci_function_is_valid callback to check function if valid Zhou Jie
2016-05-18  3:31 ` [Qemu-devel] [PATCH 08/12] vfio: add check aer functionality for hotplug device Zhou Jie
2016-05-18  3:31 ` [Qemu-devel] [PATCH 09/12] vfio: vote the function 0 to do host bus reset when aer occurred Zhou Jie
2016-05-18  3:31 ` [Qemu-devel] [PATCH 10/12] vfio-pci: pass the aer error to guest Zhou Jie
2016-05-18  3:31 ` [Qemu-devel] [PATCH 11/12] vfio: register aer resume notification handler for aer resume Zhou Jie
2016-05-18 18:26   ` Alex Williamson
2016-05-19  1:49     ` Zhou Jie
2016-05-19  2:18       ` Alex Williamson
2016-05-19  2:41         ` Zhou Jie
2016-05-19  2:41         ` Zhou Jie
2016-05-25  6:23       ` Zhou Jie
2016-05-25 14:06         ` Zhou Jie
2016-05-18  3:31 ` [Qemu-devel] [PATCH 12/12] vfio: add 'aer' property to expose aercap Zhou Jie

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.