All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC v8.1 00/13] vfio-pci: pass the aer error to guest
@ 2015-05-27  2:46 Chen Fan
  2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 01/13] vfio: extract vfio_get_hot_reset_info as a single function Chen Fan
                   ` (11 more replies)
  0 siblings, 12 replies; 27+ messages in thread
From: Chen Fan @ 2015-05-27  2:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: izumi.taku, alex.williamson

For now, for vfio pci passthough devices when qemu receives
an error from host aer report, there just terminate the guest,
but usually user want to know what error occurred but stop the
guest, so this patches add aer capability support for vfio device,
and pass the error to guest, and have guest driver to recover
from the error.
and turning on SERR# for error forwording in bridge control register
patch in seabios has been merged.

v8-v8.1:
   1. add check the case when one aer devices is re-hotplugged
      to upper bus which will conflict the present which enable aer.
   2. add vfio_pci_affect_devices_is_multi to specify signal or
      multi in-use when call hot reset.

v7-v8:
   1. some bug fixes requested by Alex

v6-v7:
   1. add has_bus_reset to support reset host bus.

v5-v6:
   1. add secondary bus callbacks to reset host bus.

v3-v4:
   1. add 'x-aer' for user to off aer capability.
   2. refactor vfio device to parse extended capabilities.

v2-v3:
   1. refactor vfio device to parse extended capability.
   2. add global property for piix4 to disable vfio aer cap.

v1-v2:
   1. turn on SERR# for bridge control register in firmware.
   2. initilize aer capability for vfio device.
   3. fix some trivial bug.

Chen Fan (13):
  vfio: extract vfio_get_hot_reset_info as a single function
  vfio: squeeze out vfio_pci_do_hot_reset for support bus reset
  vfio: add pcie extanded capability support
  aer: impove pcie_aer_init to support vfio device
  vfio: add aer support for vfio device
  vfio: add check host bus reset is support or not
  vfio: add check for vfio devices which enable aer should support bus
    reset
  pci: add bus reset_notifiers callbacks for host bus reset
  vfio: add sec_bus_reset notifier to notify physical bus reset is
    needed
  vfio: do hot bus reset when do virtual secondary bus reset
  pcie_aer: expose pcie_aer_msg() interface
  vfio-pci: pass the aer error to guest
  vfio: add 'aer' property to expose aercap

 hw/pci-bridge/ioh3420.c            |   2 +-
 hw/pci-bridge/xio3130_downstream.c |   2 +-
 hw/pci-bridge/xio3130_upstream.c   |   2 +-
 hw/pci/pci.c                       |  16 +
 hw/pci/pci_bridge.c                |   5 +
 hw/pci/pcie_aer.c                  |   6 +-
 hw/vfio/pci.c                      | 587 +++++++++++++++++++++++++++++++++----
 include/hw/pci/pci.h               |   4 +
 include/hw/pci/pci_bus.h           |   2 +
 include/hw/pci/pcie_aer.h          |   3 +-
 10 files changed, 566 insertions(+), 63 deletions(-)

-- 
1.9.3

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Qemu-devel] [RFC v8.1 01/13] vfio: extract vfio_get_hot_reset_info as a single function
  2015-05-27  2:46 [Qemu-devel] [RFC v8.1 00/13] vfio-pci: pass the aer error to guest Chen Fan
@ 2015-05-27  2:46 ` Chen Fan
  2015-05-27 21:31   ` Alex Williamson
  2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 03/13] vfio: add pcie extanded capability support Chen Fan
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 27+ messages in thread
From: Chen Fan @ 2015-05-27  2:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: izumi.taku, alex.williamson

the function is used to get affected devices by bus reset.
so here extract it, and can used for aer soon.

Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
---
 hw/vfio/pci.c | 67 +++++++++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 49 insertions(+), 18 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index e0e339a..9c05304 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2641,6 +2641,50 @@ static void vfio_check_af_flr(VFIOPCIDevice *vdev, uint8_t pos)
     }
 }
 
+/*
+ * return negative with errno, return 0 on success.
+ * if success, the point of ret_info fill with the affected device reset info.
+ *
+ */
+static int vfio_get_hot_reset_info(VFIOPCIDevice *vdev,
+                                   struct vfio_pci_hot_reset_info **ret_info)
+{
+    struct vfio_pci_hot_reset_info *info;
+    int ret, count;
+
+    *ret_info = NULL;
+
+    info = g_malloc0(sizeof(*info));
+    info->argsz = sizeof(*info);
+
+    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO, info);
+    if (ret && errno != ENOSPC) {
+        ret = -errno;
+        goto error;
+    }
+
+    count = info->count;
+
+    info = g_realloc(info, sizeof(*info) +
+                     (count * sizeof(struct vfio_pci_dependent_device)));
+    info->argsz = sizeof(*info) +
+                  (count * sizeof(struct vfio_pci_dependent_device));
+
+    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO, info);
+    if (ret) {
+        ret = -errno;
+        goto error;
+    }
+
+    *ret_info = info;
+    info = NULL;
+
+    ret = 0;
+error:
+    g_free(info);
+    return ret;
+}
+
 static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos)
 {
     PCIDevice *pdev = &vdev->pdev;
@@ -2780,7 +2824,7 @@ static bool vfio_pci_host_match(PCIHostDeviceAddress *host1,
 static int vfio_pci_hot_reset(VFIOPCIDevice *vdev, bool single)
 {
     VFIOGroup *group;
-    struct vfio_pci_hot_reset_info *info;
+    struct vfio_pci_hot_reset_info *info = NULL;
     struct vfio_pci_dependent_device *devices;
     struct vfio_pci_hot_reset *reset;
     int32_t *fds;
@@ -2792,32 +2836,19 @@ static int vfio_pci_hot_reset(VFIOPCIDevice *vdev, bool single)
     vfio_pci_pre_reset(vdev);
     vdev->vbasedev.needs_reset = false;
 
-    info = g_malloc0(sizeof(*info));
-    info->argsz = sizeof(*info);
-
-    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO, info);
-    if (ret && errno != ENOSPC) {
-        ret = -errno;
+    ret = vfio_get_hot_reset_info(vdev, &info);
+    if (ret) {
         if (!vdev->has_pm_reset) {
             error_report("vfio: Cannot reset device %04x:%02x:%02x.%x, "
                          "no available reset mechanism.", vdev->host.domain,
                          vdev->host.bus, vdev->host.slot, vdev->host.function);
+        } else {
+            error_report("vfio: hot reset info failed: %m");
         }
         goto out_single;
     }
 
-    count = info->count;
-    info = g_realloc(info, sizeof(*info) + (count * sizeof(*devices)));
-    info->argsz = sizeof(*info) + (count * sizeof(*devices));
     devices = &info->devices[0];
-
-    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO, info);
-    if (ret) {
-        ret = -errno;
-        error_report("vfio: hot reset info failed: %m");
-        goto out_single;
-    }
-
     trace_vfio_pci_hot_reset_has_dep_devices(vdev->vbasedev.name);
 
     /* Verify that we have all the groups required */
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [RFC v8.1 03/13] vfio: add pcie extanded capability support
  2015-05-27  2:46 [Qemu-devel] [RFC v8.1 00/13] vfio-pci: pass the aer error to guest Chen Fan
  2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 01/13] vfio: extract vfio_get_hot_reset_info as a single function Chen Fan
@ 2015-05-27  2:46 ` Chen Fan
  2015-05-27 21:31   ` Alex Williamson
  2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 04/13] aer: impove pcie_aer_init to support vfio device Chen Fan
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 27+ messages in thread
From: Chen Fan @ 2015-05-27  2:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: izumi.taku, alex.williamson

For vfio pcie device, we could expose the extended capability on
PCIE bus. in order to avoid config space broken, we introduce
a copy config for parsing extended caps. and rebuild the pcie
extended config space.

Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
---
 hw/vfio/pci.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 72 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 78f4c82..81a4a9a 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2493,6 +2493,21 @@ static uint8_t vfio_std_cap_max_size(PCIDevice *pdev, uint8_t pos)
     return next - pos;
 }
 
+
+static uint16_t vfio_ext_cap_max_size(const uint8_t *config, uint16_t pos)
+{
+    uint16_t tmp, next = PCIE_CONFIG_SPACE_SIZE - 1;
+
+    for (tmp = PCI_CONFIG_SPACE_SIZE; tmp;
+        tmp = PCI_EXT_CAP_NEXT(pci_get_long(config + tmp))) {
+        if (tmp > pos && tmp < next) {
+            next = tmp;
+        }
+    }
+
+    return next - pos;
+}
+
 static void vfio_set_word_bits(uint8_t *buf, uint16_t val, uint16_t mask)
 {
     pci_set_word(buf, (pci_get_word(buf) & ~mask) | val);
@@ -2802,16 +2817,72 @@ static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos)
     return 0;
 }
 
+static int vfio_add_ext_cap(VFIOPCIDevice *vdev, const uint8_t *config)
+{
+    PCIDevice *pdev = &vdev->pdev;
+    uint32_t header;
+    uint16_t cap_id, next, size;
+    uint8_t cap_ver;
+
+    for (next = PCI_CONFIG_SPACE_SIZE; next;
+         next = PCI_EXT_CAP_NEXT(pci_get_long(config + next))) {
+        header = pci_get_long(config + next);
+        cap_id = PCI_EXT_CAP_ID(header);
+        cap_ver = PCI_EXT_CAP_VER(header);
+
+        /*
+         * If it becomes important to configure extended capabilities to their
+         * actual size, use this as the default when it's something we don't
+         * recognize. Since QEMU doesn't actually handle many of the config
+         * accesses, exact size doesn't seem worthwhile.
+         */
+        size = vfio_ext_cap_max_size(config, next);
+
+        pcie_add_capability(pdev, cap_id, cap_ver, next, size);
+        if (next == PCI_CONFIG_SPACE_SIZE) {
+            /* Begin the rebuild, we should set the next offset zero. */
+            pci_set_long(pdev->config + next, PCI_EXT_CAP(cap_id, cap_ver, 0));
+        }
+
+        /* Use emulated header pointer to allow dropping extended caps */
+        pci_set_long(vdev->emulated_config_bits + next, 0xffffffff);
+    }
+
+    return 0;
+}
+
 static int vfio_add_capabilities(VFIOPCIDevice *vdev)
 {
     PCIDevice *pdev = &vdev->pdev;
+    int ret;
+    uint8_t *config;
 
     if (!(pdev->config[PCI_STATUS] & PCI_STATUS_CAP_LIST) ||
         !pdev->config[PCI_CAPABILITY_LIST]) {
         return 0; /* Nothing to add */
     }
 
-    return vfio_add_std_cap(vdev, pdev->config[PCI_CAPABILITY_LIST]);
+    ret = vfio_add_std_cap(vdev, pdev->config[PCI_CAPABILITY_LIST]);
+    if (ret) {
+        return ret;
+    }
+
+    /* on PCI bus, it doesn't make sense to expose extended capabilities. */
+    if (!pci_is_express(pdev) ||
+        !pci_bus_is_express(pdev->bus) ||
+        !pci_get_long(pdev->config + PCI_CONFIG_SPACE_SIZE)) {
+        return 0;
+    }
+
+    /*
+     * In order to avoid config space broken, here using a copy config to
+     * parse extended capabilities.
+     */
+    config = g_memdup(pdev->config, vdev->config_size);
+    ret = vfio_add_ext_cap(vdev, config);
+
+    g_free(config);
+    return ret;
 }
 
 static void vfio_pci_pre_reset(VFIOPCIDevice *vdev)
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [RFC v8.1 04/13] aer: impove pcie_aer_init to support vfio device
  2015-05-27  2:46 [Qemu-devel] [RFC v8.1 00/13] vfio-pci: pass the aer error to guest Chen Fan
  2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 01/13] vfio: extract vfio_get_hot_reset_info as a single function Chen Fan
  2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 03/13] vfio: add pcie extanded capability support Chen Fan
@ 2015-05-27  2:46 ` Chen Fan
  2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 05/13] vfio: add aer support for " Chen Fan
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 27+ messages in thread
From: Chen Fan @ 2015-05-27  2:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: izumi.taku, alex.williamson

pcie_aer_init was used to emulate an aer capability for pcie device,
but for vfio device, the aer config space size is mutable and is not
always equal to PCI_ERR_SIZEOF(0x48). it depends on where the TLP Prefix
register required, so here we add a size argument.

Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
---
 hw/pci-bridge/ioh3420.c            | 2 +-
 hw/pci-bridge/xio3130_downstream.c | 2 +-
 hw/pci-bridge/xio3130_upstream.c   | 2 +-
 hw/pci/pcie_aer.c                  | 4 ++--
 include/hw/pci/pcie_aer.h          | 2 +-
 5 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/hw/pci-bridge/ioh3420.c b/hw/pci-bridge/ioh3420.c
index cce2fdd..4d9cd3f 100644
--- a/hw/pci-bridge/ioh3420.c
+++ b/hw/pci-bridge/ioh3420.c
@@ -129,7 +129,7 @@ static int ioh3420_initfn(PCIDevice *d)
         goto err_pcie_cap;
     }
     pcie_cap_root_init(d);
-    rc = pcie_aer_init(d, IOH_EP_AER_OFFSET);
+    rc = pcie_aer_init(d, IOH_EP_AER_OFFSET, PCI_ERR_SIZEOF);
     if (rc < 0) {
         goto err;
     }
diff --git a/hw/pci-bridge/xio3130_downstream.c b/hw/pci-bridge/xio3130_downstream.c
index b3a6479..9737041 100644
--- a/hw/pci-bridge/xio3130_downstream.c
+++ b/hw/pci-bridge/xio3130_downstream.c
@@ -92,7 +92,7 @@ static int xio3130_downstream_initfn(PCIDevice *d)
         goto err_pcie_cap;
     }
     pcie_cap_arifwd_init(d);
-    rc = pcie_aer_init(d, XIO3130_AER_OFFSET);
+    rc = pcie_aer_init(d, XIO3130_AER_OFFSET, PCI_ERR_SIZEOF);
     if (rc < 0) {
         goto err;
     }
diff --git a/hw/pci-bridge/xio3130_upstream.c b/hw/pci-bridge/xio3130_upstream.c
index eada582..4d7f894 100644
--- a/hw/pci-bridge/xio3130_upstream.c
+++ b/hw/pci-bridge/xio3130_upstream.c
@@ -81,7 +81,7 @@ static int xio3130_upstream_initfn(PCIDevice *d)
     }
     pcie_cap_flr_init(d);
     pcie_cap_deverr_init(d);
-    rc = pcie_aer_init(d, XIO3130_AER_OFFSET);
+    rc = pcie_aer_init(d, XIO3130_AER_OFFSET, PCI_ERR_SIZEOF);
     if (rc < 0) {
         goto err;
     }
diff --git a/hw/pci/pcie_aer.c b/hw/pci/pcie_aer.c
index b48c09c..ccf577e 100644
--- a/hw/pci/pcie_aer.c
+++ b/hw/pci/pcie_aer.c
@@ -94,12 +94,12 @@ static void aer_log_clear_all_err(PCIEAERLog *aer_log)
     aer_log->log_num = 0;
 }
 
-int pcie_aer_init(PCIDevice *dev, uint16_t offset)
+int pcie_aer_init(PCIDevice *dev, uint16_t offset, uint16_t size)
 {
     PCIExpressDevice *exp;
 
     pcie_add_capability(dev, PCI_EXT_CAP_ID_ERR, PCI_ERR_VER,
-                        offset, PCI_ERR_SIZEOF);
+                        offset, size);
     exp = &dev->exp;
     exp->aer_cap = offset;
 
diff --git a/include/hw/pci/pcie_aer.h b/include/hw/pci/pcie_aer.h
index 2fb8388..156acb0 100644
--- a/include/hw/pci/pcie_aer.h
+++ b/include/hw/pci/pcie_aer.h
@@ -87,7 +87,7 @@ struct PCIEAERErr {
 
 extern const VMStateDescription vmstate_pcie_aer_log;
 
-int pcie_aer_init(PCIDevice *dev, uint16_t offset);
+int pcie_aer_init(PCIDevice *dev, uint16_t offset, uint16_t size);
 void pcie_aer_exit(PCIDevice *dev);
 void pcie_aer_write_config(PCIDevice *dev,
                            uint32_t addr, uint32_t val, int len);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [RFC v8.1 05/13] vfio: add aer support for vfio device
  2015-05-27  2:46 [Qemu-devel] [RFC v8.1 00/13] vfio-pci: pass the aer error to guest Chen Fan
                   ` (2 preceding siblings ...)
  2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 04/13] aer: impove pcie_aer_init to support vfio device Chen Fan
@ 2015-05-27  2:46 ` Chen Fan
  2015-05-27 21:32   ` Alex Williamson
  2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 06/13] vfio: add check host bus reset is support or not Chen Fan
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 27+ messages in thread
From: Chen Fan @ 2015-05-27  2:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: izumi.taku, alex.williamson

Calling pcie_aer_init to initilize aer related registers for
vfio device, then reload physical related registers to expose
device capability.

Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
---
 hw/vfio/pci.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 82 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 81a4a9a..f4e7855 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -32,6 +32,7 @@
 #include "hw/pci/msi.h"
 #include "hw/pci/msix.h"
 #include "hw/pci/pci.h"
+#include "hw/pci/pci_bridge.h"
 #include "qemu-common.h"
 #include "qemu/error-report.h"
 #include "qemu/event_notifier.h"
@@ -160,6 +161,8 @@ typedef struct VFIOPCIDevice {
 #define VFIO_FEATURE_ENABLE_VGA (1 << VFIO_FEATURE_ENABLE_VGA_BIT)
 #define VFIO_FEATURE_ENABLE_REQ_BIT 1
 #define VFIO_FEATURE_ENABLE_REQ (1 << VFIO_FEATURE_ENABLE_REQ_BIT)
+#define VFIO_FEATURE_ENABLE_AER_BIT 2
+#define VFIO_FEATURE_ENABLE_AER (1 << VFIO_FEATURE_ENABLE_AER_BIT)
     int32_t bootindex;
     uint8_t pm_cap;
     bool has_vga;
@@ -2817,12 +2820,78 @@ static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos)
     return 0;
 }
 
+static int vfio_setup_aer(VFIOPCIDevice *vdev, int pos, uint16_t size)
+{
+    PCIDevice *pdev = &vdev->pdev;
+    uint8_t *exp_cap = pdev->config + pdev->exp.exp_cap;
+    PCIDevice *dev_iter;
+    uint8_t type;
+    uint32_t severity, errcap;
+    int ret;
+
+    if (!(vdev->features & VFIO_FEATURE_ENABLE_AER)) {
+        return 0;
+    }
+
+    dev_iter = pci_bridge_get_device(pdev->bus);
+    if (!dev_iter) {
+        goto error;
+    }
+
+    while (dev_iter) {
+        type = pcie_cap_get_type(dev_iter);
+        if ((type != PCI_EXP_TYPE_ROOT_PORT &&
+             type != PCI_EXP_TYPE_UPSTREAM &&
+             type != PCI_EXP_TYPE_DOWNSTREAM)) {
+            goto error;
+        }
+
+        if (!dev_iter->exp.aer_cap) {
+            goto error;
+        }
+
+        dev_iter = pci_bridge_get_device(dev_iter->bus);
+    }
+
+    errcap = vfio_pci_read_config(pdev, pdev->exp.aer_cap + PCI_ERR_CAP, 4);
+    /*
+     * The ability to record multiple headers is depending on
+     * the state of the Multiple Header Recording Capable bit and
+     * enabled by the Multiple Header Recording Enable bit.
+     */
+    if ((errcap & PCI_ERR_CAP_MHRC) &&
+        (errcap & PCI_ERR_CAP_MHRE)) {
+        pdev->exp.aer_log.log_max = PCIE_AER_LOG_MAX_DEFAULT;
+    } else {
+        pdev->exp.aer_log.log_max = 0;
+    }
+
+    pcie_cap_deverr_init(pdev);
+    ret = pcie_aer_init(pdev, pos, size);
+    if (ret) {
+        return ret;
+    }
+
+    /* load physical registers */
+    severity = vfio_pci_read_config(pdev,
+                   pdev->exp.aer_cap + PCI_ERR_UNCOR_SEVER, 4);
+    pci_long_test_and_clear_mask(exp_cap + PCI_ERR_UNCOR_SEVER, ~severity);
+
+    return 0;
+
+error:
+    error_report("vfio: Unable to enable AER for device %s, parent bus "
+                 "do not support signal AER", vdev->vbasedev.name);
+    return -1;
+}
+
 static int vfio_add_ext_cap(VFIOPCIDevice *vdev, const uint8_t *config)
 {
     PCIDevice *pdev = &vdev->pdev;
     uint32_t header;
     uint16_t cap_id, next, size;
     uint8_t cap_ver;
+    int ret = 0;
 
     for (next = PCI_CONFIG_SPACE_SIZE; next;
          next = PCI_EXT_CAP_NEXT(pci_get_long(config + next))) {
@@ -2838,7 +2907,19 @@ static int vfio_add_ext_cap(VFIOPCIDevice *vdev, const uint8_t *config)
          */
         size = vfio_ext_cap_max_size(config, next);
 
-        pcie_add_capability(pdev, cap_id, cap_ver, next, size);
+        switch (cap_id) {
+        case PCI_EXT_CAP_ID_ERR:
+            ret = vfio_setup_aer(vdev, next, size);
+            break;
+        default:
+            pcie_add_capability(pdev, cap_id, cap_ver, next, size);
+            break;
+        }
+
+        if (ret) {
+            return ret;
+        }
+
         if (next == PCI_CONFIG_SPACE_SIZE) {
             /* Begin the rebuild, we should set the next offset zero. */
             pci_set_long(pdev->config + next, PCI_EXT_CAP(cap_id, cap_ver, 0));
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [RFC v8.1 06/13] vfio: add check host bus reset is support or not
  2015-05-27  2:46 [Qemu-devel] [RFC v8.1 00/13] vfio-pci: pass the aer error to guest Chen Fan
                   ` (3 preceding siblings ...)
  2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 05/13] vfio: add aer support for " Chen Fan
@ 2015-05-27  2:46 ` Chen Fan
  2015-05-27 21:32   ` Alex Williamson
  2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 07/13] vfio: add check for vfio devices which enable aer should support bus reset Chen Fan
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 27+ messages in thread
From: Chen Fan @ 2015-05-27  2:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: izumi.taku, alex.williamson

we introduce a has_bus_reset capability to sign the vfio
devices if support host bus reset.

Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
---
 hw/vfio/pci.c | 123 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 123 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index f4e7855..5934fd7 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -33,6 +33,7 @@
 #include "hw/pci/msix.h"
 #include "hw/pci/pci.h"
 #include "hw/pci/pci_bridge.h"
+#include "hw/pci/pci_bus.h"
 #include "qemu-common.h"
 #include "qemu/error-report.h"
 #include "qemu/event_notifier.h"
@@ -170,6 +171,7 @@ typedef struct VFIOPCIDevice {
     bool req_enabled;
     bool has_flr;
     bool has_pm_reset;
+    bool has_bus_reset;
     bool rom_read_failed;
 } VFIOPCIDevice;
 
@@ -203,6 +205,7 @@ static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
 static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
                                   uint32_t val, int len);
 static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
+static void vfio_check_host_bus_reset(VFIOPCIDevice *vdev);
 
 /*
  * Disabling BAR mmaping can be slow, but toggling it around INTx can
@@ -2853,6 +2856,20 @@ static int vfio_setup_aer(VFIOPCIDevice *vdev, int pos, uint16_t size)
         dev_iter = pci_bridge_get_device(dev_iter->bus);
     }
 
+    /*
+     * Don't check bus reset capability when device is enabled during
+     * qemu machine creation, which is done by machine init function.
+     */
+    if (DEVICE(vdev)->hotplugged) {
+        vfio_check_host_bus_reset(vdev);
+        if (!vdev->has_bus_reset) {
+            error_report("vfio: Cannot enable AER for device %s, "
+                         "which is not support host bus reset.",
+                         vdev->vbasedev.name);
+            goto error;
+        }
+    }
+
     errcap = vfio_pci_read_config(pdev, pdev->exp.aer_cap + PCI_ERR_CAP, 4);
     /*
      * The ability to record multiple headers is depending on
@@ -3678,6 +3695,112 @@ static void vfio_setup_resetfn(VFIOPCIDevice *vdev)
     }
 }
 
+struct VfioDeviceFind {
+    PCIBus *pbus;
+    PCIDevice *pdev;
+    bool found;
+};
+
+static void find_devices(PCIBus *bus, void *opaque)
+{
+    struct VfioDeviceFind *find = opaque;
+    int i;
+
+    if (find->found == true) {
+        return;
+    }
+
+    for (i = 0; i < ARRAY_SIZE(bus->devices); i++) {
+        if (!bus->devices[i]) {
+            continue;
+        }
+
+        if (bus->devices[i] == find->pdev) {
+            find->pbus = bus;
+            find->found = true;
+            break;
+        }
+    }
+}
+
+static void vfio_check_host_bus_reset(VFIOPCIDevice *vdev)
+{
+    PCIBus *bus = vdev->pdev.bus;
+    struct vfio_pci_hot_reset_info *info = NULL;
+    struct vfio_pci_dependent_device *devices;
+    VFIOGroup *group;
+    int ret, i;
+    bool has_bus_reset = false;
+
+    ret = vfio_get_hot_reset_info(vdev, &info);
+    if (ret < 0) {
+        goto out;
+    }
+
+    /* List all affected devices by bus reset */
+    devices = &info->devices[0];
+
+    /* Verify that we have all the groups required */
+    for (i = 0; i < info->count; i++) {
+        PCIHostDeviceAddress host;
+        VFIOPCIDevice *tmp;
+        VFIODevice *vbasedev_iter;
+
+        host.domain = devices[i].segment;
+        host.bus = devices[i].bus;
+        host.slot = PCI_SLOT(devices[i].devfn);
+        host.function = PCI_FUNC(devices[i].devfn);
+
+        /* Skip the current device */
+        if (vfio_pci_host_match(&host, &vdev->host)) {
+            continue;
+        }
+
+        /* Ensure we own the group of the affected device */
+        QLIST_FOREACH(group, &vfio_group_list, next) {
+            if (group->groupid == devices[i].group_id) {
+                break;
+            }
+        }
+
+        if (!group) {
+            goto out;
+        }
+
+        /* Ensure affected devices for reset under the same bus */
+        QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
+            if (vbasedev_iter->type != VFIO_DEVICE_TYPE_PCI) {
+                continue;
+            }
+            tmp = container_of(vbasedev_iter, VFIOPCIDevice, vbasedev);
+            if (vfio_pci_host_match(&host, &tmp->host)) {
+                struct VfioDeviceFind find = { .pdev = &tmp->pdev, .found = false };
+
+                pci_for_each_bus(bus, find_devices, &find);
+                if (!find.found) {
+                    goto out;
+                }
+                /*
+                 * When the check device is hotplugged to a higher bus again,
+                 * which would influence the affected device which enable aer
+                 * below the bus.
+                 */
+                if (tmp->features & VFIO_FEATURE_ENABLE_AER &&
+                    find.pbus != bus) {
+                    goto out;
+                }
+                break;
+            }
+        }
+    }
+
+    has_bus_reset = true;
+
+out:
+    vdev->has_bus_reset = has_bus_reset;
+    g_free(info);
+}
+
 static int vfio_initfn(PCIDevice *pdev)
 {
     VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [RFC v8.1 07/13] vfio: add check for vfio devices which enable aer should support bus reset
  2015-05-27  2:46 [Qemu-devel] [RFC v8.1 00/13] vfio-pci: pass the aer error to guest Chen Fan
                   ` (4 preceding siblings ...)
  2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 06/13] vfio: add check host bus reset is support or not Chen Fan
@ 2015-05-27  2:46 ` Chen Fan
  2015-05-27 21:32   ` Alex Williamson
  2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 08/13] pci: add bus reset_notifiers callbacks for host " Chen Fan
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 27+ messages in thread
From: Chen Fan @ 2015-05-27  2:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: izumi.taku, alex.williamson

Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
---
 hw/vfio/pci.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 5934fd7..91ad9ad 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3801,6 +3801,33 @@ out:
     g_free(info);
 }
 
+static void vfio_pci_machine_done_notify(Notifier *notifier, void *unused)
+{
+    VFIOGroup *group;
+    VFIODevice *vbasedev;
+    VFIOPCIDevice *vdev;
+
+    /* Check All support AER devices if has bus reset capability */
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+            if (vdev->features & VFIO_FEATURE_ENABLE_AER) {
+                vfio_check_host_bus_reset(vdev);
+                if (!vdev->has_bus_reset) {
+                    error_report("vfio: Cannot enable AER for device %s, "
+                                 "which is not support host bus reset.",
+                                 vdev->vbasedev.name);
+                    exit(1);
+                }
+            }
+        }
+    }
+}
+
+static Notifier machine_notifier = {
+    .notify = vfio_pci_machine_done_notify,
+};
+
 static int vfio_initfn(PCIDevice *pdev)
 {
     VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
@@ -4086,6 +4113,11 @@ static const TypeInfo vfio_pci_dev_info = {
 static void register_vfio_pci_dev_type(void)
 {
     type_register_static(&vfio_pci_dev_info);
+    /*
+     * Register notifier when machine init is done, since we need
+     * check the configration manner after all vfio devices are inited.
+     */
+    qemu_add_machine_init_done_notifier(&machine_notifier);
 }
 
 type_init(register_vfio_pci_dev_type)
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [RFC v8.1 08/13] pci: add bus reset_notifiers callbacks for host bus reset
  2015-05-27  2:46 [Qemu-devel] [RFC v8.1 00/13] vfio-pci: pass the aer error to guest Chen Fan
                   ` (5 preceding siblings ...)
  2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 07/13] vfio: add check for vfio devices which enable aer should support bus reset Chen Fan
@ 2015-05-27  2:46 ` Chen Fan
  2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 09/13] vfio: add sec_bus_reset notifier to notify physical bus reset is needed Chen Fan
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 27+ messages in thread
From: Chen Fan @ 2015-05-27  2:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: izumi.taku, alex.williamson

Particularly, For vfio devices, Once need to recovery devices
by bus reset such as AER, we always need to reset the host bus
to recovery the devices under the bus, so we need to add pci bus
callbacks to specify to do host bus reset.

Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
---
 hw/pci/pci.c             | 16 ++++++++++++++++
 hw/pci/pci_bridge.c      |  5 +++++
 include/hw/pci/pci.h     |  4 ++++
 include/hw/pci/pci_bus.h |  2 ++
 4 files changed, 27 insertions(+)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 48f19a3..0878834 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -74,11 +74,27 @@ static const VMStateDescription vmstate_pcibus = {
     }
 };
 
+void pci_bus_add_reset_notifier(PCIBus *bus, Notifier *notify)
+{
+    notifier_list_add(&bus->reset_notifiers, notify);
+}
+
+void pci_bus_remove_reset_notifier(Notifier *notify)
+{
+    notifier_remove(notify);
+}
+
+void pci_bus_run_reset_notifier(PCIBus *bus, void *opaque)
+{
+    notifier_list_notify(&bus->reset_notifiers, opaque);
+}
+
 static void pci_bus_realize(BusState *qbus, Error **errp)
 {
     PCIBus *bus = PCI_BUS(qbus);
 
     vmstate_register(NULL, -1, &vmstate_pcibus, bus);
+    notifier_list_init(&bus->reset_notifiers);
 }
 
 static void pci_bus_unrealize(BusState *qbus, Error **errp)
diff --git a/hw/pci/pci_bridge.c b/hw/pci/pci_bridge.c
index 40c97b1..75d2b4b 100644
--- a/hw/pci/pci_bridge.c
+++ b/hw/pci/pci_bridge.c
@@ -267,6 +267,11 @@ void pci_bridge_write_config(PCIDevice *d,
 
     newctl = pci_get_word(d->config + PCI_BRIDGE_CONTROL);
     if (~oldctl & newctl & PCI_BRIDGE_CTL_BUS_RESET) {
+        /*
+         * Notify all vfio-pci devices under the bus
+         * to do physical bus reset.
+         */
+        pci_for_each_bus(&s->sec_bus, pci_bus_run_reset_notifier, NULL);
         /* Trigger hot reset on 0->1 transition. */
         qbus_reset_all(&s->sec_bus.qbus);
     }
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 5d050c8..5a9357c 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -7,6 +7,7 @@
 #include "exec/memory.h"
 #include "sysemu/dma.h"
 #include "qapi/error.h"
+#include "qemu/notify.h"
 
 /* PCI includes legacy ISA access.  */
 #include "hw/isa/isa.h"
@@ -371,6 +372,9 @@ void pci_bus_fire_intx_routing_notifier(PCIBus *bus);
 void pci_device_set_intx_routing_notifier(PCIDevice *dev,
                                           PCIINTxRoutingNotifier notifier);
 void pci_device_reset(PCIDevice *dev);
+void pci_bus_add_reset_notifier(PCIBus *bus, Notifier *notify);
+void pci_bus_remove_reset_notifier(Notifier *notify);
+void pci_bus_run_reset_notifier(PCIBus *bus, void *opaque);
 
 PCIDevice *pci_nic_init_nofail(NICInfo *nd, PCIBus *rootbus,
                                const char *default_model,
diff --git a/include/hw/pci/pci_bus.h b/include/hw/pci/pci_bus.h
index fabaeee..3b551d7 100644
--- a/include/hw/pci/pci_bus.h
+++ b/include/hw/pci/pci_bus.h
@@ -29,6 +29,8 @@ struct PCIBus {
        Keep a count of the number of devices with raised IRQs.  */
     int nirq;
     int *irq_count;
+
+    NotifierList reset_notifiers;
 };
 
 typedef struct PCIBridgeWindows PCIBridgeWindows;
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [RFC v8.1 09/13] vfio: add sec_bus_reset notifier to notify physical bus reset is needed
  2015-05-27  2:46 [Qemu-devel] [RFC v8.1 00/13] vfio-pci: pass the aer error to guest Chen Fan
                   ` (6 preceding siblings ...)
  2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 08/13] pci: add bus reset_notifiers callbacks for host " Chen Fan
@ 2015-05-27  2:46 ` Chen Fan
  2015-05-27 21:32   ` Alex Williamson
  2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 10/13] vfio: do hot bus reset when do virtual secondary bus reset Chen Fan
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 27+ messages in thread
From: Chen Fan @ 2015-05-27  2:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: izumi.taku, alex.williamson

Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
---
 hw/vfio/pci.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 91ad9ad..a8c5988 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -156,6 +156,7 @@ typedef struct VFIOPCIDevice {
     PCIHostDeviceAddress host;
     EventNotifier err_notifier;
     EventNotifier req_notifier;
+    Notifier sec_bus_reset_notifier;
     int (*resetfn)(struct VFIOPCIDevice *);
     uint32_t features;
 #define VFIO_FEATURE_ENABLE_VGA_BIT 0
@@ -172,6 +173,7 @@ typedef struct VFIOPCIDevice {
     bool has_flr;
     bool has_pm_reset;
     bool has_bus_reset;
+    bool needs_bus_reset;
     bool rom_read_failed;
 } VFIOPCIDevice;
 
@@ -3828,6 +3830,17 @@ static Notifier machine_notifier = {
     .notify = vfio_pci_machine_done_notify,
 };
 
+static void vfio_pci_host_needs_bus_reset(Notifier *n, void *opaque)
+{
+    VFIOPCIDevice *vdev = container_of(n, VFIOPCIDevice, sec_bus_reset_notifier);
+    VFIODevice *vbasedev = &vdev->vbasedev;
+
+    if (vdev->has_bus_reset) {
+        vdev->needs_bus_reset = true;
+        vbasedev->needs_reset = true;
+    }
+}
+
 static int vfio_initfn(PCIDevice *pdev)
 {
     VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
@@ -3978,6 +3991,9 @@ static int vfio_initfn(PCIDevice *pdev)
     vfio_register_req_notifier(vdev);
     vfio_setup_resetfn(vdev);
 
+    vdev->sec_bus_reset_notifier.notify = vfio_pci_host_needs_bus_reset;
+    pci_bus_add_reset_notifier(pdev->bus, &vdev->sec_bus_reset_notifier);
+
     return 0;
 
 out_teardown:
@@ -4006,6 +4022,7 @@ static void vfio_exitfn(PCIDevice *pdev)
 
     vfio_unregister_req_notifier(vdev);
     vfio_unregister_err_notifier(vdev);
+    pci_bus_remove_reset_notifier(&vdev->sec_bus_reset_notifier);
     pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
     vfio_disable_interrupts(vdev);
     if (vdev->intx.mmap_timer) {
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [RFC v8.1 10/13] vfio: do hot bus reset when do virtual secondary bus reset
  2015-05-27  2:46 [Qemu-devel] [RFC v8.1 00/13] vfio-pci: pass the aer error to guest Chen Fan
                   ` (7 preceding siblings ...)
  2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 09/13] vfio: add sec_bus_reset notifier to notify physical bus reset is needed Chen Fan
@ 2015-05-27  2:46 ` Chen Fan
  2015-05-27 21:33   ` Alex Williamson
  2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 11/13] pcie_aer: expose pcie_aer_msg() interface Chen Fan
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 27+ messages in thread
From: Chen Fan @ 2015-05-27  2:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: izumi.taku, alex.williamson

when do virtual secondary bus reset, the vfio device under
this bus need to do host bus reset to reset the device.
so add this case.

Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
---
 hw/vfio/pci.c | 75 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 75 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index a8c5988..b05ccdf 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3803,6 +3803,64 @@ out:
     g_free(info);
 }
 
+/* Check when device reset the affected devices is single or multi,
+ * because there was differentiate the hot reset of mulitple in-use
+ * devices and hot reset of a single in-use device. */
+static int vfio_pci_affect_devices_is_multi(VFIOPCIDevice *vdev)
+{
+    VFIOGroup *group;
+    struct vfio_pci_hot_reset_info *info = NULL;
+    struct vfio_pci_dependent_device *devices;
+    int ret, i, count = 0;
+
+    ret = vfio_get_hot_reset_info(vdev, &info);
+    if (ret < 0) {
+        goto out;
+    }
+
+    devices = &info->devices[0];
+
+    /* Verify that we have all the groups required */
+    for (i = 0; i < info->count; i++) {
+        PCIHostDeviceAddress host;
+        VFIODevice *vbasedev_iter;
+
+        host.domain = devices[i].segment;
+        host.bus = devices[i].bus;
+        host.slot = PCI_SLOT(devices[i].devfn);
+        host.function = PCI_FUNC(devices[i].devfn);
+
+        /* Skip the current device */
+        if (vfio_pci_host_match(&host, &vdev->host)) {
+            continue;
+        }
+
+        /* Ensure we own the group of the affected device */
+        QLIST_FOREACH(group, &vfio_group_list, next) {
+            if (group->groupid == devices[i].group_id) {
+                break;
+            }
+        }
+
+        if (!group) {
+            ret = -1;
+            goto out;
+        }
+
+        QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
+            if (vbasedev_iter->type != VFIO_DEVICE_TYPE_PCI) {
+                continue;
+            }
+            count++;
+        }
+    }
+
+    ret = count;
+out:
+    g_free(info);
+    return ret;
+}
+
 static void vfio_pci_machine_done_notify(Notifier *notifier, void *unused)
 {
     VFIOGroup *group;
@@ -4039,6 +4097,23 @@ static void vfio_pci_reset(DeviceState *dev)
 
     trace_vfio_pci_reset(vdev->vbasedev.name);
 
+    if (vdev->needs_bus_reset) {
+        vdev->needs_bus_reset = false;
+        if (vdev->vbasedev.needs_reset) {
+            int ret;
+
+            ret = vfio_pci_affect_devices_is_multi(vdev);
+            if (ret < 0) {
+                error_report("vfio: Cannot reset device %04x:%02x:%02x.%x, "
+                             "get hot reset info failed.", vdev->host.domain,
+                             vdev->host.bus, vdev->host.slot, vdev->host.function);
+            } else {
+                vfio_pci_hot_reset(vdev, ret ? true : false);
+            }
+        }
+        return;
+    }
+
     vfio_pci_pre_reset(vdev);
 
     if (vdev->resetfn && !vdev->resetfn(vdev)) {
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [RFC v8.1 11/13] pcie_aer: expose pcie_aer_msg() interface
  2015-05-27  2:46 [Qemu-devel] [RFC v8.1 00/13] vfio-pci: pass the aer error to guest Chen Fan
                   ` (8 preceding siblings ...)
  2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 10/13] vfio: do hot bus reset when do virtual secondary bus reset Chen Fan
@ 2015-05-27  2:46 ` Chen Fan
  2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 12/13] vfio-pci: pass the aer error to guest Chen Fan
  2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 13/13] vfio: add 'aer' property to expose aercap Chen Fan
  11 siblings, 0 replies; 27+ messages in thread
From: Chen Fan @ 2015-05-27  2:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: izumi.taku, alex.williamson

For vfio device, we need to propagate the aer error to
Guest OS. we use the pcie_aer_msg() to send aer error
to guest.

Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
---
 hw/pci/pcie_aer.c         | 2 +-
 include/hw/pci/pcie_aer.h | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/pci/pcie_aer.c b/hw/pci/pcie_aer.c
index ccf577e..58010c4 100644
--- a/hw/pci/pcie_aer.c
+++ b/hw/pci/pcie_aer.c
@@ -370,7 +370,7 @@ static void pcie_aer_msg_root_port(PCIDevice *dev, const PCIEAERMsg *msg)
  *
  * Walk up the bus tree from the device, propagate the error message.
  */
-static void pcie_aer_msg(PCIDevice *dev, const PCIEAERMsg *msg)
+void pcie_aer_msg(PCIDevice *dev, const PCIEAERMsg *msg)
 {
     uint8_t type;
 
diff --git a/include/hw/pci/pcie_aer.h b/include/hw/pci/pcie_aer.h
index 156acb0..c2ee4e2 100644
--- a/include/hw/pci/pcie_aer.h
+++ b/include/hw/pci/pcie_aer.h
@@ -102,5 +102,6 @@ void pcie_aer_root_write_config(PCIDevice *dev,
 
 /* error injection */
 int pcie_aer_inject_error(PCIDevice *dev, const PCIEAERErr *err);
+void pcie_aer_msg(PCIDevice *dev, const PCIEAERMsg *msg);
 
 #endif /* QEMU_PCIE_AER_H */
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [RFC v8.1 12/13] vfio-pci: pass the aer error to guest
  2015-05-27  2:46 [Qemu-devel] [RFC v8.1 00/13] vfio-pci: pass the aer error to guest Chen Fan
                   ` (9 preceding siblings ...)
  2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 11/13] pcie_aer: expose pcie_aer_msg() interface Chen Fan
@ 2015-05-27  2:46 ` Chen Fan
  2015-05-27 21:33   ` Alex Williamson
  2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 13/13] vfio: add 'aer' property to expose aercap Chen Fan
  11 siblings, 1 reply; 27+ messages in thread
From: Chen Fan @ 2015-05-27  2:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: izumi.taku, alex.williamson

when the vfio device encounters an uncorrectable error in host,
the vfio_pci driver will signal the eventfd registered by this
vfio device, the results in the qemu eventfd handler getting
invoked.

this patch is to pass the error to guest and have the guest driver
recover from the error.

Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
---
 hw/vfio/pci.c | 42 ++++++++++++++++++++++++++++++++++++------
 1 file changed, 36 insertions(+), 6 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index b05ccdf..855a0a6 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3348,18 +3348,48 @@ static void vfio_put_device(VFIOPCIDevice *vdev)
 static void vfio_err_notifier_handler(void *opaque)
 {
     VFIOPCIDevice *vdev = opaque;
+    PCIDevice *dev = &vdev->pdev;
+    PCIEAERMsg msg = {
+        .severity = 0,
+        .source_id = (pci_bus_num(dev->bus) << 8) | dev->devfn,
+    };
 
     if (!event_notifier_test_and_clear(&vdev->err_notifier)) {
         return;
     }
 
     /*
-     * TBD. Retrieve the error details and decide what action
-     * needs to be taken. One of the actions could be to pass
-     * the error to the guest and have the guest driver recover
-     * from the error. This requires that PCIe capabilities be
-     * exposed to the guest. For now, we just terminate the
-     * guest to contain the error.
+     * in case the real hardware configration has been changed,
+     * here we should recheck the bus reset capability.
+     */
+    vfio_check_host_bus_reset(vdev);
+
+    /*
+     * we should read the error details from the real hardware
+     * configuration spaces, here we only need to do is signaling
+     * to guest an uncorrectable error has occurred.
+     */
+    if (dev->exp.aer_cap &&
+        vdev->has_bus_reset) {
+        uint8_t *aer_cap = dev->config + dev->exp.aer_cap;
+        uint32_t uncor_status;
+        bool isfatal;
+
+        uncor_status = vfio_pci_read_config(dev,
+                           dev->exp.aer_cap + PCI_ERR_UNCOR_STATUS, 4);
+
+        isfatal = uncor_status & pci_get_long(aer_cap + PCI_ERR_UNCOR_SEVER);
+
+        msg.severity = isfatal ? PCI_ERR_ROOT_CMD_FATAL_EN :
+                                 PCI_ERR_ROOT_CMD_NONFATAL_EN;
+
+        pcie_aer_msg(dev, &msg);
+        return;
+    }
+
+    /*
+     * If the aer capability is not exposed to the guest. we just
+     * terminate the guest to contain the error.
      */
 
     error_report("%s(%04x:%02x:%02x.%x) Unrecoverable error detected.  "
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [RFC v8.1 13/13] vfio: add 'aer' property to expose aercap
  2015-05-27  2:46 [Qemu-devel] [RFC v8.1 00/13] vfio-pci: pass the aer error to guest Chen Fan
                   ` (10 preceding siblings ...)
  2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 12/13] vfio-pci: pass the aer error to guest Chen Fan
@ 2015-05-27  2:46 ` Chen Fan
  11 siblings, 0 replies; 27+ messages in thread
From: Chen Fan @ 2015-05-27  2:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: izumi.taku, alex.williamson

add 'aer' property to let user able to decide whether expose
the aer capability. by default we should disable aer feature,
because it needs configuration restrictions.

Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
---
 hw/vfio/pci.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 855a0a6..161479b 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -4191,6 +4191,8 @@ static Property vfio_pci_dev_properties[] = {
                     VFIO_FEATURE_ENABLE_VGA_BIT, false),
     DEFINE_PROP_BIT("x-req", VFIOPCIDevice, features,
                     VFIO_FEATURE_ENABLE_REQ_BIT, true),
+    DEFINE_PROP_BIT("aer", VFIOPCIDevice, features,
+                    VFIO_FEATURE_ENABLE_AER_BIT, false),
     DEFINE_PROP_INT32("bootindex", VFIOPCIDevice, bootindex, -1),
     DEFINE_PROP_BOOL("x-mmap", VFIOPCIDevice, vbasedev.allow_mmap, true),
     /*
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC v8.1 01/13] vfio: extract vfio_get_hot_reset_info as a single function
  2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 01/13] vfio: extract vfio_get_hot_reset_info as a single function Chen Fan
@ 2015-05-27 21:31   ` Alex Williamson
  0 siblings, 0 replies; 27+ messages in thread
From: Alex Williamson @ 2015-05-27 21:31 UTC (permalink / raw)
  To: Chen Fan; +Cc: izumi.taku, qemu-devel

On Wed, 2015-05-27 at 10:46 +0800, Chen Fan wrote:
> the function is used to get affected devices by bus reset.
> so here extract it, and can used for aer soon.
> 
> Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
> ---
>  hw/vfio/pci.c | 67 +++++++++++++++++++++++++++++++++++++++++++----------------
>  1 file changed, 49 insertions(+), 18 deletions(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index e0e339a..9c05304 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2641,6 +2641,50 @@ static void vfio_check_af_flr(VFIOPCIDevice *vdev, uint8_t pos)
>      }
>  }
>  
> +/*
> + * return negative with errno, return 0 on success.
> + * if success, the point of ret_info fill with the affected device reset info.
> + *
> + */
> +static int vfio_get_hot_reset_info(VFIOPCIDevice *vdev,
> +                                   struct vfio_pci_hot_reset_info **ret_info)
> +{
> +    struct vfio_pci_hot_reset_info *info;
> +    int ret, count;
> +
> +    *ret_info = NULL;
> +
> +    info = g_malloc0(sizeof(*info));
> +    info->argsz = sizeof(*info);
> +
> +    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO, info);
> +    if (ret && errno != ENOSPC) {
> +        ret = -errno;
> +        goto error;
> +    }
> +
> +    count = info->count;
> +
> +    info = g_realloc(info, sizeof(*info) +
> +                     (count * sizeof(struct vfio_pci_dependent_device)));
> +    info->argsz = sizeof(*info) +
> +                  (count * sizeof(struct vfio_pci_dependent_device));
> +
> +    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO, info);
> +    if (ret) {
> +        ret = -errno;
> +        goto error;
> +    }
> +
> +    *ret_info = info;
> +    info = NULL;
> +
> +    ret = 0;

Just return 0 here.

> +error:
> +    g_free(info);
> +    return ret;
> +}
> +
>  static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos)
>  {
>      PCIDevice *pdev = &vdev->pdev;
> @@ -2780,7 +2824,7 @@ static bool vfio_pci_host_match(PCIHostDeviceAddress *host1,
>  static int vfio_pci_hot_reset(VFIOPCIDevice *vdev, bool single)
>  {
>      VFIOGroup *group;
> -    struct vfio_pci_hot_reset_info *info;
> +    struct vfio_pci_hot_reset_info *info = NULL;
>      struct vfio_pci_dependent_device *devices;
>      struct vfio_pci_hot_reset *reset;
>      int32_t *fds;
> @@ -2792,32 +2836,19 @@ static int vfio_pci_hot_reset(VFIOPCIDevice *vdev, bool single)
>      vfio_pci_pre_reset(vdev);
>      vdev->vbasedev.needs_reset = false;
>  
> -    info = g_malloc0(sizeof(*info));
> -    info->argsz = sizeof(*info);
> -
> -    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO, info);
> -    if (ret && errno != ENOSPC) {
> -        ret = -errno;
> +    ret = vfio_get_hot_reset_info(vdev, &info);
> +    if (ret) {
>          if (!vdev->has_pm_reset) {
>              error_report("vfio: Cannot reset device %04x:%02x:%02x.%x, "
>                           "no available reset mechanism.", vdev->host.domain,
>                           vdev->host.bus, vdev->host.slot, vdev->host.function);
> +        } else {
> +            error_report("vfio: hot reset info failed: %m");

What if we pass through a root complex device where we can't do a bus
reset.  With this change, we get this useless error_report every time we
come through this path.  There are cases where this is not an error and
should not be reported to the user.

>          }
>          goto out_single;
>      }
>  
> -    count = info->count;
> -    info = g_realloc(info, sizeof(*info) + (count * sizeof(*devices)));
> -    info->argsz = sizeof(*info) + (count * sizeof(*devices));
>      devices = &info->devices[0];
> -
> -    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO, info);
> -    if (ret) {
> -        ret = -errno;
> -        error_report("vfio: hot reset info failed: %m");
> -        goto out_single;
> -    }
> -
>      trace_vfio_pci_hot_reset_has_dep_devices(vdev->vbasedev.name);
>  
>      /* Verify that we have all the groups required */

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC v8.1 03/13] vfio: add pcie extanded capability support
  2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 03/13] vfio: add pcie extanded capability support Chen Fan
@ 2015-05-27 21:31   ` Alex Williamson
  0 siblings, 0 replies; 27+ messages in thread
From: Alex Williamson @ 2015-05-27 21:31 UTC (permalink / raw)
  To: Chen Fan; +Cc: izumi.taku, qemu-devel

On Wed, 2015-05-27 at 10:46 +0800, Chen Fan wrote:
> For vfio pcie device, we could expose the extended capability on
> PCIE bus. in order to avoid config space broken, we introduce
> a copy config for parsing extended caps. and rebuild the pcie
> extended config space.
> 
> Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
> ---
>  hw/vfio/pci.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 72 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 78f4c82..81a4a9a 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2493,6 +2493,21 @@ static uint8_t vfio_std_cap_max_size(PCIDevice *pdev, uint8_t pos)
>      return next - pos;
>  }
>  
> +
> +static uint16_t vfio_ext_cap_max_size(const uint8_t *config, uint16_t pos)
> +{
> +    uint16_t tmp, next = PCIE_CONFIG_SPACE_SIZE - 1;
> +
> +    for (tmp = PCI_CONFIG_SPACE_SIZE; tmp;
> +        tmp = PCI_EXT_CAP_NEXT(pci_get_long(config + tmp))) {
> +        if (tmp > pos && tmp < next) {
> +            next = tmp;
> +        }
> +    }
> +
> +    return next - pos;
> +}


A bug in the existing vfio_std_cap_max_size() too, but shouldn't 'next'
start at PCIE_CONFIG_SPACE_SIZE instead of PCIE_CONFIG_SPACE_SIZE-1?
Capabilities need to be DWORD aligned, so say I have a capability
starting 4 bytes from the end, 0xFFC.  The max size should be 4 bytes,
0x1000 - 0xFFC, not 3 bytes, 0xFFF - 0xFFC.

> +
>  static void vfio_set_word_bits(uint8_t *buf, uint16_t val, uint16_t mask)
>  {
>      pci_set_word(buf, (pci_get_word(buf) & ~mask) | val);
> @@ -2802,16 +2817,72 @@ static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos)
>      return 0;
>  }
>  
> +static int vfio_add_ext_cap(VFIOPCIDevice *vdev, const uint8_t *config)
> +{
> +    PCIDevice *pdev = &vdev->pdev;
> +    uint32_t header;
> +    uint16_t cap_id, next, size;
> +    uint8_t cap_ver;
> +
> +    for (next = PCI_CONFIG_SPACE_SIZE; next;
> +         next = PCI_EXT_CAP_NEXT(pci_get_long(config + next))) {
> +        header = pci_get_long(config + next);
> +        cap_id = PCI_EXT_CAP_ID(header);
> +        cap_ver = PCI_EXT_CAP_VER(header);
> +
> +        /*
> +         * If it becomes important to configure extended capabilities to their
> +         * actual size, use this as the default when it's something we don't
> +         * recognize. Since QEMU doesn't actually handle many of the config
> +         * accesses, exact size doesn't seem worthwhile.
> +         */
> +        size = vfio_ext_cap_max_size(config, next);
> +
> +        pcie_add_capability(pdev, cap_id, cap_ver, next, size);
> +        if (next == PCI_CONFIG_SPACE_SIZE) {
> +            /* Begin the rebuild, we should set the next offset zero. */
> +            pci_set_long(pdev->config + next, PCI_EXT_CAP(cap_id, cap_ver, 0));
> +        }
> +
> +        /* Use emulated header pointer to allow dropping extended caps */
> +        pci_set_long(vdev->emulated_config_bits + next, 0xffffffff);
> +    }
> +
> +    return 0;
> +}
> +
>  static int vfio_add_capabilities(VFIOPCIDevice *vdev)
>  {
>      PCIDevice *pdev = &vdev->pdev;
> +    int ret;
> +    uint8_t *config;
>  
>      if (!(pdev->config[PCI_STATUS] & PCI_STATUS_CAP_LIST) ||
>          !pdev->config[PCI_CAPABILITY_LIST]) {
>          return 0; /* Nothing to add */
>      }
>  
> -    return vfio_add_std_cap(vdev, pdev->config[PCI_CAPABILITY_LIST]);
> +    ret = vfio_add_std_cap(vdev, pdev->config[PCI_CAPABILITY_LIST]);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    /* on PCI bus, it doesn't make sense to expose extended capabilities. */
> +    if (!pci_is_express(pdev) ||
> +        !pci_bus_is_express(pdev->bus) ||
> +        !pci_get_long(pdev->config + PCI_CONFIG_SPACE_SIZE)) {
> +        return 0;
> +    }
> +
> +    /*
> +     * In order to avoid config space broken, here using a copy config to
> +     * parse extended capabilities.

I think this should read something more like "In order to avoid breaking
config space, create a copy to use for parsing extended capabilities".

> +     */
> +    config = g_memdup(pdev->config, vdev->config_size);
> +    ret = vfio_add_ext_cap(vdev, config);
> +
> +    g_free(config);
> +    return ret;
>  }
>  
>  static void vfio_pci_pre_reset(VFIOPCIDevice *vdev)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC v8.1 05/13] vfio: add aer support for vfio device
  2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 05/13] vfio: add aer support for " Chen Fan
@ 2015-05-27 21:32   ` Alex Williamson
  0 siblings, 0 replies; 27+ messages in thread
From: Alex Williamson @ 2015-05-27 21:32 UTC (permalink / raw)
  To: Chen Fan; +Cc: izumi.taku, qemu-devel

On Wed, 2015-05-27 at 10:46 +0800, Chen Fan wrote:
> Calling pcie_aer_init to initilize aer related registers for
> vfio device, then reload physical related registers to expose
> device capability.
> 
> Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
> ---
>  hw/vfio/pci.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 82 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 81a4a9a..f4e7855 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -32,6 +32,7 @@
>  #include "hw/pci/msi.h"
>  #include "hw/pci/msix.h"
>  #include "hw/pci/pci.h"
> +#include "hw/pci/pci_bridge.h"
>  #include "qemu-common.h"
>  #include "qemu/error-report.h"
>  #include "qemu/event_notifier.h"
> @@ -160,6 +161,8 @@ typedef struct VFIOPCIDevice {
>  #define VFIO_FEATURE_ENABLE_VGA (1 << VFIO_FEATURE_ENABLE_VGA_BIT)
>  #define VFIO_FEATURE_ENABLE_REQ_BIT 1
>  #define VFIO_FEATURE_ENABLE_REQ (1 << VFIO_FEATURE_ENABLE_REQ_BIT)
> +#define VFIO_FEATURE_ENABLE_AER_BIT 2
> +#define VFIO_FEATURE_ENABLE_AER (1 << VFIO_FEATURE_ENABLE_AER_BIT)
>      int32_t bootindex;
>      uint8_t pm_cap;
>      bool has_vga;
> @@ -2817,12 +2820,78 @@ static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos)
>      return 0;
>  }
>  
> +static int vfio_setup_aer(VFIOPCIDevice *vdev, int pos, uint16_t size)
> +{
> +    PCIDevice *pdev = &vdev->pdev;
> +    uint8_t *exp_cap = pdev->config + pdev->exp.exp_cap;
> +    PCIDevice *dev_iter;
> +    uint8_t type;
> +    uint32_t severity, errcap;
> +    int ret;
> +
> +    if (!(vdev->features & VFIO_FEATURE_ENABLE_AER)) {
> +        return 0;

There's a subtle behavioral change here that we not only do not enable
AER through QEMU, but we hide the AER capability if the AER feature bit
is not enabled.  Do we want to use pcie_add_capability() here to
continue exposing the dummy AER capability?

> +    }
> +
> +    dev_iter = pci_bridge_get_device(pdev->bus);
> +    if (!dev_iter) {
> +        goto error;
> +    }
> +
> +    while (dev_iter) {
> +        type = pcie_cap_get_type(dev_iter);
> +        if ((type != PCI_EXP_TYPE_ROOT_PORT &&
> +             type != PCI_EXP_TYPE_UPSTREAM &&
> +             type != PCI_EXP_TYPE_DOWNSTREAM)) {
> +            goto error;
> +        }
> +
> +        if (!dev_iter->exp.aer_cap) {
> +            goto error;
> +        }
> +
> +        dev_iter = pci_bridge_get_device(dev_iter->bus);
> +    }
> +
> +    errcap = vfio_pci_read_config(pdev, pdev->exp.aer_cap + PCI_ERR_CAP, 4);
> +    /*
> +     * The ability to record multiple headers is depending on
> +     * the state of the Multiple Header Recording Capable bit and
> +     * enabled by the Multiple Header Recording Enable bit.
> +     */
> +    if ((errcap & PCI_ERR_CAP_MHRC) &&
> +        (errcap & PCI_ERR_CAP_MHRE)) {
> +        pdev->exp.aer_log.log_max = PCIE_AER_LOG_MAX_DEFAULT;
> +    } else {
> +        pdev->exp.aer_log.log_max = 0;
> +    }
> +
> +    pcie_cap_deverr_init(pdev);
> +    ret = pcie_aer_init(pdev, pos, size);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    /* load physical registers */
> +    severity = vfio_pci_read_config(pdev,
> +                   pdev->exp.aer_cap + PCI_ERR_UNCOR_SEVER, 4);
> +    pci_long_test_and_clear_mask(exp_cap + PCI_ERR_UNCOR_SEVER, ~severity);
> +
> +    return 0;
> +
> +error:
> +    error_report("vfio: Unable to enable AER for device %s, parent bus "
> +                 "do not support signal AER", vdev->vbasedev.name);

"does not support AER signaling"

> +    return -1;
> +}
> +
>  static int vfio_add_ext_cap(VFIOPCIDevice *vdev, const uint8_t *config)
>  {
>      PCIDevice *pdev = &vdev->pdev;
>      uint32_t header;
>      uint16_t cap_id, next, size;
>      uint8_t cap_ver;
> +    int ret = 0;
>  
>      for (next = PCI_CONFIG_SPACE_SIZE; next;
>           next = PCI_EXT_CAP_NEXT(pci_get_long(config + next))) {
> @@ -2838,7 +2907,19 @@ static int vfio_add_ext_cap(VFIOPCIDevice *vdev, const uint8_t *config)
>           */
>          size = vfio_ext_cap_max_size(config, next);
>  
> -        pcie_add_capability(pdev, cap_id, cap_ver, next, size);
> +        switch (cap_id) {
> +        case PCI_EXT_CAP_ID_ERR:
> +            ret = vfio_setup_aer(vdev, next, size);
> +            break;
> +        default:
> +            pcie_add_capability(pdev, cap_id, cap_ver, next, size);
> +            break;
> +        }
> +
> +        if (ret) {
> +            return ret;
> +        }
> +
>          if (next == PCI_CONFIG_SPACE_SIZE) {
>              /* Begin the rebuild, we should set the next offset zero. */
>              pci_set_long(pdev->config + next, PCI_EXT_CAP(cap_id, cap_ver, 0));

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC v8.1 06/13] vfio: add check host bus reset is support or not
  2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 06/13] vfio: add check host bus reset is support or not Chen Fan
@ 2015-05-27 21:32   ` Alex Williamson
  2015-06-02  7:54     ` Chen Fan
  0 siblings, 1 reply; 27+ messages in thread
From: Alex Williamson @ 2015-05-27 21:32 UTC (permalink / raw)
  To: Chen Fan; +Cc: izumi.taku, qemu-devel

On Wed, 2015-05-27 at 10:46 +0800, Chen Fan wrote:
> we introduce a has_bus_reset capability to sign the vfio
> devices if support host bus reset.
> 
> Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
> ---
>  hw/vfio/pci.c | 123 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 123 insertions(+)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index f4e7855..5934fd7 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -33,6 +33,7 @@
>  #include "hw/pci/msix.h"
>  #include "hw/pci/pci.h"
>  #include "hw/pci/pci_bridge.h"
> +#include "hw/pci/pci_bus.h"
>  #include "qemu-common.h"
>  #include "qemu/error-report.h"
>  #include "qemu/event_notifier.h"
> @@ -170,6 +171,7 @@ typedef struct VFIOPCIDevice {
>      bool req_enabled;
>      bool has_flr;
>      bool has_pm_reset;
> +    bool has_bus_reset;

I still think that caching this is a bad idea, there's no point at which
we can blindly assume the capability is still present.

>      bool rom_read_failed;
>  } VFIOPCIDevice;
>  
> @@ -203,6 +205,7 @@ static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
>  static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
>                                    uint32_t val, int len);
>  static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
> +static void vfio_check_host_bus_reset(VFIOPCIDevice *vdev);
>  
>  /*
>   * Disabling BAR mmaping can be slow, but toggling it around INTx can
> @@ -2853,6 +2856,20 @@ static int vfio_setup_aer(VFIOPCIDevice *vdev, int pos, uint16_t size)
>          dev_iter = pci_bridge_get_device(dev_iter->bus);
>      }
>  
> +    /*
> +     * Don't check bus reset capability when device is enabled during
> +     * qemu machine creation, which is done by machine init function.
> +     */
> +    if (DEVICE(vdev)->hotplugged) {
> +        vfio_check_host_bus_reset(vdev);
> +        if (!vdev->has_bus_reset) {
> +            error_report("vfio: Cannot enable AER for device %s, "
> +                         "which is not support host bus reset.",

"which does not support host bus reset."

> +                         vdev->vbasedev.name);
> +            goto error;
> +        }
> +    }
> +
>      errcap = vfio_pci_read_config(pdev, pdev->exp.aer_cap + PCI_ERR_CAP, 4);
>      /*
>       * The ability to record multiple headers is depending on
> @@ -3678,6 +3695,112 @@ static void vfio_setup_resetfn(VFIOPCIDevice *vdev)
>      }
>  }
>  
> +struct VfioDeviceFind {

We use VFIOFooBar for all other camel case definitions, much like PCIBus
and PCIDevice below.

> +    PCIBus *pbus;
> +    PCIDevice *pdev;
> +    bool found;
> +};
> +
> +static void find_devices(PCIBus *bus, void *opaque)
> +{
> +    struct VfioDeviceFind *find = opaque;
> +    int i;
> +
> +    if (find->found == true) {

if (find->found) {...

> +        return;
> +    }
> +
> +    for (i = 0; i < ARRAY_SIZE(bus->devices); i++) {
> +        if (!bus->devices[i]) {
> +            continue;
> +        }
> +
> +        if (bus->devices[i] == find->pdev) {
> +            find->pbus = bus;
> +            find->found = true;
> +            break;
> +        }
> +    }
> +}
> +
> +static void vfio_check_host_bus_reset(VFIOPCIDevice *vdev)
> +{
> +    PCIBus *bus = vdev->pdev.bus;
> +    struct vfio_pci_hot_reset_info *info = NULL;
> +    struct vfio_pci_dependent_device *devices;
> +    VFIOGroup *group;
> +    int ret, i;
> +    bool has_bus_reset = false;
> +
> +    ret = vfio_get_hot_reset_info(vdev, &info);
> +    if (ret < 0) {

if (ret) {...

> +        goto out;
> +    }
> +
> +    /* List all affected devices by bus reset */
> +    devices = &info->devices[0];
> +
> +    /* Verify that we have all the groups required */
> +    for (i = 0; i < info->count; i++) {
> +        PCIHostDeviceAddress host;
> +        VFIOPCIDevice *tmp;
> +        VFIODevice *vbasedev_iter;
> +
> +        host.domain = devices[i].segment;
> +        host.bus = devices[i].bus;
> +        host.slot = PCI_SLOT(devices[i].devfn);
> +        host.function = PCI_FUNC(devices[i].devfn);
> +
> +        /* Skip the current device */
> +        if (vfio_pci_host_match(&host, &vdev->host)) {
> +            continue;
> +        }
> +
> +        /* Ensure we own the group of the affected device */
> +        QLIST_FOREACH(group, &vfio_group_list, next) {
> +            if (group->groupid == devices[i].group_id) {
> +                break;
> +            }
> +        }
> +
> +        if (!group) {
> +            goto out;
> +        }
> +
> +        /* Ensure affected devices for reset under the same bus */
> +        QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
> +            if (vbasedev_iter->type != VFIO_DEVICE_TYPE_PCI) {
> +                continue;
> +            }
> +            tmp = container_of(vbasedev_iter, VFIOPCIDevice, vbasedev);
> +            if (vfio_pci_host_match(&host, &tmp->host)) {
> +                struct VfioDeviceFind find = { .pdev = &tmp->pdev, .found = false };
> +
> +                pci_for_each_bus(bus, find_devices, &find);
> +                if (!find.found) {
> +                    goto out;
> +                }
> +                /*
> +                 * When the check device is hotplugged to a higher bus again,
> +                 * which would influence the affected device which enable aer
> +                 * below the bus.
> +                 */
> +                if (tmp->features & VFIO_FEATURE_ENABLE_AER &&
> +                    find.pbus != bus) {
> +                    goto out;
> +                }

I think what you're trying to do here is assume that if a reset of A
affects B, then a reset of B affects A, and if B is on a subordinate bus
from A, then our configuration is broken, right?  Can we really
guarantee that assumption?  If we had a physical topology that mirrored
this virtual topology, that wouldn't necessarily be true.  For instance,
if A was a function of a multi-function device where another function
was a PCIe upstream switch, B could be subordinate to that switch, so a
reset of A affects B, but a reset of B doesn't affect A.

> +                break;
> +            }
> +        }
> +    }
> +
> +    has_bus_reset = true;
> +
> +out:
> +    vdev->has_bus_reset = has_bus_reset;

I don't see any value is storing this, it can't be trusted at any point
in the future.  I think that any time we add a device or think about
forwarding an AER message to the guest, we need to do a validation pass,
testing the entire topology.

To do that we'd iterate through every device in every group for PCI
devices that support AER.  For each one, get the hot reset info for the
affected device list.  For each affected device, if it's attached to the
VM, it needs to be on or below the bus of the target device.
Additionally, we should test all of the non-AER supporting vfio-pci
devices on or below the target bus to verify they have a reset
mechanism.  Run that function for each vfio-pci device as it's added,
regardless of whether it's hotplug.  If the test fails, fail the initfn
for the current device.  Also run the test prior to AER injection, if it
fails demote the AER injection to a machine halt as we have now.

> +    g_free(info);
> +}
> +
>  static int vfio_initfn(PCIDevice *pdev)
>  {
>      VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC v8.1 07/13] vfio: add check for vfio devices which enable aer should support bus reset
  2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 07/13] vfio: add check for vfio devices which enable aer should support bus reset Chen Fan
@ 2015-05-27 21:32   ` Alex Williamson
  0 siblings, 0 replies; 27+ messages in thread
From: Alex Williamson @ 2015-05-27 21:32 UTC (permalink / raw)
  To: Chen Fan; +Cc: izumi.taku, qemu-devel

On Wed, 2015-05-27 at 10:46 +0800, Chen Fan wrote:
> Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
> ---
>  hw/vfio/pci.c | 32 ++++++++++++++++++++++++++++++++
>  1 file changed, 32 insertions(+)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 5934fd7..91ad9ad 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3801,6 +3801,33 @@ out:
>      g_free(info);
>  }
>  
> +static void vfio_pci_machine_done_notify(Notifier *notifier, void *unused)
> +{
> +    VFIOGroup *group;
> +    VFIODevice *vbasedev;
> +    VFIOPCIDevice *vdev;
> +
> +    /* Check All support AER devices if has bus reset capability */
> +    QLIST_FOREACH(group, &vfio_group_list, next) {
> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> +            vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> +            if (vdev->features & VFIO_FEATURE_ENABLE_AER) {
> +                vfio_check_host_bus_reset(vdev);
> +                if (!vdev->has_bus_reset) {
> +                    error_report("vfio: Cannot enable AER for device %s, "
> +                                 "which is not support host bus reset.",
> +                                 vdev->vbasedev.name);
> +                    exit(1);
> +                }
> +            }
> +        }
> +    }
> +}
> +
> +static Notifier machine_notifier = {
> +    .notify = vfio_pci_machine_done_notify,
> +};
> +
>  static int vfio_initfn(PCIDevice *pdev)
>  {
>      VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
> @@ -4086,6 +4113,11 @@ static const TypeInfo vfio_pci_dev_info = {
>  static void register_vfio_pci_dev_type(void)
>  {
>      type_register_static(&vfio_pci_dev_info);
> +    /*
> +     * Register notifier when machine init is done, since we need
> +     * check the configration manner after all vfio devices are inited.
> +     */
> +    qemu_add_machine_init_done_notifier(&machine_notifier);
>  }
>  
>  type_init(register_vfio_pci_dev_type)

I don't see what we're gaining by waiting until after machine init done
to test everything.  It makes cold-plug a different case than hot-plug,
which makes things error prone.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC v8.1 09/13] vfio: add sec_bus_reset notifier to notify physical bus reset is needed
  2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 09/13] vfio: add sec_bus_reset notifier to notify physical bus reset is needed Chen Fan
@ 2015-05-27 21:32   ` Alex Williamson
  0 siblings, 0 replies; 27+ messages in thread
From: Alex Williamson @ 2015-05-27 21:32 UTC (permalink / raw)
  To: Chen Fan; +Cc: izumi.taku, qemu-devel

On Wed, 2015-05-27 at 10:46 +0800, Chen Fan wrote:
> Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
> ---
>  hw/vfio/pci.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 91ad9ad..a8c5988 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -156,6 +156,7 @@ typedef struct VFIOPCIDevice {
>      PCIHostDeviceAddress host;
>      EventNotifier err_notifier;
>      EventNotifier req_notifier;
> +    Notifier sec_bus_reset_notifier;
>      int (*resetfn)(struct VFIOPCIDevice *);
>      uint32_t features;
>  #define VFIO_FEATURE_ENABLE_VGA_BIT 0
> @@ -172,6 +173,7 @@ typedef struct VFIOPCIDevice {
>      bool has_flr;
>      bool has_pm_reset;
>      bool has_bus_reset;
> +    bool needs_bus_reset;
>      bool rom_read_failed;
>  } VFIOPCIDevice;
>  
> @@ -3828,6 +3830,17 @@ static Notifier machine_notifier = {
>      .notify = vfio_pci_machine_done_notify,
>  };
>  
> +static void vfio_pci_host_needs_bus_reset(Notifier *n, void *opaque)
> +{
> +    VFIOPCIDevice *vdev = container_of(n, VFIOPCIDevice, sec_bus_reset_notifier);
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +
> +    if (vdev->has_bus_reset) {
> +        vdev->needs_bus_reset = true;
> +        vbasedev->needs_reset = true;

We really can't trust has_bus_reset, so this could really just be
features & AER, which then begs the question why we register a notifier
for devices that don't support the AER feature.

> +    }
> +}
> +
>  static int vfio_initfn(PCIDevice *pdev)
>  {
>      VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
> @@ -3978,6 +3991,9 @@ static int vfio_initfn(PCIDevice *pdev)
>      vfio_register_req_notifier(vdev);
>      vfio_setup_resetfn(vdev);
>  
> +    vdev->sec_bus_reset_notifier.notify = vfio_pci_host_needs_bus_reset;
> +    pci_bus_add_reset_notifier(pdev->bus, &vdev->sec_bus_reset_notifier);
> +
>      return 0;
>  
>  out_teardown:
> @@ -4006,6 +4022,7 @@ static void vfio_exitfn(PCIDevice *pdev)
>  
>      vfio_unregister_req_notifier(vdev);
>      vfio_unregister_err_notifier(vdev);
> +    pci_bus_remove_reset_notifier(&vdev->sec_bus_reset_notifier);
>      pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
>      vfio_disable_interrupts(vdev);
>      if (vdev->intx.mmap_timer) {

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC v8.1 10/13] vfio: do hot bus reset when do virtual secondary bus reset
  2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 10/13] vfio: do hot bus reset when do virtual secondary bus reset Chen Fan
@ 2015-05-27 21:33   ` Alex Williamson
       [not found]     ` <557020F1.7070705@cn.fujitsu.com>
  0 siblings, 1 reply; 27+ messages in thread
From: Alex Williamson @ 2015-05-27 21:33 UTC (permalink / raw)
  To: Chen Fan; +Cc: izumi.taku, qemu-devel

On Wed, 2015-05-27 at 10:46 +0800, Chen Fan wrote:
> when do virtual secondary bus reset, the vfio device under
> this bus need to do host bus reset to reset the device.
> so add this case.
> 
> Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
> ---
>  hw/vfio/pci.c | 75 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 75 insertions(+)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index a8c5988..b05ccdf 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3803,6 +3803,64 @@ out:
>      g_free(info);
>  }
>  
> +/* Check when device reset the affected devices is single or multi,
> + * because there was differentiate the hot reset of mulitple in-use
> + * devices and hot reset of a single in-use device. */
> +static int vfio_pci_affect_devices_is_multi(VFIOPCIDevice *vdev)
> +{
> +    VFIOGroup *group;
> +    struct vfio_pci_hot_reset_info *info = NULL;
> +    struct vfio_pci_dependent_device *devices;
> +    int ret, i, count = 0;
> +
> +    ret = vfio_get_hot_reset_info(vdev, &info);
> +    if (ret < 0) {
> +        goto out;
> +    }
> +
> +    devices = &info->devices[0];
> +
> +    /* Verify that we have all the groups required */
> +    for (i = 0; i < info->count; i++) {
> +        PCIHostDeviceAddress host;
> +        VFIODevice *vbasedev_iter;
> +
> +        host.domain = devices[i].segment;
> +        host.bus = devices[i].bus;
> +        host.slot = PCI_SLOT(devices[i].devfn);
> +        host.function = PCI_FUNC(devices[i].devfn);
> +
> +        /* Skip the current device */
> +        if (vfio_pci_host_match(&host, &vdev->host)) {
> +            continue;
> +        }
> +
> +        /* Ensure we own the group of the affected device */
> +        QLIST_FOREACH(group, &vfio_group_list, next) {
> +            if (group->groupid == devices[i].group_id) {
> +                break;
> +            }
> +        }
> +
> +        if (!group) {
> +            ret = -1;
> +            goto out;
> +        }
> +
> +        QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
> +            if (vbasedev_iter->type != VFIO_DEVICE_TYPE_PCI) {
> +                continue;
> +            }
> +            count++;
> +        }
> +    }
> +
> +    ret = count;
> +out:
> +    g_free(info);
> +    return ret;
> +}

This basically does almost all the work of the hot reset function just
to figure out whether it's a single or multi-reset to call the hot reset
function with the right parameter.  Why not just do a hot reset?

> +
>  static void vfio_pci_machine_done_notify(Notifier *notifier, void *unused)
>  {
>      VFIOGroup *group;
> @@ -4039,6 +4097,23 @@ static void vfio_pci_reset(DeviceState *dev)
>  
>      trace_vfio_pci_reset(vdev->vbasedev.name);
>  
> +    if (vdev->needs_bus_reset) {
> +        vdev->needs_bus_reset = false;

A comment would be nice to signal that this secondary test avoids
duplicate bus resets.

> +        if (vdev->vbasedev.needs_reset) {
> +            int ret;
> +
> +            ret = vfio_pci_affect_devices_is_multi(vdev);
> +            if (ret < 0) {
> +                error_report("vfio: Cannot reset device %04x:%02x:%02x.%x, "
> +                             "get hot reset info failed.", vdev->host.domain,
> +                             vdev->host.bus, vdev->host.slot, vdev->host.function);
> +            } else {
> +                vfio_pci_hot_reset(vdev, ret ? true : false);
> +            }
> +        }
> +        return;
> +    }
> +
>      vfio_pci_pre_reset(vdev);
>  
>      if (vdev->resetfn && !vdev->resetfn(vdev)) {

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC v8.1 12/13] vfio-pci: pass the aer error to guest
  2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 12/13] vfio-pci: pass the aer error to guest Chen Fan
@ 2015-05-27 21:33   ` Alex Williamson
  0 siblings, 0 replies; 27+ messages in thread
From: Alex Williamson @ 2015-05-27 21:33 UTC (permalink / raw)
  To: Chen Fan; +Cc: izumi.taku, qemu-devel

On Wed, 2015-05-27 at 10:46 +0800, Chen Fan wrote:
> when the vfio device encounters an uncorrectable error in host,
> the vfio_pci driver will signal the eventfd registered by this
> vfio device, the results in the qemu eventfd handler getting
> invoked.
> 
> this patch is to pass the error to guest and have the guest driver
> recover from the error.
> 
> Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
> ---
>  hw/vfio/pci.c | 42 ++++++++++++++++++++++++++++++++++++------
>  1 file changed, 36 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index b05ccdf..855a0a6 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3348,18 +3348,48 @@ static void vfio_put_device(VFIOPCIDevice *vdev)
>  static void vfio_err_notifier_handler(void *opaque)
>  {
>      VFIOPCIDevice *vdev = opaque;
> +    PCIDevice *dev = &vdev->pdev;
> +    PCIEAERMsg msg = {
> +        .severity = 0,
> +        .source_id = (pci_bus_num(dev->bus) << 8) | dev->devfn,
> +    };
>  
>      if (!event_notifier_test_and_clear(&vdev->err_notifier)) {
>          return;
>      }
>  
>      /*
> -     * TBD. Retrieve the error details and decide what action
> -     * needs to be taken. One of the actions could be to pass
> -     * the error to the guest and have the guest driver recover
> -     * from the error. This requires that PCIe capabilities be
> -     * exposed to the guest. For now, we just terminate the
> -     * guest to contain the error.
> +     * in case the real hardware configration has been changed,
> +     * here we should recheck the bus reset capability.
> +     */
> +    vfio_check_host_bus_reset(vdev);

Why would we run this on devices without the AER feature?  I agree that
we do need to revalidate for devices that are exposing AER forwarding.

> +
> +    /*
> +     * we should read the error details from the real hardware
> +     * configuration spaces, here we only need to do is signaling
> +     * to guest an uncorrectable error has occurred.
> +     */
> +    if (dev->exp.aer_cap &&
> +        vdev->has_bus_reset) {
> +        uint8_t *aer_cap = dev->config + dev->exp.aer_cap;
> +        uint32_t uncor_status;
> +        bool isfatal;
> +
> +        uncor_status = vfio_pci_read_config(dev,
> +                           dev->exp.aer_cap + PCI_ERR_UNCOR_STATUS, 4);
> +
> +        isfatal = uncor_status & pci_get_long(aer_cap + PCI_ERR_UNCOR_SEVER);
> +
> +        msg.severity = isfatal ? PCI_ERR_ROOT_CMD_FATAL_EN :
> +                                 PCI_ERR_ROOT_CMD_NONFATAL_EN;
> +
> +        pcie_aer_msg(dev, &msg);
> +        return;
> +    }
> +
> +    /*
> +     * If the aer capability is not exposed to the guest. we just
> +     * terminate the guest to contain the error.
>       */
>  
>      error_report("%s(%04x:%02x:%02x.%x) Unrecoverable error detected.  "

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC v8.1 06/13] vfio: add check host bus reset is support or not
  2015-05-27 21:32   ` Alex Williamson
@ 2015-06-02  7:54     ` Chen Fan
  2015-06-02 16:47       ` Alex Williamson
  0 siblings, 1 reply; 27+ messages in thread
From: Chen Fan @ 2015-06-02  7:54 UTC (permalink / raw)
  To: Alex Williamson; +Cc: izumi.taku, qemu-devel


On 05/28/2015 05:32 AM, Alex Williamson wrote:
> On Wed, 2015-05-27 at 10:46 +0800, Chen Fan wrote:
>> we introduce a has_bus_reset capability to sign the vfio
>> devices if support host bus reset.
>>
>> Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
>> ---
>>   hw/vfio/pci.c | 123 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 123 insertions(+)
>>
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index f4e7855..5934fd7 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -33,6 +33,7 @@
>>   #include "hw/pci/msix.h"
>>   #include "hw/pci/pci.h"
>>   #include "hw/pci/pci_bridge.h"
>> +#include "hw/pci/pci_bus.h"
>>   #include "qemu-common.h"
>>   #include "qemu/error-report.h"
>>   #include "qemu/event_notifier.h"
>> @@ -170,6 +171,7 @@ typedef struct VFIOPCIDevice {
>>       bool req_enabled;
>>       bool has_flr;
>>       bool has_pm_reset;
>> +    bool has_bus_reset;
> I still think that caching this is a bad idea, there's no point at which
> we can blindly assume the capability is still present.
>
>>       bool rom_read_failed;
>>   } VFIOPCIDevice;
>>   
>> @@ -203,6 +205,7 @@ static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
>>   static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
>>                                     uint32_t val, int len);
>>   static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
>> +static void vfio_check_host_bus_reset(VFIOPCIDevice *vdev);
>>   
>>   /*
>>    * Disabling BAR mmaping can be slow, but toggling it around INTx can
>> @@ -2853,6 +2856,20 @@ static int vfio_setup_aer(VFIOPCIDevice *vdev, int pos, uint16_t size)
>>           dev_iter = pci_bridge_get_device(dev_iter->bus);
>>       }
>>   
>> +    /*
>> +     * Don't check bus reset capability when device is enabled during
>> +     * qemu machine creation, which is done by machine init function.
>> +     */
>> +    if (DEVICE(vdev)->hotplugged) {
>> +        vfio_check_host_bus_reset(vdev);
>> +        if (!vdev->has_bus_reset) {
>> +            error_report("vfio: Cannot enable AER for device %s, "
>> +                         "which is not support host bus reset.",
> "which does not support host bus reset."
>
>> +                         vdev->vbasedev.name);
>> +            goto error;
>> +        }
>> +    }
>> +
>>       errcap = vfio_pci_read_config(pdev, pdev->exp.aer_cap + PCI_ERR_CAP, 4);
>>       /*
>>        * The ability to record multiple headers is depending on
>> @@ -3678,6 +3695,112 @@ static void vfio_setup_resetfn(VFIOPCIDevice *vdev)
>>       }
>>   }
>>   
>> +struct VfioDeviceFind {
> We use VFIOFooBar for all other camel case definitions, much like PCIBus
> and PCIDevice below.
>
>> +    PCIBus *pbus;
>> +    PCIDevice *pdev;
>> +    bool found;
>> +};
>> +
>> +static void find_devices(PCIBus *bus, void *opaque)
>> +{
>> +    struct VfioDeviceFind *find = opaque;
>> +    int i;
>> +
>> +    if (find->found == true) {
> if (find->found) {...
>
>> +        return;
>> +    }
>> +
>> +    for (i = 0; i < ARRAY_SIZE(bus->devices); i++) {
>> +        if (!bus->devices[i]) {
>> +            continue;
>> +        }
>> +
>> +        if (bus->devices[i] == find->pdev) {
>> +            find->pbus = bus;
>> +            find->found = true;
>> +            break;
>> +        }
>> +    }
>> +}
>> +
>> +static void vfio_check_host_bus_reset(VFIOPCIDevice *vdev)
>> +{
>> +    PCIBus *bus = vdev->pdev.bus;
>> +    struct vfio_pci_hot_reset_info *info = NULL;
>> +    struct vfio_pci_dependent_device *devices;
>> +    VFIOGroup *group;
>> +    int ret, i;
>> +    bool has_bus_reset = false;
>> +
>> +    ret = vfio_get_hot_reset_info(vdev, &info);
>> +    if (ret < 0) {
> if (ret) {...
>
>> +        goto out;
>> +    }
>> +
>> +    /* List all affected devices by bus reset */
>> +    devices = &info->devices[0];
>> +
>> +    /* Verify that we have all the groups required */
>> +    for (i = 0; i < info->count; i++) {
>> +        PCIHostDeviceAddress host;
>> +        VFIOPCIDevice *tmp;
>> +        VFIODevice *vbasedev_iter;
>> +
>> +        host.domain = devices[i].segment;
>> +        host.bus = devices[i].bus;
>> +        host.slot = PCI_SLOT(devices[i].devfn);
>> +        host.function = PCI_FUNC(devices[i].devfn);
>> +
>> +        /* Skip the current device */
>> +        if (vfio_pci_host_match(&host, &vdev->host)) {
>> +            continue;
>> +        }
>> +
>> +        /* Ensure we own the group of the affected device */
>> +        QLIST_FOREACH(group, &vfio_group_list, next) {
>> +            if (group->groupid == devices[i].group_id) {
>> +                break;
>> +            }
>> +        }
>> +
>> +        if (!group) {
>> +            goto out;
>> +        }
>> +
>> +        /* Ensure affected devices for reset under the same bus */
>> +        QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
>> +            if (vbasedev_iter->type != VFIO_DEVICE_TYPE_PCI) {
>> +                continue;
>> +            }
>> +            tmp = container_of(vbasedev_iter, VFIOPCIDevice, vbasedev);
>> +            if (vfio_pci_host_match(&host, &tmp->host)) {
>> +                struct VfioDeviceFind find = { .pdev = &tmp->pdev, .found = false };
>> +
>> +                pci_for_each_bus(bus, find_devices, &find);
>> +                if (!find.found) {
>> +                    goto out;
>> +                }
>> +                /*
>> +                 * When the check device is hotplugged to a higher bus again,
>> +                 * which would influence the affected device which enable aer
>> +                 * below the bus.
>> +                 */
>> +                if (tmp->features & VFIO_FEATURE_ENABLE_AER &&
>> +                    find.pbus != bus) {
>> +                    goto out;
>> +                }
> I think what you're trying to do here is assume that if a reset of A
> affects B, then a reset of B affects A, and if B is on a subordinate bus
> from A, then our configuration is broken, right?  Can we really
> guarantee that assumption?  If we had a physical topology that mirrored
> this virtual topology, that wouldn't necessarily be true.  For instance,
> if A was a function of a multi-function device where another function
> was a PCIe upstream switch, B could be subordinate to that switch, so a
> reset of A affects B, but a reset of B doesn't affect A.
>
>> +                break;
>> +            }
>> +        }
>> +    }
>> +
>> +    has_bus_reset = true;
>> +
>> +out:
>> +    vdev->has_bus_reset = has_bus_reset;
> I don't see any value is storing this, it can't be trusted at any point
> in the future.  I think that any time we add a device or think about
> forwarding an AER message to the guest, we need to do a validation pass,
> testing the entire topology.
>
> To do that we'd iterate through every device in every group for PCI
> devices that support AER.  For each one, get the hot reset info for the
> affected device list.  For each affected device, if it's attached to the
> VM, it needs to be on or below the bus of the target device.
> Additionally, we should test all of the non-AER supporting vfio-pci
> devices on or below the target bus to verify they have a reset
> mechanism.  Run that function for each vfio-pci device as it's added,
> regardless of whether it's hotplug.  If the test fails, fail the initfn
> for the current device.  Also run the test prior to AER injection, if it
> fails demote the AER injection to a machine halt as we have now.
I'm worry about is the case that the affected devices belonged to
another groups but when initialize this device the another group
has not been added. it would cause fail the initfn, but the group
maybe added later. so I used the machine done event notifier to
check this case. if we don't do like that. how can we check the
case when all vfio-pci devices initfn ?

Thanks,
Chen

>
>> +    g_free(info);
>> +}
>> +
>>   static int vfio_initfn(PCIDevice *pdev)
>>   {
>>       VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
>
>
> .
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC v8.1 06/13] vfio: add check host bus reset is support or not
  2015-06-02  7:54     ` Chen Fan
@ 2015-06-02 16:47       ` Alex Williamson
  2015-06-03  0:52         ` Chen Fan
  0 siblings, 1 reply; 27+ messages in thread
From: Alex Williamson @ 2015-06-02 16:47 UTC (permalink / raw)
  To: Chen Fan; +Cc: izumi.taku, qemu-devel

On Tue, 2015-06-02 at 15:54 +0800, Chen Fan wrote:
> On 05/28/2015 05:32 AM, Alex Williamson wrote:
> > On Wed, 2015-05-27 at 10:46 +0800, Chen Fan wrote:
> >> we introduce a has_bus_reset capability to sign the vfio
> >> devices if support host bus reset.
> >>
> >> Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
> >> ---
> >>   hw/vfio/pci.c | 123 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>   1 file changed, 123 insertions(+)
> >>
> >> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> >> index f4e7855..5934fd7 100644
> >> --- a/hw/vfio/pci.c
> >> +++ b/hw/vfio/pci.c
> >> @@ -33,6 +33,7 @@
> >>   #include "hw/pci/msix.h"
> >>   #include "hw/pci/pci.h"
> >>   #include "hw/pci/pci_bridge.h"
> >> +#include "hw/pci/pci_bus.h"
> >>   #include "qemu-common.h"
> >>   #include "qemu/error-report.h"
> >>   #include "qemu/event_notifier.h"
> >> @@ -170,6 +171,7 @@ typedef struct VFIOPCIDevice {
> >>       bool req_enabled;
> >>       bool has_flr;
> >>       bool has_pm_reset;
> >> +    bool has_bus_reset;
> > I still think that caching this is a bad idea, there's no point at which
> > we can blindly assume the capability is still present.
> >
> >>       bool rom_read_failed;
> >>   } VFIOPCIDevice;
> >>   
> >> @@ -203,6 +205,7 @@ static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
> >>   static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
> >>                                     uint32_t val, int len);
> >>   static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
> >> +static void vfio_check_host_bus_reset(VFIOPCIDevice *vdev);
> >>   
> >>   /*
> >>    * Disabling BAR mmaping can be slow, but toggling it around INTx can
> >> @@ -2853,6 +2856,20 @@ static int vfio_setup_aer(VFIOPCIDevice *vdev, int pos, uint16_t size)
> >>           dev_iter = pci_bridge_get_device(dev_iter->bus);
> >>       }
> >>   
> >> +    /*
> >> +     * Don't check bus reset capability when device is enabled during
> >> +     * qemu machine creation, which is done by machine init function.
> >> +     */
> >> +    if (DEVICE(vdev)->hotplugged) {
> >> +        vfio_check_host_bus_reset(vdev);
> >> +        if (!vdev->has_bus_reset) {
> >> +            error_report("vfio: Cannot enable AER for device %s, "
> >> +                         "which is not support host bus reset.",
> > "which does not support host bus reset."
> >
> >> +                         vdev->vbasedev.name);
> >> +            goto error;
> >> +        }
> >> +    }
> >> +
> >>       errcap = vfio_pci_read_config(pdev, pdev->exp.aer_cap + PCI_ERR_CAP, 4);
> >>       /*
> >>        * The ability to record multiple headers is depending on
> >> @@ -3678,6 +3695,112 @@ static void vfio_setup_resetfn(VFIOPCIDevice *vdev)
> >>       }
> >>   }
> >>   
> >> +struct VfioDeviceFind {
> > We use VFIOFooBar for all other camel case definitions, much like PCIBus
> > and PCIDevice below.
> >
> >> +    PCIBus *pbus;
> >> +    PCIDevice *pdev;
> >> +    bool found;
> >> +};
> >> +
> >> +static void find_devices(PCIBus *bus, void *opaque)
> >> +{
> >> +    struct VfioDeviceFind *find = opaque;
> >> +    int i;
> >> +
> >> +    if (find->found == true) {
> > if (find->found) {...
> >
> >> +        return;
> >> +    }
> >> +
> >> +    for (i = 0; i < ARRAY_SIZE(bus->devices); i++) {
> >> +        if (!bus->devices[i]) {
> >> +            continue;
> >> +        }
> >> +
> >> +        if (bus->devices[i] == find->pdev) {
> >> +            find->pbus = bus;
> >> +            find->found = true;
> >> +            break;
> >> +        }
> >> +    }
> >> +}
> >> +
> >> +static void vfio_check_host_bus_reset(VFIOPCIDevice *vdev)
> >> +{
> >> +    PCIBus *bus = vdev->pdev.bus;
> >> +    struct vfio_pci_hot_reset_info *info = NULL;
> >> +    struct vfio_pci_dependent_device *devices;
> >> +    VFIOGroup *group;
> >> +    int ret, i;
> >> +    bool has_bus_reset = false;
> >> +
> >> +    ret = vfio_get_hot_reset_info(vdev, &info);
> >> +    if (ret < 0) {
> > if (ret) {...
> >
> >> +        goto out;
> >> +    }
> >> +
> >> +    /* List all affected devices by bus reset */
> >> +    devices = &info->devices[0];
> >> +
> >> +    /* Verify that we have all the groups required */
> >> +    for (i = 0; i < info->count; i++) {
> >> +        PCIHostDeviceAddress host;
> >> +        VFIOPCIDevice *tmp;
> >> +        VFIODevice *vbasedev_iter;
> >> +
> >> +        host.domain = devices[i].segment;
> >> +        host.bus = devices[i].bus;
> >> +        host.slot = PCI_SLOT(devices[i].devfn);
> >> +        host.function = PCI_FUNC(devices[i].devfn);
> >> +
> >> +        /* Skip the current device */
> >> +        if (vfio_pci_host_match(&host, &vdev->host)) {
> >> +            continue;
> >> +        }
> >> +
> >> +        /* Ensure we own the group of the affected device */
> >> +        QLIST_FOREACH(group, &vfio_group_list, next) {
> >> +            if (group->groupid == devices[i].group_id) {
> >> +                break;
> >> +            }
> >> +        }
> >> +
> >> +        if (!group) {
> >> +            goto out;
> >> +        }
> >> +
> >> +        /* Ensure affected devices for reset under the same bus */
> >> +        QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
> >> +            if (vbasedev_iter->type != VFIO_DEVICE_TYPE_PCI) {
> >> +                continue;
> >> +            }
> >> +            tmp = container_of(vbasedev_iter, VFIOPCIDevice, vbasedev);
> >> +            if (vfio_pci_host_match(&host, &tmp->host)) {
> >> +                struct VfioDeviceFind find = { .pdev = &tmp->pdev, .found = false };
> >> +
> >> +                pci_for_each_bus(bus, find_devices, &find);
> >> +                if (!find.found) {
> >> +                    goto out;
> >> +                }
> >> +                /*
> >> +                 * When the check device is hotplugged to a higher bus again,
> >> +                 * which would influence the affected device which enable aer
> >> +                 * below the bus.
> >> +                 */
> >> +                if (tmp->features & VFIO_FEATURE_ENABLE_AER &&
> >> +                    find.pbus != bus) {
> >> +                    goto out;
> >> +                }
> > I think what you're trying to do here is assume that if a reset of A
> > affects B, then a reset of B affects A, and if B is on a subordinate bus
> > from A, then our configuration is broken, right?  Can we really
> > guarantee that assumption?  If we had a physical topology that mirrored
> > this virtual topology, that wouldn't necessarily be true.  For instance,
> > if A was a function of a multi-function device where another function
> > was a PCIe upstream switch, B could be subordinate to that switch, so a
> > reset of A affects B, but a reset of B doesn't affect A.
> >
> >> +                break;
> >> +            }
> >> +        }
> >> +    }
> >> +
> >> +    has_bus_reset = true;
> >> +
> >> +out:
> >> +    vdev->has_bus_reset = has_bus_reset;
> > I don't see any value is storing this, it can't be trusted at any point
> > in the future.  I think that any time we add a device or think about
> > forwarding an AER message to the guest, we need to do a validation pass,
> > testing the entire topology.
> >
> > To do that we'd iterate through every device in every group for PCI
> > devices that support AER.  For each one, get the hot reset info for the
> > affected device list.  For each affected device, if it's attached to the
> > VM, it needs to be on or below the bus of the target device.
> > Additionally, we should test all of the non-AER supporting vfio-pci
> > devices on or below the target bus to verify they have a reset
> > mechanism.  Run that function for each vfio-pci device as it's added,
> > regardless of whether it's hotplug.  If the test fails, fail the initfn
> > for the current device.  Also run the test prior to AER injection, if it
> > fails demote the AER injection to a machine halt as we have now.
> I'm worry about is the case that the affected devices belonged to
> another groups but when initialize this device the another group
> has not been added. it would cause fail the initfn, but the group
> maybe added later. so I used the machine done event notifier to
> check this case. if we don't do like that. how can we check the
> case when all vfio-pci devices initfn ?

That's why the initfn test needs to test the entire topology, not just
the device being added.  If we have the case you describe where we have
two devices in separate groups, we add the first device, test the
topology, see that the second device is affected but not yet added and
allow AER to be enabled.  When the second device is added, we again test
the topology, we see the potential conflict and fail the initfn for the
second device if the bus requirements are not met.

I don't see the advantage of using a machine init done notifier.  You
can perform fewer topology verification passes using that notifier, but
I think you can provide a more useful error message by testing on each
device addition.  We also use the exact same strategy for cold-plug and
hot-plug, which makes maintenance easier.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC v8.1 06/13] vfio: add check host bus reset is support or not
  2015-06-02 16:47       ` Alex Williamson
@ 2015-06-03  0:52         ` Chen Fan
  2015-06-04 15:59           ` Alex Williamson
  0 siblings, 1 reply; 27+ messages in thread
From: Chen Fan @ 2015-06-03  0:52 UTC (permalink / raw)
  To: Alex Williamson; +Cc: izumi.taku, qemu-devel


On 06/03/2015 12:47 AM, Alex Williamson wrote:
> On Tue, 2015-06-02 at 15:54 +0800, Chen Fan wrote:
>> On 05/28/2015 05:32 AM, Alex Williamson wrote:
>>> On Wed, 2015-05-27 at 10:46 +0800, Chen Fan wrote:
>>>> we introduce a has_bus_reset capability to sign the vfio
>>>> devices if support host bus reset.
>>>>
>>>> Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
>>>> ---
>>>>    hw/vfio/pci.c | 123 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>    1 file changed, 123 insertions(+)
>>>>
>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>> index f4e7855..5934fd7 100644
>>>> --- a/hw/vfio/pci.c
>>>> +++ b/hw/vfio/pci.c
>>>> @@ -33,6 +33,7 @@
>>>>    #include "hw/pci/msix.h"
>>>>    #include "hw/pci/pci.h"
>>>>    #include "hw/pci/pci_bridge.h"
>>>> +#include "hw/pci/pci_bus.h"
>>>>    #include "qemu-common.h"
>>>>    #include "qemu/error-report.h"
>>>>    #include "qemu/event_notifier.h"
>>>> @@ -170,6 +171,7 @@ typedef struct VFIOPCIDevice {
>>>>        bool req_enabled;
>>>>        bool has_flr;
>>>>        bool has_pm_reset;
>>>> +    bool has_bus_reset;
>>> I still think that caching this is a bad idea, there's no point at which
>>> we can blindly assume the capability is still present.
>>>
>>>>        bool rom_read_failed;
>>>>    } VFIOPCIDevice;
>>>>    
>>>> @@ -203,6 +205,7 @@ static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
>>>>    static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
>>>>                                      uint32_t val, int len);
>>>>    static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
>>>> +static void vfio_check_host_bus_reset(VFIOPCIDevice *vdev);
>>>>    
>>>>    /*
>>>>     * Disabling BAR mmaping can be slow, but toggling it around INTx can
>>>> @@ -2853,6 +2856,20 @@ static int vfio_setup_aer(VFIOPCIDevice *vdev, int pos, uint16_t size)
>>>>            dev_iter = pci_bridge_get_device(dev_iter->bus);
>>>>        }
>>>>    
>>>> +    /*
>>>> +     * Don't check bus reset capability when device is enabled during
>>>> +     * qemu machine creation, which is done by machine init function.
>>>> +     */
>>>> +    if (DEVICE(vdev)->hotplugged) {
>>>> +        vfio_check_host_bus_reset(vdev);
>>>> +        if (!vdev->has_bus_reset) {
>>>> +            error_report("vfio: Cannot enable AER for device %s, "
>>>> +                         "which is not support host bus reset.",
>>> "which does not support host bus reset."
>>>
>>>> +                         vdev->vbasedev.name);
>>>> +            goto error;
>>>> +        }
>>>> +    }
>>>> +
>>>>        errcap = vfio_pci_read_config(pdev, pdev->exp.aer_cap + PCI_ERR_CAP, 4);
>>>>        /*
>>>>         * The ability to record multiple headers is depending on
>>>> @@ -3678,6 +3695,112 @@ static void vfio_setup_resetfn(VFIOPCIDevice *vdev)
>>>>        }
>>>>    }
>>>>    
>>>> +struct VfioDeviceFind {
>>> We use VFIOFooBar for all other camel case definitions, much like PCIBus
>>> and PCIDevice below.
>>>
>>>> +    PCIBus *pbus;
>>>> +    PCIDevice *pdev;
>>>> +    bool found;
>>>> +};
>>>> +
>>>> +static void find_devices(PCIBus *bus, void *opaque)
>>>> +{
>>>> +    struct VfioDeviceFind *find = opaque;
>>>> +    int i;
>>>> +
>>>> +    if (find->found == true) {
>>> if (find->found) {...
>>>
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    for (i = 0; i < ARRAY_SIZE(bus->devices); i++) {
>>>> +        if (!bus->devices[i]) {
>>>> +            continue;
>>>> +        }
>>>> +
>>>> +        if (bus->devices[i] == find->pdev) {
>>>> +            find->pbus = bus;
>>>> +            find->found = true;
>>>> +            break;
>>>> +        }
>>>> +    }
>>>> +}
>>>> +
>>>> +static void vfio_check_host_bus_reset(VFIOPCIDevice *vdev)
>>>> +{
>>>> +    PCIBus *bus = vdev->pdev.bus;
>>>> +    struct vfio_pci_hot_reset_info *info = NULL;
>>>> +    struct vfio_pci_dependent_device *devices;
>>>> +    VFIOGroup *group;
>>>> +    int ret, i;
>>>> +    bool has_bus_reset = false;
>>>> +
>>>> +    ret = vfio_get_hot_reset_info(vdev, &info);
>>>> +    if (ret < 0) {
>>> if (ret) {...
>>>
>>>> +        goto out;
>>>> +    }
>>>> +
>>>> +    /* List all affected devices by bus reset */
>>>> +    devices = &info->devices[0];
>>>> +
>>>> +    /* Verify that we have all the groups required */
>>>> +    for (i = 0; i < info->count; i++) {
>>>> +        PCIHostDeviceAddress host;
>>>> +        VFIOPCIDevice *tmp;
>>>> +        VFIODevice *vbasedev_iter;
>>>> +
>>>> +        host.domain = devices[i].segment;
>>>> +        host.bus = devices[i].bus;
>>>> +        host.slot = PCI_SLOT(devices[i].devfn);
>>>> +        host.function = PCI_FUNC(devices[i].devfn);
>>>> +
>>>> +        /* Skip the current device */
>>>> +        if (vfio_pci_host_match(&host, &vdev->host)) {
>>>> +            continue;
>>>> +        }
>>>> +
>>>> +        /* Ensure we own the group of the affected device */
>>>> +        QLIST_FOREACH(group, &vfio_group_list, next) {
>>>> +            if (group->groupid == devices[i].group_id) {
>>>> +                break;
>>>> +            }
>>>> +        }
>>>> +
>>>> +        if (!group) {
>>>> +            goto out;
>>>> +        }
>>>> +
>>>> +        /* Ensure affected devices for reset under the same bus */
>>>> +        QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
>>>> +            if (vbasedev_iter->type != VFIO_DEVICE_TYPE_PCI) {
>>>> +                continue;
>>>> +            }
>>>> +            tmp = container_of(vbasedev_iter, VFIOPCIDevice, vbasedev);
>>>> +            if (vfio_pci_host_match(&host, &tmp->host)) {
>>>> +                struct VfioDeviceFind find = { .pdev = &tmp->pdev, .found = false };
>>>> +
>>>> +                pci_for_each_bus(bus, find_devices, &find);
>>>> +                if (!find.found) {
>>>> +                    goto out;
>>>> +                }
>>>> +                /*
>>>> +                 * When the check device is hotplugged to a higher bus again,
>>>> +                 * which would influence the affected device which enable aer
>>>> +                 * below the bus.
>>>> +                 */
>>>> +                if (tmp->features & VFIO_FEATURE_ENABLE_AER &&
>>>> +                    find.pbus != bus) {
>>>> +                    goto out;
>>>> +                }
>>> I think what you're trying to do here is assume that if a reset of A
>>> affects B, then a reset of B affects A, and if B is on a subordinate bus
>>> from A, then our configuration is broken, right?  Can we really
>>> guarantee that assumption?  If we had a physical topology that mirrored
>>> this virtual topology, that wouldn't necessarily be true.  For instance,
>>> if A was a function of a multi-function device where another function
>>> was a PCIe upstream switch, B could be subordinate to that switch, so a
>>> reset of A affects B, but a reset of B doesn't affect A.
>>>
>>>> +                break;
>>>> +            }
>>>> +        }
>>>> +    }
>>>> +
>>>> +    has_bus_reset = true;
>>>> +
>>>> +out:
>>>> +    vdev->has_bus_reset = has_bus_reset;
>>> I don't see any value is storing this, it can't be trusted at any point
>>> in the future.  I think that any time we add a device or think about
>>> forwarding an AER message to the guest, we need to do a validation pass,
>>> testing the entire topology.
>>>
>>> To do that we'd iterate through every device in every group for PCI
>>> devices that support AER.  For each one, get the hot reset info for the
>>> affected device list.  For each affected device, if it's attached to the
>>> VM, it needs to be on or below the bus of the target device.
>>> Additionally, we should test all of the non-AER supporting vfio-pci
>>> devices on or below the target bus to verify they have a reset
>>> mechanism.  Run that function for each vfio-pci device as it's added,
>>> regardless of whether it's hotplug.  If the test fails, fail the initfn
>>> for the current device.  Also run the test prior to AER injection, if it
>>> fails demote the AER injection to a machine halt as we have now.
>> I'm worry about is the case that the affected devices belonged to
>> another groups but when initialize this device the another group
>> has not been added. it would cause fail the initfn, but the group
>> maybe added later. so I used the machine done event notifier to
>> check this case. if we don't do like that. how can we check the
>> case when all vfio-pci devices initfn ?
> That's why the initfn test needs to test the entire topology, not just
> the device being added.  If we have the case you describe where we have
> two devices in separate groups, we add the first device, test the
> topology, see that the second device is affected but not yet added and
> allow AER to be enabled.  When the second device is added, we again test
> the topology, we see the potential conflict and fail the initfn for the
> second device if the bus requirements are not met.
In case that the second device may not be added. in this case the
first device enable the aer, and pass the validate. how do we know
the second device if be added ?

Thanks,
Chen



>
> I don't see the advantage of using a machine init done notifier.  You
> can perform fewer topology verification passes using that notifier, but
> I think you can provide a more useful error message by testing on each
> device addition.  We also use the exact same strategy for cold-plug and
> hot-plug, which makes maintenance easier.  Thanks,
>
> Alex
>
> .
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC v8.1 06/13] vfio: add check host bus reset is support or not
  2015-06-03  0:52         ` Chen Fan
@ 2015-06-04 15:59           ` Alex Williamson
  2015-06-09  3:43             ` Chen Fan
  0 siblings, 1 reply; 27+ messages in thread
From: Alex Williamson @ 2015-06-04 15:59 UTC (permalink / raw)
  To: Chen Fan; +Cc: izumi.taku, qemu-devel

On Wed, 2015-06-03 at 08:52 +0800, Chen Fan wrote:
> On 06/03/2015 12:47 AM, Alex Williamson wrote:
> > On Tue, 2015-06-02 at 15:54 +0800, Chen Fan wrote:
> >> On 05/28/2015 05:32 AM, Alex Williamson wrote:
> >>> On Wed, 2015-05-27 at 10:46 +0800, Chen Fan wrote:
> >>>> we introduce a has_bus_reset capability to sign the vfio
> >>>> devices if support host bus reset.
> >>>>
> >>>> Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
> >>>> ---
> >>>>    hw/vfio/pci.c | 123 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>    1 file changed, 123 insertions(+)
> >>>>
> >>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> >>>> index f4e7855..5934fd7 100644
> >>>> --- a/hw/vfio/pci.c
> >>>> +++ b/hw/vfio/pci.c
> >>>> @@ -33,6 +33,7 @@
> >>>>    #include "hw/pci/msix.h"
> >>>>    #include "hw/pci/pci.h"
> >>>>    #include "hw/pci/pci_bridge.h"
> >>>> +#include "hw/pci/pci_bus.h"
> >>>>    #include "qemu-common.h"
> >>>>    #include "qemu/error-report.h"
> >>>>    #include "qemu/event_notifier.h"
> >>>> @@ -170,6 +171,7 @@ typedef struct VFIOPCIDevice {
> >>>>        bool req_enabled;
> >>>>        bool has_flr;
> >>>>        bool has_pm_reset;
> >>>> +    bool has_bus_reset;
> >>> I still think that caching this is a bad idea, there's no point at which
> >>> we can blindly assume the capability is still present.
> >>>
> >>>>        bool rom_read_failed;
> >>>>    } VFIOPCIDevice;
> >>>>    
> >>>> @@ -203,6 +205,7 @@ static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
> >>>>    static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
> >>>>                                      uint32_t val, int len);
> >>>>    static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
> >>>> +static void vfio_check_host_bus_reset(VFIOPCIDevice *vdev);
> >>>>    
> >>>>    /*
> >>>>     * Disabling BAR mmaping can be slow, but toggling it around INTx can
> >>>> @@ -2853,6 +2856,20 @@ static int vfio_setup_aer(VFIOPCIDevice *vdev, int pos, uint16_t size)
> >>>>            dev_iter = pci_bridge_get_device(dev_iter->bus);
> >>>>        }
> >>>>    
> >>>> +    /*
> >>>> +     * Don't check bus reset capability when device is enabled during
> >>>> +     * qemu machine creation, which is done by machine init function.
> >>>> +     */
> >>>> +    if (DEVICE(vdev)->hotplugged) {
> >>>> +        vfio_check_host_bus_reset(vdev);
> >>>> +        if (!vdev->has_bus_reset) {
> >>>> +            error_report("vfio: Cannot enable AER for device %s, "
> >>>> +                         "which is not support host bus reset.",
> >>> "which does not support host bus reset."
> >>>
> >>>> +                         vdev->vbasedev.name);
> >>>> +            goto error;
> >>>> +        }
> >>>> +    }
> >>>> +
> >>>>        errcap = vfio_pci_read_config(pdev, pdev->exp.aer_cap + PCI_ERR_CAP, 4);
> >>>>        /*
> >>>>         * The ability to record multiple headers is depending on
> >>>> @@ -3678,6 +3695,112 @@ static void vfio_setup_resetfn(VFIOPCIDevice *vdev)
> >>>>        }
> >>>>    }
> >>>>    
> >>>> +struct VfioDeviceFind {
> >>> We use VFIOFooBar for all other camel case definitions, much like PCIBus
> >>> and PCIDevice below.
> >>>
> >>>> +    PCIBus *pbus;
> >>>> +    PCIDevice *pdev;
> >>>> +    bool found;
> >>>> +};
> >>>> +
> >>>> +static void find_devices(PCIBus *bus, void *opaque)
> >>>> +{
> >>>> +    struct VfioDeviceFind *find = opaque;
> >>>> +    int i;
> >>>> +
> >>>> +    if (find->found == true) {
> >>> if (find->found) {...
> >>>
> >>>> +        return;
> >>>> +    }
> >>>> +
> >>>> +    for (i = 0; i < ARRAY_SIZE(bus->devices); i++) {
> >>>> +        if (!bus->devices[i]) {
> >>>> +            continue;
> >>>> +        }
> >>>> +
> >>>> +        if (bus->devices[i] == find->pdev) {
> >>>> +            find->pbus = bus;
> >>>> +            find->found = true;
> >>>> +            break;
> >>>> +        }
> >>>> +    }
> >>>> +}
> >>>> +
> >>>> +static void vfio_check_host_bus_reset(VFIOPCIDevice *vdev)
> >>>> +{
> >>>> +    PCIBus *bus = vdev->pdev.bus;
> >>>> +    struct vfio_pci_hot_reset_info *info = NULL;
> >>>> +    struct vfio_pci_dependent_device *devices;
> >>>> +    VFIOGroup *group;
> >>>> +    int ret, i;
> >>>> +    bool has_bus_reset = false;
> >>>> +
> >>>> +    ret = vfio_get_hot_reset_info(vdev, &info);
> >>>> +    if (ret < 0) {
> >>> if (ret) {...
> >>>
> >>>> +        goto out;
> >>>> +    }
> >>>> +
> >>>> +    /* List all affected devices by bus reset */
> >>>> +    devices = &info->devices[0];
> >>>> +
> >>>> +    /* Verify that we have all the groups required */
> >>>> +    for (i = 0; i < info->count; i++) {
> >>>> +        PCIHostDeviceAddress host;
> >>>> +        VFIOPCIDevice *tmp;
> >>>> +        VFIODevice *vbasedev_iter;
> >>>> +
> >>>> +        host.domain = devices[i].segment;
> >>>> +        host.bus = devices[i].bus;
> >>>> +        host.slot = PCI_SLOT(devices[i].devfn);
> >>>> +        host.function = PCI_FUNC(devices[i].devfn);
> >>>> +
> >>>> +        /* Skip the current device */
> >>>> +        if (vfio_pci_host_match(&host, &vdev->host)) {
> >>>> +            continue;
> >>>> +        }
> >>>> +
> >>>> +        /* Ensure we own the group of the affected device */
> >>>> +        QLIST_FOREACH(group, &vfio_group_list, next) {
> >>>> +            if (group->groupid == devices[i].group_id) {
> >>>> +                break;
> >>>> +            }
> >>>> +        }
> >>>> +
> >>>> +        if (!group) {
> >>>> +            goto out;
> >>>> +        }
> >>>> +
> >>>> +        /* Ensure affected devices for reset under the same bus */
> >>>> +        QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
> >>>> +            if (vbasedev_iter->type != VFIO_DEVICE_TYPE_PCI) {
> >>>> +                continue;
> >>>> +            }
> >>>> +            tmp = container_of(vbasedev_iter, VFIOPCIDevice, vbasedev);
> >>>> +            if (vfio_pci_host_match(&host, &tmp->host)) {
> >>>> +                struct VfioDeviceFind find = { .pdev = &tmp->pdev, .found = false };
> >>>> +
> >>>> +                pci_for_each_bus(bus, find_devices, &find);
> >>>> +                if (!find.found) {
> >>>> +                    goto out;
> >>>> +                }
> >>>> +                /*
> >>>> +                 * When the check device is hotplugged to a higher bus again,
> >>>> +                 * which would influence the affected device which enable aer
> >>>> +                 * below the bus.
> >>>> +                 */
> >>>> +                if (tmp->features & VFIO_FEATURE_ENABLE_AER &&
> >>>> +                    find.pbus != bus) {
> >>>> +                    goto out;
> >>>> +                }
> >>> I think what you're trying to do here is assume that if a reset of A
> >>> affects B, then a reset of B affects A, and if B is on a subordinate bus
> >>> from A, then our configuration is broken, right?  Can we really
> >>> guarantee that assumption?  If we had a physical topology that mirrored
> >>> this virtual topology, that wouldn't necessarily be true.  For instance,
> >>> if A was a function of a multi-function device where another function
> >>> was a PCIe upstream switch, B could be subordinate to that switch, so a
> >>> reset of A affects B, but a reset of B doesn't affect A.
> >>>
> >>>> +                break;
> >>>> +            }
> >>>> +        }
> >>>> +    }
> >>>> +
> >>>> +    has_bus_reset = true;
> >>>> +
> >>>> +out:
> >>>> +    vdev->has_bus_reset = has_bus_reset;
> >>> I don't see any value is storing this, it can't be trusted at any point
> >>> in the future.  I think that any time we add a device or think about
> >>> forwarding an AER message to the guest, we need to do a validation pass,
> >>> testing the entire topology.
> >>>
> >>> To do that we'd iterate through every device in every group for PCI
> >>> devices that support AER.  For each one, get the hot reset info for the
> >>> affected device list.  For each affected device, if it's attached to the
> >>> VM, it needs to be on or below the bus of the target device.
> >>> Additionally, we should test all of the non-AER supporting vfio-pci
> >>> devices on or below the target bus to verify they have a reset
> >>> mechanism.  Run that function for each vfio-pci device as it's added,
> >>> regardless of whether it's hotplug.  If the test fails, fail the initfn
> >>> for the current device.  Also run the test prior to AER injection, if it
> >>> fails demote the AER injection to a machine halt as we have now.
> >> I'm worry about is the case that the affected devices belonged to
> >> another groups but when initialize this device the another group
> >> has not been added. it would cause fail the initfn, but the group
> >> maybe added later. so I used the machine done event notifier to
> >> check this case. if we don't do like that. how can we check the
> >> case when all vfio-pci devices initfn ?
> > That's why the initfn test needs to test the entire topology, not just
> > the device being added.  If we have the case you describe where we have
> > two devices in separate groups, we add the first device, test the
> > topology, see that the second device is affected but not yet added and
> > allow AER to be enabled.  When the second device is added, we again test
> > the topology, we see the potential conflict and fail the initfn for the
> > second device if the bus requirements are not met.
> In case that the second device may not be added. in this case the
> first device enable the aer, and pass the validate. how do we know
> the second device if be added ?

Yeah, I see what you're getting at here.  If we have a dual port NIC
with isolation between functions such that each is a separate IOMMU
group, when we add the first device with AER enabled it will fail
because a bus reset affects both groups and we don't yet own the second
group.

My proposal wouldn't provide a way to ever enable AER for these devices.
However, the proposal of using a machine-init-done hook only allows
cold-plug of such devices with AER, hotplug would get the same issue.  I
don't think that sort of asymmetry is supportable either.

We almost need some sort of vfio-group stub driver in qemu that we can
claim ownership of a group without adding any of the devices in the
group to the VM.  Another option might be to make AER a "soft"
requirement, demote it to a VM stop unless the topology allows it.  That
also creates a confusing user scenario that probably looks nearly random
from an outside perspective.  The only other idea I can see would be to
allow some manipulation of iommu groups at the kernel level, perhaps
creating a meta-group composed of one or more iommu groups.  Or maybe a
kernel option that could ignore isolation for specified devices to
broaden the group.  Are we really sure exposing AER to guests is a good
idea?  Requiring guest bus resets to map to host bus rests is really a
mess.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC v8.1 10/13] vfio: do hot bus reset when do virtual secondary bus reset
       [not found]     ` <557020F1.7070705@cn.fujitsu.com>
@ 2015-06-04 16:06       ` Alex Williamson
  0 siblings, 0 replies; 27+ messages in thread
From: Alex Williamson @ 2015-06-04 16:06 UTC (permalink / raw)
  To: Chen Fan; +Cc: izumi.taku, qemu-devel

On Thu, 2015-06-04 at 17:57 +0800, Chen Fan wrote:
> On 05/28/2015 05:33 AM, Alex Williamson wrote:
> > On Wed, 2015-05-27 at 10:46 +0800, Chen Fan wrote:
> >> when do virtual secondary bus reset, the vfio device under
> >> this bus need to do host bus reset to reset the device.
> >> so add this case.
> >>
> >> Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
> >> ---
> >>   hw/vfio/pci.c | 75 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>   1 file changed, 75 insertions(+)
> >>
> >> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> >> index a8c5988..b05ccdf 100644
> >> --- a/hw/vfio/pci.c
> >> +++ b/hw/vfio/pci.c
> >> @@ -3803,6 +3803,64 @@ out:
> >>       g_free(info);
> >>   }
> >>   
> >> +/* Check when device reset the affected devices is single or multi,
> >> + * because there was differentiate the hot reset of mulitple in-use
> >> + * devices and hot reset of a single in-use device. */
> >> +static int vfio_pci_affect_devices_is_multi(VFIOPCIDevice *vdev)
> >> +{
> >> +    VFIOGroup *group;
> >> +    struct vfio_pci_hot_reset_info *info = NULL;
> >> +    struct vfio_pci_dependent_device *devices;
> >> +    int ret, i, count = 0;
> >> +
> >> +    ret = vfio_get_hot_reset_info(vdev, &info);
> >> +    if (ret < 0) {
> >> +        goto out;
> >> +    }
> >> +
> >> +    devices = &info->devices[0];
> >> +
> >> +    /* Verify that we have all the groups required */
> >> +    for (i = 0; i < info->count; i++) {
> >> +        PCIHostDeviceAddress host;
> >> +        VFIODevice *vbasedev_iter;
> >> +
> >> +        host.domain = devices[i].segment;
> >> +        host.bus = devices[i].bus;
> >> +        host.slot = PCI_SLOT(devices[i].devfn);
> >> +        host.function = PCI_FUNC(devices[i].devfn);
> >> +
> >> +        /* Skip the current device */
> >> +        if (vfio_pci_host_match(&host, &vdev->host)) {
> >> +            continue;
> >> +        }
> >> +
> >> +        /* Ensure we own the group of the affected device */
> >> +        QLIST_FOREACH(group, &vfio_group_list, next) {
> >> +            if (group->groupid == devices[i].group_id) {
> >> +                break;
> >> +            }
> >> +        }
> >> +
> >> +        if (!group) {
> >> +            ret = -1;
> >> +            goto out;
> >> +        }
> >> +
> >> +        QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
> >> +            if (vbasedev_iter->type != VFIO_DEVICE_TYPE_PCI) {
> >> +                continue;
> >> +            }
> >> +            count++;
> >> +        }
> >> +    }
> >> +
> >> +    ret = count;
> >> +out:
> >> +    g_free(info);
> >> +    return ret;
> >> +}
> > This basically does almost all the work of the hot reset function just
> > to figure out whether it's a single or multi-reset to call the hot reset
> > function with the right parameter.  Why not just do a hot reset?
> 
> Because the vfio_pci_hot_reset function has a bit odd, if one group
> has two or more function, even though only one function was
> passthroughed to VM, when we need to do bus reset, we should
> call vfio_pci_hot_reset(single), but if multi devices were passed to
> VM, we should call with muliti. also we should update the affected
> devices needs_reset status when do host bus reset. we should not
> do bus reset multi-times, which vfio_pci_hot_reset has done for
> us.

Right, the current function is called via two paths, one for a
per-device reset and one for a system reset.  On a system reset, we
don't mind if we gratuitously reset other assigned or non-assigned
devices.  On a device reset, we can only do a bus reset if there are no
other affected assigned devices.  But still the problem remains that the
above code change duplicates a lot of the hot reset code just to figure
out which way to call the existing function, when it seems like the
easier and more maintainable path is to modify the existing function to
incorporate the functionality we need here.  Thanks,

Alex

> >> +
> >>   static void vfio_pci_machine_done_notify(Notifier *notifier, void *unused)
> >>   {
> >>       VFIOGroup *group;
> >> @@ -4039,6 +4097,23 @@ static void vfio_pci_reset(DeviceState *dev)
> >>   
> >>       trace_vfio_pci_reset(vdev->vbasedev.name);
> >>   
> >> +    if (vdev->needs_bus_reset) {
> >> +        vdev->needs_bus_reset = false;
> > A comment would be nice to signal that this secondary test avoids
> > duplicate bus resets.
> >
> >> +        if (vdev->vbasedev.needs_reset) {
> >> +            int ret;
> >> +
> >> +            ret = vfio_pci_affect_devices_is_multi(vdev);
> >> +            if (ret < 0) {
> >> +                error_report("vfio: Cannot reset device %04x:%02x:%02x.%x, "
> >> +                             "get hot reset info failed.", vdev->host.domain,
> >> +                             vdev->host.bus, vdev->host.slot, vdev->host.function);
> >> +            } else {
> >> +                vfio_pci_hot_reset(vdev, ret ? true : false);
> >> +            }
> >> +        }
> >> +        return;
> >> +    }
> >> +
> >>       vfio_pci_pre_reset(vdev);
> >>   
> >>       if (vdev->resetfn && !vdev->resetfn(vdev)) {
> >
> >
> > .
> >
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC v8.1 06/13] vfio: add check host bus reset is support or not
  2015-06-04 15:59           ` Alex Williamson
@ 2015-06-09  3:43             ` Chen Fan
  0 siblings, 0 replies; 27+ messages in thread
From: Chen Fan @ 2015-06-09  3:43 UTC (permalink / raw)
  To: Alex Williamson; +Cc: izumi.taku, qemu-devel


On 06/04/2015 11:59 PM, Alex Williamson wrote:
> On Wed, 2015-06-03 at 08:52 +0800, Chen Fan wrote:
>> On 06/03/2015 12:47 AM, Alex Williamson wrote:
>>> On Tue, 2015-06-02 at 15:54 +0800, Chen Fan wrote:
>>>> On 05/28/2015 05:32 AM, Alex Williamson wrote:
>>>>> On Wed, 2015-05-27 at 10:46 +0800, Chen Fan wrote:
>>>>>> we introduce a has_bus_reset capability to sign the vfio
>>>>>> devices if support host bus reset.
>>>>>>
>>>>>> Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
>>>>>> ---
>>>>>>     hw/vfio/pci.c | 123 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>     1 file changed, 123 insertions(+)
>>>>>>
>>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>>>> index f4e7855..5934fd7 100644
>>>>>> --- a/hw/vfio/pci.c
>>>>>> +++ b/hw/vfio/pci.c
>>>>>> @@ -33,6 +33,7 @@
>>>>>>     #include "hw/pci/msix.h"
>>>>>>     #include "hw/pci/pci.h"
>>>>>>     #include "hw/pci/pci_bridge.h"
>>>>>> +#include "hw/pci/pci_bus.h"
>>>>>>     #include "qemu-common.h"
>>>>>>     #include "qemu/error-report.h"
>>>>>>     #include "qemu/event_notifier.h"
>>>>>> @@ -170,6 +171,7 @@ typedef struct VFIOPCIDevice {
>>>>>>         bool req_enabled;
>>>>>>         bool has_flr;
>>>>>>         bool has_pm_reset;
>>>>>> +    bool has_bus_reset;
>>>>> I still think that caching this is a bad idea, there's no point at which
>>>>> we can blindly assume the capability is still present.
>>>>>
>>>>>>         bool rom_read_failed;
>>>>>>     } VFIOPCIDevice;
>>>>>>     
>>>>>> @@ -203,6 +205,7 @@ static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
>>>>>>     static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
>>>>>>                                       uint32_t val, int len);
>>>>>>     static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
>>>>>> +static void vfio_check_host_bus_reset(VFIOPCIDevice *vdev);
>>>>>>     
>>>>>>     /*
>>>>>>      * Disabling BAR mmaping can be slow, but toggling it around INTx can
>>>>>> @@ -2853,6 +2856,20 @@ static int vfio_setup_aer(VFIOPCIDevice *vdev, int pos, uint16_t size)
>>>>>>             dev_iter = pci_bridge_get_device(dev_iter->bus);
>>>>>>         }
>>>>>>     
>>>>>> +    /*
>>>>>> +     * Don't check bus reset capability when device is enabled during
>>>>>> +     * qemu machine creation, which is done by machine init function.
>>>>>> +     */
>>>>>> +    if (DEVICE(vdev)->hotplugged) {
>>>>>> +        vfio_check_host_bus_reset(vdev);
>>>>>> +        if (!vdev->has_bus_reset) {
>>>>>> +            error_report("vfio: Cannot enable AER for device %s, "
>>>>>> +                         "which is not support host bus reset.",
>>>>> "which does not support host bus reset."
>>>>>
>>>>>> +                         vdev->vbasedev.name);
>>>>>> +            goto error;
>>>>>> +        }
>>>>>> +    }
>>>>>> +
>>>>>>         errcap = vfio_pci_read_config(pdev, pdev->exp.aer_cap + PCI_ERR_CAP, 4);
>>>>>>         /*
>>>>>>          * The ability to record multiple headers is depending on
>>>>>> @@ -3678,6 +3695,112 @@ static void vfio_setup_resetfn(VFIOPCIDevice *vdev)
>>>>>>         }
>>>>>>     }
>>>>>>     
>>>>>> +struct VfioDeviceFind {
>>>>> We use VFIOFooBar for all other camel case definitions, much like PCIBus
>>>>> and PCIDevice below.
>>>>>
>>>>>> +    PCIBus *pbus;
>>>>>> +    PCIDevice *pdev;
>>>>>> +    bool found;
>>>>>> +};
>>>>>> +
>>>>>> +static void find_devices(PCIBus *bus, void *opaque)
>>>>>> +{
>>>>>> +    struct VfioDeviceFind *find = opaque;
>>>>>> +    int i;
>>>>>> +
>>>>>> +    if (find->found == true) {
>>>>> if (find->found) {...
>>>>>
>>>>>> +        return;
>>>>>> +    }
>>>>>> +
>>>>>> +    for (i = 0; i < ARRAY_SIZE(bus->devices); i++) {
>>>>>> +        if (!bus->devices[i]) {
>>>>>> +            continue;
>>>>>> +        }
>>>>>> +
>>>>>> +        if (bus->devices[i] == find->pdev) {
>>>>>> +            find->pbus = bus;
>>>>>> +            find->found = true;
>>>>>> +            break;
>>>>>> +        }
>>>>>> +    }
>>>>>> +}
>>>>>> +
>>>>>> +static void vfio_check_host_bus_reset(VFIOPCIDevice *vdev)
>>>>>> +{
>>>>>> +    PCIBus *bus = vdev->pdev.bus;
>>>>>> +    struct vfio_pci_hot_reset_info *info = NULL;
>>>>>> +    struct vfio_pci_dependent_device *devices;
>>>>>> +    VFIOGroup *group;
>>>>>> +    int ret, i;
>>>>>> +    bool has_bus_reset = false;
>>>>>> +
>>>>>> +    ret = vfio_get_hot_reset_info(vdev, &info);
>>>>>> +    if (ret < 0) {
>>>>> if (ret) {...
>>>>>
>>>>>> +        goto out;
>>>>>> +    }
>>>>>> +
>>>>>> +    /* List all affected devices by bus reset */
>>>>>> +    devices = &info->devices[0];
>>>>>> +
>>>>>> +    /* Verify that we have all the groups required */
>>>>>> +    for (i = 0; i < info->count; i++) {
>>>>>> +        PCIHostDeviceAddress host;
>>>>>> +        VFIOPCIDevice *tmp;
>>>>>> +        VFIODevice *vbasedev_iter;
>>>>>> +
>>>>>> +        host.domain = devices[i].segment;
>>>>>> +        host.bus = devices[i].bus;
>>>>>> +        host.slot = PCI_SLOT(devices[i].devfn);
>>>>>> +        host.function = PCI_FUNC(devices[i].devfn);
>>>>>> +
>>>>>> +        /* Skip the current device */
>>>>>> +        if (vfio_pci_host_match(&host, &vdev->host)) {
>>>>>> +            continue;
>>>>>> +        }
>>>>>> +
>>>>>> +        /* Ensure we own the group of the affected device */
>>>>>> +        QLIST_FOREACH(group, &vfio_group_list, next) {
>>>>>> +            if (group->groupid == devices[i].group_id) {
>>>>>> +                break;
>>>>>> +            }
>>>>>> +        }
>>>>>> +
>>>>>> +        if (!group) {
>>>>>> +            goto out;
>>>>>> +        }
>>>>>> +
>>>>>> +        /* Ensure affected devices for reset under the same bus */
>>>>>> +        QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
>>>>>> +            if (vbasedev_iter->type != VFIO_DEVICE_TYPE_PCI) {
>>>>>> +                continue;
>>>>>> +            }
>>>>>> +            tmp = container_of(vbasedev_iter, VFIOPCIDevice, vbasedev);
>>>>>> +            if (vfio_pci_host_match(&host, &tmp->host)) {
>>>>>> +                struct VfioDeviceFind find = { .pdev = &tmp->pdev, .found = false };
>>>>>> +
>>>>>> +                pci_for_each_bus(bus, find_devices, &find);
>>>>>> +                if (!find.found) {
>>>>>> +                    goto out;
>>>>>> +                }
>>>>>> +                /*
>>>>>> +                 * When the check device is hotplugged to a higher bus again,
>>>>>> +                 * which would influence the affected device which enable aer
>>>>>> +                 * below the bus.
>>>>>> +                 */
>>>>>> +                if (tmp->features & VFIO_FEATURE_ENABLE_AER &&
>>>>>> +                    find.pbus != bus) {
>>>>>> +                    goto out;
>>>>>> +                }
>>>>> I think what you're trying to do here is assume that if a reset of A
>>>>> affects B, then a reset of B affects A, and if B is on a subordinate bus
>>>>> from A, then our configuration is broken, right?  Can we really
>>>>> guarantee that assumption?  If we had a physical topology that mirrored
>>>>> this virtual topology, that wouldn't necessarily be true.  For instance,
>>>>> if A was a function of a multi-function device where another function
>>>>> was a PCIe upstream switch, B could be subordinate to that switch, so a
>>>>> reset of A affects B, but a reset of B doesn't affect A.
>>>>>
>>>>>> +                break;
>>>>>> +            }
>>>>>> +        }
>>>>>> +    }
>>>>>> +
>>>>>> +    has_bus_reset = true;
>>>>>> +
>>>>>> +out:
>>>>>> +    vdev->has_bus_reset = has_bus_reset;
>>>>> I don't see any value is storing this, it can't be trusted at any point
>>>>> in the future.  I think that any time we add a device or think about
>>>>> forwarding an AER message to the guest, we need to do a validation pass,
>>>>> testing the entire topology.
>>>>>
>>>>> To do that we'd iterate through every device in every group for PCI
>>>>> devices that support AER.  For each one, get the hot reset info for the
>>>>> affected device list.  For each affected device, if it's attached to the
>>>>> VM, it needs to be on or below the bus of the target device.
>>>>> Additionally, we should test all of the non-AER supporting vfio-pci
>>>>> devices on or below the target bus to verify they have a reset
>>>>> mechanism.  Run that function for each vfio-pci device as it's added,
>>>>> regardless of whether it's hotplug.  If the test fails, fail the initfn
>>>>> for the current device.  Also run the test prior to AER injection, if it
>>>>> fails demote the AER injection to a machine halt as we have now.
>>>> I'm worry about is the case that the affected devices belonged to
>>>> another groups but when initialize this device the another group
>>>> has not been added. it would cause fail the initfn, but the group
>>>> maybe added later. so I used the machine done event notifier to
>>>> check this case. if we don't do like that. how can we check the
>>>> case when all vfio-pci devices initfn ?
>>> That's why the initfn test needs to test the entire topology, not just
>>> the device being added.  If we have the case you describe where we have
>>> two devices in separate groups, we add the first device, test the
>>> topology, see that the second device is affected but not yet added and
>>> allow AER to be enabled.  When the second device is added, we again test
>>> the topology, we see the potential conflict and fail the initfn for the
>>> second device if the bus requirements are not met.
>> In case that the second device may not be added. in this case the
>> first device enable the aer, and pass the validate. how do we know
>> the second device if be added ?
> Yeah, I see what you're getting at here.  If we have a dual port NIC
> with isolation between functions such that each is a separate IOMMU
> group, when we add the first device with AER enabled it will fail
> because a bus reset affects both groups and we don't yet own the second
> group.
>
> My proposal wouldn't provide a way to ever enable AER for these devices.
> However, the proposal of using a machine-init-done hook only allows
> cold-plug of such devices with AER, hotplug would get the same issue.  I
> don't think that sort of asymmetry is supportable either.
>
> We almost need some sort of vfio-group stub driver in qemu that we can
> claim ownership of a group without adding any of the devices in the
> group to the VM.  Another option might be to make AER a "soft"
> requirement, demote it to a VM stop unless the topology allows it.  That
> also creates a confusing user scenario that probably looks nearly random
> from an outside perspective.  The only other idea I can see would be to
> allow some manipulation of iommu groups at the kernel level, perhaps
> creating a meta-group composed of one or more iommu groups.  Or maybe a
> kernel option that could ignore isolation for specified devices to
> broaden the group.  Are we really sure exposing AER to guests is a good
> idea?  Requiring guest bus resets to map to host bus rests is really a
> mess.  Thanks,
>
> Alex
Usually the user and the administer are not the same one, when device
eject aer error, the user isn't aware of, so I think this feature should
have. I think own the affected group in VM for aer devices is good idea.
and I sent out the new version 9 to fix the init issues. pls help to 
review it.

Thanks,
Chen

>
> .
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2015-06-09  4:02 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-27  2:46 [Qemu-devel] [RFC v8.1 00/13] vfio-pci: pass the aer error to guest Chen Fan
2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 01/13] vfio: extract vfio_get_hot_reset_info as a single function Chen Fan
2015-05-27 21:31   ` Alex Williamson
2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 03/13] vfio: add pcie extanded capability support Chen Fan
2015-05-27 21:31   ` Alex Williamson
2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 04/13] aer: impove pcie_aer_init to support vfio device Chen Fan
2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 05/13] vfio: add aer support for " Chen Fan
2015-05-27 21:32   ` Alex Williamson
2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 06/13] vfio: add check host bus reset is support or not Chen Fan
2015-05-27 21:32   ` Alex Williamson
2015-06-02  7:54     ` Chen Fan
2015-06-02 16:47       ` Alex Williamson
2015-06-03  0:52         ` Chen Fan
2015-06-04 15:59           ` Alex Williamson
2015-06-09  3:43             ` Chen Fan
2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 07/13] vfio: add check for vfio devices which enable aer should support bus reset Chen Fan
2015-05-27 21:32   ` Alex Williamson
2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 08/13] pci: add bus reset_notifiers callbacks for host " Chen Fan
2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 09/13] vfio: add sec_bus_reset notifier to notify physical bus reset is needed Chen Fan
2015-05-27 21:32   ` Alex Williamson
2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 10/13] vfio: do hot bus reset when do virtual secondary bus reset Chen Fan
2015-05-27 21:33   ` Alex Williamson
     [not found]     ` <557020F1.7070705@cn.fujitsu.com>
2015-06-04 16:06       ` Alex Williamson
2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 11/13] pcie_aer: expose pcie_aer_msg() interface Chen Fan
2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 12/13] vfio-pci: pass the aer error to guest Chen Fan
2015-05-27 21:33   ` Alex Williamson
2015-05-27  2:46 ` [Qemu-devel] [RFC v8.1 13/13] vfio: add 'aer' property to expose aercap Chen Fan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.