All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH V3 00/10] Xen PCI Passthrough
@ 2011-10-28 15:07 ` Anthony PERARD
  0 siblings, 0 replies; 60+ messages in thread
From: Anthony PERARD @ 2011-10-28 15:07 UTC (permalink / raw)
  To: QEMU-devel, Stefano Stabellini; +Cc: Anthony PERARD, Xen Devel

Hi all,

This patch series introduces the PCI passthrough for Xen.

First, we have HostPCIDevice that help to access one PCI device of the host.

Then, there is an additions in the QEMU code, pci_check_bar_overlap.

There are also several change in pci_ids and pci_regs.

Last part, but not least, the PCI passthrough device himself. Cut in 3 parts
(or file), there is one to take care of the initialisation of a passthrough
device. The second one handle everything about the config address space, there
are specifics functions for every config register. The third one is to handle
MSI.

There is a patch series on xen-devel that add the support of setting a PCI
passthrough device through QMP from libxl (xen tool stack). It is just a call
to device_add, with the driver parametter hostaddr="0000:00:1b.0".

Change since v2;
  - in host-pci-device.c:
    - Return more usefull error code in get_ressource().
    - Use macro in host_pci_find_ext_cap_offset instead of raw number. But I
      still not sure if PCI_MAX_EXT_CAP is right, it's result is 480 like it
      was before, so it's maybe ok.
  - All use of MSI stuff in two first pci passthrough patch have been removed
    and move to the last patch.


Change v1-v2:
  - fix style issue (checkpatch.pl)
  - set the original authors, add some missing copyright headers
  - HostPCIDevice:
    - introduce HostPCIIORegions (with base_addr, size, flags)
    - save all flags from ./resource and store it in a separate field.
    - fix endianess on write
    - new host_pci_dev_put function
    - use pci.c like interface host_pci_get/set_byte/word/long (instead of
      host_pci_read/write_)
  - compile HostPCIDevice only on linux (as well as xen_pci_passthrough)
  - introduce apic-msidef.h file.
  - no more run_one_timer, if a pci device is in the middle of a power
    transition, just "return an error" in config read/write
  - use a global var mapped_machine_irq (local to xen_pci_passthrough.c)
  - add msitranslate and power-mgmt ad qdev property


Allen Kay (2):
  Introduce Xen PCI Passthrough, qdevice (1/3)
  Introduce Xen PCI Passthrough, PCI config space helpers (2/3)

Anthony PERARD (6):
  configure: Introduce --enable-xen-pci-passthrough.
  Introduce HostPCIDevice to access a pci device on the host.
  pci_ids: Add INTEL_82599_VF id.
  pci_regs: Fix value of PCI_EXP_TYPE_RC_EC.
  pci_regs: Add PCI_EXP_TYPE_PCIE_BRIDGE
  Introduce apic-msidef.h

Jiang Yunhong (1):
  Introduce Xen PCI Passthrough, MSI (3/3)

Yuji Shimada (1):
  pci.c: Add pci_check_bar_overlap

 Makefile.target                      |    7 +
 configure                            |   25 +
 hw/apic-msidef.h                     |   30 +
 hw/apic.c                            |   11 +-
 hw/host-pci-device.c                 |  252 ++++
 hw/host-pci-device.h                 |   75 +
 hw/pci.c                             |   47 +
 hw/pci.h                             |    3 +
 hw/pci_ids.h                         |    1 +
 hw/pci_regs.h                        |    3 +-
 hw/xen_pci_passthrough.c             |  861 ++++++++++++
 hw/xen_pci_passthrough.h             |  280 ++++
 hw/xen_pci_passthrough_config_init.c | 2553 ++++++++++++++++++++++++++++++++++
 hw/xen_pci_passthrough_helpers.c     |   46 +
 hw/xen_pci_passthrough_msi.c         |  667 +++++++++
 15 files changed, 4850 insertions(+), 11 deletions(-)
 create mode 100644 hw/apic-msidef.h
 create mode 100644 hw/host-pci-device.c
 create mode 100644 hw/host-pci-device.h
 create mode 100644 hw/xen_pci_passthrough.c
 create mode 100644 hw/xen_pci_passthrough.h
 create mode 100644 hw/xen_pci_passthrough_config_init.c
 create mode 100644 hw/xen_pci_passthrough_helpers.c
 create mode 100644 hw/xen_pci_passthrough_msi.c

-- 
Anthony PERARD

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH V3 00/10] Xen PCI Passthrough
@ 2011-10-28 15:07 ` Anthony PERARD
  0 siblings, 0 replies; 60+ messages in thread
From: Anthony PERARD @ 2011-10-28 15:07 UTC (permalink / raw)
  To: QEMU-devel, Stefano Stabellini; +Cc: Anthony PERARD, Xen Devel

Hi all,

This patch series introduces the PCI passthrough for Xen.

First, we have HostPCIDevice that help to access one PCI device of the host.

Then, there is an additions in the QEMU code, pci_check_bar_overlap.

There are also several change in pci_ids and pci_regs.

Last part, but not least, the PCI passthrough device himself. Cut in 3 parts
(or file), there is one to take care of the initialisation of a passthrough
device. The second one handle everything about the config address space, there
are specifics functions for every config register. The third one is to handle
MSI.

There is a patch series on xen-devel that add the support of setting a PCI
passthrough device through QMP from libxl (xen tool stack). It is just a call
to device_add, with the driver parametter hostaddr="0000:00:1b.0".

Change since v2;
  - in host-pci-device.c:
    - Return more usefull error code in get_ressource().
    - Use macro in host_pci_find_ext_cap_offset instead of raw number. But I
      still not sure if PCI_MAX_EXT_CAP is right, it's result is 480 like it
      was before, so it's maybe ok.
  - All use of MSI stuff in two first pci passthrough patch have been removed
    and move to the last patch.


Change v1-v2:
  - fix style issue (checkpatch.pl)
  - set the original authors, add some missing copyright headers
  - HostPCIDevice:
    - introduce HostPCIIORegions (with base_addr, size, flags)
    - save all flags from ./resource and store it in a separate field.
    - fix endianess on write
    - new host_pci_dev_put function
    - use pci.c like interface host_pci_get/set_byte/word/long (instead of
      host_pci_read/write_)
  - compile HostPCIDevice only on linux (as well as xen_pci_passthrough)
  - introduce apic-msidef.h file.
  - no more run_one_timer, if a pci device is in the middle of a power
    transition, just "return an error" in config read/write
  - use a global var mapped_machine_irq (local to xen_pci_passthrough.c)
  - add msitranslate and power-mgmt ad qdev property


Allen Kay (2):
  Introduce Xen PCI Passthrough, qdevice (1/3)
  Introduce Xen PCI Passthrough, PCI config space helpers (2/3)

Anthony PERARD (6):
  configure: Introduce --enable-xen-pci-passthrough.
  Introduce HostPCIDevice to access a pci device on the host.
  pci_ids: Add INTEL_82599_VF id.
  pci_regs: Fix value of PCI_EXP_TYPE_RC_EC.
  pci_regs: Add PCI_EXP_TYPE_PCIE_BRIDGE
  Introduce apic-msidef.h

Jiang Yunhong (1):
  Introduce Xen PCI Passthrough, MSI (3/3)

Yuji Shimada (1):
  pci.c: Add pci_check_bar_overlap

 Makefile.target                      |    7 +
 configure                            |   25 +
 hw/apic-msidef.h                     |   30 +
 hw/apic.c                            |   11 +-
 hw/host-pci-device.c                 |  252 ++++
 hw/host-pci-device.h                 |   75 +
 hw/pci.c                             |   47 +
 hw/pci.h                             |    3 +
 hw/pci_ids.h                         |    1 +
 hw/pci_regs.h                        |    3 +-
 hw/xen_pci_passthrough.c             |  861 ++++++++++++
 hw/xen_pci_passthrough.h             |  280 ++++
 hw/xen_pci_passthrough_config_init.c | 2553 ++++++++++++++++++++++++++++++++++
 hw/xen_pci_passthrough_helpers.c     |   46 +
 hw/xen_pci_passthrough_msi.c         |  667 +++++++++
 15 files changed, 4850 insertions(+), 11 deletions(-)
 create mode 100644 hw/apic-msidef.h
 create mode 100644 hw/host-pci-device.c
 create mode 100644 hw/host-pci-device.h
 create mode 100644 hw/xen_pci_passthrough.c
 create mode 100644 hw/xen_pci_passthrough.h
 create mode 100644 hw/xen_pci_passthrough_config_init.c
 create mode 100644 hw/xen_pci_passthrough_helpers.c
 create mode 100644 hw/xen_pci_passthrough_msi.c

-- 
Anthony PERARD

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [Qemu-devel] [PATCH V3 01/10] configure: Introduce --enable-xen-pci-passthrough.
  2011-10-28 15:07 ` Anthony PERARD
@ 2011-10-28 15:07   ` Anthony PERARD
  -1 siblings, 0 replies; 60+ messages in thread
From: Anthony PERARD @ 2011-10-28 15:07 UTC (permalink / raw)
  To: QEMU-devel, Stefano Stabellini; +Cc: Anthony PERARD, Xen Devel

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
---
 Makefile.target |    2 ++
 configure       |   25 +++++++++++++++++++++++++
 2 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/Makefile.target b/Makefile.target
index fe5f6f7..867d687 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -215,6 +215,8 @@ obj-$(CONFIG_NO_XEN) += xen-stub.o
 
 obj-i386-$(CONFIG_XEN) += xen_platform.o
 
+# Xen PCI Passthrough
+
 # Inter-VM PCI shared memory
 CONFIG_IVSHMEM =
 ifeq ($(CONFIG_KVM), y)
diff --git a/configure b/configure
index 4f87e0a..301ab44 100755
--- a/configure
+++ b/configure
@@ -127,6 +127,7 @@ vnc_png=""
 vnc_thread="no"
 xen=""
 xen_ctrl_version=""
+xen_pci_passthrough=""
 linux_aio=""
 attr=""
 xfs=""
@@ -641,6 +642,10 @@ for opt do
   ;;
   --enable-xen) xen="yes"
   ;;
+  --disable-xen-pci-passthrough) xen_pci_passthrough="no"
+  ;;
+  --enable-xen-pci-passthrough) xen_pci_passthrough="yes"
+  ;;
   --disable-brlapi) brlapi="no"
   ;;
   --enable-brlapi) brlapi="yes"
@@ -979,6 +984,8 @@ echo "                           (affects only QEMU, not qemu-img)"
 echo "  --enable-mixemu          enable mixer emulation"
 echo "  --disable-xen            disable xen backend driver support"
 echo "  --enable-xen             enable xen backend driver support"
+echo "  --disable-xen-pci-passthrough"
+echo "  --enable-xen-pci-passthrough"
 echo "  --disable-brlapi         disable BrlAPI"
 echo "  --enable-brlapi          enable BrlAPI"
 echo "  --disable-vnc-tls        disable TLS encryption for VNC server"
@@ -1342,6 +1349,21 @@ EOF
   fi
 fi
 
+if test "$xen_pci_passthrough" != "no"; then
+  if test "$xen" = "yes" && test "$linux" = "yes"; then
+    xen_pci_passthrough=yes
+  else
+    if test "$xen_pci_passthrough" = "yes"; then
+      echo "ERROR"
+      echo "ERROR: User requested feature Xen PCI Passthrough"
+      echo "ERROR: but this feature require /sys from Linux"
+      echo "ERROR"
+      exit 1;
+    fi
+    xen_pci_passthrough=no
+  fi
+fi
+
 ##########################################
 # pkg-config probe
 
@@ -3398,6 +3420,9 @@ case "$target_arch2" in
     if test "$xen" = "yes" -a "$target_softmmu" = "yes" ; then
       target_phys_bits=64
       echo "CONFIG_XEN=y" >> $config_target_mak
+      if test "$xen_pci_passthrough" = yes; then
+        echo "CONFIG_XEN_PCI_PASSTHROUGH=y" >> "$config_target_mak"
+      fi
     else
       echo "CONFIG_NO_XEN=y" >> $config_target_mak
     fi
-- 
Anthony PERARD

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH V3 01/10] configure: Introduce --enable-xen-pci-passthrough.
@ 2011-10-28 15:07   ` Anthony PERARD
  0 siblings, 0 replies; 60+ messages in thread
From: Anthony PERARD @ 2011-10-28 15:07 UTC (permalink / raw)
  To: QEMU-devel, Stefano Stabellini; +Cc: Anthony PERARD, Xen Devel

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
---
 Makefile.target |    2 ++
 configure       |   25 +++++++++++++++++++++++++
 2 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/Makefile.target b/Makefile.target
index fe5f6f7..867d687 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -215,6 +215,8 @@ obj-$(CONFIG_NO_XEN) += xen-stub.o
 
 obj-i386-$(CONFIG_XEN) += xen_platform.o
 
+# Xen PCI Passthrough
+
 # Inter-VM PCI shared memory
 CONFIG_IVSHMEM =
 ifeq ($(CONFIG_KVM), y)
diff --git a/configure b/configure
index 4f87e0a..301ab44 100755
--- a/configure
+++ b/configure
@@ -127,6 +127,7 @@ vnc_png=""
 vnc_thread="no"
 xen=""
 xen_ctrl_version=""
+xen_pci_passthrough=""
 linux_aio=""
 attr=""
 xfs=""
@@ -641,6 +642,10 @@ for opt do
   ;;
   --enable-xen) xen="yes"
   ;;
+  --disable-xen-pci-passthrough) xen_pci_passthrough="no"
+  ;;
+  --enable-xen-pci-passthrough) xen_pci_passthrough="yes"
+  ;;
   --disable-brlapi) brlapi="no"
   ;;
   --enable-brlapi) brlapi="yes"
@@ -979,6 +984,8 @@ echo "                           (affects only QEMU, not qemu-img)"
 echo "  --enable-mixemu          enable mixer emulation"
 echo "  --disable-xen            disable xen backend driver support"
 echo "  --enable-xen             enable xen backend driver support"
+echo "  --disable-xen-pci-passthrough"
+echo "  --enable-xen-pci-passthrough"
 echo "  --disable-brlapi         disable BrlAPI"
 echo "  --enable-brlapi          enable BrlAPI"
 echo "  --disable-vnc-tls        disable TLS encryption for VNC server"
@@ -1342,6 +1349,21 @@ EOF
   fi
 fi
 
+if test "$xen_pci_passthrough" != "no"; then
+  if test "$xen" = "yes" && test "$linux" = "yes"; then
+    xen_pci_passthrough=yes
+  else
+    if test "$xen_pci_passthrough" = "yes"; then
+      echo "ERROR"
+      echo "ERROR: User requested feature Xen PCI Passthrough"
+      echo "ERROR: but this feature require /sys from Linux"
+      echo "ERROR"
+      exit 1;
+    fi
+    xen_pci_passthrough=no
+  fi
+fi
+
 ##########################################
 # pkg-config probe
 
@@ -3398,6 +3420,9 @@ case "$target_arch2" in
     if test "$xen" = "yes" -a "$target_softmmu" = "yes" ; then
       target_phys_bits=64
       echo "CONFIG_XEN=y" >> $config_target_mak
+      if test "$xen_pci_passthrough" = yes; then
+        echo "CONFIG_XEN_PCI_PASSTHROUGH=y" >> "$config_target_mak"
+      fi
     else
       echo "CONFIG_NO_XEN=y" >> $config_target_mak
     fi
-- 
Anthony PERARD

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [Qemu-devel] [PATCH V3 02/10] Introduce HostPCIDevice to access a pci device on the host.
  2011-10-28 15:07 ` Anthony PERARD
@ 2011-10-28 15:07   ` Anthony PERARD
  -1 siblings, 0 replies; 60+ messages in thread
From: Anthony PERARD @ 2011-10-28 15:07 UTC (permalink / raw)
  To: QEMU-devel, Stefano Stabellini; +Cc: Anthony PERARD, Xen Devel

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
---
 Makefile.target      |    1 +
 hw/host-pci-device.c |  252 ++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/host-pci-device.h |   75 +++++++++++++++
 3 files changed, 328 insertions(+), 0 deletions(-)
 create mode 100644 hw/host-pci-device.c
 create mode 100644 hw/host-pci-device.h

diff --git a/Makefile.target b/Makefile.target
index 867d687..243f9f2 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -216,6 +216,7 @@ obj-$(CONFIG_NO_XEN) += xen-stub.o
 obj-i386-$(CONFIG_XEN) += xen_platform.o
 
 # Xen PCI Passthrough
+obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += host-pci-device.o
 
 # Inter-VM PCI shared memory
 CONFIG_IVSHMEM =
diff --git a/hw/host-pci-device.c b/hw/host-pci-device.c
new file mode 100644
index 0000000..5eafc49
--- /dev/null
+++ b/hw/host-pci-device.c
@@ -0,0 +1,252 @@
+/*
+ * Copyright (C) 2011       Citrix Ltd.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu-common.h"
+#include "host-pci-device.h"
+
+#define PCI_MAX_EXT_CAP \
+    ((PCIE_CONFIG_SPACE_SIZE - PCI_CONFIG_SPACE_SIZE) / (PCI_CAP_SIZEOF + 4))
+
+enum error_code {
+    ERROR_SYNTAX = 1,
+};
+
+static int path_to(const HostPCIDevice *d,
+                   const char *name, char *buf, ssize_t size)
+{
+    return snprintf(buf, size, "/sys/bus/pci/devices/%04x:%02x:%02x.%x/%s",
+                    d->domain, d->bus, d->dev, d->func, name);
+}
+
+static int get_resource(HostPCIDevice *d)
+{
+    int i, rc = 0;
+    FILE *f;
+    char path[PATH_MAX];
+    unsigned long long start, end, flags, size;
+
+    path_to(d, "resource", path, sizeof (path));
+    f = fopen(path, "r");
+    if (!f) {
+        fprintf(stderr, "Error: Can't open %s: %s\n", path, strerror(errno));
+        return -errno;
+    }
+
+    for (i = 0; i < PCI_NUM_REGIONS; i++) {
+        if (fscanf(f, "%llx %llx %llx", &start, &end, &flags) != 3) {
+            fprintf(stderr, "Error: Syntax error in %s\n", path);
+            rc = ERROR_SYNTAX;
+            break;
+        }
+        if (start) {
+            size = end - start + 1;
+        } else {
+            size = 0;
+        }
+
+        if (i < PCI_ROM_SLOT) {
+            d->io_regions[i].base_addr = start;
+            d->io_regions[i].size = size;
+            d->io_regions[i].flags = flags;
+        } else {
+            d->rom.base_addr = start;
+            d->rom.size = size;
+            d->rom.flags = flags;
+        }
+    }
+
+    fclose(f);
+    return rc;
+}
+
+static unsigned long get_value(HostPCIDevice *d, const char *name)
+{
+    char path[PATH_MAX];
+    FILE *f;
+    unsigned long value;
+
+    path_to(d, name, path, sizeof (path));
+    f = fopen(path, "r");
+    if (!f) {
+        fprintf(stderr, "Error: Can't open %s: %s\n", path, strerror(errno));
+        return -1;
+    }
+    if (fscanf(f, "%lx\n", &value) != 1) {
+        fprintf(stderr, "Error: Syntax error in %s\n", path);
+        value = -1;
+    }
+    fclose(f);
+    return value;
+}
+
+static int pci_dev_is_virtfn(HostPCIDevice *d)
+{
+    int rc;
+    char path[PATH_MAX];
+    struct stat buf;
+
+    path_to(d, "physfn", path, sizeof (path));
+    rc = !stat(path, &buf);
+
+    return rc;
+}
+
+static int host_pci_config_fd(HostPCIDevice *d)
+{
+    char path[PATH_MAX];
+
+    if (d->config_fd < 0) {
+        path_to(d, "config", path, sizeof (path));
+        d->config_fd = open(path, O_RDWR);
+        if (d->config_fd < 0) {
+            fprintf(stderr, "HostPCIDevice: Can not open '%s': %s\n",
+                    path, strerror(errno));
+        }
+    }
+    return d->config_fd;
+}
+static int host_pci_config_read(HostPCIDevice *d, int pos, void *buf, int len)
+{
+    int fd = host_pci_config_fd(d);
+    int res = 0;
+
+    res = pread(fd, buf, len, pos);
+    if (res < 0) {
+        fprintf(stderr, "host_pci_config: read failed: %s (fd: %i)\n",
+                strerror(errno), fd);
+        return -1;
+    }
+    return res;
+}
+static int host_pci_config_write(HostPCIDevice *d,
+                                 int pos, const void *buf, int len)
+{
+    int fd = host_pci_config_fd(d);
+    int res = 0;
+
+    res = pwrite(fd, buf, len, pos);
+    if (res < 0) {
+        fprintf(stderr, "host_pci_config: write failed: %s\n",
+                strerror(errno));
+        return -1;
+    }
+    return res;
+}
+
+uint8_t host_pci_get_byte(HostPCIDevice *d, int pos)
+{
+  uint8_t buf;
+  host_pci_config_read(d, pos, &buf, 1);
+  return buf;
+}
+uint16_t host_pci_get_word(HostPCIDevice *d, int pos)
+{
+  uint16_t buf;
+  host_pci_config_read(d, pos, &buf, 2);
+  return le16_to_cpu(buf);
+}
+uint32_t host_pci_get_long(HostPCIDevice *d, int pos)
+{
+  uint32_t buf;
+  host_pci_config_read(d, pos, &buf, 4);
+  return le32_to_cpu(buf);
+}
+int host_pci_get_block(HostPCIDevice *d, int pos, uint8_t *buf, int len)
+{
+  return host_pci_config_read(d, pos, buf, len);
+}
+
+int host_pci_set_byte(HostPCIDevice *d, int pos, uint8_t data)
+{
+  return host_pci_config_write(d, pos, &data, 1);
+}
+int host_pci_set_word(HostPCIDevice *d, int pos, uint16_t data)
+{
+  data = cpu_to_le16(data);
+  return host_pci_config_write(d, pos, &data, 2);
+}
+int host_pci_set_long(HostPCIDevice *d, int pos, uint32_t data)
+{
+  data = cpu_to_le32(data);
+  return host_pci_config_write(d, pos, &data, 4);
+}
+int host_pci_set_block(HostPCIDevice *d, int pos, uint8_t *buf, int len)
+{
+  return host_pci_config_write(d, pos, buf, len);
+}
+
+uint32_t host_pci_find_ext_cap_offset(HostPCIDevice *d, uint32_t cap)
+{
+    uint32_t header = 0;
+    int max_cap = PCI_MAX_EXT_CAP;
+    int pos = PCI_CONFIG_SPACE_SIZE;
+
+    do {
+        header = host_pci_get_long(d, pos);
+        /*
+         * If we have no capabilities, this is indicated by cap ID,
+         * cap version and next pointer all being 0.
+         */
+        if (header == 0) {
+            break;
+        }
+
+        if (PCI_EXT_CAP_ID(header) == cap) {
+            return pos;
+        }
+
+        pos = PCI_EXT_CAP_NEXT(header);
+        if (pos < PCI_CONFIG_SPACE_SIZE) {
+            break;
+        }
+
+        max_cap--;
+    } while (max_cap > 0);
+
+    return 0;
+}
+
+HostPCIDevice *host_pci_device_get(uint8_t bus, uint8_t dev, uint8_t func)
+{
+    HostPCIDevice *d = NULL;
+
+    d = g_new0(HostPCIDevice, 1);
+
+    d->config_fd = -1;
+    d->domain = 0;
+    d->bus = bus;
+    d->dev = dev;
+    d->func = func;
+
+    if (host_pci_config_fd(d) == -1) {
+        goto error;
+    }
+    if (get_resource(d) != 0) {
+        goto error;
+    }
+
+    d->vendor_id = get_value(d, "vendor");
+    d->device_id = get_value(d, "device");
+    d->is_virtfn = pci_dev_is_virtfn(d);
+
+    return d;
+error:
+    if (d->config_fd >= 0) {
+        close(d->config_fd);
+    }
+    g_free(d);
+    return NULL;
+}
+
+void host_pci_device_put(HostPCIDevice *d)
+{
+    if (d->config_fd >= 0) {
+        close(d->config_fd);
+    }
+    g_free(d);
+}
diff --git a/hw/host-pci-device.h b/hw/host-pci-device.h
new file mode 100644
index 0000000..d79ba48
--- /dev/null
+++ b/hw/host-pci-device.h
@@ -0,0 +1,75 @@
+#ifndef HW_HOST_PCI_DEVICE
+#  define HW_HOST_PCI_DEVICE
+
+#include "pci.h"
+
+/*
+ * from linux/ioport.h
+ * IO resources have these defined flags.
+ */
+#define IORESOURCE_BITS         0x000000ff      /* Bus-specific bits */
+
+#define IORESOURCE_TYPE_BITS    0x00000f00      /* Resource type */
+#define IORESOURCE_IO           0x00000100
+#define IORESOURCE_MEM          0x00000200
+#define IORESOURCE_IRQ          0x00000400
+#define IORESOURCE_DMA          0x00000800
+
+#define IORESOURCE_PREFETCH     0x00001000      /* No side effects */
+#define IORESOURCE_READONLY     0x00002000
+#define IORESOURCE_CACHEABLE    0x00004000
+#define IORESOURCE_RANGELENGTH  0x00008000
+#define IORESOURCE_SHADOWABLE   0x00010000
+
+#define IORESOURCE_SIZEALIGN    0x00020000      /* size indicates alignment */
+#define IORESOURCE_STARTALIGN   0x00040000      /* start field is alignment */
+
+#define IORESOURCE_MEM_64       0x00100000
+
+    /* Userland may not map this resource */
+#define IORESOURCE_EXCLUSIVE    0x08000000
+#define IORESOURCE_DISABLED     0x10000000
+#define IORESOURCE_UNSET        0x20000000
+#define IORESOURCE_AUTO         0x40000000
+    /* Driver has marked this resource busy */
+#define IORESOURCE_BUSY         0x80000000
+
+
+typedef struct HostPCIIORegion {
+    unsigned long flags;
+    pcibus_t base_addr;
+    pcibus_t size;
+} HostPCIIORegion;
+
+typedef struct HostPCIDevice {
+    uint16_t domain;
+    uint8_t bus;
+    uint8_t dev;
+    uint8_t func;
+
+    uint16_t vendor_id;
+    uint16_t device_id;
+
+    HostPCIIORegion io_regions[PCI_NUM_REGIONS - 1];
+    HostPCIIORegion rom;
+
+    bool is_virtfn;
+
+    int config_fd;
+} HostPCIDevice;
+
+HostPCIDevice *host_pci_device_get(uint8_t bus, uint8_t dev, uint8_t func);
+void host_pci_device_put(HostPCIDevice *pci_dev);
+
+uint8_t host_pci_get_byte(HostPCIDevice *d, int pos);
+uint16_t host_pci_get_word(HostPCIDevice *d, int pos);
+uint32_t host_pci_get_long(HostPCIDevice *d, int pos);
+int host_pci_get_block(HostPCIDevice *d, int pos, uint8_t *buf, int len);
+int host_pci_set_byte(HostPCIDevice *d, int pos, uint8_t data);
+int host_pci_set_word(HostPCIDevice *d, int pos, uint16_t data);
+int host_pci_set_long(HostPCIDevice *d, int pos, uint32_t data);
+int host_pci_set_block(HostPCIDevice *d, int pos, uint8_t *buf, int len);
+
+uint32_t host_pci_find_ext_cap_offset(HostPCIDevice *s, uint32_t cap);
+
+#endif /* !HW_HOST_PCI_DEVICE */
-- 
Anthony PERARD

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH V3 02/10] Introduce HostPCIDevice to access a pci device on the host.
@ 2011-10-28 15:07   ` Anthony PERARD
  0 siblings, 0 replies; 60+ messages in thread
From: Anthony PERARD @ 2011-10-28 15:07 UTC (permalink / raw)
  To: QEMU-devel, Stefano Stabellini; +Cc: Anthony PERARD, Xen Devel

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
---
 Makefile.target      |    1 +
 hw/host-pci-device.c |  252 ++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/host-pci-device.h |   75 +++++++++++++++
 3 files changed, 328 insertions(+), 0 deletions(-)
 create mode 100644 hw/host-pci-device.c
 create mode 100644 hw/host-pci-device.h

diff --git a/Makefile.target b/Makefile.target
index 867d687..243f9f2 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -216,6 +216,7 @@ obj-$(CONFIG_NO_XEN) += xen-stub.o
 obj-i386-$(CONFIG_XEN) += xen_platform.o
 
 # Xen PCI Passthrough
+obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += host-pci-device.o
 
 # Inter-VM PCI shared memory
 CONFIG_IVSHMEM =
diff --git a/hw/host-pci-device.c b/hw/host-pci-device.c
new file mode 100644
index 0000000..5eafc49
--- /dev/null
+++ b/hw/host-pci-device.c
@@ -0,0 +1,252 @@
+/*
+ * Copyright (C) 2011       Citrix Ltd.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu-common.h"
+#include "host-pci-device.h"
+
+#define PCI_MAX_EXT_CAP \
+    ((PCIE_CONFIG_SPACE_SIZE - PCI_CONFIG_SPACE_SIZE) / (PCI_CAP_SIZEOF + 4))
+
+enum error_code {
+    ERROR_SYNTAX = 1,
+};
+
+static int path_to(const HostPCIDevice *d,
+                   const char *name, char *buf, ssize_t size)
+{
+    return snprintf(buf, size, "/sys/bus/pci/devices/%04x:%02x:%02x.%x/%s",
+                    d->domain, d->bus, d->dev, d->func, name);
+}
+
+static int get_resource(HostPCIDevice *d)
+{
+    int i, rc = 0;
+    FILE *f;
+    char path[PATH_MAX];
+    unsigned long long start, end, flags, size;
+
+    path_to(d, "resource", path, sizeof (path));
+    f = fopen(path, "r");
+    if (!f) {
+        fprintf(stderr, "Error: Can't open %s: %s\n", path, strerror(errno));
+        return -errno;
+    }
+
+    for (i = 0; i < PCI_NUM_REGIONS; i++) {
+        if (fscanf(f, "%llx %llx %llx", &start, &end, &flags) != 3) {
+            fprintf(stderr, "Error: Syntax error in %s\n", path);
+            rc = ERROR_SYNTAX;
+            break;
+        }
+        if (start) {
+            size = end - start + 1;
+        } else {
+            size = 0;
+        }
+
+        if (i < PCI_ROM_SLOT) {
+            d->io_regions[i].base_addr = start;
+            d->io_regions[i].size = size;
+            d->io_regions[i].flags = flags;
+        } else {
+            d->rom.base_addr = start;
+            d->rom.size = size;
+            d->rom.flags = flags;
+        }
+    }
+
+    fclose(f);
+    return rc;
+}
+
+static unsigned long get_value(HostPCIDevice *d, const char *name)
+{
+    char path[PATH_MAX];
+    FILE *f;
+    unsigned long value;
+
+    path_to(d, name, path, sizeof (path));
+    f = fopen(path, "r");
+    if (!f) {
+        fprintf(stderr, "Error: Can't open %s: %s\n", path, strerror(errno));
+        return -1;
+    }
+    if (fscanf(f, "%lx\n", &value) != 1) {
+        fprintf(stderr, "Error: Syntax error in %s\n", path);
+        value = -1;
+    }
+    fclose(f);
+    return value;
+}
+
+static int pci_dev_is_virtfn(HostPCIDevice *d)
+{
+    int rc;
+    char path[PATH_MAX];
+    struct stat buf;
+
+    path_to(d, "physfn", path, sizeof (path));
+    rc = !stat(path, &buf);
+
+    return rc;
+}
+
+static int host_pci_config_fd(HostPCIDevice *d)
+{
+    char path[PATH_MAX];
+
+    if (d->config_fd < 0) {
+        path_to(d, "config", path, sizeof (path));
+        d->config_fd = open(path, O_RDWR);
+        if (d->config_fd < 0) {
+            fprintf(stderr, "HostPCIDevice: Can not open '%s': %s\n",
+                    path, strerror(errno));
+        }
+    }
+    return d->config_fd;
+}
+static int host_pci_config_read(HostPCIDevice *d, int pos, void *buf, int len)
+{
+    int fd = host_pci_config_fd(d);
+    int res = 0;
+
+    res = pread(fd, buf, len, pos);
+    if (res < 0) {
+        fprintf(stderr, "host_pci_config: read failed: %s (fd: %i)\n",
+                strerror(errno), fd);
+        return -1;
+    }
+    return res;
+}
+static int host_pci_config_write(HostPCIDevice *d,
+                                 int pos, const void *buf, int len)
+{
+    int fd = host_pci_config_fd(d);
+    int res = 0;
+
+    res = pwrite(fd, buf, len, pos);
+    if (res < 0) {
+        fprintf(stderr, "host_pci_config: write failed: %s\n",
+                strerror(errno));
+        return -1;
+    }
+    return res;
+}
+
+uint8_t host_pci_get_byte(HostPCIDevice *d, int pos)
+{
+  uint8_t buf;
+  host_pci_config_read(d, pos, &buf, 1);
+  return buf;
+}
+uint16_t host_pci_get_word(HostPCIDevice *d, int pos)
+{
+  uint16_t buf;
+  host_pci_config_read(d, pos, &buf, 2);
+  return le16_to_cpu(buf);
+}
+uint32_t host_pci_get_long(HostPCIDevice *d, int pos)
+{
+  uint32_t buf;
+  host_pci_config_read(d, pos, &buf, 4);
+  return le32_to_cpu(buf);
+}
+int host_pci_get_block(HostPCIDevice *d, int pos, uint8_t *buf, int len)
+{
+  return host_pci_config_read(d, pos, buf, len);
+}
+
+int host_pci_set_byte(HostPCIDevice *d, int pos, uint8_t data)
+{
+  return host_pci_config_write(d, pos, &data, 1);
+}
+int host_pci_set_word(HostPCIDevice *d, int pos, uint16_t data)
+{
+  data = cpu_to_le16(data);
+  return host_pci_config_write(d, pos, &data, 2);
+}
+int host_pci_set_long(HostPCIDevice *d, int pos, uint32_t data)
+{
+  data = cpu_to_le32(data);
+  return host_pci_config_write(d, pos, &data, 4);
+}
+int host_pci_set_block(HostPCIDevice *d, int pos, uint8_t *buf, int len)
+{
+  return host_pci_config_write(d, pos, buf, len);
+}
+
+uint32_t host_pci_find_ext_cap_offset(HostPCIDevice *d, uint32_t cap)
+{
+    uint32_t header = 0;
+    int max_cap = PCI_MAX_EXT_CAP;
+    int pos = PCI_CONFIG_SPACE_SIZE;
+
+    do {
+        header = host_pci_get_long(d, pos);
+        /*
+         * If we have no capabilities, this is indicated by cap ID,
+         * cap version and next pointer all being 0.
+         */
+        if (header == 0) {
+            break;
+        }
+
+        if (PCI_EXT_CAP_ID(header) == cap) {
+            return pos;
+        }
+
+        pos = PCI_EXT_CAP_NEXT(header);
+        if (pos < PCI_CONFIG_SPACE_SIZE) {
+            break;
+        }
+
+        max_cap--;
+    } while (max_cap > 0);
+
+    return 0;
+}
+
+HostPCIDevice *host_pci_device_get(uint8_t bus, uint8_t dev, uint8_t func)
+{
+    HostPCIDevice *d = NULL;
+
+    d = g_new0(HostPCIDevice, 1);
+
+    d->config_fd = -1;
+    d->domain = 0;
+    d->bus = bus;
+    d->dev = dev;
+    d->func = func;
+
+    if (host_pci_config_fd(d) == -1) {
+        goto error;
+    }
+    if (get_resource(d) != 0) {
+        goto error;
+    }
+
+    d->vendor_id = get_value(d, "vendor");
+    d->device_id = get_value(d, "device");
+    d->is_virtfn = pci_dev_is_virtfn(d);
+
+    return d;
+error:
+    if (d->config_fd >= 0) {
+        close(d->config_fd);
+    }
+    g_free(d);
+    return NULL;
+}
+
+void host_pci_device_put(HostPCIDevice *d)
+{
+    if (d->config_fd >= 0) {
+        close(d->config_fd);
+    }
+    g_free(d);
+}
diff --git a/hw/host-pci-device.h b/hw/host-pci-device.h
new file mode 100644
index 0000000..d79ba48
--- /dev/null
+++ b/hw/host-pci-device.h
@@ -0,0 +1,75 @@
+#ifndef HW_HOST_PCI_DEVICE
+#  define HW_HOST_PCI_DEVICE
+
+#include "pci.h"
+
+/*
+ * from linux/ioport.h
+ * IO resources have these defined flags.
+ */
+#define IORESOURCE_BITS         0x000000ff      /* Bus-specific bits */
+
+#define IORESOURCE_TYPE_BITS    0x00000f00      /* Resource type */
+#define IORESOURCE_IO           0x00000100
+#define IORESOURCE_MEM          0x00000200
+#define IORESOURCE_IRQ          0x00000400
+#define IORESOURCE_DMA          0x00000800
+
+#define IORESOURCE_PREFETCH     0x00001000      /* No side effects */
+#define IORESOURCE_READONLY     0x00002000
+#define IORESOURCE_CACHEABLE    0x00004000
+#define IORESOURCE_RANGELENGTH  0x00008000
+#define IORESOURCE_SHADOWABLE   0x00010000
+
+#define IORESOURCE_SIZEALIGN    0x00020000      /* size indicates alignment */
+#define IORESOURCE_STARTALIGN   0x00040000      /* start field is alignment */
+
+#define IORESOURCE_MEM_64       0x00100000
+
+    /* Userland may not map this resource */
+#define IORESOURCE_EXCLUSIVE    0x08000000
+#define IORESOURCE_DISABLED     0x10000000
+#define IORESOURCE_UNSET        0x20000000
+#define IORESOURCE_AUTO         0x40000000
+    /* Driver has marked this resource busy */
+#define IORESOURCE_BUSY         0x80000000
+
+
+typedef struct HostPCIIORegion {
+    unsigned long flags;
+    pcibus_t base_addr;
+    pcibus_t size;
+} HostPCIIORegion;
+
+typedef struct HostPCIDevice {
+    uint16_t domain;
+    uint8_t bus;
+    uint8_t dev;
+    uint8_t func;
+
+    uint16_t vendor_id;
+    uint16_t device_id;
+
+    HostPCIIORegion io_regions[PCI_NUM_REGIONS - 1];
+    HostPCIIORegion rom;
+
+    bool is_virtfn;
+
+    int config_fd;
+} HostPCIDevice;
+
+HostPCIDevice *host_pci_device_get(uint8_t bus, uint8_t dev, uint8_t func);
+void host_pci_device_put(HostPCIDevice *pci_dev);
+
+uint8_t host_pci_get_byte(HostPCIDevice *d, int pos);
+uint16_t host_pci_get_word(HostPCIDevice *d, int pos);
+uint32_t host_pci_get_long(HostPCIDevice *d, int pos);
+int host_pci_get_block(HostPCIDevice *d, int pos, uint8_t *buf, int len);
+int host_pci_set_byte(HostPCIDevice *d, int pos, uint8_t data);
+int host_pci_set_word(HostPCIDevice *d, int pos, uint16_t data);
+int host_pci_set_long(HostPCIDevice *d, int pos, uint32_t data);
+int host_pci_set_block(HostPCIDevice *d, int pos, uint8_t *buf, int len);
+
+uint32_t host_pci_find_ext_cap_offset(HostPCIDevice *s, uint32_t cap);
+
+#endif /* !HW_HOST_PCI_DEVICE */
-- 
Anthony PERARD

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [Qemu-devel] [PATCH V3 03/10] pci.c: Add pci_check_bar_overlap
  2011-10-28 15:07 ` Anthony PERARD
@ 2011-10-28 15:07   ` Anthony PERARD
  -1 siblings, 0 replies; 60+ messages in thread
From: Anthony PERARD @ 2011-10-28 15:07 UTC (permalink / raw)
  To: QEMU-devel, Stefano Stabellini; +Cc: Anthony PERARD, Yuji Shimada, Xen Devel

From: Yuji Shimada <shimada-yxb@necst.nec.co.jp>

This function help Xen PCI Passthrough device to check for overlap.

Signed-off-by: Yuji Shimada <shimada-yxb@necst.nec.co.jp>
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
---
 hw/pci.c |   47 +++++++++++++++++++++++++++++++++++++++++++++++
 hw/pci.h |    3 +++
 2 files changed, 50 insertions(+), 0 deletions(-)

diff --git a/hw/pci.c b/hw/pci.c
index e8cc1b0..9f65216 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -2120,3 +2120,50 @@ MemoryRegion *pci_address_space_io(PCIDevice *dev)
 {
     return dev->bus->address_space_io;
 }
+
+int pci_check_bar_overlap(PCIDevice *dev,
+                          pcibus_t addr, pcibus_t size, uint8_t type)
+{
+    PCIBus *bus = dev->bus;
+    PCIDevice *devices = NULL;
+    PCIIORegion *r;
+    int i, j;
+    int rc = 0;
+
+    /* check Overlapped to Base Address */
+    for (i = 0; i < ARRAY_SIZE(bus->devices); i++) {
+        devices = bus->devices[i];
+        if (!devices) {
+            continue;
+        }
+
+        /* skip itself */
+        if (devices->devfn == dev->devfn) {
+            continue;
+        }
+
+        for (j = 0; j < PCI_NUM_REGIONS; j++) {
+            r = &devices->io_regions[j];
+
+            /* skip different resource type, but don't skip when
+             * prefetch and non-prefetch memory are compared.
+             */
+            if (type != r->type) {
+                if (type == PCI_BASE_ADDRESS_SPACE_IO ||
+                    r->type == PCI_BASE_ADDRESS_SPACE_IO) {
+                    continue;
+                }
+            }
+
+            if ((addr < (r->addr + r->size)) && ((addr + size) > r->addr)) {
+                printf("Overlapped to device[%02x:%02x.%x][Region:%d]"
+                       "[Address:%"PRIx64"h][Size:%"PRIx64"h]\n",
+                       pci_bus_num(bus), PCI_SLOT(devices->devfn),
+                       PCI_FUNC(devices->devfn), j, r->addr, r->size);
+                rc = 1;
+            }
+        }
+    }
+
+    return rc;
+}
diff --git a/hw/pci.h b/hw/pci.h
index 86a81c8..0e1a07d 100644
--- a/hw/pci.h
+++ b/hw/pci.h
@@ -487,4 +487,7 @@ static inline uint32_t pci_config_size(const PCIDevice *d)
     return pci_is_express(d) ? PCIE_CONFIG_SPACE_SIZE : PCI_CONFIG_SPACE_SIZE;
 }
 
+int pci_check_bar_overlap(PCIDevice *dev,
+                          pcibus_t addr, pcibus_t size, uint8_t type);
+
 #endif
-- 
Anthony PERARD

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH V3 03/10] pci.c: Add pci_check_bar_overlap
@ 2011-10-28 15:07   ` Anthony PERARD
  0 siblings, 0 replies; 60+ messages in thread
From: Anthony PERARD @ 2011-10-28 15:07 UTC (permalink / raw)
  To: QEMU-devel, Stefano Stabellini; +Cc: Anthony PERARD, Yuji Shimada, Xen Devel

From: Yuji Shimada <shimada-yxb@necst.nec.co.jp>

This function help Xen PCI Passthrough device to check for overlap.

Signed-off-by: Yuji Shimada <shimada-yxb@necst.nec.co.jp>
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
---
 hw/pci.c |   47 +++++++++++++++++++++++++++++++++++++++++++++++
 hw/pci.h |    3 +++
 2 files changed, 50 insertions(+), 0 deletions(-)

diff --git a/hw/pci.c b/hw/pci.c
index e8cc1b0..9f65216 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -2120,3 +2120,50 @@ MemoryRegion *pci_address_space_io(PCIDevice *dev)
 {
     return dev->bus->address_space_io;
 }
+
+int pci_check_bar_overlap(PCIDevice *dev,
+                          pcibus_t addr, pcibus_t size, uint8_t type)
+{
+    PCIBus *bus = dev->bus;
+    PCIDevice *devices = NULL;
+    PCIIORegion *r;
+    int i, j;
+    int rc = 0;
+
+    /* check Overlapped to Base Address */
+    for (i = 0; i < ARRAY_SIZE(bus->devices); i++) {
+        devices = bus->devices[i];
+        if (!devices) {
+            continue;
+        }
+
+        /* skip itself */
+        if (devices->devfn == dev->devfn) {
+            continue;
+        }
+
+        for (j = 0; j < PCI_NUM_REGIONS; j++) {
+            r = &devices->io_regions[j];
+
+            /* skip different resource type, but don't skip when
+             * prefetch and non-prefetch memory are compared.
+             */
+            if (type != r->type) {
+                if (type == PCI_BASE_ADDRESS_SPACE_IO ||
+                    r->type == PCI_BASE_ADDRESS_SPACE_IO) {
+                    continue;
+                }
+            }
+
+            if ((addr < (r->addr + r->size)) && ((addr + size) > r->addr)) {
+                printf("Overlapped to device[%02x:%02x.%x][Region:%d]"
+                       "[Address:%"PRIx64"h][Size:%"PRIx64"h]\n",
+                       pci_bus_num(bus), PCI_SLOT(devices->devfn),
+                       PCI_FUNC(devices->devfn), j, r->addr, r->size);
+                rc = 1;
+            }
+        }
+    }
+
+    return rc;
+}
diff --git a/hw/pci.h b/hw/pci.h
index 86a81c8..0e1a07d 100644
--- a/hw/pci.h
+++ b/hw/pci.h
@@ -487,4 +487,7 @@ static inline uint32_t pci_config_size(const PCIDevice *d)
     return pci_is_express(d) ? PCIE_CONFIG_SPACE_SIZE : PCI_CONFIG_SPACE_SIZE;
 }
 
+int pci_check_bar_overlap(PCIDevice *dev,
+                          pcibus_t addr, pcibus_t size, uint8_t type);
+
 #endif
-- 
Anthony PERARD

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [Qemu-devel] [PATCH V3 04/10] pci_ids: Add INTEL_82599_VF id.
  2011-10-28 15:07 ` Anthony PERARD
@ 2011-10-28 15:07   ` Anthony PERARD
  -1 siblings, 0 replies; 60+ messages in thread
From: Anthony PERARD @ 2011-10-28 15:07 UTC (permalink / raw)
  To: QEMU-devel, Stefano Stabellini; +Cc: Anthony PERARD, Xen Devel

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
---
 hw/pci_ids.h |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/hw/pci_ids.h b/hw/pci_ids.h
index 83f3893..2ea5ec2 100644
--- a/hw/pci_ids.h
+++ b/hw/pci_ids.h
@@ -117,6 +117,7 @@
 #define PCI_DEVICE_ID_INTEL_82801I_UHCI6 0x2939
 #define PCI_DEVICE_ID_INTEL_82801I_EHCI1 0x293a
 #define PCI_DEVICE_ID_INTEL_82801I_EHCI2 0x293c
+#define PCI_DEVICE_ID_INTEL_82599_VF     0x10ed
 
 #define PCI_VENDOR_ID_XEN               0x5853
 #define PCI_DEVICE_ID_XEN_PLATFORM      0x0001
-- 
Anthony PERARD

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH V3 04/10] pci_ids: Add INTEL_82599_VF id.
@ 2011-10-28 15:07   ` Anthony PERARD
  0 siblings, 0 replies; 60+ messages in thread
From: Anthony PERARD @ 2011-10-28 15:07 UTC (permalink / raw)
  To: QEMU-devel, Stefano Stabellini; +Cc: Anthony PERARD, Xen Devel

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
---
 hw/pci_ids.h |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/hw/pci_ids.h b/hw/pci_ids.h
index 83f3893..2ea5ec2 100644
--- a/hw/pci_ids.h
+++ b/hw/pci_ids.h
@@ -117,6 +117,7 @@
 #define PCI_DEVICE_ID_INTEL_82801I_UHCI6 0x2939
 #define PCI_DEVICE_ID_INTEL_82801I_EHCI1 0x293a
 #define PCI_DEVICE_ID_INTEL_82801I_EHCI2 0x293c
+#define PCI_DEVICE_ID_INTEL_82599_VF     0x10ed
 
 #define PCI_VENDOR_ID_XEN               0x5853
 #define PCI_DEVICE_ID_XEN_PLATFORM      0x0001
-- 
Anthony PERARD

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [Qemu-devel] [PATCH V3 05/10] pci_regs: Fix value of PCI_EXP_TYPE_RC_EC.
  2011-10-28 15:07 ` Anthony PERARD
@ 2011-10-28 15:07   ` Anthony PERARD
  -1 siblings, 0 replies; 60+ messages in thread
From: Anthony PERARD @ 2011-10-28 15:07 UTC (permalink / raw)
  To: QEMU-devel, Stefano Stabellini; +Cc: Anthony PERARD, Xen Devel

Value check in PCI Express Base Specification rev 1.1

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
---
 hw/pci_regs.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/hw/pci_regs.h b/hw/pci_regs.h
index e8357c3..6b42515 100644
--- a/hw/pci_regs.h
+++ b/hw/pci_regs.h
@@ -393,7 +393,7 @@
 #define  PCI_EXP_TYPE_DOWNSTREAM 0x6	/* Downstream Port */
 #define  PCI_EXP_TYPE_PCI_BRIDGE 0x7	/* PCI/PCI-X Bridge */
 #define  PCI_EXP_TYPE_RC_END	0x9	/* Root Complex Integrated Endpoint */
-#define  PCI_EXP_TYPE_RC_EC	0x10	/* Root Complex Event Collector */
+#define  PCI_EXP_TYPE_RC_EC     0xa     /* Root Complex Event Collector */
 #define PCI_EXP_FLAGS_SLOT	0x0100	/* Slot implemented */
 #define PCI_EXP_FLAGS_IRQ	0x3e00	/* Interrupt message number */
 #define PCI_EXP_DEVCAP		4	/* Device capabilities */
-- 
Anthony PERARD

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH V3 05/10] pci_regs: Fix value of PCI_EXP_TYPE_RC_EC.
@ 2011-10-28 15:07   ` Anthony PERARD
  0 siblings, 0 replies; 60+ messages in thread
From: Anthony PERARD @ 2011-10-28 15:07 UTC (permalink / raw)
  To: QEMU-devel, Stefano Stabellini; +Cc: Anthony PERARD, Xen Devel

Value check in PCI Express Base Specification rev 1.1

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
---
 hw/pci_regs.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/hw/pci_regs.h b/hw/pci_regs.h
index e8357c3..6b42515 100644
--- a/hw/pci_regs.h
+++ b/hw/pci_regs.h
@@ -393,7 +393,7 @@
 #define  PCI_EXP_TYPE_DOWNSTREAM 0x6	/* Downstream Port */
 #define  PCI_EXP_TYPE_PCI_BRIDGE 0x7	/* PCI/PCI-X Bridge */
 #define  PCI_EXP_TYPE_RC_END	0x9	/* Root Complex Integrated Endpoint */
-#define  PCI_EXP_TYPE_RC_EC	0x10	/* Root Complex Event Collector */
+#define  PCI_EXP_TYPE_RC_EC     0xa     /* Root Complex Event Collector */
 #define PCI_EXP_FLAGS_SLOT	0x0100	/* Slot implemented */
 #define PCI_EXP_FLAGS_IRQ	0x3e00	/* Interrupt message number */
 #define PCI_EXP_DEVCAP		4	/* Device capabilities */
-- 
Anthony PERARD

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [Qemu-devel] [PATCH V3 06/10] pci_regs: Add PCI_EXP_TYPE_PCIE_BRIDGE
  2011-10-28 15:07 ` Anthony PERARD
@ 2011-10-28 15:07   ` Anthony PERARD
  -1 siblings, 0 replies; 60+ messages in thread
From: Anthony PERARD @ 2011-10-28 15:07 UTC (permalink / raw)
  To: QEMU-devel, Stefano Stabellini; +Cc: Anthony PERARD, Xen Devel

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
---
 hw/pci_regs.h |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/hw/pci_regs.h b/hw/pci_regs.h
index 6b42515..56a404b 100644
--- a/hw/pci_regs.h
+++ b/hw/pci_regs.h
@@ -392,6 +392,7 @@
 #define  PCI_EXP_TYPE_UPSTREAM	0x5	/* Upstream Port */
 #define  PCI_EXP_TYPE_DOWNSTREAM 0x6	/* Downstream Port */
 #define  PCI_EXP_TYPE_PCI_BRIDGE 0x7	/* PCI/PCI-X Bridge */
+#define  PCI_EXP_TYPE_PCIE_BRIDGE 0x8   /* PCI/PCI-X to PCIE Bridge */
 #define  PCI_EXP_TYPE_RC_END	0x9	/* Root Complex Integrated Endpoint */
 #define  PCI_EXP_TYPE_RC_EC     0xa     /* Root Complex Event Collector */
 #define PCI_EXP_FLAGS_SLOT	0x0100	/* Slot implemented */
-- 
Anthony PERARD

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH V3 06/10] pci_regs: Add PCI_EXP_TYPE_PCIE_BRIDGE
@ 2011-10-28 15:07   ` Anthony PERARD
  0 siblings, 0 replies; 60+ messages in thread
From: Anthony PERARD @ 2011-10-28 15:07 UTC (permalink / raw)
  To: QEMU-devel, Stefano Stabellini; +Cc: Anthony PERARD, Xen Devel

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
---
 hw/pci_regs.h |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/hw/pci_regs.h b/hw/pci_regs.h
index 6b42515..56a404b 100644
--- a/hw/pci_regs.h
+++ b/hw/pci_regs.h
@@ -392,6 +392,7 @@
 #define  PCI_EXP_TYPE_UPSTREAM	0x5	/* Upstream Port */
 #define  PCI_EXP_TYPE_DOWNSTREAM 0x6	/* Downstream Port */
 #define  PCI_EXP_TYPE_PCI_BRIDGE 0x7	/* PCI/PCI-X Bridge */
+#define  PCI_EXP_TYPE_PCIE_BRIDGE 0x8   /* PCI/PCI-X to PCIE Bridge */
 #define  PCI_EXP_TYPE_RC_END	0x9	/* Root Complex Integrated Endpoint */
 #define  PCI_EXP_TYPE_RC_EC     0xa     /* Root Complex Event Collector */
 #define PCI_EXP_FLAGS_SLOT	0x0100	/* Slot implemented */
-- 
Anthony PERARD

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [Qemu-devel] [PATCH V3 07/10] Introduce Xen PCI Passthrough, qdevice (1/3)
  2011-10-28 15:07 ` Anthony PERARD
@ 2011-10-28 15:07   ` Anthony PERARD
  -1 siblings, 0 replies; 60+ messages in thread
From: Anthony PERARD @ 2011-10-28 15:07 UTC (permalink / raw)
  To: QEMU-devel, Stefano Stabellini
  Cc: Anthony PERARD, Guy Zana, Xen Devel, Allen Kay

From: Allen Kay <allen.m.kay@intel.com>

Signed-off-by: Allen Kay <allen.m.kay@intel.com>
Signed-off-by: Guy Zana <guy@neocleus.com>
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
---
 Makefile.target                  |    2 +
 hw/xen_pci_passthrough.c         |  838 ++++++++++++++++++++++++++++++++++++++
 hw/xen_pci_passthrough.h         |  223 ++++++++++
 hw/xen_pci_passthrough_helpers.c |   46 ++
 4 files changed, 1109 insertions(+), 0 deletions(-)
 create mode 100644 hw/xen_pci_passthrough.c
 create mode 100644 hw/xen_pci_passthrough.h
 create mode 100644 hw/xen_pci_passthrough_helpers.c

diff --git a/Makefile.target b/Makefile.target
index 243f9f2..36ea47d 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -217,6 +217,8 @@ obj-i386-$(CONFIG_XEN) += xen_platform.o
 
 # Xen PCI Passthrough
 obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += host-pci-device.o
+obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough.o
+obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough_helpers.o
 
 # Inter-VM PCI shared memory
 CONFIG_IVSHMEM =
diff --git a/hw/xen_pci_passthrough.c b/hw/xen_pci_passthrough.c
new file mode 100644
index 0000000..b97c5b6
--- /dev/null
+++ b/hw/xen_pci_passthrough.c
@@ -0,0 +1,838 @@
+/*
+ * Copyright (c) 2007, Neocleus Corporation.
+ * Copyright (c) 2007, Intel Corporation.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * Alex Novik <alex@neocleus.com>
+ * Allen Kay <allen.m.kay@intel.com>
+ * Guy Zana <guy@neocleus.com>
+ *
+ * This file implements direct PCI assignment to a HVM guest
+ */
+
+/*
+ * Interrupt Disable policy:
+ *
+ * INTx interrupt:
+ *   Initialize(register_real_device)
+ *     Map INTx(xc_physdev_map_pirq):
+ *       <fail>
+ *         - Set real Interrupt Disable bit to '1'.
+ *         - Set machine_irq and assigned_device->machine_irq to '0'.
+ *         * Don't bind INTx.
+ *
+ *     Bind INTx(xc_domain_bind_pt_pci_irq):
+ *       <fail>
+ *         - Set real Interrupt Disable bit to '1'.
+ *         - Unmap INTx.
+ *         - Decrement mapped_machine_irq[machine_irq]
+ *         - Set assigned_device->machine_irq to '0'.
+ *
+ *   Write to Interrupt Disable bit by guest software(pt_cmd_reg_write)
+ *     Write '0'
+ *       <ptdev->msi_trans_en is false>
+ *         - Set real bit to '0' if assigned_device->machine_irq isn't '0'.
+ *
+ *     Write '1'
+ *       <ptdev->msi_trans_en is false>
+ *         - Set real bit to '1'.
+ *
+ * MSI-INTx translation.
+ *   Initialize(xc_physdev_map_pirq_msi/pt_msi_setup)
+ *     Bind MSI-INTx(xc_domain_bind_pt_irq)
+ *       <fail>
+ *         - Unmap MSI.
+ *           <success>
+ *             - Set dev->msi->pirq to '-1'.
+ *           <fail>
+ *             - Do nothing.
+ *
+ *   Write to Interrupt Disable bit by guest software(pt_cmd_reg_write)
+ *     Write '0'
+ *       <ptdev->msi_trans_en is true>
+ *         - Set MSI Enable bit to '1'.
+ *
+ *     Write '1'
+ *       <ptdev->msi_trans_en is true>
+ *         - Set MSI Enable bit to '0'.
+ *
+ * MSI interrupt:
+ *   Initialize MSI register(pt_msi_setup, pt_msi_update)
+ *     Bind MSI(xc_domain_update_msi_irq)
+ *       <fail>
+ *         - Unmap MSI.
+ *         - Set dev->msi->pirq to '-1'.
+ *
+ * MSI-X interrupt:
+ *   Initialize MSI-X register(pt_msix_update_one)
+ *     Bind MSI-X(xc_domain_update_msi_irq)
+ *       <fail>
+ *         - Unmap MSI-X.
+ *         - Set entry->pirq to '-1'.
+ */
+
+#include <sys/ioctl.h>
+
+#include "pci.h"
+#include "xen.h"
+#include "xen_backend.h"
+#include "xen_pci_passthrough.h"
+
+#define PCI_BAR_ENTRIES (6)
+
+#define PT_NR_IRQS          (256)
+char mapped_machine_irq[PT_NR_IRQS] = {0};
+
+/* Config Space */
+static int pt_pci_config_access_check(PCIDevice *d, uint32_t address, int len)
+{
+    /* check offset range */
+    if (address >= 0xFF) {
+        PT_LOG("Error: Failed to access register with offset exceeding FFh. "
+               "[%02x:%02x.%x][Offset:%02xh][Length:%d]\n",
+               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
+               address, len);
+        return -1;
+    }
+
+    /* check read size */
+    if ((len != 1) && (len != 2) && (len != 4)) {
+        PT_LOG("Error: Failed to access register with invalid access length. "
+               "[%02x:%02x.%x][Offset:%02xh][Length:%d]\n",
+               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
+               address, len);
+        return -1;
+    }
+
+    /* check offset alignment */
+    if (address & (len - 1)) {
+        PT_LOG("Error: Failed to access register with invalid access size "
+            "alignment. [%02x:%02x.%x][Offset:%02xh][Length:%d]\n",
+            pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
+            address, len);
+        return -1;
+    }
+
+    return 0;
+}
+
+int pt_bar_offset_to_index(uint32_t offset)
+{
+    int index = 0;
+
+    /* check Exp ROM BAR */
+    if (offset == PCI_ROM_ADDRESS) {
+        return PCI_ROM_SLOT;
+    }
+
+    /* calculate BAR index */
+    index = (offset - PCI_BASE_ADDRESS_0) >> 2;
+    if (index >= PCI_NUM_REGIONS) {
+        return -1;
+    }
+
+    return index;
+}
+
+static uint32_t pt_pci_read_config(PCIDevice *d, uint32_t address, int len)
+{
+    XenPCIPassthroughState *s = DO_UPCAST(XenPCIPassthroughState, dev, d);
+    uint32_t val = 0;
+    XenPTRegGroup *reg_grp_entry = NULL;
+    XenPTReg *reg_entry = NULL;
+    int rc = 0;
+    int emul_len = 0;
+    uint32_t find_addr = address;
+
+    if (pt_pci_config_access_check(d, address, len)) {
+        goto exit;
+    }
+
+    /* check power state transition flags */
+    if (s->pm_state != NULL && s->pm_state->flags & PT_FLAG_TRANSITING) {
+        /* can't accept until previous power state transition is completed.
+         * so finished previous request here.
+         */
+        PT_LOG("Warning: guest want to write durring power state transition\n");
+        goto exit;
+    }
+
+    /* find register group entry */
+    reg_grp_entry = pt_find_reg_grp(s, address);
+    if (reg_grp_entry) {
+        /* check 0 Hardwired register group */
+        if (reg_grp_entry->reg_grp->grp_type == GRP_TYPE_HARDWIRED) {
+            /* no need to emulate, just return 0 */
+            val = 0;
+            goto exit;
+        }
+    }
+
+    /* read I/O device register value */
+    rc = host_pci_get_block(s->real_device, address, (uint8_t *)&val, len);
+    if (!rc) {
+        PT_LOG("Error: pci_read_block failed. return value[%d].\n", rc);
+        memset(&val, 0xff, len);
+    }
+
+    /* just return the I/O device register value for
+     * passthrough type register group */
+    if (reg_grp_entry == NULL) {
+        goto exit;
+    }
+
+    /* adjust the read value to appropriate CFC-CFF window */
+    val <<= (address & 3) << 3;
+    emul_len = len;
+
+    /* loop Guest request size */
+    while (emul_len > 0) {
+        /* find register entry to be emulated */
+        reg_entry = pt_find_reg(reg_grp_entry, find_addr);
+        if (reg_entry) {
+            XenPTRegInfo *reg = reg_entry->reg;
+            uint32_t real_offset = reg_grp_entry->base_offset + reg->offset;
+            uint32_t valid_mask = 0xFFFFFFFF >> ((4 - emul_len) << 3);
+            uint8_t *ptr_val = NULL;
+
+            valid_mask <<= (find_addr - real_offset) << 3;
+            ptr_val = (uint8_t *)&val + (real_offset & 3);
+
+            /* do emulation depend on register size */
+            switch (reg->size) {
+            case 1:
+                if (reg->u.b.read) {
+                    rc = reg->u.b.read(s, reg_entry, ptr_val, valid_mask);
+                }
+                break;
+            case 2:
+                if (reg->u.w.read) {
+                    rc = reg->u.w.read(s, reg_entry,
+                                       (uint16_t *)ptr_val, valid_mask);
+                }
+                break;
+            case 4:
+                if (reg->u.dw.read) {
+                    rc = reg->u.dw.read(s, reg_entry,
+                                        (uint32_t *)ptr_val, valid_mask);
+                }
+                break;
+            }
+
+            if (rc < 0) {
+                hw_error("Internal error: Invalid read emulation "
+                         "return value[%d]. I/O emulator exit.\n", rc);
+            }
+
+            /* calculate next address to find */
+            emul_len -= reg->size;
+            if (emul_len > 0) {
+                find_addr = real_offset + reg->size;
+            }
+        } else {
+            /* nothing to do with passthrough type register,
+             * continue to find next byte */
+            emul_len--;
+            find_addr++;
+        }
+    }
+
+    /* need to shift back before returning them to pci bus emulator */
+    val >>= ((address & 3) << 3);
+
+exit:
+    PT_LOG_CONFIG("[%02x:%02x.%x]: address=%04x val=0x%08x len=%d\n",
+                  pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
+                  address, val, len);
+    return val;
+}
+
+static void pt_pci_write_config(PCIDevice *d, uint32_t address,
+                                uint32_t val, int len)
+{
+    XenPCIPassthroughState *s = DO_UPCAST(XenPCIPassthroughState, dev, d);
+    int index = 0;
+    XenPTRegGroup *reg_grp_entry = NULL;
+    int rc = 0;
+    uint32_t read_val = 0;
+    int emul_len = 0;
+    XenPTReg *reg_entry = NULL;
+    uint32_t find_addr = address;
+    XenPTRegInfo *reg = NULL;
+
+    if (pt_pci_config_access_check(d, address, len)) {
+        return;
+    }
+
+    PT_LOG_CONFIG("[%02x:%02x.%x]: address=%04x val=0x%08x len=%d\n",
+                  pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
+                  address, val, len);
+
+    /* check unused BAR register */
+    index = pt_bar_offset_to_index(address);
+    if ((index >= 0) && (val > 0 && val < PT_BAR_ALLF) &&
+        (s->bases[index].bar_flag == PT_BAR_FLAG_UNUSED)) {
+        PT_LOG("Warning: Guest attempt to set address to unused Base Address "
+               "Register. [%02x:%02x.%x][Offset:%02xh][Length:%d]\n",
+               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
+               address, len);
+    }
+
+    /* check power state transition flags */
+    if (s->pm_state != NULL && s->pm_state->flags & PT_FLAG_TRANSITING) {
+        /* can't accept untill previous power state transition is completed.
+         * so finished previous request here.
+         */
+        PT_LOG("Warning: guest want to write durring power state transition\n");
+        return;
+    }
+
+    /* find register group entry */
+    reg_grp_entry = pt_find_reg_grp(s, address);
+    if (reg_grp_entry) {
+        /* check 0 Hardwired register group */
+        if (reg_grp_entry->reg_grp->grp_type == GRP_TYPE_HARDWIRED) {
+            /* ignore silently */
+            PT_LOG("Warning: Access to 0 Hardwired register. "
+                   "[%02x:%02x.%x][Offset:%02xh][Length:%d]\n",
+                   pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
+                   address, len);
+            return;
+        }
+    }
+
+    /* read I/O device register value */
+    rc = host_pci_get_block(s->real_device, address,
+                             (uint8_t *)&read_val, len);
+    if (!rc) {
+        PT_LOG("Error: pci_read_block failed. return value[%d].\n", rc);
+        memset(&read_val, 0xff, len);
+    }
+
+    /* pass directly to libpci for passthrough type register group */
+    if (reg_grp_entry == NULL) {
+        goto out;
+    }
+
+    /* adjust the read and write value to appropriate CFC-CFF window */
+    read_val <<= (address & 3) << 3;
+    val <<= (address & 3) << 3;
+    emul_len = len;
+
+    /* loop Guest request size */
+    while (emul_len > 0) {
+        /* find register entry to be emulated */
+        reg_entry = pt_find_reg(reg_grp_entry, find_addr);
+        if (reg_entry) {
+            reg = reg_entry->reg;
+            uint32_t real_offset = reg_grp_entry->base_offset + reg->offset;
+            uint32_t valid_mask = 0xFFFFFFFF >> ((4 - emul_len) << 3);
+            uint8_t *ptr_val = NULL;
+
+            valid_mask <<= (find_addr - real_offset) << 3;
+            ptr_val = (uint8_t *)&val + (real_offset & 3);
+
+            /* do emulation depend on register size */
+            switch (reg->size) {
+            case 1:
+                if (reg->u.b.write) {
+                    rc = reg->u.b.write(s, reg_entry, ptr_val,
+                                        read_val >> ((real_offset & 3) << 3),
+                                        valid_mask);
+                }
+                break;
+            case 2:
+                if (reg->u.w.write) {
+                    rc = reg->u.w.write(s, reg_entry, (uint16_t *)ptr_val,
+                                        (read_val >> ((real_offset & 3) << 3)),
+                                        valid_mask);
+                }
+                break;
+            case 4:
+                if (reg->u.dw.write) {
+                    rc = reg->u.dw.write(s, reg_entry, (uint32_t *)ptr_val,
+                                         (read_val >> ((real_offset & 3) << 3)),
+                                         valid_mask);
+                }
+                break;
+            }
+
+            if (rc < 0) {
+                hw_error("Internal error: Invalid write emulation "
+                         "return value[%d]. I/O emulator exit.\n", rc);
+            }
+
+            /* calculate next address to find */
+            emul_len -= reg->size;
+            if (emul_len > 0) {
+                find_addr = real_offset + reg->size;
+            }
+        } else {
+            /* nothing to do with passthrough type register,
+             * continue to find next byte */
+            emul_len--;
+            find_addr++;
+        }
+    }
+
+    /* need to shift back before passing them to libpci */
+    val >>= (address & 3) << 3;
+
+out:
+    if (!(reg && reg->no_wb)) {
+        /* unknown regs are passed through */
+        rc = host_pci_set_block(s->real_device, address, (uint8_t *)&val, len);
+
+        if (!rc) {
+            PT_LOG("Error: pci_write_block failed. return value[%d].\n", rc);
+        }
+    }
+
+    if (s->pm_state != NULL && s->pm_state->flags & PT_FLAG_TRANSITING) {
+        qemu_mod_timer(s->pm_state->pm_timer,
+                       qemu_get_clock_ms(rt_clock) + s->pm_state->pm_delay);
+    }
+}
+
+/* ioport/iomem space*/
+static void pt_iomem_map(XenPCIPassthroughState *s, int i,
+                         pcibus_t e_phys, pcibus_t e_size, int type)
+{
+    uint32_t old_ebase = s->bases[i].e_physbase;
+    bool first_map = s->bases[i].e_size == 0;
+    int ret = 0;
+
+    s->bases[i].e_physbase = e_phys;
+    s->bases[i].e_size = e_size;
+
+    PT_LOG("e_phys=%#"PRIx64" maddr=%#"PRIx64" type=%%d"
+           " len=%#"PRIx64" index=%d first_map=%d\n",
+           e_phys, s->bases[i].access.maddr, /*type,*/
+           e_size, i, first_map);
+
+    if (e_size == 0) {
+        return;
+    }
+
+    if (!first_map && old_ebase != -1) {
+        /* Remove old mapping */
+        ret = xc_domain_memory_mapping(xen_xc, xen_domid,
+                               old_ebase >> XC_PAGE_SHIFT,
+                               s->bases[i].access.maddr >> XC_PAGE_SHIFT,
+                               (e_size + XC_PAGE_SIZE - 1) >> XC_PAGE_SHIFT,
+                               DPCI_REMOVE_MAPPING);
+        if (ret != 0) {
+            PT_LOG("Error: remove old mapping failed!\n");
+            return;
+        }
+    }
+
+    /* map only valid guest address */
+    if (e_phys != -1) {
+        /* Create new mapping */
+        ret = xc_domain_memory_mapping(xen_xc, xen_domid,
+                                   s->bases[i].e_physbase >> XC_PAGE_SHIFT,
+                                   s->bases[i].access.maddr >> XC_PAGE_SHIFT,
+                                   (e_size+XC_PAGE_SIZE-1) >> XC_PAGE_SHIFT,
+                                   DPCI_ADD_MAPPING);
+
+        if (ret != 0) {
+            PT_LOG("Error: create new mapping failed!\n");
+        }
+    }
+}
+
+static void pt_ioport_map(XenPCIPassthroughState *s, int i,
+                          pcibus_t e_phys, pcibus_t e_size, int type)
+{
+    uint32_t old_ebase = s->bases[i].e_physbase;
+    bool first_map = s->bases[i].e_size == 0;
+    int ret = 0;
+
+    s->bases[i].e_physbase = e_phys;
+    s->bases[i].e_size = e_size;
+
+    PT_LOG("e_phys=%#04"PRIx64" pio_base=%#04"PRIx64" len=%"PRId64" index=%d"
+           " first_map=%d\n",
+           e_phys, s->bases[i].access.pio_base, e_size, i, first_map);
+
+    if (e_size == 0) {
+        return;
+    }
+
+    if (!first_map && old_ebase != -1) {
+        /* Remove old mapping */
+        ret = xc_domain_ioport_mapping(xen_xc, xen_domid, old_ebase,
+                                       s->bases[i].access.pio_base, e_size,
+                                       DPCI_REMOVE_MAPPING);
+        if (ret != 0) {
+            PT_LOG("Error: remove old mapping failed!\n");
+            return;
+        }
+    }
+
+    /* map only valid guest address (include 0) */
+    if (e_phys != -1) {
+        /* Create new mapping */
+        ret = xc_domain_ioport_mapping(xen_xc, xen_domid, e_phys,
+                                       s->bases[i].access.pio_base, e_size,
+                                       DPCI_ADD_MAPPING);
+        if (ret != 0) {
+            PT_LOG("Error: create new mapping failed!\n");
+        }
+    }
+
+}
+
+
+/* mapping BAR */
+
+void pt_bar_mapping_one(XenPCIPassthroughState *s, int bar,
+                        int io_enable, int mem_enable)
+{
+    PCIDevice *dev = &s->dev;
+    PCIIORegion *r;
+    XenPTRegGroup *reg_grp_entry = NULL;
+    XenPTReg *reg_entry = NULL;
+    XenPTRegion *base = NULL;
+    pcibus_t r_size = 0, r_addr = -1;
+    int rc = 0;
+
+    r = &dev->io_regions[bar];
+
+    /* check valid region */
+    if (!r->size) {
+        return;
+    }
+
+    base = &s->bases[bar];
+    /* skip unused BAR or upper 64bit BAR */
+    if ((base->bar_flag == PT_BAR_FLAG_UNUSED)
+        || (base->bar_flag == PT_BAR_FLAG_UPPER)) {
+           return;
+    }
+
+    /* copy region address to temporary */
+    r_addr = r->addr;
+
+    /* need unmapping in case I/O Space or Memory Space disable */
+    if (((base->bar_flag == PT_BAR_FLAG_IO) && !io_enable) ||
+        ((base->bar_flag == PT_BAR_FLAG_MEM) && !mem_enable)) {
+        r_addr = -1;
+    }
+    if ((bar == PCI_ROM_SLOT) && (r_addr != -1)) {
+        reg_grp_entry = pt_find_reg_grp(s, PCI_ROM_ADDRESS);
+        if (reg_grp_entry) {
+            reg_entry = pt_find_reg(reg_grp_entry, PCI_ROM_ADDRESS);
+            if (reg_entry && !(reg_entry->data & PCI_ROM_ADDRESS_ENABLE)) {
+                r_addr = -1;
+            }
+        }
+    }
+
+    /* prevent guest software mapping memory resource to 00000000h */
+    if ((base->bar_flag == PT_BAR_FLAG_MEM) && (r_addr == 0)) {
+        r_addr = -1;
+    }
+
+    r_size = pt_get_emul_size(base->bar_flag, r->size);
+
+    rc = pci_check_bar_overlap(dev, r_addr, r_size, r->type);
+    if (rc > 0) {
+        PT_LOG("Warning: s[%02x:%02x.%x][Region:%d][Address:%"FMT_PCIBUS"h]"
+               "[Size:%"FMT_PCIBUS"h] is overlapped.\n", pci_bus_num(dev->bus),
+               PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn), bar,
+               r_addr, r_size);
+    }
+
+    /* check whether we need to update the mapping or not */
+    if (r_addr != s->bases[bar].e_physbase) {
+        /* mapping BAR */
+        if (base->bar_flag == PT_BAR_FLAG_IO) {
+            pt_ioport_map(s, bar, r_addr, r_size, r->type);
+        } else {
+            pt_iomem_map(s, bar, r_addr, r_size, r->type);
+        }
+    }
+}
+
+void pt_bar_mapping(XenPCIPassthroughState *s, int io_enable, int mem_enable)
+{
+    int i;
+
+    for (i = 0; i < PCI_NUM_REGIONS; i++) {
+        pt_bar_mapping_one(s, i, io_enable, mem_enable);
+    }
+}
+
+/* register regions */
+static int pt_register_regions(XenPCIPassthroughState *s)
+{
+    int i = 0;
+    uint32_t bar_data = 0;
+    HostPCIDevice *d = s->real_device;
+
+    /* Register PIO/MMIO BARs */
+    for (i = 0; i < PCI_BAR_ENTRIES; i++) {
+        HostPCIIORegion *r = &d->io_regions[i];
+
+        if (r->base_addr) {
+            s->bases[i].e_physbase = r->base_addr;
+            s->bases[i].access.u = r->base_addr;
+
+            /* Register current region */
+            if (r->flags & IORESOURCE_IO) {
+                memory_region_init_io(&s->bar[i], NULL, NULL,
+                                      "xen-pci-pt-bar", r->size);
+                pci_register_bar(&s->dev, i, PCI_BASE_ADDRESS_SPACE_IO,
+                                 &s->bar[i]);
+            } else if (r->flags & IORESOURCE_PREFETCH) {
+                memory_region_init_io(&s->bar[i], NULL, NULL,
+                                      "xen-pci-pt-bar", r->size);
+                pci_register_bar(&s->dev, i, PCI_BASE_ADDRESS_MEM_PREFETCH,
+                                 &s->bar[i]);
+            } else {
+                memory_region_init_io(&s->bar[i], NULL, NULL,
+                                      "xen-pci-pt-bar", r->size);
+                pci_register_bar(&s->dev, i, PCI_BASE_ADDRESS_SPACE_MEMORY,
+                                 &s->bar[i]);
+            }
+
+            PT_LOG("IO region registered (size=0x%08"PRIx64
+                   " base_addr=0x%08"PRIx64")\n",
+                   r->size, r->base_addr);
+        }
+    }
+
+    /* Register expansion ROM address */
+    if (d->rom.base_addr && d->rom.size) {
+        /* Re-set BAR reported by OS, otherwise ROM can't be read. */
+        bar_data = host_pci_get_long(d, PCI_ROM_ADDRESS);
+        if ((bar_data & PCI_ROM_ADDRESS_MASK) == 0) {
+            bar_data |= d->rom.base_addr & PCI_ROM_ADDRESS_MASK;
+            host_pci_set_long(d, PCI_ROM_ADDRESS, bar_data);
+        }
+
+        s->bases[PCI_ROM_SLOT].e_physbase = d->rom.base_addr;
+        s->bases[PCI_ROM_SLOT].access.maddr = d->rom.base_addr;
+
+        memory_region_init_rom_device(&s->rom, NULL, NULL, &s->dev.qdev,
+                                      "xen-pci-pt-rom", d->rom.size);
+        pci_register_bar(&s->dev, PCI_ROM_SLOT, PCI_BASE_ADDRESS_MEM_PREFETCH,
+                         &s->rom);
+
+        PT_LOG("Expansion ROM registered (size=0x%08"PRIx64
+               " base_addr=0x%08"PRIx64")\n",
+               d->rom.size, d->rom.base_addr);
+    }
+
+    return 0;
+}
+
+static void pt_unregister_regions(XenPCIPassthroughState *s)
+{
+    int i, type, rc;
+    uint32_t e_size;
+    PCIDevice *d = &s->dev;
+
+    for (i = 0; i < PCI_NUM_REGIONS; i++) {
+        e_size = s->bases[i].e_size;
+        if ((e_size == 0) || (s->bases[i].e_physbase == -1)) {
+            continue;
+        }
+
+        type = d->io_regions[i].type;
+
+        if (type == PCI_BASE_ADDRESS_SPACE_MEMORY
+            || type == PCI_BASE_ADDRESS_MEM_PREFETCH) {
+            rc = xc_domain_memory_mapping(xen_xc, xen_domid,
+                    s->bases[i].e_physbase >> XC_PAGE_SHIFT,
+                    s->bases[i].access.maddr >> XC_PAGE_SHIFT,
+                    (e_size+XC_PAGE_SIZE-1) >> XC_PAGE_SHIFT,
+                    DPCI_REMOVE_MAPPING);
+            if (rc != 0) {
+                PT_LOG("Error: remove old mem mapping failed!\n");
+                continue;
+            }
+
+        } else if (type == PCI_BASE_ADDRESS_SPACE_IO) {
+            rc = xc_domain_ioport_mapping(xen_xc, xen_domid,
+                        s->bases[i].e_physbase,
+                        s->bases[i].access.pio_base,
+                        e_size,
+                        DPCI_REMOVE_MAPPING);
+            if (rc != 0) {
+                PT_LOG("Error: remove old io mapping failed!\n");
+                continue;
+            }
+        }
+    }
+}
+
+static int pt_initfn(PCIDevice *pcidev)
+{
+    XenPCIPassthroughState *s = DO_UPCAST(XenPCIPassthroughState, dev, pcidev);
+    int dom, bus;
+    unsigned slot, func;
+    int rc = 0;
+    uint32_t machine_irq;
+    int pirq = -1;
+
+    if (pci_parse_devaddr(s->hostaddr, &dom, &bus, &slot, &func) < 0) {
+        fprintf(stderr, "error parse bdf: %s\n", s->hostaddr);
+        return -1;
+    }
+
+    /* register real device */
+    PT_LOG("Assigning real physical device %02x:%02x.%x to devfn %i ...\n",
+           bus, slot, func, s->dev.devfn);
+
+    s->real_device = host_pci_device_get(bus, slot, func);
+    if (!s->real_device) {
+        return -1;
+    }
+
+    s->is_virtfn = s->real_device->is_virtfn;
+    if (s->is_virtfn) {
+        PT_LOG("%04x:%02x:%02x.%x is a SR-IOV Virtual Function\n",
+               s->real_device->domain, bus, slot, func);
+    }
+
+    /* Initialize virtualized PCI configuration (Extended 256 Bytes) */
+    if (host_pci_get_block(s->real_device, 0, pcidev->config,
+                           PCI_CONFIG_SPACE_SIZE) == -1) {
+        return -1;
+    }
+
+    /* Handle real device's MMIO/PIO BARs */
+    pt_register_regions(s);
+
+    /* reinitialize each config register to be emulated */
+    pt_config_init(s);
+
+    /* Bind interrupt */
+    if (!s->dev.config[PCI_INTERRUPT_PIN]) {
+        PT_LOG("no pin interrupt\n");
+        goto out;
+    }
+
+    machine_irq = host_pci_get_byte(s->real_device, PCI_INTERRUPT_LINE);
+    rc = xc_physdev_map_pirq(xen_xc, xen_domid, machine_irq, &pirq);
+
+    if (rc) {
+        PT_LOG("Error: Mapping irq failed, rc = %d\n", rc);
+
+        /* Disable PCI intx assertion (turn on bit10 of devctl) */
+        host_pci_set_word(s->real_device,
+                          PCI_COMMAND,
+                          pci_get_word(s->dev.config + PCI_COMMAND)
+                          | PCI_COMMAND_INTX_DISABLE);
+        machine_irq = 0;
+        s->machine_irq = 0;
+    } else {
+        machine_irq = pirq;
+        s->machine_irq = pirq;
+        mapped_machine_irq[machine_irq]++;
+    }
+
+    /* bind machine_irq to device */
+    if (rc < 0 && machine_irq != 0) {
+        uint8_t e_device = PCI_SLOT(s->dev.devfn);
+        uint8_t e_intx = pci_intx(s);
+
+        rc = xc_domain_bind_pt_pci_irq(xen_xc, xen_domid, machine_irq, 0,
+                                       e_device, e_intx);
+        if (rc < 0) {
+            PT_LOG("Error: Binding of interrupt failed! rc=%d\n", rc);
+
+            /* Disable PCI intx assertion (turn on bit10 of devctl) */
+            host_pci_set_word(s->real_device, PCI_COMMAND,
+                              *(uint16_t *)(&s->dev.config[PCI_COMMAND])
+                              | PCI_COMMAND_INTX_DISABLE);
+            mapped_machine_irq[machine_irq]--;
+
+            if (mapped_machine_irq[machine_irq] == 0) {
+                if (xc_physdev_unmap_pirq(xen_xc, xen_domid, machine_irq)) {
+                    PT_LOG("Error: Unmapping of interrupt failed! rc=%d\n",
+                           rc);
+                }
+            }
+            s->machine_irq = 0;
+        }
+    }
+
+out:
+    PT_LOG("Real physical device %02x:%02x.%x registered successfuly!\n"
+           "IRQ type = %s\n", bus, slot, func, "INTx");
+
+    return 0;
+}
+
+static int pt_unregister_device(PCIDevice *pcidev)
+{
+    XenPCIPassthroughState *s = DO_UPCAST(XenPCIPassthroughState, dev, pcidev);
+    uint8_t e_device, e_intx;
+    uint32_t machine_irq;
+    int rc;
+
+    /* Unbind interrupt */
+    e_device = PCI_SLOT(s->dev.devfn);
+    e_intx = pci_intx(s);
+    machine_irq = s->machine_irq;
+
+    if (machine_irq) {
+        rc = xc_domain_unbind_pt_irq(xen_xc, xen_domid, machine_irq,
+                                     PT_IRQ_TYPE_PCI, 0, e_device, e_intx, 0);
+        if (rc < 0) {
+            PT_LOG("Error: Unbinding of interrupt failed! rc=%d\n", rc);
+        }
+    }
+
+    if (machine_irq) {
+        mapped_machine_irq[machine_irq]--;
+
+        if (mapped_machine_irq[machine_irq] == 0) {
+            rc = xc_physdev_unmap_pirq(xen_xc, xen_domid, machine_irq);
+
+            if (rc < 0) {
+                PT_LOG("Error: Unmaping of interrupt failed! rc=%d\n", rc);
+            }
+        }
+    }
+
+    /* delete all emulated config registers */
+    pt_config_delete(s);
+
+    /* unregister real device's MMIO/PIO BARs */
+    pt_unregister_regions(s);
+
+    host_pci_device_put(s->real_device);
+
+    return 0;
+}
+
+static PCIDeviceInfo xen_pci_passthrough = {
+    .init = pt_initfn,
+    .exit = pt_unregister_device,
+    .qdev.name = "xen-pci-passthrough",
+    .qdev.desc = "Assign an host pci device with Xen",
+    .qdev.size = sizeof(XenPCIPassthroughState),
+    .config_read = pt_pci_read_config,
+    .config_write = pt_pci_write_config,
+    .is_express = 0,
+    .qdev.props = (Property[]) {
+        DEFINE_PROP_STRING("hostaddr", XenPCIPassthroughState, hostaddr),
+        DEFINE_PROP_BIT("power-mgmt", XenPCIPassthroughState, power_mgmt,
+                        0, false),
+        DEFINE_PROP_END_OF_LIST(),
+    }
+};
+
+static void xen_passthrough_register(void)
+{
+    pci_qdev_register(&xen_pci_passthrough);
+}
+
+device_init(xen_passthrough_register);
diff --git a/hw/xen_pci_passthrough.h b/hw/xen_pci_passthrough.h
new file mode 100644
index 0000000..2d1979d
--- /dev/null
+++ b/hw/xen_pci_passthrough.h
@@ -0,0 +1,223 @@
+#ifndef QEMU_HW_XEN_PCI_PASSTHROUGH_H
+#  define QEMU_HW_XEN_PCI_PASSTHROUGH_H
+
+#include "qemu-common.h"
+#include "xen_common.h"
+#include "pci.h"
+#include "host-pci-device.h"
+
+#define PT_LOGGING_ENABLED
+#define PT_DEBUG_PCI_CONFIG_ACCESS
+
+#ifdef PT_LOGGING_ENABLED
+#  define PT_LOG(_f, _a...)   fprintf(stderr, "%s: " _f, __func__, ##_a)
+#else
+#  define PT_LOG(_f, _a...)
+#endif
+
+#ifdef PT_DEBUG_PCI_CONFIG_ACCESS
+#  define PT_LOG_CONFIG(_f, _a...) PT_LOG(_f, ##_a)
+#else
+#  define PT_LOG_CONFIG(_f, _a...)
+#endif
+
+
+typedef struct XenPTRegInfo XenPTRegInfo;
+typedef struct XenPTReg XenPTReg;
+
+typedef struct XenPCIPassthroughState XenPCIPassthroughState;
+
+/* function type for config reg */
+typedef uint32_t (*conf_reg_init)
+    (XenPCIPassthroughState *, XenPTRegInfo *, uint32_t real_offset);
+typedef int (*conf_dword_write)
+    (XenPCIPassthroughState *, XenPTReg *cfg_entry,
+     uint32_t *val, uint32_t dev_value, uint32_t valid_mask);
+typedef int (*conf_word_write)
+    (XenPCIPassthroughState *, XenPTReg *cfg_entry,
+     uint16_t *val, uint16_t dev_value, uint16_t valid_mask);
+typedef int (*conf_byte_write)
+    (XenPCIPassthroughState *, XenPTReg *cfg_entry,
+     uint8_t *val, uint8_t dev_value, uint8_t valid_mask);
+typedef int (*conf_dword_read)
+    (XenPCIPassthroughState *, XenPTReg *cfg_entry,
+     uint32_t *val, uint32_t valid_mask);
+typedef int (*conf_word_read)
+    (XenPCIPassthroughState *, XenPTReg *cfg_entry,
+     uint16_t *val, uint16_t valid_mask);
+typedef int (*conf_byte_read)
+    (XenPCIPassthroughState *, XenPTReg *cfg_entry,
+     uint8_t *val, uint8_t valid_mask);
+typedef int (*conf_dword_restore)
+    (XenPCIPassthroughState *, XenPTReg *cfg_entry, uint32_t real_offset,
+     uint32_t dev_value, uint32_t *val);
+typedef int (*conf_word_restore)
+    (XenPCIPassthroughState *, XenPTReg *cfg_entry, uint32_t real_offset,
+     uint16_t dev_value, uint16_t *val);
+typedef int (*conf_byte_restore)
+    (XenPCIPassthroughState *, XenPTReg *cfg_entry, uint32_t real_offset,
+     uint8_t dev_value, uint8_t *val);
+
+/* power state transition */
+#define PT_FLAG_TRANSITING 0x0001
+
+
+typedef enum {
+    GRP_TYPE_HARDWIRED = 0,                     /* 0 Hardwired reg group */
+    GRP_TYPE_EMU,                               /* emul reg group */
+} RegisterGroupType;
+
+typedef enum {
+    PT_BAR_FLAG_MEM = 0,                        /* Memory type BAR */
+    PT_BAR_FLAG_IO,                             /* I/O type BAR */
+    PT_BAR_FLAG_UPPER,                          /* upper 64bit BAR */
+    PT_BAR_FLAG_UNUSED,                         /* unused BAR */
+} PTBarFlag;
+
+
+typedef struct XenPTRegion {
+    /* Virtual phys base & size */
+    uint32_t e_physbase;
+    uint32_t e_size;
+    /* Index of region in qemu */
+    uint32_t memory_index;
+    /* BAR flag */
+    PTBarFlag bar_flag;
+    /* Translation of the emulated address */
+    union {
+        uint64_t maddr;
+        uint64_t pio_base;
+        uint64_t u;
+    } access;
+} XenPTRegion;
+
+/* XenPTRegInfo declaration
+ * - only for emulated register (either a part or whole bit).
+ * - for passthrough register that need special behavior (like interacting with
+ *   other component), set emu_mask to all 0 and specify r/w func properly.
+ * - do NOT use ALL F for init_val, otherwise the tbl will not be registered.
+ */
+
+/* emulated register infomation */
+struct XenPTRegInfo {
+    uint32_t offset;
+    uint32_t size;
+    uint32_t init_val;
+    /* reg read only field mask (ON:RO/ROS, OFF:other) */
+    uint32_t ro_mask;
+    /* reg emulate field mask (ON:emu, OFF:passthrough) */
+    uint32_t emu_mask;
+    /* no write back allowed */
+    uint32_t no_wb;
+    conf_reg_init init;
+    /* read/write/restore function pointer
+     * for double_word/word/byte size */
+    union {
+        struct {
+            conf_dword_write write;
+            conf_dword_read read;
+            conf_dword_restore restore;
+        } dw;
+        struct {
+            conf_word_write write;
+            conf_word_read read;
+            conf_word_restore restore;
+        } w;
+        struct {
+            conf_byte_write write;
+            conf_byte_read read;
+            conf_byte_restore restore;
+        } b;
+    } u;
+};
+
+/* emulated register management */
+struct XenPTReg {
+    QLIST_ENTRY(XenPTReg) entries;
+    XenPTRegInfo *reg;
+    uint32_t data;
+};
+
+typedef struct XenPTRegGroupInfo XenPTRegGroupInfo;
+
+/* emul reg group size initialize method */
+typedef uint8_t (*pt_reg_size_init_fn)
+    (XenPCIPassthroughState *, const XenPTRegGroupInfo *,
+     uint32_t base_offset);
+
+/* emulated register group infomation */
+struct XenPTRegGroupInfo {
+    uint8_t grp_id;
+    RegisterGroupType grp_type;
+    uint8_t grp_size;
+    pt_reg_size_init_fn size_init;
+    XenPTRegInfo *emu_reg_tbl;
+};
+
+/* emul register group management table */
+typedef struct XenPTRegGroup {
+    QLIST_ENTRY(XenPTRegGroup) entries;
+    const XenPTRegGroupInfo *reg_grp;
+    uint32_t base_offset;
+    uint8_t size;
+    QLIST_HEAD(, XenPTReg) reg_tbl_list;
+} XenPTRegGroup;
+
+
+typedef struct XenPTPM {
+    QEMUTimer *pm_timer;  /* QEMUTimer struct */
+    int no_soft_reset;    /* No Soft Reset flags */
+    uint16_t flags;       /* power state transition flags */
+    uint16_t pmc_field;   /* Power Management Capabilities field */
+    int pm_delay;         /* power state transition delay */
+    uint16_t cur_state;   /* current power state */
+    uint16_t req_state;   /* requested power state */
+    uint32_t pm_base;     /* Power Management Capability reg base offset */
+    uint32_t aer_base;    /* AER Capability reg base offset */
+} XenPTPM;
+
+struct XenPCIPassthroughState {
+    PCIDevice dev;
+
+    char *hostaddr;
+    bool is_virtfn;
+    HostPCIDevice *real_device;
+    XenPTRegion bases[PCI_NUM_REGIONS]; /* Access regions */
+    QLIST_HEAD(, XenPTRegGroup) reg_grp_tbl;
+
+    uint32_t machine_irq;
+
+    uint32_t power_mgmt;
+    XenPTPM *pm_state;
+
+    MemoryRegion bar[PCI_NUM_REGIONS - 1];
+    MemoryRegion rom;
+};
+
+void pt_config_init(XenPCIPassthroughState *s);
+void pt_config_delete(XenPCIPassthroughState *s);
+void pt_bar_mapping(XenPCIPassthroughState *s, int io_enable, int mem_enable);
+void pt_bar_mapping_one(XenPCIPassthroughState *s, int bar,
+                        int io_enable, int mem_enable);
+XenPTRegGroup *pt_find_reg_grp(XenPCIPassthroughState *s, uint32_t address);
+XenPTReg *pt_find_reg(XenPTRegGroup *reg_grp, uint32_t address);
+int pt_bar_offset_to_index(uint32_t offset);
+
+static inline pcibus_t pt_get_emul_size(PTBarFlag flag, pcibus_t r_size)
+{
+    /* align resource size (memory type only) */
+    if (flag == PT_BAR_FLAG_MEM) {
+        return (r_size + XC_PAGE_SIZE - 1) & XC_PAGE_MASK;
+    } else {
+        return r_size;
+    }
+}
+
+/* INTx */
+static inline uint8_t pci_read_intx(XenPCIPassthroughState *s)
+{
+    return host_pci_get_byte(s->real_device, PCI_INTERRUPT_PIN);
+}
+uint8_t pci_intx(XenPCIPassthroughState *ptdev);
+
+#endif /* !QEMU_HW_XEN_PCI_PASSTHROUGH_H */
diff --git a/hw/xen_pci_passthrough_helpers.c b/hw/xen_pci_passthrough_helpers.c
new file mode 100644
index 0000000..192e918
--- /dev/null
+++ b/hw/xen_pci_passthrough_helpers.c
@@ -0,0 +1,46 @@
+#include "xen_pci_passthrough.h"
+
+/* The PCI Local Bus Specification, Rev. 3.0, {
+ * Section 6.2.4 Miscellaneous Registers, pp 223
+ * outlines 5 valid values for the intertupt pin (intx).
+ *  0: For devices (or device functions) that don't use an interrupt in
+ *  1: INTA#
+ *  2: INTB#
+ *  3: INTC#
+ *  4: INTD#
+ *
+ * Xen uses the following 4 values for intx
+ *  0: INTA#
+ *  1: INTB#
+ *  2: INTC#
+ *  3: INTD#
+ *
+ * Observing that these list of values are not the same, pci_read_intx()
+ * uses the following mapping from hw to xen values.
+ * This seems to reflect the current usage within Xen.
+ *
+ * PCI hardware    | Xen | Notes
+ * ----------------+-----+----------------------------------------------------
+ * 0               | 0   | No interrupt
+ * 1               | 0   | INTA#
+ * 2               | 1   | INTB#
+ * 3               | 2   | INTC#
+ * 4               | 3   | INTD#
+ * any other value | 0   | This should never happen, log error message
+}
+ */
+uint8_t pci_intx(XenPCIPassthroughState *ptdev)
+{
+    uint8_t r_val = pci_read_intx(ptdev);
+
+    PT_LOG("intx=%i\n", r_val);
+    if (r_val < 1 || r_val > 4) {
+        PT_LOG("Interrupt pin read from hardware is out of range: "
+               "value=%i, acceptable range is 1 - 4\n", r_val);
+        r_val = 0;
+    } else {
+        r_val -= 1;
+    }
+
+    return r_val;
+}
-- 
Anthony PERARD

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH V3 07/10] Introduce Xen PCI Passthrough, qdevice (1/3)
@ 2011-10-28 15:07   ` Anthony PERARD
  0 siblings, 0 replies; 60+ messages in thread
From: Anthony PERARD @ 2011-10-28 15:07 UTC (permalink / raw)
  To: QEMU-devel, Stefano Stabellini
  Cc: Anthony PERARD, Guy Zana, Xen Devel, Allen Kay

From: Allen Kay <allen.m.kay@intel.com>

Signed-off-by: Allen Kay <allen.m.kay@intel.com>
Signed-off-by: Guy Zana <guy@neocleus.com>
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
---
 Makefile.target                  |    2 +
 hw/xen_pci_passthrough.c         |  838 ++++++++++++++++++++++++++++++++++++++
 hw/xen_pci_passthrough.h         |  223 ++++++++++
 hw/xen_pci_passthrough_helpers.c |   46 ++
 4 files changed, 1109 insertions(+), 0 deletions(-)
 create mode 100644 hw/xen_pci_passthrough.c
 create mode 100644 hw/xen_pci_passthrough.h
 create mode 100644 hw/xen_pci_passthrough_helpers.c

diff --git a/Makefile.target b/Makefile.target
index 243f9f2..36ea47d 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -217,6 +217,8 @@ obj-i386-$(CONFIG_XEN) += xen_platform.o
 
 # Xen PCI Passthrough
 obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += host-pci-device.o
+obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough.o
+obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough_helpers.o
 
 # Inter-VM PCI shared memory
 CONFIG_IVSHMEM =
diff --git a/hw/xen_pci_passthrough.c b/hw/xen_pci_passthrough.c
new file mode 100644
index 0000000..b97c5b6
--- /dev/null
+++ b/hw/xen_pci_passthrough.c
@@ -0,0 +1,838 @@
+/*
+ * Copyright (c) 2007, Neocleus Corporation.
+ * Copyright (c) 2007, Intel Corporation.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * Alex Novik <alex@neocleus.com>
+ * Allen Kay <allen.m.kay@intel.com>
+ * Guy Zana <guy@neocleus.com>
+ *
+ * This file implements direct PCI assignment to a HVM guest
+ */
+
+/*
+ * Interrupt Disable policy:
+ *
+ * INTx interrupt:
+ *   Initialize(register_real_device)
+ *     Map INTx(xc_physdev_map_pirq):
+ *       <fail>
+ *         - Set real Interrupt Disable bit to '1'.
+ *         - Set machine_irq and assigned_device->machine_irq to '0'.
+ *         * Don't bind INTx.
+ *
+ *     Bind INTx(xc_domain_bind_pt_pci_irq):
+ *       <fail>
+ *         - Set real Interrupt Disable bit to '1'.
+ *         - Unmap INTx.
+ *         - Decrement mapped_machine_irq[machine_irq]
+ *         - Set assigned_device->machine_irq to '0'.
+ *
+ *   Write to Interrupt Disable bit by guest software(pt_cmd_reg_write)
+ *     Write '0'
+ *       <ptdev->msi_trans_en is false>
+ *         - Set real bit to '0' if assigned_device->machine_irq isn't '0'.
+ *
+ *     Write '1'
+ *       <ptdev->msi_trans_en is false>
+ *         - Set real bit to '1'.
+ *
+ * MSI-INTx translation.
+ *   Initialize(xc_physdev_map_pirq_msi/pt_msi_setup)
+ *     Bind MSI-INTx(xc_domain_bind_pt_irq)
+ *       <fail>
+ *         - Unmap MSI.
+ *           <success>
+ *             - Set dev->msi->pirq to '-1'.
+ *           <fail>
+ *             - Do nothing.
+ *
+ *   Write to Interrupt Disable bit by guest software(pt_cmd_reg_write)
+ *     Write '0'
+ *       <ptdev->msi_trans_en is true>
+ *         - Set MSI Enable bit to '1'.
+ *
+ *     Write '1'
+ *       <ptdev->msi_trans_en is true>
+ *         - Set MSI Enable bit to '0'.
+ *
+ * MSI interrupt:
+ *   Initialize MSI register(pt_msi_setup, pt_msi_update)
+ *     Bind MSI(xc_domain_update_msi_irq)
+ *       <fail>
+ *         - Unmap MSI.
+ *         - Set dev->msi->pirq to '-1'.
+ *
+ * MSI-X interrupt:
+ *   Initialize MSI-X register(pt_msix_update_one)
+ *     Bind MSI-X(xc_domain_update_msi_irq)
+ *       <fail>
+ *         - Unmap MSI-X.
+ *         - Set entry->pirq to '-1'.
+ */
+
+#include <sys/ioctl.h>
+
+#include "pci.h"
+#include "xen.h"
+#include "xen_backend.h"
+#include "xen_pci_passthrough.h"
+
+#define PCI_BAR_ENTRIES (6)
+
+#define PT_NR_IRQS          (256)
+char mapped_machine_irq[PT_NR_IRQS] = {0};
+
+/* Config Space */
+static int pt_pci_config_access_check(PCIDevice *d, uint32_t address, int len)
+{
+    /* check offset range */
+    if (address >= 0xFF) {
+        PT_LOG("Error: Failed to access register with offset exceeding FFh. "
+               "[%02x:%02x.%x][Offset:%02xh][Length:%d]\n",
+               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
+               address, len);
+        return -1;
+    }
+
+    /* check read size */
+    if ((len != 1) && (len != 2) && (len != 4)) {
+        PT_LOG("Error: Failed to access register with invalid access length. "
+               "[%02x:%02x.%x][Offset:%02xh][Length:%d]\n",
+               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
+               address, len);
+        return -1;
+    }
+
+    /* check offset alignment */
+    if (address & (len - 1)) {
+        PT_LOG("Error: Failed to access register with invalid access size "
+            "alignment. [%02x:%02x.%x][Offset:%02xh][Length:%d]\n",
+            pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
+            address, len);
+        return -1;
+    }
+
+    return 0;
+}
+
+int pt_bar_offset_to_index(uint32_t offset)
+{
+    int index = 0;
+
+    /* check Exp ROM BAR */
+    if (offset == PCI_ROM_ADDRESS) {
+        return PCI_ROM_SLOT;
+    }
+
+    /* calculate BAR index */
+    index = (offset - PCI_BASE_ADDRESS_0) >> 2;
+    if (index >= PCI_NUM_REGIONS) {
+        return -1;
+    }
+
+    return index;
+}
+
+static uint32_t pt_pci_read_config(PCIDevice *d, uint32_t address, int len)
+{
+    XenPCIPassthroughState *s = DO_UPCAST(XenPCIPassthroughState, dev, d);
+    uint32_t val = 0;
+    XenPTRegGroup *reg_grp_entry = NULL;
+    XenPTReg *reg_entry = NULL;
+    int rc = 0;
+    int emul_len = 0;
+    uint32_t find_addr = address;
+
+    if (pt_pci_config_access_check(d, address, len)) {
+        goto exit;
+    }
+
+    /* check power state transition flags */
+    if (s->pm_state != NULL && s->pm_state->flags & PT_FLAG_TRANSITING) {
+        /* can't accept until previous power state transition is completed.
+         * so finished previous request here.
+         */
+        PT_LOG("Warning: guest want to write durring power state transition\n");
+        goto exit;
+    }
+
+    /* find register group entry */
+    reg_grp_entry = pt_find_reg_grp(s, address);
+    if (reg_grp_entry) {
+        /* check 0 Hardwired register group */
+        if (reg_grp_entry->reg_grp->grp_type == GRP_TYPE_HARDWIRED) {
+            /* no need to emulate, just return 0 */
+            val = 0;
+            goto exit;
+        }
+    }
+
+    /* read I/O device register value */
+    rc = host_pci_get_block(s->real_device, address, (uint8_t *)&val, len);
+    if (!rc) {
+        PT_LOG("Error: pci_read_block failed. return value[%d].\n", rc);
+        memset(&val, 0xff, len);
+    }
+
+    /* just return the I/O device register value for
+     * passthrough type register group */
+    if (reg_grp_entry == NULL) {
+        goto exit;
+    }
+
+    /* adjust the read value to appropriate CFC-CFF window */
+    val <<= (address & 3) << 3;
+    emul_len = len;
+
+    /* loop Guest request size */
+    while (emul_len > 0) {
+        /* find register entry to be emulated */
+        reg_entry = pt_find_reg(reg_grp_entry, find_addr);
+        if (reg_entry) {
+            XenPTRegInfo *reg = reg_entry->reg;
+            uint32_t real_offset = reg_grp_entry->base_offset + reg->offset;
+            uint32_t valid_mask = 0xFFFFFFFF >> ((4 - emul_len) << 3);
+            uint8_t *ptr_val = NULL;
+
+            valid_mask <<= (find_addr - real_offset) << 3;
+            ptr_val = (uint8_t *)&val + (real_offset & 3);
+
+            /* do emulation depend on register size */
+            switch (reg->size) {
+            case 1:
+                if (reg->u.b.read) {
+                    rc = reg->u.b.read(s, reg_entry, ptr_val, valid_mask);
+                }
+                break;
+            case 2:
+                if (reg->u.w.read) {
+                    rc = reg->u.w.read(s, reg_entry,
+                                       (uint16_t *)ptr_val, valid_mask);
+                }
+                break;
+            case 4:
+                if (reg->u.dw.read) {
+                    rc = reg->u.dw.read(s, reg_entry,
+                                        (uint32_t *)ptr_val, valid_mask);
+                }
+                break;
+            }
+
+            if (rc < 0) {
+                hw_error("Internal error: Invalid read emulation "
+                         "return value[%d]. I/O emulator exit.\n", rc);
+            }
+
+            /* calculate next address to find */
+            emul_len -= reg->size;
+            if (emul_len > 0) {
+                find_addr = real_offset + reg->size;
+            }
+        } else {
+            /* nothing to do with passthrough type register,
+             * continue to find next byte */
+            emul_len--;
+            find_addr++;
+        }
+    }
+
+    /* need to shift back before returning them to pci bus emulator */
+    val >>= ((address & 3) << 3);
+
+exit:
+    PT_LOG_CONFIG("[%02x:%02x.%x]: address=%04x val=0x%08x len=%d\n",
+                  pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
+                  address, val, len);
+    return val;
+}
+
+static void pt_pci_write_config(PCIDevice *d, uint32_t address,
+                                uint32_t val, int len)
+{
+    XenPCIPassthroughState *s = DO_UPCAST(XenPCIPassthroughState, dev, d);
+    int index = 0;
+    XenPTRegGroup *reg_grp_entry = NULL;
+    int rc = 0;
+    uint32_t read_val = 0;
+    int emul_len = 0;
+    XenPTReg *reg_entry = NULL;
+    uint32_t find_addr = address;
+    XenPTRegInfo *reg = NULL;
+
+    if (pt_pci_config_access_check(d, address, len)) {
+        return;
+    }
+
+    PT_LOG_CONFIG("[%02x:%02x.%x]: address=%04x val=0x%08x len=%d\n",
+                  pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
+                  address, val, len);
+
+    /* check unused BAR register */
+    index = pt_bar_offset_to_index(address);
+    if ((index >= 0) && (val > 0 && val < PT_BAR_ALLF) &&
+        (s->bases[index].bar_flag == PT_BAR_FLAG_UNUSED)) {
+        PT_LOG("Warning: Guest attempt to set address to unused Base Address "
+               "Register. [%02x:%02x.%x][Offset:%02xh][Length:%d]\n",
+               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
+               address, len);
+    }
+
+    /* check power state transition flags */
+    if (s->pm_state != NULL && s->pm_state->flags & PT_FLAG_TRANSITING) {
+        /* can't accept untill previous power state transition is completed.
+         * so finished previous request here.
+         */
+        PT_LOG("Warning: guest want to write durring power state transition\n");
+        return;
+    }
+
+    /* find register group entry */
+    reg_grp_entry = pt_find_reg_grp(s, address);
+    if (reg_grp_entry) {
+        /* check 0 Hardwired register group */
+        if (reg_grp_entry->reg_grp->grp_type == GRP_TYPE_HARDWIRED) {
+            /* ignore silently */
+            PT_LOG("Warning: Access to 0 Hardwired register. "
+                   "[%02x:%02x.%x][Offset:%02xh][Length:%d]\n",
+                   pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
+                   address, len);
+            return;
+        }
+    }
+
+    /* read I/O device register value */
+    rc = host_pci_get_block(s->real_device, address,
+                             (uint8_t *)&read_val, len);
+    if (!rc) {
+        PT_LOG("Error: pci_read_block failed. return value[%d].\n", rc);
+        memset(&read_val, 0xff, len);
+    }
+
+    /* pass directly to libpci for passthrough type register group */
+    if (reg_grp_entry == NULL) {
+        goto out;
+    }
+
+    /* adjust the read and write value to appropriate CFC-CFF window */
+    read_val <<= (address & 3) << 3;
+    val <<= (address & 3) << 3;
+    emul_len = len;
+
+    /* loop Guest request size */
+    while (emul_len > 0) {
+        /* find register entry to be emulated */
+        reg_entry = pt_find_reg(reg_grp_entry, find_addr);
+        if (reg_entry) {
+            reg = reg_entry->reg;
+            uint32_t real_offset = reg_grp_entry->base_offset + reg->offset;
+            uint32_t valid_mask = 0xFFFFFFFF >> ((4 - emul_len) << 3);
+            uint8_t *ptr_val = NULL;
+
+            valid_mask <<= (find_addr - real_offset) << 3;
+            ptr_val = (uint8_t *)&val + (real_offset & 3);
+
+            /* do emulation depend on register size */
+            switch (reg->size) {
+            case 1:
+                if (reg->u.b.write) {
+                    rc = reg->u.b.write(s, reg_entry, ptr_val,
+                                        read_val >> ((real_offset & 3) << 3),
+                                        valid_mask);
+                }
+                break;
+            case 2:
+                if (reg->u.w.write) {
+                    rc = reg->u.w.write(s, reg_entry, (uint16_t *)ptr_val,
+                                        (read_val >> ((real_offset & 3) << 3)),
+                                        valid_mask);
+                }
+                break;
+            case 4:
+                if (reg->u.dw.write) {
+                    rc = reg->u.dw.write(s, reg_entry, (uint32_t *)ptr_val,
+                                         (read_val >> ((real_offset & 3) << 3)),
+                                         valid_mask);
+                }
+                break;
+            }
+
+            if (rc < 0) {
+                hw_error("Internal error: Invalid write emulation "
+                         "return value[%d]. I/O emulator exit.\n", rc);
+            }
+
+            /* calculate next address to find */
+            emul_len -= reg->size;
+            if (emul_len > 0) {
+                find_addr = real_offset + reg->size;
+            }
+        } else {
+            /* nothing to do with passthrough type register,
+             * continue to find next byte */
+            emul_len--;
+            find_addr++;
+        }
+    }
+
+    /* need to shift back before passing them to libpci */
+    val >>= (address & 3) << 3;
+
+out:
+    if (!(reg && reg->no_wb)) {
+        /* unknown regs are passed through */
+        rc = host_pci_set_block(s->real_device, address, (uint8_t *)&val, len);
+
+        if (!rc) {
+            PT_LOG("Error: pci_write_block failed. return value[%d].\n", rc);
+        }
+    }
+
+    if (s->pm_state != NULL && s->pm_state->flags & PT_FLAG_TRANSITING) {
+        qemu_mod_timer(s->pm_state->pm_timer,
+                       qemu_get_clock_ms(rt_clock) + s->pm_state->pm_delay);
+    }
+}
+
+/* ioport/iomem space*/
+static void pt_iomem_map(XenPCIPassthroughState *s, int i,
+                         pcibus_t e_phys, pcibus_t e_size, int type)
+{
+    uint32_t old_ebase = s->bases[i].e_physbase;
+    bool first_map = s->bases[i].e_size == 0;
+    int ret = 0;
+
+    s->bases[i].e_physbase = e_phys;
+    s->bases[i].e_size = e_size;
+
+    PT_LOG("e_phys=%#"PRIx64" maddr=%#"PRIx64" type=%%d"
+           " len=%#"PRIx64" index=%d first_map=%d\n",
+           e_phys, s->bases[i].access.maddr, /*type,*/
+           e_size, i, first_map);
+
+    if (e_size == 0) {
+        return;
+    }
+
+    if (!first_map && old_ebase != -1) {
+        /* Remove old mapping */
+        ret = xc_domain_memory_mapping(xen_xc, xen_domid,
+                               old_ebase >> XC_PAGE_SHIFT,
+                               s->bases[i].access.maddr >> XC_PAGE_SHIFT,
+                               (e_size + XC_PAGE_SIZE - 1) >> XC_PAGE_SHIFT,
+                               DPCI_REMOVE_MAPPING);
+        if (ret != 0) {
+            PT_LOG("Error: remove old mapping failed!\n");
+            return;
+        }
+    }
+
+    /* map only valid guest address */
+    if (e_phys != -1) {
+        /* Create new mapping */
+        ret = xc_domain_memory_mapping(xen_xc, xen_domid,
+                                   s->bases[i].e_physbase >> XC_PAGE_SHIFT,
+                                   s->bases[i].access.maddr >> XC_PAGE_SHIFT,
+                                   (e_size+XC_PAGE_SIZE-1) >> XC_PAGE_SHIFT,
+                                   DPCI_ADD_MAPPING);
+
+        if (ret != 0) {
+            PT_LOG("Error: create new mapping failed!\n");
+        }
+    }
+}
+
+static void pt_ioport_map(XenPCIPassthroughState *s, int i,
+                          pcibus_t e_phys, pcibus_t e_size, int type)
+{
+    uint32_t old_ebase = s->bases[i].e_physbase;
+    bool first_map = s->bases[i].e_size == 0;
+    int ret = 0;
+
+    s->bases[i].e_physbase = e_phys;
+    s->bases[i].e_size = e_size;
+
+    PT_LOG("e_phys=%#04"PRIx64" pio_base=%#04"PRIx64" len=%"PRId64" index=%d"
+           " first_map=%d\n",
+           e_phys, s->bases[i].access.pio_base, e_size, i, first_map);
+
+    if (e_size == 0) {
+        return;
+    }
+
+    if (!first_map && old_ebase != -1) {
+        /* Remove old mapping */
+        ret = xc_domain_ioport_mapping(xen_xc, xen_domid, old_ebase,
+                                       s->bases[i].access.pio_base, e_size,
+                                       DPCI_REMOVE_MAPPING);
+        if (ret != 0) {
+            PT_LOG("Error: remove old mapping failed!\n");
+            return;
+        }
+    }
+
+    /* map only valid guest address (include 0) */
+    if (e_phys != -1) {
+        /* Create new mapping */
+        ret = xc_domain_ioport_mapping(xen_xc, xen_domid, e_phys,
+                                       s->bases[i].access.pio_base, e_size,
+                                       DPCI_ADD_MAPPING);
+        if (ret != 0) {
+            PT_LOG("Error: create new mapping failed!\n");
+        }
+    }
+
+}
+
+
+/* mapping BAR */
+
+void pt_bar_mapping_one(XenPCIPassthroughState *s, int bar,
+                        int io_enable, int mem_enable)
+{
+    PCIDevice *dev = &s->dev;
+    PCIIORegion *r;
+    XenPTRegGroup *reg_grp_entry = NULL;
+    XenPTReg *reg_entry = NULL;
+    XenPTRegion *base = NULL;
+    pcibus_t r_size = 0, r_addr = -1;
+    int rc = 0;
+
+    r = &dev->io_regions[bar];
+
+    /* check valid region */
+    if (!r->size) {
+        return;
+    }
+
+    base = &s->bases[bar];
+    /* skip unused BAR or upper 64bit BAR */
+    if ((base->bar_flag == PT_BAR_FLAG_UNUSED)
+        || (base->bar_flag == PT_BAR_FLAG_UPPER)) {
+           return;
+    }
+
+    /* copy region address to temporary */
+    r_addr = r->addr;
+
+    /* need unmapping in case I/O Space or Memory Space disable */
+    if (((base->bar_flag == PT_BAR_FLAG_IO) && !io_enable) ||
+        ((base->bar_flag == PT_BAR_FLAG_MEM) && !mem_enable)) {
+        r_addr = -1;
+    }
+    if ((bar == PCI_ROM_SLOT) && (r_addr != -1)) {
+        reg_grp_entry = pt_find_reg_grp(s, PCI_ROM_ADDRESS);
+        if (reg_grp_entry) {
+            reg_entry = pt_find_reg(reg_grp_entry, PCI_ROM_ADDRESS);
+            if (reg_entry && !(reg_entry->data & PCI_ROM_ADDRESS_ENABLE)) {
+                r_addr = -1;
+            }
+        }
+    }
+
+    /* prevent guest software mapping memory resource to 00000000h */
+    if ((base->bar_flag == PT_BAR_FLAG_MEM) && (r_addr == 0)) {
+        r_addr = -1;
+    }
+
+    r_size = pt_get_emul_size(base->bar_flag, r->size);
+
+    rc = pci_check_bar_overlap(dev, r_addr, r_size, r->type);
+    if (rc > 0) {
+        PT_LOG("Warning: s[%02x:%02x.%x][Region:%d][Address:%"FMT_PCIBUS"h]"
+               "[Size:%"FMT_PCIBUS"h] is overlapped.\n", pci_bus_num(dev->bus),
+               PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn), bar,
+               r_addr, r_size);
+    }
+
+    /* check whether we need to update the mapping or not */
+    if (r_addr != s->bases[bar].e_physbase) {
+        /* mapping BAR */
+        if (base->bar_flag == PT_BAR_FLAG_IO) {
+            pt_ioport_map(s, bar, r_addr, r_size, r->type);
+        } else {
+            pt_iomem_map(s, bar, r_addr, r_size, r->type);
+        }
+    }
+}
+
+void pt_bar_mapping(XenPCIPassthroughState *s, int io_enable, int mem_enable)
+{
+    int i;
+
+    for (i = 0; i < PCI_NUM_REGIONS; i++) {
+        pt_bar_mapping_one(s, i, io_enable, mem_enable);
+    }
+}
+
+/* register regions */
+static int pt_register_regions(XenPCIPassthroughState *s)
+{
+    int i = 0;
+    uint32_t bar_data = 0;
+    HostPCIDevice *d = s->real_device;
+
+    /* Register PIO/MMIO BARs */
+    for (i = 0; i < PCI_BAR_ENTRIES; i++) {
+        HostPCIIORegion *r = &d->io_regions[i];
+
+        if (r->base_addr) {
+            s->bases[i].e_physbase = r->base_addr;
+            s->bases[i].access.u = r->base_addr;
+
+            /* Register current region */
+            if (r->flags & IORESOURCE_IO) {
+                memory_region_init_io(&s->bar[i], NULL, NULL,
+                                      "xen-pci-pt-bar", r->size);
+                pci_register_bar(&s->dev, i, PCI_BASE_ADDRESS_SPACE_IO,
+                                 &s->bar[i]);
+            } else if (r->flags & IORESOURCE_PREFETCH) {
+                memory_region_init_io(&s->bar[i], NULL, NULL,
+                                      "xen-pci-pt-bar", r->size);
+                pci_register_bar(&s->dev, i, PCI_BASE_ADDRESS_MEM_PREFETCH,
+                                 &s->bar[i]);
+            } else {
+                memory_region_init_io(&s->bar[i], NULL, NULL,
+                                      "xen-pci-pt-bar", r->size);
+                pci_register_bar(&s->dev, i, PCI_BASE_ADDRESS_SPACE_MEMORY,
+                                 &s->bar[i]);
+            }
+
+            PT_LOG("IO region registered (size=0x%08"PRIx64
+                   " base_addr=0x%08"PRIx64")\n",
+                   r->size, r->base_addr);
+        }
+    }
+
+    /* Register expansion ROM address */
+    if (d->rom.base_addr && d->rom.size) {
+        /* Re-set BAR reported by OS, otherwise ROM can't be read. */
+        bar_data = host_pci_get_long(d, PCI_ROM_ADDRESS);
+        if ((bar_data & PCI_ROM_ADDRESS_MASK) == 0) {
+            bar_data |= d->rom.base_addr & PCI_ROM_ADDRESS_MASK;
+            host_pci_set_long(d, PCI_ROM_ADDRESS, bar_data);
+        }
+
+        s->bases[PCI_ROM_SLOT].e_physbase = d->rom.base_addr;
+        s->bases[PCI_ROM_SLOT].access.maddr = d->rom.base_addr;
+
+        memory_region_init_rom_device(&s->rom, NULL, NULL, &s->dev.qdev,
+                                      "xen-pci-pt-rom", d->rom.size);
+        pci_register_bar(&s->dev, PCI_ROM_SLOT, PCI_BASE_ADDRESS_MEM_PREFETCH,
+                         &s->rom);
+
+        PT_LOG("Expansion ROM registered (size=0x%08"PRIx64
+               " base_addr=0x%08"PRIx64")\n",
+               d->rom.size, d->rom.base_addr);
+    }
+
+    return 0;
+}
+
+static void pt_unregister_regions(XenPCIPassthroughState *s)
+{
+    int i, type, rc;
+    uint32_t e_size;
+    PCIDevice *d = &s->dev;
+
+    for (i = 0; i < PCI_NUM_REGIONS; i++) {
+        e_size = s->bases[i].e_size;
+        if ((e_size == 0) || (s->bases[i].e_physbase == -1)) {
+            continue;
+        }
+
+        type = d->io_regions[i].type;
+
+        if (type == PCI_BASE_ADDRESS_SPACE_MEMORY
+            || type == PCI_BASE_ADDRESS_MEM_PREFETCH) {
+            rc = xc_domain_memory_mapping(xen_xc, xen_domid,
+                    s->bases[i].e_physbase >> XC_PAGE_SHIFT,
+                    s->bases[i].access.maddr >> XC_PAGE_SHIFT,
+                    (e_size+XC_PAGE_SIZE-1) >> XC_PAGE_SHIFT,
+                    DPCI_REMOVE_MAPPING);
+            if (rc != 0) {
+                PT_LOG("Error: remove old mem mapping failed!\n");
+                continue;
+            }
+
+        } else if (type == PCI_BASE_ADDRESS_SPACE_IO) {
+            rc = xc_domain_ioport_mapping(xen_xc, xen_domid,
+                        s->bases[i].e_physbase,
+                        s->bases[i].access.pio_base,
+                        e_size,
+                        DPCI_REMOVE_MAPPING);
+            if (rc != 0) {
+                PT_LOG("Error: remove old io mapping failed!\n");
+                continue;
+            }
+        }
+    }
+}
+
+static int pt_initfn(PCIDevice *pcidev)
+{
+    XenPCIPassthroughState *s = DO_UPCAST(XenPCIPassthroughState, dev, pcidev);
+    int dom, bus;
+    unsigned slot, func;
+    int rc = 0;
+    uint32_t machine_irq;
+    int pirq = -1;
+
+    if (pci_parse_devaddr(s->hostaddr, &dom, &bus, &slot, &func) < 0) {
+        fprintf(stderr, "error parse bdf: %s\n", s->hostaddr);
+        return -1;
+    }
+
+    /* register real device */
+    PT_LOG("Assigning real physical device %02x:%02x.%x to devfn %i ...\n",
+           bus, slot, func, s->dev.devfn);
+
+    s->real_device = host_pci_device_get(bus, slot, func);
+    if (!s->real_device) {
+        return -1;
+    }
+
+    s->is_virtfn = s->real_device->is_virtfn;
+    if (s->is_virtfn) {
+        PT_LOG("%04x:%02x:%02x.%x is a SR-IOV Virtual Function\n",
+               s->real_device->domain, bus, slot, func);
+    }
+
+    /* Initialize virtualized PCI configuration (Extended 256 Bytes) */
+    if (host_pci_get_block(s->real_device, 0, pcidev->config,
+                           PCI_CONFIG_SPACE_SIZE) == -1) {
+        return -1;
+    }
+
+    /* Handle real device's MMIO/PIO BARs */
+    pt_register_regions(s);
+
+    /* reinitialize each config register to be emulated */
+    pt_config_init(s);
+
+    /* Bind interrupt */
+    if (!s->dev.config[PCI_INTERRUPT_PIN]) {
+        PT_LOG("no pin interrupt\n");
+        goto out;
+    }
+
+    machine_irq = host_pci_get_byte(s->real_device, PCI_INTERRUPT_LINE);
+    rc = xc_physdev_map_pirq(xen_xc, xen_domid, machine_irq, &pirq);
+
+    if (rc) {
+        PT_LOG("Error: Mapping irq failed, rc = %d\n", rc);
+
+        /* Disable PCI intx assertion (turn on bit10 of devctl) */
+        host_pci_set_word(s->real_device,
+                          PCI_COMMAND,
+                          pci_get_word(s->dev.config + PCI_COMMAND)
+                          | PCI_COMMAND_INTX_DISABLE);
+        machine_irq = 0;
+        s->machine_irq = 0;
+    } else {
+        machine_irq = pirq;
+        s->machine_irq = pirq;
+        mapped_machine_irq[machine_irq]++;
+    }
+
+    /* bind machine_irq to device */
+    if (rc < 0 && machine_irq != 0) {
+        uint8_t e_device = PCI_SLOT(s->dev.devfn);
+        uint8_t e_intx = pci_intx(s);
+
+        rc = xc_domain_bind_pt_pci_irq(xen_xc, xen_domid, machine_irq, 0,
+                                       e_device, e_intx);
+        if (rc < 0) {
+            PT_LOG("Error: Binding of interrupt failed! rc=%d\n", rc);
+
+            /* Disable PCI intx assertion (turn on bit10 of devctl) */
+            host_pci_set_word(s->real_device, PCI_COMMAND,
+                              *(uint16_t *)(&s->dev.config[PCI_COMMAND])
+                              | PCI_COMMAND_INTX_DISABLE);
+            mapped_machine_irq[machine_irq]--;
+
+            if (mapped_machine_irq[machine_irq] == 0) {
+                if (xc_physdev_unmap_pirq(xen_xc, xen_domid, machine_irq)) {
+                    PT_LOG("Error: Unmapping of interrupt failed! rc=%d\n",
+                           rc);
+                }
+            }
+            s->machine_irq = 0;
+        }
+    }
+
+out:
+    PT_LOG("Real physical device %02x:%02x.%x registered successfuly!\n"
+           "IRQ type = %s\n", bus, slot, func, "INTx");
+
+    return 0;
+}
+
+static int pt_unregister_device(PCIDevice *pcidev)
+{
+    XenPCIPassthroughState *s = DO_UPCAST(XenPCIPassthroughState, dev, pcidev);
+    uint8_t e_device, e_intx;
+    uint32_t machine_irq;
+    int rc;
+
+    /* Unbind interrupt */
+    e_device = PCI_SLOT(s->dev.devfn);
+    e_intx = pci_intx(s);
+    machine_irq = s->machine_irq;
+
+    if (machine_irq) {
+        rc = xc_domain_unbind_pt_irq(xen_xc, xen_domid, machine_irq,
+                                     PT_IRQ_TYPE_PCI, 0, e_device, e_intx, 0);
+        if (rc < 0) {
+            PT_LOG("Error: Unbinding of interrupt failed! rc=%d\n", rc);
+        }
+    }
+
+    if (machine_irq) {
+        mapped_machine_irq[machine_irq]--;
+
+        if (mapped_machine_irq[machine_irq] == 0) {
+            rc = xc_physdev_unmap_pirq(xen_xc, xen_domid, machine_irq);
+
+            if (rc < 0) {
+                PT_LOG("Error: Unmaping of interrupt failed! rc=%d\n", rc);
+            }
+        }
+    }
+
+    /* delete all emulated config registers */
+    pt_config_delete(s);
+
+    /* unregister real device's MMIO/PIO BARs */
+    pt_unregister_regions(s);
+
+    host_pci_device_put(s->real_device);
+
+    return 0;
+}
+
+static PCIDeviceInfo xen_pci_passthrough = {
+    .init = pt_initfn,
+    .exit = pt_unregister_device,
+    .qdev.name = "xen-pci-passthrough",
+    .qdev.desc = "Assign an host pci device with Xen",
+    .qdev.size = sizeof(XenPCIPassthroughState),
+    .config_read = pt_pci_read_config,
+    .config_write = pt_pci_write_config,
+    .is_express = 0,
+    .qdev.props = (Property[]) {
+        DEFINE_PROP_STRING("hostaddr", XenPCIPassthroughState, hostaddr),
+        DEFINE_PROP_BIT("power-mgmt", XenPCIPassthroughState, power_mgmt,
+                        0, false),
+        DEFINE_PROP_END_OF_LIST(),
+    }
+};
+
+static void xen_passthrough_register(void)
+{
+    pci_qdev_register(&xen_pci_passthrough);
+}
+
+device_init(xen_passthrough_register);
diff --git a/hw/xen_pci_passthrough.h b/hw/xen_pci_passthrough.h
new file mode 100644
index 0000000..2d1979d
--- /dev/null
+++ b/hw/xen_pci_passthrough.h
@@ -0,0 +1,223 @@
+#ifndef QEMU_HW_XEN_PCI_PASSTHROUGH_H
+#  define QEMU_HW_XEN_PCI_PASSTHROUGH_H
+
+#include "qemu-common.h"
+#include "xen_common.h"
+#include "pci.h"
+#include "host-pci-device.h"
+
+#define PT_LOGGING_ENABLED
+#define PT_DEBUG_PCI_CONFIG_ACCESS
+
+#ifdef PT_LOGGING_ENABLED
+#  define PT_LOG(_f, _a...)   fprintf(stderr, "%s: " _f, __func__, ##_a)
+#else
+#  define PT_LOG(_f, _a...)
+#endif
+
+#ifdef PT_DEBUG_PCI_CONFIG_ACCESS
+#  define PT_LOG_CONFIG(_f, _a...) PT_LOG(_f, ##_a)
+#else
+#  define PT_LOG_CONFIG(_f, _a...)
+#endif
+
+
+typedef struct XenPTRegInfo XenPTRegInfo;
+typedef struct XenPTReg XenPTReg;
+
+typedef struct XenPCIPassthroughState XenPCIPassthroughState;
+
+/* function type for config reg */
+typedef uint32_t (*conf_reg_init)
+    (XenPCIPassthroughState *, XenPTRegInfo *, uint32_t real_offset);
+typedef int (*conf_dword_write)
+    (XenPCIPassthroughState *, XenPTReg *cfg_entry,
+     uint32_t *val, uint32_t dev_value, uint32_t valid_mask);
+typedef int (*conf_word_write)
+    (XenPCIPassthroughState *, XenPTReg *cfg_entry,
+     uint16_t *val, uint16_t dev_value, uint16_t valid_mask);
+typedef int (*conf_byte_write)
+    (XenPCIPassthroughState *, XenPTReg *cfg_entry,
+     uint8_t *val, uint8_t dev_value, uint8_t valid_mask);
+typedef int (*conf_dword_read)
+    (XenPCIPassthroughState *, XenPTReg *cfg_entry,
+     uint32_t *val, uint32_t valid_mask);
+typedef int (*conf_word_read)
+    (XenPCIPassthroughState *, XenPTReg *cfg_entry,
+     uint16_t *val, uint16_t valid_mask);
+typedef int (*conf_byte_read)
+    (XenPCIPassthroughState *, XenPTReg *cfg_entry,
+     uint8_t *val, uint8_t valid_mask);
+typedef int (*conf_dword_restore)
+    (XenPCIPassthroughState *, XenPTReg *cfg_entry, uint32_t real_offset,
+     uint32_t dev_value, uint32_t *val);
+typedef int (*conf_word_restore)
+    (XenPCIPassthroughState *, XenPTReg *cfg_entry, uint32_t real_offset,
+     uint16_t dev_value, uint16_t *val);
+typedef int (*conf_byte_restore)
+    (XenPCIPassthroughState *, XenPTReg *cfg_entry, uint32_t real_offset,
+     uint8_t dev_value, uint8_t *val);
+
+/* power state transition */
+#define PT_FLAG_TRANSITING 0x0001
+
+
+typedef enum {
+    GRP_TYPE_HARDWIRED = 0,                     /* 0 Hardwired reg group */
+    GRP_TYPE_EMU,                               /* emul reg group */
+} RegisterGroupType;
+
+typedef enum {
+    PT_BAR_FLAG_MEM = 0,                        /* Memory type BAR */
+    PT_BAR_FLAG_IO,                             /* I/O type BAR */
+    PT_BAR_FLAG_UPPER,                          /* upper 64bit BAR */
+    PT_BAR_FLAG_UNUSED,                         /* unused BAR */
+} PTBarFlag;
+
+
+typedef struct XenPTRegion {
+    /* Virtual phys base & size */
+    uint32_t e_physbase;
+    uint32_t e_size;
+    /* Index of region in qemu */
+    uint32_t memory_index;
+    /* BAR flag */
+    PTBarFlag bar_flag;
+    /* Translation of the emulated address */
+    union {
+        uint64_t maddr;
+        uint64_t pio_base;
+        uint64_t u;
+    } access;
+} XenPTRegion;
+
+/* XenPTRegInfo declaration
+ * - only for emulated register (either a part or whole bit).
+ * - for passthrough register that need special behavior (like interacting with
+ *   other component), set emu_mask to all 0 and specify r/w func properly.
+ * - do NOT use ALL F for init_val, otherwise the tbl will not be registered.
+ */
+
+/* emulated register infomation */
+struct XenPTRegInfo {
+    uint32_t offset;
+    uint32_t size;
+    uint32_t init_val;
+    /* reg read only field mask (ON:RO/ROS, OFF:other) */
+    uint32_t ro_mask;
+    /* reg emulate field mask (ON:emu, OFF:passthrough) */
+    uint32_t emu_mask;
+    /* no write back allowed */
+    uint32_t no_wb;
+    conf_reg_init init;
+    /* read/write/restore function pointer
+     * for double_word/word/byte size */
+    union {
+        struct {
+            conf_dword_write write;
+            conf_dword_read read;
+            conf_dword_restore restore;
+        } dw;
+        struct {
+            conf_word_write write;
+            conf_word_read read;
+            conf_word_restore restore;
+        } w;
+        struct {
+            conf_byte_write write;
+            conf_byte_read read;
+            conf_byte_restore restore;
+        } b;
+    } u;
+};
+
+/* emulated register management */
+struct XenPTReg {
+    QLIST_ENTRY(XenPTReg) entries;
+    XenPTRegInfo *reg;
+    uint32_t data;
+};
+
+typedef struct XenPTRegGroupInfo XenPTRegGroupInfo;
+
+/* emul reg group size initialize method */
+typedef uint8_t (*pt_reg_size_init_fn)
+    (XenPCIPassthroughState *, const XenPTRegGroupInfo *,
+     uint32_t base_offset);
+
+/* emulated register group infomation */
+struct XenPTRegGroupInfo {
+    uint8_t grp_id;
+    RegisterGroupType grp_type;
+    uint8_t grp_size;
+    pt_reg_size_init_fn size_init;
+    XenPTRegInfo *emu_reg_tbl;
+};
+
+/* emul register group management table */
+typedef struct XenPTRegGroup {
+    QLIST_ENTRY(XenPTRegGroup) entries;
+    const XenPTRegGroupInfo *reg_grp;
+    uint32_t base_offset;
+    uint8_t size;
+    QLIST_HEAD(, XenPTReg) reg_tbl_list;
+} XenPTRegGroup;
+
+
+typedef struct XenPTPM {
+    QEMUTimer *pm_timer;  /* QEMUTimer struct */
+    int no_soft_reset;    /* No Soft Reset flags */
+    uint16_t flags;       /* power state transition flags */
+    uint16_t pmc_field;   /* Power Management Capabilities field */
+    int pm_delay;         /* power state transition delay */
+    uint16_t cur_state;   /* current power state */
+    uint16_t req_state;   /* requested power state */
+    uint32_t pm_base;     /* Power Management Capability reg base offset */
+    uint32_t aer_base;    /* AER Capability reg base offset */
+} XenPTPM;
+
+struct XenPCIPassthroughState {
+    PCIDevice dev;
+
+    char *hostaddr;
+    bool is_virtfn;
+    HostPCIDevice *real_device;
+    XenPTRegion bases[PCI_NUM_REGIONS]; /* Access regions */
+    QLIST_HEAD(, XenPTRegGroup) reg_grp_tbl;
+
+    uint32_t machine_irq;
+
+    uint32_t power_mgmt;
+    XenPTPM *pm_state;
+
+    MemoryRegion bar[PCI_NUM_REGIONS - 1];
+    MemoryRegion rom;
+};
+
+void pt_config_init(XenPCIPassthroughState *s);
+void pt_config_delete(XenPCIPassthroughState *s);
+void pt_bar_mapping(XenPCIPassthroughState *s, int io_enable, int mem_enable);
+void pt_bar_mapping_one(XenPCIPassthroughState *s, int bar,
+                        int io_enable, int mem_enable);
+XenPTRegGroup *pt_find_reg_grp(XenPCIPassthroughState *s, uint32_t address);
+XenPTReg *pt_find_reg(XenPTRegGroup *reg_grp, uint32_t address);
+int pt_bar_offset_to_index(uint32_t offset);
+
+static inline pcibus_t pt_get_emul_size(PTBarFlag flag, pcibus_t r_size)
+{
+    /* align resource size (memory type only) */
+    if (flag == PT_BAR_FLAG_MEM) {
+        return (r_size + XC_PAGE_SIZE - 1) & XC_PAGE_MASK;
+    } else {
+        return r_size;
+    }
+}
+
+/* INTx */
+static inline uint8_t pci_read_intx(XenPCIPassthroughState *s)
+{
+    return host_pci_get_byte(s->real_device, PCI_INTERRUPT_PIN);
+}
+uint8_t pci_intx(XenPCIPassthroughState *ptdev);
+
+#endif /* !QEMU_HW_XEN_PCI_PASSTHROUGH_H */
diff --git a/hw/xen_pci_passthrough_helpers.c b/hw/xen_pci_passthrough_helpers.c
new file mode 100644
index 0000000..192e918
--- /dev/null
+++ b/hw/xen_pci_passthrough_helpers.c
@@ -0,0 +1,46 @@
+#include "xen_pci_passthrough.h"
+
+/* The PCI Local Bus Specification, Rev. 3.0, {
+ * Section 6.2.4 Miscellaneous Registers, pp 223
+ * outlines 5 valid values for the intertupt pin (intx).
+ *  0: For devices (or device functions) that don't use an interrupt in
+ *  1: INTA#
+ *  2: INTB#
+ *  3: INTC#
+ *  4: INTD#
+ *
+ * Xen uses the following 4 values for intx
+ *  0: INTA#
+ *  1: INTB#
+ *  2: INTC#
+ *  3: INTD#
+ *
+ * Observing that these list of values are not the same, pci_read_intx()
+ * uses the following mapping from hw to xen values.
+ * This seems to reflect the current usage within Xen.
+ *
+ * PCI hardware    | Xen | Notes
+ * ----------------+-----+----------------------------------------------------
+ * 0               | 0   | No interrupt
+ * 1               | 0   | INTA#
+ * 2               | 1   | INTB#
+ * 3               | 2   | INTC#
+ * 4               | 3   | INTD#
+ * any other value | 0   | This should never happen, log error message
+}
+ */
+uint8_t pci_intx(XenPCIPassthroughState *ptdev)
+{
+    uint8_t r_val = pci_read_intx(ptdev);
+
+    PT_LOG("intx=%i\n", r_val);
+    if (r_val < 1 || r_val > 4) {
+        PT_LOG("Interrupt pin read from hardware is out of range: "
+               "value=%i, acceptable range is 1 - 4\n", r_val);
+        r_val = 0;
+    } else {
+        r_val -= 1;
+    }
+
+    return r_val;
+}
-- 
Anthony PERARD

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [Qemu-devel] [PATCH V3 08/10] Introduce Xen PCI Passthrough, PCI config space helpers (2/3)
  2011-10-28 15:07 ` Anthony PERARD
@ 2011-10-28 15:07   ` Anthony PERARD
  -1 siblings, 0 replies; 60+ messages in thread
From: Anthony PERARD @ 2011-10-28 15:07 UTC (permalink / raw)
  To: QEMU-devel, Stefano Stabellini
  Cc: Anthony PERARD, Guy Zana, Xen Devel, Allen Kay

From: Allen Kay <allen.m.kay@intel.com>

Signed-off-by: Allen Kay <allen.m.kay@intel.com>
Signed-off-by: Guy Zana <guy@neocleus.com>
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
---
 Makefile.target                      |    1 +
 hw/xen_pci_passthrough.h             |    2 +
 hw/xen_pci_passthrough_config_init.c | 2068 ++++++++++++++++++++++++++++++++++
 3 files changed, 2071 insertions(+), 0 deletions(-)
 create mode 100644 hw/xen_pci_passthrough_config_init.c

diff --git a/Makefile.target b/Makefile.target
index 36ea47d..c32c688 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -219,6 +219,7 @@ obj-i386-$(CONFIG_XEN) += xen_platform.o
 obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += host-pci-device.o
 obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough.o
 obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough_helpers.o
+obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough_config_init.o
 
 # Inter-VM PCI shared memory
 CONFIG_IVSHMEM =
diff --git a/hw/xen_pci_passthrough.h b/hw/xen_pci_passthrough.h
index 2d1979d..ebc04fd 100644
--- a/hw/xen_pci_passthrough.h
+++ b/hw/xen_pci_passthrough.h
@@ -61,6 +61,8 @@ typedef int (*conf_byte_restore)
 /* power state transition */
 #define PT_FLAG_TRANSITING 0x0001
 
+#define PT_BAR_ALLF        0xFFFFFFFF  /* BAR ALLF value */
+
 
 typedef enum {
     GRP_TYPE_HARDWIRED = 0,                     /* 0 Hardwired reg group */
diff --git a/hw/xen_pci_passthrough_config_init.c b/hw/xen_pci_passthrough_config_init.c
new file mode 100644
index 0000000..4103b59
--- /dev/null
+++ b/hw/xen_pci_passthrough_config_init.c
@@ -0,0 +1,2068 @@
+/*
+ * Copyright (c) 2007, Neocleus Corporation.
+ * Copyright (c) 2007, Intel Corporation.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * Alex Novik <alex@neocleus.com>
+ * Allen Kay <allen.m.kay@intel.com>
+ * Guy Zana <guy@neocleus.com>
+ *
+ * This file implements direct PCI assignment to a HVM guest
+ */
+
+#include "qemu-timer.h"
+#include "xen_backend.h"
+#include "xen_pci_passthrough.h"
+
+#define PT_MERGE_VALUE(value, data, val_mask) \
+    (((value) & (val_mask)) | ((data) & ~(val_mask)))
+
+#define PT_INVALID_REG          0xFFFFFFFF      /* invalid register value */
+
+/* prototype */
+
+static uint32_t pt_ptr_reg_init(XenPCIPassthroughState *s, XenPTRegInfo *reg,
+                                uint32_t real_offset);
+static int pt_init_pci_config(XenPCIPassthroughState *s);
+
+
+/* helper */
+
+/* A return value of 1 means the capability should NOT be exposed to guest. */
+static int pt_hide_dev_cap(const HostPCIDevice *d, uint8_t grp_id)
+{
+    switch (grp_id) {
+    case PCI_CAP_ID_EXP:
+        /* The PCI Express Capability Structure of the VF of Intel 82599 10GbE
+         * Controller looks trivial, e.g., the PCI Express Capabilities
+         * Register is 0. We should not try to expose it to guest.
+         */
+        if (d->vendor_id == PCI_VENDOR_ID_INTEL &&
+                d->device_id == PCI_DEVICE_ID_INTEL_82599_VF) {
+            return 1;
+        }
+        break;
+    }
+    return 0;
+}
+
+/*   find emulate register group entry */
+XenPTRegGroup *pt_find_reg_grp(XenPCIPassthroughState *s, uint32_t address)
+{
+    XenPTRegGroup *entry = NULL;
+
+    /* find register group entry */
+    QLIST_FOREACH(entry, &s->reg_grp_tbl, entries) {
+        /* check address */
+        if ((entry->base_offset <= address)
+            && ((entry->base_offset + entry->size) > address)) {
+            return entry;
+        }
+    }
+
+    /* group entry not found */
+    return NULL;
+}
+
+/* find emulate register entry */
+XenPTReg *pt_find_reg(XenPTRegGroup *reg_grp, uint32_t address)
+{
+    XenPTReg *reg_entry = NULL;
+    XenPTRegInfo *reg = NULL;
+    uint32_t real_offset = 0;
+
+    /* find register entry */
+    QLIST_FOREACH(reg_entry, &reg_grp->reg_tbl_list, entries) {
+        reg = reg_entry->reg;
+        real_offset = reg_grp->base_offset + reg->offset;
+        /* check address */
+        if ((real_offset <= address)
+            && ((real_offset + reg->size) > address)) {
+            return reg_entry;
+        }
+    }
+
+    return NULL;
+}
+
+/* parse BAR */
+static PTBarFlag pt_bar_reg_parse(XenPCIPassthroughState *s, XenPTRegInfo *reg)
+{
+    PCIDevice *d = &s->dev;
+    XenPTRegion *region = NULL;
+    PCIIORegion *r;
+    int index = 0;
+
+    /* check 64bit BAR */
+    index = pt_bar_offset_to_index(reg->offset);
+    if ((0 < index) && (index < PCI_ROM_SLOT)) {
+        int flags = s->real_device->io_regions[index - 1].flags;
+
+        if ((flags & IORESOURCE_MEM) && (flags & IORESOURCE_MEM_64)) {
+            region = &s->bases[index - 1];
+            if (region->bar_flag != PT_BAR_FLAG_UPPER) {
+                return PT_BAR_FLAG_UPPER;
+            }
+        }
+    }
+
+    /* check unused BAR */
+    r = &d->io_regions[index];
+    if (r->size == 0) {
+        return PT_BAR_FLAG_UNUSED;
+    }
+
+    /* for ExpROM BAR */
+    if (index == PCI_ROM_SLOT) {
+        return PT_BAR_FLAG_MEM;
+    }
+
+    /* check BAR I/O indicator */
+    if (s->real_device->io_regions[index].flags & IORESOURCE_IO) {
+        return PT_BAR_FLAG_IO;
+    } else {
+        return PT_BAR_FLAG_MEM;
+    }
+}
+
+
+/****************
+ * general register functions
+ */
+
+/* register initialization function */
+
+static uint32_t pt_common_reg_init(XenPCIPassthroughState *s,
+                                   XenPTRegInfo *reg, uint32_t real_offset)
+{
+    return reg->init_val;
+}
+
+/* Read register functions */
+
+static int pt_byte_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                            uint8_t *value, uint8_t valid_mask)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    uint8_t valid_emu_mask = 0;
+
+    /* emulate byte register */
+    valid_emu_mask = reg->emu_mask & valid_mask;
+    *value = PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
+
+    return 0;
+}
+static int pt_word_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                            uint16_t *value, uint16_t valid_mask)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    uint16_t valid_emu_mask = 0;
+
+    /* emulate word register */
+    valid_emu_mask = reg->emu_mask & valid_mask;
+    *value = PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
+
+    return 0;
+}
+static int pt_long_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                            uint32_t *value, uint32_t valid_mask)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    uint32_t valid_emu_mask = 0;
+
+    /* emulate long register */
+    valid_emu_mask = reg->emu_mask & valid_mask;
+    *value = PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
+
+   return 0;
+}
+
+/* Write register functions */
+
+static int pt_byte_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                             uint8_t *value, uint8_t dev_value,
+                             uint8_t valid_mask)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    uint8_t writable_mask = 0;
+    uint8_t throughable_mask = 0;
+
+    /* modify emulate register */
+    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
+    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
+
+    /* create value for writing to I/O device register */
+    throughable_mask = ~reg->emu_mask & valid_mask;
+    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
+
+    return 0;
+}
+static int pt_word_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                             uint16_t *value, uint16_t dev_value,
+                             uint16_t valid_mask)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    uint16_t writable_mask = 0;
+    uint16_t throughable_mask = 0;
+
+    /* modify emulate register */
+    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
+    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
+
+    /* create value for writing to I/O device register */
+    throughable_mask = ~reg->emu_mask & valid_mask;
+    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
+
+    return 0;
+}
+static int pt_long_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                             uint32_t *value, uint32_t dev_value,
+                             uint32_t valid_mask)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    uint32_t writable_mask = 0;
+    uint32_t throughable_mask = 0;
+
+    /* modify emulate register */
+    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
+    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
+
+    /* create value for writing to I/O device register */
+    throughable_mask = ~reg->emu_mask & valid_mask;
+    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
+
+    return 0;
+}
+
+/* common restore register fonctions */
+static int pt_byte_reg_restore(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                               uint32_t real_offset, uint8_t dev_value,
+                               uint8_t *value)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    PCIDevice *d = &s->dev;
+
+    /* use I/O device register's value as restore value */
+    *value = pci_get_byte(d->config + real_offset);
+
+    /* create value for restoring to I/O device register */
+    *value = PT_MERGE_VALUE(*value, dev_value, reg->emu_mask);
+
+    return 0;
+}
+static int pt_word_reg_restore(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                               uint32_t real_offset, uint16_t dev_value,
+                               uint16_t *value)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    PCIDevice *d = &s->dev;
+
+    /* use I/O device register's value as restore value */
+    *value = pci_get_word(d->config + real_offset);
+
+    /* create value for restoring to I/O device register */
+    *value = PT_MERGE_VALUE(*value, dev_value, reg->emu_mask);
+
+    return 0;
+}
+
+
+/* XenPTRegInfo declaration
+ * - only for emulated register (either a part or whole bit).
+ * - for passthrough register that need special behavior (like interacting with
+ *   other component), set emu_mask to all 0 and specify r/w func properly.
+ * - do NOT use ALL F for init_val, otherwise the tbl will not be registered.
+ */
+
+/********************
+ * Header Type0
+ */
+
+static uint32_t pt_vendor_reg_init(XenPCIPassthroughState *s,
+                                   XenPTRegInfo *reg, uint32_t real_offset)
+{
+    return s->real_device->vendor_id;
+}
+static uint32_t pt_device_reg_init(XenPCIPassthroughState *s,
+                                   XenPTRegInfo *reg, uint32_t real_offset)
+{
+    return s->real_device->device_id;
+}
+static uint32_t pt_status_reg_init(XenPCIPassthroughState *s,
+                                   XenPTRegInfo *reg, uint32_t real_offset)
+{
+    XenPTRegGroup *reg_grp_entry = NULL;
+    XenPTReg *reg_entry = NULL;
+    int reg_field = 0;
+
+    /* find Header register group */
+    reg_grp_entry = pt_find_reg_grp(s, PCI_CAPABILITY_LIST);
+    if (reg_grp_entry) {
+        /* find Capabilities Pointer register */
+        reg_entry = pt_find_reg(reg_grp_entry, PCI_CAPABILITY_LIST);
+        if (reg_entry) {
+            /* check Capabilities Pointer register */
+            if (reg_entry->data) {
+                reg_field |= PCI_STATUS_CAP_LIST;
+            } else {
+                reg_field &= ~PCI_STATUS_CAP_LIST;
+            }
+        } else {
+            hw_error("Internal error: Couldn't find pt_reg_tbl for "
+                     "Capabilities Pointer register. I/O emulator exit.\n");
+        }
+    } else {
+        hw_error("Internal error: Couldn't find pt_reg_grp_tbl for Header. "
+                 "I/O emulator exit.\n");
+    }
+
+    return reg_field;
+}
+static uint32_t pt_header_type_reg_init(XenPCIPassthroughState *s,
+                                        XenPTRegInfo *reg,
+                                        uint32_t real_offset)
+{
+    /* read PCI_HEADER_TYPE */
+    return reg->init_val | 0x80;
+}
+
+/* initialize Interrupt Pin register */
+static uint32_t pt_irqpin_reg_init(XenPCIPassthroughState *s,
+                                   XenPTRegInfo *reg, uint32_t real_offset)
+{
+    return pci_read_intx(s);
+}
+
+/* Command register */
+static int pt_cmd_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                           uint16_t *value, uint16_t valid_mask)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    uint16_t valid_emu_mask = 0;
+    uint16_t emu_mask = reg->emu_mask;
+
+    if (s->is_virtfn) {
+        emu_mask |= PCI_COMMAND_MEMORY;
+    }
+
+    /* emulate word register */
+    valid_emu_mask = emu_mask & valid_mask;
+    *value = PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
+
+    return 0;
+}
+static int pt_cmd_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                            uint16_t *value, uint16_t dev_value,
+                            uint16_t valid_mask)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    uint16_t writable_mask = 0;
+    uint16_t throughable_mask = 0;
+    uint16_t wr_value = *value;
+    uint16_t emu_mask = reg->emu_mask;
+
+    if (s->is_virtfn) {
+        emu_mask |= PCI_COMMAND_MEMORY;
+    }
+
+    /* modify emulate register */
+    writable_mask = ~reg->ro_mask & valid_mask;
+    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
+
+    /* create value for writing to I/O device register */
+    throughable_mask = ~emu_mask & valid_mask;
+
+    if (*value & PCI_COMMAND_INTX_DISABLE) {
+        throughable_mask |= PCI_COMMAND_INTX_DISABLE;
+    } else {
+        if (s->machine_irq) {
+            throughable_mask |= PCI_COMMAND_INTX_DISABLE;
+        }
+    }
+
+    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
+
+    /* mapping BAR */
+    pt_bar_mapping(s, wr_value & PCI_COMMAND_IO,
+                   wr_value & PCI_COMMAND_MEMORY);
+
+    return 0;
+}
+static int pt_cmd_reg_restore(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                              uint32_t real_offset, uint16_t dev_value,
+                              uint16_t *value)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    PCIDevice *d = &s->dev;
+    uint16_t restorable_mask = 0;
+
+    /* use I/O device register's value as restore value */
+    *value = pci_get_word(d->config + real_offset);
+
+    /* create value for restoring to I/O device register
+     * but do not include Fast Back-to-Back Enable bit.
+     */
+    restorable_mask = reg->emu_mask & ~PCI_COMMAND_FAST_BACK;
+    *value = PT_MERGE_VALUE(*value, dev_value, restorable_mask);
+
+    if (!s->machine_irq) {
+        *value |= PCI_COMMAND_INTX_DISABLE;
+    } else {
+        *value &= ~PCI_COMMAND_INTX_DISABLE;
+    }
+
+    return 0;
+}
+
+/* BAR */
+#define PT_BAR_MEM_RO_MASK      0x0000000F      /* BAR ReadOnly mask(Memory) */
+#define PT_BAR_MEM_EMU_MASK     0xFFFFFFF0      /* BAR emul mask(Memory) */
+#define PT_BAR_IO_RO_MASK       0x00000003      /* BAR ReadOnly mask(I/O) */
+#define PT_BAR_IO_EMU_MASK      0xFFFFFFFC      /* BAR emul mask(I/O) */
+
+static inline uint32_t base_address_with_flags(HostPCIIORegion *hr)
+{
+    if ((hr->flags & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO) {
+        return hr->base_addr | (hr->flags & ~PCI_BASE_ADDRESS_IO_MASK);
+    } else {
+        return hr->base_addr | (hr->flags & ~PCI_BASE_ADDRESS_MEM_MASK);
+    }
+}
+
+static uint32_t pt_bar_reg_init(XenPCIPassthroughState *s, XenPTRegInfo *reg,
+                                uint32_t real_offset)
+{
+    int reg_field = 0;
+    int index;
+
+    /* get BAR index */
+    index = pt_bar_offset_to_index(reg->offset);
+    if (index < 0) {
+        hw_error("Internal error: Invalid BAR index[%d]. "
+                 "I/O emulator exit.\n", index);
+    }
+
+    /* set initial guest physical base address to -1 */
+    s->bases[index].e_physbase = -1;
+
+    /* set BAR flag */
+    s->bases[index].bar_flag = pt_bar_reg_parse(s, reg);
+    if (s->bases[index].bar_flag == PT_BAR_FLAG_UNUSED) {
+        reg_field = PT_INVALID_REG;
+    }
+
+    return reg_field;
+}
+static int pt_bar_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                           uint32_t *value, uint32_t valid_mask)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    uint32_t valid_emu_mask = 0;
+    uint32_t bar_emu_mask = 0;
+    int index;
+
+    /* get BAR index */
+    index = pt_bar_offset_to_index(reg->offset);
+    if (index < 0) {
+        hw_error("Internal error: Invalid BAR index[%d]. "
+                 "I/O emulator exit.\n", index);
+    }
+
+    /* use fixed-up value from kernel sysfs */
+    *value = base_address_with_flags(&s->real_device->io_regions[index]);
+
+    /* set emulate mask depend on BAR flag */
+    switch (s->bases[index].bar_flag) {
+    case PT_BAR_FLAG_MEM:
+        bar_emu_mask = PT_BAR_MEM_EMU_MASK;
+        break;
+    case PT_BAR_FLAG_IO:
+        bar_emu_mask = PT_BAR_IO_EMU_MASK;
+        break;
+    case PT_BAR_FLAG_UPPER:
+        bar_emu_mask = PT_BAR_ALLF;
+        break;
+    default:
+        break;
+    }
+
+    /* emulate BAR */
+    valid_emu_mask = bar_emu_mask & valid_mask;
+    *value = PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
+
+   return 0;
+}
+static int pt_bar_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                            uint32_t *value, uint32_t dev_value,
+                            uint32_t valid_mask)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    XenPTRegGroup *reg_grp_entry = NULL;
+    XenPTReg *reg_entry = NULL;
+    XenPTRegion *base = NULL;
+    PCIDevice *d = &s->dev;
+    PCIIORegion *r;
+    uint32_t writable_mask = 0;
+    uint32_t throughable_mask = 0;
+    uint32_t bar_emu_mask = 0;
+    uint32_t bar_ro_mask = 0;
+    uint32_t new_addr, last_addr;
+    uint32_t prev_offset;
+    uint32_t r_size = 0;
+    int index = 0;
+
+    /* get BAR index */
+    index = pt_bar_offset_to_index(reg->offset);
+    if (index < 0) {
+        hw_error("Internal error: Invalid BAR index[%d]. "
+                 "I/O emulator exit.\n", index);
+    }
+
+    r = &d->io_regions[index];
+    base = &s->bases[index];
+    r_size = pt_get_emul_size(base->bar_flag, r->size);
+
+    /* set emulate mask and read-only mask depend on BAR flag */
+    switch (s->bases[index].bar_flag) {
+    case PT_BAR_FLAG_MEM:
+        bar_emu_mask = PT_BAR_MEM_EMU_MASK;
+        bar_ro_mask = PT_BAR_MEM_RO_MASK | (r_size - 1);
+        break;
+    case PT_BAR_FLAG_IO:
+        bar_emu_mask = PT_BAR_IO_EMU_MASK;
+        bar_ro_mask = PT_BAR_IO_RO_MASK | (r_size - 1);
+        break;
+    case PT_BAR_FLAG_UPPER:
+        bar_emu_mask = PT_BAR_ALLF;
+        bar_ro_mask = 0;    /* all upper 32bit are R/W */
+        break;
+    default:
+        break;
+    }
+
+    /* modify emulate register */
+    writable_mask = bar_emu_mask & ~bar_ro_mask & valid_mask;
+    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
+
+    /* check whether we need to update the virtual region address or not */
+    switch (s->bases[index].bar_flag) {
+    case PT_BAR_FLAG_MEM:
+        /* nothing to do */
+        break;
+    case PT_BAR_FLAG_IO:
+        new_addr = cfg_entry->data;
+        last_addr = new_addr + r_size - 1;
+        /* check invalid address */
+        if (last_addr <= new_addr || !new_addr || last_addr >= 0x10000) {
+            /* check 64K range */
+            if ((last_addr >= 0x10000) &&
+                (cfg_entry->data != (PT_BAR_ALLF & ~bar_ro_mask))) {
+                PT_LOG("Warning: Guest attempt to set Base Address "
+                       "over the 64KB. [%02x:%02x.%x][Offset:%02xh]"
+                       "[Address:%08xh][Size:%08xh]\n",
+                       pci_bus_num(d->bus), PCI_SLOT(d->devfn),
+                       PCI_FUNC(d->devfn),
+                       reg->offset, new_addr, r_size);
+            }
+            /* just remove mapping */
+            r->addr = -1;
+            goto exit;
+        }
+        break;
+    case PT_BAR_FLAG_UPPER:
+        if (cfg_entry->data) {
+            if (cfg_entry->data != (PT_BAR_ALLF & ~bar_ro_mask)) {
+                PT_LOG("Warning: Guest attempt to set high MMIO Base Address. "
+                       "Ignore mapping. "
+                       "[%02x:%02x.%x][Offset:%02xh][High Address:%08xh]\n",
+                       pci_bus_num(d->bus), PCI_SLOT(d->devfn),
+                       PCI_FUNC(d->devfn), reg->offset, cfg_entry->data);
+            }
+            /* clear lower address */
+            d->io_regions[index-1].addr = -1;
+        } else {
+            /* find lower 32bit BAR */
+            prev_offset = (reg->offset - 4);
+            reg_grp_entry = pt_find_reg_grp(s, prev_offset);
+            if (reg_grp_entry) {
+                reg_entry = pt_find_reg(reg_grp_entry, prev_offset);
+                if (reg_entry) {
+                    /* restore lower address */
+                    d->io_regions[index-1].addr = reg_entry->data;
+                } else {
+                    return -1;
+                }
+            } else {
+                return -1;
+            }
+        }
+
+        /* never mapping the 'empty' upper region,
+         * because we'll do it enough for the lower region.
+         */
+        r->addr = -1;
+        goto exit;
+    default:
+        break;
+    }
+
+    /* update the corresponding virtual region address */
+    /*
+     * When guest code tries to get block size of mmio, it will write all "1"s
+     * into pci bar register. In this case, cfg_entry->data == writable_mask.
+     * Especially for devices with large mmio, the value of writable_mask
+     * is likely to be a guest physical address that has been mapped to ram
+     * rather than mmio. Remapping this value to mmio should be prevented.
+     */
+
+    if (cfg_entry->data != writable_mask) {
+        r->addr = cfg_entry->data;
+    }
+
+exit:
+    /* create value for writing to I/O device register */
+    throughable_mask = ~bar_emu_mask & valid_mask;
+    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
+
+    /* After BAR reg update, we need to remap BAR */
+    reg_grp_entry = pt_find_reg_grp(s, PCI_COMMAND);
+    if (reg_grp_entry) {
+        reg_entry = pt_find_reg(reg_grp_entry, PCI_COMMAND);
+        if (reg_entry) {
+            pt_bar_mapping_one(s, index, reg_entry->data & PCI_COMMAND_IO,
+                               reg_entry->data & PCI_COMMAND_MEMORY);
+        }
+    }
+
+    return 0;
+}
+static int pt_bar_reg_restore(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                              uint32_t real_offset, uint32_t dev_value,
+                              uint32_t *value)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    uint32_t bar_emu_mask = 0;
+    int index = 0;
+
+    /* get BAR index */
+    index = pt_bar_offset_to_index(reg->offset);
+    if (index < 0) {
+        hw_error("Internal error: Invalid BAR index[%d]. "
+                 "I/O emulator exit.\n", index);
+    }
+
+    /* use value from kernel sysfs */
+    if (s->bases[index].bar_flag == PT_BAR_FLAG_UPPER) {
+        *value = s->real_device->io_regions[index - 1].base_addr >> 32;
+    } else {
+        *value = base_address_with_flags(&s->real_device->io_regions[index]);
+    }
+
+    /* set emulate mask depend on BAR flag */
+    switch (s->bases[index].bar_flag) {
+    case PT_BAR_FLAG_MEM:
+        bar_emu_mask = PT_BAR_MEM_EMU_MASK;
+        break;
+    case PT_BAR_FLAG_IO:
+        bar_emu_mask = PT_BAR_IO_EMU_MASK;
+        break;
+    case PT_BAR_FLAG_UPPER:
+        bar_emu_mask = PT_BAR_ALLF;
+        break;
+    default:
+        break;
+    }
+
+    /* create value for restoring to I/O device register */
+    *value = PT_MERGE_VALUE(*value, dev_value, bar_emu_mask);
+
+    return 0;
+}
+
+/* write Exp ROM BAR */
+static int pt_exp_rom_bar_reg_write(XenPCIPassthroughState *s,
+                                    XenPTReg *cfg_entry, uint32_t *value,
+                                    uint32_t dev_value, uint32_t valid_mask)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    XenPTRegGroup *reg_grp_entry = NULL;
+    XenPTReg *reg_entry = NULL;
+    XenPTRegion *base = NULL;
+    PCIDevice *d = (PCIDevice *)&s->dev;
+    PCIIORegion *r;
+    uint32_t writable_mask = 0;
+    uint32_t throughable_mask = 0;
+    pcibus_t r_size = 0;
+    uint32_t bar_emu_mask = 0;
+    uint32_t bar_ro_mask = 0;
+
+    r = &d->io_regions[PCI_ROM_SLOT];
+    r_size = r->size;
+    base = &s->bases[PCI_ROM_SLOT];
+    /* align memory type resource size */
+    pt_get_emul_size(base->bar_flag, r_size);
+
+    /* set emulate mask and read-only mask */
+    bar_emu_mask = reg->emu_mask;
+    bar_ro_mask = (reg->ro_mask | (r_size - 1)) & ~PCI_ROM_ADDRESS_ENABLE;
+
+    /* modify emulate register */
+    writable_mask = ~bar_ro_mask & valid_mask;
+    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
+
+    /* update the corresponding virtual region address */
+    /*
+     * When guest code tries to get block size of mmio, it will write all "1"s
+     * into pci bar register. In this case, cfg_entry->data == writable_mask.
+     * Especially for devices with large mmio, the value of writable_mask
+     * is likely to be a guest physical address that has been mapped to ram
+     * rather than mmio. Remapping this value to mmio should be prevented.
+     */
+
+    if (cfg_entry->data != writable_mask) {
+        r->addr = cfg_entry->data;
+    }
+
+    /* create value for writing to I/O device register */
+    throughable_mask = ~bar_emu_mask & valid_mask;
+    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
+
+    /* After BAR reg update, we need to remap BAR*/
+    reg_grp_entry = pt_find_reg_grp(s, PCI_COMMAND);
+    if (reg_grp_entry) {
+        reg_entry = pt_find_reg(reg_grp_entry, PCI_COMMAND);
+        if (reg_entry) {
+            pt_bar_mapping_one(s, PCI_ROM_SLOT,
+                               reg_entry->data & PCI_COMMAND_IO,
+                               reg_entry->data & PCI_COMMAND_MEMORY);
+        }
+    }
+
+    return 0;
+}
+/* restore ROM BAR */
+static int pt_exp_rom_bar_reg_restore(XenPCIPassthroughState *s,
+                                      XenPTReg *cfg_entry,
+                                      uint32_t real_offset,
+                                      uint32_t dev_value, uint32_t *value)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+
+    /* use value from kernel sysfs */
+    *value =
+        PT_MERGE_VALUE(host_pci_get_long(s->real_device, PCI_ROM_ADDRESS),
+                       dev_value, reg->emu_mask);
+    return 0;
+}
+
+/* Header Type0 reg static infomation table */
+static XenPTRegInfo pt_emu_reg_header0_tbl[] = {
+    /* Vendor ID reg */
+    {
+        .offset     = PCI_VENDOR_ID,
+        .size       = 2,
+        .init_val   = 0x0000,
+        .ro_mask    = 0xFFFF,
+        .emu_mask   = 0xFFFF,
+        .init       = pt_vendor_reg_init,
+        .u.w.read   = pt_word_reg_read,
+        .u.w.write  = pt_word_reg_write,
+        .u.w.restore  = NULL,
+    },
+    /* Device ID reg */
+    {
+        .offset     = PCI_DEVICE_ID,
+        .size       = 2,
+        .init_val   = 0x0000,
+        .ro_mask    = 0xFFFF,
+        .emu_mask   = 0xFFFF,
+        .init       = pt_device_reg_init,
+        .u.w.read   = pt_word_reg_read,
+        .u.w.write  = pt_word_reg_write,
+        .u.w.restore  = NULL,
+    },
+    /* Command reg */
+    {
+        .offset     = PCI_COMMAND,
+        .size       = 2,
+        .init_val   = 0x0000,
+        .ro_mask    = 0xF880,
+        .emu_mask   = 0x0740,
+        .init       = pt_common_reg_init,
+        .u.w.read   = pt_cmd_reg_read,
+        .u.w.write  = pt_cmd_reg_write,
+        .u.w.restore  = pt_cmd_reg_restore,
+    },
+    /* Capabilities Pointer reg */
+    {
+        .offset     = PCI_CAPABILITY_LIST,
+        .size       = 1,
+        .init_val   = 0x00,
+        .ro_mask    = 0xFF,
+        .emu_mask   = 0xFF,
+        .init       = pt_ptr_reg_init,
+        .u.b.read   = pt_byte_reg_read,
+        .u.b.write  = pt_byte_reg_write,
+        .u.b.restore  = NULL,
+    },
+    /* Status reg */
+    /* use emulated Cap Ptr value to initialize,
+     * so need to be declared after Cap Ptr reg
+     */
+    {
+        .offset     = PCI_STATUS,
+        .size       = 2,
+        .init_val   = 0x0000,
+        .ro_mask    = 0x06FF,
+        .emu_mask   = 0x0010,
+        .init       = pt_status_reg_init,
+        .u.w.read   = pt_word_reg_read,
+        .u.w.write  = pt_word_reg_write,
+        .u.w.restore  = NULL,
+    },
+    /* Cache Line Size reg */
+    {
+        .offset     = PCI_CACHE_LINE_SIZE,
+        .size       = 1,
+        .init_val   = 0x00,
+        .ro_mask    = 0x00,
+        .emu_mask   = 0xFF,
+        .init       = pt_common_reg_init,
+        .u.b.read   = pt_byte_reg_read,
+        .u.b.write  = pt_byte_reg_write,
+        .u.b.restore  = pt_byte_reg_restore,
+    },
+    /* Latency Timer reg */
+    {
+        .offset     = PCI_LATENCY_TIMER,
+        .size       = 1,
+        .init_val   = 0x00,
+        .ro_mask    = 0x00,
+        .emu_mask   = 0xFF,
+        .init       = pt_common_reg_init,
+        .u.b.read   = pt_byte_reg_read,
+        .u.b.write  = pt_byte_reg_write,
+        .u.b.restore  = pt_byte_reg_restore,
+    },
+    /* Header Type reg */
+    {
+        .offset     = PCI_HEADER_TYPE,
+        .size       = 1,
+        .init_val   = 0x00,
+        .ro_mask    = 0xFF,
+        .emu_mask   = 0x00,
+        .init       = pt_header_type_reg_init,
+        .u.b.read   = pt_byte_reg_read,
+        .u.b.write  = pt_byte_reg_write,
+        .u.b.restore  = NULL,
+    },
+    /* Interrupt Line reg */
+    {
+        .offset     = PCI_INTERRUPT_LINE,
+        .size       = 1,
+        .init_val   = 0x00,
+        .ro_mask    = 0x00,
+        .emu_mask   = 0xFF,
+        .init       = pt_common_reg_init,
+        .u.b.read   = pt_byte_reg_read,
+        .u.b.write  = pt_byte_reg_write,
+        .u.b.restore  = NULL,
+    },
+    /* Interrupt Pin reg */
+    {
+        .offset     = PCI_INTERRUPT_PIN,
+        .size       = 1,
+        .init_val   = 0x00,
+        .ro_mask    = 0xFF,
+        .emu_mask   = 0xFF,
+        .init       = pt_irqpin_reg_init,
+        .u.b.read   = pt_byte_reg_read,
+        .u.b.write  = pt_byte_reg_write,
+        .u.b.restore  = NULL,
+    },
+    /* BAR 0 reg */
+    /* mask of BAR need to be decided later, depends on IO/MEM type */
+    {
+        .offset     = PCI_BASE_ADDRESS_0,
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .init       = pt_bar_reg_init,
+        .u.dw.read  = pt_bar_reg_read,
+        .u.dw.write = pt_bar_reg_write,
+        .u.dw.restore = pt_bar_reg_restore,
+    },
+    /* BAR 1 reg */
+    {
+        .offset     = PCI_BASE_ADDRESS_1,
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .init       = pt_bar_reg_init,
+        .u.dw.read  = pt_bar_reg_read,
+        .u.dw.write = pt_bar_reg_write,
+        .u.dw.restore = pt_bar_reg_restore,
+    },
+    /* BAR 2 reg */
+    {
+        .offset     = PCI_BASE_ADDRESS_2,
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .init       = pt_bar_reg_init,
+        .u.dw.read  = pt_bar_reg_read,
+        .u.dw.write = pt_bar_reg_write,
+        .u.dw.restore = pt_bar_reg_restore,
+    },
+    /* BAR 3 reg */
+    {
+        .offset     = PCI_BASE_ADDRESS_3,
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .init       = pt_bar_reg_init,
+        .u.dw.read  = pt_bar_reg_read,
+        .u.dw.write = pt_bar_reg_write,
+        .u.dw.restore = pt_bar_reg_restore,
+    },
+    /* BAR 4 reg */
+    {
+        .offset     = PCI_BASE_ADDRESS_4,
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .init       = pt_bar_reg_init,
+        .u.dw.read  = pt_bar_reg_read,
+        .u.dw.write = pt_bar_reg_write,
+        .u.dw.restore = pt_bar_reg_restore,
+    },
+    /* BAR 5 reg */
+    {
+        .offset     = PCI_BASE_ADDRESS_5,
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .init       = pt_bar_reg_init,
+        .u.dw.read  = pt_bar_reg_read,
+        .u.dw.write = pt_bar_reg_write,
+        .u.dw.restore = pt_bar_reg_restore,
+    },
+    /* Expansion ROM BAR reg */
+    {
+        .offset     = PCI_ROM_ADDRESS,
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .ro_mask    = 0x000007FE,
+        .emu_mask   = 0xFFFFF800,
+        .init       = pt_bar_reg_init,
+        .u.dw.read  = pt_long_reg_read,
+        .u.dw.write = pt_exp_rom_bar_reg_write,
+        .u.dw.restore = pt_exp_rom_bar_reg_restore,
+    },
+    {
+        .size = 0,
+    },
+};
+
+
+/*********************************
+ * Vital Product Data Capability
+ */
+
+/* Vital Product Data Capability Structure reg static infomation table */
+static XenPTRegInfo pt_emu_reg_vpd_tbl[] = {
+    {
+        .offset     = PCI_CAP_LIST_NEXT,
+        .size       = 1,
+        .init_val   = 0x00,
+        .ro_mask    = 0xFF,
+        .emu_mask   = 0xFF,
+        .init       = pt_ptr_reg_init,
+        .u.b.read   = pt_byte_reg_read,
+        .u.b.write  = pt_byte_reg_write,
+        .u.b.restore  = NULL,
+    },
+    {
+        .size = 0,
+    },
+};
+
+
+/**************************************
+ * Vendor Specific Capability
+ */
+
+/* Vendor Specific Capability Structure reg static infomation table */
+static XenPTRegInfo pt_emu_reg_vendor_tbl[] = {
+    {
+        .offset     = PCI_CAP_LIST_NEXT,
+        .size       = 1,
+        .init_val   = 0x00,
+        .ro_mask    = 0xFF,
+        .emu_mask   = 0xFF,
+        .init       = pt_ptr_reg_init,
+        .u.b.read   = pt_byte_reg_read,
+        .u.b.write  = pt_byte_reg_write,
+        .u.b.restore  = NULL,
+    },
+    {
+        .size = 0,
+    },
+};
+
+
+/*****************************
+ * PCI Express Capability
+ */
+
+/* initialize Link Control register */
+static uint32_t pt_linkctrl_reg_init(XenPCIPassthroughState *s,
+                                     XenPTRegInfo *reg, uint32_t real_offset)
+{
+    uint8_t cap_ver = 0;
+    uint8_t dev_type = 0;
+
+    /* TODO maybe better to use fonction from hw/pcie.c */
+    cap_ver = pci_get_byte(s->dev.config + real_offset - reg->offset
+                           + PCI_EXP_FLAGS)
+        & PCI_EXP_FLAGS_VERS;
+    dev_type = (pci_get_byte(s->dev.config + real_offset - reg->offset
+                             + PCI_EXP_FLAGS)
+                & PCI_EXP_FLAGS_TYPE) >> 4;
+
+    /* no need to initialize in case of Root Complex Integrated Endpoint
+     * with cap_ver 1.x
+     */
+    if ((dev_type == PCI_EXP_TYPE_RC_END) && (cap_ver == 1)) {
+        return PT_INVALID_REG;
+    }
+
+    return reg->init_val;
+}
+/* initialize Device Control 2 register */
+static uint32_t pt_devctrl2_reg_init(XenPCIPassthroughState *s,
+                                     XenPTRegInfo *reg, uint32_t real_offset)
+{
+    uint8_t cap_ver = 0;
+
+    cap_ver = pci_get_byte(s->dev.config + real_offset - reg->offset
+                           + PCI_EXP_FLAGS)
+        & PCI_EXP_FLAGS_VERS;
+
+    /* no need to initialize in case of cap_ver 1.x */
+    if (cap_ver == 1) {
+        return PT_INVALID_REG;
+    }
+
+    return reg->init_val;
+}
+/* initialize Link Control 2 register */
+static uint32_t pt_linkctrl2_reg_init(XenPCIPassthroughState *s,
+                                      XenPTRegInfo *reg, uint32_t real_offset)
+{
+    int reg_field = 0;
+    uint8_t cap_ver = 0;
+
+    cap_ver = pci_get_byte(s->dev.config + real_offset - reg->offset
+                           + PCI_EXP_FLAGS)
+        & PCI_EXP_FLAGS_VERS;
+
+    /* no need to initialize in case of cap_ver 1.x */
+    if (cap_ver == 1) {
+        return PT_INVALID_REG;
+    }
+
+    /* set Supported Link Speed */
+    reg_field |= PCI_EXP_LNKCAP_SLS &
+        pci_get_byte(s->dev.config + real_offset - reg->offset
+                     + PCI_EXP_LNKCAP);
+
+    return reg_field;
+}
+
+/* PCI Express Capability Structure reg static infomation table */
+static XenPTRegInfo pt_emu_reg_pcie_tbl[] = {
+    /* Next Pointer reg */
+    {
+        .offset     = PCI_CAP_LIST_NEXT,
+        .size       = 1,
+        .init_val   = 0x00,
+        .ro_mask    = 0xFF,
+        .emu_mask   = 0xFF,
+        .init       = pt_ptr_reg_init,
+        .u.b.read   = pt_byte_reg_read,
+        .u.b.write  = pt_byte_reg_write,
+        .u.b.restore  = NULL,
+    },
+    /* Device Capabilities reg */
+    {
+        .offset     = PCI_EXP_DEVCAP,
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .ro_mask    = 0x1FFCFFFF,
+        .emu_mask   = 0x10000000,
+        .init       = pt_common_reg_init,
+        .u.dw.read  = pt_long_reg_read,
+        .u.dw.write = pt_long_reg_write,
+        .u.dw.restore = NULL,
+    },
+    /* Device Control reg */
+    {
+        .offset     = PCI_EXP_DEVCTL,
+        .size       = 2,
+        .init_val   = 0x2810,
+        .ro_mask    = 0x8400,
+        .emu_mask   = 0xFFFF,
+        .init       = pt_common_reg_init,
+        .u.w.read   = pt_word_reg_read,
+        .u.w.write  = pt_word_reg_write,
+        .u.w.restore  = pt_word_reg_restore,
+    },
+    /* Link Control reg */
+    {
+        .offset     = PCI_EXP_LNKCTL,
+        .size       = 2,
+        .init_val   = 0x0000,
+        .ro_mask    = 0xFC34,
+        .emu_mask   = 0xFFFF,
+        .init       = pt_linkctrl_reg_init,
+        .u.w.read   = pt_word_reg_read,
+        .u.w.write  = pt_word_reg_write,
+        .u.w.restore  = pt_word_reg_restore,
+    },
+    /* Device Control 2 reg */
+    {
+        .offset     = 0x28,
+        .size       = 2,
+        .init_val   = 0x0000,
+        .ro_mask    = 0xFFE0,
+        .emu_mask   = 0xFFFF,
+        .init       = pt_devctrl2_reg_init,
+        .u.w.read   = pt_word_reg_read,
+        .u.w.write  = pt_word_reg_write,
+        .u.w.restore  = pt_word_reg_restore,
+    },
+    /* Link Control 2 reg */
+    {
+        .offset     = 0x30,
+        .size       = 2,
+        .init_val   = 0x0000,
+        .ro_mask    = 0xE040,
+        .emu_mask   = 0xFFFF,
+        .init       = pt_linkctrl2_reg_init,
+        .u.w.read   = pt_word_reg_read,
+        .u.w.write  = pt_word_reg_write,
+        .u.w.restore  = pt_word_reg_restore,
+    },
+    {
+        .size = 0,
+    },
+};
+
+
+/*********************************
+ * Power Management Capability
+ */
+
+/* initialize Power Management Capabilities register */
+static uint32_t pt_pmc_reg_init(XenPCIPassthroughState *s,
+                                XenPTRegInfo *reg, uint32_t real_offset)
+{
+    PCIDevice *d = &s->dev;
+
+    if (!s->power_mgmt) {
+        return reg->init_val;
+    }
+
+    /* set Power Management Capabilities register */
+    s->pm_state->pmc_field = pci_get_word(d->config + real_offset);
+
+    return reg->init_val;
+}
+/* initialize PCI Power Management Control/Status register */
+static uint32_t pt_pmcsr_reg_init(XenPCIPassthroughState *s,
+                                  XenPTRegInfo *reg, uint32_t real_offset)
+{
+    PCIDevice *d = &s->dev;
+    uint16_t cap_ver  = 0;
+
+    if (!s->power_mgmt) {
+        return reg->init_val;
+    }
+
+    /* check PCI Power Management support version */
+    cap_ver = s->pm_state->pmc_field & PCI_PM_CAP_VER_MASK;
+
+    if (cap_ver > 2) {
+        /* set No Soft Reset */
+        s->pm_state->no_soft_reset =
+            pci_get_byte(d->config + real_offset) & PCI_PM_CTRL_NO_SOFT_RESET;
+    }
+
+    /* wake up real physical device */
+    switch (host_pci_get_word(s->real_device, real_offset)
+            & PCI_PM_CTRL_STATE_MASK) {
+    case 0:
+        break;
+    case 1:
+        PT_LOG("Power state transition D1 -> D0active\n");
+        host_pci_set_word(s->real_device, real_offset, 0);
+        break;
+    case 2:
+        PT_LOG("Power state transition D2 -> D0active\n");
+        host_pci_set_word(s->real_device, real_offset, 0);
+        usleep(200);
+        break;
+    case 3:
+        PT_LOG("Power state transition D3hot -> D0active\n");
+        host_pci_set_word(s->real_device, real_offset, 0);
+        usleep(10 * 1000);
+        pt_init_pci_config(s);
+        break;
+    }
+
+    return reg->init_val;
+}
+/* read Power Management Control/Status register */
+static int pt_pmcsr_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                             uint16_t *value, uint16_t valid_mask)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    uint16_t valid_emu_mask = reg->emu_mask;
+
+    if (!s->power_mgmt) {
+        valid_emu_mask |= PCI_PM_CTRL_STATE_MASK | PCI_PM_CTRL_NO_SOFT_RESET;
+    }
+
+    valid_emu_mask = valid_emu_mask & valid_mask;
+    *value = PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
+
+    return 0;
+}
+/* reset Interrupt and I/O resource  */
+static void pt_reset_interrupt_and_io_mapping(XenPCIPassthroughState *s)
+{
+    PCIDevice *d = &s->dev;
+    PCIIORegion *r;
+    int i = 0;
+    uint8_t e_device = 0;
+    uint8_t e_intx = 0;
+
+    /* unbind INTx */
+    e_device = PCI_SLOT(s->dev.devfn);
+    e_intx = pci_intx(s);
+
+    if (s->machine_irq) {
+        if (xc_domain_unbind_pt_irq(xen_xc, xen_domid, s->machine_irq,
+                                    PT_IRQ_TYPE_PCI, 0, e_device, e_intx, 0)) {
+            PT_LOG("Error: Unbinding of interrupt failed!\n");
+        }
+    }
+
+    /* clear all virtual region address */
+    for (i = 0; i < PCI_NUM_REGIONS; i++) {
+        r = &d->io_regions[i];
+        r->addr = -1;
+    }
+
+    /* unmapping BAR */
+    pt_bar_mapping(s, 0, 0);
+}
+/* check power state transition */
+static int check_power_state(XenPCIPassthroughState *s)
+{
+    XenPTPM *pm_state = s->pm_state;
+    PCIDevice *d = &s->dev;
+    uint16_t read_val = 0;
+    uint16_t cur_state = 0;
+
+    /* get current power state */
+    read_val = host_pci_get_word(s->real_device,
+                                 pm_state->pm_base + PCI_PM_CTRL);
+    cur_state = read_val & PCI_PM_CTRL_STATE_MASK;
+
+    if (pm_state->req_state != cur_state) {
+        PT_LOG("Error: Failed to change power state. "
+               "[%02x:%02x.%x][requested state:%d][current state:%d]\n",
+               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
+               pm_state->req_state, cur_state);
+        return -1;
+    }
+    return 0;
+}
+/* write Power Management Control/Status register */
+static void pt_from_d3hot_to_d0_with_reset(void *opaque)
+{
+    XenPCIPassthroughState *s = opaque;
+    XenPTPM *pm_state = s->pm_state;
+    int ret = 0;
+
+    /* check power state */
+    ret = check_power_state(s);
+
+    if (ret < 0) {
+        goto out;
+    }
+
+    pt_init_pci_config(s);
+
+out:
+    /* power state transition flags off */
+    pm_state->flags &= ~PT_FLAG_TRANSITING;
+
+    qemu_free_timer(pm_state->pm_timer);
+    pm_state->pm_timer = NULL;
+}
+static void pt_default_power_transition(void *opaque)
+{
+    XenPCIPassthroughState *ptdev = opaque;
+    XenPTPM *pm_state = ptdev->pm_state;
+
+    /* check power state */
+    check_power_state(ptdev);
+
+    /* power state transition flags off */
+    pm_state->flags &= ~PT_FLAG_TRANSITING;
+
+    qemu_free_timer(pm_state->pm_timer);
+    pm_state->pm_timer = NULL;
+}
+static int pt_pmcsr_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                              uint16_t *value, uint16_t dev_value,
+                              uint16_t valid_mask)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    PCIDevice *d = &s->dev;
+    uint16_t emu_mask = reg->emu_mask;
+    uint16_t writable_mask = 0;
+    uint16_t throughable_mask = 0;
+    XenPTPM *pm_state = s->pm_state;
+
+    if (!s->power_mgmt) {
+        emu_mask |= PCI_PM_CTRL_STATE_MASK | PCI_PM_CTRL_NO_SOFT_RESET;
+    }
+
+    /* modify emulate register */
+    writable_mask = emu_mask & ~reg->ro_mask & valid_mask;
+    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
+
+    /* create value for writing to I/O device register */
+    throughable_mask = ~emu_mask & valid_mask;
+    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
+
+    if (!s->power_mgmt) {
+        return 0;
+    }
+
+    /* set I/O device power state */
+    pm_state->cur_state = dev_value & PCI_PM_CTRL_STATE_MASK;
+
+    /* set Guest requested PowerState */
+    pm_state->req_state = *value & PCI_PM_CTRL_STATE_MASK;
+
+    /* check power state transition or not */
+    if (pm_state->cur_state == pm_state->req_state) {
+        /* not power state transition */
+        return 0;
+    }
+
+    /* check enable power state transition */
+    if ((pm_state->req_state != 0) &&
+        (pm_state->cur_state > pm_state->req_state)) {
+        PT_LOG("Error: Invalid power transition. "
+               "[%02x:%02x.%x][requested state:%d][current state:%d]\n",
+               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
+               pm_state->req_state, pm_state->cur_state);
+
+        return 0;
+    }
+
+    /* check if this device supports the requested power state */
+    if (((pm_state->req_state == 1) && !(pm_state->pmc_field & PCI_PM_CAP_D1))
+        || ((pm_state->req_state == 2) &&
+            !(pm_state->pmc_field & PCI_PM_CAP_D2))) {
+        PT_LOG("Error: Invalid power transition. "
+               "[%02x:%02x.%x][requested state:%d][current state:%d]\n",
+               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
+               pm_state->req_state, pm_state->cur_state);
+
+        return 0;
+    }
+
+    /* in case of transition related to D3hot, it's necessary to wait 10 ms.
+     * But because writing to register will be performed later on actually,
+     * don't start QEMUTimer right now, just alloc and init QEMUTimer here.
+     */
+    if ((pm_state->cur_state == 3) || (pm_state->req_state == 3)) {
+        if (pm_state->req_state == 0) {
+            /* alloc and init QEMUTimer */
+            if (!pm_state->no_soft_reset) {
+                pm_state->pm_timer = qemu_new_timer_ms(rt_clock,
+                    pt_from_d3hot_to_d0_with_reset, s);
+
+                /* reset Interrupt and I/O resource mapping */
+                pt_reset_interrupt_and_io_mapping(s);
+            } else {
+                pm_state->pm_timer = qemu_new_timer_ms(rt_clock,
+                                        pt_default_power_transition, s);
+            }
+        } else {
+            /* alloc and init QEMUTimer */
+            pm_state->pm_timer = qemu_new_timer_ms(rt_clock,
+                pt_default_power_transition, s);
+        }
+
+        /* set power state transition delay */
+        pm_state->pm_delay = 10;
+
+        /* power state transition flags on */
+        pm_state->flags |= PT_FLAG_TRANSITING;
+    }
+    /* in case of transition related to D0, D1 and D2,
+     * no need to use QEMUTimer.
+     * So, we perfom writing to register here and then read it back.
+     */
+    else {
+        /* write power state to I/O device register */
+        host_pci_set_word(s->real_device, pm_state->pm_base + PCI_PM_CTRL,
+                          *value);
+
+        /* in case of transition related to D2,
+         * it's necessary to wait 200 usec.
+         * But because QEMUTimer do not support microsec unit right now,
+         * so we do wait ourself here.
+         */
+        if ((pm_state->cur_state == 2) || (pm_state->req_state == 2)) {
+            usleep(200);
+        }
+
+        /* check power state */
+        check_power_state(s);
+
+        /* recreate value for writing to I/O device register */
+        *value = host_pci_get_word(s->real_device,
+                                   pm_state->pm_base + PCI_PM_CTRL);
+    }
+
+    return 0;
+}
+
+/* restore Power Management Control/Status register */
+static int pt_pmcsr_reg_restore(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                                uint32_t real_offset, uint16_t dev_value,
+                                uint16_t *value)
+{
+    /* create value for restoring to I/O device register
+     * No need to restore, just clear PME Enable and PME Status bit
+     * Note: register type of PME Status bit is RW1C, so clear by writing 1b
+     */
+    *value = (dev_value & ~PCI_PM_CTRL_PME_ENABLE) | PCI_PM_CTRL_PME_STATUS;
+
+    return 0;
+}
+
+
+/* Power Management Capability reg static infomation table */
+static XenPTRegInfo pt_emu_reg_pm_tbl[] = {
+    /* Next Pointer reg */
+    {
+        .offset     = PCI_CAP_LIST_NEXT,
+        .size       = 1,
+        .init_val   = 0x00,
+        .ro_mask    = 0xFF,
+        .emu_mask   = 0xFF,
+        .init       = pt_ptr_reg_init,
+        .u.b.read   = pt_byte_reg_read,
+        .u.b.write  = pt_byte_reg_write,
+        .u.b.restore  = NULL,
+    },
+    /* Power Management Capabilities reg */
+    {
+        .offset     = PCI_CAP_FLAGS,
+        .size       = 2,
+        .init_val   = 0x0000,
+        .ro_mask    = 0xFFFF,
+        .emu_mask   = 0xF9C8,
+        .init       = pt_pmc_reg_init,
+        .u.w.read   = pt_word_reg_read,
+        .u.w.write  = pt_word_reg_write,
+        .u.w.restore  = NULL,
+    },
+    /* PCI Power Management Control/Status reg */
+    {
+        .offset     = PCI_PM_CTRL,
+        .size       = 2,
+        .init_val   = 0x0008,
+        .ro_mask    = 0xE1FC,
+        .emu_mask   = 0x8100,
+        .init       = pt_pmcsr_reg_init,
+        .u.w.read   = pt_pmcsr_reg_read,
+        .u.w.write  = pt_pmcsr_reg_write,
+        .u.w.restore  = pt_pmcsr_reg_restore,
+    },
+    {
+        .size = 0,
+    },
+};
+
+
+/****************************
+ * Capabilities
+ */
+
+/* AER register operations */
+
+static void aer_save_one_register(XenPCIPassthroughState *s, int offset)
+{
+    PCIDevice *d = &s->dev;
+    uint32_t aer_base = s->pm_state->aer_base;
+    uint32_t val = 0;
+
+    val = host_pci_get_long(s->real_device, aer_base + offset);
+    pci_set_long(d->config + aer_base + offset, val);
+}
+static void pt_aer_reg_save(XenPCIPassthroughState *s)
+{
+    /* after reset, following register values should be restored.
+     * So, save them.
+     */
+    aer_save_one_register(s, PCI_ERR_UNCOR_MASK);
+    aer_save_one_register(s, PCI_ERR_UNCOR_SEVER);
+    aer_save_one_register(s, PCI_ERR_COR_MASK);
+    aer_save_one_register(s, PCI_ERR_CAP);
+}
+static void aer_restore_one_register(XenPCIPassthroughState *s, int offset)
+{
+    PCIDevice *d = &s->dev;
+    uint32_t aer_base = s->pm_state->aer_base;
+    uint32_t config = 0;
+
+    config = pci_get_long(d->config + aer_base + offset);
+    host_pci_set_long(s->real_device, aer_base + offset, config);
+}
+static void pt_aer_reg_restore(XenPCIPassthroughState *s)
+{
+    /* the following registers should be reconfigured to correct values
+     * after reset. restore them.
+     * other registers should not be reconfigured after reset
+     * if there is no reason
+     */
+    aer_restore_one_register(s, PCI_ERR_UNCOR_MASK);
+    aer_restore_one_register(s, PCI_ERR_UNCOR_SEVER);
+    aer_restore_one_register(s, PCI_ERR_COR_MASK);
+    aer_restore_one_register(s, PCI_ERR_CAP);
+}
+
+/* capability structure register group size functions */
+
+static uint8_t pt_reg_grp_size_init(XenPCIPassthroughState *s,
+                                    const XenPTRegGroupInfo *grp_reg,
+                                    uint32_t base_offset)
+{
+    return grp_reg->grp_size;
+}
+/* get Power Management Capability Structure register group size */
+static uint8_t pt_pm_size_init(XenPCIPassthroughState *s,
+                               const XenPTRegGroupInfo *grp_reg,
+                               uint32_t base_offset)
+{
+    if (!s->power_mgmt) {
+        return grp_reg->grp_size;
+    }
+
+    s->pm_state = g_malloc0(sizeof (XenPTPM));
+
+    /* set Power Management Capability base offset */
+    s->pm_state->pm_base = base_offset;
+
+    /* find AER register and set AER Capability base offset */
+    s->pm_state->aer_base = host_pci_find_ext_cap_offset(s->real_device,
+                                                         PCI_EXT_CAP_ID_ERR);
+
+    /* save AER register */
+    if (s->pm_state->aer_base) {
+        pt_aer_reg_save(s);
+    }
+
+    return grp_reg->grp_size;
+}
+/* get Vendor Specific Capability Structure register group size */
+static uint8_t pt_vendor_size_init(XenPCIPassthroughState *s,
+                                   const XenPTRegGroupInfo *grp_reg,
+                                   uint32_t base_offset)
+{
+    return pci_get_byte(s->dev.config + base_offset + 0x02);
+}
+/* get PCI Express Capability Structure register group size */
+static uint8_t pt_pcie_size_init(XenPCIPassthroughState *s,
+                                 const XenPTRegGroupInfo *grp_reg,
+                                 uint32_t base_offset)
+{
+    PCIDevice *d = &s->dev;
+    uint16_t exp_flag = 0;
+    uint16_t type = 0;
+    uint16_t version = 0;
+    uint8_t pcie_size = 0;
+
+    exp_flag = pci_get_word(d->config + base_offset + PCI_EXP_FLAGS);
+    type = (exp_flag & PCI_EXP_FLAGS_TYPE) >> 4;
+    version = exp_flag & PCI_EXP_FLAGS_VERS;
+
+    /* calculate size depend on capability version and device/port type */
+    /* in case of PCI Express Base Specification Rev 1.x */
+    if (version == 1) {
+        /* The PCI Express Capabilities, Device Capabilities, and Device
+         * Status/Control registers are required for all PCI Express devices.
+         * The Link Capabilities and Link Status/Control are required for all
+         * Endpoints that are not Root Complex Integrated Endpoints. Endpoints
+         * are not required to implement registers other than those listed
+         * above and terminate the capability structure.
+         */
+        switch (type) {
+        case PCI_EXP_TYPE_ENDPOINT:
+        case PCI_EXP_TYPE_LEG_END:
+            pcie_size = 0x14;
+            break;
+        case PCI_EXP_TYPE_RC_END:
+            /* has no link */
+            pcie_size = 0x0C;
+            break;
+        /* only EndPoint passthrough is supported */
+        case PCI_EXP_TYPE_ROOT_PORT:
+        case PCI_EXP_TYPE_UPSTREAM:
+        case PCI_EXP_TYPE_DOWNSTREAM:
+        case PCI_EXP_TYPE_PCI_BRIDGE:
+        case PCI_EXP_TYPE_PCIE_BRIDGE:
+        case PCI_EXP_TYPE_RC_EC:
+        default:
+            hw_error("Internal error: Unsupported device/port type[%d]. "
+                     "I/O emulator exit.\n", type);
+        }
+    }
+    /* in case of PCI Express Base Specification Rev 2.0 */
+    else if (version == 2) {
+        switch (type) {
+        case PCI_EXP_TYPE_ENDPOINT:
+        case PCI_EXP_TYPE_LEG_END:
+        case PCI_EXP_TYPE_RC_END:
+            /* For Functions that do not implement the registers,
+             * these spaces must be hardwired to 0b.
+             */
+            pcie_size = 0x3C;
+            break;
+        /* only EndPoint passthrough is supported */
+        case PCI_EXP_TYPE_ROOT_PORT:
+        case PCI_EXP_TYPE_UPSTREAM:
+        case PCI_EXP_TYPE_DOWNSTREAM:
+        case PCI_EXP_TYPE_PCI_BRIDGE:
+        case PCI_EXP_TYPE_PCIE_BRIDGE:
+        case PCI_EXP_TYPE_RC_EC:
+        default:
+            hw_error("Internal error: Unsupported device/port type[%d]. "
+                     "I/O emulator exit.\n", type);
+        }
+    } else {
+        hw_error("Internal error: Unsupported capability version[%d]. "
+                 "I/O emulator exit.\n", version);
+    }
+
+    return pcie_size;
+}
+
+static const XenPTRegGroupInfo pt_emu_reg_grp_tbl[] = {
+    /* Header Type0 reg group */
+    {
+        .grp_id      = 0xFF,
+        .grp_type    = GRP_TYPE_EMU,
+        .grp_size    = 0x40,
+        .size_init   = pt_reg_grp_size_init,
+        .emu_reg_tbl = pt_emu_reg_header0_tbl,
+    },
+    /* PCI PowerManagement Capability reg group */
+    {
+        .grp_id      = PCI_CAP_ID_PM,
+        .grp_type    = GRP_TYPE_EMU,
+        .grp_size    = PCI_PM_SIZEOF,
+        .size_init   = pt_pm_size_init,
+        .emu_reg_tbl = pt_emu_reg_pm_tbl,
+    },
+    /* AGP Capability Structure reg group */
+    {
+        .grp_id     = PCI_CAP_ID_AGP,
+        .grp_type   = GRP_TYPE_HARDWIRED,
+        .grp_size   = 0x30,
+        .size_init  = pt_reg_grp_size_init,
+    },
+    /* Vital Product Data Capability Structure reg group */
+    {
+        .grp_id      = PCI_CAP_ID_VPD,
+        .grp_type    = GRP_TYPE_EMU,
+        .grp_size    = 0x08,
+        .size_init   = pt_reg_grp_size_init,
+        .emu_reg_tbl = pt_emu_reg_vpd_tbl,
+    },
+    /* Slot Identification reg group */
+    {
+        .grp_id     = PCI_CAP_ID_SLOTID,
+        .grp_type   = GRP_TYPE_HARDWIRED,
+        .grp_size   = 0x04,
+        .size_init  = pt_reg_grp_size_init,
+    },
+    /* PCI-X Capabilities List Item reg group */
+    {
+        .grp_id     = PCI_CAP_ID_PCIX,
+        .grp_type   = GRP_TYPE_HARDWIRED,
+        .grp_size   = 0x18,
+        .size_init  = pt_reg_grp_size_init,
+    },
+    /* Vendor Specific Capability Structure reg group */
+    {
+        .grp_id      = PCI_CAP_ID_VNDR,
+        .grp_type    = GRP_TYPE_EMU,
+        .grp_size    = 0xFF,
+        .size_init   = pt_vendor_size_init,
+        .emu_reg_tbl = pt_emu_reg_vendor_tbl,
+    },
+    /* SHPC Capability List Item reg group */
+    {
+        .grp_id     = PCI_CAP_ID_SHPC,
+        .grp_type   = GRP_TYPE_HARDWIRED,
+        .grp_size   = 0x08,
+        .size_init  = pt_reg_grp_size_init,
+    },
+    /* Subsystem ID and Subsystem Vendor ID Capability List Item reg group */
+    {
+        .grp_id     = PCI_CAP_ID_SSVID,
+        .grp_type   = GRP_TYPE_HARDWIRED,
+        .grp_size   = 0x08,
+        .size_init  = pt_reg_grp_size_init,
+    },
+    /* AGP 8x Capability Structure reg group */
+    {
+        .grp_id     = PCI_CAP_ID_AGP3,
+        .grp_type   = GRP_TYPE_HARDWIRED,
+        .grp_size   = 0x30,
+        .size_init  = pt_reg_grp_size_init,
+    },
+    /* PCI Express Capability Structure reg group */
+    {
+        .grp_id      = PCI_CAP_ID_EXP,
+        .grp_type    = GRP_TYPE_EMU,
+        .grp_size    = 0xFF,
+        .size_init   = pt_pcie_size_init,
+        .emu_reg_tbl = pt_emu_reg_pcie_tbl,
+    },
+    {
+        .grp_size = 0,
+    },
+};
+
+/* initialize Capabilities Pointer or Next Pointer register */
+static uint32_t pt_ptr_reg_init(XenPCIPassthroughState *s,
+                                XenPTRegInfo *reg, uint32_t real_offset)
+{
+    /* uint32_t reg_field = (uint32_t)s->dev.config[real_offset]; */
+    uint32_t reg_field = pci_get_byte(s->dev.config + real_offset);
+    int i;
+
+    /* find capability offset */
+    while (reg_field) {
+        for (i = 0; pt_emu_reg_grp_tbl[i].grp_size != 0; i++) {
+            if (pt_hide_dev_cap(s->real_device,
+                                pt_emu_reg_grp_tbl[i].grp_id)) {
+                continue;
+            }
+            if (pt_emu_reg_grp_tbl[i].grp_id == s->dev.config[reg_field]) {
+                if (pt_emu_reg_grp_tbl[i].grp_type == GRP_TYPE_EMU) {
+                    goto out;
+                }
+                /* ignore the 0 hardwired capability, find next one */
+                break;
+            }
+        }
+        /* next capability */
+        /* reg_field = (uint32_t)s->dev.config[reg_field + 1]; */
+        reg_field = pci_get_byte(s->dev.config + reg_field + 1);
+    }
+
+out:
+    return reg_field;
+}
+
+
+/*************
+ * Main
+ */
+
+/* restore a part of I/O device register */
+static void pt_config_restore(XenPCIPassthroughState *s)
+{
+    XenPTRegGroup *reg_grp_entry = NULL;
+    XenPTReg *reg_entry = NULL;
+    XenPTRegInfo *reg = NULL;
+    uint32_t real_offset = 0;
+    uint32_t read_val = 0;
+    uint32_t val = 0;
+    int ret = 0;
+
+    /* find emulate register group entry */
+    QLIST_FOREACH(reg_grp_entry, &s->reg_grp_tbl, entries) {
+        /* find emulate register entry */
+        QLIST_FOREACH(reg_entry, &reg_grp_entry->reg_tbl_list, entries) {
+            reg = reg_entry->reg;
+
+            /* check whether restoring is needed */
+            if (!reg->u.b.restore) {
+                continue;
+            }
+
+            real_offset = reg_grp_entry->base_offset + reg->offset;
+
+            /* read I/O device register value */
+            ret = host_pci_get_block(s->real_device, real_offset,
+                                     (uint8_t *)&read_val, reg->size);
+
+            if (!ret) {
+                PT_LOG("Error: pci_read_block failed. "
+                       "return value[%d].\n", ret);
+                memset(&read_val, 0xff, reg->size);
+            }
+
+            val = 0;
+
+            /* restore based on register size */
+            switch (reg->size) {
+            case 1:
+                /* byte register */
+                ret = reg->u.b.restore(s, reg_entry, real_offset,
+                                       (uint8_t)read_val, (uint8_t *)&val);
+                break;
+            case 2:
+                /* word register */
+                ret = reg->u.w.restore(s, reg_entry, real_offset,
+                                       (uint16_t)read_val, (uint16_t *)&val);
+                break;
+            case 4:
+                /* double word register */
+                ret = reg->u.dw.restore(s, reg_entry, real_offset,
+                                        (uint32_t)read_val, (uint32_t *)&val);
+                break;
+            }
+
+            /* restoring error */
+            if (ret < 0) {
+                hw_error("Internal error: Invalid restoring "
+                         "return value[%d]. I/O emulator exit.\n", ret);
+            }
+
+            PT_LOG_CONFIG("[%02x:%02x.%x]: address=%04x val=0x%08x len=%d\n",
+                          pci_bus_num(s->dev.bus), PCI_SLOT(s->dev.devfn),
+                          PCI_FUNC(s->dev.devfn),
+                          real_offset, val, reg->size);
+
+            ret = host_pci_set_block(s->real_device, real_offset,
+                                     (uint8_t *)&val, reg->size);
+
+            if (!ret) {
+                PT_LOG("Error: pci_write_block failed. "
+                       "return value[%d].\n", ret);
+            }
+        }
+    }
+
+    /* if AER supported, restore it */
+    if (s->pm_state->aer_base) {
+        pt_aer_reg_restore(s);
+    }
+}
+/* reinitialize all emulate registers */
+static void pt_config_reinit(XenPCIPassthroughState *s)
+{
+    XenPTRegGroup *reg_grp_entry = NULL;
+    XenPTReg *reg_entry = NULL;
+    XenPTRegInfo *reg = NULL;
+
+    /* find emulate register group entry */
+    QLIST_FOREACH(reg_grp_entry, &s->reg_grp_tbl, entries) {
+        /* find emulate register entry */
+        QLIST_FOREACH(reg_entry, &reg_grp_entry->reg_tbl_list, entries) {
+            reg = reg_entry->reg;
+            if (reg->init) {
+                /* initialize emulate register */
+                reg_entry->data =
+                    reg->init(s, reg_entry->reg,
+                              reg_grp_entry->base_offset + reg->offset);
+            }
+        }
+    }
+}
+
+static int pt_init_pci_config(XenPCIPassthroughState *s)
+{
+    PCIDevice *d = &s->dev;
+    int ret = 0;
+
+    PT_LOG("Reinitialize PCI configuration registers due to power state"
+           " transition with internal reset. [%02x:%02x.%x]\n",
+           pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn));
+
+    /* restore a part of I/O device register */
+    pt_config_restore(s);
+
+    /* reinitialize all emulate register */
+    pt_config_reinit(s);
+
+    /* rebind machine_irq to device */
+    if (s->machine_irq != 0) {
+        uint8_t e_device = PCI_SLOT(s->dev.devfn);
+        uint8_t e_intx = pci_intx(s);
+
+        ret = xc_domain_bind_pt_pci_irq(xen_xc, xen_domid, s->machine_irq, 0,
+                                        e_device, e_intx);
+        if (ret < 0) {
+            PT_LOG("Error: Rebinding of interrupt failed! ret=%d\n", ret);
+        }
+    }
+
+    return ret;
+}
+
+static uint8_t find_cap_offset(XenPCIPassthroughState *s, uint8_t cap)
+{
+    int id;
+    int max_cap = 48;
+    int pos = PCI_CAPABILITY_LIST;
+    int status;
+
+    status = host_pci_get_byte(s->real_device, PCI_STATUS);
+    if ((status & PCI_STATUS_CAP_LIST) == 0) {
+        return 0;
+    }
+
+    while (max_cap--) {
+        pos = host_pci_get_byte(s->real_device, pos);
+        if (pos < 0x40) {
+            break;
+        }
+
+        pos &= ~3;
+        id = host_pci_get_byte(s->real_device, pos + PCI_CAP_LIST_ID);
+
+        if (id == 0xff) {
+            break;
+        }
+        if (id == cap) {
+            return pos;
+        }
+
+        pos += PCI_CAP_LIST_NEXT;
+    }
+    return 0;
+}
+
+static void pt_config_reg_init(XenPCIPassthroughState *s,
+                               XenPTRegGroup *reg_grp, XenPTRegInfo *reg)
+{
+    XenPTReg *reg_entry;
+    uint32_t data = 0;
+
+    reg_entry = g_malloc0(sizeof (XenPTReg));
+
+    reg_entry->reg = reg;
+    reg_entry->data = 0;
+
+    if (reg->init) {
+        /* initialize emulate register */
+        data = reg->init(s, reg_entry->reg,
+                         reg_grp->base_offset + reg->offset);
+        if (data == PT_INVALID_REG) {
+            /* free unused BAR register entry */
+            free(reg_entry);
+            return;
+        }
+        /* set register value */
+        reg_entry->data = data;
+    }
+    /* list add register entry */
+    QLIST_INSERT_HEAD(&reg_grp->reg_tbl_list, reg_entry, entries);
+
+    return;
+}
+
+void pt_config_init(XenPCIPassthroughState *s)
+{
+    XenPTRegGroup *reg_grp_entry = NULL;
+    uint32_t reg_grp_offset = 0;
+    XenPTRegInfo *reg_tbl = NULL;
+    int i, j;
+
+    QLIST_INIT(&s->reg_grp_tbl);
+
+    for (i = 0; pt_emu_reg_grp_tbl[i].grp_size != 0; i++) {
+        if (pt_emu_reg_grp_tbl[i].grp_id != 0xFF) {
+            if (pt_hide_dev_cap(s->real_device,
+                                pt_emu_reg_grp_tbl[i].grp_id)) {
+                continue;
+            }
+
+            reg_grp_offset = find_cap_offset(s, pt_emu_reg_grp_tbl[i].grp_id);
+
+            if (!reg_grp_offset) {
+                continue;
+            }
+        }
+
+        reg_grp_entry = g_malloc0(sizeof (XenPTRegGroup));
+        QLIST_INIT(&reg_grp_entry->reg_tbl_list);
+        QLIST_INSERT_HEAD(&s->reg_grp_tbl, reg_grp_entry, entries);
+
+        reg_grp_entry->base_offset = reg_grp_offset;
+        reg_grp_entry->reg_grp = pt_emu_reg_grp_tbl + i;
+        if (pt_emu_reg_grp_tbl[i].size_init) {
+            /* get register group size */
+            reg_grp_entry->size =
+                pt_emu_reg_grp_tbl[i].size_init(s, reg_grp_entry->reg_grp,
+                                                reg_grp_offset);
+        }
+
+        if (pt_emu_reg_grp_tbl[i].grp_type == GRP_TYPE_EMU) {
+            if (pt_emu_reg_grp_tbl[i].emu_reg_tbl) {
+                reg_tbl = pt_emu_reg_grp_tbl[i].emu_reg_tbl;
+                /* initialize capability register */
+                for (j = 0; reg_tbl->size != 0; j++, reg_tbl++) {
+                    /* initialize capability register */
+                    pt_config_reg_init(s, reg_grp_entry, reg_tbl);
+                }
+            }
+        }
+        reg_grp_offset = 0;
+    }
+
+    return;
+}
+
+/* delete all emulate register */
+void pt_config_delete(XenPCIPassthroughState *s)
+{
+    struct XenPTRegGroup *reg_group, *next_grp;
+    struct XenPTReg *reg, *next_reg;
+
+    /* free Power Management info table */
+    if (s->pm_state) {
+        if (s->pm_state->pm_timer) {
+            qemu_del_timer(s->pm_state->pm_timer);
+            qemu_free_timer(s->pm_state->pm_timer);
+            s->pm_state->pm_timer = NULL;
+        }
+
+        g_free(s->pm_state);
+    }
+
+    /* free all register group entry */
+    QLIST_FOREACH_SAFE(reg_group, &s->reg_grp_tbl, entries, next_grp) {
+        /* free all register entry */
+        QLIST_FOREACH_SAFE(reg, &reg_group->reg_tbl_list, entries, next_reg) {
+            QLIST_REMOVE(reg, entries);
+            g_free(reg);
+        }
+
+        QLIST_REMOVE(reg_group, entries);
+        g_free(reg_group);
+    }
+}
-- 
Anthony PERARD

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH V3 08/10] Introduce Xen PCI Passthrough, PCI config space helpers (2/3)
@ 2011-10-28 15:07   ` Anthony PERARD
  0 siblings, 0 replies; 60+ messages in thread
From: Anthony PERARD @ 2011-10-28 15:07 UTC (permalink / raw)
  To: QEMU-devel, Stefano Stabellini
  Cc: Anthony PERARD, Guy Zana, Xen Devel, Allen Kay

From: Allen Kay <allen.m.kay@intel.com>

Signed-off-by: Allen Kay <allen.m.kay@intel.com>
Signed-off-by: Guy Zana <guy@neocleus.com>
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
---
 Makefile.target                      |    1 +
 hw/xen_pci_passthrough.h             |    2 +
 hw/xen_pci_passthrough_config_init.c | 2068 ++++++++++++++++++++++++++++++++++
 3 files changed, 2071 insertions(+), 0 deletions(-)
 create mode 100644 hw/xen_pci_passthrough_config_init.c

diff --git a/Makefile.target b/Makefile.target
index 36ea47d..c32c688 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -219,6 +219,7 @@ obj-i386-$(CONFIG_XEN) += xen_platform.o
 obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += host-pci-device.o
 obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough.o
 obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough_helpers.o
+obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough_config_init.o
 
 # Inter-VM PCI shared memory
 CONFIG_IVSHMEM =
diff --git a/hw/xen_pci_passthrough.h b/hw/xen_pci_passthrough.h
index 2d1979d..ebc04fd 100644
--- a/hw/xen_pci_passthrough.h
+++ b/hw/xen_pci_passthrough.h
@@ -61,6 +61,8 @@ typedef int (*conf_byte_restore)
 /* power state transition */
 #define PT_FLAG_TRANSITING 0x0001
 
+#define PT_BAR_ALLF        0xFFFFFFFF  /* BAR ALLF value */
+
 
 typedef enum {
     GRP_TYPE_HARDWIRED = 0,                     /* 0 Hardwired reg group */
diff --git a/hw/xen_pci_passthrough_config_init.c b/hw/xen_pci_passthrough_config_init.c
new file mode 100644
index 0000000..4103b59
--- /dev/null
+++ b/hw/xen_pci_passthrough_config_init.c
@@ -0,0 +1,2068 @@
+/*
+ * Copyright (c) 2007, Neocleus Corporation.
+ * Copyright (c) 2007, Intel Corporation.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * Alex Novik <alex@neocleus.com>
+ * Allen Kay <allen.m.kay@intel.com>
+ * Guy Zana <guy@neocleus.com>
+ *
+ * This file implements direct PCI assignment to a HVM guest
+ */
+
+#include "qemu-timer.h"
+#include "xen_backend.h"
+#include "xen_pci_passthrough.h"
+
+#define PT_MERGE_VALUE(value, data, val_mask) \
+    (((value) & (val_mask)) | ((data) & ~(val_mask)))
+
+#define PT_INVALID_REG          0xFFFFFFFF      /* invalid register value */
+
+/* prototype */
+
+static uint32_t pt_ptr_reg_init(XenPCIPassthroughState *s, XenPTRegInfo *reg,
+                                uint32_t real_offset);
+static int pt_init_pci_config(XenPCIPassthroughState *s);
+
+
+/* helper */
+
+/* A return value of 1 means the capability should NOT be exposed to guest. */
+static int pt_hide_dev_cap(const HostPCIDevice *d, uint8_t grp_id)
+{
+    switch (grp_id) {
+    case PCI_CAP_ID_EXP:
+        /* The PCI Express Capability Structure of the VF of Intel 82599 10GbE
+         * Controller looks trivial, e.g., the PCI Express Capabilities
+         * Register is 0. We should not try to expose it to guest.
+         */
+        if (d->vendor_id == PCI_VENDOR_ID_INTEL &&
+                d->device_id == PCI_DEVICE_ID_INTEL_82599_VF) {
+            return 1;
+        }
+        break;
+    }
+    return 0;
+}
+
+/*   find emulate register group entry */
+XenPTRegGroup *pt_find_reg_grp(XenPCIPassthroughState *s, uint32_t address)
+{
+    XenPTRegGroup *entry = NULL;
+
+    /* find register group entry */
+    QLIST_FOREACH(entry, &s->reg_grp_tbl, entries) {
+        /* check address */
+        if ((entry->base_offset <= address)
+            && ((entry->base_offset + entry->size) > address)) {
+            return entry;
+        }
+    }
+
+    /* group entry not found */
+    return NULL;
+}
+
+/* find emulate register entry */
+XenPTReg *pt_find_reg(XenPTRegGroup *reg_grp, uint32_t address)
+{
+    XenPTReg *reg_entry = NULL;
+    XenPTRegInfo *reg = NULL;
+    uint32_t real_offset = 0;
+
+    /* find register entry */
+    QLIST_FOREACH(reg_entry, &reg_grp->reg_tbl_list, entries) {
+        reg = reg_entry->reg;
+        real_offset = reg_grp->base_offset + reg->offset;
+        /* check address */
+        if ((real_offset <= address)
+            && ((real_offset + reg->size) > address)) {
+            return reg_entry;
+        }
+    }
+
+    return NULL;
+}
+
+/* parse BAR */
+static PTBarFlag pt_bar_reg_parse(XenPCIPassthroughState *s, XenPTRegInfo *reg)
+{
+    PCIDevice *d = &s->dev;
+    XenPTRegion *region = NULL;
+    PCIIORegion *r;
+    int index = 0;
+
+    /* check 64bit BAR */
+    index = pt_bar_offset_to_index(reg->offset);
+    if ((0 < index) && (index < PCI_ROM_SLOT)) {
+        int flags = s->real_device->io_regions[index - 1].flags;
+
+        if ((flags & IORESOURCE_MEM) && (flags & IORESOURCE_MEM_64)) {
+            region = &s->bases[index - 1];
+            if (region->bar_flag != PT_BAR_FLAG_UPPER) {
+                return PT_BAR_FLAG_UPPER;
+            }
+        }
+    }
+
+    /* check unused BAR */
+    r = &d->io_regions[index];
+    if (r->size == 0) {
+        return PT_BAR_FLAG_UNUSED;
+    }
+
+    /* for ExpROM BAR */
+    if (index == PCI_ROM_SLOT) {
+        return PT_BAR_FLAG_MEM;
+    }
+
+    /* check BAR I/O indicator */
+    if (s->real_device->io_regions[index].flags & IORESOURCE_IO) {
+        return PT_BAR_FLAG_IO;
+    } else {
+        return PT_BAR_FLAG_MEM;
+    }
+}
+
+
+/****************
+ * general register functions
+ */
+
+/* register initialization function */
+
+static uint32_t pt_common_reg_init(XenPCIPassthroughState *s,
+                                   XenPTRegInfo *reg, uint32_t real_offset)
+{
+    return reg->init_val;
+}
+
+/* Read register functions */
+
+static int pt_byte_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                            uint8_t *value, uint8_t valid_mask)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    uint8_t valid_emu_mask = 0;
+
+    /* emulate byte register */
+    valid_emu_mask = reg->emu_mask & valid_mask;
+    *value = PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
+
+    return 0;
+}
+static int pt_word_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                            uint16_t *value, uint16_t valid_mask)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    uint16_t valid_emu_mask = 0;
+
+    /* emulate word register */
+    valid_emu_mask = reg->emu_mask & valid_mask;
+    *value = PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
+
+    return 0;
+}
+static int pt_long_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                            uint32_t *value, uint32_t valid_mask)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    uint32_t valid_emu_mask = 0;
+
+    /* emulate long register */
+    valid_emu_mask = reg->emu_mask & valid_mask;
+    *value = PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
+
+   return 0;
+}
+
+/* Write register functions */
+
+static int pt_byte_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                             uint8_t *value, uint8_t dev_value,
+                             uint8_t valid_mask)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    uint8_t writable_mask = 0;
+    uint8_t throughable_mask = 0;
+
+    /* modify emulate register */
+    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
+    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
+
+    /* create value for writing to I/O device register */
+    throughable_mask = ~reg->emu_mask & valid_mask;
+    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
+
+    return 0;
+}
+static int pt_word_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                             uint16_t *value, uint16_t dev_value,
+                             uint16_t valid_mask)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    uint16_t writable_mask = 0;
+    uint16_t throughable_mask = 0;
+
+    /* modify emulate register */
+    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
+    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
+
+    /* create value for writing to I/O device register */
+    throughable_mask = ~reg->emu_mask & valid_mask;
+    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
+
+    return 0;
+}
+static int pt_long_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                             uint32_t *value, uint32_t dev_value,
+                             uint32_t valid_mask)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    uint32_t writable_mask = 0;
+    uint32_t throughable_mask = 0;
+
+    /* modify emulate register */
+    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
+    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
+
+    /* create value for writing to I/O device register */
+    throughable_mask = ~reg->emu_mask & valid_mask;
+    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
+
+    return 0;
+}
+
+/* common restore register fonctions */
+static int pt_byte_reg_restore(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                               uint32_t real_offset, uint8_t dev_value,
+                               uint8_t *value)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    PCIDevice *d = &s->dev;
+
+    /* use I/O device register's value as restore value */
+    *value = pci_get_byte(d->config + real_offset);
+
+    /* create value for restoring to I/O device register */
+    *value = PT_MERGE_VALUE(*value, dev_value, reg->emu_mask);
+
+    return 0;
+}
+static int pt_word_reg_restore(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                               uint32_t real_offset, uint16_t dev_value,
+                               uint16_t *value)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    PCIDevice *d = &s->dev;
+
+    /* use I/O device register's value as restore value */
+    *value = pci_get_word(d->config + real_offset);
+
+    /* create value for restoring to I/O device register */
+    *value = PT_MERGE_VALUE(*value, dev_value, reg->emu_mask);
+
+    return 0;
+}
+
+
+/* XenPTRegInfo declaration
+ * - only for emulated register (either a part or whole bit).
+ * - for passthrough register that need special behavior (like interacting with
+ *   other component), set emu_mask to all 0 and specify r/w func properly.
+ * - do NOT use ALL F for init_val, otherwise the tbl will not be registered.
+ */
+
+/********************
+ * Header Type0
+ */
+
+static uint32_t pt_vendor_reg_init(XenPCIPassthroughState *s,
+                                   XenPTRegInfo *reg, uint32_t real_offset)
+{
+    return s->real_device->vendor_id;
+}
+static uint32_t pt_device_reg_init(XenPCIPassthroughState *s,
+                                   XenPTRegInfo *reg, uint32_t real_offset)
+{
+    return s->real_device->device_id;
+}
+static uint32_t pt_status_reg_init(XenPCIPassthroughState *s,
+                                   XenPTRegInfo *reg, uint32_t real_offset)
+{
+    XenPTRegGroup *reg_grp_entry = NULL;
+    XenPTReg *reg_entry = NULL;
+    int reg_field = 0;
+
+    /* find Header register group */
+    reg_grp_entry = pt_find_reg_grp(s, PCI_CAPABILITY_LIST);
+    if (reg_grp_entry) {
+        /* find Capabilities Pointer register */
+        reg_entry = pt_find_reg(reg_grp_entry, PCI_CAPABILITY_LIST);
+        if (reg_entry) {
+            /* check Capabilities Pointer register */
+            if (reg_entry->data) {
+                reg_field |= PCI_STATUS_CAP_LIST;
+            } else {
+                reg_field &= ~PCI_STATUS_CAP_LIST;
+            }
+        } else {
+            hw_error("Internal error: Couldn't find pt_reg_tbl for "
+                     "Capabilities Pointer register. I/O emulator exit.\n");
+        }
+    } else {
+        hw_error("Internal error: Couldn't find pt_reg_grp_tbl for Header. "
+                 "I/O emulator exit.\n");
+    }
+
+    return reg_field;
+}
+static uint32_t pt_header_type_reg_init(XenPCIPassthroughState *s,
+                                        XenPTRegInfo *reg,
+                                        uint32_t real_offset)
+{
+    /* read PCI_HEADER_TYPE */
+    return reg->init_val | 0x80;
+}
+
+/* initialize Interrupt Pin register */
+static uint32_t pt_irqpin_reg_init(XenPCIPassthroughState *s,
+                                   XenPTRegInfo *reg, uint32_t real_offset)
+{
+    return pci_read_intx(s);
+}
+
+/* Command register */
+static int pt_cmd_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                           uint16_t *value, uint16_t valid_mask)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    uint16_t valid_emu_mask = 0;
+    uint16_t emu_mask = reg->emu_mask;
+
+    if (s->is_virtfn) {
+        emu_mask |= PCI_COMMAND_MEMORY;
+    }
+
+    /* emulate word register */
+    valid_emu_mask = emu_mask & valid_mask;
+    *value = PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
+
+    return 0;
+}
+static int pt_cmd_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                            uint16_t *value, uint16_t dev_value,
+                            uint16_t valid_mask)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    uint16_t writable_mask = 0;
+    uint16_t throughable_mask = 0;
+    uint16_t wr_value = *value;
+    uint16_t emu_mask = reg->emu_mask;
+
+    if (s->is_virtfn) {
+        emu_mask |= PCI_COMMAND_MEMORY;
+    }
+
+    /* modify emulate register */
+    writable_mask = ~reg->ro_mask & valid_mask;
+    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
+
+    /* create value for writing to I/O device register */
+    throughable_mask = ~emu_mask & valid_mask;
+
+    if (*value & PCI_COMMAND_INTX_DISABLE) {
+        throughable_mask |= PCI_COMMAND_INTX_DISABLE;
+    } else {
+        if (s->machine_irq) {
+            throughable_mask |= PCI_COMMAND_INTX_DISABLE;
+        }
+    }
+
+    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
+
+    /* mapping BAR */
+    pt_bar_mapping(s, wr_value & PCI_COMMAND_IO,
+                   wr_value & PCI_COMMAND_MEMORY);
+
+    return 0;
+}
+static int pt_cmd_reg_restore(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                              uint32_t real_offset, uint16_t dev_value,
+                              uint16_t *value)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    PCIDevice *d = &s->dev;
+    uint16_t restorable_mask = 0;
+
+    /* use I/O device register's value as restore value */
+    *value = pci_get_word(d->config + real_offset);
+
+    /* create value for restoring to I/O device register
+     * but do not include Fast Back-to-Back Enable bit.
+     */
+    restorable_mask = reg->emu_mask & ~PCI_COMMAND_FAST_BACK;
+    *value = PT_MERGE_VALUE(*value, dev_value, restorable_mask);
+
+    if (!s->machine_irq) {
+        *value |= PCI_COMMAND_INTX_DISABLE;
+    } else {
+        *value &= ~PCI_COMMAND_INTX_DISABLE;
+    }
+
+    return 0;
+}
+
+/* BAR */
+#define PT_BAR_MEM_RO_MASK      0x0000000F      /* BAR ReadOnly mask(Memory) */
+#define PT_BAR_MEM_EMU_MASK     0xFFFFFFF0      /* BAR emul mask(Memory) */
+#define PT_BAR_IO_RO_MASK       0x00000003      /* BAR ReadOnly mask(I/O) */
+#define PT_BAR_IO_EMU_MASK      0xFFFFFFFC      /* BAR emul mask(I/O) */
+
+static inline uint32_t base_address_with_flags(HostPCIIORegion *hr)
+{
+    if ((hr->flags & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO) {
+        return hr->base_addr | (hr->flags & ~PCI_BASE_ADDRESS_IO_MASK);
+    } else {
+        return hr->base_addr | (hr->flags & ~PCI_BASE_ADDRESS_MEM_MASK);
+    }
+}
+
+static uint32_t pt_bar_reg_init(XenPCIPassthroughState *s, XenPTRegInfo *reg,
+                                uint32_t real_offset)
+{
+    int reg_field = 0;
+    int index;
+
+    /* get BAR index */
+    index = pt_bar_offset_to_index(reg->offset);
+    if (index < 0) {
+        hw_error("Internal error: Invalid BAR index[%d]. "
+                 "I/O emulator exit.\n", index);
+    }
+
+    /* set initial guest physical base address to -1 */
+    s->bases[index].e_physbase = -1;
+
+    /* set BAR flag */
+    s->bases[index].bar_flag = pt_bar_reg_parse(s, reg);
+    if (s->bases[index].bar_flag == PT_BAR_FLAG_UNUSED) {
+        reg_field = PT_INVALID_REG;
+    }
+
+    return reg_field;
+}
+static int pt_bar_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                           uint32_t *value, uint32_t valid_mask)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    uint32_t valid_emu_mask = 0;
+    uint32_t bar_emu_mask = 0;
+    int index;
+
+    /* get BAR index */
+    index = pt_bar_offset_to_index(reg->offset);
+    if (index < 0) {
+        hw_error("Internal error: Invalid BAR index[%d]. "
+                 "I/O emulator exit.\n", index);
+    }
+
+    /* use fixed-up value from kernel sysfs */
+    *value = base_address_with_flags(&s->real_device->io_regions[index]);
+
+    /* set emulate mask depend on BAR flag */
+    switch (s->bases[index].bar_flag) {
+    case PT_BAR_FLAG_MEM:
+        bar_emu_mask = PT_BAR_MEM_EMU_MASK;
+        break;
+    case PT_BAR_FLAG_IO:
+        bar_emu_mask = PT_BAR_IO_EMU_MASK;
+        break;
+    case PT_BAR_FLAG_UPPER:
+        bar_emu_mask = PT_BAR_ALLF;
+        break;
+    default:
+        break;
+    }
+
+    /* emulate BAR */
+    valid_emu_mask = bar_emu_mask & valid_mask;
+    *value = PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
+
+   return 0;
+}
+static int pt_bar_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                            uint32_t *value, uint32_t dev_value,
+                            uint32_t valid_mask)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    XenPTRegGroup *reg_grp_entry = NULL;
+    XenPTReg *reg_entry = NULL;
+    XenPTRegion *base = NULL;
+    PCIDevice *d = &s->dev;
+    PCIIORegion *r;
+    uint32_t writable_mask = 0;
+    uint32_t throughable_mask = 0;
+    uint32_t bar_emu_mask = 0;
+    uint32_t bar_ro_mask = 0;
+    uint32_t new_addr, last_addr;
+    uint32_t prev_offset;
+    uint32_t r_size = 0;
+    int index = 0;
+
+    /* get BAR index */
+    index = pt_bar_offset_to_index(reg->offset);
+    if (index < 0) {
+        hw_error("Internal error: Invalid BAR index[%d]. "
+                 "I/O emulator exit.\n", index);
+    }
+
+    r = &d->io_regions[index];
+    base = &s->bases[index];
+    r_size = pt_get_emul_size(base->bar_flag, r->size);
+
+    /* set emulate mask and read-only mask depend on BAR flag */
+    switch (s->bases[index].bar_flag) {
+    case PT_BAR_FLAG_MEM:
+        bar_emu_mask = PT_BAR_MEM_EMU_MASK;
+        bar_ro_mask = PT_BAR_MEM_RO_MASK | (r_size - 1);
+        break;
+    case PT_BAR_FLAG_IO:
+        bar_emu_mask = PT_BAR_IO_EMU_MASK;
+        bar_ro_mask = PT_BAR_IO_RO_MASK | (r_size - 1);
+        break;
+    case PT_BAR_FLAG_UPPER:
+        bar_emu_mask = PT_BAR_ALLF;
+        bar_ro_mask = 0;    /* all upper 32bit are R/W */
+        break;
+    default:
+        break;
+    }
+
+    /* modify emulate register */
+    writable_mask = bar_emu_mask & ~bar_ro_mask & valid_mask;
+    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
+
+    /* check whether we need to update the virtual region address or not */
+    switch (s->bases[index].bar_flag) {
+    case PT_BAR_FLAG_MEM:
+        /* nothing to do */
+        break;
+    case PT_BAR_FLAG_IO:
+        new_addr = cfg_entry->data;
+        last_addr = new_addr + r_size - 1;
+        /* check invalid address */
+        if (last_addr <= new_addr || !new_addr || last_addr >= 0x10000) {
+            /* check 64K range */
+            if ((last_addr >= 0x10000) &&
+                (cfg_entry->data != (PT_BAR_ALLF & ~bar_ro_mask))) {
+                PT_LOG("Warning: Guest attempt to set Base Address "
+                       "over the 64KB. [%02x:%02x.%x][Offset:%02xh]"
+                       "[Address:%08xh][Size:%08xh]\n",
+                       pci_bus_num(d->bus), PCI_SLOT(d->devfn),
+                       PCI_FUNC(d->devfn),
+                       reg->offset, new_addr, r_size);
+            }
+            /* just remove mapping */
+            r->addr = -1;
+            goto exit;
+        }
+        break;
+    case PT_BAR_FLAG_UPPER:
+        if (cfg_entry->data) {
+            if (cfg_entry->data != (PT_BAR_ALLF & ~bar_ro_mask)) {
+                PT_LOG("Warning: Guest attempt to set high MMIO Base Address. "
+                       "Ignore mapping. "
+                       "[%02x:%02x.%x][Offset:%02xh][High Address:%08xh]\n",
+                       pci_bus_num(d->bus), PCI_SLOT(d->devfn),
+                       PCI_FUNC(d->devfn), reg->offset, cfg_entry->data);
+            }
+            /* clear lower address */
+            d->io_regions[index-1].addr = -1;
+        } else {
+            /* find lower 32bit BAR */
+            prev_offset = (reg->offset - 4);
+            reg_grp_entry = pt_find_reg_grp(s, prev_offset);
+            if (reg_grp_entry) {
+                reg_entry = pt_find_reg(reg_grp_entry, prev_offset);
+                if (reg_entry) {
+                    /* restore lower address */
+                    d->io_regions[index-1].addr = reg_entry->data;
+                } else {
+                    return -1;
+                }
+            } else {
+                return -1;
+            }
+        }
+
+        /* never mapping the 'empty' upper region,
+         * because we'll do it enough for the lower region.
+         */
+        r->addr = -1;
+        goto exit;
+    default:
+        break;
+    }
+
+    /* update the corresponding virtual region address */
+    /*
+     * When guest code tries to get block size of mmio, it will write all "1"s
+     * into pci bar register. In this case, cfg_entry->data == writable_mask.
+     * Especially for devices with large mmio, the value of writable_mask
+     * is likely to be a guest physical address that has been mapped to ram
+     * rather than mmio. Remapping this value to mmio should be prevented.
+     */
+
+    if (cfg_entry->data != writable_mask) {
+        r->addr = cfg_entry->data;
+    }
+
+exit:
+    /* create value for writing to I/O device register */
+    throughable_mask = ~bar_emu_mask & valid_mask;
+    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
+
+    /* After BAR reg update, we need to remap BAR */
+    reg_grp_entry = pt_find_reg_grp(s, PCI_COMMAND);
+    if (reg_grp_entry) {
+        reg_entry = pt_find_reg(reg_grp_entry, PCI_COMMAND);
+        if (reg_entry) {
+            pt_bar_mapping_one(s, index, reg_entry->data & PCI_COMMAND_IO,
+                               reg_entry->data & PCI_COMMAND_MEMORY);
+        }
+    }
+
+    return 0;
+}
+static int pt_bar_reg_restore(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                              uint32_t real_offset, uint32_t dev_value,
+                              uint32_t *value)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    uint32_t bar_emu_mask = 0;
+    int index = 0;
+
+    /* get BAR index */
+    index = pt_bar_offset_to_index(reg->offset);
+    if (index < 0) {
+        hw_error("Internal error: Invalid BAR index[%d]. "
+                 "I/O emulator exit.\n", index);
+    }
+
+    /* use value from kernel sysfs */
+    if (s->bases[index].bar_flag == PT_BAR_FLAG_UPPER) {
+        *value = s->real_device->io_regions[index - 1].base_addr >> 32;
+    } else {
+        *value = base_address_with_flags(&s->real_device->io_regions[index]);
+    }
+
+    /* set emulate mask depend on BAR flag */
+    switch (s->bases[index].bar_flag) {
+    case PT_BAR_FLAG_MEM:
+        bar_emu_mask = PT_BAR_MEM_EMU_MASK;
+        break;
+    case PT_BAR_FLAG_IO:
+        bar_emu_mask = PT_BAR_IO_EMU_MASK;
+        break;
+    case PT_BAR_FLAG_UPPER:
+        bar_emu_mask = PT_BAR_ALLF;
+        break;
+    default:
+        break;
+    }
+
+    /* create value for restoring to I/O device register */
+    *value = PT_MERGE_VALUE(*value, dev_value, bar_emu_mask);
+
+    return 0;
+}
+
+/* write Exp ROM BAR */
+static int pt_exp_rom_bar_reg_write(XenPCIPassthroughState *s,
+                                    XenPTReg *cfg_entry, uint32_t *value,
+                                    uint32_t dev_value, uint32_t valid_mask)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    XenPTRegGroup *reg_grp_entry = NULL;
+    XenPTReg *reg_entry = NULL;
+    XenPTRegion *base = NULL;
+    PCIDevice *d = (PCIDevice *)&s->dev;
+    PCIIORegion *r;
+    uint32_t writable_mask = 0;
+    uint32_t throughable_mask = 0;
+    pcibus_t r_size = 0;
+    uint32_t bar_emu_mask = 0;
+    uint32_t bar_ro_mask = 0;
+
+    r = &d->io_regions[PCI_ROM_SLOT];
+    r_size = r->size;
+    base = &s->bases[PCI_ROM_SLOT];
+    /* align memory type resource size */
+    pt_get_emul_size(base->bar_flag, r_size);
+
+    /* set emulate mask and read-only mask */
+    bar_emu_mask = reg->emu_mask;
+    bar_ro_mask = (reg->ro_mask | (r_size - 1)) & ~PCI_ROM_ADDRESS_ENABLE;
+
+    /* modify emulate register */
+    writable_mask = ~bar_ro_mask & valid_mask;
+    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
+
+    /* update the corresponding virtual region address */
+    /*
+     * When guest code tries to get block size of mmio, it will write all "1"s
+     * into pci bar register. In this case, cfg_entry->data == writable_mask.
+     * Especially for devices with large mmio, the value of writable_mask
+     * is likely to be a guest physical address that has been mapped to ram
+     * rather than mmio. Remapping this value to mmio should be prevented.
+     */
+
+    if (cfg_entry->data != writable_mask) {
+        r->addr = cfg_entry->data;
+    }
+
+    /* create value for writing to I/O device register */
+    throughable_mask = ~bar_emu_mask & valid_mask;
+    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
+
+    /* After BAR reg update, we need to remap BAR*/
+    reg_grp_entry = pt_find_reg_grp(s, PCI_COMMAND);
+    if (reg_grp_entry) {
+        reg_entry = pt_find_reg(reg_grp_entry, PCI_COMMAND);
+        if (reg_entry) {
+            pt_bar_mapping_one(s, PCI_ROM_SLOT,
+                               reg_entry->data & PCI_COMMAND_IO,
+                               reg_entry->data & PCI_COMMAND_MEMORY);
+        }
+    }
+
+    return 0;
+}
+/* restore ROM BAR */
+static int pt_exp_rom_bar_reg_restore(XenPCIPassthroughState *s,
+                                      XenPTReg *cfg_entry,
+                                      uint32_t real_offset,
+                                      uint32_t dev_value, uint32_t *value)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+
+    /* use value from kernel sysfs */
+    *value =
+        PT_MERGE_VALUE(host_pci_get_long(s->real_device, PCI_ROM_ADDRESS),
+                       dev_value, reg->emu_mask);
+    return 0;
+}
+
+/* Header Type0 reg static infomation table */
+static XenPTRegInfo pt_emu_reg_header0_tbl[] = {
+    /* Vendor ID reg */
+    {
+        .offset     = PCI_VENDOR_ID,
+        .size       = 2,
+        .init_val   = 0x0000,
+        .ro_mask    = 0xFFFF,
+        .emu_mask   = 0xFFFF,
+        .init       = pt_vendor_reg_init,
+        .u.w.read   = pt_word_reg_read,
+        .u.w.write  = pt_word_reg_write,
+        .u.w.restore  = NULL,
+    },
+    /* Device ID reg */
+    {
+        .offset     = PCI_DEVICE_ID,
+        .size       = 2,
+        .init_val   = 0x0000,
+        .ro_mask    = 0xFFFF,
+        .emu_mask   = 0xFFFF,
+        .init       = pt_device_reg_init,
+        .u.w.read   = pt_word_reg_read,
+        .u.w.write  = pt_word_reg_write,
+        .u.w.restore  = NULL,
+    },
+    /* Command reg */
+    {
+        .offset     = PCI_COMMAND,
+        .size       = 2,
+        .init_val   = 0x0000,
+        .ro_mask    = 0xF880,
+        .emu_mask   = 0x0740,
+        .init       = pt_common_reg_init,
+        .u.w.read   = pt_cmd_reg_read,
+        .u.w.write  = pt_cmd_reg_write,
+        .u.w.restore  = pt_cmd_reg_restore,
+    },
+    /* Capabilities Pointer reg */
+    {
+        .offset     = PCI_CAPABILITY_LIST,
+        .size       = 1,
+        .init_val   = 0x00,
+        .ro_mask    = 0xFF,
+        .emu_mask   = 0xFF,
+        .init       = pt_ptr_reg_init,
+        .u.b.read   = pt_byte_reg_read,
+        .u.b.write  = pt_byte_reg_write,
+        .u.b.restore  = NULL,
+    },
+    /* Status reg */
+    /* use emulated Cap Ptr value to initialize,
+     * so need to be declared after Cap Ptr reg
+     */
+    {
+        .offset     = PCI_STATUS,
+        .size       = 2,
+        .init_val   = 0x0000,
+        .ro_mask    = 0x06FF,
+        .emu_mask   = 0x0010,
+        .init       = pt_status_reg_init,
+        .u.w.read   = pt_word_reg_read,
+        .u.w.write  = pt_word_reg_write,
+        .u.w.restore  = NULL,
+    },
+    /* Cache Line Size reg */
+    {
+        .offset     = PCI_CACHE_LINE_SIZE,
+        .size       = 1,
+        .init_val   = 0x00,
+        .ro_mask    = 0x00,
+        .emu_mask   = 0xFF,
+        .init       = pt_common_reg_init,
+        .u.b.read   = pt_byte_reg_read,
+        .u.b.write  = pt_byte_reg_write,
+        .u.b.restore  = pt_byte_reg_restore,
+    },
+    /* Latency Timer reg */
+    {
+        .offset     = PCI_LATENCY_TIMER,
+        .size       = 1,
+        .init_val   = 0x00,
+        .ro_mask    = 0x00,
+        .emu_mask   = 0xFF,
+        .init       = pt_common_reg_init,
+        .u.b.read   = pt_byte_reg_read,
+        .u.b.write  = pt_byte_reg_write,
+        .u.b.restore  = pt_byte_reg_restore,
+    },
+    /* Header Type reg */
+    {
+        .offset     = PCI_HEADER_TYPE,
+        .size       = 1,
+        .init_val   = 0x00,
+        .ro_mask    = 0xFF,
+        .emu_mask   = 0x00,
+        .init       = pt_header_type_reg_init,
+        .u.b.read   = pt_byte_reg_read,
+        .u.b.write  = pt_byte_reg_write,
+        .u.b.restore  = NULL,
+    },
+    /* Interrupt Line reg */
+    {
+        .offset     = PCI_INTERRUPT_LINE,
+        .size       = 1,
+        .init_val   = 0x00,
+        .ro_mask    = 0x00,
+        .emu_mask   = 0xFF,
+        .init       = pt_common_reg_init,
+        .u.b.read   = pt_byte_reg_read,
+        .u.b.write  = pt_byte_reg_write,
+        .u.b.restore  = NULL,
+    },
+    /* Interrupt Pin reg */
+    {
+        .offset     = PCI_INTERRUPT_PIN,
+        .size       = 1,
+        .init_val   = 0x00,
+        .ro_mask    = 0xFF,
+        .emu_mask   = 0xFF,
+        .init       = pt_irqpin_reg_init,
+        .u.b.read   = pt_byte_reg_read,
+        .u.b.write  = pt_byte_reg_write,
+        .u.b.restore  = NULL,
+    },
+    /* BAR 0 reg */
+    /* mask of BAR need to be decided later, depends on IO/MEM type */
+    {
+        .offset     = PCI_BASE_ADDRESS_0,
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .init       = pt_bar_reg_init,
+        .u.dw.read  = pt_bar_reg_read,
+        .u.dw.write = pt_bar_reg_write,
+        .u.dw.restore = pt_bar_reg_restore,
+    },
+    /* BAR 1 reg */
+    {
+        .offset     = PCI_BASE_ADDRESS_1,
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .init       = pt_bar_reg_init,
+        .u.dw.read  = pt_bar_reg_read,
+        .u.dw.write = pt_bar_reg_write,
+        .u.dw.restore = pt_bar_reg_restore,
+    },
+    /* BAR 2 reg */
+    {
+        .offset     = PCI_BASE_ADDRESS_2,
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .init       = pt_bar_reg_init,
+        .u.dw.read  = pt_bar_reg_read,
+        .u.dw.write = pt_bar_reg_write,
+        .u.dw.restore = pt_bar_reg_restore,
+    },
+    /* BAR 3 reg */
+    {
+        .offset     = PCI_BASE_ADDRESS_3,
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .init       = pt_bar_reg_init,
+        .u.dw.read  = pt_bar_reg_read,
+        .u.dw.write = pt_bar_reg_write,
+        .u.dw.restore = pt_bar_reg_restore,
+    },
+    /* BAR 4 reg */
+    {
+        .offset     = PCI_BASE_ADDRESS_4,
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .init       = pt_bar_reg_init,
+        .u.dw.read  = pt_bar_reg_read,
+        .u.dw.write = pt_bar_reg_write,
+        .u.dw.restore = pt_bar_reg_restore,
+    },
+    /* BAR 5 reg */
+    {
+        .offset     = PCI_BASE_ADDRESS_5,
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .init       = pt_bar_reg_init,
+        .u.dw.read  = pt_bar_reg_read,
+        .u.dw.write = pt_bar_reg_write,
+        .u.dw.restore = pt_bar_reg_restore,
+    },
+    /* Expansion ROM BAR reg */
+    {
+        .offset     = PCI_ROM_ADDRESS,
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .ro_mask    = 0x000007FE,
+        .emu_mask   = 0xFFFFF800,
+        .init       = pt_bar_reg_init,
+        .u.dw.read  = pt_long_reg_read,
+        .u.dw.write = pt_exp_rom_bar_reg_write,
+        .u.dw.restore = pt_exp_rom_bar_reg_restore,
+    },
+    {
+        .size = 0,
+    },
+};
+
+
+/*********************************
+ * Vital Product Data Capability
+ */
+
+/* Vital Product Data Capability Structure reg static infomation table */
+static XenPTRegInfo pt_emu_reg_vpd_tbl[] = {
+    {
+        .offset     = PCI_CAP_LIST_NEXT,
+        .size       = 1,
+        .init_val   = 0x00,
+        .ro_mask    = 0xFF,
+        .emu_mask   = 0xFF,
+        .init       = pt_ptr_reg_init,
+        .u.b.read   = pt_byte_reg_read,
+        .u.b.write  = pt_byte_reg_write,
+        .u.b.restore  = NULL,
+    },
+    {
+        .size = 0,
+    },
+};
+
+
+/**************************************
+ * Vendor Specific Capability
+ */
+
+/* Vendor Specific Capability Structure reg static infomation table */
+static XenPTRegInfo pt_emu_reg_vendor_tbl[] = {
+    {
+        .offset     = PCI_CAP_LIST_NEXT,
+        .size       = 1,
+        .init_val   = 0x00,
+        .ro_mask    = 0xFF,
+        .emu_mask   = 0xFF,
+        .init       = pt_ptr_reg_init,
+        .u.b.read   = pt_byte_reg_read,
+        .u.b.write  = pt_byte_reg_write,
+        .u.b.restore  = NULL,
+    },
+    {
+        .size = 0,
+    },
+};
+
+
+/*****************************
+ * PCI Express Capability
+ */
+
+/* initialize Link Control register */
+static uint32_t pt_linkctrl_reg_init(XenPCIPassthroughState *s,
+                                     XenPTRegInfo *reg, uint32_t real_offset)
+{
+    uint8_t cap_ver = 0;
+    uint8_t dev_type = 0;
+
+    /* TODO maybe better to use fonction from hw/pcie.c */
+    cap_ver = pci_get_byte(s->dev.config + real_offset - reg->offset
+                           + PCI_EXP_FLAGS)
+        & PCI_EXP_FLAGS_VERS;
+    dev_type = (pci_get_byte(s->dev.config + real_offset - reg->offset
+                             + PCI_EXP_FLAGS)
+                & PCI_EXP_FLAGS_TYPE) >> 4;
+
+    /* no need to initialize in case of Root Complex Integrated Endpoint
+     * with cap_ver 1.x
+     */
+    if ((dev_type == PCI_EXP_TYPE_RC_END) && (cap_ver == 1)) {
+        return PT_INVALID_REG;
+    }
+
+    return reg->init_val;
+}
+/* initialize Device Control 2 register */
+static uint32_t pt_devctrl2_reg_init(XenPCIPassthroughState *s,
+                                     XenPTRegInfo *reg, uint32_t real_offset)
+{
+    uint8_t cap_ver = 0;
+
+    cap_ver = pci_get_byte(s->dev.config + real_offset - reg->offset
+                           + PCI_EXP_FLAGS)
+        & PCI_EXP_FLAGS_VERS;
+
+    /* no need to initialize in case of cap_ver 1.x */
+    if (cap_ver == 1) {
+        return PT_INVALID_REG;
+    }
+
+    return reg->init_val;
+}
+/* initialize Link Control 2 register */
+static uint32_t pt_linkctrl2_reg_init(XenPCIPassthroughState *s,
+                                      XenPTRegInfo *reg, uint32_t real_offset)
+{
+    int reg_field = 0;
+    uint8_t cap_ver = 0;
+
+    cap_ver = pci_get_byte(s->dev.config + real_offset - reg->offset
+                           + PCI_EXP_FLAGS)
+        & PCI_EXP_FLAGS_VERS;
+
+    /* no need to initialize in case of cap_ver 1.x */
+    if (cap_ver == 1) {
+        return PT_INVALID_REG;
+    }
+
+    /* set Supported Link Speed */
+    reg_field |= PCI_EXP_LNKCAP_SLS &
+        pci_get_byte(s->dev.config + real_offset - reg->offset
+                     + PCI_EXP_LNKCAP);
+
+    return reg_field;
+}
+
+/* PCI Express Capability Structure reg static infomation table */
+static XenPTRegInfo pt_emu_reg_pcie_tbl[] = {
+    /* Next Pointer reg */
+    {
+        .offset     = PCI_CAP_LIST_NEXT,
+        .size       = 1,
+        .init_val   = 0x00,
+        .ro_mask    = 0xFF,
+        .emu_mask   = 0xFF,
+        .init       = pt_ptr_reg_init,
+        .u.b.read   = pt_byte_reg_read,
+        .u.b.write  = pt_byte_reg_write,
+        .u.b.restore  = NULL,
+    },
+    /* Device Capabilities reg */
+    {
+        .offset     = PCI_EXP_DEVCAP,
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .ro_mask    = 0x1FFCFFFF,
+        .emu_mask   = 0x10000000,
+        .init       = pt_common_reg_init,
+        .u.dw.read  = pt_long_reg_read,
+        .u.dw.write = pt_long_reg_write,
+        .u.dw.restore = NULL,
+    },
+    /* Device Control reg */
+    {
+        .offset     = PCI_EXP_DEVCTL,
+        .size       = 2,
+        .init_val   = 0x2810,
+        .ro_mask    = 0x8400,
+        .emu_mask   = 0xFFFF,
+        .init       = pt_common_reg_init,
+        .u.w.read   = pt_word_reg_read,
+        .u.w.write  = pt_word_reg_write,
+        .u.w.restore  = pt_word_reg_restore,
+    },
+    /* Link Control reg */
+    {
+        .offset     = PCI_EXP_LNKCTL,
+        .size       = 2,
+        .init_val   = 0x0000,
+        .ro_mask    = 0xFC34,
+        .emu_mask   = 0xFFFF,
+        .init       = pt_linkctrl_reg_init,
+        .u.w.read   = pt_word_reg_read,
+        .u.w.write  = pt_word_reg_write,
+        .u.w.restore  = pt_word_reg_restore,
+    },
+    /* Device Control 2 reg */
+    {
+        .offset     = 0x28,
+        .size       = 2,
+        .init_val   = 0x0000,
+        .ro_mask    = 0xFFE0,
+        .emu_mask   = 0xFFFF,
+        .init       = pt_devctrl2_reg_init,
+        .u.w.read   = pt_word_reg_read,
+        .u.w.write  = pt_word_reg_write,
+        .u.w.restore  = pt_word_reg_restore,
+    },
+    /* Link Control 2 reg */
+    {
+        .offset     = 0x30,
+        .size       = 2,
+        .init_val   = 0x0000,
+        .ro_mask    = 0xE040,
+        .emu_mask   = 0xFFFF,
+        .init       = pt_linkctrl2_reg_init,
+        .u.w.read   = pt_word_reg_read,
+        .u.w.write  = pt_word_reg_write,
+        .u.w.restore  = pt_word_reg_restore,
+    },
+    {
+        .size = 0,
+    },
+};
+
+
+/*********************************
+ * Power Management Capability
+ */
+
+/* initialize Power Management Capabilities register */
+static uint32_t pt_pmc_reg_init(XenPCIPassthroughState *s,
+                                XenPTRegInfo *reg, uint32_t real_offset)
+{
+    PCIDevice *d = &s->dev;
+
+    if (!s->power_mgmt) {
+        return reg->init_val;
+    }
+
+    /* set Power Management Capabilities register */
+    s->pm_state->pmc_field = pci_get_word(d->config + real_offset);
+
+    return reg->init_val;
+}
+/* initialize PCI Power Management Control/Status register */
+static uint32_t pt_pmcsr_reg_init(XenPCIPassthroughState *s,
+                                  XenPTRegInfo *reg, uint32_t real_offset)
+{
+    PCIDevice *d = &s->dev;
+    uint16_t cap_ver  = 0;
+
+    if (!s->power_mgmt) {
+        return reg->init_val;
+    }
+
+    /* check PCI Power Management support version */
+    cap_ver = s->pm_state->pmc_field & PCI_PM_CAP_VER_MASK;
+
+    if (cap_ver > 2) {
+        /* set No Soft Reset */
+        s->pm_state->no_soft_reset =
+            pci_get_byte(d->config + real_offset) & PCI_PM_CTRL_NO_SOFT_RESET;
+    }
+
+    /* wake up real physical device */
+    switch (host_pci_get_word(s->real_device, real_offset)
+            & PCI_PM_CTRL_STATE_MASK) {
+    case 0:
+        break;
+    case 1:
+        PT_LOG("Power state transition D1 -> D0active\n");
+        host_pci_set_word(s->real_device, real_offset, 0);
+        break;
+    case 2:
+        PT_LOG("Power state transition D2 -> D0active\n");
+        host_pci_set_word(s->real_device, real_offset, 0);
+        usleep(200);
+        break;
+    case 3:
+        PT_LOG("Power state transition D3hot -> D0active\n");
+        host_pci_set_word(s->real_device, real_offset, 0);
+        usleep(10 * 1000);
+        pt_init_pci_config(s);
+        break;
+    }
+
+    return reg->init_val;
+}
+/* read Power Management Control/Status register */
+static int pt_pmcsr_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                             uint16_t *value, uint16_t valid_mask)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    uint16_t valid_emu_mask = reg->emu_mask;
+
+    if (!s->power_mgmt) {
+        valid_emu_mask |= PCI_PM_CTRL_STATE_MASK | PCI_PM_CTRL_NO_SOFT_RESET;
+    }
+
+    valid_emu_mask = valid_emu_mask & valid_mask;
+    *value = PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
+
+    return 0;
+}
+/* reset Interrupt and I/O resource  */
+static void pt_reset_interrupt_and_io_mapping(XenPCIPassthroughState *s)
+{
+    PCIDevice *d = &s->dev;
+    PCIIORegion *r;
+    int i = 0;
+    uint8_t e_device = 0;
+    uint8_t e_intx = 0;
+
+    /* unbind INTx */
+    e_device = PCI_SLOT(s->dev.devfn);
+    e_intx = pci_intx(s);
+
+    if (s->machine_irq) {
+        if (xc_domain_unbind_pt_irq(xen_xc, xen_domid, s->machine_irq,
+                                    PT_IRQ_TYPE_PCI, 0, e_device, e_intx, 0)) {
+            PT_LOG("Error: Unbinding of interrupt failed!\n");
+        }
+    }
+
+    /* clear all virtual region address */
+    for (i = 0; i < PCI_NUM_REGIONS; i++) {
+        r = &d->io_regions[i];
+        r->addr = -1;
+    }
+
+    /* unmapping BAR */
+    pt_bar_mapping(s, 0, 0);
+}
+/* check power state transition */
+static int check_power_state(XenPCIPassthroughState *s)
+{
+    XenPTPM *pm_state = s->pm_state;
+    PCIDevice *d = &s->dev;
+    uint16_t read_val = 0;
+    uint16_t cur_state = 0;
+
+    /* get current power state */
+    read_val = host_pci_get_word(s->real_device,
+                                 pm_state->pm_base + PCI_PM_CTRL);
+    cur_state = read_val & PCI_PM_CTRL_STATE_MASK;
+
+    if (pm_state->req_state != cur_state) {
+        PT_LOG("Error: Failed to change power state. "
+               "[%02x:%02x.%x][requested state:%d][current state:%d]\n",
+               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
+               pm_state->req_state, cur_state);
+        return -1;
+    }
+    return 0;
+}
+/* write Power Management Control/Status register */
+static void pt_from_d3hot_to_d0_with_reset(void *opaque)
+{
+    XenPCIPassthroughState *s = opaque;
+    XenPTPM *pm_state = s->pm_state;
+    int ret = 0;
+
+    /* check power state */
+    ret = check_power_state(s);
+
+    if (ret < 0) {
+        goto out;
+    }
+
+    pt_init_pci_config(s);
+
+out:
+    /* power state transition flags off */
+    pm_state->flags &= ~PT_FLAG_TRANSITING;
+
+    qemu_free_timer(pm_state->pm_timer);
+    pm_state->pm_timer = NULL;
+}
+static void pt_default_power_transition(void *opaque)
+{
+    XenPCIPassthroughState *ptdev = opaque;
+    XenPTPM *pm_state = ptdev->pm_state;
+
+    /* check power state */
+    check_power_state(ptdev);
+
+    /* power state transition flags off */
+    pm_state->flags &= ~PT_FLAG_TRANSITING;
+
+    qemu_free_timer(pm_state->pm_timer);
+    pm_state->pm_timer = NULL;
+}
+static int pt_pmcsr_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                              uint16_t *value, uint16_t dev_value,
+                              uint16_t valid_mask)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    PCIDevice *d = &s->dev;
+    uint16_t emu_mask = reg->emu_mask;
+    uint16_t writable_mask = 0;
+    uint16_t throughable_mask = 0;
+    XenPTPM *pm_state = s->pm_state;
+
+    if (!s->power_mgmt) {
+        emu_mask |= PCI_PM_CTRL_STATE_MASK | PCI_PM_CTRL_NO_SOFT_RESET;
+    }
+
+    /* modify emulate register */
+    writable_mask = emu_mask & ~reg->ro_mask & valid_mask;
+    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
+
+    /* create value for writing to I/O device register */
+    throughable_mask = ~emu_mask & valid_mask;
+    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
+
+    if (!s->power_mgmt) {
+        return 0;
+    }
+
+    /* set I/O device power state */
+    pm_state->cur_state = dev_value & PCI_PM_CTRL_STATE_MASK;
+
+    /* set Guest requested PowerState */
+    pm_state->req_state = *value & PCI_PM_CTRL_STATE_MASK;
+
+    /* check power state transition or not */
+    if (pm_state->cur_state == pm_state->req_state) {
+        /* not power state transition */
+        return 0;
+    }
+
+    /* check enable power state transition */
+    if ((pm_state->req_state != 0) &&
+        (pm_state->cur_state > pm_state->req_state)) {
+        PT_LOG("Error: Invalid power transition. "
+               "[%02x:%02x.%x][requested state:%d][current state:%d]\n",
+               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
+               pm_state->req_state, pm_state->cur_state);
+
+        return 0;
+    }
+
+    /* check if this device supports the requested power state */
+    if (((pm_state->req_state == 1) && !(pm_state->pmc_field & PCI_PM_CAP_D1))
+        || ((pm_state->req_state == 2) &&
+            !(pm_state->pmc_field & PCI_PM_CAP_D2))) {
+        PT_LOG("Error: Invalid power transition. "
+               "[%02x:%02x.%x][requested state:%d][current state:%d]\n",
+               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
+               pm_state->req_state, pm_state->cur_state);
+
+        return 0;
+    }
+
+    /* in case of transition related to D3hot, it's necessary to wait 10 ms.
+     * But because writing to register will be performed later on actually,
+     * don't start QEMUTimer right now, just alloc and init QEMUTimer here.
+     */
+    if ((pm_state->cur_state == 3) || (pm_state->req_state == 3)) {
+        if (pm_state->req_state == 0) {
+            /* alloc and init QEMUTimer */
+            if (!pm_state->no_soft_reset) {
+                pm_state->pm_timer = qemu_new_timer_ms(rt_clock,
+                    pt_from_d3hot_to_d0_with_reset, s);
+
+                /* reset Interrupt and I/O resource mapping */
+                pt_reset_interrupt_and_io_mapping(s);
+            } else {
+                pm_state->pm_timer = qemu_new_timer_ms(rt_clock,
+                                        pt_default_power_transition, s);
+            }
+        } else {
+            /* alloc and init QEMUTimer */
+            pm_state->pm_timer = qemu_new_timer_ms(rt_clock,
+                pt_default_power_transition, s);
+        }
+
+        /* set power state transition delay */
+        pm_state->pm_delay = 10;
+
+        /* power state transition flags on */
+        pm_state->flags |= PT_FLAG_TRANSITING;
+    }
+    /* in case of transition related to D0, D1 and D2,
+     * no need to use QEMUTimer.
+     * So, we perfom writing to register here and then read it back.
+     */
+    else {
+        /* write power state to I/O device register */
+        host_pci_set_word(s->real_device, pm_state->pm_base + PCI_PM_CTRL,
+                          *value);
+
+        /* in case of transition related to D2,
+         * it's necessary to wait 200 usec.
+         * But because QEMUTimer do not support microsec unit right now,
+         * so we do wait ourself here.
+         */
+        if ((pm_state->cur_state == 2) || (pm_state->req_state == 2)) {
+            usleep(200);
+        }
+
+        /* check power state */
+        check_power_state(s);
+
+        /* recreate value for writing to I/O device register */
+        *value = host_pci_get_word(s->real_device,
+                                   pm_state->pm_base + PCI_PM_CTRL);
+    }
+
+    return 0;
+}
+
+/* restore Power Management Control/Status register */
+static int pt_pmcsr_reg_restore(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                                uint32_t real_offset, uint16_t dev_value,
+                                uint16_t *value)
+{
+    /* create value for restoring to I/O device register
+     * No need to restore, just clear PME Enable and PME Status bit
+     * Note: register type of PME Status bit is RW1C, so clear by writing 1b
+     */
+    *value = (dev_value & ~PCI_PM_CTRL_PME_ENABLE) | PCI_PM_CTRL_PME_STATUS;
+
+    return 0;
+}
+
+
+/* Power Management Capability reg static infomation table */
+static XenPTRegInfo pt_emu_reg_pm_tbl[] = {
+    /* Next Pointer reg */
+    {
+        .offset     = PCI_CAP_LIST_NEXT,
+        .size       = 1,
+        .init_val   = 0x00,
+        .ro_mask    = 0xFF,
+        .emu_mask   = 0xFF,
+        .init       = pt_ptr_reg_init,
+        .u.b.read   = pt_byte_reg_read,
+        .u.b.write  = pt_byte_reg_write,
+        .u.b.restore  = NULL,
+    },
+    /* Power Management Capabilities reg */
+    {
+        .offset     = PCI_CAP_FLAGS,
+        .size       = 2,
+        .init_val   = 0x0000,
+        .ro_mask    = 0xFFFF,
+        .emu_mask   = 0xF9C8,
+        .init       = pt_pmc_reg_init,
+        .u.w.read   = pt_word_reg_read,
+        .u.w.write  = pt_word_reg_write,
+        .u.w.restore  = NULL,
+    },
+    /* PCI Power Management Control/Status reg */
+    {
+        .offset     = PCI_PM_CTRL,
+        .size       = 2,
+        .init_val   = 0x0008,
+        .ro_mask    = 0xE1FC,
+        .emu_mask   = 0x8100,
+        .init       = pt_pmcsr_reg_init,
+        .u.w.read   = pt_pmcsr_reg_read,
+        .u.w.write  = pt_pmcsr_reg_write,
+        .u.w.restore  = pt_pmcsr_reg_restore,
+    },
+    {
+        .size = 0,
+    },
+};
+
+
+/****************************
+ * Capabilities
+ */
+
+/* AER register operations */
+
+static void aer_save_one_register(XenPCIPassthroughState *s, int offset)
+{
+    PCIDevice *d = &s->dev;
+    uint32_t aer_base = s->pm_state->aer_base;
+    uint32_t val = 0;
+
+    val = host_pci_get_long(s->real_device, aer_base + offset);
+    pci_set_long(d->config + aer_base + offset, val);
+}
+static void pt_aer_reg_save(XenPCIPassthroughState *s)
+{
+    /* after reset, following register values should be restored.
+     * So, save them.
+     */
+    aer_save_one_register(s, PCI_ERR_UNCOR_MASK);
+    aer_save_one_register(s, PCI_ERR_UNCOR_SEVER);
+    aer_save_one_register(s, PCI_ERR_COR_MASK);
+    aer_save_one_register(s, PCI_ERR_CAP);
+}
+static void aer_restore_one_register(XenPCIPassthroughState *s, int offset)
+{
+    PCIDevice *d = &s->dev;
+    uint32_t aer_base = s->pm_state->aer_base;
+    uint32_t config = 0;
+
+    config = pci_get_long(d->config + aer_base + offset);
+    host_pci_set_long(s->real_device, aer_base + offset, config);
+}
+static void pt_aer_reg_restore(XenPCIPassthroughState *s)
+{
+    /* the following registers should be reconfigured to correct values
+     * after reset. restore them.
+     * other registers should not be reconfigured after reset
+     * if there is no reason
+     */
+    aer_restore_one_register(s, PCI_ERR_UNCOR_MASK);
+    aer_restore_one_register(s, PCI_ERR_UNCOR_SEVER);
+    aer_restore_one_register(s, PCI_ERR_COR_MASK);
+    aer_restore_one_register(s, PCI_ERR_CAP);
+}
+
+/* capability structure register group size functions */
+
+static uint8_t pt_reg_grp_size_init(XenPCIPassthroughState *s,
+                                    const XenPTRegGroupInfo *grp_reg,
+                                    uint32_t base_offset)
+{
+    return grp_reg->grp_size;
+}
+/* get Power Management Capability Structure register group size */
+static uint8_t pt_pm_size_init(XenPCIPassthroughState *s,
+                               const XenPTRegGroupInfo *grp_reg,
+                               uint32_t base_offset)
+{
+    if (!s->power_mgmt) {
+        return grp_reg->grp_size;
+    }
+
+    s->pm_state = g_malloc0(sizeof (XenPTPM));
+
+    /* set Power Management Capability base offset */
+    s->pm_state->pm_base = base_offset;
+
+    /* find AER register and set AER Capability base offset */
+    s->pm_state->aer_base = host_pci_find_ext_cap_offset(s->real_device,
+                                                         PCI_EXT_CAP_ID_ERR);
+
+    /* save AER register */
+    if (s->pm_state->aer_base) {
+        pt_aer_reg_save(s);
+    }
+
+    return grp_reg->grp_size;
+}
+/* get Vendor Specific Capability Structure register group size */
+static uint8_t pt_vendor_size_init(XenPCIPassthroughState *s,
+                                   const XenPTRegGroupInfo *grp_reg,
+                                   uint32_t base_offset)
+{
+    return pci_get_byte(s->dev.config + base_offset + 0x02);
+}
+/* get PCI Express Capability Structure register group size */
+static uint8_t pt_pcie_size_init(XenPCIPassthroughState *s,
+                                 const XenPTRegGroupInfo *grp_reg,
+                                 uint32_t base_offset)
+{
+    PCIDevice *d = &s->dev;
+    uint16_t exp_flag = 0;
+    uint16_t type = 0;
+    uint16_t version = 0;
+    uint8_t pcie_size = 0;
+
+    exp_flag = pci_get_word(d->config + base_offset + PCI_EXP_FLAGS);
+    type = (exp_flag & PCI_EXP_FLAGS_TYPE) >> 4;
+    version = exp_flag & PCI_EXP_FLAGS_VERS;
+
+    /* calculate size depend on capability version and device/port type */
+    /* in case of PCI Express Base Specification Rev 1.x */
+    if (version == 1) {
+        /* The PCI Express Capabilities, Device Capabilities, and Device
+         * Status/Control registers are required for all PCI Express devices.
+         * The Link Capabilities and Link Status/Control are required for all
+         * Endpoints that are not Root Complex Integrated Endpoints. Endpoints
+         * are not required to implement registers other than those listed
+         * above and terminate the capability structure.
+         */
+        switch (type) {
+        case PCI_EXP_TYPE_ENDPOINT:
+        case PCI_EXP_TYPE_LEG_END:
+            pcie_size = 0x14;
+            break;
+        case PCI_EXP_TYPE_RC_END:
+            /* has no link */
+            pcie_size = 0x0C;
+            break;
+        /* only EndPoint passthrough is supported */
+        case PCI_EXP_TYPE_ROOT_PORT:
+        case PCI_EXP_TYPE_UPSTREAM:
+        case PCI_EXP_TYPE_DOWNSTREAM:
+        case PCI_EXP_TYPE_PCI_BRIDGE:
+        case PCI_EXP_TYPE_PCIE_BRIDGE:
+        case PCI_EXP_TYPE_RC_EC:
+        default:
+            hw_error("Internal error: Unsupported device/port type[%d]. "
+                     "I/O emulator exit.\n", type);
+        }
+    }
+    /* in case of PCI Express Base Specification Rev 2.0 */
+    else if (version == 2) {
+        switch (type) {
+        case PCI_EXP_TYPE_ENDPOINT:
+        case PCI_EXP_TYPE_LEG_END:
+        case PCI_EXP_TYPE_RC_END:
+            /* For Functions that do not implement the registers,
+             * these spaces must be hardwired to 0b.
+             */
+            pcie_size = 0x3C;
+            break;
+        /* only EndPoint passthrough is supported */
+        case PCI_EXP_TYPE_ROOT_PORT:
+        case PCI_EXP_TYPE_UPSTREAM:
+        case PCI_EXP_TYPE_DOWNSTREAM:
+        case PCI_EXP_TYPE_PCI_BRIDGE:
+        case PCI_EXP_TYPE_PCIE_BRIDGE:
+        case PCI_EXP_TYPE_RC_EC:
+        default:
+            hw_error("Internal error: Unsupported device/port type[%d]. "
+                     "I/O emulator exit.\n", type);
+        }
+    } else {
+        hw_error("Internal error: Unsupported capability version[%d]. "
+                 "I/O emulator exit.\n", version);
+    }
+
+    return pcie_size;
+}
+
+static const XenPTRegGroupInfo pt_emu_reg_grp_tbl[] = {
+    /* Header Type0 reg group */
+    {
+        .grp_id      = 0xFF,
+        .grp_type    = GRP_TYPE_EMU,
+        .grp_size    = 0x40,
+        .size_init   = pt_reg_grp_size_init,
+        .emu_reg_tbl = pt_emu_reg_header0_tbl,
+    },
+    /* PCI PowerManagement Capability reg group */
+    {
+        .grp_id      = PCI_CAP_ID_PM,
+        .grp_type    = GRP_TYPE_EMU,
+        .grp_size    = PCI_PM_SIZEOF,
+        .size_init   = pt_pm_size_init,
+        .emu_reg_tbl = pt_emu_reg_pm_tbl,
+    },
+    /* AGP Capability Structure reg group */
+    {
+        .grp_id     = PCI_CAP_ID_AGP,
+        .grp_type   = GRP_TYPE_HARDWIRED,
+        .grp_size   = 0x30,
+        .size_init  = pt_reg_grp_size_init,
+    },
+    /* Vital Product Data Capability Structure reg group */
+    {
+        .grp_id      = PCI_CAP_ID_VPD,
+        .grp_type    = GRP_TYPE_EMU,
+        .grp_size    = 0x08,
+        .size_init   = pt_reg_grp_size_init,
+        .emu_reg_tbl = pt_emu_reg_vpd_tbl,
+    },
+    /* Slot Identification reg group */
+    {
+        .grp_id     = PCI_CAP_ID_SLOTID,
+        .grp_type   = GRP_TYPE_HARDWIRED,
+        .grp_size   = 0x04,
+        .size_init  = pt_reg_grp_size_init,
+    },
+    /* PCI-X Capabilities List Item reg group */
+    {
+        .grp_id     = PCI_CAP_ID_PCIX,
+        .grp_type   = GRP_TYPE_HARDWIRED,
+        .grp_size   = 0x18,
+        .size_init  = pt_reg_grp_size_init,
+    },
+    /* Vendor Specific Capability Structure reg group */
+    {
+        .grp_id      = PCI_CAP_ID_VNDR,
+        .grp_type    = GRP_TYPE_EMU,
+        .grp_size    = 0xFF,
+        .size_init   = pt_vendor_size_init,
+        .emu_reg_tbl = pt_emu_reg_vendor_tbl,
+    },
+    /* SHPC Capability List Item reg group */
+    {
+        .grp_id     = PCI_CAP_ID_SHPC,
+        .grp_type   = GRP_TYPE_HARDWIRED,
+        .grp_size   = 0x08,
+        .size_init  = pt_reg_grp_size_init,
+    },
+    /* Subsystem ID and Subsystem Vendor ID Capability List Item reg group */
+    {
+        .grp_id     = PCI_CAP_ID_SSVID,
+        .grp_type   = GRP_TYPE_HARDWIRED,
+        .grp_size   = 0x08,
+        .size_init  = pt_reg_grp_size_init,
+    },
+    /* AGP 8x Capability Structure reg group */
+    {
+        .grp_id     = PCI_CAP_ID_AGP3,
+        .grp_type   = GRP_TYPE_HARDWIRED,
+        .grp_size   = 0x30,
+        .size_init  = pt_reg_grp_size_init,
+    },
+    /* PCI Express Capability Structure reg group */
+    {
+        .grp_id      = PCI_CAP_ID_EXP,
+        .grp_type    = GRP_TYPE_EMU,
+        .grp_size    = 0xFF,
+        .size_init   = pt_pcie_size_init,
+        .emu_reg_tbl = pt_emu_reg_pcie_tbl,
+    },
+    {
+        .grp_size = 0,
+    },
+};
+
+/* initialize Capabilities Pointer or Next Pointer register */
+static uint32_t pt_ptr_reg_init(XenPCIPassthroughState *s,
+                                XenPTRegInfo *reg, uint32_t real_offset)
+{
+    /* uint32_t reg_field = (uint32_t)s->dev.config[real_offset]; */
+    uint32_t reg_field = pci_get_byte(s->dev.config + real_offset);
+    int i;
+
+    /* find capability offset */
+    while (reg_field) {
+        for (i = 0; pt_emu_reg_grp_tbl[i].grp_size != 0; i++) {
+            if (pt_hide_dev_cap(s->real_device,
+                                pt_emu_reg_grp_tbl[i].grp_id)) {
+                continue;
+            }
+            if (pt_emu_reg_grp_tbl[i].grp_id == s->dev.config[reg_field]) {
+                if (pt_emu_reg_grp_tbl[i].grp_type == GRP_TYPE_EMU) {
+                    goto out;
+                }
+                /* ignore the 0 hardwired capability, find next one */
+                break;
+            }
+        }
+        /* next capability */
+        /* reg_field = (uint32_t)s->dev.config[reg_field + 1]; */
+        reg_field = pci_get_byte(s->dev.config + reg_field + 1);
+    }
+
+out:
+    return reg_field;
+}
+
+
+/*************
+ * Main
+ */
+
+/* restore a part of I/O device register */
+static void pt_config_restore(XenPCIPassthroughState *s)
+{
+    XenPTRegGroup *reg_grp_entry = NULL;
+    XenPTReg *reg_entry = NULL;
+    XenPTRegInfo *reg = NULL;
+    uint32_t real_offset = 0;
+    uint32_t read_val = 0;
+    uint32_t val = 0;
+    int ret = 0;
+
+    /* find emulate register group entry */
+    QLIST_FOREACH(reg_grp_entry, &s->reg_grp_tbl, entries) {
+        /* find emulate register entry */
+        QLIST_FOREACH(reg_entry, &reg_grp_entry->reg_tbl_list, entries) {
+            reg = reg_entry->reg;
+
+            /* check whether restoring is needed */
+            if (!reg->u.b.restore) {
+                continue;
+            }
+
+            real_offset = reg_grp_entry->base_offset + reg->offset;
+
+            /* read I/O device register value */
+            ret = host_pci_get_block(s->real_device, real_offset,
+                                     (uint8_t *)&read_val, reg->size);
+
+            if (!ret) {
+                PT_LOG("Error: pci_read_block failed. "
+                       "return value[%d].\n", ret);
+                memset(&read_val, 0xff, reg->size);
+            }
+
+            val = 0;
+
+            /* restore based on register size */
+            switch (reg->size) {
+            case 1:
+                /* byte register */
+                ret = reg->u.b.restore(s, reg_entry, real_offset,
+                                       (uint8_t)read_val, (uint8_t *)&val);
+                break;
+            case 2:
+                /* word register */
+                ret = reg->u.w.restore(s, reg_entry, real_offset,
+                                       (uint16_t)read_val, (uint16_t *)&val);
+                break;
+            case 4:
+                /* double word register */
+                ret = reg->u.dw.restore(s, reg_entry, real_offset,
+                                        (uint32_t)read_val, (uint32_t *)&val);
+                break;
+            }
+
+            /* restoring error */
+            if (ret < 0) {
+                hw_error("Internal error: Invalid restoring "
+                         "return value[%d]. I/O emulator exit.\n", ret);
+            }
+
+            PT_LOG_CONFIG("[%02x:%02x.%x]: address=%04x val=0x%08x len=%d\n",
+                          pci_bus_num(s->dev.bus), PCI_SLOT(s->dev.devfn),
+                          PCI_FUNC(s->dev.devfn),
+                          real_offset, val, reg->size);
+
+            ret = host_pci_set_block(s->real_device, real_offset,
+                                     (uint8_t *)&val, reg->size);
+
+            if (!ret) {
+                PT_LOG("Error: pci_write_block failed. "
+                       "return value[%d].\n", ret);
+            }
+        }
+    }
+
+    /* if AER supported, restore it */
+    if (s->pm_state->aer_base) {
+        pt_aer_reg_restore(s);
+    }
+}
+/* reinitialize all emulate registers */
+static void pt_config_reinit(XenPCIPassthroughState *s)
+{
+    XenPTRegGroup *reg_grp_entry = NULL;
+    XenPTReg *reg_entry = NULL;
+    XenPTRegInfo *reg = NULL;
+
+    /* find emulate register group entry */
+    QLIST_FOREACH(reg_grp_entry, &s->reg_grp_tbl, entries) {
+        /* find emulate register entry */
+        QLIST_FOREACH(reg_entry, &reg_grp_entry->reg_tbl_list, entries) {
+            reg = reg_entry->reg;
+            if (reg->init) {
+                /* initialize emulate register */
+                reg_entry->data =
+                    reg->init(s, reg_entry->reg,
+                              reg_grp_entry->base_offset + reg->offset);
+            }
+        }
+    }
+}
+
+static int pt_init_pci_config(XenPCIPassthroughState *s)
+{
+    PCIDevice *d = &s->dev;
+    int ret = 0;
+
+    PT_LOG("Reinitialize PCI configuration registers due to power state"
+           " transition with internal reset. [%02x:%02x.%x]\n",
+           pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn));
+
+    /* restore a part of I/O device register */
+    pt_config_restore(s);
+
+    /* reinitialize all emulate register */
+    pt_config_reinit(s);
+
+    /* rebind machine_irq to device */
+    if (s->machine_irq != 0) {
+        uint8_t e_device = PCI_SLOT(s->dev.devfn);
+        uint8_t e_intx = pci_intx(s);
+
+        ret = xc_domain_bind_pt_pci_irq(xen_xc, xen_domid, s->machine_irq, 0,
+                                        e_device, e_intx);
+        if (ret < 0) {
+            PT_LOG("Error: Rebinding of interrupt failed! ret=%d\n", ret);
+        }
+    }
+
+    return ret;
+}
+
+static uint8_t find_cap_offset(XenPCIPassthroughState *s, uint8_t cap)
+{
+    int id;
+    int max_cap = 48;
+    int pos = PCI_CAPABILITY_LIST;
+    int status;
+
+    status = host_pci_get_byte(s->real_device, PCI_STATUS);
+    if ((status & PCI_STATUS_CAP_LIST) == 0) {
+        return 0;
+    }
+
+    while (max_cap--) {
+        pos = host_pci_get_byte(s->real_device, pos);
+        if (pos < 0x40) {
+            break;
+        }
+
+        pos &= ~3;
+        id = host_pci_get_byte(s->real_device, pos + PCI_CAP_LIST_ID);
+
+        if (id == 0xff) {
+            break;
+        }
+        if (id == cap) {
+            return pos;
+        }
+
+        pos += PCI_CAP_LIST_NEXT;
+    }
+    return 0;
+}
+
+static void pt_config_reg_init(XenPCIPassthroughState *s,
+                               XenPTRegGroup *reg_grp, XenPTRegInfo *reg)
+{
+    XenPTReg *reg_entry;
+    uint32_t data = 0;
+
+    reg_entry = g_malloc0(sizeof (XenPTReg));
+
+    reg_entry->reg = reg;
+    reg_entry->data = 0;
+
+    if (reg->init) {
+        /* initialize emulate register */
+        data = reg->init(s, reg_entry->reg,
+                         reg_grp->base_offset + reg->offset);
+        if (data == PT_INVALID_REG) {
+            /* free unused BAR register entry */
+            free(reg_entry);
+            return;
+        }
+        /* set register value */
+        reg_entry->data = data;
+    }
+    /* list add register entry */
+    QLIST_INSERT_HEAD(&reg_grp->reg_tbl_list, reg_entry, entries);
+
+    return;
+}
+
+void pt_config_init(XenPCIPassthroughState *s)
+{
+    XenPTRegGroup *reg_grp_entry = NULL;
+    uint32_t reg_grp_offset = 0;
+    XenPTRegInfo *reg_tbl = NULL;
+    int i, j;
+
+    QLIST_INIT(&s->reg_grp_tbl);
+
+    for (i = 0; pt_emu_reg_grp_tbl[i].grp_size != 0; i++) {
+        if (pt_emu_reg_grp_tbl[i].grp_id != 0xFF) {
+            if (pt_hide_dev_cap(s->real_device,
+                                pt_emu_reg_grp_tbl[i].grp_id)) {
+                continue;
+            }
+
+            reg_grp_offset = find_cap_offset(s, pt_emu_reg_grp_tbl[i].grp_id);
+
+            if (!reg_grp_offset) {
+                continue;
+            }
+        }
+
+        reg_grp_entry = g_malloc0(sizeof (XenPTRegGroup));
+        QLIST_INIT(&reg_grp_entry->reg_tbl_list);
+        QLIST_INSERT_HEAD(&s->reg_grp_tbl, reg_grp_entry, entries);
+
+        reg_grp_entry->base_offset = reg_grp_offset;
+        reg_grp_entry->reg_grp = pt_emu_reg_grp_tbl + i;
+        if (pt_emu_reg_grp_tbl[i].size_init) {
+            /* get register group size */
+            reg_grp_entry->size =
+                pt_emu_reg_grp_tbl[i].size_init(s, reg_grp_entry->reg_grp,
+                                                reg_grp_offset);
+        }
+
+        if (pt_emu_reg_grp_tbl[i].grp_type == GRP_TYPE_EMU) {
+            if (pt_emu_reg_grp_tbl[i].emu_reg_tbl) {
+                reg_tbl = pt_emu_reg_grp_tbl[i].emu_reg_tbl;
+                /* initialize capability register */
+                for (j = 0; reg_tbl->size != 0; j++, reg_tbl++) {
+                    /* initialize capability register */
+                    pt_config_reg_init(s, reg_grp_entry, reg_tbl);
+                }
+            }
+        }
+        reg_grp_offset = 0;
+    }
+
+    return;
+}
+
+/* delete all emulate register */
+void pt_config_delete(XenPCIPassthroughState *s)
+{
+    struct XenPTRegGroup *reg_group, *next_grp;
+    struct XenPTReg *reg, *next_reg;
+
+    /* free Power Management info table */
+    if (s->pm_state) {
+        if (s->pm_state->pm_timer) {
+            qemu_del_timer(s->pm_state->pm_timer);
+            qemu_free_timer(s->pm_state->pm_timer);
+            s->pm_state->pm_timer = NULL;
+        }
+
+        g_free(s->pm_state);
+    }
+
+    /* free all register group entry */
+    QLIST_FOREACH_SAFE(reg_group, &s->reg_grp_tbl, entries, next_grp) {
+        /* free all register entry */
+        QLIST_FOREACH_SAFE(reg, &reg_group->reg_tbl_list, entries, next_reg) {
+            QLIST_REMOVE(reg, entries);
+            g_free(reg);
+        }
+
+        QLIST_REMOVE(reg_group, entries);
+        g_free(reg_group);
+    }
+}
-- 
Anthony PERARD

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [Qemu-devel] [PATCH V3 09/10] Introduce apic-msidef.h
  2011-10-28 15:07 ` Anthony PERARD
@ 2011-10-28 15:07   ` Anthony PERARD
  -1 siblings, 0 replies; 60+ messages in thread
From: Anthony PERARD @ 2011-10-28 15:07 UTC (permalink / raw)
  To: QEMU-devel, Stefano Stabellini; +Cc: Anthony PERARD, Xen Devel

This patch move the msi definition from apic.c to apic-msidef.h. So it can be
used also by other .c files.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
---
 hw/apic-msidef.h |   28 ++++++++++++++++++++++++++++
 hw/apic.c        |   11 +----------
 2 files changed, 29 insertions(+), 10 deletions(-)
 create mode 100644 hw/apic-msidef.h

diff --git a/hw/apic-msidef.h b/hw/apic-msidef.h
new file mode 100644
index 0000000..3182f0b
--- /dev/null
+++ b/hw/apic-msidef.h
@@ -0,0 +1,28 @@
+#ifndef HW_APIC_MSIDEF_H
+#define HW_APIC_MSIDEF_H
+
+/*
+ * Intel APIC constants: from include/asm/msidef.h
+ */
+
+/*
+ * Shifts for MSI data
+ */
+
+#define MSI_DATA_VECTOR_SHIFT           0
+#define  MSI_DATA_VECTOR_MASK           0x000000ff
+
+#define MSI_DATA_DELIVERY_MODE_SHIFT    8
+#define MSI_DATA_LEVEL_SHIFT            14
+#define MSI_DATA_TRIGGER_SHIFT          15
+
+/*
+ * Shift/mask fields for msi address
+ */
+
+#define MSI_ADDR_DEST_MODE_SHIFT        2
+
+#define MSI_ADDR_DEST_ID_SHIFT          12
+#define  MSI_ADDR_DEST_ID_MASK          0x00ffff0
+
+#endif /* HW_APIC_MSIDEF_H */
diff --git a/hw/apic.c b/hw/apic.c
index 8289eef..18c4a87 100644
--- a/hw/apic.c
+++ b/hw/apic.c
@@ -24,6 +24,7 @@
 #include "sysbus.h"
 #include "trace.h"
 #include "pc.h"
+#include "apic-msidef.h"
 
 /* APIC Local Vector Table */
 #define APIC_LVT_TIMER   0
@@ -65,16 +66,6 @@
 #define MAX_APICS 255
 #define MAX_APIC_WORDS 8
 
-/* Intel APIC constants: from include/asm/msidef.h */
-#define MSI_DATA_VECTOR_SHIFT		0
-#define MSI_DATA_VECTOR_MASK		0x000000ff
-#define MSI_DATA_DELIVERY_MODE_SHIFT	8
-#define MSI_DATA_TRIGGER_SHIFT		15
-#define MSI_DATA_LEVEL_SHIFT		14
-#define MSI_ADDR_DEST_MODE_SHIFT	2
-#define MSI_ADDR_DEST_ID_SHIFT		12
-#define	MSI_ADDR_DEST_ID_MASK		0x00ffff0
-
 #define MSI_ADDR_SIZE                   0x100000
 
 typedef struct APICState APICState;
-- 
Anthony PERARD

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH V3 09/10] Introduce apic-msidef.h
@ 2011-10-28 15:07   ` Anthony PERARD
  0 siblings, 0 replies; 60+ messages in thread
From: Anthony PERARD @ 2011-10-28 15:07 UTC (permalink / raw)
  To: QEMU-devel, Stefano Stabellini; +Cc: Anthony PERARD, Xen Devel

This patch move the msi definition from apic.c to apic-msidef.h. So it can be
used also by other .c files.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
---
 hw/apic-msidef.h |   28 ++++++++++++++++++++++++++++
 hw/apic.c        |   11 +----------
 2 files changed, 29 insertions(+), 10 deletions(-)
 create mode 100644 hw/apic-msidef.h

diff --git a/hw/apic-msidef.h b/hw/apic-msidef.h
new file mode 100644
index 0000000..3182f0b
--- /dev/null
+++ b/hw/apic-msidef.h
@@ -0,0 +1,28 @@
+#ifndef HW_APIC_MSIDEF_H
+#define HW_APIC_MSIDEF_H
+
+/*
+ * Intel APIC constants: from include/asm/msidef.h
+ */
+
+/*
+ * Shifts for MSI data
+ */
+
+#define MSI_DATA_VECTOR_SHIFT           0
+#define  MSI_DATA_VECTOR_MASK           0x000000ff
+
+#define MSI_DATA_DELIVERY_MODE_SHIFT    8
+#define MSI_DATA_LEVEL_SHIFT            14
+#define MSI_DATA_TRIGGER_SHIFT          15
+
+/*
+ * Shift/mask fields for msi address
+ */
+
+#define MSI_ADDR_DEST_MODE_SHIFT        2
+
+#define MSI_ADDR_DEST_ID_SHIFT          12
+#define  MSI_ADDR_DEST_ID_MASK          0x00ffff0
+
+#endif /* HW_APIC_MSIDEF_H */
diff --git a/hw/apic.c b/hw/apic.c
index 8289eef..18c4a87 100644
--- a/hw/apic.c
+++ b/hw/apic.c
@@ -24,6 +24,7 @@
 #include "sysbus.h"
 #include "trace.h"
 #include "pc.h"
+#include "apic-msidef.h"
 
 /* APIC Local Vector Table */
 #define APIC_LVT_TIMER   0
@@ -65,16 +66,6 @@
 #define MAX_APICS 255
 #define MAX_APIC_WORDS 8
 
-/* Intel APIC constants: from include/asm/msidef.h */
-#define MSI_DATA_VECTOR_SHIFT		0
-#define MSI_DATA_VECTOR_MASK		0x000000ff
-#define MSI_DATA_DELIVERY_MODE_SHIFT	8
-#define MSI_DATA_TRIGGER_SHIFT		15
-#define MSI_DATA_LEVEL_SHIFT		14
-#define MSI_ADDR_DEST_MODE_SHIFT	2
-#define MSI_ADDR_DEST_ID_SHIFT		12
-#define	MSI_ADDR_DEST_ID_MASK		0x00ffff0
-
 #define MSI_ADDR_SIZE                   0x100000
 
 typedef struct APICState APICState;
-- 
Anthony PERARD

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [Qemu-devel] [PATCH V3 10/10] Introduce Xen PCI Passthrough, MSI (3/3)
  2011-10-28 15:07 ` Anthony PERARD
@ 2011-10-28 15:07   ` Anthony PERARD
  -1 siblings, 0 replies; 60+ messages in thread
From: Anthony PERARD @ 2011-10-28 15:07 UTC (permalink / raw)
  To: QEMU-devel, Stefano Stabellini; +Cc: Anthony PERARD, Xen Devel, Shan Haitao

From: Jiang Yunhong <yunhong.jiang@intel.com>

Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com>
Signed-off-by: Shan Haitao <haitao.shan@intel.com>
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
---
 Makefile.target                      |    1 +
 hw/apic-msidef.h                     |    2 +
 hw/xen_pci_passthrough.c             |   27 ++-
 hw/xen_pci_passthrough.h             |   55 +++
 hw/xen_pci_passthrough_config_init.c |  495 +++++++++++++++++++++++++-
 hw/xen_pci_passthrough_msi.c         |  667 ++++++++++++++++++++++++++++++++++
 6 files changed, 1240 insertions(+), 7 deletions(-)
 create mode 100644 hw/xen_pci_passthrough_msi.c

diff --git a/Makefile.target b/Makefile.target
index c32c688..17b8857 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -220,6 +220,7 @@ obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += host-pci-device.o
 obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough.o
 obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough_helpers.o
 obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough_config_init.o
+obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough_msi.o
 
 # Inter-VM PCI shared memory
 CONFIG_IVSHMEM =
diff --git a/hw/apic-msidef.h b/hw/apic-msidef.h
index 3182f0b..6e2eb71 100644
--- a/hw/apic-msidef.h
+++ b/hw/apic-msidef.h
@@ -22,6 +22,8 @@
 
 #define MSI_ADDR_DEST_MODE_SHIFT        2
 
+#define MSI_ADDR_REDIRECTION_SHIFT      3
+
 #define MSI_ADDR_DEST_ID_SHIFT          12
 #define  MSI_ADDR_DEST_ID_MASK          0x00ffff0
 
diff --git a/hw/xen_pci_passthrough.c b/hw/xen_pci_passthrough.c
index b97c5b6..4b9eb74 100644
--- a/hw/xen_pci_passthrough.c
+++ b/hw/xen_pci_passthrough.c
@@ -417,6 +417,7 @@ static void pt_iomem_map(XenPCIPassthroughState *s, int i,
     }
 
     if (!first_map && old_ebase != -1) {
+        pt_add_msix_mapping(s, i);
         /* Remove old mapping */
         ret = xc_domain_memory_mapping(xen_xc, xen_domid,
                                old_ebase >> XC_PAGE_SHIFT,
@@ -441,6 +442,15 @@ static void pt_iomem_map(XenPCIPassthroughState *s, int i,
         if (ret != 0) {
             PT_LOG("Error: create new mapping failed!\n");
         }
+
+        ret = pt_remove_msix_mapping(s, i);
+        if (ret != 0) {
+            PT_LOG("Error: remove MSI-X mmio mapping failed!\n");
+        }
+
+        if (old_ebase != e_phys && old_ebase != -1) {
+            pt_msix_update_remap(s, i);
+        }
     }
 }
 
@@ -737,6 +747,9 @@ static int pt_initfn(PCIDevice *pcidev)
         mapped_machine_irq[machine_irq]++;
     }
 
+    /* setup MSI-INTx translation if support */
+    rc = pt_enable_msi_translate(s);
+
     /* bind machine_irq to device */
     if (rc < 0 && machine_irq != 0) {
         uint8_t e_device = PCI_SLOT(s->dev.devfn);
@@ -765,7 +778,8 @@ static int pt_initfn(PCIDevice *pcidev)
 
 out:
     PT_LOG("Real physical device %02x:%02x.%x registered successfuly!\n"
-           "IRQ type = %s\n", bus, slot, func, "INTx");
+           "IRQ type = %s\n", bus, slot, func,
+           s->msi_trans_en ? "MSI-INTx" : "INTx");
 
     return 0;
 }
@@ -782,7 +796,7 @@ static int pt_unregister_device(PCIDevice *pcidev)
     e_intx = pci_intx(s);
     machine_irq = s->machine_irq;
 
-    if (machine_irq) {
+    if (s->msi_trans_en == 0 && machine_irq) {
         rc = xc_domain_unbind_pt_irq(xen_xc, xen_domid, machine_irq,
                                      PT_IRQ_TYPE_PCI, 0, e_device, e_intx, 0);
         if (rc < 0) {
@@ -790,6 +804,13 @@ static int pt_unregister_device(PCIDevice *pcidev)
         }
     }
 
+    if (s->msi) {
+        pt_msi_disable(s);
+    }
+    if (s->msix) {
+        pt_msix_disable(s);
+    }
+
     if (machine_irq) {
         mapped_machine_irq[machine_irq]--;
 
@@ -824,6 +845,8 @@ static PCIDeviceInfo xen_pci_passthrough = {
     .is_express = 0,
     .qdev.props = (Property[]) {
         DEFINE_PROP_STRING("hostaddr", XenPCIPassthroughState, hostaddr),
+        DEFINE_PROP_BIT("msitranslate", XenPCIPassthroughState, msi_trans_cap,
+                        0, true),
         DEFINE_PROP_BIT("power-mgmt", XenPCIPassthroughState, power_mgmt,
                         0, false),
         DEFINE_PROP_END_OF_LIST(),
diff --git a/hw/xen_pci_passthrough.h b/hw/xen_pci_passthrough.h
index ebc04fd..5f404b0 100644
--- a/hw/xen_pci_passthrough.h
+++ b/hw/xen_pci_passthrough.h
@@ -63,6 +63,10 @@ typedef int (*conf_byte_restore)
 
 #define PT_BAR_ALLF        0xFFFFFFFF  /* BAR ALLF value */
 
+/* MSI-X */
+#define PT_MSI_FLAG_UNINIT 0x1000
+#define PT_MSI_FLAG_MAPPED 0x2000
+
 
 typedef enum {
     GRP_TYPE_HARDWIRED = 0,                     /* 0 Hardwired reg group */
@@ -166,6 +170,34 @@ typedef struct XenPTRegGroup {
 } XenPTRegGroup;
 
 
+typedef struct XenPTMSI {
+    uint32_t flags;
+    uint32_t ctrl_offset; /* saved control offset */
+    int pirq;          /* guest pirq corresponding */
+    uint32_t addr_lo;  /* guest message address */
+    uint32_t addr_hi;  /* guest message upper address */
+    uint16_t data;     /* guest message data */
+} XenPTMSI;
+
+typedef struct XenMSIXEntry {
+    int pirq;        /* -1 means unmapped */
+    int flags;       /* flags indicting whether MSI ADDR or DATA is updated */
+    uint32_t io_mem[4];
+} XenMSIXEntry;
+typedef struct XenPTMSIX {
+    uint32_t ctrl_offset;
+    int enabled;
+    int total_entries;
+    int bar_index;
+    uint64_t table_base;
+    uint32_t table_off;
+    uint32_t table_offset_adjust; /* page align mmap */
+    uint64_t mmio_base_addr;
+    int mmio_index;
+    void *phys_iomem_base;
+    XenMSIXEntry msix_entry[0];
+} XenPTMSIX;
+
 typedef struct XenPTPM {
     QEMUTimer *pm_timer;  /* QEMUTimer struct */
     int no_soft_reset;    /* No Soft Reset flags */
@@ -189,6 +221,13 @@ struct XenPCIPassthroughState {
 
     uint32_t machine_irq;
 
+    XenPTMSI *msi;
+    XenPTMSIX *msix;
+
+    /* Physical MSI to guest INTx translation when possible */
+    uint32_t msi_trans_cap;
+    bool msi_trans_en;
+
     uint32_t power_mgmt;
     XenPTPM *pm_state;
 
@@ -222,4 +261,20 @@ static inline uint8_t pci_read_intx(XenPCIPassthroughState *s)
 }
 uint8_t pci_intx(XenPCIPassthroughState *ptdev);
 
+/* MSI/MSI-X */
+void pt_msi_set_enable(XenPCIPassthroughState *s, int en);
+int pt_msi_setup(XenPCIPassthroughState *s);
+int pt_msi_update(XenPCIPassthroughState *d);
+void pt_msi_disable(XenPCIPassthroughState *s);
+int pt_enable_msi_translate(XenPCIPassthroughState *s);
+void pt_disable_msi_translate(XenPCIPassthroughState *s);
+
+int pt_msix_init(XenPCIPassthroughState *s, int pos);
+void pt_msix_delete(XenPCIPassthroughState *s);
+int pt_msix_update(XenPCIPassthroughState *s);
+int pt_msix_update_remap(XenPCIPassthroughState *s, int bar_index);
+void pt_msix_disable(XenPCIPassthroughState *s);
+int pt_add_msix_mapping(XenPCIPassthroughState *s, int bar_index);
+int pt_remove_msix_mapping(XenPCIPassthroughState *s, int bar_index);
+
 #endif /* !QEMU_HW_XEN_PCI_PASSTHROUGH_H */
diff --git a/hw/xen_pci_passthrough_config_init.c b/hw/xen_pci_passthrough_config_init.c
index 4103b59..b4238ee 100644
--- a/hw/xen_pci_passthrough_config_init.c
+++ b/hw/xen_pci_passthrough_config_init.c
@@ -375,11 +375,19 @@ static int pt_cmd_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
     throughable_mask = ~emu_mask & valid_mask;
 
     if (*value & PCI_COMMAND_INTX_DISABLE) {
-        throughable_mask |= PCI_COMMAND_INTX_DISABLE;
-    } else {
-        if (s->machine_irq) {
+        if (s->msi_trans_en) {
+            pt_msi_set_enable(s, 0);
+        } else {
             throughable_mask |= PCI_COMMAND_INTX_DISABLE;
         }
+    } else {
+        if (s->msi_trans_en) {
+            pt_msi_set_enable(s, 1);
+        } else {
+            if (s->machine_irq) {
+                throughable_mask |= PCI_COMMAND_INTX_DISABLE;
+            }
+        }
     }
 
     *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
@@ -1248,13 +1256,21 @@ static void pt_reset_interrupt_and_io_mapping(XenPCIPassthroughState *s)
     e_device = PCI_SLOT(s->dev.devfn);
     e_intx = pci_intx(s);
 
-    if (s->machine_irq) {
+    if (s->msi_trans_en == 0 && s->machine_irq) {
         if (xc_domain_unbind_pt_irq(xen_xc, xen_domid, s->machine_irq,
                                     PT_IRQ_TYPE_PCI, 0, e_device, e_intx, 0)) {
             PT_LOG("Error: Unbinding of interrupt failed!\n");
         }
     }
 
+    /* disable MSI/MSI-X and MSI-INTx translation */
+    if (s->msi) {
+        pt_msi_disable(s);
+    }
+    if (s->msix) {
+        pt_msix_disable(s);
+    }
+
     /* clear all virtual region address */
     for (i = 0; i < PCI_NUM_REGIONS; i++) {
         r = &d->io_regions[i];
@@ -1501,6 +1517,406 @@ static XenPTRegInfo pt_emu_reg_pm_tbl[] = {
     },
 };
 
+/********************************
+ * MSI Capability
+ */
+
+/* Message Control register */
+static uint32_t pt_msgctrl_reg_init(XenPCIPassthroughState *s,
+                                    XenPTRegInfo *reg, uint32_t real_offset)
+{
+    PCIDevice *d = &s->dev;
+    uint16_t reg_field = 0;
+
+    /* use I/O device register's value as initial value */
+    reg_field = pci_get_word(d->config + real_offset);
+
+    if (reg_field & PCI_MSI_FLAGS_ENABLE) {
+        PT_LOG("MSI enabled already, disable first\n");
+        host_pci_set_word(s->real_device, real_offset,
+                          reg_field & ~PCI_MSI_FLAGS_ENABLE);
+    }
+    s->msi->flags |= reg_field | PT_MSI_FLAG_UNINIT;
+    s->msi->ctrl_offset = real_offset;
+
+    return reg->init_val;
+}
+static int pt_msgctrl_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                                uint16_t *value, uint16_t dev_value,
+                                uint16_t valid_mask)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    uint16_t writable_mask = 0;
+    uint16_t throughable_mask = 0;
+    PCIDevice *pd = (PCIDevice *)s;
+    uint16_t val;
+
+    /* Currently no support for multi-vector */
+    if (*value & PCI_MSI_FLAGS_QSIZE) {
+        PT_LOG("Warning: try to set more than 1 vector ctrl %x\n", *value);
+    }
+
+    /* modify emulate register */
+    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
+    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
+    /* update the msi_info too */
+    s->msi->flags |= cfg_entry->data &
+        ~(PT_MSI_FLAG_UNINIT | PT_MSI_FLAG_MAPPED | PCI_MSI_FLAGS_ENABLE);
+
+    /* create value for writing to I/O device register */
+    val = *value;
+    throughable_mask = ~reg->emu_mask & valid_mask;
+    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
+
+    /* update MSI */
+    if (val & PCI_MSI_FLAGS_ENABLE) {
+        /* setup MSI pirq for the first time */
+        if (s->msi->flags & PT_MSI_FLAG_UNINIT) {
+            if (s->msi_trans_en) {
+                PT_LOG("guest enabling MSI, disable MSI-INTx translation\n");
+                pt_disable_msi_translate(s);
+            } else {
+                /* Init physical one */
+                PT_LOG("setup msi for dev %x\n", pd->devfn);
+                if (pt_msi_setup(s)) {
+                    /* We do not broadcast the error to the framework code, so
+                     * that MSI errors are contained in MSI emulation code and
+                     * QEMU can go on running.
+                     * Guest MSI would be actually not working.
+                     */
+                    *value &= ~PCI_MSI_FLAGS_ENABLE;
+                    PT_LOG("Warning: Can not map MSI for dev %x\n", pd->devfn);
+                    return 0;
+                }
+            }
+            if (pt_msi_update(s)) {
+                *value &= ~PCI_MSI_FLAGS_ENABLE;
+                PT_LOG("Warning: Can not bind MSI for dev %x\n", pd->devfn);
+                return 0;
+            }
+            s->msi->flags &= ~PT_MSI_FLAG_UNINIT;
+            s->msi->flags |= PT_MSI_FLAG_MAPPED;
+        }
+        s->msi->flags |= PCI_MSI_FLAGS_ENABLE;
+    } else {
+        s->msi->flags &= ~PCI_MSI_FLAGS_ENABLE;
+    }
+
+    /* pass through MSI_ENABLE bit when no MSI-INTx translation */
+    if (!s->msi_trans_en) {
+        *value &= ~PCI_MSI_FLAGS_ENABLE;
+        *value |= val & PCI_MSI_FLAGS_ENABLE;
+    }
+
+    return 0;
+}
+
+/* initialize Message Upper Address register */
+static uint32_t pt_msgaddr64_reg_init(XenPCIPassthroughState *ptdev,
+                                      XenPTRegInfo *reg, uint32_t real_offset)
+{
+    /* no need to initialize in case of 32 bit type */
+    if (!(ptdev->msi->flags & PCI_MSI_FLAGS_64BIT)) {
+        return PT_INVALID_REG;
+    }
+
+    return reg->init_val;
+}
+/* this function will be called twice (for 32 bit and 64 bit type) */
+/* initialize Message Data register */
+static uint32_t pt_msgdata_reg_init(XenPCIPassthroughState *ptdev,
+                                    XenPTRegInfo *reg, uint32_t real_offset)
+{
+    uint32_t flags = ptdev->msi->flags;
+    uint32_t offset = reg->offset;
+
+    /* check the offset whether matches the type or not */
+    if (((offset == PCI_MSI_DATA_64) &&  (flags & PCI_MSI_FLAGS_64BIT)) ||
+        ((offset == PCI_MSI_DATA_32) && !(flags & PCI_MSI_FLAGS_64BIT))) {
+        return reg->init_val;
+    } else {
+        return PT_INVALID_REG;
+    }
+}
+
+/* write Message Address register */
+static int pt_msgaddr32_reg_write(XenPCIPassthroughState *s,
+                                  XenPTReg *cfg_entry, uint32_t *value,
+                                  uint32_t dev_value, uint32_t valid_mask)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    uint32_t writable_mask = 0;
+    uint32_t throughable_mask = 0;
+    uint32_t old_addr = cfg_entry->data;
+
+    /* modify emulate register */
+    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
+    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
+    /* update the msi_info too */
+    s->msi->addr_lo = cfg_entry->data;
+
+    /* create value for writing to I/O device register */
+    throughable_mask = ~reg->emu_mask & valid_mask;
+    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
+
+    /* update MSI */
+    if (cfg_entry->data != old_addr) {
+        if (s->msi->flags & PT_MSI_FLAG_MAPPED) {
+            pt_msi_update(s);
+        }
+    }
+
+    return 0;
+}
+/* write Message Upper Address register */
+static int pt_msgaddr64_reg_write(XenPCIPassthroughState *s,
+                                  XenPTReg *cfg_entry, uint32_t *value,
+                                  uint32_t dev_value, uint32_t valid_mask)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    uint32_t writable_mask = 0;
+    uint32_t throughable_mask = 0;
+    uint32_t old_addr = cfg_entry->data;
+
+    /* check whether the type is 64 bit or not */
+    if (!(s->msi->flags & PCI_MSI_FLAGS_64BIT)) {
+        /* exit I/O emulator */
+        PT_LOG("Error: why comes to Upper Address without 64 bit support??\n");
+        return -1;
+    }
+
+    /* modify emulate register */
+    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
+    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
+    /* update the msi_info too */
+    s->msi->addr_hi = cfg_entry->data;
+
+    /* create value for writing to I/O device register */
+    throughable_mask = ~reg->emu_mask & valid_mask;
+    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
+
+    /* update MSI */
+    if (cfg_entry->data != old_addr) {
+        if (s->msi->flags & PT_MSI_FLAG_MAPPED) {
+            pt_msi_update(s);
+        }
+    }
+
+    return 0;
+}
+
+
+/* this function will be called twice (for 32 bit and 64 bit type) */
+/* write Message Data register */
+static int pt_msgdata_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                                uint16_t *value, uint16_t dev_value,
+                                uint16_t valid_mask)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    uint16_t writable_mask = 0;
+    uint16_t throughable_mask = 0;
+    uint16_t old_data = cfg_entry->data;
+    uint32_t flags = s->msi->flags;
+    uint32_t offset = reg->offset;
+
+    /* check the offset whether matches the type or not */
+    if (!((offset == PCI_MSI_DATA_64) &&  (flags & PCI_MSI_FLAGS_64BIT)) &&
+        !((offset == PCI_MSI_DATA_32) && !(flags & PCI_MSI_FLAGS_64BIT))) {
+        /* exit I/O emulator */
+        PT_LOG("Error: the offset is not match with the 32/64 bit type!!\n");
+        return -1;
+    }
+
+    /* modify emulate register */
+    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
+    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
+    /* update the msi_info too */
+    s->msi->data = cfg_entry->data;
+
+    /* create value for writing to I/O device register */
+    throughable_mask = ~reg->emu_mask & valid_mask;
+    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
+
+    /* update MSI */
+    if (cfg_entry->data != old_data) {
+        if (flags & PT_MSI_FLAG_MAPPED) {
+            pt_msi_update(s);
+        }
+    }
+
+    return 0;
+}
+
+/* MSI Capability Structure reg static infomation table */
+static XenPTRegInfo pt_emu_reg_msi_tbl[] = {
+    /* Next Pointer reg */
+    {
+        .offset     = PCI_CAP_LIST_NEXT,
+        .size       = 1,
+        .init_val   = 0x00,
+        .ro_mask    = 0xFF,
+        .emu_mask   = 0xFF,
+        .init       = pt_ptr_reg_init,
+        .u.b.read   = pt_byte_reg_read,
+        .u.b.write  = pt_byte_reg_write,
+        .u.b.restore  = NULL,
+    },
+    /* Message Control reg */
+    {
+        .offset     = PCI_MSI_FLAGS,
+        .size       = 2,
+        .init_val   = 0x0000,
+        .ro_mask    = 0xFF8E,
+        .emu_mask   = 0x007F,
+        .init       = pt_msgctrl_reg_init,
+        .u.w.read   = pt_word_reg_read,
+        .u.w.write  = pt_msgctrl_reg_write,
+        .u.w.restore  = NULL,
+    },
+    /* Message Address reg */
+    {
+        .offset     = PCI_MSI_ADDRESS_LO,
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .ro_mask    = 0x00000003,
+        .emu_mask   = 0xFFFFFFFF,
+        .no_wb      = 1,
+        .init       = pt_common_reg_init,
+        .u.dw.read  = pt_long_reg_read,
+        .u.dw.write = pt_msgaddr32_reg_write,
+        .u.dw.restore = NULL,
+    },
+    /* Message Upper Address reg (if PCI_MSI_FLAGS_64BIT set) */
+    {
+        .offset     = PCI_MSI_ADDRESS_HI,
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .ro_mask    = 0x00000000,
+        .emu_mask   = 0xFFFFFFFF,
+        .no_wb      = 1,
+        .init       = pt_msgaddr64_reg_init,
+        .u.dw.read  = pt_long_reg_read,
+        .u.dw.write = pt_msgaddr64_reg_write,
+        .u.dw.restore = NULL,
+    },
+    /* Message Data reg (16 bits of data for 32-bit devices) */
+    {
+        .offset     = PCI_MSI_DATA_32,
+        .size       = 2,
+        .init_val   = 0x0000,
+        .ro_mask    = 0x0000,
+        .emu_mask   = 0xFFFF,
+        .no_wb      = 1,
+        .init       = pt_msgdata_reg_init,
+        .u.w.read   = pt_word_reg_read,
+        .u.w.write  = pt_msgdata_reg_write,
+        .u.w.restore  = NULL,
+    },
+    /* Message Data reg (16 bits of data for 64-bit devices) */
+    {
+        .offset     = PCI_MSI_DATA_64,
+        .size       = 2,
+        .init_val   = 0x0000,
+        .ro_mask    = 0x0000,
+        .emu_mask   = 0xFFFF,
+        .no_wb      = 1,
+        .init       = pt_msgdata_reg_init,
+        .u.w.read   = pt_word_reg_read,
+        .u.w.write  = pt_msgdata_reg_write,
+        .u.w.restore  = NULL,
+    },
+    {
+        .size = 0,
+    },
+};
+
+
+/**************************************
+ * MSI-X Capability
+ */
+
+/* Message Control register for MSI-X */
+static uint32_t pt_msixctrl_reg_init(XenPCIPassthroughState *s,
+                                     XenPTRegInfo *reg, uint32_t real_offset)
+{
+    PCIDevice *d = &s->dev;
+    uint16_t reg_field = 0;
+
+    /* use I/O device register's value as initial value */
+    reg_field = pci_get_word(d->config + real_offset);
+
+    if (reg_field & PCI_MSIX_FLAGS_ENABLE) {
+        PT_LOG("MSIX enabled already, disable first\n");
+        host_pci_set_word(s->real_device, real_offset,
+                          reg_field & ~PCI_MSIX_FLAGS_ENABLE);
+    }
+
+    s->msix->ctrl_offset = real_offset;
+
+    return reg->init_val;
+}
+static int pt_msixctrl_reg_write(XenPCIPassthroughState *s,
+                                 XenPTReg *cfg_entry, uint16_t *value,
+                                 uint16_t dev_value, uint16_t valid_mask)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    uint16_t writable_mask = 0;
+    uint16_t throughable_mask = 0;
+
+    /* modify emulate register */
+    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
+    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
+
+    /* create value for writing to I/O device register */
+    throughable_mask = ~reg->emu_mask & valid_mask;
+    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
+
+    /* update MSI-X */
+    if ((*value & PCI_MSIX_FLAGS_ENABLE)
+        && !(*value & PCI_MSIX_FLAGS_MASKALL)) {
+        if (s->msi_trans_en) {
+            PT_LOG("guest enabling MSI-X, disable MSI-INTx translation\n");
+            pt_disable_msi_translate(s);
+        }
+        pt_msix_update(s);
+    }
+
+    s->msix->enabled = !!(*value & PCI_MSIX_FLAGS_ENABLE);
+
+    return 0;
+}
+
+/* MSI-X Capability Structure reg static infomation table */
+static XenPTRegInfo pt_emu_reg_msix_tbl[] = {
+    /* Next Pointer reg */
+    {
+        .offset     = PCI_CAP_LIST_NEXT,
+        .size       = 1,
+        .init_val   = 0x00,
+        .ro_mask    = 0xFF,
+        .emu_mask   = 0xFF,
+        .init       = pt_ptr_reg_init,
+        .u.b.read   = pt_byte_reg_read,
+        .u.b.write  = pt_byte_reg_write,
+        .u.b.restore  = NULL,
+    },
+    /* Message Control reg */
+    {
+        .offset     = PCI_MSI_FLAGS,
+        .size       = 2,
+        .init_val   = 0x0000,
+        .ro_mask    = 0x3FFF,
+        .emu_mask   = 0x0000,
+        .init       = pt_msixctrl_reg_init,
+        .u.w.read   = pt_word_reg_read,
+        .u.w.write  = pt_msixctrl_reg_write,
+        .u.w.restore  = NULL,
+    },
+    {
+        .size = 0,
+    },
+};
+
 
 /****************************
  * Capabilities
@@ -1664,6 +2080,48 @@ static uint8_t pt_pcie_size_init(XenPCIPassthroughState *s,
 
     return pcie_size;
 }
+/* get MSI Capability Structure register group size */
+static uint8_t pt_msi_size_init(XenPCIPassthroughState *s,
+                                const XenPTRegGroupInfo *grp_reg,
+                                uint32_t base_offset)
+{
+    PCIDevice *d = &s->dev;
+    uint16_t msg_ctrl = 0;
+    uint8_t msi_size = 0xa;
+
+    msg_ctrl = pci_get_word(d->config + (base_offset + PCI_MSI_FLAGS));
+
+    /* check 64 bit address capable & Per-vector masking capable */
+    if (msg_ctrl & PCI_MSI_FLAGS_64BIT) {
+        msi_size += 4;
+    }
+    if (msg_ctrl & PCI_MSI_FLAGS_MASKBIT) {
+        msi_size += 10;
+    }
+
+    s->msi = g_malloc0(sizeof (XenPTMSI));
+    s->msi->pirq = -1;
+    PT_LOG("done\n");
+
+    return msi_size;
+}
+/* get MSI-X Capability Structure register group size */
+static uint8_t pt_msix_size_init(XenPCIPassthroughState *s,
+                                 const XenPTRegGroupInfo *grp_reg,
+                                 uint32_t base_offset)
+{
+    int ret = 0;
+
+    ret = pt_msix_init(s, base_offset);
+
+    if (ret == -1) {
+        hw_error("Internal error: Invalid pt_msix_init return value[%d]. "
+                 "I/O emulator exit.\n", ret);
+    }
+
+    return grp_reg->grp_size;
+}
+
 
 static const XenPTRegGroupInfo pt_emu_reg_grp_tbl[] = {
     /* Header Type0 reg group */
@@ -1704,6 +2162,14 @@ static const XenPTRegGroupInfo pt_emu_reg_grp_tbl[] = {
         .grp_size   = 0x04,
         .size_init  = pt_reg_grp_size_init,
     },
+    /* MSI Capability Structure reg group */
+    {
+        .grp_id      = PCI_CAP_ID_MSI,
+        .grp_type    = GRP_TYPE_EMU,
+        .grp_size    = 0xFF,
+        .size_init   = pt_msi_size_init,
+        .emu_reg_tbl = pt_emu_reg_msi_tbl,
+    },
     /* PCI-X Capabilities List Item reg group */
     {
         .grp_id     = PCI_CAP_ID_PCIX,
@@ -1748,6 +2214,14 @@ static const XenPTRegGroupInfo pt_emu_reg_grp_tbl[] = {
         .size_init   = pt_pcie_size_init,
         .emu_reg_tbl = pt_emu_reg_pcie_tbl,
     },
+    /* MSI-X Capability Structure reg group */
+    {
+        .grp_id      = PCI_CAP_ID_MSIX,
+        .grp_type    = GRP_TYPE_EMU,
+        .grp_size    = 0x0C,
+        .size_init   = pt_msix_size_init,
+        .emu_reg_tbl = pt_emu_reg_msix_tbl,
+    },
     {
         .grp_size = 0,
     },
@@ -1908,8 +2382,11 @@ static int pt_init_pci_config(XenPCIPassthroughState *s)
     /* reinitialize all emulate register */
     pt_config_reinit(s);
 
+    /* setup MSI-INTx translation if support */
+    ret = pt_enable_msi_translate(s);
+
     /* rebind machine_irq to device */
-    if (s->machine_irq != 0) {
+    if (ret < 0 && s->machine_irq != 0) {
         uint8_t e_device = PCI_SLOT(s->dev.devfn);
         uint8_t e_intx = pci_intx(s);
 
@@ -2043,6 +2520,14 @@ void pt_config_delete(XenPCIPassthroughState *s)
     struct XenPTRegGroup *reg_group, *next_grp;
     struct XenPTReg *reg, *next_reg;
 
+    /* free MSI/MSI-X info table */
+    if (s->msix) {
+        pt_msix_delete(s);
+    }
+    if (s->msi) {
+        g_free(s->msi);
+    }
+
     /* free Power Management info table */
     if (s->pm_state) {
         if (s->pm_state->pm_timer) {
diff --git a/hw/xen_pci_passthrough_msi.c b/hw/xen_pci_passthrough_msi.c
new file mode 100644
index 0000000..533aef4
--- /dev/null
+++ b/hw/xen_pci_passthrough_msi.c
@@ -0,0 +1,667 @@
+/*
+ * Copyright (c) 2007, Intel Corporation.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * Jiang Yunhong <yunhong.jiang@intel.com>
+ *
+ * This file implements direct PCI assignment to a HVM guest
+ */
+
+#include <sys/mman.h>
+
+#include "xen_backend.h"
+#include "xen_pci_passthrough.h"
+#include "apic-msidef.h"
+
+
+#define AUTO_ASSIGN -1
+
+/* shift count for gflags */
+#define GFLAGS_SHIFT_DEST_ID        0
+#define GFLAGS_SHIFT_RH             8
+#define GFLAGS_SHIFT_DM             9
+#define GLFAGS_SHIFT_DELIV_MODE     12
+#define GLFAGS_SHIFT_TRG_MODE       15
+
+
+void pt_msi_set_enable(XenPCIPassthroughState *s, int en)
+{
+    uint16_t val = 0;
+    uint32_t address = 0;
+    PT_LOG("enable: %i\n", en);
+
+    if (!s->msi) {
+        return;
+    }
+
+    address = s->msi->ctrl_offset;
+    if (!address) {
+        return;
+    }
+
+    val = host_pci_get_word(s->real_device, address);
+    val &= ~PCI_MSI_FLAGS_ENABLE;
+    val |= en & PCI_MSI_FLAGS_ENABLE;
+    host_pci_set_word(s->real_device, address, val);
+
+    PT_LOG("done, address: %#x, val: %#x\n", address, val);
+}
+
+static void msix_set_enable(XenPCIPassthroughState *s, int en)
+{
+    uint16_t val = 0;
+    uint32_t address = 0;
+
+    if (!s->msix) {
+        return;
+    }
+
+    address = s->msix->ctrl_offset;
+    if (!address) {
+        return;
+    }
+
+    val = host_pci_get_word(s->real_device, address);
+    val &= ~PCI_MSIX_FLAGS_ENABLE;
+    if (en) {
+        val |= PCI_MSIX_FLAGS_ENABLE;
+    }
+    host_pci_set_word(s->real_device, address, val);
+}
+
+/*********************************/
+/* MSI virtuailization functions */
+
+/*
+ * setup physical msi, but didn't enable it
+ */
+int pt_msi_setup(XenPCIPassthroughState *s)
+{
+    int pirq = -1;
+    uint8_t gvec = 0;
+
+    if (!(s->msi->flags & PT_MSI_FLAG_UNINIT)) {
+        PT_LOG("Error: setup physical after initialized??\n");
+        return -1;
+    }
+
+    gvec = s->msi->data & 0xFF;
+    if (!gvec) {
+        /* if gvec is 0, the guest is asking for a particular pirq that
+         * is passed as dest_id */
+        pirq = (s->msi->addr_hi & 0xffffff00) |
+               ((s->msi->addr_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
+        if (!pirq) {
+            /* this probably identifies an misconfiguration of the guest,
+             * try the emulated path */
+            pirq = -1;
+        } else {
+            PT_LOG("pt_msi_setup requested pirq = %d\n", pirq);
+        }
+    }
+
+    if (xc_physdev_map_pirq_msi(xen_xc, xen_domid, AUTO_ASSIGN, &pirq,
+                                PCI_DEVFN(s->real_device->dev,
+                                          s->real_device->func),
+                                s->real_device->bus, 0, 0)) {
+        PT_LOG("Error: Mapping of MSI failed.\n");
+        return -1;
+    }
+
+    if (pirq < 0) {
+        PT_LOG("Error: Invalid pirq number\n");
+        return -1;
+    }
+
+    s->msi->pirq = pirq;
+    PT_LOG("msi mapped with pirq %x\n", pirq);
+
+    return 0;
+}
+
+static uint32_t __get_msi_gflags(uint32_t data, uint64_t addr)
+{
+    uint32_t result = 0;
+    int rh, dm, dest_id, deliv_mode, trig_mode;
+
+    rh = (addr >> MSI_ADDR_REDIRECTION_SHIFT) & 0x1;
+    dm = (addr >> MSI_ADDR_DEST_MODE_SHIFT) & 0x1;
+    dest_id = (addr >> MSI_ADDR_DEST_ID_SHIFT) & 0xff;
+    deliv_mode = (data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x7;
+    trig_mode = (data >> MSI_DATA_TRIGGER_SHIFT) & 0x1;
+
+    result = dest_id | (rh << GFLAGS_SHIFT_RH) | (dm << GFLAGS_SHIFT_DM) |
+             (deliv_mode << GLFAGS_SHIFT_DELIV_MODE) |
+             (trig_mode << GLFAGS_SHIFT_TRG_MODE);
+
+    return result;
+}
+
+int pt_msi_update(XenPCIPassthroughState *s)
+{
+    uint8_t gvec = 0;
+    uint32_t gflags = 0;
+    uint64_t addr = 0;
+    int ret = 0;
+
+    /* get vector, address, flags info, etc. */
+    gvec = s->msi->data & 0xFF;
+    addr = (uint64_t)s->msi->addr_hi << 32 | s->msi->addr_lo;
+    gflags = __get_msi_gflags(s->msi->data, addr);
+
+    PT_LOG("Update msi with pirq %x gvec %x gflags %x\n",
+           s->msi->pirq, gvec, gflags);
+
+    ret = xc_domain_update_msi_irq(xen_xc, xen_domid, gvec,
+                                   s->msi->pirq, gflags, 0);
+
+    if (ret) {
+        PT_LOG("Error: Binding of MSI failed.\n");
+
+        if (xc_physdev_unmap_pirq(xen_xc, xen_domid, s->msi->pirq)) {
+            PT_LOG("Error: Unmapping of MSI failed.\n");
+        }
+        s->msi->pirq = -1;
+        return ret;
+    }
+    return 0;
+}
+
+void pt_msi_disable(XenPCIPassthroughState *s)
+{
+    PCIDevice *d = &s->dev;
+    uint8_t gvec = 0;
+    uint32_t gflags = 0;
+    uint64_t addr = 0;
+    uint8_t e_device = 0;
+    uint8_t e_intx = 0;
+
+    pt_msi_set_enable(s, 0);
+
+    e_device = PCI_SLOT(d->devfn);
+    e_intx = pci_intx(s);
+
+    if (s->msi_trans_en) {
+        if (xc_domain_unbind_pt_irq(xen_xc, xen_domid, s->msi->pirq,
+                                    PT_IRQ_TYPE_MSI_TRANSLATE, 0,
+                                    e_device, e_intx, 0)) {
+            PT_LOG("Error: Unbinding pt irq for MSI-INTx failed!\n");
+            goto out;
+        }
+    } else if (!(s->msi->flags & PT_MSI_FLAG_UNINIT)) {
+        /* get vector, address, flags info, etc. */
+        gvec = s->msi->data & 0xFF;
+        addr = (uint64_t)s->msi->addr_hi << 32 | s->msi->addr_lo;
+        gflags = __get_msi_gflags(s->msi->data, addr);
+
+        PT_LOG("Unbind msi with pirq %x, gvec %x\n",
+                s->msi->pirq, gvec);
+
+        if (xc_domain_unbind_msi_irq(xen_xc, xen_domid, gvec,
+                                        s->msi->pirq, gflags)) {
+            PT_LOG("Error: Unbinding of MSI failed. [%02x:%02x.%x]\n",
+                   pci_bus_num(d->bus), PCI_SLOT(d->devfn),
+                   PCI_FUNC(d->devfn));
+            goto out;
+        }
+    }
+
+    if (s->msi->pirq != -1) {
+        PT_LOG("Unmap msi with pirq %x\n", s->msi->pirq);
+
+        if (xc_physdev_unmap_pirq(xen_xc, xen_domid, s->msi->pirq)) {
+            PT_LOG("Error: Unmapping of MSI failed. [%02x:%02x.%x]\n",
+                   pci_bus_num(d->bus), PCI_SLOT(d->devfn),
+                   PCI_FUNC(d->devfn));
+            goto out;
+        }
+    }
+
+out:
+    /* clear msi info */
+    s->msi->flags = 0;
+    s->msi->pirq = -1;
+    s->msi_trans_en = 0;
+}
+
+/* MSI-INTx translation virtulization functions */
+int pt_enable_msi_translate(XenPCIPassthroughState *s)
+{
+    uint8_t e_device = 0;
+    uint8_t e_intx = 0;
+
+    if (!(s->msi && s->msi_trans_cap)) {
+        return -1;
+    }
+
+    pt_msi_set_enable(s, 0);
+    s->msi_trans_en = 0;
+
+    if (pt_msi_setup(s)) {
+        PT_LOG("Error: MSI-INTx translation MSI setup failed, fallback\n");
+        return -1;
+    }
+
+    e_device = PCI_SLOT(s->dev.devfn);
+    /* fix virtual interrupt pin to INTA# */
+    e_intx = pci_intx(s);
+
+    if (xc_domain_bind_pt_irq(xen_xc, xen_domid, s->msi->pirq,
+                              PT_IRQ_TYPE_MSI_TRANSLATE, 0,
+                              e_device, e_intx, 0)) {
+        PT_LOG("Error: MSI-INTx translation bind failed, fallback\n");
+
+        if (xc_physdev_unmap_pirq(xen_xc, xen_domid, s->msi->pirq)) {
+            PT_LOG("Error: Unmapping of MSI failed.\n");
+        }
+        s->msi->pirq = -1;
+        return -1;
+    }
+
+    pt_msi_set_enable(s, 1);
+    s->msi_trans_en = 1;
+
+    return 0;
+}
+
+void pt_disable_msi_translate(XenPCIPassthroughState *s)
+{
+    uint8_t e_device = 0;
+    uint8_t e_intx = 0;
+
+    /* MSI_ENABLE bit should be disabed until the new handler is set */
+    pt_msi_set_enable(s, 0);
+
+    e_device = PCI_SLOT(s->dev.devfn);
+    e_intx = pci_intx(s);
+
+    if (xc_domain_unbind_pt_irq(xen_xc, xen_domid, s->msi->pirq,
+                                 PT_IRQ_TYPE_MSI_TRANSLATE, 0,
+                                 e_device, e_intx, 0)) {
+        PT_LOG("Error: Unbinding pt irq for MSI-INTx failed!\n");
+    }
+
+    if (s->machine_irq) {
+        if (xc_domain_bind_pt_pci_irq(xen_xc, xen_domid, s->machine_irq,
+                                       0, e_device, e_intx)) {
+            PT_LOG("Error: Rebinding of interrupt failed!\n");
+        }
+    }
+
+    s->msi_trans_en = 0;
+}
+
+/*********************************/
+/* MSI-X virtulization functions */
+
+static void mask_physical_msix_entry(XenPCIPassthroughState *s,
+                                     int entry_nr, int mask)
+{
+    void *phys_off;
+
+    phys_off = s->msix->phys_iomem_base + 16 * entry_nr + 12;
+    *(uint32_t *)phys_off = mask;
+}
+
+static int pt_msix_update_one(XenPCIPassthroughState *s, int entry_nr)
+{
+    XenMSIXEntry *entry = &s->msix->msix_entry[entry_nr];
+    int pirq = entry->pirq;
+    int gvec = entry->io_mem[2] & 0xff;
+    uint64_t gaddr = *(uint64_t *)&entry->io_mem[0];
+    uint32_t gflags = __get_msi_gflags(entry->io_mem[2], gaddr);
+    int ret;
+
+    if (!entry->flags) {
+        return 0;
+    }
+
+    if (!gvec) {
+        /* if gvec is 0, the guest is asking for a particular pirq that
+         * is passed as dest_id */
+        pirq = ((gaddr >> 32) & 0xffffff00) |
+               (((gaddr & 0xffffffff) >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
+        if (!pirq) {
+            /* this probably identifies an misconfiguration of the guest,
+             * try the emulated path */
+            pirq = -1;
+        } else {
+            PT_LOG("pt_msix_update_one requested pirq = %d\n", pirq);
+        }
+    }
+
+    /* Check if this entry is already mapped */
+    if (entry->pirq == -1) {
+        ret = xc_physdev_map_pirq_msi(xen_xc, xen_domid, AUTO_ASSIGN, &pirq,
+                                      PCI_DEVFN(s->real_device->dev,
+                                                s->real_device->func),
+                                      s->real_device->bus, entry_nr,
+                                      s->msix->table_base);
+        if (ret) {
+            PT_LOG("Error: Mapping msix entry %x\n", entry_nr);
+            return ret;
+        }
+        entry->pirq = pirq;
+    }
+
+    PT_LOG("Update msix entry %x with pirq %x gvec %x\n",
+            entry_nr, pirq, gvec);
+
+    ret = xc_domain_update_msi_irq(xen_xc, xen_domid, gvec, pirq, gflags,
+                                   s->msix->mmio_base_addr);
+    if (ret) {
+        PT_LOG("Error: Updating msix irq info for entry %d\n", entry_nr);
+
+        if (xc_physdev_unmap_pirq(xen_xc, xen_domid, entry->pirq)) {
+            PT_LOG("Error: Unmapping of MSI-X failed.\n");
+        }
+        entry->pirq = -1;
+        return ret;
+    }
+
+    entry->flags = 0;
+
+    return 0;
+}
+
+int pt_msix_update(XenPCIPassthroughState *s)
+{
+    XenPTMSIX *msix = s->msix;
+    int i;
+
+    for (i = 0; i < msix->total_entries; i++) {
+        pt_msix_update_one(s, i);
+    }
+
+    return 0;
+}
+
+void pt_msix_disable(XenPCIPassthroughState *s)
+{
+    PCIDevice *d = &s->dev;
+    uint8_t gvec = 0;
+    uint32_t gflags = 0;
+    uint64_t addr = 0;
+    int i = 0;
+    XenMSIXEntry *entry = NULL;
+
+    msix_set_enable(s, 0);
+
+    for (i = 0; i < s->msix->total_entries; i++) {
+        entry = &s->msix->msix_entry[i];
+
+        if (entry->pirq == -1) {
+            continue;
+        }
+
+        gvec = entry->io_mem[2] & 0xff;
+        addr = *(uint64_t *)&entry->io_mem[0];
+        gflags = __get_msi_gflags(entry->io_mem[2], addr);
+
+        PT_LOG("Unbind msix with pirq %x, gvec %x\n",
+                entry->pirq, gvec);
+
+        if (xc_domain_unbind_msi_irq(xen_xc, xen_domid, gvec,
+                                        entry->pirq, gflags)) {
+            PT_LOG("Error: Unbinding of MSI-X failed. [%02x:%02x.%x]\n",
+                   pci_bus_num(d->bus), PCI_SLOT(d->devfn),
+                   PCI_FUNC(d->devfn));
+        } else {
+            PT_LOG("Unmap msix with pirq %x\n", entry->pirq);
+
+            if (xc_physdev_unmap_pirq(xen_xc, xen_domid, entry->pirq)) {
+                PT_LOG("Error: Unmapping of MSI-X failed. [%02x:%02x.%x]\n",
+                       pci_bus_num(d->bus),
+                       PCI_SLOT(d->devfn), PCI_FUNC(d->devfn));
+            }
+        }
+        /* clear msi-x info */
+        entry->pirq = -1;
+        entry->flags = 0;
+    }
+}
+
+int pt_msix_update_remap(XenPCIPassthroughState *s, int bar_index)
+{
+    XenMSIXEntry *entry;
+    int i, ret;
+
+    if (!(s->msix && s->msix->bar_index == bar_index)) {
+        return 0;
+    }
+
+    for (i = 0; i < s->msix->total_entries; i++) {
+        entry = &s->msix->msix_entry[i];
+        if (entry->pirq != -1) {
+            ret = xc_domain_unbind_pt_irq(xen_xc, xen_domid, entry->pirq,
+                                          PT_IRQ_TYPE_MSI, 0, 0, 0, 0);
+            if (ret) {
+                PT_LOG("Error: unbind MSI-X entry %d failed\n", entry->pirq);
+            }
+            entry->flags = 1;
+        }
+    }
+    pt_msix_update(s);
+
+    return 0;
+}
+
+static void pci_msix_invalid_write(void *opaque, target_phys_addr_t addr,
+                                   uint32_t val)
+{
+    PT_LOG("Error: Invalid write to MSI-X table,"
+           " only dword access is allowed.\n");
+}
+
+static void pci_msix_writel(void *opaque, target_phys_addr_t addr,
+                            uint32_t val)
+{
+    XenPCIPassthroughState *s = (XenPCIPassthroughState *)opaque;
+    XenPTMSIX *msix = s->msix;
+    XenMSIXEntry *entry;
+    int entry_nr, offset;
+    void *phys_off;
+    uint32_t vec_ctrl;
+
+    if (addr % 4) {
+        PT_LOG("Error: Unaligned dword access to MSI-X table, "
+                "addr %016"PRIx64"\n", addr);
+        return;
+    }
+
+    PT_LOG("addr: "TARGET_FMT_plx", val: %#x\n", addr, val);
+
+    entry_nr = addr / 16;
+    entry = &msix->msix_entry[entry_nr];
+    offset = (addr % 16) / 4;
+
+    /*
+     * If Xen intercepts the mask bit access, io_mem[3] may not be
+     * up-to-date. Read from hardware directly.
+     */
+    phys_off = s->msix->phys_iomem_base + 16 * entry_nr + 12;
+    vec_ctrl = *(uint32_t *)phys_off;
+
+    if (offset != 3 && msix->enabled && !(vec_ctrl & 0x1)) {
+        PT_LOG("Error: Can't update msix entry %d since MSI-X is already "
+                "function.\n", entry_nr);
+        return;
+    }
+
+    if (offset != 3 && entry->io_mem[offset] != val) {
+        entry->flags = 1;
+    }
+    entry->io_mem[offset] = val;
+
+    if (offset == 3) {
+        if (msix->enabled && !(val & 0x1)) {
+            pt_msix_update_one(s, entry_nr);
+        }
+        mask_physical_msix_entry(s, entry_nr, entry->io_mem[3] & 0x1);
+    }
+}
+
+static CPUWriteMemoryFunc *pci_msix_write[] = {
+    pci_msix_invalid_write,
+    pci_msix_invalid_write,
+    pci_msix_writel
+};
+
+static uint32_t pci_msix_invalid_read(void *opaque, target_phys_addr_t addr)
+{
+    PT_LOG("Error: Invalid read to MSI-X table,"
+           " only dword access is allowed.\n");
+    return 0;
+}
+
+static uint32_t pci_msix_readl(void *opaque, target_phys_addr_t addr)
+{
+    XenPCIPassthroughState *s = (XenPCIPassthroughState *)opaque;
+    XenPTMSIX *msix = s->msix;
+    int entry_nr, offset;
+
+    if (addr % 4) {
+        PT_LOG("Error: Unaligned dword access to MSI-X table, "
+                "addr %016"PRIx64"\n", addr);
+        return 0;
+    }
+
+    PT_LOG("addr: "TARGET_FMT_plx"\n", addr);
+
+    entry_nr = addr / 16;
+    offset = (addr % 16) / 4;
+
+    return msix->msix_entry[entry_nr].io_mem[offset];
+}
+
+static CPUReadMemoryFunc *pci_msix_read[] = {
+    pci_msix_invalid_read,
+    pci_msix_invalid_read,
+    pci_msix_readl
+};
+
+int pt_add_msix_mapping(XenPCIPassthroughState *s, int bar_index)
+{
+    if (!(s->msix && s->msix->bar_index == bar_index)) {
+        return 0;
+    }
+
+    return xc_domain_memory_mapping(xen_xc, xen_domid,
+         s->msix->mmio_base_addr >> XC_PAGE_SHIFT,
+         (s->bases[bar_index].access.maddr + s->msix->table_off)
+             >> XC_PAGE_SHIFT,
+         (s->msix->total_entries * 16 + XC_PAGE_SIZE - 1) >> XC_PAGE_SHIFT,
+         DPCI_ADD_MAPPING);
+}
+
+int pt_remove_msix_mapping(XenPCIPassthroughState *s, int bar_index)
+{
+    if (!(s->msix && s->msix->bar_index == bar_index)) {
+        return 0;
+    }
+
+    s->msix->mmio_base_addr = s->bases[bar_index].e_physbase
+        + s->msix->table_off;
+
+    cpu_register_physical_memory(s->msix->mmio_base_addr,
+                                 s->msix->total_entries * 16,
+                                 s->msix->mmio_index);
+
+    return xc_domain_memory_mapping(xen_xc, xen_domid,
+         s->msix->mmio_base_addr >> XC_PAGE_SHIFT,
+         (s->bases[bar_index].access.maddr + s->msix->table_off)
+             >> XC_PAGE_SHIFT,
+         (s->msix->total_entries * 16 + XC_PAGE_SIZE - 1) >> XC_PAGE_SHIFT,
+         DPCI_REMOVE_MAPPING);
+}
+
+int pt_msix_init(XenPCIPassthroughState *s, int base)
+{
+    uint8_t id;
+    uint16_t control;
+    int i, total_entries, table_off, bar_index;
+    HostPCIDevice *d = s->real_device;
+    int fd;
+
+    id = host_pci_get_byte(d, base + PCI_CAP_LIST_ID);
+
+    if (id != PCI_CAP_ID_MSIX) {
+        PT_LOG("Error: Invalid id %#x base %#x\n", id, base);
+        return -1;
+    }
+
+    control = host_pci_get_word(d, base + 2);
+    total_entries = control & 0x7ff;
+    total_entries += 1;
+
+    s->msix = g_malloc0(sizeof (XenPTMSIX)
+                        + total_entries * sizeof (XenMSIXEntry));
+
+    s->msix->total_entries = total_entries;
+    for (i = 0; i < total_entries; i++) {
+        s->msix->msix_entry[i].pirq = -1;
+    }
+
+    s->msix->mmio_index =
+        cpu_register_io_memory(pci_msix_read, pci_msix_write,
+                               s, DEVICE_NATIVE_ENDIAN);
+
+    table_off = host_pci_get_long(d, base + PCI_MSIX_TABLE);
+    bar_index = s->msix->bar_index = table_off & PCI_MSIX_FLAGS_BIRMASK;
+    table_off = s->msix->table_off = table_off & ~PCI_MSIX_FLAGS_BIRMASK;
+    s->msix->table_base = s->real_device->io_regions[bar_index].base_addr;
+    PT_LOG("get MSI-X table bar base %#"PRIx64"\n", s->msix->table_base);
+
+    fd = open("/dev/mem", O_RDWR);
+    if (fd == -1) {
+        PT_LOG("Error: Can't open /dev/mem: %s\n", strerror(errno));
+        goto error_out;
+    }
+    PT_LOG("table_off = %#x, total_entries = %d\n", table_off, total_entries);
+    s->msix->table_offset_adjust = table_off & 0x0fff;
+    s->msix->phys_iomem_base =
+        mmap(0,
+             total_entries * 16 + s->msix->table_offset_adjust,
+             PROT_WRITE | PROT_READ,
+             MAP_SHARED | MAP_LOCKED,
+             fd,
+             s->msix->table_base + table_off - s->msix->table_offset_adjust);
+
+    if (s->msix->phys_iomem_base == MAP_FAILED) {
+        PT_LOG("Error: Can't map physical MSI-X table: %s\n", strerror(errno));
+        close(fd);
+        goto error_out;
+    }
+    s->msix->phys_iomem_base = (char *)s->msix->phys_iomem_base
+        + s->msix->table_offset_adjust;
+
+    close(fd);
+
+    PT_LOG("mapping physical MSI-X table to %p\n", s->msix->phys_iomem_base);
+    return 0;
+
+error_out:
+    g_free(s->msix);
+    s->msix = NULL;
+    return -1;
+}
+
+void pt_msix_delete(XenPCIPassthroughState *s)
+{
+    /* unmap the MSI-X memory mapped register area */
+    if (s->msix->phys_iomem_base) {
+        PT_LOG("unmapping physical MSI-X table from %lx\n",
+           (unsigned long)s->msix->phys_iomem_base);
+        munmap(s->msix->phys_iomem_base, s->msix->total_entries * 16 +
+           s->msix->table_offset_adjust);
+    }
+
+    if (s->msix->mmio_index > 0) {
+        cpu_unregister_io_memory(s->msix->mmio_index);
+    }
+
+    g_free(s->msix);
+    s->msix = NULL;
+}
-- 
Anthony PERARD

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH V3 10/10] Introduce Xen PCI Passthrough, MSI (3/3)
@ 2011-10-28 15:07   ` Anthony PERARD
  0 siblings, 0 replies; 60+ messages in thread
From: Anthony PERARD @ 2011-10-28 15:07 UTC (permalink / raw)
  To: QEMU-devel, Stefano Stabellini; +Cc: Anthony PERARD, Xen Devel, Shan Haitao

From: Jiang Yunhong <yunhong.jiang@intel.com>

Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com>
Signed-off-by: Shan Haitao <haitao.shan@intel.com>
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
---
 Makefile.target                      |    1 +
 hw/apic-msidef.h                     |    2 +
 hw/xen_pci_passthrough.c             |   27 ++-
 hw/xen_pci_passthrough.h             |   55 +++
 hw/xen_pci_passthrough_config_init.c |  495 +++++++++++++++++++++++++-
 hw/xen_pci_passthrough_msi.c         |  667 ++++++++++++++++++++++++++++++++++
 6 files changed, 1240 insertions(+), 7 deletions(-)
 create mode 100644 hw/xen_pci_passthrough_msi.c

diff --git a/Makefile.target b/Makefile.target
index c32c688..17b8857 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -220,6 +220,7 @@ obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += host-pci-device.o
 obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough.o
 obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough_helpers.o
 obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough_config_init.o
+obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough_msi.o
 
 # Inter-VM PCI shared memory
 CONFIG_IVSHMEM =
diff --git a/hw/apic-msidef.h b/hw/apic-msidef.h
index 3182f0b..6e2eb71 100644
--- a/hw/apic-msidef.h
+++ b/hw/apic-msidef.h
@@ -22,6 +22,8 @@
 
 #define MSI_ADDR_DEST_MODE_SHIFT        2
 
+#define MSI_ADDR_REDIRECTION_SHIFT      3
+
 #define MSI_ADDR_DEST_ID_SHIFT          12
 #define  MSI_ADDR_DEST_ID_MASK          0x00ffff0
 
diff --git a/hw/xen_pci_passthrough.c b/hw/xen_pci_passthrough.c
index b97c5b6..4b9eb74 100644
--- a/hw/xen_pci_passthrough.c
+++ b/hw/xen_pci_passthrough.c
@@ -417,6 +417,7 @@ static void pt_iomem_map(XenPCIPassthroughState *s, int i,
     }
 
     if (!first_map && old_ebase != -1) {
+        pt_add_msix_mapping(s, i);
         /* Remove old mapping */
         ret = xc_domain_memory_mapping(xen_xc, xen_domid,
                                old_ebase >> XC_PAGE_SHIFT,
@@ -441,6 +442,15 @@ static void pt_iomem_map(XenPCIPassthroughState *s, int i,
         if (ret != 0) {
             PT_LOG("Error: create new mapping failed!\n");
         }
+
+        ret = pt_remove_msix_mapping(s, i);
+        if (ret != 0) {
+            PT_LOG("Error: remove MSI-X mmio mapping failed!\n");
+        }
+
+        if (old_ebase != e_phys && old_ebase != -1) {
+            pt_msix_update_remap(s, i);
+        }
     }
 }
 
@@ -737,6 +747,9 @@ static int pt_initfn(PCIDevice *pcidev)
         mapped_machine_irq[machine_irq]++;
     }
 
+    /* setup MSI-INTx translation if support */
+    rc = pt_enable_msi_translate(s);
+
     /* bind machine_irq to device */
     if (rc < 0 && machine_irq != 0) {
         uint8_t e_device = PCI_SLOT(s->dev.devfn);
@@ -765,7 +778,8 @@ static int pt_initfn(PCIDevice *pcidev)
 
 out:
     PT_LOG("Real physical device %02x:%02x.%x registered successfuly!\n"
-           "IRQ type = %s\n", bus, slot, func, "INTx");
+           "IRQ type = %s\n", bus, slot, func,
+           s->msi_trans_en ? "MSI-INTx" : "INTx");
 
     return 0;
 }
@@ -782,7 +796,7 @@ static int pt_unregister_device(PCIDevice *pcidev)
     e_intx = pci_intx(s);
     machine_irq = s->machine_irq;
 
-    if (machine_irq) {
+    if (s->msi_trans_en == 0 && machine_irq) {
         rc = xc_domain_unbind_pt_irq(xen_xc, xen_domid, machine_irq,
                                      PT_IRQ_TYPE_PCI, 0, e_device, e_intx, 0);
         if (rc < 0) {
@@ -790,6 +804,13 @@ static int pt_unregister_device(PCIDevice *pcidev)
         }
     }
 
+    if (s->msi) {
+        pt_msi_disable(s);
+    }
+    if (s->msix) {
+        pt_msix_disable(s);
+    }
+
     if (machine_irq) {
         mapped_machine_irq[machine_irq]--;
 
@@ -824,6 +845,8 @@ static PCIDeviceInfo xen_pci_passthrough = {
     .is_express = 0,
     .qdev.props = (Property[]) {
         DEFINE_PROP_STRING("hostaddr", XenPCIPassthroughState, hostaddr),
+        DEFINE_PROP_BIT("msitranslate", XenPCIPassthroughState, msi_trans_cap,
+                        0, true),
         DEFINE_PROP_BIT("power-mgmt", XenPCIPassthroughState, power_mgmt,
                         0, false),
         DEFINE_PROP_END_OF_LIST(),
diff --git a/hw/xen_pci_passthrough.h b/hw/xen_pci_passthrough.h
index ebc04fd..5f404b0 100644
--- a/hw/xen_pci_passthrough.h
+++ b/hw/xen_pci_passthrough.h
@@ -63,6 +63,10 @@ typedef int (*conf_byte_restore)
 
 #define PT_BAR_ALLF        0xFFFFFFFF  /* BAR ALLF value */
 
+/* MSI-X */
+#define PT_MSI_FLAG_UNINIT 0x1000
+#define PT_MSI_FLAG_MAPPED 0x2000
+
 
 typedef enum {
     GRP_TYPE_HARDWIRED = 0,                     /* 0 Hardwired reg group */
@@ -166,6 +170,34 @@ typedef struct XenPTRegGroup {
 } XenPTRegGroup;
 
 
+typedef struct XenPTMSI {
+    uint32_t flags;
+    uint32_t ctrl_offset; /* saved control offset */
+    int pirq;          /* guest pirq corresponding */
+    uint32_t addr_lo;  /* guest message address */
+    uint32_t addr_hi;  /* guest message upper address */
+    uint16_t data;     /* guest message data */
+} XenPTMSI;
+
+typedef struct XenMSIXEntry {
+    int pirq;        /* -1 means unmapped */
+    int flags;       /* flags indicting whether MSI ADDR or DATA is updated */
+    uint32_t io_mem[4];
+} XenMSIXEntry;
+typedef struct XenPTMSIX {
+    uint32_t ctrl_offset;
+    int enabled;
+    int total_entries;
+    int bar_index;
+    uint64_t table_base;
+    uint32_t table_off;
+    uint32_t table_offset_adjust; /* page align mmap */
+    uint64_t mmio_base_addr;
+    int mmio_index;
+    void *phys_iomem_base;
+    XenMSIXEntry msix_entry[0];
+} XenPTMSIX;
+
 typedef struct XenPTPM {
     QEMUTimer *pm_timer;  /* QEMUTimer struct */
     int no_soft_reset;    /* No Soft Reset flags */
@@ -189,6 +221,13 @@ struct XenPCIPassthroughState {
 
     uint32_t machine_irq;
 
+    XenPTMSI *msi;
+    XenPTMSIX *msix;
+
+    /* Physical MSI to guest INTx translation when possible */
+    uint32_t msi_trans_cap;
+    bool msi_trans_en;
+
     uint32_t power_mgmt;
     XenPTPM *pm_state;
 
@@ -222,4 +261,20 @@ static inline uint8_t pci_read_intx(XenPCIPassthroughState *s)
 }
 uint8_t pci_intx(XenPCIPassthroughState *ptdev);
 
+/* MSI/MSI-X */
+void pt_msi_set_enable(XenPCIPassthroughState *s, int en);
+int pt_msi_setup(XenPCIPassthroughState *s);
+int pt_msi_update(XenPCIPassthroughState *d);
+void pt_msi_disable(XenPCIPassthroughState *s);
+int pt_enable_msi_translate(XenPCIPassthroughState *s);
+void pt_disable_msi_translate(XenPCIPassthroughState *s);
+
+int pt_msix_init(XenPCIPassthroughState *s, int pos);
+void pt_msix_delete(XenPCIPassthroughState *s);
+int pt_msix_update(XenPCIPassthroughState *s);
+int pt_msix_update_remap(XenPCIPassthroughState *s, int bar_index);
+void pt_msix_disable(XenPCIPassthroughState *s);
+int pt_add_msix_mapping(XenPCIPassthroughState *s, int bar_index);
+int pt_remove_msix_mapping(XenPCIPassthroughState *s, int bar_index);
+
 #endif /* !QEMU_HW_XEN_PCI_PASSTHROUGH_H */
diff --git a/hw/xen_pci_passthrough_config_init.c b/hw/xen_pci_passthrough_config_init.c
index 4103b59..b4238ee 100644
--- a/hw/xen_pci_passthrough_config_init.c
+++ b/hw/xen_pci_passthrough_config_init.c
@@ -375,11 +375,19 @@ static int pt_cmd_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
     throughable_mask = ~emu_mask & valid_mask;
 
     if (*value & PCI_COMMAND_INTX_DISABLE) {
-        throughable_mask |= PCI_COMMAND_INTX_DISABLE;
-    } else {
-        if (s->machine_irq) {
+        if (s->msi_trans_en) {
+            pt_msi_set_enable(s, 0);
+        } else {
             throughable_mask |= PCI_COMMAND_INTX_DISABLE;
         }
+    } else {
+        if (s->msi_trans_en) {
+            pt_msi_set_enable(s, 1);
+        } else {
+            if (s->machine_irq) {
+                throughable_mask |= PCI_COMMAND_INTX_DISABLE;
+            }
+        }
     }
 
     *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
@@ -1248,13 +1256,21 @@ static void pt_reset_interrupt_and_io_mapping(XenPCIPassthroughState *s)
     e_device = PCI_SLOT(s->dev.devfn);
     e_intx = pci_intx(s);
 
-    if (s->machine_irq) {
+    if (s->msi_trans_en == 0 && s->machine_irq) {
         if (xc_domain_unbind_pt_irq(xen_xc, xen_domid, s->machine_irq,
                                     PT_IRQ_TYPE_PCI, 0, e_device, e_intx, 0)) {
             PT_LOG("Error: Unbinding of interrupt failed!\n");
         }
     }
 
+    /* disable MSI/MSI-X and MSI-INTx translation */
+    if (s->msi) {
+        pt_msi_disable(s);
+    }
+    if (s->msix) {
+        pt_msix_disable(s);
+    }
+
     /* clear all virtual region address */
     for (i = 0; i < PCI_NUM_REGIONS; i++) {
         r = &d->io_regions[i];
@@ -1501,6 +1517,406 @@ static XenPTRegInfo pt_emu_reg_pm_tbl[] = {
     },
 };
 
+/********************************
+ * MSI Capability
+ */
+
+/* Message Control register */
+static uint32_t pt_msgctrl_reg_init(XenPCIPassthroughState *s,
+                                    XenPTRegInfo *reg, uint32_t real_offset)
+{
+    PCIDevice *d = &s->dev;
+    uint16_t reg_field = 0;
+
+    /* use I/O device register's value as initial value */
+    reg_field = pci_get_word(d->config + real_offset);
+
+    if (reg_field & PCI_MSI_FLAGS_ENABLE) {
+        PT_LOG("MSI enabled already, disable first\n");
+        host_pci_set_word(s->real_device, real_offset,
+                          reg_field & ~PCI_MSI_FLAGS_ENABLE);
+    }
+    s->msi->flags |= reg_field | PT_MSI_FLAG_UNINIT;
+    s->msi->ctrl_offset = real_offset;
+
+    return reg->init_val;
+}
+static int pt_msgctrl_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                                uint16_t *value, uint16_t dev_value,
+                                uint16_t valid_mask)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    uint16_t writable_mask = 0;
+    uint16_t throughable_mask = 0;
+    PCIDevice *pd = (PCIDevice *)s;
+    uint16_t val;
+
+    /* Currently no support for multi-vector */
+    if (*value & PCI_MSI_FLAGS_QSIZE) {
+        PT_LOG("Warning: try to set more than 1 vector ctrl %x\n", *value);
+    }
+
+    /* modify emulate register */
+    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
+    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
+    /* update the msi_info too */
+    s->msi->flags |= cfg_entry->data &
+        ~(PT_MSI_FLAG_UNINIT | PT_MSI_FLAG_MAPPED | PCI_MSI_FLAGS_ENABLE);
+
+    /* create value for writing to I/O device register */
+    val = *value;
+    throughable_mask = ~reg->emu_mask & valid_mask;
+    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
+
+    /* update MSI */
+    if (val & PCI_MSI_FLAGS_ENABLE) {
+        /* setup MSI pirq for the first time */
+        if (s->msi->flags & PT_MSI_FLAG_UNINIT) {
+            if (s->msi_trans_en) {
+                PT_LOG("guest enabling MSI, disable MSI-INTx translation\n");
+                pt_disable_msi_translate(s);
+            } else {
+                /* Init physical one */
+                PT_LOG("setup msi for dev %x\n", pd->devfn);
+                if (pt_msi_setup(s)) {
+                    /* We do not broadcast the error to the framework code, so
+                     * that MSI errors are contained in MSI emulation code and
+                     * QEMU can go on running.
+                     * Guest MSI would be actually not working.
+                     */
+                    *value &= ~PCI_MSI_FLAGS_ENABLE;
+                    PT_LOG("Warning: Can not map MSI for dev %x\n", pd->devfn);
+                    return 0;
+                }
+            }
+            if (pt_msi_update(s)) {
+                *value &= ~PCI_MSI_FLAGS_ENABLE;
+                PT_LOG("Warning: Can not bind MSI for dev %x\n", pd->devfn);
+                return 0;
+            }
+            s->msi->flags &= ~PT_MSI_FLAG_UNINIT;
+            s->msi->flags |= PT_MSI_FLAG_MAPPED;
+        }
+        s->msi->flags |= PCI_MSI_FLAGS_ENABLE;
+    } else {
+        s->msi->flags &= ~PCI_MSI_FLAGS_ENABLE;
+    }
+
+    /* pass through MSI_ENABLE bit when no MSI-INTx translation */
+    if (!s->msi_trans_en) {
+        *value &= ~PCI_MSI_FLAGS_ENABLE;
+        *value |= val & PCI_MSI_FLAGS_ENABLE;
+    }
+
+    return 0;
+}
+
+/* initialize Message Upper Address register */
+static uint32_t pt_msgaddr64_reg_init(XenPCIPassthroughState *ptdev,
+                                      XenPTRegInfo *reg, uint32_t real_offset)
+{
+    /* no need to initialize in case of 32 bit type */
+    if (!(ptdev->msi->flags & PCI_MSI_FLAGS_64BIT)) {
+        return PT_INVALID_REG;
+    }
+
+    return reg->init_val;
+}
+/* this function will be called twice (for 32 bit and 64 bit type) */
+/* initialize Message Data register */
+static uint32_t pt_msgdata_reg_init(XenPCIPassthroughState *ptdev,
+                                    XenPTRegInfo *reg, uint32_t real_offset)
+{
+    uint32_t flags = ptdev->msi->flags;
+    uint32_t offset = reg->offset;
+
+    /* check the offset whether matches the type or not */
+    if (((offset == PCI_MSI_DATA_64) &&  (flags & PCI_MSI_FLAGS_64BIT)) ||
+        ((offset == PCI_MSI_DATA_32) && !(flags & PCI_MSI_FLAGS_64BIT))) {
+        return reg->init_val;
+    } else {
+        return PT_INVALID_REG;
+    }
+}
+
+/* write Message Address register */
+static int pt_msgaddr32_reg_write(XenPCIPassthroughState *s,
+                                  XenPTReg *cfg_entry, uint32_t *value,
+                                  uint32_t dev_value, uint32_t valid_mask)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    uint32_t writable_mask = 0;
+    uint32_t throughable_mask = 0;
+    uint32_t old_addr = cfg_entry->data;
+
+    /* modify emulate register */
+    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
+    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
+    /* update the msi_info too */
+    s->msi->addr_lo = cfg_entry->data;
+
+    /* create value for writing to I/O device register */
+    throughable_mask = ~reg->emu_mask & valid_mask;
+    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
+
+    /* update MSI */
+    if (cfg_entry->data != old_addr) {
+        if (s->msi->flags & PT_MSI_FLAG_MAPPED) {
+            pt_msi_update(s);
+        }
+    }
+
+    return 0;
+}
+/* write Message Upper Address register */
+static int pt_msgaddr64_reg_write(XenPCIPassthroughState *s,
+                                  XenPTReg *cfg_entry, uint32_t *value,
+                                  uint32_t dev_value, uint32_t valid_mask)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    uint32_t writable_mask = 0;
+    uint32_t throughable_mask = 0;
+    uint32_t old_addr = cfg_entry->data;
+
+    /* check whether the type is 64 bit or not */
+    if (!(s->msi->flags & PCI_MSI_FLAGS_64BIT)) {
+        /* exit I/O emulator */
+        PT_LOG("Error: why comes to Upper Address without 64 bit support??\n");
+        return -1;
+    }
+
+    /* modify emulate register */
+    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
+    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
+    /* update the msi_info too */
+    s->msi->addr_hi = cfg_entry->data;
+
+    /* create value for writing to I/O device register */
+    throughable_mask = ~reg->emu_mask & valid_mask;
+    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
+
+    /* update MSI */
+    if (cfg_entry->data != old_addr) {
+        if (s->msi->flags & PT_MSI_FLAG_MAPPED) {
+            pt_msi_update(s);
+        }
+    }
+
+    return 0;
+}
+
+
+/* this function will be called twice (for 32 bit and 64 bit type) */
+/* write Message Data register */
+static int pt_msgdata_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
+                                uint16_t *value, uint16_t dev_value,
+                                uint16_t valid_mask)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    uint16_t writable_mask = 0;
+    uint16_t throughable_mask = 0;
+    uint16_t old_data = cfg_entry->data;
+    uint32_t flags = s->msi->flags;
+    uint32_t offset = reg->offset;
+
+    /* check the offset whether matches the type or not */
+    if (!((offset == PCI_MSI_DATA_64) &&  (flags & PCI_MSI_FLAGS_64BIT)) &&
+        !((offset == PCI_MSI_DATA_32) && !(flags & PCI_MSI_FLAGS_64BIT))) {
+        /* exit I/O emulator */
+        PT_LOG("Error: the offset is not match with the 32/64 bit type!!\n");
+        return -1;
+    }
+
+    /* modify emulate register */
+    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
+    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
+    /* update the msi_info too */
+    s->msi->data = cfg_entry->data;
+
+    /* create value for writing to I/O device register */
+    throughable_mask = ~reg->emu_mask & valid_mask;
+    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
+
+    /* update MSI */
+    if (cfg_entry->data != old_data) {
+        if (flags & PT_MSI_FLAG_MAPPED) {
+            pt_msi_update(s);
+        }
+    }
+
+    return 0;
+}
+
+/* MSI Capability Structure reg static infomation table */
+static XenPTRegInfo pt_emu_reg_msi_tbl[] = {
+    /* Next Pointer reg */
+    {
+        .offset     = PCI_CAP_LIST_NEXT,
+        .size       = 1,
+        .init_val   = 0x00,
+        .ro_mask    = 0xFF,
+        .emu_mask   = 0xFF,
+        .init       = pt_ptr_reg_init,
+        .u.b.read   = pt_byte_reg_read,
+        .u.b.write  = pt_byte_reg_write,
+        .u.b.restore  = NULL,
+    },
+    /* Message Control reg */
+    {
+        .offset     = PCI_MSI_FLAGS,
+        .size       = 2,
+        .init_val   = 0x0000,
+        .ro_mask    = 0xFF8E,
+        .emu_mask   = 0x007F,
+        .init       = pt_msgctrl_reg_init,
+        .u.w.read   = pt_word_reg_read,
+        .u.w.write  = pt_msgctrl_reg_write,
+        .u.w.restore  = NULL,
+    },
+    /* Message Address reg */
+    {
+        .offset     = PCI_MSI_ADDRESS_LO,
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .ro_mask    = 0x00000003,
+        .emu_mask   = 0xFFFFFFFF,
+        .no_wb      = 1,
+        .init       = pt_common_reg_init,
+        .u.dw.read  = pt_long_reg_read,
+        .u.dw.write = pt_msgaddr32_reg_write,
+        .u.dw.restore = NULL,
+    },
+    /* Message Upper Address reg (if PCI_MSI_FLAGS_64BIT set) */
+    {
+        .offset     = PCI_MSI_ADDRESS_HI,
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .ro_mask    = 0x00000000,
+        .emu_mask   = 0xFFFFFFFF,
+        .no_wb      = 1,
+        .init       = pt_msgaddr64_reg_init,
+        .u.dw.read  = pt_long_reg_read,
+        .u.dw.write = pt_msgaddr64_reg_write,
+        .u.dw.restore = NULL,
+    },
+    /* Message Data reg (16 bits of data for 32-bit devices) */
+    {
+        .offset     = PCI_MSI_DATA_32,
+        .size       = 2,
+        .init_val   = 0x0000,
+        .ro_mask    = 0x0000,
+        .emu_mask   = 0xFFFF,
+        .no_wb      = 1,
+        .init       = pt_msgdata_reg_init,
+        .u.w.read   = pt_word_reg_read,
+        .u.w.write  = pt_msgdata_reg_write,
+        .u.w.restore  = NULL,
+    },
+    /* Message Data reg (16 bits of data for 64-bit devices) */
+    {
+        .offset     = PCI_MSI_DATA_64,
+        .size       = 2,
+        .init_val   = 0x0000,
+        .ro_mask    = 0x0000,
+        .emu_mask   = 0xFFFF,
+        .no_wb      = 1,
+        .init       = pt_msgdata_reg_init,
+        .u.w.read   = pt_word_reg_read,
+        .u.w.write  = pt_msgdata_reg_write,
+        .u.w.restore  = NULL,
+    },
+    {
+        .size = 0,
+    },
+};
+
+
+/**************************************
+ * MSI-X Capability
+ */
+
+/* Message Control register for MSI-X */
+static uint32_t pt_msixctrl_reg_init(XenPCIPassthroughState *s,
+                                     XenPTRegInfo *reg, uint32_t real_offset)
+{
+    PCIDevice *d = &s->dev;
+    uint16_t reg_field = 0;
+
+    /* use I/O device register's value as initial value */
+    reg_field = pci_get_word(d->config + real_offset);
+
+    if (reg_field & PCI_MSIX_FLAGS_ENABLE) {
+        PT_LOG("MSIX enabled already, disable first\n");
+        host_pci_set_word(s->real_device, real_offset,
+                          reg_field & ~PCI_MSIX_FLAGS_ENABLE);
+    }
+
+    s->msix->ctrl_offset = real_offset;
+
+    return reg->init_val;
+}
+static int pt_msixctrl_reg_write(XenPCIPassthroughState *s,
+                                 XenPTReg *cfg_entry, uint16_t *value,
+                                 uint16_t dev_value, uint16_t valid_mask)
+{
+    XenPTRegInfo *reg = cfg_entry->reg;
+    uint16_t writable_mask = 0;
+    uint16_t throughable_mask = 0;
+
+    /* modify emulate register */
+    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
+    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
+
+    /* create value for writing to I/O device register */
+    throughable_mask = ~reg->emu_mask & valid_mask;
+    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
+
+    /* update MSI-X */
+    if ((*value & PCI_MSIX_FLAGS_ENABLE)
+        && !(*value & PCI_MSIX_FLAGS_MASKALL)) {
+        if (s->msi_trans_en) {
+            PT_LOG("guest enabling MSI-X, disable MSI-INTx translation\n");
+            pt_disable_msi_translate(s);
+        }
+        pt_msix_update(s);
+    }
+
+    s->msix->enabled = !!(*value & PCI_MSIX_FLAGS_ENABLE);
+
+    return 0;
+}
+
+/* MSI-X Capability Structure reg static infomation table */
+static XenPTRegInfo pt_emu_reg_msix_tbl[] = {
+    /* Next Pointer reg */
+    {
+        .offset     = PCI_CAP_LIST_NEXT,
+        .size       = 1,
+        .init_val   = 0x00,
+        .ro_mask    = 0xFF,
+        .emu_mask   = 0xFF,
+        .init       = pt_ptr_reg_init,
+        .u.b.read   = pt_byte_reg_read,
+        .u.b.write  = pt_byte_reg_write,
+        .u.b.restore  = NULL,
+    },
+    /* Message Control reg */
+    {
+        .offset     = PCI_MSI_FLAGS,
+        .size       = 2,
+        .init_val   = 0x0000,
+        .ro_mask    = 0x3FFF,
+        .emu_mask   = 0x0000,
+        .init       = pt_msixctrl_reg_init,
+        .u.w.read   = pt_word_reg_read,
+        .u.w.write  = pt_msixctrl_reg_write,
+        .u.w.restore  = NULL,
+    },
+    {
+        .size = 0,
+    },
+};
+
 
 /****************************
  * Capabilities
@@ -1664,6 +2080,48 @@ static uint8_t pt_pcie_size_init(XenPCIPassthroughState *s,
 
     return pcie_size;
 }
+/* get MSI Capability Structure register group size */
+static uint8_t pt_msi_size_init(XenPCIPassthroughState *s,
+                                const XenPTRegGroupInfo *grp_reg,
+                                uint32_t base_offset)
+{
+    PCIDevice *d = &s->dev;
+    uint16_t msg_ctrl = 0;
+    uint8_t msi_size = 0xa;
+
+    msg_ctrl = pci_get_word(d->config + (base_offset + PCI_MSI_FLAGS));
+
+    /* check 64 bit address capable & Per-vector masking capable */
+    if (msg_ctrl & PCI_MSI_FLAGS_64BIT) {
+        msi_size += 4;
+    }
+    if (msg_ctrl & PCI_MSI_FLAGS_MASKBIT) {
+        msi_size += 10;
+    }
+
+    s->msi = g_malloc0(sizeof (XenPTMSI));
+    s->msi->pirq = -1;
+    PT_LOG("done\n");
+
+    return msi_size;
+}
+/* get MSI-X Capability Structure register group size */
+static uint8_t pt_msix_size_init(XenPCIPassthroughState *s,
+                                 const XenPTRegGroupInfo *grp_reg,
+                                 uint32_t base_offset)
+{
+    int ret = 0;
+
+    ret = pt_msix_init(s, base_offset);
+
+    if (ret == -1) {
+        hw_error("Internal error: Invalid pt_msix_init return value[%d]. "
+                 "I/O emulator exit.\n", ret);
+    }
+
+    return grp_reg->grp_size;
+}
+
 
 static const XenPTRegGroupInfo pt_emu_reg_grp_tbl[] = {
     /* Header Type0 reg group */
@@ -1704,6 +2162,14 @@ static const XenPTRegGroupInfo pt_emu_reg_grp_tbl[] = {
         .grp_size   = 0x04,
         .size_init  = pt_reg_grp_size_init,
     },
+    /* MSI Capability Structure reg group */
+    {
+        .grp_id      = PCI_CAP_ID_MSI,
+        .grp_type    = GRP_TYPE_EMU,
+        .grp_size    = 0xFF,
+        .size_init   = pt_msi_size_init,
+        .emu_reg_tbl = pt_emu_reg_msi_tbl,
+    },
     /* PCI-X Capabilities List Item reg group */
     {
         .grp_id     = PCI_CAP_ID_PCIX,
@@ -1748,6 +2214,14 @@ static const XenPTRegGroupInfo pt_emu_reg_grp_tbl[] = {
         .size_init   = pt_pcie_size_init,
         .emu_reg_tbl = pt_emu_reg_pcie_tbl,
     },
+    /* MSI-X Capability Structure reg group */
+    {
+        .grp_id      = PCI_CAP_ID_MSIX,
+        .grp_type    = GRP_TYPE_EMU,
+        .grp_size    = 0x0C,
+        .size_init   = pt_msix_size_init,
+        .emu_reg_tbl = pt_emu_reg_msix_tbl,
+    },
     {
         .grp_size = 0,
     },
@@ -1908,8 +2382,11 @@ static int pt_init_pci_config(XenPCIPassthroughState *s)
     /* reinitialize all emulate register */
     pt_config_reinit(s);
 
+    /* setup MSI-INTx translation if support */
+    ret = pt_enable_msi_translate(s);
+
     /* rebind machine_irq to device */
-    if (s->machine_irq != 0) {
+    if (ret < 0 && s->machine_irq != 0) {
         uint8_t e_device = PCI_SLOT(s->dev.devfn);
         uint8_t e_intx = pci_intx(s);
 
@@ -2043,6 +2520,14 @@ void pt_config_delete(XenPCIPassthroughState *s)
     struct XenPTRegGroup *reg_group, *next_grp;
     struct XenPTReg *reg, *next_reg;
 
+    /* free MSI/MSI-X info table */
+    if (s->msix) {
+        pt_msix_delete(s);
+    }
+    if (s->msi) {
+        g_free(s->msi);
+    }
+
     /* free Power Management info table */
     if (s->pm_state) {
         if (s->pm_state->pm_timer) {
diff --git a/hw/xen_pci_passthrough_msi.c b/hw/xen_pci_passthrough_msi.c
new file mode 100644
index 0000000..533aef4
--- /dev/null
+++ b/hw/xen_pci_passthrough_msi.c
@@ -0,0 +1,667 @@
+/*
+ * Copyright (c) 2007, Intel Corporation.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * Jiang Yunhong <yunhong.jiang@intel.com>
+ *
+ * This file implements direct PCI assignment to a HVM guest
+ */
+
+#include <sys/mman.h>
+
+#include "xen_backend.h"
+#include "xen_pci_passthrough.h"
+#include "apic-msidef.h"
+
+
+#define AUTO_ASSIGN -1
+
+/* shift count for gflags */
+#define GFLAGS_SHIFT_DEST_ID        0
+#define GFLAGS_SHIFT_RH             8
+#define GFLAGS_SHIFT_DM             9
+#define GLFAGS_SHIFT_DELIV_MODE     12
+#define GLFAGS_SHIFT_TRG_MODE       15
+
+
+void pt_msi_set_enable(XenPCIPassthroughState *s, int en)
+{
+    uint16_t val = 0;
+    uint32_t address = 0;
+    PT_LOG("enable: %i\n", en);
+
+    if (!s->msi) {
+        return;
+    }
+
+    address = s->msi->ctrl_offset;
+    if (!address) {
+        return;
+    }
+
+    val = host_pci_get_word(s->real_device, address);
+    val &= ~PCI_MSI_FLAGS_ENABLE;
+    val |= en & PCI_MSI_FLAGS_ENABLE;
+    host_pci_set_word(s->real_device, address, val);
+
+    PT_LOG("done, address: %#x, val: %#x\n", address, val);
+}
+
+static void msix_set_enable(XenPCIPassthroughState *s, int en)
+{
+    uint16_t val = 0;
+    uint32_t address = 0;
+
+    if (!s->msix) {
+        return;
+    }
+
+    address = s->msix->ctrl_offset;
+    if (!address) {
+        return;
+    }
+
+    val = host_pci_get_word(s->real_device, address);
+    val &= ~PCI_MSIX_FLAGS_ENABLE;
+    if (en) {
+        val |= PCI_MSIX_FLAGS_ENABLE;
+    }
+    host_pci_set_word(s->real_device, address, val);
+}
+
+/*********************************/
+/* MSI virtuailization functions */
+
+/*
+ * setup physical msi, but didn't enable it
+ */
+int pt_msi_setup(XenPCIPassthroughState *s)
+{
+    int pirq = -1;
+    uint8_t gvec = 0;
+
+    if (!(s->msi->flags & PT_MSI_FLAG_UNINIT)) {
+        PT_LOG("Error: setup physical after initialized??\n");
+        return -1;
+    }
+
+    gvec = s->msi->data & 0xFF;
+    if (!gvec) {
+        /* if gvec is 0, the guest is asking for a particular pirq that
+         * is passed as dest_id */
+        pirq = (s->msi->addr_hi & 0xffffff00) |
+               ((s->msi->addr_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
+        if (!pirq) {
+            /* this probably identifies an misconfiguration of the guest,
+             * try the emulated path */
+            pirq = -1;
+        } else {
+            PT_LOG("pt_msi_setup requested pirq = %d\n", pirq);
+        }
+    }
+
+    if (xc_physdev_map_pirq_msi(xen_xc, xen_domid, AUTO_ASSIGN, &pirq,
+                                PCI_DEVFN(s->real_device->dev,
+                                          s->real_device->func),
+                                s->real_device->bus, 0, 0)) {
+        PT_LOG("Error: Mapping of MSI failed.\n");
+        return -1;
+    }
+
+    if (pirq < 0) {
+        PT_LOG("Error: Invalid pirq number\n");
+        return -1;
+    }
+
+    s->msi->pirq = pirq;
+    PT_LOG("msi mapped with pirq %x\n", pirq);
+
+    return 0;
+}
+
+static uint32_t __get_msi_gflags(uint32_t data, uint64_t addr)
+{
+    uint32_t result = 0;
+    int rh, dm, dest_id, deliv_mode, trig_mode;
+
+    rh = (addr >> MSI_ADDR_REDIRECTION_SHIFT) & 0x1;
+    dm = (addr >> MSI_ADDR_DEST_MODE_SHIFT) & 0x1;
+    dest_id = (addr >> MSI_ADDR_DEST_ID_SHIFT) & 0xff;
+    deliv_mode = (data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x7;
+    trig_mode = (data >> MSI_DATA_TRIGGER_SHIFT) & 0x1;
+
+    result = dest_id | (rh << GFLAGS_SHIFT_RH) | (dm << GFLAGS_SHIFT_DM) |
+             (deliv_mode << GLFAGS_SHIFT_DELIV_MODE) |
+             (trig_mode << GLFAGS_SHIFT_TRG_MODE);
+
+    return result;
+}
+
+int pt_msi_update(XenPCIPassthroughState *s)
+{
+    uint8_t gvec = 0;
+    uint32_t gflags = 0;
+    uint64_t addr = 0;
+    int ret = 0;
+
+    /* get vector, address, flags info, etc. */
+    gvec = s->msi->data & 0xFF;
+    addr = (uint64_t)s->msi->addr_hi << 32 | s->msi->addr_lo;
+    gflags = __get_msi_gflags(s->msi->data, addr);
+
+    PT_LOG("Update msi with pirq %x gvec %x gflags %x\n",
+           s->msi->pirq, gvec, gflags);
+
+    ret = xc_domain_update_msi_irq(xen_xc, xen_domid, gvec,
+                                   s->msi->pirq, gflags, 0);
+
+    if (ret) {
+        PT_LOG("Error: Binding of MSI failed.\n");
+
+        if (xc_physdev_unmap_pirq(xen_xc, xen_domid, s->msi->pirq)) {
+            PT_LOG("Error: Unmapping of MSI failed.\n");
+        }
+        s->msi->pirq = -1;
+        return ret;
+    }
+    return 0;
+}
+
+void pt_msi_disable(XenPCIPassthroughState *s)
+{
+    PCIDevice *d = &s->dev;
+    uint8_t gvec = 0;
+    uint32_t gflags = 0;
+    uint64_t addr = 0;
+    uint8_t e_device = 0;
+    uint8_t e_intx = 0;
+
+    pt_msi_set_enable(s, 0);
+
+    e_device = PCI_SLOT(d->devfn);
+    e_intx = pci_intx(s);
+
+    if (s->msi_trans_en) {
+        if (xc_domain_unbind_pt_irq(xen_xc, xen_domid, s->msi->pirq,
+                                    PT_IRQ_TYPE_MSI_TRANSLATE, 0,
+                                    e_device, e_intx, 0)) {
+            PT_LOG("Error: Unbinding pt irq for MSI-INTx failed!\n");
+            goto out;
+        }
+    } else if (!(s->msi->flags & PT_MSI_FLAG_UNINIT)) {
+        /* get vector, address, flags info, etc. */
+        gvec = s->msi->data & 0xFF;
+        addr = (uint64_t)s->msi->addr_hi << 32 | s->msi->addr_lo;
+        gflags = __get_msi_gflags(s->msi->data, addr);
+
+        PT_LOG("Unbind msi with pirq %x, gvec %x\n",
+                s->msi->pirq, gvec);
+
+        if (xc_domain_unbind_msi_irq(xen_xc, xen_domid, gvec,
+                                        s->msi->pirq, gflags)) {
+            PT_LOG("Error: Unbinding of MSI failed. [%02x:%02x.%x]\n",
+                   pci_bus_num(d->bus), PCI_SLOT(d->devfn),
+                   PCI_FUNC(d->devfn));
+            goto out;
+        }
+    }
+
+    if (s->msi->pirq != -1) {
+        PT_LOG("Unmap msi with pirq %x\n", s->msi->pirq);
+
+        if (xc_physdev_unmap_pirq(xen_xc, xen_domid, s->msi->pirq)) {
+            PT_LOG("Error: Unmapping of MSI failed. [%02x:%02x.%x]\n",
+                   pci_bus_num(d->bus), PCI_SLOT(d->devfn),
+                   PCI_FUNC(d->devfn));
+            goto out;
+        }
+    }
+
+out:
+    /* clear msi info */
+    s->msi->flags = 0;
+    s->msi->pirq = -1;
+    s->msi_trans_en = 0;
+}
+
+/* MSI-INTx translation virtulization functions */
+int pt_enable_msi_translate(XenPCIPassthroughState *s)
+{
+    uint8_t e_device = 0;
+    uint8_t e_intx = 0;
+
+    if (!(s->msi && s->msi_trans_cap)) {
+        return -1;
+    }
+
+    pt_msi_set_enable(s, 0);
+    s->msi_trans_en = 0;
+
+    if (pt_msi_setup(s)) {
+        PT_LOG("Error: MSI-INTx translation MSI setup failed, fallback\n");
+        return -1;
+    }
+
+    e_device = PCI_SLOT(s->dev.devfn);
+    /* fix virtual interrupt pin to INTA# */
+    e_intx = pci_intx(s);
+
+    if (xc_domain_bind_pt_irq(xen_xc, xen_domid, s->msi->pirq,
+                              PT_IRQ_TYPE_MSI_TRANSLATE, 0,
+                              e_device, e_intx, 0)) {
+        PT_LOG("Error: MSI-INTx translation bind failed, fallback\n");
+
+        if (xc_physdev_unmap_pirq(xen_xc, xen_domid, s->msi->pirq)) {
+            PT_LOG("Error: Unmapping of MSI failed.\n");
+        }
+        s->msi->pirq = -1;
+        return -1;
+    }
+
+    pt_msi_set_enable(s, 1);
+    s->msi_trans_en = 1;
+
+    return 0;
+}
+
+void pt_disable_msi_translate(XenPCIPassthroughState *s)
+{
+    uint8_t e_device = 0;
+    uint8_t e_intx = 0;
+
+    /* MSI_ENABLE bit should be disabed until the new handler is set */
+    pt_msi_set_enable(s, 0);
+
+    e_device = PCI_SLOT(s->dev.devfn);
+    e_intx = pci_intx(s);
+
+    if (xc_domain_unbind_pt_irq(xen_xc, xen_domid, s->msi->pirq,
+                                 PT_IRQ_TYPE_MSI_TRANSLATE, 0,
+                                 e_device, e_intx, 0)) {
+        PT_LOG("Error: Unbinding pt irq for MSI-INTx failed!\n");
+    }
+
+    if (s->machine_irq) {
+        if (xc_domain_bind_pt_pci_irq(xen_xc, xen_domid, s->machine_irq,
+                                       0, e_device, e_intx)) {
+            PT_LOG("Error: Rebinding of interrupt failed!\n");
+        }
+    }
+
+    s->msi_trans_en = 0;
+}
+
+/*********************************/
+/* MSI-X virtulization functions */
+
+static void mask_physical_msix_entry(XenPCIPassthroughState *s,
+                                     int entry_nr, int mask)
+{
+    void *phys_off;
+
+    phys_off = s->msix->phys_iomem_base + 16 * entry_nr + 12;
+    *(uint32_t *)phys_off = mask;
+}
+
+static int pt_msix_update_one(XenPCIPassthroughState *s, int entry_nr)
+{
+    XenMSIXEntry *entry = &s->msix->msix_entry[entry_nr];
+    int pirq = entry->pirq;
+    int gvec = entry->io_mem[2] & 0xff;
+    uint64_t gaddr = *(uint64_t *)&entry->io_mem[0];
+    uint32_t gflags = __get_msi_gflags(entry->io_mem[2], gaddr);
+    int ret;
+
+    if (!entry->flags) {
+        return 0;
+    }
+
+    if (!gvec) {
+        /* if gvec is 0, the guest is asking for a particular pirq that
+         * is passed as dest_id */
+        pirq = ((gaddr >> 32) & 0xffffff00) |
+               (((gaddr & 0xffffffff) >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
+        if (!pirq) {
+            /* this probably identifies an misconfiguration of the guest,
+             * try the emulated path */
+            pirq = -1;
+        } else {
+            PT_LOG("pt_msix_update_one requested pirq = %d\n", pirq);
+        }
+    }
+
+    /* Check if this entry is already mapped */
+    if (entry->pirq == -1) {
+        ret = xc_physdev_map_pirq_msi(xen_xc, xen_domid, AUTO_ASSIGN, &pirq,
+                                      PCI_DEVFN(s->real_device->dev,
+                                                s->real_device->func),
+                                      s->real_device->bus, entry_nr,
+                                      s->msix->table_base);
+        if (ret) {
+            PT_LOG("Error: Mapping msix entry %x\n", entry_nr);
+            return ret;
+        }
+        entry->pirq = pirq;
+    }
+
+    PT_LOG("Update msix entry %x with pirq %x gvec %x\n",
+            entry_nr, pirq, gvec);
+
+    ret = xc_domain_update_msi_irq(xen_xc, xen_domid, gvec, pirq, gflags,
+                                   s->msix->mmio_base_addr);
+    if (ret) {
+        PT_LOG("Error: Updating msix irq info for entry %d\n", entry_nr);
+
+        if (xc_physdev_unmap_pirq(xen_xc, xen_domid, entry->pirq)) {
+            PT_LOG("Error: Unmapping of MSI-X failed.\n");
+        }
+        entry->pirq = -1;
+        return ret;
+    }
+
+    entry->flags = 0;
+
+    return 0;
+}
+
+int pt_msix_update(XenPCIPassthroughState *s)
+{
+    XenPTMSIX *msix = s->msix;
+    int i;
+
+    for (i = 0; i < msix->total_entries; i++) {
+        pt_msix_update_one(s, i);
+    }
+
+    return 0;
+}
+
+void pt_msix_disable(XenPCIPassthroughState *s)
+{
+    PCIDevice *d = &s->dev;
+    uint8_t gvec = 0;
+    uint32_t gflags = 0;
+    uint64_t addr = 0;
+    int i = 0;
+    XenMSIXEntry *entry = NULL;
+
+    msix_set_enable(s, 0);
+
+    for (i = 0; i < s->msix->total_entries; i++) {
+        entry = &s->msix->msix_entry[i];
+
+        if (entry->pirq == -1) {
+            continue;
+        }
+
+        gvec = entry->io_mem[2] & 0xff;
+        addr = *(uint64_t *)&entry->io_mem[0];
+        gflags = __get_msi_gflags(entry->io_mem[2], addr);
+
+        PT_LOG("Unbind msix with pirq %x, gvec %x\n",
+                entry->pirq, gvec);
+
+        if (xc_domain_unbind_msi_irq(xen_xc, xen_domid, gvec,
+                                        entry->pirq, gflags)) {
+            PT_LOG("Error: Unbinding of MSI-X failed. [%02x:%02x.%x]\n",
+                   pci_bus_num(d->bus), PCI_SLOT(d->devfn),
+                   PCI_FUNC(d->devfn));
+        } else {
+            PT_LOG("Unmap msix with pirq %x\n", entry->pirq);
+
+            if (xc_physdev_unmap_pirq(xen_xc, xen_domid, entry->pirq)) {
+                PT_LOG("Error: Unmapping of MSI-X failed. [%02x:%02x.%x]\n",
+                       pci_bus_num(d->bus),
+                       PCI_SLOT(d->devfn), PCI_FUNC(d->devfn));
+            }
+        }
+        /* clear msi-x info */
+        entry->pirq = -1;
+        entry->flags = 0;
+    }
+}
+
+int pt_msix_update_remap(XenPCIPassthroughState *s, int bar_index)
+{
+    XenMSIXEntry *entry;
+    int i, ret;
+
+    if (!(s->msix && s->msix->bar_index == bar_index)) {
+        return 0;
+    }
+
+    for (i = 0; i < s->msix->total_entries; i++) {
+        entry = &s->msix->msix_entry[i];
+        if (entry->pirq != -1) {
+            ret = xc_domain_unbind_pt_irq(xen_xc, xen_domid, entry->pirq,
+                                          PT_IRQ_TYPE_MSI, 0, 0, 0, 0);
+            if (ret) {
+                PT_LOG("Error: unbind MSI-X entry %d failed\n", entry->pirq);
+            }
+            entry->flags = 1;
+        }
+    }
+    pt_msix_update(s);
+
+    return 0;
+}
+
+static void pci_msix_invalid_write(void *opaque, target_phys_addr_t addr,
+                                   uint32_t val)
+{
+    PT_LOG("Error: Invalid write to MSI-X table,"
+           " only dword access is allowed.\n");
+}
+
+static void pci_msix_writel(void *opaque, target_phys_addr_t addr,
+                            uint32_t val)
+{
+    XenPCIPassthroughState *s = (XenPCIPassthroughState *)opaque;
+    XenPTMSIX *msix = s->msix;
+    XenMSIXEntry *entry;
+    int entry_nr, offset;
+    void *phys_off;
+    uint32_t vec_ctrl;
+
+    if (addr % 4) {
+        PT_LOG("Error: Unaligned dword access to MSI-X table, "
+                "addr %016"PRIx64"\n", addr);
+        return;
+    }
+
+    PT_LOG("addr: "TARGET_FMT_plx", val: %#x\n", addr, val);
+
+    entry_nr = addr / 16;
+    entry = &msix->msix_entry[entry_nr];
+    offset = (addr % 16) / 4;
+
+    /*
+     * If Xen intercepts the mask bit access, io_mem[3] may not be
+     * up-to-date. Read from hardware directly.
+     */
+    phys_off = s->msix->phys_iomem_base + 16 * entry_nr + 12;
+    vec_ctrl = *(uint32_t *)phys_off;
+
+    if (offset != 3 && msix->enabled && !(vec_ctrl & 0x1)) {
+        PT_LOG("Error: Can't update msix entry %d since MSI-X is already "
+                "function.\n", entry_nr);
+        return;
+    }
+
+    if (offset != 3 && entry->io_mem[offset] != val) {
+        entry->flags = 1;
+    }
+    entry->io_mem[offset] = val;
+
+    if (offset == 3) {
+        if (msix->enabled && !(val & 0x1)) {
+            pt_msix_update_one(s, entry_nr);
+        }
+        mask_physical_msix_entry(s, entry_nr, entry->io_mem[3] & 0x1);
+    }
+}
+
+static CPUWriteMemoryFunc *pci_msix_write[] = {
+    pci_msix_invalid_write,
+    pci_msix_invalid_write,
+    pci_msix_writel
+};
+
+static uint32_t pci_msix_invalid_read(void *opaque, target_phys_addr_t addr)
+{
+    PT_LOG("Error: Invalid read to MSI-X table,"
+           " only dword access is allowed.\n");
+    return 0;
+}
+
+static uint32_t pci_msix_readl(void *opaque, target_phys_addr_t addr)
+{
+    XenPCIPassthroughState *s = (XenPCIPassthroughState *)opaque;
+    XenPTMSIX *msix = s->msix;
+    int entry_nr, offset;
+
+    if (addr % 4) {
+        PT_LOG("Error: Unaligned dword access to MSI-X table, "
+                "addr %016"PRIx64"\n", addr);
+        return 0;
+    }
+
+    PT_LOG("addr: "TARGET_FMT_plx"\n", addr);
+
+    entry_nr = addr / 16;
+    offset = (addr % 16) / 4;
+
+    return msix->msix_entry[entry_nr].io_mem[offset];
+}
+
+static CPUReadMemoryFunc *pci_msix_read[] = {
+    pci_msix_invalid_read,
+    pci_msix_invalid_read,
+    pci_msix_readl
+};
+
+int pt_add_msix_mapping(XenPCIPassthroughState *s, int bar_index)
+{
+    if (!(s->msix && s->msix->bar_index == bar_index)) {
+        return 0;
+    }
+
+    return xc_domain_memory_mapping(xen_xc, xen_domid,
+         s->msix->mmio_base_addr >> XC_PAGE_SHIFT,
+         (s->bases[bar_index].access.maddr + s->msix->table_off)
+             >> XC_PAGE_SHIFT,
+         (s->msix->total_entries * 16 + XC_PAGE_SIZE - 1) >> XC_PAGE_SHIFT,
+         DPCI_ADD_MAPPING);
+}
+
+int pt_remove_msix_mapping(XenPCIPassthroughState *s, int bar_index)
+{
+    if (!(s->msix && s->msix->bar_index == bar_index)) {
+        return 0;
+    }
+
+    s->msix->mmio_base_addr = s->bases[bar_index].e_physbase
+        + s->msix->table_off;
+
+    cpu_register_physical_memory(s->msix->mmio_base_addr,
+                                 s->msix->total_entries * 16,
+                                 s->msix->mmio_index);
+
+    return xc_domain_memory_mapping(xen_xc, xen_domid,
+         s->msix->mmio_base_addr >> XC_PAGE_SHIFT,
+         (s->bases[bar_index].access.maddr + s->msix->table_off)
+             >> XC_PAGE_SHIFT,
+         (s->msix->total_entries * 16 + XC_PAGE_SIZE - 1) >> XC_PAGE_SHIFT,
+         DPCI_REMOVE_MAPPING);
+}
+
+int pt_msix_init(XenPCIPassthroughState *s, int base)
+{
+    uint8_t id;
+    uint16_t control;
+    int i, total_entries, table_off, bar_index;
+    HostPCIDevice *d = s->real_device;
+    int fd;
+
+    id = host_pci_get_byte(d, base + PCI_CAP_LIST_ID);
+
+    if (id != PCI_CAP_ID_MSIX) {
+        PT_LOG("Error: Invalid id %#x base %#x\n", id, base);
+        return -1;
+    }
+
+    control = host_pci_get_word(d, base + 2);
+    total_entries = control & 0x7ff;
+    total_entries += 1;
+
+    s->msix = g_malloc0(sizeof (XenPTMSIX)
+                        + total_entries * sizeof (XenMSIXEntry));
+
+    s->msix->total_entries = total_entries;
+    for (i = 0; i < total_entries; i++) {
+        s->msix->msix_entry[i].pirq = -1;
+    }
+
+    s->msix->mmio_index =
+        cpu_register_io_memory(pci_msix_read, pci_msix_write,
+                               s, DEVICE_NATIVE_ENDIAN);
+
+    table_off = host_pci_get_long(d, base + PCI_MSIX_TABLE);
+    bar_index = s->msix->bar_index = table_off & PCI_MSIX_FLAGS_BIRMASK;
+    table_off = s->msix->table_off = table_off & ~PCI_MSIX_FLAGS_BIRMASK;
+    s->msix->table_base = s->real_device->io_regions[bar_index].base_addr;
+    PT_LOG("get MSI-X table bar base %#"PRIx64"\n", s->msix->table_base);
+
+    fd = open("/dev/mem", O_RDWR);
+    if (fd == -1) {
+        PT_LOG("Error: Can't open /dev/mem: %s\n", strerror(errno));
+        goto error_out;
+    }
+    PT_LOG("table_off = %#x, total_entries = %d\n", table_off, total_entries);
+    s->msix->table_offset_adjust = table_off & 0x0fff;
+    s->msix->phys_iomem_base =
+        mmap(0,
+             total_entries * 16 + s->msix->table_offset_adjust,
+             PROT_WRITE | PROT_READ,
+             MAP_SHARED | MAP_LOCKED,
+             fd,
+             s->msix->table_base + table_off - s->msix->table_offset_adjust);
+
+    if (s->msix->phys_iomem_base == MAP_FAILED) {
+        PT_LOG("Error: Can't map physical MSI-X table: %s\n", strerror(errno));
+        close(fd);
+        goto error_out;
+    }
+    s->msix->phys_iomem_base = (char *)s->msix->phys_iomem_base
+        + s->msix->table_offset_adjust;
+
+    close(fd);
+
+    PT_LOG("mapping physical MSI-X table to %p\n", s->msix->phys_iomem_base);
+    return 0;
+
+error_out:
+    g_free(s->msix);
+    s->msix = NULL;
+    return -1;
+}
+
+void pt_msix_delete(XenPCIPassthroughState *s)
+{
+    /* unmap the MSI-X memory mapped register area */
+    if (s->msix->phys_iomem_base) {
+        PT_LOG("unmapping physical MSI-X table from %lx\n",
+           (unsigned long)s->msix->phys_iomem_base);
+        munmap(s->msix->phys_iomem_base, s->msix->total_entries * 16 +
+           s->msix->table_offset_adjust);
+    }
+
+    if (s->msix->mmio_index > 0) {
+        cpu_unregister_io_memory(s->msix->mmio_index);
+    }
+
+    g_free(s->msix);
+    s->msix = NULL;
+}
-- 
Anthony PERARD

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH V3 05/10] pci_regs: Fix value of PCI_EXP_TYPE_RC_EC.
  2011-10-28 15:07   ` Anthony PERARD
@ 2011-11-04  7:36     ` Isaku Yamahata
  -1 siblings, 0 replies; 60+ messages in thread
From: Isaku Yamahata @ 2011-11-04  7:36 UTC (permalink / raw)
  To: Anthony PERARD, Stefan Hajnoczi, Michael S. Tsirkin
  Cc: qemu-trivial, Xen Devel, QEMU-devel, Stefano Stabellini

Good catch.
This patch (and the next one 6/10) can be picked up independently
through qemu-trivial or pci-tree.

On Fri, Oct 28, 2011 at 04:07:31PM +0100, Anthony PERARD wrote:
> Value check in PCI Express Base Specification rev 1.1
> 
> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
> ---
>  hw/pci_regs.h |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/hw/pci_regs.h b/hw/pci_regs.h
> index e8357c3..6b42515 100644
> --- a/hw/pci_regs.h
> +++ b/hw/pci_regs.h
> @@ -393,7 +393,7 @@
>  #define  PCI_EXP_TYPE_DOWNSTREAM 0x6	/* Downstream Port */
>  #define  PCI_EXP_TYPE_PCI_BRIDGE 0x7	/* PCI/PCI-X Bridge */
>  #define  PCI_EXP_TYPE_RC_END	0x9	/* Root Complex Integrated Endpoint */
> -#define  PCI_EXP_TYPE_RC_EC	0x10	/* Root Complex Event Collector */
> +#define  PCI_EXP_TYPE_RC_EC     0xa     /* Root Complex Event Collector */
>  #define PCI_EXP_FLAGS_SLOT	0x0100	/* Slot implemented */
>  #define PCI_EXP_FLAGS_IRQ	0x3e00	/* Interrupt message number */
>  #define PCI_EXP_DEVCAP		4	/* Device capabilities */
> -- 
> Anthony PERARD
> 
> 

-- 
yamahata

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH V3 05/10] pci_regs: Fix value of PCI_EXP_TYPE_RC_EC.
@ 2011-11-04  7:36     ` Isaku Yamahata
  0 siblings, 0 replies; 60+ messages in thread
From: Isaku Yamahata @ 2011-11-04  7:36 UTC (permalink / raw)
  To: Anthony PERARD, Stefan Hajnoczi, Michael S. Tsirkin
  Cc: qemu-trivial, Xen Devel, QEMU-devel, Stefano Stabellini

Good catch.
This patch (and the next one 6/10) can be picked up independently
through qemu-trivial or pci-tree.

On Fri, Oct 28, 2011 at 04:07:31PM +0100, Anthony PERARD wrote:
> Value check in PCI Express Base Specification rev 1.1
> 
> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
> ---
>  hw/pci_regs.h |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/hw/pci_regs.h b/hw/pci_regs.h
> index e8357c3..6b42515 100644
> --- a/hw/pci_regs.h
> +++ b/hw/pci_regs.h
> @@ -393,7 +393,7 @@
>  #define  PCI_EXP_TYPE_DOWNSTREAM 0x6	/* Downstream Port */
>  #define  PCI_EXP_TYPE_PCI_BRIDGE 0x7	/* PCI/PCI-X Bridge */
>  #define  PCI_EXP_TYPE_RC_END	0x9	/* Root Complex Integrated Endpoint */
> -#define  PCI_EXP_TYPE_RC_EC	0x10	/* Root Complex Event Collector */
> +#define  PCI_EXP_TYPE_RC_EC     0xa     /* Root Complex Event Collector */
>  #define PCI_EXP_FLAGS_SLOT	0x0100	/* Slot implemented */
>  #define PCI_EXP_FLAGS_IRQ	0x3e00	/* Interrupt message number */
>  #define PCI_EXP_DEVCAP		4	/* Device capabilities */
> -- 
> Anthony PERARD
> 
> 

-- 
yamahata

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [PATCH V3 02/10] Introduce HostPCIDevice to access a pci device on the host.
  2011-10-28 15:07   ` Anthony PERARD
@ 2011-11-04 17:49     ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 60+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-11-04 17:49 UTC (permalink / raw)
  To: Anthony PERARD; +Cc: Xen Devel, QEMU-devel, Stefano Stabellini

> +static unsigned long get_value(HostPCIDevice *d, const char *name)
> +{
> +    char path[PATH_MAX];
> +    FILE *f;
> +    unsigned long value;
> +
> +    path_to(d, name, path, sizeof (path));
> +    f = fopen(path, "r");
> +    if (!f) {
> +        fprintf(stderr, "Error: Can't open %s: %s\n", path, strerror(errno));
> +        return -1;

So the decleration is 'unsigned long' but you return -1 here.

Should the decleration be 'signed long' ?

Or perhaps return the value as parameter and return zero for success
and <= 0 for failure?

> +    }
> +    if (fscanf(f, "%lx\n", &value) != 1) {
> +        fprintf(stderr, "Error: Syntax error in %s\n", path);
> +        value = -1;
> +    }
> +    fclose(f);
> +    return value;
> +}
> +
> +static int pci_dev_is_virtfn(HostPCIDevice *d)
> +{
> +    int rc;
> +    char path[PATH_MAX];
> +    struct stat buf;
> +
> +    path_to(d, "physfn", path, sizeof (path));
> +    rc = !stat(path, &buf);

> +
> +    return rc;

Seems like this could be a 'bool'?

> +}
> +
> +static int host_pci_config_fd(HostPCIDevice *d)
> +{
> +    char path[PATH_MAX];
> +
> +    if (d->config_fd < 0) {
> +        path_to(d, "config", path, sizeof (path));
> +        d->config_fd = open(path, O_RDWR);
> +        if (d->config_fd < 0) {
> +            fprintf(stderr, "HostPCIDevice: Can not open '%s': %s\n",
> +                    path, strerror(errno));
> +        }
> +    }
> +    return d->config_fd;
> +}
> +static int host_pci_config_read(HostPCIDevice *d, int pos, void *buf, int len)
> +{
> +    int fd = host_pci_config_fd(d);
> +    int res = 0;
> +
> +    res = pread(fd, buf, len, pos);
> +    if (res < 0) {
> +        fprintf(stderr, "host_pci_config: read failed: %s (fd: %i)\n",
> +                strerror(errno), fd);
> +        return -1;
> +    }
> +    return res;
> +}
> +static int host_pci_config_write(HostPCIDevice *d,
> +                                 int pos, const void *buf, int len)
> +{
> +    int fd = host_pci_config_fd(d);
> +    int res = 0;
> +
> +    res = pwrite(fd, buf, len, pos);
> +    if (res < 0) {
> +        fprintf(stderr, "host_pci_config: write failed: %s\n",
> +                strerror(errno));
> +        return -1;
> +    }
> +    return res;
> +}
> +
> +uint8_t host_pci_get_byte(HostPCIDevice *d, int pos)
> +{
> +  uint8_t buf;
> +  host_pci_config_read(d, pos, &buf, 1);

Not checking the return value?
> +  return buf;
> +}
> +uint16_t host_pci_get_word(HostPCIDevice *d, int pos)
> +{
> +  uint16_t buf;
> +  host_pci_config_read(d, pos, &buf, 2);

Here as well?
> +  return le16_to_cpu(buf);

So if we can't read those buffers, won't that mean we end up with
garbage in buf? As we haven't actually written anything to it?

Perhaps we should do:

 if (host_pci..() < 0)
	return 0;
 ... normal case?

> +}
> +uint32_t host_pci_get_long(HostPCIDevice *d, int pos)
> +{
> +  uint32_t buf;
> +  host_pci_config_read(d, pos, &buf, 4);
> +  return le32_to_cpu(buf);
> +}
> +int host_pci_get_block(HostPCIDevice *d, int pos, uint8_t *buf, int len)
> +{
> +  return host_pci_config_read(d, pos, buf, len);

Oh, so that is called.. Hm, not much chance of returning an error there is.

Can we propage the errors in case there is some fundamental failure
when reading/writting the data stream? Say the PCI device gets
unplugged by the user.. won't pread return -EXIO?

> +}
> +
> +int host_pci_set_byte(HostPCIDevice *d, int pos, uint8_t data)
> +{
> +  return host_pci_config_write(d, pos, &data, 1);
> +}
> +int host_pci_set_word(HostPCIDevice *d, int pos, uint16_t data)
> +{
> +  data = cpu_to_le16(data);
> +  return host_pci_config_write(d, pos, &data, 2);
> +}
> +int host_pci_set_long(HostPCIDevice *d, int pos, uint32_t data)
> +{
> +  data = cpu_to_le32(data);
> +  return host_pci_config_write(d, pos, &data, 4);
> +}
> +int host_pci_set_block(HostPCIDevice *d, int pos, uint8_t *buf, int len)
> +{
> +  return host_pci_config_write(d, pos, buf, len);
> +}
> +
> +uint32_t host_pci_find_ext_cap_offset(HostPCIDevice *d, uint32_t cap)
> +{
> +    uint32_t header = 0;
> +    int max_cap = PCI_MAX_EXT_CAP;
> +    int pos = PCI_CONFIG_SPACE_SIZE;
> +
> +    do {
> +        header = host_pci_get_long(d, pos);
> +        /*
> +         * If we have no capabilities, this is indicated by cap ID,
> +         * cap version and next pointer all being 0.
> +         */
> +        if (header == 0) {
> +            break;
> +        }
> +
> +        if (PCI_EXT_CAP_ID(header) == cap) {
> +            return pos;
> +        }
> +
> +        pos = PCI_EXT_CAP_NEXT(header);
> +        if (pos < PCI_CONFIG_SPACE_SIZE) {
> +            break;
> +        }
> +
> +        max_cap--;
> +    } while (max_cap > 0);
> +
> +    return 0;
> +}
> +
> +HostPCIDevice *host_pci_device_get(uint8_t bus, uint8_t dev, uint8_t func)
> +{
> +    HostPCIDevice *d = NULL;
> +
> +    d = g_new0(HostPCIDevice, 1);
> +
> +    d->config_fd = -1;
> +    d->domain = 0;
> +    d->bus = bus;
> +    d->dev = dev;
> +    d->func = func;
> +
> +    if (host_pci_config_fd(d) == -1) {
> +        goto error;
> +    }
> +    if (get_resource(d) != 0) {
> +        goto error;
> +    }
> +
> +    d->vendor_id = get_value(d, "vendor");
> +    d->device_id = get_value(d, "device");
> +    d->is_virtfn = pci_dev_is_virtfn(d);
> +
> +    return d;
> +error:
> +    if (d->config_fd >= 0) {
> +        close(d->config_fd);
> +    }
> +    g_free(d);
> +    return NULL;
> +}
> +
> +void host_pci_device_put(HostPCIDevice *d)
> +{
> +    if (d->config_fd >= 0) {
> +        close(d->config_fd);
> +    }
> +    g_free(d);
> +}
> diff --git a/hw/host-pci-device.h b/hw/host-pci-device.h
> new file mode 100644
> index 0000000..d79ba48
> --- /dev/null
> +++ b/hw/host-pci-device.h
> @@ -0,0 +1,75 @@
> +#ifndef HW_HOST_PCI_DEVICE
> +#  define HW_HOST_PCI_DEVICE
> +
> +#include "pci.h"
> +
> +/*
> + * from linux/ioport.h
> + * IO resources have these defined flags.
> + */
> +#define IORESOURCE_BITS         0x000000ff      /* Bus-specific bits */
> +
> +#define IORESOURCE_TYPE_BITS    0x00000f00      /* Resource type */
> +#define IORESOURCE_IO           0x00000100
> +#define IORESOURCE_MEM          0x00000200
> +#define IORESOURCE_IRQ          0x00000400
> +#define IORESOURCE_DMA          0x00000800
> +
> +#define IORESOURCE_PREFETCH     0x00001000      /* No side effects */
> +#define IORESOURCE_READONLY     0x00002000
> +#define IORESOURCE_CACHEABLE    0x00004000
> +#define IORESOURCE_RANGELENGTH  0x00008000
> +#define IORESOURCE_SHADOWABLE   0x00010000
> +
> +#define IORESOURCE_SIZEALIGN    0x00020000      /* size indicates alignment */
> +#define IORESOURCE_STARTALIGN   0x00040000      /* start field is alignment */
> +
> +#define IORESOURCE_MEM_64       0x00100000
> +
> +    /* Userland may not map this resource */
> +#define IORESOURCE_EXCLUSIVE    0x08000000
> +#define IORESOURCE_DISABLED     0x10000000
> +#define IORESOURCE_UNSET        0x20000000
> +#define IORESOURCE_AUTO         0x40000000
> +    /* Driver has marked this resource busy */
> +#define IORESOURCE_BUSY         0x80000000
> +
> +
> +typedef struct HostPCIIORegion {
> +    unsigned long flags;
> +    pcibus_t base_addr;
> +    pcibus_t size;
> +} HostPCIIORegion;
> +
> +typedef struct HostPCIDevice {
> +    uint16_t domain;
> +    uint8_t bus;
> +    uint8_t dev;
> +    uint8_t func;
> +
> +    uint16_t vendor_id;
> +    uint16_t device_id;
> +
> +    HostPCIIORegion io_regions[PCI_NUM_REGIONS - 1];
> +    HostPCIIORegion rom;
> +
> +    bool is_virtfn;
> +
> +    int config_fd;
> +} HostPCIDevice;
> +
> +HostPCIDevice *host_pci_device_get(uint8_t bus, uint8_t dev, uint8_t func);
> +void host_pci_device_put(HostPCIDevice *pci_dev);
> +
> +uint8_t host_pci_get_byte(HostPCIDevice *d, int pos);
> +uint16_t host_pci_get_word(HostPCIDevice *d, int pos);
> +uint32_t host_pci_get_long(HostPCIDevice *d, int pos);
> +int host_pci_get_block(HostPCIDevice *d, int pos, uint8_t *buf, int len);
> +int host_pci_set_byte(HostPCIDevice *d, int pos, uint8_t data);
> +int host_pci_set_word(HostPCIDevice *d, int pos, uint16_t data);
> +int host_pci_set_long(HostPCIDevice *d, int pos, uint32_t data);
> +int host_pci_set_block(HostPCIDevice *d, int pos, uint8_t *buf, int len);
> +
> +uint32_t host_pci_find_ext_cap_offset(HostPCIDevice *s, uint32_t cap);
> +
> +#endif /* !HW_HOST_PCI_DEVICE */
> -- 
> Anthony PERARD
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Xen-devel] [PATCH V3 02/10] Introduce HostPCIDevice to access a pci device on the host.
@ 2011-11-04 17:49     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 60+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-11-04 17:49 UTC (permalink / raw)
  To: Anthony PERARD; +Cc: Xen Devel, QEMU-devel, Stefano Stabellini

> +static unsigned long get_value(HostPCIDevice *d, const char *name)
> +{
> +    char path[PATH_MAX];
> +    FILE *f;
> +    unsigned long value;
> +
> +    path_to(d, name, path, sizeof (path));
> +    f = fopen(path, "r");
> +    if (!f) {
> +        fprintf(stderr, "Error: Can't open %s: %s\n", path, strerror(errno));
> +        return -1;

So the decleration is 'unsigned long' but you return -1 here.

Should the decleration be 'signed long' ?

Or perhaps return the value as parameter and return zero for success
and <= 0 for failure?

> +    }
> +    if (fscanf(f, "%lx\n", &value) != 1) {
> +        fprintf(stderr, "Error: Syntax error in %s\n", path);
> +        value = -1;
> +    }
> +    fclose(f);
> +    return value;
> +}
> +
> +static int pci_dev_is_virtfn(HostPCIDevice *d)
> +{
> +    int rc;
> +    char path[PATH_MAX];
> +    struct stat buf;
> +
> +    path_to(d, "physfn", path, sizeof (path));
> +    rc = !stat(path, &buf);

> +
> +    return rc;

Seems like this could be a 'bool'?

> +}
> +
> +static int host_pci_config_fd(HostPCIDevice *d)
> +{
> +    char path[PATH_MAX];
> +
> +    if (d->config_fd < 0) {
> +        path_to(d, "config", path, sizeof (path));
> +        d->config_fd = open(path, O_RDWR);
> +        if (d->config_fd < 0) {
> +            fprintf(stderr, "HostPCIDevice: Can not open '%s': %s\n",
> +                    path, strerror(errno));
> +        }
> +    }
> +    return d->config_fd;
> +}
> +static int host_pci_config_read(HostPCIDevice *d, int pos, void *buf, int len)
> +{
> +    int fd = host_pci_config_fd(d);
> +    int res = 0;
> +
> +    res = pread(fd, buf, len, pos);
> +    if (res < 0) {
> +        fprintf(stderr, "host_pci_config: read failed: %s (fd: %i)\n",
> +                strerror(errno), fd);
> +        return -1;
> +    }
> +    return res;
> +}
> +static int host_pci_config_write(HostPCIDevice *d,
> +                                 int pos, const void *buf, int len)
> +{
> +    int fd = host_pci_config_fd(d);
> +    int res = 0;
> +
> +    res = pwrite(fd, buf, len, pos);
> +    if (res < 0) {
> +        fprintf(stderr, "host_pci_config: write failed: %s\n",
> +                strerror(errno));
> +        return -1;
> +    }
> +    return res;
> +}
> +
> +uint8_t host_pci_get_byte(HostPCIDevice *d, int pos)
> +{
> +  uint8_t buf;
> +  host_pci_config_read(d, pos, &buf, 1);

Not checking the return value?
> +  return buf;
> +}
> +uint16_t host_pci_get_word(HostPCIDevice *d, int pos)
> +{
> +  uint16_t buf;
> +  host_pci_config_read(d, pos, &buf, 2);

Here as well?
> +  return le16_to_cpu(buf);

So if we can't read those buffers, won't that mean we end up with
garbage in buf? As we haven't actually written anything to it?

Perhaps we should do:

 if (host_pci..() < 0)
	return 0;
 ... normal case?

> +}
> +uint32_t host_pci_get_long(HostPCIDevice *d, int pos)
> +{
> +  uint32_t buf;
> +  host_pci_config_read(d, pos, &buf, 4);
> +  return le32_to_cpu(buf);
> +}
> +int host_pci_get_block(HostPCIDevice *d, int pos, uint8_t *buf, int len)
> +{
> +  return host_pci_config_read(d, pos, buf, len);

Oh, so that is called.. Hm, not much chance of returning an error there is.

Can we propage the errors in case there is some fundamental failure
when reading/writting the data stream? Say the PCI device gets
unplugged by the user.. won't pread return -EXIO?

> +}
> +
> +int host_pci_set_byte(HostPCIDevice *d, int pos, uint8_t data)
> +{
> +  return host_pci_config_write(d, pos, &data, 1);
> +}
> +int host_pci_set_word(HostPCIDevice *d, int pos, uint16_t data)
> +{
> +  data = cpu_to_le16(data);
> +  return host_pci_config_write(d, pos, &data, 2);
> +}
> +int host_pci_set_long(HostPCIDevice *d, int pos, uint32_t data)
> +{
> +  data = cpu_to_le32(data);
> +  return host_pci_config_write(d, pos, &data, 4);
> +}
> +int host_pci_set_block(HostPCIDevice *d, int pos, uint8_t *buf, int len)
> +{
> +  return host_pci_config_write(d, pos, buf, len);
> +}
> +
> +uint32_t host_pci_find_ext_cap_offset(HostPCIDevice *d, uint32_t cap)
> +{
> +    uint32_t header = 0;
> +    int max_cap = PCI_MAX_EXT_CAP;
> +    int pos = PCI_CONFIG_SPACE_SIZE;
> +
> +    do {
> +        header = host_pci_get_long(d, pos);
> +        /*
> +         * If we have no capabilities, this is indicated by cap ID,
> +         * cap version and next pointer all being 0.
> +         */
> +        if (header == 0) {
> +            break;
> +        }
> +
> +        if (PCI_EXT_CAP_ID(header) == cap) {
> +            return pos;
> +        }
> +
> +        pos = PCI_EXT_CAP_NEXT(header);
> +        if (pos < PCI_CONFIG_SPACE_SIZE) {
> +            break;
> +        }
> +
> +        max_cap--;
> +    } while (max_cap > 0);
> +
> +    return 0;
> +}
> +
> +HostPCIDevice *host_pci_device_get(uint8_t bus, uint8_t dev, uint8_t func)
> +{
> +    HostPCIDevice *d = NULL;
> +
> +    d = g_new0(HostPCIDevice, 1);
> +
> +    d->config_fd = -1;
> +    d->domain = 0;
> +    d->bus = bus;
> +    d->dev = dev;
> +    d->func = func;
> +
> +    if (host_pci_config_fd(d) == -1) {
> +        goto error;
> +    }
> +    if (get_resource(d) != 0) {
> +        goto error;
> +    }
> +
> +    d->vendor_id = get_value(d, "vendor");
> +    d->device_id = get_value(d, "device");
> +    d->is_virtfn = pci_dev_is_virtfn(d);
> +
> +    return d;
> +error:
> +    if (d->config_fd >= 0) {
> +        close(d->config_fd);
> +    }
> +    g_free(d);
> +    return NULL;
> +}
> +
> +void host_pci_device_put(HostPCIDevice *d)
> +{
> +    if (d->config_fd >= 0) {
> +        close(d->config_fd);
> +    }
> +    g_free(d);
> +}
> diff --git a/hw/host-pci-device.h b/hw/host-pci-device.h
> new file mode 100644
> index 0000000..d79ba48
> --- /dev/null
> +++ b/hw/host-pci-device.h
> @@ -0,0 +1,75 @@
> +#ifndef HW_HOST_PCI_DEVICE
> +#  define HW_HOST_PCI_DEVICE
> +
> +#include "pci.h"
> +
> +/*
> + * from linux/ioport.h
> + * IO resources have these defined flags.
> + */
> +#define IORESOURCE_BITS         0x000000ff      /* Bus-specific bits */
> +
> +#define IORESOURCE_TYPE_BITS    0x00000f00      /* Resource type */
> +#define IORESOURCE_IO           0x00000100
> +#define IORESOURCE_MEM          0x00000200
> +#define IORESOURCE_IRQ          0x00000400
> +#define IORESOURCE_DMA          0x00000800
> +
> +#define IORESOURCE_PREFETCH     0x00001000      /* No side effects */
> +#define IORESOURCE_READONLY     0x00002000
> +#define IORESOURCE_CACHEABLE    0x00004000
> +#define IORESOURCE_RANGELENGTH  0x00008000
> +#define IORESOURCE_SHADOWABLE   0x00010000
> +
> +#define IORESOURCE_SIZEALIGN    0x00020000      /* size indicates alignment */
> +#define IORESOURCE_STARTALIGN   0x00040000      /* start field is alignment */
> +
> +#define IORESOURCE_MEM_64       0x00100000
> +
> +    /* Userland may not map this resource */
> +#define IORESOURCE_EXCLUSIVE    0x08000000
> +#define IORESOURCE_DISABLED     0x10000000
> +#define IORESOURCE_UNSET        0x20000000
> +#define IORESOURCE_AUTO         0x40000000
> +    /* Driver has marked this resource busy */
> +#define IORESOURCE_BUSY         0x80000000
> +
> +
> +typedef struct HostPCIIORegion {
> +    unsigned long flags;
> +    pcibus_t base_addr;
> +    pcibus_t size;
> +} HostPCIIORegion;
> +
> +typedef struct HostPCIDevice {
> +    uint16_t domain;
> +    uint8_t bus;
> +    uint8_t dev;
> +    uint8_t func;
> +
> +    uint16_t vendor_id;
> +    uint16_t device_id;
> +
> +    HostPCIIORegion io_regions[PCI_NUM_REGIONS - 1];
> +    HostPCIIORegion rom;
> +
> +    bool is_virtfn;
> +
> +    int config_fd;
> +} HostPCIDevice;
> +
> +HostPCIDevice *host_pci_device_get(uint8_t bus, uint8_t dev, uint8_t func);
> +void host_pci_device_put(HostPCIDevice *pci_dev);
> +
> +uint8_t host_pci_get_byte(HostPCIDevice *d, int pos);
> +uint16_t host_pci_get_word(HostPCIDevice *d, int pos);
> +uint32_t host_pci_get_long(HostPCIDevice *d, int pos);
> +int host_pci_get_block(HostPCIDevice *d, int pos, uint8_t *buf, int len);
> +int host_pci_set_byte(HostPCIDevice *d, int pos, uint8_t data);
> +int host_pci_set_word(HostPCIDevice *d, int pos, uint16_t data);
> +int host_pci_set_long(HostPCIDevice *d, int pos, uint32_t data);
> +int host_pci_set_block(HostPCIDevice *d, int pos, uint8_t *buf, int len);
> +
> +uint32_t host_pci_find_ext_cap_offset(HostPCIDevice *s, uint32_t cap);
> +
> +#endif /* !HW_HOST_PCI_DEVICE */
> -- 
> Anthony PERARD
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [PATCH V3 02/10] Introduce HostPCIDevice to access a pci device on the host.
  2011-11-04 17:49     ` Konrad Rzeszutek Wilk
@ 2011-11-07 15:09       ` Anthony PERARD
  -1 siblings, 0 replies; 60+ messages in thread
From: Anthony PERARD @ 2011-11-07 15:09 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Xen Devel, QEMU-devel, Stefano Stabellini

On Fri, Nov 4, 2011 at 17:49, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
>> +static unsigned long get_value(HostPCIDevice *d, const char *name)
>> +{
>> +    char path[PATH_MAX];
>> +    FILE *f;
>> +    unsigned long value;
>> +
>> +    path_to(d, name, path, sizeof (path));
>> +    f = fopen(path, "r");
>> +    if (!f) {
>> +        fprintf(stderr, "Error: Can't open %s: %s\n", path, strerror(errno));
>> +        return -1;
>
> So the decleration is 'unsigned long' but you return -1 here.
>
> Should the decleration be 'signed long' ?
>
> Or perhaps return the value as parameter and return zero for success
> and <= 0 for failure?

I will use an extra parameter for the value, and return the
success/failure. And check for error.

>> +    }
>> +    if (fscanf(f, "%lx\n", &value) != 1) {
>> +        fprintf(stderr, "Error: Syntax error in %s\n", path);
>> +        value = -1;
>> +    }
>> +    fclose(f);
>> +    return value;
>> +}
>> +
>> +static int pci_dev_is_virtfn(HostPCIDevice *d)
>> +{
>> +    int rc;
>> +    char path[PATH_MAX];
>> +    struct stat buf;
>> +
>> +    path_to(d, "physfn", path, sizeof (path));
>> +    rc = !stat(path, &buf);
>> +
>> +    return rc;
>
> Seems like this could be a 'bool'?

Yes, and the result is store in a bool :-(, so I will just change the
return type of this function.

>> +}
>> +
>> +static int host_pci_config_fd(HostPCIDevice *d)
>> +{
>> +    char path[PATH_MAX];
>> +
>> +    if (d->config_fd < 0) {
>> +        path_to(d, "config", path, sizeof (path));
>> +        d->config_fd = open(path, O_RDWR);
>> +        if (d->config_fd < 0) {
>> +            fprintf(stderr, "HostPCIDevice: Can not open '%s': %s\n",
>> +                    path, strerror(errno));
>> +        }
>> +    }
>> +    return d->config_fd;
>> +}
>> +static int host_pci_config_read(HostPCIDevice *d, int pos, void *buf, int len)
>> +{
>> +    int fd = host_pci_config_fd(d);
>> +    int res = 0;
>> +
>> +    res = pread(fd, buf, len, pos);
>> +    if (res < 0) {
>> +        fprintf(stderr, "host_pci_config: read failed: %s (fd: %i)\n",
>> +                strerror(errno), fd);
>> +        return -1;
>> +    }
>> +    return res;
>> +}
>> +static int host_pci_config_write(HostPCIDevice *d,
>> +                                 int pos, const void *buf, int len)
>> +{
>> +    int fd = host_pci_config_fd(d);
>> +    int res = 0;
>> +
>> +    res = pwrite(fd, buf, len, pos);
>> +    if (res < 0) {
>> +        fprintf(stderr, "host_pci_config: write failed: %s\n",
>> +                strerror(errno));
>> +        return -1;
>> +    }
>> +    return res;
>> +}
>> +
>> +uint8_t host_pci_get_byte(HostPCIDevice *d, int pos)
>> +{
>> +  uint8_t buf;
>> +  host_pci_config_read(d, pos, &buf, 1);
>
> Not checking the return value?
>> +  return buf;
>> +}
>> +uint16_t host_pci_get_word(HostPCIDevice *d, int pos)
>> +{
>> +  uint16_t buf;
>> +  host_pci_config_read(d, pos, &buf, 2);
>
> Here as well?
>> +  return le16_to_cpu(buf);
>
> So if we can't read those buffers, won't that mean we end up with
> garbage in buf? As we haven't actually written anything to it?
>
> Perhaps we should do:
>
>  if (host_pci..() < 0)
>        return 0;
>  ... normal case?

Yes, I should probably check for error. and check if pread has
actually read the size we expect.

>> +}
>> +uint32_t host_pci_get_long(HostPCIDevice *d, int pos)
>> +{
>> +  uint32_t buf;
>> +  host_pci_config_read(d, pos, &buf, 4);
>> +  return le32_to_cpu(buf);
>> +}
>> +int host_pci_get_block(HostPCIDevice *d, int pos, uint8_t *buf, int len)
>> +{
>> +  return host_pci_config_read(d, pos, buf, len);
>
> Oh, so that is called.. Hm, not much chance of returning an error there is.

Well, errors are already check by _config_read, so this function is
just an alias.

> Can we propage the errors in case there is some fundamental failure
> when reading/writting the data stream? Say the PCI device gets
> unplugged by the user.. won't pread return -EXIO?

I could introduce another parameter, a pointer to a buffer were to
right the value, and return only success/faillure. It's should also be
a faillure if pread read less bytes then the ask size, I will fix that
as well.

And with this extra parameter, it's should be better than return 0 as
a read value.


Thanks,

-- 
Anthony PERARD

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Xen-devel] [PATCH V3 02/10] Introduce HostPCIDevice to access a pci device on the host.
@ 2011-11-07 15:09       ` Anthony PERARD
  0 siblings, 0 replies; 60+ messages in thread
From: Anthony PERARD @ 2011-11-07 15:09 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Xen Devel, QEMU-devel, Stefano Stabellini

On Fri, Nov 4, 2011 at 17:49, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
>> +static unsigned long get_value(HostPCIDevice *d, const char *name)
>> +{
>> +    char path[PATH_MAX];
>> +    FILE *f;
>> +    unsigned long value;
>> +
>> +    path_to(d, name, path, sizeof (path));
>> +    f = fopen(path, "r");
>> +    if (!f) {
>> +        fprintf(stderr, "Error: Can't open %s: %s\n", path, strerror(errno));
>> +        return -1;
>
> So the decleration is 'unsigned long' but you return -1 here.
>
> Should the decleration be 'signed long' ?
>
> Or perhaps return the value as parameter and return zero for success
> and <= 0 for failure?

I will use an extra parameter for the value, and return the
success/failure. And check for error.

>> +    }
>> +    if (fscanf(f, "%lx\n", &value) != 1) {
>> +        fprintf(stderr, "Error: Syntax error in %s\n", path);
>> +        value = -1;
>> +    }
>> +    fclose(f);
>> +    return value;
>> +}
>> +
>> +static int pci_dev_is_virtfn(HostPCIDevice *d)
>> +{
>> +    int rc;
>> +    char path[PATH_MAX];
>> +    struct stat buf;
>> +
>> +    path_to(d, "physfn", path, sizeof (path));
>> +    rc = !stat(path, &buf);
>> +
>> +    return rc;
>
> Seems like this could be a 'bool'?

Yes, and the result is store in a bool :-(, so I will just change the
return type of this function.

>> +}
>> +
>> +static int host_pci_config_fd(HostPCIDevice *d)
>> +{
>> +    char path[PATH_MAX];
>> +
>> +    if (d->config_fd < 0) {
>> +        path_to(d, "config", path, sizeof (path));
>> +        d->config_fd = open(path, O_RDWR);
>> +        if (d->config_fd < 0) {
>> +            fprintf(stderr, "HostPCIDevice: Can not open '%s': %s\n",
>> +                    path, strerror(errno));
>> +        }
>> +    }
>> +    return d->config_fd;
>> +}
>> +static int host_pci_config_read(HostPCIDevice *d, int pos, void *buf, int len)
>> +{
>> +    int fd = host_pci_config_fd(d);
>> +    int res = 0;
>> +
>> +    res = pread(fd, buf, len, pos);
>> +    if (res < 0) {
>> +        fprintf(stderr, "host_pci_config: read failed: %s (fd: %i)\n",
>> +                strerror(errno), fd);
>> +        return -1;
>> +    }
>> +    return res;
>> +}
>> +static int host_pci_config_write(HostPCIDevice *d,
>> +                                 int pos, const void *buf, int len)
>> +{
>> +    int fd = host_pci_config_fd(d);
>> +    int res = 0;
>> +
>> +    res = pwrite(fd, buf, len, pos);
>> +    if (res < 0) {
>> +        fprintf(stderr, "host_pci_config: write failed: %s\n",
>> +                strerror(errno));
>> +        return -1;
>> +    }
>> +    return res;
>> +}
>> +
>> +uint8_t host_pci_get_byte(HostPCIDevice *d, int pos)
>> +{
>> +  uint8_t buf;
>> +  host_pci_config_read(d, pos, &buf, 1);
>
> Not checking the return value?
>> +  return buf;
>> +}
>> +uint16_t host_pci_get_word(HostPCIDevice *d, int pos)
>> +{
>> +  uint16_t buf;
>> +  host_pci_config_read(d, pos, &buf, 2);
>
> Here as well?
>> +  return le16_to_cpu(buf);
>
> So if we can't read those buffers, won't that mean we end up with
> garbage in buf? As we haven't actually written anything to it?
>
> Perhaps we should do:
>
>  if (host_pci..() < 0)
>        return 0;
>  ... normal case?

Yes, I should probably check for error. and check if pread has
actually read the size we expect.

>> +}
>> +uint32_t host_pci_get_long(HostPCIDevice *d, int pos)
>> +{
>> +  uint32_t buf;
>> +  host_pci_config_read(d, pos, &buf, 4);
>> +  return le32_to_cpu(buf);
>> +}
>> +int host_pci_get_block(HostPCIDevice *d, int pos, uint8_t *buf, int len)
>> +{
>> +  return host_pci_config_read(d, pos, buf, len);
>
> Oh, so that is called.. Hm, not much chance of returning an error there is.

Well, errors are already check by _config_read, so this function is
just an alias.

> Can we propage the errors in case there is some fundamental failure
> when reading/writting the data stream? Say the PCI device gets
> unplugged by the user.. won't pread return -EXIO?

I could introduce another parameter, a pointer to a buffer were to
right the value, and return only success/faillure. It's should also be
a faillure if pread read less bytes then the ask size, I will fix that
as well.

And with this extra parameter, it's should be better than return 0 as
a read value.


Thanks,

-- 
Anthony PERARD

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH V3 07/10] Introduce Xen PCI Passthrough, qdevice (1/3)
  2011-10-28 15:07   ` Anthony PERARD
@ 2011-11-08 12:56     ` Stefano Stabellini
  -1 siblings, 0 replies; 60+ messages in thread
From: Stefano Stabellini @ 2011-11-08 12:56 UTC (permalink / raw)
  To: Anthony PERARD
  Cc: Guy Zana, Xen Devel, Allen Kay, QEMU-devel, Stefano Stabellini

On Fri, 28 Oct 2011, Anthony PERARD wrote:
> From: Allen Kay <allen.m.kay@intel.com>
> 
> Signed-off-by: Allen Kay <allen.m.kay@intel.com>
> Signed-off-by: Guy Zana <guy@neocleus.com>
> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
> ---
>  Makefile.target                  |    2 +
>  hw/xen_pci_passthrough.c         |  838 ++++++++++++++++++++++++++++++++++++++
>  hw/xen_pci_passthrough.h         |  223 ++++++++++
>  hw/xen_pci_passthrough_helpers.c |   46 ++
>  4 files changed, 1109 insertions(+), 0 deletions(-)
>  create mode 100644 hw/xen_pci_passthrough.c
>  create mode 100644 hw/xen_pci_passthrough.h
>  create mode 100644 hw/xen_pci_passthrough_helpers.c
> 
> diff --git a/Makefile.target b/Makefile.target
> index 243f9f2..36ea47d 100644
> --- a/Makefile.target
> +++ b/Makefile.target
> @@ -217,6 +217,8 @@ obj-i386-$(CONFIG_XEN) += xen_platform.o
> 
>  # Xen PCI Passthrough
>  obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += host-pci-device.o
> +obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough.o
> +obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough_helpers.o
> 
>  # Inter-VM PCI shared memory
>  CONFIG_IVSHMEM =
> diff --git a/hw/xen_pci_passthrough.c b/hw/xen_pci_passthrough.c
> new file mode 100644
> index 0000000..b97c5b6
> --- /dev/null
> +++ b/hw/xen_pci_passthrough.c
> @@ -0,0 +1,838 @@
> +/*
> + * Copyright (c) 2007, Neocleus Corporation.
> + * Copyright (c) 2007, Intel Corporation.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.  See
> + * the COPYING file in the top-level directory.
> + *
> + * Alex Novik <alex@neocleus.com>
> + * Allen Kay <allen.m.kay@intel.com>
> + * Guy Zana <guy@neocleus.com>
> + *
> + * This file implements direct PCI assignment to a HVM guest
> + */
> +
> +/*
> + * Interrupt Disable policy:
> + *
> + * INTx interrupt:
> + *   Initialize(register_real_device)
> + *     Map INTx(xc_physdev_map_pirq):
> + *       <fail>
> + *         - Set real Interrupt Disable bit to '1'.
> + *         - Set machine_irq and assigned_device->machine_irq to '0'.
> + *         * Don't bind INTx.
> + *
> + *     Bind INTx(xc_domain_bind_pt_pci_irq):
> + *       <fail>
> + *         - Set real Interrupt Disable bit to '1'.
> + *         - Unmap INTx.
> + *         - Decrement mapped_machine_irq[machine_irq]
> + *         - Set assigned_device->machine_irq to '0'.
> + *
> + *   Write to Interrupt Disable bit by guest software(pt_cmd_reg_write)
> + *     Write '0'
> + *       <ptdev->msi_trans_en is false>
> + *         - Set real bit to '0' if assigned_device->machine_irq isn't '0'.
> + *
> + *     Write '1'
> + *       <ptdev->msi_trans_en is false>
> + *         - Set real bit to '1'.
> + *
> + * MSI-INTx translation.
> + *   Initialize(xc_physdev_map_pirq_msi/pt_msi_setup)
> + *     Bind MSI-INTx(xc_domain_bind_pt_irq)
> + *       <fail>
> + *         - Unmap MSI.
> + *           <success>
> + *             - Set dev->msi->pirq to '-1'.
> + *           <fail>
> + *             - Do nothing.
> + *
> + *   Write to Interrupt Disable bit by guest software(pt_cmd_reg_write)
> + *     Write '0'
> + *       <ptdev->msi_trans_en is true>
> + *         - Set MSI Enable bit to '1'.
> + *
> + *     Write '1'
> + *       <ptdev->msi_trans_en is true>
> + *         - Set MSI Enable bit to '0'.
> + *
> + * MSI interrupt:
> + *   Initialize MSI register(pt_msi_setup, pt_msi_update)
> + *     Bind MSI(xc_domain_update_msi_irq)
> + *       <fail>
> + *         - Unmap MSI.
> + *         - Set dev->msi->pirq to '-1'.
> + *
> + * MSI-X interrupt:
> + *   Initialize MSI-X register(pt_msix_update_one)
> + *     Bind MSI-X(xc_domain_update_msi_irq)
> + *       <fail>
> + *         - Unmap MSI-X.
> + *         - Set entry->pirq to '-1'.
> + */
> +

you should move all the MSI related comments to the MSI patch


> +#include <sys/ioctl.h>
> +
> +#include "pci.h"
> +#include "xen.h"
> +#include "xen_backend.h"
> +#include "xen_pci_passthrough.h"
> +
> +#define PCI_BAR_ENTRIES (6)
> +
> +#define PT_NR_IRQS          (256)
> +char mapped_machine_irq[PT_NR_IRQS] = {0};
> +
> +/* Config Space */
> +static int pt_pci_config_access_check(PCIDevice *d, uint32_t address, int len)
> +{
> +    /* check offset range */
> +    if (address >= 0xFF) {
> +        PT_LOG("Error: Failed to access register with offset exceeding FFh. "
> +               "[%02x:%02x.%x][Offset:%02xh][Length:%d]\n",
> +               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> +               address, len);
> +        return -1;
> +    }
> +
> +    /* check read size */
> +    if ((len != 1) && (len != 2) && (len != 4)) {
> +        PT_LOG("Error: Failed to access register with invalid access length. "
> +               "[%02x:%02x.%x][Offset:%02xh][Length:%d]\n",
> +               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> +               address, len);
> +        return -1;
> +    }
> +
> +    /* check offset alignment */
> +    if (address & (len - 1)) {
> +        PT_LOG("Error: Failed to access register with invalid access size "
> +            "alignment. [%02x:%02x.%x][Offset:%02xh][Length:%d]\n",
> +            pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> +            address, len);
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
> +int pt_bar_offset_to_index(uint32_t offset)
> +{
> +    int index = 0;
> +
> +    /* check Exp ROM BAR */
> +    if (offset == PCI_ROM_ADDRESS) {
> +        return PCI_ROM_SLOT;
> +    }
> +
> +    /* calculate BAR index */
> +    index = (offset - PCI_BASE_ADDRESS_0) >> 2;
> +    if (index >= PCI_NUM_REGIONS) {
> +        return -1;
> +    }
> +
> +    return index;
> +}
> +
> +static uint32_t pt_pci_read_config(PCIDevice *d, uint32_t address, int len)
> +{
> +    XenPCIPassthroughState *s = DO_UPCAST(XenPCIPassthroughState, dev, d);
> +    uint32_t val = 0;
> +    XenPTRegGroup *reg_grp_entry = NULL;
> +    XenPTReg *reg_entry = NULL;
> +    int rc = 0;
> +    int emul_len = 0;
> +    uint32_t find_addr = address;
> +
> +    if (pt_pci_config_access_check(d, address, len)) {
> +        goto exit;
> +    }
> +
> +    /* check power state transition flags */
> +    if (s->pm_state != NULL && s->pm_state->flags & PT_FLAG_TRANSITING) {
> +        /* can't accept until previous power state transition is completed.
> +         * so finished previous request here.
> +         */
> +        PT_LOG("Warning: guest want to write durring power state transition\n");
> +        goto exit;
> +    }
> +
> +    /* find register group entry */
> +    reg_grp_entry = pt_find_reg_grp(s, address);
> +    if (reg_grp_entry) {
> +        /* check 0 Hardwired register group */
> +        if (reg_grp_entry->reg_grp->grp_type == GRP_TYPE_HARDWIRED) {
> +            /* no need to emulate, just return 0 */
> +            val = 0;
> +            goto exit;
> +        }
> +    }
> +
> +    /* read I/O device register value */
> +    rc = host_pci_get_block(s->real_device, address, (uint8_t *)&val, len);
> +    if (!rc) {
> +        PT_LOG("Error: pci_read_block failed. return value[%d].\n", rc);
> +        memset(&val, 0xff, len);
> +    }
> +
> +    /* just return the I/O device register value for
> +     * passthrough type register group */
> +    if (reg_grp_entry == NULL) {
> +        goto exit;
> +    }
> +
> +    /* adjust the read value to appropriate CFC-CFF window */
> +    val <<= (address & 3) << 3;
> +    emul_len = len;
> +
> +    /* loop Guest request size */
> +    while (emul_len > 0) {
> +        /* find register entry to be emulated */
> +        reg_entry = pt_find_reg(reg_grp_entry, find_addr);
> +        if (reg_entry) {
> +            XenPTRegInfo *reg = reg_entry->reg;
> +            uint32_t real_offset = reg_grp_entry->base_offset + reg->offset;
> +            uint32_t valid_mask = 0xFFFFFFFF >> ((4 - emul_len) << 3);
> +            uint8_t *ptr_val = NULL;
> +
> +            valid_mask <<= (find_addr - real_offset) << 3;
> +            ptr_val = (uint8_t *)&val + (real_offset & 3);
> +
> +            /* do emulation depend on register size */
> +            switch (reg->size) {
> +            case 1:
> +                if (reg->u.b.read) {
> +                    rc = reg->u.b.read(s, reg_entry, ptr_val, valid_mask);
> +                }
> +                break;
> +            case 2:
> +                if (reg->u.w.read) {
> +                    rc = reg->u.w.read(s, reg_entry,
> +                                       (uint16_t *)ptr_val, valid_mask);
> +                }
> +                break;
> +            case 4:
> +                if (reg->u.dw.read) {
> +                    rc = reg->u.dw.read(s, reg_entry,
> +                                        (uint32_t *)ptr_val, valid_mask);
> +                }
> +                break;
> +            }
> +
> +            if (rc < 0) {
> +                hw_error("Internal error: Invalid read emulation "
> +                         "return value[%d]. I/O emulator exit.\n", rc);
> +            }
> +
> +            /* calculate next address to find */
> +            emul_len -= reg->size;
> +            if (emul_len > 0) {
> +                find_addr = real_offset + reg->size;
> +            }
> +        } else {
> +            /* nothing to do with passthrough type register,
> +             * continue to find next byte */
> +            emul_len--;
> +            find_addr++;
> +        }
> +    }
> +
> +    /* need to shift back before returning them to pci bus emulator */
> +    val >>= ((address & 3) << 3);
> +
> +exit:
> +    PT_LOG_CONFIG("[%02x:%02x.%x]: address=%04x val=0x%08x len=%d\n",
> +                  pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> +                  address, val, len);
> +    return val;
> +}
> +
> +static void pt_pci_write_config(PCIDevice *d, uint32_t address,
> +                                uint32_t val, int len)
> +{
> +    XenPCIPassthroughState *s = DO_UPCAST(XenPCIPassthroughState, dev, d);
> +    int index = 0;
> +    XenPTRegGroup *reg_grp_entry = NULL;
> +    int rc = 0;
> +    uint32_t read_val = 0;
> +    int emul_len = 0;
> +    XenPTReg *reg_entry = NULL;
> +    uint32_t find_addr = address;
> +    XenPTRegInfo *reg = NULL;
> +
> +    if (pt_pci_config_access_check(d, address, len)) {
> +        return;
> +    }
> +
> +    PT_LOG_CONFIG("[%02x:%02x.%x]: address=%04x val=0x%08x len=%d\n",
> +                  pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> +                  address, val, len);
> +
> +    /* check unused BAR register */
> +    index = pt_bar_offset_to_index(address);
> +    if ((index >= 0) && (val > 0 && val < PT_BAR_ALLF) &&
> +        (s->bases[index].bar_flag == PT_BAR_FLAG_UNUSED)) {
> +        PT_LOG("Warning: Guest attempt to set address to unused Base Address "
> +               "Register. [%02x:%02x.%x][Offset:%02xh][Length:%d]\n",
> +               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> +               address, len);
> +    }
> +
> +    /* check power state transition flags */
> +    if (s->pm_state != NULL && s->pm_state->flags & PT_FLAG_TRANSITING) {
> +        /* can't accept untill previous power state transition is completed.
> +         * so finished previous request here.
> +         */
> +        PT_LOG("Warning: guest want to write durring power state transition\n");
> +        return;
> +    }
> +
> +    /* find register group entry */
> +    reg_grp_entry = pt_find_reg_grp(s, address);
> +    if (reg_grp_entry) {
> +        /* check 0 Hardwired register group */
> +        if (reg_grp_entry->reg_grp->grp_type == GRP_TYPE_HARDWIRED) {
> +            /* ignore silently */
> +            PT_LOG("Warning: Access to 0 Hardwired register. "
> +                   "[%02x:%02x.%x][Offset:%02xh][Length:%d]\n",
> +                   pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> +                   address, len);
> +            return;
> +        }
> +    }
> +
> +    /* read I/O device register value */
> +    rc = host_pci_get_block(s->real_device, address,
> +                             (uint8_t *)&read_val, len);
> +    if (!rc) {
> +        PT_LOG("Error: pci_read_block failed. return value[%d].\n", rc);
> +        memset(&read_val, 0xff, len);
> +    }
> +
> +    /* pass directly to libpci for passthrough type register group */
> +    if (reg_grp_entry == NULL) {
> +        goto out;
> +    }
> +
> +    /* adjust the read and write value to appropriate CFC-CFF window */
> +    read_val <<= (address & 3) << 3;
> +    val <<= (address & 3) << 3;
> +    emul_len = len;
> +
> +    /* loop Guest request size */
> +    while (emul_len > 0) {
> +        /* find register entry to be emulated */
> +        reg_entry = pt_find_reg(reg_grp_entry, find_addr);
> +        if (reg_entry) {
> +            reg = reg_entry->reg;
> +            uint32_t real_offset = reg_grp_entry->base_offset + reg->offset;
> +            uint32_t valid_mask = 0xFFFFFFFF >> ((4 - emul_len) << 3);
> +            uint8_t *ptr_val = NULL;
> +
> +            valid_mask <<= (find_addr - real_offset) << 3;
> +            ptr_val = (uint8_t *)&val + (real_offset & 3);
> +
> +            /* do emulation depend on register size */
> +            switch (reg->size) {
> +            case 1:
> +                if (reg->u.b.write) {
> +                    rc = reg->u.b.write(s, reg_entry, ptr_val,
> +                                        read_val >> ((real_offset & 3) << 3),
> +                                        valid_mask);
> +                }
> +                break;
> +            case 2:
> +                if (reg->u.w.write) {
> +                    rc = reg->u.w.write(s, reg_entry, (uint16_t *)ptr_val,
> +                                        (read_val >> ((real_offset & 3) << 3)),
> +                                        valid_mask);
> +                }
> +                break;
> +            case 4:
> +                if (reg->u.dw.write) {
> +                    rc = reg->u.dw.write(s, reg_entry, (uint32_t *)ptr_val,
> +                                         (read_val >> ((real_offset & 3) << 3)),
> +                                         valid_mask);
> +                }
> +                break;
> +            }
> +
> +            if (rc < 0) {
> +                hw_error("Internal error: Invalid write emulation "
> +                         "return value[%d]. I/O emulator exit.\n", rc);
> +            }
> +
> +            /* calculate next address to find */
> +            emul_len -= reg->size;
> +            if (emul_len > 0) {
> +                find_addr = real_offset + reg->size;
> +            }
> +        } else {
> +            /* nothing to do with passthrough type register,
> +             * continue to find next byte */
> +            emul_len--;
> +            find_addr++;
> +        }
> +    }
> +
> +    /* need to shift back before passing them to libpci */
> +    val >>= (address & 3) << 3;
> +
> +out:
> +    if (!(reg && reg->no_wb)) {
> +        /* unknown regs are passed through */
> +        rc = host_pci_set_block(s->real_device, address, (uint8_t *)&val, len);
> +
> +        if (!rc) {
> +            PT_LOG("Error: pci_write_block failed. return value[%d].\n", rc);
> +        }
> +    }
> +
> +    if (s->pm_state != NULL && s->pm_state->flags & PT_FLAG_TRANSITING) {
> +        qemu_mod_timer(s->pm_state->pm_timer,
> +                       qemu_get_clock_ms(rt_clock) + s->pm_state->pm_delay);
> +    }
> +}

Where is this timer allocated and initialized?


> +/* ioport/iomem space*/
> +static void pt_iomem_map(XenPCIPassthroughState *s, int i,
> +                         pcibus_t e_phys, pcibus_t e_size, int type)
> +{
> +    uint32_t old_ebase = s->bases[i].e_physbase;
> +    bool first_map = s->bases[i].e_size == 0;
> +    int ret = 0;
> +
> +    s->bases[i].e_physbase = e_phys;
> +    s->bases[i].e_size = e_size;
> +
> +    PT_LOG("e_phys=%#"PRIx64" maddr=%#"PRIx64" type=%%d"
> +           " len=%#"PRIx64" index=%d first_map=%d\n",
> +           e_phys, s->bases[i].access.maddr, /*type,*/
> +           e_size, i, first_map);
> +
> +    if (e_size == 0) {
> +        return;
> +    }
> +
> +    if (!first_map && old_ebase != -1) {
> +        /* Remove old mapping */
> +        ret = xc_domain_memory_mapping(xen_xc, xen_domid,
> +                               old_ebase >> XC_PAGE_SHIFT,
> +                               s->bases[i].access.maddr >> XC_PAGE_SHIFT,
> +                               (e_size + XC_PAGE_SIZE - 1) >> XC_PAGE_SHIFT,
> +                               DPCI_REMOVE_MAPPING);
> +        if (ret != 0) {
> +            PT_LOG("Error: remove old mapping failed!\n");
> +            return;
> +        }
> +    }
> +
> +    /* map only valid guest address */
> +    if (e_phys != -1) {
> +        /* Create new mapping */
> +        ret = xc_domain_memory_mapping(xen_xc, xen_domid,
> +                                   s->bases[i].e_physbase >> XC_PAGE_SHIFT,
> +                                   s->bases[i].access.maddr >> XC_PAGE_SHIFT,
> +                                   (e_size+XC_PAGE_SIZE-1) >> XC_PAGE_SHIFT,
> +                                   DPCI_ADD_MAPPING);
> +
> +        if (ret != 0) {
> +            PT_LOG("Error: create new mapping failed!\n");
> +        }
> +    }
> +}
> +
> +static void pt_ioport_map(XenPCIPassthroughState *s, int i,
> +                          pcibus_t e_phys, pcibus_t e_size, int type)
> +{
> +    uint32_t old_ebase = s->bases[i].e_physbase;
> +    bool first_map = s->bases[i].e_size == 0;
> +    int ret = 0;
> +
> +    s->bases[i].e_physbase = e_phys;
> +    s->bases[i].e_size = e_size;
> +
> +    PT_LOG("e_phys=%#04"PRIx64" pio_base=%#04"PRIx64" len=%"PRId64" index=%d"
> +           " first_map=%d\n",
> +           e_phys, s->bases[i].access.pio_base, e_size, i, first_map);
> +
> +    if (e_size == 0) {
> +        return;
> +    }
> +
> +    if (!first_map && old_ebase != -1) {
> +        /* Remove old mapping */
> +        ret = xc_domain_ioport_mapping(xen_xc, xen_domid, old_ebase,
> +                                       s->bases[i].access.pio_base, e_size,
> +                                       DPCI_REMOVE_MAPPING);
> +        if (ret != 0) {
> +            PT_LOG("Error: remove old mapping failed!\n");
> +            return;
> +        }
> +    }
> +
> +    /* map only valid guest address (include 0) */
> +    if (e_phys != -1) {
> +        /* Create new mapping */
> +        ret = xc_domain_ioport_mapping(xen_xc, xen_domid, e_phys,
> +                                       s->bases[i].access.pio_base, e_size,
> +                                       DPCI_ADD_MAPPING);
> +        if (ret != 0) {
> +            PT_LOG("Error: create new mapping failed!\n");
> +        }
> +    }
> +
> +}
> +
> +
> +/* mapping BAR */
> +
> +void pt_bar_mapping_one(XenPCIPassthroughState *s, int bar,
> +                        int io_enable, int mem_enable)
> +{
> +    PCIDevice *dev = &s->dev;
> +    PCIIORegion *r;
> +    XenPTRegGroup *reg_grp_entry = NULL;
> +    XenPTReg *reg_entry = NULL;
> +    XenPTRegion *base = NULL;
> +    pcibus_t r_size = 0, r_addr = -1;
> +    int rc = 0;
> +
> +    r = &dev->io_regions[bar];
> +
> +    /* check valid region */
> +    if (!r->size) {
> +        return;
> +    }
> +
> +    base = &s->bases[bar];
> +    /* skip unused BAR or upper 64bit BAR */
> +    if ((base->bar_flag == PT_BAR_FLAG_UNUSED)
> +        || (base->bar_flag == PT_BAR_FLAG_UPPER)) {
> +           return;
> +    }
> +
> +    /* copy region address to temporary */
> +    r_addr = r->addr;
> +
> +    /* need unmapping in case I/O Space or Memory Space disable */
> +    if (((base->bar_flag == PT_BAR_FLAG_IO) && !io_enable) ||
> +        ((base->bar_flag == PT_BAR_FLAG_MEM) && !mem_enable)) {
> +        r_addr = -1;
> +    }
> +    if ((bar == PCI_ROM_SLOT) && (r_addr != -1)) {
> +        reg_grp_entry = pt_find_reg_grp(s, PCI_ROM_ADDRESS);
> +        if (reg_grp_entry) {
> +            reg_entry = pt_find_reg(reg_grp_entry, PCI_ROM_ADDRESS);
> +            if (reg_entry && !(reg_entry->data & PCI_ROM_ADDRESS_ENABLE)) {
> +                r_addr = -1;
> +            }
> +        }
> +    }
> +
> +    /* prevent guest software mapping memory resource to 00000000h */
> +    if ((base->bar_flag == PT_BAR_FLAG_MEM) && (r_addr == 0)) {
> +        r_addr = -1;
> +    }
> +
> +    r_size = pt_get_emul_size(base->bar_flag, r->size);
> +
> +    rc = pci_check_bar_overlap(dev, r_addr, r_size, r->type);
> +    if (rc > 0) {
> +        PT_LOG("Warning: s[%02x:%02x.%x][Region:%d][Address:%"FMT_PCIBUS"h]"
> +               "[Size:%"FMT_PCIBUS"h] is overlapped.\n", pci_bus_num(dev->bus),
> +               PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn), bar,
> +               r_addr, r_size);
> +    }
> +
> +    /* check whether we need to update the mapping or not */
> +    if (r_addr != s->bases[bar].e_physbase) {
> +        /* mapping BAR */
> +        if (base->bar_flag == PT_BAR_FLAG_IO) {
> +            pt_ioport_map(s, bar, r_addr, r_size, r->type);
> +        } else {
> +            pt_iomem_map(s, bar, r_addr, r_size, r->type);
> +        }
> +    }
> +}
> +
> +void pt_bar_mapping(XenPCIPassthroughState *s, int io_enable, int mem_enable)
> +{
> +    int i;
> +
> +    for (i = 0; i < PCI_NUM_REGIONS; i++) {
> +        pt_bar_mapping_one(s, i, io_enable, mem_enable);
> +    }
> +}
> +
> +/* register regions */
> +static int pt_register_regions(XenPCIPassthroughState *s)
> +{
> +    int i = 0;
> +    uint32_t bar_data = 0;
> +    HostPCIDevice *d = s->real_device;
> +
> +    /* Register PIO/MMIO BARs */
> +    for (i = 0; i < PCI_BAR_ENTRIES; i++) {
> +        HostPCIIORegion *r = &d->io_regions[i];
> +
> +        if (r->base_addr) {
> +            s->bases[i].e_physbase = r->base_addr;
> +            s->bases[i].access.u = r->base_addr;
> +
> +            /* Register current region */
> +            if (r->flags & IORESOURCE_IO) {
> +                memory_region_init_io(&s->bar[i], NULL, NULL,
> +                                      "xen-pci-pt-bar", r->size);
> +                pci_register_bar(&s->dev, i, PCI_BASE_ADDRESS_SPACE_IO,
> +                                 &s->bar[i]);
> +            } else if (r->flags & IORESOURCE_PREFETCH) {
> +                memory_region_init_io(&s->bar[i], NULL, NULL,
> +                                      "xen-pci-pt-bar", r->size);
> +                pci_register_bar(&s->dev, i, PCI_BASE_ADDRESS_MEM_PREFETCH,
> +                                 &s->bar[i]);
> +            } else {
> +                memory_region_init_io(&s->bar[i], NULL, NULL,
> +                                      "xen-pci-pt-bar", r->size);
> +                pci_register_bar(&s->dev, i, PCI_BASE_ADDRESS_SPACE_MEMORY,
> +                                 &s->bar[i]);
> +            }
> +
> +            PT_LOG("IO region registered (size=0x%08"PRIx64
> +                   " base_addr=0x%08"PRIx64")\n",
> +                   r->size, r->base_addr);
> +        }
> +    }
> +
> +    /* Register expansion ROM address */
> +    if (d->rom.base_addr && d->rom.size) {
> +        /* Re-set BAR reported by OS, otherwise ROM can't be read. */
> +        bar_data = host_pci_get_long(d, PCI_ROM_ADDRESS);
> +        if ((bar_data & PCI_ROM_ADDRESS_MASK) == 0) {
> +            bar_data |= d->rom.base_addr & PCI_ROM_ADDRESS_MASK;
> +            host_pci_set_long(d, PCI_ROM_ADDRESS, bar_data);
> +        }
> +
> +        s->bases[PCI_ROM_SLOT].e_physbase = d->rom.base_addr;
> +        s->bases[PCI_ROM_SLOT].access.maddr = d->rom.base_addr;
> +
> +        memory_region_init_rom_device(&s->rom, NULL, NULL, &s->dev.qdev,
> +                                      "xen-pci-pt-rom", d->rom.size);
> +        pci_register_bar(&s->dev, PCI_ROM_SLOT, PCI_BASE_ADDRESS_MEM_PREFETCH,
> +                         &s->rom);
> +
> +        PT_LOG("Expansion ROM registered (size=0x%08"PRIx64
> +               " base_addr=0x%08"PRIx64")\n",
> +               d->rom.size, d->rom.base_addr);
> +    }
> +
> +    return 0;
> +}
> +
> +static void pt_unregister_regions(XenPCIPassthroughState *s)
> +{
> +    int i, type, rc;
> +    uint32_t e_size;
> +    PCIDevice *d = &s->dev;
> +
> +    for (i = 0; i < PCI_NUM_REGIONS; i++) {
> +        e_size = s->bases[i].e_size;
> +        if ((e_size == 0) || (s->bases[i].e_physbase == -1)) {
> +            continue;
> +        }
> +
> +        type = d->io_regions[i].type;
> +
> +        if (type == PCI_BASE_ADDRESS_SPACE_MEMORY
> +            || type == PCI_BASE_ADDRESS_MEM_PREFETCH) {
> +            rc = xc_domain_memory_mapping(xen_xc, xen_domid,
> +                    s->bases[i].e_physbase >> XC_PAGE_SHIFT,
> +                    s->bases[i].access.maddr >> XC_PAGE_SHIFT,
> +                    (e_size+XC_PAGE_SIZE-1) >> XC_PAGE_SHIFT,
> +                    DPCI_REMOVE_MAPPING);
> +            if (rc != 0) {
> +                PT_LOG("Error: remove old mem mapping failed!\n");
> +                continue;
> +            }
> +
> +        } else if (type == PCI_BASE_ADDRESS_SPACE_IO) {
> +            rc = xc_domain_ioport_mapping(xen_xc, xen_domid,
> +                        s->bases[i].e_physbase,
> +                        s->bases[i].access.pio_base,
> +                        e_size,
> +                        DPCI_REMOVE_MAPPING);
> +            if (rc != 0) {
> +                PT_LOG("Error: remove old io mapping failed!\n");
> +                continue;
> +            }
> +        }
> +    }
> +}
> +
> +static int pt_initfn(PCIDevice *pcidev)
> +{
> +    XenPCIPassthroughState *s = DO_UPCAST(XenPCIPassthroughState, dev, pcidev);
> +    int dom, bus;
> +    unsigned slot, func;
> +    int rc = 0;
> +    uint32_t machine_irq;
> +    int pirq = -1;
> +
> +    if (pci_parse_devaddr(s->hostaddr, &dom, &bus, &slot, &func) < 0) {
> +        fprintf(stderr, "error parse bdf: %s\n", s->hostaddr);
> +        return -1;
> +    }
> +
> +    /* register real device */
> +    PT_LOG("Assigning real physical device %02x:%02x.%x to devfn %i ...\n",
> +           bus, slot, func, s->dev.devfn);
> +
> +    s->real_device = host_pci_device_get(bus, slot, func);
> +    if (!s->real_device) {
> +        return -1;
> +    }
> +
> +    s->is_virtfn = s->real_device->is_virtfn;
> +    if (s->is_virtfn) {
> +        PT_LOG("%04x:%02x:%02x.%x is a SR-IOV Virtual Function\n",
> +               s->real_device->domain, bus, slot, func);
> +    }
> +
> +    /* Initialize virtualized PCI configuration (Extended 256 Bytes) */
> +    if (host_pci_get_block(s->real_device, 0, pcidev->config,
> +                           PCI_CONFIG_SPACE_SIZE) == -1) {
> +        return -1;
> +    }
> +
> +    /* Handle real device's MMIO/PIO BARs */
> +    pt_register_regions(s);
> +
> +    /* reinitialize each config register to be emulated */
> +    pt_config_init(s);

this function is implemented in the next patch, so you might as well add
this call there


> +    /* Bind interrupt */
> +    if (!s->dev.config[PCI_INTERRUPT_PIN]) {
> +        PT_LOG("no pin interrupt\n");
> +        goto out;
> +    }
> +
> +    machine_irq = host_pci_get_byte(s->real_device, PCI_INTERRUPT_LINE);
> +    rc = xc_physdev_map_pirq(xen_xc, xen_domid, machine_irq, &pirq);
> +
> +    if (rc) {
> +        PT_LOG("Error: Mapping irq failed, rc = %d\n", rc);
> +
> +        /* Disable PCI intx assertion (turn on bit10 of devctl) */
> +        host_pci_set_word(s->real_device,
> +                          PCI_COMMAND,
> +                          pci_get_word(s->dev.config + PCI_COMMAND)
> +                          | PCI_COMMAND_INTX_DISABLE);
> +        machine_irq = 0;
> +        s->machine_irq = 0;
> +    } else {
> +        machine_irq = pirq;
> +        s->machine_irq = pirq;
> +        mapped_machine_irq[machine_irq]++;
> +    }
> +
> +    /* bind machine_irq to device */
> +    if (rc < 0 && machine_irq != 0) {
> +        uint8_t e_device = PCI_SLOT(s->dev.devfn);
> +        uint8_t e_intx = pci_intx(s);
> +
> +        rc = xc_domain_bind_pt_pci_irq(xen_xc, xen_domid, machine_irq, 0,
> +                                       e_device, e_intx);
> +        if (rc < 0) {
> +            PT_LOG("Error: Binding of interrupt failed! rc=%d\n", rc);
> +
> +            /* Disable PCI intx assertion (turn on bit10 of devctl) */
> +            host_pci_set_word(s->real_device, PCI_COMMAND,
> +                              *(uint16_t *)(&s->dev.config[PCI_COMMAND])
> +                              | PCI_COMMAND_INTX_DISABLE);
> +            mapped_machine_irq[machine_irq]--;
> +
> +            if (mapped_machine_irq[machine_irq] == 0) {
> +                if (xc_physdev_unmap_pirq(xen_xc, xen_domid, machine_irq)) {
> +                    PT_LOG("Error: Unmapping of interrupt failed! rc=%d\n",
> +                           rc);
> +                }
> +            }
> +            s->machine_irq = 0;
> +        }
> +    }
> +
> +out:
> +    PT_LOG("Real physical device %02x:%02x.%x registered successfuly!\n"
> +           "IRQ type = %s\n", bus, slot, func, "INTx");
> +
> +    return 0;
> +}
> +
> +static int pt_unregister_device(PCIDevice *pcidev)
> +{
> +    XenPCIPassthroughState *s = DO_UPCAST(XenPCIPassthroughState, dev, pcidev);
> +    uint8_t e_device, e_intx;
> +    uint32_t machine_irq;
> +    int rc;
> +
> +    /* Unbind interrupt */
> +    e_device = PCI_SLOT(s->dev.devfn);
> +    e_intx = pci_intx(s);
> +    machine_irq = s->machine_irq;
> +
> +    if (machine_irq) {
> +        rc = xc_domain_unbind_pt_irq(xen_xc, xen_domid, machine_irq,
> +                                     PT_IRQ_TYPE_PCI, 0, e_device, e_intx, 0);
> +        if (rc < 0) {
> +            PT_LOG("Error: Unbinding of interrupt failed! rc=%d\n", rc);
> +        }
> +    }
> +
> +    if (machine_irq) {
> +        mapped_machine_irq[machine_irq]--;
> +
> +        if (mapped_machine_irq[machine_irq] == 0) {
> +            rc = xc_physdev_unmap_pirq(xen_xc, xen_domid, machine_irq);
> +
> +            if (rc < 0) {
> +                PT_LOG("Error: Unmaping of interrupt failed! rc=%d\n", rc);
> +            }
> +        }
> +    }
> +
> +    /* delete all emulated config registers */
> +    pt_config_delete(s);
> +
> +    /* unregister real device's MMIO/PIO BARs */
> +    pt_unregister_regions(s);
> +
> +    host_pci_device_put(s->real_device);
> +
> +    return 0;
> +}
> +
> +static PCIDeviceInfo xen_pci_passthrough = {
> +    .init = pt_initfn,
> +    .exit = pt_unregister_device,
> +    .qdev.name = "xen-pci-passthrough",
> +    .qdev.desc = "Assign an host pci device with Xen",
> +    .qdev.size = sizeof(XenPCIPassthroughState),
> +    .config_read = pt_pci_read_config,
> +    .config_write = pt_pci_write_config,
> +    .is_express = 0,
> +    .qdev.props = (Property[]) {
> +        DEFINE_PROP_STRING("hostaddr", XenPCIPassthroughState, hostaddr),
> +        DEFINE_PROP_BIT("power-mgmt", XenPCIPassthroughState, power_mgmt,
> +                        0, false),
> +        DEFINE_PROP_END_OF_LIST(),
> +    }
> +};
> +
> +static void xen_passthrough_register(void)
> +{
> +    pci_qdev_register(&xen_pci_passthrough);
> +}
> +
> +device_init(xen_passthrough_register);
> diff --git a/hw/xen_pci_passthrough.h b/hw/xen_pci_passthrough.h
> new file mode 100644
> index 0000000..2d1979d
> --- /dev/null
> +++ b/hw/xen_pci_passthrough.h
> @@ -0,0 +1,223 @@
> +#ifndef QEMU_HW_XEN_PCI_PASSTHROUGH_H
> +#  define QEMU_HW_XEN_PCI_PASSTHROUGH_H
> +
> +#include "qemu-common.h"
> +#include "xen_common.h"
> +#include "pci.h"
> +#include "host-pci-device.h"
> +
> +#define PT_LOGGING_ENABLED
> +#define PT_DEBUG_PCI_CONFIG_ACCESS
> +
> +#ifdef PT_LOGGING_ENABLED
> +#  define PT_LOG(_f, _a...)   fprintf(stderr, "%s: " _f, __func__, ##_a)
> +#else
> +#  define PT_LOG(_f, _a...)
> +#endif
> +
> +#ifdef PT_DEBUG_PCI_CONFIG_ACCESS
> +#  define PT_LOG_CONFIG(_f, _a...) PT_LOG(_f, ##_a)
> +#else
> +#  define PT_LOG_CONFIG(_f, _a...)
> +#endif
> +
> +
> +typedef struct XenPTRegInfo XenPTRegInfo;
> +typedef struct XenPTReg XenPTReg;
> +
> +typedef struct XenPCIPassthroughState XenPCIPassthroughState;
> +
> +/* function type for config reg */
> +typedef uint32_t (*conf_reg_init)
> +    (XenPCIPassthroughState *, XenPTRegInfo *, uint32_t real_offset);
> +typedef int (*conf_dword_write)
> +    (XenPCIPassthroughState *, XenPTReg *cfg_entry,
> +     uint32_t *val, uint32_t dev_value, uint32_t valid_mask);
> +typedef int (*conf_word_write)
> +    (XenPCIPassthroughState *, XenPTReg *cfg_entry,
> +     uint16_t *val, uint16_t dev_value, uint16_t valid_mask);
> +typedef int (*conf_byte_write)
> +    (XenPCIPassthroughState *, XenPTReg *cfg_entry,
> +     uint8_t *val, uint8_t dev_value, uint8_t valid_mask);
> +typedef int (*conf_dword_read)
> +    (XenPCIPassthroughState *, XenPTReg *cfg_entry,
> +     uint32_t *val, uint32_t valid_mask);
> +typedef int (*conf_word_read)
> +    (XenPCIPassthroughState *, XenPTReg *cfg_entry,
> +     uint16_t *val, uint16_t valid_mask);
> +typedef int (*conf_byte_read)
> +    (XenPCIPassthroughState *, XenPTReg *cfg_entry,
> +     uint8_t *val, uint8_t valid_mask);
> +typedef int (*conf_dword_restore)
> +    (XenPCIPassthroughState *, XenPTReg *cfg_entry, uint32_t real_offset,
> +     uint32_t dev_value, uint32_t *val);
> +typedef int (*conf_word_restore)
> +    (XenPCIPassthroughState *, XenPTReg *cfg_entry, uint32_t real_offset,
> +     uint16_t dev_value, uint16_t *val);
> +typedef int (*conf_byte_restore)
> +    (XenPCIPassthroughState *, XenPTReg *cfg_entry, uint32_t real_offset,
> +     uint8_t dev_value, uint8_t *val);
> +
> +/* power state transition */
> +#define PT_FLAG_TRANSITING 0x0001
> +
> +
> +typedef enum {
> +    GRP_TYPE_HARDWIRED = 0,                     /* 0 Hardwired reg group */
> +    GRP_TYPE_EMU,                               /* emul reg group */
> +} RegisterGroupType;
> +
> +typedef enum {
> +    PT_BAR_FLAG_MEM = 0,                        /* Memory type BAR */
> +    PT_BAR_FLAG_IO,                             /* I/O type BAR */
> +    PT_BAR_FLAG_UPPER,                          /* upper 64bit BAR */
> +    PT_BAR_FLAG_UNUSED,                         /* unused BAR */
> +} PTBarFlag;
> +
> +
> +typedef struct XenPTRegion {
> +    /* Virtual phys base & size */
> +    uint32_t e_physbase;
> +    uint32_t e_size;
> +    /* Index of region in qemu */
> +    uint32_t memory_index;
> +    /* BAR flag */
> +    PTBarFlag bar_flag;
> +    /* Translation of the emulated address */
> +    union {
> +        uint64_t maddr;
> +        uint64_t pio_base;
> +        uint64_t u;
> +    } access;
> +} XenPTRegion;
> +
> +/* XenPTRegInfo declaration
> + * - only for emulated register (either a part or whole bit).
> + * - for passthrough register that need special behavior (like interacting with
> + *   other component), set emu_mask to all 0 and specify r/w func properly.
> + * - do NOT use ALL F for init_val, otherwise the tbl will not be registered.
> + */
> +
> +/* emulated register infomation */
> +struct XenPTRegInfo {
> +    uint32_t offset;
> +    uint32_t size;
> +    uint32_t init_val;
> +    /* reg read only field mask (ON:RO/ROS, OFF:other) */
> +    uint32_t ro_mask;
> +    /* reg emulate field mask (ON:emu, OFF:passthrough) */
> +    uint32_t emu_mask;
> +    /* no write back allowed */
> +    uint32_t no_wb;
> +    conf_reg_init init;
> +    /* read/write/restore function pointer
> +     * for double_word/word/byte size */
> +    union {
> +        struct {
> +            conf_dword_write write;
> +            conf_dword_read read;
> +            conf_dword_restore restore;
> +        } dw;
> +        struct {
> +            conf_word_write write;
> +            conf_word_read read;
> +            conf_word_restore restore;
> +        } w;
> +        struct {
> +            conf_byte_write write;
> +            conf_byte_read read;
> +            conf_byte_restore restore;
> +        } b;
> +    } u;
> +};
> +
> +/* emulated register management */
> +struct XenPTReg {
> +    QLIST_ENTRY(XenPTReg) entries;
> +    XenPTRegInfo *reg;
> +    uint32_t data;
> +};
> +
> +typedef struct XenPTRegGroupInfo XenPTRegGroupInfo;
> +
> +/* emul reg group size initialize method */
> +typedef uint8_t (*pt_reg_size_init_fn)
> +    (XenPCIPassthroughState *, const XenPTRegGroupInfo *,
> +     uint32_t base_offset);
> +
> +/* emulated register group infomation */
> +struct XenPTRegGroupInfo {
> +    uint8_t grp_id;
> +    RegisterGroupType grp_type;
> +    uint8_t grp_size;
> +    pt_reg_size_init_fn size_init;
> +    XenPTRegInfo *emu_reg_tbl;
> +};
> +
> +/* emul register group management table */
> +typedef struct XenPTRegGroup {
> +    QLIST_ENTRY(XenPTRegGroup) entries;
> +    const XenPTRegGroupInfo *reg_grp;
> +    uint32_t base_offset;
> +    uint8_t size;
> +    QLIST_HEAD(, XenPTReg) reg_tbl_list;
> +} XenPTRegGroup;
> +
> +
> +typedef struct XenPTPM {
> +    QEMUTimer *pm_timer;  /* QEMUTimer struct */
> +    int no_soft_reset;    /* No Soft Reset flags */
> +    uint16_t flags;       /* power state transition flags */
> +    uint16_t pmc_field;   /* Power Management Capabilities field */
> +    int pm_delay;         /* power state transition delay */
> +    uint16_t cur_state;   /* current power state */
> +    uint16_t req_state;   /* requested power state */
> +    uint32_t pm_base;     /* Power Management Capability reg base offset */
> +    uint32_t aer_base;    /* AER Capability reg base offset */
> +} XenPTPM;
> +
> +struct XenPCIPassthroughState {
> +    PCIDevice dev;
> +
> +    char *hostaddr;
> +    bool is_virtfn;
> +    HostPCIDevice *real_device;
> +    XenPTRegion bases[PCI_NUM_REGIONS]; /* Access regions */
> +    QLIST_HEAD(, XenPTRegGroup) reg_grp_tbl;
> +
> +    uint32_t machine_irq;
> +
> +    uint32_t power_mgmt;
> +    XenPTPM *pm_state;
> +
> +    MemoryRegion bar[PCI_NUM_REGIONS - 1];
> +    MemoryRegion rom;
> +};
> +
> +void pt_config_init(XenPCIPassthroughState *s);
> +void pt_config_delete(XenPCIPassthroughState *s);
> +void pt_bar_mapping(XenPCIPassthroughState *s, int io_enable, int mem_enable);
> +void pt_bar_mapping_one(XenPCIPassthroughState *s, int bar,
> +                        int io_enable, int mem_enable);
> +XenPTRegGroup *pt_find_reg_grp(XenPCIPassthroughState *s, uint32_t address);
> +XenPTReg *pt_find_reg(XenPTRegGroup *reg_grp, uint32_t address);
> +int pt_bar_offset_to_index(uint32_t offset);
> +
> +static inline pcibus_t pt_get_emul_size(PTBarFlag flag, pcibus_t r_size)
> +{
> +    /* align resource size (memory type only) */
> +    if (flag == PT_BAR_FLAG_MEM) {
> +        return (r_size + XC_PAGE_SIZE - 1) & XC_PAGE_MASK;
> +    } else {
> +        return r_size;
> +    }
> +}
> +
> +/* INTx */
> +static inline uint8_t pci_read_intx(XenPCIPassthroughState *s)
> +{
> +    return host_pci_get_byte(s->real_device, PCI_INTERRUPT_PIN);
> +}
> +uint8_t pci_intx(XenPCIPassthroughState *ptdev);
> +
> +#endif /* !QEMU_HW_XEN_PCI_PASSTHROUGH_H */
> diff --git a/hw/xen_pci_passthrough_helpers.c b/hw/xen_pci_passthrough_helpers.c
> new file mode 100644
> index 0000000..192e918
> --- /dev/null
> +++ b/hw/xen_pci_passthrough_helpers.c
> @@ -0,0 +1,46 @@
> +#include "xen_pci_passthrough.h"
> +
> +/* The PCI Local Bus Specification, Rev. 3.0, {
> + * Section 6.2.4 Miscellaneous Registers, pp 223
> + * outlines 5 valid values for the intertupt pin (intx).
> + *  0: For devices (or device functions) that don't use an interrupt in
> + *  1: INTA#
> + *  2: INTB#
> + *  3: INTC#
> + *  4: INTD#
> + *
> + * Xen uses the following 4 values for intx
> + *  0: INTA#
> + *  1: INTB#
> + *  2: INTC#
> + *  3: INTD#
> + *
> + * Observing that these list of values are not the same, pci_read_intx()
> + * uses the following mapping from hw to xen values.
> + * This seems to reflect the current usage within Xen.
> + *
> + * PCI hardware    | Xen | Notes
> + * ----------------+-----+----------------------------------------------------
> + * 0               | 0   | No interrupt
> + * 1               | 0   | INTA#
> + * 2               | 1   | INTB#
> + * 3               | 2   | INTC#
> + * 4               | 3   | INTD#
> + * any other value | 0   | This should never happen, log error message
> +}
> + */
> +uint8_t pci_intx(XenPCIPassthroughState *ptdev)
> +{
> +    uint8_t r_val = pci_read_intx(ptdev);
> +
> +    PT_LOG("intx=%i\n", r_val);
> +    if (r_val < 1 || r_val > 4) {
> +        PT_LOG("Interrupt pin read from hardware is out of range: "
> +               "value=%i, acceptable range is 1 - 4\n", r_val);
> +        r_val = 0;
> +    } else {
> +        r_val -= 1;
> +    }
> +
> +    return r_val;
> +}
 
if xen_pci_passthrough_helpers.c is only going to contain this function
you might as well declared it static inline and move it to
xen_pci_passthrough.h

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH V3 07/10] Introduce Xen PCI Passthrough, qdevice (1/3)
@ 2011-11-08 12:56     ` Stefano Stabellini
  0 siblings, 0 replies; 60+ messages in thread
From: Stefano Stabellini @ 2011-11-08 12:56 UTC (permalink / raw)
  To: Anthony PERARD
  Cc: Guy Zana, Xen Devel, Allen Kay, QEMU-devel, Stefano Stabellini

On Fri, 28 Oct 2011, Anthony PERARD wrote:
> From: Allen Kay <allen.m.kay@intel.com>
> 
> Signed-off-by: Allen Kay <allen.m.kay@intel.com>
> Signed-off-by: Guy Zana <guy@neocleus.com>
> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
> ---
>  Makefile.target                  |    2 +
>  hw/xen_pci_passthrough.c         |  838 ++++++++++++++++++++++++++++++++++++++
>  hw/xen_pci_passthrough.h         |  223 ++++++++++
>  hw/xen_pci_passthrough_helpers.c |   46 ++
>  4 files changed, 1109 insertions(+), 0 deletions(-)
>  create mode 100644 hw/xen_pci_passthrough.c
>  create mode 100644 hw/xen_pci_passthrough.h
>  create mode 100644 hw/xen_pci_passthrough_helpers.c
> 
> diff --git a/Makefile.target b/Makefile.target
> index 243f9f2..36ea47d 100644
> --- a/Makefile.target
> +++ b/Makefile.target
> @@ -217,6 +217,8 @@ obj-i386-$(CONFIG_XEN) += xen_platform.o
> 
>  # Xen PCI Passthrough
>  obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += host-pci-device.o
> +obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough.o
> +obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough_helpers.o
> 
>  # Inter-VM PCI shared memory
>  CONFIG_IVSHMEM =
> diff --git a/hw/xen_pci_passthrough.c b/hw/xen_pci_passthrough.c
> new file mode 100644
> index 0000000..b97c5b6
> --- /dev/null
> +++ b/hw/xen_pci_passthrough.c
> @@ -0,0 +1,838 @@
> +/*
> + * Copyright (c) 2007, Neocleus Corporation.
> + * Copyright (c) 2007, Intel Corporation.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.  See
> + * the COPYING file in the top-level directory.
> + *
> + * Alex Novik <alex@neocleus.com>
> + * Allen Kay <allen.m.kay@intel.com>
> + * Guy Zana <guy@neocleus.com>
> + *
> + * This file implements direct PCI assignment to a HVM guest
> + */
> +
> +/*
> + * Interrupt Disable policy:
> + *
> + * INTx interrupt:
> + *   Initialize(register_real_device)
> + *     Map INTx(xc_physdev_map_pirq):
> + *       <fail>
> + *         - Set real Interrupt Disable bit to '1'.
> + *         - Set machine_irq and assigned_device->machine_irq to '0'.
> + *         * Don't bind INTx.
> + *
> + *     Bind INTx(xc_domain_bind_pt_pci_irq):
> + *       <fail>
> + *         - Set real Interrupt Disable bit to '1'.
> + *         - Unmap INTx.
> + *         - Decrement mapped_machine_irq[machine_irq]
> + *         - Set assigned_device->machine_irq to '0'.
> + *
> + *   Write to Interrupt Disable bit by guest software(pt_cmd_reg_write)
> + *     Write '0'
> + *       <ptdev->msi_trans_en is false>
> + *         - Set real bit to '0' if assigned_device->machine_irq isn't '0'.
> + *
> + *     Write '1'
> + *       <ptdev->msi_trans_en is false>
> + *         - Set real bit to '1'.
> + *
> + * MSI-INTx translation.
> + *   Initialize(xc_physdev_map_pirq_msi/pt_msi_setup)
> + *     Bind MSI-INTx(xc_domain_bind_pt_irq)
> + *       <fail>
> + *         - Unmap MSI.
> + *           <success>
> + *             - Set dev->msi->pirq to '-1'.
> + *           <fail>
> + *             - Do nothing.
> + *
> + *   Write to Interrupt Disable bit by guest software(pt_cmd_reg_write)
> + *     Write '0'
> + *       <ptdev->msi_trans_en is true>
> + *         - Set MSI Enable bit to '1'.
> + *
> + *     Write '1'
> + *       <ptdev->msi_trans_en is true>
> + *         - Set MSI Enable bit to '0'.
> + *
> + * MSI interrupt:
> + *   Initialize MSI register(pt_msi_setup, pt_msi_update)
> + *     Bind MSI(xc_domain_update_msi_irq)
> + *       <fail>
> + *         - Unmap MSI.
> + *         - Set dev->msi->pirq to '-1'.
> + *
> + * MSI-X interrupt:
> + *   Initialize MSI-X register(pt_msix_update_one)
> + *     Bind MSI-X(xc_domain_update_msi_irq)
> + *       <fail>
> + *         - Unmap MSI-X.
> + *         - Set entry->pirq to '-1'.
> + */
> +

you should move all the MSI related comments to the MSI patch


> +#include <sys/ioctl.h>
> +
> +#include "pci.h"
> +#include "xen.h"
> +#include "xen_backend.h"
> +#include "xen_pci_passthrough.h"
> +
> +#define PCI_BAR_ENTRIES (6)
> +
> +#define PT_NR_IRQS          (256)
> +char mapped_machine_irq[PT_NR_IRQS] = {0};
> +
> +/* Config Space */
> +static int pt_pci_config_access_check(PCIDevice *d, uint32_t address, int len)
> +{
> +    /* check offset range */
> +    if (address >= 0xFF) {
> +        PT_LOG("Error: Failed to access register with offset exceeding FFh. "
> +               "[%02x:%02x.%x][Offset:%02xh][Length:%d]\n",
> +               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> +               address, len);
> +        return -1;
> +    }
> +
> +    /* check read size */
> +    if ((len != 1) && (len != 2) && (len != 4)) {
> +        PT_LOG("Error: Failed to access register with invalid access length. "
> +               "[%02x:%02x.%x][Offset:%02xh][Length:%d]\n",
> +               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> +               address, len);
> +        return -1;
> +    }
> +
> +    /* check offset alignment */
> +    if (address & (len - 1)) {
> +        PT_LOG("Error: Failed to access register with invalid access size "
> +            "alignment. [%02x:%02x.%x][Offset:%02xh][Length:%d]\n",
> +            pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> +            address, len);
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
> +int pt_bar_offset_to_index(uint32_t offset)
> +{
> +    int index = 0;
> +
> +    /* check Exp ROM BAR */
> +    if (offset == PCI_ROM_ADDRESS) {
> +        return PCI_ROM_SLOT;
> +    }
> +
> +    /* calculate BAR index */
> +    index = (offset - PCI_BASE_ADDRESS_0) >> 2;
> +    if (index >= PCI_NUM_REGIONS) {
> +        return -1;
> +    }
> +
> +    return index;
> +}
> +
> +static uint32_t pt_pci_read_config(PCIDevice *d, uint32_t address, int len)
> +{
> +    XenPCIPassthroughState *s = DO_UPCAST(XenPCIPassthroughState, dev, d);
> +    uint32_t val = 0;
> +    XenPTRegGroup *reg_grp_entry = NULL;
> +    XenPTReg *reg_entry = NULL;
> +    int rc = 0;
> +    int emul_len = 0;
> +    uint32_t find_addr = address;
> +
> +    if (pt_pci_config_access_check(d, address, len)) {
> +        goto exit;
> +    }
> +
> +    /* check power state transition flags */
> +    if (s->pm_state != NULL && s->pm_state->flags & PT_FLAG_TRANSITING) {
> +        /* can't accept until previous power state transition is completed.
> +         * so finished previous request here.
> +         */
> +        PT_LOG("Warning: guest want to write durring power state transition\n");
> +        goto exit;
> +    }
> +
> +    /* find register group entry */
> +    reg_grp_entry = pt_find_reg_grp(s, address);
> +    if (reg_grp_entry) {
> +        /* check 0 Hardwired register group */
> +        if (reg_grp_entry->reg_grp->grp_type == GRP_TYPE_HARDWIRED) {
> +            /* no need to emulate, just return 0 */
> +            val = 0;
> +            goto exit;
> +        }
> +    }
> +
> +    /* read I/O device register value */
> +    rc = host_pci_get_block(s->real_device, address, (uint8_t *)&val, len);
> +    if (!rc) {
> +        PT_LOG("Error: pci_read_block failed. return value[%d].\n", rc);
> +        memset(&val, 0xff, len);
> +    }
> +
> +    /* just return the I/O device register value for
> +     * passthrough type register group */
> +    if (reg_grp_entry == NULL) {
> +        goto exit;
> +    }
> +
> +    /* adjust the read value to appropriate CFC-CFF window */
> +    val <<= (address & 3) << 3;
> +    emul_len = len;
> +
> +    /* loop Guest request size */
> +    while (emul_len > 0) {
> +        /* find register entry to be emulated */
> +        reg_entry = pt_find_reg(reg_grp_entry, find_addr);
> +        if (reg_entry) {
> +            XenPTRegInfo *reg = reg_entry->reg;
> +            uint32_t real_offset = reg_grp_entry->base_offset + reg->offset;
> +            uint32_t valid_mask = 0xFFFFFFFF >> ((4 - emul_len) << 3);
> +            uint8_t *ptr_val = NULL;
> +
> +            valid_mask <<= (find_addr - real_offset) << 3;
> +            ptr_val = (uint8_t *)&val + (real_offset & 3);
> +
> +            /* do emulation depend on register size */
> +            switch (reg->size) {
> +            case 1:
> +                if (reg->u.b.read) {
> +                    rc = reg->u.b.read(s, reg_entry, ptr_val, valid_mask);
> +                }
> +                break;
> +            case 2:
> +                if (reg->u.w.read) {
> +                    rc = reg->u.w.read(s, reg_entry,
> +                                       (uint16_t *)ptr_val, valid_mask);
> +                }
> +                break;
> +            case 4:
> +                if (reg->u.dw.read) {
> +                    rc = reg->u.dw.read(s, reg_entry,
> +                                        (uint32_t *)ptr_val, valid_mask);
> +                }
> +                break;
> +            }
> +
> +            if (rc < 0) {
> +                hw_error("Internal error: Invalid read emulation "
> +                         "return value[%d]. I/O emulator exit.\n", rc);
> +            }
> +
> +            /* calculate next address to find */
> +            emul_len -= reg->size;
> +            if (emul_len > 0) {
> +                find_addr = real_offset + reg->size;
> +            }
> +        } else {
> +            /* nothing to do with passthrough type register,
> +             * continue to find next byte */
> +            emul_len--;
> +            find_addr++;
> +        }
> +    }
> +
> +    /* need to shift back before returning them to pci bus emulator */
> +    val >>= ((address & 3) << 3);
> +
> +exit:
> +    PT_LOG_CONFIG("[%02x:%02x.%x]: address=%04x val=0x%08x len=%d\n",
> +                  pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> +                  address, val, len);
> +    return val;
> +}
> +
> +static void pt_pci_write_config(PCIDevice *d, uint32_t address,
> +                                uint32_t val, int len)
> +{
> +    XenPCIPassthroughState *s = DO_UPCAST(XenPCIPassthroughState, dev, d);
> +    int index = 0;
> +    XenPTRegGroup *reg_grp_entry = NULL;
> +    int rc = 0;
> +    uint32_t read_val = 0;
> +    int emul_len = 0;
> +    XenPTReg *reg_entry = NULL;
> +    uint32_t find_addr = address;
> +    XenPTRegInfo *reg = NULL;
> +
> +    if (pt_pci_config_access_check(d, address, len)) {
> +        return;
> +    }
> +
> +    PT_LOG_CONFIG("[%02x:%02x.%x]: address=%04x val=0x%08x len=%d\n",
> +                  pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> +                  address, val, len);
> +
> +    /* check unused BAR register */
> +    index = pt_bar_offset_to_index(address);
> +    if ((index >= 0) && (val > 0 && val < PT_BAR_ALLF) &&
> +        (s->bases[index].bar_flag == PT_BAR_FLAG_UNUSED)) {
> +        PT_LOG("Warning: Guest attempt to set address to unused Base Address "
> +               "Register. [%02x:%02x.%x][Offset:%02xh][Length:%d]\n",
> +               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> +               address, len);
> +    }
> +
> +    /* check power state transition flags */
> +    if (s->pm_state != NULL && s->pm_state->flags & PT_FLAG_TRANSITING) {
> +        /* can't accept untill previous power state transition is completed.
> +         * so finished previous request here.
> +         */
> +        PT_LOG("Warning: guest want to write durring power state transition\n");
> +        return;
> +    }
> +
> +    /* find register group entry */
> +    reg_grp_entry = pt_find_reg_grp(s, address);
> +    if (reg_grp_entry) {
> +        /* check 0 Hardwired register group */
> +        if (reg_grp_entry->reg_grp->grp_type == GRP_TYPE_HARDWIRED) {
> +            /* ignore silently */
> +            PT_LOG("Warning: Access to 0 Hardwired register. "
> +                   "[%02x:%02x.%x][Offset:%02xh][Length:%d]\n",
> +                   pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> +                   address, len);
> +            return;
> +        }
> +    }
> +
> +    /* read I/O device register value */
> +    rc = host_pci_get_block(s->real_device, address,
> +                             (uint8_t *)&read_val, len);
> +    if (!rc) {
> +        PT_LOG("Error: pci_read_block failed. return value[%d].\n", rc);
> +        memset(&read_val, 0xff, len);
> +    }
> +
> +    /* pass directly to libpci for passthrough type register group */
> +    if (reg_grp_entry == NULL) {
> +        goto out;
> +    }
> +
> +    /* adjust the read and write value to appropriate CFC-CFF window */
> +    read_val <<= (address & 3) << 3;
> +    val <<= (address & 3) << 3;
> +    emul_len = len;
> +
> +    /* loop Guest request size */
> +    while (emul_len > 0) {
> +        /* find register entry to be emulated */
> +        reg_entry = pt_find_reg(reg_grp_entry, find_addr);
> +        if (reg_entry) {
> +            reg = reg_entry->reg;
> +            uint32_t real_offset = reg_grp_entry->base_offset + reg->offset;
> +            uint32_t valid_mask = 0xFFFFFFFF >> ((4 - emul_len) << 3);
> +            uint8_t *ptr_val = NULL;
> +
> +            valid_mask <<= (find_addr - real_offset) << 3;
> +            ptr_val = (uint8_t *)&val + (real_offset & 3);
> +
> +            /* do emulation depend on register size */
> +            switch (reg->size) {
> +            case 1:
> +                if (reg->u.b.write) {
> +                    rc = reg->u.b.write(s, reg_entry, ptr_val,
> +                                        read_val >> ((real_offset & 3) << 3),
> +                                        valid_mask);
> +                }
> +                break;
> +            case 2:
> +                if (reg->u.w.write) {
> +                    rc = reg->u.w.write(s, reg_entry, (uint16_t *)ptr_val,
> +                                        (read_val >> ((real_offset & 3) << 3)),
> +                                        valid_mask);
> +                }
> +                break;
> +            case 4:
> +                if (reg->u.dw.write) {
> +                    rc = reg->u.dw.write(s, reg_entry, (uint32_t *)ptr_val,
> +                                         (read_val >> ((real_offset & 3) << 3)),
> +                                         valid_mask);
> +                }
> +                break;
> +            }
> +
> +            if (rc < 0) {
> +                hw_error("Internal error: Invalid write emulation "
> +                         "return value[%d]. I/O emulator exit.\n", rc);
> +            }
> +
> +            /* calculate next address to find */
> +            emul_len -= reg->size;
> +            if (emul_len > 0) {
> +                find_addr = real_offset + reg->size;
> +            }
> +        } else {
> +            /* nothing to do with passthrough type register,
> +             * continue to find next byte */
> +            emul_len--;
> +            find_addr++;
> +        }
> +    }
> +
> +    /* need to shift back before passing them to libpci */
> +    val >>= (address & 3) << 3;
> +
> +out:
> +    if (!(reg && reg->no_wb)) {
> +        /* unknown regs are passed through */
> +        rc = host_pci_set_block(s->real_device, address, (uint8_t *)&val, len);
> +
> +        if (!rc) {
> +            PT_LOG("Error: pci_write_block failed. return value[%d].\n", rc);
> +        }
> +    }
> +
> +    if (s->pm_state != NULL && s->pm_state->flags & PT_FLAG_TRANSITING) {
> +        qemu_mod_timer(s->pm_state->pm_timer,
> +                       qemu_get_clock_ms(rt_clock) + s->pm_state->pm_delay);
> +    }
> +}

Where is this timer allocated and initialized?


> +/* ioport/iomem space*/
> +static void pt_iomem_map(XenPCIPassthroughState *s, int i,
> +                         pcibus_t e_phys, pcibus_t e_size, int type)
> +{
> +    uint32_t old_ebase = s->bases[i].e_physbase;
> +    bool first_map = s->bases[i].e_size == 0;
> +    int ret = 0;
> +
> +    s->bases[i].e_physbase = e_phys;
> +    s->bases[i].e_size = e_size;
> +
> +    PT_LOG("e_phys=%#"PRIx64" maddr=%#"PRIx64" type=%%d"
> +           " len=%#"PRIx64" index=%d first_map=%d\n",
> +           e_phys, s->bases[i].access.maddr, /*type,*/
> +           e_size, i, first_map);
> +
> +    if (e_size == 0) {
> +        return;
> +    }
> +
> +    if (!first_map && old_ebase != -1) {
> +        /* Remove old mapping */
> +        ret = xc_domain_memory_mapping(xen_xc, xen_domid,
> +                               old_ebase >> XC_PAGE_SHIFT,
> +                               s->bases[i].access.maddr >> XC_PAGE_SHIFT,
> +                               (e_size + XC_PAGE_SIZE - 1) >> XC_PAGE_SHIFT,
> +                               DPCI_REMOVE_MAPPING);
> +        if (ret != 0) {
> +            PT_LOG("Error: remove old mapping failed!\n");
> +            return;
> +        }
> +    }
> +
> +    /* map only valid guest address */
> +    if (e_phys != -1) {
> +        /* Create new mapping */
> +        ret = xc_domain_memory_mapping(xen_xc, xen_domid,
> +                                   s->bases[i].e_physbase >> XC_PAGE_SHIFT,
> +                                   s->bases[i].access.maddr >> XC_PAGE_SHIFT,
> +                                   (e_size+XC_PAGE_SIZE-1) >> XC_PAGE_SHIFT,
> +                                   DPCI_ADD_MAPPING);
> +
> +        if (ret != 0) {
> +            PT_LOG("Error: create new mapping failed!\n");
> +        }
> +    }
> +}
> +
> +static void pt_ioport_map(XenPCIPassthroughState *s, int i,
> +                          pcibus_t e_phys, pcibus_t e_size, int type)
> +{
> +    uint32_t old_ebase = s->bases[i].e_physbase;
> +    bool first_map = s->bases[i].e_size == 0;
> +    int ret = 0;
> +
> +    s->bases[i].e_physbase = e_phys;
> +    s->bases[i].e_size = e_size;
> +
> +    PT_LOG("e_phys=%#04"PRIx64" pio_base=%#04"PRIx64" len=%"PRId64" index=%d"
> +           " first_map=%d\n",
> +           e_phys, s->bases[i].access.pio_base, e_size, i, first_map);
> +
> +    if (e_size == 0) {
> +        return;
> +    }
> +
> +    if (!first_map && old_ebase != -1) {
> +        /* Remove old mapping */
> +        ret = xc_domain_ioport_mapping(xen_xc, xen_domid, old_ebase,
> +                                       s->bases[i].access.pio_base, e_size,
> +                                       DPCI_REMOVE_MAPPING);
> +        if (ret != 0) {
> +            PT_LOG("Error: remove old mapping failed!\n");
> +            return;
> +        }
> +    }
> +
> +    /* map only valid guest address (include 0) */
> +    if (e_phys != -1) {
> +        /* Create new mapping */
> +        ret = xc_domain_ioport_mapping(xen_xc, xen_domid, e_phys,
> +                                       s->bases[i].access.pio_base, e_size,
> +                                       DPCI_ADD_MAPPING);
> +        if (ret != 0) {
> +            PT_LOG("Error: create new mapping failed!\n");
> +        }
> +    }
> +
> +}
> +
> +
> +/* mapping BAR */
> +
> +void pt_bar_mapping_one(XenPCIPassthroughState *s, int bar,
> +                        int io_enable, int mem_enable)
> +{
> +    PCIDevice *dev = &s->dev;
> +    PCIIORegion *r;
> +    XenPTRegGroup *reg_grp_entry = NULL;
> +    XenPTReg *reg_entry = NULL;
> +    XenPTRegion *base = NULL;
> +    pcibus_t r_size = 0, r_addr = -1;
> +    int rc = 0;
> +
> +    r = &dev->io_regions[bar];
> +
> +    /* check valid region */
> +    if (!r->size) {
> +        return;
> +    }
> +
> +    base = &s->bases[bar];
> +    /* skip unused BAR or upper 64bit BAR */
> +    if ((base->bar_flag == PT_BAR_FLAG_UNUSED)
> +        || (base->bar_flag == PT_BAR_FLAG_UPPER)) {
> +           return;
> +    }
> +
> +    /* copy region address to temporary */
> +    r_addr = r->addr;
> +
> +    /* need unmapping in case I/O Space or Memory Space disable */
> +    if (((base->bar_flag == PT_BAR_FLAG_IO) && !io_enable) ||
> +        ((base->bar_flag == PT_BAR_FLAG_MEM) && !mem_enable)) {
> +        r_addr = -1;
> +    }
> +    if ((bar == PCI_ROM_SLOT) && (r_addr != -1)) {
> +        reg_grp_entry = pt_find_reg_grp(s, PCI_ROM_ADDRESS);
> +        if (reg_grp_entry) {
> +            reg_entry = pt_find_reg(reg_grp_entry, PCI_ROM_ADDRESS);
> +            if (reg_entry && !(reg_entry->data & PCI_ROM_ADDRESS_ENABLE)) {
> +                r_addr = -1;
> +            }
> +        }
> +    }
> +
> +    /* prevent guest software mapping memory resource to 00000000h */
> +    if ((base->bar_flag == PT_BAR_FLAG_MEM) && (r_addr == 0)) {
> +        r_addr = -1;
> +    }
> +
> +    r_size = pt_get_emul_size(base->bar_flag, r->size);
> +
> +    rc = pci_check_bar_overlap(dev, r_addr, r_size, r->type);
> +    if (rc > 0) {
> +        PT_LOG("Warning: s[%02x:%02x.%x][Region:%d][Address:%"FMT_PCIBUS"h]"
> +               "[Size:%"FMT_PCIBUS"h] is overlapped.\n", pci_bus_num(dev->bus),
> +               PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn), bar,
> +               r_addr, r_size);
> +    }
> +
> +    /* check whether we need to update the mapping or not */
> +    if (r_addr != s->bases[bar].e_physbase) {
> +        /* mapping BAR */
> +        if (base->bar_flag == PT_BAR_FLAG_IO) {
> +            pt_ioport_map(s, bar, r_addr, r_size, r->type);
> +        } else {
> +            pt_iomem_map(s, bar, r_addr, r_size, r->type);
> +        }
> +    }
> +}
> +
> +void pt_bar_mapping(XenPCIPassthroughState *s, int io_enable, int mem_enable)
> +{
> +    int i;
> +
> +    for (i = 0; i < PCI_NUM_REGIONS; i++) {
> +        pt_bar_mapping_one(s, i, io_enable, mem_enable);
> +    }
> +}
> +
> +/* register regions */
> +static int pt_register_regions(XenPCIPassthroughState *s)
> +{
> +    int i = 0;
> +    uint32_t bar_data = 0;
> +    HostPCIDevice *d = s->real_device;
> +
> +    /* Register PIO/MMIO BARs */
> +    for (i = 0; i < PCI_BAR_ENTRIES; i++) {
> +        HostPCIIORegion *r = &d->io_regions[i];
> +
> +        if (r->base_addr) {
> +            s->bases[i].e_physbase = r->base_addr;
> +            s->bases[i].access.u = r->base_addr;
> +
> +            /* Register current region */
> +            if (r->flags & IORESOURCE_IO) {
> +                memory_region_init_io(&s->bar[i], NULL, NULL,
> +                                      "xen-pci-pt-bar", r->size);
> +                pci_register_bar(&s->dev, i, PCI_BASE_ADDRESS_SPACE_IO,
> +                                 &s->bar[i]);
> +            } else if (r->flags & IORESOURCE_PREFETCH) {
> +                memory_region_init_io(&s->bar[i], NULL, NULL,
> +                                      "xen-pci-pt-bar", r->size);
> +                pci_register_bar(&s->dev, i, PCI_BASE_ADDRESS_MEM_PREFETCH,
> +                                 &s->bar[i]);
> +            } else {
> +                memory_region_init_io(&s->bar[i], NULL, NULL,
> +                                      "xen-pci-pt-bar", r->size);
> +                pci_register_bar(&s->dev, i, PCI_BASE_ADDRESS_SPACE_MEMORY,
> +                                 &s->bar[i]);
> +            }
> +
> +            PT_LOG("IO region registered (size=0x%08"PRIx64
> +                   " base_addr=0x%08"PRIx64")\n",
> +                   r->size, r->base_addr);
> +        }
> +    }
> +
> +    /* Register expansion ROM address */
> +    if (d->rom.base_addr && d->rom.size) {
> +        /* Re-set BAR reported by OS, otherwise ROM can't be read. */
> +        bar_data = host_pci_get_long(d, PCI_ROM_ADDRESS);
> +        if ((bar_data & PCI_ROM_ADDRESS_MASK) == 0) {
> +            bar_data |= d->rom.base_addr & PCI_ROM_ADDRESS_MASK;
> +            host_pci_set_long(d, PCI_ROM_ADDRESS, bar_data);
> +        }
> +
> +        s->bases[PCI_ROM_SLOT].e_physbase = d->rom.base_addr;
> +        s->bases[PCI_ROM_SLOT].access.maddr = d->rom.base_addr;
> +
> +        memory_region_init_rom_device(&s->rom, NULL, NULL, &s->dev.qdev,
> +                                      "xen-pci-pt-rom", d->rom.size);
> +        pci_register_bar(&s->dev, PCI_ROM_SLOT, PCI_BASE_ADDRESS_MEM_PREFETCH,
> +                         &s->rom);
> +
> +        PT_LOG("Expansion ROM registered (size=0x%08"PRIx64
> +               " base_addr=0x%08"PRIx64")\n",
> +               d->rom.size, d->rom.base_addr);
> +    }
> +
> +    return 0;
> +}
> +
> +static void pt_unregister_regions(XenPCIPassthroughState *s)
> +{
> +    int i, type, rc;
> +    uint32_t e_size;
> +    PCIDevice *d = &s->dev;
> +
> +    for (i = 0; i < PCI_NUM_REGIONS; i++) {
> +        e_size = s->bases[i].e_size;
> +        if ((e_size == 0) || (s->bases[i].e_physbase == -1)) {
> +            continue;
> +        }
> +
> +        type = d->io_regions[i].type;
> +
> +        if (type == PCI_BASE_ADDRESS_SPACE_MEMORY
> +            || type == PCI_BASE_ADDRESS_MEM_PREFETCH) {
> +            rc = xc_domain_memory_mapping(xen_xc, xen_domid,
> +                    s->bases[i].e_physbase >> XC_PAGE_SHIFT,
> +                    s->bases[i].access.maddr >> XC_PAGE_SHIFT,
> +                    (e_size+XC_PAGE_SIZE-1) >> XC_PAGE_SHIFT,
> +                    DPCI_REMOVE_MAPPING);
> +            if (rc != 0) {
> +                PT_LOG("Error: remove old mem mapping failed!\n");
> +                continue;
> +            }
> +
> +        } else if (type == PCI_BASE_ADDRESS_SPACE_IO) {
> +            rc = xc_domain_ioport_mapping(xen_xc, xen_domid,
> +                        s->bases[i].e_physbase,
> +                        s->bases[i].access.pio_base,
> +                        e_size,
> +                        DPCI_REMOVE_MAPPING);
> +            if (rc != 0) {
> +                PT_LOG("Error: remove old io mapping failed!\n");
> +                continue;
> +            }
> +        }
> +    }
> +}
> +
> +static int pt_initfn(PCIDevice *pcidev)
> +{
> +    XenPCIPassthroughState *s = DO_UPCAST(XenPCIPassthroughState, dev, pcidev);
> +    int dom, bus;
> +    unsigned slot, func;
> +    int rc = 0;
> +    uint32_t machine_irq;
> +    int pirq = -1;
> +
> +    if (pci_parse_devaddr(s->hostaddr, &dom, &bus, &slot, &func) < 0) {
> +        fprintf(stderr, "error parse bdf: %s\n", s->hostaddr);
> +        return -1;
> +    }
> +
> +    /* register real device */
> +    PT_LOG("Assigning real physical device %02x:%02x.%x to devfn %i ...\n",
> +           bus, slot, func, s->dev.devfn);
> +
> +    s->real_device = host_pci_device_get(bus, slot, func);
> +    if (!s->real_device) {
> +        return -1;
> +    }
> +
> +    s->is_virtfn = s->real_device->is_virtfn;
> +    if (s->is_virtfn) {
> +        PT_LOG("%04x:%02x:%02x.%x is a SR-IOV Virtual Function\n",
> +               s->real_device->domain, bus, slot, func);
> +    }
> +
> +    /* Initialize virtualized PCI configuration (Extended 256 Bytes) */
> +    if (host_pci_get_block(s->real_device, 0, pcidev->config,
> +                           PCI_CONFIG_SPACE_SIZE) == -1) {
> +        return -1;
> +    }
> +
> +    /* Handle real device's MMIO/PIO BARs */
> +    pt_register_regions(s);
> +
> +    /* reinitialize each config register to be emulated */
> +    pt_config_init(s);

this function is implemented in the next patch, so you might as well add
this call there


> +    /* Bind interrupt */
> +    if (!s->dev.config[PCI_INTERRUPT_PIN]) {
> +        PT_LOG("no pin interrupt\n");
> +        goto out;
> +    }
> +
> +    machine_irq = host_pci_get_byte(s->real_device, PCI_INTERRUPT_LINE);
> +    rc = xc_physdev_map_pirq(xen_xc, xen_domid, machine_irq, &pirq);
> +
> +    if (rc) {
> +        PT_LOG("Error: Mapping irq failed, rc = %d\n", rc);
> +
> +        /* Disable PCI intx assertion (turn on bit10 of devctl) */
> +        host_pci_set_word(s->real_device,
> +                          PCI_COMMAND,
> +                          pci_get_word(s->dev.config + PCI_COMMAND)
> +                          | PCI_COMMAND_INTX_DISABLE);
> +        machine_irq = 0;
> +        s->machine_irq = 0;
> +    } else {
> +        machine_irq = pirq;
> +        s->machine_irq = pirq;
> +        mapped_machine_irq[machine_irq]++;
> +    }
> +
> +    /* bind machine_irq to device */
> +    if (rc < 0 && machine_irq != 0) {
> +        uint8_t e_device = PCI_SLOT(s->dev.devfn);
> +        uint8_t e_intx = pci_intx(s);
> +
> +        rc = xc_domain_bind_pt_pci_irq(xen_xc, xen_domid, machine_irq, 0,
> +                                       e_device, e_intx);
> +        if (rc < 0) {
> +            PT_LOG("Error: Binding of interrupt failed! rc=%d\n", rc);
> +
> +            /* Disable PCI intx assertion (turn on bit10 of devctl) */
> +            host_pci_set_word(s->real_device, PCI_COMMAND,
> +                              *(uint16_t *)(&s->dev.config[PCI_COMMAND])
> +                              | PCI_COMMAND_INTX_DISABLE);
> +            mapped_machine_irq[machine_irq]--;
> +
> +            if (mapped_machine_irq[machine_irq] == 0) {
> +                if (xc_physdev_unmap_pirq(xen_xc, xen_domid, machine_irq)) {
> +                    PT_LOG("Error: Unmapping of interrupt failed! rc=%d\n",
> +                           rc);
> +                }
> +            }
> +            s->machine_irq = 0;
> +        }
> +    }
> +
> +out:
> +    PT_LOG("Real physical device %02x:%02x.%x registered successfuly!\n"
> +           "IRQ type = %s\n", bus, slot, func, "INTx");
> +
> +    return 0;
> +}
> +
> +static int pt_unregister_device(PCIDevice *pcidev)
> +{
> +    XenPCIPassthroughState *s = DO_UPCAST(XenPCIPassthroughState, dev, pcidev);
> +    uint8_t e_device, e_intx;
> +    uint32_t machine_irq;
> +    int rc;
> +
> +    /* Unbind interrupt */
> +    e_device = PCI_SLOT(s->dev.devfn);
> +    e_intx = pci_intx(s);
> +    machine_irq = s->machine_irq;
> +
> +    if (machine_irq) {
> +        rc = xc_domain_unbind_pt_irq(xen_xc, xen_domid, machine_irq,
> +                                     PT_IRQ_TYPE_PCI, 0, e_device, e_intx, 0);
> +        if (rc < 0) {
> +            PT_LOG("Error: Unbinding of interrupt failed! rc=%d\n", rc);
> +        }
> +    }
> +
> +    if (machine_irq) {
> +        mapped_machine_irq[machine_irq]--;
> +
> +        if (mapped_machine_irq[machine_irq] == 0) {
> +            rc = xc_physdev_unmap_pirq(xen_xc, xen_domid, machine_irq);
> +
> +            if (rc < 0) {
> +                PT_LOG("Error: Unmaping of interrupt failed! rc=%d\n", rc);
> +            }
> +        }
> +    }
> +
> +    /* delete all emulated config registers */
> +    pt_config_delete(s);
> +
> +    /* unregister real device's MMIO/PIO BARs */
> +    pt_unregister_regions(s);
> +
> +    host_pci_device_put(s->real_device);
> +
> +    return 0;
> +}
> +
> +static PCIDeviceInfo xen_pci_passthrough = {
> +    .init = pt_initfn,
> +    .exit = pt_unregister_device,
> +    .qdev.name = "xen-pci-passthrough",
> +    .qdev.desc = "Assign an host pci device with Xen",
> +    .qdev.size = sizeof(XenPCIPassthroughState),
> +    .config_read = pt_pci_read_config,
> +    .config_write = pt_pci_write_config,
> +    .is_express = 0,
> +    .qdev.props = (Property[]) {
> +        DEFINE_PROP_STRING("hostaddr", XenPCIPassthroughState, hostaddr),
> +        DEFINE_PROP_BIT("power-mgmt", XenPCIPassthroughState, power_mgmt,
> +                        0, false),
> +        DEFINE_PROP_END_OF_LIST(),
> +    }
> +};
> +
> +static void xen_passthrough_register(void)
> +{
> +    pci_qdev_register(&xen_pci_passthrough);
> +}
> +
> +device_init(xen_passthrough_register);
> diff --git a/hw/xen_pci_passthrough.h b/hw/xen_pci_passthrough.h
> new file mode 100644
> index 0000000..2d1979d
> --- /dev/null
> +++ b/hw/xen_pci_passthrough.h
> @@ -0,0 +1,223 @@
> +#ifndef QEMU_HW_XEN_PCI_PASSTHROUGH_H
> +#  define QEMU_HW_XEN_PCI_PASSTHROUGH_H
> +
> +#include "qemu-common.h"
> +#include "xen_common.h"
> +#include "pci.h"
> +#include "host-pci-device.h"
> +
> +#define PT_LOGGING_ENABLED
> +#define PT_DEBUG_PCI_CONFIG_ACCESS
> +
> +#ifdef PT_LOGGING_ENABLED
> +#  define PT_LOG(_f, _a...)   fprintf(stderr, "%s: " _f, __func__, ##_a)
> +#else
> +#  define PT_LOG(_f, _a...)
> +#endif
> +
> +#ifdef PT_DEBUG_PCI_CONFIG_ACCESS
> +#  define PT_LOG_CONFIG(_f, _a...) PT_LOG(_f, ##_a)
> +#else
> +#  define PT_LOG_CONFIG(_f, _a...)
> +#endif
> +
> +
> +typedef struct XenPTRegInfo XenPTRegInfo;
> +typedef struct XenPTReg XenPTReg;
> +
> +typedef struct XenPCIPassthroughState XenPCIPassthroughState;
> +
> +/* function type for config reg */
> +typedef uint32_t (*conf_reg_init)
> +    (XenPCIPassthroughState *, XenPTRegInfo *, uint32_t real_offset);
> +typedef int (*conf_dword_write)
> +    (XenPCIPassthroughState *, XenPTReg *cfg_entry,
> +     uint32_t *val, uint32_t dev_value, uint32_t valid_mask);
> +typedef int (*conf_word_write)
> +    (XenPCIPassthroughState *, XenPTReg *cfg_entry,
> +     uint16_t *val, uint16_t dev_value, uint16_t valid_mask);
> +typedef int (*conf_byte_write)
> +    (XenPCIPassthroughState *, XenPTReg *cfg_entry,
> +     uint8_t *val, uint8_t dev_value, uint8_t valid_mask);
> +typedef int (*conf_dword_read)
> +    (XenPCIPassthroughState *, XenPTReg *cfg_entry,
> +     uint32_t *val, uint32_t valid_mask);
> +typedef int (*conf_word_read)
> +    (XenPCIPassthroughState *, XenPTReg *cfg_entry,
> +     uint16_t *val, uint16_t valid_mask);
> +typedef int (*conf_byte_read)
> +    (XenPCIPassthroughState *, XenPTReg *cfg_entry,
> +     uint8_t *val, uint8_t valid_mask);
> +typedef int (*conf_dword_restore)
> +    (XenPCIPassthroughState *, XenPTReg *cfg_entry, uint32_t real_offset,
> +     uint32_t dev_value, uint32_t *val);
> +typedef int (*conf_word_restore)
> +    (XenPCIPassthroughState *, XenPTReg *cfg_entry, uint32_t real_offset,
> +     uint16_t dev_value, uint16_t *val);
> +typedef int (*conf_byte_restore)
> +    (XenPCIPassthroughState *, XenPTReg *cfg_entry, uint32_t real_offset,
> +     uint8_t dev_value, uint8_t *val);
> +
> +/* power state transition */
> +#define PT_FLAG_TRANSITING 0x0001
> +
> +
> +typedef enum {
> +    GRP_TYPE_HARDWIRED = 0,                     /* 0 Hardwired reg group */
> +    GRP_TYPE_EMU,                               /* emul reg group */
> +} RegisterGroupType;
> +
> +typedef enum {
> +    PT_BAR_FLAG_MEM = 0,                        /* Memory type BAR */
> +    PT_BAR_FLAG_IO,                             /* I/O type BAR */
> +    PT_BAR_FLAG_UPPER,                          /* upper 64bit BAR */
> +    PT_BAR_FLAG_UNUSED,                         /* unused BAR */
> +} PTBarFlag;
> +
> +
> +typedef struct XenPTRegion {
> +    /* Virtual phys base & size */
> +    uint32_t e_physbase;
> +    uint32_t e_size;
> +    /* Index of region in qemu */
> +    uint32_t memory_index;
> +    /* BAR flag */
> +    PTBarFlag bar_flag;
> +    /* Translation of the emulated address */
> +    union {
> +        uint64_t maddr;
> +        uint64_t pio_base;
> +        uint64_t u;
> +    } access;
> +} XenPTRegion;
> +
> +/* XenPTRegInfo declaration
> + * - only for emulated register (either a part or whole bit).
> + * - for passthrough register that need special behavior (like interacting with
> + *   other component), set emu_mask to all 0 and specify r/w func properly.
> + * - do NOT use ALL F for init_val, otherwise the tbl will not be registered.
> + */
> +
> +/* emulated register infomation */
> +struct XenPTRegInfo {
> +    uint32_t offset;
> +    uint32_t size;
> +    uint32_t init_val;
> +    /* reg read only field mask (ON:RO/ROS, OFF:other) */
> +    uint32_t ro_mask;
> +    /* reg emulate field mask (ON:emu, OFF:passthrough) */
> +    uint32_t emu_mask;
> +    /* no write back allowed */
> +    uint32_t no_wb;
> +    conf_reg_init init;
> +    /* read/write/restore function pointer
> +     * for double_word/word/byte size */
> +    union {
> +        struct {
> +            conf_dword_write write;
> +            conf_dword_read read;
> +            conf_dword_restore restore;
> +        } dw;
> +        struct {
> +            conf_word_write write;
> +            conf_word_read read;
> +            conf_word_restore restore;
> +        } w;
> +        struct {
> +            conf_byte_write write;
> +            conf_byte_read read;
> +            conf_byte_restore restore;
> +        } b;
> +    } u;
> +};
> +
> +/* emulated register management */
> +struct XenPTReg {
> +    QLIST_ENTRY(XenPTReg) entries;
> +    XenPTRegInfo *reg;
> +    uint32_t data;
> +};
> +
> +typedef struct XenPTRegGroupInfo XenPTRegGroupInfo;
> +
> +/* emul reg group size initialize method */
> +typedef uint8_t (*pt_reg_size_init_fn)
> +    (XenPCIPassthroughState *, const XenPTRegGroupInfo *,
> +     uint32_t base_offset);
> +
> +/* emulated register group infomation */
> +struct XenPTRegGroupInfo {
> +    uint8_t grp_id;
> +    RegisterGroupType grp_type;
> +    uint8_t grp_size;
> +    pt_reg_size_init_fn size_init;
> +    XenPTRegInfo *emu_reg_tbl;
> +};
> +
> +/* emul register group management table */
> +typedef struct XenPTRegGroup {
> +    QLIST_ENTRY(XenPTRegGroup) entries;
> +    const XenPTRegGroupInfo *reg_grp;
> +    uint32_t base_offset;
> +    uint8_t size;
> +    QLIST_HEAD(, XenPTReg) reg_tbl_list;
> +} XenPTRegGroup;
> +
> +
> +typedef struct XenPTPM {
> +    QEMUTimer *pm_timer;  /* QEMUTimer struct */
> +    int no_soft_reset;    /* No Soft Reset flags */
> +    uint16_t flags;       /* power state transition flags */
> +    uint16_t pmc_field;   /* Power Management Capabilities field */
> +    int pm_delay;         /* power state transition delay */
> +    uint16_t cur_state;   /* current power state */
> +    uint16_t req_state;   /* requested power state */
> +    uint32_t pm_base;     /* Power Management Capability reg base offset */
> +    uint32_t aer_base;    /* AER Capability reg base offset */
> +} XenPTPM;
> +
> +struct XenPCIPassthroughState {
> +    PCIDevice dev;
> +
> +    char *hostaddr;
> +    bool is_virtfn;
> +    HostPCIDevice *real_device;
> +    XenPTRegion bases[PCI_NUM_REGIONS]; /* Access regions */
> +    QLIST_HEAD(, XenPTRegGroup) reg_grp_tbl;
> +
> +    uint32_t machine_irq;
> +
> +    uint32_t power_mgmt;
> +    XenPTPM *pm_state;
> +
> +    MemoryRegion bar[PCI_NUM_REGIONS - 1];
> +    MemoryRegion rom;
> +};
> +
> +void pt_config_init(XenPCIPassthroughState *s);
> +void pt_config_delete(XenPCIPassthroughState *s);
> +void pt_bar_mapping(XenPCIPassthroughState *s, int io_enable, int mem_enable);
> +void pt_bar_mapping_one(XenPCIPassthroughState *s, int bar,
> +                        int io_enable, int mem_enable);
> +XenPTRegGroup *pt_find_reg_grp(XenPCIPassthroughState *s, uint32_t address);
> +XenPTReg *pt_find_reg(XenPTRegGroup *reg_grp, uint32_t address);
> +int pt_bar_offset_to_index(uint32_t offset);
> +
> +static inline pcibus_t pt_get_emul_size(PTBarFlag flag, pcibus_t r_size)
> +{
> +    /* align resource size (memory type only) */
> +    if (flag == PT_BAR_FLAG_MEM) {
> +        return (r_size + XC_PAGE_SIZE - 1) & XC_PAGE_MASK;
> +    } else {
> +        return r_size;
> +    }
> +}
> +
> +/* INTx */
> +static inline uint8_t pci_read_intx(XenPCIPassthroughState *s)
> +{
> +    return host_pci_get_byte(s->real_device, PCI_INTERRUPT_PIN);
> +}
> +uint8_t pci_intx(XenPCIPassthroughState *ptdev);
> +
> +#endif /* !QEMU_HW_XEN_PCI_PASSTHROUGH_H */
> diff --git a/hw/xen_pci_passthrough_helpers.c b/hw/xen_pci_passthrough_helpers.c
> new file mode 100644
> index 0000000..192e918
> --- /dev/null
> +++ b/hw/xen_pci_passthrough_helpers.c
> @@ -0,0 +1,46 @@
> +#include "xen_pci_passthrough.h"
> +
> +/* The PCI Local Bus Specification, Rev. 3.0, {
> + * Section 6.2.4 Miscellaneous Registers, pp 223
> + * outlines 5 valid values for the intertupt pin (intx).
> + *  0: For devices (or device functions) that don't use an interrupt in
> + *  1: INTA#
> + *  2: INTB#
> + *  3: INTC#
> + *  4: INTD#
> + *
> + * Xen uses the following 4 values for intx
> + *  0: INTA#
> + *  1: INTB#
> + *  2: INTC#
> + *  3: INTD#
> + *
> + * Observing that these list of values are not the same, pci_read_intx()
> + * uses the following mapping from hw to xen values.
> + * This seems to reflect the current usage within Xen.
> + *
> + * PCI hardware    | Xen | Notes
> + * ----------------+-----+----------------------------------------------------
> + * 0               | 0   | No interrupt
> + * 1               | 0   | INTA#
> + * 2               | 1   | INTB#
> + * 3               | 2   | INTC#
> + * 4               | 3   | INTD#
> + * any other value | 0   | This should never happen, log error message
> +}
> + */
> +uint8_t pci_intx(XenPCIPassthroughState *ptdev)
> +{
> +    uint8_t r_val = pci_read_intx(ptdev);
> +
> +    PT_LOG("intx=%i\n", r_val);
> +    if (r_val < 1 || r_val > 4) {
> +        PT_LOG("Interrupt pin read from hardware is out of range: "
> +               "value=%i, acceptable range is 1 - 4\n", r_val);
> +        r_val = 0;
> +    } else {
> +        r_val -= 1;
> +    }
> +
> +    return r_val;
> +}
 
if xen_pci_passthrough_helpers.c is only going to contain this function
you might as well declared it static inline and move it to
xen_pci_passthrough.h

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH V3 08/10] Introduce Xen PCI Passthrough, PCI config space helpers (2/3)
  2011-10-28 15:07   ` Anthony PERARD
@ 2011-11-08 12:57     ` Stefano Stabellini
  -1 siblings, 0 replies; 60+ messages in thread
From: Stefano Stabellini @ 2011-11-08 12:57 UTC (permalink / raw)
  To: Anthony PERARD
  Cc: Guy Zana, Xen Devel, Allen Kay, QEMU-devel, Stefano Stabellini

Obviously passthrough cannot work without this patch, but qemu should be
able to compile anyway. Please add to the previous patch empty stub
implementations for all the exported functions that you are going to
implement here.

I see that the timer is allocated here.
In that case it would make sense to move the timer update to this patch.

On Fri, 28 Oct 2011, Anthony PERARD wrote:
> From: Allen Kay <allen.m.kay@intel.com>
> 
> Signed-off-by: Allen Kay <allen.m.kay@intel.com>
> Signed-off-by: Guy Zana <guy@neocleus.com>
> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
> ---
>  Makefile.target                      |    1 +
>  hw/xen_pci_passthrough.h             |    2 +
>  hw/xen_pci_passthrough_config_init.c | 2068 ++++++++++++++++++++++++++++++++++
>  3 files changed, 2071 insertions(+), 0 deletions(-)
>  create mode 100644 hw/xen_pci_passthrough_config_init.c
> 
> diff --git a/Makefile.target b/Makefile.target
> index 36ea47d..c32c688 100644
> --- a/Makefile.target
> +++ b/Makefile.target
> @@ -219,6 +219,7 @@ obj-i386-$(CONFIG_XEN) += xen_platform.o
>  obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += host-pci-device.o
>  obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough.o
>  obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough_helpers.o
> +obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough_config_init.o
> 
>  # Inter-VM PCI shared memory
>  CONFIG_IVSHMEM =
> diff --git a/hw/xen_pci_passthrough.h b/hw/xen_pci_passthrough.h
> index 2d1979d..ebc04fd 100644
> --- a/hw/xen_pci_passthrough.h
> +++ b/hw/xen_pci_passthrough.h
> @@ -61,6 +61,8 @@ typedef int (*conf_byte_restore)
>  /* power state transition */
>  #define PT_FLAG_TRANSITING 0x0001
> 
> +#define PT_BAR_ALLF        0xFFFFFFFF  /* BAR ALLF value */
> +
> 
>  typedef enum {
>      GRP_TYPE_HARDWIRED = 0,                     /* 0 Hardwired reg group */
> diff --git a/hw/xen_pci_passthrough_config_init.c b/hw/xen_pci_passthrough_config_init.c
> new file mode 100644
> index 0000000..4103b59
> --- /dev/null
> +++ b/hw/xen_pci_passthrough_config_init.c
> @@ -0,0 +1,2068 @@
> +/*
> + * Copyright (c) 2007, Neocleus Corporation.
> + * Copyright (c) 2007, Intel Corporation.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.  See
> + * the COPYING file in the top-level directory.
> + *
> + * Alex Novik <alex@neocleus.com>
> + * Allen Kay <allen.m.kay@intel.com>
> + * Guy Zana <guy@neocleus.com>
> + *
> + * This file implements direct PCI assignment to a HVM guest
> + */
> +
> +#include "qemu-timer.h"
> +#include "xen_backend.h"
> +#include "xen_pci_passthrough.h"
> +
> +#define PT_MERGE_VALUE(value, data, val_mask) \
> +    (((value) & (val_mask)) | ((data) & ~(val_mask)))
> +
> +#define PT_INVALID_REG          0xFFFFFFFF      /* invalid register value */
> +
> +/* prototype */
> +
> +static uint32_t pt_ptr_reg_init(XenPCIPassthroughState *s, XenPTRegInfo *reg,
> +                                uint32_t real_offset);
> +static int pt_init_pci_config(XenPCIPassthroughState *s);
> +
> +
> +/* helper */
> +
> +/* A return value of 1 means the capability should NOT be exposed to guest. */
> +static int pt_hide_dev_cap(const HostPCIDevice *d, uint8_t grp_id)
> +{
> +    switch (grp_id) {
> +    case PCI_CAP_ID_EXP:
> +        /* The PCI Express Capability Structure of the VF of Intel 82599 10GbE
> +         * Controller looks trivial, e.g., the PCI Express Capabilities
> +         * Register is 0. We should not try to expose it to guest.
> +         */
> +        if (d->vendor_id == PCI_VENDOR_ID_INTEL &&
> +                d->device_id == PCI_DEVICE_ID_INTEL_82599_VF) {
> +            return 1;
> +        }
> +        break;
> +    }
> +    return 0;
> +}
> +
> +/*   find emulate register group entry */
> +XenPTRegGroup *pt_find_reg_grp(XenPCIPassthroughState *s, uint32_t address)
> +{
> +    XenPTRegGroup *entry = NULL;
> +
> +    /* find register group entry */
> +    QLIST_FOREACH(entry, &s->reg_grp_tbl, entries) {
> +        /* check address */
> +        if ((entry->base_offset <= address)
> +            && ((entry->base_offset + entry->size) > address)) {
> +            return entry;
> +        }
> +    }
> +
> +    /* group entry not found */
> +    return NULL;
> +}
> +
> +/* find emulate register entry */
> +XenPTReg *pt_find_reg(XenPTRegGroup *reg_grp, uint32_t address)
> +{
> +    XenPTReg *reg_entry = NULL;
> +    XenPTRegInfo *reg = NULL;
> +    uint32_t real_offset = 0;
> +
> +    /* find register entry */
> +    QLIST_FOREACH(reg_entry, &reg_grp->reg_tbl_list, entries) {
> +        reg = reg_entry->reg;
> +        real_offset = reg_grp->base_offset + reg->offset;
> +        /* check address */
> +        if ((real_offset <= address)
> +            && ((real_offset + reg->size) > address)) {
> +            return reg_entry;
> +        }
> +    }
> +
> +    return NULL;
> +}
> +
> +/* parse BAR */
> +static PTBarFlag pt_bar_reg_parse(XenPCIPassthroughState *s, XenPTRegInfo *reg)
> +{
> +    PCIDevice *d = &s->dev;
> +    XenPTRegion *region = NULL;
> +    PCIIORegion *r;
> +    int index = 0;
> +
> +    /* check 64bit BAR */
> +    index = pt_bar_offset_to_index(reg->offset);
> +    if ((0 < index) && (index < PCI_ROM_SLOT)) {
> +        int flags = s->real_device->io_regions[index - 1].flags;
> +
> +        if ((flags & IORESOURCE_MEM) && (flags & IORESOURCE_MEM_64)) {
> +            region = &s->bases[index - 1];
> +            if (region->bar_flag != PT_BAR_FLAG_UPPER) {
> +                return PT_BAR_FLAG_UPPER;
> +            }
> +        }
> +    }
> +
> +    /* check unused BAR */
> +    r = &d->io_regions[index];
> +    if (r->size == 0) {
> +        return PT_BAR_FLAG_UNUSED;
> +    }
> +
> +    /* for ExpROM BAR */
> +    if (index == PCI_ROM_SLOT) {
> +        return PT_BAR_FLAG_MEM;
> +    }
> +
> +    /* check BAR I/O indicator */
> +    if (s->real_device->io_regions[index].flags & IORESOURCE_IO) {
> +        return PT_BAR_FLAG_IO;
> +    } else {
> +        return PT_BAR_FLAG_MEM;
> +    }
> +}
> +
> +
> +/****************
> + * general register functions
> + */
> +
> +/* register initialization function */
> +
> +static uint32_t pt_common_reg_init(XenPCIPassthroughState *s,
> +                                   XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    return reg->init_val;
> +}
> +
> +/* Read register functions */
> +
> +static int pt_byte_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                            uint8_t *value, uint8_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint8_t valid_emu_mask = 0;
> +
> +    /* emulate byte register */
> +    valid_emu_mask = reg->emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
> +
> +    return 0;
> +}
> +static int pt_word_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                            uint16_t *value, uint16_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint16_t valid_emu_mask = 0;
> +
> +    /* emulate word register */
> +    valid_emu_mask = reg->emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
> +
> +    return 0;
> +}
> +static int pt_long_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                            uint32_t *value, uint32_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint32_t valid_emu_mask = 0;
> +
> +    /* emulate long register */
> +    valid_emu_mask = reg->emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
> +
> +   return 0;
> +}
> +
> +/* Write register functions */
> +
> +static int pt_byte_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                             uint8_t *value, uint8_t dev_value,
> +                             uint8_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint8_t writable_mask = 0;
> +    uint8_t throughable_mask = 0;
> +
> +    /* modify emulate register */
> +    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +
> +    /* create value for writing to I/O device register */
> +    throughable_mask = ~reg->emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    return 0;
> +}
> +static int pt_word_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                             uint16_t *value, uint16_t dev_value,
> +                             uint16_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint16_t writable_mask = 0;
> +    uint16_t throughable_mask = 0;
> +
> +    /* modify emulate register */
> +    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +
> +    /* create value for writing to I/O device register */
> +    throughable_mask = ~reg->emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    return 0;
> +}
> +static int pt_long_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                             uint32_t *value, uint32_t dev_value,
> +                             uint32_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint32_t writable_mask = 0;
> +    uint32_t throughable_mask = 0;
> +
> +    /* modify emulate register */
> +    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +
> +    /* create value for writing to I/O device register */
> +    throughable_mask = ~reg->emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    return 0;
> +}
> +
> +/* common restore register fonctions */
> +static int pt_byte_reg_restore(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                               uint32_t real_offset, uint8_t dev_value,
> +                               uint8_t *value)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    PCIDevice *d = &s->dev;
> +
> +    /* use I/O device register's value as restore value */
> +    *value = pci_get_byte(d->config + real_offset);
> +
> +    /* create value for restoring to I/O device register */
> +    *value = PT_MERGE_VALUE(*value, dev_value, reg->emu_mask);
> +
> +    return 0;
> +}
> +static int pt_word_reg_restore(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                               uint32_t real_offset, uint16_t dev_value,
> +                               uint16_t *value)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    PCIDevice *d = &s->dev;
> +
> +    /* use I/O device register's value as restore value */
> +    *value = pci_get_word(d->config + real_offset);
> +
> +    /* create value for restoring to I/O device register */
> +    *value = PT_MERGE_VALUE(*value, dev_value, reg->emu_mask);
> +
> +    return 0;
> +}
> +
> +
> +/* XenPTRegInfo declaration
> + * - only for emulated register (either a part or whole bit).
> + * - for passthrough register that need special behavior (like interacting with
> + *   other component), set emu_mask to all 0 and specify r/w func properly.
> + * - do NOT use ALL F for init_val, otherwise the tbl will not be registered.
> + */
> +
> +/********************
> + * Header Type0
> + */
> +
> +static uint32_t pt_vendor_reg_init(XenPCIPassthroughState *s,
> +                                   XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    return s->real_device->vendor_id;
> +}
> +static uint32_t pt_device_reg_init(XenPCIPassthroughState *s,
> +                                   XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    return s->real_device->device_id;
> +}
> +static uint32_t pt_status_reg_init(XenPCIPassthroughState *s,
> +                                   XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    XenPTRegGroup *reg_grp_entry = NULL;
> +    XenPTReg *reg_entry = NULL;
> +    int reg_field = 0;
> +
> +    /* find Header register group */
> +    reg_grp_entry = pt_find_reg_grp(s, PCI_CAPABILITY_LIST);
> +    if (reg_grp_entry) {
> +        /* find Capabilities Pointer register */
> +        reg_entry = pt_find_reg(reg_grp_entry, PCI_CAPABILITY_LIST);
> +        if (reg_entry) {
> +            /* check Capabilities Pointer register */
> +            if (reg_entry->data) {
> +                reg_field |= PCI_STATUS_CAP_LIST;
> +            } else {
> +                reg_field &= ~PCI_STATUS_CAP_LIST;
> +            }
> +        } else {
> +            hw_error("Internal error: Couldn't find pt_reg_tbl for "
> +                     "Capabilities Pointer register. I/O emulator exit.\n");
> +        }
> +    } else {
> +        hw_error("Internal error: Couldn't find pt_reg_grp_tbl for Header. "
> +                 "I/O emulator exit.\n");
> +    }
> +
> +    return reg_field;
> +}
> +static uint32_t pt_header_type_reg_init(XenPCIPassthroughState *s,
> +                                        XenPTRegInfo *reg,
> +                                        uint32_t real_offset)
> +{
> +    /* read PCI_HEADER_TYPE */
> +    return reg->init_val | 0x80;
> +}
> +
> +/* initialize Interrupt Pin register */
> +static uint32_t pt_irqpin_reg_init(XenPCIPassthroughState *s,
> +                                   XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    return pci_read_intx(s);
> +}
> +
> +/* Command register */
> +static int pt_cmd_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                           uint16_t *value, uint16_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint16_t valid_emu_mask = 0;
> +    uint16_t emu_mask = reg->emu_mask;
> +
> +    if (s->is_virtfn) {
> +        emu_mask |= PCI_COMMAND_MEMORY;
> +    }
> +
> +    /* emulate word register */
> +    valid_emu_mask = emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
> +
> +    return 0;
> +}
> +static int pt_cmd_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                            uint16_t *value, uint16_t dev_value,
> +                            uint16_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint16_t writable_mask = 0;
> +    uint16_t throughable_mask = 0;
> +    uint16_t wr_value = *value;
> +    uint16_t emu_mask = reg->emu_mask;
> +
> +    if (s->is_virtfn) {
> +        emu_mask |= PCI_COMMAND_MEMORY;
> +    }
> +
> +    /* modify emulate register */
> +    writable_mask = ~reg->ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +
> +    /* create value for writing to I/O device register */
> +    throughable_mask = ~emu_mask & valid_mask;
> +
> +    if (*value & PCI_COMMAND_INTX_DISABLE) {
> +        throughable_mask |= PCI_COMMAND_INTX_DISABLE;
> +    } else {
> +        if (s->machine_irq) {
> +            throughable_mask |= PCI_COMMAND_INTX_DISABLE;
> +        }
> +    }
> +
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    /* mapping BAR */
> +    pt_bar_mapping(s, wr_value & PCI_COMMAND_IO,
> +                   wr_value & PCI_COMMAND_MEMORY);
> +
> +    return 0;
> +}
> +static int pt_cmd_reg_restore(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                              uint32_t real_offset, uint16_t dev_value,
> +                              uint16_t *value)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    PCIDevice *d = &s->dev;
> +    uint16_t restorable_mask = 0;
> +
> +    /* use I/O device register's value as restore value */
> +    *value = pci_get_word(d->config + real_offset);
> +
> +    /* create value for restoring to I/O device register
> +     * but do not include Fast Back-to-Back Enable bit.
> +     */
> +    restorable_mask = reg->emu_mask & ~PCI_COMMAND_FAST_BACK;
> +    *value = PT_MERGE_VALUE(*value, dev_value, restorable_mask);
> +
> +    if (!s->machine_irq) {
> +        *value |= PCI_COMMAND_INTX_DISABLE;
> +    } else {
> +        *value &= ~PCI_COMMAND_INTX_DISABLE;
> +    }
> +
> +    return 0;
> +}
> +
> +/* BAR */
> +#define PT_BAR_MEM_RO_MASK      0x0000000F      /* BAR ReadOnly mask(Memory) */
> +#define PT_BAR_MEM_EMU_MASK     0xFFFFFFF0      /* BAR emul mask(Memory) */
> +#define PT_BAR_IO_RO_MASK       0x00000003      /* BAR ReadOnly mask(I/O) */
> +#define PT_BAR_IO_EMU_MASK      0xFFFFFFFC      /* BAR emul mask(I/O) */
> +
> +static inline uint32_t base_address_with_flags(HostPCIIORegion *hr)
> +{
> +    if ((hr->flags & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO) {
> +        return hr->base_addr | (hr->flags & ~PCI_BASE_ADDRESS_IO_MASK);
> +    } else {
> +        return hr->base_addr | (hr->flags & ~PCI_BASE_ADDRESS_MEM_MASK);
> +    }
> +}
> +
> +static uint32_t pt_bar_reg_init(XenPCIPassthroughState *s, XenPTRegInfo *reg,
> +                                uint32_t real_offset)
> +{
> +    int reg_field = 0;
> +    int index;
> +
> +    /* get BAR index */
> +    index = pt_bar_offset_to_index(reg->offset);
> +    if (index < 0) {
> +        hw_error("Internal error: Invalid BAR index[%d]. "
> +                 "I/O emulator exit.\n", index);
> +    }
> +
> +    /* set initial guest physical base address to -1 */
> +    s->bases[index].e_physbase = -1;
> +
> +    /* set BAR flag */
> +    s->bases[index].bar_flag = pt_bar_reg_parse(s, reg);
> +    if (s->bases[index].bar_flag == PT_BAR_FLAG_UNUSED) {
> +        reg_field = PT_INVALID_REG;
> +    }
> +
> +    return reg_field;
> +}
> +static int pt_bar_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                           uint32_t *value, uint32_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint32_t valid_emu_mask = 0;
> +    uint32_t bar_emu_mask = 0;
> +    int index;
> +
> +    /* get BAR index */
> +    index = pt_bar_offset_to_index(reg->offset);
> +    if (index < 0) {
> +        hw_error("Internal error: Invalid BAR index[%d]. "
> +                 "I/O emulator exit.\n", index);
> +    }
> +
> +    /* use fixed-up value from kernel sysfs */
> +    *value = base_address_with_flags(&s->real_device->io_regions[index]);
> +
> +    /* set emulate mask depend on BAR flag */
> +    switch (s->bases[index].bar_flag) {
> +    case PT_BAR_FLAG_MEM:
> +        bar_emu_mask = PT_BAR_MEM_EMU_MASK;
> +        break;
> +    case PT_BAR_FLAG_IO:
> +        bar_emu_mask = PT_BAR_IO_EMU_MASK;
> +        break;
> +    case PT_BAR_FLAG_UPPER:
> +        bar_emu_mask = PT_BAR_ALLF;
> +        break;
> +    default:
> +        break;
> +    }
> +
> +    /* emulate BAR */
> +    valid_emu_mask = bar_emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
> +
> +   return 0;
> +}
> +static int pt_bar_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                            uint32_t *value, uint32_t dev_value,
> +                            uint32_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    XenPTRegGroup *reg_grp_entry = NULL;
> +    XenPTReg *reg_entry = NULL;
> +    XenPTRegion *base = NULL;
> +    PCIDevice *d = &s->dev;
> +    PCIIORegion *r;
> +    uint32_t writable_mask = 0;
> +    uint32_t throughable_mask = 0;
> +    uint32_t bar_emu_mask = 0;
> +    uint32_t bar_ro_mask = 0;
> +    uint32_t new_addr, last_addr;
> +    uint32_t prev_offset;
> +    uint32_t r_size = 0;
> +    int index = 0;
> +
> +    /* get BAR index */
> +    index = pt_bar_offset_to_index(reg->offset);
> +    if (index < 0) {
> +        hw_error("Internal error: Invalid BAR index[%d]. "
> +                 "I/O emulator exit.\n", index);
> +    }
> +
> +    r = &d->io_regions[index];
> +    base = &s->bases[index];
> +    r_size = pt_get_emul_size(base->bar_flag, r->size);
> +
> +    /* set emulate mask and read-only mask depend on BAR flag */
> +    switch (s->bases[index].bar_flag) {
> +    case PT_BAR_FLAG_MEM:
> +        bar_emu_mask = PT_BAR_MEM_EMU_MASK;
> +        bar_ro_mask = PT_BAR_MEM_RO_MASK | (r_size - 1);
> +        break;
> +    case PT_BAR_FLAG_IO:
> +        bar_emu_mask = PT_BAR_IO_EMU_MASK;
> +        bar_ro_mask = PT_BAR_IO_RO_MASK | (r_size - 1);
> +        break;
> +    case PT_BAR_FLAG_UPPER:
> +        bar_emu_mask = PT_BAR_ALLF;
> +        bar_ro_mask = 0;    /* all upper 32bit are R/W */
> +        break;
> +    default:
> +        break;
> +    }
> +
> +    /* modify emulate register */
> +    writable_mask = bar_emu_mask & ~bar_ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +
> +    /* check whether we need to update the virtual region address or not */
> +    switch (s->bases[index].bar_flag) {
> +    case PT_BAR_FLAG_MEM:
> +        /* nothing to do */
> +        break;
> +    case PT_BAR_FLAG_IO:
> +        new_addr = cfg_entry->data;
> +        last_addr = new_addr + r_size - 1;
> +        /* check invalid address */
> +        if (last_addr <= new_addr || !new_addr || last_addr >= 0x10000) {
> +            /* check 64K range */
> +            if ((last_addr >= 0x10000) &&
> +                (cfg_entry->data != (PT_BAR_ALLF & ~bar_ro_mask))) {
> +                PT_LOG("Warning: Guest attempt to set Base Address "
> +                       "over the 64KB. [%02x:%02x.%x][Offset:%02xh]"
> +                       "[Address:%08xh][Size:%08xh]\n",
> +                       pci_bus_num(d->bus), PCI_SLOT(d->devfn),
> +                       PCI_FUNC(d->devfn),
> +                       reg->offset, new_addr, r_size);
> +            }
> +            /* just remove mapping */
> +            r->addr = -1;
> +            goto exit;
> +        }
> +        break;
> +    case PT_BAR_FLAG_UPPER:
> +        if (cfg_entry->data) {
> +            if (cfg_entry->data != (PT_BAR_ALLF & ~bar_ro_mask)) {
> +                PT_LOG("Warning: Guest attempt to set high MMIO Base Address. "
> +                       "Ignore mapping. "
> +                       "[%02x:%02x.%x][Offset:%02xh][High Address:%08xh]\n",
> +                       pci_bus_num(d->bus), PCI_SLOT(d->devfn),
> +                       PCI_FUNC(d->devfn), reg->offset, cfg_entry->data);
> +            }
> +            /* clear lower address */
> +            d->io_regions[index-1].addr = -1;
> +        } else {
> +            /* find lower 32bit BAR */
> +            prev_offset = (reg->offset - 4);
> +            reg_grp_entry = pt_find_reg_grp(s, prev_offset);
> +            if (reg_grp_entry) {
> +                reg_entry = pt_find_reg(reg_grp_entry, prev_offset);
> +                if (reg_entry) {
> +                    /* restore lower address */
> +                    d->io_regions[index-1].addr = reg_entry->data;
> +                } else {
> +                    return -1;
> +                }
> +            } else {
> +                return -1;
> +            }
> +        }
> +
> +        /* never mapping the 'empty' upper region,
> +         * because we'll do it enough for the lower region.
> +         */
> +        r->addr = -1;
> +        goto exit;
> +    default:
> +        break;
> +    }
> +
> +    /* update the corresponding virtual region address */
> +    /*
> +     * When guest code tries to get block size of mmio, it will write all "1"s
> +     * into pci bar register. In this case, cfg_entry->data == writable_mask.
> +     * Especially for devices with large mmio, the value of writable_mask
> +     * is likely to be a guest physical address that has been mapped to ram
> +     * rather than mmio. Remapping this value to mmio should be prevented.
> +     */
> +
> +    if (cfg_entry->data != writable_mask) {
> +        r->addr = cfg_entry->data;
> +    }
> +
> +exit:
> +    /* create value for writing to I/O device register */
> +    throughable_mask = ~bar_emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    /* After BAR reg update, we need to remap BAR */
> +    reg_grp_entry = pt_find_reg_grp(s, PCI_COMMAND);
> +    if (reg_grp_entry) {
> +        reg_entry = pt_find_reg(reg_grp_entry, PCI_COMMAND);
> +        if (reg_entry) {
> +            pt_bar_mapping_one(s, index, reg_entry->data & PCI_COMMAND_IO,
> +                               reg_entry->data & PCI_COMMAND_MEMORY);
> +        }
> +    }
> +
> +    return 0;
> +}
> +static int pt_bar_reg_restore(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                              uint32_t real_offset, uint32_t dev_value,
> +                              uint32_t *value)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint32_t bar_emu_mask = 0;
> +    int index = 0;
> +
> +    /* get BAR index */
> +    index = pt_bar_offset_to_index(reg->offset);
> +    if (index < 0) {
> +        hw_error("Internal error: Invalid BAR index[%d]. "
> +                 "I/O emulator exit.\n", index);
> +    }
> +
> +    /* use value from kernel sysfs */
> +    if (s->bases[index].bar_flag == PT_BAR_FLAG_UPPER) {
> +        *value = s->real_device->io_regions[index - 1].base_addr >> 32;
> +    } else {
> +        *value = base_address_with_flags(&s->real_device->io_regions[index]);
> +    }
> +
> +    /* set emulate mask depend on BAR flag */
> +    switch (s->bases[index].bar_flag) {
> +    case PT_BAR_FLAG_MEM:
> +        bar_emu_mask = PT_BAR_MEM_EMU_MASK;
> +        break;
> +    case PT_BAR_FLAG_IO:
> +        bar_emu_mask = PT_BAR_IO_EMU_MASK;
> +        break;
> +    case PT_BAR_FLAG_UPPER:
> +        bar_emu_mask = PT_BAR_ALLF;
> +        break;
> +    default:
> +        break;
> +    }
> +
> +    /* create value for restoring to I/O device register */
> +    *value = PT_MERGE_VALUE(*value, dev_value, bar_emu_mask);
> +
> +    return 0;
> +}
> +
> +/* write Exp ROM BAR */
> +static int pt_exp_rom_bar_reg_write(XenPCIPassthroughState *s,
> +                                    XenPTReg *cfg_entry, uint32_t *value,
> +                                    uint32_t dev_value, uint32_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    XenPTRegGroup *reg_grp_entry = NULL;
> +    XenPTReg *reg_entry = NULL;
> +    XenPTRegion *base = NULL;
> +    PCIDevice *d = (PCIDevice *)&s->dev;
> +    PCIIORegion *r;
> +    uint32_t writable_mask = 0;
> +    uint32_t throughable_mask = 0;
> +    pcibus_t r_size = 0;
> +    uint32_t bar_emu_mask = 0;
> +    uint32_t bar_ro_mask = 0;
> +
> +    r = &d->io_regions[PCI_ROM_SLOT];
> +    r_size = r->size;
> +    base = &s->bases[PCI_ROM_SLOT];
> +    /* align memory type resource size */
> +    pt_get_emul_size(base->bar_flag, r_size);
> +
> +    /* set emulate mask and read-only mask */
> +    bar_emu_mask = reg->emu_mask;
> +    bar_ro_mask = (reg->ro_mask | (r_size - 1)) & ~PCI_ROM_ADDRESS_ENABLE;
> +
> +    /* modify emulate register */
> +    writable_mask = ~bar_ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +
> +    /* update the corresponding virtual region address */
> +    /*
> +     * When guest code tries to get block size of mmio, it will write all "1"s
> +     * into pci bar register. In this case, cfg_entry->data == writable_mask.
> +     * Especially for devices with large mmio, the value of writable_mask
> +     * is likely to be a guest physical address that has been mapped to ram
> +     * rather than mmio. Remapping this value to mmio should be prevented.
> +     */
> +
> +    if (cfg_entry->data != writable_mask) {
> +        r->addr = cfg_entry->data;
> +    }
> +
> +    /* create value for writing to I/O device register */
> +    throughable_mask = ~bar_emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    /* After BAR reg update, we need to remap BAR*/
> +    reg_grp_entry = pt_find_reg_grp(s, PCI_COMMAND);
> +    if (reg_grp_entry) {
> +        reg_entry = pt_find_reg(reg_grp_entry, PCI_COMMAND);
> +        if (reg_entry) {
> +            pt_bar_mapping_one(s, PCI_ROM_SLOT,
> +                               reg_entry->data & PCI_COMMAND_IO,
> +                               reg_entry->data & PCI_COMMAND_MEMORY);
> +        }
> +    }
> +
> +    return 0;
> +}
> +/* restore ROM BAR */
> +static int pt_exp_rom_bar_reg_restore(XenPCIPassthroughState *s,
> +                                      XenPTReg *cfg_entry,
> +                                      uint32_t real_offset,
> +                                      uint32_t dev_value, uint32_t *value)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +
> +    /* use value from kernel sysfs */
> +    *value =
> +        PT_MERGE_VALUE(host_pci_get_long(s->real_device, PCI_ROM_ADDRESS),
> +                       dev_value, reg->emu_mask);
> +    return 0;
> +}
> +
> +/* Header Type0 reg static infomation table */
> +static XenPTRegInfo pt_emu_reg_header0_tbl[] = {
> +    /* Vendor ID reg */
> +    {
> +        .offset     = PCI_VENDOR_ID,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0xFFFF,
> +        .emu_mask   = 0xFFFF,
> +        .init       = pt_vendor_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_word_reg_write,
> +        .u.w.restore  = NULL,
> +    },
> +    /* Device ID reg */
> +    {
> +        .offset     = PCI_DEVICE_ID,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0xFFFF,
> +        .emu_mask   = 0xFFFF,
> +        .init       = pt_device_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_word_reg_write,
> +        .u.w.restore  = NULL,
> +    },
> +    /* Command reg */
> +    {
> +        .offset     = PCI_COMMAND,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0xF880,
> +        .emu_mask   = 0x0740,
> +        .init       = pt_common_reg_init,
> +        .u.w.read   = pt_cmd_reg_read,
> +        .u.w.write  = pt_cmd_reg_write,
> +        .u.w.restore  = pt_cmd_reg_restore,
> +    },
> +    /* Capabilities Pointer reg */
> +    {
> +        .offset     = PCI_CAPABILITY_LIST,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0xFF,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_ptr_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = NULL,
> +    },
> +    /* Status reg */
> +    /* use emulated Cap Ptr value to initialize,
> +     * so need to be declared after Cap Ptr reg
> +     */
> +    {
> +        .offset     = PCI_STATUS,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0x06FF,
> +        .emu_mask   = 0x0010,
> +        .init       = pt_status_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_word_reg_write,
> +        .u.w.restore  = NULL,
> +    },
> +    /* Cache Line Size reg */
> +    {
> +        .offset     = PCI_CACHE_LINE_SIZE,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0x00,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_common_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = pt_byte_reg_restore,
> +    },
> +    /* Latency Timer reg */
> +    {
> +        .offset     = PCI_LATENCY_TIMER,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0x00,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_common_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = pt_byte_reg_restore,
> +    },
> +    /* Header Type reg */
> +    {
> +        .offset     = PCI_HEADER_TYPE,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0xFF,
> +        .emu_mask   = 0x00,
> +        .init       = pt_header_type_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = NULL,
> +    },
> +    /* Interrupt Line reg */
> +    {
> +        .offset     = PCI_INTERRUPT_LINE,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0x00,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_common_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = NULL,
> +    },
> +    /* Interrupt Pin reg */
> +    {
> +        .offset     = PCI_INTERRUPT_PIN,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0xFF,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_irqpin_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = NULL,
> +    },
> +    /* BAR 0 reg */
> +    /* mask of BAR need to be decided later, depends on IO/MEM type */
> +    {
> +        .offset     = PCI_BASE_ADDRESS_0,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .init       = pt_bar_reg_init,
> +        .u.dw.read  = pt_bar_reg_read,
> +        .u.dw.write = pt_bar_reg_write,
> +        .u.dw.restore = pt_bar_reg_restore,
> +    },
> +    /* BAR 1 reg */
> +    {
> +        .offset     = PCI_BASE_ADDRESS_1,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .init       = pt_bar_reg_init,
> +        .u.dw.read  = pt_bar_reg_read,
> +        .u.dw.write = pt_bar_reg_write,
> +        .u.dw.restore = pt_bar_reg_restore,
> +    },
> +    /* BAR 2 reg */
> +    {
> +        .offset     = PCI_BASE_ADDRESS_2,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .init       = pt_bar_reg_init,
> +        .u.dw.read  = pt_bar_reg_read,
> +        .u.dw.write = pt_bar_reg_write,
> +        .u.dw.restore = pt_bar_reg_restore,
> +    },
> +    /* BAR 3 reg */
> +    {
> +        .offset     = PCI_BASE_ADDRESS_3,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .init       = pt_bar_reg_init,
> +        .u.dw.read  = pt_bar_reg_read,
> +        .u.dw.write = pt_bar_reg_write,
> +        .u.dw.restore = pt_bar_reg_restore,
> +    },
> +    /* BAR 4 reg */
> +    {
> +        .offset     = PCI_BASE_ADDRESS_4,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .init       = pt_bar_reg_init,
> +        .u.dw.read  = pt_bar_reg_read,
> +        .u.dw.write = pt_bar_reg_write,
> +        .u.dw.restore = pt_bar_reg_restore,
> +    },
> +    /* BAR 5 reg */
> +    {
> +        .offset     = PCI_BASE_ADDRESS_5,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .init       = pt_bar_reg_init,
> +        .u.dw.read  = pt_bar_reg_read,
> +        .u.dw.write = pt_bar_reg_write,
> +        .u.dw.restore = pt_bar_reg_restore,
> +    },
> +    /* Expansion ROM BAR reg */
> +    {
> +        .offset     = PCI_ROM_ADDRESS,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .ro_mask    = 0x000007FE,
> +        .emu_mask   = 0xFFFFF800,
> +        .init       = pt_bar_reg_init,
> +        .u.dw.read  = pt_long_reg_read,
> +        .u.dw.write = pt_exp_rom_bar_reg_write,
> +        .u.dw.restore = pt_exp_rom_bar_reg_restore,
> +    },
> +    {
> +        .size = 0,
> +    },
> +};
> +
> +
> +/*********************************
> + * Vital Product Data Capability
> + */
> +
> +/* Vital Product Data Capability Structure reg static infomation table */
> +static XenPTRegInfo pt_emu_reg_vpd_tbl[] = {
> +    {
> +        .offset     = PCI_CAP_LIST_NEXT,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0xFF,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_ptr_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = NULL,
> +    },
> +    {
> +        .size = 0,
> +    },
> +};
> +
> +
> +/**************************************
> + * Vendor Specific Capability
> + */
> +
> +/* Vendor Specific Capability Structure reg static infomation table */
> +static XenPTRegInfo pt_emu_reg_vendor_tbl[] = {
> +    {
> +        .offset     = PCI_CAP_LIST_NEXT,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0xFF,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_ptr_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = NULL,
> +    },
> +    {
> +        .size = 0,
> +    },
> +};
> +
> +
> +/*****************************
> + * PCI Express Capability
> + */
> +
> +/* initialize Link Control register */
> +static uint32_t pt_linkctrl_reg_init(XenPCIPassthroughState *s,
> +                                     XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    uint8_t cap_ver = 0;
> +    uint8_t dev_type = 0;
> +
> +    /* TODO maybe better to use fonction from hw/pcie.c */
> +    cap_ver = pci_get_byte(s->dev.config + real_offset - reg->offset
> +                           + PCI_EXP_FLAGS)
> +        & PCI_EXP_FLAGS_VERS;
> +    dev_type = (pci_get_byte(s->dev.config + real_offset - reg->offset
> +                             + PCI_EXP_FLAGS)
> +                & PCI_EXP_FLAGS_TYPE) >> 4;
> +
> +    /* no need to initialize in case of Root Complex Integrated Endpoint
> +     * with cap_ver 1.x
> +     */
> +    if ((dev_type == PCI_EXP_TYPE_RC_END) && (cap_ver == 1)) {
> +        return PT_INVALID_REG;
> +    }
> +
> +    return reg->init_val;
> +}
> +/* initialize Device Control 2 register */
> +static uint32_t pt_devctrl2_reg_init(XenPCIPassthroughState *s,
> +                                     XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    uint8_t cap_ver = 0;
> +
> +    cap_ver = pci_get_byte(s->dev.config + real_offset - reg->offset
> +                           + PCI_EXP_FLAGS)
> +        & PCI_EXP_FLAGS_VERS;
> +
> +    /* no need to initialize in case of cap_ver 1.x */
> +    if (cap_ver == 1) {
> +        return PT_INVALID_REG;
> +    }
> +
> +    return reg->init_val;
> +}
> +/* initialize Link Control 2 register */
> +static uint32_t pt_linkctrl2_reg_init(XenPCIPassthroughState *s,
> +                                      XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    int reg_field = 0;
> +    uint8_t cap_ver = 0;
> +
> +    cap_ver = pci_get_byte(s->dev.config + real_offset - reg->offset
> +                           + PCI_EXP_FLAGS)
> +        & PCI_EXP_FLAGS_VERS;
> +
> +    /* no need to initialize in case of cap_ver 1.x */
> +    if (cap_ver == 1) {
> +        return PT_INVALID_REG;
> +    }
> +
> +    /* set Supported Link Speed */
> +    reg_field |= PCI_EXP_LNKCAP_SLS &
> +        pci_get_byte(s->dev.config + real_offset - reg->offset
> +                     + PCI_EXP_LNKCAP);
> +
> +    return reg_field;
> +}
> +
> +/* PCI Express Capability Structure reg static infomation table */
> +static XenPTRegInfo pt_emu_reg_pcie_tbl[] = {
> +    /* Next Pointer reg */
> +    {
> +        .offset     = PCI_CAP_LIST_NEXT,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0xFF,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_ptr_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = NULL,
> +    },
> +    /* Device Capabilities reg */
> +    {
> +        .offset     = PCI_EXP_DEVCAP,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .ro_mask    = 0x1FFCFFFF,
> +        .emu_mask   = 0x10000000,
> +        .init       = pt_common_reg_init,
> +        .u.dw.read  = pt_long_reg_read,
> +        .u.dw.write = pt_long_reg_write,
> +        .u.dw.restore = NULL,
> +    },
> +    /* Device Control reg */
> +    {
> +        .offset     = PCI_EXP_DEVCTL,
> +        .size       = 2,
> +        .init_val   = 0x2810,
> +        .ro_mask    = 0x8400,
> +        .emu_mask   = 0xFFFF,
> +        .init       = pt_common_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_word_reg_write,
> +        .u.w.restore  = pt_word_reg_restore,
> +    },
> +    /* Link Control reg */
> +    {
> +        .offset     = PCI_EXP_LNKCTL,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0xFC34,
> +        .emu_mask   = 0xFFFF,
> +        .init       = pt_linkctrl_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_word_reg_write,
> +        .u.w.restore  = pt_word_reg_restore,
> +    },
> +    /* Device Control 2 reg */
> +    {
> +        .offset     = 0x28,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0xFFE0,
> +        .emu_mask   = 0xFFFF,
> +        .init       = pt_devctrl2_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_word_reg_write,
> +        .u.w.restore  = pt_word_reg_restore,
> +    },
> +    /* Link Control 2 reg */
> +    {
> +        .offset     = 0x30,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0xE040,
> +        .emu_mask   = 0xFFFF,
> +        .init       = pt_linkctrl2_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_word_reg_write,
> +        .u.w.restore  = pt_word_reg_restore,
> +    },
> +    {
> +        .size = 0,
> +    },
> +};
> +
> +
> +/*********************************
> + * Power Management Capability
> + */
> +
> +/* initialize Power Management Capabilities register */
> +static uint32_t pt_pmc_reg_init(XenPCIPassthroughState *s,
> +                                XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    PCIDevice *d = &s->dev;
> +
> +    if (!s->power_mgmt) {
> +        return reg->init_val;
> +    }
> +
> +    /* set Power Management Capabilities register */
> +    s->pm_state->pmc_field = pci_get_word(d->config + real_offset);
> +
> +    return reg->init_val;
> +}
> +/* initialize PCI Power Management Control/Status register */
> +static uint32_t pt_pmcsr_reg_init(XenPCIPassthroughState *s,
> +                                  XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    PCIDevice *d = &s->dev;
> +    uint16_t cap_ver  = 0;
> +
> +    if (!s->power_mgmt) {
> +        return reg->init_val;
> +    }
> +
> +    /* check PCI Power Management support version */
> +    cap_ver = s->pm_state->pmc_field & PCI_PM_CAP_VER_MASK;
> +
> +    if (cap_ver > 2) {
> +        /* set No Soft Reset */
> +        s->pm_state->no_soft_reset =
> +            pci_get_byte(d->config + real_offset) & PCI_PM_CTRL_NO_SOFT_RESET;
> +    }
> +
> +    /* wake up real physical device */
> +    switch (host_pci_get_word(s->real_device, real_offset)
> +            & PCI_PM_CTRL_STATE_MASK) {
> +    case 0:
> +        break;
> +    case 1:
> +        PT_LOG("Power state transition D1 -> D0active\n");
> +        host_pci_set_word(s->real_device, real_offset, 0);
> +        break;
> +    case 2:
> +        PT_LOG("Power state transition D2 -> D0active\n");
> +        host_pci_set_word(s->real_device, real_offset, 0);
> +        usleep(200);
> +        break;
> +    case 3:
> +        PT_LOG("Power state transition D3hot -> D0active\n");
> +        host_pci_set_word(s->real_device, real_offset, 0);
> +        usleep(10 * 1000);
> +        pt_init_pci_config(s);
> +        break;
> +    }
> +
> +    return reg->init_val;
> +}
> +/* read Power Management Control/Status register */
> +static int pt_pmcsr_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                             uint16_t *value, uint16_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint16_t valid_emu_mask = reg->emu_mask;
> +
> +    if (!s->power_mgmt) {
> +        valid_emu_mask |= PCI_PM_CTRL_STATE_MASK | PCI_PM_CTRL_NO_SOFT_RESET;
> +    }
> +
> +    valid_emu_mask = valid_emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
> +
> +    return 0;
> +}
> +/* reset Interrupt and I/O resource  */
> +static void pt_reset_interrupt_and_io_mapping(XenPCIPassthroughState *s)
> +{
> +    PCIDevice *d = &s->dev;
> +    PCIIORegion *r;
> +    int i = 0;
> +    uint8_t e_device = 0;
> +    uint8_t e_intx = 0;
> +
> +    /* unbind INTx */
> +    e_device = PCI_SLOT(s->dev.devfn);
> +    e_intx = pci_intx(s);
> +
> +    if (s->machine_irq) {
> +        if (xc_domain_unbind_pt_irq(xen_xc, xen_domid, s->machine_irq,
> +                                    PT_IRQ_TYPE_PCI, 0, e_device, e_intx, 0)) {
> +            PT_LOG("Error: Unbinding of interrupt failed!\n");
> +        }
> +    }
> +
> +    /* clear all virtual region address */
> +    for (i = 0; i < PCI_NUM_REGIONS; i++) {
> +        r = &d->io_regions[i];
> +        r->addr = -1;
> +    }
> +
> +    /* unmapping BAR */
> +    pt_bar_mapping(s, 0, 0);
> +}
> +/* check power state transition */
> +static int check_power_state(XenPCIPassthroughState *s)
> +{
> +    XenPTPM *pm_state = s->pm_state;
> +    PCIDevice *d = &s->dev;
> +    uint16_t read_val = 0;
> +    uint16_t cur_state = 0;
> +
> +    /* get current power state */
> +    read_val = host_pci_get_word(s->real_device,
> +                                 pm_state->pm_base + PCI_PM_CTRL);
> +    cur_state = read_val & PCI_PM_CTRL_STATE_MASK;
> +
> +    if (pm_state->req_state != cur_state) {
> +        PT_LOG("Error: Failed to change power state. "
> +               "[%02x:%02x.%x][requested state:%d][current state:%d]\n",
> +               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> +               pm_state->req_state, cur_state);
> +        return -1;
> +    }
> +    return 0;
> +}
> +/* write Power Management Control/Status register */
> +static void pt_from_d3hot_to_d0_with_reset(void *opaque)
> +{
> +    XenPCIPassthroughState *s = opaque;
> +    XenPTPM *pm_state = s->pm_state;
> +    int ret = 0;
> +
> +    /* check power state */
> +    ret = check_power_state(s);
> +
> +    if (ret < 0) {
> +        goto out;
> +    }
> +
> +    pt_init_pci_config(s);
> +
> +out:
> +    /* power state transition flags off */
> +    pm_state->flags &= ~PT_FLAG_TRANSITING;
> +
> +    qemu_free_timer(pm_state->pm_timer);
> +    pm_state->pm_timer = NULL;
> +}
> +static void pt_default_power_transition(void *opaque)
> +{
> +    XenPCIPassthroughState *ptdev = opaque;
> +    XenPTPM *pm_state = ptdev->pm_state;
> +
> +    /* check power state */
> +    check_power_state(ptdev);
> +
> +    /* power state transition flags off */
> +    pm_state->flags &= ~PT_FLAG_TRANSITING;
> +
> +    qemu_free_timer(pm_state->pm_timer);
> +    pm_state->pm_timer = NULL;
> +}
> +static int pt_pmcsr_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                              uint16_t *value, uint16_t dev_value,
> +                              uint16_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    PCIDevice *d = &s->dev;
> +    uint16_t emu_mask = reg->emu_mask;
> +    uint16_t writable_mask = 0;
> +    uint16_t throughable_mask = 0;
> +    XenPTPM *pm_state = s->pm_state;
> +
> +    if (!s->power_mgmt) {
> +        emu_mask |= PCI_PM_CTRL_STATE_MASK | PCI_PM_CTRL_NO_SOFT_RESET;
> +    }
> +
> +    /* modify emulate register */
> +    writable_mask = emu_mask & ~reg->ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +
> +    /* create value for writing to I/O device register */
> +    throughable_mask = ~emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    if (!s->power_mgmt) {
> +        return 0;
> +    }
> +
> +    /* set I/O device power state */
> +    pm_state->cur_state = dev_value & PCI_PM_CTRL_STATE_MASK;
> +
> +    /* set Guest requested PowerState */
> +    pm_state->req_state = *value & PCI_PM_CTRL_STATE_MASK;
> +
> +    /* check power state transition or not */
> +    if (pm_state->cur_state == pm_state->req_state) {
> +        /* not power state transition */
> +        return 0;
> +    }
> +
> +    /* check enable power state transition */
> +    if ((pm_state->req_state != 0) &&
> +        (pm_state->cur_state > pm_state->req_state)) {
> +        PT_LOG("Error: Invalid power transition. "
> +               "[%02x:%02x.%x][requested state:%d][current state:%d]\n",
> +               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> +               pm_state->req_state, pm_state->cur_state);
> +
> +        return 0;
> +    }
> +
> +    /* check if this device supports the requested power state */
> +    if (((pm_state->req_state == 1) && !(pm_state->pmc_field & PCI_PM_CAP_D1))
> +        || ((pm_state->req_state == 2) &&
> +            !(pm_state->pmc_field & PCI_PM_CAP_D2))) {
> +        PT_LOG("Error: Invalid power transition. "
> +               "[%02x:%02x.%x][requested state:%d][current state:%d]\n",
> +               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> +               pm_state->req_state, pm_state->cur_state);
> +
> +        return 0;
> +    }
> +
> +    /* in case of transition related to D3hot, it's necessary to wait 10 ms.
> +     * But because writing to register will be performed later on actually,
> +     * don't start QEMUTimer right now, just alloc and init QEMUTimer here.
> +     */
> +    if ((pm_state->cur_state == 3) || (pm_state->req_state == 3)) {
> +        if (pm_state->req_state == 0) {
> +            /* alloc and init QEMUTimer */
> +            if (!pm_state->no_soft_reset) {
> +                pm_state->pm_timer = qemu_new_timer_ms(rt_clock,
> +                    pt_from_d3hot_to_d0_with_reset, s);
> +
> +                /* reset Interrupt and I/O resource mapping */
> +                pt_reset_interrupt_and_io_mapping(s);
> +            } else {
> +                pm_state->pm_timer = qemu_new_timer_ms(rt_clock,
> +                                        pt_default_power_transition, s);
> +            }
> +        } else {
> +            /* alloc and init QEMUTimer */
> +            pm_state->pm_timer = qemu_new_timer_ms(rt_clock,
> +                pt_default_power_transition, s);
> +        }
> +
> +        /* set power state transition delay */
> +        pm_state->pm_delay = 10;
> +
> +        /* power state transition flags on */
> +        pm_state->flags |= PT_FLAG_TRANSITING;
> +    }
> +    /* in case of transition related to D0, D1 and D2,
> +     * no need to use QEMUTimer.
> +     * So, we perfom writing to register here and then read it back.
> +     */
> +    else {
> +        /* write power state to I/O device register */
> +        host_pci_set_word(s->real_device, pm_state->pm_base + PCI_PM_CTRL,
> +                          *value);
> +
> +        /* in case of transition related to D2,
> +         * it's necessary to wait 200 usec.
> +         * But because QEMUTimer do not support microsec unit right now,
> +         * so we do wait ourself here.
> +         */
> +        if ((pm_state->cur_state == 2) || (pm_state->req_state == 2)) {
> +            usleep(200);
> +        }
> +
> +        /* check power state */
> +        check_power_state(s);
> +
> +        /* recreate value for writing to I/O device register */
> +        *value = host_pci_get_word(s->real_device,
> +                                   pm_state->pm_base + PCI_PM_CTRL);
> +    }
> +
> +    return 0;
> +}
> +
> +/* restore Power Management Control/Status register */
> +static int pt_pmcsr_reg_restore(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                                uint32_t real_offset, uint16_t dev_value,
> +                                uint16_t *value)
> +{
> +    /* create value for restoring to I/O device register
> +     * No need to restore, just clear PME Enable and PME Status bit
> +     * Note: register type of PME Status bit is RW1C, so clear by writing 1b
> +     */
> +    *value = (dev_value & ~PCI_PM_CTRL_PME_ENABLE) | PCI_PM_CTRL_PME_STATUS;
> +
> +    return 0;
> +}
> +
> +
> +/* Power Management Capability reg static infomation table */
> +static XenPTRegInfo pt_emu_reg_pm_tbl[] = {
> +    /* Next Pointer reg */
> +    {
> +        .offset     = PCI_CAP_LIST_NEXT,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0xFF,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_ptr_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = NULL,
> +    },
> +    /* Power Management Capabilities reg */
> +    {
> +        .offset     = PCI_CAP_FLAGS,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0xFFFF,
> +        .emu_mask   = 0xF9C8,
> +        .init       = pt_pmc_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_word_reg_write,
> +        .u.w.restore  = NULL,
> +    },
> +    /* PCI Power Management Control/Status reg */
> +    {
> +        .offset     = PCI_PM_CTRL,
> +        .size       = 2,
> +        .init_val   = 0x0008,
> +        .ro_mask    = 0xE1FC,
> +        .emu_mask   = 0x8100,
> +        .init       = pt_pmcsr_reg_init,
> +        .u.w.read   = pt_pmcsr_reg_read,
> +        .u.w.write  = pt_pmcsr_reg_write,
> +        .u.w.restore  = pt_pmcsr_reg_restore,
> +    },
> +    {
> +        .size = 0,
> +    },
> +};
> +
> +
> +/****************************
> + * Capabilities
> + */
> +
> +/* AER register operations */
> +
> +static void aer_save_one_register(XenPCIPassthroughState *s, int offset)
> +{
> +    PCIDevice *d = &s->dev;
> +    uint32_t aer_base = s->pm_state->aer_base;
> +    uint32_t val = 0;
> +
> +    val = host_pci_get_long(s->real_device, aer_base + offset);
> +    pci_set_long(d->config + aer_base + offset, val);
> +}
> +static void pt_aer_reg_save(XenPCIPassthroughState *s)
> +{
> +    /* after reset, following register values should be restored.
> +     * So, save them.
> +     */
> +    aer_save_one_register(s, PCI_ERR_UNCOR_MASK);
> +    aer_save_one_register(s, PCI_ERR_UNCOR_SEVER);
> +    aer_save_one_register(s, PCI_ERR_COR_MASK);
> +    aer_save_one_register(s, PCI_ERR_CAP);
> +}
> +static void aer_restore_one_register(XenPCIPassthroughState *s, int offset)
> +{
> +    PCIDevice *d = &s->dev;
> +    uint32_t aer_base = s->pm_state->aer_base;
> +    uint32_t config = 0;
> +
> +    config = pci_get_long(d->config + aer_base + offset);
> +    host_pci_set_long(s->real_device, aer_base + offset, config);
> +}
> +static void pt_aer_reg_restore(XenPCIPassthroughState *s)
> +{
> +    /* the following registers should be reconfigured to correct values
> +     * after reset. restore them.
> +     * other registers should not be reconfigured after reset
> +     * if there is no reason
> +     */
> +    aer_restore_one_register(s, PCI_ERR_UNCOR_MASK);
> +    aer_restore_one_register(s, PCI_ERR_UNCOR_SEVER);
> +    aer_restore_one_register(s, PCI_ERR_COR_MASK);
> +    aer_restore_one_register(s, PCI_ERR_CAP);
> +}
> +
> +/* capability structure register group size functions */
> +
> +static uint8_t pt_reg_grp_size_init(XenPCIPassthroughState *s,
> +                                    const XenPTRegGroupInfo *grp_reg,
> +                                    uint32_t base_offset)
> +{
> +    return grp_reg->grp_size;
> +}
> +/* get Power Management Capability Structure register group size */
> +static uint8_t pt_pm_size_init(XenPCIPassthroughState *s,
> +                               const XenPTRegGroupInfo *grp_reg,
> +                               uint32_t base_offset)
> +{
> +    if (!s->power_mgmt) {
> +        return grp_reg->grp_size;
> +    }
> +
> +    s->pm_state = g_malloc0(sizeof (XenPTPM));
> +
> +    /* set Power Management Capability base offset */
> +    s->pm_state->pm_base = base_offset;
> +
> +    /* find AER register and set AER Capability base offset */
> +    s->pm_state->aer_base = host_pci_find_ext_cap_offset(s->real_device,
> +                                                         PCI_EXT_CAP_ID_ERR);
> +
> +    /* save AER register */
> +    if (s->pm_state->aer_base) {
> +        pt_aer_reg_save(s);
> +    }
> +
> +    return grp_reg->grp_size;
> +}
> +/* get Vendor Specific Capability Structure register group size */
> +static uint8_t pt_vendor_size_init(XenPCIPassthroughState *s,
> +                                   const XenPTRegGroupInfo *grp_reg,
> +                                   uint32_t base_offset)
> +{
> +    return pci_get_byte(s->dev.config + base_offset + 0x02);
> +}
> +/* get PCI Express Capability Structure register group size */
> +static uint8_t pt_pcie_size_init(XenPCIPassthroughState *s,
> +                                 const XenPTRegGroupInfo *grp_reg,
> +                                 uint32_t base_offset)
> +{
> +    PCIDevice *d = &s->dev;
> +    uint16_t exp_flag = 0;
> +    uint16_t type = 0;
> +    uint16_t version = 0;
> +    uint8_t pcie_size = 0;
> +
> +    exp_flag = pci_get_word(d->config + base_offset + PCI_EXP_FLAGS);
> +    type = (exp_flag & PCI_EXP_FLAGS_TYPE) >> 4;
> +    version = exp_flag & PCI_EXP_FLAGS_VERS;
> +
> +    /* calculate size depend on capability version and device/port type */
> +    /* in case of PCI Express Base Specification Rev 1.x */
> +    if (version == 1) {
> +        /* The PCI Express Capabilities, Device Capabilities, and Device
> +         * Status/Control registers are required for all PCI Express devices.
> +         * The Link Capabilities and Link Status/Control are required for all
> +         * Endpoints that are not Root Complex Integrated Endpoints. Endpoints
> +         * are not required to implement registers other than those listed
> +         * above and terminate the capability structure.
> +         */
> +        switch (type) {
> +        case PCI_EXP_TYPE_ENDPOINT:
> +        case PCI_EXP_TYPE_LEG_END:
> +            pcie_size = 0x14;
> +            break;
> +        case PCI_EXP_TYPE_RC_END:
> +            /* has no link */
> +            pcie_size = 0x0C;
> +            break;
> +        /* only EndPoint passthrough is supported */
> +        case PCI_EXP_TYPE_ROOT_PORT:
> +        case PCI_EXP_TYPE_UPSTREAM:
> +        case PCI_EXP_TYPE_DOWNSTREAM:
> +        case PCI_EXP_TYPE_PCI_BRIDGE:
> +        case PCI_EXP_TYPE_PCIE_BRIDGE:
> +        case PCI_EXP_TYPE_RC_EC:
> +        default:
> +            hw_error("Internal error: Unsupported device/port type[%d]. "
> +                     "I/O emulator exit.\n", type);
> +        }
> +    }
> +    /* in case of PCI Express Base Specification Rev 2.0 */
> +    else if (version == 2) {
> +        switch (type) {
> +        case PCI_EXP_TYPE_ENDPOINT:
> +        case PCI_EXP_TYPE_LEG_END:
> +        case PCI_EXP_TYPE_RC_END:
> +            /* For Functions that do not implement the registers,
> +             * these spaces must be hardwired to 0b.
> +             */
> +            pcie_size = 0x3C;
> +            break;
> +        /* only EndPoint passthrough is supported */
> +        case PCI_EXP_TYPE_ROOT_PORT:
> +        case PCI_EXP_TYPE_UPSTREAM:
> +        case PCI_EXP_TYPE_DOWNSTREAM:
> +        case PCI_EXP_TYPE_PCI_BRIDGE:
> +        case PCI_EXP_TYPE_PCIE_BRIDGE:
> +        case PCI_EXP_TYPE_RC_EC:
> +        default:
> +            hw_error("Internal error: Unsupported device/port type[%d]. "
> +                     "I/O emulator exit.\n", type);
> +        }
> +    } else {
> +        hw_error("Internal error: Unsupported capability version[%d]. "
> +                 "I/O emulator exit.\n", version);
> +    }
> +
> +    return pcie_size;
> +}
> +
> +static const XenPTRegGroupInfo pt_emu_reg_grp_tbl[] = {
> +    /* Header Type0 reg group */
> +    {
> +        .grp_id      = 0xFF,
> +        .grp_type    = GRP_TYPE_EMU,
> +        .grp_size    = 0x40,
> +        .size_init   = pt_reg_grp_size_init,
> +        .emu_reg_tbl = pt_emu_reg_header0_tbl,
> +    },
> +    /* PCI PowerManagement Capability reg group */
> +    {
> +        .grp_id      = PCI_CAP_ID_PM,
> +        .grp_type    = GRP_TYPE_EMU,
> +        .grp_size    = PCI_PM_SIZEOF,
> +        .size_init   = pt_pm_size_init,
> +        .emu_reg_tbl = pt_emu_reg_pm_tbl,
> +    },
> +    /* AGP Capability Structure reg group */
> +    {
> +        .grp_id     = PCI_CAP_ID_AGP,
> +        .grp_type   = GRP_TYPE_HARDWIRED,
> +        .grp_size   = 0x30,
> +        .size_init  = pt_reg_grp_size_init,
> +    },
> +    /* Vital Product Data Capability Structure reg group */
> +    {
> +        .grp_id      = PCI_CAP_ID_VPD,
> +        .grp_type    = GRP_TYPE_EMU,
> +        .grp_size    = 0x08,
> +        .size_init   = pt_reg_grp_size_init,
> +        .emu_reg_tbl = pt_emu_reg_vpd_tbl,
> +    },
> +    /* Slot Identification reg group */
> +    {
> +        .grp_id     = PCI_CAP_ID_SLOTID,
> +        .grp_type   = GRP_TYPE_HARDWIRED,
> +        .grp_size   = 0x04,
> +        .size_init  = pt_reg_grp_size_init,
> +    },
> +    /* PCI-X Capabilities List Item reg group */
> +    {
> +        .grp_id     = PCI_CAP_ID_PCIX,
> +        .grp_type   = GRP_TYPE_HARDWIRED,
> +        .grp_size   = 0x18,
> +        .size_init  = pt_reg_grp_size_init,
> +    },
> +    /* Vendor Specific Capability Structure reg group */
> +    {
> +        .grp_id      = PCI_CAP_ID_VNDR,
> +        .grp_type    = GRP_TYPE_EMU,
> +        .grp_size    = 0xFF,
> +        .size_init   = pt_vendor_size_init,
> +        .emu_reg_tbl = pt_emu_reg_vendor_tbl,
> +    },
> +    /* SHPC Capability List Item reg group */
> +    {
> +        .grp_id     = PCI_CAP_ID_SHPC,
> +        .grp_type   = GRP_TYPE_HARDWIRED,
> +        .grp_size   = 0x08,
> +        .size_init  = pt_reg_grp_size_init,
> +    },
> +    /* Subsystem ID and Subsystem Vendor ID Capability List Item reg group */
> +    {
> +        .grp_id     = PCI_CAP_ID_SSVID,
> +        .grp_type   = GRP_TYPE_HARDWIRED,
> +        .grp_size   = 0x08,
> +        .size_init  = pt_reg_grp_size_init,
> +    },
> +    /* AGP 8x Capability Structure reg group */
> +    {
> +        .grp_id     = PCI_CAP_ID_AGP3,
> +        .grp_type   = GRP_TYPE_HARDWIRED,
> +        .grp_size   = 0x30,
> +        .size_init  = pt_reg_grp_size_init,
> +    },
> +    /* PCI Express Capability Structure reg group */
> +    {
> +        .grp_id      = PCI_CAP_ID_EXP,
> +        .grp_type    = GRP_TYPE_EMU,
> +        .grp_size    = 0xFF,
> +        .size_init   = pt_pcie_size_init,
> +        .emu_reg_tbl = pt_emu_reg_pcie_tbl,
> +    },
> +    {
> +        .grp_size = 0,
> +    },
> +};
> +
> +/* initialize Capabilities Pointer or Next Pointer register */
> +static uint32_t pt_ptr_reg_init(XenPCIPassthroughState *s,
> +                                XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    /* uint32_t reg_field = (uint32_t)s->dev.config[real_offset]; */
> +    uint32_t reg_field = pci_get_byte(s->dev.config + real_offset);
> +    int i;
> +
> +    /* find capability offset */
> +    while (reg_field) {
> +        for (i = 0; pt_emu_reg_grp_tbl[i].grp_size != 0; i++) {
> +            if (pt_hide_dev_cap(s->real_device,
> +                                pt_emu_reg_grp_tbl[i].grp_id)) {
> +                continue;
> +            }
> +            if (pt_emu_reg_grp_tbl[i].grp_id == s->dev.config[reg_field]) {
> +                if (pt_emu_reg_grp_tbl[i].grp_type == GRP_TYPE_EMU) {
> +                    goto out;
> +                }
> +                /* ignore the 0 hardwired capability, find next one */
> +                break;
> +            }
> +        }
> +        /* next capability */
> +        /* reg_field = (uint32_t)s->dev.config[reg_field + 1]; */
> +        reg_field = pci_get_byte(s->dev.config + reg_field + 1);
> +    }
> +
> +out:
> +    return reg_field;
> +}
> +
> +
> +/*************
> + * Main
> + */
> +
> +/* restore a part of I/O device register */
> +static void pt_config_restore(XenPCIPassthroughState *s)
> +{
> +    XenPTRegGroup *reg_grp_entry = NULL;
> +    XenPTReg *reg_entry = NULL;
> +    XenPTRegInfo *reg = NULL;
> +    uint32_t real_offset = 0;
> +    uint32_t read_val = 0;
> +    uint32_t val = 0;
> +    int ret = 0;
> +
> +    /* find emulate register group entry */
> +    QLIST_FOREACH(reg_grp_entry, &s->reg_grp_tbl, entries) {
> +        /* find emulate register entry */
> +        QLIST_FOREACH(reg_entry, &reg_grp_entry->reg_tbl_list, entries) {
> +            reg = reg_entry->reg;
> +
> +            /* check whether restoring is needed */
> +            if (!reg->u.b.restore) {
> +                continue;
> +            }
> +
> +            real_offset = reg_grp_entry->base_offset + reg->offset;
> +
> +            /* read I/O device register value */
> +            ret = host_pci_get_block(s->real_device, real_offset,
> +                                     (uint8_t *)&read_val, reg->size);
> +
> +            if (!ret) {
> +                PT_LOG("Error: pci_read_block failed. "
> +                       "return value[%d].\n", ret);
> +                memset(&read_val, 0xff, reg->size);
> +            }
> +
> +            val = 0;
> +
> +            /* restore based on register size */
> +            switch (reg->size) {
> +            case 1:
> +                /* byte register */
> +                ret = reg->u.b.restore(s, reg_entry, real_offset,
> +                                       (uint8_t)read_val, (uint8_t *)&val);
> +                break;
> +            case 2:
> +                /* word register */
> +                ret = reg->u.w.restore(s, reg_entry, real_offset,
> +                                       (uint16_t)read_val, (uint16_t *)&val);
> +                break;
> +            case 4:
> +                /* double word register */
> +                ret = reg->u.dw.restore(s, reg_entry, real_offset,
> +                                        (uint32_t)read_val, (uint32_t *)&val);
> +                break;
> +            }
> +
> +            /* restoring error */
> +            if (ret < 0) {
> +                hw_error("Internal error: Invalid restoring "
> +                         "return value[%d]. I/O emulator exit.\n", ret);
> +            }
> +
> +            PT_LOG_CONFIG("[%02x:%02x.%x]: address=%04x val=0x%08x len=%d\n",
> +                          pci_bus_num(s->dev.bus), PCI_SLOT(s->dev.devfn),
> +                          PCI_FUNC(s->dev.devfn),
> +                          real_offset, val, reg->size);
> +
> +            ret = host_pci_set_block(s->real_device, real_offset,
> +                                     (uint8_t *)&val, reg->size);
> +
> +            if (!ret) {
> +                PT_LOG("Error: pci_write_block failed. "
> +                       "return value[%d].\n", ret);
> +            }
> +        }
> +    }
> +
> +    /* if AER supported, restore it */
> +    if (s->pm_state->aer_base) {
> +        pt_aer_reg_restore(s);
> +    }
> +}
> +/* reinitialize all emulate registers */
> +static void pt_config_reinit(XenPCIPassthroughState *s)
> +{
> +    XenPTRegGroup *reg_grp_entry = NULL;
> +    XenPTReg *reg_entry = NULL;
> +    XenPTRegInfo *reg = NULL;
> +
> +    /* find emulate register group entry */
> +    QLIST_FOREACH(reg_grp_entry, &s->reg_grp_tbl, entries) {
> +        /* find emulate register entry */
> +        QLIST_FOREACH(reg_entry, &reg_grp_entry->reg_tbl_list, entries) {
> +            reg = reg_entry->reg;
> +            if (reg->init) {
> +                /* initialize emulate register */
> +                reg_entry->data =
> +                    reg->init(s, reg_entry->reg,
> +                              reg_grp_entry->base_offset + reg->offset);
> +            }
> +        }
> +    }
> +}
> +
> +static int pt_init_pci_config(XenPCIPassthroughState *s)
> +{
> +    PCIDevice *d = &s->dev;
> +    int ret = 0;
> +
> +    PT_LOG("Reinitialize PCI configuration registers due to power state"
> +           " transition with internal reset. [%02x:%02x.%x]\n",
> +           pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn));
> +
> +    /* restore a part of I/O device register */
> +    pt_config_restore(s);
> +
> +    /* reinitialize all emulate register */
> +    pt_config_reinit(s);
> +
> +    /* rebind machine_irq to device */
> +    if (s->machine_irq != 0) {
> +        uint8_t e_device = PCI_SLOT(s->dev.devfn);
> +        uint8_t e_intx = pci_intx(s);
> +
> +        ret = xc_domain_bind_pt_pci_irq(xen_xc, xen_domid, s->machine_irq, 0,
> +                                        e_device, e_intx);
> +        if (ret < 0) {
> +            PT_LOG("Error: Rebinding of interrupt failed! ret=%d\n", ret);
> +        }
> +    }
> +
> +    return ret;
> +}
> +
> +static uint8_t find_cap_offset(XenPCIPassthroughState *s, uint8_t cap)
> +{
> +    int id;
> +    int max_cap = 48;
> +    int pos = PCI_CAPABILITY_LIST;
> +    int status;
> +
> +    status = host_pci_get_byte(s->real_device, PCI_STATUS);
> +    if ((status & PCI_STATUS_CAP_LIST) == 0) {
> +        return 0;
> +    }
> +
> +    while (max_cap--) {
> +        pos = host_pci_get_byte(s->real_device, pos);
> +        if (pos < 0x40) {
> +            break;
> +        }
> +
> +        pos &= ~3;
> +        id = host_pci_get_byte(s->real_device, pos + PCI_CAP_LIST_ID);
> +
> +        if (id == 0xff) {
> +            break;
> +        }
> +        if (id == cap) {
> +            return pos;
> +        }
> +
> +        pos += PCI_CAP_LIST_NEXT;
> +    }
> +    return 0;
> +}
> +
> +static void pt_config_reg_init(XenPCIPassthroughState *s,
> +                               XenPTRegGroup *reg_grp, XenPTRegInfo *reg)
> +{
> +    XenPTReg *reg_entry;
> +    uint32_t data = 0;
> +
> +    reg_entry = g_malloc0(sizeof (XenPTReg));
> +
> +    reg_entry->reg = reg;
> +    reg_entry->data = 0;
> +
> +    if (reg->init) {
> +        /* initialize emulate register */
> +        data = reg->init(s, reg_entry->reg,
> +                         reg_grp->base_offset + reg->offset);
> +        if (data == PT_INVALID_REG) {
> +            /* free unused BAR register entry */
> +            free(reg_entry);
> +            return;
> +        }
> +        /* set register value */
> +        reg_entry->data = data;
> +    }
> +    /* list add register entry */
> +    QLIST_INSERT_HEAD(&reg_grp->reg_tbl_list, reg_entry, entries);
> +
> +    return;
> +}
> +
> +void pt_config_init(XenPCIPassthroughState *s)
> +{
> +    XenPTRegGroup *reg_grp_entry = NULL;
> +    uint32_t reg_grp_offset = 0;
> +    XenPTRegInfo *reg_tbl = NULL;
> +    int i, j;
> +
> +    QLIST_INIT(&s->reg_grp_tbl);
> +
> +    for (i = 0; pt_emu_reg_grp_tbl[i].grp_size != 0; i++) {
> +        if (pt_emu_reg_grp_tbl[i].grp_id != 0xFF) {
> +            if (pt_hide_dev_cap(s->real_device,
> +                                pt_emu_reg_grp_tbl[i].grp_id)) {
> +                continue;
> +            }
> +
> +            reg_grp_offset = find_cap_offset(s, pt_emu_reg_grp_tbl[i].grp_id);
> +
> +            if (!reg_grp_offset) {
> +                continue;
> +            }
> +        }
> +
> +        reg_grp_entry = g_malloc0(sizeof (XenPTRegGroup));
> +        QLIST_INIT(&reg_grp_entry->reg_tbl_list);
> +        QLIST_INSERT_HEAD(&s->reg_grp_tbl, reg_grp_entry, entries);
> +
> +        reg_grp_entry->base_offset = reg_grp_offset;
> +        reg_grp_entry->reg_grp = pt_emu_reg_grp_tbl + i;
> +        if (pt_emu_reg_grp_tbl[i].size_init) {
> +            /* get register group size */
> +            reg_grp_entry->size =
> +                pt_emu_reg_grp_tbl[i].size_init(s, reg_grp_entry->reg_grp,
> +                                                reg_grp_offset);
> +        }
> +
> +        if (pt_emu_reg_grp_tbl[i].grp_type == GRP_TYPE_EMU) {
> +            if (pt_emu_reg_grp_tbl[i].emu_reg_tbl) {
> +                reg_tbl = pt_emu_reg_grp_tbl[i].emu_reg_tbl;
> +                /* initialize capability register */
> +                for (j = 0; reg_tbl->size != 0; j++, reg_tbl++) {
> +                    /* initialize capability register */
> +                    pt_config_reg_init(s, reg_grp_entry, reg_tbl);
> +                }
> +            }
> +        }
> +        reg_grp_offset = 0;
> +    }
> +
> +    return;
> +}
> +
> +/* delete all emulate register */
> +void pt_config_delete(XenPCIPassthroughState *s)
> +{
> +    struct XenPTRegGroup *reg_group, *next_grp;
> +    struct XenPTReg *reg, *next_reg;
> +
> +    /* free Power Management info table */
> +    if (s->pm_state) {
> +        if (s->pm_state->pm_timer) {
> +            qemu_del_timer(s->pm_state->pm_timer);
> +            qemu_free_timer(s->pm_state->pm_timer);
> +            s->pm_state->pm_timer = NULL;
> +        }
> +
> +        g_free(s->pm_state);
> +    }
> +
> +    /* free all register group entry */
> +    QLIST_FOREACH_SAFE(reg_group, &s->reg_grp_tbl, entries, next_grp) {
> +        /* free all register entry */
> +        QLIST_FOREACH_SAFE(reg, &reg_group->reg_tbl_list, entries, next_reg) {
> +            QLIST_REMOVE(reg, entries);
> +            g_free(reg);
> +        }
> +
> +        QLIST_REMOVE(reg_group, entries);
> +        g_free(reg_group);
> +    }
> +}
> --
> Anthony PERARD
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH V3 08/10] Introduce Xen PCI Passthrough, PCI config space helpers (2/3)
@ 2011-11-08 12:57     ` Stefano Stabellini
  0 siblings, 0 replies; 60+ messages in thread
From: Stefano Stabellini @ 2011-11-08 12:57 UTC (permalink / raw)
  To: Anthony PERARD
  Cc: Guy Zana, Xen Devel, Allen Kay, QEMU-devel, Stefano Stabellini

Obviously passthrough cannot work without this patch, but qemu should be
able to compile anyway. Please add to the previous patch empty stub
implementations for all the exported functions that you are going to
implement here.

I see that the timer is allocated here.
In that case it would make sense to move the timer update to this patch.

On Fri, 28 Oct 2011, Anthony PERARD wrote:
> From: Allen Kay <allen.m.kay@intel.com>
> 
> Signed-off-by: Allen Kay <allen.m.kay@intel.com>
> Signed-off-by: Guy Zana <guy@neocleus.com>
> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
> ---
>  Makefile.target                      |    1 +
>  hw/xen_pci_passthrough.h             |    2 +
>  hw/xen_pci_passthrough_config_init.c | 2068 ++++++++++++++++++++++++++++++++++
>  3 files changed, 2071 insertions(+), 0 deletions(-)
>  create mode 100644 hw/xen_pci_passthrough_config_init.c
> 
> diff --git a/Makefile.target b/Makefile.target
> index 36ea47d..c32c688 100644
> --- a/Makefile.target
> +++ b/Makefile.target
> @@ -219,6 +219,7 @@ obj-i386-$(CONFIG_XEN) += xen_platform.o
>  obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += host-pci-device.o
>  obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough.o
>  obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough_helpers.o
> +obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough_config_init.o
> 
>  # Inter-VM PCI shared memory
>  CONFIG_IVSHMEM =
> diff --git a/hw/xen_pci_passthrough.h b/hw/xen_pci_passthrough.h
> index 2d1979d..ebc04fd 100644
> --- a/hw/xen_pci_passthrough.h
> +++ b/hw/xen_pci_passthrough.h
> @@ -61,6 +61,8 @@ typedef int (*conf_byte_restore)
>  /* power state transition */
>  #define PT_FLAG_TRANSITING 0x0001
> 
> +#define PT_BAR_ALLF        0xFFFFFFFF  /* BAR ALLF value */
> +
> 
>  typedef enum {
>      GRP_TYPE_HARDWIRED = 0,                     /* 0 Hardwired reg group */
> diff --git a/hw/xen_pci_passthrough_config_init.c b/hw/xen_pci_passthrough_config_init.c
> new file mode 100644
> index 0000000..4103b59
> --- /dev/null
> +++ b/hw/xen_pci_passthrough_config_init.c
> @@ -0,0 +1,2068 @@
> +/*
> + * Copyright (c) 2007, Neocleus Corporation.
> + * Copyright (c) 2007, Intel Corporation.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.  See
> + * the COPYING file in the top-level directory.
> + *
> + * Alex Novik <alex@neocleus.com>
> + * Allen Kay <allen.m.kay@intel.com>
> + * Guy Zana <guy@neocleus.com>
> + *
> + * This file implements direct PCI assignment to a HVM guest
> + */
> +
> +#include "qemu-timer.h"
> +#include "xen_backend.h"
> +#include "xen_pci_passthrough.h"
> +
> +#define PT_MERGE_VALUE(value, data, val_mask) \
> +    (((value) & (val_mask)) | ((data) & ~(val_mask)))
> +
> +#define PT_INVALID_REG          0xFFFFFFFF      /* invalid register value */
> +
> +/* prototype */
> +
> +static uint32_t pt_ptr_reg_init(XenPCIPassthroughState *s, XenPTRegInfo *reg,
> +                                uint32_t real_offset);
> +static int pt_init_pci_config(XenPCIPassthroughState *s);
> +
> +
> +/* helper */
> +
> +/* A return value of 1 means the capability should NOT be exposed to guest. */
> +static int pt_hide_dev_cap(const HostPCIDevice *d, uint8_t grp_id)
> +{
> +    switch (grp_id) {
> +    case PCI_CAP_ID_EXP:
> +        /* The PCI Express Capability Structure of the VF of Intel 82599 10GbE
> +         * Controller looks trivial, e.g., the PCI Express Capabilities
> +         * Register is 0. We should not try to expose it to guest.
> +         */
> +        if (d->vendor_id == PCI_VENDOR_ID_INTEL &&
> +                d->device_id == PCI_DEVICE_ID_INTEL_82599_VF) {
> +            return 1;
> +        }
> +        break;
> +    }
> +    return 0;
> +}
> +
> +/*   find emulate register group entry */
> +XenPTRegGroup *pt_find_reg_grp(XenPCIPassthroughState *s, uint32_t address)
> +{
> +    XenPTRegGroup *entry = NULL;
> +
> +    /* find register group entry */
> +    QLIST_FOREACH(entry, &s->reg_grp_tbl, entries) {
> +        /* check address */
> +        if ((entry->base_offset <= address)
> +            && ((entry->base_offset + entry->size) > address)) {
> +            return entry;
> +        }
> +    }
> +
> +    /* group entry not found */
> +    return NULL;
> +}
> +
> +/* find emulate register entry */
> +XenPTReg *pt_find_reg(XenPTRegGroup *reg_grp, uint32_t address)
> +{
> +    XenPTReg *reg_entry = NULL;
> +    XenPTRegInfo *reg = NULL;
> +    uint32_t real_offset = 0;
> +
> +    /* find register entry */
> +    QLIST_FOREACH(reg_entry, &reg_grp->reg_tbl_list, entries) {
> +        reg = reg_entry->reg;
> +        real_offset = reg_grp->base_offset + reg->offset;
> +        /* check address */
> +        if ((real_offset <= address)
> +            && ((real_offset + reg->size) > address)) {
> +            return reg_entry;
> +        }
> +    }
> +
> +    return NULL;
> +}
> +
> +/* parse BAR */
> +static PTBarFlag pt_bar_reg_parse(XenPCIPassthroughState *s, XenPTRegInfo *reg)
> +{
> +    PCIDevice *d = &s->dev;
> +    XenPTRegion *region = NULL;
> +    PCIIORegion *r;
> +    int index = 0;
> +
> +    /* check 64bit BAR */
> +    index = pt_bar_offset_to_index(reg->offset);
> +    if ((0 < index) && (index < PCI_ROM_SLOT)) {
> +        int flags = s->real_device->io_regions[index - 1].flags;
> +
> +        if ((flags & IORESOURCE_MEM) && (flags & IORESOURCE_MEM_64)) {
> +            region = &s->bases[index - 1];
> +            if (region->bar_flag != PT_BAR_FLAG_UPPER) {
> +                return PT_BAR_FLAG_UPPER;
> +            }
> +        }
> +    }
> +
> +    /* check unused BAR */
> +    r = &d->io_regions[index];
> +    if (r->size == 0) {
> +        return PT_BAR_FLAG_UNUSED;
> +    }
> +
> +    /* for ExpROM BAR */
> +    if (index == PCI_ROM_SLOT) {
> +        return PT_BAR_FLAG_MEM;
> +    }
> +
> +    /* check BAR I/O indicator */
> +    if (s->real_device->io_regions[index].flags & IORESOURCE_IO) {
> +        return PT_BAR_FLAG_IO;
> +    } else {
> +        return PT_BAR_FLAG_MEM;
> +    }
> +}
> +
> +
> +/****************
> + * general register functions
> + */
> +
> +/* register initialization function */
> +
> +static uint32_t pt_common_reg_init(XenPCIPassthroughState *s,
> +                                   XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    return reg->init_val;
> +}
> +
> +/* Read register functions */
> +
> +static int pt_byte_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                            uint8_t *value, uint8_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint8_t valid_emu_mask = 0;
> +
> +    /* emulate byte register */
> +    valid_emu_mask = reg->emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
> +
> +    return 0;
> +}
> +static int pt_word_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                            uint16_t *value, uint16_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint16_t valid_emu_mask = 0;
> +
> +    /* emulate word register */
> +    valid_emu_mask = reg->emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
> +
> +    return 0;
> +}
> +static int pt_long_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                            uint32_t *value, uint32_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint32_t valid_emu_mask = 0;
> +
> +    /* emulate long register */
> +    valid_emu_mask = reg->emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
> +
> +   return 0;
> +}
> +
> +/* Write register functions */
> +
> +static int pt_byte_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                             uint8_t *value, uint8_t dev_value,
> +                             uint8_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint8_t writable_mask = 0;
> +    uint8_t throughable_mask = 0;
> +
> +    /* modify emulate register */
> +    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +
> +    /* create value for writing to I/O device register */
> +    throughable_mask = ~reg->emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    return 0;
> +}
> +static int pt_word_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                             uint16_t *value, uint16_t dev_value,
> +                             uint16_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint16_t writable_mask = 0;
> +    uint16_t throughable_mask = 0;
> +
> +    /* modify emulate register */
> +    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +
> +    /* create value for writing to I/O device register */
> +    throughable_mask = ~reg->emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    return 0;
> +}
> +static int pt_long_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                             uint32_t *value, uint32_t dev_value,
> +                             uint32_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint32_t writable_mask = 0;
> +    uint32_t throughable_mask = 0;
> +
> +    /* modify emulate register */
> +    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +
> +    /* create value for writing to I/O device register */
> +    throughable_mask = ~reg->emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    return 0;
> +}
> +
> +/* common restore register fonctions */
> +static int pt_byte_reg_restore(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                               uint32_t real_offset, uint8_t dev_value,
> +                               uint8_t *value)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    PCIDevice *d = &s->dev;
> +
> +    /* use I/O device register's value as restore value */
> +    *value = pci_get_byte(d->config + real_offset);
> +
> +    /* create value for restoring to I/O device register */
> +    *value = PT_MERGE_VALUE(*value, dev_value, reg->emu_mask);
> +
> +    return 0;
> +}
> +static int pt_word_reg_restore(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                               uint32_t real_offset, uint16_t dev_value,
> +                               uint16_t *value)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    PCIDevice *d = &s->dev;
> +
> +    /* use I/O device register's value as restore value */
> +    *value = pci_get_word(d->config + real_offset);
> +
> +    /* create value for restoring to I/O device register */
> +    *value = PT_MERGE_VALUE(*value, dev_value, reg->emu_mask);
> +
> +    return 0;
> +}
> +
> +
> +/* XenPTRegInfo declaration
> + * - only for emulated register (either a part or whole bit).
> + * - for passthrough register that need special behavior (like interacting with
> + *   other component), set emu_mask to all 0 and specify r/w func properly.
> + * - do NOT use ALL F for init_val, otherwise the tbl will not be registered.
> + */
> +
> +/********************
> + * Header Type0
> + */
> +
> +static uint32_t pt_vendor_reg_init(XenPCIPassthroughState *s,
> +                                   XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    return s->real_device->vendor_id;
> +}
> +static uint32_t pt_device_reg_init(XenPCIPassthroughState *s,
> +                                   XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    return s->real_device->device_id;
> +}
> +static uint32_t pt_status_reg_init(XenPCIPassthroughState *s,
> +                                   XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    XenPTRegGroup *reg_grp_entry = NULL;
> +    XenPTReg *reg_entry = NULL;
> +    int reg_field = 0;
> +
> +    /* find Header register group */
> +    reg_grp_entry = pt_find_reg_grp(s, PCI_CAPABILITY_LIST);
> +    if (reg_grp_entry) {
> +        /* find Capabilities Pointer register */
> +        reg_entry = pt_find_reg(reg_grp_entry, PCI_CAPABILITY_LIST);
> +        if (reg_entry) {
> +            /* check Capabilities Pointer register */
> +            if (reg_entry->data) {
> +                reg_field |= PCI_STATUS_CAP_LIST;
> +            } else {
> +                reg_field &= ~PCI_STATUS_CAP_LIST;
> +            }
> +        } else {
> +            hw_error("Internal error: Couldn't find pt_reg_tbl for "
> +                     "Capabilities Pointer register. I/O emulator exit.\n");
> +        }
> +    } else {
> +        hw_error("Internal error: Couldn't find pt_reg_grp_tbl for Header. "
> +                 "I/O emulator exit.\n");
> +    }
> +
> +    return reg_field;
> +}
> +static uint32_t pt_header_type_reg_init(XenPCIPassthroughState *s,
> +                                        XenPTRegInfo *reg,
> +                                        uint32_t real_offset)
> +{
> +    /* read PCI_HEADER_TYPE */
> +    return reg->init_val | 0x80;
> +}
> +
> +/* initialize Interrupt Pin register */
> +static uint32_t pt_irqpin_reg_init(XenPCIPassthroughState *s,
> +                                   XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    return pci_read_intx(s);
> +}
> +
> +/* Command register */
> +static int pt_cmd_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                           uint16_t *value, uint16_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint16_t valid_emu_mask = 0;
> +    uint16_t emu_mask = reg->emu_mask;
> +
> +    if (s->is_virtfn) {
> +        emu_mask |= PCI_COMMAND_MEMORY;
> +    }
> +
> +    /* emulate word register */
> +    valid_emu_mask = emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
> +
> +    return 0;
> +}
> +static int pt_cmd_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                            uint16_t *value, uint16_t dev_value,
> +                            uint16_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint16_t writable_mask = 0;
> +    uint16_t throughable_mask = 0;
> +    uint16_t wr_value = *value;
> +    uint16_t emu_mask = reg->emu_mask;
> +
> +    if (s->is_virtfn) {
> +        emu_mask |= PCI_COMMAND_MEMORY;
> +    }
> +
> +    /* modify emulate register */
> +    writable_mask = ~reg->ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +
> +    /* create value for writing to I/O device register */
> +    throughable_mask = ~emu_mask & valid_mask;
> +
> +    if (*value & PCI_COMMAND_INTX_DISABLE) {
> +        throughable_mask |= PCI_COMMAND_INTX_DISABLE;
> +    } else {
> +        if (s->machine_irq) {
> +            throughable_mask |= PCI_COMMAND_INTX_DISABLE;
> +        }
> +    }
> +
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    /* mapping BAR */
> +    pt_bar_mapping(s, wr_value & PCI_COMMAND_IO,
> +                   wr_value & PCI_COMMAND_MEMORY);
> +
> +    return 0;
> +}
> +static int pt_cmd_reg_restore(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                              uint32_t real_offset, uint16_t dev_value,
> +                              uint16_t *value)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    PCIDevice *d = &s->dev;
> +    uint16_t restorable_mask = 0;
> +
> +    /* use I/O device register's value as restore value */
> +    *value = pci_get_word(d->config + real_offset);
> +
> +    /* create value for restoring to I/O device register
> +     * but do not include Fast Back-to-Back Enable bit.
> +     */
> +    restorable_mask = reg->emu_mask & ~PCI_COMMAND_FAST_BACK;
> +    *value = PT_MERGE_VALUE(*value, dev_value, restorable_mask);
> +
> +    if (!s->machine_irq) {
> +        *value |= PCI_COMMAND_INTX_DISABLE;
> +    } else {
> +        *value &= ~PCI_COMMAND_INTX_DISABLE;
> +    }
> +
> +    return 0;
> +}
> +
> +/* BAR */
> +#define PT_BAR_MEM_RO_MASK      0x0000000F      /* BAR ReadOnly mask(Memory) */
> +#define PT_BAR_MEM_EMU_MASK     0xFFFFFFF0      /* BAR emul mask(Memory) */
> +#define PT_BAR_IO_RO_MASK       0x00000003      /* BAR ReadOnly mask(I/O) */
> +#define PT_BAR_IO_EMU_MASK      0xFFFFFFFC      /* BAR emul mask(I/O) */
> +
> +static inline uint32_t base_address_with_flags(HostPCIIORegion *hr)
> +{
> +    if ((hr->flags & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO) {
> +        return hr->base_addr | (hr->flags & ~PCI_BASE_ADDRESS_IO_MASK);
> +    } else {
> +        return hr->base_addr | (hr->flags & ~PCI_BASE_ADDRESS_MEM_MASK);
> +    }
> +}
> +
> +static uint32_t pt_bar_reg_init(XenPCIPassthroughState *s, XenPTRegInfo *reg,
> +                                uint32_t real_offset)
> +{
> +    int reg_field = 0;
> +    int index;
> +
> +    /* get BAR index */
> +    index = pt_bar_offset_to_index(reg->offset);
> +    if (index < 0) {
> +        hw_error("Internal error: Invalid BAR index[%d]. "
> +                 "I/O emulator exit.\n", index);
> +    }
> +
> +    /* set initial guest physical base address to -1 */
> +    s->bases[index].e_physbase = -1;
> +
> +    /* set BAR flag */
> +    s->bases[index].bar_flag = pt_bar_reg_parse(s, reg);
> +    if (s->bases[index].bar_flag == PT_BAR_FLAG_UNUSED) {
> +        reg_field = PT_INVALID_REG;
> +    }
> +
> +    return reg_field;
> +}
> +static int pt_bar_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                           uint32_t *value, uint32_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint32_t valid_emu_mask = 0;
> +    uint32_t bar_emu_mask = 0;
> +    int index;
> +
> +    /* get BAR index */
> +    index = pt_bar_offset_to_index(reg->offset);
> +    if (index < 0) {
> +        hw_error("Internal error: Invalid BAR index[%d]. "
> +                 "I/O emulator exit.\n", index);
> +    }
> +
> +    /* use fixed-up value from kernel sysfs */
> +    *value = base_address_with_flags(&s->real_device->io_regions[index]);
> +
> +    /* set emulate mask depend on BAR flag */
> +    switch (s->bases[index].bar_flag) {
> +    case PT_BAR_FLAG_MEM:
> +        bar_emu_mask = PT_BAR_MEM_EMU_MASK;
> +        break;
> +    case PT_BAR_FLAG_IO:
> +        bar_emu_mask = PT_BAR_IO_EMU_MASK;
> +        break;
> +    case PT_BAR_FLAG_UPPER:
> +        bar_emu_mask = PT_BAR_ALLF;
> +        break;
> +    default:
> +        break;
> +    }
> +
> +    /* emulate BAR */
> +    valid_emu_mask = bar_emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
> +
> +   return 0;
> +}
> +static int pt_bar_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                            uint32_t *value, uint32_t dev_value,
> +                            uint32_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    XenPTRegGroup *reg_grp_entry = NULL;
> +    XenPTReg *reg_entry = NULL;
> +    XenPTRegion *base = NULL;
> +    PCIDevice *d = &s->dev;
> +    PCIIORegion *r;
> +    uint32_t writable_mask = 0;
> +    uint32_t throughable_mask = 0;
> +    uint32_t bar_emu_mask = 0;
> +    uint32_t bar_ro_mask = 0;
> +    uint32_t new_addr, last_addr;
> +    uint32_t prev_offset;
> +    uint32_t r_size = 0;
> +    int index = 0;
> +
> +    /* get BAR index */
> +    index = pt_bar_offset_to_index(reg->offset);
> +    if (index < 0) {
> +        hw_error("Internal error: Invalid BAR index[%d]. "
> +                 "I/O emulator exit.\n", index);
> +    }
> +
> +    r = &d->io_regions[index];
> +    base = &s->bases[index];
> +    r_size = pt_get_emul_size(base->bar_flag, r->size);
> +
> +    /* set emulate mask and read-only mask depend on BAR flag */
> +    switch (s->bases[index].bar_flag) {
> +    case PT_BAR_FLAG_MEM:
> +        bar_emu_mask = PT_BAR_MEM_EMU_MASK;
> +        bar_ro_mask = PT_BAR_MEM_RO_MASK | (r_size - 1);
> +        break;
> +    case PT_BAR_FLAG_IO:
> +        bar_emu_mask = PT_BAR_IO_EMU_MASK;
> +        bar_ro_mask = PT_BAR_IO_RO_MASK | (r_size - 1);
> +        break;
> +    case PT_BAR_FLAG_UPPER:
> +        bar_emu_mask = PT_BAR_ALLF;
> +        bar_ro_mask = 0;    /* all upper 32bit are R/W */
> +        break;
> +    default:
> +        break;
> +    }
> +
> +    /* modify emulate register */
> +    writable_mask = bar_emu_mask & ~bar_ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +
> +    /* check whether we need to update the virtual region address or not */
> +    switch (s->bases[index].bar_flag) {
> +    case PT_BAR_FLAG_MEM:
> +        /* nothing to do */
> +        break;
> +    case PT_BAR_FLAG_IO:
> +        new_addr = cfg_entry->data;
> +        last_addr = new_addr + r_size - 1;
> +        /* check invalid address */
> +        if (last_addr <= new_addr || !new_addr || last_addr >= 0x10000) {
> +            /* check 64K range */
> +            if ((last_addr >= 0x10000) &&
> +                (cfg_entry->data != (PT_BAR_ALLF & ~bar_ro_mask))) {
> +                PT_LOG("Warning: Guest attempt to set Base Address "
> +                       "over the 64KB. [%02x:%02x.%x][Offset:%02xh]"
> +                       "[Address:%08xh][Size:%08xh]\n",
> +                       pci_bus_num(d->bus), PCI_SLOT(d->devfn),
> +                       PCI_FUNC(d->devfn),
> +                       reg->offset, new_addr, r_size);
> +            }
> +            /* just remove mapping */
> +            r->addr = -1;
> +            goto exit;
> +        }
> +        break;
> +    case PT_BAR_FLAG_UPPER:
> +        if (cfg_entry->data) {
> +            if (cfg_entry->data != (PT_BAR_ALLF & ~bar_ro_mask)) {
> +                PT_LOG("Warning: Guest attempt to set high MMIO Base Address. "
> +                       "Ignore mapping. "
> +                       "[%02x:%02x.%x][Offset:%02xh][High Address:%08xh]\n",
> +                       pci_bus_num(d->bus), PCI_SLOT(d->devfn),
> +                       PCI_FUNC(d->devfn), reg->offset, cfg_entry->data);
> +            }
> +            /* clear lower address */
> +            d->io_regions[index-1].addr = -1;
> +        } else {
> +            /* find lower 32bit BAR */
> +            prev_offset = (reg->offset - 4);
> +            reg_grp_entry = pt_find_reg_grp(s, prev_offset);
> +            if (reg_grp_entry) {
> +                reg_entry = pt_find_reg(reg_grp_entry, prev_offset);
> +                if (reg_entry) {
> +                    /* restore lower address */
> +                    d->io_regions[index-1].addr = reg_entry->data;
> +                } else {
> +                    return -1;
> +                }
> +            } else {
> +                return -1;
> +            }
> +        }
> +
> +        /* never mapping the 'empty' upper region,
> +         * because we'll do it enough for the lower region.
> +         */
> +        r->addr = -1;
> +        goto exit;
> +    default:
> +        break;
> +    }
> +
> +    /* update the corresponding virtual region address */
> +    /*
> +     * When guest code tries to get block size of mmio, it will write all "1"s
> +     * into pci bar register. In this case, cfg_entry->data == writable_mask.
> +     * Especially for devices with large mmio, the value of writable_mask
> +     * is likely to be a guest physical address that has been mapped to ram
> +     * rather than mmio. Remapping this value to mmio should be prevented.
> +     */
> +
> +    if (cfg_entry->data != writable_mask) {
> +        r->addr = cfg_entry->data;
> +    }
> +
> +exit:
> +    /* create value for writing to I/O device register */
> +    throughable_mask = ~bar_emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    /* After BAR reg update, we need to remap BAR */
> +    reg_grp_entry = pt_find_reg_grp(s, PCI_COMMAND);
> +    if (reg_grp_entry) {
> +        reg_entry = pt_find_reg(reg_grp_entry, PCI_COMMAND);
> +        if (reg_entry) {
> +            pt_bar_mapping_one(s, index, reg_entry->data & PCI_COMMAND_IO,
> +                               reg_entry->data & PCI_COMMAND_MEMORY);
> +        }
> +    }
> +
> +    return 0;
> +}
> +static int pt_bar_reg_restore(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                              uint32_t real_offset, uint32_t dev_value,
> +                              uint32_t *value)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint32_t bar_emu_mask = 0;
> +    int index = 0;
> +
> +    /* get BAR index */
> +    index = pt_bar_offset_to_index(reg->offset);
> +    if (index < 0) {
> +        hw_error("Internal error: Invalid BAR index[%d]. "
> +                 "I/O emulator exit.\n", index);
> +    }
> +
> +    /* use value from kernel sysfs */
> +    if (s->bases[index].bar_flag == PT_BAR_FLAG_UPPER) {
> +        *value = s->real_device->io_regions[index - 1].base_addr >> 32;
> +    } else {
> +        *value = base_address_with_flags(&s->real_device->io_regions[index]);
> +    }
> +
> +    /* set emulate mask depend on BAR flag */
> +    switch (s->bases[index].bar_flag) {
> +    case PT_BAR_FLAG_MEM:
> +        bar_emu_mask = PT_BAR_MEM_EMU_MASK;
> +        break;
> +    case PT_BAR_FLAG_IO:
> +        bar_emu_mask = PT_BAR_IO_EMU_MASK;
> +        break;
> +    case PT_BAR_FLAG_UPPER:
> +        bar_emu_mask = PT_BAR_ALLF;
> +        break;
> +    default:
> +        break;
> +    }
> +
> +    /* create value for restoring to I/O device register */
> +    *value = PT_MERGE_VALUE(*value, dev_value, bar_emu_mask);
> +
> +    return 0;
> +}
> +
> +/* write Exp ROM BAR */
> +static int pt_exp_rom_bar_reg_write(XenPCIPassthroughState *s,
> +                                    XenPTReg *cfg_entry, uint32_t *value,
> +                                    uint32_t dev_value, uint32_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    XenPTRegGroup *reg_grp_entry = NULL;
> +    XenPTReg *reg_entry = NULL;
> +    XenPTRegion *base = NULL;
> +    PCIDevice *d = (PCIDevice *)&s->dev;
> +    PCIIORegion *r;
> +    uint32_t writable_mask = 0;
> +    uint32_t throughable_mask = 0;
> +    pcibus_t r_size = 0;
> +    uint32_t bar_emu_mask = 0;
> +    uint32_t bar_ro_mask = 0;
> +
> +    r = &d->io_regions[PCI_ROM_SLOT];
> +    r_size = r->size;
> +    base = &s->bases[PCI_ROM_SLOT];
> +    /* align memory type resource size */
> +    pt_get_emul_size(base->bar_flag, r_size);
> +
> +    /* set emulate mask and read-only mask */
> +    bar_emu_mask = reg->emu_mask;
> +    bar_ro_mask = (reg->ro_mask | (r_size - 1)) & ~PCI_ROM_ADDRESS_ENABLE;
> +
> +    /* modify emulate register */
> +    writable_mask = ~bar_ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +
> +    /* update the corresponding virtual region address */
> +    /*
> +     * When guest code tries to get block size of mmio, it will write all "1"s
> +     * into pci bar register. In this case, cfg_entry->data == writable_mask.
> +     * Especially for devices with large mmio, the value of writable_mask
> +     * is likely to be a guest physical address that has been mapped to ram
> +     * rather than mmio. Remapping this value to mmio should be prevented.
> +     */
> +
> +    if (cfg_entry->data != writable_mask) {
> +        r->addr = cfg_entry->data;
> +    }
> +
> +    /* create value for writing to I/O device register */
> +    throughable_mask = ~bar_emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    /* After BAR reg update, we need to remap BAR*/
> +    reg_grp_entry = pt_find_reg_grp(s, PCI_COMMAND);
> +    if (reg_grp_entry) {
> +        reg_entry = pt_find_reg(reg_grp_entry, PCI_COMMAND);
> +        if (reg_entry) {
> +            pt_bar_mapping_one(s, PCI_ROM_SLOT,
> +                               reg_entry->data & PCI_COMMAND_IO,
> +                               reg_entry->data & PCI_COMMAND_MEMORY);
> +        }
> +    }
> +
> +    return 0;
> +}
> +/* restore ROM BAR */
> +static int pt_exp_rom_bar_reg_restore(XenPCIPassthroughState *s,
> +                                      XenPTReg *cfg_entry,
> +                                      uint32_t real_offset,
> +                                      uint32_t dev_value, uint32_t *value)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +
> +    /* use value from kernel sysfs */
> +    *value =
> +        PT_MERGE_VALUE(host_pci_get_long(s->real_device, PCI_ROM_ADDRESS),
> +                       dev_value, reg->emu_mask);
> +    return 0;
> +}
> +
> +/* Header Type0 reg static infomation table */
> +static XenPTRegInfo pt_emu_reg_header0_tbl[] = {
> +    /* Vendor ID reg */
> +    {
> +        .offset     = PCI_VENDOR_ID,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0xFFFF,
> +        .emu_mask   = 0xFFFF,
> +        .init       = pt_vendor_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_word_reg_write,
> +        .u.w.restore  = NULL,
> +    },
> +    /* Device ID reg */
> +    {
> +        .offset     = PCI_DEVICE_ID,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0xFFFF,
> +        .emu_mask   = 0xFFFF,
> +        .init       = pt_device_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_word_reg_write,
> +        .u.w.restore  = NULL,
> +    },
> +    /* Command reg */
> +    {
> +        .offset     = PCI_COMMAND,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0xF880,
> +        .emu_mask   = 0x0740,
> +        .init       = pt_common_reg_init,
> +        .u.w.read   = pt_cmd_reg_read,
> +        .u.w.write  = pt_cmd_reg_write,
> +        .u.w.restore  = pt_cmd_reg_restore,
> +    },
> +    /* Capabilities Pointer reg */
> +    {
> +        .offset     = PCI_CAPABILITY_LIST,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0xFF,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_ptr_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = NULL,
> +    },
> +    /* Status reg */
> +    /* use emulated Cap Ptr value to initialize,
> +     * so need to be declared after Cap Ptr reg
> +     */
> +    {
> +        .offset     = PCI_STATUS,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0x06FF,
> +        .emu_mask   = 0x0010,
> +        .init       = pt_status_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_word_reg_write,
> +        .u.w.restore  = NULL,
> +    },
> +    /* Cache Line Size reg */
> +    {
> +        .offset     = PCI_CACHE_LINE_SIZE,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0x00,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_common_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = pt_byte_reg_restore,
> +    },
> +    /* Latency Timer reg */
> +    {
> +        .offset     = PCI_LATENCY_TIMER,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0x00,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_common_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = pt_byte_reg_restore,
> +    },
> +    /* Header Type reg */
> +    {
> +        .offset     = PCI_HEADER_TYPE,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0xFF,
> +        .emu_mask   = 0x00,
> +        .init       = pt_header_type_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = NULL,
> +    },
> +    /* Interrupt Line reg */
> +    {
> +        .offset     = PCI_INTERRUPT_LINE,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0x00,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_common_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = NULL,
> +    },
> +    /* Interrupt Pin reg */
> +    {
> +        .offset     = PCI_INTERRUPT_PIN,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0xFF,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_irqpin_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = NULL,
> +    },
> +    /* BAR 0 reg */
> +    /* mask of BAR need to be decided later, depends on IO/MEM type */
> +    {
> +        .offset     = PCI_BASE_ADDRESS_0,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .init       = pt_bar_reg_init,
> +        .u.dw.read  = pt_bar_reg_read,
> +        .u.dw.write = pt_bar_reg_write,
> +        .u.dw.restore = pt_bar_reg_restore,
> +    },
> +    /* BAR 1 reg */
> +    {
> +        .offset     = PCI_BASE_ADDRESS_1,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .init       = pt_bar_reg_init,
> +        .u.dw.read  = pt_bar_reg_read,
> +        .u.dw.write = pt_bar_reg_write,
> +        .u.dw.restore = pt_bar_reg_restore,
> +    },
> +    /* BAR 2 reg */
> +    {
> +        .offset     = PCI_BASE_ADDRESS_2,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .init       = pt_bar_reg_init,
> +        .u.dw.read  = pt_bar_reg_read,
> +        .u.dw.write = pt_bar_reg_write,
> +        .u.dw.restore = pt_bar_reg_restore,
> +    },
> +    /* BAR 3 reg */
> +    {
> +        .offset     = PCI_BASE_ADDRESS_3,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .init       = pt_bar_reg_init,
> +        .u.dw.read  = pt_bar_reg_read,
> +        .u.dw.write = pt_bar_reg_write,
> +        .u.dw.restore = pt_bar_reg_restore,
> +    },
> +    /* BAR 4 reg */
> +    {
> +        .offset     = PCI_BASE_ADDRESS_4,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .init       = pt_bar_reg_init,
> +        .u.dw.read  = pt_bar_reg_read,
> +        .u.dw.write = pt_bar_reg_write,
> +        .u.dw.restore = pt_bar_reg_restore,
> +    },
> +    /* BAR 5 reg */
> +    {
> +        .offset     = PCI_BASE_ADDRESS_5,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .init       = pt_bar_reg_init,
> +        .u.dw.read  = pt_bar_reg_read,
> +        .u.dw.write = pt_bar_reg_write,
> +        .u.dw.restore = pt_bar_reg_restore,
> +    },
> +    /* Expansion ROM BAR reg */
> +    {
> +        .offset     = PCI_ROM_ADDRESS,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .ro_mask    = 0x000007FE,
> +        .emu_mask   = 0xFFFFF800,
> +        .init       = pt_bar_reg_init,
> +        .u.dw.read  = pt_long_reg_read,
> +        .u.dw.write = pt_exp_rom_bar_reg_write,
> +        .u.dw.restore = pt_exp_rom_bar_reg_restore,
> +    },
> +    {
> +        .size = 0,
> +    },
> +};
> +
> +
> +/*********************************
> + * Vital Product Data Capability
> + */
> +
> +/* Vital Product Data Capability Structure reg static infomation table */
> +static XenPTRegInfo pt_emu_reg_vpd_tbl[] = {
> +    {
> +        .offset     = PCI_CAP_LIST_NEXT,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0xFF,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_ptr_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = NULL,
> +    },
> +    {
> +        .size = 0,
> +    },
> +};
> +
> +
> +/**************************************
> + * Vendor Specific Capability
> + */
> +
> +/* Vendor Specific Capability Structure reg static infomation table */
> +static XenPTRegInfo pt_emu_reg_vendor_tbl[] = {
> +    {
> +        .offset     = PCI_CAP_LIST_NEXT,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0xFF,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_ptr_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = NULL,
> +    },
> +    {
> +        .size = 0,
> +    },
> +};
> +
> +
> +/*****************************
> + * PCI Express Capability
> + */
> +
> +/* initialize Link Control register */
> +static uint32_t pt_linkctrl_reg_init(XenPCIPassthroughState *s,
> +                                     XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    uint8_t cap_ver = 0;
> +    uint8_t dev_type = 0;
> +
> +    /* TODO maybe better to use fonction from hw/pcie.c */
> +    cap_ver = pci_get_byte(s->dev.config + real_offset - reg->offset
> +                           + PCI_EXP_FLAGS)
> +        & PCI_EXP_FLAGS_VERS;
> +    dev_type = (pci_get_byte(s->dev.config + real_offset - reg->offset
> +                             + PCI_EXP_FLAGS)
> +                & PCI_EXP_FLAGS_TYPE) >> 4;
> +
> +    /* no need to initialize in case of Root Complex Integrated Endpoint
> +     * with cap_ver 1.x
> +     */
> +    if ((dev_type == PCI_EXP_TYPE_RC_END) && (cap_ver == 1)) {
> +        return PT_INVALID_REG;
> +    }
> +
> +    return reg->init_val;
> +}
> +/* initialize Device Control 2 register */
> +static uint32_t pt_devctrl2_reg_init(XenPCIPassthroughState *s,
> +                                     XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    uint8_t cap_ver = 0;
> +
> +    cap_ver = pci_get_byte(s->dev.config + real_offset - reg->offset
> +                           + PCI_EXP_FLAGS)
> +        & PCI_EXP_FLAGS_VERS;
> +
> +    /* no need to initialize in case of cap_ver 1.x */
> +    if (cap_ver == 1) {
> +        return PT_INVALID_REG;
> +    }
> +
> +    return reg->init_val;
> +}
> +/* initialize Link Control 2 register */
> +static uint32_t pt_linkctrl2_reg_init(XenPCIPassthroughState *s,
> +                                      XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    int reg_field = 0;
> +    uint8_t cap_ver = 0;
> +
> +    cap_ver = pci_get_byte(s->dev.config + real_offset - reg->offset
> +                           + PCI_EXP_FLAGS)
> +        & PCI_EXP_FLAGS_VERS;
> +
> +    /* no need to initialize in case of cap_ver 1.x */
> +    if (cap_ver == 1) {
> +        return PT_INVALID_REG;
> +    }
> +
> +    /* set Supported Link Speed */
> +    reg_field |= PCI_EXP_LNKCAP_SLS &
> +        pci_get_byte(s->dev.config + real_offset - reg->offset
> +                     + PCI_EXP_LNKCAP);
> +
> +    return reg_field;
> +}
> +
> +/* PCI Express Capability Structure reg static infomation table */
> +static XenPTRegInfo pt_emu_reg_pcie_tbl[] = {
> +    /* Next Pointer reg */
> +    {
> +        .offset     = PCI_CAP_LIST_NEXT,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0xFF,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_ptr_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = NULL,
> +    },
> +    /* Device Capabilities reg */
> +    {
> +        .offset     = PCI_EXP_DEVCAP,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .ro_mask    = 0x1FFCFFFF,
> +        .emu_mask   = 0x10000000,
> +        .init       = pt_common_reg_init,
> +        .u.dw.read  = pt_long_reg_read,
> +        .u.dw.write = pt_long_reg_write,
> +        .u.dw.restore = NULL,
> +    },
> +    /* Device Control reg */
> +    {
> +        .offset     = PCI_EXP_DEVCTL,
> +        .size       = 2,
> +        .init_val   = 0x2810,
> +        .ro_mask    = 0x8400,
> +        .emu_mask   = 0xFFFF,
> +        .init       = pt_common_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_word_reg_write,
> +        .u.w.restore  = pt_word_reg_restore,
> +    },
> +    /* Link Control reg */
> +    {
> +        .offset     = PCI_EXP_LNKCTL,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0xFC34,
> +        .emu_mask   = 0xFFFF,
> +        .init       = pt_linkctrl_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_word_reg_write,
> +        .u.w.restore  = pt_word_reg_restore,
> +    },
> +    /* Device Control 2 reg */
> +    {
> +        .offset     = 0x28,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0xFFE0,
> +        .emu_mask   = 0xFFFF,
> +        .init       = pt_devctrl2_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_word_reg_write,
> +        .u.w.restore  = pt_word_reg_restore,
> +    },
> +    /* Link Control 2 reg */
> +    {
> +        .offset     = 0x30,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0xE040,
> +        .emu_mask   = 0xFFFF,
> +        .init       = pt_linkctrl2_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_word_reg_write,
> +        .u.w.restore  = pt_word_reg_restore,
> +    },
> +    {
> +        .size = 0,
> +    },
> +};
> +
> +
> +/*********************************
> + * Power Management Capability
> + */
> +
> +/* initialize Power Management Capabilities register */
> +static uint32_t pt_pmc_reg_init(XenPCIPassthroughState *s,
> +                                XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    PCIDevice *d = &s->dev;
> +
> +    if (!s->power_mgmt) {
> +        return reg->init_val;
> +    }
> +
> +    /* set Power Management Capabilities register */
> +    s->pm_state->pmc_field = pci_get_word(d->config + real_offset);
> +
> +    return reg->init_val;
> +}
> +/* initialize PCI Power Management Control/Status register */
> +static uint32_t pt_pmcsr_reg_init(XenPCIPassthroughState *s,
> +                                  XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    PCIDevice *d = &s->dev;
> +    uint16_t cap_ver  = 0;
> +
> +    if (!s->power_mgmt) {
> +        return reg->init_val;
> +    }
> +
> +    /* check PCI Power Management support version */
> +    cap_ver = s->pm_state->pmc_field & PCI_PM_CAP_VER_MASK;
> +
> +    if (cap_ver > 2) {
> +        /* set No Soft Reset */
> +        s->pm_state->no_soft_reset =
> +            pci_get_byte(d->config + real_offset) & PCI_PM_CTRL_NO_SOFT_RESET;
> +    }
> +
> +    /* wake up real physical device */
> +    switch (host_pci_get_word(s->real_device, real_offset)
> +            & PCI_PM_CTRL_STATE_MASK) {
> +    case 0:
> +        break;
> +    case 1:
> +        PT_LOG("Power state transition D1 -> D0active\n");
> +        host_pci_set_word(s->real_device, real_offset, 0);
> +        break;
> +    case 2:
> +        PT_LOG("Power state transition D2 -> D0active\n");
> +        host_pci_set_word(s->real_device, real_offset, 0);
> +        usleep(200);
> +        break;
> +    case 3:
> +        PT_LOG("Power state transition D3hot -> D0active\n");
> +        host_pci_set_word(s->real_device, real_offset, 0);
> +        usleep(10 * 1000);
> +        pt_init_pci_config(s);
> +        break;
> +    }
> +
> +    return reg->init_val;
> +}
> +/* read Power Management Control/Status register */
> +static int pt_pmcsr_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                             uint16_t *value, uint16_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint16_t valid_emu_mask = reg->emu_mask;
> +
> +    if (!s->power_mgmt) {
> +        valid_emu_mask |= PCI_PM_CTRL_STATE_MASK | PCI_PM_CTRL_NO_SOFT_RESET;
> +    }
> +
> +    valid_emu_mask = valid_emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
> +
> +    return 0;
> +}
> +/* reset Interrupt and I/O resource  */
> +static void pt_reset_interrupt_and_io_mapping(XenPCIPassthroughState *s)
> +{
> +    PCIDevice *d = &s->dev;
> +    PCIIORegion *r;
> +    int i = 0;
> +    uint8_t e_device = 0;
> +    uint8_t e_intx = 0;
> +
> +    /* unbind INTx */
> +    e_device = PCI_SLOT(s->dev.devfn);
> +    e_intx = pci_intx(s);
> +
> +    if (s->machine_irq) {
> +        if (xc_domain_unbind_pt_irq(xen_xc, xen_domid, s->machine_irq,
> +                                    PT_IRQ_TYPE_PCI, 0, e_device, e_intx, 0)) {
> +            PT_LOG("Error: Unbinding of interrupt failed!\n");
> +        }
> +    }
> +
> +    /* clear all virtual region address */
> +    for (i = 0; i < PCI_NUM_REGIONS; i++) {
> +        r = &d->io_regions[i];
> +        r->addr = -1;
> +    }
> +
> +    /* unmapping BAR */
> +    pt_bar_mapping(s, 0, 0);
> +}
> +/* check power state transition */
> +static int check_power_state(XenPCIPassthroughState *s)
> +{
> +    XenPTPM *pm_state = s->pm_state;
> +    PCIDevice *d = &s->dev;
> +    uint16_t read_val = 0;
> +    uint16_t cur_state = 0;
> +
> +    /* get current power state */
> +    read_val = host_pci_get_word(s->real_device,
> +                                 pm_state->pm_base + PCI_PM_CTRL);
> +    cur_state = read_val & PCI_PM_CTRL_STATE_MASK;
> +
> +    if (pm_state->req_state != cur_state) {
> +        PT_LOG("Error: Failed to change power state. "
> +               "[%02x:%02x.%x][requested state:%d][current state:%d]\n",
> +               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> +               pm_state->req_state, cur_state);
> +        return -1;
> +    }
> +    return 0;
> +}
> +/* write Power Management Control/Status register */
> +static void pt_from_d3hot_to_d0_with_reset(void *opaque)
> +{
> +    XenPCIPassthroughState *s = opaque;
> +    XenPTPM *pm_state = s->pm_state;
> +    int ret = 0;
> +
> +    /* check power state */
> +    ret = check_power_state(s);
> +
> +    if (ret < 0) {
> +        goto out;
> +    }
> +
> +    pt_init_pci_config(s);
> +
> +out:
> +    /* power state transition flags off */
> +    pm_state->flags &= ~PT_FLAG_TRANSITING;
> +
> +    qemu_free_timer(pm_state->pm_timer);
> +    pm_state->pm_timer = NULL;
> +}
> +static void pt_default_power_transition(void *opaque)
> +{
> +    XenPCIPassthroughState *ptdev = opaque;
> +    XenPTPM *pm_state = ptdev->pm_state;
> +
> +    /* check power state */
> +    check_power_state(ptdev);
> +
> +    /* power state transition flags off */
> +    pm_state->flags &= ~PT_FLAG_TRANSITING;
> +
> +    qemu_free_timer(pm_state->pm_timer);
> +    pm_state->pm_timer = NULL;
> +}
> +static int pt_pmcsr_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                              uint16_t *value, uint16_t dev_value,
> +                              uint16_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    PCIDevice *d = &s->dev;
> +    uint16_t emu_mask = reg->emu_mask;
> +    uint16_t writable_mask = 0;
> +    uint16_t throughable_mask = 0;
> +    XenPTPM *pm_state = s->pm_state;
> +
> +    if (!s->power_mgmt) {
> +        emu_mask |= PCI_PM_CTRL_STATE_MASK | PCI_PM_CTRL_NO_SOFT_RESET;
> +    }
> +
> +    /* modify emulate register */
> +    writable_mask = emu_mask & ~reg->ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +
> +    /* create value for writing to I/O device register */
> +    throughable_mask = ~emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    if (!s->power_mgmt) {
> +        return 0;
> +    }
> +
> +    /* set I/O device power state */
> +    pm_state->cur_state = dev_value & PCI_PM_CTRL_STATE_MASK;
> +
> +    /* set Guest requested PowerState */
> +    pm_state->req_state = *value & PCI_PM_CTRL_STATE_MASK;
> +
> +    /* check power state transition or not */
> +    if (pm_state->cur_state == pm_state->req_state) {
> +        /* not power state transition */
> +        return 0;
> +    }
> +
> +    /* check enable power state transition */
> +    if ((pm_state->req_state != 0) &&
> +        (pm_state->cur_state > pm_state->req_state)) {
> +        PT_LOG("Error: Invalid power transition. "
> +               "[%02x:%02x.%x][requested state:%d][current state:%d]\n",
> +               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> +               pm_state->req_state, pm_state->cur_state);
> +
> +        return 0;
> +    }
> +
> +    /* check if this device supports the requested power state */
> +    if (((pm_state->req_state == 1) && !(pm_state->pmc_field & PCI_PM_CAP_D1))
> +        || ((pm_state->req_state == 2) &&
> +            !(pm_state->pmc_field & PCI_PM_CAP_D2))) {
> +        PT_LOG("Error: Invalid power transition. "
> +               "[%02x:%02x.%x][requested state:%d][current state:%d]\n",
> +               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> +               pm_state->req_state, pm_state->cur_state);
> +
> +        return 0;
> +    }
> +
> +    /* in case of transition related to D3hot, it's necessary to wait 10 ms.
> +     * But because writing to register will be performed later on actually,
> +     * don't start QEMUTimer right now, just alloc and init QEMUTimer here.
> +     */
> +    if ((pm_state->cur_state == 3) || (pm_state->req_state == 3)) {
> +        if (pm_state->req_state == 0) {
> +            /* alloc and init QEMUTimer */
> +            if (!pm_state->no_soft_reset) {
> +                pm_state->pm_timer = qemu_new_timer_ms(rt_clock,
> +                    pt_from_d3hot_to_d0_with_reset, s);
> +
> +                /* reset Interrupt and I/O resource mapping */
> +                pt_reset_interrupt_and_io_mapping(s);
> +            } else {
> +                pm_state->pm_timer = qemu_new_timer_ms(rt_clock,
> +                                        pt_default_power_transition, s);
> +            }
> +        } else {
> +            /* alloc and init QEMUTimer */
> +            pm_state->pm_timer = qemu_new_timer_ms(rt_clock,
> +                pt_default_power_transition, s);
> +        }
> +
> +        /* set power state transition delay */
> +        pm_state->pm_delay = 10;
> +
> +        /* power state transition flags on */
> +        pm_state->flags |= PT_FLAG_TRANSITING;
> +    }
> +    /* in case of transition related to D0, D1 and D2,
> +     * no need to use QEMUTimer.
> +     * So, we perfom writing to register here and then read it back.
> +     */
> +    else {
> +        /* write power state to I/O device register */
> +        host_pci_set_word(s->real_device, pm_state->pm_base + PCI_PM_CTRL,
> +                          *value);
> +
> +        /* in case of transition related to D2,
> +         * it's necessary to wait 200 usec.
> +         * But because QEMUTimer do not support microsec unit right now,
> +         * so we do wait ourself here.
> +         */
> +        if ((pm_state->cur_state == 2) || (pm_state->req_state == 2)) {
> +            usleep(200);
> +        }
> +
> +        /* check power state */
> +        check_power_state(s);
> +
> +        /* recreate value for writing to I/O device register */
> +        *value = host_pci_get_word(s->real_device,
> +                                   pm_state->pm_base + PCI_PM_CTRL);
> +    }
> +
> +    return 0;
> +}
> +
> +/* restore Power Management Control/Status register */
> +static int pt_pmcsr_reg_restore(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                                uint32_t real_offset, uint16_t dev_value,
> +                                uint16_t *value)
> +{
> +    /* create value for restoring to I/O device register
> +     * No need to restore, just clear PME Enable and PME Status bit
> +     * Note: register type of PME Status bit is RW1C, so clear by writing 1b
> +     */
> +    *value = (dev_value & ~PCI_PM_CTRL_PME_ENABLE) | PCI_PM_CTRL_PME_STATUS;
> +
> +    return 0;
> +}
> +
> +
> +/* Power Management Capability reg static infomation table */
> +static XenPTRegInfo pt_emu_reg_pm_tbl[] = {
> +    /* Next Pointer reg */
> +    {
> +        .offset     = PCI_CAP_LIST_NEXT,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0xFF,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_ptr_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = NULL,
> +    },
> +    /* Power Management Capabilities reg */
> +    {
> +        .offset     = PCI_CAP_FLAGS,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0xFFFF,
> +        .emu_mask   = 0xF9C8,
> +        .init       = pt_pmc_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_word_reg_write,
> +        .u.w.restore  = NULL,
> +    },
> +    /* PCI Power Management Control/Status reg */
> +    {
> +        .offset     = PCI_PM_CTRL,
> +        .size       = 2,
> +        .init_val   = 0x0008,
> +        .ro_mask    = 0xE1FC,
> +        .emu_mask   = 0x8100,
> +        .init       = pt_pmcsr_reg_init,
> +        .u.w.read   = pt_pmcsr_reg_read,
> +        .u.w.write  = pt_pmcsr_reg_write,
> +        .u.w.restore  = pt_pmcsr_reg_restore,
> +    },
> +    {
> +        .size = 0,
> +    },
> +};
> +
> +
> +/****************************
> + * Capabilities
> + */
> +
> +/* AER register operations */
> +
> +static void aer_save_one_register(XenPCIPassthroughState *s, int offset)
> +{
> +    PCIDevice *d = &s->dev;
> +    uint32_t aer_base = s->pm_state->aer_base;
> +    uint32_t val = 0;
> +
> +    val = host_pci_get_long(s->real_device, aer_base + offset);
> +    pci_set_long(d->config + aer_base + offset, val);
> +}
> +static void pt_aer_reg_save(XenPCIPassthroughState *s)
> +{
> +    /* after reset, following register values should be restored.
> +     * So, save them.
> +     */
> +    aer_save_one_register(s, PCI_ERR_UNCOR_MASK);
> +    aer_save_one_register(s, PCI_ERR_UNCOR_SEVER);
> +    aer_save_one_register(s, PCI_ERR_COR_MASK);
> +    aer_save_one_register(s, PCI_ERR_CAP);
> +}
> +static void aer_restore_one_register(XenPCIPassthroughState *s, int offset)
> +{
> +    PCIDevice *d = &s->dev;
> +    uint32_t aer_base = s->pm_state->aer_base;
> +    uint32_t config = 0;
> +
> +    config = pci_get_long(d->config + aer_base + offset);
> +    host_pci_set_long(s->real_device, aer_base + offset, config);
> +}
> +static void pt_aer_reg_restore(XenPCIPassthroughState *s)
> +{
> +    /* the following registers should be reconfigured to correct values
> +     * after reset. restore them.
> +     * other registers should not be reconfigured after reset
> +     * if there is no reason
> +     */
> +    aer_restore_one_register(s, PCI_ERR_UNCOR_MASK);
> +    aer_restore_one_register(s, PCI_ERR_UNCOR_SEVER);
> +    aer_restore_one_register(s, PCI_ERR_COR_MASK);
> +    aer_restore_one_register(s, PCI_ERR_CAP);
> +}
> +
> +/* capability structure register group size functions */
> +
> +static uint8_t pt_reg_grp_size_init(XenPCIPassthroughState *s,
> +                                    const XenPTRegGroupInfo *grp_reg,
> +                                    uint32_t base_offset)
> +{
> +    return grp_reg->grp_size;
> +}
> +/* get Power Management Capability Structure register group size */
> +static uint8_t pt_pm_size_init(XenPCIPassthroughState *s,
> +                               const XenPTRegGroupInfo *grp_reg,
> +                               uint32_t base_offset)
> +{
> +    if (!s->power_mgmt) {
> +        return grp_reg->grp_size;
> +    }
> +
> +    s->pm_state = g_malloc0(sizeof (XenPTPM));
> +
> +    /* set Power Management Capability base offset */
> +    s->pm_state->pm_base = base_offset;
> +
> +    /* find AER register and set AER Capability base offset */
> +    s->pm_state->aer_base = host_pci_find_ext_cap_offset(s->real_device,
> +                                                         PCI_EXT_CAP_ID_ERR);
> +
> +    /* save AER register */
> +    if (s->pm_state->aer_base) {
> +        pt_aer_reg_save(s);
> +    }
> +
> +    return grp_reg->grp_size;
> +}
> +/* get Vendor Specific Capability Structure register group size */
> +static uint8_t pt_vendor_size_init(XenPCIPassthroughState *s,
> +                                   const XenPTRegGroupInfo *grp_reg,
> +                                   uint32_t base_offset)
> +{
> +    return pci_get_byte(s->dev.config + base_offset + 0x02);
> +}
> +/* get PCI Express Capability Structure register group size */
> +static uint8_t pt_pcie_size_init(XenPCIPassthroughState *s,
> +                                 const XenPTRegGroupInfo *grp_reg,
> +                                 uint32_t base_offset)
> +{
> +    PCIDevice *d = &s->dev;
> +    uint16_t exp_flag = 0;
> +    uint16_t type = 0;
> +    uint16_t version = 0;
> +    uint8_t pcie_size = 0;
> +
> +    exp_flag = pci_get_word(d->config + base_offset + PCI_EXP_FLAGS);
> +    type = (exp_flag & PCI_EXP_FLAGS_TYPE) >> 4;
> +    version = exp_flag & PCI_EXP_FLAGS_VERS;
> +
> +    /* calculate size depend on capability version and device/port type */
> +    /* in case of PCI Express Base Specification Rev 1.x */
> +    if (version == 1) {
> +        /* The PCI Express Capabilities, Device Capabilities, and Device
> +         * Status/Control registers are required for all PCI Express devices.
> +         * The Link Capabilities and Link Status/Control are required for all
> +         * Endpoints that are not Root Complex Integrated Endpoints. Endpoints
> +         * are not required to implement registers other than those listed
> +         * above and terminate the capability structure.
> +         */
> +        switch (type) {
> +        case PCI_EXP_TYPE_ENDPOINT:
> +        case PCI_EXP_TYPE_LEG_END:
> +            pcie_size = 0x14;
> +            break;
> +        case PCI_EXP_TYPE_RC_END:
> +            /* has no link */
> +            pcie_size = 0x0C;
> +            break;
> +        /* only EndPoint passthrough is supported */
> +        case PCI_EXP_TYPE_ROOT_PORT:
> +        case PCI_EXP_TYPE_UPSTREAM:
> +        case PCI_EXP_TYPE_DOWNSTREAM:
> +        case PCI_EXP_TYPE_PCI_BRIDGE:
> +        case PCI_EXP_TYPE_PCIE_BRIDGE:
> +        case PCI_EXP_TYPE_RC_EC:
> +        default:
> +            hw_error("Internal error: Unsupported device/port type[%d]. "
> +                     "I/O emulator exit.\n", type);
> +        }
> +    }
> +    /* in case of PCI Express Base Specification Rev 2.0 */
> +    else if (version == 2) {
> +        switch (type) {
> +        case PCI_EXP_TYPE_ENDPOINT:
> +        case PCI_EXP_TYPE_LEG_END:
> +        case PCI_EXP_TYPE_RC_END:
> +            /* For Functions that do not implement the registers,
> +             * these spaces must be hardwired to 0b.
> +             */
> +            pcie_size = 0x3C;
> +            break;
> +        /* only EndPoint passthrough is supported */
> +        case PCI_EXP_TYPE_ROOT_PORT:
> +        case PCI_EXP_TYPE_UPSTREAM:
> +        case PCI_EXP_TYPE_DOWNSTREAM:
> +        case PCI_EXP_TYPE_PCI_BRIDGE:
> +        case PCI_EXP_TYPE_PCIE_BRIDGE:
> +        case PCI_EXP_TYPE_RC_EC:
> +        default:
> +            hw_error("Internal error: Unsupported device/port type[%d]. "
> +                     "I/O emulator exit.\n", type);
> +        }
> +    } else {
> +        hw_error("Internal error: Unsupported capability version[%d]. "
> +                 "I/O emulator exit.\n", version);
> +    }
> +
> +    return pcie_size;
> +}
> +
> +static const XenPTRegGroupInfo pt_emu_reg_grp_tbl[] = {
> +    /* Header Type0 reg group */
> +    {
> +        .grp_id      = 0xFF,
> +        .grp_type    = GRP_TYPE_EMU,
> +        .grp_size    = 0x40,
> +        .size_init   = pt_reg_grp_size_init,
> +        .emu_reg_tbl = pt_emu_reg_header0_tbl,
> +    },
> +    /* PCI PowerManagement Capability reg group */
> +    {
> +        .grp_id      = PCI_CAP_ID_PM,
> +        .grp_type    = GRP_TYPE_EMU,
> +        .grp_size    = PCI_PM_SIZEOF,
> +        .size_init   = pt_pm_size_init,
> +        .emu_reg_tbl = pt_emu_reg_pm_tbl,
> +    },
> +    /* AGP Capability Structure reg group */
> +    {
> +        .grp_id     = PCI_CAP_ID_AGP,
> +        .grp_type   = GRP_TYPE_HARDWIRED,
> +        .grp_size   = 0x30,
> +        .size_init  = pt_reg_grp_size_init,
> +    },
> +    /* Vital Product Data Capability Structure reg group */
> +    {
> +        .grp_id      = PCI_CAP_ID_VPD,
> +        .grp_type    = GRP_TYPE_EMU,
> +        .grp_size    = 0x08,
> +        .size_init   = pt_reg_grp_size_init,
> +        .emu_reg_tbl = pt_emu_reg_vpd_tbl,
> +    },
> +    /* Slot Identification reg group */
> +    {
> +        .grp_id     = PCI_CAP_ID_SLOTID,
> +        .grp_type   = GRP_TYPE_HARDWIRED,
> +        .grp_size   = 0x04,
> +        .size_init  = pt_reg_grp_size_init,
> +    },
> +    /* PCI-X Capabilities List Item reg group */
> +    {
> +        .grp_id     = PCI_CAP_ID_PCIX,
> +        .grp_type   = GRP_TYPE_HARDWIRED,
> +        .grp_size   = 0x18,
> +        .size_init  = pt_reg_grp_size_init,
> +    },
> +    /* Vendor Specific Capability Structure reg group */
> +    {
> +        .grp_id      = PCI_CAP_ID_VNDR,
> +        .grp_type    = GRP_TYPE_EMU,
> +        .grp_size    = 0xFF,
> +        .size_init   = pt_vendor_size_init,
> +        .emu_reg_tbl = pt_emu_reg_vendor_tbl,
> +    },
> +    /* SHPC Capability List Item reg group */
> +    {
> +        .grp_id     = PCI_CAP_ID_SHPC,
> +        .grp_type   = GRP_TYPE_HARDWIRED,
> +        .grp_size   = 0x08,
> +        .size_init  = pt_reg_grp_size_init,
> +    },
> +    /* Subsystem ID and Subsystem Vendor ID Capability List Item reg group */
> +    {
> +        .grp_id     = PCI_CAP_ID_SSVID,
> +        .grp_type   = GRP_TYPE_HARDWIRED,
> +        .grp_size   = 0x08,
> +        .size_init  = pt_reg_grp_size_init,
> +    },
> +    /* AGP 8x Capability Structure reg group */
> +    {
> +        .grp_id     = PCI_CAP_ID_AGP3,
> +        .grp_type   = GRP_TYPE_HARDWIRED,
> +        .grp_size   = 0x30,
> +        .size_init  = pt_reg_grp_size_init,
> +    },
> +    /* PCI Express Capability Structure reg group */
> +    {
> +        .grp_id      = PCI_CAP_ID_EXP,
> +        .grp_type    = GRP_TYPE_EMU,
> +        .grp_size    = 0xFF,
> +        .size_init   = pt_pcie_size_init,
> +        .emu_reg_tbl = pt_emu_reg_pcie_tbl,
> +    },
> +    {
> +        .grp_size = 0,
> +    },
> +};
> +
> +/* initialize Capabilities Pointer or Next Pointer register */
> +static uint32_t pt_ptr_reg_init(XenPCIPassthroughState *s,
> +                                XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    /* uint32_t reg_field = (uint32_t)s->dev.config[real_offset]; */
> +    uint32_t reg_field = pci_get_byte(s->dev.config + real_offset);
> +    int i;
> +
> +    /* find capability offset */
> +    while (reg_field) {
> +        for (i = 0; pt_emu_reg_grp_tbl[i].grp_size != 0; i++) {
> +            if (pt_hide_dev_cap(s->real_device,
> +                                pt_emu_reg_grp_tbl[i].grp_id)) {
> +                continue;
> +            }
> +            if (pt_emu_reg_grp_tbl[i].grp_id == s->dev.config[reg_field]) {
> +                if (pt_emu_reg_grp_tbl[i].grp_type == GRP_TYPE_EMU) {
> +                    goto out;
> +                }
> +                /* ignore the 0 hardwired capability, find next one */
> +                break;
> +            }
> +        }
> +        /* next capability */
> +        /* reg_field = (uint32_t)s->dev.config[reg_field + 1]; */
> +        reg_field = pci_get_byte(s->dev.config + reg_field + 1);
> +    }
> +
> +out:
> +    return reg_field;
> +}
> +
> +
> +/*************
> + * Main
> + */
> +
> +/* restore a part of I/O device register */
> +static void pt_config_restore(XenPCIPassthroughState *s)
> +{
> +    XenPTRegGroup *reg_grp_entry = NULL;
> +    XenPTReg *reg_entry = NULL;
> +    XenPTRegInfo *reg = NULL;
> +    uint32_t real_offset = 0;
> +    uint32_t read_val = 0;
> +    uint32_t val = 0;
> +    int ret = 0;
> +
> +    /* find emulate register group entry */
> +    QLIST_FOREACH(reg_grp_entry, &s->reg_grp_tbl, entries) {
> +        /* find emulate register entry */
> +        QLIST_FOREACH(reg_entry, &reg_grp_entry->reg_tbl_list, entries) {
> +            reg = reg_entry->reg;
> +
> +            /* check whether restoring is needed */
> +            if (!reg->u.b.restore) {
> +                continue;
> +            }
> +
> +            real_offset = reg_grp_entry->base_offset + reg->offset;
> +
> +            /* read I/O device register value */
> +            ret = host_pci_get_block(s->real_device, real_offset,
> +                                     (uint8_t *)&read_val, reg->size);
> +
> +            if (!ret) {
> +                PT_LOG("Error: pci_read_block failed. "
> +                       "return value[%d].\n", ret);
> +                memset(&read_val, 0xff, reg->size);
> +            }
> +
> +            val = 0;
> +
> +            /* restore based on register size */
> +            switch (reg->size) {
> +            case 1:
> +                /* byte register */
> +                ret = reg->u.b.restore(s, reg_entry, real_offset,
> +                                       (uint8_t)read_val, (uint8_t *)&val);
> +                break;
> +            case 2:
> +                /* word register */
> +                ret = reg->u.w.restore(s, reg_entry, real_offset,
> +                                       (uint16_t)read_val, (uint16_t *)&val);
> +                break;
> +            case 4:
> +                /* double word register */
> +                ret = reg->u.dw.restore(s, reg_entry, real_offset,
> +                                        (uint32_t)read_val, (uint32_t *)&val);
> +                break;
> +            }
> +
> +            /* restoring error */
> +            if (ret < 0) {
> +                hw_error("Internal error: Invalid restoring "
> +                         "return value[%d]. I/O emulator exit.\n", ret);
> +            }
> +
> +            PT_LOG_CONFIG("[%02x:%02x.%x]: address=%04x val=0x%08x len=%d\n",
> +                          pci_bus_num(s->dev.bus), PCI_SLOT(s->dev.devfn),
> +                          PCI_FUNC(s->dev.devfn),
> +                          real_offset, val, reg->size);
> +
> +            ret = host_pci_set_block(s->real_device, real_offset,
> +                                     (uint8_t *)&val, reg->size);
> +
> +            if (!ret) {
> +                PT_LOG("Error: pci_write_block failed. "
> +                       "return value[%d].\n", ret);
> +            }
> +        }
> +    }
> +
> +    /* if AER supported, restore it */
> +    if (s->pm_state->aer_base) {
> +        pt_aer_reg_restore(s);
> +    }
> +}
> +/* reinitialize all emulate registers */
> +static void pt_config_reinit(XenPCIPassthroughState *s)
> +{
> +    XenPTRegGroup *reg_grp_entry = NULL;
> +    XenPTReg *reg_entry = NULL;
> +    XenPTRegInfo *reg = NULL;
> +
> +    /* find emulate register group entry */
> +    QLIST_FOREACH(reg_grp_entry, &s->reg_grp_tbl, entries) {
> +        /* find emulate register entry */
> +        QLIST_FOREACH(reg_entry, &reg_grp_entry->reg_tbl_list, entries) {
> +            reg = reg_entry->reg;
> +            if (reg->init) {
> +                /* initialize emulate register */
> +                reg_entry->data =
> +                    reg->init(s, reg_entry->reg,
> +                              reg_grp_entry->base_offset + reg->offset);
> +            }
> +        }
> +    }
> +}
> +
> +static int pt_init_pci_config(XenPCIPassthroughState *s)
> +{
> +    PCIDevice *d = &s->dev;
> +    int ret = 0;
> +
> +    PT_LOG("Reinitialize PCI configuration registers due to power state"
> +           " transition with internal reset. [%02x:%02x.%x]\n",
> +           pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn));
> +
> +    /* restore a part of I/O device register */
> +    pt_config_restore(s);
> +
> +    /* reinitialize all emulate register */
> +    pt_config_reinit(s);
> +
> +    /* rebind machine_irq to device */
> +    if (s->machine_irq != 0) {
> +        uint8_t e_device = PCI_SLOT(s->dev.devfn);
> +        uint8_t e_intx = pci_intx(s);
> +
> +        ret = xc_domain_bind_pt_pci_irq(xen_xc, xen_domid, s->machine_irq, 0,
> +                                        e_device, e_intx);
> +        if (ret < 0) {
> +            PT_LOG("Error: Rebinding of interrupt failed! ret=%d\n", ret);
> +        }
> +    }
> +
> +    return ret;
> +}
> +
> +static uint8_t find_cap_offset(XenPCIPassthroughState *s, uint8_t cap)
> +{
> +    int id;
> +    int max_cap = 48;
> +    int pos = PCI_CAPABILITY_LIST;
> +    int status;
> +
> +    status = host_pci_get_byte(s->real_device, PCI_STATUS);
> +    if ((status & PCI_STATUS_CAP_LIST) == 0) {
> +        return 0;
> +    }
> +
> +    while (max_cap--) {
> +        pos = host_pci_get_byte(s->real_device, pos);
> +        if (pos < 0x40) {
> +            break;
> +        }
> +
> +        pos &= ~3;
> +        id = host_pci_get_byte(s->real_device, pos + PCI_CAP_LIST_ID);
> +
> +        if (id == 0xff) {
> +            break;
> +        }
> +        if (id == cap) {
> +            return pos;
> +        }
> +
> +        pos += PCI_CAP_LIST_NEXT;
> +    }
> +    return 0;
> +}
> +
> +static void pt_config_reg_init(XenPCIPassthroughState *s,
> +                               XenPTRegGroup *reg_grp, XenPTRegInfo *reg)
> +{
> +    XenPTReg *reg_entry;
> +    uint32_t data = 0;
> +
> +    reg_entry = g_malloc0(sizeof (XenPTReg));
> +
> +    reg_entry->reg = reg;
> +    reg_entry->data = 0;
> +
> +    if (reg->init) {
> +        /* initialize emulate register */
> +        data = reg->init(s, reg_entry->reg,
> +                         reg_grp->base_offset + reg->offset);
> +        if (data == PT_INVALID_REG) {
> +            /* free unused BAR register entry */
> +            free(reg_entry);
> +            return;
> +        }
> +        /* set register value */
> +        reg_entry->data = data;
> +    }
> +    /* list add register entry */
> +    QLIST_INSERT_HEAD(&reg_grp->reg_tbl_list, reg_entry, entries);
> +
> +    return;
> +}
> +
> +void pt_config_init(XenPCIPassthroughState *s)
> +{
> +    XenPTRegGroup *reg_grp_entry = NULL;
> +    uint32_t reg_grp_offset = 0;
> +    XenPTRegInfo *reg_tbl = NULL;
> +    int i, j;
> +
> +    QLIST_INIT(&s->reg_grp_tbl);
> +
> +    for (i = 0; pt_emu_reg_grp_tbl[i].grp_size != 0; i++) {
> +        if (pt_emu_reg_grp_tbl[i].grp_id != 0xFF) {
> +            if (pt_hide_dev_cap(s->real_device,
> +                                pt_emu_reg_grp_tbl[i].grp_id)) {
> +                continue;
> +            }
> +
> +            reg_grp_offset = find_cap_offset(s, pt_emu_reg_grp_tbl[i].grp_id);
> +
> +            if (!reg_grp_offset) {
> +                continue;
> +            }
> +        }
> +
> +        reg_grp_entry = g_malloc0(sizeof (XenPTRegGroup));
> +        QLIST_INIT(&reg_grp_entry->reg_tbl_list);
> +        QLIST_INSERT_HEAD(&s->reg_grp_tbl, reg_grp_entry, entries);
> +
> +        reg_grp_entry->base_offset = reg_grp_offset;
> +        reg_grp_entry->reg_grp = pt_emu_reg_grp_tbl + i;
> +        if (pt_emu_reg_grp_tbl[i].size_init) {
> +            /* get register group size */
> +            reg_grp_entry->size =
> +                pt_emu_reg_grp_tbl[i].size_init(s, reg_grp_entry->reg_grp,
> +                                                reg_grp_offset);
> +        }
> +
> +        if (pt_emu_reg_grp_tbl[i].grp_type == GRP_TYPE_EMU) {
> +            if (pt_emu_reg_grp_tbl[i].emu_reg_tbl) {
> +                reg_tbl = pt_emu_reg_grp_tbl[i].emu_reg_tbl;
> +                /* initialize capability register */
> +                for (j = 0; reg_tbl->size != 0; j++, reg_tbl++) {
> +                    /* initialize capability register */
> +                    pt_config_reg_init(s, reg_grp_entry, reg_tbl);
> +                }
> +            }
> +        }
> +        reg_grp_offset = 0;
> +    }
> +
> +    return;
> +}
> +
> +/* delete all emulate register */
> +void pt_config_delete(XenPCIPassthroughState *s)
> +{
> +    struct XenPTRegGroup *reg_group, *next_grp;
> +    struct XenPTReg *reg, *next_reg;
> +
> +    /* free Power Management info table */
> +    if (s->pm_state) {
> +        if (s->pm_state->pm_timer) {
> +            qemu_del_timer(s->pm_state->pm_timer);
> +            qemu_free_timer(s->pm_state->pm_timer);
> +            s->pm_state->pm_timer = NULL;
> +        }
> +
> +        g_free(s->pm_state);
> +    }
> +
> +    /* free all register group entry */
> +    QLIST_FOREACH_SAFE(reg_group, &s->reg_grp_tbl, entries, next_grp) {
> +        /* free all register entry */
> +        QLIST_FOREACH_SAFE(reg, &reg_group->reg_tbl_list, entries, next_reg) {
> +            QLIST_REMOVE(reg, entries);
> +            g_free(reg);
> +        }
> +
> +        QLIST_REMOVE(reg_group, entries);
> +        g_free(reg_group);
> +    }
> +}
> --
> Anthony PERARD
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH V3 09/10] Introduce apic-msidef.h
  2011-10-28 15:07   ` Anthony PERARD
@ 2011-11-08 12:57     ` Stefano Stabellini
  -1 siblings, 0 replies; 60+ messages in thread
From: Stefano Stabellini @ 2011-11-08 12:57 UTC (permalink / raw)
  To: Anthony PERARD
  Cc: Xen Devel, Michael S. Tsirkin, QEMU-devel, Stefano Stabellini

On Fri, 28 Oct 2011, Anthony PERARD wrote:
> This patch move the msi definition from apic.c to apic-msidef.h. So it can be
> used also by other .c files.


you should CC Michael on this one


> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
> ---
>  hw/apic-msidef.h |   28 ++++++++++++++++++++++++++++
>  hw/apic.c        |   11 +----------
>  2 files changed, 29 insertions(+), 10 deletions(-)
>  create mode 100644 hw/apic-msidef.h
> 
> diff --git a/hw/apic-msidef.h b/hw/apic-msidef.h
> new file mode 100644
> index 0000000..3182f0b
> --- /dev/null
> +++ b/hw/apic-msidef.h
> @@ -0,0 +1,28 @@
> +#ifndef HW_APIC_MSIDEF_H
> +#define HW_APIC_MSIDEF_H
> +
> +/*
> + * Intel APIC constants: from include/asm/msidef.h
> + */
> +
> +/*
> + * Shifts for MSI data
> + */
> +
> +#define MSI_DATA_VECTOR_SHIFT           0
> +#define  MSI_DATA_VECTOR_MASK           0x000000ff
> +
> +#define MSI_DATA_DELIVERY_MODE_SHIFT    8
> +#define MSI_DATA_LEVEL_SHIFT            14
> +#define MSI_DATA_TRIGGER_SHIFT          15
> +
> +/*
> + * Shift/mask fields for msi address
> + */
> +
> +#define MSI_ADDR_DEST_MODE_SHIFT        2
> +
> +#define MSI_ADDR_DEST_ID_SHIFT          12
> +#define  MSI_ADDR_DEST_ID_MASK          0x00ffff0
> +
> +#endif /* HW_APIC_MSIDEF_H */
> diff --git a/hw/apic.c b/hw/apic.c
> index 8289eef..18c4a87 100644
> --- a/hw/apic.c
> +++ b/hw/apic.c
> @@ -24,6 +24,7 @@
>  #include "sysbus.h"
>  #include "trace.h"
>  #include "pc.h"
> +#include "apic-msidef.h"
>  
>  /* APIC Local Vector Table */
>  #define APIC_LVT_TIMER   0
> @@ -65,16 +66,6 @@
>  #define MAX_APICS 255
>  #define MAX_APIC_WORDS 8
>  
> -/* Intel APIC constants: from include/asm/msidef.h */
> -#define MSI_DATA_VECTOR_SHIFT		0
> -#define MSI_DATA_VECTOR_MASK		0x000000ff
> -#define MSI_DATA_DELIVERY_MODE_SHIFT	8
> -#define MSI_DATA_TRIGGER_SHIFT		15
> -#define MSI_DATA_LEVEL_SHIFT		14
> -#define MSI_ADDR_DEST_MODE_SHIFT	2
> -#define MSI_ADDR_DEST_ID_SHIFT		12
> -#define	MSI_ADDR_DEST_ID_MASK		0x00ffff0
> -
>  #define MSI_ADDR_SIZE                   0x100000
>  
>  typedef struct APICState APICState;
> -- 
> Anthony PERARD
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH V3 09/10] Introduce apic-msidef.h
@ 2011-11-08 12:57     ` Stefano Stabellini
  0 siblings, 0 replies; 60+ messages in thread
From: Stefano Stabellini @ 2011-11-08 12:57 UTC (permalink / raw)
  To: Anthony PERARD
  Cc: Xen Devel, Michael S. Tsirkin, QEMU-devel, Stefano Stabellini

On Fri, 28 Oct 2011, Anthony PERARD wrote:
> This patch move the msi definition from apic.c to apic-msidef.h. So it can be
> used also by other .c files.


you should CC Michael on this one


> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
> ---
>  hw/apic-msidef.h |   28 ++++++++++++++++++++++++++++
>  hw/apic.c        |   11 +----------
>  2 files changed, 29 insertions(+), 10 deletions(-)
>  create mode 100644 hw/apic-msidef.h
> 
> diff --git a/hw/apic-msidef.h b/hw/apic-msidef.h
> new file mode 100644
> index 0000000..3182f0b
> --- /dev/null
> +++ b/hw/apic-msidef.h
> @@ -0,0 +1,28 @@
> +#ifndef HW_APIC_MSIDEF_H
> +#define HW_APIC_MSIDEF_H
> +
> +/*
> + * Intel APIC constants: from include/asm/msidef.h
> + */
> +
> +/*
> + * Shifts for MSI data
> + */
> +
> +#define MSI_DATA_VECTOR_SHIFT           0
> +#define  MSI_DATA_VECTOR_MASK           0x000000ff
> +
> +#define MSI_DATA_DELIVERY_MODE_SHIFT    8
> +#define MSI_DATA_LEVEL_SHIFT            14
> +#define MSI_DATA_TRIGGER_SHIFT          15
> +
> +/*
> + * Shift/mask fields for msi address
> + */
> +
> +#define MSI_ADDR_DEST_MODE_SHIFT        2
> +
> +#define MSI_ADDR_DEST_ID_SHIFT          12
> +#define  MSI_ADDR_DEST_ID_MASK          0x00ffff0
> +
> +#endif /* HW_APIC_MSIDEF_H */
> diff --git a/hw/apic.c b/hw/apic.c
> index 8289eef..18c4a87 100644
> --- a/hw/apic.c
> +++ b/hw/apic.c
> @@ -24,6 +24,7 @@
>  #include "sysbus.h"
>  #include "trace.h"
>  #include "pc.h"
> +#include "apic-msidef.h"
>  
>  /* APIC Local Vector Table */
>  #define APIC_LVT_TIMER   0
> @@ -65,16 +66,6 @@
>  #define MAX_APICS 255
>  #define MAX_APIC_WORDS 8
>  
> -/* Intel APIC constants: from include/asm/msidef.h */
> -#define MSI_DATA_VECTOR_SHIFT		0
> -#define MSI_DATA_VECTOR_MASK		0x000000ff
> -#define MSI_DATA_DELIVERY_MODE_SHIFT	8
> -#define MSI_DATA_TRIGGER_SHIFT		15
> -#define MSI_DATA_LEVEL_SHIFT		14
> -#define MSI_ADDR_DEST_MODE_SHIFT	2
> -#define MSI_ADDR_DEST_ID_SHIFT		12
> -#define	MSI_ADDR_DEST_ID_MASK		0x00ffff0
> -
>  #define MSI_ADDR_SIZE                   0x100000
>  
>  typedef struct APICState APICState;
> -- 
> Anthony PERARD
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH V3 07/10] Introduce Xen PCI Passthrough, qdevice (1/3)
  2011-11-08 12:56     ` Stefano Stabellini
@ 2011-11-09 17:03       ` Anthony PERARD
  -1 siblings, 0 replies; 60+ messages in thread
From: Anthony PERARD @ 2011-11-09 17:03 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: Guy Zana, Xen Devel, Allen Kay, QEMU-devel

On Tue, Nov 8, 2011 at 12:56, Stefano Stabellini
<stefano.stabellini@eu.citrix.com> wrote:
> On Fri, 28 Oct 2011, Anthony PERARD wrote:
>> From: Allen Kay <allen.m.kay@intel.com>
>>
>> Signed-off-by: Allen Kay <allen.m.kay@intel.com>
>> Signed-off-by: Guy Zana <guy@neocleus.com>
>> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
>> ---
>>  Makefile.target                  |    2 +
>>  hw/xen_pci_passthrough.c         |  838 ++++++++++++++++++++++++++++++++++++++
>>  hw/xen_pci_passthrough.h         |  223 ++++++++++
>>  hw/xen_pci_passthrough_helpers.c |   46 ++
>>  4 files changed, 1109 insertions(+), 0 deletions(-)
>>  create mode 100644 hw/xen_pci_passthrough.c
>>  create mode 100644 hw/xen_pci_passthrough.h
>>  create mode 100644 hw/xen_pci_passthrough_helpers.c
>>
>> diff --git a/Makefile.target b/Makefile.target
>> index 243f9f2..36ea47d 100644
>> --- a/Makefile.target
>> +++ b/Makefile.target
>> @@ -217,6 +217,8 @@ obj-i386-$(CONFIG_XEN) += xen_platform.o
>>
>>  # Xen PCI Passthrough
>>  obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += host-pci-device.o
>> +obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough.o
>> +obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough_helpers.o
>>
>>  # Inter-VM PCI shared memory
>>  CONFIG_IVSHMEM =
>> diff --git a/hw/xen_pci_passthrough.c b/hw/xen_pci_passthrough.c
>> new file mode 100644
>> index 0000000..b97c5b6
>> --- /dev/null
>> +++ b/hw/xen_pci_passthrough.c
>> @@ -0,0 +1,838 @@
>> +/*
>> + * Copyright (c) 2007, Neocleus Corporation.
>> + * Copyright (c) 2007, Intel Corporation.
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2.  See
>> + * the COPYING file in the top-level directory.
>> + *
>> + * Alex Novik <alex@neocleus.com>
>> + * Allen Kay <allen.m.kay@intel.com>
>> + * Guy Zana <guy@neocleus.com>
>> + *
>> + * This file implements direct PCI assignment to a HVM guest
>> + */
>> +
>> +/*
>> + * Interrupt Disable policy:
>> + *
>> + * INTx interrupt:
>> + *   Initialize(register_real_device)
>> + *     Map INTx(xc_physdev_map_pirq):
>> + *       <fail>
>> + *         - Set real Interrupt Disable bit to '1'.
>> + *         - Set machine_irq and assigned_device->machine_irq to '0'.
>> + *         * Don't bind INTx.
>> + *
>> + *     Bind INTx(xc_domain_bind_pt_pci_irq):
>> + *       <fail>
>> + *         - Set real Interrupt Disable bit to '1'.
>> + *         - Unmap INTx.
>> + *         - Decrement mapped_machine_irq[machine_irq]
>> + *         - Set assigned_device->machine_irq to '0'.
>> + *
>> + *   Write to Interrupt Disable bit by guest software(pt_cmd_reg_write)
>> + *     Write '0'
>> + *       <ptdev->msi_trans_en is false>
>> + *         - Set real bit to '0' if assigned_device->machine_irq isn't '0'.
>> + *
>> + *     Write '1'
>> + *       <ptdev->msi_trans_en is false>
>> + *         - Set real bit to '1'.
>> + *
>> + * MSI-INTx translation.
>> + *   Initialize(xc_physdev_map_pirq_msi/pt_msi_setup)
>> + *     Bind MSI-INTx(xc_domain_bind_pt_irq)
>> + *       <fail>
>> + *         - Unmap MSI.
>> + *           <success>
>> + *             - Set dev->msi->pirq to '-1'.
>> + *           <fail>
>> + *             - Do nothing.
>> + *
>> + *   Write to Interrupt Disable bit by guest software(pt_cmd_reg_write)
>> + *     Write '0'
>> + *       <ptdev->msi_trans_en is true>
>> + *         - Set MSI Enable bit to '1'.
>> + *
>> + *     Write '1'
>> + *       <ptdev->msi_trans_en is true>
>> + *         - Set MSI Enable bit to '0'.
>> + *
>> + * MSI interrupt:
>> + *   Initialize MSI register(pt_msi_setup, pt_msi_update)
>> + *     Bind MSI(xc_domain_update_msi_irq)
>> + *       <fail>
>> + *         - Unmap MSI.
>> + *         - Set dev->msi->pirq to '-1'.
>> + *
>> + * MSI-X interrupt:
>> + *   Initialize MSI-X register(pt_msix_update_one)
>> + *     Bind MSI-X(xc_domain_update_msi_irq)
>> + *       <fail>
>> + *         - Unmap MSI-X.
>> + *         - Set entry->pirq to '-1'.
>> + */
>> +
>
> you should move all the MSI related comments to the MSI patch

OK, I will move MSI comments.

>> +#include <sys/ioctl.h>
>> +
>> +#include "pci.h"
>> +#include "xen.h"
>> +#include "xen_backend.h"
>> +#include "xen_pci_passthrough.h"
>> +
>> +#define PCI_BAR_ENTRIES (6)
>> +
>> +#define PT_NR_IRQS          (256)
>> +char mapped_machine_irq[PT_NR_IRQS] = {0};
>> +
>> +/* Config Space */
>> +static int pt_pci_config_access_check(PCIDevice *d, uint32_t address, int len)
>> +{
>> +    /* check offset range */
>> +    if (address >= 0xFF) {
>> +        PT_LOG("Error: Failed to access register with offset exceeding FFh. "
>> +               "[%02x:%02x.%x][Offset:%02xh][Length:%d]\n",
>> +               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
>> +               address, len);
>> +        return -1;
>> +    }
>> +
>> +    /* check read size */
>> +    if ((len != 1) && (len != 2) && (len != 4)) {
>> +        PT_LOG("Error: Failed to access register with invalid access length. "
>> +               "[%02x:%02x.%x][Offset:%02xh][Length:%d]\n",
>> +               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
>> +               address, len);
>> +        return -1;
>> +    }
>> +
>> +    /* check offset alignment */
>> +    if (address & (len - 1)) {
>> +        PT_LOG("Error: Failed to access register with invalid access size "
>> +            "alignment. [%02x:%02x.%x][Offset:%02xh][Length:%d]\n",
>> +            pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
>> +            address, len);
>> +        return -1;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +int pt_bar_offset_to_index(uint32_t offset)
>> +{
>> +    int index = 0;
>> +
>> +    /* check Exp ROM BAR */
>> +    if (offset == PCI_ROM_ADDRESS) {
>> +        return PCI_ROM_SLOT;
>> +    }
>> +
>> +    /* calculate BAR index */
>> +    index = (offset - PCI_BASE_ADDRESS_0) >> 2;
>> +    if (index >= PCI_NUM_REGIONS) {
>> +        return -1;
>> +    }
>> +
>> +    return index;
>> +}
>> +
>> +static uint32_t pt_pci_read_config(PCIDevice *d, uint32_t address, int len)
>> +{
>> +    XenPCIPassthroughState *s = DO_UPCAST(XenPCIPassthroughState, dev, d);
>> +    uint32_t val = 0;
>> +    XenPTRegGroup *reg_grp_entry = NULL;
>> +    XenPTReg *reg_entry = NULL;
>> +    int rc = 0;
>> +    int emul_len = 0;
>> +    uint32_t find_addr = address;
>> +
>> +    if (pt_pci_config_access_check(d, address, len)) {
>> +        goto exit;
>> +    }
>> +
>> +    /* check power state transition flags */
>> +    if (s->pm_state != NULL && s->pm_state->flags & PT_FLAG_TRANSITING) {
>> +        /* can't accept until previous power state transition is completed.
>> +         * so finished previous request here.
>> +         */
>> +        PT_LOG("Warning: guest want to write durring power state transition\n");
>> +        goto exit;
>> +    }
>> +
>> +    /* find register group entry */
>> +    reg_grp_entry = pt_find_reg_grp(s, address);
>> +    if (reg_grp_entry) {
>> +        /* check 0 Hardwired register group */
>> +        if (reg_grp_entry->reg_grp->grp_type == GRP_TYPE_HARDWIRED) {
>> +            /* no need to emulate, just return 0 */
>> +            val = 0;
>> +            goto exit;
>> +        }
>> +    }
>> +
>> +    /* read I/O device register value */
>> +    rc = host_pci_get_block(s->real_device, address, (uint8_t *)&val, len);
>> +    if (!rc) {
>> +        PT_LOG("Error: pci_read_block failed. return value[%d].\n", rc);
>> +        memset(&val, 0xff, len);
>> +    }
>> +
>> +    /* just return the I/O device register value for
>> +     * passthrough type register group */
>> +    if (reg_grp_entry == NULL) {
>> +        goto exit;
>> +    }
>> +
>> +    /* adjust the read value to appropriate CFC-CFF window */
>> +    val <<= (address & 3) << 3;
>> +    emul_len = len;
>> +
>> +    /* loop Guest request size */
>> +    while (emul_len > 0) {
>> +        /* find register entry to be emulated */
>> +        reg_entry = pt_find_reg(reg_grp_entry, find_addr);
>> +        if (reg_entry) {
>> +            XenPTRegInfo *reg = reg_entry->reg;
>> +            uint32_t real_offset = reg_grp_entry->base_offset + reg->offset;
>> +            uint32_t valid_mask = 0xFFFFFFFF >> ((4 - emul_len) << 3);
>> +            uint8_t *ptr_val = NULL;
>> +
>> +            valid_mask <<= (find_addr - real_offset) << 3;
>> +            ptr_val = (uint8_t *)&val + (real_offset & 3);
>> +
>> +            /* do emulation depend on register size */
>> +            switch (reg->size) {
>> +            case 1:
>> +                if (reg->u.b.read) {
>> +                    rc = reg->u.b.read(s, reg_entry, ptr_val, valid_mask);
>> +                }
>> +                break;
>> +            case 2:
>> +                if (reg->u.w.read) {
>> +                    rc = reg->u.w.read(s, reg_entry,
>> +                                       (uint16_t *)ptr_val, valid_mask);
>> +                }
>> +                break;
>> +            case 4:
>> +                if (reg->u.dw.read) {
>> +                    rc = reg->u.dw.read(s, reg_entry,
>> +                                        (uint32_t *)ptr_val, valid_mask);
>> +                }
>> +                break;
>> +            }
>> +
>> +            if (rc < 0) {
>> +                hw_error("Internal error: Invalid read emulation "
>> +                         "return value[%d]. I/O emulator exit.\n", rc);
>> +            }
>> +
>> +            /* calculate next address to find */
>> +            emul_len -= reg->size;
>> +            if (emul_len > 0) {
>> +                find_addr = real_offset + reg->size;
>> +            }
>> +        } else {
>> +            /* nothing to do with passthrough type register,
>> +             * continue to find next byte */
>> +            emul_len--;
>> +            find_addr++;
>> +        }
>> +    }
>> +
>> +    /* need to shift back before returning them to pci bus emulator */
>> +    val >>= ((address & 3) << 3);
>> +
>> +exit:
>> +    PT_LOG_CONFIG("[%02x:%02x.%x]: address=%04x val=0x%08x len=%d\n",
>> +                  pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
>> +                  address, val, len);
>> +    return val;
>> +}
>> +
>> +static void pt_pci_write_config(PCIDevice *d, uint32_t address,
>> +                                uint32_t val, int len)
>> +{
>> +    XenPCIPassthroughState *s = DO_UPCAST(XenPCIPassthroughState, dev, d);
>> +    int index = 0;
>> +    XenPTRegGroup *reg_grp_entry = NULL;
>> +    int rc = 0;
>> +    uint32_t read_val = 0;
>> +    int emul_len = 0;
>> +    XenPTReg *reg_entry = NULL;
>> +    uint32_t find_addr = address;
>> +    XenPTRegInfo *reg = NULL;
>> +
>> +    if (pt_pci_config_access_check(d, address, len)) {
>> +        return;
>> +    }
>> +
>> +    PT_LOG_CONFIG("[%02x:%02x.%x]: address=%04x val=0x%08x len=%d\n",
>> +                  pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
>> +                  address, val, len);
>> +
>> +    /* check unused BAR register */
>> +    index = pt_bar_offset_to_index(address);
>> +    if ((index >= 0) && (val > 0 && val < PT_BAR_ALLF) &&
>> +        (s->bases[index].bar_flag == PT_BAR_FLAG_UNUSED)) {
>> +        PT_LOG("Warning: Guest attempt to set address to unused Base Address "
>> +               "Register. [%02x:%02x.%x][Offset:%02xh][Length:%d]\n",
>> +               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
>> +               address, len);
>> +    }
>> +
>> +    /* check power state transition flags */
>> +    if (s->pm_state != NULL && s->pm_state->flags & PT_FLAG_TRANSITING) {
>> +        /* can't accept untill previous power state transition is completed.
>> +         * so finished previous request here.
>> +         */
>> +        PT_LOG("Warning: guest want to write durring power state transition\n");
>> +        return;
>> +    }
>> +
>> +    /* find register group entry */
>> +    reg_grp_entry = pt_find_reg_grp(s, address);
>> +    if (reg_grp_entry) {
>> +        /* check 0 Hardwired register group */
>> +        if (reg_grp_entry->reg_grp->grp_type == GRP_TYPE_HARDWIRED) {
>> +            /* ignore silently */
>> +            PT_LOG("Warning: Access to 0 Hardwired register. "
>> +                   "[%02x:%02x.%x][Offset:%02xh][Length:%d]\n",
>> +                   pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
>> +                   address, len);
>> +            return;
>> +        }
>> +    }
>> +
>> +    /* read I/O device register value */
>> +    rc = host_pci_get_block(s->real_device, address,
>> +                             (uint8_t *)&read_val, len);
>> +    if (!rc) {
>> +        PT_LOG("Error: pci_read_block failed. return value[%d].\n", rc);
>> +        memset(&read_val, 0xff, len);
>> +    }
>> +
>> +    /* pass directly to libpci for passthrough type register group */
>> +    if (reg_grp_entry == NULL) {
>> +        goto out;
>> +    }
>> +
>> +    /* adjust the read and write value to appropriate CFC-CFF window */
>> +    read_val <<= (address & 3) << 3;
>> +    val <<= (address & 3) << 3;
>> +    emul_len = len;
>> +
>> +    /* loop Guest request size */
>> +    while (emul_len > 0) {
>> +        /* find register entry to be emulated */
>> +        reg_entry = pt_find_reg(reg_grp_entry, find_addr);
>> +        if (reg_entry) {
>> +            reg = reg_entry->reg;
>> +            uint32_t real_offset = reg_grp_entry->base_offset + reg->offset;
>> +            uint32_t valid_mask = 0xFFFFFFFF >> ((4 - emul_len) << 3);
>> +            uint8_t *ptr_val = NULL;
>> +
>> +            valid_mask <<= (find_addr - real_offset) << 3;
>> +            ptr_val = (uint8_t *)&val + (real_offset & 3);
>> +
>> +            /* do emulation depend on register size */
>> +            switch (reg->size) {
>> +            case 1:
>> +                if (reg->u.b.write) {
>> +                    rc = reg->u.b.write(s, reg_entry, ptr_val,
>> +                                        read_val >> ((real_offset & 3) << 3),
>> +                                        valid_mask);
>> +                }
>> +                break;
>> +            case 2:
>> +                if (reg->u.w.write) {
>> +                    rc = reg->u.w.write(s, reg_entry, (uint16_t *)ptr_val,
>> +                                        (read_val >> ((real_offset & 3) << 3)),
>> +                                        valid_mask);
>> +                }
>> +                break;
>> +            case 4:
>> +                if (reg->u.dw.write) {
>> +                    rc = reg->u.dw.write(s, reg_entry, (uint32_t *)ptr_val,
>> +                                         (read_val >> ((real_offset & 3) << 3)),
>> +                                         valid_mask);
>> +                }
>> +                break;
>> +            }
>> +
>> +            if (rc < 0) {
>> +                hw_error("Internal error: Invalid write emulation "
>> +                         "return value[%d]. I/O emulator exit.\n", rc);
>> +            }
>> +
>> +            /* calculate next address to find */
>> +            emul_len -= reg->size;
>> +            if (emul_len > 0) {
>> +                find_addr = real_offset + reg->size;
>> +            }
>> +        } else {
>> +            /* nothing to do with passthrough type register,
>> +             * continue to find next byte */
>> +            emul_len--;
>> +            find_addr++;
>> +        }
>> +    }
>> +
>> +    /* need to shift back before passing them to libpci */
>> +    val >>= (address & 3) << 3;
>> +
>> +out:
>> +    if (!(reg && reg->no_wb)) {
>> +        /* unknown regs are passed through */
>> +        rc = host_pci_set_block(s->real_device, address, (uint8_t *)&val, len);
>> +
>> +        if (!rc) {
>> +            PT_LOG("Error: pci_write_block failed. return value[%d].\n", rc);
>> +        }
>> +    }
>> +
>> +    if (s->pm_state != NULL && s->pm_state->flags & PT_FLAG_TRANSITING) {
>> +        qemu_mod_timer(s->pm_state->pm_timer,
>> +                       qemu_get_clock_ms(rt_clock) + s->pm_state->pm_delay);
>> +    }
>> +}
>
> Where is this timer allocated and initialized?

In the next patch, I will move this lines to the releated patch.

>> +/* ioport/iomem space*/
>> +static void pt_iomem_map(XenPCIPassthroughState *s, int i,
>> +                         pcibus_t e_phys, pcibus_t e_size, int type)
>> +{
>> +    uint32_t old_ebase = s->bases[i].e_physbase;
>> +    bool first_map = s->bases[i].e_size == 0;
>> +    int ret = 0;
>> +
>> +    s->bases[i].e_physbase = e_phys;
>> +    s->bases[i].e_size = e_size;
>> +
>> +    PT_LOG("e_phys=%#"PRIx64" maddr=%#"PRIx64" type=%%d"
>> +           " len=%#"PRIx64" index=%d first_map=%d\n",
>> +           e_phys, s->bases[i].access.maddr, /*type,*/
>> +           e_size, i, first_map);
>> +
>> +    if (e_size == 0) {
>> +        return;
>> +    }
>> +
>> +    if (!first_map && old_ebase != -1) {
>> +        /* Remove old mapping */
>> +        ret = xc_domain_memory_mapping(xen_xc, xen_domid,
>> +                               old_ebase >> XC_PAGE_SHIFT,
>> +                               s->bases[i].access.maddr >> XC_PAGE_SHIFT,
>> +                               (e_size + XC_PAGE_SIZE - 1) >> XC_PAGE_SHIFT,
>> +                               DPCI_REMOVE_MAPPING);
>> +        if (ret != 0) {
>> +            PT_LOG("Error: remove old mapping failed!\n");
>> +            return;
>> +        }
>> +    }
>> +
>> +    /* map only valid guest address */
>> +    if (e_phys != -1) {
>> +        /* Create new mapping */
>> +        ret = xc_domain_memory_mapping(xen_xc, xen_domid,
>> +                                   s->bases[i].e_physbase >> XC_PAGE_SHIFT,
>> +                                   s->bases[i].access.maddr >> XC_PAGE_SHIFT,
>> +                                   (e_size+XC_PAGE_SIZE-1) >> XC_PAGE_SHIFT,
>> +                                   DPCI_ADD_MAPPING);
>> +
>> +        if (ret != 0) {
>> +            PT_LOG("Error: create new mapping failed!\n");
>> +        }
>> +    }
>> +}
>> +
>> +static void pt_ioport_map(XenPCIPassthroughState *s, int i,
>> +                          pcibus_t e_phys, pcibus_t e_size, int type)
>> +{
>> +    uint32_t old_ebase = s->bases[i].e_physbase;
>> +    bool first_map = s->bases[i].e_size == 0;
>> +    int ret = 0;
>> +
>> +    s->bases[i].e_physbase = e_phys;
>> +    s->bases[i].e_size = e_size;
>> +
>> +    PT_LOG("e_phys=%#04"PRIx64" pio_base=%#04"PRIx64" len=%"PRId64" index=%d"
>> +           " first_map=%d\n",
>> +           e_phys, s->bases[i].access.pio_base, e_size, i, first_map);
>> +
>> +    if (e_size == 0) {
>> +        return;
>> +    }
>> +
>> +    if (!first_map && old_ebase != -1) {
>> +        /* Remove old mapping */
>> +        ret = xc_domain_ioport_mapping(xen_xc, xen_domid, old_ebase,
>> +                                       s->bases[i].access.pio_base, e_size,
>> +                                       DPCI_REMOVE_MAPPING);
>> +        if (ret != 0) {
>> +            PT_LOG("Error: remove old mapping failed!\n");
>> +            return;
>> +        }
>> +    }
>> +
>> +    /* map only valid guest address (include 0) */
>> +    if (e_phys != -1) {
>> +        /* Create new mapping */
>> +        ret = xc_domain_ioport_mapping(xen_xc, xen_domid, e_phys,
>> +                                       s->bases[i].access.pio_base, e_size,
>> +                                       DPCI_ADD_MAPPING);
>> +        if (ret != 0) {
>> +            PT_LOG("Error: create new mapping failed!\n");
>> +        }
>> +    }
>> +
>> +}
>> +
>> +
>> +/* mapping BAR */
>> +
>> +void pt_bar_mapping_one(XenPCIPassthroughState *s, int bar,
>> +                        int io_enable, int mem_enable)
>> +{
>> +    PCIDevice *dev = &s->dev;
>> +    PCIIORegion *r;
>> +    XenPTRegGroup *reg_grp_entry = NULL;
>> +    XenPTReg *reg_entry = NULL;
>> +    XenPTRegion *base = NULL;
>> +    pcibus_t r_size = 0, r_addr = -1;
>> +    int rc = 0;
>> +
>> +    r = &dev->io_regions[bar];
>> +
>> +    /* check valid region */
>> +    if (!r->size) {
>> +        return;
>> +    }
>> +
>> +    base = &s->bases[bar];
>> +    /* skip unused BAR or upper 64bit BAR */
>> +    if ((base->bar_flag == PT_BAR_FLAG_UNUSED)
>> +        || (base->bar_flag == PT_BAR_FLAG_UPPER)) {
>> +           return;
>> +    }
>> +
>> +    /* copy region address to temporary */
>> +    r_addr = r->addr;
>> +
>> +    /* need unmapping in case I/O Space or Memory Space disable */
>> +    if (((base->bar_flag == PT_BAR_FLAG_IO) && !io_enable) ||
>> +        ((base->bar_flag == PT_BAR_FLAG_MEM) && !mem_enable)) {
>> +        r_addr = -1;
>> +    }
>> +    if ((bar == PCI_ROM_SLOT) && (r_addr != -1)) {
>> +        reg_grp_entry = pt_find_reg_grp(s, PCI_ROM_ADDRESS);
>> +        if (reg_grp_entry) {
>> +            reg_entry = pt_find_reg(reg_grp_entry, PCI_ROM_ADDRESS);
>> +            if (reg_entry && !(reg_entry->data & PCI_ROM_ADDRESS_ENABLE)) {
>> +                r_addr = -1;
>> +            }
>> +        }
>> +    }
>> +
>> +    /* prevent guest software mapping memory resource to 00000000h */
>> +    if ((base->bar_flag == PT_BAR_FLAG_MEM) && (r_addr == 0)) {
>> +        r_addr = -1;
>> +    }
>> +
>> +    r_size = pt_get_emul_size(base->bar_flag, r->size);
>> +
>> +    rc = pci_check_bar_overlap(dev, r_addr, r_size, r->type);
>> +    if (rc > 0) {
>> +        PT_LOG("Warning: s[%02x:%02x.%x][Region:%d][Address:%"FMT_PCIBUS"h]"
>> +               "[Size:%"FMT_PCIBUS"h] is overlapped.\n", pci_bus_num(dev->bus),
>> +               PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn), bar,
>> +               r_addr, r_size);
>> +    }
>> +
>> +    /* check whether we need to update the mapping or not */
>> +    if (r_addr != s->bases[bar].e_physbase) {
>> +        /* mapping BAR */
>> +        if (base->bar_flag == PT_BAR_FLAG_IO) {
>> +            pt_ioport_map(s, bar, r_addr, r_size, r->type);
>> +        } else {
>> +            pt_iomem_map(s, bar, r_addr, r_size, r->type);
>> +        }
>> +    }
>> +}
>> +
>> +void pt_bar_mapping(XenPCIPassthroughState *s, int io_enable, int mem_enable)
>> +{
>> +    int i;
>> +
>> +    for (i = 0; i < PCI_NUM_REGIONS; i++) {
>> +        pt_bar_mapping_one(s, i, io_enable, mem_enable);
>> +    }
>> +}
>> +
>> +/* register regions */
>> +static int pt_register_regions(XenPCIPassthroughState *s)
>> +{
>> +    int i = 0;
>> +    uint32_t bar_data = 0;
>> +    HostPCIDevice *d = s->real_device;
>> +
>> +    /* Register PIO/MMIO BARs */
>> +    for (i = 0; i < PCI_BAR_ENTRIES; i++) {
>> +        HostPCIIORegion *r = &d->io_regions[i];
>> +
>> +        if (r->base_addr) {
>> +            s->bases[i].e_physbase = r->base_addr;
>> +            s->bases[i].access.u = r->base_addr;
>> +
>> +            /* Register current region */
>> +            if (r->flags & IORESOURCE_IO) {
>> +                memory_region_init_io(&s->bar[i], NULL, NULL,
>> +                                      "xen-pci-pt-bar", r->size);
>> +                pci_register_bar(&s->dev, i, PCI_BASE_ADDRESS_SPACE_IO,
>> +                                 &s->bar[i]);
>> +            } else if (r->flags & IORESOURCE_PREFETCH) {
>> +                memory_region_init_io(&s->bar[i], NULL, NULL,
>> +                                      "xen-pci-pt-bar", r->size);
>> +                pci_register_bar(&s->dev, i, PCI_BASE_ADDRESS_MEM_PREFETCH,
>> +                                 &s->bar[i]);
>> +            } else {
>> +                memory_region_init_io(&s->bar[i], NULL, NULL,
>> +                                      "xen-pci-pt-bar", r->size);
>> +                pci_register_bar(&s->dev, i, PCI_BASE_ADDRESS_SPACE_MEMORY,
>> +                                 &s->bar[i]);
>> +            }
>> +
>> +            PT_LOG("IO region registered (size=0x%08"PRIx64
>> +                   " base_addr=0x%08"PRIx64")\n",
>> +                   r->size, r->base_addr);
>> +        }
>> +    }
>> +
>> +    /* Register expansion ROM address */
>> +    if (d->rom.base_addr && d->rom.size) {
>> +        /* Re-set BAR reported by OS, otherwise ROM can't be read. */
>> +        bar_data = host_pci_get_long(d, PCI_ROM_ADDRESS);
>> +        if ((bar_data & PCI_ROM_ADDRESS_MASK) == 0) {
>> +            bar_data |= d->rom.base_addr & PCI_ROM_ADDRESS_MASK;
>> +            host_pci_set_long(d, PCI_ROM_ADDRESS, bar_data);
>> +        }
>> +
>> +        s->bases[PCI_ROM_SLOT].e_physbase = d->rom.base_addr;
>> +        s->bases[PCI_ROM_SLOT].access.maddr = d->rom.base_addr;
>> +
>> +        memory_region_init_rom_device(&s->rom, NULL, NULL, &s->dev.qdev,
>> +                                      "xen-pci-pt-rom", d->rom.size);
>> +        pci_register_bar(&s->dev, PCI_ROM_SLOT, PCI_BASE_ADDRESS_MEM_PREFETCH,
>> +                         &s->rom);
>> +
>> +        PT_LOG("Expansion ROM registered (size=0x%08"PRIx64
>> +               " base_addr=0x%08"PRIx64")\n",
>> +               d->rom.size, d->rom.base_addr);
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static void pt_unregister_regions(XenPCIPassthroughState *s)
>> +{
>> +    int i, type, rc;
>> +    uint32_t e_size;
>> +    PCIDevice *d = &s->dev;
>> +
>> +    for (i = 0; i < PCI_NUM_REGIONS; i++) {
>> +        e_size = s->bases[i].e_size;
>> +        if ((e_size == 0) || (s->bases[i].e_physbase == -1)) {
>> +            continue;
>> +        }
>> +
>> +        type = d->io_regions[i].type;
>> +
>> +        if (type == PCI_BASE_ADDRESS_SPACE_MEMORY
>> +            || type == PCI_BASE_ADDRESS_MEM_PREFETCH) {
>> +            rc = xc_domain_memory_mapping(xen_xc, xen_domid,
>> +                    s->bases[i].e_physbase >> XC_PAGE_SHIFT,
>> +                    s->bases[i].access.maddr >> XC_PAGE_SHIFT,
>> +                    (e_size+XC_PAGE_SIZE-1) >> XC_PAGE_SHIFT,
>> +                    DPCI_REMOVE_MAPPING);
>> +            if (rc != 0) {
>> +                PT_LOG("Error: remove old mem mapping failed!\n");
>> +                continue;
>> +            }
>> +
>> +        } else if (type == PCI_BASE_ADDRESS_SPACE_IO) {
>> +            rc = xc_domain_ioport_mapping(xen_xc, xen_domid,
>> +                        s->bases[i].e_physbase,
>> +                        s->bases[i].access.pio_base,
>> +                        e_size,
>> +                        DPCI_REMOVE_MAPPING);
>> +            if (rc != 0) {
>> +                PT_LOG("Error: remove old io mapping failed!\n");
>> +                continue;
>> +            }
>> +        }
>> +    }
>> +}
>> +
>> +static int pt_initfn(PCIDevice *pcidev)
>> +{
>> +    XenPCIPassthroughState *s = DO_UPCAST(XenPCIPassthroughState, dev, pcidev);
>> +    int dom, bus;
>> +    unsigned slot, func;
>> +    int rc = 0;
>> +    uint32_t machine_irq;
>> +    int pirq = -1;
>> +
>> +    if (pci_parse_devaddr(s->hostaddr, &dom, &bus, &slot, &func) < 0) {
>> +        fprintf(stderr, "error parse bdf: %s\n", s->hostaddr);
>> +        return -1;
>> +    }
>> +
>> +    /* register real device */
>> +    PT_LOG("Assigning real physical device %02x:%02x.%x to devfn %i ...\n",
>> +           bus, slot, func, s->dev.devfn);
>> +
>> +    s->real_device = host_pci_device_get(bus, slot, func);
>> +    if (!s->real_device) {
>> +        return -1;
>> +    }
>> +
>> +    s->is_virtfn = s->real_device->is_virtfn;
>> +    if (s->is_virtfn) {
>> +        PT_LOG("%04x:%02x:%02x.%x is a SR-IOV Virtual Function\n",
>> +               s->real_device->domain, bus, slot, func);
>> +    }
>> +
>> +    /* Initialize virtualized PCI configuration (Extended 256 Bytes) */
>> +    if (host_pci_get_block(s->real_device, 0, pcidev->config,
>> +                           PCI_CONFIG_SPACE_SIZE) == -1) {
>> +        return -1;
>> +    }
>> +
>> +    /* Handle real device's MMIO/PIO BARs */
>> +    pt_register_regions(s);
>> +
>> +    /* reinitialize each config register to be emulated */
>> +    pt_config_init(s);
>
> this function is implemented in the next patch, so you might as well add
> this call there

Ok, I will move this.

>> +    /* Bind interrupt */
>> +    if (!s->dev.config[PCI_INTERRUPT_PIN]) {
>> +        PT_LOG("no pin interrupt\n");
>> +        goto out;
>> +    }
>> +
>> +    machine_irq = host_pci_get_byte(s->real_device, PCI_INTERRUPT_LINE);
>> +    rc = xc_physdev_map_pirq(xen_xc, xen_domid, machine_irq, &pirq);
>> +
>> +    if (rc) {
>> +        PT_LOG("Error: Mapping irq failed, rc = %d\n", rc);
>> +
>> +        /* Disable PCI intx assertion (turn on bit10 of devctl) */
>> +        host_pci_set_word(s->real_device,
>> +                          PCI_COMMAND,
>> +                          pci_get_word(s->dev.config + PCI_COMMAND)
>> +                          | PCI_COMMAND_INTX_DISABLE);
>> +        machine_irq = 0;
>> +        s->machine_irq = 0;
>> +    } else {
>> +        machine_irq = pirq;
>> +        s->machine_irq = pirq;
>> +        mapped_machine_irq[machine_irq]++;
>> +    }
>> +
>> +    /* bind machine_irq to device */
>> +    if (rc < 0 && machine_irq != 0) {
>> +        uint8_t e_device = PCI_SLOT(s->dev.devfn);
>> +        uint8_t e_intx = pci_intx(s);
>> +
>> +        rc = xc_domain_bind_pt_pci_irq(xen_xc, xen_domid, machine_irq, 0,
>> +                                       e_device, e_intx);
>> +        if (rc < 0) {
>> +            PT_LOG("Error: Binding of interrupt failed! rc=%d\n", rc);
>> +
>> +            /* Disable PCI intx assertion (turn on bit10 of devctl) */
>> +            host_pci_set_word(s->real_device, PCI_COMMAND,
>> +                              *(uint16_t *)(&s->dev.config[PCI_COMMAND])
>> +                              | PCI_COMMAND_INTX_DISABLE);
>> +            mapped_machine_irq[machine_irq]--;
>> +
>> +            if (mapped_machine_irq[machine_irq] == 0) {
>> +                if (xc_physdev_unmap_pirq(xen_xc, xen_domid, machine_irq)) {
>> +                    PT_LOG("Error: Unmapping of interrupt failed! rc=%d\n",
>> +                           rc);
>> +                }
>> +            }
>> +            s->machine_irq = 0;
>> +        }
>> +    }
>> +
>> +out:
>> +    PT_LOG("Real physical device %02x:%02x.%x registered successfuly!\n"
>> +           "IRQ type = %s\n", bus, slot, func, "INTx");
>> +
>> +    return 0;
>> +}
>> +
>> +static int pt_unregister_device(PCIDevice *pcidev)
>> +{
>> +    XenPCIPassthroughState *s = DO_UPCAST(XenPCIPassthroughState, dev, pcidev);
>> +    uint8_t e_device, e_intx;
>> +    uint32_t machine_irq;
>> +    int rc;
>> +
>> +    /* Unbind interrupt */
>> +    e_device = PCI_SLOT(s->dev.devfn);
>> +    e_intx = pci_intx(s);
>> +    machine_irq = s->machine_irq;
>> +
>> +    if (machine_irq) {
>> +        rc = xc_domain_unbind_pt_irq(xen_xc, xen_domid, machine_irq,
>> +                                     PT_IRQ_TYPE_PCI, 0, e_device, e_intx, 0);
>> +        if (rc < 0) {
>> +            PT_LOG("Error: Unbinding of interrupt failed! rc=%d\n", rc);
>> +        }
>> +    }
>> +
>> +    if (machine_irq) {
>> +        mapped_machine_irq[machine_irq]--;
>> +
>> +        if (mapped_machine_irq[machine_irq] == 0) {
>> +            rc = xc_physdev_unmap_pirq(xen_xc, xen_domid, machine_irq);
>> +
>> +            if (rc < 0) {
>> +                PT_LOG("Error: Unmaping of interrupt failed! rc=%d\n", rc);
>> +            }
>> +        }
>> +    }
>> +
>> +    /* delete all emulated config registers */
>> +    pt_config_delete(s);
>> +
>> +    /* unregister real device's MMIO/PIO BARs */
>> +    pt_unregister_regions(s);
>> +
>> +    host_pci_device_put(s->real_device);
>> +
>> +    return 0;
>> +}
>> +
>> +static PCIDeviceInfo xen_pci_passthrough = {
>> +    .init = pt_initfn,
>> +    .exit = pt_unregister_device,
>> +    .qdev.name = "xen-pci-passthrough",
>> +    .qdev.desc = "Assign an host pci device with Xen",
>> +    .qdev.size = sizeof(XenPCIPassthroughState),
>> +    .config_read = pt_pci_read_config,
>> +    .config_write = pt_pci_write_config,
>> +    .is_express = 0,
>> +    .qdev.props = (Property[]) {
>> +        DEFINE_PROP_STRING("hostaddr", XenPCIPassthroughState, hostaddr),
>> +        DEFINE_PROP_BIT("power-mgmt", XenPCIPassthroughState, power_mgmt,
>> +                        0, false),
>> +        DEFINE_PROP_END_OF_LIST(),
>> +    }
>> +};
>> +
>> +static void xen_passthrough_register(void)
>> +{
>> +    pci_qdev_register(&xen_pci_passthrough);
>> +}
>> +
>> +device_init(xen_passthrough_register);
>> diff --git a/hw/xen_pci_passthrough.h b/hw/xen_pci_passthrough.h
>> new file mode 100644
>> index 0000000..2d1979d
>> --- /dev/null
>> +++ b/hw/xen_pci_passthrough.h
>> @@ -0,0 +1,223 @@
>> +#ifndef QEMU_HW_XEN_PCI_PASSTHROUGH_H
>> +#  define QEMU_HW_XEN_PCI_PASSTHROUGH_H
>> +
>> +#include "qemu-common.h"
>> +#include "xen_common.h"
>> +#include "pci.h"
>> +#include "host-pci-device.h"
>> +
>> +#define PT_LOGGING_ENABLED
>> +#define PT_DEBUG_PCI_CONFIG_ACCESS
>> +
>> +#ifdef PT_LOGGING_ENABLED
>> +#  define PT_LOG(_f, _a...)   fprintf(stderr, "%s: " _f, __func__, ##_a)
>> +#else
>> +#  define PT_LOG(_f, _a...)
>> +#endif
>> +
>> +#ifdef PT_DEBUG_PCI_CONFIG_ACCESS
>> +#  define PT_LOG_CONFIG(_f, _a...) PT_LOG(_f, ##_a)
>> +#else
>> +#  define PT_LOG_CONFIG(_f, _a...)
>> +#endif
>> +
>> +
>> +typedef struct XenPTRegInfo XenPTRegInfo;
>> +typedef struct XenPTReg XenPTReg;
>> +
>> +typedef struct XenPCIPassthroughState XenPCIPassthroughState;
>> +
>> +/* function type for config reg */
>> +typedef uint32_t (*conf_reg_init)
>> +    (XenPCIPassthroughState *, XenPTRegInfo *, uint32_t real_offset);
>> +typedef int (*conf_dword_write)
>> +    (XenPCIPassthroughState *, XenPTReg *cfg_entry,
>> +     uint32_t *val, uint32_t dev_value, uint32_t valid_mask);
>> +typedef int (*conf_word_write)
>> +    (XenPCIPassthroughState *, XenPTReg *cfg_entry,
>> +     uint16_t *val, uint16_t dev_value, uint16_t valid_mask);
>> +typedef int (*conf_byte_write)
>> +    (XenPCIPassthroughState *, XenPTReg *cfg_entry,
>> +     uint8_t *val, uint8_t dev_value, uint8_t valid_mask);
>> +typedef int (*conf_dword_read)
>> +    (XenPCIPassthroughState *, XenPTReg *cfg_entry,
>> +     uint32_t *val, uint32_t valid_mask);
>> +typedef int (*conf_word_read)
>> +    (XenPCIPassthroughState *, XenPTReg *cfg_entry,
>> +     uint16_t *val, uint16_t valid_mask);
>> +typedef int (*conf_byte_read)
>> +    (XenPCIPassthroughState *, XenPTReg *cfg_entry,
>> +     uint8_t *val, uint8_t valid_mask);
>> +typedef int (*conf_dword_restore)
>> +    (XenPCIPassthroughState *, XenPTReg *cfg_entry, uint32_t real_offset,
>> +     uint32_t dev_value, uint32_t *val);
>> +typedef int (*conf_word_restore)
>> +    (XenPCIPassthroughState *, XenPTReg *cfg_entry, uint32_t real_offset,
>> +     uint16_t dev_value, uint16_t *val);
>> +typedef int (*conf_byte_restore)
>> +    (XenPCIPassthroughState *, XenPTReg *cfg_entry, uint32_t real_offset,
>> +     uint8_t dev_value, uint8_t *val);
>> +
>> +/* power state transition */
>> +#define PT_FLAG_TRANSITING 0x0001
>> +
>> +
>> +typedef enum {
>> +    GRP_TYPE_HARDWIRED = 0,                     /* 0 Hardwired reg group */
>> +    GRP_TYPE_EMU,                               /* emul reg group */
>> +} RegisterGroupType;
>> +
>> +typedef enum {
>> +    PT_BAR_FLAG_MEM = 0,                        /* Memory type BAR */
>> +    PT_BAR_FLAG_IO,                             /* I/O type BAR */
>> +    PT_BAR_FLAG_UPPER,                          /* upper 64bit BAR */
>> +    PT_BAR_FLAG_UNUSED,                         /* unused BAR */
>> +} PTBarFlag;
>> +
>> +
>> +typedef struct XenPTRegion {
>> +    /* Virtual phys base & size */
>> +    uint32_t e_physbase;
>> +    uint32_t e_size;
>> +    /* Index of region in qemu */
>> +    uint32_t memory_index;
>> +    /* BAR flag */
>> +    PTBarFlag bar_flag;
>> +    /* Translation of the emulated address */
>> +    union {
>> +        uint64_t maddr;
>> +        uint64_t pio_base;
>> +        uint64_t u;
>> +    } access;
>> +} XenPTRegion;
>> +
>> +/* XenPTRegInfo declaration
>> + * - only for emulated register (either a part or whole bit).
>> + * - for passthrough register that need special behavior (like interacting with
>> + *   other component), set emu_mask to all 0 and specify r/w func properly.
>> + * - do NOT use ALL F for init_val, otherwise the tbl will not be registered.
>> + */
>> +
>> +/* emulated register infomation */
>> +struct XenPTRegInfo {
>> +    uint32_t offset;
>> +    uint32_t size;
>> +    uint32_t init_val;
>> +    /* reg read only field mask (ON:RO/ROS, OFF:other) */
>> +    uint32_t ro_mask;
>> +    /* reg emulate field mask (ON:emu, OFF:passthrough) */
>> +    uint32_t emu_mask;
>> +    /* no write back allowed */
>> +    uint32_t no_wb;
>> +    conf_reg_init init;
>> +    /* read/write/restore function pointer
>> +     * for double_word/word/byte size */
>> +    union {
>> +        struct {
>> +            conf_dword_write write;
>> +            conf_dword_read read;
>> +            conf_dword_restore restore;
>> +        } dw;
>> +        struct {
>> +            conf_word_write write;
>> +            conf_word_read read;
>> +            conf_word_restore restore;
>> +        } w;
>> +        struct {
>> +            conf_byte_write write;
>> +            conf_byte_read read;
>> +            conf_byte_restore restore;
>> +        } b;
>> +    } u;
>> +};
>> +
>> +/* emulated register management */
>> +struct XenPTReg {
>> +    QLIST_ENTRY(XenPTReg) entries;
>> +    XenPTRegInfo *reg;
>> +    uint32_t data;
>> +};
>> +
>> +typedef struct XenPTRegGroupInfo XenPTRegGroupInfo;
>> +
>> +/* emul reg group size initialize method */
>> +typedef uint8_t (*pt_reg_size_init_fn)
>> +    (XenPCIPassthroughState *, const XenPTRegGroupInfo *,
>> +     uint32_t base_offset);
>> +
>> +/* emulated register group infomation */
>> +struct XenPTRegGroupInfo {
>> +    uint8_t grp_id;
>> +    RegisterGroupType grp_type;
>> +    uint8_t grp_size;
>> +    pt_reg_size_init_fn size_init;
>> +    XenPTRegInfo *emu_reg_tbl;
>> +};
>> +
>> +/* emul register group management table */
>> +typedef struct XenPTRegGroup {
>> +    QLIST_ENTRY(XenPTRegGroup) entries;
>> +    const XenPTRegGroupInfo *reg_grp;
>> +    uint32_t base_offset;
>> +    uint8_t size;
>> +    QLIST_HEAD(, XenPTReg) reg_tbl_list;
>> +} XenPTRegGroup;
>> +
>> +
>> +typedef struct XenPTPM {
>> +    QEMUTimer *pm_timer;  /* QEMUTimer struct */
>> +    int no_soft_reset;    /* No Soft Reset flags */
>> +    uint16_t flags;       /* power state transition flags */
>> +    uint16_t pmc_field;   /* Power Management Capabilities field */
>> +    int pm_delay;         /* power state transition delay */
>> +    uint16_t cur_state;   /* current power state */
>> +    uint16_t req_state;   /* requested power state */
>> +    uint32_t pm_base;     /* Power Management Capability reg base offset */
>> +    uint32_t aer_base;    /* AER Capability reg base offset */
>> +} XenPTPM;
>> +
>> +struct XenPCIPassthroughState {
>> +    PCIDevice dev;
>> +
>> +    char *hostaddr;
>> +    bool is_virtfn;
>> +    HostPCIDevice *real_device;
>> +    XenPTRegion bases[PCI_NUM_REGIONS]; /* Access regions */
>> +    QLIST_HEAD(, XenPTRegGroup) reg_grp_tbl;
>> +
>> +    uint32_t machine_irq;
>> +
>> +    uint32_t power_mgmt;
>> +    XenPTPM *pm_state;
>> +
>> +    MemoryRegion bar[PCI_NUM_REGIONS - 1];
>> +    MemoryRegion rom;
>> +};
>> +
>> +void pt_config_init(XenPCIPassthroughState *s);
>> +void pt_config_delete(XenPCIPassthroughState *s);
>> +void pt_bar_mapping(XenPCIPassthroughState *s, int io_enable, int mem_enable);
>> +void pt_bar_mapping_one(XenPCIPassthroughState *s, int bar,
>> +                        int io_enable, int mem_enable);
>> +XenPTRegGroup *pt_find_reg_grp(XenPCIPassthroughState *s, uint32_t address);
>> +XenPTReg *pt_find_reg(XenPTRegGroup *reg_grp, uint32_t address);
>> +int pt_bar_offset_to_index(uint32_t offset);
>> +
>> +static inline pcibus_t pt_get_emul_size(PTBarFlag flag, pcibus_t r_size)
>> +{
>> +    /* align resource size (memory type only) */
>> +    if (flag == PT_BAR_FLAG_MEM) {
>> +        return (r_size + XC_PAGE_SIZE - 1) & XC_PAGE_MASK;
>> +    } else {
>> +        return r_size;
>> +    }
>> +}
>> +
>> +/* INTx */
>> +static inline uint8_t pci_read_intx(XenPCIPassthroughState *s)
>> +{
>> +    return host_pci_get_byte(s->real_device, PCI_INTERRUPT_PIN);
>> +}
>> +uint8_t pci_intx(XenPCIPassthroughState *ptdev);
>> +
>> +#endif /* !QEMU_HW_XEN_PCI_PASSTHROUGH_H */
>> diff --git a/hw/xen_pci_passthrough_helpers.c b/hw/xen_pci_passthrough_helpers.c
>> new file mode 100644
>> index 0000000..192e918
>> --- /dev/null
>> +++ b/hw/xen_pci_passthrough_helpers.c
>> @@ -0,0 +1,46 @@
>> +#include "xen_pci_passthrough.h"
>> +
>> +/* The PCI Local Bus Specification, Rev. 3.0, {
>> + * Section 6.2.4 Miscellaneous Registers, pp 223
>> + * outlines 5 valid values for the intertupt pin (intx).
>> + *  0: For devices (or device functions) that don't use an interrupt in
>> + *  1: INTA#
>> + *  2: INTB#
>> + *  3: INTC#
>> + *  4: INTD#
>> + *
>> + * Xen uses the following 4 values for intx
>> + *  0: INTA#
>> + *  1: INTB#
>> + *  2: INTC#
>> + *  3: INTD#
>> + *
>> + * Observing that these list of values are not the same, pci_read_intx()
>> + * uses the following mapping from hw to xen values.
>> + * This seems to reflect the current usage within Xen.
>> + *
>> + * PCI hardware    | Xen | Notes
>> + * ----------------+-----+----------------------------------------------------
>> + * 0               | 0   | No interrupt
>> + * 1               | 0   | INTA#
>> + * 2               | 1   | INTB#
>> + * 3               | 2   | INTC#
>> + * 4               | 3   | INTD#
>> + * any other value | 0   | This should never happen, log error message
>> +}
>> + */
>> +uint8_t pci_intx(XenPCIPassthroughState *ptdev)
>> +{
>> +    uint8_t r_val = pci_read_intx(ptdev);
>> +
>> +    PT_LOG("intx=%i\n", r_val);
>> +    if (r_val < 1 || r_val > 4) {
>> +        PT_LOG("Interrupt pin read from hardware is out of range: "
>> +               "value=%i, acceptable range is 1 - 4\n", r_val);
>> +        r_val = 0;
>> +    } else {
>> +        r_val -= 1;
>> +    }
>> +
>> +    return r_val;
>> +}
>
> if xen_pci_passthrough_helpers.c is only going to contain this function
> you might as well declared it static inline and move it to
> xen_pci_passthrough.h

Ok, I will.

-- 
Anthony PERARD

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH V3 07/10] Introduce Xen PCI Passthrough, qdevice (1/3)
@ 2011-11-09 17:03       ` Anthony PERARD
  0 siblings, 0 replies; 60+ messages in thread
From: Anthony PERARD @ 2011-11-09 17:03 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: Guy Zana, Xen Devel, Allen Kay, QEMU-devel

On Tue, Nov 8, 2011 at 12:56, Stefano Stabellini
<stefano.stabellini@eu.citrix.com> wrote:
> On Fri, 28 Oct 2011, Anthony PERARD wrote:
>> From: Allen Kay <allen.m.kay@intel.com>
>>
>> Signed-off-by: Allen Kay <allen.m.kay@intel.com>
>> Signed-off-by: Guy Zana <guy@neocleus.com>
>> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
>> ---
>>  Makefile.target                  |    2 +
>>  hw/xen_pci_passthrough.c         |  838 ++++++++++++++++++++++++++++++++++++++
>>  hw/xen_pci_passthrough.h         |  223 ++++++++++
>>  hw/xen_pci_passthrough_helpers.c |   46 ++
>>  4 files changed, 1109 insertions(+), 0 deletions(-)
>>  create mode 100644 hw/xen_pci_passthrough.c
>>  create mode 100644 hw/xen_pci_passthrough.h
>>  create mode 100644 hw/xen_pci_passthrough_helpers.c
>>
>> diff --git a/Makefile.target b/Makefile.target
>> index 243f9f2..36ea47d 100644
>> --- a/Makefile.target
>> +++ b/Makefile.target
>> @@ -217,6 +217,8 @@ obj-i386-$(CONFIG_XEN) += xen_platform.o
>>
>>  # Xen PCI Passthrough
>>  obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += host-pci-device.o
>> +obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough.o
>> +obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough_helpers.o
>>
>>  # Inter-VM PCI shared memory
>>  CONFIG_IVSHMEM =
>> diff --git a/hw/xen_pci_passthrough.c b/hw/xen_pci_passthrough.c
>> new file mode 100644
>> index 0000000..b97c5b6
>> --- /dev/null
>> +++ b/hw/xen_pci_passthrough.c
>> @@ -0,0 +1,838 @@
>> +/*
>> + * Copyright (c) 2007, Neocleus Corporation.
>> + * Copyright (c) 2007, Intel Corporation.
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2.  See
>> + * the COPYING file in the top-level directory.
>> + *
>> + * Alex Novik <alex@neocleus.com>
>> + * Allen Kay <allen.m.kay@intel.com>
>> + * Guy Zana <guy@neocleus.com>
>> + *
>> + * This file implements direct PCI assignment to a HVM guest
>> + */
>> +
>> +/*
>> + * Interrupt Disable policy:
>> + *
>> + * INTx interrupt:
>> + *   Initialize(register_real_device)
>> + *     Map INTx(xc_physdev_map_pirq):
>> + *       <fail>
>> + *         - Set real Interrupt Disable bit to '1'.
>> + *         - Set machine_irq and assigned_device->machine_irq to '0'.
>> + *         * Don't bind INTx.
>> + *
>> + *     Bind INTx(xc_domain_bind_pt_pci_irq):
>> + *       <fail>
>> + *         - Set real Interrupt Disable bit to '1'.
>> + *         - Unmap INTx.
>> + *         - Decrement mapped_machine_irq[machine_irq]
>> + *         - Set assigned_device->machine_irq to '0'.
>> + *
>> + *   Write to Interrupt Disable bit by guest software(pt_cmd_reg_write)
>> + *     Write '0'
>> + *       <ptdev->msi_trans_en is false>
>> + *         - Set real bit to '0' if assigned_device->machine_irq isn't '0'.
>> + *
>> + *     Write '1'
>> + *       <ptdev->msi_trans_en is false>
>> + *         - Set real bit to '1'.
>> + *
>> + * MSI-INTx translation.
>> + *   Initialize(xc_physdev_map_pirq_msi/pt_msi_setup)
>> + *     Bind MSI-INTx(xc_domain_bind_pt_irq)
>> + *       <fail>
>> + *         - Unmap MSI.
>> + *           <success>
>> + *             - Set dev->msi->pirq to '-1'.
>> + *           <fail>
>> + *             - Do nothing.
>> + *
>> + *   Write to Interrupt Disable bit by guest software(pt_cmd_reg_write)
>> + *     Write '0'
>> + *       <ptdev->msi_trans_en is true>
>> + *         - Set MSI Enable bit to '1'.
>> + *
>> + *     Write '1'
>> + *       <ptdev->msi_trans_en is true>
>> + *         - Set MSI Enable bit to '0'.
>> + *
>> + * MSI interrupt:
>> + *   Initialize MSI register(pt_msi_setup, pt_msi_update)
>> + *     Bind MSI(xc_domain_update_msi_irq)
>> + *       <fail>
>> + *         - Unmap MSI.
>> + *         - Set dev->msi->pirq to '-1'.
>> + *
>> + * MSI-X interrupt:
>> + *   Initialize MSI-X register(pt_msix_update_one)
>> + *     Bind MSI-X(xc_domain_update_msi_irq)
>> + *       <fail>
>> + *         - Unmap MSI-X.
>> + *         - Set entry->pirq to '-1'.
>> + */
>> +
>
> you should move all the MSI related comments to the MSI patch

OK, I will move MSI comments.

>> +#include <sys/ioctl.h>
>> +
>> +#include "pci.h"
>> +#include "xen.h"
>> +#include "xen_backend.h"
>> +#include "xen_pci_passthrough.h"
>> +
>> +#define PCI_BAR_ENTRIES (6)
>> +
>> +#define PT_NR_IRQS          (256)
>> +char mapped_machine_irq[PT_NR_IRQS] = {0};
>> +
>> +/* Config Space */
>> +static int pt_pci_config_access_check(PCIDevice *d, uint32_t address, int len)
>> +{
>> +    /* check offset range */
>> +    if (address >= 0xFF) {
>> +        PT_LOG("Error: Failed to access register with offset exceeding FFh. "
>> +               "[%02x:%02x.%x][Offset:%02xh][Length:%d]\n",
>> +               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
>> +               address, len);
>> +        return -1;
>> +    }
>> +
>> +    /* check read size */
>> +    if ((len != 1) && (len != 2) && (len != 4)) {
>> +        PT_LOG("Error: Failed to access register with invalid access length. "
>> +               "[%02x:%02x.%x][Offset:%02xh][Length:%d]\n",
>> +               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
>> +               address, len);
>> +        return -1;
>> +    }
>> +
>> +    /* check offset alignment */
>> +    if (address & (len - 1)) {
>> +        PT_LOG("Error: Failed to access register with invalid access size "
>> +            "alignment. [%02x:%02x.%x][Offset:%02xh][Length:%d]\n",
>> +            pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
>> +            address, len);
>> +        return -1;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +int pt_bar_offset_to_index(uint32_t offset)
>> +{
>> +    int index = 0;
>> +
>> +    /* check Exp ROM BAR */
>> +    if (offset == PCI_ROM_ADDRESS) {
>> +        return PCI_ROM_SLOT;
>> +    }
>> +
>> +    /* calculate BAR index */
>> +    index = (offset - PCI_BASE_ADDRESS_0) >> 2;
>> +    if (index >= PCI_NUM_REGIONS) {
>> +        return -1;
>> +    }
>> +
>> +    return index;
>> +}
>> +
>> +static uint32_t pt_pci_read_config(PCIDevice *d, uint32_t address, int len)
>> +{
>> +    XenPCIPassthroughState *s = DO_UPCAST(XenPCIPassthroughState, dev, d);
>> +    uint32_t val = 0;
>> +    XenPTRegGroup *reg_grp_entry = NULL;
>> +    XenPTReg *reg_entry = NULL;
>> +    int rc = 0;
>> +    int emul_len = 0;
>> +    uint32_t find_addr = address;
>> +
>> +    if (pt_pci_config_access_check(d, address, len)) {
>> +        goto exit;
>> +    }
>> +
>> +    /* check power state transition flags */
>> +    if (s->pm_state != NULL && s->pm_state->flags & PT_FLAG_TRANSITING) {
>> +        /* can't accept until previous power state transition is completed.
>> +         * so finished previous request here.
>> +         */
>> +        PT_LOG("Warning: guest want to write durring power state transition\n");
>> +        goto exit;
>> +    }
>> +
>> +    /* find register group entry */
>> +    reg_grp_entry = pt_find_reg_grp(s, address);
>> +    if (reg_grp_entry) {
>> +        /* check 0 Hardwired register group */
>> +        if (reg_grp_entry->reg_grp->grp_type == GRP_TYPE_HARDWIRED) {
>> +            /* no need to emulate, just return 0 */
>> +            val = 0;
>> +            goto exit;
>> +        }
>> +    }
>> +
>> +    /* read I/O device register value */
>> +    rc = host_pci_get_block(s->real_device, address, (uint8_t *)&val, len);
>> +    if (!rc) {
>> +        PT_LOG("Error: pci_read_block failed. return value[%d].\n", rc);
>> +        memset(&val, 0xff, len);
>> +    }
>> +
>> +    /* just return the I/O device register value for
>> +     * passthrough type register group */
>> +    if (reg_grp_entry == NULL) {
>> +        goto exit;
>> +    }
>> +
>> +    /* adjust the read value to appropriate CFC-CFF window */
>> +    val <<= (address & 3) << 3;
>> +    emul_len = len;
>> +
>> +    /* loop Guest request size */
>> +    while (emul_len > 0) {
>> +        /* find register entry to be emulated */
>> +        reg_entry = pt_find_reg(reg_grp_entry, find_addr);
>> +        if (reg_entry) {
>> +            XenPTRegInfo *reg = reg_entry->reg;
>> +            uint32_t real_offset = reg_grp_entry->base_offset + reg->offset;
>> +            uint32_t valid_mask = 0xFFFFFFFF >> ((4 - emul_len) << 3);
>> +            uint8_t *ptr_val = NULL;
>> +
>> +            valid_mask <<= (find_addr - real_offset) << 3;
>> +            ptr_val = (uint8_t *)&val + (real_offset & 3);
>> +
>> +            /* do emulation depend on register size */
>> +            switch (reg->size) {
>> +            case 1:
>> +                if (reg->u.b.read) {
>> +                    rc = reg->u.b.read(s, reg_entry, ptr_val, valid_mask);
>> +                }
>> +                break;
>> +            case 2:
>> +                if (reg->u.w.read) {
>> +                    rc = reg->u.w.read(s, reg_entry,
>> +                                       (uint16_t *)ptr_val, valid_mask);
>> +                }
>> +                break;
>> +            case 4:
>> +                if (reg->u.dw.read) {
>> +                    rc = reg->u.dw.read(s, reg_entry,
>> +                                        (uint32_t *)ptr_val, valid_mask);
>> +                }
>> +                break;
>> +            }
>> +
>> +            if (rc < 0) {
>> +                hw_error("Internal error: Invalid read emulation "
>> +                         "return value[%d]. I/O emulator exit.\n", rc);
>> +            }
>> +
>> +            /* calculate next address to find */
>> +            emul_len -= reg->size;
>> +            if (emul_len > 0) {
>> +                find_addr = real_offset + reg->size;
>> +            }
>> +        } else {
>> +            /* nothing to do with passthrough type register,
>> +             * continue to find next byte */
>> +            emul_len--;
>> +            find_addr++;
>> +        }
>> +    }
>> +
>> +    /* need to shift back before returning them to pci bus emulator */
>> +    val >>= ((address & 3) << 3);
>> +
>> +exit:
>> +    PT_LOG_CONFIG("[%02x:%02x.%x]: address=%04x val=0x%08x len=%d\n",
>> +                  pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
>> +                  address, val, len);
>> +    return val;
>> +}
>> +
>> +static void pt_pci_write_config(PCIDevice *d, uint32_t address,
>> +                                uint32_t val, int len)
>> +{
>> +    XenPCIPassthroughState *s = DO_UPCAST(XenPCIPassthroughState, dev, d);
>> +    int index = 0;
>> +    XenPTRegGroup *reg_grp_entry = NULL;
>> +    int rc = 0;
>> +    uint32_t read_val = 0;
>> +    int emul_len = 0;
>> +    XenPTReg *reg_entry = NULL;
>> +    uint32_t find_addr = address;
>> +    XenPTRegInfo *reg = NULL;
>> +
>> +    if (pt_pci_config_access_check(d, address, len)) {
>> +        return;
>> +    }
>> +
>> +    PT_LOG_CONFIG("[%02x:%02x.%x]: address=%04x val=0x%08x len=%d\n",
>> +                  pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
>> +                  address, val, len);
>> +
>> +    /* check unused BAR register */
>> +    index = pt_bar_offset_to_index(address);
>> +    if ((index >= 0) && (val > 0 && val < PT_BAR_ALLF) &&
>> +        (s->bases[index].bar_flag == PT_BAR_FLAG_UNUSED)) {
>> +        PT_LOG("Warning: Guest attempt to set address to unused Base Address "
>> +               "Register. [%02x:%02x.%x][Offset:%02xh][Length:%d]\n",
>> +               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
>> +               address, len);
>> +    }
>> +
>> +    /* check power state transition flags */
>> +    if (s->pm_state != NULL && s->pm_state->flags & PT_FLAG_TRANSITING) {
>> +        /* can't accept untill previous power state transition is completed.
>> +         * so finished previous request here.
>> +         */
>> +        PT_LOG("Warning: guest want to write durring power state transition\n");
>> +        return;
>> +    }
>> +
>> +    /* find register group entry */
>> +    reg_grp_entry = pt_find_reg_grp(s, address);
>> +    if (reg_grp_entry) {
>> +        /* check 0 Hardwired register group */
>> +        if (reg_grp_entry->reg_grp->grp_type == GRP_TYPE_HARDWIRED) {
>> +            /* ignore silently */
>> +            PT_LOG("Warning: Access to 0 Hardwired register. "
>> +                   "[%02x:%02x.%x][Offset:%02xh][Length:%d]\n",
>> +                   pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
>> +                   address, len);
>> +            return;
>> +        }
>> +    }
>> +
>> +    /* read I/O device register value */
>> +    rc = host_pci_get_block(s->real_device, address,
>> +                             (uint8_t *)&read_val, len);
>> +    if (!rc) {
>> +        PT_LOG("Error: pci_read_block failed. return value[%d].\n", rc);
>> +        memset(&read_val, 0xff, len);
>> +    }
>> +
>> +    /* pass directly to libpci for passthrough type register group */
>> +    if (reg_grp_entry == NULL) {
>> +        goto out;
>> +    }
>> +
>> +    /* adjust the read and write value to appropriate CFC-CFF window */
>> +    read_val <<= (address & 3) << 3;
>> +    val <<= (address & 3) << 3;
>> +    emul_len = len;
>> +
>> +    /* loop Guest request size */
>> +    while (emul_len > 0) {
>> +        /* find register entry to be emulated */
>> +        reg_entry = pt_find_reg(reg_grp_entry, find_addr);
>> +        if (reg_entry) {
>> +            reg = reg_entry->reg;
>> +            uint32_t real_offset = reg_grp_entry->base_offset + reg->offset;
>> +            uint32_t valid_mask = 0xFFFFFFFF >> ((4 - emul_len) << 3);
>> +            uint8_t *ptr_val = NULL;
>> +
>> +            valid_mask <<= (find_addr - real_offset) << 3;
>> +            ptr_val = (uint8_t *)&val + (real_offset & 3);
>> +
>> +            /* do emulation depend on register size */
>> +            switch (reg->size) {
>> +            case 1:
>> +                if (reg->u.b.write) {
>> +                    rc = reg->u.b.write(s, reg_entry, ptr_val,
>> +                                        read_val >> ((real_offset & 3) << 3),
>> +                                        valid_mask);
>> +                }
>> +                break;
>> +            case 2:
>> +                if (reg->u.w.write) {
>> +                    rc = reg->u.w.write(s, reg_entry, (uint16_t *)ptr_val,
>> +                                        (read_val >> ((real_offset & 3) << 3)),
>> +                                        valid_mask);
>> +                }
>> +                break;
>> +            case 4:
>> +                if (reg->u.dw.write) {
>> +                    rc = reg->u.dw.write(s, reg_entry, (uint32_t *)ptr_val,
>> +                                         (read_val >> ((real_offset & 3) << 3)),
>> +                                         valid_mask);
>> +                }
>> +                break;
>> +            }
>> +
>> +            if (rc < 0) {
>> +                hw_error("Internal error: Invalid write emulation "
>> +                         "return value[%d]. I/O emulator exit.\n", rc);
>> +            }
>> +
>> +            /* calculate next address to find */
>> +            emul_len -= reg->size;
>> +            if (emul_len > 0) {
>> +                find_addr = real_offset + reg->size;
>> +            }
>> +        } else {
>> +            /* nothing to do with passthrough type register,
>> +             * continue to find next byte */
>> +            emul_len--;
>> +            find_addr++;
>> +        }
>> +    }
>> +
>> +    /* need to shift back before passing them to libpci */
>> +    val >>= (address & 3) << 3;
>> +
>> +out:
>> +    if (!(reg && reg->no_wb)) {
>> +        /* unknown regs are passed through */
>> +        rc = host_pci_set_block(s->real_device, address, (uint8_t *)&val, len);
>> +
>> +        if (!rc) {
>> +            PT_LOG("Error: pci_write_block failed. return value[%d].\n", rc);
>> +        }
>> +    }
>> +
>> +    if (s->pm_state != NULL && s->pm_state->flags & PT_FLAG_TRANSITING) {
>> +        qemu_mod_timer(s->pm_state->pm_timer,
>> +                       qemu_get_clock_ms(rt_clock) + s->pm_state->pm_delay);
>> +    }
>> +}
>
> Where is this timer allocated and initialized?

In the next patch, I will move this lines to the releated patch.

>> +/* ioport/iomem space*/
>> +static void pt_iomem_map(XenPCIPassthroughState *s, int i,
>> +                         pcibus_t e_phys, pcibus_t e_size, int type)
>> +{
>> +    uint32_t old_ebase = s->bases[i].e_physbase;
>> +    bool first_map = s->bases[i].e_size == 0;
>> +    int ret = 0;
>> +
>> +    s->bases[i].e_physbase = e_phys;
>> +    s->bases[i].e_size = e_size;
>> +
>> +    PT_LOG("e_phys=%#"PRIx64" maddr=%#"PRIx64" type=%%d"
>> +           " len=%#"PRIx64" index=%d first_map=%d\n",
>> +           e_phys, s->bases[i].access.maddr, /*type,*/
>> +           e_size, i, first_map);
>> +
>> +    if (e_size == 0) {
>> +        return;
>> +    }
>> +
>> +    if (!first_map && old_ebase != -1) {
>> +        /* Remove old mapping */
>> +        ret = xc_domain_memory_mapping(xen_xc, xen_domid,
>> +                               old_ebase >> XC_PAGE_SHIFT,
>> +                               s->bases[i].access.maddr >> XC_PAGE_SHIFT,
>> +                               (e_size + XC_PAGE_SIZE - 1) >> XC_PAGE_SHIFT,
>> +                               DPCI_REMOVE_MAPPING);
>> +        if (ret != 0) {
>> +            PT_LOG("Error: remove old mapping failed!\n");
>> +            return;
>> +        }
>> +    }
>> +
>> +    /* map only valid guest address */
>> +    if (e_phys != -1) {
>> +        /* Create new mapping */
>> +        ret = xc_domain_memory_mapping(xen_xc, xen_domid,
>> +                                   s->bases[i].e_physbase >> XC_PAGE_SHIFT,
>> +                                   s->bases[i].access.maddr >> XC_PAGE_SHIFT,
>> +                                   (e_size+XC_PAGE_SIZE-1) >> XC_PAGE_SHIFT,
>> +                                   DPCI_ADD_MAPPING);
>> +
>> +        if (ret != 0) {
>> +            PT_LOG("Error: create new mapping failed!\n");
>> +        }
>> +    }
>> +}
>> +
>> +static void pt_ioport_map(XenPCIPassthroughState *s, int i,
>> +                          pcibus_t e_phys, pcibus_t e_size, int type)
>> +{
>> +    uint32_t old_ebase = s->bases[i].e_physbase;
>> +    bool first_map = s->bases[i].e_size == 0;
>> +    int ret = 0;
>> +
>> +    s->bases[i].e_physbase = e_phys;
>> +    s->bases[i].e_size = e_size;
>> +
>> +    PT_LOG("e_phys=%#04"PRIx64" pio_base=%#04"PRIx64" len=%"PRId64" index=%d"
>> +           " first_map=%d\n",
>> +           e_phys, s->bases[i].access.pio_base, e_size, i, first_map);
>> +
>> +    if (e_size == 0) {
>> +        return;
>> +    }
>> +
>> +    if (!first_map && old_ebase != -1) {
>> +        /* Remove old mapping */
>> +        ret = xc_domain_ioport_mapping(xen_xc, xen_domid, old_ebase,
>> +                                       s->bases[i].access.pio_base, e_size,
>> +                                       DPCI_REMOVE_MAPPING);
>> +        if (ret != 0) {
>> +            PT_LOG("Error: remove old mapping failed!\n");
>> +            return;
>> +        }
>> +    }
>> +
>> +    /* map only valid guest address (include 0) */
>> +    if (e_phys != -1) {
>> +        /* Create new mapping */
>> +        ret = xc_domain_ioport_mapping(xen_xc, xen_domid, e_phys,
>> +                                       s->bases[i].access.pio_base, e_size,
>> +                                       DPCI_ADD_MAPPING);
>> +        if (ret != 0) {
>> +            PT_LOG("Error: create new mapping failed!\n");
>> +        }
>> +    }
>> +
>> +}
>> +
>> +
>> +/* mapping BAR */
>> +
>> +void pt_bar_mapping_one(XenPCIPassthroughState *s, int bar,
>> +                        int io_enable, int mem_enable)
>> +{
>> +    PCIDevice *dev = &s->dev;
>> +    PCIIORegion *r;
>> +    XenPTRegGroup *reg_grp_entry = NULL;
>> +    XenPTReg *reg_entry = NULL;
>> +    XenPTRegion *base = NULL;
>> +    pcibus_t r_size = 0, r_addr = -1;
>> +    int rc = 0;
>> +
>> +    r = &dev->io_regions[bar];
>> +
>> +    /* check valid region */
>> +    if (!r->size) {
>> +        return;
>> +    }
>> +
>> +    base = &s->bases[bar];
>> +    /* skip unused BAR or upper 64bit BAR */
>> +    if ((base->bar_flag == PT_BAR_FLAG_UNUSED)
>> +        || (base->bar_flag == PT_BAR_FLAG_UPPER)) {
>> +           return;
>> +    }
>> +
>> +    /* copy region address to temporary */
>> +    r_addr = r->addr;
>> +
>> +    /* need unmapping in case I/O Space or Memory Space disable */
>> +    if (((base->bar_flag == PT_BAR_FLAG_IO) && !io_enable) ||
>> +        ((base->bar_flag == PT_BAR_FLAG_MEM) && !mem_enable)) {
>> +        r_addr = -1;
>> +    }
>> +    if ((bar == PCI_ROM_SLOT) && (r_addr != -1)) {
>> +        reg_grp_entry = pt_find_reg_grp(s, PCI_ROM_ADDRESS);
>> +        if (reg_grp_entry) {
>> +            reg_entry = pt_find_reg(reg_grp_entry, PCI_ROM_ADDRESS);
>> +            if (reg_entry && !(reg_entry->data & PCI_ROM_ADDRESS_ENABLE)) {
>> +                r_addr = -1;
>> +            }
>> +        }
>> +    }
>> +
>> +    /* prevent guest software mapping memory resource to 00000000h */
>> +    if ((base->bar_flag == PT_BAR_FLAG_MEM) && (r_addr == 0)) {
>> +        r_addr = -1;
>> +    }
>> +
>> +    r_size = pt_get_emul_size(base->bar_flag, r->size);
>> +
>> +    rc = pci_check_bar_overlap(dev, r_addr, r_size, r->type);
>> +    if (rc > 0) {
>> +        PT_LOG("Warning: s[%02x:%02x.%x][Region:%d][Address:%"FMT_PCIBUS"h]"
>> +               "[Size:%"FMT_PCIBUS"h] is overlapped.\n", pci_bus_num(dev->bus),
>> +               PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn), bar,
>> +               r_addr, r_size);
>> +    }
>> +
>> +    /* check whether we need to update the mapping or not */
>> +    if (r_addr != s->bases[bar].e_physbase) {
>> +        /* mapping BAR */
>> +        if (base->bar_flag == PT_BAR_FLAG_IO) {
>> +            pt_ioport_map(s, bar, r_addr, r_size, r->type);
>> +        } else {
>> +            pt_iomem_map(s, bar, r_addr, r_size, r->type);
>> +        }
>> +    }
>> +}
>> +
>> +void pt_bar_mapping(XenPCIPassthroughState *s, int io_enable, int mem_enable)
>> +{
>> +    int i;
>> +
>> +    for (i = 0; i < PCI_NUM_REGIONS; i++) {
>> +        pt_bar_mapping_one(s, i, io_enable, mem_enable);
>> +    }
>> +}
>> +
>> +/* register regions */
>> +static int pt_register_regions(XenPCIPassthroughState *s)
>> +{
>> +    int i = 0;
>> +    uint32_t bar_data = 0;
>> +    HostPCIDevice *d = s->real_device;
>> +
>> +    /* Register PIO/MMIO BARs */
>> +    for (i = 0; i < PCI_BAR_ENTRIES; i++) {
>> +        HostPCIIORegion *r = &d->io_regions[i];
>> +
>> +        if (r->base_addr) {
>> +            s->bases[i].e_physbase = r->base_addr;
>> +            s->bases[i].access.u = r->base_addr;
>> +
>> +            /* Register current region */
>> +            if (r->flags & IORESOURCE_IO) {
>> +                memory_region_init_io(&s->bar[i], NULL, NULL,
>> +                                      "xen-pci-pt-bar", r->size);
>> +                pci_register_bar(&s->dev, i, PCI_BASE_ADDRESS_SPACE_IO,
>> +                                 &s->bar[i]);
>> +            } else if (r->flags & IORESOURCE_PREFETCH) {
>> +                memory_region_init_io(&s->bar[i], NULL, NULL,
>> +                                      "xen-pci-pt-bar", r->size);
>> +                pci_register_bar(&s->dev, i, PCI_BASE_ADDRESS_MEM_PREFETCH,
>> +                                 &s->bar[i]);
>> +            } else {
>> +                memory_region_init_io(&s->bar[i], NULL, NULL,
>> +                                      "xen-pci-pt-bar", r->size);
>> +                pci_register_bar(&s->dev, i, PCI_BASE_ADDRESS_SPACE_MEMORY,
>> +                                 &s->bar[i]);
>> +            }
>> +
>> +            PT_LOG("IO region registered (size=0x%08"PRIx64
>> +                   " base_addr=0x%08"PRIx64")\n",
>> +                   r->size, r->base_addr);
>> +        }
>> +    }
>> +
>> +    /* Register expansion ROM address */
>> +    if (d->rom.base_addr && d->rom.size) {
>> +        /* Re-set BAR reported by OS, otherwise ROM can't be read. */
>> +        bar_data = host_pci_get_long(d, PCI_ROM_ADDRESS);
>> +        if ((bar_data & PCI_ROM_ADDRESS_MASK) == 0) {
>> +            bar_data |= d->rom.base_addr & PCI_ROM_ADDRESS_MASK;
>> +            host_pci_set_long(d, PCI_ROM_ADDRESS, bar_data);
>> +        }
>> +
>> +        s->bases[PCI_ROM_SLOT].e_physbase = d->rom.base_addr;
>> +        s->bases[PCI_ROM_SLOT].access.maddr = d->rom.base_addr;
>> +
>> +        memory_region_init_rom_device(&s->rom, NULL, NULL, &s->dev.qdev,
>> +                                      "xen-pci-pt-rom", d->rom.size);
>> +        pci_register_bar(&s->dev, PCI_ROM_SLOT, PCI_BASE_ADDRESS_MEM_PREFETCH,
>> +                         &s->rom);
>> +
>> +        PT_LOG("Expansion ROM registered (size=0x%08"PRIx64
>> +               " base_addr=0x%08"PRIx64")\n",
>> +               d->rom.size, d->rom.base_addr);
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static void pt_unregister_regions(XenPCIPassthroughState *s)
>> +{
>> +    int i, type, rc;
>> +    uint32_t e_size;
>> +    PCIDevice *d = &s->dev;
>> +
>> +    for (i = 0; i < PCI_NUM_REGIONS; i++) {
>> +        e_size = s->bases[i].e_size;
>> +        if ((e_size == 0) || (s->bases[i].e_physbase == -1)) {
>> +            continue;
>> +        }
>> +
>> +        type = d->io_regions[i].type;
>> +
>> +        if (type == PCI_BASE_ADDRESS_SPACE_MEMORY
>> +            || type == PCI_BASE_ADDRESS_MEM_PREFETCH) {
>> +            rc = xc_domain_memory_mapping(xen_xc, xen_domid,
>> +                    s->bases[i].e_physbase >> XC_PAGE_SHIFT,
>> +                    s->bases[i].access.maddr >> XC_PAGE_SHIFT,
>> +                    (e_size+XC_PAGE_SIZE-1) >> XC_PAGE_SHIFT,
>> +                    DPCI_REMOVE_MAPPING);
>> +            if (rc != 0) {
>> +                PT_LOG("Error: remove old mem mapping failed!\n");
>> +                continue;
>> +            }
>> +
>> +        } else if (type == PCI_BASE_ADDRESS_SPACE_IO) {
>> +            rc = xc_domain_ioport_mapping(xen_xc, xen_domid,
>> +                        s->bases[i].e_physbase,
>> +                        s->bases[i].access.pio_base,
>> +                        e_size,
>> +                        DPCI_REMOVE_MAPPING);
>> +            if (rc != 0) {
>> +                PT_LOG("Error: remove old io mapping failed!\n");
>> +                continue;
>> +            }
>> +        }
>> +    }
>> +}
>> +
>> +static int pt_initfn(PCIDevice *pcidev)
>> +{
>> +    XenPCIPassthroughState *s = DO_UPCAST(XenPCIPassthroughState, dev, pcidev);
>> +    int dom, bus;
>> +    unsigned slot, func;
>> +    int rc = 0;
>> +    uint32_t machine_irq;
>> +    int pirq = -1;
>> +
>> +    if (pci_parse_devaddr(s->hostaddr, &dom, &bus, &slot, &func) < 0) {
>> +        fprintf(stderr, "error parse bdf: %s\n", s->hostaddr);
>> +        return -1;
>> +    }
>> +
>> +    /* register real device */
>> +    PT_LOG("Assigning real physical device %02x:%02x.%x to devfn %i ...\n",
>> +           bus, slot, func, s->dev.devfn);
>> +
>> +    s->real_device = host_pci_device_get(bus, slot, func);
>> +    if (!s->real_device) {
>> +        return -1;
>> +    }
>> +
>> +    s->is_virtfn = s->real_device->is_virtfn;
>> +    if (s->is_virtfn) {
>> +        PT_LOG("%04x:%02x:%02x.%x is a SR-IOV Virtual Function\n",
>> +               s->real_device->domain, bus, slot, func);
>> +    }
>> +
>> +    /* Initialize virtualized PCI configuration (Extended 256 Bytes) */
>> +    if (host_pci_get_block(s->real_device, 0, pcidev->config,
>> +                           PCI_CONFIG_SPACE_SIZE) == -1) {
>> +        return -1;
>> +    }
>> +
>> +    /* Handle real device's MMIO/PIO BARs */
>> +    pt_register_regions(s);
>> +
>> +    /* reinitialize each config register to be emulated */
>> +    pt_config_init(s);
>
> this function is implemented in the next patch, so you might as well add
> this call there

Ok, I will move this.

>> +    /* Bind interrupt */
>> +    if (!s->dev.config[PCI_INTERRUPT_PIN]) {
>> +        PT_LOG("no pin interrupt\n");
>> +        goto out;
>> +    }
>> +
>> +    machine_irq = host_pci_get_byte(s->real_device, PCI_INTERRUPT_LINE);
>> +    rc = xc_physdev_map_pirq(xen_xc, xen_domid, machine_irq, &pirq);
>> +
>> +    if (rc) {
>> +        PT_LOG("Error: Mapping irq failed, rc = %d\n", rc);
>> +
>> +        /* Disable PCI intx assertion (turn on bit10 of devctl) */
>> +        host_pci_set_word(s->real_device,
>> +                          PCI_COMMAND,
>> +                          pci_get_word(s->dev.config + PCI_COMMAND)
>> +                          | PCI_COMMAND_INTX_DISABLE);
>> +        machine_irq = 0;
>> +        s->machine_irq = 0;
>> +    } else {
>> +        machine_irq = pirq;
>> +        s->machine_irq = pirq;
>> +        mapped_machine_irq[machine_irq]++;
>> +    }
>> +
>> +    /* bind machine_irq to device */
>> +    if (rc < 0 && machine_irq != 0) {
>> +        uint8_t e_device = PCI_SLOT(s->dev.devfn);
>> +        uint8_t e_intx = pci_intx(s);
>> +
>> +        rc = xc_domain_bind_pt_pci_irq(xen_xc, xen_domid, machine_irq, 0,
>> +                                       e_device, e_intx);
>> +        if (rc < 0) {
>> +            PT_LOG("Error: Binding of interrupt failed! rc=%d\n", rc);
>> +
>> +            /* Disable PCI intx assertion (turn on bit10 of devctl) */
>> +            host_pci_set_word(s->real_device, PCI_COMMAND,
>> +                              *(uint16_t *)(&s->dev.config[PCI_COMMAND])
>> +                              | PCI_COMMAND_INTX_DISABLE);
>> +            mapped_machine_irq[machine_irq]--;
>> +
>> +            if (mapped_machine_irq[machine_irq] == 0) {
>> +                if (xc_physdev_unmap_pirq(xen_xc, xen_domid, machine_irq)) {
>> +                    PT_LOG("Error: Unmapping of interrupt failed! rc=%d\n",
>> +                           rc);
>> +                }
>> +            }
>> +            s->machine_irq = 0;
>> +        }
>> +    }
>> +
>> +out:
>> +    PT_LOG("Real physical device %02x:%02x.%x registered successfuly!\n"
>> +           "IRQ type = %s\n", bus, slot, func, "INTx");
>> +
>> +    return 0;
>> +}
>> +
>> +static int pt_unregister_device(PCIDevice *pcidev)
>> +{
>> +    XenPCIPassthroughState *s = DO_UPCAST(XenPCIPassthroughState, dev, pcidev);
>> +    uint8_t e_device, e_intx;
>> +    uint32_t machine_irq;
>> +    int rc;
>> +
>> +    /* Unbind interrupt */
>> +    e_device = PCI_SLOT(s->dev.devfn);
>> +    e_intx = pci_intx(s);
>> +    machine_irq = s->machine_irq;
>> +
>> +    if (machine_irq) {
>> +        rc = xc_domain_unbind_pt_irq(xen_xc, xen_domid, machine_irq,
>> +                                     PT_IRQ_TYPE_PCI, 0, e_device, e_intx, 0);
>> +        if (rc < 0) {
>> +            PT_LOG("Error: Unbinding of interrupt failed! rc=%d\n", rc);
>> +        }
>> +    }
>> +
>> +    if (machine_irq) {
>> +        mapped_machine_irq[machine_irq]--;
>> +
>> +        if (mapped_machine_irq[machine_irq] == 0) {
>> +            rc = xc_physdev_unmap_pirq(xen_xc, xen_domid, machine_irq);
>> +
>> +            if (rc < 0) {
>> +                PT_LOG("Error: Unmaping of interrupt failed! rc=%d\n", rc);
>> +            }
>> +        }
>> +    }
>> +
>> +    /* delete all emulated config registers */
>> +    pt_config_delete(s);
>> +
>> +    /* unregister real device's MMIO/PIO BARs */
>> +    pt_unregister_regions(s);
>> +
>> +    host_pci_device_put(s->real_device);
>> +
>> +    return 0;
>> +}
>> +
>> +static PCIDeviceInfo xen_pci_passthrough = {
>> +    .init = pt_initfn,
>> +    .exit = pt_unregister_device,
>> +    .qdev.name = "xen-pci-passthrough",
>> +    .qdev.desc = "Assign an host pci device with Xen",
>> +    .qdev.size = sizeof(XenPCIPassthroughState),
>> +    .config_read = pt_pci_read_config,
>> +    .config_write = pt_pci_write_config,
>> +    .is_express = 0,
>> +    .qdev.props = (Property[]) {
>> +        DEFINE_PROP_STRING("hostaddr", XenPCIPassthroughState, hostaddr),
>> +        DEFINE_PROP_BIT("power-mgmt", XenPCIPassthroughState, power_mgmt,
>> +                        0, false),
>> +        DEFINE_PROP_END_OF_LIST(),
>> +    }
>> +};
>> +
>> +static void xen_passthrough_register(void)
>> +{
>> +    pci_qdev_register(&xen_pci_passthrough);
>> +}
>> +
>> +device_init(xen_passthrough_register);
>> diff --git a/hw/xen_pci_passthrough.h b/hw/xen_pci_passthrough.h
>> new file mode 100644
>> index 0000000..2d1979d
>> --- /dev/null
>> +++ b/hw/xen_pci_passthrough.h
>> @@ -0,0 +1,223 @@
>> +#ifndef QEMU_HW_XEN_PCI_PASSTHROUGH_H
>> +#  define QEMU_HW_XEN_PCI_PASSTHROUGH_H
>> +
>> +#include "qemu-common.h"
>> +#include "xen_common.h"
>> +#include "pci.h"
>> +#include "host-pci-device.h"
>> +
>> +#define PT_LOGGING_ENABLED
>> +#define PT_DEBUG_PCI_CONFIG_ACCESS
>> +
>> +#ifdef PT_LOGGING_ENABLED
>> +#  define PT_LOG(_f, _a...)   fprintf(stderr, "%s: " _f, __func__, ##_a)
>> +#else
>> +#  define PT_LOG(_f, _a...)
>> +#endif
>> +
>> +#ifdef PT_DEBUG_PCI_CONFIG_ACCESS
>> +#  define PT_LOG_CONFIG(_f, _a...) PT_LOG(_f, ##_a)
>> +#else
>> +#  define PT_LOG_CONFIG(_f, _a...)
>> +#endif
>> +
>> +
>> +typedef struct XenPTRegInfo XenPTRegInfo;
>> +typedef struct XenPTReg XenPTReg;
>> +
>> +typedef struct XenPCIPassthroughState XenPCIPassthroughState;
>> +
>> +/* function type for config reg */
>> +typedef uint32_t (*conf_reg_init)
>> +    (XenPCIPassthroughState *, XenPTRegInfo *, uint32_t real_offset);
>> +typedef int (*conf_dword_write)
>> +    (XenPCIPassthroughState *, XenPTReg *cfg_entry,
>> +     uint32_t *val, uint32_t dev_value, uint32_t valid_mask);
>> +typedef int (*conf_word_write)
>> +    (XenPCIPassthroughState *, XenPTReg *cfg_entry,
>> +     uint16_t *val, uint16_t dev_value, uint16_t valid_mask);
>> +typedef int (*conf_byte_write)
>> +    (XenPCIPassthroughState *, XenPTReg *cfg_entry,
>> +     uint8_t *val, uint8_t dev_value, uint8_t valid_mask);
>> +typedef int (*conf_dword_read)
>> +    (XenPCIPassthroughState *, XenPTReg *cfg_entry,
>> +     uint32_t *val, uint32_t valid_mask);
>> +typedef int (*conf_word_read)
>> +    (XenPCIPassthroughState *, XenPTReg *cfg_entry,
>> +     uint16_t *val, uint16_t valid_mask);
>> +typedef int (*conf_byte_read)
>> +    (XenPCIPassthroughState *, XenPTReg *cfg_entry,
>> +     uint8_t *val, uint8_t valid_mask);
>> +typedef int (*conf_dword_restore)
>> +    (XenPCIPassthroughState *, XenPTReg *cfg_entry, uint32_t real_offset,
>> +     uint32_t dev_value, uint32_t *val);
>> +typedef int (*conf_word_restore)
>> +    (XenPCIPassthroughState *, XenPTReg *cfg_entry, uint32_t real_offset,
>> +     uint16_t dev_value, uint16_t *val);
>> +typedef int (*conf_byte_restore)
>> +    (XenPCIPassthroughState *, XenPTReg *cfg_entry, uint32_t real_offset,
>> +     uint8_t dev_value, uint8_t *val);
>> +
>> +/* power state transition */
>> +#define PT_FLAG_TRANSITING 0x0001
>> +
>> +
>> +typedef enum {
>> +    GRP_TYPE_HARDWIRED = 0,                     /* 0 Hardwired reg group */
>> +    GRP_TYPE_EMU,                               /* emul reg group */
>> +} RegisterGroupType;
>> +
>> +typedef enum {
>> +    PT_BAR_FLAG_MEM = 0,                        /* Memory type BAR */
>> +    PT_BAR_FLAG_IO,                             /* I/O type BAR */
>> +    PT_BAR_FLAG_UPPER,                          /* upper 64bit BAR */
>> +    PT_BAR_FLAG_UNUSED,                         /* unused BAR */
>> +} PTBarFlag;
>> +
>> +
>> +typedef struct XenPTRegion {
>> +    /* Virtual phys base & size */
>> +    uint32_t e_physbase;
>> +    uint32_t e_size;
>> +    /* Index of region in qemu */
>> +    uint32_t memory_index;
>> +    /* BAR flag */
>> +    PTBarFlag bar_flag;
>> +    /* Translation of the emulated address */
>> +    union {
>> +        uint64_t maddr;
>> +        uint64_t pio_base;
>> +        uint64_t u;
>> +    } access;
>> +} XenPTRegion;
>> +
>> +/* XenPTRegInfo declaration
>> + * - only for emulated register (either a part or whole bit).
>> + * - for passthrough register that need special behavior (like interacting with
>> + *   other component), set emu_mask to all 0 and specify r/w func properly.
>> + * - do NOT use ALL F for init_val, otherwise the tbl will not be registered.
>> + */
>> +
>> +/* emulated register infomation */
>> +struct XenPTRegInfo {
>> +    uint32_t offset;
>> +    uint32_t size;
>> +    uint32_t init_val;
>> +    /* reg read only field mask (ON:RO/ROS, OFF:other) */
>> +    uint32_t ro_mask;
>> +    /* reg emulate field mask (ON:emu, OFF:passthrough) */
>> +    uint32_t emu_mask;
>> +    /* no write back allowed */
>> +    uint32_t no_wb;
>> +    conf_reg_init init;
>> +    /* read/write/restore function pointer
>> +     * for double_word/word/byte size */
>> +    union {
>> +        struct {
>> +            conf_dword_write write;
>> +            conf_dword_read read;
>> +            conf_dword_restore restore;
>> +        } dw;
>> +        struct {
>> +            conf_word_write write;
>> +            conf_word_read read;
>> +            conf_word_restore restore;
>> +        } w;
>> +        struct {
>> +            conf_byte_write write;
>> +            conf_byte_read read;
>> +            conf_byte_restore restore;
>> +        } b;
>> +    } u;
>> +};
>> +
>> +/* emulated register management */
>> +struct XenPTReg {
>> +    QLIST_ENTRY(XenPTReg) entries;
>> +    XenPTRegInfo *reg;
>> +    uint32_t data;
>> +};
>> +
>> +typedef struct XenPTRegGroupInfo XenPTRegGroupInfo;
>> +
>> +/* emul reg group size initialize method */
>> +typedef uint8_t (*pt_reg_size_init_fn)
>> +    (XenPCIPassthroughState *, const XenPTRegGroupInfo *,
>> +     uint32_t base_offset);
>> +
>> +/* emulated register group infomation */
>> +struct XenPTRegGroupInfo {
>> +    uint8_t grp_id;
>> +    RegisterGroupType grp_type;
>> +    uint8_t grp_size;
>> +    pt_reg_size_init_fn size_init;
>> +    XenPTRegInfo *emu_reg_tbl;
>> +};
>> +
>> +/* emul register group management table */
>> +typedef struct XenPTRegGroup {
>> +    QLIST_ENTRY(XenPTRegGroup) entries;
>> +    const XenPTRegGroupInfo *reg_grp;
>> +    uint32_t base_offset;
>> +    uint8_t size;
>> +    QLIST_HEAD(, XenPTReg) reg_tbl_list;
>> +} XenPTRegGroup;
>> +
>> +
>> +typedef struct XenPTPM {
>> +    QEMUTimer *pm_timer;  /* QEMUTimer struct */
>> +    int no_soft_reset;    /* No Soft Reset flags */
>> +    uint16_t flags;       /* power state transition flags */
>> +    uint16_t pmc_field;   /* Power Management Capabilities field */
>> +    int pm_delay;         /* power state transition delay */
>> +    uint16_t cur_state;   /* current power state */
>> +    uint16_t req_state;   /* requested power state */
>> +    uint32_t pm_base;     /* Power Management Capability reg base offset */
>> +    uint32_t aer_base;    /* AER Capability reg base offset */
>> +} XenPTPM;
>> +
>> +struct XenPCIPassthroughState {
>> +    PCIDevice dev;
>> +
>> +    char *hostaddr;
>> +    bool is_virtfn;
>> +    HostPCIDevice *real_device;
>> +    XenPTRegion bases[PCI_NUM_REGIONS]; /* Access regions */
>> +    QLIST_HEAD(, XenPTRegGroup) reg_grp_tbl;
>> +
>> +    uint32_t machine_irq;
>> +
>> +    uint32_t power_mgmt;
>> +    XenPTPM *pm_state;
>> +
>> +    MemoryRegion bar[PCI_NUM_REGIONS - 1];
>> +    MemoryRegion rom;
>> +};
>> +
>> +void pt_config_init(XenPCIPassthroughState *s);
>> +void pt_config_delete(XenPCIPassthroughState *s);
>> +void pt_bar_mapping(XenPCIPassthroughState *s, int io_enable, int mem_enable);
>> +void pt_bar_mapping_one(XenPCIPassthroughState *s, int bar,
>> +                        int io_enable, int mem_enable);
>> +XenPTRegGroup *pt_find_reg_grp(XenPCIPassthroughState *s, uint32_t address);
>> +XenPTReg *pt_find_reg(XenPTRegGroup *reg_grp, uint32_t address);
>> +int pt_bar_offset_to_index(uint32_t offset);
>> +
>> +static inline pcibus_t pt_get_emul_size(PTBarFlag flag, pcibus_t r_size)
>> +{
>> +    /* align resource size (memory type only) */
>> +    if (flag == PT_BAR_FLAG_MEM) {
>> +        return (r_size + XC_PAGE_SIZE - 1) & XC_PAGE_MASK;
>> +    } else {
>> +        return r_size;
>> +    }
>> +}
>> +
>> +/* INTx */
>> +static inline uint8_t pci_read_intx(XenPCIPassthroughState *s)
>> +{
>> +    return host_pci_get_byte(s->real_device, PCI_INTERRUPT_PIN);
>> +}
>> +uint8_t pci_intx(XenPCIPassthroughState *ptdev);
>> +
>> +#endif /* !QEMU_HW_XEN_PCI_PASSTHROUGH_H */
>> diff --git a/hw/xen_pci_passthrough_helpers.c b/hw/xen_pci_passthrough_helpers.c
>> new file mode 100644
>> index 0000000..192e918
>> --- /dev/null
>> +++ b/hw/xen_pci_passthrough_helpers.c
>> @@ -0,0 +1,46 @@
>> +#include "xen_pci_passthrough.h"
>> +
>> +/* The PCI Local Bus Specification, Rev. 3.0, {
>> + * Section 6.2.4 Miscellaneous Registers, pp 223
>> + * outlines 5 valid values for the intertupt pin (intx).
>> + *  0: For devices (or device functions) that don't use an interrupt in
>> + *  1: INTA#
>> + *  2: INTB#
>> + *  3: INTC#
>> + *  4: INTD#
>> + *
>> + * Xen uses the following 4 values for intx
>> + *  0: INTA#
>> + *  1: INTB#
>> + *  2: INTC#
>> + *  3: INTD#
>> + *
>> + * Observing that these list of values are not the same, pci_read_intx()
>> + * uses the following mapping from hw to xen values.
>> + * This seems to reflect the current usage within Xen.
>> + *
>> + * PCI hardware    | Xen | Notes
>> + * ----------------+-----+----------------------------------------------------
>> + * 0               | 0   | No interrupt
>> + * 1               | 0   | INTA#
>> + * 2               | 1   | INTB#
>> + * 3               | 2   | INTC#
>> + * 4               | 3   | INTD#
>> + * any other value | 0   | This should never happen, log error message
>> +}
>> + */
>> +uint8_t pci_intx(XenPCIPassthroughState *ptdev)
>> +{
>> +    uint8_t r_val = pci_read_intx(ptdev);
>> +
>> +    PT_LOG("intx=%i\n", r_val);
>> +    if (r_val < 1 || r_val > 4) {
>> +        PT_LOG("Interrupt pin read from hardware is out of range: "
>> +               "value=%i, acceptable range is 1 - 4\n", r_val);
>> +        r_val = 0;
>> +    } else {
>> +        r_val -= 1;
>> +    }
>> +
>> +    return r_val;
>> +}
>
> if xen_pci_passthrough_helpers.c is only going to contain this function
> you might as well declared it static inline and move it to
> xen_pci_passthrough.h

Ok, I will.

-- 
Anthony PERARD

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [PATCH V3 08/10] Introduce Xen PCI Passthrough, PCI config space helpers (2/3)
  2011-11-08 12:57     ` Stefano Stabellini
@ 2011-11-09 17:05       ` Anthony PERARD
  -1 siblings, 0 replies; 60+ messages in thread
From: Anthony PERARD @ 2011-11-09 17:05 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: Guy Zana, Xen Devel, Allen Kay, QEMU-devel

On Tue, Nov 8, 2011 at 12:57, Stefano Stabellini
<stefano.stabellini@eu.citrix.com> wrote:
> Obviously passthrough cannot work without this patch, but qemu should be
> able to compile anyway. Please add to the previous patch empty stub
> implementations for all the exported functions that you are going to
> implement here.
>
> I see that the timer is allocated here.
> In that case it would make sense to move the timer update to this patch.

Ok, I will do that.

-- 
Anthony PERARD

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH V3 08/10] Introduce Xen PCI Passthrough, PCI config space helpers (2/3)
@ 2011-11-09 17:05       ` Anthony PERARD
  0 siblings, 0 replies; 60+ messages in thread
From: Anthony PERARD @ 2011-11-09 17:05 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: Guy Zana, Xen Devel, Allen Kay, QEMU-devel

On Tue, Nov 8, 2011 at 12:57, Stefano Stabellini
<stefano.stabellini@eu.citrix.com> wrote:
> Obviously passthrough cannot work without this patch, but qemu should be
> able to compile anyway. Please add to the previous patch empty stub
> implementations for all the exported functions that you are going to
> implement here.
>
> I see that the timer is allocated here.
> In that case it would make sense to move the timer update to this patch.

Ok, I will do that.

-- 
Anthony PERARD

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [PATCH V3 07/10] Introduce Xen PCI Passthrough, qdevice (1/3)
  2011-10-28 15:07   ` Anthony PERARD
@ 2011-11-10 21:28     ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 60+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-11-10 21:28 UTC (permalink / raw)
  To: Anthony PERARD
  Cc: Guy Zana, Xen Devel, Allen Kay, QEMU-devel, Stefano Stabellini

On Fri, Oct 28, 2011 at 04:07:33PM +0100, Anthony PERARD wrote:
> From: Allen Kay <allen.m.kay@intel.com>
> 

This is going to be a bit lame review..

> +static uint32_t pt_pci_read_config(PCIDevice *d, uint32_t address, int len)
> +{
> +    XenPCIPassthroughState *s = DO_UPCAST(XenPCIPassthroughState, dev, d);
> +    uint32_t val = 0;
> +    XenPTRegGroup *reg_grp_entry = NULL;
> +    XenPTReg *reg_entry = NULL;
> +    int rc = 0;
> +    int emul_len = 0;
> +    uint32_t find_addr = address;
> +
> +    if (pt_pci_config_access_check(d, address, len)) {
> +        goto exit;
> +    }
> +
> +    /* check power state transition flags */
> +    if (s->pm_state != NULL && s->pm_state->flags & PT_FLAG_TRANSITING) {
> +        /* can't accept until previous power state transition is completed.
> +         * so finished previous request here.
> +         */
> +        PT_LOG("Warning: guest want to write durring power state transition\n");

during
> +        goto exit;
> +    }
> +
> +    /* find register group entry */
> +    reg_grp_entry = pt_find_reg_grp(s, address);
> +    if (reg_grp_entry) {
> +        /* check 0 Hardwired register group */
> +        if (reg_grp_entry->reg_grp->grp_type == GRP_TYPE_HARDWIRED) {
> +            /* no need to emulate, just return 0 */
> +            val = 0;
> +            goto exit;
> +        }
> +    }
> +
> +    /* read I/O device register value */
> +    rc = host_pci_get_block(s->real_device, address, (uint8_t *)&val, len);
> +    if (!rc) {
> +        PT_LOG("Error: pci_read_block failed. return value[%d].\n", rc);
> +        memset(&val, 0xff, len);
> +    }
> +
> +    /* just return the I/O device register value for
> +     * passthrough type register group */
> +    if (reg_grp_entry == NULL) {
> +        goto exit;
> +    }
> +
> +    /* adjust the read value to appropriate CFC-CFF window */
> +    val <<= (address & 3) << 3;
> +    emul_len = len;
> +
> +    /* loop Guest request size */

Perhaps 'loop around the guest request size' ?

> +    while (emul_len > 0) {
> +        /* find register entry to be emulated */
> +        reg_entry = pt_find_reg(reg_grp_entry, find_addr);
> +        if (reg_entry) {
> +            XenPTRegInfo *reg = reg_entry->reg;
> +            uint32_t real_offset = reg_grp_entry->base_offset + reg->offset;
> +            uint32_t valid_mask = 0xFFFFFFFF >> ((4 - emul_len) << 3);
> +            uint8_t *ptr_val = NULL;
> +
> +            valid_mask <<= (find_addr - real_offset) << 3;
> +            ptr_val = (uint8_t *)&val + (real_offset & 3);
> +
> +            /* do emulation depend on register size */

based on register size

> +            switch (reg->size) {
> +            case 1:
> +                if (reg->u.b.read) {
> +                    rc = reg->u.b.read(s, reg_entry, ptr_val, valid_mask);
> +                }
> +                break;
> +            case 2:
> +                if (reg->u.w.read) {
> +                    rc = reg->u.w.read(s, reg_entry,
> +                                       (uint16_t *)ptr_val, valid_mask);
> +                }
> +                break;
> +            case 4:
> +                if (reg->u.dw.read) {
> +                    rc = reg->u.dw.read(s, reg_entry,
> +                                        (uint32_t *)ptr_val, valid_mask);
> +                }
> +                break;
> +            }
> +
> +            if (rc < 0) {
> +                hw_error("Internal error: Invalid read emulation "
> +                         "return value[%d]. I/O emulator exit.\n", rc);
> +            }
> +
> +            /* calculate next address to find */
> +            emul_len -= reg->size;
> +            if (emul_len > 0) {
> +                find_addr = real_offset + reg->size;
> +            }
> +        } else {
> +            /* nothing to do with passthrough type register,
> +             * continue to find next byte */
> +            emul_len--;
> +            find_addr++;
> +        }
> +    }
> +
> +    /* need to shift back before returning them to pci bus emulator */
> +    val >>= ((address & 3) << 3);
> +
> +exit:
> +    PT_LOG_CONFIG("[%02x:%02x.%x]: address=%04x val=0x%08x len=%d\n",
> +                  pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> +                  address, val, len);
> +    return val;
> +}
> +
> +static void pt_pci_write_config(PCIDevice *d, uint32_t address,
> +                                uint32_t val, int len)
> +{
> +    XenPCIPassthroughState *s = DO_UPCAST(XenPCIPassthroughState, dev, d);
> +    int index = 0;
> +    XenPTRegGroup *reg_grp_entry = NULL;
> +    int rc = 0;
> +    uint32_t read_val = 0;
> +    int emul_len = 0;
> +    XenPTReg *reg_entry = NULL;
> +    uint32_t find_addr = address;
> +    XenPTRegInfo *reg = NULL;
> +
> +    if (pt_pci_config_access_check(d, address, len)) {
> +        return;
> +    }
> +
> +    PT_LOG_CONFIG("[%02x:%02x.%x]: address=%04x val=0x%08x len=%d\n",
> +                  pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> +                  address, val, len);
> +
> +    /* check unused BAR register */
> +    index = pt_bar_offset_to_index(address);
> +    if ((index >= 0) && (val > 0 && val < PT_BAR_ALLF) &&
> +        (s->bases[index].bar_flag == PT_BAR_FLAG_UNUSED)) {
> +        PT_LOG("Warning: Guest attempt to set address to unused Base Address "

So.. it is called PT_LOG, but the first thing it says is Warning. So should it be
PT_WARN?

> +               "Register. [%02x:%02x.%x][Offset:%02xh][Length:%d]\n",
> +               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> +               address, len);
> +    }
> +
> +    /* check power state transition flags */
> +    if (s->pm_state != NULL && s->pm_state->flags & PT_FLAG_TRANSITING) {
> +        /* can't accept untill previous power state transition is completed.

until
> +         * so finished previous request here.

finish
> +         */
> +        PT_LOG("Warning: guest want to write durring power state transition\n");

during
> +        return;
> +    }
> +
> +    /* find register group entry */
> +    reg_grp_entry = pt_find_reg_grp(s, address);
> +    if (reg_grp_entry) {
> +        /* check 0 Hardwired register group */
> +        if (reg_grp_entry->reg_grp->grp_type == GRP_TYPE_HARDWIRED) {
> +            /* ignore silently */
> +            PT_LOG("Warning: Access to 0 Hardwired register. "
> +                   "[%02x:%02x.%x][Offset:%02xh][Length:%d]\n",
> +                   pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> +                   address, len);
> +            return;
> +        }
> +    }
> +
> +    /* read I/O device register value */
> +    rc = host_pci_get_block(s->real_device, address,
> +                             (uint8_t *)&read_val, len);
> +    if (!rc) {
> +        PT_LOG("Error: pci_read_block failed. return value[%d].\n", rc);

There isn't a PT_ERR? Hm, looking at the code there is only PT_LOG. Perhaps
declearing PT_ERR and PT_WARN might be a good idea? In case in the future
one wants different levels of this? Or do we really not care much about that?

> +        memset(&read_val, 0xff, len);
> +    }
> +
> +    /* pass directly to libpci for passthrough type register group */

Um, is the libpci requirement a certain thing?

> +    if (reg_grp_entry == NULL) {
> +        goto out;
> +    }
> +
> +    /* adjust the read and write value to appropriate CFC-CFF window */
> +    read_val <<= (address & 3) << 3;
> +    val <<= (address & 3) << 3;
> +    emul_len = len;
> +
> +    /* loop Guest request size */

loop around what the guest requested..

> +    while (emul_len > 0) {
> +        /* find register entry to be emulated */
> +        reg_entry = pt_find_reg(reg_grp_entry, find_addr);
> +        if (reg_entry) {
> +            reg = reg_entry->reg;
> +            uint32_t real_offset = reg_grp_entry->base_offset + reg->offset;
> +            uint32_t valid_mask = 0xFFFFFFFF >> ((4 - emul_len) << 3);
> +            uint8_t *ptr_val = NULL;
> +
> +            valid_mask <<= (find_addr - real_offset) << 3;
> +            ptr_val = (uint8_t *)&val + (real_offset & 3);
> +
> +            /* do emulation depend on register size */

based 
> +            switch (reg->size) {
> +            case 1:
> +                if (reg->u.b.write) {
> +                    rc = reg->u.b.write(s, reg_entry, ptr_val,
> +                                        read_val >> ((real_offset & 3) << 3),
> +                                        valid_mask);
> +                }
> +                break;
> +            case 2:
> +                if (reg->u.w.write) {
> +                    rc = reg->u.w.write(s, reg_entry, (uint16_t *)ptr_val,
> +                                        (read_val >> ((real_offset & 3) << 3)),
> +                                        valid_mask);
> +                }
> +                break;
> +            case 4:
> +                if (reg->u.dw.write) {
> +                    rc = reg->u.dw.write(s, reg_entry, (uint32_t *)ptr_val,
> +                                         (read_val >> ((real_offset & 3) << 3)),
> +                                         valid_mask);
> +                }
> +                break;
> +            }
> +
> +            if (rc < 0) {
> +                hw_error("Internal error: Invalid write emulation "
> +                         "return value[%d]. I/O emulator exit.\n", rc);

Oh. I hadn't realized this, but you are using hw_error. Which is
calling 'abort'! Yikes. Is there no way to recover from this? Say return 0xfffff?

> +            }
> +
> +            /* calculate next address to find */
> +            emul_len -= reg->size;
> +            if (emul_len > 0) {
> +                find_addr = real_offset + reg->size;
> +            }
> +        } else {
> +            /* nothing to do with passthrough type register,
> +             * continue to find next byte */
> +            emul_len--;
> +            find_addr++;
> +        }
> +    }
> +
> +    /* need to shift back before passing them to libpci */
> +    val >>= (address & 3) << 3;
> +
> +out:
> +    if (!(reg && reg->no_wb)) {
> +        /* unknown regs are passed through */
> +        rc = host_pci_set_block(s->real_device, address, (uint8_t *)&val, len);
> +
> +        if (!rc) {
> +            PT_LOG("Error: pci_write_block failed. return value[%d].\n", rc);
> +        }
> +    }
> +
> +    if (s->pm_state != NULL && s->pm_state->flags & PT_FLAG_TRANSITING) {
> +        qemu_mod_timer(s->pm_state->pm_timer,
> +                       qemu_get_clock_ms(rt_clock) + s->pm_state->pm_delay);
> +    }
> +}
> +
> +/* ioport/iomem space*/
> +static void pt_iomem_map(XenPCIPassthroughState *s, int i,
> +                         pcibus_t e_phys, pcibus_t e_size, int type)
> +{
> +    uint32_t old_ebase = s->bases[i].e_physbase;
> +    bool first_map = s->bases[i].e_size == 0;
> +    int ret = 0;
> +
> +    s->bases[i].e_physbase = e_phys;
> +    s->bases[i].e_size = e_size;
> +
> +    PT_LOG("e_phys=%#"PRIx64" maddr=%#"PRIx64" type=%%d"
> +           " len=%#"PRIx64" index=%d first_map=%d\n",
> +           e_phys, s->bases[i].access.maddr, /*type,*/
> +           e_size, i, first_map);
> +
> +    if (e_size == 0) {
> +        return;
> +    }
> +
> +    if (!first_map && old_ebase != -1) {

old_ebase != PCI_BAR_UNMAPPED ?

> +        /* Remove old mapping */
> +        ret = xc_domain_memory_mapping(xen_xc, xen_domid,
> +                               old_ebase >> XC_PAGE_SHIFT,
> +                               s->bases[i].access.maddr >> XC_PAGE_SHIFT,
> +                               (e_size + XC_PAGE_SIZE - 1) >> XC_PAGE_SHIFT,
> +                               DPCI_REMOVE_MAPPING);
> +        if (ret != 0) {
> +            PT_LOG("Error: remove old mapping failed!\n");
> +            return;
> +        }
> +    }
> +
> +    /* map only valid guest address */
> +    if (e_phys != -1) {

PCI_BAR_UNMAPPED

> +        /* Create new mapping */
> +        ret = xc_domain_memory_mapping(xen_xc, xen_domid,
> +                                   s->bases[i].e_physbase >> XC_PAGE_SHIFT,
> +                                   s->bases[i].access.maddr >> XC_PAGE_SHIFT,
> +                                   (e_size+XC_PAGE_SIZE-1) >> XC_PAGE_SHIFT,
> +                                   DPCI_ADD_MAPPING);
> +
> +        if (ret != 0) {
> +            PT_LOG("Error: create new mapping failed!\n");
> +        }
> +    }
> +}
> +
> +static void pt_ioport_map(XenPCIPassthroughState *s, int i,
> +                          pcibus_t e_phys, pcibus_t e_size, int type)
> +{
> +    uint32_t old_ebase = s->bases[i].e_physbase;
> +    bool first_map = s->bases[i].e_size == 0;
> +    int ret = 0;
> +
> +    s->bases[i].e_physbase = e_phys;
> +    s->bases[i].e_size = e_size;
> +
> +    PT_LOG("e_phys=%#04"PRIx64" pio_base=%#04"PRIx64" len=%"PRId64" index=%d"
> +           " first_map=%d\n",
> +           e_phys, s->bases[i].access.pio_base, e_size, i, first_map);
> +
> +    if (e_size == 0) {
> +        return;
> +    }
> +
> +    if (!first_map && old_ebase != -1) {

PCI_BAR_UNMAPPED
> +        /* Remove old mapping */
> +        ret = xc_domain_ioport_mapping(xen_xc, xen_domid, old_ebase,
> +                                       s->bases[i].access.pio_base, e_size,
> +                                       DPCI_REMOVE_MAPPING);
> +        if (ret != 0) {
> +            PT_LOG("Error: remove old mapping failed!\n");
> +            return;
> +        }
> +    }
> +
> +    /* map only valid guest address (include 0) */
> +    if (e_phys != -1) {
> +        /* Create new mapping */
> +        ret = xc_domain_ioport_mapping(xen_xc, xen_domid, e_phys,
> +                                       s->bases[i].access.pio_base, e_size,
> +                                       DPCI_ADD_MAPPING);
> +        if (ret != 0) {
> +            PT_LOG("Error: create new mapping failed!\n");
> +        }
> +    }
> +
> +}
> +
> +
> +/* mapping BAR */
> +
> +void pt_bar_mapping_one(XenPCIPassthroughState *s, int bar,
> +                        int io_enable, int mem_enable)
> +{
> +    PCIDevice *dev = &s->dev;
> +    PCIIORegion *r;
> +    XenPTRegGroup *reg_grp_entry = NULL;
> +    XenPTReg *reg_entry = NULL;
> +    XenPTRegion *base = NULL;
> +    pcibus_t r_size = 0, r_addr = -1;

PCI_BAR_UNMAPPED

> +    int rc = 0;
> +
> +    r = &dev->io_regions[bar];
> +
> +    /* check valid region */
> +    if (!r->size) {
> +        return;
> +    }
> +
> +    base = &s->bases[bar];
> +    /* skip unused BAR or upper 64bit BAR */
> +    if ((base->bar_flag == PT_BAR_FLAG_UNUSED)
> +        || (base->bar_flag == PT_BAR_FLAG_UPPER)) {
> +           return;
> +    }
> +
> +    /* copy region address to temporary */
> +    r_addr = r->addr;
> +
> +    /* need unmapping in case I/O Space or Memory Space disable */
> +    if (((base->bar_flag == PT_BAR_FLAG_IO) && !io_enable) ||
> +        ((base->bar_flag == PT_BAR_FLAG_MEM) && !mem_enable)) {
> +        r_addr = -1;
> +    }
> +    if ((bar == PCI_ROM_SLOT) && (r_addr != -1)) {
> +        reg_grp_entry = pt_find_reg_grp(s, PCI_ROM_ADDRESS);
> +        if (reg_grp_entry) {
> +            reg_entry = pt_find_reg(reg_grp_entry, PCI_ROM_ADDRESS);
> +            if (reg_entry && !(reg_entry->data & PCI_ROM_ADDRESS_ENABLE)) {
> +                r_addr = -1;

PCI_BAR_UNMAPPED

> +            }
> +        }
> +    }
> +
> +    /* prevent guest software mapping memory resource to 00000000h */
> +    if ((base->bar_flag == PT_BAR_FLAG_MEM) && (r_addr == 0)) {
> +        r_addr = -1;
> +    }
> +
> +    r_size = pt_get_emul_size(base->bar_flag, r->size);
> +
> +    rc = pci_check_bar_overlap(dev, r_addr, r_size, r->type);
> +    if (rc > 0) {
> +        PT_LOG("Warning: s[%02x:%02x.%x][Region:%d][Address:%"FMT_PCIBUS"h]"
> +               "[Size:%"FMT_PCIBUS"h] is overlapped.\n", pci_bus_num(dev->bus),
> +               PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn), bar,
> +               r_addr, r_size);
> +    }
> +
> +    /* check whether we need to update the mapping or not */
> +    if (r_addr != s->bases[bar].e_physbase) {
> +        /* mapping BAR */
> +        if (base->bar_flag == PT_BAR_FLAG_IO) {
> +            pt_ioport_map(s, bar, r_addr, r_size, r->type);
> +        } else {
> +            pt_iomem_map(s, bar, r_addr, r_size, r->type);
> +        }
> +    }
> +}
> +
> +void pt_bar_mapping(XenPCIPassthroughState *s, int io_enable, int mem_enable)
> +{
> +    int i;
> +
> +    for (i = 0; i < PCI_NUM_REGIONS; i++) {
> +        pt_bar_mapping_one(s, i, io_enable, mem_enable);
> +    }
> +}
> +
> +/* register regions */
> +static int pt_register_regions(XenPCIPassthroughState *s)
> +{
> +    int i = 0;
> +    uint32_t bar_data = 0;
> +    HostPCIDevice *d = s->real_device;
> +
> +    /* Register PIO/MMIO BARs */
> +    for (i = 0; i < PCI_BAR_ENTRIES; i++) {
> +        HostPCIIORegion *r = &d->io_regions[i];
> +
> +        if (r->base_addr) {

So should you check for PCI_BAR_UNMAPPED or is that not really
required here as the pci_register_bar would do it?

> +            s->bases[i].e_physbase = r->base_addr;
> +            s->bases[i].access.u = r->base_addr;
> +
> +            /* Register current region */
> +            if (r->flags & IORESOURCE_IO) {
> +                memory_region_init_io(&s->bar[i], NULL, NULL,
> +                                      "xen-pci-pt-bar", r->size);

You can make the "xen_pci-pt-bar" be a #define somewhere and reuse that.

> +                pci_register_bar(&s->dev, i, PCI_BASE_ADDRESS_SPACE_IO,
> +                                 &s->bar[i]);
> +            } else if (r->flags & IORESOURCE_PREFETCH) {
> +                memory_region_init_io(&s->bar[i], NULL, NULL,
> +                                      "xen-pci-pt-bar", r->size);
> +                pci_register_bar(&s->dev, i, PCI_BASE_ADDRESS_MEM_PREFETCH,
> +                                 &s->bar[i]);
> +            } else {
> +                memory_region_init_io(&s->bar[i], NULL, NULL,
> +                                      "xen-pci-pt-bar", r->size);
> +                pci_register_bar(&s->dev, i, PCI_BASE_ADDRESS_SPACE_MEMORY,
> +                                 &s->bar[i]);
> +            }
> +
> +            PT_LOG("IO region registered (size=0x%08"PRIx64
> +                   " base_addr=0x%08"PRIx64")\n",
> +                   r->size, r->base_addr);
> +        }
> +    }
> +
> +    /* Register expansion ROM address */
> +    if (d->rom.base_addr && d->rom.size) {
> +        /* Re-set BAR reported by OS, otherwise ROM can't be read. */
> +        bar_data = host_pci_get_long(d, PCI_ROM_ADDRESS);
> +        if ((bar_data & PCI_ROM_ADDRESS_MASK) == 0) {
> +            bar_data |= d->rom.base_addr & PCI_ROM_ADDRESS_MASK;
> +            host_pci_set_long(d, PCI_ROM_ADDRESS, bar_data);
> +        }
> +
> +        s->bases[PCI_ROM_SLOT].e_physbase = d->rom.base_addr;
> +        s->bases[PCI_ROM_SLOT].access.maddr = d->rom.base_addr;
> +
> +        memory_region_init_rom_device(&s->rom, NULL, NULL, &s->dev.qdev,
> +                                      "xen-pci-pt-rom", d->rom.size);
> +        pci_register_bar(&s->dev, PCI_ROM_SLOT, PCI_BASE_ADDRESS_MEM_PREFETCH,
> +                         &s->rom);
> +
> +        PT_LOG("Expansion ROM registered (size=0x%08"PRIx64
> +               " base_addr=0x%08"PRIx64")\n",
> +               d->rom.size, d->rom.base_addr);
> +    }
> +
> +    return 0;
> +}
> +
> +static void pt_unregister_regions(XenPCIPassthroughState *s)
> +{
> +    int i, type, rc;
> +    uint32_t e_size;
> +    PCIDevice *d = &s->dev;
> +
> +    for (i = 0; i < PCI_NUM_REGIONS; i++) {
> +        e_size = s->bases[i].e_size;
> +        if ((e_size == 0) || (s->bases[i].e_physbase == -1)) {
> +            continue;
> +        }
> +
> +        type = d->io_regions[i].type;
> +
> +        if (type == PCI_BASE_ADDRESS_SPACE_MEMORY
> +            || type == PCI_BASE_ADDRESS_MEM_PREFETCH) {
> +            rc = xc_domain_memory_mapping(xen_xc, xen_domid,
> +                    s->bases[i].e_physbase >> XC_PAGE_SHIFT,
> +                    s->bases[i].access.maddr >> XC_PAGE_SHIFT,
> +                    (e_size+XC_PAGE_SIZE-1) >> XC_PAGE_SHIFT,
> +                    DPCI_REMOVE_MAPPING);
> +            if (rc != 0) {
> +                PT_LOG("Error: remove old mem mapping failed!\n");
> +                continue;
> +            }
> +
> +        } else if (type == PCI_BASE_ADDRESS_SPACE_IO) {
> +            rc = xc_domain_ioport_mapping(xen_xc, xen_domid,
> +                        s->bases[i].e_physbase,
> +                        s->bases[i].access.pio_base,
> +                        e_size,
> +                        DPCI_REMOVE_MAPPING);
> +            if (rc != 0) {
> +                PT_LOG("Error: remove old io mapping failed!\n");
> +                continue;
> +            }
> +        }
> +    }
> +}
> +
> +static int pt_initfn(PCIDevice *pcidev)
> +{
> +    XenPCIPassthroughState *s = DO_UPCAST(XenPCIPassthroughState, dev, pcidev);
> +    int dom, bus;
> +    unsigned slot, func;
> +    int rc = 0;
> +    uint32_t machine_irq;
> +    int pirq = -1;
> +
> +    if (pci_parse_devaddr(s->hostaddr, &dom, &bus, &slot, &func) < 0) {
> +        fprintf(stderr, "error parse bdf: %s\n", s->hostaddr);
> +        return -1;
> +    }
> +
> +    /* register real device */
> +    PT_LOG("Assigning real physical device %02x:%02x.%x to devfn %i ...\n",
> +           bus, slot, func, s->dev.devfn);
> +
> +    s->real_device = host_pci_device_get(bus, slot, func);
> +    if (!s->real_device) {
> +        return -1;
> +    }
> +
> +    s->is_virtfn = s->real_device->is_virtfn;
> +    if (s->is_virtfn) {
> +        PT_LOG("%04x:%02x:%02x.%x is a SR-IOV Virtual Function\n",
> +               s->real_device->domain, bus, slot, func);
> +    }
> +
> +    /* Initialize virtualized PCI configuration (Extended 256 Bytes) */
> +    if (host_pci_get_block(s->real_device, 0, pcidev->config,
> +                           PCI_CONFIG_SPACE_SIZE) == -1) {
> +        return -1;
> +    }
> +
> +    /* Handle real device's MMIO/PIO BARs */
> +    pt_register_regions(s);
> +
> +    /* reinitialize each config register to be emulated */
> +    pt_config_init(s);
> +
> +    /* Bind interrupt */
> +    if (!s->dev.config[PCI_INTERRUPT_PIN]) {
> +        PT_LOG("no pin interrupt\n");

Perhaps include some details of which device failed?

> +        goto out;
> +    }
> +
> +    machine_irq = host_pci_get_byte(s->real_device, PCI_INTERRUPT_LINE);
> +    rc = xc_physdev_map_pirq(xen_xc, xen_domid, machine_irq, &pirq);
> +
> +    if (rc) {
> +        PT_LOG("Error: Mapping irq failed, rc = %d\n", rc);

Can you also include the IRQ it tried to map (both machine and pirq).

> +
> +        /* Disable PCI intx assertion (turn on bit10 of devctl) */
> +        host_pci_set_word(s->real_device,
> +                          PCI_COMMAND,
> +                          pci_get_word(s->dev.config + PCI_COMMAND)
> +                          | PCI_COMMAND_INTX_DISABLE);
> +        machine_irq = 0;
> +        s->machine_irq = 0;
> +    } else {
> +        machine_irq = pirq;
> +        s->machine_irq = pirq;
> +        mapped_machine_irq[machine_irq]++;
> +    }
> +
> +    /* bind machine_irq to device */
> +    if (rc < 0 && machine_irq != 0) {
> +        uint8_t e_device = PCI_SLOT(s->dev.devfn);
> +        uint8_t e_intx = pci_intx(s);
> +
> +        rc = xc_domain_bind_pt_pci_irq(xen_xc, xen_domid, machine_irq, 0,
> +                                       e_device, e_intx);
> +        if (rc < 0) {
> +            PT_LOG("Error: Binding of interrupt failed! rc=%d\n", rc);

A bit details - name of the device, the IRQ,..

> +
> +            /* Disable PCI intx assertion (turn on bit10 of devctl) */
> +            host_pci_set_word(s->real_device, PCI_COMMAND,
> +                              *(uint16_t *)(&s->dev.config[PCI_COMMAND])
> +                              | PCI_COMMAND_INTX_DISABLE);
> +            mapped_machine_irq[machine_irq]--;
> +
> +            if (mapped_machine_irq[machine_irq] == 0) {
> +                if (xc_physdev_unmap_pirq(xen_xc, xen_domid, machine_irq)) {
> +                    PT_LOG("Error: Unmapping of interrupt failed! rc=%d\n",
> +                           rc);

And here too. It would be beneficial to have on the error paths lots of 
nice details so that in the field it will be easier to find out what
went wrong (and match up PIRQ with the GSI).

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH V3 07/10] Introduce Xen PCI Passthrough, qdevice (1/3)
@ 2011-11-10 21:28     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 60+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-11-10 21:28 UTC (permalink / raw)
  To: Anthony PERARD
  Cc: Guy Zana, Xen Devel, Allen Kay, QEMU-devel, Stefano Stabellini

On Fri, Oct 28, 2011 at 04:07:33PM +0100, Anthony PERARD wrote:
> From: Allen Kay <allen.m.kay@intel.com>
> 

This is going to be a bit lame review..

> +static uint32_t pt_pci_read_config(PCIDevice *d, uint32_t address, int len)
> +{
> +    XenPCIPassthroughState *s = DO_UPCAST(XenPCIPassthroughState, dev, d);
> +    uint32_t val = 0;
> +    XenPTRegGroup *reg_grp_entry = NULL;
> +    XenPTReg *reg_entry = NULL;
> +    int rc = 0;
> +    int emul_len = 0;
> +    uint32_t find_addr = address;
> +
> +    if (pt_pci_config_access_check(d, address, len)) {
> +        goto exit;
> +    }
> +
> +    /* check power state transition flags */
> +    if (s->pm_state != NULL && s->pm_state->flags & PT_FLAG_TRANSITING) {
> +        /* can't accept until previous power state transition is completed.
> +         * so finished previous request here.
> +         */
> +        PT_LOG("Warning: guest want to write durring power state transition\n");

during
> +        goto exit;
> +    }
> +
> +    /* find register group entry */
> +    reg_grp_entry = pt_find_reg_grp(s, address);
> +    if (reg_grp_entry) {
> +        /* check 0 Hardwired register group */
> +        if (reg_grp_entry->reg_grp->grp_type == GRP_TYPE_HARDWIRED) {
> +            /* no need to emulate, just return 0 */
> +            val = 0;
> +            goto exit;
> +        }
> +    }
> +
> +    /* read I/O device register value */
> +    rc = host_pci_get_block(s->real_device, address, (uint8_t *)&val, len);
> +    if (!rc) {
> +        PT_LOG("Error: pci_read_block failed. return value[%d].\n", rc);
> +        memset(&val, 0xff, len);
> +    }
> +
> +    /* just return the I/O device register value for
> +     * passthrough type register group */
> +    if (reg_grp_entry == NULL) {
> +        goto exit;
> +    }
> +
> +    /* adjust the read value to appropriate CFC-CFF window */
> +    val <<= (address & 3) << 3;
> +    emul_len = len;
> +
> +    /* loop Guest request size */

Perhaps 'loop around the guest request size' ?

> +    while (emul_len > 0) {
> +        /* find register entry to be emulated */
> +        reg_entry = pt_find_reg(reg_grp_entry, find_addr);
> +        if (reg_entry) {
> +            XenPTRegInfo *reg = reg_entry->reg;
> +            uint32_t real_offset = reg_grp_entry->base_offset + reg->offset;
> +            uint32_t valid_mask = 0xFFFFFFFF >> ((4 - emul_len) << 3);
> +            uint8_t *ptr_val = NULL;
> +
> +            valid_mask <<= (find_addr - real_offset) << 3;
> +            ptr_val = (uint8_t *)&val + (real_offset & 3);
> +
> +            /* do emulation depend on register size */

based on register size

> +            switch (reg->size) {
> +            case 1:
> +                if (reg->u.b.read) {
> +                    rc = reg->u.b.read(s, reg_entry, ptr_val, valid_mask);
> +                }
> +                break;
> +            case 2:
> +                if (reg->u.w.read) {
> +                    rc = reg->u.w.read(s, reg_entry,
> +                                       (uint16_t *)ptr_val, valid_mask);
> +                }
> +                break;
> +            case 4:
> +                if (reg->u.dw.read) {
> +                    rc = reg->u.dw.read(s, reg_entry,
> +                                        (uint32_t *)ptr_val, valid_mask);
> +                }
> +                break;
> +            }
> +
> +            if (rc < 0) {
> +                hw_error("Internal error: Invalid read emulation "
> +                         "return value[%d]. I/O emulator exit.\n", rc);
> +            }
> +
> +            /* calculate next address to find */
> +            emul_len -= reg->size;
> +            if (emul_len > 0) {
> +                find_addr = real_offset + reg->size;
> +            }
> +        } else {
> +            /* nothing to do with passthrough type register,
> +             * continue to find next byte */
> +            emul_len--;
> +            find_addr++;
> +        }
> +    }
> +
> +    /* need to shift back before returning them to pci bus emulator */
> +    val >>= ((address & 3) << 3);
> +
> +exit:
> +    PT_LOG_CONFIG("[%02x:%02x.%x]: address=%04x val=0x%08x len=%d\n",
> +                  pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> +                  address, val, len);
> +    return val;
> +}
> +
> +static void pt_pci_write_config(PCIDevice *d, uint32_t address,
> +                                uint32_t val, int len)
> +{
> +    XenPCIPassthroughState *s = DO_UPCAST(XenPCIPassthroughState, dev, d);
> +    int index = 0;
> +    XenPTRegGroup *reg_grp_entry = NULL;
> +    int rc = 0;
> +    uint32_t read_val = 0;
> +    int emul_len = 0;
> +    XenPTReg *reg_entry = NULL;
> +    uint32_t find_addr = address;
> +    XenPTRegInfo *reg = NULL;
> +
> +    if (pt_pci_config_access_check(d, address, len)) {
> +        return;
> +    }
> +
> +    PT_LOG_CONFIG("[%02x:%02x.%x]: address=%04x val=0x%08x len=%d\n",
> +                  pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> +                  address, val, len);
> +
> +    /* check unused BAR register */
> +    index = pt_bar_offset_to_index(address);
> +    if ((index >= 0) && (val > 0 && val < PT_BAR_ALLF) &&
> +        (s->bases[index].bar_flag == PT_BAR_FLAG_UNUSED)) {
> +        PT_LOG("Warning: Guest attempt to set address to unused Base Address "

So.. it is called PT_LOG, but the first thing it says is Warning. So should it be
PT_WARN?

> +               "Register. [%02x:%02x.%x][Offset:%02xh][Length:%d]\n",
> +               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> +               address, len);
> +    }
> +
> +    /* check power state transition flags */
> +    if (s->pm_state != NULL && s->pm_state->flags & PT_FLAG_TRANSITING) {
> +        /* can't accept untill previous power state transition is completed.

until
> +         * so finished previous request here.

finish
> +         */
> +        PT_LOG("Warning: guest want to write durring power state transition\n");

during
> +        return;
> +    }
> +
> +    /* find register group entry */
> +    reg_grp_entry = pt_find_reg_grp(s, address);
> +    if (reg_grp_entry) {
> +        /* check 0 Hardwired register group */
> +        if (reg_grp_entry->reg_grp->grp_type == GRP_TYPE_HARDWIRED) {
> +            /* ignore silently */
> +            PT_LOG("Warning: Access to 0 Hardwired register. "
> +                   "[%02x:%02x.%x][Offset:%02xh][Length:%d]\n",
> +                   pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> +                   address, len);
> +            return;
> +        }
> +    }
> +
> +    /* read I/O device register value */
> +    rc = host_pci_get_block(s->real_device, address,
> +                             (uint8_t *)&read_val, len);
> +    if (!rc) {
> +        PT_LOG("Error: pci_read_block failed. return value[%d].\n", rc);

There isn't a PT_ERR? Hm, looking at the code there is only PT_LOG. Perhaps
declearing PT_ERR and PT_WARN might be a good idea? In case in the future
one wants different levels of this? Or do we really not care much about that?

> +        memset(&read_val, 0xff, len);
> +    }
> +
> +    /* pass directly to libpci for passthrough type register group */

Um, is the libpci requirement a certain thing?

> +    if (reg_grp_entry == NULL) {
> +        goto out;
> +    }
> +
> +    /* adjust the read and write value to appropriate CFC-CFF window */
> +    read_val <<= (address & 3) << 3;
> +    val <<= (address & 3) << 3;
> +    emul_len = len;
> +
> +    /* loop Guest request size */

loop around what the guest requested..

> +    while (emul_len > 0) {
> +        /* find register entry to be emulated */
> +        reg_entry = pt_find_reg(reg_grp_entry, find_addr);
> +        if (reg_entry) {
> +            reg = reg_entry->reg;
> +            uint32_t real_offset = reg_grp_entry->base_offset + reg->offset;
> +            uint32_t valid_mask = 0xFFFFFFFF >> ((4 - emul_len) << 3);
> +            uint8_t *ptr_val = NULL;
> +
> +            valid_mask <<= (find_addr - real_offset) << 3;
> +            ptr_val = (uint8_t *)&val + (real_offset & 3);
> +
> +            /* do emulation depend on register size */

based 
> +            switch (reg->size) {
> +            case 1:
> +                if (reg->u.b.write) {
> +                    rc = reg->u.b.write(s, reg_entry, ptr_val,
> +                                        read_val >> ((real_offset & 3) << 3),
> +                                        valid_mask);
> +                }
> +                break;
> +            case 2:
> +                if (reg->u.w.write) {
> +                    rc = reg->u.w.write(s, reg_entry, (uint16_t *)ptr_val,
> +                                        (read_val >> ((real_offset & 3) << 3)),
> +                                        valid_mask);
> +                }
> +                break;
> +            case 4:
> +                if (reg->u.dw.write) {
> +                    rc = reg->u.dw.write(s, reg_entry, (uint32_t *)ptr_val,
> +                                         (read_val >> ((real_offset & 3) << 3)),
> +                                         valid_mask);
> +                }
> +                break;
> +            }
> +
> +            if (rc < 0) {
> +                hw_error("Internal error: Invalid write emulation "
> +                         "return value[%d]. I/O emulator exit.\n", rc);

Oh. I hadn't realized this, but you are using hw_error. Which is
calling 'abort'! Yikes. Is there no way to recover from this? Say return 0xfffff?

> +            }
> +
> +            /* calculate next address to find */
> +            emul_len -= reg->size;
> +            if (emul_len > 0) {
> +                find_addr = real_offset + reg->size;
> +            }
> +        } else {
> +            /* nothing to do with passthrough type register,
> +             * continue to find next byte */
> +            emul_len--;
> +            find_addr++;
> +        }
> +    }
> +
> +    /* need to shift back before passing them to libpci */
> +    val >>= (address & 3) << 3;
> +
> +out:
> +    if (!(reg && reg->no_wb)) {
> +        /* unknown regs are passed through */
> +        rc = host_pci_set_block(s->real_device, address, (uint8_t *)&val, len);
> +
> +        if (!rc) {
> +            PT_LOG("Error: pci_write_block failed. return value[%d].\n", rc);
> +        }
> +    }
> +
> +    if (s->pm_state != NULL && s->pm_state->flags & PT_FLAG_TRANSITING) {
> +        qemu_mod_timer(s->pm_state->pm_timer,
> +                       qemu_get_clock_ms(rt_clock) + s->pm_state->pm_delay);
> +    }
> +}
> +
> +/* ioport/iomem space*/
> +static void pt_iomem_map(XenPCIPassthroughState *s, int i,
> +                         pcibus_t e_phys, pcibus_t e_size, int type)
> +{
> +    uint32_t old_ebase = s->bases[i].e_physbase;
> +    bool first_map = s->bases[i].e_size == 0;
> +    int ret = 0;
> +
> +    s->bases[i].e_physbase = e_phys;
> +    s->bases[i].e_size = e_size;
> +
> +    PT_LOG("e_phys=%#"PRIx64" maddr=%#"PRIx64" type=%%d"
> +           " len=%#"PRIx64" index=%d first_map=%d\n",
> +           e_phys, s->bases[i].access.maddr, /*type,*/
> +           e_size, i, first_map);
> +
> +    if (e_size == 0) {
> +        return;
> +    }
> +
> +    if (!first_map && old_ebase != -1) {

old_ebase != PCI_BAR_UNMAPPED ?

> +        /* Remove old mapping */
> +        ret = xc_domain_memory_mapping(xen_xc, xen_domid,
> +                               old_ebase >> XC_PAGE_SHIFT,
> +                               s->bases[i].access.maddr >> XC_PAGE_SHIFT,
> +                               (e_size + XC_PAGE_SIZE - 1) >> XC_PAGE_SHIFT,
> +                               DPCI_REMOVE_MAPPING);
> +        if (ret != 0) {
> +            PT_LOG("Error: remove old mapping failed!\n");
> +            return;
> +        }
> +    }
> +
> +    /* map only valid guest address */
> +    if (e_phys != -1) {

PCI_BAR_UNMAPPED

> +        /* Create new mapping */
> +        ret = xc_domain_memory_mapping(xen_xc, xen_domid,
> +                                   s->bases[i].e_physbase >> XC_PAGE_SHIFT,
> +                                   s->bases[i].access.maddr >> XC_PAGE_SHIFT,
> +                                   (e_size+XC_PAGE_SIZE-1) >> XC_PAGE_SHIFT,
> +                                   DPCI_ADD_MAPPING);
> +
> +        if (ret != 0) {
> +            PT_LOG("Error: create new mapping failed!\n");
> +        }
> +    }
> +}
> +
> +static void pt_ioport_map(XenPCIPassthroughState *s, int i,
> +                          pcibus_t e_phys, pcibus_t e_size, int type)
> +{
> +    uint32_t old_ebase = s->bases[i].e_physbase;
> +    bool first_map = s->bases[i].e_size == 0;
> +    int ret = 0;
> +
> +    s->bases[i].e_physbase = e_phys;
> +    s->bases[i].e_size = e_size;
> +
> +    PT_LOG("e_phys=%#04"PRIx64" pio_base=%#04"PRIx64" len=%"PRId64" index=%d"
> +           " first_map=%d\n",
> +           e_phys, s->bases[i].access.pio_base, e_size, i, first_map);
> +
> +    if (e_size == 0) {
> +        return;
> +    }
> +
> +    if (!first_map && old_ebase != -1) {

PCI_BAR_UNMAPPED
> +        /* Remove old mapping */
> +        ret = xc_domain_ioport_mapping(xen_xc, xen_domid, old_ebase,
> +                                       s->bases[i].access.pio_base, e_size,
> +                                       DPCI_REMOVE_MAPPING);
> +        if (ret != 0) {
> +            PT_LOG("Error: remove old mapping failed!\n");
> +            return;
> +        }
> +    }
> +
> +    /* map only valid guest address (include 0) */
> +    if (e_phys != -1) {
> +        /* Create new mapping */
> +        ret = xc_domain_ioport_mapping(xen_xc, xen_domid, e_phys,
> +                                       s->bases[i].access.pio_base, e_size,
> +                                       DPCI_ADD_MAPPING);
> +        if (ret != 0) {
> +            PT_LOG("Error: create new mapping failed!\n");
> +        }
> +    }
> +
> +}
> +
> +
> +/* mapping BAR */
> +
> +void pt_bar_mapping_one(XenPCIPassthroughState *s, int bar,
> +                        int io_enable, int mem_enable)
> +{
> +    PCIDevice *dev = &s->dev;
> +    PCIIORegion *r;
> +    XenPTRegGroup *reg_grp_entry = NULL;
> +    XenPTReg *reg_entry = NULL;
> +    XenPTRegion *base = NULL;
> +    pcibus_t r_size = 0, r_addr = -1;

PCI_BAR_UNMAPPED

> +    int rc = 0;
> +
> +    r = &dev->io_regions[bar];
> +
> +    /* check valid region */
> +    if (!r->size) {
> +        return;
> +    }
> +
> +    base = &s->bases[bar];
> +    /* skip unused BAR or upper 64bit BAR */
> +    if ((base->bar_flag == PT_BAR_FLAG_UNUSED)
> +        || (base->bar_flag == PT_BAR_FLAG_UPPER)) {
> +           return;
> +    }
> +
> +    /* copy region address to temporary */
> +    r_addr = r->addr;
> +
> +    /* need unmapping in case I/O Space or Memory Space disable */
> +    if (((base->bar_flag == PT_BAR_FLAG_IO) && !io_enable) ||
> +        ((base->bar_flag == PT_BAR_FLAG_MEM) && !mem_enable)) {
> +        r_addr = -1;
> +    }
> +    if ((bar == PCI_ROM_SLOT) && (r_addr != -1)) {
> +        reg_grp_entry = pt_find_reg_grp(s, PCI_ROM_ADDRESS);
> +        if (reg_grp_entry) {
> +            reg_entry = pt_find_reg(reg_grp_entry, PCI_ROM_ADDRESS);
> +            if (reg_entry && !(reg_entry->data & PCI_ROM_ADDRESS_ENABLE)) {
> +                r_addr = -1;

PCI_BAR_UNMAPPED

> +            }
> +        }
> +    }
> +
> +    /* prevent guest software mapping memory resource to 00000000h */
> +    if ((base->bar_flag == PT_BAR_FLAG_MEM) && (r_addr == 0)) {
> +        r_addr = -1;
> +    }
> +
> +    r_size = pt_get_emul_size(base->bar_flag, r->size);
> +
> +    rc = pci_check_bar_overlap(dev, r_addr, r_size, r->type);
> +    if (rc > 0) {
> +        PT_LOG("Warning: s[%02x:%02x.%x][Region:%d][Address:%"FMT_PCIBUS"h]"
> +               "[Size:%"FMT_PCIBUS"h] is overlapped.\n", pci_bus_num(dev->bus),
> +               PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn), bar,
> +               r_addr, r_size);
> +    }
> +
> +    /* check whether we need to update the mapping or not */
> +    if (r_addr != s->bases[bar].e_physbase) {
> +        /* mapping BAR */
> +        if (base->bar_flag == PT_BAR_FLAG_IO) {
> +            pt_ioport_map(s, bar, r_addr, r_size, r->type);
> +        } else {
> +            pt_iomem_map(s, bar, r_addr, r_size, r->type);
> +        }
> +    }
> +}
> +
> +void pt_bar_mapping(XenPCIPassthroughState *s, int io_enable, int mem_enable)
> +{
> +    int i;
> +
> +    for (i = 0; i < PCI_NUM_REGIONS; i++) {
> +        pt_bar_mapping_one(s, i, io_enable, mem_enable);
> +    }
> +}
> +
> +/* register regions */
> +static int pt_register_regions(XenPCIPassthroughState *s)
> +{
> +    int i = 0;
> +    uint32_t bar_data = 0;
> +    HostPCIDevice *d = s->real_device;
> +
> +    /* Register PIO/MMIO BARs */
> +    for (i = 0; i < PCI_BAR_ENTRIES; i++) {
> +        HostPCIIORegion *r = &d->io_regions[i];
> +
> +        if (r->base_addr) {

So should you check for PCI_BAR_UNMAPPED or is that not really
required here as the pci_register_bar would do it?

> +            s->bases[i].e_physbase = r->base_addr;
> +            s->bases[i].access.u = r->base_addr;
> +
> +            /* Register current region */
> +            if (r->flags & IORESOURCE_IO) {
> +                memory_region_init_io(&s->bar[i], NULL, NULL,
> +                                      "xen-pci-pt-bar", r->size);

You can make the "xen_pci-pt-bar" be a #define somewhere and reuse that.

> +                pci_register_bar(&s->dev, i, PCI_BASE_ADDRESS_SPACE_IO,
> +                                 &s->bar[i]);
> +            } else if (r->flags & IORESOURCE_PREFETCH) {
> +                memory_region_init_io(&s->bar[i], NULL, NULL,
> +                                      "xen-pci-pt-bar", r->size);
> +                pci_register_bar(&s->dev, i, PCI_BASE_ADDRESS_MEM_PREFETCH,
> +                                 &s->bar[i]);
> +            } else {
> +                memory_region_init_io(&s->bar[i], NULL, NULL,
> +                                      "xen-pci-pt-bar", r->size);
> +                pci_register_bar(&s->dev, i, PCI_BASE_ADDRESS_SPACE_MEMORY,
> +                                 &s->bar[i]);
> +            }
> +
> +            PT_LOG("IO region registered (size=0x%08"PRIx64
> +                   " base_addr=0x%08"PRIx64")\n",
> +                   r->size, r->base_addr);
> +        }
> +    }
> +
> +    /* Register expansion ROM address */
> +    if (d->rom.base_addr && d->rom.size) {
> +        /* Re-set BAR reported by OS, otherwise ROM can't be read. */
> +        bar_data = host_pci_get_long(d, PCI_ROM_ADDRESS);
> +        if ((bar_data & PCI_ROM_ADDRESS_MASK) == 0) {
> +            bar_data |= d->rom.base_addr & PCI_ROM_ADDRESS_MASK;
> +            host_pci_set_long(d, PCI_ROM_ADDRESS, bar_data);
> +        }
> +
> +        s->bases[PCI_ROM_SLOT].e_physbase = d->rom.base_addr;
> +        s->bases[PCI_ROM_SLOT].access.maddr = d->rom.base_addr;
> +
> +        memory_region_init_rom_device(&s->rom, NULL, NULL, &s->dev.qdev,
> +                                      "xen-pci-pt-rom", d->rom.size);
> +        pci_register_bar(&s->dev, PCI_ROM_SLOT, PCI_BASE_ADDRESS_MEM_PREFETCH,
> +                         &s->rom);
> +
> +        PT_LOG("Expansion ROM registered (size=0x%08"PRIx64
> +               " base_addr=0x%08"PRIx64")\n",
> +               d->rom.size, d->rom.base_addr);
> +    }
> +
> +    return 0;
> +}
> +
> +static void pt_unregister_regions(XenPCIPassthroughState *s)
> +{
> +    int i, type, rc;
> +    uint32_t e_size;
> +    PCIDevice *d = &s->dev;
> +
> +    for (i = 0; i < PCI_NUM_REGIONS; i++) {
> +        e_size = s->bases[i].e_size;
> +        if ((e_size == 0) || (s->bases[i].e_physbase == -1)) {
> +            continue;
> +        }
> +
> +        type = d->io_regions[i].type;
> +
> +        if (type == PCI_BASE_ADDRESS_SPACE_MEMORY
> +            || type == PCI_BASE_ADDRESS_MEM_PREFETCH) {
> +            rc = xc_domain_memory_mapping(xen_xc, xen_domid,
> +                    s->bases[i].e_physbase >> XC_PAGE_SHIFT,
> +                    s->bases[i].access.maddr >> XC_PAGE_SHIFT,
> +                    (e_size+XC_PAGE_SIZE-1) >> XC_PAGE_SHIFT,
> +                    DPCI_REMOVE_MAPPING);
> +            if (rc != 0) {
> +                PT_LOG("Error: remove old mem mapping failed!\n");
> +                continue;
> +            }
> +
> +        } else if (type == PCI_BASE_ADDRESS_SPACE_IO) {
> +            rc = xc_domain_ioport_mapping(xen_xc, xen_domid,
> +                        s->bases[i].e_physbase,
> +                        s->bases[i].access.pio_base,
> +                        e_size,
> +                        DPCI_REMOVE_MAPPING);
> +            if (rc != 0) {
> +                PT_LOG("Error: remove old io mapping failed!\n");
> +                continue;
> +            }
> +        }
> +    }
> +}
> +
> +static int pt_initfn(PCIDevice *pcidev)
> +{
> +    XenPCIPassthroughState *s = DO_UPCAST(XenPCIPassthroughState, dev, pcidev);
> +    int dom, bus;
> +    unsigned slot, func;
> +    int rc = 0;
> +    uint32_t machine_irq;
> +    int pirq = -1;
> +
> +    if (pci_parse_devaddr(s->hostaddr, &dom, &bus, &slot, &func) < 0) {
> +        fprintf(stderr, "error parse bdf: %s\n", s->hostaddr);
> +        return -1;
> +    }
> +
> +    /* register real device */
> +    PT_LOG("Assigning real physical device %02x:%02x.%x to devfn %i ...\n",
> +           bus, slot, func, s->dev.devfn);
> +
> +    s->real_device = host_pci_device_get(bus, slot, func);
> +    if (!s->real_device) {
> +        return -1;
> +    }
> +
> +    s->is_virtfn = s->real_device->is_virtfn;
> +    if (s->is_virtfn) {
> +        PT_LOG("%04x:%02x:%02x.%x is a SR-IOV Virtual Function\n",
> +               s->real_device->domain, bus, slot, func);
> +    }
> +
> +    /* Initialize virtualized PCI configuration (Extended 256 Bytes) */
> +    if (host_pci_get_block(s->real_device, 0, pcidev->config,
> +                           PCI_CONFIG_SPACE_SIZE) == -1) {
> +        return -1;
> +    }
> +
> +    /* Handle real device's MMIO/PIO BARs */
> +    pt_register_regions(s);
> +
> +    /* reinitialize each config register to be emulated */
> +    pt_config_init(s);
> +
> +    /* Bind interrupt */
> +    if (!s->dev.config[PCI_INTERRUPT_PIN]) {
> +        PT_LOG("no pin interrupt\n");

Perhaps include some details of which device failed?

> +        goto out;
> +    }
> +
> +    machine_irq = host_pci_get_byte(s->real_device, PCI_INTERRUPT_LINE);
> +    rc = xc_physdev_map_pirq(xen_xc, xen_domid, machine_irq, &pirq);
> +
> +    if (rc) {
> +        PT_LOG("Error: Mapping irq failed, rc = %d\n", rc);

Can you also include the IRQ it tried to map (both machine and pirq).

> +
> +        /* Disable PCI intx assertion (turn on bit10 of devctl) */
> +        host_pci_set_word(s->real_device,
> +                          PCI_COMMAND,
> +                          pci_get_word(s->dev.config + PCI_COMMAND)
> +                          | PCI_COMMAND_INTX_DISABLE);
> +        machine_irq = 0;
> +        s->machine_irq = 0;
> +    } else {
> +        machine_irq = pirq;
> +        s->machine_irq = pirq;
> +        mapped_machine_irq[machine_irq]++;
> +    }
> +
> +    /* bind machine_irq to device */
> +    if (rc < 0 && machine_irq != 0) {
> +        uint8_t e_device = PCI_SLOT(s->dev.devfn);
> +        uint8_t e_intx = pci_intx(s);
> +
> +        rc = xc_domain_bind_pt_pci_irq(xen_xc, xen_domid, machine_irq, 0,
> +                                       e_device, e_intx);
> +        if (rc < 0) {
> +            PT_LOG("Error: Binding of interrupt failed! rc=%d\n", rc);

A bit details - name of the device, the IRQ,..

> +
> +            /* Disable PCI intx assertion (turn on bit10 of devctl) */
> +            host_pci_set_word(s->real_device, PCI_COMMAND,
> +                              *(uint16_t *)(&s->dev.config[PCI_COMMAND])
> +                              | PCI_COMMAND_INTX_DISABLE);
> +            mapped_machine_irq[machine_irq]--;
> +
> +            if (mapped_machine_irq[machine_irq] == 0) {
> +                if (xc_physdev_unmap_pirq(xen_xc, xen_domid, machine_irq)) {
> +                    PT_LOG("Error: Unmapping of interrupt failed! rc=%d\n",
> +                           rc);

And here too. It would be beneficial to have on the error paths lots of 
nice details so that in the field it will be easier to find out what
went wrong (and match up PIRQ with the GSI).

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [PATCH V3 08/10] Introduce Xen PCI Passthrough, PCI config space helpers (2/3)
  2011-10-28 15:07   ` Anthony PERARD
@ 2011-11-10 21:53     ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 60+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-11-10 21:53 UTC (permalink / raw)
  To: Anthony PERARD
  Cc: Guy Zana, Xen Devel, Allen Kay, QEMU-devel, Stefano Stabellini

On Fri, Oct 28, 2011 at 04:07:34PM +0100, Anthony PERARD wrote:
> From: Allen Kay <allen.m.kay@intel.com>
> 
> Signed-off-by: Allen Kay <allen.m.kay@intel.com>
> Signed-off-by: Guy Zana <guy@neocleus.com>
> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
> ---
>  Makefile.target                      |    1 +
>  hw/xen_pci_passthrough.h             |    2 +
>  hw/xen_pci_passthrough_config_init.c | 2068 ++++++++++++++++++++++++++++++++++
>  3 files changed, 2071 insertions(+), 0 deletions(-)
>  create mode 100644 hw/xen_pci_passthrough_config_init.c
> 
> diff --git a/Makefile.target b/Makefile.target
> index 36ea47d..c32c688 100644
> --- a/Makefile.target
> +++ b/Makefile.target
> @@ -219,6 +219,7 @@ obj-i386-$(CONFIG_XEN) += xen_platform.o
>  obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += host-pci-device.o
>  obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough.o
>  obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough_helpers.o
> +obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough_config_init.o
>  
>  # Inter-VM PCI shared memory
>  CONFIG_IVSHMEM =
> diff --git a/hw/xen_pci_passthrough.h b/hw/xen_pci_passthrough.h
> index 2d1979d..ebc04fd 100644
> --- a/hw/xen_pci_passthrough.h
> +++ b/hw/xen_pci_passthrough.h
> @@ -61,6 +61,8 @@ typedef int (*conf_byte_restore)
>  /* power state transition */
>  #define PT_FLAG_TRANSITING 0x0001
>  
> +#define PT_BAR_ALLF        0xFFFFFFFF  /* BAR ALLF value */
> +
>  
>  typedef enum {
>      GRP_TYPE_HARDWIRED = 0,                     /* 0 Hardwired reg group */
> diff --git a/hw/xen_pci_passthrough_config_init.c b/hw/xen_pci_passthrough_config_init.c
> new file mode 100644
> index 0000000..4103b59
> --- /dev/null
> +++ b/hw/xen_pci_passthrough_config_init.c
> @@ -0,0 +1,2068 @@
> +/*
> + * Copyright (c) 2007, Neocleus Corporation.
> + * Copyright (c) 2007, Intel Corporation.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.  See
> + * the COPYING file in the top-level directory.
> + *
> + * Alex Novik <alex@neocleus.com>
> + * Allen Kay <allen.m.kay@intel.com>
> + * Guy Zana <guy@neocleus.com>
> + *
> + * This file implements direct PCI assignment to a HVM guest
> + */
> +
> +#include "qemu-timer.h"
> +#include "xen_backend.h"
> +#include "xen_pci_passthrough.h"
> +
> +#define PT_MERGE_VALUE(value, data, val_mask) \
> +    (((value) & (val_mask)) | ((data) & ~(val_mask)))
> +
> +#define PT_INVALID_REG          0xFFFFFFFF      /* invalid register value */
> +
> +/* prototype */
> +
> +static uint32_t pt_ptr_reg_init(XenPCIPassthroughState *s, XenPTRegInfo *reg,
> +                                uint32_t real_offset);
> +static int pt_init_pci_config(XenPCIPassthroughState *s);
> +
> +
> +/* helper */
> +
> +/* A return value of 1 means the capability should NOT be exposed to guest. */
> +static int pt_hide_dev_cap(const HostPCIDevice *d, uint8_t grp_id)
> +{
> +    switch (grp_id) {
> +    case PCI_CAP_ID_EXP:
> +        /* The PCI Express Capability Structure of the VF of Intel 82599 10GbE
> +         * Controller looks trivial, e.g., the PCI Express Capabilities
> +         * Register is 0. We should not try to expose it to guest.

Why not?
> +         */
> +        if (d->vendor_id == PCI_VENDOR_ID_INTEL &&
> +                d->device_id == PCI_DEVICE_ID_INTEL_82599_VF) {
> +            return 1;
> +        }
> +        break;
> +    }
> +    return 0;
> +}
> +
> +/*   find emulate register group entry */
> +XenPTRegGroup *pt_find_reg_grp(XenPCIPassthroughState *s, uint32_t address)
> +{
> +    XenPTRegGroup *entry = NULL;
> +
> +    /* find register group entry */
> +    QLIST_FOREACH(entry, &s->reg_grp_tbl, entries) {
> +        /* check address */
> +        if ((entry->base_offset <= address)
> +            && ((entry->base_offset + entry->size) > address)) {
> +            return entry;
> +        }
> +    }
> +
> +    /* group entry not found */
> +    return NULL;
> +}
> +
> +/* find emulate register entry */
> +XenPTReg *pt_find_reg(XenPTRegGroup *reg_grp, uint32_t address)
> +{
> +    XenPTReg *reg_entry = NULL;
> +    XenPTRegInfo *reg = NULL;
> +    uint32_t real_offset = 0;
> +
> +    /* find register entry */
> +    QLIST_FOREACH(reg_entry, &reg_grp->reg_tbl_list, entries) {
> +        reg = reg_entry->reg;
> +        real_offset = reg_grp->base_offset + reg->offset;
> +        /* check address */
> +        if ((real_offset <= address)
> +            && ((real_offset + reg->size) > address)) {
> +            return reg_entry;
> +        }
> +    }
> +
> +    return NULL;
> +}
> +
> +/* parse BAR */
> +static PTBarFlag pt_bar_reg_parse(XenPCIPassthroughState *s, XenPTRegInfo *reg)
> +{
> +    PCIDevice *d = &s->dev;
> +    XenPTRegion *region = NULL;
> +    PCIIORegion *r;
> +    int index = 0;
> +
> +    /* check 64bit BAR */
> +    index = pt_bar_offset_to_index(reg->offset);
> +    if ((0 < index) && (index < PCI_ROM_SLOT)) {

This is  a bit confusing. Can you make the index be on the same
side, like

if ((0 < index) && (PCI_ROM_SLOT > index)

or better:

if ((index < 0) && (index < PCI_ROM_SLOT))

um, which looks wrong. Should it be 'index > 0' ?

> +        int flags = s->real_device->io_regions[index - 1].flags;

Do we want to check the index - 1 to make sure it is not negative?

> +
> +        if ((flags & IORESOURCE_MEM) && (flags & IORESOURCE_MEM_64)) {
> +            region = &s->bases[index - 1];
> +            if (region->bar_flag != PT_BAR_FLAG_UPPER) {
> +                return PT_BAR_FLAG_UPPER;
> +            }
> +        }
> +    }
> +
> +    /* check unused BAR */
> +    r = &d->io_regions[index];
> +    if (r->size == 0) {
> +        return PT_BAR_FLAG_UNUSED;
> +    }
> +
> +    /* for ExpROM BAR */
> +    if (index == PCI_ROM_SLOT) {
> +        return PT_BAR_FLAG_MEM;
> +    }
> +
> +    /* check BAR I/O indicator */
> +    if (s->real_device->io_regions[index].flags & IORESOURCE_IO) {
> +        return PT_BAR_FLAG_IO;
> +    } else {
> +        return PT_BAR_FLAG_MEM;
> +    }
> +}
> +
> +
> +/****************
> + * general register functions
> + */
> +
> +/* register initialization function */
> +
> +static uint32_t pt_common_reg_init(XenPCIPassthroughState *s,
> +                                   XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    return reg->init_val;
> +}
> +
> +/* Read register functions */
> +
> +static int pt_byte_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                            uint8_t *value, uint8_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint8_t valid_emu_mask = 0;
> +
> +    /* emulate byte register */
> +    valid_emu_mask = reg->emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
> +
> +    return 0;
> +}
> +static int pt_word_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                            uint16_t *value, uint16_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint16_t valid_emu_mask = 0;
> +
> +    /* emulate word register */
> +    valid_emu_mask = reg->emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
> +
> +    return 0;
> +}
> +static int pt_long_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                            uint32_t *value, uint32_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint32_t valid_emu_mask = 0;
> +
> +    /* emulate long register */
> +    valid_emu_mask = reg->emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
> +
> +   return 0;
> +}
> +
> +/* Write register functions */
> +
> +static int pt_byte_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                             uint8_t *value, uint8_t dev_value,
> +                             uint8_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint8_t writable_mask = 0;
> +    uint8_t throughable_mask = 0;
> +
> +    /* modify emulate register */
> +    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +
> +    /* create value for writing to I/O device register */
> +    throughable_mask = ~reg->emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    return 0;
> +}
> +static int pt_word_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                             uint16_t *value, uint16_t dev_value,
> +                             uint16_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint16_t writable_mask = 0;
> +    uint16_t throughable_mask = 0;
> +
> +    /* modify emulate register */
> +    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +
> +    /* create value for writing to I/O device register */
> +    throughable_mask = ~reg->emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    return 0;
> +}
> +static int pt_long_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                             uint32_t *value, uint32_t dev_value,
> +                             uint32_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint32_t writable_mask = 0;
> +    uint32_t throughable_mask = 0;
> +
> +    /* modify emulate register */
> +    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +
> +    /* create value for writing to I/O device register */
> +    throughable_mask = ~reg->emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    return 0;
> +}
> +
> +/* common restore register fonctions */
> +static int pt_byte_reg_restore(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                               uint32_t real_offset, uint8_t dev_value,
> +                               uint8_t *value)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    PCIDevice *d = &s->dev;
> +
> +    /* use I/O device register's value as restore value */
> +    *value = pci_get_byte(d->config + real_offset);
> +
> +    /* create value for restoring to I/O device register */
> +    *value = PT_MERGE_VALUE(*value, dev_value, reg->emu_mask);
> +
> +    return 0;
> +}
> +static int pt_word_reg_restore(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                               uint32_t real_offset, uint16_t dev_value,
> +                               uint16_t *value)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    PCIDevice *d = &s->dev;
> +
> +    /* use I/O device register's value as restore value */
> +    *value = pci_get_word(d->config + real_offset);
> +
> +    /* create value for restoring to I/O device register */
> +    *value = PT_MERGE_VALUE(*value, dev_value, reg->emu_mask);
> +
> +    return 0;
> +}
> +
> +
> +/* XenPTRegInfo declaration
> + * - only for emulated register (either a part or whole bit).
> + * - for passthrough register that need special behavior (like interacting with
> + *   other component), set emu_mask to all 0 and specify r/w func properly.
> + * - do NOT use ALL F for init_val, otherwise the tbl will not be registered.
> + */
> +
> +/********************
> + * Header Type0
> + */
> +
> +static uint32_t pt_vendor_reg_init(XenPCIPassthroughState *s,
> +                                   XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    return s->real_device->vendor_id;
> +}
> +static uint32_t pt_device_reg_init(XenPCIPassthroughState *s,
> +                                   XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    return s->real_device->device_id;
> +}
> +static uint32_t pt_status_reg_init(XenPCIPassthroughState *s,
> +                                   XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    XenPTRegGroup *reg_grp_entry = NULL;
> +    XenPTReg *reg_entry = NULL;
> +    int reg_field = 0;
> +
> +    /* find Header register group */
> +    reg_grp_entry = pt_find_reg_grp(s, PCI_CAPABILITY_LIST);
> +    if (reg_grp_entry) {
> +        /* find Capabilities Pointer register */
> +        reg_entry = pt_find_reg(reg_grp_entry, PCI_CAPABILITY_LIST);
> +        if (reg_entry) {
> +            /* check Capabilities Pointer register */
> +            if (reg_entry->data) {
> +                reg_field |= PCI_STATUS_CAP_LIST;
> +            } else {
> +                reg_field &= ~PCI_STATUS_CAP_LIST;
> +            }
> +        } else {
> +            hw_error("Internal error: Couldn't find pt_reg_tbl for "
> +                     "Capabilities Pointer register. I/O emulator exit.\n");

Yikes. abort here? Um, can we just return a fault code instead?

> +        }
> +    } else {
> +        hw_error("Internal error: Couldn't find pt_reg_grp_tbl for Header. "
> +                 "I/O emulator exit.\n");
> +    }
> +
> +    return reg_field;
> +}
> +static uint32_t pt_header_type_reg_init(XenPCIPassthroughState *s,
> +                                        XenPTRegInfo *reg,
> +                                        uint32_t real_offset)
> +{
> +    /* read PCI_HEADER_TYPE */
> +    return reg->init_val | 0x80;
> +}
> +
> +/* initialize Interrupt Pin register */
> +static uint32_t pt_irqpin_reg_init(XenPCIPassthroughState *s,
> +                                   XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    return pci_read_intx(s);
> +}
> +
> +/* Command register */
> +static int pt_cmd_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                           uint16_t *value, uint16_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint16_t valid_emu_mask = 0;
> +    uint16_t emu_mask = reg->emu_mask;
> +
> +    if (s->is_virtfn) {
> +        emu_mask |= PCI_COMMAND_MEMORY;
> +    }
> +
> +    /* emulate word register */
> +    valid_emu_mask = emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
> +
> +    return 0;
> +}
> +static int pt_cmd_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                            uint16_t *value, uint16_t dev_value,
> +                            uint16_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint16_t writable_mask = 0;
> +    uint16_t throughable_mask = 0;
> +    uint16_t wr_value = *value;
> +    uint16_t emu_mask = reg->emu_mask;
> +
> +    if (s->is_virtfn) {
> +        emu_mask |= PCI_COMMAND_MEMORY;
> +    }
> +
> +    /* modify emulate register */
> +    writable_mask = ~reg->ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +
> +    /* create value for writing to I/O device register */
> +    throughable_mask = ~emu_mask & valid_mask;
> +
> +    if (*value & PCI_COMMAND_INTX_DISABLE) {
> +        throughable_mask |= PCI_COMMAND_INTX_DISABLE;
> +    } else {
> +        if (s->machine_irq) {
> +            throughable_mask |= PCI_COMMAND_INTX_DISABLE;
> +        }
> +    }
> +
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    /* mapping BAR */
> +    pt_bar_mapping(s, wr_value & PCI_COMMAND_IO,
> +                   wr_value & PCI_COMMAND_MEMORY);
> +
> +    return 0;
> +}
> +static int pt_cmd_reg_restore(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                              uint32_t real_offset, uint16_t dev_value,
> +                              uint16_t *value)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    PCIDevice *d = &s->dev;
> +    uint16_t restorable_mask = 0;
> +
> +    /* use I/O device register's value as restore value */
> +    *value = pci_get_word(d->config + real_offset);
> +
> +    /* create value for restoring to I/O device register
> +     * but do not include Fast Back-to-Back Enable bit.
> +     */
> +    restorable_mask = reg->emu_mask & ~PCI_COMMAND_FAST_BACK;
> +    *value = PT_MERGE_VALUE(*value, dev_value, restorable_mask);
> +
> +    if (!s->machine_irq) {
> +        *value |= PCI_COMMAND_INTX_DISABLE;
> +    } else {
> +        *value &= ~PCI_COMMAND_INTX_DISABLE;
> +    }
> +
> +    return 0;
> +}
> +
> +/* BAR */
> +#define PT_BAR_MEM_RO_MASK      0x0000000F      /* BAR ReadOnly mask(Memory) */
> +#define PT_BAR_MEM_EMU_MASK     0xFFFFFFF0      /* BAR emul mask(Memory) */
> +#define PT_BAR_IO_RO_MASK       0x00000003      /* BAR ReadOnly mask(I/O) */
> +#define PT_BAR_IO_EMU_MASK      0xFFFFFFFC      /* BAR emul mask(I/O) */
> +
> +static inline uint32_t base_address_with_flags(HostPCIIORegion *hr)
> +{
> +    if ((hr->flags & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO) {
> +        return hr->base_addr | (hr->flags & ~PCI_BASE_ADDRESS_IO_MASK);
> +    } else {
> +        return hr->base_addr | (hr->flags & ~PCI_BASE_ADDRESS_MEM_MASK);
> +    }
> +}
> +
> +static uint32_t pt_bar_reg_init(XenPCIPassthroughState *s, XenPTRegInfo *reg,
> +                                uint32_t real_offset)
> +{
> +    int reg_field = 0;
> +    int index;
> +
> +    /* get BAR index */
> +    index = pt_bar_offset_to_index(reg->offset);
> +    if (index < 0) {
> +        hw_error("Internal error: Invalid BAR index[%d]. "
> +                 "I/O emulator exit.\n", index);
> +    }
> +
> +    /* set initial guest physical base address to -1 */
> +    s->bases[index].e_physbase = -1;

Um, use that define PCI_.. something macro.
> +
> +    /* set BAR flag */
> +    s->bases[index].bar_flag = pt_bar_reg_parse(s, reg);
> +    if (s->bases[index].bar_flag == PT_BAR_FLAG_UNUSED) {
> +        reg_field = PT_INVALID_REG;
> +    }
> +
> +    return reg_field;
> +}
> +static int pt_bar_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                           uint32_t *value, uint32_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint32_t valid_emu_mask = 0;
> +    uint32_t bar_emu_mask = 0;
> +    int index;
> +
> +    /* get BAR index */
> +    index = pt_bar_offset_to_index(reg->offset);
> +    if (index < 0) {
> +        hw_error("Internal error: Invalid BAR index[%d]. "
> +                 "I/O emulator exit.\n", index);
> +    }
> +
> +    /* use fixed-up value from kernel sysfs */
> +    *value = base_address_with_flags(&s->real_device->io_regions[index]);
> +
> +    /* set emulate mask depend on BAR flag */
> +    switch (s->bases[index].bar_flag) {
> +    case PT_BAR_FLAG_MEM:
> +        bar_emu_mask = PT_BAR_MEM_EMU_MASK;
> +        break;
> +    case PT_BAR_FLAG_IO:
> +        bar_emu_mask = PT_BAR_IO_EMU_MASK;
> +        break;
> +    case PT_BAR_FLAG_UPPER:
> +        bar_emu_mask = PT_BAR_ALLF;
> +        break;
> +    default:
> +        break;
> +    }
> +
> +    /* emulate BAR */
> +    valid_emu_mask = bar_emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
> +
> +   return 0;
> +}
> +static int pt_bar_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                            uint32_t *value, uint32_t dev_value,
> +                            uint32_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    XenPTRegGroup *reg_grp_entry = NULL;
> +    XenPTReg *reg_entry = NULL;
> +    XenPTRegion *base = NULL;
> +    PCIDevice *d = &s->dev;
> +    PCIIORegion *r;
> +    uint32_t writable_mask = 0;
> +    uint32_t throughable_mask = 0;
> +    uint32_t bar_emu_mask = 0;
> +    uint32_t bar_ro_mask = 0;
> +    uint32_t new_addr, last_addr;
> +    uint32_t prev_offset;
> +    uint32_t r_size = 0;
> +    int index = 0;
> +
> +    /* get BAR index */
> +    index = pt_bar_offset_to_index(reg->offset);
> +    if (index < 0) {
> +        hw_error("Internal error: Invalid BAR index[%d]. "
> +                 "I/O emulator exit.\n", index);
> +    }
> +
> +    r = &d->io_regions[index];
> +    base = &s->bases[index];
> +    r_size = pt_get_emul_size(base->bar_flag, r->size);
> +
> +    /* set emulate mask and read-only mask depend on BAR flag */
> +    switch (s->bases[index].bar_flag) {
> +    case PT_BAR_FLAG_MEM:
> +        bar_emu_mask = PT_BAR_MEM_EMU_MASK;
> +        bar_ro_mask = PT_BAR_MEM_RO_MASK | (r_size - 1);
> +        break;
> +    case PT_BAR_FLAG_IO:
> +        bar_emu_mask = PT_BAR_IO_EMU_MASK;
> +        bar_ro_mask = PT_BAR_IO_RO_MASK | (r_size - 1);
> +        break;
> +    case PT_BAR_FLAG_UPPER:
> +        bar_emu_mask = PT_BAR_ALLF;
> +        bar_ro_mask = 0;    /* all upper 32bit are R/W */
> +        break;
> +    default:
> +        break;
> +    }
> +
> +    /* modify emulate register */
> +    writable_mask = bar_emu_mask & ~bar_ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +
> +    /* check whether we need to update the virtual region address or not */
> +    switch (s->bases[index].bar_flag) {
> +    case PT_BAR_FLAG_MEM:
> +        /* nothing to do */
> +        break;
> +    case PT_BAR_FLAG_IO:
> +        new_addr = cfg_entry->data;
> +        last_addr = new_addr + r_size - 1;
> +        /* check invalid address */
> +        if (last_addr <= new_addr || !new_addr || last_addr >= 0x10000) {

Make a #define for 0x10000.. 

> +            /* check 64K range */
> +            if ((last_addr >= 0x10000) &&
> +                (cfg_entry->data != (PT_BAR_ALLF & ~bar_ro_mask))) {
> +                PT_LOG("Warning: Guest attempt to set Base Address "
> +                       "over the 64KB. [%02x:%02x.%x][Offset:%02xh]"
> +                       "[Address:%08xh][Size:%08xh]\n",
> +                       pci_bus_num(d->bus), PCI_SLOT(d->devfn),
> +                       PCI_FUNC(d->devfn),
> +                       reg->offset, new_addr, r_size);
> +            }
> +            /* just remove mapping */
> +            r->addr = -1;
> +            goto exit;
> +        }
> +        break;
> +    case PT_BAR_FLAG_UPPER:
> +        if (cfg_entry->data) {
> +            if (cfg_entry->data != (PT_BAR_ALLF & ~bar_ro_mask)) {
> +                PT_LOG("Warning: Guest attempt to set high MMIO Base Address. "
> +                       "Ignore mapping. "
> +                       "[%02x:%02x.%x][Offset:%02xh][High Address:%08xh]\n",
> +                       pci_bus_num(d->bus), PCI_SLOT(d->devfn),
> +                       PCI_FUNC(d->devfn), reg->offset, cfg_entry->data);
> +            }
> +            /* clear lower address */
> +            d->io_regions[index-1].addr = -1;
> +        } else {
> +            /* find lower 32bit BAR */
> +            prev_offset = (reg->offset - 4);
> +            reg_grp_entry = pt_find_reg_grp(s, prev_offset);
> +            if (reg_grp_entry) {
> +                reg_entry = pt_find_reg(reg_grp_entry, prev_offset);
> +                if (reg_entry) {
> +                    /* restore lower address */
> +                    d->io_regions[index-1].addr = reg_entry->data;
> +                } else {
> +                    return -1;
> +                }
> +            } else {
> +                return -1;
> +            }
> +        }
> +
> +        /* never mapping the 'empty' upper region,
> +         * because we'll do it enough for the lower region.
> +         */
> +        r->addr = -1;
> +        goto exit;
> +    default:
> +        break;
> +    }
> +
> +    /* update the corresponding virtual region address */
> +    /*
> +     * When guest code tries to get block size of mmio, it will write all "1"s
> +     * into pci bar register. In this case, cfg_entry->data == writable_mask.
> +     * Especially for devices with large mmio, the value of writable_mask
> +     * is likely to be a guest physical address that has been mapped to ram
> +     * rather than mmio. Remapping this value to mmio should be prevented.
> +     */
> +
> +    if (cfg_entry->data != writable_mask) {
> +        r->addr = cfg_entry->data;
> +    }
> +
> +exit:
> +    /* create value for writing to I/O device register */
> +    throughable_mask = ~bar_emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    /* After BAR reg update, we need to remap BAR */
> +    reg_grp_entry = pt_find_reg_grp(s, PCI_COMMAND);
> +    if (reg_grp_entry) {
> +        reg_entry = pt_find_reg(reg_grp_entry, PCI_COMMAND);
> +        if (reg_entry) {
> +            pt_bar_mapping_one(s, index, reg_entry->data & PCI_COMMAND_IO,
> +                               reg_entry->data & PCI_COMMAND_MEMORY);
> +        }
> +    }
> +
> +    return 0;
> +}
> +static int pt_bar_reg_restore(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                              uint32_t real_offset, uint32_t dev_value,
> +                              uint32_t *value)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint32_t bar_emu_mask = 0;
> +    int index = 0;
> +
> +    /* get BAR index */
> +    index = pt_bar_offset_to_index(reg->offset);
> +    if (index < 0) {
> +        hw_error("Internal error: Invalid BAR index[%d]. "
> +                 "I/O emulator exit.\n", index);
> +    }
> +
> +    /* use value from kernel sysfs */
> +    if (s->bases[index].bar_flag == PT_BAR_FLAG_UPPER) {
> +        *value = s->real_device->io_regions[index - 1].base_addr >> 32;
> +    } else {
> +        *value = base_address_with_flags(&s->real_device->io_regions[index]);
> +    }
> +
> +    /* set emulate mask depend on BAR flag */
> +    switch (s->bases[index].bar_flag) {
> +    case PT_BAR_FLAG_MEM:
> +        bar_emu_mask = PT_BAR_MEM_EMU_MASK;
> +        break;
> +    case PT_BAR_FLAG_IO:
> +        bar_emu_mask = PT_BAR_IO_EMU_MASK;
> +        break;
> +    case PT_BAR_FLAG_UPPER:
> +        bar_emu_mask = PT_BAR_ALLF;
> +        break;
> +    default:
> +        break;
> +    }
> +
> +    /* create value for restoring to I/O device register */
> +    *value = PT_MERGE_VALUE(*value, dev_value, bar_emu_mask);
> +
> +    return 0;
> +}
> +
> +/* write Exp ROM BAR */
> +static int pt_exp_rom_bar_reg_write(XenPCIPassthroughState *s,
> +                                    XenPTReg *cfg_entry, uint32_t *value,
> +                                    uint32_t dev_value, uint32_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    XenPTRegGroup *reg_grp_entry = NULL;
> +    XenPTReg *reg_entry = NULL;
> +    XenPTRegion *base = NULL;
> +    PCIDevice *d = (PCIDevice *)&s->dev;
> +    PCIIORegion *r;
> +    uint32_t writable_mask = 0;
> +    uint32_t throughable_mask = 0;
> +    pcibus_t r_size = 0;
> +    uint32_t bar_emu_mask = 0;
> +    uint32_t bar_ro_mask = 0;
> +
> +    r = &d->io_regions[PCI_ROM_SLOT];
> +    r_size = r->size;
> +    base = &s->bases[PCI_ROM_SLOT];
> +    /* align memory type resource size */
> +    pt_get_emul_size(base->bar_flag, r_size);
> +
> +    /* set emulate mask and read-only mask */
> +    bar_emu_mask = reg->emu_mask;
> +    bar_ro_mask = (reg->ro_mask | (r_size - 1)) & ~PCI_ROM_ADDRESS_ENABLE;
> +
> +    /* modify emulate register */
> +    writable_mask = ~bar_ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +
> +    /* update the corresponding virtual region address */
> +    /*
> +     * When guest code tries to get block size of mmio, it will write all "1"s
> +     * into pci bar register. In this case, cfg_entry->data == writable_mask.
> +     * Especially for devices with large mmio, the value of writable_mask
> +     * is likely to be a guest physical address that has been mapped to ram
> +     * rather than mmio. Remapping this value to mmio should be prevented.
> +     */
> +
> +    if (cfg_entry->data != writable_mask) {
> +        r->addr = cfg_entry->data;
> +    }
> +
> +    /* create value for writing to I/O device register */
> +    throughable_mask = ~bar_emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    /* After BAR reg update, we need to remap BAR*/
> +    reg_grp_entry = pt_find_reg_grp(s, PCI_COMMAND);
> +    if (reg_grp_entry) {
> +        reg_entry = pt_find_reg(reg_grp_entry, PCI_COMMAND);
> +        if (reg_entry) {
> +            pt_bar_mapping_one(s, PCI_ROM_SLOT,
> +                               reg_entry->data & PCI_COMMAND_IO,
> +                               reg_entry->data & PCI_COMMAND_MEMORY);
> +        }
> +    }
> +
> +    return 0;
> +}
> +/* restore ROM BAR */
> +static int pt_exp_rom_bar_reg_restore(XenPCIPassthroughState *s,
> +                                      XenPTReg *cfg_entry,
> +                                      uint32_t real_offset,
> +                                      uint32_t dev_value, uint32_t *value)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +
> +    /* use value from kernel sysfs */
> +    *value =
> +        PT_MERGE_VALUE(host_pci_get_long(s->real_device, PCI_ROM_ADDRESS),
> +                       dev_value, reg->emu_mask);
> +    return 0;
> +}
> +
> +/* Header Type0 reg static infomation table */
> +static XenPTRegInfo pt_emu_reg_header0_tbl[] = {
> +    /* Vendor ID reg */
> +    {
> +        .offset     = PCI_VENDOR_ID,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0xFFFF,
> +        .emu_mask   = 0xFFFF,
> +        .init       = pt_vendor_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_word_reg_write,
> +        .u.w.restore  = NULL,
> +    },
> +    /* Device ID reg */
> +    {
> +        .offset     = PCI_DEVICE_ID,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0xFFFF,
> +        .emu_mask   = 0xFFFF,
> +        .init       = pt_device_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_word_reg_write,
> +        .u.w.restore  = NULL,
> +    },
> +    /* Command reg */
> +    {
> +        .offset     = PCI_COMMAND,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0xF880,
> +        .emu_mask   = 0x0740,
> +        .init       = pt_common_reg_init,
> +        .u.w.read   = pt_cmd_reg_read,
> +        .u.w.write  = pt_cmd_reg_write,
> +        .u.w.restore  = pt_cmd_reg_restore,
> +    },
> +    /* Capabilities Pointer reg */
> +    {
> +        .offset     = PCI_CAPABILITY_LIST,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0xFF,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_ptr_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = NULL,
> +    },
> +    /* Status reg */
> +    /* use emulated Cap Ptr value to initialize,
> +     * so need to be declared after Cap Ptr reg
> +     */
> +    {
> +        .offset     = PCI_STATUS,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0x06FF,
> +        .emu_mask   = 0x0010,
> +        .init       = pt_status_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_word_reg_write,
> +        .u.w.restore  = NULL,
> +    },
> +    /* Cache Line Size reg */
> +    {
> +        .offset     = PCI_CACHE_LINE_SIZE,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0x00,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_common_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = pt_byte_reg_restore,
> +    },
> +    /* Latency Timer reg */
> +    {
> +        .offset     = PCI_LATENCY_TIMER,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0x00,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_common_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = pt_byte_reg_restore,
> +    },
> +    /* Header Type reg */
> +    {
> +        .offset     = PCI_HEADER_TYPE,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0xFF,
> +        .emu_mask   = 0x00,
> +        .init       = pt_header_type_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = NULL,
> +    },
> +    /* Interrupt Line reg */
> +    {
> +        .offset     = PCI_INTERRUPT_LINE,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0x00,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_common_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = NULL,
> +    },
> +    /* Interrupt Pin reg */
> +    {
> +        .offset     = PCI_INTERRUPT_PIN,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0xFF,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_irqpin_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = NULL,
> +    },
> +    /* BAR 0 reg */
> +    /* mask of BAR need to be decided later, depends on IO/MEM type */
> +    {
> +        .offset     = PCI_BASE_ADDRESS_0,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .init       = pt_bar_reg_init,
> +        .u.dw.read  = pt_bar_reg_read,
> +        .u.dw.write = pt_bar_reg_write,
> +        .u.dw.restore = pt_bar_reg_restore,
> +    },
> +    /* BAR 1 reg */
> +    {
> +        .offset     = PCI_BASE_ADDRESS_1,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .init       = pt_bar_reg_init,
> +        .u.dw.read  = pt_bar_reg_read,
> +        .u.dw.write = pt_bar_reg_write,
> +        .u.dw.restore = pt_bar_reg_restore,
> +    },
> +    /* BAR 2 reg */
> +    {
> +        .offset     = PCI_BASE_ADDRESS_2,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .init       = pt_bar_reg_init,
> +        .u.dw.read  = pt_bar_reg_read,
> +        .u.dw.write = pt_bar_reg_write,
> +        .u.dw.restore = pt_bar_reg_restore,
> +    },
> +    /* BAR 3 reg */
> +    {
> +        .offset     = PCI_BASE_ADDRESS_3,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .init       = pt_bar_reg_init,
> +        .u.dw.read  = pt_bar_reg_read,
> +        .u.dw.write = pt_bar_reg_write,
> +        .u.dw.restore = pt_bar_reg_restore,
> +    },
> +    /* BAR 4 reg */
> +    {
> +        .offset     = PCI_BASE_ADDRESS_4,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .init       = pt_bar_reg_init,
> +        .u.dw.read  = pt_bar_reg_read,
> +        .u.dw.write = pt_bar_reg_write,
> +        .u.dw.restore = pt_bar_reg_restore,
> +    },
> +    /* BAR 5 reg */
> +    {
> +        .offset     = PCI_BASE_ADDRESS_5,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .init       = pt_bar_reg_init,
> +        .u.dw.read  = pt_bar_reg_read,
> +        .u.dw.write = pt_bar_reg_write,
> +        .u.dw.restore = pt_bar_reg_restore,
> +    },
> +    /* Expansion ROM BAR reg */
> +    {
> +        .offset     = PCI_ROM_ADDRESS,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .ro_mask    = 0x000007FE,
> +        .emu_mask   = 0xFFFFF800,
> +        .init       = pt_bar_reg_init,
> +        .u.dw.read  = pt_long_reg_read,
> +        .u.dw.write = pt_exp_rom_bar_reg_write,
> +        .u.dw.restore = pt_exp_rom_bar_reg_restore,
> +    },
> +    {
> +        .size = 0,
> +    },
> +};
> +
> +
> +/*********************************
> + * Vital Product Data Capability
> + */
> +
> +/* Vital Product Data Capability Structure reg static infomation table */
> +static XenPTRegInfo pt_emu_reg_vpd_tbl[] = {
> +    {
> +        .offset     = PCI_CAP_LIST_NEXT,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0xFF,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_ptr_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = NULL,
> +    },
> +    {
> +        .size = 0,
> +    },
> +};
> +
> +
> +/**************************************
> + * Vendor Specific Capability
> + */
> +
> +/* Vendor Specific Capability Structure reg static infomation table */
> +static XenPTRegInfo pt_emu_reg_vendor_tbl[] = {
> +    {
> +        .offset     = PCI_CAP_LIST_NEXT,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0xFF,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_ptr_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = NULL,
> +    },
> +    {
> +        .size = 0,
> +    },
> +};
> +
> +
> +/*****************************
> + * PCI Express Capability
> + */
> +
> +/* initialize Link Control register */
> +static uint32_t pt_linkctrl_reg_init(XenPCIPassthroughState *s,
> +                                     XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    uint8_t cap_ver = 0;
> +    uint8_t dev_type = 0;
> +
> +    /* TODO maybe better to use fonction from hw/pcie.c */

function
> +    cap_ver = pci_get_byte(s->dev.config + real_offset - reg->offset
> +                           + PCI_EXP_FLAGS)
> +        & PCI_EXP_FLAGS_VERS;
> +    dev_type = (pci_get_byte(s->dev.config + real_offset - reg->offset
> +                             + PCI_EXP_FLAGS)
> +                & PCI_EXP_FLAGS_TYPE) >> 4;
> +
> +    /* no need to initialize in case of Root Complex Integrated Endpoint
> +     * with cap_ver 1.x

Why?

> +     */
> +    if ((dev_type == PCI_EXP_TYPE_RC_END) && (cap_ver == 1)) {
> +        return PT_INVALID_REG;
> +    }
> +
> +    return reg->init_val;
> +}
> +/* initialize Device Control 2 register */
> +static uint32_t pt_devctrl2_reg_init(XenPCIPassthroughState *s,
> +                                     XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    uint8_t cap_ver = 0;
> +
> +    cap_ver = pci_get_byte(s->dev.config + real_offset - reg->offset
> +                           + PCI_EXP_FLAGS)
> +        & PCI_EXP_FLAGS_VERS;
> +
> +    /* no need to initialize in case of cap_ver 1.x */
> +    if (cap_ver == 1) {
> +        return PT_INVALID_REG;
> +    }
> +
> +    return reg->init_val;
> +}
> +/* initialize Link Control 2 register */
> +static uint32_t pt_linkctrl2_reg_init(XenPCIPassthroughState *s,
> +                                      XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    int reg_field = 0;
> +    uint8_t cap_ver = 0;
> +
> +    cap_ver = pci_get_byte(s->dev.config + real_offset - reg->offset
> +                           + PCI_EXP_FLAGS)
> +        & PCI_EXP_FLAGS_VERS;

This looks like a weird tab issue, but it might be just my mailer.

> +
> +    /* no need to initialize in case of cap_ver 1.x */
> +    if (cap_ver == 1) {
> +        return PT_INVALID_REG;
> +    }
> +
> +    /* set Supported Link Speed */
> +    reg_field |= PCI_EXP_LNKCAP_SLS &
> +        pci_get_byte(s->dev.config + real_offset - reg->offset
> +                     + PCI_EXP_LNKCAP);
> +
> +    return reg_field;
> +}
> +
> +/* PCI Express Capability Structure reg static infomation table */
> +static XenPTRegInfo pt_emu_reg_pcie_tbl[] = {
> +    /* Next Pointer reg */
> +    {
> +        .offset     = PCI_CAP_LIST_NEXT,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0xFF,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_ptr_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = NULL,
> +    },
> +    /* Device Capabilities reg */
> +    {
> +        .offset     = PCI_EXP_DEVCAP,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .ro_mask    = 0x1FFCFFFF,
> +        .emu_mask   = 0x10000000,
> +        .init       = pt_common_reg_init,
> +        .u.dw.read  = pt_long_reg_read,
> +        .u.dw.write = pt_long_reg_write,
> +        .u.dw.restore = NULL,
> +    },
> +    /* Device Control reg */
> +    {
> +        .offset     = PCI_EXP_DEVCTL,
> +        .size       = 2,
> +        .init_val   = 0x2810,
> +        .ro_mask    = 0x8400,
> +        .emu_mask   = 0xFFFF,
> +        .init       = pt_common_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_word_reg_write,
> +        .u.w.restore  = pt_word_reg_restore,
> +    },
> +    /* Link Control reg */
> +    {
> +        .offset     = PCI_EXP_LNKCTL,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0xFC34,
> +        .emu_mask   = 0xFFFF,
> +        .init       = pt_linkctrl_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_word_reg_write,
> +        .u.w.restore  = pt_word_reg_restore,
> +    },
> +    /* Device Control 2 reg */
> +    {
> +        .offset     = 0x28,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0xFFE0,
> +        .emu_mask   = 0xFFFF,
> +        .init       = pt_devctrl2_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_word_reg_write,
> +        .u.w.restore  = pt_word_reg_restore,
> +    },
> +    /* Link Control 2 reg */
> +    {
> +        .offset     = 0x30,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0xE040,
> +        .emu_mask   = 0xFFFF,
> +        .init       = pt_linkctrl2_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_word_reg_write,
> +        .u.w.restore  = pt_word_reg_restore,
> +    },
> +    {
> +        .size = 0,
> +    },
> +};
> +
> +
> +/*********************************
> + * Power Management Capability
> + */
> +
> +/* initialize Power Management Capabilities register */
> +static uint32_t pt_pmc_reg_init(XenPCIPassthroughState *s,
> +                                XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    PCIDevice *d = &s->dev;
> +
> +    if (!s->power_mgmt) {
> +        return reg->init_val;
> +    }
> +
> +    /* set Power Management Capabilities register */
> +    s->pm_state->pmc_field = pci_get_word(d->config + real_offset);
> +
> +    return reg->init_val;
> +}
> +/* initialize PCI Power Management Control/Status register */
> +static uint32_t pt_pmcsr_reg_init(XenPCIPassthroughState *s,
> +                                  XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    PCIDevice *d = &s->dev;
> +    uint16_t cap_ver  = 0;
> +
> +    if (!s->power_mgmt) {
> +        return reg->init_val;
> +    }
> +
> +    /* check PCI Power Management support version */
> +    cap_ver = s->pm_state->pmc_field & PCI_PM_CAP_VER_MASK;
> +
> +    if (cap_ver > 2) {
> +        /* set No Soft Reset */
> +        s->pm_state->no_soft_reset =
> +            pci_get_byte(d->config + real_offset) & PCI_PM_CTRL_NO_SOFT_RESET;
> +    }
> +
> +    /* wake up real physical device */
> +    switch (host_pci_get_word(s->real_device, real_offset)
> +            & PCI_PM_CTRL_STATE_MASK) {
> +    case 0:
> +        break;
> +    case 1:
> +        PT_LOG("Power state transition D1 -> D0active\n");
> +        host_pci_set_word(s->real_device, real_offset, 0);
> +        break;
> +    case 2:
> +        PT_LOG("Power state transition D2 -> D0active\n");
> +        host_pci_set_word(s->real_device, real_offset, 0);
> +        usleep(200);

Heheh..
> +        break;
> +    case 3:
> +        PT_LOG("Power state transition D3hot -> D0active\n");
> +        host_pci_set_word(s->real_device, real_offset, 0);
> +        usleep(10 * 1000);
> +        pt_init_pci_config(s);
> +        break;
> +    }
> +
> +    return reg->init_val;
> +}
> +/* read Power Management Control/Status register */
> +static int pt_pmcsr_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                             uint16_t *value, uint16_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint16_t valid_emu_mask = reg->emu_mask;
> +
> +    if (!s->power_mgmt) {
> +        valid_emu_mask |= PCI_PM_CTRL_STATE_MASK | PCI_PM_CTRL_NO_SOFT_RESET;
> +    }
> +
> +    valid_emu_mask = valid_emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
> +
> +    return 0;
> +}
> +/* reset Interrupt and I/O resource  */
> +static void pt_reset_interrupt_and_io_mapping(XenPCIPassthroughState *s)
> +{
> +    PCIDevice *d = &s->dev;
> +    PCIIORegion *r;
> +    int i = 0;
> +    uint8_t e_device = 0;
> +    uint8_t e_intx = 0;
> +
> +    /* unbind INTx */
> +    e_device = PCI_SLOT(s->dev.devfn);
> +    e_intx = pci_intx(s);
> +
> +    if (s->machine_irq) {
> +        if (xc_domain_unbind_pt_irq(xen_xc, xen_domid, s->machine_irq,
> +                                    PT_IRQ_TYPE_PCI, 0, e_device, e_intx, 0)) {
> +            PT_LOG("Error: Unbinding of interrupt failed!\n");
> +        }
> +    }
> +
> +    /* clear all virtual region address */
> +    for (i = 0; i < PCI_NUM_REGIONS; i++) {
> +        r = &d->io_regions[i];
> +        r->addr = -1;
> +    }
> +
> +    /* unmapping BAR */
> +    pt_bar_mapping(s, 0, 0);
> +}
> +/* check power state transition */
> +static int check_power_state(XenPCIPassthroughState *s)
> +{
> +    XenPTPM *pm_state = s->pm_state;
> +    PCIDevice *d = &s->dev;
> +    uint16_t read_val = 0;
> +    uint16_t cur_state = 0;
> +
> +    /* get current power state */
> +    read_val = host_pci_get_word(s->real_device,
> +                                 pm_state->pm_base + PCI_PM_CTRL);
> +    cur_state = read_val & PCI_PM_CTRL_STATE_MASK;
> +
> +    if (pm_state->req_state != cur_state) {
> +        PT_LOG("Error: Failed to change power state. "
> +               "[%02x:%02x.%x][requested state:%d][current state:%d]\n",
> +               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> +               pm_state->req_state, cur_state);
> +        return -1;
> +    }
> +    return 0;
> +}
> +/* write Power Management Control/Status register */
> +static void pt_from_d3hot_to_d0_with_reset(void *opaque)
> +{
> +    XenPCIPassthroughState *s = opaque;
> +    XenPTPM *pm_state = s->pm_state;
> +    int ret = 0;
> +
> +    /* check power state */
> +    ret = check_power_state(s);
> +
> +    if (ret < 0) {
> +        goto out;
> +    }
> +
> +    pt_init_pci_config(s);
> +
> +out:
> +    /* power state transition flags off */
> +    pm_state->flags &= ~PT_FLAG_TRANSITING;
> +
> +    qemu_free_timer(pm_state->pm_timer);
> +    pm_state->pm_timer = NULL;
> +}
> +static void pt_default_power_transition(void *opaque)
> +{
> +    XenPCIPassthroughState *ptdev = opaque;
> +    XenPTPM *pm_state = ptdev->pm_state;
> +
> +    /* check power state */
> +    check_power_state(ptdev);
> +
> +    /* power state transition flags off */
> +    pm_state->flags &= ~PT_FLAG_TRANSITING;
> +
> +    qemu_free_timer(pm_state->pm_timer);
> +    pm_state->pm_timer = NULL;
> +}
> +static int pt_pmcsr_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                              uint16_t *value, uint16_t dev_value,
> +                              uint16_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    PCIDevice *d = &s->dev;
> +    uint16_t emu_mask = reg->emu_mask;
> +    uint16_t writable_mask = 0;
> +    uint16_t throughable_mask = 0;
> +    XenPTPM *pm_state = s->pm_state;
> +
> +    if (!s->power_mgmt) {
> +        emu_mask |= PCI_PM_CTRL_STATE_MASK | PCI_PM_CTRL_NO_SOFT_RESET;
> +    }
> +
> +    /* modify emulate register */
> +    writable_mask = emu_mask & ~reg->ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +
> +    /* create value for writing to I/O device register */
> +    throughable_mask = ~emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    if (!s->power_mgmt) {
> +        return 0;
> +    }
> +
> +    /* set I/O device power state */
> +    pm_state->cur_state = dev_value & PCI_PM_CTRL_STATE_MASK;
> +
> +    /* set Guest requested PowerState */
> +    pm_state->req_state = *value & PCI_PM_CTRL_STATE_MASK;
> +
> +    /* check power state transition or not */
> +    if (pm_state->cur_state == pm_state->req_state) {
> +        /* not power state transition */
> +        return 0;
> +    }
> +
> +    /* check enable power state transition */
> +    if ((pm_state->req_state != 0) &&
> +        (pm_state->cur_state > pm_state->req_state)) {
> +        PT_LOG("Error: Invalid power transition. "
> +               "[%02x:%02x.%x][requested state:%d][current state:%d]\n",
> +               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> +               pm_state->req_state, pm_state->cur_state);
> +
> +        return 0;
> +    }
> +
> +    /* check if this device supports the requested power state */
> +    if (((pm_state->req_state == 1) && !(pm_state->pmc_field & PCI_PM_CAP_D1))
> +        || ((pm_state->req_state == 2) &&
> +            !(pm_state->pmc_field & PCI_PM_CAP_D2))) {
> +        PT_LOG("Error: Invalid power transition. "
> +               "[%02x:%02x.%x][requested state:%d][current state:%d]\n",
> +               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> +               pm_state->req_state, pm_state->cur_state);
> +
> +        return 0;
> +    }
> +
> +    /* in case of transition related to D3hot, it's necessary to wait 10 ms.
> +     * But because writing to register will be performed later on actually,
> +     * don't start QEMUTimer right now, just alloc and init QEMUTimer here.
> +     */
> +    if ((pm_state->cur_state == 3) || (pm_state->req_state == 3)) {
> +        if (pm_state->req_state == 0) {
> +            /* alloc and init QEMUTimer */
> +            if (!pm_state->no_soft_reset) {
> +                pm_state->pm_timer = qemu_new_timer_ms(rt_clock,
> +                    pt_from_d3hot_to_d0_with_reset, s);
> +
> +                /* reset Interrupt and I/O resource mapping */
> +                pt_reset_interrupt_and_io_mapping(s);
> +            } else {
> +                pm_state->pm_timer = qemu_new_timer_ms(rt_clock,
> +                                        pt_default_power_transition, s);
> +            }
> +        } else {
> +            /* alloc and init QEMUTimer */
> +            pm_state->pm_timer = qemu_new_timer_ms(rt_clock,
> +                pt_default_power_transition, s);
> +        }
> +
> +        /* set power state transition delay */
> +        pm_state->pm_delay = 10;
> +
> +        /* power state transition flags on */
> +        pm_state->flags |= PT_FLAG_TRANSITING;
> +    }
> +    /* in case of transition related to D0, D1 and D2,
> +     * no need to use QEMUTimer.
> +     * So, we perfom writing to register here and then read it back.
> +     */
> +    else {
> +        /* write power state to I/O device register */
> +        host_pci_set_word(s->real_device, pm_state->pm_base + PCI_PM_CTRL,
> +                          *value);
> +
> +        /* in case of transition related to D2,
> +         * it's necessary to wait 200 usec.
> +         * But because QEMUTimer do not support microsec unit right now,
> +         * so we do wait ourself here.
> +         */
> +        if ((pm_state->cur_state == 2) || (pm_state->req_state == 2)) {
> +            usleep(200);
> +        }
> +
> +        /* check power state */
> +        check_power_state(s);
> +
> +        /* recreate value for writing to I/O device register */
> +        *value = host_pci_get_word(s->real_device,
> +                                   pm_state->pm_base + PCI_PM_CTRL);
> +    }
> +
> +    return 0;
> +}
> +
> +/* restore Power Management Control/Status register */
> +static int pt_pmcsr_reg_restore(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                                uint32_t real_offset, uint16_t dev_value,
> +                                uint16_t *value)
> +{
> +    /* create value for restoring to I/O device register
> +     * No need to restore, just clear PME Enable and PME Status bit
> +     * Note: register type of PME Status bit is RW1C, so clear by writing 1b
> +     */
> +    *value = (dev_value & ~PCI_PM_CTRL_PME_ENABLE) | PCI_PM_CTRL_PME_STATUS;
> +
> +    return 0;
> +}
> +
> +
> +/* Power Management Capability reg static infomation table */
> +static XenPTRegInfo pt_emu_reg_pm_tbl[] = {
> +    /* Next Pointer reg */
> +    {
> +        .offset     = PCI_CAP_LIST_NEXT,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0xFF,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_ptr_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = NULL,
> +    },
> +    /* Power Management Capabilities reg */
> +    {
> +        .offset     = PCI_CAP_FLAGS,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0xFFFF,
> +        .emu_mask   = 0xF9C8,
> +        .init       = pt_pmc_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_word_reg_write,
> +        .u.w.restore  = NULL,
> +    },
> +    /* PCI Power Management Control/Status reg */
> +    {
> +        .offset     = PCI_PM_CTRL,
> +        .size       = 2,
> +        .init_val   = 0x0008,
> +        .ro_mask    = 0xE1FC,
> +        .emu_mask   = 0x8100,
> +        .init       = pt_pmcsr_reg_init,
> +        .u.w.read   = pt_pmcsr_reg_read,
> +        .u.w.write  = pt_pmcsr_reg_write,
> +        .u.w.restore  = pt_pmcsr_reg_restore,
> +    },
> +    {
> +        .size = 0,
> +    },
> +};
> +
> +
> +/****************************
> + * Capabilities
> + */
> +
> +/* AER register operations */
> +
> +static void aer_save_one_register(XenPCIPassthroughState *s, int offset)
> +{
> +    PCIDevice *d = &s->dev;
> +    uint32_t aer_base = s->pm_state->aer_base;
> +    uint32_t val = 0;
> +
> +    val = host_pci_get_long(s->real_device, aer_base + offset);
> +    pci_set_long(d->config + aer_base + offset, val);
> +}
> +static void pt_aer_reg_save(XenPCIPassthroughState *s)
> +{
> +    /* after reset, following register values should be restored.
> +     * So, save them.
> +     */
> +    aer_save_one_register(s, PCI_ERR_UNCOR_MASK);
> +    aer_save_one_register(s, PCI_ERR_UNCOR_SEVER);
> +    aer_save_one_register(s, PCI_ERR_COR_MASK);
> +    aer_save_one_register(s, PCI_ERR_CAP);
> +}
> +static void aer_restore_one_register(XenPCIPassthroughState *s, int offset)
> +{
> +    PCIDevice *d = &s->dev;
> +    uint32_t aer_base = s->pm_state->aer_base;
> +    uint32_t config = 0;
> +
> +    config = pci_get_long(d->config + aer_base + offset);
> +    host_pci_set_long(s->real_device, aer_base + offset, config);
> +}
> +static void pt_aer_reg_restore(XenPCIPassthroughState *s)
> +{
> +    /* the following registers should be reconfigured to correct values
> +     * after reset. restore them.
> +     * other registers should not be reconfigured after reset
> +     * if there is no reason
> +     */
> +    aer_restore_one_register(s, PCI_ERR_UNCOR_MASK);
> +    aer_restore_one_register(s, PCI_ERR_UNCOR_SEVER);
> +    aer_restore_one_register(s, PCI_ERR_COR_MASK);
> +    aer_restore_one_register(s, PCI_ERR_CAP);
> +}
> +
> +/* capability structure register group size functions */
> +
> +static uint8_t pt_reg_grp_size_init(XenPCIPassthroughState *s,
> +                                    const XenPTRegGroupInfo *grp_reg,
> +                                    uint32_t base_offset)
> +{
> +    return grp_reg->grp_size;
> +}
> +/* get Power Management Capability Structure register group size */
> +static uint8_t pt_pm_size_init(XenPCIPassthroughState *s,
> +                               const XenPTRegGroupInfo *grp_reg,
> +                               uint32_t base_offset)
> +{
> +    if (!s->power_mgmt) {
> +        return grp_reg->grp_size;
> +    }
> +
> +    s->pm_state = g_malloc0(sizeof (XenPTPM));
> +
> +    /* set Power Management Capability base offset */
> +    s->pm_state->pm_base = base_offset;
> +
> +    /* find AER register and set AER Capability base offset */
> +    s->pm_state->aer_base = host_pci_find_ext_cap_offset(s->real_device,
> +                                                         PCI_EXT_CAP_ID_ERR);
> +
> +    /* save AER register */
> +    if (s->pm_state->aer_base) {
> +        pt_aer_reg_save(s);
> +    }
> +
> +    return grp_reg->grp_size;
> +}
> +/* get Vendor Specific Capability Structure register group size */
> +static uint8_t pt_vendor_size_init(XenPCIPassthroughState *s,
> +                                   const XenPTRegGroupInfo *grp_reg,
> +                                   uint32_t base_offset)
> +{
> +    return pci_get_byte(s->dev.config + base_offset + 0x02);
> +}
> +/* get PCI Express Capability Structure register group size */
> +static uint8_t pt_pcie_size_init(XenPCIPassthroughState *s,
> +                                 const XenPTRegGroupInfo *grp_reg,
> +                                 uint32_t base_offset)
> +{
> +    PCIDevice *d = &s->dev;
> +    uint16_t exp_flag = 0;
> +    uint16_t type = 0;
> +    uint16_t version = 0;
> +    uint8_t pcie_size = 0;
> +
> +    exp_flag = pci_get_word(d->config + base_offset + PCI_EXP_FLAGS);
> +    type = (exp_flag & PCI_EXP_FLAGS_TYPE) >> 4;
> +    version = exp_flag & PCI_EXP_FLAGS_VERS;
> +
> +    /* calculate size depend on capability version and device/port type */
> +    /* in case of PCI Express Base Specification Rev 1.x */
> +    if (version == 1) {
> +        /* The PCI Express Capabilities, Device Capabilities, and Device
> +         * Status/Control registers are required for all PCI Express devices.
> +         * The Link Capabilities and Link Status/Control are required for all
> +         * Endpoints that are not Root Complex Integrated Endpoints. Endpoints
> +         * are not required to implement registers other than those listed
> +         * above and terminate the capability structure.
> +         */
> +        switch (type) {
> +        case PCI_EXP_TYPE_ENDPOINT:
> +        case PCI_EXP_TYPE_LEG_END:
> +            pcie_size = 0x14;
> +            break;
> +        case PCI_EXP_TYPE_RC_END:
> +            /* has no link */
> +            pcie_size = 0x0C;
> +            break;
> +        /* only EndPoint passthrough is supported */
> +        case PCI_EXP_TYPE_ROOT_PORT:
> +        case PCI_EXP_TYPE_UPSTREAM:
> +        case PCI_EXP_TYPE_DOWNSTREAM:
> +        case PCI_EXP_TYPE_PCI_BRIDGE:
> +        case PCI_EXP_TYPE_PCIE_BRIDGE:
> +        case PCI_EXP_TYPE_RC_EC:
> +        default:
> +            hw_error("Internal error: Unsupported device/port type[%d]. "
> +                     "I/O emulator exit.\n", type);
> +        }
> +    }
> +    /* in case of PCI Express Base Specification Rev 2.0 */
> +    else if (version == 2) {
> +        switch (type) {
> +        case PCI_EXP_TYPE_ENDPOINT:
> +        case PCI_EXP_TYPE_LEG_END:
> +        case PCI_EXP_TYPE_RC_END:
> +            /* For Functions that do not implement the registers,
> +             * these spaces must be hardwired to 0b.
> +             */
> +            pcie_size = 0x3C;
> +            break;
> +        /* only EndPoint passthrough is supported */
> +        case PCI_EXP_TYPE_ROOT_PORT:
> +        case PCI_EXP_TYPE_UPSTREAM:
> +        case PCI_EXP_TYPE_DOWNSTREAM:
> +        case PCI_EXP_TYPE_PCI_BRIDGE:
> +        case PCI_EXP_TYPE_PCIE_BRIDGE:
> +        case PCI_EXP_TYPE_RC_EC:
> +        default:
> +            hw_error("Internal error: Unsupported device/port type[%d]. "
> +                     "I/O emulator exit.\n", type);
> +        }
> +    } else {
> +        hw_error("Internal error: Unsupported capability version[%d]. "
> +                 "I/O emulator exit.\n", version);
> +    }
> +
> +    return pcie_size;
> +}
> +
> +static const XenPTRegGroupInfo pt_emu_reg_grp_tbl[] = {
> +    /* Header Type0 reg group */
> +    {
> +        .grp_id      = 0xFF,
> +        .grp_type    = GRP_TYPE_EMU,
> +        .grp_size    = 0x40,
> +        .size_init   = pt_reg_grp_size_init,
> +        .emu_reg_tbl = pt_emu_reg_header0_tbl,
> +    },
> +    /* PCI PowerManagement Capability reg group */
> +    {
> +        .grp_id      = PCI_CAP_ID_PM,
> +        .grp_type    = GRP_TYPE_EMU,
> +        .grp_size    = PCI_PM_SIZEOF,
> +        .size_init   = pt_pm_size_init,
> +        .emu_reg_tbl = pt_emu_reg_pm_tbl,
> +    },
> +    /* AGP Capability Structure reg group */
> +    {
> +        .grp_id     = PCI_CAP_ID_AGP,
> +        .grp_type   = GRP_TYPE_HARDWIRED,
> +        .grp_size   = 0x30,
> +        .size_init  = pt_reg_grp_size_init,
> +    },
> +    /* Vital Product Data Capability Structure reg group */
> +    {
> +        .grp_id      = PCI_CAP_ID_VPD,
> +        .grp_type    = GRP_TYPE_EMU,
> +        .grp_size    = 0x08,
> +        .size_init   = pt_reg_grp_size_init,
> +        .emu_reg_tbl = pt_emu_reg_vpd_tbl,
> +    },
> +    /* Slot Identification reg group */
> +    {
> +        .grp_id     = PCI_CAP_ID_SLOTID,
> +        .grp_type   = GRP_TYPE_HARDWIRED,
> +        .grp_size   = 0x04,
> +        .size_init  = pt_reg_grp_size_init,
> +    },
> +    /* PCI-X Capabilities List Item reg group */
> +    {
> +        .grp_id     = PCI_CAP_ID_PCIX,
> +        .grp_type   = GRP_TYPE_HARDWIRED,
> +        .grp_size   = 0x18,
> +        .size_init  = pt_reg_grp_size_init,
> +    },
> +    /* Vendor Specific Capability Structure reg group */
> +    {
> +        .grp_id      = PCI_CAP_ID_VNDR,
> +        .grp_type    = GRP_TYPE_EMU,
> +        .grp_size    = 0xFF,
> +        .size_init   = pt_vendor_size_init,
> +        .emu_reg_tbl = pt_emu_reg_vendor_tbl,
> +    },
> +    /* SHPC Capability List Item reg group */
> +    {
> +        .grp_id     = PCI_CAP_ID_SHPC,
> +        .grp_type   = GRP_TYPE_HARDWIRED,
> +        .grp_size   = 0x08,
> +        .size_init  = pt_reg_grp_size_init,
> +    },
> +    /* Subsystem ID and Subsystem Vendor ID Capability List Item reg group */
> +    {
> +        .grp_id     = PCI_CAP_ID_SSVID,
> +        .grp_type   = GRP_TYPE_HARDWIRED,
> +        .grp_size   = 0x08,
> +        .size_init  = pt_reg_grp_size_init,
> +    },
> +    /* AGP 8x Capability Structure reg group */
> +    {
> +        .grp_id     = PCI_CAP_ID_AGP3,
> +        .grp_type   = GRP_TYPE_HARDWIRED,
> +        .grp_size   = 0x30,
> +        .size_init  = pt_reg_grp_size_init,
> +    },
> +    /* PCI Express Capability Structure reg group */
> +    {
> +        .grp_id      = PCI_CAP_ID_EXP,
> +        .grp_type    = GRP_TYPE_EMU,
> +        .grp_size    = 0xFF,
> +        .size_init   = pt_pcie_size_init,
> +        .emu_reg_tbl = pt_emu_reg_pcie_tbl,
> +    },
> +    {
> +        .grp_size = 0,
> +    },
> +};
> +
> +/* initialize Capabilities Pointer or Next Pointer register */
> +static uint32_t pt_ptr_reg_init(XenPCIPassthroughState *s,
> +                                XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    /* uint32_t reg_field = (uint32_t)s->dev.config[real_offset]; */
> +    uint32_t reg_field = pci_get_byte(s->dev.config + real_offset);
> +    int i;
> +
> +    /* find capability offset */
> +    while (reg_field) {
> +        for (i = 0; pt_emu_reg_grp_tbl[i].grp_size != 0; i++) {
> +            if (pt_hide_dev_cap(s->real_device,
> +                                pt_emu_reg_grp_tbl[i].grp_id)) {
> +                continue;
> +            }
> +            if (pt_emu_reg_grp_tbl[i].grp_id == s->dev.config[reg_field]) {
> +                if (pt_emu_reg_grp_tbl[i].grp_type == GRP_TYPE_EMU) {
> +                    goto out;
> +                }
> +                /* ignore the 0 hardwired capability, find next one */
> +                break;
> +            }
> +        }
> +        /* next capability */
> +        /* reg_field = (uint32_t)s->dev.config[reg_field + 1]; */
> +        reg_field = pci_get_byte(s->dev.config + reg_field + 1);
> +    }
> +
> +out:
> +    return reg_field;
> +}
> +
> +
> +/*************
> + * Main
> + */
> +
> +/* restore a part of I/O device register */
> +static void pt_config_restore(XenPCIPassthroughState *s)
> +{
> +    XenPTRegGroup *reg_grp_entry = NULL;
> +    XenPTReg *reg_entry = NULL;
> +    XenPTRegInfo *reg = NULL;
> +    uint32_t real_offset = 0;
> +    uint32_t read_val = 0;
> +    uint32_t val = 0;
> +    int ret = 0;
> +
> +    /* find emulate register group entry */
> +    QLIST_FOREACH(reg_grp_entry, &s->reg_grp_tbl, entries) {
> +        /* find emulate register entry */
> +        QLIST_FOREACH(reg_entry, &reg_grp_entry->reg_tbl_list, entries) {
> +            reg = reg_entry->reg;
> +
> +            /* check whether restoring is needed */
> +            if (!reg->u.b.restore) {
> +                continue;
> +            }
> +
> +            real_offset = reg_grp_entry->base_offset + reg->offset;
> +
> +            /* read I/O device register value */
> +            ret = host_pci_get_block(s->real_device, real_offset,
> +                                     (uint8_t *)&read_val, reg->size);
> +
> +            if (!ret) {
> +                PT_LOG("Error: pci_read_block failed. "
> +                       "return value[%d].\n", ret);
> +                memset(&read_val, 0xff, reg->size);
> +            }
> +
> +            val = 0;
> +
> +            /* restore based on register size */
> +            switch (reg->size) {
> +            case 1:
> +                /* byte register */
> +                ret = reg->u.b.restore(s, reg_entry, real_offset,
> +                                       (uint8_t)read_val, (uint8_t *)&val);
> +                break;
> +            case 2:
> +                /* word register */
> +                ret = reg->u.w.restore(s, reg_entry, real_offset,
> +                                       (uint16_t)read_val, (uint16_t *)&val);
> +                break;
> +            case 4:
> +                /* double word register */
> +                ret = reg->u.dw.restore(s, reg_entry, real_offset,
> +                                        (uint32_t)read_val, (uint32_t *)&val);
> +                break;
> +            }
> +
> +            /* restoring error */
> +            if (ret < 0) {
> +                hw_error("Internal error: Invalid restoring "
> +                         "return value[%d]. I/O emulator exit.\n", ret);
> +            }
> +
> +            PT_LOG_CONFIG("[%02x:%02x.%x]: address=%04x val=0x%08x len=%d\n",
> +                          pci_bus_num(s->dev.bus), PCI_SLOT(s->dev.devfn),
> +                          PCI_FUNC(s->dev.devfn),
> +                          real_offset, val, reg->size);
> +
> +            ret = host_pci_set_block(s->real_device, real_offset,
> +                                     (uint8_t *)&val, reg->size);
> +
> +            if (!ret) {
> +                PT_LOG("Error: pci_write_block failed. "
> +                       "return value[%d].\n", ret);
> +            }
> +        }
> +    }
> +
> +    /* if AER supported, restore it */
> +    if (s->pm_state->aer_base) {
> +        pt_aer_reg_restore(s);
> +    }
> +}
> +/* reinitialize all emulate registers */
> +static void pt_config_reinit(XenPCIPassthroughState *s)
> +{
> +    XenPTRegGroup *reg_grp_entry = NULL;
> +    XenPTReg *reg_entry = NULL;
> +    XenPTRegInfo *reg = NULL;
> +
> +    /* find emulate register group entry */
> +    QLIST_FOREACH(reg_grp_entry, &s->reg_grp_tbl, entries) {
> +        /* find emulate register entry */
> +        QLIST_FOREACH(reg_entry, &reg_grp_entry->reg_tbl_list, entries) {
> +            reg = reg_entry->reg;
> +            if (reg->init) {
> +                /* initialize emulate register */
> +                reg_entry->data =
> +                    reg->init(s, reg_entry->reg,
> +                              reg_grp_entry->base_offset + reg->offset);
> +            }
> +        }
> +    }
> +}
> +
> +static int pt_init_pci_config(XenPCIPassthroughState *s)
> +{
> +    PCIDevice *d = &s->dev;
> +    int ret = 0;
> +
> +    PT_LOG("Reinitialize PCI configuration registers due to power state"
> +           " transition with internal reset. [%02x:%02x.%x]\n",
> +           pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn));
> +
> +    /* restore a part of I/O device register */
> +    pt_config_restore(s);
> +
> +    /* reinitialize all emulate register */
> +    pt_config_reinit(s);
> +
> +    /* rebind machine_irq to device */
> +    if (s->machine_irq != 0) {
> +        uint8_t e_device = PCI_SLOT(s->dev.devfn);
> +        uint8_t e_intx = pci_intx(s);
> +
> +        ret = xc_domain_bind_pt_pci_irq(xen_xc, xen_domid, s->machine_irq, 0,
> +                                        e_device, e_intx);
> +        if (ret < 0) {
> +            PT_LOG("Error: Rebinding of interrupt failed! ret=%d\n", ret);
> +        }
> +    }
> +
> +    return ret;
> +}
> +
> +static uint8_t find_cap_offset(XenPCIPassthroughState *s, uint8_t cap)
> +{
> +    int id;
> +    int max_cap = 48;
> +    int pos = PCI_CAPABILITY_LIST;
> +    int status;
> +
> +    status = host_pci_get_byte(s->real_device, PCI_STATUS);
> +    if ((status & PCI_STATUS_CAP_LIST) == 0) {
> +        return 0;
> +    }
> +
> +    while (max_cap--) {
> +        pos = host_pci_get_byte(s->real_device, pos);
> +        if (pos < 0x40) {
> +            break;
> +        }
> +
> +        pos &= ~3;
> +        id = host_pci_get_byte(s->real_device, pos + PCI_CAP_LIST_ID);
> +
> +        if (id == 0xff) {
> +            break;
> +        }
> +        if (id == cap) {
> +            return pos;
> +        }
> +
> +        pos += PCI_CAP_LIST_NEXT;
> +    }
> +    return 0;
> +}
> +
> +static void pt_config_reg_init(XenPCIPassthroughState *s,
> +                               XenPTRegGroup *reg_grp, XenPTRegInfo *reg)
> +{
> +    XenPTReg *reg_entry;
> +    uint32_t data = 0;
> +
> +    reg_entry = g_malloc0(sizeof (XenPTReg));
> +
> +    reg_entry->reg = reg;
> +    reg_entry->data = 0;
> +
> +    if (reg->init) {
> +        /* initialize emulate register */
> +        data = reg->init(s, reg_entry->reg,
> +                         reg_grp->base_offset + reg->offset);
> +        if (data == PT_INVALID_REG) {
> +            /* free unused BAR register entry */
> +            free(reg_entry);
> +            return;
> +        }
> +        /* set register value */
> +        reg_entry->data = data;
> +    }
> +    /* list add register entry */
> +    QLIST_INSERT_HEAD(&reg_grp->reg_tbl_list, reg_entry, entries);
> +
> +    return;
> +}
> +
> +void pt_config_init(XenPCIPassthroughState *s)
> +{
> +    XenPTRegGroup *reg_grp_entry = NULL;
> +    uint32_t reg_grp_offset = 0;
> +    XenPTRegInfo *reg_tbl = NULL;
> +    int i, j;
> +
> +    QLIST_INIT(&s->reg_grp_tbl);
> +
> +    for (i = 0; pt_emu_reg_grp_tbl[i].grp_size != 0; i++) {
> +        if (pt_emu_reg_grp_tbl[i].grp_id != 0xFF) {
> +            if (pt_hide_dev_cap(s->real_device,
> +                                pt_emu_reg_grp_tbl[i].grp_id)) {
> +                continue;
> +            }
> +
> +            reg_grp_offset = find_cap_offset(s, pt_emu_reg_grp_tbl[i].grp_id);
> +
> +            if (!reg_grp_offset) {
> +                continue;
> +            }
> +        }
> +
> +        reg_grp_entry = g_malloc0(sizeof (XenPTRegGroup));
> +        QLIST_INIT(&reg_grp_entry->reg_tbl_list);
> +        QLIST_INSERT_HEAD(&s->reg_grp_tbl, reg_grp_entry, entries);
> +
> +        reg_grp_entry->base_offset = reg_grp_offset;
> +        reg_grp_entry->reg_grp = pt_emu_reg_grp_tbl + i;
> +        if (pt_emu_reg_grp_tbl[i].size_init) {
> +            /* get register group size */
> +            reg_grp_entry->size =
> +                pt_emu_reg_grp_tbl[i].size_init(s, reg_grp_entry->reg_grp,
> +                                                reg_grp_offset);
> +        }
> +
> +        if (pt_emu_reg_grp_tbl[i].grp_type == GRP_TYPE_EMU) {
> +            if (pt_emu_reg_grp_tbl[i].emu_reg_tbl) {
> +                reg_tbl = pt_emu_reg_grp_tbl[i].emu_reg_tbl;
> +                /* initialize capability register */
> +                for (j = 0; reg_tbl->size != 0; j++, reg_tbl++) {
> +                    /* initialize capability register */
> +                    pt_config_reg_init(s, reg_grp_entry, reg_tbl);
> +                }
> +            }
> +        }
> +        reg_grp_offset = 0;
> +    }
> +
> +    return;
> +}
> +
> +/* delete all emulate register */
> +void pt_config_delete(XenPCIPassthroughState *s)
> +{
> +    struct XenPTRegGroup *reg_group, *next_grp;
> +    struct XenPTReg *reg, *next_reg;
> +
> +    /* free Power Management info table */
> +    if (s->pm_state) {
> +        if (s->pm_state->pm_timer) {
> +            qemu_del_timer(s->pm_state->pm_timer);
> +            qemu_free_timer(s->pm_state->pm_timer);
> +            s->pm_state->pm_timer = NULL;
> +        }
> +
> +        g_free(s->pm_state);
> +    }
> +
> +    /* free all register group entry */
> +    QLIST_FOREACH_SAFE(reg_group, &s->reg_grp_tbl, entries, next_grp) {
> +        /* free all register entry */
> +        QLIST_FOREACH_SAFE(reg, &reg_group->reg_tbl_list, entries, next_reg) {
> +            QLIST_REMOVE(reg, entries);
> +            g_free(reg);
> +        }
> +
> +        QLIST_REMOVE(reg_group, entries);
> +        g_free(reg_group);
> +    }
> +}
> -- 
> Anthony PERARD
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH V3 08/10] Introduce Xen PCI Passthrough, PCI config space helpers (2/3)
@ 2011-11-10 21:53     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 60+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-11-10 21:53 UTC (permalink / raw)
  To: Anthony PERARD
  Cc: Guy Zana, Xen Devel, Allen Kay, QEMU-devel, Stefano Stabellini

On Fri, Oct 28, 2011 at 04:07:34PM +0100, Anthony PERARD wrote:
> From: Allen Kay <allen.m.kay@intel.com>
> 
> Signed-off-by: Allen Kay <allen.m.kay@intel.com>
> Signed-off-by: Guy Zana <guy@neocleus.com>
> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
> ---
>  Makefile.target                      |    1 +
>  hw/xen_pci_passthrough.h             |    2 +
>  hw/xen_pci_passthrough_config_init.c | 2068 ++++++++++++++++++++++++++++++++++
>  3 files changed, 2071 insertions(+), 0 deletions(-)
>  create mode 100644 hw/xen_pci_passthrough_config_init.c
> 
> diff --git a/Makefile.target b/Makefile.target
> index 36ea47d..c32c688 100644
> --- a/Makefile.target
> +++ b/Makefile.target
> @@ -219,6 +219,7 @@ obj-i386-$(CONFIG_XEN) += xen_platform.o
>  obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += host-pci-device.o
>  obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough.o
>  obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough_helpers.o
> +obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough_config_init.o
>  
>  # Inter-VM PCI shared memory
>  CONFIG_IVSHMEM =
> diff --git a/hw/xen_pci_passthrough.h b/hw/xen_pci_passthrough.h
> index 2d1979d..ebc04fd 100644
> --- a/hw/xen_pci_passthrough.h
> +++ b/hw/xen_pci_passthrough.h
> @@ -61,6 +61,8 @@ typedef int (*conf_byte_restore)
>  /* power state transition */
>  #define PT_FLAG_TRANSITING 0x0001
>  
> +#define PT_BAR_ALLF        0xFFFFFFFF  /* BAR ALLF value */
> +
>  
>  typedef enum {
>      GRP_TYPE_HARDWIRED = 0,                     /* 0 Hardwired reg group */
> diff --git a/hw/xen_pci_passthrough_config_init.c b/hw/xen_pci_passthrough_config_init.c
> new file mode 100644
> index 0000000..4103b59
> --- /dev/null
> +++ b/hw/xen_pci_passthrough_config_init.c
> @@ -0,0 +1,2068 @@
> +/*
> + * Copyright (c) 2007, Neocleus Corporation.
> + * Copyright (c) 2007, Intel Corporation.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.  See
> + * the COPYING file in the top-level directory.
> + *
> + * Alex Novik <alex@neocleus.com>
> + * Allen Kay <allen.m.kay@intel.com>
> + * Guy Zana <guy@neocleus.com>
> + *
> + * This file implements direct PCI assignment to a HVM guest
> + */
> +
> +#include "qemu-timer.h"
> +#include "xen_backend.h"
> +#include "xen_pci_passthrough.h"
> +
> +#define PT_MERGE_VALUE(value, data, val_mask) \
> +    (((value) & (val_mask)) | ((data) & ~(val_mask)))
> +
> +#define PT_INVALID_REG          0xFFFFFFFF      /* invalid register value */
> +
> +/* prototype */
> +
> +static uint32_t pt_ptr_reg_init(XenPCIPassthroughState *s, XenPTRegInfo *reg,
> +                                uint32_t real_offset);
> +static int pt_init_pci_config(XenPCIPassthroughState *s);
> +
> +
> +/* helper */
> +
> +/* A return value of 1 means the capability should NOT be exposed to guest. */
> +static int pt_hide_dev_cap(const HostPCIDevice *d, uint8_t grp_id)
> +{
> +    switch (grp_id) {
> +    case PCI_CAP_ID_EXP:
> +        /* The PCI Express Capability Structure of the VF of Intel 82599 10GbE
> +         * Controller looks trivial, e.g., the PCI Express Capabilities
> +         * Register is 0. We should not try to expose it to guest.

Why not?
> +         */
> +        if (d->vendor_id == PCI_VENDOR_ID_INTEL &&
> +                d->device_id == PCI_DEVICE_ID_INTEL_82599_VF) {
> +            return 1;
> +        }
> +        break;
> +    }
> +    return 0;
> +}
> +
> +/*   find emulate register group entry */
> +XenPTRegGroup *pt_find_reg_grp(XenPCIPassthroughState *s, uint32_t address)
> +{
> +    XenPTRegGroup *entry = NULL;
> +
> +    /* find register group entry */
> +    QLIST_FOREACH(entry, &s->reg_grp_tbl, entries) {
> +        /* check address */
> +        if ((entry->base_offset <= address)
> +            && ((entry->base_offset + entry->size) > address)) {
> +            return entry;
> +        }
> +    }
> +
> +    /* group entry not found */
> +    return NULL;
> +}
> +
> +/* find emulate register entry */
> +XenPTReg *pt_find_reg(XenPTRegGroup *reg_grp, uint32_t address)
> +{
> +    XenPTReg *reg_entry = NULL;
> +    XenPTRegInfo *reg = NULL;
> +    uint32_t real_offset = 0;
> +
> +    /* find register entry */
> +    QLIST_FOREACH(reg_entry, &reg_grp->reg_tbl_list, entries) {
> +        reg = reg_entry->reg;
> +        real_offset = reg_grp->base_offset + reg->offset;
> +        /* check address */
> +        if ((real_offset <= address)
> +            && ((real_offset + reg->size) > address)) {
> +            return reg_entry;
> +        }
> +    }
> +
> +    return NULL;
> +}
> +
> +/* parse BAR */
> +static PTBarFlag pt_bar_reg_parse(XenPCIPassthroughState *s, XenPTRegInfo *reg)
> +{
> +    PCIDevice *d = &s->dev;
> +    XenPTRegion *region = NULL;
> +    PCIIORegion *r;
> +    int index = 0;
> +
> +    /* check 64bit BAR */
> +    index = pt_bar_offset_to_index(reg->offset);
> +    if ((0 < index) && (index < PCI_ROM_SLOT)) {

This is  a bit confusing. Can you make the index be on the same
side, like

if ((0 < index) && (PCI_ROM_SLOT > index)

or better:

if ((index < 0) && (index < PCI_ROM_SLOT))

um, which looks wrong. Should it be 'index > 0' ?

> +        int flags = s->real_device->io_regions[index - 1].flags;

Do we want to check the index - 1 to make sure it is not negative?

> +
> +        if ((flags & IORESOURCE_MEM) && (flags & IORESOURCE_MEM_64)) {
> +            region = &s->bases[index - 1];
> +            if (region->bar_flag != PT_BAR_FLAG_UPPER) {
> +                return PT_BAR_FLAG_UPPER;
> +            }
> +        }
> +    }
> +
> +    /* check unused BAR */
> +    r = &d->io_regions[index];
> +    if (r->size == 0) {
> +        return PT_BAR_FLAG_UNUSED;
> +    }
> +
> +    /* for ExpROM BAR */
> +    if (index == PCI_ROM_SLOT) {
> +        return PT_BAR_FLAG_MEM;
> +    }
> +
> +    /* check BAR I/O indicator */
> +    if (s->real_device->io_regions[index].flags & IORESOURCE_IO) {
> +        return PT_BAR_FLAG_IO;
> +    } else {
> +        return PT_BAR_FLAG_MEM;
> +    }
> +}
> +
> +
> +/****************
> + * general register functions
> + */
> +
> +/* register initialization function */
> +
> +static uint32_t pt_common_reg_init(XenPCIPassthroughState *s,
> +                                   XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    return reg->init_val;
> +}
> +
> +/* Read register functions */
> +
> +static int pt_byte_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                            uint8_t *value, uint8_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint8_t valid_emu_mask = 0;
> +
> +    /* emulate byte register */
> +    valid_emu_mask = reg->emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
> +
> +    return 0;
> +}
> +static int pt_word_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                            uint16_t *value, uint16_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint16_t valid_emu_mask = 0;
> +
> +    /* emulate word register */
> +    valid_emu_mask = reg->emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
> +
> +    return 0;
> +}
> +static int pt_long_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                            uint32_t *value, uint32_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint32_t valid_emu_mask = 0;
> +
> +    /* emulate long register */
> +    valid_emu_mask = reg->emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
> +
> +   return 0;
> +}
> +
> +/* Write register functions */
> +
> +static int pt_byte_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                             uint8_t *value, uint8_t dev_value,
> +                             uint8_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint8_t writable_mask = 0;
> +    uint8_t throughable_mask = 0;
> +
> +    /* modify emulate register */
> +    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +
> +    /* create value for writing to I/O device register */
> +    throughable_mask = ~reg->emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    return 0;
> +}
> +static int pt_word_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                             uint16_t *value, uint16_t dev_value,
> +                             uint16_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint16_t writable_mask = 0;
> +    uint16_t throughable_mask = 0;
> +
> +    /* modify emulate register */
> +    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +
> +    /* create value for writing to I/O device register */
> +    throughable_mask = ~reg->emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    return 0;
> +}
> +static int pt_long_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                             uint32_t *value, uint32_t dev_value,
> +                             uint32_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint32_t writable_mask = 0;
> +    uint32_t throughable_mask = 0;
> +
> +    /* modify emulate register */
> +    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +
> +    /* create value for writing to I/O device register */
> +    throughable_mask = ~reg->emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    return 0;
> +}
> +
> +/* common restore register fonctions */
> +static int pt_byte_reg_restore(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                               uint32_t real_offset, uint8_t dev_value,
> +                               uint8_t *value)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    PCIDevice *d = &s->dev;
> +
> +    /* use I/O device register's value as restore value */
> +    *value = pci_get_byte(d->config + real_offset);
> +
> +    /* create value for restoring to I/O device register */
> +    *value = PT_MERGE_VALUE(*value, dev_value, reg->emu_mask);
> +
> +    return 0;
> +}
> +static int pt_word_reg_restore(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                               uint32_t real_offset, uint16_t dev_value,
> +                               uint16_t *value)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    PCIDevice *d = &s->dev;
> +
> +    /* use I/O device register's value as restore value */
> +    *value = pci_get_word(d->config + real_offset);
> +
> +    /* create value for restoring to I/O device register */
> +    *value = PT_MERGE_VALUE(*value, dev_value, reg->emu_mask);
> +
> +    return 0;
> +}
> +
> +
> +/* XenPTRegInfo declaration
> + * - only for emulated register (either a part or whole bit).
> + * - for passthrough register that need special behavior (like interacting with
> + *   other component), set emu_mask to all 0 and specify r/w func properly.
> + * - do NOT use ALL F for init_val, otherwise the tbl will not be registered.
> + */
> +
> +/********************
> + * Header Type0
> + */
> +
> +static uint32_t pt_vendor_reg_init(XenPCIPassthroughState *s,
> +                                   XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    return s->real_device->vendor_id;
> +}
> +static uint32_t pt_device_reg_init(XenPCIPassthroughState *s,
> +                                   XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    return s->real_device->device_id;
> +}
> +static uint32_t pt_status_reg_init(XenPCIPassthroughState *s,
> +                                   XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    XenPTRegGroup *reg_grp_entry = NULL;
> +    XenPTReg *reg_entry = NULL;
> +    int reg_field = 0;
> +
> +    /* find Header register group */
> +    reg_grp_entry = pt_find_reg_grp(s, PCI_CAPABILITY_LIST);
> +    if (reg_grp_entry) {
> +        /* find Capabilities Pointer register */
> +        reg_entry = pt_find_reg(reg_grp_entry, PCI_CAPABILITY_LIST);
> +        if (reg_entry) {
> +            /* check Capabilities Pointer register */
> +            if (reg_entry->data) {
> +                reg_field |= PCI_STATUS_CAP_LIST;
> +            } else {
> +                reg_field &= ~PCI_STATUS_CAP_LIST;
> +            }
> +        } else {
> +            hw_error("Internal error: Couldn't find pt_reg_tbl for "
> +                     "Capabilities Pointer register. I/O emulator exit.\n");

Yikes. abort here? Um, can we just return a fault code instead?

> +        }
> +    } else {
> +        hw_error("Internal error: Couldn't find pt_reg_grp_tbl for Header. "
> +                 "I/O emulator exit.\n");
> +    }
> +
> +    return reg_field;
> +}
> +static uint32_t pt_header_type_reg_init(XenPCIPassthroughState *s,
> +                                        XenPTRegInfo *reg,
> +                                        uint32_t real_offset)
> +{
> +    /* read PCI_HEADER_TYPE */
> +    return reg->init_val | 0x80;
> +}
> +
> +/* initialize Interrupt Pin register */
> +static uint32_t pt_irqpin_reg_init(XenPCIPassthroughState *s,
> +                                   XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    return pci_read_intx(s);
> +}
> +
> +/* Command register */
> +static int pt_cmd_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                           uint16_t *value, uint16_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint16_t valid_emu_mask = 0;
> +    uint16_t emu_mask = reg->emu_mask;
> +
> +    if (s->is_virtfn) {
> +        emu_mask |= PCI_COMMAND_MEMORY;
> +    }
> +
> +    /* emulate word register */
> +    valid_emu_mask = emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
> +
> +    return 0;
> +}
> +static int pt_cmd_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                            uint16_t *value, uint16_t dev_value,
> +                            uint16_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint16_t writable_mask = 0;
> +    uint16_t throughable_mask = 0;
> +    uint16_t wr_value = *value;
> +    uint16_t emu_mask = reg->emu_mask;
> +
> +    if (s->is_virtfn) {
> +        emu_mask |= PCI_COMMAND_MEMORY;
> +    }
> +
> +    /* modify emulate register */
> +    writable_mask = ~reg->ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +
> +    /* create value for writing to I/O device register */
> +    throughable_mask = ~emu_mask & valid_mask;
> +
> +    if (*value & PCI_COMMAND_INTX_DISABLE) {
> +        throughable_mask |= PCI_COMMAND_INTX_DISABLE;
> +    } else {
> +        if (s->machine_irq) {
> +            throughable_mask |= PCI_COMMAND_INTX_DISABLE;
> +        }
> +    }
> +
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    /* mapping BAR */
> +    pt_bar_mapping(s, wr_value & PCI_COMMAND_IO,
> +                   wr_value & PCI_COMMAND_MEMORY);
> +
> +    return 0;
> +}
> +static int pt_cmd_reg_restore(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                              uint32_t real_offset, uint16_t dev_value,
> +                              uint16_t *value)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    PCIDevice *d = &s->dev;
> +    uint16_t restorable_mask = 0;
> +
> +    /* use I/O device register's value as restore value */
> +    *value = pci_get_word(d->config + real_offset);
> +
> +    /* create value for restoring to I/O device register
> +     * but do not include Fast Back-to-Back Enable bit.
> +     */
> +    restorable_mask = reg->emu_mask & ~PCI_COMMAND_FAST_BACK;
> +    *value = PT_MERGE_VALUE(*value, dev_value, restorable_mask);
> +
> +    if (!s->machine_irq) {
> +        *value |= PCI_COMMAND_INTX_DISABLE;
> +    } else {
> +        *value &= ~PCI_COMMAND_INTX_DISABLE;
> +    }
> +
> +    return 0;
> +}
> +
> +/* BAR */
> +#define PT_BAR_MEM_RO_MASK      0x0000000F      /* BAR ReadOnly mask(Memory) */
> +#define PT_BAR_MEM_EMU_MASK     0xFFFFFFF0      /* BAR emul mask(Memory) */
> +#define PT_BAR_IO_RO_MASK       0x00000003      /* BAR ReadOnly mask(I/O) */
> +#define PT_BAR_IO_EMU_MASK      0xFFFFFFFC      /* BAR emul mask(I/O) */
> +
> +static inline uint32_t base_address_with_flags(HostPCIIORegion *hr)
> +{
> +    if ((hr->flags & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO) {
> +        return hr->base_addr | (hr->flags & ~PCI_BASE_ADDRESS_IO_MASK);
> +    } else {
> +        return hr->base_addr | (hr->flags & ~PCI_BASE_ADDRESS_MEM_MASK);
> +    }
> +}
> +
> +static uint32_t pt_bar_reg_init(XenPCIPassthroughState *s, XenPTRegInfo *reg,
> +                                uint32_t real_offset)
> +{
> +    int reg_field = 0;
> +    int index;
> +
> +    /* get BAR index */
> +    index = pt_bar_offset_to_index(reg->offset);
> +    if (index < 0) {
> +        hw_error("Internal error: Invalid BAR index[%d]. "
> +                 "I/O emulator exit.\n", index);
> +    }
> +
> +    /* set initial guest physical base address to -1 */
> +    s->bases[index].e_physbase = -1;

Um, use that define PCI_.. something macro.
> +
> +    /* set BAR flag */
> +    s->bases[index].bar_flag = pt_bar_reg_parse(s, reg);
> +    if (s->bases[index].bar_flag == PT_BAR_FLAG_UNUSED) {
> +        reg_field = PT_INVALID_REG;
> +    }
> +
> +    return reg_field;
> +}
> +static int pt_bar_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                           uint32_t *value, uint32_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint32_t valid_emu_mask = 0;
> +    uint32_t bar_emu_mask = 0;
> +    int index;
> +
> +    /* get BAR index */
> +    index = pt_bar_offset_to_index(reg->offset);
> +    if (index < 0) {
> +        hw_error("Internal error: Invalid BAR index[%d]. "
> +                 "I/O emulator exit.\n", index);
> +    }
> +
> +    /* use fixed-up value from kernel sysfs */
> +    *value = base_address_with_flags(&s->real_device->io_regions[index]);
> +
> +    /* set emulate mask depend on BAR flag */
> +    switch (s->bases[index].bar_flag) {
> +    case PT_BAR_FLAG_MEM:
> +        bar_emu_mask = PT_BAR_MEM_EMU_MASK;
> +        break;
> +    case PT_BAR_FLAG_IO:
> +        bar_emu_mask = PT_BAR_IO_EMU_MASK;
> +        break;
> +    case PT_BAR_FLAG_UPPER:
> +        bar_emu_mask = PT_BAR_ALLF;
> +        break;
> +    default:
> +        break;
> +    }
> +
> +    /* emulate BAR */
> +    valid_emu_mask = bar_emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
> +
> +   return 0;
> +}
> +static int pt_bar_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                            uint32_t *value, uint32_t dev_value,
> +                            uint32_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    XenPTRegGroup *reg_grp_entry = NULL;
> +    XenPTReg *reg_entry = NULL;
> +    XenPTRegion *base = NULL;
> +    PCIDevice *d = &s->dev;
> +    PCIIORegion *r;
> +    uint32_t writable_mask = 0;
> +    uint32_t throughable_mask = 0;
> +    uint32_t bar_emu_mask = 0;
> +    uint32_t bar_ro_mask = 0;
> +    uint32_t new_addr, last_addr;
> +    uint32_t prev_offset;
> +    uint32_t r_size = 0;
> +    int index = 0;
> +
> +    /* get BAR index */
> +    index = pt_bar_offset_to_index(reg->offset);
> +    if (index < 0) {
> +        hw_error("Internal error: Invalid BAR index[%d]. "
> +                 "I/O emulator exit.\n", index);
> +    }
> +
> +    r = &d->io_regions[index];
> +    base = &s->bases[index];
> +    r_size = pt_get_emul_size(base->bar_flag, r->size);
> +
> +    /* set emulate mask and read-only mask depend on BAR flag */
> +    switch (s->bases[index].bar_flag) {
> +    case PT_BAR_FLAG_MEM:
> +        bar_emu_mask = PT_BAR_MEM_EMU_MASK;
> +        bar_ro_mask = PT_BAR_MEM_RO_MASK | (r_size - 1);
> +        break;
> +    case PT_BAR_FLAG_IO:
> +        bar_emu_mask = PT_BAR_IO_EMU_MASK;
> +        bar_ro_mask = PT_BAR_IO_RO_MASK | (r_size - 1);
> +        break;
> +    case PT_BAR_FLAG_UPPER:
> +        bar_emu_mask = PT_BAR_ALLF;
> +        bar_ro_mask = 0;    /* all upper 32bit are R/W */
> +        break;
> +    default:
> +        break;
> +    }
> +
> +    /* modify emulate register */
> +    writable_mask = bar_emu_mask & ~bar_ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +
> +    /* check whether we need to update the virtual region address or not */
> +    switch (s->bases[index].bar_flag) {
> +    case PT_BAR_FLAG_MEM:
> +        /* nothing to do */
> +        break;
> +    case PT_BAR_FLAG_IO:
> +        new_addr = cfg_entry->data;
> +        last_addr = new_addr + r_size - 1;
> +        /* check invalid address */
> +        if (last_addr <= new_addr || !new_addr || last_addr >= 0x10000) {

Make a #define for 0x10000.. 

> +            /* check 64K range */
> +            if ((last_addr >= 0x10000) &&
> +                (cfg_entry->data != (PT_BAR_ALLF & ~bar_ro_mask))) {
> +                PT_LOG("Warning: Guest attempt to set Base Address "
> +                       "over the 64KB. [%02x:%02x.%x][Offset:%02xh]"
> +                       "[Address:%08xh][Size:%08xh]\n",
> +                       pci_bus_num(d->bus), PCI_SLOT(d->devfn),
> +                       PCI_FUNC(d->devfn),
> +                       reg->offset, new_addr, r_size);
> +            }
> +            /* just remove mapping */
> +            r->addr = -1;
> +            goto exit;
> +        }
> +        break;
> +    case PT_BAR_FLAG_UPPER:
> +        if (cfg_entry->data) {
> +            if (cfg_entry->data != (PT_BAR_ALLF & ~bar_ro_mask)) {
> +                PT_LOG("Warning: Guest attempt to set high MMIO Base Address. "
> +                       "Ignore mapping. "
> +                       "[%02x:%02x.%x][Offset:%02xh][High Address:%08xh]\n",
> +                       pci_bus_num(d->bus), PCI_SLOT(d->devfn),
> +                       PCI_FUNC(d->devfn), reg->offset, cfg_entry->data);
> +            }
> +            /* clear lower address */
> +            d->io_regions[index-1].addr = -1;
> +        } else {
> +            /* find lower 32bit BAR */
> +            prev_offset = (reg->offset - 4);
> +            reg_grp_entry = pt_find_reg_grp(s, prev_offset);
> +            if (reg_grp_entry) {
> +                reg_entry = pt_find_reg(reg_grp_entry, prev_offset);
> +                if (reg_entry) {
> +                    /* restore lower address */
> +                    d->io_regions[index-1].addr = reg_entry->data;
> +                } else {
> +                    return -1;
> +                }
> +            } else {
> +                return -1;
> +            }
> +        }
> +
> +        /* never mapping the 'empty' upper region,
> +         * because we'll do it enough for the lower region.
> +         */
> +        r->addr = -1;
> +        goto exit;
> +    default:
> +        break;
> +    }
> +
> +    /* update the corresponding virtual region address */
> +    /*
> +     * When guest code tries to get block size of mmio, it will write all "1"s
> +     * into pci bar register. In this case, cfg_entry->data == writable_mask.
> +     * Especially for devices with large mmio, the value of writable_mask
> +     * is likely to be a guest physical address that has been mapped to ram
> +     * rather than mmio. Remapping this value to mmio should be prevented.
> +     */
> +
> +    if (cfg_entry->data != writable_mask) {
> +        r->addr = cfg_entry->data;
> +    }
> +
> +exit:
> +    /* create value for writing to I/O device register */
> +    throughable_mask = ~bar_emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    /* After BAR reg update, we need to remap BAR */
> +    reg_grp_entry = pt_find_reg_grp(s, PCI_COMMAND);
> +    if (reg_grp_entry) {
> +        reg_entry = pt_find_reg(reg_grp_entry, PCI_COMMAND);
> +        if (reg_entry) {
> +            pt_bar_mapping_one(s, index, reg_entry->data & PCI_COMMAND_IO,
> +                               reg_entry->data & PCI_COMMAND_MEMORY);
> +        }
> +    }
> +
> +    return 0;
> +}
> +static int pt_bar_reg_restore(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                              uint32_t real_offset, uint32_t dev_value,
> +                              uint32_t *value)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint32_t bar_emu_mask = 0;
> +    int index = 0;
> +
> +    /* get BAR index */
> +    index = pt_bar_offset_to_index(reg->offset);
> +    if (index < 0) {
> +        hw_error("Internal error: Invalid BAR index[%d]. "
> +                 "I/O emulator exit.\n", index);
> +    }
> +
> +    /* use value from kernel sysfs */
> +    if (s->bases[index].bar_flag == PT_BAR_FLAG_UPPER) {
> +        *value = s->real_device->io_regions[index - 1].base_addr >> 32;
> +    } else {
> +        *value = base_address_with_flags(&s->real_device->io_regions[index]);
> +    }
> +
> +    /* set emulate mask depend on BAR flag */
> +    switch (s->bases[index].bar_flag) {
> +    case PT_BAR_FLAG_MEM:
> +        bar_emu_mask = PT_BAR_MEM_EMU_MASK;
> +        break;
> +    case PT_BAR_FLAG_IO:
> +        bar_emu_mask = PT_BAR_IO_EMU_MASK;
> +        break;
> +    case PT_BAR_FLAG_UPPER:
> +        bar_emu_mask = PT_BAR_ALLF;
> +        break;
> +    default:
> +        break;
> +    }
> +
> +    /* create value for restoring to I/O device register */
> +    *value = PT_MERGE_VALUE(*value, dev_value, bar_emu_mask);
> +
> +    return 0;
> +}
> +
> +/* write Exp ROM BAR */
> +static int pt_exp_rom_bar_reg_write(XenPCIPassthroughState *s,
> +                                    XenPTReg *cfg_entry, uint32_t *value,
> +                                    uint32_t dev_value, uint32_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    XenPTRegGroup *reg_grp_entry = NULL;
> +    XenPTReg *reg_entry = NULL;
> +    XenPTRegion *base = NULL;
> +    PCIDevice *d = (PCIDevice *)&s->dev;
> +    PCIIORegion *r;
> +    uint32_t writable_mask = 0;
> +    uint32_t throughable_mask = 0;
> +    pcibus_t r_size = 0;
> +    uint32_t bar_emu_mask = 0;
> +    uint32_t bar_ro_mask = 0;
> +
> +    r = &d->io_regions[PCI_ROM_SLOT];
> +    r_size = r->size;
> +    base = &s->bases[PCI_ROM_SLOT];
> +    /* align memory type resource size */
> +    pt_get_emul_size(base->bar_flag, r_size);
> +
> +    /* set emulate mask and read-only mask */
> +    bar_emu_mask = reg->emu_mask;
> +    bar_ro_mask = (reg->ro_mask | (r_size - 1)) & ~PCI_ROM_ADDRESS_ENABLE;
> +
> +    /* modify emulate register */
> +    writable_mask = ~bar_ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +
> +    /* update the corresponding virtual region address */
> +    /*
> +     * When guest code tries to get block size of mmio, it will write all "1"s
> +     * into pci bar register. In this case, cfg_entry->data == writable_mask.
> +     * Especially for devices with large mmio, the value of writable_mask
> +     * is likely to be a guest physical address that has been mapped to ram
> +     * rather than mmio. Remapping this value to mmio should be prevented.
> +     */
> +
> +    if (cfg_entry->data != writable_mask) {
> +        r->addr = cfg_entry->data;
> +    }
> +
> +    /* create value for writing to I/O device register */
> +    throughable_mask = ~bar_emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    /* After BAR reg update, we need to remap BAR*/
> +    reg_grp_entry = pt_find_reg_grp(s, PCI_COMMAND);
> +    if (reg_grp_entry) {
> +        reg_entry = pt_find_reg(reg_grp_entry, PCI_COMMAND);
> +        if (reg_entry) {
> +            pt_bar_mapping_one(s, PCI_ROM_SLOT,
> +                               reg_entry->data & PCI_COMMAND_IO,
> +                               reg_entry->data & PCI_COMMAND_MEMORY);
> +        }
> +    }
> +
> +    return 0;
> +}
> +/* restore ROM BAR */
> +static int pt_exp_rom_bar_reg_restore(XenPCIPassthroughState *s,
> +                                      XenPTReg *cfg_entry,
> +                                      uint32_t real_offset,
> +                                      uint32_t dev_value, uint32_t *value)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +
> +    /* use value from kernel sysfs */
> +    *value =
> +        PT_MERGE_VALUE(host_pci_get_long(s->real_device, PCI_ROM_ADDRESS),
> +                       dev_value, reg->emu_mask);
> +    return 0;
> +}
> +
> +/* Header Type0 reg static infomation table */
> +static XenPTRegInfo pt_emu_reg_header0_tbl[] = {
> +    /* Vendor ID reg */
> +    {
> +        .offset     = PCI_VENDOR_ID,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0xFFFF,
> +        .emu_mask   = 0xFFFF,
> +        .init       = pt_vendor_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_word_reg_write,
> +        .u.w.restore  = NULL,
> +    },
> +    /* Device ID reg */
> +    {
> +        .offset     = PCI_DEVICE_ID,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0xFFFF,
> +        .emu_mask   = 0xFFFF,
> +        .init       = pt_device_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_word_reg_write,
> +        .u.w.restore  = NULL,
> +    },
> +    /* Command reg */
> +    {
> +        .offset     = PCI_COMMAND,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0xF880,
> +        .emu_mask   = 0x0740,
> +        .init       = pt_common_reg_init,
> +        .u.w.read   = pt_cmd_reg_read,
> +        .u.w.write  = pt_cmd_reg_write,
> +        .u.w.restore  = pt_cmd_reg_restore,
> +    },
> +    /* Capabilities Pointer reg */
> +    {
> +        .offset     = PCI_CAPABILITY_LIST,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0xFF,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_ptr_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = NULL,
> +    },
> +    /* Status reg */
> +    /* use emulated Cap Ptr value to initialize,
> +     * so need to be declared after Cap Ptr reg
> +     */
> +    {
> +        .offset     = PCI_STATUS,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0x06FF,
> +        .emu_mask   = 0x0010,
> +        .init       = pt_status_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_word_reg_write,
> +        .u.w.restore  = NULL,
> +    },
> +    /* Cache Line Size reg */
> +    {
> +        .offset     = PCI_CACHE_LINE_SIZE,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0x00,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_common_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = pt_byte_reg_restore,
> +    },
> +    /* Latency Timer reg */
> +    {
> +        .offset     = PCI_LATENCY_TIMER,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0x00,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_common_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = pt_byte_reg_restore,
> +    },
> +    /* Header Type reg */
> +    {
> +        .offset     = PCI_HEADER_TYPE,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0xFF,
> +        .emu_mask   = 0x00,
> +        .init       = pt_header_type_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = NULL,
> +    },
> +    /* Interrupt Line reg */
> +    {
> +        .offset     = PCI_INTERRUPT_LINE,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0x00,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_common_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = NULL,
> +    },
> +    /* Interrupt Pin reg */
> +    {
> +        .offset     = PCI_INTERRUPT_PIN,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0xFF,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_irqpin_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = NULL,
> +    },
> +    /* BAR 0 reg */
> +    /* mask of BAR need to be decided later, depends on IO/MEM type */
> +    {
> +        .offset     = PCI_BASE_ADDRESS_0,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .init       = pt_bar_reg_init,
> +        .u.dw.read  = pt_bar_reg_read,
> +        .u.dw.write = pt_bar_reg_write,
> +        .u.dw.restore = pt_bar_reg_restore,
> +    },
> +    /* BAR 1 reg */
> +    {
> +        .offset     = PCI_BASE_ADDRESS_1,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .init       = pt_bar_reg_init,
> +        .u.dw.read  = pt_bar_reg_read,
> +        .u.dw.write = pt_bar_reg_write,
> +        .u.dw.restore = pt_bar_reg_restore,
> +    },
> +    /* BAR 2 reg */
> +    {
> +        .offset     = PCI_BASE_ADDRESS_2,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .init       = pt_bar_reg_init,
> +        .u.dw.read  = pt_bar_reg_read,
> +        .u.dw.write = pt_bar_reg_write,
> +        .u.dw.restore = pt_bar_reg_restore,
> +    },
> +    /* BAR 3 reg */
> +    {
> +        .offset     = PCI_BASE_ADDRESS_3,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .init       = pt_bar_reg_init,
> +        .u.dw.read  = pt_bar_reg_read,
> +        .u.dw.write = pt_bar_reg_write,
> +        .u.dw.restore = pt_bar_reg_restore,
> +    },
> +    /* BAR 4 reg */
> +    {
> +        .offset     = PCI_BASE_ADDRESS_4,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .init       = pt_bar_reg_init,
> +        .u.dw.read  = pt_bar_reg_read,
> +        .u.dw.write = pt_bar_reg_write,
> +        .u.dw.restore = pt_bar_reg_restore,
> +    },
> +    /* BAR 5 reg */
> +    {
> +        .offset     = PCI_BASE_ADDRESS_5,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .init       = pt_bar_reg_init,
> +        .u.dw.read  = pt_bar_reg_read,
> +        .u.dw.write = pt_bar_reg_write,
> +        .u.dw.restore = pt_bar_reg_restore,
> +    },
> +    /* Expansion ROM BAR reg */
> +    {
> +        .offset     = PCI_ROM_ADDRESS,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .ro_mask    = 0x000007FE,
> +        .emu_mask   = 0xFFFFF800,
> +        .init       = pt_bar_reg_init,
> +        .u.dw.read  = pt_long_reg_read,
> +        .u.dw.write = pt_exp_rom_bar_reg_write,
> +        .u.dw.restore = pt_exp_rom_bar_reg_restore,
> +    },
> +    {
> +        .size = 0,
> +    },
> +};
> +
> +
> +/*********************************
> + * Vital Product Data Capability
> + */
> +
> +/* Vital Product Data Capability Structure reg static infomation table */
> +static XenPTRegInfo pt_emu_reg_vpd_tbl[] = {
> +    {
> +        .offset     = PCI_CAP_LIST_NEXT,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0xFF,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_ptr_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = NULL,
> +    },
> +    {
> +        .size = 0,
> +    },
> +};
> +
> +
> +/**************************************
> + * Vendor Specific Capability
> + */
> +
> +/* Vendor Specific Capability Structure reg static infomation table */
> +static XenPTRegInfo pt_emu_reg_vendor_tbl[] = {
> +    {
> +        .offset     = PCI_CAP_LIST_NEXT,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0xFF,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_ptr_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = NULL,
> +    },
> +    {
> +        .size = 0,
> +    },
> +};
> +
> +
> +/*****************************
> + * PCI Express Capability
> + */
> +
> +/* initialize Link Control register */
> +static uint32_t pt_linkctrl_reg_init(XenPCIPassthroughState *s,
> +                                     XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    uint8_t cap_ver = 0;
> +    uint8_t dev_type = 0;
> +
> +    /* TODO maybe better to use fonction from hw/pcie.c */

function
> +    cap_ver = pci_get_byte(s->dev.config + real_offset - reg->offset
> +                           + PCI_EXP_FLAGS)
> +        & PCI_EXP_FLAGS_VERS;
> +    dev_type = (pci_get_byte(s->dev.config + real_offset - reg->offset
> +                             + PCI_EXP_FLAGS)
> +                & PCI_EXP_FLAGS_TYPE) >> 4;
> +
> +    /* no need to initialize in case of Root Complex Integrated Endpoint
> +     * with cap_ver 1.x

Why?

> +     */
> +    if ((dev_type == PCI_EXP_TYPE_RC_END) && (cap_ver == 1)) {
> +        return PT_INVALID_REG;
> +    }
> +
> +    return reg->init_val;
> +}
> +/* initialize Device Control 2 register */
> +static uint32_t pt_devctrl2_reg_init(XenPCIPassthroughState *s,
> +                                     XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    uint8_t cap_ver = 0;
> +
> +    cap_ver = pci_get_byte(s->dev.config + real_offset - reg->offset
> +                           + PCI_EXP_FLAGS)
> +        & PCI_EXP_FLAGS_VERS;
> +
> +    /* no need to initialize in case of cap_ver 1.x */
> +    if (cap_ver == 1) {
> +        return PT_INVALID_REG;
> +    }
> +
> +    return reg->init_val;
> +}
> +/* initialize Link Control 2 register */
> +static uint32_t pt_linkctrl2_reg_init(XenPCIPassthroughState *s,
> +                                      XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    int reg_field = 0;
> +    uint8_t cap_ver = 0;
> +
> +    cap_ver = pci_get_byte(s->dev.config + real_offset - reg->offset
> +                           + PCI_EXP_FLAGS)
> +        & PCI_EXP_FLAGS_VERS;

This looks like a weird tab issue, but it might be just my mailer.

> +
> +    /* no need to initialize in case of cap_ver 1.x */
> +    if (cap_ver == 1) {
> +        return PT_INVALID_REG;
> +    }
> +
> +    /* set Supported Link Speed */
> +    reg_field |= PCI_EXP_LNKCAP_SLS &
> +        pci_get_byte(s->dev.config + real_offset - reg->offset
> +                     + PCI_EXP_LNKCAP);
> +
> +    return reg_field;
> +}
> +
> +/* PCI Express Capability Structure reg static infomation table */
> +static XenPTRegInfo pt_emu_reg_pcie_tbl[] = {
> +    /* Next Pointer reg */
> +    {
> +        .offset     = PCI_CAP_LIST_NEXT,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0xFF,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_ptr_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = NULL,
> +    },
> +    /* Device Capabilities reg */
> +    {
> +        .offset     = PCI_EXP_DEVCAP,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .ro_mask    = 0x1FFCFFFF,
> +        .emu_mask   = 0x10000000,
> +        .init       = pt_common_reg_init,
> +        .u.dw.read  = pt_long_reg_read,
> +        .u.dw.write = pt_long_reg_write,
> +        .u.dw.restore = NULL,
> +    },
> +    /* Device Control reg */
> +    {
> +        .offset     = PCI_EXP_DEVCTL,
> +        .size       = 2,
> +        .init_val   = 0x2810,
> +        .ro_mask    = 0x8400,
> +        .emu_mask   = 0xFFFF,
> +        .init       = pt_common_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_word_reg_write,
> +        .u.w.restore  = pt_word_reg_restore,
> +    },
> +    /* Link Control reg */
> +    {
> +        .offset     = PCI_EXP_LNKCTL,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0xFC34,
> +        .emu_mask   = 0xFFFF,
> +        .init       = pt_linkctrl_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_word_reg_write,
> +        .u.w.restore  = pt_word_reg_restore,
> +    },
> +    /* Device Control 2 reg */
> +    {
> +        .offset     = 0x28,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0xFFE0,
> +        .emu_mask   = 0xFFFF,
> +        .init       = pt_devctrl2_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_word_reg_write,
> +        .u.w.restore  = pt_word_reg_restore,
> +    },
> +    /* Link Control 2 reg */
> +    {
> +        .offset     = 0x30,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0xE040,
> +        .emu_mask   = 0xFFFF,
> +        .init       = pt_linkctrl2_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_word_reg_write,
> +        .u.w.restore  = pt_word_reg_restore,
> +    },
> +    {
> +        .size = 0,
> +    },
> +};
> +
> +
> +/*********************************
> + * Power Management Capability
> + */
> +
> +/* initialize Power Management Capabilities register */
> +static uint32_t pt_pmc_reg_init(XenPCIPassthroughState *s,
> +                                XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    PCIDevice *d = &s->dev;
> +
> +    if (!s->power_mgmt) {
> +        return reg->init_val;
> +    }
> +
> +    /* set Power Management Capabilities register */
> +    s->pm_state->pmc_field = pci_get_word(d->config + real_offset);
> +
> +    return reg->init_val;
> +}
> +/* initialize PCI Power Management Control/Status register */
> +static uint32_t pt_pmcsr_reg_init(XenPCIPassthroughState *s,
> +                                  XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    PCIDevice *d = &s->dev;
> +    uint16_t cap_ver  = 0;
> +
> +    if (!s->power_mgmt) {
> +        return reg->init_val;
> +    }
> +
> +    /* check PCI Power Management support version */
> +    cap_ver = s->pm_state->pmc_field & PCI_PM_CAP_VER_MASK;
> +
> +    if (cap_ver > 2) {
> +        /* set No Soft Reset */
> +        s->pm_state->no_soft_reset =
> +            pci_get_byte(d->config + real_offset) & PCI_PM_CTRL_NO_SOFT_RESET;
> +    }
> +
> +    /* wake up real physical device */
> +    switch (host_pci_get_word(s->real_device, real_offset)
> +            & PCI_PM_CTRL_STATE_MASK) {
> +    case 0:
> +        break;
> +    case 1:
> +        PT_LOG("Power state transition D1 -> D0active\n");
> +        host_pci_set_word(s->real_device, real_offset, 0);
> +        break;
> +    case 2:
> +        PT_LOG("Power state transition D2 -> D0active\n");
> +        host_pci_set_word(s->real_device, real_offset, 0);
> +        usleep(200);

Heheh..
> +        break;
> +    case 3:
> +        PT_LOG("Power state transition D3hot -> D0active\n");
> +        host_pci_set_word(s->real_device, real_offset, 0);
> +        usleep(10 * 1000);
> +        pt_init_pci_config(s);
> +        break;
> +    }
> +
> +    return reg->init_val;
> +}
> +/* read Power Management Control/Status register */
> +static int pt_pmcsr_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                             uint16_t *value, uint16_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint16_t valid_emu_mask = reg->emu_mask;
> +
> +    if (!s->power_mgmt) {
> +        valid_emu_mask |= PCI_PM_CTRL_STATE_MASK | PCI_PM_CTRL_NO_SOFT_RESET;
> +    }
> +
> +    valid_emu_mask = valid_emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
> +
> +    return 0;
> +}
> +/* reset Interrupt and I/O resource  */
> +static void pt_reset_interrupt_and_io_mapping(XenPCIPassthroughState *s)
> +{
> +    PCIDevice *d = &s->dev;
> +    PCIIORegion *r;
> +    int i = 0;
> +    uint8_t e_device = 0;
> +    uint8_t e_intx = 0;
> +
> +    /* unbind INTx */
> +    e_device = PCI_SLOT(s->dev.devfn);
> +    e_intx = pci_intx(s);
> +
> +    if (s->machine_irq) {
> +        if (xc_domain_unbind_pt_irq(xen_xc, xen_domid, s->machine_irq,
> +                                    PT_IRQ_TYPE_PCI, 0, e_device, e_intx, 0)) {
> +            PT_LOG("Error: Unbinding of interrupt failed!\n");
> +        }
> +    }
> +
> +    /* clear all virtual region address */
> +    for (i = 0; i < PCI_NUM_REGIONS; i++) {
> +        r = &d->io_regions[i];
> +        r->addr = -1;
> +    }
> +
> +    /* unmapping BAR */
> +    pt_bar_mapping(s, 0, 0);
> +}
> +/* check power state transition */
> +static int check_power_state(XenPCIPassthroughState *s)
> +{
> +    XenPTPM *pm_state = s->pm_state;
> +    PCIDevice *d = &s->dev;
> +    uint16_t read_val = 0;
> +    uint16_t cur_state = 0;
> +
> +    /* get current power state */
> +    read_val = host_pci_get_word(s->real_device,
> +                                 pm_state->pm_base + PCI_PM_CTRL);
> +    cur_state = read_val & PCI_PM_CTRL_STATE_MASK;
> +
> +    if (pm_state->req_state != cur_state) {
> +        PT_LOG("Error: Failed to change power state. "
> +               "[%02x:%02x.%x][requested state:%d][current state:%d]\n",
> +               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> +               pm_state->req_state, cur_state);
> +        return -1;
> +    }
> +    return 0;
> +}
> +/* write Power Management Control/Status register */
> +static void pt_from_d3hot_to_d0_with_reset(void *opaque)
> +{
> +    XenPCIPassthroughState *s = opaque;
> +    XenPTPM *pm_state = s->pm_state;
> +    int ret = 0;
> +
> +    /* check power state */
> +    ret = check_power_state(s);
> +
> +    if (ret < 0) {
> +        goto out;
> +    }
> +
> +    pt_init_pci_config(s);
> +
> +out:
> +    /* power state transition flags off */
> +    pm_state->flags &= ~PT_FLAG_TRANSITING;
> +
> +    qemu_free_timer(pm_state->pm_timer);
> +    pm_state->pm_timer = NULL;
> +}
> +static void pt_default_power_transition(void *opaque)
> +{
> +    XenPCIPassthroughState *ptdev = opaque;
> +    XenPTPM *pm_state = ptdev->pm_state;
> +
> +    /* check power state */
> +    check_power_state(ptdev);
> +
> +    /* power state transition flags off */
> +    pm_state->flags &= ~PT_FLAG_TRANSITING;
> +
> +    qemu_free_timer(pm_state->pm_timer);
> +    pm_state->pm_timer = NULL;
> +}
> +static int pt_pmcsr_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                              uint16_t *value, uint16_t dev_value,
> +                              uint16_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    PCIDevice *d = &s->dev;
> +    uint16_t emu_mask = reg->emu_mask;
> +    uint16_t writable_mask = 0;
> +    uint16_t throughable_mask = 0;
> +    XenPTPM *pm_state = s->pm_state;
> +
> +    if (!s->power_mgmt) {
> +        emu_mask |= PCI_PM_CTRL_STATE_MASK | PCI_PM_CTRL_NO_SOFT_RESET;
> +    }
> +
> +    /* modify emulate register */
> +    writable_mask = emu_mask & ~reg->ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +
> +    /* create value for writing to I/O device register */
> +    throughable_mask = ~emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    if (!s->power_mgmt) {
> +        return 0;
> +    }
> +
> +    /* set I/O device power state */
> +    pm_state->cur_state = dev_value & PCI_PM_CTRL_STATE_MASK;
> +
> +    /* set Guest requested PowerState */
> +    pm_state->req_state = *value & PCI_PM_CTRL_STATE_MASK;
> +
> +    /* check power state transition or not */
> +    if (pm_state->cur_state == pm_state->req_state) {
> +        /* not power state transition */
> +        return 0;
> +    }
> +
> +    /* check enable power state transition */
> +    if ((pm_state->req_state != 0) &&
> +        (pm_state->cur_state > pm_state->req_state)) {
> +        PT_LOG("Error: Invalid power transition. "
> +               "[%02x:%02x.%x][requested state:%d][current state:%d]\n",
> +               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> +               pm_state->req_state, pm_state->cur_state);
> +
> +        return 0;
> +    }
> +
> +    /* check if this device supports the requested power state */
> +    if (((pm_state->req_state == 1) && !(pm_state->pmc_field & PCI_PM_CAP_D1))
> +        || ((pm_state->req_state == 2) &&
> +            !(pm_state->pmc_field & PCI_PM_CAP_D2))) {
> +        PT_LOG("Error: Invalid power transition. "
> +               "[%02x:%02x.%x][requested state:%d][current state:%d]\n",
> +               pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> +               pm_state->req_state, pm_state->cur_state);
> +
> +        return 0;
> +    }
> +
> +    /* in case of transition related to D3hot, it's necessary to wait 10 ms.
> +     * But because writing to register will be performed later on actually,
> +     * don't start QEMUTimer right now, just alloc and init QEMUTimer here.
> +     */
> +    if ((pm_state->cur_state == 3) || (pm_state->req_state == 3)) {
> +        if (pm_state->req_state == 0) {
> +            /* alloc and init QEMUTimer */
> +            if (!pm_state->no_soft_reset) {
> +                pm_state->pm_timer = qemu_new_timer_ms(rt_clock,
> +                    pt_from_d3hot_to_d0_with_reset, s);
> +
> +                /* reset Interrupt and I/O resource mapping */
> +                pt_reset_interrupt_and_io_mapping(s);
> +            } else {
> +                pm_state->pm_timer = qemu_new_timer_ms(rt_clock,
> +                                        pt_default_power_transition, s);
> +            }
> +        } else {
> +            /* alloc and init QEMUTimer */
> +            pm_state->pm_timer = qemu_new_timer_ms(rt_clock,
> +                pt_default_power_transition, s);
> +        }
> +
> +        /* set power state transition delay */
> +        pm_state->pm_delay = 10;
> +
> +        /* power state transition flags on */
> +        pm_state->flags |= PT_FLAG_TRANSITING;
> +    }
> +    /* in case of transition related to D0, D1 and D2,
> +     * no need to use QEMUTimer.
> +     * So, we perfom writing to register here and then read it back.
> +     */
> +    else {
> +        /* write power state to I/O device register */
> +        host_pci_set_word(s->real_device, pm_state->pm_base + PCI_PM_CTRL,
> +                          *value);
> +
> +        /* in case of transition related to D2,
> +         * it's necessary to wait 200 usec.
> +         * But because QEMUTimer do not support microsec unit right now,
> +         * so we do wait ourself here.
> +         */
> +        if ((pm_state->cur_state == 2) || (pm_state->req_state == 2)) {
> +            usleep(200);
> +        }
> +
> +        /* check power state */
> +        check_power_state(s);
> +
> +        /* recreate value for writing to I/O device register */
> +        *value = host_pci_get_word(s->real_device,
> +                                   pm_state->pm_base + PCI_PM_CTRL);
> +    }
> +
> +    return 0;
> +}
> +
> +/* restore Power Management Control/Status register */
> +static int pt_pmcsr_reg_restore(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                                uint32_t real_offset, uint16_t dev_value,
> +                                uint16_t *value)
> +{
> +    /* create value for restoring to I/O device register
> +     * No need to restore, just clear PME Enable and PME Status bit
> +     * Note: register type of PME Status bit is RW1C, so clear by writing 1b
> +     */
> +    *value = (dev_value & ~PCI_PM_CTRL_PME_ENABLE) | PCI_PM_CTRL_PME_STATUS;
> +
> +    return 0;
> +}
> +
> +
> +/* Power Management Capability reg static infomation table */
> +static XenPTRegInfo pt_emu_reg_pm_tbl[] = {
> +    /* Next Pointer reg */
> +    {
> +        .offset     = PCI_CAP_LIST_NEXT,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0xFF,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_ptr_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = NULL,
> +    },
> +    /* Power Management Capabilities reg */
> +    {
> +        .offset     = PCI_CAP_FLAGS,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0xFFFF,
> +        .emu_mask   = 0xF9C8,
> +        .init       = pt_pmc_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_word_reg_write,
> +        .u.w.restore  = NULL,
> +    },
> +    /* PCI Power Management Control/Status reg */
> +    {
> +        .offset     = PCI_PM_CTRL,
> +        .size       = 2,
> +        .init_val   = 0x0008,
> +        .ro_mask    = 0xE1FC,
> +        .emu_mask   = 0x8100,
> +        .init       = pt_pmcsr_reg_init,
> +        .u.w.read   = pt_pmcsr_reg_read,
> +        .u.w.write  = pt_pmcsr_reg_write,
> +        .u.w.restore  = pt_pmcsr_reg_restore,
> +    },
> +    {
> +        .size = 0,
> +    },
> +};
> +
> +
> +/****************************
> + * Capabilities
> + */
> +
> +/* AER register operations */
> +
> +static void aer_save_one_register(XenPCIPassthroughState *s, int offset)
> +{
> +    PCIDevice *d = &s->dev;
> +    uint32_t aer_base = s->pm_state->aer_base;
> +    uint32_t val = 0;
> +
> +    val = host_pci_get_long(s->real_device, aer_base + offset);
> +    pci_set_long(d->config + aer_base + offset, val);
> +}
> +static void pt_aer_reg_save(XenPCIPassthroughState *s)
> +{
> +    /* after reset, following register values should be restored.
> +     * So, save them.
> +     */
> +    aer_save_one_register(s, PCI_ERR_UNCOR_MASK);
> +    aer_save_one_register(s, PCI_ERR_UNCOR_SEVER);
> +    aer_save_one_register(s, PCI_ERR_COR_MASK);
> +    aer_save_one_register(s, PCI_ERR_CAP);
> +}
> +static void aer_restore_one_register(XenPCIPassthroughState *s, int offset)
> +{
> +    PCIDevice *d = &s->dev;
> +    uint32_t aer_base = s->pm_state->aer_base;
> +    uint32_t config = 0;
> +
> +    config = pci_get_long(d->config + aer_base + offset);
> +    host_pci_set_long(s->real_device, aer_base + offset, config);
> +}
> +static void pt_aer_reg_restore(XenPCIPassthroughState *s)
> +{
> +    /* the following registers should be reconfigured to correct values
> +     * after reset. restore them.
> +     * other registers should not be reconfigured after reset
> +     * if there is no reason
> +     */
> +    aer_restore_one_register(s, PCI_ERR_UNCOR_MASK);
> +    aer_restore_one_register(s, PCI_ERR_UNCOR_SEVER);
> +    aer_restore_one_register(s, PCI_ERR_COR_MASK);
> +    aer_restore_one_register(s, PCI_ERR_CAP);
> +}
> +
> +/* capability structure register group size functions */
> +
> +static uint8_t pt_reg_grp_size_init(XenPCIPassthroughState *s,
> +                                    const XenPTRegGroupInfo *grp_reg,
> +                                    uint32_t base_offset)
> +{
> +    return grp_reg->grp_size;
> +}
> +/* get Power Management Capability Structure register group size */
> +static uint8_t pt_pm_size_init(XenPCIPassthroughState *s,
> +                               const XenPTRegGroupInfo *grp_reg,
> +                               uint32_t base_offset)
> +{
> +    if (!s->power_mgmt) {
> +        return grp_reg->grp_size;
> +    }
> +
> +    s->pm_state = g_malloc0(sizeof (XenPTPM));
> +
> +    /* set Power Management Capability base offset */
> +    s->pm_state->pm_base = base_offset;
> +
> +    /* find AER register and set AER Capability base offset */
> +    s->pm_state->aer_base = host_pci_find_ext_cap_offset(s->real_device,
> +                                                         PCI_EXT_CAP_ID_ERR);
> +
> +    /* save AER register */
> +    if (s->pm_state->aer_base) {
> +        pt_aer_reg_save(s);
> +    }
> +
> +    return grp_reg->grp_size;
> +}
> +/* get Vendor Specific Capability Structure register group size */
> +static uint8_t pt_vendor_size_init(XenPCIPassthroughState *s,
> +                                   const XenPTRegGroupInfo *grp_reg,
> +                                   uint32_t base_offset)
> +{
> +    return pci_get_byte(s->dev.config + base_offset + 0x02);
> +}
> +/* get PCI Express Capability Structure register group size */
> +static uint8_t pt_pcie_size_init(XenPCIPassthroughState *s,
> +                                 const XenPTRegGroupInfo *grp_reg,
> +                                 uint32_t base_offset)
> +{
> +    PCIDevice *d = &s->dev;
> +    uint16_t exp_flag = 0;
> +    uint16_t type = 0;
> +    uint16_t version = 0;
> +    uint8_t pcie_size = 0;
> +
> +    exp_flag = pci_get_word(d->config + base_offset + PCI_EXP_FLAGS);
> +    type = (exp_flag & PCI_EXP_FLAGS_TYPE) >> 4;
> +    version = exp_flag & PCI_EXP_FLAGS_VERS;
> +
> +    /* calculate size depend on capability version and device/port type */
> +    /* in case of PCI Express Base Specification Rev 1.x */
> +    if (version == 1) {
> +        /* The PCI Express Capabilities, Device Capabilities, and Device
> +         * Status/Control registers are required for all PCI Express devices.
> +         * The Link Capabilities and Link Status/Control are required for all
> +         * Endpoints that are not Root Complex Integrated Endpoints. Endpoints
> +         * are not required to implement registers other than those listed
> +         * above and terminate the capability structure.
> +         */
> +        switch (type) {
> +        case PCI_EXP_TYPE_ENDPOINT:
> +        case PCI_EXP_TYPE_LEG_END:
> +            pcie_size = 0x14;
> +            break;
> +        case PCI_EXP_TYPE_RC_END:
> +            /* has no link */
> +            pcie_size = 0x0C;
> +            break;
> +        /* only EndPoint passthrough is supported */
> +        case PCI_EXP_TYPE_ROOT_PORT:
> +        case PCI_EXP_TYPE_UPSTREAM:
> +        case PCI_EXP_TYPE_DOWNSTREAM:
> +        case PCI_EXP_TYPE_PCI_BRIDGE:
> +        case PCI_EXP_TYPE_PCIE_BRIDGE:
> +        case PCI_EXP_TYPE_RC_EC:
> +        default:
> +            hw_error("Internal error: Unsupported device/port type[%d]. "
> +                     "I/O emulator exit.\n", type);
> +        }
> +    }
> +    /* in case of PCI Express Base Specification Rev 2.0 */
> +    else if (version == 2) {
> +        switch (type) {
> +        case PCI_EXP_TYPE_ENDPOINT:
> +        case PCI_EXP_TYPE_LEG_END:
> +        case PCI_EXP_TYPE_RC_END:
> +            /* For Functions that do not implement the registers,
> +             * these spaces must be hardwired to 0b.
> +             */
> +            pcie_size = 0x3C;
> +            break;
> +        /* only EndPoint passthrough is supported */
> +        case PCI_EXP_TYPE_ROOT_PORT:
> +        case PCI_EXP_TYPE_UPSTREAM:
> +        case PCI_EXP_TYPE_DOWNSTREAM:
> +        case PCI_EXP_TYPE_PCI_BRIDGE:
> +        case PCI_EXP_TYPE_PCIE_BRIDGE:
> +        case PCI_EXP_TYPE_RC_EC:
> +        default:
> +            hw_error("Internal error: Unsupported device/port type[%d]. "
> +                     "I/O emulator exit.\n", type);
> +        }
> +    } else {
> +        hw_error("Internal error: Unsupported capability version[%d]. "
> +                 "I/O emulator exit.\n", version);
> +    }
> +
> +    return pcie_size;
> +}
> +
> +static const XenPTRegGroupInfo pt_emu_reg_grp_tbl[] = {
> +    /* Header Type0 reg group */
> +    {
> +        .grp_id      = 0xFF,
> +        .grp_type    = GRP_TYPE_EMU,
> +        .grp_size    = 0x40,
> +        .size_init   = pt_reg_grp_size_init,
> +        .emu_reg_tbl = pt_emu_reg_header0_tbl,
> +    },
> +    /* PCI PowerManagement Capability reg group */
> +    {
> +        .grp_id      = PCI_CAP_ID_PM,
> +        .grp_type    = GRP_TYPE_EMU,
> +        .grp_size    = PCI_PM_SIZEOF,
> +        .size_init   = pt_pm_size_init,
> +        .emu_reg_tbl = pt_emu_reg_pm_tbl,
> +    },
> +    /* AGP Capability Structure reg group */
> +    {
> +        .grp_id     = PCI_CAP_ID_AGP,
> +        .grp_type   = GRP_TYPE_HARDWIRED,
> +        .grp_size   = 0x30,
> +        .size_init  = pt_reg_grp_size_init,
> +    },
> +    /* Vital Product Data Capability Structure reg group */
> +    {
> +        .grp_id      = PCI_CAP_ID_VPD,
> +        .grp_type    = GRP_TYPE_EMU,
> +        .grp_size    = 0x08,
> +        .size_init   = pt_reg_grp_size_init,
> +        .emu_reg_tbl = pt_emu_reg_vpd_tbl,
> +    },
> +    /* Slot Identification reg group */
> +    {
> +        .grp_id     = PCI_CAP_ID_SLOTID,
> +        .grp_type   = GRP_TYPE_HARDWIRED,
> +        .grp_size   = 0x04,
> +        .size_init  = pt_reg_grp_size_init,
> +    },
> +    /* PCI-X Capabilities List Item reg group */
> +    {
> +        .grp_id     = PCI_CAP_ID_PCIX,
> +        .grp_type   = GRP_TYPE_HARDWIRED,
> +        .grp_size   = 0x18,
> +        .size_init  = pt_reg_grp_size_init,
> +    },
> +    /* Vendor Specific Capability Structure reg group */
> +    {
> +        .grp_id      = PCI_CAP_ID_VNDR,
> +        .grp_type    = GRP_TYPE_EMU,
> +        .grp_size    = 0xFF,
> +        .size_init   = pt_vendor_size_init,
> +        .emu_reg_tbl = pt_emu_reg_vendor_tbl,
> +    },
> +    /* SHPC Capability List Item reg group */
> +    {
> +        .grp_id     = PCI_CAP_ID_SHPC,
> +        .grp_type   = GRP_TYPE_HARDWIRED,
> +        .grp_size   = 0x08,
> +        .size_init  = pt_reg_grp_size_init,
> +    },
> +    /* Subsystem ID and Subsystem Vendor ID Capability List Item reg group */
> +    {
> +        .grp_id     = PCI_CAP_ID_SSVID,
> +        .grp_type   = GRP_TYPE_HARDWIRED,
> +        .grp_size   = 0x08,
> +        .size_init  = pt_reg_grp_size_init,
> +    },
> +    /* AGP 8x Capability Structure reg group */
> +    {
> +        .grp_id     = PCI_CAP_ID_AGP3,
> +        .grp_type   = GRP_TYPE_HARDWIRED,
> +        .grp_size   = 0x30,
> +        .size_init  = pt_reg_grp_size_init,
> +    },
> +    /* PCI Express Capability Structure reg group */
> +    {
> +        .grp_id      = PCI_CAP_ID_EXP,
> +        .grp_type    = GRP_TYPE_EMU,
> +        .grp_size    = 0xFF,
> +        .size_init   = pt_pcie_size_init,
> +        .emu_reg_tbl = pt_emu_reg_pcie_tbl,
> +    },
> +    {
> +        .grp_size = 0,
> +    },
> +};
> +
> +/* initialize Capabilities Pointer or Next Pointer register */
> +static uint32_t pt_ptr_reg_init(XenPCIPassthroughState *s,
> +                                XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    /* uint32_t reg_field = (uint32_t)s->dev.config[real_offset]; */
> +    uint32_t reg_field = pci_get_byte(s->dev.config + real_offset);
> +    int i;
> +
> +    /* find capability offset */
> +    while (reg_field) {
> +        for (i = 0; pt_emu_reg_grp_tbl[i].grp_size != 0; i++) {
> +            if (pt_hide_dev_cap(s->real_device,
> +                                pt_emu_reg_grp_tbl[i].grp_id)) {
> +                continue;
> +            }
> +            if (pt_emu_reg_grp_tbl[i].grp_id == s->dev.config[reg_field]) {
> +                if (pt_emu_reg_grp_tbl[i].grp_type == GRP_TYPE_EMU) {
> +                    goto out;
> +                }
> +                /* ignore the 0 hardwired capability, find next one */
> +                break;
> +            }
> +        }
> +        /* next capability */
> +        /* reg_field = (uint32_t)s->dev.config[reg_field + 1]; */
> +        reg_field = pci_get_byte(s->dev.config + reg_field + 1);
> +    }
> +
> +out:
> +    return reg_field;
> +}
> +
> +
> +/*************
> + * Main
> + */
> +
> +/* restore a part of I/O device register */
> +static void pt_config_restore(XenPCIPassthroughState *s)
> +{
> +    XenPTRegGroup *reg_grp_entry = NULL;
> +    XenPTReg *reg_entry = NULL;
> +    XenPTRegInfo *reg = NULL;
> +    uint32_t real_offset = 0;
> +    uint32_t read_val = 0;
> +    uint32_t val = 0;
> +    int ret = 0;
> +
> +    /* find emulate register group entry */
> +    QLIST_FOREACH(reg_grp_entry, &s->reg_grp_tbl, entries) {
> +        /* find emulate register entry */
> +        QLIST_FOREACH(reg_entry, &reg_grp_entry->reg_tbl_list, entries) {
> +            reg = reg_entry->reg;
> +
> +            /* check whether restoring is needed */
> +            if (!reg->u.b.restore) {
> +                continue;
> +            }
> +
> +            real_offset = reg_grp_entry->base_offset + reg->offset;
> +
> +            /* read I/O device register value */
> +            ret = host_pci_get_block(s->real_device, real_offset,
> +                                     (uint8_t *)&read_val, reg->size);
> +
> +            if (!ret) {
> +                PT_LOG("Error: pci_read_block failed. "
> +                       "return value[%d].\n", ret);
> +                memset(&read_val, 0xff, reg->size);
> +            }
> +
> +            val = 0;
> +
> +            /* restore based on register size */
> +            switch (reg->size) {
> +            case 1:
> +                /* byte register */
> +                ret = reg->u.b.restore(s, reg_entry, real_offset,
> +                                       (uint8_t)read_val, (uint8_t *)&val);
> +                break;
> +            case 2:
> +                /* word register */
> +                ret = reg->u.w.restore(s, reg_entry, real_offset,
> +                                       (uint16_t)read_val, (uint16_t *)&val);
> +                break;
> +            case 4:
> +                /* double word register */
> +                ret = reg->u.dw.restore(s, reg_entry, real_offset,
> +                                        (uint32_t)read_val, (uint32_t *)&val);
> +                break;
> +            }
> +
> +            /* restoring error */
> +            if (ret < 0) {
> +                hw_error("Internal error: Invalid restoring "
> +                         "return value[%d]. I/O emulator exit.\n", ret);
> +            }
> +
> +            PT_LOG_CONFIG("[%02x:%02x.%x]: address=%04x val=0x%08x len=%d\n",
> +                          pci_bus_num(s->dev.bus), PCI_SLOT(s->dev.devfn),
> +                          PCI_FUNC(s->dev.devfn),
> +                          real_offset, val, reg->size);
> +
> +            ret = host_pci_set_block(s->real_device, real_offset,
> +                                     (uint8_t *)&val, reg->size);
> +
> +            if (!ret) {
> +                PT_LOG("Error: pci_write_block failed. "
> +                       "return value[%d].\n", ret);
> +            }
> +        }
> +    }
> +
> +    /* if AER supported, restore it */
> +    if (s->pm_state->aer_base) {
> +        pt_aer_reg_restore(s);
> +    }
> +}
> +/* reinitialize all emulate registers */
> +static void pt_config_reinit(XenPCIPassthroughState *s)
> +{
> +    XenPTRegGroup *reg_grp_entry = NULL;
> +    XenPTReg *reg_entry = NULL;
> +    XenPTRegInfo *reg = NULL;
> +
> +    /* find emulate register group entry */
> +    QLIST_FOREACH(reg_grp_entry, &s->reg_grp_tbl, entries) {
> +        /* find emulate register entry */
> +        QLIST_FOREACH(reg_entry, &reg_grp_entry->reg_tbl_list, entries) {
> +            reg = reg_entry->reg;
> +            if (reg->init) {
> +                /* initialize emulate register */
> +                reg_entry->data =
> +                    reg->init(s, reg_entry->reg,
> +                              reg_grp_entry->base_offset + reg->offset);
> +            }
> +        }
> +    }
> +}
> +
> +static int pt_init_pci_config(XenPCIPassthroughState *s)
> +{
> +    PCIDevice *d = &s->dev;
> +    int ret = 0;
> +
> +    PT_LOG("Reinitialize PCI configuration registers due to power state"
> +           " transition with internal reset. [%02x:%02x.%x]\n",
> +           pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn));
> +
> +    /* restore a part of I/O device register */
> +    pt_config_restore(s);
> +
> +    /* reinitialize all emulate register */
> +    pt_config_reinit(s);
> +
> +    /* rebind machine_irq to device */
> +    if (s->machine_irq != 0) {
> +        uint8_t e_device = PCI_SLOT(s->dev.devfn);
> +        uint8_t e_intx = pci_intx(s);
> +
> +        ret = xc_domain_bind_pt_pci_irq(xen_xc, xen_domid, s->machine_irq, 0,
> +                                        e_device, e_intx);
> +        if (ret < 0) {
> +            PT_LOG("Error: Rebinding of interrupt failed! ret=%d\n", ret);
> +        }
> +    }
> +
> +    return ret;
> +}
> +
> +static uint8_t find_cap_offset(XenPCIPassthroughState *s, uint8_t cap)
> +{
> +    int id;
> +    int max_cap = 48;
> +    int pos = PCI_CAPABILITY_LIST;
> +    int status;
> +
> +    status = host_pci_get_byte(s->real_device, PCI_STATUS);
> +    if ((status & PCI_STATUS_CAP_LIST) == 0) {
> +        return 0;
> +    }
> +
> +    while (max_cap--) {
> +        pos = host_pci_get_byte(s->real_device, pos);
> +        if (pos < 0x40) {
> +            break;
> +        }
> +
> +        pos &= ~3;
> +        id = host_pci_get_byte(s->real_device, pos + PCI_CAP_LIST_ID);
> +
> +        if (id == 0xff) {
> +            break;
> +        }
> +        if (id == cap) {
> +            return pos;
> +        }
> +
> +        pos += PCI_CAP_LIST_NEXT;
> +    }
> +    return 0;
> +}
> +
> +static void pt_config_reg_init(XenPCIPassthroughState *s,
> +                               XenPTRegGroup *reg_grp, XenPTRegInfo *reg)
> +{
> +    XenPTReg *reg_entry;
> +    uint32_t data = 0;
> +
> +    reg_entry = g_malloc0(sizeof (XenPTReg));
> +
> +    reg_entry->reg = reg;
> +    reg_entry->data = 0;
> +
> +    if (reg->init) {
> +        /* initialize emulate register */
> +        data = reg->init(s, reg_entry->reg,
> +                         reg_grp->base_offset + reg->offset);
> +        if (data == PT_INVALID_REG) {
> +            /* free unused BAR register entry */
> +            free(reg_entry);
> +            return;
> +        }
> +        /* set register value */
> +        reg_entry->data = data;
> +    }
> +    /* list add register entry */
> +    QLIST_INSERT_HEAD(&reg_grp->reg_tbl_list, reg_entry, entries);
> +
> +    return;
> +}
> +
> +void pt_config_init(XenPCIPassthroughState *s)
> +{
> +    XenPTRegGroup *reg_grp_entry = NULL;
> +    uint32_t reg_grp_offset = 0;
> +    XenPTRegInfo *reg_tbl = NULL;
> +    int i, j;
> +
> +    QLIST_INIT(&s->reg_grp_tbl);
> +
> +    for (i = 0; pt_emu_reg_grp_tbl[i].grp_size != 0; i++) {
> +        if (pt_emu_reg_grp_tbl[i].grp_id != 0xFF) {
> +            if (pt_hide_dev_cap(s->real_device,
> +                                pt_emu_reg_grp_tbl[i].grp_id)) {
> +                continue;
> +            }
> +
> +            reg_grp_offset = find_cap_offset(s, pt_emu_reg_grp_tbl[i].grp_id);
> +
> +            if (!reg_grp_offset) {
> +                continue;
> +            }
> +        }
> +
> +        reg_grp_entry = g_malloc0(sizeof (XenPTRegGroup));
> +        QLIST_INIT(&reg_grp_entry->reg_tbl_list);
> +        QLIST_INSERT_HEAD(&s->reg_grp_tbl, reg_grp_entry, entries);
> +
> +        reg_grp_entry->base_offset = reg_grp_offset;
> +        reg_grp_entry->reg_grp = pt_emu_reg_grp_tbl + i;
> +        if (pt_emu_reg_grp_tbl[i].size_init) {
> +            /* get register group size */
> +            reg_grp_entry->size =
> +                pt_emu_reg_grp_tbl[i].size_init(s, reg_grp_entry->reg_grp,
> +                                                reg_grp_offset);
> +        }
> +
> +        if (pt_emu_reg_grp_tbl[i].grp_type == GRP_TYPE_EMU) {
> +            if (pt_emu_reg_grp_tbl[i].emu_reg_tbl) {
> +                reg_tbl = pt_emu_reg_grp_tbl[i].emu_reg_tbl;
> +                /* initialize capability register */
> +                for (j = 0; reg_tbl->size != 0; j++, reg_tbl++) {
> +                    /* initialize capability register */
> +                    pt_config_reg_init(s, reg_grp_entry, reg_tbl);
> +                }
> +            }
> +        }
> +        reg_grp_offset = 0;
> +    }
> +
> +    return;
> +}
> +
> +/* delete all emulate register */
> +void pt_config_delete(XenPCIPassthroughState *s)
> +{
> +    struct XenPTRegGroup *reg_group, *next_grp;
> +    struct XenPTReg *reg, *next_reg;
> +
> +    /* free Power Management info table */
> +    if (s->pm_state) {
> +        if (s->pm_state->pm_timer) {
> +            qemu_del_timer(s->pm_state->pm_timer);
> +            qemu_free_timer(s->pm_state->pm_timer);
> +            s->pm_state->pm_timer = NULL;
> +        }
> +
> +        g_free(s->pm_state);
> +    }
> +
> +    /* free all register group entry */
> +    QLIST_FOREACH_SAFE(reg_group, &s->reg_grp_tbl, entries, next_grp) {
> +        /* free all register entry */
> +        QLIST_FOREACH_SAFE(reg, &reg_group->reg_tbl_list, entries, next_reg) {
> +            QLIST_REMOVE(reg, entries);
> +            g_free(reg);
> +        }
> +
> +        QLIST_REMOVE(reg_group, entries);
> +        g_free(reg_group);
> +    }
> +}
> -- 
> Anthony PERARD
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [PATCH V3 10/10] Introduce Xen PCI Passthrough, MSI (3/3)
  2011-10-28 15:07   ` Anthony PERARD
@ 2011-11-10 22:10     ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 60+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-11-10 22:10 UTC (permalink / raw)
  To: Anthony PERARD; +Cc: Xen Devel, Shan Haitao, QEMU-devel, Stefano Stabellini

On Fri, Oct 28, 2011 at 04:07:36PM +0100, Anthony PERARD wrote:
> From: Jiang Yunhong <yunhong.jiang@intel.com>
> 
> Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com>
> Signed-off-by: Shan Haitao <haitao.shan@intel.com>
> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
> ---
>  Makefile.target                      |    1 +
>  hw/apic-msidef.h                     |    2 +
>  hw/xen_pci_passthrough.c             |   27 ++-
>  hw/xen_pci_passthrough.h             |   55 +++
>  hw/xen_pci_passthrough_config_init.c |  495 +++++++++++++++++++++++++-
>  hw/xen_pci_passthrough_msi.c         |  667 ++++++++++++++++++++++++++++++++++
>  6 files changed, 1240 insertions(+), 7 deletions(-)
>  create mode 100644 hw/xen_pci_passthrough_msi.c
> 
> diff --git a/Makefile.target b/Makefile.target
> index c32c688..17b8857 100644
> --- a/Makefile.target
> +++ b/Makefile.target
> @@ -220,6 +220,7 @@ obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += host-pci-device.o
>  obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough.o
>  obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough_helpers.o
>  obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough_config_init.o
> +obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough_msi.o
>  
>  # Inter-VM PCI shared memory
>  CONFIG_IVSHMEM =
> diff --git a/hw/apic-msidef.h b/hw/apic-msidef.h
> index 3182f0b..6e2eb71 100644
> --- a/hw/apic-msidef.h
> +++ b/hw/apic-msidef.h
> @@ -22,6 +22,8 @@
>  
>  #define MSI_ADDR_DEST_MODE_SHIFT        2
>  
> +#define MSI_ADDR_REDIRECTION_SHIFT      3
> +
>  #define MSI_ADDR_DEST_ID_SHIFT          12
>  #define  MSI_ADDR_DEST_ID_MASK          0x00ffff0
>  
> diff --git a/hw/xen_pci_passthrough.c b/hw/xen_pci_passthrough.c
> index b97c5b6..4b9eb74 100644
> --- a/hw/xen_pci_passthrough.c
> +++ b/hw/xen_pci_passthrough.c
> @@ -417,6 +417,7 @@ static void pt_iomem_map(XenPCIPassthroughState *s, int i,
>      }
>  
>      if (!first_map && old_ebase != -1) {
> +        pt_add_msix_mapping(s, i);
>          /* Remove old mapping */
>          ret = xc_domain_memory_mapping(xen_xc, xen_domid,
>                                 old_ebase >> XC_PAGE_SHIFT,
> @@ -441,6 +442,15 @@ static void pt_iomem_map(XenPCIPassthroughState *s, int i,
>          if (ret != 0) {
>              PT_LOG("Error: create new mapping failed!\n");
>          }
> +
> +        ret = pt_remove_msix_mapping(s, i);
> +        if (ret != 0) {
> +            PT_LOG("Error: remove MSI-X mmio mapping failed!\n");
> +        }
> +
> +        if (old_ebase != e_phys && old_ebase != -1) {
> +            pt_msix_update_remap(s, i);
> +        }
>      }
>  }
>  
> @@ -737,6 +747,9 @@ static int pt_initfn(PCIDevice *pcidev)
>          mapped_machine_irq[machine_irq]++;
>      }
>  
> +    /* setup MSI-INTx translation if support */
> +    rc = pt_enable_msi_translate(s);
> +
>      /* bind machine_irq to device */
>      if (rc < 0 && machine_irq != 0) {
>          uint8_t e_device = PCI_SLOT(s->dev.devfn);
> @@ -765,7 +778,8 @@ static int pt_initfn(PCIDevice *pcidev)
>  
>  out:
>      PT_LOG("Real physical device %02x:%02x.%x registered successfuly!\n"
> -           "IRQ type = %s\n", bus, slot, func, "INTx");
> +           "IRQ type = %s\n", bus, slot, func,
> +           s->msi_trans_en ? "MSI-INTx" : "INTx");
>  
>      return 0;
>  }
> @@ -782,7 +796,7 @@ static int pt_unregister_device(PCIDevice *pcidev)
>      e_intx = pci_intx(s);
>      machine_irq = s->machine_irq;
>  
> -    if (machine_irq) {
> +    if (s->msi_trans_en == 0 && machine_irq) {
>          rc = xc_domain_unbind_pt_irq(xen_xc, xen_domid, machine_irq,
>                                       PT_IRQ_TYPE_PCI, 0, e_device, e_intx, 0);
>          if (rc < 0) {
> @@ -790,6 +804,13 @@ static int pt_unregister_device(PCIDevice *pcidev)
>          }
>      }
>  
> +    if (s->msi) {
> +        pt_msi_disable(s);
> +    }
> +    if (s->msix) {
> +        pt_msix_disable(s);
> +    }
> +
>      if (machine_irq) {
>          mapped_machine_irq[machine_irq]--;
>  
> @@ -824,6 +845,8 @@ static PCIDeviceInfo xen_pci_passthrough = {
>      .is_express = 0,
>      .qdev.props = (Property[]) {
>          DEFINE_PROP_STRING("hostaddr", XenPCIPassthroughState, hostaddr),
> +        DEFINE_PROP_BIT("msitranslate", XenPCIPassthroughState, msi_trans_cap,
> +                        0, true),
>          DEFINE_PROP_BIT("power-mgmt", XenPCIPassthroughState, power_mgmt,
>                          0, false),
>          DEFINE_PROP_END_OF_LIST(),
> diff --git a/hw/xen_pci_passthrough.h b/hw/xen_pci_passthrough.h
> index ebc04fd..5f404b0 100644
> --- a/hw/xen_pci_passthrough.h
> +++ b/hw/xen_pci_passthrough.h
> @@ -63,6 +63,10 @@ typedef int (*conf_byte_restore)
>  
>  #define PT_BAR_ALLF        0xFFFFFFFF  /* BAR ALLF value */
>  
> +/* MSI-X */
> +#define PT_MSI_FLAG_UNINIT 0x1000
> +#define PT_MSI_FLAG_MAPPED 0x2000
> +
>  
>  typedef enum {
>      GRP_TYPE_HARDWIRED = 0,                     /* 0 Hardwired reg group */
> @@ -166,6 +170,34 @@ typedef struct XenPTRegGroup {
>  } XenPTRegGroup;
>  
>  
> +typedef struct XenPTMSI {
> +    uint32_t flags;
> +    uint32_t ctrl_offset; /* saved control offset */
> +    int pirq;          /* guest pirq corresponding */
> +    uint32_t addr_lo;  /* guest message address */
> +    uint32_t addr_hi;  /* guest message upper address */
> +    uint16_t data;     /* guest message data */
> +} XenPTMSI;
> +
> +typedef struct XenMSIXEntry {
> +    int pirq;        /* -1 means unmapped */
> +    int flags;       /* flags indicting whether MSI ADDR or DATA is updated */
> +    uint32_t io_mem[4];
> +} XenMSIXEntry;
> +typedef struct XenPTMSIX {
> +    uint32_t ctrl_offset;
> +    int enabled;
> +    int total_entries;
> +    int bar_index;
> +    uint64_t table_base;
> +    uint32_t table_off;
> +    uint32_t table_offset_adjust; /* page align mmap */
> +    uint64_t mmio_base_addr;
> +    int mmio_index;
> +    void *phys_iomem_base;
> +    XenMSIXEntry msix_entry[0];
> +} XenPTMSIX;
> +
>  typedef struct XenPTPM {
>      QEMUTimer *pm_timer;  /* QEMUTimer struct */
>      int no_soft_reset;    /* No Soft Reset flags */
> @@ -189,6 +221,13 @@ struct XenPCIPassthroughState {
>  
>      uint32_t machine_irq;
>  
> +    XenPTMSI *msi;
> +    XenPTMSIX *msix;
> +
> +    /* Physical MSI to guest INTx translation when possible */
> +    uint32_t msi_trans_cap;
> +    bool msi_trans_en;
> +
>      uint32_t power_mgmt;
>      XenPTPM *pm_state;
>  
> @@ -222,4 +261,20 @@ static inline uint8_t pci_read_intx(XenPCIPassthroughState *s)
>  }
>  uint8_t pci_intx(XenPCIPassthroughState *ptdev);
>  
> +/* MSI/MSI-X */
> +void pt_msi_set_enable(XenPCIPassthroughState *s, int en);
> +int pt_msi_setup(XenPCIPassthroughState *s);
> +int pt_msi_update(XenPCIPassthroughState *d);
> +void pt_msi_disable(XenPCIPassthroughState *s);
> +int pt_enable_msi_translate(XenPCIPassthroughState *s);
> +void pt_disable_msi_translate(XenPCIPassthroughState *s);
> +
> +int pt_msix_init(XenPCIPassthroughState *s, int pos);
> +void pt_msix_delete(XenPCIPassthroughState *s);
> +int pt_msix_update(XenPCIPassthroughState *s);
> +int pt_msix_update_remap(XenPCIPassthroughState *s, int bar_index);
> +void pt_msix_disable(XenPCIPassthroughState *s);
> +int pt_add_msix_mapping(XenPCIPassthroughState *s, int bar_index);
> +int pt_remove_msix_mapping(XenPCIPassthroughState *s, int bar_index);
> +
>  #endif /* !QEMU_HW_XEN_PCI_PASSTHROUGH_H */
> diff --git a/hw/xen_pci_passthrough_config_init.c b/hw/xen_pci_passthrough_config_init.c
> index 4103b59..b4238ee 100644
> --- a/hw/xen_pci_passthrough_config_init.c
> +++ b/hw/xen_pci_passthrough_config_init.c
> @@ -375,11 +375,19 @@ static int pt_cmd_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
>      throughable_mask = ~emu_mask & valid_mask;
>  
>      if (*value & PCI_COMMAND_INTX_DISABLE) {
> -        throughable_mask |= PCI_COMMAND_INTX_DISABLE;
> -    } else {
> -        if (s->machine_irq) {
> +        if (s->msi_trans_en) {
> +            pt_msi_set_enable(s, 0);
> +        } else {
>              throughable_mask |= PCI_COMMAND_INTX_DISABLE;
>          }
> +    } else {
> +        if (s->msi_trans_en) {
> +            pt_msi_set_enable(s, 1);
> +        } else {
> +            if (s->machine_irq) {
> +                throughable_mask |= PCI_COMMAND_INTX_DISABLE;
> +            }
> +        }
>      }
>  
>      *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> @@ -1248,13 +1256,21 @@ static void pt_reset_interrupt_and_io_mapping(XenPCIPassthroughState *s)
>      e_device = PCI_SLOT(s->dev.devfn);
>      e_intx = pci_intx(s);
>  
> -    if (s->machine_irq) {
> +    if (s->msi_trans_en == 0 && s->machine_irq) {
>          if (xc_domain_unbind_pt_irq(xen_xc, xen_domid, s->machine_irq,
>                                      PT_IRQ_TYPE_PCI, 0, e_device, e_intx, 0)) {
>              PT_LOG("Error: Unbinding of interrupt failed!\n");
>          }
>      }
>  
> +    /* disable MSI/MSI-X and MSI-INTx translation */
> +    if (s->msi) {
> +        pt_msi_disable(s);
> +    }
> +    if (s->msix) {
> +        pt_msix_disable(s);
> +    }
> +
>      /* clear all virtual region address */
>      for (i = 0; i < PCI_NUM_REGIONS; i++) {
>          r = &d->io_regions[i];
> @@ -1501,6 +1517,406 @@ static XenPTRegInfo pt_emu_reg_pm_tbl[] = {
>      },
>  };
>  
> +/********************************
> + * MSI Capability
> + */
> +
> +/* Message Control register */
> +static uint32_t pt_msgctrl_reg_init(XenPCIPassthroughState *s,
> +                                    XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    PCIDevice *d = &s->dev;
> +    uint16_t reg_field = 0;
> +
> +    /* use I/O device register's value as initial value */
> +    reg_field = pci_get_word(d->config + real_offset);
> +
> +    if (reg_field & PCI_MSI_FLAGS_ENABLE) {
> +        PT_LOG("MSI enabled already, disable first\n");
> +        host_pci_set_word(s->real_device, real_offset,
> +                          reg_field & ~PCI_MSI_FLAGS_ENABLE);
> +    }
> +    s->msi->flags |= reg_field | PT_MSI_FLAG_UNINIT;
> +    s->msi->ctrl_offset = real_offset;
> +
> +    return reg->init_val;
> +}
> +static int pt_msgctrl_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                                uint16_t *value, uint16_t dev_value,
> +                                uint16_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint16_t writable_mask = 0;
> +    uint16_t throughable_mask = 0;
> +    PCIDevice *pd = (PCIDevice *)s;
> +    uint16_t val;
> +
> +    /* Currently no support for multi-vector */
> +    if (*value & PCI_MSI_FLAGS_QSIZE) {
> +        PT_LOG("Warning: try to set more than 1 vector ctrl %x\n", *value);
> +    }
> +
> +    /* modify emulate register */
> +    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +    /* update the msi_info too */
> +    s->msi->flags |= cfg_entry->data &
> +        ~(PT_MSI_FLAG_UNINIT | PT_MSI_FLAG_MAPPED | PCI_MSI_FLAGS_ENABLE);
> +
> +    /* create value for writing to I/O device register */
> +    val = *value;
> +    throughable_mask = ~reg->emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    /* update MSI */
> +    if (val & PCI_MSI_FLAGS_ENABLE) {
> +        /* setup MSI pirq for the first time */
> +        if (s->msi->flags & PT_MSI_FLAG_UNINIT) {
> +            if (s->msi_trans_en) {
> +                PT_LOG("guest enabling MSI, disable MSI-INTx translation\n");
> +                pt_disable_msi_translate(s);
> +            } else {
> +                /* Init physical one */
> +                PT_LOG("setup msi for dev %x\n", pd->devfn);
> +                if (pt_msi_setup(s)) {
> +                    /* We do not broadcast the error to the framework code, so
> +                     * that MSI errors are contained in MSI emulation code and
> +                     * QEMU can go on running.
> +                     * Guest MSI would be actually not working.
> +                     */
> +                    *value &= ~PCI_MSI_FLAGS_ENABLE;
> +                    PT_LOG("Warning: Can not map MSI for dev %x\n", pd->devfn);
> +                    return 0;
> +                }
> +            }
> +            if (pt_msi_update(s)) {
> +                *value &= ~PCI_MSI_FLAGS_ENABLE;
> +                PT_LOG("Warning: Can not bind MSI for dev %x\n", pd->devfn);
> +                return 0;
> +            }
> +            s->msi->flags &= ~PT_MSI_FLAG_UNINIT;
> +            s->msi->flags |= PT_MSI_FLAG_MAPPED;
> +        }
> +        s->msi->flags |= PCI_MSI_FLAGS_ENABLE;
> +    } else {
> +        s->msi->flags &= ~PCI_MSI_FLAGS_ENABLE;
> +    }
> +
> +    /* pass through MSI_ENABLE bit when no MSI-INTx translation */
> +    if (!s->msi_trans_en) {
> +        *value &= ~PCI_MSI_FLAGS_ENABLE;
> +        *value |= val & PCI_MSI_FLAGS_ENABLE;
> +    }
> +
> +    return 0;
> +}
> +
> +/* initialize Message Upper Address register */
> +static uint32_t pt_msgaddr64_reg_init(XenPCIPassthroughState *ptdev,
> +                                      XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    /* no need to initialize in case of 32 bit type */
> +    if (!(ptdev->msi->flags & PCI_MSI_FLAGS_64BIT)) {
> +        return PT_INVALID_REG;
> +    }
> +
> +    return reg->init_val;
> +}
> +/* this function will be called twice (for 32 bit and 64 bit type) */
> +/* initialize Message Data register */
> +static uint32_t pt_msgdata_reg_init(XenPCIPassthroughState *ptdev,
> +                                    XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    uint32_t flags = ptdev->msi->flags;
> +    uint32_t offset = reg->offset;
> +
> +    /* check the offset whether matches the type or not */
> +    if (((offset == PCI_MSI_DATA_64) &&  (flags & PCI_MSI_FLAGS_64BIT)) ||
> +        ((offset == PCI_MSI_DATA_32) && !(flags & PCI_MSI_FLAGS_64BIT))) {
> +        return reg->init_val;
> +    } else {
> +        return PT_INVALID_REG;
> +    }
> +}
> +
> +/* write Message Address register */
> +static int pt_msgaddr32_reg_write(XenPCIPassthroughState *s,
> +                                  XenPTReg *cfg_entry, uint32_t *value,
> +                                  uint32_t dev_value, uint32_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint32_t writable_mask = 0;
> +    uint32_t throughable_mask = 0;
> +    uint32_t old_addr = cfg_entry->data;
> +
> +    /* modify emulate register */
> +    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +    /* update the msi_info too */
> +    s->msi->addr_lo = cfg_entry->data;
> +
> +    /* create value for writing to I/O device register */
> +    throughable_mask = ~reg->emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    /* update MSI */
> +    if (cfg_entry->data != old_addr) {
> +        if (s->msi->flags & PT_MSI_FLAG_MAPPED) {
> +            pt_msi_update(s);
> +        }
> +    }
> +
> +    return 0;
> +}
> +/* write Message Upper Address register */
> +static int pt_msgaddr64_reg_write(XenPCIPassthroughState *s,
> +                                  XenPTReg *cfg_entry, uint32_t *value,
> +                                  uint32_t dev_value, uint32_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint32_t writable_mask = 0;
> +    uint32_t throughable_mask = 0;
> +    uint32_t old_addr = cfg_entry->data;
> +
> +    /* check whether the type is 64 bit or not */
> +    if (!(s->msi->flags & PCI_MSI_FLAGS_64BIT)) {
> +        /* exit I/O emulator */
> +        PT_LOG("Error: why comes to Upper Address without 64 bit support??\n");

Um, not sure what that means.

> +        return -1;
> +    }
> +
> +    /* modify emulate register */
> +    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +    /* update the msi_info too */
> +    s->msi->addr_hi = cfg_entry->data;
> +
> +    /* create value for writing to I/O device register */
> +    throughable_mask = ~reg->emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    /* update MSI */
> +    if (cfg_entry->data != old_addr) {
> +        if (s->msi->flags & PT_MSI_FLAG_MAPPED) {
> +            pt_msi_update(s);
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +
> +/* this function will be called twice (for 32 bit and 64 bit type) */
> +/* write Message Data register */
> +static int pt_msgdata_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                                uint16_t *value, uint16_t dev_value,
> +                                uint16_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint16_t writable_mask = 0;
> +    uint16_t throughable_mask = 0;
> +    uint16_t old_data = cfg_entry->data;
> +    uint32_t flags = s->msi->flags;
> +    uint32_t offset = reg->offset;
> +
> +    /* check the offset whether matches the type or not */
> +    if (!((offset == PCI_MSI_DATA_64) &&  (flags & PCI_MSI_FLAGS_64BIT)) &&
> +        !((offset == PCI_MSI_DATA_32) && !(flags & PCI_MSI_FLAGS_64BIT))) {
> +        /* exit I/O emulator */
> +        PT_LOG("Error: the offset is not match with the 32/64 bit type!!\n");

I think it means: "The offset does not match the 32/64 bit type"

> +        return -1;
> +    }
> +
> +    /* modify emulate register */
> +    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +    /* update the msi_info too */
> +    s->msi->data = cfg_entry->data;
> +
> +    /* create value for writing to I/O device register */
> +    throughable_mask = ~reg->emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    /* update MSI */
> +    if (cfg_entry->data != old_data) {
> +        if (flags & PT_MSI_FLAG_MAPPED) {
> +            pt_msi_update(s);
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +/* MSI Capability Structure reg static infomation table */
> +static XenPTRegInfo pt_emu_reg_msi_tbl[] = {
> +    /* Next Pointer reg */
> +    {
> +        .offset     = PCI_CAP_LIST_NEXT,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0xFF,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_ptr_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = NULL,
> +    },
> +    /* Message Control reg */
> +    {
> +        .offset     = PCI_MSI_FLAGS,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0xFF8E,
> +        .emu_mask   = 0x007F,
> +        .init       = pt_msgctrl_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_msgctrl_reg_write,
> +        .u.w.restore  = NULL,
> +    },
> +    /* Message Address reg */
> +    {
> +        .offset     = PCI_MSI_ADDRESS_LO,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .ro_mask    = 0x00000003,
> +        .emu_mask   = 0xFFFFFFFF,
> +        .no_wb      = 1,
> +        .init       = pt_common_reg_init,
> +        .u.dw.read  = pt_long_reg_read,
> +        .u.dw.write = pt_msgaddr32_reg_write,
> +        .u.dw.restore = NULL,
> +    },
> +    /* Message Upper Address reg (if PCI_MSI_FLAGS_64BIT set) */
> +    {
> +        .offset     = PCI_MSI_ADDRESS_HI,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .ro_mask    = 0x00000000,
> +        .emu_mask   = 0xFFFFFFFF,
> +        .no_wb      = 1,
> +        .init       = pt_msgaddr64_reg_init,
> +        .u.dw.read  = pt_long_reg_read,
> +        .u.dw.write = pt_msgaddr64_reg_write,
> +        .u.dw.restore = NULL,
> +    },
> +    /* Message Data reg (16 bits of data for 32-bit devices) */
> +    {
> +        .offset     = PCI_MSI_DATA_32,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0x0000,
> +        .emu_mask   = 0xFFFF,
> +        .no_wb      = 1,
> +        .init       = pt_msgdata_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_msgdata_reg_write,
> +        .u.w.restore  = NULL,
> +    },
> +    /* Message Data reg (16 bits of data for 64-bit devices) */
> +    {
> +        .offset     = PCI_MSI_DATA_64,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0x0000,
> +        .emu_mask   = 0xFFFF,
> +        .no_wb      = 1,
> +        .init       = pt_msgdata_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_msgdata_reg_write,
> +        .u.w.restore  = NULL,
> +    },
> +    {
> +        .size = 0,
> +    },
> +};
> +
> +
> +/**************************************
> + * MSI-X Capability
> + */
> +
> +/* Message Control register for MSI-X */
> +static uint32_t pt_msixctrl_reg_init(XenPCIPassthroughState *s,
> +                                     XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    PCIDevice *d = &s->dev;
> +    uint16_t reg_field = 0;
> +
> +    /* use I/O device register's value as initial value */
> +    reg_field = pci_get_word(d->config + real_offset);
> +
> +    if (reg_field & PCI_MSIX_FLAGS_ENABLE) {
> +        PT_LOG("MSIX enabled already, disable first\n");
> +        host_pci_set_word(s->real_device, real_offset,
> +                          reg_field & ~PCI_MSIX_FLAGS_ENABLE);
> +    }
> +
> +    s->msix->ctrl_offset = real_offset;
> +
> +    return reg->init_val;
> +}
> +static int pt_msixctrl_reg_write(XenPCIPassthroughState *s,
> +                                 XenPTReg *cfg_entry, uint16_t *value,
> +                                 uint16_t dev_value, uint16_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint16_t writable_mask = 0;
> +    uint16_t throughable_mask = 0;
> +
> +    /* modify emulate register */
> +    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +
> +    /* create value for writing to I/O device register */
> +    throughable_mask = ~reg->emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    /* update MSI-X */
> +    if ((*value & PCI_MSIX_FLAGS_ENABLE)
> +        && !(*value & PCI_MSIX_FLAGS_MASKALL)) {
> +        if (s->msi_trans_en) {
> +            PT_LOG("guest enabling MSI-X, disable MSI-INTx translation\n");
> +            pt_disable_msi_translate(s);
> +        }
> +        pt_msix_update(s);
> +    }
> +
> +    s->msix->enabled = !!(*value & PCI_MSIX_FLAGS_ENABLE);
> +
> +    return 0;
> +}
> +
> +/* MSI-X Capability Structure reg static infomation table */
> +static XenPTRegInfo pt_emu_reg_msix_tbl[] = {
> +    /* Next Pointer reg */
> +    {
> +        .offset     = PCI_CAP_LIST_NEXT,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0xFF,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_ptr_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = NULL,
> +    },
> +    /* Message Control reg */
> +    {
> +        .offset     = PCI_MSI_FLAGS,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0x3FFF,
> +        .emu_mask   = 0x0000,
> +        .init       = pt_msixctrl_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_msixctrl_reg_write,
> +        .u.w.restore  = NULL,
> +    },
> +    {
> +        .size = 0,
> +    },
> +};
> +
>  
>  /****************************
>   * Capabilities
> @@ -1664,6 +2080,48 @@ static uint8_t pt_pcie_size_init(XenPCIPassthroughState *s,
>  
>      return pcie_size;
>  }
> +/* get MSI Capability Structure register group size */
> +static uint8_t pt_msi_size_init(XenPCIPassthroughState *s,
> +                                const XenPTRegGroupInfo *grp_reg,
> +                                uint32_t base_offset)
> +{
> +    PCIDevice *d = &s->dev;
> +    uint16_t msg_ctrl = 0;
> +    uint8_t msi_size = 0xa;
> +
> +    msg_ctrl = pci_get_word(d->config + (base_offset + PCI_MSI_FLAGS));
> +
> +    /* check 64 bit address capable & Per-vector masking capable */

ehh?


> +    if (msg_ctrl & PCI_MSI_FLAGS_64BIT) {
> +        msi_size += 4;
> +    }
> +    if (msg_ctrl & PCI_MSI_FLAGS_MASKBIT) {
> +        msi_size += 10;
> +    }
> +
> +    s->msi = g_malloc0(sizeof (XenPTMSI));
> +    s->msi->pirq = -1;

Is there a define for this -1?

> +    PT_LOG("done\n");
> +
> +    return msi_size;
> +}
> +/* get MSI-X Capability Structure register group size */
> +static uint8_t pt_msix_size_init(XenPCIPassthroughState *s,
> +                                 const XenPTRegGroupInfo *grp_reg,
> +                                 uint32_t base_offset)
> +{
> +    int ret = 0;
> +
> +    ret = pt_msix_init(s, base_offset);
> +
> +    if (ret == -1) {
> +        hw_error("Internal error: Invalid pt_msix_init return value[%d]. "
> +                 "I/O emulator exit.\n", ret);
> +    }
> +
> +    return grp_reg->grp_size;
> +}
> +
>  
>  static const XenPTRegGroupInfo pt_emu_reg_grp_tbl[] = {
>      /* Header Type0 reg group */
> @@ -1704,6 +2162,14 @@ static const XenPTRegGroupInfo pt_emu_reg_grp_tbl[] = {
>          .grp_size   = 0x04,
>          .size_init  = pt_reg_grp_size_init,
>      },
> +    /* MSI Capability Structure reg group */
> +    {
> +        .grp_id      = PCI_CAP_ID_MSI,
> +        .grp_type    = GRP_TYPE_EMU,
> +        .grp_size    = 0xFF,
> +        .size_init   = pt_msi_size_init,
> +        .emu_reg_tbl = pt_emu_reg_msi_tbl,
> +    },
>      /* PCI-X Capabilities List Item reg group */
>      {
>          .grp_id     = PCI_CAP_ID_PCIX,
> @@ -1748,6 +2214,14 @@ static const XenPTRegGroupInfo pt_emu_reg_grp_tbl[] = {
>          .size_init   = pt_pcie_size_init,
>          .emu_reg_tbl = pt_emu_reg_pcie_tbl,
>      },
> +    /* MSI-X Capability Structure reg group */
> +    {
> +        .grp_id      = PCI_CAP_ID_MSIX,
> +        .grp_type    = GRP_TYPE_EMU,
> +        .grp_size    = 0x0C,
> +        .size_init   = pt_msix_size_init,
> +        .emu_reg_tbl = pt_emu_reg_msix_tbl,
> +    },
>      {
>          .grp_size = 0,
>      },
> @@ -1908,8 +2382,11 @@ static int pt_init_pci_config(XenPCIPassthroughState *s)
>      /* reinitialize all emulate register */
>      pt_config_reinit(s);
>  
> +    /* setup MSI-INTx translation if support */
> +    ret = pt_enable_msi_translate(s);
> +
>      /* rebind machine_irq to device */
> -    if (s->machine_irq != 0) {
> +    if (ret < 0 && s->machine_irq != 0) {

So can machine_irq be -1? Or is it only pirq that can be -1?


>          uint8_t e_device = PCI_SLOT(s->dev.devfn);
>          uint8_t e_intx = pci_intx(s);
>  
> @@ -2043,6 +2520,14 @@ void pt_config_delete(XenPCIPassthroughState *s)
>      struct XenPTRegGroup *reg_group, *next_grp;
>      struct XenPTReg *reg, *next_reg;
>  
> +    /* free MSI/MSI-X info table */
> +    if (s->msix) {
> +        pt_msix_delete(s);
> +    }
> +    if (s->msi) {
> +        g_free(s->msi);
> +    }
> +
>      /* free Power Management info table */
>      if (s->pm_state) {
>          if (s->pm_state->pm_timer) {
> diff --git a/hw/xen_pci_passthrough_msi.c b/hw/xen_pci_passthrough_msi.c
> new file mode 100644
> index 0000000..533aef4
> --- /dev/null
> +++ b/hw/xen_pci_passthrough_msi.c
> @@ -0,0 +1,667 @@
> +/*
> + * Copyright (c) 2007, Intel Corporation.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.  See
> + * the COPYING file in the top-level directory.
> + *
> + * Jiang Yunhong <yunhong.jiang@intel.com>
> + *
> + * This file implements direct PCI assignment to a HVM guest
> + */
> +
> +#include <sys/mman.h>
> +
> +#include "xen_backend.h"
> +#include "xen_pci_passthrough.h"
> +#include "apic-msidef.h"
> +
> +
> +#define AUTO_ASSIGN -1
> +
> +/* shift count for gflags */
> +#define GFLAGS_SHIFT_DEST_ID        0
> +#define GFLAGS_SHIFT_RH             8
> +#define GFLAGS_SHIFT_DM             9
> +#define GLFAGS_SHIFT_DELIV_MODE     12
> +#define GLFAGS_SHIFT_TRG_MODE       15
> +
> +
> +void pt_msi_set_enable(XenPCIPassthroughState *s, int en)
> +{
> +    uint16_t val = 0;
> +    uint32_t address = 0;
> +    PT_LOG("enable: %i\n", en);
> +
> +    if (!s->msi) {
> +        return;
> +    }
> +
> +    address = s->msi->ctrl_offset;
> +    if (!address) {
> +        return;
> +    }
> +
> +    val = host_pci_get_word(s->real_device, address);
> +    val &= ~PCI_MSI_FLAGS_ENABLE;
> +    val |= en & PCI_MSI_FLAGS_ENABLE;
> +    host_pci_set_word(s->real_device, address, val);
> +
> +    PT_LOG("done, address: %#x, val: %#x\n", address, val);
> +}
> +
> +static void msix_set_enable(XenPCIPassthroughState *s, int en)
> +{
> +    uint16_t val = 0;
> +    uint32_t address = 0;
> +
> +    if (!s->msix) {
> +        return;
> +    }
> +
> +    address = s->msix->ctrl_offset;
> +    if (!address) {
> +        return;
> +    }
> +
> +    val = host_pci_get_word(s->real_device, address);
> +    val &= ~PCI_MSIX_FLAGS_ENABLE;
> +    if (en) {
> +        val |= PCI_MSIX_FLAGS_ENABLE;
> +    }
> +    host_pci_set_word(s->real_device, address, val);
> +}
> +
> +/*********************************/
> +/* MSI virtuailization functions */


virtualization
> +
> +/*
> + * setup physical msi, but didn't enable it

but don't

> + */
> +int pt_msi_setup(XenPCIPassthroughState *s)
> +{
> +    int pirq = -1;
> +    uint8_t gvec = 0;
> +
> +    if (!(s->msi->flags & PT_MSI_FLAG_UNINIT)) {
> +        PT_LOG("Error: setup physical after initialized??\n");

I am not sure what that says.

> +        return -1;
> +    }
> +
> +    gvec = s->msi->data & 0xFF;
> +    if (!gvec) {
> +        /* if gvec is 0, the guest is asking for a particular pirq that
> +         * is passed as dest_id */
> +        pirq = (s->msi->addr_hi & 0xffffff00) |
> +               ((s->msi->addr_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
> +        if (!pirq) {
> +            /* this probably identifies an misconfiguration of the guest,
> +             * try the emulated path */
> +            pirq = -1;
> +        } else {
> +            PT_LOG("pt_msi_setup requested pirq = %d\n", pirq);
> +        }
> +    }
> +
> +    if (xc_physdev_map_pirq_msi(xen_xc, xen_domid, AUTO_ASSIGN, &pirq,
> +                                PCI_DEVFN(s->real_device->dev,
> +                                          s->real_device->func),
> +                                s->real_device->bus, 0, 0)) {
> +        PT_LOG("Error: Mapping of MSI failed.\n");

Give more details. As in what device failed. PErhaps even the return code?

> +        return -1;
> +    }
> +
> +    if (pirq < 0) {
> +        PT_LOG("Error: Invalid pirq number\n");
> +        return -1;
> +    }
> +
> +    s->msi->pirq = pirq;
> +    PT_LOG("msi mapped with pirq %x\n", pirq);
> +
> +    return 0;
> +}
> +
> +static uint32_t __get_msi_gflags(uint32_t data, uint64_t addr)
> +{
> +    uint32_t result = 0;
> +    int rh, dm, dest_id, deliv_mode, trig_mode;
> +
> +    rh = (addr >> MSI_ADDR_REDIRECTION_SHIFT) & 0x1;
> +    dm = (addr >> MSI_ADDR_DEST_MODE_SHIFT) & 0x1;
> +    dest_id = (addr >> MSI_ADDR_DEST_ID_SHIFT) & 0xff;
> +    deliv_mode = (data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x7;
> +    trig_mode = (data >> MSI_DATA_TRIGGER_SHIFT) & 0x1;
> +
> +    result = dest_id | (rh << GFLAGS_SHIFT_RH) | (dm << GFLAGS_SHIFT_DM) |
> +             (deliv_mode << GLFAGS_SHIFT_DELIV_MODE) |
> +             (trig_mode << GLFAGS_SHIFT_TRG_MODE);
> +
> +    return result;
> +}
> +
> +int pt_msi_update(XenPCIPassthroughState *s)
> +{
> +    uint8_t gvec = 0;
> +    uint32_t gflags = 0;
> +    uint64_t addr = 0;
> +    int ret = 0;
> +
> +    /* get vector, address, flags info, etc. */
> +    gvec = s->msi->data & 0xFF;
> +    addr = (uint64_t)s->msi->addr_hi << 32 | s->msi->addr_lo;
> +    gflags = __get_msi_gflags(s->msi->data, addr);
> +
> +    PT_LOG("Update msi with pirq %x gvec %x gflags %x\n",
> +           s->msi->pirq, gvec, gflags);

And the details for the device?

> +
> +    ret = xc_domain_update_msi_irq(xen_xc, xen_domid, gvec,
> +                                   s->msi->pirq, gflags, 0);
> +
> +    if (ret) {
> +        PT_LOG("Error: Binding of MSI failed.\n");
> +
> +        if (xc_physdev_unmap_pirq(xen_xc, xen_domid, s->msi->pirq)) {
> +            PT_LOG("Error: Unmapping of MSI failed.\n");
> +        }
> +        s->msi->pirq = -1;
> +        return ret;
> +    }
> +    return 0;
> +}
> +
> +void pt_msi_disable(XenPCIPassthroughState *s)
> +{
> +    PCIDevice *d = &s->dev;
> +    uint8_t gvec = 0;
> +    uint32_t gflags = 0;
> +    uint64_t addr = 0;
> +    uint8_t e_device = 0;
> +    uint8_t e_intx = 0;
> +
> +    pt_msi_set_enable(s, 0);
> +
> +    e_device = PCI_SLOT(d->devfn);
> +    e_intx = pci_intx(s);
> +
> +    if (s->msi_trans_en) {
> +        if (xc_domain_unbind_pt_irq(xen_xc, xen_domid, s->msi->pirq,
> +                                    PT_IRQ_TYPE_MSI_TRANSLATE, 0,
> +                                    e_device, e_intx, 0)) {
> +            PT_LOG("Error: Unbinding pt irq for MSI-INTx failed!\n");
> +            goto out;
> +        }
> +    } else if (!(s->msi->flags & PT_MSI_FLAG_UNINIT)) {
> +        /* get vector, address, flags info, etc. */
> +        gvec = s->msi->data & 0xFF;
> +        addr = (uint64_t)s->msi->addr_hi << 32 | s->msi->addr_lo;
> +        gflags = __get_msi_gflags(s->msi->data, addr);
> +
> +        PT_LOG("Unbind msi with pirq %x, gvec %x\n",
> +                s->msi->pirq, gvec);
> +
> +        if (xc_domain_unbind_msi_irq(xen_xc, xen_domid, gvec,
> +                                        s->msi->pirq, gflags)) {
> +            PT_LOG("Error: Unbinding of MSI failed. [%02x:%02x.%x]\n",
> +                   pci_bus_num(d->bus), PCI_SLOT(d->devfn),
> +                   PCI_FUNC(d->devfn));
> +            goto out;
> +        }
> +    }
> +
> +    if (s->msi->pirq != -1) {
> +        PT_LOG("Unmap msi with pirq %x\n", s->msi->pirq);
> +
> +        if (xc_physdev_unmap_pirq(xen_xc, xen_domid, s->msi->pirq)) {
> +            PT_LOG("Error: Unmapping of MSI failed. [%02x:%02x.%x]\n",
> +                   pci_bus_num(d->bus), PCI_SLOT(d->devfn),
> +                   PCI_FUNC(d->devfn));
> +            goto out;
> +        }
> +    }
> +
> +out:
> +    /* clear msi info */
> +    s->msi->flags = 0;
> +    s->msi->pirq = -1;
> +    s->msi_trans_en = 0;
> +}
> +
> +/* MSI-INTx translation virtulization functions */

virtualization

> +int pt_enable_msi_translate(XenPCIPassthroughState *s)
> +{
> +    uint8_t e_device = 0;
> +    uint8_t e_intx = 0;
> +
> +    if (!(s->msi && s->msi_trans_cap)) {
> +        return -1;
> +    }
> +
> +    pt_msi_set_enable(s, 0);
> +    s->msi_trans_en = 0;
> +
> +    if (pt_msi_setup(s)) {
> +        PT_LOG("Error: MSI-INTx translation MSI setup failed, fallback\n");
> +        return -1;
> +    }
> +
> +    e_device = PCI_SLOT(s->dev.devfn);
> +    /* fix virtual interrupt pin to INTA# */
> +    e_intx = pci_intx(s);
> +
> +    if (xc_domain_bind_pt_irq(xen_xc, xen_domid, s->msi->pirq,
> +                              PT_IRQ_TYPE_MSI_TRANSLATE, 0,
> +                              e_device, e_intx, 0)) {
> +        PT_LOG("Error: MSI-INTx translation bind failed, fallback\n");
> +
> +        if (xc_physdev_unmap_pirq(xen_xc, xen_domid, s->msi->pirq)) {
> +            PT_LOG("Error: Unmapping of MSI failed.\n");
> +        }
> +        s->msi->pirq = -1;
> +        return -1;
> +    }
> +
> +    pt_msi_set_enable(s, 1);
> +    s->msi_trans_en = 1;
> +
> +    return 0;
> +}
> +
> +void pt_disable_msi_translate(XenPCIPassthroughState *s)
> +{
> +    uint8_t e_device = 0;
> +    uint8_t e_intx = 0;
> +
> +    /* MSI_ENABLE bit should be disabed until the new handler is set */
> +    pt_msi_set_enable(s, 0);
> +
> +    e_device = PCI_SLOT(s->dev.devfn);
> +    e_intx = pci_intx(s);
> +
> +    if (xc_domain_unbind_pt_irq(xen_xc, xen_domid, s->msi->pirq,
> +                                 PT_IRQ_TYPE_MSI_TRANSLATE, 0,
> +                                 e_device, e_intx, 0)) {
> +        PT_LOG("Error: Unbinding pt irq for MSI-INTx failed!\n");
> +    }
> +
> +    if (s->machine_irq) {
> +        if (xc_domain_bind_pt_pci_irq(xen_xc, xen_domid, s->machine_irq,
> +                                       0, e_device, e_intx)) {
> +            PT_LOG("Error: Rebinding of interrupt failed!\n");
> +        }
> +    }
> +
> +    s->msi_trans_en = 0;
> +}
> +
> +/*********************************/
> +/* MSI-X virtulization functions */


virtu...

> +
> +static void mask_physical_msix_entry(XenPCIPassthroughState *s,
> +                                     int entry_nr, int mask)
> +{
> +    void *phys_off;
> +
> +    phys_off = s->msix->phys_iomem_base + 16 * entry_nr + 12;
> +    *(uint32_t *)phys_off = mask;
> +}
> +
> +static int pt_msix_update_one(XenPCIPassthroughState *s, int entry_nr)
> +{
> +    XenMSIXEntry *entry = &s->msix->msix_entry[entry_nr];
> +    int pirq = entry->pirq;
> +    int gvec = entry->io_mem[2] & 0xff;
> +    uint64_t gaddr = *(uint64_t *)&entry->io_mem[0];
> +    uint32_t gflags = __get_msi_gflags(entry->io_mem[2], gaddr);
> +    int ret;
> +
> +    if (!entry->flags) {
> +        return 0;
> +    }
> +
> +    if (!gvec) {
> +        /* if gvec is 0, the guest is asking for a particular pirq that
> +         * is passed as dest_id */
> +        pirq = ((gaddr >> 32) & 0xffffff00) |
> +               (((gaddr & 0xffffffff) >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
> +        if (!pirq) {
> +            /* this probably identifies an misconfiguration of the guest,
> +             * try the emulated path */
> +            pirq = -1;
> +        } else {
> +            PT_LOG("pt_msix_update_one requested pirq = %d\n", pirq);

This is the same code as in the MSI case. Could it be coalesced ?

> +        }
> +    }
> +
> +    /* Check if this entry is already mapped */
> +    if (entry->pirq == -1) {
> +        ret = xc_physdev_map_pirq_msi(xen_xc, xen_domid, AUTO_ASSIGN, &pirq,
> +                                      PCI_DEVFN(s->real_device->dev,
> +                                                s->real_device->func),
> +                                      s->real_device->bus, entry_nr,
> +                                      s->msix->table_base);
> +        if (ret) {
> +            PT_LOG("Error: Mapping msix entry %x\n", entry_nr);

Oh boy. So here the error is %x, but later on it is %d. Should it
be %d or 0x%x?


> +            return ret;
> +        }
> +        entry->pirq = pirq;
> +    }
> +
> +    PT_LOG("Update msix entry %x with pirq %x gvec %x\n",
> +            entry_nr, pirq, gvec);
> +
> +    ret = xc_domain_update_msi_irq(xen_xc, xen_domid, gvec, pirq, gflags,
> +                                   s->msix->mmio_base_addr);
> +    if (ret) {
> +        PT_LOG("Error: Updating msix irq info for entry %d\n", entry_nr);
> +
> +        if (xc_physdev_unmap_pirq(xen_xc, xen_domid, entry->pirq)) {
> +            PT_LOG("Error: Unmapping of MSI-X failed.\n");
> +        }
> +        entry->pirq = -1;
> +        return ret;
> +    }
> +
> +    entry->flags = 0;
> +
> +    return 0;
> +}
> +
> +int pt_msix_update(XenPCIPassthroughState *s)
> +{
> +    XenPTMSIX *msix = s->msix;
> +    int i;
> +
> +    for (i = 0; i < msix->total_entries; i++) {
> +        pt_msix_update_one(s, i);
> +    }
> +
> +    return 0;
> +}
> +
> +void pt_msix_disable(XenPCIPassthroughState *s)
> +{
> +    PCIDevice *d = &s->dev;
> +    uint8_t gvec = 0;
> +    uint32_t gflags = 0;
> +    uint64_t addr = 0;
> +    int i = 0;
> +    XenMSIXEntry *entry = NULL;
> +
> +    msix_set_enable(s, 0);
> +
> +    for (i = 0; i < s->msix->total_entries; i++) {
> +        entry = &s->msix->msix_entry[i];
> +
> +        if (entry->pirq == -1) {
> +            continue;
> +        }
> +
> +        gvec = entry->io_mem[2] & 0xff;
> +        addr = *(uint64_t *)&entry->io_mem[0];
> +        gflags = __get_msi_gflags(entry->io_mem[2], addr);
> +
> +        PT_LOG("Unbind msix with pirq %x, gvec %x\n",
> +                entry->pirq, gvec);
> +
> +        if (xc_domain_unbind_msi_irq(xen_xc, xen_domid, gvec,
> +                                        entry->pirq, gflags)) {
> +            PT_LOG("Error: Unbinding of MSI-X failed. [%02x:%02x.%x]\n",
> +                   pci_bus_num(d->bus), PCI_SLOT(d->devfn),
> +                   PCI_FUNC(d->devfn));
> +        } else {
> +            PT_LOG("Unmap msix with pirq %x\n", entry->pirq);
> +
> +            if (xc_physdev_unmap_pirq(xen_xc, xen_domid, entry->pirq)) {
> +                PT_LOG("Error: Unmapping of MSI-X failed. [%02x:%02x.%x]\n",
> +                       pci_bus_num(d->bus),
> +                       PCI_SLOT(d->devfn), PCI_FUNC(d->devfn));

There is a lot of those error reporting where the pci_bus_num, PCI_SLOT, etc
are used. Perhaps this should be in a function?

> +            }
> +        }
> +        /* clear msi-x info */
> +        entry->pirq = -1;
> +        entry->flags = 0;
> +    }
> +}
> +
> +int pt_msix_update_remap(XenPCIPassthroughState *s, int bar_index)
> +{
> +    XenMSIXEntry *entry;
> +    int i, ret;
> +
> +    if (!(s->msix && s->msix->bar_index == bar_index)) {
> +        return 0;
> +    }
> +
> +    for (i = 0; i < s->msix->total_entries; i++) {
> +        entry = &s->msix->msix_entry[i];
> +        if (entry->pirq != -1) {
> +            ret = xc_domain_unbind_pt_irq(xen_xc, xen_domid, entry->pirq,
> +                                          PT_IRQ_TYPE_MSI, 0, 0, 0, 0);
> +            if (ret) {
> +                PT_LOG("Error: unbind MSI-X entry %d failed\n", entry->pirq);
> +            }
> +            entry->flags = 1;
> +        }
> +    }
> +    pt_msix_update(s);
> +
> +    return 0;
> +}
> +
> +static void pci_msix_invalid_write(void *opaque, target_phys_addr_t addr,
> +                                   uint32_t val)
> +{
> +    PT_LOG("Error: Invalid write to MSI-X table,"
> +           " only dword access is allowed.\n");
> +}
> +
> +static void pci_msix_writel(void *opaque, target_phys_addr_t addr,
> +                            uint32_t val)
> +{
> +    XenPCIPassthroughState *s = (XenPCIPassthroughState *)opaque;
> +    XenPTMSIX *msix = s->msix;
> +    XenMSIXEntry *entry;
> +    int entry_nr, offset;
> +    void *phys_off;
> +    uint32_t vec_ctrl;
> +
> +    if (addr % 4) {
> +        PT_LOG("Error: Unaligned dword access to MSI-X table, "
> +                "addr %016"PRIx64"\n", addr);
> +        return;
> +    }
> +
> +    PT_LOG("addr: "TARGET_FMT_plx", val: %#x\n", addr, val);

Huh?

> +
> +    entry_nr = addr / 16;
> +    entry = &msix->msix_entry[entry_nr];
> +    offset = (addr % 16) / 4;
> +
> +    /*
> +     * If Xen intercepts the mask bit access, io_mem[3] may not be
> +     * up-to-date. Read from hardware directly.
> +     */
> +    phys_off = s->msix->phys_iomem_base + 16 * entry_nr + 12;
> +    vec_ctrl = *(uint32_t *)phys_off;
> +
> +    if (offset != 3 && msix->enabled && !(vec_ctrl & 0x1)) {
> +        PT_LOG("Error: Can't update msix entry %d since MSI-X is already "
> +                "function.\n", entry_nr);

already function? already on? active?


> +        return;
> +    }
> +
> +    if (offset != 3 && entry->io_mem[offset] != val) {
> +        entry->flags = 1;
> +    }
> +    entry->io_mem[offset] = val;
> +
> +    if (offset == 3) {
> +        if (msix->enabled && !(val & 0x1)) {
> +            pt_msix_update_one(s, entry_nr);
> +        }
> +        mask_physical_msix_entry(s, entry_nr, entry->io_mem[3] & 0x1);
> +    }
> +}
> +
> +static CPUWriteMemoryFunc *pci_msix_write[] = {
> +    pci_msix_invalid_write,
> +    pci_msix_invalid_write,
> +    pci_msix_writel
> +};
> +
> +static uint32_t pci_msix_invalid_read(void *opaque, target_phys_addr_t addr)
> +{
> +    PT_LOG("Error: Invalid read to MSI-X table,"
> +           " only dword access is allowed.\n");
> +    return 0;
> +}
> +
> +static uint32_t pci_msix_readl(void *opaque, target_phys_addr_t addr)
> +{
> +    XenPCIPassthroughState *s = (XenPCIPassthroughState *)opaque;
> +    XenPTMSIX *msix = s->msix;
> +    int entry_nr, offset;
> +
> +    if (addr % 4) {
> +        PT_LOG("Error: Unaligned dword access to MSI-X table, "
> +                "addr %016"PRIx64"\n", addr);
> +        return 0;
> +    }
> +
> +    PT_LOG("addr: "TARGET_FMT_plx"\n", addr);
> +
> +    entry_nr = addr / 16;
> +    offset = (addr % 16) / 4;
> +
> +    return msix->msix_entry[entry_nr].io_mem[offset];
> +}
> +
> +static CPUReadMemoryFunc *pci_msix_read[] = {
> +    pci_msix_invalid_read,
> +    pci_msix_invalid_read,
> +    pci_msix_readl
> +};
> +
> +int pt_add_msix_mapping(XenPCIPassthroughState *s, int bar_index)
> +{
> +    if (!(s->msix && s->msix->bar_index == bar_index)) {
> +        return 0;
> +    }
> +
> +    return xc_domain_memory_mapping(xen_xc, xen_domid,
> +         s->msix->mmio_base_addr >> XC_PAGE_SHIFT,
> +         (s->bases[bar_index].access.maddr + s->msix->table_off)
> +             >> XC_PAGE_SHIFT,
> +         (s->msix->total_entries * 16 + XC_PAGE_SIZE - 1) >> XC_PAGE_SHIFT,
> +         DPCI_ADD_MAPPING);
> +}
> +
> +int pt_remove_msix_mapping(XenPCIPassthroughState *s, int bar_index)
> +{
> +    if (!(s->msix && s->msix->bar_index == bar_index)) {
> +        return 0;
> +    }
> +
> +    s->msix->mmio_base_addr = s->bases[bar_index].e_physbase
> +        + s->msix->table_off;
> +
> +    cpu_register_physical_memory(s->msix->mmio_base_addr,
> +                                 s->msix->total_entries * 16,
> +                                 s->msix->mmio_index);
> +
> +    return xc_domain_memory_mapping(xen_xc, xen_domid,
> +         s->msix->mmio_base_addr >> XC_PAGE_SHIFT,
> +         (s->bases[bar_index].access.maddr + s->msix->table_off)
> +             >> XC_PAGE_SHIFT,
> +         (s->msix->total_entries * 16 + XC_PAGE_SIZE - 1) >> XC_PAGE_SHIFT,
> +         DPCI_REMOVE_MAPPING);
> +}
> +
> +int pt_msix_init(XenPCIPassthroughState *s, int base)
> +{
> +    uint8_t id;
> +    uint16_t control;
> +    int i, total_entries, table_off, bar_index;
> +    HostPCIDevice *d = s->real_device;
> +    int fd;
> +
> +    id = host_pci_get_byte(d, base + PCI_CAP_LIST_ID);
> +
> +    if (id != PCI_CAP_ID_MSIX) {
> +        PT_LOG("Error: Invalid id %#x base %#x\n", id, base);
> +        return -1;
> +    }
> +
> +    control = host_pci_get_word(d, base + 2);
> +    total_entries = control & 0x7ff;
> +    total_entries += 1;
> +
> +    s->msix = g_malloc0(sizeof (XenPTMSIX)
> +                        + total_entries * sizeof (XenMSIXEntry));
> +
> +    s->msix->total_entries = total_entries;
> +    for (i = 0; i < total_entries; i++) {
> +        s->msix->msix_entry[i].pirq = -1;
> +    }
> +
> +    s->msix->mmio_index =
> +        cpu_register_io_memory(pci_msix_read, pci_msix_write,
> +                               s, DEVICE_NATIVE_ENDIAN);
> +
> +    table_off = host_pci_get_long(d, base + PCI_MSIX_TABLE);
> +    bar_index = s->msix->bar_index = table_off & PCI_MSIX_FLAGS_BIRMASK;
> +    table_off = s->msix->table_off = table_off & ~PCI_MSIX_FLAGS_BIRMASK;
> +    s->msix->table_base = s->real_device->io_regions[bar_index].base_addr;
> +    PT_LOG("get MSI-X table bar base %#"PRIx64"\n", s->msix->table_base);
> +
> +    fd = open("/dev/mem", O_RDWR);
> +    if (fd == -1) {
> +        PT_LOG("Error: Can't open /dev/mem: %s\n", strerror(errno));
> +        goto error_out;
> +    }
> +    PT_LOG("table_off = %#x, total_entries = %d\n", table_off, total_entries);
> +    s->msix->table_offset_adjust = table_off & 0x0fff;
> +    s->msix->phys_iomem_base =
> +        mmap(0,
> +             total_entries * 16 + s->msix->table_offset_adjust,
> +             PROT_WRITE | PROT_READ,
> +             MAP_SHARED | MAP_LOCKED,
> +             fd,
> +             s->msix->table_base + table_off - s->msix->table_offset_adjust);
> +
> +    if (s->msix->phys_iomem_base == MAP_FAILED) {
> +        PT_LOG("Error: Can't map physical MSI-X table: %s\n", strerror(errno));
> +        close(fd);
> +        goto error_out;
> +    }
> +    s->msix->phys_iomem_base = (char *)s->msix->phys_iomem_base
> +        + s->msix->table_offset_adjust;
> +
> +    close(fd);
> +
> +    PT_LOG("mapping physical MSI-X table to %p\n", s->msix->phys_iomem_base);
> +    return 0;
> +
> +error_out:
> +    g_free(s->msix);
> +    s->msix = NULL;
> +    return -1;
> +}
> +
> +void pt_msix_delete(XenPCIPassthroughState *s)
> +{
> +    /* unmap the MSI-X memory mapped register area */
> +    if (s->msix->phys_iomem_base) {
> +        PT_LOG("unmapping physical MSI-X table from %lx\n",
> +           (unsigned long)s->msix->phys_iomem_base);
> +        munmap(s->msix->phys_iomem_base, s->msix->total_entries * 16 +
> +           s->msix->table_offset_adjust);
> +    }
> +
> +    if (s->msix->mmio_index > 0) {
> +        cpu_unregister_io_memory(s->msix->mmio_index);
> +    }
> +
> +    g_free(s->msix);
> +    s->msix = NULL;
> +}
> -- 
> Anthony PERARD
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Xen-devel] [PATCH V3 10/10] Introduce Xen PCI Passthrough, MSI (3/3)
@ 2011-11-10 22:10     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 60+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-11-10 22:10 UTC (permalink / raw)
  To: Anthony PERARD; +Cc: Xen Devel, Shan Haitao, QEMU-devel, Stefano Stabellini

On Fri, Oct 28, 2011 at 04:07:36PM +0100, Anthony PERARD wrote:
> From: Jiang Yunhong <yunhong.jiang@intel.com>
> 
> Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com>
> Signed-off-by: Shan Haitao <haitao.shan@intel.com>
> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
> ---
>  Makefile.target                      |    1 +
>  hw/apic-msidef.h                     |    2 +
>  hw/xen_pci_passthrough.c             |   27 ++-
>  hw/xen_pci_passthrough.h             |   55 +++
>  hw/xen_pci_passthrough_config_init.c |  495 +++++++++++++++++++++++++-
>  hw/xen_pci_passthrough_msi.c         |  667 ++++++++++++++++++++++++++++++++++
>  6 files changed, 1240 insertions(+), 7 deletions(-)
>  create mode 100644 hw/xen_pci_passthrough_msi.c
> 
> diff --git a/Makefile.target b/Makefile.target
> index c32c688..17b8857 100644
> --- a/Makefile.target
> +++ b/Makefile.target
> @@ -220,6 +220,7 @@ obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += host-pci-device.o
>  obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough.o
>  obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough_helpers.o
>  obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough_config_init.o
> +obj-i386-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pci_passthrough_msi.o
>  
>  # Inter-VM PCI shared memory
>  CONFIG_IVSHMEM =
> diff --git a/hw/apic-msidef.h b/hw/apic-msidef.h
> index 3182f0b..6e2eb71 100644
> --- a/hw/apic-msidef.h
> +++ b/hw/apic-msidef.h
> @@ -22,6 +22,8 @@
>  
>  #define MSI_ADDR_DEST_MODE_SHIFT        2
>  
> +#define MSI_ADDR_REDIRECTION_SHIFT      3
> +
>  #define MSI_ADDR_DEST_ID_SHIFT          12
>  #define  MSI_ADDR_DEST_ID_MASK          0x00ffff0
>  
> diff --git a/hw/xen_pci_passthrough.c b/hw/xen_pci_passthrough.c
> index b97c5b6..4b9eb74 100644
> --- a/hw/xen_pci_passthrough.c
> +++ b/hw/xen_pci_passthrough.c
> @@ -417,6 +417,7 @@ static void pt_iomem_map(XenPCIPassthroughState *s, int i,
>      }
>  
>      if (!first_map && old_ebase != -1) {
> +        pt_add_msix_mapping(s, i);
>          /* Remove old mapping */
>          ret = xc_domain_memory_mapping(xen_xc, xen_domid,
>                                 old_ebase >> XC_PAGE_SHIFT,
> @@ -441,6 +442,15 @@ static void pt_iomem_map(XenPCIPassthroughState *s, int i,
>          if (ret != 0) {
>              PT_LOG("Error: create new mapping failed!\n");
>          }
> +
> +        ret = pt_remove_msix_mapping(s, i);
> +        if (ret != 0) {
> +            PT_LOG("Error: remove MSI-X mmio mapping failed!\n");
> +        }
> +
> +        if (old_ebase != e_phys && old_ebase != -1) {
> +            pt_msix_update_remap(s, i);
> +        }
>      }
>  }
>  
> @@ -737,6 +747,9 @@ static int pt_initfn(PCIDevice *pcidev)
>          mapped_machine_irq[machine_irq]++;
>      }
>  
> +    /* setup MSI-INTx translation if support */
> +    rc = pt_enable_msi_translate(s);
> +
>      /* bind machine_irq to device */
>      if (rc < 0 && machine_irq != 0) {
>          uint8_t e_device = PCI_SLOT(s->dev.devfn);
> @@ -765,7 +778,8 @@ static int pt_initfn(PCIDevice *pcidev)
>  
>  out:
>      PT_LOG("Real physical device %02x:%02x.%x registered successfuly!\n"
> -           "IRQ type = %s\n", bus, slot, func, "INTx");
> +           "IRQ type = %s\n", bus, slot, func,
> +           s->msi_trans_en ? "MSI-INTx" : "INTx");
>  
>      return 0;
>  }
> @@ -782,7 +796,7 @@ static int pt_unregister_device(PCIDevice *pcidev)
>      e_intx = pci_intx(s);
>      machine_irq = s->machine_irq;
>  
> -    if (machine_irq) {
> +    if (s->msi_trans_en == 0 && machine_irq) {
>          rc = xc_domain_unbind_pt_irq(xen_xc, xen_domid, machine_irq,
>                                       PT_IRQ_TYPE_PCI, 0, e_device, e_intx, 0);
>          if (rc < 0) {
> @@ -790,6 +804,13 @@ static int pt_unregister_device(PCIDevice *pcidev)
>          }
>      }
>  
> +    if (s->msi) {
> +        pt_msi_disable(s);
> +    }
> +    if (s->msix) {
> +        pt_msix_disable(s);
> +    }
> +
>      if (machine_irq) {
>          mapped_machine_irq[machine_irq]--;
>  
> @@ -824,6 +845,8 @@ static PCIDeviceInfo xen_pci_passthrough = {
>      .is_express = 0,
>      .qdev.props = (Property[]) {
>          DEFINE_PROP_STRING("hostaddr", XenPCIPassthroughState, hostaddr),
> +        DEFINE_PROP_BIT("msitranslate", XenPCIPassthroughState, msi_trans_cap,
> +                        0, true),
>          DEFINE_PROP_BIT("power-mgmt", XenPCIPassthroughState, power_mgmt,
>                          0, false),
>          DEFINE_PROP_END_OF_LIST(),
> diff --git a/hw/xen_pci_passthrough.h b/hw/xen_pci_passthrough.h
> index ebc04fd..5f404b0 100644
> --- a/hw/xen_pci_passthrough.h
> +++ b/hw/xen_pci_passthrough.h
> @@ -63,6 +63,10 @@ typedef int (*conf_byte_restore)
>  
>  #define PT_BAR_ALLF        0xFFFFFFFF  /* BAR ALLF value */
>  
> +/* MSI-X */
> +#define PT_MSI_FLAG_UNINIT 0x1000
> +#define PT_MSI_FLAG_MAPPED 0x2000
> +
>  
>  typedef enum {
>      GRP_TYPE_HARDWIRED = 0,                     /* 0 Hardwired reg group */
> @@ -166,6 +170,34 @@ typedef struct XenPTRegGroup {
>  } XenPTRegGroup;
>  
>  
> +typedef struct XenPTMSI {
> +    uint32_t flags;
> +    uint32_t ctrl_offset; /* saved control offset */
> +    int pirq;          /* guest pirq corresponding */
> +    uint32_t addr_lo;  /* guest message address */
> +    uint32_t addr_hi;  /* guest message upper address */
> +    uint16_t data;     /* guest message data */
> +} XenPTMSI;
> +
> +typedef struct XenMSIXEntry {
> +    int pirq;        /* -1 means unmapped */
> +    int flags;       /* flags indicting whether MSI ADDR or DATA is updated */
> +    uint32_t io_mem[4];
> +} XenMSIXEntry;
> +typedef struct XenPTMSIX {
> +    uint32_t ctrl_offset;
> +    int enabled;
> +    int total_entries;
> +    int bar_index;
> +    uint64_t table_base;
> +    uint32_t table_off;
> +    uint32_t table_offset_adjust; /* page align mmap */
> +    uint64_t mmio_base_addr;
> +    int mmio_index;
> +    void *phys_iomem_base;
> +    XenMSIXEntry msix_entry[0];
> +} XenPTMSIX;
> +
>  typedef struct XenPTPM {
>      QEMUTimer *pm_timer;  /* QEMUTimer struct */
>      int no_soft_reset;    /* No Soft Reset flags */
> @@ -189,6 +221,13 @@ struct XenPCIPassthroughState {
>  
>      uint32_t machine_irq;
>  
> +    XenPTMSI *msi;
> +    XenPTMSIX *msix;
> +
> +    /* Physical MSI to guest INTx translation when possible */
> +    uint32_t msi_trans_cap;
> +    bool msi_trans_en;
> +
>      uint32_t power_mgmt;
>      XenPTPM *pm_state;
>  
> @@ -222,4 +261,20 @@ static inline uint8_t pci_read_intx(XenPCIPassthroughState *s)
>  }
>  uint8_t pci_intx(XenPCIPassthroughState *ptdev);
>  
> +/* MSI/MSI-X */
> +void pt_msi_set_enable(XenPCIPassthroughState *s, int en);
> +int pt_msi_setup(XenPCIPassthroughState *s);
> +int pt_msi_update(XenPCIPassthroughState *d);
> +void pt_msi_disable(XenPCIPassthroughState *s);
> +int pt_enable_msi_translate(XenPCIPassthroughState *s);
> +void pt_disable_msi_translate(XenPCIPassthroughState *s);
> +
> +int pt_msix_init(XenPCIPassthroughState *s, int pos);
> +void pt_msix_delete(XenPCIPassthroughState *s);
> +int pt_msix_update(XenPCIPassthroughState *s);
> +int pt_msix_update_remap(XenPCIPassthroughState *s, int bar_index);
> +void pt_msix_disable(XenPCIPassthroughState *s);
> +int pt_add_msix_mapping(XenPCIPassthroughState *s, int bar_index);
> +int pt_remove_msix_mapping(XenPCIPassthroughState *s, int bar_index);
> +
>  #endif /* !QEMU_HW_XEN_PCI_PASSTHROUGH_H */
> diff --git a/hw/xen_pci_passthrough_config_init.c b/hw/xen_pci_passthrough_config_init.c
> index 4103b59..b4238ee 100644
> --- a/hw/xen_pci_passthrough_config_init.c
> +++ b/hw/xen_pci_passthrough_config_init.c
> @@ -375,11 +375,19 @@ static int pt_cmd_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
>      throughable_mask = ~emu_mask & valid_mask;
>  
>      if (*value & PCI_COMMAND_INTX_DISABLE) {
> -        throughable_mask |= PCI_COMMAND_INTX_DISABLE;
> -    } else {
> -        if (s->machine_irq) {
> +        if (s->msi_trans_en) {
> +            pt_msi_set_enable(s, 0);
> +        } else {
>              throughable_mask |= PCI_COMMAND_INTX_DISABLE;
>          }
> +    } else {
> +        if (s->msi_trans_en) {
> +            pt_msi_set_enable(s, 1);
> +        } else {
> +            if (s->machine_irq) {
> +                throughable_mask |= PCI_COMMAND_INTX_DISABLE;
> +            }
> +        }
>      }
>  
>      *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> @@ -1248,13 +1256,21 @@ static void pt_reset_interrupt_and_io_mapping(XenPCIPassthroughState *s)
>      e_device = PCI_SLOT(s->dev.devfn);
>      e_intx = pci_intx(s);
>  
> -    if (s->machine_irq) {
> +    if (s->msi_trans_en == 0 && s->machine_irq) {
>          if (xc_domain_unbind_pt_irq(xen_xc, xen_domid, s->machine_irq,
>                                      PT_IRQ_TYPE_PCI, 0, e_device, e_intx, 0)) {
>              PT_LOG("Error: Unbinding of interrupt failed!\n");
>          }
>      }
>  
> +    /* disable MSI/MSI-X and MSI-INTx translation */
> +    if (s->msi) {
> +        pt_msi_disable(s);
> +    }
> +    if (s->msix) {
> +        pt_msix_disable(s);
> +    }
> +
>      /* clear all virtual region address */
>      for (i = 0; i < PCI_NUM_REGIONS; i++) {
>          r = &d->io_regions[i];
> @@ -1501,6 +1517,406 @@ static XenPTRegInfo pt_emu_reg_pm_tbl[] = {
>      },
>  };
>  
> +/********************************
> + * MSI Capability
> + */
> +
> +/* Message Control register */
> +static uint32_t pt_msgctrl_reg_init(XenPCIPassthroughState *s,
> +                                    XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    PCIDevice *d = &s->dev;
> +    uint16_t reg_field = 0;
> +
> +    /* use I/O device register's value as initial value */
> +    reg_field = pci_get_word(d->config + real_offset);
> +
> +    if (reg_field & PCI_MSI_FLAGS_ENABLE) {
> +        PT_LOG("MSI enabled already, disable first\n");
> +        host_pci_set_word(s->real_device, real_offset,
> +                          reg_field & ~PCI_MSI_FLAGS_ENABLE);
> +    }
> +    s->msi->flags |= reg_field | PT_MSI_FLAG_UNINIT;
> +    s->msi->ctrl_offset = real_offset;
> +
> +    return reg->init_val;
> +}
> +static int pt_msgctrl_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                                uint16_t *value, uint16_t dev_value,
> +                                uint16_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint16_t writable_mask = 0;
> +    uint16_t throughable_mask = 0;
> +    PCIDevice *pd = (PCIDevice *)s;
> +    uint16_t val;
> +
> +    /* Currently no support for multi-vector */
> +    if (*value & PCI_MSI_FLAGS_QSIZE) {
> +        PT_LOG("Warning: try to set more than 1 vector ctrl %x\n", *value);
> +    }
> +
> +    /* modify emulate register */
> +    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +    /* update the msi_info too */
> +    s->msi->flags |= cfg_entry->data &
> +        ~(PT_MSI_FLAG_UNINIT | PT_MSI_FLAG_MAPPED | PCI_MSI_FLAGS_ENABLE);
> +
> +    /* create value for writing to I/O device register */
> +    val = *value;
> +    throughable_mask = ~reg->emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    /* update MSI */
> +    if (val & PCI_MSI_FLAGS_ENABLE) {
> +        /* setup MSI pirq for the first time */
> +        if (s->msi->flags & PT_MSI_FLAG_UNINIT) {
> +            if (s->msi_trans_en) {
> +                PT_LOG("guest enabling MSI, disable MSI-INTx translation\n");
> +                pt_disable_msi_translate(s);
> +            } else {
> +                /* Init physical one */
> +                PT_LOG("setup msi for dev %x\n", pd->devfn);
> +                if (pt_msi_setup(s)) {
> +                    /* We do not broadcast the error to the framework code, so
> +                     * that MSI errors are contained in MSI emulation code and
> +                     * QEMU can go on running.
> +                     * Guest MSI would be actually not working.
> +                     */
> +                    *value &= ~PCI_MSI_FLAGS_ENABLE;
> +                    PT_LOG("Warning: Can not map MSI for dev %x\n", pd->devfn);
> +                    return 0;
> +                }
> +            }
> +            if (pt_msi_update(s)) {
> +                *value &= ~PCI_MSI_FLAGS_ENABLE;
> +                PT_LOG("Warning: Can not bind MSI for dev %x\n", pd->devfn);
> +                return 0;
> +            }
> +            s->msi->flags &= ~PT_MSI_FLAG_UNINIT;
> +            s->msi->flags |= PT_MSI_FLAG_MAPPED;
> +        }
> +        s->msi->flags |= PCI_MSI_FLAGS_ENABLE;
> +    } else {
> +        s->msi->flags &= ~PCI_MSI_FLAGS_ENABLE;
> +    }
> +
> +    /* pass through MSI_ENABLE bit when no MSI-INTx translation */
> +    if (!s->msi_trans_en) {
> +        *value &= ~PCI_MSI_FLAGS_ENABLE;
> +        *value |= val & PCI_MSI_FLAGS_ENABLE;
> +    }
> +
> +    return 0;
> +}
> +
> +/* initialize Message Upper Address register */
> +static uint32_t pt_msgaddr64_reg_init(XenPCIPassthroughState *ptdev,
> +                                      XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    /* no need to initialize in case of 32 bit type */
> +    if (!(ptdev->msi->flags & PCI_MSI_FLAGS_64BIT)) {
> +        return PT_INVALID_REG;
> +    }
> +
> +    return reg->init_val;
> +}
> +/* this function will be called twice (for 32 bit and 64 bit type) */
> +/* initialize Message Data register */
> +static uint32_t pt_msgdata_reg_init(XenPCIPassthroughState *ptdev,
> +                                    XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    uint32_t flags = ptdev->msi->flags;
> +    uint32_t offset = reg->offset;
> +
> +    /* check the offset whether matches the type or not */
> +    if (((offset == PCI_MSI_DATA_64) &&  (flags & PCI_MSI_FLAGS_64BIT)) ||
> +        ((offset == PCI_MSI_DATA_32) && !(flags & PCI_MSI_FLAGS_64BIT))) {
> +        return reg->init_val;
> +    } else {
> +        return PT_INVALID_REG;
> +    }
> +}
> +
> +/* write Message Address register */
> +static int pt_msgaddr32_reg_write(XenPCIPassthroughState *s,
> +                                  XenPTReg *cfg_entry, uint32_t *value,
> +                                  uint32_t dev_value, uint32_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint32_t writable_mask = 0;
> +    uint32_t throughable_mask = 0;
> +    uint32_t old_addr = cfg_entry->data;
> +
> +    /* modify emulate register */
> +    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +    /* update the msi_info too */
> +    s->msi->addr_lo = cfg_entry->data;
> +
> +    /* create value for writing to I/O device register */
> +    throughable_mask = ~reg->emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    /* update MSI */
> +    if (cfg_entry->data != old_addr) {
> +        if (s->msi->flags & PT_MSI_FLAG_MAPPED) {
> +            pt_msi_update(s);
> +        }
> +    }
> +
> +    return 0;
> +}
> +/* write Message Upper Address register */
> +static int pt_msgaddr64_reg_write(XenPCIPassthroughState *s,
> +                                  XenPTReg *cfg_entry, uint32_t *value,
> +                                  uint32_t dev_value, uint32_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint32_t writable_mask = 0;
> +    uint32_t throughable_mask = 0;
> +    uint32_t old_addr = cfg_entry->data;
> +
> +    /* check whether the type is 64 bit or not */
> +    if (!(s->msi->flags & PCI_MSI_FLAGS_64BIT)) {
> +        /* exit I/O emulator */
> +        PT_LOG("Error: why comes to Upper Address without 64 bit support??\n");

Um, not sure what that means.

> +        return -1;
> +    }
> +
> +    /* modify emulate register */
> +    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +    /* update the msi_info too */
> +    s->msi->addr_hi = cfg_entry->data;
> +
> +    /* create value for writing to I/O device register */
> +    throughable_mask = ~reg->emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    /* update MSI */
> +    if (cfg_entry->data != old_addr) {
> +        if (s->msi->flags & PT_MSI_FLAG_MAPPED) {
> +            pt_msi_update(s);
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +
> +/* this function will be called twice (for 32 bit and 64 bit type) */
> +/* write Message Data register */
> +static int pt_msgdata_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> +                                uint16_t *value, uint16_t dev_value,
> +                                uint16_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint16_t writable_mask = 0;
> +    uint16_t throughable_mask = 0;
> +    uint16_t old_data = cfg_entry->data;
> +    uint32_t flags = s->msi->flags;
> +    uint32_t offset = reg->offset;
> +
> +    /* check the offset whether matches the type or not */
> +    if (!((offset == PCI_MSI_DATA_64) &&  (flags & PCI_MSI_FLAGS_64BIT)) &&
> +        !((offset == PCI_MSI_DATA_32) && !(flags & PCI_MSI_FLAGS_64BIT))) {
> +        /* exit I/O emulator */
> +        PT_LOG("Error: the offset is not match with the 32/64 bit type!!\n");

I think it means: "The offset does not match the 32/64 bit type"

> +        return -1;
> +    }
> +
> +    /* modify emulate register */
> +    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +    /* update the msi_info too */
> +    s->msi->data = cfg_entry->data;
> +
> +    /* create value for writing to I/O device register */
> +    throughable_mask = ~reg->emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    /* update MSI */
> +    if (cfg_entry->data != old_data) {
> +        if (flags & PT_MSI_FLAG_MAPPED) {
> +            pt_msi_update(s);
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +/* MSI Capability Structure reg static infomation table */
> +static XenPTRegInfo pt_emu_reg_msi_tbl[] = {
> +    /* Next Pointer reg */
> +    {
> +        .offset     = PCI_CAP_LIST_NEXT,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0xFF,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_ptr_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = NULL,
> +    },
> +    /* Message Control reg */
> +    {
> +        .offset     = PCI_MSI_FLAGS,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0xFF8E,
> +        .emu_mask   = 0x007F,
> +        .init       = pt_msgctrl_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_msgctrl_reg_write,
> +        .u.w.restore  = NULL,
> +    },
> +    /* Message Address reg */
> +    {
> +        .offset     = PCI_MSI_ADDRESS_LO,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .ro_mask    = 0x00000003,
> +        .emu_mask   = 0xFFFFFFFF,
> +        .no_wb      = 1,
> +        .init       = pt_common_reg_init,
> +        .u.dw.read  = pt_long_reg_read,
> +        .u.dw.write = pt_msgaddr32_reg_write,
> +        .u.dw.restore = NULL,
> +    },
> +    /* Message Upper Address reg (if PCI_MSI_FLAGS_64BIT set) */
> +    {
> +        .offset     = PCI_MSI_ADDRESS_HI,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .ro_mask    = 0x00000000,
> +        .emu_mask   = 0xFFFFFFFF,
> +        .no_wb      = 1,
> +        .init       = pt_msgaddr64_reg_init,
> +        .u.dw.read  = pt_long_reg_read,
> +        .u.dw.write = pt_msgaddr64_reg_write,
> +        .u.dw.restore = NULL,
> +    },
> +    /* Message Data reg (16 bits of data for 32-bit devices) */
> +    {
> +        .offset     = PCI_MSI_DATA_32,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0x0000,
> +        .emu_mask   = 0xFFFF,
> +        .no_wb      = 1,
> +        .init       = pt_msgdata_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_msgdata_reg_write,
> +        .u.w.restore  = NULL,
> +    },
> +    /* Message Data reg (16 bits of data for 64-bit devices) */
> +    {
> +        .offset     = PCI_MSI_DATA_64,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0x0000,
> +        .emu_mask   = 0xFFFF,
> +        .no_wb      = 1,
> +        .init       = pt_msgdata_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_msgdata_reg_write,
> +        .u.w.restore  = NULL,
> +    },
> +    {
> +        .size = 0,
> +    },
> +};
> +
> +
> +/**************************************
> + * MSI-X Capability
> + */
> +
> +/* Message Control register for MSI-X */
> +static uint32_t pt_msixctrl_reg_init(XenPCIPassthroughState *s,
> +                                     XenPTRegInfo *reg, uint32_t real_offset)
> +{
> +    PCIDevice *d = &s->dev;
> +    uint16_t reg_field = 0;
> +
> +    /* use I/O device register's value as initial value */
> +    reg_field = pci_get_word(d->config + real_offset);
> +
> +    if (reg_field & PCI_MSIX_FLAGS_ENABLE) {
> +        PT_LOG("MSIX enabled already, disable first\n");
> +        host_pci_set_word(s->real_device, real_offset,
> +                          reg_field & ~PCI_MSIX_FLAGS_ENABLE);
> +    }
> +
> +    s->msix->ctrl_offset = real_offset;
> +
> +    return reg->init_val;
> +}
> +static int pt_msixctrl_reg_write(XenPCIPassthroughState *s,
> +                                 XenPTReg *cfg_entry, uint16_t *value,
> +                                 uint16_t dev_value, uint16_t valid_mask)
> +{
> +    XenPTRegInfo *reg = cfg_entry->reg;
> +    uint16_t writable_mask = 0;
> +    uint16_t throughable_mask = 0;
> +
> +    /* modify emulate register */
> +    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
> +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> +
> +    /* create value for writing to I/O device register */
> +    throughable_mask = ~reg->emu_mask & valid_mask;
> +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> +
> +    /* update MSI-X */
> +    if ((*value & PCI_MSIX_FLAGS_ENABLE)
> +        && !(*value & PCI_MSIX_FLAGS_MASKALL)) {
> +        if (s->msi_trans_en) {
> +            PT_LOG("guest enabling MSI-X, disable MSI-INTx translation\n");
> +            pt_disable_msi_translate(s);
> +        }
> +        pt_msix_update(s);
> +    }
> +
> +    s->msix->enabled = !!(*value & PCI_MSIX_FLAGS_ENABLE);
> +
> +    return 0;
> +}
> +
> +/* MSI-X Capability Structure reg static infomation table */
> +static XenPTRegInfo pt_emu_reg_msix_tbl[] = {
> +    /* Next Pointer reg */
> +    {
> +        .offset     = PCI_CAP_LIST_NEXT,
> +        .size       = 1,
> +        .init_val   = 0x00,
> +        .ro_mask    = 0xFF,
> +        .emu_mask   = 0xFF,
> +        .init       = pt_ptr_reg_init,
> +        .u.b.read   = pt_byte_reg_read,
> +        .u.b.write  = pt_byte_reg_write,
> +        .u.b.restore  = NULL,
> +    },
> +    /* Message Control reg */
> +    {
> +        .offset     = PCI_MSI_FLAGS,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .ro_mask    = 0x3FFF,
> +        .emu_mask   = 0x0000,
> +        .init       = pt_msixctrl_reg_init,
> +        .u.w.read   = pt_word_reg_read,
> +        .u.w.write  = pt_msixctrl_reg_write,
> +        .u.w.restore  = NULL,
> +    },
> +    {
> +        .size = 0,
> +    },
> +};
> +
>  
>  /****************************
>   * Capabilities
> @@ -1664,6 +2080,48 @@ static uint8_t pt_pcie_size_init(XenPCIPassthroughState *s,
>  
>      return pcie_size;
>  }
> +/* get MSI Capability Structure register group size */
> +static uint8_t pt_msi_size_init(XenPCIPassthroughState *s,
> +                                const XenPTRegGroupInfo *grp_reg,
> +                                uint32_t base_offset)
> +{
> +    PCIDevice *d = &s->dev;
> +    uint16_t msg_ctrl = 0;
> +    uint8_t msi_size = 0xa;
> +
> +    msg_ctrl = pci_get_word(d->config + (base_offset + PCI_MSI_FLAGS));
> +
> +    /* check 64 bit address capable & Per-vector masking capable */

ehh?


> +    if (msg_ctrl & PCI_MSI_FLAGS_64BIT) {
> +        msi_size += 4;
> +    }
> +    if (msg_ctrl & PCI_MSI_FLAGS_MASKBIT) {
> +        msi_size += 10;
> +    }
> +
> +    s->msi = g_malloc0(sizeof (XenPTMSI));
> +    s->msi->pirq = -1;

Is there a define for this -1?

> +    PT_LOG("done\n");
> +
> +    return msi_size;
> +}
> +/* get MSI-X Capability Structure register group size */
> +static uint8_t pt_msix_size_init(XenPCIPassthroughState *s,
> +                                 const XenPTRegGroupInfo *grp_reg,
> +                                 uint32_t base_offset)
> +{
> +    int ret = 0;
> +
> +    ret = pt_msix_init(s, base_offset);
> +
> +    if (ret == -1) {
> +        hw_error("Internal error: Invalid pt_msix_init return value[%d]. "
> +                 "I/O emulator exit.\n", ret);
> +    }
> +
> +    return grp_reg->grp_size;
> +}
> +
>  
>  static const XenPTRegGroupInfo pt_emu_reg_grp_tbl[] = {
>      /* Header Type0 reg group */
> @@ -1704,6 +2162,14 @@ static const XenPTRegGroupInfo pt_emu_reg_grp_tbl[] = {
>          .grp_size   = 0x04,
>          .size_init  = pt_reg_grp_size_init,
>      },
> +    /* MSI Capability Structure reg group */
> +    {
> +        .grp_id      = PCI_CAP_ID_MSI,
> +        .grp_type    = GRP_TYPE_EMU,
> +        .grp_size    = 0xFF,
> +        .size_init   = pt_msi_size_init,
> +        .emu_reg_tbl = pt_emu_reg_msi_tbl,
> +    },
>      /* PCI-X Capabilities List Item reg group */
>      {
>          .grp_id     = PCI_CAP_ID_PCIX,
> @@ -1748,6 +2214,14 @@ static const XenPTRegGroupInfo pt_emu_reg_grp_tbl[] = {
>          .size_init   = pt_pcie_size_init,
>          .emu_reg_tbl = pt_emu_reg_pcie_tbl,
>      },
> +    /* MSI-X Capability Structure reg group */
> +    {
> +        .grp_id      = PCI_CAP_ID_MSIX,
> +        .grp_type    = GRP_TYPE_EMU,
> +        .grp_size    = 0x0C,
> +        .size_init   = pt_msix_size_init,
> +        .emu_reg_tbl = pt_emu_reg_msix_tbl,
> +    },
>      {
>          .grp_size = 0,
>      },
> @@ -1908,8 +2382,11 @@ static int pt_init_pci_config(XenPCIPassthroughState *s)
>      /* reinitialize all emulate register */
>      pt_config_reinit(s);
>  
> +    /* setup MSI-INTx translation if support */
> +    ret = pt_enable_msi_translate(s);
> +
>      /* rebind machine_irq to device */
> -    if (s->machine_irq != 0) {
> +    if (ret < 0 && s->machine_irq != 0) {

So can machine_irq be -1? Or is it only pirq that can be -1?


>          uint8_t e_device = PCI_SLOT(s->dev.devfn);
>          uint8_t e_intx = pci_intx(s);
>  
> @@ -2043,6 +2520,14 @@ void pt_config_delete(XenPCIPassthroughState *s)
>      struct XenPTRegGroup *reg_group, *next_grp;
>      struct XenPTReg *reg, *next_reg;
>  
> +    /* free MSI/MSI-X info table */
> +    if (s->msix) {
> +        pt_msix_delete(s);
> +    }
> +    if (s->msi) {
> +        g_free(s->msi);
> +    }
> +
>      /* free Power Management info table */
>      if (s->pm_state) {
>          if (s->pm_state->pm_timer) {
> diff --git a/hw/xen_pci_passthrough_msi.c b/hw/xen_pci_passthrough_msi.c
> new file mode 100644
> index 0000000..533aef4
> --- /dev/null
> +++ b/hw/xen_pci_passthrough_msi.c
> @@ -0,0 +1,667 @@
> +/*
> + * Copyright (c) 2007, Intel Corporation.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.  See
> + * the COPYING file in the top-level directory.
> + *
> + * Jiang Yunhong <yunhong.jiang@intel.com>
> + *
> + * This file implements direct PCI assignment to a HVM guest
> + */
> +
> +#include <sys/mman.h>
> +
> +#include "xen_backend.h"
> +#include "xen_pci_passthrough.h"
> +#include "apic-msidef.h"
> +
> +
> +#define AUTO_ASSIGN -1
> +
> +/* shift count for gflags */
> +#define GFLAGS_SHIFT_DEST_ID        0
> +#define GFLAGS_SHIFT_RH             8
> +#define GFLAGS_SHIFT_DM             9
> +#define GLFAGS_SHIFT_DELIV_MODE     12
> +#define GLFAGS_SHIFT_TRG_MODE       15
> +
> +
> +void pt_msi_set_enable(XenPCIPassthroughState *s, int en)
> +{
> +    uint16_t val = 0;
> +    uint32_t address = 0;
> +    PT_LOG("enable: %i\n", en);
> +
> +    if (!s->msi) {
> +        return;
> +    }
> +
> +    address = s->msi->ctrl_offset;
> +    if (!address) {
> +        return;
> +    }
> +
> +    val = host_pci_get_word(s->real_device, address);
> +    val &= ~PCI_MSI_FLAGS_ENABLE;
> +    val |= en & PCI_MSI_FLAGS_ENABLE;
> +    host_pci_set_word(s->real_device, address, val);
> +
> +    PT_LOG("done, address: %#x, val: %#x\n", address, val);
> +}
> +
> +static void msix_set_enable(XenPCIPassthroughState *s, int en)
> +{
> +    uint16_t val = 0;
> +    uint32_t address = 0;
> +
> +    if (!s->msix) {
> +        return;
> +    }
> +
> +    address = s->msix->ctrl_offset;
> +    if (!address) {
> +        return;
> +    }
> +
> +    val = host_pci_get_word(s->real_device, address);
> +    val &= ~PCI_MSIX_FLAGS_ENABLE;
> +    if (en) {
> +        val |= PCI_MSIX_FLAGS_ENABLE;
> +    }
> +    host_pci_set_word(s->real_device, address, val);
> +}
> +
> +/*********************************/
> +/* MSI virtuailization functions */


virtualization
> +
> +/*
> + * setup physical msi, but didn't enable it

but don't

> + */
> +int pt_msi_setup(XenPCIPassthroughState *s)
> +{
> +    int pirq = -1;
> +    uint8_t gvec = 0;
> +
> +    if (!(s->msi->flags & PT_MSI_FLAG_UNINIT)) {
> +        PT_LOG("Error: setup physical after initialized??\n");

I am not sure what that says.

> +        return -1;
> +    }
> +
> +    gvec = s->msi->data & 0xFF;
> +    if (!gvec) {
> +        /* if gvec is 0, the guest is asking for a particular pirq that
> +         * is passed as dest_id */
> +        pirq = (s->msi->addr_hi & 0xffffff00) |
> +               ((s->msi->addr_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
> +        if (!pirq) {
> +            /* this probably identifies an misconfiguration of the guest,
> +             * try the emulated path */
> +            pirq = -1;
> +        } else {
> +            PT_LOG("pt_msi_setup requested pirq = %d\n", pirq);
> +        }
> +    }
> +
> +    if (xc_physdev_map_pirq_msi(xen_xc, xen_domid, AUTO_ASSIGN, &pirq,
> +                                PCI_DEVFN(s->real_device->dev,
> +                                          s->real_device->func),
> +                                s->real_device->bus, 0, 0)) {
> +        PT_LOG("Error: Mapping of MSI failed.\n");

Give more details. As in what device failed. PErhaps even the return code?

> +        return -1;
> +    }
> +
> +    if (pirq < 0) {
> +        PT_LOG("Error: Invalid pirq number\n");
> +        return -1;
> +    }
> +
> +    s->msi->pirq = pirq;
> +    PT_LOG("msi mapped with pirq %x\n", pirq);
> +
> +    return 0;
> +}
> +
> +static uint32_t __get_msi_gflags(uint32_t data, uint64_t addr)
> +{
> +    uint32_t result = 0;
> +    int rh, dm, dest_id, deliv_mode, trig_mode;
> +
> +    rh = (addr >> MSI_ADDR_REDIRECTION_SHIFT) & 0x1;
> +    dm = (addr >> MSI_ADDR_DEST_MODE_SHIFT) & 0x1;
> +    dest_id = (addr >> MSI_ADDR_DEST_ID_SHIFT) & 0xff;
> +    deliv_mode = (data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x7;
> +    trig_mode = (data >> MSI_DATA_TRIGGER_SHIFT) & 0x1;
> +
> +    result = dest_id | (rh << GFLAGS_SHIFT_RH) | (dm << GFLAGS_SHIFT_DM) |
> +             (deliv_mode << GLFAGS_SHIFT_DELIV_MODE) |
> +             (trig_mode << GLFAGS_SHIFT_TRG_MODE);
> +
> +    return result;
> +}
> +
> +int pt_msi_update(XenPCIPassthroughState *s)
> +{
> +    uint8_t gvec = 0;
> +    uint32_t gflags = 0;
> +    uint64_t addr = 0;
> +    int ret = 0;
> +
> +    /* get vector, address, flags info, etc. */
> +    gvec = s->msi->data & 0xFF;
> +    addr = (uint64_t)s->msi->addr_hi << 32 | s->msi->addr_lo;
> +    gflags = __get_msi_gflags(s->msi->data, addr);
> +
> +    PT_LOG("Update msi with pirq %x gvec %x gflags %x\n",
> +           s->msi->pirq, gvec, gflags);

And the details for the device?

> +
> +    ret = xc_domain_update_msi_irq(xen_xc, xen_domid, gvec,
> +                                   s->msi->pirq, gflags, 0);
> +
> +    if (ret) {
> +        PT_LOG("Error: Binding of MSI failed.\n");
> +
> +        if (xc_physdev_unmap_pirq(xen_xc, xen_domid, s->msi->pirq)) {
> +            PT_LOG("Error: Unmapping of MSI failed.\n");
> +        }
> +        s->msi->pirq = -1;
> +        return ret;
> +    }
> +    return 0;
> +}
> +
> +void pt_msi_disable(XenPCIPassthroughState *s)
> +{
> +    PCIDevice *d = &s->dev;
> +    uint8_t gvec = 0;
> +    uint32_t gflags = 0;
> +    uint64_t addr = 0;
> +    uint8_t e_device = 0;
> +    uint8_t e_intx = 0;
> +
> +    pt_msi_set_enable(s, 0);
> +
> +    e_device = PCI_SLOT(d->devfn);
> +    e_intx = pci_intx(s);
> +
> +    if (s->msi_trans_en) {
> +        if (xc_domain_unbind_pt_irq(xen_xc, xen_domid, s->msi->pirq,
> +                                    PT_IRQ_TYPE_MSI_TRANSLATE, 0,
> +                                    e_device, e_intx, 0)) {
> +            PT_LOG("Error: Unbinding pt irq for MSI-INTx failed!\n");
> +            goto out;
> +        }
> +    } else if (!(s->msi->flags & PT_MSI_FLAG_UNINIT)) {
> +        /* get vector, address, flags info, etc. */
> +        gvec = s->msi->data & 0xFF;
> +        addr = (uint64_t)s->msi->addr_hi << 32 | s->msi->addr_lo;
> +        gflags = __get_msi_gflags(s->msi->data, addr);
> +
> +        PT_LOG("Unbind msi with pirq %x, gvec %x\n",
> +                s->msi->pirq, gvec);
> +
> +        if (xc_domain_unbind_msi_irq(xen_xc, xen_domid, gvec,
> +                                        s->msi->pirq, gflags)) {
> +            PT_LOG("Error: Unbinding of MSI failed. [%02x:%02x.%x]\n",
> +                   pci_bus_num(d->bus), PCI_SLOT(d->devfn),
> +                   PCI_FUNC(d->devfn));
> +            goto out;
> +        }
> +    }
> +
> +    if (s->msi->pirq != -1) {
> +        PT_LOG("Unmap msi with pirq %x\n", s->msi->pirq);
> +
> +        if (xc_physdev_unmap_pirq(xen_xc, xen_domid, s->msi->pirq)) {
> +            PT_LOG("Error: Unmapping of MSI failed. [%02x:%02x.%x]\n",
> +                   pci_bus_num(d->bus), PCI_SLOT(d->devfn),
> +                   PCI_FUNC(d->devfn));
> +            goto out;
> +        }
> +    }
> +
> +out:
> +    /* clear msi info */
> +    s->msi->flags = 0;
> +    s->msi->pirq = -1;
> +    s->msi_trans_en = 0;
> +}
> +
> +/* MSI-INTx translation virtulization functions */

virtualization

> +int pt_enable_msi_translate(XenPCIPassthroughState *s)
> +{
> +    uint8_t e_device = 0;
> +    uint8_t e_intx = 0;
> +
> +    if (!(s->msi && s->msi_trans_cap)) {
> +        return -1;
> +    }
> +
> +    pt_msi_set_enable(s, 0);
> +    s->msi_trans_en = 0;
> +
> +    if (pt_msi_setup(s)) {
> +        PT_LOG("Error: MSI-INTx translation MSI setup failed, fallback\n");
> +        return -1;
> +    }
> +
> +    e_device = PCI_SLOT(s->dev.devfn);
> +    /* fix virtual interrupt pin to INTA# */
> +    e_intx = pci_intx(s);
> +
> +    if (xc_domain_bind_pt_irq(xen_xc, xen_domid, s->msi->pirq,
> +                              PT_IRQ_TYPE_MSI_TRANSLATE, 0,
> +                              e_device, e_intx, 0)) {
> +        PT_LOG("Error: MSI-INTx translation bind failed, fallback\n");
> +
> +        if (xc_physdev_unmap_pirq(xen_xc, xen_domid, s->msi->pirq)) {
> +            PT_LOG("Error: Unmapping of MSI failed.\n");
> +        }
> +        s->msi->pirq = -1;
> +        return -1;
> +    }
> +
> +    pt_msi_set_enable(s, 1);
> +    s->msi_trans_en = 1;
> +
> +    return 0;
> +}
> +
> +void pt_disable_msi_translate(XenPCIPassthroughState *s)
> +{
> +    uint8_t e_device = 0;
> +    uint8_t e_intx = 0;
> +
> +    /* MSI_ENABLE bit should be disabed until the new handler is set */
> +    pt_msi_set_enable(s, 0);
> +
> +    e_device = PCI_SLOT(s->dev.devfn);
> +    e_intx = pci_intx(s);
> +
> +    if (xc_domain_unbind_pt_irq(xen_xc, xen_domid, s->msi->pirq,
> +                                 PT_IRQ_TYPE_MSI_TRANSLATE, 0,
> +                                 e_device, e_intx, 0)) {
> +        PT_LOG("Error: Unbinding pt irq for MSI-INTx failed!\n");
> +    }
> +
> +    if (s->machine_irq) {
> +        if (xc_domain_bind_pt_pci_irq(xen_xc, xen_domid, s->machine_irq,
> +                                       0, e_device, e_intx)) {
> +            PT_LOG("Error: Rebinding of interrupt failed!\n");
> +        }
> +    }
> +
> +    s->msi_trans_en = 0;
> +}
> +
> +/*********************************/
> +/* MSI-X virtulization functions */


virtu...

> +
> +static void mask_physical_msix_entry(XenPCIPassthroughState *s,
> +                                     int entry_nr, int mask)
> +{
> +    void *phys_off;
> +
> +    phys_off = s->msix->phys_iomem_base + 16 * entry_nr + 12;
> +    *(uint32_t *)phys_off = mask;
> +}
> +
> +static int pt_msix_update_one(XenPCIPassthroughState *s, int entry_nr)
> +{
> +    XenMSIXEntry *entry = &s->msix->msix_entry[entry_nr];
> +    int pirq = entry->pirq;
> +    int gvec = entry->io_mem[2] & 0xff;
> +    uint64_t gaddr = *(uint64_t *)&entry->io_mem[0];
> +    uint32_t gflags = __get_msi_gflags(entry->io_mem[2], gaddr);
> +    int ret;
> +
> +    if (!entry->flags) {
> +        return 0;
> +    }
> +
> +    if (!gvec) {
> +        /* if gvec is 0, the guest is asking for a particular pirq that
> +         * is passed as dest_id */
> +        pirq = ((gaddr >> 32) & 0xffffff00) |
> +               (((gaddr & 0xffffffff) >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
> +        if (!pirq) {
> +            /* this probably identifies an misconfiguration of the guest,
> +             * try the emulated path */
> +            pirq = -1;
> +        } else {
> +            PT_LOG("pt_msix_update_one requested pirq = %d\n", pirq);

This is the same code as in the MSI case. Could it be coalesced ?

> +        }
> +    }
> +
> +    /* Check if this entry is already mapped */
> +    if (entry->pirq == -1) {
> +        ret = xc_physdev_map_pirq_msi(xen_xc, xen_domid, AUTO_ASSIGN, &pirq,
> +                                      PCI_DEVFN(s->real_device->dev,
> +                                                s->real_device->func),
> +                                      s->real_device->bus, entry_nr,
> +                                      s->msix->table_base);
> +        if (ret) {
> +            PT_LOG("Error: Mapping msix entry %x\n", entry_nr);

Oh boy. So here the error is %x, but later on it is %d. Should it
be %d or 0x%x?


> +            return ret;
> +        }
> +        entry->pirq = pirq;
> +    }
> +
> +    PT_LOG("Update msix entry %x with pirq %x gvec %x\n",
> +            entry_nr, pirq, gvec);
> +
> +    ret = xc_domain_update_msi_irq(xen_xc, xen_domid, gvec, pirq, gflags,
> +                                   s->msix->mmio_base_addr);
> +    if (ret) {
> +        PT_LOG("Error: Updating msix irq info for entry %d\n", entry_nr);
> +
> +        if (xc_physdev_unmap_pirq(xen_xc, xen_domid, entry->pirq)) {
> +            PT_LOG("Error: Unmapping of MSI-X failed.\n");
> +        }
> +        entry->pirq = -1;
> +        return ret;
> +    }
> +
> +    entry->flags = 0;
> +
> +    return 0;
> +}
> +
> +int pt_msix_update(XenPCIPassthroughState *s)
> +{
> +    XenPTMSIX *msix = s->msix;
> +    int i;
> +
> +    for (i = 0; i < msix->total_entries; i++) {
> +        pt_msix_update_one(s, i);
> +    }
> +
> +    return 0;
> +}
> +
> +void pt_msix_disable(XenPCIPassthroughState *s)
> +{
> +    PCIDevice *d = &s->dev;
> +    uint8_t gvec = 0;
> +    uint32_t gflags = 0;
> +    uint64_t addr = 0;
> +    int i = 0;
> +    XenMSIXEntry *entry = NULL;
> +
> +    msix_set_enable(s, 0);
> +
> +    for (i = 0; i < s->msix->total_entries; i++) {
> +        entry = &s->msix->msix_entry[i];
> +
> +        if (entry->pirq == -1) {
> +            continue;
> +        }
> +
> +        gvec = entry->io_mem[2] & 0xff;
> +        addr = *(uint64_t *)&entry->io_mem[0];
> +        gflags = __get_msi_gflags(entry->io_mem[2], addr);
> +
> +        PT_LOG("Unbind msix with pirq %x, gvec %x\n",
> +                entry->pirq, gvec);
> +
> +        if (xc_domain_unbind_msi_irq(xen_xc, xen_domid, gvec,
> +                                        entry->pirq, gflags)) {
> +            PT_LOG("Error: Unbinding of MSI-X failed. [%02x:%02x.%x]\n",
> +                   pci_bus_num(d->bus), PCI_SLOT(d->devfn),
> +                   PCI_FUNC(d->devfn));
> +        } else {
> +            PT_LOG("Unmap msix with pirq %x\n", entry->pirq);
> +
> +            if (xc_physdev_unmap_pirq(xen_xc, xen_domid, entry->pirq)) {
> +                PT_LOG("Error: Unmapping of MSI-X failed. [%02x:%02x.%x]\n",
> +                       pci_bus_num(d->bus),
> +                       PCI_SLOT(d->devfn), PCI_FUNC(d->devfn));

There is a lot of those error reporting where the pci_bus_num, PCI_SLOT, etc
are used. Perhaps this should be in a function?

> +            }
> +        }
> +        /* clear msi-x info */
> +        entry->pirq = -1;
> +        entry->flags = 0;
> +    }
> +}
> +
> +int pt_msix_update_remap(XenPCIPassthroughState *s, int bar_index)
> +{
> +    XenMSIXEntry *entry;
> +    int i, ret;
> +
> +    if (!(s->msix && s->msix->bar_index == bar_index)) {
> +        return 0;
> +    }
> +
> +    for (i = 0; i < s->msix->total_entries; i++) {
> +        entry = &s->msix->msix_entry[i];
> +        if (entry->pirq != -1) {
> +            ret = xc_domain_unbind_pt_irq(xen_xc, xen_domid, entry->pirq,
> +                                          PT_IRQ_TYPE_MSI, 0, 0, 0, 0);
> +            if (ret) {
> +                PT_LOG("Error: unbind MSI-X entry %d failed\n", entry->pirq);
> +            }
> +            entry->flags = 1;
> +        }
> +    }
> +    pt_msix_update(s);
> +
> +    return 0;
> +}
> +
> +static void pci_msix_invalid_write(void *opaque, target_phys_addr_t addr,
> +                                   uint32_t val)
> +{
> +    PT_LOG("Error: Invalid write to MSI-X table,"
> +           " only dword access is allowed.\n");
> +}
> +
> +static void pci_msix_writel(void *opaque, target_phys_addr_t addr,
> +                            uint32_t val)
> +{
> +    XenPCIPassthroughState *s = (XenPCIPassthroughState *)opaque;
> +    XenPTMSIX *msix = s->msix;
> +    XenMSIXEntry *entry;
> +    int entry_nr, offset;
> +    void *phys_off;
> +    uint32_t vec_ctrl;
> +
> +    if (addr % 4) {
> +        PT_LOG("Error: Unaligned dword access to MSI-X table, "
> +                "addr %016"PRIx64"\n", addr);
> +        return;
> +    }
> +
> +    PT_LOG("addr: "TARGET_FMT_plx", val: %#x\n", addr, val);

Huh?

> +
> +    entry_nr = addr / 16;
> +    entry = &msix->msix_entry[entry_nr];
> +    offset = (addr % 16) / 4;
> +
> +    /*
> +     * If Xen intercepts the mask bit access, io_mem[3] may not be
> +     * up-to-date. Read from hardware directly.
> +     */
> +    phys_off = s->msix->phys_iomem_base + 16 * entry_nr + 12;
> +    vec_ctrl = *(uint32_t *)phys_off;
> +
> +    if (offset != 3 && msix->enabled && !(vec_ctrl & 0x1)) {
> +        PT_LOG("Error: Can't update msix entry %d since MSI-X is already "
> +                "function.\n", entry_nr);

already function? already on? active?


> +        return;
> +    }
> +
> +    if (offset != 3 && entry->io_mem[offset] != val) {
> +        entry->flags = 1;
> +    }
> +    entry->io_mem[offset] = val;
> +
> +    if (offset == 3) {
> +        if (msix->enabled && !(val & 0x1)) {
> +            pt_msix_update_one(s, entry_nr);
> +        }
> +        mask_physical_msix_entry(s, entry_nr, entry->io_mem[3] & 0x1);
> +    }
> +}
> +
> +static CPUWriteMemoryFunc *pci_msix_write[] = {
> +    pci_msix_invalid_write,
> +    pci_msix_invalid_write,
> +    pci_msix_writel
> +};
> +
> +static uint32_t pci_msix_invalid_read(void *opaque, target_phys_addr_t addr)
> +{
> +    PT_LOG("Error: Invalid read to MSI-X table,"
> +           " only dword access is allowed.\n");
> +    return 0;
> +}
> +
> +static uint32_t pci_msix_readl(void *opaque, target_phys_addr_t addr)
> +{
> +    XenPCIPassthroughState *s = (XenPCIPassthroughState *)opaque;
> +    XenPTMSIX *msix = s->msix;
> +    int entry_nr, offset;
> +
> +    if (addr % 4) {
> +        PT_LOG("Error: Unaligned dword access to MSI-X table, "
> +                "addr %016"PRIx64"\n", addr);
> +        return 0;
> +    }
> +
> +    PT_LOG("addr: "TARGET_FMT_plx"\n", addr);
> +
> +    entry_nr = addr / 16;
> +    offset = (addr % 16) / 4;
> +
> +    return msix->msix_entry[entry_nr].io_mem[offset];
> +}
> +
> +static CPUReadMemoryFunc *pci_msix_read[] = {
> +    pci_msix_invalid_read,
> +    pci_msix_invalid_read,
> +    pci_msix_readl
> +};
> +
> +int pt_add_msix_mapping(XenPCIPassthroughState *s, int bar_index)
> +{
> +    if (!(s->msix && s->msix->bar_index == bar_index)) {
> +        return 0;
> +    }
> +
> +    return xc_domain_memory_mapping(xen_xc, xen_domid,
> +         s->msix->mmio_base_addr >> XC_PAGE_SHIFT,
> +         (s->bases[bar_index].access.maddr + s->msix->table_off)
> +             >> XC_PAGE_SHIFT,
> +         (s->msix->total_entries * 16 + XC_PAGE_SIZE - 1) >> XC_PAGE_SHIFT,
> +         DPCI_ADD_MAPPING);
> +}
> +
> +int pt_remove_msix_mapping(XenPCIPassthroughState *s, int bar_index)
> +{
> +    if (!(s->msix && s->msix->bar_index == bar_index)) {
> +        return 0;
> +    }
> +
> +    s->msix->mmio_base_addr = s->bases[bar_index].e_physbase
> +        + s->msix->table_off;
> +
> +    cpu_register_physical_memory(s->msix->mmio_base_addr,
> +                                 s->msix->total_entries * 16,
> +                                 s->msix->mmio_index);
> +
> +    return xc_domain_memory_mapping(xen_xc, xen_domid,
> +         s->msix->mmio_base_addr >> XC_PAGE_SHIFT,
> +         (s->bases[bar_index].access.maddr + s->msix->table_off)
> +             >> XC_PAGE_SHIFT,
> +         (s->msix->total_entries * 16 + XC_PAGE_SIZE - 1) >> XC_PAGE_SHIFT,
> +         DPCI_REMOVE_MAPPING);
> +}
> +
> +int pt_msix_init(XenPCIPassthroughState *s, int base)
> +{
> +    uint8_t id;
> +    uint16_t control;
> +    int i, total_entries, table_off, bar_index;
> +    HostPCIDevice *d = s->real_device;
> +    int fd;
> +
> +    id = host_pci_get_byte(d, base + PCI_CAP_LIST_ID);
> +
> +    if (id != PCI_CAP_ID_MSIX) {
> +        PT_LOG("Error: Invalid id %#x base %#x\n", id, base);
> +        return -1;
> +    }
> +
> +    control = host_pci_get_word(d, base + 2);
> +    total_entries = control & 0x7ff;
> +    total_entries += 1;
> +
> +    s->msix = g_malloc0(sizeof (XenPTMSIX)
> +                        + total_entries * sizeof (XenMSIXEntry));
> +
> +    s->msix->total_entries = total_entries;
> +    for (i = 0; i < total_entries; i++) {
> +        s->msix->msix_entry[i].pirq = -1;
> +    }
> +
> +    s->msix->mmio_index =
> +        cpu_register_io_memory(pci_msix_read, pci_msix_write,
> +                               s, DEVICE_NATIVE_ENDIAN);
> +
> +    table_off = host_pci_get_long(d, base + PCI_MSIX_TABLE);
> +    bar_index = s->msix->bar_index = table_off & PCI_MSIX_FLAGS_BIRMASK;
> +    table_off = s->msix->table_off = table_off & ~PCI_MSIX_FLAGS_BIRMASK;
> +    s->msix->table_base = s->real_device->io_regions[bar_index].base_addr;
> +    PT_LOG("get MSI-X table bar base %#"PRIx64"\n", s->msix->table_base);
> +
> +    fd = open("/dev/mem", O_RDWR);
> +    if (fd == -1) {
> +        PT_LOG("Error: Can't open /dev/mem: %s\n", strerror(errno));
> +        goto error_out;
> +    }
> +    PT_LOG("table_off = %#x, total_entries = %d\n", table_off, total_entries);
> +    s->msix->table_offset_adjust = table_off & 0x0fff;
> +    s->msix->phys_iomem_base =
> +        mmap(0,
> +             total_entries * 16 + s->msix->table_offset_adjust,
> +             PROT_WRITE | PROT_READ,
> +             MAP_SHARED | MAP_LOCKED,
> +             fd,
> +             s->msix->table_base + table_off - s->msix->table_offset_adjust);
> +
> +    if (s->msix->phys_iomem_base == MAP_FAILED) {
> +        PT_LOG("Error: Can't map physical MSI-X table: %s\n", strerror(errno));
> +        close(fd);
> +        goto error_out;
> +    }
> +    s->msix->phys_iomem_base = (char *)s->msix->phys_iomem_base
> +        + s->msix->table_offset_adjust;
> +
> +    close(fd);
> +
> +    PT_LOG("mapping physical MSI-X table to %p\n", s->msix->phys_iomem_base);
> +    return 0;
> +
> +error_out:
> +    g_free(s->msix);
> +    s->msix = NULL;
> +    return -1;
> +}
> +
> +void pt_msix_delete(XenPCIPassthroughState *s)
> +{
> +    /* unmap the MSI-X memory mapped register area */
> +    if (s->msix->phys_iomem_base) {
> +        PT_LOG("unmapping physical MSI-X table from %lx\n",
> +           (unsigned long)s->msix->phys_iomem_base);
> +        munmap(s->msix->phys_iomem_base, s->msix->total_entries * 16 +
> +           s->msix->table_offset_adjust);
> +    }
> +
> +    if (s->msix->mmio_index > 0) {
> +        cpu_unregister_io_memory(s->msix->mmio_index);
> +    }
> +
> +    g_free(s->msix);
> +    s->msix = NULL;
> +}
> -- 
> Anthony PERARD
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [PATCH V3 07/10] Introduce Xen PCI Passthrough, qdevice (1/3)
  2011-11-10 21:28     ` Konrad Rzeszutek Wilk
@ 2011-11-11 16:27       ` Anthony PERARD
  -1 siblings, 0 replies; 60+ messages in thread
From: Anthony PERARD @ 2011-11-11 16:27 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Guy Zana, Xen Devel, Allen Kay, QEMU-devel, Stefano Stabellini

On Thu, 10 Nov 2011, Konrad Rzeszutek Wilk wrote:

> On Fri, Oct 28, 2011 at 04:07:33PM +0100, Anthony PERARD wrote:
> > From: Allen Kay <allen.m.kay@intel.com>
> >
>
> This is going to be a bit lame review..
>

[...]

> > +        return;
> > +    }
> > +
> > +    /* find register group entry */
> > +    reg_grp_entry = pt_find_reg_grp(s, address);
> > +    if (reg_grp_entry) {
> > +        /* check 0 Hardwired register group */
> > +        if (reg_grp_entry->reg_grp->grp_type == GRP_TYPE_HARDWIRED) {
> > +            /* ignore silently */
> > +            PT_LOG("Warning: Access to 0 Hardwired register. "
> > +                   "[%02x:%02x.%x][Offset:%02xh][Length:%d]\n",
> > +                   pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> > +                   address, len);
> > +            return;
> > +        }
> > +    }
> > +
> > +    /* read I/O device register value */
> > +    rc = host_pci_get_block(s->real_device, address,
> > +                             (uint8_t *)&read_val, len);
> > +    if (!rc) {
> > +        PT_LOG("Error: pci_read_block failed. return value[%d].\n", rc);
>
> There isn't a PT_ERR? Hm, looking at the code there is only PT_LOG. Perhaps
> declearing PT_ERR and PT_WARN might be a good idea? In case in the future
> one wants different levels of this? Or do we really not care much about that?

I will add this two macros.

> > +        memset(&read_val, 0xff, len);
> > +    }
> > +
> > +    /* pass directly to libpci for passthrough type register group */
>
> Um, is the libpci requirement a certain thing?

:(, it's just an old comment. libpci is not used anymore and have been
replaced by host-pci-device. I will replace libpci in the comment by
"the real device".


[...]

> > +            switch (reg->size) {
> > +            case 1:
> > +                if (reg->u.b.write) {
> > +                    rc = reg->u.b.write(s, reg_entry, ptr_val,
> > +                                        read_val >> ((real_offset & 3) << 3),
> > +                                        valid_mask);
> > +                }
> > +                break;
> > +            case 2:
> > +                if (reg->u.w.write) {
> > +                    rc = reg->u.w.write(s, reg_entry, (uint16_t *)ptr_val,
> > +                                        (read_val >> ((real_offset & 3) << 3)),
> > +                                        valid_mask);
> > +                }
> > +                break;
> > +            case 4:
> > +                if (reg->u.dw.write) {
> > +                    rc = reg->u.dw.write(s, reg_entry, (uint32_t *)ptr_val,
> > +                                         (read_val >> ((real_offset & 3) << 3)),
> > +                                         valid_mask);
> > +                }
> > +                break;
> > +            }
> > +
> > +            if (rc < 0) {
> > +                hw_error("Internal error: Invalid write emulation "
> > +                         "return value[%d]. I/O emulator exit.\n", rc);
>
> Oh. I hadn't realized this, but you are using hw_error. Which is
> calling 'abort'! Yikes. Is there no way to recover from this? Say return 0xfffff?

In qemu-xen-traditionnal, it was an exit(1). I do not know the
consequence of a bad write, and I can not return anythings. So I suppose
that the guest would know that somethings wrong only on the next read.

Instead of abort();, I can just do nothing and return. Or we could unplug
the device from QEMU.

Any preference?

> > +            }
> > +
> > +            /* calculate next address to find */
> > +            emul_len -= reg->size;
> > +            if (emul_len > 0) {
> > +                find_addr = real_offset + reg->size;
> > +            }
> > +        } else {
> > +            /* nothing to do with passthrough type register,
> > +             * continue to find next byte */
> > +            emul_len--;
> > +            find_addr++;
> > +        }
> > +    }
> > +
> > +    /* need to shift back before passing them to libpci */
> > +    val >>= (address & 3) << 3;
> > +
> > +out:
> > +    if (!(reg && reg->no_wb)) {
> > +        /* unknown regs are passed through */
> > +        rc = host_pci_set_block(s->real_device, address, (uint8_t *)&val, len);
> > +
> > +        if (!rc) {
> > +            PT_LOG("Error: pci_write_block failed. return value[%d].\n", rc);
> > +        }
> > +    }
> > +
> > +    if (s->pm_state != NULL && s->pm_state->flags & PT_FLAG_TRANSITING) {
> > +        qemu_mod_timer(s->pm_state->pm_timer,
> > +                       qemu_get_clock_ms(rt_clock) + s->pm_state->pm_delay);
> > +    }
> > +}
> > +
> > +/* ioport/iomem space*/
> > +static void pt_iomem_map(XenPCIPassthroughState *s, int i,
> > +                         pcibus_t e_phys, pcibus_t e_size, int type)
> > +{
> > +    uint32_t old_ebase = s->bases[i].e_physbase;
> > +    bool first_map = s->bases[i].e_size == 0;
> > +    int ret = 0;
> > +
> > +    s->bases[i].e_physbase = e_phys;
> > +    s->bases[i].e_size = e_size;
> > +
> > +    PT_LOG("e_phys=%#"PRIx64" maddr=%#"PRIx64" type=%%d"
> > +           " len=%#"PRIx64" index=%d first_map=%d\n",
> > +           e_phys, s->bases[i].access.maddr, /*type,*/
> > +           e_size, i, first_map);
> > +
> > +    if (e_size == 0) {
> > +        return;
> > +    }
> > +
> > +    if (!first_map && old_ebase != -1) {
>
> old_ebase != PCI_BAR_UNMAPPED ?

:(, no. Because old_ebase is a uint32_t and PCI_BAR_UNMAPPED is
pcibus_t (uint64_t in Xen case).

I'm not sure that a good idee to change the type of old_ebase as
xc_domain_memory_mapping bellow takes only uint32_t.

But, if I can replace a -1 by PCI_BAR_UNMAPPED, I will.

> > +        /* Remove old mapping */
> > +        ret = xc_domain_memory_mapping(xen_xc, xen_domid,
> > +                               old_ebase >> XC_PAGE_SHIFT,
> > +                               s->bases[i].access.maddr >> XC_PAGE_SHIFT,
> > +                               (e_size + XC_PAGE_SIZE - 1) >> XC_PAGE_SHIFT,
> > +                               DPCI_REMOVE_MAPPING);
> > +        if (ret != 0) {
> > +            PT_LOG("Error: remove old mapping failed!\n");
> > +            return;
> > +        }
> > +    }
> > +
> > +    /* map only valid guest address */
> > +    if (e_phys != -1) {
>
> PCI_BAR_UNMAPPED
>
> > +        /* Create new mapping */
> > +        ret = xc_domain_memory_mapping(xen_xc, xen_domid,
> > +                                   s->bases[i].e_physbase >> XC_PAGE_SHIFT,
> > +                                   s->bases[i].access.maddr >> XC_PAGE_SHIFT,
> > +                                   (e_size+XC_PAGE_SIZE-1) >> XC_PAGE_SHIFT,
> > +                                   DPCI_ADD_MAPPING);
> > +
> > +        if (ret != 0) {
> > +            PT_LOG("Error: create new mapping failed!\n");
> > +        }
> > +    }
> > +}

[...]

> > +void pt_bar_mapping(XenPCIPassthroughState *s, int io_enable, int mem_enable)
> > +{
> > +    int i;
> > +
> > +    for (i = 0; i < PCI_NUM_REGIONS; i++) {
> > +        pt_bar_mapping_one(s, i, io_enable, mem_enable);
> > +    }
> > +}
> > +
> > +/* register regions */
> > +static int pt_register_regions(XenPCIPassthroughState *s)
> > +{
> > +    int i = 0;
> > +    uint32_t bar_data = 0;
> > +    HostPCIDevice *d = s->real_device;
> > +
> > +    /* Register PIO/MMIO BARs */
> > +    for (i = 0; i < PCI_BAR_ENTRIES; i++) {
> > +        HostPCIIORegion *r = &d->io_regions[i];
> > +
> > +        if (r->base_addr) {
>
> So should you check for PCI_BAR_UNMAPPED or is that not really
> required here as the pci_register_bar would do it?

Actually, this value come from the real device (the value in
sysfs/resource). So, I think it's just 0 if it's not mapped.

Here, it's probably better to check for the size instead, to know if
there is actually a BAR.

> > +            s->bases[i].e_physbase = r->base_addr;
> > +            s->bases[i].access.u = r->base_addr;
> > +
> > +            /* Register current region */
> > +            if (r->flags & IORESOURCE_IO) {
> > +                memory_region_init_io(&s->bar[i], NULL, NULL,
> > +                                      "xen-pci-pt-bar", r->size);
>
> You can make the "xen_pci-pt-bar" be a #define somewhere and reuse that.
>
> > +                pci_register_bar(&s->dev, i, PCI_BASE_ADDRESS_SPACE_IO,
> > +                                 &s->bar[i]);
> > +            } else if (r->flags & IORESOURCE_PREFETCH) {
> > +                memory_region_init_io(&s->bar[i], NULL, NULL,
> > +                                      "xen-pci-pt-bar", r->size);
> > +                pci_register_bar(&s->dev, i, PCI_BASE_ADDRESS_MEM_PREFETCH,
> > +                                 &s->bar[i]);
> > +            } else {
> > +                memory_region_init_io(&s->bar[i], NULL, NULL,
> > +                                      "xen-pci-pt-bar", r->size);
> > +                pci_register_bar(&s->dev, i, PCI_BASE_ADDRESS_SPACE_MEMORY,
> > +                                 &s->bar[i]);
> > +            }
> > +
> > +            PT_LOG("IO region registered (size=0x%08"PRIx64
> > +                   " base_addr=0x%08"PRIx64")\n",
> > +                   r->size, r->base_addr);
> > +        }
> > +    }
> > +
> > +    /* Register expansion ROM address */
> > +    if (d->rom.base_addr && d->rom.size) {
> > +        /* Re-set BAR reported by OS, otherwise ROM can't be read. */
> > +        bar_data = host_pci_get_long(d, PCI_ROM_ADDRESS);
> > +        if ((bar_data & PCI_ROM_ADDRESS_MASK) == 0) {
> > +            bar_data |= d->rom.base_addr & PCI_ROM_ADDRESS_MASK;
> > +            host_pci_set_long(d, PCI_ROM_ADDRESS, bar_data);
> > +        }
> > +
> > +        s->bases[PCI_ROM_SLOT].e_physbase = d->rom.base_addr;
> > +        s->bases[PCI_ROM_SLOT].access.maddr = d->rom.base_addr;
> > +
> > +        memory_region_init_rom_device(&s->rom, NULL, NULL, &s->dev.qdev,
> > +                                      "xen-pci-pt-rom", d->rom.size);
> > +        pci_register_bar(&s->dev, PCI_ROM_SLOT, PCI_BASE_ADDRESS_MEM_PREFETCH,
> > +                         &s->rom);
> > +
> > +        PT_LOG("Expansion ROM registered (size=0x%08"PRIx64
> > +               " base_addr=0x%08"PRIx64")\n",
> > +               d->rom.size, d->rom.base_addr);
> > +    }
> > +
> > +    return 0;
> > +}
> > +

[...]

> > +static int pt_initfn(PCIDevice *pcidev)
> > +{
> > +    XenPCIPassthroughState *s = DO_UPCAST(XenPCIPassthroughState, dev, pcidev);
> > +    int dom, bus;
> > +    unsigned slot, func;
> > +    int rc = 0;
> > +    uint32_t machine_irq;
> > +    int pirq = -1;
> > +
> > +    if (pci_parse_devaddr(s->hostaddr, &dom, &bus, &slot, &func) < 0) {
> > +        fprintf(stderr, "error parse bdf: %s\n", s->hostaddr);
> > +        return -1;
> > +    }
> > +
> > +    /* register real device */
> > +    PT_LOG("Assigning real physical device %02x:%02x.%x to devfn %i ...\n",
> > +           bus, slot, func, s->dev.devfn);
> > +
> > +    s->real_device = host_pci_device_get(bus, slot, func);
> > +    if (!s->real_device) {
> > +        return -1;
> > +    }
> > +
> > +    s->is_virtfn = s->real_device->is_virtfn;
> > +    if (s->is_virtfn) {
> > +        PT_LOG("%04x:%02x:%02x.%x is a SR-IOV Virtual Function\n",
> > +               s->real_device->domain, bus, slot, func);
> > +    }
> > +
> > +    /* Initialize virtualized PCI configuration (Extended 256 Bytes) */
> > +    if (host_pci_get_block(s->real_device, 0, pcidev->config,
> > +                           PCI_CONFIG_SPACE_SIZE) == -1) {
> > +        return -1;
> > +    }
> > +
> > +    /* Handle real device's MMIO/PIO BARs */
> > +    pt_register_regions(s);
> > +
> > +    /* reinitialize each config register to be emulated */
> > +    pt_config_init(s);
> > +
> > +    /* Bind interrupt */
> > +    if (!s->dev.config[PCI_INTERRUPT_PIN]) {
> > +        PT_LOG("no pin interrupt\n");
>
> Perhaps include some details of which device failed?

There is already detailed about the device at the beginning of the
function. Is it not enough?

> > +        goto out;
> > +    }
> > +
> > +    machine_irq = host_pci_get_byte(s->real_device, PCI_INTERRUPT_LINE);
> > +    rc = xc_physdev_map_pirq(xen_xc, xen_domid, machine_irq, &pirq);
> > +
> > +    if (rc) {
> > +        PT_LOG("Error: Mapping irq failed, rc = %d\n", rc);
>
> Can you also include the IRQ it tried to map (both machine and pirq).

Yep.

> > +
> > +        /* Disable PCI intx assertion (turn on bit10 of devctl) */
> > +        host_pci_set_word(s->real_device,
> > +                          PCI_COMMAND,
> > +                          pci_get_word(s->dev.config + PCI_COMMAND)
> > +                          | PCI_COMMAND_INTX_DISABLE);
> > +        machine_irq = 0;
> > +        s->machine_irq = 0;
> > +    } else {
> > +        machine_irq = pirq;
> > +        s->machine_irq = pirq;
> > +        mapped_machine_irq[machine_irq]++;
> > +    }
> > +
> > +    /* bind machine_irq to device */
> > +    if (rc < 0 && machine_irq != 0) {
> > +        uint8_t e_device = PCI_SLOT(s->dev.devfn);
> > +        uint8_t e_intx = pci_intx(s);
> > +
> > +        rc = xc_domain_bind_pt_pci_irq(xen_xc, xen_domid, machine_irq, 0,
> > +                                       e_device, e_intx);
> > +        if (rc < 0) {
> > +            PT_LOG("Error: Binding of interrupt failed! rc=%d\n", rc);
>
> A bit details - name of the device, the IRQ,..
>
> > +
> > +            /* Disable PCI intx assertion (turn on bit10 of devctl) */
> > +            host_pci_set_word(s->real_device, PCI_COMMAND,
> > +                              *(uint16_t *)(&s->dev.config[PCI_COMMAND])
> > +                              | PCI_COMMAND_INTX_DISABLE);
> > +            mapped_machine_irq[machine_irq]--;
> > +
> > +            if (mapped_machine_irq[machine_irq] == 0) {
> > +                if (xc_physdev_unmap_pirq(xen_xc, xen_domid, machine_irq)) {
> > +                    PT_LOG("Error: Unmapping of interrupt failed! rc=%d\n",
> > +                           rc);
>
> And here too. It would be beneficial to have on the error paths lots of
> nice details so that in the field it will be easier to find out what
> went wrong (and match up PIRQ with the GSI).

Yes, I will try to improve the messages.

It's also probably good to always print the errors.


Thanks,

-- 
Anthony PERARD

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Xen-devel] [PATCH V3 07/10] Introduce Xen PCI Passthrough, qdevice (1/3)
@ 2011-11-11 16:27       ` Anthony PERARD
  0 siblings, 0 replies; 60+ messages in thread
From: Anthony PERARD @ 2011-11-11 16:27 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Guy Zana, Xen Devel, Allen Kay, QEMU-devel, Stefano Stabellini

On Thu, 10 Nov 2011, Konrad Rzeszutek Wilk wrote:

> On Fri, Oct 28, 2011 at 04:07:33PM +0100, Anthony PERARD wrote:
> > From: Allen Kay <allen.m.kay@intel.com>
> >
>
> This is going to be a bit lame review..
>

[...]

> > +        return;
> > +    }
> > +
> > +    /* find register group entry */
> > +    reg_grp_entry = pt_find_reg_grp(s, address);
> > +    if (reg_grp_entry) {
> > +        /* check 0 Hardwired register group */
> > +        if (reg_grp_entry->reg_grp->grp_type == GRP_TYPE_HARDWIRED) {
> > +            /* ignore silently */
> > +            PT_LOG("Warning: Access to 0 Hardwired register. "
> > +                   "[%02x:%02x.%x][Offset:%02xh][Length:%d]\n",
> > +                   pci_bus_num(d->bus), PCI_SLOT(d->devfn), PCI_FUNC(d->devfn),
> > +                   address, len);
> > +            return;
> > +        }
> > +    }
> > +
> > +    /* read I/O device register value */
> > +    rc = host_pci_get_block(s->real_device, address,
> > +                             (uint8_t *)&read_val, len);
> > +    if (!rc) {
> > +        PT_LOG("Error: pci_read_block failed. return value[%d].\n", rc);
>
> There isn't a PT_ERR? Hm, looking at the code there is only PT_LOG. Perhaps
> declearing PT_ERR and PT_WARN might be a good idea? In case in the future
> one wants different levels of this? Or do we really not care much about that?

I will add this two macros.

> > +        memset(&read_val, 0xff, len);
> > +    }
> > +
> > +    /* pass directly to libpci for passthrough type register group */
>
> Um, is the libpci requirement a certain thing?

:(, it's just an old comment. libpci is not used anymore and have been
replaced by host-pci-device. I will replace libpci in the comment by
"the real device".


[...]

> > +            switch (reg->size) {
> > +            case 1:
> > +                if (reg->u.b.write) {
> > +                    rc = reg->u.b.write(s, reg_entry, ptr_val,
> > +                                        read_val >> ((real_offset & 3) << 3),
> > +                                        valid_mask);
> > +                }
> > +                break;
> > +            case 2:
> > +                if (reg->u.w.write) {
> > +                    rc = reg->u.w.write(s, reg_entry, (uint16_t *)ptr_val,
> > +                                        (read_val >> ((real_offset & 3) << 3)),
> > +                                        valid_mask);
> > +                }
> > +                break;
> > +            case 4:
> > +                if (reg->u.dw.write) {
> > +                    rc = reg->u.dw.write(s, reg_entry, (uint32_t *)ptr_val,
> > +                                         (read_val >> ((real_offset & 3) << 3)),
> > +                                         valid_mask);
> > +                }
> > +                break;
> > +            }
> > +
> > +            if (rc < 0) {
> > +                hw_error("Internal error: Invalid write emulation "
> > +                         "return value[%d]. I/O emulator exit.\n", rc);
>
> Oh. I hadn't realized this, but you are using hw_error. Which is
> calling 'abort'! Yikes. Is there no way to recover from this? Say return 0xfffff?

In qemu-xen-traditionnal, it was an exit(1). I do not know the
consequence of a bad write, and I can not return anythings. So I suppose
that the guest would know that somethings wrong only on the next read.

Instead of abort();, I can just do nothing and return. Or we could unplug
the device from QEMU.

Any preference?

> > +            }
> > +
> > +            /* calculate next address to find */
> > +            emul_len -= reg->size;
> > +            if (emul_len > 0) {
> > +                find_addr = real_offset + reg->size;
> > +            }
> > +        } else {
> > +            /* nothing to do with passthrough type register,
> > +             * continue to find next byte */
> > +            emul_len--;
> > +            find_addr++;
> > +        }
> > +    }
> > +
> > +    /* need to shift back before passing them to libpci */
> > +    val >>= (address & 3) << 3;
> > +
> > +out:
> > +    if (!(reg && reg->no_wb)) {
> > +        /* unknown regs are passed through */
> > +        rc = host_pci_set_block(s->real_device, address, (uint8_t *)&val, len);
> > +
> > +        if (!rc) {
> > +            PT_LOG("Error: pci_write_block failed. return value[%d].\n", rc);
> > +        }
> > +    }
> > +
> > +    if (s->pm_state != NULL && s->pm_state->flags & PT_FLAG_TRANSITING) {
> > +        qemu_mod_timer(s->pm_state->pm_timer,
> > +                       qemu_get_clock_ms(rt_clock) + s->pm_state->pm_delay);
> > +    }
> > +}
> > +
> > +/* ioport/iomem space*/
> > +static void pt_iomem_map(XenPCIPassthroughState *s, int i,
> > +                         pcibus_t e_phys, pcibus_t e_size, int type)
> > +{
> > +    uint32_t old_ebase = s->bases[i].e_physbase;
> > +    bool first_map = s->bases[i].e_size == 0;
> > +    int ret = 0;
> > +
> > +    s->bases[i].e_physbase = e_phys;
> > +    s->bases[i].e_size = e_size;
> > +
> > +    PT_LOG("e_phys=%#"PRIx64" maddr=%#"PRIx64" type=%%d"
> > +           " len=%#"PRIx64" index=%d first_map=%d\n",
> > +           e_phys, s->bases[i].access.maddr, /*type,*/
> > +           e_size, i, first_map);
> > +
> > +    if (e_size == 0) {
> > +        return;
> > +    }
> > +
> > +    if (!first_map && old_ebase != -1) {
>
> old_ebase != PCI_BAR_UNMAPPED ?

:(, no. Because old_ebase is a uint32_t and PCI_BAR_UNMAPPED is
pcibus_t (uint64_t in Xen case).

I'm not sure that a good idee to change the type of old_ebase as
xc_domain_memory_mapping bellow takes only uint32_t.

But, if I can replace a -1 by PCI_BAR_UNMAPPED, I will.

> > +        /* Remove old mapping */
> > +        ret = xc_domain_memory_mapping(xen_xc, xen_domid,
> > +                               old_ebase >> XC_PAGE_SHIFT,
> > +                               s->bases[i].access.maddr >> XC_PAGE_SHIFT,
> > +                               (e_size + XC_PAGE_SIZE - 1) >> XC_PAGE_SHIFT,
> > +                               DPCI_REMOVE_MAPPING);
> > +        if (ret != 0) {
> > +            PT_LOG("Error: remove old mapping failed!\n");
> > +            return;
> > +        }
> > +    }
> > +
> > +    /* map only valid guest address */
> > +    if (e_phys != -1) {
>
> PCI_BAR_UNMAPPED
>
> > +        /* Create new mapping */
> > +        ret = xc_domain_memory_mapping(xen_xc, xen_domid,
> > +                                   s->bases[i].e_physbase >> XC_PAGE_SHIFT,
> > +                                   s->bases[i].access.maddr >> XC_PAGE_SHIFT,
> > +                                   (e_size+XC_PAGE_SIZE-1) >> XC_PAGE_SHIFT,
> > +                                   DPCI_ADD_MAPPING);
> > +
> > +        if (ret != 0) {
> > +            PT_LOG("Error: create new mapping failed!\n");
> > +        }
> > +    }
> > +}

[...]

> > +void pt_bar_mapping(XenPCIPassthroughState *s, int io_enable, int mem_enable)
> > +{
> > +    int i;
> > +
> > +    for (i = 0; i < PCI_NUM_REGIONS; i++) {
> > +        pt_bar_mapping_one(s, i, io_enable, mem_enable);
> > +    }
> > +}
> > +
> > +/* register regions */
> > +static int pt_register_regions(XenPCIPassthroughState *s)
> > +{
> > +    int i = 0;
> > +    uint32_t bar_data = 0;
> > +    HostPCIDevice *d = s->real_device;
> > +
> > +    /* Register PIO/MMIO BARs */
> > +    for (i = 0; i < PCI_BAR_ENTRIES; i++) {
> > +        HostPCIIORegion *r = &d->io_regions[i];
> > +
> > +        if (r->base_addr) {
>
> So should you check for PCI_BAR_UNMAPPED or is that not really
> required here as the pci_register_bar would do it?

Actually, this value come from the real device (the value in
sysfs/resource). So, I think it's just 0 if it's not mapped.

Here, it's probably better to check for the size instead, to know if
there is actually a BAR.

> > +            s->bases[i].e_physbase = r->base_addr;
> > +            s->bases[i].access.u = r->base_addr;
> > +
> > +            /* Register current region */
> > +            if (r->flags & IORESOURCE_IO) {
> > +                memory_region_init_io(&s->bar[i], NULL, NULL,
> > +                                      "xen-pci-pt-bar", r->size);
>
> You can make the "xen_pci-pt-bar" be a #define somewhere and reuse that.
>
> > +                pci_register_bar(&s->dev, i, PCI_BASE_ADDRESS_SPACE_IO,
> > +                                 &s->bar[i]);
> > +            } else if (r->flags & IORESOURCE_PREFETCH) {
> > +                memory_region_init_io(&s->bar[i], NULL, NULL,
> > +                                      "xen-pci-pt-bar", r->size);
> > +                pci_register_bar(&s->dev, i, PCI_BASE_ADDRESS_MEM_PREFETCH,
> > +                                 &s->bar[i]);
> > +            } else {
> > +                memory_region_init_io(&s->bar[i], NULL, NULL,
> > +                                      "xen-pci-pt-bar", r->size);
> > +                pci_register_bar(&s->dev, i, PCI_BASE_ADDRESS_SPACE_MEMORY,
> > +                                 &s->bar[i]);
> > +            }
> > +
> > +            PT_LOG("IO region registered (size=0x%08"PRIx64
> > +                   " base_addr=0x%08"PRIx64")\n",
> > +                   r->size, r->base_addr);
> > +        }
> > +    }
> > +
> > +    /* Register expansion ROM address */
> > +    if (d->rom.base_addr && d->rom.size) {
> > +        /* Re-set BAR reported by OS, otherwise ROM can't be read. */
> > +        bar_data = host_pci_get_long(d, PCI_ROM_ADDRESS);
> > +        if ((bar_data & PCI_ROM_ADDRESS_MASK) == 0) {
> > +            bar_data |= d->rom.base_addr & PCI_ROM_ADDRESS_MASK;
> > +            host_pci_set_long(d, PCI_ROM_ADDRESS, bar_data);
> > +        }
> > +
> > +        s->bases[PCI_ROM_SLOT].e_physbase = d->rom.base_addr;
> > +        s->bases[PCI_ROM_SLOT].access.maddr = d->rom.base_addr;
> > +
> > +        memory_region_init_rom_device(&s->rom, NULL, NULL, &s->dev.qdev,
> > +                                      "xen-pci-pt-rom", d->rom.size);
> > +        pci_register_bar(&s->dev, PCI_ROM_SLOT, PCI_BASE_ADDRESS_MEM_PREFETCH,
> > +                         &s->rom);
> > +
> > +        PT_LOG("Expansion ROM registered (size=0x%08"PRIx64
> > +               " base_addr=0x%08"PRIx64")\n",
> > +               d->rom.size, d->rom.base_addr);
> > +    }
> > +
> > +    return 0;
> > +}
> > +

[...]

> > +static int pt_initfn(PCIDevice *pcidev)
> > +{
> > +    XenPCIPassthroughState *s = DO_UPCAST(XenPCIPassthroughState, dev, pcidev);
> > +    int dom, bus;
> > +    unsigned slot, func;
> > +    int rc = 0;
> > +    uint32_t machine_irq;
> > +    int pirq = -1;
> > +
> > +    if (pci_parse_devaddr(s->hostaddr, &dom, &bus, &slot, &func) < 0) {
> > +        fprintf(stderr, "error parse bdf: %s\n", s->hostaddr);
> > +        return -1;
> > +    }
> > +
> > +    /* register real device */
> > +    PT_LOG("Assigning real physical device %02x:%02x.%x to devfn %i ...\n",
> > +           bus, slot, func, s->dev.devfn);
> > +
> > +    s->real_device = host_pci_device_get(bus, slot, func);
> > +    if (!s->real_device) {
> > +        return -1;
> > +    }
> > +
> > +    s->is_virtfn = s->real_device->is_virtfn;
> > +    if (s->is_virtfn) {
> > +        PT_LOG("%04x:%02x:%02x.%x is a SR-IOV Virtual Function\n",
> > +               s->real_device->domain, bus, slot, func);
> > +    }
> > +
> > +    /* Initialize virtualized PCI configuration (Extended 256 Bytes) */
> > +    if (host_pci_get_block(s->real_device, 0, pcidev->config,
> > +                           PCI_CONFIG_SPACE_SIZE) == -1) {
> > +        return -1;
> > +    }
> > +
> > +    /* Handle real device's MMIO/PIO BARs */
> > +    pt_register_regions(s);
> > +
> > +    /* reinitialize each config register to be emulated */
> > +    pt_config_init(s);
> > +
> > +    /* Bind interrupt */
> > +    if (!s->dev.config[PCI_INTERRUPT_PIN]) {
> > +        PT_LOG("no pin interrupt\n");
>
> Perhaps include some details of which device failed?

There is already detailed about the device at the beginning of the
function. Is it not enough?

> > +        goto out;
> > +    }
> > +
> > +    machine_irq = host_pci_get_byte(s->real_device, PCI_INTERRUPT_LINE);
> > +    rc = xc_physdev_map_pirq(xen_xc, xen_domid, machine_irq, &pirq);
> > +
> > +    if (rc) {
> > +        PT_LOG("Error: Mapping irq failed, rc = %d\n", rc);
>
> Can you also include the IRQ it tried to map (both machine and pirq).

Yep.

> > +
> > +        /* Disable PCI intx assertion (turn on bit10 of devctl) */
> > +        host_pci_set_word(s->real_device,
> > +                          PCI_COMMAND,
> > +                          pci_get_word(s->dev.config + PCI_COMMAND)
> > +                          | PCI_COMMAND_INTX_DISABLE);
> > +        machine_irq = 0;
> > +        s->machine_irq = 0;
> > +    } else {
> > +        machine_irq = pirq;
> > +        s->machine_irq = pirq;
> > +        mapped_machine_irq[machine_irq]++;
> > +    }
> > +
> > +    /* bind machine_irq to device */
> > +    if (rc < 0 && machine_irq != 0) {
> > +        uint8_t e_device = PCI_SLOT(s->dev.devfn);
> > +        uint8_t e_intx = pci_intx(s);
> > +
> > +        rc = xc_domain_bind_pt_pci_irq(xen_xc, xen_domid, machine_irq, 0,
> > +                                       e_device, e_intx);
> > +        if (rc < 0) {
> > +            PT_LOG("Error: Binding of interrupt failed! rc=%d\n", rc);
>
> A bit details - name of the device, the IRQ,..
>
> > +
> > +            /* Disable PCI intx assertion (turn on bit10 of devctl) */
> > +            host_pci_set_word(s->real_device, PCI_COMMAND,
> > +                              *(uint16_t *)(&s->dev.config[PCI_COMMAND])
> > +                              | PCI_COMMAND_INTX_DISABLE);
> > +            mapped_machine_irq[machine_irq]--;
> > +
> > +            if (mapped_machine_irq[machine_irq] == 0) {
> > +                if (xc_physdev_unmap_pirq(xen_xc, xen_domid, machine_irq)) {
> > +                    PT_LOG("Error: Unmapping of interrupt failed! rc=%d\n",
> > +                           rc);
>
> And here too. It would be beneficial to have on the error paths lots of
> nice details so that in the field it will be easier to find out what
> went wrong (and match up PIRQ with the GSI).

Yes, I will try to improve the messages.

It's also probably good to always print the errors.


Thanks,

-- 
Anthony PERARD

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [PATCH V3 08/10] Introduce Xen PCI Passthrough, PCI config space helpers (2/3)
  2011-11-10 21:53     ` Konrad Rzeszutek Wilk
@ 2011-11-11 17:40       ` Anthony PERARD
  -1 siblings, 0 replies; 60+ messages in thread
From: Anthony PERARD @ 2011-11-11 17:40 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Guy Zana, Xen Devel, Allen Kay, QEMU-devel, Stefano Stabellini

On Thu, 10 Nov 2011, Konrad Rzeszutek Wilk wrote:

> On Fri, Oct 28, 2011 at 04:07:34PM +0100, Anthony PERARD wrote:
> > From: Allen Kay <allen.m.kay@intel.com>
> >
> > Signed-off-by: Allen Kay <allen.m.kay@intel.com>
> > Signed-off-by: Guy Zana <guy@neocleus.com>
> > Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
> > ---
> >  Makefile.target                      |    1 +
> >  hw/xen_pci_passthrough.h             |    2 +
> >  hw/xen_pci_passthrough_config_init.c | 2068 ++++++++++++++++++++++++++++++++++
> >  3 files changed, 2071 insertions(+), 0 deletions(-)
> >  create mode 100644 hw/xen_pci_passthrough_config_init.c
> >

[...]

> > +/* A return value of 1 means the capability should NOT be exposed to guest. */
> > +static int pt_hide_dev_cap(const HostPCIDevice *d, uint8_t grp_id)
> > +{
> > +    switch (grp_id) {
> > +    case PCI_CAP_ID_EXP:
> > +        /* The PCI Express Capability Structure of the VF of Intel 82599 10GbE
> > +         * Controller looks trivial, e.g., the PCI Express Capabilities
> > +         * Register is 0. We should not try to expose it to guest.
>
> Why not?

Because (an old commit):

passthrough: support the assignment of the VF of Intel 82599 10GbE Controller

The datasheet is available at
http://download.intel.com/design/network/datashts/82599_datasheet.pdf

See 'Table 9.7. VF PCIe Configuration Space' of the datasheet, the PCI
Express Capability Structure of the VF of Intel 82599 10GbE Controller looks
trivial, e.g., the PCI Express Capabilities Register is 0, so the Capability
Version is 0 and pt_pcie_size_init() would fail.

We should not try to expose the PCIe cap of the device to guest.

a patch from Dexuan Cui <dexuan.cui@intel.com>

> > +         */
> > +        if (d->vendor_id == PCI_VENDOR_ID_INTEL &&
> > +                d->device_id == PCI_DEVICE_ID_INTEL_82599_VF) {
> > +            return 1;
> > +        }
> > +        break;
> > +    }
> > +    return 0;
> > +}
> > +
> > +/*   find emulate register group entry */
> > +XenPTRegGroup *pt_find_reg_grp(XenPCIPassthroughState *s, uint32_t address)
> > +{
> > +    XenPTRegGroup *entry = NULL;
> > +
> > +    /* find register group entry */
> > +    QLIST_FOREACH(entry, &s->reg_grp_tbl, entries) {
> > +        /* check address */
> > +        if ((entry->base_offset <= address)
> > +            && ((entry->base_offset + entry->size) > address)) {
> > +            return entry;
> > +        }
> > +    }
> > +
> > +    /* group entry not found */
> > +    return NULL;
> > +}
> > +
> > +/* find emulate register entry */
> > +XenPTReg *pt_find_reg(XenPTRegGroup *reg_grp, uint32_t address)
> > +{
> > +    XenPTReg *reg_entry = NULL;
> > +    XenPTRegInfo *reg = NULL;
> > +    uint32_t real_offset = 0;
> > +
> > +    /* find register entry */
> > +    QLIST_FOREACH(reg_entry, &reg_grp->reg_tbl_list, entries) {
> > +        reg = reg_entry->reg;
> > +        real_offset = reg_grp->base_offset + reg->offset;
> > +        /* check address */
> > +        if ((real_offset <= address)
> > +            && ((real_offset + reg->size) > address)) {
> > +            return reg_entry;
> > +        }
> > +    }
> > +
> > +    return NULL;
> > +}
> > +
> > +/* parse BAR */
> > +static PTBarFlag pt_bar_reg_parse(XenPCIPassthroughState *s, XenPTRegInfo *reg)
> > +{
> > +    PCIDevice *d = &s->dev;
> > +    XenPTRegion *region = NULL;
> > +    PCIIORegion *r;
> > +    int index = 0;
> > +
> > +    /* check 64bit BAR */
> > +    index = pt_bar_offset_to_index(reg->offset);
> > +    if ((0 < index) && (index < PCI_ROM_SLOT)) {
>
> This is  a bit confusing. Can you make the index be on the same
> side, like
>
> if ((0 < index) && (PCI_ROM_SLOT > index)
>
> or better:
>
> if ((index < 0) && (index < PCI_ROM_SLOT))
>
> um, which looks wrong. Should it be 'index > 0' ?

Every other form is a bit confusing to me. I'd like to write
0 < index < ROM_SLOT, so I know that index is between 0 and ROM_SLOT.
But, it's C and not math, so I wrote the closest way I can.

> > +        int flags = s->real_device->io_regions[index - 1].flags;
>
> Do we want to check the index - 1 to make sure it is not negative?

We have:
  0 < index < ROM_SLOT
so (index - 1) give us:
  0 <= index - 1 < ROM_SLOT - 1

So (index - 1) can be 0, but under 0.
;)


[...]

> > +/********************
> > + * Header Type0
> > + */
> > +
> > +static uint32_t pt_vendor_reg_init(XenPCIPassthroughState *s,
> > +                                   XenPTRegInfo *reg, uint32_t real_offset)
> > +{
> > +    return s->real_device->vendor_id;
> > +}
> > +static uint32_t pt_device_reg_init(XenPCIPassthroughState *s,
> > +                                   XenPTRegInfo *reg, uint32_t real_offset)
> > +{
> > +    return s->real_device->device_id;
> > +}
> > +static uint32_t pt_status_reg_init(XenPCIPassthroughState *s,
> > +                                   XenPTRegInfo *reg, uint32_t real_offset)
> > +{
> > +    XenPTRegGroup *reg_grp_entry = NULL;
> > +    XenPTReg *reg_entry = NULL;
> > +    int reg_field = 0;
> > +
> > +    /* find Header register group */
> > +    reg_grp_entry = pt_find_reg_grp(s, PCI_CAPABILITY_LIST);
> > +    if (reg_grp_entry) {
> > +        /* find Capabilities Pointer register */
> > +        reg_entry = pt_find_reg(reg_grp_entry, PCI_CAPABILITY_LIST);
> > +        if (reg_entry) {
> > +            /* check Capabilities Pointer register */
> > +            if (reg_entry->data) {
> > +                reg_field |= PCI_STATUS_CAP_LIST;
> > +            } else {
> > +                reg_field &= ~PCI_STATUS_CAP_LIST;
> > +            }
> > +        } else {
> > +            hw_error("Internal error: Couldn't find pt_reg_tbl for "
> > +                     "Capabilities Pointer register. I/O emulator exit.\n");
>
> Yikes. abort here? Um, can we just return a fault code instead?

This should probably not happend, I suppose.

> > +        }
> > +    } else {
> > +        hw_error("Internal error: Couldn't find pt_reg_grp_tbl for Header. "
> > +                 "I/O emulator exit.\n");
> > +    }
> > +
> > +    return reg_field;
> > +}

[...]

> > +static uint32_t pt_bar_reg_init(XenPCIPassthroughState *s, XenPTRegInfo *reg,
> > +                                uint32_t real_offset)
> > +{
> > +    int reg_field = 0;
> > +    int index;
> > +
> > +    /* get BAR index */
> > +    index = pt_bar_offset_to_index(reg->offset);
> > +    if (index < 0) {
> > +        hw_error("Internal error: Invalid BAR index[%d]. "
> > +                 "I/O emulator exit.\n", index);
> > +    }
> > +
> > +    /* set initial guest physical base address to -1 */
> > +    s->bases[index].e_physbase = -1;
>
> Um, use that define PCI_.. something macro.

:(, e_physbase is uint32, and PCI_BAR_UNMAPPED is uint64.

> > +
> > +    /* set BAR flag */
> > +    s->bases[index].bar_flag = pt_bar_reg_parse(s, reg);
> > +    if (s->bases[index].bar_flag == PT_BAR_FLAG_UNUSED) {
> > +        reg_field = PT_INVALID_REG;
> > +    }
> > +
> > +    return reg_field;
> > +}

[...]

> > +
> > +
> > +/*****************************
> > + * PCI Express Capability
> > + */
> > +
> > +/* initialize Link Control register */
> > +static uint32_t pt_linkctrl_reg_init(XenPCIPassthroughState *s,
> > +                                     XenPTRegInfo *reg, uint32_t real_offset)
> > +{
> > +    uint8_t cap_ver = 0;
> > +    uint8_t dev_type = 0;
> > +
> > +    /* TODO maybe better to use fonction from hw/pcie.c */
>
> function
> > +    cap_ver = pci_get_byte(s->dev.config + real_offset - reg->offset
> > +                           + PCI_EXP_FLAGS)
> > +        & PCI_EXP_FLAGS_VERS;
> > +    dev_type = (pci_get_byte(s->dev.config + real_offset - reg->offset
> > +                             + PCI_EXP_FLAGS)
> > +                & PCI_EXP_FLAGS_TYPE) >> 4;
> > +
> > +    /* no need to initialize in case of Root Complex Integrated Endpoint
> > +     * with cap_ver 1.x
>
> Why?

Who knows? I don't. And `git log` does not give me more information.

> > +     */
> > +    if ((dev_type == PCI_EXP_TYPE_RC_END) && (cap_ver == 1)) {
> > +        return PT_INVALID_REG;
> > +    }
> > +
> > +    return reg->init_val;
> > +}
> > +/* initialize Device Control 2 register */
> > +static uint32_t pt_devctrl2_reg_init(XenPCIPassthroughState *s,
> > +                                     XenPTRegInfo *reg, uint32_t real_offset)
> > +{
> > +    uint8_t cap_ver = 0;
> > +
> > +    cap_ver = pci_get_byte(s->dev.config + real_offset - reg->offset
> > +                           + PCI_EXP_FLAGS)
> > +        & PCI_EXP_FLAGS_VERS;
> > +
> > +    /* no need to initialize in case of cap_ver 1.x */
> > +    if (cap_ver == 1) {
> > +        return PT_INVALID_REG;
> > +    }
> > +
> > +    return reg->init_val;
> > +}
> > +/* initialize Link Control 2 register */
> > +static uint32_t pt_linkctrl2_reg_init(XenPCIPassthroughState *s,
> > +                                      XenPTRegInfo *reg, uint32_t real_offset)
> > +{
> > +    int reg_field = 0;
> > +    uint8_t cap_ver = 0;
> > +
> > +    cap_ver = pci_get_byte(s->dev.config + real_offset - reg->offset
> > +                           + PCI_EXP_FLAGS)
> > +        & PCI_EXP_FLAGS_VERS;
>
> This looks like a weird tab issue, but it might be just my mailer.

Nop, there is no tab.

Maybe writing it like that:
> cap_ver = pci_get_byte(s->dev.config + real_offset - reg->offset
>                        + PCI_EXP_FLAGS);
> cap_ver &= PCI_EXP_FLAGS_VERS;
would be better.

> > +
> > +    /* no need to initialize in case of cap_ver 1.x */
> > +    if (cap_ver == 1) {
> > +        return PT_INVALID_REG;
> > +    }
> > +
> > +    /* set Supported Link Speed */
> > +    reg_field |= PCI_EXP_LNKCAP_SLS &
> > +        pci_get_byte(s->dev.config + real_offset - reg->offset
> > +                     + PCI_EXP_LNKCAP);
> > +
> > +    return reg_field;
> > +}
> > +

[...]

> > +/*********************************
> > + * Power Management Capability
> > + */
> > +
> > +/* initialize Power Management Capabilities register */
> > +static uint32_t pt_pmc_reg_init(XenPCIPassthroughState *s,
> > +                                XenPTRegInfo *reg, uint32_t real_offset)
> > +{
> > +    PCIDevice *d = &s->dev;
> > +
> > +    if (!s->power_mgmt) {
> > +        return reg->init_val;
> > +    }
> > +
> > +    /* set Power Management Capabilities register */
> > +    s->pm_state->pmc_field = pci_get_word(d->config + real_offset);
> > +
> > +    return reg->init_val;
> > +}
> > +/* initialize PCI Power Management Control/Status register */
> > +static uint32_t pt_pmcsr_reg_init(XenPCIPassthroughState *s,
> > +                                  XenPTRegInfo *reg, uint32_t real_offset)
> > +{
> > +    PCIDevice *d = &s->dev;
> > +    uint16_t cap_ver  = 0;
> > +
> > +    if (!s->power_mgmt) {
> > +        return reg->init_val;
> > +    }
> > +
> > +    /* check PCI Power Management support version */
> > +    cap_ver = s->pm_state->pmc_field & PCI_PM_CAP_VER_MASK;
> > +
> > +    if (cap_ver > 2) {
> > +        /* set No Soft Reset */
> > +        s->pm_state->no_soft_reset =
> > +            pci_get_byte(d->config + real_offset) & PCI_PM_CTRL_NO_SOFT_RESET;
> > +    }
> > +
> > +    /* wake up real physical device */
> > +    switch (host_pci_get_word(s->real_device, real_offset)
> > +            & PCI_PM_CTRL_STATE_MASK) {
> > +    case 0:
> > +        break;
> > +    case 1:
> > +        PT_LOG("Power state transition D1 -> D0active\n");
> > +        host_pci_set_word(s->real_device, real_offset, 0);
> > +        break;
> > +    case 2:
> > +        PT_LOG("Power state transition D2 -> D0active\n");
> > +        host_pci_set_word(s->real_device, real_offset, 0);
> > +        usleep(200);
>
> Heheh..

I don't know if I can remove it safely, or not.

> > +        break;
> > +    case 3:
> > +        PT_LOG("Power state transition D3hot -> D0active\n");
> > +        host_pci_set_word(s->real_device, real_offset, 0);
> > +        usleep(10 * 1000);

Same for this one.

> > +        pt_init_pci_config(s);
> > +        break;
> > +    }
> > +
> > +    return reg->init_val;
> > +}

-- 
Anthony PERARD

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Xen-devel] [PATCH V3 08/10] Introduce Xen PCI Passthrough, PCI config space helpers (2/3)
@ 2011-11-11 17:40       ` Anthony PERARD
  0 siblings, 0 replies; 60+ messages in thread
From: Anthony PERARD @ 2011-11-11 17:40 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Guy Zana, Xen Devel, Allen Kay, QEMU-devel, Stefano Stabellini

On Thu, 10 Nov 2011, Konrad Rzeszutek Wilk wrote:

> On Fri, Oct 28, 2011 at 04:07:34PM +0100, Anthony PERARD wrote:
> > From: Allen Kay <allen.m.kay@intel.com>
> >
> > Signed-off-by: Allen Kay <allen.m.kay@intel.com>
> > Signed-off-by: Guy Zana <guy@neocleus.com>
> > Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
> > ---
> >  Makefile.target                      |    1 +
> >  hw/xen_pci_passthrough.h             |    2 +
> >  hw/xen_pci_passthrough_config_init.c | 2068 ++++++++++++++++++++++++++++++++++
> >  3 files changed, 2071 insertions(+), 0 deletions(-)
> >  create mode 100644 hw/xen_pci_passthrough_config_init.c
> >

[...]

> > +/* A return value of 1 means the capability should NOT be exposed to guest. */
> > +static int pt_hide_dev_cap(const HostPCIDevice *d, uint8_t grp_id)
> > +{
> > +    switch (grp_id) {
> > +    case PCI_CAP_ID_EXP:
> > +        /* The PCI Express Capability Structure of the VF of Intel 82599 10GbE
> > +         * Controller looks trivial, e.g., the PCI Express Capabilities
> > +         * Register is 0. We should not try to expose it to guest.
>
> Why not?

Because (an old commit):

passthrough: support the assignment of the VF of Intel 82599 10GbE Controller

The datasheet is available at
http://download.intel.com/design/network/datashts/82599_datasheet.pdf

See 'Table 9.7. VF PCIe Configuration Space' of the datasheet, the PCI
Express Capability Structure of the VF of Intel 82599 10GbE Controller looks
trivial, e.g., the PCI Express Capabilities Register is 0, so the Capability
Version is 0 and pt_pcie_size_init() would fail.

We should not try to expose the PCIe cap of the device to guest.

a patch from Dexuan Cui <dexuan.cui@intel.com>

> > +         */
> > +        if (d->vendor_id == PCI_VENDOR_ID_INTEL &&
> > +                d->device_id == PCI_DEVICE_ID_INTEL_82599_VF) {
> > +            return 1;
> > +        }
> > +        break;
> > +    }
> > +    return 0;
> > +}
> > +
> > +/*   find emulate register group entry */
> > +XenPTRegGroup *pt_find_reg_grp(XenPCIPassthroughState *s, uint32_t address)
> > +{
> > +    XenPTRegGroup *entry = NULL;
> > +
> > +    /* find register group entry */
> > +    QLIST_FOREACH(entry, &s->reg_grp_tbl, entries) {
> > +        /* check address */
> > +        if ((entry->base_offset <= address)
> > +            && ((entry->base_offset + entry->size) > address)) {
> > +            return entry;
> > +        }
> > +    }
> > +
> > +    /* group entry not found */
> > +    return NULL;
> > +}
> > +
> > +/* find emulate register entry */
> > +XenPTReg *pt_find_reg(XenPTRegGroup *reg_grp, uint32_t address)
> > +{
> > +    XenPTReg *reg_entry = NULL;
> > +    XenPTRegInfo *reg = NULL;
> > +    uint32_t real_offset = 0;
> > +
> > +    /* find register entry */
> > +    QLIST_FOREACH(reg_entry, &reg_grp->reg_tbl_list, entries) {
> > +        reg = reg_entry->reg;
> > +        real_offset = reg_grp->base_offset + reg->offset;
> > +        /* check address */
> > +        if ((real_offset <= address)
> > +            && ((real_offset + reg->size) > address)) {
> > +            return reg_entry;
> > +        }
> > +    }
> > +
> > +    return NULL;
> > +}
> > +
> > +/* parse BAR */
> > +static PTBarFlag pt_bar_reg_parse(XenPCIPassthroughState *s, XenPTRegInfo *reg)
> > +{
> > +    PCIDevice *d = &s->dev;
> > +    XenPTRegion *region = NULL;
> > +    PCIIORegion *r;
> > +    int index = 0;
> > +
> > +    /* check 64bit BAR */
> > +    index = pt_bar_offset_to_index(reg->offset);
> > +    if ((0 < index) && (index < PCI_ROM_SLOT)) {
>
> This is  a bit confusing. Can you make the index be on the same
> side, like
>
> if ((0 < index) && (PCI_ROM_SLOT > index)
>
> or better:
>
> if ((index < 0) && (index < PCI_ROM_SLOT))
>
> um, which looks wrong. Should it be 'index > 0' ?

Every other form is a bit confusing to me. I'd like to write
0 < index < ROM_SLOT, so I know that index is between 0 and ROM_SLOT.
But, it's C and not math, so I wrote the closest way I can.

> > +        int flags = s->real_device->io_regions[index - 1].flags;
>
> Do we want to check the index - 1 to make sure it is not negative?

We have:
  0 < index < ROM_SLOT
so (index - 1) give us:
  0 <= index - 1 < ROM_SLOT - 1

So (index - 1) can be 0, but under 0.
;)


[...]

> > +/********************
> > + * Header Type0
> > + */
> > +
> > +static uint32_t pt_vendor_reg_init(XenPCIPassthroughState *s,
> > +                                   XenPTRegInfo *reg, uint32_t real_offset)
> > +{
> > +    return s->real_device->vendor_id;
> > +}
> > +static uint32_t pt_device_reg_init(XenPCIPassthroughState *s,
> > +                                   XenPTRegInfo *reg, uint32_t real_offset)
> > +{
> > +    return s->real_device->device_id;
> > +}
> > +static uint32_t pt_status_reg_init(XenPCIPassthroughState *s,
> > +                                   XenPTRegInfo *reg, uint32_t real_offset)
> > +{
> > +    XenPTRegGroup *reg_grp_entry = NULL;
> > +    XenPTReg *reg_entry = NULL;
> > +    int reg_field = 0;
> > +
> > +    /* find Header register group */
> > +    reg_grp_entry = pt_find_reg_grp(s, PCI_CAPABILITY_LIST);
> > +    if (reg_grp_entry) {
> > +        /* find Capabilities Pointer register */
> > +        reg_entry = pt_find_reg(reg_grp_entry, PCI_CAPABILITY_LIST);
> > +        if (reg_entry) {
> > +            /* check Capabilities Pointer register */
> > +            if (reg_entry->data) {
> > +                reg_field |= PCI_STATUS_CAP_LIST;
> > +            } else {
> > +                reg_field &= ~PCI_STATUS_CAP_LIST;
> > +            }
> > +        } else {
> > +            hw_error("Internal error: Couldn't find pt_reg_tbl for "
> > +                     "Capabilities Pointer register. I/O emulator exit.\n");
>
> Yikes. abort here? Um, can we just return a fault code instead?

This should probably not happend, I suppose.

> > +        }
> > +    } else {
> > +        hw_error("Internal error: Couldn't find pt_reg_grp_tbl for Header. "
> > +                 "I/O emulator exit.\n");
> > +    }
> > +
> > +    return reg_field;
> > +}

[...]

> > +static uint32_t pt_bar_reg_init(XenPCIPassthroughState *s, XenPTRegInfo *reg,
> > +                                uint32_t real_offset)
> > +{
> > +    int reg_field = 0;
> > +    int index;
> > +
> > +    /* get BAR index */
> > +    index = pt_bar_offset_to_index(reg->offset);
> > +    if (index < 0) {
> > +        hw_error("Internal error: Invalid BAR index[%d]. "
> > +                 "I/O emulator exit.\n", index);
> > +    }
> > +
> > +    /* set initial guest physical base address to -1 */
> > +    s->bases[index].e_physbase = -1;
>
> Um, use that define PCI_.. something macro.

:(, e_physbase is uint32, and PCI_BAR_UNMAPPED is uint64.

> > +
> > +    /* set BAR flag */
> > +    s->bases[index].bar_flag = pt_bar_reg_parse(s, reg);
> > +    if (s->bases[index].bar_flag == PT_BAR_FLAG_UNUSED) {
> > +        reg_field = PT_INVALID_REG;
> > +    }
> > +
> > +    return reg_field;
> > +}

[...]

> > +
> > +
> > +/*****************************
> > + * PCI Express Capability
> > + */
> > +
> > +/* initialize Link Control register */
> > +static uint32_t pt_linkctrl_reg_init(XenPCIPassthroughState *s,
> > +                                     XenPTRegInfo *reg, uint32_t real_offset)
> > +{
> > +    uint8_t cap_ver = 0;
> > +    uint8_t dev_type = 0;
> > +
> > +    /* TODO maybe better to use fonction from hw/pcie.c */
>
> function
> > +    cap_ver = pci_get_byte(s->dev.config + real_offset - reg->offset
> > +                           + PCI_EXP_FLAGS)
> > +        & PCI_EXP_FLAGS_VERS;
> > +    dev_type = (pci_get_byte(s->dev.config + real_offset - reg->offset
> > +                             + PCI_EXP_FLAGS)
> > +                & PCI_EXP_FLAGS_TYPE) >> 4;
> > +
> > +    /* no need to initialize in case of Root Complex Integrated Endpoint
> > +     * with cap_ver 1.x
>
> Why?

Who knows? I don't. And `git log` does not give me more information.

> > +     */
> > +    if ((dev_type == PCI_EXP_TYPE_RC_END) && (cap_ver == 1)) {
> > +        return PT_INVALID_REG;
> > +    }
> > +
> > +    return reg->init_val;
> > +}
> > +/* initialize Device Control 2 register */
> > +static uint32_t pt_devctrl2_reg_init(XenPCIPassthroughState *s,
> > +                                     XenPTRegInfo *reg, uint32_t real_offset)
> > +{
> > +    uint8_t cap_ver = 0;
> > +
> > +    cap_ver = pci_get_byte(s->dev.config + real_offset - reg->offset
> > +                           + PCI_EXP_FLAGS)
> > +        & PCI_EXP_FLAGS_VERS;
> > +
> > +    /* no need to initialize in case of cap_ver 1.x */
> > +    if (cap_ver == 1) {
> > +        return PT_INVALID_REG;
> > +    }
> > +
> > +    return reg->init_val;
> > +}
> > +/* initialize Link Control 2 register */
> > +static uint32_t pt_linkctrl2_reg_init(XenPCIPassthroughState *s,
> > +                                      XenPTRegInfo *reg, uint32_t real_offset)
> > +{
> > +    int reg_field = 0;
> > +    uint8_t cap_ver = 0;
> > +
> > +    cap_ver = pci_get_byte(s->dev.config + real_offset - reg->offset
> > +                           + PCI_EXP_FLAGS)
> > +        & PCI_EXP_FLAGS_VERS;
>
> This looks like a weird tab issue, but it might be just my mailer.

Nop, there is no tab.

Maybe writing it like that:
> cap_ver = pci_get_byte(s->dev.config + real_offset - reg->offset
>                        + PCI_EXP_FLAGS);
> cap_ver &= PCI_EXP_FLAGS_VERS;
would be better.

> > +
> > +    /* no need to initialize in case of cap_ver 1.x */
> > +    if (cap_ver == 1) {
> > +        return PT_INVALID_REG;
> > +    }
> > +
> > +    /* set Supported Link Speed */
> > +    reg_field |= PCI_EXP_LNKCAP_SLS &
> > +        pci_get_byte(s->dev.config + real_offset - reg->offset
> > +                     + PCI_EXP_LNKCAP);
> > +
> > +    return reg_field;
> > +}
> > +

[...]

> > +/*********************************
> > + * Power Management Capability
> > + */
> > +
> > +/* initialize Power Management Capabilities register */
> > +static uint32_t pt_pmc_reg_init(XenPCIPassthroughState *s,
> > +                                XenPTRegInfo *reg, uint32_t real_offset)
> > +{
> > +    PCIDevice *d = &s->dev;
> > +
> > +    if (!s->power_mgmt) {
> > +        return reg->init_val;
> > +    }
> > +
> > +    /* set Power Management Capabilities register */
> > +    s->pm_state->pmc_field = pci_get_word(d->config + real_offset);
> > +
> > +    return reg->init_val;
> > +}
> > +/* initialize PCI Power Management Control/Status register */
> > +static uint32_t pt_pmcsr_reg_init(XenPCIPassthroughState *s,
> > +                                  XenPTRegInfo *reg, uint32_t real_offset)
> > +{
> > +    PCIDevice *d = &s->dev;
> > +    uint16_t cap_ver  = 0;
> > +
> > +    if (!s->power_mgmt) {
> > +        return reg->init_val;
> > +    }
> > +
> > +    /* check PCI Power Management support version */
> > +    cap_ver = s->pm_state->pmc_field & PCI_PM_CAP_VER_MASK;
> > +
> > +    if (cap_ver > 2) {
> > +        /* set No Soft Reset */
> > +        s->pm_state->no_soft_reset =
> > +            pci_get_byte(d->config + real_offset) & PCI_PM_CTRL_NO_SOFT_RESET;
> > +    }
> > +
> > +    /* wake up real physical device */
> > +    switch (host_pci_get_word(s->real_device, real_offset)
> > +            & PCI_PM_CTRL_STATE_MASK) {
> > +    case 0:
> > +        break;
> > +    case 1:
> > +        PT_LOG("Power state transition D1 -> D0active\n");
> > +        host_pci_set_word(s->real_device, real_offset, 0);
> > +        break;
> > +    case 2:
> > +        PT_LOG("Power state transition D2 -> D0active\n");
> > +        host_pci_set_word(s->real_device, real_offset, 0);
> > +        usleep(200);
>
> Heheh..

I don't know if I can remove it safely, or not.

> > +        break;
> > +    case 3:
> > +        PT_LOG("Power state transition D3hot -> D0active\n");
> > +        host_pci_set_word(s->real_device, real_offset, 0);
> > +        usleep(10 * 1000);

Same for this one.

> > +        pt_init_pci_config(s);
> > +        break;
> > +    }
> > +
> > +    return reg->init_val;
> > +}

-- 
Anthony PERARD

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [PATCH V3 07/10] Introduce Xen PCI Passthrough, qdevice (1/3)
  2011-11-11 16:27       ` Anthony PERARD
@ 2011-11-11 18:05         ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 60+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-11-11 18:05 UTC (permalink / raw)
  To: Anthony PERARD
  Cc: Guy Zana, Xen Devel, Allen Kay, QEMU-devel, Stefano Stabellini

> > > +                hw_error("Internal error: Invalid write emulation "
> > > +                         "return value[%d]. I/O emulator exit.\n", rc);
> >
> > Oh. I hadn't realized this, but you are using hw_error. Which is
> > calling 'abort'! Yikes. Is there no way to recover from this? Say return 0xfffff?
> 
> In qemu-xen-traditionnal, it was an exit(1). I do not know the
> consequence of a bad write, and I can not return anythings. So I suppose
> that the guest would know that somethings wrong only on the next read.
> 
> Instead of abort();, I can just do nothing and return. Or we could unplug
> the device from QEMU.
> 
> Any preference?

I think this calls for an experiment. If Linux still functions if you completly
unplug the device, then I would say unplug it (b/c in most likelyhood the reason
you can't write is b/c the host has unplugged the device).

> 
> > > +            }
> > > +
> > > +            /* calculate next address to find */
> > > +            emul_len -= reg->size;
> > > +            if (emul_len > 0) {
> > > +                find_addr = real_offset + reg->size;
> > > +            }
> > > +        } else {
> > > +            /* nothing to do with passthrough type register,
> > > +             * continue to find next byte */
> > > +            emul_len--;
> > > +            find_addr++;
> > > +        }
> > > +    }
> > > +
> > > +    /* need to shift back before passing them to libpci */
> > > +    val >>= (address & 3) << 3;
> > > +
> > > +out:
> > > +    if (!(reg && reg->no_wb)) {
> > > +        /* unknown regs are passed through */
> > > +        rc = host_pci_set_block(s->real_device, address, (uint8_t *)&val, len);
> > > +
> > > +        if (!rc) {
> > > +            PT_LOG("Error: pci_write_block failed. return value[%d].\n", rc);
> > > +        }
> > > +    }
> > > +
> > > +    if (s->pm_state != NULL && s->pm_state->flags & PT_FLAG_TRANSITING) {
> > > +        qemu_mod_timer(s->pm_state->pm_timer,
> > > +                       qemu_get_clock_ms(rt_clock) + s->pm_state->pm_delay);
> > > +    }
> > > +}
> > > +
> > > +/* ioport/iomem space*/
> > > +static void pt_iomem_map(XenPCIPassthroughState *s, int i,
> > > +                         pcibus_t e_phys, pcibus_t e_size, int type)
> > > +{
> > > +    uint32_t old_ebase = s->bases[i].e_physbase;
> > > +    bool first_map = s->bases[i].e_size == 0;
> > > +    int ret = 0;
> > > +
> > > +    s->bases[i].e_physbase = e_phys;
> > > +    s->bases[i].e_size = e_size;
> > > +
> > > +    PT_LOG("e_phys=%#"PRIx64" maddr=%#"PRIx64" type=%%d"
> > > +           " len=%#"PRIx64" index=%d first_map=%d\n",
> > > +           e_phys, s->bases[i].access.maddr, /*type,*/
> > > +           e_size, i, first_map);
> > > +
> > > +    if (e_size == 0) {
> > > +        return;
> > > +    }
> > > +
> > > +    if (!first_map && old_ebase != -1) {
> >
> > old_ebase != PCI_BAR_UNMAPPED ?
> 
> :(, no. Because old_ebase is a uint32_t and PCI_BAR_UNMAPPED is
> pcibus_t (uint64_t in Xen case).

I somehow thought it was defined as -1.. but 
> 
> I'm not sure that a good idee to change the type of old_ebase as
> xc_domain_memory_mapping bellow takes only uint32_t.
> 
> But, if I can replace a -1 by PCI_BAR_UNMAPPED, I will.

.. or something close to it. _PCI_BAR_UNMAPPED?
.. snip..

> > > +    /* Register PIO/MMIO BARs */
> > > +    for (i = 0; i < PCI_BAR_ENTRIES; i++) {
> > > +        HostPCIIORegion *r = &d->io_regions[i];
> > > +
> > > +        if (r->base_addr) {
> >
> > So should you check for PCI_BAR_UNMAPPED or is that not really
> > required here as the pci_register_bar would do it?
> 
> Actually, this value come from the real device (the value in
> sysfs/resource). So, I think it's just 0 if it's not mapped.

Ah! Right.
> 
> Here, it's probably better to check for the size instead, to know if
> there is actually a BAR.

<nods>
> 
> > > +            s->bases[i].e_physbase = r->base_addr;
> > > +            s->bases[i].access.u = r->base_addr;
> > > +
> > > +            /* Register current region */
> > > +            if (r->flags & IORESOURCE_IO) {
> > > +                memory_region_init_io(&s->bar[i], NULL, NULL,
> > > +                                      "xen-pci-pt-bar", r->size);
> >
> > You can make the "xen_pci-pt-bar" be a #define somewhere and reuse that.

.. snip ..
> > > +    if (!s->dev.config[PCI_INTERRUPT_PIN]) {
> > > +        PT_LOG("no pin interrupt\n");
> >
> > Perhaps include some details of which device failed?
> 
> There is already detailed about the device at the beginning of the
> function. Is it not enough?

I was thinking parallel operations. So it could be there are multiple
PCI requests and you might not know which device's pin is wrong.

> 
> > > +        goto out;
> > > +    }
> > > +
> > > +    machine_irq = host_pci_get_byte(s->real_device, PCI_INTERRUPT_LINE);
> > > +    rc = xc_physdev_map_pirq(xen_xc, xen_domid, machine_irq, &pirq);
> > > +
> > > +    if (rc) {
> > > +        PT_LOG("Error: Mapping irq failed, rc = %d\n", rc);
> >
> > Can you also include the IRQ it tried to map (both machine and pirq).
> 
> Yep.
> 
> > > +
> > > +        /* Disable PCI intx assertion (turn on bit10 of devctl) */
> > > +        host_pci_set_word(s->real_device,
> > > +                          PCI_COMMAND,
> > > +                          pci_get_word(s->dev.config + PCI_COMMAND)
> > > +                          | PCI_COMMAND_INTX_DISABLE);
> > > +        machine_irq = 0;
> > > +        s->machine_irq = 0;
> > > +    } else {
> > > +        machine_irq = pirq;
> > > +        s->machine_irq = pirq;
> > > +        mapped_machine_irq[machine_irq]++;
> > > +    }
> > > +
> > > +    /* bind machine_irq to device */
> > > +    if (rc < 0 && machine_irq != 0) {
> > > +        uint8_t e_device = PCI_SLOT(s->dev.devfn);
> > > +        uint8_t e_intx = pci_intx(s);
> > > +
> > > +        rc = xc_domain_bind_pt_pci_irq(xen_xc, xen_domid, machine_irq, 0,
> > > +                                       e_device, e_intx);
> > > +        if (rc < 0) {
> > > +            PT_LOG("Error: Binding of interrupt failed! rc=%d\n", rc);
> >
> > A bit details - name of the device, the IRQ,..
> >
> > > +
> > > +            /* Disable PCI intx assertion (turn on bit10 of devctl) */
> > > +            host_pci_set_word(s->real_device, PCI_COMMAND,
> > > +                              *(uint16_t *)(&s->dev.config[PCI_COMMAND])
> > > +                              | PCI_COMMAND_INTX_DISABLE);
> > > +            mapped_machine_irq[machine_irq]--;
> > > +
> > > +            if (mapped_machine_irq[machine_irq] == 0) {
> > > +                if (xc_physdev_unmap_pirq(xen_xc, xen_domid, machine_irq)) {
> > > +                    PT_LOG("Error: Unmapping of interrupt failed! rc=%d\n",
> > > +                           rc);
> >
> > And here too. It would be beneficial to have on the error paths lots of
> > nice details so that in the field it will be easier to find out what
> > went wrong (and match up PIRQ with the GSI).
> 
> Yes, I will try to improve the messages.
> 
> It's also probably good to always print the errors.

<nods> Thanks.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Xen-devel] [PATCH V3 07/10] Introduce Xen PCI Passthrough, qdevice (1/3)
@ 2011-11-11 18:05         ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 60+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-11-11 18:05 UTC (permalink / raw)
  To: Anthony PERARD
  Cc: Guy Zana, Xen Devel, Allen Kay, QEMU-devel, Stefano Stabellini

> > > +                hw_error("Internal error: Invalid write emulation "
> > > +                         "return value[%d]. I/O emulator exit.\n", rc);
> >
> > Oh. I hadn't realized this, but you are using hw_error. Which is
> > calling 'abort'! Yikes. Is there no way to recover from this? Say return 0xfffff?
> 
> In qemu-xen-traditionnal, it was an exit(1). I do not know the
> consequence of a bad write, and I can not return anythings. So I suppose
> that the guest would know that somethings wrong only on the next read.
> 
> Instead of abort();, I can just do nothing and return. Or we could unplug
> the device from QEMU.
> 
> Any preference?

I think this calls for an experiment. If Linux still functions if you completly
unplug the device, then I would say unplug it (b/c in most likelyhood the reason
you can't write is b/c the host has unplugged the device).

> 
> > > +            }
> > > +
> > > +            /* calculate next address to find */
> > > +            emul_len -= reg->size;
> > > +            if (emul_len > 0) {
> > > +                find_addr = real_offset + reg->size;
> > > +            }
> > > +        } else {
> > > +            /* nothing to do with passthrough type register,
> > > +             * continue to find next byte */
> > > +            emul_len--;
> > > +            find_addr++;
> > > +        }
> > > +    }
> > > +
> > > +    /* need to shift back before passing them to libpci */
> > > +    val >>= (address & 3) << 3;
> > > +
> > > +out:
> > > +    if (!(reg && reg->no_wb)) {
> > > +        /* unknown regs are passed through */
> > > +        rc = host_pci_set_block(s->real_device, address, (uint8_t *)&val, len);
> > > +
> > > +        if (!rc) {
> > > +            PT_LOG("Error: pci_write_block failed. return value[%d].\n", rc);
> > > +        }
> > > +    }
> > > +
> > > +    if (s->pm_state != NULL && s->pm_state->flags & PT_FLAG_TRANSITING) {
> > > +        qemu_mod_timer(s->pm_state->pm_timer,
> > > +                       qemu_get_clock_ms(rt_clock) + s->pm_state->pm_delay);
> > > +    }
> > > +}
> > > +
> > > +/* ioport/iomem space*/
> > > +static void pt_iomem_map(XenPCIPassthroughState *s, int i,
> > > +                         pcibus_t e_phys, pcibus_t e_size, int type)
> > > +{
> > > +    uint32_t old_ebase = s->bases[i].e_physbase;
> > > +    bool first_map = s->bases[i].e_size == 0;
> > > +    int ret = 0;
> > > +
> > > +    s->bases[i].e_physbase = e_phys;
> > > +    s->bases[i].e_size = e_size;
> > > +
> > > +    PT_LOG("e_phys=%#"PRIx64" maddr=%#"PRIx64" type=%%d"
> > > +           " len=%#"PRIx64" index=%d first_map=%d\n",
> > > +           e_phys, s->bases[i].access.maddr, /*type,*/
> > > +           e_size, i, first_map);
> > > +
> > > +    if (e_size == 0) {
> > > +        return;
> > > +    }
> > > +
> > > +    if (!first_map && old_ebase != -1) {
> >
> > old_ebase != PCI_BAR_UNMAPPED ?
> 
> :(, no. Because old_ebase is a uint32_t and PCI_BAR_UNMAPPED is
> pcibus_t (uint64_t in Xen case).

I somehow thought it was defined as -1.. but 
> 
> I'm not sure that a good idee to change the type of old_ebase as
> xc_domain_memory_mapping bellow takes only uint32_t.
> 
> But, if I can replace a -1 by PCI_BAR_UNMAPPED, I will.

.. or something close to it. _PCI_BAR_UNMAPPED?
.. snip..

> > > +    /* Register PIO/MMIO BARs */
> > > +    for (i = 0; i < PCI_BAR_ENTRIES; i++) {
> > > +        HostPCIIORegion *r = &d->io_regions[i];
> > > +
> > > +        if (r->base_addr) {
> >
> > So should you check for PCI_BAR_UNMAPPED or is that not really
> > required here as the pci_register_bar would do it?
> 
> Actually, this value come from the real device (the value in
> sysfs/resource). So, I think it's just 0 if it's not mapped.

Ah! Right.
> 
> Here, it's probably better to check for the size instead, to know if
> there is actually a BAR.

<nods>
> 
> > > +            s->bases[i].e_physbase = r->base_addr;
> > > +            s->bases[i].access.u = r->base_addr;
> > > +
> > > +            /* Register current region */
> > > +            if (r->flags & IORESOURCE_IO) {
> > > +                memory_region_init_io(&s->bar[i], NULL, NULL,
> > > +                                      "xen-pci-pt-bar", r->size);
> >
> > You can make the "xen_pci-pt-bar" be a #define somewhere and reuse that.

.. snip ..
> > > +    if (!s->dev.config[PCI_INTERRUPT_PIN]) {
> > > +        PT_LOG("no pin interrupt\n");
> >
> > Perhaps include some details of which device failed?
> 
> There is already detailed about the device at the beginning of the
> function. Is it not enough?

I was thinking parallel operations. So it could be there are multiple
PCI requests and you might not know which device's pin is wrong.

> 
> > > +        goto out;
> > > +    }
> > > +
> > > +    machine_irq = host_pci_get_byte(s->real_device, PCI_INTERRUPT_LINE);
> > > +    rc = xc_physdev_map_pirq(xen_xc, xen_domid, machine_irq, &pirq);
> > > +
> > > +    if (rc) {
> > > +        PT_LOG("Error: Mapping irq failed, rc = %d\n", rc);
> >
> > Can you also include the IRQ it tried to map (both machine and pirq).
> 
> Yep.
> 
> > > +
> > > +        /* Disable PCI intx assertion (turn on bit10 of devctl) */
> > > +        host_pci_set_word(s->real_device,
> > > +                          PCI_COMMAND,
> > > +                          pci_get_word(s->dev.config + PCI_COMMAND)
> > > +                          | PCI_COMMAND_INTX_DISABLE);
> > > +        machine_irq = 0;
> > > +        s->machine_irq = 0;
> > > +    } else {
> > > +        machine_irq = pirq;
> > > +        s->machine_irq = pirq;
> > > +        mapped_machine_irq[machine_irq]++;
> > > +    }
> > > +
> > > +    /* bind machine_irq to device */
> > > +    if (rc < 0 && machine_irq != 0) {
> > > +        uint8_t e_device = PCI_SLOT(s->dev.devfn);
> > > +        uint8_t e_intx = pci_intx(s);
> > > +
> > > +        rc = xc_domain_bind_pt_pci_irq(xen_xc, xen_domid, machine_irq, 0,
> > > +                                       e_device, e_intx);
> > > +        if (rc < 0) {
> > > +            PT_LOG("Error: Binding of interrupt failed! rc=%d\n", rc);
> >
> > A bit details - name of the device, the IRQ,..
> >
> > > +
> > > +            /* Disable PCI intx assertion (turn on bit10 of devctl) */
> > > +            host_pci_set_word(s->real_device, PCI_COMMAND,
> > > +                              *(uint16_t *)(&s->dev.config[PCI_COMMAND])
> > > +                              | PCI_COMMAND_INTX_DISABLE);
> > > +            mapped_machine_irq[machine_irq]--;
> > > +
> > > +            if (mapped_machine_irq[machine_irq] == 0) {
> > > +                if (xc_physdev_unmap_pirq(xen_xc, xen_domid, machine_irq)) {
> > > +                    PT_LOG("Error: Unmapping of interrupt failed! rc=%d\n",
> > > +                           rc);
> >
> > And here too. It would be beneficial to have on the error paths lots of
> > nice details so that in the field it will be easier to find out what
> > went wrong (and match up PIRQ with the GSI).
> 
> Yes, I will try to improve the messages.
> 
> It's also probably good to always print the errors.

<nods> Thanks.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [PATCH V3 08/10] Introduce Xen PCI Passthrough, PCI config space helpers (2/3)
  2011-11-11 17:40       ` Anthony PERARD
@ 2011-11-11 18:11         ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 60+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-11-11 18:11 UTC (permalink / raw)
  To: Anthony PERARD
  Cc: Guy Zana, Xen Devel, Allen Kay, QEMU-devel, Stefano Stabellini

> > > +    case PCI_CAP_ID_EXP:
> > > +        /* The PCI Express Capability Structure of the VF of Intel 82599 10GbE
> > > +         * Controller looks trivial, e.g., the PCI Express Capabilities
> > > +         * Register is 0. We should not try to expose it to guest.
> >
> > Why not?
> 
> Because (an old commit):
> 
> passthrough: support the assignment of the VF of Intel 82599 10GbE Controller
> 
> The datasheet is available at
> http://download.intel.com/design/network/datashts/82599_datasheet.pdf
> 
> See 'Table 9.7. VF PCIe Configuration Space' of the datasheet, the PCI
> Express Capability Structure of the VF of Intel 82599 10GbE Controller looks
> trivial, e.g., the PCI Express Capabilities Register is 0, so the Capability
> Version is 0 and pt_pcie_size_init() would fail.
> 
> We should not try to expose the PCIe cap of the device to guest.
> 
> a patch from Dexuan Cui <dexuan.cui@intel.com>

Lets inlude that in the description here..
> 
> > > +         */
> > > +        if (d->vendor_id == PCI_VENDOR_ID_INTEL &&
> > > +                d->device_id == PCI_DEVICE_ID_INTEL_82599_VF) {
> > > +            return 1;
> > > +        }
> > > +        break;
> > > +    }
> > > +    return 0;
> > > +}
> > > +
> > > +/*   find emulate register group entry */
> > > +XenPTRegGroup *pt_find_reg_grp(XenPCIPassthroughState *s, uint32_t address)
> > > +{
> > > +    XenPTRegGroup *entry = NULL;
> > > +
> > > +    /* find register group entry */
> > > +    QLIST_FOREACH(entry, &s->reg_grp_tbl, entries) {
> > > +        /* check address */
> > > +        if ((entry->base_offset <= address)
> > > +            && ((entry->base_offset + entry->size) > address)) {
> > > +            return entry;
> > > +        }
> > > +    }
> > > +
> > > +    /* group entry not found */
> > > +    return NULL;
> > > +}
> > > +
> > > +/* find emulate register entry */
> > > +XenPTReg *pt_find_reg(XenPTRegGroup *reg_grp, uint32_t address)
> > > +{
> > > +    XenPTReg *reg_entry = NULL;
> > > +    XenPTRegInfo *reg = NULL;
> > > +    uint32_t real_offset = 0;
> > > +
> > > +    /* find register entry */
> > > +    QLIST_FOREACH(reg_entry, &reg_grp->reg_tbl_list, entries) {
> > > +        reg = reg_entry->reg;
> > > +        real_offset = reg_grp->base_offset + reg->offset;
> > > +        /* check address */
> > > +        if ((real_offset <= address)
> > > +            && ((real_offset + reg->size) > address)) {
> > > +            return reg_entry;
> > > +        }
> > > +    }
> > > +
> > > +    return NULL;
> > > +}
> > > +
> > > +/* parse BAR */
> > > +static PTBarFlag pt_bar_reg_parse(XenPCIPassthroughState *s, XenPTRegInfo *reg)
> > > +{
> > > +    PCIDevice *d = &s->dev;
> > > +    XenPTRegion *region = NULL;
> > > +    PCIIORegion *r;
> > > +    int index = 0;
> > > +
> > > +    /* check 64bit BAR */
> > > +    index = pt_bar_offset_to_index(reg->offset);
> > > +    if ((0 < index) && (index < PCI_ROM_SLOT)) {
> >
> > This is  a bit confusing. Can you make the index be on the same
> > side, like
> >
> > if ((0 < index) && (PCI_ROM_SLOT > index)
> >
> > or better:
> >
> > if ((index < 0) && (index < PCI_ROM_SLOT))
> >
> > um, which looks wrong. Should it be 'index > 0' ?
> 
> Every other form is a bit confusing to me. I'd like to write
> 0 < index < ROM_SLOT, so I know that index is between 0 and ROM_SLOT.
> But, it's C and not math, so I wrote the closest way I can.
> 
> > > +        int flags = s->real_device->io_regions[index - 1].flags;
> >
> > Do we want to check the index - 1 to make sure it is not negative?
> 
> We have:
>   0 < index < ROM_SLOT
> so (index - 1) give us:
>   0 <= index - 1 < ROM_SLOT - 1
> 
> So (index - 1) can be 0, but under 0.
> ;)

Right! Ok, then please ignore my comment.

.. snip..
> > > +    cap_ver = pci_get_byte(s->dev.config + real_offset - reg->offset
> > > +                           + PCI_EXP_FLAGS)
> > > +        & PCI_EXP_FLAGS_VERS;
> > > +    dev_type = (pci_get_byte(s->dev.config + real_offset - reg->offset
> > > +                             + PCI_EXP_FLAGS)
> > > +                & PCI_EXP_FLAGS_TYPE) >> 4;
> > > +
> > > +    /* no need to initialize in case of Root Complex Integrated Endpoint
> > > +     * with cap_ver 1.x
> >
> > Why?
> 
> Who knows? I don't. And `git log` does not give me more information.

<laughs> OK, could the earlier author provide some ideas? Or perhaps
there is something akin in the Linux code.

.. snip..
> > > +    cap_ver = pci_get_byte(s->dev.config + real_offset - reg->offset
> > > +                           + PCI_EXP_FLAGS)
> > > +        & PCI_EXP_FLAGS_VERS;
> >
> > This looks like a weird tab issue, but it might be just my mailer.
> 
> Nop, there is no tab.
> 
> Maybe writing it like that:
> > cap_ver = pci_get_byte(s->dev.config + real_offset - reg->offset
> >                        + PCI_EXP_FLAGS);
> > cap_ver &= PCI_EXP_FLAGS_VERS;
> would be better.

OK.
> 
> > > +
> > > +    /* no need to initialize in case of cap_ver 1.x */
> > > +    if (cap_ver == 1) {
> > > +        return PT_INVALID_REG;

.. snip..
> > > +    case 2:
> > > +        PT_LOG("Power state transition D2 -> D0active\n");
> > > +        host_pci_set_word(s->real_device, real_offset, 0);
> > > +        usleep(200);
> >
> > Heheh..
> 
> I don't know if I can remove it safely, or not.

Probably not. One of my machines reguarly gets confused when the SSD disk
returns VPD information way to fast and it ends up using the name of a
previous disk.. So the usleep is probably very much required (and in
all likehood defined in the PCI spec).

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Xen-devel] [PATCH V3 08/10] Introduce Xen PCI Passthrough, PCI config space helpers (2/3)
@ 2011-11-11 18:11         ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 60+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-11-11 18:11 UTC (permalink / raw)
  To: Anthony PERARD
  Cc: Guy Zana, Xen Devel, Allen Kay, QEMU-devel, Stefano Stabellini

> > > +    case PCI_CAP_ID_EXP:
> > > +        /* The PCI Express Capability Structure of the VF of Intel 82599 10GbE
> > > +         * Controller looks trivial, e.g., the PCI Express Capabilities
> > > +         * Register is 0. We should not try to expose it to guest.
> >
> > Why not?
> 
> Because (an old commit):
> 
> passthrough: support the assignment of the VF of Intel 82599 10GbE Controller
> 
> The datasheet is available at
> http://download.intel.com/design/network/datashts/82599_datasheet.pdf
> 
> See 'Table 9.7. VF PCIe Configuration Space' of the datasheet, the PCI
> Express Capability Structure of the VF of Intel 82599 10GbE Controller looks
> trivial, e.g., the PCI Express Capabilities Register is 0, so the Capability
> Version is 0 and pt_pcie_size_init() would fail.
> 
> We should not try to expose the PCIe cap of the device to guest.
> 
> a patch from Dexuan Cui <dexuan.cui@intel.com>

Lets inlude that in the description here..
> 
> > > +         */
> > > +        if (d->vendor_id == PCI_VENDOR_ID_INTEL &&
> > > +                d->device_id == PCI_DEVICE_ID_INTEL_82599_VF) {
> > > +            return 1;
> > > +        }
> > > +        break;
> > > +    }
> > > +    return 0;
> > > +}
> > > +
> > > +/*   find emulate register group entry */
> > > +XenPTRegGroup *pt_find_reg_grp(XenPCIPassthroughState *s, uint32_t address)
> > > +{
> > > +    XenPTRegGroup *entry = NULL;
> > > +
> > > +    /* find register group entry */
> > > +    QLIST_FOREACH(entry, &s->reg_grp_tbl, entries) {
> > > +        /* check address */
> > > +        if ((entry->base_offset <= address)
> > > +            && ((entry->base_offset + entry->size) > address)) {
> > > +            return entry;
> > > +        }
> > > +    }
> > > +
> > > +    /* group entry not found */
> > > +    return NULL;
> > > +}
> > > +
> > > +/* find emulate register entry */
> > > +XenPTReg *pt_find_reg(XenPTRegGroup *reg_grp, uint32_t address)
> > > +{
> > > +    XenPTReg *reg_entry = NULL;
> > > +    XenPTRegInfo *reg = NULL;
> > > +    uint32_t real_offset = 0;
> > > +
> > > +    /* find register entry */
> > > +    QLIST_FOREACH(reg_entry, &reg_grp->reg_tbl_list, entries) {
> > > +        reg = reg_entry->reg;
> > > +        real_offset = reg_grp->base_offset + reg->offset;
> > > +        /* check address */
> > > +        if ((real_offset <= address)
> > > +            && ((real_offset + reg->size) > address)) {
> > > +            return reg_entry;
> > > +        }
> > > +    }
> > > +
> > > +    return NULL;
> > > +}
> > > +
> > > +/* parse BAR */
> > > +static PTBarFlag pt_bar_reg_parse(XenPCIPassthroughState *s, XenPTRegInfo *reg)
> > > +{
> > > +    PCIDevice *d = &s->dev;
> > > +    XenPTRegion *region = NULL;
> > > +    PCIIORegion *r;
> > > +    int index = 0;
> > > +
> > > +    /* check 64bit BAR */
> > > +    index = pt_bar_offset_to_index(reg->offset);
> > > +    if ((0 < index) && (index < PCI_ROM_SLOT)) {
> >
> > This is  a bit confusing. Can you make the index be on the same
> > side, like
> >
> > if ((0 < index) && (PCI_ROM_SLOT > index)
> >
> > or better:
> >
> > if ((index < 0) && (index < PCI_ROM_SLOT))
> >
> > um, which looks wrong. Should it be 'index > 0' ?
> 
> Every other form is a bit confusing to me. I'd like to write
> 0 < index < ROM_SLOT, so I know that index is between 0 and ROM_SLOT.
> But, it's C and not math, so I wrote the closest way I can.
> 
> > > +        int flags = s->real_device->io_regions[index - 1].flags;
> >
> > Do we want to check the index - 1 to make sure it is not negative?
> 
> We have:
>   0 < index < ROM_SLOT
> so (index - 1) give us:
>   0 <= index - 1 < ROM_SLOT - 1
> 
> So (index - 1) can be 0, but under 0.
> ;)

Right! Ok, then please ignore my comment.

.. snip..
> > > +    cap_ver = pci_get_byte(s->dev.config + real_offset - reg->offset
> > > +                           + PCI_EXP_FLAGS)
> > > +        & PCI_EXP_FLAGS_VERS;
> > > +    dev_type = (pci_get_byte(s->dev.config + real_offset - reg->offset
> > > +                             + PCI_EXP_FLAGS)
> > > +                & PCI_EXP_FLAGS_TYPE) >> 4;
> > > +
> > > +    /* no need to initialize in case of Root Complex Integrated Endpoint
> > > +     * with cap_ver 1.x
> >
> > Why?
> 
> Who knows? I don't. And `git log` does not give me more information.

<laughs> OK, could the earlier author provide some ideas? Or perhaps
there is something akin in the Linux code.

.. snip..
> > > +    cap_ver = pci_get_byte(s->dev.config + real_offset - reg->offset
> > > +                           + PCI_EXP_FLAGS)
> > > +        & PCI_EXP_FLAGS_VERS;
> >
> > This looks like a weird tab issue, but it might be just my mailer.
> 
> Nop, there is no tab.
> 
> Maybe writing it like that:
> > cap_ver = pci_get_byte(s->dev.config + real_offset - reg->offset
> >                        + PCI_EXP_FLAGS);
> > cap_ver &= PCI_EXP_FLAGS_VERS;
> would be better.

OK.
> 
> > > +
> > > +    /* no need to initialize in case of cap_ver 1.x */
> > > +    if (cap_ver == 1) {
> > > +        return PT_INVALID_REG;

.. snip..
> > > +    case 2:
> > > +        PT_LOG("Power state transition D2 -> D0active\n");
> > > +        host_pci_set_word(s->real_device, real_offset, 0);
> > > +        usleep(200);
> >
> > Heheh..
> 
> I don't know if I can remove it safely, or not.

Probably not. One of my machines reguarly gets confused when the SSD disk
returns VPD information way to fast and it ends up using the name of a
previous disk.. So the usleep is probably very much required (and in
all likehood defined in the PCI spec).

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [PATCH V3 10/10] Introduce Xen PCI Passthrough, MSI (3/3)
  2011-11-10 22:10     ` Konrad Rzeszutek Wilk
@ 2011-11-11 19:18       ` Anthony PERARD
  -1 siblings, 0 replies; 60+ messages in thread
From: Anthony PERARD @ 2011-11-11 19:18 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Xen Devel, Shan Haitao, QEMU-devel, Stefano Stabellini

On Thu, 10 Nov 2011, Konrad Rzeszutek Wilk wrote:

> On Fri, Oct 28, 2011 at 04:07:36PM +0100, Anthony PERARD wrote:
> > From: Jiang Yunhong <yunhong.jiang@intel.com>
> >
> > Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com>
> > Signed-off-by: Shan Haitao <haitao.shan@intel.com>
> > Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
> > ---
> >  Makefile.target                      |    1 +
> >  hw/apic-msidef.h                     |    2 +
> >  hw/xen_pci_passthrough.c             |   27 ++-
> >  hw/xen_pci_passthrough.h             |   55 +++
> >  hw/xen_pci_passthrough_config_init.c |  495 +++++++++++++++++++++++++-
> >  hw/xen_pci_passthrough_msi.c         |  667 ++++++++++++++++++++++++++++++++++
> >  6 files changed, 1240 insertions(+), 7 deletions(-)
> >  create mode 100644 hw/xen_pci_passthrough_msi.c
> >

[...]

> > +/* write Message Upper Address register */
> > +static int pt_msgaddr64_reg_write(XenPCIPassthroughState *s,
> > +                                  XenPTReg *cfg_entry, uint32_t *value,
> > +                                  uint32_t dev_value, uint32_t valid_mask)
> > +{
> > +    XenPTRegInfo *reg = cfg_entry->reg;
> > +    uint32_t writable_mask = 0;
> > +    uint32_t throughable_mask = 0;
> > +    uint32_t old_addr = cfg_entry->data;
> > +
> > +    /* check whether the type is 64 bit or not */
> > +    if (!(s->msi->flags & PCI_MSI_FLAGS_64BIT)) {
> > +        /* exit I/O emulator */
> > +        PT_LOG("Error: why comes to Upper Address without 64 bit support??\n");
>
> Um, not sure what that means.

This is probably unprobable.

I'll change the comment for "write to the Upper Address without 64 bit
support"

> > +        return -1;
> > +    }
> > +
> > +    /* modify emulate register */
> > +    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
> > +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> > +    /* update the msi_info too */
> > +    s->msi->addr_hi = cfg_entry->data;
> > +
> > +    /* create value for writing to I/O device register */
> > +    throughable_mask = ~reg->emu_mask & valid_mask;
> > +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> > +
> > +    /* update MSI */
> > +    if (cfg_entry->data != old_addr) {
> > +        if (s->msi->flags & PT_MSI_FLAG_MAPPED) {
> > +            pt_msi_update(s);
> > +        }
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +
> > +/* this function will be called twice (for 32 bit and 64 bit type) */
> > +/* write Message Data register */
> > +static int pt_msgdata_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> > +                                uint16_t *value, uint16_t dev_value,
> > +                                uint16_t valid_mask)
> > +{
> > +    XenPTRegInfo *reg = cfg_entry->reg;
> > +    uint16_t writable_mask = 0;
> > +    uint16_t throughable_mask = 0;
> > +    uint16_t old_data = cfg_entry->data;
> > +    uint32_t flags = s->msi->flags;
> > +    uint32_t offset = reg->offset;
> > +
> > +    /* check the offset whether matches the type or not */
> > +    if (!((offset == PCI_MSI_DATA_64) &&  (flags & PCI_MSI_FLAGS_64BIT)) &&
> > +        !((offset == PCI_MSI_DATA_32) && !(flags & PCI_MSI_FLAGS_64BIT))) {
> > +        /* exit I/O emulator */
> > +        PT_LOG("Error: the offset is not match with the 32/64 bit type!!\n");
>
> I think it means: "The offset does not match the 32/64 bit type"
>
> > +        return -1;
> > +    }
> > +
> > +    /* modify emulate register */
> > +    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
> > +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> > +    /* update the msi_info too */
> > +    s->msi->data = cfg_entry->data;
> > +
> > +    /* create value for writing to I/O device register */
> > +    throughable_mask = ~reg->emu_mask & valid_mask;
> > +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> > +
> > +    /* update MSI */
> > +    if (cfg_entry->data != old_data) {
> > +        if (flags & PT_MSI_FLAG_MAPPED) {
> > +            pt_msi_update(s);
> > +        }
> > +    }
> > +
> > +    return 0;
> > +}

[...]

> >  /****************************
> >   * Capabilities
> > @@ -1664,6 +2080,48 @@ static uint8_t pt_pcie_size_init(XenPCIPassthroughState *s,
> >
> >      return pcie_size;
> >  }
> > +/* get MSI Capability Structure register group size */
> > +static uint8_t pt_msi_size_init(XenPCIPassthroughState *s,
> > +                                const XenPTRegGroupInfo *grp_reg,
> > +                                uint32_t base_offset)
> > +{
> > +    PCIDevice *d = &s->dev;
> > +    uint16_t msg_ctrl = 0;
> > +    uint8_t msi_size = 0xa;
> > +
> > +    msg_ctrl = pci_get_word(d->config + (base_offset + PCI_MSI_FLAGS));
> > +
> > +    /* check 64 bit address capable & Per-vector masking capable */
>
> ehh?

Precisely!

> > +    if (msg_ctrl & PCI_MSI_FLAGS_64BIT) {
> > +        msi_size += 4;
> > +    }
> > +    if (msg_ctrl & PCI_MSI_FLAGS_MASKBIT) {
> > +        msi_size += 10;
> > +    }
> > +
> > +    s->msi = g_malloc0(sizeof (XenPTMSI));
> > +    s->msi->pirq = -1;
>
> Is there a define for this -1?

Probably not.

What about PT_UNASSIGNED_MSI_PIRQ ?

> > +    PT_LOG("done\n");
> > +
> > +    return msi_size;
> > +}

[...]

> > @@ -1908,8 +2382,11 @@ static int pt_init_pci_config(XenPCIPassthroughState *s)
> >      /* reinitialize all emulate register */
> >      pt_config_reinit(s);
> >
> > +    /* setup MSI-INTx translation if support */
> > +    ret = pt_enable_msi_translate(s);
> > +
> >      /* rebind machine_irq to device */
> > -    if (s->machine_irq != 0) {
> > +    if (ret < 0 && s->machine_irq != 0) {
>
> So can machine_irq be -1? Or is it only pirq that can be -1?

I think only pirq can be -1. And the default value of machine_irq is 0.

At least, the comment on top of xen_pci_passthrough.c says the same
thing.

>
> >          uint8_t e_device = PCI_SLOT(s->dev.devfn);
> >          uint8_t e_intx = pci_intx(s);
> >
> > @@ -2043,6 +2520,14 @@ void pt_config_delete(XenPCIPassthroughState *s)
> >      struct XenPTRegGroup *reg_group, *next_grp;
> >      struct XenPTReg *reg, *next_reg;
> >
> > +    /* free MSI/MSI-X info table */
> > +    if (s->msix) {
> > +        pt_msix_delete(s);
> > +    }
> > +    if (s->msi) {
> > +        g_free(s->msi);
> > +    }
> > +
> >      /* free Power Management info table */
> >      if (s->pm_state) {
> >          if (s->pm_state->pm_timer) {

[...]

> > +/*********************************/
> > +/* MSI virtuailization functions */
>
>
> virtualization
> > +
> > +/*
> > + * setup physical msi, but didn't enable it
>
> but don't
>
> > + */
> > +int pt_msi_setup(XenPCIPassthroughState *s)
> > +{
> > +    int pirq = -1;
> > +    uint8_t gvec = 0;
> > +
> > +    if (!(s->msi->flags & PT_MSI_FLAG_UNINIT)) {
> > +        PT_LOG("Error: setup physical after initialized??\n");
>
> I am not sure what that says.

Someone eats some words :(.

I thinks the comment come from this function: pt_msgctrl_reg_write.
pt_msgctrl_reg_write do the setup on the emulation side, and call
pt_msi_setup, and unset PT_MSI_FLAG_UNINIT. (this flags is only internal
to emulator)

I supose this prevent the function to been called to many times
(probably by the guest).

So, maybe "setup physical MSI when it's already initialized" would be a
better log.

> > +        return -1;
> > +    }
> > +
> > +    gvec = s->msi->data & 0xFF;
> > +    if (!gvec) {
> > +        /* if gvec is 0, the guest is asking for a particular pirq that
> > +         * is passed as dest_id */
> > +        pirq = (s->msi->addr_hi & 0xffffff00) |
> > +               ((s->msi->addr_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
> > +        if (!pirq) {
> > +            /* this probably identifies an misconfiguration of the guest,
> > +             * try the emulated path */
> > +            pirq = -1;
> > +        } else {
> > +            PT_LOG("pt_msi_setup requested pirq = %d\n", pirq);
> > +        }
> > +    }
> > +
> > +    if (xc_physdev_map_pirq_msi(xen_xc, xen_domid, AUTO_ASSIGN, &pirq,
> > +                                PCI_DEVFN(s->real_device->dev,
> > +                                          s->real_device->func),
> > +                                s->real_device->bus, 0, 0)) {
> > +        PT_LOG("Error: Mapping of MSI failed.\n");
>
> Give more details. As in what device failed. PErhaps even the return code?

Ok.

> > +        return -1;
> > +    }
> > +
> > +    if (pirq < 0) {
> > +        PT_LOG("Error: Invalid pirq number\n");
> > +        return -1;
> > +    }
> > +
> > +    s->msi->pirq = pirq;
> > +    PT_LOG("msi mapped with pirq %x\n", pirq);
> > +
> > +    return 0;
> > +}
> > +

[...]

> > +/*********************************/
> > +/* MSI-X virtulization functions */
>
>
> virtu...
>
> > +
> > +static void mask_physical_msix_entry(XenPCIPassthroughState *s,
> > +                                     int entry_nr, int mask)
> > +{
> > +    void *phys_off;
> > +
> > +    phys_off = s->msix->phys_iomem_base + 16 * entry_nr + 12;
> > +    *(uint32_t *)phys_off = mask;
> > +}
> > +
> > +static int pt_msix_update_one(XenPCIPassthroughState *s, int entry_nr)
> > +{
> > +    XenMSIXEntry *entry = &s->msix->msix_entry[entry_nr];
> > +    int pirq = entry->pirq;
> > +    int gvec = entry->io_mem[2] & 0xff;
> > +    uint64_t gaddr = *(uint64_t *)&entry->io_mem[0];
> > +    uint32_t gflags = __get_msi_gflags(entry->io_mem[2], gaddr);
> > +    int ret;
> > +
> > +    if (!entry->flags) {
> > +        return 0;
> > +    }
> > +
> > +    if (!gvec) {
> > +        /* if gvec is 0, the guest is asking for a particular pirq that
> > +         * is passed as dest_id */
> > +        pirq = ((gaddr >> 32) & 0xffffff00) |
> > +               (((gaddr & 0xffffffff) >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
> > +        if (!pirq) {
> > +            /* this probably identifies an misconfiguration of the guest,
> > +             * try the emulated path */
> > +            pirq = -1;
> > +        } else {
> > +            PT_LOG("pt_msix_update_one requested pirq = %d\n", pirq);
>
> This is the same code as in the MSI case. Could it be coalesced ?

I can try.


[...]

> > +void pt_msix_disable(XenPCIPassthroughState *s)
> > +{
> > +    PCIDevice *d = &s->dev;
> > +    uint8_t gvec = 0;
> > +    uint32_t gflags = 0;
> > +    uint64_t addr = 0;
> > +    int i = 0;
> > +    XenMSIXEntry *entry = NULL;
> > +
> > +    msix_set_enable(s, 0);
> > +
> > +    for (i = 0; i < s->msix->total_entries; i++) {
> > +        entry = &s->msix->msix_entry[i];
> > +
> > +        if (entry->pirq == -1) {
> > +            continue;
> > +        }
> > +
> > +        gvec = entry->io_mem[2] & 0xff;
> > +        addr = *(uint64_t *)&entry->io_mem[0];
> > +        gflags = __get_msi_gflags(entry->io_mem[2], addr);
> > +
> > +        PT_LOG("Unbind msix with pirq %x, gvec %x\n",
> > +                entry->pirq, gvec);
> > +
> > +        if (xc_domain_unbind_msi_irq(xen_xc, xen_domid, gvec,
> > +                                        entry->pirq, gflags)) {
> > +            PT_LOG("Error: Unbinding of MSI-X failed. [%02x:%02x.%x]\n",
> > +                   pci_bus_num(d->bus), PCI_SLOT(d->devfn),
> > +                   PCI_FUNC(d->devfn));
> > +        } else {
> > +            PT_LOG("Unmap msix with pirq %x\n", entry->pirq);
> > +
> > +            if (xc_physdev_unmap_pirq(xen_xc, xen_domid, entry->pirq)) {
> > +                PT_LOG("Error: Unmapping of MSI-X failed. [%02x:%02x.%x]\n",
> > +                       pci_bus_num(d->bus),
> > +                       PCI_SLOT(d->devfn), PCI_FUNC(d->devfn));
>
> There is a lot of those error reporting where the pci_bus_num, PCI_SLOT, etc
> are used. Perhaps this should be in a function?

Yes, that will help to have a better reporting.

> > +            }
> > +        }
> > +        /* clear msi-x info */
> > +        entry->pirq = -1;
> > +        entry->flags = 0;
> > +    }
> > +}
> > +
> > +int pt_msix_update_remap(XenPCIPassthroughState *s, int bar_index)
> > +{
> > +    XenMSIXEntry *entry;
> > +    int i, ret;
> > +
> > +    if (!(s->msix && s->msix->bar_index == bar_index)) {
> > +        return 0;
> > +    }
> > +
> > +    for (i = 0; i < s->msix->total_entries; i++) {
> > +        entry = &s->msix->msix_entry[i];
> > +        if (entry->pirq != -1) {
> > +            ret = xc_domain_unbind_pt_irq(xen_xc, xen_domid, entry->pirq,
> > +                                          PT_IRQ_TYPE_MSI, 0, 0, 0, 0);
> > +            if (ret) {
> > +                PT_LOG("Error: unbind MSI-X entry %d failed\n", entry->pirq);
> > +            }
> > +            entry->flags = 1;
> > +        }
> > +    }
> > +    pt_msix_update(s);
> > +
> > +    return 0;
> > +}
> > +
> > +static void pci_msix_invalid_write(void *opaque, target_phys_addr_t addr,
> > +                                   uint32_t val)
> > +{
> > +    PT_LOG("Error: Invalid write to MSI-X table,"
> > +           " only dword access is allowed.\n");
> > +}
> > +
> > +static void pci_msix_writel(void *opaque, target_phys_addr_t addr,
> > +                            uint32_t val)
> > +{
> > +    XenPCIPassthroughState *s = (XenPCIPassthroughState *)opaque;
> > +    XenPTMSIX *msix = s->msix;
> > +    XenMSIXEntry *entry;
> > +    int entry_nr, offset;
> > +    void *phys_off;
> > +    uint32_t vec_ctrl;
> > +
> > +    if (addr % 4) {
> > +        PT_LOG("Error: Unaligned dword access to MSI-X table, "
> > +                "addr %016"PRIx64"\n", addr);
> > +        return;
> > +    }
> > +
> > +    PT_LOG("addr: "TARGET_FMT_plx", val: %#x\n", addr, val);
>
> Huh?

I will remove this one.

> > +
> > +    entry_nr = addr / 16;
> > +    entry = &msix->msix_entry[entry_nr];
> > +    offset = (addr % 16) / 4;
> > +
> > +    /*
> > +     * If Xen intercepts the mask bit access, io_mem[3] may not be
> > +     * up-to-date. Read from hardware directly.
> > +     */
> > +    phys_off = s->msix->phys_iomem_base + 16 * entry_nr + 12;
> > +    vec_ctrl = *(uint32_t *)phys_off;
> > +
> > +    if (offset != 3 && msix->enabled && !(vec_ctrl & 0x1)) {
> > +        PT_LOG("Error: Can't update msix entry %d since MSI-X is already "
> > +                "function.\n", entry_nr);
>
> already function? already on? active?

Probably.

But I don't know what it is check here.

> > +        return;
> > +    }
> > +
> > +    if (offset != 3 && entry->io_mem[offset] != val) {
> > +        entry->flags = 1;
> > +    }
> > +    entry->io_mem[offset] = val;
> > +
> > +    if (offset == 3) {
> > +        if (msix->enabled && !(val & 0x1)) {
> > +            pt_msix_update_one(s, entry_nr);
> > +        }
> > +        mask_physical_msix_entry(s, entry_nr, entry->io_mem[3] & 0x1);
> > +    }
> > +}

Thanks,

-- 
Anthony PERARD

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Xen-devel] [PATCH V3 10/10] Introduce Xen PCI Passthrough, MSI (3/3)
@ 2011-11-11 19:18       ` Anthony PERARD
  0 siblings, 0 replies; 60+ messages in thread
From: Anthony PERARD @ 2011-11-11 19:18 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Xen Devel, Shan Haitao, QEMU-devel, Stefano Stabellini

On Thu, 10 Nov 2011, Konrad Rzeszutek Wilk wrote:

> On Fri, Oct 28, 2011 at 04:07:36PM +0100, Anthony PERARD wrote:
> > From: Jiang Yunhong <yunhong.jiang@intel.com>
> >
> > Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com>
> > Signed-off-by: Shan Haitao <haitao.shan@intel.com>
> > Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
> > ---
> >  Makefile.target                      |    1 +
> >  hw/apic-msidef.h                     |    2 +
> >  hw/xen_pci_passthrough.c             |   27 ++-
> >  hw/xen_pci_passthrough.h             |   55 +++
> >  hw/xen_pci_passthrough_config_init.c |  495 +++++++++++++++++++++++++-
> >  hw/xen_pci_passthrough_msi.c         |  667 ++++++++++++++++++++++++++++++++++
> >  6 files changed, 1240 insertions(+), 7 deletions(-)
> >  create mode 100644 hw/xen_pci_passthrough_msi.c
> >

[...]

> > +/* write Message Upper Address register */
> > +static int pt_msgaddr64_reg_write(XenPCIPassthroughState *s,
> > +                                  XenPTReg *cfg_entry, uint32_t *value,
> > +                                  uint32_t dev_value, uint32_t valid_mask)
> > +{
> > +    XenPTRegInfo *reg = cfg_entry->reg;
> > +    uint32_t writable_mask = 0;
> > +    uint32_t throughable_mask = 0;
> > +    uint32_t old_addr = cfg_entry->data;
> > +
> > +    /* check whether the type is 64 bit or not */
> > +    if (!(s->msi->flags & PCI_MSI_FLAGS_64BIT)) {
> > +        /* exit I/O emulator */
> > +        PT_LOG("Error: why comes to Upper Address without 64 bit support??\n");
>
> Um, not sure what that means.

This is probably unprobable.

I'll change the comment for "write to the Upper Address without 64 bit
support"

> > +        return -1;
> > +    }
> > +
> > +    /* modify emulate register */
> > +    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
> > +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> > +    /* update the msi_info too */
> > +    s->msi->addr_hi = cfg_entry->data;
> > +
> > +    /* create value for writing to I/O device register */
> > +    throughable_mask = ~reg->emu_mask & valid_mask;
> > +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> > +
> > +    /* update MSI */
> > +    if (cfg_entry->data != old_addr) {
> > +        if (s->msi->flags & PT_MSI_FLAG_MAPPED) {
> > +            pt_msi_update(s);
> > +        }
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +
> > +/* this function will be called twice (for 32 bit and 64 bit type) */
> > +/* write Message Data register */
> > +static int pt_msgdata_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> > +                                uint16_t *value, uint16_t dev_value,
> > +                                uint16_t valid_mask)
> > +{
> > +    XenPTRegInfo *reg = cfg_entry->reg;
> > +    uint16_t writable_mask = 0;
> > +    uint16_t throughable_mask = 0;
> > +    uint16_t old_data = cfg_entry->data;
> > +    uint32_t flags = s->msi->flags;
> > +    uint32_t offset = reg->offset;
> > +
> > +    /* check the offset whether matches the type or not */
> > +    if (!((offset == PCI_MSI_DATA_64) &&  (flags & PCI_MSI_FLAGS_64BIT)) &&
> > +        !((offset == PCI_MSI_DATA_32) && !(flags & PCI_MSI_FLAGS_64BIT))) {
> > +        /* exit I/O emulator */
> > +        PT_LOG("Error: the offset is not match with the 32/64 bit type!!\n");
>
> I think it means: "The offset does not match the 32/64 bit type"
>
> > +        return -1;
> > +    }
> > +
> > +    /* modify emulate register */
> > +    writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
> > +    cfg_entry->data = PT_MERGE_VALUE(*value, cfg_entry->data, writable_mask);
> > +    /* update the msi_info too */
> > +    s->msi->data = cfg_entry->data;
> > +
> > +    /* create value for writing to I/O device register */
> > +    throughable_mask = ~reg->emu_mask & valid_mask;
> > +    *value = PT_MERGE_VALUE(*value, dev_value, throughable_mask);
> > +
> > +    /* update MSI */
> > +    if (cfg_entry->data != old_data) {
> > +        if (flags & PT_MSI_FLAG_MAPPED) {
> > +            pt_msi_update(s);
> > +        }
> > +    }
> > +
> > +    return 0;
> > +}

[...]

> >  /****************************
> >   * Capabilities
> > @@ -1664,6 +2080,48 @@ static uint8_t pt_pcie_size_init(XenPCIPassthroughState *s,
> >
> >      return pcie_size;
> >  }
> > +/* get MSI Capability Structure register group size */
> > +static uint8_t pt_msi_size_init(XenPCIPassthroughState *s,
> > +                                const XenPTRegGroupInfo *grp_reg,
> > +                                uint32_t base_offset)
> > +{
> > +    PCIDevice *d = &s->dev;
> > +    uint16_t msg_ctrl = 0;
> > +    uint8_t msi_size = 0xa;
> > +
> > +    msg_ctrl = pci_get_word(d->config + (base_offset + PCI_MSI_FLAGS));
> > +
> > +    /* check 64 bit address capable & Per-vector masking capable */
>
> ehh?

Precisely!

> > +    if (msg_ctrl & PCI_MSI_FLAGS_64BIT) {
> > +        msi_size += 4;
> > +    }
> > +    if (msg_ctrl & PCI_MSI_FLAGS_MASKBIT) {
> > +        msi_size += 10;
> > +    }
> > +
> > +    s->msi = g_malloc0(sizeof (XenPTMSI));
> > +    s->msi->pirq = -1;
>
> Is there a define for this -1?

Probably not.

What about PT_UNASSIGNED_MSI_PIRQ ?

> > +    PT_LOG("done\n");
> > +
> > +    return msi_size;
> > +}

[...]

> > @@ -1908,8 +2382,11 @@ static int pt_init_pci_config(XenPCIPassthroughState *s)
> >      /* reinitialize all emulate register */
> >      pt_config_reinit(s);
> >
> > +    /* setup MSI-INTx translation if support */
> > +    ret = pt_enable_msi_translate(s);
> > +
> >      /* rebind machine_irq to device */
> > -    if (s->machine_irq != 0) {
> > +    if (ret < 0 && s->machine_irq != 0) {
>
> So can machine_irq be -1? Or is it only pirq that can be -1?

I think only pirq can be -1. And the default value of machine_irq is 0.

At least, the comment on top of xen_pci_passthrough.c says the same
thing.

>
> >          uint8_t e_device = PCI_SLOT(s->dev.devfn);
> >          uint8_t e_intx = pci_intx(s);
> >
> > @@ -2043,6 +2520,14 @@ void pt_config_delete(XenPCIPassthroughState *s)
> >      struct XenPTRegGroup *reg_group, *next_grp;
> >      struct XenPTReg *reg, *next_reg;
> >
> > +    /* free MSI/MSI-X info table */
> > +    if (s->msix) {
> > +        pt_msix_delete(s);
> > +    }
> > +    if (s->msi) {
> > +        g_free(s->msi);
> > +    }
> > +
> >      /* free Power Management info table */
> >      if (s->pm_state) {
> >          if (s->pm_state->pm_timer) {

[...]

> > +/*********************************/
> > +/* MSI virtuailization functions */
>
>
> virtualization
> > +
> > +/*
> > + * setup physical msi, but didn't enable it
>
> but don't
>
> > + */
> > +int pt_msi_setup(XenPCIPassthroughState *s)
> > +{
> > +    int pirq = -1;
> > +    uint8_t gvec = 0;
> > +
> > +    if (!(s->msi->flags & PT_MSI_FLAG_UNINIT)) {
> > +        PT_LOG("Error: setup physical after initialized??\n");
>
> I am not sure what that says.

Someone eats some words :(.

I thinks the comment come from this function: pt_msgctrl_reg_write.
pt_msgctrl_reg_write do the setup on the emulation side, and call
pt_msi_setup, and unset PT_MSI_FLAG_UNINIT. (this flags is only internal
to emulator)

I supose this prevent the function to been called to many times
(probably by the guest).

So, maybe "setup physical MSI when it's already initialized" would be a
better log.

> > +        return -1;
> > +    }
> > +
> > +    gvec = s->msi->data & 0xFF;
> > +    if (!gvec) {
> > +        /* if gvec is 0, the guest is asking for a particular pirq that
> > +         * is passed as dest_id */
> > +        pirq = (s->msi->addr_hi & 0xffffff00) |
> > +               ((s->msi->addr_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
> > +        if (!pirq) {
> > +            /* this probably identifies an misconfiguration of the guest,
> > +             * try the emulated path */
> > +            pirq = -1;
> > +        } else {
> > +            PT_LOG("pt_msi_setup requested pirq = %d\n", pirq);
> > +        }
> > +    }
> > +
> > +    if (xc_physdev_map_pirq_msi(xen_xc, xen_domid, AUTO_ASSIGN, &pirq,
> > +                                PCI_DEVFN(s->real_device->dev,
> > +                                          s->real_device->func),
> > +                                s->real_device->bus, 0, 0)) {
> > +        PT_LOG("Error: Mapping of MSI failed.\n");
>
> Give more details. As in what device failed. PErhaps even the return code?

Ok.

> > +        return -1;
> > +    }
> > +
> > +    if (pirq < 0) {
> > +        PT_LOG("Error: Invalid pirq number\n");
> > +        return -1;
> > +    }
> > +
> > +    s->msi->pirq = pirq;
> > +    PT_LOG("msi mapped with pirq %x\n", pirq);
> > +
> > +    return 0;
> > +}
> > +

[...]

> > +/*********************************/
> > +/* MSI-X virtulization functions */
>
>
> virtu...
>
> > +
> > +static void mask_physical_msix_entry(XenPCIPassthroughState *s,
> > +                                     int entry_nr, int mask)
> > +{
> > +    void *phys_off;
> > +
> > +    phys_off = s->msix->phys_iomem_base + 16 * entry_nr + 12;
> > +    *(uint32_t *)phys_off = mask;
> > +}
> > +
> > +static int pt_msix_update_one(XenPCIPassthroughState *s, int entry_nr)
> > +{
> > +    XenMSIXEntry *entry = &s->msix->msix_entry[entry_nr];
> > +    int pirq = entry->pirq;
> > +    int gvec = entry->io_mem[2] & 0xff;
> > +    uint64_t gaddr = *(uint64_t *)&entry->io_mem[0];
> > +    uint32_t gflags = __get_msi_gflags(entry->io_mem[2], gaddr);
> > +    int ret;
> > +
> > +    if (!entry->flags) {
> > +        return 0;
> > +    }
> > +
> > +    if (!gvec) {
> > +        /* if gvec is 0, the guest is asking for a particular pirq that
> > +         * is passed as dest_id */
> > +        pirq = ((gaddr >> 32) & 0xffffff00) |
> > +               (((gaddr & 0xffffffff) >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
> > +        if (!pirq) {
> > +            /* this probably identifies an misconfiguration of the guest,
> > +             * try the emulated path */
> > +            pirq = -1;
> > +        } else {
> > +            PT_LOG("pt_msix_update_one requested pirq = %d\n", pirq);
>
> This is the same code as in the MSI case. Could it be coalesced ?

I can try.


[...]

> > +void pt_msix_disable(XenPCIPassthroughState *s)
> > +{
> > +    PCIDevice *d = &s->dev;
> > +    uint8_t gvec = 0;
> > +    uint32_t gflags = 0;
> > +    uint64_t addr = 0;
> > +    int i = 0;
> > +    XenMSIXEntry *entry = NULL;
> > +
> > +    msix_set_enable(s, 0);
> > +
> > +    for (i = 0; i < s->msix->total_entries; i++) {
> > +        entry = &s->msix->msix_entry[i];
> > +
> > +        if (entry->pirq == -1) {
> > +            continue;
> > +        }
> > +
> > +        gvec = entry->io_mem[2] & 0xff;
> > +        addr = *(uint64_t *)&entry->io_mem[0];
> > +        gflags = __get_msi_gflags(entry->io_mem[2], addr);
> > +
> > +        PT_LOG("Unbind msix with pirq %x, gvec %x\n",
> > +                entry->pirq, gvec);
> > +
> > +        if (xc_domain_unbind_msi_irq(xen_xc, xen_domid, gvec,
> > +                                        entry->pirq, gflags)) {
> > +            PT_LOG("Error: Unbinding of MSI-X failed. [%02x:%02x.%x]\n",
> > +                   pci_bus_num(d->bus), PCI_SLOT(d->devfn),
> > +                   PCI_FUNC(d->devfn));
> > +        } else {
> > +            PT_LOG("Unmap msix with pirq %x\n", entry->pirq);
> > +
> > +            if (xc_physdev_unmap_pirq(xen_xc, xen_domid, entry->pirq)) {
> > +                PT_LOG("Error: Unmapping of MSI-X failed. [%02x:%02x.%x]\n",
> > +                       pci_bus_num(d->bus),
> > +                       PCI_SLOT(d->devfn), PCI_FUNC(d->devfn));
>
> There is a lot of those error reporting where the pci_bus_num, PCI_SLOT, etc
> are used. Perhaps this should be in a function?

Yes, that will help to have a better reporting.

> > +            }
> > +        }
> > +        /* clear msi-x info */
> > +        entry->pirq = -1;
> > +        entry->flags = 0;
> > +    }
> > +}
> > +
> > +int pt_msix_update_remap(XenPCIPassthroughState *s, int bar_index)
> > +{
> > +    XenMSIXEntry *entry;
> > +    int i, ret;
> > +
> > +    if (!(s->msix && s->msix->bar_index == bar_index)) {
> > +        return 0;
> > +    }
> > +
> > +    for (i = 0; i < s->msix->total_entries; i++) {
> > +        entry = &s->msix->msix_entry[i];
> > +        if (entry->pirq != -1) {
> > +            ret = xc_domain_unbind_pt_irq(xen_xc, xen_domid, entry->pirq,
> > +                                          PT_IRQ_TYPE_MSI, 0, 0, 0, 0);
> > +            if (ret) {
> > +                PT_LOG("Error: unbind MSI-X entry %d failed\n", entry->pirq);
> > +            }
> > +            entry->flags = 1;
> > +        }
> > +    }
> > +    pt_msix_update(s);
> > +
> > +    return 0;
> > +}
> > +
> > +static void pci_msix_invalid_write(void *opaque, target_phys_addr_t addr,
> > +                                   uint32_t val)
> > +{
> > +    PT_LOG("Error: Invalid write to MSI-X table,"
> > +           " only dword access is allowed.\n");
> > +}
> > +
> > +static void pci_msix_writel(void *opaque, target_phys_addr_t addr,
> > +                            uint32_t val)
> > +{
> > +    XenPCIPassthroughState *s = (XenPCIPassthroughState *)opaque;
> > +    XenPTMSIX *msix = s->msix;
> > +    XenMSIXEntry *entry;
> > +    int entry_nr, offset;
> > +    void *phys_off;
> > +    uint32_t vec_ctrl;
> > +
> > +    if (addr % 4) {
> > +        PT_LOG("Error: Unaligned dword access to MSI-X table, "
> > +                "addr %016"PRIx64"\n", addr);
> > +        return;
> > +    }
> > +
> > +    PT_LOG("addr: "TARGET_FMT_plx", val: %#x\n", addr, val);
>
> Huh?

I will remove this one.

> > +
> > +    entry_nr = addr / 16;
> > +    entry = &msix->msix_entry[entry_nr];
> > +    offset = (addr % 16) / 4;
> > +
> > +    /*
> > +     * If Xen intercepts the mask bit access, io_mem[3] may not be
> > +     * up-to-date. Read from hardware directly.
> > +     */
> > +    phys_off = s->msix->phys_iomem_base + 16 * entry_nr + 12;
> > +    vec_ctrl = *(uint32_t *)phys_off;
> > +
> > +    if (offset != 3 && msix->enabled && !(vec_ctrl & 0x1)) {
> > +        PT_LOG("Error: Can't update msix entry %d since MSI-X is already "
> > +                "function.\n", entry_nr);
>
> already function? already on? active?

Probably.

But I don't know what it is check here.

> > +        return;
> > +    }
> > +
> > +    if (offset != 3 && entry->io_mem[offset] != val) {
> > +        entry->flags = 1;
> > +    }
> > +    entry->io_mem[offset] = val;
> > +
> > +    if (offset == 3) {
> > +        if (msix->enabled && !(val & 0x1)) {
> > +            pt_msix_update_one(s, entry_nr);
> > +        }
> > +        mask_physical_msix_entry(s, entry_nr, entry->io_mem[3] & 0x1);
> > +    }
> > +}

Thanks,

-- 
Anthony PERARD

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [PATCH V3 08/10] Introduce Xen PCI Passthrough, PCI config space helpers (2/3)
  2011-11-11 17:40       ` Anthony PERARD
@ 2011-11-11 20:37         ` Ian Campbell
  -1 siblings, 0 replies; 60+ messages in thread
From: Ian Campbell @ 2011-11-11 20:37 UTC (permalink / raw)
  To: Anthony PERARD
  Cc: Xen Devel, Stefano Stabellini, Konrad Rzeszutek Wilk, Allen Kay,
	QEMU-devel, Guy Zana

On Fri, 2011-11-11 at 17:40 +0000, Anthony PERARD wrote:
> 
> > if ((index < 0) && (index < PCI_ROM_SLOT))
> >
> > um, which looks wrong. Should it be 'index > 0' ?
> 
> Every other form is a bit confusing to me. I'd like to write
> 0 < index < ROM_SLOT, so I know that index is between 0 and ROM_SLOT.
> But, it's C and not math, so I wrote the closest way I can.

"0 < index < ROM_SLOT" ==> "0 < index && index < ROM_SLOT"
but you have "index < 0 && ..." which is backwards.

Ian.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Xen-devel] [PATCH V3 08/10] Introduce Xen PCI Passthrough, PCI config space helpers (2/3)
@ 2011-11-11 20:37         ` Ian Campbell
  0 siblings, 0 replies; 60+ messages in thread
From: Ian Campbell @ 2011-11-11 20:37 UTC (permalink / raw)
  To: Anthony PERARD
  Cc: Xen Devel, Stefano Stabellini, Konrad Rzeszutek Wilk, Allen Kay,
	QEMU-devel, Guy Zana

On Fri, 2011-11-11 at 17:40 +0000, Anthony PERARD wrote:
> 
> > if ((index < 0) && (index < PCI_ROM_SLOT))
> >
> > um, which looks wrong. Should it be 'index > 0' ?
> 
> Every other form is a bit confusing to me. I'd like to write
> 0 < index < ROM_SLOT, so I know that index is between 0 and ROM_SLOT.
> But, it's C and not math, so I wrote the closest way I can.

"0 < index < ROM_SLOT" ==> "0 < index && index < ROM_SLOT"
but you have "index < 0 && ..." which is backwards.

Ian.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [PATCH V3 07/10] Introduce Xen PCI Passthrough, qdevice (1/3)
  2011-11-11 18:05         ` Konrad Rzeszutek Wilk
@ 2011-11-14 11:09           ` Stefano Stabellini
  -1 siblings, 0 replies; 60+ messages in thread
From: Stefano Stabellini @ 2011-11-14 11:09 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Xen Devel, Stefano Stabellini, Allen Kay, QEMU-devel, Guy Zana,
	Anthony Perard

On Fri, 11 Nov 2011, Konrad Rzeszutek Wilk wrote:
> > > > +                hw_error("Internal error: Invalid write emulation "
> > > > +                         "return value[%d]. I/O emulator exit.\n", rc);
> > >
> > > Oh. I hadn't realized this, but you are using hw_error. Which is
> > > calling 'abort'! Yikes. Is there no way to recover from this? Say return 0xfffff?
> > 
> > In qemu-xen-traditionnal, it was an exit(1). I do not know the
> > consequence of a bad write, and I can not return anythings. So I suppose
> > that the guest would know that somethings wrong only on the next read.
> > 
> > Instead of abort();, I can just do nothing and return. Or we could unplug
> > the device from QEMU.
> > 
> > Any preference?
> 
> I think this calls for an experiment. If Linux still functions if you completly
> unplug the device, then I would say unplug it (b/c in most likelyhood the reason
> you can't write is b/c the host has unplugged the device).

It would make sense to try to PCI hot-unplug the device, however
considering that it requires guest support, it cannot be used to safely
handle an error like this one. Also it requires some interactions that
might not be possible anymore at this point.
I would destroy the domain instead, using a graceful shutdown if
possible. Something similar to libxl_domain_shutdown.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Xen-devel] [PATCH V3 07/10] Introduce Xen PCI Passthrough, qdevice (1/3)
@ 2011-11-14 11:09           ` Stefano Stabellini
  0 siblings, 0 replies; 60+ messages in thread
From: Stefano Stabellini @ 2011-11-14 11:09 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Xen Devel, Stefano Stabellini, Allen Kay, QEMU-devel, Guy Zana,
	Anthony Perard

On Fri, 11 Nov 2011, Konrad Rzeszutek Wilk wrote:
> > > > +                hw_error("Internal error: Invalid write emulation "
> > > > +                         "return value[%d]. I/O emulator exit.\n", rc);
> > >
> > > Oh. I hadn't realized this, but you are using hw_error. Which is
> > > calling 'abort'! Yikes. Is there no way to recover from this? Say return 0xfffff?
> > 
> > In qemu-xen-traditionnal, it was an exit(1). I do not know the
> > consequence of a bad write, and I can not return anythings. So I suppose
> > that the guest would know that somethings wrong only on the next read.
> > 
> > Instead of abort();, I can just do nothing and return. Or we could unplug
> > the device from QEMU.
> > 
> > Any preference?
> 
> I think this calls for an experiment. If Linux still functions if you completly
> unplug the device, then I would say unplug it (b/c in most likelyhood the reason
> you can't write is b/c the host has unplugged the device).

It would make sense to try to PCI hot-unplug the device, however
considering that it requires guest support, it cannot be used to safely
handle an error like this one. Also it requires some interactions that
might not be possible anymore at this point.
I would destroy the domain instead, using a graceful shutdown if
possible. Something similar to libxl_domain_shutdown.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [Xen-devel] [PATCH V3 07/10] Introduce Xen PCI Passthrough, qdevice (1/3)
  2011-11-14 11:09           ` Stefano Stabellini
@ 2011-11-14 18:11             ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 60+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-11-14 18:11 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Anthony Perard, Guy Zana, Xen Devel, Allen Kay, QEMU-devel

On Mon, Nov 14, 2011 at 11:09:31AM +0000, Stefano Stabellini wrote:
> On Fri, 11 Nov 2011, Konrad Rzeszutek Wilk wrote:
> > > > > +                hw_error("Internal error: Invalid write emulation "
> > > > > +                         "return value[%d]. I/O emulator exit.\n", rc);
> > > >
> > > > Oh. I hadn't realized this, but you are using hw_error. Which is
> > > > calling 'abort'! Yikes. Is there no way to recover from this? Say return 0xfffff?
> > > 
> > > In qemu-xen-traditionnal, it was an exit(1). I do not know the
> > > consequence of a bad write, and I can not return anythings. So I suppose
> > > that the guest would know that somethings wrong only on the next read.
> > > 
> > > Instead of abort();, I can just do nothing and return. Or we could unplug
> > > the device from QEMU.
> > > 
> > > Any preference?
> > 
> > I think this calls for an experiment. If Linux still functions if you completly
> > unplug the device, then I would say unplug it (b/c in most likelyhood the reason
> > you can't write is b/c the host has unplugged the device).
> 
> It would make sense to try to PCI hot-unplug the device, however
> considering that it requires guest support, it cannot be used to safely
> handle an error like this one. Also it requires some interactions that
> might not be possible anymore at this point.
> I would destroy the domain instead, using a graceful shutdown if
> possible. Something similar to libxl_domain_shutdown.

Sounds good, and we should also print something prudent to the log _why_
we just killed the guest.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Xen-devel] [PATCH V3 07/10] Introduce Xen PCI Passthrough, qdevice (1/3)
@ 2011-11-14 18:11             ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 60+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-11-14 18:11 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Anthony Perard, Guy Zana, Xen Devel, Allen Kay, QEMU-devel

On Mon, Nov 14, 2011 at 11:09:31AM +0000, Stefano Stabellini wrote:
> On Fri, 11 Nov 2011, Konrad Rzeszutek Wilk wrote:
> > > > > +                hw_error("Internal error: Invalid write emulation "
> > > > > +                         "return value[%d]. I/O emulator exit.\n", rc);
> > > >
> > > > Oh. I hadn't realized this, but you are using hw_error. Which is
> > > > calling 'abort'! Yikes. Is there no way to recover from this? Say return 0xfffff?
> > > 
> > > In qemu-xen-traditionnal, it was an exit(1). I do not know the
> > > consequence of a bad write, and I can not return anythings. So I suppose
> > > that the guest would know that somethings wrong only on the next read.
> > > 
> > > Instead of abort();, I can just do nothing and return. Or we could unplug
> > > the device from QEMU.
> > > 
> > > Any preference?
> > 
> > I think this calls for an experiment. If Linux still functions if you completly
> > unplug the device, then I would say unplug it (b/c in most likelyhood the reason
> > you can't write is b/c the host has unplugged the device).
> 
> It would make sense to try to PCI hot-unplug the device, however
> considering that it requires guest support, it cannot be used to safely
> handle an error like this one. Also it requires some interactions that
> might not be possible anymore at this point.
> I would destroy the domain instead, using a graceful shutdown if
> possible. Something similar to libxl_domain_shutdown.

Sounds good, and we should also print something prudent to the log _why_
we just killed the guest.

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2011-11-14 18:12 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-10-28 15:07 [Qemu-devel] [PATCH V3 00/10] Xen PCI Passthrough Anthony PERARD
2011-10-28 15:07 ` Anthony PERARD
2011-10-28 15:07 ` [Qemu-devel] [PATCH V3 01/10] configure: Introduce --enable-xen-pci-passthrough Anthony PERARD
2011-10-28 15:07   ` Anthony PERARD
2011-10-28 15:07 ` [Qemu-devel] [PATCH V3 02/10] Introduce HostPCIDevice to access a pci device on the host Anthony PERARD
2011-10-28 15:07   ` Anthony PERARD
2011-11-04 17:49   ` [Qemu-devel] [Xen-devel] " Konrad Rzeszutek Wilk
2011-11-04 17:49     ` Konrad Rzeszutek Wilk
2011-11-07 15:09     ` [Qemu-devel] " Anthony PERARD
2011-11-07 15:09       ` Anthony PERARD
2011-10-28 15:07 ` [Qemu-devel] [PATCH V3 03/10] pci.c: Add pci_check_bar_overlap Anthony PERARD
2011-10-28 15:07   ` Anthony PERARD
2011-10-28 15:07 ` [Qemu-devel] [PATCH V3 04/10] pci_ids: Add INTEL_82599_VF id Anthony PERARD
2011-10-28 15:07   ` Anthony PERARD
2011-10-28 15:07 ` [Qemu-devel] [PATCH V3 05/10] pci_regs: Fix value of PCI_EXP_TYPE_RC_EC Anthony PERARD
2011-10-28 15:07   ` Anthony PERARD
2011-11-04  7:36   ` [Qemu-devel] " Isaku Yamahata
2011-11-04  7:36     ` Isaku Yamahata
2011-10-28 15:07 ` [Qemu-devel] [PATCH V3 06/10] pci_regs: Add PCI_EXP_TYPE_PCIE_BRIDGE Anthony PERARD
2011-10-28 15:07   ` Anthony PERARD
2011-10-28 15:07 ` [Qemu-devel] [PATCH V3 07/10] Introduce Xen PCI Passthrough, qdevice (1/3) Anthony PERARD
2011-10-28 15:07   ` Anthony PERARD
2011-11-08 12:56   ` [Qemu-devel] " Stefano Stabellini
2011-11-08 12:56     ` Stefano Stabellini
2011-11-09 17:03     ` [Qemu-devel] " Anthony PERARD
2011-11-09 17:03       ` Anthony PERARD
2011-11-10 21:28   ` [Qemu-devel] [Xen-devel] " Konrad Rzeszutek Wilk
2011-11-10 21:28     ` Konrad Rzeszutek Wilk
2011-11-11 16:27     ` [Qemu-devel] [Xen-devel] " Anthony PERARD
2011-11-11 16:27       ` Anthony PERARD
2011-11-11 18:05       ` [Qemu-devel] " Konrad Rzeszutek Wilk
2011-11-11 18:05         ` Konrad Rzeszutek Wilk
2011-11-14 11:09         ` [Qemu-devel] " Stefano Stabellini
2011-11-14 11:09           ` Stefano Stabellini
2011-11-14 18:11           ` [Qemu-devel] " Konrad Rzeszutek Wilk
2011-11-14 18:11             ` Konrad Rzeszutek Wilk
2011-10-28 15:07 ` [Qemu-devel] [PATCH V3 08/10] Introduce Xen PCI Passthrough, PCI config space helpers (2/3) Anthony PERARD
2011-10-28 15:07   ` Anthony PERARD
2011-11-08 12:57   ` [Qemu-devel] " Stefano Stabellini
2011-11-08 12:57     ` Stefano Stabellini
2011-11-09 17:05     ` [Qemu-devel] " Anthony PERARD
2011-11-09 17:05       ` Anthony PERARD
2011-11-10 21:53   ` [Qemu-devel] [Xen-devel] " Konrad Rzeszutek Wilk
2011-11-10 21:53     ` Konrad Rzeszutek Wilk
2011-11-11 17:40     ` [Qemu-devel] [Xen-devel] " Anthony PERARD
2011-11-11 17:40       ` Anthony PERARD
2011-11-11 18:11       ` [Qemu-devel] " Konrad Rzeszutek Wilk
2011-11-11 18:11         ` Konrad Rzeszutek Wilk
2011-11-11 20:37       ` [Qemu-devel] " Ian Campbell
2011-11-11 20:37         ` Ian Campbell
2011-10-28 15:07 ` [Qemu-devel] [PATCH V3 09/10] Introduce apic-msidef.h Anthony PERARD
2011-10-28 15:07   ` Anthony PERARD
2011-11-08 12:57   ` [Qemu-devel] " Stefano Stabellini
2011-11-08 12:57     ` Stefano Stabellini
2011-10-28 15:07 ` [Qemu-devel] [PATCH V3 10/10] Introduce Xen PCI Passthrough, MSI (3/3) Anthony PERARD
2011-10-28 15:07   ` Anthony PERARD
2011-11-10 22:10   ` [Qemu-devel] [Xen-devel] " Konrad Rzeszutek Wilk
2011-11-10 22:10     ` Konrad Rzeszutek Wilk
2011-11-11 19:18     ` [Qemu-devel] " Anthony PERARD
2011-11-11 19:18       ` Anthony PERARD

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.