All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v11 00/12] vpci: PCI config space emulation
@ 2018-03-20 15:15 Roger Pau Monne
  2018-03-20 15:15 ` [PATCH v11 01/12] vpci: introduce basic handlers to trap accesses to the PCI config space Roger Pau Monne
                   ` (11 more replies)
  0 siblings, 12 replies; 23+ messages in thread
From: Roger Pau Monne @ 2018-03-20 15:15 UTC (permalink / raw)
  To: xen-devel; +Cc: Boris Ostrovsky, Roger Pau Monne

Hello,

The following series contain an implementation of handlers for the PCI
configuration space inside of Xen. This allows Xen to detect accesses
to the PCI configuration space and react accordingly.

Why is this needed? IMHO, there are two main points of doing all this
emulation inside of Xen, the first one is to prevent adding a bunch of
duplicated Xen PV specific code to each OS we want to support in PVH
mode. This just promotes Xen code duplication amongst OSes, which
leads to a higher maintainership burden.

The second reason would be that this code (or it's functionality to be
more precise) already exists in QEMU (and pciback to a degree), and
it's code that we already support and maintain. By moving it into the
hypervisor itself every guest type can make use of it, and should be
shared between them all. I know that the code in this series is not
yet suitable for DomU HVM guests in it's current state, but it should
be in due time.

As usual, each patch contains a changeset summary between versions,
I'm not going to copy the list of changes here.

The branch containing the patches can be found at:

git://xenbits.xen.org/people/royger/xen.git vpci_v11

Note that this is only safe to use for the hardware domain (that's
trusted), any non-trusted domain will need a lot more handlers before it
can freely access the PCI configuration space.

Roger Pau Monne (12):
  vpci: introduce basic handlers to trap accesses to the PCI config
    space
  x86/mmcfg: add handlers for the PVH Dom0 MMCFG areas
  x86/physdev: enable PHYSDEVOP_pci_mmcfg_reserved for PVH Dom0
  pci: split code to size BARs from pci_add_device
  pci: add support to size ROM BARs to pci_size_mem_bar
  xen: introduce rangeset_consume_ranges
  vpci: add header handlers
  x86/pt: mask MSI vectors on unbind
  vpci/msi: add MSI handlers
  vpci: add a priority parameter to the vPCI register initializer
  vpci/msix: add MSI-X handlers
  vpci: do not expose unneeded functions to the user-space test harness

 .gitignore                        |   3 +
 tools/libxl/libxl_x86.c           |   2 +-
 tools/tests/Makefile              |   1 +
 tools/tests/vpci/Makefile         |  33 +++
 tools/tests/vpci/emul.h           | 134 +++++++++
 tools/tests/vpci/main.c           | 309 +++++++++++++++++++++
 xen/arch/arm/xen.lds.S            |  14 +
 xen/arch/x86/Kconfig              |   1 +
 xen/arch/x86/domain.c             |   6 +-
 xen/arch/x86/hvm/dom0_build.c     |  23 +-
 xen/arch/x86/hvm/hvm.c            |   7 +
 xen/arch/x86/hvm/hypercall.c      |   5 +
 xen/arch/x86/hvm/io.c             | 293 ++++++++++++++++++++
 xen/arch/x86/hvm/ioreq.c          |   4 +
 xen/arch/x86/hvm/vmsi.c           | 246 +++++++++++++++++
 xen/arch/x86/msi.c                |   3 +
 xen/arch/x86/physdev.c            |  11 +
 xen/arch/x86/setup.c              |   2 +-
 xen/arch/x86/x86_64/mmconfig.h    |   4 -
 xen/arch/x86/xen.lds.S            |  14 +
 xen/common/rangeset.c             |  28 ++
 xen/drivers/Kconfig               |   3 +
 xen/drivers/Makefile              |   1 +
 xen/drivers/passthrough/io.c      |  15 +
 xen/drivers/passthrough/pci.c     | 107 ++++---
 xen/drivers/vpci/Makefile         |   1 +
 xen/drivers/vpci/header.c         | 567 ++++++++++++++++++++++++++++++++++++++
 xen/drivers/vpci/msi.c            | 349 +++++++++++++++++++++++
 xen/drivers/vpci/msix.c           | 458 ++++++++++++++++++++++++++++++
 xen/drivers/vpci/vpci.c           | 482 ++++++++++++++++++++++++++++++++
 xen/include/asm-x86/domain.h      |   1 +
 xen/include/asm-x86/hvm/domain.h  |   7 +
 xen/include/asm-x86/hvm/io.h      |  20 ++
 xen/include/asm-x86/msi.h         |   3 +
 xen/include/asm-x86/pci.h         |   6 +
 xen/include/public/arch-x86/xen.h |   5 +-
 xen/include/xen/irq.h             |   1 +
 xen/include/xen/pci.h             |   9 +
 xen/include/xen/pci_regs.h        |   8 +
 xen/include/xen/rangeset.h        |  10 +
 xen/include/xen/sched.h           |   4 +
 xen/include/xen/vpci.h            | 225 +++++++++++++++
 42 files changed, 3379 insertions(+), 46 deletions(-)
 create mode 100644 tools/tests/vpci/Makefile
 create mode 100644 tools/tests/vpci/emul.h
 create mode 100644 tools/tests/vpci/main.c
 create mode 100644 xen/drivers/vpci/Makefile
 create mode 100644 xen/drivers/vpci/header.c
 create mode 100644 xen/drivers/vpci/msi.c
 create mode 100644 xen/drivers/vpci/msix.c
 create mode 100644 xen/drivers/vpci/vpci.c
 create mode 100644 xen/include/xen/vpci.h

-- 
2.16.2


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v11 01/12] vpci: introduce basic handlers to trap accesses to the PCI config space
  2018-03-20 15:15 [PATCH v11 00/12] vpci: PCI config space emulation Roger Pau Monne
@ 2018-03-20 15:15 ` Roger Pau Monne
  2018-03-21  4:39   ` Julien Grall
  2018-03-20 15:15 ` [PATCH v11 02/12] x86/mmcfg: add handlers for the PVH Dom0 MMCFG areas Roger Pau Monne
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 23+ messages in thread
From: Roger Pau Monne @ 2018-03-20 15:15 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, Julien Grall, Paul Durrant, Jan Beulich,
	Boris Ostrovsky, Roger Pau Monne

This functionality is going to reside in vpci.c (and the corresponding
vpci.h header), and should be arch-agnostic. The handlers introduced
in this patch setup the basic functionality required in order to trap
accesses to the PCI config space, and allow decoding the address and
finding the corresponding handler that should handle the access
(although no handlers are implemented).

Note that the traps to the PCI IO ports registers (0xcf8/0xcfc) are
setup inside of a x86 HVM file, since that's not shared with other
arches.

A new XEN_X86_EMU_VPCI x86 domain flag is added in order to signal Xen
whether a domain should use the newly introduced vPCI handlers, this
is only enabled for PVH Dom0 at the moment.

A very simple user-space test is also provided, so that the basic
functionality of the vPCI traps can be asserted. This has been proven
quite helpful during development, since the logic to handle partial
accesses or accesses that expand across multiple registers is not
trivial.

The handlers for the registers are added to a linked list that's keep
sorted at all times. Both the read and write handlers support accesses
that expand across multiple emulated registers and contain gaps not
emulated.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
[IO parts]
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
---
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Tim Deegan <tim@xen.org>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Julien Grall <julien.grall@arm.com>
Cc: Paul Durrant <paul.durrant@citrix.com>
---
Changes since v9:
 - Remove vpci/Kconfig and use drivers/Kconfig instead.
 - Remove depends on HAS_PCI.

Changes since v8:
 - Introduce HAS_VPCI Kconfig option.
 - Drop Jan and Wei's RB (keep Paul's since the HAS_VPCI addition
   doesn't change IO code).
 - Rebase on top of XSA-256.

Changes since v7:
 - Constify d in vpci_portio_read.
 - ASSERT the correctness of the address in the read/write handlers.
 - Add newlines between non-fallthrough case statements.

Changes since v6:
 - Align the vpci handlers in the linker script.
 - Switch add/remove register functions to take a vpci parameter
   instead of a pci_dev.
 - Expand comment of merge_result.
 - Return X86EMUL_UNHANDLEABLE if accessing cfc and cf8 is disabled.

Changes since v5:
 - Use a spinlock per pci device.
 - Use the recently introduced pci_sbdf_t type.
 - Fix test harness to use the right handler type and the newly
   introduced lock.
 - Move the position of the vpci sections in the linker scripts.
 - Constify domain and pci_dev in vpci_{read/write}.
 - Fix typos in comments.
 - Use _XEN_VPCI_H_ as header guard.

Changes since v4:
* User-space test harness:
 - Do not redirect the output of the test.
 - Add main.c and emul.h as dependencies of the Makefile target.
 - Use the same rule to modify the vpci and list headers.
 - Remove underscores from local macro variables.
 - Add _check suffix to the test harness multiread function.
 - Change the value written by every different size in the multiwrite
   test.
 - Use { } to initialize the r16 and r20 arrays (instead of { 0 }).
 - Perform some of the read checks with the local variable directly.
 - Expand some comments.
 - Implement a dummy rwlock.
* Hypervisor code:
 - Guard the linker script changes with CONFIG_HAS_PCI.
 - Rename vpci_access_check to vpci_access_allowed and make it return
   bool.
 - Make hvm_pci_decode_addr return the register as return value.
 - Use ~3 instead of 0xfffc to remove the register offset when
   checking accesses to IO ports.
 - s/head/prev in vpci_add_register.
 - Add parentheses around & in vpci_add_register.
 - Fix register removal.
 - Change the BUGs in vpci_{read/write}_hw helpers to
   ASSERT_UNREACHABLE.
 - Make merge_result static and change the computation of the mask to
   avoid using a uint64_t.
 - Modify vpci_read to only read from hardware the not-emulated gaps.
 - Remove the vpci_val union and use a uint32_t instead.
 - Change handler read type to return a uint32_t instead of modifying
   a variable passed by reference.
 - Constify the data opaque parameter of read handlers.
 - Change the size parameter of the vpci_{read/write} functions to
   unsigned int.
 - Place the array of initialization handlers in init.rodata or
   .rodata depending on whether late-hwdom is enabled.
 - Remove the pci_devs lock, assume the Dom0 is well behaved and won't
   remove the device while trying to access it.
 - Change the recursive spinlock into a rw lock for performance
   reasons.

Changes since v3:
* User-space test harness:
 - Fix spaces in container_of macro.
 - Implement a dummy locking functions.
 - Remove 'current' macro make current a pointer to the statically
   allocated vpcu.
 - Remove unneeded parentheses in the pci_conf_readX macros.
 - Fix the name of the write test macro.
 - Remove the dummy EXPORT_SYMBOL macro (this was needed by the RB
   code only).
 - Import the max macro.
 - Test all possible read/write size combinations with all possible
   emulated register sizes.
 - Introduce a test for register removal.
* Hypervisor code:
 - Use a sorted list in order to store the config space handlers.
 - Remove some unneeded 'else' branches.
 - Make the IO port handlers always return X86EMUL_OKAY, and set the
   data to all 1's in case of read failure (write are simply ignored).
 - In hvm_select_ioreq_server reuse local variables when calling
   XEN_DMOP_PCI_SBDF.
 - Store the pointers to the initialization functions in the .rodata
   section.
 - Do not ignore the return value of xen_vpci_add_handlers in
   setup_one_hwdom_device.
 - Remove the vpci_init macro.
 - Do not hide the pointers inside of the vpci_{read/write}_t
   typedefs.
 - Rename priv_data to private in vpci_register.
 - Simplify checking for register overlap in vpci_register_cmp.
 - Check that the offset and the length match before removing a
   register in xen_vpci_remove_register.
 - Make vpci_read_hw return a value rather than storing it in a
   pointer passed by parameter.
 - Handler dispatcher functions vpci_{read/write} no longer return an
   error code, errors on reads/writes should be treated like hardware
   (writes ignored, reads return all 1's or garbage).
 - Make sure pcidevs is locked before calling pci_get_pdev_by_domain.
 - Use a recursive spinlock for the vpci lock, so that spin_is_locked
   checks that the current CPU is holding the lock.
 - Make the code less error-chatty by removing some of the printk's.
 - Pass the slot and the function as separate parameters to the
   handler dispatchers (instead of passing devfn).
 - Allow handlers to be registered with either a read or write
   function only, the missing handler will be replaced by a dummy
   handler (writes ignored, reads return 1's).
 - Introduce PCI_CFG_SPACE_* defines from Linux.
 - Simplify the handler dispatchers by removing the recursion, now the
   dispatchers iterate over the list of sorted handlers and call them
   in order.
 - Remove the GENMASK_BYTES, SHIFT_RIGHT_BYTES and ADD_RESULT macros,
   and instead provide a merge_result function in order to merge a
   register output into a partial result.
 - Rename the fields of the vpci_val union to u8/u16/u32.
 - Remove the return values from the read/write handlers, errors
   should be handled internally and signaled as would be done on
   native hardware.
 - Remove the usage of the GENMASK macro.

Changes since v2:
 - Generalize the PCI address decoding and use it for IOREQ code also.

Changes since v1:
 - Allow access to cross a word-boundary.
 - Add locking.
 - Add cleanup to xen_vpci_add_handlers in case of failure.
---
 .gitignore                        |   3 +
 tools/libxl/libxl_x86.c           |   2 +-
 tools/tests/Makefile              |   1 +
 tools/tests/vpci/Makefile         |  37 +++
 tools/tests/vpci/emul.h           | 133 +++++++++++
 tools/tests/vpci/main.c           | 309 +++++++++++++++++++++++++
 xen/arch/arm/xen.lds.S            |  14 ++
 xen/arch/x86/Kconfig              |   1 +
 xen/arch/x86/domain.c             |   6 +-
 xen/arch/x86/hvm/hvm.c            |   2 +
 xen/arch/x86/hvm/io.c             | 105 +++++++++
 xen/arch/x86/setup.c              |   2 +-
 xen/arch/x86/xen.lds.S            |  14 ++
 xen/drivers/Kconfig               |   3 +
 xen/drivers/Makefile              |   1 +
 xen/drivers/passthrough/pci.c     |  10 +-
 xen/drivers/vpci/Makefile         |   1 +
 xen/drivers/vpci/vpci.c           | 459 ++++++++++++++++++++++++++++++++++++++
 xen/include/asm-x86/domain.h      |   1 +
 xen/include/asm-x86/hvm/io.h      |   3 +
 xen/include/public/arch-x86/xen.h |   5 +-
 xen/include/xen/pci.h             |   3 +
 xen/include/xen/pci_regs.h        |   8 +
 xen/include/xen/vpci.h            |  53 +++++
 24 files changed, 1169 insertions(+), 7 deletions(-)
 create mode 100644 tools/tests/vpci/Makefile
 create mode 100644 tools/tests/vpci/emul.h
 create mode 100644 tools/tests/vpci/main.c
 create mode 100644 xen/drivers/vpci/Makefile
 create mode 100644 xen/drivers/vpci/vpci.c
 create mode 100644 xen/include/xen/vpci.h

diff --git a/.gitignore b/.gitignore
index 7820abb756..cd57530cba 100644
--- a/.gitignore
+++ b/.gitignore
@@ -254,6 +254,9 @@ tools/tests/regression/build/*
 tools/tests/regression/downloads/*
 tools/tests/mem-sharing/memshrtool
 tools/tests/mce-test/tools/xen-mceinj
+tools/tests/vpci/list.h
+tools/tests/vpci/vpci.[hc]
+tools/tests/vpci/test_vpci
 tools/xcutils/lsevtchn
 tools/xcutils/readnotes
 tools/xenbackendd/_paths.h
diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c
index 4ea1249925..1e9f98961b 100644
--- a/tools/libxl/libxl_x86.c
+++ b/tools/libxl/libxl_x86.c
@@ -9,7 +9,7 @@ int libxl__arch_domain_prepare_config(libxl__gc *gc,
 {
     switch(d_config->c_info.type) {
     case LIBXL_DOMAIN_TYPE_HVM:
-        xc_config->emulation_flags = XEN_X86_EMU_ALL;
+        xc_config->emulation_flags = (XEN_X86_EMU_ALL & ~XEN_X86_EMU_VPCI);
         break;
     case LIBXL_DOMAIN_TYPE_PVH:
         xc_config->emulation_flags = XEN_X86_EMU_LAPIC;
diff --git a/tools/tests/Makefile b/tools/tests/Makefile
index 7162945121..f6942a93fb 100644
--- a/tools/tests/Makefile
+++ b/tools/tests/Makefile
@@ -13,6 +13,7 @@ endif
 SUBDIRS-$(CONFIG_X86) += x86_emulator
 SUBDIRS-y += xen-access
 SUBDIRS-y += xenstore
+SUBDIRS-$(CONFIG_HAS_PCI) += vpci
 
 .PHONY: all clean install distclean uninstall
 all clean distclean: %: subdirs-%
diff --git a/tools/tests/vpci/Makefile b/tools/tests/vpci/Makefile
new file mode 100644
index 0000000000..e45fcb5cd9
--- /dev/null
+++ b/tools/tests/vpci/Makefile
@@ -0,0 +1,37 @@
+XEN_ROOT=$(CURDIR)/../../..
+include $(XEN_ROOT)/tools/Rules.mk
+
+TARGET := test_vpci
+
+.PHONY: all
+all: $(TARGET)
+
+.PHONY: run
+run: $(TARGET)
+	./$(TARGET)
+
+$(TARGET): vpci.c vpci.h list.h main.c emul.h
+	$(HOSTCC) -g -o $@ vpci.c main.c
+
+.PHONY: clean
+clean:
+	rm -rf $(TARGET) *.o *~ vpci.h vpci.c list.h
+
+.PHONY: distclean
+distclean: clean
+
+.PHONY: install
+install:
+
+vpci.c: $(XEN_ROOT)/xen/drivers/vpci/vpci.c
+	# Trick the compiler so it doesn't complain about missing symbols
+	sed -e '/#include/d' \
+	    -e '1s;^;#include "emul.h"\
+	             vpci_register_init_t *const __start_vpci_array[1]\;\
+	             vpci_register_init_t *const __end_vpci_array[1]\;\
+	             ;' <$< >$@
+
+list.h: $(XEN_ROOT)/xen/include/xen/list.h
+vpci.h: $(XEN_ROOT)/xen/include/xen/vpci.h
+list.h vpci.h:
+	sed -e '/#include/d' <$< >$@
diff --git a/tools/tests/vpci/emul.h b/tools/tests/vpci/emul.h
new file mode 100644
index 0000000000..fd0317995a
--- /dev/null
+++ b/tools/tests/vpci/emul.h
@@ -0,0 +1,133 @@
+/*
+ * Unit tests for the generic vPCI handler code.
+ *
+ * Copyright (C) 2017 Citrix Systems R&D
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms and conditions of the GNU General Public
+ * License, version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef _TEST_VPCI_
+#define _TEST_VPCI_
+
+#include <assert.h>
+#include <errno.h>
+#include <stdbool.h>
+#include <stddef.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+#define container_of(ptr, type, member) ({                      \
+        typeof(((type *)0)->member) *mptr = (ptr);              \
+                                                                \
+        (type *)((char *)mptr - offsetof(type, member));        \
+})
+
+#define smp_wmb()
+#define prefetch(x) __builtin_prefetch(x)
+#define ASSERT(x) assert(x)
+#define __must_check __attribute__((__warn_unused_result__))
+
+#include "list.h"
+
+struct domain {
+};
+
+struct pci_dev {
+    struct vpci *vpci;
+};
+
+struct vcpu
+{
+    const struct domain *domain;
+};
+
+extern const struct vcpu *current;
+extern const struct pci_dev test_pdev;
+
+typedef bool spinlock_t;
+#define spin_lock_init(l) (*(l) = false)
+#define spin_lock(l) (*(l) = true)
+#define spin_unlock(l) (*(l) = false)
+
+typedef union {
+    uint32_t sbdf;
+    struct {
+        union {
+            uint16_t bdf;
+            struct {
+                union {
+                    struct {
+                        uint8_t func : 3,
+                                dev  : 5;
+                    };
+                    uint8_t     extfunc;
+                };
+                uint8_t         bus;
+            };
+        };
+        uint16_t                seg;
+    };
+} pci_sbdf_t;
+
+#include "vpci.h"
+
+#define __hwdom_init
+
+#define has_vpci(d) true
+
+#define xzalloc(type) ((type *)calloc(1, sizeof(type)))
+#define xmalloc(type) ((type *)malloc(sizeof(type)))
+#define xfree(p) free(p)
+
+#define pci_get_pdev_by_domain(...) &test_pdev
+
+/* Dummy native helpers. Writes are ignored, reads return 1's. */
+#define pci_conf_read8(...)     0xff
+#define pci_conf_read16(...)    0xffff
+#define pci_conf_read32(...)    0xffffffff
+#define pci_conf_write8(...)
+#define pci_conf_write16(...)
+#define pci_conf_write32(...)
+
+#define PCI_CFG_SPACE_EXP_SIZE 4096
+
+#define BUG() assert(0)
+#define ASSERT_UNREACHABLE() assert(0)
+
+#define min(x, y) ({                    \
+        const typeof(x) tx = (x);       \
+        const typeof(y) ty = (y);       \
+                                        \
+        (void) (&tx == &ty);            \
+        tx < ty ? tx : ty;              \
+})
+
+#define max(x, y) ({                    \
+        const typeof(x) tx = (x);       \
+        const typeof(y) ty = (y);       \
+                                        \
+        (void) (&tx == &ty);            \
+        tx > ty ? tx : ty;              \
+})
+
+#endif
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tools/tests/vpci/main.c b/tools/tests/vpci/main.c
new file mode 100644
index 0000000000..b9a0a6006b
--- /dev/null
+++ b/tools/tests/vpci/main.c
@@ -0,0 +1,309 @@
+/*
+ * Unit tests for the generic vPCI handler code.
+ *
+ * Copyright (C) 2017 Citrix Systems R&D
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms and conditions of the GNU General Public
+ * License, version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "emul.h"
+
+/* Single vcpu (current), and single domain with a single PCI device. */
+static struct vpci vpci;
+
+const static struct domain d;
+
+const struct pci_dev test_pdev = {
+    .vpci = &vpci,
+};
+
+const static struct vcpu v = {
+    .domain = &d
+};
+
+const struct vcpu *current = &v;
+
+/* Dummy hooks, write stores data, read fetches it. */
+static uint32_t vpci_read8(const struct pci_dev *pdev, unsigned int reg,
+                           void *data)
+{
+    return *(uint8_t *)data;
+}
+
+static void vpci_write8(const struct pci_dev *pdev, unsigned int reg,
+                        uint32_t val, void *data)
+{
+    *(uint8_t *)data = val;
+}
+
+static uint32_t vpci_read16(const struct pci_dev *pdev, unsigned int reg,
+                            void *data)
+{
+    return *(uint16_t *)data;
+}
+
+static void vpci_write16(const struct pci_dev *pdev, unsigned int reg,
+                         uint32_t val, void *data)
+{
+    *(uint16_t *)data = val;
+}
+
+static uint32_t vpci_read32(const struct pci_dev *pdev, unsigned int reg,
+                            void *data)
+{
+    return *(uint32_t *)data;
+}
+
+static void vpci_write32(const struct pci_dev *pdev, unsigned int reg,
+                         uint32_t val, void *data)
+{
+    *(uint32_t *)data = val;
+}
+
+#define VPCI_READ(reg, size, data) ({                           \
+    data = vpci_read((pci_sbdf_t){ .sbdf = 0 }, reg, size);     \
+})
+
+#define VPCI_READ_CHECK(reg, size, expected) ({                 \
+    uint32_t rd;                                                \
+                                                                \
+    VPCI_READ(reg, size, rd);                                   \
+    assert(rd == (expected));                                   \
+})
+
+#define VPCI_WRITE(reg, size, data) ({                          \
+    vpci_write((pci_sbdf_t){ .sbdf = 0 }, reg, size, data);     \
+})
+
+#define VPCI_WRITE_CHECK(reg, size, data) ({                    \
+    VPCI_WRITE(reg, size, data);                                \
+    VPCI_READ_CHECK(reg, size, data);                           \
+})
+
+#define VPCI_ADD_REG(fread, fwrite, off, size, store)                       \
+    assert(!vpci_add_register(test_pdev.vpci, fread, fwrite, off, size,     \
+                              &store))
+
+#define VPCI_ADD_INVALID_REG(fread, fwrite, off, size)                      \
+    assert(vpci_add_register(test_pdev.vpci, fread, fwrite, off, size, NULL))
+
+#define VPCI_REMOVE_REG(off, size)                                          \
+    assert(!vpci_remove_register(test_pdev.vpci, off, size))
+
+#define VPCI_REMOVE_INVALID_REG(off, size)                                  \
+    assert(vpci_remove_register(test_pdev.vpci, off, size))
+
+/* Read a 32b register using all possible sizes. */
+void multiread4_check(unsigned int reg, uint32_t val)
+{
+    unsigned int i;
+
+    /* Read using bytes. */
+    for ( i = 0; i < 4; i++ )
+        VPCI_READ_CHECK(reg + i, 1, (val >> (i * 8)) & UINT8_MAX);
+
+    /* Read using 2bytes. */
+    for ( i = 0; i < 2; i++ )
+        VPCI_READ_CHECK(reg + i * 2, 2, (val >> (i * 2 * 8)) & UINT16_MAX);
+
+    VPCI_READ_CHECK(reg, 4, val);
+}
+
+void multiwrite4_check(unsigned int reg)
+{
+    unsigned int i;
+    uint32_t val = 0xa2f51732;
+
+    /* Write using bytes. */
+    for ( i = 0; i < 4; i++ )
+        VPCI_WRITE_CHECK(reg + i, 1, (val >> (i * 8)) & UINT8_MAX);
+    multiread4_check(reg, val);
+
+    /* Change the value each time to be sure writes work fine. */
+    val = 0x2b836fda;
+    /* Write using 2bytes. */
+    for ( i = 0; i < 2; i++ )
+        VPCI_WRITE_CHECK(reg + i * 2, 2, (val >> (i * 2 * 8)) & UINT16_MAX);
+    multiread4_check(reg, val);
+
+    val = 0xc4693beb;
+    VPCI_WRITE_CHECK(reg, 4, val);
+    multiread4_check(reg, val);
+}
+
+int
+main(int argc, char **argv)
+{
+    /* Index storage by offset. */
+    uint32_t r0 = 0xdeadbeef;
+    uint8_t r5 = 0xef;
+    uint8_t r6 = 0xbe;
+    uint8_t r7 = 0xef;
+    uint16_t r12 = 0x8696;
+    uint8_t r16[4] = { };
+    uint16_t r20[2] = { };
+    uint32_t r24 = 0;
+    uint8_t r28, r30;
+    unsigned int i;
+    int rc;
+
+    INIT_LIST_HEAD(&vpci.handlers);
+    spin_lock_init(&vpci.lock);
+
+    VPCI_ADD_REG(vpci_read32, vpci_write32, 0, 4, r0);
+    VPCI_READ_CHECK(0, 4, r0);
+    VPCI_WRITE_CHECK(0, 4, 0xbcbcbcbc);
+
+    VPCI_ADD_REG(vpci_read8, vpci_write8, 5, 1, r5);
+    VPCI_READ_CHECK(5, 1, r5);
+    VPCI_WRITE_CHECK(5, 1, 0xba);
+
+    VPCI_ADD_REG(vpci_read8, vpci_write8, 6, 1, r6);
+    VPCI_READ_CHECK(6, 1, r6);
+    VPCI_WRITE_CHECK(6, 1, 0xba);
+
+    VPCI_ADD_REG(vpci_read8, vpci_write8, 7, 1, r7);
+    VPCI_READ_CHECK(7, 1, r7);
+    VPCI_WRITE_CHECK(7, 1, 0xbd);
+
+    VPCI_ADD_REG(vpci_read16, vpci_write16, 12, 2, r12);
+    VPCI_READ_CHECK(12, 2, r12);
+    VPCI_READ_CHECK(12, 4, 0xffff8696);
+
+    /*
+     * At this point we have the following layout:
+     *
+     * Note that this refers to the position of the variables,
+     * but the value has already changed from the one given at
+     * initialization time because write tests have been performed.
+     *
+     * 32    24    16     8     0
+     *  +-----+-----+-----+-----+
+     *  |          r0           | 0
+     *  +-----+-----+-----+-----+
+     *  | r7  |  r6 |  r5 |/////| 32
+     *  +-----+-----+-----+-----|
+     *  |///////////////////////| 64
+     *  +-----------+-----------+
+     *  |///////////|    r12    | 96
+     *  +-----------+-----------+
+     *             ...
+     *  / = unhandled.
+     */
+
+    /* Try to add an overlapping register handler. */
+    VPCI_ADD_INVALID_REG(vpci_read32, vpci_write32, 4, 4);
+
+    /* Try to add a non-aligned register. */
+    VPCI_ADD_INVALID_REG(vpci_read16, vpci_write16, 15, 2);
+
+    /* Try to add a register with wrong size. */
+    VPCI_ADD_INVALID_REG(vpci_read16, vpci_write16, 8, 3);
+
+    /* Try to add a register with missing handlers. */
+    VPCI_ADD_INVALID_REG(NULL, NULL, 8, 2);
+
+    /* Read/write of unset register. */
+    VPCI_READ_CHECK(8, 4, 0xffffffff);
+    VPCI_READ_CHECK(8, 2, 0xffff);
+    VPCI_READ_CHECK(8, 1, 0xff);
+    VPCI_WRITE(10, 2, 0xbeef);
+    VPCI_READ_CHECK(10, 2, 0xffff);
+
+    /* Read of multiple registers */
+    VPCI_WRITE_CHECK(7, 1, 0xbd);
+    VPCI_READ_CHECK(4, 4, 0xbdbabaff);
+
+    /* Partial read of a register. */
+    VPCI_WRITE_CHECK(0, 4, 0x1a1b1c1d);
+    VPCI_READ_CHECK(2, 1, 0x1b);
+    VPCI_READ_CHECK(6, 2, 0xbdba);
+
+    /* Write of multiple registers. */
+    VPCI_WRITE_CHECK(4, 4, 0xaabbccff);
+
+    /* Partial write of a register. */
+    VPCI_WRITE_CHECK(2, 1, 0xfe);
+    VPCI_WRITE_CHECK(6, 2, 0xfebc);
+
+    /*
+     * Test all possible read/write size combinations.
+     *
+     * Place 4 1B registers at 128bits (16B), 2 2B registers at 160bits
+     * (20B) and finally 1 4B register at 192bits (24B).
+     *
+     * Then perform all possible write and read sizes on each of them.
+     *
+     *               ...
+     * 32     24     16      8      0
+     *  +------+------+------+------+
+     *  |r16[3]|r16[2]|r16[1]|r16[0]| 16
+     *  +------+------+------+------+
+     *  |    r20[1]   |    r20[0]   | 20
+     *  +-------------+-------------|
+     *  |            r24            | 24
+     *  +-------------+-------------+
+     *
+     */
+    VPCI_ADD_REG(vpci_read8, vpci_write8, 16, 1, r16[0]);
+    VPCI_ADD_REG(vpci_read8, vpci_write8, 17, 1, r16[1]);
+    VPCI_ADD_REG(vpci_read8, vpci_write8, 18, 1, r16[2]);
+    VPCI_ADD_REG(vpci_read8, vpci_write8, 19, 1, r16[3]);
+
+    VPCI_ADD_REG(vpci_read16, vpci_write16, 20, 2, r20[0]);
+    VPCI_ADD_REG(vpci_read16, vpci_write16, 22, 2, r20[1]);
+
+    VPCI_ADD_REG(vpci_read32, vpci_write32, 24, 4, r24);
+
+    /* Check the initial value is 0. */
+    multiread4_check(16, 0);
+    multiread4_check(20, 0);
+    multiread4_check(24, 0);
+
+    multiwrite4_check(16);
+    multiwrite4_check(20);
+    multiwrite4_check(24);
+
+    /*
+     * Check multiple non-consecutive gaps on the same read/write:
+     *
+     * 32     24     16      8      0
+     *  +------+------+------+------+
+     *  |//////|  r30 |//////|  r28 | 28
+     *  +------+------+------+------+
+     *
+     */
+    VPCI_ADD_REG(vpci_read8, vpci_write8, 28, 1, r28);
+    VPCI_ADD_REG(vpci_read8, vpci_write8, 30, 1, r30);
+    VPCI_WRITE_CHECK(28, 4, 0xffacffdc);
+
+    /* Finally try to remove a couple of registers. */
+    VPCI_REMOVE_REG(28, 1);
+    VPCI_REMOVE_REG(24, 4);
+    VPCI_REMOVE_REG(12, 2);
+
+    VPCI_REMOVE_INVALID_REG(20, 1);
+    VPCI_REMOVE_INVALID_REG(16, 2);
+    VPCI_REMOVE_INVALID_REG(30, 2);
+
+    return 0;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/arch/arm/xen.lds.S b/xen/arch/arm/xen.lds.S
index b0390180b4..49cae2af71 100644
--- a/xen/arch/arm/xen.lds.S
+++ b/xen/arch/arm/xen.lds.S
@@ -65,6 +65,13 @@ SECTIONS
        __param_start = .;
        *(.data.param)
        __param_end = .;
+
+#if defined(CONFIG_HAS_VPCI) && defined(CONFIG_LATE_HWDOM)
+       . = ALIGN(POINTER_ALIGN);
+       __start_vpci_array = .;
+       *(.data.vpci)
+       __end_vpci_array = .;
+#endif
   } :text
 
 #if defined(BUILD_ID)
@@ -171,6 +178,13 @@ SECTIONS
        *(.init_array)
        *(SORT(.init_array.*))
        __ctors_end = .;
+
+#if defined(CONFIG_HAS_VPCI) && !defined(CONFIG_LATE_HWDOM)
+       . = ALIGN(POINTER_ALIGN);
+       __start_vpci_array = .;
+       *(.data.vpci)
+       __end_vpci_array = .;
+#endif
   } :text
   __init_end_efi = .;
   . = ALIGN(STACK_SIZE);
diff --git a/xen/arch/x86/Kconfig b/xen/arch/x86/Kconfig
index f621e799ed..c405c4bf4f 100644
--- a/xen/arch/x86/Kconfig
+++ b/xen/arch/x86/Kconfig
@@ -23,6 +23,7 @@ config X86
 	select HAS_PCI
 	select HAS_PDX
 	select HAS_UBSAN
+	select HAS_VPCI
 	select NUMA
 
 config ARCH_DEFCONFIG
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 4cac8906ea..fbb320da9c 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -411,10 +411,12 @@ static bool emulation_flags_ok(const struct domain *d, uint32_t emflags)
     if ( is_hvm_domain(d) )
     {
         if ( is_hardware_domain(d) &&
-             emflags != (XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC) )
+             emflags != (XEN_X86_EMU_VPCI | XEN_X86_EMU_LAPIC |
+                         XEN_X86_EMU_IOAPIC) )
             return false;
         if ( !is_hardware_domain(d) &&
-             emflags != XEN_X86_EMU_ALL && emflags != XEN_X86_EMU_LAPIC )
+             emflags != (XEN_X86_EMU_ALL & ~XEN_X86_EMU_VPCI) &&
+             emflags != XEN_X86_EMU_LAPIC )
             return false;
     }
     else if ( emflags != 0 && emflags != XEN_X86_EMU_PIT )
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index b3a6e1f740..8b773fdcf9 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -36,6 +36,7 @@
 #include <xen/rangeset.h>
 #include <xen/monitor.h>
 #include <xen/warning.h>
+#include <xen/vpci.h>
 #include <asm/shadow.h>
 #include <asm/hap.h>
 #include <asm/current.h>
@@ -632,6 +633,7 @@ int hvm_domain_initialise(struct domain *d)
         d->arch.hvm_domain.io_bitmap = hvm_io_bitmap;
 
     register_g2m_portio_handler(d);
+    register_vpci_portio_handler(d);
 
     hvm_ioreq_init(d);
 
diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
index 77f4c2ad41..6914bd6834 100644
--- a/xen/arch/x86/hvm/io.c
+++ b/xen/arch/x86/hvm/io.c
@@ -25,6 +25,7 @@
 #include <xen/trace.h>
 #include <xen/event.h>
 #include <xen/hypercall.h>
+#include <xen/vpci.h>
 #include <asm/current.h>
 #include <asm/cpufeature.h>
 #include <asm/processor.h>
@@ -278,6 +279,110 @@ unsigned int hvm_pci_decode_addr(unsigned int cf8, unsigned int addr,
     return CF8_ADDR_LO(cf8) | (addr & 3);
 }
 
+/* Do some sanity checks. */
+static bool vpci_access_allowed(unsigned int reg, unsigned int len)
+{
+    /* Check access size. */
+    if ( len != 1 && len != 2 && len != 4 )
+        return false;
+
+    /* Check that access is size aligned. */
+    if ( (reg & (len - 1)) )
+        return false;
+
+    return true;
+}
+
+/* vPCI config space IO ports handlers (0xcf8/0xcfc). */
+static bool vpci_portio_accept(const struct hvm_io_handler *handler,
+                               const ioreq_t *p)
+{
+    return (p->addr == 0xcf8 && p->size == 4) || (p->addr & ~3) == 0xcfc;
+}
+
+static int vpci_portio_read(const struct hvm_io_handler *handler,
+                            uint64_t addr, uint32_t size, uint64_t *data)
+{
+    const struct domain *d = current->domain;
+    unsigned int reg;
+    pci_sbdf_t sbdf;
+    uint32_t cf8;
+
+    *data = ~(uint64_t)0;
+
+    if ( addr == 0xcf8 )
+    {
+        ASSERT(size == 4);
+        *data = d->arch.hvm_domain.pci_cf8;
+        return X86EMUL_OKAY;
+    }
+
+    ASSERT((addr & ~3) == 0xcfc);
+    cf8 = ACCESS_ONCE(d->arch.hvm_domain.pci_cf8);
+    if ( !CF8_ENABLED(cf8) )
+        return X86EMUL_UNHANDLEABLE;
+
+    reg = hvm_pci_decode_addr(cf8, addr, &sbdf);
+
+    if ( !vpci_access_allowed(reg, size) )
+        return X86EMUL_OKAY;
+
+    *data = vpci_read(sbdf, reg, size);
+
+    return X86EMUL_OKAY;
+}
+
+static int vpci_portio_write(const struct hvm_io_handler *handler,
+                             uint64_t addr, uint32_t size, uint64_t data)
+{
+    struct domain *d = current->domain;
+    unsigned int reg;
+    pci_sbdf_t sbdf;
+    uint32_t cf8;
+
+    if ( addr == 0xcf8 )
+    {
+        ASSERT(size == 4);
+        d->arch.hvm_domain.pci_cf8 = data;
+        return X86EMUL_OKAY;
+    }
+
+    ASSERT((addr & ~3) == 0xcfc);
+    cf8 = ACCESS_ONCE(d->arch.hvm_domain.pci_cf8);
+    if ( !CF8_ENABLED(cf8) )
+        return X86EMUL_UNHANDLEABLE;
+
+    reg = hvm_pci_decode_addr(cf8, addr, &sbdf);
+
+    if ( !vpci_access_allowed(reg, size) )
+        return X86EMUL_OKAY;
+
+    vpci_write(sbdf, reg, size, data);
+
+    return X86EMUL_OKAY;
+}
+
+static const struct hvm_io_ops vpci_portio_ops = {
+    .accept = vpci_portio_accept,
+    .read = vpci_portio_read,
+    .write = vpci_portio_write,
+};
+
+void register_vpci_portio_handler(struct domain *d)
+{
+    struct hvm_io_handler *handler;
+
+    if ( !has_vpci(d) )
+        return;
+
+    handler = hvm_next_io_handler(d);
+    if ( !handler )
+        return;
+
+    handler->type = IOREQ_TYPE_PIO;
+    handler->ops = &vpci_portio_ops;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 3f6ecf4c32..c0b97a748a 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1639,7 +1639,7 @@ void __init noreturn __start_xen(unsigned long mbi_p)
                             XEN_DOMCTL_CDF_hap : 0));
 
         dom0_cfg.config.emulation_flags |=
-            XEN_X86_EMU_LAPIC | XEN_X86_EMU_IOAPIC;
+            XEN_X86_EMU_LAPIC | XEN_X86_EMU_IOAPIC | XEN_X86_EMU_VPCI;
     }
 
     /* Create initial domain 0. */
diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S
index e9f2ecd9fb..7bd6fb51c3 100644
--- a/xen/arch/x86/xen.lds.S
+++ b/xen/arch/x86/xen.lds.S
@@ -135,6 +135,13 @@ SECTIONS
        __param_start = .;
        *(.data.param)
        __param_end = .;
+
+#if defined(CONFIG_HAS_VPCI) && defined(CONFIG_LATE_HWDOM)
+       . = ALIGN(POINTER_ALIGN);
+       __start_vpci_array = .;
+       *(.data.vpci)
+       __end_vpci_array = .;
+#endif
   } :text
 
 #if defined(CONFIG_PVH_GUEST) && !defined(EFI)
@@ -235,6 +242,13 @@ SECTIONS
        *(.init_array)
        *(SORT(.init_array.*))
        __ctors_end = .;
+
+#if defined(CONFIG_HAS_VPCI) && !defined(CONFIG_LATE_HWDOM)
+       . = ALIGN(POINTER_ALIGN);
+       __start_vpci_array = .;
+       *(.data.vpci)
+       __end_vpci_array = .;
+#endif
   } :text
 
   . = ALIGN(SECTION_ALIGN);
diff --git a/xen/drivers/Kconfig b/xen/drivers/Kconfig
index bc3a54f0ea..db94393f47 100644
--- a/xen/drivers/Kconfig
+++ b/xen/drivers/Kconfig
@@ -12,4 +12,7 @@ source "drivers/pci/Kconfig"
 
 source "drivers/video/Kconfig"
 
+config HAS_VPCI
+	bool
+
 endmenu
diff --git a/xen/drivers/Makefile b/xen/drivers/Makefile
index 19391802a8..30bab3cfdb 100644
--- a/xen/drivers/Makefile
+++ b/xen/drivers/Makefile
@@ -1,6 +1,7 @@
 subdir-y += char
 subdir-$(CONFIG_HAS_CPUFREQ) += cpufreq
 subdir-$(CONFIG_HAS_PCI) += pci
+subdir-$(CONFIG_HAS_VPCI) += vpci
 subdir-$(CONFIG_HAS_PASSTHROUGH) += passthrough
 subdir-$(CONFIG_ACPI) += acpi
 subdir-$(CONFIG_VIDEO) += video
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 2b976ade62..e65c7faa6f 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -31,6 +31,7 @@
 #include <xen/radix-tree.h>
 #include <xen/softirq.h>
 #include <xen/tasklet.h>
+#include <xen/vpci.h>
 #include <xsm/xsm.h>
 #include <asm/msi.h>
 #include "ats.h"
@@ -1050,10 +1051,10 @@ static void __hwdom_init setup_one_hwdom_device(const struct setup_hwdom *ctxt,
                                                 struct pci_dev *pdev)
 {
     u8 devfn = pdev->devfn;
+    int err;
 
     do {
-        int err = ctxt->handler(devfn, pdev);
-
+        err = ctxt->handler(devfn, pdev);
         if ( err )
         {
             printk(XENLOG_ERR "setup %04x:%02x:%02x.%u for d%d failed (%d)\n",
@@ -1065,6 +1066,11 @@ static void __hwdom_init setup_one_hwdom_device(const struct setup_hwdom *ctxt,
         devfn += pdev->phantom_stride;
     } while ( devfn != pdev->devfn &&
               PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
+
+    err = vpci_add_handlers(pdev);
+    if ( err )
+        printk(XENLOG_ERR "setup of vPCI for d%d failed: %d\n",
+               ctxt->d->domain_id, err);
 }
 
 static int __hwdom_init _setup_hwdom_pci_devices(struct pci_seg *pseg, void *arg)
diff --git a/xen/drivers/vpci/Makefile b/xen/drivers/vpci/Makefile
new file mode 100644
index 0000000000..840a906470
--- /dev/null
+++ b/xen/drivers/vpci/Makefile
@@ -0,0 +1 @@
+obj-y += vpci.o
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
new file mode 100644
index 0000000000..4740d02edf
--- /dev/null
+++ b/xen/drivers/vpci/vpci.c
@@ -0,0 +1,459 @@
+/*
+ * Generic functionality for handling accesses to the PCI configuration space
+ * from guests.
+ *
+ * Copyright (C) 2017 Citrix Systems R&D
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms and conditions of the GNU General Public
+ * License, version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <xen/sched.h>
+#include <xen/vpci.h>
+
+extern vpci_register_init_t *const __start_vpci_array[];
+extern vpci_register_init_t *const __end_vpci_array[];
+#define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
+
+/* Internal struct to store the emulated PCI registers. */
+struct vpci_register {
+    vpci_read_t *read;
+    vpci_write_t *write;
+    unsigned int size;
+    unsigned int offset;
+    void *private;
+    struct list_head node;
+};
+
+int __hwdom_init vpci_add_handlers(struct pci_dev *pdev)
+{
+    unsigned int i;
+    int rc = 0;
+
+    if ( !has_vpci(pdev->domain) )
+        return 0;
+
+    pdev->vpci = xzalloc(struct vpci);
+    if ( !pdev->vpci )
+        return -ENOMEM;
+
+    INIT_LIST_HEAD(&pdev->vpci->handlers);
+    spin_lock_init(&pdev->vpci->lock);
+
+    for ( i = 0; i < NUM_VPCI_INIT; i++ )
+    {
+        rc = __start_vpci_array[i](pdev);
+        if ( rc )
+            break;
+    }
+
+    if ( rc )
+    {
+        while ( !list_empty(&pdev->vpci->handlers) )
+        {
+            struct vpci_register *r = list_first_entry(&pdev->vpci->handlers,
+                                                       struct vpci_register,
+                                                       node);
+
+            list_del(&r->node);
+            xfree(r);
+        }
+        xfree(pdev->vpci);
+        pdev->vpci = NULL;
+    }
+
+    return rc;
+}
+
+static int vpci_register_cmp(const struct vpci_register *r1,
+                             const struct vpci_register *r2)
+{
+    /* Return 0 if registers overlap. */
+    if ( r1->offset < r2->offset + r2->size &&
+         r2->offset < r1->offset + r1->size )
+        return 0;
+    if ( r1->offset < r2->offset )
+        return -1;
+    if ( r1->offset > r2->offset )
+        return 1;
+
+    ASSERT_UNREACHABLE();
+    return 0;
+}
+
+/* Dummy hooks, writes are ignored, reads return 1's */
+static uint32_t vpci_ignored_read(const struct pci_dev *pdev, unsigned int reg,
+                                  void *data)
+{
+    return ~(uint32_t)0;
+}
+
+static void vpci_ignored_write(const struct pci_dev *pdev, unsigned int reg,
+                               uint32_t val, void *data)
+{
+}
+
+int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
+                      vpci_write_t *write_handler, unsigned int offset,
+                      unsigned int size, void *data)
+{
+    struct list_head *prev;
+    struct vpci_register *r;
+
+    /* Some sanity checks. */
+    if ( (size != 1 && size != 2 && size != 4) ||
+         offset >= PCI_CFG_SPACE_EXP_SIZE || (offset & (size - 1)) ||
+         (!read_handler && !write_handler) )
+        return -EINVAL;
+
+    r = xmalloc(struct vpci_register);
+    if ( !r )
+        return -ENOMEM;
+
+    r->read = read_handler ?: vpci_ignored_read;
+    r->write = write_handler ?: vpci_ignored_write;
+    r->size = size;
+    r->offset = offset;
+    r->private = data;
+
+    spin_lock(&vpci->lock);
+
+    /* The list of handlers must be kept sorted at all times. */
+    list_for_each ( prev, &vpci->handlers )
+    {
+        const struct vpci_register *this =
+            list_entry(prev, const struct vpci_register, node);
+        int cmp = vpci_register_cmp(r, this);
+
+        if ( cmp < 0 )
+            break;
+        if ( cmp == 0 )
+        {
+            spin_unlock(&vpci->lock);
+            xfree(r);
+            return -EEXIST;
+        }
+    }
+
+    list_add_tail(&r->node, prev);
+    spin_unlock(&vpci->lock);
+
+    return 0;
+}
+
+int vpci_remove_register(struct vpci *vpci, unsigned int offset,
+                         unsigned int size)
+{
+    const struct vpci_register r = { .offset = offset, .size = size };
+    struct vpci_register *rm;
+
+    spin_lock(&vpci->lock);
+    list_for_each_entry ( rm, &vpci->handlers, node )
+    {
+        int cmp = vpci_register_cmp(&r, rm);
+
+        /*
+         * NB: do not use a switch so that we can use break to
+         * get out of the list loop earlier if required.
+         */
+        if ( !cmp && rm->offset == offset && rm->size == size )
+        {
+            list_del(&rm->node);
+            spin_unlock(&vpci->lock);
+            xfree(rm);
+            return 0;
+        }
+        if ( cmp <= 0 )
+            break;
+    }
+    spin_unlock(&vpci->lock);
+
+    return -ENOENT;
+}
+
+/* Wrappers for performing reads/writes to the underlying hardware. */
+static uint32_t vpci_read_hw(pci_sbdf_t sbdf, unsigned int reg,
+                             unsigned int size)
+{
+    uint32_t data;
+
+    switch ( size )
+    {
+    case 4:
+        data = pci_conf_read32(sbdf.seg, sbdf.bus, sbdf.dev, sbdf.func, reg);
+        break;
+
+    case 3:
+        /*
+         * This is possible because a 4byte read can have 1byte trapped and
+         * the rest passed-through.
+         */
+        if ( reg & 1 )
+        {
+            data = pci_conf_read8(sbdf.seg, sbdf.bus, sbdf.dev, sbdf.func,
+                                  reg);
+            data |= pci_conf_read16(sbdf.seg, sbdf.bus, sbdf.dev, sbdf.func,
+                                    reg + 1) << 8;
+        }
+        else
+        {
+            data = pci_conf_read16(sbdf.seg, sbdf.bus, sbdf.dev, sbdf.func,
+                                   reg);
+            data |= pci_conf_read8(sbdf.seg, sbdf.bus, sbdf.dev, sbdf.func,
+                                   reg + 2) << 16;
+        }
+        break;
+
+    case 2:
+        data = pci_conf_read16(sbdf.seg, sbdf.bus, sbdf.dev, sbdf.func, reg);
+        break;
+
+    case 1:
+        data = pci_conf_read8(sbdf.seg, sbdf.bus, sbdf.dev, sbdf.func, reg);
+        break;
+
+    default:
+        ASSERT_UNREACHABLE();
+        data = ~(uint32_t)0;
+        break;
+    }
+
+    return data;
+}
+
+static void vpci_write_hw(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
+                          uint32_t data)
+{
+    switch ( size )
+    {
+    case 4:
+        pci_conf_write32(sbdf.seg, sbdf.bus, sbdf.dev, sbdf.func, reg, data);
+        break;
+
+    case 3:
+        /*
+         * This is possible because a 4byte write can have 1byte trapped and
+         * the rest passed-through.
+         */
+        if ( reg & 1 )
+        {
+            pci_conf_write8(sbdf.seg, sbdf.bus, sbdf.dev, sbdf.func, reg,
+                            data);
+            pci_conf_write16(sbdf.seg, sbdf.bus, sbdf.dev, sbdf.func, reg + 1,
+                             data >> 8);
+        }
+        else
+        {
+            pci_conf_write16(sbdf.seg, sbdf.bus, sbdf.dev, sbdf.func, reg,
+                             data);
+            pci_conf_write8(sbdf.seg, sbdf.bus, sbdf.dev, sbdf.func, reg + 2,
+                            data >> 16);
+        }
+        break;
+
+    case 2:
+        pci_conf_write16(sbdf.seg, sbdf.bus, sbdf.dev, sbdf.func, reg, data);
+        break;
+
+    case 1:
+        pci_conf_write8(sbdf.seg, sbdf.bus, sbdf.dev, sbdf.func, reg, data);
+        break;
+
+    default:
+        ASSERT_UNREACHABLE();
+        break;
+    }
+}
+
+/*
+ * Merge new data into a partial result.
+ *
+ * Copy the value found in 'new' from [0, size) left shifted by
+ * 'offset' into 'data'. Note that both 'size' and 'offset' are
+ * in byte units.
+ */
+static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
+                             unsigned int offset)
+{
+    uint32_t mask = 0xffffffff >> (32 - 8 * size);
+
+    return (data & ~(mask << (offset * 8))) | ((new & mask) << (offset * 8));
+}
+
+uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
+{
+    const struct domain *d = current->domain;
+    const struct pci_dev *pdev;
+    const struct vpci_register *r;
+    unsigned int data_offset = 0;
+    uint32_t data = ~(uint32_t)0;
+
+    /* Find the PCI dev matching the address. */
+    pdev = pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.extfunc);
+    if ( !pdev )
+        return vpci_read_hw(sbdf, reg, size);
+
+    spin_lock(&pdev->vpci->lock);
+
+    /* Read from the hardware or the emulated register handlers. */
+    list_for_each_entry ( r, &pdev->vpci->handlers, node )
+    {
+        const struct vpci_register emu = {
+            .offset = reg + data_offset,
+            .size = size - data_offset
+        };
+        int cmp = vpci_register_cmp(&emu, r);
+        uint32_t val;
+        unsigned int read_size;
+
+        if ( cmp < 0 )
+            break;
+        if ( cmp > 0 )
+            continue;
+
+        if ( emu.offset < r->offset )
+        {
+            /* Heading gap, read partial content from hardware. */
+            read_size = r->offset - emu.offset;
+            val = vpci_read_hw(sbdf, emu.offset, read_size);
+            data = merge_result(data, val, read_size, data_offset);
+            data_offset += read_size;
+        }
+
+        val = r->read(pdev, r->offset, r->private);
+
+        /* Check if the read is in the middle of a register. */
+        if ( r->offset < emu.offset )
+            val >>= (emu.offset - r->offset) * 8;
+
+        /* Find the intersection size between the two sets. */
+        read_size = min(emu.offset + emu.size, r->offset + r->size) -
+                    max(emu.offset, r->offset);
+        /* Merge the emulated data into the native read value. */
+        data = merge_result(data, val, read_size, data_offset);
+        data_offset += read_size;
+        if ( data_offset == size )
+            break;
+        ASSERT(data_offset < size);
+    }
+
+    if ( data_offset < size )
+    {
+        /* Tailing gap, read the remaining. */
+        uint32_t tmp_data = vpci_read_hw(sbdf, reg + data_offset,
+                                         size - data_offset);
+
+        data = merge_result(data, tmp_data, size - data_offset, data_offset);
+    }
+    spin_unlock(&pdev->vpci->lock);
+
+    return data & (0xffffffff >> (32 - 8 * size));
+}
+
+/*
+ * Perform a maybe partial write to a register.
+ *
+ * Note that this will only work for simple registers, if Xen needs to
+ * trap accesses to rw1c registers (like the status PCI header register)
+ * the logic in vpci_write will have to be expanded in order to correctly
+ * deal with them.
+ */
+static void vpci_write_helper(const struct pci_dev *pdev,
+                              const struct vpci_register *r, unsigned int size,
+                              unsigned int offset, uint32_t data)
+{
+    ASSERT(size <= r->size);
+
+    if ( size != r->size )
+    {
+        uint32_t val;
+
+        val = r->read(pdev, r->offset, r->private);
+        data = merge_result(val, data, size, offset);
+    }
+
+    r->write(pdev, r->offset, data & (0xffffffff >> (32 - 8 * r->size)),
+             r->private);
+}
+
+void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
+                uint32_t data)
+{
+    const struct domain *d = current->domain;
+    const struct pci_dev *pdev;
+    const struct vpci_register *r;
+    unsigned int data_offset = 0;
+
+    /*
+     * Find the PCI dev matching the address.
+     * Passthrough everything that's not trapped.
+     */
+    pdev = pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.extfunc);
+    if ( !pdev )
+    {
+        vpci_write_hw(sbdf, reg, size, data);
+        return;
+    }
+
+    spin_lock(&pdev->vpci->lock);
+
+    /* Write the value to the hardware or emulated registers. */
+    list_for_each_entry ( r, &pdev->vpci->handlers, node )
+    {
+        const struct vpci_register emu = {
+            .offset = reg + data_offset,
+            .size = size - data_offset
+        };
+        int cmp = vpci_register_cmp(&emu, r);
+        unsigned int write_size;
+
+        if ( cmp < 0 )
+            break;
+        if ( cmp > 0 )
+            continue;
+
+        if ( emu.offset < r->offset )
+        {
+            /* Heading gap, write partial content to hardware. */
+            vpci_write_hw(sbdf, emu.offset, r->offset - emu.offset,
+                          data >> (data_offset * 8));
+            data_offset += r->offset - emu.offset;
+        }
+
+        /* Find the intersection size between the two sets. */
+        write_size = min(emu.offset + emu.size, r->offset + r->size) -
+                     max(emu.offset, r->offset);
+        vpci_write_helper(pdev, r, write_size, reg + data_offset - r->offset,
+                          data >> (data_offset * 8));
+        data_offset += write_size;
+        if ( data_offset == size )
+            break;
+        ASSERT(data_offset < size);
+    }
+
+    if ( data_offset < size )
+        /* Tailing gap, write the remaining. */
+        vpci_write_hw(sbdf, reg + data_offset, size - data_offset,
+                      data >> (data_offset * 8));
+
+    spin_unlock(&pdev->vpci->lock);
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index 47aadc2600..a12ae47f1b 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -434,6 +434,7 @@ struct arch_domain
 #define has_vpit(d)        (!!((d)->arch.emulation_flags & XEN_X86_EMU_PIT))
 #define has_pirq(d)        (!!((d)->arch.emulation_flags & \
                             XEN_X86_EMU_USE_PIRQ))
+#define has_vpci(d)        (!!((d)->arch.emulation_flags & XEN_X86_EMU_VPCI))
 
 #define has_arch_pdevs(d)    (!list_empty(&(d)->arch.pdev_list))
 
diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h
index 707665fbba..ff0bea5d53 100644
--- a/xen/include/asm-x86/hvm/io.h
+++ b/xen/include/asm-x86/hvm/io.h
@@ -160,6 +160,9 @@ unsigned int hvm_pci_decode_addr(unsigned int cf8, unsigned int addr,
  */
 void register_g2m_portio_handler(struct domain *d);
 
+/* HVM port IO handler for vPCI accesses. */
+void register_vpci_portio_handler(struct domain *d);
+
 #endif /* __ASM_X86_HVM_IO_H__ */
 
 
diff --git a/xen/include/public/arch-x86/xen.h b/xen/include/public/arch-x86/xen.h
index 3b0b1d6073..69ee4bc40d 100644
--- a/xen/include/public/arch-x86/xen.h
+++ b/xen/include/public/arch-x86/xen.h
@@ -294,12 +294,15 @@ struct xen_arch_domainconfig {
 #define XEN_X86_EMU_PIT             (1U<<_XEN_X86_EMU_PIT)
 #define _XEN_X86_EMU_USE_PIRQ       9
 #define XEN_X86_EMU_USE_PIRQ        (1U<<_XEN_X86_EMU_USE_PIRQ)
+#define _XEN_X86_EMU_VPCI           10
+#define XEN_X86_EMU_VPCI            (1U<<_XEN_X86_EMU_VPCI)
 
 #define XEN_X86_EMU_ALL             (XEN_X86_EMU_LAPIC | XEN_X86_EMU_HPET |  \
                                      XEN_X86_EMU_PM | XEN_X86_EMU_RTC |      \
                                      XEN_X86_EMU_IOAPIC | XEN_X86_EMU_PIC |  \
                                      XEN_X86_EMU_VGA | XEN_X86_EMU_IOMMU |   \
-                                     XEN_X86_EMU_PIT | XEN_X86_EMU_USE_PIRQ)
+                                     XEN_X86_EMU_PIT | XEN_X86_EMU_USE_PIRQ |\
+                                     XEN_X86_EMU_VPCI)
     uint32_t emulation_flags;
 };
 
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index dd5ec43a70..b7a6abfc53 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -112,6 +112,9 @@ struct pci_dev {
 #define PT_FAULT_THRESHOLD 10
     } fault;
     u64 vf_rlen[6];
+
+    /* Data for vPCI. */
+    struct vpci *vpci;
 };
 
 #define for_each_pdev(domain, pdev) \
diff --git a/xen/include/xen/pci_regs.h b/xen/include/xen/pci_regs.h
index ecd6124d91..cc4ee3b83e 100644
--- a/xen/include/xen/pci_regs.h
+++ b/xen/include/xen/pci_regs.h
@@ -22,6 +22,14 @@
 #ifndef LINUX_PCI_REGS_H
 #define LINUX_PCI_REGS_H
 
+/*
+ * Conventional PCI and PCI-X Mode 1 devices have 256 bytes of
+ * configuration space.  PCI-X Mode 2 and PCIe devices have 4096 bytes of
+ * configuration space.
+ */
+#define PCI_CFG_SPACE_SIZE	256
+#define PCI_CFG_SPACE_EXP_SIZE	4096
+
 /*
  * Under PCI, each device has 256 bytes of configuration address space,
  * of which the first 64 bytes are standardized as follows:
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
new file mode 100644
index 0000000000..9f2864fb0c
--- /dev/null
+++ b/xen/include/xen/vpci.h
@@ -0,0 +1,53 @@
+#ifndef _XEN_VPCI_H_
+#define _XEN_VPCI_H_
+
+#include <xen/pci.h>
+#include <xen/types.h>
+#include <xen/list.h>
+
+typedef uint32_t vpci_read_t(const struct pci_dev *pdev, unsigned int reg,
+                             void *data);
+
+typedef void vpci_write_t(const struct pci_dev *pdev, unsigned int reg,
+                          uint32_t val, void *data);
+
+typedef int vpci_register_init_t(struct pci_dev *dev);
+
+#define REGISTER_VPCI_INIT(x)                   \
+  static vpci_register_init_t *const x##_entry  \
+               __used_section(".data.vpci") = x
+
+/* Add vPCI handlers to device. */
+int __must_check vpci_add_handlers(struct pci_dev *dev);
+
+/* Add/remove a register handler. */
+int __must_check vpci_add_register(struct vpci *vpci,
+                                   vpci_read_t *read_handler,
+                                   vpci_write_t *write_handler,
+                                   unsigned int offset, unsigned int size,
+                                   void *data);
+int __must_check vpci_remove_register(struct vpci *vpci, unsigned int offset,
+                                      unsigned int size);
+
+/* Generic read/write handlers for the PCI config space. */
+uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size);
+void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
+                uint32_t data);
+
+struct vpci {
+    /* List of vPCI handlers for a device. */
+    struct list_head handlers;
+    spinlock_t lock;
+};
+
+#endif
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
2.16.2


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v11 02/12] x86/mmcfg: add handlers for the PVH Dom0 MMCFG areas
  2018-03-20 15:15 [PATCH v11 00/12] vpci: PCI config space emulation Roger Pau Monne
  2018-03-20 15:15 ` [PATCH v11 01/12] vpci: introduce basic handlers to trap accesses to the PCI config space Roger Pau Monne
@ 2018-03-20 15:15 ` Roger Pau Monne
  2018-03-20 15:15 ` [PATCH v11 03/12] x86/physdev: enable PHYSDEVOP_pci_mmcfg_reserved for PVH Dom0 Roger Pau Monne
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 23+ messages in thread
From: Roger Pau Monne @ 2018-03-20 15:15 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Paul Durrant, Jan Beulich, Boris Ostrovsky,
	Roger Pau Monne

Introduce a set of handlers for the accesses to the MMCFG areas. Those
areas are setup based on the contents of the hardware MMCFG tables,
and the list of handled MMCFG areas is stored inside of the hvm_domain
struct.

The read/writes are forwarded to the generic vpci handlers once the
address is decoded in order to obtain the device and register the
guest is trying to access.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Paul Durrant <paul.durrant@citrix.com>
---
Changes since v7:
 - Add check for end_bus >= start_bus to register_vpci_mmcfg_handler.
 - Protect destroy_vpci_mmcfg with the mmcfg_lock.

Changes since v6:
 - Move allocation of mmcfg outside of the locked region.
 - Do proper overlap checks when adding mmcfg regions.
 - Return _RETRY if the mcfg region cannot be found in the read/write
   handlers. This means the mcfg area has been removed between the
   accept and the read/write calls.

Changes since v5:
 - Switch to use pci_sbdf_t.
 - Switch to the new per vpci locks.
 - Move the mmcfg related external definitions to asm-x86/pci.h.

Changes since v4:
 - Change the attribute of pvh_setup_mmcfg to __hwdom_init.
 - Try to add as many MMCFG regions as possible, even if one fails to
   add.
 - Change some fields of the hvm_mmcfg struct: turn size into a
   unsigned int, segment into uint16_t and bus into uint8_t.
 - Convert some address parameters from unsigned long to paddr_t for
   consistency.
 - Make vpci_mmcfg_decode_addr return the decoded register in the
   return of the function.
 - Introduce a new macro to convert a MMCFG address into a BDF, and
   use it in vpci_mmcfg_decode_addr to clarify the logic.
 - In vpci_mmcfg_{read/write} unify the logic for 8B accesses and
   smaller ones.
 - Add the __hwdom_init attribute to register_vpci_mmcfg_handler.
 - Test that reg + size doesn't cross a device boundary.

Changes since v3:
 - Propagate changes from previous patches: drop xen_ prefix for vpci
   functions, pass slot and func instead of devfn and fix the error
   paths of the MMCFG handlers.
 - s/ecam/mmcfg/.
 - Move the destroy code to a separate function, so the hvm_mmcfg
   struct can be private to hvm/io.c.
 - Constify the return of vpci_mmcfg_find.
 - Use d instead of v->domain in vpci_mmcfg_accept.
 - Allow 8byte accesses to the mmcfg.

Changes since v1:
 - Added locking.
---
 xen/arch/x86/hvm/dom0_build.c    |  21 +++++
 xen/arch/x86/hvm/hvm.c           |   4 +
 xen/arch/x86/hvm/io.c            | 184 ++++++++++++++++++++++++++++++++++++++-
 xen/arch/x86/x86_64/mmconfig.h   |   4 -
 xen/include/asm-x86/hvm/domain.h |   4 +
 xen/include/asm-x86/hvm/io.h     |   7 ++
 xen/include/asm-x86/pci.h        |   6 ++
 7 files changed, 225 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
index 1c70416af4..259814d95d 100644
--- a/xen/arch/x86/hvm/dom0_build.c
+++ b/xen/arch/x86/hvm/dom0_build.c
@@ -22,6 +22,7 @@
 #include <xen/init.h>
 #include <xen/libelf.h>
 #include <xen/multiboot.h>
+#include <xen/pci.h>
 #include <xen/softirq.h>
 
 #include <acpi/actables.h>
@@ -1055,6 +1056,24 @@ static int __init pvh_setup_acpi(struct domain *d, paddr_t start_info)
     return 0;
 }
 
+static void __hwdom_init pvh_setup_mmcfg(struct domain *d)
+{
+    unsigned int i;
+    int rc;
+
+    for ( i = 0; i < pci_mmcfg_config_num; i++ )
+    {
+        rc = register_vpci_mmcfg_handler(d, pci_mmcfg_config[i].address,
+                                         pci_mmcfg_config[i].start_bus_number,
+                                         pci_mmcfg_config[i].end_bus_number,
+                                         pci_mmcfg_config[i].pci_segment);
+        if ( rc )
+            printk("Unable to setup MMCFG handler at %#lx for segment %u\n",
+                   pci_mmcfg_config[i].address,
+                   pci_mmcfg_config[i].pci_segment);
+    }
+}
+
 int __init dom0_construct_pvh(struct domain *d, const module_t *image,
                               unsigned long image_headroom,
                               module_t *initrd,
@@ -1096,6 +1115,8 @@ int __init dom0_construct_pvh(struct domain *d, const module_t *image,
         return rc;
     }
 
+    pvh_setup_mmcfg(d);
+
     panic("Building a PVHv2 Dom0 is not yet supported.");
     return 0;
 }
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 8b773fdcf9..0afb651b7f 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -584,8 +584,10 @@ int hvm_domain_initialise(struct domain *d)
     spin_lock_init(&d->arch.hvm_domain.irq_lock);
     spin_lock_init(&d->arch.hvm_domain.uc_lock);
     spin_lock_init(&d->arch.hvm_domain.write_map.lock);
+    rwlock_init(&d->arch.hvm_domain.mmcfg_lock);
     INIT_LIST_HEAD(&d->arch.hvm_domain.write_map.list);
     INIT_LIST_HEAD(&d->arch.hvm_domain.g2m_ioport_list);
+    INIT_LIST_HEAD(&d->arch.hvm_domain.mmcfg_regions);
 
     rc = create_perdomain_mapping(d, PERDOMAIN_VIRT_START, 0, NULL, NULL);
     if ( rc )
@@ -731,6 +733,8 @@ void hvm_domain_destroy(struct domain *d)
         list_del(&ioport->list);
         xfree(ioport);
     }
+
+    destroy_vpci_mmcfg(d);
 }
 
 static int hvm_save_tsc_adjust(struct domain *d, hvm_domain_context_t *h)
diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
index 6914bd6834..04425c064b 100644
--- a/xen/arch/x86/hvm/io.c
+++ b/xen/arch/x86/hvm/io.c
@@ -283,7 +283,7 @@ unsigned int hvm_pci_decode_addr(unsigned int cf8, unsigned int addr,
 static bool vpci_access_allowed(unsigned int reg, unsigned int len)
 {
     /* Check access size. */
-    if ( len != 1 && len != 2 && len != 4 )
+    if ( len != 1 && len != 2 && len != 4 && len != 8 )
         return false;
 
     /* Check that access is size aligned. */
@@ -383,6 +383,188 @@ void register_vpci_portio_handler(struct domain *d)
     handler->ops = &vpci_portio_ops;
 }
 
+struct hvm_mmcfg {
+    struct list_head next;
+    paddr_t addr;
+    unsigned int size;
+    uint16_t segment;
+    uint8_t start_bus;
+};
+
+/* Handlers to trap PCI MMCFG config accesses. */
+static const struct hvm_mmcfg *vpci_mmcfg_find(const struct domain *d,
+                                               paddr_t addr)
+{
+    const struct hvm_mmcfg *mmcfg;
+
+    list_for_each_entry ( mmcfg, &d->arch.hvm_domain.mmcfg_regions, next )
+        if ( addr >= mmcfg->addr && addr < mmcfg->addr + mmcfg->size )
+            return mmcfg;
+
+    return NULL;
+}
+
+static unsigned int vpci_mmcfg_decode_addr(const struct hvm_mmcfg *mmcfg,
+                                           paddr_t addr, pci_sbdf_t *sbdf)
+{
+    addr -= mmcfg->addr;
+    sbdf->bdf = MMCFG_BDF(addr);
+    sbdf->bus += mmcfg->start_bus;
+    sbdf->seg = mmcfg->segment;
+
+    return addr & (PCI_CFG_SPACE_EXP_SIZE - 1);
+}
+
+static int vpci_mmcfg_accept(struct vcpu *v, unsigned long addr)
+{
+    struct domain *d = v->domain;
+    bool found;
+
+    read_lock(&d->arch.hvm_domain.mmcfg_lock);
+    found = vpci_mmcfg_find(d, addr);
+    read_unlock(&d->arch.hvm_domain.mmcfg_lock);
+
+    return found;
+}
+
+static int vpci_mmcfg_read(struct vcpu *v, unsigned long addr,
+                           unsigned int len, unsigned long *data)
+{
+    struct domain *d = v->domain;
+    const struct hvm_mmcfg *mmcfg;
+    unsigned int reg;
+    pci_sbdf_t sbdf;
+
+    *data = ~0ul;
+
+    read_lock(&d->arch.hvm_domain.mmcfg_lock);
+    mmcfg = vpci_mmcfg_find(d, addr);
+    if ( !mmcfg )
+    {
+        read_unlock(&d->arch.hvm_domain.mmcfg_lock);
+        return X86EMUL_RETRY;
+    }
+
+    reg = vpci_mmcfg_decode_addr(mmcfg, addr, &sbdf);
+    read_unlock(&d->arch.hvm_domain.mmcfg_lock);
+
+    if ( !vpci_access_allowed(reg, len) ||
+         (reg + len) > PCI_CFG_SPACE_EXP_SIZE )
+        return X86EMUL_OKAY;
+
+    /*
+     * According to the PCIe 3.1A specification:
+     *  - Configuration Reads and Writes must usually be DWORD or smaller
+     *    in size.
+     *  - Because Root Complex implementations are not required to support
+     *    accesses to a RCRB that cross DW boundaries [...] software
+     *    should take care not to cause the generation of such accesses
+     *    when accessing a RCRB unless the Root Complex will support the
+     *    access.
+     *  Xen however supports 8byte accesses by splitting them into two
+     *  4byte accesses.
+     */
+    *data = vpci_read(sbdf, reg, min(4u, len));
+    if ( len == 8 )
+        *data |= (uint64_t)vpci_read(sbdf, reg + 4, 4) << 32;
+
+    return X86EMUL_OKAY;
+}
+
+static int vpci_mmcfg_write(struct vcpu *v, unsigned long addr,
+                            unsigned int len, unsigned long data)
+{
+    struct domain *d = v->domain;
+    const struct hvm_mmcfg *mmcfg;
+    unsigned int reg;
+    pci_sbdf_t sbdf;
+
+    read_lock(&d->arch.hvm_domain.mmcfg_lock);
+    mmcfg = vpci_mmcfg_find(d, addr);
+    if ( !mmcfg )
+    {
+        read_unlock(&d->arch.hvm_domain.mmcfg_lock);
+        return X86EMUL_RETRY;
+    }
+
+    reg = vpci_mmcfg_decode_addr(mmcfg, addr, &sbdf);
+    read_unlock(&d->arch.hvm_domain.mmcfg_lock);
+
+    if ( !vpci_access_allowed(reg, len) ||
+         (reg + len) > PCI_CFG_SPACE_EXP_SIZE )
+        return X86EMUL_OKAY;
+
+    vpci_write(sbdf, reg, min(4u, len), data);
+    if ( len == 8 )
+        vpci_write(sbdf, reg + 4, 4, data >> 32);
+
+    return X86EMUL_OKAY;
+}
+
+static const struct hvm_mmio_ops vpci_mmcfg_ops = {
+    .check = vpci_mmcfg_accept,
+    .read = vpci_mmcfg_read,
+    .write = vpci_mmcfg_write,
+};
+
+int __hwdom_init register_vpci_mmcfg_handler(struct domain *d, paddr_t addr,
+                                             unsigned int start_bus,
+                                             unsigned int end_bus,
+                                             unsigned int seg)
+{
+    struct hvm_mmcfg *mmcfg, *new = xmalloc(struct hvm_mmcfg);
+
+    ASSERT(is_hardware_domain(d));
+
+    if ( !new )
+        return -ENOMEM;
+
+    if ( start_bus > end_bus )
+    {
+        xfree(new);
+        return -EINVAL;
+    }
+
+    new->addr = addr + (start_bus << 20);
+    new->start_bus = start_bus;
+    new->segment = seg;
+    new->size = (end_bus - start_bus + 1) << 20;
+
+    write_lock(&d->arch.hvm_domain.mmcfg_lock);
+    list_for_each_entry ( mmcfg, &d->arch.hvm_domain.mmcfg_regions, next )
+        if ( new->addr < mmcfg->addr + mmcfg->size &&
+             mmcfg->addr < new->addr + new->size )
+        {
+            write_unlock(&d->arch.hvm_domain.mmcfg_lock);
+            xfree(new);
+            return -EEXIST;
+        }
+
+    if ( list_empty(&d->arch.hvm_domain.mmcfg_regions) )
+        register_mmio_handler(d, &vpci_mmcfg_ops);
+
+    list_add(&new->next, &d->arch.hvm_domain.mmcfg_regions);
+    write_unlock(&d->arch.hvm_domain.mmcfg_lock);
+
+    return 0;
+}
+
+void destroy_vpci_mmcfg(struct domain *d)
+{
+    struct list_head *mmcfg_regions = &d->arch.hvm_domain.mmcfg_regions;
+
+    write_lock(&d->arch.hvm_domain.mmcfg_lock);
+    while ( !list_empty(mmcfg_regions) )
+    {
+        struct hvm_mmcfg *mmcfg = list_first_entry(mmcfg_regions,
+                                                   struct hvm_mmcfg, next);
+
+        list_del(&mmcfg->next);
+        xfree(mmcfg);
+    }
+    write_unlock(&d->arch.hvm_domain.mmcfg_lock);
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/arch/x86/x86_64/mmconfig.h b/xen/arch/x86/x86_64/mmconfig.h
index 7537519414..2e836848ad 100644
--- a/xen/arch/x86/x86_64/mmconfig.h
+++ b/xen/arch/x86/x86_64/mmconfig.h
@@ -74,10 +74,6 @@ static inline void mmio_config_writel(void __iomem *pos, u32 val)
     asm volatile("movl %%eax,(%1)" :: "a" (val), "r" (pos) : "memory");
 }
 
-/* external variable defines */
-extern int pci_mmcfg_config_num;
-extern struct acpi_mcfg_allocation *pci_mmcfg_config;
-
 /* function prototypes */
 int acpi_parse_mcfg(struct acpi_table_header *header);
 int pci_mmcfg_reserved(uint64_t address, unsigned int segment,
diff --git a/xen/include/asm-x86/hvm/domain.h b/xen/include/asm-x86/hvm/domain.h
index 7f128c05ff..d1d933d791 100644
--- a/xen/include/asm-x86/hvm/domain.h
+++ b/xen/include/asm-x86/hvm/domain.h
@@ -184,6 +184,10 @@ struct hvm_domain {
     /* List of guest to machine IO ports mapping. */
     struct list_head g2m_ioport_list;
 
+    /* List of MMCFG regions trapped by Xen. */
+    struct list_head mmcfg_regions;
+    rwlock_t mmcfg_lock;
+
     /* List of permanently write-mapped pages. */
     struct {
         spinlock_t lock;
diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h
index ff0bea5d53..16465ceb30 100644
--- a/xen/include/asm-x86/hvm/io.h
+++ b/xen/include/asm-x86/hvm/io.h
@@ -163,6 +163,13 @@ void register_g2m_portio_handler(struct domain *d);
 /* HVM port IO handler for vPCI accesses. */
 void register_vpci_portio_handler(struct domain *d);
 
+/* HVM MMIO handler for PCI MMCFG accesses. */
+int register_vpci_mmcfg_handler(struct domain *d, paddr_t addr,
+                                unsigned int start_bus, unsigned int end_bus,
+                                unsigned int seg);
+/* Destroy tracked MMCFG areas. */
+void destroy_vpci_mmcfg(struct domain *d);
+
 #endif /* __ASM_X86_HVM_IO_H__ */
 
 
diff --git a/xen/include/asm-x86/pci.h b/xen/include/asm-x86/pci.h
index 36801d317b..cc05045e9c 100644
--- a/xen/include/asm-x86/pci.h
+++ b/xen/include/asm-x86/pci.h
@@ -6,6 +6,8 @@
 #define CF8_ADDR_HI(cf8) (  ((cf8) & 0x0f000000) >> 16)
 #define CF8_ENABLED(cf8) (!!((cf8) & 0x80000000))
 
+#define MMCFG_BDF(addr)  ( ((addr) & 0x0ffff000) >> 12)
+
 #define IS_SNB_GFX(id) (id == 0x01068086 || id == 0x01168086 \
                         || id == 0x01268086 || id == 0x01028086 \
                         || id == 0x01128086 || id == 0x01228086 \
@@ -26,4 +28,8 @@ bool_t pci_mmcfg_decode(unsigned long mfn, unsigned int *seg,
 bool_t pci_ro_mmcfg_decode(unsigned long mfn, unsigned int *seg,
                            unsigned int *bdf);
 
+/* MMCFG external variable defines */
+extern int pci_mmcfg_config_num;
+extern struct acpi_mcfg_allocation *pci_mmcfg_config;
+
 #endif /* __X86_PCI_H__ */
-- 
2.16.2


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v11 03/12] x86/physdev: enable PHYSDEVOP_pci_mmcfg_reserved for PVH Dom0
  2018-03-20 15:15 [PATCH v11 00/12] vpci: PCI config space emulation Roger Pau Monne
  2018-03-20 15:15 ` [PATCH v11 01/12] vpci: introduce basic handlers to trap accesses to the PCI config space Roger Pau Monne
  2018-03-20 15:15 ` [PATCH v11 02/12] x86/mmcfg: add handlers for the PVH Dom0 MMCFG areas Roger Pau Monne
@ 2018-03-20 15:15 ` Roger Pau Monne
  2018-03-20 15:15 ` [PATCH v11 04/12] pci: split code to size BARs from pci_add_device Roger Pau Monne
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 23+ messages in thread
From: Roger Pau Monne @ 2018-03-20 15:15 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Paul Durrant, Jan Beulich, Boris Ostrovsky,
	Roger Pau Monne

So that MMCFG regions not present in the MCFG ACPI table can be added
at run time by the hardware domain.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Paul Durrant <paul.durrant@citrix.com>
---
Changes since v7:
 - Add newline in hvm_physdev_op for non-fallthrough case.

Changes since v6:
 - Do not return EEXIST if the same exact region is already tracked by
   Xen.

Changes since v5:
 - Check for has_vpci before calling register_vpci_mmcfg_handler
   instead of checking for is_hvm_domain.

Changes since v4:
 - Change the hardware_domain check in hvm_physdev_op to a vpci check.
 - Only register the MMCFG area, but don't scan it.

Changes since v3:
 - New in this version.
---
 xen/arch/x86/hvm/hypercall.c |  5 +++++
 xen/arch/x86/hvm/io.c        | 16 +++++++++++-----
 xen/arch/x86/physdev.c       | 11 +++++++++++
 3 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/hvm/hypercall.c b/xen/arch/x86/hvm/hypercall.c
index 5742dd1797..85eacd7d33 100644
--- a/xen/arch/x86/hvm/hypercall.c
+++ b/xen/arch/x86/hvm/hypercall.c
@@ -89,6 +89,11 @@ static long hvm_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
         if ( !has_pirq(curr->domain) )
             return -ENOSYS;
         break;
+
+    case PHYSDEVOP_pci_mmcfg_reserved:
+        if ( !has_vpci(curr->domain) )
+            return -ENOSYS;
+        break;
     }
 
     if ( !curr->hcall_compat )
diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
index 04425c064b..556810c126 100644
--- a/xen/arch/x86/hvm/io.c
+++ b/xen/arch/x86/hvm/io.c
@@ -507,10 +507,9 @@ static const struct hvm_mmio_ops vpci_mmcfg_ops = {
     .write = vpci_mmcfg_write,
 };
 
-int __hwdom_init register_vpci_mmcfg_handler(struct domain *d, paddr_t addr,
-                                             unsigned int start_bus,
-                                             unsigned int end_bus,
-                                             unsigned int seg)
+int register_vpci_mmcfg_handler(struct domain *d, paddr_t addr,
+                                unsigned int start_bus, unsigned int end_bus,
+                                unsigned int seg)
 {
     struct hvm_mmcfg *mmcfg, *new = xmalloc(struct hvm_mmcfg);
 
@@ -535,9 +534,16 @@ int __hwdom_init register_vpci_mmcfg_handler(struct domain *d, paddr_t addr,
         if ( new->addr < mmcfg->addr + mmcfg->size &&
              mmcfg->addr < new->addr + new->size )
         {
+            int ret = -EEXIST;
+
+            if ( new->addr == mmcfg->addr &&
+                 new->start_bus == mmcfg->start_bus &&
+                 new->segment == mmcfg->segment &&
+                 new->size == mmcfg->size )
+                ret = 0;
             write_unlock(&d->arch.hvm_domain.mmcfg_lock);
             xfree(new);
-            return -EEXIST;
+            return ret;
         }
 
     if ( list_empty(&d->arch.hvm_domain.mmcfg_regions) )
diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c
index 380d36f6b9..984491c3dc 100644
--- a/xen/arch/x86/physdev.c
+++ b/xen/arch/x86/physdev.c
@@ -557,6 +557,17 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 
         ret = pci_mmcfg_reserved(info.address, info.segment,
                                  info.start_bus, info.end_bus, info.flags);
+        if ( !ret && has_vpci(currd) )
+        {
+            /*
+             * For HVM (PVH) domains try to add the newly found MMCFG to the
+             * domain.
+             */
+            ret = register_vpci_mmcfg_handler(currd, info.address,
+                                              info.start_bus, info.end_bus,
+                                              info.segment);
+        }
+
         break;
     }
 
-- 
2.16.2


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v11 04/12] pci: split code to size BARs from pci_add_device
  2018-03-20 15:15 [PATCH v11 00/12] vpci: PCI config space emulation Roger Pau Monne
                   ` (2 preceding siblings ...)
  2018-03-20 15:15 ` [PATCH v11 03/12] x86/physdev: enable PHYSDEVOP_pci_mmcfg_reserved for PVH Dom0 Roger Pau Monne
@ 2018-03-20 15:15 ` Roger Pau Monne
  2018-03-22 10:15   ` Jan Beulich
  2018-03-20 15:15 ` [PATCH v11 05/12] pci: add support to size ROM BARs to pci_size_mem_bar Roger Pau Monne
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 23+ messages in thread
From: Roger Pau Monne @ 2018-03-20 15:15 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, Julien Grall, Jan Beulich,
	Boris Ostrovsky, Roger Pau Monne

So that it can be called from outside in order to get the size of regular PCI
BARs. This will be required in order to map the BARs from PCI devices into PVH
Dom0 p2m.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Julien Grall <julien.grall@arm.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Tim Deegan <tim@xen.org>
Cc: Wei Liu <wei.liu2@citrix.com>
---
Changes since v7:
 - Do not return error from pci_size_mem_bar in order to keep previous
   behavior.

Changes since v6:
 - Remove the vf and addr local variables.
 - Change the way flags are declared.
 - Move the last bool parameter to the flags field.

Changes since v5:
 - Introduce a flags field for pci_size_mem_bar.
 - Use pci_sbdf_t.

Changes since v4:
 - Restore printing whether the BAR is from a vf.
 - Make the psize pointer parameter not optional.
 - s/u64/uint64_t.
 - Remove some unneeded parentheses.
 - Assert the return value is never 0.
 - Use the newly introduced pci_sbdf_t type.

Changes since v3:
 - Rename function to size BARs to pci_size_mem_bar.
 - Change the parameters passed to the function. Pass the position and
   whether the BAR is the last one, instead of the (base, max_bars,
   *index) tuple.
 - Make the function return the number of BARs consumed (1 for 32b, 2
   for 64b BARs).
 - Change the dprintk back to printk.
 - Do not log another error message in pci_add_device in case
   pci_size_mem_bar fails.
---
 xen/drivers/passthrough/pci.c | 97 ++++++++++++++++++++++++++++---------------
 xen/include/xen/pci.h         |  5 +++
 2 files changed, 68 insertions(+), 34 deletions(-)

diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index e65c7faa6f..190515b3c6 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -603,6 +603,56 @@ static int iommu_add_device(struct pci_dev *pdev);
 static int iommu_enable_device(struct pci_dev *pdev);
 static int iommu_remove_device(struct pci_dev *pdev);
 
+unsigned int pci_size_mem_bar(pci_sbdf_t sbdf, unsigned int pos,
+                              uint64_t *paddr, uint64_t *psize,
+                              unsigned int flags)
+{
+    uint32_t hi = 0, bar = pci_conf_read32(sbdf.seg, sbdf.bus, sbdf.dev,
+                                           sbdf.func, pos);
+    uint64_t size;
+
+    ASSERT((bar & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_MEMORY);
+    pci_conf_write32(sbdf.seg, sbdf.bus, sbdf.dev, sbdf.func, pos, ~0);
+    if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
+         PCI_BASE_ADDRESS_MEM_TYPE_64 )
+    {
+        if ( flags & PCI_BAR_LAST )
+        {
+            printk(XENLOG_WARNING
+                   "%sdevice %04x:%02x:%02x.%u with 64-bit %sBAR in last slot\n",
+                   (flags & PCI_BAR_VF) ? "SR-IOV " : "", sbdf.seg, sbdf.bus,
+                   sbdf.dev, sbdf.func, (flags & PCI_BAR_VF) ? "vf " : "");
+            *psize = 0;
+            return 1;
+        }
+        hi = pci_conf_read32(sbdf.seg, sbdf.bus, sbdf.dev, sbdf.func, pos + 4);
+        pci_conf_write32(sbdf.seg, sbdf.bus, sbdf.dev, sbdf.func, pos + 4, ~0);
+    }
+    size = pci_conf_read32(sbdf.seg, sbdf.bus, sbdf.dev, sbdf.func, pos) &
+           PCI_BASE_ADDRESS_MEM_MASK;
+    if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
+         PCI_BASE_ADDRESS_MEM_TYPE_64 )
+    {
+        size |= (uint64_t)pci_conf_read32(sbdf.seg, sbdf.bus, sbdf.dev,
+                                          sbdf.func, pos + 4) << 32;
+        pci_conf_write32(sbdf.seg, sbdf.bus, sbdf.dev, sbdf.func, pos + 4, hi);
+    }
+    else if ( size )
+        size |= (uint64_t)~0 << 32;
+    pci_conf_write32(sbdf.seg, sbdf.bus, sbdf.dev, sbdf.func, pos, bar);
+    size = -size;
+
+    if ( paddr )
+        *paddr = (bar & PCI_BASE_ADDRESS_MEM_MASK) | ((uint64_t)hi << 32);
+    *psize = size;
+
+    if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
+         PCI_BASE_ADDRESS_MEM_TYPE_64 )
+        return 2;
+
+    return 1;
+}
+
 int pci_add_device(u16 seg, u8 bus, u8 devfn,
                    const struct pci_dev_info *info, nodeid_t node)
 {
@@ -672,11 +722,16 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
             unsigned int i;
 
             BUILD_BUG_ON(ARRAY_SIZE(pdev->vf_rlen) != PCI_SRIOV_NUM_BARS);
-            for ( i = 0; i < PCI_SRIOV_NUM_BARS; ++i )
+            for ( i = 0; i < PCI_SRIOV_NUM_BARS; )
             {
                 unsigned int idx = pos + PCI_SRIOV_BAR + i * 4;
                 u32 bar = pci_conf_read32(seg, bus, slot, func, idx);
-                u32 hi = 0;
+                pci_sbdf_t sbdf = {
+                    .seg = seg,
+                    .bus = bus,
+                    .dev = slot,
+                    .func = func,
+                };
 
                 if ( (bar & PCI_BASE_ADDRESS_SPACE) ==
                      PCI_BASE_ADDRESS_SPACE_IO )
@@ -687,38 +742,12 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
                            seg, bus, slot, func, i);
                     continue;
                 }
-                pci_conf_write32(seg, bus, slot, func, idx, ~0);
-                if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
-                     PCI_BASE_ADDRESS_MEM_TYPE_64 )
-                {
-                    if ( i >= PCI_SRIOV_NUM_BARS )
-                    {
-                        printk(XENLOG_WARNING
-                               "SR-IOV device %04x:%02x:%02x.%u with 64-bit"
-                               " vf BAR in last slot\n",
-                               seg, bus, slot, func);
-                        break;
-                    }
-                    hi = pci_conf_read32(seg, bus, slot, func, idx + 4);
-                    pci_conf_write32(seg, bus, slot, func, idx + 4, ~0);
-                }
-                pdev->vf_rlen[i] = pci_conf_read32(seg, bus, slot, func, idx) &
-                                   PCI_BASE_ADDRESS_MEM_MASK;
-                if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
-                     PCI_BASE_ADDRESS_MEM_TYPE_64 )
-                {
-                    pdev->vf_rlen[i] |= (u64)pci_conf_read32(seg, bus,
-                                                             slot, func,
-                                                             idx + 4) << 32;
-                    pci_conf_write32(seg, bus, slot, func, idx + 4, hi);
-                }
-                else if ( pdev->vf_rlen[i] )
-                    pdev->vf_rlen[i] |= (u64)~0 << 32;
-                pci_conf_write32(seg, bus, slot, func, idx, bar);
-                pdev->vf_rlen[i] = -pdev->vf_rlen[i];
-                if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
-                     PCI_BASE_ADDRESS_MEM_TYPE_64 )
-                    ++i;
+                ret = pci_size_mem_bar(sbdf, idx, NULL, &pdev->vf_rlen[i],
+                                       PCI_BAR_VF |
+                                       ((i == PCI_SRIOV_NUM_BARS - 1) ?
+                                        PCI_BAR_LAST : 0));
+                ASSERT(ret);
+                i += ret;
             }
         }
         else
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index b7a6abfc53..2f171a8dcc 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -189,6 +189,11 @@ const char *parse_pci(const char *, unsigned int *seg, unsigned int *bus,
 const char *parse_pci_seg(const char *, unsigned int *seg, unsigned int *bus,
                           unsigned int *dev, unsigned int *func, bool *def_seg);
 
+#define PCI_BAR_VF      (1u << 0)
+#define PCI_BAR_LAST    (1u << 1)
+unsigned int pci_size_mem_bar(pci_sbdf_t sbdf, unsigned int pos,
+                              uint64_t *paddr, uint64_t *psize,
+                              unsigned int flags);
 
 bool_t pcie_aer_get_firmware_first(const struct pci_dev *);
 
-- 
2.16.2


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v11 05/12] pci: add support to size ROM BARs to pci_size_mem_bar
  2018-03-20 15:15 [PATCH v11 00/12] vpci: PCI config space emulation Roger Pau Monne
                   ` (3 preceding siblings ...)
  2018-03-20 15:15 ` [PATCH v11 04/12] pci: split code to size BARs from pci_add_device Roger Pau Monne
@ 2018-03-20 15:15 ` Roger Pau Monne
  2018-03-20 15:15 ` [PATCH v11 06/12] xen: introduce rangeset_consume_ranges Roger Pau Monne
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 23+ messages in thread
From: Roger Pau Monne @ 2018-03-20 15:15 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, Julien Grall, Jan Beulich,
	Boris Ostrovsky, Roger Pau Monne

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Julien Grall <julien.grall@arm.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Tim Deegan <tim@xen.org>
Cc: Wei Liu <wei.liu2@citrix.com>
---
Changes since v6:
 - Remove the rom local variable.

Changes since v5:
 - Use the flags field.
 - Introduce a mask local variable.
 - Simplify return.

Changes since v4:
 - New in this version.
---
 xen/drivers/passthrough/pci.c | 28 ++++++++++++++--------------
 xen/include/xen/pci.h         |  1 +
 2 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 190515b3c6..1751c66e34 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -610,11 +610,16 @@ unsigned int pci_size_mem_bar(pci_sbdf_t sbdf, unsigned int pos,
     uint32_t hi = 0, bar = pci_conf_read32(sbdf.seg, sbdf.bus, sbdf.dev,
                                            sbdf.func, pos);
     uint64_t size;
-
-    ASSERT((bar & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_MEMORY);
+    bool is64bits = !(flags & PCI_BAR_ROM) &&
+        (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) == PCI_BASE_ADDRESS_MEM_TYPE_64;
+    uint32_t mask = (flags & PCI_BAR_ROM) ? (uint32_t)PCI_ROM_ADDRESS_MASK
+                                          : (uint32_t)PCI_BASE_ADDRESS_MEM_MASK;
+
+    ASSERT(!((flags & PCI_BAR_VF) && (flags & PCI_BAR_ROM)));
+    ASSERT((flags & PCI_BAR_ROM) ||
+           (bar & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_MEMORY);
     pci_conf_write32(sbdf.seg, sbdf.bus, sbdf.dev, sbdf.func, pos, ~0);
-    if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
-         PCI_BASE_ADDRESS_MEM_TYPE_64 )
+    if ( is64bits )
     {
         if ( flags & PCI_BAR_LAST )
         {
@@ -628,10 +633,9 @@ unsigned int pci_size_mem_bar(pci_sbdf_t sbdf, unsigned int pos,
         hi = pci_conf_read32(sbdf.seg, sbdf.bus, sbdf.dev, sbdf.func, pos + 4);
         pci_conf_write32(sbdf.seg, sbdf.bus, sbdf.dev, sbdf.func, pos + 4, ~0);
     }
-    size = pci_conf_read32(sbdf.seg, sbdf.bus, sbdf.dev, sbdf.func, pos) &
-           PCI_BASE_ADDRESS_MEM_MASK;
-    if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
-         PCI_BASE_ADDRESS_MEM_TYPE_64 )
+    size = pci_conf_read32(sbdf.seg, sbdf.bus, sbdf.dev, sbdf.func,
+                           pos) & mask;
+    if ( is64bits )
     {
         size |= (uint64_t)pci_conf_read32(sbdf.seg, sbdf.bus, sbdf.dev,
                                           sbdf.func, pos + 4) << 32;
@@ -643,14 +647,10 @@ unsigned int pci_size_mem_bar(pci_sbdf_t sbdf, unsigned int pos,
     size = -size;
 
     if ( paddr )
-        *paddr = (bar & PCI_BASE_ADDRESS_MEM_MASK) | ((uint64_t)hi << 32);
+        *paddr = (bar & mask) | ((uint64_t)hi << 32);
     *psize = size;
 
-    if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
-         PCI_BASE_ADDRESS_MEM_TYPE_64 )
-        return 2;
-
-    return 1;
+    return is64bits ? 2 : 1;
 }
 
 int pci_add_device(u16 seg, u8 bus, u8 devfn,
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 2f171a8dcc..4cfa774615 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -191,6 +191,7 @@ const char *parse_pci_seg(const char *, unsigned int *seg, unsigned int *bus,
 
 #define PCI_BAR_VF      (1u << 0)
 #define PCI_BAR_LAST    (1u << 1)
+#define PCI_BAR_ROM     (1u << 2)
 unsigned int pci_size_mem_bar(pci_sbdf_t sbdf, unsigned int pos,
                               uint64_t *paddr, uint64_t *psize,
                               unsigned int flags);
-- 
2.16.2


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v11 06/12] xen: introduce rangeset_consume_ranges
  2018-03-20 15:15 [PATCH v11 00/12] vpci: PCI config space emulation Roger Pau Monne
                   ` (4 preceding siblings ...)
  2018-03-20 15:15 ` [PATCH v11 05/12] pci: add support to size ROM BARs to pci_size_mem_bar Roger Pau Monne
@ 2018-03-20 15:15 ` Roger Pau Monne
  2018-03-20 15:15 ` [PATCH v11 07/12] vpci: add header handlers Roger Pau Monne
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 23+ messages in thread
From: Roger Pau Monne @ 2018-03-20 15:15 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, Julien Grall, Jan Beulich,
	Boris Ostrovsky, Roger Pau Monne

This function allows to iterate over a rangeset while removing the
processed regions.

This will be used in order to split processing of large memory areas
when mapping them into the guest p2m.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
---
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Julien Grall <julien.grall@arm.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Tim Deegan <tim@xen.org>
Cc: Wei Liu <wei.liu2@citrix.com>
---
Changes since v6:
 - Expand commit message.
 - Add a comment to describe the expected function behavior.
 - Fix indentation.

Changes since v5:
 - New in this version.
---
 xen/common/rangeset.c      | 28 ++++++++++++++++++++++++++++
 xen/include/xen/rangeset.h | 10 ++++++++++
 2 files changed, 38 insertions(+)

diff --git a/xen/common/rangeset.c b/xen/common/rangeset.c
index ade34f6a50..bb68ce62e4 100644
--- a/xen/common/rangeset.c
+++ b/xen/common/rangeset.c
@@ -350,6 +350,34 @@ int rangeset_claim_range(struct rangeset *r, unsigned long size,
     return 0;
 }
 
+int rangeset_consume_ranges(struct rangeset *r,
+                            int (*cb)(unsigned long s, unsigned long e, void *,
+                                      unsigned long *c),
+                            void *ctxt)
+{
+    int rc = 0;
+
+    write_lock(&r->lock);
+    while ( !rangeset_is_empty(r) )
+    {
+        unsigned long consumed = 0;
+        struct range *x = first_range(r);
+
+        rc = cb(x->s, x->e, ctxt, &consumed);
+
+        ASSERT(consumed <= x->e - x->s + 1);
+        x->s += consumed;
+        if ( x->s > x->e )
+            destroy_range(r, x);
+
+        if ( rc )
+            break;
+    }
+    write_unlock(&r->lock);
+
+    return rc;
+}
+
 int rangeset_add_singleton(
     struct rangeset *r, unsigned long s)
 {
diff --git a/xen/include/xen/rangeset.h b/xen/include/xen/rangeset.h
index 1f83b1f44b..583b72bb0c 100644
--- a/xen/include/xen/rangeset.h
+++ b/xen/include/xen/rangeset.h
@@ -70,6 +70,16 @@ int rangeset_report_ranges(
     struct rangeset *r, unsigned long s, unsigned long e,
     int (*cb)(unsigned long s, unsigned long e, void *), void *ctxt);
 
+/*
+ * Note that the consume function can return an error value apart from
+ * -ERESTART, and that no cleanup is performed (ie: the user should call
+ * rangeset_destroy if needed).
+ */
+int rangeset_consume_ranges(struct rangeset *r,
+                            int (*cb)(unsigned long s, unsigned long e,
+                                      void *, unsigned long *c),
+                            void *ctxt);
+
 /* Add/remove/query a single number. */
 int __must_check rangeset_add_singleton(
     struct rangeset *r, unsigned long s);
-- 
2.16.2


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v11 07/12] vpci: add header handlers
  2018-03-20 15:15 [PATCH v11 00/12] vpci: PCI config space emulation Roger Pau Monne
                   ` (5 preceding siblings ...)
  2018-03-20 15:15 ` [PATCH v11 06/12] xen: introduce rangeset_consume_ranges Roger Pau Monne
@ 2018-03-20 15:15 ` Roger Pau Monne
  2018-03-20 16:19   ` Jan Beulich
  2018-03-21 12:31   ` Paul Durrant
  2018-03-20 15:15 ` [PATCH v11 08/12] x86/pt: mask MSI vectors on unbind Roger Pau Monne
                   ` (4 subsequent siblings)
  11 siblings, 2 replies; 23+ messages in thread
From: Roger Pau Monne @ 2018-03-20 15:15 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, Julien Grall, Paul Durrant, Jan Beulich,
	Boris Ostrovsky, Roger Pau Monne

Introduce a set of handlers that trap accesses to the PCI BARs and the
command register, in order to snoop BAR sizing and BAR relocation.

The command handler is used to detect changes to bit 2 (response to
memory space accesses), and maps/unmaps the BARs of the device into
the guest p2m. A rangeset is used in order to figure out which memory
to map/unmap. This makes it easier to keep track of the possible
overlaps with other BARs, and will also simplify MSI-X support, where
certain regions of a BAR might be used for the MSI-X table or PBA.

The BAR register handlers are used to detect attempts by the guest to
size or relocate the BARs.

Note that the long running BAR mapping and unmapping operations are
deferred to be performed by hvm_io_pending, so that they can be safely
preempted.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Julien Grall <julien.grall@arm.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Tim Deegan <tim@xen.org>
Cc: Paul Durrant <paul.durrant@citrix.com>
---
Changes since v10:
 - Fix indirect function call in map_range.
 - Use rom->addr instead of fetching it from the ROM BAR register in
   modify_decoding.
 - Remove ternary operator from modify_decoding.
 - Simply apply_map to have a single return.
 - Constify pci_dev parameter of apply_map.
 - Remove references to maybe_defer_map.
 - Use pdev (const) or dev (non-const) consistently in modify_bars.
 - Invert part of the logic in rom_write to remove one indentation
   level.
 - Add comments in rom_write to clarify why rom->addr is updated in
   two different places.
 - Use lx to print frame numbers in modify_bars.
 - Add start/end local variables in the first modify_bars loop.

Changes since v9:
 - Expand comments to clarify the code.
 - Rename rom to rom_only in the vpci_cpu struct.
 - Change definition style of dummy vpci_cpu.
 - Replace incorrect usage of PFN_UP.
 - Use system_state in order to check if the mapping functions are
   being called from Dom0 builder context.
 - Split the maybe_defer_map into two functions and place the Dom0
   builder one in the init section.

Changes since v8:
 - Do not pretend to support ARM in the map_range function. Explain
   the required changes in the comment.
 - Introduce PCI_HEADER_{NORMAL/BRIDGE}_NR_BARS defines.
 - Rename 'rom' boolean variable to 'rom_only', which is more
   descriptive of it's meaning.
 - Introduce vpci_remove_device which removes all handlers for a
   device.
 - Simplify error handling when modifying BARs mapping. Any error will
   cause the device to be unplugged (by calling vpci_remove_device).
 - Return an error code in modify_bars. Add comments describing why
   the error is sometimes ignored.

Changes since v7:
 - Order includes.
 - Add newline between switch cases.
 - Fix typo in comment (hopping).
 - Wrap ternary conditional in parentheses.
 - Remove CONFIG_HAS_PCI gueard from sched.h vpci_vcpu usage.
 - Add comment regarding vpci_vcpu usage.
 - Move rom_enabled from BAR struct to header.
 - Do not protect vpci_vcpu with __XEN__ guards.

Changes since v6:
 - s/vpci_check_pending/vpci_process_pending/.
 - Improve error handling in vpci_process_pending.
 - Add a comment that explains how vpci_check_bar_overlap works.
 - Add error messages to vpci_modify_bars and vpci_modify_rom.
 - Introduce vpci_hw_read16/32, in order to passthrough reads to
   the underlying hw.
 - Print BAR number on error in vpci_bar_write.
 - Place the CONFIG_HAS_PCI guards inside the vpci.h header and
   provide an empty vpci_vcpu structure for the !CONFIG_HAS_PCI case.
 - Define CONFIG_HAS_PCI in the test harness emul.h header before
   including vpci.h
 - Add ARM TODOs and an ARM-specific bodge to vpci_map_range due to
   the lack of preemption in {un}map_mmio_regions.
 - Make vpci_maybe_defer_map void.
 - Set rom_enabled in vpci_init_bars.
 - Defer enabling/disabling the memory decoding (or the ROM enable
   bit) until the memory has been mapped/unmapped.
 - Remove vpci_ prefix from static functions.
 - Use the same code in order to map the general BARs and the ROM
   BARs.
 - Remove the seg/bus local variables and use pdev->{seg,bus} instead.
 - Convert the bools in the BAR related structs into bool bitfields.
 - Add the must_check attribute to vpci_process_pending.
 - Open code check_bar_overlap inside modify_bars, which was it's only
   user.

Changes since v5:
 - Switch to the new handler type.
 - Use pci_sbdf_t to size the BARs.
 - Use a single return for vpci_modify_bar.
 - Do not return an error code from vpci_modify_bars, just log the
   failure.
 - Remove the 'sizing' parameter. Instead just let the guest write
   directly to the BAR, and read the value back. This simplifies the
   BAR register handlers, specially the read one.
 - Ignore ROM BAR writes with memory decoding enabled and ROM enabled.
 - Do not propagate failures to setup the ROM BAR in vpci_init_bars.
 - Add preemption support to the BAR mapping/unmapping operations.

Changes since v4:
 - Expand commit message to mention the reason behind the usage of
   rangesets.
 - Fix comment related to the inclusiveness of rangesets.
 - Fix off-by-one error in the calculation of the end of memory
   regions.
 - Store the state of the BAR (mapped/unmapped) in the vpci_bar
   enabled field, previously was only used by ROMs.
 - Fix double negation of return code.
 - Modify vpci_cmd_write so it has a single call to pci_conf_write16.
 - Print a warning when trying to write to the BAR with memory
   decoding enabled (and ignore the write).
 - Remove header_type local variable, it's used only once.
 - Move the read of the command register.
 - Restore previous command register value in the exit paths.
 - Only set address to INVALID_PADDR if the initial BAR value matches
    ~0 & PCI_BASE_ADDRESS_MEM_MASK.
 - Don't disable the enabled bit in the expansion ROM register, memory
   decoding is already disabled and takes precedence.
 - Don't use INVALID_PADDR, just set the initial BAR address to the
   value found in the hardware.
 - Introduce rom_enabled to store the status of the
   PCI_ROM_ADDRESS_ENABLE bit.
 - Reorder fields of the structure to prevent holes.

Changes since v3:
 - Propagate previous changes: drop xen_ prefix and use u8/u16/u32
   instead of the previous half_word/word/double_word.
 - Constify some of the paramerters.
 - s/VPCI_BAR_MEM/VPCI_BAR_MEM32/.
 - Simplify the number of fields stored for each BAR, a single address
   field is stored and contains the address of the BAR both on Xen and
   in the guest.
 - Allow the guest to move the BARs around in the physical memory map.
 - Add support for expansion ROM BARs.
 - Do not cache the value of the command register.
 - Remove a label used in vpci_cmd_write.
 - Fix the calculation of the sizing mask in vpci_bar_write.
 - Check the memory decode bit in order to decide if a BAR is
   positioned or not.
 - Disable memory decoding before sizing the BARs in Xen.
 - When mapping/unmapping BARs check if there's overlap between BARs,
   in order to avoid unmapping memory required by another BAR.
 - Introduce a macro to check whether a BAR is mappable or not.
 - Add a comment regarding the lack of support for SR-IOV.
 - Remove the usage of the GENMASK macro.

Changes since v2:
 - Detect unset BARs and allow the hardware domain to position them.
---
 tools/tests/vpci/emul.h   |   1 +
 xen/arch/x86/hvm/ioreq.c  |   4 +
 xen/drivers/vpci/Makefile |   2 +-
 xen/drivers/vpci/header.c | 548 ++++++++++++++++++++++++++++++++++++++++++++++
 xen/drivers/vpci/vpci.c   |  45 ++--
 xen/include/xen/sched.h   |   4 +
 xen/include/xen/vpci.h    |  61 ++++++
 7 files changed, 651 insertions(+), 14 deletions(-)
 create mode 100644 xen/drivers/vpci/header.c

diff --git a/tools/tests/vpci/emul.h b/tools/tests/vpci/emul.h
index fd0317995a..5d47544bf7 100644
--- a/tools/tests/vpci/emul.h
+++ b/tools/tests/vpci/emul.h
@@ -80,6 +80,7 @@ typedef union {
     };
 } pci_sbdf_t;
 
+#define CONFIG_HAS_VPCI
 #include "vpci.h"
 
 #define __hwdom_init
diff --git a/xen/arch/x86/hvm/ioreq.c b/xen/arch/x86/hvm/ioreq.c
index 7e66965bcd..90c9e3cd59 100644
--- a/xen/arch/x86/hvm/ioreq.c
+++ b/xen/arch/x86/hvm/ioreq.c
@@ -26,6 +26,7 @@
 #include <xen/domain.h>
 #include <xen/event.h>
 #include <xen/paging.h>
+#include <xen/vpci.h>
 
 #include <asm/hvm/hvm.h>
 #include <asm/hvm/ioreq.h>
@@ -48,6 +49,9 @@ bool hvm_io_pending(struct vcpu *v)
     struct domain *d = v->domain;
     struct hvm_ioreq_server *s;
 
+    if ( has_vpci(d) && vpci_process_pending(v) )
+        return true;
+
     list_for_each_entry ( s,
                           &d->arch.hvm_domain.ioreq_server.list,
                           list_entry )
diff --git a/xen/drivers/vpci/Makefile b/xen/drivers/vpci/Makefile
index 840a906470..241467212f 100644
--- a/xen/drivers/vpci/Makefile
+++ b/xen/drivers/vpci/Makefile
@@ -1 +1 @@
-obj-y += vpci.o
+obj-y += vpci.o header.o
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
new file mode 100644
index 0000000000..d7c220a452
--- /dev/null
+++ b/xen/drivers/vpci/header.c
@@ -0,0 +1,548 @@
+/*
+ * Generic functionality for handling accesses to the PCI header from the
+ * configuration space.
+ *
+ * Copyright (C) 2017 Citrix Systems R&D
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms and conditions of the GNU General Public
+ * License, version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <xen/p2m-common.h>
+#include <xen/sched.h>
+#include <xen/softirq.h>
+#include <xen/vpci.h>
+
+#include <asm/event.h>
+
+#define MAPPABLE_BAR(x)                                                 \
+    ((x)->type == VPCI_BAR_MEM32 || (x)->type == VPCI_BAR_MEM64_LO ||   \
+     (x)->type == VPCI_BAR_ROM)
+
+struct map_data {
+    struct domain *d;
+    bool map;
+};
+
+static int map_range(unsigned long s, unsigned long e, void *data,
+                     unsigned long *c)
+{
+    const struct map_data *map = data;
+    int rc;
+
+    for ( ; ; )
+    {
+        unsigned long size = e - s + 1;
+
+        /*
+         * ARM TODOs:
+         * - On ARM whether the memory is prefetchable or not should be passed
+         *   to map_mmio_regions in order to decide which memory attributes
+         *   should be used.
+         *
+         * - {un}map_mmio_regions doesn't support preemption.
+         */
+
+        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(s))
+                      : unmap_mmio_regions(map->d, _gfn(s), size, _mfn(s));
+        if ( rc == 0 )
+        {
+            *c += size;
+            break;
+        }
+        if ( rc < 0 )
+        {
+            printk(XENLOG_G_WARNING
+                   "Failed to identity %smap [%lx, %lx] for d%d: %d\n",
+                   map ? "" : "un", s, e, map->d->domain_id, rc);
+            break;
+        }
+        ASSERT(rc < size);
+        *c += rc;
+        s += rc;
+        if ( general_preempt_check() )
+                return -ERESTART;
+    }
+
+    return rc;
+}
+
+/*
+ * The rom_only parameter is used to signal the map/unmap helpers that the ROM
+ * BAR's enable bit has changed with the memory decoding bit already enabled.
+ * If rom_only is not set then it's the memory decoding bit that changed.
+ */
+static void modify_decoding(const struct pci_dev *pdev, bool map, bool rom_only)
+{
+    struct vpci_header *header = &pdev->vpci->header;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    uint16_t cmd;
+    unsigned int i;
+
+    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
+    {
+        if ( !MAPPABLE_BAR(&header->bars[i]) )
+            continue;
+
+        if ( rom_only && header->bars[i].type == VPCI_BAR_ROM )
+        {
+            unsigned int rom_pos = (i == PCI_HEADER_NORMAL_NR_BARS)
+                                   ? PCI_ROM_ADDRESS : PCI_ROM_ADDRESS1;
+            uint32_t val = header->bars[i].addr |
+                           (map ? PCI_ROM_ADDRESS_ENABLE : 0);
+
+            header->bars[i].enabled = header->rom_enabled = map;
+            pci_conf_write32(pdev->seg, pdev->bus, slot, func, rom_pos, val);
+            return;
+        }
+
+        if ( !rom_only &&
+             (header->bars[i].type != VPCI_BAR_ROM || header->rom_enabled) )
+            header->bars[i].enabled = map;
+    }
+
+    ASSERT(!rom_only);
+    cmd = pci_conf_read16(pdev->seg, pdev->bus, slot, func, PCI_COMMAND);
+    cmd &= ~PCI_COMMAND_MEMORY;
+    cmd |= map ? PCI_COMMAND_MEMORY : 0;
+    pci_conf_write16(pdev->seg, pdev->bus, slot, func, PCI_COMMAND,
+                     cmd);
+}
+
+bool vpci_process_pending(struct vcpu *v)
+{
+    if ( v->vpci.mem )
+    {
+        struct map_data data = {
+            .d = v->domain,
+            .map = v->vpci.map,
+        };
+        int rc = rangeset_consume_ranges(v->vpci.mem, map_range, &data);
+
+        if ( rc == -ERESTART )
+            return true;
+
+        spin_lock(&v->vpci.pdev->vpci->lock);
+        /* Disable memory decoding unconditionally on failure. */
+        modify_decoding(v->vpci.pdev, !rc && v->vpci.map,
+                        !rc && v->vpci.rom_only);
+        spin_unlock(&v->vpci.pdev->vpci->lock);
+
+        rangeset_destroy(v->vpci.mem);
+        v->vpci.mem = NULL;
+        if ( rc )
+            /*
+             * FIXME: in case of failure remove the device from the domain.
+             * Note that there might still be leftover mappings. While this is
+             * safe for Dom0, for DomUs the domain will likely need to be
+             * killed in order to avoid leaking stale p2m mappings on
+             * failure.
+             */
+            vpci_remove_device(v->vpci.pdev);
+    }
+
+    return false;
+}
+
+static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
+                            struct rangeset *mem)
+{
+    struct map_data data = { .d = d, .map = true };
+    int rc;
+
+    while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
+        process_pending_softirqs();
+    rangeset_destroy(mem);
+    if ( !rc )
+        modify_decoding(pdev, true, false);
+
+    return rc;
+}
+
+static void defer_map(struct domain *d, struct pci_dev *pdev,
+                      struct rangeset *mem, bool map, bool rom_only)
+{
+    struct vcpu *curr = current;
+
+    /*
+     * FIXME: when deferring the {un}map the state of the device should not
+     * be trusted. For example the enable bit is toggled after the device
+     * is mapped. This can lead to parallel mapping operations being
+     * started for the same device if the domain is not well-behaved.
+     */
+    curr->vpci.pdev = pdev;
+    curr->vpci.mem = mem;
+    curr->vpci.map = map;
+    curr->vpci.rom_only = rom_only;
+}
+
+static int modify_bars(const struct pci_dev *pdev, bool map, bool rom_only)
+{
+    struct vpci_header *header = &pdev->vpci->header;
+    struct rangeset *mem = rangeset_new(NULL, NULL, 0);
+    struct pci_dev *tmp, *dev = NULL;
+    unsigned int i;
+    int rc;
+
+    if ( !mem )
+        return -ENOMEM;
+
+    /*
+     * Create a rangeset that represents the current device BARs memory region
+     * and compare it against all the currently active BAR memory regions. If
+     * an overlap is found, subtract it from the region to be mapped/unmapped.
+     *
+     * First fill the rangeset with all the BARs of this device or with the ROM
+     * BAR only, depending on whether the guest is toggling the memory decode
+     * bit of the command register, or the enable bit of the ROM BAR register.
+     */
+    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
+    {
+        const struct vpci_bar *bar = &header->bars[i];
+        unsigned long start = PFN_DOWN(bar->addr);
+        unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
+
+        if ( !MAPPABLE_BAR(bar) ||
+             (rom_only ? bar->type != VPCI_BAR_ROM
+                       : (bar->type == VPCI_BAR_ROM && !header->rom_enabled)) )
+            continue;
+
+        rc = rangeset_add_range(mem, start, end);
+        if ( rc )
+        {
+            printk(XENLOG_G_WARNING "Failed to add [%lx, %lx]: %d\n",
+                   start, end, rc);
+            rangeset_destroy(mem);
+            return rc;
+        }
+    }
+
+    /*
+     * Check for overlaps with other BARs. Note that only BARs that are
+     * currently mapped (enabled) are checked for overlaps.
+     */
+    list_for_each_entry(tmp, &pdev->domain->arch.pdev_list, domain_list)
+    {
+        if ( tmp == pdev )
+        {
+            /*
+             * Need to store the device so it's not constified and defer_map
+             * can modify it in case of error.
+             */
+            dev = tmp;
+            if ( !rom_only )
+                /*
+                 * If memory decoding is toggled avoid checking against the
+                 * same device, or else all regions will be removed from the
+                 * memory map in the unmap case.
+                 */
+                continue;
+        }
+
+        for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
+        {
+            const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
+            unsigned long start = PFN_DOWN(bar->addr);
+            unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
+
+            if ( !bar->enabled || !rangeset_overlaps_range(mem, start, end) ||
+                 /*
+                  * If only the ROM enable bit is toggled check against other
+                  * BARs in the same device for overlaps, but not against the
+                  * same ROM BAR.
+                  */
+                 (rom_only && tmp == pdev && bar->type == VPCI_BAR_ROM) )
+                continue;
+
+            rc = rangeset_remove_range(mem, start, end);
+            if ( rc )
+            {
+                printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
+                       start, end, rc);
+                rangeset_destroy(mem);
+                return rc;
+            }
+        }
+    }
+
+    ASSERT(dev);
+
+    if ( system_state < SYS_STATE_active )
+    {
+        /*
+         * Mappings might be created when building Dom0 if the memory decoding
+         * bit of PCI devices is enabled. In that case it's not possible to
+         * defer the operation, so call apply_map in order to create the
+         * mappings right away. Note that at build time this function will only
+         * be called iff the memory decoding bit is enabled, thus the operation
+         * will always be to establish mappings and process all the BARs.
+         */
+        ASSERT(map && !rom_only);
+        return apply_map(pdev->domain, pdev, mem);
+    }
+
+    defer_map(dev->domain, dev, mem, map, rom_only);
+
+    return 0;
+}
+
+static void cmd_write(const struct pci_dev *pdev, unsigned int reg,
+                      uint32_t cmd, void *data)
+{
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    uint16_t current_cmd = pci_conf_read16(pdev->seg, pdev->bus, slot, func,
+                                           reg);
+
+    /*
+     * Let Dom0 play with all the bits directly except for the memory
+     * decoding one.
+     */
+    if ( (cmd ^ current_cmd) & PCI_COMMAND_MEMORY )
+        /*
+         * Ignore the error. No memory has been added or removed from the p2m
+         * (because the actual p2m changes are deferred in defer_map) and the
+         * memory decoding bit has not been changed, so leave everything as-is,
+         * hoping the guest will realize and try again.
+         */
+        modify_bars(pdev, cmd & PCI_COMMAND_MEMORY, false);
+    else
+        pci_conf_write16(pdev->seg, pdev->bus, slot, func, reg, cmd);
+}
+
+static void bar_write(const struct pci_dev *pdev, unsigned int reg,
+                      uint32_t val, void *data)
+{
+    struct vpci_bar *bar = data;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    bool hi = false;
+
+    if ( pci_conf_read16(pdev->seg, pdev->bus, slot, func, PCI_COMMAND) &
+         PCI_COMMAND_MEMORY )
+    {
+        gprintk(XENLOG_WARNING,
+                "%04x:%02x:%02x.%u: ignored BAR %lu write with memory decoding enabled\n",
+                pdev->seg, pdev->bus, slot, func,
+                bar - pdev->vpci->header.bars);
+        return;
+    }
+
+    if ( bar->type == VPCI_BAR_MEM64_HI )
+    {
+        ASSERT(reg > PCI_BASE_ADDRESS_0);
+        bar--;
+        hi = true;
+    }
+    else
+        val &= PCI_BASE_ADDRESS_MEM_MASK;
+
+    /*
+     * Update the cached address, so that when memory decoding is enabled
+     * Xen can map the BAR into the guest p2m.
+     */
+    bar->addr &= ~(0xffffffffull << (hi ? 32 : 0));
+    bar->addr |= (uint64_t)val << (hi ? 32 : 0);
+
+    /* Make sure Xen writes back the same value for the BAR RO bits. */
+    if ( !hi )
+    {
+        val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
+                                           : PCI_BASE_ADDRESS_MEM_TYPE_64;
+        val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
+    }
+
+    pci_conf_write32(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                     PCI_FUNC(pdev->devfn), reg, val);
+}
+
+static void rom_write(const struct pci_dev *pdev, unsigned int reg,
+                      uint32_t val, void *data)
+{
+    struct vpci_header *header = &pdev->vpci->header;
+    struct vpci_bar *rom = data;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    uint16_t cmd = pci_conf_read16(pdev->seg, pdev->bus, slot, func,
+                                   PCI_COMMAND);
+    bool new_enabled = val & PCI_ROM_ADDRESS_ENABLE;
+
+    if ( (cmd & PCI_COMMAND_MEMORY) && header->rom_enabled && new_enabled )
+    {
+        gprintk(XENLOG_WARNING,
+                "%04x:%02x:%02x.%u: ignored ROM BAR write with memory decoding enabled\n",
+                pdev->seg, pdev->bus, slot, func);
+        return;
+    }
+
+    if ( !header->rom_enabled )
+        /*
+         * If the ROM BAR is not enabled update the address field so the
+         * correct address is mapped into the p2m.
+         */
+        rom->addr = val & PCI_ROM_ADDRESS_MASK;
+
+    if ( !(cmd & PCI_COMMAND_MEMORY) || header->rom_enabled == new_enabled )
+    {
+        /* Just update the ROM BAR field. */
+        header->rom_enabled = new_enabled;
+        pci_conf_write32(pdev->seg, pdev->bus, slot, func, reg, val);
+    }
+    else if ( modify_bars(pdev, new_enabled, true) )
+        /*
+         * No memory has been added or removed from the p2m (because the actual
+         * p2m changes are deferred in defer_map) and the ROM enable bit has
+         * not been changed, so leave everything as-is, hoping the guest will
+         * realize and try again. It's important to not update rom->addr in the
+         * unmap case if modify_bars has failed, or future attempts would
+         * attempt to unmap the wrong address.
+         */
+        return;
+
+    if ( !new_enabled )
+        rom->addr = val & PCI_ROM_ADDRESS_MASK;
+}
+
+static int init_bars(struct pci_dev *pdev)
+{
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    uint16_t cmd;
+    uint64_t addr, size;
+    unsigned int i, num_bars, rom_reg;
+    struct vpci_header *header = &pdev->vpci->header;
+    struct vpci_bar *bars = header->bars;
+    pci_sbdf_t sbdf = {
+        .seg = pdev->seg,
+        .bus = pdev->bus,
+        .dev = slot,
+        .func = func,
+    };
+    int rc;
+
+    switch ( pci_conf_read8(pdev->seg, pdev->bus, slot, func, PCI_HEADER_TYPE)
+             & 0x7f )
+    {
+    case PCI_HEADER_TYPE_NORMAL:
+        num_bars = PCI_HEADER_NORMAL_NR_BARS;
+        rom_reg = PCI_ROM_ADDRESS;
+        break;
+
+    case PCI_HEADER_TYPE_BRIDGE:
+        num_bars = PCI_HEADER_BRIDGE_NR_BARS;
+        rom_reg = PCI_ROM_ADDRESS1;
+        break;
+
+    default:
+        return -EOPNOTSUPP;
+    }
+
+    /* Setup a handler for the command register. */
+    rc = vpci_add_register(pdev->vpci, vpci_hw_read16, cmd_write, PCI_COMMAND,
+                           2, header);
+    if ( rc )
+        return rc;
+
+    /* Disable memory decoding before sizing. */
+    cmd = pci_conf_read16(pdev->seg, pdev->bus, slot, func, PCI_COMMAND);
+    if ( cmd & PCI_COMMAND_MEMORY )
+        pci_conf_write16(pdev->seg, pdev->bus, slot, func, PCI_COMMAND,
+                         cmd & ~PCI_COMMAND_MEMORY);
+
+    for ( i = 0; i < num_bars; i++ )
+    {
+        uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
+        uint32_t val;
+
+        if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
+        {
+            bars[i].type = VPCI_BAR_MEM64_HI;
+            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg,
+                                   4, &bars[i]);
+            if ( rc )
+            {
+                pci_conf_write16(pdev->seg, pdev->bus, slot, func,
+                                 PCI_COMMAND, cmd);
+                return rc;
+            }
+
+            continue;
+        }
+
+        val = pci_conf_read32(pdev->seg, pdev->bus, slot, func, reg);
+        if ( (val & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO )
+        {
+            bars[i].type = VPCI_BAR_IO;
+            continue;
+        }
+        if ( (val & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
+             PCI_BASE_ADDRESS_MEM_TYPE_64 )
+            bars[i].type = VPCI_BAR_MEM64_LO;
+        else
+            bars[i].type = VPCI_BAR_MEM32;
+
+        rc = pci_size_mem_bar(sbdf, reg, &addr, &size,
+                              (i == num_bars - 1) ? PCI_BAR_LAST : 0);
+        if ( rc < 0 )
+        {
+            pci_conf_write16(pdev->seg, pdev->bus, slot, func, PCI_COMMAND,
+                             cmd);
+            return rc;
+        }
+
+        if ( size == 0 )
+        {
+            bars[i].type = VPCI_BAR_EMPTY;
+            continue;
+        }
+
+        bars[i].addr = addr;
+        bars[i].size = size;
+        bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
+
+        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg, 4,
+                               &bars[i]);
+        if ( rc )
+        {
+            pci_conf_write16(pdev->seg, pdev->bus, slot, func, PCI_COMMAND,
+                             cmd);
+            return rc;
+        }
+    }
+
+    /* Check expansion ROM. */
+    rc = pci_size_mem_bar(sbdf, rom_reg, &addr, &size, PCI_BAR_ROM);
+    if ( rc > 0 && size )
+    {
+        struct vpci_bar *rom = &header->bars[num_bars];
+
+        rom->type = VPCI_BAR_ROM;
+        rom->size = size;
+        rom->addr = addr;
+        header->rom_enabled = pci_conf_read32(pdev->seg, pdev->bus, slot, func,
+                                              rom_reg) & PCI_ROM_ADDRESS_ENABLE;
+
+        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, rom_write, rom_reg,
+                               4, rom);
+        if ( rc )
+            rom->type = VPCI_BAR_EMPTY;
+    }
+
+    return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, true, false) : 0;
+}
+REGISTER_VPCI_INIT(init_bars);
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 4740d02edf..e5b49b9d82 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -34,6 +34,23 @@ struct vpci_register {
     struct list_head node;
 };
 
+void vpci_remove_device(struct pci_dev *pdev)
+{
+    spin_lock(&pdev->vpci->lock);
+    while ( !list_empty(&pdev->vpci->handlers) )
+    {
+        struct vpci_register *r = list_first_entry(&pdev->vpci->handlers,
+                                                   struct vpci_register,
+                                                   node);
+
+        list_del(&r->node);
+        xfree(r);
+    }
+    spin_unlock(&pdev->vpci->lock);
+    xfree(pdev->vpci);
+    pdev->vpci = NULL;
+}
+
 int __hwdom_init vpci_add_handlers(struct pci_dev *pdev)
 {
     unsigned int i;
@@ -57,19 +74,7 @@ int __hwdom_init vpci_add_handlers(struct pci_dev *pdev)
     }
 
     if ( rc )
-    {
-        while ( !list_empty(&pdev->vpci->handlers) )
-        {
-            struct vpci_register *r = list_first_entry(&pdev->vpci->handlers,
-                                                       struct vpci_register,
-                                                       node);
-
-            list_del(&r->node);
-            xfree(r);
-        }
-        xfree(pdev->vpci);
-        pdev->vpci = NULL;
-    }
+        vpci_remove_device(pdev);
 
     return rc;
 }
@@ -102,6 +107,20 @@ static void vpci_ignored_write(const struct pci_dev *pdev, unsigned int reg,
 {
 }
 
+uint32_t vpci_hw_read16(const struct pci_dev *pdev, unsigned int reg,
+                        void *data)
+{
+    return pci_conf_read16(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                           PCI_FUNC(pdev->devfn), reg);
+}
+
+uint32_t vpci_hw_read32(const struct pci_dev *pdev, unsigned int reg,
+                        void *data)
+{
+    return pci_conf_read32(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                           PCI_FUNC(pdev->devfn), reg);
+}
+
 int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
                       vpci_write_t *write_handler, unsigned int offset,
                       unsigned int size, void *data)
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index f89896e59b..57bb142c02 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -20,6 +20,7 @@
 #include <xen/smp.h>
 #include <xen/perfc.h>
 #include <asm/atomic.h>
+#include <xen/vpci.h>
 #include <xen/wait.h>
 #include <public/xen.h>
 #include <public/domctl.h>
@@ -264,6 +265,9 @@ struct vcpu
 
     struct evtchn_fifo_vcpu *evtchn_fifo;
 
+    /* vPCI per-vCPU area, used to store data for long running operations. */
+    struct vpci_vcpu vpci;
+
     struct arch_vcpu arch;
 };
 
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 9f2864fb0c..6bf8b22b4f 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -1,6 +1,8 @@
 #ifndef _XEN_VPCI_H_
 #define _XEN_VPCI_H_
 
+#ifdef CONFIG_HAS_VPCI
+
 #include <xen/pci.h>
 #include <xen/types.h>
 #include <xen/list.h>
@@ -20,6 +22,9 @@ typedef int vpci_register_init_t(struct pci_dev *dev);
 /* Add vPCI handlers to device. */
 int __must_check vpci_add_handlers(struct pci_dev *dev);
 
+/* Remove all handlers and free vpci related structures. */
+void vpci_remove_device(struct pci_dev *pdev);
+
 /* Add/remove a register handler. */
 int __must_check vpci_add_register(struct vpci *vpci,
                                    vpci_read_t *read_handler,
@@ -34,12 +39,68 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size);
 void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
                 uint32_t data);
 
+/* Passthrough handlers. */
+uint32_t vpci_hw_read16(const struct pci_dev *pdev, unsigned int reg,
+                        void *data);
+uint32_t vpci_hw_read32(const struct pci_dev *pdev, unsigned int reg,
+                        void *data);
+
+/*
+ * Check for pending vPCI operations on this vcpu. Returns true if the vcpu
+ * should not run.
+ */
+bool __must_check vpci_process_pending(struct vcpu *v);
+
 struct vpci {
     /* List of vPCI handlers for a device. */
     struct list_head handlers;
     spinlock_t lock;
+
+#ifdef __XEN__
+    /* Hide the rest of the vpci struct from the user-space test harness. */
+    struct vpci_header {
+        /* Information about the PCI BARs of this device. */
+        struct vpci_bar {
+            uint64_t addr;
+            uint64_t size;
+            enum {
+                VPCI_BAR_EMPTY,
+                VPCI_BAR_IO,
+                VPCI_BAR_MEM32,
+                VPCI_BAR_MEM64_LO,
+                VPCI_BAR_MEM64_HI,
+                VPCI_BAR_ROM,
+            } type;
+            bool prefetchable : 1;
+            /* Store whether the BAR is mapped into guest p2m. */
+            bool enabled      : 1;
+#define PCI_HEADER_NORMAL_NR_BARS        6
+#define PCI_HEADER_BRIDGE_NR_BARS        2
+        } bars[PCI_HEADER_NORMAL_NR_BARS + 1];
+        /* At most 6 BARS + 1 expansion ROM BAR. */
+
+        /*
+         * Store whether the ROM enable bit is set (doesn't imply ROM BAR
+         * is mapped into guest p2m) if there's a ROM BAR on the device.
+         */
+        bool rom_enabled      : 1;
+        /* FIXME: currently there's no support for SR-IOV. */
+    } header;
+#endif
+};
+
+struct vpci_vcpu {
+    /* Per-vcpu structure to store state while {un}mapping of PCI BARs. */
+    struct rangeset *mem;
+    struct pci_dev *pdev;
+    bool map      : 1;
+    bool rom_only : 1;
 };
 
+#else /* !CONFIG_HAS_VPCI */
+struct vpci_vcpu {};
+#endif
+
 #endif
 
 /*
-- 
2.16.2


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v11 08/12] x86/pt: mask MSI vectors on unbind
  2018-03-20 15:15 [PATCH v11 00/12] vpci: PCI config space emulation Roger Pau Monne
                   ` (6 preceding siblings ...)
  2018-03-20 15:15 ` [PATCH v11 07/12] vpci: add header handlers Roger Pau Monne
@ 2018-03-20 15:15 ` Roger Pau Monne
  2018-03-20 15:15 ` [PATCH v11 09/12] vpci/msi: add MSI handlers Roger Pau Monne
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 23+ messages in thread
From: Roger Pau Monne @ 2018-03-20 15:15 UTC (permalink / raw)
  To: xen-devel; +Cc: Boris Ostrovsky, Roger Pau Monne, Jan Beulich

When a MSI device with per-vector masking capabilities is detected or
added to Xen all the vectors are masked when initializing it. This
implies that the first time the interrupt is bound to a domain it's
masked.

This however only applies to the first time the interrupt is bound
because neither the unbind nor the pirq unmap will mask the vector
again. In order to fix this re-mask the interrupt when unbinding it
from a guest. This makes sure that pairs of bind/unbind will always
get the same masking state.

Note that no issues have been reported regarding this behavior because
QEMU always uses the newly introduced XEN_PT_GFLAGSSHIFT_UNMASKED when
binding interrupts, so it's always unmasked.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
---
Changes since v7:
 - New in this version.
---
 xen/drivers/passthrough/io.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index 8f16e6c0a5..bab3aa349a 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -645,7 +645,22 @@ int pt_irq_destroy_bind(
         }
         break;
     case PT_IRQ_TYPE_MSI:
+    {
+        unsigned long flags;
+        struct irq_desc *desc = domain_spin_lock_irq_desc(d, machine_gsi,
+                                                          &flags);
+
+        if ( !desc )
+            return -EINVAL;
+        /*
+         * Leave the MSI masked, so that the state when calling
+         * pt_irq_create_bind is consistent across bind/unbinds.
+         */
+        guest_mask_msi_irq(desc, true);
+        spin_unlock_irqrestore(&desc->lock, flags);
         break;
+    }
+
     default:
         return -EOPNOTSUPP;
     }
-- 
2.16.2


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v11 09/12] vpci/msi: add MSI handlers
  2018-03-20 15:15 [PATCH v11 00/12] vpci: PCI config space emulation Roger Pau Monne
                   ` (7 preceding siblings ...)
  2018-03-20 15:15 ` [PATCH v11 08/12] x86/pt: mask MSI vectors on unbind Roger Pau Monne
@ 2018-03-20 15:15 ` Roger Pau Monne
  2018-03-21 12:34   ` Paul Durrant
  2018-03-20 15:15 ` [PATCH v11 10/12] vpci: add a priority parameter to the vPCI register initializer Roger Pau Monne
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 23+ messages in thread
From: Roger Pau Monne @ 2018-03-20 15:15 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, Julien Grall, Paul Durrant, Jan Beulich,
	Boris Ostrovsky, Roger Pau Monne

Add handlers for the MSI control, address, data and mask fields in
order to detect accesses to them and setup the interrupts as requested
by the guest.

Note that the pending register is not trapped, and the guest can
freely read/write to it.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Julien Grall <julien.grall@arm.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Tim Deegan <tim@xen.org>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Paul Durrant <paul.durrant@citrix.com>
---
Changes since v8:
 - Add a FIXME about the lack of testing and a comment regarding the
   lack of cleaning done in the init_msi error path.
 - Free msi struct when cleaning up if an init function failed.
 - Remove the 'error' label of init_msi, the caller will already
   perform the cleaning.

Changes since v7:
 - Don't store pci segment/bus on local variables.
 - Add an error label to init_msi.
 - Don't trap accesses to the PBA.
 - Fix msi_pending_bits_reg macro so it matches coding style.
 - Move the position of vectors in the vpci_msi struct.
 - Add a comment to clarify the expected state of vectors after
   pt_irq_create_bind and use XEN_DOMCTL_VMSI_X86_UNMASKED.

Changes since v6:
 - Use domain_spin_lock_irq_desc instead of open coding it.
 - Reduce the size of printed debug messages.
 - Constify domain in vpci_dump_msi.
 - Lock domlist_read_lock before iterating over the list of domains.
 - Make max_vectors and vectors uint8_t.
 - Drop the vpci_ prefix from the static functions in msi.c.
 - Turn the booleans in vpci_msi into bitfields.
 - Apply the mask bits to all vectors when enabling msi.
 - Remove the pos field.
 - Remove the usage of __msi_set_{enable/disable}.
 - Update the bindings when the message or data fields are updated.
 - Make vpci_msi_arch_disable return void, it wasn't returning any
   error.
 - Prevent the guest from writing to the pending bits field, it's read
   only as defined in the spec.
 - Add the must_check attribute to vpci_msi_arch_enable.

Changes since v5:
 - Update to new lock usage.
 - Change handlers to match the new type.
 - s/msi_flags/msi_gflags/, remove the local variables and use the new
   DOMCTL_VMSI_* defines.
 - Change the MSI arch function to take a vpci_msi instead of a
   vpci_arch_msi as parameter.
 - Fix the calculation of the guest vector for MSI injection to take
   into account the number of bits that can be modified.
 - Use INVALID_PIRQ everywhere.
 - Simplify exit path of vpci_msi_disable.
 - Remove the conditional when setting address64 and masking fields.
 - Add a process_pending_softirqs to the MSI dump loop.
 - Place the prototypes for the MSI arch-specific functions in
   xen/vpci.h.
 - Add parentheses around the INVALID_PIRQ definition.

Changes since v4:
 - Fix commit message.
 - Change the ASSERTs in vpci_msi_arch_mask into ifs.
 - Introduce INVALID_PIRQ.
 - Destroy the partially created bindings in case of failure in
   vpci_msi_arch_enable.
 - Just take the pcidevs lock once in vpci_msi_arch_disable.
 - Print an error message in case of failure of pt_irq_destroy_bind.
 - Make vpci_msi_arch_init return void.
 - Constify the arch parameter of vpci_msi_arch_print.
 - Use fixed instead of cpu for msi redirection.
 - Separate the header includes in vpci/msi.c between xen and asm.
 - Store the number of configured vectors even if MSI is not enabled
   and always return it in vpci_msi_control_read.
 - Fix/add comments in vpci_msi_control_write to clarify intended
   behavior.
 - Simplify usage of masks in vpci_msi_address_{upper_}write.
 - Add comment to vpci_msi_mask_{read/write}.
 - Don't use MASK_EXTR in vpci_msi_mask_write.
 - s/msi_offset/pos/ in vpci_init_msi.
 - Move control variable setup closer to it's usage.
 - Use d%d in vpci_dump_msi.
 - Fix printing of bitfield mask in vpci_dump_msi.
 - Fix definition of MSI_ADDR_REDIRECTION_MASK.
 - Shuffle the layout of vpci_msi to minimize gaps.
 - Remove the error label in vpci_init_msi.

Changes since v3:
 - Propagate changes from previous versions: drop xen_ prefix, drop
   return value from handlers, use the new vpci_val fields.
 - Use MASK_EXTR.
 - Remove the usage of GENMASK.
 - Add GFLAGS_SHIFT_DEST_ID and use it in msi_flags.
 - Add "arch" to the MSI arch specific functions.
 - Move the dumping of vPCI MSI information to dump_msi (key 'M').
 - Remove the guest_vectors field.
 - Allow the guest to change the number of active vectors without
   having to disable and enable MSI.
 - Check the number of active vectors when parsing the disable
   mask.
 - Remove the debug messages from vpci_init_msi.
 - Move the arch-specific part of the dump handler to x86/hvm/vmsi.c.
 - Use trylock in the dump handler to get the vpci lock.

Changes since v2:
 - Add an arch-specific abstraction layer. Note that this is only implemented
   for x86 currently.
 - Add a wrapper to detect MSI enabling for vPCI.
---
NB: I've only been able to test this with devices using a single MSI
interrupt and no mask register. I will try to find hardware that
supports the mask register and more than one vector, but I cannot make
any promises.

If there are doubts about the untested parts we could always force Xen
to report no per-vector masking support and only 1 available vector,
but I would rather avoid doing it.
---
 xen/arch/x86/hvm/vmsi.c      | 142 +++++++++++++++++++
 xen/arch/x86/msi.c           |   3 +
 xen/drivers/vpci/Makefile    |   2 +-
 xen/drivers/vpci/msi.c       | 324 +++++++++++++++++++++++++++++++++++++++++++
 xen/drivers/vpci/vpci.c      |   1 +
 xen/include/asm-x86/hvm/io.h |   5 +
 xen/include/asm-x86/msi.h    |   3 +
 xen/include/xen/irq.h        |   1 +
 xen/include/xen/vpci.h       |  38 +++++
 9 files changed, 518 insertions(+), 1 deletion(-)
 create mode 100644 xen/drivers/vpci/msi.c

diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index 7126de7841..be59c56d43 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -31,6 +31,7 @@
 #include <xen/errno.h>
 #include <xen/sched.h>
 #include <xen/irq.h>
+#include <xen/vpci.h>
 #include <public/hvm/ioreq.h>
 #include <asm/hvm/io.h>
 #include <asm/hvm/vpic.h>
@@ -621,3 +622,144 @@ void msix_write_completion(struct vcpu *v)
     if ( msixtbl_write(v, ctrl_address, 4, 0) != X86EMUL_OKAY )
         gdprintk(XENLOG_WARNING, "MSI-X write completion failure\n");
 }
+
+static unsigned int msi_gflags(uint16_t data, uint64_t addr, bool masked)
+{
+    /*
+     * We need to use the DOMCTL constants here because the output of this
+     * function is used as input to pt_irq_create_bind, which also takes the
+     * input from the DOMCTL itself.
+     */
+    return MASK_INSR(MASK_EXTR(addr, MSI_ADDR_DEST_ID_MASK),
+                     XEN_DOMCTL_VMSI_X86_DEST_ID_MASK) |
+           MASK_INSR(MASK_EXTR(addr, MSI_ADDR_REDIRECTION_MASK),
+                     XEN_DOMCTL_VMSI_X86_RH_MASK) |
+           MASK_INSR(MASK_EXTR(addr, MSI_ADDR_DESTMODE_MASK),
+                     XEN_DOMCTL_VMSI_X86_DM_MASK) |
+           MASK_INSR(MASK_EXTR(data, MSI_DATA_DELIVERY_MODE_MASK),
+                     XEN_DOMCTL_VMSI_X86_DELIV_MASK) |
+           MASK_INSR(MASK_EXTR(data, MSI_DATA_TRIGGER_MASK),
+                     XEN_DOMCTL_VMSI_X86_TRIG_MASK) |
+           /* NB: by default MSI vectors are bound masked. */
+           (masked ? 0 : XEN_DOMCTL_VMSI_X86_UNMASKED);
+}
+
+void vpci_msi_arch_mask(struct vpci_msi *msi, const struct pci_dev *pdev,
+                        unsigned int entry, bool mask)
+{
+    unsigned long flags;
+    struct irq_desc *desc = domain_spin_lock_irq_desc(pdev->domain,
+                                                      msi->arch.pirq + entry,
+                                                      &flags);
+
+    if ( !desc )
+        return;
+    guest_mask_msi_irq(desc, mask);
+    spin_unlock_irqrestore(&desc->lock, flags);
+}
+
+int vpci_msi_arch_enable(struct vpci_msi *msi, const struct pci_dev *pdev,
+                         unsigned int vectors)
+{
+    struct msi_info msi_info = {
+        .seg = pdev->seg,
+        .bus = pdev->bus,
+        .devfn = pdev->devfn,
+        .entry_nr = vectors,
+    };
+    unsigned int i;
+    int rc;
+
+    ASSERT(msi->arch.pirq == INVALID_PIRQ);
+
+    /* Get a PIRQ. */
+    rc = allocate_and_map_msi_pirq(pdev->domain, -1, &msi->arch.pirq,
+                                   MAP_PIRQ_TYPE_MULTI_MSI, &msi_info);
+    if ( rc )
+    {
+        gdprintk(XENLOG_ERR, "%04x:%02x:%02x.%u: failed to map PIRQ: %d\n",
+                 pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                 PCI_FUNC(pdev->devfn), rc);
+        return rc;
+    }
+
+    for ( i = 0; i < vectors; i++ )
+    {
+        uint8_t vector = MASK_EXTR(msi->data, MSI_DATA_VECTOR_MASK);
+        uint8_t vector_mask = 0xff >> (8 - fls(msi->vectors) + 1);
+        struct xen_domctl_bind_pt_irq bind = {
+            .machine_irq = msi->arch.pirq + i,
+            .irq_type = PT_IRQ_TYPE_MSI,
+            .u.msi.gvec = (vector & ~vector_mask) |
+                          ((vector + i) & vector_mask),
+            .u.msi.gflags = msi_gflags(msi->data, msi->address,
+                                       (msi->mask >> i) & 1),
+        };
+
+        pcidevs_lock();
+        rc = pt_irq_create_bind(pdev->domain, &bind);
+        if ( rc )
+        {
+            gdprintk(XENLOG_ERR,
+                     "%04x:%02x:%02x.%u: failed to bind PIRQ %u: %d\n",
+                     pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                     PCI_FUNC(pdev->devfn), msi->arch.pirq + i, rc);
+            while ( bind.machine_irq-- )
+                pt_irq_destroy_bind(pdev->domain, &bind);
+            spin_lock(&pdev->domain->event_lock);
+            unmap_domain_pirq(pdev->domain, msi->arch.pirq);
+            spin_unlock(&pdev->domain->event_lock);
+            pcidevs_unlock();
+            msi->arch.pirq = INVALID_PIRQ;
+            return rc;
+        }
+        pcidevs_unlock();
+    }
+
+    return 0;
+}
+
+void vpci_msi_arch_disable(struct vpci_msi *msi, const struct pci_dev *pdev)
+{
+    unsigned int i;
+
+    ASSERT(msi->arch.pirq != INVALID_PIRQ);
+
+    pcidevs_lock();
+    for ( i = 0; i < msi->vectors; i++ )
+    {
+        struct xen_domctl_bind_pt_irq bind = {
+            .machine_irq = msi->arch.pirq + i,
+            .irq_type = PT_IRQ_TYPE_MSI,
+        };
+        int rc;
+
+        rc = pt_irq_destroy_bind(pdev->domain, &bind);
+        ASSERT(!rc);
+    }
+
+    spin_lock(&pdev->domain->event_lock);
+    unmap_domain_pirq(pdev->domain, msi->arch.pirq);
+    spin_unlock(&pdev->domain->event_lock);
+    pcidevs_unlock();
+
+    msi->arch.pirq = INVALID_PIRQ;
+}
+
+void vpci_msi_arch_init(struct vpci_msi *msi)
+{
+    msi->arch.pirq = INVALID_PIRQ;
+}
+
+void vpci_msi_arch_print(const struct vpci_msi *msi)
+{
+    printk("vec=%#02x%7s%6s%3sassert%5s%7s dest_id=%lu pirq: %d\n",
+           MASK_EXTR(msi->data, MSI_DATA_VECTOR_MASK),
+           msi->data & MSI_DATA_DELIVERY_LOWPRI ? "lowest" : "fixed",
+           msi->data & MSI_DATA_TRIGGER_LEVEL ? "level" : "edge",
+           msi->data & MSI_DATA_LEVEL_ASSERT ? "" : "de",
+           msi->address & MSI_ADDR_DESTMODE_LOGIC ? "log" : "phys",
+           msi->address & MSI_ADDR_REDIRECTION_LOWPRI ? "lowest" : "fixed",
+           MASK_EXTR(msi->address, MSI_ADDR_DEST_ID_MASK),
+           msi->arch.pirq);
+}
diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index 8c89f072a8..5567990fbd 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -30,6 +30,7 @@
 #include <public/physdev.h>
 #include <xen/iommu.h>
 #include <xsm/xsm.h>
+#include <xen/vpci.h>
 
 static s8 __read_mostly use_msi = -1;
 boolean_param("msi", use_msi);
@@ -1527,6 +1528,8 @@ static void dump_msi(unsigned char key)
                attr.guest_masked ? 'G' : ' ',
                mask);
     }
+
+    vpci_dump_msi();
 }
 
 static int __init msi_setup_keyhandler(void)
diff --git a/xen/drivers/vpci/Makefile b/xen/drivers/vpci/Makefile
index 241467212f..62cec9e82b 100644
--- a/xen/drivers/vpci/Makefile
+++ b/xen/drivers/vpci/Makefile
@@ -1 +1 @@
-obj-y += vpci.o header.o
+obj-y += vpci.o header.o msi.o
diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
new file mode 100644
index 0000000000..c3c69ec453
--- /dev/null
+++ b/xen/drivers/vpci/msi.c
@@ -0,0 +1,324 @@
+/*
+ * Handlers for accesses to the MSI capability structure.
+ *
+ * Copyright (C) 2017 Citrix Systems R&D
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms and conditions of the GNU General Public
+ * License, version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <xen/sched.h>
+#include <xen/softirq.h>
+#include <xen/vpci.h>
+
+#include <asm/msi.h>
+
+static uint32_t control_read(const struct pci_dev *pdev, unsigned int reg,
+                             void *data)
+{
+    const struct vpci_msi *msi = data;
+
+    return MASK_INSR(fls(msi->max_vectors) - 1, PCI_MSI_FLAGS_QMASK) |
+           MASK_INSR(fls(msi->vectors) - 1, PCI_MSI_FLAGS_QSIZE) |
+           (msi->enabled ? PCI_MSI_FLAGS_ENABLE : 0) |
+           (msi->masking ? PCI_MSI_FLAGS_MASKBIT : 0) |
+           (msi->address64 ? PCI_MSI_FLAGS_64BIT : 0);
+}
+
+static void control_write(const struct pci_dev *pdev, unsigned int reg,
+                          uint32_t val, void *data)
+{
+    struct vpci_msi *msi = data;
+    unsigned int vectors = min_t(uint8_t,
+                                 1u << MASK_EXTR(val, PCI_MSI_FLAGS_QSIZE),
+                                 msi->max_vectors);
+    bool new_enabled = val & PCI_MSI_FLAGS_ENABLE;
+
+    /*
+     * No change if the enable field and the number of vectors is
+     * the same or the device is not enabled, in which case the
+     * vectors field can be updated directly.
+     */
+    if ( new_enabled == msi->enabled &&
+         (vectors == msi->vectors || !msi->enabled) )
+    {
+        msi->vectors = vectors;
+        return;
+    }
+
+    if ( new_enabled )
+    {
+        /*
+         * If the device is already enabled it means the number of
+         * enabled messages has changed. Disable and re-enable the
+         * device in order to apply the change.
+         */
+        if ( msi->enabled )
+        {
+            vpci_msi_arch_disable(msi, pdev);
+            msi->enabled = false;
+        }
+
+        if ( vpci_msi_arch_enable(msi, pdev, vectors) )
+            return;
+    }
+    else
+        vpci_msi_arch_disable(msi, pdev);
+
+    msi->vectors = vectors;
+    msi->enabled = new_enabled;
+
+    pci_conf_write16(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                     PCI_FUNC(pdev->devfn), reg,
+                     control_read(pdev, reg, data));
+}
+
+static void update_msi(const struct pci_dev *pdev, struct vpci_msi *msi)
+{
+    if ( !msi->enabled )
+        return;
+
+    vpci_msi_arch_disable(msi, pdev);
+    if ( vpci_msi_arch_enable(msi, pdev, msi->vectors) )
+        msi->enabled = false;
+}
+
+/* Handlers for the address field (32bit or low part of a 64bit address). */
+static uint32_t address_read(const struct pci_dev *pdev, unsigned int reg,
+                             void *data)
+{
+    const struct vpci_msi *msi = data;
+
+    return msi->address;
+}
+
+static void address_write(const struct pci_dev *pdev, unsigned int reg,
+                          uint32_t val, void *data)
+{
+    struct vpci_msi *msi = data;
+
+    /* Clear low part. */
+    msi->address &= ~0xffffffffull;
+    msi->address |= val;
+
+    update_msi(pdev, msi);
+}
+
+/* Handlers for the high part of a 64bit address field. */
+static uint32_t address_hi_read(const struct pci_dev *pdev, unsigned int reg,
+                                void *data)
+{
+    const struct vpci_msi *msi = data;
+
+    return msi->address >> 32;
+}
+
+static void address_hi_write(const struct pci_dev *pdev, unsigned int reg,
+                             uint32_t val, void *data)
+{
+    struct vpci_msi *msi = data;
+
+    /* Clear and update high part. */
+    msi->address &= 0xffffffff;
+    msi->address |= (uint64_t)val << 32;
+
+    update_msi(pdev, msi);
+}
+
+/* Handlers for the data field. */
+static uint32_t data_read(const struct pci_dev *pdev, unsigned int reg,
+                          void *data)
+{
+    const struct vpci_msi *msi = data;
+
+    return msi->data;
+}
+
+static void data_write(const struct pci_dev *pdev, unsigned int reg,
+                       uint32_t val, void *data)
+{
+    struct vpci_msi *msi = data;
+
+    msi->data = val;
+
+    update_msi(pdev, msi);
+}
+
+/* Handlers for the MSI mask bits. */
+static uint32_t mask_read(const struct pci_dev *pdev, unsigned int reg,
+                          void *data)
+{
+    const struct vpci_msi *msi = data;
+
+    return msi->mask;
+}
+
+static void mask_write(const struct pci_dev *pdev, unsigned int reg,
+                       uint32_t val, void *data)
+{
+    struct vpci_msi *msi = data;
+    uint32_t dmask = msi->mask ^ val;
+
+    if ( !dmask )
+        return;
+
+    if ( msi->enabled )
+    {
+        unsigned int i;
+
+        for ( i = ffs(dmask) - 1; dmask && i < msi->vectors;
+              i = ffs(dmask) - 1 )
+        {
+            vpci_msi_arch_mask(msi, pdev, i, (val >> i) & 1);
+            __clear_bit(i, &dmask);
+        }
+    }
+
+    msi->mask = val;
+}
+
+static int init_msi(struct pci_dev *pdev)
+{
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    unsigned int pos = pci_find_cap_offset(pdev->seg, pdev->bus, slot, func,
+                                           PCI_CAP_ID_MSI);
+    uint16_t control;
+    int ret;
+
+    if ( !pos )
+        return 0;
+
+    pdev->vpci->msi = xzalloc(struct vpci_msi);
+    if ( !pdev->vpci->msi )
+        return -ENOMEM;
+
+    ret = vpci_add_register(pdev->vpci, control_read, control_write,
+                            msi_control_reg(pos), 2, pdev->vpci->msi);
+    if ( ret )
+        /*
+         * NB: there's no need to free the msi struct or remove the register
+         * handlers form the config space, the caller will take care of the
+         * cleanup.
+         */
+        return ret;
+
+    /* Get the maximum number of vectors the device supports. */
+    control = pci_conf_read16(pdev->seg, pdev->bus, slot, func,
+                              msi_control_reg(pos));
+
+    /*
+     * FIXME: I've only been able to test this code with devices using a single
+     * MSI interrupt and no mask register.
+     */
+    pdev->vpci->msi->max_vectors = multi_msi_capable(control);
+    ASSERT(pdev->vpci->msi->max_vectors <= 32);
+
+    /* The multiple message enable is 0 after reset (1 message enabled). */
+    pdev->vpci->msi->vectors = 1;
+
+    /* No PIRQ bound yet. */
+    vpci_msi_arch_init(pdev->vpci->msi);
+
+    pdev->vpci->msi->address64 = is_64bit_address(control);
+    pdev->vpci->msi->masking = is_mask_bit_support(control);
+
+    ret = vpci_add_register(pdev->vpci, address_read, address_write,
+                            msi_lower_address_reg(pos), 4, pdev->vpci->msi);
+    if ( ret )
+        return ret;
+
+    ret = vpci_add_register(pdev->vpci, data_read, data_write,
+                            msi_data_reg(pos, pdev->vpci->msi->address64), 2,
+                            pdev->vpci->msi);
+    if ( ret )
+        return ret;
+
+    if ( pdev->vpci->msi->address64 )
+    {
+        ret = vpci_add_register(pdev->vpci, address_hi_read, address_hi_write,
+                                msi_upper_address_reg(pos), 4, pdev->vpci->msi);
+        if ( ret )
+            return ret;
+    }
+
+    if ( pdev->vpci->msi->masking )
+    {
+        ret = vpci_add_register(pdev->vpci, mask_read, mask_write,
+                                msi_mask_bits_reg(pos,
+                                                  pdev->vpci->msi->address64),
+                                4, pdev->vpci->msi);
+        if ( ret )
+            return ret;
+        /*
+         * FIXME: do not add any handler for the pending bits for the hardware
+         * domain, which means direct access. This will be revisited when
+         * adding unprivileged domain support.
+         */
+    }
+
+    return 0;
+}
+REGISTER_VPCI_INIT(init_msi);
+
+void vpci_dump_msi(void)
+{
+    const struct domain *d;
+
+    rcu_read_lock(&domlist_read_lock);
+    for_each_domain ( d )
+    {
+        const struct pci_dev *pdev;
+
+        if ( !has_vpci(d) )
+            continue;
+
+        printk("vPCI MSI d%d\n", d->domain_id);
+
+        list_for_each_entry ( pdev, &d->arch.pdev_list, domain_list )
+        {
+            const struct vpci_msi *msi;
+
+            if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
+                continue;
+
+            msi = pdev->vpci->msi;
+            if ( msi && msi->enabled )
+            {
+                printk("%04x:%02x:%02x.%u MSI\n", pdev->seg, pdev->bus,
+                       PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
+
+                printk("  enabled: %d 64-bit: %d",
+                       msi->enabled, msi->address64);
+                if ( msi->masking )
+                    printk(" mask=%08x", msi->mask);
+                printk(" vectors max: %u enabled: %u\n",
+                       msi->max_vectors, msi->vectors);
+
+                vpci_msi_arch_print(msi);
+            }
+
+            spin_unlock(&pdev->vpci->lock);
+            process_pending_softirqs();
+        }
+    }
+    rcu_read_unlock(&domlist_read_lock);
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index e5b49b9d82..3012b30013 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -47,6 +47,7 @@ void vpci_remove_device(struct pci_dev *pdev)
         xfree(r);
     }
     spin_unlock(&pdev->vpci->lock);
+    xfree(pdev->vpci->msi);
     xfree(pdev->vpci);
     pdev->vpci = NULL;
 }
diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h
index 16465ceb30..0fedb3473c 100644
--- a/xen/include/asm-x86/hvm/io.h
+++ b/xen/include/asm-x86/hvm/io.h
@@ -127,6 +127,11 @@ void hvm_dpci_eoi(struct domain *d, unsigned int guest_irq,
 void msix_write_completion(struct vcpu *);
 void msixtbl_init(struct domain *d);
 
+/* Arch-specific MSI data for vPCI. */
+struct vpci_arch_msi {
+    int pirq;
+};
+
 enum stdvga_cache_state {
     STDVGA_CACHE_UNINITIALIZED,
     STDVGA_CACHE_ENABLED,
diff --git a/xen/include/asm-x86/msi.h b/xen/include/asm-x86/msi.h
index 37d37b820e..10387dce2e 100644
--- a/xen/include/asm-x86/msi.h
+++ b/xen/include/asm-x86/msi.h
@@ -48,6 +48,7 @@
 #define MSI_ADDR_REDIRECTION_SHIFT  3
 #define MSI_ADDR_REDIRECTION_CPU    (0 << MSI_ADDR_REDIRECTION_SHIFT)
 #define MSI_ADDR_REDIRECTION_LOWPRI (1 << MSI_ADDR_REDIRECTION_SHIFT)
+#define MSI_ADDR_REDIRECTION_MASK   (1 << MSI_ADDR_REDIRECTION_SHIFT)
 
 #define MSI_ADDR_DEST_ID_SHIFT		12
 #define	 MSI_ADDR_DEST_ID_MASK		0x00ff000
@@ -152,6 +153,8 @@ int msi_free_irq(struct msi_desc *entry);
 	( (is64bit == 1) ? base+PCI_MSI_DATA_64 : base+PCI_MSI_DATA_32 )
 #define msi_mask_bits_reg(base, is64bit) \
 	( (is64bit == 1) ? base+PCI_MSI_MASK_BIT : base+PCI_MSI_MASK_BIT-4)
+#define msi_pending_bits_reg(base, is64bit) \
+	((base) + PCI_MSI_MASK_BIT + ((is64bit) ? 4 : 0))
 #define msi_disable(control)		control &= ~PCI_MSI_FLAGS_ENABLE
 #define multi_msi_capable(control) \
 	(1 << ((control & PCI_MSI_FLAGS_QMASK) >> 1))
diff --git a/xen/include/xen/irq.h b/xen/include/xen/irq.h
index 0aa817e266..586b78393a 100644
--- a/xen/include/xen/irq.h
+++ b/xen/include/xen/irq.h
@@ -133,6 +133,7 @@ struct pirq {
     struct arch_pirq arch;
 };
 
+#define INVALID_PIRQ (-1)
 #define pirq_info(d, p) ((struct pirq *)radix_tree_lookup(&(d)->pirq_tree, p))
 
 /* Use this instead of pirq_info() if the structure may need allocating. */
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 6bf8b22b4f..116b93f519 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -87,6 +87,30 @@ struct vpci {
         /* FIXME: currently there's no support for SR-IOV. */
     } header;
 #endif
+
+    /* MSI data. */
+    struct vpci_msi {
+#ifdef __XEN__
+      /* Address. */
+        uint64_t address;
+        /* Mask bitfield. */
+        uint32_t mask;
+        /* Data. */
+        uint16_t data;
+        /* Maximum number of vectors supported by the device. */
+        uint8_t max_vectors : 5;
+        /* Enabled? */
+        bool enabled        : 1;
+        /* Supports per-vector masking? */
+        bool masking        : 1;
+        /* 64-bit address capable? */
+        bool address64      : 1;
+        /* Number of vectors configured. */
+        uint8_t vectors     : 5;
+        /* Arch-specific data. */
+        struct vpci_arch_msi arch;
+#endif
+    } *msi;
 };
 
 struct vpci_vcpu {
@@ -97,6 +121,20 @@ struct vpci_vcpu {
     bool rom_only : 1;
 };
 
+#ifdef __XEN__
+void vpci_dump_msi(void);
+
+/* Arch-specific vPCI MSI helpers. */
+void vpci_msi_arch_mask(struct vpci_msi *msi, const struct pci_dev *pdev,
+                        unsigned int entry, bool mask);
+int __must_check vpci_msi_arch_enable(struct vpci_msi *msi,
+                                      const struct pci_dev *pdev,
+                                      unsigned int vectors);
+void vpci_msi_arch_disable(struct vpci_msi *msi, const struct pci_dev *pdev);
+void vpci_msi_arch_init(struct vpci_msi *msi);
+void vpci_msi_arch_print(const struct vpci_msi *msi);
+#endif /* __XEN__ */
+
 #else /* !CONFIG_HAS_VPCI */
 struct vpci_vcpu {};
 #endif
-- 
2.16.2


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v11 10/12] vpci: add a priority parameter to the vPCI register initializer
  2018-03-20 15:15 [PATCH v11 00/12] vpci: PCI config space emulation Roger Pau Monne
                   ` (8 preceding siblings ...)
  2018-03-20 15:15 ` [PATCH v11 09/12] vpci/msi: add MSI handlers Roger Pau Monne
@ 2018-03-20 15:15 ` Roger Pau Monne
  2018-03-22  9:57   ` Jan Beulich
  2018-03-20 15:15 ` [PATCH v11 11/12] vpci/msix: add MSI-X handlers Roger Pau Monne
  2018-03-20 15:15 ` [PATCH v11 12/12] vpci: do not expose unneeded functions to the user-space test harness Roger Pau Monne
  11 siblings, 1 reply; 23+ messages in thread
From: Roger Pau Monne @ 2018-03-20 15:15 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, Julien Grall, Jan Beulich,
	Boris Ostrovsky, Roger Pau Monne

This is needed for MSI-X, since MSI-X will need to be initialized
before parsing the BARs, so that the header BAR handlers are aware of
the MSI-X related holes and make sure they are not mapped in order for
the trap handlers to work properly.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Julien Grall <julien.grall@arm.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Tim Deegan <tim@xen.org>
Cc: Wei Liu <wei.liu2@citrix.com>
---
Changes since v4:
 - Add a middle priority and add the PCI header to it.

Changes since v3:
 - Add a numerial suffix to the section used to store the pointer to
   each initializer function, and sort them at link time.
---
 xen/arch/arm/xen.lds.S    | 4 ++--
 xen/arch/x86/xen.lds.S    | 4 ++--
 xen/drivers/vpci/header.c | 2 +-
 xen/drivers/vpci/msi.c    | 2 +-
 xen/include/xen/vpci.h    | 8 ++++++--
 5 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/xen/arch/arm/xen.lds.S b/xen/arch/arm/xen.lds.S
index 49cae2af71..245a0e0e85 100644
--- a/xen/arch/arm/xen.lds.S
+++ b/xen/arch/arm/xen.lds.S
@@ -69,7 +69,7 @@ SECTIONS
 #if defined(CONFIG_HAS_VPCI) && defined(CONFIG_LATE_HWDOM)
        . = ALIGN(POINTER_ALIGN);
        __start_vpci_array = .;
-       *(.data.vpci)
+       *(SORT(.data.vpci.*))
        __end_vpci_array = .;
 #endif
   } :text
@@ -182,7 +182,7 @@ SECTIONS
 #if defined(CONFIG_HAS_VPCI) && !defined(CONFIG_LATE_HWDOM)
        . = ALIGN(POINTER_ALIGN);
        __start_vpci_array = .;
-       *(.data.vpci)
+       *(SORT(.data.vpci.*))
        __end_vpci_array = .;
 #endif
   } :text
diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S
index 7bd6fb51c3..70afedd31d 100644
--- a/xen/arch/x86/xen.lds.S
+++ b/xen/arch/x86/xen.lds.S
@@ -139,7 +139,7 @@ SECTIONS
 #if defined(CONFIG_HAS_VPCI) && defined(CONFIG_LATE_HWDOM)
        . = ALIGN(POINTER_ALIGN);
        __start_vpci_array = .;
-       *(.data.vpci)
+       *(SORT(.data.vpci.*))
        __end_vpci_array = .;
 #endif
   } :text
@@ -246,7 +246,7 @@ SECTIONS
 #if defined(CONFIG_HAS_VPCI) && !defined(CONFIG_LATE_HWDOM)
        . = ALIGN(POINTER_ALIGN);
        __start_vpci_array = .;
-       *(.data.vpci)
+       *(SORT(.data.vpci.*))
        __end_vpci_array = .;
 #endif
   } :text
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index d7c220a452..8d9d6f43f3 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -535,7 +535,7 @@ static int init_bars(struct pci_dev *pdev)
 
     return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, true, false) : 0;
 }
-REGISTER_VPCI_INIT(init_bars);
+REGISTER_VPCI_INIT(init_bars, VPCI_PRIORITY_MIDDLE);
 
 /*
  * Local variables:
diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
index c3c69ec453..de4ddf562e 100644
--- a/xen/drivers/vpci/msi.c
+++ b/xen/drivers/vpci/msi.c
@@ -267,7 +267,7 @@ static int init_msi(struct pci_dev *pdev)
 
     return 0;
 }
-REGISTER_VPCI_INIT(init_msi);
+REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW);
 
 void vpci_dump_msi(void)
 {
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 116b93f519..7266c17679 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -15,9 +15,13 @@ typedef void vpci_write_t(const struct pci_dev *pdev, unsigned int reg,
 
 typedef int vpci_register_init_t(struct pci_dev *dev);
 
-#define REGISTER_VPCI_INIT(x)                   \
+#define VPCI_PRIORITY_HIGH      "1"
+#define VPCI_PRIORITY_MIDDLE    "5"
+#define VPCI_PRIORITY_LOW       "9"
+
+#define REGISTER_VPCI_INIT(x, p)                \
   static vpci_register_init_t *const x##_entry  \
-               __used_section(".data.vpci") = x
+               __used_section(".data.vpci." p) = x
 
 /* Add vPCI handlers to device. */
 int __must_check vpci_add_handlers(struct pci_dev *dev);
-- 
2.16.2


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v11 11/12] vpci/msix: add MSI-X handlers
  2018-03-20 15:15 [PATCH v11 00/12] vpci: PCI config space emulation Roger Pau Monne
                   ` (9 preceding siblings ...)
  2018-03-20 15:15 ` [PATCH v11 10/12] vpci: add a priority parameter to the vPCI register initializer Roger Pau Monne
@ 2018-03-20 15:15 ` Roger Pau Monne
  2018-03-21 12:36   ` Paul Durrant
  2018-03-20 15:15 ` [PATCH v11 12/12] vpci: do not expose unneeded functions to the user-space test harness Roger Pau Monne
  11 siblings, 1 reply; 23+ messages in thread
From: Roger Pau Monne @ 2018-03-20 15:15 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, Julien Grall, Paul Durrant, Jan Beulich,
	Boris Ostrovsky, Roger Pau Monne

Add handlers for accesses to the MSI-X message control field on the
PCI configuration space, and traps for accesses to the memory region
that contains the MSI-X table and PBA. This traps detect attempts from
the guest to configure MSI-X interrupts and properly sets them up.

Note that accesses to the Table Offset, Table BIR, PBA Offset and PBA
BIR are not trapped by Xen at the moment.

Finally, turn the panic in the Dom0 PVH builder into a warning.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Julien Grall <julien.grall@arm.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Tim Deegan <tim@xen.org>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Paul Durrant <paul.durrant@citrix.com>
---
Changes since v10:
 - Do not continue to print msix entries if the MSIX struct has
   changed it's address while processing softirqs.
 - Use unsigned long to store the frame numbers in modify_bars.
 - Use lu to print frame values in modify_bars.

Changes since v9:
 - Unlock/lock when calling process_pending_softirqs.
 - Change vpci_msix_arch_print to return int in order to signal
   failure to continue after having processed softirqs.
 - Use a power of 2 to do the module.
 - Use PFN_DOWN in order to calculate the end of the MSI-X memory
   areas for the rangeset.

Changes since v8:
 - Call process_pending_softirqs between printing MSI-X entries.
 - Free msix struct in vpci_add_handlers.
 - Print only MSI or MSI-X if they are enabled.
 - Fix comment in update_entry.

Changes since v7:
 - Switch vpci.h macros to inline functions.
 - Change vpci_msix_arch_print_entry into vpci_msix_arch_print and
   make it print all the entries.
 - Add a log message if rangeset_remove_range fails to remove the BAR
   MSI-related range.
 - Introduce a new update_entry to disable and enable a MSIX entry in
   order to either update or set it up. This removes open coding it in
   two different places.
 - Unify access checks in access_allowed.
 - Add newlines between switch cases.
 - Expand max_entries to 12 bits.

Changes since v6:
 - Reduce the output of the debug keys.
 - Fix comments and code to match in vpci_msix_control_write.
 - Optimize size of the MSIX structure.
 - Convert 'tables[]' to a uint32_t in order to reduce the size of
   vpci_msix. Introduce some macros to make it easier to get the MSIX
   tables related data.
 - Limit size of the bool fields to 1 bit.
 - Remove the 'nr' field of vpci_msix_entry. The position can be
   calculated from the base of the entries array.
 - Drop the 'vpci_' prefix from the functions in msix.c, they are all
   static.
 - Remove the val local variable in control_read.
 - Initialize new_masked and new_enabled at declaration.
 - Recalculate the msix control value before writing it.
 - Remove the seg and bus local variables and use pdev->seg and
   pdev->bus instead.
 - Initialize msix at declaration in msix_{write/read}.
 - Add the must_check attribute to
   vpci_msix_arch_{enable/disable}_entry.

Changes since v5:
 - Update lock usage.
 - Unbind/unmap PIRQs when MSIX is disabled.
 - Share the arch-specific MSIX code with the MSI functions.
 - Do not reference the MSIX memory areas from the PCI BARs fields,
   instead fetch the BIR and offset each time needed.
 - Add the '_entry' suffix to the MSIX arch functions.
 - Prefix the vMSIX macros with 'V'.
 - s/gdprintk/gprintk/ in msix.c
 - Make vpci_msix_access_check return bool, and change it's name to
   vpci_msix_access_allowed.
 - Join the first two ifs in vpci_msix_{read/write} into a single one.
 - Allow Dom0 to write to the PBA area.
 - Add a note that reads from the PBA area will need to be translated
   if the PBA it's not identity mapped.

Changes since v4:
 - Remove parentheses around offsetof.
 - Add "being" to MSI-X enabling comment.
 - Use INVALID_PIRQ.
 - Add a simple sanity check to vpci_msix_arch_enable in order to
   detect wrong MSI-X entries more quickly.
 - Constify vpci_msix_arch_print entry argument.
 - s/cpu/fixed/ in vpci_msix_arch_print.
 - Dump the MSI-X info together with the MSI info.
 - Fix vpci_msix_control_write to take into account changes to the
   address and data fields when switching the function mask bit.
 - Only disable/enable the entries if the address or data fields have
   been updated.
 - Usew the BAR enable field to check if a BAR is mapped or not
   (instead of reading the command register for each device).
 - Fix error path in vpci_msix_read to set the return data to ~0.
 - Simplify mask usage in vpci_msix_write.
 - Cast data to uint64_t when shifting it 32 bits.
 - Fix writes to the table entry control register to take into account
   if the mask-all bit is set.
 - Add some comments to clarify the intended behavior of the code.
 - Align the PBA size to 64-bits.
 - Remove the error label in vpci_init_msix.
 - Try to compact the layout of the vpci_msix structure.
 - Remove the local table_bar and pba_bar variables from
   vpci_init_msix, they are used only once.

Changes since v3:
 - Propagate changes from previous versions: remove xen_ prefix, use
   the new fields in vpci_val and remove the return value from
   handlers.
 - Remove the usage of GENMASK.
 - Mave the arch-specific parts of the dump routine to the
   x86/hvm/vmsi.c dump handler.
 - Chain the MSI-X dump handler to the 'M' debug key.
 - Fix the header BAR mappings so that the MSI-X regions inside of
   BARs are unmapped from the domain p2m in order for the handlers to
   work properly.
 - Unconditionally trap and forward accesses to the PBA MSI-X area.
 - Simplify the conditionals in vpci_msix_control_write.
 - Fix vpci_msix_accept to use a bool type.
 - Allow all supported accesses as described in the spec to the MSI-X
   table.
 - Truncate the returned address when the access is a 32b read.
 - Always return X86EMUL_OKAY from the handlers, returning ~0 in the
   read case if the access is not supported, or ignoring writes.
 - Do not check that max_entries is != 0 in the init handler.
 - Use trylock in the dump handler.

Changes since v2:
 - Split out arch-specific code.

This patch has been tested with devices using both a single MSI-X
entry and multiple ones.
---
 xen/arch/x86/hvm/dom0_build.c    |   2 +-
 xen/arch/x86/hvm/hvm.c           |   1 +
 xen/arch/x86/hvm/vmsi.c          | 160 +++++++++++---
 xen/drivers/vpci/Makefile        |   2 +-
 xen/drivers/vpci/header.c        |  19 ++
 xen/drivers/vpci/msi.c           |  27 ++-
 xen/drivers/vpci/msix.c          | 458 +++++++++++++++++++++++++++++++++++++++
 xen/drivers/vpci/vpci.c          |   1 +
 xen/include/asm-x86/hvm/domain.h |   3 +
 xen/include/asm-x86/hvm/io.h     |   5 +
 xen/include/xen/vpci.h           |  73 +++++++
 11 files changed, 720 insertions(+), 31 deletions(-)
 create mode 100644 xen/drivers/vpci/msix.c

diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
index 259814d95d..d3f65eadbe 100644
--- a/xen/arch/x86/hvm/dom0_build.c
+++ b/xen/arch/x86/hvm/dom0_build.c
@@ -1117,7 +1117,7 @@ int __init dom0_construct_pvh(struct domain *d, const module_t *image,
 
     pvh_setup_mmcfg(d);
 
-    panic("Building a PVHv2 Dom0 is not yet supported.");
+    printk("WARNING: PVH is an experimental mode with limited functionality\n");
     return 0;
 }
 
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 0afb651b7f..7660ea704a 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -588,6 +588,7 @@ int hvm_domain_initialise(struct domain *d)
     INIT_LIST_HEAD(&d->arch.hvm_domain.write_map.list);
     INIT_LIST_HEAD(&d->arch.hvm_domain.g2m_ioport_list);
     INIT_LIST_HEAD(&d->arch.hvm_domain.mmcfg_regions);
+    INIT_LIST_HEAD(&d->arch.hvm_domain.msix_tables);
 
     rc = create_perdomain_mapping(d, PERDOMAIN_VIRT_START, 0, NULL, NULL);
     if ( rc )
diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index be59c56d43..c31d27c389 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -30,6 +30,7 @@
 #include <xen/lib.h>
 #include <xen/errno.h>
 #include <xen/sched.h>
+#include <xen/softirq.h>
 #include <xen/irq.h>
 #include <xen/vpci.h>
 #include <public/hvm/ioreq.h>
@@ -644,13 +645,10 @@ static unsigned int msi_gflags(uint16_t data, uint64_t addr, bool masked)
            (masked ? 0 : XEN_DOMCTL_VMSI_X86_UNMASKED);
 }
 
-void vpci_msi_arch_mask(struct vpci_msi *msi, const struct pci_dev *pdev,
-                        unsigned int entry, bool mask)
+static void vpci_mask_pirq(struct domain *d, int pirq, bool mask)
 {
     unsigned long flags;
-    struct irq_desc *desc = domain_spin_lock_irq_desc(pdev->domain,
-                                                      msi->arch.pirq + entry,
-                                                      &flags);
+    struct irq_desc *desc = domain_spin_lock_irq_desc(d, pirq, &flags);
 
     if ( !desc )
         return;
@@ -658,23 +656,31 @@ void vpci_msi_arch_mask(struct vpci_msi *msi, const struct pci_dev *pdev,
     spin_unlock_irqrestore(&desc->lock, flags);
 }
 
-int vpci_msi_arch_enable(struct vpci_msi *msi, const struct pci_dev *pdev,
-                         unsigned int vectors)
+void vpci_msi_arch_mask(struct vpci_msi *msi, const struct pci_dev *pdev,
+                        unsigned int entry, bool mask)
+{
+    vpci_mask_pirq(pdev->domain, msi->arch.pirq + entry, mask);
+}
+
+static int vpci_msi_enable(const struct pci_dev *pdev, uint32_t data,
+                           uint64_t address, unsigned int nr,
+                           paddr_t table_base, uint32_t mask)
 {
     struct msi_info msi_info = {
         .seg = pdev->seg,
         .bus = pdev->bus,
         .devfn = pdev->devfn,
-        .entry_nr = vectors,
+        .table_base = table_base,
+        .entry_nr = nr,
     };
-    unsigned int i;
-    int rc;
-
-    ASSERT(msi->arch.pirq == INVALID_PIRQ);
+    unsigned int i, vectors = table_base ? 1 : nr;
+    int rc, pirq = INVALID_PIRQ;
 
     /* Get a PIRQ. */
-    rc = allocate_and_map_msi_pirq(pdev->domain, -1, &msi->arch.pirq,
-                                   MAP_PIRQ_TYPE_MULTI_MSI, &msi_info);
+    rc = allocate_and_map_msi_pirq(pdev->domain, -1, &pirq,
+                                   table_base ? MAP_PIRQ_TYPE_MSI
+                                              : MAP_PIRQ_TYPE_MULTI_MSI,
+                                   &msi_info);
     if ( rc )
     {
         gdprintk(XENLOG_ERR, "%04x:%02x:%02x.%u: failed to map PIRQ: %d\n",
@@ -685,15 +691,14 @@ int vpci_msi_arch_enable(struct vpci_msi *msi, const struct pci_dev *pdev,
 
     for ( i = 0; i < vectors; i++ )
     {
-        uint8_t vector = MASK_EXTR(msi->data, MSI_DATA_VECTOR_MASK);
-        uint8_t vector_mask = 0xff >> (8 - fls(msi->vectors) + 1);
+        uint8_t vector = MASK_EXTR(data, MSI_DATA_VECTOR_MASK);
+        uint8_t vector_mask = 0xff >> (8 - fls(vectors) + 1);
         struct xen_domctl_bind_pt_irq bind = {
-            .machine_irq = msi->arch.pirq + i,
+            .machine_irq = pirq + i,
             .irq_type = PT_IRQ_TYPE_MSI,
             .u.msi.gvec = (vector & ~vector_mask) |
                           ((vector + i) & vector_mask),
-            .u.msi.gflags = msi_gflags(msi->data, msi->address,
-                                       (msi->mask >> i) & 1),
+            .u.msi.gflags = msi_gflags(data, address, (mask >> i) & 1),
         };
 
         pcidevs_lock();
@@ -703,33 +708,49 @@ int vpci_msi_arch_enable(struct vpci_msi *msi, const struct pci_dev *pdev,
             gdprintk(XENLOG_ERR,
                      "%04x:%02x:%02x.%u: failed to bind PIRQ %u: %d\n",
                      pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
-                     PCI_FUNC(pdev->devfn), msi->arch.pirq + i, rc);
+                     PCI_FUNC(pdev->devfn), pirq + i, rc);
             while ( bind.machine_irq-- )
                 pt_irq_destroy_bind(pdev->domain, &bind);
             spin_lock(&pdev->domain->event_lock);
-            unmap_domain_pirq(pdev->domain, msi->arch.pirq);
+            unmap_domain_pirq(pdev->domain, pirq);
             spin_unlock(&pdev->domain->event_lock);
             pcidevs_unlock();
-            msi->arch.pirq = INVALID_PIRQ;
             return rc;
         }
         pcidevs_unlock();
     }
 
-    return 0;
+    return pirq;
 }
 
-void vpci_msi_arch_disable(struct vpci_msi *msi, const struct pci_dev *pdev)
+int vpci_msi_arch_enable(struct vpci_msi *msi, const struct pci_dev *pdev,
+                         unsigned int vectors)
+{
+    int rc;
+
+    ASSERT(msi->arch.pirq == INVALID_PIRQ);
+    rc = vpci_msi_enable(pdev, msi->data, msi->address, vectors, 0, msi->mask);
+    if ( rc >= 0 )
+    {
+        msi->arch.pirq = rc;
+        rc = 0;
+    }
+
+    return rc;
+}
+
+static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
+                             unsigned int nr)
 {
     unsigned int i;
 
-    ASSERT(msi->arch.pirq != INVALID_PIRQ);
+    ASSERT(pirq != INVALID_PIRQ);
 
     pcidevs_lock();
-    for ( i = 0; i < msi->vectors; i++ )
+    for ( i = 0; i < nr; i++ )
     {
         struct xen_domctl_bind_pt_irq bind = {
-            .machine_irq = msi->arch.pirq + i,
+            .machine_irq = pirq + i,
             .irq_type = PT_IRQ_TYPE_MSI,
         };
         int rc;
@@ -739,10 +760,14 @@ void vpci_msi_arch_disable(struct vpci_msi *msi, const struct pci_dev *pdev)
     }
 
     spin_lock(&pdev->domain->event_lock);
-    unmap_domain_pirq(pdev->domain, msi->arch.pirq);
+    unmap_domain_pirq(pdev->domain, pirq);
     spin_unlock(&pdev->domain->event_lock);
     pcidevs_unlock();
+}
 
+void vpci_msi_arch_disable(struct vpci_msi *msi, const struct pci_dev *pdev)
+{
+    vpci_msi_disable(pdev, msi->arch.pirq, msi->vectors);
     msi->arch.pirq = INVALID_PIRQ;
 }
 
@@ -763,3 +788,82 @@ void vpci_msi_arch_print(const struct vpci_msi *msi)
            MASK_EXTR(msi->address, MSI_ADDR_DEST_ID_MASK),
            msi->arch.pirq);
 }
+
+void vpci_msix_arch_mask_entry(struct vpci_msix_entry *entry,
+                               const struct pci_dev *pdev, bool mask)
+{
+    ASSERT(entry->arch.pirq != INVALID_PIRQ);
+    vpci_mask_pirq(pdev->domain, entry->arch.pirq, mask);
+}
+
+int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
+                                const struct pci_dev *pdev, paddr_t table_base)
+{
+    int rc;
+
+    ASSERT(entry->arch.pirq == INVALID_PIRQ);
+    rc = vpci_msi_enable(pdev, entry->data, entry->addr,
+                         vmsix_entry_nr(pdev->vpci->msix, entry),
+                         table_base, entry->masked);
+    if ( rc >= 0 )
+    {
+        entry->arch.pirq = rc;
+        rc = 0;
+    }
+
+    return rc;
+}
+
+int vpci_msix_arch_disable_entry(struct vpci_msix_entry *entry,
+                                 const struct pci_dev *pdev)
+{
+    if ( entry->arch.pirq == INVALID_PIRQ )
+        return -ENOENT;
+
+    vpci_msi_disable(pdev, entry->arch.pirq, 1);
+    entry->arch.pirq = INVALID_PIRQ;
+
+    return 0;
+}
+
+void vpci_msix_arch_init_entry(struct vpci_msix_entry *entry)
+{
+    entry->arch.pirq = INVALID_PIRQ;
+}
+
+int vpci_msix_arch_print(const struct vpci_msix *msix)
+{
+    unsigned int i;
+
+    for ( i = 0; i < msix->max_entries; i++ )
+    {
+        const struct vpci_msix_entry *entry = &msix->entries[i];
+
+        printk("%6u vec=%02x%7s%6s%3sassert%5s%7s dest_id=%lu mask=%u pirq: %d\n",
+               i, MASK_EXTR(entry->data, MSI_DATA_VECTOR_MASK),
+               entry->data & MSI_DATA_DELIVERY_LOWPRI ? "lowest" : "fixed",
+               entry->data & MSI_DATA_TRIGGER_LEVEL ? "level" : "edge",
+               entry->data & MSI_DATA_LEVEL_ASSERT ? "" : "de",
+               entry->addr & MSI_ADDR_DESTMODE_LOGIC ? "log" : "phys",
+               entry->addr & MSI_ADDR_REDIRECTION_LOWPRI ? "lowest" : "fixed",
+               MASK_EXTR(entry->addr, MSI_ADDR_DEST_ID_MASK),
+               entry->masked, entry->arch.pirq);
+        if ( i && !(i % 64) )
+        {
+            struct pci_dev *pdev = msix->pdev;
+
+            spin_unlock(&msix->pdev->vpci->lock);
+            process_pending_softirqs();
+            /* NB: we assume that pdev cannot go away for an alive domain. */
+            if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
+                return -EBUSY;
+            if ( pdev->vpci->msix != msix )
+            {
+                spin_unlock(&pdev->vpci->lock);
+                return -EAGAIN;
+            }
+        }
+    }
+
+    return 0;
+}
diff --git a/xen/drivers/vpci/Makefile b/xen/drivers/vpci/Makefile
index 62cec9e82b..55d1bdfda0 100644
--- a/xen/drivers/vpci/Makefile
+++ b/xen/drivers/vpci/Makefile
@@ -1 +1 @@
-obj-y += vpci.o header.o msi.o
+obj-y += vpci.o header.o msi.o msix.o
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 8d9d6f43f3..271e4667dc 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -190,6 +190,7 @@ static int modify_bars(const struct pci_dev *pdev, bool map, bool rom_only)
     struct vpci_header *header = &pdev->vpci->header;
     struct rangeset *mem = rangeset_new(NULL, NULL, 0);
     struct pci_dev *tmp, *dev = NULL;
+    const struct vpci_msix *msix = pdev->vpci->msix;
     unsigned int i;
     int rc;
 
@@ -226,6 +227,24 @@ static int modify_bars(const struct pci_dev *pdev, bool map, bool rom_only)
         }
     }
 
+    /* Remove any MSIX regions if present. */
+    for ( i = 0; msix && i < ARRAY_SIZE(msix->tables); i++ )
+    {
+        unsigned long start = PFN_DOWN(vmsix_table_addr(pdev->vpci, i));
+        unsigned long end = PFN_DOWN(vmsix_table_addr(pdev->vpci, i) +
+                                     vmsix_table_size(pdev->vpci, i) - 1);
+
+        rc = rangeset_remove_range(mem, start, end);
+        if ( rc )
+        {
+            printk(XENLOG_G_WARNING
+                   "Failed to remove MSIX table [%lx, %lx]: %d\n",
+                   start, end, rc);
+            rangeset_destroy(mem);
+            return rc;
+        }
+    }
+
     /*
      * Check for overlaps with other BARs. Note that only BARs that are
      * currently mapped (enabled) are checked for overlaps.
diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
index de4ddf562e..ad26c38a92 100644
--- a/xen/drivers/vpci/msi.c
+++ b/xen/drivers/vpci/msi.c
@@ -281,11 +281,12 @@ void vpci_dump_msi(void)
         if ( !has_vpci(d) )
             continue;
 
-        printk("vPCI MSI d%d\n", d->domain_id);
+        printk("vPCI MSI/MSI-X d%d\n", d->domain_id);
 
         list_for_each_entry ( pdev, &d->arch.pdev_list, domain_list )
         {
             const struct vpci_msi *msi;
+            const struct vpci_msix *msix;
 
             if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
                 continue;
@@ -306,6 +307,30 @@ void vpci_dump_msi(void)
                 vpci_msi_arch_print(msi);
             }
 
+            msix = pdev->vpci->msix;
+            if ( msix && msix->enabled )
+            {
+                int rc;
+
+                printk("%04x:%02x:%02x.%u MSI-X\n", pdev->seg, pdev->bus,
+                       PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
+
+                printk("  entries: %u maskall: %d enabled: %d\n",
+                       msix->max_entries, msix->masked, msix->enabled);
+
+                rc = vpci_msix_arch_print(msix);
+                if ( rc )
+                {
+                    /*
+                     * On error vpci_msix_arch_print will always return without
+                     * holding the lock.
+                     */
+                    printk("unable to print all MSI-X entries: %d\n", rc);
+                    process_pending_softirqs();
+                    continue;
+                }
+            }
+
             spin_unlock(&pdev->vpci->lock);
             process_pending_softirqs();
         }
diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
new file mode 100644
index 0000000000..3b378c2e51
--- /dev/null
+++ b/xen/drivers/vpci/msix.c
@@ -0,0 +1,458 @@
+/*
+ * Handlers for accesses to the MSI-X capability structure and the memory
+ * region.
+ *
+ * Copyright (C) 2017 Citrix Systems R&D
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms and conditions of the GNU General Public
+ * License, version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <xen/sched.h>
+#include <xen/vpci.h>
+
+#include <asm/msi.h>
+
+#define VMSIX_SIZE(num) offsetof(struct vpci_msix, entries[num])
+
+#define VMSIX_ADDR_IN_RANGE(addr, vpci, nr)                               \
+    ((addr) >= vmsix_table_addr(vpci, nr) &&                              \
+     (addr) < vmsix_table_addr(vpci, nr) + vmsix_table_size(vpci, nr))
+
+static uint32_t control_read(const struct pci_dev *pdev, unsigned int reg,
+                             void *data)
+{
+    const struct vpci_msix *msix = data;
+
+    return (msix->max_entries - 1) |
+           (msix->enabled ? PCI_MSIX_FLAGS_ENABLE : 0) |
+           (msix->masked ? PCI_MSIX_FLAGS_MASKALL : 0);
+}
+
+static int update_entry(struct vpci_msix_entry *entry,
+                        const struct pci_dev *pdev, unsigned int nr)
+{
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    int rc = vpci_msix_arch_disable_entry(entry, pdev);
+
+    /* Ignore ENOENT, it means the entry wasn't setup. */
+    if ( rc && rc != -ENOENT )
+    {
+        gprintk(XENLOG_WARNING,
+                "%04x:%02x:%02x.%u: unable to disable entry %u for update: %d\n",
+                pdev->seg, pdev->bus, slot, func, nr, rc);
+        return rc;
+    }
+
+    rc = vpci_msix_arch_enable_entry(entry, pdev,
+                                     vmsix_table_base(pdev->vpci,
+                                                      VPCI_MSIX_TABLE));
+    if ( rc )
+    {
+        gprintk(XENLOG_WARNING,
+                "%04x:%02x:%02x.%u: unable to enable entry %u: %d\n",
+                pdev->seg, pdev->bus, slot, func, nr, rc);
+        /* Entry is likely not properly configured. */
+        return rc;
+    }
+
+    return 0;
+}
+
+static void control_write(const struct pci_dev *pdev, unsigned int reg,
+                          uint32_t val, void *data)
+{
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    struct vpci_msix *msix = data;
+    bool new_masked = val & PCI_MSIX_FLAGS_MASKALL;
+    bool new_enabled = val & PCI_MSIX_FLAGS_ENABLE;
+    unsigned int i;
+    int rc;
+
+    if ( new_masked == msix->masked && new_enabled == msix->enabled )
+        return;
+
+    /*
+     * According to the PCI 3.0 specification, switching the enable bit to 1
+     * or the function mask bit to 0 should cause all the cached addresses
+     * and data fields to be recalculated.
+     *
+     * In order to avoid the overhead of disabling and enabling all the
+     * entries every time the guest sets the maskall bit, Xen will only
+     * perform the disable and enable sequence when the guest has written to
+     * the entry.
+     */
+    if ( new_enabled && !new_masked && (!msix->enabled || msix->masked) )
+    {
+        for ( i = 0; i < msix->max_entries; i++ )
+        {
+            if ( msix->entries[i].masked || !msix->entries[i].updated ||
+                 update_entry(&msix->entries[i], pdev, i) )
+                continue;
+
+            msix->entries[i].updated = false;
+        }
+    }
+    else if ( !new_enabled && msix->enabled )
+    {
+        /* Guest has disabled MSIX, disable all entries. */
+        for ( i = 0; i < msix->max_entries; i++ )
+        {
+            /*
+             * NB: vpci_msix_arch_disable can be called for entries that are
+             * not setup, it will return -ENOENT in that case.
+             */
+            rc = vpci_msix_arch_disable_entry(&msix->entries[i], pdev);
+            switch ( rc )
+            {
+            case 0:
+                /*
+                 * Mark the entry successfully disabled as updated, so that on
+                 * the next enable the entry is properly setup. This is done
+                 * so that the following flow works correctly:
+                 *
+                 * mask entry -> disable MSIX -> enable MSIX -> unmask entry
+                 *
+                 * Without setting 'updated', the 'unmask entry' step will fail
+                 * because the entry has not been updated, so it would not be
+                 * mapped/bound at all.
+                 */
+                msix->entries[i].updated = true;
+                break;
+            case -ENOENT:
+                /* Ignore non-present entry. */
+                break;
+            default:
+                gprintk(XENLOG_WARNING,
+                        "%04x:%02x:%02x.%u: unable to disable entry %u: %d\n",
+                        pdev->seg, pdev->bus, slot, func, i, rc);
+                return;
+            }
+        }
+    }
+
+    msix->masked = new_masked;
+    msix->enabled = new_enabled;
+
+    val = control_read(pdev, reg, data);
+    if ( pci_msi_conf_write_intercept(msix->pdev, reg, 2, &val) >= 0 )
+        pci_conf_write16(pdev->seg, pdev->bus, slot, func, reg, val);
+}
+
+static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
+{
+    struct vpci_msix *msix;
+
+    list_for_each_entry ( msix, &d->arch.hvm_domain.msix_tables, next )
+    {
+        const struct vpci_bar *bars = msix->pdev->vpci->header.bars;
+        unsigned int i;
+
+        for ( i = 0; i < ARRAY_SIZE(msix->tables); i++ )
+            if ( bars[msix->tables[i] & PCI_MSIX_BIRMASK].enabled &&
+                 VMSIX_ADDR_IN_RANGE(addr, msix->pdev->vpci, i) )
+                return msix;
+    }
+
+    return NULL;
+}
+
+static int msix_accept(struct vcpu *v, unsigned long addr)
+{
+    return !!msix_find(v->domain, addr);
+}
+
+static bool access_allowed(const struct pci_dev *pdev, unsigned long addr,
+                           unsigned int len)
+{
+    /* Only allow aligned 32/64b accesses. */
+    if ( (len == 4 || len == 8) && !(addr & (len - 1)) )
+        return true;
+
+    gprintk(XENLOG_WARNING,
+            "%04x:%02x:%02x.%u: unaligned or invalid size MSI-X table access\n",
+            pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
+
+    return false;
+}
+
+static struct vpci_msix_entry *get_entry(struct vpci_msix *msix,
+                                         paddr_t addr)
+{
+    paddr_t start = vmsix_table_addr(msix->pdev->vpci, VPCI_MSIX_TABLE);
+
+    return &msix->entries[(addr - start) / PCI_MSIX_ENTRY_SIZE];
+}
+
+static int msix_read(struct vcpu *v, unsigned long addr, unsigned int len,
+                     unsigned long *data)
+{
+    const struct domain *d = v->domain;
+    struct vpci_msix *msix = msix_find(d, addr);
+    const struct vpci_msix_entry *entry;
+    unsigned int offset;
+
+    *data = ~0ul;
+
+    if ( !msix )
+        return X86EMUL_RETRY;
+
+    if ( !access_allowed(msix->pdev, addr, len) )
+        return X86EMUL_OKAY;
+
+    if ( VMSIX_ADDR_IN_RANGE(addr, msix->pdev->vpci, VPCI_MSIX_PBA) )
+    {
+        /*
+         * Access to PBA.
+         *
+         * TODO: note that this relies on having the PBA identity mapped to the
+         * guest address space. If this changes the address will need to be
+         * translated.
+         */
+        switch ( len )
+        {
+        case 4:
+            *data = readl(addr);
+            break;
+
+        case 8:
+            *data = readq(addr);
+            break;
+
+        default:
+            ASSERT_UNREACHABLE();
+            break;
+        }
+
+        return X86EMUL_OKAY;
+    }
+
+    spin_lock(&msix->pdev->vpci->lock);
+    entry = get_entry(msix, addr);
+    offset = addr & (PCI_MSIX_ENTRY_SIZE - 1);
+
+    switch ( offset )
+    {
+    case PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET:
+        *data = entry->addr;
+        break;
+
+    case PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET:
+        *data = entry->addr >> 32;
+        break;
+
+    case PCI_MSIX_ENTRY_DATA_OFFSET:
+        *data = entry->data;
+        if ( len == 8 )
+            *data |=
+                (uint64_t)(entry->masked ? PCI_MSIX_VECTOR_BITMASK : 0) << 32;
+        break;
+
+    case PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET:
+        *data = entry->masked ? PCI_MSIX_VECTOR_BITMASK : 0;
+        break;
+
+    default:
+        ASSERT_UNREACHABLE();
+        break;
+    }
+    spin_unlock(&msix->pdev->vpci->lock);
+
+    return X86EMUL_OKAY;
+}
+
+static int msix_write(struct vcpu *v, unsigned long addr, unsigned int len,
+                      unsigned long data)
+{
+    const struct domain *d = v->domain;
+    struct vpci_msix *msix = msix_find(d, addr);
+    struct vpci_msix_entry *entry;
+    unsigned int offset;
+
+    if ( !msix )
+        return X86EMUL_RETRY;
+
+    if ( !access_allowed(msix->pdev, addr, len) )
+        return X86EMUL_OKAY;
+
+    if ( VMSIX_ADDR_IN_RANGE(addr, msix->pdev->vpci, VPCI_MSIX_PBA) )
+    {
+        /* Ignore writes to PBA for DomUs, it's behavior is undefined. */
+        if ( is_hardware_domain(d) )
+        {
+            switch ( len )
+            {
+            case 4:
+                writel(data, addr);
+                break;
+
+            case 8:
+                writeq(data, addr);
+                break;
+
+            default:
+                ASSERT_UNREACHABLE();
+                break;
+            }
+        }
+
+        return X86EMUL_OKAY;
+    }
+
+    spin_lock(&msix->pdev->vpci->lock);
+    entry = get_entry(msix, addr);
+    offset = addr & (PCI_MSIX_ENTRY_SIZE - 1);
+
+    /*
+     * NB: Xen allows writes to the data/address registers with the entry
+     * unmasked. The specification says this is undefined behavior, and Xen
+     * implements it as storing the written value, which will be made effective
+     * in the next mask/unmask cycle. This also mimics the implementation in
+     * QEMU.
+     */
+    switch ( offset )
+    {
+    case PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET:
+        entry->updated = true;
+        if ( len == 8 )
+        {
+            entry->addr = data;
+            break;
+        }
+        entry->addr &= ~0xffffffff;
+        entry->addr |= data;
+        break;
+
+    case PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET:
+        entry->updated = true;
+        entry->addr &= 0xffffffff;
+        entry->addr |= (uint64_t)data << 32;
+        break;
+
+    case PCI_MSIX_ENTRY_DATA_OFFSET:
+        entry->updated = true;
+        entry->data = data;
+
+        if ( len == 4 )
+            break;
+
+        data >>= 32;
+        /* fallthrough */
+    case PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET:
+    {
+        bool new_masked = data & PCI_MSIX_VECTOR_BITMASK;
+        const struct pci_dev *pdev = msix->pdev;
+
+        if ( entry->masked == new_masked )
+            /* No change in the mask bit, nothing to do. */
+            break;
+
+        /*
+         * Update the masked state before calling vpci_msix_arch_enable_entry,
+         * so that it picks the new state.
+         */
+        entry->masked = new_masked;
+        if ( !new_masked && msix->enabled && !msix->masked && entry->updated )
+        {
+            /*
+             * If MSI-X is enabled, the function mask is not active, the entry
+             * is being unmasked and there have been changes to the address or
+             * data fields Xen needs to disable and enable the entry in order
+             * to pick up the changes.
+             */
+            if ( update_entry(entry, pdev, vmsix_entry_nr(msix, entry)) )
+                break;
+
+            entry->updated = false;
+        }
+        else
+            vpci_msix_arch_mask_entry(entry, pdev, entry->masked);
+
+        break;
+    }
+
+    default:
+        ASSERT_UNREACHABLE();
+        break;
+    }
+    spin_unlock(&msix->pdev->vpci->lock);
+
+    return X86EMUL_OKAY;
+}
+
+static const struct hvm_mmio_ops vpci_msix_table_ops = {
+    .check = msix_accept,
+    .read = msix_read,
+    .write = msix_write,
+};
+
+static int init_msix(struct pci_dev *pdev)
+{
+    struct domain *d = pdev->domain;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    unsigned int msix_offset, i, max_entries;
+    uint16_t control;
+    int rc;
+
+    msix_offset = pci_find_cap_offset(pdev->seg, pdev->bus, slot, func,
+                                      PCI_CAP_ID_MSIX);
+    if ( !msix_offset )
+        return 0;
+
+    control = pci_conf_read16(pdev->seg, pdev->bus, slot, func,
+                              msix_control_reg(msix_offset));
+
+    max_entries = msix_table_size(control);
+
+    pdev->vpci->msix = xzalloc_bytes(VMSIX_SIZE(max_entries));
+    if ( !pdev->vpci->msix )
+        return -ENOMEM;
+
+    pdev->vpci->msix->max_entries = max_entries;
+    pdev->vpci->msix->pdev = pdev;
+
+    pdev->vpci->msix->tables[VPCI_MSIX_TABLE] =
+        pci_conf_read32(pdev->seg, pdev->bus, slot, func,
+                        msix_table_offset_reg(msix_offset));
+    pdev->vpci->msix->tables[VPCI_MSIX_PBA] =
+        pci_conf_read32(pdev->seg, pdev->bus, slot, func,
+                        msix_pba_offset_reg(msix_offset));
+
+    for ( i = 0; i < pdev->vpci->msix->max_entries; i++)
+    {
+        pdev->vpci->msix->entries[i].masked = true;
+        vpci_msix_arch_init_entry(&pdev->vpci->msix->entries[i]);
+    }
+
+    rc = vpci_add_register(pdev->vpci, control_read, control_write,
+                           msix_control_reg(msix_offset), 2, pdev->vpci->msix);
+    if ( rc )
+        return rc;
+
+    if ( list_empty(&d->arch.hvm_domain.msix_tables) )
+        register_mmio_handler(d, &vpci_msix_table_ops);
+
+    list_add(&pdev->vpci->msix->next, &d->arch.hvm_domain.msix_tables);
+
+    return 0;
+}
+REGISTER_VPCI_INIT(init_msix, VPCI_PRIORITY_HIGH);
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 3012b30013..8ec9c916ea 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -47,6 +47,7 @@ void vpci_remove_device(struct pci_dev *pdev)
         xfree(r);
     }
     spin_unlock(&pdev->vpci->lock);
+    xfree(pdev->vpci->msix);
     xfree(pdev->vpci->msi);
     xfree(pdev->vpci);
     pdev->vpci = NULL;
diff --git a/xen/include/asm-x86/hvm/domain.h b/xen/include/asm-x86/hvm/domain.h
index d1d933d791..020ceacd81 100644
--- a/xen/include/asm-x86/hvm/domain.h
+++ b/xen/include/asm-x86/hvm/domain.h
@@ -188,6 +188,9 @@ struct hvm_domain {
     struct list_head mmcfg_regions;
     rwlock_t mmcfg_lock;
 
+    /* List of MSI-X tables. */
+    struct list_head msix_tables;
+
     /* List of permanently write-mapped pages. */
     struct {
         spinlock_t lock;
diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h
index 0fedb3473c..e6b6ed0b92 100644
--- a/xen/include/asm-x86/hvm/io.h
+++ b/xen/include/asm-x86/hvm/io.h
@@ -132,6 +132,11 @@ struct vpci_arch_msi {
     int pirq;
 };
 
+/* Arch-specific MSI-X entry data for vPCI. */
+struct vpci_arch_msix_entry {
+    int pirq;
+};
+
 enum stdvga_cache_state {
     STDVGA_CACHE_UNINITIALIZED,
     STDVGA_CACHE_ENABLED,
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 7266c17679..fc47163ba6 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -115,6 +115,34 @@ struct vpci {
         struct vpci_arch_msi arch;
 #endif
     } *msi;
+
+    /* MSI-X data. */
+    struct vpci_msix {
+#ifdef __XEN__
+        struct pci_dev *pdev;
+        /* List link. */
+        struct list_head next;
+        /* Table information. */
+#define VPCI_MSIX_TABLE     0
+#define VPCI_MSIX_PBA       1
+#define VPCI_MSIX_MEM_NUM   2
+        uint32_t tables[VPCI_MSIX_MEM_NUM];
+        /* Maximum number of vectors supported by the device. */
+        uint16_t max_entries : 12;
+        /* MSI-X enabled? */
+        bool enabled         : 1;
+        /* Masked? */
+        bool masked          : 1;
+        /* Entries. */
+        struct vpci_msix_entry {
+            uint64_t addr;
+            uint32_t data;
+            bool masked  : 1;
+            bool updated : 1;
+            struct vpci_arch_msix_entry arch;
+        } entries[];
+#endif
+    } *msix;
 };
 
 struct vpci_vcpu {
@@ -137,6 +165,51 @@ int __must_check vpci_msi_arch_enable(struct vpci_msi *msi,
 void vpci_msi_arch_disable(struct vpci_msi *msi, const struct pci_dev *pdev);
 void vpci_msi_arch_init(struct vpci_msi *msi);
 void vpci_msi_arch_print(const struct vpci_msi *msi);
+
+/* Arch-specific vPCI MSI-X helpers. */
+void vpci_msix_arch_mask_entry(struct vpci_msix_entry *entry,
+                               const struct pci_dev *pdev, bool mask);
+int __must_check vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
+                                             const struct pci_dev *pdev,
+                                             paddr_t table_base);
+int __must_check vpci_msix_arch_disable_entry(struct vpci_msix_entry *entry,
+                                              const struct pci_dev *pdev);
+void vpci_msix_arch_init_entry(struct vpci_msix_entry *entry);
+int vpci_msix_arch_print(const struct vpci_msix *msix);
+
+/*
+ * Helper functions to fetch MSIX related data. They are used by both the
+ * emulated MSIX code and the BAR handlers.
+ */
+static inline paddr_t vmsix_table_base(const struct vpci *vpci, unsigned int nr)
+{
+    return vpci->header.bars[vpci->msix->tables[nr] & PCI_MSIX_BIRMASK].addr;
+}
+
+static inline paddr_t vmsix_table_addr(const struct vpci *vpci, unsigned int nr)
+{
+    return vmsix_table_base(vpci, nr) +
+           (vpci->msix->tables[nr] & ~PCI_MSIX_BIRMASK);
+}
+
+/*
+ * Note regarding the size calculation of the PBA: the spec mentions "The last
+ * QWORD will not necessarily be fully populated", so it implies that the PBA
+ * size is 64-bit aligned.
+ */
+static inline size_t vmsix_table_size(const struct vpci *vpci, unsigned int nr)
+{
+    return
+        (nr == VPCI_MSIX_TABLE) ? vpci->msix->max_entries * PCI_MSIX_ENTRY_SIZE
+                                : ROUNDUP(DIV_ROUND_UP(vpci->msix->max_entries,
+                                                       8), 8);
+}
+
+static inline unsigned int vmsix_entry_nr(const struct vpci_msix *msix,
+                                          const struct vpci_msix_entry *entry)
+{
+    return entry - msix->entries;
+}
 #endif /* __XEN__ */
 
 #else /* !CONFIG_HAS_VPCI */
-- 
2.16.2


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v11 12/12] vpci: do not expose unneeded functions to the user-space test harness
  2018-03-20 15:15 [PATCH v11 00/12] vpci: PCI config space emulation Roger Pau Monne
                   ` (10 preceding siblings ...)
  2018-03-20 15:15 ` [PATCH v11 11/12] vpci/msix: add MSI-X handlers Roger Pau Monne
@ 2018-03-20 15:15 ` Roger Pau Monne
  2018-03-20 16:20   ` Jan Beulich
  11 siblings, 1 reply; 23+ messages in thread
From: Roger Pau Monne @ 2018-03-20 15:15 UTC (permalink / raw)
  To: xen-devel; +Cc: Boris Ostrovsky, Roger Pau Monne

Some functions in vpci.c (vpci_remove_device and vpci_add_handlers)
are not used by the user-space test harness, so guard them with
__XEN__ in order to avoid exposing them to the user-space test
harness.

Requested-by: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 tools/tests/vpci/Makefile |  8 ++------
 xen/drivers/vpci/vpci.c   | 10 ++++++----
 xen/include/xen/vpci.h    |  6 +-----
 3 files changed, 9 insertions(+), 15 deletions(-)

diff --git a/tools/tests/vpci/Makefile b/tools/tests/vpci/Makefile
index e45fcb5cd9..5075bc2be2 100644
--- a/tools/tests/vpci/Makefile
+++ b/tools/tests/vpci/Makefile
@@ -24,12 +24,8 @@ distclean: clean
 install:
 
 vpci.c: $(XEN_ROOT)/xen/drivers/vpci/vpci.c
-	# Trick the compiler so it doesn't complain about missing symbols
-	sed -e '/#include/d' \
-	    -e '1s;^;#include "emul.h"\
-	             vpci_register_init_t *const __start_vpci_array[1]\;\
-	             vpci_register_init_t *const __end_vpci_array[1]\;\
-	             ;' <$< >$@
+	# Remove includes and add the test harness header
+	sed -e '/#include/d' -e '1s/^/#include "emul.h"/' <$< >$@
 
 list.h: $(XEN_ROOT)/xen/include/xen/list.h
 vpci.h: $(XEN_ROOT)/xen/include/xen/vpci.h
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 8ec9c916ea..2913b56500 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -20,10 +20,6 @@
 #include <xen/sched.h>
 #include <xen/vpci.h>
 
-extern vpci_register_init_t *const __start_vpci_array[];
-extern vpci_register_init_t *const __end_vpci_array[];
-#define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
-
 /* Internal struct to store the emulated PCI registers. */
 struct vpci_register {
     vpci_read_t *read;
@@ -34,6 +30,11 @@ struct vpci_register {
     struct list_head node;
 };
 
+#ifdef __XEN__
+extern vpci_register_init_t *const __start_vpci_array[];
+extern vpci_register_init_t *const __end_vpci_array[];
+#define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
+
 void vpci_remove_device(struct pci_dev *pdev)
 {
     spin_lock(&pdev->vpci->lock);
@@ -80,6 +81,7 @@ int __hwdom_init vpci_add_handlers(struct pci_dev *pdev)
 
     return rc;
 }
+#endif /* __XEN__ */
 
 static int vpci_register_cmp(const struct vpci_register *r1,
                              const struct vpci_register *r2)
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index fc47163ba6..cb39e0ebea 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -90,11 +90,9 @@ struct vpci {
         bool rom_enabled      : 1;
         /* FIXME: currently there's no support for SR-IOV. */
     } header;
-#endif
 
     /* MSI data. */
     struct vpci_msi {
-#ifdef __XEN__
       /* Address. */
         uint64_t address;
         /* Mask bitfield. */
@@ -113,12 +111,10 @@ struct vpci {
         uint8_t vectors     : 5;
         /* Arch-specific data. */
         struct vpci_arch_msi arch;
-#endif
     } *msi;
 
     /* MSI-X data. */
     struct vpci_msix {
-#ifdef __XEN__
         struct pci_dev *pdev;
         /* List link. */
         struct list_head next;
@@ -141,8 +137,8 @@ struct vpci {
             bool updated : 1;
             struct vpci_arch_msix_entry arch;
         } entries[];
-#endif
     } *msix;
+#endif
 };
 
 struct vpci_vcpu {
-- 
2.16.2


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v11 07/12] vpci: add header handlers
  2018-03-20 15:15 ` [PATCH v11 07/12] vpci: add header handlers Roger Pau Monne
@ 2018-03-20 16:19   ` Jan Beulich
  2018-03-21 12:31   ` Paul Durrant
  1 sibling, 0 replies; 23+ messages in thread
From: Jan Beulich @ 2018-03-20 16:19 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: StefanoStabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, Julien Grall, Paul Durrant, xen-devel,
	Boris Ostrovsky

>>> On 20.03.18 at 16:15, <roger.pau@citrix.com> wrote:
> Introduce a set of handlers that trap accesses to the PCI BARs and the
> command register, in order to snoop BAR sizing and BAR relocation.
> 
> The command handler is used to detect changes to bit 2 (response to
> memory space accesses), and maps/unmaps the BARs of the device into
> the guest p2m. A rangeset is used in order to figure out which memory
> to map/unmap. This makes it easier to keep track of the possible
> overlaps with other BARs, and will also simplify MSI-X support, where
> certain regions of a BAR might be used for the MSI-X table or PBA.
> 
> The BAR register handlers are used to detect attempts by the guest to
> size or relocate the BARs.
> 
> Note that the long running BAR mapping and unmapping operations are
> deferred to be performed by hvm_io_pending, so that they can be safely
> preempted.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v11 12/12] vpci: do not expose unneeded functions to the user-space test harness
  2018-03-20 15:15 ` [PATCH v11 12/12] vpci: do not expose unneeded functions to the user-space test harness Roger Pau Monne
@ 2018-03-20 16:20   ` Jan Beulich
  0 siblings, 0 replies; 23+ messages in thread
From: Jan Beulich @ 2018-03-20 16:20 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: xen-devel, Boris Ostrovsky

>>> On 20.03.18 at 16:15, <roger.pau@citrix.com> wrote:
> Some functions in vpci.c (vpci_remove_device and vpci_add_handlers)
> are not used by the user-space test harness, so guard them with
> __XEN__ in order to avoid exposing them to the user-space test
> harness.
> 
> Requested-by: Jan Beulich <JBeulich@suse.com>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v11 01/12] vpci: introduce basic handlers to trap accesses to the PCI config space
  2018-03-20 15:15 ` [PATCH v11 01/12] vpci: introduce basic handlers to trap accesses to the PCI config space Roger Pau Monne
@ 2018-03-21  4:39   ` Julien Grall
  0 siblings, 0 replies; 23+ messages in thread
From: Julien Grall @ 2018-03-21  4:39 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel
  Cc: Stefano Stabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, Paul Durrant, Jan Beulich,
	Boris Ostrovsky

Hi,

On 03/20/2018 03:15 PM, Roger Pau Monne wrote:
> This functionality is going to reside in vpci.c (and the corresponding
> vpci.h header), and should be arch-agnostic. The handlers introduced
> in this patch setup the basic functionality required in order to trap
> accesses to the PCI config space, and allow decoding the address and
> finding the corresponding handler that should handle the access
> (although no handlers are implemented).
> 
> Note that the traps to the PCI IO ports registers (0xcf8/0xcfc) are
> setup inside of a x86 HVM file, since that's not shared with other
> arches.
> 
> A new XEN_X86_EMU_VPCI x86 domain flag is added in order to signal Xen
> whether a domain should use the newly introduced vPCI handlers, this
> is only enabled for PVH Dom0 at the moment.
> 
> A very simple user-space test is also provided, so that the basic
> functionality of the vPCI traps can be asserted. This has been proven
> quite helpful during development, since the logic to handle partial
> accesses or accesses that expand across multiple registers is not
> trivial.
> 
> The handlers for the registers are added to a linked list that's keep
> sorted at all times. Both the read and write handlers support accesses
> that expand across multiple emulated registers and contain gaps not
> emulated.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
> [IO parts]
> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>

Acked-by: Julien Grall <julien.grall@arm.com>

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v11 07/12] vpci: add header handlers
  2018-03-20 15:15 ` [PATCH v11 07/12] vpci: add header handlers Roger Pau Monne
  2018-03-20 16:19   ` Jan Beulich
@ 2018-03-21 12:31   ` Paul Durrant
  1 sibling, 0 replies; 23+ messages in thread
From: Paul Durrant @ 2018-03-21 12:31 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Wei Liu, Andrew Cooper, Tim (Xen.org),
	George Dunlap, Julien Grall, Jan Beulich, Ian Jackson,
	Boris Ostrovsky, Roger Pau Monne

> -----Original Message-----
> From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> Sent: 20 March 2018 15:16
> To: xen-devel@lists.xenproject.org
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>; Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com>; Roger Pau Monne <roger.pau@citrix.com>; Ian
> Jackson <Ian.Jackson@citrix.com>; Wei Liu <wei.liu2@citrix.com>; Andrew
> Cooper <Andrew.Cooper3@citrix.com>; George Dunlap
> <George.Dunlap@citrix.com>; Jan Beulich <jbeulich@suse.com>; Julien Grall
> <julien.grall@arm.com>; Stefano Stabellini <sstabellini@kernel.org>; Tim
> (Xen.org) <tim@xen.org>; Paul Durrant <Paul.Durrant@citrix.com>
> Subject: [PATCH v11 07/12] vpci: add header handlers
> 
> Introduce a set of handlers that trap accesses to the PCI BARs and the
> command register, in order to snoop BAR sizing and BAR relocation.
> 
> The command handler is used to detect changes to bit 2 (response to
> memory space accesses), and maps/unmaps the BARs of the device into
> the guest p2m. A rangeset is used in order to figure out which memory
> to map/unmap. This makes it easier to keep track of the possible
> overlaps with other BARs, and will also simplify MSI-X support, where
> certain regions of a BAR might be used for the MSI-X table or PBA.
> 
> The BAR register handlers are used to detect attempts by the guest to
> size or relocate the BARs.
> 
> Note that the long running BAR mapping and unmapping operations are
> deferred to be performed by hvm_io_pending, so that they can be safely
> preempted.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

ioreq part

Reviewed-by: Paul Durrant <paul.durrant@citrix.com>

> ---
> Cc: Ian Jackson <ian.jackson@eu.citrix.com>
> Cc: Wei Liu <wei.liu2@citrix.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: George Dunlap <George.Dunlap@eu.citrix.com>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Julien Grall <julien.grall@arm.com>
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: Stefano Stabellini <sstabellini@kernel.org>
> Cc: Tim Deegan <tim@xen.org>
> Cc: Paul Durrant <paul.durrant@citrix.com>
> ---
> Changes since v10:
>  - Fix indirect function call in map_range.
>  - Use rom->addr instead of fetching it from the ROM BAR register in
>    modify_decoding.
>  - Remove ternary operator from modify_decoding.
>  - Simply apply_map to have a single return.
>  - Constify pci_dev parameter of apply_map.
>  - Remove references to maybe_defer_map.
>  - Use pdev (const) or dev (non-const) consistently in modify_bars.
>  - Invert part of the logic in rom_write to remove one indentation
>    level.
>  - Add comments in rom_write to clarify why rom->addr is updated in
>    two different places.
>  - Use lx to print frame numbers in modify_bars.
>  - Add start/end local variables in the first modify_bars loop.
> 
> Changes since v9:
>  - Expand comments to clarify the code.
>  - Rename rom to rom_only in the vpci_cpu struct.
>  - Change definition style of dummy vpci_cpu.
>  - Replace incorrect usage of PFN_UP.
>  - Use system_state in order to check if the mapping functions are
>    being called from Dom0 builder context.
>  - Split the maybe_defer_map into two functions and place the Dom0
>    builder one in the init section.
> 
> Changes since v8:
>  - Do not pretend to support ARM in the map_range function. Explain
>    the required changes in the comment.
>  - Introduce PCI_HEADER_{NORMAL/BRIDGE}_NR_BARS defines.
>  - Rename 'rom' boolean variable to 'rom_only', which is more
>    descriptive of it's meaning.
>  - Introduce vpci_remove_device which removes all handlers for a
>    device.
>  - Simplify error handling when modifying BARs mapping. Any error will
>    cause the device to be unplugged (by calling vpci_remove_device).
>  - Return an error code in modify_bars. Add comments describing why
>    the error is sometimes ignored.
> 
> Changes since v7:
>  - Order includes.
>  - Add newline between switch cases.
>  - Fix typo in comment (hopping).
>  - Wrap ternary conditional in parentheses.
>  - Remove CONFIG_HAS_PCI gueard from sched.h vpci_vcpu usage.
>  - Add comment regarding vpci_vcpu usage.
>  - Move rom_enabled from BAR struct to header.
>  - Do not protect vpci_vcpu with __XEN__ guards.
> 
> Changes since v6:
>  - s/vpci_check_pending/vpci_process_pending/.
>  - Improve error handling in vpci_process_pending.
>  - Add a comment that explains how vpci_check_bar_overlap works.
>  - Add error messages to vpci_modify_bars and vpci_modify_rom.
>  - Introduce vpci_hw_read16/32, in order to passthrough reads to
>    the underlying hw.
>  - Print BAR number on error in vpci_bar_write.
>  - Place the CONFIG_HAS_PCI guards inside the vpci.h header and
>    provide an empty vpci_vcpu structure for the !CONFIG_HAS_PCI case.
>  - Define CONFIG_HAS_PCI in the test harness emul.h header before
>    including vpci.h
>  - Add ARM TODOs and an ARM-specific bodge to vpci_map_range due to
>    the lack of preemption in {un}map_mmio_regions.
>  - Make vpci_maybe_defer_map void.
>  - Set rom_enabled in vpci_init_bars.
>  - Defer enabling/disabling the memory decoding (or the ROM enable
>    bit) until the memory has been mapped/unmapped.
>  - Remove vpci_ prefix from static functions.
>  - Use the same code in order to map the general BARs and the ROM
>    BARs.
>  - Remove the seg/bus local variables and use pdev->{seg,bus} instead.
>  - Convert the bools in the BAR related structs into bool bitfields.
>  - Add the must_check attribute to vpci_process_pending.
>  - Open code check_bar_overlap inside modify_bars, which was it's only
>    user.
> 
> Changes since v5:
>  - Switch to the new handler type.
>  - Use pci_sbdf_t to size the BARs.
>  - Use a single return for vpci_modify_bar.
>  - Do not return an error code from vpci_modify_bars, just log the
>    failure.
>  - Remove the 'sizing' parameter. Instead just let the guest write
>    directly to the BAR, and read the value back. This simplifies the
>    BAR register handlers, specially the read one.
>  - Ignore ROM BAR writes with memory decoding enabled and ROM enabled.
>  - Do not propagate failures to setup the ROM BAR in vpci_init_bars.
>  - Add preemption support to the BAR mapping/unmapping operations.
> 
> Changes since v4:
>  - Expand commit message to mention the reason behind the usage of
>    rangesets.
>  - Fix comment related to the inclusiveness of rangesets.
>  - Fix off-by-one error in the calculation of the end of memory
>    regions.
>  - Store the state of the BAR (mapped/unmapped) in the vpci_bar
>    enabled field, previously was only used by ROMs.
>  - Fix double negation of return code.
>  - Modify vpci_cmd_write so it has a single call to pci_conf_write16.
>  - Print a warning when trying to write to the BAR with memory
>    decoding enabled (and ignore the write).
>  - Remove header_type local variable, it's used only once.
>  - Move the read of the command register.
>  - Restore previous command register value in the exit paths.
>  - Only set address to INVALID_PADDR if the initial BAR value matches
>     ~0 & PCI_BASE_ADDRESS_MEM_MASK.
>  - Don't disable the enabled bit in the expansion ROM register, memory
>    decoding is already disabled and takes precedence.
>  - Don't use INVALID_PADDR, just set the initial BAR address to the
>    value found in the hardware.
>  - Introduce rom_enabled to store the status of the
>    PCI_ROM_ADDRESS_ENABLE bit.
>  - Reorder fields of the structure to prevent holes.
> 
> Changes since v3:
>  - Propagate previous changes: drop xen_ prefix and use u8/u16/u32
>    instead of the previous half_word/word/double_word.
>  - Constify some of the paramerters.
>  - s/VPCI_BAR_MEM/VPCI_BAR_MEM32/.
>  - Simplify the number of fields stored for each BAR, a single address
>    field is stored and contains the address of the BAR both on Xen and
>    in the guest.
>  - Allow the guest to move the BARs around in the physical memory map.
>  - Add support for expansion ROM BARs.
>  - Do not cache the value of the command register.
>  - Remove a label used in vpci_cmd_write.
>  - Fix the calculation of the sizing mask in vpci_bar_write.
>  - Check the memory decode bit in order to decide if a BAR is
>    positioned or not.
>  - Disable memory decoding before sizing the BARs in Xen.
>  - When mapping/unmapping BARs check if there's overlap between BARs,
>    in order to avoid unmapping memory required by another BAR.
>  - Introduce a macro to check whether a BAR is mappable or not.
>  - Add a comment regarding the lack of support for SR-IOV.
>  - Remove the usage of the GENMASK macro.
> 
> Changes since v2:
>  - Detect unset BARs and allow the hardware domain to position them.
> ---
>  tools/tests/vpci/emul.h   |   1 +
>  xen/arch/x86/hvm/ioreq.c  |   4 +
>  xen/drivers/vpci/Makefile |   2 +-
>  xen/drivers/vpci/header.c | 548
> ++++++++++++++++++++++++++++++++++++++++++++++
>  xen/drivers/vpci/vpci.c   |  45 ++--
>  xen/include/xen/sched.h   |   4 +
>  xen/include/xen/vpci.h    |  61 ++++++
>  7 files changed, 651 insertions(+), 14 deletions(-)
>  create mode 100644 xen/drivers/vpci/header.c
> 
> diff --git a/tools/tests/vpci/emul.h b/tools/tests/vpci/emul.h
> index fd0317995a..5d47544bf7 100644
> --- a/tools/tests/vpci/emul.h
> +++ b/tools/tests/vpci/emul.h
> @@ -80,6 +80,7 @@ typedef union {
>      };
>  } pci_sbdf_t;
> 
> +#define CONFIG_HAS_VPCI
>  #include "vpci.h"
> 
>  #define __hwdom_init
> diff --git a/xen/arch/x86/hvm/ioreq.c b/xen/arch/x86/hvm/ioreq.c
> index 7e66965bcd..90c9e3cd59 100644
> --- a/xen/arch/x86/hvm/ioreq.c
> +++ b/xen/arch/x86/hvm/ioreq.c
> @@ -26,6 +26,7 @@
>  #include <xen/domain.h>
>  #include <xen/event.h>
>  #include <xen/paging.h>
> +#include <xen/vpci.h>
> 
>  #include <asm/hvm/hvm.h>
>  #include <asm/hvm/ioreq.h>
> @@ -48,6 +49,9 @@ bool hvm_io_pending(struct vcpu *v)
>      struct domain *d = v->domain;
>      struct hvm_ioreq_server *s;
> 
> +    if ( has_vpci(d) && vpci_process_pending(v) )
> +        return true;
> +
>      list_for_each_entry ( s,
>                            &d->arch.hvm_domain.ioreq_server.list,
>                            list_entry )
> diff --git a/xen/drivers/vpci/Makefile b/xen/drivers/vpci/Makefile
> index 840a906470..241467212f 100644
> --- a/xen/drivers/vpci/Makefile
> +++ b/xen/drivers/vpci/Makefile
> @@ -1 +1 @@
> -obj-y += vpci.o
> +obj-y += vpci.o header.o
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> new file mode 100644
> index 0000000000..d7c220a452
> --- /dev/null
> +++ b/xen/drivers/vpci/header.c
> @@ -0,0 +1,548 @@
> +/*
> + * Generic functionality for handling accesses to the PCI header from the
> + * configuration space.
> + *
> + * Copyright (C) 2017 Citrix Systems R&D
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms and conditions of the GNU General Public
> + * License, version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; If not, see
> <http://www.gnu.org/licenses/>.
> + */
> +
> +#include <xen/p2m-common.h>
> +#include <xen/sched.h>
> +#include <xen/softirq.h>
> +#include <xen/vpci.h>
> +
> +#include <asm/event.h>
> +
> +#define MAPPABLE_BAR(x)                                                 \
> +    ((x)->type == VPCI_BAR_MEM32 || (x)->type == VPCI_BAR_MEM64_LO
> ||   \
> +     (x)->type == VPCI_BAR_ROM)
> +
> +struct map_data {
> +    struct domain *d;
> +    bool map;
> +};
> +
> +static int map_range(unsigned long s, unsigned long e, void *data,
> +                     unsigned long *c)
> +{
> +    const struct map_data *map = data;
> +    int rc;
> +
> +    for ( ; ; )
> +    {
> +        unsigned long size = e - s + 1;
> +
> +        /*
> +         * ARM TODOs:
> +         * - On ARM whether the memory is prefetchable or not should be
> passed
> +         *   to map_mmio_regions in order to decide which memory attributes
> +         *   should be used.
> +         *
> +         * - {un}map_mmio_regions doesn't support preemption.
> +         */
> +
> +        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(s))
> +                      : unmap_mmio_regions(map->d, _gfn(s), size, _mfn(s));
> +        if ( rc == 0 )
> +        {
> +            *c += size;
> +            break;
> +        }
> +        if ( rc < 0 )
> +        {
> +            printk(XENLOG_G_WARNING
> +                   "Failed to identity %smap [%lx, %lx] for d%d: %d\n",
> +                   map ? "" : "un", s, e, map->d->domain_id, rc);
> +            break;
> +        }
> +        ASSERT(rc < size);
> +        *c += rc;
> +        s += rc;
> +        if ( general_preempt_check() )
> +                return -ERESTART;
> +    }
> +
> +    return rc;
> +}
> +
> +/*
> + * The rom_only parameter is used to signal the map/unmap helpers that
> the ROM
> + * BAR's enable bit has changed with the memory decoding bit already
> enabled.
> + * If rom_only is not set then it's the memory decoding bit that changed.
> + */
> +static void modify_decoding(const struct pci_dev *pdev, bool map, bool
> rom_only)
> +{
> +    struct vpci_header *header = &pdev->vpci->header;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    uint16_t cmd;
> +    unsigned int i;
> +
> +    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
> +    {
> +        if ( !MAPPABLE_BAR(&header->bars[i]) )
> +            continue;
> +
> +        if ( rom_only && header->bars[i].type == VPCI_BAR_ROM )
> +        {
> +            unsigned int rom_pos = (i == PCI_HEADER_NORMAL_NR_BARS)
> +                                   ? PCI_ROM_ADDRESS : PCI_ROM_ADDRESS1;
> +            uint32_t val = header->bars[i].addr |
> +                           (map ? PCI_ROM_ADDRESS_ENABLE : 0);
> +
> +            header->bars[i].enabled = header->rom_enabled = map;
> +            pci_conf_write32(pdev->seg, pdev->bus, slot, func, rom_pos, val);
> +            return;
> +        }
> +
> +        if ( !rom_only &&
> +             (header->bars[i].type != VPCI_BAR_ROM || header->rom_enabled)
> )
> +            header->bars[i].enabled = map;
> +    }
> +
> +    ASSERT(!rom_only);
> +    cmd = pci_conf_read16(pdev->seg, pdev->bus, slot, func,
> PCI_COMMAND);
> +    cmd &= ~PCI_COMMAND_MEMORY;
> +    cmd |= map ? PCI_COMMAND_MEMORY : 0;
> +    pci_conf_write16(pdev->seg, pdev->bus, slot, func, PCI_COMMAND,
> +                     cmd);
> +}
> +
> +bool vpci_process_pending(struct vcpu *v)
> +{
> +    if ( v->vpci.mem )
> +    {
> +        struct map_data data = {
> +            .d = v->domain,
> +            .map = v->vpci.map,
> +        };
> +        int rc = rangeset_consume_ranges(v->vpci.mem, map_range, &data);
> +
> +        if ( rc == -ERESTART )
> +            return true;
> +
> +        spin_lock(&v->vpci.pdev->vpci->lock);
> +        /* Disable memory decoding unconditionally on failure. */
> +        modify_decoding(v->vpci.pdev, !rc && v->vpci.map,
> +                        !rc && v->vpci.rom_only);
> +        spin_unlock(&v->vpci.pdev->vpci->lock);
> +
> +        rangeset_destroy(v->vpci.mem);
> +        v->vpci.mem = NULL;
> +        if ( rc )
> +            /*
> +             * FIXME: in case of failure remove the device from the domain.
> +             * Note that there might still be leftover mappings. While this is
> +             * safe for Dom0, for DomUs the domain will likely need to be
> +             * killed in order to avoid leaking stale p2m mappings on
> +             * failure.
> +             */
> +            vpci_remove_device(v->vpci.pdev);
> +    }
> +
> +    return false;
> +}
> +
> +static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
> +                            struct rangeset *mem)
> +{
> +    struct map_data data = { .d = d, .map = true };
> +    int rc;
> +
> +    while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -
> ERESTART )
> +        process_pending_softirqs();
> +    rangeset_destroy(mem);
> +    if ( !rc )
> +        modify_decoding(pdev, true, false);
> +
> +    return rc;
> +}
> +
> +static void defer_map(struct domain *d, struct pci_dev *pdev,
> +                      struct rangeset *mem, bool map, bool rom_only)
> +{
> +    struct vcpu *curr = current;
> +
> +    /*
> +     * FIXME: when deferring the {un}map the state of the device should not
> +     * be trusted. For example the enable bit is toggled after the device
> +     * is mapped. This can lead to parallel mapping operations being
> +     * started for the same device if the domain is not well-behaved.
> +     */
> +    curr->vpci.pdev = pdev;
> +    curr->vpci.mem = mem;
> +    curr->vpci.map = map;
> +    curr->vpci.rom_only = rom_only;
> +}
> +
> +static int modify_bars(const struct pci_dev *pdev, bool map, bool
> rom_only)
> +{
> +    struct vpci_header *header = &pdev->vpci->header;
> +    struct rangeset *mem = rangeset_new(NULL, NULL, 0);
> +    struct pci_dev *tmp, *dev = NULL;
> +    unsigned int i;
> +    int rc;
> +
> +    if ( !mem )
> +        return -ENOMEM;
> +
> +    /*
> +     * Create a rangeset that represents the current device BARs memory
> region
> +     * and compare it against all the currently active BAR memory regions. If
> +     * an overlap is found, subtract it from the region to be
> mapped/unmapped.
> +     *
> +     * First fill the rangeset with all the BARs of this device or with the ROM
> +     * BAR only, depending on whether the guest is toggling the memory
> decode
> +     * bit of the command register, or the enable bit of the ROM BAR register.
> +     */
> +    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
> +    {
> +        const struct vpci_bar *bar = &header->bars[i];
> +        unsigned long start = PFN_DOWN(bar->addr);
> +        unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
> +
> +        if ( !MAPPABLE_BAR(bar) ||
> +             (rom_only ? bar->type != VPCI_BAR_ROM
> +                       : (bar->type == VPCI_BAR_ROM && !header->rom_enabled)) )
> +            continue;
> +
> +        rc = rangeset_add_range(mem, start, end);
> +        if ( rc )
> +        {
> +            printk(XENLOG_G_WARNING "Failed to add [%lx, %lx]: %d\n",
> +                   start, end, rc);
> +            rangeset_destroy(mem);
> +            return rc;
> +        }
> +    }
> +
> +    /*
> +     * Check for overlaps with other BARs. Note that only BARs that are
> +     * currently mapped (enabled) are checked for overlaps.
> +     */
> +    list_for_each_entry(tmp, &pdev->domain->arch.pdev_list, domain_list)
> +    {
> +        if ( tmp == pdev )
> +        {
> +            /*
> +             * Need to store the device so it's not constified and defer_map
> +             * can modify it in case of error.
> +             */
> +            dev = tmp;
> +            if ( !rom_only )
> +                /*
> +                 * If memory decoding is toggled avoid checking against the
> +                 * same device, or else all regions will be removed from the
> +                 * memory map in the unmap case.
> +                 */
> +                continue;
> +        }
> +
> +        for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
> +        {
> +            const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
> +            unsigned long start = PFN_DOWN(bar->addr);
> +            unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
> +
> +            if ( !bar->enabled || !rangeset_overlaps_range(mem, start, end) ||
> +                 /*
> +                  * If only the ROM enable bit is toggled check against other
> +                  * BARs in the same device for overlaps, but not against the
> +                  * same ROM BAR.
> +                  */
> +                 (rom_only && tmp == pdev && bar->type == VPCI_BAR_ROM) )
> +                continue;
> +
> +            rc = rangeset_remove_range(mem, start, end);
> +            if ( rc )
> +            {
> +                printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
> +                       start, end, rc);
> +                rangeset_destroy(mem);
> +                return rc;
> +            }
> +        }
> +    }
> +
> +    ASSERT(dev);
> +
> +    if ( system_state < SYS_STATE_active )
> +    {
> +        /*
> +         * Mappings might be created when building Dom0 if the memory
> decoding
> +         * bit of PCI devices is enabled. In that case it's not possible to
> +         * defer the operation, so call apply_map in order to create the
> +         * mappings right away. Note that at build time this function will only
> +         * be called iff the memory decoding bit is enabled, thus the operation
> +         * will always be to establish mappings and process all the BARs.
> +         */
> +        ASSERT(map && !rom_only);
> +        return apply_map(pdev->domain, pdev, mem);
> +    }
> +
> +    defer_map(dev->domain, dev, mem, map, rom_only);
> +
> +    return 0;
> +}
> +
> +static void cmd_write(const struct pci_dev *pdev, unsigned int reg,
> +                      uint32_t cmd, void *data)
> +{
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    uint16_t current_cmd = pci_conf_read16(pdev->seg, pdev->bus, slot,
> func,
> +                                           reg);
> +
> +    /*
> +     * Let Dom0 play with all the bits directly except for the memory
> +     * decoding one.
> +     */
> +    if ( (cmd ^ current_cmd) & PCI_COMMAND_MEMORY )
> +        /*
> +         * Ignore the error. No memory has been added or removed from the
> p2m
> +         * (because the actual p2m changes are deferred in defer_map) and the
> +         * memory decoding bit has not been changed, so leave everything as-
> is,
> +         * hoping the guest will realize and try again.
> +         */
> +        modify_bars(pdev, cmd & PCI_COMMAND_MEMORY, false);
> +    else
> +        pci_conf_write16(pdev->seg, pdev->bus, slot, func, reg, cmd);
> +}
> +
> +static void bar_write(const struct pci_dev *pdev, unsigned int reg,
> +                      uint32_t val, void *data)
> +{
> +    struct vpci_bar *bar = data;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    bool hi = false;
> +
> +    if ( pci_conf_read16(pdev->seg, pdev->bus, slot, func, PCI_COMMAND)
> &
> +         PCI_COMMAND_MEMORY )
> +    {
> +        gprintk(XENLOG_WARNING,
> +                "%04x:%02x:%02x.%u: ignored BAR %lu write with memory
> decoding enabled\n",
> +                pdev->seg, pdev->bus, slot, func,
> +                bar - pdev->vpci->header.bars);
> +        return;
> +    }
> +
> +    if ( bar->type == VPCI_BAR_MEM64_HI )
> +    {
> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> +        bar--;
> +        hi = true;
> +    }
> +    else
> +        val &= PCI_BASE_ADDRESS_MEM_MASK;
> +
> +    /*
> +     * Update the cached address, so that when memory decoding is enabled
> +     * Xen can map the BAR into the guest p2m.
> +     */
> +    bar->addr &= ~(0xffffffffull << (hi ? 32 : 0));
> +    bar->addr |= (uint64_t)val << (hi ? 32 : 0);
> +
> +    /* Make sure Xen writes back the same value for the BAR RO bits. */
> +    if ( !hi )
> +    {
> +        val |= bar->type == VPCI_BAR_MEM32 ?
> PCI_BASE_ADDRESS_MEM_TYPE_32
> +                                           : PCI_BASE_ADDRESS_MEM_TYPE_64;
> +        val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
> +    }
> +
> +    pci_conf_write32(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> +                     PCI_FUNC(pdev->devfn), reg, val);
> +}
> +
> +static void rom_write(const struct pci_dev *pdev, unsigned int reg,
> +                      uint32_t val, void *data)
> +{
> +    struct vpci_header *header = &pdev->vpci->header;
> +    struct vpci_bar *rom = data;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    uint16_t cmd = pci_conf_read16(pdev->seg, pdev->bus, slot, func,
> +                                   PCI_COMMAND);
> +    bool new_enabled = val & PCI_ROM_ADDRESS_ENABLE;
> +
> +    if ( (cmd & PCI_COMMAND_MEMORY) && header->rom_enabled &&
> new_enabled )
> +    {
> +        gprintk(XENLOG_WARNING,
> +                "%04x:%02x:%02x.%u: ignored ROM BAR write with memory
> decoding enabled\n",
> +                pdev->seg, pdev->bus, slot, func);
> +        return;
> +    }
> +
> +    if ( !header->rom_enabled )
> +        /*
> +         * If the ROM BAR is not enabled update the address field so the
> +         * correct address is mapped into the p2m.
> +         */
> +        rom->addr = val & PCI_ROM_ADDRESS_MASK;
> +
> +    if ( !(cmd & PCI_COMMAND_MEMORY) || header->rom_enabled ==
> new_enabled )
> +    {
> +        /* Just update the ROM BAR field. */
> +        header->rom_enabled = new_enabled;
> +        pci_conf_write32(pdev->seg, pdev->bus, slot, func, reg, val);
> +    }
> +    else if ( modify_bars(pdev, new_enabled, true) )
> +        /*
> +         * No memory has been added or removed from the p2m (because the
> actual
> +         * p2m changes are deferred in defer_map) and the ROM enable bit has
> +         * not been changed, so leave everything as-is, hoping the guest will
> +         * realize and try again. It's important to not update rom->addr in the
> +         * unmap case if modify_bars has failed, or future attempts would
> +         * attempt to unmap the wrong address.
> +         */
> +        return;
> +
> +    if ( !new_enabled )
> +        rom->addr = val & PCI_ROM_ADDRESS_MASK;
> +}
> +
> +static int init_bars(struct pci_dev *pdev)
> +{
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    uint16_t cmd;
> +    uint64_t addr, size;
> +    unsigned int i, num_bars, rom_reg;
> +    struct vpci_header *header = &pdev->vpci->header;
> +    struct vpci_bar *bars = header->bars;
> +    pci_sbdf_t sbdf = {
> +        .seg = pdev->seg,
> +        .bus = pdev->bus,
> +        .dev = slot,
> +        .func = func,
> +    };
> +    int rc;
> +
> +    switch ( pci_conf_read8(pdev->seg, pdev->bus, slot, func,
> PCI_HEADER_TYPE)
> +             & 0x7f )
> +    {
> +    case PCI_HEADER_TYPE_NORMAL:
> +        num_bars = PCI_HEADER_NORMAL_NR_BARS;
> +        rom_reg = PCI_ROM_ADDRESS;
> +        break;
> +
> +    case PCI_HEADER_TYPE_BRIDGE:
> +        num_bars = PCI_HEADER_BRIDGE_NR_BARS;
> +        rom_reg = PCI_ROM_ADDRESS1;
> +        break;
> +
> +    default:
> +        return -EOPNOTSUPP;
> +    }
> +
> +    /* Setup a handler for the command register. */
> +    rc = vpci_add_register(pdev->vpci, vpci_hw_read16, cmd_write,
> PCI_COMMAND,
> +                           2, header);
> +    if ( rc )
> +        return rc;
> +
> +    /* Disable memory decoding before sizing. */
> +    cmd = pci_conf_read16(pdev->seg, pdev->bus, slot, func,
> PCI_COMMAND);
> +    if ( cmd & PCI_COMMAND_MEMORY )
> +        pci_conf_write16(pdev->seg, pdev->bus, slot, func, PCI_COMMAND,
> +                         cmd & ~PCI_COMMAND_MEMORY);
> +
> +    for ( i = 0; i < num_bars; i++ )
> +    {
> +        uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
> +        uint32_t val;
> +
> +        if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
> +        {
> +            bars[i].type = VPCI_BAR_MEM64_HI;
> +            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg,
> +                                   4, &bars[i]);
> +            if ( rc )
> +            {
> +                pci_conf_write16(pdev->seg, pdev->bus, slot, func,
> +                                 PCI_COMMAND, cmd);
> +                return rc;
> +            }
> +
> +            continue;
> +        }
> +
> +        val = pci_conf_read32(pdev->seg, pdev->bus, slot, func, reg);
> +        if ( (val & PCI_BASE_ADDRESS_SPACE) ==
> PCI_BASE_ADDRESS_SPACE_IO )
> +        {
> +            bars[i].type = VPCI_BAR_IO;
> +            continue;
> +        }
> +        if ( (val & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
> +             PCI_BASE_ADDRESS_MEM_TYPE_64 )
> +            bars[i].type = VPCI_BAR_MEM64_LO;
> +        else
> +            bars[i].type = VPCI_BAR_MEM32;
> +
> +        rc = pci_size_mem_bar(sbdf, reg, &addr, &size,
> +                              (i == num_bars - 1) ? PCI_BAR_LAST : 0);
> +        if ( rc < 0 )
> +        {
> +            pci_conf_write16(pdev->seg, pdev->bus, slot, func, PCI_COMMAND,
> +                             cmd);
> +            return rc;
> +        }
> +
> +        if ( size == 0 )
> +        {
> +            bars[i].type = VPCI_BAR_EMPTY;
> +            continue;
> +        }
> +
> +        bars[i].addr = addr;
> +        bars[i].size = size;
> +        bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
> +
> +        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg, 4,
> +                               &bars[i]);
> +        if ( rc )
> +        {
> +            pci_conf_write16(pdev->seg, pdev->bus, slot, func, PCI_COMMAND,
> +                             cmd);
> +            return rc;
> +        }
> +    }
> +
> +    /* Check expansion ROM. */
> +    rc = pci_size_mem_bar(sbdf, rom_reg, &addr, &size, PCI_BAR_ROM);
> +    if ( rc > 0 && size )
> +    {
> +        struct vpci_bar *rom = &header->bars[num_bars];
> +
> +        rom->type = VPCI_BAR_ROM;
> +        rom->size = size;
> +        rom->addr = addr;
> +        header->rom_enabled = pci_conf_read32(pdev->seg, pdev->bus, slot,
> func,
> +                                              rom_reg) & PCI_ROM_ADDRESS_ENABLE;
> +
> +        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, rom_write,
> rom_reg,
> +                               4, rom);
> +        if ( rc )
> +            rom->type = VPCI_BAR_EMPTY;
> +    }
> +
> +    return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, true,
> false) : 0;
> +}
> +REGISTER_VPCI_INIT(init_bars);
> +
> +/*
> + * Local variables:
> + * mode: C
> + * c-file-style: "BSD"
> + * c-basic-offset: 4
> + * tab-width: 4
> + * indent-tabs-mode: nil
> + * End:
> + */
> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
> index 4740d02edf..e5b49b9d82 100644
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -34,6 +34,23 @@ struct vpci_register {
>      struct list_head node;
>  };
> 
> +void vpci_remove_device(struct pci_dev *pdev)
> +{
> +    spin_lock(&pdev->vpci->lock);
> +    while ( !list_empty(&pdev->vpci->handlers) )
> +    {
> +        struct vpci_register *r = list_first_entry(&pdev->vpci->handlers,
> +                                                   struct vpci_register,
> +                                                   node);
> +
> +        list_del(&r->node);
> +        xfree(r);
> +    }
> +    spin_unlock(&pdev->vpci->lock);
> +    xfree(pdev->vpci);
> +    pdev->vpci = NULL;
> +}
> +
>  int __hwdom_init vpci_add_handlers(struct pci_dev *pdev)
>  {
>      unsigned int i;
> @@ -57,19 +74,7 @@ int __hwdom_init vpci_add_handlers(struct pci_dev
> *pdev)
>      }
> 
>      if ( rc )
> -    {
> -        while ( !list_empty(&pdev->vpci->handlers) )
> -        {
> -            struct vpci_register *r = list_first_entry(&pdev->vpci->handlers,
> -                                                       struct vpci_register,
> -                                                       node);
> -
> -            list_del(&r->node);
> -            xfree(r);
> -        }
> -        xfree(pdev->vpci);
> -        pdev->vpci = NULL;
> -    }
> +        vpci_remove_device(pdev);
> 
>      return rc;
>  }
> @@ -102,6 +107,20 @@ static void vpci_ignored_write(const struct pci_dev
> *pdev, unsigned int reg,
>  {
>  }
> 
> +uint32_t vpci_hw_read16(const struct pci_dev *pdev, unsigned int reg,
> +                        void *data)
> +{
> +    return pci_conf_read16(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> +                           PCI_FUNC(pdev->devfn), reg);
> +}
> +
> +uint32_t vpci_hw_read32(const struct pci_dev *pdev, unsigned int reg,
> +                        void *data)
> +{
> +    return pci_conf_read32(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> +                           PCI_FUNC(pdev->devfn), reg);
> +}
> +
>  int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
>                        vpci_write_t *write_handler, unsigned int offset,
>                        unsigned int size, void *data)
> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
> index f89896e59b..57bb142c02 100644
> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -20,6 +20,7 @@
>  #include <xen/smp.h>
>  #include <xen/perfc.h>
>  #include <asm/atomic.h>
> +#include <xen/vpci.h>
>  #include <xen/wait.h>
>  #include <public/xen.h>
>  #include <public/domctl.h>
> @@ -264,6 +265,9 @@ struct vcpu
> 
>      struct evtchn_fifo_vcpu *evtchn_fifo;
> 
> +    /* vPCI per-vCPU area, used to store data for long running operations. */
> +    struct vpci_vcpu vpci;
> +
>      struct arch_vcpu arch;
>  };
> 
> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
> index 9f2864fb0c..6bf8b22b4f 100644
> --- a/xen/include/xen/vpci.h
> +++ b/xen/include/xen/vpci.h
> @@ -1,6 +1,8 @@
>  #ifndef _XEN_VPCI_H_
>  #define _XEN_VPCI_H_
> 
> +#ifdef CONFIG_HAS_VPCI
> +
>  #include <xen/pci.h>
>  #include <xen/types.h>
>  #include <xen/list.h>
> @@ -20,6 +22,9 @@ typedef int vpci_register_init_t(struct pci_dev *dev);
>  /* Add vPCI handlers to device. */
>  int __must_check vpci_add_handlers(struct pci_dev *dev);
> 
> +/* Remove all handlers and free vpci related structures. */
> +void vpci_remove_device(struct pci_dev *pdev);
> +
>  /* Add/remove a register handler. */
>  int __must_check vpci_add_register(struct vpci *vpci,
>                                     vpci_read_t *read_handler,
> @@ -34,12 +39,68 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg,
> unsigned int size);
>  void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>                  uint32_t data);
> 
> +/* Passthrough handlers. */
> +uint32_t vpci_hw_read16(const struct pci_dev *pdev, unsigned int reg,
> +                        void *data);
> +uint32_t vpci_hw_read32(const struct pci_dev *pdev, unsigned int reg,
> +                        void *data);
> +
> +/*
> + * Check for pending vPCI operations on this vcpu. Returns true if the vcpu
> + * should not run.
> + */
> +bool __must_check vpci_process_pending(struct vcpu *v);
> +
>  struct vpci {
>      /* List of vPCI handlers for a device. */
>      struct list_head handlers;
>      spinlock_t lock;
> +
> +#ifdef __XEN__
> +    /* Hide the rest of the vpci struct from the user-space test harness. */
> +    struct vpci_header {
> +        /* Information about the PCI BARs of this device. */
> +        struct vpci_bar {
> +            uint64_t addr;
> +            uint64_t size;
> +            enum {
> +                VPCI_BAR_EMPTY,
> +                VPCI_BAR_IO,
> +                VPCI_BAR_MEM32,
> +                VPCI_BAR_MEM64_LO,
> +                VPCI_BAR_MEM64_HI,
> +                VPCI_BAR_ROM,
> +            } type;
> +            bool prefetchable : 1;
> +            /* Store whether the BAR is mapped into guest p2m. */
> +            bool enabled      : 1;
> +#define PCI_HEADER_NORMAL_NR_BARS        6
> +#define PCI_HEADER_BRIDGE_NR_BARS        2
> +        } bars[PCI_HEADER_NORMAL_NR_BARS + 1];
> +        /* At most 6 BARS + 1 expansion ROM BAR. */
> +
> +        /*
> +         * Store whether the ROM enable bit is set (doesn't imply ROM BAR
> +         * is mapped into guest p2m) if there's a ROM BAR on the device.
> +         */
> +        bool rom_enabled      : 1;
> +        /* FIXME: currently there's no support for SR-IOV. */
> +    } header;
> +#endif
> +};
> +
> +struct vpci_vcpu {
> +    /* Per-vcpu structure to store state while {un}mapping of PCI BARs. */
> +    struct rangeset *mem;
> +    struct pci_dev *pdev;
> +    bool map      : 1;
> +    bool rom_only : 1;
>  };
> 
> +#else /* !CONFIG_HAS_VPCI */
> +struct vpci_vcpu {};
> +#endif
> +
>  #endif
> 
>  /*
> --
> 2.16.2

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v11 09/12] vpci/msi: add MSI handlers
  2018-03-20 15:15 ` [PATCH v11 09/12] vpci/msi: add MSI handlers Roger Pau Monne
@ 2018-03-21 12:34   ` Paul Durrant
  0 siblings, 0 replies; 23+ messages in thread
From: Paul Durrant @ 2018-03-21 12:34 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Wei Liu, Andrew Cooper, Tim (Xen.org),
	George Dunlap, Julien Grall, Jan Beulich, Ian Jackson,
	Boris Ostrovsky, Roger Pau Monne

> -----Original Message-----
> From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> Sent: 20 March 2018 15:16
> To: xen-devel@lists.xenproject.org
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>; Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com>; Roger Pau Monne <roger.pau@citrix.com>; Jan
> Beulich <jbeulich@suse.com>; Andrew Cooper
> <Andrew.Cooper3@citrix.com>; George Dunlap
> <George.Dunlap@citrix.com>; Ian Jackson <Ian.Jackson@citrix.com>; Julien
> Grall <julien.grall@arm.com>; Stefano Stabellini <sstabellini@kernel.org>;
> Tim (Xen.org) <tim@xen.org>; Wei Liu <wei.liu2@citrix.com>; Paul Durrant
> <Paul.Durrant@citrix.com>
> Subject: [PATCH v11 09/12] vpci/msi: add MSI handlers
> 
> Add handlers for the MSI control, address, data and mask fields in
> order to detect accesses to them and setup the interrupts as requested
> by the guest.
> 
> Note that the pending register is not trapped, and the guest can
> freely read/write to it.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>

io header changes...

Reviewed-by: Paul Durrant <paul.durrant@citrix.com>

> ---
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: George Dunlap <George.Dunlap@eu.citrix.com>
> Cc: Ian Jackson <ian.jackson@eu.citrix.com>
> Cc: Julien Grall <julien.grall@arm.com>
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: Stefano Stabellini <sstabellini@kernel.org>
> Cc: Tim Deegan <tim@xen.org>
> Cc: Wei Liu <wei.liu2@citrix.com>
> Cc: Paul Durrant <paul.durrant@citrix.com>
> ---
> Changes since v8:
>  - Add a FIXME about the lack of testing and a comment regarding the
>    lack of cleaning done in the init_msi error path.
>  - Free msi struct when cleaning up if an init function failed.
>  - Remove the 'error' label of init_msi, the caller will already
>    perform the cleaning.
> 
> Changes since v7:
>  - Don't store pci segment/bus on local variables.
>  - Add an error label to init_msi.
>  - Don't trap accesses to the PBA.
>  - Fix msi_pending_bits_reg macro so it matches coding style.
>  - Move the position of vectors in the vpci_msi struct.
>  - Add a comment to clarify the expected state of vectors after
>    pt_irq_create_bind and use XEN_DOMCTL_VMSI_X86_UNMASKED.
> 
> Changes since v6:
>  - Use domain_spin_lock_irq_desc instead of open coding it.
>  - Reduce the size of printed debug messages.
>  - Constify domain in vpci_dump_msi.
>  - Lock domlist_read_lock before iterating over the list of domains.
>  - Make max_vectors and vectors uint8_t.
>  - Drop the vpci_ prefix from the static functions in msi.c.
>  - Turn the booleans in vpci_msi into bitfields.
>  - Apply the mask bits to all vectors when enabling msi.
>  - Remove the pos field.
>  - Remove the usage of __msi_set_{enable/disable}.
>  - Update the bindings when the message or data fields are updated.
>  - Make vpci_msi_arch_disable return void, it wasn't returning any
>    error.
>  - Prevent the guest from writing to the pending bits field, it's read
>    only as defined in the spec.
>  - Add the must_check attribute to vpci_msi_arch_enable.
> 
> Changes since v5:
>  - Update to new lock usage.
>  - Change handlers to match the new type.
>  - s/msi_flags/msi_gflags/, remove the local variables and use the new
>    DOMCTL_VMSI_* defines.
>  - Change the MSI arch function to take a vpci_msi instead of a
>    vpci_arch_msi as parameter.
>  - Fix the calculation of the guest vector for MSI injection to take
>    into account the number of bits that can be modified.
>  - Use INVALID_PIRQ everywhere.
>  - Simplify exit path of vpci_msi_disable.
>  - Remove the conditional when setting address64 and masking fields.
>  - Add a process_pending_softirqs to the MSI dump loop.
>  - Place the prototypes for the MSI arch-specific functions in
>    xen/vpci.h.
>  - Add parentheses around the INVALID_PIRQ definition.
> 
> Changes since v4:
>  - Fix commit message.
>  - Change the ASSERTs in vpci_msi_arch_mask into ifs.
>  - Introduce INVALID_PIRQ.
>  - Destroy the partially created bindings in case of failure in
>    vpci_msi_arch_enable.
>  - Just take the pcidevs lock once in vpci_msi_arch_disable.
>  - Print an error message in case of failure of pt_irq_destroy_bind.
>  - Make vpci_msi_arch_init return void.
>  - Constify the arch parameter of vpci_msi_arch_print.
>  - Use fixed instead of cpu for msi redirection.
>  - Separate the header includes in vpci/msi.c between xen and asm.
>  - Store the number of configured vectors even if MSI is not enabled
>    and always return it in vpci_msi_control_read.
>  - Fix/add comments in vpci_msi_control_write to clarify intended
>    behavior.
>  - Simplify usage of masks in vpci_msi_address_{upper_}write.
>  - Add comment to vpci_msi_mask_{read/write}.
>  - Don't use MASK_EXTR in vpci_msi_mask_write.
>  - s/msi_offset/pos/ in vpci_init_msi.
>  - Move control variable setup closer to it's usage.
>  - Use d%d in vpci_dump_msi.
>  - Fix printing of bitfield mask in vpci_dump_msi.
>  - Fix definition of MSI_ADDR_REDIRECTION_MASK.
>  - Shuffle the layout of vpci_msi to minimize gaps.
>  - Remove the error label in vpci_init_msi.
> 
> Changes since v3:
>  - Propagate changes from previous versions: drop xen_ prefix, drop
>    return value from handlers, use the new vpci_val fields.
>  - Use MASK_EXTR.
>  - Remove the usage of GENMASK.
>  - Add GFLAGS_SHIFT_DEST_ID and use it in msi_flags.
>  - Add "arch" to the MSI arch specific functions.
>  - Move the dumping of vPCI MSI information to dump_msi (key 'M').
>  - Remove the guest_vectors field.
>  - Allow the guest to change the number of active vectors without
>    having to disable and enable MSI.
>  - Check the number of active vectors when parsing the disable
>    mask.
>  - Remove the debug messages from vpci_init_msi.
>  - Move the arch-specific part of the dump handler to x86/hvm/vmsi.c.
>  - Use trylock in the dump handler to get the vpci lock.
> 
> Changes since v2:
>  - Add an arch-specific abstraction layer. Note that this is only implemented
>    for x86 currently.
>  - Add a wrapper to detect MSI enabling for vPCI.
> ---
> NB: I've only been able to test this with devices using a single MSI
> interrupt and no mask register. I will try to find hardware that
> supports the mask register and more than one vector, but I cannot make
> any promises.
> 
> If there are doubts about the untested parts we could always force Xen
> to report no per-vector masking support and only 1 available vector,
> but I would rather avoid doing it.
> ---
>  xen/arch/x86/hvm/vmsi.c      | 142 +++++++++++++++++++
>  xen/arch/x86/msi.c           |   3 +
>  xen/drivers/vpci/Makefile    |   2 +-
>  xen/drivers/vpci/msi.c       | 324
> +++++++++++++++++++++++++++++++++++++++++++
>  xen/drivers/vpci/vpci.c      |   1 +
>  xen/include/asm-x86/hvm/io.h |   5 +
>  xen/include/asm-x86/msi.h    |   3 +
>  xen/include/xen/irq.h        |   1 +
>  xen/include/xen/vpci.h       |  38 +++++
>  9 files changed, 518 insertions(+), 1 deletion(-)
>  create mode 100644 xen/drivers/vpci/msi.c
> 
> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
> index 7126de7841..be59c56d43 100644
> --- a/xen/arch/x86/hvm/vmsi.c
> +++ b/xen/arch/x86/hvm/vmsi.c
> @@ -31,6 +31,7 @@
>  #include <xen/errno.h>
>  #include <xen/sched.h>
>  #include <xen/irq.h>
> +#include <xen/vpci.h>
>  #include <public/hvm/ioreq.h>
>  #include <asm/hvm/io.h>
>  #include <asm/hvm/vpic.h>
> @@ -621,3 +622,144 @@ void msix_write_completion(struct vcpu *v)
>      if ( msixtbl_write(v, ctrl_address, 4, 0) != X86EMUL_OKAY )
>          gdprintk(XENLOG_WARNING, "MSI-X write completion failure\n");
>  }
> +
> +static unsigned int msi_gflags(uint16_t data, uint64_t addr, bool masked)
> +{
> +    /*
> +     * We need to use the DOMCTL constants here because the output of this
> +     * function is used as input to pt_irq_create_bind, which also takes the
> +     * input from the DOMCTL itself.
> +     */
> +    return MASK_INSR(MASK_EXTR(addr, MSI_ADDR_DEST_ID_MASK),
> +                     XEN_DOMCTL_VMSI_X86_DEST_ID_MASK) |
> +           MASK_INSR(MASK_EXTR(addr, MSI_ADDR_REDIRECTION_MASK),
> +                     XEN_DOMCTL_VMSI_X86_RH_MASK) |
> +           MASK_INSR(MASK_EXTR(addr, MSI_ADDR_DESTMODE_MASK),
> +                     XEN_DOMCTL_VMSI_X86_DM_MASK) |
> +           MASK_INSR(MASK_EXTR(data, MSI_DATA_DELIVERY_MODE_MASK),
> +                     XEN_DOMCTL_VMSI_X86_DELIV_MASK) |
> +           MASK_INSR(MASK_EXTR(data, MSI_DATA_TRIGGER_MASK),
> +                     XEN_DOMCTL_VMSI_X86_TRIG_MASK) |
> +           /* NB: by default MSI vectors are bound masked. */
> +           (masked ? 0 : XEN_DOMCTL_VMSI_X86_UNMASKED);
> +}
> +
> +void vpci_msi_arch_mask(struct vpci_msi *msi, const struct pci_dev *pdev,
> +                        unsigned int entry, bool mask)
> +{
> +    unsigned long flags;
> +    struct irq_desc *desc = domain_spin_lock_irq_desc(pdev->domain,
> +                                                      msi->arch.pirq + entry,
> +                                                      &flags);
> +
> +    if ( !desc )
> +        return;
> +    guest_mask_msi_irq(desc, mask);
> +    spin_unlock_irqrestore(&desc->lock, flags);
> +}
> +
> +int vpci_msi_arch_enable(struct vpci_msi *msi, const struct pci_dev *pdev,
> +                         unsigned int vectors)
> +{
> +    struct msi_info msi_info = {
> +        .seg = pdev->seg,
> +        .bus = pdev->bus,
> +        .devfn = pdev->devfn,
> +        .entry_nr = vectors,
> +    };
> +    unsigned int i;
> +    int rc;
> +
> +    ASSERT(msi->arch.pirq == INVALID_PIRQ);
> +
> +    /* Get a PIRQ. */
> +    rc = allocate_and_map_msi_pirq(pdev->domain, -1, &msi->arch.pirq,
> +                                   MAP_PIRQ_TYPE_MULTI_MSI, &msi_info);
> +    if ( rc )
> +    {
> +        gdprintk(XENLOG_ERR, "%04x:%02x:%02x.%u: failed to map PIRQ:
> %d\n",
> +                 pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> +                 PCI_FUNC(pdev->devfn), rc);
> +        return rc;
> +    }
> +
> +    for ( i = 0; i < vectors; i++ )
> +    {
> +        uint8_t vector = MASK_EXTR(msi->data, MSI_DATA_VECTOR_MASK);
> +        uint8_t vector_mask = 0xff >> (8 - fls(msi->vectors) + 1);
> +        struct xen_domctl_bind_pt_irq bind = {
> +            .machine_irq = msi->arch.pirq + i,
> +            .irq_type = PT_IRQ_TYPE_MSI,
> +            .u.msi.gvec = (vector & ~vector_mask) |
> +                          ((vector + i) & vector_mask),
> +            .u.msi.gflags = msi_gflags(msi->data, msi->address,
> +                                       (msi->mask >> i) & 1),
> +        };
> +
> +        pcidevs_lock();
> +        rc = pt_irq_create_bind(pdev->domain, &bind);
> +        if ( rc )
> +        {
> +            gdprintk(XENLOG_ERR,
> +                     "%04x:%02x:%02x.%u: failed to bind PIRQ %u: %d\n",
> +                     pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> +                     PCI_FUNC(pdev->devfn), msi->arch.pirq + i, rc);
> +            while ( bind.machine_irq-- )
> +                pt_irq_destroy_bind(pdev->domain, &bind);
> +            spin_lock(&pdev->domain->event_lock);
> +            unmap_domain_pirq(pdev->domain, msi->arch.pirq);
> +            spin_unlock(&pdev->domain->event_lock);
> +            pcidevs_unlock();
> +            msi->arch.pirq = INVALID_PIRQ;
> +            return rc;
> +        }
> +        pcidevs_unlock();
> +    }
> +
> +    return 0;
> +}
> +
> +void vpci_msi_arch_disable(struct vpci_msi *msi, const struct pci_dev
> *pdev)
> +{
> +    unsigned int i;
> +
> +    ASSERT(msi->arch.pirq != INVALID_PIRQ);
> +
> +    pcidevs_lock();
> +    for ( i = 0; i < msi->vectors; i++ )
> +    {
> +        struct xen_domctl_bind_pt_irq bind = {
> +            .machine_irq = msi->arch.pirq + i,
> +            .irq_type = PT_IRQ_TYPE_MSI,
> +        };
> +        int rc;
> +
> +        rc = pt_irq_destroy_bind(pdev->domain, &bind);
> +        ASSERT(!rc);
> +    }
> +
> +    spin_lock(&pdev->domain->event_lock);
> +    unmap_domain_pirq(pdev->domain, msi->arch.pirq);
> +    spin_unlock(&pdev->domain->event_lock);
> +    pcidevs_unlock();
> +
> +    msi->arch.pirq = INVALID_PIRQ;
> +}
> +
> +void vpci_msi_arch_init(struct vpci_msi *msi)
> +{
> +    msi->arch.pirq = INVALID_PIRQ;
> +}
> +
> +void vpci_msi_arch_print(const struct vpci_msi *msi)
> +{
> +    printk("vec=%#02x%7s%6s%3sassert%5s%7s dest_id=%lu pirq: %d\n",
> +           MASK_EXTR(msi->data, MSI_DATA_VECTOR_MASK),
> +           msi->data & MSI_DATA_DELIVERY_LOWPRI ? "lowest" : "fixed",
> +           msi->data & MSI_DATA_TRIGGER_LEVEL ? "level" : "edge",
> +           msi->data & MSI_DATA_LEVEL_ASSERT ? "" : "de",
> +           msi->address & MSI_ADDR_DESTMODE_LOGIC ? "log" : "phys",
> +           msi->address & MSI_ADDR_REDIRECTION_LOWPRI ? "lowest" :
> "fixed",
> +           MASK_EXTR(msi->address, MSI_ADDR_DEST_ID_MASK),
> +           msi->arch.pirq);
> +}
> diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
> index 8c89f072a8..5567990fbd 100644
> --- a/xen/arch/x86/msi.c
> +++ b/xen/arch/x86/msi.c
> @@ -30,6 +30,7 @@
>  #include <public/physdev.h>
>  #include <xen/iommu.h>
>  #include <xsm/xsm.h>
> +#include <xen/vpci.h>
> 
>  static s8 __read_mostly use_msi = -1;
>  boolean_param("msi", use_msi);
> @@ -1527,6 +1528,8 @@ static void dump_msi(unsigned char key)
>                 attr.guest_masked ? 'G' : ' ',
>                 mask);
>      }
> +
> +    vpci_dump_msi();
>  }
> 
>  static int __init msi_setup_keyhandler(void)
> diff --git a/xen/drivers/vpci/Makefile b/xen/drivers/vpci/Makefile
> index 241467212f..62cec9e82b 100644
> --- a/xen/drivers/vpci/Makefile
> +++ b/xen/drivers/vpci/Makefile
> @@ -1 +1 @@
> -obj-y += vpci.o header.o
> +obj-y += vpci.o header.o msi.o
> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
> new file mode 100644
> index 0000000000..c3c69ec453
> --- /dev/null
> +++ b/xen/drivers/vpci/msi.c
> @@ -0,0 +1,324 @@
> +/*
> + * Handlers for accesses to the MSI capability structure.
> + *
> + * Copyright (C) 2017 Citrix Systems R&D
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms and conditions of the GNU General Public
> + * License, version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; If not, see
> <http://www.gnu.org/licenses/>.
> + */
> +
> +#include <xen/sched.h>
> +#include <xen/softirq.h>
> +#include <xen/vpci.h>
> +
> +#include <asm/msi.h>
> +
> +static uint32_t control_read(const struct pci_dev *pdev, unsigned int reg,
> +                             void *data)
> +{
> +    const struct vpci_msi *msi = data;
> +
> +    return MASK_INSR(fls(msi->max_vectors) - 1, PCI_MSI_FLAGS_QMASK)
> |
> +           MASK_INSR(fls(msi->vectors) - 1, PCI_MSI_FLAGS_QSIZE) |
> +           (msi->enabled ? PCI_MSI_FLAGS_ENABLE : 0) |
> +           (msi->masking ? PCI_MSI_FLAGS_MASKBIT : 0) |
> +           (msi->address64 ? PCI_MSI_FLAGS_64BIT : 0);
> +}
> +
> +static void control_write(const struct pci_dev *pdev, unsigned int reg,
> +                          uint32_t val, void *data)
> +{
> +    struct vpci_msi *msi = data;
> +    unsigned int vectors = min_t(uint8_t,
> +                                 1u << MASK_EXTR(val, PCI_MSI_FLAGS_QSIZE),
> +                                 msi->max_vectors);
> +    bool new_enabled = val & PCI_MSI_FLAGS_ENABLE;
> +
> +    /*
> +     * No change if the enable field and the number of vectors is
> +     * the same or the device is not enabled, in which case the
> +     * vectors field can be updated directly.
> +     */
> +    if ( new_enabled == msi->enabled &&
> +         (vectors == msi->vectors || !msi->enabled) )
> +    {
> +        msi->vectors = vectors;
> +        return;
> +    }
> +
> +    if ( new_enabled )
> +    {
> +        /*
> +         * If the device is already enabled it means the number of
> +         * enabled messages has changed. Disable and re-enable the
> +         * device in order to apply the change.
> +         */
> +        if ( msi->enabled )
> +        {
> +            vpci_msi_arch_disable(msi, pdev);
> +            msi->enabled = false;
> +        }
> +
> +        if ( vpci_msi_arch_enable(msi, pdev, vectors) )
> +            return;
> +    }
> +    else
> +        vpci_msi_arch_disable(msi, pdev);
> +
> +    msi->vectors = vectors;
> +    msi->enabled = new_enabled;
> +
> +    pci_conf_write16(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> +                     PCI_FUNC(pdev->devfn), reg,
> +                     control_read(pdev, reg, data));
> +}
> +
> +static void update_msi(const struct pci_dev *pdev, struct vpci_msi *msi)
> +{
> +    if ( !msi->enabled )
> +        return;
> +
> +    vpci_msi_arch_disable(msi, pdev);
> +    if ( vpci_msi_arch_enable(msi, pdev, msi->vectors) )
> +        msi->enabled = false;
> +}
> +
> +/* Handlers for the address field (32bit or low part of a 64bit address). */
> +static uint32_t address_read(const struct pci_dev *pdev, unsigned int reg,
> +                             void *data)
> +{
> +    const struct vpci_msi *msi = data;
> +
> +    return msi->address;
> +}
> +
> +static void address_write(const struct pci_dev *pdev, unsigned int reg,
> +                          uint32_t val, void *data)
> +{
> +    struct vpci_msi *msi = data;
> +
> +    /* Clear low part. */
> +    msi->address &= ~0xffffffffull;
> +    msi->address |= val;
> +
> +    update_msi(pdev, msi);
> +}
> +
> +/* Handlers for the high part of a 64bit address field. */
> +static uint32_t address_hi_read(const struct pci_dev *pdev, unsigned int
> reg,
> +                                void *data)
> +{
> +    const struct vpci_msi *msi = data;
> +
> +    return msi->address >> 32;
> +}
> +
> +static void address_hi_write(const struct pci_dev *pdev, unsigned int reg,
> +                             uint32_t val, void *data)
> +{
> +    struct vpci_msi *msi = data;
> +
> +    /* Clear and update high part. */
> +    msi->address &= 0xffffffff;
> +    msi->address |= (uint64_t)val << 32;
> +
> +    update_msi(pdev, msi);
> +}
> +
> +/* Handlers for the data field. */
> +static uint32_t data_read(const struct pci_dev *pdev, unsigned int reg,
> +                          void *data)
> +{
> +    const struct vpci_msi *msi = data;
> +
> +    return msi->data;
> +}
> +
> +static void data_write(const struct pci_dev *pdev, unsigned int reg,
> +                       uint32_t val, void *data)
> +{
> +    struct vpci_msi *msi = data;
> +
> +    msi->data = val;
> +
> +    update_msi(pdev, msi);
> +}
> +
> +/* Handlers for the MSI mask bits. */
> +static uint32_t mask_read(const struct pci_dev *pdev, unsigned int reg,
> +                          void *data)
> +{
> +    const struct vpci_msi *msi = data;
> +
> +    return msi->mask;
> +}
> +
> +static void mask_write(const struct pci_dev *pdev, unsigned int reg,
> +                       uint32_t val, void *data)
> +{
> +    struct vpci_msi *msi = data;
> +    uint32_t dmask = msi->mask ^ val;
> +
> +    if ( !dmask )
> +        return;
> +
> +    if ( msi->enabled )
> +    {
> +        unsigned int i;
> +
> +        for ( i = ffs(dmask) - 1; dmask && i < msi->vectors;
> +              i = ffs(dmask) - 1 )
> +        {
> +            vpci_msi_arch_mask(msi, pdev, i, (val >> i) & 1);
> +            __clear_bit(i, &dmask);
> +        }
> +    }
> +
> +    msi->mask = val;
> +}
> +
> +static int init_msi(struct pci_dev *pdev)
> +{
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    unsigned int pos = pci_find_cap_offset(pdev->seg, pdev->bus, slot, func,
> +                                           PCI_CAP_ID_MSI);
> +    uint16_t control;
> +    int ret;
> +
> +    if ( !pos )
> +        return 0;
> +
> +    pdev->vpci->msi = xzalloc(struct vpci_msi);
> +    if ( !pdev->vpci->msi )
> +        return -ENOMEM;
> +
> +    ret = vpci_add_register(pdev->vpci, control_read, control_write,
> +                            msi_control_reg(pos), 2, pdev->vpci->msi);
> +    if ( ret )
> +        /*
> +         * NB: there's no need to free the msi struct or remove the register
> +         * handlers form the config space, the caller will take care of the
> +         * cleanup.
> +         */
> +        return ret;
> +
> +    /* Get the maximum number of vectors the device supports. */
> +    control = pci_conf_read16(pdev->seg, pdev->bus, slot, func,
> +                              msi_control_reg(pos));
> +
> +    /*
> +     * FIXME: I've only been able to test this code with devices using a single
> +     * MSI interrupt and no mask register.
> +     */
> +    pdev->vpci->msi->max_vectors = multi_msi_capable(control);
> +    ASSERT(pdev->vpci->msi->max_vectors <= 32);
> +
> +    /* The multiple message enable is 0 after reset (1 message enabled). */
> +    pdev->vpci->msi->vectors = 1;
> +
> +    /* No PIRQ bound yet. */
> +    vpci_msi_arch_init(pdev->vpci->msi);
> +
> +    pdev->vpci->msi->address64 = is_64bit_address(control);
> +    pdev->vpci->msi->masking = is_mask_bit_support(control);
> +
> +    ret = vpci_add_register(pdev->vpci, address_read, address_write,
> +                            msi_lower_address_reg(pos), 4, pdev->vpci->msi);
> +    if ( ret )
> +        return ret;
> +
> +    ret = vpci_add_register(pdev->vpci, data_read, data_write,
> +                            msi_data_reg(pos, pdev->vpci->msi->address64), 2,
> +                            pdev->vpci->msi);
> +    if ( ret )
> +        return ret;
> +
> +    if ( pdev->vpci->msi->address64 )
> +    {
> +        ret = vpci_add_register(pdev->vpci, address_hi_read,
> address_hi_write,
> +                                msi_upper_address_reg(pos), 4, pdev->vpci->msi);
> +        if ( ret )
> +            return ret;
> +    }
> +
> +    if ( pdev->vpci->msi->masking )
> +    {
> +        ret = vpci_add_register(pdev->vpci, mask_read, mask_write,
> +                                msi_mask_bits_reg(pos,
> +                                                  pdev->vpci->msi->address64),
> +                                4, pdev->vpci->msi);
> +        if ( ret )
> +            return ret;
> +        /*
> +         * FIXME: do not add any handler for the pending bits for the hardware
> +         * domain, which means direct access. This will be revisited when
> +         * adding unprivileged domain support.
> +         */
> +    }
> +
> +    return 0;
> +}
> +REGISTER_VPCI_INIT(init_msi);
> +
> +void vpci_dump_msi(void)
> +{
> +    const struct domain *d;
> +
> +    rcu_read_lock(&domlist_read_lock);
> +    for_each_domain ( d )
> +    {
> +        const struct pci_dev *pdev;
> +
> +        if ( !has_vpci(d) )
> +            continue;
> +
> +        printk("vPCI MSI d%d\n", d->domain_id);
> +
> +        list_for_each_entry ( pdev, &d->arch.pdev_list, domain_list )
> +        {
> +            const struct vpci_msi *msi;
> +
> +            if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
> +                continue;
> +
> +            msi = pdev->vpci->msi;
> +            if ( msi && msi->enabled )
> +            {
> +                printk("%04x:%02x:%02x.%u MSI\n", pdev->seg, pdev->bus,
> +                       PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
> +
> +                printk("  enabled: %d 64-bit: %d",
> +                       msi->enabled, msi->address64);
> +                if ( msi->masking )
> +                    printk(" mask=%08x", msi->mask);
> +                printk(" vectors max: %u enabled: %u\n",
> +                       msi->max_vectors, msi->vectors);
> +
> +                vpci_msi_arch_print(msi);
> +            }
> +
> +            spin_unlock(&pdev->vpci->lock);
> +            process_pending_softirqs();
> +        }
> +    }
> +    rcu_read_unlock(&domlist_read_lock);
> +}
> +
> +/*
> + * Local variables:
> + * mode: C
> + * c-file-style: "BSD"
> + * c-basic-offset: 4
> + * tab-width: 4
> + * indent-tabs-mode: nil
> + * End:
> + */
> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
> index e5b49b9d82..3012b30013 100644
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -47,6 +47,7 @@ void vpci_remove_device(struct pci_dev *pdev)
>          xfree(r);
>      }
>      spin_unlock(&pdev->vpci->lock);
> +    xfree(pdev->vpci->msi);
>      xfree(pdev->vpci);
>      pdev->vpci = NULL;
>  }
> diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h
> index 16465ceb30..0fedb3473c 100644
> --- a/xen/include/asm-x86/hvm/io.h
> +++ b/xen/include/asm-x86/hvm/io.h
> @@ -127,6 +127,11 @@ void hvm_dpci_eoi(struct domain *d, unsigned int
> guest_irq,
>  void msix_write_completion(struct vcpu *);
>  void msixtbl_init(struct domain *d);
> 
> +/* Arch-specific MSI data for vPCI. */
> +struct vpci_arch_msi {
> +    int pirq;
> +};
> +
>  enum stdvga_cache_state {
>      STDVGA_CACHE_UNINITIALIZED,
>      STDVGA_CACHE_ENABLED,
> diff --git a/xen/include/asm-x86/msi.h b/xen/include/asm-x86/msi.h
> index 37d37b820e..10387dce2e 100644
> --- a/xen/include/asm-x86/msi.h
> +++ b/xen/include/asm-x86/msi.h
> @@ -48,6 +48,7 @@
>  #define MSI_ADDR_REDIRECTION_SHIFT  3
>  #define MSI_ADDR_REDIRECTION_CPU    (0 <<
> MSI_ADDR_REDIRECTION_SHIFT)
>  #define MSI_ADDR_REDIRECTION_LOWPRI (1 <<
> MSI_ADDR_REDIRECTION_SHIFT)
> +#define MSI_ADDR_REDIRECTION_MASK   (1 <<
> MSI_ADDR_REDIRECTION_SHIFT)
> 
>  #define MSI_ADDR_DEST_ID_SHIFT		12
>  #define	 MSI_ADDR_DEST_ID_MASK		0x00ff000
> @@ -152,6 +153,8 @@ int msi_free_irq(struct msi_desc *entry);
>  	( (is64bit == 1) ? base+PCI_MSI_DATA_64 : base+PCI_MSI_DATA_32
> )
>  #define msi_mask_bits_reg(base, is64bit) \
>  	( (is64bit == 1) ? base+PCI_MSI_MASK_BIT :
> base+PCI_MSI_MASK_BIT-4)
> +#define msi_pending_bits_reg(base, is64bit) \
> +	((base) + PCI_MSI_MASK_BIT + ((is64bit) ? 4 : 0))
>  #define msi_disable(control)		control &= ~PCI_MSI_FLAGS_ENABLE
>  #define multi_msi_capable(control) \
>  	(1 << ((control & PCI_MSI_FLAGS_QMASK) >> 1))
> diff --git a/xen/include/xen/irq.h b/xen/include/xen/irq.h
> index 0aa817e266..586b78393a 100644
> --- a/xen/include/xen/irq.h
> +++ b/xen/include/xen/irq.h
> @@ -133,6 +133,7 @@ struct pirq {
>      struct arch_pirq arch;
>  };
> 
> +#define INVALID_PIRQ (-1)
>  #define pirq_info(d, p) ((struct pirq *)radix_tree_lookup(&(d)->pirq_tree,
> p))
> 
>  /* Use this instead of pirq_info() if the structure may need allocating. */
> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
> index 6bf8b22b4f..116b93f519 100644
> --- a/xen/include/xen/vpci.h
> +++ b/xen/include/xen/vpci.h
> @@ -87,6 +87,30 @@ struct vpci {
>          /* FIXME: currently there's no support for SR-IOV. */
>      } header;
>  #endif
> +
> +    /* MSI data. */
> +    struct vpci_msi {
> +#ifdef __XEN__
> +      /* Address. */
> +        uint64_t address;
> +        /* Mask bitfield. */
> +        uint32_t mask;
> +        /* Data. */
> +        uint16_t data;
> +        /* Maximum number of vectors supported by the device. */
> +        uint8_t max_vectors : 5;
> +        /* Enabled? */
> +        bool enabled        : 1;
> +        /* Supports per-vector masking? */
> +        bool masking        : 1;
> +        /* 64-bit address capable? */
> +        bool address64      : 1;
> +        /* Number of vectors configured. */
> +        uint8_t vectors     : 5;
> +        /* Arch-specific data. */
> +        struct vpci_arch_msi arch;
> +#endif
> +    } *msi;
>  };
> 
>  struct vpci_vcpu {
> @@ -97,6 +121,20 @@ struct vpci_vcpu {
>      bool rom_only : 1;
>  };
> 
> +#ifdef __XEN__
> +void vpci_dump_msi(void);
> +
> +/* Arch-specific vPCI MSI helpers. */
> +void vpci_msi_arch_mask(struct vpci_msi *msi, const struct pci_dev *pdev,
> +                        unsigned int entry, bool mask);
> +int __must_check vpci_msi_arch_enable(struct vpci_msi *msi,
> +                                      const struct pci_dev *pdev,
> +                                      unsigned int vectors);
> +void vpci_msi_arch_disable(struct vpci_msi *msi, const struct pci_dev
> *pdev);
> +void vpci_msi_arch_init(struct vpci_msi *msi);
> +void vpci_msi_arch_print(const struct vpci_msi *msi);
> +#endif /* __XEN__ */
> +
>  #else /* !CONFIG_HAS_VPCI */
>  struct vpci_vcpu {};
>  #endif
> --
> 2.16.2

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v11 11/12] vpci/msix: add MSI-X handlers
  2018-03-20 15:15 ` [PATCH v11 11/12] vpci/msix: add MSI-X handlers Roger Pau Monne
@ 2018-03-21 12:36   ` Paul Durrant
  0 siblings, 0 replies; 23+ messages in thread
From: Paul Durrant @ 2018-03-21 12:36 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Wei Liu, Andrew Cooper, Tim (Xen.org),
	George Dunlap, Julien Grall, Jan Beulich, Ian Jackson,
	Boris Ostrovsky, Roger Pau Monne

> -----Original Message-----
> From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> Sent: 20 March 2018 15:16
> To: xen-devel@lists.xenproject.org
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>; Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com>; Roger Pau Monne <roger.pau@citrix.com>; Jan
> Beulich <jbeulich@suse.com>; Andrew Cooper
> <Andrew.Cooper3@citrix.com>; George Dunlap
> <George.Dunlap@citrix.com>; Ian Jackson <Ian.Jackson@citrix.com>; Julien
> Grall <julien.grall@arm.com>; Stefano Stabellini <sstabellini@kernel.org>;
> Tim (Xen.org) <tim@xen.org>; Wei Liu <wei.liu2@citrix.com>; Paul Durrant
> <Paul.Durrant@citrix.com>
> Subject: [PATCH v11 11/12] vpci/msix: add MSI-X handlers
> 
> Add handlers for accesses to the MSI-X message control field on the
> PCI configuration space, and traps for accesses to the memory region
> that contains the MSI-X table and PBA. This traps detect attempts from
> the guest to configure MSI-X interrupts and properly sets them up.
> 
> Note that accesses to the Table Offset, Table BIR, PBA Offset and PBA
> BIR are not trapped by Xen at the moment.
> 
> Finally, turn the panic in the Dom0 PVH builder into a warning.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>

io header changes...

Reviewed-by: Paul Durrant <paul.durrant@citrix.com>

> ---
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: George Dunlap <George.Dunlap@eu.citrix.com>
> Cc: Ian Jackson <ian.jackson@eu.citrix.com>
> Cc: Julien Grall <julien.grall@arm.com>
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: Stefano Stabellini <sstabellini@kernel.org>
> Cc: Tim Deegan <tim@xen.org>
> Cc: Wei Liu <wei.liu2@citrix.com>
> Cc: Paul Durrant <paul.durrant@citrix.com>
> ---
> Changes since v10:
>  - Do not continue to print msix entries if the MSIX struct has
>    changed it's address while processing softirqs.
>  - Use unsigned long to store the frame numbers in modify_bars.
>  - Use lu to print frame values in modify_bars.
> 
> Changes since v9:
>  - Unlock/lock when calling process_pending_softirqs.
>  - Change vpci_msix_arch_print to return int in order to signal
>    failure to continue after having processed softirqs.
>  - Use a power of 2 to do the module.
>  - Use PFN_DOWN in order to calculate the end of the MSI-X memory
>    areas for the rangeset.
> 
> Changes since v8:
>  - Call process_pending_softirqs between printing MSI-X entries.
>  - Free msix struct in vpci_add_handlers.
>  - Print only MSI or MSI-X if they are enabled.
>  - Fix comment in update_entry.
> 
> Changes since v7:
>  - Switch vpci.h macros to inline functions.
>  - Change vpci_msix_arch_print_entry into vpci_msix_arch_print and
>    make it print all the entries.
>  - Add a log message if rangeset_remove_range fails to remove the BAR
>    MSI-related range.
>  - Introduce a new update_entry to disable and enable a MSIX entry in
>    order to either update or set it up. This removes open coding it in
>    two different places.
>  - Unify access checks in access_allowed.
>  - Add newlines between switch cases.
>  - Expand max_entries to 12 bits.
> 
> Changes since v6:
>  - Reduce the output of the debug keys.
>  - Fix comments and code to match in vpci_msix_control_write.
>  - Optimize size of the MSIX structure.
>  - Convert 'tables[]' to a uint32_t in order to reduce the size of
>    vpci_msix. Introduce some macros to make it easier to get the MSIX
>    tables related data.
>  - Limit size of the bool fields to 1 bit.
>  - Remove the 'nr' field of vpci_msix_entry. The position can be
>    calculated from the base of the entries array.
>  - Drop the 'vpci_' prefix from the functions in msix.c, they are all
>    static.
>  - Remove the val local variable in control_read.
>  - Initialize new_masked and new_enabled at declaration.
>  - Recalculate the msix control value before writing it.
>  - Remove the seg and bus local variables and use pdev->seg and
>    pdev->bus instead.
>  - Initialize msix at declaration in msix_{write/read}.
>  - Add the must_check attribute to
>    vpci_msix_arch_{enable/disable}_entry.
> 
> Changes since v5:
>  - Update lock usage.
>  - Unbind/unmap PIRQs when MSIX is disabled.
>  - Share the arch-specific MSIX code with the MSI functions.
>  - Do not reference the MSIX memory areas from the PCI BARs fields,
>    instead fetch the BIR and offset each time needed.
>  - Add the '_entry' suffix to the MSIX arch functions.
>  - Prefix the vMSIX macros with 'V'.
>  - s/gdprintk/gprintk/ in msix.c
>  - Make vpci_msix_access_check return bool, and change it's name to
>    vpci_msix_access_allowed.
>  - Join the first two ifs in vpci_msix_{read/write} into a single one.
>  - Allow Dom0 to write to the PBA area.
>  - Add a note that reads from the PBA area will need to be translated
>    if the PBA it's not identity mapped.
> 
> Changes since v4:
>  - Remove parentheses around offsetof.
>  - Add "being" to MSI-X enabling comment.
>  - Use INVALID_PIRQ.
>  - Add a simple sanity check to vpci_msix_arch_enable in order to
>    detect wrong MSI-X entries more quickly.
>  - Constify vpci_msix_arch_print entry argument.
>  - s/cpu/fixed/ in vpci_msix_arch_print.
>  - Dump the MSI-X info together with the MSI info.
>  - Fix vpci_msix_control_write to take into account changes to the
>    address and data fields when switching the function mask bit.
>  - Only disable/enable the entries if the address or data fields have
>    been updated.
>  - Usew the BAR enable field to check if a BAR is mapped or not
>    (instead of reading the command register for each device).
>  - Fix error path in vpci_msix_read to set the return data to ~0.
>  - Simplify mask usage in vpci_msix_write.
>  - Cast data to uint64_t when shifting it 32 bits.
>  - Fix writes to the table entry control register to take into account
>    if the mask-all bit is set.
>  - Add some comments to clarify the intended behavior of the code.
>  - Align the PBA size to 64-bits.
>  - Remove the error label in vpci_init_msix.
>  - Try to compact the layout of the vpci_msix structure.
>  - Remove the local table_bar and pba_bar variables from
>    vpci_init_msix, they are used only once.
> 
> Changes since v3:
>  - Propagate changes from previous versions: remove xen_ prefix, use
>    the new fields in vpci_val and remove the return value from
>    handlers.
>  - Remove the usage of GENMASK.
>  - Mave the arch-specific parts of the dump routine to the
>    x86/hvm/vmsi.c dump handler.
>  - Chain the MSI-X dump handler to the 'M' debug key.
>  - Fix the header BAR mappings so that the MSI-X regions inside of
>    BARs are unmapped from the domain p2m in order for the handlers to
>    work properly.
>  - Unconditionally trap and forward accesses to the PBA MSI-X area.
>  - Simplify the conditionals in vpci_msix_control_write.
>  - Fix vpci_msix_accept to use a bool type.
>  - Allow all supported accesses as described in the spec to the MSI-X
>    table.
>  - Truncate the returned address when the access is a 32b read.
>  - Always return X86EMUL_OKAY from the handlers, returning ~0 in the
>    read case if the access is not supported, or ignoring writes.
>  - Do not check that max_entries is != 0 in the init handler.
>  - Use trylock in the dump handler.
> 
> Changes since v2:
>  - Split out arch-specific code.
> 
> This patch has been tested with devices using both a single MSI-X
> entry and multiple ones.
> ---
>  xen/arch/x86/hvm/dom0_build.c    |   2 +-
>  xen/arch/x86/hvm/hvm.c           |   1 +
>  xen/arch/x86/hvm/vmsi.c          | 160 +++++++++++---
>  xen/drivers/vpci/Makefile        |   2 +-
>  xen/drivers/vpci/header.c        |  19 ++
>  xen/drivers/vpci/msi.c           |  27 ++-
>  xen/drivers/vpci/msix.c          | 458
> +++++++++++++++++++++++++++++++++++++++
>  xen/drivers/vpci/vpci.c          |   1 +
>  xen/include/asm-x86/hvm/domain.h |   3 +
>  xen/include/asm-x86/hvm/io.h     |   5 +
>  xen/include/xen/vpci.h           |  73 +++++++
>  11 files changed, 720 insertions(+), 31 deletions(-)
>  create mode 100644 xen/drivers/vpci/msix.c
> 
> diff --git a/xen/arch/x86/hvm/dom0_build.c
> b/xen/arch/x86/hvm/dom0_build.c
> index 259814d95d..d3f65eadbe 100644
> --- a/xen/arch/x86/hvm/dom0_build.c
> +++ b/xen/arch/x86/hvm/dom0_build.c
> @@ -1117,7 +1117,7 @@ int __init dom0_construct_pvh(struct domain *d,
> const module_t *image,
> 
>      pvh_setup_mmcfg(d);
> 
> -    panic("Building a PVHv2 Dom0 is not yet supported.");
> +    printk("WARNING: PVH is an experimental mode with limited
> functionality\n");
>      return 0;
>  }
> 
> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> index 0afb651b7f..7660ea704a 100644
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -588,6 +588,7 @@ int hvm_domain_initialise(struct domain *d)
>      INIT_LIST_HEAD(&d->arch.hvm_domain.write_map.list);
>      INIT_LIST_HEAD(&d->arch.hvm_domain.g2m_ioport_list);
>      INIT_LIST_HEAD(&d->arch.hvm_domain.mmcfg_regions);
> +    INIT_LIST_HEAD(&d->arch.hvm_domain.msix_tables);
> 
>      rc = create_perdomain_mapping(d, PERDOMAIN_VIRT_START, 0, NULL,
> NULL);
>      if ( rc )
> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
> index be59c56d43..c31d27c389 100644
> --- a/xen/arch/x86/hvm/vmsi.c
> +++ b/xen/arch/x86/hvm/vmsi.c
> @@ -30,6 +30,7 @@
>  #include <xen/lib.h>
>  #include <xen/errno.h>
>  #include <xen/sched.h>
> +#include <xen/softirq.h>
>  #include <xen/irq.h>
>  #include <xen/vpci.h>
>  #include <public/hvm/ioreq.h>
> @@ -644,13 +645,10 @@ static unsigned int msi_gflags(uint16_t data,
> uint64_t addr, bool masked)
>             (masked ? 0 : XEN_DOMCTL_VMSI_X86_UNMASKED);
>  }
> 
> -void vpci_msi_arch_mask(struct vpci_msi *msi, const struct pci_dev *pdev,
> -                        unsigned int entry, bool mask)
> +static void vpci_mask_pirq(struct domain *d, int pirq, bool mask)
>  {
>      unsigned long flags;
> -    struct irq_desc *desc = domain_spin_lock_irq_desc(pdev->domain,
> -                                                      msi->arch.pirq + entry,
> -                                                      &flags);
> +    struct irq_desc *desc = domain_spin_lock_irq_desc(d, pirq, &flags);
> 
>      if ( !desc )
>          return;
> @@ -658,23 +656,31 @@ void vpci_msi_arch_mask(struct vpci_msi *msi,
> const struct pci_dev *pdev,
>      spin_unlock_irqrestore(&desc->lock, flags);
>  }
> 
> -int vpci_msi_arch_enable(struct vpci_msi *msi, const struct pci_dev *pdev,
> -                         unsigned int vectors)
> +void vpci_msi_arch_mask(struct vpci_msi *msi, const struct pci_dev *pdev,
> +                        unsigned int entry, bool mask)
> +{
> +    vpci_mask_pirq(pdev->domain, msi->arch.pirq + entry, mask);
> +}
> +
> +static int vpci_msi_enable(const struct pci_dev *pdev, uint32_t data,
> +                           uint64_t address, unsigned int nr,
> +                           paddr_t table_base, uint32_t mask)
>  {
>      struct msi_info msi_info = {
>          .seg = pdev->seg,
>          .bus = pdev->bus,
>          .devfn = pdev->devfn,
> -        .entry_nr = vectors,
> +        .table_base = table_base,
> +        .entry_nr = nr,
>      };
> -    unsigned int i;
> -    int rc;
> -
> -    ASSERT(msi->arch.pirq == INVALID_PIRQ);
> +    unsigned int i, vectors = table_base ? 1 : nr;
> +    int rc, pirq = INVALID_PIRQ;
> 
>      /* Get a PIRQ. */
> -    rc = allocate_and_map_msi_pirq(pdev->domain, -1, &msi->arch.pirq,
> -                                   MAP_PIRQ_TYPE_MULTI_MSI, &msi_info);
> +    rc = allocate_and_map_msi_pirq(pdev->domain, -1, &pirq,
> +                                   table_base ? MAP_PIRQ_TYPE_MSI
> +                                              : MAP_PIRQ_TYPE_MULTI_MSI,
> +                                   &msi_info);
>      if ( rc )
>      {
>          gdprintk(XENLOG_ERR, "%04x:%02x:%02x.%u: failed to map PIRQ:
> %d\n",
> @@ -685,15 +691,14 @@ int vpci_msi_arch_enable(struct vpci_msi *msi,
> const struct pci_dev *pdev,
> 
>      for ( i = 0; i < vectors; i++ )
>      {
> -        uint8_t vector = MASK_EXTR(msi->data, MSI_DATA_VECTOR_MASK);
> -        uint8_t vector_mask = 0xff >> (8 - fls(msi->vectors) + 1);
> +        uint8_t vector = MASK_EXTR(data, MSI_DATA_VECTOR_MASK);
> +        uint8_t vector_mask = 0xff >> (8 - fls(vectors) + 1);
>          struct xen_domctl_bind_pt_irq bind = {
> -            .machine_irq = msi->arch.pirq + i,
> +            .machine_irq = pirq + i,
>              .irq_type = PT_IRQ_TYPE_MSI,
>              .u.msi.gvec = (vector & ~vector_mask) |
>                            ((vector + i) & vector_mask),
> -            .u.msi.gflags = msi_gflags(msi->data, msi->address,
> -                                       (msi->mask >> i) & 1),
> +            .u.msi.gflags = msi_gflags(data, address, (mask >> i) & 1),
>          };
> 
>          pcidevs_lock();
> @@ -703,33 +708,49 @@ int vpci_msi_arch_enable(struct vpci_msi *msi,
> const struct pci_dev *pdev,
>              gdprintk(XENLOG_ERR,
>                       "%04x:%02x:%02x.%u: failed to bind PIRQ %u: %d\n",
>                       pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> -                     PCI_FUNC(pdev->devfn), msi->arch.pirq + i, rc);
> +                     PCI_FUNC(pdev->devfn), pirq + i, rc);
>              while ( bind.machine_irq-- )
>                  pt_irq_destroy_bind(pdev->domain, &bind);
>              spin_lock(&pdev->domain->event_lock);
> -            unmap_domain_pirq(pdev->domain, msi->arch.pirq);
> +            unmap_domain_pirq(pdev->domain, pirq);
>              spin_unlock(&pdev->domain->event_lock);
>              pcidevs_unlock();
> -            msi->arch.pirq = INVALID_PIRQ;
>              return rc;
>          }
>          pcidevs_unlock();
>      }
> 
> -    return 0;
> +    return pirq;
>  }
> 
> -void vpci_msi_arch_disable(struct vpci_msi *msi, const struct pci_dev
> *pdev)
> +int vpci_msi_arch_enable(struct vpci_msi *msi, const struct pci_dev *pdev,
> +                         unsigned int vectors)
> +{
> +    int rc;
> +
> +    ASSERT(msi->arch.pirq == INVALID_PIRQ);
> +    rc = vpci_msi_enable(pdev, msi->data, msi->address, vectors, 0, msi-
> >mask);
> +    if ( rc >= 0 )
> +    {
> +        msi->arch.pirq = rc;
> +        rc = 0;
> +    }
> +
> +    return rc;
> +}
> +
> +static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
> +                             unsigned int nr)
>  {
>      unsigned int i;
> 
> -    ASSERT(msi->arch.pirq != INVALID_PIRQ);
> +    ASSERT(pirq != INVALID_PIRQ);
> 
>      pcidevs_lock();
> -    for ( i = 0; i < msi->vectors; i++ )
> +    for ( i = 0; i < nr; i++ )
>      {
>          struct xen_domctl_bind_pt_irq bind = {
> -            .machine_irq = msi->arch.pirq + i,
> +            .machine_irq = pirq + i,
>              .irq_type = PT_IRQ_TYPE_MSI,
>          };
>          int rc;
> @@ -739,10 +760,14 @@ void vpci_msi_arch_disable(struct vpci_msi *msi,
> const struct pci_dev *pdev)
>      }
> 
>      spin_lock(&pdev->domain->event_lock);
> -    unmap_domain_pirq(pdev->domain, msi->arch.pirq);
> +    unmap_domain_pirq(pdev->domain, pirq);
>      spin_unlock(&pdev->domain->event_lock);
>      pcidevs_unlock();
> +}
> 
> +void vpci_msi_arch_disable(struct vpci_msi *msi, const struct pci_dev
> *pdev)
> +{
> +    vpci_msi_disable(pdev, msi->arch.pirq, msi->vectors);
>      msi->arch.pirq = INVALID_PIRQ;
>  }
> 
> @@ -763,3 +788,82 @@ void vpci_msi_arch_print(const struct vpci_msi *msi)
>             MASK_EXTR(msi->address, MSI_ADDR_DEST_ID_MASK),
>             msi->arch.pirq);
>  }
> +
> +void vpci_msix_arch_mask_entry(struct vpci_msix_entry *entry,
> +                               const struct pci_dev *pdev, bool mask)
> +{
> +    ASSERT(entry->arch.pirq != INVALID_PIRQ);
> +    vpci_mask_pirq(pdev->domain, entry->arch.pirq, mask);
> +}
> +
> +int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
> +                                const struct pci_dev *pdev, paddr_t table_base)
> +{
> +    int rc;
> +
> +    ASSERT(entry->arch.pirq == INVALID_PIRQ);
> +    rc = vpci_msi_enable(pdev, entry->data, entry->addr,
> +                         vmsix_entry_nr(pdev->vpci->msix, entry),
> +                         table_base, entry->masked);
> +    if ( rc >= 0 )
> +    {
> +        entry->arch.pirq = rc;
> +        rc = 0;
> +    }
> +
> +    return rc;
> +}
> +
> +int vpci_msix_arch_disable_entry(struct vpci_msix_entry *entry,
> +                                 const struct pci_dev *pdev)
> +{
> +    if ( entry->arch.pirq == INVALID_PIRQ )
> +        return -ENOENT;
> +
> +    vpci_msi_disable(pdev, entry->arch.pirq, 1);
> +    entry->arch.pirq = INVALID_PIRQ;
> +
> +    return 0;
> +}
> +
> +void vpci_msix_arch_init_entry(struct vpci_msix_entry *entry)
> +{
> +    entry->arch.pirq = INVALID_PIRQ;
> +}
> +
> +int vpci_msix_arch_print(const struct vpci_msix *msix)
> +{
> +    unsigned int i;
> +
> +    for ( i = 0; i < msix->max_entries; i++ )
> +    {
> +        const struct vpci_msix_entry *entry = &msix->entries[i];
> +
> +        printk("%6u vec=%02x%7s%6s%3sassert%5s%7s dest_id=%lu mask=%u
> pirq: %d\n",
> +               i, MASK_EXTR(entry->data, MSI_DATA_VECTOR_MASK),
> +               entry->data & MSI_DATA_DELIVERY_LOWPRI ? "lowest" : "fixed",
> +               entry->data & MSI_DATA_TRIGGER_LEVEL ? "level" : "edge",
> +               entry->data & MSI_DATA_LEVEL_ASSERT ? "" : "de",
> +               entry->addr & MSI_ADDR_DESTMODE_LOGIC ? "log" : "phys",
> +               entry->addr & MSI_ADDR_REDIRECTION_LOWPRI ? "lowest" :
> "fixed",
> +               MASK_EXTR(entry->addr, MSI_ADDR_DEST_ID_MASK),
> +               entry->masked, entry->arch.pirq);
> +        if ( i && !(i % 64) )
> +        {
> +            struct pci_dev *pdev = msix->pdev;
> +
> +            spin_unlock(&msix->pdev->vpci->lock);
> +            process_pending_softirqs();
> +            /* NB: we assume that pdev cannot go away for an alive domain. */
> +            if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
> +                return -EBUSY;
> +            if ( pdev->vpci->msix != msix )
> +            {
> +                spin_unlock(&pdev->vpci->lock);
> +                return -EAGAIN;
> +            }
> +        }
> +    }
> +
> +    return 0;
> +}
> diff --git a/xen/drivers/vpci/Makefile b/xen/drivers/vpci/Makefile
> index 62cec9e82b..55d1bdfda0 100644
> --- a/xen/drivers/vpci/Makefile
> +++ b/xen/drivers/vpci/Makefile
> @@ -1 +1 @@
> -obj-y += vpci.o header.o msi.o
> +obj-y += vpci.o header.o msi.o msix.o
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index 8d9d6f43f3..271e4667dc 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -190,6 +190,7 @@ static int modify_bars(const struct pci_dev *pdev,
> bool map, bool rom_only)
>      struct vpci_header *header = &pdev->vpci->header;
>      struct rangeset *mem = rangeset_new(NULL, NULL, 0);
>      struct pci_dev *tmp, *dev = NULL;
> +    const struct vpci_msix *msix = pdev->vpci->msix;
>      unsigned int i;
>      int rc;
> 
> @@ -226,6 +227,24 @@ static int modify_bars(const struct pci_dev *pdev,
> bool map, bool rom_only)
>          }
>      }
> 
> +    /* Remove any MSIX regions if present. */
> +    for ( i = 0; msix && i < ARRAY_SIZE(msix->tables); i++ )
> +    {
> +        unsigned long start = PFN_DOWN(vmsix_table_addr(pdev->vpci, i));
> +        unsigned long end = PFN_DOWN(vmsix_table_addr(pdev->vpci, i) +
> +                                     vmsix_table_size(pdev->vpci, i) - 1);
> +
> +        rc = rangeset_remove_range(mem, start, end);
> +        if ( rc )
> +        {
> +            printk(XENLOG_G_WARNING
> +                   "Failed to remove MSIX table [%lx, %lx]: %d\n",
> +                   start, end, rc);
> +            rangeset_destroy(mem);
> +            return rc;
> +        }
> +    }
> +
>      /*
>       * Check for overlaps with other BARs. Note that only BARs that are
>       * currently mapped (enabled) are checked for overlaps.
> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
> index de4ddf562e..ad26c38a92 100644
> --- a/xen/drivers/vpci/msi.c
> +++ b/xen/drivers/vpci/msi.c
> @@ -281,11 +281,12 @@ void vpci_dump_msi(void)
>          if ( !has_vpci(d) )
>              continue;
> 
> -        printk("vPCI MSI d%d\n", d->domain_id);
> +        printk("vPCI MSI/MSI-X d%d\n", d->domain_id);
> 
>          list_for_each_entry ( pdev, &d->arch.pdev_list, domain_list )
>          {
>              const struct vpci_msi *msi;
> +            const struct vpci_msix *msix;
> 
>              if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
>                  continue;
> @@ -306,6 +307,30 @@ void vpci_dump_msi(void)
>                  vpci_msi_arch_print(msi);
>              }
> 
> +            msix = pdev->vpci->msix;
> +            if ( msix && msix->enabled )
> +            {
> +                int rc;
> +
> +                printk("%04x:%02x:%02x.%u MSI-X\n", pdev->seg, pdev->bus,
> +                       PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
> +
> +                printk("  entries: %u maskall: %d enabled: %d\n",
> +                       msix->max_entries, msix->masked, msix->enabled);
> +
> +                rc = vpci_msix_arch_print(msix);
> +                if ( rc )
> +                {
> +                    /*
> +                     * On error vpci_msix_arch_print will always return without
> +                     * holding the lock.
> +                     */
> +                    printk("unable to print all MSI-X entries: %d\n", rc);
> +                    process_pending_softirqs();
> +                    continue;
> +                }
> +            }
> +
>              spin_unlock(&pdev->vpci->lock);
>              process_pending_softirqs();
>          }
> diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
> new file mode 100644
> index 0000000000..3b378c2e51
> --- /dev/null
> +++ b/xen/drivers/vpci/msix.c
> @@ -0,0 +1,458 @@
> +/*
> + * Handlers for accesses to the MSI-X capability structure and the memory
> + * region.
> + *
> + * Copyright (C) 2017 Citrix Systems R&D
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms and conditions of the GNU General Public
> + * License, version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; If not, see
> <http://www.gnu.org/licenses/>.
> + */
> +
> +#include <xen/sched.h>
> +#include <xen/vpci.h>
> +
> +#include <asm/msi.h>
> +
> +#define VMSIX_SIZE(num) offsetof(struct vpci_msix, entries[num])
> +
> +#define VMSIX_ADDR_IN_RANGE(addr, vpci, nr)                               \
> +    ((addr) >= vmsix_table_addr(vpci, nr) &&                              \
> +     (addr) < vmsix_table_addr(vpci, nr) + vmsix_table_size(vpci, nr))
> +
> +static uint32_t control_read(const struct pci_dev *pdev, unsigned int reg,
> +                             void *data)
> +{
> +    const struct vpci_msix *msix = data;
> +
> +    return (msix->max_entries - 1) |
> +           (msix->enabled ? PCI_MSIX_FLAGS_ENABLE : 0) |
> +           (msix->masked ? PCI_MSIX_FLAGS_MASKALL : 0);
> +}
> +
> +static int update_entry(struct vpci_msix_entry *entry,
> +                        const struct pci_dev *pdev, unsigned int nr)
> +{
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    int rc = vpci_msix_arch_disable_entry(entry, pdev);
> +
> +    /* Ignore ENOENT, it means the entry wasn't setup. */
> +    if ( rc && rc != -ENOENT )
> +    {
> +        gprintk(XENLOG_WARNING,
> +                "%04x:%02x:%02x.%u: unable to disable entry %u for update:
> %d\n",
> +                pdev->seg, pdev->bus, slot, func, nr, rc);
> +        return rc;
> +    }
> +
> +    rc = vpci_msix_arch_enable_entry(entry, pdev,
> +                                     vmsix_table_base(pdev->vpci,
> +                                                      VPCI_MSIX_TABLE));
> +    if ( rc )
> +    {
> +        gprintk(XENLOG_WARNING,
> +                "%04x:%02x:%02x.%u: unable to enable entry %u: %d\n",
> +                pdev->seg, pdev->bus, slot, func, nr, rc);
> +        /* Entry is likely not properly configured. */
> +        return rc;
> +    }
> +
> +    return 0;
> +}
> +
> +static void control_write(const struct pci_dev *pdev, unsigned int reg,
> +                          uint32_t val, void *data)
> +{
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    struct vpci_msix *msix = data;
> +    bool new_masked = val & PCI_MSIX_FLAGS_MASKALL;
> +    bool new_enabled = val & PCI_MSIX_FLAGS_ENABLE;
> +    unsigned int i;
> +    int rc;
> +
> +    if ( new_masked == msix->masked && new_enabled == msix->enabled )
> +        return;
> +
> +    /*
> +     * According to the PCI 3.0 specification, switching the enable bit to 1
> +     * or the function mask bit to 0 should cause all the cached addresses
> +     * and data fields to be recalculated.
> +     *
> +     * In order to avoid the overhead of disabling and enabling all the
> +     * entries every time the guest sets the maskall bit, Xen will only
> +     * perform the disable and enable sequence when the guest has written
> to
> +     * the entry.
> +     */
> +    if ( new_enabled && !new_masked && (!msix->enabled || msix-
> >masked) )
> +    {
> +        for ( i = 0; i < msix->max_entries; i++ )
> +        {
> +            if ( msix->entries[i].masked || !msix->entries[i].updated ||
> +                 update_entry(&msix->entries[i], pdev, i) )
> +                continue;
> +
> +            msix->entries[i].updated = false;
> +        }
> +    }
> +    else if ( !new_enabled && msix->enabled )
> +    {
> +        /* Guest has disabled MSIX, disable all entries. */
> +        for ( i = 0; i < msix->max_entries; i++ )
> +        {
> +            /*
> +             * NB: vpci_msix_arch_disable can be called for entries that are
> +             * not setup, it will return -ENOENT in that case.
> +             */
> +            rc = vpci_msix_arch_disable_entry(&msix->entries[i], pdev);
> +            switch ( rc )
> +            {
> +            case 0:
> +                /*
> +                 * Mark the entry successfully disabled as updated, so that on
> +                 * the next enable the entry is properly setup. This is done
> +                 * so that the following flow works correctly:
> +                 *
> +                 * mask entry -> disable MSIX -> enable MSIX -> unmask entry
> +                 *
> +                 * Without setting 'updated', the 'unmask entry' step will fail
> +                 * because the entry has not been updated, so it would not be
> +                 * mapped/bound at all.
> +                 */
> +                msix->entries[i].updated = true;
> +                break;
> +            case -ENOENT:
> +                /* Ignore non-present entry. */
> +                break;
> +            default:
> +                gprintk(XENLOG_WARNING,
> +                        "%04x:%02x:%02x.%u: unable to disable entry %u: %d\n",
> +                        pdev->seg, pdev->bus, slot, func, i, rc);
> +                return;
> +            }
> +        }
> +    }
> +
> +    msix->masked = new_masked;
> +    msix->enabled = new_enabled;
> +
> +    val = control_read(pdev, reg, data);
> +    if ( pci_msi_conf_write_intercept(msix->pdev, reg, 2, &val) >= 0 )
> +        pci_conf_write16(pdev->seg, pdev->bus, slot, func, reg, val);
> +}
> +
> +static struct vpci_msix *msix_find(const struct domain *d, unsigned long
> addr)
> +{
> +    struct vpci_msix *msix;
> +
> +    list_for_each_entry ( msix, &d->arch.hvm_domain.msix_tables, next )
> +    {
> +        const struct vpci_bar *bars = msix->pdev->vpci->header.bars;
> +        unsigned int i;
> +
> +        for ( i = 0; i < ARRAY_SIZE(msix->tables); i++ )
> +            if ( bars[msix->tables[i] & PCI_MSIX_BIRMASK].enabled &&
> +                 VMSIX_ADDR_IN_RANGE(addr, msix->pdev->vpci, i) )
> +                return msix;
> +    }
> +
> +    return NULL;
> +}
> +
> +static int msix_accept(struct vcpu *v, unsigned long addr)
> +{
> +    return !!msix_find(v->domain, addr);
> +}
> +
> +static bool access_allowed(const struct pci_dev *pdev, unsigned long addr,
> +                           unsigned int len)
> +{
> +    /* Only allow aligned 32/64b accesses. */
> +    if ( (len == 4 || len == 8) && !(addr & (len - 1)) )
> +        return true;
> +
> +    gprintk(XENLOG_WARNING,
> +            "%04x:%02x:%02x.%u: unaligned or invalid size MSI-X table access\n",
> +            pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn), PCI_FUNC(pdev-
> >devfn));
> +
> +    return false;
> +}
> +
> +static struct vpci_msix_entry *get_entry(struct vpci_msix *msix,
> +                                         paddr_t addr)
> +{
> +    paddr_t start = vmsix_table_addr(msix->pdev->vpci, VPCI_MSIX_TABLE);
> +
> +    return &msix->entries[(addr - start) / PCI_MSIX_ENTRY_SIZE];
> +}
> +
> +static int msix_read(struct vcpu *v, unsigned long addr, unsigned int len,
> +                     unsigned long *data)
> +{
> +    const struct domain *d = v->domain;
> +    struct vpci_msix *msix = msix_find(d, addr);
> +    const struct vpci_msix_entry *entry;
> +    unsigned int offset;
> +
> +    *data = ~0ul;
> +
> +    if ( !msix )
> +        return X86EMUL_RETRY;
> +
> +    if ( !access_allowed(msix->pdev, addr, len) )
> +        return X86EMUL_OKAY;
> +
> +    if ( VMSIX_ADDR_IN_RANGE(addr, msix->pdev->vpci, VPCI_MSIX_PBA) )
> +    {
> +        /*
> +         * Access to PBA.
> +         *
> +         * TODO: note that this relies on having the PBA identity mapped to the
> +         * guest address space. If this changes the address will need to be
> +         * translated.
> +         */
> +        switch ( len )
> +        {
> +        case 4:
> +            *data = readl(addr);
> +            break;
> +
> +        case 8:
> +            *data = readq(addr);
> +            break;
> +
> +        default:
> +            ASSERT_UNREACHABLE();
> +            break;
> +        }
> +
> +        return X86EMUL_OKAY;
> +    }
> +
> +    spin_lock(&msix->pdev->vpci->lock);
> +    entry = get_entry(msix, addr);
> +    offset = addr & (PCI_MSIX_ENTRY_SIZE - 1);
> +
> +    switch ( offset )
> +    {
> +    case PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET:
> +        *data = entry->addr;
> +        break;
> +
> +    case PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET:
> +        *data = entry->addr >> 32;
> +        break;
> +
> +    case PCI_MSIX_ENTRY_DATA_OFFSET:
> +        *data = entry->data;
> +        if ( len == 8 )
> +            *data |=
> +                (uint64_t)(entry->masked ? PCI_MSIX_VECTOR_BITMASK : 0) <<
> 32;
> +        break;
> +
> +    case PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET:
> +        *data = entry->masked ? PCI_MSIX_VECTOR_BITMASK : 0;
> +        break;
> +
> +    default:
> +        ASSERT_UNREACHABLE();
> +        break;
> +    }
> +    spin_unlock(&msix->pdev->vpci->lock);
> +
> +    return X86EMUL_OKAY;
> +}
> +
> +static int msix_write(struct vcpu *v, unsigned long addr, unsigned int len,
> +                      unsigned long data)
> +{
> +    const struct domain *d = v->domain;
> +    struct vpci_msix *msix = msix_find(d, addr);
> +    struct vpci_msix_entry *entry;
> +    unsigned int offset;
> +
> +    if ( !msix )
> +        return X86EMUL_RETRY;
> +
> +    if ( !access_allowed(msix->pdev, addr, len) )
> +        return X86EMUL_OKAY;
> +
> +    if ( VMSIX_ADDR_IN_RANGE(addr, msix->pdev->vpci, VPCI_MSIX_PBA) )
> +    {
> +        /* Ignore writes to PBA for DomUs, it's behavior is undefined. */
> +        if ( is_hardware_domain(d) )
> +        {
> +            switch ( len )
> +            {
> +            case 4:
> +                writel(data, addr);
> +                break;
> +
> +            case 8:
> +                writeq(data, addr);
> +                break;
> +
> +            default:
> +                ASSERT_UNREACHABLE();
> +                break;
> +            }
> +        }
> +
> +        return X86EMUL_OKAY;
> +    }
> +
> +    spin_lock(&msix->pdev->vpci->lock);
> +    entry = get_entry(msix, addr);
> +    offset = addr & (PCI_MSIX_ENTRY_SIZE - 1);
> +
> +    /*
> +     * NB: Xen allows writes to the data/address registers with the entry
> +     * unmasked. The specification says this is undefined behavior, and Xen
> +     * implements it as storing the written value, which will be made effective
> +     * in the next mask/unmask cycle. This also mimics the implementation in
> +     * QEMU.
> +     */
> +    switch ( offset )
> +    {
> +    case PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET:
> +        entry->updated = true;
> +        if ( len == 8 )
> +        {
> +            entry->addr = data;
> +            break;
> +        }
> +        entry->addr &= ~0xffffffff;
> +        entry->addr |= data;
> +        break;
> +
> +    case PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET:
> +        entry->updated = true;
> +        entry->addr &= 0xffffffff;
> +        entry->addr |= (uint64_t)data << 32;
> +        break;
> +
> +    case PCI_MSIX_ENTRY_DATA_OFFSET:
> +        entry->updated = true;
> +        entry->data = data;
> +
> +        if ( len == 4 )
> +            break;
> +
> +        data >>= 32;
> +        /* fallthrough */
> +    case PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET:
> +    {
> +        bool new_masked = data & PCI_MSIX_VECTOR_BITMASK;
> +        const struct pci_dev *pdev = msix->pdev;
> +
> +        if ( entry->masked == new_masked )
> +            /* No change in the mask bit, nothing to do. */
> +            break;
> +
> +        /*
> +         * Update the masked state before calling
> vpci_msix_arch_enable_entry,
> +         * so that it picks the new state.
> +         */
> +        entry->masked = new_masked;
> +        if ( !new_masked && msix->enabled && !msix->masked && entry-
> >updated )
> +        {
> +            /*
> +             * If MSI-X is enabled, the function mask is not active, the entry
> +             * is being unmasked and there have been changes to the address or
> +             * data fields Xen needs to disable and enable the entry in order
> +             * to pick up the changes.
> +             */
> +            if ( update_entry(entry, pdev, vmsix_entry_nr(msix, entry)) )
> +                break;
> +
> +            entry->updated = false;
> +        }
> +        else
> +            vpci_msix_arch_mask_entry(entry, pdev, entry->masked);
> +
> +        break;
> +    }
> +
> +    default:
> +        ASSERT_UNREACHABLE();
> +        break;
> +    }
> +    spin_unlock(&msix->pdev->vpci->lock);
> +
> +    return X86EMUL_OKAY;
> +}
> +
> +static const struct hvm_mmio_ops vpci_msix_table_ops = {
> +    .check = msix_accept,
> +    .read = msix_read,
> +    .write = msix_write,
> +};
> +
> +static int init_msix(struct pci_dev *pdev)
> +{
> +    struct domain *d = pdev->domain;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    unsigned int msix_offset, i, max_entries;
> +    uint16_t control;
> +    int rc;
> +
> +    msix_offset = pci_find_cap_offset(pdev->seg, pdev->bus, slot, func,
> +                                      PCI_CAP_ID_MSIX);
> +    if ( !msix_offset )
> +        return 0;
> +
> +    control = pci_conf_read16(pdev->seg, pdev->bus, slot, func,
> +                              msix_control_reg(msix_offset));
> +
> +    max_entries = msix_table_size(control);
> +
> +    pdev->vpci->msix = xzalloc_bytes(VMSIX_SIZE(max_entries));
> +    if ( !pdev->vpci->msix )
> +        return -ENOMEM;
> +
> +    pdev->vpci->msix->max_entries = max_entries;
> +    pdev->vpci->msix->pdev = pdev;
> +
> +    pdev->vpci->msix->tables[VPCI_MSIX_TABLE] =
> +        pci_conf_read32(pdev->seg, pdev->bus, slot, func,
> +                        msix_table_offset_reg(msix_offset));
> +    pdev->vpci->msix->tables[VPCI_MSIX_PBA] =
> +        pci_conf_read32(pdev->seg, pdev->bus, slot, func,
> +                        msix_pba_offset_reg(msix_offset));
> +
> +    for ( i = 0; i < pdev->vpci->msix->max_entries; i++)
> +    {
> +        pdev->vpci->msix->entries[i].masked = true;
> +        vpci_msix_arch_init_entry(&pdev->vpci->msix->entries[i]);
> +    }
> +
> +    rc = vpci_add_register(pdev->vpci, control_read, control_write,
> +                           msix_control_reg(msix_offset), 2, pdev->vpci->msix);
> +    if ( rc )
> +        return rc;
> +
> +    if ( list_empty(&d->arch.hvm_domain.msix_tables) )
> +        register_mmio_handler(d, &vpci_msix_table_ops);
> +
> +    list_add(&pdev->vpci->msix->next, &d->arch.hvm_domain.msix_tables);
> +
> +    return 0;
> +}
> +REGISTER_VPCI_INIT(init_msix, VPCI_PRIORITY_HIGH);
> +
> +/*
> + * Local variables:
> + * mode: C
> + * c-file-style: "BSD"
> + * c-basic-offset: 4
> + * tab-width: 4
> + * indent-tabs-mode: nil
> + * End:
> + */
> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
> index 3012b30013..8ec9c916ea 100644
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -47,6 +47,7 @@ void vpci_remove_device(struct pci_dev *pdev)
>          xfree(r);
>      }
>      spin_unlock(&pdev->vpci->lock);
> +    xfree(pdev->vpci->msix);
>      xfree(pdev->vpci->msi);
>      xfree(pdev->vpci);
>      pdev->vpci = NULL;
> diff --git a/xen/include/asm-x86/hvm/domain.h b/xen/include/asm-
> x86/hvm/domain.h
> index d1d933d791..020ceacd81 100644
> --- a/xen/include/asm-x86/hvm/domain.h
> +++ b/xen/include/asm-x86/hvm/domain.h
> @@ -188,6 +188,9 @@ struct hvm_domain {
>      struct list_head mmcfg_regions;
>      rwlock_t mmcfg_lock;
> 
> +    /* List of MSI-X tables. */
> +    struct list_head msix_tables;
> +
>      /* List of permanently write-mapped pages. */
>      struct {
>          spinlock_t lock;
> diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h
> index 0fedb3473c..e6b6ed0b92 100644
> --- a/xen/include/asm-x86/hvm/io.h
> +++ b/xen/include/asm-x86/hvm/io.h
> @@ -132,6 +132,11 @@ struct vpci_arch_msi {
>      int pirq;
>  };
> 
> +/* Arch-specific MSI-X entry data for vPCI. */
> +struct vpci_arch_msix_entry {
> +    int pirq;
> +};
> +
>  enum stdvga_cache_state {
>      STDVGA_CACHE_UNINITIALIZED,
>      STDVGA_CACHE_ENABLED,
> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
> index 7266c17679..fc47163ba6 100644
> --- a/xen/include/xen/vpci.h
> +++ b/xen/include/xen/vpci.h
> @@ -115,6 +115,34 @@ struct vpci {
>          struct vpci_arch_msi arch;
>  #endif
>      } *msi;
> +
> +    /* MSI-X data. */
> +    struct vpci_msix {
> +#ifdef __XEN__
> +        struct pci_dev *pdev;
> +        /* List link. */
> +        struct list_head next;
> +        /* Table information. */
> +#define VPCI_MSIX_TABLE     0
> +#define VPCI_MSIX_PBA       1
> +#define VPCI_MSIX_MEM_NUM   2
> +        uint32_t tables[VPCI_MSIX_MEM_NUM];
> +        /* Maximum number of vectors supported by the device. */
> +        uint16_t max_entries : 12;
> +        /* MSI-X enabled? */
> +        bool enabled         : 1;
> +        /* Masked? */
> +        bool masked          : 1;
> +        /* Entries. */
> +        struct vpci_msix_entry {
> +            uint64_t addr;
> +            uint32_t data;
> +            bool masked  : 1;
> +            bool updated : 1;
> +            struct vpci_arch_msix_entry arch;
> +        } entries[];
> +#endif
> +    } *msix;
>  };
> 
>  struct vpci_vcpu {
> @@ -137,6 +165,51 @@ int __must_check vpci_msi_arch_enable(struct
> vpci_msi *msi,
>  void vpci_msi_arch_disable(struct vpci_msi *msi, const struct pci_dev
> *pdev);
>  void vpci_msi_arch_init(struct vpci_msi *msi);
>  void vpci_msi_arch_print(const struct vpci_msi *msi);
> +
> +/* Arch-specific vPCI MSI-X helpers. */
> +void vpci_msix_arch_mask_entry(struct vpci_msix_entry *entry,
> +                               const struct pci_dev *pdev, bool mask);
> +int __must_check vpci_msix_arch_enable_entry(struct vpci_msix_entry
> *entry,
> +                                             const struct pci_dev *pdev,
> +                                             paddr_t table_base);
> +int __must_check vpci_msix_arch_disable_entry(struct vpci_msix_entry
> *entry,
> +                                              const struct pci_dev *pdev);
> +void vpci_msix_arch_init_entry(struct vpci_msix_entry *entry);
> +int vpci_msix_arch_print(const struct vpci_msix *msix);
> +
> +/*
> + * Helper functions to fetch MSIX related data. They are used by both the
> + * emulated MSIX code and the BAR handlers.
> + */
> +static inline paddr_t vmsix_table_base(const struct vpci *vpci, unsigned int
> nr)
> +{
> +    return vpci->header.bars[vpci->msix->tables[nr] &
> PCI_MSIX_BIRMASK].addr;
> +}
> +
> +static inline paddr_t vmsix_table_addr(const struct vpci *vpci, unsigned int
> nr)
> +{
> +    return vmsix_table_base(vpci, nr) +
> +           (vpci->msix->tables[nr] & ~PCI_MSIX_BIRMASK);
> +}
> +
> +/*
> + * Note regarding the size calculation of the PBA: the spec mentions "The
> last
> + * QWORD will not necessarily be fully populated", so it implies that the PBA
> + * size is 64-bit aligned.
> + */
> +static inline size_t vmsix_table_size(const struct vpci *vpci, unsigned int nr)
> +{
> +    return
> +        (nr == VPCI_MSIX_TABLE) ? vpci->msix->max_entries *
> PCI_MSIX_ENTRY_SIZE
> +                                : ROUNDUP(DIV_ROUND_UP(vpci->msix->max_entries,
> +                                                       8), 8);
> +}
> +
> +static inline unsigned int vmsix_entry_nr(const struct vpci_msix *msix,
> +                                          const struct vpci_msix_entry *entry)
> +{
> +    return entry - msix->entries;
> +}
>  #endif /* __XEN__ */
> 
>  #else /* !CONFIG_HAS_VPCI */
> --
> 2.16.2

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v11 10/12] vpci: add a priority parameter to the vPCI register initializer
  2018-03-20 15:15 ` [PATCH v11 10/12] vpci: add a priority parameter to the vPCI register initializer Roger Pau Monne
@ 2018-03-22  9:57   ` Jan Beulich
  0 siblings, 0 replies; 23+ messages in thread
From: Jan Beulich @ 2018-03-22  9:57 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: StefanoStabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, Julien Grall, xen-devel,
	Boris Ostrovsky

>>> On 20.03.18 at 16:15, <roger.pau@citrix.com> wrote:
> This is needed for MSI-X, since MSI-X will need to be initialized
> before parsing the BARs, so that the header BAR handlers are aware of
> the MSI-X related holes and make sure they are not mapped in order for
> the trap handlers to work properly.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
> ---
> Cc: Stefano Stabellini <sstabellini@kernel.org>
> Cc: Julien Grall <julien.grall@arm.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: George Dunlap <George.Dunlap@eu.citrix.com>
> Cc: Ian Jackson <ian.jackson@eu.citrix.com>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: Tim Deegan <tim@xen.org>
> Cc: Wei Liu <wei.liu2@citrix.com>
> ---
> Changes since v4:
>  - Add a middle priority and add the PCI header to it.
> 
> Changes since v3:
>  - Add a numerial suffix to the section used to store the pointer to
>    each initializer function, and sort them at link time.
> ---
>  xen/arch/arm/xen.lds.S    | 4 ++--

Julien, Stefano?

Thanks, Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v11 04/12] pci: split code to size BARs from pci_add_device
  2018-03-20 15:15 ` [PATCH v11 04/12] pci: split code to size BARs from pci_add_device Roger Pau Monne
@ 2018-03-22 10:15   ` Jan Beulich
  2018-03-22 10:31     ` Roger Pau Monné
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Beulich @ 2018-03-22 10:15 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Stefano Stabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, Julien Grall, xen-devel,
	Boris Ostrovsky

>>> On 20.03.18 at 16:15, <roger.pau@citrix.com> wrote:
> @@ -672,11 +722,16 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>              unsigned int i;
>  
>              BUILD_BUG_ON(ARRAY_SIZE(pdev->vf_rlen) != PCI_SRIOV_NUM_BARS);
> -            for ( i = 0; i < PCI_SRIOV_NUM_BARS; ++i )
> +            for ( i = 0; i < PCI_SRIOV_NUM_BARS; )
>              {
>                  unsigned int idx = pos + PCI_SRIOV_BAR + i * 4;
>                  u32 bar = pci_conf_read32(seg, bus, slot, func, idx);
> -                u32 hi = 0;
> +                pci_sbdf_t sbdf = {
> +                    .seg = seg,
> +                    .bus = bus,
> +                    .dev = slot,
> +                    .func = func,
> +                };

So I've had everything up to patch 9 applied and ready for pushing,
when I did my usual secondary compile test on an old system: This
fails to compile with gcc 4.3 (due to there being a unnamed sub-
structure). A similar issue exists at least in patch 7. Since the
structure gets introduced in patch 1 (and hence may need changing
there, depending on how this is to be addressed), I'm not going to
push any part of this series.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v11 04/12] pci: split code to size BARs from pci_add_device
  2018-03-22 10:15   ` Jan Beulich
@ 2018-03-22 10:31     ` Roger Pau Monné
  2018-03-22 10:33       ` Jan Beulich
  0 siblings, 1 reply; 23+ messages in thread
From: Roger Pau Monné @ 2018-03-22 10:31 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stefano Stabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, Julien Grall, xen-devel,
	Boris Ostrovsky

On Thu, Mar 22, 2018 at 04:15:06AM -0600, Jan Beulich wrote:
> >>> On 20.03.18 at 16:15, <roger.pau@citrix.com> wrote:
> > @@ -672,11 +722,16 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
> >              unsigned int i;
> >  
> >              BUILD_BUG_ON(ARRAY_SIZE(pdev->vf_rlen) != PCI_SRIOV_NUM_BARS);
> > -            for ( i = 0; i < PCI_SRIOV_NUM_BARS; ++i )
> > +            for ( i = 0; i < PCI_SRIOV_NUM_BARS; )
> >              {
> >                  unsigned int idx = pos + PCI_SRIOV_BAR + i * 4;
> >                  u32 bar = pci_conf_read32(seg, bus, slot, func, idx);
> > -                u32 hi = 0;
> > +                pci_sbdf_t sbdf = {
> > +                    .seg = seg,
> > +                    .bus = bus,
> > +                    .dev = slot,
> > +                    .func = func,
> > +                };
> 
> So I've had everything up to patch 9 applied and ready for pushing,
> when I did my usual secondary compile test on an old system: This
> fails to compile with gcc 4.3 (due to there being a unnamed sub-
> structure). A similar issue exists at least in patch 7. Since the
> structure gets introduced in patch 1 (and hence may need changing

pci_sbdf_t is already in the source tree, it was introduced by
514f58d4468a40b5dd418a5ea1742681930c3f2d back in December.

> there, depending on how this is to be addressed), I'm not going to
> push any part of this series.

No patch in the series changes pci_sbdf_t at all, so in any case this
should be a pre-patch or a post-patch, but not really part of patch 1.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v11 04/12] pci: split code to size BARs from pci_add_device
  2018-03-22 10:31     ` Roger Pau Monné
@ 2018-03-22 10:33       ` Jan Beulich
  0 siblings, 0 replies; 23+ messages in thread
From: Jan Beulich @ 2018-03-22 10:33 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Stefano Stabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, Julien Grall, xen-devel,
	Boris Ostrovsky

>>> On 22.03.18 at 11:31, <roger.pau@citrix.com> wrote:
> On Thu, Mar 22, 2018 at 04:15:06AM -0600, Jan Beulich wrote:
>> >>> On 20.03.18 at 16:15, <roger.pau@citrix.com> wrote:
>> > @@ -672,11 +722,16 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>> >              unsigned int i;
>> >  
>> >              BUILD_BUG_ON(ARRAY_SIZE(pdev->vf_rlen) != PCI_SRIOV_NUM_BARS);
>> > -            for ( i = 0; i < PCI_SRIOV_NUM_BARS; ++i )
>> > +            for ( i = 0; i < PCI_SRIOV_NUM_BARS; )
>> >              {
>> >                  unsigned int idx = pos + PCI_SRIOV_BAR + i * 4;
>> >                  u32 bar = pci_conf_read32(seg, bus, slot, func, idx);
>> > -                u32 hi = 0;
>> > +                pci_sbdf_t sbdf = {
>> > +                    .seg = seg,
>> > +                    .bus = bus,
>> > +                    .dev = slot,
>> > +                    .func = func,
>> > +                };
>> 
>> So I've had everything up to patch 9 applied and ready for pushing,
>> when I did my usual secondary compile test on an old system: This
>> fails to compile with gcc 4.3 (due to there being a unnamed sub-
>> structure). A similar issue exists at least in patch 7. Since the
>> structure gets introduced in patch 1 (and hence may need changing
> 
> pci_sbdf_t is already in the source tree, it was introduced by
> 514f58d4468a40b5dd418a5ea1742681930c3f2d back in December.

Oh, I guess it's the test harness instance that I've mistakenly seen
in the grep output here.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2018-03-22 10:33 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-20 15:15 [PATCH v11 00/12] vpci: PCI config space emulation Roger Pau Monne
2018-03-20 15:15 ` [PATCH v11 01/12] vpci: introduce basic handlers to trap accesses to the PCI config space Roger Pau Monne
2018-03-21  4:39   ` Julien Grall
2018-03-20 15:15 ` [PATCH v11 02/12] x86/mmcfg: add handlers for the PVH Dom0 MMCFG areas Roger Pau Monne
2018-03-20 15:15 ` [PATCH v11 03/12] x86/physdev: enable PHYSDEVOP_pci_mmcfg_reserved for PVH Dom0 Roger Pau Monne
2018-03-20 15:15 ` [PATCH v11 04/12] pci: split code to size BARs from pci_add_device Roger Pau Monne
2018-03-22 10:15   ` Jan Beulich
2018-03-22 10:31     ` Roger Pau Monné
2018-03-22 10:33       ` Jan Beulich
2018-03-20 15:15 ` [PATCH v11 05/12] pci: add support to size ROM BARs to pci_size_mem_bar Roger Pau Monne
2018-03-20 15:15 ` [PATCH v11 06/12] xen: introduce rangeset_consume_ranges Roger Pau Monne
2018-03-20 15:15 ` [PATCH v11 07/12] vpci: add header handlers Roger Pau Monne
2018-03-20 16:19   ` Jan Beulich
2018-03-21 12:31   ` Paul Durrant
2018-03-20 15:15 ` [PATCH v11 08/12] x86/pt: mask MSI vectors on unbind Roger Pau Monne
2018-03-20 15:15 ` [PATCH v11 09/12] vpci/msi: add MSI handlers Roger Pau Monne
2018-03-21 12:34   ` Paul Durrant
2018-03-20 15:15 ` [PATCH v11 10/12] vpci: add a priority parameter to the vPCI register initializer Roger Pau Monne
2018-03-22  9:57   ` Jan Beulich
2018-03-20 15:15 ` [PATCH v11 11/12] vpci/msix: add MSI-X handlers Roger Pau Monne
2018-03-21 12:36   ` Paul Durrant
2018-03-20 15:15 ` [PATCH v11 12/12] vpci: do not expose unneeded functions to the user-space test harness Roger Pau Monne
2018-03-20 16:20   ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.