All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 00/11] vpci: PCI config space emulation
@ 2017-08-14 14:28 Roger Pau Monne
  2017-08-14 14:28 ` [PATCH v5 01/11] x86/pci: introduce hvm_pci_decode_addr Roger Pau Monne
                   ` (10 more replies)
  0 siblings, 11 replies; 60+ messages in thread
From: Roger Pau Monne @ 2017-08-14 14:28 UTC (permalink / raw)
  To: xen-devel; +Cc: boris.ostrovsky

Hello,

The following series contain an implementation of handlers for the PCI
configuration space inside of Xen. This allows Xen to detect accesses
to the PCI configuration space and react accordingly.

Why is this needed? IMHO, there are two main points of doing all this
emulation inside of Xen, the first one is to prevent adding a bunch of
duplicated Xen PV specific code to each OS we want to support in PVH
mode. This just promotes Xen code duplication amongst OSes, which
leads to a higher maintainership burden.

The second reason would be that this code (or it's functionality to be
more precise) already exists in QEMU (and pciback to a degree), and
it's code that we already support and maintain. By moving it into the
hypervisor itself every guest type can make use of it, and should be
shared between them all. I know that the code in this series is not
yet suitable for DomU HVM guests in it's current state, but it should
be in due time.

As usual, each patch contains a changeset summary between versions,
I'm not going to copy the list of changes here.

Patch 1 introduces a function to decode a PCI IO port access into bdf
and register (which is shared with the ioreq code). Patch 2 implements
the generic handlers for accesses to the PCI configuration space
together with a minimal user-space test harness that I've used during
development. Currently a per-device linked list is used in order to
store the list of handlers, and they are sorted based on their offset
inside of the configuration space. Patch 2 also adds the x86 port IO
traps and wires them into the newly introduced vPCI dispatchers. Patch
3 and 4 adds handlers for the MMCFG areas (as found on the MMCFG ACPI
table). Patches 5, 6 and 7 are mostly code moment/refactoring in order
to implement support for BAR mapping in patch 8. Finally patches 9 and
11 add support for trapping accesses to the MSI and MSI-X capabilities
respectively, so that interrupts are properly setup on behalf of
Dom0.

The branch containing the patches can be found at:

git://xenbits.xen.org/people/royger/xen.git vpci_v5

Note that this is only safe to use for the hardware domain (that's
trusted), any non-trusted domain will need a lot more of traps before
it can freely access the PCI configuration space.

This series is based on top of my "x86/pvh: implement
iommu_inclusive_mapping for PVH Dom0" series.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v5 01/11] x86/pci: introduce hvm_pci_decode_addr
  2017-08-14 14:28 [PATCH v5 00/11] vpci: PCI config space emulation Roger Pau Monne
@ 2017-08-14 14:28 ` Roger Pau Monne
  2017-08-22 11:24   ` Paul Durrant
  2017-08-24 15:46   ` Jan Beulich
  2017-08-14 14:28 ` [PATCH v5 02/11] vpci: introduce basic handlers to trap accesses to the PCI config space Roger Pau Monne
                   ` (9 subsequent siblings)
  10 siblings, 2 replies; 60+ messages in thread
From: Roger Pau Monne @ 2017-08-14 14:28 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Paul Durrant, Jan Beulich, boris.ostrovsky,
	Roger Pau Monne

And use it in the ioreq code to decode accesses to the PCI IO ports
into bus, slot, function and register values.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Paul Durrant <paul.durrant@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v4:
 - New in this version.
---
 xen/arch/x86/hvm/io.c        | 19 +++++++++++++++++++
 xen/arch/x86/hvm/ioreq.c     | 12 +++++-------
 xen/include/asm-x86/hvm/io.h |  5 +++++
 3 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
index 214ab307c4..074cba89da 100644
--- a/xen/arch/x86/hvm/io.c
+++ b/xen/arch/x86/hvm/io.c
@@ -256,6 +256,25 @@ void register_g2m_portio_handler(struct domain *d)
     handler->ops = &g2m_portio_ops;
 }
 
+unsigned int hvm_pci_decode_addr(unsigned int cf8, unsigned int addr,
+                                 unsigned int *bus, unsigned int *slot,
+                                 unsigned int *func)
+{
+    unsigned long bdf;
+
+    ASSERT(CF8_ENABLED(cf8));
+
+    bdf = CF8_BDF(cf8);
+    *bus = PCI_BUS(bdf);
+    *slot = PCI_SLOT(bdf);
+    *func = PCI_FUNC(bdf);
+    /*
+     * NB: the lower 2 bits of the register address are fetched from the
+     * offset into the 0xcfc register when reading/writing to it.
+     */
+    return CF8_ADDR_LO(cf8) | (addr & 3);
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/arch/x86/hvm/ioreq.c b/xen/arch/x86/hvm/ioreq.c
index b2a8b0e986..752976d16d 100644
--- a/xen/arch/x86/hvm/ioreq.c
+++ b/xen/arch/x86/hvm/ioreq.c
@@ -1178,18 +1178,16 @@ struct hvm_ioreq_server *hvm_select_ioreq_server(struct domain *d,
          CF8_ENABLED(cf8) )
     {
         uint32_t sbdf, x86_fam;
+        unsigned int bus, slot, func, reg;
+
+        reg = hvm_pci_decode_addr(cf8, p->addr, &bus, &slot, &func);
 
         /* PCI config data cycle */
 
-        sbdf = XEN_DMOP_PCI_SBDF(0,
-                                 PCI_BUS(CF8_BDF(cf8)),
-                                 PCI_SLOT(CF8_BDF(cf8)),
-                                 PCI_FUNC(CF8_BDF(cf8)));
+        sbdf = XEN_DMOP_PCI_SBDF(0, bus, slot, func);
 
         type = XEN_DMOP_IO_RANGE_PCI;
-        addr = ((uint64_t)sbdf << 32) |
-               CF8_ADDR_LO(cf8) |
-               (p->addr & 3);
+        addr = ((uint64_t)sbdf << 32) | reg;
         /* AMD extended configuration space access? */
         if ( CF8_ADDR_HI(cf8) &&
              d->arch.cpuid->x86_vendor == X86_VENDOR_AMD &&
diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h
index 2484eb1c75..51659b6c7f 100644
--- a/xen/include/asm-x86/hvm/io.h
+++ b/xen/include/asm-x86/hvm/io.h
@@ -149,6 +149,11 @@ void stdvga_deinit(struct domain *d);
 
 extern void hvm_dpci_msi_eoi(struct domain *d, int vector);
 
+/* Decode a PCI port IO access into a bus/slot/func/reg. */
+unsigned int hvm_pci_decode_addr(unsigned int cf8, unsigned int addr,
+                                 unsigned int *bus, unsigned int *slot,
+                                 unsigned int *func);
+
 /*
  * HVM port IO handler that performs forwarding of guest IO ports into machine
  * IO ports.
-- 
2.11.0 (Apple Git-81)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v5 02/11] vpci: introduce basic handlers to trap accesses to the PCI config space
  2017-08-14 14:28 [PATCH v5 00/11] vpci: PCI config space emulation Roger Pau Monne
  2017-08-14 14:28 ` [PATCH v5 01/11] x86/pci: introduce hvm_pci_decode_addr Roger Pau Monne
@ 2017-08-14 14:28 ` Roger Pau Monne
  2017-08-22 12:05   ` Paul Durrant
                     ` (2 more replies)
  2017-08-14 14:28 ` [PATCH v5 03/11] x86/mmcfg: add handlers for the PVH Dom0 MMCFG areas Roger Pau Monne
                   ` (8 subsequent siblings)
  10 siblings, 3 replies; 60+ messages in thread
From: Roger Pau Monne @ 2017-08-14 14:28 UTC (permalink / raw)
  To: xen-devel
  Cc: Wei Liu, Andrew Cooper, Ian Jackson, Paul Durrant, Jan Beulich,
	boris.ostrovsky, Roger Pau Monne

This functionality is going to reside in vpci.c (and the corresponding
vpci.h header), and should be arch-agnostic. The handlers introduced
in this patch setup the basic functionality required in order to trap
accesses to the PCI config space, and allow decoding the address and
finding the corresponding handler that should handle the access
(although no handlers are implemented).

Note that the traps to the PCI IO ports registers (0xcf8/0xcfc) are
setup inside of a x86 HVM file, since that's not shared with other
arches.

A new XEN_X86_EMU_VPCI x86 domain flag is added in order to signal Xen
whether a domain should use the newly introduced vPCI handlers, this
is only enabled for PVH Dom0 at the moment.

A very simple user-space test is also provided, so that the basic
functionality of the vPCI traps can be asserted. This has been proven
quite helpful during development, since the logic to handle partial
accesses or accesses that expand across multiple registers is not
trivial.

The handlers for the registers are added to a linked list that's keep
sorted at all times. Both the read and write handlers support accesses
that expand across multiple emulated registers and contain gaps not
emulated.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Paul Durrant <paul.durrant@citrix.com>
---
Changes since v4:
* User-space test harness:
 - Do not redirect the output of the test.
 - Add main.c and emul.h as dependencies of the Makefile target.
 - Use the same rule to modify the vpci and list headers.
 - Remove underscores from local macro variables.
 - Add _check suffix to the test harness multiread function.
 - Change the value written by every different size in the multiwrite
   test.
 - Use { } to initialize the r16 and r20 arrays (instead of { 0 }).
 - Perform some of the read checks with the local variable directly.
 - Expand some comments.
 - Implement a dummy rwlock.
* Hypervisor code:
 - Guard the linker script changes with CONFIG_HAS_PCI.
 - Rename vpci_access_check to vpci_access_allowed and make it return
   bool.
 - Make hvm_pci_decode_addr return the register as return value.
 - Use ~3 instead of 0xfffc to remove the register offset when
   checking accesses to IO ports.
 - s/head/prev in vpci_add_register.
 - Add parentheses around & in vpci_add_register.
 - Fix register removal.
 - Change the BUGs in vpci_{read/write}_hw helpers to
   ASSERT_UNREACHABLE.
 - Make merge_result static and change the computation of the mask to
   avoid using a uint64_t.
 - Modify vpci_read to only read from hardware the not-emulated gaps.
 - Remove the vpci_val union and use a uint32_t instead.
 - Change handler read type to return a uint32_t instead of modifying
   a variable passed by reference.
 - Constify the data opaque parameter of read handlers.
 - Change the size parameter of the vpci_{read/write} functions to
   unsigned int.
 - Place the array of initialization handlers in init.rodata or
   .rodata depending on whether late-hwdom is enabled.
 - Remove the pci_devs lock, assume the Dom0 is well behaved and won't
   remove the device while trying to access it.
 - Change the recursive spinlock into a rw lock for performance
   reasons.

Changes since v3:
* User-space test harness:
 - Fix spaces in container_of macro.
 - Implement a dummy locking functions.
 - Remove 'current' macro make current a pointer to the statically
   allocated vpcu.
 - Remove unneeded parentheses in the pci_conf_readX macros.
 - Fix the name of the write test macro.
 - Remove the dummy EXPORT_SYMBOL macro (this was needed by the RB
   code only).
 - Import the max macro.
 - Test all possible read/write size combinations with all possible
   emulated register sizes.
 - Introduce a test for register removal.
* Hypervisor code:
 - Use a sorted list in order to store the config space handlers.
 - Remove some unneeded 'else' branches.
 - Make the IO port handlers always return X86EMUL_OKAY, and set the
   data to all 1's in case of read failure (write are simply ignored).
 - In hvm_select_ioreq_server reuse local variables when calling
   XEN_DMOP_PCI_SBDF.
 - Store the pointers to the initialization functions in the .rodata
   section.
 - Do not ignore the return value of xen_vpci_add_handlers in
   setup_one_hwdom_device.
 - Remove the vpci_init macro.
 - Do not hide the pointers inside of the vpci_{read/write}_t
   typedefs.
 - Rename priv_data to private in vpci_register.
 - Simplify checking for register overlap in vpci_register_cmp.
 - Check that the offset and the length match before removing a
   register in xen_vpci_remove_register.
 - Make vpci_read_hw return a value rather than storing it in a
   pointer passed by parameter.
 - Handler dispatcher functions vpci_{read/write} no longer return an
   error code, errors on reads/writes should be treated like hardware
   (writes ignored, reads return all 1's or garbage).
 - Make sure pcidevs is locked before calling pci_get_pdev_by_domain.
 - Use a recursive spinlock for the vpci lock, so that spin_is_locked
   checks that the current CPU is holding the lock.
 - Make the code less error-chatty by removing some of the printk's.
 - Pass the slot and the function as separate parameters to the
   handler dispatchers (instead of passing devfn).
 - Allow handlers to be registered with either a read or write
   function only, the missing handler will be replaced by a dummy
   handler (writes ignored, reads return 1's).
 - Introduce PCI_CFG_SPACE_* defines from Linux.
 - Simplify the handler dispatchers by removing the recursion, now the
   dispatchers iterate over the list of sorted handlers and call them
   in order.
 - Remove the GENMASK_BYTES, SHIFT_RIGHT_BYTES and ADD_RESULT macros,
   and instead provide a merge_result function in order to merge a
   register output into a partial result.
 - Rename the fields of the vpci_val union to u8/u16/u32.
 - Remove the return values from the read/write handlers, errors
   should be handled internally and signaled as would be done on
   native hardware.
 - Remove the usage of the GENMASK macro.

Changes since v2:
 - Generalize the PCI address decoding and use it for IOREQ code also.

Changes since v1:
 - Allow access to cross a word-boundary.
 - Add locking.
 - Add cleanup to xen_vpci_add_handlers in case of failure.
---
 .gitignore                        |   3 +
 tools/libxl/libxl_x86.c           |   2 +-
 tools/tests/Makefile              |   1 +
 tools/tests/vpci/Makefile         |  37 ++++
 tools/tests/vpci/emul.h           | 128 +++++++++++
 tools/tests/vpci/main.c           | 314 +++++++++++++++++++++++++++
 xen/arch/arm/xen.lds.S            |  10 +
 xen/arch/x86/domain.c             |  18 +-
 xen/arch/x86/hvm/hvm.c            |   2 +
 xen/arch/x86/hvm/io.c             | 118 +++++++++-
 xen/arch/x86/setup.c              |   3 +-
 xen/arch/x86/xen.lds.S            |  10 +
 xen/drivers/Makefile              |   2 +-
 xen/drivers/passthrough/pci.c     |   9 +-
 xen/drivers/vpci/Makefile         |   1 +
 xen/drivers/vpci/vpci.c           | 443 ++++++++++++++++++++++++++++++++++++++
 xen/include/asm-x86/domain.h      |   1 +
 xen/include/asm-x86/hvm/domain.h  |   3 +
 xen/include/asm-x86/hvm/io.h      |   3 +
 xen/include/public/arch-x86/xen.h |   5 +-
 xen/include/xen/pci.h             |   3 +
 xen/include/xen/pci_regs.h        |   8 +
 xen/include/xen/vpci.h            |  80 +++++++
 23 files changed, 1194 insertions(+), 10 deletions(-)
 create mode 100644 tools/tests/vpci/Makefile
 create mode 100644 tools/tests/vpci/emul.h
 create mode 100644 tools/tests/vpci/main.c
 create mode 100644 xen/drivers/vpci/Makefile
 create mode 100644 xen/drivers/vpci/vpci.c
 create mode 100644 xen/include/xen/vpci.h

diff --git a/.gitignore b/.gitignore
index 594ffd9a7f..7f24ce72b1 100644
--- a/.gitignore
+++ b/.gitignore
@@ -237,6 +237,9 @@ tools/tests/regression/build/*
 tools/tests/regression/downloads/*
 tools/tests/mem-sharing/memshrtool
 tools/tests/mce-test/tools/xen-mceinj
+tools/tests/vpci/list.h
+tools/tests/vpci/vpci.[hc]
+tools/tests/vpci/test_vpci
 tools/xcutils/lsevtchn
 tools/xcutils/readnotes
 tools/xenbackendd/_paths.h
diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c
index 455f6f0bed..dd7fc78a99 100644
--- a/tools/libxl/libxl_x86.c
+++ b/tools/libxl/libxl_x86.c
@@ -11,7 +11,7 @@ int libxl__arch_domain_prepare_config(libxl__gc *gc,
     if (d_config->c_info.type == LIBXL_DOMAIN_TYPE_HVM) {
         if (d_config->b_info.device_model_version !=
             LIBXL_DEVICE_MODEL_VERSION_NONE) {
-            xc_config->emulation_flags = XEN_X86_EMU_ALL;
+            xc_config->emulation_flags = (XEN_X86_EMU_ALL & ~XEN_X86_EMU_VPCI);
         } else if (libxl_defbool_val(d_config->b_info.u.hvm.apic)) {
             /*
              * HVM guests without device model may want
diff --git a/tools/tests/Makefile b/tools/tests/Makefile
index 7162945121..f6942a93fb 100644
--- a/tools/tests/Makefile
+++ b/tools/tests/Makefile
@@ -13,6 +13,7 @@ endif
 SUBDIRS-$(CONFIG_X86) += x86_emulator
 SUBDIRS-y += xen-access
 SUBDIRS-y += xenstore
+SUBDIRS-$(CONFIG_HAS_PCI) += vpci
 
 .PHONY: all clean install distclean uninstall
 all clean distclean: %: subdirs-%
diff --git a/tools/tests/vpci/Makefile b/tools/tests/vpci/Makefile
new file mode 100644
index 0000000000..e45fcb5cd9
--- /dev/null
+++ b/tools/tests/vpci/Makefile
@@ -0,0 +1,37 @@
+XEN_ROOT=$(CURDIR)/../../..
+include $(XEN_ROOT)/tools/Rules.mk
+
+TARGET := test_vpci
+
+.PHONY: all
+all: $(TARGET)
+
+.PHONY: run
+run: $(TARGET)
+	./$(TARGET)
+
+$(TARGET): vpci.c vpci.h list.h main.c emul.h
+	$(HOSTCC) -g -o $@ vpci.c main.c
+
+.PHONY: clean
+clean:
+	rm -rf $(TARGET) *.o *~ vpci.h vpci.c list.h
+
+.PHONY: distclean
+distclean: clean
+
+.PHONY: install
+install:
+
+vpci.c: $(XEN_ROOT)/xen/drivers/vpci/vpci.c
+	# Trick the compiler so it doesn't complain about missing symbols
+	sed -e '/#include/d' \
+	    -e '1s;^;#include "emul.h"\
+	             vpci_register_init_t *const __start_vpci_array[1]\;\
+	             vpci_register_init_t *const __end_vpci_array[1]\;\
+	             ;' <$< >$@
+
+list.h: $(XEN_ROOT)/xen/include/xen/list.h
+vpci.h: $(XEN_ROOT)/xen/include/xen/vpci.h
+list.h vpci.h:
+	sed -e '/#include/d' <$< >$@
diff --git a/tools/tests/vpci/emul.h b/tools/tests/vpci/emul.h
new file mode 100644
index 0000000000..69f08ba1e7
--- /dev/null
+++ b/tools/tests/vpci/emul.h
@@ -0,0 +1,128 @@
+/*
+ * Unit tests for the generic vPCI handler code.
+ *
+ * Copyright (C) 2017 Citrix Systems R&D
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms and conditions of the GNU General Public
+ * License, version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef _TEST_VPCI_
+#define _TEST_VPCI_
+
+#include <stdlib.h>
+#include <stdio.h>
+#include <stddef.h>
+#include <stdint.h>
+#include <stdbool.h>
+#include <errno.h>
+#include <assert.h>
+
+#define container_of(ptr, type, member) ({                      \
+        typeof(((type *)0)->member) *mptr = (ptr);              \
+                                                                \
+        (type *)((char *)mptr - offsetof(type, member));        \
+})
+
+#define smp_wmb()
+#define prefetch(x) __builtin_prefetch(x)
+#define ASSERT(x) assert(x)
+#define __must_check __attribute__((__warn_unused_result__))
+
+#include "list.h"
+
+struct pci_dev {
+    struct domain *domain;
+    struct vpci *vpci;
+};
+
+struct domain {
+    enum {
+        UNLOCKED,
+        RLOCKED,
+        WLOCKED,
+    } lock;
+};
+
+struct vcpu
+{
+    struct domain *domain;
+};
+
+extern struct vcpu *current;
+extern struct pci_dev test_pdev;
+
+#include "vpci.h"
+
+#define __hwdom_init
+
+#define has_vpci(d) true
+
+/* Define our own locks. */
+#undef vpci_rlock
+#undef vpci_wlock
+#undef vpci_runlock
+#undef vpci_wunlock
+#undef vpci_rlocked
+#undef vpci_wlocked
+#define vpci_rlock(d) ((d)->lock = RLOCKED)
+#define vpci_wlock(d) ((d)->lock = WLOCKED)
+#define vpci_runlock(d) ((d)->lock = UNLOCKED)
+#define vpci_wunlock(d) ((d)->lock = UNLOCKED)
+#define vpci_rlocked(d) ((d)->lock == RLOCKED)
+#define vpci_wlocked(d) ((d)->lock == WLOCKED)
+
+#define xzalloc(type) ((type *)calloc(1, sizeof(type)))
+#define xmalloc(type) ((type *)malloc(sizeof(type)))
+#define xfree(p) free(p)
+
+#define pci_get_pdev_by_domain(...) &test_pdev
+
+/* Dummy native helpers. Writes are ignored, reads return 1's. */
+#define pci_conf_read8(...)     0xff
+#define pci_conf_read16(...)    0xffff
+#define pci_conf_read32(...)    0xffffffff
+#define pci_conf_write8(...)
+#define pci_conf_write16(...)
+#define pci_conf_write32(...)
+
+#define PCI_CFG_SPACE_EXP_SIZE 4096
+
+#define BUG() assert(0)
+#define ASSERT_UNREACHABLE() assert(0)
+
+#define min(x, y) ({                    \
+        const typeof(x) tx = (x);       \
+        const typeof(y) ty = (y);       \
+                                        \
+        (void) (&tx == &ty);            \
+        tx < ty ? tx : ty;              \
+})
+
+#define max(x, y) ({                    \
+        const typeof(x) tx = (x);       \
+        const typeof(y) ty = (y);       \
+                                        \
+        (void) (&tx == &ty);            \
+        tx > ty ? tx : ty;              \
+})
+
+#endif
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tools/tests/vpci/main.c b/tools/tests/vpci/main.c
new file mode 100644
index 0000000000..33736937bf
--- /dev/null
+++ b/tools/tests/vpci/main.c
@@ -0,0 +1,314 @@
+/*
+ * Unit tests for the generic vPCI handler code.
+ *
+ * Copyright (C) 2017 Citrix Systems R&D
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms and conditions of the GNU General Public
+ * License, version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "emul.h"
+
+/* Single vcpu (current), and single domain with a single PCI device. */
+static struct vpci vpci;
+
+static struct domain d = {
+    .lock = false,
+};
+
+struct pci_dev test_pdev = {
+    .domain = &d,
+    .vpci = &vpci,
+};
+
+static struct vcpu v = {
+    .domain = &d
+};
+
+struct vcpu *current = &v;
+
+/* Dummy hooks, write stores data, read fetches it. */
+static uint32_t vpci_read8(struct pci_dev *pdev, unsigned int reg,
+                           const void *data)
+{
+    return *(const uint8_t *)data;
+}
+
+static void vpci_write8(struct pci_dev *pdev, unsigned int reg,
+                        uint32_t val, void *data)
+{
+    *(uint8_t *)data = val;
+}
+
+static uint32_t vpci_read16(struct pci_dev *pdev, unsigned int reg,
+                            const void *data)
+{
+    return *(const uint16_t *)data;
+}
+
+static void vpci_write16(struct pci_dev *pdev, unsigned int reg,
+                         uint32_t val, void *data)
+{
+    *(uint16_t *)data = val;
+}
+
+static uint32_t vpci_read32(struct pci_dev *pdev, unsigned int reg,
+                            const void *data)
+{
+    return *(const uint32_t *)data;
+}
+
+static void vpci_write32(struct pci_dev *pdev, unsigned int reg,
+                         uint32_t val, void *data)
+{
+    *(uint32_t *)data = val;
+}
+
+#define VPCI_READ(reg, size, data) ({                   \
+    vpci_rlock(&d);                                     \
+    data = vpci_read(0, 0, 0, 0, reg, size);            \
+    vpci_runlock(&d);                                   \
+})
+
+#define VPCI_READ_CHECK(reg, size, expected) ({         \
+    uint32_t rd;                                        \
+                                                        \
+    VPCI_READ(reg, size, rd);                           \
+    assert(rd == (expected));                           \
+})
+
+#define VPCI_WRITE(reg, size, data) ({                  \
+    vpci_wlock(&d);                                     \
+    vpci_write(0, 0, 0, 0, reg, size, data);            \
+    vpci_wunlock(&d);                                   \
+})
+
+#define VPCI_WRITE_CHECK(reg, size, data) ({            \
+    VPCI_WRITE(reg, size, data);                        \
+    VPCI_READ_CHECK(reg, size, data);                   \
+})
+
+#define VPCI_ADD_REG(fread, fwrite, off, size, store)                       \
+    assert(!vpci_add_register(&test_pdev, fread, fwrite, off, size, &store))
+
+#define VPCI_ADD_INVALID_REG(fread, fwrite, off, size)                      \
+    assert(vpci_add_register(&test_pdev, fread, fwrite, off, size, NULL))
+
+#define VPCI_REMOVE_REG(off, size)                                          \
+    assert(!vpci_remove_register(&test_pdev, off, size))
+
+#define VPCI_REMOVE_INVALID_REG(off, size)                                  \
+    assert(vpci_remove_register(&test_pdev, off, size))
+
+/* Read a 32b register using all possible sizes. */
+void multiread4_check(unsigned int reg, uint32_t val)
+{
+    unsigned int i;
+
+    /* Read using bytes. */
+    for ( i = 0; i < 4; i++ )
+        VPCI_READ_CHECK(reg + i, 1, (val >> (i * 8)) & UINT8_MAX);
+
+    /* Read using 2bytes. */
+    for ( i = 0; i < 2; i++ )
+        VPCI_READ_CHECK(reg + i * 2, 2, (val >> (i * 2 * 8)) & UINT16_MAX);
+
+    VPCI_READ_CHECK(reg, 4, val);
+}
+
+void multiwrite4_check(unsigned int reg)
+{
+    unsigned int i;
+    uint32_t val = 0xa2f51732;
+
+    /* Write using bytes. */
+    for ( i = 0; i < 4; i++ )
+        VPCI_WRITE_CHECK(reg + i, 1, (val >> (i * 8)) & UINT8_MAX);
+    multiread4_check(reg, val);
+
+    /* Change the value each time to be sure writes work fine. */
+    val = 0x2b836fda;
+    /* Write using 2bytes. */
+    for ( i = 0; i < 2; i++ )
+        VPCI_WRITE_CHECK(reg + i * 2, 2, (val >> (i * 2 * 8)) & UINT16_MAX);
+    multiread4_check(reg, val);
+
+    val = 0xc4693beb;
+    VPCI_WRITE_CHECK(reg, 4, val);
+    multiread4_check(reg, val);
+}
+
+int
+main(int argc, char **argv)
+{
+    /* Index storage by offset. */
+    uint32_t r0 = 0xdeadbeef;
+    uint8_t r5 = 0xef;
+    uint8_t r6 = 0xbe;
+    uint8_t r7 = 0xef;
+    uint16_t r12 = 0x8696;
+    uint8_t r16[4] = { };
+    uint16_t r20[2] = { };
+    uint32_t r24 = 0;
+    uint8_t r28, r30;
+    unsigned int i;
+    int rc;
+
+    INIT_LIST_HEAD(&vpci.handlers);
+
+    VPCI_ADD_REG(vpci_read32, vpci_write32, 0, 4, r0);
+    VPCI_READ_CHECK(0, 4, r0);
+    VPCI_WRITE_CHECK(0, 4, 0xbcbcbcbc);
+
+    VPCI_ADD_REG(vpci_read8, vpci_write8, 5, 1, r5);
+    VPCI_READ_CHECK(5, 1, r5);
+    VPCI_WRITE_CHECK(5, 1, 0xba);
+
+    VPCI_ADD_REG(vpci_read8, vpci_write8, 6, 1, r6);
+    VPCI_READ_CHECK(6, 1, r6);
+    VPCI_WRITE_CHECK(6, 1, 0xba);
+
+    VPCI_ADD_REG(vpci_read8, vpci_write8, 7, 1, r7);
+    VPCI_READ_CHECK(7, 1, r7);
+    VPCI_WRITE_CHECK(7, 1, 0xbd);
+
+    VPCI_ADD_REG(vpci_read16, vpci_write16, 12, 2, r12);
+    VPCI_READ_CHECK(12, 2, r12);
+    VPCI_READ_CHECK(12, 4, 0xffff8696);
+
+    /*
+     * At this point we have the following layout:
+     *
+     * Note that this refers to the position of the variables,
+     * but the value has already changed from the one given at
+     * initialization time because write tests have been performed.
+     *
+     * 32    24    16     8     0
+     *  +-----+-----+-----+-----+
+     *  |          r0           | 0
+     *  +-----+-----+-----+-----+
+     *  | r7  |  r6 |  r5 |/////| 32
+     *  +-----+-----+-----+-----|
+     *  |///////////////////////| 64
+     *  +-----------+-----------+
+     *  |///////////|    r12    | 96
+     *  +-----------+-----------+
+     *             ...
+     *  / = empty.
+     */
+
+    /* Try to add an overlapping register handler. */
+    VPCI_ADD_INVALID_REG(vpci_read32, vpci_write32, 4, 4);
+
+    /* Try to add a non-aligned register. */
+    VPCI_ADD_INVALID_REG(vpci_read16, vpci_write16, 15, 2);
+
+    /* Try to add a register with wrong size. */
+    VPCI_ADD_INVALID_REG(vpci_read16, vpci_write16, 8, 3);
+
+    /* Try to add a register with missing handlers. */
+    VPCI_ADD_INVALID_REG(NULL, NULL, 8, 2);
+
+    /* Read/write of unset register. */
+    VPCI_READ_CHECK(8, 4, 0xffffffff);
+    VPCI_READ_CHECK(8, 2, 0xffff);
+    VPCI_READ_CHECK(8, 1, 0xff);
+    VPCI_WRITE(10, 2, 0xbeef);
+    VPCI_READ_CHECK(10, 2, 0xffff);
+
+    /* Read of multiple registers */
+    VPCI_WRITE_CHECK(7, 1, 0xbd);
+    VPCI_READ_CHECK(4, 4, 0xbdbabaff);
+
+    /* Partial read of a register. */
+    VPCI_WRITE_CHECK(0, 4, 0x1a1b1c1d);
+    VPCI_READ_CHECK(2, 1, 0x1b);
+    VPCI_READ_CHECK(6, 2, 0xbdba);
+
+    /* Write of multiple registers. */
+    VPCI_WRITE_CHECK(4, 4, 0xaabbccff);
+
+    /* Partial write of a register. */
+    VPCI_WRITE_CHECK(2, 1, 0xfe);
+    VPCI_WRITE_CHECK(6, 2, 0xfebc);
+
+    /*
+     * Test all possible read/write size combinations.
+     *
+     * Place 4 1B registers at 128bits (16B), 2 2B registers at 160bits
+     * (20B) and finally 1 4B register at 192bits (24B).
+     *
+     * Then perform all possible write and read sizes on each of them.
+     *
+     *               ...
+     * 32     24     16      8      0
+     *  +------+------+------+------+
+     *  |r16[3]|r16[2]|r16[1]|r16[0]| 16
+     *  +------+------+------+------+
+     *  |    r20[1]   |    r20[0]   | 20
+     *  +-------------+-------------|
+     *  |            r24            | 24
+     *  +-------------+-------------+
+     *
+     */
+    VPCI_ADD_REG(vpci_read8, vpci_write8, 16, 1, r16[0]);
+    VPCI_ADD_REG(vpci_read8, vpci_write8, 17, 1, r16[1]);
+    VPCI_ADD_REG(vpci_read8, vpci_write8, 18, 1, r16[2]);
+    VPCI_ADD_REG(vpci_read8, vpci_write8, 19, 1, r16[3]);
+
+    VPCI_ADD_REG(vpci_read16, vpci_write16, 20, 2, r20[0]);
+    VPCI_ADD_REG(vpci_read16, vpci_write16, 22, 2, r20[1]);
+
+    VPCI_ADD_REG(vpci_read32, vpci_write32, 24, 4, r24);
+
+    /* Check the initial value is 0. */
+    multiread4_check(16, 0);
+    multiread4_check(20, 0);
+    multiread4_check(24, 0);
+
+    multiwrite4_check(16);
+    multiwrite4_check(20);
+    multiwrite4_check(24);
+
+    /*
+     * Check multiple non-consecutive gaps on the same read/write:
+     *
+     * 32     24     16      8      0
+     *  +------+------+------+------+
+     *  |//////|  r30 |//////|  r28 | 28
+     *  +------+------+------+------+
+     *
+     */
+    VPCI_ADD_REG(vpci_read8, vpci_write8, 28, 1, r28);
+    VPCI_ADD_REG(vpci_read8, vpci_write8, 30, 1, r30);
+    VPCI_WRITE_CHECK(28, 4, 0xffacffdc);
+
+    /* Finally try to remove a couple of registers. */
+    VPCI_REMOVE_REG(28, 1);
+    VPCI_REMOVE_REG(24, 4);
+    VPCI_REMOVE_REG(12, 2);
+
+    VPCI_REMOVE_INVALID_REG(20, 1);
+    VPCI_REMOVE_INVALID_REG(16, 2);
+    VPCI_REMOVE_INVALID_REG(30, 2);
+
+    return 0;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/arch/arm/xen.lds.S b/xen/arch/arm/xen.lds.S
index 2d54f224ec..6690516ff1 100644
--- a/xen/arch/arm/xen.lds.S
+++ b/xen/arch/arm/xen.lds.S
@@ -41,6 +41,11 @@ SECTIONS
 
   . = ALIGN(PAGE_SIZE);
   .rodata : {
+#if defined(CONFIG_HAS_PCI) && defined(CONFIG_LATE_HWDOM)
+       __start_vpci_array = .;
+       *(.rodata.vpci)
+       __end_vpci_array = .;
+#endif
         _srodata = .;          /* Read-only data */
         /* Bug frames table */
        __start_bug_frames = .;
@@ -131,6 +136,11 @@ SECTIONS
   } :text
   . = ALIGN(PAGE_SIZE);
   .init.data : {
+#if defined(CONFIG_HAS_PCI) && !defined(CONFIG_LATE_HWDOM)
+       __start_vpci_array = .;
+       *(.init.rodata.vpci)
+       __end_vpci_array = .;
+#endif
        *(.init.rodata)
        *(.init.rodata.rel)
        *(.init.rodata.str*)
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index baaf8151d2..7a862ea671 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -376,11 +376,21 @@ static bool emulation_flags_ok(const struct domain *d, uint32_t emflags)
     if ( is_hvm_domain(d) )
     {
         if ( is_hardware_domain(d) &&
-             emflags != (XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC) )
-            return false;
-        if ( !is_hardware_domain(d) && emflags &&
-             emflags != XEN_X86_EMU_ALL && emflags != XEN_X86_EMU_LAPIC )
+             emflags != (XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC|
+                         XEN_X86_EMU_VPCI) )
             return false;
+        if ( !is_hardware_domain(d) )
+        {
+            switch ( emflags )
+            {
+            case XEN_X86_EMU_ALL & ~XEN_X86_EMU_VPCI:
+            case XEN_X86_EMU_LAPIC:
+            case 0:
+                break;
+            default:
+                return false;
+            }
+        }
     }
     else if ( emflags != 0 && emflags != XEN_X86_EMU_PIT )
     {
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 6cb903def5..cc73df8dc7 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -36,6 +36,7 @@
 #include <xen/rangeset.h>
 #include <xen/monitor.h>
 #include <xen/warning.h>
+#include <xen/vpci.h>
 #include <asm/shadow.h>
 #include <asm/hap.h>
 #include <asm/current.h>
@@ -629,6 +630,7 @@ int hvm_domain_initialise(struct domain *d, unsigned long domcr_flags,
         d->arch.hvm_domain.io_bitmap = hvm_io_bitmap;
 
     register_g2m_portio_handler(d);
+    register_vpci_portio_handler(d);
 
     hvm_ioreq_init(d);
 
diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
index 074cba89da..c3b68eb257 100644
--- a/xen/arch/x86/hvm/io.c
+++ b/xen/arch/x86/hvm/io.c
@@ -25,6 +25,7 @@
 #include <xen/trace.h>
 #include <xen/event.h>
 #include <xen/hypercall.h>
+#include <xen/vpci.h>
 #include <asm/current.h>
 #include <asm/cpufeature.h>
 #include <asm/processor.h>
@@ -260,7 +261,7 @@ unsigned int hvm_pci_decode_addr(unsigned int cf8, unsigned int addr,
                                  unsigned int *bus, unsigned int *slot,
                                  unsigned int *func)
 {
-    unsigned long bdf;
+    unsigned int bdf;
 
     ASSERT(CF8_ENABLED(cf8));
 
@@ -275,6 +276,121 @@ unsigned int hvm_pci_decode_addr(unsigned int cf8, unsigned int addr,
     return CF8_ADDR_LO(cf8) | (addr & 3);
 }
 
+/* Do some sanity checks. */
+static bool vpci_access_allowed(unsigned int reg, unsigned int len)
+{
+    /* Check access size. */
+    if ( len != 1 && len != 2 && len != 4 )
+        return false;
+
+    /* Check that access is size aligned. */
+    if ( (reg & (len - 1)) )
+        return false;
+
+    return true;
+}
+
+/* vPCI config space IO ports handlers (0xcf8/0xcfc). */
+static bool vpci_portio_accept(const struct hvm_io_handler *handler,
+                               const ioreq_t *p)
+{
+    return (p->addr == 0xcf8 && p->size == 4) || (p->addr & ~3) == 0xcfc;
+}
+
+static int vpci_portio_read(const struct hvm_io_handler *handler,
+                            uint64_t addr, uint32_t size, uint64_t *data)
+{
+    struct domain *d = current->domain;
+    unsigned int bus, slot, func, reg;
+
+    *data = ~(uint64_t)0;
+
+    vpci_rlock(d);
+    if ( addr == 0xcf8 )
+    {
+        ASSERT(size == 4);
+        *data = d->arch.hvm_domain.pci_cf8;
+        vpci_runlock(d);
+        return X86EMUL_OKAY;
+    }
+    if ( !CF8_ENABLED(d->arch.hvm_domain.pci_cf8) )
+    {
+        vpci_runlock(d);
+        return X86EMUL_OKAY;
+    }
+
+    reg = hvm_pci_decode_addr(d->arch.hvm_domain.pci_cf8, addr, &bus, &slot,
+                              &func);
+
+    if ( !vpci_access_allowed(reg, size) )
+    {
+        vpci_runlock(d);
+        return X86EMUL_OKAY;
+    }
+
+    *data = vpci_read(0, bus, slot, func, reg, size);
+    vpci_runlock(d);
+
+    return X86EMUL_OKAY;
+}
+
+static int vpci_portio_write(const struct hvm_io_handler *handler,
+                             uint64_t addr, uint32_t size, uint64_t data)
+{
+    struct domain *d = current->domain;
+    unsigned int bus, slot, func, reg;
+
+    vpci_wlock(d);
+    if ( addr == 0xcf8 )
+    {
+        ASSERT(size == 4);
+        d->arch.hvm_domain.pci_cf8 = data;
+        vpci_wunlock(d);
+        return X86EMUL_OKAY;
+    }
+    if ( !CF8_ENABLED(d->arch.hvm_domain.pci_cf8) )
+    {
+        vpci_wunlock(d);
+        return X86EMUL_OKAY;
+    }
+
+    reg = hvm_pci_decode_addr(d->arch.hvm_domain.pci_cf8, addr, &bus, &slot,
+                              &func);
+
+    if ( !vpci_access_allowed(reg, size) )
+    {
+        vpci_wunlock(d);
+        return X86EMUL_OKAY;
+    }
+
+    vpci_write(0, bus, slot, func, reg, size, data);
+    vpci_wunlock(d);
+
+    return X86EMUL_OKAY;
+}
+
+static const struct hvm_io_ops vpci_portio_ops = {
+    .accept = vpci_portio_accept,
+    .read = vpci_portio_read,
+    .write = vpci_portio_write,
+};
+
+void register_vpci_portio_handler(struct domain *d)
+{
+    struct hvm_io_handler *handler;
+
+    if ( !has_vpci(d) )
+        return;
+
+    handler = hvm_next_io_handler(d);
+    if ( !handler )
+        return;
+
+    rwlock_init(&d->arch.hvm_domain.vpci_lock);
+    handler->type = IOREQ_TYPE_PIO;
+    handler->ops = &vpci_portio_ops;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index db5df6956d..5b2c0e3fc3 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1566,7 +1566,8 @@ void __init noreturn __start_xen(unsigned long mbi_p)
         domcr_flags |= DOMCRF_hvm |
                        ((hvm_funcs.hap_supported && !opt_dom0_shadow) ?
                          DOMCRF_hap : 0);
-        config.emulation_flags = XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC;
+        config.emulation_flags = XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC|
+                                 XEN_X86_EMU_VPCI;
     }
 
     /* Create initial domain 0. */
diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S
index ff08bbe42a..af1b30cb2b 100644
--- a/xen/arch/x86/xen.lds.S
+++ b/xen/arch/x86/xen.lds.S
@@ -76,6 +76,11 @@ SECTIONS
 
   __2M_rodata_start = .;       /* Start of 2M superpages, mapped RO. */
   .rodata : {
+#if defined(CONFIG_HAS_PCI) && defined(CONFIG_LATE_HWDOM)
+       __start_vpci_array = .;
+       *(.rodata.vpci)
+       __end_vpci_array = .;
+#endif
        _srodata = .;
        /* Bug frames table */
        __start_bug_frames = .;
@@ -167,6 +172,11 @@ SECTIONS
        _einittext = .;
   } :text
   .init.data : {
+#if defined(CONFIG_HAS_PCI) && !defined(CONFIG_LATE_HWDOM)
+       __start_vpci_array = .;
+       *(.init.rodata.vpci)
+       __end_vpci_array = .;
+#endif
        *(.init.rodata)
        *(.init.rodata.rel)
        *(.init.rodata.str*)
diff --git a/xen/drivers/Makefile b/xen/drivers/Makefile
index 19391802a8..d51c766453 100644
--- a/xen/drivers/Makefile
+++ b/xen/drivers/Makefile
@@ -1,6 +1,6 @@
 subdir-y += char
 subdir-$(CONFIG_HAS_CPUFREQ) += cpufreq
-subdir-$(CONFIG_HAS_PCI) += pci
+subdir-$(CONFIG_HAS_PCI) += pci vpci
 subdir-$(CONFIG_HAS_PASSTHROUGH) += passthrough
 subdir-$(CONFIG_ACPI) += acpi
 subdir-$(CONFIG_VIDEO) += video
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 27bdb7163c..54326cf0b8 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -30,6 +30,7 @@
 #include <xen/radix-tree.h>
 #include <xen/softirq.h>
 #include <xen/tasklet.h>
+#include <xen/vpci.h>
 #include <xsm/xsm.h>
 #include <asm/msi.h>
 #include "ats.h"
@@ -1030,9 +1031,10 @@ static void __hwdom_init setup_one_hwdom_device(const struct setup_hwdom *ctxt,
                                                 struct pci_dev *pdev)
 {
     u8 devfn = pdev->devfn;
+    int err;
 
     do {
-        int err = ctxt->handler(devfn, pdev);
+        err = ctxt->handler(devfn, pdev);
 
         if ( err )
         {
@@ -1045,6 +1047,11 @@ static void __hwdom_init setup_one_hwdom_device(const struct setup_hwdom *ctxt,
         devfn += pdev->phantom_stride;
     } while ( devfn != pdev->devfn &&
               PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
+
+    err = vpci_add_handlers(pdev);
+    if ( err )
+        printk(XENLOG_ERR "setup of vPCI for d%d failed: %d\n",
+               ctxt->d->domain_id, err);
 }
 
 static int __hwdom_init _setup_hwdom_pci_devices(struct pci_seg *pseg, void *arg)
diff --git a/xen/drivers/vpci/Makefile b/xen/drivers/vpci/Makefile
new file mode 100644
index 0000000000..840a906470
--- /dev/null
+++ b/xen/drivers/vpci/Makefile
@@ -0,0 +1 @@
+obj-y += vpci.o
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
new file mode 100644
index 0000000000..f63de97e89
--- /dev/null
+++ b/xen/drivers/vpci/vpci.c
@@ -0,0 +1,443 @@
+/*
+ * Generic functionality for handling accesses to the PCI configuration space
+ * from guests.
+ *
+ * Copyright (C) 2017 Citrix Systems R&D
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms and conditions of the GNU General Public
+ * License, version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <xen/sched.h>
+#include <xen/vpci.h>
+
+extern vpci_register_init_t *const __start_vpci_array[];
+extern vpci_register_init_t *const __end_vpci_array[];
+#define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
+
+/* Internal struct to store the emulated PCI registers. */
+struct vpci_register {
+    vpci_read_t *read;
+    vpci_write_t *write;
+    unsigned int size;
+    unsigned int offset;
+    void *private;
+    struct list_head node;
+};
+
+int __hwdom_init vpci_add_handlers(struct pci_dev *pdev)
+{
+    unsigned int i;
+    int rc = 0;
+
+    if ( !has_vpci(pdev->domain) )
+        return 0;
+
+    pdev->vpci = xzalloc(struct vpci);
+    if ( !pdev->vpci )
+        return -ENOMEM;
+
+    INIT_LIST_HEAD(&pdev->vpci->handlers);
+
+    for ( i = 0; i < NUM_VPCI_INIT; i++ )
+    {
+        rc = __start_vpci_array[i](pdev);
+        if ( rc )
+            break;
+    }
+
+    if ( rc )
+    {
+        while ( !list_empty(&pdev->vpci->handlers) )
+        {
+            struct vpci_register *r = list_first_entry(&pdev->vpci->handlers,
+                                                       struct vpci_register,
+                                                       node);
+
+            list_del(&r->node);
+            xfree(r);
+        }
+        xfree(pdev->vpci);
+    }
+
+    return rc;
+}
+
+static int vpci_register_cmp(const struct vpci_register *r1,
+                             const struct vpci_register *r2)
+{
+    /* Return 0 if registers overlap. */
+    if ( r1->offset < r2->offset + r2->size &&
+         r2->offset < r1->offset + r1->size )
+        return 0;
+    if ( r1->offset < r2->offset )
+        return -1;
+    if ( r1->offset > r2->offset )
+        return 1;
+
+    ASSERT_UNREACHABLE();
+    return 0;
+}
+
+/* Dummy hooks, writes are ignored, reads return 1's */
+static uint32_t vpci_ignored_read(struct pci_dev *pdev, unsigned int reg,
+                                  const void *data)
+{
+    return ~(uint32_t)0;
+}
+
+static void vpci_ignored_write(struct pci_dev *pdev, unsigned int reg,
+                               uint32_t val, void *data)
+{
+}
+
+int vpci_add_register(const struct pci_dev *pdev, vpci_read_t *read_handler,
+                      vpci_write_t *write_handler, unsigned int offset,
+                      unsigned int size, void *data)
+{
+    struct list_head *prev;
+    struct vpci_register *r;
+
+    /* Some sanity checks. */
+    if ( (size != 1 && size != 2 && size != 4) ||
+         offset >= PCI_CFG_SPACE_EXP_SIZE || (offset & (size - 1)) ||
+         (!read_handler && !write_handler) )
+        return -EINVAL;
+
+    r = xmalloc(struct vpci_register);
+    if ( !r )
+        return -ENOMEM;
+
+    r->read = read_handler ?: vpci_ignored_read;
+    r->write = write_handler ?: vpci_ignored_write;
+    r->size = size;
+    r->offset = offset;
+    r->private = data;
+
+    vpci_wlock(pdev->domain);
+
+    /* The list of handlers must be keep sorted at all times. */
+    list_for_each ( prev, &pdev->vpci->handlers )
+    {
+        const struct vpci_register *this =
+            list_entry(prev, const struct vpci_register, node);
+        int cmp = vpci_register_cmp(r, this);
+
+        if ( cmp < 0 )
+            break;
+        if ( cmp == 0 )
+        {
+            vpci_wunlock(pdev->domain);
+            xfree(r);
+            return -EEXIST;
+        }
+    }
+
+    list_add_tail(&r->node, prev);
+    vpci_wunlock(pdev->domain);
+
+    return 0;
+}
+
+int vpci_remove_register(const struct pci_dev *pdev, unsigned int offset,
+                         unsigned int size)
+{
+    const struct vpci_register r = { .offset = offset, .size = size };
+    struct vpci_register *rm;
+
+    vpci_wlock(pdev->domain);
+    list_for_each_entry ( rm, &pdev->vpci->handlers, node )
+    {
+        int cmp = vpci_register_cmp(&r, rm);
+
+        /*
+         * NB: do not use a switch so that we can use break to
+         * get out of the list loop earlier if required.
+         */
+        if ( !cmp && rm->offset == offset && rm->size == size )
+        {
+            list_del(&rm->node);
+            vpci_wunlock(pdev->domain);
+            xfree(rm);
+            return 0;
+        }
+        if ( cmp <= 0 )
+            break;
+    }
+    vpci_wunlock(pdev->domain);
+
+    return -ENOENT;
+}
+
+/* Wrappers for performing reads/writes to the underlying hardware. */
+static uint32_t vpci_read_hw(unsigned int seg, unsigned int bus,
+                             unsigned int slot, unsigned int func,
+                             unsigned int reg, unsigned int size)
+{
+    uint32_t data;
+
+    switch ( size )
+    {
+    case 4:
+        data = pci_conf_read32(seg, bus, slot, func, reg);
+        break;
+    case 3:
+        /*
+         * This is possible because a 4byte read can have 1byte trapped and
+         * the rest passed-through.
+         */
+        if ( reg & 1 )
+        {
+            data = pci_conf_read8(seg, bus, slot, func, reg);
+            data |= pci_conf_read16(seg, bus, slot, func, reg + 1) << 8;
+        }
+        else
+        {
+            data = pci_conf_read16(seg, bus, slot, func, reg);
+            data |= pci_conf_read8(seg, bus, slot, func, reg + 2) << 16;
+        }
+        break;
+    case 2:
+        data = pci_conf_read16(seg, bus, slot, func, reg);
+        break;
+    case 1:
+        data = pci_conf_read8(seg, bus, slot, func, reg);
+        break;
+    default:
+        ASSERT_UNREACHABLE();
+        data = ~(uint32_t)0;
+        break;
+    }
+
+    return data;
+}
+
+static void vpci_write_hw(unsigned int seg, unsigned int bus,
+                          unsigned int slot, unsigned int func,
+                          unsigned int reg, unsigned int size, uint32_t data)
+{
+    switch ( size )
+    {
+    case 4:
+        pci_conf_write32(seg, bus, slot, func, reg, data);
+        break;
+    case 3:
+        /*
+         * This is possible because a 4byte write can have 1byte trapped and
+         * the rest passed-through.
+         */
+        if ( reg & 1 )
+        {
+            pci_conf_write8(seg, bus, slot, func, reg, data);
+            pci_conf_write16(seg, bus, slot, func, reg + 1, data >> 8);
+        }
+        else
+        {
+            pci_conf_write16(seg, bus, slot, func, reg, data);
+            pci_conf_write8(seg, bus, slot, func, reg + 2, data >> 16);
+        }
+        break;
+    case 2:
+        pci_conf_write16(seg, bus, slot, func, reg, data);
+        break;
+    case 1:
+        pci_conf_write8(seg, bus, slot, func, reg, data);
+        break;
+    default:
+        ASSERT_UNREACHABLE();
+        break;
+    }
+}
+
+/*
+ * Merge new data into a partial result.
+ *
+ * Zero the bytes of 'data' from [offset, offset + size), and
+ * merge the value found in 'new' from [0, offset) left shifted
+ * by 'offset'.
+ */
+static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
+                             unsigned int offset)
+{
+    uint32_t mask = 0xffffffff >> (32 - 8 * size);
+
+    return (data & ~(mask << (offset * 8))) | ((new & mask) << (offset * 8));
+}
+
+uint32_t vpci_read(unsigned int seg, unsigned int bus, unsigned int slot,
+                   unsigned int func, unsigned int reg, unsigned int size)
+{
+    struct domain *d = current->domain;
+    struct pci_dev *pdev;
+    const struct vpci_register *r;
+    unsigned int data_offset = 0;
+    uint32_t data = ~(uint32_t)0;
+
+    ASSERT(vpci_rlocked(d));
+
+    /* Find the PCI dev matching the address. */
+    pdev = pci_get_pdev_by_domain(d, seg, bus, PCI_DEVFN(slot, func));
+    if ( !pdev )
+        return vpci_read_hw(seg, bus, slot, func, reg, size);
+
+    /* Read from the hardware or the emulated register handlers. */
+    list_for_each_entry ( r, &pdev->vpci->handlers, node )
+    {
+        const struct vpci_register emu = {
+            .offset = reg + data_offset,
+            .size = size - data_offset
+        };
+        int cmp = vpci_register_cmp(&emu, r);
+        uint32_t val;
+        unsigned int read_size;
+
+        if ( cmp < 0 )
+            break;
+        if ( cmp > 0 )
+            continue;
+
+        if ( emu.offset < r->offset )
+        {
+            /* Heading gap, read partial content from hardware. */
+            read_size = r->offset - emu.offset;
+            val = vpci_read_hw(seg, bus, slot, func, emu.offset, read_size);
+            data = merge_result(data, val, read_size, data_offset);
+            data_offset += read_size;
+        }
+
+        val = r->read(pdev, r->offset, r->private);
+
+        /* Check if the read is in the middle of a register. */
+        if ( r->offset < emu.offset )
+            val >>= (emu.offset - r->offset) * 8;
+
+        /* Find the intersection size between the two sets. */
+        read_size = min(emu.offset + emu.size, r->offset + r->size) -
+                    max(emu.offset, r->offset);
+        /* Merge the emulated data into the native read value. */
+        data = merge_result(data, val, read_size, data_offset);
+        data_offset += read_size;
+        if ( data_offset == size )
+            break;
+        ASSERT(data_offset < size);
+    }
+
+    if ( data_offset < size )
+    {
+        /* Tailing gap, read the remaining. */
+        uint32_t tmp_data = vpci_read_hw(seg, bus, slot, func,
+                                         reg + data_offset,
+                                         size - data_offset);
+
+        data = merge_result(data, tmp_data, size - data_offset, data_offset);
+    }
+
+    return data & (0xffffffff >> (32 - 8 * size));
+}
+
+/*
+ * Perform a maybe partial write to a register.
+ *
+ * Note that this will only work for simple registers, if Xen needs to
+ * trap accesses to rw1c registers (like the status PCI header register)
+ * the logic in vpci_write will have to be expanded in order to correctly
+ * deal with them.
+ */
+static void vpci_write_helper(struct pci_dev *pdev,
+                              const struct vpci_register *r, unsigned int size,
+                              unsigned int offset, uint32_t data)
+{
+    ASSERT(size <= r->size);
+
+    if ( size != r->size )
+    {
+        uint32_t val;
+
+        val = r->read(pdev, r->offset, r->private);
+        data = merge_result(val, data, size, offset);
+    }
+
+    r->write(pdev, r->offset, data & (0xffffffff >> (32 - 8 * r->size)),
+             r->private);
+}
+
+void vpci_write(unsigned int seg, unsigned int bus, unsigned int slot,
+                unsigned int func, unsigned int reg, unsigned int size,
+                uint32_t data)
+{
+    struct domain *d = current->domain;
+    struct pci_dev *pdev;
+    const struct vpci_register *r;
+    unsigned int data_offset = 0;
+
+    ASSERT(vpci_wlocked(d));
+
+    /*
+     * Find the PCI dev matching the address.
+     * Passthrough everything that's not trapped.
+     */
+    pdev = pci_get_pdev_by_domain(d, seg, bus, PCI_DEVFN(slot, func));
+    if ( !pdev )
+    {
+        vpci_write_hw(seg, bus, slot, func, reg, size, data);
+        return;
+    }
+
+    /* Write the value to the hardware or emulated registers. */
+    list_for_each_entry ( r, &pdev->vpci->handlers, node )
+    {
+        const struct vpci_register emu = {
+            .offset = reg + data_offset,
+            .size = size - data_offset
+        };
+        int cmp = vpci_register_cmp(&emu, r);
+        unsigned int write_size;
+
+        if ( cmp < 0 )
+            break;
+        if ( cmp > 0 )
+            continue;
+
+        if ( emu.offset < r->offset )
+        {
+            /* Heading gap, write partial content to hardware. */
+            vpci_write_hw(seg, bus, slot, func, emu.offset,
+                          r->offset - emu.offset, data >> (data_offset * 8));
+            data_offset += r->offset - emu.offset;
+        }
+
+        /* Find the intersection size between the two sets. */
+        write_size = min(emu.offset + emu.size, r->offset + r->size) -
+                     max(emu.offset, r->offset);
+        vpci_write_helper(pdev, r, write_size, reg + data_offset - r->offset,
+                          data >> (data_offset * 8));
+        data_offset += write_size;
+        if ( data_offset == size )
+            break;
+        ASSERT(data_offset < size);
+    }
+
+    if ( data_offset < size )
+        /* Tailing gap, write the remaining. */
+        vpci_write_hw(seg, bus, slot, func, reg + data_offset,
+                      size - data_offset, data >> (data_offset * 8));
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index c10522b7f5..ec14343b27 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -427,6 +427,7 @@ struct arch_domain
 #define has_vpit(d)        (!!((d)->arch.emulation_flags & XEN_X86_EMU_PIT))
 #define has_pirq(d)        (!!((d)->arch.emulation_flags & \
                             XEN_X86_EMU_USE_PIRQ))
+#define has_vpci(d)        (!!((d)->arch.emulation_flags & XEN_X86_EMU_VPCI))
 
 #define has_arch_pdevs(d)    (!list_empty(&(d)->arch.pdev_list))
 
diff --git a/xen/include/asm-x86/hvm/domain.h b/xen/include/asm-x86/hvm/domain.h
index d2899c9bb2..3a54d50606 100644
--- a/xen/include/asm-x86/hvm/domain.h
+++ b/xen/include/asm-x86/hvm/domain.h
@@ -184,6 +184,9 @@ struct hvm_domain {
     /* List of guest to machine IO ports mapping. */
     struct list_head g2m_ioport_list;
 
+    /* Lock for the PCI emulation layer (vPCI). */
+    rwlock_t vpci_lock;
+
     /* List of permanently write-mapped pages. */
     struct {
         spinlock_t lock;
diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h
index 51659b6c7f..01322a2e21 100644
--- a/xen/include/asm-x86/hvm/io.h
+++ b/xen/include/asm-x86/hvm/io.h
@@ -160,6 +160,9 @@ unsigned int hvm_pci_decode_addr(unsigned int cf8, unsigned int addr,
  */
 void register_g2m_portio_handler(struct domain *d);
 
+/* HVM port IO handler for PCI accesses. */
+void register_vpci_portio_handler(struct domain *d);
+
 #endif /* __ASM_X86_HVM_IO_H__ */
 
 
diff --git a/xen/include/public/arch-x86/xen.h b/xen/include/public/arch-x86/xen.h
index f21332e897..86a1a09a8d 100644
--- a/xen/include/public/arch-x86/xen.h
+++ b/xen/include/public/arch-x86/xen.h
@@ -295,12 +295,15 @@ struct xen_arch_domainconfig {
 #define XEN_X86_EMU_PIT             (1U<<_XEN_X86_EMU_PIT)
 #define _XEN_X86_EMU_USE_PIRQ       9
 #define XEN_X86_EMU_USE_PIRQ        (1U<<_XEN_X86_EMU_USE_PIRQ)
+#define _XEN_X86_EMU_VPCI           10
+#define XEN_X86_EMU_VPCI            (1U<<_XEN_X86_EMU_VPCI)
 
 #define XEN_X86_EMU_ALL             (XEN_X86_EMU_LAPIC | XEN_X86_EMU_HPET |  \
                                      XEN_X86_EMU_PM | XEN_X86_EMU_RTC |      \
                                      XEN_X86_EMU_IOAPIC | XEN_X86_EMU_PIC |  \
                                      XEN_X86_EMU_VGA | XEN_X86_EMU_IOMMU |   \
-                                     XEN_X86_EMU_PIT | XEN_X86_EMU_USE_PIRQ)
+                                     XEN_X86_EMU_PIT | XEN_X86_EMU_USE_PIRQ |\
+                                     XEN_X86_EMU_VPCI)
     uint32_t emulation_flags;
 };
 
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index ea6a66b248..ad5d3ca031 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -88,6 +88,9 @@ struct pci_dev {
 #define PT_FAULT_THRESHOLD 10
     } fault;
     u64 vf_rlen[6];
+
+    /* Data for vPCI. */
+    struct vpci *vpci;
 };
 
 #define for_each_pdev(domain, pdev) \
diff --git a/xen/include/xen/pci_regs.h b/xen/include/xen/pci_regs.h
index ecd6124d91..cc4ee3b83e 100644
--- a/xen/include/xen/pci_regs.h
+++ b/xen/include/xen/pci_regs.h
@@ -23,6 +23,14 @@
 #define LINUX_PCI_REGS_H
 
 /*
+ * Conventional PCI and PCI-X Mode 1 devices have 256 bytes of
+ * configuration space.  PCI-X Mode 2 and PCIe devices have 4096 bytes of
+ * configuration space.
+ */
+#define PCI_CFG_SPACE_SIZE	256
+#define PCI_CFG_SPACE_EXP_SIZE	4096
+
+/*
  * Under PCI, each device has 256 bytes of configuration address space,
  * of which the first 64 bytes are standardized as follows:
  */
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
new file mode 100644
index 0000000000..12f7287d7b
--- /dev/null
+++ b/xen/include/xen/vpci.h
@@ -0,0 +1,80 @@
+#ifndef _VPCI_
+#define _VPCI_
+
+#include <xen/pci.h>
+#include <xen/types.h>
+#include <xen/list.h>
+
+/*
+ * Helpers for locking/unlocking.
+ *
+ * NB: the recursive variants are used so that spin_is_locked
+ * returns whether the lock is hold by the current CPU (instead
+ * of just returning whether the lock is hold by any CPU).
+ */
+#define vpci_rlock(d) read_lock(&(d)->arch.hvm_domain.vpci_lock)
+#define vpci_wlock(d) write_lock(&(d)->arch.hvm_domain.vpci_lock)
+#define vpci_runlock(d) read_unlock(&(d)->arch.hvm_domain.vpci_lock)
+#define vpci_wunlock(d) write_unlock(&(d)->arch.hvm_domain.vpci_lock)
+#define vpci_rlocked(d) rw_is_locked(&(d)->arch.hvm_domain.vpci_lock)
+#define vpci_wlocked(d) rw_is_write_locked(&(d)->arch.hvm_domain.vpci_lock)
+
+/*
+ * The vPCI handlers will never be called concurrently for the same domain, it
+ * is guaranteed that the vpci domain lock will always be locked when calling
+ * any handler.
+ */
+typedef uint32_t vpci_read_t(struct pci_dev *pdev, unsigned int reg,
+                             const void *data);
+
+typedef void vpci_write_t(struct pci_dev *pdev, unsigned int reg,
+                          uint32_t val, void *data);
+
+typedef int vpci_register_init_t(struct pci_dev *dev);
+
+#ifdef CONFIG_LATE_HWDOM
+#define VPCI_SECTION ".rodata.vpci"
+#else
+#define VPCI_SECTION ".init.rodata.vpci"
+#endif
+
+#define REGISTER_VPCI_INIT(x)                   \
+  static vpci_register_init_t *const x##_entry  \
+               __used_section(VPCI_SECTION) = x
+
+/* Add vPCI handlers to device. */
+int __must_check vpci_add_handlers(struct pci_dev *dev);
+
+/* Add/remove a register handler. */
+int __must_check vpci_add_register(const struct pci_dev *pdev,
+                                   vpci_read_t *read_handler,
+                                   vpci_write_t *write_handler,
+                                   unsigned int offset, unsigned int size,
+                                   void *data);
+int __must_check vpci_remove_register(const struct pci_dev *pdev,
+                                      unsigned int offset,
+                                      unsigned int size);
+
+/* Generic read/write handlers for the PCI config space. */
+uint32_t vpci_read(unsigned int seg, unsigned int bus, unsigned int slot,
+                   unsigned int func, unsigned int reg, unsigned int size);
+void vpci_write(unsigned int seg, unsigned int bus, unsigned int slot,
+                unsigned int func, unsigned int reg, unsigned int size,
+                uint32_t data);
+
+struct vpci {
+    /* List of vPCI handlers for a device. */
+    struct list_head handlers;
+};
+
+#endif
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
2.11.0 (Apple Git-81)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v5 03/11] x86/mmcfg: add handlers for the PVH Dom0 MMCFG areas
  2017-08-14 14:28 [PATCH v5 00/11] vpci: PCI config space emulation Roger Pau Monne
  2017-08-14 14:28 ` [PATCH v5 01/11] x86/pci: introduce hvm_pci_decode_addr Roger Pau Monne
  2017-08-14 14:28 ` [PATCH v5 02/11] vpci: introduce basic handlers to trap accesses to the PCI config space Roger Pau Monne
@ 2017-08-14 14:28 ` Roger Pau Monne
  2017-08-22 12:11   ` Paul Durrant
  2017-09-04 15:58   ` Jan Beulich
  2017-08-14 14:28 ` [PATCH v5 04/11] x86/physdev: enable PHYSDEVOP_pci_mmcfg_reserved for PVH Dom0 Roger Pau Monne
                   ` (7 subsequent siblings)
  10 siblings, 2 replies; 60+ messages in thread
From: Roger Pau Monne @ 2017-08-14 14:28 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Paul Durrant, Jan Beulich, boris.ostrovsky,
	Roger Pau Monne

Introduce a set of handlers for the accesses to the MMCFG areas. Those
areas are setup based on the contents of the hardware MMCFG tables,
and the list of handled MMCFG areas is stored inside of the hvm_domain
struct.

The read/writes are forwarded to the generic vpci handlers once the
address is decoded in order to obtain the device and register the
guest is trying to access.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Paul Durrant <paul.durrant@citrix.com>
---
Changes since v4:
 - Change the attribute of pvh_setup_mmcfg to __hwdom_init.
 - Try to add as many MMCFG regions as possible, even if one fails to
   add.
 - Change some fields of the hvm_mmcfg struct: turn size into a
   unsigned int, segment into uint16_t and bus into uint8_t.
 - Convert some address parameters from unsigned long to paddr_t for
   consistency.
 - Make vpci_mmcfg_decode_addr return the decoded register in the
   return of the function.
 - Introduce a new macro to convert a MMCFG address into a BDF, and
   use it in vpci_mmcfg_decode_addr to clarify the logic.
 - In vpci_mmcfg_{read/write} unify the logic for 8B accesses and
   smaller ones.
 - Add the __hwdom_init attribute to register_vpci_mmcfg_handler.
 - Test that reg + size doesn't cross a device boundary.

Changes since v3:
 - Propagate changes from previous patches: drop xen_ prefix for vpci
   functions, pass slot and func instead of devfn and fix the error
   paths of the MMCFG handlers.
 - s/ecam/mmcfg/.
 - Move the destroy code to a separate function, so the hvm_mmcfg
   struct can be private to hvm/io.c.
 - Constify the return of vpci_mmcfg_find.
 - Use d instead of v->domain in vpci_mmcfg_accept.
 - Allow 8byte accesses to the mmcfg.

Changes since v1:
 - Added locking.
---
 xen/arch/x86/hvm/dom0_build.c    |  22 +++++
 xen/arch/x86/hvm/hvm.c           |   3 +
 xen/arch/x86/hvm/io.c            | 183 ++++++++++++++++++++++++++++++++++++++-
 xen/include/asm-x86/hvm/domain.h |   3 +
 xen/include/asm-x86/hvm/io.h     |   7 ++
 xen/include/asm-x86/pci.h        |   2 +
 6 files changed, 219 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
index 0e7d06be95..04a8682d33 100644
--- a/xen/arch/x86/hvm/dom0_build.c
+++ b/xen/arch/x86/hvm/dom0_build.c
@@ -38,6 +38,8 @@
 #include <public/hvm/hvm_info_table.h>
 #include <public/hvm/hvm_vcpu.h>
 
+#include "../x86_64/mmconfig.h"
+
 /*
  * Have the TSS cover the ISA port range, which makes it
  * - 104 bytes base structure
@@ -1041,6 +1043,24 @@ static int __init pvh_setup_acpi(struct domain *d, paddr_t start_info)
     return 0;
 }
 
+static void __hwdom_init pvh_setup_mmcfg(struct domain *d)
+{
+    unsigned int i;
+    int rc;
+
+    for ( i = 0; i < pci_mmcfg_config_num; i++ )
+    {
+        rc = register_vpci_mmcfg_handler(d, pci_mmcfg_config[i].address,
+                                         pci_mmcfg_config[i].start_bus_number,
+                                         pci_mmcfg_config[i].end_bus_number,
+                                         pci_mmcfg_config[i].pci_segment);
+        if ( rc )
+            printk("Unable to setup MMCFG handler at %#lx for segment %u\n",
+                   pci_mmcfg_config[i].address,
+                   pci_mmcfg_config[i].pci_segment);
+    }
+}
+
 int __init dom0_construct_pvh(struct domain *d, const module_t *image,
                               unsigned long image_headroom,
                               module_t *initrd,
@@ -1090,6 +1110,8 @@ int __init dom0_construct_pvh(struct domain *d, const module_t *image,
         return rc;
     }
 
+    pvh_setup_mmcfg(d);
+
     panic("Building a PVHv2 Dom0 is not yet supported.");
     return 0;
 }
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index cc73df8dc7..3168973820 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -583,6 +583,7 @@ int hvm_domain_initialise(struct domain *d, unsigned long domcr_flags,
     spin_lock_init(&d->arch.hvm_domain.write_map.lock);
     INIT_LIST_HEAD(&d->arch.hvm_domain.write_map.list);
     INIT_LIST_HEAD(&d->arch.hvm_domain.g2m_ioport_list);
+    INIT_LIST_HEAD(&d->arch.hvm_domain.mmcfg_regions);
 
     rc = create_perdomain_mapping(d, PERDOMAIN_VIRT_START, 0, NULL, NULL);
     if ( rc )
@@ -728,6 +729,8 @@ void hvm_domain_destroy(struct domain *d)
         list_del(&ioport->list);
         xfree(ioport);
     }
+
+    destroy_vpci_mmcfg(&d->arch.hvm_domain.mmcfg_regions);
 }
 
 static int hvm_save_tsc_adjust(struct domain *d, hvm_domain_context_t *h)
diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
index c3b68eb257..2845dc5b48 100644
--- a/xen/arch/x86/hvm/io.c
+++ b/xen/arch/x86/hvm/io.c
@@ -280,7 +280,7 @@ unsigned int hvm_pci_decode_addr(unsigned int cf8, unsigned int addr,
 static bool vpci_access_allowed(unsigned int reg, unsigned int len)
 {
     /* Check access size. */
-    if ( len != 1 && len != 2 && len != 4 )
+    if ( len != 1 && len != 2 && len != 4 && len != 8 )
         return false;
 
     /* Check that access is size aligned. */
@@ -391,6 +391,187 @@ void register_vpci_portio_handler(struct domain *d)
     handler->ops = &vpci_portio_ops;
 }
 
+struct hvm_mmcfg {
+    struct list_head next;
+    paddr_t addr;
+    unsigned int size;
+    uint16_t segment;
+    int8_t bus;
+};
+
+/* Handlers to trap PCI MMCFG config accesses. */
+static const struct hvm_mmcfg *vpci_mmcfg_find(struct domain *d, paddr_t addr)
+{
+    const struct hvm_mmcfg *mmcfg;
+
+    list_for_each_entry ( mmcfg, &d->arch.hvm_domain.mmcfg_regions, next )
+        if ( addr >= mmcfg->addr && addr < mmcfg->addr + mmcfg->size )
+            return mmcfg;
+
+    return NULL;
+}
+
+static unsigned int vpci_mmcfg_decode_addr(const struct hvm_mmcfg *mmcfg,
+                                           paddr_t addr, unsigned int *bus,
+                                           unsigned int *slot,
+                                           unsigned int *func)
+{
+    unsigned int bdf;
+
+    addr -= mmcfg->addr;
+    bdf = MMCFG_BDF(addr);
+    *bus = PCI_BUS(bdf) + mmcfg->bus;
+    *slot = PCI_SLOT(bdf);
+    *func = PCI_FUNC(bdf);
+
+    return addr & (PCI_CFG_SPACE_EXP_SIZE - 1);
+}
+
+static int vpci_mmcfg_accept(struct vcpu *v, unsigned long addr)
+{
+    struct domain *d = v->domain;
+    bool found;
+
+    vpci_rlock(d);
+    found = vpci_mmcfg_find(d, addr);
+    vpci_runlock(d);
+
+    return found;
+}
+
+static int vpci_mmcfg_read(struct vcpu *v, unsigned long addr,
+                           unsigned int len, unsigned long *data)
+{
+    struct domain *d = v->domain;
+    const struct hvm_mmcfg *mmcfg;
+    unsigned int bus, slot, func, reg;
+
+    *data = ~0ul;
+
+    vpci_rlock(d);
+    mmcfg = vpci_mmcfg_find(d, addr);
+    if ( !mmcfg )
+    {
+        vpci_runlock(d);
+        return X86EMUL_OKAY;
+    }
+
+    reg = vpci_mmcfg_decode_addr(mmcfg, addr, &bus, &slot, &func);
+
+    if ( !vpci_access_allowed(reg, len) ||
+         (reg + len) > PCI_CFG_SPACE_EXP_SIZE )
+    {
+        vpci_runlock(d);
+        return X86EMUL_OKAY;
+    }
+
+    /*
+     * According to the PCIe 3.1A specification:
+     *  - Configuration Reads and Writes must usually be DWORD or smaller
+     *    in size.
+     *  - Because Root Complex implementations are not required to support
+     *    accesses to a RCRB that cross DW boundaries [...] software
+     *    should take care not to cause the generation of such accesses
+     *    when accessing a RCRB unless the Root Complex will support the
+     *    access.
+     *  Xen however supports 8byte accesses by splitting them into two
+     *  4byte accesses.
+     */
+    *data = vpci_read(mmcfg->segment, bus, slot, func, reg, min(4u, len));
+    if ( len == 8 )
+        *data |= (uint64_t)vpci_read(mmcfg->segment, bus, slot, func,
+                                     reg + 4, 4) << 32;
+    vpci_runlock(d);
+
+    return X86EMUL_OKAY;
+}
+
+static int vpci_mmcfg_write(struct vcpu *v, unsigned long addr,
+                            unsigned int len, unsigned long data)
+{
+    struct domain *d = v->domain;
+    const struct hvm_mmcfg *mmcfg;
+    unsigned int bus, slot, func, reg;
+
+    vpci_wlock(d);
+    mmcfg = vpci_mmcfg_find(d, addr);
+    if ( !mmcfg )
+    {
+        vpci_wunlock(d);
+        return X86EMUL_OKAY;
+    }
+
+    reg = vpci_mmcfg_decode_addr(mmcfg, addr, &bus, &slot, &func);
+
+    if ( !vpci_access_allowed(reg, len) ||
+         (reg + len) > PCI_CFG_SPACE_EXP_SIZE )
+    {
+        vpci_wunlock(d);
+        return X86EMUL_OKAY;
+    }
+
+    vpci_write(mmcfg->segment, bus, slot, func, reg, min(4u, len), data);
+    if ( len == 8 )
+        vpci_write(mmcfg->segment, bus, slot, func, reg + 4, 4, data >> 32);
+    vpci_wunlock(d);
+
+    return X86EMUL_OKAY;
+}
+
+static const struct hvm_mmio_ops vpci_mmcfg_ops = {
+    .check = vpci_mmcfg_accept,
+    .read = vpci_mmcfg_read,
+    .write = vpci_mmcfg_write,
+};
+
+int __hwdom_init register_vpci_mmcfg_handler(struct domain *d, paddr_t addr,
+                                             unsigned int start_bus,
+                                             unsigned int end_bus,
+                                             unsigned int seg)
+{
+    struct hvm_mmcfg *mmcfg;
+
+    ASSERT(is_hardware_domain(d));
+
+    vpci_wlock(d);
+    if ( vpci_mmcfg_find(d, addr) )
+    {
+        vpci_wunlock(d);
+        return -EEXIST;
+    }
+
+    mmcfg = xmalloc(struct hvm_mmcfg);
+    if ( !mmcfg )
+    {
+        vpci_wunlock(d);
+        return -ENOMEM;
+    }
+
+    if ( list_empty(&d->arch.hvm_domain.mmcfg_regions) )
+        register_mmio_handler(d, &vpci_mmcfg_ops);
+
+    mmcfg->addr = addr + (start_bus << 20);
+    mmcfg->bus = start_bus;
+    mmcfg->segment = seg;
+    mmcfg->size = (end_bus - start_bus + 1) << 20;
+    list_add(&mmcfg->next, &d->arch.hvm_domain.mmcfg_regions);
+    vpci_wunlock(d);
+
+    return 0;
+}
+
+void destroy_vpci_mmcfg(struct list_head *domain_mmcfg)
+{
+    while ( !list_empty(domain_mmcfg) )
+    {
+        struct hvm_mmcfg *mmcfg = list_first_entry(domain_mmcfg,
+                                                   struct hvm_mmcfg, next);
+
+        list_del(&mmcfg->next);
+        xfree(mmcfg);
+    }
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/include/asm-x86/hvm/domain.h b/xen/include/asm-x86/hvm/domain.h
index 3a54d50606..e8dc01bc3e 100644
--- a/xen/include/asm-x86/hvm/domain.h
+++ b/xen/include/asm-x86/hvm/domain.h
@@ -187,6 +187,9 @@ struct hvm_domain {
     /* Lock for the PCI emulation layer (vPCI). */
     rwlock_t vpci_lock;
 
+    /* List of MMCFG regions trapped by Xen. */
+    struct list_head mmcfg_regions;
+
     /* List of permanently write-mapped pages. */
     struct {
         spinlock_t lock;
diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h
index 01322a2e21..837046026c 100644
--- a/xen/include/asm-x86/hvm/io.h
+++ b/xen/include/asm-x86/hvm/io.h
@@ -163,6 +163,13 @@ void register_g2m_portio_handler(struct domain *d);
 /* HVM port IO handler for PCI accesses. */
 void register_vpci_portio_handler(struct domain *d);
 
+/* HVM MMIO handler for PCI MMCFG accesses. */
+int register_vpci_mmcfg_handler(struct domain *d, paddr_t addr,
+                                unsigned int start_bus, unsigned int end_bus,
+                                unsigned int seg);
+/* Destroy tracked MMCFG areas. */
+void destroy_vpci_mmcfg(struct list_head *domain_mmcfg);
+
 #endif /* __ASM_X86_HVM_IO_H__ */
 
 
diff --git a/xen/include/asm-x86/pci.h b/xen/include/asm-x86/pci.h
index 36801d317b..ac16c8fd5d 100644
--- a/xen/include/asm-x86/pci.h
+++ b/xen/include/asm-x86/pci.h
@@ -6,6 +6,8 @@
 #define CF8_ADDR_HI(cf8) (  ((cf8) & 0x0f000000) >> 16)
 #define CF8_ENABLED(cf8) (!!((cf8) & 0x80000000))
 
+#define MMCFG_BDF(addr)  ( ((addr) & 0x0ffff000) >> 12)
+
 #define IS_SNB_GFX(id) (id == 0x01068086 || id == 0x01168086 \
                         || id == 0x01268086 || id == 0x01028086 \
                         || id == 0x01128086 || id == 0x01228086 \
-- 
2.11.0 (Apple Git-81)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v5 04/11] x86/physdev: enable PHYSDEVOP_pci_mmcfg_reserved for PVH Dom0
  2017-08-14 14:28 [PATCH v5 00/11] vpci: PCI config space emulation Roger Pau Monne
                   ` (2 preceding siblings ...)
  2017-08-14 14:28 ` [PATCH v5 03/11] x86/mmcfg: add handlers for the PVH Dom0 MMCFG areas Roger Pau Monne
@ 2017-08-14 14:28 ` Roger Pau Monne
  2017-09-05 14:57   ` Jan Beulich
  2017-08-14 14:28 ` [PATCH v5 05/11] mm: move modify_identity_mmio to global file and drop __init Roger Pau Monne
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 60+ messages in thread
From: Roger Pau Monne @ 2017-08-14 14:28 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, boris.ostrovsky, Roger Pau Monne, Jan Beulich

So that MMCFG regions not present in the MCFG ACPI table can be added
at run time by the hardware domain.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v4:
 - Change the hardware_domain check in hvm_physdev_op to a vpci check.
 - Only register the MMCFG area, but don't scan it.

Changes since v3:
 - New in this version.
---
 xen/arch/x86/hvm/hypercall.c | 4 ++++
 xen/arch/x86/hvm/io.c        | 7 +++----
 xen/arch/x86/physdev.c       | 9 +++++++++
 3 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/hvm/hypercall.c b/xen/arch/x86/hvm/hypercall.c
index e7238ce293..369abfd262 100644
--- a/xen/arch/x86/hvm/hypercall.c
+++ b/xen/arch/x86/hvm/hypercall.c
@@ -89,6 +89,10 @@ static long hvm_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
         if ( !has_pirq(curr->domain) )
             return -ENOSYS;
         break;
+    case PHYSDEVOP_pci_mmcfg_reserved:
+        if ( !has_vpci(curr->domain) )
+            return -ENOSYS;
+        break;
     }
 
     if ( !curr->hcall_compat )
diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
index 2845dc5b48..93c3b21a47 100644
--- a/xen/arch/x86/hvm/io.c
+++ b/xen/arch/x86/hvm/io.c
@@ -524,10 +524,9 @@ static const struct hvm_mmio_ops vpci_mmcfg_ops = {
     .write = vpci_mmcfg_write,
 };
 
-int __hwdom_init register_vpci_mmcfg_handler(struct domain *d, paddr_t addr,
-                                             unsigned int start_bus,
-                                             unsigned int end_bus,
-                                             unsigned int seg)
+int register_vpci_mmcfg_handler(struct domain *d, paddr_t addr,
+                                unsigned int start_bus, unsigned int end_bus,
+                                unsigned int seg)
 {
     struct hvm_mmcfg *mmcfg;
 
diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c
index 0eb409758f..10e0a1ad79 100644
--- a/xen/arch/x86/physdev.c
+++ b/xen/arch/x86/physdev.c
@@ -559,6 +559,15 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 
         ret = pci_mmcfg_reserved(info.address, info.segment,
                                  info.start_bus, info.end_bus, info.flags);
+        if ( ret || !is_hvm_domain(currd) )
+            break;
+
+        /*
+         * For HVM (PVH) domains try to add the newly found MMCFG to the
+         * domain.
+         */
+        ret = register_vpci_mmcfg_handler(currd, info.address, info.start_bus,
+                                          info.end_bus, info.segment);
         break;
     }
 
-- 
2.11.0 (Apple Git-81)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v5 05/11] mm: move modify_identity_mmio to global file and drop __init
  2017-08-14 14:28 [PATCH v5 00/11] vpci: PCI config space emulation Roger Pau Monne
                   ` (3 preceding siblings ...)
  2017-08-14 14:28 ` [PATCH v5 04/11] x86/physdev: enable PHYSDEVOP_pci_mmcfg_reserved for PVH Dom0 Roger Pau Monne
@ 2017-08-14 14:28 ` Roger Pau Monne
  2017-09-05 15:01   ` Jan Beulich
  2017-09-12 10:53   ` Julien Grall
  2017-08-14 14:28 ` [PATCH v5 06/11] pci: split code to size BARs from pci_add_device Roger Pau Monne
                   ` (5 subsequent siblings)
  10 siblings, 2 replies; 60+ messages in thread
From: Roger Pau Monne @ 2017-08-14 14:28 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, boris.ostrovsky, Roger Pau Monne, Jan Beulich

And also allow it to do non-identity mappings by adding a new
parameter.

This function will be needed in order to map the BARs from PCI devices
into the Dom0 p2m (and is also used by the x86 Dom0 builder). While
there fix the function to use gfn_t and mfn_t instead of unsigned long
for memory addresses.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v4:
 - Guard the function with CONFIG_HAS_PCI only.
 - s/non-trival/non-negligible in the comment.
 - Change XENLOG_G_WARNING to XENLOG_WARNING like the original
   function.

Changes since v3:
 - Remove the dummy modify_identity_mmio helper in dom0_build.c
 - Try to make the comment in modify MMIO less scary.
 - Clarify commit message.
 - Only build the function for x86 or if there's PCI support.

Changes since v2:
 - Use mfn_t and gfn_t.
 - Remove stray newline.
---
 xen/arch/x86/hvm/dom0_build.c | 30 ++----------------------------
 xen/common/memory.c           | 40 ++++++++++++++++++++++++++++++++++++++++
 xen/include/xen/p2m-common.h  |  3 +++
 3 files changed, 45 insertions(+), 28 deletions(-)

diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
index 04a8682d33..c65eb8503f 100644
--- a/xen/arch/x86/hvm/dom0_build.c
+++ b/xen/arch/x86/hvm/dom0_build.c
@@ -61,32 +61,6 @@ static struct acpi_madt_interrupt_override __initdata *intsrcovr;
 static unsigned int __initdata acpi_nmi_sources;
 static struct acpi_madt_nmi_source __initdata *nmisrc;
 
-static int __init modify_identity_mmio(struct domain *d, unsigned long pfn,
-                                       unsigned long nr_pages, const bool map)
-{
-    int rc;
-
-    for ( ; ; )
-    {
-        rc = (map ? map_mmio_regions : unmap_mmio_regions)
-             (d, _gfn(pfn), nr_pages, _mfn(pfn));
-        if ( rc == 0 )
-            break;
-        if ( rc < 0 )
-        {
-            printk(XENLOG_WARNING
-                   "Failed to identity %smap [%#lx,%#lx) for d%d: %d\n",
-                   map ? "" : "un", pfn, pfn + nr_pages, d->domain_id, rc);
-            break;
-        }
-        nr_pages -= rc;
-        pfn += rc;
-        process_pending_softirqs();
-    }
-
-    return rc;
-}
-
 /* Populate a HVM memory range using the biggest possible order. */
 static int __init pvh_populate_memory_range(struct domain *d,
                                             unsigned long start,
@@ -397,7 +371,7 @@ static int __init pvh_setup_p2m(struct domain *d)
      * Memory below 1MB is identity mapped.
      * NB: this only makes sense when booted from legacy BIOS.
      */
-    rc = modify_identity_mmio(d, 0, MB1_PAGES, true);
+    rc = modify_mmio(d, _gfn(0), _mfn(0), MB1_PAGES, true);
     if ( rc )
     {
         printk("Failed to identity map low 1MB: %d\n", rc);
@@ -964,7 +938,7 @@ static int __init pvh_setup_acpi(struct domain *d, paddr_t start_info)
         nr_pages = PFN_UP((d->arch.e820[i].addr & ~PAGE_MASK) +
                           d->arch.e820[i].size);
 
-        rc = modify_identity_mmio(d, pfn, nr_pages, true);
+        rc = modify_mmio(d, _gfn(pfn), _mfn(pfn), nr_pages, true);
         if ( rc )
         {
             printk("Failed to map ACPI region [%#lx, %#lx) into Dom0 memory map\n",
diff --git a/xen/common/memory.c b/xen/common/memory.c
index b2066db07e..86824edb09 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -1465,6 +1465,46 @@ int prepare_ring_for_helper(
     return 0;
 }
 
+#if defined(CONFIG_HAS_PCI)
+int modify_mmio(struct domain *d, gfn_t gfn, mfn_t mfn, unsigned long nr_pages,
+                bool map)
+{
+    int rc;
+
+    /*
+     * ATM this function should only be used by the hardware domain
+     * because it doesn't support preemption/continuation, and as such
+     * can take a non-negligible amount of time. Note that it periodically
+     * calls process_pending_softirqs in order to avoid stalling the system.
+     */
+    ASSERT(is_hardware_domain(d));
+
+    for ( ; ; )
+    {
+        rc = (map ? map_mmio_regions : unmap_mmio_regions)
+             (d, gfn, nr_pages, mfn);
+        if ( rc == 0 )
+            break;
+        if ( rc < 0 )
+        {
+            printk(XENLOG_WARNING
+                   "Failed to %smap [%" PRI_gfn ", %" PRI_gfn ") -> "
+                   "[%" PRI_mfn ", %" PRI_mfn ") for d%d: %d\n",
+                   map ? "" : "un", gfn_x(gfn), gfn_x(gfn_add(gfn, nr_pages)),
+                   mfn_x(mfn), mfn_x(mfn_add(mfn, nr_pages)), d->domain_id,
+                   rc);
+            break;
+        }
+        nr_pages -= rc;
+        mfn = mfn_add(mfn, rc);
+        gfn = gfn_add(gfn, rc);
+        process_pending_softirqs();
+    }
+
+    return rc;
+}
+#endif
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/include/xen/p2m-common.h b/xen/include/xen/p2m-common.h
index 2b5696cf33..c2f9015ad8 100644
--- a/xen/include/xen/p2m-common.h
+++ b/xen/include/xen/p2m-common.h
@@ -20,4 +20,7 @@ int unmap_mmio_regions(struct domain *d,
                        unsigned long nr,
                        mfn_t mfn);
 
+int modify_mmio(struct domain *d, gfn_t gfn, mfn_t mfn, unsigned long nr_pages,
+                const bool map);
+
 #endif /* _XEN_P2M_COMMON_H */
-- 
2.11.0 (Apple Git-81)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v5 06/11] pci: split code to size BARs from pci_add_device
  2017-08-14 14:28 [PATCH v5 00/11] vpci: PCI config space emulation Roger Pau Monne
                   ` (4 preceding siblings ...)
  2017-08-14 14:28 ` [PATCH v5 05/11] mm: move modify_identity_mmio to global file and drop __init Roger Pau Monne
@ 2017-08-14 14:28 ` Roger Pau Monne
  2017-09-05 15:05   ` Jan Beulich
  2017-08-14 14:28 ` [PATCH v5 07/11] pci: add support to size ROM BARs to pci_size_mem_bar Roger Pau Monne
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 60+ messages in thread
From: Roger Pau Monne @ 2017-08-14 14:28 UTC (permalink / raw)
  To: xen-devel; +Cc: boris.ostrovsky, Roger Pau Monne, Jan Beulich

So that it can be called from outside in order to get the size of regular PCI
BARs. This will be required in order to map the BARs from PCI devices into PVH
Dom0 p2m.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
---
Changes since v4:
 - Restore printing whether the BAR is from a vf.
 - Make the psize pointer parameter not optional.
 - s/u64/uint64_t.
 - Remove some unneeded parentheses.
 - Assert the return value is never 0.

Changes since v3:
 - Rename function to size BARs to pci_size_mem_bar.
 - Change the parameters passed to the function. Pass the position and
   whether the BAR is the last one, instead of the (base, max_bars,
   *index) tuple.
 - Make the function return the number of BARs consumed (1 for 32b, 2
   for 64b BARs).
 - Change the dprintk back to printk.
 - Do not log another error message in pci_add_device in case
   pci_size_mem_bar fails.
---
 xen/drivers/passthrough/pci.c | 91 +++++++++++++++++++++++++++----------------
 xen/include/xen/pci.h         |  3 ++
 2 files changed, 60 insertions(+), 34 deletions(-)

diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 54326cf0b8..948c227e01 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -592,6 +592,54 @@ static int iommu_add_device(struct pci_dev *pdev);
 static int iommu_enable_device(struct pci_dev *pdev);
 static int iommu_remove_device(struct pci_dev *pdev);
 
+int pci_size_mem_bar(unsigned int seg, unsigned int bus, unsigned int slot,
+                     unsigned int func, unsigned int pos, bool last,
+                     uint64_t *paddr, uint64_t *psize, bool vf)
+{
+    uint32_t hi = 0, bar = pci_conf_read32(seg, bus, slot, func, pos);
+    uint64_t addr, size;
+
+    ASSERT((bar & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_MEMORY);
+    pci_conf_write32(seg, bus, slot, func, pos, ~0);
+    if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
+         PCI_BASE_ADDRESS_MEM_TYPE_64 )
+    {
+        if ( last )
+        {
+            printk(XENLOG_WARNING
+                   "%sdevice %04x:%02x:%02x.%u with 64-bit %sBAR in last slot\n",
+                   vf ? "SR-IOV " : "", seg, bus, slot, func,
+                   vf ? "vf " : "");
+            return -EINVAL;
+        }
+        hi = pci_conf_read32(seg, bus, slot, func, pos + 4);
+        pci_conf_write32(seg, bus, slot, func, pos + 4, ~0);
+    }
+    size = pci_conf_read32(seg, bus, slot, func, pos) &
+           PCI_BASE_ADDRESS_MEM_MASK;
+    if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
+         PCI_BASE_ADDRESS_MEM_TYPE_64 )
+    {
+        size |= (uint64_t)pci_conf_read32(seg, bus, slot, func, pos + 4) << 32;
+        pci_conf_write32(seg, bus, slot, func, pos + 4, hi);
+    }
+    else if ( size )
+        size |= (uint64_t)~0 << 32;
+    pci_conf_write32(seg, bus, slot, func, pos, bar);
+    size = -size;
+    addr = (bar & PCI_BASE_ADDRESS_MEM_MASK) | ((uint64_t)hi << 32);
+
+    if ( paddr )
+        *paddr = addr;
+    *psize = size;
+
+    if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
+         PCI_BASE_ADDRESS_MEM_TYPE_64 )
+        return 2;
+
+    return 1;
+}
+
 int pci_add_device(u16 seg, u8 bus, u8 devfn,
                    const struct pci_dev_info *info, nodeid_t node)
 {
@@ -652,11 +700,10 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
             unsigned int i;
 
             BUILD_BUG_ON(ARRAY_SIZE(pdev->vf_rlen) != PCI_SRIOV_NUM_BARS);
-            for ( i = 0; i < PCI_SRIOV_NUM_BARS; ++i )
+            for ( i = 0; i < PCI_SRIOV_NUM_BARS; )
             {
                 unsigned int idx = pos + PCI_SRIOV_BAR + i * 4;
                 u32 bar = pci_conf_read32(seg, bus, slot, func, idx);
-                u32 hi = 0;
 
                 if ( (bar & PCI_BASE_ADDRESS_SPACE) ==
                      PCI_BASE_ADDRESS_SPACE_IO )
@@ -667,38 +714,14 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
                            seg, bus, slot, func, i);
                     continue;
                 }
-                pci_conf_write32(seg, bus, slot, func, idx, ~0);
-                if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
-                     PCI_BASE_ADDRESS_MEM_TYPE_64 )
-                {
-                    if ( i >= PCI_SRIOV_NUM_BARS )
-                    {
-                        printk(XENLOG_WARNING
-                               "SR-IOV device %04x:%02x:%02x.%u with 64-bit"
-                               " vf BAR in last slot\n",
-                               seg, bus, slot, func);
-                        break;
-                    }
-                    hi = pci_conf_read32(seg, bus, slot, func, idx + 4);
-                    pci_conf_write32(seg, bus, slot, func, idx + 4, ~0);
-                }
-                pdev->vf_rlen[i] = pci_conf_read32(seg, bus, slot, func, idx) &
-                                   PCI_BASE_ADDRESS_MEM_MASK;
-                if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
-                     PCI_BASE_ADDRESS_MEM_TYPE_64 )
-                {
-                    pdev->vf_rlen[i] |= (u64)pci_conf_read32(seg, bus,
-                                                             slot, func,
-                                                             idx + 4) << 32;
-                    pci_conf_write32(seg, bus, slot, func, idx + 4, hi);
-                }
-                else if ( pdev->vf_rlen[i] )
-                    pdev->vf_rlen[i] |= (u64)~0 << 32;
-                pci_conf_write32(seg, bus, slot, func, idx, bar);
-                pdev->vf_rlen[i] = -pdev->vf_rlen[i];
-                if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
-                     PCI_BASE_ADDRESS_MEM_TYPE_64 )
-                    ++i;
+                ret = pci_size_mem_bar(seg, bus, slot, func, idx,
+                                       i == PCI_SRIOV_NUM_BARS - 1, NULL,
+                                       &pdev->vf_rlen[i], true);
+                if ( ret < 0 )
+                    break;
+
+                ASSERT(ret);
+                i += ret;
             }
         }
         else
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index ad5d3ca031..b85e4fa8ad 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -164,6 +164,9 @@ const char *parse_pci(const char *, unsigned int *seg, unsigned int *bus,
                       unsigned int *dev, unsigned int *func);
 const char *parse_pci_seg(const char *, unsigned int *seg, unsigned int *bus,
                           unsigned int *dev, unsigned int *func, bool *def_seg);
+int pci_size_mem_bar(unsigned int seg, unsigned int bus, unsigned int slot,
+                     unsigned int func, unsigned int pos, bool last,
+                     uint64_t *addr, uint64_t *size, bool vf);
 
 
 bool_t pcie_aer_get_firmware_first(const struct pci_dev *);
-- 
2.11.0 (Apple Git-81)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v5 07/11] pci: add support to size ROM BARs to pci_size_mem_bar
  2017-08-14 14:28 [PATCH v5 00/11] vpci: PCI config space emulation Roger Pau Monne
                   ` (5 preceding siblings ...)
  2017-08-14 14:28 ` [PATCH v5 06/11] pci: split code to size BARs from pci_add_device Roger Pau Monne
@ 2017-08-14 14:28 ` Roger Pau Monne
  2017-09-05 15:12   ` Jan Beulich
  2017-08-14 14:28 ` [PATCH v5 08/11] vpci/bars: add handlers to map the BARs Roger Pau Monne
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 60+ messages in thread
From: Roger Pau Monne @ 2017-08-14 14:28 UTC (permalink / raw)
  To: xen-devel; +Cc: boris.ostrovsky, Roger Pau Monne, Jan Beulich

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
---
Changes since v4:
 - New in this version.
---
 xen/drivers/passthrough/pci.c | 24 +++++++++++++-----------
 xen/include/xen/pci.h         |  2 +-
 2 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 948c227e01..33cb774d29 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -594,15 +594,18 @@ static int iommu_remove_device(struct pci_dev *pdev);
 
 int pci_size_mem_bar(unsigned int seg, unsigned int bus, unsigned int slot,
                      unsigned int func, unsigned int pos, bool last,
-                     uint64_t *paddr, uint64_t *psize, bool vf)
+                     uint64_t *paddr, uint64_t *psize, bool vf, bool rom)
 {
     uint32_t hi = 0, bar = pci_conf_read32(seg, bus, slot, func, pos);
     uint64_t addr, size;
+    bool is64bits = !rom && (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
+                    PCI_BASE_ADDRESS_MEM_TYPE_64;
 
-    ASSERT((bar & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_MEMORY);
+    ASSERT(!(rom && vf));
+    ASSERT(rom ||
+           (bar & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_MEMORY);
     pci_conf_write32(seg, bus, slot, func, pos, ~0);
-    if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
-         PCI_BASE_ADDRESS_MEM_TYPE_64 )
+    if ( is64bits )
     {
         if ( last )
         {
@@ -616,9 +619,8 @@ int pci_size_mem_bar(unsigned int seg, unsigned int bus, unsigned int slot,
         pci_conf_write32(seg, bus, slot, func, pos + 4, ~0);
     }
     size = pci_conf_read32(seg, bus, slot, func, pos) &
-           PCI_BASE_ADDRESS_MEM_MASK;
-    if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
-         PCI_BASE_ADDRESS_MEM_TYPE_64 )
+           (rom ? PCI_ROM_ADDRESS_MASK : PCI_BASE_ADDRESS_MEM_MASK);
+    if ( is64bits )
     {
         size |= (uint64_t)pci_conf_read32(seg, bus, slot, func, pos + 4) << 32;
         pci_conf_write32(seg, bus, slot, func, pos + 4, hi);
@@ -627,14 +629,14 @@ int pci_size_mem_bar(unsigned int seg, unsigned int bus, unsigned int slot,
         size |= (uint64_t)~0 << 32;
     pci_conf_write32(seg, bus, slot, func, pos, bar);
     size = -size;
-    addr = (bar & PCI_BASE_ADDRESS_MEM_MASK) | ((uint64_t)hi << 32);
+    addr = (bar & (rom ? PCI_ROM_ADDRESS_MASK : PCI_BASE_ADDRESS_MEM_MASK)) |
+           ((uint64_t)hi << 32);
 
     if ( paddr )
         *paddr = addr;
     *psize = size;
 
-    if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
-         PCI_BASE_ADDRESS_MEM_TYPE_64 )
+    if ( is64bits )
         return 2;
 
     return 1;
@@ -716,7 +718,7 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
                 }
                 ret = pci_size_mem_bar(seg, bus, slot, func, idx,
                                        i == PCI_SRIOV_NUM_BARS - 1, NULL,
-                                       &pdev->vf_rlen[i], true);
+                                       &pdev->vf_rlen[i], true, false);
                 if ( ret < 0 )
                     break;
 
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index b85e4fa8ad..72c901be66 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -166,7 +166,7 @@ const char *parse_pci_seg(const char *, unsigned int *seg, unsigned int *bus,
                           unsigned int *dev, unsigned int *func, bool *def_seg);
 int pci_size_mem_bar(unsigned int seg, unsigned int bus, unsigned int slot,
                      unsigned int func, unsigned int pos, bool last,
-                     uint64_t *addr, uint64_t *size, bool vf);
+                     uint64_t *addr, uint64_t *size, bool vf, bool rom);
 
 
 bool_t pcie_aer_get_firmware_first(const struct pci_dev *);
-- 
2.11.0 (Apple Git-81)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v5 08/11] vpci/bars: add handlers to map the BARs
  2017-08-14 14:28 [PATCH v5 00/11] vpci: PCI config space emulation Roger Pau Monne
                   ` (6 preceding siblings ...)
  2017-08-14 14:28 ` [PATCH v5 07/11] pci: add support to size ROM BARs to pci_size_mem_bar Roger Pau Monne
@ 2017-08-14 14:28 ` Roger Pau Monne
  2017-09-07  9:53   ` Jan Beulich
  2017-08-14 14:28 ` [PATCH v5 09/11] vpci/msi: add MSI handlers Roger Pau Monne
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 60+ messages in thread
From: Roger Pau Monne @ 2017-08-14 14:28 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, Jan Beulich, boris.ostrovsky,
	Roger Pau Monne

Introduce a set of handlers that trap accesses to the PCI BARs and the command
register, in order to snoop BAR sizing and BAR relocation.

The command handler is used to detect changes to bit 2 (response to
memory space accesses), and maps/unmaps the BARs of the device into
the guest p2m. A rangeset is used in order to figure out which memory
to map/unmap. This makes it easier to keep track of the possible
overlaps with other BARs, and will also simplify MSI-X support, where
certain regions of a BAR might be used for the MSI-X table or PBA.

The BAR register handlers are used to detect attempts by the guest to size or
relocate the BARs.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Tim Deegan <tim@xen.org>
Cc: Wei Liu <wei.liu2@citrix.com>
---
Changes since v4:
 - Expand commit message to mention the reason behind the usage of
   rangesets.
 - Fix comment related to the inclusiveness of rangesets.
 - Fix off-by-one error in the calculation of the end of memory
   regions.
 - Store the state of the BAR (mapped/unmapped) in the vpci_bar
   enabled field, previously was only used by ROMs.
 - Fix double negation of return code.
 - Modify vpci_cmd_write so it has a single call to pci_conf_write16.
 - Print a warning when trying to write to the BAR with memory
   decoding enabled (and ignore the write).
 - Remove header_type local variable, it's used only once.
 - Move the read of the command register.
 - Restore previous command register value in the exit paths.
 - Only set address to INVALID_PADDR if the initial BAR value matches
    ~0 & PCI_BASE_ADDRESS_MEM_MASK.
 - Don't disable the enabled bit in the expansion ROM register, memory
   decoding is already disabled and takes precedence.
 - Don't use INVALID_PADDR, just set the initial BAR address to the
   value found in the hardware.
 - Introduce rom_enabled to store the status of the
   PCI_ROM_ADDRESS_ENABLE bit.
 - Reorder fields of the structure to prevent holes.

Changes since v3:
 - Propagate previous changes: drop xen_ prefix and use u8/u16/u32
   instead of the previous half_word/word/double_word.
 - Constify some of the paramerters.
 - s/VPCI_BAR_MEM/VPCI_BAR_MEM32/.
 - Simplify the number of fields stored for each BAR, a single address
   field is stored and contains the address of the BAR both on Xen and
   in the guest.
 - Allow the guest to move the BARs around in the physical memory map.
 - Add support for expansion ROM BARs.
 - Do not cache the value of the command register.
 - Remove a label used in vpci_cmd_write.
 - Fix the calculation of the sizing mask in vpci_bar_write.
 - Check the memory decode bit in order to decide if a BAR is
   positioned or not.
 - Disable memory decoding before sizing the BARs in Xen.
 - When mapping/unmapping BARs check if there's overlap between BARs,
   in order to avoid unmapping memory required by another BAR.
 - Introduce a macro to check whether a BAR is mappable or not.
 - Add a comment regarding the lack of support for SR-IOV.
 - Remove the usage of the GENMASK macro.

Changes since v2:
 - Detect unset BARs and allow the hardware domain to position them.
---
 xen/drivers/vpci/Makefile |   2 +-
 xen/drivers/vpci/header.c | 453 ++++++++++++++++++++++++++++++++++++++++++++++
 xen/include/xen/vpci.h    |  29 +++
 3 files changed, 483 insertions(+), 1 deletion(-)
 create mode 100644 xen/drivers/vpci/header.c

diff --git a/xen/drivers/vpci/Makefile b/xen/drivers/vpci/Makefile
index 840a906470..241467212f 100644
--- a/xen/drivers/vpci/Makefile
+++ b/xen/drivers/vpci/Makefile
@@ -1 +1 @@
-obj-y += vpci.o
+obj-y += vpci.o header.o
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
new file mode 100644
index 0000000000..9b44c8441a
--- /dev/null
+++ b/xen/drivers/vpci/header.c
@@ -0,0 +1,453 @@
+/*
+ * Generic functionality for handling accesses to the PCI header from the
+ * configuration space.
+ *
+ * Copyright (C) 2017 Citrix Systems R&D
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms and conditions of the GNU General Public
+ * License, version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <xen/sched.h>
+#include <xen/vpci.h>
+#include <xen/p2m-common.h>
+
+#define MAPPABLE_BAR(x)                                                 \
+    ((x)->type == VPCI_BAR_MEM32 || (x)->type == VPCI_BAR_MEM64_LO ||   \
+     (x)->type == VPCI_BAR_ROM)
+
+static struct rangeset *vpci_get_bar_memory(const struct domain *d,
+                                            const struct vpci_bar *map)
+{
+    const struct pci_dev *pdev;
+    struct rangeset *mem = rangeset_new(NULL, NULL, 0);
+    int rc;
+
+    if ( !mem )
+        return ERR_PTR(-ENOMEM);
+
+    /*
+     * Create a rangeset that represents the current BAR memory region
+     * and compare it against all the currently active BAR memory regions.
+     * If an overlap is found, subtract it from the region to be
+     * mapped/unmapped.
+     *
+     * NB: the rangeset uses inclusive frame numbers.
+     */
+    rc = rangeset_add_range(mem, PFN_DOWN(map->addr),
+                            PFN_DOWN(map->addr + map->size - 1));
+    if ( rc )
+    {
+        rangeset_destroy(mem);
+        return ERR_PTR(rc);
+    }
+
+    list_for_each_entry(pdev, &d->arch.pdev_list, domain_list)
+    {
+        unsigned int i;
+
+        for ( i = 0; i < ARRAY_SIZE(pdev->vpci->header.bars); i++ )
+        {
+            const struct vpci_bar *bar = &pdev->vpci->header.bars[i];
+            unsigned long start = PFN_DOWN(bar->addr);
+            unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
+
+            if ( bar == map || !bar->enabled || !MAPPABLE_BAR(bar) ||
+                 !rangeset_overlaps_range(mem, start, end) )
+                continue;
+
+            rc = rangeset_remove_range(mem, start, end);
+            if ( rc )
+            {
+                rangeset_destroy(mem);
+                return ERR_PTR(rc);
+            }
+        }
+    }
+
+    return mem;
+}
+
+struct map_data {
+    struct domain *d;
+    bool map;
+};
+
+static int vpci_map_range(unsigned long s, unsigned long e, void *data)
+{
+    const struct map_data *map = data;
+
+    return modify_mmio(map->d, _gfn(s), _mfn(s), e - s + 1, map->map);
+}
+
+static int vpci_modify_bar(struct domain *d, const struct vpci_bar *bar,
+                           bool map)
+{
+    struct rangeset *mem;
+    struct map_data data = { .d = d, .map = map };
+    int rc;
+
+    ASSERT(MAPPABLE_BAR(bar));
+
+    mem = vpci_get_bar_memory(d, bar);
+    if ( IS_ERR(mem) )
+        return PTR_ERR(mem);
+
+    rc = rangeset_report_ranges(mem, 0, ~0ul, vpci_map_range, &data);
+    rangeset_destroy(mem);
+    if ( rc )
+        return rc;
+
+    return 0;
+}
+
+static int vpci_modify_bars(const struct pci_dev *pdev, const bool map)
+{
+    struct vpci_header *header = &pdev->vpci->header;
+    unsigned int i;
+
+    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
+    {
+        struct vpci_bar *bar = &header->bars[i];
+        int rc;
+
+        if ( !MAPPABLE_BAR(bar) ||
+             (bar->type == VPCI_BAR_ROM && !bar->rom_enabled) )
+            continue;
+
+        rc = vpci_modify_bar(pdev->domain, bar, map);
+        if ( rc )
+            return rc;
+
+        bar->enabled = map;
+    }
+
+    return 0;
+}
+
+static uint32_t vpci_cmd_read(struct pci_dev *pdev, unsigned int reg,
+                              const void *data)
+{
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+
+    return pci_conf_read16(seg, bus, slot, func, reg);
+}
+
+static void vpci_cmd_write(struct pci_dev *pdev, unsigned int reg,
+                           uint32_t cmd, void *data)
+{
+    uint16_t current_cmd;
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+
+    current_cmd = pci_conf_read16(seg, bus, slot, func, reg);
+
+    /*
+     * Let the guest play with all the bits directly except for the
+     * memory decoding one.
+     */
+    if ( (cmd ^ current_cmd) & PCI_COMMAND_MEMORY )
+    {
+        /* Memory space access change. */
+        int rc = vpci_modify_bars(pdev, cmd & PCI_COMMAND_MEMORY);
+
+        if ( rc )
+        {
+            gdprintk(XENLOG_ERR,
+                     "%04x:%02x:%02x.%u:unable to %smap BARs: %d\n",
+                     seg, bus, slot, func,
+                     cmd & PCI_COMMAND_MEMORY ? "" : "un", rc);
+            return;
+        }
+    }
+
+    pci_conf_write16(seg, bus, slot, func, reg, cmd);
+}
+
+static uint32_t vpci_bar_read(struct pci_dev *pdev, unsigned int reg,
+                              const void *data)
+{
+    const struct vpci_bar *bar = data;
+    uint32_t val;
+    bool hi = false;
+
+    ASSERT(bar->type == VPCI_BAR_MEM32 || bar->type == VPCI_BAR_MEM64_LO ||
+           bar->type == VPCI_BAR_MEM64_HI);
+
+    if ( bar->type == VPCI_BAR_MEM64_HI )
+    {
+        ASSERT(reg > PCI_BASE_ADDRESS_0);
+        bar--;
+        hi = true;
+    }
+
+    if ( bar->sizing )
+        val = ~(bar->size - 1) >> (hi ? 32 : 0);
+    else
+        val = bar->addr >> (hi ? 32 : 0);
+
+    if ( !hi )
+    {
+        val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
+                                           : PCI_BASE_ADDRESS_MEM_TYPE_64;
+        val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
+    }
+
+    return val;
+}
+
+static void vpci_bar_write(struct pci_dev *pdev, unsigned int reg,
+                           uint32_t val, void *data)
+{
+    struct vpci_bar *bar = data;
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    bool hi = false;
+
+    if ( pci_conf_read16(seg, bus, slot, func, PCI_COMMAND) &
+         PCI_COMMAND_MEMORY )
+    {
+         gdprintk(XENLOG_WARNING,
+                  "%04x:%02x:%02x.%u: ignored BAR write with memory decoding enabled\n",
+                  seg, bus, slot, func);
+        return;
+    }
+
+    if ( bar->type == VPCI_BAR_MEM64_HI )
+    {
+        ASSERT(reg > PCI_BASE_ADDRESS_0);
+        bar--;
+        hi = true;
+    }
+
+    if ( !hi )
+        val &= PCI_BASE_ADDRESS_MEM_MASK;
+
+    /*
+     * The PCI Local Bus Specification suggests writing ~0 to both the high
+     * and the low part of the BAR registers before attempting to read back
+     * the size.
+     *
+     * However real device BARs registers (at least the ones I've tried)
+     * will return the size of the BAR just by having written ~0 to one half
+     * of it, independently of the value of the other half of the register.
+     * Hence here Xen will switch to returning the size as soon as one half
+     * of the BAR register has been written with ~0.
+     */
+    if ( val == (hi ? 0xffffffff : (uint32_t)PCI_BASE_ADDRESS_MEM_MASK) )
+    {
+        bar->sizing = true;
+        return;
+    }
+    bar->sizing = false;
+
+    /* Update the relevant part of the BAR address. */
+    bar->addr &= ~(0xffffffffull << (hi ? 32 : 0));
+    bar->addr |= (uint64_t)val << (hi ? 32 : 0);
+
+    /* Make sure Xen writes back the same value for the BAR RO bits. */
+    if ( !hi )
+        val |= pci_conf_read32(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                               PCI_FUNC(pdev->devfn), reg) &
+                               ~PCI_BASE_ADDRESS_MEM_MASK;
+    pci_conf_write32(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                     PCI_FUNC(pdev->devfn), reg, val);
+}
+
+static uint32_t vpci_rom_read(struct pci_dev *pdev, unsigned int reg,
+                              const void *data)
+{
+    const struct vpci_bar *rom = data;
+    uint32_t val;
+
+    val = rom->sizing ? ~(rom->size - 1) : rom->addr;
+    val |= rom->rom_enabled ? PCI_ROM_ADDRESS_ENABLE : 0;
+
+    return val;
+}
+
+static void vpci_rom_write(struct pci_dev *pdev, unsigned int reg,
+                           uint32_t val, void *data)
+{
+    struct vpci_bar *rom = data;
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    uint16_t cmd = pci_conf_read16(seg, bus, slot, func, PCI_COMMAND);
+    uint32_t addr = val & PCI_ROM_ADDRESS_MASK;
+
+    if ( addr == (uint32_t)PCI_ROM_ADDRESS_MASK )
+    {
+        rom->sizing = true;
+        return;
+    }
+    rom->sizing = false;
+
+    rom->addr = addr;
+
+    /* Check if ROM BAR should be mapped. */
+    if ( (cmd & PCI_COMMAND_MEMORY) &&
+         rom->enabled != !!(val & PCI_ROM_ADDRESS_ENABLE) &&
+         vpci_modify_bar(pdev->domain, rom, val & PCI_ROM_ADDRESS_ENABLE) )
+        return;
+
+    rom->rom_enabled = val & PCI_ROM_ADDRESS_ENABLE;
+    pci_conf_write32(pdev->seg, pdev->bus, slot, func, reg, val);
+}
+
+static int vpci_init_bars(struct pci_dev *pdev)
+{
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    uint16_t cmd;
+    uint64_t addr, size;
+    unsigned int i, num_bars, rom_reg;
+    struct vpci_header *header = &pdev->vpci->header;
+    struct vpci_bar *bars = header->bars;
+    int rc;
+
+    switch ( pci_conf_read8(seg, bus, slot, func, PCI_HEADER_TYPE) & 0x7f )
+    {
+    case PCI_HEADER_TYPE_NORMAL:
+        num_bars = 6;
+        rom_reg = PCI_ROM_ADDRESS;
+        break;
+    case PCI_HEADER_TYPE_BRIDGE:
+        num_bars = 2;
+        rom_reg = PCI_ROM_ADDRESS1;
+        break;
+    default:
+        return -EOPNOTSUPP;
+    }
+
+    /* Setup a handler for the command register. */
+    rc = vpci_add_register(pdev, vpci_cmd_read, vpci_cmd_write, PCI_COMMAND,
+                           2, header);
+    if ( rc )
+        return rc;
+
+    /* Disable memory decoding before sizing. */
+    cmd = pci_conf_read16(seg, bus, slot, func, PCI_COMMAND);
+    if ( cmd & PCI_COMMAND_MEMORY )
+        pci_conf_write16(seg, bus, slot, func, PCI_COMMAND,
+                         cmd & ~PCI_COMMAND_MEMORY);
+
+    for ( i = 0; i < num_bars; i++ )
+    {
+        uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
+        uint32_t val = pci_conf_read32(seg, bus, slot, func, reg);
+
+        if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
+        {
+            bars[i].type = VPCI_BAR_MEM64_HI;
+            rc = vpci_add_register(pdev, vpci_bar_read, vpci_bar_write, reg, 4,
+                                   &bars[i]);
+            if ( rc )
+            {
+                pci_conf_write16(seg, bus, slot, func, PCI_COMMAND, cmd);
+                return rc;
+            }
+
+            continue;
+        }
+        if ( (val & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO )
+        {
+            bars[i].type = VPCI_BAR_IO;
+            continue;
+        }
+        if ( (val & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
+             PCI_BASE_ADDRESS_MEM_TYPE_64 )
+            bars[i].type = VPCI_BAR_MEM64_LO;
+        else
+            bars[i].type = VPCI_BAR_MEM32;
+
+        /* Size the BAR and map it. */
+        rc = pci_size_mem_bar(seg, bus, slot, func, reg, i == num_bars - 1,
+                              &addr, &size, false, false);
+        if ( rc < 0 )
+        {
+            pci_conf_write16(seg, bus, slot, func, PCI_COMMAND, cmd);
+            return rc;
+        }
+
+        if ( size == 0 )
+        {
+            bars[i].type = VPCI_BAR_EMPTY;
+            continue;
+        }
+
+        bars[i].addr = addr;
+        bars[i].size = size;
+        bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
+
+        rc = vpci_add_register(pdev, vpci_bar_read, vpci_bar_write, reg, 4,
+                               &bars[i]);
+        if ( rc )
+        {
+            pci_conf_write16(seg, bus, slot, func, PCI_COMMAND, cmd);
+            return rc;
+        }
+    }
+
+    /* Check expansion ROM. */
+    rc = pci_size_mem_bar(seg, bus, slot, func, rom_reg, true, &addr, &size,
+                          false, true);
+    if ( rc < 0 )
+    {
+        pci_conf_write16(seg, bus, slot, func, PCI_COMMAND, cmd);
+        return rc;
+    }
+
+    if ( size )
+    {
+        struct vpci_bar *rom = &header->bars[num_bars];
+
+        rom->type = VPCI_BAR_ROM;
+        rom->size = size;
+        rom->addr = addr;
+
+        rc = vpci_add_register(pdev, vpci_rom_read, vpci_rom_write, rom_reg, 4,
+                               rom);
+        if ( rc )
+        {
+            pci_conf_write16(seg, bus, slot, func, PCI_COMMAND, cmd);
+            return rc;
+        }
+    }
+
+    if ( cmd & PCI_COMMAND_MEMORY )
+    {
+        rc = vpci_modify_bars(pdev, true);
+        if ( rc )
+        {
+            pci_conf_write16(seg, bus, slot, func, PCI_COMMAND, cmd);
+            return rc;
+        }
+
+        /* Enable memory decoding. */
+        pci_conf_write16(seg, bus, slot, func, PCI_COMMAND, cmd);
+    }
+
+    return 0;
+}
+
+REGISTER_VPCI_INIT(vpci_init_bars);
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 12f7287d7b..3c6beaaf4a 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -65,6 +65,35 @@ void vpci_write(unsigned int seg, unsigned int bus, unsigned int slot,
 struct vpci {
     /* List of vPCI handlers for a device. */
     struct list_head handlers;
+
+#ifdef __XEN__
+    /* Hide the rest of the vpci struct from the user-space test harness. */
+    struct vpci_header {
+        /* Information about the PCI BARs of this device. */
+        struct vpci_bar {
+            paddr_t addr;
+            uint64_t size;
+            enum {
+                VPCI_BAR_EMPTY,
+                VPCI_BAR_IO,
+                VPCI_BAR_MEM32,
+                VPCI_BAR_MEM64_LO,
+                VPCI_BAR_MEM64_HI,
+                VPCI_BAR_ROM,
+            } type;
+            bool prefetchable;
+            bool sizing;
+            /* Store whether the BAR is mapped into guest p2m. */
+            bool enabled;
+            /*
+             * Store whether the ROM enable bit is set (doesn't imply ROM BAR
+             * is mapped into guest p2m). Only used for type VPCI_BAR_ROM.
+             */
+            bool rom_enabled;
+        } bars[7]; /* At most 6 BARS + 1 expansion ROM BAR. */
+        /* FIXME: currently there's no support for SR-IOV. */
+    } header;
+#endif
 };
 
 #endif
-- 
2.11.0 (Apple Git-81)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v5 09/11] vpci/msi: add MSI handlers
  2017-08-14 14:28 [PATCH v5 00/11] vpci: PCI config space emulation Roger Pau Monne
                   ` (7 preceding siblings ...)
  2017-08-14 14:28 ` [PATCH v5 08/11] vpci/bars: add handlers to map the BARs Roger Pau Monne
@ 2017-08-14 14:28 ` Roger Pau Monne
  2017-08-22 12:20   ` Paul Durrant
  2017-09-07 15:29   ` Jan Beulich
  2017-08-14 14:28 ` [PATCH v5 10/11] vpci: add a priority parameter to the vPCI register initializer Roger Pau Monne
  2017-08-14 14:28 ` [PATCH v5 11/11] vpci/msix: add MSI-X handlers Roger Pau Monne
  10 siblings, 2 replies; 60+ messages in thread
From: Roger Pau Monne @ 2017-08-14 14:28 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Paul Durrant, Jan Beulich, boris.ostrovsky,
	Roger Pau Monne

Add handlers for the MSI control, address, data and mask fields in
order to detect accesses to them and setup the interrupts as requested
by the guest.

Note that the pending register is not trapped, and the guest can
freely read/write to it.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Paul Durrant <paul.durrant@citrix.com>
---
Changes since v4:
 - Fix commit message.
 - Change the ASSERTs in vpci_msi_arch_mask into ifs.
 - Introduce INVALID_PIRQ.
 - Destroy the partially created bindings in case of failure in
   vpci_msi_arch_enable.
 - Just take the pcidevs lock once in vpci_msi_arch_disable.
 - Print an error message in case of failure of pt_irq_destroy_bind.
 - Make vpci_msi_arch_init return void.
 - Constify the arch parameter of vpci_msi_arch_print.
 - Use fixed instead of cpu for msi redirection.
 - Separate the header includes in vpci/msi.c between xen and asm.
 - Store the number of configured vectors even if MSI is not enabled
   and always return it in vpci_msi_control_read.
 - Fix/add comments in vpci_msi_control_write to clarify intended
   behavior.
 - Simplify usage of masks in vpci_msi_address_{upper_}write.
 - Add comment to vpci_msi_mask_{read/write}.
 - Don't use MASK_EXTR in vpci_msi_mask_write.
 - s/msi_offset/pos/ in vpci_init_msi.
 - Move control variable setup closer to it's usage.
 - Use d%d in vpci_dump_msi.
 - Fix printing of bitfield mask in vpci_dump_msi.
 - Fix definition of MSI_ADDR_REDIRECTION_MASK.
 - Shuffle the layout of vpci_msi to minimize gaps.
 - Remove the error label in vpci_init_msi.

Changes since v3:
 - Propagate changes from previous versions: drop xen_ prefix, drop
   return value from handlers, use the new vpci_val fields.
 - Use MASK_EXTR.
 - Remove the usage of GENMASK.
 - Add GFLAGS_SHIFT_DEST_ID and use it in msi_flags.
 - Add "arch" to the MSI arch specific functions.
 - Move the dumping of vPCI MSI information to dump_msi (key 'M').
 - Remove the guest_vectors field.
 - Allow the guest to change the number of active vectors without
   having to disable and enable MSI.
 - Check the number of active vectors when parsing the disable
   mask.
 - Remove the debug messages from vpci_init_msi.
 - Move the arch-specific part of the dump handler to x86/hvm/vmsi.c.
 - Use trylock in the dump handler to get the vpci lock.

Changes since v2:
 - Add an arch-specific abstraction layer. Note that this is only implemented
   for x86 currently.
 - Add a wrapper to detect MSI enabling for vPCI.

NB: I've only been able to test this with devices using a single MSI interrupt
and no mask register. I will try to find hardware that supports the mask
register and more than one vector, but I cannot make any promises.

If there are doubts about the untested parts we could always force Xen to
report no per-vector masking support and only 1 available vector, but I would
rather avoid doing it.
---
 xen/arch/x86/hvm/vmsi.c      | 156 ++++++++++++++++++
 xen/arch/x86/msi.c           |   3 +
 xen/drivers/vpci/Makefile    |   2 +-
 xen/drivers/vpci/msi.c       | 368 +++++++++++++++++++++++++++++++++++++++++++
 xen/include/asm-x86/hvm/io.h |  18 +++
 xen/include/asm-x86/msi.h    |   1 +
 xen/include/xen/hvm/irq.h    |   2 +
 xen/include/xen/irq.h        |   1 +
 xen/include/xen/vpci.h       |  27 ++++
 9 files changed, 577 insertions(+), 1 deletion(-)
 create mode 100644 xen/drivers/vpci/msi.c

diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index a36692c313..aea088e290 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -622,3 +622,159 @@ void msix_write_completion(struct vcpu *v)
     if ( msixtbl_write(v, ctrl_address, 4, 0) != X86EMUL_OKAY )
         gdprintk(XENLOG_WARNING, "MSI-X write completion failure\n");
 }
+
+static unsigned int msi_vector(uint16_t data)
+{
+    return MASK_EXTR(data, MSI_DATA_VECTOR_MASK);
+}
+
+static unsigned int msi_flags(uint16_t data, uint64_t addr)
+{
+    unsigned int rh, dm, dest_id, deliv_mode, trig_mode;
+
+    rh = MASK_EXTR(addr, MSI_ADDR_REDIRECTION_MASK);
+    dm = MASK_EXTR(addr, MSI_ADDR_DESTMODE_MASK);
+    dest_id = MASK_EXTR(addr, MSI_ADDR_DEST_ID_MASK);
+    deliv_mode = MASK_EXTR(data, MSI_DATA_DELIVERY_MODE_MASK);
+    trig_mode = MASK_EXTR(data, MSI_DATA_TRIGGER_MASK);
+
+    return (dest_id << GFLAGS_SHIFT_DEST_ID) | (rh << GFLAGS_SHIFT_RH) |
+           (dm << GFLAGS_SHIFT_DM) | (deliv_mode << GFLAGS_SHIFT_DELIV_MODE) |
+           (trig_mode << GFLAGS_SHIFT_TRG_MODE);
+}
+
+void vpci_msi_arch_mask(struct vpci_arch_msi *arch, struct pci_dev *pdev,
+                        unsigned int entry, bool mask)
+{
+    struct domain *d = pdev->domain;
+    const struct pirq *pinfo;
+    struct irq_desc *desc;
+    unsigned long flags;
+    int irq;
+
+    ASSERT(arch->pirq >= 0);
+    pinfo = pirq_info(d, arch->pirq + entry);
+    if ( !pinfo )
+        return;
+
+    irq = pinfo->arch.irq;
+    if ( irq >= nr_irqs || irq < 0)
+        return;
+
+    desc = irq_to_desc(irq);
+    if ( !desc )
+        return;
+
+    spin_lock_irqsave(&desc->lock, flags);
+    guest_mask_msi_irq(desc, mask);
+    spin_unlock_irqrestore(&desc->lock, flags);
+}
+
+int vpci_msi_arch_enable(struct vpci_arch_msi *arch, struct pci_dev *pdev,
+                         uint64_t address, uint32_t data, unsigned int vectors)
+{
+    struct msi_info msi_info = {
+        .seg = pdev->seg,
+        .bus = pdev->bus,
+        .devfn = pdev->devfn,
+        .entry_nr = vectors,
+    };
+    unsigned int i;
+    int rc;
+
+    ASSERT(arch->pirq == INVALID_PIRQ);
+
+    /* Get a PIRQ. */
+    rc = allocate_and_map_msi_pirq(pdev->domain, -1, &arch->pirq,
+                                   MAP_PIRQ_TYPE_MULTI_MSI, &msi_info);
+    if ( rc )
+    {
+        gdprintk(XENLOG_ERR, "%04x:%02x:%02x.%u: failed to map PIRQ: %d\n",
+                 pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                 PCI_FUNC(pdev->devfn), rc);
+        return rc;
+    }
+
+    for ( i = 0; i < vectors; i++ )
+    {
+        xen_domctl_bind_pt_irq_t bind = {
+            .machine_irq = arch->pirq + i,
+            .irq_type = PT_IRQ_TYPE_MSI,
+            .u.msi.gvec = msi_vector(data) + i,
+            .u.msi.gflags = msi_flags(data, address),
+        };
+
+        pcidevs_lock();
+        rc = pt_irq_create_bind(pdev->domain, &bind);
+        if ( rc )
+        {
+            gdprintk(XENLOG_ERR,
+                     "%04x:%02x:%02x.%u: failed to bind PIRQ %u: %d\n",
+                     pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                     PCI_FUNC(pdev->devfn), arch->pirq + i, rc);
+            while ( bind.machine_irq-- )
+                pt_irq_destroy_bind(pdev->domain, &bind);
+            spin_lock(&pdev->domain->event_lock);
+            unmap_domain_pirq(pdev->domain, arch->pirq);
+            spin_unlock(&pdev->domain->event_lock);
+            pcidevs_unlock();
+            arch->pirq = -1;
+            return rc;
+        }
+        pcidevs_unlock();
+    }
+
+    return 0;
+}
+
+int vpci_msi_arch_disable(struct vpci_arch_msi *arch, struct pci_dev *pdev,
+                          unsigned int vectors)
+{
+    unsigned int i;
+
+    ASSERT(arch->pirq != INVALID_PIRQ);
+
+    pcidevs_lock();
+    for ( i = 0; i < vectors; i++ )
+    {
+        xen_domctl_bind_pt_irq_t bind = {
+            .machine_irq = arch->pirq + i,
+            .irq_type = PT_IRQ_TYPE_MSI,
+        };
+        int rc;
+
+        rc = pt_irq_destroy_bind(pdev->domain, &bind);
+        gdprintk(XENLOG_ERR,
+                 "%04x:%02x:%02x.%u: failed to unbind PIRQ %u: %d\n",
+                 pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                 PCI_FUNC(pdev->devfn), arch->pirq + i, rc);
+    }
+
+    spin_lock(&pdev->domain->event_lock);
+    unmap_domain_pirq(pdev->domain, arch->pirq);
+    spin_unlock(&pdev->domain->event_lock);
+    pcidevs_unlock();
+
+    arch->pirq = INVALID_PIRQ;
+
+    return 0;
+}
+
+void vpci_msi_arch_init(struct vpci_arch_msi *arch)
+{
+    arch->pirq = INVALID_PIRQ;
+}
+
+void vpci_msi_arch_print(const struct vpci_arch_msi *arch, uint16_t data,
+                         uint64_t addr)
+{
+    printk("vec=%#02x%7s%6s%3sassert%5s%7s dest_id=%lu pirq: %d\n",
+           MASK_EXTR(data, MSI_DATA_VECTOR_MASK),
+           data & MSI_DATA_DELIVERY_LOWPRI ? "lowest" : "fixed",
+           data & MSI_DATA_TRIGGER_LEVEL ? "level" : "edge",
+           data & MSI_DATA_LEVEL_ASSERT ? "" : "de",
+           addr & MSI_ADDR_DESTMODE_LOGIC ? "log" : "phys",
+           addr & MSI_ADDR_REDIRECTION_LOWPRI ? "lowest" : "fixed",
+           MASK_EXTR(addr, MSI_ADDR_DEST_ID_MASK),
+           arch->pirq);
+}
diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index 77998f4fb3..63769153f1 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -30,6 +30,7 @@
 #include <public/physdev.h>
 #include <xen/iommu.h>
 #include <xsm/xsm.h>
+#include <xen/vpci.h>
 
 static s8 __read_mostly use_msi = -1;
 boolean_param("msi", use_msi);
@@ -1536,6 +1537,8 @@ static void dump_msi(unsigned char key)
                attr.guest_masked ? 'G' : ' ',
                mask);
     }
+
+    vpci_dump_msi();
 }
 
 static int __init msi_setup_keyhandler(void)
diff --git a/xen/drivers/vpci/Makefile b/xen/drivers/vpci/Makefile
index 241467212f..62cec9e82b 100644
--- a/xen/drivers/vpci/Makefile
+++ b/xen/drivers/vpci/Makefile
@@ -1 +1 @@
-obj-y += vpci.o header.o
+obj-y += vpci.o header.o msi.o
diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
new file mode 100644
index 0000000000..1e36b9779a
--- /dev/null
+++ b/xen/drivers/vpci/msi.c
@@ -0,0 +1,368 @@
+/*
+ * Handlers for accesses to the MSI capability structure.
+ *
+ * Copyright (C) 2017 Citrix Systems R&D
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms and conditions of the GNU General Public
+ * License, version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <xen/sched.h>
+#include <xen/vpci.h>
+
+#include <asm/msi.h>
+
+/* Handlers for the MSI control field (PCI_MSI_FLAGS). */
+static uint32_t vpci_msi_control_read(struct pci_dev *pdev, unsigned int reg,
+                                      const void *data)
+{
+    const struct vpci_msi *msi = data;
+    uint16_t val;
+
+    /* Set the number of supported/configured messages. */
+    val = MASK_INSR(fls(msi->max_vectors) - 1, PCI_MSI_FLAGS_QMASK);
+    val |= MASK_INSR(fls(msi->vectors) - 1, PCI_MSI_FLAGS_QSIZE);
+
+    val |= msi->enabled ? PCI_MSI_FLAGS_ENABLE : 0;
+    val |= msi->masking ? PCI_MSI_FLAGS_MASKBIT : 0;
+    val |= msi->address64 ? PCI_MSI_FLAGS_64BIT : 0;
+
+    return val;
+}
+
+static void vpci_msi_enable(struct pci_dev *pdev, struct vpci_msi *msi,
+                            unsigned int vectors)
+{
+    int ret;
+
+    ASSERT(!msi->enabled);
+    ret = vpci_msi_arch_enable(&msi->arch, pdev, msi->address, msi->data,
+                               vectors);
+    if ( ret )
+        return;
+
+    /* Apply the mask bits. */
+    if ( msi->masking )
+    {
+        unsigned int i;
+        uint32_t mask = msi->mask;
+
+        for ( i = ffs(mask) - 1; mask && i < vectors; i = ffs(mask) - 1 )
+        {
+            vpci_msi_arch_mask(&msi->arch, pdev, i, true);
+            __clear_bit(i, &mask);
+        }
+    }
+
+    __msi_set_enable(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                     PCI_FUNC(pdev->devfn), msi->pos, 1);
+
+    msi->enabled = true;
+}
+
+static int vpci_msi_disable(struct pci_dev *pdev, struct vpci_msi *msi)
+{
+    int ret;
+
+    ASSERT(msi->enabled);
+    __msi_set_enable(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                     PCI_FUNC(pdev->devfn), msi->pos, 0);
+
+    ret = vpci_msi_arch_disable(&msi->arch, pdev, msi->vectors);
+    if ( ret )
+        return ret;
+
+    msi->enabled = false;
+
+    return 0;
+}
+
+static void vpci_msi_control_write(struct pci_dev *pdev, unsigned int reg,
+                                   uint32_t val, void *data)
+{
+    struct vpci_msi *msi = data;
+    unsigned int vectors = 1 << MASK_EXTR(val, PCI_MSI_FLAGS_QSIZE);
+    bool new_enabled = val & PCI_MSI_FLAGS_ENABLE;
+
+    if ( vectors > msi->max_vectors )
+        vectors = msi->max_vectors;
+
+    /*
+     * No change in the enable field and the number of vectors is
+     * the same or the device is not enabled, in which case the
+     * vectors field can be updated directly.
+     */
+    if ( new_enabled == msi->enabled &&
+         (vectors == msi->vectors || !msi->enabled) )
+    {
+        msi->vectors = vectors;
+        return;
+    }
+
+    if ( new_enabled )
+    {
+        /*
+         * If the device is already enabled it means the number of
+         * enabled messages has changed. Disable and re-enable the
+         * device in order to apply the change.
+         */
+        if ( msi->enabled && vpci_msi_disable(pdev, msi) )
+            /*
+             * Somehow Xen has not been able to disable the
+             * configured MSI messages, leave the device state as-is,
+             * so that the guest can try to disable MSI again.
+             */
+            return;
+
+        vpci_msi_enable(pdev, msi, vectors);
+    }
+    else
+        vpci_msi_disable(pdev, msi);
+
+    msi->vectors = vectors;
+}
+
+/* Handlers for the address field (32bit or low part of a 64bit address). */
+static uint32_t vpci_msi_address_read(struct pci_dev *pdev, unsigned int reg,
+                                      const void *data)
+{
+    const struct vpci_msi *msi = data;
+
+    return msi->address;
+}
+
+static void vpci_msi_address_write(struct pci_dev *pdev, unsigned int reg,
+                                   uint32_t val, void *data)
+{
+    struct vpci_msi *msi = data;
+
+    /* Clear low part. */
+    msi->address &= ~0xffffffffull;
+    msi->address |= val;
+}
+
+/* Handlers for the high part of a 64bit address field. */
+static uint32_t vpci_msi_address_upper_read(struct pci_dev *pdev,
+                                            unsigned int reg,
+                                            const void *data)
+{
+    const struct vpci_msi *msi = data;
+
+    return msi->address >> 32;
+}
+
+static void vpci_msi_address_upper_write(struct pci_dev *pdev, unsigned int reg,
+                                         uint32_t val, void *data)
+{
+    struct vpci_msi *msi = data;
+
+    /* Clear high part. */
+    msi->address &= 0xffffffff;
+    msi->address |= (uint64_t)val << 32;
+}
+
+/* Handlers for the data field. */
+static uint32_t vpci_msi_data_read(struct pci_dev *pdev, unsigned int reg,
+                                   const void *data)
+{
+    const struct vpci_msi *msi = data;
+
+    return msi->data;
+}
+
+static void vpci_msi_data_write(struct pci_dev *pdev, unsigned int reg,
+                                uint32_t val, void *data)
+{
+    struct vpci_msi *msi = data;
+
+    msi->data = val;
+}
+
+/* Handlers for the MSI mask bits. */
+static uint32_t vpci_msi_mask_read(struct pci_dev *pdev, unsigned int reg,
+                                   const void *data)
+{
+    const struct vpci_msi *msi = data;
+
+    return msi->mask;
+}
+
+static void vpci_msi_mask_write(struct pci_dev *pdev, unsigned int reg,
+                                uint32_t val, void *data)
+{
+    struct vpci_msi *msi = data;
+    uint32_t dmask;
+
+    dmask = msi->mask ^ val;
+
+    if ( !dmask )
+        return;
+
+    if ( msi->enabled )
+    {
+        unsigned int i;
+
+        for ( i = ffs(dmask) - 1; dmask && i < msi->vectors;
+              i = ffs(dmask) - 1 )
+        {
+            vpci_msi_arch_mask(&msi->arch, pdev, i, (val >> i) & 1);
+            __clear_bit(i, &dmask);
+        }
+    }
+
+    msi->mask = val;
+}
+
+static int vpci_init_msi(struct pci_dev *pdev)
+{
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    struct vpci_msi *msi;
+    unsigned int pos;
+    uint16_t control;
+    int ret;
+
+    pos = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSI);
+    if ( !pos )
+        return 0;
+
+    msi = xzalloc(struct vpci_msi);
+    if ( !msi )
+        return -ENOMEM;
+
+    msi->pos = pos;
+
+    ret = vpci_add_register(pdev, vpci_msi_control_read,
+                            vpci_msi_control_write,
+                            msi_control_reg(pos), 2, msi);
+    if ( ret )
+    {
+        xfree(msi);
+        return ret;
+    }
+
+    /* Get the maximum number of vectors the device supports. */
+    control = pci_conf_read16(seg, bus, slot, func, msi_control_reg(pos));
+    msi->max_vectors = multi_msi_capable(control);
+    ASSERT(msi->max_vectors <= 32);
+
+    /* The multiple message enable is 0 after reset (1 message enabled). */
+    msi->vectors = 1;
+
+    /* No PIRQ bound yet. */
+    vpci_msi_arch_init(&msi->arch);
+
+    msi->address64 = is_64bit_address(control) ? true : false;
+    msi->masking = is_mask_bit_support(control) ? true : false;
+
+    ret = vpci_add_register(pdev, vpci_msi_address_read,
+                            vpci_msi_address_write,
+                            msi_lower_address_reg(pos), 4, msi);
+    if ( ret )
+    {
+        xfree(msi);
+        return ret;
+    }
+
+    ret = vpci_add_register(pdev, vpci_msi_data_read, vpci_msi_data_write,
+                            msi_data_reg(pos, msi->address64), 2,
+                            msi);
+    if ( ret )
+    {
+        xfree(msi);
+        return ret;
+    }
+
+    if ( msi->address64 )
+    {
+        ret = vpci_add_register(pdev, vpci_msi_address_upper_read,
+                                vpci_msi_address_upper_write,
+                                msi_upper_address_reg(pos), 4, msi);
+        if ( ret )
+        {
+            xfree(msi);
+            return ret;
+        }
+    }
+
+    if ( msi->masking )
+    {
+        ret = vpci_add_register(pdev, vpci_msi_mask_read, vpci_msi_mask_write,
+                                msi_mask_bits_reg(pos, msi->address64), 4,
+                                msi);
+        if ( ret )
+        {
+            xfree(msi);
+            return ret;
+        }
+    }
+
+    pdev->vpci->msi = msi;
+
+    return 0;
+}
+
+REGISTER_VPCI_INIT(vpci_init_msi);
+
+void vpci_dump_msi(void)
+{
+    struct domain *d;
+
+    for_each_domain ( d )
+    {
+        const struct pci_dev *pdev;
+
+        if ( !has_vpci(d) )
+            continue;
+
+        printk("vPCI MSI information for d%d\n", d->domain_id);
+
+        if ( !vpci_tryrlock(d) )
+        {
+            printk("Unable to get vPCI lock, skipping\n");
+            continue;
+        }
+
+        list_for_each_entry ( pdev, &d->arch.pdev_list, domain_list )
+        {
+            uint8_t seg = pdev->seg, bus = pdev->bus;
+            uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+            const struct vpci_msi *msi = pdev->vpci->msi;
+
+            if ( msi )
+            {
+                printk("Device %04x:%02x:%02x.%u\n", seg, bus, slot, func);
+
+                printk("  Enabled: %u Supports masking: %u 64-bit addresses: %u\n",
+                       msi->enabled, msi->masking, msi->address64);
+                printk("  Max vectors: %u enabled vectors: %u\n",
+                       msi->max_vectors, msi->vectors);
+
+                vpci_msi_arch_print(&msi->arch, msi->data, msi->address);
+
+                if ( msi->masking )
+                    printk("  mask=%08x\n", msi->mask);
+            }
+        }
+        vpci_runlock(d);
+    }
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h
index 837046026c..b6c5e30b6a 100644
--- a/xen/include/asm-x86/hvm/io.h
+++ b/xen/include/asm-x86/hvm/io.h
@@ -20,6 +20,7 @@
 #define __ASM_X86_HVM_IO_H__
 
 #include <xen/mm.h>
+#include <xen/pci.h>
 #include <asm/hvm/vpic.h>
 #include <asm/hvm/vioapic.h>
 #include <public/hvm/ioreq.h>
@@ -126,6 +127,23 @@ void hvm_dpci_eoi(struct domain *d, unsigned int guest_irq,
 void msix_write_completion(struct vcpu *);
 void msixtbl_init(struct domain *d);
 
+/* Arch-specific MSI data for vPCI. */
+struct vpci_arch_msi {
+    int pirq;
+};
+
+/* Arch-specific vPCI MSI helpers. */
+void vpci_msi_arch_mask(struct vpci_arch_msi *arch, struct pci_dev *pdev,
+                        unsigned int entry, bool mask);
+int vpci_msi_arch_enable(struct vpci_arch_msi *arch, struct pci_dev *pdev,
+                         uint64_t address, uint32_t data,
+                         unsigned int vectors);
+int vpci_msi_arch_disable(struct vpci_arch_msi *arch, struct pci_dev *pdev,
+                          unsigned int vectors);
+void vpci_msi_arch_init(struct vpci_arch_msi *arch);
+void vpci_msi_arch_print(const struct vpci_arch_msi *arch, uint16_t data,
+                         uint64_t addr);
+
 enum stdvga_cache_state {
     STDVGA_CACHE_UNINITIALIZED,
     STDVGA_CACHE_ENABLED,
diff --git a/xen/include/asm-x86/msi.h b/xen/include/asm-x86/msi.h
index 37d37b820e..43ab5c6bc6 100644
--- a/xen/include/asm-x86/msi.h
+++ b/xen/include/asm-x86/msi.h
@@ -48,6 +48,7 @@
 #define MSI_ADDR_REDIRECTION_SHIFT  3
 #define MSI_ADDR_REDIRECTION_CPU    (0 << MSI_ADDR_REDIRECTION_SHIFT)
 #define MSI_ADDR_REDIRECTION_LOWPRI (1 << MSI_ADDR_REDIRECTION_SHIFT)
+#define MSI_ADDR_REDIRECTION_MASK   (1 << MSI_ADDR_REDIRECTION_SHIFT)
 
 #define MSI_ADDR_DEST_ID_SHIFT		12
 #define	 MSI_ADDR_DEST_ID_MASK		0x00ff000
diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h
index 0d2c72c109..d07185a479 100644
--- a/xen/include/xen/hvm/irq.h
+++ b/xen/include/xen/hvm/irq.h
@@ -57,7 +57,9 @@ struct dev_intx_gsi_link {
 #define VMSI_DELIV_MASK   0x7000
 #define VMSI_TRIG_MODE    0x8000
 
+#define GFLAGS_SHIFT_DEST_ID        0
 #define GFLAGS_SHIFT_RH             8
+#define GFLAGS_SHIFT_DM             9
 #define GFLAGS_SHIFT_DELIV_MODE     12
 #define GFLAGS_SHIFT_TRG_MODE       15
 
diff --git a/xen/include/xen/irq.h b/xen/include/xen/irq.h
index 0aa817e266..9b10ffa4c3 100644
--- a/xen/include/xen/irq.h
+++ b/xen/include/xen/irq.h
@@ -133,6 +133,7 @@ struct pirq {
     struct arch_pirq arch;
 };
 
+#define INVALID_PIRQ -1
 #define pirq_info(d, p) ((struct pirq *)radix_tree_lookup(&(d)->pirq_tree, p))
 
 /* Use this instead of pirq_info() if the structure may need allocating. */
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 3c6beaaf4a..21da73df16 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -13,6 +13,7 @@
  * of just returning whether the lock is hold by any CPU).
  */
 #define vpci_rlock(d) read_lock(&(d)->arch.hvm_domain.vpci_lock)
+#define vpci_tryrlock(d) read_trylock(&(d)->arch.hvm_domain.vpci_lock)
 #define vpci_wlock(d) write_lock(&(d)->arch.hvm_domain.vpci_lock)
 #define vpci_runlock(d) read_unlock(&(d)->arch.hvm_domain.vpci_lock)
 #define vpci_wunlock(d) write_unlock(&(d)->arch.hvm_domain.vpci_lock)
@@ -93,9 +94,35 @@ struct vpci {
         } bars[7]; /* At most 6 BARS + 1 expansion ROM BAR. */
         /* FIXME: currently there's no support for SR-IOV. */
     } header;
+
+    /* MSI data. */
+    struct vpci_msi {
+        /* Arch-specific data. */
+        struct vpci_arch_msi arch;
+        /* Address. */
+        uint64_t address;
+        /* Offset of the capability in the config space. */
+        unsigned int pos;
+        /* Maximum number of vectors supported by the device. */
+        unsigned int max_vectors;
+        /* Number of vectors configured. */
+        unsigned int vectors;
+        /* Mask bitfield. */
+        uint32_t mask;
+        /* Data. */
+        uint16_t data;
+        /* Enabled? */
+        bool enabled;
+        /* Supports per-vector masking? */
+        bool masking;
+        /* 64-bit address capable? */
+        bool address64;
+    } *msi;
 #endif
 };
 
+void vpci_dump_msi(void);
+
 #endif
 
 /*
-- 
2.11.0 (Apple Git-81)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v5 10/11] vpci: add a priority parameter to the vPCI register initializer
  2017-08-14 14:28 [PATCH v5 00/11] vpci: PCI config space emulation Roger Pau Monne
                   ` (8 preceding siblings ...)
  2017-08-14 14:28 ` [PATCH v5 09/11] vpci/msi: add MSI handlers Roger Pau Monne
@ 2017-08-14 14:28 ` Roger Pau Monne
  2017-09-07 15:32   ` Jan Beulich
  2017-08-14 14:28 ` [PATCH v5 11/11] vpci/msix: add MSI-X handlers Roger Pau Monne
  10 siblings, 1 reply; 60+ messages in thread
From: Roger Pau Monne @ 2017-08-14 14:28 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, boris.ostrovsky, Roger Pau Monne, Jan Beulich

This is needed for MSI-X, since MSI-X will need to be initialized
before parsing the BARs, so that the header BAR handlers are aware of
the MSI-X related holes and make sure they are not mapped in order for
the trap handlers to work properly.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v4:
 - Add a middle priority and add the PCI header to it.

Changes since v3:
 - Add a numerial suffix to the section used to store the pointer to
   each initializer function, and sort them at link time.
---
 xen/arch/arm/xen.lds.S    |  4 ++--
 xen/arch/x86/xen.lds.S    |  4 ++--
 xen/drivers/vpci/header.c |  2 +-
 xen/drivers/vpci/msi.c    |  2 +-
 xen/include/xen/vpci.h    | 12 ++++++++----
 5 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/xen/arch/arm/xen.lds.S b/xen/arch/arm/xen.lds.S
index 6690516ff1..781ea39ecc 100644
--- a/xen/arch/arm/xen.lds.S
+++ b/xen/arch/arm/xen.lds.S
@@ -43,7 +43,7 @@ SECTIONS
   .rodata : {
 #if defined(CONFIG_HAS_PCI) && defined(CONFIG_LATE_HWDOM)
        __start_vpci_array = .;
-       *(.rodata.vpci)
+       *(SORT(.rodata.vpci.*))
        __end_vpci_array = .;
 #endif
         _srodata = .;          /* Read-only data */
@@ -138,7 +138,7 @@ SECTIONS
   .init.data : {
 #if defined(CONFIG_HAS_PCI) && !defined(CONFIG_LATE_HWDOM)
        __start_vpci_array = .;
-       *(.init.rodata.vpci)
+       *(SORT(.init.rodata.vpci.*))
        __end_vpci_array = .;
 #endif
        *(.init.rodata)
diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S
index af1b30cb2b..4ed1fe5aff 100644
--- a/xen/arch/x86/xen.lds.S
+++ b/xen/arch/x86/xen.lds.S
@@ -78,7 +78,7 @@ SECTIONS
   .rodata : {
 #if defined(CONFIG_HAS_PCI) && defined(CONFIG_LATE_HWDOM)
        __start_vpci_array = .;
-       *(.rodata.vpci)
+       *(SORT(.rodata.vpci.*))
        __end_vpci_array = .;
 #endif
        _srodata = .;
@@ -174,7 +174,7 @@ SECTIONS
   .init.data : {
 #if defined(CONFIG_HAS_PCI) && !defined(CONFIG_LATE_HWDOM)
        __start_vpci_array = .;
-       *(.init.rodata.vpci)
+       *(SORT(.init.rodata.vpci.*))
        __end_vpci_array = .;
 #endif
        *(.init.rodata)
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 9b44c8441a..1533c36470 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -440,7 +440,7 @@ static int vpci_init_bars(struct pci_dev *pdev)
     return 0;
 }
 
-REGISTER_VPCI_INIT(vpci_init_bars);
+REGISTER_VPCI_INIT(vpci_init_bars, VPCI_PRIORITY_MIDDLE);
 
 /*
  * Local variables:
diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
index 1e36b9779a..181599241a 100644
--- a/xen/drivers/vpci/msi.c
+++ b/xen/drivers/vpci/msi.c
@@ -311,7 +311,7 @@ static int vpci_init_msi(struct pci_dev *pdev)
     return 0;
 }
 
-REGISTER_VPCI_INIT(vpci_init_msi);
+REGISTER_VPCI_INIT(vpci_init_msi, VPCI_PRIORITY_LOW);
 
 void vpci_dump_msi(void)
 {
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 21da73df16..66d8ae8b5f 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -34,14 +34,18 @@ typedef void vpci_write_t(struct pci_dev *pdev, unsigned int reg,
 typedef int vpci_register_init_t(struct pci_dev *dev);
 
 #ifdef CONFIG_LATE_HWDOM
-#define VPCI_SECTION ".rodata.vpci"
+#define VPCI_SECTION ".rodata.vpci."
 #else
-#define VPCI_SECTION ".init.rodata.vpci"
+#define VPCI_SECTION ".init.rodata.vpci."
 #endif
 
-#define REGISTER_VPCI_INIT(x)                   \
+#define VPCI_PRIORITY_HIGH      "1"
+#define VPCI_PRIORITY_MIDDLE    "5"
+#define VPCI_PRIORITY_LOW       "9"
+
+#define REGISTER_VPCI_INIT(x, p)                \
   static vpci_register_init_t *const x##_entry  \
-               __used_section(VPCI_SECTION) = x
+               __used_section(VPCI_SECTION p) = x
 
 /* Add vPCI handlers to device. */
 int __must_check vpci_add_handlers(struct pci_dev *dev);
-- 
2.11.0 (Apple Git-81)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v5 11/11] vpci/msix: add MSI-X handlers
  2017-08-14 14:28 [PATCH v5 00/11] vpci: PCI config space emulation Roger Pau Monne
                   ` (9 preceding siblings ...)
  2017-08-14 14:28 ` [PATCH v5 10/11] vpci: add a priority parameter to the vPCI register initializer Roger Pau Monne
@ 2017-08-14 14:28 ` Roger Pau Monne
  2017-09-07 16:11   ` Roger Pau Monné
  2017-09-07 16:12   ` Jan Beulich
  10 siblings, 2 replies; 60+ messages in thread
From: Roger Pau Monne @ 2017-08-14 14:28 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, boris.ostrovsky, Roger Pau Monne, Jan Beulich

Add handlers for accesses to the MSI-X message control field on the
PCI configuration space, and traps for accesses to the memory region
that contains the MSI-X table and PBA. This traps detect attempts from
the guest to configure MSI-X interrupts and properly sets them up.

Note that accesses to the Table Offset, Table BIR, PBA Offset and PBA
BIR are not trapped by Xen at the moment.

Finally, turn the panic in the Dom0 PVH builder into a warning.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v4:
 - Remove parentheses around offsetof.
 - Add "being" to MSI-X enabling comment.
 - Use INVALID_PIRQ.
 - Add a simple sanity check to vpci_msix_arch_enable in order to
   detect wrong MSI-X entries more quickly.
 - Constify vpci_msix_arch_print entry argument.
 - s/cpu/fixed/ in vpci_msix_arch_print.
 - Dump the MSI-X info together with the MSI info.
 - Fix vpci_msix_control_write to take into account changes to the
   address and data fields when switching the function mask bit.
 - Only disable/enable the entries if the address or data fields have
   been updated.
 - Usew the BAR enable field to check if a BAR is mapped or not
   (instead of reading the command register for each device).
 - Fix error path in vpci_msix_read to set the return data to ~0.
 - Simplify mask usage in vpci_msix_write.
 - Cast data to uint64_t when shifting it 32 bits.
 - Fix writes to the table entry control register to take into account
   if the mask-all bit is set.
 - Add some comments to clarify the intended behavior of the code.
 - Align the PBA size to 64-bits.
 - Remove the error label in vpci_init_msix.
 - Try to compact the layout of the vpci_msix structure.
 - Remove the local table_bar and pba_bar variables from
   vpci_init_msix, they are used only once.

Changes since v3:
 - Propagate changes from previous versions: remove xen_ prefix, use
   the new fields in vpci_val and remove the return value from
   handlers.
 - Remove the usage of GENMASK.
 - Mave the arch-specific parts of the dump routine to the
   x86/hvm/vmsi.c dump handler.
 - Chain the MSI-X dump handler to the 'M' debug key.
 - Fix the header BAR mappings so that the MSI-X regions inside of
   BARs are unmapped from the domain p2m in order for the handlers to
   work properly.
 - Unconditionally trap and forward accesses to the PBA MSI-X area.
 - Simplify the conditionals in vpci_msix_control_write.
 - Fix vpci_msix_accept to use a bool type.
 - Allow all supported accesses as described in the spec to the MSI-X
   table.
 - Truncate the returned address when the access is a 32b read.
 - Always return X86EMUL_OKAY from the handlers, returning ~0 in the
   read case if the access is not supported, or ignoring writes.
 - Do not check that max_entries is != 0 in the init handler.
 - Use trylock in the dump handler.

Changes since v2:
 - Split out arch-specific code.

This patch has been tested with devices using both a single MSI-X
entry and multiple ones.
---
 xen/arch/x86/hvm/dom0_build.c    |   2 +-
 xen/arch/x86/hvm/hvm.c           |   1 +
 xen/arch/x86/hvm/vmsi.c          | 138 ++++++++++-
 xen/drivers/vpci/Makefile        |   2 +-
 xen/drivers/vpci/header.c        |  81 +++++++
 xen/drivers/vpci/msi.c           |  25 +-
 xen/drivers/vpci/msix.c          | 478 +++++++++++++++++++++++++++++++++++++++
 xen/include/asm-x86/hvm/domain.h |   3 +
 xen/include/asm-x86/hvm/io.h     |  19 ++
 xen/include/xen/vpci.h           |  39 ++++
 10 files changed, 779 insertions(+), 9 deletions(-)
 create mode 100644 xen/drivers/vpci/msix.c

diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
index c65eb8503f..3deae7cb4a 100644
--- a/xen/arch/x86/hvm/dom0_build.c
+++ b/xen/arch/x86/hvm/dom0_build.c
@@ -1086,7 +1086,7 @@ int __init dom0_construct_pvh(struct domain *d, const module_t *image,
 
     pvh_setup_mmcfg(d);
 
-    panic("Building a PVHv2 Dom0 is not yet supported.");
+    printk("WARNING: PVH is an experimental mode with limited functionality\n");
     return 0;
 }
 
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 3168973820..e27e86a514 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -584,6 +584,7 @@ int hvm_domain_initialise(struct domain *d, unsigned long domcr_flags,
     INIT_LIST_HEAD(&d->arch.hvm_domain.write_map.list);
     INIT_LIST_HEAD(&d->arch.hvm_domain.g2m_ioport_list);
     INIT_LIST_HEAD(&d->arch.hvm_domain.mmcfg_regions);
+    INIT_LIST_HEAD(&d->arch.hvm_domain.msix_tables);
 
     rc = create_perdomain_mapping(d, PERDOMAIN_VIRT_START, 0, NULL, NULL);
     if ( rc )
diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index aea088e290..5c41eea0fe 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -643,17 +643,15 @@ static unsigned int msi_flags(uint16_t data, uint64_t addr)
            (trig_mode << GFLAGS_SHIFT_TRG_MODE);
 }
 
-void vpci_msi_arch_mask(struct vpci_arch_msi *arch, struct pci_dev *pdev,
-                        unsigned int entry, bool mask)
+static void vpci_mask_pirq(struct domain *d, int pirq, bool mask)
 {
-    struct domain *d = pdev->domain;
     const struct pirq *pinfo;
     struct irq_desc *desc;
     unsigned long flags;
     int irq;
 
-    ASSERT(arch->pirq >= 0);
-    pinfo = pirq_info(d, arch->pirq + entry);
+    ASSERT(pirq >= 0);
+    pinfo = pirq_info(d, pirq);
     if ( !pinfo )
         return;
 
@@ -670,6 +668,12 @@ void vpci_msi_arch_mask(struct vpci_arch_msi *arch, struct pci_dev *pdev,
     spin_unlock_irqrestore(&desc->lock, flags);
 }
 
+void vpci_msi_arch_mask(struct vpci_arch_msi *arch, struct pci_dev *pdev,
+                        unsigned int entry, bool mask)
+{
+    vpci_mask_pirq(pdev->domain, arch->pirq + entry, mask);
+}
+
 int vpci_msi_arch_enable(struct vpci_arch_msi *arch, struct pci_dev *pdev,
                          uint64_t address, uint32_t data, unsigned int vectors)
 {
@@ -778,3 +782,127 @@ void vpci_msi_arch_print(const struct vpci_arch_msi *arch, uint16_t data,
            MASK_EXTR(addr, MSI_ADDR_DEST_ID_MASK),
            arch->pirq);
 }
+
+void vpci_msix_arch_mask(struct vpci_arch_msix_entry *arch,
+                         struct pci_dev *pdev, bool mask)
+{
+    if ( arch->pirq == INVALID_PIRQ )
+        return;
+
+    vpci_mask_pirq(pdev->domain, arch->pirq, mask);
+}
+
+int vpci_msix_arch_enable(struct vpci_arch_msix_entry *arch,
+                          struct pci_dev *pdev, uint64_t address,
+                          uint32_t data, unsigned int entry_nr,
+                          paddr_t table_base)
+{
+    struct domain *d = pdev->domain;
+    struct msi_info msi_info = {
+        .seg = pdev->seg,
+        .bus = pdev->bus,
+        .devfn = pdev->devfn,
+        .table_base = table_base,
+        .entry_nr = entry_nr,
+    };
+    xen_domctl_bind_pt_irq_t bind = {
+        .irq_type = PT_IRQ_TYPE_MSI,
+        .u.msi.gvec = msi_vector(data),
+        .u.msi.gflags = msi_flags(data, address),
+    };
+    int rc;
+
+    ASSERT(arch->pirq == INVALID_PIRQ);
+
+    /*
+     * Simple sanity check before trying to setup the interrupt.
+     * According to the Intel SDM, bits [31, 20] must contain the
+     * value 0xfee. This avoids needlessly setting up pirqs for entries
+     * the guest has not actually configured.
+     */
+    if ( (address & 0xfff00000) != MSI_ADDR_HEADER )
+        return -EINVAL;
+
+    rc = allocate_and_map_msi_pirq(d, -1, &arch->pirq,
+                                   MAP_PIRQ_TYPE_MSI, &msi_info);
+    if ( rc )
+    {
+        gdprintk(XENLOG_ERR,
+                 "%04x:%02x:%02x.%u: unable to map MSI-X PIRQ entry %u: %d\n",
+                 pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                 PCI_FUNC(pdev->devfn), entry_nr, rc);
+        return rc;
+    }
+
+    bind.machine_irq = arch->pirq;
+    pcidevs_lock();
+    rc = pt_irq_create_bind(d, &bind);
+    if ( rc )
+    {
+        gdprintk(XENLOG_ERR,
+                 "%04x:%02x:%02x.%u: unable to create MSI-X bind %u: %d\n",
+                 pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                 PCI_FUNC(pdev->devfn), entry_nr, rc);
+        spin_lock(&d->event_lock);
+        unmap_domain_pirq(d, arch->pirq);
+        spin_unlock(&d->event_lock);
+        pcidevs_unlock();
+        arch->pirq = INVALID_PIRQ;
+        return rc;
+    }
+    pcidevs_unlock();
+
+    return 0;
+}
+
+int vpci_msix_arch_disable(struct vpci_arch_msix_entry *arch,
+                           struct pci_dev *pdev)
+{
+    struct domain *d = pdev->domain;
+    xen_domctl_bind_pt_irq_t bind = {
+        .irq_type = PT_IRQ_TYPE_MSI,
+        .machine_irq = arch->pirq,
+    };
+    int rc;
+
+    if ( arch->pirq == INVALID_PIRQ )
+        return 0;
+
+    pcidevs_lock();
+    rc = pt_irq_destroy_bind(d, &bind);
+    if ( rc )
+    {
+        pcidevs_unlock();
+        return rc;
+    }
+
+    spin_lock(&d->event_lock);
+    unmap_domain_pirq(d, arch->pirq);
+    spin_unlock(&d->event_lock);
+    pcidevs_unlock();
+
+    arch->pirq = INVALID_PIRQ;
+
+    return 0;
+}
+
+int vpci_msix_arch_init(struct vpci_arch_msix_entry *arch)
+{
+    arch->pirq = -1;
+    return 0;
+}
+
+void vpci_msix_arch_print(const struct vpci_arch_msix_entry *entry,
+                          uint32_t data, uint64_t addr, bool masked,
+                          unsigned int pos)
+{
+    printk("%4u vec=%#02x%7s%6s%3sassert%5s%7s dest_id=%lu mask=%u pirq: %d\n",
+           pos, MASK_EXTR(data, MSI_DATA_VECTOR_MASK),
+           data & MSI_DATA_DELIVERY_LOWPRI ? "lowest" : "fixed",
+           data & MSI_DATA_TRIGGER_LEVEL ? "level" : "edge",
+           data & MSI_DATA_LEVEL_ASSERT ? "" : "de",
+           addr & MSI_ADDR_DESTMODE_LOGIC ? "log" : "phys",
+           addr & MSI_ADDR_REDIRECTION_LOWPRI ? "lowest" : "fixed",
+           MASK_EXTR(addr, MSI_ADDR_DEST_ID_MASK),
+           masked, entry->pirq);
+}
diff --git a/xen/drivers/vpci/Makefile b/xen/drivers/vpci/Makefile
index 62cec9e82b..55d1bdfda0 100644
--- a/xen/drivers/vpci/Makefile
+++ b/xen/drivers/vpci/Makefile
@@ -1 +1 @@
-obj-y += vpci.o header.o msi.o
+obj-y += vpci.o header.o msi.o msix.o
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 1533c36470..effa830714 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -20,6 +20,7 @@
 #include <xen/sched.h>
 #include <xen/vpci.h>
 #include <xen/p2m-common.h>
+#include <asm/p2m.h>
 
 #define MAPPABLE_BAR(x)                                                 \
     ((x)->type == VPCI_BAR_MEM32 || (x)->type == VPCI_BAR_MEM64_LO ||   \
@@ -89,11 +90,45 @@ static int vpci_map_range(unsigned long s, unsigned long e, void *data)
     return modify_mmio(map->d, _gfn(s), _mfn(s), e - s + 1, map->map);
 }
 
+static int vpci_unmap_msix(struct domain *d, struct vpci_msix_mem *msix)
+{
+    unsigned long gfn;
+
+    for ( gfn = PFN_DOWN(msix->addr); gfn <= PFN_UP(msix->addr + msix->size);
+          gfn++ )
+    {
+        p2m_type_t t;
+        mfn_t mfn = get_gfn(d, gfn, &t);
+        int rc;
+
+        if ( mfn_eq(mfn, INVALID_MFN) )
+        {
+            /* Nothing to do, this is already a hole. */
+            put_gfn(d, gfn);
+            continue;
+        }
+
+        if ( !p2m_is_mmio(t) )
+        {
+            put_gfn(d, gfn);
+            return -EINVAL;
+        }
+
+        rc = modify_mmio(d, _gfn(gfn), mfn, 1, false);
+        put_gfn(d, gfn);
+        if ( rc )
+            return rc;
+    }
+
+    return 0;
+}
+
 static int vpci_modify_bar(struct domain *d, const struct vpci_bar *bar,
                            bool map)
 {
     struct rangeset *mem;
     struct map_data data = { .d = d, .map = map };
+    unsigned int i;
     int rc;
 
     ASSERT(MAPPABLE_BAR(bar));
@@ -102,6 +137,42 @@ static int vpci_modify_bar(struct domain *d, const struct vpci_bar *bar,
     if ( IS_ERR(mem) )
         return PTR_ERR(mem);
 
+    for ( i = 0; i < ARRAY_SIZE(bar->msix); i++ )
+    {
+        struct vpci_msix_mem *msix = bar->msix[i];
+
+        if ( !msix || msix->addr == INVALID_PADDR )
+            continue;
+
+        if ( map )
+        {
+            /*
+             * Make sure the MSI-X regions of the BAR are not mapped into the
+             * domain p2m, or else the MSI-X handlers are useless. Only do this
+             * when mapping, since that's when the memory decoding on the
+             * device is enabled.
+             *
+             * This is required because iommu_inclusive_mapping might have
+             * mapped MSI-X regions into the guest p2m.
+             */
+            rc = vpci_unmap_msix(d, msix);
+            if ( rc )
+            {
+                rangeset_destroy(mem);
+                return rc;
+            }
+        }
+
+        rc = rangeset_remove_range(mem, PFN_DOWN(msix->addr),
+                                   PFN_DOWN(msix->addr + msix->size));
+        if ( rc )
+        {
+            rangeset_destroy(mem);
+            return rc;
+        }
+
+    }
+
     rc = rangeset_report_ranges(mem, 0, ~0ul, vpci_map_range, &data);
     rangeset_destroy(mem);
     if ( rc )
@@ -212,6 +283,7 @@ static void vpci_bar_write(struct pci_dev *pdev, unsigned int reg,
     struct vpci_bar *bar = data;
     uint8_t seg = pdev->seg, bus = pdev->bus;
     uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    unsigned int i;
     bool hi = false;
 
     if ( pci_conf_read16(seg, bus, slot, func, PCI_COMMAND) &
@@ -255,6 +327,11 @@ static void vpci_bar_write(struct pci_dev *pdev, unsigned int reg,
     bar->addr &= ~(0xffffffffull << (hi ? 32 : 0));
     bar->addr |= (uint64_t)val << (hi ? 32 : 0);
 
+    /* Update any MSI-X areas contained in this BAR. */
+    for ( i = 0; i < ARRAY_SIZE(bar->msix); i++ )
+        if ( bar->msix[i] )
+            bar->msix[i]->addr = bar->addr + bar->msix[i]->offset;
+
     /* Make sure Xen writes back the same value for the BAR RO bits. */
     if ( !hi )
         val |= pci_conf_read32(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
@@ -345,6 +422,7 @@ static int vpci_init_bars(struct pci_dev *pdev)
     {
         uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
         uint32_t val = pci_conf_read32(seg, bus, slot, func, reg);
+        unsigned int j;
 
         if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
         {
@@ -386,6 +464,9 @@ static int vpci_init_bars(struct pci_dev *pdev)
         }
 
         bars[i].addr = addr;
+        for ( j = 0; j < ARRAY_SIZE(bars[i].msix); j++ )
+            if ( bars[i].msix[j] )
+                bars[i].msix[j]->addr = addr + bars[i].msix[j]->offset;
         bars[i].size = size;
         bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
 
diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
index 181599241a..66e1bab6e8 100644
--- a/xen/drivers/vpci/msi.c
+++ b/xen/drivers/vpci/msi.c
@@ -324,7 +324,7 @@ void vpci_dump_msi(void)
         if ( !has_vpci(d) )
             continue;
 
-        printk("vPCI MSI information for d%d\n", d->domain_id);
+        printk("vPCI MSI/MSI-X information for d%d\n", d->domain_id);
 
         if ( !vpci_tryrlock(d) )
         {
@@ -337,10 +337,14 @@ void vpci_dump_msi(void)
             uint8_t seg = pdev->seg, bus = pdev->bus;
             uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
             const struct vpci_msi *msi = pdev->vpci->msi;
+            const struct vpci_msix *msix = pdev->vpci->msix;
+
+            if ( msi || msix )
+                printk("Device %04x:%02x:%02x.%u\n", seg, bus, slot, func);
 
             if ( msi )
             {
-                printk("Device %04x:%02x:%02x.%u\n", seg, bus, slot, func);
+                printk(" MSI\n");
 
                 printk("  Enabled: %u Supports masking: %u 64-bit addresses: %u\n",
                        msi->enabled, msi->masking, msi->address64);
@@ -352,6 +356,23 @@ void vpci_dump_msi(void)
                 if ( msi->masking )
                     printk("  mask=%08x\n", msi->mask);
             }
+
+            if ( msix )
+            {
+                unsigned int i;
+
+                printk(" MSI-X\n");
+
+                printk("  Max entries: %u maskall: %u enabled: %u\n",
+                       msix->max_entries, msix->masked, msix->enabled);
+
+                printk("  Table entries:\n");
+                for ( i = 0; i < msix->max_entries; i++ )
+                    vpci_msix_arch_print(&msix->entries[i].arch,
+                                         msix->entries[i].data,
+                                         msix->entries[i].addr,
+                                         msix->entries[i].masked, i);
+            }
         }
         vpci_runlock(d);
     }
diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
new file mode 100644
index 0000000000..4035bea421
--- /dev/null
+++ b/xen/drivers/vpci/msix.c
@@ -0,0 +1,478 @@
+/*
+ * Handlers for accesses to the MSI-X capability structure and the memory
+ * region.
+ *
+ * Copyright (C) 2017 Citrix Systems R&D
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms and conditions of the GNU General Public
+ * License, version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <xen/sched.h>
+#include <xen/vpci.h>
+#include <asm/msi.h>
+#include <xen/p2m-common.h>
+#include <xen/keyhandler.h>
+
+#define MSIX_SIZE(num) offsetof(struct vpci_msix, entries[num])
+#define MSIX_ADDR_IN_RANGE(a, table)                                    \
+    ((table)->addr != INVALID_PADDR && (a) >= (table)->addr &&          \
+     (a) < (table)->addr + (table)->size)
+
+static uint32_t vpci_msix_control_read(struct pci_dev *pdev, unsigned int reg,
+                                       const void *data)
+{
+    const struct vpci_msix *msix = data;
+    uint16_t val;
+
+    val = (msix->max_entries - 1) & PCI_MSIX_FLAGS_QSIZE;
+    val |= msix->enabled ? PCI_MSIX_FLAGS_ENABLE : 0;
+    val |= msix->masked ? PCI_MSIX_FLAGS_MASKALL : 0;
+
+    return val;
+}
+
+static void vpci_msix_control_write(struct pci_dev *pdev, unsigned int reg,
+                                    uint32_t val, void *data)
+{
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    struct vpci_msix *msix = data;
+    bool new_masked, new_enabled;
+
+    new_masked = val & PCI_MSIX_FLAGS_MASKALL;
+    new_enabled = val & PCI_MSIX_FLAGS_ENABLE;
+
+    /*
+     * According to the PCI 3.0 specification, switching the enable bit
+     * to 1 or the function mask bit to 0 should cause all the cached
+     * addresses and data fields to be recalculated. Xen implements this
+     * as disabling and enabling the entries.
+     *
+     * Note that the disable/enable sequence is only performed when the
+     * guest has written to the entry (ie: updated field set).
+     */
+    if ( new_enabled && !new_masked && (!msix->enabled || msix->masked) )
+    {
+        paddr_t table_base = pdev->vpci->header.bars[msix->table.bir].addr;
+        unsigned int i;
+        int rc;
+
+        for ( i = 0; i < msix->max_entries; i++ )
+        {
+            if ( msix->entries[i].masked || !msix->entries[i].updated )
+                continue;
+
+            rc = vpci_msix_arch_disable(&msix->entries[i].arch, pdev);
+            if ( rc )
+            {
+                gdprintk(XENLOG_ERR,
+                         "%04x:%02x:%02x.%u: unable to disable entry %u: %d\n",
+                         seg, bus, slot, func, msix->entries[i].nr, rc);
+                return;
+            }
+
+            rc = vpci_msix_arch_enable(&msix->entries[i].arch, pdev,
+                                       msix->entries[i].addr,
+                                       msix->entries[i].data,
+                                       msix->entries[i].nr, table_base);
+            if ( rc )
+            {
+                gdprintk(XENLOG_ERR,
+                         "%04x:%02x:%02x.%u: unable to enable entry %u: %d\n",
+                         seg, bus, slot, func, msix->entries[i].nr, rc);
+                /* Entry is likely not configured, skip it. */
+                continue;
+            }
+
+            /*
+             * At this point the PIRQ is still masked. Unmask it, or else the
+             * guest won't receive interrupts. This is due to the
+             * disable/enable sequence performed above.
+             */
+            vpci_msix_arch_mask(&msix->entries[i].arch, pdev, false);
+
+            msix->entries[i].updated = false;
+        }
+    }
+
+    if ( (new_enabled != msix->enabled || new_masked != msix->masked) &&
+         pci_msi_conf_write_intercept(pdev, reg, 2, &val) >= 0 )
+        pci_conf_write16(seg, bus, slot, func, reg, val);
+
+    msix->masked = new_masked;
+    msix->enabled = new_enabled;
+}
+
+static struct vpci_msix *vpci_msix_find(struct domain *d, unsigned long addr)
+{
+    struct vpci_msix *msix;
+
+    list_for_each_entry ( msix, &d->arch.hvm_domain.msix_tables, next )
+    {
+        const struct vpci_bar *bars = msix->pdev->vpci->header.bars;
+
+        if ( (bars[msix->table.bir].enabled &&
+              MSIX_ADDR_IN_RANGE(addr, &msix->table)) ||
+             (bars[msix->pba.bir].enabled &&
+              MSIX_ADDR_IN_RANGE(addr, &msix->pba)) )
+            return msix;
+    }
+
+    return NULL;
+}
+
+static int vpci_msix_accept(struct vcpu *v, unsigned long addr)
+{
+    bool found;
+
+    vpci_rlock(v->domain);
+    found = vpci_msix_find(v->domain, addr);
+    vpci_runlock(v->domain);
+
+    return found;
+}
+
+static int vpci_msix_access_check(struct pci_dev *pdev, unsigned long addr,
+                                  unsigned int len)
+{
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+
+    /* Only allow 32/64b accesses. */
+    if ( len != 4 && len != 8 )
+    {
+        gdprintk(XENLOG_ERR,
+                 "%04x:%02x:%02x.%u: invalid MSI-X table access size: %u\n",
+                 seg, bus, slot, func, len);
+        return -EINVAL;
+    }
+
+    /* Only allow aligned accesses. */
+    if ( (addr & (len - 1)) != 0 )
+    {
+        gdprintk(XENLOG_ERR,
+                 "%04x:%02x:%02x.%u: MSI-X only allows aligned accesses\n",
+                 seg, bus, slot, func);
+        return -EINVAL;
+    }
+
+    return 0;
+}
+
+static struct vpci_msix_entry *vpci_msix_get_entry(struct vpci_msix *msix,
+                                                   unsigned long addr)
+{
+    return &msix->entries[(addr - msix->table.addr) / PCI_MSIX_ENTRY_SIZE];
+}
+
+static int vpci_msix_read(struct vcpu *v, unsigned long addr,
+                          unsigned int len, unsigned long *data)
+{
+    struct domain *d = v->domain;
+    struct vpci_msix *msix;
+    const struct vpci_msix_entry *entry;
+    unsigned int offset;
+
+    vpci_rlock(d);
+    msix = vpci_msix_find(d, addr);
+    if ( !msix )
+    {
+        vpci_runlock(d);
+        *data = ~0ul;
+        return X86EMUL_OKAY;
+    }
+
+    if ( vpci_msix_access_check(msix->pdev, addr, len) )
+    {
+        vpci_runlock(d);
+        *data = ~0ul;
+        return X86EMUL_OKAY;
+    }
+
+    if ( MSIX_ADDR_IN_RANGE(addr, &msix->pba) )
+    {
+        /* Access to PBA. */
+        switch ( len )
+        {
+        case 4:
+            *data = readl(addr);
+            break;
+        case 8:
+            *data = readq(addr);
+            break;
+        default:
+            ASSERT_UNREACHABLE();
+            *data = ~0ul;
+            break;
+        }
+
+        vpci_runlock(d);
+        return X86EMUL_OKAY;
+    }
+
+    entry = vpci_msix_get_entry(msix, addr);
+    offset = addr & (PCI_MSIX_ENTRY_SIZE - 1);
+
+    switch ( offset )
+    {
+    case PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET:
+        /*
+         * NB: do explicit truncation to the size of the access. This shouldn't
+         * be required here, since the caller of the handler should already
+         * take the appropriate measures to truncate the value before returning
+         * to the guest, but better be safe than sorry.
+         */
+        *data = len == 8 ? entry->addr : (uint32_t)entry->addr;
+        break;
+    case PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET:
+        *data = entry->addr >> 32;
+        break;
+    case PCI_MSIX_ENTRY_DATA_OFFSET:
+        *data = entry->data;
+        if ( len == 8 )
+            *data |=
+                (uint64_t)(entry->masked ? PCI_MSIX_VECTOR_BITMASK : 0) << 32;
+        break;
+    case PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET:
+        *data = entry->masked ? PCI_MSIX_VECTOR_BITMASK : 0;
+        break;
+    default:
+        ASSERT_UNREACHABLE();
+        *data = ~0ul;
+        break;
+    }
+    vpci_runlock(d);
+
+    return X86EMUL_OKAY;
+}
+
+static int vpci_msix_write(struct vcpu *v, unsigned long addr,
+                           unsigned int len, unsigned long data)
+{
+    struct domain *d = v->domain;
+    struct vpci_msix *msix;
+    struct vpci_msix_entry *entry;
+    unsigned int offset;
+
+    vpci_wlock(d);
+    msix = vpci_msix_find(d, addr);
+    if ( !msix )
+    {
+        vpci_wunlock(d);
+        return X86EMUL_OKAY;
+    }
+
+    if ( MSIX_ADDR_IN_RANGE(addr, &msix->pba) )
+    {
+        /* Ignore writes to PBA, it's behavior is undefined. */
+        vpci_wunlock(d);
+        return X86EMUL_OKAY;
+    }
+
+    if ( vpci_msix_access_check(msix->pdev, addr, len) )
+    {
+        vpci_wunlock(d);
+        return X86EMUL_OKAY;
+    }
+
+    entry = vpci_msix_get_entry(msix, addr);
+    offset = addr & (PCI_MSIX_ENTRY_SIZE - 1);
+
+    /*
+     * NB: Xen allows writes to the data/address registers with the entry
+     * unmasked. The specification says this is undefined behavior, and Xen
+     * implements it as storing the written value, which will be made effective
+     * in the next mask/unmask cycle. This also mimics the implementation in
+     * QEMU.
+     */
+    switch ( offset )
+    {
+    case PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET:
+        entry->updated = true;
+        if ( len == 8 )
+        {
+            entry->addr = data;
+            break;
+        }
+        entry->addr &= ~0xffffffff;
+        entry->addr |= data;
+        break;
+    case PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET:
+        entry->updated = true;
+        entry->addr &= 0xffffffff;
+        entry->addr |= (uint64_t)data << 32;
+        break;
+    case PCI_MSIX_ENTRY_DATA_OFFSET:
+        entry->updated = true;
+        entry->data = data;
+
+        if ( len == 4 )
+            break;
+
+        data >>= 32;
+        /* fallthrough */
+    case PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET:
+    {
+        bool new_masked = data & PCI_MSIX_VECTOR_BITMASK;
+        struct pci_dev *pdev = msix->pdev;
+        paddr_t table_base = pdev->vpci->header.bars[msix->table.bir].addr;
+        int rc;
+
+        if ( entry->masked == new_masked )
+            /* No change in the mask bit, nothing to do. */
+            break;
+
+        if ( !new_masked && msix->enabled && !msix->masked && entry->updated )
+        {
+            /*
+             * If MSI-X is enabled, the function mask is not active, the entry
+             * is being unmasked and there have been changes to the address or
+             * data fields Xen needs to disable and enable the entry in order
+             * to pick up the changes.
+             */
+            rc = vpci_msix_arch_disable(&entry->arch, pdev);
+            if ( rc )
+            {
+                gdprintk(XENLOG_ERR,
+                         "%04x:%02x:%02x.%u: unable to disable entry %u: %d\n",
+                         pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                         PCI_FUNC(pdev->devfn), entry->nr, rc);
+                break;
+            }
+
+            rc = vpci_msix_arch_enable(&entry->arch, pdev, entry->addr,
+                                       entry->data, entry->nr, table_base);
+            if ( rc )
+            {
+                gdprintk(XENLOG_ERR,
+                         "%04x:%02x:%02x.%u: unable to enable entry %u: %d\n",
+                         pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                         PCI_FUNC(pdev->devfn), entry->nr, rc);
+                break;
+            }
+            entry->updated = false;
+        }
+
+        vpci_msix_arch_mask(&entry->arch, pdev, new_masked);
+        entry->masked = new_masked;
+
+        break;
+    }
+    default:
+        ASSERT_UNREACHABLE();
+        break;
+    }
+    vpci_wunlock(d);
+
+    return X86EMUL_OKAY;
+}
+
+static const struct hvm_mmio_ops vpci_msix_table_ops = {
+    .check = vpci_msix_accept,
+    .read = vpci_msix_read,
+    .write = vpci_msix_write,
+};
+
+static int vpci_init_msix(struct pci_dev *pdev)
+{
+    struct domain *d = pdev->domain;
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    struct vpci_msix *msix;
+    unsigned int msix_offset, i, max_entries;
+    uint16_t control;
+    int rc;
+
+    msix_offset = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSIX);
+    if ( !msix_offset )
+        return 0;
+
+    control = pci_conf_read16(seg, bus, slot, func,
+                              msix_control_reg(msix_offset));
+
+    max_entries = msix_table_size(control);
+
+    msix = xzalloc_bytes(MSIX_SIZE(max_entries));
+    if ( !msix )
+        return -ENOMEM;
+
+    msix->max_entries = max_entries;
+    msix->pdev = pdev;
+
+    /* Find the MSI-X table address. */
+    msix->table.offset = pci_conf_read32(seg, bus, slot, func,
+                                         msix_table_offset_reg(msix_offset));
+    msix->table.bir = msix->table.offset & PCI_MSIX_BIRMASK;
+    msix->table.offset &= ~PCI_MSIX_BIRMASK;
+    msix->table.size = msix->max_entries * PCI_MSIX_ENTRY_SIZE;
+    /*
+     * The PCI header initialization code will take care of setting the address
+     * of both the table and pba memory regions once the BARs have been
+     * sized.
+     */
+    msix->table.addr = INVALID_PADDR;
+
+    /* Find the MSI-X pba address. */
+    msix->pba.offset = pci_conf_read32(seg, bus, slot, func,
+                                       msix_pba_offset_reg(msix_offset));
+    msix->pba.bir = msix->pba.offset & PCI_MSIX_BIRMASK;
+    msix->pba.offset &= ~PCI_MSIX_BIRMASK;
+    /*
+     * The spec mentions regarding to the PBA that "The last QWORD will not
+     * necessarily be fully populated", so it implies that the PBA size is
+     * 64-bit aligned.
+     */
+    msix->pba.size = ROUNDUP(DIV_ROUND_UP(msix->max_entries, 8), 8);
+    msix->pba.addr = INVALID_PADDR;
+
+    for ( i = 0; i < msix->max_entries; i++)
+    {
+        msix->entries[i].masked = true;
+        msix->entries[i].nr = i;
+        vpci_msix_arch_init(&msix->entries[i].arch);
+    }
+
+    if ( list_empty(&d->arch.hvm_domain.msix_tables) )
+        register_mmio_handler(d, &vpci_msix_table_ops);
+
+    list_add(&msix->next, &d->arch.hvm_domain.msix_tables);
+
+    rc = vpci_add_register(pdev, vpci_msix_control_read,
+                           vpci_msix_control_write,
+                           msix_control_reg(msix_offset), 2, msix);
+    if ( rc )
+    {
+        xfree(msix);
+        return rc;
+    }
+
+    pdev->vpci->header.bars[msix->table.bir].msix[VPCI_BAR_MSIX_TABLE] =
+        &msix->table;
+    pdev->vpci->header.bars[msix->pba.bir].msix[VPCI_BAR_MSIX_PBA] =
+        &msix->pba;
+    pdev->vpci->msix = msix;
+
+    return 0;
+}
+
+REGISTER_VPCI_INIT(vpci_init_msix, VPCI_PRIORITY_HIGH);
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/include/asm-x86/hvm/domain.h b/xen/include/asm-x86/hvm/domain.h
index e8dc01bc3e..3be62f3589 100644
--- a/xen/include/asm-x86/hvm/domain.h
+++ b/xen/include/asm-x86/hvm/domain.h
@@ -190,6 +190,9 @@ struct hvm_domain {
     /* List of MMCFG regions trapped by Xen. */
     struct list_head mmcfg_regions;
 
+    /* List of MSI-X tables. */
+    struct list_head msix_tables;
+
     /* List of permanently write-mapped pages. */
     struct {
         spinlock_t lock;
diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h
index b6c5e30b6a..de2e451b4f 100644
--- a/xen/include/asm-x86/hvm/io.h
+++ b/xen/include/asm-x86/hvm/io.h
@@ -144,6 +144,25 @@ void vpci_msi_arch_init(struct vpci_arch_msi *arch);
 void vpci_msi_arch_print(const struct vpci_arch_msi *arch, uint16_t data,
                          uint64_t addr);
 
+/* Arch-specific MSI-X entry data for vPCI. */
+struct vpci_arch_msix_entry {
+    int pirq;
+};
+
+/* Arch-specific vPCI MSI-X helpers. */
+void vpci_msix_arch_mask(struct vpci_arch_msix_entry *arch,
+                         struct pci_dev *pdev, bool mask);
+int vpci_msix_arch_enable(struct vpci_arch_msix_entry *arch,
+                          struct pci_dev *pdev, uint64_t address,
+                          uint32_t data, unsigned int entry_nr,
+                          paddr_t table_base);
+int vpci_msix_arch_disable(struct vpci_arch_msix_entry *arch,
+                           struct pci_dev *pdev);
+int vpci_msix_arch_init(struct vpci_arch_msix_entry *arch);
+void vpci_msix_arch_print(const struct vpci_arch_msix_entry *entry,
+                          uint32_t data, uint64_t addr, bool masked,
+                          unsigned int pos);
+
 enum stdvga_cache_state {
     STDVGA_CACHE_UNINITIALIZED,
     STDVGA_CACHE_ENABLED,
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 66d8ae8b5f..5d053b12fd 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -78,6 +78,10 @@ struct vpci {
         struct vpci_bar {
             paddr_t addr;
             uint64_t size;
+#define VPCI_BAR_MSIX_TABLE     0
+#define VPCI_BAR_MSIX_PBA       1
+#define VPCI_BAR_MSIX_NUM       2
+            struct vpci_msix_mem *msix[VPCI_BAR_MSIX_NUM];
             enum {
                 VPCI_BAR_EMPTY,
                 VPCI_BAR_IO,
@@ -122,6 +126,41 @@ struct vpci {
         /* 64-bit address capable? */
         bool address64;
     } *msi;
+
+    /* MSI-X data. */
+    struct vpci_msix {
+        struct pci_dev *pdev;
+        /* List link. */
+        struct list_head next;
+        /* Table information. */
+        struct vpci_msix_mem {
+            /* MSI-X table offset. */
+            unsigned int offset;
+            /* MSI-X table BIR. */
+            unsigned int bir;
+            /* Table addr. */
+            paddr_t addr;
+            /* Table size. */
+            unsigned int size;
+        } table;
+        /* PBA */
+        struct vpci_msix_mem pba;
+        /* Maximum number of vectors supported by the device. */
+        unsigned int max_entries;
+        /* MSI-X enabled? */
+        bool enabled;
+        /* Masked? */
+        bool masked;
+        /* Entries. */
+        struct vpci_msix_entry {
+            uint64_t addr;
+            uint32_t data;
+            unsigned int nr;
+            struct vpci_arch_msix_entry arch;
+            bool masked;
+            bool updated;
+        } entries[];
+    } *msix;
 #endif
 };
 
-- 
2.11.0 (Apple Git-81)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 01/11] x86/pci: introduce hvm_pci_decode_addr
  2017-08-14 14:28 ` [PATCH v5 01/11] x86/pci: introduce hvm_pci_decode_addr Roger Pau Monne
@ 2017-08-22 11:24   ` Paul Durrant
  2017-08-24 15:46   ` Jan Beulich
  1 sibling, 0 replies; 60+ messages in thread
From: Paul Durrant @ 2017-08-22 11:24 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, boris.ostrovsky, Roger Pau Monne, Jan Beulich

> -----Original Message-----
> From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> Sent: 14 August 2017 15:29
> To: xen-devel@lists.xenproject.org
> Cc: boris.ostrovsky@oracle.com; konrad.wilk@oracle.com; Roger Pau Monne
> <roger.pau@citrix.com>; Paul Durrant <Paul.Durrant@citrix.com>; Jan
> Beulich <jbeulich@suse.com>; Andrew Cooper
> <Andrew.Cooper3@citrix.com>
> Subject: [PATCH v5 01/11] x86/pci: introduce hvm_pci_decode_addr
> 
> And use it in the ioreq code to decode accesses to the PCI IO ports
> into bus, slot, function and register values.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Reviewed-by: Paul Durrant <paul.durrant@citrix.com>

> ---
> Cc: Paul Durrant <paul.durrant@citrix.com>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> Changes since v4:
>  - New in this version.
> ---
>  xen/arch/x86/hvm/io.c        | 19 +++++++++++++++++++
>  xen/arch/x86/hvm/ioreq.c     | 12 +++++-------
>  xen/include/asm-x86/hvm/io.h |  5 +++++
>  3 files changed, 29 insertions(+), 7 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
> index 214ab307c4..074cba89da 100644
> --- a/xen/arch/x86/hvm/io.c
> +++ b/xen/arch/x86/hvm/io.c
> @@ -256,6 +256,25 @@ void register_g2m_portio_handler(struct domain
> *d)
>      handler->ops = &g2m_portio_ops;
>  }
> 
> +unsigned int hvm_pci_decode_addr(unsigned int cf8, unsigned int addr,
> +                                 unsigned int *bus, unsigned int *slot,
> +                                 unsigned int *func)
> +{
> +    unsigned long bdf;
> +
> +    ASSERT(CF8_ENABLED(cf8));
> +
> +    bdf = CF8_BDF(cf8);
> +    *bus = PCI_BUS(bdf);
> +    *slot = PCI_SLOT(bdf);
> +    *func = PCI_FUNC(bdf);
> +    /*
> +     * NB: the lower 2 bits of the register address are fetched from the
> +     * offset into the 0xcfc register when reading/writing to it.
> +     */
> +    return CF8_ADDR_LO(cf8) | (addr & 3);
> +}
> +
>  /*
>   * Local variables:
>   * mode: C
> diff --git a/xen/arch/x86/hvm/ioreq.c b/xen/arch/x86/hvm/ioreq.c
> index b2a8b0e986..752976d16d 100644
> --- a/xen/arch/x86/hvm/ioreq.c
> +++ b/xen/arch/x86/hvm/ioreq.c
> @@ -1178,18 +1178,16 @@ struct hvm_ioreq_server
> *hvm_select_ioreq_server(struct domain *d,
>           CF8_ENABLED(cf8) )
>      {
>          uint32_t sbdf, x86_fam;
> +        unsigned int bus, slot, func, reg;
> +
> +        reg = hvm_pci_decode_addr(cf8, p->addr, &bus, &slot, &func);
> 
>          /* PCI config data cycle */
> 
> -        sbdf = XEN_DMOP_PCI_SBDF(0,
> -                                 PCI_BUS(CF8_BDF(cf8)),
> -                                 PCI_SLOT(CF8_BDF(cf8)),
> -                                 PCI_FUNC(CF8_BDF(cf8)));
> +        sbdf = XEN_DMOP_PCI_SBDF(0, bus, slot, func);
> 
>          type = XEN_DMOP_IO_RANGE_PCI;
> -        addr = ((uint64_t)sbdf << 32) |
> -               CF8_ADDR_LO(cf8) |
> -               (p->addr & 3);
> +        addr = ((uint64_t)sbdf << 32) | reg;
>          /* AMD extended configuration space access? */
>          if ( CF8_ADDR_HI(cf8) &&
>               d->arch.cpuid->x86_vendor == X86_VENDOR_AMD &&
> diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h
> index 2484eb1c75..51659b6c7f 100644
> --- a/xen/include/asm-x86/hvm/io.h
> +++ b/xen/include/asm-x86/hvm/io.h
> @@ -149,6 +149,11 @@ void stdvga_deinit(struct domain *d);
> 
>  extern void hvm_dpci_msi_eoi(struct domain *d, int vector);
> 
> +/* Decode a PCI port IO access into a bus/slot/func/reg. */
> +unsigned int hvm_pci_decode_addr(unsigned int cf8, unsigned int addr,
> +                                 unsigned int *bus, unsigned int *slot,
> +                                 unsigned int *func);
> +
>  /*
>   * HVM port IO handler that performs forwarding of guest IO ports into
> machine
>   * IO ports.
> --
> 2.11.0 (Apple Git-81)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 02/11] vpci: introduce basic handlers to trap accesses to the PCI config space
  2017-08-14 14:28 ` [PATCH v5 02/11] vpci: introduce basic handlers to trap accesses to the PCI config space Roger Pau Monne
@ 2017-08-22 12:05   ` Paul Durrant
  2017-09-04 15:38   ` Jan Beulich
  2017-09-12 10:42   ` Julien Grall
  2 siblings, 0 replies; 60+ messages in thread
From: Paul Durrant @ 2017-08-22 12:05 UTC (permalink / raw)
  To: xen-devel
  Cc: Wei Liu, Andrew Cooper, Jan Beulich, Ian Jackson,
	boris.ostrovsky, Roger Pau Monne

> -----Original Message-----
> From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> Sent: 14 August 2017 15:29
> To: xen-devel@lists.xenproject.org
> Cc: boris.ostrovsky@oracle.com; konrad.wilk@oracle.com; Roger Pau Monne
> <roger.pau@citrix.com>; Ian Jackson <Ian.Jackson@citrix.com>; Wei Liu
> <wei.liu2@citrix.com>; Jan Beulich <jbeulich@suse.com>; Andrew Cooper
> <Andrew.Cooper3@citrix.com>; Paul Durrant <Paul.Durrant@citrix.com>
> Subject: [PATCH v5 02/11] vpci: introduce basic handlers to trap accesses to
> the PCI config space
> 
> This functionality is going to reside in vpci.c (and the corresponding
> vpci.h header), and should be arch-agnostic. The handlers introduced
> in this patch setup the basic functionality required in order to trap
> accesses to the PCI config space, and allow decoding the address and
> finding the corresponding handler that should handle the access
> (although no handlers are implemented).
> 
> Note that the traps to the PCI IO ports registers (0xcf8/0xcfc) are
> setup inside of a x86 HVM file, since that's not shared with other
> arches.
> 
> A new XEN_X86_EMU_VPCI x86 domain flag is added in order to signal Xen
> whether a domain should use the newly introduced vPCI handlers, this
> is only enabled for PVH Dom0 at the moment.
> 
> A very simple user-space test is also provided, so that the basic
> functionality of the vPCI traps can be asserted. This has been proven
> quite helpful during development, since the logic to handle partial
> accesses or accesses that expand across multiple registers is not
> trivial.
> 
> The handlers for the registers are added to a linked list that's keep
> sorted at all times. Both the read and write handlers support accesses
> that expand across multiple emulated registers and contain gaps not
> emulated.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> Cc: Ian Jackson <ian.jackson@eu.citrix.com>
> Cc: Wei Liu <wei.liu2@citrix.com>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: Paul Durrant <paul.durrant@citrix.com>
> ---
> Changes since v4:
> * User-space test harness:
>  - Do not redirect the output of the test.
>  - Add main.c and emul.h as dependencies of the Makefile target.
>  - Use the same rule to modify the vpci and list headers.
>  - Remove underscores from local macro variables.
>  - Add _check suffix to the test harness multiread function.
>  - Change the value written by every different size in the multiwrite
>    test.
>  - Use { } to initialize the r16 and r20 arrays (instead of { 0 }).
>  - Perform some of the read checks with the local variable directly.
>  - Expand some comments.
>  - Implement a dummy rwlock.
> * Hypervisor code:
>  - Guard the linker script changes with CONFIG_HAS_PCI.
>  - Rename vpci_access_check to vpci_access_allowed and make it return
>    bool.
>  - Make hvm_pci_decode_addr return the register as return value.
>  - Use ~3 instead of 0xfffc to remove the register offset when
>    checking accesses to IO ports.
>  - s/head/prev in vpci_add_register.
>  - Add parentheses around & in vpci_add_register.
>  - Fix register removal.
>  - Change the BUGs in vpci_{read/write}_hw helpers to
>    ASSERT_UNREACHABLE.
>  - Make merge_result static and change the computation of the mask to
>    avoid using a uint64_t.
>  - Modify vpci_read to only read from hardware the not-emulated gaps.
>  - Remove the vpci_val union and use a uint32_t instead.
>  - Change handler read type to return a uint32_t instead of modifying
>    a variable passed by reference.
>  - Constify the data opaque parameter of read handlers.
>  - Change the size parameter of the vpci_{read/write} functions to
>    unsigned int.
>  - Place the array of initialization handlers in init.rodata or
>    .rodata depending on whether late-hwdom is enabled.
>  - Remove the pci_devs lock, assume the Dom0 is well behaved and won't
>    remove the device while trying to access it.
>  - Change the recursive spinlock into a rw lock for performance
>    reasons.
> 
> Changes since v3:
> * User-space test harness:
>  - Fix spaces in container_of macro.
>  - Implement a dummy locking functions.
>  - Remove 'current' macro make current a pointer to the statically
>    allocated vpcu.
>  - Remove unneeded parentheses in the pci_conf_readX macros.
>  - Fix the name of the write test macro.
>  - Remove the dummy EXPORT_SYMBOL macro (this was needed by the RB
>    code only).
>  - Import the max macro.
>  - Test all possible read/write size combinations with all possible
>    emulated register sizes.
>  - Introduce a test for register removal.
> * Hypervisor code:
>  - Use a sorted list in order to store the config space handlers.
>  - Remove some unneeded 'else' branches.
>  - Make the IO port handlers always return X86EMUL_OKAY, and set the
>    data to all 1's in case of read failure (write are simply ignored).
>  - In hvm_select_ioreq_server reuse local variables when calling
>    XEN_DMOP_PCI_SBDF.
>  - Store the pointers to the initialization functions in the .rodata
>    section.
>  - Do not ignore the return value of xen_vpci_add_handlers in
>    setup_one_hwdom_device.
>  - Remove the vpci_init macro.
>  - Do not hide the pointers inside of the vpci_{read/write}_t
>    typedefs.
>  - Rename priv_data to private in vpci_register.
>  - Simplify checking for register overlap in vpci_register_cmp.
>  - Check that the offset and the length match before removing a
>    register in xen_vpci_remove_register.
>  - Make vpci_read_hw return a value rather than storing it in a
>    pointer passed by parameter.
>  - Handler dispatcher functions vpci_{read/write} no longer return an
>    error code, errors on reads/writes should be treated like hardware
>    (writes ignored, reads return all 1's or garbage).
>  - Make sure pcidevs is locked before calling pci_get_pdev_by_domain.
>  - Use a recursive spinlock for the vpci lock, so that spin_is_locked
>    checks that the current CPU is holding the lock.
>  - Make the code less error-chatty by removing some of the printk's.
>  - Pass the slot and the function as separate parameters to the
>    handler dispatchers (instead of passing devfn).
>  - Allow handlers to be registered with either a read or write
>    function only, the missing handler will be replaced by a dummy
>    handler (writes ignored, reads return 1's).
>  - Introduce PCI_CFG_SPACE_* defines from Linux.
>  - Simplify the handler dispatchers by removing the recursion, now the
>    dispatchers iterate over the list of sorted handlers and call them
>    in order.
>  - Remove the GENMASK_BYTES, SHIFT_RIGHT_BYTES and ADD_RESULT
> macros,
>    and instead provide a merge_result function in order to merge a
>    register output into a partial result.
>  - Rename the fields of the vpci_val union to u8/u16/u32.
>  - Remove the return values from the read/write handlers, errors
>    should be handled internally and signaled as would be done on
>    native hardware.
>  - Remove the usage of the GENMASK macro.
> 
> Changes since v2:
>  - Generalize the PCI address decoding and use it for IOREQ code also.
> 
> Changes since v1:
>  - Allow access to cross a word-boundary.
>  - Add locking.
>  - Add cleanup to xen_vpci_add_handlers in case of failure.
> ---
>  .gitignore                        |   3 +
>  tools/libxl/libxl_x86.c           |   2 +-
>  tools/tests/Makefile              |   1 +
>  tools/tests/vpci/Makefile         |  37 ++++
>  tools/tests/vpci/emul.h           | 128 +++++++++++
>  tools/tests/vpci/main.c           | 314 +++++++++++++++++++++++++++
>  xen/arch/arm/xen.lds.S            |  10 +
>  xen/arch/x86/domain.c             |  18 +-
>  xen/arch/x86/hvm/hvm.c            |   2 +
>  xen/arch/x86/hvm/io.c             | 118 +++++++++-
>  xen/arch/x86/setup.c              |   3 +-
>  xen/arch/x86/xen.lds.S            |  10 +
>  xen/drivers/Makefile              |   2 +-
>  xen/drivers/passthrough/pci.c     |   9 +-
>  xen/drivers/vpci/Makefile         |   1 +
>  xen/drivers/vpci/vpci.c           | 443
> ++++++++++++++++++++++++++++++++++++++
>  xen/include/asm-x86/domain.h      |   1 +
>  xen/include/asm-x86/hvm/domain.h  |   3 +
>  xen/include/asm-x86/hvm/io.h      |   3 +
>  xen/include/public/arch-x86/xen.h |   5 +-
>  xen/include/xen/pci.h             |   3 +
>  xen/include/xen/pci_regs.h        |   8 +
>  xen/include/xen/vpci.h            |  80 +++++++
>  23 files changed, 1194 insertions(+), 10 deletions(-)
>  create mode 100644 tools/tests/vpci/Makefile
>  create mode 100644 tools/tests/vpci/emul.h
>  create mode 100644 tools/tests/vpci/main.c
>  create mode 100644 xen/drivers/vpci/Makefile
>  create mode 100644 xen/drivers/vpci/vpci.c
>  create mode 100644 xen/include/xen/vpci.h
>
[snip]
> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> index 6cb903def5..cc73df8dc7 100644
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -36,6 +36,7 @@
>  #include <xen/rangeset.h>
>  #include <xen/monitor.h>
>  #include <xen/warning.h>
> +#include <xen/vpci.h>
>  #include <asm/shadow.h>
>  #include <asm/hap.h>
>  #include <asm/current.h>
> @@ -629,6 +630,7 @@ int hvm_domain_initialise(struct domain *d, unsigned
> long domcr_flags,
>          d->arch.hvm_domain.io_bitmap = hvm_io_bitmap;
> 
>      register_g2m_portio_handler(d);
> +    register_vpci_portio_handler(d);
> 
>      hvm_ioreq_init(d);
> 
> diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
> index 074cba89da..c3b68eb257 100644
> --- a/xen/arch/x86/hvm/io.c
> +++ b/xen/arch/x86/hvm/io.c
> @@ -25,6 +25,7 @@
>  #include <xen/trace.h>
>  #include <xen/event.h>
>  #include <xen/hypercall.h>
> +#include <xen/vpci.h>
>  #include <asm/current.h>
>  #include <asm/cpufeature.h>
>  #include <asm/processor.h>
> @@ -260,7 +261,7 @@ unsigned int hvm_pci_decode_addr(unsigned int cf8,
> unsigned int addr,
>                                   unsigned int *bus, unsigned int *slot,
>                                   unsigned int *func)
>  {
> -    unsigned long bdf;
> +    unsigned int bdf;

Shouldn't this be folded into the previous patch where you introduce this function?

> 
>      ASSERT(CF8_ENABLED(cf8));
> 
> @@ -275,6 +276,121 @@ unsigned int hvm_pci_decode_addr(unsigned int
> cf8, unsigned int addr,
>      return CF8_ADDR_LO(cf8) | (addr & 3);
>  }
> 
> +/* Do some sanity checks. */
> +static bool vpci_access_allowed(unsigned int reg, unsigned int len)
> +{
> +    /* Check access size. */
> +    if ( len != 1 && len != 2 && len != 4 )
> +        return false;
> +
> +    /* Check that access is size aligned. */
> +    if ( (reg & (len - 1)) )
> +        return false;
> +
> +    return true;
> +}
> +
> +/* vPCI config space IO ports handlers (0xcf8/0xcfc). */
> +static bool vpci_portio_accept(const struct hvm_io_handler *handler,
> +                               const ioreq_t *p)
> +{
> +    return (p->addr == 0xcf8 && p->size == 4) || (p->addr & ~3) == 0xcfc;
> +}
> +
> +static int vpci_portio_read(const struct hvm_io_handler *handler,
> +                            uint64_t addr, uint32_t size, uint64_t *data)
> +{
> +    struct domain *d = current->domain;
> +    unsigned int bus, slot, func, reg;
> +
> +    *data = ~(uint64_t)0;
> +
> +    vpci_rlock(d);
> +    if ( addr == 0xcf8 )
> +    {
> +        ASSERT(size == 4);
> +        *data = d->arch.hvm_domain.pci_cf8;
> +        vpci_runlock(d);
> +        return X86EMUL_OKAY;
> +    }
> +    if ( !CF8_ENABLED(d->arch.hvm_domain.pci_cf8) )
> +    {
> +        vpci_runlock(d);
> +        return X86EMUL_OKAY;
> +    }
> +
> +    reg = hvm_pci_decode_addr(d->arch.hvm_domain.pci_cf8, addr, &bus,
> &slot,
> +                              &func);
> +
> +    if ( !vpci_access_allowed(reg, size) )
> +    {
> +        vpci_runlock(d);
> +        return X86EMUL_OKAY;
> +    }
> +
> +    *data = vpci_read(0, bus, slot, func, reg, size);
> +    vpci_runlock(d);
> +
> +    return X86EMUL_OKAY;
> +}
> +
> +static int vpci_portio_write(const struct hvm_io_handler *handler,
> +                             uint64_t addr, uint32_t size, uint64_t data)
> +{
> +    struct domain *d = current->domain;
> +    unsigned int bus, slot, func, reg;
> +
> +    vpci_wlock(d);
> +    if ( addr == 0xcf8 )
> +    {
> +        ASSERT(size == 4);
> +        d->arch.hvm_domain.pci_cf8 = data;
> +        vpci_wunlock(d);
> +        return X86EMUL_OKAY;
> +    }
> +    if ( !CF8_ENABLED(d->arch.hvm_domain.pci_cf8) )
> +    {
> +        vpci_wunlock(d);
> +        return X86EMUL_OKAY;
> +    }
> +
> +    reg = hvm_pci_decode_addr(d->arch.hvm_domain.pci_cf8, addr, &bus,
> &slot,
> +                              &func);
> +
> +    if ( !vpci_access_allowed(reg, size) )
> +    {
> +        vpci_wunlock(d);
> +        return X86EMUL_OKAY;
> +    }
> +
> +    vpci_write(0, bus, slot, func, reg, size, data);
> +    vpci_wunlock(d);
> +
> +    return X86EMUL_OKAY;
> +}
> +
> +static const struct hvm_io_ops vpci_portio_ops = {
> +    .accept = vpci_portio_accept,
> +    .read = vpci_portio_read,
> +    .write = vpci_portio_write,
> +};
> +
> +void register_vpci_portio_handler(struct domain *d)
> +{
> +    struct hvm_io_handler *handler;
> +
> +    if ( !has_vpci(d) )
> +        return;
> +
> +    handler = hvm_next_io_handler(d);
> +    if ( !handler )
> +        return;
> +
> +    rwlock_init(&d->arch.hvm_domain.vpci_lock);
> +    handler->type = IOREQ_TYPE_PIO;
> +    handler->ops = &vpci_portio_ops;
> +}
> +
>  /*
>   * Local variables:
>   * mode: C
> diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
> index db5df6956d..5b2c0e3fc3 100644
> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -1566,7 +1566,8 @@ void __init noreturn __start_xen(unsigned long
> mbi_p)
>          domcr_flags |= DOMCRF_hvm |
>                         ((hvm_funcs.hap_supported && !opt_dom0_shadow) ?
>                           DOMCRF_hap : 0);
> -        config.emulation_flags =
> XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC;
> +        config.emulation_flags =
> XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC|
> +                                 XEN_X86_EMU_VPCI;
>      }
> 
>      /* Create initial domain 0. */
> diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S
> index ff08bbe42a..af1b30cb2b 100644
> --- a/xen/arch/x86/xen.lds.S
> +++ b/xen/arch/x86/xen.lds.S
> @@ -76,6 +76,11 @@ SECTIONS
> 
>    __2M_rodata_start = .;       /* Start of 2M superpages, mapped RO. */
>    .rodata : {
> +#if defined(CONFIG_HAS_PCI) && defined(CONFIG_LATE_HWDOM)
> +       __start_vpci_array = .;
> +       *(.rodata.vpci)
> +       __end_vpci_array = .;
> +#endif
>         _srodata = .;
>         /* Bug frames table */
>         __start_bug_frames = .;
> @@ -167,6 +172,11 @@ SECTIONS
>         _einittext = .;
>    } :text
>    .init.data : {
> +#if defined(CONFIG_HAS_PCI) && !defined(CONFIG_LATE_HWDOM)
> +       __start_vpci_array = .;
> +       *(.init.rodata.vpci)
> +       __end_vpci_array = .;
> +#endif
>         *(.init.rodata)
>         *(.init.rodata.rel)
>         *(.init.rodata.str*)
> diff --git a/xen/drivers/Makefile b/xen/drivers/Makefile
> index 19391802a8..d51c766453 100644
> --- a/xen/drivers/Makefile
> +++ b/xen/drivers/Makefile
> @@ -1,6 +1,6 @@
>  subdir-y += char
>  subdir-$(CONFIG_HAS_CPUFREQ) += cpufreq
> -subdir-$(CONFIG_HAS_PCI) += pci
> +subdir-$(CONFIG_HAS_PCI) += pci vpci
>  subdir-$(CONFIG_HAS_PASSTHROUGH) += passthrough
>  subdir-$(CONFIG_ACPI) += acpi
>  subdir-$(CONFIG_VIDEO) += video
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index 27bdb7163c..54326cf0b8 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -30,6 +30,7 @@
>  #include <xen/radix-tree.h>
>  #include <xen/softirq.h>
>  #include <xen/tasklet.h>
> +#include <xen/vpci.h>
>  #include <xsm/xsm.h>
>  #include <asm/msi.h>
>  #include "ats.h"
> @@ -1030,9 +1031,10 @@ static void __hwdom_init
> setup_one_hwdom_device(const struct setup_hwdom *ctxt,
>                                                  struct pci_dev *pdev)
>  {
>      u8 devfn = pdev->devfn;
> +    int err;
> 
>      do {
> -        int err = ctxt->handler(devfn, pdev);
> +        err = ctxt->handler(devfn, pdev);
> 
>          if ( err )
>          {
> @@ -1045,6 +1047,11 @@ static void __hwdom_init
> setup_one_hwdom_device(const struct setup_hwdom *ctxt,
>          devfn += pdev->phantom_stride;
>      } while ( devfn != pdev->devfn &&
>                PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
> +
> +    err = vpci_add_handlers(pdev);
> +    if ( err )
> +        printk(XENLOG_ERR "setup of vPCI for d%d failed: %d\n",
> +               ctxt->d->domain_id, err);
>  }
> 
>  static int __hwdom_init _setup_hwdom_pci_devices(struct pci_seg *pseg,
> void *arg)
> diff --git a/xen/drivers/vpci/Makefile b/xen/drivers/vpci/Makefile
> new file mode 100644
> index 0000000000..840a906470
> --- /dev/null
> +++ b/xen/drivers/vpci/Makefile
> @@ -0,0 +1 @@
> +obj-y += vpci.o
> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
> new file mode 100644
> index 0000000000..f63de97e89
> --- /dev/null
> +++ b/xen/drivers/vpci/vpci.c
> @@ -0,0 +1,443 @@
> +/*
> + * Generic functionality for handling accesses to the PCI configuration space
> + * from guests.
> + *
> + * Copyright (C) 2017 Citrix Systems R&D
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms and conditions of the GNU General Public
> + * License, version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; If not, see
> <http://www.gnu.org/licenses/>.
> + */
> +
> +#include <xen/sched.h>
> +#include <xen/vpci.h>
> +
> +extern vpci_register_init_t *const __start_vpci_array[];
> +extern vpci_register_init_t *const __end_vpci_array[];
> +#define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
> +
> +/* Internal struct to store the emulated PCI registers. */
> +struct vpci_register {
> +    vpci_read_t *read;
> +    vpci_write_t *write;
> +    unsigned int size;
> +    unsigned int offset;
> +    void *private;
> +    struct list_head node;
> +};
> +
> +int __hwdom_init vpci_add_handlers(struct pci_dev *pdev)
> +{
> +    unsigned int i;
> +    int rc = 0;
> +
> +    if ( !has_vpci(pdev->domain) )
> +        return 0;
> +
> +    pdev->vpci = xzalloc(struct vpci);
> +    if ( !pdev->vpci )
> +        return -ENOMEM;
> +
> +    INIT_LIST_HEAD(&pdev->vpci->handlers);
> +
> +    for ( i = 0; i < NUM_VPCI_INIT; i++ )
> +    {
> +        rc = __start_vpci_array[i](pdev);
> +        if ( rc )
> +            break;
> +    }
> +
> +    if ( rc )
> +    {
> +        while ( !list_empty(&pdev->vpci->handlers) )
> +        {
> +            struct vpci_register *r = list_first_entry(&pdev->vpci->handlers,
> +                                                       struct vpci_register,
> +                                                       node);
> +
> +            list_del(&r->node);
> +            xfree(r);
> +        }
> +        xfree(pdev->vpci);
> +    }
> +
> +    return rc;
> +}
> +
> +static int vpci_register_cmp(const struct vpci_register *r1,
> +                             const struct vpci_register *r2)
> +{
> +    /* Return 0 if registers overlap. */
> +    if ( r1->offset < r2->offset + r2->size &&
> +         r2->offset < r1->offset + r1->size )
> +        return 0;
> +    if ( r1->offset < r2->offset )
> +        return -1;
> +    if ( r1->offset > r2->offset )
> +        return 1;
> +
> +    ASSERT_UNREACHABLE();
> +    return 0;
> +}
> +
> +/* Dummy hooks, writes are ignored, reads return 1's */
> +static uint32_t vpci_ignored_read(struct pci_dev *pdev, unsigned int reg,
> +                                  const void *data)
> +{
> +    return ~(uint32_t)0;
> +}
> +
> +static void vpci_ignored_write(struct pci_dev *pdev, unsigned int reg,
> +                               uint32_t val, void *data)
> +{
> +}
> +
> +int vpci_add_register(const struct pci_dev *pdev, vpci_read_t
> *read_handler,
> +                      vpci_write_t *write_handler, unsigned int offset,
> +                      unsigned int size, void *data)
> +{
> +    struct list_head *prev;
> +    struct vpci_register *r;
> +
> +    /* Some sanity checks. */
> +    if ( (size != 1 && size != 2 && size != 4) ||
> +         offset >= PCI_CFG_SPACE_EXP_SIZE || (offset & (size - 1)) ||
> +         (!read_handler && !write_handler) )
> +        return -EINVAL;
> +
> +    r = xmalloc(struct vpci_register);
> +    if ( !r )
> +        return -ENOMEM;
> +
> +    r->read = read_handler ?: vpci_ignored_read;
> +    r->write = write_handler ?: vpci_ignored_write;
> +    r->size = size;
> +    r->offset = offset;
> +    r->private = data;
> +
> +    vpci_wlock(pdev->domain);
> +
> +    /* The list of handlers must be keep sorted at all times. */
> +    list_for_each ( prev, &pdev->vpci->handlers )
> +    {
> +        const struct vpci_register *this =
> +            list_entry(prev, const struct vpci_register, node);
> +        int cmp = vpci_register_cmp(r, this);
> +
> +        if ( cmp < 0 )
> +            break;
> +        if ( cmp == 0 )
> +        {
> +            vpci_wunlock(pdev->domain);
> +            xfree(r);
> +            return -EEXIST;
> +        }
> +    }
> +
> +    list_add_tail(&r->node, prev);
> +    vpci_wunlock(pdev->domain);
> +
> +    return 0;
> +}
> +
> +int vpci_remove_register(const struct pci_dev *pdev, unsigned int offset,
> +                         unsigned int size)
> +{
> +    const struct vpci_register r = { .offset = offset, .size = size };
> +    struct vpci_register *rm;
> +
> +    vpci_wlock(pdev->domain);
> +    list_for_each_entry ( rm, &pdev->vpci->handlers, node )
> +    {
> +        int cmp = vpci_register_cmp(&r, rm);
> +
> +        /*
> +         * NB: do not use a switch so that we can use break to
> +         * get out of the list loop earlier if required.
> +         */
> +        if ( !cmp && rm->offset == offset && rm->size == size )
> +        {
> +            list_del(&rm->node);
> +            vpci_wunlock(pdev->domain);
> +            xfree(rm);
> +            return 0;
> +        }
> +        if ( cmp <= 0 )
> +            break;
> +    }
> +    vpci_wunlock(pdev->domain);
> +
> +    return -ENOENT;
> +}
> +
> +/* Wrappers for performing reads/writes to the underlying hardware. */
> +static uint32_t vpci_read_hw(unsigned int seg, unsigned int bus,
> +                             unsigned int slot, unsigned int func,
> +                             unsigned int reg, unsigned int size)
> +{
> +    uint32_t data;
> +
> +    switch ( size )
> +    {
> +    case 4:
> +        data = pci_conf_read32(seg, bus, slot, func, reg);
> +        break;
> +    case 3:
> +        /*
> +         * This is possible because a 4byte read can have 1byte trapped and
> +         * the rest passed-through.
> +         */
> +        if ( reg & 1 )
> +        {
> +            data = pci_conf_read8(seg, bus, slot, func, reg);
> +            data |= pci_conf_read16(seg, bus, slot, func, reg + 1) << 8;
> +        }
> +        else
> +        {
> +            data = pci_conf_read16(seg, bus, slot, func, reg);
> +            data |= pci_conf_read8(seg, bus, slot, func, reg + 2) << 16;
> +        }
> +        break;
> +    case 2:
> +        data = pci_conf_read16(seg, bus, slot, func, reg);
> +        break;
> +    case 1:
> +        data = pci_conf_read8(seg, bus, slot, func, reg);
> +        break;
> +    default:
> +        ASSERT_UNREACHABLE();
> +        data = ~(uint32_t)0;
> +        break;
> +    }
> +
> +    return data;
> +}
> +
> +static void vpci_write_hw(unsigned int seg, unsigned int bus,
> +                          unsigned int slot, unsigned int func,
> +                          unsigned int reg, unsigned int size, uint32_t data)
> +{
> +    switch ( size )
> +    {
> +    case 4:
> +        pci_conf_write32(seg, bus, slot, func, reg, data);
> +        break;
> +    case 3:
> +        /*
> +         * This is possible because a 4byte write can have 1byte trapped and
> +         * the rest passed-through.
> +         */
> +        if ( reg & 1 )
> +        {
> +            pci_conf_write8(seg, bus, slot, func, reg, data);
> +            pci_conf_write16(seg, bus, slot, func, reg + 1, data >> 8);
> +        }
> +        else
> +        {
> +            pci_conf_write16(seg, bus, slot, func, reg, data);
> +            pci_conf_write8(seg, bus, slot, func, reg + 2, data >> 16);
> +        }
> +        break;
> +    case 2:
> +        pci_conf_write16(seg, bus, slot, func, reg, data);
> +        break;
> +    case 1:
> +        pci_conf_write8(seg, bus, slot, func, reg, data);
> +        break;
> +    default:
> +        ASSERT_UNREACHABLE();
> +        break;
> +    }
> +}
> +
> +/*
> + * Merge new data into a partial result.
> + *
> + * Zero the bytes of 'data' from [offset, offset + size), and
> + * merge the value found in 'new' from [0, offset) left shifted
> + * by 'offset'.
> + */
> +static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
> +                             unsigned int offset)
> +{
> +    uint32_t mask = 0xffffffff >> (32 - 8 * size);
> +
> +    return (data & ~(mask << (offset * 8))) | ((new & mask) << (offset * 8));
> +}
> +
> +uint32_t vpci_read(unsigned int seg, unsigned int bus, unsigned int slot,
> +                   unsigned int func, unsigned int reg, unsigned int size)
> +{
> +    struct domain *d = current->domain;
> +    struct pci_dev *pdev;
> +    const struct vpci_register *r;
> +    unsigned int data_offset = 0;
> +    uint32_t data = ~(uint32_t)0;
> +
> +    ASSERT(vpci_rlocked(d));
> +
> +    /* Find the PCI dev matching the address. */
> +    pdev = pci_get_pdev_by_domain(d, seg, bus, PCI_DEVFN(slot, func));
> +    if ( !pdev )
> +        return vpci_read_hw(seg, bus, slot, func, reg, size);
> +
> +    /* Read from the hardware or the emulated register handlers. */
> +    list_for_each_entry ( r, &pdev->vpci->handlers, node )
> +    {
> +        const struct vpci_register emu = {
> +            .offset = reg + data_offset,
> +            .size = size - data_offset
> +        };
> +        int cmp = vpci_register_cmp(&emu, r);
> +        uint32_t val;
> +        unsigned int read_size;
> +
> +        if ( cmp < 0 )
> +            break;
> +        if ( cmp > 0 )
> +            continue;
> +
> +        if ( emu.offset < r->offset )
> +        {
> +            /* Heading gap, read partial content from hardware. */
> +            read_size = r->offset - emu.offset;
> +            val = vpci_read_hw(seg, bus, slot, func, emu.offset, read_size);
> +            data = merge_result(data, val, read_size, data_offset);
> +            data_offset += read_size;
> +        }
> +
> +        val = r->read(pdev, r->offset, r->private);
> +
> +        /* Check if the read is in the middle of a register. */
> +        if ( r->offset < emu.offset )
> +            val >>= (emu.offset - r->offset) * 8;
> +
> +        /* Find the intersection size between the two sets. */
> +        read_size = min(emu.offset + emu.size, r->offset + r->size) -
> +                    max(emu.offset, r->offset);
> +        /* Merge the emulated data into the native read value. */
> +        data = merge_result(data, val, read_size, data_offset);
> +        data_offset += read_size;
> +        if ( data_offset == size )
> +            break;
> +        ASSERT(data_offset < size);
> +    }
> +
> +    if ( data_offset < size )
> +    {
> +        /* Tailing gap, read the remaining. */
> +        uint32_t tmp_data = vpci_read_hw(seg, bus, slot, func,
> +                                         reg + data_offset,
> +                                         size - data_offset);
> +
> +        data = merge_result(data, tmp_data, size - data_offset, data_offset);
> +    }
> +
> +    return data & (0xffffffff >> (32 - 8 * size));
> +}
> +
> +/*
> + * Perform a maybe partial write to a register.
> + *
> + * Note that this will only work for simple registers, if Xen needs to
> + * trap accesses to rw1c registers (like the status PCI header register)
> + * the logic in vpci_write will have to be expanded in order to correctly
> + * deal with them.
> + */
> +static void vpci_write_helper(struct pci_dev *pdev,
> +                              const struct vpci_register *r, unsigned int size,
> +                              unsigned int offset, uint32_t data)
> +{
> +    ASSERT(size <= r->size);
> +
> +    if ( size != r->size )
> +    {
> +        uint32_t val;
> +
> +        val = r->read(pdev, r->offset, r->private);
> +        data = merge_result(val, data, size, offset);
> +    }
> +
> +    r->write(pdev, r->offset, data & (0xffffffff >> (32 - 8 * r->size)),
> +             r->private);
> +}
> +
> +void vpci_write(unsigned int seg, unsigned int bus, unsigned int slot,
> +                unsigned int func, unsigned int reg, unsigned int size,
> +                uint32_t data)
> +{
> +    struct domain *d = current->domain;
> +    struct pci_dev *pdev;
> +    const struct vpci_register *r;
> +    unsigned int data_offset = 0;
> +
> +    ASSERT(vpci_wlocked(d));
> +
> +    /*
> +     * Find the PCI dev matching the address.
> +     * Passthrough everything that's not trapped.
> +     */
> +    pdev = pci_get_pdev_by_domain(d, seg, bus, PCI_DEVFN(slot, func));
> +    if ( !pdev )
> +    {
> +        vpci_write_hw(seg, bus, slot, func, reg, size, data);
> +        return;
> +    }
> +
> +    /* Write the value to the hardware or emulated registers. */
> +    list_for_each_entry ( r, &pdev->vpci->handlers, node )
> +    {
> +        const struct vpci_register emu = {
> +            .offset = reg + data_offset,
> +            .size = size - data_offset
> +        };
> +        int cmp = vpci_register_cmp(&emu, r);
> +        unsigned int write_size;
> +
> +        if ( cmp < 0 )
> +            break;
> +        if ( cmp > 0 )
> +            continue;
> +
> +        if ( emu.offset < r->offset )
> +        {
> +            /* Heading gap, write partial content to hardware. */
> +            vpci_write_hw(seg, bus, slot, func, emu.offset,
> +                          r->offset - emu.offset, data >> (data_offset * 8));
> +            data_offset += r->offset - emu.offset;
> +        }
> +
> +        /* Find the intersection size between the two sets. */
> +        write_size = min(emu.offset + emu.size, r->offset + r->size) -
> +                     max(emu.offset, r->offset);
> +        vpci_write_helper(pdev, r, write_size, reg + data_offset - r->offset,
> +                          data >> (data_offset * 8));
> +        data_offset += write_size;
> +        if ( data_offset == size )
> +            break;
> +        ASSERT(data_offset < size);
> +    }
> +
> +    if ( data_offset < size )
> +        /* Tailing gap, write the remaining. */
> +        vpci_write_hw(seg, bus, slot, func, reg + data_offset,
> +                      size - data_offset, data >> (data_offset * 8));
> +}
> +
> +/*
> + * Local variables:
> + * mode: C
> + * c-file-style: "BSD"
> + * c-basic-offset: 4
> + * tab-width: 4
> + * indent-tabs-mode: nil
> + * End:
> + */
> diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
> index c10522b7f5..ec14343b27 100644
> --- a/xen/include/asm-x86/domain.h
> +++ b/xen/include/asm-x86/domain.h
> @@ -427,6 +427,7 @@ struct arch_domain
>  #define has_vpit(d)        (!!((d)->arch.emulation_flags &
> XEN_X86_EMU_PIT))
>  #define has_pirq(d)        (!!((d)->arch.emulation_flags & \
>                              XEN_X86_EMU_USE_PIRQ))
> +#define has_vpci(d)        (!!((d)->arch.emulation_flags &
> XEN_X86_EMU_VPCI))
> 
>  #define has_arch_pdevs(d)    (!list_empty(&(d)->arch.pdev_list))
> 
> diff --git a/xen/include/asm-x86/hvm/domain.h b/xen/include/asm-
> x86/hvm/domain.h
> index d2899c9bb2..3a54d50606 100644
> --- a/xen/include/asm-x86/hvm/domain.h
> +++ b/xen/include/asm-x86/hvm/domain.h
> @@ -184,6 +184,9 @@ struct hvm_domain {
>      /* List of guest to machine IO ports mapping. */
>      struct list_head g2m_ioport_list;
> 
> +    /* Lock for the PCI emulation layer (vPCI). */
> +    rwlock_t vpci_lock;
> +
>      /* List of permanently write-mapped pages. */
>      struct {
>          spinlock_t lock;
> diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h
> index 51659b6c7f..01322a2e21 100644
> --- a/xen/include/asm-x86/hvm/io.h
> +++ b/xen/include/asm-x86/hvm/io.h
> @@ -160,6 +160,9 @@ unsigned int hvm_pci_decode_addr(unsigned int cf8,
> unsigned int addr,
>   */
>  void register_g2m_portio_handler(struct domain *d);
> 
> +/* HVM port IO handler for PCI accesses. */
> +void register_vpci_portio_handler(struct domain *d);
> +
>  #endif /* __ASM_X86_HVM_IO_H__ */
> 
> 
> diff --git a/xen/include/public/arch-x86/xen.h b/xen/include/public/arch-
> x86/xen.h
> index f21332e897..86a1a09a8d 100644
> --- a/xen/include/public/arch-x86/xen.h
> +++ b/xen/include/public/arch-x86/xen.h
> @@ -295,12 +295,15 @@ struct xen_arch_domainconfig {
>  #define XEN_X86_EMU_PIT             (1U<<_XEN_X86_EMU_PIT)
>  #define _XEN_X86_EMU_USE_PIRQ       9
>  #define XEN_X86_EMU_USE_PIRQ        (1U<<_XEN_X86_EMU_USE_PIRQ)
> +#define _XEN_X86_EMU_VPCI           10
> +#define XEN_X86_EMU_VPCI            (1U<<_XEN_X86_EMU_VPCI)
> 
>  #define XEN_X86_EMU_ALL             (XEN_X86_EMU_LAPIC |
> XEN_X86_EMU_HPET |  \
>                                       XEN_X86_EMU_PM | XEN_X86_EMU_RTC |      \
>                                       XEN_X86_EMU_IOAPIC | XEN_X86_EMU_PIC |  \
>                                       XEN_X86_EMU_VGA | XEN_X86_EMU_IOMMU |   \
> -                                     XEN_X86_EMU_PIT | XEN_X86_EMU_USE_PIRQ)
> +                                     XEN_X86_EMU_PIT | XEN_X86_EMU_USE_PIRQ |\
> +                                     XEN_X86_EMU_VPCI)
>      uint32_t emulation_flags;
>  };
> 
> diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
> index ea6a66b248..ad5d3ca031 100644
> --- a/xen/include/xen/pci.h
> +++ b/xen/include/xen/pci.h
> @@ -88,6 +88,9 @@ struct pci_dev {
>  #define PT_FAULT_THRESHOLD 10
>      } fault;
>      u64 vf_rlen[6];
> +
> +    /* Data for vPCI. */
> +    struct vpci *vpci;
>  };
> 
>  #define for_each_pdev(domain, pdev) \
> diff --git a/xen/include/xen/pci_regs.h b/xen/include/xen/pci_regs.h
> index ecd6124d91..cc4ee3b83e 100644
> --- a/xen/include/xen/pci_regs.h
> +++ b/xen/include/xen/pci_regs.h
> @@ -23,6 +23,14 @@
>  #define LINUX_PCI_REGS_H
> 
>  /*
> + * Conventional PCI and PCI-X Mode 1 devices have 256 bytes of
> + * configuration space.  PCI-X Mode 2 and PCIe devices have 4096 bytes of
> + * configuration space.
> + */
> +#define PCI_CFG_SPACE_SIZE	256
> +#define PCI_CFG_SPACE_EXP_SIZE	4096
> +
> +/*
>   * Under PCI, each device has 256 bytes of configuration address space,
>   * of which the first 64 bytes are standardized as follows:
>   */
> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
> new file mode 100644
> index 0000000000..12f7287d7b
> --- /dev/null
> +++ b/xen/include/xen/vpci.h
> @@ -0,0 +1,80 @@
> +#ifndef _VPCI_
> +#define _VPCI_
> +
> +#include <xen/pci.h>
> +#include <xen/types.h>
> +#include <xen/list.h>
> +
> +/*
> + * Helpers for locking/unlocking.
> + *
> + * NB: the recursive variants are used so that spin_is_locked
> + * returns whether the lock is hold by the current CPU (instead
> + * of just returning whether the lock is hold by any CPU).
> + */

The comment doesn't seem to match the use of read-write locks below.

> +#define vpci_rlock(d) read_lock(&(d)->arch.hvm_domain.vpci_lock)
> +#define vpci_wlock(d) write_lock(&(d)->arch.hvm_domain.vpci_lock)
> +#define vpci_runlock(d) read_unlock(&(d)->arch.hvm_domain.vpci_lock)
> +#define vpci_wunlock(d) write_unlock(&(d)->arch.hvm_domain.vpci_lock)
> +#define vpci_rlocked(d) rw_is_locked(&(d)->arch.hvm_domain.vpci_lock)
> +#define vpci_wlocked(d) rw_is_write_locked(&(d)-
> >arch.hvm_domain.vpci_lock)
> +
> +/*
> + * The vPCI handlers will never be called concurrently for the same domain,
> it
> + * is guaranteed that the vpci domain lock will always be locked when calling
> + * any handler.
> + */
> +typedef uint32_t vpci_read_t(struct pci_dev *pdev, unsigned int reg,
> +                             const void *data);
> +
> +typedef void vpci_write_t(struct pci_dev *pdev, unsigned int reg,
> +                          uint32_t val, void *data);
> +
> +typedef int vpci_register_init_t(struct pci_dev *dev);
> +
> +#ifdef CONFIG_LATE_HWDOM
> +#define VPCI_SECTION ".rodata.vpci"
> +#else
> +#define VPCI_SECTION ".init.rodata.vpci"
> +#endif
> +
> +#define REGISTER_VPCI_INIT(x)                   \
> +  static vpci_register_init_t *const x##_entry  \
> +               __used_section(VPCI_SECTION) = x
> +
> +/* Add vPCI handlers to device. */
> +int __must_check vpci_add_handlers(struct pci_dev *dev);
> +
> +/* Add/remove a register handler. */
> +int __must_check vpci_add_register(const struct pci_dev *pdev,
> +                                   vpci_read_t *read_handler,
> +                                   vpci_write_t *write_handler,
> +                                   unsigned int offset, unsigned int size,
> +                                   void *data);
> +int __must_check vpci_remove_register(const struct pci_dev *pdev,
> +                                      unsigned int offset,
> +                                      unsigned int size);
> +
> +/* Generic read/write handlers for the PCI config space. */
> +uint32_t vpci_read(unsigned int seg, unsigned int bus, unsigned int slot,
> +                   unsigned int func, unsigned int reg, unsigned int size);
> +void vpci_write(unsigned int seg, unsigned int bus, unsigned int slot,
> +                unsigned int func, unsigned int reg, unsigned int size,
> +                uint32_t data);
> +
> +struct vpci {
> +    /* List of vPCI handlers for a device. */
> +    struct list_head handlers;
> +};
> +
> +#endif
> +
> +/*
> + * Local variables:
> + * mode: C
> + * c-file-style: "BSD"
> + * c-basic-offset: 4
> + * tab-width: 4
> + * indent-tabs-mode: nil
> + * End:
> + */
> --
> 2.11.0 (Apple Git-81)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 03/11] x86/mmcfg: add handlers for the PVH Dom0 MMCFG areas
  2017-08-14 14:28 ` [PATCH v5 03/11] x86/mmcfg: add handlers for the PVH Dom0 MMCFG areas Roger Pau Monne
@ 2017-08-22 12:11   ` Paul Durrant
  2017-09-04 15:58   ` Jan Beulich
  1 sibling, 0 replies; 60+ messages in thread
From: Paul Durrant @ 2017-08-22 12:11 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, boris.ostrovsky, Roger Pau Monne, Jan Beulich

> -----Original Message-----
> From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> Sent: 14 August 2017 15:29
> To: xen-devel@lists.xenproject.org
> Cc: boris.ostrovsky@oracle.com; konrad.wilk@oracle.com; Roger Pau Monne
> <roger.pau@citrix.com>; Jan Beulich <jbeulich@suse.com>; Andrew Cooper
> <Andrew.Cooper3@citrix.com>; Paul Durrant <Paul.Durrant@citrix.com>
> Subject: [PATCH v5 03/11] x86/mmcfg: add handlers for the PVH Dom0
> MMCFG areas
> 
> Introduce a set of handlers for the accesses to the MMCFG areas. Those
> areas are setup based on the contents of the hardware MMCFG tables,
> and the list of handled MMCFG areas is stored inside of the hvm_domain
> struct.
> 
> The read/writes are forwarded to the generic vpci handlers once the
> address is decoded in order to obtain the device and register the
> guest is trying to access.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Reviewed-by: Paul Durrant <paul.durrant@citrix.com>

> ---
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: Paul Durrant <paul.durrant@citrix.com>
> ---
> Changes since v4:
>  - Change the attribute of pvh_setup_mmcfg to __hwdom_init.
>  - Try to add as many MMCFG regions as possible, even if one fails to
>    add.
>  - Change some fields of the hvm_mmcfg struct: turn size into a
>    unsigned int, segment into uint16_t and bus into uint8_t.
>  - Convert some address parameters from unsigned long to paddr_t for
>    consistency.
>  - Make vpci_mmcfg_decode_addr return the decoded register in the
>    return of the function.
>  - Introduce a new macro to convert a MMCFG address into a BDF, and
>    use it in vpci_mmcfg_decode_addr to clarify the logic.
>  - In vpci_mmcfg_{read/write} unify the logic for 8B accesses and
>    smaller ones.
>  - Add the __hwdom_init attribute to register_vpci_mmcfg_handler.
>  - Test that reg + size doesn't cross a device boundary.
> 
> Changes since v3:
>  - Propagate changes from previous patches: drop xen_ prefix for vpci
>    functions, pass slot and func instead of devfn and fix the error
>    paths of the MMCFG handlers.
>  - s/ecam/mmcfg/.
>  - Move the destroy code to a separate function, so the hvm_mmcfg
>    struct can be private to hvm/io.c.
>  - Constify the return of vpci_mmcfg_find.
>  - Use d instead of v->domain in vpci_mmcfg_accept.
>  - Allow 8byte accesses to the mmcfg.
> 
> Changes since v1:
>  - Added locking.
> ---
>  xen/arch/x86/hvm/dom0_build.c    |  22 +++++
>  xen/arch/x86/hvm/hvm.c           |   3 +
>  xen/arch/x86/hvm/io.c            | 183
> ++++++++++++++++++++++++++++++++++++++-
>  xen/include/asm-x86/hvm/domain.h |   3 +
>  xen/include/asm-x86/hvm/io.h     |   7 ++
>  xen/include/asm-x86/pci.h        |   2 +
>  6 files changed, 219 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/arch/x86/hvm/dom0_build.c
> b/xen/arch/x86/hvm/dom0_build.c
> index 0e7d06be95..04a8682d33 100644
> --- a/xen/arch/x86/hvm/dom0_build.c
> +++ b/xen/arch/x86/hvm/dom0_build.c
> @@ -38,6 +38,8 @@
>  #include <public/hvm/hvm_info_table.h>
>  #include <public/hvm/hvm_vcpu.h>
> 
> +#include "../x86_64/mmconfig.h"
> +
>  /*
>   * Have the TSS cover the ISA port range, which makes it
>   * - 104 bytes base structure
> @@ -1041,6 +1043,24 @@ static int __init pvh_setup_acpi(struct domain *d,
> paddr_t start_info)
>      return 0;
>  }
> 
> +static void __hwdom_init pvh_setup_mmcfg(struct domain *d)
> +{
> +    unsigned int i;
> +    int rc;
> +
> +    for ( i = 0; i < pci_mmcfg_config_num; i++ )
> +    {
> +        rc = register_vpci_mmcfg_handler(d, pci_mmcfg_config[i].address,
> +                                         pci_mmcfg_config[i].start_bus_number,
> +                                         pci_mmcfg_config[i].end_bus_number,
> +                                         pci_mmcfg_config[i].pci_segment);
> +        if ( rc )
> +            printk("Unable to setup MMCFG handler at %#lx for segment %u\n",
> +                   pci_mmcfg_config[i].address,
> +                   pci_mmcfg_config[i].pci_segment);
> +    }
> +}
> +
>  int __init dom0_construct_pvh(struct domain *d, const module_t *image,
>                                unsigned long image_headroom,
>                                module_t *initrd,
> @@ -1090,6 +1110,8 @@ int __init dom0_construct_pvh(struct domain *d,
> const module_t *image,
>          return rc;
>      }
> 
> +    pvh_setup_mmcfg(d);
> +
>      panic("Building a PVHv2 Dom0 is not yet supported.");
>      return 0;
>  }
> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> index cc73df8dc7..3168973820 100644
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -583,6 +583,7 @@ int hvm_domain_initialise(struct domain *d, unsigned
> long domcr_flags,
>      spin_lock_init(&d->arch.hvm_domain.write_map.lock);
>      INIT_LIST_HEAD(&d->arch.hvm_domain.write_map.list);
>      INIT_LIST_HEAD(&d->arch.hvm_domain.g2m_ioport_list);
> +    INIT_LIST_HEAD(&d->arch.hvm_domain.mmcfg_regions);
> 
>      rc = create_perdomain_mapping(d, PERDOMAIN_VIRT_START, 0, NULL,
> NULL);
>      if ( rc )
> @@ -728,6 +729,8 @@ void hvm_domain_destroy(struct domain *d)
>          list_del(&ioport->list);
>          xfree(ioport);
>      }
> +
> +    destroy_vpci_mmcfg(&d->arch.hvm_domain.mmcfg_regions);
>  }
> 
>  static int hvm_save_tsc_adjust(struct domain *d, hvm_domain_context_t
> *h)
> diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
> index c3b68eb257..2845dc5b48 100644
> --- a/xen/arch/x86/hvm/io.c
> +++ b/xen/arch/x86/hvm/io.c
> @@ -280,7 +280,7 @@ unsigned int hvm_pci_decode_addr(unsigned int cf8,
> unsigned int addr,
>  static bool vpci_access_allowed(unsigned int reg, unsigned int len)
>  {
>      /* Check access size. */
> -    if ( len != 1 && len != 2 && len != 4 )
> +    if ( len != 1 && len != 2 && len != 4 && len != 8 )
>          return false;
> 
>      /* Check that access is size aligned. */
> @@ -391,6 +391,187 @@ void register_vpci_portio_handler(struct domain
> *d)
>      handler->ops = &vpci_portio_ops;
>  }
> 
> +struct hvm_mmcfg {
> +    struct list_head next;
> +    paddr_t addr;
> +    unsigned int size;
> +    uint16_t segment;
> +    int8_t bus;
> +};
> +
> +/* Handlers to trap PCI MMCFG config accesses. */
> +static const struct hvm_mmcfg *vpci_mmcfg_find(struct domain *d,
> paddr_t addr)
> +{
> +    const struct hvm_mmcfg *mmcfg;
> +
> +    list_for_each_entry ( mmcfg, &d->arch.hvm_domain.mmcfg_regions,
> next )
> +        if ( addr >= mmcfg->addr && addr < mmcfg->addr + mmcfg->size )
> +            return mmcfg;
> +
> +    return NULL;
> +}
> +
> +static unsigned int vpci_mmcfg_decode_addr(const struct hvm_mmcfg
> *mmcfg,
> +                                           paddr_t addr, unsigned int *bus,
> +                                           unsigned int *slot,
> +                                           unsigned int *func)
> +{
> +    unsigned int bdf;
> +
> +    addr -= mmcfg->addr;
> +    bdf = MMCFG_BDF(addr);
> +    *bus = PCI_BUS(bdf) + mmcfg->bus;
> +    *slot = PCI_SLOT(bdf);
> +    *func = PCI_FUNC(bdf);
> +
> +    return addr & (PCI_CFG_SPACE_EXP_SIZE - 1);
> +}
> +
> +static int vpci_mmcfg_accept(struct vcpu *v, unsigned long addr)
> +{
> +    struct domain *d = v->domain;
> +    bool found;
> +
> +    vpci_rlock(d);
> +    found = vpci_mmcfg_find(d, addr);
> +    vpci_runlock(d);
> +
> +    return found;
> +}
> +
> +static int vpci_mmcfg_read(struct vcpu *v, unsigned long addr,
> +                           unsigned int len, unsigned long *data)
> +{
> +    struct domain *d = v->domain;
> +    const struct hvm_mmcfg *mmcfg;
> +    unsigned int bus, slot, func, reg;
> +
> +    *data = ~0ul;
> +
> +    vpci_rlock(d);
> +    mmcfg = vpci_mmcfg_find(d, addr);
> +    if ( !mmcfg )
> +    {
> +        vpci_runlock(d);
> +        return X86EMUL_OKAY;
> +    }
> +
> +    reg = vpci_mmcfg_decode_addr(mmcfg, addr, &bus, &slot, &func);
> +
> +    if ( !vpci_access_allowed(reg, len) ||
> +         (reg + len) > PCI_CFG_SPACE_EXP_SIZE )
> +    {
> +        vpci_runlock(d);
> +        return X86EMUL_OKAY;
> +    }
> +
> +    /*
> +     * According to the PCIe 3.1A specification:
> +     *  - Configuration Reads and Writes must usually be DWORD or smaller
> +     *    in size.
> +     *  - Because Root Complex implementations are not required to support
> +     *    accesses to a RCRB that cross DW boundaries [...] software
> +     *    should take care not to cause the generation of such accesses
> +     *    when accessing a RCRB unless the Root Complex will support the
> +     *    access.
> +     *  Xen however supports 8byte accesses by splitting them into two
> +     *  4byte accesses.
> +     */
> +    *data = vpci_read(mmcfg->segment, bus, slot, func, reg, min(4u, len));
> +    if ( len == 8 )
> +        *data |= (uint64_t)vpci_read(mmcfg->segment, bus, slot, func,
> +                                     reg + 4, 4) << 32;
> +    vpci_runlock(d);
> +
> +    return X86EMUL_OKAY;
> +}
> +
> +static int vpci_mmcfg_write(struct vcpu *v, unsigned long addr,
> +                            unsigned int len, unsigned long data)
> +{
> +    struct domain *d = v->domain;
> +    const struct hvm_mmcfg *mmcfg;
> +    unsigned int bus, slot, func, reg;
> +
> +    vpci_wlock(d);
> +    mmcfg = vpci_mmcfg_find(d, addr);
> +    if ( !mmcfg )
> +    {
> +        vpci_wunlock(d);
> +        return X86EMUL_OKAY;
> +    }
> +
> +    reg = vpci_mmcfg_decode_addr(mmcfg, addr, &bus, &slot, &func);
> +
> +    if ( !vpci_access_allowed(reg, len) ||
> +         (reg + len) > PCI_CFG_SPACE_EXP_SIZE )
> +    {
> +        vpci_wunlock(d);
> +        return X86EMUL_OKAY;
> +    }
> +
> +    vpci_write(mmcfg->segment, bus, slot, func, reg, min(4u, len), data);
> +    if ( len == 8 )
> +        vpci_write(mmcfg->segment, bus, slot, func, reg + 4, 4, data >> 32);
> +    vpci_wunlock(d);
> +
> +    return X86EMUL_OKAY;
> +}
> +
> +static const struct hvm_mmio_ops vpci_mmcfg_ops = {
> +    .check = vpci_mmcfg_accept,
> +    .read = vpci_mmcfg_read,
> +    .write = vpci_mmcfg_write,
> +};
> +
> +int __hwdom_init register_vpci_mmcfg_handler(struct domain *d, paddr_t
> addr,
> +                                             unsigned int start_bus,
> +                                             unsigned int end_bus,
> +                                             unsigned int seg)
> +{
> +    struct hvm_mmcfg *mmcfg;
> +
> +    ASSERT(is_hardware_domain(d));
> +
> +    vpci_wlock(d);
> +    if ( vpci_mmcfg_find(d, addr) )
> +    {
> +        vpci_wunlock(d);
> +        return -EEXIST;
> +    }
> +
> +    mmcfg = xmalloc(struct hvm_mmcfg);
> +    if ( !mmcfg )
> +    {
> +        vpci_wunlock(d);
> +        return -ENOMEM;
> +    }
> +
> +    if ( list_empty(&d->arch.hvm_domain.mmcfg_regions) )
> +        register_mmio_handler(d, &vpci_mmcfg_ops);
> +
> +    mmcfg->addr = addr + (start_bus << 20);
> +    mmcfg->bus = start_bus;
> +    mmcfg->segment = seg;
> +    mmcfg->size = (end_bus - start_bus + 1) << 20;
> +    list_add(&mmcfg->next, &d->arch.hvm_domain.mmcfg_regions);
> +    vpci_wunlock(d);
> +
> +    return 0;
> +}
> +
> +void destroy_vpci_mmcfg(struct list_head *domain_mmcfg)
> +{
> +    while ( !list_empty(domain_mmcfg) )
> +    {
> +        struct hvm_mmcfg *mmcfg = list_first_entry(domain_mmcfg,
> +                                                   struct hvm_mmcfg, next);
> +
> +        list_del(&mmcfg->next);
> +        xfree(mmcfg);
> +    }
> +}
> +
>  /*
>   * Local variables:
>   * mode: C
> diff --git a/xen/include/asm-x86/hvm/domain.h b/xen/include/asm-
> x86/hvm/domain.h
> index 3a54d50606..e8dc01bc3e 100644
> --- a/xen/include/asm-x86/hvm/domain.h
> +++ b/xen/include/asm-x86/hvm/domain.h
> @@ -187,6 +187,9 @@ struct hvm_domain {
>      /* Lock for the PCI emulation layer (vPCI). */
>      rwlock_t vpci_lock;
> 
> +    /* List of MMCFG regions trapped by Xen. */
> +    struct list_head mmcfg_regions;
> +
>      /* List of permanently write-mapped pages. */
>      struct {
>          spinlock_t lock;
> diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h
> index 01322a2e21..837046026c 100644
> --- a/xen/include/asm-x86/hvm/io.h
> +++ b/xen/include/asm-x86/hvm/io.h
> @@ -163,6 +163,13 @@ void register_g2m_portio_handler(struct domain
> *d);
>  /* HVM port IO handler for PCI accesses. */
>  void register_vpci_portio_handler(struct domain *d);
> 
> +/* HVM MMIO handler for PCI MMCFG accesses. */
> +int register_vpci_mmcfg_handler(struct domain *d, paddr_t addr,
> +                                unsigned int start_bus, unsigned int end_bus,
> +                                unsigned int seg);
> +/* Destroy tracked MMCFG areas. */
> +void destroy_vpci_mmcfg(struct list_head *domain_mmcfg);
> +
>  #endif /* __ASM_X86_HVM_IO_H__ */
> 
> 
> diff --git a/xen/include/asm-x86/pci.h b/xen/include/asm-x86/pci.h
> index 36801d317b..ac16c8fd5d 100644
> --- a/xen/include/asm-x86/pci.h
> +++ b/xen/include/asm-x86/pci.h
> @@ -6,6 +6,8 @@
>  #define CF8_ADDR_HI(cf8) (  ((cf8) & 0x0f000000) >> 16)
>  #define CF8_ENABLED(cf8) (!!((cf8) & 0x80000000))
> 
> +#define MMCFG_BDF(addr)  ( ((addr) & 0x0ffff000) >> 12)
> +
>  #define IS_SNB_GFX(id) (id == 0x01068086 || id == 0x01168086 \
>                          || id == 0x01268086 || id == 0x01028086 \
>                          || id == 0x01128086 || id == 0x01228086 \
> --
> 2.11.0 (Apple Git-81)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 09/11] vpci/msi: add MSI handlers
  2017-08-14 14:28 ` [PATCH v5 09/11] vpci/msi: add MSI handlers Roger Pau Monne
@ 2017-08-22 12:20   ` Paul Durrant
  2017-09-07 15:29   ` Jan Beulich
  1 sibling, 0 replies; 60+ messages in thread
From: Paul Durrant @ 2017-08-22 12:20 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, boris.ostrovsky, Roger Pau Monne, Jan Beulich

> -----Original Message-----
> From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> Sent: 14 August 2017 15:29
> To: xen-devel@lists.xenproject.org
> Cc: boris.ostrovsky@oracle.com; konrad.wilk@oracle.com; Roger Pau Monne
> <roger.pau@citrix.com>; Jan Beulich <jbeulich@suse.com>; Andrew Cooper
> <Andrew.Cooper3@citrix.com>; Paul Durrant <Paul.Durrant@citrix.com>
> Subject: [PATCH v5 09/11] vpci/msi: add MSI handlers
> 
> Add handlers for the MSI control, address, data and mask fields in
> order to detect accesses to them and setup the interrupts as requested
> by the guest.
> 
> Note that the pending register is not trapped, and the guest can
> freely read/write to it.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Reviewed-by: Paul Durrant <paul.durrant@citrix.com>

> ---
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: Paul Durrant <paul.durrant@citrix.com>
> ---
> Changes since v4:
>  - Fix commit message.
>  - Change the ASSERTs in vpci_msi_arch_mask into ifs.
>  - Introduce INVALID_PIRQ.
>  - Destroy the partially created bindings in case of failure in
>    vpci_msi_arch_enable.
>  - Just take the pcidevs lock once in vpci_msi_arch_disable.
>  - Print an error message in case of failure of pt_irq_destroy_bind.
>  - Make vpci_msi_arch_init return void.
>  - Constify the arch parameter of vpci_msi_arch_print.
>  - Use fixed instead of cpu for msi redirection.
>  - Separate the header includes in vpci/msi.c between xen and asm.
>  - Store the number of configured vectors even if MSI is not enabled
>    and always return it in vpci_msi_control_read.
>  - Fix/add comments in vpci_msi_control_write to clarify intended
>    behavior.
>  - Simplify usage of masks in vpci_msi_address_{upper_}write.
>  - Add comment to vpci_msi_mask_{read/write}.
>  - Don't use MASK_EXTR in vpci_msi_mask_write.
>  - s/msi_offset/pos/ in vpci_init_msi.
>  - Move control variable setup closer to it's usage.
>  - Use d%d in vpci_dump_msi.
>  - Fix printing of bitfield mask in vpci_dump_msi.
>  - Fix definition of MSI_ADDR_REDIRECTION_MASK.
>  - Shuffle the layout of vpci_msi to minimize gaps.
>  - Remove the error label in vpci_init_msi.
> 
> Changes since v3:
>  - Propagate changes from previous versions: drop xen_ prefix, drop
>    return value from handlers, use the new vpci_val fields.
>  - Use MASK_EXTR.
>  - Remove the usage of GENMASK.
>  - Add GFLAGS_SHIFT_DEST_ID and use it in msi_flags.
>  - Add "arch" to the MSI arch specific functions.
>  - Move the dumping of vPCI MSI information to dump_msi (key 'M').
>  - Remove the guest_vectors field.
>  - Allow the guest to change the number of active vectors without
>    having to disable and enable MSI.
>  - Check the number of active vectors when parsing the disable
>    mask.
>  - Remove the debug messages from vpci_init_msi.
>  - Move the arch-specific part of the dump handler to x86/hvm/vmsi.c.
>  - Use trylock in the dump handler to get the vpci lock.
> 
> Changes since v2:
>  - Add an arch-specific abstraction layer. Note that this is only implemented
>    for x86 currently.
>  - Add a wrapper to detect MSI enabling for vPCI.
> 
> NB: I've only been able to test this with devices using a single MSI interrupt
> and no mask register. I will try to find hardware that supports the mask
> register and more than one vector, but I cannot make any promises.
> 
> If there are doubts about the untested parts we could always force Xen to
> report no per-vector masking support and only 1 available vector, but I would
> rather avoid doing it.
> ---
>  xen/arch/x86/hvm/vmsi.c      | 156 ++++++++++++++++++
>  xen/arch/x86/msi.c           |   3 +
>  xen/drivers/vpci/Makefile    |   2 +-
>  xen/drivers/vpci/msi.c       | 368
> +++++++++++++++++++++++++++++++++++++++++++
>  xen/include/asm-x86/hvm/io.h |  18 +++
>  xen/include/asm-x86/msi.h    |   1 +
>  xen/include/xen/hvm/irq.h    |   2 +
>  xen/include/xen/irq.h        |   1 +
>  xen/include/xen/vpci.h       |  27 ++++
>  9 files changed, 577 insertions(+), 1 deletion(-)
>  create mode 100644 xen/drivers/vpci/msi.c
> 
> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
> index a36692c313..aea088e290 100644
> --- a/xen/arch/x86/hvm/vmsi.c
> +++ b/xen/arch/x86/hvm/vmsi.c
> @@ -622,3 +622,159 @@ void msix_write_completion(struct vcpu *v)
>      if ( msixtbl_write(v, ctrl_address, 4, 0) != X86EMUL_OKAY )
>          gdprintk(XENLOG_WARNING, "MSI-X write completion failure\n");
>  }
> +
> +static unsigned int msi_vector(uint16_t data)
> +{
> +    return MASK_EXTR(data, MSI_DATA_VECTOR_MASK);
> +}
> +
> +static unsigned int msi_flags(uint16_t data, uint64_t addr)
> +{
> +    unsigned int rh, dm, dest_id, deliv_mode, trig_mode;
> +
> +    rh = MASK_EXTR(addr, MSI_ADDR_REDIRECTION_MASK);
> +    dm = MASK_EXTR(addr, MSI_ADDR_DESTMODE_MASK);
> +    dest_id = MASK_EXTR(addr, MSI_ADDR_DEST_ID_MASK);
> +    deliv_mode = MASK_EXTR(data, MSI_DATA_DELIVERY_MODE_MASK);
> +    trig_mode = MASK_EXTR(data, MSI_DATA_TRIGGER_MASK);
> +
> +    return (dest_id << GFLAGS_SHIFT_DEST_ID) | (rh << GFLAGS_SHIFT_RH)
> |
> +           (dm << GFLAGS_SHIFT_DM) | (deliv_mode <<
> GFLAGS_SHIFT_DELIV_MODE) |
> +           (trig_mode << GFLAGS_SHIFT_TRG_MODE);
> +}
> +
> +void vpci_msi_arch_mask(struct vpci_arch_msi *arch, struct pci_dev *pdev,
> +                        unsigned int entry, bool mask)
> +{
> +    struct domain *d = pdev->domain;
> +    const struct pirq *pinfo;
> +    struct irq_desc *desc;
> +    unsigned long flags;
> +    int irq;
> +
> +    ASSERT(arch->pirq >= 0);
> +    pinfo = pirq_info(d, arch->pirq + entry);
> +    if ( !pinfo )
> +        return;
> +
> +    irq = pinfo->arch.irq;
> +    if ( irq >= nr_irqs || irq < 0)
> +        return;
> +
> +    desc = irq_to_desc(irq);
> +    if ( !desc )
> +        return;
> +
> +    spin_lock_irqsave(&desc->lock, flags);
> +    guest_mask_msi_irq(desc, mask);
> +    spin_unlock_irqrestore(&desc->lock, flags);
> +}
> +
> +int vpci_msi_arch_enable(struct vpci_arch_msi *arch, struct pci_dev *pdev,
> +                         uint64_t address, uint32_t data, unsigned int vectors)
> +{
> +    struct msi_info msi_info = {
> +        .seg = pdev->seg,
> +        .bus = pdev->bus,
> +        .devfn = pdev->devfn,
> +        .entry_nr = vectors,
> +    };
> +    unsigned int i;
> +    int rc;
> +
> +    ASSERT(arch->pirq == INVALID_PIRQ);
> +
> +    /* Get a PIRQ. */
> +    rc = allocate_and_map_msi_pirq(pdev->domain, -1, &arch->pirq,
> +                                   MAP_PIRQ_TYPE_MULTI_MSI, &msi_info);
> +    if ( rc )
> +    {
> +        gdprintk(XENLOG_ERR, "%04x:%02x:%02x.%u: failed to map PIRQ:
> %d\n",
> +                 pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> +                 PCI_FUNC(pdev->devfn), rc);
> +        return rc;
> +    }
> +
> +    for ( i = 0; i < vectors; i++ )
> +    {
> +        xen_domctl_bind_pt_irq_t bind = {
> +            .machine_irq = arch->pirq + i,
> +            .irq_type = PT_IRQ_TYPE_MSI,
> +            .u.msi.gvec = msi_vector(data) + i,
> +            .u.msi.gflags = msi_flags(data, address),
> +        };
> +
> +        pcidevs_lock();
> +        rc = pt_irq_create_bind(pdev->domain, &bind);
> +        if ( rc )
> +        {
> +            gdprintk(XENLOG_ERR,
> +                     "%04x:%02x:%02x.%u: failed to bind PIRQ %u: %d\n",
> +                     pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> +                     PCI_FUNC(pdev->devfn), arch->pirq + i, rc);
> +            while ( bind.machine_irq-- )
> +                pt_irq_destroy_bind(pdev->domain, &bind);
> +            spin_lock(&pdev->domain->event_lock);
> +            unmap_domain_pirq(pdev->domain, arch->pirq);
> +            spin_unlock(&pdev->domain->event_lock);
> +            pcidevs_unlock();
> +            arch->pirq = -1;
> +            return rc;
> +        }
> +        pcidevs_unlock();
> +    }
> +
> +    return 0;
> +}
> +
> +int vpci_msi_arch_disable(struct vpci_arch_msi *arch, struct pci_dev
> *pdev,
> +                          unsigned int vectors)
> +{
> +    unsigned int i;
> +
> +    ASSERT(arch->pirq != INVALID_PIRQ);
> +
> +    pcidevs_lock();
> +    for ( i = 0; i < vectors; i++ )
> +    {
> +        xen_domctl_bind_pt_irq_t bind = {
> +            .machine_irq = arch->pirq + i,
> +            .irq_type = PT_IRQ_TYPE_MSI,
> +        };
> +        int rc;
> +
> +        rc = pt_irq_destroy_bind(pdev->domain, &bind);
> +        gdprintk(XENLOG_ERR,
> +                 "%04x:%02x:%02x.%u: failed to unbind PIRQ %u: %d\n",
> +                 pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> +                 PCI_FUNC(pdev->devfn), arch->pirq + i, rc);
> +    }
> +
> +    spin_lock(&pdev->domain->event_lock);
> +    unmap_domain_pirq(pdev->domain, arch->pirq);
> +    spin_unlock(&pdev->domain->event_lock);
> +    pcidevs_unlock();
> +
> +    arch->pirq = INVALID_PIRQ;
> +
> +    return 0;
> +}
> +
> +void vpci_msi_arch_init(struct vpci_arch_msi *arch)
> +{
> +    arch->pirq = INVALID_PIRQ;
> +}
> +
> +void vpci_msi_arch_print(const struct vpci_arch_msi *arch, uint16_t data,
> +                         uint64_t addr)
> +{
> +    printk("vec=%#02x%7s%6s%3sassert%5s%7s dest_id=%lu pirq: %d\n",
> +           MASK_EXTR(data, MSI_DATA_VECTOR_MASK),
> +           data & MSI_DATA_DELIVERY_LOWPRI ? "lowest" : "fixed",
> +           data & MSI_DATA_TRIGGER_LEVEL ? "level" : "edge",
> +           data & MSI_DATA_LEVEL_ASSERT ? "" : "de",
> +           addr & MSI_ADDR_DESTMODE_LOGIC ? "log" : "phys",
> +           addr & MSI_ADDR_REDIRECTION_LOWPRI ? "lowest" : "fixed",
> +           MASK_EXTR(addr, MSI_ADDR_DEST_ID_MASK),
> +           arch->pirq);
> +}
> diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
> index 77998f4fb3..63769153f1 100644
> --- a/xen/arch/x86/msi.c
> +++ b/xen/arch/x86/msi.c
> @@ -30,6 +30,7 @@
>  #include <public/physdev.h>
>  #include <xen/iommu.h>
>  #include <xsm/xsm.h>
> +#include <xen/vpci.h>
> 
>  static s8 __read_mostly use_msi = -1;
>  boolean_param("msi", use_msi);
> @@ -1536,6 +1537,8 @@ static void dump_msi(unsigned char key)
>                 attr.guest_masked ? 'G' : ' ',
>                 mask);
>      }
> +
> +    vpci_dump_msi();
>  }
> 
>  static int __init msi_setup_keyhandler(void)
> diff --git a/xen/drivers/vpci/Makefile b/xen/drivers/vpci/Makefile
> index 241467212f..62cec9e82b 100644
> --- a/xen/drivers/vpci/Makefile
> +++ b/xen/drivers/vpci/Makefile
> @@ -1 +1 @@
> -obj-y += vpci.o header.o
> +obj-y += vpci.o header.o msi.o
> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
> new file mode 100644
> index 0000000000..1e36b9779a
> --- /dev/null
> +++ b/xen/drivers/vpci/msi.c
> @@ -0,0 +1,368 @@
> +/*
> + * Handlers for accesses to the MSI capability structure.
> + *
> + * Copyright (C) 2017 Citrix Systems R&D
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms and conditions of the GNU General Public
> + * License, version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; If not, see
> <http://www.gnu.org/licenses/>.
> + */
> +
> +#include <xen/sched.h>
> +#include <xen/vpci.h>
> +
> +#include <asm/msi.h>
> +
> +/* Handlers for the MSI control field (PCI_MSI_FLAGS). */
> +static uint32_t vpci_msi_control_read(struct pci_dev *pdev, unsigned int
> reg,
> +                                      const void *data)
> +{
> +    const struct vpci_msi *msi = data;
> +    uint16_t val;
> +
> +    /* Set the number of supported/configured messages. */
> +    val = MASK_INSR(fls(msi->max_vectors) - 1, PCI_MSI_FLAGS_QMASK);
> +    val |= MASK_INSR(fls(msi->vectors) - 1, PCI_MSI_FLAGS_QSIZE);
> +
> +    val |= msi->enabled ? PCI_MSI_FLAGS_ENABLE : 0;
> +    val |= msi->masking ? PCI_MSI_FLAGS_MASKBIT : 0;
> +    val |= msi->address64 ? PCI_MSI_FLAGS_64BIT : 0;
> +
> +    return val;
> +}
> +
> +static void vpci_msi_enable(struct pci_dev *pdev, struct vpci_msi *msi,
> +                            unsigned int vectors)
> +{
> +    int ret;
> +
> +    ASSERT(!msi->enabled);
> +    ret = vpci_msi_arch_enable(&msi->arch, pdev, msi->address, msi->data,
> +                               vectors);
> +    if ( ret )
> +        return;
> +
> +    /* Apply the mask bits. */
> +    if ( msi->masking )
> +    {
> +        unsigned int i;
> +        uint32_t mask = msi->mask;
> +
> +        for ( i = ffs(mask) - 1; mask && i < vectors; i = ffs(mask) - 1 )
> +        {
> +            vpci_msi_arch_mask(&msi->arch, pdev, i, true);
> +            __clear_bit(i, &mask);
> +        }
> +    }
> +
> +    __msi_set_enable(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> +                     PCI_FUNC(pdev->devfn), msi->pos, 1);
> +
> +    msi->enabled = true;
> +}
> +
> +static int vpci_msi_disable(struct pci_dev *pdev, struct vpci_msi *msi)
> +{
> +    int ret;
> +
> +    ASSERT(msi->enabled);
> +    __msi_set_enable(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> +                     PCI_FUNC(pdev->devfn), msi->pos, 0);
> +
> +    ret = vpci_msi_arch_disable(&msi->arch, pdev, msi->vectors);
> +    if ( ret )
> +        return ret;
> +
> +    msi->enabled = false;
> +
> +    return 0;
> +}
> +
> +static void vpci_msi_control_write(struct pci_dev *pdev, unsigned int reg,
> +                                   uint32_t val, void *data)
> +{
> +    struct vpci_msi *msi = data;
> +    unsigned int vectors = 1 << MASK_EXTR(val, PCI_MSI_FLAGS_QSIZE);
> +    bool new_enabled = val & PCI_MSI_FLAGS_ENABLE;
> +
> +    if ( vectors > msi->max_vectors )
> +        vectors = msi->max_vectors;
> +
> +    /*
> +     * No change in the enable field and the number of vectors is
> +     * the same or the device is not enabled, in which case the
> +     * vectors field can be updated directly.
> +     */
> +    if ( new_enabled == msi->enabled &&
> +         (vectors == msi->vectors || !msi->enabled) )
> +    {
> +        msi->vectors = vectors;
> +        return;
> +    }
> +
> +    if ( new_enabled )
> +    {
> +        /*
> +         * If the device is already enabled it means the number of
> +         * enabled messages has changed. Disable and re-enable the
> +         * device in order to apply the change.
> +         */
> +        if ( msi->enabled && vpci_msi_disable(pdev, msi) )
> +            /*
> +             * Somehow Xen has not been able to disable the
> +             * configured MSI messages, leave the device state as-is,
> +             * so that the guest can try to disable MSI again.
> +             */
> +            return;
> +
> +        vpci_msi_enable(pdev, msi, vectors);
> +    }
> +    else
> +        vpci_msi_disable(pdev, msi);
> +
> +    msi->vectors = vectors;
> +}
> +
> +/* Handlers for the address field (32bit or low part of a 64bit address). */
> +static uint32_t vpci_msi_address_read(struct pci_dev *pdev, unsigned int
> reg,
> +                                      const void *data)
> +{
> +    const struct vpci_msi *msi = data;
> +
> +    return msi->address;
> +}
> +
> +static void vpci_msi_address_write(struct pci_dev *pdev, unsigned int reg,
> +                                   uint32_t val, void *data)
> +{
> +    struct vpci_msi *msi = data;
> +
> +    /* Clear low part. */
> +    msi->address &= ~0xffffffffull;
> +    msi->address |= val;
> +}
> +
> +/* Handlers for the high part of a 64bit address field. */
> +static uint32_t vpci_msi_address_upper_read(struct pci_dev *pdev,
> +                                            unsigned int reg,
> +                                            const void *data)
> +{
> +    const struct vpci_msi *msi = data;
> +
> +    return msi->address >> 32;
> +}
> +
> +static void vpci_msi_address_upper_write(struct pci_dev *pdev, unsigned
> int reg,
> +                                         uint32_t val, void *data)
> +{
> +    struct vpci_msi *msi = data;
> +
> +    /* Clear high part. */
> +    msi->address &= 0xffffffff;
> +    msi->address |= (uint64_t)val << 32;
> +}
> +
> +/* Handlers for the data field. */
> +static uint32_t vpci_msi_data_read(struct pci_dev *pdev, unsigned int reg,
> +                                   const void *data)
> +{
> +    const struct vpci_msi *msi = data;
> +
> +    return msi->data;
> +}
> +
> +static void vpci_msi_data_write(struct pci_dev *pdev, unsigned int reg,
> +                                uint32_t val, void *data)
> +{
> +    struct vpci_msi *msi = data;
> +
> +    msi->data = val;
> +}
> +
> +/* Handlers for the MSI mask bits. */
> +static uint32_t vpci_msi_mask_read(struct pci_dev *pdev, unsigned int reg,
> +                                   const void *data)
> +{
> +    const struct vpci_msi *msi = data;
> +
> +    return msi->mask;
> +}
> +
> +static void vpci_msi_mask_write(struct pci_dev *pdev, unsigned int reg,
> +                                uint32_t val, void *data)
> +{
> +    struct vpci_msi *msi = data;
> +    uint32_t dmask;
> +
> +    dmask = msi->mask ^ val;
> +
> +    if ( !dmask )
> +        return;
> +
> +    if ( msi->enabled )
> +    {
> +        unsigned int i;
> +
> +        for ( i = ffs(dmask) - 1; dmask && i < msi->vectors;
> +              i = ffs(dmask) - 1 )
> +        {
> +            vpci_msi_arch_mask(&msi->arch, pdev, i, (val >> i) & 1);
> +            __clear_bit(i, &dmask);
> +        }
> +    }
> +
> +    msi->mask = val;
> +}
> +
> +static int vpci_init_msi(struct pci_dev *pdev)
> +{
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    struct vpci_msi *msi;
> +    unsigned int pos;
> +    uint16_t control;
> +    int ret;
> +
> +    pos = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSI);
> +    if ( !pos )
> +        return 0;
> +
> +    msi = xzalloc(struct vpci_msi);
> +    if ( !msi )
> +        return -ENOMEM;
> +
> +    msi->pos = pos;
> +
> +    ret = vpci_add_register(pdev, vpci_msi_control_read,
> +                            vpci_msi_control_write,
> +                            msi_control_reg(pos), 2, msi);
> +    if ( ret )
> +    {
> +        xfree(msi);
> +        return ret;
> +    }
> +
> +    /* Get the maximum number of vectors the device supports. */
> +    control = pci_conf_read16(seg, bus, slot, func, msi_control_reg(pos));
> +    msi->max_vectors = multi_msi_capable(control);
> +    ASSERT(msi->max_vectors <= 32);
> +
> +    /* The multiple message enable is 0 after reset (1 message enabled). */
> +    msi->vectors = 1;
> +
> +    /* No PIRQ bound yet. */
> +    vpci_msi_arch_init(&msi->arch);
> +
> +    msi->address64 = is_64bit_address(control) ? true : false;
> +    msi->masking = is_mask_bit_support(control) ? true : false;
> +
> +    ret = vpci_add_register(pdev, vpci_msi_address_read,
> +                            vpci_msi_address_write,
> +                            msi_lower_address_reg(pos), 4, msi);
> +    if ( ret )
> +    {
> +        xfree(msi);
> +        return ret;
> +    }
> +
> +    ret = vpci_add_register(pdev, vpci_msi_data_read, vpci_msi_data_write,
> +                            msi_data_reg(pos, msi->address64), 2,
> +                            msi);
> +    if ( ret )
> +    {
> +        xfree(msi);
> +        return ret;
> +    }
> +
> +    if ( msi->address64 )
> +    {
> +        ret = vpci_add_register(pdev, vpci_msi_address_upper_read,
> +                                vpci_msi_address_upper_write,
> +                                msi_upper_address_reg(pos), 4, msi);
> +        if ( ret )
> +        {
> +            xfree(msi);
> +            return ret;
> +        }
> +    }
> +
> +    if ( msi->masking )
> +    {
> +        ret = vpci_add_register(pdev, vpci_msi_mask_read,
> vpci_msi_mask_write,
> +                                msi_mask_bits_reg(pos, msi->address64), 4,
> +                                msi);
> +        if ( ret )
> +        {
> +            xfree(msi);
> +            return ret;
> +        }
> +    }
> +
> +    pdev->vpci->msi = msi;
> +
> +    return 0;
> +}
> +
> +REGISTER_VPCI_INIT(vpci_init_msi);
> +
> +void vpci_dump_msi(void)
> +{
> +    struct domain *d;
> +
> +    for_each_domain ( d )
> +    {
> +        const struct pci_dev *pdev;
> +
> +        if ( !has_vpci(d) )
> +            continue;
> +
> +        printk("vPCI MSI information for d%d\n", d->domain_id);
> +
> +        if ( !vpci_tryrlock(d) )
> +        {
> +            printk("Unable to get vPCI lock, skipping\n");
> +            continue;
> +        }
> +
> +        list_for_each_entry ( pdev, &d->arch.pdev_list, domain_list )
> +        {
> +            uint8_t seg = pdev->seg, bus = pdev->bus;
> +            uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev-
> >devfn);
> +            const struct vpci_msi *msi = pdev->vpci->msi;
> +
> +            if ( msi )
> +            {
> +                printk("Device %04x:%02x:%02x.%u\n", seg, bus, slot, func);
> +
> +                printk("  Enabled: %u Supports masking: %u 64-bit addresses:
> %u\n",
> +                       msi->enabled, msi->masking, msi->address64);
> +                printk("  Max vectors: %u enabled vectors: %u\n",
> +                       msi->max_vectors, msi->vectors);
> +
> +                vpci_msi_arch_print(&msi->arch, msi->data, msi->address);
> +
> +                if ( msi->masking )
> +                    printk("  mask=%08x\n", msi->mask);
> +            }
> +        }
> +        vpci_runlock(d);
> +    }
> +}
> +
> +/*
> + * Local variables:
> + * mode: C
> + * c-file-style: "BSD"
> + * c-basic-offset: 4
> + * tab-width: 4
> + * indent-tabs-mode: nil
> + * End:
> + */
> diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h
> index 837046026c..b6c5e30b6a 100644
> --- a/xen/include/asm-x86/hvm/io.h
> +++ b/xen/include/asm-x86/hvm/io.h
> @@ -20,6 +20,7 @@
>  #define __ASM_X86_HVM_IO_H__
> 
>  #include <xen/mm.h>
> +#include <xen/pci.h>
>  #include <asm/hvm/vpic.h>
>  #include <asm/hvm/vioapic.h>
>  #include <public/hvm/ioreq.h>
> @@ -126,6 +127,23 @@ void hvm_dpci_eoi(struct domain *d, unsigned int
> guest_irq,
>  void msix_write_completion(struct vcpu *);
>  void msixtbl_init(struct domain *d);
> 
> +/* Arch-specific MSI data for vPCI. */
> +struct vpci_arch_msi {
> +    int pirq;
> +};
> +
> +/* Arch-specific vPCI MSI helpers. */
> +void vpci_msi_arch_mask(struct vpci_arch_msi *arch, struct pci_dev *pdev,
> +                        unsigned int entry, bool mask);
> +int vpci_msi_arch_enable(struct vpci_arch_msi *arch, struct pci_dev *pdev,
> +                         uint64_t address, uint32_t data,
> +                         unsigned int vectors);
> +int vpci_msi_arch_disable(struct vpci_arch_msi *arch, struct pci_dev
> *pdev,
> +                          unsigned int vectors);
> +void vpci_msi_arch_init(struct vpci_arch_msi *arch);
> +void vpci_msi_arch_print(const struct vpci_arch_msi *arch, uint16_t data,
> +                         uint64_t addr);
> +
>  enum stdvga_cache_state {
>      STDVGA_CACHE_UNINITIALIZED,
>      STDVGA_CACHE_ENABLED,
> diff --git a/xen/include/asm-x86/msi.h b/xen/include/asm-x86/msi.h
> index 37d37b820e..43ab5c6bc6 100644
> --- a/xen/include/asm-x86/msi.h
> +++ b/xen/include/asm-x86/msi.h
> @@ -48,6 +48,7 @@
>  #define MSI_ADDR_REDIRECTION_SHIFT  3
>  #define MSI_ADDR_REDIRECTION_CPU    (0 <<
> MSI_ADDR_REDIRECTION_SHIFT)
>  #define MSI_ADDR_REDIRECTION_LOWPRI (1 <<
> MSI_ADDR_REDIRECTION_SHIFT)
> +#define MSI_ADDR_REDIRECTION_MASK   (1 <<
> MSI_ADDR_REDIRECTION_SHIFT)
> 
>  #define MSI_ADDR_DEST_ID_SHIFT		12
>  #define	 MSI_ADDR_DEST_ID_MASK		0x00ff000
> diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h
> index 0d2c72c109..d07185a479 100644
> --- a/xen/include/xen/hvm/irq.h
> +++ b/xen/include/xen/hvm/irq.h
> @@ -57,7 +57,9 @@ struct dev_intx_gsi_link {
>  #define VMSI_DELIV_MASK   0x7000
>  #define VMSI_TRIG_MODE    0x8000
> 
> +#define GFLAGS_SHIFT_DEST_ID        0
>  #define GFLAGS_SHIFT_RH             8
> +#define GFLAGS_SHIFT_DM             9
>  #define GFLAGS_SHIFT_DELIV_MODE     12
>  #define GFLAGS_SHIFT_TRG_MODE       15
> 
> diff --git a/xen/include/xen/irq.h b/xen/include/xen/irq.h
> index 0aa817e266..9b10ffa4c3 100644
> --- a/xen/include/xen/irq.h
> +++ b/xen/include/xen/irq.h
> @@ -133,6 +133,7 @@ struct pirq {
>      struct arch_pirq arch;
>  };
> 
> +#define INVALID_PIRQ -1
>  #define pirq_info(d, p) ((struct pirq *)radix_tree_lookup(&(d)->pirq_tree,
> p))
> 
>  /* Use this instead of pirq_info() if the structure may need allocating. */
> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
> index 3c6beaaf4a..21da73df16 100644
> --- a/xen/include/xen/vpci.h
> +++ b/xen/include/xen/vpci.h
> @@ -13,6 +13,7 @@
>   * of just returning whether the lock is hold by any CPU).
>   */
>  #define vpci_rlock(d) read_lock(&(d)->arch.hvm_domain.vpci_lock)
> +#define vpci_tryrlock(d) read_trylock(&(d)->arch.hvm_domain.vpci_lock)
>  #define vpci_wlock(d) write_lock(&(d)->arch.hvm_domain.vpci_lock)
>  #define vpci_runlock(d) read_unlock(&(d)->arch.hvm_domain.vpci_lock)
>  #define vpci_wunlock(d) write_unlock(&(d)->arch.hvm_domain.vpci_lock)
> @@ -93,9 +94,35 @@ struct vpci {
>          } bars[7]; /* At most 6 BARS + 1 expansion ROM BAR. */
>          /* FIXME: currently there's no support for SR-IOV. */
>      } header;
> +
> +    /* MSI data. */
> +    struct vpci_msi {
> +        /* Arch-specific data. */
> +        struct vpci_arch_msi arch;
> +        /* Address. */
> +        uint64_t address;
> +        /* Offset of the capability in the config space. */
> +        unsigned int pos;
> +        /* Maximum number of vectors supported by the device. */
> +        unsigned int max_vectors;
> +        /* Number of vectors configured. */
> +        unsigned int vectors;
> +        /* Mask bitfield. */
> +        uint32_t mask;
> +        /* Data. */
> +        uint16_t data;
> +        /* Enabled? */
> +        bool enabled;
> +        /* Supports per-vector masking? */
> +        bool masking;
> +        /* 64-bit address capable? */
> +        bool address64;
> +    } *msi;
>  #endif
>  };
> 
> +void vpci_dump_msi(void);
> +
>  #endif
> 
>  /*
> --
> 2.11.0 (Apple Git-81)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 01/11] x86/pci: introduce hvm_pci_decode_addr
  2017-08-14 14:28 ` [PATCH v5 01/11] x86/pci: introduce hvm_pci_decode_addr Roger Pau Monne
  2017-08-22 11:24   ` Paul Durrant
@ 2017-08-24 15:46   ` Jan Beulich
  2017-08-25  8:29     ` Roger Pau Monne
  1 sibling, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2017-08-24 15:46 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, Paul Durrant, xen-devel

>>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
> --- a/xen/arch/x86/hvm/io.c
> +++ b/xen/arch/x86/hvm/io.c
> @@ -256,6 +256,25 @@ void register_g2m_portio_handler(struct domain *d)
>      handler->ops = &g2m_portio_ops;
>  }
>  
> +unsigned int hvm_pci_decode_addr(unsigned int cf8, unsigned int addr,
> +                                 unsigned int *bus, unsigned int *slot,
> +                                 unsigned int *func)
> +{
> +    unsigned long bdf;

Is there a need for this being unsigned long instead of unsigned int?
If not, I'd be fine changing this while committing.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 01/11] x86/pci: introduce hvm_pci_decode_addr
  2017-08-24 15:46   ` Jan Beulich
@ 2017-08-25  8:29     ` Roger Pau Monne
  0 siblings, 0 replies; 60+ messages in thread
From: Roger Pau Monne @ 2017-08-25  8:29 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, Paul Durrant, xen-devel

On Thu, Aug 24, 2017 at 09:46:16AM -0600, Jan Beulich wrote:
> >>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
> > --- a/xen/arch/x86/hvm/io.c
> > +++ b/xen/arch/x86/hvm/io.c
> > @@ -256,6 +256,25 @@ void register_g2m_portio_handler(struct domain *d)
> >      handler->ops = &g2m_portio_ops;
> >  }
> >  
> > +unsigned int hvm_pci_decode_addr(unsigned int cf8, unsigned int addr,
> > +                                 unsigned int *bus, unsigned int *slot,
> > +                                 unsigned int *func)
> > +{
> > +    unsigned long bdf;
> 
> Is there a need for this being unsigned long instead of unsigned int?
> If not, I'd be fine changing this while committing.

It's my mistake it certainly needs to be unsigned int. I've wrongly
fixed this in patch 2.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 02/11] vpci: introduce basic handlers to trap accesses to the PCI config space
  2017-08-14 14:28 ` [PATCH v5 02/11] vpci: introduce basic handlers to trap accesses to the PCI config space Roger Pau Monne
  2017-08-22 12:05   ` Paul Durrant
@ 2017-09-04 15:38   ` Jan Beulich
  2017-09-06 15:40     ` Roger Pau Monné
  2017-09-08 14:41     ` Roger Pau Monné
  2017-09-12 10:42   ` Julien Grall
  2 siblings, 2 replies; 60+ messages in thread
From: Jan Beulich @ 2017-09-04 15:38 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Wei Liu, Andrew Cooper, Ian Jackson, Paul Durrant, xen-devel,
	boris.ostrovsky

>>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
> Changes since v4:
>[...]
> * Hypervisor code:
>[...]
>  - Constify the data opaque parameter of read handlers.

Is that a good idea? Such callbacks should generally be allowed to
modify their state even if the operation is just a read - think of a
private lock or statistics/debugging data to be updated.

> --- /dev/null
> +++ b/tools/tests/vpci/main.c
> @@ -0,0 +1,314 @@
> +/*
> + * Unit tests for the generic vPCI handler code.
> + *
> + * Copyright (C) 2017 Citrix Systems R&D
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms and conditions of the GNU General Public
> + * License, version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; If not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "emul.h"
> +
> +/* Single vcpu (current), and single domain with a single PCI device. */
> +static struct vpci vpci;
> +
> +static struct domain d = {
> +    .lock = false,

UNLOCKED ?

> +int
> +main(int argc, char **argv)
> +{
> +    /* Index storage by offset. */
> +    uint32_t r0 = 0xdeadbeef;
> +    uint8_t r5 = 0xef;
> +    uint8_t r6 = 0xbe;
> +    uint8_t r7 = 0xef;
> +    uint16_t r12 = 0x8696;
> +    uint8_t r16[4] = { };
> +    uint16_t r20[2] = { };
> +    uint32_t r24 = 0;
> +    uint8_t r28, r30;
> +    unsigned int i;
> +    int rc;
> +
> +    INIT_LIST_HEAD(&vpci.handlers);
> +
> +    VPCI_ADD_REG(vpci_read32, vpci_write32, 0, 4, r0);
> +    VPCI_READ_CHECK(0, 4, r0);
> +    VPCI_WRITE_CHECK(0, 4, 0xbcbcbcbc);
> +
> +    VPCI_ADD_REG(vpci_read8, vpci_write8, 5, 1, r5);
> +    VPCI_READ_CHECK(5, 1, r5);
> +    VPCI_WRITE_CHECK(5, 1, 0xba);
> +
> +    VPCI_ADD_REG(vpci_read8, vpci_write8, 6, 1, r6);
> +    VPCI_READ_CHECK(6, 1, r6);
> +    VPCI_WRITE_CHECK(6, 1, 0xba);
> +
> +    VPCI_ADD_REG(vpci_read8, vpci_write8, 7, 1, r7);
> +    VPCI_READ_CHECK(7, 1, r7);
> +    VPCI_WRITE_CHECK(7, 1, 0xbd);
> +
> +    VPCI_ADD_REG(vpci_read16, vpci_write16, 12, 2, r12);
> +    VPCI_READ_CHECK(12, 2, r12);
> +    VPCI_READ_CHECK(12, 4, 0xffff8696);
> +
> +    /*
> +     * At this point we have the following layout:
> +     *
> +     * Note that this refers to the position of the variables,
> +     * but the value has already changed from the one given at
> +     * initialization time because write tests have been performed.
> +     *
> +     * 32    24    16     8     0
> +     *  +-----+-----+-----+-----+
> +     *  |          r0           | 0
> +     *  +-----+-----+-----+-----+
> +     *  | r7  |  r6 |  r5 |/////| 32
> +     *  +-----+-----+-----+-----|
> +     *  |///////////////////////| 64
> +     *  +-----------+-----------+
> +     *  |///////////|    r12    | 96
> +     *  +-----------+-----------+
> +     *             ...
> +     *  / = empty.

Maybe better "unwritten"?

> --- a/xen/arch/x86/xen.lds.S
> +++ b/xen/arch/x86/xen.lds.S
> @@ -76,6 +76,11 @@ SECTIONS
>  
>    __2M_rodata_start = .;       /* Start of 2M superpages, mapped RO. */
>    .rodata : {
> +#if defined(CONFIG_HAS_PCI) && defined(CONFIG_LATE_HWDOM)
> +       __start_vpci_array = .;
> +       *(.rodata.vpci)
> +       __end_vpci_array = .;
> +#endif
>         _srodata = .;

Please don't put anything ahead of this label. And I'd prefer if
ordinary .rodata contributions came first, i.e. please don't follow
the bad example ...

>         /* Bug frames table */
>         __start_bug_frames = .;

... this gives (there are plenty of good examples further down in
this section).

> @@ -167,6 +172,11 @@ SECTIONS
>         _einittext = .;
>    } :text
>    .init.data : {
> +#if defined(CONFIG_HAS_PCI) && !defined(CONFIG_LATE_HWDOM)
> +       __start_vpci_array = .;
> +       *(.init.rodata.vpci)
> +       __end_vpci_array = .;
> +#endif
>         *(.init.rodata)

Same here then.

> +int vpci_add_register(const struct pci_dev *pdev, vpci_read_t *read_handler,
> +                      vpci_write_t *write_handler, unsigned int offset,
> +                      unsigned int size, void *data)
> +{
> +    struct list_head *prev;
> +    struct vpci_register *r;
> +
> +    /* Some sanity checks. */
> +    if ( (size != 1 && size != 2 && size != 4) ||
> +         offset >= PCI_CFG_SPACE_EXP_SIZE || (offset & (size - 1)) ||
> +         (!read_handler && !write_handler) )
> +        return -EINVAL;
> +
> +    r = xmalloc(struct vpci_register);
> +    if ( !r )
> +        return -ENOMEM;
> +
> +    r->read = read_handler ?: vpci_ignored_read;
> +    r->write = write_handler ?: vpci_ignored_write;
> +    r->size = size;
> +    r->offset = offset;
> +    r->private = data;
> +
> +    vpci_wlock(pdev->domain);
> +
> +    /* The list of handlers must be keep sorted at all times. */

kept

> +/*
> + * Merge new data into a partial result.
> + *
> + * Zero the bytes of 'data' from [offset, offset + size), and
> + * merge the value found in 'new' from [0, offset) left shifted

DYM [0, size) here? I also have to admit that I find it strange that
you talk of zeroing something here - the net effect of the function
is not producing any zeros anywhere afaict. Such a pre-function
comment is normally describing the effect of the function as seen
to the caller rather than its inner workings.

> --- /dev/null
> +++ b/xen/include/xen/vpci.h
> @@ -0,0 +1,80 @@
> +#ifndef _VPCI_
> +#define _VPCI_

This is a little short (and unusual) for an inclusion guard.

> +
> +#include <xen/pci.h>
> +#include <xen/types.h>
> +#include <xen/list.h>
> +
> +/*
> + * Helpers for locking/unlocking.
> + *
> + * NB: the recursive variants are used so that spin_is_locked
> + * returns whether the lock is hold by the current CPU (instead
> + * of just returning whether the lock is hold by any CPU).

Stale comment?

> + */
> +#define vpci_rlock(d) read_lock(&(d)->arch.hvm_domain.vpci_lock)
> +#define vpci_wlock(d) write_lock(&(d)->arch.hvm_domain.vpci_lock)
> +#define vpci_runlock(d) read_unlock(&(d)->arch.hvm_domain.vpci_lock)
> +#define vpci_wunlock(d) write_unlock(&(d)->arch.hvm_domain.vpci_lock)
> +#define vpci_rlocked(d) rw_is_locked(&(d)->arch.hvm_domain.vpci_lock)
> +#define vpci_wlocked(d) rw_is_write_locked(&(d)->arch.hvm_domain.vpci_lock)
> +
> +/*
> + * The vPCI handlers will never be called concurrently for the same domain, it
> + * is guaranteed that the vpci domain lock will always be locked when calling
> + * any handler.

One more?

> + */
> +typedef uint32_t vpci_read_t(struct pci_dev *pdev, unsigned int reg,
> +                             const void *data);
> +
> +typedef void vpci_write_t(struct pci_dev *pdev, unsigned int reg,
> +                          uint32_t val, void *data);

Do these two really need access to the struct pci_dev, rather than
just struct vpci? And if they do, then are they really permitted to
alter that struct pci_dev?

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 03/11] x86/mmcfg: add handlers for the PVH Dom0 MMCFG areas
  2017-08-14 14:28 ` [PATCH v5 03/11] x86/mmcfg: add handlers for the PVH Dom0 MMCFG areas Roger Pau Monne
  2017-08-22 12:11   ` Paul Durrant
@ 2017-09-04 15:58   ` Jan Beulich
  1 sibling, 0 replies; 60+ messages in thread
From: Jan Beulich @ 2017-09-04 15:58 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, Paul Durrant, xen-devel

>>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
> --- a/xen/arch/x86/hvm/dom0_build.c
> +++ b/xen/arch/x86/hvm/dom0_build.c
> @@ -38,6 +38,8 @@
>  #include <public/hvm/hvm_info_table.h>
>  #include <public/hvm/hvm_vcpu.h>
>  
> +#include "../x86_64/mmconfig.h"

Please try to avoid such includes - move whatever is needed to a
header in asm-x86/ instead.

> @@ -391,6 +391,187 @@ void register_vpci_portio_handler(struct domain *d)
>      handler->ops = &vpci_portio_ops;
>  }
>  
> +struct hvm_mmcfg {
> +    struct list_head next;
> +    paddr_t addr;
> +    unsigned int size;
> +    uint16_t segment;
> +    int8_t bus;

uint8_t

> +};
> +
> +/* Handlers to trap PCI MMCFG config accesses. */
> +static const struct hvm_mmcfg *vpci_mmcfg_find(struct domain *d, paddr_t addr)

const struct domain *

> +static int vpci_mmcfg_accept(struct vcpu *v, unsigned long addr)
> +{
> +    struct domain *d = v->domain;

const?

> +int __hwdom_init register_vpci_mmcfg_handler(struct domain *d, paddr_t addr,
> +                                             unsigned int start_bus,
> +                                             unsigned int end_bus,
> +                                             unsigned int seg)
> +{
> +    struct hvm_mmcfg *mmcfg;
> +
> +    ASSERT(is_hardware_domain(d));
> +
> +    vpci_wlock(d);
> +    if ( vpci_mmcfg_find(d, addr) )
> +    {
> +        vpci_wunlock(d);
> +        return -EEXIST;
> +    }
> +
> +    mmcfg = xmalloc(struct hvm_mmcfg);
> +    if ( !mmcfg )
> +    {
> +        vpci_wunlock(d);
> +        return -ENOMEM;
> +    }
> +
> +    if ( list_empty(&d->arch.hvm_domain.mmcfg_regions) )
> +        register_mmio_handler(d, &vpci_mmcfg_ops);
> +
> +    mmcfg->addr = addr + (start_bus << 20);
> +    mmcfg->bus = start_bus;

I think your structure field would better be named start_bus too.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 04/11] x86/physdev: enable PHYSDEVOP_pci_mmcfg_reserved for PVH Dom0
  2017-08-14 14:28 ` [PATCH v5 04/11] x86/physdev: enable PHYSDEVOP_pci_mmcfg_reserved for PVH Dom0 Roger Pau Monne
@ 2017-09-05 14:57   ` Jan Beulich
  2017-09-13 15:55     ` Roger Pau Monné
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2017-09-05 14:57 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
> --- a/xen/arch/x86/physdev.c
> +++ b/xen/arch/x86/physdev.c
> @@ -559,6 +559,15 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>  
>          ret = pci_mmcfg_reserved(info.address, info.segment,
>                                   info.start_bus, info.end_bus, info.flags);
> +        if ( ret || !is_hvm_domain(currd) )
> +            break;

Don't you also want to check has_vpci() here?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 05/11] mm: move modify_identity_mmio to global file and drop __init
  2017-08-14 14:28 ` [PATCH v5 05/11] mm: move modify_identity_mmio to global file and drop __init Roger Pau Monne
@ 2017-09-05 15:01   ` Jan Beulich
  2017-09-12  7:49     ` Roger Pau Monné
  2017-09-12 10:53   ` Julien Grall
  1 sibling, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2017-09-05 15:01 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
> --- a/xen/common/memory.c
> +++ b/xen/common/memory.c
> @@ -1465,6 +1465,46 @@ int prepare_ring_for_helper(
>      return 0;
>  }
>  
> +#if defined(CONFIG_HAS_PCI)

#ifdef please.

> +int modify_mmio(struct domain *d, gfn_t gfn, mfn_t mfn, unsigned long nr_pages,
> +                bool map)
> +{
> +    int rc;
> +
> +    /*
> +     * ATM this function should only be used by the hardware domain
> +     * because it doesn't support preemption/continuation, and as such
> +     * can take a non-negligible amount of time. Note that it periodically
> +     * calls process_pending_softirqs in order to avoid stalling the system.
> +     */
> +    ASSERT(is_hardware_domain(d));
> +
> +    for ( ; ; )
> +    {
> +        rc = (map ? map_mmio_regions : unmap_mmio_regions)
> +             (d, gfn, nr_pages, mfn);
> +        if ( rc == 0 )
> +            break;
> +        if ( rc < 0 )
> +        {
> +            printk(XENLOG_WARNING
> +                   "Failed to %smap [%" PRI_gfn ", %" PRI_gfn ") -> "
> +                   "[%" PRI_mfn ", %" PRI_mfn ") for d%d: %d\n",
> +                   map ? "" : "un", gfn_x(gfn), gfn_x(gfn_add(gfn, nr_pages)),
> +                   mfn_x(mfn), mfn_x(mfn_add(mfn, nr_pages)), d->domain_id,
> +                   rc);
> +            break;
> +        }
> +        nr_pages -= rc;
> +        mfn = mfn_add(mfn, rc);
> +        gfn = gfn_add(gfn, rc);
> +        process_pending_softirqs();

With the __init dropped, this become questionable: We shouldn't
do this arbitrarily; runtime use should instead force a hypercall
continuation (assuming that's the context it's going to be used in).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 06/11] pci: split code to size BARs from pci_add_device
  2017-08-14 14:28 ` [PATCH v5 06/11] pci: split code to size BARs from pci_add_device Roger Pau Monne
@ 2017-09-05 15:05   ` Jan Beulich
  0 siblings, 0 replies; 60+ messages in thread
From: Jan Beulich @ 2017-09-05 15:05 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: xen-devel, boris.ostrovsky

>>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
> So that it can be called from outside in order to get the size of regular PCI
> BARs. This will be required in order to map the BARs from PCI devices into PVH
> Dom0 p2m.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 07/11] pci: add support to size ROM BARs to pci_size_mem_bar
  2017-08-14 14:28 ` [PATCH v5 07/11] pci: add support to size ROM BARs to pci_size_mem_bar Roger Pau Monne
@ 2017-09-05 15:12   ` Jan Beulich
  0 siblings, 0 replies; 60+ messages in thread
From: Jan Beulich @ 2017-09-05 15:12 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: xen-devel, boris.ostrovsky

>>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -594,15 +594,18 @@ static int iommu_remove_device(struct pci_dev *pdev);
>  
>  int pci_size_mem_bar(unsigned int seg, unsigned int bus, unsigned int slot,
>                       unsigned int func, unsigned int pos, bool last,
> -                     uint64_t *paddr, uint64_t *psize, bool vf)
> +                     uint64_t *paddr, uint64_t *psize, bool vf, bool rom)
>  {
>      uint32_t hi = 0, bar = pci_conf_read32(seg, bus, slot, func, pos);
>      uint64_t addr, size;
> +    bool is64bits = !rom && (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
> +                    PCI_BASE_ADDRESS_MEM_TYPE_64;
>  
> -    ASSERT((bar & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_MEMORY);
> +    ASSERT(!(rom && vf));

Things like this as well as there now being three bools among the
function parameters is imo a good indication that you want an
"unsigned int flags" parameter instead. That'll also help seeing
what the true-s and false-s are representing at the call sites. And
perhaps that would then better already be done in the patch
adding "vf".

> @@ -616,9 +619,8 @@ int pci_size_mem_bar(unsigned int seg, unsigned int bus, unsigned int slot,
>          pci_conf_write32(seg, bus, slot, func, pos + 4, ~0);
>      }
>      size = pci_conf_read32(seg, bus, slot, func, pos) &
> -           PCI_BASE_ADDRESS_MEM_MASK;
> -    if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
> -         PCI_BASE_ADDRESS_MEM_TYPE_64 )
> +           (rom ? PCI_ROM_ADDRESS_MASK : PCI_BASE_ADDRESS_MEM_MASK);

To aid readability and because it repeats ...

> @@ -627,14 +629,14 @@ int pci_size_mem_bar(unsigned int seg, unsigned int bus, unsigned int slot,
>          size |= (uint64_t)~0 << 32;
>      pci_conf_write32(seg, bus, slot, func, pos, bar);
>      size = -size;
> -    addr = (bar & PCI_BASE_ADDRESS_MEM_MASK) | ((uint64_t)hi << 32);
> +    addr = (bar & (rom ? PCI_ROM_ADDRESS_MASK : PCI_BASE_ADDRESS_MEM_MASK)) |

... here, perhaps worth using a local variable just like you did e.g.
with is64bits?

> +    if ( is64bits )
>          return 2;
>  
>      return 1;

Mind folding these into a single return statement now that the
result is going to be reasonably short?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 02/11] vpci: introduce basic handlers to trap accesses to the PCI config space
  2017-09-04 15:38   ` Jan Beulich
@ 2017-09-06 15:40     ` Roger Pau Monné
  2017-09-07  9:06       ` Jan Beulich
  2017-09-08 14:41     ` Roger Pau Monné
  1 sibling, 1 reply; 60+ messages in thread
From: Roger Pau Monné @ 2017-09-06 15:40 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, Andrew Cooper, Ian Jackson, Paul Durrant, xen-devel,
	boris.ostrovsky

On Mon, Sep 04, 2017 at 09:38:11AM -0600, Jan Beulich wrote:
> >>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
> > Changes since v4:
> >[...]
> > * Hypervisor code:
> >[...]
> >  - Constify the data opaque parameter of read handlers.
> 
> Is that a good idea? Such callbacks should generally be allowed to
> modify their state even if the operation is just a read - think of a
> private lock or statistics/debugging data to be updated.

Right now the consistency of the opaque data is kept by the global
vpci lock, which I like because it makes the code simpler. If the
opaque data is not constified for the read handlers then I would be
against using a read/write lock.

Statistic/debug data IMHO should be kept in a separate structure with
it's own lock, that's referenced by the opaque data. This would allow
Xen to not allocate this for non-debug builds, reducing memory
footprint and lock contention in production.

> > +/*
> > + * Merge new data into a partial result.
> > + *
> > + * Zero the bytes of 'data' from [offset, offset + size), and
> > + * merge the value found in 'new' from [0, offset) left shifted
> 
> DYM [0, size) here?

Yes, will fix.

> I also have to admit that I find it strange that
> you talk of zeroing something here - the net effect of the function
> is not producing any zeros anywhere afaict. Such a pre-function
> comment is normally describing the effect of the function as seen
> to the caller rather than its inner workings.

OK, would it be better to write it as:

/*
 * Merge new data into a partial result.
 *
 * Copy the value found in 'new' from [0, size) left shifted by
 * 'offset' into 'data'.
 */

> > + */
> > +typedef uint32_t vpci_read_t(struct pci_dev *pdev, unsigned int reg,
> > +                             const void *data);
> > +
> > +typedef void vpci_write_t(struct pci_dev *pdev, unsigned int reg,
> > +                          uint32_t val, void *data);
> 
> Do these two really need access to the struct pci_dev, rather than
> just struct vpci? And if they do, then are they really permitted to
> alter that struct pci_dev?

I'm leaning towards pdev because it already contains vpci. AFAICT it
should be fine to constify it.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 02/11] vpci: introduce basic handlers to trap accesses to the PCI config space
  2017-09-06 15:40     ` Roger Pau Monné
@ 2017-09-07  9:06       ` Jan Beulich
  2017-09-07 11:30         ` Roger Pau Monné
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2017-09-07  9:06 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Wei Liu, Andrew Cooper, Ian Jackson, Paul Durrant, xen-devel,
	boris.ostrovsky

>>> On 06.09.17 at 17:40, <roger.pau@citrix.com> wrote:
> On Mon, Sep 04, 2017 at 09:38:11AM -0600, Jan Beulich wrote:
>> >>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
>> > Changes since v4:
>> >[...]
>> > * Hypervisor code:
>> >[...]
>> >  - Constify the data opaque parameter of read handlers.
>> 
>> Is that a good idea? Such callbacks should generally be allowed to
>> modify their state even if the operation is just a read - think of a
>> private lock or statistics/debugging data to be updated.
> 
> Right now the consistency of the opaque data is kept by the global
> vpci lock, which I like because it makes the code simpler. If the
> opaque data is not constified for the read handlers then I would be
> against using a read/write lock.
> 
> Statistic/debug data IMHO should be kept in a separate structure with
> it's own lock, that's referenced by the opaque data. This would allow
> Xen to not allocate this for non-debug builds, reducing memory
> footprint and lock contention in production.

I don't like this, as it makes adding such transiently needlessly
hard (as one would need to drop all the const-s or cast them away).
What was the reason for switching to the rwlock anyway? Did you
measure any performance problems? Are there Dom0 kernels not
serializing PCI config space accesses anyway? Would it be an
alternative to make the (spin) lock per-device rather than per-
domain? That might also be a good idea for pass-through (as there
Dom0 as well as the owning DomU fundamentally have access to
the config space of a device, and they'd better be synchronized).

>> I also have to admit that I find it strange that
>> you talk of zeroing something here - the net effect of the function
>> is not producing any zeros anywhere afaict. Such a pre-function
>> comment is normally describing the effect of the function as seen
>> to the caller rather than its inner workings.
> 
> OK, would it be better to write it as:
> 
> /*
>  * Merge new data into a partial result.
>  *
>  * Copy the value found in 'new' from [0, size) left shifted by
>  * 'offset' into 'data'.
>  */

Yes.

>> > + */
>> > +typedef uint32_t vpci_read_t(struct pci_dev *pdev, unsigned int reg,
>> > +                             const void *data);
>> > +
>> > +typedef void vpci_write_t(struct pci_dev *pdev, unsigned int reg,
>> > +                          uint32_t val, void *data);
>> 
>> Do these two really need access to the struct pci_dev, rather than
>> just struct vpci? And if they do, then are they really permitted to
>> alter that struct pci_dev?
> 
> I'm leaning towards pdev because it already contains vpci.

Well, I certainly guessed that to be the reason for the way things
are now. But what data in the physical device structure would such
a handler possibly require? Oh, looking at later patches, you take
SBDF from there. That's certainly a good enough reason then.

> AFAICT it should be fine to constify it.

In which case please do.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 08/11] vpci/bars: add handlers to map the BARs
  2017-08-14 14:28 ` [PATCH v5 08/11] vpci/bars: add handlers to map the BARs Roger Pau Monne
@ 2017-09-07  9:53   ` Jan Beulich
  2017-09-12  9:54     ` Roger Pau Monné
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2017-09-07  9:53 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: StefanoStabellini, Wei Liu, GeorgeDunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, xen-devel, boris.ostrovsky

>>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
> +static int vpci_modify_bar(struct domain *d, const struct vpci_bar *bar,
> +                           bool map)
> +{
> +    struct rangeset *mem;
> +    struct map_data data = { .d = d, .map = map };
> +    int rc;
> +
> +    ASSERT(MAPPABLE_BAR(bar));
> +
> +    mem = vpci_get_bar_memory(d, bar);
> +    if ( IS_ERR(mem) )
> +        return PTR_ERR(mem);
> +
> +    rc = rangeset_report_ranges(mem, 0, ~0ul, vpci_map_range, &data);
> +    rangeset_destroy(mem);
> +    if ( rc )
> +        return rc;
> +
> +    return 0;
> +}

Please make this simply "return rc".

> +static int vpci_modify_bars(const struct pci_dev *pdev, const bool map)

We don't normally constify non-pointing-to types, and even less so
in function parameters.

> +static uint32_t vpci_cmd_read(struct pci_dev *pdev, unsigned int reg,
> +                              const void *data)
> +{
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +
> +    return pci_conf_read16(seg, bus, slot, func, reg);
> +}

I can see why you may want to have the slot and func local
variables, but at least seg and bus look pretty pointless to have
here.

> +static void vpci_cmd_write(struct pci_dev *pdev, unsigned int reg,
> +                           uint32_t cmd, void *data)
> +{
> +    uint16_t current_cmd;
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +
> +    current_cmd = pci_conf_read16(seg, bus, slot, func, reg);

Would you mind making this the initializer of the variable?

> +    /*
> +     * Let the guest play with all the bits directly except for the
> +     * memory decoding one.
> +     */
> +    if ( (cmd ^ current_cmd) & PCI_COMMAND_MEMORY )
> +    {
> +        /* Memory space access change. */
> +        int rc = vpci_modify_bars(pdev, cmd & PCI_COMMAND_MEMORY);
> +
> +        if ( rc )
> +        {
> +            gdprintk(XENLOG_ERR,
> +                     "%04x:%02x:%02x.%u:unable to %smap BARs: %d\n",
> +                     seg, bus, slot, func,
> +                     cmd & PCI_COMMAND_MEMORY ? "" : "un", rc);

Perhaps gprintk(), and perhaps XENLOG_WARNING (depending on
the device it may not be that big of a problem)?

> +            return;

I think you should not bail here when it is an unmap that failed -
you'd better disable memory decode in that case.

> +static uint32_t vpci_bar_read(struct pci_dev *pdev, unsigned int reg,
> +                              const void *data)
> +{
> +    const struct vpci_bar *bar = data;
> +    uint32_t val;
> +    bool hi = false;
> +
> +    ASSERT(bar->type == VPCI_BAR_MEM32 || bar->type == VPCI_BAR_MEM64_LO ||
> +           bar->type == VPCI_BAR_MEM64_HI);
> +
> +    if ( bar->type == VPCI_BAR_MEM64_HI )
> +    {
> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> +        bar--;
> +        hi = true;
> +    }
> +
> +    if ( bar->sizing )
> +        val = ~(bar->size - 1) >> (hi ? 32 : 0);

I'm not going to insist on it, but "-bar->size" is certainly easier to
read, and may also produce shorter code (unless the compiler
does the transformation itself anyway).

> +static void vpci_bar_write(struct pci_dev *pdev, unsigned int reg,
> +                           uint32_t val, void *data)
> +{
> +    struct vpci_bar *bar = data;
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    bool hi = false;
> +
> +    if ( pci_conf_read16(seg, bus, slot, func, PCI_COMMAND) &
> +         PCI_COMMAND_MEMORY )
> +    {
> +         gdprintk(XENLOG_WARNING,
> +                  "%04x:%02x:%02x.%u: ignored BAR write with memory decoding enabled\n",
> +                  seg, bus, slot, func);

I'd again suggest gprintk().

> +        return;
> +    }
> +
> +    if ( bar->type == VPCI_BAR_MEM64_HI )
> +    {
> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> +        bar--;
> +        hi = true;
> +    }
> +
> +    if ( !hi )
> +        val &= PCI_BASE_ADDRESS_MEM_MASK;

This could be the else to the earlier if().

> +
> +    /*
> +     * The PCI Local Bus Specification suggests writing ~0 to both the high
> +     * and the low part of the BAR registers before attempting to read back
> +     * the size.
> +     *
> +     * However real device BARs registers (at least the ones I've tried)
> +     * will return the size of the BAR just by having written ~0 to one half
> +     * of it, independently of the value of the other half of the register.
> +     * Hence here Xen will switch to returning the size as soon as one half
> +     * of the BAR register has been written with ~0.
> +     */

I don't believe this is correct behavior (but I'd have to play with
some hardware to see whether I can confirm the behavior you
describe): How would you place a BAR at, say, 0x1ffffff0?

> +    if ( val == (hi ? 0xffffffff : (uint32_t)PCI_BASE_ADDRESS_MEM_MASK) )
> +    {
> +        bar->sizing = true;
> +        return;
> +    }
> +    bar->sizing = false;
> +
> +    /* Update the relevant part of the BAR address. */
> +    bar->addr &= ~(0xffffffffull << (hi ? 32 : 0));
> +    bar->addr |= (uint64_t)val << (hi ? 32 : 0);
> +
> +    /* Make sure Xen writes back the same value for the BAR RO bits. */
> +    if ( !hi )
> +        val |= pci_conf_read32(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> +                               PCI_FUNC(pdev->devfn), reg) &
> +                               ~PCI_BASE_ADDRESS_MEM_MASK;

Why don't you break out the logic from vpci_bar_read() into a
helper and use it here too, saving the (slow) config space access.

> +static void vpci_rom_write(struct pci_dev *pdev, unsigned int reg,
> +                           uint32_t val, void *data)
> +{
> +    struct vpci_bar *rom = data;
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    uint16_t cmd = pci_conf_read16(seg, bus, slot, func, PCI_COMMAND);
> +    uint32_t addr = val & PCI_ROM_ADDRESS_MASK;
> +
> +    if ( addr == (uint32_t)PCI_ROM_ADDRESS_MASK )
> +    {
> +        rom->sizing = true;
> +        return;
> +    }
> +    rom->sizing = false;
> +
> +    rom->addr = addr;
> +
> +    /* Check if ROM BAR should be mapped. */
> +    if ( (cmd & PCI_COMMAND_MEMORY) &&

Why don't you disallow writes here when memory decoding is enabled
and the ROM BAR is enabled the same way (including a log message)
you do in vpci_bar_write()?

> +static int vpci_init_bars(struct pci_dev *pdev)
> +{
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    uint16_t cmd;
> +    uint64_t addr, size;
> +    unsigned int i, num_bars, rom_reg;
> +    struct vpci_header *header = &pdev->vpci->header;
> +    struct vpci_bar *bars = header->bars;
> +    int rc;
> +
> +    switch ( pci_conf_read8(seg, bus, slot, func, PCI_HEADER_TYPE) & 0x7f )
> +    {
> +    case PCI_HEADER_TYPE_NORMAL:
> +        num_bars = 6;
> +        rom_reg = PCI_ROM_ADDRESS;
> +        break;
> +    case PCI_HEADER_TYPE_BRIDGE:
> +        num_bars = 2;
> +        rom_reg = PCI_ROM_ADDRESS1;
> +        break;
> +    default:
> +        return -EOPNOTSUPP;
> +    }
> +
> +    /* Setup a handler for the command register. */
> +    rc = vpci_add_register(pdev, vpci_cmd_read, vpci_cmd_write, PCI_COMMAND,
> +                           2, header);
> +    if ( rc )
> +        return rc;
> +
> +    /* Disable memory decoding before sizing. */
> +    cmd = pci_conf_read16(seg, bus, slot, func, PCI_COMMAND);
> +    if ( cmd & PCI_COMMAND_MEMORY )
> +        pci_conf_write16(seg, bus, slot, func, PCI_COMMAND,
> +                         cmd & ~PCI_COMMAND_MEMORY);
> +
> +    for ( i = 0; i < num_bars; i++ )
> +    {
> +        uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
> +        uint32_t val = pci_conf_read32(seg, bus, slot, func, reg);
> +
> +        if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
> +        {
> +            bars[i].type = VPCI_BAR_MEM64_HI;
> +            rc = vpci_add_register(pdev, vpci_bar_read, vpci_bar_write, reg, 4,
> +                                   &bars[i]);
> +            if ( rc )
> +            {
> +                pci_conf_write16(seg, bus, slot, func, PCI_COMMAND, cmd);
> +                return rc;
> +            }
> +
> +            continue;
> +        }
> +        if ( (val & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO )
> +        {
> +            bars[i].type = VPCI_BAR_IO;
> +            continue;
> +        }
> +        if ( (val & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
> +             PCI_BASE_ADDRESS_MEM_TYPE_64 )
> +            bars[i].type = VPCI_BAR_MEM64_LO;
> +        else
> +            bars[i].type = VPCI_BAR_MEM32;
> +
> +        /* Size the BAR and map it. */
> +        rc = pci_size_mem_bar(seg, bus, slot, func, reg, i == num_bars - 1,
> +                              &addr, &size, false, false);
> +        if ( rc < 0 )
> +        {
> +            pci_conf_write16(seg, bus, slot, func, PCI_COMMAND, cmd);
> +            return rc;
> +        }
> +
> +        if ( size == 0 )
> +        {
> +            bars[i].type = VPCI_BAR_EMPTY;
> +            continue;
> +        }
> +
> +        bars[i].addr = addr;
> +        bars[i].size = size;
> +        bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
> +
> +        rc = vpci_add_register(pdev, vpci_bar_read, vpci_bar_write, reg, 4,
> +                               &bars[i]);
> +        if ( rc )
> +        {
> +            pci_conf_write16(seg, bus, slot, func, PCI_COMMAND, cmd);
> +            return rc;
> +        }
> +    }
> +
> +    /* Check expansion ROM. */
> +    rc = pci_size_mem_bar(seg, bus, slot, func, rom_reg, true, &addr, &size,
> +                          false, true);
> +    if ( rc < 0 )
> +    {
> +        pci_conf_write16(seg, bus, slot, func, PCI_COMMAND, cmd);
> +        return rc;
> +    }
> +
> +    if ( size )
> +    {
> +        struct vpci_bar *rom = &header->bars[num_bars];
> +
> +        rom->type = VPCI_BAR_ROM;
> +        rom->size = size;
> +        rom->addr = addr;
> +
> +        rc = vpci_add_register(pdev, vpci_rom_read, vpci_rom_write, rom_reg, 4,
> +                               rom);
> +        if ( rc )
> +        {
> +            pci_conf_write16(seg, bus, slot, func, PCI_COMMAND, cmd);
> +            return rc;
> +        }
> +    }
> +
> +    if ( cmd & PCI_COMMAND_MEMORY )
> +    {
> +        rc = vpci_modify_bars(pdev, true);
> +        if ( rc )
> +        {
> +            pci_conf_write16(seg, bus, slot, func, PCI_COMMAND, cmd);
> +            return rc;
> +        }
> +
> +        /* Enable memory decoding. */
> +        pci_conf_write16(seg, bus, slot, func, PCI_COMMAND, cmd);

No need for two pci_conf_write16() invocations here - you can
simply arrange to ...

> +    }
> +
> +    return 0;
> +}

... "return rc" here.

> +REGISTER_VPCI_INIT(vpci_init_bars);

Considering that ultimately an error returned from this function will
lead to nothing but a log message, I wonder whether you wouldn't
better do best effort here, i.e. only record the first error, but
continue rather than bailing. Especially for the ROM it may well be
that it's not really needed for proper operation of the device.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 02/11] vpci: introduce basic handlers to trap accesses to the PCI config space
  2017-09-07  9:06       ` Jan Beulich
@ 2017-09-07 11:30         ` Roger Pau Monné
  2017-09-07 11:38           ` Jan Beulich
  0 siblings, 1 reply; 60+ messages in thread
From: Roger Pau Monné @ 2017-09-07 11:30 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, Andrew Cooper, Ian Jackson, Paul Durrant, xen-devel,
	boris.ostrovsky

On Thu, Sep 07, 2017 at 03:06:50AM -0600, Jan Beulich wrote:
> >>> On 06.09.17 at 17:40, <roger.pau@citrix.com> wrote:
> > On Mon, Sep 04, 2017 at 09:38:11AM -0600, Jan Beulich wrote:
> >> >>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
> >> > Changes since v4:
> >> >[...]
> >> > * Hypervisor code:
> >> >[...]
> >> >  - Constify the data opaque parameter of read handlers.
> >> 
> >> Is that a good idea? Such callbacks should generally be allowed to
> >> modify their state even if the operation is just a read - think of a
> >> private lock or statistics/debugging data to be updated.
> > 
> > Right now the consistency of the opaque data is kept by the global
> > vpci lock, which I like because it makes the code simpler. If the
> > opaque data is not constified for the read handlers then I would be
> > against using a read/write lock.
> > 
> > Statistic/debug data IMHO should be kept in a separate structure with
> > it's own lock, that's referenced by the opaque data. This would allow
> > Xen to not allocate this for non-debug builds, reducing memory
> > footprint and lock contention in production.
> 
> I don't like this, as it makes adding such transiently needlessly
> hard (as one would need to drop all the const-s or cast them away).

Hm, I'm not sure I follow. I was thinking of something along the lines
of:

struct vpci_msi_stats {...};

struct vpci_msi {
    [...]
    struct vpci_msi_stats *stats;
};

Then you can easily have a handler like:

static void vpci_msi_reg(..., const void *data)
{
    const struct vpci_msi *msi = data;
    struct vpci_msi_stats *stats = msi->stats;
    [...]
}

That should work AFAICT.

> What was the reason for switching to the rwlock anyway? Did you
> measure any performance problems? Are there Dom0 kernels not
> serializing PCI config space accesses anyway?

Not that I know of, but the PCIe spec doesn't seem to require non
concurrent accesses. PCI of course must not be concurrent.

> Would it be an
> alternative to make the (spin) lock per-device rather than per-
> domain? That might also be a good idea for pass-through (as there
> Dom0 as well as the owning DomU fundamentally have access to
> the config space of a device, and they'd better be synchronized).

That would also work, then I agree it should be a spin lock and the
const from the read handlers can be dropped. Unless you say otherwise
I'm going to implement this.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 02/11] vpci: introduce basic handlers to trap accesses to the PCI config space
  2017-09-07 11:30         ` Roger Pau Monné
@ 2017-09-07 11:38           ` Jan Beulich
  0 siblings, 0 replies; 60+ messages in thread
From: Jan Beulich @ 2017-09-07 11:38 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Wei Liu, Andrew Cooper, Ian Jackson, Paul Durrant, xen-devel,
	boris.ostrovsky

>>> On 07.09.17 at 13:30, <roger.pau@citrix.com> wrote:
> On Thu, Sep 07, 2017 at 03:06:50AM -0600, Jan Beulich wrote:
>> >>> On 06.09.17 at 17:40, <roger.pau@citrix.com> wrote:
>> > On Mon, Sep 04, 2017 at 09:38:11AM -0600, Jan Beulich wrote:
>> >> >>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
>> >> > Changes since v4:
>> >> >[...]
>> >> > * Hypervisor code:
>> >> >[...]
>> >> >  - Constify the data opaque parameter of read handlers.
>> >> 
>> >> Is that a good idea? Such callbacks should generally be allowed to
>> >> modify their state even if the operation is just a read - think of a
>> >> private lock or statistics/debugging data to be updated.
>> > 
>> > Right now the consistency of the opaque data is kept by the global
>> > vpci lock, which I like because it makes the code simpler. If the
>> > opaque data is not constified for the read handlers then I would be
>> > against using a read/write lock.
>> > 
>> > Statistic/debug data IMHO should be kept in a separate structure with
>> > it's own lock, that's referenced by the opaque data. This would allow
>> > Xen to not allocate this for non-debug builds, reducing memory
>> > footprint and lock contention in production.
>> 
>> I don't like this, as it makes adding such transiently needlessly
>> hard (as one would need to drop all the const-s or cast them away).
> 
> Hm, I'm not sure I follow. I was thinking of something along the lines
> of:
> 
> struct vpci_msi_stats {...};
> 
> struct vpci_msi {
>     [...]
>     struct vpci_msi_stats *stats;
> };
> 
> Then you can easily have a handler like:
> 
> static void vpci_msi_reg(..., const void *data)
> {
>     const struct vpci_msi *msi = data;
>     struct vpci_msi_stats *stats = msi->stats;
>     [...]
> }
> 
> That should work AFAICT.

Sure, but the structure needs allocating. Which again is something
I wouldn't want to have to worry about when adding _temporary_
debugging or statistics code. But this is all moot anyway with ...

>> Would it be an
>> alternative to make the (spin) lock per-device rather than per-
>> domain? That might also be a good idea for pass-through (as there
>> Dom0 as well as the owning DomU fundamentally have access to
>> the config space of a device, and they'd better be synchronized).
> 
> That would also work, then I agree it should be a spin lock and the
> const from the read handlers can be dropped. Unless you say otherwise
> I'm going to implement this.

... this.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 09/11] vpci/msi: add MSI handlers
  2017-08-14 14:28 ` [PATCH v5 09/11] vpci/msi: add MSI handlers Roger Pau Monne
  2017-08-22 12:20   ` Paul Durrant
@ 2017-09-07 15:29   ` Jan Beulich
  2017-09-14 10:08     ` Roger Pau Monné
  1 sibling, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2017-09-07 15:29 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, Paul Durrant, xen-devel

>>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
> +static unsigned int msi_flags(uint16_t data, uint64_t addr)
> +{
> +    unsigned int rh, dm, dest_id, deliv_mode, trig_mode;
> +
> +    rh = MASK_EXTR(addr, MSI_ADDR_REDIRECTION_MASK);
> +    dm = MASK_EXTR(addr, MSI_ADDR_DESTMODE_MASK);
> +    dest_id = MASK_EXTR(addr, MSI_ADDR_DEST_ID_MASK);
> +    deliv_mode = MASK_EXTR(data, MSI_DATA_DELIVERY_MODE_MASK);
> +    trig_mode = MASK_EXTR(data, MSI_DATA_TRIGGER_MASK);
> +
> +    return (dest_id << GFLAGS_SHIFT_DEST_ID) | (rh << GFLAGS_SHIFT_RH) |
> +           (dm << GFLAGS_SHIFT_DM) | (deliv_mode << GFLAGS_SHIFT_DELIV_MODE) |
> +           (trig_mode << GFLAGS_SHIFT_TRG_MODE);

This would need re-basing over the removal of those GFLAGS_*
values anyway, but do you really mean those? There's no domctl
involved here I think, and hence I think you rather mean the x86
architecture mandated macros found in asm-x86/msi.h. Or wait,
no, I've just found the caller - you instead want to name the
function msi_gflags() and perhaps add comment on why you
need to use the domctl constants here.

> +void vpci_msi_arch_mask(struct vpci_arch_msi *arch, struct pci_dev *pdev,

Presumably constification of pdev here and elsewhere will come as
a result of doing so at the callback layer. If not, helper functions
not needing to alter it should have it const nevertheless.

> +                        unsigned int entry, bool mask)
> +{
> +    struct domain *d = pdev->domain;

This one probably can be const too.

> +int vpci_msi_arch_enable(struct vpci_arch_msi *arch, struct pci_dev *pdev,
> +                         uint64_t address, uint32_t data, unsigned int vectors)
> +{
> +    struct msi_info msi_info = {
> +        .seg = pdev->seg,
> +        .bus = pdev->bus,
> +        .devfn = pdev->devfn,
> +        .entry_nr = vectors,
> +    };
> +    unsigned int i;
> +    int rc;
> +
> +    ASSERT(arch->pirq == INVALID_PIRQ);
> +
> +    /* Get a PIRQ. */
> +    rc = allocate_and_map_msi_pirq(pdev->domain, -1, &arch->pirq,
> +                                   MAP_PIRQ_TYPE_MULTI_MSI, &msi_info);
> +    if ( rc )
> +    {
> +        gdprintk(XENLOG_ERR, "%04x:%02x:%02x.%u: failed to map PIRQ: %d\n",
> +                 pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> +                 PCI_FUNC(pdev->devfn), rc);
> +        return rc;
> +    }
> +
> +    for ( i = 0; i < vectors; i++ )
> +    {
> +        xen_domctl_bind_pt_irq_t bind = {
> +            .machine_irq = arch->pirq + i,
> +            .irq_type = PT_IRQ_TYPE_MSI,
> +            .u.msi.gvec = msi_vector(data) + i,

Isn't that rather msi_vector(data + i), i.e. wouldn't you better
increment data together with i?

> +            .u.msi.gflags = msi_flags(data, address),
> +        };
> +
> +        pcidevs_lock();
> +        rc = pt_irq_create_bind(pdev->domain, &bind);
> +        if ( rc )
> +        {
> +            gdprintk(XENLOG_ERR,
> +                     "%04x:%02x:%02x.%u: failed to bind PIRQ %u: %d\n",
> +                     pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> +                     PCI_FUNC(pdev->devfn), arch->pirq + i, rc);
> +            while ( bind.machine_irq-- )
> +                pt_irq_destroy_bind(pdev->domain, &bind);
> +            spin_lock(&pdev->domain->event_lock);
> +            unmap_domain_pirq(pdev->domain, arch->pirq);
> +            spin_unlock(&pdev->domain->event_lock);
> +            pcidevs_unlock();
> +            arch->pirq = -1;

INVALID_PIRQ ?

> +int vpci_msi_arch_disable(struct vpci_arch_msi *arch, struct pci_dev *pdev,
> +                          unsigned int vectors)
> +{
> +    unsigned int i;
> +
> +    ASSERT(arch->pirq != INVALID_PIRQ);
> +
> +    pcidevs_lock();
> +    for ( i = 0; i < vectors; i++ )
> +    {
> +        xen_domctl_bind_pt_irq_t bind = {
> +            .machine_irq = arch->pirq + i,
> +            .irq_type = PT_IRQ_TYPE_MSI,
> +        };
> +        int rc;
> +
> +        rc = pt_irq_destroy_bind(pdev->domain, &bind);
> +        gdprintk(XENLOG_ERR,
> +                 "%04x:%02x:%02x.%u: failed to unbind PIRQ %u: %d\n",
> +                 pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> +                 PCI_FUNC(pdev->devfn), arch->pirq + i, rc);

Couldn't this be ASSERT(!rc)? Afaics all error paths in that function
are due to passing in invalid state or information.

> +static int vpci_msi_disable(struct pci_dev *pdev, struct vpci_msi *msi)
> +{
> +    int ret;
> +
> +    ASSERT(msi->enabled);
> +    __msi_set_enable(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> +                     PCI_FUNC(pdev->devfn), msi->pos, 0);
> +
> +    ret = vpci_msi_arch_disable(&msi->arch, pdev, msi->vectors);
> +    if ( ret )
> +        return ret;
> +
> +    msi->enabled = false;
> +
> +    return 0;
> +}

Please invert the if() condition and have both cases reach the final
return.

> +static void vpci_msi_control_write(struct pci_dev *pdev, unsigned int reg,
> +                                   uint32_t val, void *data)
> +{
> +    struct vpci_msi *msi = data;
> +    unsigned int vectors = 1 << MASK_EXTR(val, PCI_MSI_FLAGS_QSIZE);
> +    bool new_enabled = val & PCI_MSI_FLAGS_ENABLE;
> +
> +    if ( vectors > msi->max_vectors )
> +        vectors = msi->max_vectors;
> +
> +    /*
> +     * No change in the enable field and the number of vectors is

s/ in / if / ?

> +static int vpci_init_msi(struct pci_dev *pdev)
> +{
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    struct vpci_msi *msi;
> +    unsigned int pos;
> +    uint16_t control;
> +    int ret;
> +
> +    pos = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSI);
> +    if ( !pos )
> +        return 0;
> +
> +    msi = xzalloc(struct vpci_msi);
> +    if ( !msi )
> +        return -ENOMEM;
> +
> +    msi->pos = pos;
> +
> +    ret = vpci_add_register(pdev, vpci_msi_control_read,
> +                            vpci_msi_control_write,
> +                            msi_control_reg(pos), 2, msi);
> +    if ( ret )
> +    {
> +        xfree(msi);
> +        return ret;
> +    }
> +
> +    /* Get the maximum number of vectors the device supports. */
> +    control = pci_conf_read16(seg, bus, slot, func, msi_control_reg(pos));
> +    msi->max_vectors = multi_msi_capable(control);
> +    ASSERT(msi->max_vectors <= 32);
> +
> +    /* The multiple message enable is 0 after reset (1 message enabled). */
> +    msi->vectors = 1;
> +
> +    /* No PIRQ bound yet. */
> +    vpci_msi_arch_init(&msi->arch);

I think it is generally a good idea to hand the entire thing to the
arch hoot, not just the ->arch portion. The hook may want to
read other fields in order to establish the per-arch ones. Just
think of something depending on the maximum vector count.

> +    msi->address64 = is_64bit_address(control) ? true : false;
> +    msi->masking = is_mask_bit_support(control) ? true : false;

What are these conditional operators good for?

> +}
> +
> +REGISTER_VPCI_INIT(vpci_init_msi);

Btw, it's certainly a matter of taste, but would you mind leaving out
the blank line between the function and such registration macro
invocations? We commonly (but not fully uniformly) do so e.g. for
custom_param() invocations, too.

> +void vpci_dump_msi(void)
> +{
> +    struct domain *d;
> +
> +    for_each_domain ( d )
> +    {
> +        const struct pci_dev *pdev;
> +
> +        if ( !has_vpci(d) )
> +            continue;
> +
> +        printk("vPCI MSI information for d%d\n", d->domain_id);
> +
> +        if ( !vpci_tryrlock(d) )
> +        {
> +            printk("Unable to get vPCI lock, skipping\n");
> +            continue;
> +        }
> +
> +        list_for_each_entry ( pdev, &d->arch.pdev_list, domain_list )
> +        {
> +            uint8_t seg = pdev->seg, bus = pdev->bus;
> +            uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +            const struct vpci_msi *msi = pdev->vpci->msi;
> +
> +            if ( msi )
> +            {
> +                printk("Device %04x:%02x:%02x.%u\n", seg, bus, slot, func);
> +
> +                printk("  Enabled: %u Supports masking: %u 64-bit addresses: %u\n",
> +                       msi->enabled, msi->masking, msi->address64);
> +                printk("  Max vectors: %u enabled vectors: %u\n",
> +                       msi->max_vectors, msi->vectors);
> +
> +                vpci_msi_arch_print(&msi->arch, msi->data, msi->address);
> +
> +                if ( msi->masking )
> +                    printk("  mask=%08x\n", msi->mask);
> +            }
> +        }
> +        vpci_runlock(d);
> +    }
> +}

There no process_pending_softirqs() in here, despite the nested
loops potentially being long lasting.

> --- a/xen/include/xen/irq.h
> +++ b/xen/include/xen/irq.h
> @@ -133,6 +133,7 @@ struct pirq {
>      struct arch_pirq arch;
>  };
>  
> +#define INVALID_PIRQ -1

This needs parentheses.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 10/11] vpci: add a priority parameter to the vPCI register initializer
  2017-08-14 14:28 ` [PATCH v5 10/11] vpci: add a priority parameter to the vPCI register initializer Roger Pau Monne
@ 2017-09-07 15:32   ` Jan Beulich
  0 siblings, 0 replies; 60+ messages in thread
From: Jan Beulich @ 2017-09-07 15:32 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
> This is needed for MSI-X, since MSI-X will need to be initialized
> before parsing the BARs, so that the header BAR handlers are aware of
> the MSI-X related holes and make sure they are not mapped in order for
> the trap handlers to work properly.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 11/11] vpci/msix: add MSI-X handlers
  2017-08-14 14:28 ` [PATCH v5 11/11] vpci/msix: add MSI-X handlers Roger Pau Monne
@ 2017-09-07 16:11   ` Roger Pau Monné
  2017-09-07 16:12   ` Jan Beulich
  1 sibling, 0 replies; 60+ messages in thread
From: Roger Pau Monné @ 2017-09-07 16:11 UTC (permalink / raw)
  To: xen-devel, boris.ostrovsky, konrad.wilk, Jan Beulich, Andrew Cooper

On Mon, Aug 14, 2017 at 03:28:50PM +0100, Roger Pau Monne wrote:
[...]
> +static void vpci_msix_control_write(struct pci_dev *pdev, unsigned int reg,
> +                                    uint32_t val, void *data)
> +{
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    struct vpci_msix *msix = data;
> +    bool new_masked, new_enabled;
> +
> +    new_masked = val & PCI_MSIX_FLAGS_MASKALL;
> +    new_enabled = val & PCI_MSIX_FLAGS_ENABLE;
> +
> +    /*
> +     * According to the PCI 3.0 specification, switching the enable bit
> +     * to 1 or the function mask bit to 0 should cause all the cached
> +     * addresses and data fields to be recalculated. Xen implements this
> +     * as disabling and enabling the entries.
> +     *
> +     * Note that the disable/enable sequence is only performed when the
> +     * guest has written to the entry (ie: updated field set).
> +     */
> +    if ( new_enabled && !new_masked && (!msix->enabled || msix->masked) )
> +    {
> +        paddr_t table_base = pdev->vpci->header.bars[msix->table.bir].addr;
> +        unsigned int i;
> +        int rc;
> +
> +        for ( i = 0; i < msix->max_entries; i++ )
> +        {
> +            if ( msix->entries[i].masked || !msix->entries[i].updated )
> +                continue;
> +
> +            rc = vpci_msix_arch_disable(&msix->entries[i].arch, pdev);
> +            if ( rc )
> +            {
> +                gdprintk(XENLOG_ERR,
> +                         "%04x:%02x:%02x.%u: unable to disable entry %u: %d\n",
> +                         seg, bus, slot, func, msix->entries[i].nr, rc);
> +                return;
> +            }
> +
> +            rc = vpci_msix_arch_enable(&msix->entries[i].arch, pdev,
> +                                       msix->entries[i].addr,
> +                                       msix->entries[i].data,
> +                                       msix->entries[i].nr, table_base);
> +            if ( rc )
> +            {
> +                gdprintk(XENLOG_ERR,
> +                         "%04x:%02x:%02x.%u: unable to enable entry %u: %d\n",
> +                         seg, bus, slot, func, msix->entries[i].nr, rc);
> +                /* Entry is likely not configured, skip it. */
> +                continue;
> +            }
> +
> +            /*
> +             * At this point the PIRQ is still masked. Unmask it, or else the
> +             * guest won't receive interrupts. This is due to the
> +             * disable/enable sequence performed above.
> +             */
> +            vpci_msix_arch_mask(&msix->entries[i].arch, pdev, false);
> +
> +            msix->entries[i].updated = false;
> +        }
> +    }

I've realized that this function is missing the unmapping/unbinding of
PIRQs when the guest disables MSIX. I've added this now and will be
part of the next iteration.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 11/11] vpci/msix: add MSI-X handlers
  2017-08-14 14:28 ` [PATCH v5 11/11] vpci/msix: add MSI-X handlers Roger Pau Monne
  2017-09-07 16:11   ` Roger Pau Monné
@ 2017-09-07 16:12   ` Jan Beulich
  2017-09-15 10:44     ` Roger Pau Monné
  1 sibling, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2017-09-07 16:12 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
> +void vpci_msix_arch_mask(struct vpci_arch_msix_entry *arch,
> +                         struct pci_dev *pdev, bool mask)
> +{
> +    if ( arch->pirq == INVALID_PIRQ )
> +        return;

How come no similar guard is needed in vpci_msi_arch_mask()?

> +int vpci_msix_arch_enable(struct vpci_arch_msix_entry *arch,

Considering the first parameter's type, is the name actually
appropriate? From the parameter (and the code) I would
conclude you enable a single entry here only, not the entire
device. This may apply to functions further below, too.

> +                          struct pci_dev *pdev, uint64_t address,
> +                          uint32_t data, unsigned int entry_nr,
> +                          paddr_t table_base)
> +{
> +    struct domain *d = pdev->domain;
> +    struct msi_info msi_info = {
> +        .seg = pdev->seg,
> +        .bus = pdev->bus,
> +        .devfn = pdev->devfn,
> +        .table_base = table_base,
> +        .entry_nr = entry_nr,
> +    };
> +    xen_domctl_bind_pt_irq_t bind = {
> +        .irq_type = PT_IRQ_TYPE_MSI,
> +        .u.msi.gvec = msi_vector(data),
> +        .u.msi.gflags = msi_flags(data, address),
> +    };
> +    int rc;
> +
> +    ASSERT(arch->pirq == INVALID_PIRQ);
> +
> +    /*
> +     * Simple sanity check before trying to setup the interrupt.
> +     * According to the Intel SDM, bits [31, 20] must contain the
> +     * value 0xfee. This avoids needlessly setting up pirqs for entries
> +     * the guest has not actually configured.
> +     */
> +    if ( (address & 0xfff00000) != MSI_ADDR_HEADER )
> +        return -EINVAL;

What about the upper half of the address? That needs to be zero,
I think.

> +    rc = allocate_and_map_msi_pirq(d, -1, &arch->pirq,
> +                                   MAP_PIRQ_TYPE_MSI, &msi_info);
> +    if ( rc )
> +    {
> +        gdprintk(XENLOG_ERR,
> +                 "%04x:%02x:%02x.%u: unable to map MSI-X PIRQ entry %u: %d\n",
> +                 pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> +                 PCI_FUNC(pdev->devfn), entry_nr, rc);
> +        return rc;
> +    }
> +
> +    bind.machine_irq = arch->pirq;
> +    pcidevs_lock();
> +    rc = pt_irq_create_bind(d, &bind);
> +    if ( rc )
> +    {
> +        gdprintk(XENLOG_ERR,
> +                 "%04x:%02x:%02x.%u: unable to create MSI-X bind %u: %d\n",
> +                 pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> +                 PCI_FUNC(pdev->devfn), entry_nr, rc);
> +        spin_lock(&d->event_lock);
> +        unmap_domain_pirq(d, arch->pirq);
> +        spin_unlock(&d->event_lock);
> +        pcidevs_unlock();
> +        arch->pirq = INVALID_PIRQ;
> +        return rc;
> +    }
> +    pcidevs_unlock();
> +
> +    return 0;
> +}

Much of this looks pretty similar to the MSI counterpart. Can any
reasonable portion of this be factored out?

> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -20,6 +20,7 @@
>  #include <xen/sched.h>
>  #include <xen/vpci.h>
>  #include <xen/p2m-common.h>
> +#include <asm/p2m.h>

No need to include xen/p2m-common.h then anymore, although for
a common source file it's probably not entirely right to include this
header at all.

> @@ -89,11 +90,45 @@ static int vpci_map_range(unsigned long s, unsigned long e, void *data)
>      return modify_mmio(map->d, _gfn(s), _mfn(s), e - s + 1, map->map);
>  }
>  
> +static int vpci_unmap_msix(struct domain *d, struct vpci_msix_mem *msix)
> +{
> +    unsigned long gfn;
> +
> +    for ( gfn = PFN_DOWN(msix->addr); gfn <= PFN_UP(msix->addr + msix->size);

Missing a subtraction of 1 in the PFN_UP() invocation? And why <= ?

> @@ -102,6 +137,42 @@ static int vpci_modify_bar(struct domain *d, const struct vpci_bar *bar,
>      if ( IS_ERR(mem) )
>          return PTR_ERR(mem);
>  
> +    for ( i = 0; i < ARRAY_SIZE(bar->msix); i++ )
> +    {
> +        struct vpci_msix_mem *msix = bar->msix[i];
> +
> +        if ( !msix || msix->addr == INVALID_PADDR )
> +            continue;
> +
> +        if ( map )
> +        {
> +            /*
> +             * Make sure the MSI-X regions of the BAR are not mapped into the
> +             * domain p2m, or else the MSI-X handlers are useless. Only do this
> +             * when mapping, since that's when the memory decoding on the
> +             * device is enabled.
> +             *
> +             * This is required because iommu_inclusive_mapping might have
> +             * mapped MSI-X regions into the guest p2m.
> +             */
> +            rc = vpci_unmap_msix(d, msix);
> +            if ( rc )
> +            {
> +                rangeset_destroy(mem);
> +                return rc;
> +            }
> +        }
> +
> +        rc = rangeset_remove_range(mem, PFN_DOWN(msix->addr),
> +                                   PFN_DOWN(msix->addr + msix->size));
> +        if ( rc )
> +        {
> +            rangeset_destroy(mem);
> +            return rc;
> +        }
> +
> +    }

Why do you do this for the PBA regardless of whether it's shared
with a table page?

> @@ -255,6 +327,11 @@ static void vpci_bar_write(struct pci_dev *pdev, unsigned int reg,
>      bar->addr &= ~(0xffffffffull << (hi ? 32 : 0));
>      bar->addr |= (uint64_t)val << (hi ? 32 : 0);
>  
> +    /* Update any MSI-X areas contained in this BAR. */
> +    for ( i = 0; i < ARRAY_SIZE(bar->msix); i++ )
> +        if ( bar->msix[i] )
> +            bar->msix[i]->addr = bar->addr + bar->msix[i]->offset;

Is it really necessary to store this information separately? It
can be re-calculated easily from BIR.

> +#define MSIX_SIZE(num) offsetof(struct vpci_msix, entries[num])

Wouldn't this better be VMSIX_SIZE()?

> +#define MSIX_ADDR_IN_RANGE(a, table)                                    \
> +    ((table)->addr != INVALID_PADDR && (a) >= (table)->addr &&          \
> +     (a) < (table)->addr + (table)->size)
> +
> +static uint32_t vpci_msix_control_read(struct pci_dev *pdev, unsigned int reg,
> +                                       const void *data)
> +{
> +    const struct vpci_msix *msix = data;
> +    uint16_t val;
> +
> +    val = (msix->max_entries - 1) & PCI_MSIX_FLAGS_QSIZE;

Is the explicit masking really necessary? And if so, it would look to
be more correct to use MASK_INSR().

> +static struct vpci_msix *vpci_msix_find(struct domain *d, unsigned long addr)

d can be const as it looks.

> +static int vpci_msix_access_check(struct pci_dev *pdev, unsigned long addr,
> +                                  unsigned int len)
> +{
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +
> +    /* Only allow 32/64b accesses. */
> +    if ( len != 4 && len != 8 )
> +    {
> +        gdprintk(XENLOG_ERR,
> +                 "%04x:%02x:%02x.%u: invalid MSI-X table access size: %u\n",
> +                 seg, bus, slot, func, len);
> +        return -EINVAL;
> +    }
> +
> +    /* Only allow aligned accesses. */
> +    if ( (addr & (len - 1)) != 0 )
> +    {
> +        gdprintk(XENLOG_ERR,
> +                 "%04x:%02x:%02x.%u: MSI-X only allows aligned accesses\n",
> +                 seg, bus, slot, func);
> +        return -EINVAL;
> +    }
> +
> +    return 0;
> +}

Wouldn't this function better return bool?

> +static int vpci_msix_read(struct vcpu *v, unsigned long addr,
> +                          unsigned int len, unsigned long *data)
> +{
> +    struct domain *d = v->domain;
> +    struct vpci_msix *msix;
> +    const struct vpci_msix_entry *entry;
> +    unsigned int offset;
> +
> +    vpci_rlock(d);
> +    msix = vpci_msix_find(d, addr);
> +    if ( !msix )
> +    {
> +        vpci_runlock(d);
> +        *data = ~0ul;
> +        return X86EMUL_OKAY;
> +    }
> +
> +    if ( vpci_msix_access_check(msix->pdev, addr, len) )
> +    {
> +        vpci_runlock(d);
> +        *data = ~0ul;
> +        return X86EMUL_OKAY;
> +    }

Please fold the two if()s.

> +    if ( MSIX_ADDR_IN_RANGE(addr, &msix->pba) )
> +    {
> +        /* Access to PBA. */
> +        switch ( len )
> +        {
> +        case 4:
> +            *data = readl(addr);
> +            break;
> +        case 8:
> +            *data = readq(addr);
> +            break;

This is strictly only valid for Dom0, so perhaps worth a comment.

> +        default:
> +            ASSERT_UNREACHABLE();
> +            *data = ~0ul;
> +            break;
> +        }
> +
> +        vpci_runlock(d);
> +        return X86EMUL_OKAY;
> +    }
> +
> +    entry = vpci_msix_get_entry(msix, addr);
> +    offset = addr & (PCI_MSIX_ENTRY_SIZE - 1);
> +
> +    switch ( offset )
> +    {
> +    case PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET:
> +        /*
> +         * NB: do explicit truncation to the size of the access. This shouldn't
> +         * be required here, since the caller of the handler should already
> +         * take the appropriate measures to truncate the value before returning
> +         * to the guest, but better be safe than sorry.
> +         */
> +        *data = len == 8 ? entry->addr : (uint32_t)entry->addr;

I don't see the need for it - if there's really a bug up the call stack,
it'll be better to fix it there. There's no security implication here
afaics.

> --- a/xen/include/asm-x86/hvm/io.h
> +++ b/xen/include/asm-x86/hvm/io.h
> @@ -144,6 +144,25 @@ void vpci_msi_arch_init(struct vpci_arch_msi *arch);
>  void vpci_msi_arch_print(const struct vpci_arch_msi *arch, uint16_t data,
>                           uint64_t addr);
>  
> +/* Arch-specific MSI-X entry data for vPCI. */
> +struct vpci_arch_msix_entry {
> +    int pirq;
> +};
> +
> +/* Arch-specific vPCI MSI-X helpers. */
> +void vpci_msix_arch_mask(struct vpci_arch_msix_entry *arch,
> +                         struct pci_dev *pdev, bool mask);
> +int vpci_msix_arch_enable(struct vpci_arch_msix_entry *arch,
> +                          struct pci_dev *pdev, uint64_t address,
> +                          uint32_t data, unsigned int entry_nr,
> +                          paddr_t table_base);
> +int vpci_msix_arch_disable(struct vpci_arch_msix_entry *arch,
> +                           struct pci_dev *pdev);
> +int vpci_msix_arch_init(struct vpci_arch_msix_entry *arch);
> +void vpci_msix_arch_print(const struct vpci_arch_msix_entry *entry,
> +                          uint32_t data, uint64_t addr, bool masked,
> +                          unsigned int pos);

Actually such helpers, when they are supposed to be called from
common code, and when it is not expected for them to be inlined,
would better be declared in a common header, so they won't need
repeating (and later updating) in multiple places. Obviously this
applies to the earlier MSI ones too.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 02/11] vpci: introduce basic handlers to trap accesses to the PCI config space
  2017-09-04 15:38   ` Jan Beulich
  2017-09-06 15:40     ` Roger Pau Monné
@ 2017-09-08 14:41     ` Roger Pau Monné
  2017-09-08 15:56       ` Jan Beulich
  1 sibling, 1 reply; 60+ messages in thread
From: Roger Pau Monné @ 2017-09-08 14:41 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, Andrew Cooper, Ian Jackson, Paul Durrant, xen-devel,
	boris.ostrovsky

On Mon, Sep 04, 2017 at 09:38:11AM -0600, Jan Beulich wrote:
> >>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
> > +    /*
> > +     * At this point we have the following layout:
> > +     *
> > +     * Note that this refers to the position of the variables,
> > +     * but the value has already changed from the one given at
> > +     * initialization time because write tests have been performed.
> > +     *
> > +     * 32    24    16     8     0
> > +     *  +-----+-----+-----+-----+
> > +     *  |          r0           | 0
> > +     *  +-----+-----+-----+-----+
> > +     *  | r7  |  r6 |  r5 |/////| 32
> > +     *  +-----+-----+-----+-----|
> > +     *  |///////////////////////| 64
> > +     *  +-----------+-----------+
> > +     *  |///////////|    r12    | 96
> > +     *  +-----------+-----------+
> > +     *             ...
> > +     *  / = empty.
> 
> Maybe better "unwritten"?

I've been thinking about this, and I'm not sure unwritten is better,
in fact the test will write to this registers, it's just that there's
no backing handlers so writes will be discarded and reads will return
~0.

So I think "empty" or maybe "unhandled" is more descriptive.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 02/11] vpci: introduce basic handlers to trap accesses to the PCI config space
  2017-09-08 14:41     ` Roger Pau Monné
@ 2017-09-08 15:56       ` Jan Beulich
  0 siblings, 0 replies; 60+ messages in thread
From: Jan Beulich @ 2017-09-08 15:56 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Wei Liu, Andrew Cooper, Ian Jackson, Paul Durrant, xen-devel,
	boris.ostrovsky

>>> On 08.09.17 at 16:41, <roger.pau@citrix.com> wrote:
> On Mon, Sep 04, 2017 at 09:38:11AM -0600, Jan Beulich wrote:
>> >>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
>> > +    /*
>> > +     * At this point we have the following layout:
>> > +     *
>> > +     * Note that this refers to the position of the variables,
>> > +     * but the value has already changed from the one given at
>> > +     * initialization time because write tests have been performed.
>> > +     *
>> > +     * 32    24    16     8     0
>> > +     *  +-----+-----+-----+-----+
>> > +     *  |          r0           | 0
>> > +     *  +-----+-----+-----+-----+
>> > +     *  | r7  |  r6 |  r5 |/////| 32
>> > +     *  +-----+-----+-----+-----|
>> > +     *  |///////////////////////| 64
>> > +     *  +-----------+-----------+
>> > +     *  |///////////|    r12    | 96
>> > +     *  +-----------+-----------+
>> > +     *             ...
>> > +     *  / = empty.
>> 
>> Maybe better "unwritten"?
> 
> I've been thinking about this, and I'm not sure unwritten is better,
> in fact the test will write to this registers, it's just that there's
> no backing handlers so writes will be discarded and reads will return
> ~0.
> 
> So I think "empty" or maybe "unhandled" is more descriptive.

"unhandled" then please - registers can't possibly be empty imo.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 05/11] mm: move modify_identity_mmio to global file and drop __init
  2017-09-05 15:01   ` Jan Beulich
@ 2017-09-12  7:49     ` Roger Pau Monné
  2017-09-12  9:04       ` Jan Beulich
  0 siblings, 1 reply; 60+ messages in thread
From: Roger Pau Monné @ 2017-09-12  7:49 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Tue, Sep 05, 2017 at 09:01:57AM -0600, Jan Beulich wrote:
> >>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
> > +int modify_mmio(struct domain *d, gfn_t gfn, mfn_t mfn, unsigned long nr_pages,
> > +                bool map)
> > +{
> > +    int rc;
> > +
> > +    /*
> > +     * ATM this function should only be used by the hardware domain
> > +     * because it doesn't support preemption/continuation, and as such
> > +     * can take a non-negligible amount of time. Note that it periodically
> > +     * calls process_pending_softirqs in order to avoid stalling the system.
> > +     */
> > +    ASSERT(is_hardware_domain(d));
> > +
> > +    for ( ; ; )
> > +    {
> > +        rc = (map ? map_mmio_regions : unmap_mmio_regions)
> > +             (d, gfn, nr_pages, mfn);
> > +        if ( rc == 0 )
> > +            break;
> > +        if ( rc < 0 )
> > +        {
> > +            printk(XENLOG_WARNING
> > +                   "Failed to %smap [%" PRI_gfn ", %" PRI_gfn ") -> "
> > +                   "[%" PRI_mfn ", %" PRI_mfn ") for d%d: %d\n",
> > +                   map ? "" : "un", gfn_x(gfn), gfn_x(gfn_add(gfn, nr_pages)),
> > +                   mfn_x(mfn), mfn_x(mfn_add(mfn, nr_pages)), d->domain_id,
> > +                   rc);
> > +            break;
> > +        }
> > +        nr_pages -= rc;
> > +        mfn = mfn_add(mfn, rc);
> > +        gfn = gfn_add(gfn, rc);
> > +        process_pending_softirqs();
> 
> With the __init dropped, this become questionable: We shouldn't
> do this arbitrarily; runtime use should instead force a hypercall
> continuation (assuming that's the context it's going to be used in).

This will be used by the PCI emulation code, which is a vmexit but not
an hypercall.

I have a plan to add continuations, but I would rather do it as part
of using the PCI emulation for DomUs.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 05/11] mm: move modify_identity_mmio to global file and drop __init
  2017-09-12  7:49     ` Roger Pau Monné
@ 2017-09-12  9:04       ` Jan Beulich
  2017-09-12 11:27         ` Roger Pau Monné
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2017-09-12  9:04 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 12.09.17 at 09:49, <roger.pau@citrix.com> wrote:
> On Tue, Sep 05, 2017 at 09:01:57AM -0600, Jan Beulich wrote:
>> >>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
>> > +int modify_mmio(struct domain *d, gfn_t gfn, mfn_t mfn, unsigned long 
> nr_pages,
>> > +                bool map)
>> > +{
>> > +    int rc;
>> > +
>> > +    /*
>> > +     * ATM this function should only be used by the hardware domain
>> > +     * because it doesn't support preemption/continuation, and as such
>> > +     * can take a non-negligible amount of time. Note that it periodically
>> > +     * calls process_pending_softirqs in order to avoid stalling the 
> system.
>> > +     */
>> > +    ASSERT(is_hardware_domain(d));
>> > +
>> > +    for ( ; ; )
>> > +    {
>> > +        rc = (map ? map_mmio_regions : unmap_mmio_regions)
>> > +             (d, gfn, nr_pages, mfn);
>> > +        if ( rc == 0 )
>> > +            break;
>> > +        if ( rc < 0 )
>> > +        {
>> > +            printk(XENLOG_WARNING
>> > +                   "Failed to %smap [%" PRI_gfn ", %" PRI_gfn ") -> "
>> > +                   "[%" PRI_mfn ", %" PRI_mfn ") for d%d: %d\n",
>> > +                   map ? "" : "un", gfn_x(gfn), gfn_x(gfn_add(gfn, nr_pages)),
>> > +                   mfn_x(mfn), mfn_x(mfn_add(mfn, nr_pages)), d->domain_id,
>> > +                   rc);
>> > +            break;
>> > +        }
>> > +        nr_pages -= rc;
>> > +        mfn = mfn_add(mfn, rc);
>> > +        gfn = gfn_add(gfn, rc);
>> > +        process_pending_softirqs();
>> 
>> With the __init dropped, this become questionable: We shouldn't
>> do this arbitrarily; runtime use should instead force a hypercall
>> continuation (assuming that's the context it's going to be used in).
> 
> This will be used by the PCI emulation code, which is a vmexit but not
> an hypercall.
> 
> I have a plan to add continuations, but I would rather do it as part
> of using the PCI emulation for DomUs.

In which case please retain the __init while moving the function,
so there's no latent bug here in case someone else wants to
call this function in other than boot time context. The __init
should be dropped only together with making the softirq
processing here conditional, using some suitable other mechanism
post-boot.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 08/11] vpci/bars: add handlers to map the BARs
  2017-09-07  9:53   ` Jan Beulich
@ 2017-09-12  9:54     ` Roger Pau Monné
  2017-09-12 10:06       ` Jan Beulich
  0 siblings, 1 reply; 60+ messages in thread
From: Roger Pau Monné @ 2017-09-12  9:54 UTC (permalink / raw)
  To: Jan Beulich
  Cc: StefanoStabellini, Wei Liu, GeorgeDunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, xen-devel, boris.ostrovsky

On Thu, Sep 07, 2017 at 03:53:07AM -0600, Jan Beulich wrote:
> >>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
> > +
> > +    /*
> > +     * The PCI Local Bus Specification suggests writing ~0 to both the high
> > +     * and the low part of the BAR registers before attempting to read back
> > +     * the size.
> > +     *
> > +     * However real device BARs registers (at least the ones I've tried)
> > +     * will return the size of the BAR just by having written ~0 to one half
> > +     * of it, independently of the value of the other half of the register.
> > +     * Hence here Xen will switch to returning the size as soon as one half
> > +     * of the BAR register has been written with ~0.
> > +     */
> 
> I don't believe this is correct behavior (but I'd have to play with
> some hardware to see whether I can confirm the behavior you
> describe): How would you place a BAR at, say, 0x1ffffff0?

I don't think it's 'correct' behavior either, but FreeBSD has been
sizing BARs like that, and nobody noticed any issues. I just fixed it
recently:

https://svnweb.freebsd.org/base/head/sys/dev/pci/pci.c?r1=312250&r2=321863

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 08/11] vpci/bars: add handlers to map the BARs
  2017-09-12  9:54     ` Roger Pau Monné
@ 2017-09-12 10:06       ` Jan Beulich
  2017-09-12 11:48         ` Roger Pau Monné
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2017-09-12 10:06 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: StefanoStabellini, Wei Liu, GeorgeDunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, xen-devel, boris.ostrovsky

>>> On 12.09.17 at 11:54, <roger.pau@citrix.com> wrote:
> On Thu, Sep 07, 2017 at 03:53:07AM -0600, Jan Beulich wrote:
>> >>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
>> > +
>> > +    /*
>> > +     * The PCI Local Bus Specification suggests writing ~0 to both the high
>> > +     * and the low part of the BAR registers before attempting to read back
>> > +     * the size.
>> > +     *
>> > +     * However real device BARs registers (at least the ones I've tried)
>> > +     * will return the size of the BAR just by having written ~0 to one half
>> > +     * of it, independently of the value of the other half of the register.
>> > +     * Hence here Xen will switch to returning the size as soon as one half
>> > +     * of the BAR register has been written with ~0.
>> > +     */
>> 
>> I don't believe this is correct behavior (but I'd have to play with
>> some hardware to see whether I can confirm the behavior you
>> describe): How would you place a BAR at, say, 0x1ffffff0?
> 
> I don't think it's 'correct' behavior either, but FreeBSD has been
> sizing BARs like that, and nobody noticed any issues. I just fixed it
> recently:
> 
> https://svnweb.freebsd.org/base/head/sys/dev/pci/pci.c?r1=312250&r2=321863 

Oh, no, that old code was fine afaict. You have to view the two
halves of a 64-bit BAR as completely distinct registers, and
sizing of the full BAR can be done either by writing both, then
reading both, or reading/writing each half. The problem with the
code in the patch here is that you don't treat the two halves as
fully separate registers, implying that one half being written with
all ones _also_ makes the other half return the sizing value
instead of the last written address.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 02/11] vpci: introduce basic handlers to trap accesses to the PCI config space
  2017-08-14 14:28 ` [PATCH v5 02/11] vpci: introduce basic handlers to trap accesses to the PCI config space Roger Pau Monne
  2017-08-22 12:05   ` Paul Durrant
  2017-09-04 15:38   ` Jan Beulich
@ 2017-09-12 10:42   ` Julien Grall
  2017-09-12 10:58     ` Roger Pau Monné
  2 siblings, 1 reply; 60+ messages in thread
From: Julien Grall @ 2017-09-12 10:42 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel
  Cc: Wei Liu, Andrew Cooper, Ian Jackson, Paul Durrant, Jan Beulich,
	boris.ostrovsky

Hi Roger,

On 14/08/17 15:28, Roger Pau Monne wrote:
> This functionality is going to reside in vpci.c (and the corresponding
> vpci.h header), and should be arch-agnostic. The handlers introduced
> in this patch setup the basic functionality required in order to trap
> accesses to the PCI config space, and allow decoding the address and
> finding the corresponding handler that should handle the access
> (although no handlers are implemented).

If I understand correctly this patch, the virtual BDF will always 
correspond to the physical BDF. Is that right? If so, would you mind to 
explain why such restriction?

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 05/11] mm: move modify_identity_mmio to global file and drop __init
  2017-08-14 14:28 ` [PATCH v5 05/11] mm: move modify_identity_mmio to global file and drop __init Roger Pau Monne
  2017-09-05 15:01   ` Jan Beulich
@ 2017-09-12 10:53   ` Julien Grall
  2017-09-12 11:38     ` Roger Pau Monné
  1 sibling, 1 reply; 60+ messages in thread
From: Julien Grall @ 2017-09-12 10:53 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel; +Cc: Andrew Cooper, boris.ostrovsky, Jan Beulich

Hi Roger,

On 14/08/17 15:28, Roger Pau Monne wrote:
> And also allow it to do non-identity mappings by adding a new
> parameter.
> 
> This function will be needed in order to map the BARs from PCI devices
> into the Dom0 p2m (and is also used by the x86 Dom0 builder). While
> there fix the function to use gfn_t and mfn_t instead of unsigned long
> for memory addresses. >
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> Changes since v4:
>   - Guard the function with CONFIG_HAS_PCI only.
>   - s/non-trival/non-negligible in the comment.
>   - Change XENLOG_G_WARNING to XENLOG_WARNING like the original
>     function.
> 
> Changes since v3:
>   - Remove the dummy modify_identity_mmio helper in dom0_build.c
>   - Try to make the comment in modify MMIO less scary.
>   - Clarify commit message.
>   - Only build the function for x86 or if there's PCI support.
> 
> Changes since v2:
>   - Use mfn_t and gfn_t.
>   - Remove stray newline.
> ---
>   xen/arch/x86/hvm/dom0_build.c | 30 ++----------------------------
>   xen/common/memory.c           | 40 ++++++++++++++++++++++++++++++++++++++++
>   xen/include/xen/p2m-common.h  |  3 +++
>   3 files changed, 45 insertions(+), 28 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
> index 04a8682d33..c65eb8503f 100644
> --- a/xen/arch/x86/hvm/dom0_build.c
> +++ b/xen/arch/x86/hvm/dom0_build.c
> @@ -61,32 +61,6 @@ static struct acpi_madt_interrupt_override __initdata *intsrcovr;
>   static unsigned int __initdata acpi_nmi_sources;
>   static struct acpi_madt_nmi_source __initdata *nmisrc;
>   
> -static int __init modify_identity_mmio(struct domain *d, unsigned long pfn,
> -                                       unsigned long nr_pages, const bool map)
> -{
> -    int rc;
> -
> -    for ( ; ; )
> -    {
> -        rc = (map ? map_mmio_regions : unmap_mmio_regions)
> -             (d, _gfn(pfn), nr_pages, _mfn(pfn));
> -        if ( rc == 0 )
> -            break;
> -        if ( rc < 0 )
> -        {
> -            printk(XENLOG_WARNING
> -                   "Failed to identity %smap [%#lx,%#lx) for d%d: %d\n",
> -                   map ? "" : "un", pfn, pfn + nr_pages, d->domain_id, rc);
> -            break;
> -        }
> -        nr_pages -= rc;
> -        pfn += rc;
> -        process_pending_softirqs();
> -    }
> -
> -    return rc;
> -}
> -
>   /* Populate a HVM memory range using the biggest possible order. */
>   static int __init pvh_populate_memory_range(struct domain *d,
>                                               unsigned long start,
> @@ -397,7 +371,7 @@ static int __init pvh_setup_p2m(struct domain *d)
>        * Memory below 1MB is identity mapped.
>        * NB: this only makes sense when booted from legacy BIOS.
>        */
> -    rc = modify_identity_mmio(d, 0, MB1_PAGES, true);
> +    rc = modify_mmio(d, _gfn(0), _mfn(0), MB1_PAGES, true);
>       if ( rc )
>       {
>           printk("Failed to identity map low 1MB: %d\n", rc);
> @@ -964,7 +938,7 @@ static int __init pvh_setup_acpi(struct domain *d, paddr_t start_info)
>           nr_pages = PFN_UP((d->arch.e820[i].addr & ~PAGE_MASK) +
>                             d->arch.e820[i].size);
>   
> -        rc = modify_identity_mmio(d, pfn, nr_pages, true);
> +        rc = modify_mmio(d, _gfn(pfn), _mfn(pfn), nr_pages, true);
>           if ( rc )
>           {
>               printk("Failed to map ACPI region [%#lx, %#lx) into Dom0 memory map\n",
> diff --git a/xen/common/memory.c b/xen/common/memory.c
> index b2066db07e..86824edb09 100644
> --- a/xen/common/memory.c
> +++ b/xen/common/memory.c
> @@ -1465,6 +1465,46 @@ int prepare_ring_for_helper(
>       return 0;
>   }
>   
> +#if defined(CONFIG_HAS_PCI)
> +int modify_mmio(struct domain *d, gfn_t gfn, mfn_t mfn, unsigned long nr_pages,
> +                bool map)
> +{

I don't think this is correct to move this function in common code 
without making sure that all arch have *map_mmio_regions supporting 
preemption.

This is actually not the case on ARM and IHMO should be fixed before 
getting this function common. Otherwise you will expose a security issue 
the day vCPI will get supported on ARM.

> +    int rc;
> +
> +    /*
> +     * ATM this function should only be used by the hardware domain
> +     * because it doesn't support preemption/continuation, and as such
> +     * can take a non-negligible amount of time. Note that it periodically
> +     * calls process_pending_softirqs in order to avoid stalling the system.
> +     */
> +    ASSERT(is_hardware_domain(d));
> +
> +    for ( ; ; )
> +    {
> +        rc = (map ? map_mmio_regions : unmap_mmio_regions)

As mentioned in an earlier version, on ARM map_mmio_regions will map the 
MMIO with very strict attribute (no-unaligned access, 
non-gatherable,...). This will not be correct for some BARs. So I think 
we should provide the attribute type in parameter of modify_mmio to know 
the memory attribute.

> +             (d, gfn, nr_pages, mfn);
> +        if ( rc == 0 )
> +            break;
> +        if ( rc < 0 )
> +        {
> +            printk(XENLOG_WARNING
> +                   "Failed to %smap [%" PRI_gfn ", %" PRI_gfn ") -> "
> +                   "[%" PRI_mfn ", %" PRI_mfn ") for d%d: %d\n",
> +                   map ? "" : "un", gfn_x(gfn), gfn_x(gfn_add(gfn, nr_pages)),
> +                   mfn_x(mfn), mfn_x(mfn_add(mfn, nr_pages)), d->domain_id,
> +                   rc);
> +            break;
> +        }
> +        nr_pages -= rc;
> +        mfn = mfn_add(mfn, rc);
> +        gfn = gfn_add(gfn, rc);
> +        process_pending_softirqs();
> +    }
> +
> +    return rc;
> +}
> +#endif
> +
>   /*
>    * Local variables:
>    * mode: C
> diff --git a/xen/include/xen/p2m-common.h b/xen/include/xen/p2m-common.h
> index 2b5696cf33..c2f9015ad8 100644
> --- a/xen/include/xen/p2m-common.h
> +++ b/xen/include/xen/p2m-common.h
> @@ -20,4 +20,7 @@ int unmap_mmio_regions(struct domain *d,
>                          unsigned long nr,
>                          mfn_t mfn);
>   
> +int modify_mmio(struct domain *d, gfn_t gfn, mfn_t mfn, unsigned long nr_pages,
> +                const bool map);
> +
>   #endif /* _XEN_P2M_COMMON_H */
> 

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 02/11] vpci: introduce basic handlers to trap accesses to the PCI config space
  2017-09-12 10:42   ` Julien Grall
@ 2017-09-12 10:58     ` Roger Pau Monné
  2017-09-12 11:00       ` Julien Grall
  0 siblings, 1 reply; 60+ messages in thread
From: Roger Pau Monné @ 2017-09-12 10:58 UTC (permalink / raw)
  To: Julien Grall
  Cc: Wei Liu, Andrew Cooper, Ian Jackson, Paul Durrant, Jan Beulich,
	xen-devel, boris.ostrovsky

On Tue, Sep 12, 2017 at 11:42:38AM +0100, Julien Grall wrote:
> Hi Roger,
> 
> On 14/08/17 15:28, Roger Pau Monne wrote:
> > This functionality is going to reside in vpci.c (and the corresponding
> > vpci.h header), and should be arch-agnostic. The handlers introduced
> > in this patch setup the basic functionality required in order to trap
> > accesses to the PCI config space, and allow decoding the address and
> > finding the corresponding handler that should handle the access
> > (although no handlers are implemented).
> 
> If I understand correctly this patch, the virtual BDF will always correspond
> to the physical BDF. Is that right? If so, would you mind to explain why
> such restriction?

Yes, this is not a limitation of this patch, but of the implementation
that follows. Likely this will be expanded when support for DomU is
added, but for Dom0 at least on x86 adding such a translation layer is
not needed, since I see no reason to present a different PCI topology
to Dom0.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 02/11] vpci: introduce basic handlers to trap accesses to the PCI config space
  2017-09-12 10:58     ` Roger Pau Monné
@ 2017-09-12 11:00       ` Julien Grall
  0 siblings, 0 replies; 60+ messages in thread
From: Julien Grall @ 2017-09-12 11:00 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Wei Liu, Andrew Cooper, Ian Jackson, Paul Durrant, Jan Beulich,
	xen-devel, boris.ostrovsky



On 12/09/17 11:58, Roger Pau Monné wrote:
> On Tue, Sep 12, 2017 at 11:42:38AM +0100, Julien Grall wrote:
>> Hi Roger,
>>
>> On 14/08/17 15:28, Roger Pau Monne wrote:
>>> This functionality is going to reside in vpci.c (and the corresponding
>>> vpci.h header), and should be arch-agnostic. The handlers introduced
>>> in this patch setup the basic functionality required in order to trap
>>> accesses to the PCI config space, and allow decoding the address and
>>> finding the corresponding handler that should handle the access
>>> (although no handlers are implemented).
>>
>> If I understand correctly this patch, the virtual BDF will always correspond
>> to the physical BDF. Is that right? If so, would you mind to explain why
>> such restriction?
> 
> Yes, this is not a limitation of this patch, but of the implementation
> that follows. Likely this will be expanded when support for DomU is
> added, but for Dom0 at least on x86 adding such a translation layer is
> not needed, since I see no reason to present a different PCI topology
> to Dom0.

I think this will be the same for ARM. Dom0 will always have pBDF == 
vBDF. For guest will likely want pBDF != vBDF.

This answer to my other question regarding the plan to support vBDF != 
pBDF :). Thanks.

> 
> Thanks, Roger.
> 

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 05/11] mm: move modify_identity_mmio to global file and drop __init
  2017-09-12  9:04       ` Jan Beulich
@ 2017-09-12 11:27         ` Roger Pau Monné
  2017-09-12 12:53           ` Jan Beulich
  0 siblings, 1 reply; 60+ messages in thread
From: Roger Pau Monné @ 2017-09-12 11:27 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Tue, Sep 12, 2017 at 03:04:02AM -0600, Jan Beulich wrote:
> >>> On 12.09.17 at 09:49, <roger.pau@citrix.com> wrote:
> > On Tue, Sep 05, 2017 at 09:01:57AM -0600, Jan Beulich wrote:
> >> >>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
> >> > +int modify_mmio(struct domain *d, gfn_t gfn, mfn_t mfn, unsigned long 
> > nr_pages,
> >> > +                bool map)
> >> > +{
> >> > +    int rc;
> >> > +
> >> > +    /*
> >> > +     * ATM this function should only be used by the hardware domain
> >> > +     * because it doesn't support preemption/continuation, and as such
> >> > +     * can take a non-negligible amount of time. Note that it periodically
> >> > +     * calls process_pending_softirqs in order to avoid stalling the 
> > system.
> >> > +     */
> >> > +    ASSERT(is_hardware_domain(d));
> >> > +
> >> > +    for ( ; ; )
> >> > +    {
> >> > +        rc = (map ? map_mmio_regions : unmap_mmio_regions)
> >> > +             (d, gfn, nr_pages, mfn);
> >> > +        if ( rc == 0 )
> >> > +            break;
> >> > +        if ( rc < 0 )
> >> > +        {
> >> > +            printk(XENLOG_WARNING
> >> > +                   "Failed to %smap [%" PRI_gfn ", %" PRI_gfn ") -> "
> >> > +                   "[%" PRI_mfn ", %" PRI_mfn ") for d%d: %d\n",
> >> > +                   map ? "" : "un", gfn_x(gfn), gfn_x(gfn_add(gfn, nr_pages)),
> >> > +                   mfn_x(mfn), mfn_x(mfn_add(mfn, nr_pages)), d->domain_id,
> >> > +                   rc);
> >> > +            break;
> >> > +        }
> >> > +        nr_pages -= rc;
> >> > +        mfn = mfn_add(mfn, rc);
> >> > +        gfn = gfn_add(gfn, rc);
> >> > +        process_pending_softirqs();
> >> 
> >> With the __init dropped, this become questionable: We shouldn't
> >> do this arbitrarily; runtime use should instead force a hypercall
> >> continuation (assuming that's the context it's going to be used in).
> > 
> > This will be used by the PCI emulation code, which is a vmexit but not
> > an hypercall.
> > 
> > I have a plan to add continuations, but I would rather do it as part
> > of using the PCI emulation for DomUs.
> 
> In which case please retain the __init while moving the function,
> so there's no latent bug here in case someone else wants to
> call this function in other than boot time context. The __init
> should be dropped only together with making the softirq
> processing here conditional, using some suitable other mechanism
> post-boot.

This will already be used in non-boot context with this series. From
the discussion that we had in v3 I though it was fine to use
process_pending_softirqs as long as it was limited to Dom0:

https://lists.xenproject.org/archives/html/xen-devel/2017-06/msg02411.html

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 05/11] mm: move modify_identity_mmio to global file and drop __init
  2017-09-12 10:53   ` Julien Grall
@ 2017-09-12 11:38     ` Roger Pau Monné
  2017-09-12 13:02       ` Julien Grall
  0 siblings, 1 reply; 60+ messages in thread
From: Roger Pau Monné @ 2017-09-12 11:38 UTC (permalink / raw)
  To: Julien Grall; +Cc: xen-devel, boris.ostrovsky, Jan Beulich, Andrew Cooper

On Tue, Sep 12, 2017 at 11:53:49AM +0100, Julien Grall wrote:
> Hi Roger,
> 
> On 14/08/17 15:28, Roger Pau Monne wrote:
> > And also allow it to do non-identity mappings by adding a new
> > parameter.
> > 
> > This function will be needed in order to map the BARs from PCI devices
> > into the Dom0 p2m (and is also used by the x86 Dom0 builder). While
> > there fix the function to use gfn_t and mfn_t instead of unsigned long
> > for memory addresses. >
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> > Cc: Jan Beulich <jbeulich@suse.com>
> > Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> > ---
> > Changes since v4:
> >   - Guard the function with CONFIG_HAS_PCI only.
> >   - s/non-trival/non-negligible in the comment.
> >   - Change XENLOG_G_WARNING to XENLOG_WARNING like the original
> >     function.
> > 
> > Changes since v3:
> >   - Remove the dummy modify_identity_mmio helper in dom0_build.c
> >   - Try to make the comment in modify MMIO less scary.
> >   - Clarify commit message.
> >   - Only build the function for x86 or if there's PCI support.
> > 
> > Changes since v2:
> >   - Use mfn_t and gfn_t.
> >   - Remove stray newline.
> > ---
> >   xen/arch/x86/hvm/dom0_build.c | 30 ++----------------------------
> >   xen/common/memory.c           | 40 ++++++++++++++++++++++++++++++++++++++++
> >   xen/include/xen/p2m-common.h  |  3 +++
> >   3 files changed, 45 insertions(+), 28 deletions(-)
> > 
> > diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
> > index 04a8682d33..c65eb8503f 100644
> > --- a/xen/arch/x86/hvm/dom0_build.c
> > +++ b/xen/arch/x86/hvm/dom0_build.c
> > @@ -61,32 +61,6 @@ static struct acpi_madt_interrupt_override __initdata *intsrcovr;
> >   static unsigned int __initdata acpi_nmi_sources;
> >   static struct acpi_madt_nmi_source __initdata *nmisrc;
> > -static int __init modify_identity_mmio(struct domain *d, unsigned long pfn,
> > -                                       unsigned long nr_pages, const bool map)
> > -{
> > -    int rc;
> > -
> > -    for ( ; ; )
> > -    {
> > -        rc = (map ? map_mmio_regions : unmap_mmio_regions)
> > -             (d, _gfn(pfn), nr_pages, _mfn(pfn));
> > -        if ( rc == 0 )
> > -            break;
> > -        if ( rc < 0 )
> > -        {
> > -            printk(XENLOG_WARNING
> > -                   "Failed to identity %smap [%#lx,%#lx) for d%d: %d\n",
> > -                   map ? "" : "un", pfn, pfn + nr_pages, d->domain_id, rc);
> > -            break;
> > -        }
> > -        nr_pages -= rc;
> > -        pfn += rc;
> > -        process_pending_softirqs();
> > -    }
> > -
> > -    return rc;
> > -}
> > -
> >   /* Populate a HVM memory range using the biggest possible order. */
> >   static int __init pvh_populate_memory_range(struct domain *d,
> >                                               unsigned long start,
> > @@ -397,7 +371,7 @@ static int __init pvh_setup_p2m(struct domain *d)
> >        * Memory below 1MB is identity mapped.
> >        * NB: this only makes sense when booted from legacy BIOS.
> >        */
> > -    rc = modify_identity_mmio(d, 0, MB1_PAGES, true);
> > +    rc = modify_mmio(d, _gfn(0), _mfn(0), MB1_PAGES, true);
> >       if ( rc )
> >       {
> >           printk("Failed to identity map low 1MB: %d\n", rc);
> > @@ -964,7 +938,7 @@ static int __init pvh_setup_acpi(struct domain *d, paddr_t start_info)
> >           nr_pages = PFN_UP((d->arch.e820[i].addr & ~PAGE_MASK) +
> >                             d->arch.e820[i].size);
> > -        rc = modify_identity_mmio(d, pfn, nr_pages, true);
> > +        rc = modify_mmio(d, _gfn(pfn), _mfn(pfn), nr_pages, true);
> >           if ( rc )
> >           {
> >               printk("Failed to map ACPI region [%#lx, %#lx) into Dom0 memory map\n",
> > diff --git a/xen/common/memory.c b/xen/common/memory.c
> > index b2066db07e..86824edb09 100644
> > --- a/xen/common/memory.c
> > +++ b/xen/common/memory.c
> > @@ -1465,6 +1465,46 @@ int prepare_ring_for_helper(
> >       return 0;
> >   }
> > +#if defined(CONFIG_HAS_PCI)
> > +int modify_mmio(struct domain *d, gfn_t gfn, mfn_t mfn, unsigned long nr_pages,
> > +                bool map)
> > +{
> 
> I don't think this is correct to move this function in common code without
> making sure that all arch have *map_mmio_regions supporting preemption.
> 
> This is actually not the case on ARM and IHMO should be fixed before getting
> this function common. Otherwise you will expose a security issue the day
> vCPI will get supported on ARM.

I could add something like:

#ifndef CONFIG_X86 /* XXX ARM!? */
    ret = -E2BIG;
    /* Must break hypercall up as this could take a while. */
    if ( nr_mfns > 64 )
        break;
#endif

That's what's done in XEN_DOMCTL_memory_mapping.

> > +    int rc;
> > +
> > +    /*
> > +     * ATM this function should only be used by the hardware domain
> > +     * because it doesn't support preemption/continuation, and as such
> > +     * can take a non-negligible amount of time. Note that it periodically
> > +     * calls process_pending_softirqs in order to avoid stalling the system.
> > +     */
> > +    ASSERT(is_hardware_domain(d));
> > +
> > +    for ( ; ; )
> > +    {
> > +        rc = (map ? map_mmio_regions : unmap_mmio_regions)
> 
> As mentioned in an earlier version, on ARM map_mmio_regions will map the
> MMIO with very strict attribute (no-unaligned access, non-gatherable,...).
> This will not be correct for some BARs. So I think we should provide the
> attribute type in parameter of modify_mmio to know the memory attribute.

As I understand it, current mapping attributes albeit slow will work
fine?

At least they are the same as the ones used by
XEN_DOMCTL_memory_mapping which is how MMIO is mapped to a DomU ATM,
hence I don't think this is a priority at such early state.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 08/11] vpci/bars: add handlers to map the BARs
  2017-09-12 10:06       ` Jan Beulich
@ 2017-09-12 11:48         ` Roger Pau Monné
  2017-09-12 12:56           ` Jan Beulich
  0 siblings, 1 reply; 60+ messages in thread
From: Roger Pau Monné @ 2017-09-12 11:48 UTC (permalink / raw)
  To: Jan Beulich
  Cc: StefanoStabellini, Wei Liu, GeorgeDunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, xen-devel, boris.ostrovsky

On Tue, Sep 12, 2017 at 04:06:13AM -0600, Jan Beulich wrote:
> >>> On 12.09.17 at 11:54, <roger.pau@citrix.com> wrote:
> > On Thu, Sep 07, 2017 at 03:53:07AM -0600, Jan Beulich wrote:
> >> >>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
> >> > +
> >> > +    /*
> >> > +     * The PCI Local Bus Specification suggests writing ~0 to both the high
> >> > +     * and the low part of the BAR registers before attempting to read back
> >> > +     * the size.
> >> > +     *
> >> > +     * However real device BARs registers (at least the ones I've tried)
> >> > +     * will return the size of the BAR just by having written ~0 to one half
> >> > +     * of it, independently of the value of the other half of the register.
> >> > +     * Hence here Xen will switch to returning the size as soon as one half
> >> > +     * of the BAR register has been written with ~0.
> >> > +     */
> >> 
> >> I don't believe this is correct behavior (but I'd have to play with
> >> some hardware to see whether I can confirm the behavior you
> >> describe): How would you place a BAR at, say, 0x1ffffff0?
> > 
> > I don't think it's 'correct' behavior either, but FreeBSD has been
> > sizing BARs like that, and nobody noticed any issues. I just fixed it
> > recently:
> > 
> > https://svnweb.freebsd.org/base/head/sys/dev/pci/pci.c?r1=312250&r2=321863 
> 
> Oh, no, that old code was fine afaict. You have to view the two
> halves of a 64-bit BAR as completely distinct registers, and
> sizing of the full BAR can be done either by writing both, then
> reading both, or reading/writing each half.

OK, the example in the specification seems to suggest that you should
first write to both registers, and then read back the values.

> The problem with the
> code in the patch here is that you don't treat the two halves as
> fully separate registers, implying that one half being written with
> all ones _also_ makes the other half return the sizing value
> instead of the last written address.

Right, then I think adding a sizing_hi/sizing_lo field is the right
answer.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 05/11] mm: move modify_identity_mmio to global file and drop __init
  2017-09-12 11:27         ` Roger Pau Monné
@ 2017-09-12 12:53           ` Jan Beulich
  0 siblings, 0 replies; 60+ messages in thread
From: Jan Beulich @ 2017-09-12 12:53 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 12.09.17 at 13:27, <roger.pau@citrix.com> wrote:
> On Tue, Sep 12, 2017 at 03:04:02AM -0600, Jan Beulich wrote:
>> >>> On 12.09.17 at 09:49, <roger.pau@citrix.com> wrote:
>> > On Tue, Sep 05, 2017 at 09:01:57AM -0600, Jan Beulich wrote:
>> >> >>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
>> >> > +int modify_mmio(struct domain *d, gfn_t gfn, mfn_t mfn, unsigned long 
>> > nr_pages,
>> >> > +                bool map)
>> >> > +{
>> >> > +    int rc;
>> >> > +
>> >> > +    /*
>> >> > +     * ATM this function should only be used by the hardware domain
>> >> > +     * because it doesn't support preemption/continuation, and as such
>> >> > +     * can take a non-negligible amount of time. Note that it periodically
>> >> > +     * calls process_pending_softirqs in order to avoid stalling the 
>> > system.
>> >> > +     */
>> >> > +    ASSERT(is_hardware_domain(d));
>> >> > +
>> >> > +    for ( ; ; )
>> >> > +    {
>> >> > +        rc = (map ? map_mmio_regions : unmap_mmio_regions)
>> >> > +             (d, gfn, nr_pages, mfn);
>> >> > +        if ( rc == 0 )
>> >> > +            break;
>> >> > +        if ( rc < 0 )
>> >> > +        {
>> >> > +            printk(XENLOG_WARNING
>> >> > +                   "Failed to %smap [%" PRI_gfn ", %" PRI_gfn ") -> "
>> >> > +                   "[%" PRI_mfn ", %" PRI_mfn ") for d%d: %d\n",
>> >> > +                   map ? "" : "un", gfn_x(gfn), gfn_x(gfn_add(gfn, 
> nr_pages)),
>> >> > +                   mfn_x(mfn), mfn_x(mfn_add(mfn, nr_pages)), d->domain_id,
>> >> > +                   rc);
>> >> > +            break;
>> >> > +        }
>> >> > +        nr_pages -= rc;
>> >> > +        mfn = mfn_add(mfn, rc);
>> >> > +        gfn = gfn_add(gfn, rc);
>> >> > +        process_pending_softirqs();
>> >> 
>> >> With the __init dropped, this become questionable: We shouldn't
>> >> do this arbitrarily; runtime use should instead force a hypercall
>> >> continuation (assuming that's the context it's going to be used in).
>> > 
>> > This will be used by the PCI emulation code, which is a vmexit but not
>> > an hypercall.
>> > 
>> > I have a plan to add continuations, but I would rather do it as part
>> > of using the PCI emulation for DomUs.
>> 
>> In which case please retain the __init while moving the function,
>> so there's no latent bug here in case someone else wants to
>> call this function in other than boot time context. The __init
>> should be dropped only together with making the softirq
>> processing here conditional, using some suitable other mechanism
>> post-boot.
> 
> This will already be used in non-boot context with this series. From
> the discussion that we had in v3 I though it was fine to use
> process_pending_softirqs as long as it was limited to Dom0:
> 
> https://lists.xenproject.org/archives/html/xen-devel/2017-06/msg02411.html 

I don't think it was a good idea to agree - we shouldn't special
case Dom0 in this regard.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 08/11] vpci/bars: add handlers to map the BARs
  2017-09-12 11:48         ` Roger Pau Monné
@ 2017-09-12 12:56           ` Jan Beulich
  0 siblings, 0 replies; 60+ messages in thread
From: Jan Beulich @ 2017-09-12 12:56 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: StefanoStabellini, Wei Liu, GeorgeDunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, xen-devel, boris.ostrovsky

>>> On 12.09.17 at 13:48, <roger.pau@citrix.com> wrote:
> On Tue, Sep 12, 2017 at 04:06:13AM -0600, Jan Beulich wrote:
>> >>> On 12.09.17 at 11:54, <roger.pau@citrix.com> wrote:
>> > On Thu, Sep 07, 2017 at 03:53:07AM -0600, Jan Beulich wrote:
>> >> >>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
>> >> > +
>> >> > +    /*
>> >> > +     * The PCI Local Bus Specification suggests writing ~0 to both the 
> high
>> >> > +     * and the low part of the BAR registers before attempting to read 
> back
>> >> > +     * the size.
>> >> > +     *
>> >> > +     * However real device BARs registers (at least the ones I've tried)
>> >> > +     * will return the size of the BAR just by having written ~0 to one 
> half
>> >> > +     * of it, independently of the value of the other half of the 
> register.
>> >> > +     * Hence here Xen will switch to returning the size as soon as one 
> half
>> >> > +     * of the BAR register has been written with ~0.
>> >> > +     */
>> >> 
>> >> I don't believe this is correct behavior (but I'd have to play with
>> >> some hardware to see whether I can confirm the behavior you
>> >> describe): How would you place a BAR at, say, 0x1ffffff0?
>> > 
>> > I don't think it's 'correct' behavior either, but FreeBSD has been
>> > sizing BARs like that, and nobody noticed any issues. I just fixed it
>> > recently:
>> > 
>> > https://svnweb.freebsd.org/base/head/sys/dev/pci/pci.c?r1=312250&r2=321863 
>> 
>> Oh, no, that old code was fine afaict. You have to view the two
>> halves of a 64-bit BAR as completely distinct registers, and
>> sizing of the full BAR can be done either by writing both, then
>> reading both, or reading/writing each half.
> 
> OK, the example in the specification seems to suggest that you should
> first write to both registers, and then read back the values.
> 
>> The problem with the
>> code in the patch here is that you don't treat the two halves as
>> fully separate registers, implying that one half being written with
>> all ones _also_ makes the other half return the sizing value
>> instead of the last written address.
> 
> Right, then I think adding a sizing_hi/sizing_lo field is the right
> answer.

That was my first thought too, but meanwhile I think these flags
are pointless and misleading, and hence should be dropped.
There's really nothing special about those sizing writes: They
simply write a special address, but to the handler this shouldn't
matter. All you need to make sure is that hardwired to zero bits
come back as zero on the following read(s), entirely independent
of what precise value was written (the r/o bits at the bottom of
the first half left aside here, of course).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 05/11] mm: move modify_identity_mmio to global file and drop __init
  2017-09-12 11:38     ` Roger Pau Monné
@ 2017-09-12 13:02       ` Julien Grall
  0 siblings, 0 replies; 60+ messages in thread
From: Julien Grall @ 2017-09-12 13:02 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, boris.ostrovsky, Jan Beulich, Andrew Cooper

Hi,

On 12/09/17 12:38, Roger Pau Monné wrote:
> On Tue, Sep 12, 2017 at 11:53:49AM +0100, Julien Grall wrote:
>> Hi Roger,
>>
>> On 14/08/17 15:28, Roger Pau Monne wrote:
>>> And also allow it to do non-identity mappings by adding a new
>>> parameter.
>>>
>>> This function will be needed in order to map the BARs from PCI devices
>>> into the Dom0 p2m (and is also used by the x86 Dom0 builder). While
>>> there fix the function to use gfn_t and mfn_t instead of unsigned long
>>> for memory addresses. >
>>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
>>> ---
>>> Cc: Jan Beulich <jbeulich@suse.com>
>>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
>>> ---
>>> Changes since v4:
>>>    - Guard the function with CONFIG_HAS_PCI only.
>>>    - s/non-trival/non-negligible in the comment.
>>>    - Change XENLOG_G_WARNING to XENLOG_WARNING like the original
>>>      function.
>>>
>>> Changes since v3:
>>>    - Remove the dummy modify_identity_mmio helper in dom0_build.c
>>>    - Try to make the comment in modify MMIO less scary.
>>>    - Clarify commit message.
>>>    - Only build the function for x86 or if there's PCI support.
>>>
>>> Changes since v2:
>>>    - Use mfn_t and gfn_t.
>>>    - Remove stray newline.
>>> ---
>>>    xen/arch/x86/hvm/dom0_build.c | 30 ++----------------------------
>>>    xen/common/memory.c           | 40 ++++++++++++++++++++++++++++++++++++++++
>>>    xen/include/xen/p2m-common.h  |  3 +++
>>>    3 files changed, 45 insertions(+), 28 deletions(-)
>>>
>>> diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
>>> index 04a8682d33..c65eb8503f 100644
>>> --- a/xen/arch/x86/hvm/dom0_build.c
>>> +++ b/xen/arch/x86/hvm/dom0_build.c
>>> @@ -61,32 +61,6 @@ static struct acpi_madt_interrupt_override __initdata *intsrcovr;
>>>    static unsigned int __initdata acpi_nmi_sources;
>>>    static struct acpi_madt_nmi_source __initdata *nmisrc;
>>> -static int __init modify_identity_mmio(struct domain *d, unsigned long pfn,
>>> -                                       unsigned long nr_pages, const bool map)
>>> -{
>>> -    int rc;
>>> -
>>> -    for ( ; ; )
>>> -    {
>>> -        rc = (map ? map_mmio_regions : unmap_mmio_regions)
>>> -             (d, _gfn(pfn), nr_pages, _mfn(pfn));
>>> -        if ( rc == 0 )
>>> -            break;
>>> -        if ( rc < 0 )
>>> -        {
>>> -            printk(XENLOG_WARNING
>>> -                   "Failed to identity %smap [%#lx,%#lx) for d%d: %d\n",
>>> -                   map ? "" : "un", pfn, pfn + nr_pages, d->domain_id, rc);
>>> -            break;
>>> -        }
>>> -        nr_pages -= rc;
>>> -        pfn += rc;
>>> -        process_pending_softirqs();
>>> -    }
>>> -
>>> -    return rc;
>>> -}
>>> -
>>>    /* Populate a HVM memory range using the biggest possible order. */
>>>    static int __init pvh_populate_memory_range(struct domain *d,
>>>                                                unsigned long start,
>>> @@ -397,7 +371,7 @@ static int __init pvh_setup_p2m(struct domain *d)
>>>         * Memory below 1MB is identity mapped.
>>>         * NB: this only makes sense when booted from legacy BIOS.
>>>         */
>>> -    rc = modify_identity_mmio(d, 0, MB1_PAGES, true);
>>> +    rc = modify_mmio(d, _gfn(0), _mfn(0), MB1_PAGES, true);
>>>        if ( rc )
>>>        {
>>>            printk("Failed to identity map low 1MB: %d\n", rc);
>>> @@ -964,7 +938,7 @@ static int __init pvh_setup_acpi(struct domain *d, paddr_t start_info)
>>>            nr_pages = PFN_UP((d->arch.e820[i].addr & ~PAGE_MASK) +
>>>                              d->arch.e820[i].size);
>>> -        rc = modify_identity_mmio(d, pfn, nr_pages, true);
>>> +        rc = modify_mmio(d, _gfn(pfn), _mfn(pfn), nr_pages, true);
>>>            if ( rc )
>>>            {
>>>                printk("Failed to map ACPI region [%#lx, %#lx) into Dom0 memory map\n",
>>> diff --git a/xen/common/memory.c b/xen/common/memory.c
>>> index b2066db07e..86824edb09 100644
>>> --- a/xen/common/memory.c
>>> +++ b/xen/common/memory.c
>>> @@ -1465,6 +1465,46 @@ int prepare_ring_for_helper(
>>>        return 0;
>>>    }
>>> +#if defined(CONFIG_HAS_PCI)
>>> +int modify_mmio(struct domain *d, gfn_t gfn, mfn_t mfn, unsigned long nr_pages,
>>> +                bool map)
>>> +{
>>
>> I don't think this is correct to move this function in common code without
>> making sure that all arch have *map_mmio_regions supporting preemption.
>>
>> This is actually not the case on ARM and IHMO should be fixed before getting
>> this function common. Otherwise you will expose a security issue the day
>> vCPI will get supported on ARM.
> 
> I could add something like:
> 
> #ifndef CONFIG_X86 /* XXX ARM!? */
>      ret = -E2BIG;
>      /* Must break hypercall up as this could take a while. */
>      if ( nr_mfns > 64 )
>          break;
> #endif
> 
> That's what's done in XEN_DOMCTL_memory_mapping.

Well in the case of XEN_DOMCTL_memory_mapping, the caller will take care 
of splitting the batch. Here you will not split the batch and just fail.

> 
>>> +    int rc;
>>> +
>>> +    /*
>>> +     * ATM this function should only be used by the hardware domain
>>> +     * because it doesn't support preemption/continuation, and as such
>>> +     * can take a non-negligible amount of time. Note that it periodically
>>> +     * calls process_pending_softirqs in order to avoid stalling the system.
>>> +     */
>>> +    ASSERT(is_hardware_domain(d));
>>> +
>>> +    for ( ; ; )
>>> +    {
>>> +        rc = (map ? map_mmio_regions : unmap_mmio_regions)
>>
>> As mentioned in an earlier version, on ARM map_mmio_regions will map the
>> MMIO with very strict attribute (no-unaligned access, non-gatherable,...).
>> This will not be correct for some BARs. So I think we should provide the
>> attribute type in parameter of modify_mmio to know the memory attribute.
> 
> As I understand it, current mapping attributes albeit slow will work
> fine?

No. The guest will receive an abort if it does unaligned access on that 
region (such as when memcpy is used).

> 
> At least they are the same as the ones used by
> XEN_DOMCTL_memory_mapping which is how MMIO is mapped to a DomU ATM,
> hence I don't think this is a priority at such early state.

Why do you speak about Domu? This is used for mapping Dom0 at the 
moment... So this is kind of priority in order to get PCI in Dom0.

For Dom0, we relaxed the mapping and rely on the OS to tighten the 
attribute (p2m_mmio_direct_c is used). This is something we could not 
possibly do by default on DomU at the moment.

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 04/11] x86/physdev: enable PHYSDEVOP_pci_mmcfg_reserved for PVH Dom0
  2017-09-05 14:57   ` Jan Beulich
@ 2017-09-13 15:55     ` Roger Pau Monné
  2017-09-14  9:53       ` Jan Beulich
  0 siblings, 1 reply; 60+ messages in thread
From: Roger Pau Monné @ 2017-09-13 15:55 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Tue, Sep 05, 2017 at 08:57:54AM -0600, Jan Beulich wrote:
> >>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
> > --- a/xen/arch/x86/physdev.c
> > +++ b/xen/arch/x86/physdev.c
> > @@ -559,6 +559,15 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
> >  
> >          ret = pci_mmcfg_reserved(info.address, info.segment,
> >                                   info.start_bus, info.end_bus, info.flags);
> > +        if ( ret || !is_hvm_domain(currd) )
> > +            break;
> 
> Don't you also want to check has_vpci() here?

I don't think the also is needed here, just checking for has_vpci
should be fine (PV guests will not have the vpci flag set in any
case).

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 04/11] x86/physdev: enable PHYSDEVOP_pci_mmcfg_reserved for PVH Dom0
  2017-09-13 15:55     ` Roger Pau Monné
@ 2017-09-14  9:53       ` Jan Beulich
  0 siblings, 0 replies; 60+ messages in thread
From: Jan Beulich @ 2017-09-14  9:53 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 13.09.17 at 17:55, <roger.pau@citrix.com> wrote:
> On Tue, Sep 05, 2017 at 08:57:54AM -0600, Jan Beulich wrote:
>> >>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
>> > --- a/xen/arch/x86/physdev.c
>> > +++ b/xen/arch/x86/physdev.c
>> > @@ -559,6 +559,15 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>> >  
>> >          ret = pci_mmcfg_reserved(info.address, info.segment,
>> >                                   info.start_bus, info.end_bus, info.flags);
>> > +        if ( ret || !is_hvm_domain(currd) )
>> > +            break;
>> 
>> Don't you also want to check has_vpci() here?
> 
> I don't think the also is needed here, just checking for has_vpci
> should be fine (PV guests will not have the vpci flag set in any
> case).

Ah, right, emulation_flags is not in the HVM/PV union, but available
for all guests.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 09/11] vpci/msi: add MSI handlers
  2017-09-07 15:29   ` Jan Beulich
@ 2017-09-14 10:08     ` Roger Pau Monné
  2017-09-14 10:19       ` Jan Beulich
  0 siblings, 1 reply; 60+ messages in thread
From: Roger Pau Monné @ 2017-09-14 10:08 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, Paul Durrant, xen-devel

On Thu, Sep 07, 2017 at 09:29:41AM -0600, Jan Beulich wrote:
> >>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
> > +int vpci_msi_arch_enable(struct vpci_arch_msi *arch, struct pci_dev *pdev,
> > +                         uint64_t address, uint32_t data, unsigned int vectors)
> > +{
> > +    struct msi_info msi_info = {
> > +        .seg = pdev->seg,
> > +        .bus = pdev->bus,
> > +        .devfn = pdev->devfn,
> > +        .entry_nr = vectors,
> > +    };
> > +    unsigned int i;
> > +    int rc;
> > +
> > +    ASSERT(arch->pirq == INVALID_PIRQ);
> > +
> > +    /* Get a PIRQ. */
> > +    rc = allocate_and_map_msi_pirq(pdev->domain, -1, &arch->pirq,
> > +                                   MAP_PIRQ_TYPE_MULTI_MSI, &msi_info);
> > +    if ( rc )
> > +    {
> > +        gdprintk(XENLOG_ERR, "%04x:%02x:%02x.%u: failed to map PIRQ: %d\n",
> > +                 pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> > +                 PCI_FUNC(pdev->devfn), rc);
> > +        return rc;
> > +    }
> > +
> > +    for ( i = 0; i < vectors; i++ )
> > +    {
> > +        xen_domctl_bind_pt_irq_t bind = {
> > +            .machine_irq = arch->pirq + i,
> > +            .irq_type = PT_IRQ_TYPE_MSI,
> > +            .u.msi.gvec = msi_vector(data) + i,
> 
> Isn't that rather msi_vector(data + i), i.e. wouldn't you better
> increment data together with i?

That's true, because the vector is fetched from the last 8bits of the
data, but I find it more confusing (and it requires that the reader
knows this detail). IMHO I would prefer to leave it as-is.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 09/11] vpci/msi: add MSI handlers
  2017-09-14 10:08     ` Roger Pau Monné
@ 2017-09-14 10:19       ` Jan Beulich
  2017-09-14 10:42         ` Roger Pau Monné
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2017-09-14 10:19 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Andrew Cooper, boris.ostrovsky, Paul Durrant, xen-devel

>>> On 14.09.17 at 12:08, <roger.pau@citrix.com> wrote:
> On Thu, Sep 07, 2017 at 09:29:41AM -0600, Jan Beulich wrote:
>> >>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
>> > +int vpci_msi_arch_enable(struct vpci_arch_msi *arch, struct pci_dev *pdev,
>> > +                         uint64_t address, uint32_t data, unsigned int vectors)
>> > +{
>> > +    struct msi_info msi_info = {
>> > +        .seg = pdev->seg,
>> > +        .bus = pdev->bus,
>> > +        .devfn = pdev->devfn,
>> > +        .entry_nr = vectors,
>> > +    };
>> > +    unsigned int i;
>> > +    int rc;
>> > +
>> > +    ASSERT(arch->pirq == INVALID_PIRQ);
>> > +
>> > +    /* Get a PIRQ. */
>> > +    rc = allocate_and_map_msi_pirq(pdev->domain, -1, &arch->pirq,
>> > +                                   MAP_PIRQ_TYPE_MULTI_MSI, &msi_info);
>> > +    if ( rc )
>> > +    {
>> > +        gdprintk(XENLOG_ERR, "%04x:%02x:%02x.%u: failed to map PIRQ: %d\n",
>> > +                 pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
>> > +                 PCI_FUNC(pdev->devfn), rc);
>> > +        return rc;
>> > +    }
>> > +
>> > +    for ( i = 0; i < vectors; i++ )
>> > +    {
>> > +        xen_domctl_bind_pt_irq_t bind = {
>> > +            .machine_irq = arch->pirq + i,
>> > +            .irq_type = PT_IRQ_TYPE_MSI,
>> > +            .u.msi.gvec = msi_vector(data) + i,
>> 
>> Isn't that rather msi_vector(data + i), i.e. wouldn't you better
>> increment data together with i?
> 
> That's true, because the vector is fetched from the last 8bits of the
> data, but I find it more confusing (and it requires that the reader
> knows this detail). IMHO I would prefer to leave it as-is.

No, the problem is the wrap-around case, which your code
doesn't handle. Iirc hardware behaves along the lines of what
I've suggested to change to, with potentially the vector
increment carrying into other parts of the value. Hence you
either need an early check for there not being any wrapping,
or other places may need similar adjustment (in which case it
might be better to really just increment "data" once in the
loop.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 09/11] vpci/msi: add MSI handlers
  2017-09-14 10:19       ` Jan Beulich
@ 2017-09-14 10:42         ` Roger Pau Monné
  2017-09-14 10:50           ` Jan Beulich
  0 siblings, 1 reply; 60+ messages in thread
From: Roger Pau Monné @ 2017-09-14 10:42 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, Paul Durrant, xen-devel

On Thu, Sep 14, 2017 at 04:19:44AM -0600, Jan Beulich wrote:
> >>> On 14.09.17 at 12:08, <roger.pau@citrix.com> wrote:
> > On Thu, Sep 07, 2017 at 09:29:41AM -0600, Jan Beulich wrote:
> >> >>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
> >> > +int vpci_msi_arch_enable(struct vpci_arch_msi *arch, struct pci_dev *pdev,
> >> > +                         uint64_t address, uint32_t data, unsigned int vectors)
> >> > +{
> >> > +    struct msi_info msi_info = {
> >> > +        .seg = pdev->seg,
> >> > +        .bus = pdev->bus,
> >> > +        .devfn = pdev->devfn,
> >> > +        .entry_nr = vectors,
> >> > +    };
> >> > +    unsigned int i;
> >> > +    int rc;
> >> > +
> >> > +    ASSERT(arch->pirq == INVALID_PIRQ);
> >> > +
> >> > +    /* Get a PIRQ. */
> >> > +    rc = allocate_and_map_msi_pirq(pdev->domain, -1, &arch->pirq,
> >> > +                                   MAP_PIRQ_TYPE_MULTI_MSI, &msi_info);
> >> > +    if ( rc )
> >> > +    {
> >> > +        gdprintk(XENLOG_ERR, "%04x:%02x:%02x.%u: failed to map PIRQ: %d\n",
> >> > +                 pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> >> > +                 PCI_FUNC(pdev->devfn), rc);
> >> > +        return rc;
> >> > +    }
> >> > +
> >> > +    for ( i = 0; i < vectors; i++ )
> >> > +    {
> >> > +        xen_domctl_bind_pt_irq_t bind = {
> >> > +            .machine_irq = arch->pirq + i,
> >> > +            .irq_type = PT_IRQ_TYPE_MSI,
> >> > +            .u.msi.gvec = msi_vector(data) + i,
> >> 
> >> Isn't that rather msi_vector(data + i), i.e. wouldn't you better
> >> increment data together with i?
> > 
> > That's true, because the vector is fetched from the last 8bits of the
> > data, but I find it more confusing (and it requires that the reader
> > knows this detail). IMHO I would prefer to leave it as-is.
> 
> No, the problem is the wrap-around case, which your code
> doesn't handle. Iirc hardware behaves along the lines of what
> I've suggested to change to, with potentially the vector
> increment carrying into other parts of the value. Hence you
> either need an early check for there not being any wrapping,
> or other places may need similar adjustment (in which case it
> might be better to really just increment "data" once in the
> loop.

Oh, so the vector increment carries over to the delivery mode, then I
will switch it.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 09/11] vpci/msi: add MSI handlers
  2017-09-14 10:42         ` Roger Pau Monné
@ 2017-09-14 10:50           ` Jan Beulich
  2017-09-14 11:35             ` Roger Pau Monné
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2017-09-14 10:50 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Andrew Cooper, boris.ostrovsky, Paul Durrant, xen-devel

>>> On 14.09.17 at 12:42, <roger.pau@citrix.com> wrote:
> On Thu, Sep 14, 2017 at 04:19:44AM -0600, Jan Beulich wrote:
>> >>> On 14.09.17 at 12:08, <roger.pau@citrix.com> wrote:
>> > On Thu, Sep 07, 2017 at 09:29:41AM -0600, Jan Beulich wrote:
>> >> >>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
>> >> > +int vpci_msi_arch_enable(struct vpci_arch_msi *arch, struct pci_dev 
> *pdev,
>> >> > +                         uint64_t address, uint32_t data, unsigned int 
> vectors)
>> >> > +{
>> >> > +    struct msi_info msi_info = {
>> >> > +        .seg = pdev->seg,
>> >> > +        .bus = pdev->bus,
>> >> > +        .devfn = pdev->devfn,
>> >> > +        .entry_nr = vectors,
>> >> > +    };
>> >> > +    unsigned int i;
>> >> > +    int rc;
>> >> > +
>> >> > +    ASSERT(arch->pirq == INVALID_PIRQ);
>> >> > +
>> >> > +    /* Get a PIRQ. */
>> >> > +    rc = allocate_and_map_msi_pirq(pdev->domain, -1, &arch->pirq,
>> >> > +                                   MAP_PIRQ_TYPE_MULTI_MSI, &msi_info);
>> >> > +    if ( rc )
>> >> > +    {
>> >> > +        gdprintk(XENLOG_ERR, "%04x:%02x:%02x.%u: failed to map PIRQ: 
> %d\n",
>> >> > +                 pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
>> >> > +                 PCI_FUNC(pdev->devfn), rc);
>> >> > +        return rc;
>> >> > +    }
>> >> > +
>> >> > +    for ( i = 0; i < vectors; i++ )
>> >> > +    {
>> >> > +        xen_domctl_bind_pt_irq_t bind = {
>> >> > +            .machine_irq = arch->pirq + i,
>> >> > +            .irq_type = PT_IRQ_TYPE_MSI,
>> >> > +            .u.msi.gvec = msi_vector(data) + i,
>> >> 
>> >> Isn't that rather msi_vector(data + i), i.e. wouldn't you better
>> >> increment data together with i?
>> > 
>> > That's true, because the vector is fetched from the last 8bits of the
>> > data, but I find it more confusing (and it requires that the reader
>> > knows this detail). IMHO I would prefer to leave it as-is.
>> 
>> No, the problem is the wrap-around case, which your code
>> doesn't handle. Iirc hardware behaves along the lines of what
>> I've suggested to change to, with potentially the vector
>> increment carrying into other parts of the value. Hence you
>> either need an early check for there not being any wrapping,
>> or other places may need similar adjustment (in which case it
>> might be better to really just increment "data" once in the
>> loop.
> 
> Oh, so the vector increment carries over to the delivery mode, then I
> will switch it.

But please double check first that I'm not mis-remembering.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 09/11] vpci/msi: add MSI handlers
  2017-09-14 10:50           ` Jan Beulich
@ 2017-09-14 11:35             ` Roger Pau Monné
  2017-09-14 12:09               ` Jan Beulich
  0 siblings, 1 reply; 60+ messages in thread
From: Roger Pau Monné @ 2017-09-14 11:35 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, Paul Durrant, xen-devel

On Thu, Sep 14, 2017 at 04:50:10AM -0600, Jan Beulich wrote:
> >>> On 14.09.17 at 12:42, <roger.pau@citrix.com> wrote:
> > On Thu, Sep 14, 2017 at 04:19:44AM -0600, Jan Beulich wrote:
> >> >>> On 14.09.17 at 12:08, <roger.pau@citrix.com> wrote:
> >> > On Thu, Sep 07, 2017 at 09:29:41AM -0600, Jan Beulich wrote:
> >> >> >>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
> >> >> > +int vpci_msi_arch_enable(struct vpci_arch_msi *arch, struct pci_dev 
> > *pdev,
> >> >> > +                         uint64_t address, uint32_t data, unsigned int 
> > vectors)
> >> >> > +{
> >> >> > +    struct msi_info msi_info = {
> >> >> > +        .seg = pdev->seg,
> >> >> > +        .bus = pdev->bus,
> >> >> > +        .devfn = pdev->devfn,
> >> >> > +        .entry_nr = vectors,
> >> >> > +    };
> >> >> > +    unsigned int i;
> >> >> > +    int rc;
> >> >> > +
> >> >> > +    ASSERT(arch->pirq == INVALID_PIRQ);
> >> >> > +
> >> >> > +    /* Get a PIRQ. */
> >> >> > +    rc = allocate_and_map_msi_pirq(pdev->domain, -1, &arch->pirq,
> >> >> > +                                   MAP_PIRQ_TYPE_MULTI_MSI, &msi_info);
> >> >> > +    if ( rc )
> >> >> > +    {
> >> >> > +        gdprintk(XENLOG_ERR, "%04x:%02x:%02x.%u: failed to map PIRQ: 
> > %d\n",
> >> >> > +                 pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> >> >> > +                 PCI_FUNC(pdev->devfn), rc);
> >> >> > +        return rc;
> >> >> > +    }
> >> >> > +
> >> >> > +    for ( i = 0; i < vectors; i++ )
> >> >> > +    {
> >> >> > +        xen_domctl_bind_pt_irq_t bind = {
> >> >> > +            .machine_irq = arch->pirq + i,
> >> >> > +            .irq_type = PT_IRQ_TYPE_MSI,
> >> >> > +            .u.msi.gvec = msi_vector(data) + i,
> >> >> 
> >> >> Isn't that rather msi_vector(data + i), i.e. wouldn't you better
> >> >> increment data together with i?
> >> > 
> >> > That's true, because the vector is fetched from the last 8bits of the
> >> > data, but I find it more confusing (and it requires that the reader
> >> > knows this detail). IMHO I would prefer to leave it as-is.
> >> 
> >> No, the problem is the wrap-around case, which your code
> >> doesn't handle. Iirc hardware behaves along the lines of what
> >> I've suggested to change to, with potentially the vector
> >> increment carrying into other parts of the value. Hence you
> >> either need an early check for there not being any wrapping,
> >> or other places may need similar adjustment (in which case it
> >> might be better to really just increment "data" once in the
> >> loop.
> > 
> > Oh, so the vector increment carries over to the delivery mode, then I
> > will switch it.
> 
> But please double check first that I'm not mis-remembering.

The PCI spec contains the following about the data field:

The Multiple Message Enable field (bits 6-4 of the Message Control
register) defines the number of low order message data bits the
function is permitted to modify to generate its system software
allocated vectors. For example, a Multiple Message Enable encoding of
“010” indicates the function has been allocated four vectors and is
permitted to modify message data bits 1 and 0 (a function modifies the
lower message data bits to generate the allocated number of vectors).
If the Multiple Message Enable field is “000”, the function is not
permitted to modify the message data.

So it seems like the overflow should be limited to the number of
enabled vectors, ie: maybe store the vector in a uint8_t and increase
it at every loop?

I don't seem to be able to find any specific mention to what happens
when the vector part of the data register overflows. From the except
above it seems like it should never modify any other bits apart from
the low 8bit vector ones.

The Intel SDM mentions that the vector must be in the range 0x10-0xfe,
in which case the resulting vector (whether we wrap or not) is not
going to be valid anyway.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 09/11] vpci/msi: add MSI handlers
  2017-09-14 11:35             ` Roger Pau Monné
@ 2017-09-14 12:09               ` Jan Beulich
  0 siblings, 0 replies; 60+ messages in thread
From: Jan Beulich @ 2017-09-14 12:09 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Andrew Cooper, boris.ostrovsky, Paul Durrant, xen-devel

>>> On 14.09.17 at 13:35, <roger.pau@citrix.com> wrote:
> On Thu, Sep 14, 2017 at 04:50:10AM -0600, Jan Beulich wrote:
>> >>> On 14.09.17 at 12:42, <roger.pau@citrix.com> wrote:
>> > On Thu, Sep 14, 2017 at 04:19:44AM -0600, Jan Beulich wrote:
>> >> >>> On 14.09.17 at 12:08, <roger.pau@citrix.com> wrote:
>> >> > On Thu, Sep 07, 2017 at 09:29:41AM -0600, Jan Beulich wrote:
>> >> >> >>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
>> >> >> > +int vpci_msi_arch_enable(struct vpci_arch_msi *arch, struct pci_dev 
>> > *pdev,
>> >> >> > +                         uint64_t address, uint32_t data, unsigned int 
>> > vectors)
>> >> >> > +{
>> >> >> > +    struct msi_info msi_info = {
>> >> >> > +        .seg = pdev->seg,
>> >> >> > +        .bus = pdev->bus,
>> >> >> > +        .devfn = pdev->devfn,
>> >> >> > +        .entry_nr = vectors,
>> >> >> > +    };
>> >> >> > +    unsigned int i;
>> >> >> > +    int rc;
>> >> >> > +
>> >> >> > +    ASSERT(arch->pirq == INVALID_PIRQ);
>> >> >> > +
>> >> >> > +    /* Get a PIRQ. */
>> >> >> > +    rc = allocate_and_map_msi_pirq(pdev->domain, -1, &arch->pirq,
>> >> >> > +                                   MAP_PIRQ_TYPE_MULTI_MSI, &msi_info);
>> >> >> > +    if ( rc )
>> >> >> > +    {
>> >> >> > +        gdprintk(XENLOG_ERR, "%04x:%02x:%02x.%u: failed to map PIRQ: 
>> > %d\n",
>> >> >> > +                 pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
>> >> >> > +                 PCI_FUNC(pdev->devfn), rc);
>> >> >> > +        return rc;
>> >> >> > +    }
>> >> >> > +
>> >> >> > +    for ( i = 0; i < vectors; i++ )
>> >> >> > +    {
>> >> >> > +        xen_domctl_bind_pt_irq_t bind = {
>> >> >> > +            .machine_irq = arch->pirq + i,
>> >> >> > +            .irq_type = PT_IRQ_TYPE_MSI,
>> >> >> > +            .u.msi.gvec = msi_vector(data) + i,
>> >> >> 
>> >> >> Isn't that rather msi_vector(data + i), i.e. wouldn't you better
>> >> >> increment data together with i?
>> >> > 
>> >> > That's true, because the vector is fetched from the last 8bits of the
>> >> > data, but I find it more confusing (and it requires that the reader
>> >> > knows this detail). IMHO I would prefer to leave it as-is.
>> >> 
>> >> No, the problem is the wrap-around case, which your code
>> >> doesn't handle. Iirc hardware behaves along the lines of what
>> >> I've suggested to change to, with potentially the vector
>> >> increment carrying into other parts of the value. Hence you
>> >> either need an early check for there not being any wrapping,
>> >> or other places may need similar adjustment (in which case it
>> >> might be better to really just increment "data" once in the
>> >> loop.
>> > 
>> > Oh, so the vector increment carries over to the delivery mode, then I
>> > will switch it.
>> 
>> But please double check first that I'm not mis-remembering.
> 
> The PCI spec contains the following about the data field:
> 
> The Multiple Message Enable field (bits 6-4 of the Message Control
> register) defines the number of low order message data bits the
> function is permitted to modify to generate its system software
> allocated vectors. For example, a Multiple Message Enable encoding of
> “010” indicates the function has been allocated four vectors and is
> permitted to modify message data bits 1 and 0 (a function modifies the
> lower message data bits to generate the allocated number of vectors).
> If the Multiple Message Enable field is “000”, the function is not
> permitted to modify the message data.
> 
> So it seems like the overflow should be limited to the number of
> enabled vectors,

Ah, right. So no spilling over into the higher bits.

> ie: maybe store the vector in a uint8_t and increase
> it at every loop?

A uint8_t won't help: As the text says, you need to limit the change
to the number of bits that are permitted to be altered. I think there's
an implication for these bits to be all clear for the first of the vectors,
but that's not spelled out, so I'd prefer if we were flexible and
allowed to start from a non-zero value (resulting in e.g. a
0b00111101, 0b00111110, 0b00111111, 0b00111100 vector
sequence).

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 11/11] vpci/msix: add MSI-X handlers
  2017-09-07 16:12   ` Jan Beulich
@ 2017-09-15 10:44     ` Roger Pau Monné
  2017-09-15 11:43       ` Jan Beulich
  0 siblings, 1 reply; 60+ messages in thread
From: Roger Pau Monné @ 2017-09-15 10:44 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Thu, Sep 07, 2017 at 10:12:59AM -0600, Jan Beulich wrote:
> >>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
> > +void vpci_msix_arch_mask(struct vpci_arch_msix_entry *arch,
> > +                         struct pci_dev *pdev, bool mask)
> > +{
> > +    if ( arch->pirq == INVALID_PIRQ )
> > +        return;
> 
> How come no similar guard is needed in vpci_msi_arch_mask()?

That's right, this should be an ASSERT instead.

> > +    for ( i = 0; i < ARRAY_SIZE(bar->msix); i++ )
> > +    {
> > +        struct vpci_msix_mem *msix = bar->msix[i];
> > +
> > +        if ( !msix || msix->addr == INVALID_PADDR )
> > +            continue;
> > +
> > +        if ( map )
> > +        {
> > +            /*
> > +             * Make sure the MSI-X regions of the BAR are not mapped into the
> > +             * domain p2m, or else the MSI-X handlers are useless. Only do this
> > +             * when mapping, since that's when the memory decoding on the
> > +             * device is enabled.
> > +             *
> > +             * This is required because iommu_inclusive_mapping might have
> > +             * mapped MSI-X regions into the guest p2m.
> > +             */
> > +            rc = vpci_unmap_msix(d, msix);
> > +            if ( rc )
> > +            {
> > +                rangeset_destroy(mem);
> > +                return rc;
> > +            }
> > +        }
> > +
> > +        rc = rangeset_remove_range(mem, PFN_DOWN(msix->addr),
> > +                                   PFN_DOWN(msix->addr + msix->size));
> > +        if ( rc )
> > +        {
> > +            rangeset_destroy(mem);
> > +            return rc;
> > +        }
> > +
> > +    }
> 
> Why do you do this for the PBA regardless of whether it's shared
> with a table page?

Writes to the PBA area are described as undefined by the spec:

"If software writes to Pending Bits, the result is undefined."

I think it's better to simply not allow the guest to perform such
writes, and hence we need to trap this area unconditionally IMHO.

> > +    if ( MSIX_ADDR_IN_RANGE(addr, &msix->pba) )
> > +    {
> > +        /* Access to PBA. */
> > +        switch ( len )
> > +        {
> > +        case 4:
> > +            *data = readl(addr);
> > +            break;
> > +        case 8:
> > +            *data = readq(addr);
> > +            break;
> 
> This is strictly only valid for Dom0, so perhaps worth a comment.

I'm not sure I follow, why should Xen disallow accesses to the PBA for
a DomU?

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 11/11] vpci/msix: add MSI-X handlers
  2017-09-15 10:44     ` Roger Pau Monné
@ 2017-09-15 11:43       ` Jan Beulich
  2017-09-15 12:44         ` Roger Pau Monné
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2017-09-15 11:43 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 15.09.17 at 12:44, <roger.pau@citrix.com> wrote:
> On Thu, Sep 07, 2017 at 10:12:59AM -0600, Jan Beulich wrote:
>> >>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
>> > +    for ( i = 0; i < ARRAY_SIZE(bar->msix); i++ )
>> > +    {
>> > +        struct vpci_msix_mem *msix = bar->msix[i];
>> > +
>> > +        if ( !msix || msix->addr == INVALID_PADDR )
>> > +            continue;
>> > +
>> > +        if ( map )
>> > +        {
>> > +            /*
>> > +             * Make sure the MSI-X regions of the BAR are not mapped into the
>> > +             * domain p2m, or else the MSI-X handlers are useless. Only do this
>> > +             * when mapping, since that's when the memory decoding on the
>> > +             * device is enabled.
>> > +             *
>> > +             * This is required because iommu_inclusive_mapping might have
>> > +             * mapped MSI-X regions into the guest p2m.
>> > +             */
>> > +            rc = vpci_unmap_msix(d, msix);
>> > +            if ( rc )
>> > +            {
>> > +                rangeset_destroy(mem);
>> > +                return rc;
>> > +            }
>> > +        }
>> > +
>> > +        rc = rangeset_remove_range(mem, PFN_DOWN(msix->addr),
>> > +                                   PFN_DOWN(msix->addr + msix->size));
>> > +        if ( rc )
>> > +        {
>> > +            rangeset_destroy(mem);
>> > +            return rc;
>> > +        }
>> > +
>> > +    }
>> 
>> Why do you do this for the PBA regardless of whether it's shared
>> with a table page?
> 
> Writes to the PBA area are described as undefined by the spec:
> 
> "If software writes to Pending Bits, the result is undefined."
> 
> I think it's better to simply not allow the guest to perform such
> writes, and hence we need to trap this area unconditionally IMHO.

For DomU I would agree, but for Dom0 I think we better allow a
driver anything it would also be able to do without Xen.

>> > +    if ( MSIX_ADDR_IN_RANGE(addr, &msix->pba) )
>> > +    {
>> > +        /* Access to PBA. */
>> > +        switch ( len )
>> > +        {
>> > +        case 4:
>> > +            *data = readl(addr);
>> > +            break;
>> > +        case 8:
>> > +            *data = readq(addr);
>> > +            break;
>> 
>> This is strictly only valid for Dom0, so perhaps worth a comment.
> 
> I'm not sure I follow, why should Xen disallow accesses to the PBA for
> a DomU?

That's not the point, but instead the readl() / readq() using an
untranslated address.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v5 11/11] vpci/msix: add MSI-X handlers
  2017-09-15 11:43       ` Jan Beulich
@ 2017-09-15 12:44         ` Roger Pau Monné
  0 siblings, 0 replies; 60+ messages in thread
From: Roger Pau Monné @ 2017-09-15 12:44 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Fri, Sep 15, 2017 at 05:43:15AM -0600, Jan Beulich wrote:
> >>> On 15.09.17 at 12:44, <roger.pau@citrix.com> wrote:
> > On Thu, Sep 07, 2017 at 10:12:59AM -0600, Jan Beulich wrote:
> >> >>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
> >> > +    for ( i = 0; i < ARRAY_SIZE(bar->msix); i++ )
> >> > +    {
> >> > +        struct vpci_msix_mem *msix = bar->msix[i];
> >> > +
> >> > +        if ( !msix || msix->addr == INVALID_PADDR )
> >> > +            continue;
> >> > +
> >> > +        if ( map )
> >> > +        {
> >> > +            /*
> >> > +             * Make sure the MSI-X regions of the BAR are not mapped into the
> >> > +             * domain p2m, or else the MSI-X handlers are useless. Only do this
> >> > +             * when mapping, since that's when the memory decoding on the
> >> > +             * device is enabled.
> >> > +             *
> >> > +             * This is required because iommu_inclusive_mapping might have
> >> > +             * mapped MSI-X regions into the guest p2m.
> >> > +             */
> >> > +            rc = vpci_unmap_msix(d, msix);
> >> > +            if ( rc )
> >> > +            {
> >> > +                rangeset_destroy(mem);
> >> > +                return rc;
> >> > +            }
> >> > +        }
> >> > +
> >> > +        rc = rangeset_remove_range(mem, PFN_DOWN(msix->addr),
> >> > +                                   PFN_DOWN(msix->addr + msix->size));
> >> > +        if ( rc )
> >> > +        {
> >> > +            rangeset_destroy(mem);
> >> > +            return rc;
> >> > +        }
> >> > +
> >> > +    }
> >> 
> >> Why do you do this for the PBA regardless of whether it's shared
> >> with a table page?
> > 
> > Writes to the PBA area are described as undefined by the spec:
> > 
> > "If software writes to Pending Bits, the result is undefined."
> > 
> > I think it's better to simply not allow the guest to perform such
> > writes, and hence we need to trap this area unconditionally IMHO.
> 
> For DomU I would agree, but for Dom0 I think we better allow a
> driver anything it would also be able to do without Xen.

OK. IMHO it's simpler to always trap access to the PBA, regardless of
whether is sharing a page with the MSIX table or not. I will add a
is_hardware_domain check here in order to forward the writes.

> >> > +    if ( MSIX_ADDR_IN_RANGE(addr, &msix->pba) )
> >> > +    {
> >> > +        /* Access to PBA. */
> >> > +        switch ( len )
> >> > +        {
> >> > +        case 4:
> >> > +            *data = readl(addr);
> >> > +            break;
> >> > +        case 8:
> >> > +            *data = readq(addr);
> >> > +            break;
> >> 
> >> This is strictly only valid for Dom0, so perhaps worth a comment.
> > 
> > I'm not sure I follow, why should Xen disallow accesses to the PBA for
> > a DomU?
> 
> That's not the point, but instead the readl() / readq() using an
> untranslated address.

Right, ATM we identity map the PBA into the guest address space. If
that changes then this is certainly going to be wrong. I will add a
TODO item here so that I don't forget later.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2017-09-15 12:45 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-14 14:28 [PATCH v5 00/11] vpci: PCI config space emulation Roger Pau Monne
2017-08-14 14:28 ` [PATCH v5 01/11] x86/pci: introduce hvm_pci_decode_addr Roger Pau Monne
2017-08-22 11:24   ` Paul Durrant
2017-08-24 15:46   ` Jan Beulich
2017-08-25  8:29     ` Roger Pau Monne
2017-08-14 14:28 ` [PATCH v5 02/11] vpci: introduce basic handlers to trap accesses to the PCI config space Roger Pau Monne
2017-08-22 12:05   ` Paul Durrant
2017-09-04 15:38   ` Jan Beulich
2017-09-06 15:40     ` Roger Pau Monné
2017-09-07  9:06       ` Jan Beulich
2017-09-07 11:30         ` Roger Pau Monné
2017-09-07 11:38           ` Jan Beulich
2017-09-08 14:41     ` Roger Pau Monné
2017-09-08 15:56       ` Jan Beulich
2017-09-12 10:42   ` Julien Grall
2017-09-12 10:58     ` Roger Pau Monné
2017-09-12 11:00       ` Julien Grall
2017-08-14 14:28 ` [PATCH v5 03/11] x86/mmcfg: add handlers for the PVH Dom0 MMCFG areas Roger Pau Monne
2017-08-22 12:11   ` Paul Durrant
2017-09-04 15:58   ` Jan Beulich
2017-08-14 14:28 ` [PATCH v5 04/11] x86/physdev: enable PHYSDEVOP_pci_mmcfg_reserved for PVH Dom0 Roger Pau Monne
2017-09-05 14:57   ` Jan Beulich
2017-09-13 15:55     ` Roger Pau Monné
2017-09-14  9:53       ` Jan Beulich
2017-08-14 14:28 ` [PATCH v5 05/11] mm: move modify_identity_mmio to global file and drop __init Roger Pau Monne
2017-09-05 15:01   ` Jan Beulich
2017-09-12  7:49     ` Roger Pau Monné
2017-09-12  9:04       ` Jan Beulich
2017-09-12 11:27         ` Roger Pau Monné
2017-09-12 12:53           ` Jan Beulich
2017-09-12 10:53   ` Julien Grall
2017-09-12 11:38     ` Roger Pau Monné
2017-09-12 13:02       ` Julien Grall
2017-08-14 14:28 ` [PATCH v5 06/11] pci: split code to size BARs from pci_add_device Roger Pau Monne
2017-09-05 15:05   ` Jan Beulich
2017-08-14 14:28 ` [PATCH v5 07/11] pci: add support to size ROM BARs to pci_size_mem_bar Roger Pau Monne
2017-09-05 15:12   ` Jan Beulich
2017-08-14 14:28 ` [PATCH v5 08/11] vpci/bars: add handlers to map the BARs Roger Pau Monne
2017-09-07  9:53   ` Jan Beulich
2017-09-12  9:54     ` Roger Pau Monné
2017-09-12 10:06       ` Jan Beulich
2017-09-12 11:48         ` Roger Pau Monné
2017-09-12 12:56           ` Jan Beulich
2017-08-14 14:28 ` [PATCH v5 09/11] vpci/msi: add MSI handlers Roger Pau Monne
2017-08-22 12:20   ` Paul Durrant
2017-09-07 15:29   ` Jan Beulich
2017-09-14 10:08     ` Roger Pau Monné
2017-09-14 10:19       ` Jan Beulich
2017-09-14 10:42         ` Roger Pau Monné
2017-09-14 10:50           ` Jan Beulich
2017-09-14 11:35             ` Roger Pau Monné
2017-09-14 12:09               ` Jan Beulich
2017-08-14 14:28 ` [PATCH v5 10/11] vpci: add a priority parameter to the vPCI register initializer Roger Pau Monne
2017-09-07 15:32   ` Jan Beulich
2017-08-14 14:28 ` [PATCH v5 11/11] vpci/msix: add MSI-X handlers Roger Pau Monne
2017-09-07 16:11   ` Roger Pau Monné
2017-09-07 16:12   ` Jan Beulich
2017-09-15 10:44     ` Roger Pau Monné
2017-09-15 11:43       ` Jan Beulich
2017-09-15 12:44         ` Roger Pau Monné

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.