All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/9] vpci: PCI config space emulation
@ 2017-04-27 14:35 Roger Pau Monne
  2017-04-27 14:35 ` [PATCH v3 1/9] xen/vpci: introduce basic handlers to trap accesses to the PCI config space Roger Pau Monne
                   ` (9 more replies)
  0 siblings, 10 replies; 49+ messages in thread
From: Roger Pau Monne @ 2017-04-27 14:35 UTC (permalink / raw)
  To: xen-devel; +Cc: boris.ostrovsky, julien.grall

Hello,

The following series contain an implementation of handlers for the PCI
configuration space inside of Xen. This allows Xen to detect accesses to the
PCI configuration space and react accordingly.

Although there hasn't been a lot of review on the previous version, I send this
new version because I will be away for > 1 week, and I would rather have review
on this version than the old one. As usual, each patch contains a changeset
summary between versions.

Patch 1 implements the generic handlers for accesses to the PCI configuration
space together with a minimal user-space test harness that I've used during
development. Currently a per-device red-back tree is used in order to store the
list of handlers, and they are indexed based on their offset inside of the
configuration space. Patch 1 also adds the x86 port IO traps and wires them
into the newly introduced vPCI dispatchers. Patch 2 adds handlers for the ECAM
areas (as found on the MMCFG ACPI table). Patches 3 and 4 are mostly code
moment/refactoring in order to implement support for BAR mapping in patch 5.
Patch 6 allows Xen to mask certain PCI capabilities on-demand, which is used in
order to mask MSI and MSI-X.

Finally patches 8 and 9 implement support in order to emulate the MSI/MSI-X
capabilities inside of Xen, so that the interrupts are transparently routed to
the guest.

This series is based on top of my previous "x86/dpci: bind legacy PCI
interrupts to PVHv2 Dom0". The branch containing the patches can be found at:

git://xenbits.xen.org/people/royger/xen.git vpci_v3

Note that this is only safe to use for the hardware domain (that's trusted),
any non-trusted domain will need a lot more of traps before it can freely
access the PCI configuration space.

Thanks, Roger.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v3 1/9] xen/vpci: introduce basic handlers to trap accesses to the PCI config space
  2017-04-27 14:35 [PATCH v3 0/9] vpci: PCI config space emulation Roger Pau Monne
@ 2017-04-27 14:35 ` Roger Pau Monne
  2017-05-19 11:35   ` Jan Beulich
  2017-04-27 14:35 ` [PATCH v3 2/9] x86/ecam: add handlers for the PVH Dom0 MMCFG areas Roger Pau Monne
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 49+ messages in thread
From: Roger Pau Monne @ 2017-04-27 14:35 UTC (permalink / raw)
  To: xen-devel
  Cc: Wei Liu, Andrew Cooper, Ian Jackson, julien.grall, Paul Durrant,
	Jan Beulich, boris.ostrovsky, Roger Pau Monne

This functionality is going to reside in vpci.c (and the corresponding vpci.h
header), and should be arch-agnostic. The handlers introduced in this patch
setup the basic functionality required in order to trap accesses to the PCI
config space, and allow decoding the address and finding the corresponding
handler that should handle the access (although no handlers are implemented).

Note that the traps to the PCI IO ports registers (0xcf8/0xcfc) are setup
inside of a x86 HVM file, since that's not shared with other arches.

A new XEN_X86_EMU_VPCI x86 domain flag is added in order to signal Xen whether
a domain should use the newly introduced vPCI handlers, this is only enabled
for PVH Dom0 at the moment.

A very simple user-space test is also provided, so that the basic functionality
of the vPCI traps can be asserted. This has been proven quite helpful during
development, since the logic to handle partial accesses or accesses that expand
across multiple registers is not trivial.

The handlers for the registers are added to a red-black tree, that indexes them
based on their offset. Since Xen needs to handle partial accesses to the
registers and access that expand across multiple registers the logic in
xen_vpci_{read/write} is kind of convoluted, I've tried to properly comment it
in order to make it easier to understand.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Paul Durrant <paul.durrant@citrix.com>
---
Changes since v2:
 - Generalize the PCI address decoding and use it for IOREQ code also.

Changes since v1:
 - Allow access to cross a word-boundary.
 - Add locking.
 - Add cleanup to xen_vpci_add_handlers in case of failure.
---
 .gitignore                        |   4 +
 tools/libxl/libxl_x86.c           |   2 +-
 tools/tests/Makefile              |   1 +
 tools/tests/vpci/Makefile         |  45 ++++
 tools/tests/vpci/emul.h           | 107 +++++++++
 tools/tests/vpci/main.c           | 206 +++++++++++++++++
 xen/arch/arm/xen.lds.S            |   3 +
 xen/arch/x86/domain.c             |  18 +-
 xen/arch/x86/hvm/hvm.c            |   2 +
 xen/arch/x86/hvm/io.c             | 147 ++++++++++++
 xen/arch/x86/hvm/ioreq.c          |   7 +-
 xen/arch/x86/setup.c              |   3 +-
 xen/arch/x86/xen.lds.S            |   3 +
 xen/drivers/Makefile              |   2 +-
 xen/drivers/passthrough/pci.c     |   3 +
 xen/drivers/vpci/Makefile         |   1 +
 xen/drivers/vpci/vpci.c           | 469 ++++++++++++++++++++++++++++++++++++++
 xen/include/asm-x86/domain.h      |   1 +
 xen/include/asm-x86/hvm/domain.h  |   3 +
 xen/include/asm-x86/hvm/io.h      |   8 +
 xen/include/public/arch-x86/xen.h |   5 +-
 xen/include/xen/pci.h             |   4 +
 xen/include/xen/vpci.h            |  66 ++++++
 23 files changed, 1099 insertions(+), 11 deletions(-)
 create mode 100644 tools/tests/vpci/Makefile
 create mode 100644 tools/tests/vpci/emul.h
 create mode 100644 tools/tests/vpci/main.c
 create mode 100644 xen/drivers/vpci/Makefile
 create mode 100644 xen/drivers/vpci/vpci.c
 create mode 100644 xen/include/xen/vpci.h

diff --git a/.gitignore b/.gitignore
index 74747cb7e7..ebafba25b5 100644
--- a/.gitignore
+++ b/.gitignore
@@ -236,6 +236,10 @@ tools/tests/regression/build/*
 tools/tests/regression/downloads/*
 tools/tests/mem-sharing/memshrtool
 tools/tests/mce-test/tools/xen-mceinj
+tools/tests/vpci/rbtree.[hc]
+tools/tests/vpci/vpci.[hc]
+tools/tests/vpci/test_vpci.out
+tools/tests/vpci/test_vpci
 tools/xcutils/lsevtchn
 tools/xcutils/readnotes
 tools/xenbackendd/_paths.h
diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c
index 455f6f0bed..dd7fc78a99 100644
--- a/tools/libxl/libxl_x86.c
+++ b/tools/libxl/libxl_x86.c
@@ -11,7 +11,7 @@ int libxl__arch_domain_prepare_config(libxl__gc *gc,
     if (d_config->c_info.type == LIBXL_DOMAIN_TYPE_HVM) {
         if (d_config->b_info.device_model_version !=
             LIBXL_DEVICE_MODEL_VERSION_NONE) {
-            xc_config->emulation_flags = XEN_X86_EMU_ALL;
+            xc_config->emulation_flags = (XEN_X86_EMU_ALL & ~XEN_X86_EMU_VPCI);
         } else if (libxl_defbool_val(d_config->b_info.u.hvm.apic)) {
             /*
              * HVM guests without device model may want
diff --git a/tools/tests/Makefile b/tools/tests/Makefile
index 639776130b..5cfe781e62 100644
--- a/tools/tests/Makefile
+++ b/tools/tests/Makefile
@@ -13,6 +13,7 @@ endif
 SUBDIRS-$(CONFIG_X86) += x86_emulator
 SUBDIRS-y += xen-access
 SUBDIRS-y += xenstore
+SUBDIRS-$(CONFIG_HAS_PCI) += vpci
 
 .PHONY: all clean install distclean
 all clean distclean: %: subdirs-%
diff --git a/tools/tests/vpci/Makefile b/tools/tests/vpci/Makefile
new file mode 100644
index 0000000000..7969fcbd82
--- /dev/null
+++ b/tools/tests/vpci/Makefile
@@ -0,0 +1,45 @@
+
+XEN_ROOT=$(CURDIR)/../../..
+include $(XEN_ROOT)/tools/Rules.mk
+
+TARGET := test_vpci
+
+.PHONY: all
+all: $(TARGET)
+
+.PHONY: run
+run: $(TARGET)
+	./$(TARGET) > $(TARGET).out
+
+$(TARGET): vpci.c vpci.h rbtree.c rbtree.h
+	$(HOSTCC) -g -o $@ vpci.c main.c rbtree.c
+
+.PHONY: clean
+clean:
+	rm -rf $(TARGET) $(TARGET).out *.o *~ vpci.h vpci.c rbtree.c rbtree.h
+
+.PHONY: distclean
+distclean: clean
+
+.PHONY: install
+install:
+
+vpci.h: $(XEN_ROOT)/xen/include/xen/vpci.h
+	sed -e '/#include/d' <$< >$@
+
+vpci.c: $(XEN_ROOT)/xen/drivers/vpci/vpci.c
+	# Trick the compiler so it doesn't complain about missing symbols
+	sed -e '/#include/d' \
+	    -e '1s;^;#include "emul.h"\
+	             const vpci_register_init_t __start_vpci_array[1]\;\
+	             const vpci_register_init_t __end_vpci_array[1]\;\
+	             ;' <$< >$@
+
+rbtree.h: $(XEN_ROOT)/xen/include/xen/rbtree.h
+	sed -e '/#include/d' <$< >$@
+
+rbtree.c: $(XEN_ROOT)/xen/common/rbtree.c
+	sed -e "/#include/d" \
+	    -e '1s;^;#include "emul.h"\
+	             ;' <$< >$@
+
diff --git a/tools/tests/vpci/emul.h b/tools/tests/vpci/emul.h
new file mode 100644
index 0000000000..85897ed43b
--- /dev/null
+++ b/tools/tests/vpci/emul.h
@@ -0,0 +1,107 @@
+/*
+ * Unit tests for the generic vPCI handler code.
+ *
+ * Copyright (C) 2017 Citrix Systems R&D
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms and conditions of the GNU General Public
+ * License, version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef _TEST_VPCI_
+#define _TEST_VPCI_
+
+#include <stdlib.h>
+#include <stdio.h>
+#include <stddef.h>
+#include <stdint.h>
+#include <stdbool.h>
+#include <errno.h>
+#include <assert.h>
+
+#define container_of(ptr, type, member) ({                      \
+        typeof( ((type *)0)->member ) *__mptr = (ptr);          \
+        (type *)( (char *)__mptr - offsetof(type,member) );})
+
+#include "rbtree.h"
+
+struct pci_dev {
+    struct domain *domain;
+    struct vpci *vpci;
+};
+
+struct domain {
+    struct pci_dev pdev;
+};
+
+struct vcpu
+{
+    struct domain *domain;
+};
+
+extern struct vcpu v;
+
+#define spin_lock(x)
+#define spin_unlock(x)
+#define spin_is_locked(x) true
+
+#define current (&v)
+
+#define has_vpci(d) true
+
+#include "vpci.h"
+
+#define xzalloc(type) (type *)calloc(1, sizeof(type))
+#define xfree(p) free(p)
+
+#define EXPORT_SYMBOL(x)
+
+#define pci_get_pdev_by_domain(d, ...) &(d)->pdev
+
+#define atomic_read(x) 1
+
+/* Dummy native helpers. Writes are ignored, reads return 1's. */
+#define pci_conf_read8(...) (0xff)
+#define pci_conf_read16(...) (0xffff)
+#define pci_conf_read32(...) (0xffffffff)
+#define pci_conf_write8(...)
+#define pci_conf_write16(...)
+#define pci_conf_write32(...)
+
+#define BUG() assert(0)
+#define ASSERT_UNREACHABLE() assert(0)
+#define ASSERT(x) assert(x)
+
+#ifdef _LP64
+#define BITS_PER_LONG 64
+#else
+#define BITS_PER_LONG 32
+#endif
+#define GENMASK(h, l) \
+    (((~0UL) << (l)) & (~0UL >> (BITS_PER_LONG - 1 - (h))))
+
+#define min(x,y) ({ \
+        const typeof(x) _x = (x);       \
+        const typeof(y) _y = (y);       \
+        (void) (&_x == &_y);            \
+        _x < _y ? _x : _y; })
+
+#endif
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
+
diff --git a/tools/tests/vpci/main.c b/tools/tests/vpci/main.c
new file mode 100644
index 0000000000..0fc63de038
--- /dev/null
+++ b/tools/tests/vpci/main.c
@@ -0,0 +1,206 @@
+/*
+ * Unit tests for the generic vPCI handler code.
+ *
+ * Copyright (C) 2017 Citrix Systems R&D
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms and conditions of the GNU General Public
+ * License, version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "emul.h"
+
+/* Single vcpu (current), and single domain with a single PCI device. */
+static struct vpci vpci = {
+    .handlers = RB_ROOT,
+};
+
+static struct domain d = {
+    .pdev.domain = &d,
+    .pdev.vpci = &vpci,
+};
+
+struct vcpu v = { .domain = &d };
+
+/* Dummy hooks, write stores data, read fetches it. */
+static int vpci_read8(struct pci_dev *pdev, unsigned int reg,
+                      union vpci_val *val, void *data)
+{
+    uint8_t *priv = data;
+
+    val->half_word = *priv;
+    return 0;
+}
+
+static int vpci_write8(struct pci_dev *pdev, unsigned int reg,
+                       union vpci_val val, void *data)
+{
+    uint8_t *priv = data;
+
+    *priv = val.half_word;
+    return 0;
+}
+
+static int vpci_read16(struct pci_dev *pdev, unsigned int reg,
+                       union vpci_val *val, void *data)
+{
+    uint16_t *priv = data;
+
+    val->word = *priv;
+    return 0;
+}
+
+static int vpci_write16(struct pci_dev *pdev, unsigned int reg,
+                        union vpci_val val, void *data)
+{
+    uint16_t *priv = data;
+
+    *priv = val.word;
+    return 0;
+}
+
+static int vpci_read32(struct pci_dev *pdev, unsigned int reg,
+                       union vpci_val *val, void *data)
+{
+    uint32_t *priv = data;
+
+    val->double_word = *priv;
+    return 0;
+}
+
+static int vpci_write32(struct pci_dev *pdev, unsigned int reg,
+                        union vpci_val val, void *data)
+{
+    uint32_t *priv = data;
+
+    *priv = val.double_word;
+    return 0;
+}
+
+#define VPCI_READ(reg, size, data) \
+    assert(!xen_vpci_read(0, 0, 0, reg, size, data))
+
+#define VPCI_READ_CHECK(reg, size, expected) ({ \
+    uint32_t val;                               \
+    VPCI_READ(reg, size, &val);                 \
+    assert(val == expected);                    \
+    })
+
+#define VPCI_WRITE(reg, size, data) \
+    assert(!xen_vpci_write(0, 0, 0, reg, size, data))
+
+#define VPCI_CHECK_REG(reg, size, data) ({      \
+    VPCI_WRITE(reg, size, data);                \
+    VPCI_READ_CHECK(reg, size, data);           \
+    })
+
+#define VPCI_ADD_REG(fread, fwrite, off, size, store)                         \
+    assert(!xen_vpci_add_register(&d.pdev, fread, fwrite, off, size, &store)) \
+
+#define VPCI_ADD_INVALID_REG(fread, fwrite, off, size)                      \
+    assert(xen_vpci_add_register(&d.pdev, fread, fwrite, off, size, NULL))  \
+
+int
+main(int argc, char **argv)
+{
+    /* Index storage by offset. */
+    uint32_t r0 = 0xdeadbeef;
+    uint8_t r5 = 0xef;
+    uint8_t r6 = 0xbe;
+    uint8_t r7 = 0xef;
+    uint16_t r12 = 0x8696;
+    int rc;
+
+    VPCI_ADD_REG(vpci_read32, vpci_write32, 0, 4, r0);
+    VPCI_READ_CHECK(0, 4, 0xdeadbeef);
+    VPCI_CHECK_REG(0, 4, 0xbcbcbcbc);
+
+    VPCI_ADD_REG(vpci_read8, vpci_write8, 5, 1, r5);
+    VPCI_READ_CHECK(5, 1, 0xef);
+    VPCI_CHECK_REG(5, 1, 0xba);
+
+    VPCI_ADD_REG(vpci_read8, vpci_write8, 6, 1, r6);
+    VPCI_READ_CHECK(6, 1, 0xbe);
+    VPCI_CHECK_REG(6, 1, 0xba);
+
+    VPCI_ADD_REG(vpci_read8, vpci_write8, 7, 1, r7);
+    VPCI_READ_CHECK(7, 1, 0xef);
+    VPCI_CHECK_REG(7, 1, 0xbd);
+
+    VPCI_ADD_REG(vpci_read16, vpci_write16, 12, 2, r12);
+    VPCI_READ_CHECK(12, 2, 0x8696);
+    VPCI_READ_CHECK(12, 4, 0xffff8696);
+
+    /*
+     * At this point we have the following layout:
+     *
+     * 32    24    16     8     0
+     *  +-----+-----+-----+-----+
+     *  |          r0           | 0
+     *  +-----+-----+-----+-----+
+     *  | r7  |  r6 |  r5 |/////| 32
+     *  +-----+-----+-----+-----|
+     *  |///////////////////////| 64
+     *  +-----------+-----------+
+     *  |///////////|    r12    | 96
+     *  +-----------+-----------+
+     *             ...
+     *  / = empty.
+     */
+
+    /* Try to add an overlapping register handler. */
+    VPCI_ADD_INVALID_REG(vpci_read32, vpci_write32, 4, 4);
+
+    /* Try to add a non-aligned register. */
+    VPCI_ADD_INVALID_REG(vpci_read16, vpci_write16, 15, 2);
+
+    /* Try to add a register with wrong size. */
+    VPCI_ADD_INVALID_REG(vpci_read16, vpci_write16, 8, 3);
+
+    /* Try to add a register with missing handlers. */
+    VPCI_ADD_INVALID_REG(vpci_read16, NULL, 8, 2);
+    VPCI_ADD_INVALID_REG(NULL, vpci_write16, 8, 2);
+
+    /* Read/write of unset register. */
+    VPCI_READ_CHECK(8, 4, 0xffffffff);
+    VPCI_READ_CHECK(8, 2, 0xffff);
+    VPCI_READ_CHECK(8, 1, 0xff);
+    VPCI_WRITE(10, 2, 0xbeef);
+    VPCI_READ_CHECK(10, 2, 0xffff);
+
+    /* Read of multiple registers */
+    VPCI_CHECK_REG(7, 1, 0xbd);
+    VPCI_READ_CHECK(4, 4, 0xbdbabaff);
+
+    /* Partial read of a register. */
+    VPCI_CHECK_REG(0, 4, 0x1a1b1c1d);
+    VPCI_READ_CHECK(2, 1, 0x1b);
+    VPCI_READ_CHECK(6, 2, 0xbdba);
+
+    /* Write of multiple registers. */
+    VPCI_CHECK_REG(4, 4, 0xaabbccff);
+
+    /* Partial write of a register. */
+    VPCI_CHECK_REG(2, 1, 0xfe);
+    VPCI_CHECK_REG(6, 2, 0xfebc);
+
+    return 0;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
+
diff --git a/xen/arch/arm/xen.lds.S b/xen/arch/arm/xen.lds.S
index 44bd3bf0ce..41bf9dfaf3 100644
--- a/xen/arch/arm/xen.lds.S
+++ b/xen/arch/arm/xen.lds.S
@@ -79,6 +79,9 @@ SECTIONS
        __start_schedulers_array = .;
        *(.data.schedulers)
        __end_schedulers_array = .;
+       __start_vpci_array = .;
+       *(.data.vpci)
+       __end_vpci_array = .;
        *(.data.rel)
        *(.data.rel.*)
        CONSTRUCTORS
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 90e2b1f82a..f74020facc 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -500,11 +500,21 @@ static bool emulation_flags_ok(const struct domain *d, uint32_t emflags)
     if ( is_hvm_domain(d) )
     {
         if ( is_hardware_domain(d) &&
-             emflags != (XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC) )
-            return false;
-        if ( !is_hardware_domain(d) && emflags &&
-             emflags != XEN_X86_EMU_ALL && emflags != XEN_X86_EMU_LAPIC )
+             emflags != (XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC|
+                         XEN_X86_EMU_VPCI) )
             return false;
+        if ( !is_hardware_domain(d) )
+        {
+            switch ( emflags )
+            {
+            case XEN_X86_EMU_ALL & ~XEN_X86_EMU_VPCI:
+            case XEN_X86_EMU_LAPIC:
+            case 0:
+                break;
+            default:
+                return false;
+            }
+        }
     }
     else if ( emflags != 0 && emflags != XEN_X86_EMU_PIT )
     {
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index a441955322..7f3322ede6 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -37,6 +37,7 @@
 #include <xen/vm_event.h>
 #include <xen/monitor.h>
 #include <xen/warning.h>
+#include <xen/vpci.h>
 #include <asm/shadow.h>
 #include <asm/hap.h>
 #include <asm/current.h>
@@ -655,6 +656,7 @@ int hvm_domain_initialise(struct domain *d)
         d->arch.hvm_domain.io_bitmap = hvm_io_bitmap;
 
     register_g2m_portio_handler(d);
+    register_vpci_portio_handler(d);
 
     hvm_ioreq_init(d);
 
diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
index 214ab307c4..80f842b092 100644
--- a/xen/arch/x86/hvm/io.c
+++ b/xen/arch/x86/hvm/io.c
@@ -25,6 +25,7 @@
 #include <xen/trace.h>
 #include <xen/event.h>
 #include <xen/hypercall.h>
+#include <xen/vpci.h>
 #include <asm/current.h>
 #include <asm/cpufeature.h>
 #include <asm/processor.h>
@@ -256,6 +257,152 @@ void register_g2m_portio_handler(struct domain *d)
     handler->ops = &g2m_portio_ops;
 }
 
+/* Do some sanity checks. */
+static int vpci_access_check(unsigned int reg, unsigned int len)
+{
+    /* Check access size. */
+    if ( len != 1 && len != 2 && len != 4 )
+    {
+        gdprintk(XENLOG_WARNING, "invalid length (reg: %#x, len: %u)\n",
+                 reg, len);
+        return -EINVAL;
+    }
+
+    /* Check if access crosses a double-word boundary. */
+    if ( (reg & 3) + len > 4 )
+    {
+        gdprintk(XENLOG_WARNING,
+                 "invalid access across double-word boundary (reg: %#x, len: %u)\n",
+                 reg, len);
+        return -EINVAL;
+    }
+
+    return 0;
+}
+
+/* Helper to decode a PCI address. */
+void hvm_pci_decode_addr(unsigned int cf8, unsigned int addr,
+                         unsigned int *bus, unsigned int *devfn,
+                         unsigned int *reg)
+{
+    unsigned long bdf;
+
+    ASSERT(CF8_ENABLED(cf8));
+
+    bdf = CF8_BDF(cf8);
+    *bus = PCI_BUS(bdf);
+    *devfn = PCI_DEVFN(PCI_SLOT(bdf), PCI_FUNC(bdf));
+    /*
+     * NB: the lower 2 bits of the register address are fetched from the
+     * offset into the 0xcfc register when reading/writing to it.
+     */
+    *reg = CF8_ADDR_LO(cf8) | (addr & 3);
+}
+
+/* vPCI config space IO ports handlers (0xcf8/0xcfc). */
+static bool_t vpci_portio_accept(const struct hvm_io_handler *handler,
+                                 const ioreq_t *p)
+{
+    return (p->addr == 0xcf8 && p->size == 4) || (p->addr & 0xfffc) == 0xcfc;
+}
+
+static int vpci_portio_read(const struct hvm_io_handler *handler,
+                            uint64_t addr, uint32_t size, uint64_t *data)
+{
+    struct domain *d = current->domain;
+    unsigned int bus, devfn, reg;
+    uint32_t data32;
+    int rc;
+
+    vpci_lock(d);
+    if ( addr == 0xcf8 )
+    {
+        ASSERT(size == 4);
+        *data = d->arch.hvm_domain.pci_cf8;
+        vpci_unlock(d);
+        return X86EMUL_OKAY;
+    }
+    else if ( !CF8_ENABLED(d->arch.hvm_domain.pci_cf8) )
+    {
+        vpci_unlock(d);
+        return X86EMUL_OKAY;
+    }
+
+    /* Decode the PCI address. */
+    hvm_pci_decode_addr(d->arch.hvm_domain.pci_cf8, addr, &bus, &devfn, &reg);
+
+    if ( vpci_access_check(reg, size) || reg >= 0xff )
+    {
+        vpci_unlock(d);
+        return X86EMUL_UNHANDLEABLE;
+    }
+
+    rc = xen_vpci_read(0, bus, devfn, reg, size, &data32);
+    if ( !rc )
+        *data = data32;
+    vpci_unlock(d);
+
+     return rc ? X86EMUL_UNHANDLEABLE : X86EMUL_OKAY;
+}
+
+static int vpci_portio_write(const struct hvm_io_handler *handler,
+                             uint64_t addr, uint32_t size, uint64_t data)
+{
+    struct domain *d = current->domain;
+    unsigned int bus, devfn, reg;
+    int rc;
+
+    vpci_lock(d);
+    if ( addr == 0xcf8 )
+    {
+        ASSERT(size == 4);
+        d->arch.hvm_domain.pci_cf8 = data;
+        vpci_unlock(d);
+        return X86EMUL_OKAY;
+    }
+    else if ( !CF8_ENABLED(d->arch.hvm_domain.pci_cf8) )
+    {
+        vpci_unlock(d);
+        return X86EMUL_OKAY;
+    }
+
+    /* Decode the PCI address. */
+    hvm_pci_decode_addr(d->arch.hvm_domain.pci_cf8, addr, &bus, &devfn, &reg);
+
+    if ( vpci_access_check(reg, size) || reg >= 0xff )
+    {
+        vpci_unlock(d);
+        return X86EMUL_UNHANDLEABLE;
+    }
+
+    rc = xen_vpci_write(0, bus, devfn, reg, size, data);
+    vpci_unlock(d);
+
+    return rc ? X86EMUL_UNHANDLEABLE : X86EMUL_OKAY;
+}
+
+static const struct hvm_io_ops vpci_portio_ops = {
+    .accept = vpci_portio_accept,
+    .read = vpci_portio_read,
+    .write = vpci_portio_write,
+};
+
+void register_vpci_portio_handler(struct domain *d)
+{
+    struct hvm_io_handler *handler;
+
+    if ( !has_vpci(d) )
+        return;
+
+    handler = hvm_next_io_handler(d);
+    if ( !handler )
+        return;
+
+    spin_lock_init(&d->arch.hvm_domain.vpci_lock);
+    handler->type = IOREQ_TYPE_PIO;
+    handler->ops = &vpci_portio_ops;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/arch/x86/hvm/ioreq.c b/xen/arch/x86/hvm/ioreq.c
index 07a6c2679b..70eabb809c 100644
--- a/xen/arch/x86/hvm/ioreq.c
+++ b/xen/arch/x86/hvm/ioreq.c
@@ -1177,6 +1177,9 @@ struct hvm_ioreq_server *hvm_select_ioreq_server(struct domain *d,
          CF8_ENABLED(cf8) )
     {
         uint32_t sbdf, x86_fam;
+        unsigned int bus, devfn, reg;
+
+        hvm_pci_decode_addr(cf8, p->addr, &bus, &devfn, &reg);
 
         /* PCI config data cycle */
 
@@ -1186,9 +1189,7 @@ struct hvm_ioreq_server *hvm_select_ioreq_server(struct domain *d,
                                  PCI_FUNC(CF8_BDF(cf8)));
 
         type = XEN_DMOP_IO_RANGE_PCI;
-        addr = ((uint64_t)sbdf << 32) |
-               CF8_ADDR_LO(cf8) |
-               (p->addr & 3);
+        addr = ((uint64_t)sbdf << 32) | reg;
         /* AMD extended configuration space access? */
         if ( CF8_ADDR_HI(cf8) &&
              d->arch.cpuid->x86_vendor == X86_VENDOR_AMD &&
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index f7b927858c..4cf919f206 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1566,7 +1566,8 @@ void __init noreturn __start_xen(unsigned long mbi_p)
         domcr_flags |= DOMCRF_hvm |
                        ((hvm_funcs.hap_supported && !opt_dom0_shadow) ?
                          DOMCRF_hap : 0);
-        config.emulation_flags = XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC;
+        config.emulation_flags = XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC|
+                                 XEN_X86_EMU_VPCI;
     }
 
     /* Create initial domain 0. */
diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S
index 8289a1bf09..f5cc8e2b8d 100644
--- a/xen/arch/x86/xen.lds.S
+++ b/xen/arch/x86/xen.lds.S
@@ -224,6 +224,9 @@ SECTIONS
        __start_schedulers_array = .;
        *(.data.schedulers)
        __end_schedulers_array = .;
+       __start_vpci_array = .;
+       *(.data.vpci)
+       __end_vpci_array = .;
        *(.data.rel.ro)
        *(.data.rel.ro.*)
   } :text
diff --git a/xen/drivers/Makefile b/xen/drivers/Makefile
index 19391802a8..d51c766453 100644
--- a/xen/drivers/Makefile
+++ b/xen/drivers/Makefile
@@ -1,6 +1,6 @@
 subdir-y += char
 subdir-$(CONFIG_HAS_CPUFREQ) += cpufreq
-subdir-$(CONFIG_HAS_PCI) += pci
+subdir-$(CONFIG_HAS_PCI) += pci vpci
 subdir-$(CONFIG_HAS_PASSTHROUGH) += passthrough
 subdir-$(CONFIG_ACPI) += acpi
 subdir-$(CONFIG_VIDEO) += video
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index c8e2d2d9a9..2288cf8814 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -30,6 +30,7 @@
 #include <xen/radix-tree.h>
 #include <xen/softirq.h>
 #include <xen/tasklet.h>
+#include <xen/vpci.h>
 #include <xsm/xsm.h>
 #include <asm/msi.h>
 #include "ats.h"
@@ -1041,6 +1042,8 @@ static void setup_one_hwdom_device(const struct setup_hwdom *ctxt,
         devfn += pdev->phantom_stride;
     } while ( devfn != pdev->devfn &&
               PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
+
+    xen_vpci_add_handlers(pdev);
 }
 
 static int __hwdom_init _setup_hwdom_pci_devices(struct pci_seg *pseg, void *arg)
diff --git a/xen/drivers/vpci/Makefile b/xen/drivers/vpci/Makefile
new file mode 100644
index 0000000000..840a906470
--- /dev/null
+++ b/xen/drivers/vpci/Makefile
@@ -0,0 +1 @@
+obj-y += vpci.o
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
new file mode 100644
index 0000000000..b159f0db80
--- /dev/null
+++ b/xen/drivers/vpci/vpci.c
@@ -0,0 +1,469 @@
+/*
+ * Generic functionality for handling accesses to the PCI configuration space
+ * from guests.
+ *
+ * Copyright (C) 2017 Citrix Systems R&D
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms and conditions of the GNU General Public
+ * License, version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <xen/sched.h>
+#include <xen/vpci.h>
+
+extern const vpci_register_init_t __start_vpci_array[], __end_vpci_array[];
+#define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
+#define vpci_init __start_vpci_array
+
+/* Internal struct to store the emulated PCI registers. */
+struct vpci_register {
+    vpci_read_t read;
+    vpci_write_t write;
+    unsigned int size;
+    unsigned int offset;
+    void *priv_data;
+    struct rb_node node;
+};
+
+int xen_vpci_add_handlers(struct pci_dev *pdev)
+{
+    int i, rc = 0;
+
+    if ( !has_vpci(pdev->domain) )
+        return 0;
+
+    pdev->vpci = xzalloc(struct vpci);
+    if ( !pdev->vpci )
+        return -ENOMEM;
+
+    pdev->vpci->handlers = RB_ROOT;
+
+    for ( i = 0; i < NUM_VPCI_INIT; i++ )
+    {
+        rc = vpci_init[i](pdev);
+        if ( rc )
+            break;
+    }
+
+    if ( rc )
+    {
+        struct rb_node *node = rb_first(&pdev->vpci->handlers);
+        struct vpci_register *r;
+
+        /* Iterate over the tree and cleanup. */
+        while ( node != NULL )
+        {
+            r = container_of(node, struct vpci_register, node);
+            node = rb_next(node);
+            rb_erase(&r->node, &pdev->vpci->handlers);
+            xfree(r);
+        }
+        xfree(pdev->vpci);
+    }
+
+    return rc;
+}
+
+static bool vpci_register_overlap(const struct vpci_register *r,
+                                  unsigned int offset)
+{
+    if ( offset >= r->offset && offset < r->offset + r->size )
+        return true;
+
+    return false;
+}
+
+
+static int vpci_register_cmp(const struct vpci_register *r1,
+                             const struct vpci_register *r2)
+{
+    /* Make sure there's no overlap between registers. */
+    if ( vpci_register_overlap(r1, r2->offset) ||
+         vpci_register_overlap(r1, r2->offset + r2->size - 1) ||
+         vpci_register_overlap(r2, r1->offset) ||
+         vpci_register_overlap(r2, r1->offset + r1->size - 1) )
+        return 0;
+
+    if (r1->offset < r2->offset)
+        return -1;
+    else if (r1->offset > r2->offset)
+        return 1;
+
+    ASSERT_UNREACHABLE();
+    return 0;
+}
+
+static struct vpci_register *vpci_find_register(const struct pci_dev *pdev,
+                                                const unsigned int reg,
+                                                const unsigned int size)
+{
+    struct rb_node *node;
+    struct vpci_register r = {
+        .offset = reg,
+        .size = size,
+    };
+
+    ASSERT(vpci_locked(pdev->domain));
+
+    node = pdev->vpci->handlers.rb_node;
+    while ( node )
+    {
+        struct vpci_register *t =
+            container_of(node, struct vpci_register, node);
+
+        switch ( vpci_register_cmp(&r, t) )
+        {
+        case -1:
+            node = node->rb_left;
+            break;
+        case 1:
+            node = node->rb_right;
+            break;
+        default:
+            return t;
+        }
+    }
+
+    return NULL;
+}
+
+int xen_vpci_add_register(struct pci_dev *pdev, vpci_read_t read_handler,
+                          vpci_write_t write_handler, unsigned int offset,
+                          unsigned int size, void *data)
+{
+    struct rb_node **new, *parent;
+    struct vpci_register *r;
+
+    /* Some sanity checks. */
+    if ( (size != 1 && size != 2 && size != 4) || offset >= 0xFFF ||
+         offset & (size - 1) || read_handler == NULL || write_handler == NULL )
+        return -EINVAL;
+
+    r = xzalloc(struct vpci_register);
+    if ( !r )
+        return -ENOMEM;
+
+    r->read = read_handler;
+    r->write = write_handler;
+    r->size = size;
+    r->offset = offset;
+    r->priv_data = data;
+
+    vpci_lock(pdev->domain);
+    new = &pdev->vpci->handlers.rb_node;
+    parent = NULL;
+
+    while (*new) {
+        struct vpci_register *this =
+            container_of(*new, struct vpci_register, node);
+
+        parent = *new;
+        switch ( vpci_register_cmp(r, this) )
+        {
+        case -1:
+            new = &((*new)->rb_left);
+            break;
+        case 1:
+            new = &((*new)->rb_right);
+            break;
+        default:
+            xfree(r);
+            vpci_unlock(pdev->domain);
+            return -EEXIST;
+        }
+    }
+
+    rb_link_node(&r->node, parent, new);
+    rb_insert_color(&r->node, &pdev->vpci->handlers);
+    vpci_unlock(pdev->domain);
+
+    return 0;
+}
+
+int xen_vpci_remove_register(struct pci_dev *pdev, unsigned int offset)
+{
+    struct vpci_register *r;
+
+    vpci_lock(pdev->domain);
+    r = vpci_find_register(pdev, offset, 1 /* size doesn't matter here. */);
+    if ( !r )
+    {
+        vpci_unlock(pdev->domain);
+        return -ENOENT;
+    }
+
+    rb_erase(&r->node, &pdev->vpci->handlers);
+    xfree(r);
+    vpci_unlock(pdev->domain);
+
+    return 0;
+}
+
+/* Wrappers for performing reads/writes to the underlying hardware. */
+static void vpci_read_hw(unsigned int seg, unsigned int bus,
+                         unsigned int devfn, unsigned int reg, uint32_t size,
+                         uint32_t *data)
+{
+    switch ( size )
+    {
+    case 4:
+        *data = pci_conf_read32(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
+                                reg);
+        break;
+    case 3:
+        /*
+         * This is possible because a 4byte read can have 1byte trapped and
+         * the rest passed-through.
+         */
+        *data = pci_conf_read16(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
+                                reg + 1) << 8;
+        *data |= pci_conf_read8(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
+                               reg);
+        break;
+    case 2:
+        *data = pci_conf_read16(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
+                                reg);
+        break;
+    case 1:
+        *data = pci_conf_read8(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
+                               reg);
+        break;
+    default:
+        BUG();
+    }
+}
+
+static void vpci_write_hw(unsigned int seg, unsigned int bus,
+                          unsigned int devfn, unsigned int reg, uint32_t size,
+                          uint32_t data)
+{
+    switch ( size )
+    {
+    case 4:
+        pci_conf_write32(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), reg,
+                         data);
+        break;
+    case 3:
+        /*
+         * This is possible because a 4byte write can have 1byte trapped and
+         * the rest passed-through.
+         */
+        pci_conf_write8(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), reg, data);
+        pci_conf_write16(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), reg + 1,
+                         data >> 8);
+        break;
+    case 2:
+        pci_conf_write16(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), reg,
+                         data);
+        break;
+    case 1:
+        pci_conf_write8(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), reg, data);
+        break;
+    default:
+        BUG();
+    }
+}
+
+/* Helper macros for the read/write handlers. */
+#define GENMASK_BYTES(e, s) GENMASK((e) * 8, (s) * 8)
+#define SHIFT_RIGHT_BYTES(d, o) d >>= (o) * 8
+#define ADD_RESULT(r, d, s, o) r |= ((d) & GENMASK_BYTES(s, 0)) << ((o) * 8)
+
+int xen_vpci_read(unsigned int seg, unsigned int bus, unsigned int devfn,
+                  unsigned int reg, uint32_t size, uint32_t *data)
+{
+    struct domain *d = current->domain;
+    struct pci_dev *pdev;
+    const struct vpci_register *r;
+    union vpci_val val = { .double_word = 0 };
+    unsigned int data_rshift = 0, data_lshift = 0, data_size;
+    uint32_t tmp_data;
+    int rc;
+
+    ASSERT(vpci_locked(d));
+
+    *data = 0;
+
+    /* Find the PCI dev matching the address. */
+    pdev = pci_get_pdev_by_domain(d, seg, bus, devfn);
+    if ( !pdev )
+        goto passthrough;
+
+    /* Find the vPCI register handler. */
+    r = vpci_find_register(pdev, reg, size);
+    if ( !r )
+        goto passthrough;
+
+    if ( r->offset > reg )
+    {
+        /*
+         * There's a heading gap into the emulated register.
+         * NB: it's possible for this recursive call to have a size of 3.
+         */
+        rc = xen_vpci_read(seg, bus, devfn, reg, r->offset - reg, &tmp_data);
+        if ( rc )
+            return rc;
+
+        /* Add the head read to the partial result. */
+        ADD_RESULT(*data, tmp_data, r->offset - reg, 0);
+        data_lshift = r->offset - reg;
+
+        /* Account for the read. */
+        size -= data_lshift;
+        reg += data_lshift;
+    }
+    else if ( r->offset < reg )
+        /* There's an offset into the emulated register */
+        data_rshift = reg - r->offset;
+
+    ASSERT(data_lshift == 0 || data_rshift == 0);
+    data_size = min(size, r->size - data_rshift);
+    ASSERT(data_size != 0);
+
+    /* Perform the read of the register. */
+    rc = r->read(pdev, r->offset, &val, r->priv_data);
+    if ( rc )
+        return rc;
+
+    val.double_word >>= data_rshift * 8;
+    ADD_RESULT(*data, val.double_word, data_size, data_lshift);
+
+    /* Account for the read */
+    size -= data_size;
+    reg += data_size;
+
+    /* Read the remaining, if any. */
+    if ( size > 0 )
+    {
+        /*
+         * Read tailing data.
+         * NB: it's possible for this recursive call to have a size of 3.
+         */
+        rc = xen_vpci_read(seg, bus, devfn, reg, size, &tmp_data);
+        if ( rc )
+            return rc;
+
+        /* Add the tail read to the partial result. */
+        ADD_RESULT(*data, tmp_data, size, data_size + data_lshift);
+    }
+
+    return 0;
+
+ passthrough:
+    vpci_read_hw(seg, bus, devfn, reg, size, data);
+    return 0;
+}
+
+/* Perform a maybe partial write to a register. */
+static int vpci_write_helper(struct pci_dev *pdev,
+                             const struct vpci_register *r, unsigned int size,
+                             unsigned int offset, uint32_t data)
+{
+    union vpci_val val = { .double_word = data };
+    int rc;
+
+    ASSERT(size <= r->size);
+    if ( size != r->size )
+    {
+        rc = r->read(pdev, r->offset, &val, r->priv_data);
+        if ( rc )
+            return rc;
+        val.double_word &= ~GENMASK_BYTES(size + offset, offset);
+        data &= GENMASK_BYTES(size, 0);
+        val.double_word |= data << (offset * 8);
+    }
+
+    return r->write(pdev, r->offset, val, r->priv_data);
+}
+
+int xen_vpci_write(unsigned int seg, unsigned int bus, unsigned int devfn,
+                   unsigned int reg, uint32_t size, uint32_t data)
+{
+    struct domain *d = current->domain;
+    struct pci_dev *pdev;
+    const struct vpci_register *r;
+    unsigned int data_size, data_offset = 0;
+    int rc;
+
+    ASSERT(vpci_locked(d));
+
+    /* Find the PCI dev matching the address. */
+    pdev = pci_get_pdev_by_domain(d, seg, bus, devfn);
+    if ( !pdev )
+        goto passthrough;
+
+    /* Find the vPCI register handler. */
+    r = vpci_find_register(pdev, reg, size);
+    if ( !r )
+        goto passthrough;
+
+    else if ( r->offset > reg )
+    {
+        /*
+         * There's a heading gap into the emulated register found.
+         * NB: it's possible for this recursive call to have a size of 3.
+         */
+        rc = xen_vpci_write(seg, bus, devfn, reg, r->offset - reg, data);
+        if ( rc )
+            return rc;
+
+        /* Advance the data by the written size. */
+        SHIFT_RIGHT_BYTES(data, r->offset - reg);
+        size -= r->offset - reg;
+        reg += r->offset - reg;
+    }
+    else if ( r->offset < reg )
+        /* There's an offset into the emulated register. */
+        data_offset = reg - r->offset;
+
+    data_size = min(size, r->size - data_offset);
+
+    /* Perform the write of the register. */
+    ASSERT(data_size != 0);
+    rc = vpci_write_helper(pdev, r, data_size, data_offset, data);
+    if ( rc )
+        return rc;
+
+    /* Account for the read */
+    size -= data_size;
+    reg += data_size;
+    SHIFT_RIGHT_BYTES(data, data_size);
+
+    /* Write the remaining, if any. */
+    if ( size > 0 )
+    {
+        /*
+         * Write tailing data.
+         * NB: it's possible for this recursive call to have a size of 3.
+         */
+        rc = xen_vpci_write(seg, bus, devfn, reg, size, data);
+        if ( rc )
+            return rc;
+    }
+
+    return 0;
+
+ passthrough:
+    vpci_write_hw(seg, bus, devfn, reg, size, data);
+    return 0;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
+
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index 6ab987f231..f0741917ed 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -426,6 +426,7 @@ struct arch_domain
 #define has_vpit(d)        (!!((d)->arch.emulation_flags & XEN_X86_EMU_PIT))
 #define has_pirq(d)        (!!((d)->arch.emulation_flags & \
                             XEN_X86_EMU_USE_PIRQ))
+#define has_vpci(d)        (!!((d)->arch.emulation_flags & XEN_X86_EMU_VPCI))
 
 #define has_arch_pdevs(d)    (!list_empty(&(d)->arch.pdev_list))
 
diff --git a/xen/include/asm-x86/hvm/domain.h b/xen/include/asm-x86/hvm/domain.h
index d2899c9bb2..cbf4170789 100644
--- a/xen/include/asm-x86/hvm/domain.h
+++ b/xen/include/asm-x86/hvm/domain.h
@@ -184,6 +184,9 @@ struct hvm_domain {
     /* List of guest to machine IO ports mapping. */
     struct list_head g2m_ioport_list;
 
+    /* Lock for the PCI emulation layer (vPCI). */
+    spinlock_t vpci_lock;
+
     /* List of permanently write-mapped pages. */
     struct {
         spinlock_t lock;
diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h
index 2484eb1c75..9ed1bf2e06 100644
--- a/xen/include/asm-x86/hvm/io.h
+++ b/xen/include/asm-x86/hvm/io.h
@@ -149,12 +149,20 @@ void stdvga_deinit(struct domain *d);
 
 extern void hvm_dpci_msi_eoi(struct domain *d, int vector);
 
+/* Decode a PCI port IO access into a bus/devfn/reg. */
+void hvm_pci_decode_addr(unsigned int cf8, unsigned int addr,
+                         unsigned int *bus, unsigned int *devfn,
+                         unsigned int *reg);
+
 /*
  * HVM port IO handler that performs forwarding of guest IO ports into machine
  * IO ports.
  */
 void register_g2m_portio_handler(struct domain *d);
 
+/* HVM port IO handler for PCI accesses. */
+void register_vpci_portio_handler(struct domain *d);
+
 #endif /* __ASM_X86_HVM_IO_H__ */
 
 
diff --git a/xen/include/public/arch-x86/xen.h b/xen/include/public/arch-x86/xen.h
index f21332e897..86a1a09a8d 100644
--- a/xen/include/public/arch-x86/xen.h
+++ b/xen/include/public/arch-x86/xen.h
@@ -295,12 +295,15 @@ struct xen_arch_domainconfig {
 #define XEN_X86_EMU_PIT             (1U<<_XEN_X86_EMU_PIT)
 #define _XEN_X86_EMU_USE_PIRQ       9
 #define XEN_X86_EMU_USE_PIRQ        (1U<<_XEN_X86_EMU_USE_PIRQ)
+#define _XEN_X86_EMU_VPCI           10
+#define XEN_X86_EMU_VPCI            (1U<<_XEN_X86_EMU_VPCI)
 
 #define XEN_X86_EMU_ALL             (XEN_X86_EMU_LAPIC | XEN_X86_EMU_HPET |  \
                                      XEN_X86_EMU_PM | XEN_X86_EMU_RTC |      \
                                      XEN_X86_EMU_IOAPIC | XEN_X86_EMU_PIC |  \
                                      XEN_X86_EMU_VGA | XEN_X86_EMU_IOMMU |   \
-                                     XEN_X86_EMU_PIT | XEN_X86_EMU_USE_PIRQ)
+                                     XEN_X86_EMU_PIT | XEN_X86_EMU_USE_PIRQ |\
+                                     XEN_X86_EMU_VPCI)
     uint32_t emulation_flags;
 };
 
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 59b6e8a81c..a83c4a1276 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -13,6 +13,7 @@
 #include <xen/irq.h>
 #include <xen/pci_regs.h>
 #include <xen/pfn.h>
+#include <xen/rbtree.h>
 #include <asm/device.h>
 #include <asm/numa.h>
 #include <asm/pci.h>
@@ -88,6 +89,9 @@ struct pci_dev {
 #define PT_FAULT_THRESHOLD 10
     } fault;
     u64 vf_rlen[6];
+
+    /* Data for vPCI. */
+    struct vpci *vpci;
 };
 
 #define for_each_pdev(domain, pdev) \
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
new file mode 100644
index 0000000000..56e8d1c35e
--- /dev/null
+++ b/xen/include/xen/vpci.h
@@ -0,0 +1,66 @@
+#ifndef _VPCI_
+#define _VPCI_
+
+#include <xen/pci.h>
+#include <xen/types.h>
+
+/* Helpers for locking/unlocking. */
+#define vpci_lock(d) spin_lock(&(d)->arch.hvm_domain.vpci_lock)
+#define vpci_unlock(d) spin_unlock(&(d)->arch.hvm_domain.vpci_lock)
+#define vpci_locked(d) spin_is_locked(&(d)->arch.hvm_domain.vpci_lock)
+
+/* Value read or written by the handlers. */
+union vpci_val {
+    uint8_t half_word;
+    uint16_t word;
+    uint32_t double_word;
+};
+
+/*
+ * The vPCI handlers will never be called concurrently for the same domain, ii
+ * is guaranteed that the vpci domain lock will always be locked when calling
+ * any handler.
+ */
+typedef int (*vpci_read_t)(struct pci_dev *pdev, unsigned int reg,
+                           union vpci_val *val, void *data);
+
+typedef int (*vpci_write_t)(struct pci_dev *pdev, unsigned int reg,
+                            union vpci_val val, void *data);
+
+typedef int (*vpci_register_init_t)(struct pci_dev *dev);
+
+#define REGISTER_VPCI_INIT(x) \
+  static const vpci_register_init_t x##_entry __used_section(".data.vpci") = x
+
+/* Add vPCI handlers to device. */
+int xen_vpci_add_handlers(struct pci_dev *dev);
+
+/* Add/remove a register handler. */
+int xen_vpci_add_register(struct pci_dev *pdev, vpci_read_t read_handler,
+                          vpci_write_t write_handler, unsigned int offset,
+                          unsigned int size, void *data);
+int xen_vpci_remove_register(struct pci_dev *pdev, unsigned int offset);
+
+/* Generic read/write handlers for the PCI config space. */
+int xen_vpci_read(unsigned int seg, unsigned int bus, unsigned int devfn,
+                  unsigned int reg, uint32_t size, uint32_t *data);
+int xen_vpci_write(unsigned int seg, unsigned int bus, unsigned int devfn,
+                   unsigned int reg, uint32_t size, uint32_t data);
+
+struct vpci {
+    /* Root pointer for the tree of vPCI handlers. */
+    struct rb_root handlers;
+};
+
+#endif
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
+
-- 
2.11.0 (Apple Git-81)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v3 2/9] x86/ecam: add handlers for the PVH Dom0 MMCFG areas
  2017-04-27 14:35 [PATCH v3 0/9] vpci: PCI config space emulation Roger Pau Monne
  2017-04-27 14:35 ` [PATCH v3 1/9] xen/vpci: introduce basic handlers to trap accesses to the PCI config space Roger Pau Monne
@ 2017-04-27 14:35 ` Roger Pau Monne
  2017-05-19 13:25   ` Jan Beulich
  2017-04-27 14:35 ` [PATCH v3 3/9] xen/mm: move modify_identity_mmio to global file and drop __init Roger Pau Monne
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 49+ messages in thread
From: Roger Pau Monne @ 2017-04-27 14:35 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, julien.grall, Paul Durrant, Jan Beulich,
	boris.ostrovsky, Roger Pau Monne

Introduce a set of handlers for the accesses to the ECAM areas. Those areas are
setup based on the contents of the hardware MMCFG tables, and the list of
handled ECAM areas is stored inside of the hvm_domain struct.

The read/writes are forwarded to the generic vpci handlers once the address is
decoded in order to obtain the device and register the guest is trying to
access.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Paul Durrant <paul.durrant@citrix.com>
---
Changes since v1:
 - Added locking.
---
 xen/arch/x86/hvm/dom0_build.c    |  27 ++++++++
 xen/arch/x86/hvm/hvm.c           |  10 +++
 xen/arch/x86/hvm/io.c            | 139 +++++++++++++++++++++++++++++++++++++++
 xen/include/asm-x86/hvm/domain.h |  11 ++++
 xen/include/asm-x86/hvm/io.h     |   5 ++
 5 files changed, 192 insertions(+)

diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
index 020c355faf..b8999a053a 100644
--- a/xen/arch/x86/hvm/dom0_build.c
+++ b/xen/arch/x86/hvm/dom0_build.c
@@ -38,6 +38,8 @@
 #include <public/hvm/hvm_info_table.h>
 #include <public/hvm/hvm_vcpu.h>
 
+#include "../x86_64/mmconfig.h"
+
 /*
  * Have the TSS cover the ISA port range, which makes it
  * - 104 bytes base structure
@@ -1048,6 +1050,24 @@ static int __init pvh_setup_acpi(struct domain *d, paddr_t start_info)
     return 0;
 }
 
+int __init pvh_setup_ecam(struct domain *d)
+{
+    unsigned int i;
+    int rc;
+
+    for ( i = 0; i < pci_mmcfg_config_num; i++ )
+    {
+        rc = register_vpci_ecam_handler(d, pci_mmcfg_config[i].address,
+                                        pci_mmcfg_config[i].start_bus_number,
+                                        pci_mmcfg_config[i].end_bus_number,
+                                        pci_mmcfg_config[i].pci_segment);
+        if ( rc )
+            return rc;
+    }
+
+    return 0;
+}
+
 int __init dom0_construct_pvh(struct domain *d, const module_t *image,
                               unsigned long image_headroom,
                               module_t *initrd,
@@ -1090,6 +1110,13 @@ int __init dom0_construct_pvh(struct domain *d, const module_t *image,
         return rc;
     }
 
+    rc = pvh_setup_ecam(d);
+    if ( rc )
+    {
+        printk("Failed to setup Dom0 PCI ECAM areas: %d\n", rc);
+        return rc;
+    }
+
     panic("Building a PVHv2 Dom0 is not yet supported.");
     return 0;
 }
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 7f3322ede6..ef3ad2a615 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -613,6 +613,7 @@ int hvm_domain_initialise(struct domain *d)
     spin_lock_init(&d->arch.hvm_domain.write_map.lock);
     INIT_LIST_HEAD(&d->arch.hvm_domain.write_map.list);
     INIT_LIST_HEAD(&d->arch.hvm_domain.g2m_ioport_list);
+    INIT_LIST_HEAD(&d->arch.hvm_domain.ecam_regions);
 
     hvm_init_cacheattr_region_list(d);
 
@@ -725,6 +726,7 @@ void hvm_domain_destroy(struct domain *d)
 {
     struct list_head *ioport_list, *tmp;
     struct g2m_ioport *ioport;
+    struct hvm_ecam *ecam, *etmp;
 
     xfree(d->arch.hvm_domain.io_handler);
     d->arch.hvm_domain.io_handler = NULL;
@@ -752,6 +754,14 @@ void hvm_domain_destroy(struct domain *d)
         list_del(&ioport->list);
         xfree(ioport);
     }
+
+    list_for_each_entry_safe ( ecam, etmp, &d->arch.hvm_domain.ecam_regions,
+                               next )
+    {
+        list_del(&ecam->next);
+        xfree(ecam);
+    }
+
 }
 
 static int hvm_save_tsc_adjust(struct domain *d, hvm_domain_context_t *h)
diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
index 80f842b092..1f138989ff 100644
--- a/xen/arch/x86/hvm/io.c
+++ b/xen/arch/x86/hvm/io.c
@@ -403,6 +403,145 @@ void register_vpci_portio_handler(struct domain *d)
     handler->ops = &vpci_portio_ops;
 }
 
+/* Handlers to trap PCI ECAM config accesses. */
+static struct hvm_ecam *vpci_ecam_find(struct domain *d, unsigned long addr)
+{
+    struct hvm_ecam *ecam = NULL;
+
+    ASSERT(vpci_locked(d));
+    list_for_each_entry ( ecam, &d->arch.hvm_domain.ecam_regions, next )
+        if ( addr >= ecam->addr && addr < ecam->addr + ecam->size )
+            return ecam;
+
+    return NULL;
+}
+
+static void vpci_ecam_decode_addr(struct hvm_ecam *ecam, unsigned long addr,
+                                  unsigned int *bus, unsigned int *devfn,
+                                  unsigned int *reg)
+{
+    addr -= ecam->addr;
+    *bus = ((addr >> 20) & 0xff) + ecam->bus;
+    *devfn = (addr >> 12) & 0xff;
+    *reg = addr & 0xfff;
+}
+
+static int vpci_ecam_accept(struct vcpu *v, unsigned long addr)
+{
+    struct domain *d = v->domain;
+    int found;
+
+    vpci_lock(d);
+    found = !!vpci_ecam_find(v->domain, addr);
+    vpci_unlock(d);
+
+    return found;
+}
+
+static int vpci_ecam_read(struct vcpu *v, unsigned long addr,
+                          unsigned int len, unsigned long *data)
+{
+    struct domain *d = v->domain;
+    struct hvm_ecam *ecam;
+    unsigned int bus, devfn, reg;
+    uint32_t data32;
+    int rc;
+
+    vpci_lock(d);
+    ecam = vpci_ecam_find(d, addr);
+    if ( !ecam )
+    {
+        vpci_unlock(d);
+        return X86EMUL_UNHANDLEABLE;
+    }
+
+    vpci_ecam_decode_addr(ecam, addr, &bus, &devfn, &reg);
+
+    if ( vpci_access_check(reg, len) || reg >= 0xfff )
+    {
+        vpci_unlock(d);
+        return X86EMUL_UNHANDLEABLE;
+    }
+
+    rc = xen_vpci_read(ecam->segment, bus, devfn, reg, len, &data32);
+    if ( !rc )
+        *data = data32;
+    vpci_unlock(d);
+
+    return rc ? X86EMUL_UNHANDLEABLE : X86EMUL_OKAY;
+}
+
+static int vpci_ecam_write(struct vcpu *v, unsigned long addr,
+                           unsigned int len, unsigned long data)
+{
+    struct domain *d = v->domain;
+    struct hvm_ecam *ecam;
+    unsigned int bus, devfn, reg;
+    int rc;
+
+    vpci_lock(d);
+    ecam = vpci_ecam_find(d, addr);
+    if ( !ecam )
+    {
+        vpci_unlock(d);
+        return X86EMUL_UNHANDLEABLE;
+    }
+
+    vpci_ecam_decode_addr(ecam, addr, &bus, &devfn, &reg);
+
+    if ( vpci_access_check(reg, len) || reg >= 0xfff )
+    {
+        vpci_unlock(d);
+        return X86EMUL_UNHANDLEABLE;
+    }
+
+    rc = xen_vpci_write(ecam->segment, bus, devfn, reg, len, data);
+    vpci_unlock(d);
+
+    return rc ? X86EMUL_UNHANDLEABLE : X86EMUL_OKAY;
+}
+
+static const struct hvm_mmio_ops vpci_ecam_ops = {
+    .check = vpci_ecam_accept,
+    .read = vpci_ecam_read,
+    .write = vpci_ecam_write,
+};
+
+int register_vpci_ecam_handler(struct domain *d, paddr_t addr,
+                               unsigned int start_bus, unsigned int end_bus,
+                               unsigned int seg)
+{
+    struct hvm_ecam *ecam;
+
+    ASSERT(is_hardware_domain(d));
+
+    vpci_lock(d);
+    if ( vpci_ecam_find(d, addr) )
+    {
+        vpci_unlock(d);
+        return -EEXIST;
+    }
+
+    ecam = xzalloc(struct hvm_ecam);
+    if ( !ecam )
+    {
+        vpci_unlock(d);
+        return -ENOMEM;
+    }
+
+    if ( list_empty(&d->arch.hvm_domain.ecam_regions) )
+        register_mmio_handler(d, &vpci_ecam_ops);
+
+    ecam->addr = addr + (start_bus << 20);
+    ecam->bus = start_bus;
+    ecam->segment = seg;
+    ecam->size = (end_bus - start_bus + 1) << 20;
+    list_add(&ecam->next, &d->arch.hvm_domain.ecam_regions);
+    vpci_unlock(d);
+
+    return 0;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/include/asm-x86/hvm/domain.h b/xen/include/asm-x86/hvm/domain.h
index cbf4170789..b69c33df3c 100644
--- a/xen/include/asm-x86/hvm/domain.h
+++ b/xen/include/asm-x86/hvm/domain.h
@@ -100,6 +100,14 @@ struct hvm_pi_ops {
     void (*do_resume)(struct vcpu *v);
 };
 
+struct hvm_ecam {
+    paddr_t addr;
+    size_t size;
+    unsigned int bus;
+    unsigned int segment;
+    struct list_head next;
+};
+
 struct hvm_domain {
     /* Guest page range used for non-default ioreq servers */
     struct {
@@ -187,6 +195,9 @@ struct hvm_domain {
     /* Lock for the PCI emulation layer (vPCI). */
     spinlock_t vpci_lock;
 
+    /* List of ECAM (MMCFG) regions trapped by Xen. */
+    struct list_head ecam_regions;
+
     /* List of permanently write-mapped pages. */
     struct {
         spinlock_t lock;
diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h
index 9ed1bf2e06..f9b9829eaf 100644
--- a/xen/include/asm-x86/hvm/io.h
+++ b/xen/include/asm-x86/hvm/io.h
@@ -163,6 +163,11 @@ void register_g2m_portio_handler(struct domain *d);
 /* HVM port IO handler for PCI accesses. */
 void register_vpci_portio_handler(struct domain *d);
 
+/* HVM MMIO handler for PCI ECAM accesses. */
+int register_vpci_ecam_handler(struct domain *d, paddr_t addr,
+                               unsigned int start_bus, unsigned int end_bus,
+                               unsigned int seg);
+
 #endif /* __ASM_X86_HVM_IO_H__ */
 
 
-- 
2.11.0 (Apple Git-81)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v3 3/9] xen/mm: move modify_identity_mmio to global file and drop __init
  2017-04-27 14:35 [PATCH v3 0/9] vpci: PCI config space emulation Roger Pau Monne
  2017-04-27 14:35 ` [PATCH v3 1/9] xen/vpci: introduce basic handlers to trap accesses to the PCI config space Roger Pau Monne
  2017-04-27 14:35 ` [PATCH v3 2/9] x86/ecam: add handlers for the PVH Dom0 MMCFG areas Roger Pau Monne
@ 2017-04-27 14:35 ` Roger Pau Monne
  2017-05-19 13:35   ` Jan Beulich
  2017-04-27 14:35 ` [PATCH v3 4/9] xen/pci: split code to size BARs from pci_add_device Roger Pau Monne
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 49+ messages in thread
From: Roger Pau Monne @ 2017-04-27 14:35 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, julien.grall, Jan Beulich, boris.ostrovsky,
	Roger Pau Monne

And also allow it to do non-identity mappings by adding a new parameter. This
function will be needed in other parts apart from PVH Dom0 build. While there
fix the function to use gfn_t and mfn_t instead of unsigned long for memory
addresses.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v2:
 - Use mfn_t and gfn_t.
 - Remove stray newline.
---
 xen/arch/x86/hvm/dom0_build.c | 22 +---------------------
 xen/common/memory.c           | 36 ++++++++++++++++++++++++++++++++++++
 xen/include/xen/p2m-common.h  |  3 +++
 3 files changed, 40 insertions(+), 21 deletions(-)

diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
index b8999a053a..9efe4e571b 100644
--- a/xen/arch/x86/hvm/dom0_build.c
+++ b/xen/arch/x86/hvm/dom0_build.c
@@ -64,27 +64,7 @@ static struct acpi_madt_nmi_source __initdata *nmisrc;
 static int __init modify_identity_mmio(struct domain *d, unsigned long pfn,
                                        unsigned long nr_pages, const bool map)
 {
-    int rc;
-
-    for ( ; ; )
-    {
-        rc = (map ? map_mmio_regions : unmap_mmio_regions)
-             (d, _gfn(pfn), nr_pages, _mfn(pfn));
-        if ( rc == 0 )
-            break;
-        if ( rc < 0 )
-        {
-            printk(XENLOG_WARNING
-                   "Failed to identity %smap [%#lx,%#lx) for d%d: %d\n",
-                   map ? "" : "un", pfn, pfn + nr_pages, d->domain_id, rc);
-            break;
-        }
-        nr_pages -= rc;
-        pfn += rc;
-        process_pending_softirqs();
-    }
-
-    return rc;
+    return modify_mmio(d, _gfn(pfn), _mfn(pfn), nr_pages, map);
 }
 
 /* Populate a HVM memory range using the biggest possible order. */
diff --git a/xen/common/memory.c b/xen/common/memory.c
index 52879e7438..888b3f2f4f 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -1438,6 +1438,42 @@ int prepare_ring_for_helper(
     return 0;
 }
 
+int modify_mmio(struct domain *d, gfn_t gfn, mfn_t mfn, unsigned long nr_pages,
+                const bool map)
+{
+    int rc;
+
+    /*
+     * Make sure this function is only used by the hardware domain, because it
+     * can take an arbitrary long time, and could DoS the whole system.
+     */
+    ASSERT(is_hardware_domain(d));
+
+    for ( ; ; )
+    {
+        rc = (map ? map_mmio_regions : unmap_mmio_regions)
+             (d, gfn, nr_pages, mfn);
+        if ( rc == 0 )
+            break;
+        if ( rc < 0 )
+        {
+            printk(XENLOG_G_WARNING
+                   "Failed to %smap [%" PRI_gfn ", %" PRI_gfn ") -> "
+                   "[%" PRI_mfn ", %" PRI_mfn ") for d%d: %d\n",
+                   map ? "" : "un", gfn_x(gfn), gfn_x(gfn_add(gfn, nr_pages)),
+                   mfn_x(mfn), mfn_x(mfn_add(mfn, nr_pages)), d->domain_id,
+                   rc);
+            break;
+        }
+        nr_pages -= rc;
+        mfn = mfn_add(mfn, rc);
+        gfn = gfn_add(gfn, rc);
+        process_pending_softirqs();
+    }
+
+    return rc;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/include/xen/p2m-common.h b/xen/include/xen/p2m-common.h
index 8cd5a6b503..4f398cb847 100644
--- a/xen/include/xen/p2m-common.h
+++ b/xen/include/xen/p2m-common.h
@@ -13,4 +13,7 @@ int unmap_mmio_regions(struct domain *d,
                        unsigned long nr,
                        mfn_t mfn);
 
+int modify_mmio(struct domain *d, gfn_t gfn, mfn_t mfn, unsigned long nr_pages,
+                const bool map);
+
 #endif /* _XEN_P2M_COMMON_H */
-- 
2.11.0 (Apple Git-81)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v3 4/9] xen/pci: split code to size BARs from pci_add_device
  2017-04-27 14:35 [PATCH v3 0/9] vpci: PCI config space emulation Roger Pau Monne
                   ` (2 preceding siblings ...)
  2017-04-27 14:35 ` [PATCH v3 3/9] xen/mm: move modify_identity_mmio to global file and drop __init Roger Pau Monne
@ 2017-04-27 14:35 ` Roger Pau Monne
  2017-05-19 13:56   ` Jan Beulich
  2017-04-27 14:35 ` [PATCH v3 5/9] xen/vpci: add handlers to map the BARs Roger Pau Monne
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 49+ messages in thread
From: Roger Pau Monne @ 2017-04-27 14:35 UTC (permalink / raw)
  To: xen-devel; +Cc: boris.ostrovsky, Roger Pau Monne, julien.grall, Jan Beulich

So that it can be called from outside in order to get the size of regular PCI
BARs. This will be required in order to map the BARs from PCI devices into PVH
Dom0 p2m.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
---
 xen/drivers/passthrough/pci.c | 86 ++++++++++++++++++++++++++-----------------
 xen/include/xen/pci.h         |  3 ++
 2 files changed, 56 insertions(+), 33 deletions(-)

diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 2288cf8814..7710c41533 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -588,6 +588,51 @@ static void pci_enable_acs(struct pci_dev *pdev)
     pci_conf_write16(seg, bus, dev, func, pos + PCI_ACS_CTRL, ctrl);
 }
 
+int pci_size_bar(unsigned int seg, unsigned int bus, unsigned int slot,
+                 unsigned int func, unsigned int base, unsigned int max_bars,
+                 unsigned int *index, uint64_t *addr, uint64_t *size)
+{
+    unsigned int idx = base + *index * 4;
+    u32 bar = pci_conf_read32(seg, bus, slot, func, idx);
+    u32 hi = 0;
+
+    *addr = *size = 0;
+
+    ASSERT((bar & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_MEMORY);
+    pci_conf_write32(seg, bus, slot, func, idx, ~0);
+    if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
+         PCI_BASE_ADDRESS_MEM_TYPE_64 )
+    {
+        if ( *index >= max_bars )
+        {
+            dprintk(XENLOG_WARNING,
+                    "device %04x:%02x:%02x.%u with 64-bit BAR in last slot\n",
+                    seg, bus, slot, func);
+            return -EINVAL;
+        }
+        hi = pci_conf_read32(seg, bus, slot, func, idx + 4);
+        pci_conf_write32(seg, bus, slot, func, idx + 4, ~0);
+    }
+    *size = pci_conf_read32(seg, bus, slot, func, idx) &
+            PCI_BASE_ADDRESS_MEM_MASK;
+    if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
+         PCI_BASE_ADDRESS_MEM_TYPE_64 )
+    {
+        *size |= (u64)pci_conf_read32(seg, bus, slot, func, idx + 4) << 32;
+        pci_conf_write32(seg, bus, slot, func, idx + 4, hi);
+    }
+    else if ( *size )
+        *size |= (u64)~0 << 32;
+    pci_conf_write32(seg, bus, slot, func, idx, bar);
+    *size = -(*size);
+    *addr = (bar & PCI_BASE_ADDRESS_MEM_MASK) | ((u64)hi << 32);
+    if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
+         PCI_BASE_ADDRESS_MEM_TYPE_64 )
+        ++*index;
+
+    return 0;
+}
+
 int pci_add_device(u16 seg, u8 bus, u8 devfn,
                    const struct pci_dev_info *info, nodeid_t node)
 {
@@ -652,7 +697,7 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
             {
                 unsigned int idx = pos + PCI_SRIOV_BAR + i * 4;
                 u32 bar = pci_conf_read32(seg, bus, slot, func, idx);
-                u32 hi = 0;
+                uint64_t addr;
 
                 if ( (bar & PCI_BASE_ADDRESS_SPACE) ==
                      PCI_BASE_ADDRESS_SPACE_IO )
@@ -663,38 +708,13 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
                            seg, bus, slot, func, i);
                     continue;
                 }
-                pci_conf_write32(seg, bus, slot, func, idx, ~0);
-                if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
-                     PCI_BASE_ADDRESS_MEM_TYPE_64 )
-                {
-                    if ( i >= PCI_SRIOV_NUM_BARS )
-                    {
-                        printk(XENLOG_WARNING
-                               "SR-IOV device %04x:%02x:%02x.%u with 64-bit"
-                               " vf BAR in last slot\n",
-                               seg, bus, slot, func);
-                        break;
-                    }
-                    hi = pci_conf_read32(seg, bus, slot, func, idx + 4);
-                    pci_conf_write32(seg, bus, slot, func, idx + 4, ~0);
-                }
-                pdev->vf_rlen[i] = pci_conf_read32(seg, bus, slot, func, idx) &
-                                   PCI_BASE_ADDRESS_MEM_MASK;
-                if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
-                     PCI_BASE_ADDRESS_MEM_TYPE_64 )
-                {
-                    pdev->vf_rlen[i] |= (u64)pci_conf_read32(seg, bus,
-                                                             slot, func,
-                                                             idx + 4) << 32;
-                    pci_conf_write32(seg, bus, slot, func, idx + 4, hi);
-                }
-                else if ( pdev->vf_rlen[i] )
-                    pdev->vf_rlen[i] |= (u64)~0 << 32;
-                pci_conf_write32(seg, bus, slot, func, idx, bar);
-                pdev->vf_rlen[i] = -pdev->vf_rlen[i];
-                if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
-                     PCI_BASE_ADDRESS_MEM_TYPE_64 )
-                    ++i;
+                ret = pci_size_bar(seg, bus, slot, func, pos + PCI_SRIOV_BAR,
+                                   PCI_SRIOV_NUM_BARS, &i, &addr,
+                                   &pdev->vf_rlen[i]);
+                if ( ret )
+                    dprintk(XENLOG_WARNING,
+                            "%04x:%02x:%02x.%u: failed to size SR-IOV BAR%u\n",
+                            seg, bus, slot, func, i);
             }
         }
         else
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index a83c4a1276..3d3853fd6f 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -165,6 +165,9 @@ const char *parse_pci(const char *, unsigned int *seg, unsigned int *bus,
                       unsigned int *dev, unsigned int *func);
 const char *parse_pci_seg(const char *, unsigned int *seg, unsigned int *bus,
                           unsigned int *dev, unsigned int *func, bool *def_seg);
+int pci_size_bar(unsigned int seg, unsigned int bus, unsigned int slot,
+                 unsigned int func, unsigned int base, unsigned int max_bars,
+                 unsigned int *index, uint64_t *addr, uint64_t *size);
 
 
 bool_t pcie_aer_get_firmware_first(const struct pci_dev *);
-- 
2.11.0 (Apple Git-81)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v3 5/9] xen/vpci: add handlers to map the BARs
  2017-04-27 14:35 [PATCH v3 0/9] vpci: PCI config space emulation Roger Pau Monne
                   ` (3 preceding siblings ...)
  2017-04-27 14:35 ` [PATCH v3 4/9] xen/pci: split code to size BARs from pci_add_device Roger Pau Monne
@ 2017-04-27 14:35 ` Roger Pau Monne
  2017-05-19 15:21   ` Jan Beulich
  2017-04-27 14:35 ` [PATCH v3 6/9] xen/vpci: trap access to the list of PCI capabilities Roger Pau Monne
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 49+ messages in thread
From: Roger Pau Monne @ 2017-04-27 14:35 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, julien.grall, Jan Beulich,
	boris.ostrovsky, Roger Pau Monne

Introduce a set of handlers that trap accesses to the PCI BARs and the command
register, in order to emulate BAR sizing and BAR relocation.

The command handler is used to detect changes to bit 2 (response to memory
space accesses), and maps/unmaps the BARs of the device into the guest p2m.

The BAR register handlers are used to detect attempts by the guest to size or
relocate the BARs.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Tim Deegan <tim@xen.org>
Cc: Wei Liu <wei.liu2@citrix.com>
---
Changes since v2:
 - Detect unset BARs and allow the hardware domain to position them.
---
 xen/drivers/vpci/Makefile |   2 +-
 xen/drivers/vpci/header.c | 292 ++++++++++++++++++++++++++++++++++++++++++++++
 xen/include/xen/vpci.h    |  28 +++++
 3 files changed, 321 insertions(+), 1 deletion(-)
 create mode 100644 xen/drivers/vpci/header.c

diff --git a/xen/drivers/vpci/Makefile b/xen/drivers/vpci/Makefile
index 840a906470..241467212f 100644
--- a/xen/drivers/vpci/Makefile
+++ b/xen/drivers/vpci/Makefile
@@ -1 +1 @@
-obj-y += vpci.o
+obj-y += vpci.o header.o
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
new file mode 100644
index 0000000000..a401cf6915
--- /dev/null
+++ b/xen/drivers/vpci/header.c
@@ -0,0 +1,292 @@
+/*
+ * Generic functionality for handling accesses to the PCI header from the
+ * configuration space.
+ *
+ * Copyright (C) 2017 Citrix Systems R&D
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms and conditions of the GNU General Public
+ * License, version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <xen/sched.h>
+#include <xen/vpci.h>
+#include <xen/p2m-common.h>
+
+static int vpci_modify_bars(struct pci_dev *pdev, const bool map)
+{
+    struct vpci_header *header = &pdev->vpci->header;
+    unsigned int i;
+    int rc = 0;
+
+    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
+    {
+        paddr_t gaddr = map ? header->bars[i].gaddr
+                            : header->bars[i].mapped_addr;
+        paddr_t paddr = header->bars[i].paddr;
+
+        if ( header->bars[i].type != VPCI_BAR_MEM &&
+             header->bars[i].type != VPCI_BAR_MEM64_LO )
+            continue;
+
+        rc = modify_mmio(pdev->domain, _gfn(PFN_DOWN(gaddr)),
+                         _mfn(PFN_DOWN(paddr)), PFN_UP(header->bars[i].size),
+                         map);
+        if ( rc )
+            break;
+
+        header->bars[i].mapped_addr = map ? gaddr : 0;
+    }
+
+    return rc;
+}
+
+static int vpci_cmd_read(struct pci_dev *pdev, unsigned int reg,
+                         union vpci_val *val, void *data)
+{
+    struct vpci_header *header = data;
+
+    val->word = header->command;
+
+    return 0;
+}
+
+static int vpci_cmd_write(struct pci_dev *pdev, unsigned int reg,
+                          union vpci_val val, void *data)
+{
+    struct vpci_header *header = data;
+    uint16_t new_cmd, saved_cmd;
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    int rc;
+
+    new_cmd = val.word;
+    saved_cmd = header->command;
+
+    if ( !((new_cmd ^ saved_cmd) & PCI_COMMAND_MEMORY) )
+        goto out;
+
+    /* Memory space access change. */
+    rc = vpci_modify_bars(pdev, new_cmd & PCI_COMMAND_MEMORY);
+    if ( rc )
+    {
+        dprintk(XENLOG_ERR,
+                "%04x:%02x:%02x.%u:unable to %smap BARs: %d\n",
+                seg, bus, slot, func,
+                new_cmd & PCI_COMMAND_MEMORY ? "" : "un", rc);
+        return rc;
+    }
+
+ out:
+    pci_conf_write16(seg, bus, slot, func, reg, new_cmd);
+    header->command = pci_conf_read16(seg, bus, slot, func, reg);
+    return 0;
+}
+
+static int vpci_bar_read(struct pci_dev *pdev, unsigned int reg,
+                         union vpci_val *val, void *data)
+{
+    struct vpci_bar *bar = data;
+    bool hi = false;
+
+    ASSERT(bar->type == VPCI_BAR_MEM || bar->type == VPCI_BAR_MEM64_LO ||
+           bar->type == VPCI_BAR_MEM64_HI);
+
+    if ( bar->type == VPCI_BAR_MEM64_HI )
+    {
+        ASSERT(reg - PCI_BASE_ADDRESS_0 > 0);
+        bar--;
+        hi = true;
+    }
+
+    if ( bar->sizing )
+        val->double_word = ~(bar->size - 1) >> (hi ? 32 : 0);
+    else
+        val->double_word = bar->gaddr >> (hi ? 32 : 0);
+
+    val->double_word |= hi ? 0 : bar->attributes;
+
+    return 0;
+}
+
+static int vpci_bar_write(struct pci_dev *pdev, unsigned int reg,
+                          union vpci_val val, void *data)
+{
+    struct vpci_bar *bar = data;
+    uint32_t wdata = val.double_word;
+    bool hi = false, unset = false;
+
+    ASSERT(bar->type == VPCI_BAR_MEM || bar->type == VPCI_BAR_MEM64_LO ||
+           bar->type == VPCI_BAR_MEM64_HI);
+
+    if ( wdata == GENMASK(31, 0) )
+    {
+        /* Next reads from this register are going to return the BAR size. */
+        bar->sizing = true;
+        return 0;
+    }
+
+    /* End previous sizing cycle if any. */
+    bar->sizing = false;
+
+    unset = bar->unset;
+    if ( unset )
+        bar->unset = false;
+
+    if ( bar->type == VPCI_BAR_MEM64_HI )
+    {
+        ASSERT(reg - PCI_BASE_ADDRESS_0 > 0);
+        bar--;
+        hi = true;
+    }
+
+    /* Update the relevant part of the BAR address. */
+    bar->gaddr &= hi ? ~GENMASK(63, 32) : ~GENMASK(31, 0);
+    wdata &= hi ? GENMASK(31, 0) : PCI_BASE_ADDRESS_MEM_MASK;
+    bar->gaddr |= (uint64_t)wdata << (hi ? 32 : 0);
+
+    if ( unset )
+    {
+        bar->paddr = bar->gaddr;
+        pci_conf_write16(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                         PCI_FUNC(pdev->devfn), reg, wdata);
+    }
+
+    ASSERT(IS_ALIGNED(bar->gaddr, PAGE_SIZE));
+
+    return 0;
+}
+
+static int vpci_init_bars(struct pci_dev *pdev)
+{
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    uint8_t header_type;
+    unsigned int i, num_bars;
+    struct vpci_header *header = &pdev->vpci->header;
+    struct vpci_bar *bars = header->bars;
+    int rc;
+
+    header_type = pci_conf_read8(seg, bus, slot, func, PCI_HEADER_TYPE) & 0x7f;
+    if ( header_type == PCI_HEADER_TYPE_NORMAL )
+        num_bars = 6;
+    else if ( header_type == PCI_HEADER_TYPE_BRIDGE )
+        num_bars = 2;
+    else
+        return -ENOSYS;
+
+    /* Setup a handler for the control register. */
+    header->command = pci_conf_read16(seg, bus, slot, func, PCI_COMMAND);
+    rc = xen_vpci_add_register(pdev, vpci_cmd_read, vpci_cmd_write,
+                               PCI_COMMAND, 2, header);
+    if ( rc )
+    {
+        dprintk(XENLOG_ERR,
+                "%04x:%02x:%02x.%u: failed to add handler register %#x: %d\n",
+                seg, bus, slot, func, PCI_COMMAND, rc);
+        return rc;
+    }
+
+    for ( i = 0; i < num_bars; i++ )
+    {
+        uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
+        uint32_t val = pci_conf_read32(seg, bus, slot, func, reg);
+        uint64_t addr, size;
+        unsigned int index;
+
+        if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
+        {
+            bars[i].type = VPCI_BAR_MEM64_HI;
+            bars[i].unset = bars[i - 1].unset;
+            continue;
+        }
+        else if ( (val & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO )
+        {
+            bars[i].type = VPCI_BAR_IO;
+            continue;
+        }
+        else if ( (val & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
+                  PCI_BASE_ADDRESS_MEM_TYPE_64 )
+            bars[i].type = VPCI_BAR_MEM64_LO;
+        else
+            bars[i].type = VPCI_BAR_MEM;
+
+        /* Size the BAR and map it. */
+        index = i;
+        rc = pci_size_bar(seg, bus, slot, func, PCI_BASE_ADDRESS_0, num_bars,
+                          &index, &addr, &size);
+        if ( rc )
+        {
+            dprintk(XENLOG_ERR,
+                    "%04x:%02x:%02x.%u: unable to size BAR#%u: %d\n",
+                    seg, bus, slot, func, i, rc);
+            return rc;
+        }
+
+        if ( size == 0 )
+        {
+            bars[i].type = VPCI_BAR_EMPTY;
+            continue;
+        }
+
+        if ( (bars[i].type == VPCI_BAR_MEM && addr == GENMASK(31, 12)) ||
+             addr == GENMASK(63, 26) )
+        {
+            /* BAR is not positioned. */
+            bars[i].unset = true;
+            ASSERT(is_hardware_domain(pdev->domain));
+            ASSERT(!(header->command & PCI_COMMAND_MEMORY));
+        }
+
+        ASSERT(IS_ALIGNED(addr, PAGE_SIZE));
+
+        /* Initial guest address is the hardware one. */
+        bars[i].gaddr = bars[i].paddr = addr;
+        bars[i].size = size;
+        bars[i].attributes = val & ~PCI_BASE_ADDRESS_MEM_MASK;
+
+        rc = xen_vpci_add_register(pdev, vpci_bar_read, vpci_bar_write, reg,
+                                   4, &bars[i]);
+        if ( rc )
+        {
+            dprintk(XENLOG_ERR,
+                    "%04x:%02x:%02x.%u: failed to add handler for BAR#%u: %d\n",
+                    seg, bus, slot, func, i, rc);
+            return rc;
+        }
+    }
+
+    if ( header->command & PCI_COMMAND_MEMORY )
+    {
+        rc = vpci_modify_bars(pdev, true);
+        if ( rc )
+        {
+            dprintk(XENLOG_ERR, "%04x:%02x:%02x.%u: unable to map BARs: %d\n",
+                    seg, bus, slot, func, rc);
+            return rc;
+        }
+    }
+
+    return 0;
+}
+
+REGISTER_VPCI_INIT(vpci_init_bars);
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
+
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 56e8d1c35e..235b4ebd1f 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -50,6 +50,34 @@ int xen_vpci_write(unsigned int seg, unsigned int bus, unsigned int devfn,
 struct vpci {
     /* Root pointer for the tree of vPCI handlers. */
     struct rb_root handlers;
+
+#ifdef __XEN__
+    /* Hide the rest of the vpci struct from the user-space test harness. */
+    struct vpci_header {
+        /* Cached value of the command register. */
+        uint16_t command;
+        /* Information about the PCI BARs of this device. */
+        struct vpci_bar {
+            enum {
+                VPCI_BAR_EMPTY,
+                VPCI_BAR_IO,
+                VPCI_BAR_MEM,
+                VPCI_BAR_MEM64_LO,
+                VPCI_BAR_MEM64_HI,
+            } type;
+            /* Hardware address. */
+            paddr_t paddr;
+            /* Guest address where the BAR should be mapped. */
+            paddr_t gaddr;
+            /* Current guest address where the BAR is mapped. */
+            paddr_t mapped_addr;
+            size_t size;
+            unsigned int attributes:4;
+            bool sizing;
+            bool unset;
+        } bars[6];
+    } header;
+#endif
 };
 
 #endif
-- 
2.11.0 (Apple Git-81)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v3 6/9] xen/vpci: trap access to the list of PCI capabilities
  2017-04-27 14:35 [PATCH v3 0/9] vpci: PCI config space emulation Roger Pau Monne
                   ` (4 preceding siblings ...)
  2017-04-27 14:35 ` [PATCH v3 5/9] xen/vpci: add handlers to map the BARs Roger Pau Monne
@ 2017-04-27 14:35 ` Roger Pau Monne
  2017-05-23 12:49   ` Jan Beulich
  2017-05-29 13:32   ` Jan Beulich
  2017-04-27 14:35 ` [PATCH v3 7/9] vpci: add a priority field to the vPCI register initializer Roger Pau Monne
                   ` (3 subsequent siblings)
  9 siblings, 2 replies; 49+ messages in thread
From: Roger Pau Monne @ 2017-04-27 14:35 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, julien.grall, Jan Beulich, boris.ostrovsky,
	Roger Pau Monne

Add traps to each capability PCI_CAP_LIST_NEXT field in order to mask them on
request.

All capabilities from the device are fetched and stored in an internal list,
that's later used in order to return the next capability to the guest. Note
that this only removes the capability from the linked list as seen by the
guest, but the actual capability structure could still be accessed by the
guest, provided that it's position can be found using another mechanism.
Finally the MSI and MSI-X capabilities are masked until Xen knows how to
properly handle accesses to them.

This should allow a PVH Dom0 to boot on some hardware, provided that the
hardware doesn't require MSI/MSI-X and that there are no SR-IOV devices in the
system, so the panic at the end of the PVH Dom0 build is replaced by a
warning.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v1:
 - Add missing newline between cmd handlers.
 - Switch the handler to use list_for_each_entry_continue instead of a wrong
   open-coded version of it.
---
 xen/arch/x86/hvm/dom0_build.c   |   2 +-
 xen/drivers/vpci/Makefile       |   2 +-
 xen/drivers/vpci/capabilities.c | 159 ++++++++++++++++++++++++++++++++++++++++
 xen/include/xen/vpci.h          |   3 +
 4 files changed, 164 insertions(+), 2 deletions(-)
 create mode 100644 xen/drivers/vpci/capabilities.c

diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
index 9efe4e571b..7cf794447f 100644
--- a/xen/arch/x86/hvm/dom0_build.c
+++ b/xen/arch/x86/hvm/dom0_build.c
@@ -1097,7 +1097,7 @@ int __init dom0_construct_pvh(struct domain *d, const module_t *image,
         return rc;
     }
 
-    panic("Building a PVHv2 Dom0 is not yet supported.");
+    printk("WARNING: PVH is an experimental mode with limited functionality\n");
     return 0;
 }
 
diff --git a/xen/drivers/vpci/Makefile b/xen/drivers/vpci/Makefile
index 241467212f..c3f3085c93 100644
--- a/xen/drivers/vpci/Makefile
+++ b/xen/drivers/vpci/Makefile
@@ -1 +1 @@
-obj-y += vpci.o header.o
+obj-y += vpci.o header.o capabilities.o
diff --git a/xen/drivers/vpci/capabilities.c b/xen/drivers/vpci/capabilities.c
new file mode 100644
index 0000000000..b2a3326aa7
--- /dev/null
+++ b/xen/drivers/vpci/capabilities.c
@@ -0,0 +1,159 @@
+/*
+ * Generic functionality for handling accesses to the PCI capabilities from
+ * the configuration space.
+ *
+ * Copyright (C) 2017 Citrix Systems R&D
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms and conditions of the GNU General Public
+ * License, version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <xen/sched.h>
+#include <xen/vpci.h>
+
+struct vpci_capability {
+    struct list_head next;
+    uint8_t offset;
+    bool masked;
+};
+
+static int vpci_cap_read(struct pci_dev *pdev, unsigned int reg,
+                         union vpci_val *val, void *data)
+{
+    struct vpci_capability *cap = data;
+
+    val->half_word = 0;
+
+    /* Return the position of the next non-masked capability. */
+    list_for_each_entry_continue ( cap, &pdev->vpci->cap_list, next )
+    {
+        if ( !cap->masked )
+        {
+            val->half_word = cap->offset;
+            break;
+        }
+    }
+
+    return 0;
+}
+
+static int vpci_cap_write(struct pci_dev *pdev, unsigned int reg,
+                          union vpci_val val, void *data)
+{
+    /* Ignored. */
+    return 0;
+}
+
+static int vpci_index_capabilities(struct pci_dev *pdev)
+{
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    uint8_t pos = PCI_CAPABILITY_LIST;
+    uint16_t status;
+    unsigned int max_cap = 48;
+    struct vpci_capability *cap;
+    int rc;
+
+    INIT_LIST_HEAD(&pdev->vpci->cap_list);
+
+    /* Check if device has capabilities. */
+    status = pci_conf_read16(seg, bus, slot, func, PCI_STATUS);
+    if ( !(status & PCI_STATUS_CAP_LIST) )
+        return 0;
+
+    /* Add the root capability pointer. */
+    cap = xzalloc(struct vpci_capability);
+    if ( !cap )
+        return -ENOMEM;
+
+    cap->offset = pos;
+    list_add_tail(&cap->next, &pdev->vpci->cap_list);
+    rc = xen_vpci_add_register(pdev, vpci_cap_read, vpci_cap_write, pos,
+                               1, cap);
+    if ( rc )
+        return rc;
+
+    /*
+     * Iterate over the list of capabilities present in the device, and
+     * add a handler for each register pointer to the next item
+     * (PCI_CAP_LIST_NEXT).
+     */
+    while ( max_cap-- )
+    {
+        pos = pci_conf_read8(seg, bus, slot, func, pos);
+        if ( pos < 0x40 )
+            break;
+
+        cap = xzalloc(struct vpci_capability);
+        if ( !cap )
+            return -ENOMEM;
+
+        cap->offset = pos;
+        list_add_tail(&cap->next, &pdev->vpci->cap_list);
+        pos += PCI_CAP_LIST_NEXT;
+        rc = xen_vpci_add_register(pdev, vpci_cap_read, vpci_cap_write, pos,
+                                   1, cap);
+        if ( rc )
+            return rc;
+    }
+
+    return 0;
+}
+
+static void vpci_mask_capability(struct pci_dev *pdev, uint8_t cap_id)
+{
+    struct vpci_capability *cap;
+    uint8_t cap_offset;
+
+    cap_offset = pci_find_cap_offset(pdev->seg, pdev->bus,
+                                     PCI_SLOT(pdev->devfn),
+                                     PCI_FUNC(pdev->devfn), cap_id);
+    if ( !cap_offset )
+        return;
+
+    list_for_each_entry ( cap, &pdev->vpci->cap_list, next )
+    {
+        if ( cap->offset == cap_offset )
+        {
+            cap->masked = true;
+            break;
+        }
+    }
+}
+
+static int vpci_capabilities_init(struct pci_dev *pdev)
+{
+    int rc;
+
+    rc = vpci_index_capabilities(pdev);
+    if ( rc )
+        return rc;
+
+    /* Mask MSI and MSI-X capabilities until Xen handles them. */
+    vpci_mask_capability(pdev, PCI_CAP_ID_MSI);
+    vpci_mask_capability(pdev, PCI_CAP_ID_MSIX);
+
+    return 0;
+}
+
+REGISTER_VPCI_INIT(vpci_capabilities_init);
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
+
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 235b4ebd1f..d41277f39b 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -77,6 +77,9 @@ struct vpci {
             bool unset;
         } bars[6];
     } header;
+
+    /* List of capabilities supported by the device. */
+    struct list_head cap_list;
 #endif
 };
 
-- 
2.11.0 (Apple Git-81)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v3 7/9] vpci: add a priority field to the vPCI register initializer
  2017-04-27 14:35 [PATCH v3 0/9] vpci: PCI config space emulation Roger Pau Monne
                   ` (5 preceding siblings ...)
  2017-04-27 14:35 ` [PATCH v3 6/9] xen/vpci: trap access to the list of PCI capabilities Roger Pau Monne
@ 2017-04-27 14:35 ` Roger Pau Monne
  2017-05-23 12:52   ` Jan Beulich
  2017-04-27 14:35 ` [PATCH v3 8/9] vpci/msi: add MSI handlers Roger Pau Monne
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 49+ messages in thread
From: Roger Pau Monne @ 2017-04-27 14:35 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, julien.grall, Jan Beulich, boris.ostrovsky,
	Roger Pau Monne

And mark the capability and header vPCI register initializers as high priority,
so that they are initialized first.

This is needed for MSI-X, since MSI-X needs to know the position of the BARs in
order to perform it's initialization, and in order to mask or enable the
MSI/MSI-X functionality on demand.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
 tools/tests/vpci/Makefile       |  4 ++--
 xen/drivers/vpci/capabilities.c |  2 +-
 xen/drivers/vpci/header.c       |  2 +-
 xen/drivers/vpci/vpci.c         | 14 ++++++++++++--
 xen/include/xen/vpci.h          | 13 +++++++++++--
 5 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/tools/tests/vpci/Makefile b/tools/tests/vpci/Makefile
index 7969fcbd82..e5edc4f512 100644
--- a/tools/tests/vpci/Makefile
+++ b/tools/tests/vpci/Makefile
@@ -31,8 +31,8 @@ vpci.c: $(XEN_ROOT)/xen/drivers/vpci/vpci.c
 	# Trick the compiler so it doesn't complain about missing symbols
 	sed -e '/#include/d' \
 	    -e '1s;^;#include "emul.h"\
-	             const vpci_register_init_t __start_vpci_array[1]\;\
-	             const vpci_register_init_t __end_vpci_array[1]\;\
+	             const struct vpci_register_init __start_vpci_array[1]\;\
+	             const struct vpci_register_init __end_vpci_array[1]\;\
 	             ;' <$< >$@
 
 rbtree.h: $(XEN_ROOT)/xen/include/xen/rbtree.h
diff --git a/xen/drivers/vpci/capabilities.c b/xen/drivers/vpci/capabilities.c
index b2a3326aa7..204355e673 100644
--- a/xen/drivers/vpci/capabilities.c
+++ b/xen/drivers/vpci/capabilities.c
@@ -145,7 +145,7 @@ static int vpci_capabilities_init(struct pci_dev *pdev)
     return 0;
 }
 
-REGISTER_VPCI_INIT(vpci_capabilities_init);
+REGISTER_VPCI_INIT(vpci_capabilities_init, true);
 
 /*
  * Local variables:
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index a401cf6915..3deec53efd 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -278,7 +278,7 @@ static int vpci_init_bars(struct pci_dev *pdev)
     return 0;
 }
 
-REGISTER_VPCI_INIT(vpci_init_bars);
+REGISTER_VPCI_INIT(vpci_init_bars, true);
 
 /*
  * Local variables:
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index b159f0db80..e6154b742e 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -20,7 +20,7 @@
 #include <xen/sched.h>
 #include <xen/vpci.h>
 
-extern const vpci_register_init_t __start_vpci_array[], __end_vpci_array[];
+extern const struct vpci_register_init __start_vpci_array[], __end_vpci_array[];
 #define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
 #define vpci_init __start_vpci_array
 
@@ -37,6 +37,7 @@ struct vpci_register {
 int xen_vpci_add_handlers(struct pci_dev *pdev)
 {
     int i, rc = 0;
+    bool priority = true;
 
     if ( !has_vpci(pdev->domain) )
         return 0;
@@ -47,9 +48,13 @@ int xen_vpci_add_handlers(struct pci_dev *pdev)
 
     pdev->vpci->handlers = RB_ROOT;
 
+ again:
     for ( i = 0; i < NUM_VPCI_INIT; i++ )
     {
-        rc = vpci_init[i](pdev);
+        if ( priority != vpci_init[i].priority )
+            continue;
+
+        rc = vpci_init[i].init(pdev);
         if ( rc )
             break;
     }
@@ -69,6 +74,11 @@ int xen_vpci_add_handlers(struct pci_dev *pdev)
         }
         xfree(pdev->vpci);
     }
+    else if ( priority )
+    {
+        priority = false;
+        goto again;
+    }
 
     return rc;
 }
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index d41277f39b..2bf61d6c15 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -29,8 +29,17 @@ typedef int (*vpci_write_t)(struct pci_dev *pdev, unsigned int reg,
 
 typedef int (*vpci_register_init_t)(struct pci_dev *dev);
 
-#define REGISTER_VPCI_INIT(x) \
-  static const vpci_register_init_t x##_entry __used_section(".data.vpci") = x
+struct vpci_register_init {
+    vpci_register_init_t init;
+    bool priority;
+};
+
+#define REGISTER_VPCI_INIT(f, p)                                        \
+  static const struct vpci_register_init                                \
+                      x##_entry __used_section(".data.vpci") = {        \
+    .init = f,                                                          \
+    .priority = p,                                                      \
+}
 
 /* Add vPCI handlers to device. */
 int xen_vpci_add_handlers(struct pci_dev *dev);
-- 
2.11.0 (Apple Git-81)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v3 8/9] vpci/msi: add MSI handlers
  2017-04-27 14:35 [PATCH v3 0/9] vpci: PCI config space emulation Roger Pau Monne
                   ` (6 preceding siblings ...)
  2017-04-27 14:35 ` [PATCH v3 7/9] vpci: add a priority field to the vPCI register initializer Roger Pau Monne
@ 2017-04-27 14:35 ` Roger Pau Monne
  2017-05-26 15:26   ` Jan Beulich
  2017-04-27 14:35 ` [PATCH v3 9/9] vpci/msix: add MSI-X handlers Roger Pau Monne
  2017-05-29 13:38 ` [PATCH v3 0/9] vpci: PCI config space emulation Jan Beulich
  9 siblings, 1 reply; 49+ messages in thread
From: Roger Pau Monne @ 2017-04-27 14:35 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, julien.grall, Paul Durrant, Jan Beulich,
	boris.ostrovsky, Roger Pau Monne

Add handlers for the MSI control, address, data and mask fields in order to
detect accesses to them and setup the interrupts as requested by the guest.

Note that the pending register is not trapped, and the guest can freely
read/write to it.

Whether Xen is going to provide this functionality to Dom0 (MSI emulation) is
controlled by the "msi" option in the dom0 field. When disabling this option
Xen will hide the MSI capability structure from Dom0.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Paul Durrant <paul.durrant@citrix.com>
---
Changes since v2:
 - Add an arch-specific abstraction layer. Note that this is only implemented
   for x86 currently.
 - Add a wrapper to detect MSI enabling for vPCI.

NB: I've only been able to test this with devices using a single MSI interrupt
and no mask register. I will try to find hardware that supports the mask
register and more than one vector, but I cannot make any promises.

If there are doubts about the untested parts we could always force Xen to
report no per-vector masking support and only 1 available vector, but I would
rather avoid doing it.
---
 docs/misc/xen-command-line.markdown |   9 +-
 xen/arch/x86/dom0_build.c           |  12 +-
 xen/arch/x86/hvm/vmsi.c             | 141 +++++++++++++
 xen/drivers/vpci/Makefile           |   2 +-
 xen/drivers/vpci/capabilities.c     |   7 +-
 xen/drivers/vpci/msi.c              | 392 ++++++++++++++++++++++++++++++++++++
 xen/include/asm-x86/hvm/io.h        |  19 ++
 xen/include/xen/hvm/irq.h           |   1 +
 xen/include/xen/vpci.h              |  26 +++
 9 files changed, 601 insertions(+), 8 deletions(-)
 create mode 100644 xen/drivers/vpci/msi.c

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index 44d99852aa..73013aea14 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -660,7 +660,7 @@ affinities to prefer but be not limited to the specified node(s).
 Pin dom0 vcpus to their respective pcpus
 
 ### dom0
-> `= List of [ pvh | shadow ]`
+> `= List of [ pvh | shadow | msi ]`
 
 > Sub-options:
 
@@ -677,6 +677,13 @@ Flag that makes a dom0 boot in PVHv2 mode.
 Flag that makes a dom0 use shadow paging. Only works when "pvh" is
 enabled.
 
+> `msi`
+
+> Default: `true`
+
+Enable or disable (using the `no-` prefix) the MSI emulation inside of
+Xen for a PVH Dom0. Note that this option has no effect on a PV Dom0.
+
 ### dtuart (ARM)
 > `= path [:options]`
 
diff --git a/xen/arch/x86/dom0_build.c b/xen/arch/x86/dom0_build.c
index cc8acad688..01afcf6215 100644
--- a/xen/arch/x86/dom0_build.c
+++ b/xen/arch/x86/dom0_build.c
@@ -176,29 +176,37 @@ struct vcpu *__init alloc_dom0_vcpu0(struct domain *dom0)
 bool __initdata opt_dom0_shadow;
 #endif
 bool __initdata dom0_pvh;
+bool __initdata dom0_msi = true;
 
 /*
  * List of parameters that affect Dom0 creation:
  *
  *  - pvh               Create a PVHv2 Dom0.
  *  - shadow            Use shadow paging for Dom0.
+ *  - msi               MSI functionality.
  */
 static void __init parse_dom0_param(char *s)
 {
     char *ss;
+    bool enabled;
 
     do {
+        enabled = !!strncmp(s, "no-", 3);
+        if ( !enabled )
+            s += 3;
 
         ss = strchr(s, ',');
         if ( ss )
             *ss = '\0';
 
         if ( !strcmp(s, "pvh") )
-            dom0_pvh = true;
+            dom0_pvh = enabled;
 #ifdef CONFIG_SHADOW_PAGING
         else if ( !strcmp(s, "shadow") )
-            opt_dom0_shadow = true;
+            opt_dom0_shadow = enabled;
 #endif
+        else if ( !strcmp(s, "msi") )
+            dom0_msi = enabled;
 
         s = ss + 1;
     } while ( ss );
diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index a36692c313..f23b2f43d6 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -622,3 +622,144 @@ void msix_write_completion(struct vcpu *v)
     if ( msixtbl_write(v, ctrl_address, 4, 0) != X86EMUL_OKAY )
         gdprintk(XENLOG_WARNING, "MSI-X write completion failure\n");
 }
+
+static unsigned int msi_vector(uint16_t data)
+{
+    return (data & MSI_DATA_VECTOR_MASK) >> MSI_DATA_VECTOR_SHIFT;
+}
+
+static unsigned int msi_flags(uint16_t data, uint64_t addr)
+{
+    unsigned int rh, dm, dest_id, deliv_mode, trig_mode;
+
+    rh = (addr >> MSI_ADDR_REDIRECTION_SHIFT) & 0x1;
+    dm = (addr >> MSI_ADDR_DESTMODE_SHIFT) & 0x1;
+    dest_id = (addr & MSI_ADDR_DEST_ID_MASK) >> MSI_ADDR_DEST_ID_SHIFT;
+    deliv_mode = (data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x7;
+    trig_mode = (data >> MSI_DATA_TRIGGER_SHIFT) & 0x1;
+
+    return dest_id | (rh << GFLAGS_SHIFT_RH) | (dm << GFLAGS_SHIFT_DM) |
+           (deliv_mode << GFLAGS_SHIFT_DELIV_MODE) |
+           (trig_mode << GFLAGS_SHIFT_TRG_MODE);
+}
+
+void vpci_msi_mask(struct vpci_arch_msi *arch, unsigned int entry, bool mask)
+{
+    struct pirq *pinfo;
+    struct irq_desc *desc;
+    unsigned long flags;
+    int irq;
+
+    ASSERT(arch->pirq != -1);
+    pinfo = pirq_info(current->domain, arch->pirq + entry);
+    ASSERT(pinfo);
+
+    irq = pinfo->arch.irq;
+    ASSERT(irq < nr_irqs);
+
+    desc = irq_to_desc(irq);
+    ASSERT(desc);
+
+    spin_lock_irqsave(&desc->lock, flags);
+    guest_mask_msi_irq(desc, mask);
+    spin_unlock_irqrestore(&desc->lock, flags);
+}
+
+int vpci_msi_enable(struct vpci_arch_msi *arch, struct pci_dev *pdev,
+                    uint64_t address, uint32_t data, unsigned int vectors)
+{
+    struct msi_info msi_info = {
+        .seg = pdev->seg,
+        .bus = pdev->bus,
+        .devfn = pdev->devfn,
+        .entry_nr = vectors,
+    };
+    int index = -1, rc;
+    unsigned int i;
+
+    ASSERT(arch->pirq == -1);
+
+    /* Get a PIRQ. */
+    rc = allocate_and_map_msi_pirq(pdev->domain, &index, &arch->pirq,
+                                   &msi_info);
+    if ( rc )
+    {
+        dprintk(XENLOG_ERR, "%04x:%02x:%02x.%u: failed to map PIRQ: %d\n",
+                pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                PCI_FUNC(pdev->devfn), rc);
+        return rc;
+    }
+
+    for ( i = 0; i < vectors; i++ )
+    {
+        xen_domctl_bind_pt_irq_t bind = {
+            .hvm_domid = DOMID_SELF,
+            .machine_irq = arch->pirq + i,
+            .irq_type = PT_IRQ_TYPE_MSI,
+            .u.msi.gvec = msi_vector(data) + i,
+            .u.msi.gflags = msi_flags(data, address),
+        };
+
+        pcidevs_lock();
+        rc = pt_irq_create_bind(pdev->domain, &bind);
+        if ( rc )
+        {
+            dprintk(XENLOG_ERR,
+                    "%04x:%02x:%02x.%u: failed to bind PIRQ %u: %d\n",
+                    pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                    PCI_FUNC(pdev->devfn), arch->pirq + i, rc);
+            spin_lock(&pdev->domain->event_lock);
+            unmap_domain_pirq(pdev->domain, arch->pirq);
+            spin_unlock(&pdev->domain->event_lock);
+            pcidevs_unlock();
+            arch->pirq = -1;
+            return rc;
+        }
+        pcidevs_unlock();
+    }
+
+    return 0;
+}
+
+int vpci_msi_disable(struct vpci_arch_msi *arch, struct pci_dev *pdev,
+                     unsigned int vectors)
+{
+    unsigned int i;
+
+    ASSERT(arch->pirq != -1);
+
+    for ( i = 0; i < vectors; i++ )
+    {
+        xen_domctl_bind_pt_irq_t bind = {
+            .hvm_domid = DOMID_SELF,
+            .machine_irq = arch->pirq + i,
+            .irq_type = PT_IRQ_TYPE_MSI,
+        };
+
+        pcidevs_lock();
+        pt_irq_destroy_bind(pdev->domain, &bind);
+        pcidevs_unlock();
+    }
+
+    pcidevs_lock();
+    spin_lock(&pdev->domain->event_lock);
+    unmap_domain_pirq(pdev->domain, arch->pirq);
+    spin_unlock(&pdev->domain->event_lock);
+    pcidevs_unlock();
+
+    arch->pirq = -1;
+
+    return 0;
+}
+
+int vpci_msi_arch_init(struct vpci_arch_msi *arch)
+{
+    arch->pirq = -1;
+    return 0;
+}
+
+void vpci_msi_arch_print(struct vpci_arch_msi *arch)
+{
+    printk("PIRQ: %d\n", arch->pirq);
+}
+
diff --git a/xen/drivers/vpci/Makefile b/xen/drivers/vpci/Makefile
index c3f3085c93..ef4fc6caf3 100644
--- a/xen/drivers/vpci/Makefile
+++ b/xen/drivers/vpci/Makefile
@@ -1 +1 @@
-obj-y += vpci.o header.o capabilities.o
+obj-y += vpci.o header.o capabilities.o msi.o
diff --git a/xen/drivers/vpci/capabilities.c b/xen/drivers/vpci/capabilities.c
index 204355e673..ad9f45c2e1 100644
--- a/xen/drivers/vpci/capabilities.c
+++ b/xen/drivers/vpci/capabilities.c
@@ -109,7 +109,7 @@ static int vpci_index_capabilities(struct pci_dev *pdev)
     return 0;
 }
 
-static void vpci_mask_capability(struct pci_dev *pdev, uint8_t cap_id)
+void xen_vpci_mask_capability(struct pci_dev *pdev, uint8_t cap_id)
 {
     struct vpci_capability *cap;
     uint8_t cap_offset;
@@ -138,9 +138,8 @@ static int vpci_capabilities_init(struct pci_dev *pdev)
     if ( rc )
         return rc;
 
-    /* Mask MSI and MSI-X capabilities until Xen handles them. */
-    vpci_mask_capability(pdev, PCI_CAP_ID_MSI);
-    vpci_mask_capability(pdev, PCI_CAP_ID_MSIX);
+    /* Mask MSI-X capability until Xen handles it. */
+    xen_vpci_mask_capability(pdev, PCI_CAP_ID_MSIX);
 
     return 0;
 }
diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
new file mode 100644
index 0000000000..318dc0e626
--- /dev/null
+++ b/xen/drivers/vpci/msi.c
@@ -0,0 +1,392 @@
+/*
+ * Handlers for accesses to the MSI capability structure.
+ *
+ * Copyright (C) 2017 Citrix Systems R&D
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms and conditions of the GNU General Public
+ * License, version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <xen/sched.h>
+#include <xen/vpci.h>
+#include <asm/msi.h>
+#include <xen/keyhandler.h>
+
+/* Handlers for the MSI control field (PCI_MSI_FLAGS). */
+static int vpci_msi_control_read(struct pci_dev *pdev, unsigned int reg,
+                                 union vpci_val *val, void *data)
+{
+    struct vpci_msi *msi = data;
+
+    if ( msi->enabled )
+        val->word |= PCI_MSI_FLAGS_ENABLE;
+    if ( msi->masking )
+        val->word |= PCI_MSI_FLAGS_MASKBIT;
+    if ( msi->address64 )
+        val->word |= PCI_MSI_FLAGS_64BIT;
+
+    /* Set multiple message capable. */
+    val->word |= ((fls(msi->max_vectors) - 1) << 1) & PCI_MSI_FLAGS_QMASK;
+
+    /* Set current number of configured vectors. */
+    val->word |= ((fls(msi->guest_vectors) - 1) << 4) & PCI_MSI_FLAGS_QSIZE;
+
+    return 0;
+}
+
+static int vpci_msi_control_write(struct pci_dev *pdev, unsigned int reg,
+                                  union vpci_val val, void *data)
+{
+    struct vpci_msi *msi = data;
+    unsigned int vectors = 1 << ((val.word & PCI_MSI_FLAGS_QSIZE) >> 4);
+    int rc;
+
+    if ( vectors > msi->max_vectors )
+        return -EINVAL;
+
+    msi->guest_vectors = vectors;
+
+    if ( !!(val.word & PCI_MSI_FLAGS_ENABLE) == msi->enabled )
+        return 0;
+
+    if ( val.word & PCI_MSI_FLAGS_ENABLE )
+    {
+        ASSERT(!msi->enabled && !msi->vectors);
+
+        rc = vpci_msi_enable(&msi->arch, pdev, msi->address, msi->data,
+                             vectors);
+        if ( rc )
+            return rc;
+
+        /* Apply the mask bits. */
+        if ( msi->masking )
+        {
+            uint32_t mask = msi->mask;
+
+            while ( mask )
+            {
+                unsigned int i = ffs(mask);
+
+                vpci_msi_mask(&msi->arch, i, true);
+                __clear_bit(i, &mask);
+            }
+        }
+
+        __msi_set_enable(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                         PCI_FUNC(pdev->devfn), reg - PCI_MSI_FLAGS, 1);
+
+        msi->vectors = vectors;
+        msi->enabled = true;
+    }
+    else
+    {
+        ASSERT(msi->enabled && msi->vectors);
+
+        __msi_set_enable(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                         PCI_FUNC(pdev->devfn), reg - PCI_MSI_FLAGS, 0);
+
+
+        rc = vpci_msi_disable(&msi->arch, pdev, msi->vectors);
+        if ( rc )
+            return rc;
+
+        msi->vectors = 0;
+        msi->enabled = false;
+    }
+
+    return rc;
+}
+
+/* Handlers for the address field (32bit or low part of a 64bit address). */
+static int vpci_msi_address_read(struct pci_dev *pdev, unsigned int reg,
+                                 union vpci_val *val, void *data)
+{
+    struct vpci_msi *msi = data;
+
+    val->double_word = msi->address;
+
+    return 0;
+}
+
+static int vpci_msi_address_write(struct pci_dev *pdev, unsigned int reg,
+                                  union vpci_val val, void *data)
+{
+    struct vpci_msi *msi = data;
+
+    /* Clear low part. */
+    msi->address &= ~GENMASK(31, 0);
+    msi->address |= val.double_word;
+
+    return 0;
+}
+
+/* Handlers for the high part of a 64bit address field. */
+static int vpci_msi_address_upper_read(struct pci_dev *pdev, unsigned int reg,
+                                       union vpci_val *val, void *data)
+{
+    struct vpci_msi *msi = data;
+
+    val->double_word = msi->address >> 32;
+
+    return 0;
+}
+
+static int vpci_msi_address_upper_write(struct pci_dev *pdev, unsigned int reg,
+                                        union vpci_val val, void *data)
+{
+    struct vpci_msi *msi = data;
+
+    /* Clear high part. */
+    msi->address &= ~GENMASK(63, 32);
+    msi->address |= (uint64_t)val.double_word << 32;
+
+    return 0;
+}
+
+/* Handlers for the data field. */
+static int vpci_msi_data_read(struct pci_dev *pdev, unsigned int reg,
+                              union vpci_val *val, void *data)
+{
+    struct vpci_msi *msi = data;
+
+    val->word = msi->data;
+
+    return 0;
+}
+
+static int vpci_msi_data_write(struct pci_dev *pdev, unsigned int reg,
+                               union vpci_val val, void *data)
+{
+    struct vpci_msi *msi = data;
+
+    msi->data = val.word;
+
+    return 0;
+}
+
+static int vpci_msi_mask_read(struct pci_dev *pdev, unsigned int reg,
+                              union vpci_val *val, void *data)
+{
+    struct vpci_msi *msi = data;
+
+    val->double_word = msi->mask;
+
+    return 0;
+}
+
+static int vpci_msi_mask_write(struct pci_dev *pdev, unsigned int reg,
+                               union vpci_val val, void *data)
+{
+    struct vpci_msi *msi = data;
+    uint32_t dmask;
+
+    dmask = msi->mask ^ val.double_word;
+
+    if ( !dmask )
+        return 0;
+
+    while ( dmask && msi->enabled )
+    {
+        unsigned int i = ffs(dmask);
+
+        vpci_msi_mask(&msi->arch, i, !test_bit(i, &msi->mask));
+        __clear_bit(i, &dmask);
+    }
+
+    msi->mask = val.double_word;
+    return 0;
+}
+
+static int vpci_init_msi(struct pci_dev *pdev)
+{
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    struct vpci_msi *msi = NULL;
+    unsigned int msi_offset;
+    uint16_t control;
+    int rc;
+
+    msi_offset = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSI);
+    if ( !msi_offset )
+        return 0;
+
+    if ( !vpci_msi_enabled(pdev->domain) )
+    {
+        xen_vpci_mask_capability(pdev, PCI_CAP_ID_MSI);
+        return 0;
+    }
+
+    msi = xzalloc(struct vpci_msi);
+    if ( !msi )
+        return -ENOMEM;
+
+    control = pci_conf_read16(seg, bus, slot, func,
+                              msi_control_reg(msi_offset));
+
+    rc = xen_vpci_add_register(pdev, vpci_msi_control_read,
+                               vpci_msi_control_write,
+                               msi_control_reg(msi_offset), 2, msi);
+    if ( rc )
+    {
+        dprintk(XENLOG_ERR,
+                "%04x:%02x:%02x.%u: failed to add handler for MSI control: %d\n",
+                seg, bus, slot, func, rc);
+        goto error;
+    }
+
+    /* Get the maximum number of vectors the device supports. */
+    msi->max_vectors = multi_msi_capable(control);
+    ASSERT(msi->max_vectors <= 32);
+
+    /* Initial value after reset. */
+    msi->guest_vectors = 1;
+
+    /* No PIRQ bind yet. */
+    vpci_msi_arch_init(&msi->arch);
+
+    if ( is_64bit_address(control) )
+        msi->address64 = true;
+    if ( is_mask_bit_support(control) )
+        msi->masking = true;
+
+    rc = xen_vpci_add_register(pdev, vpci_msi_address_read,
+                               vpci_msi_address_write,
+                               msi_lower_address_reg(msi_offset), 4, msi);
+    if ( rc )
+    {
+        dprintk(XENLOG_ERR,
+                "%04x:%02x:%02x.%u: failed to add handler for MSI address: %d\n",
+                seg, bus, slot, func, rc);
+        goto error;
+    }
+
+    rc = xen_vpci_add_register(pdev, vpci_msi_data_read, vpci_msi_data_write,
+                               msi_data_reg(msi_offset, msi->address64), 2,
+                               msi);
+    if ( rc )
+    {
+        dprintk(XENLOG_ERR,
+                "%04x:%02x:%02x.%u: failed to add handler for MSI address: %d\n",
+                seg, bus, slot, func, rc);
+        goto error;
+    }
+
+    if ( msi->address64 )
+    {
+        rc = xen_vpci_add_register(pdev, vpci_msi_address_upper_read,
+                                   vpci_msi_address_upper_write,
+                                   msi_upper_address_reg(msi_offset), 4, msi);
+        if ( rc )
+        {
+            dprintk(XENLOG_ERR,
+                    "%04x:%02x:%02x.%u: failed to add handler for MSI address: %d\n",
+                    seg, bus, slot, func, rc);
+            goto error;
+        }
+    }
+
+    if ( msi->masking )
+    {
+        rc = xen_vpci_add_register(pdev, vpci_msi_mask_read,
+                                   vpci_msi_mask_write,
+                                   msi_mask_bits_reg(msi_offset,
+                                                     msi->address64), 4, msi);
+        if ( rc )
+        {
+            dprintk(XENLOG_ERR,
+                    "%04x:%02x:%02x.%u: failed to add handler for MSI mask: %d\n",
+                    seg, bus, slot, func, rc);
+            goto error;
+        }
+    }
+
+    pdev->vpci->msi = msi;
+
+    return 0;
+
+ error:
+    ASSERT(rc);
+    xfree(msi);
+    return rc;
+}
+
+REGISTER_VPCI_INIT(vpci_init_msi, false);
+
+static void vpci_dump_msi(unsigned char key)
+{
+    struct domain *d;
+    struct pci_dev *pdev;
+
+    printk("Guest MSI information:\n");
+
+    for_each_domain ( d )
+    {
+        if ( !has_vpci(d) )
+            continue;
+
+        vpci_lock(d);
+        list_for_each_entry ( pdev, &d->arch.pdev_list, domain_list)
+        {
+            uint8_t seg = pdev->seg, bus = pdev->bus;
+            uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+            struct vpci_msi *msi = pdev->vpci->msi;
+            uint16_t data;
+            uint64_t addr;
+
+            if ( !msi )
+                continue;
+
+            printk("Device %04x:%02x:%02x.%u\n", seg, bus, slot, func);
+
+            printk("Enabled: %u Supports masking: %u 64-bit addresses: %u\n",
+                   msi->enabled, msi->masking, msi->address64);
+            printk("Max vectors: %u guest vectors: %u enabled vectors: %u\n",
+                   msi->max_vectors, msi->guest_vectors, msi->vectors);
+
+            vpci_msi_arch_print(&msi->arch);
+
+            data = msi->data;
+            addr = msi->address;
+            printk("vec=%#02x%7s%6s%3sassert%5s%7s dest_id=%lu\n",
+                   (data & MSI_DATA_VECTOR_MASK) >> MSI_DATA_VECTOR_SHIFT,
+                   data & MSI_DATA_DELIVERY_LOWPRI ? "lowest" : "fixed",
+                   data & MSI_DATA_TRIGGER_LEVEL ? "level" : "edge",
+                   data & MSI_DATA_LEVEL_ASSERT ? "" : "de",
+                   addr & MSI_ADDR_DESTMODE_LOGIC ? "log" : "phys",
+                   addr & MSI_ADDR_REDIRECTION_LOWPRI ? "lowest" : "cpu",
+                   (addr & MSI_ADDR_DEST_ID_MASK) >> MSI_ADDR_DEST_ID_SHIFT);
+
+            if ( msi->masking )
+                printk("mask=%#032x\n", msi->mask);
+            printk("\n");
+        }
+        vpci_unlock(d);
+    }
+}
+
+static int __init vpci_msi_setup_keyhandler(void)
+{
+    register_keyhandler('Z', vpci_dump_msi, "dump guest MSI state", 1);
+    return 0;
+}
+__initcall(vpci_msi_setup_keyhandler);
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
+
diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h
index f9b9829eaf..ae3af43749 100644
--- a/xen/include/asm-x86/hvm/io.h
+++ b/xen/include/asm-x86/hvm/io.h
@@ -20,6 +20,7 @@
 #define __ASM_X86_HVM_IO_H__
 
 #include <xen/mm.h>
+#include <xen/pci.h>
 #include <asm/hvm/vpic.h>
 #include <asm/hvm/vioapic.h>
 #include <public/hvm/ioreq.h>
@@ -126,6 +127,24 @@ void hvm_dpci_eoi(struct domain *d, unsigned int guest_irq,
 void msix_write_completion(struct vcpu *);
 void msixtbl_init(struct domain *d);
 
+/* Is emulated MSI enabled? */
+extern bool dom0_msi;
+#define vpci_msi_enabled(d) (!is_hardware_domain((d)) || dom0_msi)
+
+/* Arch-specific MSI data for vPCI. */
+struct vpci_arch_msi {
+    int pirq;
+};
+
+/* Arch-specific vPCI MSI helpers. */
+void vpci_msi_mask(struct vpci_arch_msi *arch, unsigned int entry, bool mask);
+int vpci_msi_enable(struct vpci_arch_msi *arch, struct pci_dev *pdev,
+                    uint64_t address, uint32_t data, unsigned int vectors);
+int vpci_msi_disable(struct vpci_arch_msi *arch, struct pci_dev *pdev,
+                     unsigned int vectors);
+int vpci_msi_arch_init(struct vpci_arch_msi *arch);
+void vpci_msi_arch_print(struct vpci_arch_msi *arch);
+
 enum stdvga_cache_state {
     STDVGA_CACHE_UNINITIALIZED,
     STDVGA_CACHE_ENABLED,
diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h
index 0d2c72c109..37dfb3b6c5 100644
--- a/xen/include/xen/hvm/irq.h
+++ b/xen/include/xen/hvm/irq.h
@@ -58,6 +58,7 @@ struct dev_intx_gsi_link {
 #define VMSI_TRIG_MODE    0x8000
 
 #define GFLAGS_SHIFT_RH             8
+#define GFLAGS_SHIFT_DM             9
 #define GFLAGS_SHIFT_DELIV_MODE     12
 #define GFLAGS_SHIFT_TRG_MODE       15
 
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 2bf61d6c15..373b8d6505 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -89,9 +89,35 @@ struct vpci {
 
     /* List of capabilities supported by the device. */
     struct list_head cap_list;
+
+    /* MSI data. */
+    struct vpci_msi {
+        /* Maximum number of vectors supported by the device. */
+        unsigned int max_vectors;
+        /* Current guest-written number of vectors. */
+        unsigned int guest_vectors;
+        /* Number of vectors configured. */
+        unsigned int vectors;
+        /* Address and data fields. */
+        uint64_t address;
+        uint16_t data;
+        /* Mask bitfield. */
+        uint32_t mask;
+        /* Enabled? */
+        bool enabled;
+        /* Supports per-vector masking? */
+        bool masking;
+        /* 64-bit address capable? */
+        bool address64;
+        /* Arch-specific data. */
+        struct vpci_arch_msi arch;
+    } *msi;
 #endif
 };
 
+/* Mask a PCI capability. */
+void xen_vpci_mask_capability(struct pci_dev *pdev, uint8_t cap_id);
+
 #endif
 
 /*
-- 
2.11.0 (Apple Git-81)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v3 9/9] vpci/msix: add MSI-X handlers
  2017-04-27 14:35 [PATCH v3 0/9] vpci: PCI config space emulation Roger Pau Monne
                   ` (7 preceding siblings ...)
  2017-04-27 14:35 ` [PATCH v3 8/9] vpci/msi: add MSI handlers Roger Pau Monne
@ 2017-04-27 14:35 ` Roger Pau Monne
  2017-05-29 13:29   ` Jan Beulich
  2017-05-29 13:38 ` [PATCH v3 0/9] vpci: PCI config space emulation Jan Beulich
  9 siblings, 1 reply; 49+ messages in thread
From: Roger Pau Monne @ 2017-04-27 14:35 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, julien.grall, Jan Beulich, boris.ostrovsky,
	Roger Pau Monne

Add handlers for accesses to the MSI-X message control field on the PCI
configuration space, and traps for accesses to the memory region that contains
the MSI-X table. This traps detect attempts from the guest to configure MSI-X
interrupts and properly sets them up.

Note that accesses to the Table Offset, Table BIR, PBA Offset, PBA BIR and the
PBA memory region itself are not trapped by Xen at the moment.

Whether Xen is going to provide this functionality to Dom0 (MSI-X emulation) is
controlled by the "msix" option in the dom0 field. When disabling this option
Xen will hide the MSI-X capability structure from Dom0.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v2:
 - Split out arch-specific code.

This patch has been tested with devices using both a single MSI-X entry and
multiple ones.
---
 docs/misc/xen-command-line.markdown |   7 +
 xen/arch/x86/dom0_build.c           |   4 +
 xen/arch/x86/hvm/hvm.c              |   1 +
 xen/arch/x86/hvm/vmsi.c             | 114 ++++++++-
 xen/drivers/vpci/Makefile           |   2 +-
 xen/drivers/vpci/capabilities.c     |  16 +-
 xen/drivers/vpci/header.c           |  41 ++-
 xen/drivers/vpci/msix.c             | 487 ++++++++++++++++++++++++++++++++++++
 xen/include/asm-x86/hvm/domain.h    |   3 +
 xen/include/asm-x86/hvm/io.h        |  18 ++
 xen/include/xen/vpci.h              |  27 ++
 11 files changed, 696 insertions(+), 24 deletions(-)
 create mode 100644 xen/drivers/vpci/msix.c

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index 73013aea14..7c1684511c 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -684,6 +684,13 @@ enabled.
 Enable or disable (using the `no-` prefix) the MSI emulation inside of
 Xen for a PVH Dom0. Note that this option has no effect on a PV Dom0.
 
+> `msix`
+
+> Default: `true`
+
+Enable or disable (using the `no-` prefix) the MSI-X emulation inside of
+Xen for a PVH Dom0. Note that this option has no effect on a PV Dom0.
+
 ### dtuart (ARM)
 > `= path [:options]`
 
diff --git a/xen/arch/x86/dom0_build.c b/xen/arch/x86/dom0_build.c
index 01afcf6215..3996d9dd12 100644
--- a/xen/arch/x86/dom0_build.c
+++ b/xen/arch/x86/dom0_build.c
@@ -177,6 +177,7 @@ bool __initdata opt_dom0_shadow;
 #endif
 bool __initdata dom0_pvh;
 bool __initdata dom0_msi = true;
+bool __initdata dom0_msix = true;
 
 /*
  * List of parameters that affect Dom0 creation:
@@ -184,6 +185,7 @@ bool __initdata dom0_msi = true;
  *  - pvh               Create a PVHv2 Dom0.
  *  - shadow            Use shadow paging for Dom0.
  *  - msi               MSI functionality.
+ *  - msix              MSI-X functionality.
  */
 static void __init parse_dom0_param(char *s)
 {
@@ -207,6 +209,8 @@ static void __init parse_dom0_param(char *s)
 #endif
         else if ( !strcmp(s, "msi") )
             dom0_msi = enabled;
+        else if ( !strcmp(s, "msix") )
+            dom0_msix = enabled;
 
         s = ss + 1;
     } while ( ss );
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index ef3ad2a615..3a3296ffe7 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -614,6 +614,7 @@ int hvm_domain_initialise(struct domain *d)
     INIT_LIST_HEAD(&d->arch.hvm_domain.write_map.list);
     INIT_LIST_HEAD(&d->arch.hvm_domain.g2m_ioport_list);
     INIT_LIST_HEAD(&d->arch.hvm_domain.ecam_regions);
+    INIT_LIST_HEAD(&d->arch.hvm_domain.msix_tables);
 
     hvm_init_cacheattr_region_list(d);
 
diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index f23b2f43d6..00a15b862f 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -643,15 +643,15 @@ static unsigned int msi_flags(uint16_t data, uint64_t addr)
            (trig_mode << GFLAGS_SHIFT_TRG_MODE);
 }
 
-void vpci_msi_mask(struct vpci_arch_msi *arch, unsigned int entry, bool mask)
+static void vpci_mask_pirq(int pirq, bool mask)
 {
     struct pirq *pinfo;
     struct irq_desc *desc;
     unsigned long flags;
     int irq;
 
-    ASSERT(arch->pirq != -1);
-    pinfo = pirq_info(current->domain, arch->pirq + entry);
+    ASSERT(pirq != -1);
+    pinfo = pirq_info(current->domain, pirq);
     ASSERT(pinfo);
 
     irq = pinfo->arch.irq;
@@ -665,6 +665,11 @@ void vpci_msi_mask(struct vpci_arch_msi *arch, unsigned int entry, bool mask)
     spin_unlock_irqrestore(&desc->lock, flags);
 }
 
+void vpci_msi_mask(struct vpci_arch_msi *arch, unsigned int entry, bool mask)
+{
+    vpci_mask_pirq(arch->pirq + entry, mask);
+}
+
 int vpci_msi_enable(struct vpci_arch_msi *arch, struct pci_dev *pdev,
                     uint64_t address, uint32_t data, unsigned int vectors)
 {
@@ -763,3 +768,106 @@ void vpci_msi_arch_print(struct vpci_arch_msi *arch)
     printk("PIRQ: %d\n", arch->pirq);
 }
 
+void vpci_msix_mask(struct vpci_arch_msix_entry *arch, bool mask)
+{
+    vpci_mask_pirq(arch->pirq, mask);
+}
+
+int vpci_msix_enable(struct vpci_arch_msix_entry *arch, struct pci_dev *pdev,
+                     uint64_t address, uint32_t data, unsigned int entry_nr,
+                     paddr_t table_base)
+{
+    struct domain *d = pdev->domain;
+    xen_domctl_bind_pt_irq_t bind = {
+        .hvm_domid = DOMID_SELF,
+        .irq_type = PT_IRQ_TYPE_MSI,
+        .u.msi.gvec = msi_vector(data),
+        .u.msi.gflags = msi_flags(data, address),
+    };
+    int rc;
+
+    if ( arch->pirq == -1 )
+    {
+        struct msi_info msi_info = {
+            .seg = pdev->seg,
+            .bus = pdev->bus,
+            .devfn = pdev->devfn,
+            .table_base = table_base,
+            .entry_nr = entry_nr,
+        };
+        int index = -1;
+
+        /* Map PIRQ. */
+        rc = allocate_and_map_msi_pirq(pdev->domain, &index, &arch->pirq,
+                                       &msi_info);
+        if ( rc )
+        {
+            gdprintk(XENLOG_ERR,
+                     "%04x:%02x:%02x.%u: unable to map MSI-X PIRQ entry %u: %d\n",
+                     pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                     PCI_FUNC(pdev->devfn), entry_nr, rc);
+            return rc;
+        }
+    }
+
+    bind.machine_irq = arch->pirq;
+    pcidevs_lock();
+    rc = pt_irq_create_bind(d, &bind);
+    if ( rc )
+    {
+        gdprintk(XENLOG_ERR,
+                 "%04x:%02x:%02x.%u: unable to create MSI-X bind %u: %d\n",
+                 pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                 PCI_FUNC(pdev->devfn), entry_nr, rc);
+        spin_lock(&pdev->domain->event_lock);
+        unmap_domain_pirq(pdev->domain, arch->pirq);
+        spin_unlock(&pdev->domain->event_lock);
+        arch->pirq = -1;
+    }
+    pcidevs_unlock();
+
+    return 0;
+}
+
+int vpci_msix_disable(struct vpci_arch_msix_entry *arch)
+{
+    xen_domctl_bind_pt_irq_t bind = {
+        .hvm_domid = DOMID_SELF,
+        .irq_type = PT_IRQ_TYPE_MSI,
+        .machine_irq = arch->pirq,
+    };
+    int rc;
+
+    if ( arch->pirq == -1 )
+        return 0;
+
+    pcidevs_lock();
+    rc = pt_irq_destroy_bind(current->domain, &bind);
+    if ( rc )
+    {
+        pcidevs_unlock();
+        return rc;
+    }
+
+    spin_lock(&current->domain->event_lock);
+    unmap_domain_pirq(current->domain, arch->pirq);
+    spin_unlock(&current->domain->event_lock);
+    pcidevs_unlock();
+
+    arch->pirq = -1;
+
+    return 0;
+}
+
+int vpci_msix_arch_init(struct vpci_arch_msix_entry *arch)
+{
+    arch->pirq = -1;
+    return 0;
+}
+
+void vpci_msix_arch_print(struct vpci_arch_msix_entry *arch)
+{
+    /* No newline, it will be added by the generic debug handler. */
+    printk("pirq: %d", arch->pirq);
+}
+
diff --git a/xen/drivers/vpci/Makefile b/xen/drivers/vpci/Makefile
index ef4fc6caf3..55398d4428 100644
--- a/xen/drivers/vpci/Makefile
+++ b/xen/drivers/vpci/Makefile
@@ -1 +1 @@
-obj-y += vpci.o header.o capabilities.o msi.o
+obj-y += vpci.o header.o capabilities.o msi.o msix.o
diff --git a/xen/drivers/vpci/capabilities.c b/xen/drivers/vpci/capabilities.c
index ad9f45c2e1..7166ccb502 100644
--- a/xen/drivers/vpci/capabilities.c
+++ b/xen/drivers/vpci/capabilities.c
@@ -130,21 +130,7 @@ void xen_vpci_mask_capability(struct pci_dev *pdev, uint8_t cap_id)
     }
 }
 
-static int vpci_capabilities_init(struct pci_dev *pdev)
-{
-    int rc;
-
-    rc = vpci_index_capabilities(pdev);
-    if ( rc )
-        return rc;
-
-    /* Mask MSI-X capability until Xen handles it. */
-    xen_vpci_mask_capability(pdev, PCI_CAP_ID_MSIX);
-
-    return 0;
-}
-
-REGISTER_VPCI_INIT(vpci_capabilities_init, true);
+REGISTER_VPCI_INIT(vpci_index_capabilities, true);
 
 /*
  * Local variables:
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 3deec53efd..c03dcdd708 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -32,16 +32,47 @@ static int vpci_modify_bars(struct pci_dev *pdev, const bool map)
         paddr_t gaddr = map ? header->bars[i].gaddr
                             : header->bars[i].mapped_addr;
         paddr_t paddr = header->bars[i].paddr;
+        size_t size = header->bars[i].size;
 
         if ( header->bars[i].type != VPCI_BAR_MEM &&
              header->bars[i].type != VPCI_BAR_MEM64_LO )
             continue;
 
-        rc = modify_mmio(pdev->domain, _gfn(PFN_DOWN(gaddr)),
-                         _mfn(PFN_DOWN(paddr)), PFN_UP(header->bars[i].size),
-                         map);
-        if ( rc )
-            break;
+        if ( pdev->vpci->msix != NULL && pdev->vpci->msix->bir == i )
+        {
+            /* There's an MSI-X table inside of this BAR. */
+            paddr_t msix_gaddr = gaddr + pdev->vpci->msix->offset;
+            paddr_t msix_paddr = paddr + pdev->vpci->msix->offset;
+            size_t msix_size = pdev->vpci->msix->max_entries *
+                               PCI_MSIX_ENTRY_SIZE;
+
+            ASSERT(IS_ALIGNED(msix_gaddr, PAGE_SIZE) &&
+                   IS_ALIGNED(msix_paddr, PAGE_SIZE));
+
+            rc = modify_mmio(pdev->domain, _gfn(PFN_DOWN(gaddr)),
+                             _mfn(PFN_DOWN(paddr)),
+                             PFN_DOWN(msix_paddr - paddr), map);
+            if ( rc )
+                break;
+
+            rc = modify_mmio(pdev->domain,
+                             _gfn(PFN_UP(msix_gaddr + msix_size)),
+                             _mfn(PFN_UP(msix_paddr + msix_size)),
+                             PFN_UP(paddr + size -
+                                    round_pgup(msix_paddr + msix_size)), map);
+            if ( rc )
+                break;
+
+            if ( map )
+                pdev->vpci->msix->addr = msix_gaddr;
+        }
+        else
+        {
+            rc = modify_mmio(pdev->domain, _gfn(PFN_DOWN(gaddr)),
+                             _mfn(PFN_DOWN(paddr)), PFN_UP(size), map);
+            if ( rc )
+                break;
+        }
 
         header->bars[i].mapped_addr = map ? gaddr : 0;
     }
diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
new file mode 100644
index 0000000000..df45dca917
--- /dev/null
+++ b/xen/drivers/vpci/msix.c
@@ -0,0 +1,487 @@
+/*
+ * Handlers for accesses to the MSI-X capability structure and the memory
+ * region.
+ *
+ * Copyright (C) 2017 Citrix Systems R&D
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms and conditions of the GNU General Public
+ * License, version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <xen/sched.h>
+#include <xen/vpci.h>
+#include <asm/msi.h>
+#include <xen/p2m-common.h>
+#include <xen/keyhandler.h>
+
+#define MSIX_SIZE(num) (offsetof(struct vpci_msix, entries[num]))
+
+static int vpci_msix_control_read(struct pci_dev *pdev, unsigned int reg,
+                                  union vpci_val *val, void *data)
+{
+    struct vpci_msix *msix = data;
+
+    val->word = (msix->max_entries - 1) & PCI_MSIX_FLAGS_QSIZE;
+    val->word |= msix->enabled ? PCI_MSIX_FLAGS_ENABLE : 0;
+    val->word |= msix->masked ? PCI_MSIX_FLAGS_MASKALL : 0;
+
+    return 0;
+}
+
+static int vpci_msix_control_write(struct pci_dev *pdev, unsigned int reg,
+                                   union vpci_val val, void *data)
+{
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    paddr_t table_base = pdev->vpci->header.bars[pdev->vpci->msix->bir].paddr;
+    struct vpci_msix *msix = data;
+    bool new_masked, new_enabled;
+    unsigned int i;
+    uint32_t data32;
+    int rc;
+
+    new_masked = val.word & PCI_MSIX_FLAGS_MASKALL;
+    new_enabled = val.word & PCI_MSIX_FLAGS_ENABLE;
+
+    if ( new_enabled != msix->enabled && new_enabled )
+    {
+        /* MSI-X enabled. */
+        for ( i = 0; i < msix->max_entries; i++ )
+        {
+            if ( msix->entries[i].masked )
+                continue;
+
+            rc = vpci_msix_enable(&msix->entries[i].arch, pdev,
+                                  msix->entries[i].addr, msix->entries[i].data,
+                                  msix->entries[i].nr, table_base);
+            if ( rc )
+            {
+                gdprintk(XENLOG_ERR,
+                         "%04x:%02x:%02x.%u: unable to update entry %u: %d\n",
+                         seg, bus, slot, func, i, rc);
+                return rc;
+            }
+
+            vpci_msix_mask(&msix->entries[i].arch, false);
+        }
+    }
+    else if ( new_enabled != msix->enabled && !new_enabled )
+    {
+        /* MSI-X disabled. */
+        for ( i = 0; i < msix->max_entries; i++ )
+        {
+            rc = vpci_msix_disable(&msix->entries[i].arch);
+            if ( rc )
+            {
+                gdprintk(XENLOG_ERR,
+                         "%04x:%02x:%02x.%u: unable to disable entry %u: %d\n",
+                         seg, bus, slot, func, i, rc);
+                return rc;
+            }
+        }
+    }
+
+    data32 = val.word;
+    if ( (new_enabled != msix->enabled || new_masked != msix->masked) &&
+         pci_msi_conf_write_intercept(pdev, reg, 2, &data32) >= 0 )
+        pci_conf_write16(seg, bus, slot, func, reg, data32);
+
+    msix->masked = new_masked;
+    msix->enabled = new_enabled;
+
+    return 0;
+}
+
+static struct vpci_msix *vpci_msix_find(struct domain *d, unsigned long addr)
+{
+    struct vpci_msix *msix;
+
+    ASSERT(vpci_locked(d));
+    list_for_each_entry ( msix,  &d->arch.hvm_domain.msix_tables, next )
+        if ( msix->pdev->vpci->header.command & PCI_COMMAND_MEMORY &&
+             addr >= msix->addr &&
+             addr < msix->addr + msix->max_entries * PCI_MSIX_ENTRY_SIZE )
+            return msix;
+
+    return NULL;
+}
+
+static int vpci_msix_table_accept(struct vcpu *v, unsigned long addr)
+{
+    int found;
+
+    vpci_lock(v->domain);
+    found = !!vpci_msix_find(v->domain, addr);
+    vpci_unlock(v->domain);
+
+    return found;
+}
+
+static int vpci_msix_access_check(struct pci_dev *pdev, unsigned long addr,
+                                  unsigned int len)
+{
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+
+
+    /* Only allow 32/64b accesses. */
+    if ( len != 4 && len != 8 )
+    {
+        gdprintk(XENLOG_ERR,
+                 "%04x:%02x:%02x.%u: invalid MSI-X table access size: %u\n",
+                 seg, bus, slot, func, len);
+        return -EINVAL;
+    }
+
+    /* Do no allow accesses that span across multiple entries. */
+    if ( (addr & (PCI_MSIX_ENTRY_SIZE - 1)) + len > PCI_MSIX_ENTRY_SIZE )
+    {
+        gdprintk(XENLOG_ERR,
+                 "%04x:%02x:%02x.%u: MSI-X access crosses entry boundary\n",
+                 seg, bus, slot, func);
+        return -EINVAL;
+    }
+
+    /*
+     * Only allow 64b accesses to the low message address field.
+     *
+     * NB: this is more restrictive than the specification, that allows 64b
+     * accesses to other fields under certain circumstances, so this check and
+     * the code will have to be fixed in order to fully comply with the
+     * specification.
+     */
+    if ( (addr & (PCI_MSIX_ENTRY_SIZE - 1)) != 0 && len != 4 )
+    {
+        gdprintk(XENLOG_ERR,
+                 "%04x:%02x:%02x.%u: 64bit MSI-X table access to 32bit field"
+                 " (offset: %#lx len: %u)\n", seg, bus, slot, func,
+                 addr & (PCI_MSIX_ENTRY_SIZE - 1), len);
+        return -EINVAL;
+    }
+
+    return 0;
+}
+
+static struct vpci_msix_entry *vpci_msix_get_entry(struct vpci_msix *msix,
+                                                   unsigned long addr)
+{
+    return &msix->entries[(addr - msix->addr) / PCI_MSIX_ENTRY_SIZE];
+}
+
+static int vpci_msix_table_read(struct vcpu *v, unsigned long addr,
+                                unsigned int len, unsigned long *data)
+{
+    struct vpci_msix *msix;
+    struct vpci_msix_entry *entry;
+    unsigned int offset;
+
+    vpci_lock(v->domain);
+    msix = vpci_msix_find(v->domain, addr);
+    if ( !msix )
+    {
+        vpci_unlock(v->domain);
+        return X86EMUL_UNHANDLEABLE;
+    }
+
+    if ( vpci_msix_access_check(msix->pdev, addr, len) )
+    {
+        vpci_unlock(v->domain);
+        return X86EMUL_UNHANDLEABLE;
+    }
+
+    /* Get the table entry and offset. */
+    entry = vpci_msix_get_entry(msix, addr);
+    offset = addr & (PCI_MSIX_ENTRY_SIZE - 1);
+
+    switch ( offset )
+    {
+    case PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET:
+        *data = entry->addr;
+        break;
+    case PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET:
+        *data = entry->addr >> 32;
+        break;
+    case PCI_MSIX_ENTRY_DATA_OFFSET:
+        *data = entry->data;
+        break;
+    case PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET:
+        *data = entry->masked ? PCI_MSIX_VECTOR_BITMASK : 0;
+        break;
+    default:
+        BUG();
+    }
+    vpci_unlock(v->domain);
+
+    return X86EMUL_OKAY;
+}
+
+static int vpci_msix_table_write(struct vcpu *v, unsigned long addr,
+                                 unsigned int len, unsigned long data)
+{
+    struct vpci_msix *msix;
+    struct vpci_msix_entry *entry;
+    unsigned int offset;
+
+    vpci_lock(v->domain);
+    msix = vpci_msix_find(v->domain, addr);
+    if ( !msix )
+    {
+        vpci_unlock(v->domain);
+        return X86EMUL_UNHANDLEABLE;
+    }
+
+    if ( vpci_msix_access_check(msix->pdev, addr, len) )
+    {
+        vpci_unlock(v->domain);
+        return X86EMUL_UNHANDLEABLE;
+    }
+
+    /* Get the table entry and offset. */
+    entry = vpci_msix_get_entry(msix, addr);
+    offset = addr & (PCI_MSIX_ENTRY_SIZE - 1);
+
+    switch ( offset )
+    {
+    case PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET:
+        if ( len == 8 )
+        {
+            entry->addr = data;
+            break;
+        }
+        entry->addr &= ~GENMASK(31, 0);
+        entry->addr |= data;
+        break;
+    case PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET:
+        entry->addr &= ~GENMASK(63, 32);
+        entry->addr |= data << 32;
+        break;
+    case PCI_MSIX_ENTRY_DATA_OFFSET:
+        entry->data = data;
+        break;
+    case PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET:
+    {
+        bool new_masked = data & PCI_MSIX_VECTOR_BITMASK;
+        struct pci_dev *pdev = msix->pdev;
+        paddr_t table_base =
+            pdev->vpci->header.bars[pdev->vpci->msix->bir].paddr;
+        int rc;
+
+        if ( !msix->enabled )
+        {
+            entry->masked = new_masked;
+            break;
+        }
+
+        if ( new_masked != entry->masked && !new_masked )
+        {
+            /* Unmasking an entry, update it. */
+            rc = vpci_msix_enable(&entry->arch, msix->pdev, entry->addr,
+                                  entry->data, entry->nr, table_base);
+            if ( rc )
+            {
+                vpci_unlock(v->domain);
+                gdprintk(XENLOG_ERR,
+                         "%04x:%02x:%02x.%u: unable to update entry %u: %d\n",
+                         pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                         PCI_FUNC(pdev->devfn), entry->nr, rc);
+                return X86EMUL_UNHANDLEABLE;
+            }
+        }
+
+        vpci_msix_mask(&entry->arch, new_masked);
+        entry->masked = new_masked;
+
+        break;
+    }
+    default:
+        BUG();
+    }
+    vpci_unlock(v->domain);
+
+    return X86EMUL_OKAY;
+}
+
+static const struct hvm_mmio_ops vpci_msix_table_ops = {
+    .check = vpci_msix_table_accept,
+    .read = vpci_msix_table_read,
+    .write = vpci_msix_table_write,
+};
+
+static int vpci_init_msix(struct pci_dev *pdev)
+{
+    struct domain *d = pdev->domain;
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    struct vpci_msix *msix;
+    unsigned int msix_offset, i, max_entries;
+    paddr_t msix_paddr;
+    uint16_t control;
+    int rc;
+
+    msix_offset = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSIX);
+    if ( !msix_offset )
+        return 0;
+
+    if ( !vpci_msix_enabled(pdev->domain) )
+    {
+        xen_vpci_mask_capability(pdev, PCI_CAP_ID_MSIX);
+        return 0;
+    }
+
+    control = pci_conf_read16(seg, bus, slot, func,
+                              msix_control_reg(msix_offset));
+
+    /* Get the maximum number of vectors the device supports. */
+    max_entries = msix_table_size(control);
+    if ( !max_entries )
+        return 0;
+
+    msix = xzalloc_bytes(MSIX_SIZE(max_entries));
+    if ( !msix )
+        return -ENOMEM;
+
+    msix->max_entries = max_entries;
+    msix->pdev = pdev;
+
+    /* Find the MSI-X table address. */
+    msix->offset = pci_conf_read32(seg, bus, slot, func,
+                                   msix_table_offset_reg(msix_offset));
+    msix->bir = msix->offset & PCI_MSIX_BIRMASK;
+    msix->offset &= ~PCI_MSIX_BIRMASK;
+
+    ASSERT(pdev->vpci->header.bars[msix->bir].type == VPCI_BAR_MEM ||
+           pdev->vpci->header.bars[msix->bir].type == VPCI_BAR_MEM64_LO);
+    msix->addr = pdev->vpci->header.bars[msix->bir].mapped_addr + msix->offset;
+    msix_paddr = pdev->vpci->header.bars[msix->bir].paddr + msix->offset;
+
+    for ( i = 0; i < msix->max_entries; i++)
+    {
+        msix->entries[i].masked = true;
+        msix->entries[i].nr = i;
+        vpci_msix_arch_init(&msix->entries[i].arch);
+    }
+
+    if ( list_empty(&d->arch.hvm_domain.msix_tables) )
+        register_mmio_handler(d, &vpci_msix_table_ops);
+
+    list_add(&msix->next, &d->arch.hvm_domain.msix_tables);
+
+    rc = xen_vpci_add_register(pdev, vpci_msix_control_read,
+                               vpci_msix_control_write,
+                               msix_control_reg(msix_offset), 2, msix);
+    if ( rc )
+    {
+        dprintk(XENLOG_ERR,
+                "%04x:%02x:%02x.%u: failed to add handler for MSI-X control: %d\n",
+                seg, bus, slot, func, rc);
+        goto error;
+    }
+
+    if ( pdev->vpci->header.command & PCI_COMMAND_MEMORY )
+    {
+        /* Unmap this memory from the guest. */
+        rc = modify_mmio(pdev->domain, _gfn(PFN_DOWN(msix->addr)),
+                         _mfn(PFN_DOWN(msix_paddr)),
+                         PFN_UP(msix->max_entries * PCI_MSIX_ENTRY_SIZE),
+                         false);
+        if ( rc )
+        {
+            dprintk(XENLOG_ERR,
+                    "%04x:%02x:%02x.%u: unable to unmap MSI-X BAR region: %d\n",
+                    seg, bus, slot, func, rc);
+            goto error;
+        }
+    }
+
+    pdev->vpci->msix = msix;
+
+    return 0;
+
+ error:
+    ASSERT(rc);
+    xfree(msix);
+    return rc;
+}
+
+REGISTER_VPCI_INIT(vpci_init_msix, false);
+
+static void vpci_dump_msix(unsigned char key)
+{
+    struct domain *d;
+    struct pci_dev *pdev;
+
+    printk("Guest MSI-X information:\n");
+
+    for_each_domain ( d )
+    {
+        if ( !has_vpci(d) )
+            continue;
+
+        vpci_lock(d);
+        list_for_each_entry ( pdev, &d->arch.pdev_list, domain_list)
+        {
+            uint8_t seg = pdev->seg, bus = pdev->bus;
+            uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+            struct vpci_msix *msix = pdev->vpci->msix;
+            unsigned int i;
+
+            if ( !msix )
+                continue;
+
+            printk("Device %04x:%02x:%02x.%u\n", seg, bus, slot, func);
+
+            printk("Max entries: %u maskall: %u enabled: %u\n",
+                   msix->max_entries, msix->masked, msix->enabled);
+
+            printk("Guest entries:\n");
+            for ( i = 0; i < msix->max_entries; i++ )
+            {
+                struct vpci_msix_entry *entry = &msix->entries[i];
+                uint32_t data = entry->data;
+                uint64_t addr = entry->addr;
+
+                printk("%4u vec=%#02x%7s%6s%3sassert%5s%7s dest_id=%lu mask=%u ",
+                       i,
+                       (data & MSI_DATA_VECTOR_MASK) >> MSI_DATA_VECTOR_SHIFT,
+                       data & MSI_DATA_DELIVERY_LOWPRI ? "lowest" : "fixed",
+                       data & MSI_DATA_TRIGGER_LEVEL ? "level" : "edge",
+                       data & MSI_DATA_LEVEL_ASSERT ? "" : "de",
+                       addr & MSI_ADDR_DESTMODE_LOGIC ? "log" : "phys",
+                       addr & MSI_ADDR_REDIRECTION_LOWPRI ? "lowest" : "cpu",
+                       (addr & MSI_ADDR_DEST_ID_MASK) >> MSI_ADDR_DEST_ID_SHIFT,
+                       entry->masked);
+                vpci_msix_arch_print(&entry->arch);
+                printk("\n");
+            }
+            printk("\n");
+        }
+        vpci_unlock(d);
+    }
+}
+
+static int __init vpci_msix_setup_keyhandler(void)
+{
+    register_keyhandler('X', vpci_dump_msix, "dump guest MSI-X state", 1);
+    return 0;
+}
+__initcall(vpci_msix_setup_keyhandler);
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
+
diff --git a/xen/include/asm-x86/hvm/domain.h b/xen/include/asm-x86/hvm/domain.h
index b69c33df3c..2bef170cc0 100644
--- a/xen/include/asm-x86/hvm/domain.h
+++ b/xen/include/asm-x86/hvm/domain.h
@@ -198,6 +198,9 @@ struct hvm_domain {
     /* List of ECAM (MMCFG) regions trapped by Xen. */
     struct list_head ecam_regions;
 
+    /* List of MSI-X tables. */
+    struct list_head msix_tables;
+
     /* List of permanently write-mapped pages. */
     struct {
         spinlock_t lock;
diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h
index ae3af43749..d7ecc2adbf 100644
--- a/xen/include/asm-x86/hvm/io.h
+++ b/xen/include/asm-x86/hvm/io.h
@@ -131,6 +131,10 @@ void msixtbl_init(struct domain *d);
 extern bool dom0_msi;
 #define vpci_msi_enabled(d) (!is_hardware_domain((d)) || dom0_msi)
 
+/* Is emulated MSI enabled? */
+extern bool dom0_msix;
+#define vpci_msix_enabled(d) (!is_hardware_domain((d)) || dom0_msix)
+
 /* Arch-specific MSI data for vPCI. */
 struct vpci_arch_msi {
     int pirq;
@@ -145,6 +149,20 @@ int vpci_msi_disable(struct vpci_arch_msi *arch, struct pci_dev *pdev,
 int vpci_msi_arch_init(struct vpci_arch_msi *arch);
 void vpci_msi_arch_print(struct vpci_arch_msi *arch);
 
+/* Arch-specific MSI-X entry data for vPCI. */
+struct vpci_arch_msix_entry {
+    int pirq;
+};
+
+/* Arch-specific vPCI MSI-X helpers. */
+void vpci_msix_mask(struct vpci_arch_msix_entry *arch, bool mask);
+int vpci_msix_enable(struct vpci_arch_msix_entry *arch, struct pci_dev *pdev,
+                     uint64_t address, uint32_t data, unsigned int entry_nr,
+                     paddr_t table_base);
+int vpci_msix_disable(struct vpci_arch_msix_entry *arch);
+int vpci_msix_arch_init(struct vpci_arch_msix_entry *arch);
+void vpci_msix_arch_print(struct vpci_arch_msix_entry *arch);
+
 enum stdvga_cache_state {
     STDVGA_CACHE_UNINITIALIZED,
     STDVGA_CACHE_ENABLED,
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 373b8d6505..8c3e56c1e4 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -112,6 +112,33 @@ struct vpci {
         /* Arch-specific data. */
         struct vpci_arch_msi arch;
     } *msi;
+
+    /* MSI-X data. */
+    struct vpci_msix {
+        struct pci_dev *pdev;
+        /* Maximum number of vectors supported by the device. */
+        unsigned int max_entries;
+        /* MSI-X table offset. */
+        unsigned int offset;
+        /* MSI-X table BIR. */
+        unsigned int bir;
+        /* Table addr. */
+        paddr_t addr;
+        /* MSI-X enabled? */
+        bool enabled;
+        /* Masked? */
+        bool masked;
+        /* List link. */
+        struct list_head next;
+        /* Entries. */
+        struct vpci_msix_entry {
+                unsigned int nr;
+                uint64_t addr;
+                uint32_t data;
+                bool masked;
+                struct vpci_arch_msix_entry arch;
+          } entries[];
+    } *msix;
 #endif
 };
 
-- 
2.11.0 (Apple Git-81)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 1/9] xen/vpci: introduce basic handlers to trap accesses to the PCI config space
  2017-04-27 14:35 ` [PATCH v3 1/9] xen/vpci: introduce basic handlers to trap accesses to the PCI config space Roger Pau Monne
@ 2017-05-19 11:35   ` Jan Beulich
  2017-05-29 12:57     ` Roger Pau Monne
  0 siblings, 1 reply; 49+ messages in thread
From: Jan Beulich @ 2017-05-19 11:35 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Wei Liu, Andrew Cooper, Ian Jackson, julien.grall, Paul Durrant,
	xen-devel, boris.ostrovsky

>>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
> --- a/tools/libxl/libxl_x86.c
> +++ b/tools/libxl/libxl_x86.c
> @@ -11,7 +11,7 @@ int libxl__arch_domain_prepare_config(libxl__gc *gc,
>      if (d_config->c_info.type == LIBXL_DOMAIN_TYPE_HVM) {
>          if (d_config->b_info.device_model_version !=
>              LIBXL_DEVICE_MODEL_VERSION_NONE) {
> -            xc_config->emulation_flags = XEN_X86_EMU_ALL;
> +            xc_config->emulation_flags = (XEN_X86_EMU_ALL & ~XEN_X86_EMU_VPCI);

I can see why you need this, but I'm not sure this is a good model.
Ideally for ordinary HVM guests you'd never have to change this
line. Therefore perhaps it might be a better idea to use a "negative"
flag here.

> --- /dev/null
> +++ b/tools/tests/vpci/Makefile
> @@ -0,0 +1,45 @@
> +
> +XEN_ROOT=$(CURDIR)/../../..
> +include $(XEN_ROOT)/tools/Rules.mk
> +
> +TARGET := test_vpci
> +
> +.PHONY: all
> +all: $(TARGET)
> +
> +.PHONY: run
> +run: $(TARGET)
> +	./$(TARGET) > $(TARGET).out
> +
> +$(TARGET): vpci.c vpci.h rbtree.c rbtree.h
> +	$(HOSTCC) -g -o $@ vpci.c main.c rbtree.c
> +
> +.PHONY: clean
> +clean:
> +	rm -rf $(TARGET) $(TARGET).out *.o *~ vpci.h vpci.c rbtree.c rbtree.h
> +
> +.PHONY: distclean
> +distclean: clean
> +
> +.PHONY: install
> +install:
> +
> +vpci.h: $(XEN_ROOT)/xen/include/xen/vpci.h
> +	sed -e '/#include/d' <$< >$@
> +
> +vpci.c: $(XEN_ROOT)/xen/drivers/vpci/vpci.c
> +	# Trick the compiler so it doesn't complain about missing symbols
> +	sed -e '/#include/d' \
> +	    -e '1s;^;#include "emul.h"\
> +	             const vpci_register_init_t __start_vpci_array[1]\;\
> +	             const vpci_register_init_t __end_vpci_array[1]\;\
> +	             ;' <$< >$@
> +
> +rbtree.h: $(XEN_ROOT)/xen/include/xen/rbtree.h
> +	sed -e '/#include/d' <$< >$@
> +
> +rbtree.c: $(XEN_ROOT)/xen/common/rbtree.c
> +	sed -e "/#include/d" \
> +	    -e '1s;^;#include "emul.h"\
> +	             ;' <$< >$@

Plain symlinking and __XEN__ conditionals in the files may be the
easier to follow variant. But I'm no heavily opposed to this one,
I'm merely afraid that further adjustments may end up becoming
necessary down the road, resulting in the rules here to become
more convoluted.

> --- /dev/null
> +++ b/tools/tests/vpci/emul.h
> @@ -0,0 +1,107 @@
> +/*
> + * Unit tests for the generic vPCI handler code.
> + *
> + * Copyright (C) 2017 Citrix Systems R&D
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms and conditions of the GNU General Public
> + * License, version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; If not, see 
> <http://www.gnu.org/licenses/>.
> + */
> +
> +#ifndef _TEST_VPCI_
> +#define _TEST_VPCI_
> +
> +#include <stdlib.h>
> +#include <stdio.h>
> +#include <stddef.h>
> +#include <stdint.h>
> +#include <stdbool.h>
> +#include <errno.h>
> +#include <assert.h>
> +
> +#define container_of(ptr, type, member) ({                      \
> +        typeof( ((type *)0)->member ) *__mptr = (ptr);          \
> +        (type *)( (char *)__mptr - offsetof(type,member) );})

There are a couple of stray blanks (immediately inside parentheses)
here, and a missing one after the comma in offsetof().

> +#include "rbtree.h"
> +
> +struct pci_dev {
> +    struct domain *domain;
> +    struct vpci *vpci;
> +};
> +
> +struct domain {
> +    struct pci_dev pdev;
> +};
> +
> +struct vcpu
> +{
> +    struct domain *domain;
> +};
> +
> +extern struct vcpu v;

This is odd. With ...

> +#define spin_lock(x)
> +#define spin_unlock(x)
> +#define spin_is_locked(x) true
> +
> +#define current (&v)

... this, why don't you simply have

extern struct vcpu *current;

keeping v (or however you mean to name it) static?

> +#define has_vpci(d) true
> +
> +#include "vpci.h"
> +
> +#define xzalloc(type) (type *)calloc(1, sizeof(type))

Missing an outer pair of parentheses.

> +#define xfree(p) free(p)
> +
> +#define EXPORT_SYMBOL(x)

I think we should rather get rid of them from rbtree.c.

> +#define pci_get_pdev_by_domain(d, ...) &(d)->pdev

Missing an outer pair of parentheses again, whereas ...

> +#define atomic_read(x) 1
> +
> +/* Dummy native helpers. Writes are ignored, reads return 1's. */
> +#define pci_conf_read8(...) (0xff)
> +#define pci_conf_read16(...) (0xffff)
> +#define pci_conf_read32(...) (0xffffffff)

... here they're pointless.

> +/* Dummy hooks, write stores data, read fetches it. */
> +static int vpci_read8(struct pci_dev *pdev, unsigned int reg,
> +                      union vpci_val *val, void *data)
> +{
> +    uint8_t *priv = data;
> +
> +    val->half_word = *priv;

Half word? Half a word is at least 16 bits on any reasonable
architecture nowadays. Using it for a byte is simply confusing. I'd
suggest naming the fields what they are - u8, u16, and u32.

> +#define VPCI_READ(reg, size, data) \
> +    assert(!xen_vpci_read(0, 0, 0, reg, size, data))
> +
> +#define VPCI_READ_CHECK(reg, size, expected) ({ \
> +    uint32_t val;                               \
> +    VPCI_READ(reg, size, &val);                 \
> +    assert(val == expected);                    \
> +    })

Bad indentation - either the }) needs to move left, of the body needs
to move right.

> +#define VPCI_WRITE(reg, size, data) \
> +    assert(!xen_vpci_write(0, 0, 0, reg, size, data))

You using fixed SBDF here, ...

> +#define VPCI_CHECK_REG(reg, size, data) ({      \
> +    VPCI_WRITE(reg, size, data);                \
> +    VPCI_READ_CHECK(reg, size, data);           \
> +    })
> +
> +#define VPCI_ADD_REG(fread, fwrite, off, size, store)                         \
> +    assert(!xen_vpci_add_register(&d.pdev, fread, fwrite, off, size, &store)) \

... why do you have this strange &d.pdev here? The assumption
that a (fake) domain has a single (fake) PCI device looks pretty odd
anyway - why can't you simply have a global (fake) PCI device?

> +#define VPCI_ADD_INVALID_REG(fread, fwrite, off, size)                      \
> +    assert(xen_vpci_add_register(&d.pdev, fread, fwrite, off, size, NULL))  \
> +
> +int
> +main(int argc, char **argv)
> +{
> +    /* Index storage by offset. */
> +    uint32_t r0 = 0xdeadbeef;
> +    uint8_t r5 = 0xef;
> +    uint8_t r6 = 0xbe;
> +    uint8_t r7 = 0xef;
> +    uint16_t r12 = 0x8696;
> +    int rc;
> +
> +    VPCI_ADD_REG(vpci_read32, vpci_write32, 0, 4, r0);
> +    VPCI_READ_CHECK(0, 4, 0xdeadbeef);
> +    VPCI_CHECK_REG(0, 4, 0xbcbcbcbc);

In the context here the macro name is pretty confusing: I'd expect
it to check the register holds the specified value, without also doing
a write. How about VPCI_WRITE_CHECK()?

> +    VPCI_ADD_REG(vpci_read8, vpci_write8, 5, 1, r5);
> +    VPCI_READ_CHECK(5, 1, 0xef);
> +    VPCI_CHECK_REG(5, 1, 0xba);
> +
> +    VPCI_ADD_REG(vpci_read8, vpci_write8, 6, 1, r6);
> +    VPCI_READ_CHECK(6, 1, 0xbe);
> +    VPCI_CHECK_REG(6, 1, 0xba);
> +
> +    VPCI_ADD_REG(vpci_read8, vpci_write8, 7, 1, r7);
> +    VPCI_READ_CHECK(7, 1, 0xef);
> +    VPCI_CHECK_REG(7, 1, 0xbd);
> +
> +    VPCI_ADD_REG(vpci_read16, vpci_write16, 12, 2, r12);
> +    VPCI_READ_CHECK(12, 2, 0x8696);
> +    VPCI_READ_CHECK(12, 4, 0xffff8696);
> +
> +    /*
> +     * At this point we have the following layout:
> +     *
> +     * 32    24    16     8     0
> +     *  +-----+-----+-----+-----+
> +     *  |          r0           | 0
> +     *  +-----+-----+-----+-----+
> +     *  | r7  |  r6 |  r5 |/////| 32
> +     *  +-----+-----+-----+-----|
> +     *  |///////////////////////| 64
> +     *  +-----------+-----------+
> +     *  |///////////|    r12    | 96
> +     *  +-----------+-----------+
> +     *             ...
> +     *  / = empty.
> +     */
> +
> +    /* Try to add an overlapping register handler. */
> +    VPCI_ADD_INVALID_REG(vpci_read32, vpci_write32, 4, 4);
> +
> +    /* Try to add a non-aligned register. */
> +    VPCI_ADD_INVALID_REG(vpci_read16, vpci_write16, 15, 2);
> +
> +    /* Try to add a register with wrong size. */
> +    VPCI_ADD_INVALID_REG(vpci_read16, vpci_write16, 8, 3);
> +
> +    /* Try to add a register with missing handlers. */
> +    VPCI_ADD_INVALID_REG(vpci_read16, NULL, 8, 2);
> +    VPCI_ADD_INVALID_REG(NULL, vpci_write16, 8, 2);

Is that something which really is wrong in all cases? What about e.g.
r/o registers?

> +    /* Read/write of unset register. */
> +    VPCI_READ_CHECK(8, 4, 0xffffffff);
> +    VPCI_READ_CHECK(8, 2, 0xffff);
> +    VPCI_READ_CHECK(8, 1, 0xff);
> +    VPCI_WRITE(10, 2, 0xbeef);
> +    VPCI_READ_CHECK(10, 2, 0xffff);
> +
> +    /* Read of multiple registers */
> +    VPCI_CHECK_REG(7, 1, 0xbd);
> +    VPCI_READ_CHECK(4, 4, 0xbdbabaff);

I think a variant accessing mixed size registers would also be
desirable here. Perhaps it would be best to exhaustively test
all possible variations (there aren't that many after all). Same
for writes and partial accesses (below) then.

> @@ -256,6 +257,152 @@ void register_g2m_portio_handler(struct domain *d)
>      handler->ops = &g2m_portio_ops;
>  }
>  
> +/* Do some sanity checks. */
> +static int vpci_access_check(unsigned int reg, unsigned int len)
> +{
> +    /* Check access size. */
> +    if ( len != 1 && len != 2 && len != 4 )
> +    {
> +        gdprintk(XENLOG_WARNING, "invalid length (reg: %#x, len: %u)\n",
> +                 reg, len);

I think many of such gdprintk()s want to go away before this series
gets committed.

> +/* vPCI config space IO ports handlers (0xcf8/0xcfc). */
> +static bool_t vpci_portio_accept(const struct hvm_io_handler *handler,

Plain bool please.

> +                                 const ioreq_t *p)
> +{
> +    return (p->addr == 0xcf8 && p->size == 4) || (p->addr & 0xfffc) == 0xcfc;
> +}
> +
> +static int vpci_portio_read(const struct hvm_io_handler *handler,
> +                            uint64_t addr, uint32_t size, uint64_t *data)
> +{
> +    struct domain *d = current->domain;
> +    unsigned int bus, devfn, reg;
> +    uint32_t data32;
> +    int rc;
> +
> +    vpci_lock(d);
> +    if ( addr == 0xcf8 )
> +    {
> +        ASSERT(size == 4);
> +        *data = d->arch.hvm_domain.pci_cf8;
> +        vpci_unlock(d);
> +        return X86EMUL_OKAY;
> +    }
> +    else if ( !CF8_ENABLED(d->arch.hvm_domain.pci_cf8) )

Pointless "else".

> +    {
> +        vpci_unlock(d);
> +        return X86EMUL_OKAY;

You need to write to *data here, or else you need to return
false from vpci_portio_accept() already in this case (but then
you'd need to follow the stdvga model and take the lock
there, releasing it in a .complete handler).

> +    }
> +
> +    /* Decode the PCI address. */
> +    hvm_pci_decode_addr(d->arch.hvm_domain.pci_cf8, addr, &bus, &devfn, &reg);
> +
> +    if ( vpci_access_check(reg, size) || reg >= 0xff )

> 0xff or >= 0x100, but the check is pointless as
hvm_pci_decode_addr() wont return larger values.

> +    {
> +        vpci_unlock(d);
> +        return X86EMUL_UNHANDLEABLE;

I don't think this matches real hardware behavior. If this "fails"
at all, surely by returning all ones.

> +    }
> +
> +    rc = xen_vpci_read(0, bus, devfn, reg, size, &data32);
> +    if ( !rc )
> +        *data = data32;
> +    vpci_unlock(d);

Please set *data outside the locked region.

And since there's no best place to make this other remark - I'd
prefer if you either kept together SBDF in one value when passing
this as arguments to functions, or alternatively pass this as four
values rather than keeping devfn artificially together.

> +     return rc ? X86EMUL_UNHANDLEABLE : X86EMUL_OKAY;
> +}

Again the question - what's the bare hardware equivalent of
returning X86EMUL_UNHANDLEABLE here?

> --- a/xen/arch/x86/hvm/ioreq.c
> +++ b/xen/arch/x86/hvm/ioreq.c
> @@ -1177,6 +1177,9 @@ struct hvm_ioreq_server *hvm_select_ioreq_server(struct domain *d,
>           CF8_ENABLED(cf8) )
>      {
>          uint32_t sbdf, x86_fam;
> +        unsigned int bus, devfn, reg;
> +
> +        hvm_pci_decode_addr(cf8, p->addr, &bus, &devfn, &reg);
>  
>          /* PCI config data cycle */
>  
> @@ -1186,9 +1189,7 @@ struct hvm_ioreq_server *hvm_select_ioreq_server(struct domain *d,
>                                   PCI_FUNC(CF8_BDF(cf8)));

Any reason you don't use bus and devfn (really dev/slot and func)
in the expression the tail of which is visible here?

> --- a/xen/arch/x86/xen.lds.S
> +++ b/xen/arch/x86/xen.lds.S
> @@ -224,6 +224,9 @@ SECTIONS
>         __start_schedulers_array = .;
>         *(.data.schedulers)
>         __end_schedulers_array = .;
> +       __start_vpci_array = .;
> +       *(.data.vpci)
> +       __end_vpci_array = .;

With vpci.c declaring these const, they should go into .rodata.
With the type name further being vpci_register_init_t it may even
be next to .init.rodata where they belong.

> @@ -1041,6 +1042,8 @@ static void setup_one_hwdom_device(const struct setup_hwdom *ctxt,
>          devfn += pdev->phantom_stride;
>      } while ( devfn != pdev->devfn &&
>                PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
> +
> +    xen_vpci_add_handlers(pdev);
> }

You're losing an error code here.

> --- /dev/null
> +++ b/xen/drivers/vpci/Makefile
> @@ -0,0 +1 @@
> +obj-y += vpci.o

Without having seen further patches it's not clear whether this really
needs its own directory.

> --- /dev/null
> +++ b/xen/drivers/vpci/vpci.c
> @@ -0,0 +1,469 @@
> +/*
> + * Generic functionality for handling accesses to the PCI configuration space
> + * from guests.
> + *
> + * Copyright (C) 2017 Citrix Systems R&D
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms and conditions of the GNU General Public
> + * License, version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; If not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include <xen/sched.h>
> +#include <xen/vpci.h>
> +
> +extern const vpci_register_init_t __start_vpci_array[], __end_vpci_array[];
> +#define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
> +#define vpci_init __start_vpci_array

What is this last one good for?

> +/* Internal struct to store the emulated PCI registers. */
> +struct vpci_register {
> +    vpci_read_t read;
> +    vpci_write_t write;

These two are pointers - please change the typedefs so that they're
visibly pointers here. That'll then also allow the typedef to be used to
declare actual handlers, should any such declarations be needed (e.g.
if the same handler can be used by two different source files).

> +    unsigned int size;
> +    unsigned int offset;
> +    void *priv_data;

"private" (shorter and hence easier to type)?

> +    struct rb_node node;
> +};
> +
> +int xen_vpci_add_handlers(struct pci_dev *pdev)

__hwdom_init (I notice setup_one_hwdom_device() wrongly isn't
annotated so).

> +{
> +    int i, rc = 0;

i wants to be unsigned.

> +    if ( !has_vpci(pdev->domain) )
> +        return 0;
> +
> +    pdev->vpci = xzalloc(struct vpci);
> +    if ( !pdev->vpci )
> +        return -ENOMEM;
> +
> +    pdev->vpci->handlers = RB_ROOT;
> +
> +    for ( i = 0; i < NUM_VPCI_INIT; i++ )
> +    {
> +        rc = vpci_init[i](pdev);
> +        if ( rc )
> +            break;
> +    }
> +
> +    if ( rc )
> +    {
> +        struct rb_node *node = rb_first(&pdev->vpci->handlers);
> +        struct vpci_register *r;

Please move this into the more narrow scope below.

> +        /* Iterate over the tree and cleanup. */
> +        while ( node != NULL )
> +        {
> +            r = container_of(node, struct vpci_register, node);
> +            node = rb_next(node);
> +            rb_erase(&r->node, &pdev->vpci->handlers);
> +            xfree(r);
> +        }
> +        xfree(pdev->vpci);
> +    }
> +
> +    return rc;
> +}
> +
> +static bool vpci_register_overlap(const struct vpci_register *r,
> +                                  unsigned int offset)
> +{
> +    if ( offset >= r->offset && offset < r->offset + r->size )
> +        return true;
> +
> +    return false;

This can be one single return statement.

> +}
> +
> +

Stray double blank lines.

> +static int vpci_register_cmp(const struct vpci_register *r1,
> +                             const struct vpci_register *r2)
> +{
> +    /* Make sure there's no overlap between registers. */
> +    if ( vpci_register_overlap(r1, r2->offset) ||
> +         vpci_register_overlap(r1, r2->offset + r2->size - 1) ||
> +         vpci_register_overlap(r2, r1->offset) ||
> +         vpci_register_overlap(r2, r1->offset + r1->size - 1) )

Overlap checks can generally be done with just two comparisons,
so I guess the parameters chosen for vpci_register_overlap()
aren't optimal. I guess you don't need the function at all, as you
could do all that's needed here:

    if ( r1->offset < r2->offset + r2->size &&
         r2->offset < r1->offset + r1->size )
        return 0;

The comment of course is somewhat misleading here too, as
returning zero isn't really an error indication.

> +        return 0;
> +
> +    if (r1->offset < r2->offset)
> +        return -1;
> +    else if (r1->offset > r2->offset)
> +        return 1;

Coding style.

> +    ASSERT_UNREACHABLE();
> +    return 0;
> +}
> +
> +static struct vpci_register *vpci_find_register(const struct pci_dev *pdev,
> +                                                const unsigned int reg,
> +                                                const unsigned int size)
> +{
> +    struct rb_node *node;

const

> +    struct vpci_register r = {
> +        .offset = reg,
> +        .size = size,
> +    };
> +
> +    ASSERT(vpci_locked(pdev->domain));
> +
> +    node = pdev->vpci->handlers.rb_node;
> +    while ( node )
> +    {
> +        struct vpci_register *t =

const

> +int xen_vpci_add_register(struct pci_dev *pdev, vpci_read_t read_handler,
> +                          vpci_write_t write_handler, unsigned int offset,
> +                          unsigned int size, void *data)
> +{
> +    struct rb_node **new, *parent;
> +    struct vpci_register *r;
> +
> +    /* Some sanity checks. */
> +    if ( (size != 1 && size != 2 && size != 4) || offset >= 0xFFF ||

Off by one again in the offset check.

> +         offset & (size - 1) || read_handler == NULL || write_handler == NULL )

As said, I'm not convinced either of the read or write handlers
being NULL is really a mistake. Both of them being NULL surely
is.

> +        return -EINVAL;
> +
> +    r = xzalloc(struct vpci_register);

Looks like xmalloc() would be fine here - you initialize all fields.

> +    if ( !r )
> +        return -ENOMEM;
> +
> +    r->read = read_handler;
> +    r->write = write_handler;
> +    r->size = size;
> +    r->offset = offset;
> +    r->priv_data = data;
> +
> +    vpci_lock(pdev->domain);
> +    new = &pdev->vpci->handlers.rb_node;
> +    parent = NULL;
> +
> +    while (*new) {

Coding style.

> +        struct vpci_register *this =

const

> +int xen_vpci_remove_register(struct pci_dev *pdev, unsigned int offset)
> +{
> +    struct vpci_register *r;
> +
> +    vpci_lock(pdev->domain);
> +    r = vpci_find_register(pdev, offset, 1 /* size doesn't matter here. */);

I'm not sure about this - is there anything wrong with the caller,
knowing the size, also passing it? You could then even refuse
requests to remove a register where (offset,size) doesn't match
the recorded values (as vpci_find_register() will return any
overlapping one).

> +    if ( !r )
> +    {
> +        vpci_unlock(pdev->domain);
> +        return -ENOENT;
> +    }
> +
> +    rb_erase(&r->node, &pdev->vpci->handlers);
> +    xfree(r);
> +    vpci_unlock(pdev->domain);

Please swap xfree() and unlock.

> +static void vpci_read_hw(unsigned int seg, unsigned int bus,
> +                         unsigned int devfn, unsigned int reg, uint32_t size,
> +                         uint32_t *data)

Instead of passing a pointer to the result, please consider returning
the value, as the function doesn't return anything at present.

> +{
> +    switch ( size )
> +    {
> +    case 4:
> +        *data = pci_conf_read32(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
> +                                reg);
> +        break;
> +    case 3:
> +        /*
> +         * This is possible because a 4byte read can have 1byte trapped and
> +         * the rest passed-through.
> +         */
> +        *data = pci_conf_read16(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
> +                                reg + 1) << 8;
> +        *data |= pci_conf_read8(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
> +                               reg);

Which of the two parts to read with read16() should depend on the
low bit of reg. Also for maximum compatibility I'd strongly suggest
reading the low part before the high one.

> +/* Helper macros for the read/write handlers. */
> +#define GENMASK_BYTES(e, s) GENMASK((e) * 8, (s) * 8)

What do e and s stand for here?

> +#define SHIFT_RIGHT_BYTES(d, o) d >>= (o) * 8

And at least o here?

> +#define ADD_RESULT(r, d, s, o) r |= ((d) & GENMASK_BYTES(s, 0)) << ((o) * 8)

And d, s, and o here?

Also I can't see what addition you would want to perform below.
All you ought to do are ANDs and ORs.

> +int xen_vpci_read(unsigned int seg, unsigned int bus, unsigned int devfn,

The function being other than void, same question as earlier:
What's the bare hardware equivalent of this returning other
than zero?

> +                  unsigned int reg, uint32_t size, uint32_t *data)
> +{
> +    struct domain *d = current->domain;
> +    struct pci_dev *pdev;
> +    const struct vpci_register *r;
> +    union vpci_val val = { .double_word = 0 };
> +    unsigned int data_rshift = 0, data_lshift = 0, data_size;
> +    uint32_t tmp_data;
> +    int rc;
> +
> +    ASSERT(vpci_locked(d));
> +
> +    *data = 0;
> +
> +    /* Find the PCI dev matching the address. */
> +    pdev = pci_get_pdev_by_domain(d, seg, bus, devfn);

What about the global PCI devices lock here? While VT-d code,
perhaps wrongly, doesn't acquire that lock prior to calling the
function, all callers in passthrough/pci.c do or verify it is being
held.

> +    if ( !pdev )
> +        goto passthrough;
> +
> +    /* Find the vPCI register handler. */
> +    r = vpci_find_register(pdev, reg, size);

With the overlap handling in vpci_find_register() I can't see how
this would reliably return the correct (lowest) register when the
request spans multiple ones.

> +    if ( !r )
> +        goto passthrough;
> +
> +    if ( r->offset > reg )
> +    {
> +        /*
> +         * There's a heading gap into the emulated register.
> +         * NB: it's possible for this recursive call to have a size of 3.
> +         */
> +        rc = xen_vpci_read(seg, bus, devfn, reg, r->offset - reg, &tmp_data);

I'm not particularly happy to see recursion being used here, even if
that's not going to be very deep. Both qemu and pciback get away
without, iirc, and while it's not the neatest code I find qemu's easier
to follow than the apparently written from scratch variant here. Is
there a particular reason you didn't at least take what is there as a
basis?

> +        if ( rc )
> +            return rc;
> +
> +        /* Add the head read to the partial result. */
> +        ADD_RESULT(*data, tmp_data, r->offset - reg, 0);
> +        data_lshift = r->offset - reg;
> +
> +        /* Account for the read. */
> +        size -= data_lshift;
> +        reg += data_lshift;
> +    }
> +    else if ( r->offset < reg )
> +        /* There's an offset into the emulated register */
> +        data_rshift = reg - r->offset;

This could be a plain else, avoiding another conditional branch.

> +    ASSERT(data_lshift == 0 || data_rshift == 0);
> +    data_size = min(size, r->size - data_rshift);
> +    ASSERT(data_size != 0);
> +
> +    /* Perform the read of the register. */
> +    rc = r->read(pdev, r->offset, &val, r->priv_data);
> +    if ( rc )
> +        return rc;
> +
> +    val.double_word >>= data_rshift * 8;
> +    ADD_RESULT(*data, val.double_word, data_size, data_lshift);
> +
> +    /* Account for the read */
> +    size -= data_size;
> +    reg += data_size;
> +
> +    /* Read the remaining, if any. */
> +    if ( size > 0 )
> +    {
> +        /*
> +         * Read tailing data.

trailing?

> +static int vpci_write_helper(struct pci_dev *pdev,
> +                             const struct vpci_register *r, unsigned int size,
> +                             unsigned int offset, uint32_t data)
> +{
> +    union vpci_val val = { .double_word = data };
> +    int rc;
> +
> +    ASSERT(size <= r->size);
> +    if ( size != r->size )
> +    {
> +        rc = r->read(pdev, r->offset, &val, r->priv_data);
> +        if ( rc )
> +            return rc;
> +        val.double_word &= ~GENMASK_BYTES(size + offset, offset);
> +        data &= GENMASK_BYTES(size, 0);
> +        val.double_word |= data << (offset * 8);
> +    }
> +
> +    return r->write(pdev, r->offset, val, r->priv_data);
> +}

I'm not sure that writing back the value read is correct in all cases
(think of write-only or rw1c registers or even offsets where reads
and writes access different registers altogether). I think the write
handlers will need to be made capable of dealing with partial writes.

> +int xen_vpci_write(unsigned int seg, unsigned int bus, unsigned int devfn,
> +                   unsigned int reg, uint32_t size, uint32_t data)
> +{
> +    struct domain *d = current->domain;
> +    struct pci_dev *pdev;
> +    const struct vpci_register *r;
> +    unsigned int data_size, data_offset = 0;
> +    int rc;
> +
> +    ASSERT(vpci_locked(d));
> +
> +    /* Find the PCI dev matching the address. */
> +    pdev = pci_get_pdev_by_domain(d, seg, bus, devfn);
> +    if ( !pdev )
> +        goto passthrough;
> +
> +    /* Find the vPCI register handler. */
> +    r = vpci_find_register(pdev, reg, size);
> +    if ( !r )
> +        goto passthrough;
> +
> +    else if ( r->offset > reg )

Pointless "else" again, even more so with the blank line in between.

> --- a/xen/include/xen/pci.h
> +++ b/xen/include/xen/pci.h
> @@ -13,6 +13,7 @@
>  #include <xen/irq.h>
>  #include <xen/pci_regs.h>
>  #include <xen/pfn.h>
> +#include <xen/rbtree.h>

Why? All you add to this file is ...

> @@ -88,6 +89,9 @@ struct pci_dev {
>  #define PT_FAULT_THRESHOLD 10
>      } fault;
>      u64 vf_rlen[6];
> +
> +    /* Data for vPCI. */
> +    struct vpci *vpci;

... this. I guess you really want to add the #include ...

> --- /dev/null
> +++ b/xen/include/xen/vpci.h
> @@ -0,0 +1,66 @@
> +#ifndef _VPCI_
> +#define _VPCI_
> +
> +#include <xen/pci.h>
> +#include <xen/types.h>

... here.

> +/* Helpers for locking/unlocking. */
> +#define vpci_lock(d) spin_lock(&(d)->arch.hvm_domain.vpci_lock)
> +#define vpci_unlock(d) spin_unlock(&(d)->arch.hvm_domain.vpci_lock)
> +#define vpci_locked(d) spin_is_locked(&(d)->arch.hvm_domain.vpci_lock)

While for the code layering you don't need recursive locks, did you
consider using them nevertheless so that spin_is_locked() return
values are actually meaningful for your purposes?

> +#define REGISTER_VPCI_INIT(x) \
> +  static const vpci_register_init_t x##_entry __used_section(".data.vpci") = x

To match up with the type name and assuming "REGISTER" here
means the PCI register rather than "registration", I think this
would better be VPCI_REGISTER() (I don't really mind the _INIT
suffix, but I think it's relatively pointless).

> +/* Add vPCI handlers to device. */
> +int xen_vpci_add_handlers(struct pci_dev *dev);
> +
> +/* Add/remove a register handler. */
> +int xen_vpci_add_register(struct pci_dev *pdev, vpci_read_t read_handler,
> +                          vpci_write_t write_handler, unsigned int offset,
> +                          unsigned int size, void *data);
> +int xen_vpci_remove_register(struct pci_dev *pdev, unsigned int offset);
> +
> +/* Generic read/write handlers for the PCI config space. */
> +int xen_vpci_read(unsigned int seg, unsigned int bus, unsigned int devfn,
> +                  unsigned int reg, uint32_t size, uint32_t *data);
> +int xen_vpci_write(unsigned int seg, unsigned int bus, unsigned int devfn,
> +                   unsigned int reg, uint32_t size, uint32_t data);

Along the lines of what I've said in a few places about return values,
please carefully consider where they're needed. Once you decide
they are really needed, the respective functions would likely want to
become __must_check.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 2/9] x86/ecam: add handlers for the PVH Dom0 MMCFG areas
  2017-04-27 14:35 ` [PATCH v3 2/9] x86/ecam: add handlers for the PVH Dom0 MMCFG areas Roger Pau Monne
@ 2017-05-19 13:25   ` Jan Beulich
  2017-06-20 11:56     ` Roger Pau Monne
  0 siblings, 1 reply; 49+ messages in thread
From: Jan Beulich @ 2017-05-19 13:25 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Andrew Cooper, julien.grall, Paul Durrant, xen-devel, boris.ostrovsky

>>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
> @@ -1048,6 +1050,24 @@ static int __init pvh_setup_acpi(struct domain *d, 
> paddr_t start_info)
>      return 0;
>  }
>  
> +int __init pvh_setup_ecam(struct domain *d)

While I won't object to the term ecam in title and description,
please use mmcfg uniformly in code - that's the way we name
the thing everywhere else.

> +{
> +    unsigned int i;
> +    int rc;
> +
> +    for ( i = 0; i < pci_mmcfg_config_num; i++ )
> +    {
> +        rc = register_vpci_ecam_handler(d, pci_mmcfg_config[i].address,
> +                                        pci_mmcfg_config[i].start_bus_number,
> +                                        pci_mmcfg_config[i].end_bus_number,
> +                                        pci_mmcfg_config[i].pci_segment);
> +        if ( rc )
> +            return rc;
> +    }
> +
> +    return 0;
> +}

What about regions becoming available only post-boot?

> @@ -752,6 +754,14 @@ void hvm_domain_destroy(struct domain *d)
>          list_del(&ioport->list);
>          xfree(ioport);
>      }
> +
> +    list_for_each_entry_safe ( ecam, etmp, &d->arch.hvm_domain.ecam_regions,
> +                               next )
> +    {
> +        list_del(&ecam->next);
> +        xfree(ecam);
> +    }
> +
>  }

Stray blank line. Of course the addition is of questionable use
anyway as long as all of this is Dom0 only.

> --- a/xen/arch/x86/hvm/io.c
> +++ b/xen/arch/x86/hvm/io.c
> @@ -403,6 +403,145 @@ void register_vpci_portio_handler(struct domain *d)
>      handler->ops = &vpci_portio_ops;
>  }
>  
> +/* Handlers to trap PCI ECAM config accesses. */
> +static struct hvm_ecam *vpci_ecam_find(struct domain *d, unsigned long addr)

Logically d should be a pointer to const, and I think no caller really
needs you to return a pointer to non-const.

> +{
> +    struct hvm_ecam *ecam = NULL;

Pointless initializer.

> +static void vpci_ecam_decode_addr(struct hvm_ecam *ecam, unsigned long addr,

const

> +static int vpci_ecam_accept(struct vcpu *v, unsigned long addr)
> +{
> +    struct domain *d = v->domain;
> +    int found;
> +
> +    vpci_lock(d);
> +    found = !!vpci_ecam_find(v->domain, addr);

Please use the local variable consistently.

> +static int vpci_ecam_read(struct vcpu *v, unsigned long addr,

Did I overlook this in patch 1? Why is this a vcpu instead of a
domain parameter? All of PCI is (virtual) machine wide...

> +                          unsigned int len, unsigned long *data)
> +{
> +    struct domain *d = v->domain;
> +    struct hvm_ecam *ecam;
> +    unsigned int bus, devfn, reg;
> +    uint32_t data32;
> +    int rc;
> +
> +    vpci_lock(d);
> +    ecam = vpci_ecam_find(d, addr);
> +    if ( !ecam )
> +    {
> +        vpci_unlock(d);
> +        return X86EMUL_UNHANDLEABLE;
> +    }
> +
> +    vpci_ecam_decode_addr(ecam, addr, &bus, &devfn, &reg);
> +
> +    if ( vpci_access_check(reg, len) || reg >= 0xfff )

So this function iirc allows only 1-, 2-, and 4-byte accesses. Other
than with port I/O, MMCFG allows wider ones, and once again I
don't think hardware would raise any kind of fault in such a case.
The general expectation is for the fabric to split such accesses.

Also the reg check is once again off by one.

> +int register_vpci_ecam_handler(struct domain *d, paddr_t addr,
> +                               unsigned int start_bus, unsigned int end_bus,
> +                               unsigned int seg)
> +{
> +    struct hvm_ecam *ecam;
> +
> +    ASSERT(is_hardware_domain(d));
> +
> +    vpci_lock(d);
> +    if ( vpci_ecam_find(d, addr) )
> +    {
> +        vpci_unlock(d);
> +        return -EEXIST;
> +    }
> +
> +    ecam = xzalloc(struct hvm_ecam);

xmalloc() would again suffice afaict.

> --- a/xen/include/asm-x86/hvm/domain.h
> +++ b/xen/include/asm-x86/hvm/domain.h
> @@ -100,6 +100,14 @@ struct hvm_pi_ops {
>      void (*do_resume)(struct vcpu *v);
>  };
>  
> +struct hvm_ecam {
> +    paddr_t addr;
> +    size_t size;
> +    unsigned int bus;
> +    unsigned int segment;
> +    struct list_head next;
> +};

If you moved the addition to hvm_domain_destroy() into a function
in hvm/io.c, this type could be private to that latter file afaict.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 3/9] xen/mm: move modify_identity_mmio to global file and drop __init
  2017-04-27 14:35 ` [PATCH v3 3/9] xen/mm: move modify_identity_mmio to global file and drop __init Roger Pau Monne
@ 2017-05-19 13:35   ` Jan Beulich
  2017-06-21 11:11     ` Roger Pau Monne
  0 siblings, 1 reply; 49+ messages in thread
From: Jan Beulich @ 2017-05-19 13:35 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, julien.grall, boris.ostrovsky, xen-devel

>>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
> And also allow it to do non-identity mappings by adding a new parameter. This
> function will be needed in other parts apart from PVH Dom0 build. While there
> fix the function to use gfn_t and mfn_t instead of unsigned long for memory
> addresses.

I'm afraid both title and description don't (or no longer) properly reflect
what the patch does. I'm also afraid the reason the new parameter as
well as the placement in common/memory.c aren't sufficiently explained.
For example, what use is the function going to be without
CONFIG_HAS_PCI?

> --- a/xen/arch/x86/hvm/dom0_build.c
> +++ b/xen/arch/x86/hvm/dom0_build.c
> @@ -64,27 +64,7 @@ static struct acpi_madt_nmi_source __initdata *nmisrc;
>  static int __init modify_identity_mmio(struct domain *d, unsigned long pfn,
>                                         unsigned long nr_pages, const bool map)
>  {
> -    int rc;
> -
> -    for ( ; ; )
> -    {
> -        rc = (map ? map_mmio_regions : unmap_mmio_regions)
> -             (d, _gfn(pfn), nr_pages, _mfn(pfn));
> -        if ( rc == 0 )
> -            break;
> -        if ( rc < 0 )
> -        {
> -            printk(XENLOG_WARNING
> -                   "Failed to identity %smap [%#lx,%#lx) for d%d: %d\n",
> -                   map ? "" : "un", pfn, pfn + nr_pages, d->domain_id, rc);
> -            break;
> -        }
> -        nr_pages -= rc;
> -        pfn += rc;
> -        process_pending_softirqs();
> -    }
> -
> -    return rc;
> +    return modify_mmio(d, _gfn(pfn), _mfn(pfn), nr_pages, map);
>  }

I don't see the value of retaining this wrapper.

> --- a/xen/common/memory.c
> +++ b/xen/common/memory.c
> @@ -1438,6 +1438,42 @@ int prepare_ring_for_helper(
>      return 0;
>  }
>  
> +int modify_mmio(struct domain *d, gfn_t gfn, mfn_t mfn, unsigned long nr_pages,
> +                const bool map)
> +{
> +    int rc;
> +
> +    /*
> +     * Make sure this function is only used by the hardware domain, because it
> +     * can take an arbitrary long time, and could DoS the whole system.
> +     */
> +    ASSERT(is_hardware_domain(d));

If that can happen arbitrarily at run time (rather than just at boot,
as suggested by the removal of __init), it definitely can't remain as
is and will instead need to make use of continuations. I'm therefore
unconvinced you really want to move this code instead of simply
calling {,un}map_mmio_regions() while taking care of preemption
needs.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 4/9] xen/pci: split code to size BARs from pci_add_device
  2017-04-27 14:35 ` [PATCH v3 4/9] xen/pci: split code to size BARs from pci_add_device Roger Pau Monne
@ 2017-05-19 13:56   ` Jan Beulich
  2017-06-21 15:16     ` Roger Pau Monne
  0 siblings, 1 reply; 49+ messages in thread
From: Jan Beulich @ 2017-05-19 13:56 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: xen-devel, julien.grall, boris.ostrovsky

>>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -588,6 +588,51 @@ static void pci_enable_acs(struct pci_dev *pdev)
>      pci_conf_write16(seg, bus, dev, func, pos + PCI_ACS_CTRL, ctrl);
>  }
>  
> +int pci_size_bar(unsigned int seg, unsigned int bus, unsigned int slot,
> +                 unsigned int func, unsigned int base, unsigned int max_bars,
> +                 unsigned int *index, uint64_t *addr, uint64_t *size)
> +{
> +    unsigned int idx = base + *index * 4;

The parameter and variable naming looks confusing to me. Generally
we call what we pass to pci_conf_read*() "pos". I think it would be
better to have the caller pass pos and a boolean indicator whether
another BAR is following, reducing the (base,max_bars,index)
triplet to a pair, and the function returning a negative error or the
(positive) number of BARs to increment by (the more that you leave
half of the incrementing to the caller anyway).

> +    u32 bar = pci_conf_read32(seg, bus, slot, func, idx);
> +    u32 hi = 0;
> +
> +    *addr = *size = 0;

With addr not needed by the current only caller, please allow passing
NULL there. I'm also unconvinced these initializations are actually
needed.

> +    ASSERT((bar & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_MEMORY);

With this, the function name should be more like pci_size_mem_bar().

> +    pci_conf_write32(seg, bus, slot, func, idx, ~0);
> +    if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
> +         PCI_BASE_ADDRESS_MEM_TYPE_64 )
> +    {
> +        if ( *index >= max_bars )
> +        {
> +            dprintk(XENLOG_WARNING,
> +                    "device %04x:%02x:%02x.%u with 64-bit BAR in last slot\n",
> +                    seg, bus, slot, func);

This was a normal printk() originally.

> @@ -663,38 +708,13 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>                             seg, bus, slot, func, i);
>                      continue;
>                  }
> -                pci_conf_write32(seg, bus, slot, func, idx, ~0);
> -                if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
> -                     PCI_BASE_ADDRESS_MEM_TYPE_64 )
> -                {
> -                    if ( i >= PCI_SRIOV_NUM_BARS )
> -                    {
> -                        printk(XENLOG_WARNING
> -                               "SR-IOV device %04x:%02x:%02x.%u with 64-bit"
> -                               " vf BAR in last slot\n",
> -                               seg, bus, slot, func);
> -                        break;
> -                    }
> -                    hi = pci_conf_read32(seg, bus, slot, func, idx + 4);
> -                    pci_conf_write32(seg, bus, slot, func, idx + 4, ~0);
> -                }
> -                pdev->vf_rlen[i] = pci_conf_read32(seg, bus, slot, func, idx) &
> -                                   PCI_BASE_ADDRESS_MEM_MASK;
> -                if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
> -                     PCI_BASE_ADDRESS_MEM_TYPE_64 )
> -                {
> -                    pdev->vf_rlen[i] |= (u64)pci_conf_read32(seg, bus,
> -                                                             slot, func,
> -                                                             idx + 4) << 32;
> -                    pci_conf_write32(seg, bus, slot, func, idx + 4, hi);
> -                }
> -                else if ( pdev->vf_rlen[i] )
> -                    pdev->vf_rlen[i] |= (u64)~0 << 32;
> -                pci_conf_write32(seg, bus, slot, func, idx, bar);
> -                pdev->vf_rlen[i] = -pdev->vf_rlen[i];
> -                if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
> -                     PCI_BASE_ADDRESS_MEM_TYPE_64 )
> -                    ++i;
> +                ret = pci_size_bar(seg, bus, slot, func, pos + PCI_SRIOV_BAR,
> +                                   PCI_SRIOV_NUM_BARS, &i, &addr,
> +                                   &pdev->vf_rlen[i]);
> +                if ( ret )
> +                    dprintk(XENLOG_WARNING,
> +                            "%04x:%02x:%02x.%u: failed to size SR-IOV BAR%u\n",
> +                            seg, bus, slot, func, i);

You shouldn't log two messages for the same problem (the called
function already logs one).

A final more general remark: With you intending to call this function
from other than pci_add_device() context, some further care may /
will be needed. For example, are all to be added callers such that
you playing with config space won't interfere with what Dom0 does?
Are you sure you can get away without disabling memory decode
while fiddling with the BARs?

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 5/9] xen/vpci: add handlers to map the BARs
  2017-04-27 14:35 ` [PATCH v3 5/9] xen/vpci: add handlers to map the BARs Roger Pau Monne
@ 2017-05-19 15:21   ` Jan Beulich
  2017-05-22 11:38     ` Julien Grall
  2017-06-22 17:13     ` Roger Pau Monne
  0 siblings, 2 replies; 49+ messages in thread
From: Jan Beulich @ 2017-05-19 15:21 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: StefanoStabellini, Wei Liu, George Dunlap, Andrew Cooper,
	IanJackson, Tim Deegan, julien.grall, xen-devel, boris.ostrovsky

>>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
> +static int vpci_modify_bars(struct pci_dev *pdev, const bool map)
> +{
> +    struct vpci_header *header = &pdev->vpci->header;
> +    unsigned int i;
> +    int rc = 0;
> +
> +    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
> +    {
> +        paddr_t gaddr = map ? header->bars[i].gaddr
> +                            : header->bars[i].mapped_addr;
> +        paddr_t paddr = header->bars[i].paddr;
> +
> +        if ( header->bars[i].type != VPCI_BAR_MEM &&
> +             header->bars[i].type != VPCI_BAR_MEM64_LO )
> +            continue;
> +
> +        rc = modify_mmio(pdev->domain, _gfn(PFN_DOWN(gaddr)),
> +                         _mfn(PFN_DOWN(paddr)), PFN_UP(header->bars[i].size),

The PFN_UP() indicates a problem: For sub-page BARs you can't
blindly map/unmap them without taking into consideration other
devices sharing the same page.

> +                         map);
> +        if ( rc )
> +            break;
> +
> +        header->bars[i].mapped_addr = map ? gaddr : 0;
> +    }
> +
> +    return rc;
> +}

Shouldn't this function somewhere honor the unset flags?

> +static int vpci_cmd_read(struct pci_dev *pdev, unsigned int reg,
> +                         union vpci_val *val, void *data)
> +{
> +    struct vpci_header *header = data;
> +
> +    val->word = header->command;

Rather than reading back and storing the value in the write handler,
I'd recommending doing an actual read here.

> +static int vpci_cmd_write(struct pci_dev *pdev, unsigned int reg,
> +                          union vpci_val val, void *data)
> +{
> +    struct vpci_header *header = data;
> +    uint16_t new_cmd, saved_cmd;
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    int rc;
> +
> +    new_cmd = val.word;
> +    saved_cmd = header->command;
> +
> +    if ( !((new_cmd ^ saved_cmd) & PCI_COMMAND_MEMORY) )
> +        goto out;
> +
> +    /* Memory space access change. */
> +    rc = vpci_modify_bars(pdev, new_cmd & PCI_COMMAND_MEMORY);
> +    if ( rc )
> +    {
> +        dprintk(XENLOG_ERR,
> +                "%04x:%02x:%02x.%u:unable to %smap BARs: %d\n",
> +                seg, bus, slot, func,
> +                new_cmd & PCI_COMMAND_MEMORY ? "" : "un", rc);
> +        return rc;

I guess you can guess the question already: What is the bare
hardware equivalent of this failure return?

> +    }
> +
> + out:

Please try to avoid goto-s and labels for other than error handling
(and even then only when code would otherwise end up pretty
convoluted).

> +static int vpci_bar_read(struct pci_dev *pdev, unsigned int reg,
> +                         union vpci_val *val, void *data)
> +{
> +    struct vpci_bar *bar = data;

const

> +    bool hi = false;
> +
> +    ASSERT(bar->type == VPCI_BAR_MEM || bar->type == VPCI_BAR_MEM64_LO ||
> +           bar->type == VPCI_BAR_MEM64_HI);
> +
> +    if ( bar->type == VPCI_BAR_MEM64_HI )
> +    {
> +        ASSERT(reg - PCI_BASE_ADDRESS_0 > 0);

reg > PCI_BASE_ADDRESS_0

> +        bar--;
> +        hi = true;
> +    }
> +
> +    if ( bar->sizing )
> +        val->double_word = ~(bar->size - 1) >> (hi ? 32 : 0);

There's also a comment further down - this is producing undefined
behavior on 32-bits arches.

> +static int vpci_bar_write(struct pci_dev *pdev, unsigned int reg,
> +                          union vpci_val val, void *data)
> +{
> +    struct vpci_bar *bar = data;
> +    uint32_t wdata = val.double_word;
> +    bool hi = false, unset = false;
> +
> +    ASSERT(bar->type == VPCI_BAR_MEM || bar->type == VPCI_BAR_MEM64_LO ||
> +           bar->type == VPCI_BAR_MEM64_HI);
> +
> +    if ( wdata == GENMASK(31, 0) )

I'm afraid this again doesn't match real hardware behavior: As the
low bits are r/o, writes with them having any value, but all other
bits being 1 should have the same effect. I notice that while I had
fixed this for the ROM BAR in Linux'es pciback, I should have also
fixed this for ordinary ones.

> +    {
> +        /* Next reads from this register are going to return the BAR size. */
> +        bar->sizing = true;
> +        return 0;
> +    }
> +
> +    /* End previous sizing cycle if any. */
> +    bar->sizing = false;
> +
> +    unset = bar->unset;
> +    if ( unset )
> +        bar->unset = false;
> +
> +    if ( bar->type == VPCI_BAR_MEM64_HI )
> +    {
> +        ASSERT(reg - PCI_BASE_ADDRESS_0 > 0);
> +        bar--;
> +        hi = true;
> +    }
> +
> +    /* Update the relevant part of the BAR address. */
> +    bar->gaddr &= hi ? ~GENMASK(63, 32) : ~GENMASK(31, 0);
> +    wdata &= hi ? GENMASK(31, 0) : PCI_BASE_ADDRESS_MEM_MASK;

Perhaps easier to grok as

    if ( hi )
        wdata &= PCI_BASE_ADDRESS_MEM_MASK;

However, considering the dual use below, I'd prefer if you wrote
back the value you read to the low 4 bits. They're _supposed_ to
be r/o, yes, but anyway.

> +    bar->gaddr |= (uint64_t)wdata << (hi ? 32 : 0);
> +
> +    if ( unset )
> +    {
> +        bar->paddr = bar->gaddr;

So this deals with first time setting of the BAR by Dom0. If Dom0
later decides to move BARs around, how do you guarantee things
to continue to work fine if you allow paddr and gaddr to go out of
sync? Often the reason to do re-assignments is because the OS
recognized address conflicts. Or it needs to make room for SR-IOV
BARs.

> +        pci_conf_write16(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> +                         PCI_FUNC(pdev->devfn), reg, wdata);

pci_conf_write32()

> +    }
> +
> +    ASSERT(IS_ALIGNED(bar->gaddr, PAGE_SIZE));

Urgh.

> +static int vpci_init_bars(struct pci_dev *pdev)
> +{
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    uint8_t header_type;
> +    unsigned int i, num_bars;
> +    struct vpci_header *header = &pdev->vpci->header;
> +    struct vpci_bar *bars = header->bars;
> +    int rc;
> +
> +    header_type = pci_conf_read8(seg, bus, slot, func, PCI_HEADER_TYPE) & 0x7f;
> +    if ( header_type == PCI_HEADER_TYPE_NORMAL )
> +        num_bars = 6;
> +    else if ( header_type == PCI_HEADER_TYPE_BRIDGE )
> +        num_bars = 2;
> +    else
> +        return -ENOSYS;

-EOPNOTSUPP

> +    /* Setup a handler for the control register. */
> +    header->command = pci_conf_read16(seg, bus, slot, func, PCI_COMMAND);

As the code says, the register is the Command Register, so your
comment shouldn't say "control".

> +    rc = xen_vpci_add_register(pdev, vpci_cmd_read, vpci_cmd_write,
> +                               PCI_COMMAND, 2, header);
> +    if ( rc )
> +    {
> +        dprintk(XENLOG_ERR,
> +                "%04x:%02x:%02x.%u: failed to add handler register %#x: %d\n",
> +                seg, bus, slot, func, PCI_COMMAND, rc);
> +        return rc;
> +    }
> +
> +    for ( i = 0; i < num_bars; i++ )
> +    {
> +        uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
> +        uint32_t val = pci_conf_read32(seg, bus, slot, func, reg);
> +        uint64_t addr, size;
> +        unsigned int index;
> +
> +        if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
> +        {
> +            bars[i].type = VPCI_BAR_MEM64_HI;
> +            bars[i].unset = bars[i - 1].unset;
> +            continue;

Neither here nor below you install a handler for this upper half.

> +        }
> +        else if ( (val & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO )
> +        {
> +            bars[i].type = VPCI_BAR_IO;
> +            continue;
> +        }
> +        else if ( (val & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==

Pointless "else" (twice).

> +                  PCI_BASE_ADDRESS_MEM_TYPE_64 )
> +            bars[i].type = VPCI_BAR_MEM64_LO;
> +        else
> +            bars[i].type = VPCI_BAR_MEM;
> +
> +        /* Size the BAR and map it. */
> +        index = i;
> +        rc = pci_size_bar(seg, bus, slot, func, PCI_BASE_ADDRESS_0, num_bars,
> +                          &index, &addr, &size);
> +        if ( rc )
> +        {
> +            dprintk(XENLOG_ERR,
> +                    "%04x:%02x:%02x.%u: unable to size BAR#%u: %d\n",
> +                    seg, bus, slot, func, i, rc);
> +            return rc;
> +        }
> +
> +        if ( size == 0 )
> +        {
> +            bars[i].type = VPCI_BAR_EMPTY;
> +            continue;
> +        }
> +
> +        if ( (bars[i].type == VPCI_BAR_MEM && addr == GENMASK(31, 12)) ||
> +             addr == GENMASK(63, 26) )

Where is this 26 coming from?

Perhaps

    if ( addr == GENMASK(bars[i].type == VPCI_BAR_MEM ? 31 : 63, 12) )

? Albeit I'm unconvinced GENMASK() is useful to be used here anyway
(see also below).

> +        {
> +            /* BAR is not positioned. */

I can't find anything in the standard saying that all-ones upper
address bits indicate an unassigned BAR. As long as the memory
decode bit is off, all BARs are to be considered unassigned afaik.
Furthermore you can't possibly read e.g. 0xfffff000 from a
32-bit BAR covering more than 4k.

> +            bars[i].unset = true;
> +            ASSERT(is_hardware_domain(pdev->domain));
> +            ASSERT(!(header->command & PCI_COMMAND_MEMORY));

You're asserting guest controlled state here (even if it's Dom0).

> +        }
> +
> +        ASSERT(IS_ALIGNED(addr, PAGE_SIZE));

Urgh (again).

> --- a/xen/include/xen/vpci.h
> +++ b/xen/include/xen/vpci.h
> @@ -50,6 +50,34 @@ int xen_vpci_write(unsigned int seg, unsigned int bus, unsigned int devfn,
>  struct vpci {
>      /* Root pointer for the tree of vPCI handlers. */
>      struct rb_root handlers;
> +
> +#ifdef __XEN__
> +    /* Hide the rest of the vpci struct from the user-space test harness. */
> +    struct vpci_header {
> +        /* Cached value of the command register. */
> +        uint16_t command;
> +        /* Information about the PCI BARs of this device. */
> +        struct vpci_bar {
> +            enum {
> +                VPCI_BAR_EMPTY,
> +                VPCI_BAR_IO,
> +                VPCI_BAR_MEM,

MEM32?

> +                VPCI_BAR_MEM64_LO,
> +                VPCI_BAR_MEM64_HI,
> +            } type;
> +            /* Hardware address. */
> +            paddr_t paddr;
> +            /* Guest address where the BAR should be mapped. */
> +            paddr_t gaddr;
> +            /* Current guest address where the BAR is mapped. */
> +            paddr_t mapped_addr;

Why do you need to track both "should be" and "is" addresses? Also
I think all three would more naturally be frame numbers.

> +            size_t size;

Is this enough for e.g. ARM32 (remember this is a common
header)?

> +            unsigned int attributes:4;

???

> +            bool sizing;
> +            bool unset;

Isn't this redundant with e.g. gaddr (or as per above gfn) being
INVALID_PADDR (INVALID_GFN)?

> +        } bars[6];

What about the ROM and SR-IOV ones?

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 5/9] xen/vpci: add handlers to map the BARs
  2017-05-19 15:21   ` Jan Beulich
@ 2017-05-22 11:38     ` Julien Grall
  2017-06-22 17:13     ` Roger Pau Monne
  1 sibling, 0 replies; 49+ messages in thread
From: Julien Grall @ 2017-05-22 11:38 UTC (permalink / raw)
  To: Jan Beulich, Roger Pau Monne
  Cc: StefanoStabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Punit Agrawal, IanJackson, Tim Deegan, xen-devel,
	boris.ostrovsky

Hi,

On 19/05/17 16:21, Jan Beulich wrote:
>>>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
>> +                VPCI_BAR_MEM64_LO,
>> +                VPCI_BAR_MEM64_HI,
>> +            } type;
>> +            /* Hardware address. */
>> +            paddr_t paddr;
>> +            /* Guest address where the BAR should be mapped. */
>> +            paddr_t gaddr;
>> +            /* Current guest address where the BAR is mapped. */
>> +            paddr_t mapped_addr;
>
> Why do you need to track both "should be" and "is" addresses? Also
> I think all three would more naturally be frame numbers.
>
>> +            size_t size;
>
> Is this enough for e.g. ARM32 (remember this is a common
> header)?

ARM 32 support up to 40 bits address space. So theoretically it would to 
would be possible to have BAR bigger than 4GB.

Also, I have seen quite few use of GENMASK(63...) which is not going to 
work on arm32.

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 6/9] xen/vpci: trap access to the list of PCI capabilities
  2017-04-27 14:35 ` [PATCH v3 6/9] xen/vpci: trap access to the list of PCI capabilities Roger Pau Monne
@ 2017-05-23 12:49   ` Jan Beulich
  2017-06-26 11:50     ` Roger Pau Monne
  2017-05-29 13:32   ` Jan Beulich
  1 sibling, 1 reply; 49+ messages in thread
From: Jan Beulich @ 2017-05-23 12:49 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, julien.grall, boris.ostrovsky, xen-devel

>>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
> Add traps to each capability PCI_CAP_LIST_NEXT field in order to mask them on
> request.
> 
> All capabilities from the device are fetched and stored in an internal list,
> that's later used in order to return the next capability to the guest. Note
> that this only removes the capability from the linked list as seen by the
> guest, but the actual capability structure could still be accessed by the
> guest, provided that it's position can be found using another mechanism.

Which is a problem. Drivers tied to a single device or a narrow set
aren't unknown to do such. In fact in the past Intel has given us
workaround outlines for some of their chipset issues which directed
us to fixed offsets instead of using the capability chains.

> Finally the MSI and MSI-X capabilities are masked until Xen knows how to
> properly handle accesses to them.
> 
> This should allow a PVH Dom0 to boot on some hardware, provided that the
> hardware doesn't require MSI/MSI-X and that there are no SR-IOV devices in the
> system, so the panic at the end of the PVH Dom0 build is replaced by a
> warning.

While this is certainly nice for development / debugging purposes,
what's the longer term intention with the functionality being added
here? We had no need to mask capabilities for PV Dom0, so I would
have hoped to get away without for PVH too.

Assuming there is a reason other than to temporarily hide MSI/MSI-X,
I'll give some comments on the patch itself anyway.

> --- /dev/null
> +++ b/xen/drivers/vpci/capabilities.c
> @@ -0,0 +1,159 @@
> +/*
> + * Generic functionality for handling accesses to the PCI capabilities from
> + * the configuration space.
> + *
> + * Copyright (C) 2017 Citrix Systems R&D
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms and conditions of the GNU General Public
> + * License, version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; If not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include <xen/sched.h>
> +#include <xen/vpci.h>
> +
> +struct vpci_capability {
> +    struct list_head next;
> +    uint8_t offset;
> +    bool masked;

I think I'd prefer "hidden" or "suppressed".

> +};
> +
> +static int vpci_cap_read(struct pci_dev *pdev, unsigned int reg,
> +                         union vpci_val *val, void *data)
> +{
> +    struct vpci_capability *cap = data;
> +
> +    val->half_word = 0;

Instead of doing such (and, like here, leaving part of what val
points to uninitialized), wouldn't is be better to do this in the code
calling these helpers?

> +static int vpci_cap_write(struct pci_dev *pdev, unsigned int reg,
> +                          union vpci_val val, void *data)
> +{
> +    /* Ignored. */
> +    return 0;
> +}

One possible example of what a NULL write handler might mean.

> +static int vpci_index_capabilities(struct pci_dev *pdev)
> +{
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    uint8_t pos = PCI_CAPABILITY_LIST;
> +    uint16_t status;
> +    unsigned int max_cap = 48;

I think it's high time to introduce a #define for this, which is now
at least the 3rd instance.

> +    struct vpci_capability *cap;
> +    int rc;
> +
> +    INIT_LIST_HEAD(&pdev->vpci->cap_list);
> +
> +    /* Check if device has capabilities. */
> +    status = pci_conf_read16(seg, bus, slot, func, PCI_STATUS);
> +    if ( !(status & PCI_STATUS_CAP_LIST) )
> +        return 0;
> +
> +    /* Add the root capability pointer. */
> +    cap = xzalloc(struct vpci_capability);
> +    if ( !cap )
> +        return -ENOMEM;
> +
> +    cap->offset = pos;

Please be consistent with the naming of field and variable.

> +    list_add_tail(&cap->next, &pdev->vpci->cap_list);
> +    rc = xen_vpci_add_register(pdev, vpci_cap_read, vpci_cap_write, pos,
> +                               1, cap);
> +    if ( rc )
> +        return rc;
> +
> +    /*
> +     * Iterate over the list of capabilities present in the device, and
> +     * add a handler for each register pointer to the next item
> +     * (PCI_CAP_LIST_NEXT).
> +     */
> +    while ( max_cap-- )
> +    {
> +        pos = pci_conf_read8(seg, bus, slot, func, pos);
> +        if ( pos < 0x40 )
> +            break;
> +
> +        cap = xzalloc(struct vpci_capability);
> +        if ( !cap )
> +            return -ENOMEM;
> +
> +        cap->offset = pos;

Pre-existing code clears the low two bits from the value read and
also checks whether the capability ID is 0xff.

> +static int vpci_capabilities_init(struct pci_dev *pdev)
> +{
> +    int rc;
> +
> +    rc = vpci_index_capabilities(pdev);
> +    if ( rc )
> +        return rc;
> +
> +    /* Mask MSI and MSI-X capabilities until Xen handles them. */
> +    vpci_mask_capability(pdev, PCI_CAP_ID_MSI);
> +    vpci_mask_capability(pdev, PCI_CAP_ID_MSIX);
> +
> +    return 0;
> +}

What about extended capabilities?

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 7/9] vpci: add a priority field to the vPCI register initializer
  2017-04-27 14:35 ` [PATCH v3 7/9] vpci: add a priority field to the vPCI register initializer Roger Pau Monne
@ 2017-05-23 12:52   ` Jan Beulich
  2017-06-26 14:41     ` Roger Pau Monne
  0 siblings, 1 reply; 49+ messages in thread
From: Jan Beulich @ 2017-05-23 12:52 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, julien.grall, boris.ostrovsky, xen-devel

>>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
> +#define REGISTER_VPCI_INIT(f, p)                                        \
> +  static const struct vpci_register_init                                \
> +                      x##_entry __used_section(".data.vpci") = {        \
> +    .init = f,                                                          \
> +    .priority = p,                                                      \
> +}

I think I'd rather see this done by ordering the entries in
.data.vpci suitably, e.g. by adding numeric tags and using SORT()
or some such in the linker script. Iirc upstream Linux did change to
such a model for some of their initialization, so you may be able to
glean something there.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 8/9] vpci/msi: add MSI handlers
  2017-04-27 14:35 ` [PATCH v3 8/9] vpci/msi: add MSI handlers Roger Pau Monne
@ 2017-05-26 15:26   ` Jan Beulich
  2017-06-27 10:22     ` Roger Pau Monne
  0 siblings, 1 reply; 49+ messages in thread
From: Jan Beulich @ 2017-05-26 15:26 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Andrew Cooper, julien.grall, Paul Durrant, xen-devel, boris.ostrovsky

>>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
> Add handlers for the MSI control, address, data and mask fields in order to
> detect accesses to them and setup the interrupts as requested by the guest.
> 
> Note that the pending register is not trapped, and the guest can freely
> read/write to it.
> 
> Whether Xen is going to provide this functionality to Dom0 (MSI emulation) is
> controlled by the "msi" option in the dom0 field. When disabling this option
> Xen will hide the MSI capability structure from Dom0.

Unless there's an actual reason behind this, I'd view this as a
development only thing, which shouldn't be in a non-RFC patch.

> --- a/xen/arch/x86/hvm/vmsi.c
> +++ b/xen/arch/x86/hvm/vmsi.c
> @@ -622,3 +622,144 @@ void msix_write_completion(struct vcpu *v)
>      if ( msixtbl_write(v, ctrl_address, 4, 0) != X86EMUL_OKAY )
>          gdprintk(XENLOG_WARNING, "MSI-X write completion failure\n");
>  }
> +
> +static unsigned int msi_vector(uint16_t data)
> +{
> +    return (data & MSI_DATA_VECTOR_MASK) >> MSI_DATA_VECTOR_SHIFT;

I know code elsewhere does it this way, but I'm intending to switch
that to use MASK_EXTR(), and I'd like to ask to use that construct
right away in new code.

> +static unsigned int msi_flags(uint16_t data, uint64_t addr)
> +{
> +    unsigned int rh, dm, dest_id, deliv_mode, trig_mode;
> +
> +    rh = (addr >> MSI_ADDR_REDIRECTION_SHIFT) & 0x1;
> +    dm = (addr >> MSI_ADDR_DESTMODE_SHIFT) & 0x1;
> +    dest_id = (addr & MSI_ADDR_DEST_ID_MASK) >> MSI_ADDR_DEST_ID_SHIFT;
> +    deliv_mode = (data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x7;
> +    trig_mode = (data >> MSI_DATA_TRIGGER_SHIFT) & 0x1;

I'm sure you've simply copied code from elsewhere, which I agree
should generally be fine. However, on top of what I did say above
I'd also like to request to drop such stray 0x prefixes, plus I think
the 7 wants a #define.

> +    return dest_id | (rh << GFLAGS_SHIFT_RH) | (dm << GFLAGS_SHIFT_DM) |
> +           (deliv_mode << GFLAGS_SHIFT_DELIV_MODE) |
> +           (trig_mode << GFLAGS_SHIFT_TRG_MODE);

How come dest_id has no shift? Please let's not assume the shift
is zero now and forever.

> +void vpci_msi_mask(struct vpci_arch_msi *arch, unsigned int entry, bool mask)
> +{
> +    struct pirq *pinfo;
> +    struct irq_desc *desc;
> +    unsigned long flags;
> +    int irq;
> +
> +    ASSERT(arch->pirq != -1);

Perhaps better ">= 0"?

> +    pinfo = pirq_info(current->domain, arch->pirq + entry);
> +    ASSERT(pinfo);
> +
> +    irq = pinfo->arch.irq;
> +    ASSERT(irq < nr_irqs);
> +
> +    desc = irq_to_desc(irq);
> +    ASSERT(desc);

It's not entirely clear to me where all the checks are which allow
the checks here to be ASSERT()s.

> +int vpci_msi_enable(struct vpci_arch_msi *arch, struct pci_dev *pdev,
> +                    uint64_t address, uint32_t data, unsigned int vectors)
> +{
> +    struct msi_info msi_info = {
> +        .seg = pdev->seg,
> +        .bus = pdev->bus,
> +        .devfn = pdev->devfn,
> +        .entry_nr = vectors,
> +    };
> +    int index = -1, rc;
> +    unsigned int i;
> +
> +    ASSERT(arch->pirq == -1);
> +
> +    /* Get a PIRQ. */
> +    rc = allocate_and_map_msi_pirq(pdev->domain, &index, &arch->pirq,
> +                                   &msi_info);
> +    if ( rc )
> +    {
> +        dprintk(XENLOG_ERR, "%04x:%02x:%02x.%u: failed to map PIRQ: %d\n",
> +                pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> +                PCI_FUNC(pdev->devfn), rc);
> +        return rc;
> +    }
> +
> +    for ( i = 0; i < vectors; i++ )
> +    {
> +        xen_domctl_bind_pt_irq_t bind = {
> +            .hvm_domid = DOMID_SELF,
> +            .machine_irq = arch->pirq + i,
> +            .irq_type = PT_IRQ_TYPE_MSI,
> +            .u.msi.gvec = msi_vector(data) + i,
> +            .u.msi.gflags = msi_flags(data, address),
> +        };
> +
> +        pcidevs_lock();
> +        rc = pt_irq_create_bind(pdev->domain, &bind);
> +        if ( rc )

I don't think you need to hold the lock for the if() and its body. While
I see unmap_domain_pirq(), I don't really see why it does, so perhaps
there's some cleanup potential up front.

> --- a/xen/drivers/vpci/capabilities.c
> +++ b/xen/drivers/vpci/capabilities.c
> @@ -109,7 +109,7 @@ static int vpci_index_capabilities(struct pci_dev *pdev)
>      return 0;
>  }
>  
> -static void vpci_mask_capability(struct pci_dev *pdev, uint8_t cap_id)
> +void xen_vpci_mask_capability(struct pci_dev *pdev, uint8_t cap_id)

What's the xen_ prefix good for?

> +static int vpci_msi_control_read(struct pci_dev *pdev, unsigned int reg,
> +                                 union vpci_val *val, void *data)
> +{
> +    struct vpci_msi *msi = data;

const

> +    if ( msi->enabled )
> +        val->word |= PCI_MSI_FLAGS_ENABLE;
> +    if ( msi->masking )
> +        val->word |= PCI_MSI_FLAGS_MASKBIT;
> +    if ( msi->address64 )
> +        val->word |= PCI_MSI_FLAGS_64BIT;
> +
> +    /* Set multiple message capable. */
> +    val->word |= ((fls(msi->max_vectors) - 1) << 1) & PCI_MSI_FLAGS_QMASK;
> +
> +    /* Set current number of configured vectors. */
> +    val->word |= ((fls(msi->guest_vectors) - 1) << 4) & PCI_MSI_FLAGS_QSIZE;

The 1 and 4 here clearly need #define-s or the use of MASK_INSR().

> +static int vpci_msi_control_write(struct pci_dev *pdev, unsigned int reg,
> +                                  union vpci_val val, void *data)
> +{
> +    struct vpci_msi *msi = data;
> +    unsigned int vectors = 1 << ((val.word & PCI_MSI_FLAGS_QSIZE) >> 4);
> +    int rc;
> +
> +    if ( vectors > msi->max_vectors )
> +        return -EINVAL;
> +
> +    msi->guest_vectors = vectors;
> +
> +    if ( !!(val.word & PCI_MSI_FLAGS_ENABLE) == msi->enabled )
> +        return 0;
> +
> +    if ( val.word & PCI_MSI_FLAGS_ENABLE )
> +    {
> +        ASSERT(!msi->enabled && !msi->vectors);

I can see the value of the right side, but the left (with the imediately
prior if())?

> +        rc = vpci_msi_enable(&msi->arch, pdev, msi->address, msi->data,
> +                             vectors);
> +        if ( rc )
> +            return rc;
> +
> +        /* Apply the mask bits. */
> +        if ( msi->masking )
> +        {
> +            uint32_t mask = msi->mask;
> +
> +            while ( mask )
> +            {
> +                unsigned int i = ffs(mask);

ffs(), just like fls(), returns 1-based values, so this looks to be off by
one.

> +                vpci_msi_mask(&msi->arch, i, true);
> +                __clear_bit(i, &mask);
> +            }
> +        }
> +
> +        __msi_set_enable(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> +                         PCI_FUNC(pdev->devfn), reg - PCI_MSI_FLAGS, 1);

Seems like you'll never come through msi_capability_init(); I can't
see how it can be a good idea to bypass lots of stuff.

> +static int vpci_msi_address_upper_write(struct pci_dev *pdev, unsigned int reg,
> +                                        union vpci_val val, void *data)
> +{
> +    struct vpci_msi *msi = data;
> +
> +    /* Clear high part. */
> +    msi->address &= ~GENMASK(63, 32);
> +    msi->address |= (uint64_t)val.double_word << 32;

Is the value written here actually being used for anything other than
for reading back (also applicable to the high bits of the low half of the
address)?

> +static int vpci_msi_mask_read(struct pci_dev *pdev, unsigned int reg,
> +                              union vpci_val *val, void *data)
> +{
> +    struct vpci_msi *msi = data;
> +
> +    val->double_word = msi->mask;
> +
> +    return 0;
> +}
> +
> +static int vpci_msi_mask_write(struct pci_dev *pdev, unsigned int reg,
> +                               union vpci_val val, void *data)
> +{
> +    struct vpci_msi *msi = data;
> +    uint32_t dmask;
> +
> +    dmask = msi->mask ^ val.double_word;
> +
> +    if ( !dmask )
> +        return 0;
> +
> +    while ( dmask && msi->enabled )
> +    {
> +        unsigned int i = ffs(dmask);
> +
> +        vpci_msi_mask(&msi->arch, i, !test_bit(i, &msi->mask));
> +        __clear_bit(i, &dmask);
> +    }

I think this loop should be limited to the number of enabled vectors
(and the same likely applies then to vpci_msi_control_write()).

> +static int vpci_init_msi(struct pci_dev *pdev)
> +{
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    struct vpci_msi *msi = NULL;

Pointless initializer.

> +    unsigned int msi_offset;
> +    uint16_t control;
> +    int rc;
> +
> +    msi_offset = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSI);
> +    if ( !msi_offset )
> +        return 0;
> +
> +    if ( !vpci_msi_enabled(pdev->domain) )
> +    {
> +        xen_vpci_mask_capability(pdev, PCI_CAP_ID_MSI);
> +        return 0;
> +    }
> +
> +    msi = xzalloc(struct vpci_msi);
> +    if ( !msi )
> +        return -ENOMEM;
> +
> +    control = pci_conf_read16(seg, bus, slot, func,
> +                              msi_control_reg(msi_offset));
> +
> +    rc = xen_vpci_add_register(pdev, vpci_msi_control_read,
> +                               vpci_msi_control_write,
> +                               msi_control_reg(msi_offset), 2, msi);
> +    if ( rc )
> +    {
> +        dprintk(XENLOG_ERR,
> +                "%04x:%02x:%02x.%u: failed to add handler for MSI control: %d\n",
> +                seg, bus, slot, func, rc);
> +        goto error;
> +    }
> +
> +    /* Get the maximum number of vectors the device supports. */
> +    msi->max_vectors = multi_msi_capable(control);
> +    ASSERT(msi->max_vectors <= 32);
> +
> +    /* Initial value after reset. */
> +    msi->guest_vectors = 1;
> +
> +    /* No PIRQ bind yet. */
> +    vpci_msi_arch_init(&msi->arch);
> +
> +    if ( is_64bit_address(control) )
> +        msi->address64 = true;
> +    if ( is_mask_bit_support(control) )
> +        msi->masking = true;
> +
> +    rc = xen_vpci_add_register(pdev, vpci_msi_address_read,
> +                               vpci_msi_address_write,
> +                               msi_lower_address_reg(msi_offset), 4, msi);
> +    if ( rc )
> +    {
> +        dprintk(XENLOG_ERR,
> +                "%04x:%02x:%02x.%u: failed to add handler for MSI address: %d\n",
> +                seg, bus, slot, func, rc);
> +        goto error;
> +    }
> +
> +    rc = xen_vpci_add_register(pdev, vpci_msi_data_read, vpci_msi_data_write,
> +                               msi_data_reg(msi_offset, msi->address64), 2,
> +                               msi);
> +    if ( rc )
> +    {
> +        dprintk(XENLOG_ERR,
> +                "%04x:%02x:%02x.%u: failed to add handler for MSI address: %d\n",
> +                seg, bus, slot, func, rc);

Twice the same message text is unhelpful (and actually there's a third
one below). But iirc I did indicate anyway that I think most of them
should go away. Note also how much thy contribute to the function's
size.

> +static int __init vpci_msi_setup_keyhandler(void)
> +{
> +    register_keyhandler('Z', vpci_dump_msi, "dump guest MSI state", 1);

Please let's avoid using new (and even non-intuitive) keys if at all
possible. This is Dom0 only, so can easily be chained onto e.g. the
'M' handler.

> --- a/xen/include/xen/vpci.h
> +++ b/xen/include/xen/vpci.h
> @@ -89,9 +89,35 @@ struct vpci {
>  
>      /* List of capabilities supported by the device. */
>      struct list_head cap_list;
> +
> +    /* MSI data. */
> +    struct vpci_msi {
> +        /* Maximum number of vectors supported by the device. */
> +        unsigned int max_vectors;
> +        /* Current guest-written number of vectors. */
> +        unsigned int guest_vectors;
> +        /* Number of vectors configured. */
> +        unsigned int vectors;

So coming here I still don't really see what the difference between
these last two fields is (and hence why you need two).

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 1/9] xen/vpci: introduce basic handlers to trap accesses to the PCI config space
  2017-05-19 11:35   ` Jan Beulich
@ 2017-05-29 12:57     ` Roger Pau Monne
  2017-05-29 14:16       ` Jan Beulich
  0 siblings, 1 reply; 49+ messages in thread
From: Roger Pau Monne @ 2017-05-29 12:57 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, Andrew Cooper, Ian Jackson, julien.grall, Paul Durrant,
	xen-devel, boris.ostrovsky

On Fri, May 19, 2017 at 05:35:47AM -0600, Jan Beulich wrote:
> >>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
> > --- a/tools/libxl/libxl_x86.c
> > +++ b/tools/libxl/libxl_x86.c
> > @@ -11,7 +11,7 @@ int libxl__arch_domain_prepare_config(libxl__gc *gc,
> >      if (d_config->c_info.type == LIBXL_DOMAIN_TYPE_HVM) {
> >          if (d_config->b_info.device_model_version !=
> >              LIBXL_DEVICE_MODEL_VERSION_NONE) {
> > -            xc_config->emulation_flags = XEN_X86_EMU_ALL;
> > +            xc_config->emulation_flags = (XEN_X86_EMU_ALL & ~XEN_X86_EMU_VPCI);
> 
> I can see why you need this, but I'm not sure this is a good model.
> Ideally for ordinary HVM guests you'd never have to change this
> line. Therefore perhaps it might be a better idea to use a "negative"
> flag here.

I would expect that at some point HVM guests are also going to use the
Xen internal PCI emulation for IOREQs (XEN_DMOP_IO_RANGE_PCI), and for
PCI passthrough, so having VPCI disabled is only temporary (long term
HVM guests should again use XEN_X86_EMU_ALL).

> > --- /dev/null
> > +++ b/tools/tests/vpci/Makefile
> > @@ -0,0 +1,45 @@
> > +
> > +XEN_ROOT=$(CURDIR)/../../..
> > +include $(XEN_ROOT)/tools/Rules.mk
> > +
> > +TARGET := test_vpci
> > +
> > +.PHONY: all
> > +all: $(TARGET)
> > +
> > +.PHONY: run
> > +run: $(TARGET)
> > +	./$(TARGET) > $(TARGET).out
> > +
> > +$(TARGET): vpci.c vpci.h rbtree.c rbtree.h
> > +	$(HOSTCC) -g -o $@ vpci.c main.c rbtree.c
> > +
> > +.PHONY: clean
> > +clean:
> > +	rm -rf $(TARGET) $(TARGET).out *.o *~ vpci.h vpci.c rbtree.c rbtree.h
> > +
> > +.PHONY: distclean
> > +distclean: clean
> > +
> > +.PHONY: install
> > +install:
> > +
> > +vpci.h: $(XEN_ROOT)/xen/include/xen/vpci.h
> > +	sed -e '/#include/d' <$< >$@
> > +
> > +vpci.c: $(XEN_ROOT)/xen/drivers/vpci/vpci.c
> > +	# Trick the compiler so it doesn't complain about missing symbols
> > +	sed -e '/#include/d' \
> > +	    -e '1s;^;#include "emul.h"\
> > +	             const vpci_register_init_t __start_vpci_array[1]\;\
> > +	             const vpci_register_init_t __end_vpci_array[1]\;\
> > +	             ;' <$< >$@
> > +
> > +rbtree.h: $(XEN_ROOT)/xen/include/xen/rbtree.h
> > +	sed -e '/#include/d' <$< >$@
> > +
> > +rbtree.c: $(XEN_ROOT)/xen/common/rbtree.c
> > +	sed -e "/#include/d" \
> > +	    -e '1s;^;#include "emul.h"\
> > +	             ;' <$< >$@
> 
> Plain symlinking and __XEN__ conditionals in the files may be the
> easier to follow variant. But I'm no heavily opposed to this one,
> I'm merely afraid that further adjustments may end up becoming
> necessary down the road, resulting in the rules here to become
> more convoluted.

Yes, I'm not opposed to doing that, I just think the code is cleaner
using this rather than adding __XEN__ conditionals, but I agree that
if this becomes too convoluted at some point I would be in favor of
using conditionals instead.

> > --- /dev/null
> > +++ b/tools/tests/vpci/emul.h
> > @@ -0,0 +1,107 @@
> > +/*
> > + * Unit tests for the generic vPCI handler code.
> > + *
> > + * Copyright (C) 2017 Citrix Systems R&D
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms and conditions of the GNU General Public
> > + * License, version 2, as published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > + * General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public
> > + * License along with this program; If not, see 
> > <http://www.gnu.org/licenses/>.
> > + */
> > +
> > +#ifndef _TEST_VPCI_
> > +#define _TEST_VPCI_
> > +
> > +#include <stdlib.h>
> > +#include <stdio.h>
> > +#include <stddef.h>
> > +#include <stdint.h>
> > +#include <stdbool.h>
> > +#include <errno.h>
> > +#include <assert.h>
> > +
> > +#define container_of(ptr, type, member) ({                      \
> > +        typeof( ((type *)0)->member ) *__mptr = (ptr);          \
> > +        (type *)( (char *)__mptr - offsetof(type,member) );})
> 
> There are a couple of stray blanks (immediately inside parentheses)
> here, and a missing one after the comma in offsetof().

Sorry, I've picked this as-is from the Xen headers (kernel.h). I've
changed it to:

#define container_of(ptr, type, member) ({                      \
        typeof(((type *)0)->member) *__mptr = (ptr);            \
        (type *)((char *)__mptr - offsetof(type, member));})

> > +#include "rbtree.h"
> > +
> > +struct pci_dev {
> > +    struct domain *domain;
> > +    struct vpci *vpci;
> > +};
> > +
> > +struct domain {
> > +    struct pci_dev pdev;
> > +};
> > +
> > +struct vcpu
> > +{
> > +    struct domain *domain;
> > +};
> > +
> > +extern struct vcpu v;
> 
> This is odd. With ...
> 
> > +#define spin_lock(x)
> > +#define spin_unlock(x)
> > +#define spin_is_locked(x) true
> > +
> > +#define current (&v)
> 
> ... this, why don't you simply have
> 
> extern struct vcpu *current;
> 
> keeping v (or however you mean to name it) static?

Done.

> > +#define has_vpci(d) true
> > +
> > +#include "vpci.h"
> > +
> > +#define xzalloc(type) (type *)calloc(1, sizeof(type))
> 
> Missing an outer pair of parentheses.

Done.

> > +#define xfree(p) free(p)
> > +
> > +#define EXPORT_SYMBOL(x)
> 
> I think we should rather get rid of them from rbtree.c.
> 
> > +#define pci_get_pdev_by_domain(d, ...) &(d)->pdev
> 
> Missing an outer pair of parentheses again, whereas ...

OK, I'm adding a pre-patch then to get rid of them.

> > +#define atomic_read(x) 1
> > +
> > +/* Dummy native helpers. Writes are ignored, reads return 1's. */
> > +#define pci_conf_read8(...) (0xff)
> > +#define pci_conf_read16(...) (0xffff)
> > +#define pci_conf_read32(...) (0xffffffff)
> 
> ... here they're pointless.

Thanks, removed.

> > +/* Dummy hooks, write stores data, read fetches it. */
> > +static int vpci_read8(struct pci_dev *pdev, unsigned int reg,
> > +                      union vpci_val *val, void *data)
> > +{
> > +    uint8_t *priv = data;
> > +
> > +    val->half_word = *priv;
> 
> Half word? Half a word is at least 16 bits on any reasonable
> architecture nowadays. Using it for a byte is simply confusing. I'd
> suggest naming the fields what they are - u8, u16, and u32.

Right, I can do that. I was using this nomenclature because that's
what the PCI specification uses:

half word: 8b
word: 16b
double word: 32b

Will change it to u8, u16 and u32.

> > +#define VPCI_READ(reg, size, data) \
> > +    assert(!xen_vpci_read(0, 0, 0, reg, size, data))
> > +
> > +#define VPCI_READ_CHECK(reg, size, expected) ({ \
> > +    uint32_t val;                               \
> > +    VPCI_READ(reg, size, &val);                 \
> > +    assert(val == expected);                    \
> > +    })
> 
> Bad indentation - either the }) needs to move left, of the body needs
> to move right.

I will move the }) left.

> > +#define VPCI_WRITE(reg, size, data) \
> > +    assert(!xen_vpci_write(0, 0, 0, reg, size, data))
> 
> You using fixed SBDF here, ...
> 
> > +#define VPCI_CHECK_REG(reg, size, data) ({      \
> > +    VPCI_WRITE(reg, size, data);                \
> > +    VPCI_READ_CHECK(reg, size, data);           \
> > +    })
> > +
> > +#define VPCI_ADD_REG(fread, fwrite, off, size, store)                         \
> > +    assert(!xen_vpci_add_register(&d.pdev, fread, fwrite, off, size, &store)) \
> 
> ... why do you have this strange &d.pdev here? The assumption
> that a (fake) domain has a single (fake) PCI device looks pretty odd
> anyway - why can't you simply have a global (fake) PCI device?

Yes, I can do that. Now the root is the pci_dev itself instead of the
domain.

> > +#define VPCI_ADD_INVALID_REG(fread, fwrite, off, size)                      \
> > +    assert(xen_vpci_add_register(&d.pdev, fread, fwrite, off, size, NULL))  \
> > +
> > +int
> > +main(int argc, char **argv)
> > +{
> > +    /* Index storage by offset. */
> > +    uint32_t r0 = 0xdeadbeef;
> > +    uint8_t r5 = 0xef;
> > +    uint8_t r6 = 0xbe;
> > +    uint8_t r7 = 0xef;
> > +    uint16_t r12 = 0x8696;
> > +    int rc;
> > +
> > +    VPCI_ADD_REG(vpci_read32, vpci_write32, 0, 4, r0);
> > +    VPCI_READ_CHECK(0, 4, 0xdeadbeef);
> > +    VPCI_CHECK_REG(0, 4, 0xbcbcbcbc);
> 
> In the context here the macro name is pretty confusing: I'd expect
> it to check the register holds the specified value, without also doing
> a write. How about VPCI_WRITE_CHECK()?

Completely agree, the read macro is already VPCI_READ_CHECK, so it
makes sense for the write one to be VPCI_WRITE_CHECK IMHO.

> > +    VPCI_ADD_REG(vpci_read8, vpci_write8, 5, 1, r5);
> > +    VPCI_READ_CHECK(5, 1, 0xef);
> > +    VPCI_CHECK_REG(5, 1, 0xba);
> > +
> > +    VPCI_ADD_REG(vpci_read8, vpci_write8, 6, 1, r6);
> > +    VPCI_READ_CHECK(6, 1, 0xbe);
> > +    VPCI_CHECK_REG(6, 1, 0xba);
> > +
> > +    VPCI_ADD_REG(vpci_read8, vpci_write8, 7, 1, r7);
> > +    VPCI_READ_CHECK(7, 1, 0xef);
> > +    VPCI_CHECK_REG(7, 1, 0xbd);
> > +
> > +    VPCI_ADD_REG(vpci_read16, vpci_write16, 12, 2, r12);
> > +    VPCI_READ_CHECK(12, 2, 0x8696);
> > +    VPCI_READ_CHECK(12, 4, 0xffff8696);
> > +
> > +    /*
> > +     * At this point we have the following layout:
> > +     *
> > +     * 32    24    16     8     0
> > +     *  +-----+-----+-----+-----+
> > +     *  |          r0           | 0
> > +     *  +-----+-----+-----+-----+
> > +     *  | r7  |  r6 |  r5 |/////| 32
> > +     *  +-----+-----+-----+-----|
> > +     *  |///////////////////////| 64
> > +     *  +-----------+-----------+
> > +     *  |///////////|    r12    | 96
> > +     *  +-----------+-----------+
> > +     *             ...
> > +     *  / = empty.
> > +     */
> > +
> > +    /* Try to add an overlapping register handler. */
> > +    VPCI_ADD_INVALID_REG(vpci_read32, vpci_write32, 4, 4);
> > +
> > +    /* Try to add a non-aligned register. */
> > +    VPCI_ADD_INVALID_REG(vpci_read16, vpci_write16, 15, 2);
> > +
> > +    /* Try to add a register with wrong size. */
> > +    VPCI_ADD_INVALID_REG(vpci_read16, vpci_write16, 8, 3);
> > +
> > +    /* Try to add a register with missing handlers. */
> > +    VPCI_ADD_INVALID_REG(vpci_read16, NULL, 8, 2);
> > +    VPCI_ADD_INVALID_REG(NULL, vpci_write16, 8, 2);
> 
> Is that something which really is wrong in all cases? What about e.g.
> r/o registers?

My initial though was that r/o registers should register something
like a noop write handler, that could be shared between all of them. I
can certainly allow registration of handlers without a read or write
function (but not both), and make it a noop (writes will be ignored,
reads will return 1's).

> > +    /* Read/write of unset register. */
> > +    VPCI_READ_CHECK(8, 4, 0xffffffff);
> > +    VPCI_READ_CHECK(8, 2, 0xffff);
> > +    VPCI_READ_CHECK(8, 1, 0xff);
> > +    VPCI_WRITE(10, 2, 0xbeef);
> > +    VPCI_READ_CHECK(10, 2, 0xffff);
> > +
> > +    /* Read of multiple registers */
> > +    VPCI_CHECK_REG(7, 1, 0xbd);
> > +    VPCI_READ_CHECK(4, 4, 0xbdbabaff);
> 
> I think a variant accessing mixed size registers would also be
> desirable here. Perhaps it would be best to exhaustively test
> all possible variations (there aren't that many after all). Same
> for writes and partial accesses (below) then.

So you mean to scan the whole space (from 0 to 128 on this test) using
all possible register sizes for both read and write? That would indeed
be feasible.

> > @@ -256,6 +257,152 @@ void register_g2m_portio_handler(struct domain *d)
> >      handler->ops = &g2m_portio_ops;
> >  }
> >  
> > +/* Do some sanity checks. */
> > +static int vpci_access_check(unsigned int reg, unsigned int len)
> > +{
> > +    /* Check access size. */
> > +    if ( len != 1 && len != 2 && len != 4 )
> > +    {
> > +        gdprintk(XENLOG_WARNING, "invalid length (reg: %#x, len: %u)\n",
> > +                 reg, len);
> 
> I think many of such gdprintk()s want to go away before this series
> gets committed.

OK, I've found them useful while developing, but I guess they are not
really useful outside from that context. I guess there's no way to
leave them in place, maybe a Kconfig option?

> > +/* vPCI config space IO ports handlers (0xcf8/0xcfc). */
> > +static bool_t vpci_portio_accept(const struct hvm_io_handler *handler,
> 
> Plain bool please.

Sadly struct hvm_io_ops (which is where this function is used) expects
a bool_t as return.

> 
> > +                                 const ioreq_t *p)
> > +{
> > +    return (p->addr == 0xcf8 && p->size == 4) || (p->addr & 0xfffc) == 0xcfc;
> > +}
> > +
> > +static int vpci_portio_read(const struct hvm_io_handler *handler,
> > +                            uint64_t addr, uint32_t size, uint64_t *data)
> > +{
> > +    struct domain *d = current->domain;
> > +    unsigned int bus, devfn, reg;
> > +    uint32_t data32;
> > +    int rc;
> > +
> > +    vpci_lock(d);
> > +    if ( addr == 0xcf8 )
> > +    {
> > +        ASSERT(size == 4);
> > +        *data = d->arch.hvm_domain.pci_cf8;
> > +        vpci_unlock(d);
> > +        return X86EMUL_OKAY;
> > +    }
> > +    else if ( !CF8_ENABLED(d->arch.hvm_domain.pci_cf8) )
> 
> Pointless "else".
> 
> > +    {
> > +        vpci_unlock(d);
> > +        return X86EMUL_OKAY;
> 
> You need to write to *data here, or else you need to return
> false from vpci_portio_accept() already in this case (but then
> you'd need to follow the stdvga model and take the lock
> there, releasing it in a .complete handler).

Right, this should return 1's in this case then.

> > +    }
> > +
> > +    /* Decode the PCI address. */
> > +    hvm_pci_decode_addr(d->arch.hvm_domain.pci_cf8, addr, &bus, &devfn, &reg);
> > +
> > +    if ( vpci_access_check(reg, size) || reg >= 0xff )
> 
> > 0xff or >= 0x100, but the check is pointless as
> hvm_pci_decode_addr() wont return larger values.

Right, this check is pointless.

> > +    {
> > +        vpci_unlock(d);
> > +        return X86EMUL_UNHANDLEABLE;
> 
> I don't think this matches real hardware behavior. If this "fails"
> at all, surely by returning all ones.

Yes, I've changed this to return 1's and X86EMUL_OKAY.

> > +    }
> > +
> > +    rc = xen_vpci_read(0, bus, devfn, reg, size, &data32);
> > +    if ( !rc )
> > +        *data = data32;
> > +    vpci_unlock(d);
> 
> Please set *data outside the locked region.
> 
> And since there's no best place to make this other remark - I'd
> prefer if you either kept together SBDF in one value when passing
> this as arguments to functions, or alternatively pass this as four
> values rather than keeping devfn artificially together.

I guess I prefer the 4 values, that was my first approach until I
realized pci_dev internally stores devfn in a single variable.

> > +     return rc ? X86EMUL_UNHANDLEABLE : X86EMUL_OKAY;
> > +}
> 
> Again the question - what's the bare hardware equivalent of
> returning X86EMUL_UNHANDLEABLE here?

All 1's I assume (or other random garbage). Would you be OK with me
adding a "fail" label that sets data to all 1's and returns X86EMUL_OKAY?

> > --- a/xen/arch/x86/hvm/ioreq.c
> > +++ b/xen/arch/x86/hvm/ioreq.c
> > @@ -1177,6 +1177,9 @@ struct hvm_ioreq_server *hvm_select_ioreq_server(struct domain *d,
> >           CF8_ENABLED(cf8) )
> >      {
> >          uint32_t sbdf, x86_fam;
> > +        unsigned int bus, devfn, reg;
> > +
> > +        hvm_pci_decode_addr(cf8, p->addr, &bus, &devfn, &reg);
> >  
> >          /* PCI config data cycle */
> >  
> > @@ -1186,9 +1189,7 @@ struct hvm_ioreq_server *hvm_select_ioreq_server(struct domain *d,
> >                                   PCI_FUNC(CF8_BDF(cf8)));
> 
> Any reason you don't use bus and devfn (really dev/slot and func)
> in the expression the tail of which is visible here?

Good catch, thanks.

> > --- a/xen/arch/x86/xen.lds.S
> > +++ b/xen/arch/x86/xen.lds.S
> > @@ -224,6 +224,9 @@ SECTIONS
> >         __start_schedulers_array = .;
> >         *(.data.schedulers)
> >         __end_schedulers_array = .;
> > +       __start_vpci_array = .;
> > +       *(.data.vpci)
> > +       __end_vpci_array = .;
> 
> With vpci.c declaring these const, they should go into .rodata.
> With the type name further being vpci_register_init_t it may even
> be next to .init.rodata where they belong.

Done, I've placed them in .rodata now.

I don't think it makes sense to place them in .init.rodata, the _init_
tag means these functions are supposed to be used to initialize the
vPCI handlers for devices, but they could be run after Xen has
finished the initialization, for example if MMCFG areas are discovered
by Dom0 with new devices (I know this is not yet implemented).

> > @@ -1041,6 +1042,8 @@ static void setup_one_hwdom_device(const struct setup_hwdom *ctxt,
> >          devfn += pdev->phantom_stride;
> >      } while ( devfn != pdev->devfn &&
> >                PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
> > +
> > +    xen_vpci_add_handlers(pdev);
> > }
> 
> You're losing an error code here.

Fixed, thanks.

> > --- /dev/null
> > +++ b/xen/drivers/vpci/Makefile
> > @@ -0,0 +1 @@
> > +obj-y += vpci.o
> 
> Without having seen further patches it's not clear whether this really
> needs its own directory.

vPCI emulation handlers (for PCI headers, capabilities, MSI...) will
also go into this folder.

> > --- /dev/null
> > +++ b/xen/drivers/vpci/vpci.c
> > @@ -0,0 +1,469 @@
> > +/*
> > + * Generic functionality for handling accesses to the PCI configuration space
> > + * from guests.
> > + *
> > + * Copyright (C) 2017 Citrix Systems R&D
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms and conditions of the GNU General Public
> > + * License, version 2, as published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > + * General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public
> > + * License along with this program; If not, see <http://www.gnu.org/licenses/>.
> > + */
> > +
> > +#include <xen/sched.h>
> > +#include <xen/vpci.h>
> > +
> > +extern const vpci_register_init_t __start_vpci_array[], __end_vpci_array[];
> > +#define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
> > +#define vpci_init __start_vpci_array
> 
> What is this last one good for?

It's used by xen_vpci_add_handlers in order to call the init
functions, I can drop it and call __start_vpci_array[i](...) if that's
better.

> > +/* Internal struct to store the emulated PCI registers. */
> > +struct vpci_register {
> > +    vpci_read_t read;
> > +    vpci_write_t write;
> 
> These two are pointers - please change the typedefs so that they're
> visibly pointers here. That'll then also allow the typedef to be used to
> declare actual handlers, should any such declarations be needed (e.g.
> if the same handler can be used by two different source files).
> 
> > +    unsigned int size;
> > +    unsigned int offset;
> > +    void *priv_data;
> 
> "private" (shorter and hence easier to type)?

Done to both the above comments.

> > +    struct rb_node node;
> > +};
> > +
> > +int xen_vpci_add_handlers(struct pci_dev *pdev)
> 
> __hwdom_init (I notice setup_one_hwdom_device() wrongly isn't
> annotated so).

OK, so you really want the init handlers to be inside of the
.init.rodata section then.

> > +{
> > +    int i, rc = 0;
> 
> i wants to be unsigned.

Rightfully.

> > +    if ( !has_vpci(pdev->domain) )
> > +        return 0;
> > +
> > +    pdev->vpci = xzalloc(struct vpci);
> > +    if ( !pdev->vpci )
> > +        return -ENOMEM;
> > +
> > +    pdev->vpci->handlers = RB_ROOT;
> > +
> > +    for ( i = 0; i < NUM_VPCI_INIT; i++ )
> > +    {
> > +        rc = vpci_init[i](pdev);
> > +        if ( rc )
> > +            break;
> > +    }
> > +
> > +    if ( rc )
> > +    {
> > +        struct rb_node *node = rb_first(&pdev->vpci->handlers);
> > +        struct vpci_register *r;
> 
> Please move this into the more narrow scope below.

Done.

> > +        /* Iterate over the tree and cleanup. */
> > +        while ( node != NULL )
> > +        {
> > +            r = container_of(node, struct vpci_register, node);
> > +            node = rb_next(node);
> > +            rb_erase(&r->node, &pdev->vpci->handlers);
> > +            xfree(r);
> > +        }
> > +        xfree(pdev->vpci);
> > +    }
> > +
> > +    return rc;
> > +}
> > +
> > +static bool vpci_register_overlap(const struct vpci_register *r,
> > +                                  unsigned int offset)
> > +{
> > +    if ( offset >= r->offset && offset < r->offset + r->size )
> > +        return true;
> > +
> > +    return false;
> 
> This can be one single return statement.
>
> > +}
> > +
> > +
> 
> Stray double blank lines.

Fixed both.

> > +static int vpci_register_cmp(const struct vpci_register *r1,
> > +                             const struct vpci_register *r2)
> > +{
> > +    /* Make sure there's no overlap between registers. */
> > +    if ( vpci_register_overlap(r1, r2->offset) ||
> > +         vpci_register_overlap(r1, r2->offset + r2->size - 1) ||
> > +         vpci_register_overlap(r2, r1->offset) ||
> > +         vpci_register_overlap(r2, r1->offset + r1->size - 1) )
> 
> Overlap checks can generally be done with just two comparisons,
> so I guess the parameters chosen for vpci_register_overlap()
> aren't optimal. I guess you don't need the function at all, as you
> could do all that's needed here:
> 
>     if ( r1->offset < r2->offset + r2->size &&
>          r2->offset < r1->offset + r1->size )
>         return 0;
> 
> The comment of course is somewhat misleading here too, as
> returning zero isn't really an error indication.

I've fixed this and changed the comment to "Return 0 if registers
overlap".

> > +        return 0;
> > +
> > +    if (r1->offset < r2->offset)
> > +        return -1;
> > +    else if (r1->offset > r2->offset)
> > +        return 1;
> 
> Coding style.

Fixed (and removed the pointless else).

> > +    ASSERT_UNREACHABLE();
> > +    return 0;
> > +}
> > +
> > +static struct vpci_register *vpci_find_register(const struct pci_dev *pdev,
> > +                                                const unsigned int reg,
> > +                                                const unsigned int size)
> > +{
> > +    struct rb_node *node;
> 
> const
> 
> > +    struct vpci_register r = {
> > +        .offset = reg,
> > +        .size = size,
> > +    };
> > +
> > +    ASSERT(vpci_locked(pdev->domain));
> > +
> > +    node = pdev->vpci->handlers.rb_node;
> > +    while ( node )
> > +    {
> > +        struct vpci_register *t =
> 
> const

Making both of them const means the return must also be const, and
that's not suitable by some of the consumers (ie:
xen_vpci_remove_register for example).

> > +int xen_vpci_add_register(struct pci_dev *pdev, vpci_read_t read_handler,
> > +                          vpci_write_t write_handler, unsigned int offset,
> > +                          unsigned int size, void *data)
> > +{
> > +    struct rb_node **new, *parent;
> > +    struct vpci_register *r;
> > +
> > +    /* Some sanity checks. */
> > +    if ( (size != 1 && size != 2 && size != 4) || offset >= 0xFFF ||
> 
> Off by one again in the offset check.

Fixed to be > 0xfff. Should this maybe be added to pci_regs.h as
PCI_MAX_REGISTER? 

> > +         offset & (size - 1) || read_handler == NULL || write_handler == NULL )
> 
> As said, I'm not convinced either of the read or write handlers
> being NULL is really a mistake. Both of them being NULL surely
> is.

Right (see above reply).

> > +        return -EINVAL;
> > +
> > +    r = xzalloc(struct vpci_register);
> 
> Looks like xmalloc() would be fine here - you initialize all fields.

Yes.

> > +    if ( !r )
> > +        return -ENOMEM;
> > +
> > +    r->read = read_handler;
> > +    r->write = write_handler;
> > +    r->size = size;
> > +    r->offset = offset;
> > +    r->priv_data = data;
> > +
> > +    vpci_lock(pdev->domain);
> > +    new = &pdev->vpci->handlers.rb_node;
> > +    parent = NULL;
> > +
> > +    while (*new) {
> 
> Coding style.
> 
> > +        struct vpci_register *this =
> 
> const

Done for both.

> > +int xen_vpci_remove_register(struct pci_dev *pdev, unsigned int offset)
> > +{
> > +    struct vpci_register *r;
> > +
> > +    vpci_lock(pdev->domain);
> > +    r = vpci_find_register(pdev, offset, 1 /* size doesn't matter here. */);
> 
> I'm not sure about this - is there anything wrong with the caller,
> knowing the size, also passing it? You could then even refuse
> requests to remove a register where (offset,size) doesn't match
> the recorded values (as vpci_find_register() will return any
> overlapping one).

Well, I think the important bit is to check that what
vpci_find_register returns matches what the called expects to
remove.

> > +    if ( !r )
> > +    {
> > +        vpci_unlock(pdev->domain);
> > +        return -ENOENT;
> > +    }
> > +
> > +    rb_erase(&r->node, &pdev->vpci->handlers);
> > +    xfree(r);
> > +    vpci_unlock(pdev->domain);
> 
> Please swap xfree() and unlock.
> 
> > +static void vpci_read_hw(unsigned int seg, unsigned int bus,
> > +                         unsigned int devfn, unsigned int reg, uint32_t size,
> > +                         uint32_t *data)
> 
> Instead of passing a pointer to the result, please consider returning
> the value, as the function doesn't return anything at present.
> 
> > +{
> > +    switch ( size )
> > +    {
> > +    case 4:
> > +        *data = pci_conf_read32(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
> > +                                reg);
> > +        break;
> > +    case 3:
> > +        /*
> > +         * This is possible because a 4byte read can have 1byte trapped and
> > +         * the rest passed-through.
> > +         */
> > +        *data = pci_conf_read16(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
> > +                                reg + 1) << 8;
> > +        *data |= pci_conf_read8(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
> > +                               reg);
> 
> Which of the two parts to read with read16() should depend on the
> low bit of reg. Also for maximum compatibility I'd strongly suggest
> reading the low part before the high one.

Right. Changed it to:

if ( reg & 1 )
{
    data = pci_conf_read8(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
                          reg);
    data |= pci_conf_read16(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
                            reg + 1) << 8;
}
else
{
    data = pci_conf_read16(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
                           reg);
    data |= pci_conf_read8(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
                           reg + 2) << 16;
}

> > +/* Helper macros for the read/write handlers. */
> > +#define GENMASK_BYTES(e, s) GENMASK((e) * 8, (s) * 8)
> 
> What do e and s stand for here?

e = end, s = start (in bytes).

> > +#define SHIFT_RIGHT_BYTES(d, o) d >>= (o) * 8
> 
> And at least o here?

o = offset (in bytes)

> > +#define ADD_RESULT(r, d, s, o) r |= ((d) & GENMASK_BYTES(s, 0)) << ((o) * 8)
> 
> And d, s, and o here?
> 
> Also I can't see what addition you would want to perform below.
> All you ought to do are ANDs and ORs.

It's adding new data to a partial output (ie: because the output might
be split across several registers).

r = result to update
d = new data to add
s = size of the new data to add
o = offset of the newly added data in the partial result.

I will add all this as a comment to the macros:

/*
 * Helper macros for the read/write handlers (all input values are in bytes).
 *
 * GENMASK_BYTES: generate a bitmask from the byte pointed by s to the byte
 * pointed by e.
 *
 * SHIFT_RIGHT_BYTES: shift a value pointed by d the number of bytes
 * specified in o.
 *
 * ADD_RESULT: append a result to another variable pointed by r. d is the
 * variable that contains the data to be added, s contains the size of the
 * data to add, and finally o is the offset of such data in the destination
 * (r).
 */

I can rename ADD_RESULT to APPEND_RESULT or something more descriptive
if you think it's going to make it easier to understand.

> > +int xen_vpci_read(unsigned int seg, unsigned int bus, unsigned int devfn,
> 
> The function being other than void, same question as earlier:
> What's the bare hardware equivalent of this returning other
> than zero?

I though it would be useful to have some more fine-grained error
reporting if that's ever needed, although as you say, from a hardware
point of view any errors will be simply reported as the value obtained
being all 1's.

The question is, should this already return all 1's, or should the
called translate failures into all 1's?

> > +                  unsigned int reg, uint32_t size, uint32_t *data)
> > +{
> > +    struct domain *d = current->domain;
> > +    struct pci_dev *pdev;
> > +    const struct vpci_register *r;
> > +    union vpci_val val = { .double_word = 0 };
> > +    unsigned int data_rshift = 0, data_lshift = 0, data_size;
> > +    uint32_t tmp_data;
> > +    int rc;
> > +
> > +    ASSERT(vpci_locked(d));
> > +
> > +    *data = 0;
> > +
> > +    /* Find the PCI dev matching the address. */
> > +    pdev = pci_get_pdev_by_domain(d, seg, bus, devfn);
> 
> What about the global PCI devices lock here? While VT-d code,
> perhaps wrongly, doesn't acquire that lock prior to calling the
> function, all callers in passthrough/pci.c do or verify it is being
> held.

Right, I will add an assert then.

> > +    if ( !pdev )
> > +        goto passthrough;
> > +
> > +    /* Find the vPCI register handler. */
> > +    r = vpci_find_register(pdev, reg, size);
> 
> With the overlap handling in vpci_find_register() I can't see how
> this would reliably return the correct (lowest) register when the
> request spans multiple ones.

It doesn't need to, if there's a lower register it will be found by
the recursive call to xen_vpci_read done below (before calling into
the handler pointed by r).

> > +    if ( !r )
> > +        goto passthrough;
> > +
> > +    if ( r->offset > reg )
> > +    {
> > +        /*
> > +         * There's a heading gap into the emulated register.
> > +         * NB: it's possible for this recursive call to have a size of 3.
> > +         */
> > +        rc = xen_vpci_read(seg, bus, devfn, reg, r->offset - reg, &tmp_data);
> 
> I'm not particularly happy to see recursion being used here, even if
> that's not going to be very deep. Both qemu and pciback get away
> without, iirc, and while it's not the neatest code I find qemu's easier
> to follow than the apparently written from scratch variant here. Is
> there a particular reason you didn't at least take what is there as a
> basis?

I've been thinking about this, and I've used a RB tree because I
wanted the searches to be fast, but due to RB properties if we want to
emulate a region that expends across multiple registers Xen needs to
issue several vpci_find_register, which will always start from the
root of the Rb tree, making this slower in this case.

Using a sorted linked list will allow Xen to only perform the search
once, and then continue from the first register until all registers
needed in order to complete the emulation have been found.

In resume, I think that using a sorted linked list to store the
registers here would be better indeed, and it would also get rid of
the recursion.

> > +        if ( rc )
> > +            return rc;
> > +
> > +        /* Add the head read to the partial result. */
> > +        ADD_RESULT(*data, tmp_data, r->offset - reg, 0);
> > +        data_lshift = r->offset - reg;
> > +
> > +        /* Account for the read. */
> > +        size -= data_lshift;
> > +        reg += data_lshift;
> > +    }
> > +    else if ( r->offset < reg )
> > +        /* There's an offset into the emulated register */
> > +        data_rshift = reg - r->offset;
> 
> This could be a plain else, avoiding another conditional branch.

This is likely going to change in any case...

> > +    ASSERT(data_lshift == 0 || data_rshift == 0);
> > +    data_size = min(size, r->size - data_rshift);
> > +    ASSERT(data_size != 0);
> > +
> > +    /* Perform the read of the register. */
> > +    rc = r->read(pdev, r->offset, &val, r->priv_data);
> > +    if ( rc )
> > +        return rc;
> > +
> > +    val.double_word >>= data_rshift * 8;
> > +    ADD_RESULT(*data, val.double_word, data_size, data_lshift);
> > +
> > +    /* Account for the read */
> > +    size -= data_size;
> > +    reg += data_size;
> > +
> > +    /* Read the remaining, if any. */
> > +    if ( size > 0 )
> > +    {
> > +        /*
> > +         * Read tailing data.
> 
> trailing?

Yes.

> > +static int vpci_write_helper(struct pci_dev *pdev,
> > +                             const struct vpci_register *r, unsigned int size,
> > +                             unsigned int offset, uint32_t data)
> > +{
> > +    union vpci_val val = { .double_word = data };
> > +    int rc;
> > +
> > +    ASSERT(size <= r->size);
> > +    if ( size != r->size )
> > +    {
> > +        rc = r->read(pdev, r->offset, &val, r->priv_data);
> > +        if ( rc )
> > +            return rc;
> > +        val.double_word &= ~GENMASK_BYTES(size + offset, offset);
> > +        data &= GENMASK_BYTES(size, 0);
> > +        val.double_word |= data << (offset * 8);
> > +    }
> > +
> > +    return r->write(pdev, r->offset, val, r->priv_data);
> > +}
> 
> I'm not sure that writing back the value read is correct in all cases
> (think of write-only or rw1c registers or even offsets where reads
> and writes access different registers altogether). I think the write
> handlers will need to be made capable of dealing with partial writes.

That seems to be what pciback does fro writes: read, merge value,
write back (drivers/xen/xen-pciback/conf_space.c
xen_pcibk_config_write):

err = conf_space_read(dev, cfg_entry, field_start,
		      &tmp_val);
if (err)
	break;

tmp_val = merge_value(tmp_val, value, get_mask(size),
		      offset - field_start);

err = conf_space_write(dev, cfg_entry, field_start,
		       tmp_val);

I would prefer to do it this way in order to avoid adding more
complexity to the handlers themselves. So far I haven't found any such
registers (rw1c) in the PCI config space, do you have references to
any of them?

> > +int xen_vpci_write(unsigned int seg, unsigned int bus, unsigned int devfn,
> > +                   unsigned int reg, uint32_t size, uint32_t data)
> > +{
> > +    struct domain *d = current->domain;
> > +    struct pci_dev *pdev;
> > +    const struct vpci_register *r;
> > +    unsigned int data_size, data_offset = 0;
> > +    int rc;
> > +
> > +    ASSERT(vpci_locked(d));
> > +
> > +    /* Find the PCI dev matching the address. */
> > +    pdev = pci_get_pdev_by_domain(d, seg, bus, devfn);
> > +    if ( !pdev )
> > +        goto passthrough;
> > +
> > +    /* Find the vPCI register handler. */
> > +    r = vpci_find_register(pdev, reg, size);
> > +    if ( !r )
> > +        goto passthrough;
> > +
> > +    else if ( r->offset > reg )
> 
> Pointless "else" again, even more so with the blank line in between.
> 
> > --- a/xen/include/xen/pci.h
> > +++ b/xen/include/xen/pci.h
> > @@ -13,6 +13,7 @@
> >  #include <xen/irq.h>
> >  #include <xen/pci_regs.h>
> >  #include <xen/pfn.h>
> > +#include <xen/rbtree.h>
> 
> Why? All you add to this file is ...
> 
> > @@ -88,6 +89,9 @@ struct pci_dev {
> >  #define PT_FAULT_THRESHOLD 10
> >      } fault;
> >      u64 vf_rlen[6];
> > +
> > +    /* Data for vPCI. */
> > +    struct vpci *vpci;
> 
> ... this. I guess you really want to add the #include ...
> 
> > --- /dev/null
> > +++ b/xen/include/xen/vpci.h
> > @@ -0,0 +1,66 @@
> > +#ifndef _VPCI_
> > +#define _VPCI_
> > +
> > +#include <xen/pci.h>
> > +#include <xen/types.h>
> 
> ... here.

Right, the RB is going to be replaced with a linked-list anyway, but
the list header should be included here.

> > +/* Helpers for locking/unlocking. */
> > +#define vpci_lock(d) spin_lock(&(d)->arch.hvm_domain.vpci_lock)
> > +#define vpci_unlock(d) spin_unlock(&(d)->arch.hvm_domain.vpci_lock)
> > +#define vpci_locked(d) spin_is_locked(&(d)->arch.hvm_domain.vpci_lock)
> 
> While for the code layering you don't need recursive locks, did you
> consider using them nevertheless so that spin_is_locked() return
> values are actually meaningful for your purposes?

I'm not sure I follow, spin_is_locked already return meaningful values
for my purpose AFAICT.

> > +#define REGISTER_VPCI_INIT(x) \
> > +  static const vpci_register_init_t x##_entry __used_section(".data.vpci") = x
> 
> To match up with the type name and assuming "REGISTER" here
> means the PCI register rather than "registration", I think this
> would better be VPCI_REGISTER() (I don't really mind the _INIT
> suffix, but I think it's relatively pointless).

OK, that's fine.

> > +/* Add vPCI handlers to device. */
> > +int xen_vpci_add_handlers(struct pci_dev *dev);
> > +
> > +/* Add/remove a register handler. */
> > +int xen_vpci_add_register(struct pci_dev *pdev, vpci_read_t read_handler,
> > +                          vpci_write_t write_handler, unsigned int offset,
> > +                          unsigned int size, void *data);
> > +int xen_vpci_remove_register(struct pci_dev *pdev, unsigned int offset);
> > +
> > +/* Generic read/write handlers for the PCI config space. */
> > +int xen_vpci_read(unsigned int seg, unsigned int bus, unsigned int devfn,
> > +                  unsigned int reg, uint32_t size, uint32_t *data);
> > +int xen_vpci_write(unsigned int seg, unsigned int bus, unsigned int devfn,
> > +                   unsigned int reg, uint32_t size, uint32_t data);
> 
> Along the lines of what I've said in a few places about return values,
> please carefully consider where they're needed. Once you decide
> they are really needed, the respective functions would likely want to
> become __must_check.

From a emulation PoV all those errors are going to be reported to the
VM as ignored writes or reads returning all 1's.

The question is where/who should do this translation. I though it
would be helpful to do this in the trap handlers themselves, and let
the vpci code return more meaningful error value. If you think it's
not going to be helpful I can remove them.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 9/9] vpci/msix: add MSI-X handlers
  2017-04-27 14:35 ` [PATCH v3 9/9] vpci/msix: add MSI-X handlers Roger Pau Monne
@ 2017-05-29 13:29   ` Jan Beulich
  2017-06-28 15:29     ` Roger Pau Monne
  0 siblings, 1 reply; 49+ messages in thread
From: Jan Beulich @ 2017-05-29 13:29 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, julien.grall, boris.ostrovsky, xen-devel

>>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
> +static int vpci_msix_control_write(struct pci_dev *pdev, unsigned int reg,
> +                                   union vpci_val val, void *data)
> +{
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    paddr_t table_base = pdev->vpci->header.bars[pdev->vpci->msix->bir].paddr;
> +    struct vpci_msix *msix = data;

Wouldn't you better use this also to obtain the array index one line
earlier?

> +    bool new_masked, new_enabled;
> +    unsigned int i;
> +    uint32_t data32;
> +    int rc;
> +
> +    new_masked = val.word & PCI_MSIX_FLAGS_MASKALL;
> +    new_enabled = val.word & PCI_MSIX_FLAGS_ENABLE;
> +
> +    if ( new_enabled != msix->enabled && new_enabled )

    if ( !msix->enabled && new_enabled )

would likely be easier to read (similar for the "else if" below).

> +    {
> +        /* MSI-X enabled. */
> +        for ( i = 0; i < msix->max_entries; i++ )
> +        {
> +            if ( msix->entries[i].masked )
> +                continue;
> +
> +            rc = vpci_msix_enable(&msix->entries[i].arch, pdev,
> +                                  msix->entries[i].addr, msix->entries[i].data,
> +                                  msix->entries[i].nr, table_base);
> +            if ( rc )
> +            {
> +                gdprintk(XENLOG_ERR,
> +                         "%04x:%02x:%02x.%u: unable to update entry %u: %d\n",
> +                         seg, bus, slot, func, i, rc);
> +                return rc;
> +            }
> +
> +            vpci_msix_mask(&msix->entries[i].arch, false);
> +        }
> +    }
> +    else if ( new_enabled != msix->enabled && !new_enabled )
> +    {
> +        /* MSI-X disabled. */
> +        for ( i = 0; i < msix->max_entries; i++ )
> +        {
> +            rc = vpci_msix_disable(&msix->entries[i].arch);
> +            if ( rc )
> +            {
> +                gdprintk(XENLOG_ERR,
> +                         "%04x:%02x:%02x.%u: unable to disable entry %u: %d\n",
> +                         seg, bus, slot, func, i, rc);
> +                return rc;
> +            }
> +        }
> +    }
> +
> +    data32 = val.word;
> +    if ( (new_enabled != msix->enabled || new_masked != msix->masked) &&
> +         pci_msi_conf_write_intercept(pdev, reg, 2, &data32) >= 0 )
> +        pci_conf_write16(seg, bus, slot, func, reg, data32);

What's the intermediate variable "data32" good for here? Afaict you
could use val.word in its stead.

> +static struct vpci_msix *vpci_msix_find(struct domain *d, unsigned long addr)
> +{
> +    struct vpci_msix *msix;
> +
> +    ASSERT(vpci_locked(d));
> +    list_for_each_entry ( msix,  &d->arch.hvm_domain.msix_tables, next )
> +        if ( msix->pdev->vpci->header.command & PCI_COMMAND_MEMORY &&

Please parenthesize & within &&.

> +             addr >= msix->addr &&
> +             addr < msix->addr + msix->max_entries * PCI_MSIX_ENTRY_SIZE )
> +            return msix;
> +
> +    return NULL;
> +}

Looking ahead I'm getting the impression that you only allow
accesses to the MSI-X table entries, yet in vpci_modify_bars() you
(correctly) prevent mapping entire pages. While most other
registers are disallowed from sharing a page with the table, the PBA
is specifically named as an exception. Hence you need to support
at least reads from the entire range.

> +static int vpci_msix_table_accept(struct vcpu *v, unsigned long addr)
> +{
> +    int found;
> +
> +    vpci_lock(v->domain);
> +    found = !!vpci_msix_find(v->domain, addr);

At the risk of repeating a comment I gave on an earlier patch: Using
"bool" for "found" allows you to avoid the !! .

> +static int vpci_msix_access_check(struct pci_dev *pdev, unsigned long addr,
> +                                  unsigned int len)
> +{
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +
> +

No double blank lines please.

> +    /* Only allow 32/64b accesses. */
> +    if ( len != 4 && len != 8 )
> +    {
> +        gdprintk(XENLOG_ERR,
> +                 "%04x:%02x:%02x.%u: invalid MSI-X table access size: %u\n",
> +                 seg, bus, slot, func, len);
> +        return -EINVAL;
> +    }
> +
> +    /* Do no allow accesses that span across multiple entries. */
> +    if ( (addr & (PCI_MSIX_ENTRY_SIZE - 1)) + len > PCI_MSIX_ENTRY_SIZE )
> +    {
> +        gdprintk(XENLOG_ERR,
> +                 "%04x:%02x:%02x.%u: MSI-X access crosses entry boundary\n",
> +                 seg, bus, slot, func);
> +        return -EINVAL;
> +    }
> +
> +    /*
> +     * Only allow 64b accesses to the low message address field.
> +     *
> +     * NB: this is more restrictive than the specification, that allows 64b
> +     * accesses to other fields under certain circumstances, so this check and
> +     * the code will have to be fixed in order to fully comply with the
> +     * specification.
> +     */
> +    if ( (addr & (PCI_MSIX_ENTRY_SIZE - 1)) != 0 && len != 4 )
> +    {
> +        gdprintk(XENLOG_ERR,
> +                 "%04x:%02x:%02x.%u: 64bit MSI-X table access to 32bit field"
> +                 " (offset: %#lx len: %u)\n", seg, bus, slot, func,
> +                 addr & (PCI_MSIX_ENTRY_SIZE - 1), len);
> +        return -EINVAL;
> +    }
> +
> +    return 0;
> +}

So you allow mis-aligned accesses, but you disallow 8-byte ones
to the upper half of an entry? I think both aspects need to be got
right from the beginning, the more that you BUG() in the switch()es
further down in such cases.

> +static struct vpci_msix_entry *vpci_msix_get_entry(struct vpci_msix *msix,
> +                                                   unsigned long addr)
> +{
> +    return &msix->entries[(addr - msix->addr) / PCI_MSIX_ENTRY_SIZE];
> +}
> +
> +static int vpci_msix_table_read(struct vcpu *v, unsigned long addr,
> +                                unsigned int len, unsigned long *data)
> +{
> +    struct vpci_msix *msix;
> +    struct vpci_msix_entry *entry;
> +    unsigned int offset;
> +
> +    vpci_lock(v->domain);
> +    msix = vpci_msix_find(v->domain, addr);
> +    if ( !msix )
> +    {
> +        vpci_unlock(v->domain);
> +        return X86EMUL_UNHANDLEABLE;
> +    }
> +
> +    if ( vpci_msix_access_check(msix->pdev, addr, len) )
> +    {
> +        vpci_unlock(v->domain);
> +        return X86EMUL_UNHANDLEABLE;
> +    }
> +
> +    /* Get the table entry and offset. */
> +    entry = vpci_msix_get_entry(msix, addr);
> +    offset = addr & (PCI_MSIX_ENTRY_SIZE - 1);
> +
> +    switch ( offset )
> +    {
> +    case PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET:
> +        *data = entry->addr;

You're not clipping off the upper 32 bits here - is that reliably
happening elsewhere?

> +        break;
> +    case PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET:
> +        *data = entry->addr >> 32;
> +        break;
> +    case PCI_MSIX_ENTRY_DATA_OFFSET:
> +        *data = entry->data;
> +        break;
> +    case PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET:
> +        *data = entry->masked ? PCI_MSIX_VECTOR_BITMASK : 0;

What about the other 31 bits?

> +static int vpci_msix_table_write(struct vcpu *v, unsigned long addr,
> +                                 unsigned int len, unsigned long data)
> +{
> +    struct vpci_msix *msix;
> +    struct vpci_msix_entry *entry;
> +    unsigned int offset;
> +
> +    vpci_lock(v->domain);
> +    msix = vpci_msix_find(v->domain, addr);
> +    if ( !msix )
> +    {
> +        vpci_unlock(v->domain);
> +        return X86EMUL_UNHANDLEABLE;
> +    }
> +
> +    if ( vpci_msix_access_check(msix->pdev, addr, len) )
> +    {
> +        vpci_unlock(v->domain);
> +        return X86EMUL_UNHANDLEABLE;
> +    }
> +
> +    /* Get the table entry and offset. */
> +    entry = vpci_msix_get_entry(msix, addr);
> +    offset = addr & (PCI_MSIX_ENTRY_SIZE - 1);
> +
> +    switch ( offset )
> +    {
> +    case PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET:
> +        if ( len == 8 )
> +        {
> +            entry->addr = data;
> +            break;
> +        }
> +        entry->addr &= ~GENMASK(31, 0);
> +        entry->addr |= data;
> +        break;
> +    case PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET:
> +        entry->addr &= ~GENMASK(63, 32);
> +        entry->addr |= data << 32;
> +        break;
> +    case PCI_MSIX_ENTRY_DATA_OFFSET:
> +        entry->data = data;
> +        break;
> +    case PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET:
> +    {
> +        bool new_masked = data & PCI_MSIX_VECTOR_BITMASK;
> +        struct pci_dev *pdev = msix->pdev;
> +        paddr_t table_base =
> +            pdev->vpci->header.bars[pdev->vpci->msix->bir].paddr;

Again simply "msix->bir"?

> +        int rc;
> +
> +        if ( !msix->enabled )
> +        {
> +            entry->masked = new_masked;
> +            break;
> +        }
> +
> +        if ( new_masked != entry->masked && !new_masked )
> +        {
> +            /* Unmasking an entry, update it. */
> +            rc = vpci_msix_enable(&entry->arch, msix->pdev, entry->addr,

And simply "pdev" here?

> +static int vpci_init_msix(struct pci_dev *pdev)
> +{
> +    struct domain *d = pdev->domain;
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    struct vpci_msix *msix;
> +    unsigned int msix_offset, i, max_entries;
> +    paddr_t msix_paddr;
> +    uint16_t control;
> +    int rc;
> +
> +    msix_offset = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSIX);
> +    if ( !msix_offset )
> +        return 0;
> +
> +    if ( !vpci_msix_enabled(pdev->domain) )

This is a non-__init function, so it can't use dom0_msix (I'm saying
this just in case there really is a need to retain those command line
options).

> +    {
> +        xen_vpci_mask_capability(pdev, PCI_CAP_ID_MSIX);
> +        return 0;
> +    }
> +
> +    control = pci_conf_read16(seg, bus, slot, func,
> +                              msix_control_reg(msix_offset));
> +
> +    /* Get the maximum number of vectors the device supports. */
> +    max_entries = msix_table_size(control);
> +    if ( !max_entries )
> +        return 0;

This if() is never going to be true.

> +    msix = xzalloc_bytes(MSIX_SIZE(max_entries));
> +    if ( !msix )
> +        return -ENOMEM;
> +
> +    msix->max_entries = max_entries;
> +    msix->pdev = pdev;
> +
> +    /* Find the MSI-X table address. */
> +    msix->offset = pci_conf_read32(seg, bus, slot, func,
> +                                   msix_table_offset_reg(msix_offset));
> +    msix->bir = msix->offset & PCI_MSIX_BIRMASK;
> +    msix->offset &= ~PCI_MSIX_BIRMASK;
> +
> +    ASSERT(pdev->vpci->header.bars[msix->bir].type == VPCI_BAR_MEM ||
> +           pdev->vpci->header.bars[msix->bir].type == VPCI_BAR_MEM64_LO);
> +    msix->addr = pdev->vpci->header.bars[msix->bir].mapped_addr + msix->offset;
> +    msix_paddr = pdev->vpci->header.bars[msix->bir].paddr + msix->offset;

I can't seem to find where these addresses get updated in case
the BARs are being relocated by the Dom0 kernel.

> +    for ( i = 0; i < msix->max_entries; i++)
> +    {
> +        msix->entries[i].masked = true;
> +        msix->entries[i].nr = i;
> +        vpci_msix_arch_init(&msix->entries[i].arch);
> +    }
> +
> +    if ( list_empty(&d->arch.hvm_domain.msix_tables) )
> +        register_mmio_handler(d, &vpci_msix_table_ops);
> +
> +    list_add(&msix->next, &d->arch.hvm_domain.msix_tables);
> +
> +    rc = xen_vpci_add_register(pdev, vpci_msix_control_read,
> +                               vpci_msix_control_write,
> +                               msix_control_reg(msix_offset), 2, msix);
> +    if ( rc )
> +    {
> +        dprintk(XENLOG_ERR,
> +                "%04x:%02x:%02x.%u: failed to add handler for MSI-X control: %d\n",
> +                seg, bus, slot, func, rc);
> +        goto error;
> +    }
> +
> +    if ( pdev->vpci->header.command & PCI_COMMAND_MEMORY )
> +    {
> +        /* Unmap this memory from the guest. */
> +        rc = modify_mmio(pdev->domain, _gfn(PFN_DOWN(msix->addr)),
> +                         _mfn(PFN_DOWN(msix_paddr)),
> +                         PFN_UP(msix->max_entries * PCI_MSIX_ENTRY_SIZE),
> +                         false);
> +        if ( rc )
> +        {
> +            dprintk(XENLOG_ERR,
> +                    "%04x:%02x:%02x.%u: unable to unmap MSI-X BAR region: %d\n",
> +                    seg, bus, slot, func, rc);
> +            goto error;
> +        }
> +    }

Why is this unmapping conditional upon PCI_COMMAND_MEMORY?

> +static void vpci_dump_msix(unsigned char key)
> +{
> +    struct domain *d;
> +    struct pci_dev *pdev;
> +
> +    printk("Guest MSI-X information:\n");
> +
> +    for_each_domain ( d )
> +    {
> +        if ( !has_vpci(d) )
> +            continue;
> +
> +        vpci_lock(d);

Dump handlers, even if there are existing examples to the contrary,
should only try-lock any locks they mean to hold (and not dump
anything if they can't get hold of the lock).

> --- a/xen/include/xen/vpci.h
> +++ b/xen/include/xen/vpci.h
> @@ -112,6 +112,33 @@ struct vpci {
>          /* Arch-specific data. */
>          struct vpci_arch_msi arch;
>      } *msi;
> +
> +    /* MSI-X data. */
> +    struct vpci_msix {
> +        struct pci_dev *pdev;
> +        /* Maximum number of vectors supported by the device. */
> +        unsigned int max_entries;
> +        /* MSI-X table offset. */
> +        unsigned int offset;
> +        /* MSI-X table BIR. */
> +        unsigned int bir;
> +        /* Table addr. */
> +        paddr_t addr;
> +        /* MSI-X enabled? */
> +        bool enabled;
> +        /* Masked? */
> +        bool masked;
> +        /* List link. */
> +        struct list_head next;
> +        /* Entries. */
> +        struct vpci_msix_entry {
> +                unsigned int nr;
> +                uint64_t addr;
> +                uint32_t data;
> +                bool masked;
> +                struct vpci_arch_msix_entry arch;

Indentation.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 6/9] xen/vpci: trap access to the list of PCI capabilities
  2017-04-27 14:35 ` [PATCH v3 6/9] xen/vpci: trap access to the list of PCI capabilities Roger Pau Monne
  2017-05-23 12:49   ` Jan Beulich
@ 2017-05-29 13:32   ` Jan Beulich
  1 sibling, 0 replies; 49+ messages in thread
From: Jan Beulich @ 2017-05-29 13:32 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, julien.grall, boris.ostrovsky, xen-devel

>>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
> +static int vpci_index_capabilities(struct pci_dev *pdev)
> +{
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    uint8_t pos = PCI_CAPABILITY_LIST;
> +    uint16_t status;
> +    unsigned int max_cap = 48;
> +    struct vpci_capability *cap;
> +    int rc;
> +
> +    INIT_LIST_HEAD(&pdev->vpci->cap_list);
> +
> +    /* Check if device has capabilities. */
> +    status = pci_conf_read16(seg, bus, slot, func, PCI_STATUS);
> +    if ( !(status & PCI_STATUS_CAP_LIST) )
> +        return 0;
> +
> +    /* Add the root capability pointer. */
> +    cap = xzalloc(struct vpci_capability);
> +    if ( !cap )
> +        return -ENOMEM;
> +
> +    cap->offset = pos;
> +    list_add_tail(&cap->next, &pdev->vpci->cap_list);
> +    rc = xen_vpci_add_register(pdev, vpci_cap_read, vpci_cap_write, pos,
> +                               1, cap);
> +    if ( rc )
> +        return rc;
> +
> +    /*
> +     * Iterate over the list of capabilities present in the device, and
> +     * add a handler for each register pointer to the next item
> +     * (PCI_CAP_LIST_NEXT).
> +     */
> +    while ( max_cap-- )
> +    {
> +        pos = pci_conf_read8(seg, bus, slot, func, pos);
> +        if ( pos < 0x40 )
> +            break;
> +
> +        cap = xzalloc(struct vpci_capability);
> +        if ( !cap )
> +            return -ENOMEM;
> +
> +        cap->offset = pos;
> +        list_add_tail(&cap->next, &pdev->vpci->cap_list);
> +        pos += PCI_CAP_LIST_NEXT;
> +        rc = xen_vpci_add_register(pdev, vpci_cap_read, vpci_cap_write, pos,
> +                                   1, cap);
> +        if ( rc )
> +            return rc;
> +    }
> +
> +    return 0;
> +}

Btw., instead of duplicating some of what pci_find_cap_offset()
and pci_find_next_cap() do, perhaps worth making those two
functions capable of dealing with a wildcard ID (0xff) as input.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 0/9] vpci: PCI config space emulation
  2017-04-27 14:35 [PATCH v3 0/9] vpci: PCI config space emulation Roger Pau Monne
                   ` (8 preceding siblings ...)
  2017-04-27 14:35 ` [PATCH v3 9/9] vpci/msix: add MSI-X handlers Roger Pau Monne
@ 2017-05-29 13:38 ` Jan Beulich
  2017-05-29 14:14   ` Roger Pau Monne
  9 siblings, 1 reply; 49+ messages in thread
From: Jan Beulich @ 2017-05-29 13:38 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: xen-devel, julien.grall, boris.ostrovsky

>>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
> The following series contain an implementation of handlers for the PCI
> configuration space inside of Xen. This allows Xen to detect accesses to the
> PCI configuration space and react accordingly.
> 
> Although there hasn't been a lot of review on the previous version, I send this
> new version because I will be away for > 1 week, and I would rather have review
> on this version than the old one. As usual, each patch contains a changeset
> summary between versions.
> 
> Patch 1 implements the generic handlers for accesses to the PCI configuration
> space together with a minimal user-space test harness that I've used during
> development. Currently a per-device red-back tree is used in order to store the
> list of handlers, and they are indexed based on their offset inside of the
> configuration space. Patch 1 also adds the x86 port IO traps and wires them
> into the newly introduced vPCI dispatchers. Patch 2 adds handlers for the ECAM
> areas (as found on the MMCFG ACPI table). Patches 3 and 4 are mostly code
> moment/refactoring in order to implement support for BAR mapping in patch 5.
> Patch 6 allows Xen to mask certain PCI capabilities on-demand, which is used in
> order to mask MSI and MSI-X.
> 
> Finally patches 8 and 9 implement support in order to emulate the MSI/MSI-X
> capabilities inside of Xen, so that the interrupts are transparently routed to
> the guest.

While the code looks reasonable for this early stage, it's still quite
large a chunk of new logic. Therefore I think that if already there
was no prior design discussion, some reasoning behind the decisions
you've taken should be provided here. In particular I'm quite
unhappy about this huge amount of intercepting and emulation,
none of which we require for PV Dom0. IOW I continue to be
unconvinced that putting all the burden on Xen while not para-
virtualizing any of this in the Dom0 kernel is the right choice. It
certainly would be if we were talking about HVM Dom0, but this is
PVH, and the "PV" part is first there for a reason.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 0/9] vpci: PCI config space emulation
  2017-05-29 13:38 ` [PATCH v3 0/9] vpci: PCI config space emulation Jan Beulich
@ 2017-05-29 14:14   ` Roger Pau Monne
  0 siblings, 0 replies; 49+ messages in thread
From: Roger Pau Monne @ 2017-05-29 14:14 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, julien.grall, boris.ostrovsky

On Mon, May 29, 2017 at 07:38:14AM -0600, Jan Beulich wrote:
> >>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
> > The following series contain an implementation of handlers for the PCI
> > configuration space inside of Xen. This allows Xen to detect accesses to the
> > PCI configuration space and react accordingly.
> > 
> > Although there hasn't been a lot of review on the previous version, I send this
> > new version because I will be away for > 1 week, and I would rather have review
> > on this version than the old one. As usual, each patch contains a changeset
> > summary between versions.
> > 
> > Patch 1 implements the generic handlers for accesses to the PCI configuration
> > space together with a minimal user-space test harness that I've used during
> > development. Currently a per-device red-back tree is used in order to store the
> > list of handlers, and they are indexed based on their offset inside of the
> > configuration space. Patch 1 also adds the x86 port IO traps and wires them
> > into the newly introduced vPCI dispatchers. Patch 2 adds handlers for the ECAM
> > areas (as found on the MMCFG ACPI table). Patches 3 and 4 are mostly code
> > moment/refactoring in order to implement support for BAR mapping in patch 5.
> > Patch 6 allows Xen to mask certain PCI capabilities on-demand, which is used in
> > order to mask MSI and MSI-X.
> > 
> > Finally patches 8 and 9 implement support in order to emulate the MSI/MSI-X
> > capabilities inside of Xen, so that the interrupts are transparently routed to
> > the guest.
> 
> While the code looks reasonable for this early stage, it's still quite
> large a chunk of new logic.

Thanks for the detailed review, it's greatly appreciated (it's a very
large amount of code so I assume this is not trivial for you at
all).

I'm still going over the comments, I hope I will be able to send a new
version before the end of the week.

> Therefore I think that if already there
> was no prior design discussion, some reasoning behind the decisions
> you've taken should be provided here. In particular I'm quite
> unhappy about this huge amount of intercepting and emulation,
> none of which we require for PV Dom0. IOW I continue to be
> unconvinced that putting all the burden on Xen while not para-
> virtualizing any of this in the Dom0 kernel is the right choice.

IMHO, there are two main points of doing all this emulation inside of
Xen, the first one is to prevent adding a bunch of duplicated Xen PV
specific code to each OS we want to support in PVH mode. This just
promotes Xen code duplication amongst OSes, which leads to more
maintainership burden.

The second reason would be that this code (or it's functionality to be
more precise) already exists in QEMU (and pciback to a degree), and
it's code that we already support and maintain. By moving it into the
hypervisor itself every guest type can make use of it, and should be
shared between them all (I know that the code in this series is not
yet suitable for DomU HVM guests yet).

> It
> certainly would be if we were talking about HVM Dom0, but this is
> PVH, and the "PV" part is first there for a reason.

Well, I've been mostly forced into using the PVH name for historical
reasons, but when I started working on this I called it HVMlite,
because I think it's more similar to HVM than PV by a long shot (and
the PVH Dom0 builder functions were using the "hvm" prefix in the
firsts iterations of that series).

Since PVH was never finished, the PVH name was reused in order to
prevent us the shame of announcing something that was never finished,
and to prevent adding more confusion to users.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 1/9] xen/vpci: introduce basic handlers to trap accesses to the PCI config space
  2017-05-29 12:57     ` Roger Pau Monne
@ 2017-05-29 14:16       ` Jan Beulich
  2017-05-29 15:05         ` Roger Pau Monne
  0 siblings, 1 reply; 49+ messages in thread
From: Jan Beulich @ 2017-05-29 14:16 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Wei Liu, Andrew Cooper, Ian Jackson, julien.grall, PaulDurrant,
	xen-devel, boris.ostrovsky

>>> On 29.05.17 at 14:57, <roger.pau@citrix.com> wrote:
> On Fri, May 19, 2017 at 05:35:47AM -0600, Jan Beulich wrote:
>> >>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
>> > +    /* Read/write of unset register. */
>> > +    VPCI_READ_CHECK(8, 4, 0xffffffff);
>> > +    VPCI_READ_CHECK(8, 2, 0xffff);
>> > +    VPCI_READ_CHECK(8, 1, 0xff);
>> > +    VPCI_WRITE(10, 2, 0xbeef);
>> > +    VPCI_READ_CHECK(10, 2, 0xffff);
>> > +
>> > +    /* Read of multiple registers */
>> > +    VPCI_CHECK_REG(7, 1, 0xbd);
>> > +    VPCI_READ_CHECK(4, 4, 0xbdbabaff);
>> 
>> I think a variant accessing mixed size registers would also be
>> desirable here. Perhaps it would be best to exhaustively test
>> all possible variations (there aren't that many after all). Same
>> for writes and partial accesses (below) then.
> 
> So you mean to scan the whole space (from 0 to 128 on this test) using
> all possible register sizes for both read and write? That would indeed
> be feasible.

Not sure what the "from 0 to 128" is meant to apply to. What I mean
is to test all combinations of (mix of) register sizes and access sizes.
I.e. all combinations making up a single 4-byte field ((1,1,1,1),
(1,1,2), (2,1,1), (2,2), (4)) and all four 1-byte accesses, both 2-byte
ones, and the sole possible 4-byte one.

>> > @@ -256,6 +257,152 @@ void register_g2m_portio_handler(struct domain *d)
>> >      handler->ops = &g2m_portio_ops;
>> >  }
>> >  
>> > +/* Do some sanity checks. */
>> > +static int vpci_access_check(unsigned int reg, unsigned int len)
>> > +{
>> > +    /* Check access size. */
>> > +    if ( len != 1 && len != 2 && len != 4 )
>> > +    {
>> > +        gdprintk(XENLOG_WARNING, "invalid length (reg: %#x, len: %u)\n",
>> > +                 reg, len);
>> 
>> I think many of such gdprintk()s want to go away before this series
>> gets committed.
> 
> OK, I've found them useful while developing, but I guess they are not
> really useful outside from that context. I guess there's no way to
> leave them in place, maybe a Kconfig option?

That seems overkill. I wouldn't so much mind the messages (they
get compiled out for non-debug builds anyway), but the clutter
they introduce to code: In some cases half of your functions are
the invocation of gdprintk() ...

>> > +/* vPCI config space IO ports handlers (0xcf8/0xcfc). */
>> > +static bool_t vpci_portio_accept(const struct hvm_io_handler *handler,
>> 
>> Plain bool please.
> 
> Sadly struct hvm_io_ops (which is where this function is used) expects
> a bool_t as return.

I don't follow - bool_t is simply a typedef of bool nowadays, and
typedefs are all equivalent as far as the C type system goes.

>> Again the question - what's the bare hardware equivalent of
>> returning X86EMUL_UNHANDLEABLE here?
> 
> All 1's I assume (or other random garbage). Would you be OK with me
> adding a "fail" label that sets data to all 1's and returns X86EMUL_OKAY?

That would probably be okay (despite my dislike of labels and goto-s).

>> > +#include <xen/sched.h>
>> > +#include <xen/vpci.h>
>> > +
>> > +extern const vpci_register_init_t __start_vpci_array[], __end_vpci_array[];
>> > +#define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
>> > +#define vpci_init __start_vpci_array
>> 
>> What is this last one good for?
> 
> It's used by xen_vpci_add_handlers in order to call the init
> functions, I can drop it and call __start_vpci_array[i](...) if that's
> better.

Well, if there were several users, I could see the benefit of an
abbreviating #define. But for a single user the #define adds
more code / clutter than is being saved on the use site.

>> > +    struct rb_node node;
>> > +};
>> > +
>> > +int xen_vpci_add_handlers(struct pci_dev *pdev)
>> 
>> __hwdom_init (I notice setup_one_hwdom_device() wrongly isn't
>> annotated so).
> 
> OK, so you really want the init handlers to be inside of the
> .init.rodata section then.

Only if that's correct, and it is correct as long as all possible call trees
root in a __hwdom_init function. (To avoid misunderstanding: This
clearly can't be .init.rodata uniformly, as __hwdom_init isn't always
an alias of __init).

>> > +    struct vpci_register r = {
>> > +        .offset = reg,
>> > +        .size = size,
>> > +    };
>> > +
>> > +    ASSERT(vpci_locked(pdev->domain));
>> > +
>> > +    node = pdev->vpci->handlers.rb_node;
>> > +    while ( node )
>> > +    {
>> > +        struct vpci_register *t =
>> 
>> const
> 
> Making both of them const means the return must also be const, and
> that's not suitable by some of the consumers (ie:
> xen_vpci_remove_register for example).

In that case there's no choice then, okay.

>> > +int xen_vpci_add_register(struct pci_dev *pdev, vpci_read_t read_handler,

Btw., I only now notice this further strange xen_ prefix here.

>> > +                          vpci_write_t write_handler, unsigned int offset,
>> > +                          unsigned int size, void *data)
>> > +{
>> > +    struct rb_node **new, *parent;
>> > +    struct vpci_register *r;
>> > +
>> > +    /* Some sanity checks. */
>> > +    if ( (size != 1 && size != 2 && size != 4) || offset >= 0xFFF ||
>> 
>> Off by one again in the offset check.
> 
> Fixed to be > 0xfff. Should this maybe be added to pci_regs.h as
> PCI_MAX_REGISTER? 

Could be done, but then we need one for base and one for
extended config space. May want to check whether Linux has
invented some good names for these by now.

>> > +int xen_vpci_remove_register(struct pci_dev *pdev, unsigned int offset)
>> > +{
>> > +    struct vpci_register *r;
>> > +
>> > +    vpci_lock(pdev->domain);
>> > +    r = vpci_find_register(pdev, offset, 1 /* size doesn't matter here. */);
>> 
>> I'm not sure about this - is there anything wrong with the caller,
>> knowing the size, also passing it? You could then even refuse
>> requests to remove a register where (offset,size) doesn't match
>> the recorded values (as vpci_find_register() will return any
>> overlapping one).
> 
> Well, I think the important bit is to check that what
> vpci_find_register returns matches what the called expects to
> remove.

Correct. Debuggability would call for checking both offset and size.

>> > +/* Helper macros for the read/write handlers. */
>> > +#define GENMASK_BYTES(e, s) GENMASK((e) * 8, (s) * 8)
>> 
>> What do e and s stand for here?
> 
> e = end, s = start (in bytes).

And where do you start counting. Having seen the rest of the
series I'm actually rather unconvinced use these macros results
in better code - to me, plain hex numbers are quite a bit easier
to read as long as the number of digits doesn't go meaningfully
beyond 10 or so.

>> > +#define SHIFT_RIGHT_BYTES(d, o) d >>= (o) * 8
>> 
>> And at least o here?
> 
> o = offset (in bytes)

I think simply writhing the shift expression is once again more
clear to the reader than using a macro which is longer to read
and type and which has semantics which aren't immediately
clear from its name.

> I can rename ADD_RESULT to APPEND_RESULT or something more descriptive
> if you think it's going to make it easier to understand.

I'd prefer if the name "merge" appeared in that name - I don't see
this being usable strictly only to append to either side of a value.

>> > +int xen_vpci_read(unsigned int seg, unsigned int bus, unsigned int devfn,
>> 
>> The function being other than void, same question as earlier:
>> What's the bare hardware equivalent of this returning other
>> than zero?
> 
> I though it would be useful to have some more fine-grained error
> reporting if that's ever needed, although as you say, from a hardware
> point of view any errors will be simply reported as the value obtained
> being all 1's.
> 
> The question is, should this already return all 1's, or should the
> called translate failures into all 1's?

If you leave this to the callers, overall code would likely become
less readable, so I'd prefer this to be done in a central place.

>> > +    /* Find the vPCI register handler. */
>> > +    r = vpci_find_register(pdev, reg, size);
>> 
>> With the overlap handling in vpci_find_register() I can't see how
>> this would reliably return the correct (lowest) register when the
>> request spans multiple ones.
> 
> It doesn't need to, if there's a lower register it will be found by
> the recursive call to xen_vpci_read done below (before calling into
> the handler pointed by r).

Since that's quite non-obvious, to me this is another argument
against using recursion here.

>> > +static int vpci_write_helper(struct pci_dev *pdev,
>> > +                             const struct vpci_register *r, unsigned int size,
>> > +                             unsigned int offset, uint32_t data)
>> > +{
>> > +    union vpci_val val = { .double_word = data };
>> > +    int rc;
>> > +
>> > +    ASSERT(size <= r->size);
>> > +    if ( size != r->size )
>> > +    {
>> > +        rc = r->read(pdev, r->offset, &val, r->priv_data);
>> > +        if ( rc )
>> > +            return rc;
>> > +        val.double_word &= ~GENMASK_BYTES(size + offset, offset);
>> > +        data &= GENMASK_BYTES(size, 0);
>> > +        val.double_word |= data << (offset * 8);
>> > +    }
>> > +
>> > +    return r->write(pdev, r->offset, val, r->priv_data);
>> > +}
>> 
>> I'm not sure that writing back the value read is correct in all cases
>> (think of write-only or rw1c registers or even offsets where reads
>> and writes access different registers altogether). I think the write
>> handlers will need to be made capable of dealing with partial writes.
> 
> That seems to be what pciback does fro writes: read, merge value,
> write back (drivers/xen/xen-pciback/conf_space.c
> xen_pcibk_config_write):
> 
> err = conf_space_read(dev, cfg_entry, field_start,
> 		      &tmp_val);
> if (err)
> 	break;
> 
> tmp_val = merge_value(tmp_val, value, get_mask(size),
> 		      offset - field_start);
> 
> err = conf_space_write(dev, cfg_entry, field_start,
> 		       tmp_val);
> 
> I would prefer to do it this way in order to avoid adding more
> complexity to the handlers themselves. So far I haven't found any such
> registers (rw1c) in the PCI config space, do you have references to
> any of them?

The status register.

>> > +/* Helpers for locking/unlocking. */
>> > +#define vpci_lock(d) spin_lock(&(d)->arch.hvm_domain.vpci_lock)
>> > +#define vpci_unlock(d) spin_unlock(&(d)->arch.hvm_domain.vpci_lock)
>> > +#define vpci_locked(d) spin_is_locked(&(d)->arch.hvm_domain.vpci_lock)
>> 
>> While for the code layering you don't need recursive locks, did you
>> consider using them nevertheless so that spin_is_locked() return
>> values are actually meaningful for your purposes?
> 
> I'm not sure I follow, spin_is_locked already return meaningful values
> for my purpose AFAICT.

For non-recursive locks this tells you whether _any_ CPU holds
the lock, yet normally you want to know whether the CPU you
run on does.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 1/9] xen/vpci: introduce basic handlers to trap accesses to the PCI config space
  2017-05-29 14:16       ` Jan Beulich
@ 2017-05-29 15:05         ` Roger Pau Monne
  2017-05-29 15:26           ` Jan Beulich
  0 siblings, 1 reply; 49+ messages in thread
From: Roger Pau Monne @ 2017-05-29 15:05 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, Andrew Cooper, Ian Jackson, julien.grall, PaulDurrant,
	xen-devel, boris.ostrovsky

On Mon, May 29, 2017 at 08:16:41AM -0600, Jan Beulich wrote:
> >>> On 29.05.17 at 14:57, <roger.pau@citrix.com> wrote:
> > On Fri, May 19, 2017 at 05:35:47AM -0600, Jan Beulich wrote:
> >> >>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
> >> > +    /* Read/write of unset register. */
> >> > +    VPCI_READ_CHECK(8, 4, 0xffffffff);
> >> > +    VPCI_READ_CHECK(8, 2, 0xffff);
> >> > +    VPCI_READ_CHECK(8, 1, 0xff);
> >> > +    VPCI_WRITE(10, 2, 0xbeef);
> >> > +    VPCI_READ_CHECK(10, 2, 0xffff);
> >> > +
> >> > +    /* Read of multiple registers */
> >> > +    VPCI_CHECK_REG(7, 1, 0xbd);
> >> > +    VPCI_READ_CHECK(4, 4, 0xbdbabaff);
> >> 
> >> I think a variant accessing mixed size registers would also be
> >> desirable here. Perhaps it would be best to exhaustively test
> >> all possible variations (there aren't that many after all). Same
> >> for writes and partial accesses (below) then.
> > 
> > So you mean to scan the whole space (from 0 to 128 on this test) using
> > all possible register sizes for both read and write? That would indeed
> > be feasible.
> 
> Not sure what the "from 0 to 128" is meant to apply to. What I mean
> is to test all combinations of (mix of) register sizes and access sizes.
> I.e. all combinations making up a single 4-byte field ((1,1,1,1),
> (1,1,2), (2,1,1), (2,2), (4)) and all four 1-byte accesses, both 2-byte
> ones, and the sole possible 4-byte one.

OK, thanks for the clarification, now I get it.

> >> > @@ -256,6 +257,152 @@ void register_g2m_portio_handler(struct domain *d)
> >> >      handler->ops = &g2m_portio_ops;
> >> >  }
> >> >  
> >> > +/* Do some sanity checks. */
> >> > +static int vpci_access_check(unsigned int reg, unsigned int len)
> >> > +{
> >> > +    /* Check access size. */
> >> > +    if ( len != 1 && len != 2 && len != 4 )
> >> > +    {
> >> > +        gdprintk(XENLOG_WARNING, "invalid length (reg: %#x, len: %u)\n",
> >> > +                 reg, len);
> >> 
> >> I think many of such gdprintk()s want to go away before this series
> >> gets committed.
> > 
> > OK, I've found them useful while developing, but I guess they are not
> > really useful outside from that context. I guess there's no way to
> > leave them in place, maybe a Kconfig option?
> 
> That seems overkill. I wouldn't so much mind the messages (they
> get compiled out for non-debug builds anyway), but the clutter
> they introduce to code: In some cases half of your functions are
> the invocation of gdprintk() ...

Let me try to prune some of them.

> >> > +/* vPCI config space IO ports handlers (0xcf8/0xcfc). */
> >> > +static bool_t vpci_portio_accept(const struct hvm_io_handler *handler,
> >> 
> >> Plain bool please.
> > 
> > Sadly struct hvm_io_ops (which is where this function is used) expects
> > a bool_t as return.
> 
> I don't follow - bool_t is simply a typedef of bool nowadays, and
> typedefs are all equivalent as far as the C type system goes.

Clearly my bad, I assumed they where actually different types.

> >> Again the question - what's the bare hardware equivalent of
> >> returning X86EMUL_UNHANDLEABLE here?
> > 
> > All 1's I assume (or other random garbage). Would you be OK with me
> > adding a "fail" label that sets data to all 1's and returns X86EMUL_OKAY?
> 
> That would probably be okay (despite my dislike of labels and goto-s).

Maybe the goto is not needed after all if vpci_read returns the data
filled with 1's and no error code.

> >> > +#include <xen/sched.h>
> >> > +#include <xen/vpci.h>
> >> > +
> >> > +extern const vpci_register_init_t __start_vpci_array[], __end_vpci_array[];
> >> > +#define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
> >> > +#define vpci_init __start_vpci_array
> >> 
> >> What is this last one good for?
> > 
> > It's used by xen_vpci_add_handlers in order to call the init
> > functions, I can drop it and call __start_vpci_array[i](...) if that's
> > better.
> 
> Well, if there were several users, I could see the benefit of an
> abbreviating #define. But for a single user the #define adds
> more code / clutter than is being saved on the use site.

Ack.

> >> > +    struct rb_node node;
> >> > +};
> >> > +
> >> > +int xen_vpci_add_handlers(struct pci_dev *pdev)
> >> 
> >> __hwdom_init (I notice setup_one_hwdom_device() wrongly isn't
> >> annotated so).
> > 
> > OK, so you really want the init handlers to be inside of the
> > .init.rodata section then.
> 
> Only if that's correct, and it is correct as long as all possible call trees
> root in a __hwdom_init function. (To avoid misunderstanding: This
> clearly can't be .init.rodata uniformly, as __hwdom_init isn't always
> an alias of __init).

OK.

> >> > +int xen_vpci_add_register(struct pci_dev *pdev, vpci_read_t read_handler,
> 
> Btw., I only now notice this further strange xen_ prefix here.

I assume this should be vpci_*, (dropping the xen_ prefix uniformly).

> >> > +                          vpci_write_t write_handler, unsigned int offset,
> >> > +                          unsigned int size, void *data)
> >> > +{
> >> > +    struct rb_node **new, *parent;
> >> > +    struct vpci_register *r;
> >> > +
> >> > +    /* Some sanity checks. */
> >> > +    if ( (size != 1 && size != 2 && size != 4) || offset >= 0xFFF ||
> >> 
> >> Off by one again in the offset check.
> > 
> > Fixed to be > 0xfff. Should this maybe be added to pci_regs.h as
> > PCI_MAX_REGISTER? 
> 
> Could be done, but then we need one for base and one for
> extended config space. May want to check whether Linux has
> invented some good names for these by now.

pci_regs.h from Linux now has:

/*
 * Conventional PCI and PCI-X Mode 1 devices have 256 bytes of
 * configuration space.  PCI-X Mode 2 and PCIe devices have 4096 bytes of
 * configuration space.
 */
#define PCI_CFG_SPACE_SIZE	256
#define PCI_CFG_SPACE_EXP_SIZE	4096

At the top. Do you think those defines are fine for importing? (this
was introduced by cc10385b6fde3, but I don't think importing this in a
more formal way makes sense).

> >> > +/* Helper macros for the read/write handlers. */
> >> > +#define GENMASK_BYTES(e, s) GENMASK((e) * 8, (s) * 8)
> >> 
> >> What do e and s stand for here?
> > 
> > e = end, s = start (in bytes).
> 
> And where do you start counting. Having seen the rest of the
> series I'm actually rather unconvinced use these macros results
> in better code - to me, plain hex numbers are quite a bit easier
> to read as long as the number of digits doesn't go meaningfully
> beyond 10 or so.
> 
> >> > +#define SHIFT_RIGHT_BYTES(d, o) d >>= (o) * 8
> >> 
> >> And at least o here?
> > 
> > o = offset (in bytes)
> 
> I think simply writhing the shift expression is once again more
> clear to the reader than using a macro which is longer to read
> and type and which has semantics which aren't immediately
> clear from its name.
> 
> > I can rename ADD_RESULT to APPEND_RESULT or something more descriptive
> > if you think it's going to make it easier to understand.
> 
> I'd prefer if the name "merge" appeared in that name - I don't see
> this being usable strictly only to append to either side of a value.

OK MERGE_RESULT or MERGE_REGISTER maybe? (and the rest removed).

> >> > +int xen_vpci_read(unsigned int seg, unsigned int bus, unsigned int devfn,
> >> 
> >> The function being other than void, same question as earlier:
> >> What's the bare hardware equivalent of this returning other
> >> than zero?
> > 
> > I though it would be useful to have some more fine-grained error
> > reporting if that's ever needed, although as you say, from a hardware
> > point of view any errors will be simply reported as the value obtained
> > being all 1's.
> > 
> > The question is, should this already return all 1's, or should the
> > called translate failures into all 1's?
> 
> If you leave this to the callers, overall code would likely become
> less readable, so I'd prefer this to be done in a central place.

I will change the prototypes/code of vpci_{read/write} so no error
code is returned (and in the read case the value on error is going to
be 1's).

> >> > +    /* Find the vPCI register handler. */
> >> > +    r = vpci_find_register(pdev, reg, size);
> >> 
> >> With the overlap handling in vpci_find_register() I can't see how
> >> this would reliably return the correct (lowest) register when the
> >> request spans multiple ones.
> > 
> > It doesn't need to, if there's a lower register it will be found by
> > the recursive call to xen_vpci_read done below (before calling into
> > the handler pointed by r).
> 
> Since that's quite non-obvious, to me this is another argument
> against using recursion here.

Yes, will switch to a sorted list and no recursion.

> >> > +static int vpci_write_helper(struct pci_dev *pdev,
> >> > +                             const struct vpci_register *r, unsigned int size,
> >> > +                             unsigned int offset, uint32_t data)
> >> > +{
> >> > +    union vpci_val val = { .double_word = data };
> >> > +    int rc;
> >> > +
> >> > +    ASSERT(size <= r->size);
> >> > +    if ( size != r->size )
> >> > +    {
> >> > +        rc = r->read(pdev, r->offset, &val, r->priv_data);
> >> > +        if ( rc )
> >> > +            return rc;
> >> > +        val.double_word &= ~GENMASK_BYTES(size + offset, offset);
> >> > +        data &= GENMASK_BYTES(size, 0);
> >> > +        val.double_word |= data << (offset * 8);
> >> > +    }
> >> > +
> >> > +    return r->write(pdev, r->offset, val, r->priv_data);
> >> > +}
> >> 
> >> I'm not sure that writing back the value read is correct in all cases
> >> (think of write-only or rw1c registers or even offsets where reads
> >> and writes access different registers altogether). I think the write
> >> handlers will need to be made capable of dealing with partial writes.
> > 
> > That seems to be what pciback does fro writes: read, merge value,
> > write back (drivers/xen/xen-pciback/conf_space.c
> > xen_pcibk_config_write):
> > 
> > err = conf_space_read(dev, cfg_entry, field_start,
> > 		      &tmp_val);
> > if (err)
> > 	break;
> > 
> > tmp_val = merge_value(tmp_val, value, get_mask(size),
> > 		      offset - field_start);
> > 
> > err = conf_space_write(dev, cfg_entry, field_start,
> > 		       tmp_val);
> > 
> > I would prefer to do it this way in order to avoid adding more
> > complexity to the handlers themselves. So far I haven't found any such
> > registers (rw1c) in the PCI config space, do you have references to
> > any of them?
> 
> The status register.

But the status register is not going to be trapped, not by Dom0 or
DomUs? None of the registers that I've emulated for the header, or the
capabilities behave this way, and adding such and offset would
greatly increase the complexity of each handler IMHO.

Maybe it would be easier to add a flag to mark rw1c registers as such,
if that's ever needed? (and avoid the read in that case)

> >> > +/* Helpers for locking/unlocking. */
> >> > +#define vpci_lock(d) spin_lock(&(d)->arch.hvm_domain.vpci_lock)
> >> > +#define vpci_unlock(d) spin_unlock(&(d)->arch.hvm_domain.vpci_lock)
> >> > +#define vpci_locked(d) spin_is_locked(&(d)->arch.hvm_domain.vpci_lock)
> >> 
> >> While for the code layering you don't need recursive locks, did you
> >> consider using them nevertheless so that spin_is_locked() return
> >> values are actually meaningful for your purposes?
> > 
> > I'm not sure I follow, spin_is_locked already return meaningful values
> > for my purpose AFAICT.
> 
> For non-recursive locks this tells you whether _any_ CPU holds
> the lock, yet normally you want to know whether the CPU you
> run on does.

Indeed, so if I make the lock recursive spin_is_locked is going to
return whether the current CPU holds the lock. That's kind of
counter-intuitive.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 1/9] xen/vpci: introduce basic handlers to trap accesses to the PCI config space
  2017-05-29 15:05         ` Roger Pau Monne
@ 2017-05-29 15:26           ` Jan Beulich
  0 siblings, 0 replies; 49+ messages in thread
From: Jan Beulich @ 2017-05-29 15:26 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Wei Liu, Andrew Cooper, IanJackson, julien.grall, PaulDurrant,
	xen-devel, boris.ostrovsky

>>> On 29.05.17 at 17:05, <roger.pau@citrix.com> wrote:
> On Mon, May 29, 2017 at 08:16:41AM -0600, Jan Beulich wrote:
>> >>> On 29.05.17 at 14:57, <roger.pau@citrix.com> wrote:
>> > On Fri, May 19, 2017 at 05:35:47AM -0600, Jan Beulich wrote:
>> >> >>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
>> >> > +int xen_vpci_add_register(struct pci_dev *pdev, vpci_read_t read_handler,
>> 
>> Btw., I only now notice this further strange xen_ prefix here.
> 
> I assume this should be vpci_*, (dropping the xen_ prefix uniformly).

Yes please.

>> >> > +                          vpci_write_t write_handler, unsigned int offset,
>> >> > +                          unsigned int size, void *data)
>> >> > +{
>> >> > +    struct rb_node **new, *parent;
>> >> > +    struct vpci_register *r;
>> >> > +
>> >> > +    /* Some sanity checks. */
>> >> > +    if ( (size != 1 && size != 2 && size != 4) || offset >= 0xFFF ||
>> >> 
>> >> Off by one again in the offset check.
>> > 
>> > Fixed to be > 0xfff. Should this maybe be added to pci_regs.h as
>> > PCI_MAX_REGISTER? 
>> 
>> Could be done, but then we need one for base and one for
>> extended config space. May want to check whether Linux has
>> invented some good names for these by now.
> 
> pci_regs.h from Linux now has:
> 
> /*
>  * Conventional PCI and PCI-X Mode 1 devices have 256 bytes of
>  * configuration space.  PCI-X Mode 2 and PCIe devices have 4096 bytes of
>  * configuration space.
>  */
> #define PCI_CFG_SPACE_SIZE	256
> #define PCI_CFG_SPACE_EXP_SIZE	4096
> 
> At the top. Do you think those defines are fine for importing?

Sure.

> (this
> was introduced by cc10385b6fde3, but I don't think importing this in a
> more formal way makes sense).

Agreed.

>> >> > +/* Helper macros for the read/write handlers. */
>> >> > +#define GENMASK_BYTES(e, s) GENMASK((e) * 8, (s) * 8)
>> >> 
>> >> What do e and s stand for here?
>> > 
>> > e = end, s = start (in bytes).
>> 
>> And where do you start counting. Having seen the rest of the
>> series I'm actually rather unconvinced use these macros results
>> in better code - to me, plain hex numbers are quite a bit easier
>> to read as long as the number of digits doesn't go meaningfully
>> beyond 10 or so.
>> 
>> >> > +#define SHIFT_RIGHT_BYTES(d, o) d >>= (o) * 8
>> >> 
>> >> And at least o here?
>> > 
>> > o = offset (in bytes)
>> 
>> I think simply writhing the shift expression is once again more
>> clear to the reader than using a macro which is longer to read
>> and type and which has semantics which aren't immediately
>> clear from its name.
>> 
>> > I can rename ADD_RESULT to APPEND_RESULT or something more descriptive
>> > if you think it's going to make it easier to understand.
>> 
>> I'd prefer if the name "merge" appeared in that name - I don't see
>> this being usable strictly only to append to either side of a value.
> 
> OK MERGE_RESULT or MERGE_REGISTER maybe? (and the rest removed).

Either name is fine with me, with a slight preference to the former.

>> >> > +static int vpci_write_helper(struct pci_dev *pdev,
>> >> > +                             const struct vpci_register *r, unsigned int size,
>> >> > +                             unsigned int offset, uint32_t data)
>> >> > +{
>> >> > +    union vpci_val val = { .double_word = data };
>> >> > +    int rc;
>> >> > +
>> >> > +    ASSERT(size <= r->size);
>> >> > +    if ( size != r->size )
>> >> > +    {
>> >> > +        rc = r->read(pdev, r->offset, &val, r->priv_data);
>> >> > +        if ( rc )
>> >> > +            return rc;
>> >> > +        val.double_word &= ~GENMASK_BYTES(size + offset, offset);
>> >> > +        data &= GENMASK_BYTES(size, 0);
>> >> > +        val.double_word |= data << (offset * 8);
>> >> > +    }
>> >> > +
>> >> > +    return r->write(pdev, r->offset, val, r->priv_data);
>> >> > +}
>> >> 
>> >> I'm not sure that writing back the value read is correct in all cases
>> >> (think of write-only or rw1c registers or even offsets where reads
>> >> and writes access different registers altogether). I think the write
>> >> handlers will need to be made capable of dealing with partial writes.
>> > 
>> > That seems to be what pciback does fro writes: read, merge value,
>> > write back (drivers/xen/xen-pciback/conf_space.c
>> > xen_pcibk_config_write):
>> > 
>> > err = conf_space_read(dev, cfg_entry, field_start,
>> > 		      &tmp_val);
>> > if (err)
>> > 	break;
>> > 
>> > tmp_val = merge_value(tmp_val, value, get_mask(size),
>> > 		      offset - field_start);
>> > 
>> > err = conf_space_write(dev, cfg_entry, field_start,
>> > 		       tmp_val);
>> > 
>> > I would prefer to do it this way in order to avoid adding more
>> > complexity to the handlers themselves. So far I haven't found any such
>> > registers (rw1c) in the PCI config space, do you have references to
>> > any of them?
>> 
>> The status register.
> 
> But the status register is not going to be trapped, not by Dom0 or
> DomUs? None of the registers that I've emulated for the header, or the
> capabilities behave this way, and adding such and offset would
> greatly increase the complexity of each handler IMHO.
> 
> Maybe it would be easier to add a flag to mark rw1c registers as such,
> if that's ever needed? (and avoid the read in that case)

Well, yes, that's how qemu does it. Of course with the caveat that
you need to mark individual bits (again as qemu does). Of course,
as long as no such register is being emulated, leaving a prominent
comment would probably be an acceptable alternative to coding
all this right away.

>> >> > +/* Helpers for locking/unlocking. */
>> >> > +#define vpci_lock(d) spin_lock(&(d)->arch.hvm_domain.vpci_lock)
>> >> > +#define vpci_unlock(d) spin_unlock(&(d)->arch.hvm_domain.vpci_lock)
>> >> > +#define vpci_locked(d) spin_is_locked(&(d)->arch.hvm_domain.vpci_lock)
>> >> 
>> >> While for the code layering you don't need recursive locks, did you
>> >> consider using them nevertheless so that spin_is_locked() return
>> >> values are actually meaningful for your purposes?
>> > 
>> > I'm not sure I follow, spin_is_locked already return meaningful values
>> > for my purpose AFAICT.
>> 
>> For non-recursive locks this tells you whether _any_ CPU holds
>> the lock, yet normally you want to know whether the CPU you
>> run on does.
> 
> Indeed, so if I make the lock recursive spin_is_locked is going to
> return whether the current CPU holds the lock. That's kind of
> counter-intuitive.

Counter-intuitive or not, it's a result of non-recursive locks being
more slim (and hence slightly faster). And as long as
spin_is_locked() is being used in ASSERT()s only, it's better than
no sanity checking at all.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 2/9] x86/ecam: add handlers for the PVH Dom0 MMCFG areas
  2017-05-19 13:25   ` Jan Beulich
@ 2017-06-20 11:56     ` Roger Pau Monne
  2017-06-20 13:14       ` Jan Beulich
  0 siblings, 1 reply; 49+ messages in thread
From: Roger Pau Monne @ 2017-06-20 11:56 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, julien.grall, Paul Durrant, xen-devel, boris.ostrovsky

On Fri, May 19, 2017 at 07:25:22AM -0600, Jan Beulich wrote:
> >>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
> > @@ -1048,6 +1050,24 @@ static int __init pvh_setup_acpi(struct domain *d, 
> > paddr_t start_info)
> >      return 0;
> >  }
> >  
> > +int __init pvh_setup_ecam(struct domain *d)
> 
> While I won't object to the term ecam in title and description,
> please use mmcfg uniformly in code - that's the way we name
> the thing everywhere else.

OK.

> > +{
> > +    unsigned int i;
> > +    int rc;
> > +
> > +    for ( i = 0; i < pci_mmcfg_config_num; i++ )
> > +    {
> > +        rc = register_vpci_ecam_handler(d, pci_mmcfg_config[i].address,
> > +                                        pci_mmcfg_config[i].start_bus_number,
> > +                                        pci_mmcfg_config[i].end_bus_number,
> > +                                        pci_mmcfg_config[i].pci_segment);
> > +        if ( rc )
> > +            return rc;
> > +    }
> > +
> > +    return 0;
> > +}
> 
> What about regions becoming available only post-boot?

This is not yet supported. It needs to be implemented using the
PHYSDEVOP_pci_mmcfg_reserved hypercall.

> > @@ -752,6 +754,14 @@ void hvm_domain_destroy(struct domain *d)
> >          list_del(&ioport->list);
> >          xfree(ioport);
> >      }
> > +
> > +    list_for_each_entry_safe ( ecam, etmp, &d->arch.hvm_domain.ecam_regions,
> > +                               next )
> > +    {
> > +        list_del(&ecam->next);
> > +        xfree(ecam);
> > +    }
> > +
> >  }
> 
> Stray blank line. Of course the addition is of questionable use
> anyway as long as all of this is Dom0 only.

Right, I just felt it would be better to do proper cleanup since it's
just a couple of lines.

> > --- a/xen/arch/x86/hvm/io.c
> > +++ b/xen/arch/x86/hvm/io.c
> > @@ -403,6 +403,145 @@ void register_vpci_portio_handler(struct domain *d)
> >      handler->ops = &vpci_portio_ops;
> >  }
> >  
> > +/* Handlers to trap PCI ECAM config accesses. */
> > +static struct hvm_ecam *vpci_ecam_find(struct domain *d, unsigned long addr)
> 
> Logically d should be a pointer to const, and I think no caller really
> needs you to return a pointer to non-const.
> > +{
> > +    struct hvm_ecam *ecam = NULL;
> 
> Pointless initializer.
> 
> > +static void vpci_ecam_decode_addr(struct hvm_ecam *ecam, unsigned long addr,
> 
> const

Fixed all the above.

> > +static int vpci_ecam_accept(struct vcpu *v, unsigned long addr)
> > +{
> > +    struct domain *d = v->domain;
> > +    int found;
> > +
> > +    vpci_lock(d);
> > +    found = !!vpci_ecam_find(v->domain, addr);
> 
> Please use the local variable consistently.
> 
> > +static int vpci_ecam_read(struct vcpu *v, unsigned long addr,
> 
> Did I overlook this in patch 1? Why is this a vcpu instead of a
> domain parameter? All of PCI is (virtual) machine wide...

That's what the hvm_mmio_ops struct expects (vcpu instead of domain),
which is where this function is used.

> > +                          unsigned int len, unsigned long *data)
> > +{
> > +    struct domain *d = v->domain;
> > +    struct hvm_ecam *ecam;
> > +    unsigned int bus, devfn, reg;
> > +    uint32_t data32;
> > +    int rc;
> > +
> > +    vpci_lock(d);
> > +    ecam = vpci_ecam_find(d, addr);
> > +    if ( !ecam )
> > +    {
> > +        vpci_unlock(d);
> > +        return X86EMUL_UNHANDLEABLE;
> > +    }
> > +
> > +    vpci_ecam_decode_addr(ecam, addr, &bus, &devfn, &reg);
> > +
> > +    if ( vpci_access_check(reg, len) || reg >= 0xfff )
> 
> So this function iirc allows only 1-, 2-, and 4-byte accesses. Other
> than with port I/O, MMCFG allows wider ones, and once again I
> don't think hardware would raise any kind of fault in such a case.
> The general expectation is for the fabric to split such accesses.

Hm, the PCIe spec is not authoritative in this regard, is states that
supporting 8B accesses is not mandatory. Xen/Linux/FreeBSD will never
attempt any access > 4B, hence I haven't coded this case.

Would you be fine with leaving this for later, or would you rather
have it implemented as part of this series?

> Also the reg check is once again off by one.

This is now gone, since reg cannot be > 0xfff in any case.

> > +int register_vpci_ecam_handler(struct domain *d, paddr_t addr,
> > +                               unsigned int start_bus, unsigned int end_bus,
> > +                               unsigned int seg)
> > +{
> > +    struct hvm_ecam *ecam;
> > +
> > +    ASSERT(is_hardware_domain(d));
> > +
> > +    vpci_lock(d);
> > +    if ( vpci_ecam_find(d, addr) )
> > +    {
> > +        vpci_unlock(d);
> > +        return -EEXIST;
> > +    }
> > +
> > +    ecam = xzalloc(struct hvm_ecam);
> 
> xmalloc() would again suffice afaict.

Right.

> > --- a/xen/include/asm-x86/hvm/domain.h
> > +++ b/xen/include/asm-x86/hvm/domain.h
> > @@ -100,6 +100,14 @@ struct hvm_pi_ops {
> >      void (*do_resume)(struct vcpu *v);
> >  };
> >  
> > +struct hvm_ecam {
> > +    paddr_t addr;
> > +    size_t size;
> > +    unsigned int bus;
> > +    unsigned int segment;
> > +    struct list_head next;
> > +};
> 
> If you moved the addition to hvm_domain_destroy() into a function
> in hvm/io.c, this type could be private to that latter file afaict.

Yes, I've now done that and named the function destroy_vpci_mmcfg.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 2/9] x86/ecam: add handlers for the PVH Dom0 MMCFG areas
  2017-06-20 11:56     ` Roger Pau Monne
@ 2017-06-20 13:14       ` Jan Beulich
  2017-06-20 15:04         ` Roger Pau Monne
  0 siblings, 1 reply; 49+ messages in thread
From: Jan Beulich @ 2017-06-20 13:14 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Andrew Cooper, julien.grall, PaulDurrant, xen-devel, boris.ostrovsky

>>> On 20.06.17 at 13:56, <roger.pau@citrix.com> wrote:
> On Fri, May 19, 2017 at 07:25:22AM -0600, Jan Beulich wrote:
>> >>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
>> > +{
>> > +    unsigned int i;
>> > +    int rc;
>> > +
>> > +    for ( i = 0; i < pci_mmcfg_config_num; i++ )
>> > +    {
>> > +        rc = register_vpci_ecam_handler(d, pci_mmcfg_config[i].address,
>> > +                                        pci_mmcfg_config[i].start_bus_number,
>> > +                                        pci_mmcfg_config[i].end_bus_number,
>> > +                                        pci_mmcfg_config[i].pci_segment);
>> > +        if ( rc )
>> > +            return rc;
>> > +    }
>> > +
>> > +    return 0;
>> > +}
>> 
>> What about regions becoming available only post-boot?
> 
> This is not yet supported. It needs to be implemented using the
> PHYSDEVOP_pci_mmcfg_reserved hypercall.

But then the patch here is incomplete.

>> > +                          unsigned int len, unsigned long *data)
>> > +{
>> > +    struct domain *d = v->domain;
>> > +    struct hvm_ecam *ecam;
>> > +    unsigned int bus, devfn, reg;
>> > +    uint32_t data32;
>> > +    int rc;
>> > +
>> > +    vpci_lock(d);
>> > +    ecam = vpci_ecam_find(d, addr);
>> > +    if ( !ecam )
>> > +    {
>> > +        vpci_unlock(d);
>> > +        return X86EMUL_UNHANDLEABLE;
>> > +    }
>> > +
>> > +    vpci_ecam_decode_addr(ecam, addr, &bus, &devfn, &reg);
>> > +
>> > +    if ( vpci_access_check(reg, len) || reg >= 0xfff )
>> 
>> So this function iirc allows only 1-, 2-, and 4-byte accesses. Other
>> than with port I/O, MMCFG allows wider ones, and once again I
>> don't think hardware would raise any kind of fault in such a case.
>> The general expectation is for the fabric to split such accesses.
> 
> Hm, the PCIe spec is not authoritative in this regard, is states that
> supporting 8B accesses is not mandatory. Xen/Linux/FreeBSD will never
> attempt any access > 4B, hence I haven't coded this case.
> 
> Would you be fine with leaving this for later, or would you rather
> have it implemented as part of this series?

Since it shouldn't meaningfully much more code, I'd prefer if it was
done right away. Otherwise I'd have to ask for a "fixme" comment,
and I'd rather avoid such considering the PVHv1 history.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 2/9] x86/ecam: add handlers for the PVH Dom0 MMCFG areas
  2017-06-20 13:14       ` Jan Beulich
@ 2017-06-20 15:04         ` Roger Pau Monne
  0 siblings, 0 replies; 49+ messages in thread
From: Roger Pau Monne @ 2017-06-20 15:04 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, julien.grall, PaulDurrant, xen-devel, boris.ostrovsky

On Tue, Jun 20, 2017 at 07:14:07AM -0600, Jan Beulich wrote:
> >>> On 20.06.17 at 13:56, <roger.pau@citrix.com> wrote:
> > On Fri, May 19, 2017 at 07:25:22AM -0600, Jan Beulich wrote:
> >> >>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
> >> > +{
> >> > +    unsigned int i;
> >> > +    int rc;
> >> > +
> >> > +    for ( i = 0; i < pci_mmcfg_config_num; i++ )
> >> > +    {
> >> > +        rc = register_vpci_ecam_handler(d, pci_mmcfg_config[i].address,
> >> > +                                        pci_mmcfg_config[i].start_bus_number,
> >> > +                                        pci_mmcfg_config[i].end_bus_number,
> >> > +                                        pci_mmcfg_config[i].pci_segment);
> >> > +        if ( rc )
> >> > +            return rc;
> >> > +    }
> >> > +
> >> > +    return 0;
> >> > +}
> >> 
> >> What about regions becoming available only post-boot?
> > 
> > This is not yet supported. It needs to be implemented using the
> > PHYSDEVOP_pci_mmcfg_reserved hypercall.
> 
> But then the patch here is incomplete.

OK, I don't think it's going to be a lot of code, it's just
registering extra MMCFG regions.

> >> > +                          unsigned int len, unsigned long *data)
> >> > +{
> >> > +    struct domain *d = v->domain;
> >> > +    struct hvm_ecam *ecam;
> >> > +    unsigned int bus, devfn, reg;
> >> > +    uint32_t data32;
> >> > +    int rc;
> >> > +
> >> > +    vpci_lock(d);
> >> > +    ecam = vpci_ecam_find(d, addr);
> >> > +    if ( !ecam )
> >> > +    {
> >> > +        vpci_unlock(d);
> >> > +        return X86EMUL_UNHANDLEABLE;
> >> > +    }
> >> > +
> >> > +    vpci_ecam_decode_addr(ecam, addr, &bus, &devfn, &reg);
> >> > +
> >> > +    if ( vpci_access_check(reg, len) || reg >= 0xfff )
> >> 
> >> So this function iirc allows only 1-, 2-, and 4-byte accesses. Other
> >> than with port I/O, MMCFG allows wider ones, and once again I
> >> don't think hardware would raise any kind of fault in such a case.
> >> The general expectation is for the fabric to split such accesses.
> > 
> > Hm, the PCIe spec is not authoritative in this regard, is states that
> > supporting 8B accesses is not mandatory. Xen/Linux/FreeBSD will never
> > attempt any access > 4B, hence I haven't coded this case.
> > 
> > Would you be fine with leaving this for later, or would you rather
> > have it implemented as part of this series?
> 
> Since it shouldn't meaningfully much more code, I'd prefer if it was
> done right away. Otherwise I'd have to ask for a "fixme" comment,
> and I'd rather avoid such considering the PVHv1 history.

NP, I've just added it. I have however implemented it by splitting the
access into two 4 byte accesses, and performing two calls to
vpci_{read/write}.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 3/9] xen/mm: move modify_identity_mmio to global file and drop __init
  2017-05-19 13:35   ` Jan Beulich
@ 2017-06-21 11:11     ` Roger Pau Monne
  2017-06-21 11:57       ` Jan Beulich
  0 siblings, 1 reply; 49+ messages in thread
From: Roger Pau Monne @ 2017-06-21 11:11 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, julien.grall, boris.ostrovsky, xen-devel

On Fri, May 19, 2017 at 07:35:39AM -0600, Jan Beulich wrote:
> >>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
> > And also allow it to do non-identity mappings by adding a new parameter. This
> > function will be needed in other parts apart from PVH Dom0 build. While there
> > fix the function to use gfn_t and mfn_t instead of unsigned long for memory
> > addresses.
> 
> I'm afraid both title and description don't (or no longer) properly reflect
> what the patch does. I'm also afraid the reason the new parameter as
> well as the placement in common/memory.c aren't sufficiently explained.
> For example, what use is the function going to be without
> CONFIG_HAS_PCI?

It will still be needed in order to map the low 1MB for a PVH Dom0,
but anyway, see below.

> > --- a/xen/arch/x86/hvm/dom0_build.c
> > +++ b/xen/arch/x86/hvm/dom0_build.c
> > @@ -64,27 +64,7 @@ static struct acpi_madt_nmi_source __initdata *nmisrc;
> >  static int __init modify_identity_mmio(struct domain *d, unsigned long pfn,
> >                                         unsigned long nr_pages, const bool map)
> >  {
> > -    int rc;
> > -
> > -    for ( ; ; )
> > -    {
> > -        rc = (map ? map_mmio_regions : unmap_mmio_regions)
> > -             (d, _gfn(pfn), nr_pages, _mfn(pfn));
> > -        if ( rc == 0 )
> > -            break;
> > -        if ( rc < 0 )
> > -        {
> > -            printk(XENLOG_WARNING
> > -                   "Failed to identity %smap [%#lx,%#lx) for d%d: %d\n",
> > -                   map ? "" : "un", pfn, pfn + nr_pages, d->domain_id, rc);
> > -            break;
> > -        }
> > -        nr_pages -= rc;
> > -        pfn += rc;
> > -        process_pending_softirqs();
> > -    }
> > -
> > -    return rc;
> > +    return modify_mmio(d, _gfn(pfn), _mfn(pfn), nr_pages, map);
> >  }
> 
> I don't see the value of retaining this wrapper.
> 
> > --- a/xen/common/memory.c
> > +++ b/xen/common/memory.c
> > @@ -1438,6 +1438,42 @@ int prepare_ring_for_helper(
> >      return 0;
> >  }
> >  
> > +int modify_mmio(struct domain *d, gfn_t gfn, mfn_t mfn, unsigned long nr_pages,
> > +                const bool map)
> > +{
> > +    int rc;
> > +
> > +    /*
> > +     * Make sure this function is only used by the hardware domain, because it
> > +     * can take an arbitrary long time, and could DoS the whole system.
> > +     */
> > +    ASSERT(is_hardware_domain(d));
> 
> If that can happen arbitrarily at run time (rather than just at boot,
> as suggested by the removal of __init), it definitely can't remain as
> is and will instead need to make use of continuations. I'm therefore
> unconvinced you really want to move this code instead of simply
> calling {,un}map_mmio_regions() while taking care of preemption
> needs.

I'm not sure I know how to use continuations with non-hypercall
vmexits. Do you have any recommendations about how to do this? pause
the domain and run the mmio changes inside of a tasklet?

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 3/9] xen/mm: move modify_identity_mmio to global file and drop __init
  2017-06-21 11:11     ` Roger Pau Monne
@ 2017-06-21 11:57       ` Jan Beulich
  2017-06-21 12:43         ` Roger Pau Monne
  0 siblings, 1 reply; 49+ messages in thread
From: Jan Beulich @ 2017-06-21 11:57 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, julien.grall, boris.ostrovsky, xen-devel

>>> On 21.06.17 at 13:11, <roger.pau@citrix.com> wrote:
> On Fri, May 19, 2017 at 07:35:39AM -0600, Jan Beulich wrote:
>> >>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
>> > And also allow it to do non-identity mappings by adding a new parameter. This
>> > function will be needed in other parts apart from PVH Dom0 build. While there
>> > fix the function to use gfn_t and mfn_t instead of unsigned long for memory
>> > addresses.
>> 
>> I'm afraid both title and description don't (or no longer) properly reflect
>> what the patch does. I'm also afraid the reason the new parameter as
>> well as the placement in common/memory.c aren't sufficiently explained.
>> For example, what use is the function going to be without
>> CONFIG_HAS_PCI?
> 
> It will still be needed in order to map the low 1MB for a PVH Dom0,
> but anyway, see below.

That's still implying CONFIG_HAS_PCI, as that's still x86. The
question was with ARM in mind.

>> > --- a/xen/common/memory.c
>> > +++ b/xen/common/memory.c
>> > @@ -1438,6 +1438,42 @@ int prepare_ring_for_helper(
>> >      return 0;
>> >  }
>> >  
>> > +int modify_mmio(struct domain *d, gfn_t gfn, mfn_t mfn, unsigned long nr_pages,
>> > +                const bool map)
>> > +{
>> > +    int rc;
>> > +
>> > +    /*
>> > +     * Make sure this function is only used by the hardware domain, because it
>> > +     * can take an arbitrary long time, and could DoS the whole system.
>> > +     */
>> > +    ASSERT(is_hardware_domain(d));
>> 
>> If that can happen arbitrarily at run time (rather than just at boot,
>> as suggested by the removal of __init), it definitely can't remain as
>> is and will instead need to make use of continuations. I'm therefore
>> unconvinced you really want to move this code instead of simply
>> calling {,un}map_mmio_regions() while taking care of preemption
>> needs.
> 
> I'm not sure I know how to use continuations with non-hypercall
> vmexits. Do you have any recommendations about how to do this? pause
> the domain and run the mmio changes inside of a tasklet?

That would be one option. Or you could derive from the approach
used for waiting for a response from the device model. Even exiting
back to the guest without updating rIP may be possible, provided
you have a means to store the continuation information such that
when coming back you won't start from the beginning again.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 3/9] xen/mm: move modify_identity_mmio to global file and drop __init
  2017-06-21 11:57       ` Jan Beulich
@ 2017-06-21 12:43         ` Roger Pau Monne
  2017-06-21 12:51           ` Jan Beulich
  0 siblings, 1 reply; 49+ messages in thread
From: Roger Pau Monne @ 2017-06-21 12:43 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, julien.grall, boris.ostrovsky, xen-devel

On Wed, Jun 21, 2017 at 05:57:19AM -0600, Jan Beulich wrote:
> >>> On 21.06.17 at 13:11, <roger.pau@citrix.com> wrote:
> > On Fri, May 19, 2017 at 07:35:39AM -0600, Jan Beulich wrote:
> >> >>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
> >> > +int modify_mmio(struct domain *d, gfn_t gfn, mfn_t mfn, unsigned long nr_pages,
> >> > +                const bool map)
> >> > +{
> >> > +    int rc;
> >> > +
> >> > +    /*
> >> > +     * Make sure this function is only used by the hardware domain, because it
> >> > +     * can take an arbitrary long time, and could DoS the whole system.
> >> > +     */
> >> > +    ASSERT(is_hardware_domain(d));
> >> 
> >> If that can happen arbitrarily at run time (rather than just at boot,
> >> as suggested by the removal of __init), it definitely can't remain as
> >> is and will instead need to make use of continuations. I'm therefore
> >> unconvinced you really want to move this code instead of simply
> >> calling {,un}map_mmio_regions() while taking care of preemption
> >> needs.
> > 
> > I'm not sure I know how to use continuations with non-hypercall
> > vmexits. Do you have any recommendations about how to do this? pause
> > the domain and run the mmio changes inside of a tasklet?
> 
> That would be one option. Or you could derive from the approach
> used for waiting for a response from the device model.

AFAICT the ioreq code pauses the domain and waits for a reply from the
dm, but in that case I would still need the tasklet in order to perform
the work (since there's no dm here).

> Even exiting
> back to the guest without updating rIP may be possible, provided
> you have a means to store the continuation information such that
> when coming back you won't start from the beginning again.

I don't really fancy this since it would mean wasting a lot of time in
vmexits/vmenters.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 3/9] xen/mm: move modify_identity_mmio to global file and drop __init
  2017-06-21 12:43         ` Roger Pau Monne
@ 2017-06-21 12:51           ` Jan Beulich
  2017-06-21 13:10             ` Roger Pau Monne
  0 siblings, 1 reply; 49+ messages in thread
From: Jan Beulich @ 2017-06-21 12:51 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, julien.grall, boris.ostrovsky, xen-devel

>>> On 21.06.17 at 14:43, <roger.pau@citrix.com> wrote:
> On Wed, Jun 21, 2017 at 05:57:19AM -0600, Jan Beulich wrote:
>> >>> On 21.06.17 at 13:11, <roger.pau@citrix.com> wrote:
>> > On Fri, May 19, 2017 at 07:35:39AM -0600, Jan Beulich wrote:
>> >> >>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
>> >> > +int modify_mmio(struct domain *d, gfn_t gfn, mfn_t mfn, unsigned long 
> nr_pages,
>> >> > +                const bool map)
>> >> > +{
>> >> > +    int rc;
>> >> > +
>> >> > +    /*
>> >> > +     * Make sure this function is only used by the hardware domain, 
> because it
>> >> > +     * can take an arbitrary long time, and could DoS the whole system.
>> >> > +     */
>> >> > +    ASSERT(is_hardware_domain(d));
>> >> 
>> >> If that can happen arbitrarily at run time (rather than just at boot,
>> >> as suggested by the removal of __init), it definitely can't remain as
>> >> is and will instead need to make use of continuations. I'm therefore
>> >> unconvinced you really want to move this code instead of simply
>> >> calling {,un}map_mmio_regions() while taking care of preemption
>> >> needs.
>> > 
>> > I'm not sure I know how to use continuations with non-hypercall
>> > vmexits. Do you have any recommendations about how to do this? pause
>> > the domain and run the mmio changes inside of a tasklet?
>> 
>> That would be one option. Or you could derive from the approach
>> used for waiting for a response from the device model.
> 
> AFAICT the ioreq code pauses the domain and waits for a reply from the
> dm, but in that case I would still need the tasklet in order to perform
> the work (since there's no dm here).

Well, that's kind of pausing (it's not an explicit domain_pause(),
and you really would mean to pause just the vCPU here). Otoh
to prevent hangs we simply call process_pending_softirqs()
every once in a while in a few other cases, so maybe doing that
would already suffice here.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 3/9] xen/mm: move modify_identity_mmio to global file and drop __init
  2017-06-21 12:51           ` Jan Beulich
@ 2017-06-21 13:10             ` Roger Pau Monne
  2017-06-21 13:29               ` Jan Beulich
  0 siblings, 1 reply; 49+ messages in thread
From: Roger Pau Monne @ 2017-06-21 13:10 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, julien.grall, boris.ostrovsky, xen-devel

On Wed, Jun 21, 2017 at 06:51:32AM -0600, Jan Beulich wrote:
> >>> On 21.06.17 at 14:43, <roger.pau@citrix.com> wrote:
> > On Wed, Jun 21, 2017 at 05:57:19AM -0600, Jan Beulich wrote:
> >> >>> On 21.06.17 at 13:11, <roger.pau@citrix.com> wrote:
> >> > On Fri, May 19, 2017 at 07:35:39AM -0600, Jan Beulich wrote:
> >> >> >>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
> >> >> > +int modify_mmio(struct domain *d, gfn_t gfn, mfn_t mfn, unsigned long 
> > nr_pages,
> >> >> > +                const bool map)
> >> >> > +{
> >> >> > +    int rc;
> >> >> > +
> >> >> > +    /*
> >> >> > +     * Make sure this function is only used by the hardware domain, 
> > because it
> >> >> > +     * can take an arbitrary long time, and could DoS the whole system.
> >> >> > +     */
> >> >> > +    ASSERT(is_hardware_domain(d));
> >> >> 
> >> >> If that can happen arbitrarily at run time (rather than just at boot,
> >> >> as suggested by the removal of __init), it definitely can't remain as
> >> >> is and will instead need to make use of continuations. I'm therefore
> >> >> unconvinced you really want to move this code instead of simply
> >> >> calling {,un}map_mmio_regions() while taking care of preemption
> >> >> needs.
> >> > 
> >> > I'm not sure I know how to use continuations with non-hypercall
> >> > vmexits. Do you have any recommendations about how to do this? pause
> >> > the domain and run the mmio changes inside of a tasklet?
> >> 
> >> That would be one option. Or you could derive from the approach
> >> used for waiting for a response from the device model.
> > 
> > AFAICT the ioreq code pauses the domain and waits for a reply from the
> > dm, but in that case I would still need the tasklet in order to perform
> > the work (since there's no dm here).
> 
> Well, that's kind of pausing (it's not an explicit domain_pause(),
> and you really would mean to pause just the vCPU here).

Right, so vcpu_pause would do it.

> Otoh
> to prevent hangs we simply call process_pending_softirqs()
> every once in a while in a few other cases, so maybe doing that
> would already suffice here.

That's what I was doing here in modify_mmio, calling
process_pending_softirqs between calls to {map/unmap}_mmio_regions.

I could leave modify_identity_mmio as-is and simply call
{map/unmap}_mmio_regions from the vPCI header handlers, calling
process_pending_softirqs in between. I just moved the helper because
it avoids open-coding this again.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 3/9] xen/mm: move modify_identity_mmio to global file and drop __init
  2017-06-21 13:10             ` Roger Pau Monne
@ 2017-06-21 13:29               ` Jan Beulich
  0 siblings, 0 replies; 49+ messages in thread
From: Jan Beulich @ 2017-06-21 13:29 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, julien.grall, boris.ostrovsky, xen-devel

>>> On 21.06.17 at 15:10, <roger.pau@citrix.com> wrote:
> On Wed, Jun 21, 2017 at 06:51:32AM -0600, Jan Beulich wrote:
>> Otoh
>> to prevent hangs we simply call process_pending_softirqs()
>> every once in a while in a few other cases, so maybe doing that
>> would already suffice here.
> 
> That's what I was doing here in modify_mmio, calling
> process_pending_softirqs between calls to {map/unmap}_mmio_regions.
> 
> I could leave modify_identity_mmio as-is and simply call
> {map/unmap}_mmio_regions from the vPCI header handlers, calling
> process_pending_softirqs in between. I just moved the helper because
> it avoids open-coding this again.

Oh, indeed. In that case I'd suggest wording the comment in a
less scary way.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 4/9] xen/pci: split code to size BARs from pci_add_device
  2017-05-19 13:56   ` Jan Beulich
@ 2017-06-21 15:16     ` Roger Pau Monne
  0 siblings, 0 replies; 49+ messages in thread
From: Roger Pau Monne @ 2017-06-21 15:16 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, julien.grall, boris.ostrovsky

On Fri, May 19, 2017 at 07:56:55AM -0600, Jan Beulich wrote:
> >>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
> > @@ -663,38 +708,13 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
> >                             seg, bus, slot, func, i);
> >                      continue;
> >                  }
> > -                pci_conf_write32(seg, bus, slot, func, idx, ~0);
> > -                if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
> > -                     PCI_BASE_ADDRESS_MEM_TYPE_64 )
> > -                {
> > -                    if ( i >= PCI_SRIOV_NUM_BARS )
> > -                    {
> > -                        printk(XENLOG_WARNING
> > -                               "SR-IOV device %04x:%02x:%02x.%u with 64-bit"
> > -                               " vf BAR in last slot\n",
> > -                               seg, bus, slot, func);
> > -                        break;
> > -                    }
> > -                    hi = pci_conf_read32(seg, bus, slot, func, idx + 4);
> > -                    pci_conf_write32(seg, bus, slot, func, idx + 4, ~0);
> > -                }
> > -                pdev->vf_rlen[i] = pci_conf_read32(seg, bus, slot, func, idx) &
> > -                                   PCI_BASE_ADDRESS_MEM_MASK;
> > -                if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
> > -                     PCI_BASE_ADDRESS_MEM_TYPE_64 )
> > -                {
> > -                    pdev->vf_rlen[i] |= (u64)pci_conf_read32(seg, bus,
> > -                                                             slot, func,
> > -                                                             idx + 4) << 32;
> > -                    pci_conf_write32(seg, bus, slot, func, idx + 4, hi);
> > -                }
> > -                else if ( pdev->vf_rlen[i] )
> > -                    pdev->vf_rlen[i] |= (u64)~0 << 32;
> > -                pci_conf_write32(seg, bus, slot, func, idx, bar);
> > -                pdev->vf_rlen[i] = -pdev->vf_rlen[i];
> > -                if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
> > -                     PCI_BASE_ADDRESS_MEM_TYPE_64 )
> > -                    ++i;
> > +                ret = pci_size_bar(seg, bus, slot, func, pos + PCI_SRIOV_BAR,
> > +                                   PCI_SRIOV_NUM_BARS, &i, &addr,
> > +                                   &pdev->vf_rlen[i]);
> > +                if ( ret )
> > +                    dprintk(XENLOG_WARNING,
> > +                            "%04x:%02x:%02x.%u: failed to size SR-IOV BAR%u\n",
> > +                            seg, bus, slot, func, i);
> 
> You shouldn't log two messages for the same problem (the called
> function already logs one).
> 
> A final more general remark: With you intending to call this function
> from other than pci_add_device() context, some further care may /
> will be needed. For example, are all to be added callers such that
> you playing with config space won't interfere with what Dom0 does?
> Are you sure you can get away without disabling memory decode
> while fiddling with the BARs?

So far I've been able to get away, but you are right that callers
should disable memory decode before trying to size the BARs. I will do
this in the callers however.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 5/9] xen/vpci: add handlers to map the BARs
  2017-05-19 15:21   ` Jan Beulich
  2017-05-22 11:38     ` Julien Grall
@ 2017-06-22 17:13     ` Roger Pau Monne
  2017-06-23  8:58       ` Jan Beulich
  1 sibling, 1 reply; 49+ messages in thread
From: Roger Pau Monne @ 2017-06-22 17:13 UTC (permalink / raw)
  To: Jan Beulich
  Cc: StefanoStabellini, Wei Liu, George Dunlap, Andrew Cooper,
	IanJackson, Tim Deegan, julien.grall, xen-devel, boris.ostrovsky

On Fri, May 19, 2017 at 09:21:56AM -0600, Jan Beulich wrote:
> >>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
> > +static int vpci_modify_bars(struct pci_dev *pdev, const bool map)
> > +{
> > +    struct vpci_header *header = &pdev->vpci->header;
> > +    unsigned int i;
> > +    int rc = 0;
> > +
> > +    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
> > +    {
> > +        paddr_t gaddr = map ? header->bars[i].gaddr
> > +                            : header->bars[i].mapped_addr;
> > +        paddr_t paddr = header->bars[i].paddr;
> > +
> > +        if ( header->bars[i].type != VPCI_BAR_MEM &&
> > +             header->bars[i].type != VPCI_BAR_MEM64_LO )
> > +            continue;
> > +
> > +        rc = modify_mmio(pdev->domain, _gfn(PFN_DOWN(gaddr)),
> > +                         _mfn(PFN_DOWN(paddr)), PFN_UP(header->bars[i].size),
> 
> The PFN_UP() indicates a problem: For sub-page BARs you can't
> blindly map/unmap them without taking into consideration other
> devices sharing the same page.

I'm not sure I follow, the start address of BARs is always aligned to
a 4KB boundary, so there's no chance of the same page being used by
two different BARs at the same time.

The size is indeed not aligned to 4KB, but I don't see how this can
cause collisions with other BARs unless the domain is actively trying
to make the BARs overlap, in which case there's not much Xen can do.

> > +                         map);
> > +        if ( rc )
> > +            break;
> > +
> > +        header->bars[i].mapped_addr = map ? gaddr : 0;
> > +    }
> > +
> > +    return rc;
> > +}
> 
> Shouldn't this function somewhere honor the unset flags?

Right, I've added a check to make sure the BAR is positioned before
trying to map it into the domain p2m.

> > +static int vpci_cmd_read(struct pci_dev *pdev, unsigned int reg,
> > +                         union vpci_val *val, void *data)
> > +{
> > +    struct vpci_header *header = data;
> > +
> > +    val->word = header->command;
> 
> Rather than reading back and storing the value in the write handler,
> I'd recommending doing an actual read here.

OK.

> > +static int vpci_cmd_write(struct pci_dev *pdev, unsigned int reg,
> > +                          union vpci_val val, void *data)
> > +{
> > +    struct vpci_header *header = data;
> > +    uint16_t new_cmd, saved_cmd;
> > +    uint8_t seg = pdev->seg, bus = pdev->bus;
> > +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> > +    int rc;
> > +
> > +    new_cmd = val.word;
> > +    saved_cmd = header->command;
> > +
> > +    if ( !((new_cmd ^ saved_cmd) & PCI_COMMAND_MEMORY) )
> > +        goto out;
> > +
> > +    /* Memory space access change. */
> > +    rc = vpci_modify_bars(pdev, new_cmd & PCI_COMMAND_MEMORY);
> > +    if ( rc )
> > +    {
> > +        dprintk(XENLOG_ERR,
> > +                "%04x:%02x:%02x.%u:unable to %smap BARs: %d\n",
> > +                seg, bus, slot, func,
> > +                new_cmd & PCI_COMMAND_MEMORY ? "" : "un", rc);
> > +        return rc;
> 
> I guess you can guess the question already: What is the bare
> hardware equivalent of this failure return?

Yes, this is already fixed since write handlers simply return void.
The hw equivalent would be to ignore the write AFAICT (ie: memory
decoding will not be enabled).

Are you fine with the dprintk or would you also like me to remove
that? (IMHO it's helpful for debugging).

> > +    }
> > +
> > + out:
> 
> Please try to avoid goto-s and labels for other than error handling
> (and even then only when code would otherwise end up pretty
> convoluted).

Done.

> > +static int vpci_bar_read(struct pci_dev *pdev, unsigned int reg,
> > +                         union vpci_val *val, void *data)
> > +{
> > +    struct vpci_bar *bar = data;
> 
> const
> 
> > +    bool hi = false;
> > +
> > +    ASSERT(bar->type == VPCI_BAR_MEM || bar->type == VPCI_BAR_MEM64_LO ||
> > +           bar->type == VPCI_BAR_MEM64_HI);
> > +
> > +    if ( bar->type == VPCI_BAR_MEM64_HI )
> > +    {
> > +        ASSERT(reg - PCI_BASE_ADDRESS_0 > 0);
> 
> reg > PCI_BASE_ADDRESS_0

Fixed.

> > +        bar--;
> > +        hi = true;
> > +    }
> > +
> > +    if ( bar->sizing )
> > +        val->double_word = ~(bar->size - 1) >> (hi ? 32 : 0);
> 
> There's also a comment further down - this is producing undefined
> behavior on 32-bits arches.

I've changed size to be a uint64_t.

> > +static int vpci_bar_write(struct pci_dev *pdev, unsigned int reg,
> > +                          union vpci_val val, void *data)
> > +{
> > +    struct vpci_bar *bar = data;
> > +    uint32_t wdata = val.double_word;
> > +    bool hi = false, unset = false;
> > +
> > +    ASSERT(bar->type == VPCI_BAR_MEM || bar->type == VPCI_BAR_MEM64_LO ||
> > +           bar->type == VPCI_BAR_MEM64_HI);
> > +
> > +    if ( wdata == GENMASK(31, 0) )
> 
> I'm afraid this again doesn't match real hardware behavior: As the
> low bits are r/o, writes with them having any value, but all other
> bits being 1 should have the same effect. I notice that while I had
> fixed this for the ROM BAR in Linux'es pciback, I should have also
> fixed this for ordinary ones.

I've changed this to:

    switch ( bar->type )
    {
    case VPCI_BAR_MEM:
        size_mask = GENMASK(31, 12);
        break;
    case VPCI_BAR_MEM64_LO:
        size_mask = GENMASK(31, 26);
        break;
    case VPCI_BAR_MEM64_HI:
        size_mask = GENMASK(31, 0);
        break;
    default:
        ASSERT_UNREACHABLE();
        break;
    }

    if ( (wdata & size_mask) == size_mask )
    {
        ...

And removed the ASSERT just above (since it's now handled in the
switch itself).

> > +    {
> > +        /* Next reads from this register are going to return the BAR size. */
> > +        bar->sizing = true;
> > +        return 0;
> > +    }
> > +
> > +    /* End previous sizing cycle if any. */
> > +    bar->sizing = false;
> > +
> > +    unset = bar->unset;
> > +    if ( unset )
> > +        bar->unset = false;
> > +
> > +    if ( bar->type == VPCI_BAR_MEM64_HI )
> > +    {
> > +        ASSERT(reg - PCI_BASE_ADDRESS_0 > 0);
> > +        bar--;
> > +        hi = true;
> > +    }
> > +
> > +    /* Update the relevant part of the BAR address. */
> > +    bar->gaddr &= hi ? ~GENMASK(63, 32) : ~GENMASK(31, 0);
> > +    wdata &= hi ? GENMASK(31, 0) : PCI_BASE_ADDRESS_MEM_MASK;
> 
> Perhaps easier to grok as
> 
>     if ( hi )
>         wdata &= PCI_BASE_ADDRESS_MEM_MASK;

I've done that (with the condition reversed).

> However, considering the dual use below, I'd prefer if you wrote
> back the value you read to the low 4 bits. They're _supposed_ to
> be r/o, yes, but anyway.

Done.

> 
> > +    bar->gaddr |= (uint64_t)wdata << (hi ? 32 : 0);
> > +
> > +    if ( unset )
> > +    {
> > +        bar->paddr = bar->gaddr;
> 
> So this deals with first time setting of the BAR by Dom0. If Dom0
> later decides to move BARs around, how do you guarantee things
> to continue to work fine if you allow paddr and gaddr to go out of
> sync? Often the reason to do re-assignments is because the OS
> recognized address conflicts. Or it needs to make room for SR-IOV
> BARs.

I've removed the unset check, so that every BAR position change done
by Dom0 is also applied to the hardware, instead of just changing
Dom0's p2m.

> > +        pci_conf_write16(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> > +                         PCI_FUNC(pdev->devfn), reg, wdata);
> 
> pci_conf_write32()

Ups, thanks.

> 
> > +    }
> > +
> > +    ASSERT(IS_ALIGNED(bar->gaddr, PAGE_SIZE));
> 
> Urgh.

Removed.

> > +static int vpci_init_bars(struct pci_dev *pdev)
> > +{
> > +    uint8_t seg = pdev->seg, bus = pdev->bus;
> > +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> > +    uint8_t header_type;
> > +    unsigned int i, num_bars;
> > +    struct vpci_header *header = &pdev->vpci->header;
> > +    struct vpci_bar *bars = header->bars;
> > +    int rc;
> > +
> > +    header_type = pci_conf_read8(seg, bus, slot, func, PCI_HEADER_TYPE) & 0x7f;
> > +    if ( header_type == PCI_HEADER_TYPE_NORMAL )
> > +        num_bars = 6;
> > +    else if ( header_type == PCI_HEADER_TYPE_BRIDGE )
> > +        num_bars = 2;
> > +    else
> > +        return -ENOSYS;
> 
> -EOPNOTSUPP
> 
> > +    /* Setup a handler for the control register. */
> > +    header->command = pci_conf_read16(seg, bus, slot, func, PCI_COMMAND);
> 
> As the code says, the register is the Command Register, so your
> comment shouldn't say "control".

My mistake.

> > +    rc = xen_vpci_add_register(pdev, vpci_cmd_read, vpci_cmd_write,
> > +                               PCI_COMMAND, 2, header);
> > +    if ( rc )
> > +    {
> > +        dprintk(XENLOG_ERR,
> > +                "%04x:%02x:%02x.%u: failed to add handler register %#x: %d\n",
> > +                seg, bus, slot, func, PCI_COMMAND, rc);
> > +        return rc;
> > +    }
> > +
> > +    for ( i = 0; i < num_bars; i++ )
> > +    {
> > +        uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
> > +        uint32_t val = pci_conf_read32(seg, bus, slot, func, reg);
> > +        uint64_t addr, size;
> > +        unsigned int index;
> > +
> > +        if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
> > +        {
> > +            bars[i].type = VPCI_BAR_MEM64_HI;
> > +            bars[i].unset = bars[i - 1].unset;
> > +            continue;
> 
> Neither here nor below you install a handler for this upper half.

Ugh, good catch.

> > +        }
> > +        else if ( (val & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO )
> > +        {
> > +            bars[i].type = VPCI_BAR_IO;
> > +            continue;
> > +        }
> > +        else if ( (val & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
> 
> Pointless "else" (twice).

Removed.

> > +                  PCI_BASE_ADDRESS_MEM_TYPE_64 )
> > +            bars[i].type = VPCI_BAR_MEM64_LO;
> > +        else
> > +            bars[i].type = VPCI_BAR_MEM;
> > +
> > +        /* Size the BAR and map it. */
> > +        index = i;
> > +        rc = pci_size_bar(seg, bus, slot, func, PCI_BASE_ADDRESS_0, num_bars,
> > +                          &index, &addr, &size);
> > +        if ( rc )
> > +        {
> > +            dprintk(XENLOG_ERR,
> > +                    "%04x:%02x:%02x.%u: unable to size BAR#%u: %d\n",
> > +                    seg, bus, slot, func, i, rc);
> > +            return rc;
> > +        }
> > +
> > +        if ( size == 0 )
> > +        {
> > +            bars[i].type = VPCI_BAR_EMPTY;
> > +            continue;
> > +        }
> > +
> > +        if ( (bars[i].type == VPCI_BAR_MEM && addr == GENMASK(31, 12)) ||
> > +             addr == GENMASK(63, 26) )
> 
> Where is this 26 coming from?
> 
> Perhaps
> 
>     if ( addr == GENMASK(bars[i].type == VPCI_BAR_MEM ? 31 : 63, 12) )

I'm checking the memory decode bit here instead in order to figure out
if the BAR is not positioned.

> ? Albeit I'm unconvinced GENMASK() is useful to be used here anyway
> (see also below).

Right, regardless of the specific usage above, what would you
recommend regarding the usage of GENMASK?

Julien suggested introducing GENMASK_ULL. Should I go that route, or
introduce something locally for vPCI?

> > +        {
> > +            /* BAR is not positioned. */
> 
> I can't find anything in the standard saying that all-ones upper
> address bits indicate an unassigned BAR. As long as the memory
> decode bit is off, all BARs are to be considered unassigned afaik.
> Furthermore you can't possibly read e.g. 0xfffff000 from a
> 32-bit BAR covering more than 4k.

OK, so I've now changed this to mark the BAR as unset if the memory
decode bit in the command register is not set.

> > +            bars[i].unset = true;
> > +            ASSERT(is_hardware_domain(pdev->domain));
> > +            ASSERT(!(header->command & PCI_COMMAND_MEMORY));
> 
> You're asserting guest controlled state here (even if it's Dom0).
> 
> > +        }
> > +
> > +        ASSERT(IS_ALIGNED(addr, PAGE_SIZE));
> 
> Urgh (again).

Removed both of the above.

> > --- a/xen/include/xen/vpci.h
> > +++ b/xen/include/xen/vpci.h
> > @@ -50,6 +50,34 @@ int xen_vpci_write(unsigned int seg, unsigned int bus, unsigned int devfn,
> >  struct vpci {
> >      /* Root pointer for the tree of vPCI handlers. */
> >      struct rb_root handlers;
> > +
> > +#ifdef __XEN__
> > +    /* Hide the rest of the vpci struct from the user-space test harness. */
> > +    struct vpci_header {
> > +        /* Cached value of the command register. */
> > +        uint16_t command;
> > +        /* Information about the PCI BARs of this device. */
> > +        struct vpci_bar {
> > +            enum {
> > +                VPCI_BAR_EMPTY,
> > +                VPCI_BAR_IO,
> > +                VPCI_BAR_MEM,
> 
> MEM32?

Changed.

> > +                VPCI_BAR_MEM64_LO,
> > +                VPCI_BAR_MEM64_HI,
> > +            } type;
> > +            /* Hardware address. */
> > +            paddr_t paddr;
> > +            /* Guest address where the BAR should be mapped. */
> > +            paddr_t gaddr;
> > +            /* Current guest address where the BAR is mapped. */
> > +            paddr_t mapped_addr;
> 
> Why do you need to track both "should be" and "is" addresses? Also
> I think all three would more naturally be frame numbers.

I think I can use a single field to store the address.

> > +            size_t size;
> 
> Is this enough for e.g. ARM32 (remember this is a common
> header)?

No, I've changed it to uint64_t.

> > +            unsigned int attributes:4;
> 
> ???

Changed this to "bool prefetchable" instead.

> > +            bool sizing;
> > +            bool unset;
> 
> Isn't this redundant with e.g. gaddr (or as per above gfn) being
> INVALID_PADDR (INVALID_GFN)?

Yes, now removed.

> > +        } bars[6];
> 
> What about the ROM and SR-IOV ones?

I've implemented support for the expansion ROM BAR (which I still need
to figure out how to test), but I would like to defer SR-IOV for later
because it involves a non-trivial amount of work, and with this series
one can already boot a PVH Dom0 (minus SR-IOV of course).

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 5/9] xen/vpci: add handlers to map the BARs
  2017-06-22 17:13     ` Roger Pau Monne
@ 2017-06-23  8:58       ` Jan Beulich
  2017-06-23 10:55         ` Roger Pau Monne
  0 siblings, 1 reply; 49+ messages in thread
From: Jan Beulich @ 2017-06-23  8:58 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: StefanoStabellini, Wei Liu, George Dunlap, Andrew Cooper,
	IanJackson, Tim Deegan, julien.grall, xen-devel, boris.ostrovsky

>>> On 22.06.17 at 19:13, <roger.pau@citrix.com> wrote:
> On Fri, May 19, 2017 at 09:21:56AM -0600, Jan Beulich wrote:
>> >>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
>> > +static int vpci_modify_bars(struct pci_dev *pdev, const bool map)
>> > +{
>> > +    struct vpci_header *header = &pdev->vpci->header;
>> > +    unsigned int i;
>> > +    int rc = 0;
>> > +
>> > +    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>> > +    {
>> > +        paddr_t gaddr = map ? header->bars[i].gaddr
>> > +                            : header->bars[i].mapped_addr;
>> > +        paddr_t paddr = header->bars[i].paddr;
>> > +
>> > +        if ( header->bars[i].type != VPCI_BAR_MEM &&
>> > +             header->bars[i].type != VPCI_BAR_MEM64_LO )
>> > +            continue;
>> > +
>> > +        rc = modify_mmio(pdev->domain, _gfn(PFN_DOWN(gaddr)),
>> > +                         _mfn(PFN_DOWN(paddr)), PFN_UP(header->bars[i].size),
>> 
>> The PFN_UP() indicates a problem: For sub-page BARs you can't
>> blindly map/unmap them without taking into consideration other
>> devices sharing the same page.
> 
> I'm not sure I follow, the start address of BARs is always aligned to
> a 4KB boundary, so there's no chance of the same page being used by
> two different BARs at the same time.

I'm not sure where you're taking this from. Modern BIOSes may
aim at doing so, but for one I'm sure I've seen smaller alignment
quite often on older machines, and then my most modern AMD
one has these three devices, for example:

00:11.0 SATA controller: Advanced Micro Devices [AMD] nee ATI SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode] (prog-if 01 [AHCI 1.0])
	Subsystem: Advanced Micro Devices [AMD] nee ATI SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 64
	Interrupt: pin A routed to IRQ 22
	Region 0: I/O ports at 2430 [size=8]
	Region 1: I/O ports at 2424 [size=4]
	Region 2: I/O ports at 2428 [size=8]
	Region 3: I/O ports at 2420 [size=4]
	Region 4: I/O ports at 2400 [size=16]
	Region 5: Memory at c8014000 (32-bit, non-prefetchable) [size=1K]
	Capabilities: [60] Power Management version 2
		Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [70] SATA HBA v1.0 InCfgSpace
	Kernel driver in use: ahci
	Kernel modules: ahci

00:12.2 USB controller: Advanced Micro Devices [AMD] nee ATI SB7x0/SB8x0/SB9x0 USB EHCI Controller (prog-if 20 [EHCI])
	Subsystem: Advanced Micro Devices [AMD] nee ATI SB7x0/SB8x0/SB9x0 USB EHCI Controller
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 64, Cache Line Size: 32 bytes
	Interrupt: pin B routed to IRQ 17
	Region 0: Memory at c8014400 (32-bit, non-prefetchable) [size=256]
	Capabilities: [c0] Power Management version 2
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
		Bridge: PM- B3+
	Capabilities: [e4] Debug port: BAR=1 offset=00e0
	Kernel driver in use: ehci_hcd
	Kernel modules: ehci-hcd

00:13.2 USB controller: Advanced Micro Devices [AMD] nee ATI SB7x0/SB8x0/SB9x0 USB EHCI Controller (prog-if 20 [EHCI])
	Subsystem: Advanced Micro Devices [AMD] nee ATI SB7x0/SB8x0/SB9x0 USB EHCI Controller
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 64, Cache Line Size: 32 bytes
	Interrupt: pin B routed to IRQ 19
	Region 0: Memory at c8014800 (32-bit, non-prefetchable) [size=256]
	Capabilities: [c0] Power Management version 2
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
		Bridge: PM- B3+
	Capabilities: [e4] Debug port: BAR=1 offset=00e0
	Kernel driver in use: ehci_hcd
	Kernel modules: ehci-hcd

> The size is indeed not aligned to 4KB, but I don't see how this can
> cause collisions with other BARs unless the domain is actively trying
> to make the BARs overlap, in which case there's not much Xen can do.

The above is not what Dom0 did, but how the system boots up.
And this "there's not much Xen can do" is what I've been trying
to get at with my comment: A solution is needed here for your
approach to vPCI handling to be viable.

>> > +static int vpci_cmd_write(struct pci_dev *pdev, unsigned int reg,
>> > +                          union vpci_val val, void *data)
>> > +{
>> > +    struct vpci_header *header = data;
>> > +    uint16_t new_cmd, saved_cmd;
>> > +    uint8_t seg = pdev->seg, bus = pdev->bus;
>> > +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
>> > +    int rc;
>> > +
>> > +    new_cmd = val.word;
>> > +    saved_cmd = header->command;
>> > +
>> > +    if ( !((new_cmd ^ saved_cmd) & PCI_COMMAND_MEMORY) )
>> > +        goto out;
>> > +
>> > +    /* Memory space access change. */
>> > +    rc = vpci_modify_bars(pdev, new_cmd & PCI_COMMAND_MEMORY);
>> > +    if ( rc )
>> > +    {
>> > +        dprintk(XENLOG_ERR,
>> > +                "%04x:%02x:%02x.%u:unable to %smap BARs: %d\n",
>> > +                seg, bus, slot, func,
>> > +                new_cmd & PCI_COMMAND_MEMORY ? "" : "un", rc);
>> > +        return rc;
>> 
>> I guess you can guess the question already: What is the bare
>> hardware equivalent of this failure return?
> 
> Yes, this is already fixed since write handlers simply return void.
> The hw equivalent would be to ignore the write AFAICT (ie: memory
> decoding will not be enabled).
> 
> Are you fine with the dprintk or would you also like me to remove
> that? (IMHO it's helpful for debugging).

I think it can stay there for the initial phase. Later (before
declaring PVHv2 fully supported) we may want to re-consider
which of such messages are useful to keep.

>> > +static int vpci_bar_write(struct pci_dev *pdev, unsigned int reg,
>> > +                          union vpci_val val, void *data)
>> > +{
>> > +    struct vpci_bar *bar = data;
>> > +    uint32_t wdata = val.double_word;
>> > +    bool hi = false, unset = false;
>> > +
>> > +    ASSERT(bar->type == VPCI_BAR_MEM || bar->type == VPCI_BAR_MEM64_LO ||
>> > +           bar->type == VPCI_BAR_MEM64_HI);
>> > +
>> > +    if ( wdata == GENMASK(31, 0) )
>> 
>> I'm afraid this again doesn't match real hardware behavior: As the
>> low bits are r/o, writes with them having any value, but all other
>> bits being 1 should have the same effect. I notice that while I had
>> fixed this for the ROM BAR in Linux'es pciback, I should have also
>> fixed this for ordinary ones.
> 
> I've changed this to:
> 
>     switch ( bar->type )
>     {
>     case VPCI_BAR_MEM:
>         size_mask = GENMASK(31, 12);

Relating to the comment further up - where's this 12 coming from?

>         break;
>     case VPCI_BAR_MEM64_LO:
>         size_mask = GENMASK(31, 26);

And this 26?

>         break;
>     case VPCI_BAR_MEM64_HI:
>         size_mask = GENMASK(31, 0);
>         break;
>     default:
>         ASSERT_UNREACHABLE();
>         break;

You want to return here.

>> > +    }
>> > +
>> > +    ASSERT(IS_ALIGNED(bar->gaddr, PAGE_SIZE));
>> 
>> Urgh.
> 
> Removed.

With your comment further up, you should have refused to do so
(i.e. I'm getting the impression you're not really sure about that
supposed 4k alignment).

>> > +        if ( (bars[i].type == VPCI_BAR_MEM && addr == GENMASK(31, 12)) ||
>> > +             addr == GENMASK(63, 26) )
>> 
>> Where is this 26 coming from?
>> 
>> Perhaps
>> 
>>     if ( addr == GENMASK(bars[i].type == VPCI_BAR_MEM ? 31 : 63, 12) )
> 
> I'm checking the memory decode bit here instead in order to figure out
> if the BAR is not positioned.
> 
>> ? Albeit I'm unconvinced GENMASK() is useful to be used here anyway
>> (see also below).
> 
> Right, regardless of the specific usage above, what would you
> recommend regarding the usage of GENMASK?
> 
> Julien suggested introducing GENMASK_ULL. Should I go that route, or
> introduce something locally for vPCI?

Back when GENMASK() was introduced to our code base I've
already indicated that I'm not really in favor of it. I don't think
it really helps readability all that much (to me, plain hex
numbers are easier to grok, albeit I admit ones extending
beyond 8 or 10 digits are less easy to digest; sadly the once
proposed [by Intel, I think, in the early ia64 days] language
extension to permit _ separators in numbers doesn't appear
to have made it anywhere).

>> > +        } bars[6];
>> 
>> What about the ROM and SR-IOV ones?
> 
> I've implemented support for the expansion ROM BAR (which I still need
> to figure out how to test),

There should be hardly any graphics card without a ROM. For
remote boot purposes also most NICs come with a ROM, albeit
many BIOSes allow turning it off. Most SCSI cards I've seem
have a (configuration) ROM too.

> but I would like to defer SR-IOV for later
> because it involves a non-trivial amount of work, and with this series
> one can already boot a PVH Dom0 (minus SR-IOV of course).

That's likely okay as long as there's a suitable, much beloved
"fixme" comment somewhere.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 5/9] xen/vpci: add handlers to map the BARs
  2017-06-23  8:58       ` Jan Beulich
@ 2017-06-23 10:55         ` Roger Pau Monne
  0 siblings, 0 replies; 49+ messages in thread
From: Roger Pau Monne @ 2017-06-23 10:55 UTC (permalink / raw)
  To: Jan Beulich
  Cc: StefanoStabellini, Wei Liu, George Dunlap, Andrew Cooper,
	IanJackson, Tim Deegan, julien.grall, xen-devel, boris.ostrovsky

On Fri, Jun 23, 2017 at 02:58:28AM -0600, Jan Beulich wrote:
> >>> On 22.06.17 at 19:13, <roger.pau@citrix.com> wrote:
> > On Fri, May 19, 2017 at 09:21:56AM -0600, Jan Beulich wrote:
> >> >>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
> >> > +static int vpci_modify_bars(struct pci_dev *pdev, const bool map)
> >> > +{
> >> > +    struct vpci_header *header = &pdev->vpci->header;
> >> > +    unsigned int i;
> >> > +    int rc = 0;
> >> > +
> >> > +    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
> >> > +    {
> >> > +        paddr_t gaddr = map ? header->bars[i].gaddr
> >> > +                            : header->bars[i].mapped_addr;
> >> > +        paddr_t paddr = header->bars[i].paddr;
> >> > +
> >> > +        if ( header->bars[i].type != VPCI_BAR_MEM &&
> >> > +             header->bars[i].type != VPCI_BAR_MEM64_LO )
> >> > +            continue;
> >> > +
> >> > +        rc = modify_mmio(pdev->domain, _gfn(PFN_DOWN(gaddr)),
> >> > +                         _mfn(PFN_DOWN(paddr)), PFN_UP(header->bars[i].size),
> >> 
> >> The PFN_UP() indicates a problem: For sub-page BARs you can't
> >> blindly map/unmap them without taking into consideration other
> >> devices sharing the same page.
> > 
> > I'm not sure I follow, the start address of BARs is always aligned to
> > a 4KB boundary, so there's no chance of the same page being used by
> > two different BARs at the same time.
> 
> I'm not sure where you're taking this from. Modern BIOSes may
> aim at doing so, but for one I'm sure I've seen smaller alignment
> quite often on older machines, and then my most modern AMD
> one has these three devices, for example:

Right, I guess I will have to somehow check for overlapping regions,
how inconvenient.

> 00:11.0 SATA controller: Advanced Micro Devices [AMD] nee ATI SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode] (prog-if 01 [AHCI 1.0])
> 	Subsystem: Advanced Micro Devices [AMD] nee ATI SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]
> 	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
> 	Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> 	Latency: 64
> 	Interrupt: pin A routed to IRQ 22
> 	Region 0: I/O ports at 2430 [size=8]
> 	Region 1: I/O ports at 2424 [size=4]
> 	Region 2: I/O ports at 2428 [size=8]
> 	Region 3: I/O ports at 2420 [size=4]
> 	Region 4: I/O ports at 2400 [size=16]
> 	Region 5: Memory at c8014000 (32-bit, non-prefetchable) [size=1K]
> 	Capabilities: [60] Power Management version 2
> 		Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
> 		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
> 	Capabilities: [70] SATA HBA v1.0 InCfgSpace
> 	Kernel driver in use: ahci
> 	Kernel modules: ahci
> 
> 00:12.2 USB controller: Advanced Micro Devices [AMD] nee ATI SB7x0/SB8x0/SB9x0 USB EHCI Controller (prog-if 20 [EHCI])
> 	Subsystem: Advanced Micro Devices [AMD] nee ATI SB7x0/SB8x0/SB9x0 USB EHCI Controller
> 	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
> 	Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> 	Latency: 64, Cache Line Size: 32 bytes
> 	Interrupt: pin B routed to IRQ 17
> 	Region 0: Memory at c8014400 (32-bit, non-prefetchable) [size=256]
> 	Capabilities: [c0] Power Management version 2
> 		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold-)
> 		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
> 		Bridge: PM- B3+
> 	Capabilities: [e4] Debug port: BAR=1 offset=00e0
> 	Kernel driver in use: ehci_hcd
> 	Kernel modules: ehci-hcd
> 
> 00:13.2 USB controller: Advanced Micro Devices [AMD] nee ATI SB7x0/SB8x0/SB9x0 USB EHCI Controller (prog-if 20 [EHCI])
> 	Subsystem: Advanced Micro Devices [AMD] nee ATI SB7x0/SB8x0/SB9x0 USB EHCI Controller
> 	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
> 	Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> 	Latency: 64, Cache Line Size: 32 bytes
> 	Interrupt: pin B routed to IRQ 19
> 	Region 0: Memory at c8014800 (32-bit, non-prefetchable) [size=256]
> 	Capabilities: [c0] Power Management version 2
> 		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold-)
> 		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
> 		Bridge: PM- B3+
> 	Capabilities: [e4] Debug port: BAR=1 offset=00e0
> 	Kernel driver in use: ehci_hcd
> 	Kernel modules: ehci-hcd
> 
> > The size is indeed not aligned to 4KB, but I don't see how this can
> > cause collisions with other BARs unless the domain is actively trying
> > to make the BARs overlap, in which case there's not much Xen can do.
> 
> The above is not what Dom0 did, but how the system boots up.
> And this "there's not much Xen can do" is what I've been trying
> to get at with my comment: A solution is needed here for your
> approach to vPCI handling to be viable.

Checking for overlap seem to be the only sensible option here. Xen is
in no position to relocate BARs.

> >> > +static int vpci_bar_write(struct pci_dev *pdev, unsigned int reg,
> >> > +                          union vpci_val val, void *data)
> >> > +{
> >> > +    struct vpci_bar *bar = data;
> >> > +    uint32_t wdata = val.double_word;
> >> > +    bool hi = false, unset = false;
> >> > +
> >> > +    ASSERT(bar->type == VPCI_BAR_MEM || bar->type == VPCI_BAR_MEM64_LO ||
> >> > +           bar->type == VPCI_BAR_MEM64_HI);
> >> > +
> >> > +    if ( wdata == GENMASK(31, 0) )
> >> 
> >> I'm afraid this again doesn't match real hardware behavior: As the
> >> low bits are r/o, writes with them having any value, but all other
> >> bits being 1 should have the same effect. I notice that while I had
> >> fixed this for the ROM BAR in Linux'es pciback, I should have also
> >> fixed this for ordinary ones.
> > 
> > I've changed this to:
> > 
> >     switch ( bar->type )
> >     {
> >     case VPCI_BAR_MEM:
> >         size_mask = GENMASK(31, 12);
> 
> Relating to the comment further up - where's this 12 coming from?

Hm, this is from the "PCI Express Technology" Mindshare book, which
states that for 32bit memory BARs bits [11,4] are hardcoded to 0 and
for 64bit BARs bits [25,4] are also hardcoded to 0. I've been
searching for such statement in the PCI Local Bus specification
version 3.0, but I don't seem to be able to find any references to
this. I guess I will do:

switch ( bar->type )
{
case VPCI_BAR_MEM32:
case VPCI_BAR_MEM64_LO:
    size_mask = PCI_BASE_ADDRESS_MEM_MASK;
    break;
case VPCI_BAR_MEM64_HI:
    size_mask = ~0u;
default:
    ASSERT_UNREACHABLE();
    return;
}

> >         break;
> >     case VPCI_BAR_MEM64_LO:
> >         size_mask = GENMASK(31, 26);
> 
> And this 26?
> 
> >         break;
> >     case VPCI_BAR_MEM64_HI:
> >         size_mask = GENMASK(31, 0);
> >         break;
> >     default:
> >         ASSERT_UNREACHABLE();
> >         break;
> 
> You want to return here.
> 
> >> > +    }
> >> > +
> >> > +    ASSERT(IS_ALIGNED(bar->gaddr, PAGE_SIZE));
> >> 
> >> Urgh.
> > 
> > Removed.
> 
> With your comment further up, you should have refused to do so
> (i.e. I'm getting the impression you're not really sure about that
> supposed 4k alignment).

No, it's not 4K aligned.

> >> > +        if ( (bars[i].type == VPCI_BAR_MEM && addr == GENMASK(31, 12)) ||
> >> > +             addr == GENMASK(63, 26) )
> >> 
> >> Where is this 26 coming from?
> >> Perhaps
> >> 
> >>     if ( addr == GENMASK(bars[i].type == VPCI_BAR_MEM ? 31 : 63, 12) )
> > 
> > I'm checking the memory decode bit here instead in order to figure out
> > if the BAR is not positioned.
> > 
> >> ? Albeit I'm unconvinced GENMASK() is useful to be used here anyway
> >> (see also below).
> > 
> > Right, regardless of the specific usage above, what would you
> > recommend regarding the usage of GENMASK?
> > 
> > Julien suggested introducing GENMASK_ULL. Should I go that route, or
> > introduce something locally for vPCI?
> 
> Back when GENMASK() was introduced to our code base I've
> already indicated that I'm not really in favor of it. I don't think
> it really helps readability all that much (to me, plain hex
> numbers are easier to grok, albeit I admit ones extending
> beyond 8 or 10 digits are less easy to digest; sadly the once
> proposed [by Intel, I think, in the early ia64 days] language
> extension to permit _ separators in numbers doesn't appear
> to have made it anywhere).

I could also switch GENMASK to use long long instead, but I'm not sure
if that's going to break existing callers. Let me try to see if I can
get away without using it (although I kind of liked it for coding
masks).

> >> > +        } bars[6];
> >> 
> >> What about the ROM and SR-IOV ones?
> > 
> > I've implemented support for the expansion ROM BAR (which I still need
> > to figure out how to test),
> 
> There should be hardly any graphics card without a ROM. For
> remote boot purposes also most NICs come with a ROM, albeit
> many BIOSes allow turning it off. Most SCSI cards I've seem
> have a (configuration) ROM too.

OK, but testing NICs ROMs is going to be impossible from a PVH Dom0. I
guess graphics cards, although most of my boxes are headless.

> > but I would like to defer SR-IOV for later
> > because it involves a non-trivial amount of work, and with this series
> > one can already boot a PVH Dom0 (minus SR-IOV of course).
> 
> That's likely okay as long as there's a suitable, much beloved
> "fixme" comment somewhere.

OK, the vpci.h header seems like the best place to add such a fixme
comment.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 6/9] xen/vpci: trap access to the list of PCI capabilities
  2017-05-23 12:49   ` Jan Beulich
@ 2017-06-26 11:50     ` Roger Pau Monne
  2017-06-27  6:44       ` Jan Beulich
  0 siblings, 1 reply; 49+ messages in thread
From: Roger Pau Monne @ 2017-06-26 11:50 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, julien.grall, boris.ostrovsky, xen-devel

On Tue, May 23, 2017 at 06:49:50AM -0600, Jan Beulich wrote:
> >>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
> > Add traps to each capability PCI_CAP_LIST_NEXT field in order to mask them on
> > request.
> > 
> > All capabilities from the device are fetched and stored in an internal list,
> > that's later used in order to return the next capability to the guest. Note
> > that this only removes the capability from the linked list as seen by the
> > guest, but the actual capability structure could still be accessed by the
> > guest, provided that it's position can be found using another mechanism.
> 
> Which is a problem. Drivers tied to a single device or a narrow set
> aren't unknown to do such. In fact in the past Intel has given us
> workaround outlines for some of their chipset issues which directed
> us to fixed offsets instead of using the capability chains.
> 
> > Finally the MSI and MSI-X capabilities are masked until Xen knows how to
> > properly handle accesses to them.
> > 
> > This should allow a PVH Dom0 to boot on some hardware, provided that the
> > hardware doesn't require MSI/MSI-X and that there are no SR-IOV devices in the
> > system, so the panic at the end of the PVH Dom0 build is replaced by a
> > warning.
> 
> While this is certainly nice for development / debugging purposes,
> what's the longer term intention with the functionality being added
> here? We had no need to mask capabilities for PV Dom0, so I would
> have hoped to get away without for PVH too.

Yes, this patch is mostly for development / debugging purposes, at
least in it's current state.

I though that maybe if users find issues with the MSI/MSI-X
implementations it would be easier to diagnose if there's an option to
disable those emulations.

Regarding what we would like to mask/hide from Dom0 I think the only
capability Xen must hide from Dom0 is ACS, because it's used by Xen
and Dom0 shouldn't poke at it at all (but that's an extended
capability anyway, which is not handled by this patch).

Of course for DomU Xen certainly wants to hide more capabilities, but
that's out of the picture ATM.

Let me know whether do you consider having this patch to mask
MSI/MSI-X capabilities on user request for Dom0 is helpful or not.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 7/9] vpci: add a priority field to the vPCI register initializer
  2017-05-23 12:52   ` Jan Beulich
@ 2017-06-26 14:41     ` Roger Pau Monne
  0 siblings, 0 replies; 49+ messages in thread
From: Roger Pau Monne @ 2017-06-26 14:41 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, julien.grall, boris.ostrovsky, xen-devel

On Tue, May 23, 2017 at 06:52:42AM -0600, Jan Beulich wrote:
> >>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
> > +#define REGISTER_VPCI_INIT(f, p)                                        \
> > +  static const struct vpci_register_init                                \
> > +                      x##_entry __used_section(".data.vpci") = {        \
> > +    .init = f,                                                          \
> > +    .priority = p,                                                      \
> > +}
> 
> I think I'd rather see this done by ordering the entries in
> .data.vpci suitably, e.g. by adding numeric tags and using SORT()
> or some such in the linker script. Iirc upstream Linux did change to
> such a model for some of their initialization, so you may be able to
> glean something there.

Thanks, I've now switched to using SORT and a numeric suffix to the
section used by each entry. This looks much better and requires less
code changes.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 6/9] xen/vpci: trap access to the list of PCI capabilities
  2017-06-26 11:50     ` Roger Pau Monne
@ 2017-06-27  6:44       ` Jan Beulich
  0 siblings, 0 replies; 49+ messages in thread
From: Jan Beulich @ 2017-06-27  6:44 UTC (permalink / raw)
  To: roger.pau; +Cc: andrew.cooper3, julien.grall, boris.ostrovsky, xen-devel

>>> Roger Pau Monne <roger.pau@citrix.com> 06/26/17 1:51 PM >>>
>Let me know whether do you consider having this patch to mask
>MSI/MSI-X capabilities on user request for Dom0 is helpful or not.

If the capability hiding was needed for anything else, I could see what
you're doing here as a potentially helpful by-product. But introducing
that logic just for this reason seems too much to me.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 8/9] vpci/msi: add MSI handlers
  2017-05-26 15:26   ` Jan Beulich
@ 2017-06-27 10:22     ` Roger Pau Monne
  2017-06-27 11:44       ` Jan Beulich
  0 siblings, 1 reply; 49+ messages in thread
From: Roger Pau Monne @ 2017-06-27 10:22 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, julien.grall, Paul Durrant, xen-devel, boris.ostrovsky

On Fri, May 26, 2017 at 09:26:03AM -0600, Jan Beulich wrote:
> >>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
> > Add handlers for the MSI control, address, data and mask fields in order to
> > detect accesses to them and setup the interrupts as requested by the guest.
> > 
> > Note that the pending register is not trapped, and the guest can freely
> > read/write to it.
> > 
> > Whether Xen is going to provide this functionality to Dom0 (MSI emulation) is
> > controlled by the "msi" option in the dom0 field. When disabling this option
> > Xen will hide the MSI capability structure from Dom0.
> 
> Unless there's an actual reason behind this, I'd view this as a
> development only thing, which shouldn't be in a non-RFC patch.

Yes, I've removed the patch to hide capabilities ATM.

> > --- a/xen/arch/x86/hvm/vmsi.c
> > +++ b/xen/arch/x86/hvm/vmsi.c
> > @@ -622,3 +622,144 @@ void msix_write_completion(struct vcpu *v)
> >      if ( msixtbl_write(v, ctrl_address, 4, 0) != X86EMUL_OKAY )
> >          gdprintk(XENLOG_WARNING, "MSI-X write completion failure\n");
> >  }
> > +
> > +static unsigned int msi_vector(uint16_t data)
> > +{
> > +    return (data & MSI_DATA_VECTOR_MASK) >> MSI_DATA_VECTOR_SHIFT;
> 
> I know code elsewhere does it this way, but I'm intending to switch
> that to use MASK_EXTR(), and I'd like to ask to use that construct
> right away in new code.
> 
> > +static unsigned int msi_flags(uint16_t data, uint64_t addr)
> > +{
> > +    unsigned int rh, dm, dest_id, deliv_mode, trig_mode;
> > +
> > +    rh = (addr >> MSI_ADDR_REDIRECTION_SHIFT) & 0x1;
> > +    dm = (addr >> MSI_ADDR_DESTMODE_SHIFT) & 0x1;
> > +    dest_id = (addr & MSI_ADDR_DEST_ID_MASK) >> MSI_ADDR_DEST_ID_SHIFT;
> > +    deliv_mode = (data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x7;
> > +    trig_mode = (data >> MSI_DATA_TRIGGER_SHIFT) & 0x1;
> 
> I'm sure you've simply copied code from elsewhere, which I agree
> should generally be fine. However, on top of what I did say above
> I'd also like to request to drop such stray 0x prefixes, plus I think
> the 7 wants a #define.

I've switched this to using the MASK_EXTR macro, so there's no longer
the need to add the 0x1.

> > +    return dest_id | (rh << GFLAGS_SHIFT_RH) | (dm << GFLAGS_SHIFT_DM) |
> > +           (deliv_mode << GFLAGS_SHIFT_DELIV_MODE) |
> > +           (trig_mode << GFLAGS_SHIFT_TRG_MODE);
> 
> How come dest_id has no shift? Please let's not assume the shift
> is zero now and forever.

I've added a define for GFLAGS_SHIFT_DEST_ID that sets it to 0.

> > +void vpci_msi_mask(struct vpci_arch_msi *arch, unsigned int entry, bool mask)
> > +{
> > +    struct pirq *pinfo;
> > +    struct irq_desc *desc;
> > +    unsigned long flags;
> > +    int irq;
> > +
> > +    ASSERT(arch->pirq != -1);
> 
> Perhaps better ">= 0"?

OK.

> > +    pinfo = pirq_info(current->domain, arch->pirq + entry);
> > +    ASSERT(pinfo);
> > +
> > +    irq = pinfo->arch.irq;
> > +    ASSERT(irq < nr_irqs);
> > +
> > +    desc = irq_to_desc(irq);
> > +    ASSERT(desc);
> 
> It's not entirely clear to me where all the checks are which allow
> the checks here to be ASSERT()s.

Hm, this function is only called if the pirq is set (ie: if the
interrupt is bound to the domain). AFAICT if Xen cannot get the irq or
the desc related to this pirq it means that something/someone has
unbound or changed the pirq under Xen's feet, and thus the expected
state is no longer valid.

I could add something like:

if ( irq >= nr_irqs || irq < 0 )
{
    ASSERET_UNREACABLE();
    return;
}

And the same for the desc check if that seems more sensible.

> > +int vpci_msi_enable(struct vpci_arch_msi *arch, struct pci_dev *pdev,
> > +                    uint64_t address, uint32_t data, unsigned int vectors)
> > +{
> > +    struct msi_info msi_info = {
> > +        .seg = pdev->seg,
> > +        .bus = pdev->bus,
> > +        .devfn = pdev->devfn,
> > +        .entry_nr = vectors,
> > +    };
> > +    int index = -1, rc;
> > +    unsigned int i;
> > +
> > +    ASSERT(arch->pirq == -1);
> > +
> > +    /* Get a PIRQ. */
> > +    rc = allocate_and_map_msi_pirq(pdev->domain, &index, &arch->pirq,
> > +                                   &msi_info);
> > +    if ( rc )
> > +    {
> > +        dprintk(XENLOG_ERR, "%04x:%02x:%02x.%u: failed to map PIRQ: %d\n",
> > +                pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> > +                PCI_FUNC(pdev->devfn), rc);
> > +        return rc;
> > +    }
> > +
> > +    for ( i = 0; i < vectors; i++ )
> > +    {
> > +        xen_domctl_bind_pt_irq_t bind = {
> > +            .hvm_domid = DOMID_SELF,
> > +            .machine_irq = arch->pirq + i,
> > +            .irq_type = PT_IRQ_TYPE_MSI,
> > +            .u.msi.gvec = msi_vector(data) + i,
> > +            .u.msi.gflags = msi_flags(data, address),
> > +        };
> > +
> > +        pcidevs_lock();
> > +        rc = pt_irq_create_bind(pdev->domain, &bind);
> > +        if ( rc )
> 
> I don't think you need to hold the lock for the if() and its body. While
> I see unmap_domain_pirq(), I don't really see why it does, so perhaps
> there's some cleanup potential up front.

unmap_domain_pirq might call into pci_disable_msi which I assume
requires the pci lock to be hold (although has no assert to that
effect).

I can send a pre-patch to remove the pci lock assert from
unmap_domain_pirq but I'm not that familiar with this code (TBH I
thought that anything dealing with PCI devices needed to hold the pci
lock).

> > --- a/xen/drivers/vpci/capabilities.c
> > +++ b/xen/drivers/vpci/capabilities.c
> > @@ -109,7 +109,7 @@ static int vpci_index_capabilities(struct pci_dev *pdev)
> >      return 0;
> >  }
> >  
> > -static void vpci_mask_capability(struct pci_dev *pdev, uint8_t cap_id)
> > +void xen_vpci_mask_capability(struct pci_dev *pdev, uint8_t cap_id)
> 
> What's the xen_ prefix good for?

Nah, nothing, in any case this code is now gone.

> > +static int vpci_msi_control_read(struct pci_dev *pdev, unsigned int reg,
> > +                                 union vpci_val *val, void *data)
> > +{
> > +    struct vpci_msi *msi = data;
> 
> const
> 
> > +    if ( msi->enabled )
> > +        val->word |= PCI_MSI_FLAGS_ENABLE;
> > +    if ( msi->masking )
> > +        val->word |= PCI_MSI_FLAGS_MASKBIT;
> > +    if ( msi->address64 )
> > +        val->word |= PCI_MSI_FLAGS_64BIT;
> > +
> > +    /* Set multiple message capable. */
> > +    val->word |= ((fls(msi->max_vectors) - 1) << 1) & PCI_MSI_FLAGS_QMASK;
> > +
> > +    /* Set current number of configured vectors. */
> > +    val->word |= ((fls(msi->guest_vectors) - 1) << 4) & PCI_MSI_FLAGS_QSIZE;
> 
> The 1 and 4 here clearly need #define-s or the use of MASK_INSR().

I think using MASK_INSR is better an more inline with the previous
changes that you requested.

> > +static int vpci_msi_control_write(struct pci_dev *pdev, unsigned int reg,
> > +                                  union vpci_val val, void *data)
> > +{
> > +    struct vpci_msi *msi = data;
> > +    unsigned int vectors = 1 << ((val.word & PCI_MSI_FLAGS_QSIZE) >> 4);
> > +    int rc;
> > +
> > +    if ( vectors > msi->max_vectors )
> > +        return -EINVAL;
> > +
> > +    msi->guest_vectors = vectors;
> > +
> > +    if ( !!(val.word & PCI_MSI_FLAGS_ENABLE) == msi->enabled )
> > +        return 0;
> > +
> > +    if ( val.word & PCI_MSI_FLAGS_ENABLE )
> > +    {
> > +        ASSERT(!msi->enabled && !msi->vectors);
> 
> I can see the value of the right side, but the left (with the imediately
> prior if())?

Right, this is redundant given the checks above.

> > +        rc = vpci_msi_enable(&msi->arch, pdev, msi->address, msi->data,
> > +                             vectors);
> > +        if ( rc )
> > +            return rc;
> > +
> > +        /* Apply the mask bits. */
> > +        if ( msi->masking )
> > +        {
> > +            uint32_t mask = msi->mask;
> > +
> > +            while ( mask )
> > +            {
> > +                unsigned int i = ffs(mask);
> 
> ffs(), just like fls(), returns 1-based values, so this looks to be off by
> one.

Thanks, good catch.

> > +                vpci_msi_mask(&msi->arch, i, true);
> > +                __clear_bit(i, &mask);
> > +            }
> > +        }
> > +
> > +        __msi_set_enable(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> > +                         PCI_FUNC(pdev->devfn), reg - PCI_MSI_FLAGS, 1);
> 
> Seems like you'll never come through msi_capability_init(); I can't
> see how it can be a good idea to bypass lots of stuff.

AFAICT this is done as part of the vpci_msi_enable call just above:

vpci_msi_enable -> allocate_and_map_msi_pirq -> map_domain_pirq ->
pci_enable_msi -> __pci_enable_msi -> msi_capability_init.

> > +static int vpci_msi_address_upper_write(struct pci_dev *pdev, unsigned int reg,
> > +                                        union vpci_val val, void *data)
> > +{
> > +    struct vpci_msi *msi = data;
> > +
> > +    /* Clear high part. */
> > +    msi->address &= ~GENMASK(63, 32);
> > +    msi->address |= (uint64_t)val.double_word << 32;
> 
> Is the value written here actually being used for anything other than
> for reading back (also applicable to the high bits of the low half of the
> address)?

It's used in a arch-specific way. But Xen needs to store it anyway, so
the guest can read back whatever it writes. I have no idea what ARM
might store here.

> > +static int vpci_msi_mask_read(struct pci_dev *pdev, unsigned int reg,
> > +                              union vpci_val *val, void *data)
> > +{
> > +    struct vpci_msi *msi = data;
> > +
> > +    val->double_word = msi->mask;
> > +
> > +    return 0;
> > +}
> > +
> > +static int vpci_msi_mask_write(struct pci_dev *pdev, unsigned int reg,
> > +                               union vpci_val val, void *data)
> > +{
> > +    struct vpci_msi *msi = data;
> > +    uint32_t dmask;
> > +
> > +    dmask = msi->mask ^ val.double_word;
> > +
> > +    if ( !dmask )
> > +        return 0;
> > +
> > +    while ( dmask && msi->enabled )
> > +    {
> > +        unsigned int i = ffs(dmask);
> > +
> > +        vpci_msi_mask(&msi->arch, i, !test_bit(i, &msi->mask));
> > +        __clear_bit(i, &dmask);
> > +    }
> 
> I think this loop should be limited to the number of enabled vectors
> (and the same likely applies then to vpci_msi_control_write()).

Done, I've changed it to:

for ( i = ffs(mask) - 1; mask && i < msi->vectors; i = ffs(mask) - 1 )
{
    vpci_msi_mask(&msi->arch, i, MASK_EXTR(val.u32, 1 << i));
    __clear_bit(i, &mask);
}

> > +static int vpci_init_msi(struct pci_dev *pdev)
> > +{
> > +    uint8_t seg = pdev->seg, bus = pdev->bus;
> > +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> > +    struct vpci_msi *msi = NULL;
> 
> Pointless initializer.

Removed.

> > +    unsigned int msi_offset;
> > +    uint16_t control;
> > +    int rc;
> > +
> > +    msi_offset = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSI);
> > +    if ( !msi_offset )
> > +        return 0;
> > +
> > +    if ( !vpci_msi_enabled(pdev->domain) )
> > +    {
> > +        xen_vpci_mask_capability(pdev, PCI_CAP_ID_MSI);
> > +        return 0;
> > +    }
> > +
> > +    msi = xzalloc(struct vpci_msi);
> > +    if ( !msi )
> > +        return -ENOMEM;
> > +
> > +    control = pci_conf_read16(seg, bus, slot, func,
> > +                              msi_control_reg(msi_offset));
> > +
> > +    rc = xen_vpci_add_register(pdev, vpci_msi_control_read,
> > +                               vpci_msi_control_write,
> > +                               msi_control_reg(msi_offset), 2, msi);
> > +    if ( rc )
> > +    {
> > +        dprintk(XENLOG_ERR,
> > +                "%04x:%02x:%02x.%u: failed to add handler for MSI control: %d\n",
> > +                seg, bus, slot, func, rc);
> > +        goto error;
> > +    }
> > +
> > +    /* Get the maximum number of vectors the device supports. */
> > +    msi->max_vectors = multi_msi_capable(control);
> > +    ASSERT(msi->max_vectors <= 32);
> > +
> > +    /* Initial value after reset. */
> > +    msi->guest_vectors = 1;
> > +
> > +    /* No PIRQ bind yet. */
> > +    vpci_msi_arch_init(&msi->arch);
> > +
> > +    if ( is_64bit_address(control) )
> > +        msi->address64 = true;
> > +    if ( is_mask_bit_support(control) )
> > +        msi->masking = true;
> > +
> > +    rc = xen_vpci_add_register(pdev, vpci_msi_address_read,
> > +                               vpci_msi_address_write,
> > +                               msi_lower_address_reg(msi_offset), 4, msi);
> > +    if ( rc )
> > +    {
> > +        dprintk(XENLOG_ERR,
> > +                "%04x:%02x:%02x.%u: failed to add handler for MSI address: %d\n",
> > +                seg, bus, slot, func, rc);
> > +        goto error;
> > +    }
> > +
> > +    rc = xen_vpci_add_register(pdev, vpci_msi_data_read, vpci_msi_data_write,
> > +                               msi_data_reg(msi_offset, msi->address64), 2,
> > +                               msi);
> > +    if ( rc )
> > +    {
> > +        dprintk(XENLOG_ERR,
> > +                "%04x:%02x:%02x.%u: failed to add handler for MSI address: %d\n",
> > +                seg, bus, slot, func, rc);
> 
> Twice the same message text is unhelpful (and actually there's a third
> one below). But iirc I did indicate anyway that I think most of them
> should go away. Note also how much thy contribute to the function's
> size.

OK, I will remove those then.

> > +static int __init vpci_msi_setup_keyhandler(void)
> > +{
> > +    register_keyhandler('Z', vpci_dump_msi, "dump guest MSI state", 1);
> 
> Please let's avoid using new (and even non-intuitive) keys if at all
> possible. This is Dom0 only, so can easily be chained onto e.g. the
> 'M' handler.

I assumed none of the debug keys where actually intuitive. I've wired
it to the 'M' handler, we can always add it's own key if the need
arises.

> > --- a/xen/include/xen/vpci.h
> > +++ b/xen/include/xen/vpci.h
> > @@ -89,9 +89,35 @@ struct vpci {
> >  
> >      /* List of capabilities supported by the device. */
> >      struct list_head cap_list;
> > +
> > +    /* MSI data. */
> > +    struct vpci_msi {
> > +        /* Maximum number of vectors supported by the device. */
> > +        unsigned int max_vectors;
> > +        /* Current guest-written number of vectors. */
> > +        unsigned int guest_vectors;
> > +        /* Number of vectors configured. */
> > +        unsigned int vectors;
> 
> So coming here I still don't really see what the difference between
> these last two fields is (and hence why you need two).

Right, there's no need for having both of them, so I 'just removed
guest_vectors.

Thanks for the detailed review, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 8/9] vpci/msi: add MSI handlers
  2017-06-27 10:22     ` Roger Pau Monne
@ 2017-06-27 11:44       ` Jan Beulich
  2017-06-27 12:44         ` Roger Pau Monné
  0 siblings, 1 reply; 49+ messages in thread
From: Jan Beulich @ 2017-06-27 11:44 UTC (permalink / raw)
  To: roger.pau
  Cc: andrew.cooper3, julien.grall, paul.durrant, xen-devel, boris.ostrovsky

>>> Roger Pau Monne <roger.pau@citrix.com> 06/27/17 12:23 PM >>>
>On Fri, May 26, 2017 at 09:26:03AM -0600, Jan Beulich wrote:
>> >>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
>> > +    pinfo = pirq_info(current->domain, arch->pirq + entry);
>> > +    ASSERT(pinfo);
>> > +
>> > +    irq = pinfo->arch.irq;
>> > +    ASSERT(irq < nr_irqs);
>> > +
>> > +    desc = irq_to_desc(irq);
>> > +    ASSERT(desc);
>> 
>> It's not entirely clear to me where all the checks are which allow
>> the checks here to be ASSERT()s.
>
>Hm, this function is only called if the pirq is set (ie: if the
>interrupt is bound to the domain). AFAICT if Xen cannot get the irq or
>the desc related to this pirq it means that something/someone has
>unbound or changed the pirq under Xen's feet, and thus the expected
>state is no longer valid.
>
>I could add something like:
>
>if ( irq >= nr_irqs || irq < 0 )
>{
    >ASSERET_UNREACABLE();
    >return;
>}
>
>And the same for the desc check if that seems more sensible.

Well, if the function indeed is being called only when everything has already
been set up (and can't be torn down in a racy way), then I'm not overly
concerned which of the two forms you use. 

>> > +        pcidevs_lock();
>> > +        rc = pt_irq_create_bind(pdev->domain, &bind);
>> > +        if ( rc )
>> 
>> I don't think you need to hold the lock for the if() and its body. While
>> I see unmap_domain_pirq(), I don't really see why it does, so perhaps
>> there's some cleanup potential up front.
>
>unmap_domain_pirq might call into pci_disable_msi which I assume
>requires the pci lock to be hold (although has no assert to that
>effect).

Yeah, maybe it's indeed better to keep it. List access strictly requires
holding the lock, while I think we don't consistently hold the lock for
mere use of individual devices. If there was any real issue with that, I
think we'd rather need to refcount the devices.

>> > +static int vpci_msi_address_upper_write(struct pci_dev *pdev, unsigned int reg,
>> > +                                        union vpci_val val, void *data)
>> > +{
>> > +    struct vpci_msi *msi = data;
>> > +
>> > +    /* Clear high part. */
>> > +    msi->address &= ~GENMASK(63, 32);
>> > +    msi->address |= (uint64_t)val.double_word << 32;
>> 
>> Is the value written here actually being used for anything other than
>> for reading back (also applicable to the high bits of the low half of the
>> address)?
>
>It's used in a arch-specific way. But Xen needs to store it anyway, so
>the guest can read back whatever it writes. I have no idea what ARM
>might store here.

Hmm, I'm concerned you may introduce incorrect behavior here if the value
written is other than expected. But perhaps not much of a problem as long
as all this is Dom0 only.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 8/9] vpci/msi: add MSI handlers
  2017-06-27 11:44       ` Jan Beulich
@ 2017-06-27 12:44         ` Roger Pau Monné
  0 siblings, 0 replies; 49+ messages in thread
From: Roger Pau Monné @ 2017-06-27 12:44 UTC (permalink / raw)
  To: Jan Beulich
  Cc: andrew.cooper3, julien.grall, paul.durrant, xen-devel, boris.ostrovsky

On Tue, Jun 27, 2017 at 05:44:23AM -0600, Jan Beulich wrote:
> >>> Roger Pau Monne <roger.pau@citrix.com> 06/27/17 12:23 PM >>>
> >On Fri, May 26, 2017 at 09:26:03AM -0600, Jan Beulich wrote:
> >> > +static int vpci_msi_address_upper_write(struct pci_dev *pdev, unsigned int reg,
> >> > +                                        union vpci_val val, void *data)
> >> > +{
> >> > +    struct vpci_msi *msi = data;
> >> > +
> >> > +    /* Clear high part. */
> >> > +    msi->address &= ~GENMASK(63, 32);
> >> > +    msi->address |= (uint64_t)val.double_word << 32;
> >> 
> >> Is the value written here actually being used for anything other than
> >> for reading back (also applicable to the high bits of the low half of the
> >> address)?
> >
> >It's used in a arch-specific way. But Xen needs to store it anyway, so
> >the guest can read back whatever it writes. I have no idea what ARM
> >might store here.
> 
> Hmm, I'm concerned you may introduce incorrect behavior here if the value
> written is other than expected. But perhaps not much of a problem as long
> as all this is Dom0 only.

IMHO if Xen needs to check this value for sanity it should be done in
the arch specific handler (vpci_msi_arch_enable), when the domain
tries to enable MSI. In any case, we can cross this bridge when we get
to enabling all this for DomUs.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 9/9] vpci/msix: add MSI-X handlers
  2017-05-29 13:29   ` Jan Beulich
@ 2017-06-28 15:29     ` Roger Pau Monne
  2017-06-29  6:19       ` Jan Beulich
  0 siblings, 1 reply; 49+ messages in thread
From: Roger Pau Monne @ 2017-06-28 15:29 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, julien.grall, boris.ostrovsky, xen-devel

On Mon, May 29, 2017 at 07:29:29AM -0600, Jan Beulich wrote:
> >>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
> > +static int vpci_msix_control_write(struct pci_dev *pdev, unsigned int reg,
> > +                                   union vpci_val val, void *data)
> > +{
> > +    uint8_t seg = pdev->seg, bus = pdev->bus;
> > +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> > +    paddr_t table_base = pdev->vpci->header.bars[pdev->vpci->msix->bir].paddr;
> > +    struct vpci_msix *msix = data;
> 
> Wouldn't you better use this also to obtain the array index one line
> earlier?

Yes.

> > +    bool new_masked, new_enabled;
> > +    unsigned int i;
> > +    uint32_t data32;
> > +    int rc;
> > +
> > +    new_masked = val.word & PCI_MSIX_FLAGS_MASKALL;
> > +    new_enabled = val.word & PCI_MSIX_FLAGS_ENABLE;
> > +
> > +    if ( new_enabled != msix->enabled && new_enabled )
> 
>     if ( !msix->enabled && new_enabled )
> 
> would likely be easier to read (similar for the "else if" below).

Ack.

> > +    {
> > +        /* MSI-X enabled. */
> > +        for ( i = 0; i < msix->max_entries; i++ )
> > +        {
> > +            if ( msix->entries[i].masked )
> > +                continue;
> > +
> > +            rc = vpci_msix_enable(&msix->entries[i].arch, pdev,
> > +                                  msix->entries[i].addr, msix->entries[i].data,
> > +                                  msix->entries[i].nr, table_base);
> > +            if ( rc )
> > +            {
> > +                gdprintk(XENLOG_ERR,
> > +                         "%04x:%02x:%02x.%u: unable to update entry %u: %d\n",
> > +                         seg, bus, slot, func, i, rc);
> > +                return rc;
> > +            }
> > +
> > +            vpci_msix_mask(&msix->entries[i].arch, false);
> > +        }
> > +    }
> > +    else if ( new_enabled != msix->enabled && !new_enabled )
> > +    {
> > +        /* MSI-X disabled. */
> > +        for ( i = 0; i < msix->max_entries; i++ )
> > +        {
> > +            rc = vpci_msix_disable(&msix->entries[i].arch);
> > +            if ( rc )
> > +            {
> > +                gdprintk(XENLOG_ERR,
> > +                         "%04x:%02x:%02x.%u: unable to disable entry %u: %d\n",
> > +                         seg, bus, slot, func, i, rc);
> > +                return rc;
> > +            }
> > +        }
> > +    }
> > +
> > +    data32 = val.word;
> > +    if ( (new_enabled != msix->enabled || new_masked != msix->masked) &&
> > +         pci_msi_conf_write_intercept(pdev, reg, 2, &data32) >= 0 )
> > +        pci_conf_write16(seg, bus, slot, func, reg, data32);
> 
> What's the intermediate variable "data32" good for here? Afaict you
> could use val.word in its stead.

Yes, that's seems better.

> > +static struct vpci_msix *vpci_msix_find(struct domain *d, unsigned long addr)
> > +{
> > +    struct vpci_msix *msix;
> > +
> > +    ASSERT(vpci_locked(d));
> > +    list_for_each_entry ( msix,  &d->arch.hvm_domain.msix_tables, next )
> > +        if ( msix->pdev->vpci->header.command & PCI_COMMAND_MEMORY &&
> 
> Please parenthesize & within &&.

Done.

> > +             addr >= msix->addr &&
> > +             addr < msix->addr + msix->max_entries * PCI_MSIX_ENTRY_SIZE )
> > +            return msix;
> > +
> > +    return NULL;
> > +}
> 
> Looking ahead I'm getting the impression that you only allow
> accesses to the MSI-X table entries, yet in vpci_modify_bars() you
> (correctly) prevent mapping entire pages. While most other
> registers are disallowed from sharing a page with the table, the PBA
> is specifically named as an exception. Hence you need to support
> at least reads from the entire range.

That's right, I've added support for handling the PBA also as a direct
read to the real PBA in hardware. To simplify this, in the new version
I'm always trapping accesses to the PBA (I don't think it's worth
checking whether it shares a page with the vector table or not).

> > +static int vpci_msix_table_accept(struct vcpu *v, unsigned long addr)
> > +{
> > +    int found;
> > +
> > +    vpci_lock(v->domain);
> > +    found = !!vpci_msix_find(v->domain, addr);
> 
> At the risk of repeating a comment I gave on an earlier patch: Using
> "bool" for "found" allows you to avoid the !! .

Done.

> > +static int vpci_msix_access_check(struct pci_dev *pdev, unsigned long addr,
> > +                                  unsigned int len)
> > +{
> > +    uint8_t seg = pdev->seg, bus = pdev->bus;
> > +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> > +
> > +
> 
> No double blank lines please.

Done.

> > +    /* Only allow 32/64b accesses. */
> > +    if ( len != 4 && len != 8 )
> > +    {
> > +        gdprintk(XENLOG_ERR,
> > +                 "%04x:%02x:%02x.%u: invalid MSI-X table access size: %u\n",
> > +                 seg, bus, slot, func, len);
> > +        return -EINVAL;
> > +    }
> > +
> > +    /* Do no allow accesses that span across multiple entries. */
> > +    if ( (addr & (PCI_MSIX_ENTRY_SIZE - 1)) + len > PCI_MSIX_ENTRY_SIZE )
> > +    {
> > +        gdprintk(XENLOG_ERR,
> > +                 "%04x:%02x:%02x.%u: MSI-X access crosses entry boundary\n",
> > +                 seg, bus, slot, func);
> > +        return -EINVAL;
> > +    }
> > +
> > +    /*
> > +     * Only allow 64b accesses to the low message address field.
> > +     *
> > +     * NB: this is more restrictive than the specification, that allows 64b
> > +     * accesses to other fields under certain circumstances, so this check and
> > +     * the code will have to be fixed in order to fully comply with the
> > +     * specification.
> > +     */
> > +    if ( (addr & (PCI_MSIX_ENTRY_SIZE - 1)) != 0 && len != 4 )
> > +    {
> > +        gdprintk(XENLOG_ERR,
> > +                 "%04x:%02x:%02x.%u: 64bit MSI-X table access to 32bit field"
> > +                 " (offset: %#lx len: %u)\n", seg, bus, slot, func,
> > +                 addr & (PCI_MSIX_ENTRY_SIZE - 1), len);
> > +        return -EINVAL;
> > +    }
> > +
> > +    return 0;
> > +}
> 
> So you allow mis-aligned accesses, but you disallow 8-byte ones
> to the upper half of an entry? I think both aspects need to be got
> right from the beginning, the more that you BUG() in the switch()es
> further down in such cases.

I've now fixed this to comply with the spec, that only allows aligned
32/64b accesses.

I'm allowing 32/64b read accesses to any fields, and write accesses to
any fields except for 64b write accesses to the message data and
vector control fields, unless the vector is masked (or MSI-X is not
globally enabled). That matches the restrictions detailed in the PCI
3.0 spec AFAICT.

> > +static struct vpci_msix_entry *vpci_msix_get_entry(struct vpci_msix *msix,
> > +                                                   unsigned long addr)
> > +{
> > +    return &msix->entries[(addr - msix->addr) / PCI_MSIX_ENTRY_SIZE];
> > +}
> > +
> > +static int vpci_msix_table_read(struct vcpu *v, unsigned long addr,
> > +                                unsigned int len, unsigned long *data)
> > +{
> > +    struct vpci_msix *msix;
> > +    struct vpci_msix_entry *entry;
> > +    unsigned int offset;
> > +
> > +    vpci_lock(v->domain);
> > +    msix = vpci_msix_find(v->domain, addr);
> > +    if ( !msix )
> > +    {
> > +        vpci_unlock(v->domain);
> > +        return X86EMUL_UNHANDLEABLE;
> > +    }
> > +
> > +    if ( vpci_msix_access_check(msix->pdev, addr, len) )
> > +    {
> > +        vpci_unlock(v->domain);
> > +        return X86EMUL_UNHANDLEABLE;
> > +    }
> > +
> > +    /* Get the table entry and offset. */
> > +    entry = vpci_msix_get_entry(msix, addr);
> > +    offset = addr & (PCI_MSIX_ENTRY_SIZE - 1);
> > +
> > +    switch ( offset )
> > +    {
> > +    case PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET:
> > +        *data = entry->addr;
> 
> You're not clipping off the upper 32 bits here - is that reliably
> happening elsewhere?

This could be a 64b access, so I though there was no need to
differentiate between 32/64b in this case, since the underlying
handlers will already clip it when needed. I could switch it to:

if ( len == 8 )
    *data = entry->addr;
else
    *data = (uint32_t)entry->addr;

I don't think it's happening elsewhere, but I will try to check. Is
that really an issue?

> > +        break;
> > +    case PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET:
> > +        *data = entry->addr >> 32;
> > +        break;
> > +    case PCI_MSIX_ENTRY_DATA_OFFSET:
> > +        *data = entry->data;
> > +        break;
> > +    case PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET:
> > +        *data = entry->masked ? PCI_MSIX_VECTOR_BITMASK : 0;
> 
> What about the other 31 bits?

They are all marked as "reserved" in my copy of the PCI spec.

> > +static int vpci_msix_table_write(struct vcpu *v, unsigned long addr,
> > +                                 unsigned int len, unsigned long data)
> > +{
> > +    struct vpci_msix *msix;
> > +    struct vpci_msix_entry *entry;
> > +    unsigned int offset;
> > +
> > +    vpci_lock(v->domain);
> > +    msix = vpci_msix_find(v->domain, addr);
> > +    if ( !msix )
> > +    {
> > +        vpci_unlock(v->domain);
> > +        return X86EMUL_UNHANDLEABLE;
> > +    }
> > +
> > +    if ( vpci_msix_access_check(msix->pdev, addr, len) )
> > +    {
> > +        vpci_unlock(v->domain);
> > +        return X86EMUL_UNHANDLEABLE;
> > +    }
> > +
> > +    /* Get the table entry and offset. */
> > +    entry = vpci_msix_get_entry(msix, addr);
> > +    offset = addr & (PCI_MSIX_ENTRY_SIZE - 1);
> > +
> > +    switch ( offset )
> > +    {
> > +    case PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET:
> > +        if ( len == 8 )
> > +        {
> > +            entry->addr = data;
> > +            break;
> > +        }
> > +        entry->addr &= ~GENMASK(31, 0);
> > +        entry->addr |= data;
> > +        break;
> > +    case PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET:
> > +        entry->addr &= ~GENMASK(63, 32);
> > +        entry->addr |= data << 32;
> > +        break;
> > +    case PCI_MSIX_ENTRY_DATA_OFFSET:
> > +        entry->data = data;
> > +        break;
> > +    case PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET:
> > +    {
> > +        bool new_masked = data & PCI_MSIX_VECTOR_BITMASK;
> > +        struct pci_dev *pdev = msix->pdev;
> > +        paddr_t table_base =
> > +            pdev->vpci->header.bars[pdev->vpci->msix->bir].paddr;
> 
> Again simply "msix->bir"?

Yes.

> > +        int rc;
> > +
> > +        if ( !msix->enabled )
> > +        {
> > +            entry->masked = new_masked;
> > +            break;
> > +        }
> > +
> > +        if ( new_masked != entry->masked && !new_masked )
> > +        {
> > +            /* Unmasking an entry, update it. */
> > +            rc = vpci_msix_enable(&entry->arch, msix->pdev, entry->addr,
> 
> And simply "pdev" here?

Done.

> > +static int vpci_init_msix(struct pci_dev *pdev)
> > +{
> > +    struct domain *d = pdev->domain;
> > +    uint8_t seg = pdev->seg, bus = pdev->bus;
> > +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> > +    struct vpci_msix *msix;
> > +    unsigned int msix_offset, i, max_entries;
> > +    paddr_t msix_paddr;
> > +    uint16_t control;
> > +    int rc;
> > +
> > +    msix_offset = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSIX);
> > +    if ( !msix_offset )
> > +        return 0;
> > +
> > +    if ( !vpci_msix_enabled(pdev->domain) )
> 
> This is a non-__init function, so it can't use dom0_msix (I'm saying
> this just in case there really is a need to retain those command line
> options).

No, I've just removed them, and obviously this call is now gone.

> > +    {
> > +        xen_vpci_mask_capability(pdev, PCI_CAP_ID_MSIX);
> > +        return 0;
> > +    }
> > +
> > +    control = pci_conf_read16(seg, bus, slot, func,
> > +                              msix_control_reg(msix_offset));
> > +
> > +    /* Get the maximum number of vectors the device supports. */
> > +    max_entries = msix_table_size(control);
> > +    if ( !max_entries )
> > +        return 0;
> 
> This if() is never going to be true.

Right. 0 means 1 vector.

> > +    msix = xzalloc_bytes(MSIX_SIZE(max_entries));
> > +    if ( !msix )
> > +        return -ENOMEM;
> > +
> > +    msix->max_entries = max_entries;
> > +    msix->pdev = pdev;
> > +
> > +    /* Find the MSI-X table address. */
> > +    msix->offset = pci_conf_read32(seg, bus, slot, func,
> > +                                   msix_table_offset_reg(msix_offset));
> > +    msix->bir = msix->offset & PCI_MSIX_BIRMASK;
> > +    msix->offset &= ~PCI_MSIX_BIRMASK;
> > +
> > +    ASSERT(pdev->vpci->header.bars[msix->bir].type == VPCI_BAR_MEM ||
> > +           pdev->vpci->header.bars[msix->bir].type == VPCI_BAR_MEM64_LO);
> > +    msix->addr = pdev->vpci->header.bars[msix->bir].mapped_addr + msix->offset;
> > +    msix_paddr = pdev->vpci->header.bars[msix->bir].paddr + msix->offset;
> 
> I can't seem to find where these addresses get updated in case
> the BARs are being relocated by the Dom0 kernel.

Is that something that Xen should support? I got the impression that
the current MSI-X code in Xen didn't support relocating the BARs that
contain the MSI-X table (but maybe I got it wrong).

> > +    for ( i = 0; i < msix->max_entries; i++)
> > +    {
> > +        msix->entries[i].masked = true;
> > +        msix->entries[i].nr = i;
> > +        vpci_msix_arch_init(&msix->entries[i].arch);
> > +    }
> > +
> > +    if ( list_empty(&d->arch.hvm_domain.msix_tables) )
> > +        register_mmio_handler(d, &vpci_msix_table_ops);
> > +
> > +    list_add(&msix->next, &d->arch.hvm_domain.msix_tables);
> > +
> > +    rc = xen_vpci_add_register(pdev, vpci_msix_control_read,
> > +                               vpci_msix_control_write,
> > +                               msix_control_reg(msix_offset), 2, msix);
> > +    if ( rc )
> > +    {
> > +        dprintk(XENLOG_ERR,
> > +                "%04x:%02x:%02x.%u: failed to add handler for MSI-X control: %d\n",
> > +                seg, bus, slot, func, rc);
> > +        goto error;
> > +    }
> > +
> > +    if ( pdev->vpci->header.command & PCI_COMMAND_MEMORY )
> > +    {
> > +        /* Unmap this memory from the guest. */
> > +        rc = modify_mmio(pdev->domain, _gfn(PFN_DOWN(msix->addr)),
> > +                         _mfn(PFN_DOWN(msix_paddr)),
> > +                         PFN_UP(msix->max_entries * PCI_MSIX_ENTRY_SIZE),
> > +                         false);
> > +        if ( rc )
> > +        {
> > +            dprintk(XENLOG_ERR,
> > +                    "%04x:%02x:%02x.%u: unable to unmap MSI-X BAR region: %d\n",
> > +                    seg, bus, slot, func, rc);
> > +            goto error;
> > +        }
> > +    }
> 
> Why is this unmapping conditional upon PCI_COMMAND_MEMORY?

Well, if memory decoding is not enabled the BAR is not mapped in the
first place. I've now reworked this a little bit in the newer version,
so it's handled in the header code instead.

> > +static void vpci_dump_msix(unsigned char key)
> > +{
> > +    struct domain *d;
> > +    struct pci_dev *pdev;
> > +
> > +    printk("Guest MSI-X information:\n");
> > +
> > +    for_each_domain ( d )
> > +    {
> > +        if ( !has_vpci(d) )
> > +            continue;
> > +
> > +        vpci_lock(d);
> 
> Dump handlers, even if there are existing examples to the contrary,
> should only try-lock any locks they mean to hold (and not dump
> anything if they can't get hold of the lock).

OK, will have to change the MSI dump also. Since you commented on the
MSI dump handler, I guess this output should also be appended to the
'M' key?

> > --- a/xen/include/xen/vpci.h
> > +++ b/xen/include/xen/vpci.h
> > @@ -112,6 +112,33 @@ struct vpci {
> >          /* Arch-specific data. */
> >          struct vpci_arch_msi arch;
> >      } *msi;
> > +
> > +    /* MSI-X data. */
> > +    struct vpci_msix {
> > +        struct pci_dev *pdev;
> > +        /* Maximum number of vectors supported by the device. */
> > +        unsigned int max_entries;
> > +        /* MSI-X table offset. */
> > +        unsigned int offset;
> > +        /* MSI-X table BIR. */
> > +        unsigned int bir;
> > +        /* Table addr. */
> > +        paddr_t addr;
> > +        /* MSI-X enabled? */
> > +        bool enabled;
> > +        /* Masked? */
> > +        bool masked;
> > +        /* List link. */
> > +        struct list_head next;
> > +        /* Entries. */
> > +        struct vpci_msix_entry {
> > +                unsigned int nr;
> > +                uint64_t addr;
> > +                uint32_t data;
> > +                bool masked;
> > +                struct vpci_arch_msix_entry arch;
> 
> Indentation.

Fixed.

As usual, thank you very much for the detailed review.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 9/9] vpci/msix: add MSI-X handlers
  2017-06-28 15:29     ` Roger Pau Monne
@ 2017-06-29  6:19       ` Jan Beulich
  2017-06-29  8:25         ` Roger Pau Monné
  0 siblings, 1 reply; 49+ messages in thread
From: Jan Beulich @ 2017-06-29  6:19 UTC (permalink / raw)
  To: roger.pau; +Cc: andrew.cooper3, julien.grall, boris.ostrovsky, xen-devel

>>> Roger Pau Monne <roger.pau@citrix.com> 06/28/17 5:37 PM >>>
>On Mon, May 29, 2017 at 07:29:29AM -0600, Jan Beulich wrote:
>> >>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
>> > +    switch ( offset )
>> > +    {
>> > +    case PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET:
>> > +        *data = entry->addr;
>> 
>> You're not clipping off the upper 32 bits here - is that reliably
>> happening elsewhere?
>
>This could be a 64b access, so I though there was no need to
>differentiate between 32/64b in this case, since the underlying
>handlers will already clip it when needed. I could switch it to:
>
>if ( len == 8 )
    >*data = entry->addr;
>else
    >*data = (uint32_t)entry->addr;
>
>I don't think it's happening elsewhere, but I will try to check. Is
>that really an issue?

I would hope it isn't, but I'm not 100% certain, hence my request to at
least check. I agree that it would be nice to not have to do it here. All other
similar read routines I've looked at appear to leave the upper half as zero,
albeit that's always a result of the way the code is written instead of an
explicit truncation as would be required here.

>> > +        break;
>> > +    case PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET:
>> > +        *data = entry->addr >> 32;
>> > +        break;
>> > +    case PCI_MSIX_ENTRY_DATA_OFFSET:
>> > +        *data = entry->data;
>> > +        break;
>> > +    case PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET:
>> > +        *data = entry->masked ? PCI_MSIX_VECTOR_BITMASK : 0;
>> 
>> What about the other 31 bits?
>
>They are all marked as "reserved" in my copy of the PCI spec.

Indeed, but it's at least worth considering to pass through the values (as it's
Dom0 we're talking about here), and having a comment giving a brief explanation
for the choice.

>> > +    /* Find the MSI-X table address. */
>> > +    msix->offset = pci_conf_read32(seg, bus, slot, func,
>> > +                                   msix_table_offset_reg(msix_offset));
>> > +    msix->bir = msix->offset & PCI_MSIX_BIRMASK;
>> > +    msix->offset &= ~PCI_MSIX_BIRMASK;
>> > +
>> > +    ASSERT(pdev->vpci->header.bars[msix->bir].type == VPCI_BAR_MEM ||
>> > +           pdev->vpci->header.bars[msix->bir].type == VPCI_BAR_MEM64_LO);
>> > +    msix->addr = pdev->vpci->header.bars[msix->bir].mapped_addr + msix->offset;
>> > +    msix_paddr = pdev->vpci->header.bars[msix->bir].paddr + msix->offset;
>> 
>> I can't seem to find where these addresses get updated in case
>> the BARs are being relocated by the Dom0 kernel.
>
>Is that something that Xen should support? I got the impression that
>the current MSI-X code in Xen didn't support relocating the BARs that
>contain the MSI-X table (but maybe I got it wrong).

Well, the current expectation is that Dom0 would do BAR relocation prior to any MSI-X
setup. But since you aim at maximum transparency from PVH Dom0's perspective, I'm
not certain latching the addresses here once and for all is sufficient.

>> > +static void vpci_dump_msix(unsigned char key)
>> > +{
>> > +    struct domain *d;
>> > +    struct pci_dev *pdev;
>> > +
>> > +    printk("Guest MSI-X information:\n");
>> > +
>> > +    for_each_domain ( d )
>> > +    {
>> > +        if ( !has_vpci(d) )
>> > +            continue;
>> > +
>> > +        vpci_lock(d);
>> 
>> Dump handlers, even if there are existing examples to the contrary,
>> should only try-lock any locks they mean to hold (and not dump
>> anything if they can't get hold of the lock).
>
>OK, will have to change the MSI dump also. Since you commented on the
>MSI dump handler, I guess this output should also be appended to the
>'M' key?

Yes, that indeed was the implication.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 9/9] vpci/msix: add MSI-X handlers
  2017-06-29  6:19       ` Jan Beulich
@ 2017-06-29  8:25         ` Roger Pau Monné
  0 siblings, 0 replies; 49+ messages in thread
From: Roger Pau Monné @ 2017-06-29  8:25 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, julien.grall, boris.ostrovsky, xen-devel

On Thu, Jun 29, 2017 at 12:19:39AM -0600, Jan Beulich wrote:
> >>> Roger Pau Monne <roger.pau@citrix.com> 06/28/17 5:37 PM >>>
> >On Mon, May 29, 2017 at 07:29:29AM -0600, Jan Beulich wrote:
> >> >>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
> >> > +    switch ( offset )
> >> > +    {
> >> > +    case PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET:
> >> > +        *data = entry->addr;
> >> 
> >> You're not clipping off the upper 32 bits here - is that reliably
> >> happening elsewhere?
> >
> >This could be a 64b access, so I though there was no need to
> >differentiate between 32/64b in this case, since the underlying
> >handlers will already clip it when needed. I could switch it to:
> >
> >if ( len == 8 )
>     >*data = entry->addr;
> >else
>     >*data = (uint32_t)entry->addr;
> >
> >I don't think it's happening elsewhere, but I will try to check. Is
> >that really an issue?
> 
> I would hope it isn't, but I'm not 100% certain, hence my request to at
> least check. I agree that it would be nice to not have to do it here. All other
> similar read routines I've looked at appear to leave the upper half as zero,
> albeit that's always a result of the way the code is written instead of an
> explicit truncation as would be required here.

Ack, I will add a comment to clarify this.

> >> > +        break;
> >> > +    case PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET:
> >> > +        *data = entry->addr >> 32;
> >> > +        break;
> >> > +    case PCI_MSIX_ENTRY_DATA_OFFSET:
> >> > +        *data = entry->data;
> >> > +        break;
> >> > +    case PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET:
> >> > +        *data = entry->masked ? PCI_MSIX_VECTOR_BITMASK : 0;
> >> 
> >> What about the other 31 bits?
> >
> >They are all marked as "reserved" in my copy of the PCI spec.
> 
> Indeed, but it's at least worth considering to pass through the values (as it's
> Dom0 we're talking about here), and having a comment giving a brief explanation
> for the choice.

I can certainly do that, but I don't think we should passthrough them
in the write handler. I'm worried that then Dom0 would see an
incoherent value if it attempts to modify some of the bits marked as
reserved.

IMHO it seems better to simply hide them until Xen knows how to deal
with them.

> >> > +    /* Find the MSI-X table address. */
> >> > +    msix->offset = pci_conf_read32(seg, bus, slot, func,
> >> > +                                   msix_table_offset_reg(msix_offset));
> >> > +    msix->bir = msix->offset & PCI_MSIX_BIRMASK;
> >> > +    msix->offset &= ~PCI_MSIX_BIRMASK;
> >> > +
> >> > +    ASSERT(pdev->vpci->header.bars[msix->bir].type == VPCI_BAR_MEM ||
> >> > +           pdev->vpci->header.bars[msix->bir].type == VPCI_BAR_MEM64_LO);
> >> > +    msix->addr = pdev->vpci->header.bars[msix->bir].mapped_addr + msix->offset;
> >> > +    msix_paddr = pdev->vpci->header.bars[msix->bir].paddr + msix->offset;
> >> 
> >> I can't seem to find where these addresses get updated in case
> >> the BARs are being relocated by the Dom0 kernel.
> >
> >Is that something that Xen should support? I got the impression that
> >the current MSI-X code in Xen didn't support relocating the BARs that
> >contain the MSI-X table (but maybe I got it wrong).
> 
> Well, the current expectation is that Dom0 would do BAR relocation prior to any MSI-X
> setup. But since you aim at maximum transparency from PVH Dom0's perspective, I'm
> not certain latching the addresses here once and for all is sufficient.

OK, I can implement something a little bit more flexible.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2017-06-29  8:25 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-27 14:35 [PATCH v3 0/9] vpci: PCI config space emulation Roger Pau Monne
2017-04-27 14:35 ` [PATCH v3 1/9] xen/vpci: introduce basic handlers to trap accesses to the PCI config space Roger Pau Monne
2017-05-19 11:35   ` Jan Beulich
2017-05-29 12:57     ` Roger Pau Monne
2017-05-29 14:16       ` Jan Beulich
2017-05-29 15:05         ` Roger Pau Monne
2017-05-29 15:26           ` Jan Beulich
2017-04-27 14:35 ` [PATCH v3 2/9] x86/ecam: add handlers for the PVH Dom0 MMCFG areas Roger Pau Monne
2017-05-19 13:25   ` Jan Beulich
2017-06-20 11:56     ` Roger Pau Monne
2017-06-20 13:14       ` Jan Beulich
2017-06-20 15:04         ` Roger Pau Monne
2017-04-27 14:35 ` [PATCH v3 3/9] xen/mm: move modify_identity_mmio to global file and drop __init Roger Pau Monne
2017-05-19 13:35   ` Jan Beulich
2017-06-21 11:11     ` Roger Pau Monne
2017-06-21 11:57       ` Jan Beulich
2017-06-21 12:43         ` Roger Pau Monne
2017-06-21 12:51           ` Jan Beulich
2017-06-21 13:10             ` Roger Pau Monne
2017-06-21 13:29               ` Jan Beulich
2017-04-27 14:35 ` [PATCH v3 4/9] xen/pci: split code to size BARs from pci_add_device Roger Pau Monne
2017-05-19 13:56   ` Jan Beulich
2017-06-21 15:16     ` Roger Pau Monne
2017-04-27 14:35 ` [PATCH v3 5/9] xen/vpci: add handlers to map the BARs Roger Pau Monne
2017-05-19 15:21   ` Jan Beulich
2017-05-22 11:38     ` Julien Grall
2017-06-22 17:13     ` Roger Pau Monne
2017-06-23  8:58       ` Jan Beulich
2017-06-23 10:55         ` Roger Pau Monne
2017-04-27 14:35 ` [PATCH v3 6/9] xen/vpci: trap access to the list of PCI capabilities Roger Pau Monne
2017-05-23 12:49   ` Jan Beulich
2017-06-26 11:50     ` Roger Pau Monne
2017-06-27  6:44       ` Jan Beulich
2017-05-29 13:32   ` Jan Beulich
2017-04-27 14:35 ` [PATCH v3 7/9] vpci: add a priority field to the vPCI register initializer Roger Pau Monne
2017-05-23 12:52   ` Jan Beulich
2017-06-26 14:41     ` Roger Pau Monne
2017-04-27 14:35 ` [PATCH v3 8/9] vpci/msi: add MSI handlers Roger Pau Monne
2017-05-26 15:26   ` Jan Beulich
2017-06-27 10:22     ` Roger Pau Monne
2017-06-27 11:44       ` Jan Beulich
2017-06-27 12:44         ` Roger Pau Monné
2017-04-27 14:35 ` [PATCH v3 9/9] vpci/msix: add MSI-X handlers Roger Pau Monne
2017-05-29 13:29   ` Jan Beulich
2017-06-28 15:29     ` Roger Pau Monne
2017-06-29  6:19       ` Jan Beulich
2017-06-29  8:25         ` Roger Pau Monné
2017-05-29 13:38 ` [PATCH v3 0/9] vpci: PCI config space emulation Jan Beulich
2017-05-29 14:14   ` Roger Pau Monne

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.