All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/10] [RFC] ARM PCI passthrough configuration and vPCI
@ 2020-11-09 12:50 Oleksandr Andrushchenko
  2020-11-09 12:50 ` [PATCH 01/10] pci/pvh: Allow PCI toolstack code run with PVH domains on ARM Oleksandr Andrushchenko
                   ` (9 more replies)
  0 siblings, 10 replies; 64+ messages in thread
From: Oleksandr Andrushchenko @ 2020-11-09 12:50 UTC (permalink / raw)
  To: Rahul.Singh, Bertrand.Marquis, julien.grall, jbeulich, roger.pau,
	sstabellini, xen-devel
  Cc: iwj, wl, Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Hello, all!

This is an RFC and an attempt to understand the best way to progress with ARM
PCI passthrough configuration. This includes, but not limited to, configuration
of assignable PCI devices, (legacy IRQs, MSI/MSI-X are not yet supported), MMIO
etc.

This is based on the original RFC from ARM [1] and bits of the possible
configuration approaches were discussed before [2]. So, I tried to implement
something, so we can discuss it in more detail. (Rahul, Bertrand: if you are
interested we can discuss all this in detail, so we can use this as a part of
ARM PCI passthrough solution).

This is all work in progress, so without having some other minor patches one
won’t be able to run that, but still the patches show the direction, which
should be fine for the RFC. Those interested in the full working example I have
created a branch [3], but please note that this was fully tested on R-Car Gen3
platform only which has a non-ECAM PCI host controller and only partially it
was tested on QEMU (without running guest domains, Domain-0 only).

In this RFC I only submit some patches which I would like to get community's
view on.  I will highlight some of them below and the rest is documented in
their commit messages:

1. [WORKAROUND] xen/arm: Update hwdom's p2m to trap ECAM space

This is a workaround to be able to trap ECAM address space in Domain-0 which is
normally not possible because PCI host bridge is mapped into Domain-0, so there
is a need in some special handling of such devices. I have discussed this with
Julien on IRC, but haven't implemented anything of production quality yet.

2. arm/pci: Maintain PCI assignable list

This patch needs a decision on pci-back use. As of now what is not covered is
the assignment of legacy IRQs (MMIOs and MSIs are handled by Xen without the
toolstack's help - am I right here?). MMIOs are assigned by vPCI code.  We
discussed [2] a possibility to run a "limited" version of the pci-back driver
for ARM, but I’d like to bring back that discussion here as it seems that only
some bits of the pci-back may now be used, so profit of having pci-back
in the picture is not clear.

3. vpci: Make every domain handle its own BARs

This is a big change in vPCI code which allows non-identity mappings for guest
domains. This also handles MMIO configuration for the guests without using the
toolstack which does the same via reading PCI bus sysfs entries in Domain-0.
(Thank you Roger for making lots of thing clear for me.) This implements PCI
headers per domain.

4. vpci/arm: Allow updating BAR's header for non-ECAM bridges

This allows non-ECAM bridges, which are not trapped by vPCI for Domain-0/hwdom,
to update vPCI's view of the real values of the BARs. The assumption here is
that Domain-0/hwdom won't relocate BARs which is usually the case.


5. Some code is for R-Car Gen3 which is not ECAM compatible. It is good for the
demostration of where generic ARM PCI framwork should be changed to support
such controllers.

Thank you,
Oleksandr Andrushchenko

P.S. I would like to thank Roger, Juilien and Jan for their attention
and time.

[1] https://www.mail-archive.com/xen-devel@lists.xenproject.org/msg77422.html
[2] https://www.mail-archive.com/xen-devel@lists.xenproject.org/msg84452.html
[3] https://github.com/andr2000/xen/tree/vpci_rfc

Oleksandr Andrushchenko (10):
  pci/pvh: Allow PCI toolstack code run with PVH domains on ARM
  arm/pci: Maintain PCI assignable list
  xen/arm: Setup MMIO range trap handlers for hardware domain
  [WORKAROUND] xen/arm: Update hwdom's p2m to trap ECAM space
  xen/arm: Process pending vPCI map/unmap operations
  vpci: Make every domain handle its own BARs
  xen/arm: Do not hardcode phycial PCI device addresses
  vpci/arm: Allow updating BAR's header for non-ECAM bridges
  vpci/rcar: Implement vPCI.update_bar_header callback
  [HACK] vpci/rcar: Make vPCI know DomD is hardware domain

 tools/libxc/include/xenctrl.h         |   9 +
 tools/libxc/xc_domain.c               |   1 +
 tools/libxc/xc_misc.c                 |  46 ++++
 tools/libxl/Makefile                  |   8 +
 tools/libxl/libxl_pci.c               | 109 +++++++++-
 xen/arch/arm/domain_build.c           |  10 +-
 xen/arch/arm/pci/pci-host-common.c    |  44 ++++
 xen/arch/arm/pci/pci-host-generic.c   |  43 +++-
 xen/arch/arm/pci/pci-host-rcar-gen3.c |  69 ++++++
 xen/arch/arm/sysctl.c                 |  66 +++++-
 xen/arch/arm/traps.c                  |   6 +
 xen/arch/arm/vpci.c                   |  16 +-
 xen/drivers/passthrough/pci.c         |  93 +++++++++
 xen/drivers/vpci/header.c             | 289 +++++++++++++++++++++++---
 xen/drivers/vpci/vpci.c               |   1 +
 xen/include/asm-arm/pci.h             |  17 ++
 xen/include/public/arch-arm.h         |   9 +-
 xen/include/public/sysctl.h           |  40 ++++
 xen/include/xen/pci.h                 |  12 ++
 xen/include/xen/vpci.h                |  24 ++-
 20 files changed, 857 insertions(+), 55 deletions(-)

-- 
2.17.1



^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 01/10] pci/pvh: Allow PCI toolstack code run with PVH domains on ARM
  2020-11-09 12:50 [PATCH 00/10] [RFC] ARM PCI passthrough configuration and vPCI Oleksandr Andrushchenko
@ 2020-11-09 12:50 ` Oleksandr Andrushchenko
  2020-11-11 12:31   ` [SUSPECTED SPAM][PATCH " Roger Pau Monné
  2020-11-09 12:50 ` [PATCH 02/10] arm/pci: Maintain PCI assignable list Oleksandr Andrushchenko
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 64+ messages in thread
From: Oleksandr Andrushchenko @ 2020-11-09 12:50 UTC (permalink / raw)
  To: Rahul.Singh, Bertrand.Marquis, julien.grall, jbeulich, roger.pau,
	sstabellini, xen-devel
  Cc: iwj, wl, Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

According to https://wiki.xenproject.org/wiki/Linux_PVH:

Items not supported by PVH
 - PCI pass through (as of Xen 4.10)

Allow running PCI remove code on ARM and do not assert for PVH domains.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
 tools/libxl/Makefile    | 4 ++++
 tools/libxl/libxl_pci.c | 4 +++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
index 241da7fff6f4..f3806aafcb4e 100644
--- a/tools/libxl/Makefile
+++ b/tools/libxl/Makefile
@@ -130,6 +130,10 @@ endif
 
 LIBXL_LIBS += -lyajl
 
+ifeq ($(CONFIG_ARM),y)
+CFALGS += -DCONFIG_ARM
+endif
+
 LIBXL_OBJS = flexarray.o libxl.o libxl_create.o libxl_dm.o libxl_pci.o \
 			libxl_dom.o libxl_exec.o libxl_xshelp.o libxl_device.o \
 			libxl_internal.o libxl_utils.o libxl_uuid.o \
diff --git a/tools/libxl/libxl_pci.c b/tools/libxl/libxl_pci.c
index bc5843b13701..b93cf976642b 100644
--- a/tools/libxl/libxl_pci.c
+++ b/tools/libxl/libxl_pci.c
@@ -1915,8 +1915,10 @@ static void do_pci_remove(libxl__egc *egc, uint32_t domid,
             goto out_fail;
         }
     } else {
+        /* PCI passthrough can also run on ARM PVH */
+#ifndef CONFIG_ARM
         assert(type == LIBXL_DOMAIN_TYPE_PV);
-
+#endif
         char *sysfs_path = GCSPRINTF(SYSFS_PCI_DEV"/"PCI_BDF"/resource", pcidev->domain,
                                      pcidev->bus, pcidev->dev, pcidev->func);
         FILE *f = fopen(sysfs_path, "r");
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 02/10] arm/pci: Maintain PCI assignable list
  2020-11-09 12:50 [PATCH 00/10] [RFC] ARM PCI passthrough configuration and vPCI Oleksandr Andrushchenko
  2020-11-09 12:50 ` [PATCH 01/10] pci/pvh: Allow PCI toolstack code run with PVH domains on ARM Oleksandr Andrushchenko
@ 2020-11-09 12:50 ` Oleksandr Andrushchenko
  2020-11-11 13:53   ` Roger Pau Monné
  2020-11-11 14:54   ` Jan Beulich
  2020-11-09 12:50 ` [PATCH 03/10] xen/arm: Setup MMIO range trap handlers for hardware domain Oleksandr Andrushchenko
                   ` (7 subsequent siblings)
  9 siblings, 2 replies; 64+ messages in thread
From: Oleksandr Andrushchenko @ 2020-11-09 12:50 UTC (permalink / raw)
  To: Rahul.Singh, Bertrand.Marquis, julien.grall, jbeulich, roger.pau,
	sstabellini, xen-devel
  Cc: iwj, wl, Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

The original code depends on pciback to manage assignable device list.
The functionality which is implemented by the pciback and the toolstack
and which is relevant/missing/needed for ARM:

1. pciback is used as a database for assignable PCI devices, e.g. xl
   pci-assignable-{add|remove|list} manipulates that list. So, whenever the
   toolstack needs to know which PCI devices can be passed through it reads
   that from the relevant sysfs entries of the pciback.

2. pciback is used to hold the unbound PCI devices, e.g. when passing through
   a PCI device it needs to be unbound from the relevant device driver and bound
   to pciback (strictly speaking it is not required that the device is bound to
   pciback, but pciback is again used as a database of the passed through PCI
   devices, so we can re-bind the devices back to their original drivers when
   guest domain shuts down)

1. As ARM doesn't use pciback implement the above with additional sysctls:
 - XEN_SYSCTL_pci_device_set_assigned
 - XEN_SYSCTL_pci_device_get_assigned
 - XEN_SYSCTL_pci_device_enum_assigned
2. Extend struct pci_dev to hold assignment state.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
 tools/libxc/include/xenctrl.h |   9 +++
 tools/libxc/xc_domain.c       |   1 +
 tools/libxc/xc_misc.c         |  46 +++++++++++++++
 tools/libxl/Makefile          |   4 ++
 tools/libxl/libxl_pci.c       | 105 ++++++++++++++++++++++++++++++++--
 xen/arch/arm/sysctl.c         |  66 ++++++++++++++++++++-
 xen/drivers/passthrough/pci.c |  93 ++++++++++++++++++++++++++++++
 xen/include/public/sysctl.h   |  40 +++++++++++++
 xen/include/xen/pci.h         |  12 ++++
 9 files changed, 370 insertions(+), 6 deletions(-)

diff --git a/tools/libxc/include/xenctrl.h b/tools/libxc/include/xenctrl.h
index 4c89b7294c4f..77029013da7d 100644
--- a/tools/libxc/include/xenctrl.h
+++ b/tools/libxc/include/xenctrl.h
@@ -2652,6 +2652,15 @@ int xc_livepatch_replace(xc_interface *xch, char *name, uint32_t timeout, uint32
 int xc_domain_cacheflush(xc_interface *xch, uint32_t domid,
                          xen_pfn_t start_pfn, xen_pfn_t nr_pfns);
 
+typedef xen_sysctl_pci_device_enum_assigned_t xc_pci_device_enum_assigned_t;
+
+int xc_pci_device_set_assigned(xc_interface *xch, uint32_t machine_sbdf,
+                               bool assigned);
+int xc_pci_device_get_assigned(xc_interface *xch, uint32_t machine_sbdf);
+
+int xc_pci_device_enum_assigned(xc_interface *xch,
+                                xc_pci_device_enum_assigned_t *e);
+
 /* Compat shims */
 #include "xenctrl_compat.h"
 
diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
index 71829c2bce3e..d515191e9243 100644
--- a/tools/libxc/xc_domain.c
+++ b/tools/libxc/xc_domain.c
@@ -2321,6 +2321,7 @@ int xc_domain_soft_reset(xc_interface *xch,
     domctl.domain = domid;
     return do_domctl(xch, &domctl);
 }
+
 /*
  * Local variables:
  * mode: C
diff --git a/tools/libxc/xc_misc.c b/tools/libxc/xc_misc.c
index 3820394413a9..d439c4ba1019 100644
--- a/tools/libxc/xc_misc.c
+++ b/tools/libxc/xc_misc.c
@@ -988,6 +988,52 @@ int xc_livepatch_replace(xc_interface *xch, char *name, uint32_t timeout, uint32
     return _xc_livepatch_action(xch, name, LIVEPATCH_ACTION_REPLACE, timeout, flags);
 }
 
+int xc_pci_device_set_assigned(
+    xc_interface *xch,
+    uint32_t machine_sbdf,
+    bool assigned)
+{
+    DECLARE_SYSCTL;
+
+    sysctl.cmd = XEN_SYSCTL_pci_device_set_assigned;
+    sysctl.u.pci_set_assigned.machine_sbdf = machine_sbdf;
+    sysctl.u.pci_set_assigned.assigned = assigned;
+
+    return do_sysctl(xch, &sysctl);
+}
+
+int xc_pci_device_get_assigned(
+    xc_interface *xch,
+    uint32_t machine_sbdf)
+{
+    DECLARE_SYSCTL;
+
+    sysctl.cmd = XEN_SYSCTL_pci_device_get_assigned;
+    sysctl.u.pci_get_assigned.machine_sbdf = machine_sbdf;
+
+    return do_sysctl(xch, &sysctl);
+}
+
+int xc_pci_device_enum_assigned(xc_interface *xch,
+                                xc_pci_device_enum_assigned_t *e)
+{
+    int ret;
+    DECLARE_SYSCTL;
+
+    sysctl.cmd = XEN_SYSCTL_pci_device_enum_assigned;
+    sysctl.u.pci_enum_assigned.idx = e->idx;
+    sysctl.u.pci_enum_assigned.report_not_assigned = e->report_not_assigned;
+    ret = do_sysctl(xch, &sysctl);
+    if ( ret )
+        errno = EINVAL;
+    else
+    {
+        e->domain = sysctl.u.pci_enum_assigned.domain;
+        e->machine_sbdf = sysctl.u.pci_enum_assigned.machine_sbdf;
+    }
+    return ret;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
index f3806aafcb4e..6f76ba35aec7 100644
--- a/tools/libxl/Makefile
+++ b/tools/libxl/Makefile
@@ -130,6 +130,10 @@ endif
 
 LIBXL_LIBS += -lyajl
 
+ifeq ($(CONFIG_X86),y)
+CFALGS += -DCONFIG_PCIBACK
+endif
+
 ifeq ($(CONFIG_ARM),y)
 CFALGS += -DCONFIG_ARM
 endif
diff --git a/tools/libxl/libxl_pci.c b/tools/libxl/libxl_pci.c
index b93cf976642b..41f89b8aae10 100644
--- a/tools/libxl/libxl_pci.c
+++ b/tools/libxl/libxl_pci.c
@@ -319,6 +319,7 @@ retry_transaction2:
 
 static int get_all_assigned_devices(libxl__gc *gc, libxl_device_pci **list, int *num)
 {
+#ifdef CONFIG_PCIBACK
     char **domlist;
     unsigned int nd = 0, i;
 
@@ -356,6 +357,33 @@ static int get_all_assigned_devices(libxl__gc *gc, libxl_device_pci **list, int
             }
         }
     }
+#else
+    libxl_ctx *ctx = libxl__gc_owner(gc);
+    int ret;
+    xc_pci_device_enum_assigned_t e;
+
+    *list = NULL;
+    *num = 0;
+
+    memset(&e, 0, sizeof(e));
+    do {
+        ret = xc_pci_device_enum_assigned(ctx->xch, &e);
+        if ( ret && errno == EINVAL )
+            break;
+        *list = realloc(*list, sizeof(libxl_device_pci) * (e.idx + 1));
+        if (*list == NULL)
+            return ERROR_NOMEM;
+
+        pcidev_struct_fill(*list + e.idx,
+                           e.domain,
+                           e.machine_sbdf >> 8 & 0xff,
+                           PCI_SLOT(e.machine_sbdf),
+                           PCI_FUNC(e.machine_sbdf),
+                           0 /*vdevfn*/);
+        e.idx++;
+    } while (!ret);
+    *num = e.idx;
+#endif
     libxl__ptr_add(gc, *list);
 
     return 0;
@@ -411,13 +439,20 @@ static int sysfs_write_bdf(libxl__gc *gc, const char * sysfs_path,
 libxl_device_pci *libxl_device_pci_assignable_list(libxl_ctx *ctx, int *num)
 {
     GC_INIT(ctx);
-    libxl_device_pci *pcidevs = NULL, *new, *assigned;
+    libxl_device_pci *pcidevs = NULL, *new;
+    int r;
+#ifdef CONFIG_PCIBACK
+    libxl_device_pci *assigned;
+    int num_assigned;
     struct dirent *de;
     DIR *dir;
-    int r, num_assigned;
+#else
+    xc_pci_device_enum_assigned_t e;
+#endif
 
     *num = 0;
 
+#ifdef CONFIG_PCIBACK
     r = get_all_assigned_devices(gc, &assigned, &num_assigned);
     if (r) goto out;
 
@@ -453,6 +488,32 @@ libxl_device_pci *libxl_device_pci_assignable_list(libxl_ctx *ctx, int *num)
 
     closedir(dir);
 out:
+#else
+    memset(&e, 0, sizeof(e));
+    e.report_not_assigned = 1;
+    do {
+        r = xc_pci_device_enum_assigned(ctx->xch, &e);
+        if ( r && errno == EINVAL )
+            break;
+        new = realloc(pcidevs, (e.idx + 1) * sizeof(*new));
+        if (NULL == new)
+            continue;
+
+        pcidevs = new;
+        new = pcidevs + e.idx;
+
+        memset(new, 0, sizeof(*new));
+
+        pcidev_struct_fill(new,
+                           e.domain,
+                           e.machine_sbdf >> 8 & 0xff,
+                           PCI_SLOT(e.machine_sbdf),
+                           PCI_FUNC(e.machine_sbdf),
+                           0 /*vdevfn*/);
+        e.idx++;
+    } while (!r);
+    *num = e.idx;
+#endif
     GC_FREE;
     return pcidevs;
 }
@@ -606,6 +667,7 @@ bool libxl__is_igd_vga_passthru(libxl__gc *gc,
     return false;
 }
 
+#ifdef CONFIG_PCIBACK
 /*
  * A brief comment about slots.  I don't know what slots are for; however,
  * I have by experimentation determined:
@@ -648,11 +710,13 @@ out:
     fclose(f);
     return rc;
 }
+#endif
 
 static int pciback_dev_is_assigned(libxl__gc *gc, libxl_device_pci *pcidev)
 {
-    char * spath;
     int rc;
+#ifdef CONFIG_PCIBACK
+    char * spath;
     struct stat st;
 
     if ( access(SYSFS_PCIBACK_DRIVER, F_OK) < 0 ) {
@@ -663,22 +727,27 @@ static int pciback_dev_is_assigned(libxl__gc *gc, libxl_device_pci *pcidev)
         }
         return -1;
     }
-
     spath = GCSPRINTF(SYSFS_PCIBACK_DRIVER"/"PCI_BDF,
                       pcidev->domain, pcidev->bus,
                       pcidev->dev, pcidev->func);
     rc = lstat(spath, &st);
-
     if( rc == 0 )
         return 1;
     if ( rc < 0 && errno == ENOENT )
         return 0;
     LOGE(ERROR, "Accessing %s", spath);
     return -1;
+#else
+    libxl_ctx *ctx = libxl__gc_owner(gc);
+
+    rc = xc_pci_device_get_assigned(ctx->xch, pcidev_encode_bdf(pcidev));
+    return rc == 0 ? 1 : 0;
+#endif
 }
 
 static int pciback_dev_assign(libxl__gc *gc, libxl_device_pci *pcidev)
 {
+#ifdef CONFIG_PCIBACK
     int rc;
 
     if ( (rc=pciback_dev_has_slot(gc, pcidev)) < 0 ) {
@@ -697,10 +766,17 @@ static int pciback_dev_assign(libxl__gc *gc, libxl_device_pci *pcidev)
         return ERROR_FAIL;
     }
     return 0;
+#else
+    libxl_ctx *ctx = libxl__gc_owner(gc);
+
+    return xc_pci_device_set_assigned(ctx->xch, pcidev_encode_bdf(pcidev),
+                                      true);
+#endif
 }
 
 static int pciback_dev_unassign(libxl__gc *gc, libxl_device_pci *pcidev)
 {
+#ifdef CONFIG_PCIBACK
     /* Remove from pciback */
     if ( sysfs_dev_unbind(gc, pcidev, NULL) < 0 ) {
         LOG(ERROR, "Couldn't unbind device!");
@@ -716,6 +792,12 @@ static int pciback_dev_unassign(libxl__gc *gc, libxl_device_pci *pcidev)
         }
     }
     return 0;
+#else
+    libxl_ctx *ctx = libxl__gc_owner(gc);
+
+    return xc_pci_device_set_assigned(ctx->xch, pcidev_encode_bdf(pcidev),
+                                      false);
+#endif
 }
 
 #define PCIBACK_INFO_PATH "/libxl/pciback"
@@ -780,10 +862,15 @@ static int libxl__device_pci_assignable_add(libxl__gc *gc,
 
     /* See if the device exists */
     spath = GCSPRINTF(SYSFS_PCI_DEV"/"PCI_BDF, dom, bus, dev, func);
+#ifdef CONFIG_PCI_SYSFS_DOM0
     if ( lstat(spath, &st) ) {
         LOGE(ERROR, "Couldn't lstat %s", spath);
         return ERROR_FAIL;
     }
+#else
+    (void)st;
+    printf("IMPLEMENT_ME: %s lstat %s\n", __func__, spath);
+#endif
 
     /* Check to see if it's already assigned to pciback */
     rc = pciback_dev_is_assigned(gc, pcidev);
@@ -1350,8 +1437,12 @@ static void pci_add_dm_done(libxl__egc *egc,
 
     if (f == NULL) {
         LOGED(ERROR, domainid, "Couldn't open %s", sysfs_path);
+#ifdef CONFIG_PCI_SYSFS_DOM0
         rc = ERROR_FAIL;
         goto out;
+#else
+        goto out_no_irq;
+#endif
     }
     for (i = 0; i < PROC_PCI_NUM_RESOURCES; i++) {
         if (fscanf(f, "0x%llx 0x%llx 0x%llx\n", &start, &end, &flags) != 3)
@@ -1522,7 +1613,11 @@ static int libxl_pcidev_assignable(libxl_ctx *ctx, libxl_device_pci *pcidev)
             break;
     }
     free(pcidevs);
+#ifdef CONFIG_PCIBACK
     return i != num;
+#else
+    return 1;
+#endif
 }
 
 static void device_pci_add_stubdom_wait(libxl__egc *egc,
diff --git a/xen/arch/arm/sysctl.c b/xen/arch/arm/sysctl.c
index f87944e8473c..84e933b2eb45 100644
--- a/xen/arch/arm/sysctl.c
+++ b/xen/arch/arm/sysctl.c
@@ -10,6 +10,7 @@
 #include <xen/lib.h>
 #include <xen/errno.h>
 #include <xen/hypercall.h>
+#include <xen/guest_access.h>
 #include <public/sysctl.h>
 
 void arch_do_physinfo(struct xen_sysctl_physinfo *pi)
@@ -20,7 +21,70 @@ void arch_do_physinfo(struct xen_sysctl_physinfo *pi)
 long arch_do_sysctl(struct xen_sysctl *sysctl,
                     XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) u_sysctl)
 {
-    return -ENOSYS;
+    long ret = 0;
+    bool copyback = 0;
+
+    switch ( sysctl->cmd )
+    {
+    case XEN_SYSCTL_pci_device_set_assigned:
+    {
+        u16 seg;
+        u8 bus, devfn;
+        uint32_t machine_sbdf;
+
+        machine_sbdf = sysctl->u.pci_set_assigned.machine_sbdf;
+
+#if 0
+        ret = xsm_pci_device_set_assigned(XSM_HOOK, d);
+        if ( ret )
+            break;
+#endif
+
+        seg = machine_sbdf >> 16;
+        bus = PCI_BUS(machine_sbdf);
+        devfn = PCI_DEVFN2(machine_sbdf);
+
+        pcidevs_lock();
+        ret = pci_device_set_assigned(seg, bus, devfn,
+                                      !!sysctl->u.pci_set_assigned.assigned);
+        pcidevs_unlock();
+        break;
+    }
+    case XEN_SYSCTL_pci_device_get_assigned:
+    {
+        u16 seg;
+        u8 bus, devfn;
+        uint32_t machine_sbdf;
+
+        machine_sbdf = sysctl->u.pci_set_assigned.machine_sbdf;
+
+        seg = machine_sbdf >> 16;
+        bus = PCI_BUS(machine_sbdf);
+        devfn = PCI_DEVFN2(machine_sbdf);
+
+        pcidevs_lock();
+        ret = pci_device_get_assigned(seg, bus, devfn);
+        pcidevs_unlock();
+        break;
+    }
+    case XEN_SYSCTL_pci_device_enum_assigned:
+    {
+        ret = pci_device_enum_assigned(sysctl->u.pci_enum_assigned.report_not_assigned,
+                                       sysctl->u.pci_enum_assigned.idx,
+                                       &sysctl->u.pci_enum_assigned.domain,
+                                       &sysctl->u.pci_enum_assigned.machine_sbdf);
+        copyback = 1;
+        break;
+    }
+    default:
+        ret = -ENOSYS;
+        break;
+    }
+    if ( copyback && (!ret || copyback > 0) &&
+         __copy_to_guest(u_sysctl, sysctl, 1) )
+        ret = -EFAULT;
+
+    return ret;
 }
 
 /*
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 98e8a2fade60..49b4279c63bd 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -879,6 +879,43 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
     return ret;
 }
 
+#ifdef CONFIG_ARM
+int pci_device_set_assigned(u16 seg, u8 bus, u8 devfn, bool assigned)
+{
+    struct pci_dev *pdev;
+
+    pdev = pci_get_pdev(seg, bus, devfn);
+    if ( !pdev )
+    {
+        printk(XENLOG_ERR "Can't find PCI device %04x:%02x:%02x.%u\n",
+               seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
+        return -ENODEV;
+    }
+
+    pdev->assigned = assigned;
+    printk(XENLOG_ERR "pciback %sassign PCI device %04x:%02x:%02x.%u\n",
+           assigned ? "" : "de-",
+           seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
+
+    return 0;
+}
+
+int pci_device_get_assigned(u16 seg, u8 bus, u8 devfn)
+{
+    struct pci_dev *pdev;
+
+    pdev = pci_get_pdev(seg, bus, devfn);
+    if ( !pdev )
+    {
+        printk(XENLOG_ERR "Can't find PCI device %04x:%02x:%02x.%u\n",
+               seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
+        return -ENODEV;
+    }
+
+    return pdev->assigned ? 0 : -ENODEV;
+}
+#endif
+
 #ifndef CONFIG_ARM
 /*TODO :Implement MSI support for ARM  */
 static int pci_clean_dpci_irq(struct domain *d,
@@ -1821,6 +1858,62 @@ int iommu_do_pci_domctl(
     return ret;
 }
 
+#ifdef CONFIG_ARM
+struct list_assigned {
+    uint32_t cur_idx;
+    uint32_t from_idx;
+    bool assigned;
+    domid_t *domain;
+    uint32_t *machine_sbdf;
+};
+
+static int _enum_assigned_pci_devices(struct pci_seg *pseg, void *arg)
+{
+    struct list_assigned *ctxt = arg;
+    struct pci_dev *pdev;
+
+    list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
+    {
+        if ( pdev->assigned == ctxt->assigned )
+        {
+            if ( ctxt->cur_idx == ctxt->from_idx )
+            {
+                *ctxt->domain = pdev->domain->domain_id;
+                *ctxt->machine_sbdf = pdev->sbdf.sbdf;
+                return 1;
+            }
+            ctxt->cur_idx++;
+        }
+    }
+    return 0;
+}
+
+int pci_device_enum_assigned(bool report_not_assigned,
+                             uint32_t from_idx, domid_t *domain,
+                             uint32_t *machine_sbdf)
+{
+    struct list_assigned ctxt = {
+        .assigned = !report_not_assigned,
+        .cur_idx = 0,
+        .from_idx = from_idx,
+        .domain = domain,
+        .machine_sbdf = machine_sbdf,
+    };
+    int ret;
+
+    pcidevs_lock();
+    ret = pci_segments_iterate(_enum_assigned_pci_devices, &ctxt);
+    pcidevs_unlock();
+    /*
+     * If not found then report as EINVAL to mark
+     * enumeration process finished.
+     */
+    if ( !ret )
+        return -EINVAL;
+    return 0;
+}
+#endif
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
index a07364711794..5ca73c538688 100644
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -1062,6 +1062,40 @@ typedef struct xen_sysctl_cpu_policy xen_sysctl_cpu_policy_t;
 DEFINE_XEN_GUEST_HANDLE(xen_sysctl_cpu_policy_t);
 #endif
 
+/*
+ * These are to emulate pciback device (de-)assignment used by the tools
+ * to track current device assignments: all the PCI devices that can
+ * be passed through must be assigned to the pciback to mark them
+ * as such. As on ARM we do not run pci{back|front} and are emulating
+ * PCI host bridge in Xen, so we need to maintain the assignments on our
+ * own in Xen itself.
+ *
+ * Note on xen_sysctl_pci_device_get_assigned: ENOENT is used to report
+ * that there are no assigned devices left.
+ */
+struct xen_sysctl_pci_device_set_assigned {
+    /* IN */
+    /* FIXME: is this really a machine SBDF or as Domain-0 sees it? */
+    uint32_t machine_sbdf;
+    uint8_t assigned;
+};
+
+struct xen_sysctl_pci_device_get_assigned {
+    /* IN */
+    uint32_t machine_sbdf;
+};
+
+struct xen_sysctl_pci_device_enum_assigned {
+    /* IN */
+    uint32_t idx;
+    uint8_t report_not_assigned;
+    /* OUT */
+    domid_t domain;
+    uint32_t machine_sbdf;
+};
+typedef struct xen_sysctl_pci_device_enum_assigned xen_sysctl_pci_device_enum_assigned_t;
+DEFINE_XEN_GUEST_HANDLE(xen_sysctl_pci_device_enum_assigned_t);
+
 struct xen_sysctl {
     uint32_t cmd;
 #define XEN_SYSCTL_readconsole                    1
@@ -1092,6 +1126,9 @@ struct xen_sysctl {
 #define XEN_SYSCTL_livepatch_op                  27
 /* #define XEN_SYSCTL_set_parameter              28 */
 #define XEN_SYSCTL_get_cpu_policy                29
+#define XEN_SYSCTL_pci_device_set_assigned       30
+#define XEN_SYSCTL_pci_device_get_assigned       31
+#define XEN_SYSCTL_pci_device_enum_assigned      32
     uint32_t interface_version; /* XEN_SYSCTL_INTERFACE_VERSION */
     union {
         struct xen_sysctl_readconsole       readconsole;
@@ -1122,6 +1159,9 @@ struct xen_sysctl {
 #if defined(__i386__) || defined(__x86_64__)
         struct xen_sysctl_cpu_policy        cpu_policy;
 #endif
+        struct xen_sysctl_pci_device_set_assigned pci_set_assigned;
+        struct xen_sysctl_pci_device_get_assigned pci_get_assigned;
+        struct xen_sysctl_pci_device_enum_assigned pci_enum_assigned;
         uint8_t                             pad[128];
     } u;
 };
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 2bc4aaf4530c..7bf439de4de0 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -132,6 +132,13 @@ struct pci_dev {
 
     /* Data for vPCI. */
     struct vpci *vpci;
+#ifdef CONFIG_ARM
+    /*
+     * Set if this PCI device is eligible for pass through,
+     * e.g. just like it was assigned to pciback driver.
+     */
+    bool assigned;
+#endif
 };
 
 #define for_each_pdev(domain, pdev) \
@@ -168,6 +175,11 @@ const unsigned long *pci_get_ro_map(u16 seg);
 int pci_add_device(u16 seg, u8 bus, u8 devfn,
                    const struct pci_dev_info *, nodeid_t node);
 int pci_remove_device(u16 seg, u8 bus, u8 devfn);
+int pci_device_set_assigned(u16 seg, u8 bus, u8 devfn, bool assigned);
+int pci_device_get_assigned(u16 seg, u8 bus, u8 devfn);
+int pci_device_enum_assigned(bool report_not_assigned,
+                             uint32_t from_idx, domid_t *domain,
+                             uint32_t *machine_sbdf);
 int pci_ro_device(int seg, int bus, int devfn);
 int pci_hide_device(unsigned int seg, unsigned int bus, unsigned int devfn);
 struct pci_dev *pci_get_pdev(int seg, int bus, int devfn);
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 03/10] xen/arm: Setup MMIO range trap handlers for hardware domain
  2020-11-09 12:50 [PATCH 00/10] [RFC] ARM PCI passthrough configuration and vPCI Oleksandr Andrushchenko
  2020-11-09 12:50 ` [PATCH 01/10] pci/pvh: Allow PCI toolstack code run with PVH domains on ARM Oleksandr Andrushchenko
  2020-11-09 12:50 ` [PATCH 02/10] arm/pci: Maintain PCI assignable list Oleksandr Andrushchenko
@ 2020-11-09 12:50 ` Oleksandr Andrushchenko
  2020-11-11 14:39   ` Roger Pau Monné
  2020-11-09 12:50 ` [PATCH 04/10] [WORKAROUND] xen/arm: Update hwdom's p2m to trap ECAM space Oleksandr Andrushchenko
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 64+ messages in thread
From: Oleksandr Andrushchenko @ 2020-11-09 12:50 UTC (permalink / raw)
  To: Rahul.Singh, Bertrand.Marquis, julien.grall, jbeulich, roger.pau,
	sstabellini, xen-devel
  Cc: iwj, wl, Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

In order vPCI to work it needs all access to PCI configuration space
access to be synchronized among all entities, e.g. hardware domain and
guests. For that implement PCI host bridge specific callbacks to
propelry setup those ranges depending on host bridge implementation.

This callback is optional and may not be used by non-ECAM host bridges.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
 xen/arch/arm/pci/pci-host-common.c  | 16 ++++++++++++++++
 xen/arch/arm/pci/pci-host-generic.c | 15 +++++++++++++--
 xen/arch/arm/vpci.c                 | 16 +++++++++++++++-
 xen/include/asm-arm/pci.h           |  7 +++++++
 4 files changed, 51 insertions(+), 3 deletions(-)

diff --git a/xen/arch/arm/pci/pci-host-common.c b/xen/arch/arm/pci/pci-host-common.c
index b011c7eff3c8..b81184d34980 100644
--- a/xen/arch/arm/pci/pci-host-common.c
+++ b/xen/arch/arm/pci/pci-host-common.c
@@ -219,6 +219,22 @@ struct device *pci_find_host_bridge_device(struct device *dev)
     }
     return dt_to_dev(bridge->dt_node);
 }
+
+int pci_host_iterate_bridges(struct domain *d,
+                             int (*clb)(struct domain *d,
+                                        struct pci_host_bridge *bridge))
+{
+    struct pci_host_bridge *bridge;
+    int err;
+
+    list_for_each_entry( bridge, &pci_host_bridges, node )
+    {
+        err = clb(d, bridge);
+        if ( err )
+            return err;
+    }
+    return 0;
+}
 /*
  * Local variables:
  * mode: C
diff --git a/xen/arch/arm/pci/pci-host-generic.c b/xen/arch/arm/pci/pci-host-generic.c
index 54dd123e95c7..469df3da0116 100644
--- a/xen/arch/arm/pci/pci-host-generic.c
+++ b/xen/arch/arm/pci/pci-host-generic.c
@@ -85,12 +85,23 @@ int pci_ecam_config_read(struct pci_host_bridge *bridge, uint32_t sbdf,
     return 0;
 }
 
+static int pci_ecam_register_mmio_handler(struct domain *d,
+                                          struct pci_host_bridge *bridge,
+                                          const struct mmio_handler_ops *ops)
+{
+    struct pci_config_window *cfg = bridge->sysdata;
+
+    register_mmio_handler(d, ops, cfg->phys_addr, cfg->size, NULL);
+    return 0;
+}
+
 /* ECAM ops */
 struct pci_ecam_ops pci_generic_ecam_ops = {
     .bus_shift  = 20,
     .pci_ops    = {
-        .read       = pci_ecam_config_read,
-        .write      = pci_ecam_config_write,
+        .read                  = pci_ecam_config_read,
+        .write                 = pci_ecam_config_write,
+        .register_mmio_handler = pci_ecam_register_mmio_handler,
     }
 };
 
diff --git a/xen/arch/arm/vpci.c b/xen/arch/arm/vpci.c
index 49e473ab0d10..2b9bf34c8fe6 100644
--- a/xen/arch/arm/vpci.c
+++ b/xen/arch/arm/vpci.c
@@ -80,11 +80,25 @@ static const struct mmio_handler_ops vpci_mmio_handler = {
     .write = vpci_mmio_write,
 };
 
+static int vpci_setup_mmio_handler(struct domain *d,
+                                   struct pci_host_bridge *bridge)
+{
+    if ( bridge->ops->register_mmio_handler )
+        return bridge->ops->register_mmio_handler(d, bridge,
+                                                  &vpci_mmio_handler);
+    return 0;
+}
+
+
 int domain_vpci_init(struct domain *d)
 {
-    if ( !has_vpci(d) || is_hardware_domain(d) )
+    if ( !has_vpci(d) )
         return 0;
 
+    if ( is_hardware_domain(d) )
+        return pci_host_iterate_bridges(d, vpci_setup_mmio_handler);
+
+    /* Guest domains use what is programmed in their device tree. */
     register_mmio_handler(d, &vpci_mmio_handler,
             GUEST_VPCI_ECAM_BASE,GUEST_VPCI_ECAM_SIZE,NULL);
 
diff --git a/xen/include/asm-arm/pci.h b/xen/include/asm-arm/pci.h
index ba23178f67ab..e3a02429b8d4 100644
--- a/xen/include/asm-arm/pci.h
+++ b/xen/include/asm-arm/pci.h
@@ -27,6 +27,7 @@
 #include <xen/pci.h>
 #include <xen/device_tree.h>
 #include <asm/device.h>
+#include <asm/mmio.h>
 
 #ifdef CONFIG_ARM_PCI
 
@@ -64,6 +65,9 @@ struct pci_ops {
                     uint32_t sbdf, int where, int size, u32 *val);
     int (*write)(struct pci_host_bridge *bridge,
                     uint32_t sbdf, int where, int size, u32 val);
+    int (*register_mmio_handler)(struct domain *d,
+                                 struct pci_host_bridge *bridge,
+                                 const struct mmio_handler_ops *ops);
 };
 
 /*
@@ -101,6 +105,9 @@ void pci_init(void);
 bool dt_pci_parse_bus_range(struct dt_device_node *dev,
                             struct pci_config_window *cfg);
 
+int pci_host_iterate_bridges(struct domain *d,
+                             int (*clb)(struct domain *d,
+                                        struct pci_host_bridge *bridge));
 #else   /*!CONFIG_ARM_PCI*/
 struct arch_pci_dev { };
 static inline void  pci_init(void) { }
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 04/10] [WORKAROUND] xen/arm: Update hwdom's p2m to trap ECAM space
  2020-11-09 12:50 [PATCH 00/10] [RFC] ARM PCI passthrough configuration and vPCI Oleksandr Andrushchenko
                   ` (2 preceding siblings ...)
  2020-11-09 12:50 ` [PATCH 03/10] xen/arm: Setup MMIO range trap handlers for hardware domain Oleksandr Andrushchenko
@ 2020-11-09 12:50 ` Oleksandr Andrushchenko
  2020-11-11 14:44   ` Roger Pau Monné
  2020-11-09 12:50 ` [PATCH 05/10] xen/arm: Process pending vPCI map/unmap operations Oleksandr Andrushchenko
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 64+ messages in thread
From: Oleksandr Andrushchenko @ 2020-11-09 12:50 UTC (permalink / raw)
  To: Rahul.Singh, Bertrand.Marquis, julien.grall, jbeulich, roger.pau,
	sstabellini, xen-devel
  Cc: iwj, wl, Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Host bridge controller's ECAM space is mapped into Domain-0's p2m,
thus it is not possible to trap the same for vPCI via MMIO handlers.
For this to work we need to unmap those mappings in p2m.

TODO (Julien): It would be best if we avoid the map/unmap operation.
So, maybe we want to introduce another way to avoid the mapping.
Maybe by changing the type of the controller to "PCI_HOSTCONTROLLER"
and checking if this is a PCI hostcontroller avoid the mapping.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
 xen/arch/arm/domain_build.c         | 10 +++++++++-
 xen/arch/arm/pci/pci-host-common.c  | 15 +++++++++++++++
 xen/arch/arm/pci/pci-host-generic.c | 28 ++++++++++++++++++++++++++++
 xen/include/asm-arm/pci.h           |  2 ++
 4 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
index 1f83f9048146..3f696d2a6672 100644
--- a/xen/arch/arm/domain_build.c
+++ b/xen/arch/arm/domain_build.c
@@ -2566,7 +2566,15 @@ int __init construct_dom0(struct domain *d)
     if ( rc < 0 )
         return rc;
 
-    return construct_domain(d, &kinfo);
+    rc = construct_domain(d, &kinfo);
+    if ( rc < 0 )
+        return rc;
+
+#ifdef CONFIG_HAS_PCI
+    if ( has_vpci(d) )
+        rc = pci_host_bridge_update_mappings(d);
+#endif
+    return rc;
 }
 
 /*
diff --git a/xen/arch/arm/pci/pci-host-common.c b/xen/arch/arm/pci/pci-host-common.c
index b81184d34980..b6c4d7b636b1 100644
--- a/xen/arch/arm/pci/pci-host-common.c
+++ b/xen/arch/arm/pci/pci-host-common.c
@@ -235,6 +235,21 @@ int pci_host_iterate_bridges(struct domain *d,
     }
     return 0;
 }
+
+static int pci_host_bridge_update_mapping(struct domain *d,
+                                          struct pci_host_bridge *bridge)
+{
+    if ( !bridge->ops->update_mappings )
+        return 0;
+
+    return bridge->ops->update_mappings(d, bridge);
+}
+
+int pci_host_bridge_update_mappings(struct domain *d)
+{
+    return pci_host_iterate_bridges(d, pci_host_bridge_update_mapping);
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/arch/arm/pci/pci-host-generic.c b/xen/arch/arm/pci/pci-host-generic.c
index 469df3da0116..772c53c881bc 100644
--- a/xen/arch/arm/pci/pci-host-generic.c
+++ b/xen/arch/arm/pci/pci-host-generic.c
@@ -21,6 +21,8 @@
 #include <asm/device.h>
 #include <asm/io.h>
 #include <xen/pci.h>
+#include <xen/sched.h>
+#include <asm/p2m.h>
 #include <asm/pci.h>
 
 /*
@@ -85,6 +87,31 @@ int pci_ecam_config_read(struct pci_host_bridge *bridge, uint32_t sbdf,
     return 0;
 }
 
+/*
+ * TODO: This is called late on domain creation to mangle p2m if needed:
+ * for ECAM host controller for mmio region traps to work for Domain-0
+ * we need to unmap those mappings in p2m.
+ * This is WIP:
+ * julieng: I think it would be best if we avoid the map/unmap operation.
+ * So maybe we want to introduce another way to avoid the mapping.
+ * Maybe by changing the type of the controller to "PCI_HOSTCONTROLLER"
+ * and check if this is a PCI hostcontroller avoid the mapping.
+ */
+static int pci_ecam_update_mappings(struct domain *d,
+                                    struct pci_host_bridge *bridge)
+{
+    struct pci_config_window *cfg = bridge->sysdata;
+    int ret;
+
+    /* Only for control domain which owns this PCI host bridge. */
+    if ( !is_control_domain(d) )
+        return 0;
+
+    ret = unmap_regions_p2mt(d, gaddr_to_gfn(cfg->phys_addr),
+                             cfg->size >> PAGE_SHIFT, INVALID_MFN);
+    return ret;
+}
+
 static int pci_ecam_register_mmio_handler(struct domain *d,
                                           struct pci_host_bridge *bridge,
                                           const struct mmio_handler_ops *ops)
@@ -101,6 +128,7 @@ struct pci_ecam_ops pci_generic_ecam_ops = {
     .pci_ops    = {
         .read                  = pci_ecam_config_read,
         .write                 = pci_ecam_config_write,
+        .update_mappings       = pci_ecam_update_mappings,
         .register_mmio_handler = pci_ecam_register_mmio_handler,
     }
 };
diff --git a/xen/include/asm-arm/pci.h b/xen/include/asm-arm/pci.h
index e3a02429b8d4..d94e8a6628de 100644
--- a/xen/include/asm-arm/pci.h
+++ b/xen/include/asm-arm/pci.h
@@ -65,6 +65,7 @@ struct pci_ops {
                     uint32_t sbdf, int where, int size, u32 *val);
     int (*write)(struct pci_host_bridge *bridge,
                     uint32_t sbdf, int where, int size, u32 val);
+    int (*update_mappings)(struct domain *d, struct pci_host_bridge *bridge);
     int (*register_mmio_handler)(struct domain *d,
                                  struct pci_host_bridge *bridge,
                                  const struct mmio_handler_ops *ops);
@@ -108,6 +109,7 @@ bool dt_pci_parse_bus_range(struct dt_device_node *dev,
 int pci_host_iterate_bridges(struct domain *d,
                              int (*clb)(struct domain *d,
                                         struct pci_host_bridge *bridge));
+int pci_host_bridge_update_mappings(struct domain *d);
 #else   /*!CONFIG_ARM_PCI*/
 struct arch_pci_dev { };
 static inline void  pci_init(void) { }
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 05/10] xen/arm: Process pending vPCI map/unmap operations
  2020-11-09 12:50 [PATCH 00/10] [RFC] ARM PCI passthrough configuration and vPCI Oleksandr Andrushchenko
                   ` (3 preceding siblings ...)
  2020-11-09 12:50 ` [PATCH 04/10] [WORKAROUND] xen/arm: Update hwdom's p2m to trap ECAM space Oleksandr Andrushchenko
@ 2020-11-09 12:50 ` Oleksandr Andrushchenko
  2020-11-09 12:50 ` [PATCH 06/10] vpci: Make every domain handle its own BARs Oleksandr Andrushchenko
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 64+ messages in thread
From: Oleksandr Andrushchenko @ 2020-11-09 12:50 UTC (permalink / raw)
  To: Rahul.Singh, Bertrand.Marquis, julien.grall, jbeulich, roger.pau,
	sstabellini, xen-devel
  Cc: iwj, wl, Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

vPCI may map and unmap PCI device memory (BARs) being passed through which
may take a lot of time. For this those operations may be deferred to be
performed later, so that they can be safely preempted.
Run the corresponding vPCI code while switching a vCPU.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
 xen/arch/arm/traps.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c
index 8f40d0e0b6b1..1c54dc0cdd51 100644
--- a/xen/arch/arm/traps.c
+++ b/xen/arch/arm/traps.c
@@ -33,6 +33,7 @@
 #include <xen/symbols.h>
 #include <xen/version.h>
 #include <xen/virtual_region.h>
+#include <xen/vpci.h>
 
 #include <public/sched.h>
 #include <public/xen.h>
@@ -2253,6 +2254,11 @@ static void check_for_vcpu_work(void)
 {
     struct vcpu *v = current;
 
+    local_irq_enable();
+    if ( has_vpci(v->domain) && vpci_process_pending(v) )
+        raise_softirq(SCHEDULE_SOFTIRQ);
+    local_irq_disable();
+
     if ( likely(!v->arch.need_flush_to_ram) )
         return;
 
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 06/10] vpci: Make every domain handle its own BARs
  2020-11-09 12:50 [PATCH 00/10] [RFC] ARM PCI passthrough configuration and vPCI Oleksandr Andrushchenko
                   ` (4 preceding siblings ...)
  2020-11-09 12:50 ` [PATCH 05/10] xen/arm: Process pending vPCI map/unmap operations Oleksandr Andrushchenko
@ 2020-11-09 12:50 ` Oleksandr Andrushchenko
  2020-11-12  9:40   ` Roger Pau Monné
  2020-11-09 12:50 ` [PATCH 07/10] xen/arm: Do not hardcode phycial PCI device addresses Oleksandr Andrushchenko
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 64+ messages in thread
From: Oleksandr Andrushchenko @ 2020-11-09 12:50 UTC (permalink / raw)
  To: Rahul.Singh, Bertrand.Marquis, julien.grall, jbeulich, roger.pau,
	sstabellini, xen-devel
  Cc: iwj, wl, Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

At the moment there is an identity mapping between how a guest sees its
BARs and how they are programmed into guest domain's p2m. This is not
going to work as guest domains have their own view on the BARs.
Extend existing vPCI BAR handling to allow every domain to have its own
view of the BARs: only hardware domain sees physical memory addresses in
this case and for the rest those are emulated, including logic required
for the guests to detect memory sizes and properties.

While emulating BAR access for the guests create a link between the
virtual BAR address and physical one: use full memory address while
creating range sets used to map/unmap corresponding address spaces and
exploit the fact that PCI BAR value doesn't use 8 lower bits of the
memory address. Use those bits to pass physical BAR's index, so we can
build/remove proper p2m mappings.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
 xen/drivers/vpci/header.c | 276 ++++++++++++++++++++++++++++++++++----
 xen/drivers/vpci/vpci.c   |   1 +
 xen/include/xen/vpci.h    |  24 ++--
 3 files changed, 265 insertions(+), 36 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index f74f728884c0..7dc7c70e24f2 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -31,14 +31,87 @@
 struct map_data {
     struct domain *d;
     bool map;
+    struct pci_dev *pdev;
 };
 
+static struct vpci_header *get_vpci_header(struct domain *d,
+                                           const struct pci_dev *pdev);
+
+static struct vpci_header *get_hwdom_vpci_header(const struct pci_dev *pdev)
+{
+    if ( unlikely(list_empty(&pdev->vpci->headers)) )
+        return get_vpci_header(hardware_domain, pdev);
+
+    /* hwdom's header is always the very first entry. */
+    return list_first_entry(&pdev->vpci->headers, struct vpci_header, node);
+}
+
+static struct vpci_header *get_vpci_header(struct domain *d,
+                                           const struct pci_dev *pdev)
+{
+    struct list_head *prev;
+    struct vpci_header *header;
+    struct vpci *vpci = pdev->vpci;
+
+    list_for_each( prev, &vpci->headers )
+    {
+        struct vpci_header *this = list_entry(prev, struct vpci_header, node);
+
+        if ( this->domain_id == d->domain_id )
+            return this;
+    }
+    printk(XENLOG_DEBUG "--------------------------------------" \
+           "Adding new vPCI BAR headers for domain %d: " PRI_pci" \n",
+           d->domain_id, pdev->sbdf.seg, pdev->sbdf.bus,
+           pdev->sbdf.dev, pdev->sbdf.fn);
+    header = xzalloc(struct vpci_header);
+    if ( !header )
+    {
+        printk(XENLOG_ERR
+               "Failed to add new vPCI BAR headers for domain %d: " PRI_pci" \n",
+               d->domain_id, pdev->sbdf.seg, pdev->sbdf.bus,
+               pdev->sbdf.dev, pdev->sbdf.fn);
+        return NULL;
+    }
+
+    if ( !is_hardware_domain(d) )
+    {
+        struct vpci_header *hwdom_header = get_hwdom_vpci_header(pdev);
+
+        /* Make a copy of the hwdom's BARs as the initial state for vBARs. */
+        memcpy(header, hwdom_header, sizeof(*header));
+    }
+
+    header->domain_id = d->domain_id;
+    list_add_tail(&header->node, &vpci->headers);
+    return header;
+}
+
+static struct vpci_bar *get_vpci_bar(struct domain *d,
+                                     const struct pci_dev *pdev,
+                                     int bar_idx)
+{
+    struct vpci_header *vheader;
+
+    vheader = get_vpci_header(d, pdev);
+    if ( !vheader )
+        return NULL;
+
+    return &vheader->bars[bar_idx];
+}
+
 static int map_range(unsigned long s, unsigned long e, void *data,
                      unsigned long *c)
 {
     const struct map_data *map = data;
-    int rc;
-
+    unsigned long mfn;
+    int rc, bar_idx;
+    struct vpci_header *header = get_hwdom_vpci_header(map->pdev);
+
+    bar_idx = s & ~PCI_BASE_ADDRESS_MEM_MASK;
+    s = PFN_DOWN(s);
+    e = PFN_DOWN(e);
+    mfn = _mfn(PFN_DOWN(header->bars[bar_idx].addr));
     for ( ; ; )
     {
         unsigned long size = e - s + 1;
@@ -52,11 +125,15 @@ static int map_range(unsigned long s, unsigned long e, void *data,
          * - {un}map_mmio_regions doesn't support preemption.
          */
 
-        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(s))
-                      : unmap_mmio_regions(map->d, _gfn(s), size, _mfn(s));
+        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, mfn)
+                      : unmap_mmio_regions(map->d, _gfn(s), size, mfn);
         if ( rc == 0 )
         {
-            *c += size;
+            /*
+             * Range set is not expressed in frame numbers and the size
+             * is the number of frames, so update accordingly.
+             */
+            *c += size << PAGE_SHIFT;
             break;
         }
         if ( rc < 0 )
@@ -67,8 +144,9 @@ static int map_range(unsigned long s, unsigned long e, void *data,
             break;
         }
         ASSERT(rc < size);
-        *c += rc;
+        *c += rc << PAGE_SHIFT;
         s += rc;
+        mfn += rc;
         if ( general_preempt_check() )
                 return -ERESTART;
     }
@@ -84,7 +162,7 @@ static int map_range(unsigned long s, unsigned long e, void *data,
 static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
                             bool rom_only)
 {
-    struct vpci_header *header = &pdev->vpci->header;
+    struct vpci_header *header = get_hwdom_vpci_header(pdev);
     bool map = cmd & PCI_COMMAND_MEMORY;
     unsigned int i;
 
@@ -136,6 +214,7 @@ bool vpci_process_pending(struct vcpu *v)
         struct map_data data = {
             .d = v->domain,
             .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
+            .pdev = v->vpci.pdev,
         };
         int rc = rangeset_consume_ranges(v->vpci.mem, map_range, &data);
 
@@ -168,7 +247,8 @@ bool vpci_process_pending(struct vcpu *v)
 static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
                             struct rangeset *mem, uint16_t cmd)
 {
-    struct map_data data = { .d = d, .map = true };
+    struct map_data data = { .d = d, .map = true,
+        .pdev = (struct pci_dev *)pdev };
     int rc;
 
     while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
@@ -205,7 +285,7 @@ static void defer_map(struct domain *d, struct pci_dev *pdev,
 
 static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
 {
-    struct vpci_header *header = &pdev->vpci->header;
+    struct vpci_header *header;
     struct rangeset *mem = rangeset_new(NULL, NULL, 0);
     struct pci_dev *tmp, *dev = NULL;
 #ifdef CONFIG_X86
@@ -217,6 +297,11 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
     if ( !mem )
         return -ENOMEM;
 
+    if ( is_hardware_domain(current->domain) )
+        header = get_hwdom_vpci_header(pdev);
+    else
+        header = get_vpci_header(current->domain, pdev);
+
     /*
      * Create a rangeset that represents the current device BARs memory region
      * and compare it against all the currently active BAR memory regions. If
@@ -225,12 +310,15 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
      * First fill the rangeset with all the BARs of this device or with the ROM
      * BAR only, depending on whether the guest is toggling the memory decode
      * bit of the command register, or the enable bit of the ROM BAR register.
+     *
+     * Use the PCI reserved bits of the BAR to pass BAR's index.
      */
     for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
     {
         const struct vpci_bar *bar = &header->bars[i];
-        unsigned long start = PFN_DOWN(bar->addr);
-        unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
+        unsigned long start = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK) | i;
+        unsigned long end = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK) +
+            bar->size - 1;
 
         if ( !MAPPABLE_BAR(bar) ||
              (rom_only ? bar->type != VPCI_BAR_ROM
@@ -251,9 +339,11 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
     /* Remove any MSIX regions if present. */
     for ( i = 0; msix && i < ARRAY_SIZE(msix->tables); i++ )
     {
-        unsigned long start = PFN_DOWN(vmsix_table_addr(pdev->vpci, i));
-        unsigned long end = PFN_DOWN(vmsix_table_addr(pdev->vpci, i) +
-                                     vmsix_table_size(pdev->vpci, i) - 1);
+        unsigned long start = (vmsix_table_addr(pdev->vpci, i) &
+                               PCI_BASE_ADDRESS_MEM_MASK) | i;
+        unsigned long end = (vmsix_table_addr(pdev->vpci, i) &
+                             PCI_BASE_ADDRESS_MEM_MASK ) +
+                             vmsix_table_size(pdev->vpci, i) - 1;
 
         rc = rangeset_remove_range(mem, start, end);
         if ( rc )
@@ -273,6 +363,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
      */
     for_each_pdev ( pdev->domain, tmp )
     {
+        struct vpci_header *header;
+
         if ( tmp == pdev )
         {
             /*
@@ -289,11 +381,14 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
                 continue;
         }
 
-        for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
+        header = get_vpci_header(tmp->domain, pdev);
+
+        for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
         {
-            const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
-            unsigned long start = PFN_DOWN(bar->addr);
-            unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
+            const struct vpci_bar *bar = &header->bars[i];
+            unsigned long start = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK) | i;
+            unsigned long end = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK)
+                + bar->size - 1;
 
             if ( !bar->enabled || !rangeset_overlaps_range(mem, start, end) ||
                  /*
@@ -357,7 +452,7 @@ static void cmd_write(const struct pci_dev *pdev, unsigned int reg,
         pci_conf_write16(pdev->sbdf, reg, cmd);
 }
 
-static void bar_write(const struct pci_dev *pdev, unsigned int reg,
+static void bar_write_hwdom(const struct pci_dev *pdev, unsigned int reg,
                       uint32_t val, void *data)
 {
     struct vpci_bar *bar = data;
@@ -377,14 +472,17 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
     {
         /* If the value written is the current one avoid printing a warning. */
         if ( val != (uint32_t)(bar->addr >> (hi ? 32 : 0)) )
+        {
+            struct vpci_header *header = get_hwdom_vpci_header(pdev);
+
             gprintk(XENLOG_WARNING,
                     "%04x:%02x:%02x.%u: ignored BAR %lu write with memory decoding enabled\n",
                     pdev->seg, pdev->bus, slot, func,
-                    bar - pdev->vpci->header.bars + hi);
+                    bar - header->bars + hi);
+        }
         return;
     }
 
-
     /*
      * Update the cached address, so that when memory decoding is enabled
      * Xen can map the BAR into the guest p2m.
@@ -403,10 +501,89 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
     pci_conf_write32(pdev->sbdf, reg, val);
 }
 
+static uint32_t bar_read_hwdom(const struct pci_dev *pdev, unsigned int reg,
+                               void *data)
+{
+    return vpci_hw_read32(pdev, reg, data);
+}
+
+static void bar_write_guest(const struct pci_dev *pdev, unsigned int reg,
+                            uint32_t val, void *data)
+{
+    struct vpci_bar *vbar = data;
+    bool hi = false;
+
+    if ( vbar->type == VPCI_BAR_MEM64_HI )
+    {
+        ASSERT(reg > PCI_BASE_ADDRESS_0);
+        vbar--;
+        hi = true;
+    }
+    vbar->addr &= ~(0xffffffffull << (hi ? 32 : 0));
+    vbar->addr |= (uint64_t)val << (hi ? 32 : 0);
+}
+
+static uint32_t bar_read_guest(const struct pci_dev *pdev, unsigned int reg,
+                               void *data)
+{
+    struct vpci_bar *vbar = data;
+    uint32_t val;
+    bool hi = false;
+
+    if ( vbar->type == VPCI_BAR_MEM64_HI )
+    {
+        ASSERT(reg > PCI_BASE_ADDRESS_0);
+        vbar--;
+        hi = true;
+    }
+
+    if ( vbar->type == VPCI_BAR_MEM64_LO || vbar->type == VPCI_BAR_MEM64_HI )
+    {
+        if ( hi )
+            val = vbar->addr >> 32;
+        else
+            val = vbar->addr & 0xffffffff;
+        if ( val == ~0 )
+        {
+            /* Guests detects BAR's properties and sizes. */
+            if ( !hi )
+            {
+                val = 0xffffffff & ~(vbar->size - 1);
+                val |= vbar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
+                                                    : PCI_BASE_ADDRESS_MEM_TYPE_64;
+                val |= vbar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
+            }
+            else
+                val = vbar->size >> 32;
+            vbar->addr &= ~(0xffffffffull << (hi ? 32 : 0));
+            vbar->addr |= (uint64_t)val << (hi ? 32 : 0);
+        }
+    }
+    else if ( vbar->type == VPCI_BAR_MEM32 )
+    {
+        val = vbar->addr;
+        if ( val == ~0 )
+        {
+            if ( !hi )
+            {
+                val = 0xffffffff & ~(vbar->size - 1);
+                val |= vbar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
+                                                    : PCI_BASE_ADDRESS_MEM_TYPE_64;
+                val |= vbar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
+            }
+        }
+    }
+    else
+    {
+        val = vbar->addr;
+    }
+    return val;
+}
+
 static void rom_write(const struct pci_dev *pdev, unsigned int reg,
                       uint32_t val, void *data)
 {
-    struct vpci_header *header = &pdev->vpci->header;
+    struct vpci_header *header = get_hwdom_vpci_header(pdev);
     struct vpci_bar *rom = data;
     uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
     uint16_t cmd = pci_conf_read16(pdev->sbdf, PCI_COMMAND);
@@ -452,15 +629,56 @@ static void rom_write(const struct pci_dev *pdev, unsigned int reg,
         rom->addr = val & PCI_ROM_ADDRESS_MASK;
 }
 
+static uint32_t bar_read_dispatch(const struct pci_dev *pdev, unsigned int reg,
+                                  void *data)
+{
+    struct vpci_bar *vbar, *bar = data;
+
+    if ( is_hardware_domain(current->domain) )
+        return bar_read_hwdom(pdev, reg, data);
+
+    vbar = get_vpci_bar(current->domain, pdev, bar->index);
+    if ( !vbar )
+        return ~0;
+
+    return bar_read_guest(pdev, reg, vbar);
+}
+
+static void bar_write_dispatch(const struct pci_dev *pdev, unsigned int reg,
+                               uint32_t val, void *data)
+{
+    struct vpci_bar *bar = data;
+
+    if ( is_hardware_domain(current->domain) )
+        bar_write_hwdom(pdev, reg, val, data);
+    else
+    {
+        struct vpci_bar *vbar = get_vpci_bar(current->domain, pdev, bar->index);
+
+        if ( !vbar )
+            return;
+        bar_write_guest(pdev, reg, val, vbar);
+    }
+}
+
+/*
+ * FIXME: This is called early while adding vPCI handlers which is done
+ * by and for hwdom.
+ */
 static int init_bars(struct pci_dev *pdev)
 {
     uint16_t cmd;
     uint64_t addr, size;
     unsigned int i, num_bars, rom_reg;
-    struct vpci_header *header = &pdev->vpci->header;
-    struct vpci_bar *bars = header->bars;
+    struct vpci_header *header;
+    struct vpci_bar *bars;
     int rc;
 
+    header = get_hwdom_vpci_header(pdev);
+    if ( !header )
+        return -ENOMEM;
+    bars = header->bars;
+
     switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
     {
     case PCI_HEADER_TYPE_NORMAL:
@@ -496,11 +714,12 @@ static int init_bars(struct pci_dev *pdev)
         uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
         uint32_t val;
 
+        bars[i].index = i;
         if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
         {
             bars[i].type = VPCI_BAR_MEM64_HI;
-            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg,
-                                   4, &bars[i]);
+            rc = vpci_add_register(pdev->vpci, bar_read_dispatch,
+                                   bar_write_dispatch, reg, 4, &bars[i]);
             if ( rc )
             {
                 pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
@@ -540,8 +759,8 @@ static int init_bars(struct pci_dev *pdev)
         bars[i].size = size;
         bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
 
-        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg, 4,
-                               &bars[i]);
+        rc = vpci_add_register(pdev->vpci, bar_read_dispatch,
+                               bar_write_dispatch, reg, 4, &bars[i]);
         if ( rc )
         {
             pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
@@ -558,6 +777,7 @@ static int init_bars(struct pci_dev *pdev)
         rom->type = VPCI_BAR_ROM;
         rom->size = size;
         rom->addr = addr;
+        rom->index = num_bars;
         header->rom_enabled = pci_conf_read32(pdev->sbdf, rom_reg) &
                               PCI_ROM_ADDRESS_ENABLE;
 
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index a5293521a36a..728029da3e9c 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -69,6 +69,7 @@ int __hwdom_init vpci_add_handlers(struct pci_dev *pdev)
         return -ENOMEM;
 
     INIT_LIST_HEAD(&pdev->vpci->handlers);
+    INIT_LIST_HEAD(&pdev->vpci->headers);
     spin_lock_init(&pdev->vpci->lock);
 
     for ( i = 0; i < NUM_VPCI_INIT; i++ )
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index c3501e9ec010..54423bc6556d 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -55,16 +55,14 @@ uint32_t vpci_hw_read32(const struct pci_dev *pdev, unsigned int reg,
  */
 bool __must_check vpci_process_pending(struct vcpu *v);
 
-struct vpci {
-    /* List of vPCI handlers for a device. */
-    struct list_head handlers;
-    spinlock_t lock;
-
 #ifdef __XEN__
-    /* Hide the rest of the vpci struct from the user-space test harness. */
     struct vpci_header {
+    struct list_head node;
+    /* Domain that owns this view of the BARs. */
+    domid_t domain_id;
         /* Information about the PCI BARs of this device. */
         struct vpci_bar {
+            int index;
             uint64_t addr;
             uint64_t size;
             enum {
@@ -88,8 +86,18 @@ struct vpci {
          * is mapped into guest p2m) if there's a ROM BAR on the device.
          */
         bool rom_enabled      : 1;
-        /* FIXME: currently there's no support for SR-IOV. */
-    } header;
+};
+#endif
+
+struct vpci {
+    /* List of vPCI handlers for a device. */
+    struct list_head handlers;
+    spinlock_t lock;
+
+#ifdef __XEN__
+    /* Hide the rest of the vpci struct from the user-space test harness. */
+    /* List of vPCI headers for all domains. */
+    struct list_head headers;
 
 #ifdef CONFIG_X86
     /* MSI data. */
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 07/10] xen/arm: Do not hardcode phycial PCI device addresses
  2020-11-09 12:50 [PATCH 00/10] [RFC] ARM PCI passthrough configuration and vPCI Oleksandr Andrushchenko
                   ` (5 preceding siblings ...)
  2020-11-09 12:50 ` [PATCH 06/10] vpci: Make every domain handle its own BARs Oleksandr Andrushchenko
@ 2020-11-09 12:50 ` Oleksandr Andrushchenko
  2020-11-09 12:50 ` [PATCH 08/10] vpci/arm: Allow updating BAR's header for non-ECAM bridges Oleksandr Andrushchenko
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 64+ messages in thread
From: Oleksandr Andrushchenko @ 2020-11-09 12:50 UTC (permalink / raw)
  To: Rahul.Singh, Bertrand.Marquis, julien.grall, jbeulich, roger.pau,
	sstabellini, xen-devel
  Cc: iwj, wl, Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

As vPCI now takes care of the proper p2m mappings for PCI devices there
is no more need to hardcode anything.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
 xen/include/public/arch-arm.h | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/xen/include/public/arch-arm.h b/xen/include/public/arch-arm.h
index 2411ac9f7b0a..59baf1014fe3 100644
--- a/xen/include/public/arch-arm.h
+++ b/xen/include/public/arch-arm.h
@@ -444,15 +444,8 @@ typedef uint64_t xen_callback_t;
 #define GUEST_VPCI_MEM_CPU_ADDR           xen_mk_ullong(0x04020000)
 #define GUEST_VPCI_IO_CPU_ADDR            xen_mk_ullong(0xC0200800)
 
-/*
- * This is hardcoded values for the real PCI physical addresses.
- * This will be removed once we read the real PCI-PCIe physical
- * addresses form the config space and map to the guest memory map
- * when assigning the device to guest via VPCI.
- *
- */
 #define GUEST_VPCI_PREFETCH_MEM_PCI_ADDR  xen_mk_ullong(0x4000000000)
-#define GUEST_VPCI_MEM_PCI_ADDR           xen_mk_ullong(0x50000000)
+#define GUEST_VPCI_MEM_PCI_ADDR           xen_mk_ullong(0x04020000)
 #define GUEST_VPCI_IO_PCI_ADDR            xen_mk_ullong(0x00000000)
 
 #define GUEST_VPCI_PREFETCH_MEM_SIZE      xen_mk_ullong(0x100000000)
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 08/10] vpci/arm: Allow updating BAR's header for non-ECAM bridges
  2020-11-09 12:50 [PATCH 00/10] [RFC] ARM PCI passthrough configuration and vPCI Oleksandr Andrushchenko
                   ` (6 preceding siblings ...)
  2020-11-09 12:50 ` [PATCH 07/10] xen/arm: Do not hardcode phycial PCI device addresses Oleksandr Andrushchenko
@ 2020-11-09 12:50 ` Oleksandr Andrushchenko
  2020-11-12  9:56   ` Roger Pau Monné
  2020-11-13 10:29   ` Jan Beulich
  2020-11-09 12:50 ` [PATCH 09/10] vpci/rcar: Implement vPCI.update_bar_header callback Oleksandr Andrushchenko
  2020-11-09 12:50 ` [PATCH 10/10] [HACK] vpci/rcar: Make vPCI know DomD is hardware domain Oleksandr Andrushchenko
  9 siblings, 2 replies; 64+ messages in thread
From: Oleksandr Andrushchenko @ 2020-11-09 12:50 UTC (permalink / raw)
  To: Rahul.Singh, Bertrand.Marquis, julien.grall, jbeulich, roger.pau,
	sstabellini, xen-devel
  Cc: iwj, wl, Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Non-ECAM host bridges in hwdom go directly to PCI config space,
not through vpci (they use their specific method for accessing PCI
configuration, e.g. dedicated registers etc.). Thus hwdom's vpci BARs are
never updated via vPCI MMIO handlers, so implement a dedicated method
for a PCI host bridge, so it has a chance to update the initial state of
the device BARs.

Note, we rely on the fact that control/hardware domain will not update
physical BAR locations for the given devices.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
 xen/arch/arm/pci/pci-host-common.c | 13 +++++++++++++
 xen/drivers/vpci/header.c          |  9 ++++++++-
 xen/include/asm-arm/pci.h          |  8 ++++++++
 3 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/xen/arch/arm/pci/pci-host-common.c b/xen/arch/arm/pci/pci-host-common.c
index b6c4d7b636b1..5f4239afa41f 100644
--- a/xen/arch/arm/pci/pci-host-common.c
+++ b/xen/arch/arm/pci/pci-host-common.c
@@ -250,6 +250,19 @@ int pci_host_bridge_update_mappings(struct domain *d)
     return pci_host_iterate_bridges(d, pci_host_bridge_update_mapping);
 }
 
+void pci_host_bridge_update_bar_header(const struct pci_dev *pdev,
+                                       struct vpci_header *header)
+{
+    struct pci_host_bridge *bridge;
+
+    bridge = pci_find_host_bridge(pdev->seg, pdev->bus);
+    if ( unlikely(!bridge) )
+        return;
+
+    if ( bridge->ops->update_bar_header )
+        bridge->ops->update_bar_header(pdev, header);
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 7dc7c70e24f2..1f326c894d16 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -77,7 +77,14 @@ static struct vpci_header *get_vpci_header(struct domain *d,
     if ( !is_hardware_domain(d) )
     {
         struct vpci_header *hwdom_header = get_hwdom_vpci_header(pdev);
-
+#ifdef CONFIG_ARM
+        /*
+         * Non-ECAM host bridges in hwdom go directly to PCI
+         * config space, not through vpci. Thus hwdom's vpci BARs are
+         * never updated.
+         */
+        pci_host_bridge_update_bar_header(pdev, hwdom_header);
+#endif
         /* Make a copy of the hwdom's BARs as the initial state for vBARs. */
         memcpy(header, hwdom_header, sizeof(*header));
     }
diff --git a/xen/include/asm-arm/pci.h b/xen/include/asm-arm/pci.h
index d94e8a6628de..723b2a99b6e1 100644
--- a/xen/include/asm-arm/pci.h
+++ b/xen/include/asm-arm/pci.h
@@ -60,6 +60,9 @@ struct pci_config_window {
 /* Forward declaration as pci_host_bridge and pci_ops depend on each other. */
 struct pci_host_bridge;
 
+struct pci_dev;
+struct vpci_header;
+
 struct pci_ops {
     int (*read)(struct pci_host_bridge *bridge,
                     uint32_t sbdf, int where, int size, u32 *val);
@@ -69,6 +72,8 @@ struct pci_ops {
     int (*register_mmio_handler)(struct domain *d,
                                  struct pci_host_bridge *bridge,
                                  const struct mmio_handler_ops *ops);
+    void (*update_bar_header)(const struct pci_dev *pdev,
+                              struct vpci_header *header);
 };
 
 /*
@@ -110,6 +115,9 @@ int pci_host_iterate_bridges(struct domain *d,
                              int (*clb)(struct domain *d,
                                         struct pci_host_bridge *bridge));
 int pci_host_bridge_update_mappings(struct domain *d);
+void pci_host_bridge_update_bar_header(const struct pci_dev *pdev,
+                                       struct vpci_header *header);
+
 #else   /*!CONFIG_ARM_PCI*/
 struct arch_pci_dev { };
 static inline void  pci_init(void) { }
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 09/10] vpci/rcar: Implement vPCI.update_bar_header callback
  2020-11-09 12:50 [PATCH 00/10] [RFC] ARM PCI passthrough configuration and vPCI Oleksandr Andrushchenko
                   ` (7 preceding siblings ...)
  2020-11-09 12:50 ` [PATCH 08/10] vpci/arm: Allow updating BAR's header for non-ECAM bridges Oleksandr Andrushchenko
@ 2020-11-09 12:50 ` Oleksandr Andrushchenko
  2020-11-12 10:00   ` Roger Pau Monné
  2020-11-09 12:50 ` [PATCH 10/10] [HACK] vpci/rcar: Make vPCI know DomD is hardware domain Oleksandr Andrushchenko
  9 siblings, 1 reply; 64+ messages in thread
From: Oleksandr Andrushchenko @ 2020-11-09 12:50 UTC (permalink / raw)
  To: Rahul.Singh, Bertrand.Marquis, julien.grall, jbeulich, roger.pau,
	sstabellini, xen-devel
  Cc: iwj, wl, Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Update hardware domain's BAR header as R-Car Gen3 is a non-ECAM host
controller, so vPCI MMIO handlers do not work for it in hwdom.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
 xen/arch/arm/pci/pci-host-rcar-gen3.c | 69 +++++++++++++++++++++++++++
 1 file changed, 69 insertions(+)

diff --git a/xen/arch/arm/pci/pci-host-rcar-gen3.c b/xen/arch/arm/pci/pci-host-rcar-gen3.c
index ec14bb29a38b..353ac2bfd6e6 100644
--- a/xen/arch/arm/pci/pci-host-rcar-gen3.c
+++ b/xen/arch/arm/pci/pci-host-rcar-gen3.c
@@ -23,6 +23,7 @@
 #include <xen/pci.h>
 #include <asm/pci.h>
 #include <xen/vmap.h>
+#include <xen/vpci.h>
 
 /* Error values that may be returned by PCI functions */
 #define PCIBIOS_SUCCESSFUL		0x00
@@ -307,12 +308,80 @@ int pci_rcar_gen3_config_write(struct pci_host_bridge *bridge, uint32_t _sbdf,
     return ret;
 }
 
+static void pci_rcar_gen3_hwbar_init(const struct pci_dev *pdev,
+                                     struct vpci_header *header)
+
+{
+    static bool once = true;
+    struct vpci_bar *bars = header->bars;
+    unsigned int num_bars;
+    int i;
+
+    /* Run only once. */
+    if (!once)
+        return;
+    once = false;
+
+    printk("\n\n ------------------------ %s -------------------\n", __func__);
+    switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
+    {
+    case PCI_HEADER_TYPE_NORMAL:
+        num_bars = PCI_HEADER_NORMAL_NR_BARS;
+        break;
+
+    case PCI_HEADER_TYPE_BRIDGE:
+        num_bars = PCI_HEADER_BRIDGE_NR_BARS;
+        break;
+
+    default:
+        return;
+    }
+
+    for ( i = 0; i < num_bars; i++ )
+    {
+        uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
+
+        if ( bars[i].type == VPCI_BAR_MEM64_HI )
+        {
+            /*
+             * Skip hi part of the 64-bit register: it is read
+             * together with the lower part.
+             */
+            continue;
+        }
+
+        if ( bars[i].type == VPCI_BAR_IO )
+        {
+            /* Skip IO. */
+            continue;
+        }
+
+        if ( bars[i].type == VPCI_BAR_MEM64_LO )
+        {
+            /* Read both hi and lo parts of the 64-bit BAR. */
+            bars[i].addr =
+                (uint64_t)pci_conf_read32(pdev->sbdf, reg + 4) << 32 |
+                pci_conf_read32(pdev->sbdf, reg);
+        }
+        else if ( bars[i].type == VPCI_BAR_MEM32 )
+        {
+            bars[i].addr = pci_conf_read32(pdev->sbdf, reg);
+        }
+        else
+        {
+            /* Expansion ROM? */
+            continue;
+        }
+    }
+}
+
 /* R-Car Gen3 ops */
 static struct pci_ecam_ops pci_rcar_gen3_ops = {
     .bus_shift  = 20, /* FIXME: this is not used by RCar */
     .pci_ops    = {
         .read       = pci_rcar_gen3_config_read,
         .write      = pci_rcar_gen3_config_write,
+        .update_bar_header = pci_rcar_gen3_hwbar_init,
     }
 };
 
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 10/10] [HACK] vpci/rcar: Make vPCI know DomD is hardware domain
  2020-11-09 12:50 [PATCH 00/10] [RFC] ARM PCI passthrough configuration and vPCI Oleksandr Andrushchenko
                   ` (8 preceding siblings ...)
  2020-11-09 12:50 ` [PATCH 09/10] vpci/rcar: Implement vPCI.update_bar_header callback Oleksandr Andrushchenko
@ 2020-11-09 12:50 ` Oleksandr Andrushchenko
  9 siblings, 0 replies; 64+ messages in thread
From: Oleksandr Andrushchenko @ 2020-11-09 12:50 UTC (permalink / raw)
  To: Rahul.Singh, Bertrand.Marquis, julien.grall, jbeulich, roger.pau,
	sstabellini, xen-devel
  Cc: iwj, wl, Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
 xen/drivers/vpci/header.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 1f326c894d16..d5738ecca93d 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -34,13 +34,19 @@ struct map_data {
     struct pci_dev *pdev;
 };
 
+static bool is_hardware_domain_DomD(const struct domain *d)
+{
+    return d->domain_id == 1;
+}
+
 static struct vpci_header *get_vpci_header(struct domain *d,
                                            const struct pci_dev *pdev);
 
 static struct vpci_header *get_hwdom_vpci_header(const struct pci_dev *pdev)
 {
+    /* TODO: this should be for the hardware_domain, not current->domain. */
     if ( unlikely(list_empty(&pdev->vpci->headers)) )
-        return get_vpci_header(hardware_domain, pdev);
+        return get_vpci_header(current->domain, pdev);
 
     /* hwdom's header is always the very first entry. */
     return list_first_entry(&pdev->vpci->headers, struct vpci_header, node);
@@ -74,7 +80,7 @@ static struct vpci_header *get_vpci_header(struct domain *d,
         return NULL;
     }
 
-    if ( !is_hardware_domain(d) )
+    if ( !is_hardware_domain_DomD(d) )
     {
         struct vpci_header *hwdom_header = get_hwdom_vpci_header(pdev);
 #ifdef CONFIG_ARM
@@ -304,7 +310,7 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
     if ( !mem )
         return -ENOMEM;
 
-    if ( is_hardware_domain(current->domain) )
+    if ( is_hardware_domain_DomD(current->domain) )
         header = get_hwdom_vpci_header(pdev);
     else
         header = get_vpci_header(current->domain, pdev);
@@ -641,7 +647,7 @@ static uint32_t bar_read_dispatch(const struct pci_dev *pdev, unsigned int reg,
 {
     struct vpci_bar *vbar, *bar = data;
 
-    if ( is_hardware_domain(current->domain) )
+    if ( is_hardware_domain_DomD(current->domain) )
         return bar_read_hwdom(pdev, reg, data);
 
     vbar = get_vpci_bar(current->domain, pdev, bar->index);
@@ -656,7 +662,7 @@ static void bar_write_dispatch(const struct pci_dev *pdev, unsigned int reg,
 {
     struct vpci_bar *bar = data;
 
-    if ( is_hardware_domain(current->domain) )
+    if ( is_hardware_domain_DomD(current->domain) )
         bar_write_hwdom(pdev, reg, val, data);
     else
     {
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [SUSPECTED SPAM][PATCH 01/10] pci/pvh: Allow PCI toolstack code run with PVH domains on ARM
  2020-11-09 12:50 ` [PATCH 01/10] pci/pvh: Allow PCI toolstack code run with PVH domains on ARM Oleksandr Andrushchenko
@ 2020-11-11 12:31   ` Roger Pau Monné
  2020-11-11 13:10     ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 64+ messages in thread
From: Roger Pau Monné @ 2020-11-11 12:31 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Rahul.Singh, Bertrand.Marquis, julien.grall, jbeulich,
	sstabellini, xen-devel, iwj, wl, Oleksandr Andrushchenko

On Mon, Nov 09, 2020 at 02:50:22PM +0200, Oleksandr Andrushchenko wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> According to https://wiki.xenproject.org/wiki/Linux_PVH:
> 
> Items not supported by PVH
>  - PCI pass through (as of Xen 4.10)
> 
> Allow running PCI remove code on ARM and do not assert for PVH domains.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> ---
>  tools/libxl/Makefile    | 4 ++++
>  tools/libxl/libxl_pci.c | 4 +++-
>  2 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
> index 241da7fff6f4..f3806aafcb4e 100644
> --- a/tools/libxl/Makefile
> +++ b/tools/libxl/Makefile
> @@ -130,6 +130,10 @@ endif
>  
>  LIBXL_LIBS += -lyajl
>  
> +ifeq ($(CONFIG_ARM),y)
> +CFALGS += -DCONFIG_ARM
> +endif
> +
>  LIBXL_OBJS = flexarray.o libxl.o libxl_create.o libxl_dm.o libxl_pci.o \
>  			libxl_dom.o libxl_exec.o libxl_xshelp.o libxl_device.o \
>  			libxl_internal.o libxl_utils.o libxl_uuid.o \
> diff --git a/tools/libxl/libxl_pci.c b/tools/libxl/libxl_pci.c
> index bc5843b13701..b93cf976642b 100644
> --- a/tools/libxl/libxl_pci.c
> +++ b/tools/libxl/libxl_pci.c
> @@ -1915,8 +1915,10 @@ static void do_pci_remove(libxl__egc *egc, uint32_t domid,
>              goto out_fail;
>          }
>      } else {
> +        /* PCI passthrough can also run on ARM PVH */
> +#ifndef CONFIG_ARM
>          assert(type == LIBXL_DOMAIN_TYPE_PV);
> -
> +#endif

I would just remove the assert now if this is to be used by Arm and
you don't need to fork the file for Arm.

Roger.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [SUSPECTED SPAM][PATCH 01/10] pci/pvh: Allow PCI toolstack code run with PVH domains on ARM
  2020-11-11 12:31   ` [SUSPECTED SPAM][PATCH " Roger Pau Monné
@ 2020-11-11 13:10     ` Oleksandr Andrushchenko
  2020-11-11 13:55       ` Roger Pau Monné
  0 siblings, 1 reply; 64+ messages in thread
From: Oleksandr Andrushchenko @ 2020-11-11 13:10 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Rahul.Singh, Bertrand.Marquis, julien.grall, jbeulich,
	sstabellini, xen-devel, iwj, wl, Oleksandr Andrushchenko


On 11/11/20 2:31 PM, Roger Pau Monné wrote:
> On Mon, Nov 09, 2020 at 02:50:22PM +0200, Oleksandr Andrushchenko wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>
>> According to https://wiki.xenproject.org/wiki/Linux_PVH:
>>
>> Items not supported by PVH
>>   - PCI pass through (as of Xen 4.10)
>>
>> Allow running PCI remove code on ARM and do not assert for PVH domains.
>>
>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> ---
>>   tools/libxl/Makefile    | 4 ++++
>>   tools/libxl/libxl_pci.c | 4 +++-
>>   2 files changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
>> index 241da7fff6f4..f3806aafcb4e 100644
>> --- a/tools/libxl/Makefile
>> +++ b/tools/libxl/Makefile
>> @@ -130,6 +130,10 @@ endif
>>   
>>   LIBXL_LIBS += -lyajl
>>   
>> +ifeq ($(CONFIG_ARM),y)
>> +CFALGS += -DCONFIG_ARM
>> +endif
>> +
>>   LIBXL_OBJS = flexarray.o libxl.o libxl_create.o libxl_dm.o libxl_pci.o \
>>   			libxl_dom.o libxl_exec.o libxl_xshelp.o libxl_device.o \
>>   			libxl_internal.o libxl_utils.o libxl_uuid.o \
>> diff --git a/tools/libxl/libxl_pci.c b/tools/libxl/libxl_pci.c
>> index bc5843b13701..b93cf976642b 100644
>> --- a/tools/libxl/libxl_pci.c
>> +++ b/tools/libxl/libxl_pci.c
>> @@ -1915,8 +1915,10 @@ static void do_pci_remove(libxl__egc *egc, uint32_t domid,
>>               goto out_fail;
>>           }
>>       } else {
>> +        /* PCI passthrough can also run on ARM PVH */
>> +#ifndef CONFIG_ARM
>>           assert(type == LIBXL_DOMAIN_TYPE_PV);
>> -
>> +#endif
> I would just remove the assert now if this is to be used by Arm and
> you don't need to fork the file for Arm.

Sounds good, I will drop then

But what would be the right explanation then? I mean why there was an ASSERT

and now it is safe (for x86) to remove that?

>
> Roger.

Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 02/10] arm/pci: Maintain PCI assignable list
  2020-11-09 12:50 ` [PATCH 02/10] arm/pci: Maintain PCI assignable list Oleksandr Andrushchenko
@ 2020-11-11 13:53   ` Roger Pau Monné
  2020-11-11 14:38     ` Oleksandr Andrushchenko
  2020-11-11 14:54   ` Jan Beulich
  1 sibling, 1 reply; 64+ messages in thread
From: Roger Pau Monné @ 2020-11-11 13:53 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Rahul.Singh, Bertrand.Marquis, julien.grall, jbeulich,
	sstabellini, xen-devel, iwj, wl, Oleksandr Andrushchenko

On Mon, Nov 09, 2020 at 02:50:23PM +0200, Oleksandr Andrushchenko wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> The original code depends on pciback to manage assignable device list.
> The functionality which is implemented by the pciback and the toolstack
> and which is relevant/missing/needed for ARM:
> 
> 1. pciback is used as a database for assignable PCI devices, e.g. xl
>    pci-assignable-{add|remove|list} manipulates that list. So, whenever the
>    toolstack needs to know which PCI devices can be passed through it reads
>    that from the relevant sysfs entries of the pciback.
> 
> 2. pciback is used to hold the unbound PCI devices, e.g. when passing through
>    a PCI device it needs to be unbound from the relevant device driver and bound
>    to pciback (strictly speaking it is not required that the device is bound to
>    pciback, but pciback is again used as a database of the passed through PCI
>    devices, so we can re-bind the devices back to their original drivers when
>    guest domain shuts down)
> 
> 1. As ARM doesn't use pciback implement the above with additional sysctls:
>  - XEN_SYSCTL_pci_device_set_assigned

I don't see the point in having this sysfs, Xen already knows when a
device is assigned because the XEN_DOMCTL_assign_device hypercall is
used.

>  - XEN_SYSCTL_pci_device_get_assigned
>  - XEN_SYSCTL_pci_device_enum_assigned
> 2. Extend struct pci_dev to hold assignment state.

I'm not really found of this, the hypervisor is no place to store a
database like this, unless it's strictly needed.

IMO the right implementation here would be to split Linux pciback into
two different drivers:

 - The pv-pci backend for doing passthrough to classic PV guests.
 - The rest of pciback: device reset, hand-holding driver for devices
   to be assigned and database.

I think there must be something similar in KVM that performs the tasks
of my last point, maybe we could piggyback on it?

If we want to go the route proposed by this patch, ie: Xen performing
the functions of pciback you would also have to move the PCI reset
code to Xen, so that you can fully manage the PCI devices from Xen.

> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> ---
>  tools/libxc/include/xenctrl.h |   9 +++
>  tools/libxc/xc_domain.c       |   1 +
>  tools/libxc/xc_misc.c         |  46 +++++++++++++++
>  tools/libxl/Makefile          |   4 ++
>  tools/libxl/libxl_pci.c       | 105 ++++++++++++++++++++++++++++++++--
>  xen/arch/arm/sysctl.c         |  66 ++++++++++++++++++++-
>  xen/drivers/passthrough/pci.c |  93 ++++++++++++++++++++++++++++++
>  xen/include/public/sysctl.h   |  40 +++++++++++++
>  xen/include/xen/pci.h         |  12 ++++
>  9 files changed, 370 insertions(+), 6 deletions(-)

I've done some light review below given my questions above.

> diff --git a/tools/libxc/include/xenctrl.h b/tools/libxc/include/xenctrl.h
> index 4c89b7294c4f..77029013da7d 100644
> --- a/tools/libxc/include/xenctrl.h
> +++ b/tools/libxc/include/xenctrl.h
> @@ -2652,6 +2652,15 @@ int xc_livepatch_replace(xc_interface *xch, char *name, uint32_t timeout, uint32
>  int xc_domain_cacheflush(xc_interface *xch, uint32_t domid,
>                           xen_pfn_t start_pfn, xen_pfn_t nr_pfns);
>  
> +typedef xen_sysctl_pci_device_enum_assigned_t xc_pci_device_enum_assigned_t;
> +
> +int xc_pci_device_set_assigned(xc_interface *xch, uint32_t machine_sbdf,
> +                               bool assigned);
> +int xc_pci_device_get_assigned(xc_interface *xch, uint32_t machine_sbdf);
> +
> +int xc_pci_device_enum_assigned(xc_interface *xch,
> +                                xc_pci_device_enum_assigned_t *e);
> +
>  /* Compat shims */
>  #include "xenctrl_compat.h"
>  
> diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
> index 71829c2bce3e..d515191e9243 100644
> --- a/tools/libxc/xc_domain.c
> +++ b/tools/libxc/xc_domain.c
> @@ -2321,6 +2321,7 @@ int xc_domain_soft_reset(xc_interface *xch,
>      domctl.domain = domid;
>      return do_domctl(xch, &domctl);
>  }
> +
>  /*
>   * Local variables:
>   * mode: C
> diff --git a/tools/libxc/xc_misc.c b/tools/libxc/xc_misc.c
> index 3820394413a9..d439c4ba1019 100644
> --- a/tools/libxc/xc_misc.c
> +++ b/tools/libxc/xc_misc.c
> @@ -988,6 +988,52 @@ int xc_livepatch_replace(xc_interface *xch, char *name, uint32_t timeout, uint32
>      return _xc_livepatch_action(xch, name, LIVEPATCH_ACTION_REPLACE, timeout, flags);
>  }
>  
> +int xc_pci_device_set_assigned(
> +    xc_interface *xch,
> +    uint32_t machine_sbdf,
> +    bool assigned)
> +{
> +    DECLARE_SYSCTL;
> +
> +    sysctl.cmd = XEN_SYSCTL_pci_device_set_assigned;
> +    sysctl.u.pci_set_assigned.machine_sbdf = machine_sbdf;
> +    sysctl.u.pci_set_assigned.assigned = assigned;
> +
> +    return do_sysctl(xch, &sysctl);
> +}
> +
> +int xc_pci_device_get_assigned(
> +    xc_interface *xch,
> +    uint32_t machine_sbdf)
> +{
> +    DECLARE_SYSCTL;
> +
> +    sysctl.cmd = XEN_SYSCTL_pci_device_get_assigned;
> +    sysctl.u.pci_get_assigned.machine_sbdf = machine_sbdf;
> +
> +    return do_sysctl(xch, &sysctl);
> +}
> +
> +int xc_pci_device_enum_assigned(xc_interface *xch,
> +                                xc_pci_device_enum_assigned_t *e)
> +{
> +    int ret;
> +    DECLARE_SYSCTL;
> +
> +    sysctl.cmd = XEN_SYSCTL_pci_device_enum_assigned;
> +    sysctl.u.pci_enum_assigned.idx = e->idx;
> +    sysctl.u.pci_enum_assigned.report_not_assigned = e->report_not_assigned;
> +    ret = do_sysctl(xch, &sysctl);
> +    if ( ret )
> +        errno = EINVAL;
> +    else
> +    {
> +        e->domain = sysctl.u.pci_enum_assigned.domain;
> +        e->machine_sbdf = sysctl.u.pci_enum_assigned.machine_sbdf;
> +    }
> +    return ret;
> +}
> +
>  /*
>   * Local variables:
>   * mode: C
> diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
> index f3806aafcb4e..6f76ba35aec7 100644
> --- a/tools/libxl/Makefile
> +++ b/tools/libxl/Makefile
> @@ -130,6 +130,10 @@ endif
>  
>  LIBXL_LIBS += -lyajl
>  
> +ifeq ($(CONFIG_X86),y)
> +CFALGS += -DCONFIG_PCIBACK
> +endif
> +
>  ifeq ($(CONFIG_ARM),y)
>  CFALGS += -DCONFIG_ARM
>  endif
> diff --git a/tools/libxl/libxl_pci.c b/tools/libxl/libxl_pci.c
> index b93cf976642b..41f89b8aae10 100644
> --- a/tools/libxl/libxl_pci.c
> +++ b/tools/libxl/libxl_pci.c
> @@ -319,6 +319,7 @@ retry_transaction2:
>  
>  static int get_all_assigned_devices(libxl__gc *gc, libxl_device_pci **list, int *num)
>  {
> +#ifdef CONFIG_PCIBACK
>      char **domlist;
>      unsigned int nd = 0, i;
>  
> @@ -356,6 +357,33 @@ static int get_all_assigned_devices(libxl__gc *gc, libxl_device_pci **list, int
>              }
>          }
>      }
> +#else
> +    libxl_ctx *ctx = libxl__gc_owner(gc);
> +    int ret;
> +    xc_pci_device_enum_assigned_t e;
> +
> +    *list = NULL;
> +    *num = 0;
> +
> +    memset(&e, 0, sizeof(e));
> +    do {
> +        ret = xc_pci_device_enum_assigned(ctx->xch, &e);
> +        if ( ret && errno == EINVAL )
> +            break;
> +        *list = realloc(*list, sizeof(libxl_device_pci) * (e.idx + 1));
> +        if (*list == NULL)
> +            return ERROR_NOMEM;
> +
> +        pcidev_struct_fill(*list + e.idx,
> +                           e.domain,
> +                           e.machine_sbdf >> 8 & 0xff,
> +                           PCI_SLOT(e.machine_sbdf),
> +                           PCI_FUNC(e.machine_sbdf),
> +                           0 /*vdevfn*/);
> +        e.idx++;
> +    } while (!ret);
> +    *num = e.idx;
> +#endif

I don't think the amount of ifdefs added to this file is acceptable.
If we have to go that route this needs to be split into a different
file, and maybe some of the common bits abstracted together to prevent
code repetition.

>      libxl__ptr_add(gc, *list);
>  
>      return 0;
> @@ -411,13 +439,20 @@ static int sysfs_write_bdf(libxl__gc *gc, const char * sysfs_path,
>  libxl_device_pci *libxl_device_pci_assignable_list(libxl_ctx *ctx, int *num)
>  {
>      GC_INIT(ctx);
> -    libxl_device_pci *pcidevs = NULL, *new, *assigned;
> +    libxl_device_pci *pcidevs = NULL, *new;
> +    int r;
> +#ifdef CONFIG_PCIBACK
> +    libxl_device_pci *assigned;
> +    int num_assigned;
>      struct dirent *de;
>      DIR *dir;
> -    int r, num_assigned;
> +#else
> +    xc_pci_device_enum_assigned_t e;
> +#endif
>  
>      *num = 0;
>  
> +#ifdef CONFIG_PCIBACK
>      r = get_all_assigned_devices(gc, &assigned, &num_assigned);
>      if (r) goto out;
>  
> @@ -453,6 +488,32 @@ libxl_device_pci *libxl_device_pci_assignable_list(libxl_ctx *ctx, int *num)
>  
>      closedir(dir);
>  out:
> +#else
> +    memset(&e, 0, sizeof(e));
> +    e.report_not_assigned = 1;
> +    do {
> +        r = xc_pci_device_enum_assigned(ctx->xch, &e);
> +        if ( r && errno == EINVAL )
> +            break;
> +        new = realloc(pcidevs, (e.idx + 1) * sizeof(*new));
> +        if (NULL == new)
> +            continue;
> +
> +        pcidevs = new;
> +        new = pcidevs + e.idx;
> +
> +        memset(new, 0, sizeof(*new));
> +
> +        pcidev_struct_fill(new,
> +                           e.domain,
> +                           e.machine_sbdf >> 8 & 0xff,
> +                           PCI_SLOT(e.machine_sbdf),
> +                           PCI_FUNC(e.machine_sbdf),
> +                           0 /*vdevfn*/);
> +        e.idx++;
> +    } while (!r);
> +    *num = e.idx;
> +#endif
>      GC_FREE;
>      return pcidevs;
>  }
> @@ -606,6 +667,7 @@ bool libxl__is_igd_vga_passthru(libxl__gc *gc,
>      return false;
>  }
>  
> +#ifdef CONFIG_PCIBACK
>  /*
>   * A brief comment about slots.  I don't know what slots are for; however,
>   * I have by experimentation determined:
> @@ -648,11 +710,13 @@ out:
>      fclose(f);
>      return rc;
>  }
> +#endif
>  
>  static int pciback_dev_is_assigned(libxl__gc *gc, libxl_device_pci *pcidev)
>  {
> -    char * spath;
>      int rc;
> +#ifdef CONFIG_PCIBACK
> +    char * spath;
>      struct stat st;
>  
>      if ( access(SYSFS_PCIBACK_DRIVER, F_OK) < 0 ) {
> @@ -663,22 +727,27 @@ static int pciback_dev_is_assigned(libxl__gc *gc, libxl_device_pci *pcidev)
>          }
>          return -1;
>      }
> -
>      spath = GCSPRINTF(SYSFS_PCIBACK_DRIVER"/"PCI_BDF,
>                        pcidev->domain, pcidev->bus,
>                        pcidev->dev, pcidev->func);
>      rc = lstat(spath, &st);
> -
>      if( rc == 0 )
>          return 1;
>      if ( rc < 0 && errno == ENOENT )
>          return 0;
>      LOGE(ERROR, "Accessing %s", spath);
>      return -1;
> +#else
> +    libxl_ctx *ctx = libxl__gc_owner(gc);
> +
> +    rc = xc_pci_device_get_assigned(ctx->xch, pcidev_encode_bdf(pcidev));
> +    return rc == 0 ? 1 : 0;
> +#endif
>  }
>  
>  static int pciback_dev_assign(libxl__gc *gc, libxl_device_pci *pcidev)
>  {
> +#ifdef CONFIG_PCIBACK
>      int rc;
>  
>      if ( (rc=pciback_dev_has_slot(gc, pcidev)) < 0 ) {
> @@ -697,10 +766,17 @@ static int pciback_dev_assign(libxl__gc *gc, libxl_device_pci *pcidev)
>          return ERROR_FAIL;
>      }
>      return 0;
> +#else
> +    libxl_ctx *ctx = libxl__gc_owner(gc);
> +
> +    return xc_pci_device_set_assigned(ctx->xch, pcidev_encode_bdf(pcidev),
> +                                      true);
> +#endif
>  }
>  
>  static int pciback_dev_unassign(libxl__gc *gc, libxl_device_pci *pcidev)
>  {
> +#ifdef CONFIG_PCIBACK
>      /* Remove from pciback */
>      if ( sysfs_dev_unbind(gc, pcidev, NULL) < 0 ) {
>          LOG(ERROR, "Couldn't unbind device!");
> @@ -716,6 +792,12 @@ static int pciback_dev_unassign(libxl__gc *gc, libxl_device_pci *pcidev)
>          }
>      }
>      return 0;
> +#else
> +    libxl_ctx *ctx = libxl__gc_owner(gc);
> +
> +    return xc_pci_device_set_assigned(ctx->xch, pcidev_encode_bdf(pcidev),
> +                                      false);
> +#endif
>  }
>  
>  #define PCIBACK_INFO_PATH "/libxl/pciback"
> @@ -780,10 +862,15 @@ static int libxl__device_pci_assignable_add(libxl__gc *gc,
>  
>      /* See if the device exists */
>      spath = GCSPRINTF(SYSFS_PCI_DEV"/"PCI_BDF, dom, bus, dev, func);
> +#ifdef CONFIG_PCI_SYSFS_DOM0
>      if ( lstat(spath, &st) ) {
>          LOGE(ERROR, "Couldn't lstat %s", spath);
>          return ERROR_FAIL;
>      }
> +#else
> +    (void)st;
> +    printf("IMPLEMENT_ME: %s lstat %s\n", __func__, spath);
> +#endif
>  
>      /* Check to see if it's already assigned to pciback */
>      rc = pciback_dev_is_assigned(gc, pcidev);
> @@ -1350,8 +1437,12 @@ static void pci_add_dm_done(libxl__egc *egc,
>  
>      if (f == NULL) {
>          LOGED(ERROR, domainid, "Couldn't open %s", sysfs_path);
> +#ifdef CONFIG_PCI_SYSFS_DOM0
>          rc = ERROR_FAIL;
>          goto out;
> +#else
> +        goto out_no_irq;
> +#endif
>      }
>      for (i = 0; i < PROC_PCI_NUM_RESOURCES; i++) {
>          if (fscanf(f, "0x%llx 0x%llx 0x%llx\n", &start, &end, &flags) != 3)
> @@ -1522,7 +1613,11 @@ static int libxl_pcidev_assignable(libxl_ctx *ctx, libxl_device_pci *pcidev)
>              break;
>      }
>      free(pcidevs);
> +#ifdef CONFIG_PCIBACK
>      return i != num;
> +#else
> +    return 1;
> +#endif
>  }
>  
>  static void device_pci_add_stubdom_wait(libxl__egc *egc,
> diff --git a/xen/arch/arm/sysctl.c b/xen/arch/arm/sysctl.c
> index f87944e8473c..84e933b2eb45 100644
> --- a/xen/arch/arm/sysctl.c
> +++ b/xen/arch/arm/sysctl.c
> @@ -10,6 +10,7 @@
>  #include <xen/lib.h>
>  #include <xen/errno.h>
>  #include <xen/hypercall.h>
> +#include <xen/guest_access.h>
>  #include <public/sysctl.h>
>  
>  void arch_do_physinfo(struct xen_sysctl_physinfo *pi)
> @@ -20,7 +21,70 @@ void arch_do_physinfo(struct xen_sysctl_physinfo *pi)
>  long arch_do_sysctl(struct xen_sysctl *sysctl,
>                      XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) u_sysctl)
>  {
> -    return -ENOSYS;
> +    long ret = 0;
> +    bool copyback = 0;
> +
> +    switch ( sysctl->cmd )
> +    {
> +    case XEN_SYSCTL_pci_device_set_assigned:
> +    {
> +        u16 seg;
> +        u8 bus, devfn;
> +        uint32_t machine_sbdf;
> +
> +        machine_sbdf = sysctl->u.pci_set_assigned.machine_sbdf;
> +
> +#if 0
> +        ret = xsm_pci_device_set_assigned(XSM_HOOK, d);
> +        if ( ret )
> +            break;
> +#endif
> +
> +        seg = machine_sbdf >> 16;
> +        bus = PCI_BUS(machine_sbdf);
> +        devfn = PCI_DEVFN2(machine_sbdf);
> +
> +        pcidevs_lock();
> +        ret = pci_device_set_assigned(seg, bus, devfn,
> +                                      !!sysctl->u.pci_set_assigned.assigned);
> +        pcidevs_unlock();
> +        break;
> +    }
> +    case XEN_SYSCTL_pci_device_get_assigned:
> +    {
> +        u16 seg;
> +        u8 bus, devfn;
> +        uint32_t machine_sbdf;
> +
> +        machine_sbdf = sysctl->u.pci_set_assigned.machine_sbdf;
> +
> +        seg = machine_sbdf >> 16;
> +        bus = PCI_BUS(machine_sbdf);
> +        devfn = PCI_DEVFN2(machine_sbdf);
> +
> +        pcidevs_lock();
> +        ret = pci_device_get_assigned(seg, bus, devfn);
> +        pcidevs_unlock();
> +        break;
> +    }
> +    case XEN_SYSCTL_pci_device_enum_assigned:
> +    {
> +        ret = pci_device_enum_assigned(sysctl->u.pci_enum_assigned.report_not_assigned,
> +                                       sysctl->u.pci_enum_assigned.idx,
> +                                       &sysctl->u.pci_enum_assigned.domain,
> +                                       &sysctl->u.pci_enum_assigned.machine_sbdf);
> +        copyback = 1;
> +        break;
> +    }
> +    default:
> +        ret = -ENOSYS;
> +        break;
> +    }
> +    if ( copyback && (!ret || copyback > 0) &&
> +         __copy_to_guest(u_sysctl, sysctl, 1) )
> +        ret = -EFAULT;
> +
> +    return ret;
>  }
>  
>  /*
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index 98e8a2fade60..49b4279c63bd 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -879,6 +879,43 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
>      return ret;
>  }
>  
> +#ifdef CONFIG_ARM
> +int pci_device_set_assigned(u16 seg, u8 bus, u8 devfn, bool assigned)
> +{
> +    struct pci_dev *pdev;
> +
> +    pdev = pci_get_pdev(seg, bus, devfn);
> +    if ( !pdev )
> +    {
> +        printk(XENLOG_ERR "Can't find PCI device %04x:%02x:%02x.%u\n",
> +               seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));

Take a look at pci_sbdf_t, you should use it as the parameter and in
order to print the SBDF (%pp).

> +        return -ENODEV;
> +    }
> +
> +    pdev->assigned = assigned;
> +    printk(XENLOG_ERR "pciback %sassign PCI device %04x:%02x:%02x.%u\n",
> +           assigned ? "" : "de-",
> +           seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
> +
> +    return 0;
> +}
> +
> +int pci_device_get_assigned(u16 seg, u8 bus, u8 devfn)
> +{
> +    struct pci_dev *pdev;
> +
> +    pdev = pci_get_pdev(seg, bus, devfn);
> +    if ( !pdev )
> +    {
> +        printk(XENLOG_ERR "Can't find PCI device %04x:%02x:%02x.%u\n",
> +               seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
> +        return -ENODEV;
> +    }
> +
> +    return pdev->assigned ? 0 : -ENODEV;
> +}
> +#endif
> +
>  #ifndef CONFIG_ARM
>  /*TODO :Implement MSI support for ARM  */
>  static int pci_clean_dpci_irq(struct domain *d,
> @@ -1821,6 +1858,62 @@ int iommu_do_pci_domctl(
>      return ret;
>  }
>  
> +#ifdef CONFIG_ARM
> +struct list_assigned {
> +    uint32_t cur_idx;
> +    uint32_t from_idx;
> +    bool assigned;
> +    domid_t *domain;
> +    uint32_t *machine_sbdf;
> +};
> +
> +static int _enum_assigned_pci_devices(struct pci_seg *pseg, void *arg)
> +{
> +    struct list_assigned *ctxt = arg;
> +    struct pci_dev *pdev;
> +
> +    list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
> +    {
> +        if ( pdev->assigned == ctxt->assigned )
> +        {
> +            if ( ctxt->cur_idx == ctxt->from_idx )
> +            {
> +                *ctxt->domain = pdev->domain->domain_id;
> +                *ctxt->machine_sbdf = pdev->sbdf.sbdf;
> +                return 1;
> +            }
> +            ctxt->cur_idx++;
> +        }
> +    }
> +    return 0;
> +}
> +
> +int pci_device_enum_assigned(bool report_not_assigned,
> +                             uint32_t from_idx, domid_t *domain,
> +                             uint32_t *machine_sbdf)
> +{
> +    struct list_assigned ctxt = {
> +        .assigned = !report_not_assigned,
> +        .cur_idx = 0,
> +        .from_idx = from_idx,
> +        .domain = domain,
> +        .machine_sbdf = machine_sbdf,
> +    };
> +    int ret;
> +
> +    pcidevs_lock();
> +    ret = pci_segments_iterate(_enum_assigned_pci_devices, &ctxt);
> +    pcidevs_unlock();
> +    /*
> +     * If not found then report as EINVAL to mark
> +     * enumeration process finished.
> +     */
> +    if ( !ret )
> +        return -EINVAL;
> +    return 0;
> +}
> +#endif
> +
>  /*
>   * Local variables:
>   * mode: C
> diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
> index a07364711794..5ca73c538688 100644
> --- a/xen/include/public/sysctl.h
> +++ b/xen/include/public/sysctl.h
> @@ -1062,6 +1062,40 @@ typedef struct xen_sysctl_cpu_policy xen_sysctl_cpu_policy_t;
>  DEFINE_XEN_GUEST_HANDLE(xen_sysctl_cpu_policy_t);
>  #endif
>  
> +/*
> + * These are to emulate pciback device (de-)assignment used by the tools
> + * to track current device assignments: all the PCI devices that can
> + * be passed through must be assigned to the pciback to mark them
> + * as such. As on ARM we do not run pci{back|front} and are emulating
> + * PCI host bridge in Xen, so we need to maintain the assignments on our
> + * own in Xen itself.
> + *
> + * Note on xen_sysctl_pci_device_get_assigned: ENOENT is used to report
> + * that there are no assigned devices left.
> + */
> +struct xen_sysctl_pci_device_set_assigned {
> +    /* IN */
> +    /* FIXME: is this really a machine SBDF or as Domain-0 sees it? */
> +    uint32_t machine_sbdf;

I think you need to make it clear that when running on Xen dom0 (or
the hardware domain) should _never_ change the enumeration of devices,
or else none of this will work.

> +    uint8_t assigned;
> +};
> +
> +struct xen_sysctl_pci_device_get_assigned {
> +    /* IN */
> +    uint32_t machine_sbdf;
> +};
> +
> +struct xen_sysctl_pci_device_enum_assigned {
> +    /* IN */
> +    uint32_t idx;
> +    uint8_t report_not_assigned;
> +    /* OUT */
> +    domid_t domain;
> +    uint32_t machine_sbdf;
> +};
> +typedef struct xen_sysctl_pci_device_enum_assigned xen_sysctl_pci_device_enum_assigned_t;
> +DEFINE_XEN_GUEST_HANDLE(xen_sysctl_pci_device_enum_assigned_t);
> +
>  struct xen_sysctl {
>      uint32_t cmd;
>  #define XEN_SYSCTL_readconsole                    1
> @@ -1092,6 +1126,9 @@ struct xen_sysctl {
>  #define XEN_SYSCTL_livepatch_op                  27
>  /* #define XEN_SYSCTL_set_parameter              28 */
>  #define XEN_SYSCTL_get_cpu_policy                29
> +#define XEN_SYSCTL_pci_device_set_assigned       30
> +#define XEN_SYSCTL_pci_device_get_assigned       31
> +#define XEN_SYSCTL_pci_device_enum_assigned      32
>      uint32_t interface_version; /* XEN_SYSCTL_INTERFACE_VERSION */
>      union {
>          struct xen_sysctl_readconsole       readconsole;
> @@ -1122,6 +1159,9 @@ struct xen_sysctl {
>  #if defined(__i386__) || defined(__x86_64__)
>          struct xen_sysctl_cpu_policy        cpu_policy;
>  #endif
> +        struct xen_sysctl_pci_device_set_assigned pci_set_assigned;
> +        struct xen_sysctl_pci_device_get_assigned pci_get_assigned;
> +        struct xen_sysctl_pci_device_enum_assigned pci_enum_assigned;
>          uint8_t                             pad[128];
>      } u;
>  };
> diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
> index 2bc4aaf4530c..7bf439de4de0 100644
> --- a/xen/include/xen/pci.h
> +++ b/xen/include/xen/pci.h
> @@ -132,6 +132,13 @@ struct pci_dev {
>  
>      /* Data for vPCI. */
>      struct vpci *vpci;
> +#ifdef CONFIG_ARM
> +    /*
> +     * Set if this PCI device is eligible for pass through,
> +     * e.g. just like it was assigned to pciback driver.
> +     */
> +    bool assigned;

You can see whether a device is assigned or not by looking at the
domain field AFAICT.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [SUSPECTED SPAM][PATCH 01/10] pci/pvh: Allow PCI toolstack code run with PVH domains on ARM
  2020-11-11 13:10     ` Oleksandr Andrushchenko
@ 2020-11-11 13:55       ` Roger Pau Monné
  2020-11-11 14:12         ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 64+ messages in thread
From: Roger Pau Monné @ 2020-11-11 13:55 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Rahul.Singh, Bertrand.Marquis, julien.grall, jbeulich,
	sstabellini, xen-devel, iwj, wl, Oleksandr Andrushchenko

On Wed, Nov 11, 2020 at 01:10:01PM +0000, Oleksandr Andrushchenko wrote:
> 
> On 11/11/20 2:31 PM, Roger Pau Monné wrote:
> > On Mon, Nov 09, 2020 at 02:50:22PM +0200, Oleksandr Andrushchenko wrote:
> >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> >>
> >> According to https://wiki.xenproject.org/wiki/Linux_PVH:
> >>
> >> Items not supported by PVH
> >>   - PCI pass through (as of Xen 4.10)
> >>
> >> Allow running PCI remove code on ARM and do not assert for PVH domains.
> >>
> >> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> >> ---
> >>   tools/libxl/Makefile    | 4 ++++
> >>   tools/libxl/libxl_pci.c | 4 +++-
> >>   2 files changed, 7 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
> >> index 241da7fff6f4..f3806aafcb4e 100644
> >> --- a/tools/libxl/Makefile
> >> +++ b/tools/libxl/Makefile
> >> @@ -130,6 +130,10 @@ endif
> >>   
> >>   LIBXL_LIBS += -lyajl
> >>   
> >> +ifeq ($(CONFIG_ARM),y)
> >> +CFALGS += -DCONFIG_ARM
> >> +endif
> >> +
> >>   LIBXL_OBJS = flexarray.o libxl.o libxl_create.o libxl_dm.o libxl_pci.o \
> >>   			libxl_dom.o libxl_exec.o libxl_xshelp.o libxl_device.o \
> >>   			libxl_internal.o libxl_utils.o libxl_uuid.o \
> >> diff --git a/tools/libxl/libxl_pci.c b/tools/libxl/libxl_pci.c
> >> index bc5843b13701..b93cf976642b 100644
> >> --- a/tools/libxl/libxl_pci.c
> >> +++ b/tools/libxl/libxl_pci.c
> >> @@ -1915,8 +1915,10 @@ static void do_pci_remove(libxl__egc *egc, uint32_t domid,
> >>               goto out_fail;
> >>           }
> >>       } else {
> >> +        /* PCI passthrough can also run on ARM PVH */
> >> +#ifndef CONFIG_ARM
> >>           assert(type == LIBXL_DOMAIN_TYPE_PV);
> >> -
> >> +#endif
> > I would just remove the assert now if this is to be used by Arm and
> > you don't need to fork the file for Arm.
> 
> Sounds good, I will drop then
> 
> But what would be the right explanation then? I mean why there was an ASSERT
> 
> and now it is safe (for x86) to remove that?

An assert is just a safe belt, the expectation is that it's never hit
by actual code. Given that this path will now also be used by PVH
(even if only on Arm) I don't see the point in keeping the assert, and
making it conditional to != Arm seems worse than just dropping it.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [SUSPECTED SPAM][PATCH 01/10] pci/pvh: Allow PCI toolstack code run with PVH domains on ARM
  2020-11-11 13:55       ` Roger Pau Monné
@ 2020-11-11 14:12         ` Oleksandr Andrushchenko
  2020-11-11 14:21           ` Roger Pau Monné
  0 siblings, 1 reply; 64+ messages in thread
From: Oleksandr Andrushchenko @ 2020-11-11 14:12 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Rahul.Singh, Bertrand.Marquis, julien.grall, jbeulich,
	sstabellini, xen-devel, iwj, wl, Oleksandr Andrushchenko


On 11/11/20 3:55 PM, Roger Pau Monné wrote:
> On Wed, Nov 11, 2020 at 01:10:01PM +0000, Oleksandr Andrushchenko wrote:
>> On 11/11/20 2:31 PM, Roger Pau Monné wrote:
>>> On Mon, Nov 09, 2020 at 02:50:22PM +0200, Oleksandr Andrushchenko wrote:
>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>
>>>> According to https://urldefense.com/v3/__https://wiki.xenproject.org/wiki/Linux_PVH__;!!GF_29dbcQIUBPA!nEHd6eivmqtdJxtrhO-3x2Mz9F50JsKUoV7WTEJd_D1N01DrBOJXzGW1QAqwshZ9AMxywbUhOA$ [wiki[.]xenproject[.]org]:
>>>>
>>>> Items not supported by PVH
>>>>    - PCI pass through (as of Xen 4.10)
>>>>
>>>> Allow running PCI remove code on ARM and do not assert for PVH domains.
>>>>
>>>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>> ---
>>>>    tools/libxl/Makefile    | 4 ++++
>>>>    tools/libxl/libxl_pci.c | 4 +++-
>>>>    2 files changed, 7 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
>>>> index 241da7fff6f4..f3806aafcb4e 100644
>>>> --- a/tools/libxl/Makefile
>>>> +++ b/tools/libxl/Makefile
>>>> @@ -130,6 +130,10 @@ endif
>>>>    
>>>>    LIBXL_LIBS += -lyajl
>>>>    
>>>> +ifeq ($(CONFIG_ARM),y)
>>>> +CFALGS += -DCONFIG_ARM
>>>> +endif
>>>> +
>>>>    LIBXL_OBJS = flexarray.o libxl.o libxl_create.o libxl_dm.o libxl_pci.o \
>>>>    			libxl_dom.o libxl_exec.o libxl_xshelp.o libxl_device.o \
>>>>    			libxl_internal.o libxl_utils.o libxl_uuid.o \
>>>> diff --git a/tools/libxl/libxl_pci.c b/tools/libxl/libxl_pci.c
>>>> index bc5843b13701..b93cf976642b 100644
>>>> --- a/tools/libxl/libxl_pci.c
>>>> +++ b/tools/libxl/libxl_pci.c
>>>> @@ -1915,8 +1915,10 @@ static void do_pci_remove(libxl__egc *egc, uint32_t domid,
>>>>                goto out_fail;
>>>>            }
>>>>        } else {
>>>> +        /* PCI passthrough can also run on ARM PVH */
>>>> +#ifndef CONFIG_ARM
>>>>            assert(type == LIBXL_DOMAIN_TYPE_PV);
>>>> -
>>>> +#endif
>>> I would just remove the assert now if this is to be used by Arm and
>>> you don't need to fork the file for Arm.
>> Sounds good, I will drop then
>>
>> But what would be the right explanation then? I mean why there was an ASSERT
>>
>> and now it is safe (for x86) to remove that?
> An assert is just a safe belt, the expectation is that it's never hit
> by actual code. Given that this path will now also be used by PVH
> (even if only on Arm) I don't see the point in keeping the assert, and
> making it conditional to != Arm seems worse than just dropping it.

Ok, so I can write in the patch description something like:

"this path is now used by PVH, so the assert is no longer valid"

Does it sound ok?

> Thanks, Roger.

Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [SUSPECTED SPAM][PATCH 01/10] pci/pvh: Allow PCI toolstack code run with PVH domains on ARM
  2020-11-11 14:12         ` Oleksandr Andrushchenko
@ 2020-11-11 14:21           ` Roger Pau Monné
  0 siblings, 0 replies; 64+ messages in thread
From: Roger Pau Monné @ 2020-11-11 14:21 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Rahul.Singh, Bertrand.Marquis, julien.grall, jbeulich,
	sstabellini, xen-devel, iwj, wl, Oleksandr Andrushchenko

On Wed, Nov 11, 2020 at 02:12:56PM +0000, Oleksandr Andrushchenko wrote:
> 
> On 11/11/20 3:55 PM, Roger Pau Monné wrote:
> > On Wed, Nov 11, 2020 at 01:10:01PM +0000, Oleksandr Andrushchenko wrote:
> >> On 11/11/20 2:31 PM, Roger Pau Monné wrote:
> >>> On Mon, Nov 09, 2020 at 02:50:22PM +0200, Oleksandr Andrushchenko wrote:
> >>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> >>>>
> >>>> According to https://urldefense.com/v3/__https://wiki.xenproject.org/wiki/Linux_PVH__;!!GF_29dbcQIUBPA!nEHd6eivmqtdJxtrhO-3x2Mz9F50JsKUoV7WTEJd_D1N01DrBOJXzGW1QAqwshZ9AMxywbUhOA$ [wiki[.]xenproject[.]org]:
> >>>>
> >>>> Items not supported by PVH
> >>>>    - PCI pass through (as of Xen 4.10)
> >>>>
> >>>> Allow running PCI remove code on ARM and do not assert for PVH domains.
> >>>>
> >>>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> >>>> ---
> >>>>    tools/libxl/Makefile    | 4 ++++
> >>>>    tools/libxl/libxl_pci.c | 4 +++-
> >>>>    2 files changed, 7 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
> >>>> index 241da7fff6f4..f3806aafcb4e 100644
> >>>> --- a/tools/libxl/Makefile
> >>>> +++ b/tools/libxl/Makefile
> >>>> @@ -130,6 +130,10 @@ endif
> >>>>    
> >>>>    LIBXL_LIBS += -lyajl
> >>>>    
> >>>> +ifeq ($(CONFIG_ARM),y)
> >>>> +CFALGS += -DCONFIG_ARM
> >>>> +endif
> >>>> +
> >>>>    LIBXL_OBJS = flexarray.o libxl.o libxl_create.o libxl_dm.o libxl_pci.o \
> >>>>    			libxl_dom.o libxl_exec.o libxl_xshelp.o libxl_device.o \
> >>>>    			libxl_internal.o libxl_utils.o libxl_uuid.o \
> >>>> diff --git a/tools/libxl/libxl_pci.c b/tools/libxl/libxl_pci.c
> >>>> index bc5843b13701..b93cf976642b 100644
> >>>> --- a/tools/libxl/libxl_pci.c
> >>>> +++ b/tools/libxl/libxl_pci.c
> >>>> @@ -1915,8 +1915,10 @@ static void do_pci_remove(libxl__egc *egc, uint32_t domid,
> >>>>                goto out_fail;
> >>>>            }
> >>>>        } else {
> >>>> +        /* PCI passthrough can also run on ARM PVH */
> >>>> +#ifndef CONFIG_ARM
> >>>>            assert(type == LIBXL_DOMAIN_TYPE_PV);
> >>>> -
> >>>> +#endif
> >>> I would just remove the assert now if this is to be used by Arm and
> >>> you don't need to fork the file for Arm.
> >> Sounds good, I will drop then
> >>
> >> But what would be the right explanation then? I mean why there was an ASSERT
> >>
> >> and now it is safe (for x86) to remove that?
> > An assert is just a safe belt, the expectation is that it's never hit
> > by actual code. Given that this path will now also be used by PVH
> > (even if only on Arm) I don't see the point in keeping the assert, and
> > making it conditional to != Arm seems worse than just dropping it.
> 
> Ok, so I can write in the patch description something like:
> 
> "this path is now used by PVH, so the assert is no longer valid"
> 
> Does it sound ok?

LGTM.

Roger.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 02/10] arm/pci: Maintain PCI assignable list
  2020-11-11 13:53   ` Roger Pau Monné
@ 2020-11-11 14:38     ` Oleksandr Andrushchenko
  2020-11-11 15:03       ` Roger Pau Monné
  0 siblings, 1 reply; 64+ messages in thread
From: Oleksandr Andrushchenko @ 2020-11-11 14:38 UTC (permalink / raw)
  To: Roger Pau Monné, Oleksandr Andrushchenko
  Cc: Rahul.Singh, Bertrand.Marquis, julien.grall, jbeulich,
	sstabellini, xen-devel, iwj, wl

On 11/11/20 3:53 PM, Roger Pau Monné wrote:
> On Mon, Nov 09, 2020 at 02:50:23PM +0200, Oleksandr Andrushchenko wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>
>> The original code depends on pciback to manage assignable device list.
>> The functionality which is implemented by the pciback and the toolstack
>> and which is relevant/missing/needed for ARM:
>>
>> 1. pciback is used as a database for assignable PCI devices, e.g. xl
>>     pci-assignable-{add|remove|list} manipulates that list. So, whenever the
>>     toolstack needs to know which PCI devices can be passed through it reads
>>     that from the relevant sysfs entries of the pciback.
>>
>> 2. pciback is used to hold the unbound PCI devices, e.g. when passing through
>>     a PCI device it needs to be unbound from the relevant device driver and bound
>>     to pciback (strictly speaking it is not required that the device is bound to
>>     pciback, but pciback is again used as a database of the passed through PCI
>>     devices, so we can re-bind the devices back to their original drivers when
>>     guest domain shuts down)
>>
>> 1. As ARM doesn't use pciback implement the above with additional sysctls:
>>   - XEN_SYSCTL_pci_device_set_assigned
> I don't see the point in having this sysfs, Xen already knows when a
> device is assigned because the XEN_DOMCTL_assign_device hypercall is
> used.

But how does the toolstack know about that? When the toolstack needs to

list/know all assigned devices it queries pciback's sysfs entries. So, with

XEN_DOMCTL_assign_device we make that knowledge available to Xen,

but there are no means for the toolstack to get it back.

>
>>   - XEN_SYSCTL_pci_device_get_assigned
>>   - XEN_SYSCTL_pci_device_enum_assigned
>> 2. Extend struct pci_dev to hold assignment state.
> I'm not really found of this, the hypervisor is no place to store a
> database like this, unless it's strictly needed.
I do agree and it was previously discussed a bit
>
> IMO the right implementation here would be to split Linux pciback into
> two different drivers:
>
>   - The pv-pci backend for doing passthrough to classic PV guests.
Ok
>   - The rest of pciback: device reset, hand-holding driver for devices
>     to be assigned and database.

These and assigned devices list seem to be the complete set which

is needed by the toolstack on ARM. All other functionality provided by

pciback is not needed for ARM.

Jan was saying [1] that we might still use pciback as is, but simply use only

the functionality we need.

>
> I think there must be something similar in KVM that performs the tasks
> of my last point, maybe we could piggyback on it?
I promised to look at it. I owe this
>
> If we want to go the route proposed by this patch, ie: Xen performing
> the functions of pciback you would also have to move the PCI reset
> code to Xen, so that you can fully manage the PCI devices from Xen.
In case of dom0less this would be the case: no pciback, no Domain-0
>
>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> ---
>>   tools/libxc/include/xenctrl.h |   9 +++
>>   tools/libxc/xc_domain.c       |   1 +
>>   tools/libxc/xc_misc.c         |  46 +++++++++++++++
>>   tools/libxl/Makefile          |   4 ++
>>   tools/libxl/libxl_pci.c       | 105 ++++++++++++++++++++++++++++++++--
>>   xen/arch/arm/sysctl.c         |  66 ++++++++++++++++++++-
>>   xen/drivers/passthrough/pci.c |  93 ++++++++++++++++++++++++++++++
>>   xen/include/public/sysctl.h   |  40 +++++++++++++
>>   xen/include/xen/pci.h         |  12 ++++
>>   9 files changed, 370 insertions(+), 6 deletions(-)
> I've done some light review below given my questions above.

This is more than I expected for an RFC series

Thank you!

>
>> diff --git a/tools/libxc/include/xenctrl.h b/tools/libxc/include/xenctrl.h
>> index 4c89b7294c4f..77029013da7d 100644
>> --- a/tools/libxc/include/xenctrl.h
>> +++ b/tools/libxc/include/xenctrl.h
>> @@ -2652,6 +2652,15 @@ int xc_livepatch_replace(xc_interface *xch, char *name, uint32_t timeout, uint32
>>   int xc_domain_cacheflush(xc_interface *xch, uint32_t domid,
>>                            xen_pfn_t start_pfn, xen_pfn_t nr_pfns);
>>   
>> +typedef xen_sysctl_pci_device_enum_assigned_t xc_pci_device_enum_assigned_t;
>> +
>> +int xc_pci_device_set_assigned(xc_interface *xch, uint32_t machine_sbdf,
>> +                               bool assigned);
>> +int xc_pci_device_get_assigned(xc_interface *xch, uint32_t machine_sbdf);
>> +
>> +int xc_pci_device_enum_assigned(xc_interface *xch,
>> +                                xc_pci_device_enum_assigned_t *e);
>> +
>>   /* Compat shims */
>>   #include "xenctrl_compat.h"
>>   
>> diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
>> index 71829c2bce3e..d515191e9243 100644
>> --- a/tools/libxc/xc_domain.c
>> +++ b/tools/libxc/xc_domain.c
>> @@ -2321,6 +2321,7 @@ int xc_domain_soft_reset(xc_interface *xch,
>>       domctl.domain = domid;
>>       return do_domctl(xch, &domctl);
>>   }
>> +
>>   /*
>>    * Local variables:
>>    * mode: C
>> diff --git a/tools/libxc/xc_misc.c b/tools/libxc/xc_misc.c
>> index 3820394413a9..d439c4ba1019 100644
>> --- a/tools/libxc/xc_misc.c
>> +++ b/tools/libxc/xc_misc.c
>> @@ -988,6 +988,52 @@ int xc_livepatch_replace(xc_interface *xch, char *name, uint32_t timeout, uint32
>>       return _xc_livepatch_action(xch, name, LIVEPATCH_ACTION_REPLACE, timeout, flags);
>>   }
>>   
>> +int xc_pci_device_set_assigned(
>> +    xc_interface *xch,
>> +    uint32_t machine_sbdf,
>> +    bool assigned)
>> +{
>> +    DECLARE_SYSCTL;
>> +
>> +    sysctl.cmd = XEN_SYSCTL_pci_device_set_assigned;
>> +    sysctl.u.pci_set_assigned.machine_sbdf = machine_sbdf;
>> +    sysctl.u.pci_set_assigned.assigned = assigned;
>> +
>> +    return do_sysctl(xch, &sysctl);
>> +}
>> +
>> +int xc_pci_device_get_assigned(
>> +    xc_interface *xch,
>> +    uint32_t machine_sbdf)
>> +{
>> +    DECLARE_SYSCTL;
>> +
>> +    sysctl.cmd = XEN_SYSCTL_pci_device_get_assigned;
>> +    sysctl.u.pci_get_assigned.machine_sbdf = machine_sbdf;
>> +
>> +    return do_sysctl(xch, &sysctl);
>> +}
>> +
>> +int xc_pci_device_enum_assigned(xc_interface *xch,
>> +                                xc_pci_device_enum_assigned_t *e)
>> +{
>> +    int ret;
>> +    DECLARE_SYSCTL;
>> +
>> +    sysctl.cmd = XEN_SYSCTL_pci_device_enum_assigned;
>> +    sysctl.u.pci_enum_assigned.idx = e->idx;
>> +    sysctl.u.pci_enum_assigned.report_not_assigned = e->report_not_assigned;
>> +    ret = do_sysctl(xch, &sysctl);
>> +    if ( ret )
>> +        errno = EINVAL;
>> +    else
>> +    {
>> +        e->domain = sysctl.u.pci_enum_assigned.domain;
>> +        e->machine_sbdf = sysctl.u.pci_enum_assigned.machine_sbdf;
>> +    }
>> +    return ret;
>> +}
>> +
>>   /*
>>    * Local variables:
>>    * mode: C
>> diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
>> index f3806aafcb4e..6f76ba35aec7 100644
>> --- a/tools/libxl/Makefile
>> +++ b/tools/libxl/Makefile
>> @@ -130,6 +130,10 @@ endif
>>   
>>   LIBXL_LIBS += -lyajl
>>   
>> +ifeq ($(CONFIG_X86),y)
>> +CFALGS += -DCONFIG_PCIBACK
>> +endif
>> +
>>   ifeq ($(CONFIG_ARM),y)
>>   CFALGS += -DCONFIG_ARM
>>   endif
>> diff --git a/tools/libxl/libxl_pci.c b/tools/libxl/libxl_pci.c
>> index b93cf976642b..41f89b8aae10 100644
>> --- a/tools/libxl/libxl_pci.c
>> +++ b/tools/libxl/libxl_pci.c
>> @@ -319,6 +319,7 @@ retry_transaction2:
>>   
>>   static int get_all_assigned_devices(libxl__gc *gc, libxl_device_pci **list, int *num)
>>   {
>> +#ifdef CONFIG_PCIBACK
>>       char **domlist;
>>       unsigned int nd = 0, i;
>>   
>> @@ -356,6 +357,33 @@ static int get_all_assigned_devices(libxl__gc *gc, libxl_device_pci **list, int
>>               }
>>           }
>>       }
>> +#else
>> +    libxl_ctx *ctx = libxl__gc_owner(gc);
>> +    int ret;
>> +    xc_pci_device_enum_assigned_t e;
>> +
>> +    *list = NULL;
>> +    *num = 0;
>> +
>> +    memset(&e, 0, sizeof(e));
>> +    do {
>> +        ret = xc_pci_device_enum_assigned(ctx->xch, &e);
>> +        if ( ret && errno == EINVAL )
>> +            break;
>> +        *list = realloc(*list, sizeof(libxl_device_pci) * (e.idx + 1));
>> +        if (*list == NULL)
>> +            return ERROR_NOMEM;
>> +
>> +        pcidev_struct_fill(*list + e.idx,
>> +                           e.domain,
>> +                           e.machine_sbdf >> 8 & 0xff,
>> +                           PCI_SLOT(e.machine_sbdf),
>> +                           PCI_FUNC(e.machine_sbdf),
>> +                           0 /*vdevfn*/);
>> +        e.idx++;
>> +    } while (!ret);
>> +    *num = e.idx;
>> +#endif
> I don't think the amount of ifdefs added to this file is acceptable.
> If we have to go that route this needs to be split into a different
> file, and maybe some of the common bits abstracted together to prevent
> code repetition.

We also briefly discussed that and were talking about if the arch specific

files should be someting like libxl_pci_x86_linux.c etc.

>
>>       libxl__ptr_add(gc, *list);
>>   
>>       return 0;
>> @@ -411,13 +439,20 @@ static int sysfs_write_bdf(libxl__gc *gc, const char * sysfs_path,
>>   libxl_device_pci *libxl_device_pci_assignable_list(libxl_ctx *ctx, int *num)
>>   {
>>       GC_INIT(ctx);
>> -    libxl_device_pci *pcidevs = NULL, *new, *assigned;
>> +    libxl_device_pci *pcidevs = NULL, *new;
>> +    int r;
>> +#ifdef CONFIG_PCIBACK
>> +    libxl_device_pci *assigned;
>> +    int num_assigned;
>>       struct dirent *de;
>>       DIR *dir;
>> -    int r, num_assigned;
>> +#else
>> +    xc_pci_device_enum_assigned_t e;
>> +#endif
>>   
>>       *num = 0;
>>   
>> +#ifdef CONFIG_PCIBACK
>>       r = get_all_assigned_devices(gc, &assigned, &num_assigned);
>>       if (r) goto out;
>>   
>> @@ -453,6 +488,32 @@ libxl_device_pci *libxl_device_pci_assignable_list(libxl_ctx *ctx, int *num)
>>   
>>       closedir(dir);
>>   out:
>> +#else
>> +    memset(&e, 0, sizeof(e));
>> +    e.report_not_assigned = 1;
>> +    do {
>> +        r = xc_pci_device_enum_assigned(ctx->xch, &e);
>> +        if ( r && errno == EINVAL )
>> +            break;
>> +        new = realloc(pcidevs, (e.idx + 1) * sizeof(*new));
>> +        if (NULL == new)
>> +            continue;
>> +
>> +        pcidevs = new;
>> +        new = pcidevs + e.idx;
>> +
>> +        memset(new, 0, sizeof(*new));
>> +
>> +        pcidev_struct_fill(new,
>> +                           e.domain,
>> +                           e.machine_sbdf >> 8 & 0xff,
>> +                           PCI_SLOT(e.machine_sbdf),
>> +                           PCI_FUNC(e.machine_sbdf),
>> +                           0 /*vdevfn*/);
>> +        e.idx++;
>> +    } while (!r);
>> +    *num = e.idx;
>> +#endif
>>       GC_FREE;
>>       return pcidevs;
>>   }
>> @@ -606,6 +667,7 @@ bool libxl__is_igd_vga_passthru(libxl__gc *gc,
>>       return false;
>>   }
>>   
>> +#ifdef CONFIG_PCIBACK
>>   /*
>>    * A brief comment about slots.  I don't know what slots are for; however,
>>    * I have by experimentation determined:
>> @@ -648,11 +710,13 @@ out:
>>       fclose(f);
>>       return rc;
>>   }
>> +#endif
>>   
>>   static int pciback_dev_is_assigned(libxl__gc *gc, libxl_device_pci *pcidev)
>>   {
>> -    char * spath;
>>       int rc;
>> +#ifdef CONFIG_PCIBACK
>> +    char * spath;
>>       struct stat st;
>>   
>>       if ( access(SYSFS_PCIBACK_DRIVER, F_OK) < 0 ) {
>> @@ -663,22 +727,27 @@ static int pciback_dev_is_assigned(libxl__gc *gc, libxl_device_pci *pcidev)
>>           }
>>           return -1;
>>       }
>> -
>>       spath = GCSPRINTF(SYSFS_PCIBACK_DRIVER"/"PCI_BDF,
>>                         pcidev->domain, pcidev->bus,
>>                         pcidev->dev, pcidev->func);
>>       rc = lstat(spath, &st);
>> -
>>       if( rc == 0 )
>>           return 1;
>>       if ( rc < 0 && errno == ENOENT )
>>           return 0;
>>       LOGE(ERROR, "Accessing %s", spath);
>>       return -1;
>> +#else
>> +    libxl_ctx *ctx = libxl__gc_owner(gc);
>> +
>> +    rc = xc_pci_device_get_assigned(ctx->xch, pcidev_encode_bdf(pcidev));
>> +    return rc == 0 ? 1 : 0;
>> +#endif
>>   }
>>   
>>   static int pciback_dev_assign(libxl__gc *gc, libxl_device_pci *pcidev)
>>   {
>> +#ifdef CONFIG_PCIBACK
>>       int rc;
>>   
>>       if ( (rc=pciback_dev_has_slot(gc, pcidev)) < 0 ) {
>> @@ -697,10 +766,17 @@ static int pciback_dev_assign(libxl__gc *gc, libxl_device_pci *pcidev)
>>           return ERROR_FAIL;
>>       }
>>       return 0;
>> +#else
>> +    libxl_ctx *ctx = libxl__gc_owner(gc);
>> +
>> +    return xc_pci_device_set_assigned(ctx->xch, pcidev_encode_bdf(pcidev),
>> +                                      true);
>> +#endif
>>   }
>>   
>>   static int pciback_dev_unassign(libxl__gc *gc, libxl_device_pci *pcidev)
>>   {
>> +#ifdef CONFIG_PCIBACK
>>       /* Remove from pciback */
>>       if ( sysfs_dev_unbind(gc, pcidev, NULL) < 0 ) {
>>           LOG(ERROR, "Couldn't unbind device!");
>> @@ -716,6 +792,12 @@ static int pciback_dev_unassign(libxl__gc *gc, libxl_device_pci *pcidev)
>>           }
>>       }
>>       return 0;
>> +#else
>> +    libxl_ctx *ctx = libxl__gc_owner(gc);
>> +
>> +    return xc_pci_device_set_assigned(ctx->xch, pcidev_encode_bdf(pcidev),
>> +                                      false);
>> +#endif
>>   }
>>   
>>   #define PCIBACK_INFO_PATH "/libxl/pciback"
>> @@ -780,10 +862,15 @@ static int libxl__device_pci_assignable_add(libxl__gc *gc,
>>   
>>       /* See if the device exists */
>>       spath = GCSPRINTF(SYSFS_PCI_DEV"/"PCI_BDF, dom, bus, dev, func);
>> +#ifdef CONFIG_PCI_SYSFS_DOM0
>>       if ( lstat(spath, &st) ) {
>>           LOGE(ERROR, "Couldn't lstat %s", spath);
>>           return ERROR_FAIL;
>>       }
>> +#else
>> +    (void)st;
>> +    printf("IMPLEMENT_ME: %s lstat %s\n", __func__, spath);
>> +#endif
>>   
>>       /* Check to see if it's already assigned to pciback */
>>       rc = pciback_dev_is_assigned(gc, pcidev);
>> @@ -1350,8 +1437,12 @@ static void pci_add_dm_done(libxl__egc *egc,
>>   
>>       if (f == NULL) {
>>           LOGED(ERROR, domainid, "Couldn't open %s", sysfs_path);
>> +#ifdef CONFIG_PCI_SYSFS_DOM0
>>           rc = ERROR_FAIL;
>>           goto out;
>> +#else
>> +        goto out_no_irq;
>> +#endif
>>       }
>>       for (i = 0; i < PROC_PCI_NUM_RESOURCES; i++) {
>>           if (fscanf(f, "0x%llx 0x%llx 0x%llx\n", &start, &end, &flags) != 3)
>> @@ -1522,7 +1613,11 @@ static int libxl_pcidev_assignable(libxl_ctx *ctx, libxl_device_pci *pcidev)
>>               break;
>>       }
>>       free(pcidevs);
>> +#ifdef CONFIG_PCIBACK
>>       return i != num;
>> +#else
>> +    return 1;
>> +#endif
>>   }
>>   
>>   static void device_pci_add_stubdom_wait(libxl__egc *egc,
>> diff --git a/xen/arch/arm/sysctl.c b/xen/arch/arm/sysctl.c
>> index f87944e8473c..84e933b2eb45 100644
>> --- a/xen/arch/arm/sysctl.c
>> +++ b/xen/arch/arm/sysctl.c
>> @@ -10,6 +10,7 @@
>>   #include <xen/lib.h>
>>   #include <xen/errno.h>
>>   #include <xen/hypercall.h>
>> +#include <xen/guest_access.h>
>>   #include <public/sysctl.h>
>>   
>>   void arch_do_physinfo(struct xen_sysctl_physinfo *pi)
>> @@ -20,7 +21,70 @@ void arch_do_physinfo(struct xen_sysctl_physinfo *pi)
>>   long arch_do_sysctl(struct xen_sysctl *sysctl,
>>                       XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) u_sysctl)
>>   {
>> -    return -ENOSYS;
>> +    long ret = 0;
>> +    bool copyback = 0;
>> +
>> +    switch ( sysctl->cmd )
>> +    {
>> +    case XEN_SYSCTL_pci_device_set_assigned:
>> +    {
>> +        u16 seg;
>> +        u8 bus, devfn;
>> +        uint32_t machine_sbdf;
>> +
>> +        machine_sbdf = sysctl->u.pci_set_assigned.machine_sbdf;
>> +
>> +#if 0
>> +        ret = xsm_pci_device_set_assigned(XSM_HOOK, d);
>> +        if ( ret )
>> +            break;
>> +#endif
>> +
>> +        seg = machine_sbdf >> 16;
>> +        bus = PCI_BUS(machine_sbdf);
>> +        devfn = PCI_DEVFN2(machine_sbdf);
>> +
>> +        pcidevs_lock();
>> +        ret = pci_device_set_assigned(seg, bus, devfn,
>> +                                      !!sysctl->u.pci_set_assigned.assigned);
>> +        pcidevs_unlock();
>> +        break;
>> +    }
>> +    case XEN_SYSCTL_pci_device_get_assigned:
>> +    {
>> +        u16 seg;
>> +        u8 bus, devfn;
>> +        uint32_t machine_sbdf;
>> +
>> +        machine_sbdf = sysctl->u.pci_set_assigned.machine_sbdf;
>> +
>> +        seg = machine_sbdf >> 16;
>> +        bus = PCI_BUS(machine_sbdf);
>> +        devfn = PCI_DEVFN2(machine_sbdf);
>> +
>> +        pcidevs_lock();
>> +        ret = pci_device_get_assigned(seg, bus, devfn);
>> +        pcidevs_unlock();
>> +        break;
>> +    }
>> +    case XEN_SYSCTL_pci_device_enum_assigned:
>> +    {
>> +        ret = pci_device_enum_assigned(sysctl->u.pci_enum_assigned.report_not_assigned,
>> +                                       sysctl->u.pci_enum_assigned.idx,
>> +                                       &sysctl->u.pci_enum_assigned.domain,
>> +                                       &sysctl->u.pci_enum_assigned.machine_sbdf);
>> +        copyback = 1;
>> +        break;
>> +    }
>> +    default:
>> +        ret = -ENOSYS;
>> +        break;
>> +    }
>> +    if ( copyback && (!ret || copyback > 0) &&
>> +         __copy_to_guest(u_sysctl, sysctl, 1) )
>> +        ret = -EFAULT;
>> +
>> +    return ret;
>>   }
>>   
>>   /*
>> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
>> index 98e8a2fade60..49b4279c63bd 100644
>> --- a/xen/drivers/passthrough/pci.c
>> +++ b/xen/drivers/passthrough/pci.c
>> @@ -879,6 +879,43 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
>>       return ret;
>>   }
>>   
>> +#ifdef CONFIG_ARM
>> +int pci_device_set_assigned(u16 seg, u8 bus, u8 devfn, bool assigned)
>> +{
>> +    struct pci_dev *pdev;
>> +
>> +    pdev = pci_get_pdev(seg, bus, devfn);
>> +    if ( !pdev )
>> +    {
>> +        printk(XENLOG_ERR "Can't find PCI device %04x:%02x:%02x.%u\n",
>> +               seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
> Take a look at pci_sbdf_t, you should use it as the parameter and in
> order to print the SBDF (%pp).
I will, thank you
>
>> +        return -ENODEV;
>> +    }
>> +
>> +    pdev->assigned = assigned;
>> +    printk(XENLOG_ERR "pciback %sassign PCI device %04x:%02x:%02x.%u\n",
>> +           assigned ? "" : "de-",
>> +           seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
>> +
>> +    return 0;
>> +}
>> +
>> +int pci_device_get_assigned(u16 seg, u8 bus, u8 devfn)
>> +{
>> +    struct pci_dev *pdev;
>> +
>> +    pdev = pci_get_pdev(seg, bus, devfn);
>> +    if ( !pdev )
>> +    {
>> +        printk(XENLOG_ERR "Can't find PCI device %04x:%02x:%02x.%u\n",
>> +               seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
>> +        return -ENODEV;
>> +    }
>> +
>> +    return pdev->assigned ? 0 : -ENODEV;
>> +}
>> +#endif
>> +
>>   #ifndef CONFIG_ARM
>>   /*TODO :Implement MSI support for ARM  */
>>   static int pci_clean_dpci_irq(struct domain *d,
>> @@ -1821,6 +1858,62 @@ int iommu_do_pci_domctl(
>>       return ret;
>>   }
>>   
>> +#ifdef CONFIG_ARM
>> +struct list_assigned {
>> +    uint32_t cur_idx;
>> +    uint32_t from_idx;
>> +    bool assigned;
>> +    domid_t *domain;
>> +    uint32_t *machine_sbdf;
>> +};
>> +
>> +static int _enum_assigned_pci_devices(struct pci_seg *pseg, void *arg)
>> +{
>> +    struct list_assigned *ctxt = arg;
>> +    struct pci_dev *pdev;
>> +
>> +    list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
>> +    {
>> +        if ( pdev->assigned == ctxt->assigned )
>> +        {
>> +            if ( ctxt->cur_idx == ctxt->from_idx )
>> +            {
>> +                *ctxt->domain = pdev->domain->domain_id;
>> +                *ctxt->machine_sbdf = pdev->sbdf.sbdf;
>> +                return 1;
>> +            }
>> +            ctxt->cur_idx++;
>> +        }
>> +    }
>> +    return 0;
>> +}
>> +
>> +int pci_device_enum_assigned(bool report_not_assigned,
>> +                             uint32_t from_idx, domid_t *domain,
>> +                             uint32_t *machine_sbdf)
>> +{
>> +    struct list_assigned ctxt = {
>> +        .assigned = !report_not_assigned,
>> +        .cur_idx = 0,
>> +        .from_idx = from_idx,
>> +        .domain = domain,
>> +        .machine_sbdf = machine_sbdf,
>> +    };
>> +    int ret;
>> +
>> +    pcidevs_lock();
>> +    ret = pci_segments_iterate(_enum_assigned_pci_devices, &ctxt);
>> +    pcidevs_unlock();
>> +    /*
>> +     * If not found then report as EINVAL to mark
>> +     * enumeration process finished.
>> +     */
>> +    if ( !ret )
>> +        return -EINVAL;
>> +    return 0;
>> +}
>> +#endif
>> +
>>   /*
>>    * Local variables:
>>    * mode: C
>> diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
>> index a07364711794..5ca73c538688 100644
>> --- a/xen/include/public/sysctl.h
>> +++ b/xen/include/public/sysctl.h
>> @@ -1062,6 +1062,40 @@ typedef struct xen_sysctl_cpu_policy xen_sysctl_cpu_policy_t;
>>   DEFINE_XEN_GUEST_HANDLE(xen_sysctl_cpu_policy_t);
>>   #endif
>>   
>> +/*
>> + * These are to emulate pciback device (de-)assignment used by the tools
>> + * to track current device assignments: all the PCI devices that can
>> + * be passed through must be assigned to the pciback to mark them
>> + * as such. As on ARM we do not run pci{back|front} and are emulating
>> + * PCI host bridge in Xen, so we need to maintain the assignments on our
>> + * own in Xen itself.
>> + *
>> + * Note on xen_sysctl_pci_device_get_assigned: ENOENT is used to report
>> + * that there are no assigned devices left.
>> + */
>> +struct xen_sysctl_pci_device_set_assigned {
>> +    /* IN */
>> +    /* FIXME: is this really a machine SBDF or as Domain-0 sees it? */
>> +    uint32_t machine_sbdf;
> I think you need to make it clear that when running on Xen dom0 (or
> the hardware domain) should _never_ change the enumeration of devices,
> or else none of this will work.
I will
>
>> +    uint8_t assigned;
>> +};
>> +
>> +struct xen_sysctl_pci_device_get_assigned {
>> +    /* IN */
>> +    uint32_t machine_sbdf;
>> +};
>> +
>> +struct xen_sysctl_pci_device_enum_assigned {
>> +    /* IN */
>> +    uint32_t idx;
>> +    uint8_t report_not_assigned;
>> +    /* OUT */
>> +    domid_t domain;
>> +    uint32_t machine_sbdf;
>> +};
>> +typedef struct xen_sysctl_pci_device_enum_assigned xen_sysctl_pci_device_enum_assigned_t;
>> +DEFINE_XEN_GUEST_HANDLE(xen_sysctl_pci_device_enum_assigned_t);
>> +
>>   struct xen_sysctl {
>>       uint32_t cmd;
>>   #define XEN_SYSCTL_readconsole                    1
>> @@ -1092,6 +1126,9 @@ struct xen_sysctl {
>>   #define XEN_SYSCTL_livepatch_op                  27
>>   /* #define XEN_SYSCTL_set_parameter              28 */
>>   #define XEN_SYSCTL_get_cpu_policy                29
>> +#define XEN_SYSCTL_pci_device_set_assigned       30
>> +#define XEN_SYSCTL_pci_device_get_assigned       31
>> +#define XEN_SYSCTL_pci_device_enum_assigned      32
>>       uint32_t interface_version; /* XEN_SYSCTL_INTERFACE_VERSION */
>>       union {
>>           struct xen_sysctl_readconsole       readconsole;
>> @@ -1122,6 +1159,9 @@ struct xen_sysctl {
>>   #if defined(__i386__) || defined(__x86_64__)
>>           struct xen_sysctl_cpu_policy        cpu_policy;
>>   #endif
>> +        struct xen_sysctl_pci_device_set_assigned pci_set_assigned;
>> +        struct xen_sysctl_pci_device_get_assigned pci_get_assigned;
>> +        struct xen_sysctl_pci_device_enum_assigned pci_enum_assigned;
>>           uint8_t                             pad[128];
>>       } u;
>>   };
>> diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
>> index 2bc4aaf4530c..7bf439de4de0 100644
>> --- a/xen/include/xen/pci.h
>> +++ b/xen/include/xen/pci.h
>> @@ -132,6 +132,13 @@ struct pci_dev {
>>   
>>       /* Data for vPCI. */
>>       struct vpci *vpci;
>> +#ifdef CONFIG_ARM
>> +    /*
>> +     * Set if this PCI device is eligible for pass through,
>> +     * e.g. just like it was assigned to pciback driver.
>> +     */
>> +    bool assigned;
> You can see whether a device is assigned or not by looking at the
> domain field AFAICT.

Hm, domain field could be dom_io, so we need to put an extra logic here

to understand the device is really assigned to the domain

>
> Thanks, Roger.

Thank you!

Oleksandr

[1] https://www.mail-archive.com/xen-devel@lists.xenproject.org/msg77422.html

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 03/10] xen/arm: Setup MMIO range trap handlers for hardware domain
  2020-11-09 12:50 ` [PATCH 03/10] xen/arm: Setup MMIO range trap handlers for hardware domain Oleksandr Andrushchenko
@ 2020-11-11 14:39   ` Roger Pau Monné
  2020-11-11 14:42     ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 64+ messages in thread
From: Roger Pau Monné @ 2020-11-11 14:39 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Rahul.Singh, Bertrand.Marquis, julien.grall, jbeulich,
	sstabellini, xen-devel, iwj, wl, Oleksandr Andrushchenko

On Mon, Nov 09, 2020 at 02:50:24PM +0200, Oleksandr Andrushchenko wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> In order vPCI to work it needs all access to PCI configuration space
> access to be synchronized among all entities, e.g. hardware domain and
> guests. For that implement PCI host bridge specific callbacks to
> propelry setup those ranges depending on host bridge implementation.
> 
> This callback is optional and may not be used by non-ECAM host bridges.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> ---
>  xen/arch/arm/pci/pci-host-common.c  | 16 ++++++++++++++++
>  xen/arch/arm/pci/pci-host-generic.c | 15 +++++++++++++--
>  xen/arch/arm/vpci.c                 | 16 +++++++++++++++-

So this is based on top of another series, maybe it would make sense
to post those together, or else it's hard to get the right context.

>  xen/include/asm-arm/pci.h           |  7 +++++++
>  4 files changed, 51 insertions(+), 3 deletions(-)
> 
> diff --git a/xen/arch/arm/pci/pci-host-common.c b/xen/arch/arm/pci/pci-host-common.c
> index b011c7eff3c8..b81184d34980 100644
> --- a/xen/arch/arm/pci/pci-host-common.c
> +++ b/xen/arch/arm/pci/pci-host-common.c
> @@ -219,6 +219,22 @@ struct device *pci_find_host_bridge_device(struct device *dev)
>      }
>      return dt_to_dev(bridge->dt_node);
>  }
> +
> +int pci_host_iterate_bridges(struct domain *d,
> +                             int (*clb)(struct domain *d,
> +                                        struct pci_host_bridge *bridge))
> +{
> +    struct pci_host_bridge *bridge;
> +    int err;
> +
> +    list_for_each_entry( bridge, &pci_host_bridges, node )
> +    {
> +        err = clb(d, bridge);
> +        if ( err )
> +            return err;
> +    }
> +    return 0;
> +}
>  /*
>   * Local variables:
>   * mode: C
> diff --git a/xen/arch/arm/pci/pci-host-generic.c b/xen/arch/arm/pci/pci-host-generic.c
> index 54dd123e95c7..469df3da0116 100644
> --- a/xen/arch/arm/pci/pci-host-generic.c
> +++ b/xen/arch/arm/pci/pci-host-generic.c
> @@ -85,12 +85,23 @@ int pci_ecam_config_read(struct pci_host_bridge *bridge, uint32_t sbdf,
>      return 0;
>  }
>  
> +static int pci_ecam_register_mmio_handler(struct domain *d,
> +                                          struct pci_host_bridge *bridge,

I think you can also constify bridge here.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 03/10] xen/arm: Setup MMIO range trap handlers for hardware domain
  2020-11-11 14:39   ` Roger Pau Monné
@ 2020-11-11 14:42     ` Oleksandr Andrushchenko
  0 siblings, 0 replies; 64+ messages in thread
From: Oleksandr Andrushchenko @ 2020-11-11 14:42 UTC (permalink / raw)
  To: Roger Pau Monné, Oleksandr Andrushchenko
  Cc: Rahul.Singh, Bertrand.Marquis, julien.grall, jbeulich,
	sstabellini, xen-devel, iwj, wl


On 11/11/20 4:39 PM, Roger Pau Monné wrote:
> On Mon, Nov 09, 2020 at 02:50:24PM +0200, Oleksandr Andrushchenko wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>
>> In order vPCI to work it needs all access to PCI configuration space
>> access to be synchronized among all entities, e.g. hardware domain and
>> guests. For that implement PCI host bridge specific callbacks to
>> propelry setup those ranges depending on host bridge implementation.
>>
>> This callback is optional and may not be used by non-ECAM host bridges.
>>
>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> ---
>>   xen/arch/arm/pci/pci-host-common.c  | 16 ++++++++++++++++
>>   xen/arch/arm/pci/pci-host-generic.c | 15 +++++++++++++--
>>   xen/arch/arm/vpci.c                 | 16 +++++++++++++++-
> So this is based on top of another series, maybe it would make sense
> to post those together, or else it's hard to get the right context.

This is based on ARM's PCI passthrough RFC series [1]

You can also see the whole picture at [2]

>
>>   xen/include/asm-arm/pci.h           |  7 +++++++
>>   4 files changed, 51 insertions(+), 3 deletions(-)
>>
>> diff --git a/xen/arch/arm/pci/pci-host-common.c b/xen/arch/arm/pci/pci-host-common.c
>> index b011c7eff3c8..b81184d34980 100644
>> --- a/xen/arch/arm/pci/pci-host-common.c
>> +++ b/xen/arch/arm/pci/pci-host-common.c
>> @@ -219,6 +219,22 @@ struct device *pci_find_host_bridge_device(struct device *dev)
>>       }
>>       return dt_to_dev(bridge->dt_node);
>>   }
>> +
>> +int pci_host_iterate_bridges(struct domain *d,
>> +                             int (*clb)(struct domain *d,
>> +                                        struct pci_host_bridge *bridge))
>> +{
>> +    struct pci_host_bridge *bridge;
>> +    int err;
>> +
>> +    list_for_each_entry( bridge, &pci_host_bridges, node )
>> +    {
>> +        err = clb(d, bridge);
>> +        if ( err )
>> +            return err;
>> +    }
>> +    return 0;
>> +}
>>   /*
>>    * Local variables:
>>    * mode: C
>> diff --git a/xen/arch/arm/pci/pci-host-generic.c b/xen/arch/arm/pci/pci-host-generic.c
>> index 54dd123e95c7..469df3da0116 100644
>> --- a/xen/arch/arm/pci/pci-host-generic.c
>> +++ b/xen/arch/arm/pci/pci-host-generic.c
>> @@ -85,12 +85,23 @@ int pci_ecam_config_read(struct pci_host_bridge *bridge, uint32_t sbdf,
>>       return 0;
>>   }
>>   
>> +static int pci_ecam_register_mmio_handler(struct domain *d,
>> +                                          struct pci_host_bridge *bridge,
> I think you can also constify bridge here.
Makes sense
>
> Thanks, Roger.

Thank you,

Oleksandr

[1] https://www.mail-archive.com/xen-devel@lists.xenproject.org/msg84452.html

[2] https://github.com/andr2000/xen/tree/vpci_rfc

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 04/10] [WORKAROUND] xen/arm: Update hwdom's p2m to trap ECAM space
  2020-11-09 12:50 ` [PATCH 04/10] [WORKAROUND] xen/arm: Update hwdom's p2m to trap ECAM space Oleksandr Andrushchenko
@ 2020-11-11 14:44   ` Roger Pau Monné
  2020-11-12 12:54     ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 64+ messages in thread
From: Roger Pau Monné @ 2020-11-11 14:44 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Rahul.Singh, Bertrand.Marquis, julien.grall, jbeulich,
	sstabellini, xen-devel, iwj, wl, Oleksandr Andrushchenko

On Mon, Nov 09, 2020 at 02:50:25PM +0200, Oleksandr Andrushchenko wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Host bridge controller's ECAM space is mapped into Domain-0's p2m,
> thus it is not possible to trap the same for vPCI via MMIO handlers.
> For this to work we need to unmap those mappings in p2m.
> 
> TODO (Julien): It would be best if we avoid the map/unmap operation.
> So, maybe we want to introduce another way to avoid the mapping.
> Maybe by changing the type of the controller to "PCI_HOSTCONTROLLER"
> and checking if this is a PCI hostcontroller avoid the mapping.

I know very little about Arm to be able to provide meaningful comments
here. I agree that creating the maps just to remove them afterwards is
not the right approach, we should instead avoid those mappings from
being created in the first place.

Roger.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 02/10] arm/pci: Maintain PCI assignable list
  2020-11-09 12:50 ` [PATCH 02/10] arm/pci: Maintain PCI assignable list Oleksandr Andrushchenko
  2020-11-11 13:53   ` Roger Pau Monné
@ 2020-11-11 14:54   ` Jan Beulich
  2020-11-12 12:53     ` Oleksandr Andrushchenko
  1 sibling, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2020-11-11 14:54 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: iwj, wl, Oleksandr Andrushchenko, xen-devel, Bertrand.Marquis,
	julien.grall, sstabellini, roger.pau, Rahul.Singh

On 09.11.2020 13:50, Oleksandr Andrushchenko wrote:
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -879,6 +879,43 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
>      return ret;
>  }
>  
> +#ifdef CONFIG_ARM
> +int pci_device_set_assigned(u16 seg, u8 bus, u8 devfn, bool assigned)
> +{
> +    struct pci_dev *pdev;
> +
> +    pdev = pci_get_pdev(seg, bus, devfn);
> +    if ( !pdev )
> +    {
> +        printk(XENLOG_ERR "Can't find PCI device %04x:%02x:%02x.%u\n",
> +               seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
> +        return -ENODEV;
> +    }
> +
> +    pdev->assigned = assigned;
> +    printk(XENLOG_ERR "pciback %sassign PCI device %04x:%02x:%02x.%u\n",
> +           assigned ? "" : "de-",
> +           seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
> +
> +    return 0;
> +}
> +
> +int pci_device_get_assigned(u16 seg, u8 bus, u8 devfn)
> +{
> +    struct pci_dev *pdev;
> +
> +    pdev = pci_get_pdev(seg, bus, devfn);
> +    if ( !pdev )
> +    {
> +        printk(XENLOG_ERR "Can't find PCI device %04x:%02x:%02x.%u\n",
> +               seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
> +        return -ENODEV;
> +    }
> +
> +    return pdev->assigned ? 0 : -ENODEV;
> +}
> +#endif
> +
>  #ifndef CONFIG_ARM
>  /*TODO :Implement MSI support for ARM  */
>  static int pci_clean_dpci_irq(struct domain *d,
> @@ -1821,6 +1858,62 @@ int iommu_do_pci_domctl(
>      return ret;
>  }
>  
> +#ifdef CONFIG_ARM
> +struct list_assigned {
> +    uint32_t cur_idx;
> +    uint32_t from_idx;
> +    bool assigned;
> +    domid_t *domain;
> +    uint32_t *machine_sbdf;
> +};
> +
> +static int _enum_assigned_pci_devices(struct pci_seg *pseg, void *arg)
> +{
> +    struct list_assigned *ctxt = arg;
> +    struct pci_dev *pdev;
> +
> +    list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
> +    {
> +        if ( pdev->assigned == ctxt->assigned )
> +        {
> +            if ( ctxt->cur_idx == ctxt->from_idx )
> +            {
> +                *ctxt->domain = pdev->domain->domain_id;
> +                *ctxt->machine_sbdf = pdev->sbdf.sbdf;
> +                return 1;
> +            }
> +            ctxt->cur_idx++;
> +        }
> +    }
> +    return 0;
> +}
> +
> +int pci_device_enum_assigned(bool report_not_assigned,
> +                             uint32_t from_idx, domid_t *domain,
> +                             uint32_t *machine_sbdf)
> +{
> +    struct list_assigned ctxt = {
> +        .assigned = !report_not_assigned,
> +        .cur_idx = 0,
> +        .from_idx = from_idx,
> +        .domain = domain,
> +        .machine_sbdf = machine_sbdf,
> +    };
> +    int ret;
> +
> +    pcidevs_lock();
> +    ret = pci_segments_iterate(_enum_assigned_pci_devices, &ctxt);
> +    pcidevs_unlock();
> +    /*
> +     * If not found then report as EINVAL to mark
> +     * enumeration process finished.
> +     */
> +    if ( !ret )
> +        return -EINVAL;
> +    return 0;
> +}
> +#endif

Just in case the earlier comments you've got don't lead to removal
of this code - unless there's a real need for them to be put here,
under #ifdef, please add a new xen/drivers/passthrough/arm/pci.c
instead. Even if for just part of the code, this would then also
help with more clear maintainership of this Arm specific code.

Jan


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 02/10] arm/pci: Maintain PCI assignable list
  2020-11-11 14:38     ` Oleksandr Andrushchenko
@ 2020-11-11 15:03       ` Roger Pau Monné
  2020-11-11 15:13         ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 64+ messages in thread
From: Roger Pau Monné @ 2020-11-11 15:03 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Oleksandr Andrushchenko, Rahul.Singh, Bertrand.Marquis,
	julien.grall, jbeulich, sstabellini, xen-devel, iwj, wl

On Wed, Nov 11, 2020 at 02:38:47PM +0000, Oleksandr Andrushchenko wrote:
> On 11/11/20 3:53 PM, Roger Pau Monné wrote:
> > On Mon, Nov 09, 2020 at 02:50:23PM +0200, Oleksandr Andrushchenko wrote:
> >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> >>
> >> The original code depends on pciback to manage assignable device list.
> >> The functionality which is implemented by the pciback and the toolstack
> >> and which is relevant/missing/needed for ARM:
> >>
> >> 1. pciback is used as a database for assignable PCI devices, e.g. xl
> >>     pci-assignable-{add|remove|list} manipulates that list. So, whenever the
> >>     toolstack needs to know which PCI devices can be passed through it reads
> >>     that from the relevant sysfs entries of the pciback.
> >>
> >> 2. pciback is used to hold the unbound PCI devices, e.g. when passing through
> >>     a PCI device it needs to be unbound from the relevant device driver and bound
> >>     to pciback (strictly speaking it is not required that the device is bound to
> >>     pciback, but pciback is again used as a database of the passed through PCI
> >>     devices, so we can re-bind the devices back to their original drivers when
> >>     guest domain shuts down)
> >>
> >> 1. As ARM doesn't use pciback implement the above with additional sysctls:
> >>   - XEN_SYSCTL_pci_device_set_assigned
> > I don't see the point in having this sysfs, Xen already knows when a
> > device is assigned because the XEN_DOMCTL_assign_device hypercall is
> > used.
> 
> But how does the toolstack know about that? When the toolstack needs to
> 
> list/know all assigned devices it queries pciback's sysfs entries. So, with
> 
> XEN_DOMCTL_assign_device we make that knowledge available to Xen,
> 
> but there are no means for the toolstack to get it back.

But the toolstack will figure out whether a device is assigned or
not by using
XEN_SYSCTL_pci_device_get_assigned/XEN_SYSCTL_pci_device_enum_assigned?

AFAICT XEN_SYSCTL_pci_device_set_assigned tells Xen a device has been
assigned, but Xen should already know it because
XEN_DOMCTL_assign_device would have been used to assign the device?

> >
> >>   - XEN_SYSCTL_pci_device_get_assigned
> >>   - XEN_SYSCTL_pci_device_enum_assigned
> >> 2. Extend struct pci_dev to hold assignment state.
> > I'm not really found of this, the hypervisor is no place to store a
> > database like this, unless it's strictly needed.
> I do agree and it was previously discussed a bit
> >
> > IMO the right implementation here would be to split Linux pciback into
> > two different drivers:
> >
> >   - The pv-pci backend for doing passthrough to classic PV guests.
> Ok
> >   - The rest of pciback: device reset, hand-holding driver for devices
> >     to be assigned and database.
> 
> These and assigned devices list seem to be the complete set which
> 
> is needed by the toolstack on ARM. All other functionality provided by
> 
> pciback is not needed for ARM.
> 
> Jan was saying [1] that we might still use pciback as is, but simply use only
> 
> the functionality we need.
> 
> >
> > I think there must be something similar in KVM that performs the tasks
> > of my last point, maybe we could piggyback on it?
> I promised to look at it. I owe this
> >
> > If we want to go the route proposed by this patch, ie: Xen performing
> > the functions of pciback you would also have to move the PCI reset
> > code to Xen, so that you can fully manage the PCI devices from Xen.
> In case of dom0less this would be the case: no pciback, no Domain-0

But for dom0less there's no need for any database of assignable
devices, nor the need to perform pci device resets, as it's all
assigned at boot time and then never modified?

Roger.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 02/10] arm/pci: Maintain PCI assignable list
  2020-11-11 15:03       ` Roger Pau Monné
@ 2020-11-11 15:13         ` Oleksandr Andrushchenko
  2020-11-11 15:25           ` Jan Beulich
  0 siblings, 1 reply; 64+ messages in thread
From: Oleksandr Andrushchenko @ 2020-11-11 15:13 UTC (permalink / raw)
  To: Roger Pau Monné, Oleksandr Andrushchenko
  Cc: Rahul.Singh, Bertrand.Marquis, julien.grall, jbeulich,
	sstabellini, xen-devel, iwj, wl

On 11/11/20 5:03 PM, Roger Pau Monné wrote:
> On Wed, Nov 11, 2020 at 02:38:47PM +0000, Oleksandr Andrushchenko wrote:
>> On 11/11/20 3:53 PM, Roger Pau Monné wrote:
>>> On Mon, Nov 09, 2020 at 02:50:23PM +0200, Oleksandr Andrushchenko wrote:
>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>
>>>> The original code depends on pciback to manage assignable device list.
>>>> The functionality which is implemented by the pciback and the toolstack
>>>> and which is relevant/missing/needed for ARM:
>>>>
>>>> 1. pciback is used as a database for assignable PCI devices, e.g. xl
>>>>      pci-assignable-{add|remove|list} manipulates that list. So, whenever the
>>>>      toolstack needs to know which PCI devices can be passed through it reads
>>>>      that from the relevant sysfs entries of the pciback.
>>>>
>>>> 2. pciback is used to hold the unbound PCI devices, e.g. when passing through
>>>>      a PCI device it needs to be unbound from the relevant device driver and bound
>>>>      to pciback (strictly speaking it is not required that the device is bound to
>>>>      pciback, but pciback is again used as a database of the passed through PCI
>>>>      devices, so we can re-bind the devices back to their original drivers when
>>>>      guest domain shuts down)
>>>>
>>>> 1. As ARM doesn't use pciback implement the above with additional sysctls:
>>>>    - XEN_SYSCTL_pci_device_set_assigned
>>> I don't see the point in having this sysfs, Xen already knows when a
>>> device is assigned because the XEN_DOMCTL_assign_device hypercall is
>>> used.
>> But how does the toolstack know about that? When the toolstack needs to
>>
>> list/know all assigned devices it queries pciback's sysfs entries. So, with
>>
>> XEN_DOMCTL_assign_device we make that knowledge available to Xen,
>>
>> but there are no means for the toolstack to get it back.
> But the toolstack will figure out whether a device is assigned or
> not by using
> XEN_SYSCTL_pci_device_get_assigned/XEN_SYSCTL_pci_device_enum_assigned?
>
> AFAICT XEN_SYSCTL_pci_device_set_assigned tells Xen a device has been
> assigned, but Xen should already know it because
> XEN_DOMCTL_assign_device would have been used to assign the device?

Ah, I misunderstood you then. So, we only want to drop XEN_DOMCTL_assign_device

and keep the rest.

>
>>>>    - XEN_SYSCTL_pci_device_get_assigned
>>>>    - XEN_SYSCTL_pci_device_enum_assigned
>>>> 2. Extend struct pci_dev to hold assignment state.
>>> I'm not really found of this, the hypervisor is no place to store a
>>> database like this, unless it's strictly needed.
>> I do agree and it was previously discussed a bit
>>> IMO the right implementation here would be to split Linux pciback into
>>> two different drivers:
>>>
>>>    - The pv-pci backend for doing passthrough to classic PV guests.
>> Ok
>>>    - The rest of pciback: device reset, hand-holding driver for devices
>>>      to be assigned and database.
>> These and assigned devices list seem to be the complete set which
>>
>> is needed by the toolstack on ARM. All other functionality provided by
>>
>> pciback is not needed for ARM.
>>
>> Jan was saying [1] that we might still use pciback as is, but simply use only
>>
>> the functionality we need.
>>
>>> I think there must be something similar in KVM that performs the tasks
>>> of my last point, maybe we could piggyback on it?
>> I promised to look at it. I owe this
>>> If we want to go the route proposed by this patch, ie: Xen performing
>>> the functions of pciback you would also have to move the PCI reset
>>> code to Xen, so that you can fully manage the PCI devices from Xen.
>> In case of dom0less this would be the case: no pciback, no Domain-0
> But for dom0less there's no need for any database of assignable
> devices, nor the need to perform pci device resets, as it's all
> assigned at boot time and then never modified?
You are right
>
> Roger.

Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 02/10] arm/pci: Maintain PCI assignable list
  2020-11-11 15:13         ` Oleksandr Andrushchenko
@ 2020-11-11 15:25           ` Jan Beulich
  2020-11-11 15:28             ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2020-11-11 15:25 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, Oleksandr Andrushchenko
  Cc: Rahul.Singh, Bertrand.Marquis, julien.grall, sstabellini,
	xen-devel, iwj, wl, Roger Pau Monné

On 11.11.2020 16:13, Oleksandr Andrushchenko wrote:
> On 11/11/20 5:03 PM, Roger Pau Monné wrote:
>> On Wed, Nov 11, 2020 at 02:38:47PM +0000, Oleksandr Andrushchenko wrote:
>>> On 11/11/20 3:53 PM, Roger Pau Monné wrote:
>>>> On Mon, Nov 09, 2020 at 02:50:23PM +0200, Oleksandr Andrushchenko wrote:
>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>
>>>>> The original code depends on pciback to manage assignable device list.
>>>>> The functionality which is implemented by the pciback and the toolstack
>>>>> and which is relevant/missing/needed for ARM:
>>>>>
>>>>> 1. pciback is used as a database for assignable PCI devices, e.g. xl
>>>>>      pci-assignable-{add|remove|list} manipulates that list. So, whenever the
>>>>>      toolstack needs to know which PCI devices can be passed through it reads
>>>>>      that from the relevant sysfs entries of the pciback.
>>>>>
>>>>> 2. pciback is used to hold the unbound PCI devices, e.g. when passing through
>>>>>      a PCI device it needs to be unbound from the relevant device driver and bound
>>>>>      to pciback (strictly speaking it is not required that the device is bound to
>>>>>      pciback, but pciback is again used as a database of the passed through PCI
>>>>>      devices, so we can re-bind the devices back to their original drivers when
>>>>>      guest domain shuts down)
>>>>>
>>>>> 1. As ARM doesn't use pciback implement the above with additional sysctls:
>>>>>    - XEN_SYSCTL_pci_device_set_assigned
>>>> I don't see the point in having this sysfs, Xen already knows when a
>>>> device is assigned because the XEN_DOMCTL_assign_device hypercall is
>>>> used.
>>> But how does the toolstack know about that? When the toolstack needs to
>>>
>>> list/know all assigned devices it queries pciback's sysfs entries. So, with
>>>
>>> XEN_DOMCTL_assign_device we make that knowledge available to Xen,
>>>
>>> but there are no means for the toolstack to get it back.
>> But the toolstack will figure out whether a device is assigned or
>> not by using
>> XEN_SYSCTL_pci_device_get_assigned/XEN_SYSCTL_pci_device_enum_assigned?
>>
>> AFAICT XEN_SYSCTL_pci_device_set_assigned tells Xen a device has been
>> assigned, but Xen should already know it because
>> XEN_DOMCTL_assign_device would have been used to assign the device?
> 
> Ah, I misunderstood you then. So, we only want to drop XEN_DOMCTL_assign_device
> 
> and keep the rest.

Was this a typo? Why would you want to drop XEN_DOMCTL_assign_device?

Jan


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 02/10] arm/pci: Maintain PCI assignable list
  2020-11-11 15:25           ` Jan Beulich
@ 2020-11-11 15:28             ` Oleksandr Andrushchenko
  0 siblings, 0 replies; 64+ messages in thread
From: Oleksandr Andrushchenko @ 2020-11-11 15:28 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko
  Cc: Rahul.Singh, Bertrand.Marquis, julien.grall, sstabellini,
	xen-devel, iwj, wl, Roger Pau Monné


On 11/11/20 5:25 PM, Jan Beulich wrote:
> On 11.11.2020 16:13, Oleksandr Andrushchenko wrote:
>> On 11/11/20 5:03 PM, Roger Pau Monné wrote:
>>> On Wed, Nov 11, 2020 at 02:38:47PM +0000, Oleksandr Andrushchenko wrote:
>>>> On 11/11/20 3:53 PM, Roger Pau Monné wrote:
>>>>> On Mon, Nov 09, 2020 at 02:50:23PM +0200, Oleksandr Andrushchenko wrote:
>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>
>>>>>> The original code depends on pciback to manage assignable device list.
>>>>>> The functionality which is implemented by the pciback and the toolstack
>>>>>> and which is relevant/missing/needed for ARM:
>>>>>>
>>>>>> 1. pciback is used as a database for assignable PCI devices, e.g. xl
>>>>>>       pci-assignable-{add|remove|list} manipulates that list. So, whenever the
>>>>>>       toolstack needs to know which PCI devices can be passed through it reads
>>>>>>       that from the relevant sysfs entries of the pciback.
>>>>>>
>>>>>> 2. pciback is used to hold the unbound PCI devices, e.g. when passing through
>>>>>>       a PCI device it needs to be unbound from the relevant device driver and bound
>>>>>>       to pciback (strictly speaking it is not required that the device is bound to
>>>>>>       pciback, but pciback is again used as a database of the passed through PCI
>>>>>>       devices, so we can re-bind the devices back to their original drivers when
>>>>>>       guest domain shuts down)
>>>>>>
>>>>>> 1. As ARM doesn't use pciback implement the above with additional sysctls:
>>>>>>     - XEN_SYSCTL_pci_device_set_assigned
>>>>> I don't see the point in having this sysfs, Xen already knows when a
>>>>> device is assigned because the XEN_DOMCTL_assign_device hypercall is
>>>>> used.
>>>> But how does the toolstack know about that? When the toolstack needs to
>>>>
>>>> list/know all assigned devices it queries pciback's sysfs entries. So, with
>>>>
>>>> XEN_DOMCTL_assign_device we make that knowledge available to Xen,
>>>>
>>>> but there are no means for the toolstack to get it back.
>>> But the toolstack will figure out whether a device is assigned or
>>> not by using
>>> XEN_SYSCTL_pci_device_get_assigned/XEN_SYSCTL_pci_device_enum_assigned?
>>>
>>> AFAICT XEN_SYSCTL_pci_device_set_assigned tells Xen a device has been
>>> assigned, but Xen should already know it because
>>> XEN_DOMCTL_assign_device would have been used to assign the device?
>> Ah, I misunderstood you then. So, we only want to drop XEN_DOMCTL_assign_device
>>
>> and keep the rest.
> Was this a typo? Why would you want to drop XEN_DOMCTL_assign_device?

Indeed it was: s/XEN_DOMCTL_assign_device/XEN_SYSCTL_pci_device_set_assigned

Sorry for confusion

>
> Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 06/10] vpci: Make every domain handle its own BARs
  2020-11-09 12:50 ` [PATCH 06/10] vpci: Make every domain handle its own BARs Oleksandr Andrushchenko
@ 2020-11-12  9:40   ` Roger Pau Monné
  2020-11-12 13:16     ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 64+ messages in thread
From: Roger Pau Monné @ 2020-11-12  9:40 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Rahul.Singh, Bertrand.Marquis, julien.grall, jbeulich,
	sstabellini, xen-devel, iwj, wl, Oleksandr Andrushchenko

On Mon, Nov 09, 2020 at 02:50:27PM +0200, Oleksandr Andrushchenko wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> At the moment there is an identity mapping between how a guest sees its
> BARs and how they are programmed into guest domain's p2m. This is not
> going to work as guest domains have their own view on the BARs.
> Extend existing vPCI BAR handling to allow every domain to have its own
> view of the BARs: only hardware domain sees physical memory addresses in
> this case and for the rest those are emulated, including logic required
> for the guests to detect memory sizes and properties.
> 
> While emulating BAR access for the guests create a link between the
> virtual BAR address and physical one: use full memory address while
> creating range sets used to map/unmap corresponding address spaces and
> exploit the fact that PCI BAR value doesn't use 8 lower bits of the

I think you mean the low 4bits rather than the low 8bits?

> memory address. Use those bits to pass physical BAR's index, so we can
> build/remove proper p2m mappings.

I find this quite hard to review, given it's a fairly big and
complicated patch. Do you think you could split into smaller chunks?

Maybe you could split into smaller patches that add bits towards the
end goal but still keep the identity mappings?

I've tried to do some review below, but I would appreciate if you
could split this.

> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> ---
>  xen/drivers/vpci/header.c | 276 ++++++++++++++++++++++++++++++++++----
>  xen/drivers/vpci/vpci.c   |   1 +
>  xen/include/xen/vpci.h    |  24 ++--
>  3 files changed, 265 insertions(+), 36 deletions(-)
> 
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index f74f728884c0..7dc7c70e24f2 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -31,14 +31,87 @@
>  struct map_data {
>      struct domain *d;
>      bool map;
> +    struct pci_dev *pdev;

If the field is required please place it after the domain one.

>  };
>  
> +static struct vpci_header *get_vpci_header(struct domain *d,
> +                                           const struct pci_dev *pdev);
> +
> +static struct vpci_header *get_hwdom_vpci_header(const struct pci_dev *pdev)
> +{
> +    if ( unlikely(list_empty(&pdev->vpci->headers)) )
> +        return get_vpci_header(hardware_domain, pdev);

I'm not sure I understand why you need a list here: each device can
only be owned by a single guest, and thus there shouldn't be multiple
views of the BARs (or the header).

> +
> +    /* hwdom's header is always the very first entry. */
> +    return list_first_entry(&pdev->vpci->headers, struct vpci_header, node);
> +}
> +
> +static struct vpci_header *get_vpci_header(struct domain *d,
> +                                           const struct pci_dev *pdev)
> +{
> +    struct list_head *prev;
> +    struct vpci_header *header;
> +    struct vpci *vpci = pdev->vpci;
> +
> +    list_for_each( prev, &vpci->headers )
> +    {
> +        struct vpci_header *this = list_entry(prev, struct vpci_header, node);
> +
> +        if ( this->domain_id == d->domain_id )
> +            return this;
> +    }
> +    printk(XENLOG_DEBUG "--------------------------------------" \
> +           "Adding new vPCI BAR headers for domain %d: " PRI_pci" \n",
> +           d->domain_id, pdev->sbdf.seg, pdev->sbdf.bus,
> +           pdev->sbdf.dev, pdev->sbdf.fn);
> +    header = xzalloc(struct vpci_header);
> +    if ( !header )
> +    {
> +        printk(XENLOG_ERR
> +               "Failed to add new vPCI BAR headers for domain %d: " PRI_pci" \n",
> +               d->domain_id, pdev->sbdf.seg, pdev->sbdf.bus,
> +               pdev->sbdf.dev, pdev->sbdf.fn);
> +        return NULL;
> +    }
> +
> +    if ( !is_hardware_domain(d) )
> +    {
> +        struct vpci_header *hwdom_header = get_hwdom_vpci_header(pdev);
> +
> +        /* Make a copy of the hwdom's BARs as the initial state for vBARs. */
> +        memcpy(header, hwdom_header, sizeof(*header));
> +    }
> +
> +    header->domain_id = d->domain_id;
> +    list_add_tail(&header->node, &vpci->headers);

Same here, I think you want a single header, and then some fields
would be read-only for domUs (like the position of the BARs on the
physmap).

> +    return header;
> +}
> +
> +static struct vpci_bar *get_vpci_bar(struct domain *d,
> +                                     const struct pci_dev *pdev,
> +                                     int bar_idx)

unsigned

> +{
> +    struct vpci_header *vheader;
> +
> +    vheader = get_vpci_header(d, pdev);
> +    if ( !vheader )
> +        return NULL;
> +
> +    return &vheader->bars[bar_idx];
> +}
> +
>  static int map_range(unsigned long s, unsigned long e, void *data,
>                       unsigned long *c)
>  {
>      const struct map_data *map = data;
> -    int rc;
> -
> +    unsigned long mfn;
> +    int rc, bar_idx;
> +    struct vpci_header *header = get_hwdom_vpci_header(map->pdev);
> +
> +    bar_idx = s & ~PCI_BASE_ADDRESS_MEM_MASK;

I'm not sure it's fine to stash the BAR index in the low bits of the
address, what about a device having concatenated BARs?

The rangeset would normally join them into a single range, and then
you won't be able to notice whether a region in the rangeset belongs
to one BAR or another.

IMO it might be easier to just have a rangeset for each BAR and
structure the pending work as a linked list of BARs, that will contain
the physical addresses of each BAR (the real phymap one and the guest
physmap view) plus the rangeset of memory regions to map.

> +    s = PFN_DOWN(s);
> +    e = PFN_DOWN(e);

Changing the rangeset to store memory addresses instead of frames
could for example be split into a separate patch.

I think you are doing the calculation of the end pfn wrong here, you
should use PFN_UP instead in case the address is not aligned.

> +    mfn = _mfn(PFN_DOWN(header->bars[bar_idx].addr));
>      for ( ; ; )
>      {
>          unsigned long size = e - s + 1;
> @@ -52,11 +125,15 @@ static int map_range(unsigned long s, unsigned long e, void *data,
>           * - {un}map_mmio_regions doesn't support preemption.
>           */
>  
> -        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(s))
> -                      : unmap_mmio_regions(map->d, _gfn(s), size, _mfn(s));
> +        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, mfn)
> +                      : unmap_mmio_regions(map->d, _gfn(s), size, mfn);
>          if ( rc == 0 )
>          {
> -            *c += size;
> +            /*
> +             * Range set is not expressed in frame numbers and the size
> +             * is the number of frames, so update accordingly.
> +             */
> +            *c += size << PAGE_SHIFT;
>              break;
>          }
>          if ( rc < 0 )
> @@ -67,8 +144,9 @@ static int map_range(unsigned long s, unsigned long e, void *data,
>              break;
>          }
>          ASSERT(rc < size);
> -        *c += rc;
> +        *c += rc << PAGE_SHIFT;
>          s += rc;
> +        mfn += rc;
>          if ( general_preempt_check() )
>                  return -ERESTART;
>      }
> @@ -84,7 +162,7 @@ static int map_range(unsigned long s, unsigned long e, void *data,
>  static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
>                              bool rom_only)
>  {
> -    struct vpci_header *header = &pdev->vpci->header;
> +    struct vpci_header *header = get_hwdom_vpci_header(pdev);
>      bool map = cmd & PCI_COMMAND_MEMORY;
>      unsigned int i;
>  
> @@ -136,6 +214,7 @@ bool vpci_process_pending(struct vcpu *v)
>          struct map_data data = {
>              .d = v->domain,
>              .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
> +            .pdev = v->vpci.pdev,
>          };
>          int rc = rangeset_consume_ranges(v->vpci.mem, map_range, &data);
>  
> @@ -168,7 +247,8 @@ bool vpci_process_pending(struct vcpu *v)
>  static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
>                              struct rangeset *mem, uint16_t cmd)
>  {
> -    struct map_data data = { .d = d, .map = true };
> +    struct map_data data = { .d = d, .map = true,
> +        .pdev = (struct pci_dev *)pdev };

Dropping the const here is not fine. IT either needs to be dropped
from apply_map and further up, or this needs to also be made const.

>      int rc;
>  
>      while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
> @@ -205,7 +285,7 @@ static void defer_map(struct domain *d, struct pci_dev *pdev,
>  
>  static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>  {
> -    struct vpci_header *header = &pdev->vpci->header;
> +    struct vpci_header *header;
>      struct rangeset *mem = rangeset_new(NULL, NULL, 0);
>      struct pci_dev *tmp, *dev = NULL;
>  #ifdef CONFIG_X86
> @@ -217,6 +297,11 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>      if ( !mem )
>          return -ENOMEM;
>  
> +    if ( is_hardware_domain(current->domain) )
> +        header = get_hwdom_vpci_header(pdev);
> +    else
> +        header = get_vpci_header(current->domain, pdev);
> +
>      /*
>       * Create a rangeset that represents the current device BARs memory region
>       * and compare it against all the currently active BAR memory regions. If
> @@ -225,12 +310,15 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>       * First fill the rangeset with all the BARs of this device or with the ROM
>       * BAR only, depending on whether the guest is toggling the memory decode
>       * bit of the command register, or the enable bit of the ROM BAR register.
> +     *
> +     * Use the PCI reserved bits of the BAR to pass BAR's index.
>       */
>      for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>      {
>          const struct vpci_bar *bar = &header->bars[i];
> -        unsigned long start = PFN_DOWN(bar->addr);
> -        unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
> +        unsigned long start = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK) | i;
> +        unsigned long end = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK) +
> +            bar->size - 1;

Will this work fine on Arm 32bits with LPAE? It's my understanding
that in that case unsigned long is 32bits, but the physical address
space is 44bits, in which case this won't work.

I think you need to keep the usage of frame numbers here.

>  
>          if ( !MAPPABLE_BAR(bar) ||
>               (rom_only ? bar->type != VPCI_BAR_ROM
> @@ -251,9 +339,11 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>      /* Remove any MSIX regions if present. */
>      for ( i = 0; msix && i < ARRAY_SIZE(msix->tables); i++ )
>      {
> -        unsigned long start = PFN_DOWN(vmsix_table_addr(pdev->vpci, i));
> -        unsigned long end = PFN_DOWN(vmsix_table_addr(pdev->vpci, i) +
> -                                     vmsix_table_size(pdev->vpci, i) - 1);
> +        unsigned long start = (vmsix_table_addr(pdev->vpci, i) &
> +                               PCI_BASE_ADDRESS_MEM_MASK) | i;
> +        unsigned long end = (vmsix_table_addr(pdev->vpci, i) &
> +                             PCI_BASE_ADDRESS_MEM_MASK ) +
> +                             vmsix_table_size(pdev->vpci, i) - 1;
>  
>          rc = rangeset_remove_range(mem, start, end);
>          if ( rc )
> @@ -273,6 +363,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>       */
>      for_each_pdev ( pdev->domain, tmp )
>      {
> +        struct vpci_header *header;
> +
>          if ( tmp == pdev )
>          {
>              /*
> @@ -289,11 +381,14 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>                  continue;
>          }
>  
> -        for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
> +        header = get_vpci_header(tmp->domain, pdev);
> +
> +        for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>          {
> -            const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
> -            unsigned long start = PFN_DOWN(bar->addr);
> -            unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
> +            const struct vpci_bar *bar = &header->bars[i];
> +            unsigned long start = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK) | i;
> +            unsigned long end = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK)
> +                + bar->size - 1;
>  
>              if ( !bar->enabled || !rangeset_overlaps_range(mem, start, end) ||
>                   /*
> @@ -357,7 +452,7 @@ static void cmd_write(const struct pci_dev *pdev, unsigned int reg,
>          pci_conf_write16(pdev->sbdf, reg, cmd);
>  }
>  
> -static void bar_write(const struct pci_dev *pdev, unsigned int reg,
> +static void bar_write_hwdom(const struct pci_dev *pdev, unsigned int reg,
>                        uint32_t val, void *data)
>  {
>      struct vpci_bar *bar = data;
> @@ -377,14 +472,17 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>      {
>          /* If the value written is the current one avoid printing a warning. */
>          if ( val != (uint32_t)(bar->addr >> (hi ? 32 : 0)) )
> +        {
> +            struct vpci_header *header = get_hwdom_vpci_header(pdev);
> +
>              gprintk(XENLOG_WARNING,
>                      "%04x:%02x:%02x.%u: ignored BAR %lu write with memory decoding enabled\n",
>                      pdev->seg, pdev->bus, slot, func,
> -                    bar - pdev->vpci->header.bars + hi);
> +                    bar - header->bars + hi);
> +        }
>          return;
>      }
>  
> -
>      /*
>       * Update the cached address, so that when memory decoding is enabled
>       * Xen can map the BAR into the guest p2m.
> @@ -403,10 +501,89 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>      pci_conf_write32(pdev->sbdf, reg, val);
>  }
>  
> +static uint32_t bar_read_hwdom(const struct pci_dev *pdev, unsigned int reg,
> +                               void *data)
> +{
> +    return vpci_hw_read32(pdev, reg, data);
> +}
> +
> +static void bar_write_guest(const struct pci_dev *pdev, unsigned int reg,
> +                            uint32_t val, void *data)
> +{
> +    struct vpci_bar *vbar = data;
> +    bool hi = false;
> +
> +    if ( vbar->type == VPCI_BAR_MEM64_HI )
> +    {
> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> +        vbar--;
> +        hi = true;
> +    }
> +    vbar->addr &= ~(0xffffffffull << (hi ? 32 : 0));
> +    vbar->addr |= (uint64_t)val << (hi ? 32 : 0);
> +}
> +
> +static uint32_t bar_read_guest(const struct pci_dev *pdev, unsigned int reg,
> +                               void *data)
> +{
> +    struct vpci_bar *vbar = data;
> +    uint32_t val;
> +    bool hi = false;
> +
> +    if ( vbar->type == VPCI_BAR_MEM64_HI )
> +    {
> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> +        vbar--;
> +        hi = true;
> +    }
> +
> +    if ( vbar->type == VPCI_BAR_MEM64_LO || vbar->type == VPCI_BAR_MEM64_HI )

I think this would be clearer using a switch statement.

> +    {
> +        if ( hi )
> +            val = vbar->addr >> 32;
> +        else
> +            val = vbar->addr & 0xffffffff;
> +        if ( val == ~0 )

Strictly speaking I think you are not forced to write 1s to the
reserved 4 bits in the low register (and in the 32bit case).

> +        {
> +            /* Guests detects BAR's properties and sizes. */
> +            if ( !hi )
> +            {
> +                val = 0xffffffff & ~(vbar->size - 1);
> +                val |= vbar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
> +                                                    : PCI_BASE_ADDRESS_MEM_TYPE_64;
> +                val |= vbar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
> +            }
> +            else
> +                val = vbar->size >> 32;
> +            vbar->addr &= ~(0xffffffffull << (hi ? 32 : 0));
> +            vbar->addr |= (uint64_t)val << (hi ? 32 : 0);
> +        }
> +    }
> +    else if ( vbar->type == VPCI_BAR_MEM32 )
> +    {
> +        val = vbar->addr;
> +        if ( val == ~0 )
> +        {
> +            if ( !hi )

There's no way hi can be true at this point AFAICT.

> +            {
> +                val = 0xffffffff & ~(vbar->size - 1);
> +                val |= vbar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
> +                                                    : PCI_BASE_ADDRESS_MEM_TYPE_64;
> +                val |= vbar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
> +            }
> +        }
> +    }
> +    else
> +    {
> +        val = vbar->addr;
> +    }
> +    return val;
> +}
> +
>  static void rom_write(const struct pci_dev *pdev, unsigned int reg,
>                        uint32_t val, void *data)
>  {
> -    struct vpci_header *header = &pdev->vpci->header;
> +    struct vpci_header *header = get_hwdom_vpci_header(pdev);
>      struct vpci_bar *rom = data;
>      uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
>      uint16_t cmd = pci_conf_read16(pdev->sbdf, PCI_COMMAND);
> @@ -452,15 +629,56 @@ static void rom_write(const struct pci_dev *pdev, unsigned int reg,
>          rom->addr = val & PCI_ROM_ADDRESS_MASK;
>  }

Don't you need to also protect a domU from writing to the ROM BAR
register?

>  
> +static uint32_t bar_read_dispatch(const struct pci_dev *pdev, unsigned int reg,
> +                                  void *data)
> +{
> +    struct vpci_bar *vbar, *bar = data;
> +
> +    if ( is_hardware_domain(current->domain) )
> +        return bar_read_hwdom(pdev, reg, data);
> +
> +    vbar = get_vpci_bar(current->domain, pdev, bar->index);
> +    if ( !vbar )
> +        return ~0;
> +
> +    return bar_read_guest(pdev, reg, vbar);
> +}
> +
> +static void bar_write_dispatch(const struct pci_dev *pdev, unsigned int reg,
> +                               uint32_t val, void *data)
> +{
> +    struct vpci_bar *bar = data;
> +
> +    if ( is_hardware_domain(current->domain) )
> +        bar_write_hwdom(pdev, reg, val, data);
> +    else
> +    {
> +        struct vpci_bar *vbar = get_vpci_bar(current->domain, pdev, bar->index);
> +
> +        if ( !vbar )
> +            return;
> +        bar_write_guest(pdev, reg, val, vbar);
> +    }
> +}

You should assign different handlers based on whether the domain that
has the device assigned is a domU or the hardware domain, rather than
doing the selection here.

> +
> +/*
> + * FIXME: This is called early while adding vPCI handlers which is done
> + * by and for hwdom.
> + */
>  static int init_bars(struct pci_dev *pdev)
>  {
>      uint16_t cmd;
>      uint64_t addr, size;
>      unsigned int i, num_bars, rom_reg;
> -    struct vpci_header *header = &pdev->vpci->header;
> -    struct vpci_bar *bars = header->bars;
> +    struct vpci_header *header;
> +    struct vpci_bar *bars;
>      int rc;
>  
> +    header = get_hwdom_vpci_header(pdev);
> +    if ( !header )
> +        return -ENOMEM;
> +    bars = header->bars;
> +
>      switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
>      {
>      case PCI_HEADER_TYPE_NORMAL:
> @@ -496,11 +714,12 @@ static int init_bars(struct pci_dev *pdev)
>          uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
>          uint32_t val;
>  
> +        bars[i].index = i;
>          if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
>          {
>              bars[i].type = VPCI_BAR_MEM64_HI;
> -            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg,
> -                                   4, &bars[i]);
> +            rc = vpci_add_register(pdev->vpci, bar_read_dispatch,
> +                                   bar_write_dispatch, reg, 4, &bars[i]);
>              if ( rc )
>              {
>                  pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
> @@ -540,8 +759,8 @@ static int init_bars(struct pci_dev *pdev)
>          bars[i].size = size;
>          bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
>  
> -        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg, 4,
> -                               &bars[i]);
> +        rc = vpci_add_register(pdev->vpci, bar_read_dispatch,
> +                               bar_write_dispatch, reg, 4, &bars[i]);
>          if ( rc )
>          {
>              pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
> @@ -558,6 +777,7 @@ static int init_bars(struct pci_dev *pdev)
>          rom->type = VPCI_BAR_ROM;
>          rom->size = size;
>          rom->addr = addr;
> +        rom->index = num_bars;
>          header->rom_enabled = pci_conf_read32(pdev->sbdf, rom_reg) &
>                                PCI_ROM_ADDRESS_ENABLE;
>  
> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
> index a5293521a36a..728029da3e9c 100644
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -69,6 +69,7 @@ int __hwdom_init vpci_add_handlers(struct pci_dev *pdev)
>          return -ENOMEM;
>  
>      INIT_LIST_HEAD(&pdev->vpci->handlers);
> +    INIT_LIST_HEAD(&pdev->vpci->headers);
>      spin_lock_init(&pdev->vpci->lock);
>  
>      for ( i = 0; i < NUM_VPCI_INIT; i++ )
> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
> index c3501e9ec010..54423bc6556d 100644
> --- a/xen/include/xen/vpci.h
> +++ b/xen/include/xen/vpci.h
> @@ -55,16 +55,14 @@ uint32_t vpci_hw_read32(const struct pci_dev *pdev, unsigned int reg,
>   */
>  bool __must_check vpci_process_pending(struct vcpu *v);
>  
> -struct vpci {
> -    /* List of vPCI handlers for a device. */
> -    struct list_head handlers;
> -    spinlock_t lock;
> -
>  #ifdef __XEN__
> -    /* Hide the rest of the vpci struct from the user-space test harness. */
>      struct vpci_header {
> +    struct list_head node;
> +    /* Domain that owns this view of the BARs. */
> +    domid_t domain_id;

Indentation seems screwed here.

>          /* Information about the PCI BARs of this device. */
>          struct vpci_bar {
> +            int index;

unsigned

>              uint64_t addr;
>              uint64_t size;
>              enum {
> @@ -88,8 +86,18 @@ struct vpci {
>           * is mapped into guest p2m) if there's a ROM BAR on the device.
>           */
>          bool rom_enabled      : 1;
> -        /* FIXME: currently there's no support for SR-IOV. */

Unless you are also adding support for SR-IOV, I would keep the
comment.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 08/10] vpci/arm: Allow updating BAR's header for non-ECAM bridges
  2020-11-09 12:50 ` [PATCH 08/10] vpci/arm: Allow updating BAR's header for non-ECAM bridges Oleksandr Andrushchenko
@ 2020-11-12  9:56   ` Roger Pau Monné
  2020-11-13  6:46     ` Oleksandr Andrushchenko
  2020-11-13 10:29   ` Jan Beulich
  1 sibling, 1 reply; 64+ messages in thread
From: Roger Pau Monné @ 2020-11-12  9:56 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Rahul.Singh, Bertrand.Marquis, julien.grall, jbeulich,
	sstabellini, xen-devel, iwj, wl, Oleksandr Andrushchenko

On Mon, Nov 09, 2020 at 02:50:29PM +0200, Oleksandr Andrushchenko wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Non-ECAM host bridges in hwdom go directly to PCI config space,
> not through vpci (they use their specific method for accessing PCI
> configuration, e.g. dedicated registers etc.). Thus hwdom's vpci BARs are
> never updated via vPCI MMIO handlers, so implement a dedicated method
> for a PCI host bridge, so it has a chance to update the initial state of
> the device BARs.
> 
> Note, we rely on the fact that control/hardware domain will not update
> physical BAR locations for the given devices.

This is quite ugly.

I'm looking at the commit that implements the hook for R-Car and I'm
having trouble seeing how that's different from the way we would
normally read the BAR addresses.

I think this should likely be paired with the actual implementation of
a hook, or else it's hard to tell whether it really needed or not.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 09/10] vpci/rcar: Implement vPCI.update_bar_header callback
  2020-11-09 12:50 ` [PATCH 09/10] vpci/rcar: Implement vPCI.update_bar_header callback Oleksandr Andrushchenko
@ 2020-11-12 10:00   ` Roger Pau Monné
  2020-11-13  6:50     ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 64+ messages in thread
From: Roger Pau Monné @ 2020-11-12 10:00 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Rahul.Singh, Bertrand.Marquis, julien.grall, jbeulich,
	sstabellini, xen-devel, iwj, wl, Oleksandr Andrushchenko

On Mon, Nov 09, 2020 at 02:50:30PM +0200, Oleksandr Andrushchenko wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Update hardware domain's BAR header as R-Car Gen3 is a non-ECAM host
> controller, so vPCI MMIO handlers do not work for it in hwdom.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> ---
>  xen/arch/arm/pci/pci-host-rcar-gen3.c | 69 +++++++++++++++++++++++++++
>  1 file changed, 69 insertions(+)
> 
> diff --git a/xen/arch/arm/pci/pci-host-rcar-gen3.c b/xen/arch/arm/pci/pci-host-rcar-gen3.c
> index ec14bb29a38b..353ac2bfd6e6 100644
> --- a/xen/arch/arm/pci/pci-host-rcar-gen3.c
> +++ b/xen/arch/arm/pci/pci-host-rcar-gen3.c
> @@ -23,6 +23,7 @@
>  #include <xen/pci.h>
>  #include <asm/pci.h>
>  #include <xen/vmap.h>
> +#include <xen/vpci.h>
>  
>  /* Error values that may be returned by PCI functions */
>  #define PCIBIOS_SUCCESSFUL		0x00
> @@ -307,12 +308,80 @@ int pci_rcar_gen3_config_write(struct pci_host_bridge *bridge, uint32_t _sbdf,
>      return ret;
>  }
>  
> +static void pci_rcar_gen3_hwbar_init(const struct pci_dev *pdev,
> +                                     struct vpci_header *header)
> +
> +{
> +    static bool once = true;
> +    struct vpci_bar *bars = header->bars;
> +    unsigned int num_bars;
> +    int i;

unsigned.

> +
> +    /* Run only once. */
> +    if (!once)

Missing spaces.

> +        return;
> +    once = false;
> +
> +    printk("\n\n ------------------------ %s -------------------\n", __func__);
> +    switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
> +    {
> +    case PCI_HEADER_TYPE_NORMAL:
> +        num_bars = PCI_HEADER_NORMAL_NR_BARS;
> +        break;
> +
> +    case PCI_HEADER_TYPE_BRIDGE:
> +        num_bars = PCI_HEADER_BRIDGE_NR_BARS;
> +        break;
> +
> +    default:
> +        return;
> +    }
> +
> +    for ( i = 0; i < num_bars; i++ )
> +    {
> +        uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
> +
> +        if ( bars[i].type == VPCI_BAR_MEM64_HI )
> +        {
> +            /*
> +             * Skip hi part of the 64-bit register: it is read
> +             * together with the lower part.
> +             */
> +            continue;
> +        }
> +
> +        if ( bars[i].type == VPCI_BAR_IO )
> +        {
> +            /* Skip IO. */
> +            continue;
> +        }
> +
> +        if ( bars[i].type == VPCI_BAR_MEM64_LO )
> +        {
> +            /* Read both hi and lo parts of the 64-bit BAR. */
> +            bars[i].addr =
> +                (uint64_t)pci_conf_read32(pdev->sbdf, reg + 4) << 32 |
> +                pci_conf_read32(pdev->sbdf, reg);
> +        }
> +        else if ( bars[i].type == VPCI_BAR_MEM32 )
> +        {
> +            bars[i].addr = pci_conf_read32(pdev->sbdf, reg);
> +        }
> +        else
> +        {
> +            /* Expansion ROM? */
> +            continue;
> +        }

Wouldn't this be much simpler as:

bars[i].addr = 0;
switch ( bars[i].type )
{
case VPCI_BAR_MEM64_HI:
    bars[i].addr = (uint64_t)pci_conf_read32(pdev->sbdf, reg + 4) << 32;
    /+ fallthrough. */
case VPCI_BAR_MEM64_LO:
     bars[i].addr |= pci_conf_read32(pdev->sbdf, reg);
     break;

default:
    break;
}

I also wonder why you only care about the address but not the size of
the BAR.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 02/10] arm/pci: Maintain PCI assignable list
  2020-11-11 14:54   ` Jan Beulich
@ 2020-11-12 12:53     ` Oleksandr Andrushchenko
  0 siblings, 0 replies; 64+ messages in thread
From: Oleksandr Andrushchenko @ 2020-11-12 12:53 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko
  Cc: iwj, wl, xen-devel, Bertrand.Marquis, julien.grall, sstabellini,
	roger.pau, Rahul.Singh


On 11/11/20 4:54 PM, Jan Beulich wrote:
> On 09.11.2020 13:50, Oleksandr Andrushchenko wrote:
>> --- a/xen/drivers/passthrough/pci.c
>> +++ b/xen/drivers/passthrough/pci.c
>> @@ -879,6 +879,43 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
>>       return ret;
>>   }
>>   
>> +#ifdef CONFIG_ARM
>> +int pci_device_set_assigned(u16 seg, u8 bus, u8 devfn, bool assigned)
>> +{
>> +    struct pci_dev *pdev;
>> +
>> +    pdev = pci_get_pdev(seg, bus, devfn);
>> +    if ( !pdev )
>> +    {
>> +        printk(XENLOG_ERR "Can't find PCI device %04x:%02x:%02x.%u\n",
>> +               seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
>> +        return -ENODEV;
>> +    }
>> +
>> +    pdev->assigned = assigned;
>> +    printk(XENLOG_ERR "pciback %sassign PCI device %04x:%02x:%02x.%u\n",
>> +           assigned ? "" : "de-",
>> +           seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
>> +
>> +    return 0;
>> +}
>> +
>> +int pci_device_get_assigned(u16 seg, u8 bus, u8 devfn)
>> +{
>> +    struct pci_dev *pdev;
>> +
>> +    pdev = pci_get_pdev(seg, bus, devfn);
>> +    if ( !pdev )
>> +    {
>> +        printk(XENLOG_ERR "Can't find PCI device %04x:%02x:%02x.%u\n",
>> +               seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
>> +        return -ENODEV;
>> +    }
>> +
>> +    return pdev->assigned ? 0 : -ENODEV;
>> +}
>> +#endif
>> +
>>   #ifndef CONFIG_ARM
>>   /*TODO :Implement MSI support for ARM  */
>>   static int pci_clean_dpci_irq(struct domain *d,
>> @@ -1821,6 +1858,62 @@ int iommu_do_pci_domctl(
>>       return ret;
>>   }
>>   
>> +#ifdef CONFIG_ARM
>> +struct list_assigned {
>> +    uint32_t cur_idx;
>> +    uint32_t from_idx;
>> +    bool assigned;
>> +    domid_t *domain;
>> +    uint32_t *machine_sbdf;
>> +};
>> +
>> +static int _enum_assigned_pci_devices(struct pci_seg *pseg, void *arg)
>> +{
>> +    struct list_assigned *ctxt = arg;
>> +    struct pci_dev *pdev;
>> +
>> +    list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
>> +    {
>> +        if ( pdev->assigned == ctxt->assigned )
>> +        {
>> +            if ( ctxt->cur_idx == ctxt->from_idx )
>> +            {
>> +                *ctxt->domain = pdev->domain->domain_id;
>> +                *ctxt->machine_sbdf = pdev->sbdf.sbdf;
>> +                return 1;
>> +            }
>> +            ctxt->cur_idx++;
>> +        }
>> +    }
>> +    return 0;
>> +}
>> +
>> +int pci_device_enum_assigned(bool report_not_assigned,
>> +                             uint32_t from_idx, domid_t *domain,
>> +                             uint32_t *machine_sbdf)
>> +{
>> +    struct list_assigned ctxt = {
>> +        .assigned = !report_not_assigned,
>> +        .cur_idx = 0,
>> +        .from_idx = from_idx,
>> +        .domain = domain,
>> +        .machine_sbdf = machine_sbdf,
>> +    };
>> +    int ret;
>> +
>> +    pcidevs_lock();
>> +    ret = pci_segments_iterate(_enum_assigned_pci_devices, &ctxt);
>> +    pcidevs_unlock();
>> +    /*
>> +     * If not found then report as EINVAL to mark
>> +     * enumeration process finished.
>> +     */
>> +    if ( !ret )
>> +        return -EINVAL;
>> +    return 0;
>> +}
>> +#endif
> Just in case the earlier comments you've got don't lead to removal
> of this code - unless there's a real need for them to be put here,
> under #ifdef, please add a new xen/drivers/passthrough/arm/pci.c
> instead. Even if for just part of the code, this would then also
> help with more clear maintainership of this Arm specific code.
Yes, does make sense to move all ARM specifics into a dedicated file
>
> Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 04/10] [WORKAROUND] xen/arm: Update hwdom's p2m to trap ECAM space
  2020-11-11 14:44   ` Roger Pau Monné
@ 2020-11-12 12:54     ` Oleksandr Andrushchenko
  0 siblings, 0 replies; 64+ messages in thread
From: Oleksandr Andrushchenko @ 2020-11-12 12:54 UTC (permalink / raw)
  To: Roger Pau Monné, Oleksandr Andrushchenko
  Cc: Rahul.Singh, Bertrand.Marquis, julien.grall, jbeulich,
	sstabellini, xen-devel, iwj, wl


On 11/11/20 4:44 PM, Roger Pau Monné wrote:
> On Mon, Nov 09, 2020 at 02:50:25PM +0200, Oleksandr Andrushchenko wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>
>> Host bridge controller's ECAM space is mapped into Domain-0's p2m,
>> thus it is not possible to trap the same for vPCI via MMIO handlers.
>> For this to work we need to unmap those mappings in p2m.
>>
>> TODO (Julien): It would be best if we avoid the map/unmap operation.
>> So, maybe we want to introduce another way to avoid the mapping.
>> Maybe by changing the type of the controller to "PCI_HOSTCONTROLLER"
>> and checking if this is a PCI hostcontroller avoid the mapping.
> I know very little about Arm to be able to provide meaningful comments
> here. I agree that creating the maps just to remove them afterwards is
> not the right approach, we should instead avoid those mappings from
> being created in the first place.
Agreed, we'll need to find an acceptable way of doing so
> Roger.

Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 06/10] vpci: Make every domain handle its own BARs
  2020-11-12  9:40   ` Roger Pau Monné
@ 2020-11-12 13:16     ` Oleksandr Andrushchenko
  2020-11-12 14:46       ` Roger Pau Monné
  0 siblings, 1 reply; 64+ messages in thread
From: Oleksandr Andrushchenko @ 2020-11-12 13:16 UTC (permalink / raw)
  To: Roger Pau Monné, Oleksandr Andrushchenko
  Cc: Rahul.Singh, Bertrand.Marquis, julien.grall, jbeulich,
	sstabellini, xen-devel, iwj, wl


On 11/12/20 11:40 AM, Roger Pau Monné wrote:
> On Mon, Nov 09, 2020 at 02:50:27PM +0200, Oleksandr Andrushchenko wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>
>> At the moment there is an identity mapping between how a guest sees its
>> BARs and how they are programmed into guest domain's p2m. This is not
>> going to work as guest domains have their own view on the BARs.
>> Extend existing vPCI BAR handling to allow every domain to have its own
>> view of the BARs: only hardware domain sees physical memory addresses in
>> this case and for the rest those are emulated, including logic required
>> for the guests to detect memory sizes and properties.
>>
>> While emulating BAR access for the guests create a link between the
>> virtual BAR address and physical one: use full memory address while
>> creating range sets used to map/unmap corresponding address spaces and
>> exploit the fact that PCI BAR value doesn't use 8 lower bits of the
> I think you mean the low 4bits rather than the low 8bits?
Yes, you are right. Will fix that
>
>> memory address. Use those bits to pass physical BAR's index, so we can
>> build/remove proper p2m mappings.
> I find this quite hard to review, given it's a fairly big and
> complicated patch. Do you think you could split into smaller chunks?

I'll try. But at the moment this code isn't meant to be production quality yet

as what I'd like to achieve is to collect community's view on the subject

Once all questions are resolved I'll start working on the agreed solution

which I expect to be production quality then.

>
> Maybe you could split into smaller patches that add bits towards the
> end goal but still keep the identity mappings?
>
> I've tried to do some review below, but I would appreciate if you
> could split this.
Thank you so much for doing so!
>
>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> ---
>>   xen/drivers/vpci/header.c | 276 ++++++++++++++++++++++++++++++++++----
>>   xen/drivers/vpci/vpci.c   |   1 +
>>   xen/include/xen/vpci.h    |  24 ++--
>>   3 files changed, 265 insertions(+), 36 deletions(-)
>>
>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
>> index f74f728884c0..7dc7c70e24f2 100644
>> --- a/xen/drivers/vpci/header.c
>> +++ b/xen/drivers/vpci/header.c
>> @@ -31,14 +31,87 @@
>>   struct map_data {
>>       struct domain *d;
>>       bool map;
>> +    struct pci_dev *pdev;
> If the field is required please place it after the domain one.
I will, but may I ask why?
>
>>   };
>>   
>> +static struct vpci_header *get_vpci_header(struct domain *d,
>> +                                           const struct pci_dev *pdev);
>> +
>> +static struct vpci_header *get_hwdom_vpci_header(const struct pci_dev *pdev)
>> +{
>> +    if ( unlikely(list_empty(&pdev->vpci->headers)) )
>> +        return get_vpci_header(hardware_domain, pdev);
> I'm not sure I understand why you need a list here: each device can
> only be owned by a single guest, and thus there shouldn't be multiple
> views of the BARs (or the header).

That is because of the over-engineering happening here: you are 100% right

and this all must be made way simpler without lists and all that. I just

blindly thought that we could have multiple guests, but I have overseen

the simple fact you mentioned: physical BARs are for hwdom and virtual are

for *a single* guest as we can't passthrough the same device to multiple

guests at a time.

>
>> +
>> +    /* hwdom's header is always the very first entry. */
>> +    return list_first_entry(&pdev->vpci->headers, struct vpci_header, node);
>> +}
>> +
>> +static struct vpci_header *get_vpci_header(struct domain *d,
>> +                                           const struct pci_dev *pdev)
>> +{
>> +    struct list_head *prev;
>> +    struct vpci_header *header;
>> +    struct vpci *vpci = pdev->vpci;
>> +
>> +    list_for_each( prev, &vpci->headers )
>> +    {
>> +        struct vpci_header *this = list_entry(prev, struct vpci_header, node);
>> +
>> +        if ( this->domain_id == d->domain_id )
>> +            return this;
>> +    }
>> +    printk(XENLOG_DEBUG "--------------------------------------" \
>> +           "Adding new vPCI BAR headers for domain %d: " PRI_pci" \n",
>> +           d->domain_id, pdev->sbdf.seg, pdev->sbdf.bus,
>> +           pdev->sbdf.dev, pdev->sbdf.fn);
>> +    header = xzalloc(struct vpci_header);
>> +    if ( !header )
>> +    {
>> +        printk(XENLOG_ERR
>> +               "Failed to add new vPCI BAR headers for domain %d: " PRI_pci" \n",
>> +               d->domain_id, pdev->sbdf.seg, pdev->sbdf.bus,
>> +               pdev->sbdf.dev, pdev->sbdf.fn);
>> +        return NULL;
>> +    }
>> +
>> +    if ( !is_hardware_domain(d) )
>> +    {
>> +        struct vpci_header *hwdom_header = get_hwdom_vpci_header(pdev);
>> +
>> +        /* Make a copy of the hwdom's BARs as the initial state for vBARs. */
>> +        memcpy(header, hwdom_header, sizeof(*header));
>> +    }
>> +
>> +    header->domain_id = d->domain_id;
>> +    list_add_tail(&header->node, &vpci->headers);
> Same here, I think you want a single header, and then some fields
> would be read-only for domUs (like the position of the BARs on the
> physmap).
ditto, will remove the list
>
>> +    return header;
>> +}
>> +
>> +static struct vpci_bar *get_vpci_bar(struct domain *d,
>> +                                     const struct pci_dev *pdev,
>> +                                     int bar_idx)
> unsigned
ok
>
>> +{
>> +    struct vpci_header *vheader;
>> +
>> +    vheader = get_vpci_header(d, pdev);
>> +    if ( !vheader )
>> +        return NULL;
>> +
>> +    return &vheader->bars[bar_idx];
>> +}
>> +
>>   static int map_range(unsigned long s, unsigned long e, void *data,
>>                        unsigned long *c)
>>   {
>>       const struct map_data *map = data;
>> -    int rc;
>> -
>> +    unsigned long mfn;
>> +    int rc, bar_idx;
>> +    struct vpci_header *header = get_hwdom_vpci_header(map->pdev);
>> +
>> +    bar_idx = s & ~PCI_BASE_ADDRESS_MEM_MASK;
> I'm not sure it's fine to stash the BAR index in the low bits of the
> address, what about a device having concatenated BARs?

Hm, I am not an expert in PCI, so didn't think about that.

Probably nothing stops a PCI device from splitting the same memory

region into multiple ones...

>
> The rangeset would normally join them into a single range, and then
> you won't be able to notice whether a region in the rangeset belongs
> to one BAR or another.
Yes, I see. Very good catch, thank you
>
> IMO it might be easier to just have a rangeset for each BAR and
> structure the pending work as a linked list of BARs, that will contain
> the physical addresses of each BAR (the real phymap one and the guest
> physmap view) plus the rangeset of memory regions to map.
I'll try to think how to do that, thank you
>
>> +    s = PFN_DOWN(s);
>> +    e = PFN_DOWN(e);
> Changing the rangeset to store memory addresses instead of frames
> could for example be split into a separate patch.
Ok
>
> I think you are doing the calculation of the end pfn wrong here, you
> should use PFN_UP instead in case the address is not aligned.

PFN_DOWN for the start seems to be ok if the address is not aligned

which is the case if I pass bar_index in the lower bits: PCI memory has

PAGE_SIZE granularity, so besides the fact that I use bar_index the address

must be page aligned.

The end address is expressed in (size - 1) form, again page aligned,

so to get the last page to be mapped PFN_DOWN also seems to be appropriate.

Do I miss something here?

>
>> +    mfn = _mfn(PFN_DOWN(header->bars[bar_idx].addr));
>>       for ( ; ; )
>>       {
>>           unsigned long size = e - s + 1;
>> @@ -52,11 +125,15 @@ static int map_range(unsigned long s, unsigned long e, void *data,
>>            * - {un}map_mmio_regions doesn't support preemption.
>>            */
>>   
>> -        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(s))
>> -                      : unmap_mmio_regions(map->d, _gfn(s), size, _mfn(s));
>> +        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, mfn)
>> +                      : unmap_mmio_regions(map->d, _gfn(s), size, mfn);
>>           if ( rc == 0 )
>>           {
>> -            *c += size;
>> +            /*
>> +             * Range set is not expressed in frame numbers and the size
>> +             * is the number of frames, so update accordingly.
>> +             */
>> +            *c += size << PAGE_SHIFT;
>>               break;
>>           }
>>           if ( rc < 0 )
>> @@ -67,8 +144,9 @@ static int map_range(unsigned long s, unsigned long e, void *data,
>>               break;
>>           }
>>           ASSERT(rc < size);
>> -        *c += rc;
>> +        *c += rc << PAGE_SHIFT;
>>           s += rc;
>> +        mfn += rc;
>>           if ( general_preempt_check() )
>>                   return -ERESTART;
>>       }
>> @@ -84,7 +162,7 @@ static int map_range(unsigned long s, unsigned long e, void *data,
>>   static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
>>                               bool rom_only)
>>   {
>> -    struct vpci_header *header = &pdev->vpci->header;
>> +    struct vpci_header *header = get_hwdom_vpci_header(pdev);
>>       bool map = cmd & PCI_COMMAND_MEMORY;
>>       unsigned int i;
>>   
>> @@ -136,6 +214,7 @@ bool vpci_process_pending(struct vcpu *v)
>>           struct map_data data = {
>>               .d = v->domain,
>>               .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
>> +            .pdev = v->vpci.pdev,
>>           };
>>           int rc = rangeset_consume_ranges(v->vpci.mem, map_range, &data);
>>   
>> @@ -168,7 +247,8 @@ bool vpci_process_pending(struct vcpu *v)
>>   static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
>>                               struct rangeset *mem, uint16_t cmd)
>>   {
>> -    struct map_data data = { .d = d, .map = true };
>> +    struct map_data data = { .d = d, .map = true,
>> +        .pdev = (struct pci_dev *)pdev };
> Dropping the const here is not fine. IT either needs to be dropped
> from apply_map and further up, or this needs to also be made const.
Ok, I'll try to keep it const
>
>>       int rc;
>>   
>>       while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
>> @@ -205,7 +285,7 @@ static void defer_map(struct domain *d, struct pci_dev *pdev,
>>   
>>   static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>   {
>> -    struct vpci_header *header = &pdev->vpci->header;
>> +    struct vpci_header *header;
>>       struct rangeset *mem = rangeset_new(NULL, NULL, 0);
>>       struct pci_dev *tmp, *dev = NULL;
>>   #ifdef CONFIG_X86
>> @@ -217,6 +297,11 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>       if ( !mem )
>>           return -ENOMEM;
>>   
>> +    if ( is_hardware_domain(current->domain) )
>> +        header = get_hwdom_vpci_header(pdev);
>> +    else
>> +        header = get_vpci_header(current->domain, pdev);
>> +
>>       /*
>>        * Create a rangeset that represents the current device BARs memory region
>>        * and compare it against all the currently active BAR memory regions. If
>> @@ -225,12 +310,15 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>        * First fill the rangeset with all the BARs of this device or with the ROM
>>        * BAR only, depending on whether the guest is toggling the memory decode
>>        * bit of the command register, or the enable bit of the ROM BAR register.
>> +     *
>> +     * Use the PCI reserved bits of the BAR to pass BAR's index.
>>        */
>>       for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>       {
>>           const struct vpci_bar *bar = &header->bars[i];
>> -        unsigned long start = PFN_DOWN(bar->addr);
>> -        unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
>> +        unsigned long start = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK) | i;
>> +        unsigned long end = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK) +
>> +            bar->size - 1;
> Will this work fine on Arm 32bits with LPAE? It's my understanding
> that in that case unsigned long is 32bits, but the physical address
> space is 44bits, in which case this won't work.
Hm, good question
>
> I think you need to keep the usage of frame numbers here.
If I re-work the gfn <-> mfn mapping then yes, I can use frame numbers here and elsewhere
>
>>   
>>           if ( !MAPPABLE_BAR(bar) ||
>>                (rom_only ? bar->type != VPCI_BAR_ROM
>> @@ -251,9 +339,11 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>       /* Remove any MSIX regions if present. */
>>       for ( i = 0; msix && i < ARRAY_SIZE(msix->tables); i++ )
>>       {
>> -        unsigned long start = PFN_DOWN(vmsix_table_addr(pdev->vpci, i));
>> -        unsigned long end = PFN_DOWN(vmsix_table_addr(pdev->vpci, i) +
>> -                                     vmsix_table_size(pdev->vpci, i) - 1);
>> +        unsigned long start = (vmsix_table_addr(pdev->vpci, i) &
>> +                               PCI_BASE_ADDRESS_MEM_MASK) | i;
>> +        unsigned long end = (vmsix_table_addr(pdev->vpci, i) &
>> +                             PCI_BASE_ADDRESS_MEM_MASK ) +
>> +                             vmsix_table_size(pdev->vpci, i) - 1;
>>   
>>           rc = rangeset_remove_range(mem, start, end);
>>           if ( rc )
>> @@ -273,6 +363,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>        */
>>       for_each_pdev ( pdev->domain, tmp )
>>       {
>> +        struct vpci_header *header;
>> +
>>           if ( tmp == pdev )
>>           {
>>               /*
>> @@ -289,11 +381,14 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>                   continue;
>>           }
>>   
>> -        for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
>> +        header = get_vpci_header(tmp->domain, pdev);
>> +
>> +        for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>           {
>> -            const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
>> -            unsigned long start = PFN_DOWN(bar->addr);
>> -            unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
>> +            const struct vpci_bar *bar = &header->bars[i];
>> +            unsigned long start = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK) | i;
>> +            unsigned long end = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK)
>> +                + bar->size - 1;
>>   
>>               if ( !bar->enabled || !rangeset_overlaps_range(mem, start, end) ||
>>                    /*
>> @@ -357,7 +452,7 @@ static void cmd_write(const struct pci_dev *pdev, unsigned int reg,
>>           pci_conf_write16(pdev->sbdf, reg, cmd);
>>   }
>>   
>> -static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>> +static void bar_write_hwdom(const struct pci_dev *pdev, unsigned int reg,
>>                         uint32_t val, void *data)
>>   {
>>       struct vpci_bar *bar = data;
>> @@ -377,14 +472,17 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>>       {
>>           /* If the value written is the current one avoid printing a warning. */
>>           if ( val != (uint32_t)(bar->addr >> (hi ? 32 : 0)) )
>> +        {
>> +            struct vpci_header *header = get_hwdom_vpci_header(pdev);
>> +
>>               gprintk(XENLOG_WARNING,
>>                       "%04x:%02x:%02x.%u: ignored BAR %lu write with memory decoding enabled\n",
>>                       pdev->seg, pdev->bus, slot, func,
>> -                    bar - pdev->vpci->header.bars + hi);
>> +                    bar - header->bars + hi);
>> +        }
>>           return;
>>       }
>>   
>> -
>>       /*
>>        * Update the cached address, so that when memory decoding is enabled
>>        * Xen can map the BAR into the guest p2m.
>> @@ -403,10 +501,89 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>>       pci_conf_write32(pdev->sbdf, reg, val);
>>   }
>>   
>> +static uint32_t bar_read_hwdom(const struct pci_dev *pdev, unsigned int reg,
>> +                               void *data)
>> +{
>> +    return vpci_hw_read32(pdev, reg, data);
>> +}
>> +
>> +static void bar_write_guest(const struct pci_dev *pdev, unsigned int reg,
>> +                            uint32_t val, void *data)
>> +{
>> +    struct vpci_bar *vbar = data;
>> +    bool hi = false;
>> +
>> +    if ( vbar->type == VPCI_BAR_MEM64_HI )
>> +    {
>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>> +        vbar--;
>> +        hi = true;
>> +    }
>> +    vbar->addr &= ~(0xffffffffull << (hi ? 32 : 0));
>> +    vbar->addr |= (uint64_t)val << (hi ? 32 : 0);
>> +}
>> +
>> +static uint32_t bar_read_guest(const struct pci_dev *pdev, unsigned int reg,
>> +                               void *data)
>> +{
>> +    struct vpci_bar *vbar = data;
>> +    uint32_t val;
>> +    bool hi = false;
>> +
>> +    if ( vbar->type == VPCI_BAR_MEM64_HI )
>> +    {
>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>> +        vbar--;
>> +        hi = true;
>> +    }
>> +
>> +    if ( vbar->type == VPCI_BAR_MEM64_LO || vbar->type == VPCI_BAR_MEM64_HI )
> I think this would be clearer using a switch statement.
I'll think about
>
>> +    {
>> +        if ( hi )
>> +            val = vbar->addr >> 32;
>> +        else
>> +            val = vbar->addr & 0xffffffff;
>> +        if ( val == ~0 )
> Strictly speaking I think you are not forced to write 1s to the
> reserved 4 bits in the low register (and in the 32bit case).

Ah, so Linux kernel, for instance, could have written 0xffffff0 while

I expect 0xffffffff?

>
>> +        {
>> +            /* Guests detects BAR's properties and sizes. */
>> +            if ( !hi )
>> +            {
>> +                val = 0xffffffff & ~(vbar->size - 1);
>> +                val |= vbar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
>> +                                                    : PCI_BASE_ADDRESS_MEM_TYPE_64;
>> +                val |= vbar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
>> +            }
>> +            else
>> +                val = vbar->size >> 32;
>> +            vbar->addr &= ~(0xffffffffull << (hi ? 32 : 0));
>> +            vbar->addr |= (uint64_t)val << (hi ? 32 : 0);
>> +        }
>> +    }
>> +    else if ( vbar->type == VPCI_BAR_MEM32 )
>> +    {
>> +        val = vbar->addr;
>> +        if ( val == ~0 )
>> +        {
>> +            if ( !hi )
> There's no way hi can be true at this point AFAICT.
Sure, thank you
>
>> +            {
>> +                val = 0xffffffff & ~(vbar->size - 1);
>> +                val |= vbar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
>> +                                                    : PCI_BASE_ADDRESS_MEM_TYPE_64;
>> +                val |= vbar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
>> +            }
>> +        }
>> +    }
>> +    else
>> +    {
>> +        val = vbar->addr;
>> +    }
>> +    return val;
>> +}
>> +
>>   static void rom_write(const struct pci_dev *pdev, unsigned int reg,
>>                         uint32_t val, void *data)
>>   {
>> -    struct vpci_header *header = &pdev->vpci->header;
>> +    struct vpci_header *header = get_hwdom_vpci_header(pdev);
>>       struct vpci_bar *rom = data;
>>       uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
>>       uint16_t cmd = pci_conf_read16(pdev->sbdf, PCI_COMMAND);
>> @@ -452,15 +629,56 @@ static void rom_write(const struct pci_dev *pdev, unsigned int reg,
>>           rom->addr = val & PCI_ROM_ADDRESS_MASK;
>>   }
> Don't you need to also protect a domU from writing to the ROM BAR
> register?

ROM was not a target of this RFC as I have no HW to test that, but final code must

also handle ROM as well, you are right

>
>>   
>> +static uint32_t bar_read_dispatch(const struct pci_dev *pdev, unsigned int reg,
>> +                                  void *data)
>> +{
>> +    struct vpci_bar *vbar, *bar = data;
>> +
>> +    if ( is_hardware_domain(current->domain) )
>> +        return bar_read_hwdom(pdev, reg, data);
>> +
>> +    vbar = get_vpci_bar(current->domain, pdev, bar->index);
>> +    if ( !vbar )
>> +        return ~0;
>> +
>> +    return bar_read_guest(pdev, reg, vbar);
>> +}
>> +
>> +static void bar_write_dispatch(const struct pci_dev *pdev, unsigned int reg,
>> +                               uint32_t val, void *data)
>> +{
>> +    struct vpci_bar *bar = data;
>> +
>> +    if ( is_hardware_domain(current->domain) )
>> +        bar_write_hwdom(pdev, reg, val, data);
>> +    else
>> +    {
>> +        struct vpci_bar *vbar = get_vpci_bar(current->domain, pdev, bar->index);
>> +
>> +        if ( !vbar )
>> +            return;
>> +        bar_write_guest(pdev, reg, val, vbar);
>> +    }
>> +}
> You should assign different handlers based on whether the domain that
> has the device assigned is a domU or the hardware domain, rather than
> doing the selection here.

Hm, handlers are assigned once in init_bars and this function is only called

for hwdom, so there is no way I can do that for the guests. Hence, the dispatcher

>
>> +
>> +/*
>> + * FIXME: This is called early while adding vPCI handlers which is done
>> + * by and for hwdom.
>> + */
>>   static int init_bars(struct pci_dev *pdev)
>>   {
>>       uint16_t cmd;
>>       uint64_t addr, size;
>>       unsigned int i, num_bars, rom_reg;
>> -    struct vpci_header *header = &pdev->vpci->header;
>> -    struct vpci_bar *bars = header->bars;
>> +    struct vpci_header *header;
>> +    struct vpci_bar *bars;
>>       int rc;
>>   
>> +    header = get_hwdom_vpci_header(pdev);
>> +    if ( !header )
>> +        return -ENOMEM;
>> +    bars = header->bars;
>> +
>>       switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
>>       {
>>       case PCI_HEADER_TYPE_NORMAL:
>> @@ -496,11 +714,12 @@ static int init_bars(struct pci_dev *pdev)
>>           uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
>>           uint32_t val;
>>   
>> +        bars[i].index = i;
>>           if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
>>           {
>>               bars[i].type = VPCI_BAR_MEM64_HI;
>> -            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg,
>> -                                   4, &bars[i]);
>> +            rc = vpci_add_register(pdev->vpci, bar_read_dispatch,
>> +                                   bar_write_dispatch, reg, 4, &bars[i]);
>>               if ( rc )
>>               {
>>                   pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
>> @@ -540,8 +759,8 @@ static int init_bars(struct pci_dev *pdev)
>>           bars[i].size = size;
>>           bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
>>   
>> -        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg, 4,
>> -                               &bars[i]);
>> +        rc = vpci_add_register(pdev->vpci, bar_read_dispatch,
>> +                               bar_write_dispatch, reg, 4, &bars[i]);
>>           if ( rc )
>>           {
>>               pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
>> @@ -558,6 +777,7 @@ static int init_bars(struct pci_dev *pdev)
>>           rom->type = VPCI_BAR_ROM;
>>           rom->size = size;
>>           rom->addr = addr;
>> +        rom->index = num_bars;
>>           header->rom_enabled = pci_conf_read32(pdev->sbdf, rom_reg) &
>>                                 PCI_ROM_ADDRESS_ENABLE;
>>   
>> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
>> index a5293521a36a..728029da3e9c 100644
>> --- a/xen/drivers/vpci/vpci.c
>> +++ b/xen/drivers/vpci/vpci.c
>> @@ -69,6 +69,7 @@ int __hwdom_init vpci_add_handlers(struct pci_dev *pdev)
>>           return -ENOMEM;
>>   
>>       INIT_LIST_HEAD(&pdev->vpci->handlers);
>> +    INIT_LIST_HEAD(&pdev->vpci->headers);
>>       spin_lock_init(&pdev->vpci->lock);
>>   
>>       for ( i = 0; i < NUM_VPCI_INIT; i++ )
>> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
>> index c3501e9ec010..54423bc6556d 100644
>> --- a/xen/include/xen/vpci.h
>> +++ b/xen/include/xen/vpci.h
>> @@ -55,16 +55,14 @@ uint32_t vpci_hw_read32(const struct pci_dev *pdev, unsigned int reg,
>>    */
>>   bool __must_check vpci_process_pending(struct vcpu *v);
>>   
>> -struct vpci {
>> -    /* List of vPCI handlers for a device. */
>> -    struct list_head handlers;
>> -    spinlock_t lock;
>> -
>>   #ifdef __XEN__
>> -    /* Hide the rest of the vpci struct from the user-space test harness. */
>>       struct vpci_header {
>> +    struct list_head node;
>> +    /* Domain that owns this view of the BARs. */
>> +    domid_t domain_id;
> Indentation seems screwed here.
It did ;)
>
>>           /* Information about the PCI BARs of this device. */
>>           struct vpci_bar {
>> +            int index;
> unsigned
ok
>
>>               uint64_t addr;
>>               uint64_t size;
>>               enum {
>> @@ -88,8 +86,18 @@ struct vpci {
>>            * is mapped into guest p2m) if there's a ROM BAR on the device.
>>            */
>>           bool rom_enabled      : 1;
>> -        /* FIXME: currently there's no support for SR-IOV. */
> Unless you are also adding support for SR-IOV, I would keep the
> comment.

WRT SR-IOV I do need your series [1] ;) SR-IOV is one of our targets

> Thanks, Roger.

Thank you so much for reviewing this,

Oleksandr

[1] https://lists.xenproject.org/archives/html/xen-devel/2018-07/msg01494.html

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 06/10] vpci: Make every domain handle its own BARs
  2020-11-12 13:16     ` Oleksandr Andrushchenko
@ 2020-11-12 14:46       ` Roger Pau Monné
  2020-11-13  6:32         ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 64+ messages in thread
From: Roger Pau Monné @ 2020-11-12 14:46 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Oleksandr Andrushchenko, Rahul.Singh, Bertrand.Marquis,
	julien.grall, jbeulich, sstabellini, xen-devel, iwj, wl

On Thu, Nov 12, 2020 at 01:16:10PM +0000, Oleksandr Andrushchenko wrote:
> 
> On 11/12/20 11:40 AM, Roger Pau Monné wrote:
> > On Mon, Nov 09, 2020 at 02:50:27PM +0200, Oleksandr Andrushchenko wrote:
> >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> >> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> >> index f74f728884c0..7dc7c70e24f2 100644
> >> --- a/xen/drivers/vpci/header.c
> >> +++ b/xen/drivers/vpci/header.c
> >> @@ -31,14 +31,87 @@
> >>   struct map_data {
> >>       struct domain *d;
> >>       bool map;
> >> +    struct pci_dev *pdev;
> > If the field is required please place it after the domain one.
> I will, but may I ask why?

So that if we add further boolean fields we can do at the end of the
struct for layout reasons. If we do:

struct map_data {
    struct domain *d;
    bool map;
    struct pci_dev *pdev;
    bool foo;
}

We will end up with a bunch of padding that could be avoided by doing:

struct map_data {
    struct domain *d;
    struct pci_dev *pdev;
    bool map;
    bool foo;
}

> >> +    s = PFN_DOWN(s);
> >> +    e = PFN_DOWN(e);
> > Changing the rangeset to store memory addresses instead of frames
> > could for example be split into a separate patch.
> Ok
> >
> > I think you are doing the calculation of the end pfn wrong here, you
> > should use PFN_UP instead in case the address is not aligned.
> 
> PFN_DOWN for the start seems to be ok if the address is not aligned
> 
> which is the case if I pass bar_index in the lower bits: PCI memory has
> 
> PAGE_SIZE granularity, so besides the fact that I use bar_index the address

No, BARs don't need to be aligned to page boundaries, you can even
have different BARs inside the same physical page.

The spec recommends that the minimum size of a BAR should be 4KB, but
that's not a strict requirement in which case a BAR can be as small as
16bytes, and then you can have multiple ones inside the same page.

> must be page aligned.
> 
> The end address is expressed in (size - 1) form, again page aligned,
> 
> so to get the last page to be mapped PFN_DOWN also seems to be appropriate.
> 
> Do I miss something here?

I'm not aware of any  of those addresses or sizes being guaranteed to
be page aligned, so I think you need to account for that.

Some of the code here uses PFN_DOWN to calculate the end address
because the rangesets are used in an inclusive fashion, so the end
frame also gets mapped.

> >
> >> +    mfn = _mfn(PFN_DOWN(header->bars[bar_idx].addr));
> >>       for ( ; ; )
> >>       {
> >>           unsigned long size = e - s + 1;
> >> @@ -52,11 +125,15 @@ static int map_range(unsigned long s, unsigned long e, void *data,
> >>            * - {un}map_mmio_regions doesn't support preemption.
> >>            */
> >>   
> >> -        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(s))
> >> -                      : unmap_mmio_regions(map->d, _gfn(s), size, _mfn(s));
> >> +        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, mfn)
> >> +                      : unmap_mmio_regions(map->d, _gfn(s), size, mfn);
> >>           if ( rc == 0 )
> >>           {
> >> -            *c += size;
> >> +            /*
> >> +             * Range set is not expressed in frame numbers and the size
> >> +             * is the number of frames, so update accordingly.
> >> +             */
> >> +            *c += size << PAGE_SHIFT;
> >>               break;
> >>           }
> >>           if ( rc < 0 )
> >> @@ -67,8 +144,9 @@ static int map_range(unsigned long s, unsigned long e, void *data,
> >>               break;
> >>           }
> >>           ASSERT(rc < size);
> >> -        *c += rc;
> >> +        *c += rc << PAGE_SHIFT;
> >>           s += rc;
> >> +        mfn += rc;
> >>           if ( general_preempt_check() )
> >>                   return -ERESTART;
> >>       }
> >> @@ -84,7 +162,7 @@ static int map_range(unsigned long s, unsigned long e, void *data,
> >>   static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
> >>                               bool rom_only)
> >>   {
> >> -    struct vpci_header *header = &pdev->vpci->header;
> >> +    struct vpci_header *header = get_hwdom_vpci_header(pdev);
> >>       bool map = cmd & PCI_COMMAND_MEMORY;
> >>       unsigned int i;
> >>   
> >> @@ -136,6 +214,7 @@ bool vpci_process_pending(struct vcpu *v)
> >>           struct map_data data = {
> >>               .d = v->domain,
> >>               .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
> >> +            .pdev = v->vpci.pdev,
> >>           };
> >>           int rc = rangeset_consume_ranges(v->vpci.mem, map_range, &data);
> >>   
> >> @@ -168,7 +247,8 @@ bool vpci_process_pending(struct vcpu *v)
> >>   static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
> >>                               struct rangeset *mem, uint16_t cmd)
> >>   {
> >> -    struct map_data data = { .d = d, .map = true };
> >> +    struct map_data data = { .d = d, .map = true,
> >> +        .pdev = (struct pci_dev *)pdev };
> > Dropping the const here is not fine. IT either needs to be dropped
> > from apply_map and further up, or this needs to also be made const.
> Ok, I'll try to keep it const
> >
> >>       int rc;
> >>   
> >>       while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
> >> @@ -205,7 +285,7 @@ static void defer_map(struct domain *d, struct pci_dev *pdev,
> >>   
> >>   static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> >>   {
> >> -    struct vpci_header *header = &pdev->vpci->header;
> >> +    struct vpci_header *header;
> >>       struct rangeset *mem = rangeset_new(NULL, NULL, 0);
> >>       struct pci_dev *tmp, *dev = NULL;
> >>   #ifdef CONFIG_X86
> >> @@ -217,6 +297,11 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> >>       if ( !mem )
> >>           return -ENOMEM;
> >>   
> >> +    if ( is_hardware_domain(current->domain) )
> >> +        header = get_hwdom_vpci_header(pdev);
> >> +    else
> >> +        header = get_vpci_header(current->domain, pdev);
> >> +
> >>       /*
> >>        * Create a rangeset that represents the current device BARs memory region
> >>        * and compare it against all the currently active BAR memory regions. If
> >> @@ -225,12 +310,15 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> >>        * First fill the rangeset with all the BARs of this device or with the ROM
> >>        * BAR only, depending on whether the guest is toggling the memory decode
> >>        * bit of the command register, or the enable bit of the ROM BAR register.
> >> +     *
> >> +     * Use the PCI reserved bits of the BAR to pass BAR's index.
> >>        */
> >>       for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
> >>       {
> >>           const struct vpci_bar *bar = &header->bars[i];
> >> -        unsigned long start = PFN_DOWN(bar->addr);
> >> -        unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
> >> +        unsigned long start = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK) | i;
> >> +        unsigned long end = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK) +
> >> +            bar->size - 1;
> > Will this work fine on Arm 32bits with LPAE? It's my understanding
> > that in that case unsigned long is 32bits, but the physical address
> > space is 44bits, in which case this won't work.
> Hm, good question
> >
> > I think you need to keep the usage of frame numbers here.
> If I re-work the gfn <-> mfn mapping then yes, I can use frame numbers here and elsewhere
> >
> >>   
> >>           if ( !MAPPABLE_BAR(bar) ||
> >>                (rom_only ? bar->type != VPCI_BAR_ROM
> >> @@ -251,9 +339,11 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> >>       /* Remove any MSIX regions if present. */
> >>       for ( i = 0; msix && i < ARRAY_SIZE(msix->tables); i++ )
> >>       {
> >> -        unsigned long start = PFN_DOWN(vmsix_table_addr(pdev->vpci, i));
> >> -        unsigned long end = PFN_DOWN(vmsix_table_addr(pdev->vpci, i) +
> >> -                                     vmsix_table_size(pdev->vpci, i) - 1);
> >> +        unsigned long start = (vmsix_table_addr(pdev->vpci, i) &
> >> +                               PCI_BASE_ADDRESS_MEM_MASK) | i;
> >> +        unsigned long end = (vmsix_table_addr(pdev->vpci, i) &
> >> +                             PCI_BASE_ADDRESS_MEM_MASK ) +
> >> +                             vmsix_table_size(pdev->vpci, i) - 1;
> >>   
> >>           rc = rangeset_remove_range(mem, start, end);
> >>           if ( rc )
> >> @@ -273,6 +363,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> >>        */
> >>       for_each_pdev ( pdev->domain, tmp )
> >>       {
> >> +        struct vpci_header *header;
> >> +
> >>           if ( tmp == pdev )
> >>           {
> >>               /*
> >> @@ -289,11 +381,14 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> >>                   continue;
> >>           }
> >>   
> >> -        for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
> >> +        header = get_vpci_header(tmp->domain, pdev);
> >> +
> >> +        for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
> >>           {
> >> -            const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
> >> -            unsigned long start = PFN_DOWN(bar->addr);
> >> -            unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
> >> +            const struct vpci_bar *bar = &header->bars[i];
> >> +            unsigned long start = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK) | i;
> >> +            unsigned long end = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK)
> >> +                + bar->size - 1;
> >>   
> >>               if ( !bar->enabled || !rangeset_overlaps_range(mem, start, end) ||
> >>                    /*
> >> @@ -357,7 +452,7 @@ static void cmd_write(const struct pci_dev *pdev, unsigned int reg,
> >>           pci_conf_write16(pdev->sbdf, reg, cmd);
> >>   }
> >>   
> >> -static void bar_write(const struct pci_dev *pdev, unsigned int reg,
> >> +static void bar_write_hwdom(const struct pci_dev *pdev, unsigned int reg,
> >>                         uint32_t val, void *data)
> >>   {
> >>       struct vpci_bar *bar = data;
> >> @@ -377,14 +472,17 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
> >>       {
> >>           /* If the value written is the current one avoid printing a warning. */
> >>           if ( val != (uint32_t)(bar->addr >> (hi ? 32 : 0)) )
> >> +        {
> >> +            struct vpci_header *header = get_hwdom_vpci_header(pdev);
> >> +
> >>               gprintk(XENLOG_WARNING,
> >>                       "%04x:%02x:%02x.%u: ignored BAR %lu write with memory decoding enabled\n",
> >>                       pdev->seg, pdev->bus, slot, func,
> >> -                    bar - pdev->vpci->header.bars + hi);
> >> +                    bar - header->bars + hi);
> >> +        }
> >>           return;
> >>       }
> >>   
> >> -
> >>       /*
> >>        * Update the cached address, so that when memory decoding is enabled
> >>        * Xen can map the BAR into the guest p2m.
> >> @@ -403,10 +501,89 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
> >>       pci_conf_write32(pdev->sbdf, reg, val);
> >>   }
> >>   
> >> +static uint32_t bar_read_hwdom(const struct pci_dev *pdev, unsigned int reg,
> >> +                               void *data)
> >> +{
> >> +    return vpci_hw_read32(pdev, reg, data);
> >> +}
> >> +
> >> +static void bar_write_guest(const struct pci_dev *pdev, unsigned int reg,
> >> +                            uint32_t val, void *data)
> >> +{
> >> +    struct vpci_bar *vbar = data;
> >> +    bool hi = false;
> >> +
> >> +    if ( vbar->type == VPCI_BAR_MEM64_HI )
> >> +    {
> >> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> >> +        vbar--;
> >> +        hi = true;
> >> +    }
> >> +    vbar->addr &= ~(0xffffffffull << (hi ? 32 : 0));
> >> +    vbar->addr |= (uint64_t)val << (hi ? 32 : 0);
> >> +}
> >> +
> >> +static uint32_t bar_read_guest(const struct pci_dev *pdev, unsigned int reg,
> >> +                               void *data)
> >> +{
> >> +    struct vpci_bar *vbar = data;
> >> +    uint32_t val;
> >> +    bool hi = false;
> >> +
> >> +    if ( vbar->type == VPCI_BAR_MEM64_HI )
> >> +    {
> >> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> >> +        vbar--;
> >> +        hi = true;
> >> +    }
> >> +
> >> +    if ( vbar->type == VPCI_BAR_MEM64_LO || vbar->type == VPCI_BAR_MEM64_HI )
> > I think this would be clearer using a switch statement.
> I'll think about
> >
> >> +    {
> >> +        if ( hi )
> >> +            val = vbar->addr >> 32;
> >> +        else
> >> +            val = vbar->addr & 0xffffffff;
> >> +        if ( val == ~0 )
> > Strictly speaking I think you are not forced to write 1s to the
> > reserved 4 bits in the low register (and in the 32bit case).
> 
> Ah, so Linux kernel, for instance, could have written 0xffffff0 while
> 
> I expect 0xffffffff?

I think real hardware would return the size when written 1s to all
bits except the reserved ones.

> 
> >
> >> +        {
> >> +            /* Guests detects BAR's properties and sizes. */
> >> +            if ( !hi )
> >> +            {
> >> +                val = 0xffffffff & ~(vbar->size - 1);
> >> +                val |= vbar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
> >> +                                                    : PCI_BASE_ADDRESS_MEM_TYPE_64;
> >> +                val |= vbar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
> >> +            }
> >> +            else
> >> +                val = vbar->size >> 32;
> >> +            vbar->addr &= ~(0xffffffffull << (hi ? 32 : 0));
> >> +            vbar->addr |= (uint64_t)val << (hi ? 32 : 0);
> >> +        }
> >> +    }
> >> +    else if ( vbar->type == VPCI_BAR_MEM32 )
> >> +    {
> >> +        val = vbar->addr;
> >> +        if ( val == ~0 )
> >> +        {
> >> +            if ( !hi )
> > There's no way hi can be true at this point AFAICT.
> Sure, thank you
> >
> >> +            {
> >> +                val = 0xffffffff & ~(vbar->size - 1);
> >> +                val |= vbar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
> >> +                                                    : PCI_BASE_ADDRESS_MEM_TYPE_64;
> >> +                val |= vbar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
> >> +            }
> >> +        }
> >> +    }
> >> +    else
> >> +    {
> >> +        val = vbar->addr;
> >> +    }
> >> +    return val;
> >> +}
> >> +
> >>   static void rom_write(const struct pci_dev *pdev, unsigned int reg,
> >>                         uint32_t val, void *data)
> >>   {
> >> -    struct vpci_header *header = &pdev->vpci->header;
> >> +    struct vpci_header *header = get_hwdom_vpci_header(pdev);
> >>       struct vpci_bar *rom = data;
> >>       uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> >>       uint16_t cmd = pci_conf_read16(pdev->sbdf, PCI_COMMAND);
> >> @@ -452,15 +629,56 @@ static void rom_write(const struct pci_dev *pdev, unsigned int reg,
> >>           rom->addr = val & PCI_ROM_ADDRESS_MASK;
> >>   }
> > Don't you need to also protect a domU from writing to the ROM BAR
> > register?
> 
> ROM was not a target of this RFC as I have no HW to test that, but final code must
> 
> also handle ROM as well, you are right
> 
> >
> >>   
> >> +static uint32_t bar_read_dispatch(const struct pci_dev *pdev, unsigned int reg,
> >> +                                  void *data)
> >> +{
> >> +    struct vpci_bar *vbar, *bar = data;
> >> +
> >> +    if ( is_hardware_domain(current->domain) )
> >> +        return bar_read_hwdom(pdev, reg, data);
> >> +
> >> +    vbar = get_vpci_bar(current->domain, pdev, bar->index);
> >> +    if ( !vbar )
> >> +        return ~0;
> >> +
> >> +    return bar_read_guest(pdev, reg, vbar);
> >> +}
> >> +
> >> +static void bar_write_dispatch(const struct pci_dev *pdev, unsigned int reg,
> >> +                               uint32_t val, void *data)
> >> +{
> >> +    struct vpci_bar *bar = data;
> >> +
> >> +    if ( is_hardware_domain(current->domain) )
> >> +        bar_write_hwdom(pdev, reg, val, data);
> >> +    else
> >> +    {
> >> +        struct vpci_bar *vbar = get_vpci_bar(current->domain, pdev, bar->index);
> >> +
> >> +        if ( !vbar )
> >> +            return;
> >> +        bar_write_guest(pdev, reg, val, vbar);
> >> +    }
> >> +}
> > You should assign different handlers based on whether the domain that
> > has the device assigned is a domU or the hardware domain, rather than
> > doing the selection here.
> 
> Hm, handlers are assigned once in init_bars and this function is only called
> 
> for hwdom, so there is no way I can do that for the guests. Hence, the dispatcher

I think we might want to reset the vPCI handlers when a devices gets
assigned and deassigned. In order to do passthrough to domUs safely
we will have to add more handlers than what's required for dom0, and
having is_hardware_domain sprinkled in all of them is not a suitable
solution.

Roger.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 06/10] vpci: Make every domain handle its own BARs
  2020-11-12 14:46       ` Roger Pau Monné
@ 2020-11-13  6:32         ` Oleksandr Andrushchenko
  2020-11-13  6:48           ` Oleksandr Andrushchenko
  2020-11-13 10:25           ` Jan Beulich
  0 siblings, 2 replies; 64+ messages in thread
From: Oleksandr Andrushchenko @ 2020-11-13  6:32 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Oleksandr Andrushchenko, Rahul.Singh, Bertrand.Marquis,
	julien.grall, jbeulich, sstabellini, xen-devel, iwj, wl


On 11/12/20 4:46 PM, Roger Pau Monné wrote:
> On Thu, Nov 12, 2020 at 01:16:10PM +0000, Oleksandr Andrushchenko wrote:
>> On 11/12/20 11:40 AM, Roger Pau Monné wrote:
>>> On Mon, Nov 09, 2020 at 02:50:27PM +0200, Oleksandr Andrushchenko wrote:
>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
>>>> index f74f728884c0..7dc7c70e24f2 100644
>>>> --- a/xen/drivers/vpci/header.c
>>>> +++ b/xen/drivers/vpci/header.c
>>>> @@ -31,14 +31,87 @@
>>>>    struct map_data {
>>>>        struct domain *d;
>>>>        bool map;
>>>> +    struct pci_dev *pdev;
>>> If the field is required please place it after the domain one.
>> I will, but may I ask why?
> So that if we add further boolean fields we can do at the end of the
> struct for layout reasons. If we do:
>
> struct map_data {
>      struct domain *d;
>      bool map;
>      struct pci_dev *pdev;
>      bool foo;
> }
>
> We will end up with a bunch of padding that could be avoided by doing:
>
> struct map_data {
>      struct domain *d;
>      struct pci_dev *pdev;
>      bool map;
>      bool foo;
> }
Ah, so this is about padding. Got it
>
>>>> +    s = PFN_DOWN(s);
>>>> +    e = PFN_DOWN(e);
>>> Changing the rangeset to store memory addresses instead of frames
>>> could for example be split into a separate patch.
>> Ok
>>> I think you are doing the calculation of the end pfn wrong here, you
>>> should use PFN_UP instead in case the address is not aligned.
>> PFN_DOWN for the start seems to be ok if the address is not aligned
>>
>> which is the case if I pass bar_index in the lower bits: PCI memory has
>>
>> PAGE_SIZE granularity, so besides the fact that I use bar_index the address
> No, BARs don't need to be aligned to page boundaries, you can even
> have different BARs inside the same physical page.
>
> The spec recommends that the minimum size of a BAR should be 4KB, but
> that's not a strict requirement in which case a BAR can be as small as
> 16bytes, and then you can have multiple ones inside the same page.
Ok, will account on that
>
>> must be page aligned.
>>
>> The end address is expressed in (size - 1) form, again page aligned,
>>
>> so to get the last page to be mapped PFN_DOWN also seems to be appropriate.
>>
>> Do I miss something here?
> I'm not aware of any  of those addresses or sizes being guaranteed to
> be page aligned, so I think you need to account for that.
>
> Some of the code here uses PFN_DOWN to calculate the end address
> because the rangesets are used in an inclusive fashion, so the end
> frame also gets mapped.
Ok
>
>>>> +    mfn = _mfn(PFN_DOWN(header->bars[bar_idx].addr));
>>>>        for ( ; ; )
>>>>        {
>>>>            unsigned long size = e - s + 1;
>>>> @@ -52,11 +125,15 @@ static int map_range(unsigned long s, unsigned long e, void *data,
>>>>             * - {un}map_mmio_regions doesn't support preemption.
>>>>             */
>>>>    
>>>> -        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(s))
>>>> -                      : unmap_mmio_regions(map->d, _gfn(s), size, _mfn(s));
>>>> +        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, mfn)
>>>> +                      : unmap_mmio_regions(map->d, _gfn(s), size, mfn);
>>>>            if ( rc == 0 )
>>>>            {
>>>> -            *c += size;
>>>> +            /*
>>>> +             * Range set is not expressed in frame numbers and the size
>>>> +             * is the number of frames, so update accordingly.
>>>> +             */
>>>> +            *c += size << PAGE_SHIFT;
>>>>                break;
>>>>            }
>>>>            if ( rc < 0 )
>>>> @@ -67,8 +144,9 @@ static int map_range(unsigned long s, unsigned long e, void *data,
>>>>                break;
>>>>            }
>>>>            ASSERT(rc < size);
>>>> -        *c += rc;
>>>> +        *c += rc << PAGE_SHIFT;
>>>>            s += rc;
>>>> +        mfn += rc;
>>>>            if ( general_preempt_check() )
>>>>                    return -ERESTART;
>>>>        }
>>>> @@ -84,7 +162,7 @@ static int map_range(unsigned long s, unsigned long e, void *data,
>>>>    static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
>>>>                                bool rom_only)
>>>>    {
>>>> -    struct vpci_header *header = &pdev->vpci->header;
>>>> +    struct vpci_header *header = get_hwdom_vpci_header(pdev);
>>>>        bool map = cmd & PCI_COMMAND_MEMORY;
>>>>        unsigned int i;
>>>>    
>>>> @@ -136,6 +214,7 @@ bool vpci_process_pending(struct vcpu *v)
>>>>            struct map_data data = {
>>>>                .d = v->domain,
>>>>                .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
>>>> +            .pdev = v->vpci.pdev,
>>>>            };
>>>>            int rc = rangeset_consume_ranges(v->vpci.mem, map_range, &data);
>>>>    
>>>> @@ -168,7 +247,8 @@ bool vpci_process_pending(struct vcpu *v)
>>>>    static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
>>>>                                struct rangeset *mem, uint16_t cmd)
>>>>    {
>>>> -    struct map_data data = { .d = d, .map = true };
>>>> +    struct map_data data = { .d = d, .map = true,
>>>> +        .pdev = (struct pci_dev *)pdev };
>>> Dropping the const here is not fine. IT either needs to be dropped
>>> from apply_map and further up, or this needs to also be made const.
>> Ok, I'll try to keep it const
>>>>        int rc;
>>>>    
>>>>        while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
>>>> @@ -205,7 +285,7 @@ static void defer_map(struct domain *d, struct pci_dev *pdev,
>>>>    
>>>>    static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>    {
>>>> -    struct vpci_header *header = &pdev->vpci->header;
>>>> +    struct vpci_header *header;
>>>>        struct rangeset *mem = rangeset_new(NULL, NULL, 0);
>>>>        struct pci_dev *tmp, *dev = NULL;
>>>>    #ifdef CONFIG_X86
>>>> @@ -217,6 +297,11 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>        if ( !mem )
>>>>            return -ENOMEM;
>>>>    
>>>> +    if ( is_hardware_domain(current->domain) )
>>>> +        header = get_hwdom_vpci_header(pdev);
>>>> +    else
>>>> +        header = get_vpci_header(current->domain, pdev);
>>>> +
>>>>        /*
>>>>         * Create a rangeset that represents the current device BARs memory region
>>>>         * and compare it against all the currently active BAR memory regions. If
>>>> @@ -225,12 +310,15 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>         * First fill the rangeset with all the BARs of this device or with the ROM
>>>>         * BAR only, depending on whether the guest is toggling the memory decode
>>>>         * bit of the command register, or the enable bit of the ROM BAR register.
>>>> +     *
>>>> +     * Use the PCI reserved bits of the BAR to pass BAR's index.
>>>>         */
>>>>        for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>>        {
>>>>            const struct vpci_bar *bar = &header->bars[i];
>>>> -        unsigned long start = PFN_DOWN(bar->addr);
>>>> -        unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
>>>> +        unsigned long start = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK) | i;
>>>> +        unsigned long end = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK) +
>>>> +            bar->size - 1;
>>> Will this work fine on Arm 32bits with LPAE? It's my understanding
>>> that in that case unsigned long is 32bits, but the physical address
>>> space is 44bits, in which case this won't work.
>> Hm, good question
>>> I think you need to keep the usage of frame numbers here.
>> If I re-work the gfn <-> mfn mapping then yes, I can use frame numbers here and elsewhere
>>>>    
>>>>            if ( !MAPPABLE_BAR(bar) ||
>>>>                 (rom_only ? bar->type != VPCI_BAR_ROM
>>>> @@ -251,9 +339,11 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>        /* Remove any MSIX regions if present. */
>>>>        for ( i = 0; msix && i < ARRAY_SIZE(msix->tables); i++ )
>>>>        {
>>>> -        unsigned long start = PFN_DOWN(vmsix_table_addr(pdev->vpci, i));
>>>> -        unsigned long end = PFN_DOWN(vmsix_table_addr(pdev->vpci, i) +
>>>> -                                     vmsix_table_size(pdev->vpci, i) - 1);
>>>> +        unsigned long start = (vmsix_table_addr(pdev->vpci, i) &
>>>> +                               PCI_BASE_ADDRESS_MEM_MASK) | i;
>>>> +        unsigned long end = (vmsix_table_addr(pdev->vpci, i) &
>>>> +                             PCI_BASE_ADDRESS_MEM_MASK ) +
>>>> +                             vmsix_table_size(pdev->vpci, i) - 1;
>>>>    
>>>>            rc = rangeset_remove_range(mem, start, end);
>>>>            if ( rc )
>>>> @@ -273,6 +363,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>         */
>>>>        for_each_pdev ( pdev->domain, tmp )
>>>>        {
>>>> +        struct vpci_header *header;
>>>> +
>>>>            if ( tmp == pdev )
>>>>            {
>>>>                /*
>>>> @@ -289,11 +381,14 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>                    continue;
>>>>            }
>>>>    
>>>> -        for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
>>>> +        header = get_vpci_header(tmp->domain, pdev);
>>>> +
>>>> +        for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>>            {
>>>> -            const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
>>>> -            unsigned long start = PFN_DOWN(bar->addr);
>>>> -            unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
>>>> +            const struct vpci_bar *bar = &header->bars[i];
>>>> +            unsigned long start = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK) | i;
>>>> +            unsigned long end = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK)
>>>> +                + bar->size - 1;
>>>>    
>>>>                if ( !bar->enabled || !rangeset_overlaps_range(mem, start, end) ||
>>>>                     /*
>>>> @@ -357,7 +452,7 @@ static void cmd_write(const struct pci_dev *pdev, unsigned int reg,
>>>>            pci_conf_write16(pdev->sbdf, reg, cmd);
>>>>    }
>>>>    
>>>> -static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>> +static void bar_write_hwdom(const struct pci_dev *pdev, unsigned int reg,
>>>>                          uint32_t val, void *data)
>>>>    {
>>>>        struct vpci_bar *bar = data;
>>>> @@ -377,14 +472,17 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>>        {
>>>>            /* If the value written is the current one avoid printing a warning. */
>>>>            if ( val != (uint32_t)(bar->addr >> (hi ? 32 : 0)) )
>>>> +        {
>>>> +            struct vpci_header *header = get_hwdom_vpci_header(pdev);
>>>> +
>>>>                gprintk(XENLOG_WARNING,
>>>>                        "%04x:%02x:%02x.%u: ignored BAR %lu write with memory decoding enabled\n",
>>>>                        pdev->seg, pdev->bus, slot, func,
>>>> -                    bar - pdev->vpci->header.bars + hi);
>>>> +                    bar - header->bars + hi);
>>>> +        }
>>>>            return;
>>>>        }
>>>>    
>>>> -
>>>>        /*
>>>>         * Update the cached address, so that when memory decoding is enabled
>>>>         * Xen can map the BAR into the guest p2m.
>>>> @@ -403,10 +501,89 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>>        pci_conf_write32(pdev->sbdf, reg, val);
>>>>    }
>>>>    
>>>> +static uint32_t bar_read_hwdom(const struct pci_dev *pdev, unsigned int reg,
>>>> +                               void *data)
>>>> +{
>>>> +    return vpci_hw_read32(pdev, reg, data);
>>>> +}
>>>> +
>>>> +static void bar_write_guest(const struct pci_dev *pdev, unsigned int reg,
>>>> +                            uint32_t val, void *data)
>>>> +{
>>>> +    struct vpci_bar *vbar = data;
>>>> +    bool hi = false;
>>>> +
>>>> +    if ( vbar->type == VPCI_BAR_MEM64_HI )
>>>> +    {
>>>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>>>> +        vbar--;
>>>> +        hi = true;
>>>> +    }
>>>> +    vbar->addr &= ~(0xffffffffull << (hi ? 32 : 0));
>>>> +    vbar->addr |= (uint64_t)val << (hi ? 32 : 0);
>>>> +}
>>>> +
>>>> +static uint32_t bar_read_guest(const struct pci_dev *pdev, unsigned int reg,
>>>> +                               void *data)
>>>> +{
>>>> +    struct vpci_bar *vbar = data;
>>>> +    uint32_t val;
>>>> +    bool hi = false;
>>>> +
>>>> +    if ( vbar->type == VPCI_BAR_MEM64_HI )
>>>> +    {
>>>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>>>> +        vbar--;
>>>> +        hi = true;
>>>> +    }
>>>> +
>>>> +    if ( vbar->type == VPCI_BAR_MEM64_LO || vbar->type == VPCI_BAR_MEM64_HI )
>>> I think this would be clearer using a switch statement.
>> I'll think about
>>>> +    {
>>>> +        if ( hi )
>>>> +            val = vbar->addr >> 32;
>>>> +        else
>>>> +            val = vbar->addr & 0xffffffff;
>>>> +        if ( val == ~0 )
>>> Strictly speaking I think you are not forced to write 1s to the
>>> reserved 4 bits in the low register (and in the 32bit case).
>> Ah, so Linux kernel, for instance, could have written 0xffffff0 while
>>
>> I expect 0xffffffff?
> I think real hardware would return the size when written 1s to all
> bits except the reserved ones.
>
>>>> +        {
>>>> +            /* Guests detects BAR's properties and sizes. */
>>>> +            if ( !hi )
>>>> +            {
>>>> +                val = 0xffffffff & ~(vbar->size - 1);
>>>> +                val |= vbar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
>>>> +                                                    : PCI_BASE_ADDRESS_MEM_TYPE_64;
>>>> +                val |= vbar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
>>>> +            }
>>>> +            else
>>>> +                val = vbar->size >> 32;
>>>> +            vbar->addr &= ~(0xffffffffull << (hi ? 32 : 0));
>>>> +            vbar->addr |= (uint64_t)val << (hi ? 32 : 0);
>>>> +        }
>>>> +    }
>>>> +    else if ( vbar->type == VPCI_BAR_MEM32 )
>>>> +    {
>>>> +        val = vbar->addr;
>>>> +        if ( val == ~0 )
>>>> +        {
>>>> +            if ( !hi )
>>> There's no way hi can be true at this point AFAICT.
>> Sure, thank you
>>>> +            {
>>>> +                val = 0xffffffff & ~(vbar->size - 1);
>>>> +                val |= vbar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
>>>> +                                                    : PCI_BASE_ADDRESS_MEM_TYPE_64;
>>>> +                val |= vbar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
>>>> +            }
>>>> +        }
>>>> +    }
>>>> +    else
>>>> +    {
>>>> +        val = vbar->addr;
>>>> +    }
>>>> +    return val;
>>>> +}
>>>> +
>>>>    static void rom_write(const struct pci_dev *pdev, unsigned int reg,
>>>>                          uint32_t val, void *data)
>>>>    {
>>>> -    struct vpci_header *header = &pdev->vpci->header;
>>>> +    struct vpci_header *header = get_hwdom_vpci_header(pdev);
>>>>        struct vpci_bar *rom = data;
>>>>        uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
>>>>        uint16_t cmd = pci_conf_read16(pdev->sbdf, PCI_COMMAND);
>>>> @@ -452,15 +629,56 @@ static void rom_write(const struct pci_dev *pdev, unsigned int reg,
>>>>            rom->addr = val & PCI_ROM_ADDRESS_MASK;
>>>>    }
>>> Don't you need to also protect a domU from writing to the ROM BAR
>>> register?
>> ROM was not a target of this RFC as I have no HW to test that, but final code must
>>
>> also handle ROM as well, you are right
>>
>>>>    
>>>> +static uint32_t bar_read_dispatch(const struct pci_dev *pdev, unsigned int reg,
>>>> +                                  void *data)
>>>> +{
>>>> +    struct vpci_bar *vbar, *bar = data;
>>>> +
>>>> +    if ( is_hardware_domain(current->domain) )
>>>> +        return bar_read_hwdom(pdev, reg, data);
>>>> +
>>>> +    vbar = get_vpci_bar(current->domain, pdev, bar->index);
>>>> +    if ( !vbar )
>>>> +        return ~0;
>>>> +
>>>> +    return bar_read_guest(pdev, reg, vbar);
>>>> +}
>>>> +
>>>> +static void bar_write_dispatch(const struct pci_dev *pdev, unsigned int reg,
>>>> +                               uint32_t val, void *data)
>>>> +{
>>>> +    struct vpci_bar *bar = data;
>>>> +
>>>> +    if ( is_hardware_domain(current->domain) )
>>>> +        bar_write_hwdom(pdev, reg, val, data);
>>>> +    else
>>>> +    {
>>>> +        struct vpci_bar *vbar = get_vpci_bar(current->domain, pdev, bar->index);
>>>> +
>>>> +        if ( !vbar )
>>>> +            return;
>>>> +        bar_write_guest(pdev, reg, val, vbar);
>>>> +    }
>>>> +}
>>> You should assign different handlers based on whether the domain that
>>> has the device assigned is a domU or the hardware domain, rather than
>>> doing the selection here.
>> Hm, handlers are assigned once in init_bars and this function is only called
>>
>> for hwdom, so there is no way I can do that for the guests. Hence, the dispatcher
> I think we might want to reset the vPCI handlers when a devices gets
> assigned and deassigned.

In ARM case init_bars is called too early: PCI device assignment is currently

initiated by Domain-0' kernel and is done *before* PCI devices are given memory

ranges and BARs assigned:

[    0.429514] pci_bus 0000:00: root bus resource [bus 00-ff]
[    0.429532] pci_bus 0000:00: root bus resource [io 0x0000-0xfffff]
[    0.429555] pci_bus 0000:00: root bus resource [mem 0xfe200000-0xfe3fffff]
[    0.429575] pci_bus 0000:00: root bus resource [mem 0x30000000-0x37ffffff]
[    0.429604] pci_bus 0000:00: root bus resource [mem 0x38000000-0x3fffffff pref]
[    0.429670] pci 0000:00:00.0: enabling Extended Tags
[    0.453764] pci 0000:00:00.0: -------------------- BUS_NOTIFY_ADD_DEVICE

< init_bars >

[    0.453793] pci 0000:00:00.0: -- IRQ 0
[    0.458825] pci 0000:00:00.0: Failed to add - passthrough or MSI/MSI-X might fail!
[    0.471790] pci 0000:01:00.0: -------------------- BUS_NOTIFY_ADD_DEVICE

< init_bars >

[    0.471821] pci 0000:01:00.0: -- IRQ 255
[    0.476809] pci 0000:01:00.0: Failed to add - passthrough or MSI/MSI-X might fail!

< BAR assignments below >

[    0.488233] pci 0000:00:00.0: BAR 14: assigned [mem 0xfe200000-0xfe2fffff]
[    0.488265] pci 0000:00:00.0: BAR 15: assigned [mem 0x38000000-0x380fffff pref]

In case of x86 this is pretty much ok as BARs are already in place, but for ARM we

need to take care and re-setup vPCI BARs for hwdom. Things are getting even more

complicated if the host PCI bridge is not ECAM like, so you cannot set mmio_handlers

and trap hwdom's access to the config space to update BARs etc. This is why I have that

ugly hack for rcar_gen3 to read actual BARs for hwdom.


If we go further and take a look at SR-IOV then when the kernel assigns the device

(BUS_NOTIFY_ADD_DEVICE) then it already has BARs assigned for virtual functions

(need to double-check that).

>   In order to do passthrough to domUs safely
> we will have to add more handlers than what's required for dom0,
Can you please tell what are thinking about? What are the missing handlers?
>   and
> having is_hardware_domain sprinkled in all of them is not a suitable
> solution.

I'll try to replace is_hardware_domain with something like:

+/*
+ * Detect whether physical PCI devices in this segment belong
+ * to the domain given, e.g. on x86 all PCI devices live in hwdom,
+ * but in case of ARM this might not be the case: those may also
+ * live in driver domains or even Xen itself.
+ */
+bool pci_is_hardware_domain(struct domain *d, u16 seg)
+{
+#ifdef CONFIG_X86
+    return is_hardware_domain(d);
+#elif CONFIG_ARM
+    return pci_is_owner_domain(d, seg);
+#else
+#error "Unsupported architecture"
+#endif
+}
+
+/*
+ * Get domain which owns this segment: for x86 this is always hardware
+ * domain and for ARM this can be different.
+ */
+struct domain *pci_get_hardware_domain(u16 seg)
+{
+#ifdef CONFIG_X86
+    return hardware_domain;
+#elif CONFIG_ARM
+    return pci_get_owner_domain(seg);
+#else
+#error "Unsupported architecture"
+#endif
+}

This is what I use to properly detect the domain that really owns physical host bridge

>
> Roger.

Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 08/10] vpci/arm: Allow updating BAR's header for non-ECAM bridges
  2020-11-12  9:56   ` Roger Pau Monné
@ 2020-11-13  6:46     ` Oleksandr Andrushchenko
  0 siblings, 0 replies; 64+ messages in thread
From: Oleksandr Andrushchenko @ 2020-11-13  6:46 UTC (permalink / raw)
  To: Roger Pau Monné, Oleksandr Andrushchenko
  Cc: Rahul.Singh, Bertrand.Marquis, julien.grall, jbeulich,
	sstabellini, xen-devel, iwj, wl


On 11/12/20 11:56 AM, Roger Pau Monné wrote:
> On Mon, Nov 09, 2020 at 02:50:29PM +0200, Oleksandr Andrushchenko wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>
>> Non-ECAM host bridges in hwdom go directly to PCI config space,
>> not through vpci (they use their specific method for accessing PCI
>> configuration, e.g. dedicated registers etc.). Thus hwdom's vpci BARs are
>> never updated via vPCI MMIO handlers, so implement a dedicated method
>> for a PCI host bridge, so it has a chance to update the initial state of
>> the device BARs.
>>
>> Note, we rely on the fact that control/hardware domain will not update
>> physical BAR locations for the given devices.
> This is quite ugly.
It is
>
> I'm looking at the commit that implements the hook for R-Car and I'm
> having trouble seeing how that's different from the way we would
> normally read the BAR addresses.

Ok, please see my comment on patch [06/10]. In short:

when a PCI device is *added* we call init_bars and at that time BARs

are not assigned on ARM yet. But, if we move init_bars to the point

when a device *assigned* then it will work? And this code will go away

>
> I think this should likely be paired with the actual implementation of
> a hook, or else it's hard to tell whether it really needed or not.
Yes, if we move to device assign then it won't be needed: have to check that
>
> Thanks, Roger.

Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 06/10] vpci: Make every domain handle its own BARs
  2020-11-13  6:32         ` Oleksandr Andrushchenko
@ 2020-11-13  6:48           ` Oleksandr Andrushchenko
  2020-11-13 10:25           ` Jan Beulich
  1 sibling, 0 replies; 64+ messages in thread
From: Oleksandr Andrushchenko @ 2020-11-13  6:48 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Oleksandr Andrushchenko, Rahul.Singh, Bertrand.Marquis,
	julien.grall, jbeulich, sstabellini, xen-devel, iwj, wl


On 11/13/20 8:32 AM, Oleksandr Andrushchenko wrote:
>
> On 11/12/20 4:46 PM, Roger Pau Monné wrote:
>> On Thu, Nov 12, 2020 at 01:16:10PM +0000, Oleksandr Andrushchenko wrote:
>>> On 11/12/20 11:40 AM, Roger Pau Monné wrote:
>>>> On Mon, Nov 09, 2020 at 02:50:27PM +0200, Oleksandr Andrushchenko wrote:
>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
>>>>> index f74f728884c0..7dc7c70e24f2 100644
>>>>> --- a/xen/drivers/vpci/header.c
>>>>> +++ b/xen/drivers/vpci/header.c
>>>>> @@ -31,14 +31,87 @@
>>>>>    struct map_data {
>>>>>        struct domain *d;
>>>>>        bool map;
>>>>> +    struct pci_dev *pdev;
>>>> If the field is required please place it after the domain one.
>>> I will, but may I ask why?
>> So that if we add further boolean fields we can do at the end of the
>> struct for layout reasons. If we do:
>>
>> struct map_data {
>>      struct domain *d;
>>      bool map;
>>      struct pci_dev *pdev;
>>      bool foo;
>> }
>>
>> We will end up with a bunch of padding that could be avoided by doing:
>>
>> struct map_data {
>>      struct domain *d;
>>      struct pci_dev *pdev;
>>      bool map;
>>      bool foo;
>> }
> Ah, so this is about padding. Got it
>>
>>>>> +    s = PFN_DOWN(s);
>>>>> +    e = PFN_DOWN(e);
>>>> Changing the rangeset to store memory addresses instead of frames
>>>> could for example be split into a separate patch.
>>> Ok
>>>> I think you are doing the calculation of the end pfn wrong here, you
>>>> should use PFN_UP instead in case the address is not aligned.
>>> PFN_DOWN for the start seems to be ok if the address is not aligned
>>>
>>> which is the case if I pass bar_index in the lower bits: PCI memory has
>>>
>>> PAGE_SIZE granularity, so besides the fact that I use bar_index the address
>> No, BARs don't need to be aligned to page boundaries, you can even
>> have different BARs inside the same physical page.
>>
>> The spec recommends that the minimum size of a BAR should be 4KB, but
>> that's not a strict requirement in which case a BAR can be as small as
>> 16bytes, and then you can have multiple ones inside the same page.
> Ok, will account on that
>>
>>> must be page aligned.
>>>
>>> The end address is expressed in (size - 1) form, again page aligned,
>>>
>>> so to get the last page to be mapped PFN_DOWN also seems to be appropriate.
>>>
>>> Do I miss something here?
>> I'm not aware of any  of those addresses or sizes being guaranteed to
>> be page aligned, so I think you need to account for that.
>>
>> Some of the code here uses PFN_DOWN to calculate the end address
>> because the rangesets are used in an inclusive fashion, so the end
>> frame also gets mapped.
> Ok
>>
>>>>> +    mfn = _mfn(PFN_DOWN(header->bars[bar_idx].addr));
>>>>>        for ( ; ; )
>>>>>        {
>>>>>            unsigned long size = e - s + 1;
>>>>> @@ -52,11 +125,15 @@ static int map_range(unsigned long s, unsigned long e, void *data,
>>>>>             * - {un}map_mmio_regions doesn't support preemption.
>>>>>             */
>>>>>    -        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(s))
>>>>> -                      : unmap_mmio_regions(map->d, _gfn(s), size, _mfn(s));
>>>>> +        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, mfn)
>>>>> +                      : unmap_mmio_regions(map->d, _gfn(s), size, mfn);
>>>>>            if ( rc == 0 )
>>>>>            {
>>>>> -            *c += size;
>>>>> +            /*
>>>>> +             * Range set is not expressed in frame numbers and the size
>>>>> +             * is the number of frames, so update accordingly.
>>>>> +             */
>>>>> +            *c += size << PAGE_SHIFT;
>>>>>                break;
>>>>>            }
>>>>>            if ( rc < 0 )
>>>>> @@ -67,8 +144,9 @@ static int map_range(unsigned long s, unsigned long e, void *data,
>>>>>                break;
>>>>>            }
>>>>>            ASSERT(rc < size);
>>>>> -        *c += rc;
>>>>> +        *c += rc << PAGE_SHIFT;
>>>>>            s += rc;
>>>>> +        mfn += rc;
>>>>>            if ( general_preempt_check() )
>>>>>                    return -ERESTART;
>>>>>        }
>>>>> @@ -84,7 +162,7 @@ static int map_range(unsigned long s, unsigned long e, void *data,
>>>>>    static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
>>>>>                                bool rom_only)
>>>>>    {
>>>>> -    struct vpci_header *header = &pdev->vpci->header;
>>>>> +    struct vpci_header *header = get_hwdom_vpci_header(pdev);
>>>>>        bool map = cmd & PCI_COMMAND_MEMORY;
>>>>>        unsigned int i;
>>>>>    @@ -136,6 +214,7 @@ bool vpci_process_pending(struct vcpu *v)
>>>>>            struct map_data data = {
>>>>>                .d = v->domain,
>>>>>                .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
>>>>> +            .pdev = v->vpci.pdev,
>>>>>            };
>>>>>            int rc = rangeset_consume_ranges(v->vpci.mem, map_range, &data);
>>>>>    @@ -168,7 +247,8 @@ bool vpci_process_pending(struct vcpu *v)
>>>>>    static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
>>>>>                                struct rangeset *mem, uint16_t cmd)
>>>>>    {
>>>>> -    struct map_data data = { .d = d, .map = true };
>>>>> +    struct map_data data = { .d = d, .map = true,
>>>>> +        .pdev = (struct pci_dev *)pdev };
>>>> Dropping the const here is not fine. IT either needs to be dropped
>>>> from apply_map and further up, or this needs to also be made const.
>>> Ok, I'll try to keep it const
>>>>>        int rc;
>>>>>           while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
>>>>> @@ -205,7 +285,7 @@ static void defer_map(struct domain *d, struct pci_dev *pdev,
>>>>>       static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>>    {
>>>>> -    struct vpci_header *header = &pdev->vpci->header;
>>>>> +    struct vpci_header *header;
>>>>>        struct rangeset *mem = rangeset_new(NULL, NULL, 0);
>>>>>        struct pci_dev *tmp, *dev = NULL;
>>>>>    #ifdef CONFIG_X86
>>>>> @@ -217,6 +297,11 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>>        if ( !mem )
>>>>>            return -ENOMEM;
>>>>>    +    if ( is_hardware_domain(current->domain) )
>>>>> +        header = get_hwdom_vpci_header(pdev);
>>>>> +    else
>>>>> +        header = get_vpci_header(current->domain, pdev);
>>>>> +
>>>>>        /*
>>>>>         * Create a rangeset that represents the current device BARs memory region
>>>>>         * and compare it against all the currently active BAR memory regions. If
>>>>> @@ -225,12 +310,15 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>>         * First fill the rangeset with all the BARs of this device or with the ROM
>>>>>         * BAR only, depending on whether the guest is toggling the memory decode
>>>>>         * bit of the command register, or the enable bit of the ROM BAR register.
>>>>> +     *
>>>>> +     * Use the PCI reserved bits of the BAR to pass BAR's index.
>>>>>         */
>>>>>        for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>>>        {
>>>>>            const struct vpci_bar *bar = &header->bars[i];
>>>>> -        unsigned long start = PFN_DOWN(bar->addr);
>>>>> -        unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
>>>>> +        unsigned long start = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK) | i;
>>>>> +        unsigned long end = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK) +
>>>>> +            bar->size - 1;
>>>> Will this work fine on Arm 32bits with LPAE? It's my understanding
>>>> that in that case unsigned long is 32bits, but the physical address
>>>> space is 44bits, in which case this won't work.
>>> Hm, good question
>>>> I think you need to keep the usage of frame numbers here.
>>> If I re-work the gfn <-> mfn mapping then yes, I can use frame numbers here and elsewhere
>>>>>               if ( !MAPPABLE_BAR(bar) ||
>>>>>                 (rom_only ? bar->type != VPCI_BAR_ROM
>>>>> @@ -251,9 +339,11 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>>        /* Remove any MSIX regions if present. */
>>>>>        for ( i = 0; msix && i < ARRAY_SIZE(msix->tables); i++ )
>>>>>        {
>>>>> -        unsigned long start = PFN_DOWN(vmsix_table_addr(pdev->vpci, i));
>>>>> -        unsigned long end = PFN_DOWN(vmsix_table_addr(pdev->vpci, i) +
>>>>> - vmsix_table_size(pdev->vpci, i) - 1);
>>>>> +        unsigned long start = (vmsix_table_addr(pdev->vpci, i) &
>>>>> +                               PCI_BASE_ADDRESS_MEM_MASK) | i;
>>>>> +        unsigned long end = (vmsix_table_addr(pdev->vpci, i) &
>>>>> +                             PCI_BASE_ADDRESS_MEM_MASK ) +
>>>>> + vmsix_table_size(pdev->vpci, i) - 1;
>>>>>               rc = rangeset_remove_range(mem, start, end);
>>>>>            if ( rc )
>>>>> @@ -273,6 +363,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>>         */
>>>>>        for_each_pdev ( pdev->domain, tmp )
>>>>>        {
>>>>> +        struct vpci_header *header;
>>>>> +
>>>>>            if ( tmp == pdev )
>>>>>            {
>>>>>                /*
>>>>> @@ -289,11 +381,14 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>>                    continue;
>>>>>            }
>>>>>    -        for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
>>>>> +        header = get_vpci_header(tmp->domain, pdev);
>>>>> +
>>>>> +        for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>>>            {
>>>>> -            const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
>>>>> -            unsigned long start = PFN_DOWN(bar->addr);
>>>>> -            unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
>>>>> +            const struct vpci_bar *bar = &header->bars[i];
>>>>> +            unsigned long start = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK) | i;
>>>>> +            unsigned long end = (bar->addr & PCI_BASE_ADDRESS_MEM_MASK)
>>>>> +                + bar->size - 1;
>>>>>                   if ( !bar->enabled || !rangeset_overlaps_range(mem, start, end) ||
>>>>>                     /*
>>>>> @@ -357,7 +452,7 @@ static void cmd_write(const struct pci_dev *pdev, unsigned int reg,
>>>>>            pci_conf_write16(pdev->sbdf, reg, cmd);
>>>>>    }
>>>>>    -static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>>> +static void bar_write_hwdom(const struct pci_dev *pdev, unsigned int reg,
>>>>>                          uint32_t val, void *data)
>>>>>    {
>>>>>        struct vpci_bar *bar = data;
>>>>> @@ -377,14 +472,17 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>>>        {
>>>>>            /* If the value written is the current one avoid printing a warning. */
>>>>>            if ( val != (uint32_t)(bar->addr >> (hi ? 32 : 0)) )
>>>>> +        {
>>>>> +            struct vpci_header *header = get_hwdom_vpci_header(pdev);
>>>>> +
>>>>>                gprintk(XENLOG_WARNING,
>>>>>                        "%04x:%02x:%02x.%u: ignored BAR %lu write with memory decoding enabled\n",
>>>>>                        pdev->seg, pdev->bus, slot, func,
>>>>> -                    bar - pdev->vpci->header.bars + hi);
>>>>> +                    bar - header->bars + hi);
>>>>> +        }
>>>>>            return;
>>>>>        }
>>>>>    -
>>>>>        /*
>>>>>         * Update the cached address, so that when memory decoding is enabled
>>>>>         * Xen can map the BAR into the guest p2m.
>>>>> @@ -403,10 +501,89 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>>>        pci_conf_write32(pdev->sbdf, reg, val);
>>>>>    }
>>>>>    +static uint32_t bar_read_hwdom(const struct pci_dev *pdev, unsigned int reg,
>>>>> +                               void *data)
>>>>> +{
>>>>> +    return vpci_hw_read32(pdev, reg, data);
>>>>> +}
>>>>> +
>>>>> +static void bar_write_guest(const struct pci_dev *pdev, unsigned int reg,
>>>>> +                            uint32_t val, void *data)
>>>>> +{
>>>>> +    struct vpci_bar *vbar = data;
>>>>> +    bool hi = false;
>>>>> +
>>>>> +    if ( vbar->type == VPCI_BAR_MEM64_HI )
>>>>> +    {
>>>>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>>>>> +        vbar--;
>>>>> +        hi = true;
>>>>> +    }
>>>>> +    vbar->addr &= ~(0xffffffffull << (hi ? 32 : 0));
>>>>> +    vbar->addr |= (uint64_t)val << (hi ? 32 : 0);
>>>>> +}
>>>>> +
>>>>> +static uint32_t bar_read_guest(const struct pci_dev *pdev, unsigned int reg,
>>>>> +                               void *data)
>>>>> +{
>>>>> +    struct vpci_bar *vbar = data;
>>>>> +    uint32_t val;
>>>>> +    bool hi = false;
>>>>> +
>>>>> +    if ( vbar->type == VPCI_BAR_MEM64_HI )
>>>>> +    {
>>>>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>>>>> +        vbar--;
>>>>> +        hi = true;
>>>>> +    }
>>>>> +
>>>>> +    if ( vbar->type == VPCI_BAR_MEM64_LO || vbar->type == VPCI_BAR_MEM64_HI )
>>>> I think this would be clearer using a switch statement.
>>> I'll think about
>>>>> +    {
>>>>> +        if ( hi )
>>>>> +            val = vbar->addr >> 32;
>>>>> +        else
>>>>> +            val = vbar->addr & 0xffffffff;
>>>>> +        if ( val == ~0 )
>>>> Strictly speaking I think you are not forced to write 1s to the
>>>> reserved 4 bits in the low register (and in the 32bit case).
>>> Ah, so Linux kernel, for instance, could have written 0xffffff0 while
>>>
>>> I expect 0xffffffff?
>> I think real hardware would return the size when written 1s to all
>> bits except the reserved ones.
>>
>>>>> +        {
>>>>> +            /* Guests detects BAR's properties and sizes. */
>>>>> +            if ( !hi )
>>>>> +            {
>>>>> +                val = 0xffffffff & ~(vbar->size - 1);
>>>>> +                val |= vbar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
>>>>> +                                                    : PCI_BASE_ADDRESS_MEM_TYPE_64;
>>>>> +                val |= vbar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
>>>>> +            }
>>>>> +            else
>>>>> +                val = vbar->size >> 32;
>>>>> +            vbar->addr &= ~(0xffffffffull << (hi ? 32 : 0));
>>>>> +            vbar->addr |= (uint64_t)val << (hi ? 32 : 0);
>>>>> +        }
>>>>> +    }
>>>>> +    else if ( vbar->type == VPCI_BAR_MEM32 )
>>>>> +    {
>>>>> +        val = vbar->addr;
>>>>> +        if ( val == ~0 )
>>>>> +        {
>>>>> +            if ( !hi )
>>>> There's no way hi can be true at this point AFAICT.
>>> Sure, thank you
>>>>> +            {
>>>>> +                val = 0xffffffff & ~(vbar->size - 1);
>>>>> +                val |= vbar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
>>>>> +                                                    : PCI_BASE_ADDRESS_MEM_TYPE_64;
>>>>> +                val |= vbar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
>>>>> +            }
>>>>> +        }
>>>>> +    }
>>>>> +    else
>>>>> +    {
>>>>> +        val = vbar->addr;
>>>>> +    }
>>>>> +    return val;
>>>>> +}
>>>>> +
>>>>>    static void rom_write(const struct pci_dev *pdev, unsigned int reg,
>>>>>                          uint32_t val, void *data)
>>>>>    {
>>>>> -    struct vpci_header *header = &pdev->vpci->header;
>>>>> +    struct vpci_header *header = get_hwdom_vpci_header(pdev);
>>>>>        struct vpci_bar *rom = data;
>>>>>        uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
>>>>>        uint16_t cmd = pci_conf_read16(pdev->sbdf, PCI_COMMAND);
>>>>> @@ -452,15 +629,56 @@ static void rom_write(const struct pci_dev *pdev, unsigned int reg,
>>>>>            rom->addr = val & PCI_ROM_ADDRESS_MASK;
>>>>>    }
>>>> Don't you need to also protect a domU from writing to the ROM BAR
>>>> register?
>>> ROM was not a target of this RFC as I have no HW to test that, but final code must
>>>
>>> also handle ROM as well, you are right
>>>
>>>>>    +static uint32_t bar_read_dispatch(const struct pci_dev *pdev, unsigned int reg,
>>>>> +                                  void *data)
>>>>> +{
>>>>> +    struct vpci_bar *vbar, *bar = data;
>>>>> +
>>>>> +    if ( is_hardware_domain(current->domain) )
>>>>> +        return bar_read_hwdom(pdev, reg, data);
>>>>> +
>>>>> +    vbar = get_vpci_bar(current->domain, pdev, bar->index);
>>>>> +    if ( !vbar )
>>>>> +        return ~0;
>>>>> +
>>>>> +    return bar_read_guest(pdev, reg, vbar);
>>>>> +}
>>>>> +
>>>>> +static void bar_write_dispatch(const struct pci_dev *pdev, unsigned int reg,
>>>>> +                               uint32_t val, void *data)
>>>>> +{
>>>>> +    struct vpci_bar *bar = data;
>>>>> +
>>>>> +    if ( is_hardware_domain(current->domain) )
>>>>> +        bar_write_hwdom(pdev, reg, val, data);
>>>>> +    else
>>>>> +    {
>>>>> +        struct vpci_bar *vbar = get_vpci_bar(current->domain, pdev, bar->index);
>>>>> +
>>>>> +        if ( !vbar )
>>>>> +            return;
>>>>> +        bar_write_guest(pdev, reg, val, vbar);
>>>>> +    }
>>>>> +}
>>>> You should assign different handlers based on whether the domain that
>>>> has the device assigned is a domU or the hardware domain, rather than
>>>> doing the selection here.
>>> Hm, handlers are assigned once in init_bars and this function is only called
>>>
>>> for hwdom, so there is no way I can do that for the guests. Hence, the dispatcher
>> I think we might want to reset the vPCI handlers when a devices gets
>> assigned and deassigned.
>
> In ARM case init_bars is called too early: PCI device assignment is currently
>
> initiated by Domain-0' kernel and is done *before* PCI devices are given memory
>
> ranges and BARs assigned:
>
> [    0.429514] pci_bus 0000:00: root bus resource [bus 00-ff]
> [    0.429532] pci_bus 0000:00: root bus resource [io 0x0000-0xfffff]
> [    0.429555] pci_bus 0000:00: root bus resource [mem 0xfe200000-0xfe3fffff]
> [    0.429575] pci_bus 0000:00: root bus resource [mem 0x30000000-0x37ffffff]
> [    0.429604] pci_bus 0000:00: root bus resource [mem 0x38000000-0x3fffffff pref]
> [    0.429670] pci 0000:00:00.0: enabling Extended Tags
> [    0.453764] pci 0000:00:00.0: -------------------- BUS_NOTIFY_ADD_DEVICE
>
> < init_bars >
>
> [    0.453793] pci 0000:00:00.0: -- IRQ 0
> [    0.458825] pci 0000:00:00.0: Failed to add - passthrough or MSI/MSI-X might fail!
> [    0.471790] pci 0000:01:00.0: -------------------- BUS_NOTIFY_ADD_DEVICE
>
> < init_bars >
>
> [    0.471821] pci 0000:01:00.0: -- IRQ 255
> [    0.476809] pci 0000:01:00.0: Failed to add - passthrough or MSI/MSI-X might fail!
>
> < BAR assignments below >
>
> [    0.488233] pci 0000:00:00.0: BAR 14: assigned [mem 0xfe200000-0xfe2fffff]
> [    0.488265] pci 0000:00:00.0: BAR 15: assigned [mem 0x38000000-0x380fffff pref]
>
> In case of x86 this is pretty much ok as BARs are already in place, but for ARM we
>
> need to take care and re-setup vPCI BARs for hwdom. Things are getting even more
>
> complicated if the host PCI bridge is not ECAM like, so you cannot set mmio_handlers
>
> and trap hwdom's access to the config space to update BARs etc. This is why I have that
>
> ugly hack for rcar_gen3 to read actual BARs for hwdom.
>
>
> If we go further and take a look at SR-IOV then when the kernel assigns the device
>
> (BUS_NOTIFY_ADD_DEVICE) then it already has BARs assigned for virtual functions
>
> (need to double-check that).

Hm, indeed. We just need to move init_bars from being called on PCI *device add* to

*device assign*. This way it won't (?) break x86 and allow ARM to properly initialize vPCI's

BARs...

>
>>   In order to do passthrough to domUs safely
>> we will have to add more handlers than what's required for dom0,
> Can you please tell what are thinking about? What are the missing handlers?
>>   and
>> having is_hardware_domain sprinkled in all of them is not a suitable
>> solution.
>
> I'll try to replace is_hardware_domain with something like:
>
> +/*
> + * Detect whether physical PCI devices in this segment belong
> + * to the domain given, e.g. on x86 all PCI devices live in hwdom,
> + * but in case of ARM this might not be the case: those may also
> + * live in driver domains or even Xen itself.
> + */
> +bool pci_is_hardware_domain(struct domain *d, u16 seg)
> +{
> +#ifdef CONFIG_X86
> +    return is_hardware_domain(d);
> +#elif CONFIG_ARM
> +    return pci_is_owner_domain(d, seg);
> +#else
> +#error "Unsupported architecture"
> +#endif
> +}
> +
> +/*
> + * Get domain which owns this segment: for x86 this is always hardware
> + * domain and for ARM this can be different.
> + */
> +struct domain *pci_get_hardware_domain(u16 seg)
> +{
> +#ifdef CONFIG_X86
> +    return hardware_domain;
> +#elif CONFIG_ARM
> +    return pci_get_owner_domain(seg);
> +#else
> +#error "Unsupported architecture"
> +#endif
> +}
>
> This is what I use to properly detect the domain that really owns physical host bridge
>
>>
>> Roger.
>
> Thank you,
>
> Oleksandr
>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 09/10] vpci/rcar: Implement vPCI.update_bar_header callback
  2020-11-12 10:00   ` Roger Pau Monné
@ 2020-11-13  6:50     ` Oleksandr Andrushchenko
  0 siblings, 0 replies; 64+ messages in thread
From: Oleksandr Andrushchenko @ 2020-11-13  6:50 UTC (permalink / raw)
  To: Roger Pau Monné, Oleksandr Andrushchenko
  Cc: Rahul.Singh, Bertrand.Marquis, julien.grall, jbeulich,
	sstabellini, xen-devel, iwj, wl


On 11/12/20 12:00 PM, Roger Pau Monné wrote:
> On Mon, Nov 09, 2020 at 02:50:30PM +0200, Oleksandr Andrushchenko wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>
>> Update hardware domain's BAR header as R-Car Gen3 is a non-ECAM host
>> controller, so vPCI MMIO handlers do not work for it in hwdom.
>>
>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> ---
>>   xen/arch/arm/pci/pci-host-rcar-gen3.c | 69 +++++++++++++++++++++++++++
>>   1 file changed, 69 insertions(+)
>>
>> diff --git a/xen/arch/arm/pci/pci-host-rcar-gen3.c b/xen/arch/arm/pci/pci-host-rcar-gen3.c
>> index ec14bb29a38b..353ac2bfd6e6 100644
>> --- a/xen/arch/arm/pci/pci-host-rcar-gen3.c
>> +++ b/xen/arch/arm/pci/pci-host-rcar-gen3.c
>> @@ -23,6 +23,7 @@
>>   #include <xen/pci.h>
>>   #include <asm/pci.h>
>>   #include <xen/vmap.h>
>> +#include <xen/vpci.h>
>>   
>>   /* Error values that may be returned by PCI functions */
>>   #define PCIBIOS_SUCCESSFUL		0x00
>> @@ -307,12 +308,80 @@ int pci_rcar_gen3_config_write(struct pci_host_bridge *bridge, uint32_t _sbdf,
>>       return ret;
>>   }
>>   
>> +static void pci_rcar_gen3_hwbar_init(const struct pci_dev *pdev,
>> +                                     struct vpci_header *header)
>> +
>> +{
>> +    static bool once = true;
>> +    struct vpci_bar *bars = header->bars;
>> +    unsigned int num_bars;
>> +    int i;
> unsigned.
ok
>
>> +
>> +    /* Run only once. */
>> +    if (!once)
> Missing spaces.
ok
>
>> +        return;
>> +    once = false;
>> +
>> +    printk("\n\n ------------------------ %s -------------------\n", __func__);
>> +    switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
>> +    {
>> +    case PCI_HEADER_TYPE_NORMAL:
>> +        num_bars = PCI_HEADER_NORMAL_NR_BARS;
>> +        break;
>> +
>> +    case PCI_HEADER_TYPE_BRIDGE:
>> +        num_bars = PCI_HEADER_BRIDGE_NR_BARS;
>> +        break;
>> +
>> +    default:
>> +        return;
>> +    }
>> +
>> +    for ( i = 0; i < num_bars; i++ )
>> +    {
>> +        uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
>> +
>> +        if ( bars[i].type == VPCI_BAR_MEM64_HI )
>> +        {
>> +            /*
>> +             * Skip hi part of the 64-bit register: it is read
>> +             * together with the lower part.
>> +             */
>> +            continue;
>> +        }
>> +
>> +        if ( bars[i].type == VPCI_BAR_IO )
>> +        {
>> +            /* Skip IO. */
>> +            continue;
>> +        }
>> +
>> +        if ( bars[i].type == VPCI_BAR_MEM64_LO )
>> +        {
>> +            /* Read both hi and lo parts of the 64-bit BAR. */
>> +            bars[i].addr =
>> +                (uint64_t)pci_conf_read32(pdev->sbdf, reg + 4) << 32 |
>> +                pci_conf_read32(pdev->sbdf, reg);
>> +        }
>> +        else if ( bars[i].type == VPCI_BAR_MEM32 )
>> +        {
>> +            bars[i].addr = pci_conf_read32(pdev->sbdf, reg);
>> +        }
>> +        else
>> +        {
>> +            /* Expansion ROM? */
>> +            continue;
>> +        }
> Wouldn't this be much simpler as:
Yes, seems to be simpler, thank you
>
> bars[i].addr = 0;
> switch ( bars[i].type )
> {
> case VPCI_BAR_MEM64_HI:
>      bars[i].addr = (uint64_t)pci_conf_read32(pdev->sbdf, reg + 4) << 32;
>      /+ fallthrough. */
> case VPCI_BAR_MEM64_LO:
>       bars[i].addr |= pci_conf_read32(pdev->sbdf, reg);
>       break;
>
> default:
>      break;
> }
>
> I also wonder why you only care about the address but not the size of
> the BAR.
Yes, size needs to be updated as well, even for RFC
>
> Thanks, Roger.

Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 06/10] vpci: Make every domain handle its own BARs
  2020-11-13  6:32         ` Oleksandr Andrushchenko
  2020-11-13  6:48           ` Oleksandr Andrushchenko
@ 2020-11-13 10:25           ` Jan Beulich
  2020-11-13 10:36             ` Julien Grall
  2020-11-13 10:46             ` Oleksandr Andrushchenko
  1 sibling, 2 replies; 64+ messages in thread
From: Jan Beulich @ 2020-11-13 10:25 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Oleksandr Andrushchenko, Rahul.Singh, Bertrand.Marquis,
	julien.grall, sstabellini, xen-devel, iwj, wl,
	Roger Pau Monné

On 13.11.2020 07:32, Oleksandr Andrushchenko wrote:
> On 11/12/20 4:46 PM, Roger Pau Monné wrote:
>> On Thu, Nov 12, 2020 at 01:16:10PM +0000, Oleksandr Andrushchenko wrote:
>>> On 11/12/20 11:40 AM, Roger Pau Monné wrote:
>>>> On Mon, Nov 09, 2020 at 02:50:27PM +0200, Oleksandr Andrushchenko wrote:
>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>> +static uint32_t bar_read_dispatch(const struct pci_dev *pdev, unsigned int reg,
>>>>> +                                  void *data)
>>>>> +{
>>>>> +    struct vpci_bar *vbar, *bar = data;
>>>>> +
>>>>> +    if ( is_hardware_domain(current->domain) )
>>>>> +        return bar_read_hwdom(pdev, reg, data);
>>>>> +
>>>>> +    vbar = get_vpci_bar(current->domain, pdev, bar->index);
>>>>> +    if ( !vbar )
>>>>> +        return ~0;
>>>>> +
>>>>> +    return bar_read_guest(pdev, reg, vbar);
>>>>> +}
>>>>> +
>>>>> +static void bar_write_dispatch(const struct pci_dev *pdev, unsigned int reg,
>>>>> +                               uint32_t val, void *data)
>>>>> +{
>>>>> +    struct vpci_bar *bar = data;
>>>>> +
>>>>> +    if ( is_hardware_domain(current->domain) )
>>>>> +        bar_write_hwdom(pdev, reg, val, data);
>>>>> +    else
>>>>> +    {
>>>>> +        struct vpci_bar *vbar = get_vpci_bar(current->domain, pdev, bar->index);
>>>>> +
>>>>> +        if ( !vbar )
>>>>> +            return;
>>>>> +        bar_write_guest(pdev, reg, val, vbar);
>>>>> +    }
>>>>> +}
>>>> You should assign different handlers based on whether the domain that
>>>> has the device assigned is a domU or the hardware domain, rather than
>>>> doing the selection here.
>>> Hm, handlers are assigned once in init_bars and this function is only called
>>>
>>> for hwdom, so there is no way I can do that for the guests. Hence, the dispatcher
>> I think we might want to reset the vPCI handlers when a devices gets
>> assigned and deassigned.
> 
> In ARM case init_bars is called too early: PCI device assignment is currently
> 
> initiated by Domain-0' kernel and is done *before* PCI devices are given memory
> 
> ranges and BARs assigned:
> 
> [    0.429514] pci_bus 0000:00: root bus resource [bus 00-ff]
> [    0.429532] pci_bus 0000:00: root bus resource [io 0x0000-0xfffff]
> [    0.429555] pci_bus 0000:00: root bus resource [mem 0xfe200000-0xfe3fffff]
> [    0.429575] pci_bus 0000:00: root bus resource [mem 0x30000000-0x37ffffff]
> [    0.429604] pci_bus 0000:00: root bus resource [mem 0x38000000-0x3fffffff pref]
> [    0.429670] pci 0000:00:00.0: enabling Extended Tags
> [    0.453764] pci 0000:00:00.0: -------------------- BUS_NOTIFY_ADD_DEVICE
> 
> < init_bars >
> 
> [    0.453793] pci 0000:00:00.0: -- IRQ 0
> [    0.458825] pci 0000:00:00.0: Failed to add - passthrough or MSI/MSI-X might fail!
> [    0.471790] pci 0000:01:00.0: -------------------- BUS_NOTIFY_ADD_DEVICE
> 
> < init_bars >
> 
> [    0.471821] pci 0000:01:00.0: -- IRQ 255
> [    0.476809] pci 0000:01:00.0: Failed to add - passthrough or MSI/MSI-X might fail!
> 
> < BAR assignments below >
> 
> [    0.488233] pci 0000:00:00.0: BAR 14: assigned [mem 0xfe200000-0xfe2fffff]
> [    0.488265] pci 0000:00:00.0: BAR 15: assigned [mem 0x38000000-0x380fffff pref]
> 
> In case of x86 this is pretty much ok as BARs are already in place, but for ARM we
> 
> need to take care and re-setup vPCI BARs for hwdom.

Even on x86 there's no guarantee that all devices have their BARs set
up by firmware.

In a subsequent reply you've suggested to move init_bars from "add" to
"assign", but I'm having trouble seeing what this would change: It's
not Dom0 controlling assignment (to itself), but Xen assigns the device
towards the end of pci_add_device().

> Things are getting even more
> 
> complicated if the host PCI bridge is not ECAM like, so you cannot set mmio_handlers
> 
> and trap hwdom's access to the config space to update BARs etc. This is why I have that
> 
> ugly hack for rcar_gen3 to read actual BARs for hwdom.

How to config space accesses work there? The latest for MSI/MSI-X it'll
be imperative that Xen be able to intercept config space writes.

>>   In order to do passthrough to domUs safely
>> we will have to add more handlers than what's required for dom0,
> Can you please tell what are thinking about? What are the missing handlers?
>>   and
>> having is_hardware_domain sprinkled in all of them is not a suitable
>> solution.
> 
> I'll try to replace is_hardware_domain with something like:
> 
> +/*
> + * Detect whether physical PCI devices in this segment belong
> + * to the domain given, e.g. on x86 all PCI devices live in hwdom,
> + * but in case of ARM this might not be the case: those may also
> + * live in driver domains or even Xen itself.
> + */
> +bool pci_is_hardware_domain(struct domain *d, u16 seg)
> +{
> +#ifdef CONFIG_X86
> +    return is_hardware_domain(d);
> +#elif CONFIG_ARM
> +    return pci_is_owner_domain(d, seg);
> +#else
> +#error "Unsupported architecture"
> +#endif
> +}
> +
> +/*
> + * Get domain which owns this segment: for x86 this is always hardware
> + * domain and for ARM this can be different.
> + */
> +struct domain *pci_get_hardware_domain(u16 seg)
> +{
> +#ifdef CONFIG_X86
> +    return hardware_domain;
> +#elif CONFIG_ARM
> +    return pci_get_owner_domain(seg);
> +#else
> +#error "Unsupported architecture"
> +#endif
> +}
> 
> This is what I use to properly detect the domain that really owns physical host bridge

I consider this problematic. We should try to not let Arm's and x86'es
PCI implementations diverge too much, i.e. at least the underlying basic
model would better be similar. For example, if entire segments can be
assigned to a driver domain on Arm, why should the same not be possible
on x86?

Furthermore I'm suspicious about segments being the right granularity
here. On ia64 multiple host bridges could (and typically would) live
on segment 0. Iirc I had once seen output from an x86 system which was
apparently laid out similarly. Therefore, just like for MCFG, I think
the granularity wants to be bus ranges within a segment.

Jan


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 08/10] vpci/arm: Allow updating BAR's header for non-ECAM bridges
  2020-11-09 12:50 ` [PATCH 08/10] vpci/arm: Allow updating BAR's header for non-ECAM bridges Oleksandr Andrushchenko
  2020-11-12  9:56   ` Roger Pau Monné
@ 2020-11-13 10:29   ` Jan Beulich
  2020-11-13 10:39     ` Oleksandr Andrushchenko
  1 sibling, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2020-11-13 10:29 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: iwj, wl, Oleksandr Andrushchenko, xen-devel, julien.grall,
	Bertrand.Marquis, sstabellini, roger.pau, Rahul.Singh

On 09.11.2020 13:50, Oleksandr Andrushchenko wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Non-ECAM host bridges in hwdom go directly to PCI config space,
> not through vpci (they use their specific method for accessing PCI
> configuration, e.g. dedicated registers etc.).

And access to these dedicated registers can't be intercepted? It
would seem to me that if so, such a platform is not capable of
being virtualized (without cooperation by all the domains in
possession of devices).

Jan


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 06/10] vpci: Make every domain handle its own BARs
  2020-11-13 10:25           ` Jan Beulich
@ 2020-11-13 10:36             ` Julien Grall
  2020-11-13 10:53               ` Jan Beulich
  2020-11-13 10:46             ` Oleksandr Andrushchenko
  1 sibling, 1 reply; 64+ messages in thread
From: Julien Grall @ 2020-11-13 10:36 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko
  Cc: Oleksandr Andrushchenko, Rahul.Singh, Bertrand.Marquis,
	julien.grall, sstabellini, xen-devel, iwj, wl,
	Roger Pau Monné



On 13/11/2020 10:25, Jan Beulich wrote:
> On 13.11.2020 07:32, Oleksandr Andrushchenko wrote:
>> On 11/12/20 4:46 PM, Roger Pau Monné wrote:
>>> On Thu, Nov 12, 2020 at 01:16:10PM +0000, Oleksandr Andrushchenko wrote:
>>>> On 11/12/20 11:40 AM, Roger Pau Monné wrote:
>>>>> On Mon, Nov 09, 2020 at 02:50:27PM +0200, Oleksandr Andrushchenko wrote:
>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>> +static uint32_t bar_read_dispatch(const struct pci_dev *pdev, unsigned int reg,
>>>>>> +                                  void *data)
>>>>>> +{
>>>>>> +    struct vpci_bar *vbar, *bar = data;
>>>>>> +
>>>>>> +    if ( is_hardware_domain(current->domain) )
>>>>>> +        return bar_read_hwdom(pdev, reg, data);
>>>>>> +
>>>>>> +    vbar = get_vpci_bar(current->domain, pdev, bar->index);
>>>>>> +    if ( !vbar )
>>>>>> +        return ~0;
>>>>>> +
>>>>>> +    return bar_read_guest(pdev, reg, vbar);
>>>>>> +}
>>>>>> +
>>>>>> +static void bar_write_dispatch(const struct pci_dev *pdev, unsigned int reg,
>>>>>> +                               uint32_t val, void *data)
>>>>>> +{
>>>>>> +    struct vpci_bar *bar = data;
>>>>>> +
>>>>>> +    if ( is_hardware_domain(current->domain) )
>>>>>> +        bar_write_hwdom(pdev, reg, val, data);
>>>>>> +    else
>>>>>> +    {
>>>>>> +        struct vpci_bar *vbar = get_vpci_bar(current->domain, pdev, bar->index);
>>>>>> +
>>>>>> +        if ( !vbar )
>>>>>> +            return;
>>>>>> +        bar_write_guest(pdev, reg, val, vbar);
>>>>>> +    }
>>>>>> +}
>>>>> You should assign different handlers based on whether the domain that
>>>>> has the device assigned is a domU or the hardware domain, rather than
>>>>> doing the selection here.
>>>> Hm, handlers are assigned once in init_bars and this function is only called
>>>>
>>>> for hwdom, so there is no way I can do that for the guests. Hence, the dispatcher
>>> I think we might want to reset the vPCI handlers when a devices gets
>>> assigned and deassigned.
>>
>> In ARM case init_bars is called too early: PCI device assignment is currently
>>
>> initiated by Domain-0' kernel and is done *before* PCI devices are given memory
>>
>> ranges and BARs assigned:
>>
>> [    0.429514] pci_bus 0000:00: root bus resource [bus 00-ff]
>> [    0.429532] pci_bus 0000:00: root bus resource [io 0x0000-0xfffff]
>> [    0.429555] pci_bus 0000:00: root bus resource [mem 0xfe200000-0xfe3fffff]
>> [    0.429575] pci_bus 0000:00: root bus resource [mem 0x30000000-0x37ffffff]
>> [    0.429604] pci_bus 0000:00: root bus resource [mem 0x38000000-0x3fffffff pref]
>> [    0.429670] pci 0000:00:00.0: enabling Extended Tags
>> [    0.453764] pci 0000:00:00.0: -------------------- BUS_NOTIFY_ADD_DEVICE
>>
>> < init_bars >
>>
>> [    0.453793] pci 0000:00:00.0: -- IRQ 0
>> [    0.458825] pci 0000:00:00.0: Failed to add - passthrough or MSI/MSI-X might fail!
>> [    0.471790] pci 0000:01:00.0: -------------------- BUS_NOTIFY_ADD_DEVICE
>>
>> < init_bars >
>>
>> [    0.471821] pci 0000:01:00.0: -- IRQ 255
>> [    0.476809] pci 0000:01:00.0: Failed to add - passthrough or MSI/MSI-X might fail!
>>
>> < BAR assignments below >
>>
>> [    0.488233] pci 0000:00:00.0: BAR 14: assigned [mem 0xfe200000-0xfe2fffff]
>> [    0.488265] pci 0000:00:00.0: BAR 15: assigned [mem 0x38000000-0x380fffff pref]
>>
>> In case of x86 this is pretty much ok as BARs are already in place, but for ARM we
>>
>> need to take care and re-setup vPCI BARs for hwdom.
> 
> Even on x86 there's no guarantee that all devices have their BARs set
> up by firmware.
> 
> In a subsequent reply you've suggested to move init_bars from "add" to
> "assign", but I'm having trouble seeing what this would change: It's
> not Dom0 controlling assignment (to itself), but Xen assigns the device
> towards the end of pci_add_device().
> 
>> Things are getting even more
>>
>> complicated if the host PCI bridge is not ECAM like, so you cannot set mmio_handlers
>>
>> and trap hwdom's access to the config space to update BARs etc. This is why I have that
>>
>> ugly hack for rcar_gen3 to read actual BARs for hwdom.
> 
> How to config space accesses work there? The latest for MSI/MSI-X it'll
> be imperative that Xen be able to intercept config space writes.

I am not sure to understand your last sentence. Are you saying that we 
always need to trap access to MSI/MSI-X message in order to sanitize it?

If one is using the GICv3 ITS (I haven't investigated other MSI 
controller), then I don't believe you need to sanitize the MSI/MSI-X 
message in most of the situation.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 08/10] vpci/arm: Allow updating BAR's header for non-ECAM bridges
  2020-11-13 10:29   ` Jan Beulich
@ 2020-11-13 10:39     ` Oleksandr Andrushchenko
  2020-11-13 10:47       ` Jan Beulich
  0 siblings, 1 reply; 64+ messages in thread
From: Oleksandr Andrushchenko @ 2020-11-13 10:39 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko
  Cc: iwj, wl, xen-devel, julien.grall, Bertrand.Marquis, sstabellini,
	roger.pau, Rahul.Singh


On 11/13/20 12:29 PM, Jan Beulich wrote:
> On 09.11.2020 13:50, Oleksandr Andrushchenko wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>
>> Non-ECAM host bridges in hwdom go directly to PCI config space,
>> not through vpci (they use their specific method for accessing PCI
>> configuration, e.g. dedicated registers etc.).
> And access to these dedicated registers can't be intercepted?

It can. But then you have to fully emulate that bridge, e.g.

"if we write A to regB and after that write C to regZ then it

means we are accessing config space. If we write...."

I mean this would need lots of code in Xen to achieve that

>   It
> would seem to me that if so, such a platform is not capable of
> being virtualized (without cooperation by all the domains in
> possession of devices).

Guest domains always use an emulated ECAM bridge and are easily

trapped and emulated

>
> Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 06/10] vpci: Make every domain handle its own BARs
  2020-11-13 10:25           ` Jan Beulich
  2020-11-13 10:36             ` Julien Grall
@ 2020-11-13 10:46             ` Oleksandr Andrushchenko
  2020-11-13 10:50               ` Jan Beulich
  1 sibling, 1 reply; 64+ messages in thread
From: Oleksandr Andrushchenko @ 2020-11-13 10:46 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Oleksandr Andrushchenko, Rahul.Singh, Bertrand.Marquis,
	julien.grall, sstabellini, xen-devel, iwj, wl,
	Roger Pau Monné


On 11/13/20 12:25 PM, Jan Beulich wrote:
> On 13.11.2020 07:32, Oleksandr Andrushchenko wrote:
>> On 11/12/20 4:46 PM, Roger Pau Monné wrote:
>>> On Thu, Nov 12, 2020 at 01:16:10PM +0000, Oleksandr Andrushchenko wrote:
>>>> On 11/12/20 11:40 AM, Roger Pau Monné wrote:
>>>>> On Mon, Nov 09, 2020 at 02:50:27PM +0200, Oleksandr Andrushchenko wrote:
>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>> +static uint32_t bar_read_dispatch(const struct pci_dev *pdev, unsigned int reg,
>>>>>> +                                  void *data)
>>>>>> +{
>>>>>> +    struct vpci_bar *vbar, *bar = data;
>>>>>> +
>>>>>> +    if ( is_hardware_domain(current->domain) )
>>>>>> +        return bar_read_hwdom(pdev, reg, data);
>>>>>> +
>>>>>> +    vbar = get_vpci_bar(current->domain, pdev, bar->index);
>>>>>> +    if ( !vbar )
>>>>>> +        return ~0;
>>>>>> +
>>>>>> +    return bar_read_guest(pdev, reg, vbar);
>>>>>> +}
>>>>>> +
>>>>>> +static void bar_write_dispatch(const struct pci_dev *pdev, unsigned int reg,
>>>>>> +                               uint32_t val, void *data)
>>>>>> +{
>>>>>> +    struct vpci_bar *bar = data;
>>>>>> +
>>>>>> +    if ( is_hardware_domain(current->domain) )
>>>>>> +        bar_write_hwdom(pdev, reg, val, data);
>>>>>> +    else
>>>>>> +    {
>>>>>> +        struct vpci_bar *vbar = get_vpci_bar(current->domain, pdev, bar->index);
>>>>>> +
>>>>>> +        if ( !vbar )
>>>>>> +            return;
>>>>>> +        bar_write_guest(pdev, reg, val, vbar);
>>>>>> +    }
>>>>>> +}
>>>>> You should assign different handlers based on whether the domain that
>>>>> has the device assigned is a domU or the hardware domain, rather than
>>>>> doing the selection here.
>>>> Hm, handlers are assigned once in init_bars and this function is only called
>>>>
>>>> for hwdom, so there is no way I can do that for the guests. Hence, the dispatcher
>>> I think we might want to reset the vPCI handlers when a devices gets
>>> assigned and deassigned.
>> In ARM case init_bars is called too early: PCI device assignment is currently
>>
>> initiated by Domain-0' kernel and is done *before* PCI devices are given memory
>>
>> ranges and BARs assigned:
>>
>> [    0.429514] pci_bus 0000:00: root bus resource [bus 00-ff]
>> [    0.429532] pci_bus 0000:00: root bus resource [io 0x0000-0xfffff]
>> [    0.429555] pci_bus 0000:00: root bus resource [mem 0xfe200000-0xfe3fffff]
>> [    0.429575] pci_bus 0000:00: root bus resource [mem 0x30000000-0x37ffffff]
>> [    0.429604] pci_bus 0000:00: root bus resource [mem 0x38000000-0x3fffffff pref]
>> [    0.429670] pci 0000:00:00.0: enabling Extended Tags
>> [    0.453764] pci 0000:00:00.0: -------------------- BUS_NOTIFY_ADD_DEVICE
>>
>> < init_bars >
>>
>> [    0.453793] pci 0000:00:00.0: -- IRQ 0
>> [    0.458825] pci 0000:00:00.0: Failed to add - passthrough or MSI/MSI-X might fail!
>> [    0.471790] pci 0000:01:00.0: -------------------- BUS_NOTIFY_ADD_DEVICE
>>
>> < init_bars >
>>
>> [    0.471821] pci 0000:01:00.0: -- IRQ 255
>> [    0.476809] pci 0000:01:00.0: Failed to add - passthrough or MSI/MSI-X might fail!
>>
>> < BAR assignments below >
>>
>> [    0.488233] pci 0000:00:00.0: BAR 14: assigned [mem 0xfe200000-0xfe2fffff]
>> [    0.488265] pci 0000:00:00.0: BAR 15: assigned [mem 0x38000000-0x380fffff pref]
>>
>> In case of x86 this is pretty much ok as BARs are already in place, but for ARM we
>>
>> need to take care and re-setup vPCI BARs for hwdom.
> Even on x86 there's no guarantee that all devices have their BARs set
> up by firmware.

This is true. But there you could have config space trapped in "x86 generic way",

please correct me if I'm wrong here

>
> In a subsequent reply you've suggested to move init_bars from "add" to
> "assign", but I'm having trouble seeing what this would change: It's
> not Dom0 controlling assignment (to itself), but Xen assigns the device
> towards the end of pci_add_device().

PHYSDEVOP_pci_device_add vs XEN_DOMCTL_assign_device

Currently we initialize BARs during PHYSDEVOP_pci_device_add and

if we do that during XEN_DOMCTL_assign_device things seem to change

>
>> Things are getting even more
>>
>> complicated if the host PCI bridge is not ECAM like, so you cannot set mmio_handlers
>>
>> and trap hwdom's access to the config space to update BARs etc. This is why I have that
>>
>> ugly hack for rcar_gen3 to read actual BARs for hwdom.
> How to config space accesses work there? The latest for MSI/MSI-X it'll
> be imperative that Xen be able to intercept config space writes.
>
>>>    In order to do passthrough to domUs safely
>>> we will have to add more handlers than what's required for dom0,
>> Can you please tell what are thinking about? What are the missing handlers?
>>>    and
>>> having is_hardware_domain sprinkled in all of them is not a suitable
>>> solution.
>> I'll try to replace is_hardware_domain with something like:
>>
>> +/*
>> + * Detect whether physical PCI devices in this segment belong
>> + * to the domain given, e.g. on x86 all PCI devices live in hwdom,
>> + * but in case of ARM this might not be the case: those may also
>> + * live in driver domains or even Xen itself.
>> + */
>> +bool pci_is_hardware_domain(struct domain *d, u16 seg)
>> +{
>> +#ifdef CONFIG_X86
>> +    return is_hardware_domain(d);
>> +#elif CONFIG_ARM
>> +    return pci_is_owner_domain(d, seg);
>> +#else
>> +#error "Unsupported architecture"
>> +#endif
>> +}
>> +
>> +/*
>> + * Get domain which owns this segment: for x86 this is always hardware
>> + * domain and for ARM this can be different.
>> + */
>> +struct domain *pci_get_hardware_domain(u16 seg)
>> +{
>> +#ifdef CONFIG_X86
>> +    return hardware_domain;
>> +#elif CONFIG_ARM
>> +    return pci_get_owner_domain(seg);
>> +#else
>> +#error "Unsupported architecture"
>> +#endif
>> +}
>>
>> This is what I use to properly detect the domain that really owns physical host bridge
> I consider this problematic. We should try to not let Arm's and x86'es
> PCI implementations diverge too much, i.e. at least the underlying basic
> model would better be similar. For example, if entire segments can be
> assigned to a driver domain on Arm, why should the same not be possible
> on x86?

Good question, probably in this case x86 == ARM and I can use

pci_is_owner_domain for both architectures instead of using is_hardware_domain for x86

>
> Furthermore I'm suspicious about segments being the right granularity
> here. On ia64 multiple host bridges could (and typically would) live
> on segment 0. Iirc I had once seen output from an x86 system which was
> apparently laid out similarly. Therefore, just like for MCFG, I think
> the granularity wants to be bus ranges within a segment.
Can you please suggest something we can use as a hint for such a detection logic?
>
> Jan

Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 08/10] vpci/arm: Allow updating BAR's header for non-ECAM bridges
  2020-11-13 10:39     ` Oleksandr Andrushchenko
@ 2020-11-13 10:47       ` Jan Beulich
  2020-11-13 10:55         ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2020-11-13 10:47 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, Oleksandr Andrushchenko
  Cc: iwj, wl, xen-devel, julien.grall, Bertrand.Marquis, sstabellini,
	roger.pau, Rahul.Singh

On 13.11.2020 11:39, Oleksandr Andrushchenko wrote:
> 
> On 11/13/20 12:29 PM, Jan Beulich wrote:
>> On 09.11.2020 13:50, Oleksandr Andrushchenko wrote:
>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>
>>> Non-ECAM host bridges in hwdom go directly to PCI config space,
>>> not through vpci (they use their specific method for accessing PCI
>>> configuration, e.g. dedicated registers etc.).
>> And access to these dedicated registers can't be intercepted?
> 
> It can. But then you have to fully emulate that bridge, e.g.
> 
> "if we write A to regB and after that write C to regZ then it
> 
> means we are accessing config space. If we write...."

Sounds pretty much like the I/O port based access mechanism on
x86, which also has some sort of "enable". Of course, I/O port
accesses are particularly easy to intercept and handle...

> I mean this would need lots of code in Xen to achieve that

Possibly, but look at the amount of code we have in Xen on the
x86 side to handle MCFG writes by Dom0.

Jan


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 06/10] vpci: Make every domain handle its own BARs
  2020-11-13 10:46             ` Oleksandr Andrushchenko
@ 2020-11-13 10:50               ` Jan Beulich
  2020-11-13 11:02                 ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2020-11-13 10:50 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Oleksandr Andrushchenko, Rahul.Singh, Bertrand.Marquis,
	julien.grall, sstabellini, xen-devel, iwj, wl,
	Roger Pau Monné

On 13.11.2020 11:46, Oleksandr Andrushchenko wrote:
> On 11/13/20 12:25 PM, Jan Beulich wrote:
>> On 13.11.2020 07:32, Oleksandr Andrushchenko wrote:
>>> On 11/12/20 4:46 PM, Roger Pau Monné wrote:
>>>> On Thu, Nov 12, 2020 at 01:16:10PM +0000, Oleksandr Andrushchenko wrote:
>>>>> On 11/12/20 11:40 AM, Roger Pau Monné wrote:
>>>>>> On Mon, Nov 09, 2020 at 02:50:27PM +0200, Oleksandr Andrushchenko wrote:
>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>> +static uint32_t bar_read_dispatch(const struct pci_dev *pdev, unsigned int reg,
>>>>>>> +                                  void *data)
>>>>>>> +{
>>>>>>> +    struct vpci_bar *vbar, *bar = data;
>>>>>>> +
>>>>>>> +    if ( is_hardware_domain(current->domain) )
>>>>>>> +        return bar_read_hwdom(pdev, reg, data);
>>>>>>> +
>>>>>>> +    vbar = get_vpci_bar(current->domain, pdev, bar->index);
>>>>>>> +    if ( !vbar )
>>>>>>> +        return ~0;
>>>>>>> +
>>>>>>> +    return bar_read_guest(pdev, reg, vbar);
>>>>>>> +}
>>>>>>> +
>>>>>>> +static void bar_write_dispatch(const struct pci_dev *pdev, unsigned int reg,
>>>>>>> +                               uint32_t val, void *data)
>>>>>>> +{
>>>>>>> +    struct vpci_bar *bar = data;
>>>>>>> +
>>>>>>> +    if ( is_hardware_domain(current->domain) )
>>>>>>> +        bar_write_hwdom(pdev, reg, val, data);
>>>>>>> +    else
>>>>>>> +    {
>>>>>>> +        struct vpci_bar *vbar = get_vpci_bar(current->domain, pdev, bar->index);
>>>>>>> +
>>>>>>> +        if ( !vbar )
>>>>>>> +            return;
>>>>>>> +        bar_write_guest(pdev, reg, val, vbar);
>>>>>>> +    }
>>>>>>> +}
>>>>>> You should assign different handlers based on whether the domain that
>>>>>> has the device assigned is a domU or the hardware domain, rather than
>>>>>> doing the selection here.
>>>>> Hm, handlers are assigned once in init_bars and this function is only called
>>>>>
>>>>> for hwdom, so there is no way I can do that for the guests. Hence, the dispatcher
>>>> I think we might want to reset the vPCI handlers when a devices gets
>>>> assigned and deassigned.
>>> In ARM case init_bars is called too early: PCI device assignment is currently
>>>
>>> initiated by Domain-0' kernel and is done *before* PCI devices are given memory
>>>
>>> ranges and BARs assigned:
>>>
>>> [    0.429514] pci_bus 0000:00: root bus resource [bus 00-ff]
>>> [    0.429532] pci_bus 0000:00: root bus resource [io 0x0000-0xfffff]
>>> [    0.429555] pci_bus 0000:00: root bus resource [mem 0xfe200000-0xfe3fffff]
>>> [    0.429575] pci_bus 0000:00: root bus resource [mem 0x30000000-0x37ffffff]
>>> [    0.429604] pci_bus 0000:00: root bus resource [mem 0x38000000-0x3fffffff pref]
>>> [    0.429670] pci 0000:00:00.0: enabling Extended Tags
>>> [    0.453764] pci 0000:00:00.0: -------------------- BUS_NOTIFY_ADD_DEVICE
>>>
>>> < init_bars >
>>>
>>> [    0.453793] pci 0000:00:00.0: -- IRQ 0
>>> [    0.458825] pci 0000:00:00.0: Failed to add - passthrough or MSI/MSI-X might fail!
>>> [    0.471790] pci 0000:01:00.0: -------------------- BUS_NOTIFY_ADD_DEVICE
>>>
>>> < init_bars >
>>>
>>> [    0.471821] pci 0000:01:00.0: -- IRQ 255
>>> [    0.476809] pci 0000:01:00.0: Failed to add - passthrough or MSI/MSI-X might fail!
>>>
>>> < BAR assignments below >
>>>
>>> [    0.488233] pci 0000:00:00.0: BAR 14: assigned [mem 0xfe200000-0xfe2fffff]
>>> [    0.488265] pci 0000:00:00.0: BAR 15: assigned [mem 0x38000000-0x380fffff pref]
>>>
>>> In case of x86 this is pretty much ok as BARs are already in place, but for ARM we
>>>
>>> need to take care and re-setup vPCI BARs for hwdom.
>> Even on x86 there's no guarantee that all devices have their BARs set
>> up by firmware.
> 
> This is true. But there you could have config space trapped in "x86 generic way",
> 
> please correct me if I'm wrong here
> 
>>
>> In a subsequent reply you've suggested to move init_bars from "add" to
>> "assign", but I'm having trouble seeing what this would change: It's
>> not Dom0 controlling assignment (to itself), but Xen assigns the device
>> towards the end of pci_add_device().
> 
> PHYSDEVOP_pci_device_add vs XEN_DOMCTL_assign_device
> 
> Currently we initialize BARs during PHYSDEVOP_pci_device_add and
> 
> if we do that during XEN_DOMCTL_assign_device things seem to change

But there can't possibly be any XEN_DOMCTL_assign_device involved in
booting of Dom0. 

>>>>    In order to do passthrough to domUs safely
>>>> we will have to add more handlers than what's required for dom0,
>>> Can you please tell what are thinking about? What are the missing handlers?
>>>>    and
>>>> having is_hardware_domain sprinkled in all of them is not a suitable
>>>> solution.
>>> I'll try to replace is_hardware_domain with something like:
>>>
>>> +/*
>>> + * Detect whether physical PCI devices in this segment belong
>>> + * to the domain given, e.g. on x86 all PCI devices live in hwdom,
>>> + * but in case of ARM this might not be the case: those may also
>>> + * live in driver domains or even Xen itself.
>>> + */
>>> +bool pci_is_hardware_domain(struct domain *d, u16 seg)
>>> +{
>>> +#ifdef CONFIG_X86
>>> +    return is_hardware_domain(d);
>>> +#elif CONFIG_ARM
>>> +    return pci_is_owner_domain(d, seg);
>>> +#else
>>> +#error "Unsupported architecture"
>>> +#endif
>>> +}
>>> +
>>> +/*
>>> + * Get domain which owns this segment: for x86 this is always hardware
>>> + * domain and for ARM this can be different.
>>> + */
>>> +struct domain *pci_get_hardware_domain(u16 seg)
>>> +{
>>> +#ifdef CONFIG_X86
>>> +    return hardware_domain;
>>> +#elif CONFIG_ARM
>>> +    return pci_get_owner_domain(seg);
>>> +#else
>>> +#error "Unsupported architecture"
>>> +#endif
>>> +}
>>>
>>> This is what I use to properly detect the domain that really owns physical host bridge
>> I consider this problematic. We should try to not let Arm's and x86'es
>> PCI implementations diverge too much, i.e. at least the underlying basic
>> model would better be similar. For example, if entire segments can be
>> assigned to a driver domain on Arm, why should the same not be possible
>> on x86?
> 
> Good question, probably in this case x86 == ARM and I can use
> 
> pci_is_owner_domain for both architectures instead of using is_hardware_domain for x86
> 
>>
>> Furthermore I'm suspicious about segments being the right granularity
>> here. On ia64 multiple host bridges could (and typically would) live
>> on segment 0. Iirc I had once seen output from an x86 system which was
>> apparently laid out similarly. Therefore, just like for MCFG, I think
>> the granularity wants to be bus ranges within a segment.
> Can you please suggest something we can use as a hint for such a detection logic?

The underlying information comes from ACPI tables, iirc. I don't
recall the details, though - sorry.

Jan


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 06/10] vpci: Make every domain handle its own BARs
  2020-11-13 10:36             ` Julien Grall
@ 2020-11-13 10:53               ` Jan Beulich
  2020-11-13 11:06                 ` Julien Grall
  0 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2020-11-13 10:53 UTC (permalink / raw)
  To: Julien Grall
  Cc: Oleksandr Andrushchenko, Rahul.Singh, Bertrand.Marquis,
	julien.grall, sstabellini, xen-devel, iwj, wl,
	Roger Pau Monné,
	Oleksandr Andrushchenko

On 13.11.2020 11:36, Julien Grall wrote:
> On 13/11/2020 10:25, Jan Beulich wrote:
>> On 13.11.2020 07:32, Oleksandr Andrushchenko wrote:
>>> On 11/12/20 4:46 PM, Roger Pau Monné wrote:
>>>> On Thu, Nov 12, 2020 at 01:16:10PM +0000, Oleksandr Andrushchenko wrote:
>>>>> On 11/12/20 11:40 AM, Roger Pau Monné wrote:
>>>>>> On Mon, Nov 09, 2020 at 02:50:27PM +0200, Oleksandr Andrushchenko wrote:
>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>> +static uint32_t bar_read_dispatch(const struct pci_dev *pdev, unsigned int reg,
>>>>>>> +                                  void *data)
>>>>>>> +{
>>>>>>> +    struct vpci_bar *vbar, *bar = data;
>>>>>>> +
>>>>>>> +    if ( is_hardware_domain(current->domain) )
>>>>>>> +        return bar_read_hwdom(pdev, reg, data);
>>>>>>> +
>>>>>>> +    vbar = get_vpci_bar(current->domain, pdev, bar->index);
>>>>>>> +    if ( !vbar )
>>>>>>> +        return ~0;
>>>>>>> +
>>>>>>> +    return bar_read_guest(pdev, reg, vbar);
>>>>>>> +}
>>>>>>> +
>>>>>>> +static void bar_write_dispatch(const struct pci_dev *pdev, unsigned int reg,
>>>>>>> +                               uint32_t val, void *data)
>>>>>>> +{
>>>>>>> +    struct vpci_bar *bar = data;
>>>>>>> +
>>>>>>> +    if ( is_hardware_domain(current->domain) )
>>>>>>> +        bar_write_hwdom(pdev, reg, val, data);
>>>>>>> +    else
>>>>>>> +    {
>>>>>>> +        struct vpci_bar *vbar = get_vpci_bar(current->domain, pdev, bar->index);
>>>>>>> +
>>>>>>> +        if ( !vbar )
>>>>>>> +            return;
>>>>>>> +        bar_write_guest(pdev, reg, val, vbar);
>>>>>>> +    }
>>>>>>> +}
>>>>>> You should assign different handlers based on whether the domain that
>>>>>> has the device assigned is a domU or the hardware domain, rather than
>>>>>> doing the selection here.
>>>>> Hm, handlers are assigned once in init_bars and this function is only called
>>>>>
>>>>> for hwdom, so there is no way I can do that for the guests. Hence, the dispatcher
>>>> I think we might want to reset the vPCI handlers when a devices gets
>>>> assigned and deassigned.
>>>
>>> In ARM case init_bars is called too early: PCI device assignment is currently
>>>
>>> initiated by Domain-0' kernel and is done *before* PCI devices are given memory
>>>
>>> ranges and BARs assigned:
>>>
>>> [    0.429514] pci_bus 0000:00: root bus resource [bus 00-ff]
>>> [    0.429532] pci_bus 0000:00: root bus resource [io 0x0000-0xfffff]
>>> [    0.429555] pci_bus 0000:00: root bus resource [mem 0xfe200000-0xfe3fffff]
>>> [    0.429575] pci_bus 0000:00: root bus resource [mem 0x30000000-0x37ffffff]
>>> [    0.429604] pci_bus 0000:00: root bus resource [mem 0x38000000-0x3fffffff pref]
>>> [    0.429670] pci 0000:00:00.0: enabling Extended Tags
>>> [    0.453764] pci 0000:00:00.0: -------------------- BUS_NOTIFY_ADD_DEVICE
>>>
>>> < init_bars >
>>>
>>> [    0.453793] pci 0000:00:00.0: -- IRQ 0
>>> [    0.458825] pci 0000:00:00.0: Failed to add - passthrough or MSI/MSI-X might fail!
>>> [    0.471790] pci 0000:01:00.0: -------------------- BUS_NOTIFY_ADD_DEVICE
>>>
>>> < init_bars >
>>>
>>> [    0.471821] pci 0000:01:00.0: -- IRQ 255
>>> [    0.476809] pci 0000:01:00.0: Failed to add - passthrough or MSI/MSI-X might fail!
>>>
>>> < BAR assignments below >
>>>
>>> [    0.488233] pci 0000:00:00.0: BAR 14: assigned [mem 0xfe200000-0xfe2fffff]
>>> [    0.488265] pci 0000:00:00.0: BAR 15: assigned [mem 0x38000000-0x380fffff pref]
>>>
>>> In case of x86 this is pretty much ok as BARs are already in place, but for ARM we
>>>
>>> need to take care and re-setup vPCI BARs for hwdom.
>>
>> Even on x86 there's no guarantee that all devices have their BARs set
>> up by firmware.
>>
>> In a subsequent reply you've suggested to move init_bars from "add" to
>> "assign", but I'm having trouble seeing what this would change: It's
>> not Dom0 controlling assignment (to itself), but Xen assigns the device
>> towards the end of pci_add_device().
>>
>>> Things are getting even more
>>>
>>> complicated if the host PCI bridge is not ECAM like, so you cannot set mmio_handlers
>>>
>>> and trap hwdom's access to the config space to update BARs etc. This is why I have that
>>>
>>> ugly hack for rcar_gen3 to read actual BARs for hwdom.
>>
>> How to config space accesses work there? The latest for MSI/MSI-X it'll
>> be imperative that Xen be able to intercept config space writes.
> 
> I am not sure to understand your last sentence. Are you saying that we 
> always need to trap access to MSI/MSI-X message in order to sanitize it?
> 
> If one is using the GICv3 ITS (I haven't investigated other MSI 
> controller), then I don't believe you need to sanitize the MSI/MSI-X 
> message in most of the situation.

Well, if it's fine for the guest to write arbitrary values to message
address and message data, _and_ to arbitrarily enable/disable MSI / MSI-X,
then yes, no interception would be needed.

Jan


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 08/10] vpci/arm: Allow updating BAR's header for non-ECAM bridges
  2020-11-13 10:47       ` Jan Beulich
@ 2020-11-13 10:55         ` Oleksandr Andrushchenko
  0 siblings, 0 replies; 64+ messages in thread
From: Oleksandr Andrushchenko @ 2020-11-13 10:55 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko
  Cc: iwj, wl, xen-devel, julien.grall, Bertrand.Marquis, sstabellini,
	roger.pau, Rahul.Singh


On 11/13/20 12:47 PM, Jan Beulich wrote:
> On 13.11.2020 11:39, Oleksandr Andrushchenko wrote:
>> On 11/13/20 12:29 PM, Jan Beulich wrote:
>>> On 09.11.2020 13:50, Oleksandr Andrushchenko wrote:
>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>
>>>> Non-ECAM host bridges in hwdom go directly to PCI config space,
>>>> not through vpci (they use their specific method for accessing PCI
>>>> configuration, e.g. dedicated registers etc.).
>>> And access to these dedicated registers can't be intercepted?
>> It can. But then you have to fully emulate that bridge, e.g.
>>
>> "if we write A to regB and after that write C to regZ then it
>>
>> means we are accessing config space. If we write...."
> Sounds pretty much like the I/O port based access mechanism on
> x86, which also has some sort of "enable". Of course, I/O port
> accesses are particularly easy to intercept and handle...
Yes, it has somewhat similar idea
>
>> I mean this would need lots of code in Xen to achieve that
> Possibly, but look at the amount of code we have in Xen on the
> x86 side to handle MCFG writes by Dom0.

But MCFG is handled the same way for all x86 machines, right?

And here I'll have to have a SoC specific code, e.g. a specific driver

>
> Jan

Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 06/10] vpci: Make every domain handle its own BARs
  2020-11-13 10:50               ` Jan Beulich
@ 2020-11-13 11:02                 ` Oleksandr Andrushchenko
  2020-11-13 11:35                   ` Jan Beulich
  0 siblings, 1 reply; 64+ messages in thread
From: Oleksandr Andrushchenko @ 2020-11-13 11:02 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Oleksandr Andrushchenko, Rahul.Singh, Bertrand.Marquis,
	julien.grall, sstabellini, xen-devel, iwj, wl,
	Roger Pau Monné


On 11/13/20 12:50 PM, Jan Beulich wrote:
> On 13.11.2020 11:46, Oleksandr Andrushchenko wrote:
>> On 11/13/20 12:25 PM, Jan Beulich wrote:
>>> On 13.11.2020 07:32, Oleksandr Andrushchenko wrote:
>>>> On 11/12/20 4:46 PM, Roger Pau Monné wrote:
>>>>> On Thu, Nov 12, 2020 at 01:16:10PM +0000, Oleksandr Andrushchenko wrote:
>>>>>> On 11/12/20 11:40 AM, Roger Pau Monné wrote:
>>>>>>> On Mon, Nov 09, 2020 at 02:50:27PM +0200, Oleksandr Andrushchenko wrote:
>>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>>> +static uint32_t bar_read_dispatch(const struct pci_dev *pdev, unsigned int reg,
>>>>>>>> +                                  void *data)
>>>>>>>> +{
>>>>>>>> +    struct vpci_bar *vbar, *bar = data;
>>>>>>>> +
>>>>>>>> +    if ( is_hardware_domain(current->domain) )
>>>>>>>> +        return bar_read_hwdom(pdev, reg, data);
>>>>>>>> +
>>>>>>>> +    vbar = get_vpci_bar(current->domain, pdev, bar->index);
>>>>>>>> +    if ( !vbar )
>>>>>>>> +        return ~0;
>>>>>>>> +
>>>>>>>> +    return bar_read_guest(pdev, reg, vbar);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static void bar_write_dispatch(const struct pci_dev *pdev, unsigned int reg,
>>>>>>>> +                               uint32_t val, void *data)
>>>>>>>> +{
>>>>>>>> +    struct vpci_bar *bar = data;
>>>>>>>> +
>>>>>>>> +    if ( is_hardware_domain(current->domain) )
>>>>>>>> +        bar_write_hwdom(pdev, reg, val, data);
>>>>>>>> +    else
>>>>>>>> +    {
>>>>>>>> +        struct vpci_bar *vbar = get_vpci_bar(current->domain, pdev, bar->index);
>>>>>>>> +
>>>>>>>> +        if ( !vbar )
>>>>>>>> +            return;
>>>>>>>> +        bar_write_guest(pdev, reg, val, vbar);
>>>>>>>> +    }
>>>>>>>> +}
>>>>>>> You should assign different handlers based on whether the domain that
>>>>>>> has the device assigned is a domU or the hardware domain, rather than
>>>>>>> doing the selection here.
>>>>>> Hm, handlers are assigned once in init_bars and this function is only called
>>>>>>
>>>>>> for hwdom, so there is no way I can do that for the guests. Hence, the dispatcher
>>>>> I think we might want to reset the vPCI handlers when a devices gets
>>>>> assigned and deassigned.
>>>> In ARM case init_bars is called too early: PCI device assignment is currently
>>>>
>>>> initiated by Domain-0' kernel and is done *before* PCI devices are given memory
>>>>
>>>> ranges and BARs assigned:
>>>>
>>>> [    0.429514] pci_bus 0000:00: root bus resource [bus 00-ff]
>>>> [    0.429532] pci_bus 0000:00: root bus resource [io 0x0000-0xfffff]
>>>> [    0.429555] pci_bus 0000:00: root bus resource [mem 0xfe200000-0xfe3fffff]
>>>> [    0.429575] pci_bus 0000:00: root bus resource [mem 0x30000000-0x37ffffff]
>>>> [    0.429604] pci_bus 0000:00: root bus resource [mem 0x38000000-0x3fffffff pref]
>>>> [    0.429670] pci 0000:00:00.0: enabling Extended Tags
>>>> [    0.453764] pci 0000:00:00.0: -------------------- BUS_NOTIFY_ADD_DEVICE
>>>>
>>>> < init_bars >
>>>>
>>>> [    0.453793] pci 0000:00:00.0: -- IRQ 0
>>>> [    0.458825] pci 0000:00:00.0: Failed to add - passthrough or MSI/MSI-X might fail!
>>>> [    0.471790] pci 0000:01:00.0: -------------------- BUS_NOTIFY_ADD_DEVICE
>>>>
>>>> < init_bars >
>>>>
>>>> [    0.471821] pci 0000:01:00.0: -- IRQ 255
>>>> [    0.476809] pci 0000:01:00.0: Failed to add - passthrough or MSI/MSI-X might fail!
>>>>
>>>> < BAR assignments below >
>>>>
>>>> [    0.488233] pci 0000:00:00.0: BAR 14: assigned [mem 0xfe200000-0xfe2fffff]
>>>> [    0.488265] pci 0000:00:00.0: BAR 15: assigned [mem 0x38000000-0x380fffff pref]
>>>>
>>>> In case of x86 this is pretty much ok as BARs are already in place, but for ARM we
>>>>
>>>> need to take care and re-setup vPCI BARs for hwdom.
>>> Even on x86 there's no guarantee that all devices have their BARs set
>>> up by firmware.
>> This is true. But there you could have config space trapped in "x86 generic way",
>>
>> please correct me if I'm wrong here
>>
>>> In a subsequent reply you've suggested to move init_bars from "add" to
>>> "assign", but I'm having trouble seeing what this would change: It's
>>> not Dom0 controlling assignment (to itself), but Xen assigns the device
>>> towards the end of pci_add_device().
>> PHYSDEVOP_pci_device_add vs XEN_DOMCTL_assign_device
>>
>> Currently we initialize BARs during PHYSDEVOP_pci_device_add and
>>
>> if we do that during XEN_DOMCTL_assign_device things seem to change
> But there can't possibly be any XEN_DOMCTL_assign_device involved in
> booting of Dom0.

Indeed. So, do you have an idea when we should call init_bars suitable

for both ARM and x86?

Another question is: what happens bad if x86 and ARM won't call init_bars

until the moment we really assign a PCI device to the first guest?

>
>>>>>     In order to do passthrough to domUs safely
>>>>> we will have to add more handlers than what's required for dom0,
>>>> Can you please tell what are thinking about? What are the missing handlers?
>>>>>     and
>>>>> having is_hardware_domain sprinkled in all of them is not a suitable
>>>>> solution.
>>>> I'll try to replace is_hardware_domain with something like:
>>>>
>>>> +/*
>>>> + * Detect whether physical PCI devices in this segment belong
>>>> + * to the domain given, e.g. on x86 all PCI devices live in hwdom,
>>>> + * but in case of ARM this might not be the case: those may also
>>>> + * live in driver domains or even Xen itself.
>>>> + */
>>>> +bool pci_is_hardware_domain(struct domain *d, u16 seg)
>>>> +{
>>>> +#ifdef CONFIG_X86
>>>> +    return is_hardware_domain(d);
>>>> +#elif CONFIG_ARM
>>>> +    return pci_is_owner_domain(d, seg);
>>>> +#else
>>>> +#error "Unsupported architecture"
>>>> +#endif
>>>> +}
>>>> +
>>>> +/*
>>>> + * Get domain which owns this segment: for x86 this is always hardware
>>>> + * domain and for ARM this can be different.
>>>> + */
>>>> +struct domain *pci_get_hardware_domain(u16 seg)
>>>> +{
>>>> +#ifdef CONFIG_X86
>>>> +    return hardware_domain;
>>>> +#elif CONFIG_ARM
>>>> +    return pci_get_owner_domain(seg);
>>>> +#else
>>>> +#error "Unsupported architecture"
>>>> +#endif
>>>> +}
>>>>
>>>> This is what I use to properly detect the domain that really owns physical host bridge
>>> I consider this problematic. We should try to not let Arm's and x86'es
>>> PCI implementations diverge too much, i.e. at least the underlying basic
>>> model would better be similar. For example, if entire segments can be
>>> assigned to a driver domain on Arm, why should the same not be possible
>>> on x86?
>> Good question, probably in this case x86 == ARM and I can use
>>
>> pci_is_owner_domain for both architectures instead of using is_hardware_domain for x86
>>
>>> Furthermore I'm suspicious about segments being the right granularity
>>> here. On ia64 multiple host bridges could (and typically would) live
>>> on segment 0. Iirc I had once seen output from an x86 system which was
>>> apparently laid out similarly. Therefore, just like for MCFG, I think
>>> the granularity wants to be bus ranges within a segment.
>> Can you please suggest something we can use as a hint for such a detection logic?
> The underlying information comes from ACPI tables, iirc. I don't
> recall the details, though - sorry.

Ok, so seg + bus should be enough for both ARM and Xen then, right?

pci_get_hardware_domain(u16 seg, u8 bus)

>
> Jan

Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 06/10] vpci: Make every domain handle its own BARs
  2020-11-13 10:53               ` Jan Beulich
@ 2020-11-13 11:06                 ` Julien Grall
  2020-11-13 11:26                   ` Jan Beulich
  0 siblings, 1 reply; 64+ messages in thread
From: Julien Grall @ 2020-11-13 11:06 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Oleksandr Andrushchenko, Rahul.Singh, Bertrand.Marquis,
	julien.grall, sstabellini, xen-devel, iwj, wl,
	Roger Pau Monné,
	Oleksandr Andrushchenko

Hi Jan,

On 13/11/2020 10:53, Jan Beulich wrote:
> On 13.11.2020 11:36, Julien Grall wrote:
>> On 13/11/2020 10:25, Jan Beulich wrote:
>>> On 13.11.2020 07:32, Oleksandr Andrushchenko wrote:
>>>> On 11/12/20 4:46 PM, Roger Pau Monné wrote:
>>>>> On Thu, Nov 12, 2020 at 01:16:10PM +0000, Oleksandr Andrushchenko wrote:
>>>>>> On 11/12/20 11:40 AM, Roger Pau Monné wrote:
>>>>>>> On Mon, Nov 09, 2020 at 02:50:27PM +0200, Oleksandr Andrushchenko wrote:
>>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>>> +static uint32_t bar_read_dispatch(const struct pci_dev *pdev, unsigned int reg,
>>>>>>>> +                                  void *data)
>>>>>>>> +{
>>>>>>>> +    struct vpci_bar *vbar, *bar = data;
>>>>>>>> +
>>>>>>>> +    if ( is_hardware_domain(current->domain) )
>>>>>>>> +        return bar_read_hwdom(pdev, reg, data);
>>>>>>>> +
>>>>>>>> +    vbar = get_vpci_bar(current->domain, pdev, bar->index);
>>>>>>>> +    if ( !vbar )
>>>>>>>> +        return ~0;
>>>>>>>> +
>>>>>>>> +    return bar_read_guest(pdev, reg, vbar);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static void bar_write_dispatch(const struct pci_dev *pdev, unsigned int reg,
>>>>>>>> +                               uint32_t val, void *data)
>>>>>>>> +{
>>>>>>>> +    struct vpci_bar *bar = data;
>>>>>>>> +
>>>>>>>> +    if ( is_hardware_domain(current->domain) )
>>>>>>>> +        bar_write_hwdom(pdev, reg, val, data);
>>>>>>>> +    else
>>>>>>>> +    {
>>>>>>>> +        struct vpci_bar *vbar = get_vpci_bar(current->domain, pdev, bar->index);
>>>>>>>> +
>>>>>>>> +        if ( !vbar )
>>>>>>>> +            return;
>>>>>>>> +        bar_write_guest(pdev, reg, val, vbar);
>>>>>>>> +    }
>>>>>>>> +}
>>>>>>> You should assign different handlers based on whether the domain that
>>>>>>> has the device assigned is a domU or the hardware domain, rather than
>>>>>>> doing the selection here.
>>>>>> Hm, handlers are assigned once in init_bars and this function is only called
>>>>>>
>>>>>> for hwdom, so there is no way I can do that for the guests. Hence, the dispatcher
>>>>> I think we might want to reset the vPCI handlers when a devices gets
>>>>> assigned and deassigned.
>>>>
>>>> In ARM case init_bars is called too early: PCI device assignment is currently
>>>>
>>>> initiated by Domain-0' kernel and is done *before* PCI devices are given memory
>>>>
>>>> ranges and BARs assigned:
>>>>
>>>> [    0.429514] pci_bus 0000:00: root bus resource [bus 00-ff]
>>>> [    0.429532] pci_bus 0000:00: root bus resource [io 0x0000-0xfffff]
>>>> [    0.429555] pci_bus 0000:00: root bus resource [mem 0xfe200000-0xfe3fffff]
>>>> [    0.429575] pci_bus 0000:00: root bus resource [mem 0x30000000-0x37ffffff]
>>>> [    0.429604] pci_bus 0000:00: root bus resource [mem 0x38000000-0x3fffffff pref]
>>>> [    0.429670] pci 0000:00:00.0: enabling Extended Tags
>>>> [    0.453764] pci 0000:00:00.0: -------------------- BUS_NOTIFY_ADD_DEVICE
>>>>
>>>> < init_bars >
>>>>
>>>> [    0.453793] pci 0000:00:00.0: -- IRQ 0
>>>> [    0.458825] pci 0000:00:00.0: Failed to add - passthrough or MSI/MSI-X might fail!
>>>> [    0.471790] pci 0000:01:00.0: -------------------- BUS_NOTIFY_ADD_DEVICE
>>>>
>>>> < init_bars >
>>>>
>>>> [    0.471821] pci 0000:01:00.0: -- IRQ 255
>>>> [    0.476809] pci 0000:01:00.0: Failed to add - passthrough or MSI/MSI-X might fail!
>>>>
>>>> < BAR assignments below >
>>>>
>>>> [    0.488233] pci 0000:00:00.0: BAR 14: assigned [mem 0xfe200000-0xfe2fffff]
>>>> [    0.488265] pci 0000:00:00.0: BAR 15: assigned [mem 0x38000000-0x380fffff pref]
>>>>
>>>> In case of x86 this is pretty much ok as BARs are already in place, but for ARM we
>>>>
>>>> need to take care and re-setup vPCI BARs for hwdom.
>>>
>>> Even on x86 there's no guarantee that all devices have their BARs set
>>> up by firmware.
>>>
>>> In a subsequent reply you've suggested to move init_bars from "add" to
>>> "assign", but I'm having trouble seeing what this would change: It's
>>> not Dom0 controlling assignment (to itself), but Xen assigns the device
>>> towards the end of pci_add_device().
>>>
>>>> Things are getting even more
>>>>
>>>> complicated if the host PCI bridge is not ECAM like, so you cannot set mmio_handlers
>>>>
>>>> and trap hwdom's access to the config space to update BARs etc. This is why I have that
>>>>
>>>> ugly hack for rcar_gen3 to read actual BARs for hwdom.
>>>
>>> How to config space accesses work there? The latest for MSI/MSI-X it'll
>>> be imperative that Xen be able to intercept config space writes.
>>
>> I am not sure to understand your last sentence. Are you saying that we
>> always need to trap access to MSI/MSI-X message in order to sanitize it?
>>
>> If one is using the GICv3 ITS (I haven't investigated other MSI
>> controller), then I don't believe you need to sanitize the MSI/MSI-X
>> message in most of the situation.
> 
> Well, if it's fine for the guest to write arbitrary values to message
> address and message data,

The message address would be the doorbell of the ITS that is usually 
going through the IOMMU page-tables. Although, I am aware of a couple of 
platforms where the doorbell access (among other address ranges 
including P2P transaction) bypass the IOMMU. In this situation, we would 
need a lot more work than just trapping the access.

Regarding the message data, for the ITS this is an event ID. The HW will 
then tag each message with the device ID (this prevents spoofing). The 
tupple (device ID, event ID) is used by the ITS to decide where to 
inject the event.

Whether other MSI controllers (e.g. GICv2m) have similar isolation 
feature will be on the case by case basis.

> _and_ to arbitrarily enable/disable MSI / MSI-X,
> then yes, no interception would be needed.
The device would be owned by the guest, so I am not sure to understand 
the exact problem of letting it enabling/disabling MSI/MSI-X. Do you 
mind expanding your thoughts?

Furthermore, you can also control which event is enabled/disabled at the 
ITS level.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 06/10] vpci: Make every domain handle its own BARs
  2020-11-13 11:06                 ` Julien Grall
@ 2020-11-13 11:26                   ` Jan Beulich
  2020-11-13 11:53                     ` Julien Grall
  0 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2020-11-13 11:26 UTC (permalink / raw)
  To: Julien Grall
  Cc: Oleksandr Andrushchenko, Rahul.Singh, Bertrand.Marquis,
	julien.grall, sstabellini, xen-devel, iwj, wl,
	Roger Pau Monné,
	Oleksandr Andrushchenko

On 13.11.2020 12:06, Julien Grall wrote:
> Hi Jan,
> 
> On 13/11/2020 10:53, Jan Beulich wrote:
>> On 13.11.2020 11:36, Julien Grall wrote:
>>> On 13/11/2020 10:25, Jan Beulich wrote:
>>>> On 13.11.2020 07:32, Oleksandr Andrushchenko wrote:
>>>>> On 11/12/20 4:46 PM, Roger Pau Monné wrote:
>>>>>> On Thu, Nov 12, 2020 at 01:16:10PM +0000, Oleksandr Andrushchenko wrote:
>>>>>>> On 11/12/20 11:40 AM, Roger Pau Monné wrote:
>>>>>>>> On Mon, Nov 09, 2020 at 02:50:27PM +0200, Oleksandr Andrushchenko wrote:
>>>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>>>> +static uint32_t bar_read_dispatch(const struct pci_dev *pdev, unsigned int reg,
>>>>>>>>> +                                  void *data)
>>>>>>>>> +{
>>>>>>>>> +    struct vpci_bar *vbar, *bar = data;
>>>>>>>>> +
>>>>>>>>> +    if ( is_hardware_domain(current->domain) )
>>>>>>>>> +        return bar_read_hwdom(pdev, reg, data);
>>>>>>>>> +
>>>>>>>>> +    vbar = get_vpci_bar(current->domain, pdev, bar->index);
>>>>>>>>> +    if ( !vbar )
>>>>>>>>> +        return ~0;
>>>>>>>>> +
>>>>>>>>> +    return bar_read_guest(pdev, reg, vbar);
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +static void bar_write_dispatch(const struct pci_dev *pdev, unsigned int reg,
>>>>>>>>> +                               uint32_t val, void *data)
>>>>>>>>> +{
>>>>>>>>> +    struct vpci_bar *bar = data;
>>>>>>>>> +
>>>>>>>>> +    if ( is_hardware_domain(current->domain) )
>>>>>>>>> +        bar_write_hwdom(pdev, reg, val, data);
>>>>>>>>> +    else
>>>>>>>>> +    {
>>>>>>>>> +        struct vpci_bar *vbar = get_vpci_bar(current->domain, pdev, bar->index);
>>>>>>>>> +
>>>>>>>>> +        if ( !vbar )
>>>>>>>>> +            return;
>>>>>>>>> +        bar_write_guest(pdev, reg, val, vbar);
>>>>>>>>> +    }
>>>>>>>>> +}
>>>>>>>> You should assign different handlers based on whether the domain that
>>>>>>>> has the device assigned is a domU or the hardware domain, rather than
>>>>>>>> doing the selection here.
>>>>>>> Hm, handlers are assigned once in init_bars and this function is only called
>>>>>>>
>>>>>>> for hwdom, so there is no way I can do that for the guests. Hence, the dispatcher
>>>>>> I think we might want to reset the vPCI handlers when a devices gets
>>>>>> assigned and deassigned.
>>>>>
>>>>> In ARM case init_bars is called too early: PCI device assignment is currently
>>>>>
>>>>> initiated by Domain-0' kernel and is done *before* PCI devices are given memory
>>>>>
>>>>> ranges and BARs assigned:
>>>>>
>>>>> [    0.429514] pci_bus 0000:00: root bus resource [bus 00-ff]
>>>>> [    0.429532] pci_bus 0000:00: root bus resource [io 0x0000-0xfffff]
>>>>> [    0.429555] pci_bus 0000:00: root bus resource [mem 0xfe200000-0xfe3fffff]
>>>>> [    0.429575] pci_bus 0000:00: root bus resource [mem 0x30000000-0x37ffffff]
>>>>> [    0.429604] pci_bus 0000:00: root bus resource [mem 0x38000000-0x3fffffff pref]
>>>>> [    0.429670] pci 0000:00:00.0: enabling Extended Tags
>>>>> [    0.453764] pci 0000:00:00.0: -------------------- BUS_NOTIFY_ADD_DEVICE
>>>>>
>>>>> < init_bars >
>>>>>
>>>>> [    0.453793] pci 0000:00:00.0: -- IRQ 0
>>>>> [    0.458825] pci 0000:00:00.0: Failed to add - passthrough or MSI/MSI-X might fail!
>>>>> [    0.471790] pci 0000:01:00.0: -------------------- BUS_NOTIFY_ADD_DEVICE
>>>>>
>>>>> < init_bars >
>>>>>
>>>>> [    0.471821] pci 0000:01:00.0: -- IRQ 255
>>>>> [    0.476809] pci 0000:01:00.0: Failed to add - passthrough or MSI/MSI-X might fail!
>>>>>
>>>>> < BAR assignments below >
>>>>>
>>>>> [    0.488233] pci 0000:00:00.0: BAR 14: assigned [mem 0xfe200000-0xfe2fffff]
>>>>> [    0.488265] pci 0000:00:00.0: BAR 15: assigned [mem 0x38000000-0x380fffff pref]
>>>>>
>>>>> In case of x86 this is pretty much ok as BARs are already in place, but for ARM we
>>>>>
>>>>> need to take care and re-setup vPCI BARs for hwdom.
>>>>
>>>> Even on x86 there's no guarantee that all devices have their BARs set
>>>> up by firmware.
>>>>
>>>> In a subsequent reply you've suggested to move init_bars from "add" to
>>>> "assign", but I'm having trouble seeing what this would change: It's
>>>> not Dom0 controlling assignment (to itself), but Xen assigns the device
>>>> towards the end of pci_add_device().
>>>>
>>>>> Things are getting even more
>>>>>
>>>>> complicated if the host PCI bridge is not ECAM like, so you cannot set mmio_handlers
>>>>>
>>>>> and trap hwdom's access to the config space to update BARs etc. This is why I have that
>>>>>
>>>>> ugly hack for rcar_gen3 to read actual BARs for hwdom.
>>>>
>>>> How to config space accesses work there? The latest for MSI/MSI-X it'll
>>>> be imperative that Xen be able to intercept config space writes.
>>>
>>> I am not sure to understand your last sentence. Are you saying that we
>>> always need to trap access to MSI/MSI-X message in order to sanitize it?
>>>
>>> If one is using the GICv3 ITS (I haven't investigated other MSI
>>> controller), then I don't believe you need to sanitize the MSI/MSI-X
>>> message in most of the situation.
>>
>> Well, if it's fine for the guest to write arbitrary values to message
>> address and message data,
> 
> The message address would be the doorbell of the ITS that is usually 
> going through the IOMMU page-tables. Although, I am aware of a couple of 
> platforms where the doorbell access (among other address ranges 
> including P2P transaction) bypass the IOMMU. In this situation, we would 
> need a lot more work than just trapping the access.

When you say "The message address would be the doorbell of the ITS" am
I right in understanding that's the designated address to be put there?
What if the guest puts some random different address there?

> Regarding the message data, for the ITS this is an event ID. The HW will 
> then tag each message with the device ID (this prevents spoofing). The 
> tupple (device ID, event ID) is used by the ITS to decide where to 
> inject the event.
> 
> Whether other MSI controllers (e.g. GICv2m) have similar isolation 
> feature will be on the case by case basis.
> 
>> _and_ to arbitrarily enable/disable MSI / MSI-X,
>> then yes, no interception would be needed.
> The device would be owned by the guest, so I am not sure to understand 
> the exact problem of letting it enabling/disabling MSI/MSI-X. Do you 
> mind expanding your thoughts?

Question is - is Xen involved in any way in the handling of interrupts
from such a device? If not, then I guess full control can indeed be
left with the guest.

> Furthermore, you can also control which event is enabled/disabled at the 
> ITS level.

And that's something Xen controls? On x86 we don't have a 2nd level
of controls, so we need to merge Xen's and the guest's intentions in
software to know what to store in hardware.

Jan


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 06/10] vpci: Make every domain handle its own BARs
  2020-11-13 11:02                 ` Oleksandr Andrushchenko
@ 2020-11-13 11:35                   ` Jan Beulich
  2020-11-13 12:41                     ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2020-11-13 11:35 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Oleksandr Andrushchenko, Rahul.Singh, Bertrand.Marquis,
	julien.grall, sstabellini, xen-devel, iwj, wl,
	Roger Pau Monné

On 13.11.2020 12:02, Oleksandr Andrushchenko wrote:
> 
> On 11/13/20 12:50 PM, Jan Beulich wrote:
>> On 13.11.2020 11:46, Oleksandr Andrushchenko wrote:
>>> On 11/13/20 12:25 PM, Jan Beulich wrote:
>>>> On 13.11.2020 07:32, Oleksandr Andrushchenko wrote:
>>>>> On 11/12/20 4:46 PM, Roger Pau Monné wrote:
>>>>>> On Thu, Nov 12, 2020 at 01:16:10PM +0000, Oleksandr Andrushchenko wrote:
>>>>>>> On 11/12/20 11:40 AM, Roger Pau Monné wrote:
>>>>>>>> On Mon, Nov 09, 2020 at 02:50:27PM +0200, Oleksandr Andrushchenko wrote:
>>>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>>>> +static uint32_t bar_read_dispatch(const struct pci_dev *pdev, unsigned int reg,
>>>>>>>>> +                                  void *data)
>>>>>>>>> +{
>>>>>>>>> +    struct vpci_bar *vbar, *bar = data;
>>>>>>>>> +
>>>>>>>>> +    if ( is_hardware_domain(current->domain) )
>>>>>>>>> +        return bar_read_hwdom(pdev, reg, data);
>>>>>>>>> +
>>>>>>>>> +    vbar = get_vpci_bar(current->domain, pdev, bar->index);
>>>>>>>>> +    if ( !vbar )
>>>>>>>>> +        return ~0;
>>>>>>>>> +
>>>>>>>>> +    return bar_read_guest(pdev, reg, vbar);
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +static void bar_write_dispatch(const struct pci_dev *pdev, unsigned int reg,
>>>>>>>>> +                               uint32_t val, void *data)
>>>>>>>>> +{
>>>>>>>>> +    struct vpci_bar *bar = data;
>>>>>>>>> +
>>>>>>>>> +    if ( is_hardware_domain(current->domain) )
>>>>>>>>> +        bar_write_hwdom(pdev, reg, val, data);
>>>>>>>>> +    else
>>>>>>>>> +    {
>>>>>>>>> +        struct vpci_bar *vbar = get_vpci_bar(current->domain, pdev, bar->index);
>>>>>>>>> +
>>>>>>>>> +        if ( !vbar )
>>>>>>>>> +            return;
>>>>>>>>> +        bar_write_guest(pdev, reg, val, vbar);
>>>>>>>>> +    }
>>>>>>>>> +}
>>>>>>>> You should assign different handlers based on whether the domain that
>>>>>>>> has the device assigned is a domU or the hardware domain, rather than
>>>>>>>> doing the selection here.
>>>>>>> Hm, handlers are assigned once in init_bars and this function is only called
>>>>>>>
>>>>>>> for hwdom, so there is no way I can do that for the guests. Hence, the dispatcher
>>>>>> I think we might want to reset the vPCI handlers when a devices gets
>>>>>> assigned and deassigned.
>>>>> In ARM case init_bars is called too early: PCI device assignment is currently
>>>>>
>>>>> initiated by Domain-0' kernel and is done *before* PCI devices are given memory
>>>>>
>>>>> ranges and BARs assigned:
>>>>>
>>>>> [    0.429514] pci_bus 0000:00: root bus resource [bus 00-ff]
>>>>> [    0.429532] pci_bus 0000:00: root bus resource [io 0x0000-0xfffff]
>>>>> [    0.429555] pci_bus 0000:00: root bus resource [mem 0xfe200000-0xfe3fffff]
>>>>> [    0.429575] pci_bus 0000:00: root bus resource [mem 0x30000000-0x37ffffff]
>>>>> [    0.429604] pci_bus 0000:00: root bus resource [mem 0x38000000-0x3fffffff pref]
>>>>> [    0.429670] pci 0000:00:00.0: enabling Extended Tags
>>>>> [    0.453764] pci 0000:00:00.0: -------------------- BUS_NOTIFY_ADD_DEVICE
>>>>>
>>>>> < init_bars >
>>>>>
>>>>> [    0.453793] pci 0000:00:00.0: -- IRQ 0
>>>>> [    0.458825] pci 0000:00:00.0: Failed to add - passthrough or MSI/MSI-X might fail!
>>>>> [    0.471790] pci 0000:01:00.0: -------------------- BUS_NOTIFY_ADD_DEVICE
>>>>>
>>>>> < init_bars >
>>>>>
>>>>> [    0.471821] pci 0000:01:00.0: -- IRQ 255
>>>>> [    0.476809] pci 0000:01:00.0: Failed to add - passthrough or MSI/MSI-X might fail!
>>>>>
>>>>> < BAR assignments below >
>>>>>
>>>>> [    0.488233] pci 0000:00:00.0: BAR 14: assigned [mem 0xfe200000-0xfe2fffff]
>>>>> [    0.488265] pci 0000:00:00.0: BAR 15: assigned [mem 0x38000000-0x380fffff pref]
>>>>>
>>>>> In case of x86 this is pretty much ok as BARs are already in place, but for ARM we
>>>>>
>>>>> need to take care and re-setup vPCI BARs for hwdom.
>>>> Even on x86 there's no guarantee that all devices have their BARs set
>>>> up by firmware.
>>> This is true. But there you could have config space trapped in "x86 generic way",
>>>
>>> please correct me if I'm wrong here
>>>
>>>> In a subsequent reply you've suggested to move init_bars from "add" to
>>>> "assign", but I'm having trouble seeing what this would change: It's
>>>> not Dom0 controlling assignment (to itself), but Xen assigns the device
>>>> towards the end of pci_add_device().
>>> PHYSDEVOP_pci_device_add vs XEN_DOMCTL_assign_device
>>>
>>> Currently we initialize BARs during PHYSDEVOP_pci_device_add and
>>>
>>> if we do that during XEN_DOMCTL_assign_device things seem to change
>> But there can't possibly be any XEN_DOMCTL_assign_device involved in
>> booting of Dom0.
> 
> Indeed. So, do you have an idea when we should call init_bars suitable
> 
> for both ARM and x86?
> 
> Another question is: what happens bad if x86 and ARM won't call init_bars
> 
> until the moment we really assign a PCI device to the first guest?

I'd like to answer the latter question first: How would Dom0 use
the device prior to such an assignment? As an implication to the
presumed answer here, I guess init_bars could be deferred up until
the first time Dom0 (or more generally the possessing domain)
accesses any of them. Similarly, devices used by Xen itself could
have this done immediately before first use. This may require
tracking on a per-device basis whether initialization was done.

>>>>>>     In order to do passthrough to domUs safely
>>>>>> we will have to add more handlers than what's required for dom0,
>>>>> Can you please tell what are thinking about? What are the missing handlers?
>>>>>>     and
>>>>>> having is_hardware_domain sprinkled in all of them is not a suitable
>>>>>> solution.
>>>>> I'll try to replace is_hardware_domain with something like:
>>>>>
>>>>> +/*
>>>>> + * Detect whether physical PCI devices in this segment belong
>>>>> + * to the domain given, e.g. on x86 all PCI devices live in hwdom,
>>>>> + * but in case of ARM this might not be the case: those may also
>>>>> + * live in driver domains or even Xen itself.
>>>>> + */
>>>>> +bool pci_is_hardware_domain(struct domain *d, u16 seg)
>>>>> +{
>>>>> +#ifdef CONFIG_X86
>>>>> +    return is_hardware_domain(d);
>>>>> +#elif CONFIG_ARM
>>>>> +    return pci_is_owner_domain(d, seg);
>>>>> +#else
>>>>> +#error "Unsupported architecture"
>>>>> +#endif
>>>>> +}
>>>>> +
>>>>> +/*
>>>>> + * Get domain which owns this segment: for x86 this is always hardware
>>>>> + * domain and for ARM this can be different.
>>>>> + */
>>>>> +struct domain *pci_get_hardware_domain(u16 seg)
>>>>> +{
>>>>> +#ifdef CONFIG_X86
>>>>> +    return hardware_domain;
>>>>> +#elif CONFIG_ARM
>>>>> +    return pci_get_owner_domain(seg);
>>>>> +#else
>>>>> +#error "Unsupported architecture"
>>>>> +#endif
>>>>> +}
>>>>>
>>>>> This is what I use to properly detect the domain that really owns physical host bridge
>>>> I consider this problematic. We should try to not let Arm's and x86'es
>>>> PCI implementations diverge too much, i.e. at least the underlying basic
>>>> model would better be similar. For example, if entire segments can be
>>>> assigned to a driver domain on Arm, why should the same not be possible
>>>> on x86?
>>> Good question, probably in this case x86 == ARM and I can use
>>>
>>> pci_is_owner_domain for both architectures instead of using is_hardware_domain for x86
>>>
>>>> Furthermore I'm suspicious about segments being the right granularity
>>>> here. On ia64 multiple host bridges could (and typically would) live
>>>> on segment 0. Iirc I had once seen output from an x86 system which was
>>>> apparently laid out similarly. Therefore, just like for MCFG, I think
>>>> the granularity wants to be bus ranges within a segment.
>>> Can you please suggest something we can use as a hint for such a detection logic?
>> The underlying information comes from ACPI tables, iirc. I don't
>> recall the details, though - sorry.
> 
> Ok, so seg + bus should be enough for both ARM and Xen then, right?
> 
> pci_get_hardware_domain(u16 seg, u8 bus)

Whether an individual bus number can suitable express things I can't
tell; I did say bus range, but of course if you care about just
individual devices, then a single bus number will of course do.

Jan


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 06/10] vpci: Make every domain handle its own BARs
  2020-11-13 11:26                   ` Jan Beulich
@ 2020-11-13 11:53                     ` Julien Grall
  0 siblings, 0 replies; 64+ messages in thread
From: Julien Grall @ 2020-11-13 11:53 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Oleksandr Andrushchenko, Rahul.Singh, Bertrand.Marquis,
	julien.grall, sstabellini, xen-devel, iwj, wl,
	Roger Pau Monné,
	Oleksandr Andrushchenko



On 13/11/2020 11:26, Jan Beulich wrote:
> On 13.11.2020 12:06, Julien Grall wrote:
>> Hi Jan,
>>
>> On 13/11/2020 10:53, Jan Beulich wrote:
>>> On 13.11.2020 11:36, Julien Grall wrote:
>>>> On 13/11/2020 10:25, Jan Beulich wrote:
>>>>> On 13.11.2020 07:32, Oleksandr Andrushchenko wrote:
>>>>>> On 11/12/20 4:46 PM, Roger Pau Monné wrote:
>>>>>>> On Thu, Nov 12, 2020 at 01:16:10PM +0000, Oleksandr Andrushchenko wrote:
>>>>>>>> On 11/12/20 11:40 AM, Roger Pau Monné wrote:
>>>>>>>>> On Mon, Nov 09, 2020 at 02:50:27PM +0200, Oleksandr Andrushchenko wrote:
>>>>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>>>>> +static uint32_t bar_read_dispatch(const struct pci_dev *pdev, unsigned int reg,
>>>>>>>>>> +                                  void *data)
>>>>>>>>>> +{
>>>>>>>>>> +    struct vpci_bar *vbar, *bar = data;
>>>>>>>>>> +
>>>>>>>>>> +    if ( is_hardware_domain(current->domain) )
>>>>>>>>>> +        return bar_read_hwdom(pdev, reg, data);
>>>>>>>>>> +
>>>>>>>>>> +    vbar = get_vpci_bar(current->domain, pdev, bar->index);
>>>>>>>>>> +    if ( !vbar )
>>>>>>>>>> +        return ~0;
>>>>>>>>>> +
>>>>>>>>>> +    return bar_read_guest(pdev, reg, vbar);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static void bar_write_dispatch(const struct pci_dev *pdev, unsigned int reg,
>>>>>>>>>> +                               uint32_t val, void *data)
>>>>>>>>>> +{
>>>>>>>>>> +    struct vpci_bar *bar = data;
>>>>>>>>>> +
>>>>>>>>>> +    if ( is_hardware_domain(current->domain) )
>>>>>>>>>> +        bar_write_hwdom(pdev, reg, val, data);
>>>>>>>>>> +    else
>>>>>>>>>> +    {
>>>>>>>>>> +        struct vpci_bar *vbar = get_vpci_bar(current->domain, pdev, bar->index);
>>>>>>>>>> +
>>>>>>>>>> +        if ( !vbar )
>>>>>>>>>> +            return;
>>>>>>>>>> +        bar_write_guest(pdev, reg, val, vbar);
>>>>>>>>>> +    }
>>>>>>>>>> +}
>>>>>>>>> You should assign different handlers based on whether the domain that
>>>>>>>>> has the device assigned is a domU or the hardware domain, rather than
>>>>>>>>> doing the selection here.
>>>>>>>> Hm, handlers are assigned once in init_bars and this function is only called
>>>>>>>>
>>>>>>>> for hwdom, so there is no way I can do that for the guests. Hence, the dispatcher
>>>>>>> I think we might want to reset the vPCI handlers when a devices gets
>>>>>>> assigned and deassigned.
>>>>>>
>>>>>> In ARM case init_bars is called too early: PCI device assignment is currently
>>>>>>
>>>>>> initiated by Domain-0' kernel and is done *before* PCI devices are given memory
>>>>>>
>>>>>> ranges and BARs assigned:
>>>>>>
>>>>>> [    0.429514] pci_bus 0000:00: root bus resource [bus 00-ff]
>>>>>> [    0.429532] pci_bus 0000:00: root bus resource [io 0x0000-0xfffff]
>>>>>> [    0.429555] pci_bus 0000:00: root bus resource [mem 0xfe200000-0xfe3fffff]
>>>>>> [    0.429575] pci_bus 0000:00: root bus resource [mem 0x30000000-0x37ffffff]
>>>>>> [    0.429604] pci_bus 0000:00: root bus resource [mem 0x38000000-0x3fffffff pref]
>>>>>> [    0.429670] pci 0000:00:00.0: enabling Extended Tags
>>>>>> [    0.453764] pci 0000:00:00.0: -------------------- BUS_NOTIFY_ADD_DEVICE
>>>>>>
>>>>>> < init_bars >
>>>>>>
>>>>>> [    0.453793] pci 0000:00:00.0: -- IRQ 0
>>>>>> [    0.458825] pci 0000:00:00.0: Failed to add - passthrough or MSI/MSI-X might fail!
>>>>>> [    0.471790] pci 0000:01:00.0: -------------------- BUS_NOTIFY_ADD_DEVICE
>>>>>>
>>>>>> < init_bars >
>>>>>>
>>>>>> [    0.471821] pci 0000:01:00.0: -- IRQ 255
>>>>>> [    0.476809] pci 0000:01:00.0: Failed to add - passthrough or MSI/MSI-X might fail!
>>>>>>
>>>>>> < BAR assignments below >
>>>>>>
>>>>>> [    0.488233] pci 0000:00:00.0: BAR 14: assigned [mem 0xfe200000-0xfe2fffff]
>>>>>> [    0.488265] pci 0000:00:00.0: BAR 15: assigned [mem 0x38000000-0x380fffff pref]
>>>>>>
>>>>>> In case of x86 this is pretty much ok as BARs are already in place, but for ARM we
>>>>>>
>>>>>> need to take care and re-setup vPCI BARs for hwdom.
>>>>>
>>>>> Even on x86 there's no guarantee that all devices have their BARs set
>>>>> up by firmware.
>>>>>
>>>>> In a subsequent reply you've suggested to move init_bars from "add" to
>>>>> "assign", but I'm having trouble seeing what this would change: It's
>>>>> not Dom0 controlling assignment (to itself), but Xen assigns the device
>>>>> towards the end of pci_add_device().
>>>>>
>>>>>> Things are getting even more
>>>>>>
>>>>>> complicated if the host PCI bridge is not ECAM like, so you cannot set mmio_handlers
>>>>>>
>>>>>> and trap hwdom's access to the config space to update BARs etc. This is why I have that
>>>>>>
>>>>>> ugly hack for rcar_gen3 to read actual BARs for hwdom.
>>>>>
>>>>> How to config space accesses work there? The latest for MSI/MSI-X it'll
>>>>> be imperative that Xen be able to intercept config space writes.
>>>>
>>>> I am not sure to understand your last sentence. Are you saying that we
>>>> always need to trap access to MSI/MSI-X message in order to sanitize it?
>>>>
>>>> If one is using the GICv3 ITS (I haven't investigated other MSI
>>>> controller), then I don't believe you need to sanitize the MSI/MSI-X
>>>> message in most of the situation.
>>>
>>> Well, if it's fine for the guest to write arbitrary values to message
>>> address and message data,
>>
>> The message address would be the doorbell of the ITS that is usually
>> going through the IOMMU page-tables. Although, I am aware of a couple of
>> platforms where the doorbell access (among other address ranges
>> including P2P transaction) bypass the IOMMU. In this situation, we would
>> need a lot more work than just trapping the access.
> 
> When you say "The message address would be the doorbell of the ITS" am
> I right in understanding that's the designated address to be put there?
> What if the guest puts some random different address there?

My point was that all the accesses from a PCI device should go through 
the IOMMU. Although, I know this may not be true for all the platforms.

In which case, sanitizing the MSI message address is not going to help 
because a PCI device can DMA into memory range that bypass the IOMMU.

> 
>> Regarding the message data, for the ITS this is an event ID. The HW will
>> then tag each message with the device ID (this prevents spoofing). The
>> tupple (device ID, event ID) is used by the ITS to decide where to
>> inject the event.
>>
>> Whether other MSI controllers (e.g. GICv2m) have similar isolation
>> feature will be on the case by case basis.
>>
>>> _and_ to arbitrarily enable/disable MSI / MSI-X,
>>> then yes, no interception would be needed.
>> The device would be owned by the guest, so I am not sure to understand
>> the exact problem of letting it enabling/disabling MSI/MSI-X. Do you
>> mind expanding your thoughts?
> 
> Question is - is Xen involved in any way in the handling of interrupts
> from such a device? If not, then I guess full control can indeed be
> left with the guest.

Xen will only forward the interrupts to the guest. This is not very 
different to how other interrupts (e.g. SPIs) are dealt with.

So I don't see the problem of giving full control to the guest.

> 
>> Furthermore, you can also control which event is enabled/disabled at the
>> ITS level.
> 
> And that's something Xen controls?

Yes. We only expose a virtual ITS to the guest.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 06/10] vpci: Make every domain handle its own BARs
  2020-11-13 11:35                   ` Jan Beulich
@ 2020-11-13 12:41                     ` Oleksandr Andrushchenko
  2020-11-13 14:23                       ` Jan Beulich
  0 siblings, 1 reply; 64+ messages in thread
From: Oleksandr Andrushchenko @ 2020-11-13 12:41 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Oleksandr Andrushchenko, Rahul.Singh, Bertrand.Marquis,
	julien.grall, sstabellini, xen-devel, iwj, wl,
	Roger Pau Monné


On 11/13/20 1:35 PM, Jan Beulich wrote:
> On 13.11.2020 12:02, Oleksandr Andrushchenko wrote:
>> On 11/13/20 12:50 PM, Jan Beulich wrote:
>>> On 13.11.2020 11:46, Oleksandr Andrushchenko wrote:
>>>> On 11/13/20 12:25 PM, Jan Beulich wrote:
>>>>> On 13.11.2020 07:32, Oleksandr Andrushchenko wrote:
>>>>>> On 11/12/20 4:46 PM, Roger Pau Monné wrote:
>>>>>>> On Thu, Nov 12, 2020 at 01:16:10PM +0000, Oleksandr Andrushchenko wrote:
>>>>>>>> On 11/12/20 11:40 AM, Roger Pau Monné wrote:
>>>>>>>>> On Mon, Nov 09, 2020 at 02:50:27PM +0200, Oleksandr Andrushchenko wrote:
>>>>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>>>>> +static uint32_t bar_read_dispatch(const struct pci_dev *pdev, unsigned int reg,
>>>>>>>>>> +                                  void *data)
>>>>>>>>>> +{
>>>>>>>>>> +    struct vpci_bar *vbar, *bar = data;
>>>>>>>>>> +
>>>>>>>>>> +    if ( is_hardware_domain(current->domain) )
>>>>>>>>>> +        return bar_read_hwdom(pdev, reg, data);
>>>>>>>>>> +
>>>>>>>>>> +    vbar = get_vpci_bar(current->domain, pdev, bar->index);
>>>>>>>>>> +    if ( !vbar )
>>>>>>>>>> +        return ~0;
>>>>>>>>>> +
>>>>>>>>>> +    return bar_read_guest(pdev, reg, vbar);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static void bar_write_dispatch(const struct pci_dev *pdev, unsigned int reg,
>>>>>>>>>> +                               uint32_t val, void *data)
>>>>>>>>>> +{
>>>>>>>>>> +    struct vpci_bar *bar = data;
>>>>>>>>>> +
>>>>>>>>>> +    if ( is_hardware_domain(current->domain) )
>>>>>>>>>> +        bar_write_hwdom(pdev, reg, val, data);
>>>>>>>>>> +    else
>>>>>>>>>> +    {
>>>>>>>>>> +        struct vpci_bar *vbar = get_vpci_bar(current->domain, pdev, bar->index);
>>>>>>>>>> +
>>>>>>>>>> +        if ( !vbar )
>>>>>>>>>> +            return;
>>>>>>>>>> +        bar_write_guest(pdev, reg, val, vbar);
>>>>>>>>>> +    }
>>>>>>>>>> +}
>>>>>>>>> You should assign different handlers based on whether the domain that
>>>>>>>>> has the device assigned is a domU or the hardware domain, rather than
>>>>>>>>> doing the selection here.
>>>>>>>> Hm, handlers are assigned once in init_bars and this function is only called
>>>>>>>>
>>>>>>>> for hwdom, so there is no way I can do that for the guests. Hence, the dispatcher
>>>>>>> I think we might want to reset the vPCI handlers when a devices gets
>>>>>>> assigned and deassigned.
>>>>>> In ARM case init_bars is called too early: PCI device assignment is currently
>>>>>>
>>>>>> initiated by Domain-0' kernel and is done *before* PCI devices are given memory
>>>>>>
>>>>>> ranges and BARs assigned:
>>>>>>
>>>>>> [    0.429514] pci_bus 0000:00: root bus resource [bus 00-ff]
>>>>>> [    0.429532] pci_bus 0000:00: root bus resource [io 0x0000-0xfffff]
>>>>>> [    0.429555] pci_bus 0000:00: root bus resource [mem 0xfe200000-0xfe3fffff]
>>>>>> [    0.429575] pci_bus 0000:00: root bus resource [mem 0x30000000-0x37ffffff]
>>>>>> [    0.429604] pci_bus 0000:00: root bus resource [mem 0x38000000-0x3fffffff pref]
>>>>>> [    0.429670] pci 0000:00:00.0: enabling Extended Tags
>>>>>> [    0.453764] pci 0000:00:00.0: -------------------- BUS_NOTIFY_ADD_DEVICE
>>>>>>
>>>>>> < init_bars >
>>>>>>
>>>>>> [    0.453793] pci 0000:00:00.0: -- IRQ 0
>>>>>> [    0.458825] pci 0000:00:00.0: Failed to add - passthrough or MSI/MSI-X might fail!
>>>>>> [    0.471790] pci 0000:01:00.0: -------------------- BUS_NOTIFY_ADD_DEVICE
>>>>>>
>>>>>> < init_bars >
>>>>>>
>>>>>> [    0.471821] pci 0000:01:00.0: -- IRQ 255
>>>>>> [    0.476809] pci 0000:01:00.0: Failed to add - passthrough or MSI/MSI-X might fail!
>>>>>>
>>>>>> < BAR assignments below >
>>>>>>
>>>>>> [    0.488233] pci 0000:00:00.0: BAR 14: assigned [mem 0xfe200000-0xfe2fffff]
>>>>>> [    0.488265] pci 0000:00:00.0: BAR 15: assigned [mem 0x38000000-0x380fffff pref]
>>>>>>
>>>>>> In case of x86 this is pretty much ok as BARs are already in place, but for ARM we
>>>>>>
>>>>>> need to take care and re-setup vPCI BARs for hwdom.
>>>>> Even on x86 there's no guarantee that all devices have their BARs set
>>>>> up by firmware.
>>>> This is true. But there you could have config space trapped in "x86 generic way",
>>>>
>>>> please correct me if I'm wrong here
>>>>
>>>>> In a subsequent reply you've suggested to move init_bars from "add" to
>>>>> "assign", but I'm having trouble seeing what this would change: It's
>>>>> not Dom0 controlling assignment (to itself), but Xen assigns the device
>>>>> towards the end of pci_add_device().
>>>> PHYSDEVOP_pci_device_add vs XEN_DOMCTL_assign_device
>>>>
>>>> Currently we initialize BARs during PHYSDEVOP_pci_device_add and
>>>>
>>>> if we do that during XEN_DOMCTL_assign_device things seem to change
>>> But there can't possibly be any XEN_DOMCTL_assign_device involved in
>>> booting of Dom0.
>> Indeed. So, do you have an idea when we should call init_bars suitable
>>
>> for both ARM and x86?
>>
>> Another question is: what happens bad if x86 and ARM won't call init_bars
>>
>> until the moment we really assign a PCI device to the first guest?
> I'd like to answer the latter question first: How would Dom0 use
> the device prior to such an assignment? As an implication to the
> presumed answer here, I guess init_bars could be deferred up until
> the first time Dom0 (or more generally the possessing domain)
> accesses any of them. Similarly, devices used by Xen itself could
> have this done immediately before first use. This may require
> tracking on a per-device basis whether initialization was done.
Ok, I'll try to look into it
>
>>>>>>>      In order to do passthrough to domUs safely
>>>>>>> we will have to add more handlers than what's required for dom0,
>>>>>> Can you please tell what are thinking about? What are the missing handlers?
>>>>>>>      and
>>>>>>> having is_hardware_domain sprinkled in all of them is not a suitable
>>>>>>> solution.
>>>>>> I'll try to replace is_hardware_domain with something like:
>>>>>>
>>>>>> +/*
>>>>>> + * Detect whether physical PCI devices in this segment belong
>>>>>> + * to the domain given, e.g. on x86 all PCI devices live in hwdom,
>>>>>> + * but in case of ARM this might not be the case: those may also
>>>>>> + * live in driver domains or even Xen itself.
>>>>>> + */
>>>>>> +bool pci_is_hardware_domain(struct domain *d, u16 seg)
>>>>>> +{
>>>>>> +#ifdef CONFIG_X86
>>>>>> +    return is_hardware_domain(d);
>>>>>> +#elif CONFIG_ARM
>>>>>> +    return pci_is_owner_domain(d, seg);
>>>>>> +#else
>>>>>> +#error "Unsupported architecture"
>>>>>> +#endif
>>>>>> +}
>>>>>> +
>>>>>> +/*
>>>>>> + * Get domain which owns this segment: for x86 this is always hardware
>>>>>> + * domain and for ARM this can be different.
>>>>>> + */
>>>>>> +struct domain *pci_get_hardware_domain(u16 seg)
>>>>>> +{
>>>>>> +#ifdef CONFIG_X86
>>>>>> +    return hardware_domain;
>>>>>> +#elif CONFIG_ARM
>>>>>> +    return pci_get_owner_domain(seg);
>>>>>> +#else
>>>>>> +#error "Unsupported architecture"
>>>>>> +#endif
>>>>>> +}
>>>>>>
>>>>>> This is what I use to properly detect the domain that really owns physical host bridge
>>>>> I consider this problematic. We should try to not let Arm's and x86'es
>>>>> PCI implementations diverge too much, i.e. at least the underlying basic
>>>>> model would better be similar. For example, if entire segments can be
>>>>> assigned to a driver domain on Arm, why should the same not be possible
>>>>> on x86?
>>>> Good question, probably in this case x86 == ARM and I can use
>>>>
>>>> pci_is_owner_domain for both architectures instead of using is_hardware_domain for x86
>>>>
>>>>> Furthermore I'm suspicious about segments being the right granularity
>>>>> here. On ia64 multiple host bridges could (and typically would) live
>>>>> on segment 0. Iirc I had once seen output from an x86 system which was
>>>>> apparently laid out similarly. Therefore, just like for MCFG, I think
>>>>> the granularity wants to be bus ranges within a segment.
>>>> Can you please suggest something we can use as a hint for such a detection logic?
>>> The underlying information comes from ACPI tables, iirc. I don't
>>> recall the details, though - sorry.
>> Ok, so seg + bus should be enough for both ARM and Xen then, right?
>>
>> pci_get_hardware_domain(u16 seg, u8 bus)
> Whether an individual bus number can suitable express things I can't
> tell; I did say bus range, but of course if you care about just
> individual devices, then a single bus number will of course do.

I can implement the lookup whether a PCI host bridge owned by a particular

domain with something like:

struct pci_host_bridge *bridge = pci_find_host_bridge(seg, bus);

return bridge->dt_node->used_by == d->domain_id;

Could you please give me a hint how this can be done on x86?

>
> Jan

Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 06/10] vpci: Make every domain handle its own BARs
  2020-11-13 12:41                     ` Oleksandr Andrushchenko
@ 2020-11-13 14:23                       ` Jan Beulich
  2020-11-13 14:32                         ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2020-11-13 14:23 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Oleksandr Andrushchenko, Rahul.Singh, Bertrand.Marquis,
	julien.grall, sstabellini, xen-devel, iwj, wl,
	Roger Pau Monné

On 13.11.2020 13:41, Oleksandr Andrushchenko wrote:
> 
> On 11/13/20 1:35 PM, Jan Beulich wrote:
>> On 13.11.2020 12:02, Oleksandr Andrushchenko wrote:
>>> On 11/13/20 12:50 PM, Jan Beulich wrote:
>>>> On 13.11.2020 11:46, Oleksandr Andrushchenko wrote:
>>>>> On 11/13/20 12:25 PM, Jan Beulich wrote:
>>>>>> On 13.11.2020 07:32, Oleksandr Andrushchenko wrote:
>>>>>>> I'll try to replace is_hardware_domain with something like:
>>>>>>>
>>>>>>> +/*
>>>>>>> + * Detect whether physical PCI devices in this segment belong
>>>>>>> + * to the domain given, e.g. on x86 all PCI devices live in hwdom,
>>>>>>> + * but in case of ARM this might not be the case: those may also
>>>>>>> + * live in driver domains or even Xen itself.
>>>>>>> + */
>>>>>>> +bool pci_is_hardware_domain(struct domain *d, u16 seg)
>>>>>>> +{
>>>>>>> +#ifdef CONFIG_X86
>>>>>>> +    return is_hardware_domain(d);
>>>>>>> +#elif CONFIG_ARM
>>>>>>> +    return pci_is_owner_domain(d, seg);
>>>>>>> +#else
>>>>>>> +#error "Unsupported architecture"
>>>>>>> +#endif
>>>>>>> +}
>>>>>>> +
>>>>>>> +/*
>>>>>>> + * Get domain which owns this segment: for x86 this is always hardware
>>>>>>> + * domain and for ARM this can be different.
>>>>>>> + */
>>>>>>> +struct domain *pci_get_hardware_domain(u16 seg)
>>>>>>> +{
>>>>>>> +#ifdef CONFIG_X86
>>>>>>> +    return hardware_domain;
>>>>>>> +#elif CONFIG_ARM
>>>>>>> +    return pci_get_owner_domain(seg);
>>>>>>> +#else
>>>>>>> +#error "Unsupported architecture"
>>>>>>> +#endif
>>>>>>> +}
>>>>>>>
>>>>>>> This is what I use to properly detect the domain that really owns physical host bridge
>>>>>> I consider this problematic. We should try to not let Arm's and x86'es
>>>>>> PCI implementations diverge too much, i.e. at least the underlying basic
>>>>>> model would better be similar. For example, if entire segments can be
>>>>>> assigned to a driver domain on Arm, why should the same not be possible
>>>>>> on x86?
>>>>> Good question, probably in this case x86 == ARM and I can use
>>>>>
>>>>> pci_is_owner_domain for both architectures instead of using is_hardware_domain for x86
>>>>>
>>>>>> Furthermore I'm suspicious about segments being the right granularity
>>>>>> here. On ia64 multiple host bridges could (and typically would) live
>>>>>> on segment 0. Iirc I had once seen output from an x86 system which was
>>>>>> apparently laid out similarly. Therefore, just like for MCFG, I think
>>>>>> the granularity wants to be bus ranges within a segment.
>>>>> Can you please suggest something we can use as a hint for such a detection logic?
>>>> The underlying information comes from ACPI tables, iirc. I don't
>>>> recall the details, though - sorry.
>>> Ok, so seg + bus should be enough for both ARM and Xen then, right?
>>>
>>> pci_get_hardware_domain(u16 seg, u8 bus)
>> Whether an individual bus number can suitable express things I can't
>> tell; I did say bus range, but of course if you care about just
>> individual devices, then a single bus number will of course do.
> 
> I can implement the lookup whether a PCI host bridge owned by a particular
> 
> domain with something like:
> 
> struct pci_host_bridge *bridge = pci_find_host_bridge(seg, bus);
> 
> return bridge->dt_node->used_by == d->domain_id;
> 
> Could you please give me a hint how this can be done on x86?

Bridges can't be assigned to other than the hardware domain right
now. Earlier on I didn't say you should get this to work, only
that I think the general logic around what you add shouldn't make
things more arch specific than they really should be. That said,
something similar to the above should still be doable on x86,
utilizing struct pci_seg's bus2bridge[]. There ought to be
DEV_TYPE_PCI_HOST_BRIDGE entries there, albeit a number of them
(provided by the CPUs themselves rather than the chipset) aren't
really host bridges for the purposes you're after.

Jan


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 06/10] vpci: Make every domain handle its own BARs
  2020-11-13 14:23                       ` Jan Beulich
@ 2020-11-13 14:32                         ` Oleksandr Andrushchenko
  2020-11-13 14:38                           ` Jan Beulich
  0 siblings, 1 reply; 64+ messages in thread
From: Oleksandr Andrushchenko @ 2020-11-13 14:32 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Oleksandr Andrushchenko, Rahul.Singh, Bertrand.Marquis,
	julien.grall, sstabellini, xen-devel, iwj, wl,
	Roger Pau Monné


On 11/13/20 4:23 PM, Jan Beulich wrote:
> On 13.11.2020 13:41, Oleksandr Andrushchenko wrote:
>> On 11/13/20 1:35 PM, Jan Beulich wrote:
>>> On 13.11.2020 12:02, Oleksandr Andrushchenko wrote:
>>>> On 11/13/20 12:50 PM, Jan Beulich wrote:
>>>>> On 13.11.2020 11:46, Oleksandr Andrushchenko wrote:
>>>>>> On 11/13/20 12:25 PM, Jan Beulich wrote:
>>>>>>> On 13.11.2020 07:32, Oleksandr Andrushchenko wrote:
>>>>>>>> I'll try to replace is_hardware_domain with something like:
>>>>>>>>
>>>>>>>> +/*
>>>>>>>> + * Detect whether physical PCI devices in this segment belong
>>>>>>>> + * to the domain given, e.g. on x86 all PCI devices live in hwdom,
>>>>>>>> + * but in case of ARM this might not be the case: those may also
>>>>>>>> + * live in driver domains or even Xen itself.
>>>>>>>> + */
>>>>>>>> +bool pci_is_hardware_domain(struct domain *d, u16 seg)
>>>>>>>> +{
>>>>>>>> +#ifdef CONFIG_X86
>>>>>>>> +    return is_hardware_domain(d);
>>>>>>>> +#elif CONFIG_ARM
>>>>>>>> +    return pci_is_owner_domain(d, seg);
>>>>>>>> +#else
>>>>>>>> +#error "Unsupported architecture"
>>>>>>>> +#endif
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +/*
>>>>>>>> + * Get domain which owns this segment: for x86 this is always hardware
>>>>>>>> + * domain and for ARM this can be different.
>>>>>>>> + */
>>>>>>>> +struct domain *pci_get_hardware_domain(u16 seg)
>>>>>>>> +{
>>>>>>>> +#ifdef CONFIG_X86
>>>>>>>> +    return hardware_domain;
>>>>>>>> +#elif CONFIG_ARM
>>>>>>>> +    return pci_get_owner_domain(seg);
>>>>>>>> +#else
>>>>>>>> +#error "Unsupported architecture"
>>>>>>>> +#endif
>>>>>>>> +}
>>>>>>>>
>>>>>>>> This is what I use to properly detect the domain that really owns physical host bridge
>>>>>>> I consider this problematic. We should try to not let Arm's and x86'es
>>>>>>> PCI implementations diverge too much, i.e. at least the underlying basic
>>>>>>> model would better be similar. For example, if entire segments can be
>>>>>>> assigned to a driver domain on Arm, why should the same not be possible
>>>>>>> on x86?
>>>>>> Good question, probably in this case x86 == ARM and I can use
>>>>>>
>>>>>> pci_is_owner_domain for both architectures instead of using is_hardware_domain for x86
>>>>>>
>>>>>>> Furthermore I'm suspicious about segments being the right granularity
>>>>>>> here. On ia64 multiple host bridges could (and typically would) live
>>>>>>> on segment 0. Iirc I had once seen output from an x86 system which was
>>>>>>> apparently laid out similarly. Therefore, just like for MCFG, I think
>>>>>>> the granularity wants to be bus ranges within a segment.
>>>>>> Can you please suggest something we can use as a hint for such a detection logic?
>>>>> The underlying information comes from ACPI tables, iirc. I don't
>>>>> recall the details, though - sorry.
>>>> Ok, so seg + bus should be enough for both ARM and Xen then, right?
>>>>
>>>> pci_get_hardware_domain(u16 seg, u8 bus)
>>> Whether an individual bus number can suitable express things I can't
>>> tell; I did say bus range, but of course if you care about just
>>> individual devices, then a single bus number will of course do.
>> I can implement the lookup whether a PCI host bridge owned by a particular
>>
>> domain with something like:
>>
>> struct pci_host_bridge *bridge = pci_find_host_bridge(seg, bus);
>>
>> return bridge->dt_node->used_by == d->domain_id;
>>
>> Could you please give me a hint how this can be done on x86?
> Bridges can't be assigned to other than the hardware domain right
> now.

So, I can probably then have pci_get_hardware_domain implemented

by both ARM and x86 in their arch specific code. And for x86 for now

it can simply be a wrapper for is_hardware_domain?

>   Earlier on I didn't say you should get this to work, only
> that I think the general logic around what you add shouldn't make
> things more arch specific than they really should be. That said,
> something similar to the above should still be doable on x86,
> utilizing struct pci_seg's bus2bridge[]. There ought to be
> DEV_TYPE_PCI_HOST_BRIDGE entries there, albeit a number of them
> (provided by the CPUs themselves rather than the chipset) aren't
> really host bridges for the purposes you're after.

You mean I can still use DEV_TYPE_PCI_HOST_BRIDGE as a marker

while trying to detect what I need?

>
> Jan

Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 06/10] vpci: Make every domain handle its own BARs
  2020-11-13 14:32                         ` Oleksandr Andrushchenko
@ 2020-11-13 14:38                           ` Jan Beulich
  2020-11-13 14:44                             ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2020-11-13 14:38 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Oleksandr Andrushchenko, Rahul.Singh, Bertrand.Marquis,
	julien.grall, sstabellini, xen-devel, iwj, wl,
	Roger Pau Monné

On 13.11.2020 15:32, Oleksandr Andrushchenko wrote:
> 
> On 11/13/20 4:23 PM, Jan Beulich wrote:
>> On 13.11.2020 13:41, Oleksandr Andrushchenko wrote:
>>> On 11/13/20 1:35 PM, Jan Beulich wrote:
>>>> On 13.11.2020 12:02, Oleksandr Andrushchenko wrote:
>>>>> On 11/13/20 12:50 PM, Jan Beulich wrote:
>>>>>> On 13.11.2020 11:46, Oleksandr Andrushchenko wrote:
>>>>>>> On 11/13/20 12:25 PM, Jan Beulich wrote:
>>>>>>>> On 13.11.2020 07:32, Oleksandr Andrushchenko wrote:
>>>>>>>>> I'll try to replace is_hardware_domain with something like:
>>>>>>>>>
>>>>>>>>> +/*
>>>>>>>>> + * Detect whether physical PCI devices in this segment belong
>>>>>>>>> + * to the domain given, e.g. on x86 all PCI devices live in hwdom,
>>>>>>>>> + * but in case of ARM this might not be the case: those may also
>>>>>>>>> + * live in driver domains or even Xen itself.
>>>>>>>>> + */
>>>>>>>>> +bool pci_is_hardware_domain(struct domain *d, u16 seg)
>>>>>>>>> +{
>>>>>>>>> +#ifdef CONFIG_X86
>>>>>>>>> +    return is_hardware_domain(d);
>>>>>>>>> +#elif CONFIG_ARM
>>>>>>>>> +    return pci_is_owner_domain(d, seg);
>>>>>>>>> +#else
>>>>>>>>> +#error "Unsupported architecture"
>>>>>>>>> +#endif
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +/*
>>>>>>>>> + * Get domain which owns this segment: for x86 this is always hardware
>>>>>>>>> + * domain and for ARM this can be different.
>>>>>>>>> + */
>>>>>>>>> +struct domain *pci_get_hardware_domain(u16 seg)
>>>>>>>>> +{
>>>>>>>>> +#ifdef CONFIG_X86
>>>>>>>>> +    return hardware_domain;
>>>>>>>>> +#elif CONFIG_ARM
>>>>>>>>> +    return pci_get_owner_domain(seg);
>>>>>>>>> +#else
>>>>>>>>> +#error "Unsupported architecture"
>>>>>>>>> +#endif
>>>>>>>>> +}
>>>>>>>>>
>>>>>>>>> This is what I use to properly detect the domain that really owns physical host bridge
>>>>>>>> I consider this problematic. We should try to not let Arm's and x86'es
>>>>>>>> PCI implementations diverge too much, i.e. at least the underlying basic
>>>>>>>> model would better be similar. For example, if entire segments can be
>>>>>>>> assigned to a driver domain on Arm, why should the same not be possible
>>>>>>>> on x86?
>>>>>>> Good question, probably in this case x86 == ARM and I can use
>>>>>>>
>>>>>>> pci_is_owner_domain for both architectures instead of using is_hardware_domain for x86
>>>>>>>
>>>>>>>> Furthermore I'm suspicious about segments being the right granularity
>>>>>>>> here. On ia64 multiple host bridges could (and typically would) live
>>>>>>>> on segment 0. Iirc I had once seen output from an x86 system which was
>>>>>>>> apparently laid out similarly. Therefore, just like for MCFG, I think
>>>>>>>> the granularity wants to be bus ranges within a segment.
>>>>>>> Can you please suggest something we can use as a hint for such a detection logic?
>>>>>> The underlying information comes from ACPI tables, iirc. I don't
>>>>>> recall the details, though - sorry.
>>>>> Ok, so seg + bus should be enough for both ARM and Xen then, right?
>>>>>
>>>>> pci_get_hardware_domain(u16 seg, u8 bus)
>>>> Whether an individual bus number can suitable express things I can't
>>>> tell; I did say bus range, but of course if you care about just
>>>> individual devices, then a single bus number will of course do.
>>> I can implement the lookup whether a PCI host bridge owned by a particular
>>>
>>> domain with something like:
>>>
>>> struct pci_host_bridge *bridge = pci_find_host_bridge(seg, bus);
>>>
>>> return bridge->dt_node->used_by == d->domain_id;
>>>
>>> Could you please give me a hint how this can be done on x86?
>> Bridges can't be assigned to other than the hardware domain right
>> now.
> 
> So, I can probably then have pci_get_hardware_domain implemented
> 
> by both ARM and x86 in their arch specific code. And for x86 for now
> 
> it can simply be a wrapper for is_hardware_domain?

"get" can't be a wrapper for "is", but I think I get what you're
saying. But no, preferably I would not see you do this (hence my
earlier comment). I still think you could use the owner of the
respective device; I may be mistaken, but iirc we do assign
bridges to Dom0, so deriving from that would be better than
hardwiring to is_hardware_domain().

>>   Earlier on I didn't say you should get this to work, only
>> that I think the general logic around what you add shouldn't make
>> things more arch specific than they really should be. That said,
>> something similar to the above should still be doable on x86,
>> utilizing struct pci_seg's bus2bridge[]. There ought to be
>> DEV_TYPE_PCI_HOST_BRIDGE entries there, albeit a number of them
>> (provided by the CPUs themselves rather than the chipset) aren't
>> really host bridges for the purposes you're after.
> 
> You mean I can still use DEV_TYPE_PCI_HOST_BRIDGE as a marker
> 
> while trying to detect what I need?

I'm afraid I don't understand what marker you're thinking about
here.

Jan


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 06/10] vpci: Make every domain handle its own BARs
  2020-11-13 14:38                           ` Jan Beulich
@ 2020-11-13 14:44                             ` Oleksandr Andrushchenko
  2020-11-13 14:51                               ` Jan Beulich
  0 siblings, 1 reply; 64+ messages in thread
From: Oleksandr Andrushchenko @ 2020-11-13 14:44 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Oleksandr Andrushchenko, Rahul.Singh, Bertrand.Marquis,
	julien.grall, sstabellini, xen-devel, iwj, wl,
	Roger Pau Monné


On 11/13/20 4:38 PM, Jan Beulich wrote:
> On 13.11.2020 15:32, Oleksandr Andrushchenko wrote:
>> On 11/13/20 4:23 PM, Jan Beulich wrote:
>>> On 13.11.2020 13:41, Oleksandr Andrushchenko wrote:
>>>> On 11/13/20 1:35 PM, Jan Beulich wrote:
>>>>> On 13.11.2020 12:02, Oleksandr Andrushchenko wrote:
>>>>>> On 11/13/20 12:50 PM, Jan Beulich wrote:
>>>>>>> On 13.11.2020 11:46, Oleksandr Andrushchenko wrote:
>>>>>>>> On 11/13/20 12:25 PM, Jan Beulich wrote:
>>>>>>>>> On 13.11.2020 07:32, Oleksandr Andrushchenko wrote:
>>>>>>>>>> I'll try to replace is_hardware_domain with something like:
>>>>>>>>>>
>>>>>>>>>> +/*
>>>>>>>>>> + * Detect whether physical PCI devices in this segment belong
>>>>>>>>>> + * to the domain given, e.g. on x86 all PCI devices live in hwdom,
>>>>>>>>>> + * but in case of ARM this might not be the case: those may also
>>>>>>>>>> + * live in driver domains or even Xen itself.
>>>>>>>>>> + */
>>>>>>>>>> +bool pci_is_hardware_domain(struct domain *d, u16 seg)
>>>>>>>>>> +{
>>>>>>>>>> +#ifdef CONFIG_X86
>>>>>>>>>> +    return is_hardware_domain(d);
>>>>>>>>>> +#elif CONFIG_ARM
>>>>>>>>>> +    return pci_is_owner_domain(d, seg);
>>>>>>>>>> +#else
>>>>>>>>>> +#error "Unsupported architecture"
>>>>>>>>>> +#endif
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +/*
>>>>>>>>>> + * Get domain which owns this segment: for x86 this is always hardware
>>>>>>>>>> + * domain and for ARM this can be different.
>>>>>>>>>> + */
>>>>>>>>>> +struct domain *pci_get_hardware_domain(u16 seg)
>>>>>>>>>> +{
>>>>>>>>>> +#ifdef CONFIG_X86
>>>>>>>>>> +    return hardware_domain;
>>>>>>>>>> +#elif CONFIG_ARM
>>>>>>>>>> +    return pci_get_owner_domain(seg);
>>>>>>>>>> +#else
>>>>>>>>>> +#error "Unsupported architecture"
>>>>>>>>>> +#endif
>>>>>>>>>> +}
>>>>>>>>>>
>>>>>>>>>> This is what I use to properly detect the domain that really owns physical host bridge
>>>>>>>>> I consider this problematic. We should try to not let Arm's and x86'es
>>>>>>>>> PCI implementations diverge too much, i.e. at least the underlying basic
>>>>>>>>> model would better be similar. For example, if entire segments can be
>>>>>>>>> assigned to a driver domain on Arm, why should the same not be possible
>>>>>>>>> on x86?
>>>>>>>> Good question, probably in this case x86 == ARM and I can use
>>>>>>>>
>>>>>>>> pci_is_owner_domain for both architectures instead of using is_hardware_domain for x86
>>>>>>>>
>>>>>>>>> Furthermore I'm suspicious about segments being the right granularity
>>>>>>>>> here. On ia64 multiple host bridges could (and typically would) live
>>>>>>>>> on segment 0. Iirc I had once seen output from an x86 system which was
>>>>>>>>> apparently laid out similarly. Therefore, just like for MCFG, I think
>>>>>>>>> the granularity wants to be bus ranges within a segment.
>>>>>>>> Can you please suggest something we can use as a hint for such a detection logic?
>>>>>>> The underlying information comes from ACPI tables, iirc. I don't
>>>>>>> recall the details, though - sorry.
>>>>>> Ok, so seg + bus should be enough for both ARM and Xen then, right?
>>>>>>
>>>>>> pci_get_hardware_domain(u16 seg, u8 bus)
>>>>> Whether an individual bus number can suitable express things I can't
>>>>> tell; I did say bus range, but of course if you care about just
>>>>> individual devices, then a single bus number will of course do.
>>>> I can implement the lookup whether a PCI host bridge owned by a particular
>>>>
>>>> domain with something like:
>>>>
>>>> struct pci_host_bridge *bridge = pci_find_host_bridge(seg, bus);
>>>>
>>>> return bridge->dt_node->used_by == d->domain_id;
>>>>
>>>> Could you please give me a hint how this can be done on x86?
>>> Bridges can't be assigned to other than the hardware domain right
>>> now.
>> So, I can probably then have pci_get_hardware_domain implemented
>>
>> by both ARM and x86 in their arch specific code. And for x86 for now
>>
>> it can simply be a wrapper for is_hardware_domain?
> "get" can't be a wrapper for "is"
Sure, s/get/is
> , but I think I get what you're
> saying. But no, preferably I would not see you do this (hence my
> earlier comment). I still think you could use the owner of the
> respective device; I may be mistaken, but iirc we do assign
> bridges to Dom0, so deriving from that would be better than
> hardwiring to is_hardware_domain().
Ok, I'll try to figure out how to do that
>
>>>    Earlier on I didn't say you should get this to work, only
>>> that I think the general logic around what you add shouldn't make
>>> things more arch specific than they really should be. That said,
>>> something similar to the above should still be doable on x86,
>>> utilizing struct pci_seg's bus2bridge[]. There ought to be
>>> DEV_TYPE_PCI_HOST_BRIDGE entries there, albeit a number of them
>>> (provided by the CPUs themselves rather than the chipset) aren't
>>> really host bridges for the purposes you're after.
>> You mean I can still use DEV_TYPE_PCI_HOST_BRIDGE as a marker
>>
>> while trying to detect what I need?
> I'm afraid I don't understand what marker you're thinking about
> here.

I mean that when I go over bus2bridge entries, should I only work with

those who have DEV_TYPE_PCI_HOST_BRIDGE?

>
> Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 06/10] vpci: Make every domain handle its own BARs
  2020-11-13 14:44                             ` Oleksandr Andrushchenko
@ 2020-11-13 14:51                               ` Jan Beulich
  2020-11-13 14:52                                 ` Oleksandr Andrushchenko
  2020-12-04 14:38                                 ` Oleksandr Andrushchenko
  0 siblings, 2 replies; 64+ messages in thread
From: Jan Beulich @ 2020-11-13 14:51 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Oleksandr Andrushchenko, Rahul.Singh, Bertrand.Marquis,
	julien.grall, sstabellini, xen-devel, iwj, wl,
	Roger Pau Monné

On 13.11.2020 15:44, Oleksandr Andrushchenko wrote:
> 
> On 11/13/20 4:38 PM, Jan Beulich wrote:
>> On 13.11.2020 15:32, Oleksandr Andrushchenko wrote:
>>> On 11/13/20 4:23 PM, Jan Beulich wrote:
>>>>    Earlier on I didn't say you should get this to work, only
>>>> that I think the general logic around what you add shouldn't make
>>>> things more arch specific than they really should be. That said,
>>>> something similar to the above should still be doable on x86,
>>>> utilizing struct pci_seg's bus2bridge[]. There ought to be
>>>> DEV_TYPE_PCI_HOST_BRIDGE entries there, albeit a number of them
>>>> (provided by the CPUs themselves rather than the chipset) aren't
>>>> really host bridges for the purposes you're after.
>>> You mean I can still use DEV_TYPE_PCI_HOST_BRIDGE as a marker
>>>
>>> while trying to detect what I need?
>> I'm afraid I don't understand what marker you're thinking about
>> here.
> 
> I mean that when I go over bus2bridge entries, should I only work with
> 
> those who have DEV_TYPE_PCI_HOST_BRIDGE?

Well, if you're after host bridges - yes.

Jan


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 06/10] vpci: Make every domain handle its own BARs
  2020-11-13 14:51                               ` Jan Beulich
@ 2020-11-13 14:52                                 ` Oleksandr Andrushchenko
  2020-12-04 14:38                                 ` Oleksandr Andrushchenko
  1 sibling, 0 replies; 64+ messages in thread
From: Oleksandr Andrushchenko @ 2020-11-13 14:52 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Oleksandr Andrushchenko, Rahul.Singh, Bertrand.Marquis,
	julien.grall, sstabellini, xen-devel, iwj, wl,
	Roger Pau Monné


On 11/13/20 4:51 PM, Jan Beulich wrote:
> On 13.11.2020 15:44, Oleksandr Andrushchenko wrote:
>> On 11/13/20 4:38 PM, Jan Beulich wrote:
>>> On 13.11.2020 15:32, Oleksandr Andrushchenko wrote:
>>>> On 11/13/20 4:23 PM, Jan Beulich wrote:
>>>>>     Earlier on I didn't say you should get this to work, only
>>>>> that I think the general logic around what you add shouldn't make
>>>>> things more arch specific than they really should be. That said,
>>>>> something similar to the above should still be doable on x86,
>>>>> utilizing struct pci_seg's bus2bridge[]. There ought to be
>>>>> DEV_TYPE_PCI_HOST_BRIDGE entries there, albeit a number of them
>>>>> (provided by the CPUs themselves rather than the chipset) aren't
>>>>> really host bridges for the purposes you're after.
>>>> You mean I can still use DEV_TYPE_PCI_HOST_BRIDGE as a marker
>>>>
>>>> while trying to detect what I need?
>>> I'm afraid I don't understand what marker you're thinking about
>>> here.
>> I mean that when I go over bus2bridge entries, should I only work with
>>
>> those who have DEV_TYPE_PCI_HOST_BRIDGE?
> Well, if you're after host bridges - yes.
Ok, I'll try to see what I can do about it.
>
> Jan

Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 06/10] vpci: Make every domain handle its own BARs
  2020-11-13 14:51                               ` Jan Beulich
  2020-11-13 14:52                                 ` Oleksandr Andrushchenko
@ 2020-12-04 14:38                                 ` Oleksandr Andrushchenko
  2020-12-07  8:48                                   ` Jan Beulich
  1 sibling, 1 reply; 64+ messages in thread
From: Oleksandr Andrushchenko @ 2020-12-04 14:38 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Oleksandr Andrushchenko, Rahul.Singh, Bertrand.Marquis,
	julien.grall, sstabellini, xen-devel, iwj, wl,
	Roger Pau Monné

Hi, Jan!

On 11/13/20 4:51 PM, Jan Beulich wrote:
> On 13.11.2020 15:44, Oleksandr Andrushchenko wrote:
>> On 11/13/20 4:38 PM, Jan Beulich wrote:
>>> On 13.11.2020 15:32, Oleksandr Andrushchenko wrote:
>>>> On 11/13/20 4:23 PM, Jan Beulich wrote:
>>>>>     Earlier on I didn't say you should get this to work, only
>>>>> that I think the general logic around what you add shouldn't make
>>>>> things more arch specific than they really should be. That said,
>>>>> something similar to the above should still be doable on x86,
>>>>> utilizing struct pci_seg's bus2bridge[]. There ought to be
>>>>> DEV_TYPE_PCI_HOST_BRIDGE entries there, albeit a number of them
>>>>> (provided by the CPUs themselves rather than the chipset) aren't
>>>>> really host bridges for the purposes you're after.
>>>> You mean I can still use DEV_TYPE_PCI_HOST_BRIDGE as a marker
>>>>
>>>> while trying to detect what I need?
>>> I'm afraid I don't understand what marker you're thinking about
>>> here.
>> I mean that when I go over bus2bridge entries, should I only work with
>>
>> those who have DEV_TYPE_PCI_HOST_BRIDGE?
> Well, if you're after host bridges - yes.
>
> Jan

So, I started looking at the bus2bridge and if it can be used for both x86 (and possible ARM) and I have an

impression that the original purpose of this was to identify the devices which x86 IOMMU should

cover: e.g. I am looking at the find_upstream_bridge users and they are x86 IOMMU and VGA driver.

I tried to play with this on ARM, and for the HW platform I have and QEMU I got 0 entries in bus2bridge...

This is because of how xen/drivers/passthrough/pci.c:alloc_pdev is implemented (which lives in the

common code BTW, but seems to be x86 specific): so, it skips setting up bus2bridge entries for the bridges I have.

These are DEV_TYPE_PCIe_BRIDGE and DEV_TYPE_PCI_HOST_BRIDGE. So, the assumption I've made before

that I can go over bus2bridge and filter out the "owner" or parent bridge for a given seg:bus doesn't

seem to be possible now.

Even if I find the parent bridge with xen/drivers/passthrough/pci.c:find_upstream_bridge I am not sure

I can get any further in detecting which x86 domain owns this bridge: the whole idea in the x86 code is

that Domain-0 is the only possible one here (you mentioned that). So, I doubt if we can still live with

is_hardware_domain for now for x86?

Thank you in advance,

Oleksandr

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 06/10] vpci: Make every domain handle its own BARs
  2020-12-04 14:38                                 ` Oleksandr Andrushchenko
@ 2020-12-07  8:48                                   ` Jan Beulich
  2020-12-07  9:11                                     ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2020-12-07  8:48 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Oleksandr Andrushchenko, Rahul.Singh, Bertrand.Marquis,
	julien.grall, sstabellini, xen-devel, iwj, wl,
	Roger Pau Monné

On 04.12.2020 15:38, Oleksandr Andrushchenko wrote:
> On 11/13/20 4:51 PM, Jan Beulich wrote:
>> On 13.11.2020 15:44, Oleksandr Andrushchenko wrote:
>>> On 11/13/20 4:38 PM, Jan Beulich wrote:
>>>> On 13.11.2020 15:32, Oleksandr Andrushchenko wrote:
>>>>> On 11/13/20 4:23 PM, Jan Beulich wrote:
>>>>>>     Earlier on I didn't say you should get this to work, only
>>>>>> that I think the general logic around what you add shouldn't make
>>>>>> things more arch specific than they really should be. That said,
>>>>>> something similar to the above should still be doable on x86,
>>>>>> utilizing struct pci_seg's bus2bridge[]. There ought to be
>>>>>> DEV_TYPE_PCI_HOST_BRIDGE entries there, albeit a number of them
>>>>>> (provided by the CPUs themselves rather than the chipset) aren't
>>>>>> really host bridges for the purposes you're after.
>>>>> You mean I can still use DEV_TYPE_PCI_HOST_BRIDGE as a marker
>>>>>
>>>>> while trying to detect what I need?
>>>> I'm afraid I don't understand what marker you're thinking about
>>>> here.
>>> I mean that when I go over bus2bridge entries, should I only work with
>>>
>>> those who have DEV_TYPE_PCI_HOST_BRIDGE?
>> Well, if you're after host bridges - yes.
>>
>> Jan
> 
> So, I started looking at the bus2bridge and if it can be used for both x86 (and possible ARM) and I have an
> 
> impression that the original purpose of this was to identify the devices which x86 IOMMU should
> 
> cover: e.g. I am looking at the find_upstream_bridge users and they are x86 IOMMU and VGA driver.
> 
> I tried to play with this on ARM, and for the HW platform I have and QEMU I got 0 entries in bus2bridge...
> 
> This is because of how xen/drivers/passthrough/pci.c:alloc_pdev is implemented (which lives in the
> 
> common code BTW, but seems to be x86 specific): so, it skips setting up bus2bridge entries for the bridges I have.

I'm curious to learn what's x86-specific here. I also can't deduce
why bus2bridge setup would be skipped.

> These are DEV_TYPE_PCIe_BRIDGE and DEV_TYPE_PCI_HOST_BRIDGE. So, the assumption I've made before
> 
> that I can go over bus2bridge and filter out the "owner" or parent bridge for a given seg:bus doesn't
> 
> seem to be possible now.
> 
> Even if I find the parent bridge with xen/drivers/passthrough/pci.c:find_upstream_bridge I am not sure
> 
> I can get any further in detecting which x86 domain owns this bridge: the whole idea in the x86 code is
> 
> that Domain-0 is the only possible one here (you mentioned that).

Right - your attempt to find the owner should _right now_ result in
getting back Dom0, on x86. But there shouldn't be any such assumption
in the new consumer of this data that you mean to introduce. I.e. I
only did suggest this to avoid ...

> So, I doubt if we can still live with
> 
> is_hardware_domain for now for x86?

... hard-coding information which can be properly established /
retrieved.

Jan


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 06/10] vpci: Make every domain handle its own BARs
  2020-12-07  8:48                                   ` Jan Beulich
@ 2020-12-07  9:11                                     ` Oleksandr Andrushchenko
  2020-12-07  9:28                                       ` Jan Beulich
  0 siblings, 1 reply; 64+ messages in thread
From: Oleksandr Andrushchenko @ 2020-12-07  9:11 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Oleksandr Andrushchenko, Rahul.Singh, Bertrand.Marquis,
	julien.grall, sstabellini, xen-devel, iwj, wl,
	Roger Pau Monné


On 12/7/20 10:48 AM, Jan Beulich wrote:
> On 04.12.2020 15:38, Oleksandr Andrushchenko wrote:
>> On 11/13/20 4:51 PM, Jan Beulich wrote:
>>> On 13.11.2020 15:44, Oleksandr Andrushchenko wrote:
>>>> On 11/13/20 4:38 PM, Jan Beulich wrote:
>>>>> On 13.11.2020 15:32, Oleksandr Andrushchenko wrote:
>>>>>> On 11/13/20 4:23 PM, Jan Beulich wrote:
>>>>>>>      Earlier on I didn't say you should get this to work, only
>>>>>>> that I think the general logic around what you add shouldn't make
>>>>>>> things more arch specific than they really should be. That said,
>>>>>>> something similar to the above should still be doable on x86,
>>>>>>> utilizing struct pci_seg's bus2bridge[]. There ought to be
>>>>>>> DEV_TYPE_PCI_HOST_BRIDGE entries there, albeit a number of them
>>>>>>> (provided by the CPUs themselves rather than the chipset) aren't
>>>>>>> really host bridges for the purposes you're after.
>>>>>> You mean I can still use DEV_TYPE_PCI_HOST_BRIDGE as a marker
>>>>>>
>>>>>> while trying to detect what I need?
>>>>> I'm afraid I don't understand what marker you're thinking about
>>>>> here.
>>>> I mean that when I go over bus2bridge entries, should I only work with
>>>>
>>>> those who have DEV_TYPE_PCI_HOST_BRIDGE?
>>> Well, if you're after host bridges - yes.
>>>
>>> Jan
>> So, I started looking at the bus2bridge and if it can be used for both x86 (and possible ARM) and I have an
>>
>> impression that the original purpose of this was to identify the devices which x86 IOMMU should
>>
>> cover: e.g. I am looking at the find_upstream_bridge users and they are x86 IOMMU and VGA driver.
>>
>> I tried to play with this on ARM, and for the HW platform I have and QEMU I got 0 entries in bus2bridge...
>>
>> This is because of how xen/drivers/passthrough/pci.c:alloc_pdev is implemented (which lives in the
>>
>> common code BTW, but seems to be x86 specific): so, it skips setting up bus2bridge entries for the bridges I have.
> I'm curious to learn what's x86-specific here. I also can't deduce
> why bus2bridge setup would be skipped.

So, for example:

commit 0af438757d455f8eb6b5a6ae9a990ae245f230fd
Author: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Date:   Fri Sep 27 10:11:49 2013 +0200

     AMD IOMMU: fix Dom0 device setup failure for host bridges

     The host bridge device (i.e. 0x18 for AMD) does not require IOMMU, and
     therefore is not included in the IVRS. The current logic tries to map
     all PCI devices to an IOMMU. In this case, "xl dmesg" shows the
     following message on AMD sytem.

     (XEN) setup 0000:00:18.0 for d0 failed (-19)
     (XEN) setup 0000:00:18.1 for d0 failed (-19)
     (XEN) setup 0000:00:18.2 for d0 failed (-19)
     (XEN) setup 0000:00:18.3 for d0 failed (-19)
     (XEN) setup 0000:00:18.4 for d0 failed (-19)
     (XEN) setup 0000:00:18.5 for d0 failed (-19)

     This patch adds a new device type (i.e. DEV_TYPE_PCI_HOST_BRIDGE) which
     corresponds to PCI class code 0x06 and sub-class 0x00. Then, it uses
     this new type to filter when trying to map device to IOMMU.

One of my test systems has DEV_TYPE_PCI_HOST_BRIDGE, so bus2brdige setup is ignored

>
>> These are DEV_TYPE_PCIe_BRIDGE and DEV_TYPE_PCI_HOST_BRIDGE. So, the assumption I've made before
>>
>> that I can go over bus2bridge and filter out the "owner" or parent bridge for a given seg:bus doesn't
>>
>> seem to be possible now.
>>
>> Even if I find the parent bridge with xen/drivers/passthrough/pci.c:find_upstream_bridge I am not sure
>>
>> I can get any further in detecting which x86 domain owns this bridge: the whole idea in the x86 code is
>>
>> that Domain-0 is the only possible one here (you mentioned that).
> Right - your attempt to find the owner should _right now_ result in
> getting back Dom0, on x86. But there shouldn't be any such assumption
> in the new consumer of this data that you mean to introduce. I.e. I
> only did suggest this to avoid ...
>
>> So, I doubt if we can still live with
>>
>> is_hardware_domain for now for x86?
> ... hard-coding information which can be properly established /
> retrieved.
>
> Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 06/10] vpci: Make every domain handle its own BARs
  2020-12-07  9:11                                     ` Oleksandr Andrushchenko
@ 2020-12-07  9:28                                       ` Jan Beulich
  2020-12-07  9:37                                         ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2020-12-07  9:28 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Oleksandr Andrushchenko, Rahul.Singh, Bertrand.Marquis,
	julien.grall, sstabellini, xen-devel, iwj, wl,
	Roger Pau Monné

On 07.12.2020 10:11, Oleksandr Andrushchenko wrote:
> On 12/7/20 10:48 AM, Jan Beulich wrote:
>> On 04.12.2020 15:38, Oleksandr Andrushchenko wrote:
>>> On 11/13/20 4:51 PM, Jan Beulich wrote:
>>>> On 13.11.2020 15:44, Oleksandr Andrushchenko wrote:
>>>>> On 11/13/20 4:38 PM, Jan Beulich wrote:
>>>>>> On 13.11.2020 15:32, Oleksandr Andrushchenko wrote:
>>>>>>> On 11/13/20 4:23 PM, Jan Beulich wrote:
>>>>>>>>      Earlier on I didn't say you should get this to work, only
>>>>>>>> that I think the general logic around what you add shouldn't make
>>>>>>>> things more arch specific than they really should be. That said,
>>>>>>>> something similar to the above should still be doable on x86,
>>>>>>>> utilizing struct pci_seg's bus2bridge[]. There ought to be
>>>>>>>> DEV_TYPE_PCI_HOST_BRIDGE entries there, albeit a number of them
>>>>>>>> (provided by the CPUs themselves rather than the chipset) aren't
>>>>>>>> really host bridges for the purposes you're after.
>>>>>>> You mean I can still use DEV_TYPE_PCI_HOST_BRIDGE as a marker
>>>>>>>
>>>>>>> while trying to detect what I need?
>>>>>> I'm afraid I don't understand what marker you're thinking about
>>>>>> here.
>>>>> I mean that when I go over bus2bridge entries, should I only work with
>>>>>
>>>>> those who have DEV_TYPE_PCI_HOST_BRIDGE?
>>>> Well, if you're after host bridges - yes.
>>>>
>>>> Jan
>>> So, I started looking at the bus2bridge and if it can be used for both x86 (and possible ARM) and I have an
>>>
>>> impression that the original purpose of this was to identify the devices which x86 IOMMU should
>>>
>>> cover: e.g. I am looking at the find_upstream_bridge users and they are x86 IOMMU and VGA driver.
>>>
>>> I tried to play with this on ARM, and for the HW platform I have and QEMU I got 0 entries in bus2bridge...
>>>
>>> This is because of how xen/drivers/passthrough/pci.c:alloc_pdev is implemented (which lives in the
>>>
>>> common code BTW, but seems to be x86 specific): so, it skips setting up bus2bridge entries for the bridges I have.
>> I'm curious to learn what's x86-specific here. I also can't deduce
>> why bus2bridge setup would be skipped.
> 
> So, for example:
> 
> commit 0af438757d455f8eb6b5a6ae9a990ae245f230fd
> Author: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
> Date:   Fri Sep 27 10:11:49 2013 +0200
> 
>      AMD IOMMU: fix Dom0 device setup failure for host bridges
> 
>      The host bridge device (i.e. 0x18 for AMD) does not require IOMMU, and
>      therefore is not included in the IVRS. The current logic tries to map
>      all PCI devices to an IOMMU. In this case, "xl dmesg" shows the
>      following message on AMD sytem.
> 
>      (XEN) setup 0000:00:18.0 for d0 failed (-19)
>      (XEN) setup 0000:00:18.1 for d0 failed (-19)
>      (XEN) setup 0000:00:18.2 for d0 failed (-19)
>      (XEN) setup 0000:00:18.3 for d0 failed (-19)
>      (XEN) setup 0000:00:18.4 for d0 failed (-19)
>      (XEN) setup 0000:00:18.5 for d0 failed (-19)
> 
>      This patch adds a new device type (i.e. DEV_TYPE_PCI_HOST_BRIDGE) which
>      corresponds to PCI class code 0x06 and sub-class 0x00. Then, it uses
>      this new type to filter when trying to map device to IOMMU.
> 
> One of my test systems has DEV_TYPE_PCI_HOST_BRIDGE, so bus2brdige setup is ignored

If there's data to be sensibly recorded for host bridges, I don't
see why the function couldn't be updated. I don't view this as
x86-specific; it may just be that on x86 we have no present use
for such data. It may in turn be the case that then x86-specific
call sites consuming this data need updating to not be mislead by
the change in what data gets recorded. But that's still all within
the scope of bringing intended-to-be-arch-independent code closer
to actually being arch-independent.

Jan


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 06/10] vpci: Make every domain handle its own BARs
  2020-12-07  9:28                                       ` Jan Beulich
@ 2020-12-07  9:37                                         ` Oleksandr Andrushchenko
  2020-12-07  9:50                                           ` Jan Beulich
  0 siblings, 1 reply; 64+ messages in thread
From: Oleksandr Andrushchenko @ 2020-12-07  9:37 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Oleksandr Andrushchenko, Rahul.Singh, Bertrand.Marquis,
	julien.grall, sstabellini, xen-devel, iwj, wl,
	Roger Pau Monné


On 12/7/20 11:28 AM, Jan Beulich wrote:
> On 07.12.2020 10:11, Oleksandr Andrushchenko wrote:
>> On 12/7/20 10:48 AM, Jan Beulich wrote:
>>> On 04.12.2020 15:38, Oleksandr Andrushchenko wrote:
>>>> On 11/13/20 4:51 PM, Jan Beulich wrote:
>>>>> On 13.11.2020 15:44, Oleksandr Andrushchenko wrote:
>>>>>> On 11/13/20 4:38 PM, Jan Beulich wrote:
>>>>>>> On 13.11.2020 15:32, Oleksandr Andrushchenko wrote:
>>>>>>>> On 11/13/20 4:23 PM, Jan Beulich wrote:
>>>>>>>>>       Earlier on I didn't say you should get this to work, only
>>>>>>>>> that I think the general logic around what you add shouldn't make
>>>>>>>>> things more arch specific than they really should be. That said,
>>>>>>>>> something similar to the above should still be doable on x86,
>>>>>>>>> utilizing struct pci_seg's bus2bridge[]. There ought to be
>>>>>>>>> DEV_TYPE_PCI_HOST_BRIDGE entries there, albeit a number of them
>>>>>>>>> (provided by the CPUs themselves rather than the chipset) aren't
>>>>>>>>> really host bridges for the purposes you're after.
>>>>>>>> You mean I can still use DEV_TYPE_PCI_HOST_BRIDGE as a marker
>>>>>>>>
>>>>>>>> while trying to detect what I need?
>>>>>>> I'm afraid I don't understand what marker you're thinking about
>>>>>>> here.
>>>>>> I mean that when I go over bus2bridge entries, should I only work with
>>>>>>
>>>>>> those who have DEV_TYPE_PCI_HOST_BRIDGE?
>>>>> Well, if you're after host bridges - yes.
>>>>>
>>>>> Jan
>>>> So, I started looking at the bus2bridge and if it can be used for both x86 (and possible ARM) and I have an
>>>>
>>>> impression that the original purpose of this was to identify the devices which x86 IOMMU should
>>>>
>>>> cover: e.g. I am looking at the find_upstream_bridge users and they are x86 IOMMU and VGA driver.
>>>>
>>>> I tried to play with this on ARM, and for the HW platform I have and QEMU I got 0 entries in bus2bridge...
>>>>
>>>> This is because of how xen/drivers/passthrough/pci.c:alloc_pdev is implemented (which lives in the
>>>>
>>>> common code BTW, but seems to be x86 specific): so, it skips setting up bus2bridge entries for the bridges I have.
>>> I'm curious to learn what's x86-specific here. I also can't deduce
>>> why bus2bridge setup would be skipped.
>> So, for example:
>>
>> commit 0af438757d455f8eb6b5a6ae9a990ae245f230fd
>> Author: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
>> Date:   Fri Sep 27 10:11:49 2013 +0200
>>
>>       AMD IOMMU: fix Dom0 device setup failure for host bridges
>>
>>       The host bridge device (i.e. 0x18 for AMD) does not require IOMMU, and
>>       therefore is not included in the IVRS. The current logic tries to map
>>       all PCI devices to an IOMMU. In this case, "xl dmesg" shows the
>>       following message on AMD sytem.
>>
>>       (XEN) setup 0000:00:18.0 for d0 failed (-19)
>>       (XEN) setup 0000:00:18.1 for d0 failed (-19)
>>       (XEN) setup 0000:00:18.2 for d0 failed (-19)
>>       (XEN) setup 0000:00:18.3 for d0 failed (-19)
>>       (XEN) setup 0000:00:18.4 for d0 failed (-19)
>>       (XEN) setup 0000:00:18.5 for d0 failed (-19)
>>
>>       This patch adds a new device type (i.e. DEV_TYPE_PCI_HOST_BRIDGE) which
>>       corresponds to PCI class code 0x06 and sub-class 0x00. Then, it uses
>>       this new type to filter when trying to map device to IOMMU.
>>
>> One of my test systems has DEV_TYPE_PCI_HOST_BRIDGE, so bus2brdige setup is ignored
> If there's data to be sensibly recorded for host bridges, I don't
> see why the function couldn't be updated. I don't view this as
> x86-specific; it may just be that on x86 we have no present use
> for such data. It may in turn be the case that then x86-specific
> call sites consuming this data need updating to not be mislead by
> the change in what data gets recorded. But that's still all within
> the scope of bringing intended-to-be-arch-independent code closer
> to actually being arch-independent.

Well, the patch itself made me think that this is a workaround for x86

which made DEV_TYPE_PCI_HOST_BRIDGE a special case and it relies on that.

So, please correct me if I'm wrong here, but in order to make it really generic

I would need to introduce some specific knowledge for x86 about such a device

and make the IOMMU code rely on that instead of bus2bridge.

>
> Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 06/10] vpci: Make every domain handle its own BARs
  2020-12-07  9:37                                         ` Oleksandr Andrushchenko
@ 2020-12-07  9:50                                           ` Jan Beulich
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Beulich @ 2020-12-07  9:50 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Oleksandr Andrushchenko, Rahul.Singh, Bertrand.Marquis,
	julien.grall, sstabellini, xen-devel, iwj, wl,
	Roger Pau Monné

On 07.12.2020 10:37, Oleksandr Andrushchenko wrote:
> On 12/7/20 11:28 AM, Jan Beulich wrote:
>> On 07.12.2020 10:11, Oleksandr Andrushchenko wrote:
>>> On 12/7/20 10:48 AM, Jan Beulich wrote:
>>>> On 04.12.2020 15:38, Oleksandr Andrushchenko wrote:
>>>>> So, I started looking at the bus2bridge and if it can be used for both x86 (and possible ARM) and I have an
>>>>>
>>>>> impression that the original purpose of this was to identify the devices which x86 IOMMU should
>>>>>
>>>>> cover: e.g. I am looking at the find_upstream_bridge users and they are x86 IOMMU and VGA driver.
>>>>>
>>>>> I tried to play with this on ARM, and for the HW platform I have and QEMU I got 0 entries in bus2bridge...
>>>>>
>>>>> This is because of how xen/drivers/passthrough/pci.c:alloc_pdev is implemented (which lives in the
>>>>>
>>>>> common code BTW, but seems to be x86 specific): so, it skips setting up bus2bridge entries for the bridges I have.
>>>> I'm curious to learn what's x86-specific here. I also can't deduce
>>>> why bus2bridge setup would be skipped.
>>> So, for example:
>>>
>>> commit 0af438757d455f8eb6b5a6ae9a990ae245f230fd
>>> Author: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
>>> Date:   Fri Sep 27 10:11:49 2013 +0200
>>>
>>>       AMD IOMMU: fix Dom0 device setup failure for host bridges
>>>
>>>       The host bridge device (i.e. 0x18 for AMD) does not require IOMMU, and
>>>       therefore is not included in the IVRS. The current logic tries to map
>>>       all PCI devices to an IOMMU. In this case, "xl dmesg" shows the
>>>       following message on AMD sytem.
>>>
>>>       (XEN) setup 0000:00:18.0 for d0 failed (-19)
>>>       (XEN) setup 0000:00:18.1 for d0 failed (-19)
>>>       (XEN) setup 0000:00:18.2 for d0 failed (-19)
>>>       (XEN) setup 0000:00:18.3 for d0 failed (-19)
>>>       (XEN) setup 0000:00:18.4 for d0 failed (-19)
>>>       (XEN) setup 0000:00:18.5 for d0 failed (-19)
>>>
>>>       This patch adds a new device type (i.e. DEV_TYPE_PCI_HOST_BRIDGE) which
>>>       corresponds to PCI class code 0x06 and sub-class 0x00. Then, it uses
>>>       this new type to filter when trying to map device to IOMMU.
>>>
>>> One of my test systems has DEV_TYPE_PCI_HOST_BRIDGE, so bus2brdige setup is ignored
>> If there's data to be sensibly recorded for host bridges, I don't
>> see why the function couldn't be updated. I don't view this as
>> x86-specific; it may just be that on x86 we have no present use
>> for such data. It may in turn be the case that then x86-specific
>> call sites consuming this data need updating to not be mislead by
>> the change in what data gets recorded. But that's still all within
>> the scope of bringing intended-to-be-arch-independent code closer
>> to actually being arch-independent.
> 
> Well, the patch itself made me think that this is a workaround for x86
> 
> which made DEV_TYPE_PCI_HOST_BRIDGE a special case and it relies on that.
> 
> So, please correct me if I'm wrong here, but in order to make it really generic
> 
> I would need to introduce some specific knowledge for x86 about such a device
> 
> and make the IOMMU code rely on that instead of bus2bridge.

I'm afraid this is too vague for me to respond with a clear "yes" or
"no". In particular I don't see the special casing of that type, not
the least because it's not clear to me what data you would see
recording for it (or more precisely, from where you'd take the to be
recorded data - the device's config space doesn't tell you the bus
range covered by the bridge afaict).

Jan


^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2020-12-07  9:50 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-09 12:50 [PATCH 00/10] [RFC] ARM PCI passthrough configuration and vPCI Oleksandr Andrushchenko
2020-11-09 12:50 ` [PATCH 01/10] pci/pvh: Allow PCI toolstack code run with PVH domains on ARM Oleksandr Andrushchenko
2020-11-11 12:31   ` [SUSPECTED SPAM][PATCH " Roger Pau Monné
2020-11-11 13:10     ` Oleksandr Andrushchenko
2020-11-11 13:55       ` Roger Pau Monné
2020-11-11 14:12         ` Oleksandr Andrushchenko
2020-11-11 14:21           ` Roger Pau Monné
2020-11-09 12:50 ` [PATCH 02/10] arm/pci: Maintain PCI assignable list Oleksandr Andrushchenko
2020-11-11 13:53   ` Roger Pau Monné
2020-11-11 14:38     ` Oleksandr Andrushchenko
2020-11-11 15:03       ` Roger Pau Monné
2020-11-11 15:13         ` Oleksandr Andrushchenko
2020-11-11 15:25           ` Jan Beulich
2020-11-11 15:28             ` Oleksandr Andrushchenko
2020-11-11 14:54   ` Jan Beulich
2020-11-12 12:53     ` Oleksandr Andrushchenko
2020-11-09 12:50 ` [PATCH 03/10] xen/arm: Setup MMIO range trap handlers for hardware domain Oleksandr Andrushchenko
2020-11-11 14:39   ` Roger Pau Monné
2020-11-11 14:42     ` Oleksandr Andrushchenko
2020-11-09 12:50 ` [PATCH 04/10] [WORKAROUND] xen/arm: Update hwdom's p2m to trap ECAM space Oleksandr Andrushchenko
2020-11-11 14:44   ` Roger Pau Monné
2020-11-12 12:54     ` Oleksandr Andrushchenko
2020-11-09 12:50 ` [PATCH 05/10] xen/arm: Process pending vPCI map/unmap operations Oleksandr Andrushchenko
2020-11-09 12:50 ` [PATCH 06/10] vpci: Make every domain handle its own BARs Oleksandr Andrushchenko
2020-11-12  9:40   ` Roger Pau Monné
2020-11-12 13:16     ` Oleksandr Andrushchenko
2020-11-12 14:46       ` Roger Pau Monné
2020-11-13  6:32         ` Oleksandr Andrushchenko
2020-11-13  6:48           ` Oleksandr Andrushchenko
2020-11-13 10:25           ` Jan Beulich
2020-11-13 10:36             ` Julien Grall
2020-11-13 10:53               ` Jan Beulich
2020-11-13 11:06                 ` Julien Grall
2020-11-13 11:26                   ` Jan Beulich
2020-11-13 11:53                     ` Julien Grall
2020-11-13 10:46             ` Oleksandr Andrushchenko
2020-11-13 10:50               ` Jan Beulich
2020-11-13 11:02                 ` Oleksandr Andrushchenko
2020-11-13 11:35                   ` Jan Beulich
2020-11-13 12:41                     ` Oleksandr Andrushchenko
2020-11-13 14:23                       ` Jan Beulich
2020-11-13 14:32                         ` Oleksandr Andrushchenko
2020-11-13 14:38                           ` Jan Beulich
2020-11-13 14:44                             ` Oleksandr Andrushchenko
2020-11-13 14:51                               ` Jan Beulich
2020-11-13 14:52                                 ` Oleksandr Andrushchenko
2020-12-04 14:38                                 ` Oleksandr Andrushchenko
2020-12-07  8:48                                   ` Jan Beulich
2020-12-07  9:11                                     ` Oleksandr Andrushchenko
2020-12-07  9:28                                       ` Jan Beulich
2020-12-07  9:37                                         ` Oleksandr Andrushchenko
2020-12-07  9:50                                           ` Jan Beulich
2020-11-09 12:50 ` [PATCH 07/10] xen/arm: Do not hardcode phycial PCI device addresses Oleksandr Andrushchenko
2020-11-09 12:50 ` [PATCH 08/10] vpci/arm: Allow updating BAR's header for non-ECAM bridges Oleksandr Andrushchenko
2020-11-12  9:56   ` Roger Pau Monné
2020-11-13  6:46     ` Oleksandr Andrushchenko
2020-11-13 10:29   ` Jan Beulich
2020-11-13 10:39     ` Oleksandr Andrushchenko
2020-11-13 10:47       ` Jan Beulich
2020-11-13 10:55         ` Oleksandr Andrushchenko
2020-11-09 12:50 ` [PATCH 09/10] vpci/rcar: Implement vPCI.update_bar_header callback Oleksandr Andrushchenko
2020-11-12 10:00   ` Roger Pau Monné
2020-11-13  6:50     ` Oleksandr Andrushchenko
2020-11-09 12:50 ` [PATCH 10/10] [HACK] vpci/rcar: Make vPCI know DomD is hardware domain Oleksandr Andrushchenko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.