All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v10 00/17] PCI devices passthrough on Arm, part 3
@ 2023-10-12 22:09 Volodymyr Babchuk
  2023-10-12 22:09 ` [PATCH v10 01/17] pci: msi: pass pdev to pci_enable_msi() function Volodymyr Babchuk
                   ` (16 more replies)
  0 siblings, 17 replies; 65+ messages in thread
From: Volodymyr Babchuk @ 2023-10-12 22:09 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Volodymyr Babchuk, Jan Beulich,
	Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Paul Durrant, Kevin Tian, Jun Nakajima, Bertrand Marquis,
	Volodymyr Babchuk

Hello all,

This is next version of vPCI rework. Aim of this series is to prepare
ground for introducing PCI support on ARM platform.

in v10:

 - Removed patch ("xen/arm: vpci: check guest range"), proper fix
   for the issue is part of ("vpci/header: emulate PCI_COMMAND
   register for guests")
 - Removed patch ("pci/header: reset the command register when adding
   devices")
 - Added patch ("rangeset: add rangeset_empty() function") because
   this function is needed in ("vpci/header: handle p2m range sets
   per BAR")
 - Added ("vpci/header: handle p2m range sets per BAR") which addressed
   an issue discovered by Andrii Chepurnyi during virtio integration
 - Added ("pci: msi: pass pdev to pci_enable_msi() function"), which is
   prereq for ("pci: introduce per-domain PCI rwlock")
 - Fixed "Since v9/v8/... " comments in changelogs to reduce confusion.
   I left "Since" entries for older versions, because they were added
   by original author of the patches.

in v9:

v9 includes addressed commentes from a previous one. Also it
introduces a couple patches from Stewart. This patches are related to
vPCI use on ARM. Patch "vpci/header: rework exit path in init_bars"
was factored-out from "vpci/header: handle p2m range sets per BAR".

in v8:

The biggest change from previous, mistakenly named, v7 series is how
locking is implemented. Instead of d->vpci_rwlock we introduce
d->pci_lock which has broader scope, as it protects not only domain's
vpci state, but domain's list of PCI devices as well.

As we discussed in IRC with Roger, it is not feasible to rework all
the existing code to use the new lock right away. It was agreed that
any write access to d->pdev_list will be protected by **both**
d->pci_lock in write mode and pcidevs_lock(). Read access on other
hand should be protected by either d->pci_lock in read mode or
pcidevs_lock(). It is expected that existing code will use
pcidevs_lock() and new users will use new rw lock. Of course, this
does not mean that new users shall not use pcidevs_lock() when it is
appropriate.



Changes from previous versions are described in each separate patch.


Oleksandr Andrushchenko (11):
  vpci: use per-domain PCI lock to protect vpci structure
  vpci: restrict unhandled read/write operations for guests
  vpci: add hooks for PCI device assign/de-assign
  vpci/header: implement guest BAR register handlers
  rangeset: add RANGESETF_no_print flag
  vpci/header: handle p2m range sets per BAR
  vpci/header: program p2m with guest BAR view
  vpci/header: emulate PCI_COMMAND register for guests
  vpci: add initial support for virtual PCI bus topology
  xen/arm: translate virtual PCI bus topology for guests
  xen/arm: account IO handlers for emulated PCI MSI-X

Stewart Hildebrand (1):
  xen/arm: vpci: permit access to guest vpci space

Volodymyr Babchuk (5):
  pci: msi: pass pdev to pci_enable_msi() function
  pci: introduce per-domain PCI rwlock
  vpci/header: rework exit path in init_bars
  rangeset: add rangeset_empty() function
  arm/vpci: honor access size when returning an error

 xen/arch/arm/vpci.c                         |  78 +++-
 xen/arch/x86/hvm/vmsi.c                     |  26 +-
 xen/arch/x86/hvm/vmx/vmx.c                  |   2 -
 xen/arch/x86/include/asm/irq.h              |   3 +-
 xen/arch/x86/include/asm/msi.h              |   3 +-
 xen/arch/x86/irq.c                          |  14 +-
 xen/arch/x86/msi.c                          |  25 +-
 xen/arch/x86/physdev.c                      |   2 +-
 xen/common/domain.c                         |   5 +-
 xen/common/rangeset.c                       |  14 +-
 xen/drivers/Kconfig                         |   4 +
 xen/drivers/char/ns16550.c                  |   4 +-
 xen/drivers/passthrough/amd/pci_amd_iommu.c |   9 +-
 xen/drivers/passthrough/pci.c               |  94 +++-
 xen/drivers/passthrough/vtd/iommu.c         |   9 +-
 xen/drivers/vpci/header.c                   | 482 +++++++++++++++-----
 xen/drivers/vpci/msi.c                      |  34 +-
 xen/drivers/vpci/msix.c                     |  56 ++-
 xen/drivers/vpci/vpci.c                     | 157 ++++++-
 xen/include/xen/rangeset.h                  |   8 +-
 xen/include/xen/sched.h                     |   9 +
 xen/include/xen/vpci.h                      |  42 +-
 22 files changed, 875 insertions(+), 205 deletions(-)

-- 
2.42.0

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v10 01/17] pci: msi: pass pdev to pci_enable_msi() function
  2023-10-12 22:09 [PATCH v10 00/17] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
@ 2023-10-12 22:09 ` Volodymyr Babchuk
  2023-10-30 15:55   ` Jan Beulich
  2023-11-17 13:59   ` Roger Pau Monné
  2023-10-12 22:09 ` [PATCH v10 03/17] vpci: use per-domain PCI lock to protect vpci structure Volodymyr Babchuk
                   ` (15 subsequent siblings)
  16 siblings, 2 replies; 65+ messages in thread
From: Volodymyr Babchuk @ 2023-10-12 22:09 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Volodymyr Babchuk, Jan Beulich,
	Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini

Previously pci_enable_msi() function obtained pdev pointer by itself,
but taking into account upcoming changes to PCI locking, it is better
when caller passes already acquired pdev pointer to the function.

Note that ns16550 driver does not check validity of obtained pdev
pointer because pci_enable_msi() already does this.

Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>

---
Changes in v10:

 - New in v10. This is the result of discussion in "vpci: add initial
 support for virtual PCI bus topology"
---
 xen/arch/x86/include/asm/msi.h |  3 ++-
 xen/arch/x86/irq.c             |  2 +-
 xen/arch/x86/msi.c             | 19 ++++++++++---------
 xen/drivers/char/ns16550.c     |  4 +++-
 4 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/xen/arch/x86/include/asm/msi.h b/xen/arch/x86/include/asm/msi.h
index a53ade95c9..836c8cd4ba 100644
--- a/xen/arch/x86/include/asm/msi.h
+++ b/xen/arch/x86/include/asm/msi.h
@@ -81,7 +81,8 @@ struct irq_desc;
 struct hw_interrupt_type;
 struct msi_desc;
 /* Helper functions */
-extern int pci_enable_msi(struct msi_info *msi, struct msi_desc **desc);
+extern int pci_enable_msi(struct msi_info *msi, struct msi_desc **desc,
+			  struct pci_dev *pdev);
 extern void pci_disable_msi(struct msi_desc *desc);
 extern int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off);
 extern void pci_cleanup_msi(struct pci_dev *pdev);
diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index 6abfd81621..68b788c42e 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -2167,7 +2167,7 @@ int map_domain_pirq(
         if ( !pdev )
             goto done;
 
-        ret = pci_enable_msi(msi, &msi_desc);
+        ret = pci_enable_msi(msi, &msi_desc, pdev);
         if ( ret )
         {
             if ( ret > 0 )
diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index a78367d7cf..20275260b3 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -983,13 +983,13 @@ static int msix_capability_init(struct pci_dev *dev,
  * irq or non-zero for otherwise.
  **/
 
-static int __pci_enable_msi(struct msi_info *msi, struct msi_desc **desc)
+static int __pci_enable_msi(struct msi_info *msi, struct msi_desc **desc,
+			    struct pci_dev *pdev)
 {
-    struct pci_dev *pdev;
     struct msi_desc *old_desc;
 
     ASSERT(pcidevs_locked());
-    pdev = pci_get_pdev(NULL, msi->sbdf);
+
     if ( !pdev )
         return -ENODEV;
 
@@ -1038,13 +1038,13 @@ static void __pci_disable_msi(struct msi_desc *entry)
  * of irqs available. Driver should use the returned value to re-send
  * its request.
  **/
-static int __pci_enable_msix(struct msi_info *msi, struct msi_desc **desc)
+static int __pci_enable_msix(struct msi_info *msi, struct msi_desc **desc,
+			     struct pci_dev *pdev)
 {
-    struct pci_dev *pdev;
     struct msi_desc *old_desc;
 
     ASSERT(pcidevs_locked());
-    pdev = pci_get_pdev(NULL, msi->sbdf);
+
     if ( !pdev || !pdev->msix )
         return -ENODEV;
 
@@ -1151,15 +1151,16 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off)
  * Notice: only construct the msi_desc
  * no change to irq_desc here, and the interrupt is masked
  */
-int pci_enable_msi(struct msi_info *msi, struct msi_desc **desc)
+int pci_enable_msi(struct msi_info *msi, struct msi_desc **desc,
+		   struct pci_dev *pdev)
 {
     ASSERT(pcidevs_locked());
 
     if ( !use_msi )
         return -EPERM;
 
-    return msi->table_base ? __pci_enable_msix(msi, desc) :
-                             __pci_enable_msi(msi, desc);
+    return msi->table_base ? __pci_enable_msix(msi, desc, pdev) :
+			     __pci_enable_msi(msi, desc, pdev);
 }
 
 /*
diff --git a/xen/drivers/char/ns16550.c b/xen/drivers/char/ns16550.c
index 28ddedd50d..1856b72e63 100644
--- a/xen/drivers/char/ns16550.c
+++ b/xen/drivers/char/ns16550.c
@@ -452,10 +452,12 @@ static void __init cf_check ns16550_init_postirq(struct serial_port *port)
             if ( rc > 0 )
             {
                 struct msi_desc *msi_desc = NULL;
+                struct pci_dev *pdev;
 
                 pcidevs_lock();
 
-                rc = pci_enable_msi(&msi, &msi_desc);
+                pdev = pci_get_pdev(NULL, msi.sbdf);
+                rc = pci_enable_msi(&msi, &msi_desc, pdev);
                 if ( !rc )
                 {
                     struct irq_desc *desc = irq_to_desc(msi.irq);
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v10 02/17] pci: introduce per-domain PCI rwlock
  2023-10-12 22:09 [PATCH v10 00/17] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (2 preceding siblings ...)
  2023-10-12 22:09 ` [PATCH v10 05/17] vpci: add hooks for PCI device assign/de-assign Volodymyr Babchuk
@ 2023-10-12 22:09 ` Volodymyr Babchuk
  2023-11-17 14:33   ` Roger Pau Monné
  2023-10-12 22:09 ` [PATCH v10 04/17] vpci: restrict unhandled read/write operations for guests Volodymyr Babchuk
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 65+ messages in thread
From: Volodymyr Babchuk @ 2023-10-12 22:09 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Volodymyr Babchuk, Andrew Cooper,
	George Dunlap, Jan Beulich, Julien Grall, Stefano Stabellini,
	Wei Liu, Paul Durrant, Roger Pau Monné,
	Kevin Tian

Add per-domain d->pci_lock that protects access to
d->pdev_list. Purpose of this lock is to give guarantees to VPCI code
that underlying pdev will not disappear under feet. This is a rw-lock,
but this patch adds only write_lock()s. There will be read_lock()
users in the next patches.

This lock should be taken in write mode every time d->pdev_list is
altered. All write accesses also should be protected by pcidevs_lock()
as well. Idea is that any user that wants read access to the list or
to the devices stored in the list should use either this new
d->pci_lock or old pcidevs_lock(). Usage of any of this two locks will
ensure only that pdev of interest will not disappear from under feet
and that the pdev still will be assigned to the same domain. Of
course, any new users should use pcidevs_lock() when it is
appropriate (e.g. when accessing any other state that is protected by
the said lock). In case both the newly introduced per-domain rwlock
and the pcidevs lock is taken, the later must be acquired first.

Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>

---

Changes in v10:
 - pdev->domain is assigned after removing from source domain but
   before adding to target domain in reassign_device() functions.

Changes in v9:
 - returned back "pdev->domain = target;" in AMD IOMMU code
 - used "source" instead of pdev->domain in IOMMU functions
 - added comment about lock ordering in the commit message
 - reduced locked regions
 - minor changes non-functional changes in various places

Changes in v8:
 - New patch

Changes in v8 vs RFC:
 - Removed all read_locks after discussion with Roger in #xendevel
 - pci_release_devices() now returns the first error code
 - extended commit message
 - added missing lock in pci_remove_device()
 - extended locked region in pci_add_device() to protect list_del() calls
---
 xen/common/domain.c                         |  1 +
 xen/drivers/passthrough/amd/pci_amd_iommu.c |  9 ++-
 xen/drivers/passthrough/pci.c               | 71 +++++++++++++++++----
 xen/drivers/passthrough/vtd/iommu.c         |  9 ++-
 xen/include/xen/sched.h                     |  1 +
 5 files changed, 78 insertions(+), 13 deletions(-)

diff --git a/xen/common/domain.c b/xen/common/domain.c
index 8f9ab01c0c..785c69e48b 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -651,6 +651,7 @@ struct domain *domain_create(domid_t domid,
 
 #ifdef CONFIG_HAS_PCI
     INIT_LIST_HEAD(&d->pdev_list);
+    rwlock_init(&d->pci_lock);
 #endif
 
     /* All error paths can depend on the above setup. */
diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c b/xen/drivers/passthrough/amd/pci_amd_iommu.c
index 836c24b02e..36a617bed4 100644
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -476,8 +476,15 @@ static int cf_check reassign_device(
 
     if ( devfn == pdev->devfn && pdev->domain != target )
     {
-        list_move(&pdev->domain_list, &target->pdev_list);
+        write_lock(&source->pci_lock);
+        list_del(&pdev->domain_list);
+        write_unlock(&source->pci_lock);
+
         pdev->domain = target;
+
+        write_lock(&target->pci_lock);
+        list_add(&pdev->domain_list, &target->pdev_list);
+        write_unlock(&target->pci_lock);
     }
 
     /*
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 04d00c7c37..b8ad4fa07c 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -453,7 +453,9 @@ static void __init _pci_hide_device(struct pci_dev *pdev)
     if ( pdev->domain )
         return;
     pdev->domain = dom_xen;
+    write_lock(&dom_xen->pci_lock);
     list_add(&pdev->domain_list, &dom_xen->pdev_list);
+    write_unlock(&dom_xen->pci_lock);
 }
 
 int __init pci_hide_device(unsigned int seg, unsigned int bus,
@@ -746,7 +748,9 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
     if ( !pdev->domain )
     {
         pdev->domain = hardware_domain;
+        write_lock(&hardware_domain->pci_lock);
         list_add(&pdev->domain_list, &hardware_domain->pdev_list);
+        write_unlock(&hardware_domain->pci_lock);
 
         /*
          * For devices not discovered by Xen during boot, add vPCI handlers
@@ -756,7 +760,9 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
         if ( ret )
         {
             printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
+            write_lock(&hardware_domain->pci_lock);
             list_del(&pdev->domain_list);
+            write_unlock(&hardware_domain->pci_lock);
             pdev->domain = NULL;
             goto out;
         }
@@ -764,7 +770,9 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
         if ( ret )
         {
             vpci_remove_device(pdev);
+            write_lock(&hardware_domain->pci_lock);
             list_del(&pdev->domain_list);
+            write_unlock(&hardware_domain->pci_lock);
             pdev->domain = NULL;
             goto out;
         }
@@ -814,7 +822,11 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
             pci_cleanup_msi(pdev);
             ret = iommu_remove_device(pdev);
             if ( pdev->domain )
+            {
+                write_lock(&pdev->domain->pci_lock);
                 list_del(&pdev->domain_list);
+                write_unlock(&pdev->domain->pci_lock);
+            }
             printk(XENLOG_DEBUG "PCI remove device %pp\n", &pdev->sbdf);
             free_pdev(pseg, pdev);
             break;
@@ -885,26 +897,61 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
 
 int pci_release_devices(struct domain *d)
 {
-    struct pci_dev *pdev, *tmp;
-    u8 bus, devfn;
-    int ret;
+    int combined_ret;
+    LIST_HEAD(failed_pdevs);
 
     pcidevs_lock();
-    ret = arch_pci_clean_pirqs(d);
-    if ( ret )
+
+    combined_ret = arch_pci_clean_pirqs(d);
+    if ( combined_ret )
     {
         pcidevs_unlock();
-        return ret;
+        return combined_ret;
     }
-    list_for_each_entry_safe ( pdev, tmp, &d->pdev_list, domain_list )
+
+    write_lock(&d->pci_lock);
+
+    while ( !list_empty(&d->pdev_list) )
     {
-        bus = pdev->bus;
-        devfn = pdev->devfn;
-        ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
+        struct pci_dev *pdev = list_first_entry(&d->pdev_list,
+                                                struct pci_dev,
+                                                domain_list);
+        uint16_t seg = pdev->seg;
+        uint8_t bus = pdev->bus;
+        uint8_t devfn = pdev->devfn;
+        int ret;
+
+        write_unlock(&d->pci_lock);
+        ret = deassign_device(d, seg, bus, devfn);
+        write_lock(&d->pci_lock);
+        if ( ret )
+        {
+            const struct pci_dev *tmp;
+
+            /*
+             * We need to check if deassign_device() left our pdev in
+             * domain's list. As we dropped the lock, we can't be sure
+             * that list wasn't permutated in some random way, so we
+             * need to traverse the whole list.
+             */
+            for_each_pdev ( d, tmp )
+            {
+                if ( tmp == pdev )
+                {
+                    list_move_tail(&pdev->domain_list, &failed_pdevs);
+                    break;
+                }
+            }
+
+            combined_ret = combined_ret ?: ret;
+        }
     }
+
+    list_splice(&failed_pdevs, &d->pdev_list);
+    write_unlock(&d->pci_lock);
     pcidevs_unlock();
 
-    return ret;
+    return combined_ret;
 }
 
 #define PCI_CLASS_BRIDGE_HOST    0x0600
@@ -1124,7 +1171,9 @@ static int __hwdom_init cf_check _setup_hwdom_pci_devices(
             if ( !pdev->domain )
             {
                 pdev->domain = ctxt->d;
+                write_lock(&ctxt->d->pci_lock);
                 list_add(&pdev->domain_list, &ctxt->d->pdev_list);
+                write_unlock(&ctxt->d->pci_lock);
                 setup_one_hwdom_device(ctxt, pdev);
             }
             else if ( pdev->domain == dom_xen )
diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index d34c98d9c7..cdb1225578 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2806,8 +2806,15 @@ static int cf_check reassign_device_ownership(
 
     if ( devfn == pdev->devfn && pdev->domain != target )
     {
-        list_move(&pdev->domain_list, &target->pdev_list);
+        write_lock(&source->pci_lock);
+        list_del(&pdev->domain_list);
+        write_unlock(&source->pci_lock);
+
         pdev->domain = target;
+
+        write_lock(&target->pci_lock);
+        list_add(&pdev->domain_list, &target->pdev_list);
+        write_unlock(&target->pci_lock);
     }
 
     if ( !has_arch_pdevs(source) )
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 3609ef88c4..57391e74b6 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -461,6 +461,7 @@ struct domain
 
 #ifdef CONFIG_HAS_PCI
     struct list_head pdev_list;
+    rwlock_t pci_lock;
 #endif
 
 #ifdef CONFIG_HAS_PASSTHROUGH
-- 
2.42.0

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v10 05/17] vpci: add hooks for PCI device assign/de-assign
  2023-10-12 22:09 [PATCH v10 00/17] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
  2023-10-12 22:09 ` [PATCH v10 01/17] pci: msi: pass pdev to pci_enable_msi() function Volodymyr Babchuk
  2023-10-12 22:09 ` [PATCH v10 03/17] vpci: use per-domain PCI lock to protect vpci structure Volodymyr Babchuk
@ 2023-10-12 22:09 ` Volodymyr Babchuk
  2023-11-20 15:04   ` Roger Pau Monné
  2023-10-12 22:09 ` [PATCH v10 02/17] pci: introduce per-domain PCI rwlock Volodymyr Babchuk
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 65+ messages in thread
From: Volodymyr Babchuk @ 2023-10-12 22:09 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Jan Beulich,
	Paul Durrant, Roger Pau Monné,
	Volodymyr Babchuk

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

When a PCI device gets assigned/de-assigned we need to
initialize/de-initialize vPCI state for the device.

Also, rename vpci_add_handlers() to vpci_assign_device() and
vpci_remove_device() to vpci_deassign_device() to better reflect role
of the functions.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
---

In v10:
- removed HAS_VPCI_GUEST_SUPPORT checks
- HAS_VPCI_GUEST_SUPPORT config option (in Kconfig) as it is not used
  anywhere
In v9:
- removed previous  vpci_[de]assign_device function and renamed
  existing handlers
- dropped attempts to handle errors in assign_device() function
- do not call vpci_assign_device for dom_io
- use d instead of pdev->domain
- use IS_ENABLED macro
In v8:
- removed vpci_deassign_device
In v6:
- do not pass struct domain to vpci_{assign|deassign}_device as
  pdev->domain can be used
- do not leave the device assigned (pdev->domain == new domain) in case
  vpci_assign_device fails: try to de-assign and if this also fails, then
  crash the domain
In v5:
- do not split code into run_vpci_init
- do not check for is_system_domain in vpci_{de}assign_device
- do not use vpci_remove_device_handlers_locked and re-allocate
  pdev->vpci completely
- make vpci_deassign_device void
In v4:
 - de-assign vPCI from the previous domain on device assignment
 - do not remove handlers in vpci_assign_device as those must not
   exist at that point
In v3:
 - remove toolstack roll-back description from the commit message
   as error are to be handled with proper cleanup in Xen itself
 - remove __must_check
 - remove redundant rc check while assigning devices
 - fix redundant CONFIG_HAS_VPCI check for CONFIG_HAS_VPCI_GUEST_SUPPORT
 - use REGISTER_VPCI_INIT machinery to run required steps on device
   init/assign: add run_vpci_init helper
In v2:
- define CONFIG_HAS_VPCI_GUEST_SUPPORT so dead code is not compiled
  for x86
In v1:
 - constify struct pci_dev where possible
 - do not open code is_system_domain()
 - extended the commit message
---
 xen/drivers/passthrough/pci.c | 20 ++++++++++++++++----
 xen/drivers/vpci/header.c     |  2 +-
 xen/drivers/vpci/vpci.c       |  6 +++---
 xen/include/xen/vpci.h        | 10 +++++-----
 4 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 182da45acb..b7926a291c 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -755,7 +755,7 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
          * For devices not discovered by Xen during boot, add vPCI handlers
          * when Dom0 first informs Xen about such devices.
          */
-        ret = vpci_add_handlers(pdev);
+        ret = vpci_assign_device(pdev);
         if ( ret )
         {
             list_del(&pdev->domain_list);
@@ -769,7 +769,7 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
         if ( ret )
         {
             write_lock(&hardware_domain->pci_lock);
-            vpci_remove_device(pdev);
+            vpci_deassign_device(pdev);
             list_del(&pdev->domain_list);
             write_unlock(&hardware_domain->pci_lock);
             pdev->domain = NULL;
@@ -817,7 +817,7 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
     list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
         if ( pdev->bus == bus && pdev->devfn == devfn )
         {
-            vpci_remove_device(pdev);
+            vpci_deassign_device(pdev);
             pci_cleanup_msi(pdev);
             ret = iommu_remove_device(pdev);
             if ( pdev->domain )
@@ -875,6 +875,10 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
             goto out;
     }
 
+    write_lock(&d->pci_lock);
+    vpci_deassign_device(pdev);
+    write_unlock(&d->pci_lock);
+
     devfn = pdev->devfn;
     ret = iommu_call(hd->platform_ops, reassign_device, d, target, devfn,
                      pci_to_dev(pdev));
@@ -1146,7 +1150,7 @@ static void __hwdom_init setup_one_hwdom_device(const struct setup_hwdom *ctxt,
               PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
 
     write_lock(&ctxt->d->pci_lock);
-    err = vpci_add_handlers(pdev);
+    err = vpci_assign_device(pdev);
     write_unlock(&ctxt->d->pci_lock);
     if ( err )
         printk(XENLOG_ERR "setup of vPCI for d%d failed: %d\n",
@@ -1476,6 +1480,10 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
     if ( pdev->broken && d != hardware_domain && d != dom_io )
         goto done;
 
+    write_lock(&pdev->domain->pci_lock);
+    vpci_deassign_device(pdev);
+    write_unlock(&pdev->domain->pci_lock);
+
     rc = pdev_msix_assign(d, pdev);
     if ( rc )
         goto done;
@@ -1502,6 +1510,10 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
                         pci_to_dev(pdev), flag);
     }
 
+    write_lock(&d->pci_lock);
+    rc = vpci_assign_device(pdev);
+    write_unlock(&d->pci_lock);
+
  done:
     if ( rc )
         printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index a52e52db96..176fe16b9f 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -190,7 +190,7 @@ bool vpci_process_pending(struct vcpu *v)
              * killed in order to avoid leaking stale p2m mappings on
              * failure.
              */
-            vpci_remove_device(v->vpci.pdev);
+            vpci_deassign_device(v->vpci.pdev);
         write_unlock(&v->domain->pci_lock);
     }
 
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 724987e981..b20bee2b0b 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -36,7 +36,7 @@ extern vpci_register_init_t *const __start_vpci_array[];
 extern vpci_register_init_t *const __end_vpci_array[];
 #define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
 
-void vpci_remove_device(struct pci_dev *pdev)
+void vpci_deassign_device(struct pci_dev *pdev)
 {
     ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
 
@@ -69,7 +69,7 @@ void vpci_remove_device(struct pci_dev *pdev)
     pdev->vpci = NULL;
 }
 
-int vpci_add_handlers(struct pci_dev *pdev)
+int vpci_assign_device(struct pci_dev *pdev)
 {
     unsigned int i;
     const unsigned long *ro_map;
@@ -103,7 +103,7 @@ int vpci_add_handlers(struct pci_dev *pdev)
     }
 
     if ( rc )
-        vpci_remove_device(pdev);
+        vpci_deassign_device(pdev);
 
     return rc;
 }
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index d743d96a10..75cfb532ee 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -25,11 +25,11 @@ typedef int vpci_register_init_t(struct pci_dev *dev);
   static vpci_register_init_t *const x##_entry  \
                __used_section(".data.vpci." p) = x
 
-/* Add vPCI handlers to device. */
-int __must_check vpci_add_handlers(struct pci_dev *pdev);
+/* Assign vPCI to device by adding handlers to device. */
+int __must_check vpci_assign_device(struct pci_dev *pdev);
 
 /* Remove all handlers and free vpci related structures. */
-void vpci_remove_device(struct pci_dev *pdev);
+void vpci_deassign_device(struct pci_dev *pdev);
 
 /* Add/remove a register handler. */
 int __must_check vpci_add_register(struct vpci *vpci,
@@ -235,12 +235,12 @@ bool vpci_ecam_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int len,
 #else /* !CONFIG_HAS_VPCI */
 struct vpci_vcpu {};
 
-static inline int vpci_add_handlers(struct pci_dev *pdev)
+static inline int vpci_assign_device(struct pci_dev *pdev)
 {
     return 0;
 }
 
-static inline void vpci_remove_device(struct pci_dev *pdev) { }
+static inline void vpci_deassign_device(struct pci_dev *pdev) { }
 
 static inline void vpci_dump_msi(void) { }
 
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v10 03/17] vpci: use per-domain PCI lock to protect vpci structure
  2023-10-12 22:09 [PATCH v10 00/17] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
  2023-10-12 22:09 ` [PATCH v10 01/17] pci: msi: pass pdev to pci_enable_msi() function Volodymyr Babchuk
@ 2023-10-12 22:09 ` Volodymyr Babchuk
  2023-11-03 15:39   ` Stewart Hildebrand
  2023-11-17 15:16   ` Roger Pau Monné
  2023-10-12 22:09 ` [PATCH v10 05/17] vpci: add hooks for PCI device assign/de-assign Volodymyr Babchuk
                   ` (14 subsequent siblings)
  16 siblings, 2 replies; 65+ messages in thread
From: Volodymyr Babchuk @ 2023-10-12 22:09 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Jan Beulich,
	Andrew Cooper, Roger Pau Monné,
	Wei Liu, Jun Nakajima, Kevin Tian, Paul Durrant,
	Volodymyr Babchuk

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Use a previously introduced per-domain read/write lock to check
whether vpci is present, so we are sure there are no accesses to the
contents of the vpci struct if not. This lock can be used (and in a
few cases is used right away) so that vpci removal can be performed
while holding the lock in write mode. Previously such removal could
race with vpci_read for example.

When taking both d->pci_lock and pdev->vpci->lock, they should be
taken in this exact order: d->pci_lock then pdev->vpci->lock to avoid
possible deadlock situations.

1. Per-domain's pci_rwlock is used to protect pdev->vpci structure
from being removed.

2. Writing the command register and ROM BAR register may trigger
modify_bars to run, which in turn may access multiple pdevs while
checking for the existing BAR's overlap. The overlapping check, if
done under the read lock, requires vpci->lock to be acquired on both
devices being compared, which may produce a deadlock. It is not
possible to upgrade read lock to write lock in such a case. So, in
order to prevent the deadlock, use d->pci_lock instead. To prevent
deadlock while locking both hwdom->pci_lock and dom_xen->pci_lock,
always lock hwdom first.

All other code, which doesn't lead to pdev->vpci destruction and does
not access multiple pdevs at the same time, can still use a
combination of the read lock and pdev->vpci->lock.

3. Drop const qualifier where the new rwlock is used and this is
appropriate.

4. Do not call process_pending_softirqs with any locks held. For that
unlock prior the call and re-acquire the locks after. After
re-acquiring the lock there is no need to check if pdev->vpci exists:
 - in apply_map because of the context it is called (no race condition
   possible)
 - for MSI/MSI-X debug code because it is called at the end of
   pdev->vpci access and no further access to pdev->vpci is made

5. Use d->pci_lock around for_each_pdev and pci_get_pdev_by_domain
while accessing pdevs in vpci code.

6. We are removing multiple ASSERT(pcidevs_locked()) instances because
they are too strict now: they should be corrected to
ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock)), but problem is
that mentioned instances does not have access to the domain
pointer and it is not feasible to pass a domain pointer to a function
just for debugging purposes.

There is a possible lock inversion in MSI code, as some parts of it
acquire pcidevs_lock() while already holding d->pci_lock.

Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>

---
Changes in v10:
 - Moved printk pas locked area
 - Returned back ASSERTs
 - Added new parameter to allocate_and_map_msi_pirq() so it knows if
 it should take the global pci lock
 - Added comment about possible improvement in vpci_write
 - Changed ASSERT(rw_is_locked()) to rw_is_write_locked() in
   appropriate places
 - Renamed release_domain_locks() to release_domain_write_locks()
 - moved domain_done label in vpci_dump_msi() to correct place
Changes in v9:
 - extended locked region to protect vpci_remove_device and
   vpci_add_handlers() calls
 - vpci_write() takes lock in the write mode to protect
   potential call to modify_bars()
 - renamed lock releasing function
 - removed ASSERT()s from msi code
 - added trylock in vpci_dump_msi

Changes in v8:
 - changed d->vpci_lock to d->pci_lock
 - introducing d->pci_lock in a separate patch
 - extended locked region in vpci_process_pending
 - removed pcidevs_lockis vpci_dump_msi()
 - removed some changes as they are not needed with
   the new locking scheme
 - added handling for hwdom && dom_xen case
---
 xen/arch/x86/hvm/vmsi.c        | 26 ++++++++---------
 xen/arch/x86/hvm/vmx/vmx.c     |  2 --
 xen/arch/x86/include/asm/irq.h |  3 +-
 xen/arch/x86/irq.c             | 12 ++++----
 xen/arch/x86/msi.c             | 10 ++-----
 xen/arch/x86/physdev.c         |  2 +-
 xen/drivers/passthrough/pci.c  |  9 +++---
 xen/drivers/vpci/header.c      | 18 ++++++++++++
 xen/drivers/vpci/msi.c         | 28 ++++++++++++++++--
 xen/drivers/vpci/msix.c        | 52 +++++++++++++++++++++++++++++-----
 xen/drivers/vpci/vpci.c        | 51 +++++++++++++++++++++++++++++++--
 11 files changed, 166 insertions(+), 47 deletions(-)

diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index 128f236362..6b33a80120 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -468,7 +468,7 @@ int msixtbl_pt_register(struct domain *d, struct pirq *pirq, uint64_t gtable)
     struct msixtbl_entry *entry, *new_entry;
     int r = -EINVAL;
 
-    ASSERT(pcidevs_locked());
+    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
     ASSERT(rw_is_write_locked(&d->event_lock));
 
     if ( !msixtbl_initialised(d) )
@@ -538,7 +538,7 @@ void msixtbl_pt_unregister(struct domain *d, struct pirq *pirq)
     struct pci_dev *pdev;
     struct msixtbl_entry *entry;
 
-    ASSERT(pcidevs_locked());
+    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
     ASSERT(rw_is_write_locked(&d->event_lock));
 
     if ( !msixtbl_initialised(d) )
@@ -684,7 +684,7 @@ static int vpci_msi_update(const struct pci_dev *pdev, uint32_t data,
 {
     unsigned int i;
 
-    ASSERT(pcidevs_locked());
+    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
 
     if ( (address & MSI_ADDR_BASE_MASK) != MSI_ADDR_HEADER )
     {
@@ -725,8 +725,8 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
     int rc;
 
     ASSERT(msi->arch.pirq != INVALID_PIRQ);
+    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
 
-    pcidevs_lock();
     for ( i = 0; i < msi->vectors && msi->arch.bound; i++ )
     {
         struct xen_domctl_bind_pt_irq unbind = {
@@ -745,7 +745,6 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
 
     msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address,
                                        msi->vectors, msi->arch.pirq, msi->mask);
-    pcidevs_unlock();
 }
 
 static int vpci_msi_enable(const struct pci_dev *pdev, unsigned int nr,
@@ -762,7 +761,7 @@ static int vpci_msi_enable(const struct pci_dev *pdev, unsigned int nr,
     rc = allocate_and_map_msi_pirq(pdev->domain, -1, &pirq,
                                    table_base ? MAP_PIRQ_TYPE_MSI
                                               : MAP_PIRQ_TYPE_MULTI_MSI,
-                                   &msi_info);
+                                   &msi_info, false);
     if ( rc )
     {
         gdprintk(XENLOG_ERR, "%pp: failed to map PIRQ: %d\n", &pdev->sbdf, rc);
@@ -778,15 +777,13 @@ int vpci_msi_arch_enable(struct vpci_msi *msi, const struct pci_dev *pdev,
     int rc;
 
     ASSERT(msi->arch.pirq == INVALID_PIRQ);
+    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
     rc = vpci_msi_enable(pdev, vectors, 0);
     if ( rc < 0 )
         return rc;
     msi->arch.pirq = rc;
-
-    pcidevs_lock();
     msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address, vectors,
                                        msi->arch.pirq, msi->mask);
-    pcidevs_unlock();
 
     return 0;
 }
@@ -797,8 +794,8 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
     unsigned int i;
 
     ASSERT(pirq != INVALID_PIRQ);
+    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
 
-    pcidevs_lock();
     for ( i = 0; i < nr && bound; i++ )
     {
         struct xen_domctl_bind_pt_irq bind = {
@@ -814,7 +811,6 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
     write_lock(&pdev->domain->event_lock);
     unmap_domain_pirq(pdev->domain, pirq);
     write_unlock(&pdev->domain->event_lock);
-    pcidevs_unlock();
 }
 
 void vpci_msi_arch_disable(struct vpci_msi *msi, const struct pci_dev *pdev)
@@ -854,6 +850,8 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
     int rc;
 
     ASSERT(entry->arch.pirq == INVALID_PIRQ);
+    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
+
     rc = vpci_msi_enable(pdev, vmsix_entry_nr(pdev->vpci->msix, entry),
                          table_base);
     if ( rc < 0 )
@@ -861,7 +859,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
 
     entry->arch.pirq = rc;
 
-    pcidevs_lock();
     rc = vpci_msi_update(pdev, entry->data, entry->addr, 1, entry->arch.pirq,
                          entry->masked);
     if ( rc )
@@ -869,7 +866,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
         vpci_msi_disable(pdev, entry->arch.pirq, 1, false);
         entry->arch.pirq = INVALID_PIRQ;
     }
-    pcidevs_unlock();
 
     return rc;
 }
@@ -895,6 +891,8 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
 {
     unsigned int i;
 
+    ASSERT(rw_is_locked(&msix->pdev->domain->pci_lock));
+
     for ( i = 0; i < msix->max_entries; i++ )
     {
         const struct vpci_msix_entry *entry = &msix->entries[i];
@@ -913,7 +911,9 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
             struct pci_dev *pdev = msix->pdev;
 
             spin_unlock(&msix->pdev->vpci->lock);
+            read_unlock(&pdev->domain->pci_lock);
             process_pending_softirqs();
+            read_lock(&pdev->domain->pci_lock);
             /* NB: we assume that pdev cannot go away for an alive domain. */
             if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
                 return -EBUSY;
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 1edc7f1e91..545a27796e 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -413,8 +413,6 @@ static int cf_check vmx_pi_update_irte(const struct vcpu *v,
 
     spin_unlock_irq(&desc->lock);
 
-    ASSERT(pcidevs_locked());
-
     return iommu_update_ire_from_msi(msi_desc, &msi_desc->msg);
 
  unlock_out:
diff --git a/xen/arch/x86/include/asm/irq.h b/xen/arch/x86/include/asm/irq.h
index ad907fc97f..3d24f39ca6 100644
--- a/xen/arch/x86/include/asm/irq.h
+++ b/xen/arch/x86/include/asm/irq.h
@@ -213,6 +213,7 @@ static inline void arch_move_irqs(struct vcpu *v) { }
 struct msi_info;
 int allocate_and_map_gsi_pirq(struct domain *d, int index, int *pirq_p);
 int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
-                              int type, struct msi_info *msi);
+                              int type, struct msi_info *msi,
+			      bool use_pci_lock);
 
 #endif /* _ASM_HW_IRQ_H */
diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index 68b788c42e..970ba04aa0 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -2157,7 +2157,7 @@ int map_domain_pirq(
         struct pci_dev *pdev;
         unsigned int nr = 0;
 
-        ASSERT(pcidevs_locked());
+        ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
 
         ret = -ENODEV;
         if ( !cpu_has_apic )
@@ -2314,7 +2314,7 @@ int unmap_domain_pirq(struct domain *d, int pirq)
     if ( (pirq < 0) || (pirq >= d->nr_pirqs) )
         return -EINVAL;
 
-    ASSERT(pcidevs_locked());
+    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
     ASSERT(rw_is_write_locked(&d->event_lock));
 
     info = pirq_info(d, pirq);
@@ -2875,7 +2875,7 @@ int allocate_and_map_gsi_pirq(struct domain *d, int index, int *pirq_p)
 }
 
 int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
-                              int type, struct msi_info *msi)
+                              int type, struct msi_info *msi, bool use_pci_lock)
 {
     int irq, pirq, ret;
 
@@ -2908,7 +2908,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
 
     msi->irq = irq;
 
-    pcidevs_lock();
+    if ( use_pci_lock )
+        pcidevs_lock();
     /* Verify or get pirq. */
     write_lock(&d->event_lock);
     pirq = allocate_pirq(d, index, *pirq_p, irq, type, &msi->entry_nr);
@@ -2924,7 +2925,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
 
  done:
     write_unlock(&d->event_lock);
-    pcidevs_unlock();
+    if ( use_pci_lock )
+        pcidevs_unlock();
     if ( ret )
     {
         switch ( type )
diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index 20275260b3..466725d8ca 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -602,7 +602,7 @@ static int msi_capability_init(struct pci_dev *dev,
     unsigned int i, mpos;
     uint16_t control;
 
-    ASSERT(pcidevs_locked());
+    ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
     pos = pci_find_cap_offset(dev->sbdf, PCI_CAP_ID_MSI);
     if ( !pos )
         return -ENODEV;
@@ -771,7 +771,7 @@ static int msix_capability_init(struct pci_dev *dev,
     if ( !pos )
         return -ENODEV;
 
-    ASSERT(pcidevs_locked());
+    ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
 
     control = pci_conf_read16(dev->sbdf, msix_control_reg(pos));
     /*
@@ -988,8 +988,6 @@ static int __pci_enable_msi(struct msi_info *msi, struct msi_desc **desc,
 {
     struct msi_desc *old_desc;
 
-    ASSERT(pcidevs_locked());
-
     if ( !pdev )
         return -ENODEV;
 
@@ -1043,8 +1041,6 @@ static int __pci_enable_msix(struct msi_info *msi, struct msi_desc **desc,
 {
     struct msi_desc *old_desc;
 
-    ASSERT(pcidevs_locked());
-
     if ( !pdev || !pdev->msix )
         return -ENODEV;
 
@@ -1154,8 +1150,6 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off)
 int pci_enable_msi(struct msi_info *msi, struct msi_desc **desc,
 		   struct pci_dev *pdev)
 {
-    ASSERT(pcidevs_locked());
-
     if ( !use_msi )
         return -EPERM;
 
diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c
index 2f1d955a96..7cbb5bc2c8 100644
--- a/xen/arch/x86/physdev.c
+++ b/xen/arch/x86/physdev.c
@@ -123,7 +123,7 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
 
     case MAP_PIRQ_TYPE_MSI:
     case MAP_PIRQ_TYPE_MULTI_MSI:
-        ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
+        ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi, true);
         break;
 
     default:
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index b8ad4fa07c..182da45acb 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -750,7 +750,6 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
         pdev->domain = hardware_domain;
         write_lock(&hardware_domain->pci_lock);
         list_add(&pdev->domain_list, &hardware_domain->pdev_list);
-        write_unlock(&hardware_domain->pci_lock);
 
         /*
          * For devices not discovered by Xen during boot, add vPCI handlers
@@ -759,18 +758,18 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
         ret = vpci_add_handlers(pdev);
         if ( ret )
         {
-            printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
-            write_lock(&hardware_domain->pci_lock);
             list_del(&pdev->domain_list);
             write_unlock(&hardware_domain->pci_lock);
             pdev->domain = NULL;
+            printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
             goto out;
         }
+        write_unlock(&hardware_domain->pci_lock);
         ret = iommu_add_device(pdev);
         if ( ret )
         {
-            vpci_remove_device(pdev);
             write_lock(&hardware_domain->pci_lock);
+            vpci_remove_device(pdev);
             list_del(&pdev->domain_list);
             write_unlock(&hardware_domain->pci_lock);
             pdev->domain = NULL;
@@ -1146,7 +1145,9 @@ static void __hwdom_init setup_one_hwdom_device(const struct setup_hwdom *ctxt,
     } while ( devfn != pdev->devfn &&
               PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
 
+    write_lock(&ctxt->d->pci_lock);
     err = vpci_add_handlers(pdev);
+    write_unlock(&ctxt->d->pci_lock);
     if ( err )
         printk(XENLOG_ERR "setup of vPCI for d%d failed: %d\n",
                ctxt->d->domain_id, err);
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 767c1ba718..a52e52db96 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -172,6 +172,7 @@ bool vpci_process_pending(struct vcpu *v)
         if ( rc == -ERESTART )
             return true;
 
+        write_lock(&v->domain->pci_lock);
         spin_lock(&v->vpci.pdev->vpci->lock);
         /* Disable memory decoding unconditionally on failure. */
         modify_decoding(v->vpci.pdev,
@@ -190,6 +191,7 @@ bool vpci_process_pending(struct vcpu *v)
              * failure.
              */
             vpci_remove_device(v->vpci.pdev);
+        write_unlock(&v->domain->pci_lock);
     }
 
     return false;
@@ -201,8 +203,20 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
     struct map_data data = { .d = d, .map = true };
     int rc;
 
+    ASSERT(rw_is_write_locked(&d->pci_lock));
+
     while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
+    {
+        /*
+         * It's safe to drop and reacquire the lock in this context
+         * without risking pdev disappearing because devices cannot be
+         * removed until the initial domain has been started.
+         */
+        read_unlock(&d->pci_lock);
         process_pending_softirqs();
+        read_lock(&d->pci_lock);
+    }
+
     rangeset_destroy(mem);
     if ( !rc )
         modify_decoding(pdev, cmd, false);
@@ -243,6 +257,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
     unsigned int i;
     int rc;
 
+    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
     if ( !mem )
         return -ENOMEM;
 
@@ -522,6 +538,8 @@ static int cf_check init_bars(struct pci_dev *pdev)
     struct vpci_bar *bars = header->bars;
     int rc;
 
+    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
     switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
     {
     case PCI_HEADER_TYPE_NORMAL:
diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
index a253ccbd7d..2faa54b7ce 100644
--- a/xen/drivers/vpci/msi.c
+++ b/xen/drivers/vpci/msi.c
@@ -263,7 +263,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW);
 
 void vpci_dump_msi(void)
 {
-    const struct domain *d;
+    struct domain *d;
 
     rcu_read_lock(&domlist_read_lock);
     for_each_domain ( d )
@@ -275,6 +275,9 @@ void vpci_dump_msi(void)
 
         printk("vPCI MSI/MSI-X d%d\n", d->domain_id);
 
+        if ( !read_trylock(&d->pci_lock) )
+            continue;
+
         for_each_pdev ( d, pdev )
         {
             const struct vpci_msi *msi;
@@ -316,14 +319,33 @@ void vpci_dump_msi(void)
                      * holding the lock.
                      */
                     printk("unable to print all MSI-X entries: %d\n", rc);
-                    process_pending_softirqs();
-                    continue;
+                    goto pdev_done;
                 }
             }
 
             spin_unlock(&pdev->vpci->lock);
+ pdev_done:
+            /*
+             * Unlock lock to process pending softirqs. This is
+             * potentially unsafe, as d->pdev_list can be changed in
+             * meantime.
+             */
+            read_unlock(&d->pci_lock);
             process_pending_softirqs();
+            if ( !read_trylock(&d->pci_lock) )
+            {
+                printk("unable to access other devices for the domain\n");
+                goto domain_done;
+            }
         }
+        read_unlock(&d->pci_lock);
+    domain_done:
+        /*
+         * We need this label at the end of the loop, but some
+         * compilers might not be happy about label at the end of the
+         * compound statement so we adding an empty statement here.
+         */
+        ;
     }
     rcu_read_unlock(&domlist_read_lock);
 }
diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
index d1126a417d..b6abab47ef 100644
--- a/xen/drivers/vpci/msix.c
+++ b/xen/drivers/vpci/msix.c
@@ -147,6 +147,8 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
 {
     struct vpci_msix *msix;
 
+    ASSERT(rw_is_locked(&d->pci_lock));
+
     list_for_each_entry ( msix, &d->arch.hvm.msix_tables, next )
     {
         const struct vpci_bar *bars = msix->pdev->vpci->header.bars;
@@ -163,7 +165,13 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
 
 static int cf_check msix_accept(struct vcpu *v, unsigned long addr)
 {
-    return !!msix_find(v->domain, addr);
+    int rc;
+
+    read_lock(&v->domain->pci_lock);
+    rc = !!msix_find(v->domain, addr);
+    read_unlock(&v->domain->pci_lock);
+
+    return rc;
 }
 
 static bool access_allowed(const struct pci_dev *pdev, unsigned long addr,
@@ -358,21 +366,35 @@ static int adjacent_read(const struct domain *d, const struct vpci_msix *msix,
 static int cf_check msix_read(
     struct vcpu *v, unsigned long addr, unsigned int len, unsigned long *data)
 {
-    const struct domain *d = v->domain;
-    struct vpci_msix *msix = msix_find(d, addr);
+    struct domain *d = v->domain;
+    struct vpci_msix *msix;
     const struct vpci_msix_entry *entry;
     unsigned int offset;
 
     *data = ~0UL;
 
+    read_lock(&d->pci_lock);
+
+    msix = msix_find(d, addr);
     if ( !msix )
+    {
+        read_unlock(&d->pci_lock);
         return X86EMUL_RETRY;
+    }
 
     if ( adjacent_handle(msix, addr) )
-        return adjacent_read(d, msix, addr, len, data);
+    {
+        int rc = adjacent_read(d, msix, addr, len, data);
+
+        read_unlock(&d->pci_lock);
+        return rc;
+    }
 
     if ( !access_allowed(msix->pdev, addr, len) )
+    {
+        read_unlock(&d->pci_lock);
         return X86EMUL_OKAY;
+    }
 
     spin_lock(&msix->pdev->vpci->lock);
     entry = get_entry(msix, addr);
@@ -404,6 +426,7 @@ static int cf_check msix_read(
         break;
     }
     spin_unlock(&msix->pdev->vpci->lock);
+    read_unlock(&d->pci_lock);
 
     return X86EMUL_OKAY;
 }
@@ -491,19 +514,33 @@ static int adjacent_write(const struct domain *d, const struct vpci_msix *msix,
 static int cf_check msix_write(
     struct vcpu *v, unsigned long addr, unsigned int len, unsigned long data)
 {
-    const struct domain *d = v->domain;
-    struct vpci_msix *msix = msix_find(d, addr);
+    struct domain *d = v->domain;
+    struct vpci_msix *msix;
     struct vpci_msix_entry *entry;
     unsigned int offset;
 
+    read_lock(&d->pci_lock);
+
+    msix = msix_find(d, addr);
     if ( !msix )
+    {
+        read_unlock(&d->pci_lock);
         return X86EMUL_RETRY;
+    }
 
     if ( adjacent_handle(msix, addr) )
-        return adjacent_write(d, msix, addr, len, data);
+    {
+        int rc = adjacent_write(d, msix, addr, len, data);
+
+        read_unlock(&d->pci_lock);
+        return rc;
+    }
 
     if ( !access_allowed(msix->pdev, addr, len) )
+    {
+        read_unlock(&d->pci_lock);
         return X86EMUL_OKAY;
+    }
 
     spin_lock(&msix->pdev->vpci->lock);
     entry = get_entry(msix, addr);
@@ -579,6 +616,7 @@ static int cf_check msix_write(
         break;
     }
     spin_unlock(&msix->pdev->vpci->lock);
+    read_unlock(&d->pci_lock);
 
     return X86EMUL_OKAY;
 }
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 3bec9a4153..112de56fb3 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -38,6 +38,8 @@ extern vpci_register_init_t *const __end_vpci_array[];
 
 void vpci_remove_device(struct pci_dev *pdev)
 {
+    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
     if ( !has_vpci(pdev->domain) || !pdev->vpci )
         return;
 
@@ -73,6 +75,8 @@ int vpci_add_handlers(struct pci_dev *pdev)
     const unsigned long *ro_map;
     int rc = 0;
 
+    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
     if ( !has_vpci(pdev->domain) )
         return 0;
 
@@ -326,11 +330,12 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
 
 uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
 {
-    const struct domain *d = current->domain;
+    struct domain *d = current->domain;
     const struct pci_dev *pdev;
     const struct vpci_register *r;
     unsigned int data_offset = 0;
     uint32_t data = ~(uint32_t)0;
+    rwlock_t *lock;
 
     if ( !size )
     {
@@ -342,11 +347,21 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
      * Find the PCI dev matching the address, which for hwdom also requires
      * consulting DomXEN.  Passthrough everything that's not trapped.
      */
+    lock = &d->pci_lock;
+    read_lock(lock);
     pdev = pci_get_pdev(d, sbdf);
     if ( !pdev && is_hardware_domain(d) )
+    {
+        read_unlock(lock);
+        lock = &dom_xen->pci_lock;
+        read_lock(lock);
         pdev = pci_get_pdev(dom_xen, sbdf);
+    }
     if ( !pdev || !pdev->vpci )
+    {
+        read_unlock(lock);
         return vpci_read_hw(sbdf, reg, size);
+    }
 
     spin_lock(&pdev->vpci->lock);
 
@@ -392,6 +407,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
         ASSERT(data_offset < size);
     }
     spin_unlock(&pdev->vpci->lock);
+    read_unlock(lock);
 
     if ( data_offset < size )
     {
@@ -431,10 +447,23 @@ static void vpci_write_helper(const struct pci_dev *pdev,
              r->private);
 }
 
+/* Helper function to unlock locks taken by vpci_write in proper order */
+static void release_domain_write_locks(struct domain *d)
+{
+    ASSERT(rw_is_write_locked(&d->pci_lock));
+
+    if ( is_hardware_domain(d) )
+    {
+        ASSERT(rw_is_write_locked(&dom_xen->pci_lock));
+        write_unlock(&dom_xen->pci_lock);
+    }
+    write_unlock(&d->pci_lock);
+}
+
 void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
                 uint32_t data)
 {
-    const struct domain *d = current->domain;
+    struct domain *d = current->domain;
     const struct pci_dev *pdev;
     const struct vpci_register *r;
     unsigned int data_offset = 0;
@@ -447,8 +476,20 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
 
     /*
      * Find the PCI dev matching the address, which for hwdom also requires
-     * consulting DomXEN.  Passthrough everything that's not trapped.
+     * consulting DomXEN. Passthrough everything that's not trapped.
+     * If this is hwdom, we need to hold locks for both domain in case if
+     * modify_bars() is called
+     */
+    /*
+     * TODO: We need to take pci_locks in exclusive mode only if we
+     * are modifying BARs, so there is a room for improvement.
      */
+    write_lock(&d->pci_lock);
+
+    /* dom_xen->pci_lock always should be taken second to prevent deadlock */
+    if ( is_hardware_domain(d) )
+        write_lock(&dom_xen->pci_lock);
+
     pdev = pci_get_pdev(d, sbdf);
     if ( !pdev && is_hardware_domain(d) )
         pdev = pci_get_pdev(dom_xen, sbdf);
@@ -457,8 +498,11 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
         /* Ignore writes to read-only devices, which have no ->vpci. */
         const unsigned long *ro_map = pci_get_ro_map(sbdf.seg);
 
+        release_domain_write_locks(d);
+
         if ( !ro_map || !test_bit(sbdf.bdf, ro_map) )
             vpci_write_hw(sbdf, reg, size, data);
+
         return;
     }
 
@@ -498,6 +542,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
         ASSERT(data_offset < size);
     }
     spin_unlock(&pdev->vpci->lock);
+    release_domain_write_locks(d);
 
     if ( data_offset < size )
         /* Tailing gap, write the remaining. */
-- 
2.42.0

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v10 04/17] vpci: restrict unhandled read/write operations for guests
  2023-10-12 22:09 [PATCH v10 00/17] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (3 preceding siblings ...)
  2023-10-12 22:09 ` [PATCH v10 02/17] pci: introduce per-domain PCI rwlock Volodymyr Babchuk
@ 2023-10-12 22:09 ` Volodymyr Babchuk
  2023-10-12 22:09 ` [PATCH v10 08/17] rangeset: add RANGESETF_no_print flag Volodymyr Babchuk
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 65+ messages in thread
From: Volodymyr Babchuk @ 2023-10-12 22:09 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Roger Pau Monné

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

A guest would be able to read and write those registers which are not
emulated and have no respective vPCI handlers, so it will be possible
for it to access the hardware directly.
In order to prevent a guest from reads and writes from/to the unhandled
registers make sure only hardware domain can access the hardware directly
and restrict guests from doing so.

Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

---
Since v9:
- removed stray formatting change
- added Roger's R-b tag
Since v6:
- do not use is_hwdom parameter for vpci_{read|write}_hw and use
  current->domain internally
- update commit message
New in v6
---
 xen/drivers/vpci/vpci.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 112de56fb3..724987e981 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -233,6 +233,10 @@ static uint32_t vpci_read_hw(pci_sbdf_t sbdf, unsigned int reg,
 {
     uint32_t data;
 
+    /* Guest domains are not allowed to read real hardware. */
+    if ( !is_hardware_domain(current->domain) )
+        return ~(uint32_t)0;
+
     switch ( size )
     {
     case 4:
@@ -276,6 +280,10 @@ static uint32_t vpci_read_hw(pci_sbdf_t sbdf, unsigned int reg,
 static void vpci_write_hw(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
                           uint32_t data)
 {
+    /* Guest domains are not allowed to write real hardware. */
+    if ( !is_hardware_domain(current->domain) )
+        return;
+
     switch ( size )
     {
     case 4:
-- 
2.42.0

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v10 07/17] vpci/header: implement guest BAR register handlers
  2023-10-12 22:09 [PATCH v10 00/17] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (6 preceding siblings ...)
  2023-10-12 22:09 ` [PATCH v10 06/17] vpci/header: rework exit path in init_bars Volodymyr Babchuk
@ 2023-10-12 22:09 ` Volodymyr Babchuk
  2023-10-14 16:00   ` Stewart Hildebrand
  2023-11-20 16:06   ` Roger Pau Monné
  2023-10-12 22:09 ` [PATCH v10 11/17] vpci/header: program p2m with guest BAR view Volodymyr Babchuk
                   ` (8 subsequent siblings)
  16 siblings, 2 replies; 65+ messages in thread
From: Volodymyr Babchuk @ 2023-10-12 22:09 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Roger Pau Monné

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Add relevant vpci register handlers when assigning PCI device to a domain
and remove those when de-assigning. This allows having different
handlers for different domains, e.g. hwdom and other guests.

Emulate guest BAR register values: this allows creating a guest view
of the registers and emulates size and properties probe as it is done
during PCI device enumeration by the guest.

All empty, IO and ROM BARs for guests are emulated by returning 0 on
reads and ignoring writes: this BARs are special with this respect as
their lower bits have special meaning, so returning default ~0 on read
may confuse guest OS.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
In v10:
- ull -> ULL to be MISRA-compatbile
- Use PAGE_OFFSET() instead of combining with ~PAGE_MASK
- Set type of empty bars to VPCI_BAR_EMPTY
In v9:
- factored-out "fail" label introduction in init_bars()
- replaced #ifdef CONFIG_X86 with IS_ENABLED()
- do not pass bars[i] to empty_bar_read() handler
- store guest's BAR address instead of guests BAR register view
Since v6:
- unify the writing of the PCI_COMMAND register on the
  error path into a label
- do not introduce bar_ignore_access helper and open code
- s/guest_bar_ignore_read/empty_bar_read
- update error message in guest_bar_write
- only setup empty_bar_read for IO if !x86
Since v5:
- make sure that the guest set address has the same page offset
  as the physical address on the host
- remove guest_rom_{read|write} as those just implement the default
  behaviour of the registers not being handled
- adjusted comment for struct vpci.addr field
- add guest handlers for BARs which are not handled and will otherwise
  return ~0 on read and ignore writes. The BARs are special with this
  respect as their lower bits have special meaning, so returning ~0
  doesn't seem to be right
Since v4:
- updated commit message
- s/guest_addr/guest_reg
Since v3:
- squashed two patches: dynamic add/remove handlers and guest BAR
  handler implementation
- fix guest BAR read of the high part of a 64bit BAR (Roger)
- add error handling to vpci_assign_device
- s/dom%pd/%pd
- blank line before return
Since v2:
- remove unneeded ifdefs for CONFIG_HAS_VPCI_GUEST_SUPPORT as more code
  has been eliminated from being built on x86
Since v1:
 - constify struct pci_dev where possible
 - do not open code is_system_domain()
 - simplify some code3. simplify
 - use gdprintk + error code instead of gprintk
 - gate vpci_bar_{add|remove}_handlers with CONFIG_HAS_VPCI_GUEST_SUPPORT,
   so these do not get compiled for x86
 - removed unneeded is_system_domain check
 - re-work guest read/write to be much simpler and do more work on write
   than read which is expected to be called more frequently
 - removed one too obvious comment
---
 xen/drivers/vpci/header.c | 137 +++++++++++++++++++++++++++++++++-----
 xen/include/xen/vpci.h    |   3 +
 2 files changed, 123 insertions(+), 17 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 33db58580c..40d1a07ada 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -477,6 +477,74 @@ static void cf_check bar_write(
     pci_conf_write32(pdev->sbdf, reg, val);
 }
 
+static void cf_check guest_bar_write(const struct pci_dev *pdev,
+                                     unsigned int reg, uint32_t val, void *data)
+{
+    struct vpci_bar *bar = data;
+    bool hi = false;
+    uint64_t guest_addr = bar->guest_addr;
+
+    if ( bar->type == VPCI_BAR_MEM64_HI )
+    {
+        ASSERT(reg > PCI_BASE_ADDRESS_0);
+        bar--;
+        hi = true;
+    }
+    else
+    {
+        val &= PCI_BASE_ADDRESS_MEM_MASK;
+    }
+
+    guest_addr &= ~(0xffffffffULL << (hi ? 32 : 0));
+    guest_addr |= (uint64_t)val << (hi ? 32 : 0);
+
+    /* Allow guest to size BAR correctly */
+    guest_addr &= ~(bar->size - 1);
+
+    /*
+     * Make sure that the guest set address has the same page offset
+     * as the physical address on the host or otherwise things won't work as
+     * expected.
+     */
+    if ( guest_addr != ~(bar->size -1 )  &&
+         PAGE_OFFSET(guest_addr) != PAGE_OFFSET(bar->addr) )
+    {
+        gprintk(XENLOG_WARNING,
+                "%pp: ignored BAR %zu write attempting to change page offset\n",
+                &pdev->sbdf, bar - pdev->vpci->header.bars + hi);
+        return;
+    }
+
+    bar->guest_addr = guest_addr;
+}
+
+static uint32_t cf_check guest_bar_read(const struct pci_dev *pdev,
+                                        unsigned int reg, void *data)
+{
+    const struct vpci_bar *bar = data;
+    uint32_t reg_val;
+
+    if ( bar->type == VPCI_BAR_MEM64_HI )
+    {
+        ASSERT(reg > PCI_BASE_ADDRESS_0);
+        bar--;
+        return bar->guest_addr >> 32;
+    }
+
+    reg_val = bar->guest_addr;
+    reg_val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32 :
+                                             PCI_BASE_ADDRESS_MEM_TYPE_64;
+    reg_val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
+
+    return reg_val;
+}
+
+static uint32_t cf_check empty_bar_read(const struct pci_dev *pdev,
+                                        unsigned int reg, void *data)
+{
+    return 0;
+}
+
 static void cf_check rom_write(
     const struct pci_dev *pdev, unsigned int reg, uint32_t val, void *data)
 {
@@ -537,6 +605,7 @@ static int cf_check init_bars(struct pci_dev *pdev)
     struct vpci_header *header = &pdev->vpci->header;
     struct vpci_bar *bars = header->bars;
     int rc;
+    bool is_hwdom = is_hardware_domain(pdev->domain);
 
     ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
 
@@ -578,8 +647,10 @@ static int cf_check init_bars(struct pci_dev *pdev)
         if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
         {
             bars[i].type = VPCI_BAR_MEM64_HI;
-            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg,
-                                   4, &bars[i]);
+            rc = vpci_add_register(pdev->vpci,
+                                   is_hwdom ? vpci_hw_read32 : guest_bar_read,
+                                   is_hwdom ? bar_write : guest_bar_write,
+                                   reg, 4, &bars[i]);
             if ( rc )
                 goto fail;
             continue;
@@ -588,7 +659,17 @@ static int cf_check init_bars(struct pci_dev *pdev)
         val = pci_conf_read32(pdev->sbdf, reg);
         if ( (val & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO )
         {
-            bars[i].type = VPCI_BAR_IO;
+            if ( !IS_ENABLED(CONFIG_X86) && !is_hwdom )
+            {
+                rc = vpci_add_register(pdev->vpci, empty_bar_read, NULL,
+                                       reg, 4, NULL);
+                if ( rc )
+                {
+                    bars[i].type = VPCI_BAR_EMPTY;
+                    goto fail;
+                }
+            }
+
             continue;
         }
         if ( (val & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
@@ -605,6 +686,15 @@ static int cf_check init_bars(struct pci_dev *pdev)
         if ( size == 0 )
         {
             bars[i].type = VPCI_BAR_EMPTY;
+
+            if ( !is_hwdom )
+            {
+                rc = vpci_add_register(pdev->vpci, empty_bar_read, NULL,
+                                       reg, 4, NULL);
+                if ( rc )
+                    goto fail;
+            }
+
             continue;
         }
 
@@ -612,28 +702,41 @@ static int cf_check init_bars(struct pci_dev *pdev)
         bars[i].size = size;
         bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
 
-        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg, 4,
-                               &bars[i]);
+        rc = vpci_add_register(pdev->vpci,
+                               is_hwdom ? vpci_hw_read32 : guest_bar_read,
+                               is_hwdom ? bar_write : guest_bar_write,
+                               reg, 4, &bars[i]);
         if ( rc )
             goto fail;
     }
 
-    /* Check expansion ROM. */
-    rc = pci_size_mem_bar(pdev->sbdf, rom_reg, &addr, &size, PCI_BAR_ROM);
-    if ( rc > 0 && size )
+    /* TODO: Check expansion ROM, we do not handle ROM for guests for now. */
+    if ( is_hwdom )
     {
-        struct vpci_bar *rom = &header->bars[num_bars];
+        rc = pci_size_mem_bar(pdev->sbdf, rom_reg, &addr, &size, PCI_BAR_ROM);
+        if ( rc > 0 && size )
+        {
+            struct vpci_bar *rom = &header->bars[num_bars];
 
-        rom->type = VPCI_BAR_ROM;
-        rom->size = size;
-        rom->addr = addr;
-        header->rom_enabled = pci_conf_read32(pdev->sbdf, rom_reg) &
-                              PCI_ROM_ADDRESS_ENABLE;
+            rom->type = VPCI_BAR_ROM;
+            rom->size = size;
+            rom->addr = addr;
+            header->rom_enabled = pci_conf_read32(pdev->sbdf, rom_reg) &
+                                  PCI_ROM_ADDRESS_ENABLE;
 
-        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, rom_write, rom_reg,
-                               4, rom);
+            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, rom_write,
+                                   rom_reg, 4, rom);
+            if ( rc )
+                rom->type = VPCI_BAR_EMPTY;
+        }
+    }
+    else
+    {
+        header->bars[num_bars].type = VPCI_BAR_EMPTY;
+        rc = vpci_add_register(pdev->vpci, empty_bar_read, NULL,
+                               rom_reg, 4, NULL);
         if ( rc )
-            rom->type = VPCI_BAR_EMPTY;
+            goto fail;
     }
 
     return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, cmd, false) : 0;
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 75cfb532ee..2028f2151f 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -67,7 +67,10 @@ struct vpci {
     struct vpci_header {
         /* Information about the PCI BARs of this device. */
         struct vpci_bar {
+            /* Physical (host) address. */
             uint64_t addr;
+            /* Guest address. */
+            uint64_t guest_addr;
             uint64_t size;
             enum {
                 VPCI_BAR_EMPTY,
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v10 06/17] vpci/header: rework exit path in init_bars
  2023-10-12 22:09 [PATCH v10 00/17] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (5 preceding siblings ...)
  2023-10-12 22:09 ` [PATCH v10 08/17] rangeset: add RANGESETF_no_print flag Volodymyr Babchuk
@ 2023-10-12 22:09 ` Volodymyr Babchuk
  2023-11-20 15:07   ` Roger Pau Monné
  2023-10-12 22:09 ` [PATCH v10 07/17] vpci/header: implement guest BAR register handlers Volodymyr Babchuk
                   ` (9 subsequent siblings)
  16 siblings, 1 reply; 65+ messages in thread
From: Volodymyr Babchuk @ 2023-10-12 22:09 UTC (permalink / raw)
  To: xen-devel; +Cc: Stewart Hildebrand, Volodymyr Babchuk, Roger Pau Monné

Introduce "fail" label in init_bars() function to have the centralized
error return path. This is the pre-requirement for the future changes
in this function.

This patch does not introduce functional changes.

Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
--
In v10:
- Added Roger's A-b tag.
In v9:
- New in v9
---
 xen/drivers/vpci/header.c | 20 +++++++-------------
 1 file changed, 7 insertions(+), 13 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 176fe16b9f..33db58580c 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -581,11 +581,7 @@ static int cf_check init_bars(struct pci_dev *pdev)
             rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg,
                                    4, &bars[i]);
             if ( rc )
-            {
-                pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
-                return rc;
-            }
-
+                goto fail;
             continue;
         }
 
@@ -604,10 +600,7 @@ static int cf_check init_bars(struct pci_dev *pdev)
         rc = pci_size_mem_bar(pdev->sbdf, reg, &addr, &size,
                               (i == num_bars - 1) ? PCI_BAR_LAST : 0);
         if ( rc < 0 )
-        {
-            pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
-            return rc;
-        }
+            goto fail;
 
         if ( size == 0 )
         {
@@ -622,10 +615,7 @@ static int cf_check init_bars(struct pci_dev *pdev)
         rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg, 4,
                                &bars[i]);
         if ( rc )
-        {
-            pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
-            return rc;
-        }
+            goto fail;
     }
 
     /* Check expansion ROM. */
@@ -647,6 +637,10 @@ static int cf_check init_bars(struct pci_dev *pdev)
     }
 
     return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, cmd, false) : 0;
+
+ fail:
+    pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
+    return rc;
 }
 REGISTER_VPCI_INIT(init_bars, VPCI_PRIORITY_MIDDLE);
 
-- 
2.42.0

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v10 08/17] rangeset: add RANGESETF_no_print flag
  2023-10-12 22:09 [PATCH v10 00/17] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (4 preceding siblings ...)
  2023-10-12 22:09 ` [PATCH v10 04/17] vpci: restrict unhandled read/write operations for guests Volodymyr Babchuk
@ 2023-10-12 22:09 ` Volodymyr Babchuk
  2023-10-12 22:09 ` [PATCH v10 06/17] vpci/header: rework exit path in init_bars Volodymyr Babchuk
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 65+ messages in thread
From: Volodymyr Babchuk @ 2023-10-12 22:09 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Andrew Cooper,
	George Dunlap, Jan Beulich, Julien Grall, Stefano Stabellini,
	Wei Liu

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

There are range sets which should not be printed, so introduce a flag
which allows marking those as such. Implement relevant logic to skip
such entries while printing.

While at it also simplify the definition of the flags by directly
defining those without helpers.

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
Since v5:
- comment indentation (Jan)
Since v1:
- update BUG_ON with new flag
- simplify the definition of the flags
---
 xen/common/rangeset.c      | 5 ++++-
 xen/include/xen/rangeset.h | 5 +++--
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/xen/common/rangeset.c b/xen/common/rangeset.c
index f3baf52ab6..35c3420885 100644
--- a/xen/common/rangeset.c
+++ b/xen/common/rangeset.c
@@ -433,7 +433,7 @@ struct rangeset *rangeset_new(
     INIT_LIST_HEAD(&r->range_list);
     r->nr_ranges = -1;
 
-    BUG_ON(flags & ~RANGESETF_prettyprint_hex);
+    BUG_ON(flags & ~(RANGESETF_prettyprint_hex | RANGESETF_no_print));
     r->flags = flags;
 
     safe_strcpy(r->name, name ?: "(no name)");
@@ -575,6 +575,9 @@ void rangeset_domain_printk(
 
     list_for_each_entry ( r, &d->rangesets, rangeset_list )
     {
+        if ( r->flags & RANGESETF_no_print )
+            continue;
+
         printk("    ");
         rangeset_printk(r);
         printk("\n");
diff --git a/xen/include/xen/rangeset.h b/xen/include/xen/rangeset.h
index 135f33f606..f7c69394d6 100644
--- a/xen/include/xen/rangeset.h
+++ b/xen/include/xen/rangeset.h
@@ -49,8 +49,9 @@ void rangeset_limit(
 
 /* Flags for passing to rangeset_new(). */
  /* Pretty-print range limits in hexadecimal. */
-#define _RANGESETF_prettyprint_hex 0
-#define RANGESETF_prettyprint_hex  (1U << _RANGESETF_prettyprint_hex)
+#define RANGESETF_prettyprint_hex   (1U << 0)
+ /* Do not print entries marked with this flag. */
+#define RANGESETF_no_print          (1U << 1)
 
 bool_t __must_check rangeset_is_empty(
     const struct rangeset *r);
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v10 11/17] vpci/header: program p2m with guest BAR view
  2023-10-12 22:09 [PATCH v10 00/17] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (7 preceding siblings ...)
  2023-10-12 22:09 ` [PATCH v10 07/17] vpci/header: implement guest BAR register handlers Volodymyr Babchuk
@ 2023-10-12 22:09 ` Volodymyr Babchuk
  2023-11-21 12:24   ` Roger Pau Monné
  2023-10-12 22:09 ` [PATCH v10 10/17] vpci/header: handle p2m range sets per BAR Volodymyr Babchuk
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 65+ messages in thread
From: Volodymyr Babchuk @ 2023-10-12 22:09 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Roger Pau Monné,
	Volodymyr Babchuk

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Take into account guest's BAR view and program its p2m accordingly:
gfn is guest's view of the BAR and mfn is the physical BAR value.
This way hardware domain sees physical BAR values and guest sees
emulated ones.

Hardware domain continues getting the BARs identity mapped, while for
domUs the BARs are mapped at the requested guest address without
modifying the BAR address in the device PCI config space.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
---
In v10:
- Moved GFN variable definition outside the loop in map_range()
- Updated printk error message in map_range()
- Now BAR address is always stored in bar->guest_addr, even for
  HW dom, this removes bunch of ugly is_hwdom() checks in modify_bars()
- vmsix_table_base() now uses .guest_addr instead of .addr
In v9:
- Extended the commit message
- Use bar->guest_addr in modify_bars
- Extended printk error message in map_range
- Moved map_data initialization so .bar can be initialized during declaration
Since v5:
- remove debug print in map_range callback
- remove "identity" from the debug print
Since v4:
- moved start_{gfn|mfn} calculation into map_range
- pass vpci_bar in the map_data instead of start_{gfn|mfn}
- s/guest_addr/guest_reg
Since v3:
- updated comment (Roger)
- removed gfn_add(map->start_gfn, rc); which is wrong
- use v->domain instead of v->vpci.pdev->domain
- removed odd e.g. in comment
- s/d%d/%pd in altered code
- use gdprintk for map/unmap logs
Since v2:
- improve readability for data.start_gfn and restructure ?: construct
Since v1:
 - s/MSI/MSI-X in comments
---
 xen/drivers/vpci/header.c | 53 ++++++++++++++++++++++++++++-----------
 xen/include/xen/vpci.h    |  3 ++-
 2 files changed, 41 insertions(+), 15 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 5c056923ad..efce0bc2ae 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -33,6 +33,7 @@
 
 struct map_data {
     struct domain *d;
+    const struct vpci_bar *bar;
     bool map;
 };
 
@@ -40,11 +41,21 @@ static int cf_check map_range(
     unsigned long s, unsigned long e, void *data, unsigned long *c)
 {
     const struct map_data *map = data;
+    /* Start address of the BAR as seen by the guest. */
+    unsigned long start_gfn = PFN_DOWN(map->bar->guest_addr);
+    /* Physical start address of the BAR. */
+    mfn_t start_mfn = _mfn(PFN_DOWN(map->bar->addr));
     int rc;
 
     for ( ; ; )
     {
         unsigned long size = e - s + 1;
+        /*
+         * Ranges to be mapped don't always start at the BAR start address, as
+         * there can be holes or partially consumed ranges. Account for the
+         * offset of the current address from the BAR start.
+         */
+        mfn_t map_mfn = mfn_add(start_mfn, s - start_gfn);
 
         if ( !iomem_access_permitted(map->d, s, e) )
         {
@@ -72,8 +83,8 @@ static int cf_check map_range(
          * - {un}map_mmio_regions doesn't support preemption.
          */
 
-        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(s))
-                      : unmap_mmio_regions(map->d, _gfn(s), size, _mfn(s));
+        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, map_mfn)
+                      : unmap_mmio_regions(map->d, _gfn(s), size, map_mfn);
         if ( rc == 0 )
         {
             *c += size;
@@ -82,8 +93,9 @@ static int cf_check map_range(
         if ( rc < 0 )
         {
             printk(XENLOG_G_WARNING
-                   "Failed to identity %smap [%lx, %lx] for d%d: %d\n",
-                   map->map ? "" : "un", s, e, map->d->domain_id, rc);
+                   "Failed to %smap [%lx %lx] -> [%lx %lx] for %pd: %d\n",
+                   map->map ? "" : "un", s, e, mfn_x(map_mfn),
+                   mfn_x(map_mfn) + size, map->d, rc);
             break;
         }
         ASSERT(rc < size);
@@ -162,10 +174,6 @@ static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
 bool vpci_process_pending(struct vcpu *v)
 {
     struct pci_dev *pdev = v->vpci.pdev;
-    struct map_data data = {
-        .d = v->domain,
-        .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
-    };
     struct vpci_header *header = NULL;
     unsigned int i;
 
@@ -184,6 +192,11 @@ bool vpci_process_pending(struct vcpu *v)
     for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
     {
         struct vpci_bar *bar = &header->bars[i];
+        struct map_data data = {
+            .d = v->domain,
+            .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
+            .bar = bar,
+        };
         int rc;
 
         if ( rangeset_is_empty(bar->mem) )
@@ -234,7 +247,6 @@ bool vpci_process_pending(struct vcpu *v)
 static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
                             uint16_t cmd)
 {
-    struct map_data data = { .d = d, .map = true };
     struct vpci_header *header = &pdev->vpci->header;
     int rc = 0;
     unsigned int i;
@@ -244,6 +256,7 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
     for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
     {
         struct vpci_bar *bar = &header->bars[i];
+        struct map_data data = { .d = d, .map = true, .bar = bar };
 
         if ( rangeset_is_empty(bar->mem) )
             continue;
@@ -311,12 +324,16 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
      * First fill the rangesets with the BAR of this device or with the ROM
      * BAR only, depending on whether the guest is toggling the memory decode
      * bit of the command register, or the enable bit of the ROM BAR register.
+     *
+     * For non-hardware domain we use guest physical addresses.
      */
     for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
     {
         struct vpci_bar *bar = &header->bars[i];
         unsigned long start = PFN_DOWN(bar->addr);
         unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
+        unsigned long start_guest = PFN_DOWN(bar->guest_addr);
+        unsigned long end_guest = PFN_DOWN(bar->guest_addr + bar->size - 1);
 
         if ( !bar->mem )
             continue;
@@ -336,11 +353,11 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
             continue;
         }
 
-        rc = rangeset_add_range(bar->mem, start, end);
+        rc = rangeset_add_range(bar->mem, start_guest, end_guest);
         if ( rc )
         {
             printk(XENLOG_G_WARNING "Failed to add [%lx, %lx]: %d\n",
-                   start, end, rc);
+                   start_guest, end_guest, rc);
             return rc;
         }
 
@@ -357,7 +374,7 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
             {
                 gprintk(XENLOG_WARNING,
                        "%pp: failed to remove overlapping range [%lx, %lx]: %d\n",
-                        &pdev->sbdf, start, end, rc);
+                        &pdev->sbdf, start_guest, end_guest, rc);
                 return rc;
             }
         }
@@ -425,8 +442,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
             for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
             {
                 const struct vpci_bar *remote_bar = &tmp->vpci->header.bars[i];
-                unsigned long start = PFN_DOWN(remote_bar->addr);
-                unsigned long end = PFN_DOWN(remote_bar->addr +
+                unsigned long start = PFN_DOWN(remote_bar->guest_addr);
+                unsigned long end = PFN_DOWN(remote_bar->guest_addr +
                                              remote_bar->size - 1);
 
                 if ( !remote_bar->enabled )
@@ -511,6 +528,8 @@ static void cf_check bar_write(
     struct vpci_bar *bar = data;
     bool hi = false;
 
+    ASSERT(is_hardware_domain(pdev->domain));
+
     if ( bar->type == VPCI_BAR_MEM64_HI )
     {
         ASSERT(reg > PCI_BASE_ADDRESS_0);
@@ -541,6 +560,10 @@ static void cf_check bar_write(
      */
     bar->addr &= ~(0xffffffffULL << (hi ? 32 : 0));
     bar->addr |= (uint64_t)val << (hi ? 32 : 0);
+    /*
+     * Update guest address as well, so hardware domain sees BAR identity mapped
+     */
+    bar->guest_addr = bar->addr;
 
     /* Make sure Xen writes back the same value for the BAR RO bits. */
     if ( !hi )
@@ -791,6 +814,7 @@ static int cf_check init_bars(struct pci_dev *pdev)
         }
 
         bars[i].addr = addr;
+        bars[i].guest_addr = addr;
         bars[i].size = size;
         bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
 
@@ -813,6 +837,7 @@ static int cf_check init_bars(struct pci_dev *pdev)
             rom->type = VPCI_BAR_ROM;
             rom->size = size;
             rom->addr = addr;
+            rom->guest_addr = addr;
             header->rom_enabled = pci_conf_read32(pdev->sbdf, rom_reg) &
                                   PCI_ROM_ADDRESS_ENABLE;
 
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 18a0eca3da..c5301e284f 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -196,7 +196,8 @@ int vpci_msix_arch_print(const struct vpci_msix *msix);
  */
 static inline paddr_t vmsix_table_base(const struct vpci *vpci, unsigned int nr)
 {
-    return vpci->header.bars[vpci->msix->tables[nr] & PCI_MSIX_BIRMASK].addr;
+    return
+        vpci->header.bars[vpci->msix->tables[nr] & PCI_MSIX_BIRMASK].guest_addr;
 }
 
 static inline paddr_t vmsix_table_addr(const struct vpci *vpci, unsigned int nr)
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v10 10/17] vpci/header: handle p2m range sets per BAR
  2023-10-12 22:09 [PATCH v10 00/17] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (8 preceding siblings ...)
  2023-10-12 22:09 ` [PATCH v10 11/17] vpci/header: program p2m with guest BAR view Volodymyr Babchuk
@ 2023-10-12 22:09 ` Volodymyr Babchuk
  2023-11-20 17:29   ` Roger Pau Monné
  2023-10-12 22:09 ` [PATCH v10 09/17] rangeset: add rangeset_empty() function Volodymyr Babchuk
                   ` (6 subsequent siblings)
  16 siblings, 1 reply; 65+ messages in thread
From: Volodymyr Babchuk @ 2023-10-12 22:09 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Roger Pau Monné

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Instead of handling a single range set, that contains all the memory
regions of all the BARs and ROM, have them per BAR.
As the range sets are now created when a PCI device is added and destroyed
when it is removed so make them named and accounted.

Note that rangesets were chosen here despite there being only up to
3 separate ranges in each set (typically just 1). But rangeset per BAR
was chosen for the ease of implementation and existing code re-usability.

This is in preparation of making non-identity mappings in p2m for the MMIOs.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

---
In v10:
- Added additional checks to vpci_process_pending()
- vpci_process_pending() now clears rangeset in case of failure
- Fixed locks in vpci_process_pending()
- Fixed coding style issues
- Fixed error handling in init_bars
In v9:
- removed d->vpci.map_pending in favor of checking v->vpci.pdev !=
NULL
- printk -> gprintk
- renamed bar variable to fix shadowing
- fixed bug with iterating on remote device's BARs
- relaxed lock in vpci_process_pending
- removed stale comment
Since v6:
- update according to the new locking scheme
- remove odd fail label in modify_bars
Since v5:
- fix comments
- move rangeset allocation to init_bars and only allocate
  for MAPPABLE BARs
- check for overlap with the already setup BAR ranges
Since v4:
- use named range sets for BARs (Jan)
- changes required by the new locking scheme
- updated commit message (Jan)
Since v3:
- re-work vpci_cancel_pending accordingly to the per-BAR handling
- s/num_mem_ranges/map_pending and s/uint8_t/bool
- ASSERT(bar->mem) in modify_bars
- create and destroy the rangesets on add/remove
---
 xen/drivers/vpci/header.c | 256 ++++++++++++++++++++++++++------------
 xen/drivers/vpci/vpci.c   |   6 +
 xen/include/xen/vpci.h    |   2 +-
 3 files changed, 184 insertions(+), 80 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 40d1a07ada..5c056923ad 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -161,63 +161,106 @@ static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
 
 bool vpci_process_pending(struct vcpu *v)
 {
-    if ( v->vpci.mem )
+    struct pci_dev *pdev = v->vpci.pdev;
+    struct map_data data = {
+        .d = v->domain,
+        .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
+    };
+    struct vpci_header *header = NULL;
+    unsigned int i;
+
+    if ( !pdev )
+        return false;
+
+    read_lock(&v->domain->pci_lock);
+
+    if ( !pdev->vpci || (v->domain != pdev->domain) )
     {
-        struct map_data data = {
-            .d = v->domain,
-            .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
-        };
-        int rc = rangeset_consume_ranges(v->vpci.mem, map_range, &data);
+        read_unlock(&v->domain->pci_lock);
+        return false;
+    }
+
+    header = &pdev->vpci->header;
+    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
+    {
+        struct vpci_bar *bar = &header->bars[i];
+        int rc;
+
+        if ( rangeset_is_empty(bar->mem) )
+            continue;
+
+        rc = rangeset_consume_ranges(bar->mem, map_range, &data);
 
         if ( rc == -ERESTART )
+        {
+            read_unlock(&v->domain->pci_lock);
             return true;
+        }
 
-        write_lock(&v->domain->pci_lock);
-        spin_lock(&v->vpci.pdev->vpci->lock);
-        /* Disable memory decoding unconditionally on failure. */
-        modify_decoding(v->vpci.pdev,
-                        rc ? v->vpci.cmd & ~PCI_COMMAND_MEMORY : v->vpci.cmd,
-                        !rc && v->vpci.rom_only);
-        spin_unlock(&v->vpci.pdev->vpci->lock);
-
-        rangeset_destroy(v->vpci.mem);
-        v->vpci.mem = NULL;
         if ( rc )
-            /*
-             * FIXME: in case of failure remove the device from the domain.
-             * Note that there might still be leftover mappings. While this is
-             * safe for Dom0, for DomUs the domain will likely need to be
-             * killed in order to avoid leaking stale p2m mappings on
-             * failure.
-             */
-            vpci_deassign_device(v->vpci.pdev);
-        write_unlock(&v->domain->pci_lock);
+        {
+            spin_lock(&pdev->vpci->lock);
+            /* Disable memory decoding unconditionally on failure. */
+            modify_decoding(pdev, v->vpci.cmd & ~PCI_COMMAND_MEMORY,
+                            false);
+            spin_unlock(&pdev->vpci->lock);
+
+            /* Clean all the rangesets */
+            for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
+                if ( !rangeset_is_empty(header->bars[i].mem) )
+                     rangeset_empty(header->bars[i].mem);
+
+            v->vpci.pdev = NULL;
+
+            read_unlock(&v->domain->pci_lock);
+
+            if ( !is_hardware_domain(v->domain) )
+                domain_crash(v->domain);
+
+            return false;
+        }
     }
+    v->vpci.pdev = NULL;
+
+    spin_lock(&pdev->vpci->lock);
+    modify_decoding(pdev, v->vpci.cmd, v->vpci.rom_only);
+    spin_unlock(&pdev->vpci->lock);
+
+    read_unlock(&v->domain->pci_lock);
 
     return false;
 }
 
 static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
-                            struct rangeset *mem, uint16_t cmd)
+                            uint16_t cmd)
 {
     struct map_data data = { .d = d, .map = true };
-    int rc;
+    struct vpci_header *header = &pdev->vpci->header;
+    int rc = 0;
+    unsigned int i;
 
     ASSERT(rw_is_write_locked(&d->pci_lock));
 
-    while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
+    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
     {
-        /*
-         * It's safe to drop and reacquire the lock in this context
-         * without risking pdev disappearing because devices cannot be
-         * removed until the initial domain has been started.
-         */
-        read_unlock(&d->pci_lock);
-        process_pending_softirqs();
-        read_lock(&d->pci_lock);
-    }
+        struct vpci_bar *bar = &header->bars[i];
+
+        if ( rangeset_is_empty(bar->mem) )
+            continue;
 
-    rangeset_destroy(mem);
+        while ( (rc = rangeset_consume_ranges(bar->mem, map_range,
+                                              &data)) == -ERESTART )
+        {
+            /*
+             * It's safe to drop and reacquire the lock in this context
+             * without risking pdev disappearing because devices cannot be
+             * removed until the initial domain has been started.
+             */
+            write_unlock(&d->pci_lock);
+            process_pending_softirqs();
+            write_lock(&d->pci_lock);
+        }
+    }
     if ( !rc )
         modify_decoding(pdev, cmd, false);
 
@@ -225,10 +268,12 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
 }
 
 static void defer_map(struct domain *d, struct pci_dev *pdev,
-                      struct rangeset *mem, uint16_t cmd, bool rom_only)
+                      uint16_t cmd, bool rom_only)
 {
     struct vcpu *curr = current;
 
+    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
     /*
      * FIXME: when deferring the {un}map the state of the device should not
      * be trusted. For example the enable bit is toggled after the device
@@ -236,7 +281,6 @@ static void defer_map(struct domain *d, struct pci_dev *pdev,
      * started for the same device if the domain is not well-behaved.
      */
     curr->vpci.pdev = pdev;
-    curr->vpci.mem = mem;
     curr->vpci.cmd = cmd;
     curr->vpci.rom_only = rom_only;
     /*
@@ -250,33 +294,33 @@ static void defer_map(struct domain *d, struct pci_dev *pdev,
 static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
 {
     struct vpci_header *header = &pdev->vpci->header;
-    struct rangeset *mem = rangeset_new(NULL, NULL, 0);
     struct pci_dev *tmp, *dev = NULL;
     const struct domain *d;
     const struct vpci_msix *msix = pdev->vpci->msix;
-    unsigned int i;
+    unsigned int i, j;
     int rc;
 
     ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
 
-    if ( !mem )
-        return -ENOMEM;
-
     /*
-     * Create a rangeset that represents the current device BARs memory region
-     * and compare it against all the currently active BAR memory regions. If
-     * an overlap is found, subtract it from the region to be mapped/unmapped.
+     * Create a rangeset per BAR that represents the current device memory
+     * region and compare it against all the currently active BAR memory
+     * regions. If an overlap is found, subtract it from the region to be
+     * mapped/unmapped.
      *
-     * First fill the rangeset with all the BARs of this device or with the ROM
+     * First fill the rangesets with the BAR of this device or with the ROM
      * BAR only, depending on whether the guest is toggling the memory decode
      * bit of the command register, or the enable bit of the ROM BAR register.
      */
     for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
     {
-        const struct vpci_bar *bar = &header->bars[i];
+        struct vpci_bar *bar = &header->bars[i];
         unsigned long start = PFN_DOWN(bar->addr);
         unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
 
+        if ( !bar->mem )
+            continue;
+
         if ( !MAPPABLE_BAR(bar) ||
              (rom_only ? bar->type != VPCI_BAR_ROM
                        : (bar->type == VPCI_BAR_ROM && !header->rom_enabled)) ||
@@ -292,14 +336,31 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
             continue;
         }
 
-        rc = rangeset_add_range(mem, start, end);
+        rc = rangeset_add_range(bar->mem, start, end);
         if ( rc )
         {
             printk(XENLOG_G_WARNING "Failed to add [%lx, %lx]: %d\n",
                    start, end, rc);
-            rangeset_destroy(mem);
             return rc;
         }
+
+        /* Check for overlap with the already setup BAR ranges. */
+        for ( j = 0; j < i; j++ )
+        {
+            struct vpci_bar *prev_bar = &header->bars[j];
+
+            if ( rangeset_is_empty(prev_bar->mem) )
+                continue;
+
+            rc = rangeset_remove_range(prev_bar->mem, start, end);
+            if ( rc )
+            {
+                gprintk(XENLOG_WARNING,
+                       "%pp: failed to remove overlapping range [%lx, %lx]: %d\n",
+                        &pdev->sbdf, start, end, rc);
+                return rc;
+            }
+        }
     }
 
     /* Remove any MSIX regions if present. */
@@ -309,14 +370,21 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
         unsigned long end = PFN_DOWN(vmsix_table_addr(pdev->vpci, i) +
                                      vmsix_table_size(pdev->vpci, i) - 1);
 
-        rc = rangeset_remove_range(mem, start, end);
-        if ( rc )
+        for ( j = 0; j < ARRAY_SIZE(header->bars); j++ )
         {
-            printk(XENLOG_G_WARNING
-                   "Failed to remove MSIX table [%lx, %lx]: %d\n",
-                   start, end, rc);
-            rangeset_destroy(mem);
-            return rc;
+            const struct vpci_bar *bar = &header->bars[j];
+
+            if ( rangeset_is_empty(bar->mem) )
+                continue;
+
+            rc = rangeset_remove_range(bar->mem, start, end);
+            if ( rc )
+            {
+                gprintk(XENLOG_WARNING,
+                       "%pp: failed to remove MSIX table [%lx, %lx]: %d\n",
+                        &pdev->sbdf, start, end, rc);
+                return rc;
+            }
         }
     }
 
@@ -356,27 +424,35 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
 
             for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
             {
-                const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
-                unsigned long start = PFN_DOWN(bar->addr);
-                unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
-
-                if ( !bar->enabled ||
-                     !rangeset_overlaps_range(mem, start, end) ||
-                     /*
-                      * If only the ROM enable bit is toggled check against
-                      * other BARs in the same device for overlaps, but not
-                      * against the same ROM BAR.
-                      */
-                     (rom_only && tmp == pdev && bar->type == VPCI_BAR_ROM) )
+                const struct vpci_bar *remote_bar = &tmp->vpci->header.bars[i];
+                unsigned long start = PFN_DOWN(remote_bar->addr);
+                unsigned long end = PFN_DOWN(remote_bar->addr +
+                                             remote_bar->size - 1);
+
+                if ( !remote_bar->enabled )
                     continue;
 
-                rc = rangeset_remove_range(mem, start, end);
-                if ( rc )
+                for ( j = 0; j < ARRAY_SIZE(header->bars); j++)
                 {
-                    printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
-                           start, end, rc);
-                    rangeset_destroy(mem);
-                    return rc;
+                    const struct vpci_bar *bar = &header->bars[j];
+
+                    if ( !rangeset_overlaps_range(bar->mem, start, end) ||
+                         /*
+                          * If only the ROM enable bit is toggled check against
+                          * other BARs in the same device for overlaps, but not
+                          * against the same ROM BAR.
+                          */
+                         (rom_only && tmp == pdev && bar->type == VPCI_BAR_ROM) )
+                        continue;
+
+                    rc = rangeset_remove_range(bar->mem, start, end);
+                    if ( rc )
+                    {
+                        gprintk(XENLOG_WARNING,
+                                "%pp: failed to remove [%lx, %lx]: %d\n",
+                                &pdev->sbdf, start, end, rc);
+                        return rc;
+                    }
                 }
             }
         }
@@ -400,10 +476,10 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
          * will always be to establish mappings and process all the BARs.
          */
         ASSERT((cmd & PCI_COMMAND_MEMORY) && !rom_only);
-        return apply_map(pdev->domain, pdev, mem, cmd);
+        return apply_map(pdev->domain, pdev, cmd);
     }
 
-    defer_map(dev->domain, dev, mem, cmd, rom_only);
+    defer_map(dev->domain, dev, cmd, rom_only);
 
     return 0;
 }
@@ -597,6 +673,18 @@ static void cf_check rom_write(
         rom->addr = val & PCI_ROM_ADDRESS_MASK;
 }
 
+static int bar_add_rangeset(const struct pci_dev *pdev, struct vpci_bar *bar,
+                            unsigned int i)
+{
+    char str[32];
+
+    snprintf(str, sizeof(str), "%pp:BAR%u", &pdev->sbdf, i);
+
+    bar->mem = rangeset_new(pdev->domain, str, RANGESETF_no_print);
+
+    return !bar->mem ? -ENOMEM : 0;
+}
+
 static int cf_check init_bars(struct pci_dev *pdev)
 {
     uint16_t cmd;
@@ -678,6 +766,10 @@ static int cf_check init_bars(struct pci_dev *pdev)
         else
             bars[i].type = VPCI_BAR_MEM32;
 
+        rc = bar_add_rangeset(pdev, &bars[i], i);
+        if ( rc )
+            goto fail;
+
         rc = pci_size_mem_bar(pdev->sbdf, reg, &addr, &size,
                               (i == num_bars - 1) ? PCI_BAR_LAST : 0);
         if ( rc < 0 )
@@ -728,6 +820,12 @@ static int cf_check init_bars(struct pci_dev *pdev)
                                    rom_reg, 4, rom);
             if ( rc )
                 rom->type = VPCI_BAR_EMPTY;
+            else
+            {
+                rc = bar_add_rangeset(pdev, rom, i);
+                if ( rc )
+                    goto fail;
+            }
         }
     }
     else
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index b20bee2b0b..5e34d0092a 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -38,6 +38,8 @@ extern vpci_register_init_t *const __end_vpci_array[];
 
 void vpci_deassign_device(struct pci_dev *pdev)
 {
+    unsigned int i;
+
     ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
 
     if ( !has_vpci(pdev->domain) || !pdev->vpci )
@@ -63,6 +65,10 @@ void vpci_deassign_device(struct pci_dev *pdev)
             if ( pdev->vpci->msix->table[i] )
                 iounmap(pdev->vpci->msix->table[i]);
     }
+
+    for ( i = 0; i < ARRAY_SIZE(pdev->vpci->header.bars); i++ )
+        rangeset_destroy(pdev->vpci->header.bars[i].mem);
+
     xfree(pdev->vpci->msix);
     xfree(pdev->vpci->msi);
     xfree(pdev->vpci);
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 2028f2151f..18a0eca3da 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -72,6 +72,7 @@ struct vpci {
             /* Guest address. */
             uint64_t guest_addr;
             uint64_t size;
+            struct rangeset *mem;
             enum {
                 VPCI_BAR_EMPTY,
                 VPCI_BAR_IO,
@@ -156,7 +157,6 @@ struct vpci {
 
 struct vpci_vcpu {
     /* Per-vcpu structure to store state while {un}mapping of PCI BARs. */
-    struct rangeset *mem;
     struct pci_dev *pdev;
     uint16_t cmd;
     bool rom_only : 1;
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v10 09/17] rangeset: add rangeset_empty() function
  2023-10-12 22:09 [PATCH v10 00/17] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (9 preceding siblings ...)
  2023-10-12 22:09 ` [PATCH v10 10/17] vpci/header: handle p2m range sets per BAR Volodymyr Babchuk
@ 2023-10-12 22:09 ` Volodymyr Babchuk
  2023-10-13 17:54   ` Stewart Hildebrand
  2023-10-12 22:09 ` [PATCH v10 12/17] vpci/header: emulate PCI_COMMAND register for guests Volodymyr Babchuk
                   ` (5 subsequent siblings)
  16 siblings, 1 reply; 65+ messages in thread
From: Volodymyr Babchuk @ 2023-10-12 22:09 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Volodymyr Babchuk, Andrew Cooper,
	George Dunlap, Jan Beulich, Julien Grall, Stefano Stabellini,
	Wei Liu

This function can be used when user wants to remove all rangeset
entries but do not want to destroy rangeset itself.

Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>

---

Changes in v10:

 - New in v10. The function is used in "vpci/header: handle p2m range sets per BAR"
---
 xen/common/rangeset.c      | 9 +++++++--
 xen/include/xen/rangeset.h | 3 ++-
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/xen/common/rangeset.c b/xen/common/rangeset.c
index 35c3420885..420275669e 100644
--- a/xen/common/rangeset.c
+++ b/xen/common/rangeset.c
@@ -448,8 +448,7 @@ struct rangeset *rangeset_new(
     return r;
 }
 
-void rangeset_destroy(
-    struct rangeset *r)
+void rangeset_empty(struct rangeset *r)
 {
     struct range *x;
 
@@ -465,6 +464,12 @@ void rangeset_destroy(
 
     while ( (x = first_range(r)) != NULL )
         destroy_range(r, x);
+}
+
+void rangeset_destroy(
+    struct rangeset *r)
+{
+    rangeset_empty(r);
 
     xfree(r);
 }
diff --git a/xen/include/xen/rangeset.h b/xen/include/xen/rangeset.h
index f7c69394d6..5eded7ffc5 100644
--- a/xen/include/xen/rangeset.h
+++ b/xen/include/xen/rangeset.h
@@ -56,7 +56,7 @@ void rangeset_limit(
 bool_t __must_check rangeset_is_empty(
     const struct rangeset *r);
 
-/* Add/claim/remove/query a numeric range. */
+/* Add/claim/remove/query/empty a numeric range. */
 int __must_check rangeset_add_range(
     struct rangeset *r, unsigned long s, unsigned long e);
 int __must_check rangeset_claim_range(struct rangeset *r, unsigned long size,
@@ -70,6 +70,7 @@ bool_t __must_check rangeset_overlaps_range(
 int rangeset_report_ranges(
     struct rangeset *r, unsigned long s, unsigned long e,
     int (*cb)(unsigned long s, unsigned long e, void *), void *ctxt);
+void rangeset_empty(struct rangeset *r);
 
 /*
  * Note that the consume function can return an error value apart from
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v10 12/17] vpci/header: emulate PCI_COMMAND register for guests
  2023-10-12 22:09 [PATCH v10 00/17] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (10 preceding siblings ...)
  2023-10-12 22:09 ` [PATCH v10 09/17] rangeset: add rangeset_empty() function Volodymyr Babchuk
@ 2023-10-12 22:09 ` Volodymyr Babchuk
  2023-10-13 21:53   ` Volodymyr Babchuk
  2023-11-21 14:17   ` Roger Pau Monné
  2023-10-12 22:09 ` [PATCH v10 13/17] vpci: add initial support for virtual PCI bus topology Volodymyr Babchuk
                   ` (4 subsequent siblings)
  16 siblings, 2 replies; 65+ messages in thread
From: Volodymyr Babchuk @ 2023-10-12 22:09 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Roger Pau Monné,
	Volodymyr Babchuk

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Xen and/or Dom0 may have put values in PCI_COMMAND which they expect
to remain unaltered. PCI_COMMAND_SERR bit is a good example: while the
guest's view of this will want to be zero initially, the host having set
it to 1 may not easily be overwritten with 0, or else we'd effectively
imply giving the guest control of the bit. Thus, PCI_COMMAND register needs
proper emulation in order to honor host's settings.

According to "PCI LOCAL BUS SPECIFICATION, REV. 3.0", section "6.2.2
Device Control" the reset state of the command register is typically 0,
so when assigning a PCI device use 0 as the initial state for the guest's view
of the command register.

Here is the full list of command register bits with notes about
emulation, along with QEMU behavior in the same situation:

PCI_COMMAND_IO - QEMU does not allow a guest to change value of this bit
in real device. Instead it is always set to 1. A guest can write to this
register, but writes are ignored.

PCI_COMMAND_MEMORY - QEMU behaves exactly as with PCI_COMMAND_IO. In
Xen case, we handle writes to this bit by mapping/unmapping BAR
regions. For devices assigned to DomUs, memory decoding will be
disabled and the initialization.

PCI_COMMAND_MASTER - Allow guest to control it. QEMU passes through
writes to this bit.

PCI_COMMAND_SPECIAL - Guest can generate special cycles only if it has
access to host bridge that supports software generation of special
cycles. In our case guest has no access to host bridges at all. Value
after reset is 0. QEMU passes through writes of this bit, we will do
the same.

PCI_COMMAND_INVALIDATE - Allows "Memory Write and Invalidate" commands
to be generated. It requires additional configuration via Cacheline
Size register. We are not emulating this register right now and we
can't expect guest to properly configure it. QEMU "emulates" access to
Cachline Size register by ignoring all writes to it. QEMU passes through
writes of PCI_COMMAND_INVALIDATE bit, we will do the same.

PCI_COMMAND_VGA_PALETTE - Enable VGA palette snooping. QEMU passes
through writes of this bit, we will do the same.

PCI_COMMAND_PARITY - Controls how device response to parity
errors. QEMU ignores writes to this bit, we will do the same.

PCI_COMMAND_WAIT - Reserved. Should be 0, but QEMU passes
through writes of this bit, so we will do the same.

PCI_COMMAND_SERR - Controls if device can assert SERR. QEMU ignores
writes to this bit, we will do the same.

PCI_COMMAND_FAST_BACK - Optional bit that allows fast back-to-back
transactions. It is configured by firmware, so we don't want guest to
control it. QEMU ignores writes to this bit, we will do the same.

PCI_COMMAND_INTX_DISABLE - Disables INTx signals. If MSI(X) is
enabled, device is prohibited from asserting INTx as per
specification. Value after reset is 0. In QEMU case, it checks of INTx
was mapped for a device. If it is not, then guest can't control
PCI_COMMAND_INTX_DISABLE bit. In our case, we prohibit a guest to
change value of this bit if MSI(X) is enabled.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
---
In v10:
- Added cf_check attribute to guest_cmd_read
- Removed warning about non-zero cmd
- Updated comment MSI code regarding disabling INTX
- Used ternary operator in vpci_add_register() call
- Disable memory decoding for DomUs in init_bars()
In v9:
- Reworked guest_cmd_read
- Added handling for more bits
Since v6:
- fold guest's logic into cmd_write
- implement cmd_read, so we can report emulated INTx state to guests
- introduce header->guest_cmd to hold the emulated state of the
  PCI_COMMAND register for guests
Since v5:
- add additional check for MSI-X enabled while altering INTX bit
- make sure INTx disabled while guests enable MSI/MSI-X
Since v3:
- gate more code on CONFIG_HAS_MSI
- removed logic for the case when MSI/MSI-X not enabled
---
 xen/drivers/vpci/header.c | 44 +++++++++++++++++++++++++++++++++++----
 xen/drivers/vpci/msi.c    |  6 ++++++
 xen/drivers/vpci/msix.c   |  4 ++++
 xen/include/xen/vpci.h    |  3 +++
 4 files changed, 53 insertions(+), 4 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index efce0bc2ae..e8eed6a674 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -501,14 +501,32 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
     return 0;
 }
 
+/* TODO: Add proper emulation for all bits of the command register. */
 static void cf_check cmd_write(
     const struct pci_dev *pdev, unsigned int reg, uint32_t cmd, void *data)
 {
     struct vpci_header *header = data;
 
+    if ( !is_hardware_domain(pdev->domain) )
+    {
+        const struct vpci *vpci = pdev->vpci;
+        uint16_t excluded = PCI_COMMAND_PARITY | PCI_COMMAND_SERR |
+            PCI_COMMAND_FAST_BACK;
+
+        header->guest_cmd = cmd;
+
+        if ( (vpci->msi && vpci->msi->enabled) ||
+             (vpci->msix && vpci->msi->enabled) )
+            excluded |= PCI_COMMAND_INTX_DISABLE;
+
+        cmd &= ~excluded;
+        cmd |= pci_conf_read16(pdev->sbdf, reg) & excluded;
+    }
+
     /*
-     * Let Dom0 play with all the bits directly except for the memory
-     * decoding one.
+     * Let guest play with all the bits directly except for the memory
+     * decoding one. Bits that are not allowed for DomU are already
+     * handled above.
      */
     if ( header->bars_mapped != !!(cmd & PCI_COMMAND_MEMORY) )
         /*
@@ -522,6 +540,14 @@ static void cf_check cmd_write(
         pci_conf_write16(pdev->sbdf, reg, cmd);
 }
 
+static uint32_t cf_check guest_cmd_read(
+    const struct pci_dev *pdev, unsigned int reg, void *data)
+{
+    const struct vpci_header *header = data;
+
+    return header->guest_cmd;
+}
+
 static void cf_check bar_write(
     const struct pci_dev *pdev, unsigned int reg, uint32_t val, void *data)
 {
@@ -737,8 +763,9 @@ static int cf_check init_bars(struct pci_dev *pdev)
     }
 
     /* Setup a handler for the command register. */
-    rc = vpci_add_register(pdev->vpci, vpci_hw_read16, cmd_write, PCI_COMMAND,
-                           2, header);
+    rc = vpci_add_register(pdev->vpci,
+                           is_hwdom ? vpci_hw_read16 : guest_cmd_read,
+                           cmd_write, PCI_COMMAND, 2, header);
     if ( rc )
         return rc;
 
@@ -750,6 +777,15 @@ static int cf_check init_bars(struct pci_dev *pdev)
     if ( cmd & PCI_COMMAND_MEMORY )
         pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd & ~PCI_COMMAND_MEMORY);
 
+    /*
+     * Clear PCI_COMMAND_MEMORY for DomUs, so they will always start with
+     * memory decoding disabled and to ensure that we will not call modify_bars()
+     * at the end of this function.
+     */
+    if ( !is_hwdom )
+        cmd &= ~PCI_COMMAND_MEMORY;
+    header->guest_cmd = cmd;
+
     for ( i = 0; i < num_bars; i++ )
     {
         uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
index 2faa54b7ce..0920bd071f 100644
--- a/xen/drivers/vpci/msi.c
+++ b/xen/drivers/vpci/msi.c
@@ -70,6 +70,12 @@ static void cf_check control_write(
 
         if ( vpci_msi_arch_enable(msi, pdev, vectors) )
             return;
+
+        /*
+         * Make sure guest doesn't enable INTx while enabling MSI.
+         */
+        if ( !is_hardware_domain(pdev->domain) )
+            pci_intx(pdev, false);
     }
     else
         vpci_msi_arch_disable(msi, pdev);
diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
index b6abab47ef..9d0233d0e3 100644
--- a/xen/drivers/vpci/msix.c
+++ b/xen/drivers/vpci/msix.c
@@ -97,6 +97,10 @@ static void cf_check control_write(
         for ( i = 0; i < msix->max_entries; i++ )
             if ( !msix->entries[i].masked && msix->entries[i].updated )
                 update_entry(&msix->entries[i], pdev, i);
+
+        /* Make sure guest doesn't enable INTx while enabling MSI-X. */
+        if ( !is_hardware_domain(pdev->domain) )
+            pci_intx(pdev, false);
     }
     else if ( !new_enabled && msix->enabled )
     {
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index c5301e284f..60bdc10c13 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -87,6 +87,9 @@ struct vpci {
         } bars[PCI_HEADER_NORMAL_NR_BARS + 1];
         /* At most 6 BARS + 1 expansion ROM BAR. */
 
+        /* Guest view of the PCI_COMMAND register. */
+        uint16_t guest_cmd;
+
         /*
          * Store whether the ROM enable bit is set (doesn't imply ROM BAR
          * is mapped into guest p2m) if there's a ROM BAR on the device.
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v10 14/17] xen/arm: translate virtual PCI bus topology for guests
  2023-10-12 22:09 [PATCH v10 00/17] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (12 preceding siblings ...)
  2023-10-12 22:09 ` [PATCH v10 13/17] vpci: add initial support for virtual PCI bus topology Volodymyr Babchuk
@ 2023-10-12 22:09 ` Volodymyr Babchuk
  2023-11-21 15:11   ` Roger Pau Monné
  2023-10-12 22:09 ` [PATCH v10 16/17] xen/arm: vpci: permit access to guest vpci space Volodymyr Babchuk
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 65+ messages in thread
From: Volodymyr Babchuk @ 2023-10-12 22:09 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Stefano Stabellini,
	Julien Grall, Bertrand Marquis, Volodymyr Babchuk,
	Roger Pau Monné

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

There are three  originators for the PCI configuration space access:
1. The domain that owns physical host bridge: MMIO handlers are
there so we can update vPCI register handlers with the values
written by the hardware domain, e.g. physical view of the registers
vs guest's view on the configuration space.
2. Guest access to the passed through PCI devices: we need to properly
map virtual bus topology to the physical one, e.g. pass the configuration
space access to the corresponding physical devices.
3. Emulated host PCI bridge access. It doesn't exist in the physical
topology, e.g. it can't be mapped to some physical host bridge.
So, all access to the host bridge itself needs to be trapped and
emulated.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
Since v9:
- Commend about required lock replaced with ASSERT()
- Style fixes
- call to vpci_translate_virtual_device folded into vpci_sbdf_from_gpa
Since v8:
- locks moved out of vpci_translate_virtual_device()
Since v6:
- add pcidevs locking to vpci_translate_virtual_device
- update wrt to the new locking scheme
Since v5:
- add vpci_translate_virtual_device for #ifndef CONFIG_HAS_VPCI_GUEST_SUPPORT
  case to simplify ifdefery
- add ASSERT(!is_hardware_domain(d)); to vpci_translate_virtual_device
- reset output register on failed virtual SBDF translation
Since v4:
- indentation fixes
- constify struct domain
- updated commit message
- updates to the new locking scheme (pdev->vpci_lock)
Since v3:
- revisit locking
- move code to vpci.c
Since v2:
 - pass struct domain instead of struct vcpu
 - constify arguments where possible
 - gate relevant code with CONFIG_HAS_VPCI_GUEST_SUPPORT
New in v2
---
 xen/arch/arm/vpci.c     | 51 ++++++++++++++++++++++++++++++++---------
 xen/drivers/vpci/vpci.c | 25 +++++++++++++++++++-
 xen/include/xen/vpci.h  | 10 ++++++++
 3 files changed, 74 insertions(+), 12 deletions(-)

diff --git a/xen/arch/arm/vpci.c b/xen/arch/arm/vpci.c
index 3bc4bb5508..58e2a20135 100644
--- a/xen/arch/arm/vpci.c
+++ b/xen/arch/arm/vpci.c
@@ -7,31 +7,55 @@
 
 #include <asm/mmio.h>
 
-static pci_sbdf_t vpci_sbdf_from_gpa(const struct pci_host_bridge *bridge,
-                                     paddr_t gpa)
+static bool_t vpci_sbdf_from_gpa(struct domain *d,
+                                 const struct pci_host_bridge *bridge,
+                                 paddr_t gpa, pci_sbdf_t *sbdf)
 {
-    pci_sbdf_t sbdf;
+    ASSERT(sbdf);
 
     if ( bridge )
     {
-        sbdf.sbdf = VPCI_ECAM_BDF(gpa - bridge->cfg->phys_addr);
-        sbdf.seg = bridge->segment;
-        sbdf.bus += bridge->cfg->busn_start;
+        sbdf->sbdf = VPCI_ECAM_BDF(gpa - bridge->cfg->phys_addr);
+        sbdf->seg = bridge->segment;
+        sbdf->bus += bridge->cfg->busn_start;
     }
     else
-        sbdf.sbdf = VPCI_ECAM_BDF(gpa - GUEST_VPCI_ECAM_BASE);
-
-    return sbdf;
+    {
+        bool translated;
+
+        /*
+         * For the passed through devices we need to map their virtual SBDF
+         * to the physical PCI device being passed through.
+         */
+        sbdf->sbdf = VPCI_ECAM_BDF(gpa - GUEST_VPCI_ECAM_BASE);
+        read_lock(&d->pci_lock);
+        translated = vpci_translate_virtual_device(d, sbdf);
+        read_unlock(&d->pci_lock);
+
+        if ( !translated )
+        {
+            return false;
+        }
+    }
+    return true;
 }
 
 static int vpci_mmio_read(struct vcpu *v, mmio_info_t *info,
                           register_t *r, void *p)
 {
     struct pci_host_bridge *bridge = p;
-    pci_sbdf_t sbdf = vpci_sbdf_from_gpa(bridge, info->gpa);
+    pci_sbdf_t sbdf;
     /* data is needed to prevent a pointer cast on 32bit */
     unsigned long data;
 
+    ASSERT(!bridge == !is_hardware_domain(v->domain));
+
+    if ( !vpci_sbdf_from_gpa(v->domain, bridge, info->gpa, &sbdf) )
+    {
+        *r = ~0ul;
+        return 1;
+    }
+
     if ( vpci_ecam_read(sbdf, ECAM_REG_OFFSET(info->gpa),
                         1U << info->dabt.size, &data) )
     {
@@ -48,7 +72,12 @@ static int vpci_mmio_write(struct vcpu *v, mmio_info_t *info,
                            register_t r, void *p)
 {
     struct pci_host_bridge *bridge = p;
-    pci_sbdf_t sbdf = vpci_sbdf_from_gpa(bridge, info->gpa);
+    pci_sbdf_t sbdf;
+
+    ASSERT(!bridge == !is_hardware_domain(v->domain));
+
+    if ( !vpci_sbdf_from_gpa(v->domain, bridge, info->gpa, &sbdf) )
+        return 1;
 
     return vpci_ecam_write(sbdf, ECAM_REG_OFFSET(info->gpa),
                            1U << info->dabt.size, r);
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 7c46a2d3f4..0dee5118d6 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -80,6 +80,30 @@ static int add_virtual_device(struct pci_dev *pdev)
     return 0;
 }
 
+/*
+ * Find the physical device which is mapped to the virtual device
+ * and translate virtual SBDF to the physical one.
+ */
+bool vpci_translate_virtual_device(const struct domain *d, pci_sbdf_t *sbdf)
+{
+    const struct pci_dev *pdev;
+
+    ASSERT(!is_hardware_domain(d));
+    ASSERT(rw_is_locked(&d->pci_lock));
+
+    for_each_pdev ( d, pdev )
+    {
+        if ( pdev->vpci && (pdev->vpci->guest_sbdf.sbdf == sbdf->sbdf) )
+        {
+            /* Replace guest SBDF with the physical one. */
+            *sbdf = pdev->sbdf;
+            return true;
+        }
+    }
+
+    return false;
+}
+
 #endif /* CONFIG_HAS_VPCI_GUEST_SUPPORT */
 
 void vpci_deassign_device(struct pci_dev *pdev)
@@ -175,7 +199,6 @@ int vpci_assign_device(struct pci_dev *pdev)
 
     return rc;
 }
-
 #endif /* __XEN__ */
 
 static int vpci_register_cmp(const struct vpci_register *r1,
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 4a53936447..e9269b37ac 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -282,6 +282,16 @@ static inline bool __must_check vpci_process_pending(struct vcpu *v)
 }
 #endif
 
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+bool vpci_translate_virtual_device(const struct domain *d, pci_sbdf_t *sbdf);
+#else
+static inline bool vpci_translate_virtual_device(const struct domain *d,
+                                                 pci_sbdf_t *sbdf)
+{
+    return false;
+}
+#endif
+
 #endif
 
 /*
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v10 13/17] vpci: add initial support for virtual PCI bus topology
  2023-10-12 22:09 [PATCH v10 00/17] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (11 preceding siblings ...)
  2023-10-12 22:09 ` [PATCH v10 12/17] vpci/header: emulate PCI_COMMAND register for guests Volodymyr Babchuk
@ 2023-10-12 22:09 ` Volodymyr Babchuk
  2023-11-16 16:06   ` Julien Grall
  2023-11-21 14:40   ` Roger Pau Monné
  2023-10-12 22:09 ` [PATCH v10 14/17] xen/arm: translate virtual PCI bus topology for guests Volodymyr Babchuk
                   ` (3 subsequent siblings)
  16 siblings, 2 replies; 65+ messages in thread
From: Volodymyr Babchuk @ 2023-10-12 22:09 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Andrew Cooper,
	George Dunlap, Jan Beulich, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Assign SBDF to the PCI devices being passed through with bus 0.
The resulting topology is where PCIe devices reside on the bus 0 of the
root complex itself (embedded endpoints).
This implementation is limited to 32 devices which are allowed on
a single PCI bus.

Please note, that at the moment only function 0 of a multifunction
device can be passed through.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
In v10:
- Removed ASSERT(pcidevs_locked())
- Removed redundant code (local sbdf variable, clearing sbdf during
device removal, etc)
- Added __maybe_unused attribute to "out:" label
- Introduced HAS_VPCI_GUEST_SUPPORT Kconfig option, as this is the
  first patch where it is used (previously was in "vpci: add hooks for
  PCI device assign/de-assign")
In v9:
- Lock in add_virtual_device() replaced with ASSERT (thanks, Stewart)
In v8:
- Added write lock in add_virtual_device
Since v6:
- re-work wrt new locking scheme
- OT: add ASSERT(pcidevs_write_locked()); to add_virtual_device()
Since v5:
- s/vpci_add_virtual_device/add_virtual_device and make it static
- call add_virtual_device from vpci_assign_device and do not use
  REGISTER_VPCI_INIT machinery
- add pcidevs_locked ASSERT
- use DECLARE_BITMAP for vpci_dev_assigned_map
Since v4:
- moved and re-worked guest sbdf initializers
- s/set_bit/__set_bit
- s/clear_bit/__clear_bit
- minor comment fix s/Virtual/Guest/
- added VPCI_MAX_VIRT_DEV constant (PCI_SLOT(~0) + 1) which will be used
  later for counting the number of MMIO handlers required for a guest
  (Julien)
Since v3:
 - make use of VPCI_INIT
 - moved all new code to vpci.c which belongs to it
 - changed open-coded 31 to PCI_SLOT(~0)
 - added comments and code to reject multifunction devices with
   functions other than 0
 - updated comment about vpci_dev_next and made it unsigned int
 - implement roll back in case of error while assigning/deassigning devices
 - s/dom%pd/%pd
Since v2:
 - remove casts that are (a) malformed and (b) unnecessary
 - add new line for better readability
 - remove CONFIG_HAS_VPCI_GUEST_SUPPORT ifdef's as the relevant vPCI
    functions are now completely gated with this config
 - gate common code with CONFIG_HAS_VPCI_GUEST_SUPPORT
New in v2
---
 xen/drivers/Kconfig     |  4 +++
 xen/drivers/vpci/vpci.c | 63 +++++++++++++++++++++++++++++++++++++++++
 xen/include/xen/sched.h |  8 ++++++
 xen/include/xen/vpci.h  | 11 +++++++
 4 files changed, 86 insertions(+)

diff --git a/xen/drivers/Kconfig b/xen/drivers/Kconfig
index db94393f47..780490cf8e 100644
--- a/xen/drivers/Kconfig
+++ b/xen/drivers/Kconfig
@@ -15,4 +15,8 @@ source "drivers/video/Kconfig"
 config HAS_VPCI
 	bool
 
+config HAS_VPCI_GUEST_SUPPORT
+	bool
+	depends on HAS_VPCI
+
 endmenu
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 5e34d0092a..7c46a2d3f4 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -36,6 +36,52 @@ extern vpci_register_init_t *const __start_vpci_array[];
 extern vpci_register_init_t *const __end_vpci_array[];
 #define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
 
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+static int add_virtual_device(struct pci_dev *pdev)
+{
+    struct domain *d = pdev->domain;
+    unsigned long new_dev_number;
+
+    if ( is_hardware_domain(d) )
+        return 0;
+
+    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
+    /*
+     * Each PCI bus supports 32 devices/slots at max or up to 256 when
+     * there are multi-function ones which are not yet supported.
+     */
+    if ( pdev->info.is_extfn && !pdev->info.is_virtfn )
+    {
+        gdprintk(XENLOG_ERR, "%pp: only function 0 passthrough supported\n",
+                 &pdev->sbdf);
+        return -EOPNOTSUPP;
+    }
+    new_dev_number = find_first_zero_bit(d->vpci_dev_assigned_map,
+                                         VPCI_MAX_VIRT_DEV);
+    if ( new_dev_number == VPCI_MAX_VIRT_DEV )
+    {
+        write_unlock(&pdev->domain->pci_lock);
+        return -ENOSPC;
+    }
+
+    __set_bit(new_dev_number, &d->vpci_dev_assigned_map);
+
+    /*
+     * Both segment and bus number are 0:
+     *  - we emulate a single host bridge for the guest, e.g. segment 0
+     *  - with bus 0 the virtual devices are seen as embedded
+     *    endpoints behind the root complex
+     *
+     * TODO: add support for multi-function devices.
+     */
+    pdev->vpci->guest_sbdf = PCI_SBDF(0, 0, new_dev_number, 0);
+
+    return 0;
+}
+
+#endif /* CONFIG_HAS_VPCI_GUEST_SUPPORT */
+
 void vpci_deassign_device(struct pci_dev *pdev)
 {
     unsigned int i;
@@ -46,6 +92,13 @@ void vpci_deassign_device(struct pci_dev *pdev)
         return;
 
     spin_lock(&pdev->vpci->lock);
+
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+    if ( pdev->vpci->guest_sbdf.sbdf != ~0 )
+        __clear_bit(pdev->vpci->guest_sbdf.dev,
+                    &pdev->domain->vpci_dev_assigned_map);
+#endif
+
     while ( !list_empty(&pdev->vpci->handlers) )
     {
         struct vpci_register *r = list_first_entry(&pdev->vpci->handlers,
@@ -101,6 +154,13 @@ int vpci_assign_device(struct pci_dev *pdev)
     INIT_LIST_HEAD(&pdev->vpci->handlers);
     spin_lock_init(&pdev->vpci->lock);
 
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+    pdev->vpci->guest_sbdf.sbdf = ~0;
+    rc = add_virtual_device(pdev);
+    if ( rc )
+        goto out;
+#endif
+
     for ( i = 0; i < NUM_VPCI_INIT; i++ )
     {
         rc = __start_vpci_array[i](pdev);
@@ -108,11 +168,14 @@ int vpci_assign_device(struct pci_dev *pdev)
             break;
     }
 
+ out:
+    __maybe_unused;
     if ( rc )
         vpci_deassign_device(pdev);
 
     return rc;
 }
+
 #endif /* __XEN__ */
 
 static int vpci_register_cmp(const struct vpci_register *r1,
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 57391e74b6..84e608f670 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -462,6 +462,14 @@ struct domain
 #ifdef CONFIG_HAS_PCI
     struct list_head pdev_list;
     rwlock_t pci_lock;
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+    /*
+     * The bitmap which shows which device numbers are already used by the
+     * virtual PCI bus topology and is used to assign a unique SBDF to the
+     * next passed through virtual PCI device.
+     */
+    DECLARE_BITMAP(vpci_dev_assigned_map, VPCI_MAX_VIRT_DEV);
+#endif
 #endif
 
 #ifdef CONFIG_HAS_PASSTHROUGH
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 60bdc10c13..4a53936447 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -21,6 +21,13 @@ typedef int vpci_register_init_t(struct pci_dev *dev);
 
 #define VPCI_ECAM_BDF(addr)     (((addr) & 0x0ffff000) >> 12)
 
+/*
+ * Maximum number of devices supported by the virtual bus topology:
+ * each PCI bus supports 32 devices/slots at max or up to 256 when
+ * there are multi-function ones which are not yet supported.
+ */
+#define VPCI_MAX_VIRT_DEV       (PCI_SLOT(~0) + 1)
+
 #define REGISTER_VPCI_INIT(x, p)                \
   static vpci_register_init_t *const x##_entry  \
                __used_section(".data.vpci." p) = x
@@ -155,6 +162,10 @@ struct vpci {
             struct vpci_arch_msix_entry arch;
         } entries[];
     } *msix;
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+    /* Guest SBDF of the device. */
+    pci_sbdf_t guest_sbdf;
+#endif
 #endif
 };
 
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v10 16/17] xen/arm: vpci: permit access to guest vpci space
  2023-10-12 22:09 [PATCH v10 00/17] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (13 preceding siblings ...)
  2023-10-12 22:09 ` [PATCH v10 14/17] xen/arm: translate virtual PCI bus topology for guests Volodymyr Babchuk
@ 2023-10-12 22:09 ` Volodymyr Babchuk
  2023-10-16 11:00   ` Jan Beulich
  2023-10-12 22:09 ` [PATCH v10 15/17] xen/arm: account IO handlers for emulated PCI MSI-X Volodymyr Babchuk
  2023-10-12 22:09 ` [PATCH v10 17/17] arm/vpci: honor access size when returning an error Volodymyr Babchuk
  16 siblings, 1 reply; 65+ messages in thread
From: Volodymyr Babchuk @ 2023-10-12 22:09 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Stefano Stabellini, Julien Grall,
	Bertrand Marquis, Volodymyr Babchuk, Andrew Cooper,
	George Dunlap, Jan Beulich, Wei Liu

From: Stewart Hildebrand <stewart.hildebrand@amd.com>

Move iomem_caps initialization earlier (before arch_domain_create()).

Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
---
Changes in v10:
* fix off-by-one
* also permit access to GUEST_VPCI_PREFETCH_MEM_ADDR

Changes in v9:
* new patch

This is sort of a follow-up to:

  baa6ea700386 ("vpci: add permission checks to map_range()")

I don't believe we need a fixes tag since this depends on the vPCI p2m BAR
patches.
---
 xen/arch/arm/vpci.c | 9 +++++++++
 xen/common/domain.c | 4 +++-
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/xen/arch/arm/vpci.c b/xen/arch/arm/vpci.c
index 01b50d435e..3521d5bc2f 100644
--- a/xen/arch/arm/vpci.c
+++ b/xen/arch/arm/vpci.c
@@ -2,6 +2,7 @@
 /*
  * xen/arch/arm/vpci.c
  */
+#include <xen/iocap.h>
 #include <xen/sched.h>
 #include <xen/vpci.h>
 
@@ -119,8 +120,16 @@ int domain_vpci_init(struct domain *d)
             return ret;
     }
     else
+    {
         register_mmio_handler(d, &vpci_mmio_handler,
                               GUEST_VPCI_ECAM_BASE, GUEST_VPCI_ECAM_SIZE, NULL);
+        iomem_permit_access(d, paddr_to_pfn(GUEST_VPCI_MEM_ADDR),
+                            paddr_to_pfn(GUEST_VPCI_MEM_ADDR +
+                                         GUEST_VPCI_MEM_SIZE - 1));
+        iomem_permit_access(d, paddr_to_pfn(GUEST_VPCI_PREFETCH_MEM_ADDR),
+                            paddr_to_pfn(GUEST_VPCI_PREFETCH_MEM_ADDR +
+                                         GUEST_VPCI_PREFETCH_MEM_SIZE - 1));
+    }
 
     return 0;
 }
diff --git a/xen/common/domain.c b/xen/common/domain.c
index 785c69e48b..bf63fab29b 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -695,6 +695,9 @@ struct domain *domain_create(domid_t domid,
         radix_tree_init(&d->pirq_tree);
     }
 
+    if ( !is_idle_domain(d) )
+        d->iomem_caps = rangeset_new(d, "I/O Memory", RANGESETF_prettyprint_hex);
+
     if ( (err = arch_domain_create(d, config, flags)) != 0 )
         goto fail;
     init_status |= INIT_arch;
@@ -704,7 +707,6 @@ struct domain *domain_create(domid_t domid,
         watchdog_domain_init(d);
         init_status |= INIT_watchdog;
 
-        d->iomem_caps = rangeset_new(d, "I/O Memory", RANGESETF_prettyprint_hex);
         d->irq_caps   = rangeset_new(d, "Interrupts", 0);
         if ( !d->iomem_caps || !d->irq_caps )
             goto fail;
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v10 17/17] arm/vpci: honor access size when returning an error
  2023-10-12 22:09 [PATCH v10 00/17] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (15 preceding siblings ...)
  2023-10-12 22:09 ` [PATCH v10 15/17] xen/arm: account IO handlers for emulated PCI MSI-X Volodymyr Babchuk
@ 2023-10-12 22:09 ` Volodymyr Babchuk
  16 siblings, 0 replies; 65+ messages in thread
From: Volodymyr Babchuk @ 2023-10-12 22:09 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Volodymyr Babchuk, Stefano Stabellini,
	Julien Grall, Bertrand Marquis, Volodymyr Babchuk

Guest can try to read config space using different access sizes: 8,
16, 32, 64 bits. We need to take this into account when we are
returning an error back to MMIO handler, otherwise it is possible to
provide more data than requested: i.e. guest issues LDRB instruction
to read one byte, but we are writing 0xFFFFFFFFFFFFFFFF in the target
register.

Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
---
 xen/arch/arm/vpci.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/xen/arch/arm/vpci.c b/xen/arch/arm/vpci.c
index 3521d5bc2f..f1e434a5db 100644
--- a/xen/arch/arm/vpci.c
+++ b/xen/arch/arm/vpci.c
@@ -46,6 +46,8 @@ static int vpci_mmio_read(struct vcpu *v, mmio_info_t *info,
 {
     struct pci_host_bridge *bridge = p;
     pci_sbdf_t sbdf;
+    const uint8_t access_size = (1 << info->dabt.size) * 8;
+    const uint64_t access_mask = GENMASK_ULL(access_size - 1, 0);
     /* data is needed to prevent a pointer cast on 32bit */
     unsigned long data;
 
@@ -53,7 +55,7 @@ static int vpci_mmio_read(struct vcpu *v, mmio_info_t *info,
 
     if ( !vpci_sbdf_from_gpa(v->domain, bridge, info->gpa, &sbdf) )
     {
-        *r = ~0ul;
+        *r = access_mask;
         return 1;
     }
 
@@ -64,7 +66,7 @@ static int vpci_mmio_read(struct vcpu *v, mmio_info_t *info,
         return 1;
     }
 
-    *r = ~0ul;
+    *r = access_mask;
 
     return 0;
 }
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v10 15/17] xen/arm: account IO handlers for emulated PCI MSI-X
  2023-10-12 22:09 [PATCH v10 00/17] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (14 preceding siblings ...)
  2023-10-12 22:09 ` [PATCH v10 16/17] xen/arm: vpci: permit access to guest vpci space Volodymyr Babchuk
@ 2023-10-12 22:09 ` Volodymyr Babchuk
  2023-10-13  8:34   ` Julien Grall
  2023-10-12 22:09 ` [PATCH v10 17/17] arm/vpci: honor access size when returning an error Volodymyr Babchuk
  16 siblings, 1 reply; 65+ messages in thread
From: Volodymyr Babchuk @ 2023-10-12 22:09 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Bertrand Marquis,
	Volodymyr Babchuk, Julien Grall, Julien Grall,
	Stefano Stabellini

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

At the moment, we always allocate an extra 16 slots for IO handlers
(see MAX_IO_HANDLER). So while adding IO trap handlers for the emulated
MSI-X registers we need to explicitly tell that we have additional IO
handlers, so those are accounted.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Acked-by: Julien Grall <jgrall@amazon.com>
---
Cc: Julien Grall <julien@xen.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
---
This actually moved here from the part 2 of the prep work for PCI
passthrough on Arm as it seems to be the proper place for it.

Since v5:
- optimize with IS_ENABLED(CONFIG_HAS_PCI_MSI) since VPCI_MAX_VIRT_DEV is
  defined unconditionally
New in v5
---
 xen/arch/arm/vpci.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/xen/arch/arm/vpci.c b/xen/arch/arm/vpci.c
index 58e2a20135..01b50d435e 100644
--- a/xen/arch/arm/vpci.c
+++ b/xen/arch/arm/vpci.c
@@ -134,6 +134,8 @@ static int vpci_get_num_handlers_cb(struct domain *d,
 
 unsigned int domain_vpci_get_num_mmio_handlers(struct domain *d)
 {
+    unsigned int count;
+
     if ( !has_vpci(d) )
         return 0;
 
@@ -154,7 +156,17 @@ unsigned int domain_vpci_get_num_mmio_handlers(struct domain *d)
      * For guests each host bridge requires one region to cover the
      * configuration space. At the moment, we only expose a single host bridge.
      */
-    return 1;
+    count = 1;
+
+    /*
+     * There's a single MSI-X MMIO handler that deals with both PBA
+     * and MSI-X tables per each PCI device being passed through.
+     * Maximum number of emulated virtual devices is VPCI_MAX_VIRT_DEV.
+     */
+    if ( IS_ENABLED(CONFIG_HAS_PCI_MSI) )
+        count += VPCI_MAX_VIRT_DEV;
+
+    return count;
 }
 
 /*
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 15/17] xen/arm: account IO handlers for emulated PCI MSI-X
  2023-10-12 22:09 ` [PATCH v10 15/17] xen/arm: account IO handlers for emulated PCI MSI-X Volodymyr Babchuk
@ 2023-10-13  8:34   ` Julien Grall
  2023-10-13 13:06     ` Volodymyr Babchuk
  0 siblings, 1 reply; 65+ messages in thread
From: Julien Grall @ 2023-10-13  8:34 UTC (permalink / raw)
  To: Volodymyr Babchuk, xen-devel
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Bertrand Marquis,
	Julien Grall, Stefano Stabellini

Hi Volodymyr,

On 12/10/2023 23:09, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> At the moment, we always allocate an extra 16 slots for IO handlers
> (see MAX_IO_HANDLER). So while adding IO trap handlers for the emulated
> MSI-X registers we need to explicitly tell that we have additional IO
> handlers, so those are accounted.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Some process remark. All the patches you send (even if they are 
unmodified) should also contain your signed-off-by. This is to comply 
with point (b) in the DCO certificate:

https://cert-manager.io/docs/contributing/sign-off/

Please check the other patches in this series.

> Acked-by: Julien Grall <jgrall@amazon.com>

Is this patch depends on the rest of the series? If not we can merge it 
in the for-4.19 branch Stefano created. This will reduce the number of 
patches you need to resend.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 15/17] xen/arm: account IO handlers for emulated PCI MSI-X
  2023-10-13  8:34   ` Julien Grall
@ 2023-10-13 13:06     ` Volodymyr Babchuk
  2023-10-13 16:46       ` Julien Grall
  0 siblings, 1 reply; 65+ messages in thread
From: Volodymyr Babchuk @ 2023-10-13 13:06 UTC (permalink / raw)
  To: Julien Grall
  Cc: xen-devel, Stewart Hildebrand, Oleksandr Andrushchenko,
	Bertrand Marquis, Julien Grall, Stefano Stabellini


Hi Julien,

Julien Grall <julien@xen.org> writes:

> Hi Volodymyr,
>
> On 12/10/2023 23:09, Volodymyr Babchuk wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> At the moment, we always allocate an extra 16 slots for IO handlers
>> (see MAX_IO_HANDLER). So while adding IO trap handlers for the emulated
>> MSI-X registers we need to explicitly tell that we have additional IO
>> handlers, so those are accounted.
>> Signed-off-by: Oleksandr Andrushchenko
>> <oleksandr_andrushchenko@epam.com>
>
> Some process remark. All the patches you send (even if they are
> unmodified) should also contain your signed-off-by. This is to comply
> with point (b) in the DCO certificate:

Oh, sorry. I assumed that it is enough to have signed-off-by tag of the
original author. I'll add my tags in the next version.

> https://urldefense.com/v3/__https://cert-manager.io/docs/contributing/sign-off/__;!!GF_29dbcQIUBPA!0mzdEfHOZMm2OmzFc6TZukGgMYRHxDWLdEQvbhUlDmOg3tZNeDbWb8vHz38zLzcYv8GUZeHLn-5sWTYCkvkb$ [cert-manager[.]io]
>
> Please check the other patches in this series.
>
>> Acked-by: Julien Grall <jgrall@amazon.com>
>
> Is this patch depends on the rest of the series? If not we can merge
> it in the for-4.19 branch Stefano created. This will reduce the number
> of patches you need to resend.

It uses VPCI_MAX_VIRT_DEV constant which was introduced in ("vpci: add
initial support for virtual PCI bus topology").

-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 15/17] xen/arm: account IO handlers for emulated PCI MSI-X
  2023-10-13 13:06     ` Volodymyr Babchuk
@ 2023-10-13 16:46       ` Julien Grall
  2023-10-13 17:17         ` Volodymyr Babchuk
  0 siblings, 1 reply; 65+ messages in thread
From: Julien Grall @ 2023-10-13 16:46 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: xen-devel, Stewart Hildebrand, Oleksandr Andrushchenko,
	Bertrand Marquis, Julien Grall, Stefano Stabellini

Hi,

On 13/10/2023 14:06, Volodymyr Babchuk wrote:
> 
> Hi Julien,
> 
> Julien Grall <julien@xen.org> writes:
> 
>> Hi Volodymyr,
>>
>> On 12/10/2023 23:09, Volodymyr Babchuk wrote:
>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>> At the moment, we always allocate an extra 16 slots for IO handlers
>>> (see MAX_IO_HANDLER). So while adding IO trap handlers for the emulated
>>> MSI-X registers we need to explicitly tell that we have additional IO
>>> handlers, so those are accounted.
>>> Signed-off-by: Oleksandr Andrushchenko
>>> <oleksandr_andrushchenko@epam.com>
>>
>> Some process remark. All the patches you send (even if they are
>> unmodified) should also contain your signed-off-by. This is to comply
>> with point (b) in the DCO certificate:
> 
> Oh, sorry. I assumed that it is enough to have signed-off-by tag of the
> original author. I'll add my tags in the next version.

Thanks!

> 
>> https://urldefense.com/v3/__https://cert-manager.io/docs/contributing/sign-off/__;!!GF_29dbcQIUBPA!0mzdEfHOZMm2OmzFc6TZukGgMYRHxDWLdEQvbhUlDmOg3tZNeDbWb8vHz38zLzcYv8GUZeHLn-5sWTYCkvkb$ [cert-manager[.]io]
>>
>> Please check the other patches in this series.
>>
>>> Acked-by: Julien Grall <jgrall@amazon.com>
>>
>> Is this patch depends on the rest of the series? If not we can merge
>> it in the for-4.19 branch Stefano created. This will reduce the number
>> of patches you need to resend.
> 
> It uses VPCI_MAX_VIRT_DEV constant which was introduced in ("vpci: add
> initial support for virtual PCI bus topology").

Ok. I will wait before committing. Please let me know if there are any 
Arm patches that can be already committed (or could potentially be 
reviewed independently).

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 15/17] xen/arm: account IO handlers for emulated PCI MSI-X
  2023-10-13 16:46       ` Julien Grall
@ 2023-10-13 17:17         ` Volodymyr Babchuk
  0 siblings, 0 replies; 65+ messages in thread
From: Volodymyr Babchuk @ 2023-10-13 17:17 UTC (permalink / raw)
  To: Julien Grall
  Cc: xen-devel, Stewart Hildebrand, Oleksandr Andrushchenko,
	Bertrand Marquis, Julien Grall, Stefano Stabellini


Julien,

Julien Grall <julien@xen.org> writes:

> Hi,
>
> On 13/10/2023 14:06, Volodymyr Babchuk wrote:
>> Hi Julien,
>> Julien Grall <julien@xen.org> writes:
>> 
>>> Hi Volodymyr,
>>>
>>> On 12/10/2023 23:09, Volodymyr Babchuk wrote:
>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>> At the moment, we always allocate an extra 16 slots for IO handlers
>>>> (see MAX_IO_HANDLER). So while adding IO trap handlers for the emulated
>>>> MSI-X registers we need to explicitly tell that we have additional IO
>>>> handlers, so those are accounted.
>>>> Signed-off-by: Oleksandr Andrushchenko
>>>> <oleksandr_andrushchenko@epam.com>
>>>
>>> Some process remark. All the patches you send (even if they are
>>> unmodified) should also contain your signed-off-by. This is to comply
>>> with point (b) in the DCO certificate:
>> Oh, sorry. I assumed that it is enough to have signed-off-by tag of
>> the
>> original author. I'll add my tags in the next version.
>
> Thanks!
>
>> 
>>> https://urldefense.com/v3/__https://cert-manager.io/docs/contributing/sign-off/__;!!GF_29dbcQIUBPA!0mzdEfHOZMm2OmzFc6TZukGgMYRHxDWLdEQvbhUlDmOg3tZNeDbWb8vHz38zLzcYv8GUZeHLn-5sWTYCkvkb$ [cert-manager[.]io]
>>>
>>> Please check the other patches in this series.
>>>
>>>> Acked-by: Julien Grall <jgrall@amazon.com>
>>>
>>> Is this patch depends on the rest of the series? If not we can merge
>>> it in the for-4.19 branch Stefano created. This will reduce the number
>>> of patches you need to resend.
>> It uses VPCI_MAX_VIRT_DEV constant which was introduced in ("vpci:
>> add
>> initial support for virtual PCI bus topology").
>
> Ok. I will wait before committing. Please let me know if there are any
> Arm patches that can be already committed (or could potentially be
> reviewed independently).

Well, there is the ("arm/vpci: honor access size when returning an
error") which can be applied partially to the current staging branch.

Maybe it is a good idea to publish it separately.

-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 09/17] rangeset: add rangeset_empty() function
  2023-10-12 22:09 ` [PATCH v10 09/17] rangeset: add rangeset_empty() function Volodymyr Babchuk
@ 2023-10-13 17:54   ` Stewart Hildebrand
  2023-10-13 18:08     ` Volodymyr Babchuk
  0 siblings, 1 reply; 65+ messages in thread
From: Stewart Hildebrand @ 2023-10-13 17:54 UTC (permalink / raw)
  To: Volodymyr Babchuk, xen-devel
  Cc: Andrew Cooper, George Dunlap, Jan Beulich, Julien Grall,
	Stefano Stabellini, Wei Liu

On 10/12/23 18:09, Volodymyr Babchuk wrote:
> This function can be used when user wants to remove all rangeset
> entries but do not want to destroy rangeset itself.
> 
> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
> 
> ---
> 
> Changes in v10:
> 
>  - New in v10. The function is used in "vpci/header: handle p2m range sets per BAR"
> ---
>  xen/common/rangeset.c      | 9 +++++++--
>  xen/include/xen/rangeset.h | 3 ++-
>  2 files changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/xen/common/rangeset.c b/xen/common/rangeset.c
> index 35c3420885..420275669e 100644
> --- a/xen/common/rangeset.c
> +++ b/xen/common/rangeset.c
> @@ -448,8 +448,7 @@ struct rangeset *rangeset_new(
>      return r;
>  }
> 
> -void rangeset_destroy(
> -    struct rangeset *r)
> +void rangeset_empty(struct rangeset *r)
>  {
>      struct range *x;
> 
> @@ -465,6 +464,12 @@ void rangeset_destroy(
> 
>      while ( (x = first_range(r)) != NULL )
>          destroy_range(r, x);
> +}
> +
> +void rangeset_destroy(
> +    struct rangeset *r)
> +{
> +    rangeset_empty(r);
> 
>      xfree(r);
>  }

I think the list_del(&r->rangeset_list) operation (and associated locking and NULL check) shouldn't be moved to the new rangeset_empty() function, it should stay in rangeset_destroy().


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 09/17] rangeset: add rangeset_empty() function
  2023-10-13 17:54   ` Stewart Hildebrand
@ 2023-10-13 18:08     ` Volodymyr Babchuk
  0 siblings, 0 replies; 65+ messages in thread
From: Volodymyr Babchuk @ 2023-10-13 18:08 UTC (permalink / raw)
  To: Stewart Hildebrand
  Cc: xen-devel, Andrew Cooper, George Dunlap, Jan Beulich,
	Julien Grall, Stefano Stabellini, Wei Liu


Hi Stewart,

Stewart Hildebrand <stewart.hildebrand@amd.com> writes:

> On 10/12/23 18:09, Volodymyr Babchuk wrote:
>> This function can be used when user wants to remove all rangeset
>> entries but do not want to destroy rangeset itself.
>> 
>> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
>> 
>> ---
>> 
>> Changes in v10:
>> 
>>  - New in v10. The function is used in "vpci/header: handle p2m range sets per BAR"
>> ---
>>  xen/common/rangeset.c      | 9 +++++++--
>>  xen/include/xen/rangeset.h | 3 ++-
>>  2 files changed, 9 insertions(+), 3 deletions(-)
>> 
>> diff --git a/xen/common/rangeset.c b/xen/common/rangeset.c
>> index 35c3420885..420275669e 100644
>> --- a/xen/common/rangeset.c
>> +++ b/xen/common/rangeset.c
>> @@ -448,8 +448,7 @@ struct rangeset *rangeset_new(
>>      return r;
>>  }
>> 
>> -void rangeset_destroy(
>> -    struct rangeset *r)
>> +void rangeset_empty(struct rangeset *r)
>>  {
>>      struct range *x;
>> 
>> @@ -465,6 +464,12 @@ void rangeset_destroy(
>> 
>>      while ( (x = first_range(r)) != NULL )
>>          destroy_range(r, x);
>> +}
>> +
>> +void rangeset_destroy(
>> +    struct rangeset *r)
>> +{
>> +    rangeset_empty(r);
>> 
>>      xfree(r);
>>  }
>
> I think the list_del(&r->rangeset_list) operation (and associated
> locking and NULL check) shouldn't be moved to the new rangeset_empty()
> function, it should stay in rangeset_destroy().

Ahh, yes. It was a really stupid idea to move list_del(&r->rangeset_list); to
rangeset_empty().

-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 12/17] vpci/header: emulate PCI_COMMAND register for guests
  2023-10-12 22:09 ` [PATCH v10 12/17] vpci/header: emulate PCI_COMMAND register for guests Volodymyr Babchuk
@ 2023-10-13 21:53   ` Volodymyr Babchuk
  2023-11-21 14:17   ` Roger Pau Monné
  1 sibling, 0 replies; 65+ messages in thread
From: Volodymyr Babchuk @ 2023-10-13 21:53 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: xen-devel, Stewart Hildebrand, Oleksandr Andrushchenko,
	Roger Pau Monné


Volodymyr Babchuk <Volodymyr_Babchuk@epam.com> writes:

> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>
> Xen and/or Dom0 may have put values in PCI_COMMAND which they expect
> to remain unaltered. PCI_COMMAND_SERR bit is a good example: while the
> guest's view of this will want to be zero initially, the host having set
> it to 1 may not easily be overwritten with 0, or else we'd effectively
> imply giving the guest control of the bit. Thus, PCI_COMMAND register needs
> proper emulation in order to honor host's settings.
>
> According to "PCI LOCAL BUS SPECIFICATION, REV. 3.0", section "6.2.2
> Device Control" the reset state of the command register is typically 0,
> so when assigning a PCI device use 0 as the initial state for the guest's view
> of the command register.
>
> Here is the full list of command register bits with notes about
> emulation, along with QEMU behavior in the same situation:
>
> PCI_COMMAND_IO - QEMU does not allow a guest to change value of this bit
> in real device. Instead it is always set to 1. A guest can write to this
> register, but writes are ignored.
>
> PCI_COMMAND_MEMORY - QEMU behaves exactly as with PCI_COMMAND_IO. In
> Xen case, we handle writes to this bit by mapping/unmapping BAR
> regions. For devices assigned to DomUs, memory decoding will be
> disabled and the initialization.
>
> PCI_COMMAND_MASTER - Allow guest to control it. QEMU passes through
> writes to this bit.
>
> PCI_COMMAND_SPECIAL - Guest can generate special cycles only if it has
> access to host bridge that supports software generation of special
> cycles. In our case guest has no access to host bridges at all. Value
> after reset is 0. QEMU passes through writes of this bit, we will do
> the same.
>
> PCI_COMMAND_INVALIDATE - Allows "Memory Write and Invalidate" commands
> to be generated. It requires additional configuration via Cacheline
> Size register. We are not emulating this register right now and we
> can't expect guest to properly configure it. QEMU "emulates" access to
> Cachline Size register by ignoring all writes to it. QEMU passes through
> writes of PCI_COMMAND_INVALIDATE bit, we will do the same.
>
> PCI_COMMAND_VGA_PALETTE - Enable VGA palette snooping. QEMU passes
> through writes of this bit, we will do the same.
>
> PCI_COMMAND_PARITY - Controls how device response to parity
> errors. QEMU ignores writes to this bit, we will do the same.
>
> PCI_COMMAND_WAIT - Reserved. Should be 0, but QEMU passes
> through writes of this bit, so we will do the same.
>
> PCI_COMMAND_SERR - Controls if device can assert SERR. QEMU ignores
> writes to this bit, we will do the same.
>
> PCI_COMMAND_FAST_BACK - Optional bit that allows fast back-to-back
> transactions. It is configured by firmware, so we don't want guest to
> control it. QEMU ignores writes to this bit, we will do the same.
>
> PCI_COMMAND_INTX_DISABLE - Disables INTx signals. If MSI(X) is
> enabled, device is prohibited from asserting INTx as per
> specification. Value after reset is 0. In QEMU case, it checks of INTx
> was mapped for a device. If it is not, then guest can't control
> PCI_COMMAND_INTX_DISABLE bit. In our case, we prohibit a guest to
> change value of this bit if MSI(X) is enabled.
>
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
> ---
> In v10:
> - Added cf_check attribute to guest_cmd_read
> - Removed warning about non-zero cmd
> - Updated comment MSI code regarding disabling INTX
> - Used ternary operator in vpci_add_register() call
> - Disable memory decoding for DomUs in init_bars()
> In v9:
> - Reworked guest_cmd_read
> - Added handling for more bits
> Since v6:
> - fold guest's logic into cmd_write
> - implement cmd_read, so we can report emulated INTx state to guests
> - introduce header->guest_cmd to hold the emulated state of the
>   PCI_COMMAND register for guests
> Since v5:
> - add additional check for MSI-X enabled while altering INTX bit
> - make sure INTx disabled while guests enable MSI/MSI-X
> Since v3:
> - gate more code on CONFIG_HAS_MSI
> - removed logic for the case when MSI/MSI-X not enabled
> ---
>  xen/drivers/vpci/header.c | 44 +++++++++++++++++++++++++++++++++++----
>  xen/drivers/vpci/msi.c    |  6 ++++++
>  xen/drivers/vpci/msix.c   |  4 ++++
>  xen/include/xen/vpci.h    |  3 +++
>  4 files changed, 53 insertions(+), 4 deletions(-)
>
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index efce0bc2ae..e8eed6a674 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -501,14 +501,32 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>      return 0;
>  }
>  
> +/* TODO: Add proper emulation for all bits of the command register. */
>  static void cf_check cmd_write(
>      const struct pci_dev *pdev, unsigned int reg, uint32_t cmd, void *data)
>  {
>      struct vpci_header *header = data;
>  
> +    if ( !is_hardware_domain(pdev->domain) )
> +    {
> +        const struct vpci *vpci = pdev->vpci;
> +        uint16_t excluded = PCI_COMMAND_PARITY | PCI_COMMAND_SERR |
> +            PCI_COMMAND_FAST_BACK;
> +
> +        header->guest_cmd = cmd;
> +
> +        if ( (vpci->msi && vpci->msi->enabled) ||
> +             (vpci->msix && vpci->msi->enabled) )

There is a nasty mistake. It should be
                (vpci->msix && vpci->msix->enabled)

> +            excluded |= PCI_COMMAND_INTX_DISABLE;
> +
> +        cmd &= ~excluded;
> +        cmd |= pci_conf_read16(pdev->sbdf, reg) & excluded;
> +    }
> +
>      /*
> -     * Let Dom0 play with all the bits directly except for the memory
> -     * decoding one.
> +     * Let guest play with all the bits directly except for the memory
> +     * decoding one. Bits that are not allowed for DomU are already
> +     * handled above.
>       */
>      if ( header->bars_mapped != !!(cmd & PCI_COMMAND_MEMORY) )
>          /*
> @@ -522,6 +540,14 @@ static void cf_check cmd_write(
>          pci_conf_write16(pdev->sbdf, reg, cmd);
>  }
>  
> +static uint32_t cf_check guest_cmd_read(
> +    const struct pci_dev *pdev, unsigned int reg, void *data)
> +{
> +    const struct vpci_header *header = data;
> +
> +    return header->guest_cmd;
> +}
> +
>  static void cf_check bar_write(
>      const struct pci_dev *pdev, unsigned int reg, uint32_t val, void *data)
>  {
> @@ -737,8 +763,9 @@ static int cf_check init_bars(struct pci_dev *pdev)
>      }
>  
>      /* Setup a handler for the command register. */
> -    rc = vpci_add_register(pdev->vpci, vpci_hw_read16, cmd_write, PCI_COMMAND,
> -                           2, header);
> +    rc = vpci_add_register(pdev->vpci,
> +                           is_hwdom ? vpci_hw_read16 : guest_cmd_read,
> +                           cmd_write, PCI_COMMAND, 2, header);
>      if ( rc )
>          return rc;
>  
> @@ -750,6 +777,15 @@ static int cf_check init_bars(struct pci_dev *pdev)
>      if ( cmd & PCI_COMMAND_MEMORY )
>          pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd & ~PCI_COMMAND_MEMORY);
>  
> +    /*
> +     * Clear PCI_COMMAND_MEMORY for DomUs, so they will always start with
> +     * memory decoding disabled and to ensure that we will not call modify_bars()
> +     * at the end of this function.
> +     */
> +    if ( !is_hwdom )
> +        cmd &= ~PCI_COMMAND_MEMORY;
> +    header->guest_cmd = cmd;
> +
>      for ( i = 0; i < num_bars; i++ )
>      {
>          uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
> index 2faa54b7ce..0920bd071f 100644
> --- a/xen/drivers/vpci/msi.c
> +++ b/xen/drivers/vpci/msi.c
> @@ -70,6 +70,12 @@ static void cf_check control_write(
>  
>          if ( vpci_msi_arch_enable(msi, pdev, vectors) )
>              return;
> +
> +        /*
> +         * Make sure guest doesn't enable INTx while enabling MSI.
> +         */
> +        if ( !is_hardware_domain(pdev->domain) )
> +            pci_intx(pdev, false);
>      }
>      else
>          vpci_msi_arch_disable(msi, pdev);
> diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
> index b6abab47ef..9d0233d0e3 100644
> --- a/xen/drivers/vpci/msix.c
> +++ b/xen/drivers/vpci/msix.c
> @@ -97,6 +97,10 @@ static void cf_check control_write(
>          for ( i = 0; i < msix->max_entries; i++ )
>              if ( !msix->entries[i].masked && msix->entries[i].updated )
>                  update_entry(&msix->entries[i], pdev, i);
> +
> +        /* Make sure guest doesn't enable INTx while enabling MSI-X. */
> +        if ( !is_hardware_domain(pdev->domain) )
> +            pci_intx(pdev, false);
>      }
>      else if ( !new_enabled && msix->enabled )
>      {
> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
> index c5301e284f..60bdc10c13 100644
> --- a/xen/include/xen/vpci.h
> +++ b/xen/include/xen/vpci.h
> @@ -87,6 +87,9 @@ struct vpci {
>          } bars[PCI_HEADER_NORMAL_NR_BARS + 1];
>          /* At most 6 BARS + 1 expansion ROM BAR. */
>  
> +        /* Guest view of the PCI_COMMAND register. */
> +        uint16_t guest_cmd;
> +
>          /*
>           * Store whether the ROM enable bit is set (doesn't imply ROM BAR
>           * is mapped into guest p2m) if there's a ROM BAR on the device.


-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 07/17] vpci/header: implement guest BAR register handlers
  2023-10-12 22:09 ` [PATCH v10 07/17] vpci/header: implement guest BAR register handlers Volodymyr Babchuk
@ 2023-10-14 16:00   ` Stewart Hildebrand
  2023-11-20 16:06   ` Roger Pau Monné
  1 sibling, 0 replies; 65+ messages in thread
From: Stewart Hildebrand @ 2023-10-14 16:00 UTC (permalink / raw)
  To: Volodymyr Babchuk, xen-devel
  Cc: Oleksandr Andrushchenko, Roger Pau Monné

On 10/12/23 18:09, Volodymyr Babchuk wrote:
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index 33db58580c..40d1a07ada 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -477,6 +477,74 @@ static void cf_check bar_write(
>      pci_conf_write32(pdev->sbdf, reg, val);
>  }
> 
> +static void cf_check guest_bar_write(const struct pci_dev *pdev,
> +                                     unsigned int reg, uint32_t val, void *data)
> +{
> +    struct vpci_bar *bar = data;
> +    bool hi = false;
> +    uint64_t guest_addr = bar->guest_addr;

This initialization is using the initial value of bar ...

> +
> +    if ( bar->type == VPCI_BAR_MEM64_HI )
> +    {
> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> +        bar--;

... but here bar is decremented ...

> +        hi = true;
> +    }
> +    else
> +    {
> +        val &= PCI_BASE_ADDRESS_MEM_MASK;
> +    }
> +

... so to ensure a consistent state, the initialization should be moved here.

> +    guest_addr &= ~(0xffffffffULL << (hi ? 32 : 0));
> +    guest_addr |= (uint64_t)val << (hi ? 32 : 0);
> +
> +    /* Allow guest to size BAR correctly */
> +    guest_addr &= ~(bar->size - 1);
> +
> +    /*
> +     * Make sure that the guest set address has the same page offset
> +     * as the physical address on the host or otherwise things won't work as
> +     * expected.
> +     */
> +    if ( guest_addr != ~(bar->size -1 )  &&

Should this sizing check only apply to the lower 32 bits, or take "hi" into account?

For reference, it may be helpful to see an example sequence of a Linux domU sizing a 64 bit BAR. I instrumented guest_bar_write() to print the raw/initial val argument, and guest_bar_read() to print the final reg_val:
(XEN) drivers/vpci/header.c:guest_bar_read  d1 0000:01:00.0 reg 0x10 val 0xe0100004
(XEN) drivers/vpci/header.c:guest_bar_write d1 0000:01:00.0 reg 0x10 val 0xffffffff
(XEN) drivers/vpci/header.c:guest_bar_read  d1 0000:01:00.0 reg 0x10 val 0xffffc004
(XEN) drivers/vpci/header.c:guest_bar_write d1 0000:01:00.0 reg 0x10 val 0xe0100004
(XEN) drivers/vpci/header.c:guest_bar_read  d1 0000:01:00.0 reg 0x14 val 0x0 (hi)
(XEN) drivers/vpci/header.c:guest_bar_write d1 0000:01:00.0 reg 0x14 val 0xffffffff (hi)
(XEN) drivers/vpci/header.c:guest_bar_read  d1 0000:01:00.0 reg 0x14 val 0xffffffff (hi)
(XEN) drivers/vpci/header.c:guest_bar_write d1 0000:01:00.0 reg 0x14 val 0x0 (hi)
(XEN) drivers/vpci/header.c:guest_bar_write d1 0000:01:00.0 reg 0x10 val 0x23000004
(XEN) drivers/vpci/header.c:guest_bar_read  d1 0000:01:00.0 reg 0x10 val 0x23000004
(XEN) drivers/vpci/header.c:guest_bar_write d1 0000:01:00.0 reg 0x14 val 0x0 (hi)
(XEN) drivers/vpci/header.c:guest_bar_read  d1 0000:01:00.0 reg 0x14 val 0x0 (hi)

> +         PAGE_OFFSET(guest_addr) != PAGE_OFFSET(bar->addr) )
> +    {
> +        gprintk(XENLOG_WARNING,
> +                "%pp: ignored BAR %zu write attempting to change page offset\n",
> +                &pdev->sbdf, bar - pdev->vpci->header.bars + hi);
> +        return;
> +    }
> +
> +    bar->guest_addr = guest_addr;
> +}


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 16/17] xen/arm: vpci: permit access to guest vpci space
  2023-10-12 22:09 ` [PATCH v10 16/17] xen/arm: vpci: permit access to guest vpci space Volodymyr Babchuk
@ 2023-10-16 11:00   ` Jan Beulich
  2023-10-24 19:44     ` Stewart Hildebrand
  0 siblings, 1 reply; 65+ messages in thread
From: Jan Beulich @ 2023-10-16 11:00 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Stewart Hildebrand, Stefano Stabellini, Julien Grall,
	Bertrand Marquis, Andrew Cooper, George Dunlap, Wei Liu,
	xen-devel

On 13.10.2023 00:09, Volodymyr Babchuk wrote:
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -695,6 +695,9 @@ struct domain *domain_create(domid_t domid,
>          radix_tree_init(&d->pirq_tree);
>      }
>  
> +    if ( !is_idle_domain(d) )
> +        d->iomem_caps = rangeset_new(d, "I/O Memory", RANGESETF_prettyprint_hex);
> +
>      if ( (err = arch_domain_create(d, config, flags)) != 0 )
>          goto fail;
>      init_status |= INIT_arch;
> @@ -704,7 +707,6 @@ struct domain *domain_create(domid_t domid,
>          watchdog_domain_init(d);
>          init_status |= INIT_watchdog;
>  
> -        d->iomem_caps = rangeset_new(d, "I/O Memory", RANGESETF_prettyprint_hex);
>          d->irq_caps   = rangeset_new(d, "Interrupts", 0);
>          if ( !d->iomem_caps || !d->irq_caps )
>              goto fail;

It's not really logical to move one, not both. Plus you didn't move the
error check, so if the earlier initialization is really needed, you set
things up for a NULL deref.

Jan


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 16/17] xen/arm: vpci: permit access to guest vpci space
  2023-10-16 11:00   ` Jan Beulich
@ 2023-10-24 19:44     ` Stewart Hildebrand
  0 siblings, 0 replies; 65+ messages in thread
From: Stewart Hildebrand @ 2023-10-24 19:44 UTC (permalink / raw)
  To: Jan Beulich, Volodymyr Babchuk
  Cc: Stefano Stabellini, Julien Grall, Bertrand Marquis,
	Andrew Cooper, George Dunlap, Wei Liu, xen-devel

On 10/16/23 07:00, Jan Beulich wrote:
> On 13.10.2023 00:09, Volodymyr Babchuk wrote:
>> --- a/xen/common/domain.c
>> +++ b/xen/common/domain.c
>> @@ -695,6 +695,9 @@ struct domain *domain_create(domid_t domid,
>>          radix_tree_init(&d->pirq_tree);
>>      }
>>
>> +    if ( !is_idle_domain(d) )
>> +        d->iomem_caps = rangeset_new(d, "I/O Memory", RANGESETF_prettyprint_hex);
>> +
>>      if ( (err = arch_domain_create(d, config, flags)) != 0 )
>>          goto fail;
>>      init_status |= INIT_arch;
>> @@ -704,7 +707,6 @@ struct domain *domain_create(domid_t domid,
>>          watchdog_domain_init(d);
>>          init_status |= INIT_watchdog;
>>
>> -        d->iomem_caps = rangeset_new(d, "I/O Memory", RANGESETF_prettyprint_hex);
>>          d->irq_caps   = rangeset_new(d, "Interrupts", 0);
>>          if ( !d->iomem_caps || !d->irq_caps )
>>              goto fail;
> 
> It's not really logical to move one, not both. Plus you didn't move the
> error check, so if the earlier initialization is really needed, you set
> things up for a NULL deref.
> 
> Jan

Good catch, I'll move both along with the error check (and I'll coordinate the update with Volodymyr). Thanks.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 01/17] pci: msi: pass pdev to pci_enable_msi() function
  2023-10-12 22:09 ` [PATCH v10 01/17] pci: msi: pass pdev to pci_enable_msi() function Volodymyr Babchuk
@ 2023-10-30 15:55   ` Jan Beulich
  2023-11-17 13:59   ` Roger Pau Monné
  1 sibling, 0 replies; 65+ messages in thread
From: Jan Beulich @ 2023-10-30 15:55 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Stewart Hildebrand, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On 13.10.2023 00:09, Volodymyr Babchuk wrote:
> Previously pci_enable_msi() function obtained pdev pointer by itself,
> but taking into account upcoming changes to PCI locking, it is better
> when caller passes already acquired pdev pointer to the function.

For the patch to be understandable on its own, the "is better" wants
explaining here.

> Note that ns16550 driver does not check validity of obtained pdev
> pointer because pci_enable_msi() already does this.

I'm not convinced of this model. I'd rather see the caller do the
check, and the callee - optionally - have a respective assertion.

> --- a/xen/arch/x86/msi.c
> +++ b/xen/arch/x86/msi.c
> @@ -983,13 +983,13 @@ static int msix_capability_init(struct pci_dev *dev,
>   * irq or non-zero for otherwise.
>   **/
>  
> -static int __pci_enable_msi(struct msi_info *msi, struct msi_desc **desc)
> +static int __pci_enable_msi(struct msi_info *msi, struct msi_desc **desc,
> +			    struct pci_dev *pdev)

In line with msi_capability_init() and ...

> @@ -1038,13 +1038,13 @@ static void __pci_disable_msi(struct msi_desc *entry)
>   * of irqs available. Driver should use the returned value to re-send
>   * its request.
>   **/
> -static int __pci_enable_msix(struct msi_info *msi, struct msi_desc **desc)
> +static int __pci_enable_msix(struct msi_info *msi, struct msi_desc **desc,
> +			     struct pci_dev *pdev)

... msix_capability_init(), may I ask that the new parameter then
become the first one, not the last (and hence even past output
parameters)?

Jan


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 03/17] vpci: use per-domain PCI lock to protect vpci structure
  2023-10-12 22:09 ` [PATCH v10 03/17] vpci: use per-domain PCI lock to protect vpci structure Volodymyr Babchuk
@ 2023-11-03 15:39   ` Stewart Hildebrand
  2023-11-17 15:16   ` Roger Pau Monné
  1 sibling, 0 replies; 65+ messages in thread
From: Stewart Hildebrand @ 2023-11-03 15:39 UTC (permalink / raw)
  To: Volodymyr Babchuk, xen-devel
  Cc: Oleksandr Andrushchenko, Jan Beulich, Andrew Cooper,
	Roger Pau Monné,
	Wei Liu, Jun Nakajima, Kevin Tian, Paul Durrant

On 10/12/23 18:09, Volodymyr Babchuk wrote:
> 6. We are removing multiple ASSERT(pcidevs_locked()) instances because
> they are too strict now: they should be corrected to
> ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock)), but problem is
> that mentioned instances does not have access to the domain
> pointer and it is not feasible to pass a domain pointer to a function
> just for debugging purposes.

...

> diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
> index 20275260b3..466725d8ca 100644
> --- a/xen/arch/x86/msi.c
> +++ b/xen/arch/x86/msi.c
> @@ -988,8 +988,6 @@ static int __pci_enable_msi(struct msi_info *msi, struct msi_desc **desc,
>  {
>      struct msi_desc *old_desc;
> 
> -    ASSERT(pcidevs_locked());
> -
>      if ( !pdev )
>          return -ENODEV;

I know it is mentioned in the commit message, so the ASSERT removal above may be okay. However, just to mention it: as we are passing pdev to this function now, we can access the domain pointer here. So the ASSERT can be turned into (after the !pdev check):

    ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));

In which case pdev->domain != NULL might also want to be checked.

> 
> @@ -1043,8 +1041,6 @@ static int __pci_enable_msix(struct msi_info *msi, struct msi_desc **desc,
>  {
>      struct msi_desc *old_desc;
> 
> -    ASSERT(pcidevs_locked());
> -
>      if ( !pdev || !pdev->msix )
>          return -ENODEV;

Same here.

> 
> @@ -1154,8 +1150,6 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off)
>  int pci_enable_msi(struct msi_info *msi, struct msi_desc **desc,
>                    struct pci_dev *pdev)
>  {
> -    ASSERT(pcidevs_locked());
> -
>      if ( !use_msi )
>          return -EPERM;
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 13/17] vpci: add initial support for virtual PCI bus topology
  2023-10-12 22:09 ` [PATCH v10 13/17] vpci: add initial support for virtual PCI bus topology Volodymyr Babchuk
@ 2023-11-16 16:06   ` Julien Grall
  2023-11-16 23:28     ` Stefano Stabellini
  2023-11-21 14:40   ` Roger Pau Monné
  1 sibling, 1 reply; 65+ messages in thread
From: Julien Grall @ 2023-11-16 16:06 UTC (permalink / raw)
  To: Volodymyr Babchuk, xen-devel
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Andrew Cooper,
	George Dunlap, Jan Beulich, Stefano Stabellini, Wei Liu,
	Roger Pau Monné

Hi Volodymyr,

This patch was mentioned in another context about allocating the BDF. So 
I thought I would comment here as well.

On 12/10/2023 23:09, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Assign SBDF to the PCI devices being passed through with bus 0.
> The resulting topology is where PCIe devices reside on the bus 0 of the
> root complex itself (embedded endpoints).
> This implementation is limited to 32 devices which are allowed on
> a single PCI bus.
> 
> Please note, that at the moment only function 0 of a multifunction
> device can be passed through.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Your signed-off-by should be added even if you are only sending the 
patch on behalf of Oleksandr. This is part of the DCO [1]

> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
> index 5e34d0092a..7c46a2d3f4 100644
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -36,6 +36,52 @@ extern vpci_register_init_t *const __start_vpci_array[];
>   extern vpci_register_init_t *const __end_vpci_array[];
>   #define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
>   
> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> +static int add_virtual_device(struct pci_dev *pdev)
> +{
> +    struct domain *d = pdev->domain;
> +    unsigned long new_dev_number;
> +
> +    if ( is_hardware_domain(d) )
> +        return 0;
> +
> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
> +
> +    /*
> +     * Each PCI bus supports 32 devices/slots at max or up to 256 when
> +     * there are multi-function ones which are not yet supported.
> +     */
> +    if ( pdev->info.is_extfn && !pdev->info.is_virtfn )
> +    {
> +        gdprintk(XENLOG_ERR, "%pp: only function 0 passthrough supported\n",
> +                 &pdev->sbdf);
> +        return -EOPNOTSUPP;
> +    }
> +    new_dev_number = find_first_zero_bit(d->vpci_dev_assigned_map,
> +                                         VPCI_MAX_VIRT_DEV);

IIUC, this means that Xen will allocate the BDF. I think this will 
become a problem quite quickly as some of the PCI may need to be 
assigned at a specific vBDF (I have the intel graphic card in mind).

Also, xl allows you to specificy the slot (e.g. <bdf>@<vslot>) which 
would not work with this approach.

For dom0less passthrough, I feel the virtual BDF should always be 
specified in device-tree. When a domain is created after boot, then I 
think you want to support <bdf>@<vslot> where <vslot> is optional.

[1] https://cert-manager.io/docs/contributing/sign-off/

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 13/17] vpci: add initial support for virtual PCI bus topology
  2023-11-16 16:06   ` Julien Grall
@ 2023-11-16 23:28     ` Stefano Stabellini
  2023-11-17  0:06       ` Julien Grall
  2023-11-17  0:21       ` Volodymyr Babchuk
  0 siblings, 2 replies; 65+ messages in thread
From: Stefano Stabellini @ 2023-11-16 23:28 UTC (permalink / raw)
  To: Julien Grall
  Cc: Volodymyr Babchuk, xen-devel, Stewart Hildebrand,
	Oleksandr Andrushchenko, Andrew Cooper, George Dunlap,
	Jan Beulich, Stefano Stabellini, Wei Liu, Roger Pau Monné

On Thu, 16 Nov 2023, Julien Grall wrote:
> IIUC, this means that Xen will allocate the BDF. I think this will become a
> problem quite quickly as some of the PCI may need to be assigned at a specific
> vBDF (I have the intel graphic card in mind).
> 
> Also, xl allows you to specificy the slot (e.g. <bdf>@<vslot>) which would not
> work with this approach.
> 
> For dom0less passthrough, I feel the virtual BDF should always be specified in
> device-tree. When a domain is created after boot, then I think you want to
> support <bdf>@<vslot> where <vslot> is optional.

Hi Julien,

I also think there should be a way to specify the virtual BDF, but if
possible (meaning: it is not super difficult to implement) I think it
would be very convenient if we could let Xen pick whatever virtual BDF
Xen wants when the user doesn't specify the virtual BDF. That's
because it would make it easier to specify the configuration for the
user. Typically the user doesn't care about the virtual BDF, only to
expose a specific host device to the VM. There are exceptions of course
and that's why I think we should also have a way for the user to
request a specific virtual BDF. One of these exceptions are integrated
GPUs: the OS drivers used to have hardcoded BDFs. So it wouldn't work if
the device shows up at a different virtual BDF compared to the host.

Thinking more about this, one way to simplify the problem would be if we
always reuse the physical BDF as virtual BDF for passthrough devices. I
think that would solve the problem and makes it much more unlikely to
run into drivers bugs.

And we allocate a "special" virtual BDF space for emulated devices, with
the Root Complex still emulated in Xen. For instance, we could reserve
ff:xx:xx and in case of clashes we could refuse to continue. Or we could
allocate the first free virtual BDF, after all the pasthrough devices.

Example:
- the user wants to assign physical 00:11.5 and b3:00.1 to the guest
- Xen create virtual BDFs 00:11.5 and b3:00.1 for the passthrough devices
- Xen allocates the next virtual BDF for emulated devices: b4:xx.x
- If more virtual BDFs are needed for emulated devices, Xen allocates
  b5:xx.x

I still think, no matter the BDF allocation scheme, that we should try
to avoid as much as possible to have two different PCI Root Complex
emulators. Ideally we would have only one PCI Root Complex emulated by
Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be
tolerable but not ideal. The worst case I would like to avoid is to have
two PCI Root Complexes, one emulated by Xen and one emulated by QEMU.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 13/17] vpci: add initial support for virtual PCI bus topology
  2023-11-16 23:28     ` Stefano Stabellini
@ 2023-11-17  0:06       ` Julien Grall
  2023-11-17  0:51         ` Stefano Stabellini
  2023-11-17  0:21       ` Volodymyr Babchuk
  1 sibling, 1 reply; 65+ messages in thread
From: Julien Grall @ 2023-11-17  0:06 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Volodymyr Babchuk, xen-devel, Stewart Hildebrand,
	Oleksandr Andrushchenko, Andrew Cooper, George Dunlap,
	Jan Beulich, Wei Liu, Roger Pau Monné

Hi Stefano,

On 16/11/2023 23:28, Stefano Stabellini wrote:
> On Thu, 16 Nov 2023, Julien Grall wrote:
>> IIUC, this means that Xen will allocate the BDF. I think this will become a
>> problem quite quickly as some of the PCI may need to be assigned at a specific
>> vBDF (I have the intel graphic card in mind).
>>
>> Also, xl allows you to specificy the slot (e.g. <bdf>@<vslot>) which would not
>> work with this approach.
>>
>> For dom0less passthrough, I feel the virtual BDF should always be specified in
>> device-tree. When a domain is created after boot, then I think you want to
>> support <bdf>@<vslot> where <vslot> is optional.
> 
> Hi Julien,
> 
> I also think there should be a way to specify the virtual BDF, but if
> possible (meaning: it is not super difficult to implement) I think it
> would be very convenient if we could let Xen pick whatever virtual BDF
> Xen wants when the user doesn't specify the virtual BDF. That's
> because it would make it easier to specify the configuration for the
> user. Typically the user doesn't care about the virtual BDF, only to
> expose a specific host device to the VM. There are exceptions of course
> and that's why I think we should also have a way for the user to
> request a specific virtual BDF. One of these exceptions are integrated
> GPUs: the OS drivers used to have hardcoded BDFs. So it wouldn't work if
> the device shows up at a different virtual BDF compared to the host.

If you let Xen allocating the vBDF, then wouldn't you need a way to tell 
the toolstack/Device Models which vBDF was allocated?

> 
> Thinking more about this, one way to simplify the problem would be if we
> always reuse the physical BDF as virtual BDF for passthrough devices. I
> think that would solve the problem and makes it much more unlikely to
> run into drivers bugs.

This works so long you have only one physical segment (i.e. hostbridge). 
If you have multiple one, then you either have to expose multiple 
hostbridge to the guest (which is not great) or need someone to allocate 
the vBDF.

> 
> And we allocate a "special" virtual BDF space for emulated devices, with
> the Root Complex still emulated in Xen. For instance, we could reserve
> ff:xx:xx.
Hmmm... Wouldn't this means reserving ECAM space for 256 buses? 
Obviously, we could use 5 (just as random number). Yet, it still 
requires to reserve more memory than necessary.

 > and in case of clashes we could refuse to continue.

Urgh. And what would be the solution users triggering this clash?

> Or we could
> allocate the first free virtual BDF, after all the pasthrough devices.

This is only works if you don't want to support PCI hotplug. It may not 
be a thing for embedded, but it is used by cloud. So you need a 
mechanism that works with hotplug as well.

> 
> Example:
> - the user wants to assign physical 00:11.5 and b3:00.1 to the guest
> - Xen create virtual BDFs 00:11.5 and b3:00.1 for the passthrough devices
> - Xen allocates the next virtual BDF for emulated devices: b4:xx.x
> - If more virtual BDFs are needed for emulated devices, Xen allocates
>    b5:xx.x >
> I still think, no matter the BDF allocation scheme, that we should try
> to avoid as much as possible to have two different PCI Root Complex
> emulators. Ideally we would have only one PCI Root Complex emulated by
> Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be
> tolerable but not ideal. The worst case I would like to avoid is to have
> two PCI Root Complexes, one emulated by Xen and one emulated by QEMU.

So while I agree that one emulated hostbridge is the best solution, I 
don't think your proposal would work. As I wrote above, you may have a 
system with multiple physical hostbridge. It would not be possible to 
assign two PCI devices with the same BDF but from different segment.

I agree unlikely, but if we can avoid it then it would be best. There 
are one scheme which fits that:
   1. If the vBDF is not specified, then pick a free one.
   2. Otherwise check if the specified vBDF is free. If not return an error.

This scheme should be used for both virtual and physical. This is pretty 
much the algorithm used by QEMU today. It works, so what's would be the 
benefits to do something different?

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 13/17] vpci: add initial support for virtual PCI bus topology
  2023-11-16 23:28     ` Stefano Stabellini
  2023-11-17  0:06       ` Julien Grall
@ 2023-11-17  0:21       ` Volodymyr Babchuk
  2023-11-17  0:58         ` Stefano Stabellini
  1 sibling, 1 reply; 65+ messages in thread
From: Volodymyr Babchuk @ 2023-11-17  0:21 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Julien Grall, Stewart Hildebrand, Oleksandr Andrushchenko,
	Andrew Cooper, George Dunlap, Jan Beulich, Wei Liu,
	Roger Pau Monné,
	xen-devel


Hi Stefano,

Stefano Stabellini <sstabellini@kernel.org> writes:

> On Thu, 16 Nov 2023, Julien Grall wrote:
>> IIUC, this means that Xen will allocate the BDF. I think this will become a
>> problem quite quickly as some of the PCI may need to be assigned at a specific
>> vBDF (I have the intel graphic card in mind).
>> 
>> Also, xl allows you to specificy the slot (e.g. <bdf>@<vslot>) which would not
>> work with this approach.
>> 
>> For dom0less passthrough, I feel the virtual BDF should always be specified in
>> device-tree. When a domain is created after boot, then I think you want to
>> support <bdf>@<vslot> where <vslot> is optional.
>
> Hi Julien,
>
> I also think there should be a way to specify the virtual BDF, but if
> possible (meaning: it is not super difficult to implement) I think it
> would be very convenient if we could let Xen pick whatever virtual BDF
> Xen wants when the user doesn't specify the virtual BDF. That's
> because it would make it easier to specify the configuration for the
> user. Typically the user doesn't care about the virtual BDF, only to
> expose a specific host device to the VM. There are exceptions of course
> and that's why I think we should also have a way for the user to
> request a specific virtual BDF. One of these exceptions are integrated
> GPUs: the OS drivers used to have hardcoded BDFs. So it wouldn't work if
> the device shows up at a different virtual BDF compared to the host.
>
> Thinking more about this, one way to simplify the problem would be if we
> always reuse the physical BDF as virtual BDF for passthrough devices. I
> think that would solve the problem and makes it much more unlikely to
> run into drivers bugs.

I'm not sure that this is possible. AFAIK, if we have device with B>0,
we need to have bridge device for it. So, if I want to passthrough
device 08:00.0, I need to provide a virtual bridge with BDF 0:NN.0. This
unnecessary complicates things.

Also, there can be funny situation with conflicting BFD numbers exposed
by different domains. I know that this is not your typical setup, but
imagine that Domain A acts as a driver domain for PCI controller A and
Domain B acts as a driver domain for PCI controller B. They may expose
devices with same BDFs but with different segments.

> And we allocate a "special" virtual BDF space for emulated devices, with
> the Root Complex still emulated in Xen. For instance, we could reserve
> ff:xx:xx and in case of clashes we could refuse to continue. Or we could
> allocate the first free virtual BDF, after all the pasthrough devices.

Again, I may be wrong there, but we need an emulated PCI bridge device if we
want to use Bus numbers > 0.

>
> Example:
> - the user wants to assign physical 00:11.5 and b3:00.1 to the guest
> - Xen create virtual BDFs 00:11.5 and b3:00.1 for the passthrough devices
> - Xen allocates the next virtual BDF for emulated devices: b4:xx.x
> - If more virtual BDFs are needed for emulated devices, Xen allocates
>   b5:xx.x
>
> I still think, no matter the BDF allocation scheme, that we should try
> to avoid as much as possible to have two different PCI Root Complex
> emulators. Ideally we would have only one PCI Root Complex emulated by
> Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be
> tolerable but not ideal.

But what is exactly wrong with this setup?

> The worst case I would like to avoid is to have
> two PCI Root Complexes, one emulated by Xen and one emulated by QEMU.

This is how our setup works right now.

I agree that we need some way to provide static vBDF numbers. But I am
wondering what is the best way to do this. We need some entity that
manages and assigns those vBDFs. It should reside in Xen, because there
is Dom0less use case. Probably we need to extend
xen_domctl_assign_device so we can either request a free vBDF or a
specific vBDF. And in the first case, Xen should return assigned vBDF so
toolstack can give it to a backend, if PCI device is purely virtual.

-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 13/17] vpci: add initial support for virtual PCI bus topology
  2023-11-17  0:06       ` Julien Grall
@ 2023-11-17  0:51         ` Stefano Stabellini
  0 siblings, 0 replies; 65+ messages in thread
From: Stefano Stabellini @ 2023-11-17  0:51 UTC (permalink / raw)
  To: Julien Grall
  Cc: Stefano Stabellini, Volodymyr Babchuk, xen-devel,
	Stewart Hildebrand, Oleksandr Andrushchenko, Andrew Cooper,
	George Dunlap, Jan Beulich, Wei Liu, Roger Pau Monné

On Fri, 17 Nov 2023, Julien Grall wrote:
> Hi Stefano,
> 
> On 16/11/2023 23:28, Stefano Stabellini wrote:
> > On Thu, 16 Nov 2023, Julien Grall wrote:
> > > IIUC, this means that Xen will allocate the BDF. I think this will become
> > > a
> > > problem quite quickly as some of the PCI may need to be assigned at a
> > > specific
> > > vBDF (I have the intel graphic card in mind).
> > > 
> > > Also, xl allows you to specificy the slot (e.g. <bdf>@<vslot>) which would
> > > not
> > > work with this approach.
> > > 
> > > For dom0less passthrough, I feel the virtual BDF should always be
> > > specified in
> > > device-tree. When a domain is created after boot, then I think you want to
> > > support <bdf>@<vslot> where <vslot> is optional.
> > 
> > Hi Julien,
> > 
> > I also think there should be a way to specify the virtual BDF, but if
> > possible (meaning: it is not super difficult to implement) I think it
> > would be very convenient if we could let Xen pick whatever virtual BDF
> > Xen wants when the user doesn't specify the virtual BDF. That's
> > because it would make it easier to specify the configuration for the
> > user. Typically the user doesn't care about the virtual BDF, only to
> > expose a specific host device to the VM. There are exceptions of course
> > and that's why I think we should also have a way for the user to
> > request a specific virtual BDF. One of these exceptions are integrated
> > GPUs: the OS drivers used to have hardcoded BDFs. So it wouldn't work if
> > the device shows up at a different virtual BDF compared to the host.
> 
> If you let Xen allocating the vBDF, then wouldn't you need a way to tell the
> toolstack/Device Models which vBDF was allocated?
> 
> > 
> > Thinking more about this, one way to simplify the problem would be if we
> > always reuse the physical BDF as virtual BDF for passthrough devices. I
> > think that would solve the problem and makes it much more unlikely to
> > run into drivers bugs.
> 
> This works so long you have only one physical segment (i.e. hostbridge). If
> you have multiple one, then you either have to expose multiple hostbridge to
> the guest (which is not great) or need someone to allocate the vBDF.
> 
> > 
> > And we allocate a "special" virtual BDF space for emulated devices, with
> > the Root Complex still emulated in Xen. For instance, we could reserve
> > ff:xx:xx.
> Hmmm... Wouldn't this means reserving ECAM space for 256 buses? Obviously, we
> could use 5 (just as random number). Yet, it still requires to reserve more
> memory than necessary.
> 
> > and in case of clashes we could refuse to continue.
> 
> Urgh. And what would be the solution users triggering this clash?
> 
> > Or we could
> > allocate the first free virtual BDF, after all the pasthrough devices.
> 
> This is only works if you don't want to support PCI hotplug. It may not be a
> thing for embedded, but it is used by cloud. So you need a mechanism that
> works with hotplug as well.
> 
> > 
> > Example:
> > - the user wants to assign physical 00:11.5 and b3:00.1 to the guest
> > - Xen create virtual BDFs 00:11.5 and b3:00.1 for the passthrough devices
> > - Xen allocates the next virtual BDF for emulated devices: b4:xx.x
> > - If more virtual BDFs are needed for emulated devices, Xen allocates
> >    b5:xx.x >
> > I still think, no matter the BDF allocation scheme, that we should try
> > to avoid as much as possible to have two different PCI Root Complex
> > emulators. Ideally we would have only one PCI Root Complex emulated by
> > Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be
> > tolerable but not ideal. The worst case I would like to avoid is to have
> > two PCI Root Complexes, one emulated by Xen and one emulated by QEMU.
> 
> So while I agree that one emulated hostbridge is the best solution, I don't
> think your proposal would work. As I wrote above, you may have a system with
> multiple physical hostbridge. It would not be possible to assign two PCI
> devices with the same BDF but from different segment.
> 
> I agree unlikely, but if we can avoid it then it would be best. There are one
> scheme which fits that:
>   1. If the vBDF is not specified, then pick a free one.
>   2. Otherwise check if the specified vBDF is free. If not return an error.
> 
> This scheme should be used for both virtual and physical. This is pretty much
> the algorithm used by QEMU today. It works, so what's would be the benefits to
> do something different?

I am OK with that. I was trying to find a way that could work without
user intervention in almost 100% of the cases. I think both 1. and 2.
you proposed are fine.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 13/17] vpci: add initial support for virtual PCI bus topology
  2023-11-17  0:21       ` Volodymyr Babchuk
@ 2023-11-17  0:58         ` Stefano Stabellini
  2023-11-17 14:09           ` Volodymyr Babchuk
  0 siblings, 1 reply; 65+ messages in thread
From: Stefano Stabellini @ 2023-11-17  0:58 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Stefano Stabellini, Julien Grall, Stewart Hildebrand,
	Oleksandr Andrushchenko, Andrew Cooper, George Dunlap,
	Jan Beulich, Wei Liu, Roger Pau Monné,
	xen-devel

On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
> > I still think, no matter the BDF allocation scheme, that we should try
> > to avoid as much as possible to have two different PCI Root Complex
> > emulators. Ideally we would have only one PCI Root Complex emulated by
> > Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be
> > tolerable but not ideal.
> 
> But what is exactly wrong with this setup?

[...]

> > The worst case I would like to avoid is to have
> > two PCI Root Complexes, one emulated by Xen and one emulated by QEMU.
> 
> This is how our setup works right now.

If we have:
- a single PCI Root Complex emulated in Xen
- Xen is safety certified
- individual Virtio devices emulated by QEMU with grants for memory

We can go very far in terms of being able to use Virtio in safety
use-cases. We might even be able to use Virtio (frontends) in a SafeOS.

On the other hand if we put an additional Root Complex in QEMU:
- we pay a price in terms of complexity of the codebase
- we pay a price in terms of resource utilization
- we have one additional problem in terms of using this setup with a
  SafeOS (one more device emulated by a non-safe component)

Having 2 PCI Root Complexes both emulated in Xen is a middle ground
solution because:
- we still pay a price in terms of resource utilization
- the code complexity goes up a bit but hopefully not by much
- there is no impact on safety compared to the ideal scenario

This is why I wrote that it is tolerable.


> I agree that we need some way to provide static vBDF numbers. But I am
> wondering what is the best way to do this. We need some entity that
> manages and assigns those vBDFs. It should reside in Xen, because there
> is Dom0less use case. Probably we need to extend
> xen_domctl_assign_device so we can either request a free vBDF or a
> specific vBDF. And in the first case, Xen should return assigned vBDF so
> toolstack can give it to a backend, if PCI device is purely virtual.

I think that would be fine.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 01/17] pci: msi: pass pdev to pci_enable_msi() function
  2023-10-12 22:09 ` [PATCH v10 01/17] pci: msi: pass pdev to pci_enable_msi() function Volodymyr Babchuk
  2023-10-30 15:55   ` Jan Beulich
@ 2023-11-17 13:59   ` Roger Pau Monné
  1 sibling, 0 replies; 65+ messages in thread
From: Roger Pau Monné @ 2023-11-17 13:59 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: xen-devel, Stewart Hildebrand, Jan Beulich, Andrew Cooper,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini

On Thu, Oct 12, 2023 at 10:09:14PM +0000, Volodymyr Babchuk wrote:
> Previously pci_enable_msi() function obtained pdev pointer by itself,
> but taking into account upcoming changes to PCI locking, it is better
> when caller passes already acquired pdev pointer to the function.

A bit more detail into why this matters for the upcoming locking
change would be useful here.

> Note that ns16550 driver does not check validity of obtained pdev
> pointer because pci_enable_msi() already does this.
> 
> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
> 
> ---
> Changes in v10:
> 
>  - New in v10. This is the result of discussion in "vpci: add initial
>  support for virtual PCI bus topology"
> ---
>  xen/arch/x86/include/asm/msi.h |  3 ++-
>  xen/arch/x86/irq.c             |  2 +-
>  xen/arch/x86/msi.c             | 19 ++++++++++---------
>  xen/drivers/char/ns16550.c     |  4 +++-
>  4 files changed, 16 insertions(+), 12 deletions(-)
> 
> diff --git a/xen/arch/x86/include/asm/msi.h b/xen/arch/x86/include/asm/msi.h
> index a53ade95c9..836c8cd4ba 100644
> --- a/xen/arch/x86/include/asm/msi.h
> +++ b/xen/arch/x86/include/asm/msi.h
> @@ -81,7 +81,8 @@ struct irq_desc;
>  struct hw_interrupt_type;
>  struct msi_desc;
>  /* Helper functions */
> -extern int pci_enable_msi(struct msi_info *msi, struct msi_desc **desc);
> +extern int pci_enable_msi(struct msi_info *msi, struct msi_desc **desc,
> +			  struct pci_dev *pdev);

Hard tabs (here and below).

I agree with Jan that it might be better for pdev to be the first
parameter.

Otherwise seems fine if the pdev is already in the caller context, as
we avoid an extra list walk.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 13/17] vpci: add initial support for virtual PCI bus topology
  2023-11-17  0:58         ` Stefano Stabellini
@ 2023-11-17 14:09           ` Volodymyr Babchuk
  2023-11-17 18:30             ` Julien Grall
  0 siblings, 1 reply; 65+ messages in thread
From: Volodymyr Babchuk @ 2023-11-17 14:09 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Julien Grall, Stewart Hildebrand, Oleksandr Andrushchenko,
	Andrew Cooper, George Dunlap, Jan Beulich, Wei Liu,
	Roger Pau Monné,
	xen-devel


Hi Stefano,

Stefano Stabellini <sstabellini@kernel.org> writes:

> On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
>> > I still think, no matter the BDF allocation scheme, that we should try
>> > to avoid as much as possible to have two different PCI Root Complex
>> > emulators. Ideally we would have only one PCI Root Complex emulated by
>> > Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be
>> > tolerable but not ideal.
>> 
>> But what is exactly wrong with this setup?
>
> [...]
>
>> > The worst case I would like to avoid is to have
>> > two PCI Root Complexes, one emulated by Xen and one emulated by QEMU.
>> 
>> This is how our setup works right now.
>
> If we have:
> - a single PCI Root Complex emulated in Xen
> - Xen is safety certified
> - individual Virtio devices emulated by QEMU with grants for memory
>
> We can go very far in terms of being able to use Virtio in safety
> use-cases. We might even be able to use Virtio (frontends) in a SafeOS.
>
> On the other hand if we put an additional Root Complex in QEMU:
> - we pay a price in terms of complexity of the codebase
> - we pay a price in terms of resource utilization
> - we have one additional problem in terms of using this setup with a
>   SafeOS (one more device emulated by a non-safe component)
>
> Having 2 PCI Root Complexes both emulated in Xen is a middle ground
> solution because:
> - we still pay a price in terms of resource utilization
> - the code complexity goes up a bit but hopefully not by much
> - there is no impact on safety compared to the ideal scenario
>
> This is why I wrote that it is tolerable.

Ah, I see now. Yes, I am agree with this. Also I want to add some more
points:

- There is ongoing work on implementing virtio backends as a separate
  applications, written in Rust. Linaro are doing this part. Right now
  they are implementing only virtio-mmio, but if they want to provide
  virtio-pci as well, they will need a mechanism to plug only
  virtio-pci, without Root Complex. This is argument for using single Root
  Complex emulated in Xen.

- As far as I know (actually, Oleksandr told this to me), QEMU has no
  mechanism for exposing virtio-pci backends without exposing PCI root
  complex as well. Architecturally, there should be a PCI bus to which
  virtio-pci devices are connected. Or we need to make some changes to
  QEMU internals to be able to create virtio-pci backends that are not
  connected to any bus. Also, added benefit that PCI Root Complex
  emulator in QEMU handles legacy PCI interrupts for us. This is
  argument for separate Root Complex for QEMU.

As right now we have only virtio-pci backends provided by QEMU and this
setup is already working, I propose to stick to this
solution. Especially, taking into account that it does not require any
changes to hypervisor code.

-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 02/17] pci: introduce per-domain PCI rwlock
  2023-10-12 22:09 ` [PATCH v10 02/17] pci: introduce per-domain PCI rwlock Volodymyr Babchuk
@ 2023-11-17 14:33   ` Roger Pau Monné
  0 siblings, 0 replies; 65+ messages in thread
From: Roger Pau Monné @ 2023-11-17 14:33 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: xen-devel, Stewart Hildebrand, Andrew Cooper, George Dunlap,
	Jan Beulich, Julien Grall, Stefano Stabellini, Wei Liu,
	Paul Durrant, Kevin Tian

On Thu, Oct 12, 2023 at 10:09:15PM +0000, Volodymyr Babchuk wrote:
> Add per-domain d->pci_lock that protects access to
> d->pdev_list. Purpose of this lock is to give guarantees to VPCI code
> that underlying pdev will not disappear under feet. This is a rw-lock,
> but this patch adds only write_lock()s. There will be read_lock()
> users in the next patches.
> 
> This lock should be taken in write mode every time d->pdev_list is
> altered. All write accesses also should be protected by pcidevs_lock()
> as well. Idea is that any user that wants read access to the list or
> to the devices stored in the list should use either this new
> d->pci_lock or old pcidevs_lock(). Usage of any of this two locks will
> ensure only that pdev of interest will not disappear from under feet
> and that the pdev still will be assigned to the same domain. Of
> course, any new users should use pcidevs_lock() when it is
> appropriate (e.g. when accessing any other state that is protected by
> the said lock). In case both the newly introduced per-domain rwlock
> and the pcidevs lock is taken, the later must be acquired first.
                                     ^ latter

> 
> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
> Suggested-by: Jan Beulich <jbeulich@suse.com>
> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

I'm a bit concerned with the logic used in pci_release_devices(), but
I guess it's fine for now as long as the global pcidevs_lock is still
held.

> ---
> 
> Changes in v10:
>  - pdev->domain is assigned after removing from source domain but
>    before adding to target domain in reassign_device() functions.
> 
> Changes in v9:
>  - returned back "pdev->domain = target;" in AMD IOMMU code
>  - used "source" instead of pdev->domain in IOMMU functions
>  - added comment about lock ordering in the commit message
>  - reduced locked regions
>  - minor changes non-functional changes in various places
> 
> Changes in v8:
>  - New patch
> 
> Changes in v8 vs RFC:
>  - Removed all read_locks after discussion with Roger in #xendevel
>  - pci_release_devices() now returns the first error code
>  - extended commit message
>  - added missing lock in pci_remove_device()
>  - extended locked region in pci_add_device() to protect list_del() calls
> ---
>  xen/common/domain.c                         |  1 +
>  xen/drivers/passthrough/amd/pci_amd_iommu.c |  9 ++-
>  xen/drivers/passthrough/pci.c               | 71 +++++++++++++++++----
>  xen/drivers/passthrough/vtd/iommu.c         |  9 ++-
>  xen/include/xen/sched.h                     |  1 +
>  5 files changed, 78 insertions(+), 13 deletions(-)
> 
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index 8f9ab01c0c..785c69e48b 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -651,6 +651,7 @@ struct domain *domain_create(domid_t domid,
>  
>  #ifdef CONFIG_HAS_PCI
>      INIT_LIST_HEAD(&d->pdev_list);
> +    rwlock_init(&d->pci_lock);
>  #endif
>  
>      /* All error paths can depend on the above setup. */
> diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> index 836c24b02e..36a617bed4 100644
> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> @@ -476,8 +476,15 @@ static int cf_check reassign_device(
>  
>      if ( devfn == pdev->devfn && pdev->domain != target )
>      {
> -        list_move(&pdev->domain_list, &target->pdev_list);
> +        write_lock(&source->pci_lock);
> +        list_del(&pdev->domain_list);
> +        write_unlock(&source->pci_lock);
> +
>          pdev->domain = target;
> +
> +        write_lock(&target->pci_lock);
> +        list_add(&pdev->domain_list, &target->pdev_list);
> +        write_unlock(&target->pci_lock);
>      }
>  
>      /*
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index 04d00c7c37..b8ad4fa07c 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -453,7 +453,9 @@ static void __init _pci_hide_device(struct pci_dev *pdev)
>      if ( pdev->domain )
>          return;
>      pdev->domain = dom_xen;
> +    write_lock(&dom_xen->pci_lock);
>      list_add(&pdev->domain_list, &dom_xen->pdev_list);
> +    write_unlock(&dom_xen->pci_lock);
>  }
>  
>  int __init pci_hide_device(unsigned int seg, unsigned int bus,
> @@ -746,7 +748,9 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>      if ( !pdev->domain )
>      {
>          pdev->domain = hardware_domain;
> +        write_lock(&hardware_domain->pci_lock);
>          list_add(&pdev->domain_list, &hardware_domain->pdev_list);
> +        write_unlock(&hardware_domain->pci_lock);
>  
>          /*
>           * For devices not discovered by Xen during boot, add vPCI handlers
> @@ -756,7 +760,9 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>          if ( ret )
>          {
>              printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
> +            write_lock(&hardware_domain->pci_lock);
>              list_del(&pdev->domain_list);
> +            write_unlock(&hardware_domain->pci_lock);
>              pdev->domain = NULL;
>              goto out;
>          }
> @@ -764,7 +770,9 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>          if ( ret )
>          {
>              vpci_remove_device(pdev);
> +            write_lock(&hardware_domain->pci_lock);
>              list_del(&pdev->domain_list);
> +            write_unlock(&hardware_domain->pci_lock);
>              pdev->domain = NULL;
>              goto out;
>          }
> @@ -814,7 +822,11 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
>              pci_cleanup_msi(pdev);
>              ret = iommu_remove_device(pdev);
>              if ( pdev->domain )
> +            {
> +                write_lock(&pdev->domain->pci_lock);
>                  list_del(&pdev->domain_list);
> +                write_unlock(&pdev->domain->pci_lock);
> +            }
>              printk(XENLOG_DEBUG "PCI remove device %pp\n", &pdev->sbdf);
>              free_pdev(pseg, pdev);
>              break;
> @@ -885,26 +897,61 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
>  
>  int pci_release_devices(struct domain *d)
>  {
> -    struct pci_dev *pdev, *tmp;
> -    u8 bus, devfn;
> -    int ret;
> +    int combined_ret;
> +    LIST_HEAD(failed_pdevs);
>  
>      pcidevs_lock();
> -    ret = arch_pci_clean_pirqs(d);
> -    if ( ret )
> +
> +    combined_ret = arch_pci_clean_pirqs(d);
> +    if ( combined_ret )
>      {
>          pcidevs_unlock();
> -        return ret;
> +        return combined_ret;
>      }
> -    list_for_each_entry_safe ( pdev, tmp, &d->pdev_list, domain_list )
> +
> +    write_lock(&d->pci_lock);

Strictly speaking this could be a read_lock, since you are not
modifying the list here, just getting an element out of it.  I see
however that the late half of the loop does require the lock in write
mode for altering the domain pdev list, and hence might be clearer to
just use the lock in write mode all along.

> +
> +    while ( !list_empty(&d->pdev_list) )
>      {
> -        bus = pdev->bus;
> -        devfn = pdev->devfn;
> -        ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
> +        struct pci_dev *pdev = list_first_entry(&d->pdev_list,
> +                                                struct pci_dev,
> +                                                domain_list);
> +        uint16_t seg = pdev->seg;
> +        uint8_t bus = pdev->bus;
> +        uint8_t devfn = pdev->devfn;

What's the point of those local variables?  They are used only once,
and getting them is trivial.  Is this protection against 'pdev' being
removed since we no longer hold the per-domain lock?

I don't like much dropping the lock in the middle of a loop, as I
think it's dangerous, but I don't have much better suggestion here.

One thing that we might look into is to move the whole device list to
a local variable under the per domain pci lock, and then iterate over
that list without requiring the per domain lock to be taken.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 03/17] vpci: use per-domain PCI lock to protect vpci structure
  2023-10-12 22:09 ` [PATCH v10 03/17] vpci: use per-domain PCI lock to protect vpci structure Volodymyr Babchuk
  2023-11-03 15:39   ` Stewart Hildebrand
@ 2023-11-17 15:16   ` Roger Pau Monné
  2023-11-28 22:24     ` Volodymyr Babchuk
  1 sibling, 1 reply; 65+ messages in thread
From: Roger Pau Monné @ 2023-11-17 15:16 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: xen-devel, Stewart Hildebrand, Oleksandr Andrushchenko,
	Jan Beulich, Andrew Cooper, Wei Liu, Jun Nakajima, Kevin Tian,
	Paul Durrant

On Thu, Oct 12, 2023 at 10:09:15PM +0000, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Use a previously introduced per-domain read/write lock to check
> whether vpci is present, so we are sure there are no accesses to the
> contents of the vpci struct if not. This lock can be used (and in a
> few cases is used right away) so that vpci removal can be performed
> while holding the lock in write mode. Previously such removal could
> race with vpci_read for example.
> 
> When taking both d->pci_lock and pdev->vpci->lock, they should be
> taken in this exact order: d->pci_lock then pdev->vpci->lock to avoid
> possible deadlock situations.
> 
> 1. Per-domain's pci_rwlock is used to protect pdev->vpci structure
> from being removed.
> 
> 2. Writing the command register and ROM BAR register may trigger
> modify_bars to run, which in turn may access multiple pdevs while
> checking for the existing BAR's overlap. The overlapping check, if
> done under the read lock, requires vpci->lock to be acquired on both
> devices being compared, which may produce a deadlock. It is not
> possible to upgrade read lock to write lock in such a case. So, in
> order to prevent the deadlock, use d->pci_lock instead. To prevent
> deadlock while locking both hwdom->pci_lock and dom_xen->pci_lock,
> always lock hwdom first.
> 
> All other code, which doesn't lead to pdev->vpci destruction and does
> not access multiple pdevs at the same time, can still use a
> combination of the read lock and pdev->vpci->lock.
> 
> 3. Drop const qualifier where the new rwlock is used and this is
> appropriate.
> 
> 4. Do not call process_pending_softirqs with any locks held. For that
> unlock prior the call and re-acquire the locks after. After
> re-acquiring the lock there is no need to check if pdev->vpci exists:
>  - in apply_map because of the context it is called (no race condition
>    possible)
>  - for MSI/MSI-X debug code because it is called at the end of
>    pdev->vpci access and no further access to pdev->vpci is made
> 
> 5. Use d->pci_lock around for_each_pdev and pci_get_pdev_by_domain
> while accessing pdevs in vpci code.
> 
> 6. We are removing multiple ASSERT(pcidevs_locked()) instances because
> they are too strict now: they should be corrected to
> ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock)), but problem is
> that mentioned instances does not have access to the domain
> pointer and it is not feasible to pass a domain pointer to a function
> just for debugging purposes.
> 
> There is a possible lock inversion in MSI code, as some parts of it
> acquire pcidevs_lock() while already holding d->pci_lock.

Is this going to be addressed in a further patch?

> 
> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
> Suggested-by: Jan Beulich <jbeulich@suse.com>
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
> 
> ---
> Changes in v10:
>  - Moved printk pas locked area
>  - Returned back ASSERTs
>  - Added new parameter to allocate_and_map_msi_pirq() so it knows if
>  it should take the global pci lock
>  - Added comment about possible improvement in vpci_write
>  - Changed ASSERT(rw_is_locked()) to rw_is_write_locked() in
>    appropriate places
>  - Renamed release_domain_locks() to release_domain_write_locks()
>  - moved domain_done label in vpci_dump_msi() to correct place
> Changes in v9:
>  - extended locked region to protect vpci_remove_device and
>    vpci_add_handlers() calls
>  - vpci_write() takes lock in the write mode to protect
>    potential call to modify_bars()
>  - renamed lock releasing function
>  - removed ASSERT()s from msi code
>  - added trylock in vpci_dump_msi
> 
> Changes in v8:
>  - changed d->vpci_lock to d->pci_lock
>  - introducing d->pci_lock in a separate patch
>  - extended locked region in vpci_process_pending
>  - removed pcidevs_lockis vpci_dump_msi()
>  - removed some changes as they are not needed with
>    the new locking scheme
>  - added handling for hwdom && dom_xen case
> ---
>  xen/arch/x86/hvm/vmsi.c        | 26 ++++++++---------
>  xen/arch/x86/hvm/vmx/vmx.c     |  2 --
>  xen/arch/x86/include/asm/irq.h |  3 +-
>  xen/arch/x86/irq.c             | 12 ++++----
>  xen/arch/x86/msi.c             | 10 ++-----
>  xen/arch/x86/physdev.c         |  2 +-
>  xen/drivers/passthrough/pci.c  |  9 +++---
>  xen/drivers/vpci/header.c      | 18 ++++++++++++
>  xen/drivers/vpci/msi.c         | 28 ++++++++++++++++--
>  xen/drivers/vpci/msix.c        | 52 +++++++++++++++++++++++++++++-----
>  xen/drivers/vpci/vpci.c        | 51 +++++++++++++++++++++++++++++++--
>  11 files changed, 166 insertions(+), 47 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
> index 128f236362..6b33a80120 100644
> --- a/xen/arch/x86/hvm/vmsi.c
> +++ b/xen/arch/x86/hvm/vmsi.c
> @@ -468,7 +468,7 @@ int msixtbl_pt_register(struct domain *d, struct pirq *pirq, uint64_t gtable)
>      struct msixtbl_entry *entry, *new_entry;
>      int r = -EINVAL;
>  
> -    ASSERT(pcidevs_locked());
> +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>      ASSERT(rw_is_write_locked(&d->event_lock));
>  
>      if ( !msixtbl_initialised(d) )
> @@ -538,7 +538,7 @@ void msixtbl_pt_unregister(struct domain *d, struct pirq *pirq)
>      struct pci_dev *pdev;
>      struct msixtbl_entry *entry;
>  
> -    ASSERT(pcidevs_locked());
> +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>      ASSERT(rw_is_write_locked(&d->event_lock));
>  
>      if ( !msixtbl_initialised(d) )
> @@ -684,7 +684,7 @@ static int vpci_msi_update(const struct pci_dev *pdev, uint32_t data,
>  {
>      unsigned int i;
>  
> -    ASSERT(pcidevs_locked());
> +    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>  
>      if ( (address & MSI_ADDR_BASE_MASK) != MSI_ADDR_HEADER )
>      {
> @@ -725,8 +725,8 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
>      int rc;
>  
>      ASSERT(msi->arch.pirq != INVALID_PIRQ);
> +    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>  
> -    pcidevs_lock();
>      for ( i = 0; i < msi->vectors && msi->arch.bound; i++ )
>      {
>          struct xen_domctl_bind_pt_irq unbind = {
> @@ -745,7 +745,6 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
>  
>      msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address,
>                                         msi->vectors, msi->arch.pirq, msi->mask);
> -    pcidevs_unlock();
>  }
>  
>  static int vpci_msi_enable(const struct pci_dev *pdev, unsigned int nr,
> @@ -762,7 +761,7 @@ static int vpci_msi_enable(const struct pci_dev *pdev, unsigned int nr,
>      rc = allocate_and_map_msi_pirq(pdev->domain, -1, &pirq,
>                                     table_base ? MAP_PIRQ_TYPE_MSI
>                                                : MAP_PIRQ_TYPE_MULTI_MSI,
> -                                   &msi_info);
> +                                   &msi_info, false);
>      if ( rc )
>      {
>          gdprintk(XENLOG_ERR, "%pp: failed to map PIRQ: %d\n", &pdev->sbdf, rc);
> @@ -778,15 +777,13 @@ int vpci_msi_arch_enable(struct vpci_msi *msi, const struct pci_dev *pdev,
>      int rc;
>  
>      ASSERT(msi->arch.pirq == INVALID_PIRQ);
> +    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>      rc = vpci_msi_enable(pdev, vectors, 0);
>      if ( rc < 0 )
>          return rc;
>      msi->arch.pirq = rc;
> -
> -    pcidevs_lock();
>      msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address, vectors,
>                                         msi->arch.pirq, msi->mask);
> -    pcidevs_unlock();
>  
>      return 0;
>  }
> @@ -797,8 +794,8 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
>      unsigned int i;
>  
>      ASSERT(pirq != INVALID_PIRQ);
> +    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>  
> -    pcidevs_lock();
>      for ( i = 0; i < nr && bound; i++ )
>      {
>          struct xen_domctl_bind_pt_irq bind = {
> @@ -814,7 +811,6 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
>      write_lock(&pdev->domain->event_lock);
>      unmap_domain_pirq(pdev->domain, pirq);
>      write_unlock(&pdev->domain->event_lock);
> -    pcidevs_unlock();
>  }
>  
>  void vpci_msi_arch_disable(struct vpci_msi *msi, const struct pci_dev *pdev)
> @@ -854,6 +850,8 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
>      int rc;
>  
>      ASSERT(entry->arch.pirq == INVALID_PIRQ);
> +    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
> +
>      rc = vpci_msi_enable(pdev, vmsix_entry_nr(pdev->vpci->msix, entry),
>                           table_base);
>      if ( rc < 0 )
> @@ -861,7 +859,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
>  
>      entry->arch.pirq = rc;
>  
> -    pcidevs_lock();
>      rc = vpci_msi_update(pdev, entry->data, entry->addr, 1, entry->arch.pirq,
>                           entry->masked);
>      if ( rc )
> @@ -869,7 +866,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
>          vpci_msi_disable(pdev, entry->arch.pirq, 1, false);
>          entry->arch.pirq = INVALID_PIRQ;
>      }
> -    pcidevs_unlock();
>  
>      return rc;
>  }
> @@ -895,6 +891,8 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>  {
>      unsigned int i;
>  
> +    ASSERT(rw_is_locked(&msix->pdev->domain->pci_lock));
> +
>      for ( i = 0; i < msix->max_entries; i++ )
>      {
>          const struct vpci_msix_entry *entry = &msix->entries[i];
> @@ -913,7 +911,9 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>              struct pci_dev *pdev = msix->pdev;
>  
>              spin_unlock(&msix->pdev->vpci->lock);
> +            read_unlock(&pdev->domain->pci_lock);
>              process_pending_softirqs();
> +            read_lock(&pdev->domain->pci_lock);
>              /* NB: we assume that pdev cannot go away for an alive domain. */
>              if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
>                  return -EBUSY;
> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> index 1edc7f1e91..545a27796e 100644
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -413,8 +413,6 @@ static int cf_check vmx_pi_update_irte(const struct vcpu *v,
>  
>      spin_unlock_irq(&desc->lock);
>  
> -    ASSERT(pcidevs_locked());
> -
>      return iommu_update_ire_from_msi(msi_desc, &msi_desc->msg);
>  
>   unlock_out:
> diff --git a/xen/arch/x86/include/asm/irq.h b/xen/arch/x86/include/asm/irq.h
> index ad907fc97f..3d24f39ca6 100644
> --- a/xen/arch/x86/include/asm/irq.h
> +++ b/xen/arch/x86/include/asm/irq.h
> @@ -213,6 +213,7 @@ static inline void arch_move_irqs(struct vcpu *v) { }
>  struct msi_info;
>  int allocate_and_map_gsi_pirq(struct domain *d, int index, int *pirq_p);
>  int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
> -                              int type, struct msi_info *msi);
> +                              int type, struct msi_info *msi,
> +			      bool use_pci_lock);

Indentation using hard tabs.

>  
>  #endif /* _ASM_HW_IRQ_H */
> diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
> index 68b788c42e..970ba04aa0 100644
> --- a/xen/arch/x86/irq.c
> +++ b/xen/arch/x86/irq.c
> @@ -2157,7 +2157,7 @@ int map_domain_pirq(
>          struct pci_dev *pdev;
>          unsigned int nr = 0;
>  
> -        ASSERT(pcidevs_locked());
> +        ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>  
>          ret = -ENODEV;
>          if ( !cpu_has_apic )
> @@ -2314,7 +2314,7 @@ int unmap_domain_pirq(struct domain *d, int pirq)
>      if ( (pirq < 0) || (pirq >= d->nr_pirqs) )
>          return -EINVAL;
>  
> -    ASSERT(pcidevs_locked());
> +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>      ASSERT(rw_is_write_locked(&d->event_lock));
>  
>      info = pirq_info(d, pirq);
> @@ -2875,7 +2875,7 @@ int allocate_and_map_gsi_pirq(struct domain *d, int index, int *pirq_p)
>  }
>  
>  int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
> -                              int type, struct msi_info *msi)
> +                              int type, struct msi_info *msi, bool use_pci_lock)
>  {
>      int irq, pirq, ret;
>  
> @@ -2908,7 +2908,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>  
>      msi->irq = irq;
>  
> -    pcidevs_lock();
> +    if ( use_pci_lock )
> +        pcidevs_lock();

Instead of passing the flag it might be better if the caller can take
the lock, as to avoid having to pass an extra parameter.

Then we should also assert that either the pcidev_lock or the
per-domain pci lock is taken?

>      /* Verify or get pirq. */
>      write_lock(&d->event_lock);
>      pirq = allocate_pirq(d, index, *pirq_p, irq, type, &msi->entry_nr);
> @@ -2924,7 +2925,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>  
>   done:
>      write_unlock(&d->event_lock);
> -    pcidevs_unlock();
> +    if ( use_pci_lock )
> +        pcidevs_unlock();
>      if ( ret )
>      {
>          switch ( type )
> diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
> index 20275260b3..466725d8ca 100644
> --- a/xen/arch/x86/msi.c
> +++ b/xen/arch/x86/msi.c
> @@ -602,7 +602,7 @@ static int msi_capability_init(struct pci_dev *dev,
>      unsigned int i, mpos;
>      uint16_t control;
>  
> -    ASSERT(pcidevs_locked());
> +    ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
>      pos = pci_find_cap_offset(dev->sbdf, PCI_CAP_ID_MSI);
>      if ( !pos )
>          return -ENODEV;
> @@ -771,7 +771,7 @@ static int msix_capability_init(struct pci_dev *dev,
>      if ( !pos )
>          return -ENODEV;
>  
> -    ASSERT(pcidevs_locked());
> +    ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
>  
>      control = pci_conf_read16(dev->sbdf, msix_control_reg(pos));
>      /*
> @@ -988,8 +988,6 @@ static int __pci_enable_msi(struct msi_info *msi, struct msi_desc **desc,
>  {
>      struct msi_desc *old_desc;
>  
> -    ASSERT(pcidevs_locked());
> -
>      if ( !pdev )
>          return -ENODEV;
>  
> @@ -1043,8 +1041,6 @@ static int __pci_enable_msix(struct msi_info *msi, struct msi_desc **desc,
>  {
>      struct msi_desc *old_desc;
>  
> -    ASSERT(pcidevs_locked());
> -
>      if ( !pdev || !pdev->msix )
>          return -ENODEV;
>  
> @@ -1154,8 +1150,6 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off)
>  int pci_enable_msi(struct msi_info *msi, struct msi_desc **desc,
>  		   struct pci_dev *pdev)
>  {
> -    ASSERT(pcidevs_locked());
> -

If you have the pdev in all the above function, you could expand the
assert to test for pdev->domain->pci_lock?

>      if ( !use_msi )
>          return -EPERM;
>  
> diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c
> index 2f1d955a96..7cbb5bc2c8 100644
> --- a/xen/arch/x86/physdev.c
> +++ b/xen/arch/x86/physdev.c
> @@ -123,7 +123,7 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
>  
>      case MAP_PIRQ_TYPE_MSI:
>      case MAP_PIRQ_TYPE_MULTI_MSI:
> -        ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
> +        ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi, true);
>          break;
>  
>      default:
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index b8ad4fa07c..182da45acb 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -750,7 +750,6 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>          pdev->domain = hardware_domain;
>          write_lock(&hardware_domain->pci_lock);
>          list_add(&pdev->domain_list, &hardware_domain->pdev_list);
> -        write_unlock(&hardware_domain->pci_lock);
>  
>          /*
>           * For devices not discovered by Xen during boot, add vPCI handlers
> @@ -759,18 +758,18 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>          ret = vpci_add_handlers(pdev);
>          if ( ret )
>          {
> -            printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
> -            write_lock(&hardware_domain->pci_lock);
>              list_del(&pdev->domain_list);
>              write_unlock(&hardware_domain->pci_lock);
>              pdev->domain = NULL;
> +            printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
>              goto out;
>          }
> +        write_unlock(&hardware_domain->pci_lock);
>          ret = iommu_add_device(pdev);
>          if ( ret )
>          {
> -            vpci_remove_device(pdev);
>              write_lock(&hardware_domain->pci_lock);
> +            vpci_remove_device(pdev);
>              list_del(&pdev->domain_list);
>              write_unlock(&hardware_domain->pci_lock);
>              pdev->domain = NULL;
> @@ -1146,7 +1145,9 @@ static void __hwdom_init setup_one_hwdom_device(const struct setup_hwdom *ctxt,
>      } while ( devfn != pdev->devfn &&
>                PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
>  
> +    write_lock(&ctxt->d->pci_lock);
>      err = vpci_add_handlers(pdev);
> +    write_unlock(&ctxt->d->pci_lock);
>      if ( err )
>          printk(XENLOG_ERR "setup of vPCI for d%d failed: %d\n",
>                 ctxt->d->domain_id, err);
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index 767c1ba718..a52e52db96 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -172,6 +172,7 @@ bool vpci_process_pending(struct vcpu *v)
>          if ( rc == -ERESTART )
>              return true;
>  
> +        write_lock(&v->domain->pci_lock);
>          spin_lock(&v->vpci.pdev->vpci->lock);
>          /* Disable memory decoding unconditionally on failure. */
>          modify_decoding(v->vpci.pdev,
> @@ -190,6 +191,7 @@ bool vpci_process_pending(struct vcpu *v)
>               * failure.
>               */
>              vpci_remove_device(v->vpci.pdev);
> +        write_unlock(&v->domain->pci_lock);
>      }
>  
>      return false;
> @@ -201,8 +203,20 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
>      struct map_data data = { .d = d, .map = true };
>      int rc;
>  
> +    ASSERT(rw_is_write_locked(&d->pci_lock));
> +
>      while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
> +    {
> +        /*
> +         * It's safe to drop and reacquire the lock in this context
> +         * without risking pdev disappearing because devices cannot be
> +         * removed until the initial domain has been started.
> +         */
> +        read_unlock(&d->pci_lock);
>          process_pending_softirqs();
> +        read_lock(&d->pci_lock);

You are asserting the lock is taken in write mode just above the usage
of read_{un,}lock().  Either the assert is wrong, or the usage of
read_{un,}lock() is wrong.

> +    }
> +
>      rangeset_destroy(mem);
>      if ( !rc )
>          modify_decoding(pdev, cmd, false);
> @@ -243,6 +257,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>      unsigned int i;
>      int rc;
>  
> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
> +
>      if ( !mem )
>          return -ENOMEM;
>  
> @@ -522,6 +538,8 @@ static int cf_check init_bars(struct pci_dev *pdev)
>      struct vpci_bar *bars = header->bars;
>      int rc;
>  
> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
> +
>      switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
>      {
>      case PCI_HEADER_TYPE_NORMAL:
> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
> index a253ccbd7d..2faa54b7ce 100644
> --- a/xen/drivers/vpci/msi.c
> +++ b/xen/drivers/vpci/msi.c
> @@ -263,7 +263,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW);
>  
>  void vpci_dump_msi(void)
>  {
> -    const struct domain *d;
> +    struct domain *d;
>  
>      rcu_read_lock(&domlist_read_lock);
>      for_each_domain ( d )
> @@ -275,6 +275,9 @@ void vpci_dump_msi(void)
>  
>          printk("vPCI MSI/MSI-X d%d\n", d->domain_id);
>  
> +        if ( !read_trylock(&d->pci_lock) )
> +            continue;
> +
>          for_each_pdev ( d, pdev )
>          {
>              const struct vpci_msi *msi;
> @@ -316,14 +319,33 @@ void vpci_dump_msi(void)
>                       * holding the lock.
>                       */
>                      printk("unable to print all MSI-X entries: %d\n", rc);
> -                    process_pending_softirqs();
> -                    continue;
> +                    goto pdev_done;
>                  }
>              }
>  
>              spin_unlock(&pdev->vpci->lock);
> + pdev_done:
> +            /*
> +             * Unlock lock to process pending softirqs. This is
> +             * potentially unsafe, as d->pdev_list can be changed in
> +             * meantime.
> +             */
> +            read_unlock(&d->pci_lock);
>              process_pending_softirqs();
> +            if ( !read_trylock(&d->pci_lock) )
> +            {
> +                printk("unable to access other devices for the domain\n");
> +                goto domain_done;
> +            }
>          }
> +        read_unlock(&d->pci_lock);
> +    domain_done:

Weird label indentation.

> +        /*
> +         * We need this label at the end of the loop, but some
> +         * compilers might not be happy about label at the end of the
> +         * compound statement so we adding an empty statement here.
> +         */
> +        ;
>      }
>      rcu_read_unlock(&domlist_read_lock);
>  }
> diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
> index d1126a417d..b6abab47ef 100644
> --- a/xen/drivers/vpci/msix.c
> +++ b/xen/drivers/vpci/msix.c
> @@ -147,6 +147,8 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
>  {
>      struct vpci_msix *msix;
>  
> +    ASSERT(rw_is_locked(&d->pci_lock));
> +
>      list_for_each_entry ( msix, &d->arch.hvm.msix_tables, next )
>      {
>          const struct vpci_bar *bars = msix->pdev->vpci->header.bars;
> @@ -163,7 +165,13 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
>  
>  static int cf_check msix_accept(struct vcpu *v, unsigned long addr)
>  {
> -    return !!msix_find(v->domain, addr);
> +    int rc;
> +
> +    read_lock(&v->domain->pci_lock);
> +    rc = !!msix_find(v->domain, addr);
> +    read_unlock(&v->domain->pci_lock);
> +
> +    return rc;
>  }
>  
>  static bool access_allowed(const struct pci_dev *pdev, unsigned long addr,
> @@ -358,21 +366,35 @@ static int adjacent_read(const struct domain *d, const struct vpci_msix *msix,
>  static int cf_check msix_read(
>      struct vcpu *v, unsigned long addr, unsigned int len, unsigned long *data)
>  {
> -    const struct domain *d = v->domain;
> -    struct vpci_msix *msix = msix_find(d, addr);
> +    struct domain *d = v->domain;
> +    struct vpci_msix *msix;
>      const struct vpci_msix_entry *entry;
>      unsigned int offset;
>  
>      *data = ~0UL;
>  
> +    read_lock(&d->pci_lock);
> +
> +    msix = msix_find(d, addr);
>      if ( !msix )
> +    {
> +        read_unlock(&d->pci_lock);
>          return X86EMUL_RETRY;
> +    }
>  
>      if ( adjacent_handle(msix, addr) )
> -        return adjacent_read(d, msix, addr, len, data);
> +    {
> +        int rc = adjacent_read(d, msix, addr, len, data);
> +
> +        read_unlock(&d->pci_lock);
> +        return rc;
> +    }
>  
>      if ( !access_allowed(msix->pdev, addr, len) )
> +    {
> +        read_unlock(&d->pci_lock);
>          return X86EMUL_OKAY;
> +    }
>  
>      spin_lock(&msix->pdev->vpci->lock);
>      entry = get_entry(msix, addr);
> @@ -404,6 +426,7 @@ static int cf_check msix_read(
>          break;
>      }
>      spin_unlock(&msix->pdev->vpci->lock);
> +    read_unlock(&d->pci_lock);
>  
>      return X86EMUL_OKAY;
>  }
> @@ -491,19 +514,33 @@ static int adjacent_write(const struct domain *d, const struct vpci_msix *msix,
>  static int cf_check msix_write(
>      struct vcpu *v, unsigned long addr, unsigned int len, unsigned long data)
>  {
> -    const struct domain *d = v->domain;
> -    struct vpci_msix *msix = msix_find(d, addr);
> +    struct domain *d = v->domain;
> +    struct vpci_msix *msix;
>      struct vpci_msix_entry *entry;
>      unsigned int offset;
>  
> +    read_lock(&d->pci_lock);
> +
> +    msix = msix_find(d, addr);
>      if ( !msix )
> +    {
> +        read_unlock(&d->pci_lock);
>          return X86EMUL_RETRY;
> +    }
>  
>      if ( adjacent_handle(msix, addr) )
> -        return adjacent_write(d, msix, addr, len, data);
> +    {
> +        int rc = adjacent_write(d, msix, addr, len, data);
> +
> +        read_unlock(&d->pci_lock);
> +        return rc;
> +    }
>  
>      if ( !access_allowed(msix->pdev, addr, len) )
> +    {
> +        read_unlock(&d->pci_lock);
>          return X86EMUL_OKAY;
> +    }
>  
>      spin_lock(&msix->pdev->vpci->lock);
>      entry = get_entry(msix, addr);
> @@ -579,6 +616,7 @@ static int cf_check msix_write(
>          break;
>      }
>      spin_unlock(&msix->pdev->vpci->lock);
> +    read_unlock(&d->pci_lock);
>  
>      return X86EMUL_OKAY;
>  }
> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
> index 3bec9a4153..112de56fb3 100644
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -38,6 +38,8 @@ extern vpci_register_init_t *const __end_vpci_array[];
>  
>  void vpci_remove_device(struct pci_dev *pdev)
>  {
> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
> +
>      if ( !has_vpci(pdev->domain) || !pdev->vpci )
>          return;
>  
> @@ -73,6 +75,8 @@ int vpci_add_handlers(struct pci_dev *pdev)
>      const unsigned long *ro_map;
>      int rc = 0;
>  
> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
> +
>      if ( !has_vpci(pdev->domain) )
>          return 0;
>  
> @@ -326,11 +330,12 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
>  
>  uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>  {
> -    const struct domain *d = current->domain;
> +    struct domain *d = current->domain;
>      const struct pci_dev *pdev;
>      const struct vpci_register *r;
>      unsigned int data_offset = 0;
>      uint32_t data = ~(uint32_t)0;
> +    rwlock_t *lock;
>  
>      if ( !size )
>      {
> @@ -342,11 +347,21 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>       * Find the PCI dev matching the address, which for hwdom also requires
>       * consulting DomXEN.  Passthrough everything that's not trapped.
>       */
> +    lock = &d->pci_lock;
> +    read_lock(lock);
>      pdev = pci_get_pdev(d, sbdf);
>      if ( !pdev && is_hardware_domain(d) )
> +    {
> +        read_unlock(lock);
> +        lock = &dom_xen->pci_lock;
> +        read_lock(lock);
>          pdev = pci_get_pdev(dom_xen, sbdf);

I'm unsure whether devices assigned to dom_xen can change ownership
after boot, so maybe there's no need for all this lock dance, as the
device cannot disappear?

Maybe just taking the hardware domain lock is enough to prevent
concurrent accesses in that case, as the hardware domain is the only
allowed to access devices owned by dom_xen.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 13/17] vpci: add initial support for virtual PCI bus topology
  2023-11-17 14:09           ` Volodymyr Babchuk
@ 2023-11-17 18:30             ` Julien Grall
  2023-11-17 20:08               ` Volodymyr Babchuk
  0 siblings, 1 reply; 65+ messages in thread
From: Julien Grall @ 2023-11-17 18:30 UTC (permalink / raw)
  To: Volodymyr Babchuk, Stefano Stabellini
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Andrew Cooper,
	George Dunlap, Jan Beulich, Wei Liu, Roger Pau Monné,
	xen-devel

Hi Volodymyr,

On 17/11/2023 14:09, Volodymyr Babchuk wrote:
> 
> Hi Stefano,
> 
> Stefano Stabellini <sstabellini@kernel.org> writes:
> 
>> On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
>>>> I still think, no matter the BDF allocation scheme, that we should try
>>>> to avoid as much as possible to have two different PCI Root Complex
>>>> emulators. Ideally we would have only one PCI Root Complex emulated by
>>>> Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be
>>>> tolerable but not ideal.
>>>
>>> But what is exactly wrong with this setup?
>>
>> [...]
>>
>>>> The worst case I would like to avoid is to have
>>>> two PCI Root Complexes, one emulated by Xen and one emulated by QEMU.
>>>
>>> This is how our setup works right now.
>>
>> If we have:
>> - a single PCI Root Complex emulated in Xen
>> - Xen is safety certified
>> - individual Virtio devices emulated by QEMU with grants for memory
>>
>> We can go very far in terms of being able to use Virtio in safety
>> use-cases. We might even be able to use Virtio (frontends) in a SafeOS.
>>
>> On the other hand if we put an additional Root Complex in QEMU:
>> - we pay a price in terms of complexity of the codebase
>> - we pay a price in terms of resource utilization
>> - we have one additional problem in terms of using this setup with a
>>    SafeOS (one more device emulated by a non-safe component)
>>
>> Having 2 PCI Root Complexes both emulated in Xen is a middle ground
>> solution because:
>> - we still pay a price in terms of resource utilization
>> - the code complexity goes up a bit but hopefully not by much
>> - there is no impact on safety compared to the ideal scenario
>>
>> This is why I wrote that it is tolerable.
> 
> Ah, I see now. Yes, I am agree with this. Also I want to add some more
> points:
> 
> - There is ongoing work on implementing virtio backends as a separate
>    applications, written in Rust. Linaro are doing this part. Right now
>    they are implementing only virtio-mmio, but if they want to provide
>    virtio-pci as well, they will need a mechanism to plug only
>    virtio-pci, without Root Complex. This is argument for using single Root
>    Complex emulated in Xen.
> 
> - As far as I know (actually, Oleksandr told this to me), QEMU has no
>    mechanism for exposing virtio-pci backends without exposing PCI root
>    complex as well. Architecturally, there should be a PCI bus to which
>    virtio-pci devices are connected. Or we need to make some changes to
>    QEMU internals to be able to create virtio-pci backends that are not
>    connected to any bus. Also, added benefit that PCI Root Complex
>    emulator in QEMU handles legacy PCI interrupts for us. This is
>    argument for separate Root Complex for QEMU.
> 
> As right now we have only virtio-pci backends provided by QEMU and this
> setup is already working, I propose to stick to this
> solution. Especially, taking into account that it does not require any
> changes to hypervisor code.

I am not against two hostbridge as a temporary solution as long as this 
is not a one way door decision. I am not concerned about the hypervisor 
itself, I am more concerned about the interface exposed by the toolstack 
and QEMU.

To clarify, I don't particular want to have to maintain the two 
hostbridges solution once we can use a single hostbridge. So we need to 
be able to get rid of it without impacting the interface too much.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 13/17] vpci: add initial support for virtual PCI bus topology
  2023-11-17 18:30             ` Julien Grall
@ 2023-11-17 20:08               ` Volodymyr Babchuk
  2023-11-17 21:43                 ` Stefano Stabellini
  0 siblings, 1 reply; 65+ messages in thread
From: Volodymyr Babchuk @ 2023-11-17 20:08 UTC (permalink / raw)
  To: Julien Grall
  Cc: Stefano Stabellini, Stewart Hildebrand, Oleksandr Andrushchenko,
	Andrew Cooper, George Dunlap, Jan Beulich, Wei Liu,
	Roger Pau Monné,
	xen-devel


Hi Julien,

Julien Grall <julien@xen.org> writes:

> Hi Volodymyr,
>
> On 17/11/2023 14:09, Volodymyr Babchuk wrote:
>> Hi Stefano,
>> Stefano Stabellini <sstabellini@kernel.org> writes:
>> 
>>> On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
>>>>> I still think, no matter the BDF allocation scheme, that we should try
>>>>> to avoid as much as possible to have two different PCI Root Complex
>>>>> emulators. Ideally we would have only one PCI Root Complex emulated by
>>>>> Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be
>>>>> tolerable but not ideal.
>>>>
>>>> But what is exactly wrong with this setup?
>>>
>>> [...]
>>>
>>>>> The worst case I would like to avoid is to have
>>>>> two PCI Root Complexes, one emulated by Xen and one emulated by QEMU.
>>>>
>>>> This is how our setup works right now.
>>>
>>> If we have:
>>> - a single PCI Root Complex emulated in Xen
>>> - Xen is safety certified
>>> - individual Virtio devices emulated by QEMU with grants for memory
>>>
>>> We can go very far in terms of being able to use Virtio in safety
>>> use-cases. We might even be able to use Virtio (frontends) in a SafeOS.
>>>
>>> On the other hand if we put an additional Root Complex in QEMU:
>>> - we pay a price in terms of complexity of the codebase
>>> - we pay a price in terms of resource utilization
>>> - we have one additional problem in terms of using this setup with a
>>>    SafeOS (one more device emulated by a non-safe component)
>>>
>>> Having 2 PCI Root Complexes both emulated in Xen is a middle ground
>>> solution because:
>>> - we still pay a price in terms of resource utilization
>>> - the code complexity goes up a bit but hopefully not by much
>>> - there is no impact on safety compared to the ideal scenario
>>>
>>> This is why I wrote that it is tolerable.
>> Ah, I see now. Yes, I am agree with this. Also I want to add some
>> more
>> points:
>> - There is ongoing work on implementing virtio backends as a
>> separate
>>    applications, written in Rust. Linaro are doing this part. Right now
>>    they are implementing only virtio-mmio, but if they want to provide
>>    virtio-pci as well, they will need a mechanism to plug only
>>    virtio-pci, without Root Complex. This is argument for using single Root
>>    Complex emulated in Xen.
>> - As far as I know (actually, Oleksandr told this to me), QEMU has
>> no
>>    mechanism for exposing virtio-pci backends without exposing PCI root
>>    complex as well. Architecturally, there should be a PCI bus to which
>>    virtio-pci devices are connected. Or we need to make some changes to
>>    QEMU internals to be able to create virtio-pci backends that are not
>>    connected to any bus. Also, added benefit that PCI Root Complex
>>    emulator in QEMU handles legacy PCI interrupts for us. This is
>>    argument for separate Root Complex for QEMU.
>> As right now we have only virtio-pci backends provided by QEMU and
>> this
>> setup is already working, I propose to stick to this
>> solution. Especially, taking into account that it does not require any
>> changes to hypervisor code.
>
> I am not against two hostbridge as a temporary solution as long as
> this is not a one way door decision. I am not concerned about the
> hypervisor itself, I am more concerned about the interface exposed by
> the toolstack and QEMU.
>
> To clarify, I don't particular want to have to maintain the two
> hostbridges solution once we can use a single hostbridge. So we need
> to be able to get rid of it without impacting the interface too much.

This depends on virtio-pci backends availability. AFAIK, now only one
option is to use QEMU and QEMU provides own host bridge. So if we want
get rid of the second host bridge we need either another virtio-pci
backend or we need to alter QEMU code so it can live without host
bridge.

As for interfaces, it appears that QEMU case does not require any changes
into hypervisor itself, it just boils down to writing couple of xenstore
entries and spawning QEMU with correct command line arguments.

From the user perspective, all this is configured via xl.conf entry like

virtio=[
'backend=DomD, type=virtio,device, transport=pci, bdf=0000:00:01.0, grant_usage=1, backend_type=qemu',
]

In the future we can add backend_type=standalone for non-QEMU-based
backends. If there will be no QEMU-based backends, there will be no
second host bridge.

-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 13/17] vpci: add initial support for virtual PCI bus topology
  2023-11-17 20:08               ` Volodymyr Babchuk
@ 2023-11-17 21:43                 ` Stefano Stabellini
  2023-11-17 22:22                   ` Volodymyr Babchuk
  0 siblings, 1 reply; 65+ messages in thread
From: Stefano Stabellini @ 2023-11-17 21:43 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Julien Grall, Stefano Stabellini, Stewart Hildebrand,
	Oleksandr Andrushchenko, Andrew Cooper, George Dunlap,
	Jan Beulich, Wei Liu, Roger Pau Monné,
	xen-devel

On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
> Hi Julien,
> 
> Julien Grall <julien@xen.org> writes:
> 
> > Hi Volodymyr,
> >
> > On 17/11/2023 14:09, Volodymyr Babchuk wrote:
> >> Hi Stefano,
> >> Stefano Stabellini <sstabellini@kernel.org> writes:
> >> 
> >>> On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
> >>>>> I still think, no matter the BDF allocation scheme, that we should try
> >>>>> to avoid as much as possible to have two different PCI Root Complex
> >>>>> emulators. Ideally we would have only one PCI Root Complex emulated by
> >>>>> Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be
> >>>>> tolerable but not ideal.
> >>>>
> >>>> But what is exactly wrong with this setup?
> >>>
> >>> [...]
> >>>
> >>>>> The worst case I would like to avoid is to have
> >>>>> two PCI Root Complexes, one emulated by Xen and one emulated by QEMU.
> >>>>
> >>>> This is how our setup works right now.
> >>>
> >>> If we have:
> >>> - a single PCI Root Complex emulated in Xen
> >>> - Xen is safety certified
> >>> - individual Virtio devices emulated by QEMU with grants for memory
> >>>
> >>> We can go very far in terms of being able to use Virtio in safety
> >>> use-cases. We might even be able to use Virtio (frontends) in a SafeOS.
> >>>
> >>> On the other hand if we put an additional Root Complex in QEMU:
> >>> - we pay a price in terms of complexity of the codebase
> >>> - we pay a price in terms of resource utilization
> >>> - we have one additional problem in terms of using this setup with a
> >>>    SafeOS (one more device emulated by a non-safe component)
> >>>
> >>> Having 2 PCI Root Complexes both emulated in Xen is a middle ground
> >>> solution because:
> >>> - we still pay a price in terms of resource utilization
> >>> - the code complexity goes up a bit but hopefully not by much
> >>> - there is no impact on safety compared to the ideal scenario
> >>>
> >>> This is why I wrote that it is tolerable.
> >> Ah, I see now. Yes, I am agree with this. Also I want to add some
> >> more
> >> points:
> >> - There is ongoing work on implementing virtio backends as a
> >> separate
> >>    applications, written in Rust. Linaro are doing this part. Right now
> >>    they are implementing only virtio-mmio, but if they want to provide
> >>    virtio-pci as well, they will need a mechanism to plug only
> >>    virtio-pci, without Root Complex. This is argument for using single Root
> >>    Complex emulated in Xen.
> >> - As far as I know (actually, Oleksandr told this to me), QEMU has
> >> no
> >>    mechanism for exposing virtio-pci backends without exposing PCI root
> >>    complex as well. Architecturally, there should be a PCI bus to which
> >>    virtio-pci devices are connected. Or we need to make some changes to
> >>    QEMU internals to be able to create virtio-pci backends that are not
> >>    connected to any bus. Also, added benefit that PCI Root Complex
> >>    emulator in QEMU handles legacy PCI interrupts for us. This is
> >>    argument for separate Root Complex for QEMU.
> >> As right now we have only virtio-pci backends provided by QEMU and
> >> this
> >> setup is already working, I propose to stick to this
> >> solution. Especially, taking into account that it does not require any
> >> changes to hypervisor code.
> >
> > I am not against two hostbridge as a temporary solution as long as
> > this is not a one way door decision. I am not concerned about the
> > hypervisor itself, I am more concerned about the interface exposed by
> > the toolstack and QEMU.

I agree with this...


> > To clarify, I don't particular want to have to maintain the two
> > hostbridges solution once we can use a single hostbridge. So we need
> > to be able to get rid of it without impacting the interface too much.

...and this


> This depends on virtio-pci backends availability. AFAIK, now only one
> option is to use QEMU and QEMU provides own host bridge. So if we want
> get rid of the second host bridge we need either another virtio-pci
> backend or we need to alter QEMU code so it can live without host
> bridge.
> 
> As for interfaces, it appears that QEMU case does not require any changes
> into hypervisor itself, it just boils down to writing couple of xenstore
> entries and spawning QEMU with correct command line arguments.

One thing that Stewart wrote in his reply that is important: it doesn't
matter if QEMU thinks it is emulating a PCI Root Complex because that's
required from QEMU's point of view to emulate an individual PCI device.

If we can arrange it so the QEMU PCI Root Complex is not registered
against Xen as part of the ioreq interface, then QEMU's emulated PCI
Root Complex is going to be left unused. I think that would be great
because we still have a clean QEMU-Xen-tools interface and the only
downside is some extra unused emulation in QEMU. It would be a
fantastic starting point.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 13/17] vpci: add initial support for virtual PCI bus topology
  2023-11-17 21:43                 ` Stefano Stabellini
@ 2023-11-17 22:22                   ` Volodymyr Babchuk
  2023-11-18  0:45                     ` Stefano Stabellini
  0 siblings, 1 reply; 65+ messages in thread
From: Volodymyr Babchuk @ 2023-11-17 22:22 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Julien Grall, Stewart Hildebrand, Oleksandr Andrushchenko,
	Andrew Cooper, George Dunlap, Jan Beulich, Wei Liu,
	Roger Pau Monné,
	xen-devel


Hi Stefano,

Stefano Stabellini <sstabellini@kernel.org> writes:

> On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
>> Hi Julien,
>> 
>> Julien Grall <julien@xen.org> writes:
>> 
>> > Hi Volodymyr,
>> >
>> > On 17/11/2023 14:09, Volodymyr Babchuk wrote:
>> >> Hi Stefano,
>> >> Stefano Stabellini <sstabellini@kernel.org> writes:
>> >> 
>> >>> On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
>> >>>>> I still think, no matter the BDF allocation scheme, that we should try
>> >>>>> to avoid as much as possible to have two different PCI Root Complex
>> >>>>> emulators. Ideally we would have only one PCI Root Complex emulated by
>> >>>>> Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be
>> >>>>> tolerable but not ideal.
>> >>>>
>> >>>> But what is exactly wrong with this setup?
>> >>>
>> >>> [...]
>> >>>
>> >>>>> The worst case I would like to avoid is to have
>> >>>>> two PCI Root Complexes, one emulated by Xen and one emulated by QEMU.
>> >>>>
>> >>>> This is how our setup works right now.
>> >>>
>> >>> If we have:
>> >>> - a single PCI Root Complex emulated in Xen
>> >>> - Xen is safety certified
>> >>> - individual Virtio devices emulated by QEMU with grants for memory
>> >>>
>> >>> We can go very far in terms of being able to use Virtio in safety
>> >>> use-cases. We might even be able to use Virtio (frontends) in a SafeOS.
>> >>>
>> >>> On the other hand if we put an additional Root Complex in QEMU:
>> >>> - we pay a price in terms of complexity of the codebase
>> >>> - we pay a price in terms of resource utilization
>> >>> - we have one additional problem in terms of using this setup with a
>> >>>    SafeOS (one more device emulated by a non-safe component)
>> >>>
>> >>> Having 2 PCI Root Complexes both emulated in Xen is a middle ground
>> >>> solution because:
>> >>> - we still pay a price in terms of resource utilization
>> >>> - the code complexity goes up a bit but hopefully not by much
>> >>> - there is no impact on safety compared to the ideal scenario
>> >>>
>> >>> This is why I wrote that it is tolerable.
>> >> Ah, I see now. Yes, I am agree with this. Also I want to add some
>> >> more
>> >> points:
>> >> - There is ongoing work on implementing virtio backends as a
>> >> separate
>> >>    applications, written in Rust. Linaro are doing this part. Right now
>> >>    they are implementing only virtio-mmio, but if they want to provide
>> >>    virtio-pci as well, they will need a mechanism to plug only
>> >>    virtio-pci, without Root Complex. This is argument for using single Root
>> >>    Complex emulated in Xen.
>> >> - As far as I know (actually, Oleksandr told this to me), QEMU has
>> >> no
>> >>    mechanism for exposing virtio-pci backends without exposing PCI root
>> >>    complex as well. Architecturally, there should be a PCI bus to which
>> >>    virtio-pci devices are connected. Or we need to make some changes to
>> >>    QEMU internals to be able to create virtio-pci backends that are not
>> >>    connected to any bus. Also, added benefit that PCI Root Complex
>> >>    emulator in QEMU handles legacy PCI interrupts for us. This is
>> >>    argument for separate Root Complex for QEMU.
>> >> As right now we have only virtio-pci backends provided by QEMU and
>> >> this
>> >> setup is already working, I propose to stick to this
>> >> solution. Especially, taking into account that it does not require any
>> >> changes to hypervisor code.
>> >
>> > I am not against two hostbridge as a temporary solution as long as
>> > this is not a one way door decision. I am not concerned about the
>> > hypervisor itself, I am more concerned about the interface exposed by
>> > the toolstack and QEMU.
>
> I agree with this...
>
>
>> > To clarify, I don't particular want to have to maintain the two
>> > hostbridges solution once we can use a single hostbridge. So we need
>> > to be able to get rid of it without impacting the interface too much.
>
> ...and this
>
>
>> This depends on virtio-pci backends availability. AFAIK, now only one
>> option is to use QEMU and QEMU provides own host bridge. So if we want
>> get rid of the second host bridge we need either another virtio-pci
>> backend or we need to alter QEMU code so it can live without host
>> bridge.
>> 
>> As for interfaces, it appears that QEMU case does not require any changes
>> into hypervisor itself, it just boils down to writing couple of xenstore
>> entries and spawning QEMU with correct command line arguments.
>
> One thing that Stewart wrote in his reply that is important: it doesn't
> matter if QEMU thinks it is emulating a PCI Root Complex because that's
> required from QEMU's point of view to emulate an individual PCI device.
>
> If we can arrange it so the QEMU PCI Root Complex is not registered
> against Xen as part of the ioreq interface, then QEMU's emulated PCI
> Root Complex is going to be left unused. I think that would be great
> because we still have a clean QEMU-Xen-tools interface and the only
> downside is some extra unused emulation in QEMU. It would be a
> fantastic starting point.

I believe, that in this case we need to set manual ioreq handlers, like
what was done in patch "xen/arm: Intercept vPCI config accesses and
forward them to emulator", because we need to route ECAM accesses either
to a virtio-pci backend or to a real PCI device. Also we need to tell
QEMU to not install own ioreq handles for ECAM space.

Another point is PCI legacy interrupts, which should be emulated on Xen
side. And unless I miss something, we will need some new mechanism to
signal those interrupts from QEMU/other backend. I am not sure if we can
use already existing IRQ signaling mechanism, because PCI interrupts are
ORed for all devices on a bridge and are level-sensitive.

Of course, we will need all of this anyways, if we want to support
standalone virtio-pci backends, but for me it sounds more like "finish
point" :)

-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 13/17] vpci: add initial support for virtual PCI bus topology
  2023-11-17 22:22                   ` Volodymyr Babchuk
@ 2023-11-18  0:45                     ` Stefano Stabellini
  2023-11-21  0:42                       ` Volodymyr Babchuk
  0 siblings, 1 reply; 65+ messages in thread
From: Stefano Stabellini @ 2023-11-18  0:45 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Stefano Stabellini, Julien Grall, Stewart Hildebrand,
	Oleksandr Andrushchenko, Andrew Cooper, George Dunlap,
	Jan Beulich, Wei Liu, Roger Pau Monné,
	xen-devel

On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
> > On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
> >> Hi Julien,
> >> 
> >> Julien Grall <julien@xen.org> writes:
> >> 
> >> > Hi Volodymyr,
> >> >
> >> > On 17/11/2023 14:09, Volodymyr Babchuk wrote:
> >> >> Hi Stefano,
> >> >> Stefano Stabellini <sstabellini@kernel.org> writes:
> >> >> 
> >> >>> On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
> >> >>>>> I still think, no matter the BDF allocation scheme, that we should try
> >> >>>>> to avoid as much as possible to have two different PCI Root Complex
> >> >>>>> emulators. Ideally we would have only one PCI Root Complex emulated by
> >> >>>>> Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be
> >> >>>>> tolerable but not ideal.
> >> >>>>
> >> >>>> But what is exactly wrong with this setup?
> >> >>>
> >> >>> [...]
> >> >>>
> >> >>>>> The worst case I would like to avoid is to have
> >> >>>>> two PCI Root Complexes, one emulated by Xen and one emulated by QEMU.
> >> >>>>
> >> >>>> This is how our setup works right now.
> >> >>>
> >> >>> If we have:
> >> >>> - a single PCI Root Complex emulated in Xen
> >> >>> - Xen is safety certified
> >> >>> - individual Virtio devices emulated by QEMU with grants for memory
> >> >>>
> >> >>> We can go very far in terms of being able to use Virtio in safety
> >> >>> use-cases. We might even be able to use Virtio (frontends) in a SafeOS.
> >> >>>
> >> >>> On the other hand if we put an additional Root Complex in QEMU:
> >> >>> - we pay a price in terms of complexity of the codebase
> >> >>> - we pay a price in terms of resource utilization
> >> >>> - we have one additional problem in terms of using this setup with a
> >> >>>    SafeOS (one more device emulated by a non-safe component)
> >> >>>
> >> >>> Having 2 PCI Root Complexes both emulated in Xen is a middle ground
> >> >>> solution because:
> >> >>> - we still pay a price in terms of resource utilization
> >> >>> - the code complexity goes up a bit but hopefully not by much
> >> >>> - there is no impact on safety compared to the ideal scenario
> >> >>>
> >> >>> This is why I wrote that it is tolerable.
> >> >> Ah, I see now. Yes, I am agree with this. Also I want to add some
> >> >> more
> >> >> points:
> >> >> - There is ongoing work on implementing virtio backends as a
> >> >> separate
> >> >>    applications, written in Rust. Linaro are doing this part. Right now
> >> >>    they are implementing only virtio-mmio, but if they want to provide
> >> >>    virtio-pci as well, they will need a mechanism to plug only
> >> >>    virtio-pci, without Root Complex. This is argument for using single Root
> >> >>    Complex emulated in Xen.
> >> >> - As far as I know (actually, Oleksandr told this to me), QEMU has
> >> >> no
> >> >>    mechanism for exposing virtio-pci backends without exposing PCI root
> >> >>    complex as well. Architecturally, there should be a PCI bus to which
> >> >>    virtio-pci devices are connected. Or we need to make some changes to
> >> >>    QEMU internals to be able to create virtio-pci backends that are not
> >> >>    connected to any bus. Also, added benefit that PCI Root Complex
> >> >>    emulator in QEMU handles legacy PCI interrupts for us. This is
> >> >>    argument for separate Root Complex for QEMU.
> >> >> As right now we have only virtio-pci backends provided by QEMU and
> >> >> this
> >> >> setup is already working, I propose to stick to this
> >> >> solution. Especially, taking into account that it does not require any
> >> >> changes to hypervisor code.
> >> >
> >> > I am not against two hostbridge as a temporary solution as long as
> >> > this is not a one way door decision. I am not concerned about the
> >> > hypervisor itself, I am more concerned about the interface exposed by
> >> > the toolstack and QEMU.
> >
> > I agree with this...
> >
> >
> >> > To clarify, I don't particular want to have to maintain the two
> >> > hostbridges solution once we can use a single hostbridge. So we need
> >> > to be able to get rid of it without impacting the interface too much.
> >
> > ...and this
> >
> >
> >> This depends on virtio-pci backends availability. AFAIK, now only one
> >> option is to use QEMU and QEMU provides own host bridge. So if we want
> >> get rid of the second host bridge we need either another virtio-pci
> >> backend or we need to alter QEMU code so it can live without host
> >> bridge.
> >> 
> >> As for interfaces, it appears that QEMU case does not require any changes
> >> into hypervisor itself, it just boils down to writing couple of xenstore
> >> entries and spawning QEMU with correct command line arguments.
> >
> > One thing that Stewart wrote in his reply that is important: it doesn't
> > matter if QEMU thinks it is emulating a PCI Root Complex because that's
> > required from QEMU's point of view to emulate an individual PCI device.
> >
> > If we can arrange it so the QEMU PCI Root Complex is not registered
> > against Xen as part of the ioreq interface, then QEMU's emulated PCI
> > Root Complex is going to be left unused. I think that would be great
> > because we still have a clean QEMU-Xen-tools interface and the only
> > downside is some extra unused emulation in QEMU. It would be a
> > fantastic starting point.
> 
> I believe, that in this case we need to set manual ioreq handlers, like
> what was done in patch "xen/arm: Intercept vPCI config accesses and
> forward them to emulator", because we need to route ECAM accesses
> either to a virtio-pci backend or to a real PCI device. Also we need
> to tell QEMU to not install own ioreq handles for ECAM space.

I was imagining that the interface would look like this: QEMU registers
a PCI BDF and Xen automatically starts forwarding to QEMU ECAM
reads/writes requests for the PCI config space of that BDF only. It
would not be the entire ECAM space but only individual PCI conf
reads/writes that the BDF only.


> Another point is PCI legacy interrupts, which should be emulated on Xen
> side. And unless I miss something, we will need some new mechanism to
> signal those interrupts from QEMU/other backend. I am not sure if we can
> use already existing IRQ signaling mechanism, because PCI interrupts are
> ORed for all devices on a bridge and are level-sensitive.

I hope we can reuse xc_hvm_set_pci_intx_level or another XEN_DMOP
hypercall


> Of course, we will need all of this anyways, if we want to support
> standalone virtio-pci backends, but for me it sounds more like "finish
> point" :)



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 05/17] vpci: add hooks for PCI device assign/de-assign
  2023-10-12 22:09 ` [PATCH v10 05/17] vpci: add hooks for PCI device assign/de-assign Volodymyr Babchuk
@ 2023-11-20 15:04   ` Roger Pau Monné
  0 siblings, 0 replies; 65+ messages in thread
From: Roger Pau Monné @ 2023-11-20 15:04 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: xen-devel, Stewart Hildebrand, Oleksandr Andrushchenko,
	Jan Beulich, Paul Durrant

On Thu, Oct 12, 2023 at 10:09:15PM +0000, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> When a PCI device gets assigned/de-assigned we need to
> initialize/de-initialize vPCI state for the device.
> 
> Also, rename vpci_add_handlers() to vpci_assign_device() and
> vpci_remove_device() to vpci_deassign_device() to better reflect role
> of the functions.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
> ---
> 
> In v10:
> - removed HAS_VPCI_GUEST_SUPPORT checks
> - HAS_VPCI_GUEST_SUPPORT config option (in Kconfig) as it is not used
>   anywhere
> In v9:
> - removed previous  vpci_[de]assign_device function and renamed
>   existing handlers
> - dropped attempts to handle errors in assign_device() function
> - do not call vpci_assign_device for dom_io
> - use d instead of pdev->domain
> - use IS_ENABLED macro
> In v8:
> - removed vpci_deassign_device
> In v6:
> - do not pass struct domain to vpci_{assign|deassign}_device as
>   pdev->domain can be used
> - do not leave the device assigned (pdev->domain == new domain) in case
>   vpci_assign_device fails: try to de-assign and if this also fails, then
>   crash the domain
> In v5:
> - do not split code into run_vpci_init
> - do not check for is_system_domain in vpci_{de}assign_device
> - do not use vpci_remove_device_handlers_locked and re-allocate
>   pdev->vpci completely
> - make vpci_deassign_device void
> In v4:
>  - de-assign vPCI from the previous domain on device assignment
>  - do not remove handlers in vpci_assign_device as those must not
>    exist at that point
> In v3:
>  - remove toolstack roll-back description from the commit message
>    as error are to be handled with proper cleanup in Xen itself
>  - remove __must_check
>  - remove redundant rc check while assigning devices
>  - fix redundant CONFIG_HAS_VPCI check for CONFIG_HAS_VPCI_GUEST_SUPPORT
>  - use REGISTER_VPCI_INIT machinery to run required steps on device
>    init/assign: add run_vpci_init helper
> In v2:
> - define CONFIG_HAS_VPCI_GUEST_SUPPORT so dead code is not compiled
>   for x86
> In v1:
>  - constify struct pci_dev where possible
>  - do not open code is_system_domain()
>  - extended the commit message
> ---
>  xen/drivers/passthrough/pci.c | 20 ++++++++++++++++----
>  xen/drivers/vpci/header.c     |  2 +-
>  xen/drivers/vpci/vpci.c       |  6 +++---
>  xen/include/xen/vpci.h        | 10 +++++-----
>  4 files changed, 25 insertions(+), 13 deletions(-)
> 
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index 182da45acb..b7926a291c 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -755,7 +755,7 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>           * For devices not discovered by Xen during boot, add vPCI handlers
>           * when Dom0 first informs Xen about such devices.
>           */
> -        ret = vpci_add_handlers(pdev);
> +        ret = vpci_assign_device(pdev);
>          if ( ret )
>          {
>              list_del(&pdev->domain_list);
> @@ -769,7 +769,7 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>          if ( ret )
>          {
>              write_lock(&hardware_domain->pci_lock);
> -            vpci_remove_device(pdev);
> +            vpci_deassign_device(pdev);
>              list_del(&pdev->domain_list);
>              write_unlock(&hardware_domain->pci_lock);
>              pdev->domain = NULL;
> @@ -817,7 +817,7 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
>      list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
>          if ( pdev->bus == bus && pdev->devfn == devfn )
>          {
> -            vpci_remove_device(pdev);
> +            vpci_deassign_device(pdev);
>              pci_cleanup_msi(pdev);
>              ret = iommu_remove_device(pdev);
>              if ( pdev->domain )
> @@ -875,6 +875,10 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
>              goto out;
>      }
>  
> +    write_lock(&d->pci_lock);
> +    vpci_deassign_device(pdev);
> +    write_unlock(&d->pci_lock);
> +
>      devfn = pdev->devfn;
>      ret = iommu_call(hd->platform_ops, reassign_device, d, target, devfn,
>                       pci_to_dev(pdev));

In deassign_device() you are missing a call to vpci_assign_device() in
order to setup the vPCI handlers for the target domain (not for
dom_io, but possibly for hardware_domain if it's PVH-like).

If the call to reassign_device is successful you need to call
vpci_assign_device().

The rest LGTM.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 06/17] vpci/header: rework exit path in init_bars
  2023-10-12 22:09 ` [PATCH v10 06/17] vpci/header: rework exit path in init_bars Volodymyr Babchuk
@ 2023-11-20 15:07   ` Roger Pau Monné
  0 siblings, 0 replies; 65+ messages in thread
From: Roger Pau Monné @ 2023-11-20 15:07 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: xen-devel, Stewart Hildebrand

On Thu, Oct 12, 2023 at 10:09:16PM +0000, Volodymyr Babchuk wrote:
> Introduce "fail" label in init_bars() function to have the centralized
> error return path. This is the pre-requirement for the future changes
> in this function.
> 
> This patch does not introduce functional changes.
> 
> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
> --
> In v10:
> - Added Roger's A-b tag.
> In v9:
> - New in v9
> ---
>  xen/drivers/vpci/header.c | 20 +++++++-------------
>  1 file changed, 7 insertions(+), 13 deletions(-)
> 
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index 176fe16b9f..33db58580c 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -581,11 +581,7 @@ static int cf_check init_bars(struct pci_dev *pdev)
>              rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg,
>                                     4, &bars[i]);
>              if ( rc )
> -            {
> -                pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
> -                return rc;
> -            }
> -
> +                goto fail;

One nit that can be fixed at commit IMO, could you please avoid
removing the empty line between goto fail; and continue;?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 07/17] vpci/header: implement guest BAR register handlers
  2023-10-12 22:09 ` [PATCH v10 07/17] vpci/header: implement guest BAR register handlers Volodymyr Babchuk
  2023-10-14 16:00   ` Stewart Hildebrand
@ 2023-11-20 16:06   ` Roger Pau Monné
  1 sibling, 0 replies; 65+ messages in thread
From: Roger Pau Monné @ 2023-11-20 16:06 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: xen-devel, Stewart Hildebrand, Oleksandr Andrushchenko

On Thu, Oct 12, 2023 at 10:09:16PM +0000, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Add relevant vpci register handlers when assigning PCI device to a domain
> and remove those when de-assigning. This allows having different
> handlers for different domains, e.g. hwdom and other guests.
> 
> Emulate guest BAR register values: this allows creating a guest view
> of the registers and emulates size and properties probe as it is done
> during PCI device enumeration by the guest.
> 
> All empty, IO and ROM BARs for guests are emulated by returning 0 on
> reads and ignoring writes: this BARs are special with this respect as
> their lower bits have special meaning, so returning default ~0 on read
> may confuse guest OS.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> ---
> In v10:
> - ull -> ULL to be MISRA-compatbile
> - Use PAGE_OFFSET() instead of combining with ~PAGE_MASK
> - Set type of empty bars to VPCI_BAR_EMPTY
> In v9:
> - factored-out "fail" label introduction in init_bars()
> - replaced #ifdef CONFIG_X86 with IS_ENABLED()
> - do not pass bars[i] to empty_bar_read() handler
> - store guest's BAR address instead of guests BAR register view
> Since v6:
> - unify the writing of the PCI_COMMAND register on the
>   error path into a label
> - do not introduce bar_ignore_access helper and open code
> - s/guest_bar_ignore_read/empty_bar_read
> - update error message in guest_bar_write
> - only setup empty_bar_read for IO if !x86
> Since v5:
> - make sure that the guest set address has the same page offset
>   as the physical address on the host
> - remove guest_rom_{read|write} as those just implement the default
>   behaviour of the registers not being handled
> - adjusted comment for struct vpci.addr field
> - add guest handlers for BARs which are not handled and will otherwise
>   return ~0 on read and ignore writes. The BARs are special with this
>   respect as their lower bits have special meaning, so returning ~0
>   doesn't seem to be right
> Since v4:
> - updated commit message
> - s/guest_addr/guest_reg
> Since v3:
> - squashed two patches: dynamic add/remove handlers and guest BAR
>   handler implementation
> - fix guest BAR read of the high part of a 64bit BAR (Roger)
> - add error handling to vpci_assign_device
> - s/dom%pd/%pd
> - blank line before return
> Since v2:
> - remove unneeded ifdefs for CONFIG_HAS_VPCI_GUEST_SUPPORT as more code
>   has been eliminated from being built on x86
> Since v1:
>  - constify struct pci_dev where possible
>  - do not open code is_system_domain()
>  - simplify some code3. simplify
>  - use gdprintk + error code instead of gprintk
>  - gate vpci_bar_{add|remove}_handlers with CONFIG_HAS_VPCI_GUEST_SUPPORT,
>    so these do not get compiled for x86
>  - removed unneeded is_system_domain check
>  - re-work guest read/write to be much simpler and do more work on write
>    than read which is expected to be called more frequently
>  - removed one too obvious comment
> ---
>  xen/drivers/vpci/header.c | 137 +++++++++++++++++++++++++++++++++-----
>  xen/include/xen/vpci.h    |   3 +
>  2 files changed, 123 insertions(+), 17 deletions(-)
> 
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index 33db58580c..40d1a07ada 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -477,6 +477,74 @@ static void cf_check bar_write(
>      pci_conf_write32(pdev->sbdf, reg, val);
>  }
>  
> +static void cf_check guest_bar_write(const struct pci_dev *pdev,
> +                                     unsigned int reg, uint32_t val, void *data)

In case we want to add handlers for IO BARs in the future, might be
better to name this guest_mem_bar_write() (or some similar name that
makes it clear applies to the memory BARs only).

> +{
> +    struct vpci_bar *bar = data;
> +    bool hi = false;
> +    uint64_t guest_addr = bar->guest_addr;
> +
> +    if ( bar->type == VPCI_BAR_MEM64_HI )
> +    {
> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> +        bar--;
> +        hi = true;
> +    }
> +    else
> +    {
> +        val &= PCI_BASE_ADDRESS_MEM_MASK;
> +    }

Shouldn't writes to the BAR be refused when bar->enabled is set?
Otherwise the bar->guest_addr address would get out of sync with the
mappings created on the p2m, and memory decoding disabling (BAR
unmapping) won't work as expected.

> +
> +    guest_addr &= ~(0xffffffffULL << (hi ? 32 : 0));
> +    guest_addr |= (uint64_t)val << (hi ? 32 : 0);
> +
> +    /* Allow guest to size BAR correctly */
> +    guest_addr &= ~(bar->size - 1);
> +
> +    /*
> +     * Make sure that the guest set address has the same page offset
> +     * as the physical address on the host or otherwise things won't work as
> +     * expected.
> +     */
> +    if ( guest_addr != ~(bar->size -1 )  &&
                                      ^ nit: missing space
> +         PAGE_OFFSET(guest_addr) != PAGE_OFFSET(bar->addr) )
> +    {
> +        gprintk(XENLOG_WARNING,
> +                "%pp: ignored BAR %zu write attempting to change page offset\n",
> +                &pdev->sbdf, bar - pdev->vpci->header.bars + hi);
> +        return;

I think this will trigger on valid sizing attempts if the guest writes
the low part of the address before having updated the high one first
in case of 64bit BARs?

Reading the PCI Local Bus Spec 3.0 I don't see any restriction in
which the write of ~0 should be performed in order to size the BARs
(see the Implementation Note in section 6.2.5.1.)

So I think wrong offset can only be detected when the BAR map is
attempted to be established in the p2m, in modify_bars().

> +    }
> +
> +    bar->guest_addr = guest_addr;
> +}
> +
> +static uint32_t cf_check guest_bar_read(const struct pci_dev *pdev,
> +                                        unsigned int reg, void *data)
> +{
> +    const struct vpci_bar *bar = data;
> +    uint32_t reg_val;
> +
> +    if ( bar->type == VPCI_BAR_MEM64_HI )
> +    {
> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> +        bar--;
> +        return bar->guest_addr >> 32;
> +    }
> +
> +    reg_val = bar->guest_addr;
> +    reg_val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32 :
> +                                             PCI_BASE_ADDRESS_MEM_TYPE_64;
> +    reg_val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
> +
> +    return reg_val;
> +}
> +
> +static uint32_t cf_check empty_bar_read(const struct pci_dev *pdev,
> +                                        unsigned int reg, void *data)
> +{
> +    return 0;
> +}
> +
>  static void cf_check rom_write(
>      const struct pci_dev *pdev, unsigned int reg, uint32_t val, void *data)
>  {
> @@ -537,6 +605,7 @@ static int cf_check init_bars(struct pci_dev *pdev)
>      struct vpci_header *header = &pdev->vpci->header;
>      struct vpci_bar *bars = header->bars;
>      int rc;
> +    bool is_hwdom = is_hardware_domain(pdev->domain);
>  
>      ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
>  
> @@ -578,8 +647,10 @@ static int cf_check init_bars(struct pci_dev *pdev)
>          if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
>          {
>              bars[i].type = VPCI_BAR_MEM64_HI;
> -            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg,
> -                                   4, &bars[i]);
> +            rc = vpci_add_register(pdev->vpci,
> +                                   is_hwdom ? vpci_hw_read32 : guest_bar_read,
> +                                   is_hwdom ? bar_write : guest_bar_write,
> +                                   reg, 4, &bars[i]);
>              if ( rc )
>                  goto fail;
>              continue;
> @@ -588,7 +659,17 @@ static int cf_check init_bars(struct pci_dev *pdev)
>          val = pci_conf_read32(pdev->sbdf, reg);
>          if ( (val & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO )
>          {
> -            bars[i].type = VPCI_BAR_IO;

Why are you removing this assignment?

This would leave the BAR with type VPCI_BAR_EMPTY on x86, which I
don't think will cause issues right now, but it's not correct either.

> +            if ( !IS_ENABLED(CONFIG_X86) && !is_hwdom )
> +            {
> +                rc = vpci_add_register(pdev->vpci, empty_bar_read, NULL,
> +                                       reg, 4, NULL);
> +                if ( rc )
> +                {
> +                    bars[i].type = VPCI_BAR_EMPTY;
> +                    goto fail;

I'm confused as to why it matters setting the BAR type for the failure
path.

> +                }
> +            }
> +
>              continue;
>          }
>          if ( (val & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
> @@ -605,6 +686,15 @@ static int cf_check init_bars(struct pci_dev *pdev)
>          if ( size == 0 )
>          {
>              bars[i].type = VPCI_BAR_EMPTY;
> +
> +            if ( !is_hwdom )
> +            {
> +                rc = vpci_add_register(pdev->vpci, empty_bar_read, NULL,
> +                                       reg, 4, NULL);
> +                if ( rc )
> +                    goto fail;
> +            }
> +
>              continue;
>          }
>  
> @@ -612,28 +702,41 @@ static int cf_check init_bars(struct pci_dev *pdev)
>          bars[i].size = size;
>          bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
>  
> -        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg, 4,
> -                               &bars[i]);
> +        rc = vpci_add_register(pdev->vpci,
> +                               is_hwdom ? vpci_hw_read32 : guest_bar_read,
> +                               is_hwdom ? bar_write : guest_bar_write,
> +                               reg, 4, &bars[i]);
>          if ( rc )
>              goto fail;
>      }
>  
> -    /* Check expansion ROM. */
> -    rc = pci_size_mem_bar(pdev->sbdf, rom_reg, &addr, &size, PCI_BAR_ROM);
> -    if ( rc > 0 && size )
> +    /* TODO: Check expansion ROM, we do not handle ROM for guests for now. */

Nit: it would be better to place the TODO comment in the else branch
that applied to the !is_hwdom case.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 10/17] vpci/header: handle p2m range sets per BAR
  2023-10-12 22:09 ` [PATCH v10 10/17] vpci/header: handle p2m range sets per BAR Volodymyr Babchuk
@ 2023-11-20 17:29   ` Roger Pau Monné
  0 siblings, 0 replies; 65+ messages in thread
From: Roger Pau Monné @ 2023-11-20 17:29 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: xen-devel, Stewart Hildebrand, Oleksandr Andrushchenko

On Thu, Oct 12, 2023 at 10:09:17PM +0000, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Instead of handling a single range set, that contains all the memory
> regions of all the BARs and ROM, have them per BAR.
> As the range sets are now created when a PCI device is added and destroyed
> when it is removed so make them named and accounted.
> 
> Note that rangesets were chosen here despite there being only up to
> 3 separate ranges in each set (typically just 1). But rangeset per BAR
> was chosen for the ease of implementation and existing code re-usability.
> 
> This is in preparation of making non-identity mappings in p2m for the MMIOs.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

I have some minor comments below, but overall looks good, with the
comments below addresses:

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

I think it's worth mentioning in the commit message that the error
handling of vpci_process_pending() is slightly modified, and that vPCI
handlers are no longer removed if the creation of the mappings in
vpci_process_pending() fails, as that's unlikely to lead to a
functional device in any case.

> 
> ---
> In v10:
> - Added additional checks to vpci_process_pending()
> - vpci_process_pending() now clears rangeset in case of failure
> - Fixed locks in vpci_process_pending()
> - Fixed coding style issues
> - Fixed error handling in init_bars
> In v9:
> - removed d->vpci.map_pending in favor of checking v->vpci.pdev !=
> NULL
> - printk -> gprintk
> - renamed bar variable to fix shadowing
> - fixed bug with iterating on remote device's BARs
> - relaxed lock in vpci_process_pending
> - removed stale comment
> Since v6:
> - update according to the new locking scheme
> - remove odd fail label in modify_bars
> Since v5:
> - fix comments
> - move rangeset allocation to init_bars and only allocate
>   for MAPPABLE BARs
> - check for overlap with the already setup BAR ranges
> Since v4:
> - use named range sets for BARs (Jan)
> - changes required by the new locking scheme
> - updated commit message (Jan)
> Since v3:
> - re-work vpci_cancel_pending accordingly to the per-BAR handling
> - s/num_mem_ranges/map_pending and s/uint8_t/bool
> - ASSERT(bar->mem) in modify_bars
> - create and destroy the rangesets on add/remove
> ---
>  xen/drivers/vpci/header.c | 256 ++++++++++++++++++++++++++------------
>  xen/drivers/vpci/vpci.c   |   6 +
>  xen/include/xen/vpci.h    |   2 +-
>  3 files changed, 184 insertions(+), 80 deletions(-)
> 
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index 40d1a07ada..5c056923ad 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -161,63 +161,106 @@ static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
>  
>  bool vpci_process_pending(struct vcpu *v)
>  {
> -    if ( v->vpci.mem )
> +    struct pci_dev *pdev = v->vpci.pdev;
> +    struct map_data data = {
> +        .d = v->domain,
> +        .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
> +    };
> +    struct vpci_header *header = NULL;
> +    unsigned int i;
> +
> +    if ( !pdev )
> +        return false;
> +
> +    read_lock(&v->domain->pci_lock);
> +
> +    if ( !pdev->vpci || (v->domain != pdev->domain) )
>      {
> -        struct map_data data = {
> -            .d = v->domain,
> -            .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
> -        };
> -        int rc = rangeset_consume_ranges(v->vpci.mem, map_range, &data);
> +        read_unlock(&v->domain->pci_lock);

I think you want to clear v->vpci.pdev here in order to avoid having
to take the lock again and exit early from vpci_process_pending() if
possible.

> +        return false;
> +    }
> +
> +    header = &pdev->vpci->header;
> +    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
> +    {
> +        struct vpci_bar *bar = &header->bars[i];
> +        int rc;
> +
> +        if ( rangeset_is_empty(bar->mem) )
> +            continue;
> +
> +        rc = rangeset_consume_ranges(bar->mem, map_range, &data);
>  
>          if ( rc == -ERESTART )
> +        {
> +            read_unlock(&v->domain->pci_lock);
>              return true;
> +        }
>  
> -        write_lock(&v->domain->pci_lock);
> -        spin_lock(&v->vpci.pdev->vpci->lock);
> -        /* Disable memory decoding unconditionally on failure. */
> -        modify_decoding(v->vpci.pdev,
> -                        rc ? v->vpci.cmd & ~PCI_COMMAND_MEMORY : v->vpci.cmd,
> -                        !rc && v->vpci.rom_only);
> -        spin_unlock(&v->vpci.pdev->vpci->lock);
> -
> -        rangeset_destroy(v->vpci.mem);
> -        v->vpci.mem = NULL;
>          if ( rc )
> -            /*
> -             * FIXME: in case of failure remove the device from the domain.
> -             * Note that there might still be leftover mappings. While this is
> -             * safe for Dom0, for DomUs the domain will likely need to be
> -             * killed in order to avoid leaking stale p2m mappings on
> -             * failure.
> -             */
> -            vpci_deassign_device(v->vpci.pdev);
> -        write_unlock(&v->domain->pci_lock);
> +        {
> +            spin_lock(&pdev->vpci->lock);
> +            /* Disable memory decoding unconditionally on failure. */
> +            modify_decoding(pdev, v->vpci.cmd & ~PCI_COMMAND_MEMORY,
> +                            false);
> +            spin_unlock(&pdev->vpci->lock);
> +
> +            /* Clean all the rangesets */
> +            for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
> +                if ( !rangeset_is_empty(header->bars[i].mem) )
> +                     rangeset_empty(header->bars[i].mem);
> +
> +            v->vpci.pdev = NULL;
> +
> +            read_unlock(&v->domain->pci_lock);
> +
> +            if ( !is_hardware_domain(v->domain) )
> +                domain_crash(v->domain);
> +
> +            return false;
> +        }
>      }
> +    v->vpci.pdev = NULL;
> +
> +    spin_lock(&pdev->vpci->lock);
> +    modify_decoding(pdev, v->vpci.cmd, v->vpci.rom_only);
> +    spin_unlock(&pdev->vpci->lock);
> +
> +    read_unlock(&v->domain->pci_lock);
>  
>      return false;
>  }
>  
>  static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
> -                            struct rangeset *mem, uint16_t cmd)
> +                            uint16_t cmd)
>  {
>      struct map_data data = { .d = d, .map = true };
> -    int rc;
> +    struct vpci_header *header = &pdev->vpci->header;
> +    int rc = 0;
> +    unsigned int i;
>  
>      ASSERT(rw_is_write_locked(&d->pci_lock));
>  
> -    while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
> +    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>      {
> -        /*
> -         * It's safe to drop and reacquire the lock in this context
> -         * without risking pdev disappearing because devices cannot be
> -         * removed until the initial domain has been started.
> -         */
> -        read_unlock(&d->pci_lock);
> -        process_pending_softirqs();
> -        read_lock(&d->pci_lock);
> -    }
> +        struct vpci_bar *bar = &header->bars[i];
> +
> +        if ( rangeset_is_empty(bar->mem) )
> +            continue;
>  
> -    rangeset_destroy(mem);
> +        while ( (rc = rangeset_consume_ranges(bar->mem, map_range,
> +                                              &data)) == -ERESTART )
> +        {
> +            /*
> +             * It's safe to drop and reacquire the lock in this context
> +             * without risking pdev disappearing because devices cannot be
> +             * removed until the initial domain has been started.
> +             */
> +            write_unlock(&d->pci_lock);
> +            process_pending_softirqs();
> +            write_lock(&d->pci_lock);
> +        }
> +    }
>      if ( !rc )
>          modify_decoding(pdev, cmd, false);
>  
> @@ -225,10 +268,12 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
>  }
>  
>  static void defer_map(struct domain *d, struct pci_dev *pdev,
> -                      struct rangeset *mem, uint16_t cmd, bool rom_only)
> +                      uint16_t cmd, bool rom_only)
>  {
>      struct vcpu *curr = current;
>  
> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));

Shouldn't this be part of the previous commit that introduces the
usage of d->pci_lock?

> +
>      /*
>       * FIXME: when deferring the {un}map the state of the device should not
>       * be trusted. For example the enable bit is toggled after the device
> @@ -236,7 +281,6 @@ static void defer_map(struct domain *d, struct pci_dev *pdev,
>       * started for the same device if the domain is not well-behaved.
>       */
>      curr->vpci.pdev = pdev;
> -    curr->vpci.mem = mem;
>      curr->vpci.cmd = cmd;
>      curr->vpci.rom_only = rom_only;
>      /*
> @@ -250,33 +294,33 @@ static void defer_map(struct domain *d, struct pci_dev *pdev,
>  static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>  {
>      struct vpci_header *header = &pdev->vpci->header;
> -    struct rangeset *mem = rangeset_new(NULL, NULL, 0);
>      struct pci_dev *tmp, *dev = NULL;
>      const struct domain *d;
>      const struct vpci_msix *msix = pdev->vpci->msix;
> -    unsigned int i;
> +    unsigned int i, j;
>      int rc;
>  
>      ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
>  
> -    if ( !mem )
> -        return -ENOMEM;
> -
>      /*
> -     * Create a rangeset that represents the current device BARs memory region
> -     * and compare it against all the currently active BAR memory regions. If
> -     * an overlap is found, subtract it from the region to be mapped/unmapped.
> +     * Create a rangeset per BAR that represents the current device memory
> +     * region and compare it against all the currently active BAR memory
> +     * regions. If an overlap is found, subtract it from the region to be
> +     * mapped/unmapped.
>       *
> -     * First fill the rangeset with all the BARs of this device or with the ROM
> +     * First fill the rangesets with the BAR of this device or with the ROM
>       * BAR only, depending on whether the guest is toggling the memory decode
>       * bit of the command register, or the enable bit of the ROM BAR register.
>       */
>      for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>      {
> -        const struct vpci_bar *bar = &header->bars[i];
> +        struct vpci_bar *bar = &header->bars[i];
>          unsigned long start = PFN_DOWN(bar->addr);
>          unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
>  
> +        if ( !bar->mem )
> +            continue;
> +
>          if ( !MAPPABLE_BAR(bar) ||
>               (rom_only ? bar->type != VPCI_BAR_ROM
>                         : (bar->type == VPCI_BAR_ROM && !header->rom_enabled)) ||
> @@ -292,14 +336,31 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>              continue;
>          }
>  
> -        rc = rangeset_add_range(mem, start, end);
> +        rc = rangeset_add_range(bar->mem, start, end);
>          if ( rc )
>          {
>              printk(XENLOG_G_WARNING "Failed to add [%lx, %lx]: %d\n",
>                     start, end, rc);
> -            rangeset_destroy(mem);
>              return rc;
>          }
> +
> +        /* Check for overlap with the already setup BAR ranges. */
> +        for ( j = 0; j < i; j++ )
> +        {
> +            struct vpci_bar *prev_bar = &header->bars[j];
> +
> +            if ( rangeset_is_empty(prev_bar->mem) )
> +                continue;
> +
> +            rc = rangeset_remove_range(prev_bar->mem, start, end);
> +            if ( rc )
> +            {
> +                gprintk(XENLOG_WARNING,
> +                       "%pp: failed to remove overlapping range [%lx, %lx]: %d\n",
> +                        &pdev->sbdf, start, end, rc);
> +                return rc;
> +            }
> +        }
>      }
>  
>      /* Remove any MSIX regions if present. */
> @@ -309,14 +370,21 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>          unsigned long end = PFN_DOWN(vmsix_table_addr(pdev->vpci, i) +
>                                       vmsix_table_size(pdev->vpci, i) - 1);
>  
> -        rc = rangeset_remove_range(mem, start, end);
> -        if ( rc )
> +        for ( j = 0; j < ARRAY_SIZE(header->bars); j++ )
>          {
> -            printk(XENLOG_G_WARNING
> -                   "Failed to remove MSIX table [%lx, %lx]: %d\n",
> -                   start, end, rc);
> -            rangeset_destroy(mem);
> -            return rc;
> +            const struct vpci_bar *bar = &header->bars[j];
> +
> +            if ( rangeset_is_empty(bar->mem) )
> +                continue;
> +
> +            rc = rangeset_remove_range(bar->mem, start, end);
> +            if ( rc )
> +            {
> +                gprintk(XENLOG_WARNING,
> +                       "%pp: failed to remove MSIX table [%lx, %lx]: %d\n",
> +                        &pdev->sbdf, start, end, rc);
> +                return rc;
> +            }
>          }
>      }
>  
> @@ -356,27 +424,35 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>  
>              for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
>              {
> -                const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
> -                unsigned long start = PFN_DOWN(bar->addr);
> -                unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
> -
> -                if ( !bar->enabled ||
> -                     !rangeset_overlaps_range(mem, start, end) ||
> -                     /*
> -                      * If only the ROM enable bit is toggled check against
> -                      * other BARs in the same device for overlaps, but not
> -                      * against the same ROM BAR.
> -                      */
> -                     (rom_only && tmp == pdev && bar->type == VPCI_BAR_ROM) )
> +                const struct vpci_bar *remote_bar = &tmp->vpci->header.bars[i];
> +                unsigned long start = PFN_DOWN(remote_bar->addr);
> +                unsigned long end = PFN_DOWN(remote_bar->addr +
> +                                             remote_bar->size - 1);
> +
> +                if ( !remote_bar->enabled )
>                      continue;
>  
> -                rc = rangeset_remove_range(mem, start, end);
> -                if ( rc )
> +                for ( j = 0; j < ARRAY_SIZE(header->bars); j++)
>                  {
> -                    printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
> -                           start, end, rc);
> -                    rangeset_destroy(mem);
> -                    return rc;
> +                    const struct vpci_bar *bar = &header->bars[j];
> +
> +                    if ( !rangeset_overlaps_range(bar->mem, start, end) ||
> +                         /*
> +                          * If only the ROM enable bit is toggled check against
> +                          * other BARs in the same device for overlaps, but not
> +                          * against the same ROM BAR.
> +                          */
> +                         (rom_only && tmp == pdev && bar->type == VPCI_BAR_ROM) )

Is this line slightly too long?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 13/17] vpci: add initial support for virtual PCI bus topology
  2023-11-18  0:45                     ` Stefano Stabellini
@ 2023-11-21  0:42                       ` Volodymyr Babchuk
  2023-11-22  1:12                         ` Stefano Stabellini
  0 siblings, 1 reply; 65+ messages in thread
From: Volodymyr Babchuk @ 2023-11-21  0:42 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Julien Grall, Stewart Hildebrand, Oleksandr Andrushchenko,
	Andrew Cooper, George Dunlap, Jan Beulich, Wei Liu,
	Roger Pau Monné,
	xen-devel


Hi Stefano,

Stefano Stabellini <sstabellini@kernel.org> writes:

> On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
>> > On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
>> >> Hi Julien,
>> >> 
>> >> Julien Grall <julien@xen.org> writes:
>> >> 
>> >> > Hi Volodymyr,
>> >> >
>> >> > On 17/11/2023 14:09, Volodymyr Babchuk wrote:
>> >> >> Hi Stefano,
>> >> >> Stefano Stabellini <sstabellini@kernel.org> writes:
>> >> >> 
>> >> >>> On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
>> >> >>>>> I still think, no matter the BDF allocation scheme, that we should try
>> >> >>>>> to avoid as much as possible to have two different PCI Root Complex
>> >> >>>>> emulators. Ideally we would have only one PCI Root Complex emulated by
>> >> >>>>> Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be
>> >> >>>>> tolerable but not ideal.
>> >> >>>>
>> >> >>>> But what is exactly wrong with this setup?
>> >> >>>
>> >> >>> [...]
>> >> >>>
>> >> >>>>> The worst case I would like to avoid is to have
>> >> >>>>> two PCI Root Complexes, one emulated by Xen and one emulated by QEMU.
>> >> >>>>
>> >> >>>> This is how our setup works right now.
>> >> >>>
>> >> >>> If we have:
>> >> >>> - a single PCI Root Complex emulated in Xen
>> >> >>> - Xen is safety certified
>> >> >>> - individual Virtio devices emulated by QEMU with grants for memory
>> >> >>>
>> >> >>> We can go very far in terms of being able to use Virtio in safety
>> >> >>> use-cases. We might even be able to use Virtio (frontends) in a SafeOS.
>> >> >>>
>> >> >>> On the other hand if we put an additional Root Complex in QEMU:
>> >> >>> - we pay a price in terms of complexity of the codebase
>> >> >>> - we pay a price in terms of resource utilization
>> >> >>> - we have one additional problem in terms of using this setup with a
>> >> >>>    SafeOS (one more device emulated by a non-safe component)
>> >> >>>
>> >> >>> Having 2 PCI Root Complexes both emulated in Xen is a middle ground
>> >> >>> solution because:
>> >> >>> - we still pay a price in terms of resource utilization
>> >> >>> - the code complexity goes up a bit but hopefully not by much
>> >> >>> - there is no impact on safety compared to the ideal scenario
>> >> >>>
>> >> >>> This is why I wrote that it is tolerable.
>> >> >> Ah, I see now. Yes, I am agree with this. Also I want to add some
>> >> >> more
>> >> >> points:
>> >> >> - There is ongoing work on implementing virtio backends as a
>> >> >> separate
>> >> >>    applications, written in Rust. Linaro are doing this part. Right now
>> >> >>    they are implementing only virtio-mmio, but if they want to provide
>> >> >>    virtio-pci as well, they will need a mechanism to plug only
>> >> >>    virtio-pci, without Root Complex. This is argument for using single Root
>> >> >>    Complex emulated in Xen.
>> >> >> - As far as I know (actually, Oleksandr told this to me), QEMU has
>> >> >> no
>> >> >>    mechanism for exposing virtio-pci backends without exposing PCI root
>> >> >>    complex as well. Architecturally, there should be a PCI bus to which
>> >> >>    virtio-pci devices are connected. Or we need to make some changes to
>> >> >>    QEMU internals to be able to create virtio-pci backends that are not
>> >> >>    connected to any bus. Also, added benefit that PCI Root Complex
>> >> >>    emulator in QEMU handles legacy PCI interrupts for us. This is
>> >> >>    argument for separate Root Complex for QEMU.
>> >> >> As right now we have only virtio-pci backends provided by QEMU and
>> >> >> this
>> >> >> setup is already working, I propose to stick to this
>> >> >> solution. Especially, taking into account that it does not require any
>> >> >> changes to hypervisor code.
>> >> >
>> >> > I am not against two hostbridge as a temporary solution as long as
>> >> > this is not a one way door decision. I am not concerned about the
>> >> > hypervisor itself, I am more concerned about the interface exposed by
>> >> > the toolstack and QEMU.
>> >
>> > I agree with this...
>> >
>> >
>> >> > To clarify, I don't particular want to have to maintain the two
>> >> > hostbridges solution once we can use a single hostbridge. So we need
>> >> > to be able to get rid of it without impacting the interface too much.
>> >
>> > ...and this
>> >
>> >
>> >> This depends on virtio-pci backends availability. AFAIK, now only one
>> >> option is to use QEMU and QEMU provides own host bridge. So if we want
>> >> get rid of the second host bridge we need either another virtio-pci
>> >> backend or we need to alter QEMU code so it can live without host
>> >> bridge.
>> >> 
>> >> As for interfaces, it appears that QEMU case does not require any changes
>> >> into hypervisor itself, it just boils down to writing couple of xenstore
>> >> entries and spawning QEMU with correct command line arguments.
>> >
>> > One thing that Stewart wrote in his reply that is important: it doesn't
>> > matter if QEMU thinks it is emulating a PCI Root Complex because that's
>> > required from QEMU's point of view to emulate an individual PCI device.
>> >
>> > If we can arrange it so the QEMU PCI Root Complex is not registered
>> > against Xen as part of the ioreq interface, then QEMU's emulated PCI
>> > Root Complex is going to be left unused. I think that would be great
>> > because we still have a clean QEMU-Xen-tools interface and the only
>> > downside is some extra unused emulation in QEMU. It would be a
>> > fantastic starting point.
>> 
>> I believe, that in this case we need to set manual ioreq handlers, like
>> what was done in patch "xen/arm: Intercept vPCI config accesses and
>> forward them to emulator", because we need to route ECAM accesses
>> either to a virtio-pci backend or to a real PCI device. Also we need
>> to tell QEMU to not install own ioreq handles for ECAM space.
>
> I was imagining that the interface would look like this: QEMU registers
> a PCI BDF and Xen automatically starts forwarding to QEMU ECAM
> reads/writes requests for the PCI config space of that BDF only. It
> would not be the entire ECAM space but only individual PCI conf
> reads/writes that the BDF only.
>

Okay, I see that there is the
xendevicemodel_map_pcidev_to_ioreq_server() function and corresponding
IOREQ_TYPE_PCI_CONFIG call. Is this what you propose to use to register
PCI BDF?

I see that xen-hvm-common.c in QEMU is able to handle only standard 256
bytes configuration space, but I hope that it will be easy fix.

-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 11/17] vpci/header: program p2m with guest BAR view
  2023-10-12 22:09 ` [PATCH v10 11/17] vpci/header: program p2m with guest BAR view Volodymyr Babchuk
@ 2023-11-21 12:24   ` Roger Pau Monné
  0 siblings, 0 replies; 65+ messages in thread
From: Roger Pau Monné @ 2023-11-21 12:24 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: xen-devel, Stewart Hildebrand, Oleksandr Andrushchenko

On Thu, Oct 12, 2023 at 10:09:17PM +0000, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Take into account guest's BAR view and program its p2m accordingly:
> gfn is guest's view of the BAR and mfn is the physical BAR value.
> This way hardware domain sees physical BAR values and guest sees
> emulated ones.
> 
> Hardware domain continues getting the BARs identity mapped, while for
> domUs the BARs are mapped at the requested guest address without
> modifying the BAR address in the device PCI config space.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
> ---
> In v10:
> - Moved GFN variable definition outside the loop in map_range()
> - Updated printk error message in map_range()
> - Now BAR address is always stored in bar->guest_addr, even for
>   HW dom, this removes bunch of ugly is_hwdom() checks in modify_bars()
> - vmsix_table_base() now uses .guest_addr instead of .addr
> In v9:
> - Extended the commit message
> - Use bar->guest_addr in modify_bars
> - Extended printk error message in map_range
> - Moved map_data initialization so .bar can be initialized during declaration
> Since v5:
> - remove debug print in map_range callback
> - remove "identity" from the debug print
> Since v4:
> - moved start_{gfn|mfn} calculation into map_range
> - pass vpci_bar in the map_data instead of start_{gfn|mfn}
> - s/guest_addr/guest_reg
> Since v3:
> - updated comment (Roger)
> - removed gfn_add(map->start_gfn, rc); which is wrong
> - use v->domain instead of v->vpci.pdev->domain
> - removed odd e.g. in comment
> - s/d%d/%pd in altered code
> - use gdprintk for map/unmap logs
> Since v2:
> - improve readability for data.start_gfn and restructure ?: construct
> Since v1:
>  - s/MSI/MSI-X in comments
> ---
>  xen/drivers/vpci/header.c | 53 ++++++++++++++++++++++++++++-----------
>  xen/include/xen/vpci.h    |  3 ++-
>  2 files changed, 41 insertions(+), 15 deletions(-)
> 
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index 5c056923ad..efce0bc2ae 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -33,6 +33,7 @@
>  
>  struct map_data {
>      struct domain *d;
> +    const struct vpci_bar *bar;
>      bool map;
>  };
>  
> @@ -40,11 +41,21 @@ static int cf_check map_range(
>      unsigned long s, unsigned long e, void *data, unsigned long *c)
>  {
>      const struct map_data *map = data;
> +    /* Start address of the BAR as seen by the guest. */
> +    unsigned long start_gfn = PFN_DOWN(map->bar->guest_addr);
> +    /* Physical start address of the BAR. */
> +    mfn_t start_mfn = _mfn(PFN_DOWN(map->bar->addr));
>      int rc;
>  
>      for ( ; ; )
>      {
>          unsigned long size = e - s + 1;
> +        /*
> +         * Ranges to be mapped don't always start at the BAR start address, as
> +         * there can be holes or partially consumed ranges. Account for the
> +         * offset of the current address from the BAR start.
> +         */
> +        mfn_t map_mfn = mfn_add(start_mfn, s - start_gfn);
>  
>          if ( !iomem_access_permitted(map->d, s, e) )

This check must be switched to use host physical addresses (mfns), not
the guest ones, same for the xsm_iomem_mapping() check just below.

>          {
> @@ -72,8 +83,8 @@ static int cf_check map_range(
>           * - {un}map_mmio_regions doesn't support preemption.
>           */
>  
> -        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(s))
> -                      : unmap_mmio_regions(map->d, _gfn(s), size, _mfn(s));
> +        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, map_mfn)
> +                      : unmap_mmio_regions(map->d, _gfn(s), size, map_mfn);
>          if ( rc == 0 )
>          {
>              *c += size;
> @@ -82,8 +93,9 @@ static int cf_check map_range(
>          if ( rc < 0 )
>          {
>              printk(XENLOG_G_WARNING
> -                   "Failed to identity %smap [%lx, %lx] for d%d: %d\n",
> -                   map->map ? "" : "un", s, e, map->d->domain_id, rc);
> +                   "Failed to %smap [%lx %lx] -> [%lx %lx] for %pd: %d\n",
> +                   map->map ? "" : "un", s, e, mfn_x(map_mfn),
> +                   mfn_x(map_mfn) + size, map->d, rc);
>              break;
>          }
>          ASSERT(rc < size);
> @@ -162,10 +174,6 @@ static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
>  bool vpci_process_pending(struct vcpu *v)
>  {
>      struct pci_dev *pdev = v->vpci.pdev;
> -    struct map_data data = {
> -        .d = v->domain,
> -        .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
> -    };
>      struct vpci_header *header = NULL;
>      unsigned int i;
>  
> @@ -184,6 +192,11 @@ bool vpci_process_pending(struct vcpu *v)
>      for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>      {
>          struct vpci_bar *bar = &header->bars[i];
> +        struct map_data data = {
> +            .d = v->domain,
> +            .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
> +            .bar = bar,
> +        };
>          int rc;
>  
>          if ( rangeset_is_empty(bar->mem) )
> @@ -234,7 +247,6 @@ bool vpci_process_pending(struct vcpu *v)
>  static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
>                              uint16_t cmd)
>  {
> -    struct map_data data = { .d = d, .map = true };
>      struct vpci_header *header = &pdev->vpci->header;
>      int rc = 0;
>      unsigned int i;
> @@ -244,6 +256,7 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
>      for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>      {
>          struct vpci_bar *bar = &header->bars[i];
> +        struct map_data data = { .d = d, .map = true, .bar = bar };
>  
>          if ( rangeset_is_empty(bar->mem) )
>              continue;
> @@ -311,12 +324,16 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>       * First fill the rangesets with the BAR of this device or with the ROM
>       * BAR only, depending on whether the guest is toggling the memory decode
>       * bit of the command register, or the enable bit of the ROM BAR register.
> +     *
> +     * For non-hardware domain we use guest physical addresses.
>       */
>      for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>      {
>          struct vpci_bar *bar = &header->bars[i];
>          unsigned long start = PFN_DOWN(bar->addr);
>          unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
> +        unsigned long start_guest = PFN_DOWN(bar->guest_addr);
> +        unsigned long end_guest = PFN_DOWN(bar->guest_addr + bar->size - 1);
>  
>          if ( !bar->mem )
>              continue;
> @@ -336,11 +353,11 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>              continue;
>          }
>  
> -        rc = rangeset_add_range(bar->mem, start, end);
> +        rc = rangeset_add_range(bar->mem, start_guest, end_guest);

I think you will have to add a check here to ensure that guest and
host address use the same page offset, as it's AFAICT not possible to
do so from the BAR write handler like you have it in patch 10.

>          if ( rc )
>          {
>              printk(XENLOG_G_WARNING "Failed to add [%lx, %lx]: %d\n",
> -                   start, end, rc);
> +                   start_guest, end_guest, rc);
>              return rc;
>          }
>  
> @@ -357,7 +374,7 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>              {
>                  gprintk(XENLOG_WARNING,
>                         "%pp: failed to remove overlapping range [%lx, %lx]: %d\n",
> -                        &pdev->sbdf, start, end, rc);
> +                        &pdev->sbdf, start_guest, end_guest, rc);

Don't you need to also adjust the call to rangeset_remove_range()
above this error message to use {start,end}_guest instead of
{start,end}?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 12/17] vpci/header: emulate PCI_COMMAND register for guests
  2023-10-12 22:09 ` [PATCH v10 12/17] vpci/header: emulate PCI_COMMAND register for guests Volodymyr Babchuk
  2023-10-13 21:53   ` Volodymyr Babchuk
@ 2023-11-21 14:17   ` Roger Pau Monné
  2023-12-01  2:05     ` Volodymyr Babchuk
  2023-12-21 22:58     ` Stewart Hildebrand
  1 sibling, 2 replies; 65+ messages in thread
From: Roger Pau Monné @ 2023-11-21 14:17 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: xen-devel, Stewart Hildebrand, Oleksandr Andrushchenko

On Thu, Oct 12, 2023 at 10:09:18PM +0000, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Xen and/or Dom0 may have put values in PCI_COMMAND which they expect
> to remain unaltered. PCI_COMMAND_SERR bit is a good example: while the
> guest's view of this will want to be zero initially, the host having set
> it to 1 may not easily be overwritten with 0, or else we'd effectively
> imply giving the guest control of the bit. Thus, PCI_COMMAND register needs
> proper emulation in order to honor host's settings.
> 
> According to "PCI LOCAL BUS SPECIFICATION, REV. 3.0", section "6.2.2
> Device Control" the reset state of the command register is typically 0,
> so when assigning a PCI device use 0 as the initial state for the guest's view
> of the command register.
> 
> Here is the full list of command register bits with notes about
> emulation, along with QEMU behavior in the same situation:
> 
> PCI_COMMAND_IO - QEMU does not allow a guest to change value of this bit
> in real device. Instead it is always set to 1. A guest can write to this
> register, but writes are ignored.
> 
> PCI_COMMAND_MEMORY - QEMU behaves exactly as with PCI_COMMAND_IO. In
> Xen case, we handle writes to this bit by mapping/unmapping BAR
> regions. For devices assigned to DomUs, memory decoding will be
> disabled and the initialization.
> 
> PCI_COMMAND_MASTER - Allow guest to control it. QEMU passes through
> writes to this bit.
> 
> PCI_COMMAND_SPECIAL - Guest can generate special cycles only if it has
> access to host bridge that supports software generation of special
> cycles. In our case guest has no access to host bridges at all. Value
> after reset is 0. QEMU passes through writes of this bit, we will do
> the same.
> 
> PCI_COMMAND_INVALIDATE - Allows "Memory Write and Invalidate" commands
> to be generated. It requires additional configuration via Cacheline
> Size register. We are not emulating this register right now and we
> can't expect guest to properly configure it. QEMU "emulates" access to
> Cachline Size register by ignoring all writes to it. QEMU passes through
> writes of PCI_COMMAND_INVALIDATE bit, we will do the same.
> 
> PCI_COMMAND_VGA_PALETTE - Enable VGA palette snooping. QEMU passes
> through writes of this bit, we will do the same.
> 
> PCI_COMMAND_PARITY - Controls how device response to parity
> errors. QEMU ignores writes to this bit, we will do the same.
> 
> PCI_COMMAND_WAIT - Reserved. Should be 0, but QEMU passes
> through writes of this bit, so we will do the same.
> 
> PCI_COMMAND_SERR - Controls if device can assert SERR. QEMU ignores
> writes to this bit, we will do the same.
> 
> PCI_COMMAND_FAST_BACK - Optional bit that allows fast back-to-back
> transactions. It is configured by firmware, so we don't want guest to
> control it. QEMU ignores writes to this bit, we will do the same.
> 
> PCI_COMMAND_INTX_DISABLE - Disables INTx signals. If MSI(X) is
> enabled, device is prohibited from asserting INTx as per
> specification. Value after reset is 0. In QEMU case, it checks of INTx
> was mapped for a device. If it is not, then guest can't control
> PCI_COMMAND_INTX_DISABLE bit. In our case, we prohibit a guest to
> change value of this bit if MSI(X) is enabled.

FWIW, bits 11-15 are RsvdP, so we will need to add support for them
after the series from Stewart that adds support for register masks is
accepted.

> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
> ---
> In v10:
> - Added cf_check attribute to guest_cmd_read
> - Removed warning about non-zero cmd
> - Updated comment MSI code regarding disabling INTX
> - Used ternary operator in vpci_add_register() call
> - Disable memory decoding for DomUs in init_bars()
> In v9:
> - Reworked guest_cmd_read
> - Added handling for more bits
> Since v6:
> - fold guest's logic into cmd_write
> - implement cmd_read, so we can report emulated INTx state to guests
> - introduce header->guest_cmd to hold the emulated state of the
>   PCI_COMMAND register for guests
> Since v5:
> - add additional check for MSI-X enabled while altering INTX bit
> - make sure INTx disabled while guests enable MSI/MSI-X
> Since v3:
> - gate more code on CONFIG_HAS_MSI
> - removed logic for the case when MSI/MSI-X not enabled
> ---
>  xen/drivers/vpci/header.c | 44 +++++++++++++++++++++++++++++++++++----
>  xen/drivers/vpci/msi.c    |  6 ++++++
>  xen/drivers/vpci/msix.c   |  4 ++++
>  xen/include/xen/vpci.h    |  3 +++
>  4 files changed, 53 insertions(+), 4 deletions(-)
> 
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index efce0bc2ae..e8eed6a674 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -501,14 +501,32 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>      return 0;
>  }
>  
> +/* TODO: Add proper emulation for all bits of the command register. */
>  static void cf_check cmd_write(
>      const struct pci_dev *pdev, unsigned int reg, uint32_t cmd, void *data)
>  {
>      struct vpci_header *header = data;
>  
> +    if ( !is_hardware_domain(pdev->domain) )
> +    {
> +        const struct vpci *vpci = pdev->vpci;
> +        uint16_t excluded = PCI_COMMAND_PARITY | PCI_COMMAND_SERR |
> +            PCI_COMMAND_FAST_BACK;

You could implement those bits using the RsvdP mask also.

You seem to miss PCI_COMMAND_IO?  In the commit message you note that
writes are ignored, yet here you seem to pass through writes to the
underlying device (which might be OK, but needs to be coherent with
the commit message).

> +
> +        header->guest_cmd = cmd;

I'm kind of unsure whether we want to fake the guest view by returning
what the guest writes.

> +
> +        if ( (vpci->msi && vpci->msi->enabled) ||
> +             (vpci->msix && vpci->msi->enabled) )

The typo that you mentioned about msi vs msix.

> +            excluded |= PCI_COMMAND_INTX_DISABLE;
> +
> +        cmd &= ~excluded;
> +        cmd |= pci_conf_read16(pdev->sbdf, reg) & excluded;
> +    }
> +
>      /*
> -     * Let Dom0 play with all the bits directly except for the memory
> -     * decoding one.
> +     * Let guest play with all the bits directly except for the memory
> +     * decoding one. Bits that are not allowed for DomU are already
> +     * handled above.

I think this should be: "Let Dom0 play with all the bits directly ..."
as you mention both Dom0 and DomU.

>       */
>      if ( header->bars_mapped != !!(cmd & PCI_COMMAND_MEMORY) )
>          /*
> @@ -522,6 +540,14 @@ static void cf_check cmd_write(
>          pci_conf_write16(pdev->sbdf, reg, cmd);
>  }
>  
> +static uint32_t cf_check guest_cmd_read(
> +    const struct pci_dev *pdev, unsigned int reg, void *data)
> +{
> +    const struct vpci_header *header = data;
> +
> +    return header->guest_cmd;
> +}
> +
>  static void cf_check bar_write(
>      const struct pci_dev *pdev, unsigned int reg, uint32_t val, void *data)
>  {
> @@ -737,8 +763,9 @@ static int cf_check init_bars(struct pci_dev *pdev)
>      }
>  
>      /* Setup a handler for the command register. */
> -    rc = vpci_add_register(pdev->vpci, vpci_hw_read16, cmd_write, PCI_COMMAND,
> -                           2, header);
> +    rc = vpci_add_register(pdev->vpci,
> +                           is_hwdom ? vpci_hw_read16 : guest_cmd_read,
> +                           cmd_write, PCI_COMMAND, 2, header);
>      if ( rc )
>          return rc;
>  
> @@ -750,6 +777,15 @@ static int cf_check init_bars(struct pci_dev *pdev)
>      if ( cmd & PCI_COMMAND_MEMORY )
>          pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd & ~PCI_COMMAND_MEMORY);
>  
> +    /*
> +     * Clear PCI_COMMAND_MEMORY for DomUs, so they will always start with
> +     * memory decoding disabled and to ensure that we will not call modify_bars()
> +     * at the end of this function.
> +     */
> +    if ( !is_hwdom )
> +        cmd &= ~PCI_COMMAND_MEMORY;

Just for symmetry I would also disable PCI_COMMAND_IO.

I do wonder in which states does SeaBIOS or OVMF expects to find the
devices.

> +    header->guest_cmd = cmd;
> +
>      for ( i = 0; i < num_bars; i++ )
>      {
>          uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
> index 2faa54b7ce..0920bd071f 100644
> --- a/xen/drivers/vpci/msi.c
> +++ b/xen/drivers/vpci/msi.c
> @@ -70,6 +70,12 @@ static void cf_check control_write(
>  
>          if ( vpci_msi_arch_enable(msi, pdev, vectors) )
>              return;
> +
> +        /*
> +         * Make sure guest doesn't enable INTx while enabling MSI.
> +         */
> +        if ( !is_hardware_domain(pdev->domain) )
> +            pci_intx(pdev, false);
>      }
>      else
>          vpci_msi_arch_disable(msi, pdev);
> diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
> index b6abab47ef..9d0233d0e3 100644
> --- a/xen/drivers/vpci/msix.c
> +++ b/xen/drivers/vpci/msix.c
> @@ -97,6 +97,10 @@ static void cf_check control_write(
>          for ( i = 0; i < msix->max_entries; i++ )
>              if ( !msix->entries[i].masked && msix->entries[i].updated )
>                  update_entry(&msix->entries[i], pdev, i);
> +
> +        /* Make sure guest doesn't enable INTx while enabling MSI-X. */
> +        if ( !is_hardware_domain(pdev->domain) )
> +            pci_intx(pdev, false);

Note that if both new_enabled and new_masked are set, you won't get
inside of this condition, and that could lead to MSIX being enabled
with INTx set in the command register (albeit with the maskall bit
set).

You might have to add a new check before the pci_conf_write16() that
disables INTx if `new_enabled && !msix->enabled`.

>      }
>      else if ( !new_enabled && msix->enabled )
>      {
> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
> index c5301e284f..60bdc10c13 100644
> --- a/xen/include/xen/vpci.h
> +++ b/xen/include/xen/vpci.h
> @@ -87,6 +87,9 @@ struct vpci {
>          } bars[PCI_HEADER_NORMAL_NR_BARS + 1];
>          /* At most 6 BARS + 1 expansion ROM BAR. */
>  
> +        /* Guest view of the PCI_COMMAND register. */

Maybe we want to add '(domU only)' to the comment.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 13/17] vpci: add initial support for virtual PCI bus topology
  2023-10-12 22:09 ` [PATCH v10 13/17] vpci: add initial support for virtual PCI bus topology Volodymyr Babchuk
  2023-11-16 16:06   ` Julien Grall
@ 2023-11-21 14:40   ` Roger Pau Monné
  1 sibling, 0 replies; 65+ messages in thread
From: Roger Pau Monné @ 2023-11-21 14:40 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: xen-devel, Stewart Hildebrand, Oleksandr Andrushchenko,
	Andrew Cooper, George Dunlap, Jan Beulich, Julien Grall,
	Stefano Stabellini, Wei Liu

On Thu, Oct 12, 2023 at 10:09:18PM +0000, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Assign SBDF to the PCI devices being passed through with bus 0.
> The resulting topology is where PCIe devices reside on the bus 0 of the
> root complex itself (embedded endpoints).
> This implementation is limited to 32 devices which are allowed on
> a single PCI bus.
> 
> Please note, that at the moment only function 0 of a multifunction
> device can be passed through.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> ---
> In v10:
> - Removed ASSERT(pcidevs_locked())
> - Removed redundant code (local sbdf variable, clearing sbdf during
> device removal, etc)
> - Added __maybe_unused attribute to "out:" label
> - Introduced HAS_VPCI_GUEST_SUPPORT Kconfig option, as this is the
>   first patch where it is used (previously was in "vpci: add hooks for
>   PCI device assign/de-assign")
> In v9:
> - Lock in add_virtual_device() replaced with ASSERT (thanks, Stewart)
> In v8:
> - Added write lock in add_virtual_device
> Since v6:
> - re-work wrt new locking scheme
> - OT: add ASSERT(pcidevs_write_locked()); to add_virtual_device()
> Since v5:
> - s/vpci_add_virtual_device/add_virtual_device and make it static
> - call add_virtual_device from vpci_assign_device and do not use
>   REGISTER_VPCI_INIT machinery
> - add pcidevs_locked ASSERT
> - use DECLARE_BITMAP for vpci_dev_assigned_map
> Since v4:
> - moved and re-worked guest sbdf initializers
> - s/set_bit/__set_bit
> - s/clear_bit/__clear_bit
> - minor comment fix s/Virtual/Guest/
> - added VPCI_MAX_VIRT_DEV constant (PCI_SLOT(~0) + 1) which will be used
>   later for counting the number of MMIO handlers required for a guest
>   (Julien)
> Since v3:
>  - make use of VPCI_INIT
>  - moved all new code to vpci.c which belongs to it
>  - changed open-coded 31 to PCI_SLOT(~0)
>  - added comments and code to reject multifunction devices with
>    functions other than 0
>  - updated comment about vpci_dev_next and made it unsigned int
>  - implement roll back in case of error while assigning/deassigning devices
>  - s/dom%pd/%pd
> Since v2:
>  - remove casts that are (a) malformed and (b) unnecessary
>  - add new line for better readability
>  - remove CONFIG_HAS_VPCI_GUEST_SUPPORT ifdef's as the relevant vPCI
>     functions are now completely gated with this config
>  - gate common code with CONFIG_HAS_VPCI_GUEST_SUPPORT
> New in v2
> ---
>  xen/drivers/Kconfig     |  4 +++
>  xen/drivers/vpci/vpci.c | 63 +++++++++++++++++++++++++++++++++++++++++
>  xen/include/xen/sched.h |  8 ++++++
>  xen/include/xen/vpci.h  | 11 +++++++
>  4 files changed, 86 insertions(+)
> 
> diff --git a/xen/drivers/Kconfig b/xen/drivers/Kconfig
> index db94393f47..780490cf8e 100644
> --- a/xen/drivers/Kconfig
> +++ b/xen/drivers/Kconfig
> @@ -15,4 +15,8 @@ source "drivers/video/Kconfig"
>  config HAS_VPCI
>  	bool
>  
> +config HAS_VPCI_GUEST_SUPPORT
> +	bool
> +	depends on HAS_VPCI
> +
>  endmenu
> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
> index 5e34d0092a..7c46a2d3f4 100644
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -36,6 +36,52 @@ extern vpci_register_init_t *const __start_vpci_array[];
>  extern vpci_register_init_t *const __end_vpci_array[];
>  #define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
>  
> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> +static int add_virtual_device(struct pci_dev *pdev)
> +{
> +    struct domain *d = pdev->domain;
> +    unsigned long new_dev_number;

Why unsigned long?  unsigned int seems more than enough to account for
all possible dev numbers [0, 31].

> +
> +    if ( is_hardware_domain(d) )
> +        return 0;
> +
> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
> +
> +    /*
> +     * Each PCI bus supports 32 devices/slots at max or up to 256 when
> +     * there are multi-function ones which are not yet supported.
> +     */
> +    if ( pdev->info.is_extfn && !pdev->info.is_virtfn )
> +    {
> +        gdprintk(XENLOG_ERR, "%pp: only function 0 passthrough supported\n",
> +                 &pdev->sbdf);
> +        return -EOPNOTSUPP;
> +    }
> +    new_dev_number = find_first_zero_bit(d->vpci_dev_assigned_map,
> +                                         VPCI_MAX_VIRT_DEV);
> +    if ( new_dev_number == VPCI_MAX_VIRT_DEV )
> +    {
> +        write_unlock(&pdev->domain->pci_lock);

This write_unlock() looks bogus, as the lock is not taken by this
function.  Won't this create an unlock imbalance when the caller of
vpci_assign_device() also attempts to write-unlock d->pci_lock?

> +        return -ENOSPC;
> +    }
> +
> +    __set_bit(new_dev_number, &d->vpci_dev_assigned_map);
> +
> +    /*
> +     * Both segment and bus number are 0:
> +     *  - we emulate a single host bridge for the guest, e.g. segment 0
> +     *  - with bus 0 the virtual devices are seen as embedded
> +     *    endpoints behind the root complex
> +     *
> +     * TODO: add support for multi-function devices.
> +     */
> +    pdev->vpci->guest_sbdf = PCI_SBDF(0, 0, new_dev_number, 0);
> +
> +    return 0;
> +}
> +
> +#endif /* CONFIG_HAS_VPCI_GUEST_SUPPORT */
> +
>  void vpci_deassign_device(struct pci_dev *pdev)
>  {
>      unsigned int i;
> @@ -46,6 +92,13 @@ void vpci_deassign_device(struct pci_dev *pdev)
>          return;
>  
>      spin_lock(&pdev->vpci->lock);
> +
> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> +    if ( pdev->vpci->guest_sbdf.sbdf != ~0 )
> +        __clear_bit(pdev->vpci->guest_sbdf.dev,
> +                    &pdev->domain->vpci_dev_assigned_map);
> +#endif

This chunk could in principle be outside of the vpci->lock region
AFAICT.

> +
>      while ( !list_empty(&pdev->vpci->handlers) )
>      {
>          struct vpci_register *r = list_first_entry(&pdev->vpci->handlers,
> @@ -101,6 +154,13 @@ int vpci_assign_device(struct pci_dev *pdev)
>      INIT_LIST_HEAD(&pdev->vpci->handlers);
>      spin_lock_init(&pdev->vpci->lock);
>  
> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> +    pdev->vpci->guest_sbdf.sbdf = ~0;
> +    rc = add_virtual_device(pdev);
> +    if ( rc )
> +        goto out;
> +#endif
> +
>      for ( i = 0; i < NUM_VPCI_INIT; i++ )
>      {
>          rc = __start_vpci_array[i](pdev);
> @@ -108,11 +168,14 @@ int vpci_assign_device(struct pci_dev *pdev)
>              break;
>      }
>  
> + out:
> +    __maybe_unused;

Can you place it in the same line as the out: label please?

>      if ( rc )
>          vpci_deassign_device(pdev);
>  
>      return rc;
>  }
> +

Stray newline?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 14/17] xen/arm: translate virtual PCI bus topology for guests
  2023-10-12 22:09 ` [PATCH v10 14/17] xen/arm: translate virtual PCI bus topology for guests Volodymyr Babchuk
@ 2023-11-21 15:11   ` Roger Pau Monné
  0 siblings, 0 replies; 65+ messages in thread
From: Roger Pau Monné @ 2023-11-21 15:11 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: xen-devel, Stewart Hildebrand, Oleksandr Andrushchenko,
	Stefano Stabellini, Julien Grall, Bertrand Marquis

On Thu, Oct 12, 2023 at 10:09:18PM +0000, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> There are three  originators for the PCI configuration space access:
> 1. The domain that owns physical host bridge: MMIO handlers are
> there so we can update vPCI register handlers with the values
> written by the hardware domain, e.g. physical view of the registers
> vs guest's view on the configuration space.
> 2. Guest access to the passed through PCI devices: we need to properly
> map virtual bus topology to the physical one, e.g. pass the configuration
> space access to the corresponding physical devices.
> 3. Emulated host PCI bridge access. It doesn't exist in the physical
> topology, e.g. it can't be mapped to some physical host bridge.
> So, all access to the host bridge itself needs to be trapped and
> emulated.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> ---
> Since v9:
> - Commend about required lock replaced with ASSERT()
> - Style fixes
> - call to vpci_translate_virtual_device folded into vpci_sbdf_from_gpa
> Since v8:
> - locks moved out of vpci_translate_virtual_device()
> Since v6:
> - add pcidevs locking to vpci_translate_virtual_device
> - update wrt to the new locking scheme
> Since v5:
> - add vpci_translate_virtual_device for #ifndef CONFIG_HAS_VPCI_GUEST_SUPPORT
>   case to simplify ifdefery
> - add ASSERT(!is_hardware_domain(d)); to vpci_translate_virtual_device
> - reset output register on failed virtual SBDF translation
> Since v4:
> - indentation fixes
> - constify struct domain
> - updated commit message
> - updates to the new locking scheme (pdev->vpci_lock)
> Since v3:
> - revisit locking
> - move code to vpci.c
> Since v2:
>  - pass struct domain instead of struct vcpu
>  - constify arguments where possible
>  - gate relevant code with CONFIG_HAS_VPCI_GUEST_SUPPORT
> New in v2
> ---
>  xen/arch/arm/vpci.c     | 51 ++++++++++++++++++++++++++++++++---------
>  xen/drivers/vpci/vpci.c | 25 +++++++++++++++++++-
>  xen/include/xen/vpci.h  | 10 ++++++++
>  3 files changed, 74 insertions(+), 12 deletions(-)
> 
> diff --git a/xen/arch/arm/vpci.c b/xen/arch/arm/vpci.c
> index 3bc4bb5508..58e2a20135 100644
> --- a/xen/arch/arm/vpci.c
> +++ b/xen/arch/arm/vpci.c
> @@ -7,31 +7,55 @@
>  
>  #include <asm/mmio.h>
>  
> -static pci_sbdf_t vpci_sbdf_from_gpa(const struct pci_host_bridge *bridge,
> -                                     paddr_t gpa)
> +static bool_t vpci_sbdf_from_gpa(struct domain *d,

s/bool_t/bool/

> +                                 const struct pci_host_bridge *bridge,
> +                                 paddr_t gpa, pci_sbdf_t *sbdf)
>  {
> -    pci_sbdf_t sbdf;
> +    ASSERT(sbdf);
>  
>      if ( bridge )
>      {
> -        sbdf.sbdf = VPCI_ECAM_BDF(gpa - bridge->cfg->phys_addr);
> -        sbdf.seg = bridge->segment;
> -        sbdf.bus += bridge->cfg->busn_start;
> +        sbdf->sbdf = VPCI_ECAM_BDF(gpa - bridge->cfg->phys_addr);
> +        sbdf->seg = bridge->segment;
> +        sbdf->bus += bridge->cfg->busn_start;
>      }
>      else
> -        sbdf.sbdf = VPCI_ECAM_BDF(gpa - GUEST_VPCI_ECAM_BASE);
> -
> -    return sbdf;
> +    {
> +        bool translated;
> +
> +        /*
> +         * For the passed through devices we need to map their virtual SBDF
> +         * to the physical PCI device being passed through.
> +         */
> +        sbdf->sbdf = VPCI_ECAM_BDF(gpa - GUEST_VPCI_ECAM_BASE);
> +        read_lock(&d->pci_lock);
> +        translated = vpci_translate_virtual_device(d, sbdf);
> +        read_unlock(&d->pci_lock);
> +
> +        if ( !translated )
> +        {
> +            return false;
> +        }
> +    }
> +    return true;
>  }

I would make translated a top-level variable:

{
    bool translated = true;

    if ( bridge )
        ...
    else
        ...
        translated = vpci_translate_virtual_device(d, sbdf);
        ....

    return translated;
}

As that IMO makes the logic easier to follow.

>  
>  static int vpci_mmio_read(struct vcpu *v, mmio_info_t *info,
>                            register_t *r, void *p)
>  {
>      struct pci_host_bridge *bridge = p;
> -    pci_sbdf_t sbdf = vpci_sbdf_from_gpa(bridge, info->gpa);
> +    pci_sbdf_t sbdf;
>      /* data is needed to prevent a pointer cast on 32bit */
>      unsigned long data;
>  
> +    ASSERT(!bridge == !is_hardware_domain(v->domain));
> +
> +    if ( !vpci_sbdf_from_gpa(v->domain, bridge, info->gpa, &sbdf) )
> +    {
> +        *r = ~0ul;

Uppercase suffixes.

> +        return 1;
> +    }
> +
>      if ( vpci_ecam_read(sbdf, ECAM_REG_OFFSET(info->gpa),
>                          1U << info->dabt.size, &data) )
>      {
> @@ -48,7 +72,12 @@ static int vpci_mmio_write(struct vcpu *v, mmio_info_t *info,
>                             register_t r, void *p)
>  {
>      struct pci_host_bridge *bridge = p;
> -    pci_sbdf_t sbdf = vpci_sbdf_from_gpa(bridge, info->gpa);
> +    pci_sbdf_t sbdf;
> +
> +    ASSERT(!bridge == !is_hardware_domain(v->domain));
> +
> +    if ( !vpci_sbdf_from_gpa(v->domain, bridge, info->gpa, &sbdf) )
> +        return 1;
>  
>      return vpci_ecam_write(sbdf, ECAM_REG_OFFSET(info->gpa),
>                             1U << info->dabt.size, r);
> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
> index 7c46a2d3f4..0dee5118d6 100644
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -80,6 +80,30 @@ static int add_virtual_device(struct pci_dev *pdev)
>      return 0;
>  }
>  
> +/*
> + * Find the physical device which is mapped to the virtual device
> + * and translate virtual SBDF to the physical one.
> + */
> +bool vpci_translate_virtual_device(const struct domain *d, pci_sbdf_t *sbdf)
> +{
> +    const struct pci_dev *pdev;
> +
> +    ASSERT(!is_hardware_domain(d));
> +    ASSERT(rw_is_locked(&d->pci_lock));
> +
> +    for_each_pdev ( d, pdev )
> +    {
> +        if ( pdev->vpci && (pdev->vpci->guest_sbdf.sbdf == sbdf->sbdf) )
> +        {
> +            /* Replace guest SBDF with the physical one. */
> +            *sbdf = pdev->sbdf;
> +            return true;
> +        }
> +    }
> +
> +    return false;
> +}
> +
>  #endif /* CONFIG_HAS_VPCI_GUEST_SUPPORT */
>  
>  void vpci_deassign_device(struct pci_dev *pdev)
> @@ -175,7 +199,6 @@ int vpci_assign_device(struct pci_dev *pdev)
>  
>      return rc;
>  }
> -
>  #endif /* __XEN__ */
>  
>  static int vpci_register_cmp(const struct vpci_register *r1,
> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
> index 4a53936447..e9269b37ac 100644
> --- a/xen/include/xen/vpci.h
> +++ b/xen/include/xen/vpci.h
> @@ -282,6 +282,16 @@ static inline bool __must_check vpci_process_pending(struct vcpu *v)
>  }
>  #endif
>  
> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> +bool vpci_translate_virtual_device(const struct domain *d, pci_sbdf_t *sbdf);
> +#else
> +static inline bool vpci_translate_virtual_device(const struct domain *d,
> +                                                 pci_sbdf_t *sbdf)
> +{

I think you want to add an ASSERT_UÇNREACHABLE() here, as I would
expect if there's no build in vPCI guest support we would never get
here?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 13/17] vpci: add initial support for virtual PCI bus topology
  2023-11-21  0:42                       ` Volodymyr Babchuk
@ 2023-11-22  1:12                         ` Stefano Stabellini
  2023-11-22 11:53                           ` Roger Pau Monné
  0 siblings, 1 reply; 65+ messages in thread
From: Stefano Stabellini @ 2023-11-22  1:12 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Stefano Stabellini, Julien Grall, Stewart Hildebrand,
	Oleksandr Andrushchenko, Andrew Cooper, George Dunlap,
	Jan Beulich, Wei Liu, Roger Pau Monné,
	xen-devel

On Tue, 20 Nov 2023, Volodymyr Babchuk wrote:
> Stefano Stabellini <sstabellini@kernel.org> writes:
> > On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
> >> > On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
> >> >> Hi Julien,
> >> >> 
> >> >> Julien Grall <julien@xen.org> writes:
> >> >> 
> >> >> > Hi Volodymyr,
> >> >> >
> >> >> > On 17/11/2023 14:09, Volodymyr Babchuk wrote:
> >> >> >> Hi Stefano,
> >> >> >> Stefano Stabellini <sstabellini@kernel.org> writes:
> >> >> >> 
> >> >> >>> On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
> >> >> >>>>> I still think, no matter the BDF allocation scheme, that we should try
> >> >> >>>>> to avoid as much as possible to have two different PCI Root Complex
> >> >> >>>>> emulators. Ideally we would have only one PCI Root Complex emulated by
> >> >> >>>>> Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be
> >> >> >>>>> tolerable but not ideal.
> >> >> >>>>
> >> >> >>>> But what is exactly wrong with this setup?
> >> >> >>>
> >> >> >>> [...]
> >> >> >>>
> >> >> >>>>> The worst case I would like to avoid is to have
> >> >> >>>>> two PCI Root Complexes, one emulated by Xen and one emulated by QEMU.
> >> >> >>>>
> >> >> >>>> This is how our setup works right now.
> >> >> >>>
> >> >> >>> If we have:
> >> >> >>> - a single PCI Root Complex emulated in Xen
> >> >> >>> - Xen is safety certified
> >> >> >>> - individual Virtio devices emulated by QEMU with grants for memory
> >> >> >>>
> >> >> >>> We can go very far in terms of being able to use Virtio in safety
> >> >> >>> use-cases. We might even be able to use Virtio (frontends) in a SafeOS.
> >> >> >>>
> >> >> >>> On the other hand if we put an additional Root Complex in QEMU:
> >> >> >>> - we pay a price in terms of complexity of the codebase
> >> >> >>> - we pay a price in terms of resource utilization
> >> >> >>> - we have one additional problem in terms of using this setup with a
> >> >> >>>    SafeOS (one more device emulated by a non-safe component)
> >> >> >>>
> >> >> >>> Having 2 PCI Root Complexes both emulated in Xen is a middle ground
> >> >> >>> solution because:
> >> >> >>> - we still pay a price in terms of resource utilization
> >> >> >>> - the code complexity goes up a bit but hopefully not by much
> >> >> >>> - there is no impact on safety compared to the ideal scenario
> >> >> >>>
> >> >> >>> This is why I wrote that it is tolerable.
> >> >> >> Ah, I see now. Yes, I am agree with this. Also I want to add some
> >> >> >> more
> >> >> >> points:
> >> >> >> - There is ongoing work on implementing virtio backends as a
> >> >> >> separate
> >> >> >>    applications, written in Rust. Linaro are doing this part. Right now
> >> >> >>    they are implementing only virtio-mmio, but if they want to provide
> >> >> >>    virtio-pci as well, they will need a mechanism to plug only
> >> >> >>    virtio-pci, without Root Complex. This is argument for using single Root
> >> >> >>    Complex emulated in Xen.
> >> >> >> - As far as I know (actually, Oleksandr told this to me), QEMU has
> >> >> >> no
> >> >> >>    mechanism for exposing virtio-pci backends without exposing PCI root
> >> >> >>    complex as well. Architecturally, there should be a PCI bus to which
> >> >> >>    virtio-pci devices are connected. Or we need to make some changes to
> >> >> >>    QEMU internals to be able to create virtio-pci backends that are not
> >> >> >>    connected to any bus. Also, added benefit that PCI Root Complex
> >> >> >>    emulator in QEMU handles legacy PCI interrupts for us. This is
> >> >> >>    argument for separate Root Complex for QEMU.
> >> >> >> As right now we have only virtio-pci backends provided by QEMU and
> >> >> >> this
> >> >> >> setup is already working, I propose to stick to this
> >> >> >> solution. Especially, taking into account that it does not require any
> >> >> >> changes to hypervisor code.
> >> >> >
> >> >> > I am not against two hostbridge as a temporary solution as long as
> >> >> > this is not a one way door decision. I am not concerned about the
> >> >> > hypervisor itself, I am more concerned about the interface exposed by
> >> >> > the toolstack and QEMU.
> >> >
> >> > I agree with this...
> >> >
> >> >
> >> >> > To clarify, I don't particular want to have to maintain the two
> >> >> > hostbridges solution once we can use a single hostbridge. So we need
> >> >> > to be able to get rid of it without impacting the interface too much.
> >> >
> >> > ...and this
> >> >
> >> >
> >> >> This depends on virtio-pci backends availability. AFAIK, now only one
> >> >> option is to use QEMU and QEMU provides own host bridge. So if we want
> >> >> get rid of the second host bridge we need either another virtio-pci
> >> >> backend or we need to alter QEMU code so it can live without host
> >> >> bridge.
> >> >> 
> >> >> As for interfaces, it appears that QEMU case does not require any changes
> >> >> into hypervisor itself, it just boils down to writing couple of xenstore
> >> >> entries and spawning QEMU with correct command line arguments.
> >> >
> >> > One thing that Stewart wrote in his reply that is important: it doesn't
> >> > matter if QEMU thinks it is emulating a PCI Root Complex because that's
> >> > required from QEMU's point of view to emulate an individual PCI device.
> >> >
> >> > If we can arrange it so the QEMU PCI Root Complex is not registered
> >> > against Xen as part of the ioreq interface, then QEMU's emulated PCI
> >> > Root Complex is going to be left unused. I think that would be great
> >> > because we still have a clean QEMU-Xen-tools interface and the only
> >> > downside is some extra unused emulation in QEMU. It would be a
> >> > fantastic starting point.
> >> 
> >> I believe, that in this case we need to set manual ioreq handlers, like
> >> what was done in patch "xen/arm: Intercept vPCI config accesses and
> >> forward them to emulator", because we need to route ECAM accesses
> >> either to a virtio-pci backend or to a real PCI device. Also we need
> >> to tell QEMU to not install own ioreq handles for ECAM space.
> >
> > I was imagining that the interface would look like this: QEMU registers
> > a PCI BDF and Xen automatically starts forwarding to QEMU ECAM
> > reads/writes requests for the PCI config space of that BDF only. It
> > would not be the entire ECAM space but only individual PCI conf
> > reads/writes that the BDF only.
> >
> 
> Okay, I see that there is the
> xendevicemodel_map_pcidev_to_ioreq_server() function and corresponding
> IOREQ_TYPE_PCI_CONFIG call. Is this what you propose to use to register
> PCI BDF?

Yes, I think that's best.

Let me expand on this. Like I wrote above, I think it is important that
Xen vPCI is the only in-use PCI Root Complex emulator. If it makes the
QEMU implementation easier, it is OK if QEMU emulates an unneeded and
unused PCI Root Complex. From Xen point of view, it doesn't exist.

In terms if ioreq registration, QEMU calls
xendevicemodel_map_pcidev_to_ioreq_server for each PCI BDF it wants to
emulate. That way, Xen vPCI knows exactly what PCI config space
reads/writes to forward to QEMU.

Let's say that:
- 00:02.0 is PCI passthrough device
- 00:03.0 is a PCI emulated device

QEMU would register 00:03.0 and vPCI would know to forward anything
related to 00:03.0 to QEMU, but not 00:02.0.



> I see that xen-hvm-common.c in QEMU is able to handle only standard 256
> bytes configuration space, but I hope that it will be easy fix.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 13/17] vpci: add initial support for virtual PCI bus topology
  2023-11-22  1:12                         ` Stefano Stabellini
@ 2023-11-22 11:53                           ` Roger Pau Monné
  2023-11-22 21:18                             ` Stefano Stabellini
  0 siblings, 1 reply; 65+ messages in thread
From: Roger Pau Monné @ 2023-11-22 11:53 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Volodymyr Babchuk, Julien Grall, Stewart Hildebrand,
	Oleksandr Andrushchenko, Andrew Cooper, George Dunlap,
	Jan Beulich, Wei Liu, xen-devel

On Tue, Nov 21, 2023 at 05:12:15PM -0800, Stefano Stabellini wrote:
> On Tue, 20 Nov 2023, Volodymyr Babchuk wrote:
> > Stefano Stabellini <sstabellini@kernel.org> writes:
> > > On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
> > >> > On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
> > >> >> Hi Julien,
> > >> >> 
> > >> >> Julien Grall <julien@xen.org> writes:
> > >> >> 
> > >> >> > Hi Volodymyr,
> > >> >> >
> > >> >> > On 17/11/2023 14:09, Volodymyr Babchuk wrote:
> > >> >> >> Hi Stefano,
> > >> >> >> Stefano Stabellini <sstabellini@kernel.org> writes:
> > >> >> >> 
> > >> >> >>> On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
> > >> >> >>>>> I still think, no matter the BDF allocation scheme, that we should try
> > >> >> >>>>> to avoid as much as possible to have two different PCI Root Complex
> > >> >> >>>>> emulators. Ideally we would have only one PCI Root Complex emulated by
> > >> >> >>>>> Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be
> > >> >> >>>>> tolerable but not ideal.
> > >> >> >>>>
> > >> >> >>>> But what is exactly wrong with this setup?
> > >> >> >>>
> > >> >> >>> [...]
> > >> >> >>>
> > >> >> >>>>> The worst case I would like to avoid is to have
> > >> >> >>>>> two PCI Root Complexes, one emulated by Xen and one emulated by QEMU.
> > >> >> >>>>
> > >> >> >>>> This is how our setup works right now.
> > >> >> >>>
> > >> >> >>> If we have:
> > >> >> >>> - a single PCI Root Complex emulated in Xen
> > >> >> >>> - Xen is safety certified
> > >> >> >>> - individual Virtio devices emulated by QEMU with grants for memory
> > >> >> >>>
> > >> >> >>> We can go very far in terms of being able to use Virtio in safety
> > >> >> >>> use-cases. We might even be able to use Virtio (frontends) in a SafeOS.
> > >> >> >>>
> > >> >> >>> On the other hand if we put an additional Root Complex in QEMU:
> > >> >> >>> - we pay a price in terms of complexity of the codebase
> > >> >> >>> - we pay a price in terms of resource utilization
> > >> >> >>> - we have one additional problem in terms of using this setup with a
> > >> >> >>>    SafeOS (one more device emulated by a non-safe component)
> > >> >> >>>
> > >> >> >>> Having 2 PCI Root Complexes both emulated in Xen is a middle ground
> > >> >> >>> solution because:
> > >> >> >>> - we still pay a price in terms of resource utilization
> > >> >> >>> - the code complexity goes up a bit but hopefully not by much
> > >> >> >>> - there is no impact on safety compared to the ideal scenario
> > >> >> >>>
> > >> >> >>> This is why I wrote that it is tolerable.
> > >> >> >> Ah, I see now. Yes, I am agree with this. Also I want to add some
> > >> >> >> more
> > >> >> >> points:
> > >> >> >> - There is ongoing work on implementing virtio backends as a
> > >> >> >> separate
> > >> >> >>    applications, written in Rust. Linaro are doing this part. Right now
> > >> >> >>    they are implementing only virtio-mmio, but if they want to provide
> > >> >> >>    virtio-pci as well, they will need a mechanism to plug only
> > >> >> >>    virtio-pci, without Root Complex. This is argument for using single Root
> > >> >> >>    Complex emulated in Xen.
> > >> >> >> - As far as I know (actually, Oleksandr told this to me), QEMU has
> > >> >> >> no
> > >> >> >>    mechanism for exposing virtio-pci backends without exposing PCI root
> > >> >> >>    complex as well. Architecturally, there should be a PCI bus to which
> > >> >> >>    virtio-pci devices are connected. Or we need to make some changes to
> > >> >> >>    QEMU internals to be able to create virtio-pci backends that are not
> > >> >> >>    connected to any bus. Also, added benefit that PCI Root Complex
> > >> >> >>    emulator in QEMU handles legacy PCI interrupts for us. This is
> > >> >> >>    argument for separate Root Complex for QEMU.
> > >> >> >> As right now we have only virtio-pci backends provided by QEMU and
> > >> >> >> this
> > >> >> >> setup is already working, I propose to stick to this
> > >> >> >> solution. Especially, taking into account that it does not require any
> > >> >> >> changes to hypervisor code.
> > >> >> >
> > >> >> > I am not against two hostbridge as a temporary solution as long as
> > >> >> > this is not a one way door decision. I am not concerned about the
> > >> >> > hypervisor itself, I am more concerned about the interface exposed by
> > >> >> > the toolstack and QEMU.
> > >> >
> > >> > I agree with this...
> > >> >
> > >> >
> > >> >> > To clarify, I don't particular want to have to maintain the two
> > >> >> > hostbridges solution once we can use a single hostbridge. So we need
> > >> >> > to be able to get rid of it without impacting the interface too much.
> > >> >
> > >> > ...and this
> > >> >
> > >> >
> > >> >> This depends on virtio-pci backends availability. AFAIK, now only one
> > >> >> option is to use QEMU and QEMU provides own host bridge. So if we want
> > >> >> get rid of the second host bridge we need either another virtio-pci
> > >> >> backend or we need to alter QEMU code so it can live without host
> > >> >> bridge.
> > >> >> 
> > >> >> As for interfaces, it appears that QEMU case does not require any changes
> > >> >> into hypervisor itself, it just boils down to writing couple of xenstore
> > >> >> entries and spawning QEMU with correct command line arguments.
> > >> >
> > >> > One thing that Stewart wrote in his reply that is important: it doesn't
> > >> > matter if QEMU thinks it is emulating a PCI Root Complex because that's
> > >> > required from QEMU's point of view to emulate an individual PCI device.
> > >> >
> > >> > If we can arrange it so the QEMU PCI Root Complex is not registered
> > >> > against Xen as part of the ioreq interface, then QEMU's emulated PCI
> > >> > Root Complex is going to be left unused. I think that would be great
> > >> > because we still have a clean QEMU-Xen-tools interface and the only
> > >> > downside is some extra unused emulation in QEMU. It would be a
> > >> > fantastic starting point.
> > >> 
> > >> I believe, that in this case we need to set manual ioreq handlers, like
> > >> what was done in patch "xen/arm: Intercept vPCI config accesses and
> > >> forward them to emulator", because we need to route ECAM accesses
> > >> either to a virtio-pci backend or to a real PCI device. Also we need
> > >> to tell QEMU to not install own ioreq handles for ECAM space.
> > >
> > > I was imagining that the interface would look like this: QEMU registers
> > > a PCI BDF and Xen automatically starts forwarding to QEMU ECAM
> > > reads/writes requests for the PCI config space of that BDF only. It
> > > would not be the entire ECAM space but only individual PCI conf
> > > reads/writes that the BDF only.
> > >
> > 
> > Okay, I see that there is the
> > xendevicemodel_map_pcidev_to_ioreq_server() function and corresponding
> > IOREQ_TYPE_PCI_CONFIG call. Is this what you propose to use to register
> > PCI BDF?
> 
> Yes, I think that's best.
> 
> Let me expand on this. Like I wrote above, I think it is important that
> Xen vPCI is the only in-use PCI Root Complex emulator. If it makes the
> QEMU implementation easier, it is OK if QEMU emulates an unneeded and
> unused PCI Root Complex. From Xen point of view, it doesn't exist.
> 
> In terms if ioreq registration, QEMU calls
> xendevicemodel_map_pcidev_to_ioreq_server for each PCI BDF it wants to
> emulate. That way, Xen vPCI knows exactly what PCI config space
> reads/writes to forward to QEMU.
> 
> Let's say that:
> - 00:02.0 is PCI passthrough device
> - 00:03.0 is a PCI emulated device
> 
> QEMU would register 00:03.0 and vPCI would know to forward anything
> related to 00:03.0 to QEMU, but not 00:02.0.

I think there's some work here so that we have a proper hierarchy
inside of Xen.  Right now both ioreq and vpci expect to decode the
accesses to the PCI config space, and setup (MM)IO handlers to trap
ECAM, see vpci_ecam_{read,write}().

I think we want to move to a model where vPCI doesn't setup MMIO traps
itself, and instead relies on ioreq to do the decoding and forwarding
of accesses.  We need some work in order to represent an internal
ioreq handler, but that shouldn't be too complicated.  IOW: vpci
should register devices it's handling with ioreq, much like QEMU does.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 13/17] vpci: add initial support for virtual PCI bus topology
  2023-11-22 11:53                           ` Roger Pau Monné
@ 2023-11-22 21:18                             ` Stefano Stabellini
  2023-11-23  8:29                               ` Roger Pau Monné
  0 siblings, 1 reply; 65+ messages in thread
From: Stefano Stabellini @ 2023-11-22 21:18 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Stefano Stabellini, Volodymyr Babchuk, Julien Grall,
	Stewart Hildebrand, Oleksandr Andrushchenko, Andrew Cooper,
	George Dunlap, Jan Beulich, Wei Liu, xen-devel

[-- Attachment #1: Type: text/plain, Size: 9853 bytes --]

On Wed, 22 Nov 2023, Roger Pau Monné wrote:
> On Tue, Nov 21, 2023 at 05:12:15PM -0800, Stefano Stabellini wrote:
> > On Tue, 20 Nov 2023, Volodymyr Babchuk wrote:
> > > Stefano Stabellini <sstabellini@kernel.org> writes:
> > > > On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
> > > >> > On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
> > > >> >> Hi Julien,
> > > >> >> 
> > > >> >> Julien Grall <julien@xen.org> writes:
> > > >> >> 
> > > >> >> > Hi Volodymyr,
> > > >> >> >
> > > >> >> > On 17/11/2023 14:09, Volodymyr Babchuk wrote:
> > > >> >> >> Hi Stefano,
> > > >> >> >> Stefano Stabellini <sstabellini@kernel.org> writes:
> > > >> >> >> 
> > > >> >> >>> On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
> > > >> >> >>>>> I still think, no matter the BDF allocation scheme, that we should try
> > > >> >> >>>>> to avoid as much as possible to have two different PCI Root Complex
> > > >> >> >>>>> emulators. Ideally we would have only one PCI Root Complex emulated by
> > > >> >> >>>>> Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be
> > > >> >> >>>>> tolerable but not ideal.
> > > >> >> >>>>
> > > >> >> >>>> But what is exactly wrong with this setup?
> > > >> >> >>>
> > > >> >> >>> [...]
> > > >> >> >>>
> > > >> >> >>>>> The worst case I would like to avoid is to have
> > > >> >> >>>>> two PCI Root Complexes, one emulated by Xen and one emulated by QEMU.
> > > >> >> >>>>
> > > >> >> >>>> This is how our setup works right now.
> > > >> >> >>>
> > > >> >> >>> If we have:
> > > >> >> >>> - a single PCI Root Complex emulated in Xen
> > > >> >> >>> - Xen is safety certified
> > > >> >> >>> - individual Virtio devices emulated by QEMU with grants for memory
> > > >> >> >>>
> > > >> >> >>> We can go very far in terms of being able to use Virtio in safety
> > > >> >> >>> use-cases. We might even be able to use Virtio (frontends) in a SafeOS.
> > > >> >> >>>
> > > >> >> >>> On the other hand if we put an additional Root Complex in QEMU:
> > > >> >> >>> - we pay a price in terms of complexity of the codebase
> > > >> >> >>> - we pay a price in terms of resource utilization
> > > >> >> >>> - we have one additional problem in terms of using this setup with a
> > > >> >> >>>    SafeOS (one more device emulated by a non-safe component)
> > > >> >> >>>
> > > >> >> >>> Having 2 PCI Root Complexes both emulated in Xen is a middle ground
> > > >> >> >>> solution because:
> > > >> >> >>> - we still pay a price in terms of resource utilization
> > > >> >> >>> - the code complexity goes up a bit but hopefully not by much
> > > >> >> >>> - there is no impact on safety compared to the ideal scenario
> > > >> >> >>>
> > > >> >> >>> This is why I wrote that it is tolerable.
> > > >> >> >> Ah, I see now. Yes, I am agree with this. Also I want to add some
> > > >> >> >> more
> > > >> >> >> points:
> > > >> >> >> - There is ongoing work on implementing virtio backends as a
> > > >> >> >> separate
> > > >> >> >>    applications, written in Rust. Linaro are doing this part. Right now
> > > >> >> >>    they are implementing only virtio-mmio, but if they want to provide
> > > >> >> >>    virtio-pci as well, they will need a mechanism to plug only
> > > >> >> >>    virtio-pci, without Root Complex. This is argument for using single Root
> > > >> >> >>    Complex emulated in Xen.
> > > >> >> >> - As far as I know (actually, Oleksandr told this to me), QEMU has
> > > >> >> >> no
> > > >> >> >>    mechanism for exposing virtio-pci backends without exposing PCI root
> > > >> >> >>    complex as well. Architecturally, there should be a PCI bus to which
> > > >> >> >>    virtio-pci devices are connected. Or we need to make some changes to
> > > >> >> >>    QEMU internals to be able to create virtio-pci backends that are not
> > > >> >> >>    connected to any bus. Also, added benefit that PCI Root Complex
> > > >> >> >>    emulator in QEMU handles legacy PCI interrupts for us. This is
> > > >> >> >>    argument for separate Root Complex for QEMU.
> > > >> >> >> As right now we have only virtio-pci backends provided by QEMU and
> > > >> >> >> this
> > > >> >> >> setup is already working, I propose to stick to this
> > > >> >> >> solution. Especially, taking into account that it does not require any
> > > >> >> >> changes to hypervisor code.
> > > >> >> >
> > > >> >> > I am not against two hostbridge as a temporary solution as long as
> > > >> >> > this is not a one way door decision. I am not concerned about the
> > > >> >> > hypervisor itself, I am more concerned about the interface exposed by
> > > >> >> > the toolstack and QEMU.
> > > >> >
> > > >> > I agree with this...
> > > >> >
> > > >> >
> > > >> >> > To clarify, I don't particular want to have to maintain the two
> > > >> >> > hostbridges solution once we can use a single hostbridge. So we need
> > > >> >> > to be able to get rid of it without impacting the interface too much.
> > > >> >
> > > >> > ...and this
> > > >> >
> > > >> >
> > > >> >> This depends on virtio-pci backends availability. AFAIK, now only one
> > > >> >> option is to use QEMU and QEMU provides own host bridge. So if we want
> > > >> >> get rid of the second host bridge we need either another virtio-pci
> > > >> >> backend or we need to alter QEMU code so it can live without host
> > > >> >> bridge.
> > > >> >> 
> > > >> >> As for interfaces, it appears that QEMU case does not require any changes
> > > >> >> into hypervisor itself, it just boils down to writing couple of xenstore
> > > >> >> entries and spawning QEMU with correct command line arguments.
> > > >> >
> > > >> > One thing that Stewart wrote in his reply that is important: it doesn't
> > > >> > matter if QEMU thinks it is emulating a PCI Root Complex because that's
> > > >> > required from QEMU's point of view to emulate an individual PCI device.
> > > >> >
> > > >> > If we can arrange it so the QEMU PCI Root Complex is not registered
> > > >> > against Xen as part of the ioreq interface, then QEMU's emulated PCI
> > > >> > Root Complex is going to be left unused. I think that would be great
> > > >> > because we still have a clean QEMU-Xen-tools interface and the only
> > > >> > downside is some extra unused emulation in QEMU. It would be a
> > > >> > fantastic starting point.
> > > >> 
> > > >> I believe, that in this case we need to set manual ioreq handlers, like
> > > >> what was done in patch "xen/arm: Intercept vPCI config accesses and
> > > >> forward them to emulator", because we need to route ECAM accesses
> > > >> either to a virtio-pci backend or to a real PCI device. Also we need
> > > >> to tell QEMU to not install own ioreq handles for ECAM space.
> > > >
> > > > I was imagining that the interface would look like this: QEMU registers
> > > > a PCI BDF and Xen automatically starts forwarding to QEMU ECAM
> > > > reads/writes requests for the PCI config space of that BDF only. It
> > > > would not be the entire ECAM space but only individual PCI conf
> > > > reads/writes that the BDF only.
> > > >
> > > 
> > > Okay, I see that there is the
> > > xendevicemodel_map_pcidev_to_ioreq_server() function and corresponding
> > > IOREQ_TYPE_PCI_CONFIG call. Is this what you propose to use to register
> > > PCI BDF?
> > 
> > Yes, I think that's best.
> > 
> > Let me expand on this. Like I wrote above, I think it is important that
> > Xen vPCI is the only in-use PCI Root Complex emulator. If it makes the
> > QEMU implementation easier, it is OK if QEMU emulates an unneeded and
> > unused PCI Root Complex. From Xen point of view, it doesn't exist.
> > 
> > In terms if ioreq registration, QEMU calls
> > xendevicemodel_map_pcidev_to_ioreq_server for each PCI BDF it wants to
> > emulate. That way, Xen vPCI knows exactly what PCI config space
> > reads/writes to forward to QEMU.
> > 
> > Let's say that:
> > - 00:02.0 is PCI passthrough device
> > - 00:03.0 is a PCI emulated device
> > 
> > QEMU would register 00:03.0 and vPCI would know to forward anything
> > related to 00:03.0 to QEMU, but not 00:02.0.
> 
> I think there's some work here so that we have a proper hierarchy
> inside of Xen.  Right now both ioreq and vpci expect to decode the
> accesses to the PCI config space, and setup (MM)IO handlers to trap
> ECAM, see vpci_ecam_{read,write}().
> 
> I think we want to move to a model where vPCI doesn't setup MMIO traps
> itself, and instead relies on ioreq to do the decoding and forwarding
> of accesses.  We need some work in order to represent an internal
> ioreq handler, but that shouldn't be too complicated.  IOW: vpci
> should register devices it's handling with ioreq, much like QEMU does.

I think this could be a good idea.

This would be the very first IOREQ handler implemented in Xen itself,
rather than outside of Xen. Some code refactoring might be required,
which worries me given that vPCI is at v10 and has been pending for
years. I think it could make sense as a follow-up series, not v11.

I think this idea would be beneficial if, in the example above, vPCI
doesn't really need to know about device 00:03.0. vPCI registers via
IOREQ the PCI Root Complex and device 00:02.0 only, QEMU registers
00:03.0, and everything works. vPCI is not involved at all in PCI config
space reads and writes for 00:03.0. If this is the case, then moving
vPCI to IOREQ could be good.

On the other hand if vPCI actually needs to know that 00:03.0 exists,
perhaps because it changes something in the PCI Root Complex emulation
or vPCI needs to take some action when PCI config space registers of
00:03.0 are written to, then I think this model doesn't work well. If
this is the case, then I think it would be best to keep vPCI as MMIO
handler and let vPCI forward to IOREQ when appropriate.

I haven't run any experiements, but my gut feeling tells me that we'll
have to follow the second approach because the first is too limiting.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 13/17] vpci: add initial support for virtual PCI bus topology
  2023-11-22 21:18                             ` Stefano Stabellini
@ 2023-11-23  8:29                               ` Roger Pau Monné
  2023-11-28 23:45                                 ` Volodymyr Babchuk
  0 siblings, 1 reply; 65+ messages in thread
From: Roger Pau Monné @ 2023-11-23  8:29 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Volodymyr Babchuk, Julien Grall, Stewart Hildebrand,
	Oleksandr Andrushchenko, Andrew Cooper, George Dunlap,
	Jan Beulich, Wei Liu, xen-devel

On Wed, Nov 22, 2023 at 01:18:32PM -0800, Stefano Stabellini wrote:
> On Wed, 22 Nov 2023, Roger Pau Monné wrote:
> > On Tue, Nov 21, 2023 at 05:12:15PM -0800, Stefano Stabellini wrote:
> > > Let me expand on this. Like I wrote above, I think it is important that
> > > Xen vPCI is the only in-use PCI Root Complex emulator. If it makes the
> > > QEMU implementation easier, it is OK if QEMU emulates an unneeded and
> > > unused PCI Root Complex. From Xen point of view, it doesn't exist.
> > > 
> > > In terms if ioreq registration, QEMU calls
> > > xendevicemodel_map_pcidev_to_ioreq_server for each PCI BDF it wants to
> > > emulate. That way, Xen vPCI knows exactly what PCI config space
> > > reads/writes to forward to QEMU.
> > > 
> > > Let's say that:
> > > - 00:02.0 is PCI passthrough device
> > > - 00:03.0 is a PCI emulated device
> > > 
> > > QEMU would register 00:03.0 and vPCI would know to forward anything
> > > related to 00:03.0 to QEMU, but not 00:02.0.
> > 
> > I think there's some work here so that we have a proper hierarchy
> > inside of Xen.  Right now both ioreq and vpci expect to decode the
> > accesses to the PCI config space, and setup (MM)IO handlers to trap
> > ECAM, see vpci_ecam_{read,write}().
> > 
> > I think we want to move to a model where vPCI doesn't setup MMIO traps
> > itself, and instead relies on ioreq to do the decoding and forwarding
> > of accesses.  We need some work in order to represent an internal
> > ioreq handler, but that shouldn't be too complicated.  IOW: vpci
> > should register devices it's handling with ioreq, much like QEMU does.
> 
> I think this could be a good idea.
> 
> This would be the very first IOREQ handler implemented in Xen itself,
> rather than outside of Xen. Some code refactoring might be required,
> which worries me given that vPCI is at v10 and has been pending for
> years. I think it could make sense as a follow-up series, not v11.

That's perfectly fine for me, most of the series here just deal with
the logic to intercept guest access to the config space and is
completely agnostic as to how the accesses are intercepted.

> I think this idea would be beneficial if, in the example above, vPCI
> doesn't really need to know about device 00:03.0. vPCI registers via
> IOREQ the PCI Root Complex and device 00:02.0 only, QEMU registers
> 00:03.0, and everything works. vPCI is not involved at all in PCI config
> space reads and writes for 00:03.0. If this is the case, then moving
> vPCI to IOREQ could be good.

Given your description above, with the root complex implemented in
vPCI, we would need to mandate vPCI together with ioreqs even if no
passthrough devices are using vPCI itself (just for the emulation of
the root complex).  Which is fine, just wanted to mention the
dependency.

> On the other hand if vPCI actually needs to know that 00:03.0 exists,
> perhaps because it changes something in the PCI Root Complex emulation
> or vPCI needs to take some action when PCI config space registers of
> 00:03.0 are written to, then I think this model doesn't work well. If
> this is the case, then I think it would be best to keep vPCI as MMIO
> handler and let vPCI forward to IOREQ when appropriate.

At first approximation I don't think we would have such interactions,
otherwise the whole premise of ioreq being able to register individual
PCI devices would be broken.

XenSever already has scenarios with two different user-space emulators
(ie: two different ioreq servers) handling accesses to different
devices in the same PCI bus, and there's no interaction with the root
complex required.

> I haven't run any experiements, but my gut feeling tells me that we'll
> have to follow the second approach because the first is too limiting.

Iff there's some case where a change to a downstream PCI device needs
to be propagated to the root complex, then such mechanism should be
implemented as part of the ioreq interface, otherwise the whole model
is broken.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 03/17] vpci: use per-domain PCI lock to protect vpci structure
  2023-11-17 15:16   ` Roger Pau Monné
@ 2023-11-28 22:24     ` Volodymyr Babchuk
  0 siblings, 0 replies; 65+ messages in thread
From: Volodymyr Babchuk @ 2023-11-28 22:24 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Stewart Hildebrand, Oleksandr Andrushchenko,
	Jan Beulich, Andrew Cooper, Wei Liu, Jun Nakajima, Kevin Tian,
	Paul Durrant

Hi Roger

Thank you for the review.

Roger Pau Monné <roger.pau@citrix.com> writes:

> On Thu, Oct 12, 2023 at 10:09:15PM +0000, Volodymyr Babchuk wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> 
>> Use a previously introduced per-domain read/write lock to check
>> whether vpci is present, so we are sure there are no accesses to the
>> contents of the vpci struct if not. This lock can be used (and in a
>> few cases is used right away) so that vpci removal can be performed
>> while holding the lock in write mode. Previously such removal could
>> race with vpci_read for example.
>> 
>> When taking both d->pci_lock and pdev->vpci->lock, they should be
>> taken in this exact order: d->pci_lock then pdev->vpci->lock to avoid
>> possible deadlock situations.
>> 
>> 1. Per-domain's pci_rwlock is used to protect pdev->vpci structure
>> from being removed.
>> 
>> 2. Writing the command register and ROM BAR register may trigger
>> modify_bars to run, which in turn may access multiple pdevs while
>> checking for the existing BAR's overlap. The overlapping check, if
>> done under the read lock, requires vpci->lock to be acquired on both
>> devices being compared, which may produce a deadlock. It is not
>> possible to upgrade read lock to write lock in such a case. So, in
>> order to prevent the deadlock, use d->pci_lock instead. To prevent
>> deadlock while locking both hwdom->pci_lock and dom_xen->pci_lock,
>> always lock hwdom first.
>> 
>> All other code, which doesn't lead to pdev->vpci destruction and does
>> not access multiple pdevs at the same time, can still use a
>> combination of the read lock and pdev->vpci->lock.
>> 
>> 3. Drop const qualifier where the new rwlock is used and this is
>> appropriate.
>> 
>> 4. Do not call process_pending_softirqs with any locks held. For that
>> unlock prior the call and re-acquire the locks after. After
>> re-acquiring the lock there is no need to check if pdev->vpci exists:
>>  - in apply_map because of the context it is called (no race condition
>>    possible)
>>  - for MSI/MSI-X debug code because it is called at the end of
>>    pdev->vpci access and no further access to pdev->vpci is made
>> 
>> 5. Use d->pci_lock around for_each_pdev and pci_get_pdev_by_domain
>> while accessing pdevs in vpci code.
>> 
>> 6. We are removing multiple ASSERT(pcidevs_locked()) instances because
>> they are too strict now: they should be corrected to
>> ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock)), but problem is
>> that mentioned instances does not have access to the domain
>> pointer and it is not feasible to pass a domain pointer to a function
>> just for debugging purposes.
>> 
>> There is a possible lock inversion in MSI code, as some parts of it
>> acquire pcidevs_lock() while already holding d->pci_lock.
>
> Is this going to be addressed in a further patch?
>

It is actually addressed in this patch, in the v10. I just forgot to
remove this sentence from the patch description. My bad.

This was fixed by additional parameter to allocate_and_map_msi_pirq(),
as it is being called from two places: from vpci_msi_enable(), while we
already are holding d->pci_lock, or from physdev_map_pirq(), when there
are no locks are taken.

[...]

>> @@ -2908,7 +2908,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>>  
>>      msi->irq = irq;
>>  
>> -    pcidevs_lock();
>> +    if ( use_pci_lock )
>> +        pcidevs_lock();
>
> Instead of passing the flag it might be better if the caller can take
> the lock, as to avoid having to pass an extra parameter.
>
> Then we should also assert that either the pcidev_lock or the
> per-domain pci lock is taken?
>

This is a good idea. I'll add lock in physdev_map_pirq(). This is only
one place where we call physdev_map_pirq() without holding any lock.

>>      /* Verify or get pirq. */
>>      write_lock(&d->event_lock);
>>      pirq = allocate_pirq(d, index, *pirq_p, irq, type, &msi->entry_nr);
>> @@ -2924,7 +2925,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>>  
>>   done:
>>      write_unlock(&d->event_lock);
>> -    pcidevs_unlock();
>> +    if ( use_pci_lock )
>> +        pcidevs_unlock();
>>      if ( ret )
>>      {
>>          switch ( type )
>> diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
>> index 20275260b3..466725d8ca 100644
>> --- a/xen/arch/x86/msi.c
>> +++ b/xen/arch/x86/msi.c
>> @@ -602,7 +602,7 @@ static int msi_capability_init(struct pci_dev *dev,
>>      unsigned int i, mpos;
>>      uint16_t control;
>>  
>> -    ASSERT(pcidevs_locked());
>> +    ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
>>      pos = pci_find_cap_offset(dev->sbdf, PCI_CAP_ID_MSI);
>>      if ( !pos )
>>          return -ENODEV;
>> @@ -771,7 +771,7 @@ static int msix_capability_init(struct pci_dev *dev,
>>      if ( !pos )
>>          return -ENODEV;
>>  
>> -    ASSERT(pcidevs_locked());
>> +    ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
>>  
>>      control = pci_conf_read16(dev->sbdf, msix_control_reg(pos));
>>      /*
>> @@ -988,8 +988,6 @@ static int __pci_enable_msi(struct msi_info *msi, struct msi_desc **desc,
>>  {
>>      struct msi_desc *old_desc;
>>  
>> -    ASSERT(pcidevs_locked());
>> -
>>      if ( !pdev )
>>          return -ENODEV;
>>  
>> @@ -1043,8 +1041,6 @@ static int __pci_enable_msix(struct msi_info *msi, struct msi_desc **desc,
>>  {
>>      struct msi_desc *old_desc;
>>  
>> -    ASSERT(pcidevs_locked());
>> -
>>      if ( !pdev || !pdev->msix )
>>          return -ENODEV;
>>  
>> @@ -1154,8 +1150,6 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off)
>>  int pci_enable_msi(struct msi_info *msi, struct msi_desc **desc,
>>  		   struct pci_dev *pdev)
>>  {
>> -    ASSERT(pcidevs_locked());
>> -
>
> If you have the pdev in all the above function, you could expand the
> assert to test for pdev->domain->pci_lock?
>

Yes, you are right. This is the leftover from times where
pci_enable_msi() acquired pdev pointer by itself.

[...]

>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
>> index 767c1ba718..a52e52db96 100644
>> --- a/xen/drivers/vpci/header.c
>> +++ b/xen/drivers/vpci/header.c
>> @@ -172,6 +172,7 @@ bool vpci_process_pending(struct vcpu *v)
>>          if ( rc == -ERESTART )
>>              return true;
>>  
>> +        write_lock(&v->domain->pci_lock);
>>          spin_lock(&v->vpci.pdev->vpci->lock);
>>          /* Disable memory decoding unconditionally on failure. */
>>          modify_decoding(v->vpci.pdev,
>> @@ -190,6 +191,7 @@ bool vpci_process_pending(struct vcpu *v)
>>               * failure.
>>               */
>>              vpci_remove_device(v->vpci.pdev);
>> +        write_unlock(&v->domain->pci_lock);
>>      }
>>  
>>      return false;
>> @@ -201,8 +203,20 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
>>      struct map_data data = { .d = d, .map = true };
>>      int rc;
>>  
>> +    ASSERT(rw_is_write_locked(&d->pci_lock));
>> +
>>      while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
>> +    {
>> +        /*
>> +         * It's safe to drop and reacquire the lock in this context
>> +         * without risking pdev disappearing because devices cannot be
>> +         * removed until the initial domain has been started.
>> +         */
>> +        read_unlock(&d->pci_lock);
>>          process_pending_softirqs();
>> +        read_lock(&d->pci_lock);
>
> You are asserting the lock is taken in write mode just above the usage
> of read_{un,}lock().  Either the assert is wrong, or the usage of
> read_{un,}lock() is wrong.

Oops, looks like this is the rebasing artifact. The final version of the
code uses write locks of course. Patch "vpci/header: handle p2m range
sets per BAR" changes this piece of code too and replaces read lock
with the write lock. I'll move that change to this patch.

[...]

>> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
>> index 3bec9a4153..112de56fb3 100644
>> --- a/xen/drivers/vpci/vpci.c
>> +++ b/xen/drivers/vpci/vpci.c
>> @@ -38,6 +38,8 @@ extern vpci_register_init_t *const __end_vpci_array[];
>>  
>>  void vpci_remove_device(struct pci_dev *pdev)
>>  {
>> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
>> +
>>      if ( !has_vpci(pdev->domain) || !pdev->vpci )
>>          return;
>>  
>> @@ -73,6 +75,8 @@ int vpci_add_handlers(struct pci_dev *pdev)
>>      const unsigned long *ro_map;
>>      int rc = 0;
>>  
>> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
>> +
>>      if ( !has_vpci(pdev->domain) )
>>          return 0;
>>  
>> @@ -326,11 +330,12 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
>>  
>>  uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>>  {
>> -    const struct domain *d = current->domain;
>> +    struct domain *d = current->domain;
>>      const struct pci_dev *pdev;
>>      const struct vpci_register *r;
>>      unsigned int data_offset = 0;
>>      uint32_t data = ~(uint32_t)0;
>> +    rwlock_t *lock;
>>  
>>      if ( !size )
>>      {
>> @@ -342,11 +347,21 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>>       * Find the PCI dev matching the address, which for hwdom also requires
>>       * consulting DomXEN.  Passthrough everything that's not trapped.
>>       */
>> +    lock = &d->pci_lock;
>> +    read_lock(lock);
>>      pdev = pci_get_pdev(d, sbdf);
>>      if ( !pdev && is_hardware_domain(d) )
>> +    {
>> +        read_unlock(lock);
>> +        lock = &dom_xen->pci_lock;
>> +        read_lock(lock);
>>          pdev = pci_get_pdev(dom_xen, sbdf);
>
> I'm unsure whether devices assigned to dom_xen can change ownership
> after boot, so maybe there's no need for all this lock dance, as the
> device cannot disappear?
>
> Maybe just taking the hardware domain lock is enough to prevent
> concurrent accesses in that case, as the hardware domain is the only
> allowed to access devices owned by dom_xen.

This sounds correct. If there are no objections then I can remove extra
locks.

-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 13/17] vpci: add initial support for virtual PCI bus topology
  2023-11-23  8:29                               ` Roger Pau Monné
@ 2023-11-28 23:45                                 ` Volodymyr Babchuk
  2023-11-29  8:33                                   ` Roger Pau Monné
  0 siblings, 1 reply; 65+ messages in thread
From: Volodymyr Babchuk @ 2023-11-28 23:45 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Stefano Stabellini, Julien Grall, Stewart Hildebrand,
	Oleksandr Andrushchenko, Andrew Cooper, George Dunlap,
	Jan Beulich, Wei Liu, xen-devel

Hi Roger,

Roger Pau Monné <roger.pau@citrix.com> writes:

> On Wed, Nov 22, 2023 at 01:18:32PM -0800, Stefano Stabellini wrote:
>> On Wed, 22 Nov 2023, Roger Pau Monné wrote:
>> > On Tue, Nov 21, 2023 at 05:12:15PM -0800, Stefano Stabellini wrote:
>> > > Let me expand on this. Like I wrote above, I think it is important that
>> > > Xen vPCI is the only in-use PCI Root Complex emulator. If it makes the
>> > > QEMU implementation easier, it is OK if QEMU emulates an unneeded and
>> > > unused PCI Root Complex. From Xen point of view, it doesn't exist.
>> > > 
>> > > In terms if ioreq registration, QEMU calls
>> > > xendevicemodel_map_pcidev_to_ioreq_server for each PCI BDF it wants to
>> > > emulate. That way, Xen vPCI knows exactly what PCI config space
>> > > reads/writes to forward to QEMU.
>> > > 
>> > > Let's say that:
>> > > - 00:02.0 is PCI passthrough device
>> > > - 00:03.0 is a PCI emulated device
>> > > 
>> > > QEMU would register 00:03.0 and vPCI would know to forward anything
>> > > related to 00:03.0 to QEMU, but not 00:02.0.
>> > 
>> > I think there's some work here so that we have a proper hierarchy
>> > inside of Xen.  Right now both ioreq and vpci expect to decode the
>> > accesses to the PCI config space, and setup (MM)IO handlers to trap
>> > ECAM, see vpci_ecam_{read,write}().
>> > 
>> > I think we want to move to a model where vPCI doesn't setup MMIO traps
>> > itself, and instead relies on ioreq to do the decoding and forwarding
>> > of accesses.  We need some work in order to represent an internal
>> > ioreq handler, but that shouldn't be too complicated.  IOW: vpci
>> > should register devices it's handling with ioreq, much like QEMU does.
>> 
>> I think this could be a good idea.
>> 
>> This would be the very first IOREQ handler implemented in Xen itself,
>> rather than outside of Xen. Some code refactoring might be required,
>> which worries me given that vPCI is at v10 and has been pending for
>> years. I think it could make sense as a follow-up series, not v11.
>
> That's perfectly fine for me, most of the series here just deal with
> the logic to intercept guest access to the config space and is
> completely agnostic as to how the accesses are intercepted.
>
>> I think this idea would be beneficial if, in the example above, vPCI
>> doesn't really need to know about device 00:03.0. vPCI registers via
>> IOREQ the PCI Root Complex and device 00:02.0 only, QEMU registers
>> 00:03.0, and everything works. vPCI is not involved at all in PCI config
>> space reads and writes for 00:03.0. If this is the case, then moving
>> vPCI to IOREQ could be good.
>
> Given your description above, with the root complex implemented in
> vPCI, we would need to mandate vPCI together with ioreqs even if no
> passthrough devices are using vPCI itself (just for the emulation of
> the root complex).  Which is fine, just wanted to mention the
> dependency.
>
>> On the other hand if vPCI actually needs to know that 00:03.0 exists,
>> perhaps because it changes something in the PCI Root Complex emulation
>> or vPCI needs to take some action when PCI config space registers of
>> 00:03.0 are written to, then I think this model doesn't work well. If
>> this is the case, then I think it would be best to keep vPCI as MMIO
>> handler and let vPCI forward to IOREQ when appropriate.
>
> At first approximation I don't think we would have such interactions,
> otherwise the whole premise of ioreq being able to register individual
> PCI devices would be broken.
>
> XenSever already has scenarios with two different user-space emulators
> (ie: two different ioreq servers) handling accesses to different
> devices in the same PCI bus, and there's no interaction with the root
> complex required.
>

Out of curiosity: how legacy PCI interrupts are handled in this case? In
my understanding, it is Root Complex's responsibility to propagate
correct IRQ levels to an interrupt controller?

[...]

-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 13/17] vpci: add initial support for virtual PCI bus topology
  2023-11-28 23:45                                 ` Volodymyr Babchuk
@ 2023-11-29  8:33                                   ` Roger Pau Monné
  2023-11-30  2:28                                     ` Stefano Stabellini
  0 siblings, 1 reply; 65+ messages in thread
From: Roger Pau Monné @ 2023-11-29  8:33 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Stefano Stabellini, Julien Grall, Stewart Hildebrand,
	Oleksandr Andrushchenko, Andrew Cooper, George Dunlap,
	Jan Beulich, Wei Liu, xen-devel

On Tue, Nov 28, 2023 at 11:45:34PM +0000, Volodymyr Babchuk wrote:
> Hi Roger,
> 
> Roger Pau Monné <roger.pau@citrix.com> writes:
> 
> > On Wed, Nov 22, 2023 at 01:18:32PM -0800, Stefano Stabellini wrote:
> >> On Wed, 22 Nov 2023, Roger Pau Monné wrote:
> >> > On Tue, Nov 21, 2023 at 05:12:15PM -0800, Stefano Stabellini wrote:
> >> > > Let me expand on this. Like I wrote above, I think it is important that
> >> > > Xen vPCI is the only in-use PCI Root Complex emulator. If it makes the
> >> > > QEMU implementation easier, it is OK if QEMU emulates an unneeded and
> >> > > unused PCI Root Complex. From Xen point of view, it doesn't exist.
> >> > > 
> >> > > In terms if ioreq registration, QEMU calls
> >> > > xendevicemodel_map_pcidev_to_ioreq_server for each PCI BDF it wants to
> >> > > emulate. That way, Xen vPCI knows exactly what PCI config space
> >> > > reads/writes to forward to QEMU.
> >> > > 
> >> > > Let's say that:
> >> > > - 00:02.0 is PCI passthrough device
> >> > > - 00:03.0 is a PCI emulated device
> >> > > 
> >> > > QEMU would register 00:03.0 and vPCI would know to forward anything
> >> > > related to 00:03.0 to QEMU, but not 00:02.0.
> >> > 
> >> > I think there's some work here so that we have a proper hierarchy
> >> > inside of Xen.  Right now both ioreq and vpci expect to decode the
> >> > accesses to the PCI config space, and setup (MM)IO handlers to trap
> >> > ECAM, see vpci_ecam_{read,write}().
> >> > 
> >> > I think we want to move to a model where vPCI doesn't setup MMIO traps
> >> > itself, and instead relies on ioreq to do the decoding and forwarding
> >> > of accesses.  We need some work in order to represent an internal
> >> > ioreq handler, but that shouldn't be too complicated.  IOW: vpci
> >> > should register devices it's handling with ioreq, much like QEMU does.
> >> 
> >> I think this could be a good idea.
> >> 
> >> This would be the very first IOREQ handler implemented in Xen itself,
> >> rather than outside of Xen. Some code refactoring might be required,
> >> which worries me given that vPCI is at v10 and has been pending for
> >> years. I think it could make sense as a follow-up series, not v11.
> >
> > That's perfectly fine for me, most of the series here just deal with
> > the logic to intercept guest access to the config space and is
> > completely agnostic as to how the accesses are intercepted.
> >
> >> I think this idea would be beneficial if, in the example above, vPCI
> >> doesn't really need to know about device 00:03.0. vPCI registers via
> >> IOREQ the PCI Root Complex and device 00:02.0 only, QEMU registers
> >> 00:03.0, and everything works. vPCI is not involved at all in PCI config
> >> space reads and writes for 00:03.0. If this is the case, then moving
> >> vPCI to IOREQ could be good.
> >
> > Given your description above, with the root complex implemented in
> > vPCI, we would need to mandate vPCI together with ioreqs even if no
> > passthrough devices are using vPCI itself (just for the emulation of
> > the root complex).  Which is fine, just wanted to mention the
> > dependency.
> >
> >> On the other hand if vPCI actually needs to know that 00:03.0 exists,
> >> perhaps because it changes something in the PCI Root Complex emulation
> >> or vPCI needs to take some action when PCI config space registers of
> >> 00:03.0 are written to, then I think this model doesn't work well. If
> >> this is the case, then I think it would be best to keep vPCI as MMIO
> >> handler and let vPCI forward to IOREQ when appropriate.
> >
> > At first approximation I don't think we would have such interactions,
> > otherwise the whole premise of ioreq being able to register individual
> > PCI devices would be broken.
> >
> > XenSever already has scenarios with two different user-space emulators
> > (ie: two different ioreq servers) handling accesses to different
> > devices in the same PCI bus, and there's no interaction with the root
> > complex required.
> >
> 
> Out of curiosity: how legacy PCI interrupts are handled in this case? In
> my understanding, it is Root Complex's responsibility to propagate
> correct IRQ levels to an interrupt controller?

I'm unsure whether my understanding of the question is correct, so my
reply might not be what you are asking for, sorry.

Legacy IRQs (GSI on x86) are setup directly by the toolstack when the
device is assigned to the guest, using PHYSDEVOP_map_pirq +
XEN_DOMCTL_bind_pt_irq.  Those hypercalls bind together a host IO-APIC
pin to a guest IO-APIC pin, so that interrupts originating from that
host IO-APIC pin are always forwarded to the guest an injected as
originating from the guest IO-APIC pin.

Note that the device will always use the same IO-APIC pin, this is not
configured by the OS.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 13/17] vpci: add initial support for virtual PCI bus topology
  2023-11-29  8:33                                   ` Roger Pau Monné
@ 2023-11-30  2:28                                     ` Stefano Stabellini
  0 siblings, 0 replies; 65+ messages in thread
From: Stefano Stabellini @ 2023-11-30  2:28 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Volodymyr Babchuk, Stefano Stabellini, Julien Grall,
	Stewart Hildebrand, Oleksandr Andrushchenko, Andrew Cooper,
	George Dunlap, Jan Beulich, Wei Liu, xen-devel

[-- Attachment #1: Type: text/plain, Size: 5397 bytes --]

On Wed, 29 Nov 2023, Roger Pau Monné wrote:
> On Tue, Nov 28, 2023 at 11:45:34PM +0000, Volodymyr Babchuk wrote:
> > Hi Roger,
> > 
> > Roger Pau Monné <roger.pau@citrix.com> writes:
> > 
> > > On Wed, Nov 22, 2023 at 01:18:32PM -0800, Stefano Stabellini wrote:
> > >> On Wed, 22 Nov 2023, Roger Pau Monné wrote:
> > >> > On Tue, Nov 21, 2023 at 05:12:15PM -0800, Stefano Stabellini wrote:
> > >> > > Let me expand on this. Like I wrote above, I think it is important that
> > >> > > Xen vPCI is the only in-use PCI Root Complex emulator. If it makes the
> > >> > > QEMU implementation easier, it is OK if QEMU emulates an unneeded and
> > >> > > unused PCI Root Complex. From Xen point of view, it doesn't exist.
> > >> > > 
> > >> > > In terms if ioreq registration, QEMU calls
> > >> > > xendevicemodel_map_pcidev_to_ioreq_server for each PCI BDF it wants to
> > >> > > emulate. That way, Xen vPCI knows exactly what PCI config space
> > >> > > reads/writes to forward to QEMU.
> > >> > > 
> > >> > > Let's say that:
> > >> > > - 00:02.0 is PCI passthrough device
> > >> > > - 00:03.0 is a PCI emulated device
> > >> > > 
> > >> > > QEMU would register 00:03.0 and vPCI would know to forward anything
> > >> > > related to 00:03.0 to QEMU, but not 00:02.0.
> > >> > 
> > >> > I think there's some work here so that we have a proper hierarchy
> > >> > inside of Xen.  Right now both ioreq and vpci expect to decode the
> > >> > accesses to the PCI config space, and setup (MM)IO handlers to trap
> > >> > ECAM, see vpci_ecam_{read,write}().
> > >> > 
> > >> > I think we want to move to a model where vPCI doesn't setup MMIO traps
> > >> > itself, and instead relies on ioreq to do the decoding and forwarding
> > >> > of accesses.  We need some work in order to represent an internal
> > >> > ioreq handler, but that shouldn't be too complicated.  IOW: vpci
> > >> > should register devices it's handling with ioreq, much like QEMU does.
> > >> 
> > >> I think this could be a good idea.
> > >> 
> > >> This would be the very first IOREQ handler implemented in Xen itself,
> > >> rather than outside of Xen. Some code refactoring might be required,
> > >> which worries me given that vPCI is at v10 and has been pending for
> > >> years. I think it could make sense as a follow-up series, not v11.
> > >
> > > That's perfectly fine for me, most of the series here just deal with
> > > the logic to intercept guest access to the config space and is
> > > completely agnostic as to how the accesses are intercepted.
> > >
> > >> I think this idea would be beneficial if, in the example above, vPCI
> > >> doesn't really need to know about device 00:03.0. vPCI registers via
> > >> IOREQ the PCI Root Complex and device 00:02.0 only, QEMU registers
> > >> 00:03.0, and everything works. vPCI is not involved at all in PCI config
> > >> space reads and writes for 00:03.0. If this is the case, then moving
> > >> vPCI to IOREQ could be good.
> > >
> > > Given your description above, with the root complex implemented in
> > > vPCI, we would need to mandate vPCI together with ioreqs even if no
> > > passthrough devices are using vPCI itself (just for the emulation of
> > > the root complex).  Which is fine, just wanted to mention the
> > > dependency.
> > >
> > >> On the other hand if vPCI actually needs to know that 00:03.0 exists,
> > >> perhaps because it changes something in the PCI Root Complex emulation
> > >> or vPCI needs to take some action when PCI config space registers of
> > >> 00:03.0 are written to, then I think this model doesn't work well. If
> > >> this is the case, then I think it would be best to keep vPCI as MMIO
> > >> handler and let vPCI forward to IOREQ when appropriate.
> > >
> > > At first approximation I don't think we would have such interactions,
> > > otherwise the whole premise of ioreq being able to register individual
> > > PCI devices would be broken.
> > >
> > > XenSever already has scenarios with two different user-space emulators
> > > (ie: two different ioreq servers) handling accesses to different
> > > devices in the same PCI bus, and there's no interaction with the root
> > > complex required.

Good to hear

 
> > Out of curiosity: how legacy PCI interrupts are handled in this case? In
> > my understanding, it is Root Complex's responsibility to propagate
> > correct IRQ levels to an interrupt controller?
> 
> I'm unsure whether my understanding of the question is correct, so my
> reply might not be what you are asking for, sorry.
> 
> Legacy IRQs (GSI on x86) are setup directly by the toolstack when the
> device is assigned to the guest, using PHYSDEVOP_map_pirq +
> XEN_DOMCTL_bind_pt_irq.  Those hypercalls bind together a host IO-APIC
> pin to a guest IO-APIC pin, so that interrupts originating from that
> host IO-APIC pin are always forwarded to the guest an injected as
> originating from the guest IO-APIC pin.
> 
> Note that the device will always use the same IO-APIC pin, this is not
> configured by the OS.

QEMU calls xen_set_pci_intx_level which is implemented by
xendevicemodel_set_pci_intx_level, which is XEN_DMOP_set_pci_intx_level,
which does set_pci_intx_level. Eventually it calls hvm_pci_intx_assert
and hvm_pci_intx_deassert.

I don't think any of this goes via the Root Complex otherwise, like
Roger pointed out, it wouldn't be possible to emulated individual PCI
devices in separate IOREQ servers.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 12/17] vpci/header: emulate PCI_COMMAND register for guests
  2023-11-21 14:17   ` Roger Pau Monné
@ 2023-12-01  2:05     ` Volodymyr Babchuk
  2023-12-01  9:04       ` Roger Pau Monné
  2023-12-21 22:58     ` Stewart Hildebrand
  1 sibling, 1 reply; 65+ messages in thread
From: Volodymyr Babchuk @ 2023-12-01  2:05 UTC (permalink / raw)
  To: Roger Pau Monné, Stewart Hildebrand
  Cc: xen-devel, Stewart Hildebrand, Oleksandr Andrushchenko


Hi Roger, Stewart,

Roger Pau Monné <roger.pau@citrix.com> writes:

> On Thu, Oct 12, 2023 at 10:09:18PM +0000, Volodymyr Babchuk wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> 
>> Xen and/or Dom0 may have put values in PCI_COMMAND which they expect
>> to remain unaltered. PCI_COMMAND_SERR bit is a good example: while the
>> guest's view of this will want to be zero initially, the host having set
>> it to 1 may not easily be overwritten with 0, or else we'd effectively
>> imply giving the guest control of the bit. Thus, PCI_COMMAND register needs
>> proper emulation in order to honor host's settings.
>> 
>> According to "PCI LOCAL BUS SPECIFICATION, REV. 3.0", section "6.2.2
>> Device Control" the reset state of the command register is typically 0,
>> so when assigning a PCI device use 0 as the initial state for the guest's view
>> of the command register.
>> 
>> Here is the full list of command register bits with notes about
>> emulation, along with QEMU behavior in the same situation:
>> 
>> PCI_COMMAND_IO - QEMU does not allow a guest to change value of this bit
>> in real device. Instead it is always set to 1. A guest can write to this
>> register, but writes are ignored.
>> 
>> PCI_COMMAND_MEMORY - QEMU behaves exactly as with PCI_COMMAND_IO. In
>> Xen case, we handle writes to this bit by mapping/unmapping BAR
>> regions. For devices assigned to DomUs, memory decoding will be
>> disabled and the initialization.
>> 
>> PCI_COMMAND_MASTER - Allow guest to control it. QEMU passes through
>> writes to this bit.
>> 
>> PCI_COMMAND_SPECIAL - Guest can generate special cycles only if it has
>> access to host bridge that supports software generation of special
>> cycles. In our case guest has no access to host bridges at all. Value
>> after reset is 0. QEMU passes through writes of this bit, we will do
>> the same.
>> 
>> PCI_COMMAND_INVALIDATE - Allows "Memory Write and Invalidate" commands
>> to be generated. It requires additional configuration via Cacheline
>> Size register. We are not emulating this register right now and we
>> can't expect guest to properly configure it. QEMU "emulates" access to
>> Cachline Size register by ignoring all writes to it. QEMU passes through
>> writes of PCI_COMMAND_INVALIDATE bit, we will do the same.
>> 
>> PCI_COMMAND_VGA_PALETTE - Enable VGA palette snooping. QEMU passes
>> through writes of this bit, we will do the same.
>> 
>> PCI_COMMAND_PARITY - Controls how device response to parity
>> errors. QEMU ignores writes to this bit, we will do the same.
>> 
>> PCI_COMMAND_WAIT - Reserved. Should be 0, but QEMU passes
>> through writes of this bit, so we will do the same.
>> 
>> PCI_COMMAND_SERR - Controls if device can assert SERR. QEMU ignores
>> writes to this bit, we will do the same.
>> 
>> PCI_COMMAND_FAST_BACK - Optional bit that allows fast back-to-back
>> transactions. It is configured by firmware, so we don't want guest to
>> control it. QEMU ignores writes to this bit, we will do the same.
>> 
>> PCI_COMMAND_INTX_DISABLE - Disables INTx signals. If MSI(X) is
>> enabled, device is prohibited from asserting INTx as per
>> specification. Value after reset is 0. In QEMU case, it checks of INTx
>> was mapped for a device. If it is not, then guest can't control
>> PCI_COMMAND_INTX_DISABLE bit. In our case, we prohibit a guest to
>> change value of this bit if MSI(X) is enabled.
>
> FWIW, bits 11-15 are RsvdP, so we will need to add support for them
> after the series from Stewart that adds support for register masks is
> accepted.

Stewart's series implement much better register handling than this
patch. What about dropping this change at all in favor of Stewart's
implementation? I'll be 100% okay with this.

[...]

-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 12/17] vpci/header: emulate PCI_COMMAND register for guests
  2023-12-01  2:05     ` Volodymyr Babchuk
@ 2023-12-01  9:04       ` Roger Pau Monné
  0 siblings, 0 replies; 65+ messages in thread
From: Roger Pau Monné @ 2023-12-01  9:04 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: Stewart Hildebrand, xen-devel, Oleksandr Andrushchenko

On Fri, Dec 01, 2023 at 02:05:54AM +0000, Volodymyr Babchuk wrote:
> 
> Hi Roger, Stewart,
> 
> Roger Pau Monné <roger.pau@citrix.com> writes:
> 
> > On Thu, Oct 12, 2023 at 10:09:18PM +0000, Volodymyr Babchuk wrote:
> >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> >> 
> >> Xen and/or Dom0 may have put values in PCI_COMMAND which they expect
> >> to remain unaltered. PCI_COMMAND_SERR bit is a good example: while the
> >> guest's view of this will want to be zero initially, the host having set
> >> it to 1 may not easily be overwritten with 0, or else we'd effectively
> >> imply giving the guest control of the bit. Thus, PCI_COMMAND register needs
> >> proper emulation in order to honor host's settings.
> >> 
> >> According to "PCI LOCAL BUS SPECIFICATION, REV. 3.0", section "6.2.2
> >> Device Control" the reset state of the command register is typically 0,
> >> so when assigning a PCI device use 0 as the initial state for the guest's view
> >> of the command register.
> >> 
> >> Here is the full list of command register bits with notes about
> >> emulation, along with QEMU behavior in the same situation:
> >> 
> >> PCI_COMMAND_IO - QEMU does not allow a guest to change value of this bit
> >> in real device. Instead it is always set to 1. A guest can write to this
> >> register, but writes are ignored.
> >> 
> >> PCI_COMMAND_MEMORY - QEMU behaves exactly as with PCI_COMMAND_IO. In
> >> Xen case, we handle writes to this bit by mapping/unmapping BAR
> >> regions. For devices assigned to DomUs, memory decoding will be
> >> disabled and the initialization.
> >> 
> >> PCI_COMMAND_MASTER - Allow guest to control it. QEMU passes through
> >> writes to this bit.
> >> 
> >> PCI_COMMAND_SPECIAL - Guest can generate special cycles only if it has
> >> access to host bridge that supports software generation of special
> >> cycles. In our case guest has no access to host bridges at all. Value
> >> after reset is 0. QEMU passes through writes of this bit, we will do
> >> the same.
> >> 
> >> PCI_COMMAND_INVALIDATE - Allows "Memory Write and Invalidate" commands
> >> to be generated. It requires additional configuration via Cacheline
> >> Size register. We are not emulating this register right now and we
> >> can't expect guest to properly configure it. QEMU "emulates" access to
> >> Cachline Size register by ignoring all writes to it. QEMU passes through
> >> writes of PCI_COMMAND_INVALIDATE bit, we will do the same.
> >> 
> >> PCI_COMMAND_VGA_PALETTE - Enable VGA palette snooping. QEMU passes
> >> through writes of this bit, we will do the same.
> >> 
> >> PCI_COMMAND_PARITY - Controls how device response to parity
> >> errors. QEMU ignores writes to this bit, we will do the same.
> >> 
> >> PCI_COMMAND_WAIT - Reserved. Should be 0, but QEMU passes
> >> through writes of this bit, so we will do the same.
> >> 
> >> PCI_COMMAND_SERR - Controls if device can assert SERR. QEMU ignores
> >> writes to this bit, we will do the same.
> >> 
> >> PCI_COMMAND_FAST_BACK - Optional bit that allows fast back-to-back
> >> transactions. It is configured by firmware, so we don't want guest to
> >> control it. QEMU ignores writes to this bit, we will do the same.
> >> 
> >> PCI_COMMAND_INTX_DISABLE - Disables INTx signals. If MSI(X) is
> >> enabled, device is prohibited from asserting INTx as per
> >> specification. Value after reset is 0. In QEMU case, it checks of INTx
> >> was mapped for a device. If it is not, then guest can't control
> >> PCI_COMMAND_INTX_DISABLE bit. In our case, we prohibit a guest to
> >> change value of this bit if MSI(X) is enabled.
> >
> > FWIW, bits 11-15 are RsvdP, so we will need to add support for them
> > after the series from Stewart that adds support for register masks is
> > accepted.
> 
> Stewart's series implement much better register handling than this
> patch. What about dropping this change at all in favor of Stewart's
> implementation? I'll be 100% okay with this.

That's fine.  I expect Stewart's series will go in quite soon, and
then we can likely rework this commit on top of them?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v10 12/17] vpci/header: emulate PCI_COMMAND register for guests
  2023-11-21 14:17   ` Roger Pau Monné
  2023-12-01  2:05     ` Volodymyr Babchuk
@ 2023-12-21 22:58     ` Stewart Hildebrand
  1 sibling, 0 replies; 65+ messages in thread
From: Stewart Hildebrand @ 2023-12-21 22:58 UTC (permalink / raw)
  To: Roger Pau Monné, Volodymyr Babchuk
  Cc: xen-devel, Oleksandr Andrushchenko

On 11/21/23 09:17, Roger Pau Monné wrote:
> On Thu, Oct 12, 2023 at 10:09:18PM +0000, Volodymyr Babchuk wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>
>> Xen and/or Dom0 may have put values in PCI_COMMAND which they expect
>> to remain unaltered. PCI_COMMAND_SERR bit is a good example: while the
>> guest's view of this will want to be zero initially, the host having set
>> it to 1 may not easily be overwritten with 0, or else we'd effectively
>> imply giving the guest control of the bit. Thus, PCI_COMMAND register needs
>> proper emulation in order to honor host's settings.
>>
>> According to "PCI LOCAL BUS SPECIFICATION, REV. 3.0", section "6.2.2
>> Device Control" the reset state of the command register is typically 0,
>> so when assigning a PCI device use 0 as the initial state for the guest's view
>> of the command register.
>>
>> Here is the full list of command register bits with notes about
>> emulation, along with QEMU behavior in the same situation:
>>
>> PCI_COMMAND_IO - QEMU does not allow a guest to change value of this bit
>> in real device. Instead it is always set to 1.

As far as I can tell QEMU only sets this bit to 1 (in hardware) if it exposes an I/O BAR to the guest.

>> A guest can write to this
>> register, but writes are ignored.

In Xen, I think we should use rsvdp_mask for domUs for now since we don't (yet) support I/O BARs for domUs in vPCI.

>>
>> PCI_COMMAND_MEMORY - QEMU behaves exactly as with PCI_COMMAND_IO. In
>> Xen case, we handle writes to this bit by mapping/unmapping BAR
>> regions. For devices assigned to DomUs, memory decoding will be
>> disabled and the initialization.
>>
>> PCI_COMMAND_MASTER - Allow guest to control it. QEMU passes through
>> writes to this bit.
>>
>> PCI_COMMAND_SPECIAL - Guest can generate special cycles only if it has
>> access to host bridge that supports software generation of special
>> cycles. In our case guest has no access to host bridges at all. Value
>> after reset is 0. QEMU passes through writes of this bit, we will do
>> the same.
>>
>> PCI_COMMAND_INVALIDATE - Allows "Memory Write and Invalidate" commands
>> to be generated. It requires additional configuration via Cacheline
>> Size register. We are not emulating this register right now and we
>> can't expect guest to properly configure it. QEMU "emulates" access to
>> Cachline Size register by ignoring all writes to it. QEMU passes through
>> writes of PCI_COMMAND_INVALIDATE bit, we will do the same.
>>
>> PCI_COMMAND_VGA_PALETTE - Enable VGA palette snooping. QEMU passes
>> through writes of this bit, we will do the same.
>>
>> PCI_COMMAND_PARITY - Controls how device response to parity
>> errors. QEMU ignores writes to this bit, we will do the same.
>>
>> PCI_COMMAND_WAIT - Reserved. Should be 0, but QEMU passes
>> through writes of this bit, so we will do the same.

Actually, PCI_COMMAND_WAIT bit is in qemu's res_mask, meaning it only passes through the writes in permissive mode. By default qemu does not pass through writes. PCI LB 3.0 and PCIe 6.1 specifications say devices should hardwire this bit to 0, but that some devices may have implemented it as RW. So I think we should use rsvdp_mask in Xen for this bit.

>>
>> PCI_COMMAND_SERR - Controls if device can assert SERR. QEMU ignores
>> writes to this bit, we will do the same.
>>
>> PCI_COMMAND_FAST_BACK - Optional bit that allows fast back-to-back
>> transactions. It is configured by firmware, so we don't want guest to
>> control it. QEMU ignores writes to this bit, we will do the same.
>>
>> PCI_COMMAND_INTX_DISABLE - Disables INTx signals. If MSI(X) is
>> enabled, device is prohibited from asserting INTx as per
>> specification. Value after reset is 0. In QEMU case, it checks of INTx
>> was mapped for a device. If it is not, then guest can't control
>> PCI_COMMAND_INTX_DISABLE bit. In our case, we prohibit a guest to
>> change value of this bit if MSI(X) is enabled.
> 
> FWIW, bits 11-15 are RsvdP, so we will need to add support for them
> after the series from Stewart that adds support for register masks is
> accepted.

I am working on this.

> 
>>
>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
>> ---
>> In v10:
>> - Added cf_check attribute to guest_cmd_read
>> - Removed warning about non-zero cmd
>> - Updated comment MSI code regarding disabling INTX
>> - Used ternary operator in vpci_add_register() call
>> - Disable memory decoding for DomUs in init_bars()
>> In v9:
>> - Reworked guest_cmd_read
>> - Added handling for more bits
>> Since v6:
>> - fold guest's logic into cmd_write
>> - implement cmd_read, so we can report emulated INTx state to guests
>> - introduce header->guest_cmd to hold the emulated state of the
>>   PCI_COMMAND register for guests
>> Since v5:
>> - add additional check for MSI-X enabled while altering INTX bit
>> - make sure INTx disabled while guests enable MSI/MSI-X
>> Since v3:
>> - gate more code on CONFIG_HAS_MSI
>> - removed logic for the case when MSI/MSI-X not enabled
>> ---
>>  xen/drivers/vpci/header.c | 44 +++++++++++++++++++++++++++++++++++----
>>  xen/drivers/vpci/msi.c    |  6 ++++++
>>  xen/drivers/vpci/msix.c   |  4 ++++
>>  xen/include/xen/vpci.h    |  3 +++
>>  4 files changed, 53 insertions(+), 4 deletions(-)
>>
>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
>> index efce0bc2ae..e8eed6a674 100644
>> --- a/xen/drivers/vpci/header.c
>> +++ b/xen/drivers/vpci/header.c
>> @@ -501,14 +501,32 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>      return 0;
>>  }
>>  
>> +/* TODO: Add proper emulation for all bits of the command register. */
>>  static void cf_check cmd_write(
>>      const struct pci_dev *pdev, unsigned int reg, uint32_t cmd, void *data)
>>  {
>>      struct vpci_header *header = data;
>>  
>> +    if ( !is_hardware_domain(pdev->domain) )
>> +    {
>> +        const struct vpci *vpci = pdev->vpci;
>> +        uint16_t excluded = PCI_COMMAND_PARITY | PCI_COMMAND_SERR |
>> +            PCI_COMMAND_FAST_BACK;
> 
> You could implement those bits using the RsvdP mask also.

Will do. The rsvdp_mask (in the write path) has already been applied before reaching this handler, so the guest's write value won't propagate to the header->guest_cmd variable. This is okay as long as ...

> 
> You seem to miss PCI_COMMAND_IO?  In the commit message you note that
> writes are ignored, yet here you seem to pass through writes to the
> underlying device (which might be OK, but needs to be coherent with
> the commit message).
> 
>> +
>> +        header->guest_cmd = cmd;
> 
> I'm kind of unsure whether we want to fake the guest view by returning
> what the guest writes.

... we don't provide an emulated view of the additional bits that we're treating as RsvdP. So let's not provide such an emulated/fake view for these bits (for now, at least).

Side note: qemu does provide such an emulated view, using a combination of emu_mask and emulated register variables. Looking at the qemu history, it looks like there may be other registers where we'd likely want to have such an emulated/fake view. So, regardless of whether we want to emulate certain bits in the command register, having the flexibility of a emulated mask/register in vPCI could (eventually) be beneficial (but not as part of this series). I'll make an in-code comment that we may want to re-visit this in the future.

> 
>> +
>> +        if ( (vpci->msi && vpci->msi->enabled) ||
>> +             (vpci->msix && vpci->msi->enabled) )
> 
> The typo that you mentioned about msi vs msix.
> 
>> +            excluded |= PCI_COMMAND_INTX_DISABLE;
>> +
>> +        cmd &= ~excluded;
>> +        cmd |= pci_conf_read16(pdev->sbdf, reg) & excluded;
>> +    }
>> +
>>      /*
>> -     * Let Dom0 play with all the bits directly except for the memory
>> -     * decoding one.
>> +     * Let guest play with all the bits directly except for the memory
>> +     * decoding one. Bits that are not allowed for DomU are already
>> +     * handled above.
> 
> I think this should be: "Let Dom0 play with all the bits directly ..."
> as you mention both Dom0 and DomU.
> 
>>       */
>>      if ( header->bars_mapped != !!(cmd & PCI_COMMAND_MEMORY) )
>>          /*
>> @@ -522,6 +540,14 @@ static void cf_check cmd_write(
>>          pci_conf_write16(pdev->sbdf, reg, cmd);
>>  }
>>  
>> +static uint32_t cf_check guest_cmd_read(
>> +    const struct pci_dev *pdev, unsigned int reg, void *data)
>> +{
>> +    const struct vpci_header *header = data;
>> +
>> +    return header->guest_cmd;
>> +}
>> +
>>  static void cf_check bar_write(
>>      const struct pci_dev *pdev, unsigned int reg, uint32_t val, void *data)
>>  {
>> @@ -737,8 +763,9 @@ static int cf_check init_bars(struct pci_dev *pdev)
>>      }
>>  
>>      /* Setup a handler for the command register. */
>> -    rc = vpci_add_register(pdev->vpci, vpci_hw_read16, cmd_write, PCI_COMMAND,
>> -                           2, header);
>> +    rc = vpci_add_register(pdev->vpci,
>> +                           is_hwdom ? vpci_hw_read16 : guest_cmd_read,
>> +                           cmd_write, PCI_COMMAND, 2, header);
>>      if ( rc )
>>          return rc;
>>  
>> @@ -750,6 +777,15 @@ static int cf_check init_bars(struct pci_dev *pdev)
>>      if ( cmd & PCI_COMMAND_MEMORY )
>>          pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd & ~PCI_COMMAND_MEMORY);
>>  
>> +    /*
>> +     * Clear PCI_COMMAND_MEMORY for DomUs, so they will always start with
>> +     * memory decoding disabled and to ensure that we will not call modify_bars()
>> +     * at the end of this function.
>> +     */
>> +    if ( !is_hwdom )
>> +        cmd &= ~PCI_COMMAND_MEMORY;
> 
> Just for symmetry I would also disable PCI_COMMAND_IO.
> 
> I do wonder in which states does SeaBIOS or OVMF expects to find the
> devices.
> 
>> +    header->guest_cmd = cmd;
>> +
>>      for ( i = 0; i < num_bars; i++ )
>>      {
>>          uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
>> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
>> index 2faa54b7ce..0920bd071f 100644
>> --- a/xen/drivers/vpci/msi.c
>> +++ b/xen/drivers/vpci/msi.c
>> @@ -70,6 +70,12 @@ static void cf_check control_write(
>>  
>>          if ( vpci_msi_arch_enable(msi, pdev, vectors) )
>>              return;
>> +
>> +        /*
>> +         * Make sure guest doesn't enable INTx while enabling MSI.
>> +         */
>> +        if ( !is_hardware_domain(pdev->domain) )
>> +            pci_intx(pdev, false);
>>      }
>>      else
>>          vpci_msi_arch_disable(msi, pdev);
>> diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
>> index b6abab47ef..9d0233d0e3 100644
>> --- a/xen/drivers/vpci/msix.c
>> +++ b/xen/drivers/vpci/msix.c
>> @@ -97,6 +97,10 @@ static void cf_check control_write(
>>          for ( i = 0; i < msix->max_entries; i++ )
>>              if ( !msix->entries[i].masked && msix->entries[i].updated )
>>                  update_entry(&msix->entries[i], pdev, i);
>> +
>> +        /* Make sure guest doesn't enable INTx while enabling MSI-X. */
>> +        if ( !is_hardware_domain(pdev->domain) )
>> +            pci_intx(pdev, false);
> 
> Note that if both new_enabled and new_masked are set, you won't get
> inside of this condition, and that could lead to MSIX being enabled
> with INTx set in the command register (albeit with the maskall bit
> set).
> 
> You might have to add a new check before the pci_conf_write16() that
> disables INTx if `new_enabled && !msix->enabled`.
> 
>>      }
>>      else if ( !new_enabled && msix->enabled )
>>      {
>> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
>> index c5301e284f..60bdc10c13 100644
>> --- a/xen/include/xen/vpci.h
>> +++ b/xen/include/xen/vpci.h
>> @@ -87,6 +87,9 @@ struct vpci {
>>          } bars[PCI_HEADER_NORMAL_NR_BARS + 1];
>>          /* At most 6 BARS + 1 expansion ROM BAR. */
>>  
>> +        /* Guest view of the PCI_COMMAND register. */
> 
> Maybe we want to add '(domU only)' to the comment.
> 
> Thanks, Roger.


^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2023-12-21 22:59 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-12 22:09 [PATCH v10 00/17] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
2023-10-12 22:09 ` [PATCH v10 01/17] pci: msi: pass pdev to pci_enable_msi() function Volodymyr Babchuk
2023-10-30 15:55   ` Jan Beulich
2023-11-17 13:59   ` Roger Pau Monné
2023-10-12 22:09 ` [PATCH v10 03/17] vpci: use per-domain PCI lock to protect vpci structure Volodymyr Babchuk
2023-11-03 15:39   ` Stewart Hildebrand
2023-11-17 15:16   ` Roger Pau Monné
2023-11-28 22:24     ` Volodymyr Babchuk
2023-10-12 22:09 ` [PATCH v10 05/17] vpci: add hooks for PCI device assign/de-assign Volodymyr Babchuk
2023-11-20 15:04   ` Roger Pau Monné
2023-10-12 22:09 ` [PATCH v10 02/17] pci: introduce per-domain PCI rwlock Volodymyr Babchuk
2023-11-17 14:33   ` Roger Pau Monné
2023-10-12 22:09 ` [PATCH v10 04/17] vpci: restrict unhandled read/write operations for guests Volodymyr Babchuk
2023-10-12 22:09 ` [PATCH v10 08/17] rangeset: add RANGESETF_no_print flag Volodymyr Babchuk
2023-10-12 22:09 ` [PATCH v10 06/17] vpci/header: rework exit path in init_bars Volodymyr Babchuk
2023-11-20 15:07   ` Roger Pau Monné
2023-10-12 22:09 ` [PATCH v10 07/17] vpci/header: implement guest BAR register handlers Volodymyr Babchuk
2023-10-14 16:00   ` Stewart Hildebrand
2023-11-20 16:06   ` Roger Pau Monné
2023-10-12 22:09 ` [PATCH v10 11/17] vpci/header: program p2m with guest BAR view Volodymyr Babchuk
2023-11-21 12:24   ` Roger Pau Monné
2023-10-12 22:09 ` [PATCH v10 10/17] vpci/header: handle p2m range sets per BAR Volodymyr Babchuk
2023-11-20 17:29   ` Roger Pau Monné
2023-10-12 22:09 ` [PATCH v10 09/17] rangeset: add rangeset_empty() function Volodymyr Babchuk
2023-10-13 17:54   ` Stewart Hildebrand
2023-10-13 18:08     ` Volodymyr Babchuk
2023-10-12 22:09 ` [PATCH v10 12/17] vpci/header: emulate PCI_COMMAND register for guests Volodymyr Babchuk
2023-10-13 21:53   ` Volodymyr Babchuk
2023-11-21 14:17   ` Roger Pau Monné
2023-12-01  2:05     ` Volodymyr Babchuk
2023-12-01  9:04       ` Roger Pau Monné
2023-12-21 22:58     ` Stewart Hildebrand
2023-10-12 22:09 ` [PATCH v10 13/17] vpci: add initial support for virtual PCI bus topology Volodymyr Babchuk
2023-11-16 16:06   ` Julien Grall
2023-11-16 23:28     ` Stefano Stabellini
2023-11-17  0:06       ` Julien Grall
2023-11-17  0:51         ` Stefano Stabellini
2023-11-17  0:21       ` Volodymyr Babchuk
2023-11-17  0:58         ` Stefano Stabellini
2023-11-17 14:09           ` Volodymyr Babchuk
2023-11-17 18:30             ` Julien Grall
2023-11-17 20:08               ` Volodymyr Babchuk
2023-11-17 21:43                 ` Stefano Stabellini
2023-11-17 22:22                   ` Volodymyr Babchuk
2023-11-18  0:45                     ` Stefano Stabellini
2023-11-21  0:42                       ` Volodymyr Babchuk
2023-11-22  1:12                         ` Stefano Stabellini
2023-11-22 11:53                           ` Roger Pau Monné
2023-11-22 21:18                             ` Stefano Stabellini
2023-11-23  8:29                               ` Roger Pau Monné
2023-11-28 23:45                                 ` Volodymyr Babchuk
2023-11-29  8:33                                   ` Roger Pau Monné
2023-11-30  2:28                                     ` Stefano Stabellini
2023-11-21 14:40   ` Roger Pau Monné
2023-10-12 22:09 ` [PATCH v10 14/17] xen/arm: translate virtual PCI bus topology for guests Volodymyr Babchuk
2023-11-21 15:11   ` Roger Pau Monné
2023-10-12 22:09 ` [PATCH v10 16/17] xen/arm: vpci: permit access to guest vpci space Volodymyr Babchuk
2023-10-16 11:00   ` Jan Beulich
2023-10-24 19:44     ` Stewart Hildebrand
2023-10-12 22:09 ` [PATCH v10 15/17] xen/arm: account IO handlers for emulated PCI MSI-X Volodymyr Babchuk
2023-10-13  8:34   ` Julien Grall
2023-10-13 13:06     ` Volodymyr Babchuk
2023-10-13 16:46       ` Julien Grall
2023-10-13 17:17         ` Volodymyr Babchuk
2023-10-12 22:09 ` [PATCH v10 17/17] arm/vpci: honor access size when returning an error Volodymyr Babchuk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.