All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v9 00/16] PCI devices passthrough on Arm, part 3
@ 2023-08-29 23:19 Volodymyr Babchuk
  2023-08-29 23:19 ` [PATCH v9 03/16] vpci: restrict unhandled read/write operations for guests Volodymyr Babchuk
                   ` (15 more replies)
  0 siblings, 16 replies; 60+ messages in thread
From: Volodymyr Babchuk @ 2023-08-29 23:19 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Volodymyr Babchuk, Andrew Cooper,
	George Dunlap, Jan Beulich, Julien Grall, Stefano Stabellini,
	Wei Liu, Paul Durrant, Roger Pau Monné,
	Kevin Tian, Jun Nakajima, Bertrand Marquis, Volodymyr Babchuk

Hello all,

This is next version of vPCI rework. Aim of this series is to prepare
ground for introducing PCI support on ARM platform.

This vesion includes addressed commentes from a previous one. Also it
introduces a couple patches from Stewart. This patches are related to
vPCI use on ARM. Patch "vpci/header: rework exit path in init_bars"
was factored-out from "vpci/header: handle p2m range sets per BAR".

Changes from previous versions are described in each separate patch.

Oleksandr Andrushchenko (12):
  vpci: use per-domain PCI lock to protect vpci structure
  vpci: restrict unhandled read/write operations for guests
  vpci: add hooks for PCI device assign/de-assign
  vpci/header: implement guest BAR register handlers
  rangeset: add RANGESETF_no_print flag
  vpci/header: handle p2m range sets per BAR
  vpci/header: program p2m with guest BAR view
  vpci/header: emulate PCI_COMMAND register for guests
  vpci/header: reset the command register when adding devices
  vpci: add initial support for virtual PCI bus topology
  xen/arm: translate virtual PCI bus topology for guests
  xen/arm: account IO handlers for emulated PCI MSI-X

Stewart Hildebrand (2):
  xen/arm: vpci: check guest range
  xen/arm: vpci: permit access to guest vpci space

Volodymyr Babchuk (2):
  pci: introduce per-domain PCI rwlock
  vpci/header: rework exit path in init_bars

 xen/arch/arm/vpci.c                         |  71 ++-
 xen/arch/x86/hvm/vmsi.c                     |  24 +-
 xen/arch/x86/hvm/vmx/vmx.c                  |   2 -
 xen/arch/x86/irq.c                          |  15 +-
 xen/arch/x86/msi.c                          |   8 +-
 xen/common/domain.c                         |   5 +-
 xen/common/rangeset.c                       |   5 +-
 xen/drivers/Kconfig                         |   4 +
 xen/drivers/passthrough/amd/pci_amd_iommu.c |   9 +-
 xen/drivers/passthrough/pci.c               | 103 +++-
 xen/drivers/passthrough/vtd/iommu.c         |   9 +-
 xen/drivers/vpci/header.c                   | 497 ++++++++++++++++----
 xen/drivers/vpci/msi.c                      |  32 +-
 xen/drivers/vpci/msix.c                     |  56 ++-
 xen/drivers/vpci/vpci.c                     | 158 ++++++-
 xen/include/xen/rangeset.h                  |   5 +-
 xen/include/xen/sched.h                     |   9 +
 xen/include/xen/vpci.h                      |  39 +-
 18 files changed, 868 insertions(+), 183 deletions(-)

-- 
2.41.0

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v9 02/16] vpci: use per-domain PCI lock to protect vpci structure
  2023-08-29 23:19 [PATCH v9 00/16] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
  2023-08-29 23:19 ` [PATCH v9 03/16] vpci: restrict unhandled read/write operations for guests Volodymyr Babchuk
@ 2023-08-29 23:19 ` Volodymyr Babchuk
  2023-09-19 15:39   ` Roger Pau Monné
  2023-08-29 23:19 ` [PATCH v9 01/16] pci: introduce per-domain PCI rwlock Volodymyr Babchuk
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 60+ messages in thread
From: Volodymyr Babchuk @ 2023-08-29 23:19 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Jan Beulich,
	Andrew Cooper, Roger Pau Monné,
	Wei Liu, Jun Nakajima, Kevin Tian, Paul Durrant,
	Volodymyr Babchuk

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Use a previously introduced per-domain read/write lock to check
whether vpci is present, so we are sure there are no accesses to the
contents of the vpci struct if not. This lock can be used (and in a
few cases is used right away) so that vpci removal can be performed
while holding the lock in write mode. Previously such removal could
race with vpci_read for example.

When taking both d->pci_lock and pdev->vpci->lock they are should be
taken in this exact order: d->pci_lock then pdev->vpci->lock to avoid
possible deadlock situations.

1. Per-domain's pci_rwlock is used to protect pdev->vpci structure
from being removed.

2. Writing the command register and ROM BAR register may trigger
modify_bars to run, which in turn may access multiple pdevs while
checking for the existing BAR's overlap. The overlapping check, if
done under the read lock, requires vpci->lock to be acquired on both
devices being compared, which may produce a deadlock. It is not
possible to upgrade read lock to write lock in such a case. So, in
order to prevent the deadlock, use d->pci_lock instead. To prevent
deadlock while locking both hwdom->pci_lock and dom_xen->pci_lock,
always lock hwdom first.

All other code, which doesn't lead to pdev->vpci destruction and does
not access multiple pdevs at the same time, can still use a
combination of the read lock and pdev->vpci->lock.

3. Drop const qualifier where the new rwlock is used and this is
appropriate.

4. Do not call process_pending_softirqs with any locks held. For that
unlock prior the call and re-acquire the locks after. After
re-acquiring the lock there is no need to check if pdev->vpci exists:
 - in apply_map because of the context it is called (no race condition
   possible)
 - for MSI/MSI-X debug code because it is called at the end of
   pdev->vpci access and no further access to pdev->vpci is made

5. Use d->pci_lock around for_each_pdev and pci_get_pdev_by_domain
while accessing pdevs in vpci code.

There is a possible lock inversion in MSI code, as some parts of it
acquire pcidevs_lock() while already holding d->pci_lock.

Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>

---
Changes in v9:
 - extended locked region to protect vpci_remove_device and
   vpci_add_handlers() calls
 - vpci_write() takes lock in the write mode to protect
   potential call to modify_bars()
 - renamed lock releasing function
 - removed ASSERT()s from msi code
 - added trylock in vpci_dump_msi

Changes in v8:
 - changed d->vpci_lock to d->pci_lock
 - introducing d->pci_lock in a separate patch
 - extended locked region in vpci_process_pending
 - removed pcidevs_lockis vpci_dump_msi()
 - removed some changes as they are not needed with
   the new locking scheme
 - added handling for hwdom && dom_xen case
---
 xen/arch/x86/hvm/vmsi.c       | 24 ++++++++--------
 xen/arch/x86/hvm/vmx/vmx.c    |  2 --
 xen/arch/x86/irq.c            | 15 +++++++---
 xen/arch/x86/msi.c            |  8 ++----
 xen/drivers/passthrough/pci.c |  7 +++--
 xen/drivers/vpci/header.c     | 18 ++++++++++++
 xen/drivers/vpci/msi.c        | 22 +++++++++++++--
 xen/drivers/vpci/msix.c       | 52 ++++++++++++++++++++++++++++++-----
 xen/drivers/vpci/vpci.c       | 46 +++++++++++++++++++++++++++++--
 9 files changed, 154 insertions(+), 40 deletions(-)

diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index 128f236362..fde76cc6b4 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -468,7 +468,7 @@ int msixtbl_pt_register(struct domain *d, struct pirq *pirq, uint64_t gtable)
     struct msixtbl_entry *entry, *new_entry;
     int r = -EINVAL;
 
-    ASSERT(pcidevs_locked());
+    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
     ASSERT(rw_is_write_locked(&d->event_lock));
 
     if ( !msixtbl_initialised(d) )
@@ -538,7 +538,7 @@ void msixtbl_pt_unregister(struct domain *d, struct pirq *pirq)
     struct pci_dev *pdev;
     struct msixtbl_entry *entry;
 
-    ASSERT(pcidevs_locked());
+    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
     ASSERT(rw_is_write_locked(&d->event_lock));
 
     if ( !msixtbl_initialised(d) )
@@ -684,7 +684,7 @@ static int vpci_msi_update(const struct pci_dev *pdev, uint32_t data,
 {
     unsigned int i;
 
-    ASSERT(pcidevs_locked());
+    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
 
     if ( (address & MSI_ADDR_BASE_MASK) != MSI_ADDR_HEADER )
     {
@@ -725,8 +725,8 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
     int rc;
 
     ASSERT(msi->arch.pirq != INVALID_PIRQ);
+    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
 
-    pcidevs_lock();
     for ( i = 0; i < msi->vectors && msi->arch.bound; i++ )
     {
         struct xen_domctl_bind_pt_irq unbind = {
@@ -745,7 +745,6 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
 
     msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address,
                                        msi->vectors, msi->arch.pirq, msi->mask);
-    pcidevs_unlock();
 }
 
 static int vpci_msi_enable(const struct pci_dev *pdev, unsigned int nr,
@@ -778,15 +777,13 @@ int vpci_msi_arch_enable(struct vpci_msi *msi, const struct pci_dev *pdev,
     int rc;
 
     ASSERT(msi->arch.pirq == INVALID_PIRQ);
+    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
     rc = vpci_msi_enable(pdev, vectors, 0);
     if ( rc < 0 )
         return rc;
     msi->arch.pirq = rc;
-
-    pcidevs_lock();
     msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address, vectors,
                                        msi->arch.pirq, msi->mask);
-    pcidevs_unlock();
 
     return 0;
 }
@@ -797,8 +794,8 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
     unsigned int i;
 
     ASSERT(pirq != INVALID_PIRQ);
+    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
 
-    pcidevs_lock();
     for ( i = 0; i < nr && bound; i++ )
     {
         struct xen_domctl_bind_pt_irq bind = {
@@ -814,7 +811,6 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
     write_lock(&pdev->domain->event_lock);
     unmap_domain_pirq(pdev->domain, pirq);
     write_unlock(&pdev->domain->event_lock);
-    pcidevs_unlock();
 }
 
 void vpci_msi_arch_disable(struct vpci_msi *msi, const struct pci_dev *pdev)
@@ -854,6 +850,8 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
     int rc;
 
     ASSERT(entry->arch.pirq == INVALID_PIRQ);
+    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
+
     rc = vpci_msi_enable(pdev, vmsix_entry_nr(pdev->vpci->msix, entry),
                          table_base);
     if ( rc < 0 )
@@ -861,7 +859,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
 
     entry->arch.pirq = rc;
 
-    pcidevs_lock();
     rc = vpci_msi_update(pdev, entry->data, entry->addr, 1, entry->arch.pirq,
                          entry->masked);
     if ( rc )
@@ -869,7 +866,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
         vpci_msi_disable(pdev, entry->arch.pirq, 1, false);
         entry->arch.pirq = INVALID_PIRQ;
     }
-    pcidevs_unlock();
 
     return rc;
 }
@@ -895,6 +891,8 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
 {
     unsigned int i;
 
+    ASSERT(rw_is_locked(&msix->pdev->domain->pci_lock));
+
     for ( i = 0; i < msix->max_entries; i++ )
     {
         const struct vpci_msix_entry *entry = &msix->entries[i];
@@ -913,7 +911,9 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
             struct pci_dev *pdev = msix->pdev;
 
             spin_unlock(&msix->pdev->vpci->lock);
+            read_unlock(&pdev->domain->pci_lock);
             process_pending_softirqs();
+            read_lock(&pdev->domain->pci_lock);
             /* NB: we assume that pdev cannot go away for an alive domain. */
             if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
                 return -EBUSY;
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 1edc7f1e91..545a27796e 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -413,8 +413,6 @@ static int cf_check vmx_pi_update_irte(const struct vcpu *v,
 
     spin_unlock_irq(&desc->lock);
 
-    ASSERT(pcidevs_locked());
-
     return iommu_update_ire_from_msi(msi_desc, &msi_desc->msg);
 
  unlock_out:
diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index 6abfd81621..cb99ae5392 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -2157,7 +2157,7 @@ int map_domain_pirq(
         struct pci_dev *pdev;
         unsigned int nr = 0;
 
-        ASSERT(pcidevs_locked());
+        ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
 
         ret = -ENODEV;
         if ( !cpu_has_apic )
@@ -2314,7 +2314,7 @@ int unmap_domain_pirq(struct domain *d, int pirq)
     if ( (pirq < 0) || (pirq >= d->nr_pirqs) )
         return -EINVAL;
 
-    ASSERT(pcidevs_locked());
+    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
     ASSERT(rw_is_write_locked(&d->event_lock));
 
     info = pirq_info(d, pirq);
@@ -2908,7 +2908,13 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
 
     msi->irq = irq;
 
-    pcidevs_lock();
+    /*
+     * If we are called via vPCI->vMSI path, we already are holding
+     * d->pci_lock so there is no need to take pcidevs_lock, as it
+     * will cause lock inversion.
+     */
+    if ( !rw_is_locked(&d->pci_lock) )
+        pcidevs_lock();
     /* Verify or get pirq. */
     write_lock(&d->event_lock);
     pirq = allocate_pirq(d, index, *pirq_p, irq, type, &msi->entry_nr);
@@ -2924,7 +2930,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
 
  done:
     write_unlock(&d->event_lock);
-    pcidevs_unlock();
+    if ( !rw_is_locked(&d->pci_lock) )
+        pcidevs_unlock();
     if ( ret )
     {
         switch ( type )
diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index d0bf63df1d..ba2963b7d2 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -613,7 +613,7 @@ static int msi_capability_init(struct pci_dev *dev,
     u8 slot = PCI_SLOT(dev->devfn);
     u8 func = PCI_FUNC(dev->devfn);
 
-    ASSERT(pcidevs_locked());
+    ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
     pos = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSI);
     if ( !pos )
         return -ENODEV;
@@ -783,7 +783,7 @@ static int msix_capability_init(struct pci_dev *dev,
     if ( !pos )
         return -ENODEV;
 
-    ASSERT(pcidevs_locked());
+    ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
 
     control = pci_conf_read16(dev->sbdf, msix_control_reg(pos));
     /*
@@ -1000,7 +1000,6 @@ static int __pci_enable_msi(struct msi_info *msi, struct msi_desc **desc)
     struct pci_dev *pdev;
     struct msi_desc *old_desc;
 
-    ASSERT(pcidevs_locked());
     pdev = pci_get_pdev(NULL, msi->sbdf);
     if ( !pdev )
         return -ENODEV;
@@ -1055,7 +1054,6 @@ static int __pci_enable_msix(struct msi_info *msi, struct msi_desc **desc)
     struct pci_dev *pdev;
     struct msi_desc *old_desc;
 
-    ASSERT(pcidevs_locked());
     pdev = pci_get_pdev(NULL, msi->sbdf);
     if ( !pdev || !pdev->msix )
         return -ENODEV;
@@ -1170,8 +1168,6 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off)
  */
 int pci_enable_msi(struct msi_info *msi, struct msi_desc **desc)
 {
-    ASSERT(pcidevs_locked());
-
     if ( !use_msi )
         return -EPERM;
 
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 79ca928672..4f18293900 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -752,7 +752,6 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
         pdev->domain = hardware_domain;
         write_lock(&hardware_domain->pci_lock);
         list_add(&pdev->domain_list, &hardware_domain->pdev_list);
-        write_unlock(&hardware_domain->pci_lock);
 
         /*
          * For devices not discovered by Xen during boot, add vPCI handlers
@@ -762,17 +761,17 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
         if ( ret )
         {
             printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
-            write_lock(&hardware_domain->pci_lock);
             list_del(&pdev->domain_list);
             write_unlock(&hardware_domain->pci_lock);
             pdev->domain = NULL;
             goto out;
         }
+        write_unlock(&hardware_domain->pci_lock);
         ret = iommu_add_device(pdev);
         if ( ret )
         {
-            vpci_remove_device(pdev);
             write_lock(&hardware_domain->pci_lock);
+            vpci_remove_device(pdev);
             list_del(&pdev->domain_list);
             write_unlock(&hardware_domain->pci_lock);
             pdev->domain = NULL;
@@ -1147,7 +1146,9 @@ static void __hwdom_init setup_one_hwdom_device(const struct setup_hwdom *ctxt,
     } while ( devfn != pdev->devfn &&
               PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
 
+    write_lock(&ctxt->d->pci_lock);
     err = vpci_add_handlers(pdev);
+    write_unlock(&ctxt->d->pci_lock);
     if ( err )
         printk(XENLOG_ERR "setup of vPCI for d%d failed: %d\n",
                ctxt->d->domain_id, err);
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 60f7049e34..177a6b57a5 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -172,6 +172,7 @@ bool vpci_process_pending(struct vcpu *v)
         if ( rc == -ERESTART )
             return true;
 
+        write_lock(&v->domain->pci_lock);
         spin_lock(&v->vpci.pdev->vpci->lock);
         /* Disable memory decoding unconditionally on failure. */
         modify_decoding(v->vpci.pdev,
@@ -190,6 +191,7 @@ bool vpci_process_pending(struct vcpu *v)
              * failure.
              */
             vpci_remove_device(v->vpci.pdev);
+        write_unlock(&v->domain->pci_lock);
     }
 
     return false;
@@ -201,8 +203,20 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
     struct map_data data = { .d = d, .map = true };
     int rc;
 
+    ASSERT(rw_is_locked(&d->pci_lock));
+
     while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
+    {
+        /*
+         * It's safe to drop and reacquire the lock in this context
+         * without risking pdev disappearing because devices cannot be
+         * removed until the initial domain has been started.
+         */
+        read_unlock(&d->pci_lock);
         process_pending_softirqs();
+        read_lock(&d->pci_lock);
+    }
+
     rangeset_destroy(mem);
     if ( !rc )
         modify_decoding(pdev, cmd, false);
@@ -243,6 +257,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
     unsigned int i;
     int rc;
 
+    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
     if ( !mem )
         return -ENOMEM;
 
@@ -522,6 +538,8 @@ static int cf_check init_bars(struct pci_dev *pdev)
     struct vpci_bar *bars = header->bars;
     int rc;
 
+    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
+
     switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
     {
     case PCI_HEADER_TYPE_NORMAL:
diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
index 8f2b59e61a..a0733bb2cb 100644
--- a/xen/drivers/vpci/msi.c
+++ b/xen/drivers/vpci/msi.c
@@ -265,7 +265,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW);
 
 void vpci_dump_msi(void)
 {
-    const struct domain *d;
+    struct domain *d;
 
     rcu_read_lock(&domlist_read_lock);
     for_each_domain ( d )
@@ -277,6 +277,9 @@ void vpci_dump_msi(void)
 
         printk("vPCI MSI/MSI-X d%d\n", d->domain_id);
 
+        if ( !read_trylock(&d->pci_lock) )
+            continue;
+
         for_each_pdev ( d, pdev )
         {
             const struct vpci_msi *msi;
@@ -318,15 +321,28 @@ void vpci_dump_msi(void)
                      * holding the lock.
                      */
                     printk("unable to print all MSI-X entries: %d\n", rc);
-                    process_pending_softirqs();
-                    continue;
+                    goto pdev_done;
                 }
             }
 
             spin_unlock(&pdev->vpci->lock);
+ pdev_done:
+            /*
+             * Unlock lock to process pending softirqs. This is
+             * potentially unsafe, as d->pdev_list can be changed in
+             * meantime.
+             */
+            read_unlock(&d->pci_lock);
             process_pending_softirqs();
+            if ( !read_trylock(&d->pci_lock) )
+            {
+                printk("unable to access other devices for the domain\n");
+                goto domain_done;
+            }
         }
+        read_unlock(&d->pci_lock);
     }
+ domain_done:
     rcu_read_unlock(&domlist_read_lock);
 }
 
diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
index f9df506f29..f8c5bd393b 100644
--- a/xen/drivers/vpci/msix.c
+++ b/xen/drivers/vpci/msix.c
@@ -147,6 +147,8 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
 {
     struct vpci_msix *msix;
 
+    ASSERT(rw_is_locked(&d->pci_lock));
+
     list_for_each_entry ( msix, &d->arch.hvm.msix_tables, next )
     {
         const struct vpci_bar *bars = msix->pdev->vpci->header.bars;
@@ -163,7 +165,13 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
 
 static int cf_check msix_accept(struct vcpu *v, unsigned long addr)
 {
-    return !!msix_find(v->domain, addr);
+    int rc;
+
+    read_lock(&v->domain->pci_lock);
+    rc = !!msix_find(v->domain, addr);
+    read_unlock(&v->domain->pci_lock);
+
+    return rc;
 }
 
 static bool access_allowed(const struct pci_dev *pdev, unsigned long addr,
@@ -358,21 +366,35 @@ static int adjacent_read(const struct domain *d, const struct vpci_msix *msix,
 static int cf_check msix_read(
     struct vcpu *v, unsigned long addr, unsigned int len, unsigned long *data)
 {
-    const struct domain *d = v->domain;
-    struct vpci_msix *msix = msix_find(d, addr);
+    struct domain *d = v->domain;
+    struct vpci_msix *msix;
     const struct vpci_msix_entry *entry;
     unsigned int offset;
 
     *data = ~0ul;
 
+    read_lock(&d->pci_lock);
+
+    msix = msix_find(d, addr);
     if ( !msix )
+    {
+        read_unlock(&d->pci_lock);
         return X86EMUL_RETRY;
+    }
 
     if ( adjacent_handle(msix, addr) )
-        return adjacent_read(d, msix, addr, len, data);
+    {
+        int rc = adjacent_read(d, msix, addr, len, data);
+
+        read_unlock(&d->pci_lock);
+        return rc;
+    }
 
     if ( !access_allowed(msix->pdev, addr, len) )
+    {
+        read_unlock(&d->pci_lock);
         return X86EMUL_OKAY;
+    }
 
     spin_lock(&msix->pdev->vpci->lock);
     entry = get_entry(msix, addr);
@@ -404,6 +426,7 @@ static int cf_check msix_read(
         break;
     }
     spin_unlock(&msix->pdev->vpci->lock);
+    read_unlock(&d->pci_lock);
 
     return X86EMUL_OKAY;
 }
@@ -491,19 +514,33 @@ static int adjacent_write(const struct domain *d, const struct vpci_msix *msix,
 static int cf_check msix_write(
     struct vcpu *v, unsigned long addr, unsigned int len, unsigned long data)
 {
-    const struct domain *d = v->domain;
-    struct vpci_msix *msix = msix_find(d, addr);
+    struct domain *d = v->domain;
+    struct vpci_msix *msix;
     struct vpci_msix_entry *entry;
     unsigned int offset;
 
+    read_lock(&d->pci_lock);
+
+    msix = msix_find(d, addr);
     if ( !msix )
+    {
+        read_unlock(&d->pci_lock);
         return X86EMUL_RETRY;
+    }
 
     if ( adjacent_handle(msix, addr) )
-        return adjacent_write(d, msix, addr, len, data);
+    {
+        int rc = adjacent_write(d, msix, addr, len, data);
+
+        read_unlock(&d->pci_lock);
+        return rc;
+    }
 
     if ( !access_allowed(msix->pdev, addr, len) )
+    {
+        read_unlock(&d->pci_lock);
         return X86EMUL_OKAY;
+    }
 
     spin_lock(&msix->pdev->vpci->lock);
     entry = get_entry(msix, addr);
@@ -579,6 +616,7 @@ static int cf_check msix_write(
         break;
     }
     spin_unlock(&msix->pdev->vpci->lock);
+    read_unlock(&d->pci_lock);
 
     return X86EMUL_OKAY;
 }
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index d73fa76302..34fff2ef2d 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -38,6 +38,8 @@ extern vpci_register_init_t *const __end_vpci_array[];
 
 void vpci_remove_device(struct pci_dev *pdev)
 {
+    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
     if ( !has_vpci(pdev->domain) || !pdev->vpci )
         return;
 
@@ -73,6 +75,8 @@ int vpci_add_handlers(struct pci_dev *pdev)
     const unsigned long *ro_map;
     int rc = 0;
 
+    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
     if ( !has_vpci(pdev->domain) )
         return 0;
 
@@ -326,11 +330,12 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
 
 uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
 {
-    const struct domain *d = current->domain;
+    struct domain *d = current->domain;
     const struct pci_dev *pdev;
     const struct vpci_register *r;
     unsigned int data_offset = 0;
     uint32_t data = ~(uint32_t)0;
+    rwlock_t *lock;
 
     if ( !size )
     {
@@ -342,11 +347,21 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
      * Find the PCI dev matching the address, which for hwdom also requires
      * consulting DomXEN.  Passthrough everything that's not trapped.
      */
+    lock = &d->pci_lock;
+    read_lock(lock);
     pdev = pci_get_pdev(d, sbdf);
     if ( !pdev && is_hardware_domain(d) )
+    {
+        read_unlock(lock);
+        lock = &dom_xen->pci_lock;
+        read_lock(lock);
         pdev = pci_get_pdev(dom_xen, sbdf);
+    }
     if ( !pdev || !pdev->vpci )
+    {
+        read_unlock(lock);
         return vpci_read_hw(sbdf, reg, size);
+    }
 
     spin_lock(&pdev->vpci->lock);
 
@@ -392,6 +407,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
         ASSERT(data_offset < size);
     }
     spin_unlock(&pdev->vpci->lock);
+    read_unlock(lock);
 
     if ( data_offset < size )
     {
@@ -431,10 +447,23 @@ static void vpci_write_helper(const struct pci_dev *pdev,
              r->private);
 }
 
+/* Helper function to unlock locks taken by vpci_write in proper order */
+static void release_domain_locks(struct domain *d)
+{
+    ASSERT(rw_is_write_locked(&d->pci_lock));
+
+    if ( is_hardware_domain(d) )
+    {
+        ASSERT(rw_is_write_locked(&dom_xen->pci_lock));
+        write_unlock(&dom_xen->pci_lock);
+    }
+    write_unlock(&d->pci_lock);
+}
+
 void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
                 uint32_t data)
 {
-    const struct domain *d = current->domain;
+    struct domain *d = current->domain;
     const struct pci_dev *pdev;
     const struct vpci_register *r;
     unsigned int data_offset = 0;
@@ -447,8 +476,16 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
 
     /*
      * Find the PCI dev matching the address, which for hwdom also requires
-     * consulting DomXEN.  Passthrough everything that's not trapped.
+     * consulting DomXEN. Passthrough everything that's not trapped.
+     * If this is hwdom, we need to hold locks for both domain in case if
+     * modify_bars() is called
      */
+    write_lock(&d->pci_lock);
+
+    /* dom_xen->pci_lock always should be taken second to prevent deadlock */
+    if ( is_hardware_domain(d) )
+        write_lock(&dom_xen->pci_lock);
+
     pdev = pci_get_pdev(d, sbdf);
     if ( !pdev && is_hardware_domain(d) )
         pdev = pci_get_pdev(dom_xen, sbdf);
@@ -459,6 +496,8 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
 
         if ( !ro_map || !test_bit(sbdf.bdf, ro_map) )
             vpci_write_hw(sbdf, reg, size, data);
+
+        release_domain_locks(d);
         return;
     }
 
@@ -498,6 +537,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
         ASSERT(data_offset < size);
     }
     spin_unlock(&pdev->vpci->lock);
+    release_domain_locks(d);
 
     if ( data_offset < size )
         /* Tailing gap, write the remaining. */
-- 
2.41.0

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v9 03/16] vpci: restrict unhandled read/write operations for guests
  2023-08-29 23:19 [PATCH v9 00/16] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
@ 2023-08-29 23:19 ` Volodymyr Babchuk
  2023-08-29 23:19 ` [PATCH v9 02/16] vpci: use per-domain PCI lock to protect vpci structure Volodymyr Babchuk
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 60+ messages in thread
From: Volodymyr Babchuk @ 2023-08-29 23:19 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Roger Pau Monné

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

A guest would be able to read and write those registers which are not
emulated and have no respective vPCI handlers, so it will be possible
for it to access the hardware directly.
In order to prevent a guest from reads and writes from/to the unhandled
registers make sure only hardware domain can access the hardware directly
and restrict guests from doing so.

Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

---
Since v9:
- removed stray formatting change
- added Roger's R-b tag
Since v6:
- do not use is_hwdom parameter for vpci_{read|write}_hw and use
  current->domain internally
- update commit message
New in v6
---
 xen/drivers/vpci/vpci.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 34fff2ef2d..cb45904114 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -233,6 +233,10 @@ static uint32_t vpci_read_hw(pci_sbdf_t sbdf, unsigned int reg,
 {
     uint32_t data;
 
+    /* Guest domains are not allowed to read real hardware. */
+    if ( !is_hardware_domain(current->domain) )
+        return ~(uint32_t)0;
+
     switch ( size )
     {
     case 4:
@@ -276,6 +280,10 @@ static uint32_t vpci_read_hw(pci_sbdf_t sbdf, unsigned int reg,
 static void vpci_write_hw(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
                           uint32_t data)
 {
+    /* Guest domains are not allowed to write real hardware. */
+    if ( !is_hardware_domain(current->domain) )
+        return;
+
     switch ( size )
     {
     case 4:
-- 
2.41.0

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v9 01/16] pci: introduce per-domain PCI rwlock
  2023-08-29 23:19 [PATCH v9 00/16] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
  2023-08-29 23:19 ` [PATCH v9 03/16] vpci: restrict unhandled read/write operations for guests Volodymyr Babchuk
  2023-08-29 23:19 ` [PATCH v9 02/16] vpci: use per-domain PCI lock to protect vpci structure Volodymyr Babchuk
@ 2023-08-29 23:19 ` Volodymyr Babchuk
  2023-09-19 14:09   ` Roger Pau Monné
  2023-08-29 23:19 ` [PATCH v9 07/16] rangeset: add RANGESETF_no_print flag Volodymyr Babchuk
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 60+ messages in thread
From: Volodymyr Babchuk @ 2023-08-29 23:19 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Volodymyr Babchuk, Andrew Cooper,
	George Dunlap, Jan Beulich, Julien Grall, Stefano Stabellini,
	Wei Liu, Paul Durrant, Roger Pau Monné,
	Kevin Tian

Add per-domain d->pci_lock that protects access to
d->pdev_list. Purpose of this lock is to give guarantees to VPCI code
that underlying pdev will not disappear under feet. This is a rw-lock,
but this patch adds only write_lock()s. There will be read_lock()
users in the next patches.

This lock should be taken in write mode every time d->pdev_list is
altered. This covers both accesses to d->pdev_list and accesses to
pdev->domain_list fields. All write accesses also should be protected
by pcidevs_lock() as well. Idea is that any user that wants read
access to the list or to the devices stored in the list should use
either this new d->pci_lock or old pcidevs_lock(). Usage of any of
this two locks will ensure only that pdev of interest will not
disappear from under feet and that the pdev still will be assigned to
the same domain. Of course, any new users should use pcidevs_lock()
when it is appropriate (e.g. when accessing any other state that is
protected by the said lock). In case both the newly introduced
per-domain rwlock and the pcidevs lock is taken, the later must be
acquired first.

Any write access to pdev->domain_list should be protected by both
pcidevs_lock() and d->pci_lock in the write mode.

Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>

---

Changes in v9:
 - returned back "pdev->domain = target;" in AMD IOMMU code
 - used "source" instead of pdev->domain in IOMMU functions
 - added comment about lock ordering in the commit message
 - reduced locked regions
 - minor changes non-functional changes in various places

Changes in v8:
 - New patch

Changes in v8 vs RFC:
 - Removed all read_locks after discussion with Roger in #xendevel
 - pci_release_devices() now returns the first error code
 - extended commit message
 - added missing lock in pci_remove_device()
 - extended locked region in pci_add_device() to protect list_del() calls
---
 xen/common/domain.c                         |  1 +
 xen/drivers/passthrough/amd/pci_amd_iommu.c |  9 ++-
 xen/drivers/passthrough/pci.c               | 71 +++++++++++++++++----
 xen/drivers/passthrough/vtd/iommu.c         |  9 ++-
 xen/include/xen/sched.h                     |  1 +
 5 files changed, 78 insertions(+), 13 deletions(-)

diff --git a/xen/common/domain.c b/xen/common/domain.c
index 304aa04fa6..9b04a20160 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -651,6 +651,7 @@ struct domain *domain_create(domid_t domid,
 
 #ifdef CONFIG_HAS_PCI
     INIT_LIST_HEAD(&d->pdev_list);
+    rwlock_init(&d->pci_lock);
 #endif
 
     /* All error paths can depend on the above setup. */
diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c b/xen/drivers/passthrough/amd/pci_amd_iommu.c
index bea70db4b7..d219bd9453 100644
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -476,7 +476,14 @@ static int cf_check reassign_device(
 
     if ( devfn == pdev->devfn && pdev->domain != target )
     {
-        list_move(&pdev->domain_list, &target->pdev_list);
+        write_lock(&source->pci_lock);
+        list_del(&pdev->domain_list);
+        write_unlock(&source->pci_lock);
+
+        write_lock(&target->pci_lock);
+        list_add(&pdev->domain_list, &target->pdev_list);
+        write_unlock(&target->pci_lock);
+
         pdev->domain = target;
     }
 
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 33452791a8..79ca928672 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -454,7 +454,9 @@ static void __init _pci_hide_device(struct pci_dev *pdev)
     if ( pdev->domain )
         return;
     pdev->domain = dom_xen;
+    write_lock(&dom_xen->pci_lock);
     list_add(&pdev->domain_list, &dom_xen->pdev_list);
+    write_unlock(&dom_xen->pci_lock);
 }
 
 int __init pci_hide_device(unsigned int seg, unsigned int bus,
@@ -748,7 +750,9 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
     if ( !pdev->domain )
     {
         pdev->domain = hardware_domain;
+        write_lock(&hardware_domain->pci_lock);
         list_add(&pdev->domain_list, &hardware_domain->pdev_list);
+        write_unlock(&hardware_domain->pci_lock);
 
         /*
          * For devices not discovered by Xen during boot, add vPCI handlers
@@ -758,7 +762,9 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
         if ( ret )
         {
             printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
+            write_lock(&hardware_domain->pci_lock);
             list_del(&pdev->domain_list);
+            write_unlock(&hardware_domain->pci_lock);
             pdev->domain = NULL;
             goto out;
         }
@@ -766,7 +772,9 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
         if ( ret )
         {
             vpci_remove_device(pdev);
+            write_lock(&hardware_domain->pci_lock);
             list_del(&pdev->domain_list);
+            write_unlock(&hardware_domain->pci_lock);
             pdev->domain = NULL;
             goto out;
         }
@@ -816,7 +824,11 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
             pci_cleanup_msi(pdev);
             ret = iommu_remove_device(pdev);
             if ( pdev->domain )
+            {
+                write_lock(&pdev->domain->pci_lock);
                 list_del(&pdev->domain_list);
+                write_unlock(&pdev->domain->pci_lock);
+            }
             printk(XENLOG_DEBUG "PCI remove device %pp\n", &pdev->sbdf);
             free_pdev(pseg, pdev);
             break;
@@ -887,26 +899,61 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
 
 int pci_release_devices(struct domain *d)
 {
-    struct pci_dev *pdev, *tmp;
-    u8 bus, devfn;
-    int ret;
+    int combined_ret;
+    LIST_HEAD(failed_pdevs);
 
     pcidevs_lock();
-    ret = arch_pci_clean_pirqs(d);
-    if ( ret )
+
+    combined_ret = arch_pci_clean_pirqs(d);
+    if ( combined_ret )
     {
         pcidevs_unlock();
-        return ret;
+        return combined_ret;
     }
-    list_for_each_entry_safe ( pdev, tmp, &d->pdev_list, domain_list )
+
+    write_lock(&d->pci_lock);
+
+    while ( !list_empty(&d->pdev_list) )
     {
-        bus = pdev->bus;
-        devfn = pdev->devfn;
-        ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
+        struct pci_dev *pdev = list_first_entry(&d->pdev_list,
+                                                struct pci_dev,
+                                                domain_list);
+        uint16_t seg = pdev->seg;
+        uint8_t bus = pdev->bus;
+        uint8_t devfn = pdev->devfn;
+        int ret;
+
+        write_unlock(&d->pci_lock);
+        ret = deassign_device(d, seg, bus, devfn);
+        write_lock(&d->pci_lock);
+        if ( ret )
+        {
+            const struct pci_dev *tmp;
+
+            /*
+             * We need to check if deassign_device() left our pdev in
+             * domain's list. As we dropped the lock, we can't be sure
+             * that list wasn't permutated in some random way, so we
+             * need to traverse the whole list.
+             */
+            for_each_pdev ( d, tmp )
+            {
+                if ( tmp == pdev )
+                {
+                    list_move_tail(&pdev->domain_list, &failed_pdevs);
+                    break;
+                }
+            }
+
+            combined_ret = combined_ret ?: ret;
+        }
     }
+
+    list_splice(&failed_pdevs, &d->pdev_list);
+    write_unlock(&d->pci_lock);
     pcidevs_unlock();
 
-    return ret;
+    return combined_ret;
 }
 
 #define PCI_CLASS_BRIDGE_HOST    0x0600
@@ -1125,7 +1172,9 @@ static int __hwdom_init cf_check _setup_hwdom_pci_devices(
             if ( !pdev->domain )
             {
                 pdev->domain = ctxt->d;
+                write_lock(&ctxt->d->pci_lock);
                 list_add(&pdev->domain_list, &ctxt->d->pdev_list);
+                write_unlock(&ctxt->d->pci_lock);
                 setup_one_hwdom_device(ctxt, pdev);
             }
             else if ( pdev->domain == dom_xen )
diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index 0e3062c820..3228900c97 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2806,7 +2806,14 @@ static int cf_check reassign_device_ownership(
 
     if ( devfn == pdev->devfn && pdev->domain != target )
     {
-        list_move(&pdev->domain_list, &target->pdev_list);
+        write_lock(&source->pci_lock);
+        list_del(&pdev->domain_list);
+        write_unlock(&source->pci_lock);
+
+        write_lock(&target->pci_lock);
+        list_add(&pdev->domain_list, &target->pdev_list);
+        write_unlock(&target->pci_lock);
+
         pdev->domain = target;
     }
 
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index b4f43cd410..535a81fe90 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -460,6 +460,7 @@ struct domain
 
 #ifdef CONFIG_HAS_PCI
     struct list_head pdev_list;
+    rwlock_t pci_lock;
 #endif
 
 #ifdef CONFIG_HAS_PASSTHROUGH
-- 
2.41.0

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v9 05/16] vpci/header: rework exit path in init_bars
  2023-08-29 23:19 [PATCH v9 00/16] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (4 preceding siblings ...)
  2023-08-29 23:19 ` [PATCH v9 06/16] vpci/header: implement guest BAR register handlers Volodymyr Babchuk
@ 2023-08-29 23:19 ` Volodymyr Babchuk
  2023-09-20  8:49   ` Roger Pau Monné
  2023-08-29 23:19 ` [PATCH v9 04/16] vpci: add hooks for PCI device assign/de-assign Volodymyr Babchuk
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 60+ messages in thread
From: Volodymyr Babchuk @ 2023-08-29 23:19 UTC (permalink / raw)
  To: xen-devel; +Cc: Stewart Hildebrand, Volodymyr Babchuk, Roger Pau Monné

Introduce "fail" label in init_bars() function to have the centralized
error return path. This is the pre-requirement for the future changes
in this function.

This patch does not introduce functional changes.

Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
--
Since v9:
- New in v9
---
 xen/drivers/vpci/header.c | 20 +++++++-------------
 1 file changed, 7 insertions(+), 13 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 3b797df82f..e58bbdf68d 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -581,11 +581,7 @@ static int cf_check init_bars(struct pci_dev *pdev)
             rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg,
                                    4, &bars[i]);
             if ( rc )
-            {
-                pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
-                return rc;
-            }
-
+                goto fail;
             continue;
         }
 
@@ -604,10 +600,7 @@ static int cf_check init_bars(struct pci_dev *pdev)
         rc = pci_size_mem_bar(pdev->sbdf, reg, &addr, &size,
                               (i == num_bars - 1) ? PCI_BAR_LAST : 0);
         if ( rc < 0 )
-        {
-            pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
-            return rc;
-        }
+            goto fail;
 
         if ( size == 0 )
         {
@@ -622,10 +615,7 @@ static int cf_check init_bars(struct pci_dev *pdev)
         rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg, 4,
                                &bars[i]);
         if ( rc )
-        {
-            pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
-            return rc;
-        }
+            goto fail;
     }
 
     /* Check expansion ROM. */
@@ -647,6 +637,10 @@ static int cf_check init_bars(struct pci_dev *pdev)
     }
 
     return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, cmd, false) : 0;
+
+ fail:
+    pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
+    return rc;
 }
 REGISTER_VPCI_INIT(init_bars, VPCI_PRIORITY_MIDDLE);
 
-- 
2.41.0

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v9 07/16] rangeset: add RANGESETF_no_print flag
  2023-08-29 23:19 [PATCH v9 00/16] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (2 preceding siblings ...)
  2023-08-29 23:19 ` [PATCH v9 01/16] pci: introduce per-domain PCI rwlock Volodymyr Babchuk
@ 2023-08-29 23:19 ` Volodymyr Babchuk
  2023-08-29 23:19 ` [PATCH v9 06/16] vpci/header: implement guest BAR register handlers Volodymyr Babchuk
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 60+ messages in thread
From: Volodymyr Babchuk @ 2023-08-29 23:19 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Andrew Cooper,
	George Dunlap, Jan Beulich, Julien Grall, Stefano Stabellini,
	Wei Liu

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

There are range sets which should not be printed, so introduce a flag
which allows marking those as such. Implement relevant logic to skip
such entries while printing.

While at it also simplify the definition of the flags by directly
defining those without helpers.

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
Since v5:
- comment indentation (Jan)
Since v1:
- update BUG_ON with new flag
- simplify the definition of the flags
---
 xen/common/rangeset.c      | 5 ++++-
 xen/include/xen/rangeset.h | 5 +++--
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/xen/common/rangeset.c b/xen/common/rangeset.c
index f3baf52ab6..35c3420885 100644
--- a/xen/common/rangeset.c
+++ b/xen/common/rangeset.c
@@ -433,7 +433,7 @@ struct rangeset *rangeset_new(
     INIT_LIST_HEAD(&r->range_list);
     r->nr_ranges = -1;
 
-    BUG_ON(flags & ~RANGESETF_prettyprint_hex);
+    BUG_ON(flags & ~(RANGESETF_prettyprint_hex | RANGESETF_no_print));
     r->flags = flags;
 
     safe_strcpy(r->name, name ?: "(no name)");
@@ -575,6 +575,9 @@ void rangeset_domain_printk(
 
     list_for_each_entry ( r, &d->rangesets, rangeset_list )
     {
+        if ( r->flags & RANGESETF_no_print )
+            continue;
+
         printk("    ");
         rangeset_printk(r);
         printk("\n");
diff --git a/xen/include/xen/rangeset.h b/xen/include/xen/rangeset.h
index 135f33f606..f7c69394d6 100644
--- a/xen/include/xen/rangeset.h
+++ b/xen/include/xen/rangeset.h
@@ -49,8 +49,9 @@ void rangeset_limit(
 
 /* Flags for passing to rangeset_new(). */
  /* Pretty-print range limits in hexadecimal. */
-#define _RANGESETF_prettyprint_hex 0
-#define RANGESETF_prettyprint_hex  (1U << _RANGESETF_prettyprint_hex)
+#define RANGESETF_prettyprint_hex   (1U << 0)
+ /* Do not print entries marked with this flag. */
+#define RANGESETF_no_print          (1U << 1)
 
 bool_t __must_check rangeset_is_empty(
     const struct rangeset *r);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v9 06/16] vpci/header: implement guest BAR register handlers
  2023-08-29 23:19 [PATCH v9 00/16] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (3 preceding siblings ...)
  2023-08-29 23:19 ` [PATCH v9 07/16] rangeset: add RANGESETF_no_print flag Volodymyr Babchuk
@ 2023-08-29 23:19 ` Volodymyr Babchuk
  2023-09-01  5:25   ` Stewart Hildebrand
  2023-09-20  9:49   ` Roger Pau Monné
  2023-08-29 23:19 ` [PATCH v9 05/16] vpci/header: rework exit path in init_bars Volodymyr Babchuk
                   ` (10 subsequent siblings)
  15 siblings, 2 replies; 60+ messages in thread
From: Volodymyr Babchuk @ 2023-08-29 23:19 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Roger Pau Monné

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Add relevant vpci register handlers when assigning PCI device to a domain
and remove those when de-assigning. This allows having different
handlers for different domains, e.g. hwdom and other guests.

Emulate guest BAR register values: this allows creating a guest view
of the registers and emulates size and properties probe as it is done
during PCI device enumeration by the guest.

All empty, IO and ROM BARs for guests are emulated by returning 0 on
reads and ignoring writes: this BARs are special with this respect as
their lower bits have special meaning, so returning default ~0 on read
may confuse guest OS.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
Since v9:
- factored-out "fail" label introduction in init_bars()
- replaced #ifdef CONFIG_X86 with IS_ENABLED()
- do not pass bars[i] to empty_bar_read() handler
- store guest's BAR address instead of guests BAR register view
Since v6:
- unify the writing of the PCI_COMMAND register on the
  error path into a label
- do not introduce bar_ignore_access helper and open code
- s/guest_bar_ignore_read/empty_bar_read
- update error message in guest_bar_write
- only setup empty_bar_read for IO if !x86
Since v5:
- make sure that the guest set address has the same page offset
  as the physical address on the host
- remove guest_rom_{read|write} as those just implement the default
  behaviour of the registers not being handled
- adjusted comment for struct vpci.addr field
- add guest handlers for BARs which are not handled and will otherwise
  return ~0 on read and ignore writes. The BARs are special with this
  respect as their lower bits have special meaning, so returning ~0
  doesn't seem to be right
Since v4:
- updated commit message
- s/guest_addr/guest_reg
Since v3:
- squashed two patches: dynamic add/remove handlers and guest BAR
  handler implementation
- fix guest BAR read of the high part of a 64bit BAR (Roger)
- add error handling to vpci_assign_device
- s/dom%pd/%pd
- blank line before return
Since v2:
- remove unneeded ifdefs for CONFIG_HAS_VPCI_GUEST_SUPPORT as more code
  has been eliminated from being built on x86
Since v1:
 - constify struct pci_dev where possible
 - do not open code is_system_domain()
 - simplify some code3. simplify
 - use gdprintk + error code instead of gprintk
 - gate vpci_bar_{add|remove}_handlers with CONFIG_HAS_VPCI_GUEST_SUPPORT,
   so these do not get compiled for x86
 - removed unneeded is_system_domain check
 - re-work guest read/write to be much simpler and do more work on write
   than read which is expected to be called more frequently
 - removed one too obvious comment
---
 xen/drivers/vpci/header.c | 131 +++++++++++++++++++++++++++++++++-----
 xen/include/xen/vpci.h    |   3 +
 2 files changed, 118 insertions(+), 16 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index e58bbdf68d..e96d7b2b37 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -477,6 +477,72 @@ static void cf_check bar_write(
     pci_conf_write32(pdev->sbdf, reg, val);
 }
 
+static void cf_check guest_bar_write(const struct pci_dev *pdev,
+                                     unsigned int reg, uint32_t val, void *data)
+{
+    struct vpci_bar *bar = data;
+    bool hi = false;
+    uint64_t guest_addr = bar->guest_addr;
+
+    if ( bar->type == VPCI_BAR_MEM64_HI )
+    {
+        ASSERT(reg > PCI_BASE_ADDRESS_0);
+        bar--;
+        hi = true;
+    }
+    else
+    {
+        val &= PCI_BASE_ADDRESS_MEM_MASK;
+    }
+
+    guest_addr &= ~(0xffffffffull << (hi ? 32 : 0));
+    guest_addr |= (uint64_t)val << (hi ? 32 : 0);
+
+    guest_addr &= ~(bar->size - 1) | ~PCI_BASE_ADDRESS_MEM_MASK;
+
+    /*
+     * Make sure that the guest set address has the same page offset
+     * as the physical address on the host or otherwise things won't work as
+     * expected.
+     */
+    if ( (guest_addr & (~PAGE_MASK)) != (bar->addr & ~PAGE_MASK) )
+    {
+        gprintk(XENLOG_WARNING,
+                "%pp: ignored BAR %zu write attempting to change page offset\n",
+                &pdev->sbdf, bar - pdev->vpci->header.bars + hi);
+        return;
+    }
+
+    bar->guest_addr = guest_addr;
+}
+
+static uint32_t cf_check guest_bar_read(const struct pci_dev *pdev,
+                                        unsigned int reg, void *data)
+{
+    const struct vpci_bar *bar = data;
+    uint32_t reg_val;
+
+    if ( bar->type == VPCI_BAR_MEM64_HI )
+    {
+        ASSERT(reg > PCI_BASE_ADDRESS_0);
+        bar--;
+        return bar->guest_addr >> 32;
+    }
+
+    reg_val = bar->guest_addr;
+    reg_val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32 :
+                                             PCI_BASE_ADDRESS_MEM_TYPE_64;
+    reg_val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
+
+    return reg_val;
+}
+
+static uint32_t cf_check empty_bar_read(const struct pci_dev *pdev,
+                                        unsigned int reg, void *data)
+{
+    return 0;
+}
+
 static void cf_check rom_write(
     const struct pci_dev *pdev, unsigned int reg, uint32_t val, void *data)
 {
@@ -537,6 +603,7 @@ static int cf_check init_bars(struct pci_dev *pdev)
     struct vpci_header *header = &pdev->vpci->header;
     struct vpci_bar *bars = header->bars;
     int rc;
+    bool is_hwdom = is_hardware_domain(pdev->domain);
 
     ASSERT(rw_is_locked(&pdev->domain->pci_lock));
 
@@ -578,8 +645,10 @@ static int cf_check init_bars(struct pci_dev *pdev)
         if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
         {
             bars[i].type = VPCI_BAR_MEM64_HI;
-            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg,
-                                   4, &bars[i]);
+            rc = vpci_add_register(pdev->vpci,
+                                   is_hwdom ? vpci_hw_read32 : guest_bar_read,
+                                   is_hwdom ? bar_write : guest_bar_write,
+                                   reg, 4, &bars[i]);
             if ( rc )
                 goto fail;
             continue;
@@ -589,6 +658,15 @@ static int cf_check init_bars(struct pci_dev *pdev)
         if ( (val & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO )
         {
             bars[i].type = VPCI_BAR_IO;
+
+            if ( !IS_ENABLED(CONFIG_X86) && !is_hwdom )
+            {
+                rc = vpci_add_register(pdev->vpci, empty_bar_read, NULL,
+                                       reg, 4, NULL);
+                if ( rc )
+                    goto fail;
+            }
+
             continue;
         }
         if ( (val & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
@@ -605,6 +683,15 @@ static int cf_check init_bars(struct pci_dev *pdev)
         if ( size == 0 )
         {
             bars[i].type = VPCI_BAR_EMPTY;
+
+            if ( !is_hwdom )
+            {
+                rc = vpci_add_register(pdev->vpci, empty_bar_read, NULL,
+                                       reg, 4, NULL);
+                if ( rc )
+                    goto fail;
+            }
+
             continue;
         }
 
@@ -612,28 +699,40 @@ static int cf_check init_bars(struct pci_dev *pdev)
         bars[i].size = size;
         bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
 
-        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg, 4,
-                               &bars[i]);
+        rc = vpci_add_register(pdev->vpci,
+                               is_hwdom ? vpci_hw_read32 : guest_bar_read,
+                               is_hwdom ? bar_write : guest_bar_write,
+                               reg, 4, &bars[i]);
         if ( rc )
             goto fail;
     }
 
-    /* Check expansion ROM. */
-    rc = pci_size_mem_bar(pdev->sbdf, rom_reg, &addr, &size, PCI_BAR_ROM);
-    if ( rc > 0 && size )
+    /* TODO: Check expansion ROM, we do not handle ROM for guests for now. */
+    if ( is_hwdom )
     {
-        struct vpci_bar *rom = &header->bars[num_bars];
+        rc = pci_size_mem_bar(pdev->sbdf, rom_reg, &addr, &size, PCI_BAR_ROM);
+        if ( rc > 0 && size )
+        {
+            struct vpci_bar *rom = &header->bars[num_bars];
 
-        rom->type = VPCI_BAR_ROM;
-        rom->size = size;
-        rom->addr = addr;
-        header->rom_enabled = pci_conf_read32(pdev->sbdf, rom_reg) &
-                              PCI_ROM_ADDRESS_ENABLE;
+            rom->type = VPCI_BAR_ROM;
+            rom->size = size;
+            rom->addr = addr;
+            header->rom_enabled = pci_conf_read32(pdev->sbdf, rom_reg) &
+                                  PCI_ROM_ADDRESS_ENABLE;
 
-        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, rom_write, rom_reg,
-                               4, rom);
+            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, rom_write,
+                                   rom_reg, 4, rom);
+            if ( rc )
+                rom->type = VPCI_BAR_EMPTY;
+        }
+    }
+    else
+    {
+        rc = vpci_add_register(pdev->vpci, empty_bar_read, NULL,
+                               rom_reg, 4, NULL);
         if ( rc )
-            rom->type = VPCI_BAR_EMPTY;
+            goto fail;
     }
 
     return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, cmd, false) : 0;
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 2a0ae34500..89f1e27f4f 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -67,7 +67,10 @@ struct vpci {
     struct vpci_header {
         /* Information about the PCI BARs of this device. */
         struct vpci_bar {
+            /* Physical (host) address. */
             uint64_t addr;
+            /* Guest address. */
+            uint64_t guest_addr;
             uint64_t size;
             enum {
                 VPCI_BAR_EMPTY,
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v9 04/16] vpci: add hooks for PCI device assign/de-assign
  2023-08-29 23:19 [PATCH v9 00/16] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (5 preceding siblings ...)
  2023-08-29 23:19 ` [PATCH v9 05/16] vpci/header: rework exit path in init_bars Volodymyr Babchuk
@ 2023-08-29 23:19 ` Volodymyr Babchuk
  2023-09-12  9:37   ` Jan Beulich
  2023-09-20  8:39   ` Roger Pau Monné
  2023-08-29 23:19 ` [PATCH v9 08/16] vpci/header: handle p2m range sets per BAR Volodymyr Babchuk
                   ` (8 subsequent siblings)
  15 siblings, 2 replies; 60+ messages in thread
From: Volodymyr Babchuk @ 2023-08-29 23:19 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Andrew Cooper,
	George Dunlap, Jan Beulich, Julien Grall, Stefano Stabellini,
	Wei Liu, Paul Durrant, Roger Pau Monné,
	Volodymyr Babchuk

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

When a PCI device gets assigned/de-assigned we need to
initialize/de-initialize vPCI state for the device.

Also, rename vpci_add_handlers() to vpci_assign_device() and
vpci_remove_device() to vpci_deassign_device() to better reflect role
of the functions.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
---
Since v9:
- removed previous  vpci_[de]assign_device function and renamed
  existing handlers
- dropped attempts to handle errors in assign_device() function
- do not call vpci_assign_device for dom_io
- use d instead of pdev->domain
- use IS_ENABLED macro
Since v8:
- removed vpci_deassign_device
Since v6:
- do not pass struct domain to vpci_{assign|deassign}_device as
  pdev->domain can be used
- do not leave the device assigned (pdev->domain == new domain) in case
  vpci_assign_device fails: try to de-assign and if this also fails, then
  crash the domain
Since v5:
- do not split code into run_vpci_init
- do not check for is_system_domain in vpci_{de}assign_device
- do not use vpci_remove_device_handlers_locked and re-allocate
  pdev->vpci completely
- make vpci_deassign_device void
Since v4:
 - de-assign vPCI from the previous domain on device assignment
 - do not remove handlers in vpci_assign_device as those must not
   exist at that point
Since v3:
 - remove toolstack roll-back description from the commit message
   as error are to be handled with proper cleanup in Xen itself
 - remove __must_check
 - remove redundant rc check while assigning devices
 - fix redundant CONFIG_HAS_VPCI check for CONFIG_HAS_VPCI_GUEST_SUPPORT
 - use REGISTER_VPCI_INIT machinery to run required steps on device
   init/assign: add run_vpci_init helper
Since v2:
- define CONFIG_HAS_VPCI_GUEST_SUPPORT so dead code is not compiled
  for x86
Since v1:
 - constify struct pci_dev where possible
 - do not open code is_system_domain()
 - extended the commit message
---
 xen/drivers/Kconfig           |  4 ++++
 xen/drivers/passthrough/pci.c | 31 +++++++++++++++++++++++++++----
 xen/drivers/vpci/header.c     |  2 +-
 xen/drivers/vpci/vpci.c       |  6 +++---
 xen/include/xen/vpci.h        | 10 +++++-----
 5 files changed, 40 insertions(+), 13 deletions(-)

diff --git a/xen/drivers/Kconfig b/xen/drivers/Kconfig
index db94393f47..780490cf8e 100644
--- a/xen/drivers/Kconfig
+++ b/xen/drivers/Kconfig
@@ -15,4 +15,8 @@ source "drivers/video/Kconfig"
 config HAS_VPCI
 	bool
 
+config HAS_VPCI_GUEST_SUPPORT
+	bool
+	depends on HAS_VPCI
+
 endmenu
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 4f18293900..64281f2d5e 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -757,7 +757,7 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
          * For devices not discovered by Xen during boot, add vPCI handlers
          * when Dom0 first informs Xen about such devices.
          */
-        ret = vpci_add_handlers(pdev);
+        ret = vpci_assign_device(pdev);
         if ( ret )
         {
             printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
@@ -771,7 +771,7 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
         if ( ret )
         {
             write_lock(&hardware_domain->pci_lock);
-            vpci_remove_device(pdev);
+            vpci_deassign_device(pdev);
             list_del(&pdev->domain_list);
             write_unlock(&hardware_domain->pci_lock);
             pdev->domain = NULL;
@@ -819,7 +819,7 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
     list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
         if ( pdev->bus == bus && pdev->devfn == devfn )
         {
-            vpci_remove_device(pdev);
+            vpci_deassign_device(pdev);
             pci_cleanup_msi(pdev);
             ret = iommu_remove_device(pdev);
             if ( pdev->domain )
@@ -877,6 +877,13 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
             goto out;
     }
 
+    if ( IS_ENABLED(CONFIG_HAS_VPCI_GUEST_SUPPORT) )
+    {
+        write_lock(&d->pci_lock);
+        vpci_deassign_device(pdev);
+        write_unlock(&d->pci_lock);
+    }
+
     devfn = pdev->devfn;
     ret = iommu_call(hd->platform_ops, reassign_device, d, target, devfn,
                      pci_to_dev(pdev));
@@ -1147,7 +1154,7 @@ static void __hwdom_init setup_one_hwdom_device(const struct setup_hwdom *ctxt,
               PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
 
     write_lock(&ctxt->d->pci_lock);
-    err = vpci_add_handlers(pdev);
+    err = vpci_assign_device(pdev);
     write_unlock(&ctxt->d->pci_lock);
     if ( err )
         printk(XENLOG_ERR "setup of vPCI for d%d failed: %d\n",
@@ -1481,6 +1488,13 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
     if ( pdev->broken && d != hardware_domain && d != dom_io )
         goto done;
 
+    if ( IS_ENABLED(CONFIG_HAS_VPCI_GUEST_SUPPORT) )
+    {
+        write_lock(&pdev->domain->pci_lock);
+        vpci_deassign_device(pdev);
+        write_unlock(&pdev->domain->pci_lock);
+    }
+
     rc = pdev_msix_assign(d, pdev);
     if ( rc )
         goto done;
@@ -1506,6 +1520,15 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
         rc = iommu_call(hd->platform_ops, assign_device, d, devfn,
                         pci_to_dev(pdev), flag);
     }
+    if ( rc )
+        goto done;
+
+    if ( IS_ENABLED(CONFIG_HAS_VPCI_GUEST_SUPPORT) && d != dom_io)
+    {
+        write_lock(&d->pci_lock);
+        rc = vpci_assign_device(pdev);
+        write_unlock(&d->pci_lock);
+    }
 
  done:
     if ( rc )
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 177a6b57a5..3b797df82f 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -190,7 +190,7 @@ bool vpci_process_pending(struct vcpu *v)
              * killed in order to avoid leaking stale p2m mappings on
              * failure.
              */
-            vpci_remove_device(v->vpci.pdev);
+            vpci_deassign_device(v->vpci.pdev);
         write_unlock(&v->domain->pci_lock);
     }
 
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index cb45904114..135d390218 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -36,7 +36,7 @@ extern vpci_register_init_t *const __start_vpci_array[];
 extern vpci_register_init_t *const __end_vpci_array[];
 #define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
 
-void vpci_remove_device(struct pci_dev *pdev)
+void vpci_deassign_device(struct pci_dev *pdev)
 {
     ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
 
@@ -69,7 +69,7 @@ void vpci_remove_device(struct pci_dev *pdev)
     pdev->vpci = NULL;
 }
 
-int vpci_add_handlers(struct pci_dev *pdev)
+int vpci_assign_device(struct pci_dev *pdev)
 {
     unsigned int i;
     const unsigned long *ro_map;
@@ -103,7 +103,7 @@ int vpci_add_handlers(struct pci_dev *pdev)
     }
 
     if ( rc )
-        vpci_remove_device(pdev);
+        vpci_deassign_device(pdev);
 
     return rc;
 }
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 0b8a2a3c74..2a0ae34500 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -25,11 +25,11 @@ typedef int vpci_register_init_t(struct pci_dev *dev);
   static vpci_register_init_t *const x##_entry  \
                __used_section(".data.vpci." p) = x
 
-/* Add vPCI handlers to device. */
-int __must_check vpci_add_handlers(struct pci_dev *dev);
+/* Assign vPCI to device by adding handlers to device. */
+int __must_check vpci_assign_device(struct pci_dev *dev);
 
 /* Remove all handlers and free vpci related structures. */
-void vpci_remove_device(struct pci_dev *pdev);
+void vpci_deassign_device(struct pci_dev *pdev);
 
 /* Add/remove a register handler. */
 int __must_check vpci_add_register(struct vpci *vpci,
@@ -235,12 +235,12 @@ bool vpci_ecam_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int len,
 #else /* !CONFIG_HAS_VPCI */
 struct vpci_vcpu {};
 
-static inline int vpci_add_handlers(struct pci_dev *pdev)
+static inline int vpci_assign_device(struct pci_dev *pdev)
 {
     return 0;
 }
 
-static inline void vpci_remove_device(struct pci_dev *pdev) { }
+static inline void vpci_deassign_device(struct pci_dev *pdev) { }
 
 static inline void vpci_dump_msi(void) { }
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v9 10/16] vpci/header: emulate PCI_COMMAND register for guests
  2023-08-29 23:19 [PATCH v9 00/16] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (8 preceding siblings ...)
  2023-08-29 23:19 ` [PATCH v9 09/16] vpci/header: program p2m with guest BAR view Volodymyr Babchuk
@ 2023-08-29 23:19 ` Volodymyr Babchuk
  2023-09-01  5:23   ` Stewart Hildebrand
  2023-09-21 13:18   ` Roger Pau Monné
  2023-08-29 23:19 ` [PATCH v9 11/16] vpci/header: reset the command register when adding devices Volodymyr Babchuk
                   ` (5 subsequent siblings)
  15 siblings, 2 replies; 60+ messages in thread
From: Volodymyr Babchuk @ 2023-08-29 23:19 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Roger Pau Monné,
	Volodymyr Babchuk

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Xen and/or Dom0 may have put values in PCI_COMMAND which they expect
to remain unaltered. PCI_COMMAND_SERR bit is a good example: while the
guest's view of this will want to be zero initially, the host having set
it to 1 may not easily be overwritten with 0, or else we'd effectively
imply giving the guest control of the bit. Thus, PCI_COMMAND register needs
proper emulation in order to honor host's settings.

According to "PCI LOCAL BUS SPECIFICATION, REV. 3.0", section "6.2.2
Device Control" the reset state of the command register is typically 0,
so when assigning a PCI device use 0 as the initial state for the guest's view
of the command register.

Here is the full list of command register bits with notes about
emulation:

PCI_COMMAND_IO - Allow guest to control it.
PCI_COMMAND_MEMORY - Already handled.
PCI_COMMAND_MASTER - Allow guest to control it.
PCI_COMMAND_SPECIAL - Guest can generate special cycles only if it has
access to host bridge that supports software generation of special
cycles. In our case guest has no access to host bridges at all. Value
after reset is 0.
PCI_COMMAND_INVALIDATE - Allows "Memory Write and Invalidate" commands
to be generated. It requires additional configuration via Cacheline
Size register. We are not emulating this register right now and we
can't expect guest to properly configure it.
PCI_COMMAND_VGA_PALETTE - Enable VGA palette snooping. This bit is set
by firmware and we want to leave it as is.
PCI_COMMAND_PARITY - Controls how device response to parity
errors. We want this bit to be set by a hardware domain.
PCI_COMMAND_WAIT - Reserved. Should be 0.
PCI_COMMAND_SERR - Controls if device can assert SERR.
The same as for COMMAND_PARITY.
PCI_COMMAND_FAST_BACK - Optional bit that allows fast back-to-back
transactions. It is configured by firmware, so we don't want guest to
control it.
PCI_COMMAND_INTX_DISABLE - Disables INTx signals. If MSI(X) is
enabled, device is prohibited from asserting INTx. Value after reset
is 0. Guest can control it freely.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
---
Since v9:
- Reworked guest_cmd_read
- Added handling for more bits
Since v6:
- fold guest's logic into cmd_write
- implement cmd_read, so we can report emulated INTx state to guests
- introduce header->guest_cmd to hold the emulated state of the
  PCI_COMMAND register for guests
Since v5:
- add additional check for MSI-X enabled while altering INTX bit
- make sure INTx disabled while guests enable MSI/MSI-X
Since v3:
- gate more code on CONFIG_HAS_MSI
- removed logic for the case when MSI/MSI-X not enabled
---
 xen/drivers/vpci/header.c | 54 ++++++++++++++++++++++++++++++++++++---
 xen/drivers/vpci/msi.c    | 10 ++++++++
 xen/drivers/vpci/msix.c   |  4 +++
 xen/include/xen/vpci.h    |  3 +++
 4 files changed, 67 insertions(+), 4 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 1e82217200..e351db4620 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -502,14 +502,37 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
     return 0;
 }
 
+/* TODO: Add proper emulation for all bits of the command register. */
 static void cf_check cmd_write(
     const struct pci_dev *pdev, unsigned int reg, uint32_t cmd, void *data)
 {
     struct vpci_header *header = data;
 
+    if ( !is_hardware_domain(pdev->domain) )
+    {
+        if ( IS_ENABLED(CONFIG_HAS_PCI_MSI) )
+        {
+            /* Tell guest that device does not support this */
+            cmd &= ~PCI_COMMAND_FAST_BACK;
+        }
+
+        header->guest_cmd = cmd;
+
+        if ( IS_ENABLED(CONFIG_HAS_PCI_MSI) )
+        {
+            /* Do not touch INVALIDATE, PARITY and SERR */
+            const uint16_t excluded = PCI_COMMAND_INVALIDATE |
+                PCI_COMMAND_PARITY | PCI_COMMAND_SERR;
+
+            cmd &= ~excluded;
+            cmd |= pci_conf_read16(pdev->sbdf, reg) & excluded;
+        }
+    }
+
     /*
-     * Let Dom0 play with all the bits directly except for the memory
-     * decoding one.
+     * Let guest play with all the bits directly except for the memory
+     * decoding one. Bits that are not allowed for DomU are already
+     * handled above.
      */
     if ( header->bars_mapped != !!(cmd & PCI_COMMAND_MEMORY) )
         /*
@@ -523,6 +546,14 @@ static void cf_check cmd_write(
         pci_conf_write16(pdev->sbdf, reg, cmd);
 }
 
+static uint32_t guest_cmd_read(const struct pci_dev *pdev, unsigned int reg,
+                               void *data)
+{
+    const struct vpci_header *header = data;
+
+    return header->guest_cmd;
+}
+
 static void cf_check bar_write(
     const struct pci_dev *pdev, unsigned int reg, uint32_t val, void *data)
 {
@@ -732,8 +763,12 @@ static int cf_check init_bars(struct pci_dev *pdev)
     }
 
     /* Setup a handler for the command register. */
-    rc = vpci_add_register(pdev->vpci, vpci_hw_read16, cmd_write, PCI_COMMAND,
-                           2, header);
+    if ( is_hwdom )
+        rc = vpci_add_register(pdev->vpci, vpci_hw_read16, cmd_write, PCI_COMMAND,
+                               2, header);
+    else
+        rc = vpci_add_register(pdev->vpci, guest_cmd_read, cmd_write, PCI_COMMAND,
+                               2, header);
     if ( rc )
         return rc;
 
@@ -745,6 +780,17 @@ static int cf_check init_bars(struct pci_dev *pdev)
     if ( cmd & PCI_COMMAND_MEMORY )
         pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd & ~PCI_COMMAND_MEMORY);
 
+    header->guest_cmd = cmd & ~PCI_COMMAND_MEMORY;
+
+    /*
+     * According to "PCI LOCAL BUS SPECIFICATION, REV. 3.0", section "6.2.2
+     * Device Control" the reset state of the command register is
+     * typically all 0's, so this is used as initial value for the guests.
+     */
+    if ( header->guest_cmd != 0 )
+        gprintk(XENLOG_WARNING, "%pp: CMD is not zero: %x", &pdev->sbdf,
+                header->guest_cmd);
+
     for ( i = 0; i < num_bars; i++ )
     {
         uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
index a0733bb2cb..df0f0199b8 100644
--- a/xen/drivers/vpci/msi.c
+++ b/xen/drivers/vpci/msi.c
@@ -70,6 +70,16 @@ static void cf_check control_write(
 
         if ( vpci_msi_arch_enable(msi, pdev, vectors) )
             return;
+
+        /*
+         * Make sure guest doesn't enable INTx while enabling MSI.
+         * Opposite action (enabling INTx) will be performed in
+         * vpci_msi_arch_disable call path.
+         */
+        if ( !is_hardware_domain(pdev->domain) )
+        {
+            pci_intx(pdev, false);
+        }
     }
     else
         vpci_msi_arch_disable(msi, pdev);
diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
index f8c5bd393b..300c671384 100644
--- a/xen/drivers/vpci/msix.c
+++ b/xen/drivers/vpci/msix.c
@@ -97,6 +97,10 @@ static void cf_check control_write(
         for ( i = 0; i < msix->max_entries; i++ )
             if ( !msix->entries[i].masked && msix->entries[i].updated )
                 update_entry(&msix->entries[i], pdev, i);
+
+        /* Make sure guest doesn't enable INTx while enabling MSI-X. */
+        if ( !is_hardware_domain(pdev->domain) )
+            pci_intx(pdev, false);
     }
     else if ( !new_enabled && msix->enabled )
     {
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index d77a6f9506..f67d848616 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -87,6 +87,9 @@ struct vpci {
         } bars[PCI_HEADER_NORMAL_NR_BARS + 1];
         /* At most 6 BARS + 1 expansion ROM BAR. */
 
+        /* Guest view of the PCI_COMMAND register. */
+        uint16_t guest_cmd;
+
         /*
          * Store whether the ROM enable bit is set (doesn't imply ROM BAR
          * is mapped into guest p2m) if there's a ROM BAR on the device.
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v9 08/16] vpci/header: handle p2m range sets per BAR
  2023-08-29 23:19 [PATCH v9 00/16] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (6 preceding siblings ...)
  2023-08-29 23:19 ` [PATCH v9 04/16] vpci: add hooks for PCI device assign/de-assign Volodymyr Babchuk
@ 2023-08-29 23:19 ` Volodymyr Babchuk
  2023-09-20 11:35   ` Roger Pau Monné
  2023-09-27 18:18   ` Stewart Hildebrand
  2023-08-29 23:19 ` [PATCH v9 09/16] vpci/header: program p2m with guest BAR view Volodymyr Babchuk
                   ` (7 subsequent siblings)
  15 siblings, 2 replies; 60+ messages in thread
From: Volodymyr Babchuk @ 2023-08-29 23:19 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Roger Pau Monné

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Instead of handling a single range set, that contains all the memory
regions of all the BARs and ROM, have them per BAR.
As the range sets are now created when a PCI device is added and destroyed
when it is removed so make them named and accounted.

Note that rangesets were chosen here despite there being only up to
3 separate ranges in each set (typically just 1). But rangeset per BAR
was chosen for the ease of implementation and existing code re-usability.

This is in preparation of making non-identity mappings in p2m for the MMIOs.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

---
Since v9:
- removed d->vpci.map_pending in favor of checking v->vpci.pdev !=
NULL
- printk -> gprintk
- renamed bar variable to fix shadowing
- fixed bug with iterating on remote device's BARs
- relaxed lock in vpci_process_pending
- removed stale comment
Since v6:
- update according to the new locking scheme
- remove odd fail label in modify_bars
Since v5:
- fix comments
- move rangeset allocation to init_bars and only allocate
  for MAPPABLE BARs
- check for overlap with the already setup BAR ranges
Since v4:
- use named range sets for BARs (Jan)
- changes required by the new locking scheme
- updated commit message (Jan)
Since v3:
- re-work vpci_cancel_pending accordingly to the per-BAR handling
- s/num_mem_ranges/map_pending and s/uint8_t/bool
- ASSERT(bar->mem) in modify_bars
- create and destroy the rangesets on add/remove
---
 xen/drivers/vpci/header.c | 252 ++++++++++++++++++++++++++------------
 xen/drivers/vpci/vpci.c   |   6 +
 xen/include/xen/vpci.h    |   2 +-
 3 files changed, 180 insertions(+), 80 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index e96d7b2b37..3cc6a96849 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -161,63 +161,101 @@ static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
 
 bool vpci_process_pending(struct vcpu *v)
 {
-    if ( v->vpci.mem )
+    struct pci_dev *pdev = v->vpci.pdev;
+    struct map_data data = {
+        .d = v->domain,
+        .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
+    };
+    struct vpci_header *header = NULL;
+    unsigned int i;
+
+    if ( !pdev )
+        return false;
+
+    read_lock(&v->domain->pci_lock);
+    header = &pdev->vpci->header;
+    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
     {
-        struct map_data data = {
-            .d = v->domain,
-            .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
-        };
-        int rc = rangeset_consume_ranges(v->vpci.mem, map_range, &data);
+        struct vpci_bar *bar = &header->bars[i];
+        int rc;
+
+        if ( rangeset_is_empty(bar->mem) )
+            continue;
+
+        rc = rangeset_consume_ranges(bar->mem, map_range, &data);
 
         if ( rc == -ERESTART )
+        {
+            read_unlock(&v->domain->pci_lock);
             return true;
+        }
 
-        write_lock(&v->domain->pci_lock);
-        spin_lock(&v->vpci.pdev->vpci->lock);
-        /* Disable memory decoding unconditionally on failure. */
-        modify_decoding(v->vpci.pdev,
-                        rc ? v->vpci.cmd & ~PCI_COMMAND_MEMORY : v->vpci.cmd,
-                        !rc && v->vpci.rom_only);
-        spin_unlock(&v->vpci.pdev->vpci->lock);
-
-        rangeset_destroy(v->vpci.mem);
-        v->vpci.mem = NULL;
         if ( rc )
-            /*
-             * FIXME: in case of failure remove the device from the domain.
-             * Note that there might still be leftover mappings. While this is
-             * safe for Dom0, for DomUs the domain will likely need to be
-             * killed in order to avoid leaking stale p2m mappings on
-             * failure.
-             */
-            vpci_deassign_device(v->vpci.pdev);
-        write_unlock(&v->domain->pci_lock);
+        {
+            spin_lock(&pdev->vpci->lock);
+            /* Disable memory decoding unconditionally on failure. */
+            modify_decoding(pdev, v->vpci.cmd & ~PCI_COMMAND_MEMORY,
+                            false);
+            spin_unlock(&pdev->vpci->lock);
+
+            v->vpci.pdev = NULL;
+
+            read_unlock(&v->domain->pci_lock);
+
+            if ( is_hardware_domain(v->domain) )
+            {
+                write_lock(&v->domain->pci_lock);
+                vpci_deassign_device(v->vpci.pdev);
+                write_unlock(&v->domain->pci_lock);
+            }
+            else
+            {
+                domain_crash(v->domain);
+            }
+            return false;
+        }
     }
+    read_unlock(&v->domain->pci_lock);
+
+    v->vpci.pdev = NULL;
+
+    spin_lock(&pdev->vpci->lock);
+    modify_decoding(pdev, v->vpci.cmd, v->vpci.rom_only);
+    spin_unlock(&pdev->vpci->lock);
 
     return false;
 }
 
 static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
-                            struct rangeset *mem, uint16_t cmd)
+                            uint16_t cmd)
 {
     struct map_data data = { .d = d, .map = true };
-    int rc;
+    struct vpci_header *header = &pdev->vpci->header;
+    int rc = 0;
+    unsigned int i;
 
     ASSERT(rw_is_locked(&d->pci_lock));
 
-    while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
+    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
     {
-        /*
-         * It's safe to drop and reacquire the lock in this context
-         * without risking pdev disappearing because devices cannot be
-         * removed until the initial domain has been started.
-         */
-        read_unlock(&d->pci_lock);
-        process_pending_softirqs();
-        read_lock(&d->pci_lock);
-    }
+        struct vpci_bar *bar = &header->bars[i];
 
-    rangeset_destroy(mem);
+        if ( rangeset_is_empty(bar->mem) )
+            continue;
+
+        while ( (rc = rangeset_consume_ranges(bar->mem, map_range,
+                                              &data)) == -ERESTART )
+        {
+            /*
+             * It's safe to drop and reacquire the lock in this context
+             * without risking pdev disappearing because devices cannot be
+             * removed until the initial domain has been started.
+             */
+            write_unlock(&d->pci_lock);
+            process_pending_softirqs();
+            write_lock(&d->pci_lock);
+        }
+    }
     if ( !rc )
         modify_decoding(pdev, cmd, false);
 
@@ -225,10 +263,12 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
 }
 
 static void defer_map(struct domain *d, struct pci_dev *pdev,
-                      struct rangeset *mem, uint16_t cmd, bool rom_only)
+                      uint16_t cmd, bool rom_only)
 {
     struct vcpu *curr = current;
 
+    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
     /*
      * FIXME: when deferring the {un}map the state of the device should not
      * be trusted. For example the enable bit is toggled after the device
@@ -236,7 +276,6 @@ static void defer_map(struct domain *d, struct pci_dev *pdev,
      * started for the same device if the domain is not well-behaved.
      */
     curr->vpci.pdev = pdev;
-    curr->vpci.mem = mem;
     curr->vpci.cmd = cmd;
     curr->vpci.rom_only = rom_only;
     /*
@@ -250,33 +289,33 @@ static void defer_map(struct domain *d, struct pci_dev *pdev,
 static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
 {
     struct vpci_header *header = &pdev->vpci->header;
-    struct rangeset *mem = rangeset_new(NULL, NULL, 0);
     struct pci_dev *tmp, *dev = NULL;
     const struct domain *d;
     const struct vpci_msix *msix = pdev->vpci->msix;
-    unsigned int i;
+    unsigned int i, j;
     int rc;
 
     ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
 
-    if ( !mem )
-        return -ENOMEM;
-
     /*
-     * Create a rangeset that represents the current device BARs memory region
-     * and compare it against all the currently active BAR memory regions. If
-     * an overlap is found, subtract it from the region to be mapped/unmapped.
+     * Create a rangeset per BAR that represents the current device memory
+     * region and compare it against all the currently active BAR memory
+     * regions. If an overlap is found, subtract it from the region to be
+     * mapped/unmapped.
      *
-     * First fill the rangeset with all the BARs of this device or with the ROM
+     * First fill the rangesets with the BAR of this device or with the ROM
      * BAR only, depending on whether the guest is toggling the memory decode
      * bit of the command register, or the enable bit of the ROM BAR register.
      */
     for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
     {
-        const struct vpci_bar *bar = &header->bars[i];
+        struct vpci_bar *bar = &header->bars[i];
         unsigned long start = PFN_DOWN(bar->addr);
         unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
 
+        if ( !bar->mem )
+            continue;
+
         if ( !MAPPABLE_BAR(bar) ||
              (rom_only ? bar->type != VPCI_BAR_ROM
                        : (bar->type == VPCI_BAR_ROM && !header->rom_enabled)) ||
@@ -292,14 +331,31 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
             continue;
         }
 
-        rc = rangeset_add_range(mem, start, end);
+        rc = rangeset_add_range(bar->mem, start, end);
         if ( rc )
         {
             printk(XENLOG_G_WARNING "Failed to add [%lx, %lx]: %d\n",
                    start, end, rc);
-            rangeset_destroy(mem);
             return rc;
         }
+
+        /* Check for overlap with the already setup BAR ranges. */
+        for ( j = 0; j < i; j++ )
+        {
+            struct vpci_bar *prev_bar = &header->bars[j];
+
+            if ( rangeset_is_empty(prev_bar->mem) )
+                continue;
+
+            rc = rangeset_remove_range(prev_bar->mem, start, end);
+            if ( rc )
+            {
+                gprintk(XENLOG_WARNING,
+                       "%pp: failed to remove overlapping range [%lx, %lx]: %d\n",
+                        &pdev->sbdf, start, end, rc);
+                return rc;
+            }
+        }
     }
 
     /* Remove any MSIX regions if present. */
@@ -309,14 +365,21 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
         unsigned long end = PFN_DOWN(vmsix_table_addr(pdev->vpci, i) +
                                      vmsix_table_size(pdev->vpci, i) - 1);
 
-        rc = rangeset_remove_range(mem, start, end);
-        if ( rc )
+        for ( j = 0; j < ARRAY_SIZE(header->bars); j++ )
         {
-            printk(XENLOG_G_WARNING
-                   "Failed to remove MSIX table [%lx, %lx]: %d\n",
-                   start, end, rc);
-            rangeset_destroy(mem);
-            return rc;
+            const struct vpci_bar *bar = &header->bars[j];
+
+            if ( rangeset_is_empty(bar->mem) )
+                continue;
+
+            rc = rangeset_remove_range(bar->mem, start, end);
+            if ( rc )
+            {
+                gprintk(XENLOG_WARNING,
+                       "%pp: failed to remove MSIX table [%lx, %lx]: %d\n",
+                        &pdev->sbdf, start, end, rc);
+                return rc;
+            }
         }
     }
 
@@ -356,27 +419,34 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
 
             for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
             {
-                const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
-                unsigned long start = PFN_DOWN(bar->addr);
-                unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
-
-                if ( !bar->enabled ||
-                     !rangeset_overlaps_range(mem, start, end) ||
-                     /*
-                      * If only the ROM enable bit is toggled check against
-                      * other BARs in the same device for overlaps, but not
-                      * against the same ROM BAR.
-                      */
-                     (rom_only && tmp == pdev && bar->type == VPCI_BAR_ROM) )
+                const struct vpci_bar *remote_bar = &tmp->vpci->header.bars[i];
+                unsigned long start = PFN_DOWN(remote_bar->addr);
+                unsigned long end = PFN_DOWN(remote_bar->addr +
+                                             remote_bar->size - 1);
+
+                if ( !remote_bar->enabled )
                     continue;
 
-                rc = rangeset_remove_range(mem, start, end);
-                if ( rc )
+                for ( j = 0; j < ARRAY_SIZE(header->bars); j++)
                 {
-                    printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
-                           start, end, rc);
-                    rangeset_destroy(mem);
-                    return rc;
+                    const struct vpci_bar *bar = &header->bars[j];
+                    if ( !rangeset_overlaps_range(bar->mem, start, end) ||
+                         /*
+                          * If only the ROM enable bit is toggled check against
+                          * other BARs in the same device for overlaps, but not
+                          * against the same ROM BAR.
+                          */
+                         (rom_only && tmp == pdev && bar->type == VPCI_BAR_ROM) )
+                        continue;
+
+                    rc = rangeset_remove_range(bar->mem, start, end);
+                    if ( rc )
+                    {
+                        gprintk(XENLOG_WARNING,
+                                "%pp: failed to remove [%lx, %lx]: %d\n",
+                                &pdev->sbdf, start, end, rc);
+                        return rc;
+                    }
                 }
             }
         }
@@ -400,10 +470,10 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
          * will always be to establish mappings and process all the BARs.
          */
         ASSERT((cmd & PCI_COMMAND_MEMORY) && !rom_only);
-        return apply_map(pdev->domain, pdev, mem, cmd);
+        return apply_map(pdev->domain, pdev, cmd);
     }
 
-    defer_map(dev->domain, dev, mem, cmd, rom_only);
+    defer_map(dev->domain, dev, cmd, rom_only);
 
     return 0;
 }
@@ -595,6 +665,20 @@ static void cf_check rom_write(
         rom->addr = val & PCI_ROM_ADDRESS_MASK;
 }
 
+static int bar_add_rangeset(const struct pci_dev *pdev, struct vpci_bar *bar,
+                            unsigned int i)
+{
+    char str[32];
+
+    snprintf(str, sizeof(str), "%pp:BAR%d", &pdev->sbdf, i);
+
+    bar->mem = rangeset_new(pdev->domain, str, RANGESETF_no_print);
+    if ( !bar->mem )
+        return -ENOMEM;
+
+    return 0;
+}
+
 static int cf_check init_bars(struct pci_dev *pdev)
 {
     uint16_t cmd;
@@ -675,6 +759,10 @@ static int cf_check init_bars(struct pci_dev *pdev)
         else
             bars[i].type = VPCI_BAR_MEM32;
 
+        rc = bar_add_rangeset(pdev, &bars[i], i);
+        if ( rc )
+            return rc;
+
         rc = pci_size_mem_bar(pdev->sbdf, reg, &addr, &size,
                               (i == num_bars - 1) ? PCI_BAR_LAST : 0);
         if ( rc < 0 )
@@ -725,6 +813,12 @@ static int cf_check init_bars(struct pci_dev *pdev)
                                    rom_reg, 4, rom);
             if ( rc )
                 rom->type = VPCI_BAR_EMPTY;
+            else
+            {
+                rc = bar_add_rangeset(pdev, rom, i);
+                if ( rc )
+                    return rc;
+            }
         }
     }
     else
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 135d390218..412685f41d 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -38,6 +38,8 @@ extern vpci_register_init_t *const __end_vpci_array[];
 
 void vpci_deassign_device(struct pci_dev *pdev)
 {
+    unsigned int i;
+
     ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
 
     if ( !has_vpci(pdev->domain) || !pdev->vpci )
@@ -63,6 +65,10 @@ void vpci_deassign_device(struct pci_dev *pdev)
             if ( pdev->vpci->msix->table[i] )
                 iounmap(pdev->vpci->msix->table[i]);
     }
+
+    for ( i = 0; i < ARRAY_SIZE(pdev->vpci->header.bars); i++ )
+        rangeset_destroy(pdev->vpci->header.bars[i].mem);
+
     xfree(pdev->vpci->msix);
     xfree(pdev->vpci->msi);
     xfree(pdev->vpci);
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 89f1e27f4f..d77a6f9506 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -72,6 +72,7 @@ struct vpci {
             /* Guest address. */
             uint64_t guest_addr;
             uint64_t size;
+            struct rangeset *mem;
             enum {
                 VPCI_BAR_EMPTY,
                 VPCI_BAR_IO,
@@ -156,7 +157,6 @@ struct vpci {
 
 struct vpci_vcpu {
     /* Per-vcpu structure to store state while {un}mapping of PCI BARs. */
-    struct rangeset *mem;
     struct pci_dev *pdev;
     uint16_t cmd;
     bool rom_only : 1;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v9 09/16] vpci/header: program p2m with guest BAR view
  2023-08-29 23:19 [PATCH v9 00/16] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (7 preceding siblings ...)
  2023-08-29 23:19 ` [PATCH v9 08/16] vpci/header: handle p2m range sets per BAR Volodymyr Babchuk
@ 2023-08-29 23:19 ` Volodymyr Babchuk
  2023-09-21 10:34   ` Roger Pau Monné
  2023-08-29 23:19 ` [PATCH v9 10/16] vpci/header: emulate PCI_COMMAND register for guests Volodymyr Babchuk
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 60+ messages in thread
From: Volodymyr Babchuk @ 2023-08-29 23:19 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Roger Pau Monné,
	Volodymyr Babchuk

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Take into account guest's BAR view and program its p2m accordingly:
gfn is guest's view of the BAR and mfn is the physical BAR value.
This way hardware domain sees physical BAR values and guest sees
emulated ones.

Hardware domain continues getting the BARs identity mapped, while for
domUs the BARs are mapped at the requested guest address without
modifying the BAR address in the device PCI config space.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
---
Since v9:
- Extended the commit message
- Use bar->guest_addr in modify_bars
- Extended printk error message in map_range
- Moved map_data initialization so .bar can be initialized during declaration
Since v5:
- remove debug print in map_range callback
- remove "identity" from the debug print
Since v4:
- moved start_{gfn|mfn} calculation into map_range
- pass vpci_bar in the map_data instead of start_{gfn|mfn}
- s/guest_addr/guest_reg
Since v3:
- updated comment (Roger)
- removed gfn_add(map->start_gfn, rc); which is wrong
- use v->domain instead of v->vpci.pdev->domain
- removed odd e.g. in comment
- s/d%d/%pd in altered code
- use gdprintk for map/unmap logs
Since v2:
- improve readability for data.start_gfn and restructure ?: construct
Since v1:
 - s/MSI/MSI-X in comments
---
 xen/drivers/vpci/header.c | 52 ++++++++++++++++++++++++++++-----------
 1 file changed, 38 insertions(+), 14 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 3cc6a96849..1e82217200 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -33,6 +33,7 @@
 
 struct map_data {
     struct domain *d;
+    const struct vpci_bar *bar;
     bool map;
 };
 
@@ -44,6 +45,12 @@ static int cf_check map_range(
 
     for ( ; ; )
     {
+        /* Start address of the BAR as seen by the guest. */
+        gfn_t start_gfn = _gfn(PFN_DOWN(is_hardware_domain(map->d)
+                                        ? map->bar->addr
+                                        : map->bar->guest_addr));
+        /* Physical start address of the BAR. */
+        mfn_t start_mfn = _mfn(PFN_DOWN(map->bar->addr));
         unsigned long size = e - s + 1;
 
         if ( !iomem_access_permitted(map->d, s, e) )
@@ -63,6 +70,13 @@ static int cf_check map_range(
             return rc;
         }
 
+        /*
+         * Ranges to be mapped don't always start at the BAR start address, as
+         * there can be holes or partially consumed ranges. Account for the
+         * offset of the current address from the BAR start.
+         */
+        start_mfn = mfn_add(start_mfn, s - gfn_x(start_gfn));
+
         /*
          * ARM TODOs:
          * - On ARM whether the memory is prefetchable or not should be passed
@@ -72,8 +86,8 @@ static int cf_check map_range(
          * - {un}map_mmio_regions doesn't support preemption.
          */
 
-        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(s))
-                      : unmap_mmio_regions(map->d, _gfn(s), size, _mfn(s));
+        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, start_mfn)
+                      : unmap_mmio_regions(map->d, _gfn(s), size, start_mfn);
         if ( rc == 0 )
         {
             *c += size;
@@ -82,8 +96,9 @@ static int cf_check map_range(
         if ( rc < 0 )
         {
             printk(XENLOG_G_WARNING
-                   "Failed to identity %smap [%lx, %lx] for d%d: %d\n",
-                   map->map ? "" : "un", s, e, map->d->domain_id, rc);
+                   "Failed to %smap [%lx (%lx), %lx (%lx)] for %pd: %d\n",
+                   map->map ? "" : "un", s,  mfn_x(start_mfn), e,
+                   mfn_x(start_mfn) + size, map->d, rc);
             break;
         }
         ASSERT(rc < size);
@@ -162,10 +177,6 @@ static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
 bool vpci_process_pending(struct vcpu *v)
 {
     struct pci_dev *pdev = v->vpci.pdev;
-    struct map_data data = {
-        .d = v->domain,
-        .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
-    };
     struct vpci_header *header = NULL;
     unsigned int i;
 
@@ -177,6 +188,11 @@ bool vpci_process_pending(struct vcpu *v)
     for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
     {
         struct vpci_bar *bar = &header->bars[i];
+        struct map_data data = {
+            .d = v->domain,
+            .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
+            .bar = bar,
+        };
         int rc;
 
         if ( rangeset_is_empty(bar->mem) )
@@ -229,7 +245,6 @@ bool vpci_process_pending(struct vcpu *v)
 static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
                             uint16_t cmd)
 {
-    struct map_data data = { .d = d, .map = true };
     struct vpci_header *header = &pdev->vpci->header;
     int rc = 0;
     unsigned int i;
@@ -239,6 +254,7 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
     for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
     {
         struct vpci_bar *bar = &header->bars[i];
+        struct map_data data = { .d = d, .map = true, .bar = bar };
 
         if ( rangeset_is_empty(bar->mem) )
             continue;
@@ -306,12 +322,18 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
      * First fill the rangesets with the BAR of this device or with the ROM
      * BAR only, depending on whether the guest is toggling the memory decode
      * bit of the command register, or the enable bit of the ROM BAR register.
+     *
+     * For non-hardware domain we use guest physical addresses.
      */
     for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
     {
         struct vpci_bar *bar = &header->bars[i];
         unsigned long start = PFN_DOWN(bar->addr);
         unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
+        unsigned long start_guest = PFN_DOWN(is_hardware_domain(pdev->domain) ?
+                                             bar->addr : bar->guest_addr);
+        unsigned long end_guest = PFN_DOWN((is_hardware_domain(pdev->domain) ?
+                                  bar->addr : bar->guest_addr) + bar->size - 1);
 
         if ( !bar->mem )
             continue;
@@ -331,11 +353,11 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
             continue;
         }
 
-        rc = rangeset_add_range(bar->mem, start, end);
+        rc = rangeset_add_range(bar->mem, start_guest, end_guest);
         if ( rc )
         {
             printk(XENLOG_G_WARNING "Failed to add [%lx, %lx]: %d\n",
-                   start, end, rc);
+                   start_guest, end_guest, rc);
             return rc;
         }
 
@@ -352,7 +374,7 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
             {
                 gprintk(XENLOG_WARNING,
                        "%pp: failed to remove overlapping range [%lx, %lx]: %d\n",
-                        &pdev->sbdf, start, end, rc);
+                        &pdev->sbdf, start_guest, end_guest, rc);
                 return rc;
             }
         }
@@ -420,8 +442,10 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
             for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
             {
                 const struct vpci_bar *remote_bar = &tmp->vpci->header.bars[i];
-                unsigned long start = PFN_DOWN(remote_bar->addr);
-                unsigned long end = PFN_DOWN(remote_bar->addr +
+                unsigned long start = PFN_DOWN(is_hardware_domain(pdev->domain) ?
+                                      remote_bar->addr : remote_bar->guest_addr);
+                unsigned long end = PFN_DOWN(is_hardware_domain(pdev->domain) ?
+                                    remote_bar->addr : remote_bar->guest_addr +
                                              remote_bar->size - 1);
 
                 if ( !remote_bar->enabled )
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v9 11/16] vpci/header: reset the command register when adding devices
  2023-08-29 23:19 [PATCH v9 00/16] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (9 preceding siblings ...)
  2023-08-29 23:19 ` [PATCH v9 10/16] vpci/header: emulate PCI_COMMAND register for guests Volodymyr Babchuk
@ 2023-08-29 23:19 ` Volodymyr Babchuk
  2023-09-21 13:30   ` Roger Pau Monné
  2023-08-29 23:19 ` [PATCH v9 14/16] xen/arm: account IO handlers for emulated PCI MSI-X Volodymyr Babchuk
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 60+ messages in thread
From: Volodymyr Babchuk @ 2023-08-29 23:19 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Roger Pau Monné,
	Volodymyr Babchuk

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Reset the command register when assigning a PCI device to a guest:
according to the PCI spec the PCI_COMMAND register is typically all 0's
after reset, but this might not be true for the guest as it needs
to respect host's settings.
For that reason, do not write 0 to the PCI_COMMAND register directly,
but go through the corresponding emulation layer (cmd_write), which
will take care about the actual bits written. Also, honor value of
PCI_COMMAND_VGA_PALETTE value, which is set by firmware.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
---
Since v9:
- Honor PCI_COMMAND_VGA_PALETTE bit
Since v6:
- use cmd_write directly without introducing emulate_cmd_reg
- update commit message with more description on all 0's in PCI_COMMAND
Since v5:
- updated commit message
Since v1:
 - do not write 0 to the command register, but respect host settings.
---
 xen/drivers/vpci/header.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index e351db4620..1d243eeaf9 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -762,6 +762,12 @@ static int cf_check init_bars(struct pci_dev *pdev)
         return -EOPNOTSUPP;
     }
 
+    /* Reset the command register for guests. We want to preserve only
+     * PCI_COMMAND_VGA_PALETTE as it is configured by firmware */
+    cmd = pci_conf_read16(pdev->sbdf, PCI_COMMAND);
+    if ( !is_hwdom )
+        cmd_write(pdev, PCI_COMMAND, cmd & PCI_COMMAND_VGA_PALETTE, header);
+
     /* Setup a handler for the command register. */
     if ( is_hwdom )
         rc = vpci_add_register(pdev->vpci, vpci_hw_read16, cmd_write, PCI_COMMAND,
@@ -776,7 +782,6 @@ static int cf_check init_bars(struct pci_dev *pdev)
         return 0;
 
     /* Disable memory decoding before sizing. */
-    cmd = pci_conf_read16(pdev->sbdf, PCI_COMMAND);
     if ( cmd & PCI_COMMAND_MEMORY )
         pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd & ~PCI_COMMAND_MEMORY);
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v9 13/16] xen/arm: translate virtual PCI bus topology for guests
  2023-08-29 23:19 [PATCH v9 00/16] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (11 preceding siblings ...)
  2023-08-29 23:19 ` [PATCH v9 14/16] xen/arm: account IO handlers for emulated PCI MSI-X Volodymyr Babchuk
@ 2023-08-29 23:19 ` Volodymyr Babchuk
  2023-09-22  8:32   ` Roger Pau Monné
  2023-08-29 23:19 ` [PATCH v9 12/16] vpci: add initial support for virtual PCI bus topology Volodymyr Babchuk
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 60+ messages in thread
From: Volodymyr Babchuk @ 2023-08-29 23:19 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Stefano Stabellini,
	Julien Grall, Bertrand Marquis, Volodymyr Babchuk,
	Roger Pau Monné

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

There are three  originators for the PCI configuration space access:
1. The domain that owns physical host bridge: MMIO handlers are
there so we can update vPCI register handlers with the values
written by the hardware domain, e.g. physical view of the registers
vs guest's view on the configuration space.
2. Guest access to the passed through PCI devices: we need to properly
map virtual bus topology to the physical one, e.g. pass the configuration
space access to the corresponding physical devices.
3. Emulated host PCI bridge access. It doesn't exist in the physical
topology, e.g. it can't be mapped to some physical host bridge.
So, all access to the host bridge itself needs to be trapped and
emulated.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
Since v9:
- Commend about required lock replaced with ASSERT()
- Style fixes
- call to vpci_translate_virtual_device folded into vpci_sbdf_from_gpa
Since v8:
- locks moved out of vpci_translate_virtual_device()
Since v6:
- add pcidevs locking to vpci_translate_virtual_device
- update wrt to the new locking scheme
Since v5:
- add vpci_translate_virtual_device for #ifndef CONFIG_HAS_VPCI_GUEST_SUPPORT
  case to simplify ifdefery
- add ASSERT(!is_hardware_domain(d)); to vpci_translate_virtual_device
- reset output register on failed virtual SBDF translation
Since v4:
- indentation fixes
- constify struct domain
- updated commit message
- updates to the new locking scheme (pdev->vpci_lock)
Since v3:
- revisit locking
- move code to vpci.c
Since v2:
 - pass struct domain instead of struct vcpu
 - constify arguments where possible
 - gate relevant code with CONFIG_HAS_VPCI_GUEST_SUPPORT
New in v2
---
 xen/arch/arm/vpci.c     | 51 ++++++++++++++++++++++++++++++++---------
 xen/drivers/vpci/vpci.c | 25 +++++++++++++++++++-
 xen/include/xen/vpci.h  | 10 ++++++++
 3 files changed, 74 insertions(+), 12 deletions(-)

diff --git a/xen/arch/arm/vpci.c b/xen/arch/arm/vpci.c
index 3bc4bb5508..58e2a20135 100644
--- a/xen/arch/arm/vpci.c
+++ b/xen/arch/arm/vpci.c
@@ -7,31 +7,55 @@
 
 #include <asm/mmio.h>
 
-static pci_sbdf_t vpci_sbdf_from_gpa(const struct pci_host_bridge *bridge,
-                                     paddr_t gpa)
+static bool_t vpci_sbdf_from_gpa(struct domain *d,
+                                 const struct pci_host_bridge *bridge,
+                                 paddr_t gpa, pci_sbdf_t *sbdf)
 {
-    pci_sbdf_t sbdf;
+    ASSERT(sbdf);
 
     if ( bridge )
     {
-        sbdf.sbdf = VPCI_ECAM_BDF(gpa - bridge->cfg->phys_addr);
-        sbdf.seg = bridge->segment;
-        sbdf.bus += bridge->cfg->busn_start;
+        sbdf->sbdf = VPCI_ECAM_BDF(gpa - bridge->cfg->phys_addr);
+        sbdf->seg = bridge->segment;
+        sbdf->bus += bridge->cfg->busn_start;
     }
     else
-        sbdf.sbdf = VPCI_ECAM_BDF(gpa - GUEST_VPCI_ECAM_BASE);
-
-    return sbdf;
+    {
+        bool translated;
+
+        /*
+         * For the passed through devices we need to map their virtual SBDF
+         * to the physical PCI device being passed through.
+         */
+        sbdf->sbdf = VPCI_ECAM_BDF(gpa - GUEST_VPCI_ECAM_BASE);
+        read_lock(&d->pci_lock);
+        translated = vpci_translate_virtual_device(d, sbdf);
+        read_unlock(&d->pci_lock);
+
+        if ( !translated )
+        {
+            return false;
+        }
+    }
+    return true;
 }
 
 static int vpci_mmio_read(struct vcpu *v, mmio_info_t *info,
                           register_t *r, void *p)
 {
     struct pci_host_bridge *bridge = p;
-    pci_sbdf_t sbdf = vpci_sbdf_from_gpa(bridge, info->gpa);
+    pci_sbdf_t sbdf;
     /* data is needed to prevent a pointer cast on 32bit */
     unsigned long data;
 
+    ASSERT(!bridge == !is_hardware_domain(v->domain));
+
+    if ( !vpci_sbdf_from_gpa(v->domain, bridge, info->gpa, &sbdf) )
+    {
+        *r = ~0ul;
+        return 1;
+    }
+
     if ( vpci_ecam_read(sbdf, ECAM_REG_OFFSET(info->gpa),
                         1U << info->dabt.size, &data) )
     {
@@ -48,7 +72,12 @@ static int vpci_mmio_write(struct vcpu *v, mmio_info_t *info,
                            register_t r, void *p)
 {
     struct pci_host_bridge *bridge = p;
-    pci_sbdf_t sbdf = vpci_sbdf_from_gpa(bridge, info->gpa);
+    pci_sbdf_t sbdf;
+
+    ASSERT(!bridge == !is_hardware_domain(v->domain));
+
+    if ( !vpci_sbdf_from_gpa(v->domain, bridge, info->gpa, &sbdf) )
+        return 1;
 
     return vpci_ecam_write(sbdf, ECAM_REG_OFFSET(info->gpa),
                            1U << info->dabt.size, r);
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index b284f95e05..b8df8e3265 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -82,6 +82,30 @@ static int add_virtual_device(struct pci_dev *pdev)
     return 0;
 }
 
+/*
+ * Find the physical device which is mapped to the virtual device
+ * and translate virtual SBDF to the physical one.
+ */
+bool vpci_translate_virtual_device(const struct domain *d, pci_sbdf_t *sbdf)
+{
+    const struct pci_dev *pdev;
+
+    ASSERT(!is_hardware_domain(d));
+    ASSERT(rw_is_locked(&d->pci_lock));
+
+    for_each_pdev ( d, pdev )
+    {
+        if ( pdev->vpci && (pdev->vpci->guest_sbdf.sbdf == sbdf->sbdf) )
+        {
+            /* Replace guest SBDF with the physical one. */
+            *sbdf = pdev->sbdf;
+            return true;
+        }
+    }
+
+    return false;
+}
+
 #endif /* CONFIG_HAS_VPCI_GUEST_SUPPORT */
 
 void vpci_deassign_device(struct pci_dev *pdev)
@@ -181,7 +205,6 @@ int vpci_assign_device(struct pci_dev *pdev)
 
     return rc;
 }
-
 #endif /* __XEN__ */
 
 static int vpci_register_cmp(const struct vpci_register *r1,
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 58304523ab..e278fc8b69 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -281,6 +281,16 @@ static inline bool __must_check vpci_process_pending(struct vcpu *v)
 }
 #endif
 
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+bool vpci_translate_virtual_device(const struct domain *d, pci_sbdf_t *sbdf);
+#else
+static inline bool vpci_translate_virtual_device(const struct domain *d,
+                                                 pci_sbdf_t *sbdf)
+{
+    return false;
+}
+#endif
+
 #endif
 
 /*
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v9 14/16] xen/arm: account IO handlers for emulated PCI MSI-X
  2023-08-29 23:19 [PATCH v9 00/16] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (10 preceding siblings ...)
  2023-08-29 23:19 ` [PATCH v9 11/16] vpci/header: reset the command register when adding devices Volodymyr Babchuk
@ 2023-08-29 23:19 ` Volodymyr Babchuk
  2023-08-29 23:19 ` [PATCH v9 13/16] xen/arm: translate virtual PCI bus topology for guests Volodymyr Babchuk
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 60+ messages in thread
From: Volodymyr Babchuk @ 2023-08-29 23:19 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Bertrand Marquis,
	Volodymyr Babchuk, Julien Grall, Julien Grall,
	Stefano Stabellini

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

At the moment, we always allocate an extra 16 slots for IO handlers
(see MAX_IO_HANDLER). So while adding IO trap handlers for the emulated
MSI-X registers we need to explicitly tell that we have additional IO
handlers, so those are accounted.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Acked-by: Julien Grall <jgrall@amazon.com>
---
Cc: Julien Grall <julien@xen.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
---
This actually moved here from the part 2 of the prep work for PCI
passthrough on Arm as it seems to be the proper place for it.

Since v5:
- optimize with IS_ENABLED(CONFIG_HAS_PCI_MSI) since VPCI_MAX_VIRT_DEV is
  defined unconditionally
New in v5
---
 xen/arch/arm/vpci.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/xen/arch/arm/vpci.c b/xen/arch/arm/vpci.c
index 58e2a20135..01b50d435e 100644
--- a/xen/arch/arm/vpci.c
+++ b/xen/arch/arm/vpci.c
@@ -134,6 +134,8 @@ static int vpci_get_num_handlers_cb(struct domain *d,
 
 unsigned int domain_vpci_get_num_mmio_handlers(struct domain *d)
 {
+    unsigned int count;
+
     if ( !has_vpci(d) )
         return 0;
 
@@ -154,7 +156,17 @@ unsigned int domain_vpci_get_num_mmio_handlers(struct domain *d)
      * For guests each host bridge requires one region to cover the
      * configuration space. At the moment, we only expose a single host bridge.
      */
-    return 1;
+    count = 1;
+
+    /*
+     * There's a single MSI-X MMIO handler that deals with both PBA
+     * and MSI-X tables per each PCI device being passed through.
+     * Maximum number of emulated virtual devices is VPCI_MAX_VIRT_DEV.
+     */
+    if ( IS_ENABLED(CONFIG_HAS_PCI_MSI) )
+        count += VPCI_MAX_VIRT_DEV;
+
+    return count;
 }
 
 /*
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v9 12/16] vpci: add initial support for virtual PCI bus topology
  2023-08-29 23:19 [PATCH v9 00/16] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (12 preceding siblings ...)
  2023-08-29 23:19 ` [PATCH v9 13/16] xen/arm: translate virtual PCI bus topology for guests Volodymyr Babchuk
@ 2023-08-29 23:19 ` Volodymyr Babchuk
  2023-08-30  7:37   ` Jan Beulich
  2023-09-21 16:03   ` Roger Pau Monné
  2023-08-29 23:19 ` [PATCH v9 16/16] xen/arm: vpci: permit access to guest vpci space Volodymyr Babchuk
  2023-08-29 23:19 ` [PATCH v9 15/16] xen/arm: vpci: check guest range Volodymyr Babchuk
  15 siblings, 2 replies; 60+ messages in thread
From: Volodymyr Babchuk @ 2023-08-29 23:19 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Roger Pau Monné,
	Andrew Cooper, George Dunlap, Jan Beulich, Julien Grall,
	Stefano Stabellini, Wei Liu

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Assign SBDF to the PCI devices being passed through with bus 0.
The resulting topology is where PCIe devices reside on the bus 0 of the
root complex itself (embedded endpoints).
This implementation is limited to 32 devices which are allowed on
a single PCI bus.

Please note, that at the moment only function 0 of a multifunction
device can be passed through.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
Since v9:
- Lock in add_virtual_device() replaced with ASSERT (thanks, Stewart)
Since v8:
- Added write lock in add_virtual_device
Since v6:
- re-work wrt new locking scheme
- OT: add ASSERT(pcidevs_write_locked()); to add_virtual_device()
Since v5:
- s/vpci_add_virtual_device/add_virtual_device and make it static
- call add_virtual_device from vpci_assign_device and do not use
  REGISTER_VPCI_INIT machinery
- add pcidevs_locked ASSERT
- use DECLARE_BITMAP for vpci_dev_assigned_map
Since v4:
- moved and re-worked guest sbdf initializers
- s/set_bit/__set_bit
- s/clear_bit/__clear_bit
- minor comment fix s/Virtual/Guest/
- added VPCI_MAX_VIRT_DEV constant (PCI_SLOT(~0) + 1) which will be used
  later for counting the number of MMIO handlers required for a guest
  (Julien)
Since v3:
 - make use of VPCI_INIT
 - moved all new code to vpci.c which belongs to it
 - changed open-coded 31 to PCI_SLOT(~0)
 - added comments and code to reject multifunction devices with
   functions other than 0
 - updated comment about vpci_dev_next and made it unsigned int
 - implement roll back in case of error while assigning/deassigning devices
 - s/dom%pd/%pd
Since v2:
 - remove casts that are (a) malformed and (b) unnecessary
 - add new line for better readability
 - remove CONFIG_HAS_VPCI_GUEST_SUPPORT ifdef's as the relevant vPCI
    functions are now completely gated with this config
 - gate common code with CONFIG_HAS_VPCI_GUEST_SUPPORT
New in v2
---
 xen/drivers/vpci/vpci.c | 69 +++++++++++++++++++++++++++++++++++++++++
 xen/include/xen/sched.h |  8 +++++
 xen/include/xen/vpci.h  | 11 +++++++
 3 files changed, 88 insertions(+)

diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 412685f41d..b284f95e05 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -36,6 +36,54 @@ extern vpci_register_init_t *const __start_vpci_array[];
 extern vpci_register_init_t *const __end_vpci_array[];
 #define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
 
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+static int add_virtual_device(struct pci_dev *pdev)
+{
+    struct domain *d = pdev->domain;
+    pci_sbdf_t sbdf = { 0 };
+    unsigned long new_dev_number;
+
+    if ( is_hardware_domain(d) )
+        return 0;
+
+    ASSERT(pcidevs_locked() && rw_is_write_locked(&pdev->domain->pci_lock));
+
+    /*
+     * Each PCI bus supports 32 devices/slots at max or up to 256 when
+     * there are multi-function ones which are not yet supported.
+     */
+    if ( pdev->info.is_extfn )
+    {
+        gdprintk(XENLOG_ERR, "%pp: only function 0 passthrough supported\n",
+                 &pdev->sbdf);
+        return -EOPNOTSUPP;
+    }
+    new_dev_number = find_first_zero_bit(d->vpci_dev_assigned_map,
+                                         VPCI_MAX_VIRT_DEV);
+    if ( new_dev_number >= VPCI_MAX_VIRT_DEV )
+    {
+        write_unlock(&pdev->domain->pci_lock);
+        return -ENOSPC;
+    }
+
+    __set_bit(new_dev_number, &d->vpci_dev_assigned_map);
+
+    /*
+     * Both segment and bus number are 0:
+     *  - we emulate a single host bridge for the guest, e.g. segment 0
+     *  - with bus 0 the virtual devices are seen as embedded
+     *    endpoints behind the root complex
+     *
+     * TODO: add support for multi-function devices.
+     */
+    sbdf.devfn = PCI_DEVFN(new_dev_number, 0);
+    pdev->vpci->guest_sbdf = sbdf;
+
+    return 0;
+}
+
+#endif /* CONFIG_HAS_VPCI_GUEST_SUPPORT */
+
 void vpci_deassign_device(struct pci_dev *pdev)
 {
     unsigned int i;
@@ -46,6 +94,16 @@ void vpci_deassign_device(struct pci_dev *pdev)
         return;
 
     spin_lock(&pdev->vpci->lock);
+
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+    if ( pdev->vpci->guest_sbdf.sbdf != ~0 )
+    {
+        __clear_bit(pdev->vpci->guest_sbdf.dev,
+                    &pdev->domain->vpci_dev_assigned_map);
+        pdev->vpci->guest_sbdf.sbdf = ~0;
+    }
+#endif
+
     while ( !list_empty(&pdev->vpci->handlers) )
     {
         struct vpci_register *r = list_first_entry(&pdev->vpci->handlers,
@@ -101,6 +159,13 @@ int vpci_assign_device(struct pci_dev *pdev)
     INIT_LIST_HEAD(&pdev->vpci->handlers);
     spin_lock_init(&pdev->vpci->lock);
 
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+    pdev->vpci->guest_sbdf.sbdf = ~0;
+    rc = add_virtual_device(pdev);
+    if ( rc )
+        goto out;
+#endif
+
     for ( i = 0; i < NUM_VPCI_INIT; i++ )
     {
         rc = __start_vpci_array[i](pdev);
@@ -108,11 +173,15 @@ int vpci_assign_device(struct pci_dev *pdev)
             break;
     }
 
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+ out:
+#endif
     if ( rc )
         vpci_deassign_device(pdev);
 
     return rc;
 }
+
 #endif /* __XEN__ */
 
 static int vpci_register_cmp(const struct vpci_register *r1,
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 535a81fe90..0aafe19a51 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -461,6 +461,14 @@ struct domain
 #ifdef CONFIG_HAS_PCI
     struct list_head pdev_list;
     rwlock_t pci_lock;
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+    /*
+     * The bitmap which shows which device numbers are already used by the
+     * virtual PCI bus topology and is used to assign a unique SBDF to the
+     * next passed through virtual PCI device.
+     */
+    DECLARE_BITMAP(vpci_dev_assigned_map, VPCI_MAX_VIRT_DEV);
+#endif
 #endif
 
 #ifdef CONFIG_HAS_PASSTHROUGH
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index f67d848616..58304523ab 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -21,6 +21,13 @@ typedef int vpci_register_init_t(struct pci_dev *dev);
 
 #define VPCI_ECAM_BDF(addr)     (((addr) & 0x0ffff000) >> 12)
 
+/*
+ * Maximum number of devices supported by the virtual bus topology:
+ * each PCI bus supports 32 devices/slots at max or up to 256 when
+ * there are multi-function ones which are not yet supported.
+ */
+#define VPCI_MAX_VIRT_DEV       (PCI_SLOT(~0) + 1)
+
 #define REGISTER_VPCI_INIT(x, p)                \
   static vpci_register_init_t *const x##_entry  \
                __used_section(".data.vpci." p) = x
@@ -155,6 +162,10 @@ struct vpci {
             struct vpci_arch_msix_entry arch;
         } entries[];
     } *msix;
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+    /* Guest SBDF of the device. */
+    pci_sbdf_t guest_sbdf;
+#endif
 #endif
 };
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v9 16/16] xen/arm: vpci: permit access to guest vpci space
  2023-08-29 23:19 [PATCH v9 00/16] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (13 preceding siblings ...)
  2023-08-29 23:19 ` [PATCH v9 12/16] vpci: add initial support for virtual PCI bus topology Volodymyr Babchuk
@ 2023-08-29 23:19 ` Volodymyr Babchuk
  2023-09-26  0:12   ` Stewart Hildebrand
  2023-08-29 23:19 ` [PATCH v9 15/16] xen/arm: vpci: check guest range Volodymyr Babchuk
  15 siblings, 1 reply; 60+ messages in thread
From: Volodymyr Babchuk @ 2023-08-29 23:19 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Stefano Stabellini, Julien Grall,
	Bertrand Marquis, Volodymyr Babchuk, Andrew Cooper,
	George Dunlap, Jan Beulich, Wei Liu

From: Stewart Hildebrand <stewart.hildebrand@amd.com>

Move iomem_caps initialization earlier (before arch_domain_create()).

Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
---
This is sort of a follow-up to:

  baa6ea700386 ("vpci: add permission checks to map_range()")

I don't believe we need a fixes tag since this depends on the vPCI p2m BAR
patches.
---
 xen/arch/arm/vpci.c | 6 ++++++
 xen/common/domain.c | 4 +++-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/xen/arch/arm/vpci.c b/xen/arch/arm/vpci.c
index 01b50d435e..fb5361276f 100644
--- a/xen/arch/arm/vpci.c
+++ b/xen/arch/arm/vpci.c
@@ -2,6 +2,7 @@
 /*
  * xen/arch/arm/vpci.c
  */
+#include <xen/iocap.h>
 #include <xen/sched.h>
 #include <xen/vpci.h>
 
@@ -119,8 +120,13 @@ int domain_vpci_init(struct domain *d)
             return ret;
     }
     else
+    {
         register_mmio_handler(d, &vpci_mmio_handler,
                               GUEST_VPCI_ECAM_BASE, GUEST_VPCI_ECAM_SIZE, NULL);
+        iomem_permit_access(d, paddr_to_pfn(GUEST_VPCI_MEM_ADDR),
+                            paddr_to_pfn(PAGE_ALIGN(GUEST_VPCI_MEM_ADDR +
+                                                    GUEST_VPCI_MEM_SIZE - 1)));
+    }
 
     return 0;
 }
diff --git a/xen/common/domain.c b/xen/common/domain.c
index 9b04a20160..11a48ba7e4 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -695,6 +695,9 @@ struct domain *domain_create(domid_t domid,
         radix_tree_init(&d->pirq_tree);
     }
 
+    if ( !is_idle_domain(d) )
+        d->iomem_caps = rangeset_new(d, "I/O Memory", RANGESETF_prettyprint_hex);
+
     if ( (err = arch_domain_create(d, config, flags)) != 0 )
         goto fail;
     init_status |= INIT_arch;
@@ -704,7 +707,6 @@ struct domain *domain_create(domid_t domid,
         watchdog_domain_init(d);
         init_status |= INIT_watchdog;
 
-        d->iomem_caps = rangeset_new(d, "I/O Memory", RANGESETF_prettyprint_hex);
         d->irq_caps   = rangeset_new(d, "Interrupts", 0);
         if ( !d->iomem_caps || !d->irq_caps )
             goto fail;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v9 15/16] xen/arm: vpci: check guest range
  2023-08-29 23:19 [PATCH v9 00/16] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (14 preceding siblings ...)
  2023-08-29 23:19 ` [PATCH v9 16/16] xen/arm: vpci: permit access to guest vpci space Volodymyr Babchuk
@ 2023-08-29 23:19 ` Volodymyr Babchuk
  2023-09-22  8:44   ` Roger Pau Monné
  15 siblings, 1 reply; 60+ messages in thread
From: Volodymyr Babchuk @ 2023-08-29 23:19 UTC (permalink / raw)
  To: xen-devel; +Cc: Stewart Hildebrand, Roger Pau Monné

From: Stewart Hildebrand <stewart.hildebrand@amd.com>

Skip mapping the BAR if it is not in a valid range.

Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
---
 xen/drivers/vpci/header.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 1d243eeaf9..dbabdcbed2 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -345,6 +345,15 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
              bar->enabled == !!(cmd & PCI_COMMAND_MEMORY) )
             continue;
 
+#ifdef CONFIG_ARM
+        if ( !is_hardware_domain(pdev->domain) )
+        {
+            if ( (start_guest < PFN_DOWN(GUEST_VPCI_MEM_ADDR)) ||
+                 (end_guest >= PFN_DOWN(GUEST_VPCI_MEM_ADDR + GUEST_VPCI_MEM_SIZE)) )
+                continue;
+        }
+#endif
+
         if ( !pci_check_bar(pdev, _mfn(start), _mfn(end)) )
         {
             printk(XENLOG_G_WARNING
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 12/16] vpci: add initial support for virtual PCI bus topology
  2023-08-29 23:19 ` [PATCH v9 12/16] vpci: add initial support for virtual PCI bus topology Volodymyr Babchuk
@ 2023-08-30  7:37   ` Jan Beulich
  2023-08-31 21:12     ` Volodymyr Babchuk
  2023-09-21 16:03   ` Roger Pau Monné
  1 sibling, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2023-08-30  7:37 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Roger Pau Monné,
	Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, xen-devel

On 30.08.2023 01:19, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Assign SBDF to the PCI devices being passed through with bus 0.
> The resulting topology is where PCIe devices reside on the bus 0 of the
> root complex itself (embedded endpoints).
> This implementation is limited to 32 devices which are allowed on
> a single PCI bus.
> 
> Please note, that at the moment only function 0 of a multifunction
> device can be passed through.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> ---
> Since v9:
> - Lock in add_virtual_device() replaced with ASSERT (thanks, Stewart)

Also peeking at a few other patches where similar change remarks exist,
I'm slightly confused by them: Is this submission v9 or v10?

Jan


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 12/16] vpci: add initial support for virtual PCI bus topology
  2023-08-30  7:37   ` Jan Beulich
@ 2023-08-31 21:12     ` Volodymyr Babchuk
  0 siblings, 0 replies; 60+ messages in thread
From: Volodymyr Babchuk @ 2023-08-31 21:12 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Roger Pau Monné,
	Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, xen-devel


Hi Jan,

Jan Beulich <jbeulich@suse.com> writes:

> On 30.08.2023 01:19, Volodymyr Babchuk wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> 
>> Assign SBDF to the PCI devices being passed through with bus 0.
>> The resulting topology is where PCIe devices reside on the bus 0 of the
>> root complex itself (embedded endpoints).
>> This implementation is limited to 32 devices which are allowed on
>> a single PCI bus.
>> 
>> Please note, that at the moment only function 0 of a multifunction
>> device can be passed through.
>> 
>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> ---
>> Since v9:
>> - Lock in add_virtual_device() replaced with ASSERT (thanks, Stewart)
>
> Also peeking at a few other patches where similar change remarks exist,
> I'm slightly confused by them: Is this submission v9 or v10?

Sorry, looks like I was using wrong wording. This is submission
v9. Under "Since v9" I meant "in v9 and further".

-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 10/16] vpci/header: emulate PCI_COMMAND register for guests
  2023-08-29 23:19 ` [PATCH v9 10/16] vpci/header: emulate PCI_COMMAND register for guests Volodymyr Babchuk
@ 2023-09-01  5:23   ` Stewart Hildebrand
  2023-09-21 13:18   ` Roger Pau Monné
  1 sibling, 0 replies; 60+ messages in thread
From: Stewart Hildebrand @ 2023-09-01  5:23 UTC (permalink / raw)
  To: Volodymyr Babchuk, xen-devel
  Cc: Oleksandr Andrushchenko, Roger Pau Monné

On 8/29/23 19:19, Volodymyr Babchuk wrote:
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index 1e82217200..e351db4620 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -523,6 +546,14 @@ static void cf_check cmd_write(
>          pci_conf_write16(pdev->sbdf, reg, cmd);
>  }
> 
> +static uint32_t guest_cmd_read(const struct pci_dev *pdev, unsigned int reg,

As this function is called indirectly, it needs a cf_check attribute

> +                               void *data)
> +{
> +    const struct vpci_header *header = data;
> +
> +    return header->guest_cmd;
> +}
> +
>  static void cf_check bar_write(
>      const struct pci_dev *pdev, unsigned int reg, uint32_t val, void *data)
>  {


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 06/16] vpci/header: implement guest BAR register handlers
  2023-08-29 23:19 ` [PATCH v9 06/16] vpci/header: implement guest BAR register handlers Volodymyr Babchuk
@ 2023-09-01  5:25   ` Stewart Hildebrand
  2023-09-20  9:49   ` Roger Pau Monné
  1 sibling, 0 replies; 60+ messages in thread
From: Stewart Hildebrand @ 2023-09-01  5:25 UTC (permalink / raw)
  To: Volodymyr Babchuk, xen-devel
  Cc: Oleksandr Andrushchenko, Roger Pau Monné

On 8/29/23 19:19, Volodymyr Babchuk wrote:
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index e58bbdf68d..e96d7b2b37 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -477,6 +477,72 @@ static void cf_check bar_write(
>      pci_conf_write32(pdev->sbdf, reg, val);
>  }
> 
> +static void cf_check guest_bar_write(const struct pci_dev *pdev,
> +                                     unsigned int reg, uint32_t val, void *data)
> +{
> +    struct vpci_bar *bar = data;
> +    bool hi = false;
> +    uint64_t guest_addr = bar->guest_addr;
> +
> +    if ( bar->type == VPCI_BAR_MEM64_HI )
> +    {
> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> +        bar--;
> +        hi = true;
> +    }
> +    else
> +    {
> +        val &= PCI_BASE_ADDRESS_MEM_MASK;
> +    }
> +
> +    guest_addr &= ~(0xffffffffull << (hi ? 32 : 0));

Uppercase ULL on the constant to avoid a MISRA violation


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 04/16] vpci: add hooks for PCI device assign/de-assign
  2023-08-29 23:19 ` [PATCH v9 04/16] vpci: add hooks for PCI device assign/de-assign Volodymyr Babchuk
@ 2023-09-12  9:37   ` Jan Beulich
  2023-09-12 23:41     ` Volodymyr Babchuk
  2023-09-20  8:39   ` Roger Pau Monné
  1 sibling, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2023-09-12  9:37 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Andrew Cooper,
	George Dunlap, Julien Grall, Stefano Stabellini, Wei Liu,
	Paul Durrant, Roger Pau Monné,
	xen-devel

On 30.08.2023 01:19, Volodymyr Babchuk wrote:
> @@ -1481,6 +1488,13 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>      if ( pdev->broken && d != hardware_domain && d != dom_io )
>          goto done;
>  
> +    if ( IS_ENABLED(CONFIG_HAS_VPCI_GUEST_SUPPORT) )
> +    {
> +        write_lock(&pdev->domain->pci_lock);
> +        vpci_deassign_device(pdev);
> +        write_unlock(&pdev->domain->pci_lock);
> +    }

Why is the DomIO special case ...

> @@ -1506,6 +1520,15 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>          rc = iommu_call(hd->platform_ops, assign_device, d, devfn,
>                          pci_to_dev(pdev), flag);
>      }
> +    if ( rc )
> +        goto done;
> +
> +    if ( IS_ENABLED(CONFIG_HAS_VPCI_GUEST_SUPPORT) && d != dom_io)
> +    {
> +        write_lock(&d->pci_lock);
> +        rc = vpci_assign_device(pdev);
> +        write_unlock(&d->pci_lock);
> +    }

... relevant only here?

Jan


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 04/16] vpci: add hooks for PCI device assign/de-assign
  2023-09-12  9:37   ` Jan Beulich
@ 2023-09-12 23:41     ` Volodymyr Babchuk
  2023-09-13  5:58       ` Jan Beulich
  0 siblings, 1 reply; 60+ messages in thread
From: Volodymyr Babchuk @ 2023-09-12 23:41 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Andrew Cooper,
	George Dunlap, Julien Grall, Stefano Stabellini, Wei Liu,
	Paul Durrant, Roger Pau Monné,
	xen-devel


Hi Jan,

Jan Beulich <jbeulich@suse.com> writes:

> On 30.08.2023 01:19, Volodymyr Babchuk wrote:
>> @@ -1481,6 +1488,13 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>      if ( pdev->broken && d != hardware_domain && d != dom_io )
>>          goto done;
>>  
>> +    if ( IS_ENABLED(CONFIG_HAS_VPCI_GUEST_SUPPORT) )
>> +    {
>> +        write_lock(&pdev->domain->pci_lock);
>> +        vpci_deassign_device(pdev);
>> +        write_unlock(&pdev->domain->pci_lock);
>> +    }
>
> Why is the DomIO special case ...

vpci_deassign_device() does nothing if vPCI was initialized for a
domain. So it not wrong to call this function even if pdev belongs to dom_io.

>> @@ -1506,6 +1520,15 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>          rc = iommu_call(hd->platform_ops, assign_device, d, devfn,
>>                          pci_to_dev(pdev), flag);
>>      }
>> +    if ( rc )
>> +        goto done;
>> +
>> +    if ( IS_ENABLED(CONFIG_HAS_VPCI_GUEST_SUPPORT) && d != dom_io)
>> +    {
>> +        write_lock(&d->pci_lock);
>> +        rc = vpci_assign_device(pdev);
>> +        write_unlock(&d->pci_lock);
>> +    }
>
> ... relevant only here?
>

There is no sense to initialize vPCI for dom_io.


-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 04/16] vpci: add hooks for PCI device assign/de-assign
  2023-09-12 23:41     ` Volodymyr Babchuk
@ 2023-09-13  5:58       ` Jan Beulich
  2023-09-13 23:53         ` Volodymyr Babchuk
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2023-09-13  5:58 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Andrew Cooper,
	George Dunlap, Julien Grall, Stefano Stabellini, Wei Liu,
	Paul Durrant, Roger Pau Monné,
	xen-devel

On 13.09.2023 01:41, Volodymyr Babchuk wrote:
> Jan Beulich <jbeulich@suse.com> writes:
>> On 30.08.2023 01:19, Volodymyr Babchuk wrote:
>>> @@ -1481,6 +1488,13 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>      if ( pdev->broken && d != hardware_domain && d != dom_io )
>>>          goto done;
>>>  
>>> +    if ( IS_ENABLED(CONFIG_HAS_VPCI_GUEST_SUPPORT) )
>>> +    {
>>> +        write_lock(&pdev->domain->pci_lock);
>>> +        vpci_deassign_device(pdev);
>>> +        write_unlock(&pdev->domain->pci_lock);
>>> +    }
>>
>> Why is the DomIO special case ...
> 
> vpci_deassign_device() does nothing if vPCI was initialized for a
> domain. So it not wrong to call this function even if pdev belongs to dom_io.

Well, okay, but then you acquire a lock just to do nothing (apart
from the apparent asymmetry).

>>> @@ -1506,6 +1520,15 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>          rc = iommu_call(hd->platform_ops, assign_device, d, devfn,
>>>                          pci_to_dev(pdev), flag);
>>>      }
>>> +    if ( rc )
>>> +        goto done;
>>> +
>>> +    if ( IS_ENABLED(CONFIG_HAS_VPCI_GUEST_SUPPORT) && d != dom_io)
>>> +    {
>>> +        write_lock(&d->pci_lock);
>>> +        rc = vpci_assign_device(pdev);
>>> +        write_unlock(&d->pci_lock);
>>> +    }
>>
>> ... relevant only here?
>>
> 
> There is no sense to initialize vPCI for dom_io.

Of course.

Jan


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 04/16] vpci: add hooks for PCI device assign/de-assign
  2023-09-13  5:58       ` Jan Beulich
@ 2023-09-13 23:53         ` Volodymyr Babchuk
  2023-09-20  8:41           ` Roger Pau Monné
  0 siblings, 1 reply; 60+ messages in thread
From: Volodymyr Babchuk @ 2023-09-13 23:53 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Andrew Cooper,
	George Dunlap, Julien Grall, Stefano Stabellini, Wei Liu,
	Paul Durrant, Roger Pau Monné,
	xen-devel


Hi,

Jan Beulich <jbeulich@suse.com> writes:

> On 13.09.2023 01:41, Volodymyr Babchuk wrote:
>> Jan Beulich <jbeulich@suse.com> writes:
>>> On 30.08.2023 01:19, Volodymyr Babchuk wrote:
>>>> @@ -1481,6 +1488,13 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>>      if ( pdev->broken && d != hardware_domain && d != dom_io )
>>>>          goto done;
>>>>  
>>>> +    if ( IS_ENABLED(CONFIG_HAS_VPCI_GUEST_SUPPORT) )
>>>> +    {
>>>> +        write_lock(&pdev->domain->pci_lock);
>>>> +        vpci_deassign_device(pdev);
>>>> +        write_unlock(&pdev->domain->pci_lock);
>>>> +    }
>>>
>>> Why is the DomIO special case ...
>> 
>> vpci_deassign_device() does nothing if vPCI was initialized for a
>> domain. So it not wrong to call this function even if pdev belongs to dom_io.
>
> Well, okay, but then you acquire a lock just to do nothing (apart
> from the apparent asymmetry).

Yes, I agree. I'll add the same check as below. Thanks for the review.


-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 01/16] pci: introduce per-domain PCI rwlock
  2023-08-29 23:19 ` [PATCH v9 01/16] pci: introduce per-domain PCI rwlock Volodymyr Babchuk
@ 2023-09-19 14:09   ` Roger Pau Monné
  2023-09-25 22:44     ` Volodymyr Babchuk
  0 siblings, 1 reply; 60+ messages in thread
From: Roger Pau Monné @ 2023-09-19 14:09 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: xen-devel, Stewart Hildebrand, Andrew Cooper, George Dunlap,
	Jan Beulich, Julien Grall, Stefano Stabellini, Wei Liu,
	Paul Durrant, Kevin Tian

On Tue, Aug 29, 2023 at 11:19:42PM +0000, Volodymyr Babchuk wrote:
> Add per-domain d->pci_lock that protects access to
> d->pdev_list. Purpose of this lock is to give guarantees to VPCI code
> that underlying pdev will not disappear under feet. This is a rw-lock,
> but this patch adds only write_lock()s. There will be read_lock()
> users in the next patches.
> 
> This lock should be taken in write mode every time d->pdev_list is
> altered. This covers both accesses to d->pdev_list and accesses to
> pdev->domain_list fields.

Why do you mention pdev->domain_list here?  I don't think the lock
covers accesses to pdev->domain_list, unless that domain_list field
happens to be part of the linked list in d->pdev_list.  I find it kind
of odd to mention here.

> All write accesses also should be protected
> by pcidevs_lock() as well. Idea is that any user that wants read
> access to the list or to the devices stored in the list should use
> either this new d->pci_lock or old pcidevs_lock(). Usage of any of
> this two locks will ensure only that pdev of interest will not
> disappear from under feet and that the pdev still will be assigned to
> the same domain. Of course, any new users should use pcidevs_lock()
> when it is appropriate (e.g. when accessing any other state that is
> protected by the said lock). In case both the newly introduced
> per-domain rwlock and the pcidevs lock is taken, the later must be
> acquired first.
> 
> Any write access to pdev->domain_list should be protected by both
> pcidevs_lock() and d->pci_lock in the write mode.
> 
> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
> Suggested-by: Jan Beulich <jbeulich@suse.com>
> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
> 
> ---
> 
> Changes in v9:
>  - returned back "pdev->domain = target;" in AMD IOMMU code
>  - used "source" instead of pdev->domain in IOMMU functions
>  - added comment about lock ordering in the commit message
>  - reduced locked regions
>  - minor changes non-functional changes in various places
> 
> Changes in v8:
>  - New patch
> 
> Changes in v8 vs RFC:
>  - Removed all read_locks after discussion with Roger in #xendevel
>  - pci_release_devices() now returns the first error code
>  - extended commit message
>  - added missing lock in pci_remove_device()
>  - extended locked region in pci_add_device() to protect list_del() calls
> ---
>  xen/common/domain.c                         |  1 +
>  xen/drivers/passthrough/amd/pci_amd_iommu.c |  9 ++-
>  xen/drivers/passthrough/pci.c               | 71 +++++++++++++++++----
>  xen/drivers/passthrough/vtd/iommu.c         |  9 ++-
>  xen/include/xen/sched.h                     |  1 +
>  5 files changed, 78 insertions(+), 13 deletions(-)
> 
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index 304aa04fa6..9b04a20160 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -651,6 +651,7 @@ struct domain *domain_create(domid_t domid,
>  
>  #ifdef CONFIG_HAS_PCI
>      INIT_LIST_HEAD(&d->pdev_list);
> +    rwlock_init(&d->pci_lock);
>  #endif
>  
>      /* All error paths can depend on the above setup. */
> diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> index bea70db4b7..d219bd9453 100644
> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> @@ -476,7 +476,14 @@ static int cf_check reassign_device(
>  
>      if ( devfn == pdev->devfn && pdev->domain != target )
>      {
> -        list_move(&pdev->domain_list, &target->pdev_list);
> +        write_lock(&source->pci_lock);
> +        list_del(&pdev->domain_list);
> +        write_unlock(&source->pci_lock);
> +
> +        write_lock(&target->pci_lock);
> +        list_add(&pdev->domain_list, &target->pdev_list);
> +        write_unlock(&target->pci_lock);
> +
>          pdev->domain = target;

While I don't think this is strictly an issue right now, it would be
better to set pdev->domain before the device is added to domain_list.
A pattern like:

read_lock(d->pci_lock);
for_each_pdev(d, pdev)
    foo(pdev->domain);
read_unlock(d->pci_lock);

Wouldn't work currently if the pdev is added to domain_list before the
pdev->domain field is updated to reflect the new owner.

>      }
>  
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index 33452791a8..79ca928672 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -454,7 +454,9 @@ static void __init _pci_hide_device(struct pci_dev *pdev)
>      if ( pdev->domain )
>          return;
>      pdev->domain = dom_xen;
> +    write_lock(&dom_xen->pci_lock);
>      list_add(&pdev->domain_list, &dom_xen->pdev_list);
> +    write_unlock(&dom_xen->pci_lock);
>  }
>  
>  int __init pci_hide_device(unsigned int seg, unsigned int bus,
> @@ -748,7 +750,9 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>      if ( !pdev->domain )
>      {
>          pdev->domain = hardware_domain;
> +        write_lock(&hardware_domain->pci_lock);
>          list_add(&pdev->domain_list, &hardware_domain->pdev_list);
> +        write_unlock(&hardware_domain->pci_lock);
>  
>          /*
>           * For devices not discovered by Xen during boot, add vPCI handlers
> @@ -758,7 +762,9 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>          if ( ret )
>          {
>              printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
> +            write_lock(&hardware_domain->pci_lock);
>              list_del(&pdev->domain_list);
> +            write_unlock(&hardware_domain->pci_lock);
>              pdev->domain = NULL;
>              goto out;
>          }
> @@ -766,7 +772,9 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>          if ( ret )
>          {
>              vpci_remove_device(pdev);
> +            write_lock(&hardware_domain->pci_lock);
>              list_del(&pdev->domain_list);
> +            write_unlock(&hardware_domain->pci_lock);
>              pdev->domain = NULL;
>              goto out;
>          }
> @@ -816,7 +824,11 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
>              pci_cleanup_msi(pdev);
>              ret = iommu_remove_device(pdev);
>              if ( pdev->domain )
> +            {
> +                write_lock(&pdev->domain->pci_lock);
>                  list_del(&pdev->domain_list);
> +                write_unlock(&pdev->domain->pci_lock);
> +            }
>              printk(XENLOG_DEBUG "PCI remove device %pp\n", &pdev->sbdf);
>              free_pdev(pseg, pdev);
>              break;
> @@ -887,26 +899,61 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
>  
>  int pci_release_devices(struct domain *d)
>  {
> -    struct pci_dev *pdev, *tmp;
> -    u8 bus, devfn;
> -    int ret;
> +    int combined_ret;
> +    LIST_HEAD(failed_pdevs);
>  
>      pcidevs_lock();
> -    ret = arch_pci_clean_pirqs(d);
> -    if ( ret )
> +
> +    combined_ret = arch_pci_clean_pirqs(d);
> +    if ( combined_ret )
>      {
>          pcidevs_unlock();
> -        return ret;
> +        return combined_ret;
>      }
> -    list_for_each_entry_safe ( pdev, tmp, &d->pdev_list, domain_list )
> +
> +    write_lock(&d->pci_lock);
> +
> +    while ( !list_empty(&d->pdev_list) )
>      {
> -        bus = pdev->bus;
> -        devfn = pdev->devfn;
> -        ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
> +        struct pci_dev *pdev = list_first_entry(&d->pdev_list,
> +                                                struct pci_dev,
> +                                                domain_list);
> +        uint16_t seg = pdev->seg;
> +        uint8_t bus = pdev->bus;
> +        uint8_t devfn = pdev->devfn;
> +        int ret;
> +
> +        write_unlock(&d->pci_lock);
> +        ret = deassign_device(d, seg, bus, devfn);
> +        write_lock(&d->pci_lock);
> +        if ( ret )
> +        {
> +            const struct pci_dev *tmp;
> +
> +            /*
> +             * We need to check if deassign_device() left our pdev in
> +             * domain's list. As we dropped the lock, we can't be sure
> +             * that list wasn't permutated in some random way, so we
> +             * need to traverse the whole list.
> +             */
> +            for_each_pdev ( d, tmp )
> +            {
> +                if ( tmp == pdev )
> +                {
> +                    list_move_tail(&pdev->domain_list, &failed_pdevs);
> +                    break;
> +                }
> +            }
> +
> +            combined_ret = combined_ret ?: ret;
> +        }
>      }
> +
> +    list_splice(&failed_pdevs, &d->pdev_list);
> +    write_unlock(&d->pci_lock);
>      pcidevs_unlock();
>  
> -    return ret;
> +    return combined_ret;
>  }
>  
>  #define PCI_CLASS_BRIDGE_HOST    0x0600
> @@ -1125,7 +1172,9 @@ static int __hwdom_init cf_check _setup_hwdom_pci_devices(
>              if ( !pdev->domain )
>              {
>                  pdev->domain = ctxt->d;
> +                write_lock(&ctxt->d->pci_lock);
>                  list_add(&pdev->domain_list, &ctxt->d->pdev_list);
> +                write_unlock(&ctxt->d->pci_lock);
>                  setup_one_hwdom_device(ctxt, pdev);
>              }
>              else if ( pdev->domain == dom_xen )
> diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
> index 0e3062c820..3228900c97 100644
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -2806,7 +2806,14 @@ static int cf_check reassign_device_ownership(
>  
>      if ( devfn == pdev->devfn && pdev->domain != target )
>      {
> -        list_move(&pdev->domain_list, &target->pdev_list);
> +        write_lock(&source->pci_lock);
> +        list_del(&pdev->domain_list);
> +        write_unlock(&source->pci_lock);
> +
> +        write_lock(&target->pci_lock);
> +        list_add(&pdev->domain_list, &target->pdev_list);
> +        write_unlock(&target->pci_lock);
> +
>          pdev->domain = target;

Same comment as in reassign_device() above.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 02/16] vpci: use per-domain PCI lock to protect vpci structure
  2023-08-29 23:19 ` [PATCH v9 02/16] vpci: use per-domain PCI lock to protect vpci structure Volodymyr Babchuk
@ 2023-09-19 15:39   ` Roger Pau Monné
  2023-09-19 15:55     ` Jan Beulich
                       ` (3 more replies)
  0 siblings, 4 replies; 60+ messages in thread
From: Roger Pau Monné @ 2023-09-19 15:39 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: xen-devel, Stewart Hildebrand, Oleksandr Andrushchenko,
	Jan Beulich, Andrew Cooper, Wei Liu, Jun Nakajima, Kevin Tian,
	Paul Durrant

On Tue, Aug 29, 2023 at 11:19:42PM +0000, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Use a previously introduced per-domain read/write lock to check
> whether vpci is present, so we are sure there are no accesses to the
> contents of the vpci struct if not. This lock can be used (and in a
> few cases is used right away) so that vpci removal can be performed
> while holding the lock in write mode. Previously such removal could
> race with vpci_read for example.
> 
> When taking both d->pci_lock and pdev->vpci->lock they are should be

When taking both d->pci_lock and pdev->vpci->lock the order should be
...

> taken in this exact order: d->pci_lock then pdev->vpci->lock to avoid
> possible deadlock situations.
> 
> 1. Per-domain's pci_rwlock is used to protect pdev->vpci structure
> from being removed.
> 
> 2. Writing the command register and ROM BAR register may trigger
> modify_bars to run, which in turn may access multiple pdevs while
> checking for the existing BAR's overlap. The overlapping check, if
> done under the read lock, requires vpci->lock to be acquired on both
> devices being compared, which may produce a deadlock. It is not
> possible to upgrade read lock to write lock in such a case. So, in
> order to prevent the deadlock, use d->pci_lock instead. To prevent
> deadlock while locking both hwdom->pci_lock and dom_xen->pci_lock,
> always lock hwdom first.
> 
> All other code, which doesn't lead to pdev->vpci destruction and does
> not access multiple pdevs at the same time, can still use a
> combination of the read lock and pdev->vpci->lock.
> 
> 3. Drop const qualifier where the new rwlock is used and this is
> appropriate.
> 
> 4. Do not call process_pending_softirqs with any locks held. For that
> unlock prior the call and re-acquire the locks after. After
> re-acquiring the lock there is no need to check if pdev->vpci exists:
>  - in apply_map because of the context it is called (no race condition
>    possible)
>  - for MSI/MSI-X debug code because it is called at the end of
>    pdev->vpci access and no further access to pdev->vpci is made
> 
> 5. Use d->pci_lock around for_each_pdev and pci_get_pdev_by_domain
> while accessing pdevs in vpci code.
> 
> There is a possible lock inversion in MSI code, as some parts of it
> acquire pcidevs_lock() while already holding d->pci_lock.

Those would as a minimum need to be pointed out with TODO comments of
some kind in order to be aware of them.

> 
> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
> Suggested-by: Jan Beulich <jbeulich@suse.com>
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
> 
> ---
> Changes in v9:
>  - extended locked region to protect vpci_remove_device and
>    vpci_add_handlers() calls
>  - vpci_write() takes lock in the write mode to protect
>    potential call to modify_bars()
>  - renamed lock releasing function
>  - removed ASSERT()s from msi code
>  - added trylock in vpci_dump_msi
> 
> Changes in v8:
>  - changed d->vpci_lock to d->pci_lock
>  - introducing d->pci_lock in a separate patch
>  - extended locked region in vpci_process_pending
>  - removed pcidevs_lockis vpci_dump_msi()
>  - removed some changes as they are not needed with
>    the new locking scheme
>  - added handling for hwdom && dom_xen case
> ---
>  xen/arch/x86/hvm/vmsi.c       | 24 ++++++++--------
>  xen/arch/x86/hvm/vmx/vmx.c    |  2 --
>  xen/arch/x86/irq.c            | 15 +++++++---
>  xen/arch/x86/msi.c            |  8 ++----
>  xen/drivers/passthrough/pci.c |  7 +++--
>  xen/drivers/vpci/header.c     | 18 ++++++++++++
>  xen/drivers/vpci/msi.c        | 22 +++++++++++++--
>  xen/drivers/vpci/msix.c       | 52 ++++++++++++++++++++++++++++++-----
>  xen/drivers/vpci/vpci.c       | 46 +++++++++++++++++++++++++++++--
>  9 files changed, 154 insertions(+), 40 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
> index 128f236362..fde76cc6b4 100644
> --- a/xen/arch/x86/hvm/vmsi.c
> +++ b/xen/arch/x86/hvm/vmsi.c
> @@ -468,7 +468,7 @@ int msixtbl_pt_register(struct domain *d, struct pirq *pirq, uint64_t gtable)
>      struct msixtbl_entry *entry, *new_entry;
>      int r = -EINVAL;
>  
> -    ASSERT(pcidevs_locked());
> +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>      ASSERT(rw_is_write_locked(&d->event_lock));
>  
>      if ( !msixtbl_initialised(d) )
> @@ -538,7 +538,7 @@ void msixtbl_pt_unregister(struct domain *d, struct pirq *pirq)
>      struct pci_dev *pdev;
>      struct msixtbl_entry *entry;
>  
> -    ASSERT(pcidevs_locked());
> +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>      ASSERT(rw_is_write_locked(&d->event_lock));
>  
>      if ( !msixtbl_initialised(d) )
> @@ -684,7 +684,7 @@ static int vpci_msi_update(const struct pci_dev *pdev, uint32_t data,
>  {
>      unsigned int i;
>  
> -    ASSERT(pcidevs_locked());
> +    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>  
>      if ( (address & MSI_ADDR_BASE_MASK) != MSI_ADDR_HEADER )
>      {
> @@ -725,8 +725,8 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
>      int rc;
>  
>      ASSERT(msi->arch.pirq != INVALID_PIRQ);
> +    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>  
> -    pcidevs_lock();
>      for ( i = 0; i < msi->vectors && msi->arch.bound; i++ )
>      {
>          struct xen_domctl_bind_pt_irq unbind = {
> @@ -745,7 +745,6 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
>  
>      msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address,
>                                         msi->vectors, msi->arch.pirq, msi->mask);
> -    pcidevs_unlock();
>  }
>  
>  static int vpci_msi_enable(const struct pci_dev *pdev, unsigned int nr,
> @@ -778,15 +777,13 @@ int vpci_msi_arch_enable(struct vpci_msi *msi, const struct pci_dev *pdev,
>      int rc;
>  
>      ASSERT(msi->arch.pirq == INVALID_PIRQ);
> +    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>      rc = vpci_msi_enable(pdev, vectors, 0);
>      if ( rc < 0 )
>          return rc;
>      msi->arch.pirq = rc;
> -
> -    pcidevs_lock();
>      msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address, vectors,
>                                         msi->arch.pirq, msi->mask);
> -    pcidevs_unlock();
>  
>      return 0;
>  }
> @@ -797,8 +794,8 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
>      unsigned int i;
>  
>      ASSERT(pirq != INVALID_PIRQ);
> +    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>  
> -    pcidevs_lock();
>      for ( i = 0; i < nr && bound; i++ )
>      {
>          struct xen_domctl_bind_pt_irq bind = {
> @@ -814,7 +811,6 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
>      write_lock(&pdev->domain->event_lock);
>      unmap_domain_pirq(pdev->domain, pirq);
>      write_unlock(&pdev->domain->event_lock);
> -    pcidevs_unlock();
>  }
>  
>  void vpci_msi_arch_disable(struct vpci_msi *msi, const struct pci_dev *pdev)
> @@ -854,6 +850,8 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
>      int rc;
>  
>      ASSERT(entry->arch.pirq == INVALID_PIRQ);
> +    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
> +
>      rc = vpci_msi_enable(pdev, vmsix_entry_nr(pdev->vpci->msix, entry),
>                           table_base);
>      if ( rc < 0 )
> @@ -861,7 +859,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
>  
>      entry->arch.pirq = rc;
>  
> -    pcidevs_lock();
>      rc = vpci_msi_update(pdev, entry->data, entry->addr, 1, entry->arch.pirq,
>                           entry->masked);
>      if ( rc )
> @@ -869,7 +866,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
>          vpci_msi_disable(pdev, entry->arch.pirq, 1, false);
>          entry->arch.pirq = INVALID_PIRQ;
>      }
> -    pcidevs_unlock();
>  
>      return rc;
>  }
> @@ -895,6 +891,8 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>  {
>      unsigned int i;
>  
> +    ASSERT(rw_is_locked(&msix->pdev->domain->pci_lock));
> +
>      for ( i = 0; i < msix->max_entries; i++ )
>      {
>          const struct vpci_msix_entry *entry = &msix->entries[i];
> @@ -913,7 +911,9 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>              struct pci_dev *pdev = msix->pdev;
>  
>              spin_unlock(&msix->pdev->vpci->lock);
> +            read_unlock(&pdev->domain->pci_lock);
>              process_pending_softirqs();
> +            read_lock(&pdev->domain->pci_lock);
>              /* NB: we assume that pdev cannot go away for an alive domain. */
>              if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
>                  return -EBUSY;
> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> index 1edc7f1e91..545a27796e 100644
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -413,8 +413,6 @@ static int cf_check vmx_pi_update_irte(const struct vcpu *v,
>  
>      spin_unlock_irq(&desc->lock);
>  
> -    ASSERT(pcidevs_locked());
> -

Hm, this removal seems dubious, same with some of the removal below.
And I don't see any comment in the log message as to why removing the
asserts here and in __pci_enable_msi{,x}(), pci_prepare_msix() is
safe.

>      return iommu_update_ire_from_msi(msi_desc, &msi_desc->msg);
>  
>   unlock_out:
> diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
> index 6abfd81621..cb99ae5392 100644
> --- a/xen/arch/x86/irq.c
> +++ b/xen/arch/x86/irq.c
> @@ -2157,7 +2157,7 @@ int map_domain_pirq(
>          struct pci_dev *pdev;
>          unsigned int nr = 0;
>  
> -        ASSERT(pcidevs_locked());
> +        ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>  
>          ret = -ENODEV;
>          if ( !cpu_has_apic )
> @@ -2314,7 +2314,7 @@ int unmap_domain_pirq(struct domain *d, int pirq)
>      if ( (pirq < 0) || (pirq >= d->nr_pirqs) )
>          return -EINVAL;
>  
> -    ASSERT(pcidevs_locked());
> +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>      ASSERT(rw_is_write_locked(&d->event_lock));
>  
>      info = pirq_info(d, pirq);
> @@ -2908,7 +2908,13 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>  
>      msi->irq = irq;
>  
> -    pcidevs_lock();
> +    /*
> +     * If we are called via vPCI->vMSI path, we already are holding
> +     * d->pci_lock so there is no need to take pcidevs_lock, as it
> +     * will cause lock inversion.
> +     */
> +    if ( !rw_is_locked(&d->pci_lock) )
> +        pcidevs_lock();

This is not a safe expression to use, rw_is_locked() just returns
whether the lock is taken, but not if it's taken by the current CPU.
This is fine to use in assertions and debug code, but not in order to
take lock ordering decisions I'm afraid.

You will likely need to move the locking to the callers of the
function.

>      /* Verify or get pirq. */
>      write_lock(&d->event_lock);
>      pirq = allocate_pirq(d, index, *pirq_p, irq, type, &msi->entry_nr);
> @@ -2924,7 +2930,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>  
>   done:
>      write_unlock(&d->event_lock);
> -    pcidevs_unlock();
> +    if ( !rw_is_locked(&d->pci_lock) )
> +        pcidevs_unlock();
>      if ( ret )
>      {
>          switch ( type )
> diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
> index d0bf63df1d..ba2963b7d2 100644
> --- a/xen/arch/x86/msi.c
> +++ b/xen/arch/x86/msi.c
> @@ -613,7 +613,7 @@ static int msi_capability_init(struct pci_dev *dev,
>      u8 slot = PCI_SLOT(dev->devfn);
>      u8 func = PCI_FUNC(dev->devfn);
>  
> -    ASSERT(pcidevs_locked());
> +    ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
>      pos = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSI);
>      if ( !pos )
>          return -ENODEV;
> @@ -783,7 +783,7 @@ static int msix_capability_init(struct pci_dev *dev,
>      if ( !pos )
>          return -ENODEV;
>  
> -    ASSERT(pcidevs_locked());
> +    ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
>  
>      control = pci_conf_read16(dev->sbdf, msix_control_reg(pos));
>      /*
> @@ -1000,7 +1000,6 @@ static int __pci_enable_msi(struct msi_info *msi, struct msi_desc **desc)
>      struct pci_dev *pdev;
>      struct msi_desc *old_desc;
>  
> -    ASSERT(pcidevs_locked());
>      pdev = pci_get_pdev(NULL, msi->sbdf);
>      if ( !pdev )
>          return -ENODEV;
> @@ -1055,7 +1054,6 @@ static int __pci_enable_msix(struct msi_info *msi, struct msi_desc **desc)
>      struct pci_dev *pdev;
>      struct msi_desc *old_desc;
>  
> -    ASSERT(pcidevs_locked());
>      pdev = pci_get_pdev(NULL, msi->sbdf);
>      if ( !pdev || !pdev->msix )
>          return -ENODEV;
> @@ -1170,8 +1168,6 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off)
>   */
>  int pci_enable_msi(struct msi_info *msi, struct msi_desc **desc)
>  {
> -    ASSERT(pcidevs_locked());
> -
>      if ( !use_msi )
>          return -EPERM;
>  
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index 79ca928672..4f18293900 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -752,7 +752,6 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>          pdev->domain = hardware_domain;
>          write_lock(&hardware_domain->pci_lock);
>          list_add(&pdev->domain_list, &hardware_domain->pdev_list);
> -        write_unlock(&hardware_domain->pci_lock);
>  
>          /*
>           * For devices not discovered by Xen during boot, add vPCI handlers
> @@ -762,17 +761,17 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>          if ( ret )
>          {
>              printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);

You likely want to move the printk after the unlock now.

> -            write_lock(&hardware_domain->pci_lock);
>              list_del(&pdev->domain_list);
>              write_unlock(&hardware_domain->pci_lock);
>              pdev->domain = NULL;
>              goto out;
>          }
> +        write_unlock(&hardware_domain->pci_lock);
>          ret = iommu_add_device(pdev);
>          if ( ret )
>          {
> -            vpci_remove_device(pdev);
>              write_lock(&hardware_domain->pci_lock);
> +            vpci_remove_device(pdev);
>              list_del(&pdev->domain_list);
>              write_unlock(&hardware_domain->pci_lock);
>              pdev->domain = NULL;
> @@ -1147,7 +1146,9 @@ static void __hwdom_init setup_one_hwdom_device(const struct setup_hwdom *ctxt,
>      } while ( devfn != pdev->devfn &&
>                PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
>  
> +    write_lock(&ctxt->d->pci_lock);
>      err = vpci_add_handlers(pdev);
> +    write_unlock(&ctxt->d->pci_lock);
>      if ( err )
>          printk(XENLOG_ERR "setup of vPCI for d%d failed: %d\n",
>                 ctxt->d->domain_id, err);
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index 60f7049e34..177a6b57a5 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -172,6 +172,7 @@ bool vpci_process_pending(struct vcpu *v)
>          if ( rc == -ERESTART )
>              return true;
>  
> +        write_lock(&v->domain->pci_lock);
>          spin_lock(&v->vpci.pdev->vpci->lock);
>          /* Disable memory decoding unconditionally on failure. */
>          modify_decoding(v->vpci.pdev,
> @@ -190,6 +191,7 @@ bool vpci_process_pending(struct vcpu *v)
>               * failure.
>               */
>              vpci_remove_device(v->vpci.pdev);
> +        write_unlock(&v->domain->pci_lock);

vpci_process_pending() is problematic wrt vpci_remove_device(), as the
removal of a device with pending map operations would render such
operations stale, effectively leaking the mappings to a device MMIO
area that's no longer owned by the domain.

In the same sense vpci_remove_device() should take care of removing
any MMIO mappings created, which is not currently the case.  I guess
such problem warrant at least some kind of comment in
vpci_process_pending() and/or vpci_remove_device().

>      }
>  
>      return false;
> @@ -201,8 +203,20 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
>      struct map_data data = { .d = d, .map = true };
>      int rc;
>  
> +    ASSERT(rw_is_locked(&d->pci_lock));

You want rw_is_write_locked(), as that check for exclusive ownership
of the lock (like you have in modify_bars()).

> +
>      while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
> +    {
> +        /*
> +         * It's safe to drop and reacquire the lock in this context
> +         * without risking pdev disappearing because devices cannot be
> +         * removed until the initial domain has been started.
> +         */
> +        read_unlock(&d->pci_lock);
>          process_pending_softirqs();
> +        read_lock(&d->pci_lock);
> +    }
> +
>      rangeset_destroy(mem);
>      if ( !rc )
>          modify_decoding(pdev, cmd, false);
> @@ -243,6 +257,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>      unsigned int i;
>      int rc;
>  
> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
> +
>      if ( !mem )
>          return -ENOMEM;
>  
> @@ -522,6 +538,8 @@ static int cf_check init_bars(struct pci_dev *pdev)
>      struct vpci_bar *bars = header->bars;
>      int rc;
>  
> +    ASSERT(rw_is_locked(&pdev->domain->pci_lock));

Same here, initialization should be done with the lock exclusively
held.

> +
>      switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
>      {
>      case PCI_HEADER_TYPE_NORMAL:
> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
> index 8f2b59e61a..a0733bb2cb 100644
> --- a/xen/drivers/vpci/msi.c
> +++ b/xen/drivers/vpci/msi.c
> @@ -265,7 +265,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW);
>  
>  void vpci_dump_msi(void)
>  {
> -    const struct domain *d;
> +    struct domain *d;
>  
>      rcu_read_lock(&domlist_read_lock);
>      for_each_domain ( d )
> @@ -277,6 +277,9 @@ void vpci_dump_msi(void)
>  
>          printk("vPCI MSI/MSI-X d%d\n", d->domain_id);
>  
> +        if ( !read_trylock(&d->pci_lock) )
> +            continue;
> +
>          for_each_pdev ( d, pdev )
>          {
>              const struct vpci_msi *msi;
> @@ -318,15 +321,28 @@ void vpci_dump_msi(void)
>                       * holding the lock.
>                       */
>                      printk("unable to print all MSI-X entries: %d\n", rc);
> -                    process_pending_softirqs();
> -                    continue;
> +                    goto pdev_done;
>                  }
>              }
>  
>              spin_unlock(&pdev->vpci->lock);
> + pdev_done:
> +            /*
> +             * Unlock lock to process pending softirqs. This is
> +             * potentially unsafe, as d->pdev_list can be changed in
> +             * meantime.
> +             */
> +            read_unlock(&d->pci_lock);
>              process_pending_softirqs();
> +            if ( !read_trylock(&d->pci_lock) )
> +            {
> +                printk("unable to access other devices for the domain\n");
> +                goto domain_done;

Shouldn't the domain_done label be after the read_unlock(), so that we
can proceed to try to dump the devices for the next domain?  With the
proposed code a failure to acquire one of the domains pci_lock
terminates the dump.

> +            }
>          }
> +        read_unlock(&d->pci_lock);
>      }
> + domain_done:
>      rcu_read_unlock(&domlist_read_lock);
>  }
>  
> diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
> index f9df506f29..f8c5bd393b 100644
> --- a/xen/drivers/vpci/msix.c
> +++ b/xen/drivers/vpci/msix.c
> @@ -147,6 +147,8 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
>  {
>      struct vpci_msix *msix;
>  
> +    ASSERT(rw_is_locked(&d->pci_lock));
> +
>      list_for_each_entry ( msix, &d->arch.hvm.msix_tables, next )
>      {
>          const struct vpci_bar *bars = msix->pdev->vpci->header.bars;
> @@ -163,7 +165,13 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
>  
>  static int cf_check msix_accept(struct vcpu *v, unsigned long addr)
>  {
> -    return !!msix_find(v->domain, addr);
> +    int rc;
> +
> +    read_lock(&v->domain->pci_lock);
> +    rc = !!msix_find(v->domain, addr);
> +    read_unlock(&v->domain->pci_lock);
> +
> +    return rc;
>  }
>  
>  static bool access_allowed(const struct pci_dev *pdev, unsigned long addr,
> @@ -358,21 +366,35 @@ static int adjacent_read(const struct domain *d, const struct vpci_msix *msix,
>  static int cf_check msix_read(
>      struct vcpu *v, unsigned long addr, unsigned int len, unsigned long *data)
>  {
> -    const struct domain *d = v->domain;
> -    struct vpci_msix *msix = msix_find(d, addr);
> +    struct domain *d = v->domain;
> +    struct vpci_msix *msix;
>      const struct vpci_msix_entry *entry;
>      unsigned int offset;
>  
>      *data = ~0ul;
>  
> +    read_lock(&d->pci_lock);
> +
> +    msix = msix_find(d, addr);
>      if ( !msix )
> +    {
> +        read_unlock(&d->pci_lock);
>          return X86EMUL_RETRY;
> +    }
>  
>      if ( adjacent_handle(msix, addr) )
> -        return adjacent_read(d, msix, addr, len, data);
> +    {
> +        int rc = adjacent_read(d, msix, addr, len, data);
> +
> +        read_unlock(&d->pci_lock);
> +        return rc;
> +    }
>  
>      if ( !access_allowed(msix->pdev, addr, len) )
> +    {
> +        read_unlock(&d->pci_lock);
>          return X86EMUL_OKAY;
> +    }
>  
>      spin_lock(&msix->pdev->vpci->lock);
>      entry = get_entry(msix, addr);
> @@ -404,6 +426,7 @@ static int cf_check msix_read(
>          break;
>      }
>      spin_unlock(&msix->pdev->vpci->lock);
> +    read_unlock(&d->pci_lock);
>  
>      return X86EMUL_OKAY;
>  }
> @@ -491,19 +514,33 @@ static int adjacent_write(const struct domain *d, const struct vpci_msix *msix,
>  static int cf_check msix_write(
>      struct vcpu *v, unsigned long addr, unsigned int len, unsigned long data)
>  {
> -    const struct domain *d = v->domain;
> -    struct vpci_msix *msix = msix_find(d, addr);
> +    struct domain *d = v->domain;
> +    struct vpci_msix *msix;
>      struct vpci_msix_entry *entry;
>      unsigned int offset;
>  
> +    read_lock(&d->pci_lock);
> +
> +    msix = msix_find(d, addr);
>      if ( !msix )
> +    {
> +        read_unlock(&d->pci_lock);
>          return X86EMUL_RETRY;
> +    }
>  
>      if ( adjacent_handle(msix, addr) )
> -        return adjacent_write(d, msix, addr, len, data);
> +    {
> +        int rc = adjacent_write(d, msix, addr, len, data);
> +
> +        read_unlock(&d->pci_lock);
> +        return rc;
> +    }
>  
>      if ( !access_allowed(msix->pdev, addr, len) )
> +    {
> +        read_unlock(&d->pci_lock);
>          return X86EMUL_OKAY;
> +    }
>  
>      spin_lock(&msix->pdev->vpci->lock);
>      entry = get_entry(msix, addr);
> @@ -579,6 +616,7 @@ static int cf_check msix_write(
>          break;
>      }
>      spin_unlock(&msix->pdev->vpci->lock);
> +    read_unlock(&d->pci_lock);
>  
>      return X86EMUL_OKAY;
>  }
> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
> index d73fa76302..34fff2ef2d 100644
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -38,6 +38,8 @@ extern vpci_register_init_t *const __end_vpci_array[];
>  
>  void vpci_remove_device(struct pci_dev *pdev)
>  {
> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
> +
>      if ( !has_vpci(pdev->domain) || !pdev->vpci )
>          return;
>  
> @@ -73,6 +75,8 @@ int vpci_add_handlers(struct pci_dev *pdev)
>      const unsigned long *ro_map;
>      int rc = 0;
>  
> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
> +
>      if ( !has_vpci(pdev->domain) )
>          return 0;
>  
> @@ -326,11 +330,12 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
>  
>  uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>  {
> -    const struct domain *d = current->domain;
> +    struct domain *d = current->domain;
>      const struct pci_dev *pdev;
>      const struct vpci_register *r;
>      unsigned int data_offset = 0;
>      uint32_t data = ~(uint32_t)0;
> +    rwlock_t *lock;
>  
>      if ( !size )
>      {
> @@ -342,11 +347,21 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>       * Find the PCI dev matching the address, which for hwdom also requires
>       * consulting DomXEN.  Passthrough everything that's not trapped.
>       */
> +    lock = &d->pci_lock;
> +    read_lock(lock);
>      pdev = pci_get_pdev(d, sbdf);
>      if ( !pdev && is_hardware_domain(d) )
> +    {
> +        read_unlock(lock);
> +        lock = &dom_xen->pci_lock;
> +        read_lock(lock);
>          pdev = pci_get_pdev(dom_xen, sbdf);
> +    }
>      if ( !pdev || !pdev->vpci )
> +    {
> +        read_unlock(lock);
>          return vpci_read_hw(sbdf, reg, size);
> +    }
>  
>      spin_lock(&pdev->vpci->lock);
>  
F> @@ -392,6 +407,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>          ASSERT(data_offset < size);
>      }
>      spin_unlock(&pdev->vpci->lock);
> +    read_unlock(lock);
>  
>      if ( data_offset < size )
>      {
> @@ -431,10 +447,23 @@ static void vpci_write_helper(const struct pci_dev *pdev,
>               r->private);
>  }
>  
> +/* Helper function to unlock locks taken by vpci_write in proper order */
> +static void release_domain_locks(struct domain *d)

release_domain_write_locks() might be more descriptive in case we ever
need a similar helper for reads also.

> +{
> +    ASSERT(rw_is_write_locked(&d->pci_lock));
> +
> +    if ( is_hardware_domain(d) )
> +    {
> +        ASSERT(rw_is_write_locked(&dom_xen->pci_lock));
> +        write_unlock(&dom_xen->pci_lock);
> +    }
> +    write_unlock(&d->pci_lock);
> +}
> +
>  void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>                  uint32_t data)
>  {
> -    const struct domain *d = current->domain;
> +    struct domain *d = current->domain;
>      const struct pci_dev *pdev;
>      const struct vpci_register *r;
>      unsigned int data_offset = 0;
> @@ -447,8 +476,16 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>  
>      /*
>       * Find the PCI dev matching the address, which for hwdom also requires
> -     * consulting DomXEN.  Passthrough everything that's not trapped.
> +     * consulting DomXEN. Passthrough everything that's not trapped.
> +     * If this is hwdom, we need to hold locks for both domain in case if
> +     * modify_bars() is called
>       */
> +    write_lock(&d->pci_lock);
> +
> +    /* dom_xen->pci_lock always should be taken second to prevent deadlock */
> +    if ( is_hardware_domain(d) )
> +        write_lock(&dom_xen->pci_lock);

Strictly speaking we only need the pci_lock in exclusive mode when
enabling/disabling the BARs AFAICT?  For the rest of the operations
the per-device vPCI lock already protects against concurrent accesses.
Might be worth to mention that the write lock is only required for
those accesses, but that such improvement is left as a TODO.

> +
>      pdev = pci_get_pdev(d, sbdf);
>      if ( !pdev && is_hardware_domain(d) )
>          pdev = pci_get_pdev(dom_xen, sbdf);
> @@ -459,6 +496,8 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>  
>          if ( !ro_map || !test_bit(sbdf.bdf, ro_map) )
>              vpci_write_hw(sbdf, reg, size, data);
> +
> +        release_domain_locks(d);

You can release the lock before the vpci_write_hw() call.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 02/16] vpci: use per-domain PCI lock to protect vpci structure
  2023-09-19 15:39   ` Roger Pau Monné
@ 2023-09-19 15:55     ` Jan Beulich
  2023-09-20  8:12       ` Roger Pau Monné
  2023-09-19 16:20     ` Stewart Hildebrand
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2023-09-19 15:55 UTC (permalink / raw)
  To: Roger Pau Monné, Volodymyr Babchuk
  Cc: xen-devel, Stewart Hildebrand, Oleksandr Andrushchenko,
	Andrew Cooper, Wei Liu, Jun Nakajima, Kevin Tian, Paul Durrant

On 19.09.2023 17:39, Roger Pau Monné wrote:
> On Tue, Aug 29, 2023 at 11:19:42PM +0000, Volodymyr Babchuk wrote:
>> @@ -2908,7 +2908,13 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>>  
>>      msi->irq = irq;
>>  
>> -    pcidevs_lock();
>> +    /*
>> +     * If we are called via vPCI->vMSI path, we already are holding
>> +     * d->pci_lock so there is no need to take pcidevs_lock, as it
>> +     * will cause lock inversion.
>> +     */
>> +    if ( !rw_is_locked(&d->pci_lock) )
>> +        pcidevs_lock();
> 
> This is not a safe expression to use, rw_is_locked() just returns
> whether the lock is taken, but not if it's taken by the current CPU.
> This is fine to use in assertions and debug code, but not in order to
> take lock ordering decisions I'm afraid.
> 
> You will likely need to move the locking to the callers of the
> function.

Along the lines of a later comment, I think it would by rw_is_write_locked()
here anyway. Noting that xen/rwlock.h already has an internal
_is_write_locked_by_me(), it would in principle be possible to construct
something along the lines of what the comment says. But it would certainly
be better if that could be avoided.

As to the comment: A lock inversion cannot be used to justify not
acquiring a necessary lock. The wording therefore wants adjusting (if the
logic was to stay).

Jan


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 02/16] vpci: use per-domain PCI lock to protect vpci structure
  2023-09-19 15:39   ` Roger Pau Monné
  2023-09-19 15:55     ` Jan Beulich
@ 2023-09-19 16:20     ` Stewart Hildebrand
  2023-09-20  8:09       ` Roger Pau Monné
  2023-09-20 19:16     ` Stewart Hildebrand
  2023-09-25 23:03     ` Volodymyr Babchuk
  3 siblings, 1 reply; 60+ messages in thread
From: Stewart Hildebrand @ 2023-09-19 16:20 UTC (permalink / raw)
  To: Roger Pau Monné, Volodymyr Babchuk
  Cc: xen-devel, Oleksandr Andrushchenko, Jan Beulich, Andrew Cooper,
	Wei Liu, Jun Nakajima, Kevin Tian, Paul Durrant

On 9/19/23 11:39, Roger Pau Monné wrote:
> On Tue, Aug 29, 2023 at 11:19:42PM +0000, Volodymyr Babchuk wrote:
>> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
>> index 8f2b59e61a..a0733bb2cb 100644
>> --- a/xen/drivers/vpci/msi.c
>> +++ b/xen/drivers/vpci/msi.c
>> @@ -318,15 +321,28 @@ void vpci_dump_msi(void)
>>                       * holding the lock.
>>                       */
>>                      printk("unable to print all MSI-X entries: %d\n", rc);
>> -                    process_pending_softirqs();
>> -                    continue;
>> +                    goto pdev_done;
>>                  }
>>              }
>>
>>              spin_unlock(&pdev->vpci->lock);
>> + pdev_done:
>> +            /*
>> +             * Unlock lock to process pending softirqs. This is
>> +             * potentially unsafe, as d->pdev_list can be changed in
>> +             * meantime.
>> +             */
>> +            read_unlock(&d->pci_lock);
>>              process_pending_softirqs();
>> +            if ( !read_trylock(&d->pci_lock) )
>> +            {
>> +                printk("unable to access other devices for the domain\n");
>> +                goto domain_done;
> 
> Shouldn't the domain_done label be after the read_unlock(), so that we
> can proceed to try to dump the devices for the next domain?  With the
> proposed code a failure to acquire one of the domains pci_lock
> terminates the dump.
> 
>> +            }
>>          }
>> +        read_unlock(&d->pci_lock);
>>      }
>> + domain_done:
>>      rcu_read_unlock(&domlist_read_lock);
>>  }
>>

With the label moved, a no-op expression after the label is needed to make the compiler happy:

            }
        }
        read_unlock(&d->pci_lock);
 domain_done:
        (void)0;
    }
    rcu_read_unlock(&domlist_read_lock);
}


If the no-op is omitted, the compiler may complain (gcc 9.4.0):

drivers/vpci/msi.c: In function ‘vpci_dump_msi’:
drivers/vpci/msi.c:351:2: error: label at end of compound statement
  351 |  domain_done:
      |  ^~~~~~~~~~~


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 02/16] vpci: use per-domain PCI lock to protect vpci structure
  2023-09-19 16:20     ` Stewart Hildebrand
@ 2023-09-20  8:09       ` Roger Pau Monné
  2023-09-20 13:56         ` Stewart Hildebrand
  0 siblings, 1 reply; 60+ messages in thread
From: Roger Pau Monné @ 2023-09-20  8:09 UTC (permalink / raw)
  To: Stewart Hildebrand
  Cc: Volodymyr Babchuk, xen-devel, Oleksandr Andrushchenko,
	Jan Beulich, Andrew Cooper, Wei Liu, Jun Nakajima, Kevin Tian,
	Paul Durrant

On Tue, Sep 19, 2023 at 12:20:39PM -0400, Stewart Hildebrand wrote:
> On 9/19/23 11:39, Roger Pau Monné wrote:
> > On Tue, Aug 29, 2023 at 11:19:42PM +0000, Volodymyr Babchuk wrote:
> >> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
> >> index 8f2b59e61a..a0733bb2cb 100644
> >> --- a/xen/drivers/vpci/msi.c
> >> +++ b/xen/drivers/vpci/msi.c
> >> @@ -318,15 +321,28 @@ void vpci_dump_msi(void)
> >>                       * holding the lock.
> >>                       */
> >>                      printk("unable to print all MSI-X entries: %d\n", rc);
> >> -                    process_pending_softirqs();
> >> -                    continue;
> >> +                    goto pdev_done;
> >>                  }
> >>              }
> >>
> >>              spin_unlock(&pdev->vpci->lock);
> >> + pdev_done:
> >> +            /*
> >> +             * Unlock lock to process pending softirqs. This is
> >> +             * potentially unsafe, as d->pdev_list can be changed in
> >> +             * meantime.
> >> +             */
> >> +            read_unlock(&d->pci_lock);
> >>              process_pending_softirqs();
> >> +            if ( !read_trylock(&d->pci_lock) )
> >> +            {
> >> +                printk("unable to access other devices for the domain\n");
> >> +                goto domain_done;
> > 
> > Shouldn't the domain_done label be after the read_unlock(), so that we
> > can proceed to try to dump the devices for the next domain?  With the
> > proposed code a failure to acquire one of the domains pci_lock
> > terminates the dump.
> > 
> >> +            }
> >>          }
> >> +        read_unlock(&d->pci_lock);
> >>      }
> >> + domain_done:
> >>      rcu_read_unlock(&domlist_read_lock);
> >>  }
> >>
> 
> With the label moved, a no-op expression after the label is needed to make the compiler happy:
> 
>             }
>         }
>         read_unlock(&d->pci_lock);
>  domain_done:
>         (void)0;
>     }
>     rcu_read_unlock(&domlist_read_lock);
> }
> 
> 
> If the no-op is omitted, the compiler may complain (gcc 9.4.0):
> 
> drivers/vpci/msi.c: In function ‘vpci_dump_msi’:
> drivers/vpci/msi.c:351:2: error: label at end of compound statement
>   351 |  domain_done:
>       |  ^~~~~~~~~~~


Might be better to place the label at the start of the loop, and
likely rename to next_domain.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 02/16] vpci: use per-domain PCI lock to protect vpci structure
  2023-09-19 15:55     ` Jan Beulich
@ 2023-09-20  8:12       ` Roger Pau Monné
  0 siblings, 0 replies; 60+ messages in thread
From: Roger Pau Monné @ 2023-09-20  8:12 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Volodymyr Babchuk, xen-devel, Stewart Hildebrand,
	Oleksandr Andrushchenko, Andrew Cooper, Wei Liu, Jun Nakajima,
	Kevin Tian, Paul Durrant

On Tue, Sep 19, 2023 at 05:55:42PM +0200, Jan Beulich wrote:
> On 19.09.2023 17:39, Roger Pau Monné wrote:
> > On Tue, Aug 29, 2023 at 11:19:42PM +0000, Volodymyr Babchuk wrote:
> >> @@ -2908,7 +2908,13 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
> >>  
> >>      msi->irq = irq;
> >>  
> >> -    pcidevs_lock();
> >> +    /*
> >> +     * If we are called via vPCI->vMSI path, we already are holding
> >> +     * d->pci_lock so there is no need to take pcidevs_lock, as it
> >> +     * will cause lock inversion.
> >> +     */
> >> +    if ( !rw_is_locked(&d->pci_lock) )
> >> +        pcidevs_lock();
> > 
> > This is not a safe expression to use, rw_is_locked() just returns
> > whether the lock is taken, but not if it's taken by the current CPU.
> > This is fine to use in assertions and debug code, but not in order to
> > take lock ordering decisions I'm afraid.
> > 
> > You will likely need to move the locking to the callers of the
> > function.
> 
> Along the lines of a later comment, I think it would by rw_is_write_locked()
> here anyway. Noting that xen/rwlock.h already has an internal
> _is_write_locked_by_me(), it would in principle be possible to construct
> something along the lines of what the comment says. But it would certainly
> be better if that could be avoided.

I personally don't like construct like the above, they are fragile and
should be avoided.

It might be better to introduce some wrappers around
allocate_and_map_msi_pirq() for the different locking contexts of
callers if possible.

Regards, Roger.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 04/16] vpci: add hooks for PCI device assign/de-assign
  2023-08-29 23:19 ` [PATCH v9 04/16] vpci: add hooks for PCI device assign/de-assign Volodymyr Babchuk
  2023-09-12  9:37   ` Jan Beulich
@ 2023-09-20  8:39   ` Roger Pau Monné
  1 sibling, 0 replies; 60+ messages in thread
From: Roger Pau Monné @ 2023-09-20  8:39 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: xen-devel, Stewart Hildebrand, Oleksandr Andrushchenko,
	Andrew Cooper, George Dunlap, Jan Beulich, Julien Grall,
	Stefano Stabellini, Wei Liu, Paul Durrant

On Tue, Aug 29, 2023 at 11:19:43PM +0000, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> When a PCI device gets assigned/de-assigned we need to
> initialize/de-initialize vPCI state for the device.
> 
> Also, rename vpci_add_handlers() to vpci_assign_device() and
> vpci_remove_device() to vpci_deassign_device() to better reflect role
> of the functions.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
> ---
> Since v9:
> - removed previous  vpci_[de]assign_device function and renamed
>   existing handlers
> - dropped attempts to handle errors in assign_device() function
> - do not call vpci_assign_device for dom_io
> - use d instead of pdev->domain
> - use IS_ENABLED macro
> Since v8:
> - removed vpci_deassign_device
> Since v6:
> - do not pass struct domain to vpci_{assign|deassign}_device as
>   pdev->domain can be used
> - do not leave the device assigned (pdev->domain == new domain) in case
>   vpci_assign_device fails: try to de-assign and if this also fails, then
>   crash the domain
> Since v5:
> - do not split code into run_vpci_init
> - do not check for is_system_domain in vpci_{de}assign_device
> - do not use vpci_remove_device_handlers_locked and re-allocate
>   pdev->vpci completely
> - make vpci_deassign_device void
> Since v4:
>  - de-assign vPCI from the previous domain on device assignment
>  - do not remove handlers in vpci_assign_device as those must not
>    exist at that point
> Since v3:
>  - remove toolstack roll-back description from the commit message
>    as error are to be handled with proper cleanup in Xen itself
>  - remove __must_check
>  - remove redundant rc check while assigning devices
>  - fix redundant CONFIG_HAS_VPCI check for CONFIG_HAS_VPCI_GUEST_SUPPORT
>  - use REGISTER_VPCI_INIT machinery to run required steps on device
>    init/assign: add run_vpci_init helper
> Since v2:
> - define CONFIG_HAS_VPCI_GUEST_SUPPORT so dead code is not compiled
>   for x86
> Since v1:
>  - constify struct pci_dev where possible
>  - do not open code is_system_domain()
>  - extended the commit message
> ---
>  xen/drivers/Kconfig           |  4 ++++
>  xen/drivers/passthrough/pci.c | 31 +++++++++++++++++++++++++++----
>  xen/drivers/vpci/header.c     |  2 +-
>  xen/drivers/vpci/vpci.c       |  6 +++---
>  xen/include/xen/vpci.h        | 10 +++++-----
>  5 files changed, 40 insertions(+), 13 deletions(-)
> 
> diff --git a/xen/drivers/Kconfig b/xen/drivers/Kconfig
> index db94393f47..780490cf8e 100644
> --- a/xen/drivers/Kconfig
> +++ b/xen/drivers/Kconfig
> @@ -15,4 +15,8 @@ source "drivers/video/Kconfig"
>  config HAS_VPCI
>  	bool
>  
> +config HAS_VPCI_GUEST_SUPPORT
> +	bool
> +	depends on HAS_VPCI
> +
>  endmenu
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index 4f18293900..64281f2d5e 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -757,7 +757,7 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>           * For devices not discovered by Xen during boot, add vPCI handlers
>           * when Dom0 first informs Xen about such devices.
>           */
> -        ret = vpci_add_handlers(pdev);
> +        ret = vpci_assign_device(pdev);
>          if ( ret )
>          {
>              printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
> @@ -771,7 +771,7 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>          if ( ret )
>          {
>              write_lock(&hardware_domain->pci_lock);
> -            vpci_remove_device(pdev);
> +            vpci_deassign_device(pdev);
>              list_del(&pdev->domain_list);
>              write_unlock(&hardware_domain->pci_lock);
>              pdev->domain = NULL;
> @@ -819,7 +819,7 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
>      list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
>          if ( pdev->bus == bus && pdev->devfn == devfn )
>          {
> -            vpci_remove_device(pdev);
> +            vpci_deassign_device(pdev);
>              pci_cleanup_msi(pdev);
>              ret = iommu_remove_device(pdev);
>              if ( pdev->domain )
> @@ -877,6 +877,13 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
>              goto out;
>      }
>  
> +    if ( IS_ENABLED(CONFIG_HAS_VPCI_GUEST_SUPPORT) )
> +    {
> +        write_lock(&d->pci_lock);
> +        vpci_deassign_device(pdev);
> +        write_unlock(&d->pci_lock);
> +    }

I'm confused by this one, shouldn't the code rely on has_vpci()
instead?  (which is already checked for in vpci_deassign_device()).

If you have a system without CONFIG_HAS_VPCI_GUEST_SUPPORT but vPCI is
used by dom0 you likely still need the hooks in {,de}assign_device()
so that vPCI status is properly handled for dom0 as the devices get
deassigned to dom0 and assigned to a guest? (and maybe moved back to
dom0 at a later point).

> +
>      devfn = pdev->devfn;
>      ret = iommu_call(hd->platform_ops, reassign_device, d, target, devfn,
>                       pci_to_dev(pdev));
> @@ -1147,7 +1154,7 @@ static void __hwdom_init setup_one_hwdom_device(const struct setup_hwdom *ctxt,
>                PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
>  
>      write_lock(&ctxt->d->pci_lock);
> -    err = vpci_add_handlers(pdev);
> +    err = vpci_assign_device(pdev);
>      write_unlock(&ctxt->d->pci_lock);
>      if ( err )
>          printk(XENLOG_ERR "setup of vPCI for d%d failed: %d\n",
> @@ -1481,6 +1488,13 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>      if ( pdev->broken && d != hardware_domain && d != dom_io )
>          goto done;
>  
> +    if ( IS_ENABLED(CONFIG_HAS_VPCI_GUEST_SUPPORT) )
> +    {
> +        write_lock(&pdev->domain->pci_lock);
> +        vpci_deassign_device(pdev);
> +        write_unlock(&pdev->domain->pci_lock);
> +    }
> +
>      rc = pdev_msix_assign(d, pdev);
>      if ( rc )
>          goto done;
> @@ -1506,6 +1520,15 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>          rc = iommu_call(hd->platform_ops, assign_device, d, devfn,
>                          pci_to_dev(pdev), flag);
>      }
> +    if ( rc )
> +        goto done;

rc can't be != 0 here, as the increment statement in the for loop
above will zero rc at each iteration.

> +
> +    if ( IS_ENABLED(CONFIG_HAS_VPCI_GUEST_SUPPORT) && d != dom_io)
> +    {
> +        write_lock(&d->pci_lock);
> +        rc = vpci_assign_device(pdev);
> +        write_unlock(&d->pci_lock);
> +    }

Why do you need the extra d != dom_io check here?  has_vpci() will
fail for dom_io, no need to special case it here.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 04/16] vpci: add hooks for PCI device assign/de-assign
  2023-09-13 23:53         ` Volodymyr Babchuk
@ 2023-09-20  8:41           ` Roger Pau Monné
  0 siblings, 0 replies; 60+ messages in thread
From: Roger Pau Monné @ 2023-09-20  8:41 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Jan Beulich, Stewart Hildebrand, Oleksandr Andrushchenko,
	Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Paul Durrant, xen-devel

On Wed, Sep 13, 2023 at 11:53:56PM +0000, Volodymyr Babchuk wrote:
> 
> Hi,
> 
> Jan Beulich <jbeulich@suse.com> writes:
> 
> > On 13.09.2023 01:41, Volodymyr Babchuk wrote:
> >> Jan Beulich <jbeulich@suse.com> writes:
> >>> On 30.08.2023 01:19, Volodymyr Babchuk wrote:
> >>>> @@ -1481,6 +1488,13 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
> >>>>      if ( pdev->broken && d != hardware_domain && d != dom_io )
> >>>>          goto done;
> >>>>  
> >>>> +    if ( IS_ENABLED(CONFIG_HAS_VPCI_GUEST_SUPPORT) )
> >>>> +    {
> >>>> +        write_lock(&pdev->domain->pci_lock);
> >>>> +        vpci_deassign_device(pdev);
> >>>> +        write_unlock(&pdev->domain->pci_lock);
> >>>> +    }
> >>>
> >>> Why is the DomIO special case ...
> >> 
> >> vpci_deassign_device() does nothing if vPCI was initialized for a
> >> domain. So it not wrong to call this function even if pdev belongs to dom_io.
> >
> > Well, okay, but then you acquire a lock just to do nothing (apart
> > from the apparent asymmetry).
> 
> Yes, I agree. I'll add the same check as below. Thanks for the review.

Hm, no, I would rather rely on the has_vpci check inside of
vpci_{,de}assign_device() than open coding dom_io checks elsewhere.

This is not a hot path, the extra lock taking is likely negligible
compared to the cost of assign operations.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 05/16] vpci/header: rework exit path in init_bars
  2023-08-29 23:19 ` [PATCH v9 05/16] vpci/header: rework exit path in init_bars Volodymyr Babchuk
@ 2023-09-20  8:49   ` Roger Pau Monné
  0 siblings, 0 replies; 60+ messages in thread
From: Roger Pau Monné @ 2023-09-20  8:49 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: xen-devel, Stewart Hildebrand

On Tue, Aug 29, 2023 at 11:19:43PM +0000, Volodymyr Babchuk wrote:
> Introduce "fail" label in init_bars() function to have the centralized
> error return path. This is the pre-requirement for the future changes
> in this function.
> 
> This patch does not introduce functional changes.
> 
> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 06/16] vpci/header: implement guest BAR register handlers
  2023-08-29 23:19 ` [PATCH v9 06/16] vpci/header: implement guest BAR register handlers Volodymyr Babchuk
  2023-09-01  5:25   ` Stewart Hildebrand
@ 2023-09-20  9:49   ` Roger Pau Monné
  2023-09-20 14:18     ` Stewart Hildebrand
  1 sibling, 1 reply; 60+ messages in thread
From: Roger Pau Monné @ 2023-09-20  9:49 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: xen-devel, Stewart Hildebrand, Oleksandr Andrushchenko

On Tue, Aug 29, 2023 at 11:19:43PM +0000, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Add relevant vpci register handlers when assigning PCI device to a domain
> and remove those when de-assigning. This allows having different
> handlers for different domains, e.g. hwdom and other guests.
> 
> Emulate guest BAR register values: this allows creating a guest view
> of the registers and emulates size and properties probe as it is done
> during PCI device enumeration by the guest.
> 
> All empty, IO and ROM BARs for guests are emulated by returning 0 on
> reads and ignoring writes: this BARs are special with this respect as
> their lower bits have special meaning, so returning default ~0 on read
> may confuse guest OS.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> ---
> Since v9:
> - factored-out "fail" label introduction in init_bars()
> - replaced #ifdef CONFIG_X86 with IS_ENABLED()
> - do not pass bars[i] to empty_bar_read() handler
> - store guest's BAR address instead of guests BAR register view
> Since v6:
> - unify the writing of the PCI_COMMAND register on the
>   error path into a label
> - do not introduce bar_ignore_access helper and open code
> - s/guest_bar_ignore_read/empty_bar_read
> - update error message in guest_bar_write
> - only setup empty_bar_read for IO if !x86
> Since v5:
> - make sure that the guest set address has the same page offset
>   as the physical address on the host
> - remove guest_rom_{read|write} as those just implement the default
>   behaviour of the registers not being handled
> - adjusted comment for struct vpci.addr field
> - add guest handlers for BARs which are not handled and will otherwise
>   return ~0 on read and ignore writes. The BARs are special with this
>   respect as their lower bits have special meaning, so returning ~0
>   doesn't seem to be right
> Since v4:
> - updated commit message
> - s/guest_addr/guest_reg
> Since v3:
> - squashed two patches: dynamic add/remove handlers and guest BAR
>   handler implementation
> - fix guest BAR read of the high part of a 64bit BAR (Roger)
> - add error handling to vpci_assign_device
> - s/dom%pd/%pd
> - blank line before return
> Since v2:
> - remove unneeded ifdefs for CONFIG_HAS_VPCI_GUEST_SUPPORT as more code
>   has been eliminated from being built on x86
> Since v1:
>  - constify struct pci_dev where possible
>  - do not open code is_system_domain()
>  - simplify some code3. simplify
>  - use gdprintk + error code instead of gprintk
>  - gate vpci_bar_{add|remove}_handlers with CONFIG_HAS_VPCI_GUEST_SUPPORT,
>    so these do not get compiled for x86
>  - removed unneeded is_system_domain check
>  - re-work guest read/write to be much simpler and do more work on write
>    than read which is expected to be called more frequently
>  - removed one too obvious comment
> ---
>  xen/drivers/vpci/header.c | 131 +++++++++++++++++++++++++++++++++-----
>  xen/include/xen/vpci.h    |   3 +
>  2 files changed, 118 insertions(+), 16 deletions(-)
> 
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index e58bbdf68d..e96d7b2b37 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -477,6 +477,72 @@ static void cf_check bar_write(
>      pci_conf_write32(pdev->sbdf, reg, val);
>  }
>  
> +static void cf_check guest_bar_write(const struct pci_dev *pdev,
> +                                     unsigned int reg, uint32_t val, void *data)
> +{
> +    struct vpci_bar *bar = data;
> +    bool hi = false;
> +    uint64_t guest_addr = bar->guest_addr;
> +
> +    if ( bar->type == VPCI_BAR_MEM64_HI )
> +    {
> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> +        bar--;
> +        hi = true;
> +    }
> +    else
> +    {
> +        val &= PCI_BASE_ADDRESS_MEM_MASK;
> +    }
> +
> +    guest_addr &= ~(0xffffffffull << (hi ? 32 : 0));
> +    guest_addr |= (uint64_t)val << (hi ? 32 : 0);
> +
> +    guest_addr &= ~(bar->size - 1) | ~PCI_BASE_ADDRESS_MEM_MASK;

I don't think you need to mask out PCI_BASE_ADDRESS_MEM_MASK here, as
you already do it if bar->type != VPCI_BAR_MEM64_HI for val.

> +
> +    /*
> +     * Make sure that the guest set address has the same page offset
> +     * as the physical address on the host or otherwise things won't work as
> +     * expected.
> +     */
> +    if ( (guest_addr & (~PAGE_MASK)) != (bar->addr & ~PAGE_MASK) )

PAGE_OFFSET() would be easier to read.

> +    {
> +        gprintk(XENLOG_WARNING,
> +                "%pp: ignored BAR %zu write attempting to change page offset\n",
> +                &pdev->sbdf, bar - pdev->vpci->header.bars + hi);
> +        return;
> +    }
> +
> +    bar->guest_addr = guest_addr;
> +}
> +
> +static uint32_t cf_check guest_bar_read(const struct pci_dev *pdev,
> +                                        unsigned int reg, void *data)
> +{
> +    const struct vpci_bar *bar = data;
> +    uint32_t reg_val;
> +
> +    if ( bar->type == VPCI_BAR_MEM64_HI )
> +    {
> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> +        bar--;
> +        return bar->guest_addr >> 32;
> +    }
> +
> +    reg_val = bar->guest_addr;
> +    reg_val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32 :
> +                                             PCI_BASE_ADDRESS_MEM_TYPE_64;
> +    reg_val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
> +
> +    return reg_val;
> +}
> +
> +static uint32_t cf_check empty_bar_read(const struct pci_dev *pdev,
> +                                        unsigned int reg, void *data)
> +{
> +    return 0;
> +}

If we are going to gain a lot of helpers that return a fixed value it
might be worthwhile to introduce a helper that returns what gets
passed as 'data'.  Let's leave it as you propose for now.

> +
>  static void cf_check rom_write(
>      const struct pci_dev *pdev, unsigned int reg, uint32_t val, void *data)
>  {
> @@ -537,6 +603,7 @@ static int cf_check init_bars(struct pci_dev *pdev)
>      struct vpci_header *header = &pdev->vpci->header;
>      struct vpci_bar *bars = header->bars;
>      int rc;
> +    bool is_hwdom = is_hardware_domain(pdev->domain);
>  
>      ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>  
> @@ -578,8 +645,10 @@ static int cf_check init_bars(struct pci_dev *pdev)
>          if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
>          {
>              bars[i].type = VPCI_BAR_MEM64_HI;
> -            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg,
> -                                   4, &bars[i]);
> +            rc = vpci_add_register(pdev->vpci,
> +                                   is_hwdom ? vpci_hw_read32 : guest_bar_read,
> +                                   is_hwdom ? bar_write : guest_bar_write,
> +                                   reg, 4, &bars[i]);
>              if ( rc )
>                  goto fail;
>              continue;
> @@ -589,6 +658,15 @@ static int cf_check init_bars(struct pci_dev *pdev)
>          if ( (val & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO )
>          {
>              bars[i].type = VPCI_BAR_IO;
> +
> +            if ( !IS_ENABLED(CONFIG_X86) && !is_hwdom )
> +            {
> +                rc = vpci_add_register(pdev->vpci, empty_bar_read, NULL,
> +                                       reg, 4, NULL);
> +                if ( rc )
> +                    goto fail;

For consistency you should also set bars[i].type = VPCI_BAR_EMPTY
here.

> +            }
> +
>              continue;
>          }
>          if ( (val & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
> @@ -605,6 +683,15 @@ static int cf_check init_bars(struct pci_dev *pdev)
>          if ( size == 0 )
>          {
>              bars[i].type = VPCI_BAR_EMPTY;
> +
> +            if ( !is_hwdom )
> +            {
> +                rc = vpci_add_register(pdev->vpci, empty_bar_read, NULL,
> +                                       reg, 4, NULL);
> +                if ( rc )
> +                    goto fail;
> +            }
> +
>              continue;
>          }
>  
> @@ -612,28 +699,40 @@ static int cf_check init_bars(struct pci_dev *pdev)
>          bars[i].size = size;
>          bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
>  
> -        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg, 4,
> -                               &bars[i]);
> +        rc = vpci_add_register(pdev->vpci,
> +                               is_hwdom ? vpci_hw_read32 : guest_bar_read,
> +                               is_hwdom ? bar_write : guest_bar_write,
> +                               reg, 4, &bars[i]);
>          if ( rc )
>              goto fail;
>      }
>  
> -    /* Check expansion ROM. */
> -    rc = pci_size_mem_bar(pdev->sbdf, rom_reg, &addr, &size, PCI_BAR_ROM);
> -    if ( rc > 0 && size )
> +    /* TODO: Check expansion ROM, we do not handle ROM for guests for now. */
> +    if ( is_hwdom )
>      {
> -        struct vpci_bar *rom = &header->bars[num_bars];
> +        rc = pci_size_mem_bar(pdev->sbdf, rom_reg, &addr, &size, PCI_BAR_ROM);
> +        if ( rc > 0 && size )
> +        {
> +            struct vpci_bar *rom = &header->bars[num_bars];
>  
> -        rom->type = VPCI_BAR_ROM;
> -        rom->size = size;
> -        rom->addr = addr;
> -        header->rom_enabled = pci_conf_read32(pdev->sbdf, rom_reg) &
> -                              PCI_ROM_ADDRESS_ENABLE;
> +            rom->type = VPCI_BAR_ROM;
> +            rom->size = size;
> +            rom->addr = addr;
> +            header->rom_enabled = pci_conf_read32(pdev->sbdf, rom_reg) &
> +                                  PCI_ROM_ADDRESS_ENABLE;
>  
> -        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, rom_write, rom_reg,
> -                               4, rom);
> +            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, rom_write,
> +                                   rom_reg, 4, rom);
> +            if ( rc )
> +                rom->type = VPCI_BAR_EMPTY;
> +        }
> +    }
> +    else
> +    {
> +        rc = vpci_add_register(pdev->vpci, empty_bar_read, NULL,
> +                               rom_reg, 4, NULL);

You should set the BAR type to VPCI_BAR_EMPTY here.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 08/16] vpci/header: handle p2m range sets per BAR
  2023-08-29 23:19 ` [PATCH v9 08/16] vpci/header: handle p2m range sets per BAR Volodymyr Babchuk
@ 2023-09-20 11:35   ` Roger Pau Monné
  2023-09-27 18:18   ` Stewart Hildebrand
  1 sibling, 0 replies; 60+ messages in thread
From: Roger Pau Monné @ 2023-09-20 11:35 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: xen-devel, Stewart Hildebrand, Oleksandr Andrushchenko

On Tue, Aug 29, 2023 at 11:19:44PM +0000, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Instead of handling a single range set, that contains all the memory
> regions of all the BARs and ROM, have them per BAR.
> As the range sets are now created when a PCI device is added and destroyed
> when it is removed so make them named and accounted.
> 
> Note that rangesets were chosen here despite there being only up to
> 3 separate ranges in each set (typically just 1). But rangeset per BAR
> was chosen for the ease of implementation and existing code re-usability.
> 
> This is in preparation of making non-identity mappings in p2m for the MMIOs.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> ---
> Since v9:
> - removed d->vpci.map_pending in favor of checking v->vpci.pdev !=
> NULL
> - printk -> gprintk
> - renamed bar variable to fix shadowing
> - fixed bug with iterating on remote device's BARs
> - relaxed lock in vpci_process_pending
> - removed stale comment
> Since v6:
> - update according to the new locking scheme
> - remove odd fail label in modify_bars
> Since v5:
> - fix comments
> - move rangeset allocation to init_bars and only allocate
>   for MAPPABLE BARs
> - check for overlap with the already setup BAR ranges
> Since v4:
> - use named range sets for BARs (Jan)
> - changes required by the new locking scheme
> - updated commit message (Jan)
> Since v3:
> - re-work vpci_cancel_pending accordingly to the per-BAR handling
> - s/num_mem_ranges/map_pending and s/uint8_t/bool
> - ASSERT(bar->mem) in modify_bars
> - create and destroy the rangesets on add/remove
> ---
>  xen/drivers/vpci/header.c | 252 ++++++++++++++++++++++++++------------
>  xen/drivers/vpci/vpci.c   |   6 +
>  xen/include/xen/vpci.h    |   2 +-
>  3 files changed, 180 insertions(+), 80 deletions(-)
> 
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index e96d7b2b37..3cc6a96849 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -161,63 +161,101 @@ static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
>  
>  bool vpci_process_pending(struct vcpu *v)
>  {
> -    if ( v->vpci.mem )
> +    struct pci_dev *pdev = v->vpci.pdev;
> +    struct map_data data = {
> +        .d = v->domain,
> +        .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
> +    };
> +    struct vpci_header *header = NULL;
> +    unsigned int i;
> +
> +    if ( !pdev )
> +        return false;
> +
> +    read_lock(&v->domain->pci_lock);
> +    header = &pdev->vpci->header;

You should likely check that pdev->vpci != NULL before accessing it,
and that the device is still assigned to the domain, v->domain ==
pdev->domain.

> +    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>      {
> -        struct map_data data = {
> -            .d = v->domain,
> -            .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
> -        };
> -        int rc = rangeset_consume_ranges(v->vpci.mem, map_range, &data);
> +        struct vpci_bar *bar = &header->bars[i];
> +        int rc;
> +
> +        if ( rangeset_is_empty(bar->mem) )
> +            continue;
> +
> +        rc = rangeset_consume_ranges(bar->mem, map_range, &data);
>  
>          if ( rc == -ERESTART )
> +        {
> +            read_unlock(&v->domain->pci_lock);
>              return true;
> +        }
>  
> -        write_lock(&v->domain->pci_lock);
> -        spin_lock(&v->vpci.pdev->vpci->lock);
> -        /* Disable memory decoding unconditionally on failure. */
> -        modify_decoding(v->vpci.pdev,
> -                        rc ? v->vpci.cmd & ~PCI_COMMAND_MEMORY : v->vpci.cmd,
> -                        !rc && v->vpci.rom_only);
> -        spin_unlock(&v->vpci.pdev->vpci->lock);
> -
> -        rangeset_destroy(v->vpci.mem);
> -        v->vpci.mem = NULL;
>          if ( rc )
> -            /*
> -             * FIXME: in case of failure remove the device from the domain.
> -             * Note that there might still be leftover mappings. While this is
> -             * safe for Dom0, for DomUs the domain will likely need to be
> -             * killed in order to avoid leaking stale p2m mappings on
> -             * failure.
> -             */
> -            vpci_deassign_device(v->vpci.pdev);
> -        write_unlock(&v->domain->pci_lock);
> +        {
> +            spin_lock(&pdev->vpci->lock);
> +            /* Disable memory decoding unconditionally on failure. */
> +            modify_decoding(pdev, v->vpci.cmd & ~PCI_COMMAND_MEMORY,
> +                            false);
> +            spin_unlock(&pdev->vpci->lock);
> +
> +            v->vpci.pdev = NULL;
> +
> +            read_unlock(&v->domain->pci_lock);
> +
> +            if ( is_hardware_domain(v->domain) )
> +            {
> +                write_lock(&v->domain->pci_lock);

This unlock/lock dance is racy, and I'm not sure there's much point in
removing the vPCI handlers for the device, it's not likely to be
helpful to dom0.  It might be better to just unconditionally disable
memory decoding and empty all the rangesets.  Not sure whether there's
more cached state that would need dealing with in pdev->vpci.

Maybe as a bodge you could leave the current vpci_deassign_device()
call and check that pdev->domain == v->domain after having taken the
pci_lock.

> +                vpci_deassign_device(v->vpci.pdev);
> +                write_unlock(&v->domain->pci_lock);
> +            }
> +            else
> +            {
> +                domain_crash(v->domain);
> +            }
> +            return false;
> +        }
>      }
> +    read_unlock(&v->domain->pci_lock);
> +
> +    v->vpci.pdev = NULL;
> +
> +    spin_lock(&pdev->vpci->lock);
> +    modify_decoding(pdev, v->vpci.cmd, v->vpci.rom_only);
> +    spin_unlock(&pdev->vpci->lock);

Why do you drop the pci_lock before calling modify_decoding()?  It
needs to stay locked until operations on pdev have finished, iow:
after modify_decoding(), or else accessing the contents of pdev->vpci
is not safe, and the device could be deassigned in the meantime.

>  
>      return false;
>  }
>  
>  static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
> -                            struct rangeset *mem, uint16_t cmd)
> +                            uint16_t cmd)
>  {
>      struct map_data data = { .d = d, .map = true };
> -    int rc;
> +    struct vpci_header *header = &pdev->vpci->header;
> +    int rc = 0;
> +    unsigned int i;
>  
>      ASSERT(rw_is_locked(&d->pci_lock));
>  
> -    while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
> +    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>      {
> -        /*
> -         * It's safe to drop and reacquire the lock in this context
> -         * without risking pdev disappearing because devices cannot be
> -         * removed until the initial domain has been started.
> -         */
> -        read_unlock(&d->pci_lock);
> -        process_pending_softirqs();
> -        read_lock(&d->pci_lock);
> -    }
> +        struct vpci_bar *bar = &header->bars[i];
>  
> -    rangeset_destroy(mem);
> +        if ( rangeset_is_empty(bar->mem) )
> +            continue;
> +
> +        while ( (rc = rangeset_consume_ranges(bar->mem, map_range,
> +                                              &data)) == -ERESTART )
> +        {
> +            /*
> +             * It's safe to drop and reacquire the lock in this context
> +             * without risking pdev disappearing because devices cannot be
> +             * removed until the initial domain has been started.
> +             */
> +            write_unlock(&d->pci_lock);
> +            process_pending_softirqs();
> +            write_lock(&d->pci_lock);
> +        }
> +    }
>      if ( !rc )
>          modify_decoding(pdev, cmd, false);
>  
> @@ -225,10 +263,12 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
>  }
>  
>  static void defer_map(struct domain *d, struct pci_dev *pdev,
> -                      struct rangeset *mem, uint16_t cmd, bool rom_only)
> +                      uint16_t cmd, bool rom_only)
>  {
>      struct vcpu *curr = current;
>  
> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
> +
>      /*
>       * FIXME: when deferring the {un}map the state of the device should not
>       * be trusted. For example the enable bit is toggled after the device
> @@ -236,7 +276,6 @@ static void defer_map(struct domain *d, struct pci_dev *pdev,
>       * started for the same device if the domain is not well-behaved.
>       */
>      curr->vpci.pdev = pdev;
> -    curr->vpci.mem = mem;
>      curr->vpci.cmd = cmd;
>      curr->vpci.rom_only = rom_only;
>      /*
> @@ -250,33 +289,33 @@ static void defer_map(struct domain *d, struct pci_dev *pdev,
>  static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>  {
>      struct vpci_header *header = &pdev->vpci->header;
> -    struct rangeset *mem = rangeset_new(NULL, NULL, 0);
>      struct pci_dev *tmp, *dev = NULL;
>      const struct domain *d;
>      const struct vpci_msix *msix = pdev->vpci->msix;
> -    unsigned int i;
> +    unsigned int i, j;
>      int rc;
>  
>      ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
>  
> -    if ( !mem )
> -        return -ENOMEM;
> -
>      /*
> -     * Create a rangeset that represents the current device BARs memory region
> -     * and compare it against all the currently active BAR memory regions. If
> -     * an overlap is found, subtract it from the region to be mapped/unmapped.
> +     * Create a rangeset per BAR that represents the current device memory
> +     * region and compare it against all the currently active BAR memory
> +     * regions. If an overlap is found, subtract it from the region to be
> +     * mapped/unmapped.
>       *
> -     * First fill the rangeset with all the BARs of this device or with the ROM
> +     * First fill the rangesets with the BAR of this device or with the ROM
>       * BAR only, depending on whether the guest is toggling the memory decode
>       * bit of the command register, or the enable bit of the ROM BAR register.
>       */
>      for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>      {
> -        const struct vpci_bar *bar = &header->bars[i];
> +        struct vpci_bar *bar = &header->bars[i];
>          unsigned long start = PFN_DOWN(bar->addr);
>          unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
>  
> +        if ( !bar->mem )
> +            continue;
> +
>          if ( !MAPPABLE_BAR(bar) ||
>               (rom_only ? bar->type != VPCI_BAR_ROM
>                         : (bar->type == VPCI_BAR_ROM && !header->rom_enabled)) ||
> @@ -292,14 +331,31 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>              continue;
>          }
>  
> -        rc = rangeset_add_range(mem, start, end);
> +        rc = rangeset_add_range(bar->mem, start, end);
>          if ( rc )
>          {
>              printk(XENLOG_G_WARNING "Failed to add [%lx, %lx]: %d\n",
>                     start, end, rc);
> -            rangeset_destroy(mem);
>              return rc;
>          }
> +
> +        /* Check for overlap with the already setup BAR ranges. */
> +        for ( j = 0; j < i; j++ )
> +        {
> +            struct vpci_bar *prev_bar = &header->bars[j];
> +
> +            if ( rangeset_is_empty(prev_bar->mem) )
> +                continue;
> +
> +            rc = rangeset_remove_range(prev_bar->mem, start, end);
> +            if ( rc )
> +            {
> +                gprintk(XENLOG_WARNING,
> +                       "%pp: failed to remove overlapping range [%lx, %lx]: %d\n",
> +                        &pdev->sbdf, start, end, rc);
> +                return rc;
> +            }
> +        }
>      }
>  
>      /* Remove any MSIX regions if present. */
> @@ -309,14 +365,21 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>          unsigned long end = PFN_DOWN(vmsix_table_addr(pdev->vpci, i) +
>                                       vmsix_table_size(pdev->vpci, i) - 1);
>  
> -        rc = rangeset_remove_range(mem, start, end);
> -        if ( rc )
> +        for ( j = 0; j < ARRAY_SIZE(header->bars); j++ )
>          {
> -            printk(XENLOG_G_WARNING
> -                   "Failed to remove MSIX table [%lx, %lx]: %d\n",
> -                   start, end, rc);
> -            rangeset_destroy(mem);
> -            return rc;
> +            const struct vpci_bar *bar = &header->bars[j];
> +
> +            if ( rangeset_is_empty(bar->mem) )
> +                continue;
> +
> +            rc = rangeset_remove_range(bar->mem, start, end);
> +            if ( rc )
> +            {
> +                gprintk(XENLOG_WARNING,
> +                       "%pp: failed to remove MSIX table [%lx, %lx]: %d\n",
> +                        &pdev->sbdf, start, end, rc);
> +                return rc;
> +            }
>          }
>      }
>  
> @@ -356,27 +419,34 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>  
>              for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
>              {
> -                const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
> -                unsigned long start = PFN_DOWN(bar->addr);
> -                unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
> -
> -                if ( !bar->enabled ||
> -                     !rangeset_overlaps_range(mem, start, end) ||
> -                     /*
> -                      * If only the ROM enable bit is toggled check against
> -                      * other BARs in the same device for overlaps, but not
> -                      * against the same ROM BAR.
> -                      */
> -                     (rom_only && tmp == pdev && bar->type == VPCI_BAR_ROM) )
> +                const struct vpci_bar *remote_bar = &tmp->vpci->header.bars[i];
> +                unsigned long start = PFN_DOWN(remote_bar->addr);
> +                unsigned long end = PFN_DOWN(remote_bar->addr +
> +                                             remote_bar->size - 1);
> +
> +                if ( !remote_bar->enabled )
>                      continue;
>  
> -                rc = rangeset_remove_range(mem, start, end);
> -                if ( rc )
> +                for ( j = 0; j < ARRAY_SIZE(header->bars); j++)
>                  {
> -                    printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
> -                           start, end, rc);
> -                    rangeset_destroy(mem);
> -                    return rc;
> +                    const struct vpci_bar *bar = &header->bars[j];
> +                    if ( !rangeset_overlaps_range(bar->mem, start, end) ||

Missing newline between local variable definition and code.

> +                         /*
> +                          * If only the ROM enable bit is toggled check against
> +                          * other BARs in the same device for overlaps, but not
> +                          * against the same ROM BAR.
> +                          */
> +                         (rom_only && tmp == pdev && bar->type == VPCI_BAR_ROM) )
> +                        continue;
> +
> +                    rc = rangeset_remove_range(bar->mem, start, end);
> +                    if ( rc )
> +                    {
> +                        gprintk(XENLOG_WARNING,
> +                                "%pp: failed to remove [%lx, %lx]: %d\n",
> +                                &pdev->sbdf, start, end, rc);
> +                        return rc;
> +                    }
>                  }
>              }
>          }
> @@ -400,10 +470,10 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>           * will always be to establish mappings and process all the BARs.
>           */
>          ASSERT((cmd & PCI_COMMAND_MEMORY) && !rom_only);
> -        return apply_map(pdev->domain, pdev, mem, cmd);
> +        return apply_map(pdev->domain, pdev, cmd);
>      }
>  
> -    defer_map(dev->domain, dev, mem, cmd, rom_only);
> +    defer_map(dev->domain, dev, cmd, rom_only);
>  
>      return 0;
>  }
> @@ -595,6 +665,20 @@ static void cf_check rom_write(
>          rom->addr = val & PCI_ROM_ADDRESS_MASK;
>  }
>  
> +static int bar_add_rangeset(const struct pci_dev *pdev, struct vpci_bar *bar,
> +                            unsigned int i)
> +{
> +    char str[32];
> +
> +    snprintf(str, sizeof(str), "%pp:BAR%d", &pdev->sbdf, i);

%u for i.

> +
> +    bar->mem = rangeset_new(pdev->domain, str, RANGESETF_no_print);
> +    if ( !bar->mem )
> +        return -ENOMEM;
> +
> +    return 0;

Could be simplified as:

return !bar->mem ? -ENOMEM : 0;

But I don't have a strong opinion, I understand some people might find
this obscure.

> +}
> +
>  static int cf_check init_bars(struct pci_dev *pdev)
>  {
>      uint16_t cmd;
> @@ -675,6 +759,10 @@ static int cf_check init_bars(struct pci_dev *pdev)
>          else
>              bars[i].type = VPCI_BAR_MEM32;
>  
> +        rc = bar_add_rangeset(pdev, &bars[i], i);
> +        if ( rc )
> +            return rc;

Don't you need to use the fail label in order to restore the previous
command register value on the device? (here and below)

Thanks, Roger.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 02/16] vpci: use per-domain PCI lock to protect vpci structure
  2023-09-20  8:09       ` Roger Pau Monné
@ 2023-09-20 13:56         ` Stewart Hildebrand
  2023-09-21  7:42           ` Jan Beulich
  0 siblings, 1 reply; 60+ messages in thread
From: Stewart Hildebrand @ 2023-09-20 13:56 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Volodymyr Babchuk, xen-devel, Oleksandr Andrushchenko,
	Jan Beulich, Andrew Cooper, Wei Liu, Jun Nakajima, Kevin Tian,
	Paul Durrant

On 9/20/23 04:09, Roger Pau Monné wrote:
> On Tue, Sep 19, 2023 at 12:20:39PM -0400, Stewart Hildebrand wrote:
>> On 9/19/23 11:39, Roger Pau Monné wrote:
>>> On Tue, Aug 29, 2023 at 11:19:42PM +0000, Volodymyr Babchuk wrote:
>>>> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
>>>> index 8f2b59e61a..a0733bb2cb 100644
>>>> --- a/xen/drivers/vpci/msi.c
>>>> +++ b/xen/drivers/vpci/msi.c
>>>> @@ -318,15 +321,28 @@ void vpci_dump_msi(void)
>>>>                       * holding the lock.
>>>>                       */
>>>>                      printk("unable to print all MSI-X entries: %d\n", rc);
>>>> -                    process_pending_softirqs();
>>>> -                    continue;
>>>> +                    goto pdev_done;
>>>>                  }
>>>>              }
>>>>
>>>>              spin_unlock(&pdev->vpci->lock);
>>>> + pdev_done:
>>>> +            /*
>>>> +             * Unlock lock to process pending softirqs. This is
>>>> +             * potentially unsafe, as d->pdev_list can be changed in
>>>> +             * meantime.
>>>> +             */
>>>> +            read_unlock(&d->pci_lock);
>>>>              process_pending_softirqs();
>>>> +            if ( !read_trylock(&d->pci_lock) )
>>>> +            {
>>>> +                printk("unable to access other devices for the domain\n");
>>>> +                goto domain_done;
>>>
>>> Shouldn't the domain_done label be after the read_unlock(), so that we
>>> can proceed to try to dump the devices for the next domain?  With the
>>> proposed code a failure to acquire one of the domains pci_lock
>>> terminates the dump.
>>>
>>>> +            }
>>>>          }
>>>> +        read_unlock(&d->pci_lock);
>>>>      }
>>>> + domain_done:
>>>>      rcu_read_unlock(&domlist_read_lock);
>>>>  }
>>>>
>>
>> With the label moved, a no-op expression after the label is needed to make the compiler happy:
>>
>>             }
>>         }
>>         read_unlock(&d->pci_lock);
>>  domain_done:
>>         (void)0;
>>     }
>>     rcu_read_unlock(&domlist_read_lock);
>> }
>>
>>
>> If the no-op is omitted, the compiler may complain (gcc 9.4.0):
>>
>> drivers/vpci/msi.c: In function ‘vpci_dump_msi’:
>> drivers/vpci/msi.c:351:2: error: label at end of compound statement
>>   351 |  domain_done:
>>       |  ^~~~~~~~~~~
> 
> 
> Might be better to place the label at the start of the loop, and
> likely rename to next_domain.

That would bypass the loop condition and increment statements.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 06/16] vpci/header: implement guest BAR register handlers
  2023-09-20  9:49   ` Roger Pau Monné
@ 2023-09-20 14:18     ` Stewart Hildebrand
  0 siblings, 0 replies; 60+ messages in thread
From: Stewart Hildebrand @ 2023-09-20 14:18 UTC (permalink / raw)
  To: Roger Pau Monné, Volodymyr Babchuk
  Cc: xen-devel, Oleksandr Andrushchenko

On 9/20/23 05:49, Roger Pau Monné wrote:
> On Tue, Aug 29, 2023 at 11:19:43PM +0000, Volodymyr Babchuk wrote:
>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
>> index e58bbdf68d..e96d7b2b37 100644
>> --- a/xen/drivers/vpci/header.c
>> +++ b/xen/drivers/vpci/header.c
>> +static uint32_t cf_check empty_bar_read(const struct pci_dev *pdev,
>> +                                        unsigned int reg, void *data)
>> +{
>> +    return 0;
>> +}
> 
> If we are going to gain a lot of helpers that return a fixed value it
> might be worthwhile to introduce a helper that returns what gets
> passed as 'data'.  Let's leave it as you propose for now.

For future reference, I introduce such a helper in the vPCI capabilities filtering series [1]. If that series happens gets committed before this one, it could be worthwhile making the switch. But since the helper is not upstream yet, +1 for leaving as is for now.

[1] https://lists.xenproject.org/archives/html/xen-devel/2023-09/msg00796.html


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 02/16] vpci: use per-domain PCI lock to protect vpci structure
  2023-09-19 15:39   ` Roger Pau Monné
  2023-09-19 15:55     ` Jan Beulich
  2023-09-19 16:20     ` Stewart Hildebrand
@ 2023-09-20 19:16     ` Stewart Hildebrand
  2023-09-21  9:41       ` Roger Pau Monné
  2023-09-25 23:03     ` Volodymyr Babchuk
  3 siblings, 1 reply; 60+ messages in thread
From: Stewart Hildebrand @ 2023-09-20 19:16 UTC (permalink / raw)
  To: Roger Pau Monné, Volodymyr Babchuk
  Cc: xen-devel, Oleksandr Andrushchenko, Jan Beulich, Andrew Cooper,
	Wei Liu, Jun Nakajima, Kevin Tian, Paul Durrant

On 9/19/23 11:39, Roger Pau Monné wrote:
> On Tue, Aug 29, 2023 at 11:19:42PM +0000, Volodymyr Babchuk wrote:
>> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
>> index 1edc7f1e91..545a27796e 100644
>> --- a/xen/arch/x86/hvm/vmx/vmx.c
>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
>> @@ -413,8 +413,6 @@ static int cf_check vmx_pi_update_irte(const struct vcpu *v,
>>
>>      spin_unlock_irq(&desc->lock);
>>
>> -    ASSERT(pcidevs_locked());
>> -
> 
> Hm, this removal seems dubious, same with some of the removal below.
> And I don't see any comment in the log message as to why removing the
> asserts here and in __pci_enable_msi{,x}(), pci_prepare_msix() is
> safe.
> 

I suspect we may want:

    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));

However, we don't have d. Using v->domain here is tricky because v may be NULL. How about passing struct domain *d as an arg to {hvm,vmx}_pi_update_irte()? Or ensuring that all callers pass a valid v?

>>      return iommu_update_ire_from_msi(msi_desc, &msi_desc->msg);
>>
>>   unlock_out:
>> diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
>> index d0bf63df1d..ba2963b7d2 100644
>> --- a/xen/arch/x86/msi.c
>> +++ b/xen/arch/x86/msi.c
>> @@ -613,7 +613,7 @@ static int msi_capability_init(struct pci_dev *dev,
>>      u8 slot = PCI_SLOT(dev->devfn);
>>      u8 func = PCI_FUNC(dev->devfn);
>>
>> -    ASSERT(pcidevs_locked());
>> +    ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
>>      pos = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSI);
>>      if ( !pos )
>>          return -ENODEV;
>> @@ -783,7 +783,7 @@ static int msix_capability_init(struct pci_dev *dev,
>>      if ( !pos )
>>          return -ENODEV;
>>
>> -    ASSERT(pcidevs_locked());
>> +    ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
>>
>>      control = pci_conf_read16(dev->sbdf, msix_control_reg(pos));
>>      /*
>> @@ -1000,7 +1000,6 @@ static int __pci_enable_msi(struct msi_info *msi, struct msi_desc **desc)
>>      struct pci_dev *pdev;
>>      struct msi_desc *old_desc;
>>
>> -    ASSERT(pcidevs_locked());
>>      pdev = pci_get_pdev(NULL, msi->sbdf);
>>      if ( !pdev )
>>          return -ENODEV;

I think we can move the ASSERT here, after we obtain the pdev. Then we can add the pdev->domain->pci_lock check into the mix:

    ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));

>> @@ -1055,7 +1054,6 @@ static int __pci_enable_msix(struct msi_info *msi, struct msi_desc **desc)
>>      struct pci_dev *pdev;
>>      struct msi_desc *old_desc;
>>
>> -    ASSERT(pcidevs_locked());
>>      pdev = pci_get_pdev(NULL, msi->sbdf);
>>      if ( !pdev || !pdev->msix )
>>          return -ENODEV;

Same here

>> @@ -1170,8 +1168,6 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off)
>>   */
>>  int pci_enable_msi(struct msi_info *msi, struct msi_desc **desc)
>>  {
>> -    ASSERT(pcidevs_locked());
>> -

This removal inside pci_enable_msi() may be okay if both __pci_enable_msi() and __pci_enable_msix() have an appropriate ASSERT.

>>      if ( !use_msi )
>>          return -EPERM;
>>

Related: in xen/drivers/passthrough/pci.c:pci_get_pdev() I run into an ASSERT with a PVH dom0:

(XEN) Assertion 'd || pcidevs_locked()' failed at drivers/passthrough/pci.c:534
(XEN) ----[ Xen-4.18-unstable  x86_64  debug=y  Tainted:   C    ]----
...
(XEN) Xen call trace:
(XEN)    [<ffff82d040285a3b>] R pci_get_pdev+0x4c/0xab
(XEN)    [<ffff82d04034742e>] F arch/x86/msi.c#__pci_enable_msi+0x1d/0xb4
(XEN)    [<ffff82d0403477b5>] F pci_enable_msi+0x20/0x28
(XEN)    [<ffff82d04034cfa4>] F map_domain_pirq+0x2b0/0x718
(XEN)    [<ffff82d04034e37c>] F allocate_and_map_msi_pirq+0xff/0x26b
(XEN)    [<ffff82d0402e088b>] F arch/x86/hvm/vmsi.c#vpci_msi_enable+0x53/0x9d
(XEN)    [<ffff82d0402e19d5>] F vpci_msi_arch_enable+0x36/0x6c
(XEN)    [<ffff82d04026f49d>] F drivers/vpci/msi.c#control_write+0x71/0x114
(XEN)    [<ffff82d04026d050>] F drivers/vpci/vpci.c#vpci_write_helper+0x6f/0x7c
(XEN)    [<ffff82d04026de39>] F vpci_write+0x249/0x2f9
...

With the patch applied, it's valid to call pci_get_pdev() with only d->pci_lock held, so the ASSERT in pci_get_pdev() needs to be reworked too. Inside pci_get_pdev(), d may be null, so we can't easily add || rw_is_locked(&d->pci_lock) into the ASSERT. Instead I propose something like the following, which resolves the observed assertion failure:

diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 572643abe412..2b4ad804510c 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -531,8 +531,6 @@ struct pci_dev *pci_get_pdev(const struct domain *d, pci_sbdf_t sbdf)
 {
     struct pci_dev *pdev;

-    ASSERT(d || pcidevs_locked());
-
     /*
      * The hardware domain owns the majority of the devices in the system.
      * When there are multiple segments, traversing the per-segment list is
@@ -549,12 +547,18 @@ struct pci_dev *pci_get_pdev(const struct domain *d, pci_sbdf_t sbdf)
         list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
             if ( pdev->sbdf.bdf == sbdf.bdf &&
                  (!d || pdev->domain == d) )
+            {
+                ASSERT(d || pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
                 return pdev;
+            }
     }
     else
         list_for_each_entry ( pdev, &d->pdev_list, domain_list )
             if ( pdev->sbdf.sbdf == sbdf.sbdf )
+            {
+                ASSERT(d || pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
                 return pdev;
+            }

     return NULL;
 }


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 02/16] vpci: use per-domain PCI lock to protect vpci structure
  2023-09-20 13:56         ` Stewart Hildebrand
@ 2023-09-21  7:42           ` Jan Beulich
  2023-09-21  9:00             ` Roger Pau Monné
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2023-09-21  7:42 UTC (permalink / raw)
  To: Stewart Hildebrand, Roger Pau Monné
  Cc: Volodymyr Babchuk, xen-devel, Oleksandr Andrushchenko,
	Andrew Cooper, Wei Liu, Jun Nakajima, Kevin Tian, Paul Durrant

On 20.09.2023 15:56, Stewart Hildebrand wrote:
> On 9/20/23 04:09, Roger Pau Monné wrote:
>> On Tue, Sep 19, 2023 at 12:20:39PM -0400, Stewart Hildebrand wrote:
>>> On 9/19/23 11:39, Roger Pau Monné wrote:
>>>> On Tue, Aug 29, 2023 at 11:19:42PM +0000, Volodymyr Babchuk wrote:
>>>>> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
>>>>> index 8f2b59e61a..a0733bb2cb 100644
>>>>> --- a/xen/drivers/vpci/msi.c
>>>>> +++ b/xen/drivers/vpci/msi.c
>>>>> @@ -318,15 +321,28 @@ void vpci_dump_msi(void)
>>>>>                       * holding the lock.
>>>>>                       */
>>>>>                      printk("unable to print all MSI-X entries: %d\n", rc);
>>>>> -                    process_pending_softirqs();
>>>>> -                    continue;
>>>>> +                    goto pdev_done;
>>>>>                  }
>>>>>              }
>>>>>
>>>>>              spin_unlock(&pdev->vpci->lock);
>>>>> + pdev_done:
>>>>> +            /*
>>>>> +             * Unlock lock to process pending softirqs. This is
>>>>> +             * potentially unsafe, as d->pdev_list can be changed in
>>>>> +             * meantime.
>>>>> +             */
>>>>> +            read_unlock(&d->pci_lock);
>>>>>              process_pending_softirqs();
>>>>> +            if ( !read_trylock(&d->pci_lock) )
>>>>> +            {
>>>>> +                printk("unable to access other devices for the domain\n");
>>>>> +                goto domain_done;
>>>>
>>>> Shouldn't the domain_done label be after the read_unlock(), so that we
>>>> can proceed to try to dump the devices for the next domain?  With the
>>>> proposed code a failure to acquire one of the domains pci_lock
>>>> terminates the dump.
>>>>
>>>>> +            }
>>>>>          }
>>>>> +        read_unlock(&d->pci_lock);
>>>>>      }
>>>>> + domain_done:
>>>>>      rcu_read_unlock(&domlist_read_lock);
>>>>>  }
>>>>>
>>>
>>> With the label moved, a no-op expression after the label is needed to make the compiler happy:
>>>
>>>             }
>>>         }
>>>         read_unlock(&d->pci_lock);
>>>  domain_done:
>>>         (void)0;
>>>     }
>>>     rcu_read_unlock(&domlist_read_lock);
>>> }
>>>
>>>
>>> If the no-op is omitted, the compiler may complain (gcc 9.4.0):
>>>
>>> drivers/vpci/msi.c: In function ‘vpci_dump_msi’:
>>> drivers/vpci/msi.c:351:2: error: label at end of compound statement
>>>   351 |  domain_done:
>>>       |  ^~~~~~~~~~~
>>
>>
>> Might be better to place the label at the start of the loop, and
>> likely rename to next_domain.
> 
> That would bypass the loop condition and increment statements.

Right, such a label would be bogus even without that; instead of "goto"
the use site then simply should use "continue".

Jan


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 02/16] vpci: use per-domain PCI lock to protect vpci structure
  2023-09-21  7:42           ` Jan Beulich
@ 2023-09-21  9:00             ` Roger Pau Monné
  0 siblings, 0 replies; 60+ messages in thread
From: Roger Pau Monné @ 2023-09-21  9:00 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stewart Hildebrand, Volodymyr Babchuk, xen-devel,
	Oleksandr Andrushchenko, Andrew Cooper, Wei Liu, Jun Nakajima,
	Kevin Tian, Paul Durrant

On Thu, Sep 21, 2023 at 09:42:08AM +0200, Jan Beulich wrote:
> On 20.09.2023 15:56, Stewart Hildebrand wrote:
> > On 9/20/23 04:09, Roger Pau Monné wrote:
> >> On Tue, Sep 19, 2023 at 12:20:39PM -0400, Stewart Hildebrand wrote:
> >>> On 9/19/23 11:39, Roger Pau Monné wrote:
> >>>> On Tue, Aug 29, 2023 at 11:19:42PM +0000, Volodymyr Babchuk wrote:
> >>>>> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
> >>>>> index 8f2b59e61a..a0733bb2cb 100644
> >>>>> --- a/xen/drivers/vpci/msi.c
> >>>>> +++ b/xen/drivers/vpci/msi.c
> >>>>> @@ -318,15 +321,28 @@ void vpci_dump_msi(void)
> >>>>>                       * holding the lock.
> >>>>>                       */
> >>>>>                      printk("unable to print all MSI-X entries: %d\n", rc);
> >>>>> -                    process_pending_softirqs();
> >>>>> -                    continue;
> >>>>> +                    goto pdev_done;
> >>>>>                  }
> >>>>>              }
> >>>>>
> >>>>>              spin_unlock(&pdev->vpci->lock);
> >>>>> + pdev_done:
> >>>>> +            /*
> >>>>> +             * Unlock lock to process pending softirqs. This is
> >>>>> +             * potentially unsafe, as d->pdev_list can be changed in
> >>>>> +             * meantime.
> >>>>> +             */
> >>>>> +            read_unlock(&d->pci_lock);
> >>>>>              process_pending_softirqs();
> >>>>> +            if ( !read_trylock(&d->pci_lock) )
> >>>>> +            {
> >>>>> +                printk("unable to access other devices for the domain\n");
> >>>>> +                goto domain_done;
> >>>>
> >>>> Shouldn't the domain_done label be after the read_unlock(), so that we
> >>>> can proceed to try to dump the devices for the next domain?  With the
> >>>> proposed code a failure to acquire one of the domains pci_lock
> >>>> terminates the dump.
> >>>>
> >>>>> +            }
> >>>>>          }
> >>>>> +        read_unlock(&d->pci_lock);
> >>>>>      }
> >>>>> + domain_done:
> >>>>>      rcu_read_unlock(&domlist_read_lock);
> >>>>>  }
> >>>>>
> >>>
> >>> With the label moved, a no-op expression after the label is needed to make the compiler happy:
> >>>
> >>>             }
> >>>         }
> >>>         read_unlock(&d->pci_lock);
> >>>  domain_done:
> >>>         (void)0;
> >>>     }
> >>>     rcu_read_unlock(&domlist_read_lock);
> >>> }
> >>>
> >>>
> >>> If the no-op is omitted, the compiler may complain (gcc 9.4.0):
> >>>
> >>> drivers/vpci/msi.c: In function ‘vpci_dump_msi’:
> >>> drivers/vpci/msi.c:351:2: error: label at end of compound statement
> >>>   351 |  domain_done:
> >>>       |  ^~~~~~~~~~~
> >>
> >>
> >> Might be better to place the label at the start of the loop, and
> >> likely rename to next_domain.
> > 
> > That would bypass the loop condition and increment statements.
> 
> Right, such a label would be bogus even without that; instead of "goto"
> the use site then simply should use "continue".

IIRC continue is not suitable because the code would reach the
read_unlock() without having the lock held.

Anyway, I would leave to the submitter to find a suitable way to
continue the domain iteration.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 02/16] vpci: use per-domain PCI lock to protect vpci structure
  2023-09-20 19:16     ` Stewart Hildebrand
@ 2023-09-21  9:41       ` Roger Pau Monné
  0 siblings, 0 replies; 60+ messages in thread
From: Roger Pau Monné @ 2023-09-21  9:41 UTC (permalink / raw)
  To: Stewart Hildebrand
  Cc: Volodymyr Babchuk, xen-devel, Oleksandr Andrushchenko,
	Jan Beulich, Andrew Cooper, Wei Liu, Jun Nakajima, Kevin Tian,
	Paul Durrant

On Wed, Sep 20, 2023 at 03:16:00PM -0400, Stewart Hildebrand wrote:
> On 9/19/23 11:39, Roger Pau Monné wrote:
> > On Tue, Aug 29, 2023 at 11:19:42PM +0000, Volodymyr Babchuk wrote:
> >> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> >> index 1edc7f1e91..545a27796e 100644
> >> --- a/xen/arch/x86/hvm/vmx/vmx.c
> >> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> >> @@ -413,8 +413,6 @@ static int cf_check vmx_pi_update_irte(const struct vcpu *v,
> >>
> >>      spin_unlock_irq(&desc->lock);
> >>
> >> -    ASSERT(pcidevs_locked());
> >> -
> > 
> > Hm, this removal seems dubious, same with some of the removal below.
> > And I don't see any comment in the log message as to why removing the
> > asserts here and in __pci_enable_msi{,x}(), pci_prepare_msix() is
> > safe.
> > 
> 
> I suspect we may want:
> 
>     ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
> 
> However, we don't have d. Using v->domain here is tricky because v may be NULL. How about passing struct domain *d as an arg to {hvm,vmx}_pi_update_irte()? Or ensuring that all callers pass a valid v?

I guess there was a reason to expect a path with v == NULL, but would
need to go trough the call paths that lead here.

Another option might be use use:

ASSERT(pcidevs_locked() || (v && rw_is_locked(&v->domain->pci_lock)));

But we would need some understanding of the call site of
vmx_pi_update_irte().

> 
> >>      return iommu_update_ire_from_msi(msi_desc, &msi_desc->msg);
> >>
> >>   unlock_out:
> >> diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
> >> index d0bf63df1d..ba2963b7d2 100644
> >> --- a/xen/arch/x86/msi.c
> >> +++ b/xen/arch/x86/msi.c
> >> @@ -613,7 +613,7 @@ static int msi_capability_init(struct pci_dev *dev,
> >>      u8 slot = PCI_SLOT(dev->devfn);
> >>      u8 func = PCI_FUNC(dev->devfn);
> >>
> >> -    ASSERT(pcidevs_locked());
> >> +    ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
> >>      pos = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSI);
> >>      if ( !pos )
> >>          return -ENODEV;
> >> @@ -783,7 +783,7 @@ static int msix_capability_init(struct pci_dev *dev,
> >>      if ( !pos )
> >>          return -ENODEV;
> >>
> >> -    ASSERT(pcidevs_locked());
> >> +    ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
> >>
> >>      control = pci_conf_read16(dev->sbdf, msix_control_reg(pos));
> >>      /*
> >> @@ -1000,7 +1000,6 @@ static int __pci_enable_msi(struct msi_info *msi, struct msi_desc **desc)
> >>      struct pci_dev *pdev;
> >>      struct msi_desc *old_desc;
> >>
> >> -    ASSERT(pcidevs_locked());
> >>      pdev = pci_get_pdev(NULL, msi->sbdf);
> >>      if ( !pdev )
> >>          return -ENODEV;
> 
> I think we can move the ASSERT here, after we obtain the pdev. Then we can add the pdev->domain->pci_lock check into the mix:
> 
>     ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));

Hm, it would be better to perform the ASSERT before possibly accessing
the pdev list without holding any locks, but it's just an assert so
that might be the best option.

> 
> >> @@ -1055,7 +1054,6 @@ static int __pci_enable_msix(struct msi_info *msi, struct msi_desc **desc)
> >>      struct pci_dev *pdev;
> >>      struct msi_desc *old_desc;
> >>
> >> -    ASSERT(pcidevs_locked());
> >>      pdev = pci_get_pdev(NULL, msi->sbdf);
> >>      if ( !pdev || !pdev->msix )
> >>          return -ENODEV;
> 
> Same here
> 
> >> @@ -1170,8 +1168,6 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off)
> >>   */
> >>  int pci_enable_msi(struct msi_info *msi, struct msi_desc **desc)
> >>  {
> >> -    ASSERT(pcidevs_locked());
> >> -
> 
> This removal inside pci_enable_msi() may be okay if both __pci_enable_msi() and __pci_enable_msix() have an appropriate ASSERT.

Hm, yes, that's likely fine, but would want a small mention in the
commit message.

> >>      if ( !use_msi )
> >>          return -EPERM;
> >>
> 
> Related: in xen/drivers/passthrough/pci.c:pci_get_pdev() I run into an ASSERT with a PVH dom0:
> 
> (XEN) Assertion 'd || pcidevs_locked()' failed at drivers/passthrough/pci.c:534
> (XEN) ----[ Xen-4.18-unstable  x86_64  debug=y  Tainted:   C    ]----
> ...
> (XEN) Xen call trace:
> (XEN)    [<ffff82d040285a3b>] R pci_get_pdev+0x4c/0xab
> (XEN)    [<ffff82d04034742e>] F arch/x86/msi.c#__pci_enable_msi+0x1d/0xb4
> (XEN)    [<ffff82d0403477b5>] F pci_enable_msi+0x20/0x28
> (XEN)    [<ffff82d04034cfa4>] F map_domain_pirq+0x2b0/0x718
> (XEN)    [<ffff82d04034e37c>] F allocate_and_map_msi_pirq+0xff/0x26b
> (XEN)    [<ffff82d0402e088b>] F arch/x86/hvm/vmsi.c#vpci_msi_enable+0x53/0x9d
> (XEN)    [<ffff82d0402e19d5>] F vpci_msi_arch_enable+0x36/0x6c
> (XEN)    [<ffff82d04026f49d>] F drivers/vpci/msi.c#control_write+0x71/0x114
> (XEN)    [<ffff82d04026d050>] F drivers/vpci/vpci.c#vpci_write_helper+0x6f/0x7c
> (XEN)    [<ffff82d04026de39>] F vpci_write+0x249/0x2f9
> ...
> 
> With the patch applied, it's valid to call pci_get_pdev() with only d->pci_lock held, so the ASSERT in pci_get_pdev() needs to be reworked too. Inside pci_get_pdev(), d may be null, so we can't easily add || rw_is_locked(&d->pci_lock) into the ASSERT. Instead I propose something like the following, which resolves the observed assertion failure:
> 
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index 572643abe412..2b4ad804510c 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -531,8 +531,6 @@ struct pci_dev *pci_get_pdev(const struct domain *d, pci_sbdf_t sbdf)
>  {
>      struct pci_dev *pdev;
> 
> -    ASSERT(d || pcidevs_locked());
> -
>      /*
>       * The hardware domain owns the majority of the devices in the system.
>       * When there are multiple segments, traversing the per-segment list is
> @@ -549,12 +547,18 @@ struct pci_dev *pci_get_pdev(const struct domain *d, pci_sbdf_t sbdf)
>          list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
>              if ( pdev->sbdf.bdf == sbdf.bdf &&
>                   (!d || pdev->domain == d) )
> +            {
> +                ASSERT(d || pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));

Hm, strictly speaking iterating over the pseg list while just holding
the d->pci_lock is not safe, we should instead iterate over d->pdev_list.

We might have to slightly modify pci_enable_msi() to take a pdev so
that the search can be done by the caller (holding the right lock).

Thanks, Roger.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 09/16] vpci/header: program p2m with guest BAR view
  2023-08-29 23:19 ` [PATCH v9 09/16] vpci/header: program p2m with guest BAR view Volodymyr Babchuk
@ 2023-09-21 10:34   ` Roger Pau Monné
  0 siblings, 0 replies; 60+ messages in thread
From: Roger Pau Monné @ 2023-09-21 10:34 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: xen-devel, Stewart Hildebrand, Oleksandr Andrushchenko

On Tue, Aug 29, 2023 at 11:19:44PM +0000, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Take into account guest's BAR view and program its p2m accordingly:
> gfn is guest's view of the BAR and mfn is the physical BAR value.
> This way hardware domain sees physical BAR values and guest sees
> emulated ones.
> 
> Hardware domain continues getting the BARs identity mapped, while for
> domUs the BARs are mapped at the requested guest address without
> modifying the BAR address in the device PCI config space.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
> ---
> Since v9:
> - Extended the commit message
> - Use bar->guest_addr in modify_bars
> - Extended printk error message in map_range
> - Moved map_data initialization so .bar can be initialized during declaration
> Since v5:
> - remove debug print in map_range callback
> - remove "identity" from the debug print
> Since v4:
> - moved start_{gfn|mfn} calculation into map_range
> - pass vpci_bar in the map_data instead of start_{gfn|mfn}
> - s/guest_addr/guest_reg
> Since v3:
> - updated comment (Roger)
> - removed gfn_add(map->start_gfn, rc); which is wrong
> - use v->domain instead of v->vpci.pdev->domain
> - removed odd e.g. in comment
> - s/d%d/%pd in altered code
> - use gdprintk for map/unmap logs
> Since v2:
> - improve readability for data.start_gfn and restructure ?: construct
> Since v1:
>  - s/MSI/MSI-X in comments
> ---
>  xen/drivers/vpci/header.c | 52 ++++++++++++++++++++++++++++-----------
>  1 file changed, 38 insertions(+), 14 deletions(-)
> 
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index 3cc6a96849..1e82217200 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -33,6 +33,7 @@
>  
>  struct map_data {
>      struct domain *d;
> +    const struct vpci_bar *bar;
>      bool map;
>  };
>  
> @@ -44,6 +45,12 @@ static int cf_check map_range(
>  
>      for ( ; ; )
>      {
> +        /* Start address of the BAR as seen by the guest. */
> +        gfn_t start_gfn = _gfn(PFN_DOWN(is_hardware_domain(map->d)
> +                                        ? map->bar->addr
> +                                        : map->bar->guest_addr));
> +        /* Physical start address of the BAR. */
> +        mfn_t start_mfn = _mfn(PFN_DOWN(map->bar->addr));

Both of those should be declared outside of the loop, as there's no
need to (re)calculate them at each iteration.

Also start_gfn likely wants to be unsigned long?  All the usages of it
in the patch convert it to integer by using gfn_x().

>          unsigned long size = e - s + 1;
>  
>          if ( !iomem_access_permitted(map->d, s, e) )
> @@ -63,6 +70,13 @@ static int cf_check map_range(
>              return rc;
>          }
>  
> +        /*
> +         * Ranges to be mapped don't always start at the BAR start address, as
> +         * there can be holes or partially consumed ranges. Account for the
> +         * offset of the current address from the BAR start.
> +         */
> +        start_mfn = mfn_add(start_mfn, s - gfn_x(start_gfn));

This should then be a local loop variable with a different name.

> +
>          /*
>           * ARM TODOs:
>           * - On ARM whether the memory is prefetchable or not should be passed
> @@ -72,8 +86,8 @@ static int cf_check map_range(
>           * - {un}map_mmio_regions doesn't support preemption.
>           */
>  
> -        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(s))
> -                      : unmap_mmio_regions(map->d, _gfn(s), size, _mfn(s));
> +        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, start_mfn)
> +                      : unmap_mmio_regions(map->d, _gfn(s), size, start_mfn);
>          if ( rc == 0 )
>          {
>              *c += size;
> @@ -82,8 +96,9 @@ static int cf_check map_range(
>          if ( rc < 0 )
>          {
>              printk(XENLOG_G_WARNING
> -                   "Failed to identity %smap [%lx, %lx] for d%d: %d\n",
> -                   map->map ? "" : "un", s, e, map->d->domain_id, rc);
> +                   "Failed to %smap [%lx (%lx), %lx (%lx)] for %pd: %d\n",

I think we would usually write such mapping messages as:

[start gfn, end gfn] -> [start mfn, end mfn]

So:

"Failed to %smap [%lx, %lx] -> [%lx, %lx] for %pd: %d\n"

> +                   map->map ? "" : "un", s,  mfn_x(start_mfn), e,
> +                   mfn_x(start_mfn) + size, map->d, rc);
>              break;
>          }
>          ASSERT(rc < size);
> @@ -162,10 +177,6 @@ static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
>  bool vpci_process_pending(struct vcpu *v)
>  {
>      struct pci_dev *pdev = v->vpci.pdev;
> -    struct map_data data = {
> -        .d = v->domain,
> -        .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
> -    };
>      struct vpci_header *header = NULL;
>      unsigned int i;
>  
> @@ -177,6 +188,11 @@ bool vpci_process_pending(struct vcpu *v)
>      for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>      {
>          struct vpci_bar *bar = &header->bars[i];
> +        struct map_data data = {
> +            .d = v->domain,
> +            .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
> +            .bar = bar,
> +        };
>          int rc;
>  
>          if ( rangeset_is_empty(bar->mem) )
> @@ -229,7 +245,6 @@ bool vpci_process_pending(struct vcpu *v)
>  static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
>                              uint16_t cmd)
>  {
> -    struct map_data data = { .d = d, .map = true };
>      struct vpci_header *header = &pdev->vpci->header;
>      int rc = 0;
>      unsigned int i;
> @@ -239,6 +254,7 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
>      for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>      {
>          struct vpci_bar *bar = &header->bars[i];
> +        struct map_data data = { .d = d, .map = true, .bar = bar };
>  
>          if ( rangeset_is_empty(bar->mem) )
>              continue;
> @@ -306,12 +322,18 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>       * First fill the rangesets with the BAR of this device or with the ROM
>       * BAR only, depending on whether the guest is toggling the memory decode
>       * bit of the command register, or the enable bit of the ROM BAR register.
> +     *
> +     * For non-hardware domain we use guest physical addresses.
>       */
>      for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>      {
>          struct vpci_bar *bar = &header->bars[i];
>          unsigned long start = PFN_DOWN(bar->addr);
>          unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
> +        unsigned long start_guest = PFN_DOWN(is_hardware_domain(pdev->domain) ?
> +                                             bar->addr : bar->guest_addr);
> +        unsigned long end_guest = PFN_DOWN((is_hardware_domain(pdev->domain) ?
> +                                  bar->addr : bar->guest_addr) + bar->size - 1);
>  
>          if ( !bar->mem )
>              continue;
> @@ -331,11 +353,11 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>              continue;
>          }
>  
> -        rc = rangeset_add_range(bar->mem, start, end);
> +        rc = rangeset_add_range(bar->mem, start_guest, end_guest);
>          if ( rc )
>          {
>              printk(XENLOG_G_WARNING "Failed to add [%lx, %lx]: %d\n",
> -                   start, end, rc);
> +                   start_guest, end_guest, rc);
>              return rc;
>          }
>  
> @@ -352,7 +374,7 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>              {
>                  gprintk(XENLOG_WARNING,
>                         "%pp: failed to remove overlapping range [%lx, %lx]: %d\n",
> -                        &pdev->sbdf, start, end, rc);
> +                        &pdev->sbdf, start_guest, end_guest, rc);
>                  return rc;
>              }
>          }

I think you are missing a change to adjust vmsix_table_base() to also
return the MSI-X table position in guest address space for domUs, or
else the MSI-X overlapping range checks for domUs are wrong.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 10/16] vpci/header: emulate PCI_COMMAND register for guests
  2023-08-29 23:19 ` [PATCH v9 10/16] vpci/header: emulate PCI_COMMAND register for guests Volodymyr Babchuk
  2023-09-01  5:23   ` Stewart Hildebrand
@ 2023-09-21 13:18   ` Roger Pau Monné
  1 sibling, 0 replies; 60+ messages in thread
From: Roger Pau Monné @ 2023-09-21 13:18 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: xen-devel, Stewart Hildebrand, Oleksandr Andrushchenko

On Tue, Aug 29, 2023 at 11:19:44PM +0000, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Xen and/or Dom0 may have put values in PCI_COMMAND which they expect
> to remain unaltered. PCI_COMMAND_SERR bit is a good example: while the
> guest's view of this will want to be zero initially, the host having set
> it to 1 may not easily be overwritten with 0, or else we'd effectively
> imply giving the guest control of the bit. Thus, PCI_COMMAND register needs
> proper emulation in order to honor host's settings.
> 
> According to "PCI LOCAL BUS SPECIFICATION, REV. 3.0", section "6.2.2
> Device Control" the reset state of the command register is typically 0,
> so when assigning a PCI device use 0 as the initial state for the guest's view
> of the command register.
> 
> Here is the full list of command register bits with notes about
> emulation:
> 
> PCI_COMMAND_IO - Allow guest to control it.
> PCI_COMMAND_MEMORY - Already handled.
> PCI_COMMAND_MASTER - Allow guest to control it.
> PCI_COMMAND_SPECIAL - Guest can generate special cycles only if it has
> access to host bridge that supports software generation of special
> cycles. In our case guest has no access to host bridges at all. Value
> after reset is 0.
> PCI_COMMAND_INVALIDATE - Allows "Memory Write and Invalidate" commands
> to be generated. It requires additional configuration via Cacheline
> Size register. We are not emulating this register right now and we
> can't expect guest to properly configure it.
> PCI_COMMAND_VGA_PALETTE - Enable VGA palette snooping. This bit is set
> by firmware and we want to leave it as is.
> PCI_COMMAND_PARITY - Controls how device response to parity
> errors. We want this bit to be set by a hardware domain.
> PCI_COMMAND_WAIT - Reserved. Should be 0.
> PCI_COMMAND_SERR - Controls if device can assert SERR.
> The same as for COMMAND_PARITY.
> PCI_COMMAND_FAST_BACK - Optional bit that allows fast back-to-back
> transactions. It is configured by firmware, so we don't want guest to
> control it.
> PCI_COMMAND_INTX_DISABLE - Disables INTx signals. If MSI(X) is
> enabled, device is prohibited from asserting INTx. Value after reset
> is 0. Guest can control it freely.

I'm kind of confused by the text above, does "Guest can control it
freely" imply that the guest is able to modify the bit in the device
command register vs the emulated view that we provide?  If so
INTX_DISABLE should not be allowed direct guest modification.

I'm thinking that we might want to allow guest access to the first 3
bits only, while the rest of the values won't be propagated to
hardware, iow: guest will get a fake view of them.

Have you checked with QEMU how are those bits handled?  That's our
current passthrough reference implementation, and we should aim to
handle those similarly in vPCI.

> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
> ---
> Since v9:
> - Reworked guest_cmd_read
> - Added handling for more bits
> Since v6:
> - fold guest's logic into cmd_write
> - implement cmd_read, so we can report emulated INTx state to guests
> - introduce header->guest_cmd to hold the emulated state of the
>   PCI_COMMAND register for guests
> Since v5:
> - add additional check for MSI-X enabled while altering INTX bit
> - make sure INTx disabled while guests enable MSI/MSI-X
> Since v3:
> - gate more code on CONFIG_HAS_MSI
> - removed logic for the case when MSI/MSI-X not enabled
> ---
>  xen/drivers/vpci/header.c | 54 ++++++++++++++++++++++++++++++++++++---
>  xen/drivers/vpci/msi.c    | 10 ++++++++
>  xen/drivers/vpci/msix.c   |  4 +++
>  xen/include/xen/vpci.h    |  3 +++
>  4 files changed, 67 insertions(+), 4 deletions(-)
> 
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index 1e82217200..e351db4620 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -502,14 +502,37 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>      return 0;
>  }
>  
> +/* TODO: Add proper emulation for all bits of the command register. */
>  static void cf_check cmd_write(
>      const struct pci_dev *pdev, unsigned int reg, uint32_t cmd, void *data)
>  {
>      struct vpci_header *header = data;
>  
> +    if ( !is_hardware_domain(pdev->domain) )
> +    {
> +        if ( IS_ENABLED(CONFIG_HAS_PCI_MSI) )
> +        {
> +            /* Tell guest that device does not support this */
> +            cmd &= ~PCI_COMMAND_FAST_BACK;
> +        }
> +
> +        header->guest_cmd = cmd;
> +
> +        if ( IS_ENABLED(CONFIG_HAS_PCI_MSI) )
> +        {
> +            /* Do not touch INVALIDATE, PARITY and SERR */
> +            const uint16_t excluded = PCI_COMMAND_INVALIDATE |
> +                PCI_COMMAND_PARITY | PCI_COMMAND_SERR;
> +
> +            cmd &= ~excluded;
> +            cmd |= pci_conf_read16(pdev->sbdf, reg) & excluded;
> +        }

I'm not following why allowing guest setting of those bits is
conditional on HAS_PCI_MSI being build time enabled.

Isn't it equally good or bad to let the guest play with certain bits
regardless of Xen build time configuration?

As said above, I would look at QEMU and see how bits are handled
there.

> +    }
> +
>      /*
> -     * Let Dom0 play with all the bits directly except for the memory
> -     * decoding one.
> +     * Let guest play with all the bits directly except for the memory
> +     * decoding one. Bits that are not allowed for DomU are already
> +     * handled above.
>       */
>      if ( header->bars_mapped != !!(cmd & PCI_COMMAND_MEMORY) )
>          /*
> @@ -523,6 +546,14 @@ static void cf_check cmd_write(
>          pci_conf_write16(pdev->sbdf, reg, cmd);
>  }
>  
> +static uint32_t guest_cmd_read(const struct pci_dev *pdev, unsigned int reg,
> +                               void *data)
> +{
> +    const struct vpci_header *header = data;
> +
> +    return header->guest_cmd;
> +}
> +
>  static void cf_check bar_write(
>      const struct pci_dev *pdev, unsigned int reg, uint32_t val, void *data)
>  {
> @@ -732,8 +763,12 @@ static int cf_check init_bars(struct pci_dev *pdev)
>      }
>  
>      /* Setup a handler for the command register. */
> -    rc = vpci_add_register(pdev->vpci, vpci_hw_read16, cmd_write, PCI_COMMAND,
> -                           2, header);
> +    if ( is_hwdom )
> +        rc = vpci_add_register(pdev->vpci, vpci_hw_read16, cmd_write, PCI_COMMAND,
> +                               2, header);
> +    else
> +        rc = vpci_add_register(pdev->vpci, guest_cmd_read, cmd_write, PCI_COMMAND,
> +                               2, header);

You have used the ternary operator in other places, I would recommend
to also do it here to avoid line duplication.

>      if ( rc )
>          return rc;
>  
> @@ -745,6 +780,17 @@ static int cf_check init_bars(struct pci_dev *pdev)
>      if ( cmd & PCI_COMMAND_MEMORY )
>          pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd & ~PCI_COMMAND_MEMORY);
>  
> +    header->guest_cmd = cmd & ~PCI_COMMAND_MEMORY;

Memory Enable is cleared from the guest view, yet at the end of
init_bars() mappings will be established if the bit is enabled in cmd.
Won't this create a mismatch between the guest view and the contents
of the physmap?

> +
> +    /*
> +     * According to "PCI LOCAL BUS SPECIFICATION, REV. 3.0", section "6.2.2
> +     * Device Control" the reset state of the command register is
> +     * typically all 0's, so this is used as initial value for the guests.
> +     */
> +    if ( header->guest_cmd != 0 )
> +        gprintk(XENLOG_WARNING, "%pp: CMD is not zero: %x", &pdev->sbdf,
> +                header->guest_cmd);

I think it's unlikely for the command register to be zeroed out, I
haven't looked, but I would assume that after a device reset by
pciback it won't be unlikely for some initial state to be set.

> +
>      for ( i = 0; i < num_bars; i++ )
>      {
>          uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
> index a0733bb2cb..df0f0199b8 100644
> --- a/xen/drivers/vpci/msi.c
> +++ b/xen/drivers/vpci/msi.c
> @@ -70,6 +70,16 @@ static void cf_check control_write(
>  
>          if ( vpci_msi_arch_enable(msi, pdev, vectors) )
>              return;
> +
> +        /*
> +         * Make sure guest doesn't enable INTx while enabling MSI.
> +         * Opposite action (enabling INTx) will be performed in
> +         * vpci_msi_arch_disable call path.

I'm not seeing such code in vpci_msi_arch_disable().  However the
updating of the INTX field should be done after MSI(X) has been
disabled, and hence can only be done after the pci_conf_write16() in
control_write().

I would be fine if you want to leave forcing the setting of INTX to
enabled once MSI has been disabled, any sane guest will do that
itself, but the comment needs updating.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 11/16] vpci/header: reset the command register when adding devices
  2023-08-29 23:19 ` [PATCH v9 11/16] vpci/header: reset the command register when adding devices Volodymyr Babchuk
@ 2023-09-21 13:30   ` Roger Pau Monné
  0 siblings, 0 replies; 60+ messages in thread
From: Roger Pau Monné @ 2023-09-21 13:30 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: xen-devel, Stewart Hildebrand, Oleksandr Andrushchenko

On Tue, Aug 29, 2023 at 11:19:45PM +0000, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Reset the command register when assigning a PCI device to a guest:
> according to the PCI spec the PCI_COMMAND register is typically all 0's
> after reset, but this might not be true for the guest as it needs
> to respect host's settings.
> For that reason, do not write 0 to the PCI_COMMAND register directly,
> but go through the corresponding emulation layer (cmd_write), which
> will take care about the actual bits written. Also, honor value of
> PCI_COMMAND_VGA_PALETTE value, which is set by firmware.

I think this is likely dangerous, it would be better IMO to simply
make sure the value presented to the guest is all zeros, and that the
vPCI cached state is consistent with that.

> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
> ---
> Since v9:
> - Honor PCI_COMMAND_VGA_PALETTE bit
> Since v6:
> - use cmd_write directly without introducing emulate_cmd_reg
> - update commit message with more description on all 0's in PCI_COMMAND
> Since v5:
> - updated commit message
> Since v1:
>  - do not write 0 to the command register, but respect host settings.
> ---
>  xen/drivers/vpci/header.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index e351db4620..1d243eeaf9 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -762,6 +762,12 @@ static int cf_check init_bars(struct pci_dev *pdev)
>          return -EOPNOTSUPP;
>      }
>  
> +    /* Reset the command register for guests. We want to preserve only
> +     * PCI_COMMAND_VGA_PALETTE as it is configured by firmware */

Wrong comment style, and PCI_COMMAND_VGA_PALETTE is likely to be gone
anyway after we perform a FLR of the device anyway?

> +    cmd = pci_conf_read16(pdev->sbdf, PCI_COMMAND);
> +    if ( !is_hwdom )
> +        cmd_write(pdev, PCI_COMMAND, cmd & PCI_COMMAND_VGA_PALETTE, header);

Such cmd_write() call might trigger an attempt to change the guest
physmap if you are toggling the Memory Enabled bit from 1 -> 0, and
that would fail because the guest doesn't have BAR p2m mappings setup
yet, those are done at the end of the function by the call to
modify_bars().

Thanks, Roger.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 12/16] vpci: add initial support for virtual PCI bus topology
  2023-08-29 23:19 ` [PATCH v9 12/16] vpci: add initial support for virtual PCI bus topology Volodymyr Babchuk
  2023-08-30  7:37   ` Jan Beulich
@ 2023-09-21 16:03   ` Roger Pau Monné
  1 sibling, 0 replies; 60+ messages in thread
From: Roger Pau Monné @ 2023-09-21 16:03 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: xen-devel, Stewart Hildebrand, Oleksandr Andrushchenko,
	Andrew Cooper, George Dunlap, Jan Beulich, Julien Grall,
	Stefano Stabellini, Wei Liu

On Tue, Aug 29, 2023 at 11:19:46PM +0000, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Assign SBDF to the PCI devices being passed through with bus 0.
> The resulting topology is where PCIe devices reside on the bus 0 of the
> root complex itself (embedded endpoints).
> This implementation is limited to 32 devices which are allowed on
> a single PCI bus.
> 
> Please note, that at the moment only function 0 of a multifunction
> device can be passed through.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> ---
> Since v9:
> - Lock in add_virtual_device() replaced with ASSERT (thanks, Stewart)
> Since v8:
> - Added write lock in add_virtual_device
> Since v6:
> - re-work wrt new locking scheme
> - OT: add ASSERT(pcidevs_write_locked()); to add_virtual_device()
> Since v5:
> - s/vpci_add_virtual_device/add_virtual_device and make it static
> - call add_virtual_device from vpci_assign_device and do not use
>   REGISTER_VPCI_INIT machinery
> - add pcidevs_locked ASSERT
> - use DECLARE_BITMAP for vpci_dev_assigned_map
> Since v4:
> - moved and re-worked guest sbdf initializers
> - s/set_bit/__set_bit
> - s/clear_bit/__clear_bit
> - minor comment fix s/Virtual/Guest/
> - added VPCI_MAX_VIRT_DEV constant (PCI_SLOT(~0) + 1) which will be used
>   later for counting the number of MMIO handlers required for a guest
>   (Julien)
> Since v3:
>  - make use of VPCI_INIT
>  - moved all new code to vpci.c which belongs to it
>  - changed open-coded 31 to PCI_SLOT(~0)
>  - added comments and code to reject multifunction devices with
>    functions other than 0
>  - updated comment about vpci_dev_next and made it unsigned int
>  - implement roll back in case of error while assigning/deassigning devices
>  - s/dom%pd/%pd
> Since v2:
>  - remove casts that are (a) malformed and (b) unnecessary
>  - add new line for better readability
>  - remove CONFIG_HAS_VPCI_GUEST_SUPPORT ifdef's as the relevant vPCI
>     functions are now completely gated with this config
>  - gate common code with CONFIG_HAS_VPCI_GUEST_SUPPORT
> New in v2
> ---
>  xen/drivers/vpci/vpci.c | 69 +++++++++++++++++++++++++++++++++++++++++
>  xen/include/xen/sched.h |  8 +++++
>  xen/include/xen/vpci.h  | 11 +++++++
>  3 files changed, 88 insertions(+)
> 
> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
> index 412685f41d..b284f95e05 100644
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -36,6 +36,54 @@ extern vpci_register_init_t *const __start_vpci_array[];
>  extern vpci_register_init_t *const __end_vpci_array[];
>  #define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
>  
> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> +static int add_virtual_device(struct pci_dev *pdev)
> +{
> +    struct domain *d = pdev->domain;
> +    pci_sbdf_t sbdf = { 0 };
> +    unsigned long new_dev_number;
> +
> +    if ( is_hardware_domain(d) )
> +        return 0;
> +
> +    ASSERT(pcidevs_locked() && rw_is_write_locked(&pdev->domain->pci_lock));


Do you need to check for pcidevs here?  I would think d->pci_lock
would be enough to protect the virtual allocation device bitmap.

> +
> +    /*
> +     * Each PCI bus supports 32 devices/slots at max or up to 256 when
> +     * there are multi-function ones which are not yet supported.
> +     */
> +    if ( pdev->info.is_extfn )

I think you are missing a !pdev->info.is_virtfn, as is_extfn &&
is_virtfn mean the PF it's an extended function, but not the VF we are
trying to passthrough.

> +    {
> +        gdprintk(XENLOG_ERR, "%pp: only function 0 passthrough supported\n",
> +                 &pdev->sbdf);
> +        return -EOPNOTSUPP;
> +    }
> +    new_dev_number = find_first_zero_bit(d->vpci_dev_assigned_map,
> +                                         VPCI_MAX_VIRT_DEV);
> +    if ( new_dev_number >= VPCI_MAX_VIRT_DEV )

The > is not required, as find_first_zero_bit() will return
VPCI_MAX_VIRT_DEV if the bitmap is all set.

> +    {
> +        write_unlock(&pdev->domain->pci_lock);
> +        return -ENOSPC;
> +    }
> +
> +    __set_bit(new_dev_number, &d->vpci_dev_assigned_map);
> +
> +    /*
> +     * Both segment and bus number are 0:
> +     *  - we emulate a single host bridge for the guest, e.g. segment 0
> +     *  - with bus 0 the virtual devices are seen as embedded
> +     *    endpoints behind the root complex
> +     *
> +     * TODO: add support for multi-function devices.
> +     */
> +    sbdf.devfn = PCI_DEVFN(new_dev_number, 0);
> +    pdev->vpci->guest_sbdf = sbdf;

You could avoid the local sbdf variable and just use PCI_SBDF(0, 0,
new_dev_number, 0);

> +
> +    return 0;
> +}
> +
> +#endif /* CONFIG_HAS_VPCI_GUEST_SUPPORT */
> +
>  void vpci_deassign_device(struct pci_dev *pdev)
>  {
>      unsigned int i;
> @@ -46,6 +94,16 @@ void vpci_deassign_device(struct pci_dev *pdev)
>          return;
>  
>      spin_lock(&pdev->vpci->lock);
> +
> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> +    if ( pdev->vpci->guest_sbdf.sbdf != ~0 )
> +    {
> +        __clear_bit(pdev->vpci->guest_sbdf.dev,
> +                    &pdev->domain->vpci_dev_assigned_map);
> +        pdev->vpci->guest_sbdf.sbdf = ~0;
> +    }
> +#endif

There's no need to set sbdf = ~0 as vpci is just about to be freed.

> +
>      while ( !list_empty(&pdev->vpci->handlers) )
>      {
>          struct vpci_register *r = list_first_entry(&pdev->vpci->handlers,
> @@ -101,6 +159,13 @@ int vpci_assign_device(struct pci_dev *pdev)
>      INIT_LIST_HEAD(&pdev->vpci->handlers);
>      spin_lock_init(&pdev->vpci->lock);
>  
> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> +    pdev->vpci->guest_sbdf.sbdf = ~0;
> +    rc = add_virtual_device(pdev);
> +    if ( rc )
> +        goto out;
> +#endif
> +
>      for ( i = 0; i < NUM_VPCI_INIT; i++ )
>      {
>          rc = __start_vpci_array[i](pdev);
> @@ -108,11 +173,15 @@ int vpci_assign_device(struct pci_dev *pdev)
>              break;
>      }
>  
> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> + out:
> +#endif

That's ugly, can you use the __maybe_unused attribute with a label?

>      if ( rc )
>          vpci_deassign_device(pdev);
>  
>      return rc;
>  }
> +

Spurious change.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 13/16] xen/arm: translate virtual PCI bus topology for guests
  2023-08-29 23:19 ` [PATCH v9 13/16] xen/arm: translate virtual PCI bus topology for guests Volodymyr Babchuk
@ 2023-09-22  8:32   ` Roger Pau Monné
  0 siblings, 0 replies; 60+ messages in thread
From: Roger Pau Monné @ 2023-09-22  8:32 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: xen-devel, Stewart Hildebrand, Oleksandr Andrushchenko,
	Stefano Stabellini, Julien Grall, Bertrand Marquis

On Tue, Aug 29, 2023 at 11:19:46PM +0000, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> There are three  originators for the PCI configuration space access:
> 1. The domain that owns physical host bridge: MMIO handlers are
> there so we can update vPCI register handlers with the values
> written by the hardware domain, e.g. physical view of the registers
> vs guest's view on the configuration space.
> 2. Guest access to the passed through PCI devices: we need to properly
> map virtual bus topology to the physical one, e.g. pass the configuration
> space access to the corresponding physical devices.
> 3. Emulated host PCI bridge access. It doesn't exist in the physical
> topology, e.g. it can't be mapped to some physical host bridge.
> So, all access to the host bridge itself needs to be trapped and
> emulated.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> ---
> Since v9:
> - Commend about required lock replaced with ASSERT()
> - Style fixes
> - call to vpci_translate_virtual_device folded into vpci_sbdf_from_gpa
> Since v8:
> - locks moved out of vpci_translate_virtual_device()
> Since v6:
> - add pcidevs locking to vpci_translate_virtual_device
> - update wrt to the new locking scheme
> Since v5:
> - add vpci_translate_virtual_device for #ifndef CONFIG_HAS_VPCI_GUEST_SUPPORT
>   case to simplify ifdefery
> - add ASSERT(!is_hardware_domain(d)); to vpci_translate_virtual_device
> - reset output register on failed virtual SBDF translation
> Since v4:
> - indentation fixes
> - constify struct domain
> - updated commit message
> - updates to the new locking scheme (pdev->vpci_lock)
> Since v3:
> - revisit locking
> - move code to vpci.c
> Since v2:
>  - pass struct domain instead of struct vcpu
>  - constify arguments where possible
>  - gate relevant code with CONFIG_HAS_VPCI_GUEST_SUPPORT
> New in v2
> ---
>  xen/arch/arm/vpci.c     | 51 ++++++++++++++++++++++++++++++++---------
>  xen/drivers/vpci/vpci.c | 25 +++++++++++++++++++-
>  xen/include/xen/vpci.h  | 10 ++++++++
>  3 files changed, 74 insertions(+), 12 deletions(-)
> 
> diff --git a/xen/arch/arm/vpci.c b/xen/arch/arm/vpci.c
> index 3bc4bb5508..58e2a20135 100644
> --- a/xen/arch/arm/vpci.c
> +++ b/xen/arch/arm/vpci.c
> @@ -7,31 +7,55 @@
>  
>  #include <asm/mmio.h>
>  
> -static pci_sbdf_t vpci_sbdf_from_gpa(const struct pci_host_bridge *bridge,
> -                                     paddr_t gpa)
> +static bool_t vpci_sbdf_from_gpa(struct domain *d,

Plain bool please.

> +                                 const struct pci_host_bridge *bridge,
> +                                 paddr_t gpa, pci_sbdf_t *sbdf)
>  {
> -    pci_sbdf_t sbdf;
> +    ASSERT(sbdf);
>  
>      if ( bridge )
>      {
> -        sbdf.sbdf = VPCI_ECAM_BDF(gpa - bridge->cfg->phys_addr);
> -        sbdf.seg = bridge->segment;
> -        sbdf.bus += bridge->cfg->busn_start;
> +        sbdf->sbdf = VPCI_ECAM_BDF(gpa - bridge->cfg->phys_addr);
> +        sbdf->seg = bridge->segment;
> +        sbdf->bus += bridge->cfg->busn_start;
>      }
>      else
> -        sbdf.sbdf = VPCI_ECAM_BDF(gpa - GUEST_VPCI_ECAM_BASE);
> -
> -    return sbdf;
> +    {
> +        bool translated;
> +
> +        /*
> +         * For the passed through devices we need to map their virtual SBDF
> +         * to the physical PCI device being passed through.
> +         */
> +        sbdf->sbdf = VPCI_ECAM_BDF(gpa - GUEST_VPCI_ECAM_BASE);
> +        read_lock(&d->pci_lock);
> +        translated = vpci_translate_virtual_device(d, sbdf);
> +        read_unlock(&d->pci_lock);
> +
> +        if ( !translated )
> +        {
> +            return false;
> +        }
> +    }
> +    return true;

You could define translated = true at the top level of the function
and then set it to `translated = vpci_translate_virtual_device(d,
sbdf);` and have a single return in the function that returns
`translated`:

bool translated = true;

if ( bridge )
{
    ...
}
else
{
    ...
    translated = vpci_translate_virtual_device(d, sbdf);
    ...
}
return translated;

Thanks, Roger.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 15/16] xen/arm: vpci: check guest range
  2023-08-29 23:19 ` [PATCH v9 15/16] xen/arm: vpci: check guest range Volodymyr Babchuk
@ 2023-09-22  8:44   ` Roger Pau Monné
  2023-09-25 21:49     ` Stewart Hildebrand
  0 siblings, 1 reply; 60+ messages in thread
From: Roger Pau Monné @ 2023-09-22  8:44 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: xen-devel, Stewart Hildebrand

On Tue, Aug 29, 2023 at 11:19:47PM +0000, Volodymyr Babchuk wrote:
> From: Stewart Hildebrand <stewart.hildebrand@amd.com>
> 
> Skip mapping the BAR if it is not in a valid range.
> 
> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
> ---
>  xen/drivers/vpci/header.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index 1d243eeaf9..dbabdcbed2 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -345,6 +345,15 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>               bar->enabled == !!(cmd & PCI_COMMAND_MEMORY) )
>              continue;
>  
> +#ifdef CONFIG_ARM
> +        if ( !is_hardware_domain(pdev->domain) )
> +        {
> +            if ( (start_guest < PFN_DOWN(GUEST_VPCI_MEM_ADDR)) ||
> +                 (end_guest >= PFN_DOWN(GUEST_VPCI_MEM_ADDR + GUEST_VPCI_MEM_SIZE)) )
> +                continue;
> +        }
> +#endif

Hm, I think this should be in a hook similar to pci_check_bar() that
can be implemented per-arch.

IIRC at least on x86 we allow the guest to place the BARs whenever it
wants, would such placement cause issues to the hypervisor on Arm?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 15/16] xen/arm: vpci: check guest range
  2023-09-22  8:44   ` Roger Pau Monné
@ 2023-09-25 21:49     ` Stewart Hildebrand
  2023-09-26  8:07       ` Roger Pau Monné
  0 siblings, 1 reply; 60+ messages in thread
From: Stewart Hildebrand @ 2023-09-25 21:49 UTC (permalink / raw)
  To: Roger Pau Monné, Volodymyr Babchuk; +Cc: xen-devel

On 9/22/23 04:44, Roger Pau Monné wrote:
> On Tue, Aug 29, 2023 at 11:19:47PM +0000, Volodymyr Babchuk wrote:
>> From: Stewart Hildebrand <stewart.hildebrand@amd.com>
>>
>> Skip mapping the BAR if it is not in a valid range.
>>
>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
>> ---
>>  xen/drivers/vpci/header.c | 9 +++++++++
>>  1 file changed, 9 insertions(+)
>>
>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
>> index 1d243eeaf9..dbabdcbed2 100644
>> --- a/xen/drivers/vpci/header.c
>> +++ b/xen/drivers/vpci/header.c
>> @@ -345,6 +345,15 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>               bar->enabled == !!(cmd & PCI_COMMAND_MEMORY) )
>>              continue;
>>
>> +#ifdef CONFIG_ARM
>> +        if ( !is_hardware_domain(pdev->domain) )
>> +        {
>> +            if ( (start_guest < PFN_DOWN(GUEST_VPCI_MEM_ADDR)) ||
>> +                 (end_guest >= PFN_DOWN(GUEST_VPCI_MEM_ADDR + GUEST_VPCI_MEM_SIZE)) )
>> +                continue;
>> +        }
>> +#endif
> 
> Hm, I think this should be in a hook similar to pci_check_bar() that
> can be implemented per-arch.
> 
> IIRC at least on x86 we allow the guest to place the BARs whenever it
> wants, would such placement cause issues to the hypervisor on Arm?

Hm. I wrote this patch in a hurry to make v9 of this series work on ARM. In my haste I also forgot about the prefetchable range starting at GUEST_VPCI_PREFETCH_MEM_ADDR, but that won't matter as we can probably throw this patch out.

Now that I've had some more time to investigate, I believe the check in this patch is more or less redundant to the existing check in map_range() added in baa6ea700386 ("vpci: add permission checks to map_range()").

The issue is that during initialization bar->guest_addr is zeroed, and this initial value of bar->guest_addr will fail the permissions check in map_range() and crash the domain. When the guest writes a new valid BAR, the old invalid address remains in the rangeset to be mapped. If we simply remove the old invalid BAR from the rangeset, that seems to fix the issue. So something like this:

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index d4629a14f26b..732be26f0d2d 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -638,6 +638,16 @@ static void cf_check guest_bar_write(const struct pci_dev *pdev,
         return;
     }

+    if ( (val != 0xfffffff0U) &&
+         (bar->guest_addr != (0xfffffff0ULL & ~(bar->size - 1))) &&
+         (bar->guest_addr != (0xfffffffffffffff0ULL & ~(bar->size - 1))) )
+    {
+        if ( rangeset_remove_range(bar->mem, PFN_DOWN(bar->guest_addr),
+                                   PFN_DOWN(bar->guest_addr + bar->size) - 1) )
+            gprintk(XENLOG_WARNING, "%pp failed to remove old BAR range\n",
+                    &pdev->sbdf);
+    }
+
     bar->guest_addr = guest_addr;
 }


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 01/16] pci: introduce per-domain PCI rwlock
  2023-09-19 14:09   ` Roger Pau Monné
@ 2023-09-25 22:44     ` Volodymyr Babchuk
  0 siblings, 0 replies; 60+ messages in thread
From: Volodymyr Babchuk @ 2023-09-25 22:44 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Stewart Hildebrand, Andrew Cooper, George Dunlap,
	Jan Beulich, Julien Grall, Stefano Stabellini, Wei Liu,
	Paul Durrant, Kevin Tian


Hello Roger,

Roger Pau Monné <roger.pau@citrix.com> writes:

> On Tue, Aug 29, 2023 at 11:19:42PM +0000, Volodymyr Babchuk wrote:
>> Add per-domain d->pci_lock that protects access to
>> d->pdev_list. Purpose of this lock is to give guarantees to VPCI code
>> that underlying pdev will not disappear under feet. This is a rw-lock,
>> but this patch adds only write_lock()s. There will be read_lock()
>> users in the next patches.
>> 
>> This lock should be taken in write mode every time d->pdev_list is
>> altered. This covers both accesses to d->pdev_list and accesses to
>> pdev->domain_list fields.
>
> Why do you mention pdev->domain_list here?  I don't think the lock
> covers accesses to pdev->domain_list, unless that domain_list field
> happens to be part of the linked list in d->pdev_list.  I find it kind
> of odd to mention here.

You are correct. I was referring very specific case in reassign_device()
IOMMU functions. It seemed important for me when I wrote this. But you
are correct, no need to mention pdev->domain_list explicitly.

>
>> All write accesses also should be protected
>> by pcidevs_lock() as well. Idea is that any user that wants read
>> access to the list or to the devices stored in the list should use
>> either this new d->pci_lock or old pcidevs_lock(). Usage of any of
>> this two locks will ensure only that pdev of interest will not
>> disappear from under feet and that the pdev still will be assigned to
>> the same domain. Of course, any new users should use pcidevs_lock()
>> when it is appropriate (e.g. when accessing any other state that is
>> protected by the said lock). In case both the newly introduced
>> per-domain rwlock and the pcidevs lock is taken, the later must be
>> acquired first.
>> 
>> Any write access to pdev->domain_list should be protected by both
>> pcidevs_lock() and d->pci_lock in the write mode.
>> 
>> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
>> Suggested-by: Jan Beulich <jbeulich@suse.com>
>> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
>> 
>> ---
>> 
>> Changes in v9:
>>  - returned back "pdev->domain = target;" in AMD IOMMU code
>>  - used "source" instead of pdev->domain in IOMMU functions
>>  - added comment about lock ordering in the commit message
>>  - reduced locked regions
>>  - minor changes non-functional changes in various places
>> 
>> Changes in v8:
>>  - New patch
>> 
>> Changes in v8 vs RFC:
>>  - Removed all read_locks after discussion with Roger in #xendevel
>>  - pci_release_devices() now returns the first error code
>>  - extended commit message
>>  - added missing lock in pci_remove_device()
>>  - extended locked region in pci_add_device() to protect list_del() calls
>> ---
>>  xen/common/domain.c                         |  1 +
>>  xen/drivers/passthrough/amd/pci_amd_iommu.c |  9 ++-
>>  xen/drivers/passthrough/pci.c               | 71 +++++++++++++++++----
>>  xen/drivers/passthrough/vtd/iommu.c         |  9 ++-
>>  xen/include/xen/sched.h                     |  1 +
>>  5 files changed, 78 insertions(+), 13 deletions(-)
>> 
>> diff --git a/xen/common/domain.c b/xen/common/domain.c
>> index 304aa04fa6..9b04a20160 100644
>> --- a/xen/common/domain.c
>> +++ b/xen/common/domain.c
>> @@ -651,6 +651,7 @@ struct domain *domain_create(domid_t domid,
>>  
>>  #ifdef CONFIG_HAS_PCI
>>      INIT_LIST_HEAD(&d->pdev_list);
>> +    rwlock_init(&d->pci_lock);
>>  #endif
>>  
>>      /* All error paths can depend on the above setup. */
>> diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c b/xen/drivers/passthrough/amd/pci_amd_iommu.c
>> index bea70db4b7..d219bd9453 100644
>> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
>> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
>> @@ -476,7 +476,14 @@ static int cf_check reassign_device(
>>  
>>      if ( devfn == pdev->devfn && pdev->domain != target )
>>      {
>> -        list_move(&pdev->domain_list, &target->pdev_list);
>> +        write_lock(&source->pci_lock);
>> +        list_del(&pdev->domain_list);
>> +        write_unlock(&source->pci_lock);
>> +
>> +        write_lock(&target->pci_lock);
>> +        list_add(&pdev->domain_list, &target->pdev_list);
>> +        write_unlock(&target->pci_lock);
>> +
>>          pdev->domain = target;
>
> While I don't think this is strictly an issue right now, it would be
> better to set pdev->domain before the device is added to domain_list.
> A pattern like:
>
> read_lock(d->pci_lock);
> for_each_pdev(d, pdev)
>     foo(pdev->domain);
> read_unlock(d->pci_lock);
>
> Wouldn't work currently if the pdev is added to domain_list before the
> pdev->domain field is updated to reflect the new owner.

Agree. I moved `pdev->domain = target` so it sits between list_del() and
list_add() calls


-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 02/16] vpci: use per-domain PCI lock to protect vpci structure
  2023-09-19 15:39   ` Roger Pau Monné
                       ` (2 preceding siblings ...)
  2023-09-20 19:16     ` Stewart Hildebrand
@ 2023-09-25 23:03     ` Volodymyr Babchuk
  3 siblings, 0 replies; 60+ messages in thread
From: Volodymyr Babchuk @ 2023-09-25 23:03 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Stewart Hildebrand, Oleksandr Andrushchenko,
	Jan Beulich, Andrew Cooper, Wei Liu, Jun Nakajima, Kevin Tian,
	Paul Durrant

Hi Roger,

Roger Pau Monné <roger.pau@citrix.com> writes:

> On Tue, Aug 29, 2023 at 11:19:42PM +0000, Volodymyr Babchuk wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> 
>> Use a previously introduced per-domain read/write lock to check
>> whether vpci is present, so we are sure there are no accesses to the
>> contents of the vpci struct if not. This lock can be used (and in a
>> few cases is used right away) so that vpci removal can be performed
>> while holding the lock in write mode. Previously such removal could
>> race with vpci_read for example.
>> 
>> When taking both d->pci_lock and pdev->vpci->lock they are should be
>
> When taking both d->pci_lock and pdev->vpci->lock the order should be
> ...
>

>> taken in this exact order: d->pci_lock then pdev->vpci->lock to avoid
>> possible deadlock situations.
>>

Will it be better to write like this:

"When taking both d->pci_lock and pdev->vpci->lock, they should be
taken in this exact order: d->pci_lock then pdev->vpci->lock to avoid
possible deadlock situations."

?

I am asking because your suggestion leads to "When taking both
d->pci_lock and pdev->vpci->lock the order should be taken in this exact
order: ... "

[...]

As for other comments, I am taking into account your, Jan's and Stewart's
comments and reworking this patch.

-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 16/16] xen/arm: vpci: permit access to guest vpci space
  2023-08-29 23:19 ` [PATCH v9 16/16] xen/arm: vpci: permit access to guest vpci space Volodymyr Babchuk
@ 2023-09-26  0:12   ` Stewart Hildebrand
  0 siblings, 0 replies; 60+ messages in thread
From: Stewart Hildebrand @ 2023-09-26  0:12 UTC (permalink / raw)
  To: Volodymyr Babchuk, xen-devel
  Cc: Stefano Stabellini, Julien Grall, Bertrand Marquis,
	Andrew Cooper, George Dunlap, Jan Beulich, Wei Liu

On 8/29/23 19:19, Volodymyr Babchuk wrote:
> From: Stewart Hildebrand <stewart.hildebrand@amd.com>
> 
> Move iomem_caps initialization earlier (before arch_domain_create()).
> 
> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
> ---
> This is sort of a follow-up to:
> 
>   baa6ea700386 ("vpci: add permission checks to map_range()")
> 
> I don't believe we need a fixes tag since this depends on the vPCI p2m BAR
> patches.
> ---
>  xen/arch/arm/vpci.c | 6 ++++++
>  xen/common/domain.c | 4 +++-
>  2 files changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/arch/arm/vpci.c b/xen/arch/arm/vpci.c
> index 01b50d435e..fb5361276f 100644
> --- a/xen/arch/arm/vpci.c
> +++ b/xen/arch/arm/vpci.c
> @@ -2,6 +2,7 @@
>  /*
>   * xen/arch/arm/vpci.c
>   */
> +#include <xen/iocap.h>
>  #include <xen/sched.h>
>  #include <xen/vpci.h>
> 
> @@ -119,8 +120,13 @@ int domain_vpci_init(struct domain *d)
>              return ret;
>      }
>      else
> +    {
>          register_mmio_handler(d, &vpci_mmio_handler,
>                                GUEST_VPCI_ECAM_BASE, GUEST_VPCI_ECAM_SIZE, NULL);
> +        iomem_permit_access(d, paddr_to_pfn(GUEST_VPCI_MEM_ADDR),
> +                            paddr_to_pfn(PAGE_ALIGN(GUEST_VPCI_MEM_ADDR +
> +                                                    GUEST_VPCI_MEM_SIZE - 1)));

We should also permit access to GUEST_VPCI_PREFETCH_MEM_ADDR

> +    }
> 
>      return 0;
>  }
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index 9b04a20160..11a48ba7e4 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -695,6 +695,9 @@ struct domain *domain_create(domid_t domid,
>          radix_tree_init(&d->pirq_tree);
>      }
> 
> +    if ( !is_idle_domain(d) )
> +        d->iomem_caps = rangeset_new(d, "I/O Memory", RANGESETF_prettyprint_hex);
> +
>      if ( (err = arch_domain_create(d, config, flags)) != 0 )
>          goto fail;
>      init_status |= INIT_arch;
> @@ -704,7 +707,6 @@ struct domain *domain_create(domid_t domid,
>          watchdog_domain_init(d);
>          init_status |= INIT_watchdog;
> 
> -        d->iomem_caps = rangeset_new(d, "I/O Memory", RANGESETF_prettyprint_hex);
>          d->irq_caps   = rangeset_new(d, "Interrupts", 0);
>          if ( !d->iomem_caps || !d->irq_caps )
>              goto fail;
> --
> 2.41.0


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 15/16] xen/arm: vpci: check guest range
  2023-09-25 21:49     ` Stewart Hildebrand
@ 2023-09-26  8:07       ` Roger Pau Monné
  2023-09-26 15:27         ` Stewart Hildebrand
  0 siblings, 1 reply; 60+ messages in thread
From: Roger Pau Monné @ 2023-09-26  8:07 UTC (permalink / raw)
  To: Stewart Hildebrand; +Cc: Volodymyr Babchuk, xen-devel

On Mon, Sep 25, 2023 at 05:49:00PM -0400, Stewart Hildebrand wrote:
> On 9/22/23 04:44, Roger Pau Monné wrote:
> > On Tue, Aug 29, 2023 at 11:19:47PM +0000, Volodymyr Babchuk wrote:
> >> From: Stewart Hildebrand <stewart.hildebrand@amd.com>
> >>
> >> Skip mapping the BAR if it is not in a valid range.
> >>
> >> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
> >> ---
> >>  xen/drivers/vpci/header.c | 9 +++++++++
> >>  1 file changed, 9 insertions(+)
> >>
> >> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> >> index 1d243eeaf9..dbabdcbed2 100644
> >> --- a/xen/drivers/vpci/header.c
> >> +++ b/xen/drivers/vpci/header.c
> >> @@ -345,6 +345,15 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> >>               bar->enabled == !!(cmd & PCI_COMMAND_MEMORY) )
> >>              continue;
> >>
> >> +#ifdef CONFIG_ARM
> >> +        if ( !is_hardware_domain(pdev->domain) )
> >> +        {
> >> +            if ( (start_guest < PFN_DOWN(GUEST_VPCI_MEM_ADDR)) ||
> >> +                 (end_guest >= PFN_DOWN(GUEST_VPCI_MEM_ADDR + GUEST_VPCI_MEM_SIZE)) )
> >> +                continue;
> >> +        }
> >> +#endif
> > 
> > Hm, I think this should be in a hook similar to pci_check_bar() that
> > can be implemented per-arch.
> > 
> > IIRC at least on x86 we allow the guest to place the BARs whenever it
> > wants, would such placement cause issues to the hypervisor on Arm?
> 
> Hm. I wrote this patch in a hurry to make v9 of this series work on ARM. In my haste I also forgot about the prefetchable range starting at GUEST_VPCI_PREFETCH_MEM_ADDR, but that won't matter as we can probably throw this patch out.
> 
> Now that I've had some more time to investigate, I believe the check in this patch is more or less redundant to the existing check in map_range() added in baa6ea700386 ("vpci: add permission checks to map_range()").
> 
> The issue is that during initialization bar->guest_addr is zeroed, and this initial value of bar->guest_addr will fail the permissions check in map_range() and crash the domain. When the guest writes a new valid BAR, the old invalid address remains in the rangeset to be mapped. If we simply remove the old invalid BAR from the rangeset, that seems to fix the issue. So something like this:

It does seem to me we are missing a proper cleanup of the rangeset
contents in some paths then.  In the above paragraph you mention "the
old invalid address remains in the rangeset to be mapped", how does it
get in there in the first place, and why is the rangeset not emptied
if the mapping failed?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 15/16] xen/arm: vpci: check guest range
  2023-09-26  8:07       ` Roger Pau Monné
@ 2023-09-26 15:27         ` Stewart Hildebrand
  2023-09-26 15:48           ` Roger Pau Monné
  0 siblings, 1 reply; 60+ messages in thread
From: Stewart Hildebrand @ 2023-09-26 15:27 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: Volodymyr Babchuk, xen-devel

On 9/26/23 04:07, Roger Pau Monné wrote:
> On Mon, Sep 25, 2023 at 05:49:00PM -0400, Stewart Hildebrand wrote:
>> On 9/22/23 04:44, Roger Pau Monné wrote:
>>> On Tue, Aug 29, 2023 at 11:19:47PM +0000, Volodymyr Babchuk wrote:
>>>> From: Stewart Hildebrand <stewart.hildebrand@amd.com>
>>>>
>>>> Skip mapping the BAR if it is not in a valid range.
>>>>
>>>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
>>>> ---
>>>>  xen/drivers/vpci/header.c | 9 +++++++++
>>>>  1 file changed, 9 insertions(+)
>>>>
>>>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
>>>> index 1d243eeaf9..dbabdcbed2 100644
>>>> --- a/xen/drivers/vpci/header.c
>>>> +++ b/xen/drivers/vpci/header.c
>>>> @@ -345,6 +345,15 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>               bar->enabled == !!(cmd & PCI_COMMAND_MEMORY) )
>>>>              continue;
>>>>
>>>> +#ifdef CONFIG_ARM
>>>> +        if ( !is_hardware_domain(pdev->domain) )
>>>> +        {
>>>> +            if ( (start_guest < PFN_DOWN(GUEST_VPCI_MEM_ADDR)) ||
>>>> +                 (end_guest >= PFN_DOWN(GUEST_VPCI_MEM_ADDR + GUEST_VPCI_MEM_SIZE)) )
>>>> +                continue;
>>>> +        }
>>>> +#endif
>>>
>>> Hm, I think this should be in a hook similar to pci_check_bar() that
>>> can be implemented per-arch.
>>>
>>> IIRC at least on x86 we allow the guest to place the BARs whenever it
>>> wants, would such placement cause issues to the hypervisor on Arm?
>>
>> Hm. I wrote this patch in a hurry to make v9 of this series work on ARM. In my haste I also forgot about the prefetchable range starting at GUEST_VPCI_PREFETCH_MEM_ADDR, but that won't matter as we can probably throw this patch out.
>>
>> Now that I've had some more time to investigate, I believe the check in this patch is more or less redundant to the existing check in map_range() added in baa6ea700386 ("vpci: add permission checks to map_range()").
>>
>> The issue is that during initialization bar->guest_addr is zeroed, and this initial value of bar->guest_addr will fail the permissions check in map_range() and crash the domain. When the guest writes a new valid BAR, the old invalid address remains in the rangeset to be mapped. If we simply remove the old invalid BAR from the rangeset, that seems to fix the issue. So something like this:
> 
> It does seem to me we are missing a proper cleanup of the rangeset
> contents in some paths then.  In the above paragraph you mention "the
> old invalid address remains in the rangeset to be mapped", how does it
> get in there in the first place, and why is the rangeset not emptied
> if the mapping failed?

Back in ("vpci/header: handle p2m range sets per BAR") I added a v->domain == pdev->domain check near the top of vpci_process_pending() as you appropriately suggested.

+    if ( v->domain != pdev->domain )
+    {
+        read_unlock(&v->domain->pci_lock);
+        return false;
+    }

I have also reverted this patch ("xen/arm: vpci: check guest range").

The sequence of events leading to the old value remaining in the rangeset are:

# xl pci-assignable-add 01:00.0
drivers/vpci/vpci.c:vpci_deassign_device()
    deassign 0000:01:00.0 from d0
# grep pci domu.cfg
pci = [ "01:00.0" ]
# xl create domu.cfg
drivers/vpci/vpci.c:vpci_deassign_device()
    deassign 0000:01:00.0 from d[IO]
drivers/vpci/vpci.c:vpci_assign_device()
    assign 0000:01:00.0 to d1
    bar->guest_addr is initialized to zero because of the line: pdev->vpci = xzalloc(struct vpci);
drivers/vpci/header.c:init_bars()
drivers/vpci/header.c:modify_bars()
    BAR0: start 0xe0000, end 0xe000f, start_guest 0x0, end_guest 0xf
    The range { 0-f } is added to the BAR0 rangeset for d1
drivers/vpci/header.c:defer_map()
    raise_softirq(SCHEDULE_SOFTIRQ);
drivers/vpci/header.c:vpci_process_pending()
    vpci_process_pending() returns because v->domain != pdev->domain (i.e. d0 != d1)
    BAR0 rangeset still contains { 0-f }
xl create finishes

Then during domU boot, guest initializes BAR0:

drivers/vpci/header.c:guest_bar_write()
    bar->guest_addr = 0x23000000
drivers/vpci/header.c:modify_bars()
    BAR0: start 0xe0000, end 0xe000f, start_guest 0x23000, end_guest 0x2300f
    The d1 BAR0 rangeset now contains both { 0-f } and { 23000-2300f }
drivers/vpci/header.c:defer_map()
    raise_softirq(SCHEDULE_SOFTIRQ);
drivers/vpci/header.c:vpci_process_pending()
    rangeset_consume_ranges(bar->mem, map_range, &data);
drivers/vpci/header.c:map_range()
    The range { 0-f } fails the permissions check and we crash the domU (back in vpci_process_pending)


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 15/16] xen/arm: vpci: check guest range
  2023-09-26 15:27         ` Stewart Hildebrand
@ 2023-09-26 15:48           ` Roger Pau Monné
  2023-09-27 18:03             ` Stewart Hildebrand
  0 siblings, 1 reply; 60+ messages in thread
From: Roger Pau Monné @ 2023-09-26 15:48 UTC (permalink / raw)
  To: Stewart Hildebrand; +Cc: Volodymyr Babchuk, xen-devel

On Tue, Sep 26, 2023 at 11:27:48AM -0400, Stewart Hildebrand wrote:
> On 9/26/23 04:07, Roger Pau Monné wrote:
> > On Mon, Sep 25, 2023 at 05:49:00PM -0400, Stewart Hildebrand wrote:
> >> On 9/22/23 04:44, Roger Pau Monné wrote:
> >>> On Tue, Aug 29, 2023 at 11:19:47PM +0000, Volodymyr Babchuk wrote:
> >>>> From: Stewart Hildebrand <stewart.hildebrand@amd.com>
> >>>>
> >>>> Skip mapping the BAR if it is not in a valid range.
> >>>>
> >>>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
> >>>> ---
> >>>>  xen/drivers/vpci/header.c | 9 +++++++++
> >>>>  1 file changed, 9 insertions(+)
> >>>>
> >>>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> >>>> index 1d243eeaf9..dbabdcbed2 100644
> >>>> --- a/xen/drivers/vpci/header.c
> >>>> +++ b/xen/drivers/vpci/header.c
> >>>> @@ -345,6 +345,15 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> >>>>               bar->enabled == !!(cmd & PCI_COMMAND_MEMORY) )
> >>>>              continue;
> >>>>
> >>>> +#ifdef CONFIG_ARM
> >>>> +        if ( !is_hardware_domain(pdev->domain) )
> >>>> +        {
> >>>> +            if ( (start_guest < PFN_DOWN(GUEST_VPCI_MEM_ADDR)) ||
> >>>> +                 (end_guest >= PFN_DOWN(GUEST_VPCI_MEM_ADDR + GUEST_VPCI_MEM_SIZE)) )
> >>>> +                continue;
> >>>> +        }
> >>>> +#endif
> >>>
> >>> Hm, I think this should be in a hook similar to pci_check_bar() that
> >>> can be implemented per-arch.
> >>>
> >>> IIRC at least on x86 we allow the guest to place the BARs whenever it
> >>> wants, would such placement cause issues to the hypervisor on Arm?
> >>
> >> Hm. I wrote this patch in a hurry to make v9 of this series work on ARM. In my haste I also forgot about the prefetchable range starting at GUEST_VPCI_PREFETCH_MEM_ADDR, but that won't matter as we can probably throw this patch out.
> >>
> >> Now that I've had some more time to investigate, I believe the check in this patch is more or less redundant to the existing check in map_range() added in baa6ea700386 ("vpci: add permission checks to map_range()").
> >>
> >> The issue is that during initialization bar->guest_addr is zeroed, and this initial value of bar->guest_addr will fail the permissions check in map_range() and crash the domain. When the guest writes a new valid BAR, the old invalid address remains in the rangeset to be mapped. If we simply remove the old invalid BAR from the rangeset, that seems to fix the issue. So something like this:
> > 
> > It does seem to me we are missing a proper cleanup of the rangeset
> > contents in some paths then.  In the above paragraph you mention "the
> > old invalid address remains in the rangeset to be mapped", how does it
> > get in there in the first place, and why is the rangeset not emptied
> > if the mapping failed?
> 
> Back in ("vpci/header: handle p2m range sets per BAR") I added a v->domain == pdev->domain check near the top of vpci_process_pending() as you appropriately suggested.
> 
> +    if ( v->domain != pdev->domain )
> +    {
> +        read_unlock(&v->domain->pci_lock);
> +        return false;
> +    }
> 
> I have also reverted this patch ("xen/arm: vpci: check guest range").
> 
> The sequence of events leading to the old value remaining in the rangeset are:
> 
> # xl pci-assignable-add 01:00.0
> drivers/vpci/vpci.c:vpci_deassign_device()
>     deassign 0000:01:00.0 from d0
> # grep pci domu.cfg
> pci = [ "01:00.0" ]
> # xl create domu.cfg
> drivers/vpci/vpci.c:vpci_deassign_device()
>     deassign 0000:01:00.0 from d[IO]
> drivers/vpci/vpci.c:vpci_assign_device()
>     assign 0000:01:00.0 to d1
>     bar->guest_addr is initialized to zero because of the line: pdev->vpci = xzalloc(struct vpci);
> drivers/vpci/header.c:init_bars()
> drivers/vpci/header.c:modify_bars()

I think I've commented this on another patch, but why is the device
added with memory decoding enabled?  I would expect the FLR performed
before assigning would leave the device with memory decoding disabled?

Otherwise we might have to force init_bars() to assume memory decoding
to be disabled, IOW: memory decoding would be set as disabled in the
guest cmd view, and leave the physical device cmd as-is.  We might
also consider switching memory decoding off unconditionally for domUs
on the physical device.

>     BAR0: start 0xe0000, end 0xe000f, start_guest 0x0, end_guest 0xf
>     The range { 0-f } is added to the BAR0 rangeset for d1
> drivers/vpci/header.c:defer_map()
>     raise_softirq(SCHEDULE_SOFTIRQ);
> drivers/vpci/header.c:vpci_process_pending()
>     vpci_process_pending() returns because v->domain != pdev->domain (i.e. d0 != d1)

I don't think we can easily handle BAR mappings during device
assignment with vPCI, because that would require adding some kind of
continuation support which we don't have ATM.  Might be better to just
switch memory decoding unconditionally off at init_bars() for domUs as
that's the easier solution right now that would allow us to move
forward.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 15/16] xen/arm: vpci: check guest range
  2023-09-26 15:48           ` Roger Pau Monné
@ 2023-09-27 18:03             ` Stewart Hildebrand
  2023-09-28  8:28               ` Roger Pau Monné
  0 siblings, 1 reply; 60+ messages in thread
From: Stewart Hildebrand @ 2023-09-27 18:03 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: Volodymyr Babchuk, xen-devel

On 9/26/23 11:48, Roger Pau Monné wrote:
> On Tue, Sep 26, 2023 at 11:27:48AM -0400, Stewart Hildebrand wrote:
>> On 9/26/23 04:07, Roger Pau Monné wrote:
>>> On Mon, Sep 25, 2023 at 05:49:00PM -0400, Stewart Hildebrand wrote:
>>>> On 9/22/23 04:44, Roger Pau Monné wrote:
>>>>> On Tue, Aug 29, 2023 at 11:19:47PM +0000, Volodymyr Babchuk wrote:
>>>>>> From: Stewart Hildebrand <stewart.hildebrand@amd.com>
>>>>>>
>>>>>> Skip mapping the BAR if it is not in a valid range.
>>>>>>
>>>>>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
>>>>>> ---
>>>>>>  xen/drivers/vpci/header.c | 9 +++++++++
>>>>>>  1 file changed, 9 insertions(+)
>>>>>>
>>>>>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
>>>>>> index 1d243eeaf9..dbabdcbed2 100644
>>>>>> --- a/xen/drivers/vpci/header.c
>>>>>> +++ b/xen/drivers/vpci/header.c
>>>>>> @@ -345,6 +345,15 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>>>               bar->enabled == !!(cmd & PCI_COMMAND_MEMORY) )
>>>>>>              continue;
>>>>>>
>>>>>> +#ifdef CONFIG_ARM
>>>>>> +        if ( !is_hardware_domain(pdev->domain) )
>>>>>> +        {
>>>>>> +            if ( (start_guest < PFN_DOWN(GUEST_VPCI_MEM_ADDR)) ||
>>>>>> +                 (end_guest >= PFN_DOWN(GUEST_VPCI_MEM_ADDR + GUEST_VPCI_MEM_SIZE)) )
>>>>>> +                continue;
>>>>>> +        }
>>>>>> +#endif
>>>>>
>>>>> Hm, I think this should be in a hook similar to pci_check_bar() that
>>>>> can be implemented per-arch.
>>>>>
>>>>> IIRC at least on x86 we allow the guest to place the BARs whenever it
>>>>> wants, would such placement cause issues to the hypervisor on Arm?
>>>>
>>>> Hm. I wrote this patch in a hurry to make v9 of this series work on ARM. In my haste I also forgot about the prefetchable range starting at GUEST_VPCI_PREFETCH_MEM_ADDR, but that won't matter as we can probably throw this patch out.
>>>>
>>>> Now that I've had some more time to investigate, I believe the check in this patch is more or less redundant to the existing check in map_range() added in baa6ea700386 ("vpci: add permission checks to map_range()").
>>>>
>>>> The issue is that during initialization bar->guest_addr is zeroed, and this initial value of bar->guest_addr will fail the permissions check in map_range() and crash the domain. When the guest writes a new valid BAR, the old invalid address remains in the rangeset to be mapped. If we simply remove the old invalid BAR from the rangeset, that seems to fix the issue. So something like this:
>>>
>>> It does seem to me we are missing a proper cleanup of the rangeset
>>> contents in some paths then.  In the above paragraph you mention "the
>>> old invalid address remains in the rangeset to be mapped", how does it
>>> get in there in the first place, and why is the rangeset not emptied
>>> if the mapping failed?
>>
>> Back in ("vpci/header: handle p2m range sets per BAR") I added a v->domain == pdev->domain check near the top of vpci_process_pending() as you appropriately suggested.
>>
>> +    if ( v->domain != pdev->domain )
>> +    {
>> +        read_unlock(&v->domain->pci_lock);
>> +        return false;
>> +    }
>>
>> I have also reverted this patch ("xen/arm: vpci: check guest range").
>>
>> The sequence of events leading to the old value remaining in the rangeset are:
>>
>> # xl pci-assignable-add 01:00.0
>> drivers/vpci/vpci.c:vpci_deassign_device()
>>     deassign 0000:01:00.0 from d0
>> # grep pci domu.cfg
>> pci = [ "01:00.0" ]
>> # xl create domu.cfg
>> drivers/vpci/vpci.c:vpci_deassign_device()
>>     deassign 0000:01:00.0 from d[IO]
>> drivers/vpci/vpci.c:vpci_assign_device()
>>     assign 0000:01:00.0 to d1
>>     bar->guest_addr is initialized to zero because of the line: pdev->vpci = xzalloc(struct vpci);
>> drivers/vpci/header.c:init_bars()
>> drivers/vpci/header.c:modify_bars()
> 
> I think I've commented this on another patch, but why is the device
> added with memory decoding enabled?  I would expect the FLR performed
> before assigning would leave the device with memory decoding disabled?

It seems the device is indeed being assigned to the domU with memory decoding enabled, but I'm not entirely sure why. The device I'm testing with doesn't support FLR, but it does support pm bus reset:
# cat /sys/bus/pci/devices/0000\:01\:00.0/reset_method
pm bus

As I understand it, libxl__device_pci_reset() should still be able to issue a reset in this case.

> Otherwise we might have to force init_bars() to assume memory decoding
> to be disabled, IOW: memory decoding would be set as disabled in the
> guest cmd view, and leave the physical device cmd as-is.  We might
> also consider switching memory decoding off unconditionally for domUs
> on the physical device.

I did a quick test and it works as expected with my apparently quirky test case:

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index de29e5322d34..0ad0ad947759 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -790,7 +790,12 @@ static int cf_check init_bars(struct pci_dev *pdev)

     /* Disable memory decoding before sizing. */
     if ( cmd & PCI_COMMAND_MEMORY )
+    {
+        if ( !is_hwdom )
+            cmd &= ~PCI_COMMAND_MEMORY;
+
         pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd & ~PCI_COMMAND_MEMORY);
+    }

     header->guest_cmd = cmd & ~PCI_COMMAND_MEMORY;

> 
>>     BAR0: start 0xe0000, end 0xe000f, start_guest 0x0, end_guest 0xf
>>     The range { 0-f } is added to the BAR0 rangeset for d1
>> drivers/vpci/header.c:defer_map()
>>     raise_softirq(SCHEDULE_SOFTIRQ);
>> drivers/vpci/header.c:vpci_process_pending()
>>     vpci_process_pending() returns because v->domain != pdev->domain (i.e. d0 != d1)
> 
> I don't think we can easily handle BAR mappings during device
> assignment with vPCI, because that would require adding some kind of
> continuation support which we don't have ATM.  Might be better to just
> switch memory decoding unconditionally off at init_bars() for domUs as
> that's the easier solution right now that would allow us to move
> forward.

+1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 08/16] vpci/header: handle p2m range sets per BAR
  2023-08-29 23:19 ` [PATCH v9 08/16] vpci/header: handle p2m range sets per BAR Volodymyr Babchuk
  2023-09-20 11:35   ` Roger Pau Monné
@ 2023-09-27 18:18   ` Stewart Hildebrand
  1 sibling, 0 replies; 60+ messages in thread
From: Stewart Hildebrand @ 2023-09-27 18:18 UTC (permalink / raw)
  To: Volodymyr Babchuk, xen-devel
  Cc: Oleksandr Andrushchenko, Roger Pau Monné

On 8/29/23 19:19, Volodymyr Babchuk wrote:
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index e96d7b2b37..3cc6a96849 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -161,63 +161,101 @@ static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
> 
>  bool vpci_process_pending(struct vcpu *v)
>  {
> -    if ( v->vpci.mem )
> +    struct pci_dev *pdev = v->vpci.pdev;
> +    struct map_data data = {
> +        .d = v->domain,
> +        .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
> +    };
> +    struct vpci_header *header = NULL;
> +    unsigned int i;
> +
> +    if ( !pdev )
> +        return false;
> +
> +    read_lock(&v->domain->pci_lock);
> +    header = &pdev->vpci->header;
> +    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>      {
> -        struct map_data data = {
> -            .d = v->domain,
> -            .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
> -        };
> -        int rc = rangeset_consume_ranges(v->vpci.mem, map_range, &data);
> +        struct vpci_bar *bar = &header->bars[i];
> +        int rc;
> +
> +        if ( rangeset_is_empty(bar->mem) )
> +            continue;
> +
> +        rc = rangeset_consume_ranges(bar->mem, map_range, &data);
> 
>          if ( rc == -ERESTART )
> +        {
> +            read_unlock(&v->domain->pci_lock);
>              return true;
> +        }
> 
> -        write_lock(&v->domain->pci_lock);
> -        spin_lock(&v->vpci.pdev->vpci->lock);
> -        /* Disable memory decoding unconditionally on failure. */
> -        modify_decoding(v->vpci.pdev,
> -                        rc ? v->vpci.cmd & ~PCI_COMMAND_MEMORY : v->vpci.cmd,
> -                        !rc && v->vpci.rom_only);
> -        spin_unlock(&v->vpci.pdev->vpci->lock);
> -
> -        rangeset_destroy(v->vpci.mem);
> -        v->vpci.mem = NULL;
>          if ( rc )
> -            /*
> -             * FIXME: in case of failure remove the device from the domain.
> -             * Note that there might still be leftover mappings. While this is
> -             * safe for Dom0, for DomUs the domain will likely need to be
> -             * killed in order to avoid leaking stale p2m mappings on
> -             * failure.
> -             */
> -            vpci_deassign_device(v->vpci.pdev);
> -        write_unlock(&v->domain->pci_lock);
> +        {
> +            spin_lock(&pdev->vpci->lock);
> +            /* Disable memory decoding unconditionally on failure. */
> +            modify_decoding(pdev, v->vpci.cmd & ~PCI_COMMAND_MEMORY,
> +                            false);
> +            spin_unlock(&pdev->vpci->lock);
> +
> +            v->vpci.pdev = NULL;
> +
> +            read_unlock(&v->domain->pci_lock);
> +
> +            if ( is_hardware_domain(v->domain) )
> +            {
> +                write_lock(&v->domain->pci_lock);
> +                vpci_deassign_device(v->vpci.pdev);

s/v->vpci.pdev/pdev/ since v->vpci.pdev was assigned NULL a few lines earlier.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 15/16] xen/arm: vpci: check guest range
  2023-09-27 18:03             ` Stewart Hildebrand
@ 2023-09-28  8:28               ` Roger Pau Monné
  2023-09-28 18:28                 ` Stewart Hildebrand
  0 siblings, 1 reply; 60+ messages in thread
From: Roger Pau Monné @ 2023-09-28  8:28 UTC (permalink / raw)
  To: Stewart Hildebrand; +Cc: Volodymyr Babchuk, xen-devel

On Wed, Sep 27, 2023 at 02:03:30PM -0400, Stewart Hildebrand wrote:
> On 9/26/23 11:48, Roger Pau Monné wrote:
> > On Tue, Sep 26, 2023 at 11:27:48AM -0400, Stewart Hildebrand wrote:
> >> On 9/26/23 04:07, Roger Pau Monné wrote:
> >>> On Mon, Sep 25, 2023 at 05:49:00PM -0400, Stewart Hildebrand wrote:
> >>>> On 9/22/23 04:44, Roger Pau Monné wrote:
> >>>>> On Tue, Aug 29, 2023 at 11:19:47PM +0000, Volodymyr Babchuk wrote:
> >>>>>> From: Stewart Hildebrand <stewart.hildebrand@amd.com>
> >>>>>>
> >>>>>> Skip mapping the BAR if it is not in a valid range.
> >>>>>>
> >>>>>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
> >>>>>> ---
> >>>>>>  xen/drivers/vpci/header.c | 9 +++++++++
> >>>>>>  1 file changed, 9 insertions(+)
> >>>>>>
> >>>>>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> >>>>>> index 1d243eeaf9..dbabdcbed2 100644
> >>>>>> --- a/xen/drivers/vpci/header.c
> >>>>>> +++ b/xen/drivers/vpci/header.c
> >>>>>> @@ -345,6 +345,15 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> >>>>>>               bar->enabled == !!(cmd & PCI_COMMAND_MEMORY) )
> >>>>>>              continue;
> >>>>>>
> >>>>>> +#ifdef CONFIG_ARM
> >>>>>> +        if ( !is_hardware_domain(pdev->domain) )
> >>>>>> +        {
> >>>>>> +            if ( (start_guest < PFN_DOWN(GUEST_VPCI_MEM_ADDR)) ||
> >>>>>> +                 (end_guest >= PFN_DOWN(GUEST_VPCI_MEM_ADDR + GUEST_VPCI_MEM_SIZE)) )
> >>>>>> +                continue;
> >>>>>> +        }
> >>>>>> +#endif
> >>>>>
> >>>>> Hm, I think this should be in a hook similar to pci_check_bar() that
> >>>>> can be implemented per-arch.
> >>>>>
> >>>>> IIRC at least on x86 we allow the guest to place the BARs whenever it
> >>>>> wants, would such placement cause issues to the hypervisor on Arm?
> >>>>
> >>>> Hm. I wrote this patch in a hurry to make v9 of this series work on ARM. In my haste I also forgot about the prefetchable range starting at GUEST_VPCI_PREFETCH_MEM_ADDR, but that won't matter as we can probably throw this patch out.
> >>>>
> >>>> Now that I've had some more time to investigate, I believe the check in this patch is more or less redundant to the existing check in map_range() added in baa6ea700386 ("vpci: add permission checks to map_range()").
> >>>>
> >>>> The issue is that during initialization bar->guest_addr is zeroed, and this initial value of bar->guest_addr will fail the permissions check in map_range() and crash the domain. When the guest writes a new valid BAR, the old invalid address remains in the rangeset to be mapped. If we simply remove the old invalid BAR from the rangeset, that seems to fix the issue. So something like this:
> >>>
> >>> It does seem to me we are missing a proper cleanup of the rangeset
> >>> contents in some paths then.  In the above paragraph you mention "the
> >>> old invalid address remains in the rangeset to be mapped", how does it
> >>> get in there in the first place, and why is the rangeset not emptied
> >>> if the mapping failed?
> >>
> >> Back in ("vpci/header: handle p2m range sets per BAR") I added a v->domain == pdev->domain check near the top of vpci_process_pending() as you appropriately suggested.
> >>
> >> +    if ( v->domain != pdev->domain )
> >> +    {
> >> +        read_unlock(&v->domain->pci_lock);
> >> +        return false;
> >> +    }
> >>
> >> I have also reverted this patch ("xen/arm: vpci: check guest range").
> >>
> >> The sequence of events leading to the old value remaining in the rangeset are:
> >>
> >> # xl pci-assignable-add 01:00.0
> >> drivers/vpci/vpci.c:vpci_deassign_device()
> >>     deassign 0000:01:00.0 from d0
> >> # grep pci domu.cfg
> >> pci = [ "01:00.0" ]
> >> # xl create domu.cfg
> >> drivers/vpci/vpci.c:vpci_deassign_device()
> >>     deassign 0000:01:00.0 from d[IO]
> >> drivers/vpci/vpci.c:vpci_assign_device()
> >>     assign 0000:01:00.0 to d1
> >>     bar->guest_addr is initialized to zero because of the line: pdev->vpci = xzalloc(struct vpci);
> >> drivers/vpci/header.c:init_bars()
> >> drivers/vpci/header.c:modify_bars()
> > 
> > I think I've commented this on another patch, but why is the device
> > added with memory decoding enabled?  I would expect the FLR performed
> > before assigning would leave the device with memory decoding disabled?
> 
> It seems the device is indeed being assigned to the domU with memory decoding enabled, but I'm not entirely sure why. The device I'm testing with doesn't support FLR, but it does support pm bus reset:
> # cat /sys/bus/pci/devices/0000\:01\:00.0/reset_method
> pm bus
> 
> As I understand it, libxl__device_pci_reset() should still be able to issue a reset in this case.

Maybe pciback is somehow restoring part of the previous state?  I
have no insight in what state we expect the device to be handled by
pciback, but this needs investigation in order to know what to expect.

Can you paste the full contents of the command register for this
device?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 15/16] xen/arm: vpci: check guest range
  2023-09-28  8:28               ` Roger Pau Monné
@ 2023-09-28 18:28                 ` Stewart Hildebrand
  2023-10-02 11:49                   ` Roger Pau Monné
  0 siblings, 1 reply; 60+ messages in thread
From: Stewart Hildebrand @ 2023-09-28 18:28 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: Volodymyr Babchuk, xen-devel



On 9/28/23 04:28, Roger Pau Monné wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
> 
> 
> On Wed, Sep 27, 2023 at 02:03:30PM -0400, Stewart Hildebrand wrote:
>> On 9/26/23 11:48, Roger Pau Monné wrote:
>>> On Tue, Sep 26, 2023 at 11:27:48AM -0400, Stewart Hildebrand wrote:
>>>> On 9/26/23 04:07, Roger Pau Monné wrote:
>>>>> On Mon, Sep 25, 2023 at 05:49:00PM -0400, Stewart Hildebrand wrote:
>>>>>> On 9/22/23 04:44, Roger Pau Monné wrote:
>>>>>>> On Tue, Aug 29, 2023 at 11:19:47PM +0000, Volodymyr Babchuk wrote:
>>>>>>>> From: Stewart Hildebrand <stewart.hildebrand@amd.com>
>>>>>>>>
>>>>>>>> Skip mapping the BAR if it is not in a valid range.
>>>>>>>>
>>>>>>>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
>>>>>>>> ---
>>>>>>>>  xen/drivers/vpci/header.c | 9 +++++++++
>>>>>>>>  1 file changed, 9 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
>>>>>>>> index 1d243eeaf9..dbabdcbed2 100644
>>>>>>>> --- a/xen/drivers/vpci/header.c
>>>>>>>> +++ b/xen/drivers/vpci/header.c
>>>>>>>> @@ -345,6 +345,15 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>>>>>               bar->enabled == !!(cmd & PCI_COMMAND_MEMORY) )
>>>>>>>>              continue;
>>>>>>>>
>>>>>>>> +#ifdef CONFIG_ARM
>>>>>>>> +        if ( !is_hardware_domain(pdev->domain) )
>>>>>>>> +        {
>>>>>>>> +            if ( (start_guest < PFN_DOWN(GUEST_VPCI_MEM_ADDR)) ||
>>>>>>>> +                 (end_guest >= PFN_DOWN(GUEST_VPCI_MEM_ADDR + GUEST_VPCI_MEM_SIZE)) )
>>>>>>>> +                continue;
>>>>>>>> +        }
>>>>>>>> +#endif
>>>>>>>
>>>>>>> Hm, I think this should be in a hook similar to pci_check_bar() that
>>>>>>> can be implemented per-arch.
>>>>>>>
>>>>>>> IIRC at least on x86 we allow the guest to place the BARs whenever it
>>>>>>> wants, would such placement cause issues to the hypervisor on Arm?
>>>>>>
>>>>>> Hm. I wrote this patch in a hurry to make v9 of this series work on ARM. In my haste I also forgot about the prefetchable range starting at GUEST_VPCI_PREFETCH_MEM_ADDR, but that won't matter as we can probably throw this patch out.
>>>>>>
>>>>>> Now that I've had some more time to investigate, I believe the check in this patch is more or less redundant to the existing check in map_range() added in baa6ea700386 ("vpci: add permission checks to map_range()").
>>>>>>
>>>>>> The issue is that during initialization bar->guest_addr is zeroed, and this initial value of bar->guest_addr will fail the permissions check in map_range() and crash the domain. When the guest writes a new valid BAR, the old invalid address remains in the rangeset to be mapped. If we simply remove the old invalid BAR from the rangeset, that seems to fix the issue. So something like this:
>>>>>
>>>>> It does seem to me we are missing a proper cleanup of the rangeset
>>>>> contents in some paths then.  In the above paragraph you mention "the
>>>>> old invalid address remains in the rangeset to be mapped", how does it
>>>>> get in there in the first place, and why is the rangeset not emptied
>>>>> if the mapping failed?
>>>>
>>>> Back in ("vpci/header: handle p2m range sets per BAR") I added a v->domain == pdev->domain check near the top of vpci_process_pending() as you appropriately suggested.
>>>>
>>>> +    if ( v->domain != pdev->domain )
>>>> +    {
>>>> +        read_unlock(&v->domain->pci_lock);
>>>> +        return false;
>>>> +    }
>>>>
>>>> I have also reverted this patch ("xen/arm: vpci: check guest range").
>>>>
>>>> The sequence of events leading to the old value remaining in the rangeset are:
>>>>
>>>> # xl pci-assignable-add 01:00.0
>>>> drivers/vpci/vpci.c:vpci_deassign_device()
>>>>     deassign 0000:01:00.0 from d0
>>>> # grep pci domu.cfg
>>>> pci = [ "01:00.0" ]
>>>> # xl create domu.cfg
>>>> drivers/vpci/vpci.c:vpci_deassign_device()
>>>>     deassign 0000:01:00.0 from d[IO]
>>>> drivers/vpci/vpci.c:vpci_assign_device()
>>>>     assign 0000:01:00.0 to d1
>>>>     bar->guest_addr is initialized to zero because of the line: pdev->vpci = xzalloc(struct vpci);
>>>> drivers/vpci/header.c:init_bars()
>>>> drivers/vpci/header.c:modify_bars()
>>>
>>> I think I've commented this on another patch, but why is the device
>>> added with memory decoding enabled?  I would expect the FLR performed
>>> before assigning would leave the device with memory decoding disabled?
>>
>> It seems the device is indeed being assigned to the domU with memory decoding enabled, but I'm not entirely sure why. The device I'm testing with doesn't support FLR, but it does support pm bus reset:
>> # cat /sys/bus/pci/devices/0000\:01\:00.0/reset_method
>> pm bus
>>
>> As I understand it, libxl__device_pci_reset() should still be able to issue a reset in this case.
> 
> Maybe pciback is somehow restoring part of the previous state?  I
> have no insight in what state we expect the device to be handled by
> pciback, but this needs investigation in order to know what to expect.

Yep, during "xl pci-assignable-add ..." pciback resets the device and restores the state, including whether memory decoding is enabled.

drivers/xen/xen-pciback/pci_stub.c:pcistub_init_device():

	/* We need the device active to save the state. */
	dev_dbg(&dev->dev, "save state of device\n");
	pci_save_state(dev);
	dev_data->pci_saved_state = pci_store_saved_state(dev);
	if (!dev_data->pci_saved_state)
		dev_err(&dev->dev, "Could not store PCI conf saved state!\n");
	else {
		dev_dbg(&dev->dev, "resetting (FLR, D3, etc) the device\n");
		__pci_reset_function_locked(dev);
		pci_restore_state(dev);
	}
	/* Now disable the device (this also ensures some private device
	 * data is setup before we export)
	 */
	dev_dbg(&dev->dev, "reset device\n");
	xen_pcibk_reset_device(dev);

That last function, xen_pcibk_reset_device(), clears the bus master enable bit in the command register for devices with PCI_HEADER_TYPE_NORMAL (not a reset contrary to the function name).

xl create should reset the device again, but, similarly, this also seems to restore the state.

> Can you paste the full contents of the command register for this
> device?
Start of day (PCIe controller and bridge initialized, no device BARs or anything have been programmed yet): 0x0000
After dom0 boot, device is in use: 0x0006
After pci-assignable-add: 0x0002
After echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/reset: 0x0002
After xl create, domU booted: 0x0006

Should mapping bars should be conditional on PCI_COMMAND_MASTER, not PCI_COMMAND_MEMORY? E.g.:

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 9cf701b3c464..9ce1793d64b8 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -1162,7 +1162,7 @@ static int cf_check init_bars(struct pci_dev *pdev)
             goto fail;
     }

-    return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, cmd, false) : 0;
+    return (cmd & PCI_COMMAND_MASTER) ? modify_bars(pdev, cmd, false) : 0;

  fail:
     pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH v9 15/16] xen/arm: vpci: check guest range
  2023-09-28 18:28                 ` Stewart Hildebrand
@ 2023-10-02 11:49                   ` Roger Pau Monné
  0 siblings, 0 replies; 60+ messages in thread
From: Roger Pau Monné @ 2023-10-02 11:49 UTC (permalink / raw)
  To: Stewart Hildebrand; +Cc: Volodymyr Babchuk, xen-devel

On Thu, Sep 28, 2023 at 02:28:11PM -0400, Stewart Hildebrand wrote:
> 
> 
> On 9/28/23 04:28, Roger Pau Monné wrote:
> > Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
> > 
> > 
> > On Wed, Sep 27, 2023 at 02:03:30PM -0400, Stewart Hildebrand wrote:
> >> On 9/26/23 11:48, Roger Pau Monné wrote:
> >>> On Tue, Sep 26, 2023 at 11:27:48AM -0400, Stewart Hildebrand wrote:
> >>>> On 9/26/23 04:07, Roger Pau Monné wrote:
> >>>>> On Mon, Sep 25, 2023 at 05:49:00PM -0400, Stewart Hildebrand wrote:
> >>>>>> On 9/22/23 04:44, Roger Pau Monné wrote:
> >>>>>>> On Tue, Aug 29, 2023 at 11:19:47PM +0000, Volodymyr Babchuk wrote:
> >>>>>>>> From: Stewart Hildebrand <stewart.hildebrand@amd.com>
> >>>>>>>>
> >>>>>>>> Skip mapping the BAR if it is not in a valid range.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
> >>>>>>>> ---
> >>>>>>>>  xen/drivers/vpci/header.c | 9 +++++++++
> >>>>>>>>  1 file changed, 9 insertions(+)
> >>>>>>>>
> >>>>>>>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> >>>>>>>> index 1d243eeaf9..dbabdcbed2 100644
> >>>>>>>> --- a/xen/drivers/vpci/header.c
> >>>>>>>> +++ b/xen/drivers/vpci/header.c
> >>>>>>>> @@ -345,6 +345,15 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> >>>>>>>>               bar->enabled == !!(cmd & PCI_COMMAND_MEMORY) )
> >>>>>>>>              continue;
> >>>>>>>>
> >>>>>>>> +#ifdef CONFIG_ARM
> >>>>>>>> +        if ( !is_hardware_domain(pdev->domain) )
> >>>>>>>> +        {
> >>>>>>>> +            if ( (start_guest < PFN_DOWN(GUEST_VPCI_MEM_ADDR)) ||
> >>>>>>>> +                 (end_guest >= PFN_DOWN(GUEST_VPCI_MEM_ADDR + GUEST_VPCI_MEM_SIZE)) )
> >>>>>>>> +                continue;
> >>>>>>>> +        }
> >>>>>>>> +#endif
> >>>>>>>
> >>>>>>> Hm, I think this should be in a hook similar to pci_check_bar() that
> >>>>>>> can be implemented per-arch.
> >>>>>>>
> >>>>>>> IIRC at least on x86 we allow the guest to place the BARs whenever it
> >>>>>>> wants, would such placement cause issues to the hypervisor on Arm?
> >>>>>>
> >>>>>> Hm. I wrote this patch in a hurry to make v9 of this series work on ARM. In my haste I also forgot about the prefetchable range starting at GUEST_VPCI_PREFETCH_MEM_ADDR, but that won't matter as we can probably throw this patch out.
> >>>>>>
> >>>>>> Now that I've had some more time to investigate, I believe the check in this patch is more or less redundant to the existing check in map_range() added in baa6ea700386 ("vpci: add permission checks to map_range()").
> >>>>>>
> >>>>>> The issue is that during initialization bar->guest_addr is zeroed, and this initial value of bar->guest_addr will fail the permissions check in map_range() and crash the domain. When the guest writes a new valid BAR, the old invalid address remains in the rangeset to be mapped. If we simply remove the old invalid BAR from the rangeset, that seems to fix the issue. So something like this:
> >>>>>
> >>>>> It does seem to me we are missing a proper cleanup of the rangeset
> >>>>> contents in some paths then.  In the above paragraph you mention "the
> >>>>> old invalid address remains in the rangeset to be mapped", how does it
> >>>>> get in there in the first place, and why is the rangeset not emptied
> >>>>> if the mapping failed?
> >>>>
> >>>> Back in ("vpci/header: handle p2m range sets per BAR") I added a v->domain == pdev->domain check near the top of vpci_process_pending() as you appropriately suggested.
> >>>>
> >>>> +    if ( v->domain != pdev->domain )
> >>>> +    {
> >>>> +        read_unlock(&v->domain->pci_lock);
> >>>> +        return false;
> >>>> +    }
> >>>>
> >>>> I have also reverted this patch ("xen/arm: vpci: check guest range").
> >>>>
> >>>> The sequence of events leading to the old value remaining in the rangeset are:
> >>>>
> >>>> # xl pci-assignable-add 01:00.0
> >>>> drivers/vpci/vpci.c:vpci_deassign_device()
> >>>>     deassign 0000:01:00.0 from d0
> >>>> # grep pci domu.cfg
> >>>> pci = [ "01:00.0" ]
> >>>> # xl create domu.cfg
> >>>> drivers/vpci/vpci.c:vpci_deassign_device()
> >>>>     deassign 0000:01:00.0 from d[IO]
> >>>> drivers/vpci/vpci.c:vpci_assign_device()
> >>>>     assign 0000:01:00.0 to d1
> >>>>     bar->guest_addr is initialized to zero because of the line: pdev->vpci = xzalloc(struct vpci);
> >>>> drivers/vpci/header.c:init_bars()
> >>>> drivers/vpci/header.c:modify_bars()
> >>>
> >>> I think I've commented this on another patch, but why is the device
> >>> added with memory decoding enabled?  I would expect the FLR performed
> >>> before assigning would leave the device with memory decoding disabled?
> >>
> >> It seems the device is indeed being assigned to the domU with memory decoding enabled, but I'm not entirely sure why. The device I'm testing with doesn't support FLR, but it does support pm bus reset:
> >> # cat /sys/bus/pci/devices/0000\:01\:00.0/reset_method
> >> pm bus
> >>
> >> As I understand it, libxl__device_pci_reset() should still be able to issue a reset in this case.
> > 
> > Maybe pciback is somehow restoring part of the previous state?  I
> > have no insight in what state we expect the device to be handled by
> > pciback, but this needs investigation in order to know what to expect.
> 
> Yep, during "xl pci-assignable-add ..." pciback resets the device and restores the state, including whether memory decoding is enabled.
> 
> drivers/xen/xen-pciback/pci_stub.c:pcistub_init_device():
> 
> 	/* We need the device active to save the state. */
> 	dev_dbg(&dev->dev, "save state of device\n");
> 	pci_save_state(dev);
> 	dev_data->pci_saved_state = pci_store_saved_state(dev);
> 	if (!dev_data->pci_saved_state)
> 		dev_err(&dev->dev, "Could not store PCI conf saved state!\n");
> 	else {
> 		dev_dbg(&dev->dev, "resetting (FLR, D3, etc) the device\n");
> 		__pci_reset_function_locked(dev);
> 		pci_restore_state(dev);
> 	}
> 	/* Now disable the device (this also ensures some private device
> 	 * data is setup before we export)
> 	 */
> 	dev_dbg(&dev->dev, "reset device\n");
> 	xen_pcibk_reset_device(dev);
> 
> That last function, xen_pcibk_reset_device(), clears the bus master enable bit in the command register for devices with PCI_HEADER_TYPE_NORMAL (not a reset contrary to the function name).
> 
> xl create should reset the device again, but, similarly, this also seems to restore the state.
> 
> > Can you paste the full contents of the command register for this
> > device?
> Start of day (PCIe controller and bridge initialized, no device BARs or anything have been programmed yet): 0x0000
> After dom0 boot, device is in use: 0x0006
> After pci-assignable-add: 0x0002
> After echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/reset: 0x0002
> After xl create, domU booted: 0x0006
> 
> Should mapping bars should be conditional on PCI_COMMAND_MASTER, not PCI_COMMAND_MEMORY? E.g.:

NO, I don't think so, as then Xen state would get out of sync with the
hardware state.  I think just disabling memory and IO decoding at
init_bars() for devices assigned to domUs should be fine for the time
being.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2023-10-02 11:51 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-29 23:19 [PATCH v9 00/16] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
2023-08-29 23:19 ` [PATCH v9 03/16] vpci: restrict unhandled read/write operations for guests Volodymyr Babchuk
2023-08-29 23:19 ` [PATCH v9 02/16] vpci: use per-domain PCI lock to protect vpci structure Volodymyr Babchuk
2023-09-19 15:39   ` Roger Pau Monné
2023-09-19 15:55     ` Jan Beulich
2023-09-20  8:12       ` Roger Pau Monné
2023-09-19 16:20     ` Stewart Hildebrand
2023-09-20  8:09       ` Roger Pau Monné
2023-09-20 13:56         ` Stewart Hildebrand
2023-09-21  7:42           ` Jan Beulich
2023-09-21  9:00             ` Roger Pau Monné
2023-09-20 19:16     ` Stewart Hildebrand
2023-09-21  9:41       ` Roger Pau Monné
2023-09-25 23:03     ` Volodymyr Babchuk
2023-08-29 23:19 ` [PATCH v9 01/16] pci: introduce per-domain PCI rwlock Volodymyr Babchuk
2023-09-19 14:09   ` Roger Pau Monné
2023-09-25 22:44     ` Volodymyr Babchuk
2023-08-29 23:19 ` [PATCH v9 07/16] rangeset: add RANGESETF_no_print flag Volodymyr Babchuk
2023-08-29 23:19 ` [PATCH v9 06/16] vpci/header: implement guest BAR register handlers Volodymyr Babchuk
2023-09-01  5:25   ` Stewart Hildebrand
2023-09-20  9:49   ` Roger Pau Monné
2023-09-20 14:18     ` Stewart Hildebrand
2023-08-29 23:19 ` [PATCH v9 05/16] vpci/header: rework exit path in init_bars Volodymyr Babchuk
2023-09-20  8:49   ` Roger Pau Monné
2023-08-29 23:19 ` [PATCH v9 04/16] vpci: add hooks for PCI device assign/de-assign Volodymyr Babchuk
2023-09-12  9:37   ` Jan Beulich
2023-09-12 23:41     ` Volodymyr Babchuk
2023-09-13  5:58       ` Jan Beulich
2023-09-13 23:53         ` Volodymyr Babchuk
2023-09-20  8:41           ` Roger Pau Monné
2023-09-20  8:39   ` Roger Pau Monné
2023-08-29 23:19 ` [PATCH v9 08/16] vpci/header: handle p2m range sets per BAR Volodymyr Babchuk
2023-09-20 11:35   ` Roger Pau Monné
2023-09-27 18:18   ` Stewart Hildebrand
2023-08-29 23:19 ` [PATCH v9 09/16] vpci/header: program p2m with guest BAR view Volodymyr Babchuk
2023-09-21 10:34   ` Roger Pau Monné
2023-08-29 23:19 ` [PATCH v9 10/16] vpci/header: emulate PCI_COMMAND register for guests Volodymyr Babchuk
2023-09-01  5:23   ` Stewart Hildebrand
2023-09-21 13:18   ` Roger Pau Monné
2023-08-29 23:19 ` [PATCH v9 11/16] vpci/header: reset the command register when adding devices Volodymyr Babchuk
2023-09-21 13:30   ` Roger Pau Monné
2023-08-29 23:19 ` [PATCH v9 14/16] xen/arm: account IO handlers for emulated PCI MSI-X Volodymyr Babchuk
2023-08-29 23:19 ` [PATCH v9 13/16] xen/arm: translate virtual PCI bus topology for guests Volodymyr Babchuk
2023-09-22  8:32   ` Roger Pau Monné
2023-08-29 23:19 ` [PATCH v9 12/16] vpci: add initial support for virtual PCI bus topology Volodymyr Babchuk
2023-08-30  7:37   ` Jan Beulich
2023-08-31 21:12     ` Volodymyr Babchuk
2023-09-21 16:03   ` Roger Pau Monné
2023-08-29 23:19 ` [PATCH v9 16/16] xen/arm: vpci: permit access to guest vpci space Volodymyr Babchuk
2023-09-26  0:12   ` Stewart Hildebrand
2023-08-29 23:19 ` [PATCH v9 15/16] xen/arm: vpci: check guest range Volodymyr Babchuk
2023-09-22  8:44   ` Roger Pau Monné
2023-09-25 21:49     ` Stewart Hildebrand
2023-09-26  8:07       ` Roger Pau Monné
2023-09-26 15:27         ` Stewart Hildebrand
2023-09-26 15:48           ` Roger Pau Monné
2023-09-27 18:03             ` Stewart Hildebrand
2023-09-28  8:28               ` Roger Pau Monné
2023-09-28 18:28                 ` Stewart Hildebrand
2023-10-02 11:49                   ` Roger Pau Monné

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.