All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v8 00/13] PCI devices passthrough on Arm, part 3
@ 2023-07-20  0:32 Volodymyr Babchuk
  2023-07-20  0:32 ` [PATCH v8 01/13] pci: introduce per-domain PCI rwlock Volodymyr Babchuk
                   ` (13 more replies)
  0 siblings, 14 replies; 73+ messages in thread
From: Volodymyr Babchuk @ 2023-07-20  0:32 UTC (permalink / raw)
  To: xen-devel; +Cc: Volodymyr Babchuk

Hello,

This is next version of vPCI rework. Aim of this series is to prepare
ground for introducing PCI support on ARM platform.

The biggest change from previous, mistakenly named, v7 series is how
locking is implemented. Instead of d->vpci_rwlock we introduce
d->pci_lock which has broader scope, as it protects not only domain's
vpci state, but domain's list of PCI devices as well.

As we discussed in IRC with Roger, it is not feasible to rework all
the existing code to use the new lock right away. It was agreed that
any write access to d->pdev_list will be protected by **both**
d->pci_lock in write mode and pcidevs_lock(). Read access on other
hand should be protected by either d->pci_lock in read mode or
pcidevs_lock(). It is expected that existing code will use
pcidevs_lock() and new users will use new rw lock. Of course, this
does not mean that new users shall not use pcidevs_lock() when it is
appropriate.

Apart from locking scheme rework, there are less major fixes in some
patches, based on the review comments.

Oleksandr Andrushchenko (12):
  vpci: use per-domain PCI lock to protect vpci structure
  vpci: restrict unhandled read/write operations for guests
  vpci: add hooks for PCI device assign/de-assign
  vpci/header: implement guest BAR register handlers
  rangeset: add RANGESETF_no_print flag
  vpci/header: handle p2m range sets per BAR
  vpci/header: program p2m with guest BAR view
  vpci/header: emulate PCI_COMMAND register for guests
  vpci/header: reset the command register when adding devices
  vpci: add initial support for virtual PCI bus topology
  xen/arm: translate virtual PCI bus topology for guests
  xen/arm: account IO handlers for emulated PCI MSI-X

Volodymyr Babchuk (1):
  pci: introduce per-domain PCI rwlock

 xen/arch/arm/vpci.c                         |  61 ++-
 xen/arch/x86/hvm/vmsi.c                     |   4 +
 xen/common/domain.c                         |   1 +
 xen/common/rangeset.c                       |   5 +-
 xen/drivers/Kconfig                         |   4 +
 xen/drivers/passthrough/amd/pci_amd_iommu.c |   9 +-
 xen/drivers/passthrough/pci.c               |  96 ++++-
 xen/drivers/passthrough/vtd/iommu.c         |   9 +-
 xen/drivers/vpci/header.c                   | 453 ++++++++++++++++----
 xen/drivers/vpci/msi.c                      |  18 +-
 xen/drivers/vpci/msix.c                     |  56 ++-
 xen/drivers/vpci/vpci.c                     | 176 +++++++-
 xen/include/xen/pci.h                       |   1 +
 xen/include/xen/rangeset.h                  |   5 +-
 xen/include/xen/sched.h                     |   9 +
 xen/include/xen/vpci.h                      |  42 +-
 16 files changed, 828 insertions(+), 121 deletions(-)

-- 
2.41.0

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH v8 04/13] vpci: add hooks for PCI device assign/de-assign
  2023-07-20  0:32 [PATCH v8 00/13] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (2 preceding siblings ...)
  2023-07-20  0:32 ` [PATCH v8 02/13] vpci: use per-domain PCI lock to protect vpci structure Volodymyr Babchuk
@ 2023-07-20  0:32 ` Volodymyr Babchuk
  2023-07-20 12:36   ` Roger Pau Monné
  2023-07-24  9:41   ` Jan Beulich
  2023-07-20  0:32 ` [PATCH v8 06/13] rangeset: add RANGESETF_no_print flag Volodymyr Babchuk
                   ` (9 subsequent siblings)
  13 siblings, 2 replies; 73+ messages in thread
From: Volodymyr Babchuk @ 2023-07-20  0:32 UTC (permalink / raw)
  To: xen-devel; +Cc: Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

When a PCI device gets assigned/de-assigned some work on vPCI side needs
to be done for that device. Introduce a pair of hooks so vPCI can handle
that.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
Since v8:
- removed vpci_deassign_device
Since v6:
- do not pass struct domain to vpci_{assign|deassign}_device as
  pdev->domain can be used
- do not leave the device assigned (pdev->domain == new domain) in case
  vpci_assign_device fails: try to de-assign and if this also fails, then
  crash the domain
Since v5:
- do not split code into run_vpci_init
- do not check for is_system_domain in vpci_{de}assign_device
- do not use vpci_remove_device_handlers_locked and re-allocate
  pdev->vpci completely
- make vpci_deassign_device void
Since v4:
 - de-assign vPCI from the previous domain on device assignment
 - do not remove handlers in vpci_assign_device as those must not
   exist at that point
Since v3:
 - remove toolstack roll-back description from the commit message
   as error are to be handled with proper cleanup in Xen itself
 - remove __must_check
 - remove redundant rc check while assigning devices
 - fix redundant CONFIG_HAS_VPCI check for CONFIG_HAS_VPCI_GUEST_SUPPORT
 - use REGISTER_VPCI_INIT machinery to run required steps on device
   init/assign: add run_vpci_init helper
Since v2:
- define CONFIG_HAS_VPCI_GUEST_SUPPORT so dead code is not compiled
  for x86
Since v1:
 - constify struct pci_dev where possible
 - do not open code is_system_domain()
 - extended the commit message
---
 xen/drivers/Kconfig           |  4 ++++
 xen/drivers/passthrough/pci.c | 21 +++++++++++++++++++++
 xen/drivers/vpci/vpci.c       | 18 ++++++++++++++++++
 xen/include/xen/vpci.h        | 15 +++++++++++++++
 4 files changed, 58 insertions(+)

diff --git a/xen/drivers/Kconfig b/xen/drivers/Kconfig
index db94393f47..780490cf8e 100644
--- a/xen/drivers/Kconfig
+++ b/xen/drivers/Kconfig
@@ -15,4 +15,8 @@ source "drivers/video/Kconfig"
 config HAS_VPCI
 	bool
 
+config HAS_VPCI_GUEST_SUPPORT
+	bool
+	depends on HAS_VPCI
+
 endmenu
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 6f8692cd9c..265d359704 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -885,6 +885,10 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
     if ( ret )
         goto out;
 
+    write_lock(&pdev->domain->pci_lock);
+    vpci_deassign_device(pdev);
+    write_unlock(&pdev->domain->pci_lock);
+
     if ( pdev->domain == hardware_domain  )
         pdev->quarantine = false;
 
@@ -1484,6 +1488,10 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
     if ( pdev->broken && d != hardware_domain && d != dom_io )
         goto done;
 
+    write_lock(&pdev->domain->pci_lock);
+    vpci_deassign_device(pdev);
+    write_unlock(&pdev->domain->pci_lock);
+
     rc = pdev_msix_assign(d, pdev);
     if ( rc )
         goto done;
@@ -1509,6 +1517,19 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
         rc = iommu_call(hd->platform_ops, assign_device, d, devfn,
                         pci_to_dev(pdev), flag);
     }
+    if ( rc )
+        goto done;
+
+    devfn = pdev->devfn;
+    write_lock(&pdev->domain->pci_lock);
+    rc = vpci_assign_device(pdev);
+    write_unlock(&pdev->domain->pci_lock);
+    if ( rc && deassign_device(d, seg, bus, devfn) )
+    {
+        printk(XENLOG_ERR "%pd: %pp was left partially assigned\n",
+               d, &PCI_SBDF(seg, bus, devfn));
+        domain_crash(d);
+    }
 
  done:
     if ( rc )
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index a6d2cf8660..a97710a806 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -107,6 +107,24 @@ int vpci_add_handlers(struct pci_dev *pdev)
 
     return rc;
 }
+
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+/* Notify vPCI that device is assigned to guest. */
+int vpci_assign_device(struct pci_dev *pdev)
+{
+    int rc;
+
+    if ( !has_vpci(pdev->domain) )
+        return 0;
+
+    rc = vpci_add_handlers(pdev);
+    if ( rc )
+        vpci_deassign_device(pdev);
+
+    return rc;
+}
+#endif /* CONFIG_HAS_VPCI_GUEST_SUPPORT */
+
 #endif /* __XEN__ */
 
 static int vpci_register_cmp(const struct vpci_register *r1,
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 0b8a2a3c74..44296623e1 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -264,6 +264,21 @@ static inline bool __must_check vpci_process_pending(struct vcpu *v)
 }
 #endif
 
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+/* Notify vPCI that device is assigned/de-assigned to/from guest. */
+int vpci_assign_device(struct pci_dev *pdev);
+#define vpci_deassign_device vpci_remove_device
+#else
+static inline int vpci_assign_device(struct pci_dev *pdev)
+{
+    return 0;
+};
+
+static inline void vpci_deassign_device(struct pci_dev *pdev)
+{
+};
+#endif
+
 #endif
 
 /*
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v8 02/13] vpci: use per-domain PCI lock to protect vpci structure
  2023-07-20  0:32 [PATCH v8 00/13] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
  2023-07-20  0:32 ` [PATCH v8 01/13] pci: introduce per-domain PCI rwlock Volodymyr Babchuk
  2023-07-20  0:32 ` [PATCH v8 03/13] vpci: restrict unhandled read/write operations for guests Volodymyr Babchuk
@ 2023-07-20  0:32 ` Volodymyr Babchuk
  2023-07-20 11:20   ` Roger Pau Monné
                     ` (2 more replies)
  2023-07-20  0:32 ` [PATCH v8 04/13] vpci: add hooks for PCI device assign/de-assign Volodymyr Babchuk
                   ` (10 subsequent siblings)
  13 siblings, 3 replies; 73+ messages in thread
From: Volodymyr Babchuk @ 2023-07-20  0:32 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Andrushchenko, Roger Pau Monné,
	Jan Beulich, Volodymyr Babchuk

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Use a previously introduced per-domain read/write lock to check
whether vpci is present, so we are sure there are no accesses to the
contents of the vpci struct if not. This lock can be used (and in a
few cases is used right away) so that vpci removal can be performed
while holding the lock in write mode. Previously such removal could
race with vpci_read for example.

1. Per-domain's pci_rwlock is used to protect pdev->vpci structure
from being removed.

2. Writing the command register and ROM BAR register may trigger
modify_bars to run, which in turn may access multiple pdevs while
checking for the existing BAR's overlap. The overlapping check, if
done under the read lock, requires vpci->lock to be acquired on both
devices being compared, which may produce a deadlock. It is not
possible to upgrade read lock to write lock in such a case. So, in
order to prevent the deadlock, use d->pci_lock instead. To prevent
deadlock while locking both hwdom->pci_lock and dom_xen->pci_lock,
always lock hwdom first.

All other code, which doesn't lead to pdev->vpci destruction and does
not access multiple pdevs at the same time, can still use a
combination of the read lock and pdev->vpci->lock.

3. Drop const qualifier where the new rwlock is used and this is
appropriate.

4. Do not call process_pending_softirqs with any locks held. For that
unlock prior the call and re-acquire the locks after. After
re-acquiring the lock there is no need to check if pdev->vpci exists:
 - in apply_map because of the context it is called (no race condition
   possible)
 - for MSI/MSI-X debug code because it is called at the end of
   pdev->vpci access and no further access to pdev->vpci is made

5. Introduce pcidevs_trylock, so there is a possibility to try locking
the pcidev's lock.

6. Use d->pci_lock around for_each_pdev and pci_get_pdev_by_domain
while accessing pdevs in vpci code.

Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>

---

Changes in v8:
 - changed d->vpci_lock to d->pci_lock
 - introducing d->pci_lock in a separate patch
 - extended locked region in vpci_process_pending
 - removed pcidevs_lockis vpci_dump_msi()
 - removed some changes as they are not needed with
   the new locking scheme
 - added handling for hwdom && dom_xen case
---
 xen/arch/x86/hvm/vmsi.c       |  4 +++
 xen/drivers/passthrough/pci.c |  7 +++++
 xen/drivers/vpci/header.c     | 18 ++++++++++++
 xen/drivers/vpci/msi.c        | 14 ++++++++--
 xen/drivers/vpci/msix.c       | 52 ++++++++++++++++++++++++++++++-----
 xen/drivers/vpci/vpci.c       | 46 +++++++++++++++++++++++++++++--
 xen/include/xen/pci.h         |  1 +
 7 files changed, 129 insertions(+), 13 deletions(-)

diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index 3cd4923060..8c1bd66b9c 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -895,6 +895,8 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
 {
     unsigned int i;
 
+    ASSERT(rw_is_locked(&msix->pdev->domain->pci_lock));
+
     for ( i = 0; i < msix->max_entries; i++ )
     {
         const struct vpci_msix_entry *entry = &msix->entries[i];
@@ -913,7 +915,9 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
             struct pci_dev *pdev = msix->pdev;
 
             spin_unlock(&msix->pdev->vpci->lock);
+            read_unlock(&pdev->domain->pci_lock);
             process_pending_softirqs();
+            read_lock(&pdev->domain->pci_lock);
             /* NB: we assume that pdev cannot go away for an alive domain. */
             if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
                 return -EBUSY;
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 5b4632ead2..6f8692cd9c 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -57,6 +57,11 @@ void pcidevs_lock(void)
     spin_lock_recursive(&_pcidevs_lock);
 }
 
+int pcidevs_trylock(void)
+{
+    return spin_trylock_recursive(&_pcidevs_lock);
+}
+
 void pcidevs_unlock(void)
 {
     spin_unlock_recursive(&_pcidevs_lock);
@@ -1144,7 +1149,9 @@ static void __hwdom_init setup_one_hwdom_device(const struct setup_hwdom *ctxt,
     } while ( devfn != pdev->devfn &&
               PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
 
+    write_lock(&ctxt->d->pci_lock);
     err = vpci_add_handlers(pdev);
+    write_unlock(&ctxt->d->pci_lock);
     if ( err )
         printk(XENLOG_ERR "setup of vPCI for d%d failed: %d\n",
                ctxt->d->domain_id, err);
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index b41556d007..2780fcae72 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -152,6 +152,7 @@ bool vpci_process_pending(struct vcpu *v)
         if ( rc == -ERESTART )
             return true;
 
+        write_lock(&v->domain->pci_lock);
         spin_lock(&v->vpci.pdev->vpci->lock);
         /* Disable memory decoding unconditionally on failure. */
         modify_decoding(v->vpci.pdev,
@@ -170,6 +171,7 @@ bool vpci_process_pending(struct vcpu *v)
              * failure.
              */
             vpci_remove_device(v->vpci.pdev);
+        write_unlock(&v->domain->pci_lock);
     }
 
     return false;
@@ -181,8 +183,20 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
     struct map_data data = { .d = d, .map = true };
     int rc;
 
+    ASSERT(rw_is_locked(&d->pci_lock));
+
     while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
+    {
+        /*
+         * It's safe to drop and reacquire the lock in this context
+         * without risking pdev disappearing because devices cannot be
+         * removed until the initial domain has been started.
+         */
+        read_unlock(&d->pci_lock);
         process_pending_softirqs();
+        read_lock(&d->pci_lock);
+    }
+
     rangeset_destroy(mem);
     if ( !rc )
         modify_decoding(pdev, cmd, false);
@@ -223,6 +237,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
     unsigned int i;
     int rc;
 
+    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
+
     if ( !mem )
         return -ENOMEM;
 
@@ -502,6 +518,8 @@ static int cf_check init_bars(struct pci_dev *pdev)
     struct vpci_bar *bars = header->bars;
     int rc;
 
+    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
+
     switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
     {
     case PCI_HEADER_TYPE_NORMAL:
diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
index 8f2b59e61a..e63152c224 100644
--- a/xen/drivers/vpci/msi.c
+++ b/xen/drivers/vpci/msi.c
@@ -190,6 +190,8 @@ static int cf_check init_msi(struct pci_dev *pdev)
     uint16_t control;
     int ret;
 
+    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
     if ( !pos )
         return 0;
 
@@ -265,7 +267,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW);
 
 void vpci_dump_msi(void)
 {
-    const struct domain *d;
+    struct domain *d;
 
     rcu_read_lock(&domlist_read_lock);
     for_each_domain ( d )
@@ -277,6 +279,9 @@ void vpci_dump_msi(void)
 
         printk("vPCI MSI/MSI-X d%d\n", d->domain_id);
 
+        if ( !read_trylock(&d->pci_lock) )
+            continue;
+
         for_each_pdev ( d, pdev )
         {
             const struct vpci_msi *msi;
@@ -318,14 +323,17 @@ void vpci_dump_msi(void)
                      * holding the lock.
                      */
                     printk("unable to print all MSI-X entries: %d\n", rc);
-                    process_pending_softirqs();
-                    continue;
+                    goto pdev_done;
                 }
             }
 
             spin_unlock(&pdev->vpci->lock);
+ pdev_done:
+            read_unlock(&d->pci_lock);
             process_pending_softirqs();
+            read_lock(&d->pci_lock);
         }
+        read_unlock(&d->pci_lock);
     }
     rcu_read_unlock(&domlist_read_lock);
 }
diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
index 25bde77586..9481274579 100644
--- a/xen/drivers/vpci/msix.c
+++ b/xen/drivers/vpci/msix.c
@@ -147,6 +147,8 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
 {
     struct vpci_msix *msix;
 
+    ASSERT(rw_is_locked(&d->pci_lock));
+
     list_for_each_entry ( msix, &d->arch.hvm.msix_tables, next )
     {
         const struct vpci_bar *bars = msix->pdev->vpci->header.bars;
@@ -163,7 +165,13 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
 
 static int cf_check msix_accept(struct vcpu *v, unsigned long addr)
 {
-    return !!msix_find(v->domain, addr);
+    int rc;
+
+    read_lock(&v->domain->pci_lock);
+    rc = !!msix_find(v->domain, addr);
+    read_unlock(&v->domain->pci_lock);
+
+    return rc;
 }
 
 static bool access_allowed(const struct pci_dev *pdev, unsigned long addr,
@@ -358,21 +366,34 @@ static int adjacent_read(const struct domain *d, const struct vpci_msix *msix,
 static int cf_check msix_read(
     struct vcpu *v, unsigned long addr, unsigned int len, unsigned long *data)
 {
-    const struct domain *d = v->domain;
-    struct vpci_msix *msix = msix_find(d, addr);
+    struct domain *d = v->domain;
+    struct vpci_msix *msix;
     const struct vpci_msix_entry *entry;
     unsigned int offset;
 
     *data = ~0ul;
 
+    read_lock(&d->pci_lock);
+
+    msix = msix_find(d, addr);
     if ( !msix )
+    {
+        read_unlock(&d->pci_lock);
         return X86EMUL_RETRY;
+    }
 
     if ( adjacent_handle(msix, addr) )
-        return adjacent_read(d, msix, addr, len, data);
+    {
+        int rc = adjacent_read(d, msix, addr, len, data);
+        read_unlock(&d->pci_lock);
+        return rc;
+    }
 
     if ( !access_allowed(msix->pdev, addr, len) )
+    {
+        read_unlock(&d->pci_lock);
         return X86EMUL_OKAY;
+    }
 
     spin_lock(&msix->pdev->vpci->lock);
     entry = get_entry(msix, addr);
@@ -404,6 +425,7 @@ static int cf_check msix_read(
         break;
     }
     spin_unlock(&msix->pdev->vpci->lock);
+    read_unlock(&d->pci_lock);
 
     return X86EMUL_OKAY;
 }
@@ -491,19 +513,32 @@ static int adjacent_write(const struct domain *d, const struct vpci_msix *msix,
 static int cf_check msix_write(
     struct vcpu *v, unsigned long addr, unsigned int len, unsigned long data)
 {
-    const struct domain *d = v->domain;
-    struct vpci_msix *msix = msix_find(d, addr);
+    struct domain *d = v->domain;
+    struct vpci_msix *msix;
     struct vpci_msix_entry *entry;
     unsigned int offset;
 
+    read_lock(&d->pci_lock);
+
+    msix = msix_find(d, addr);
     if ( !msix )
+    {
+        read_unlock(&d->pci_lock);
         return X86EMUL_RETRY;
+    }
 
     if ( adjacent_handle(msix, addr) )
-        return adjacent_write(d, msix, addr, len, data);
+    {
+        int rc = adjacent_write(d, msix, addr, len, data);
+        read_unlock(&d->pci_lock);
+        return rc;
+    }
 
     if ( !access_allowed(msix->pdev, addr, len) )
+    {
+        read_unlock(&d->pci_lock);
         return X86EMUL_OKAY;
+    }
 
     spin_lock(&msix->pdev->vpci->lock);
     entry = get_entry(msix, addr);
@@ -579,6 +614,7 @@ static int cf_check msix_write(
         break;
     }
     spin_unlock(&msix->pdev->vpci->lock);
+    read_unlock(&d->pci_lock);
 
     return X86EMUL_OKAY;
 }
@@ -665,6 +701,8 @@ static int cf_check init_msix(struct pci_dev *pdev)
     struct vpci_msix *msix;
     int rc;
 
+    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
     msix_offset = pci_find_cap_offset(pdev->seg, pdev->bus, slot, func,
                                       PCI_CAP_ID_MSIX);
     if ( !msix_offset )
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index d73fa76302..f22cbf2112 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -38,6 +38,8 @@ extern vpci_register_init_t *const __end_vpci_array[];
 
 void vpci_remove_device(struct pci_dev *pdev)
 {
+    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
     if ( !has_vpci(pdev->domain) || !pdev->vpci )
         return;
 
@@ -73,6 +75,8 @@ int vpci_add_handlers(struct pci_dev *pdev)
     const unsigned long *ro_map;
     int rc = 0;
 
+    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
     if ( !has_vpci(pdev->domain) )
         return 0;
 
@@ -326,11 +330,12 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
 
 uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
 {
-    const struct domain *d = current->domain;
+    struct domain *d = current->domain;
     const struct pci_dev *pdev;
     const struct vpci_register *r;
     unsigned int data_offset = 0;
     uint32_t data = ~(uint32_t)0;
+    rwlock_t *lock;
 
     if ( !size )
     {
@@ -342,11 +347,21 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
      * Find the PCI dev matching the address, which for hwdom also requires
      * consulting DomXEN.  Passthrough everything that's not trapped.
      */
+    lock = &d->pci_lock;
+    read_lock(lock);
     pdev = pci_get_pdev(d, sbdf);
     if ( !pdev && is_hardware_domain(d) )
+    {
+        read_unlock(lock);
+        lock = &dom_xen->pci_lock;
+        read_lock(lock);
         pdev = pci_get_pdev(dom_xen, sbdf);
+    }
     if ( !pdev || !pdev->vpci )
+    {
+        read_unlock(lock);
         return vpci_read_hw(sbdf, reg, size);
+    }
 
     spin_lock(&pdev->vpci->lock);
 
@@ -392,6 +407,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
         ASSERT(data_offset < size);
     }
     spin_unlock(&pdev->vpci->lock);
+    read_unlock(lock);
 
     if ( data_offset < size )
     {
@@ -431,10 +447,23 @@ static void vpci_write_helper(const struct pci_dev *pdev,
              r->private);
 }
 
+/* Helper function to unlock locks taken by vpci_write in proper order */
+static void unlock_locks(struct domain *d)
+{
+    ASSERT(rw_is_locked(&d->pci_lock));
+
+    if ( is_hardware_domain(d) )
+    {
+        ASSERT(rw_is_locked(&d->pci_lock));
+        read_unlock(&dom_xen->pci_lock);
+    }
+    read_unlock(&d->pci_lock);
+}
+
 void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
                 uint32_t data)
 {
-    const struct domain *d = current->domain;
+    struct domain *d = current->domain;
     const struct pci_dev *pdev;
     const struct vpci_register *r;
     unsigned int data_offset = 0;
@@ -447,8 +476,16 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
 
     /*
      * Find the PCI dev matching the address, which for hwdom also requires
-     * consulting DomXEN.  Passthrough everything that's not trapped.
+     * consulting DomXEN. Passthrough everything that's not trapped.
+     * If this is hwdom, we need to hold locks for both domain in case if
+     * modify_bars is called()
      */
+    read_lock(&d->pci_lock);
+
+    /* dom_xen->pci_lock always should be taken second to prevent deadlock */
+    if ( is_hardware_domain(d) )
+        read_lock(&dom_xen->pci_lock);
+
     pdev = pci_get_pdev(d, sbdf);
     if ( !pdev && is_hardware_domain(d) )
         pdev = pci_get_pdev(dom_xen, sbdf);
@@ -459,6 +496,8 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
 
         if ( !ro_map || !test_bit(sbdf.bdf, ro_map) )
             vpci_write_hw(sbdf, reg, size, data);
+
+        unlock_locks(d);
         return;
     }
 
@@ -498,6 +537,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
         ASSERT(data_offset < size);
     }
     spin_unlock(&pdev->vpci->lock);
+    unlock_locks(d);
 
     if ( data_offset < size )
         /* Tailing gap, write the remaining. */
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 5975ca2f30..4512910dca 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -157,6 +157,7 @@ struct pci_dev {
  */
 
 void pcidevs_lock(void);
+int pcidevs_trylock(void);
 void pcidevs_unlock(void);
 bool_t __must_check pcidevs_locked(void);
 
-- 
2.41.0

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v8 01/13] pci: introduce per-domain PCI rwlock
  2023-07-20  0:32 [PATCH v8 00/13] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
@ 2023-07-20  0:32 ` Volodymyr Babchuk
  2023-07-20  9:45   ` Roger Pau Monné
  2023-07-20 15:40   ` Jan Beulich
  2023-07-20  0:32 ` [PATCH v8 03/13] vpci: restrict unhandled read/write operations for guests Volodymyr Babchuk
                   ` (12 subsequent siblings)
  13 siblings, 2 replies; 73+ messages in thread
From: Volodymyr Babchuk @ 2023-07-20  0:32 UTC (permalink / raw)
  To: xen-devel; +Cc: Volodymyr Babchuk, Roger Pau Monné, Jan Beulich

Add per-domain d->pci_lock that protects access to
d->pdev_list. Purpose of this lock is to give guarantees to VPCI code
that underlying pdev will not disappear under feet. This is a rw-lock,
but this patch adds only write_lock()s. There will be read_lock()
users in the next patches.

This lock should be taken in write mode every time d->pdev_list is
altered. This covers both accesses to d->pdev_list and accesses to
pdev->domain_list fields. All write accesses also should be protected
by pcidevs_lock() as well. Idea is that any user that wants read
access to the list or to the devices stored in the list should use
either this new d->pci_lock or old pcidevs_lock(). Usage of any of
this two locks will ensure only that pdev of interest will not
disappear from under feet and that the pdev still will be assigned to
the same domain. Of course, any new users should use pcidevs_lock()
when it is appropriate (e.g. when accessing any other state that is
protected by the said lock).

Any write access to pdev->domain_list should be protected by both
pcidevs_lock() and d->pci_lock in the write mode.

Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>

---

Changes in v8:
 - New patch

Changes in v8 vs RFC:
 - Removed all read_locks after discussion with Roger in #xendevel
 - pci_release_devices() now returns the first error code
 - extended commit message
 - added missing lock in pci_remove_device()
 - extended locked region in pci_add_device() to protect list_del() calls
---
 xen/common/domain.c                         |  1 +
 xen/drivers/passthrough/amd/pci_amd_iommu.c |  9 ++-
 xen/drivers/passthrough/pci.c               | 68 +++++++++++++++++----
 xen/drivers/passthrough/vtd/iommu.c         |  9 ++-
 xen/include/xen/sched.h                     |  1 +
 5 files changed, 74 insertions(+), 14 deletions(-)

diff --git a/xen/common/domain.c b/xen/common/domain.c
index caaa402637..5d8a8836da 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -645,6 +645,7 @@ struct domain *domain_create(domid_t domid,
 
 #ifdef CONFIG_HAS_PCI
     INIT_LIST_HEAD(&d->pdev_list);
+    rwlock_init(&d->pci_lock);
 #endif
 
     /* All error paths can depend on the above setup. */
diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c b/xen/drivers/passthrough/amd/pci_amd_iommu.c
index 94e3775506..e2f2e2e950 100644
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -476,8 +476,13 @@ static int cf_check reassign_device(
 
     if ( devfn == pdev->devfn && pdev->domain != target )
     {
-        list_move(&pdev->domain_list, &target->pdev_list);
-        pdev->domain = target;
+        write_lock(&pdev->domain->pci_lock);
+        list_del(&pdev->domain_list);
+        write_unlock(&pdev->domain->pci_lock);
+
+        write_lock(&target->pci_lock);
+        list_add(&pdev->domain_list, &target->pdev_list);
+        write_unlock(&target->pci_lock);
     }
 
     /*
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 95846e84f2..5b4632ead2 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -454,7 +454,9 @@ static void __init _pci_hide_device(struct pci_dev *pdev)
     if ( pdev->domain )
         return;
     pdev->domain = dom_xen;
+    write_lock(&dom_xen->pci_lock);
     list_add(&pdev->domain_list, &dom_xen->pdev_list);
+    write_unlock(&dom_xen->pci_lock);
 }
 
 int __init pci_hide_device(unsigned int seg, unsigned int bus,
@@ -747,6 +749,7 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
     ret = 0;
     if ( !pdev->domain )
     {
+        write_lock(&hardware_domain->pci_lock);
         pdev->domain = hardware_domain;
         list_add(&pdev->domain_list, &hardware_domain->pdev_list);
 
@@ -760,6 +763,7 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
             printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
             list_del(&pdev->domain_list);
             pdev->domain = NULL;
+            write_unlock(&hardware_domain->pci_lock);
             goto out;
         }
         ret = iommu_add_device(pdev);
@@ -768,8 +772,10 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
             vpci_remove_device(pdev);
             list_del(&pdev->domain_list);
             pdev->domain = NULL;
+            write_unlock(&hardware_domain->pci_lock);
             goto out;
         }
+        write_unlock(&hardware_domain->pci_lock);
     }
     else
         iommu_enable_device(pdev);
@@ -812,11 +818,13 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
     list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
         if ( pdev->bus == bus && pdev->devfn == devfn )
         {
+            write_lock(&pdev->domain->pci_lock);
             vpci_remove_device(pdev);
             pci_cleanup_msi(pdev);
             ret = iommu_remove_device(pdev);
             if ( pdev->domain )
                 list_del(&pdev->domain_list);
+            write_unlock(&pdev->domain->pci_lock);
             printk(XENLOG_DEBUG "PCI remove device %pp\n", &pdev->sbdf);
             free_pdev(pseg, pdev);
             break;
@@ -887,26 +895,62 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
 
 int pci_release_devices(struct domain *d)
 {
-    struct pci_dev *pdev, *tmp;
-    u8 bus, devfn;
-    int ret;
+    int combined_ret;
+    LIST_HEAD(failed_pdevs);
 
     pcidevs_lock();
-    ret = arch_pci_clean_pirqs(d);
-    if ( ret )
+    write_lock(&d->pci_lock);
+    combined_ret = arch_pci_clean_pirqs(d);
+    if ( combined_ret )
     {
         pcidevs_unlock();
-        return ret;
+        write_unlock(&d->pci_lock);
+        return combined_ret;
     }
-    list_for_each_entry_safe ( pdev, tmp, &d->pdev_list, domain_list )
+
+    while ( !list_empty(&d->pdev_list) )
     {
-        bus = pdev->bus;
-        devfn = pdev->devfn;
-        ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
+        struct pci_dev *pdev = list_first_entry(&d->pdev_list,
+                                                struct pci_dev,
+                                                domain_list);
+        uint16_t seg = pdev->seg;
+        uint8_t bus = pdev->bus;
+        uint8_t devfn = pdev->devfn;
+        int ret;
+
+        write_unlock(&d->pci_lock);
+        ret = deassign_device(d, seg, bus, devfn);
+        write_lock(&d->pci_lock);
+        if ( ret )
+        {
+            bool still_present = false;
+            const struct pci_dev *tmp;
+
+            /*
+             * We need to check if deassign_device() left our pdev in
+             * domain's list. As we dropped the lock, we can't be sure
+             * that list wasn't permutated in some random way, so we
+             * need to traverse the whole list.
+             */
+            for_each_pdev ( d, tmp )
+            {
+                if ( tmp == pdev )
+                {
+                    still_present = true;
+                    break;
+                }
+            }
+            if ( still_present )
+                list_move(&pdev->domain_list, &failed_pdevs);
+            combined_ret = combined_ret?:ret;
+        }
     }
+
+    list_splice(&failed_pdevs, &d->pdev_list);
+    write_unlock(&d->pci_lock);
     pcidevs_unlock();
 
-    return ret;
+    return combined_ret;
 }
 
 #define PCI_CLASS_BRIDGE_HOST    0x0600
@@ -1125,7 +1169,9 @@ static int __hwdom_init cf_check _setup_hwdom_pci_devices(
             if ( !pdev->domain )
             {
                 pdev->domain = ctxt->d;
+                write_lock(&ctxt->d->pci_lock);
                 list_add(&pdev->domain_list, &ctxt->d->pdev_list);
+                write_unlock(&ctxt->d->pci_lock);
                 setup_one_hwdom_device(ctxt, pdev);
             }
             else if ( pdev->domain == dom_xen )
diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index 0e3062c820..55ee3f110d 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2806,7 +2806,14 @@ static int cf_check reassign_device_ownership(
 
     if ( devfn == pdev->devfn && pdev->domain != target )
     {
-        list_move(&pdev->domain_list, &target->pdev_list);
+        write_lock(&pdev->domain->pci_lock);
+        list_del(&pdev->domain_list);
+        write_unlock(&pdev->domain->pci_lock);
+
+        write_lock(&target->pci_lock);
+        list_add(&pdev->domain_list, &target->pdev_list);
+        write_unlock(&target->pci_lock);
+
         pdev->domain = target;
     }
 
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 85242a73d3..80dd150bbf 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -460,6 +460,7 @@ struct domain
 
 #ifdef CONFIG_HAS_PCI
     struct list_head pdev_list;
+    rwlock_t pci_lock;
 #endif
 
 #ifdef CONFIG_HAS_PASSTHROUGH
-- 
2.41.0

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v8 03/13] vpci: restrict unhandled read/write operations for guests
  2023-07-20  0:32 [PATCH v8 00/13] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
  2023-07-20  0:32 ` [PATCH v8 01/13] pci: introduce per-domain PCI rwlock Volodymyr Babchuk
@ 2023-07-20  0:32 ` Volodymyr Babchuk
  2023-07-20 11:32   ` Roger Pau Monné
  2023-07-20  0:32 ` [PATCH v8 02/13] vpci: use per-domain PCI lock to protect vpci structure Volodymyr Babchuk
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 73+ messages in thread
From: Volodymyr Babchuk @ 2023-07-20  0:32 UTC (permalink / raw)
  To: xen-devel; +Cc: Oleksandr Andrushchenko, Roger Pau Monné

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

A guest would be able to read and write those registers which are not
emulated and have no respective vPCI handlers, so it will be possible
for it to access the hardware directly.
In order to prevent a guest from reads and writes from/to the unhandled
registers make sure only hardware domain can access the hardware directly
and restrict guests from doing so.

Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

---
Since v6:
- do not use is_hwdom parameter for vpci_{read|write}_hw and use
  current->domain internally
- update commit message
New in v6
---
 xen/drivers/vpci/vpci.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index f22cbf2112..a6d2cf8660 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -233,6 +233,10 @@ static uint32_t vpci_read_hw(pci_sbdf_t sbdf, unsigned int reg,
 {
     uint32_t data;
 
+    /* Guest domains are not allowed to read real hardware. */
+    if ( !is_hardware_domain(current->domain) )
+        return ~(uint32_t)0;
+
     switch ( size )
     {
     case 4:
@@ -273,9 +277,13 @@ static uint32_t vpci_read_hw(pci_sbdf_t sbdf, unsigned int reg,
     return data;
 }
 
-static void vpci_write_hw(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
-                          uint32_t data)
+static void vpci_write_hw(pci_sbdf_t sbdf, unsigned int reg,
+                          unsigned int size, uint32_t data)
 {
+    /* Guest domains are not allowed to write real hardware. */
+    if ( !is_hardware_domain(current->domain) )
+        return;
+
     switch ( size )
     {
     case 4:
-- 
2.41.0

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v8 06/13] rangeset: add RANGESETF_no_print flag
  2023-07-20  0:32 [PATCH v8 00/13] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (3 preceding siblings ...)
  2023-07-20  0:32 ` [PATCH v8 04/13] vpci: add hooks for PCI device assign/de-assign Volodymyr Babchuk
@ 2023-07-20  0:32 ` Volodymyr Babchuk
  2023-07-20  0:32 ` [PATCH v8 07/13] vpci/header: handle p2m range sets per BAR Volodymyr Babchuk
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 73+ messages in thread
From: Volodymyr Babchuk @ 2023-07-20  0:32 UTC (permalink / raw)
  To: xen-devel; +Cc: Oleksandr Andrushchenko, Jan Beulich

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

There are range sets which should not be printed, so introduce a flag
which allows marking those as such. Implement relevant logic to skip
such entries while printing.

While at it also simplify the definition of the flags by directly
defining those without helpers.

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
Since v5:
- comment indentation (Jan)
Since v1:
- update BUG_ON with new flag
- simplify the definition of the flags
---
 xen/common/rangeset.c      | 5 ++++-
 xen/include/xen/rangeset.h | 5 +++--
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/xen/common/rangeset.c b/xen/common/rangeset.c
index a6ef264046..f8b909d016 100644
--- a/xen/common/rangeset.c
+++ b/xen/common/rangeset.c
@@ -433,7 +433,7 @@ struct rangeset *rangeset_new(
     INIT_LIST_HEAD(&r->range_list);
     r->nr_ranges = -1;
 
-    BUG_ON(flags & ~RANGESETF_prettyprint_hex);
+    BUG_ON(flags & ~(RANGESETF_prettyprint_hex | RANGESETF_no_print));
     r->flags = flags;
 
     safe_strcpy(r->name, name ?: "(no name)");
@@ -575,6 +575,9 @@ void rangeset_domain_printk(
 
     list_for_each_entry ( r, &d->rangesets, rangeset_list )
     {
+        if ( r->flags & RANGESETF_no_print )
+            continue;
+
         printk("    ");
         rangeset_printk(r);
         printk("\n");
diff --git a/xen/include/xen/rangeset.h b/xen/include/xen/rangeset.h
index 135f33f606..f7c69394d6 100644
--- a/xen/include/xen/rangeset.h
+++ b/xen/include/xen/rangeset.h
@@ -49,8 +49,9 @@ void rangeset_limit(
 
 /* Flags for passing to rangeset_new(). */
  /* Pretty-print range limits in hexadecimal. */
-#define _RANGESETF_prettyprint_hex 0
-#define RANGESETF_prettyprint_hex  (1U << _RANGESETF_prettyprint_hex)
+#define RANGESETF_prettyprint_hex   (1U << 0)
+ /* Do not print entries marked with this flag. */
+#define RANGESETF_no_print          (1U << 1)
 
 bool_t __must_check rangeset_is_empty(
     const struct rangeset *r);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v8 05/13] vpci/header: implement guest BAR register handlers
  2023-07-20  0:32 [PATCH v8 00/13] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (5 preceding siblings ...)
  2023-07-20  0:32 ` [PATCH v8 07/13] vpci/header: handle p2m range sets per BAR Volodymyr Babchuk
@ 2023-07-20  0:32 ` Volodymyr Babchuk
  2023-07-20 16:01   ` Roger Pau Monné
  2023-07-21 10:36   ` Rahul Singh
  2023-07-20  0:32 ` [PATCH v8 11/13] vpci: add initial support for virtual PCI bus topology Volodymyr Babchuk
                   ` (6 subsequent siblings)
  13 siblings, 2 replies; 73+ messages in thread
From: Volodymyr Babchuk @ 2023-07-20  0:32 UTC (permalink / raw)
  To: xen-devel; +Cc: Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Add relevant vpci register handlers when assigning PCI device to a domain
and remove those when de-assigning. This allows having different
handlers for different domains, e.g. hwdom and other guests.

Emulate guest BAR register values: this allows creating a guest view
of the registers and emulates size and properties probe as it is done
during PCI device enumeration by the guest.

All empty, IO and ROM BARs for guests are emulated by returning 0 on
reads and ignoring writes: this BARs are special with this respect as
their lower bits have special meaning, so returning default ~0 on read
may confuse guest OS.

Memory decoding is initially disabled when used by guests in order to
prevent the BAR being placed on top of a RAM region.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---

Since v6:
- unify the writing of the PCI_COMMAND register on the
  error path into a label
- do not introduce bar_ignore_access helper and open code
- s/guest_bar_ignore_read/empty_bar_read
- update error message in guest_bar_write
- only setup empty_bar_read for IO if !x86
Since v5:
- make sure that the guest set address has the same page offset
  as the physical address on the host
- remove guest_rom_{read|write} as those just implement the default
  behaviour of the registers not being handled
- adjusted comment for struct vpci.addr field
- add guest handlers for BARs which are not handled and will otherwise
  return ~0 on read and ignore writes. The BARs are special with this
  respect as their lower bits have special meaning, so returning ~0
  doesn't seem to be right
Since v4:
- updated commit message
- s/guest_addr/guest_reg
Since v3:
- squashed two patches: dynamic add/remove handlers and guest BAR
  handler implementation
- fix guest BAR read of the high part of a 64bit BAR (Roger)
- add error handling to vpci_assign_device
- s/dom%pd/%pd
- blank line before return
Since v2:
- remove unneeded ifdefs for CONFIG_HAS_VPCI_GUEST_SUPPORT as more code
  has been eliminated from being built on x86
Since v1:
 - constify struct pci_dev where possible
 - do not open code is_system_domain()
 - simplify some code3. simplify
 - use gdprintk + error code instead of gprintk
 - gate vpci_bar_{add|remove}_handlers with CONFIG_HAS_VPCI_GUEST_SUPPORT,
   so these do not get compiled for x86
 - removed unneeded is_system_domain check
 - re-work guest read/write to be much simpler and do more work on write
   than read which is expected to be called more frequently
 - removed one too obvious comment
---
 xen/drivers/vpci/header.c | 156 +++++++++++++++++++++++++++++++-------
 xen/include/xen/vpci.h    |   3 +
 2 files changed, 130 insertions(+), 29 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 2780fcae72..5dc9b5338b 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -457,6 +457,71 @@ static void cf_check bar_write(
     pci_conf_write32(pdev->sbdf, reg, val);
 }
 
+static void cf_check guest_bar_write(const struct pci_dev *pdev,
+                                     unsigned int reg, uint32_t val, void *data)
+{
+    struct vpci_bar *bar = data;
+    bool hi = false;
+    uint64_t guest_reg = bar->guest_reg;
+
+    if ( bar->type == VPCI_BAR_MEM64_HI )
+    {
+        ASSERT(reg > PCI_BASE_ADDRESS_0);
+        bar--;
+        hi = true;
+    }
+    else
+    {
+        val &= PCI_BASE_ADDRESS_MEM_MASK;
+        val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
+                                           : PCI_BASE_ADDRESS_MEM_TYPE_64;
+        val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
+    }
+
+    guest_reg &= ~(0xffffffffull << (hi ? 32 : 0));
+    guest_reg |= (uint64_t)val << (hi ? 32 : 0);
+
+    guest_reg &= ~(bar->size - 1) | ~PCI_BASE_ADDRESS_MEM_MASK;
+
+    /*
+     * Make sure that the guest set address has the same page offset
+     * as the physical address on the host or otherwise things won't work as
+     * expected.
+     */
+    if ( (guest_reg & (~PAGE_MASK & PCI_BASE_ADDRESS_MEM_MASK)) !=
+         (bar->addr & ~PAGE_MASK) )
+    {
+        gprintk(XENLOG_WARNING,
+                "%pp: ignored BAR %zu write attempting to change page offset\n",
+                &pdev->sbdf, bar - pdev->vpci->header.bars + hi);
+        return;
+    }
+
+    bar->guest_reg = guest_reg;
+}
+
+static uint32_t cf_check guest_bar_read(const struct pci_dev *pdev,
+                                        unsigned int reg, void *data)
+{
+    const struct vpci_bar *bar = data;
+    bool hi = false;
+
+    if ( bar->type == VPCI_BAR_MEM64_HI )
+    {
+        ASSERT(reg > PCI_BASE_ADDRESS_0);
+        bar--;
+        hi = true;
+    }
+
+    return bar->guest_reg >> (hi ? 32 : 0);
+}
+
+static uint32_t cf_check empty_bar_read(const struct pci_dev *pdev,
+                                        unsigned int reg, void *data)
+{
+    return 0;
+}
+
 static void cf_check rom_write(
     const struct pci_dev *pdev, unsigned int reg, uint32_t val, void *data)
 {
@@ -517,6 +582,7 @@ static int cf_check init_bars(struct pci_dev *pdev)
     struct vpci_header *header = &pdev->vpci->header;
     struct vpci_bar *bars = header->bars;
     int rc;
+    bool is_hwdom = is_hardware_domain(pdev->domain);
 
     ASSERT(rw_is_locked(&pdev->domain->pci_lock));
 
@@ -558,13 +624,12 @@ static int cf_check init_bars(struct pci_dev *pdev)
         if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
         {
             bars[i].type = VPCI_BAR_MEM64_HI;
-            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg,
-                                   4, &bars[i]);
+            rc = vpci_add_register(pdev->vpci,
+                                   is_hwdom ? vpci_hw_read32 : guest_bar_read,
+                                   is_hwdom ? bar_write : guest_bar_write,
+                                   reg, 4, &bars[i]);
             if ( rc )
-            {
-                pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
-                return rc;
-            }
+                goto fail;
 
             continue;
         }
@@ -573,6 +638,17 @@ static int cf_check init_bars(struct pci_dev *pdev)
         if ( (val & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO )
         {
             bars[i].type = VPCI_BAR_IO;
+
+#ifndef CONFIG_X86
+            if ( !is_hwdom )
+            {
+                rc = vpci_add_register(pdev->vpci, empty_bar_read, NULL,
+                                       reg, 4, &bars[i]);
+                if ( rc )
+                    goto fail;
+            }
+#endif
+
             continue;
         }
         if ( (val & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
@@ -584,14 +660,20 @@ static int cf_check init_bars(struct pci_dev *pdev)
         rc = pci_size_mem_bar(pdev->sbdf, reg, &addr, &size,
                               (i == num_bars - 1) ? PCI_BAR_LAST : 0);
         if ( rc < 0 )
-        {
-            pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
-            return rc;
-        }
+            goto fail;
 
         if ( size == 0 )
         {
             bars[i].type = VPCI_BAR_EMPTY;
+
+            if ( !is_hwdom )
+            {
+                rc = vpci_add_register(pdev->vpci, empty_bar_read, NULL,
+                                       reg, 4, &bars[i]);
+                if ( rc )
+                    goto fail;
+            }
+
             continue;
         }
 
@@ -599,34 +681,50 @@ static int cf_check init_bars(struct pci_dev *pdev)
         bars[i].size = size;
         bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
 
-        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg, 4,
-                               &bars[i]);
+        rc = vpci_add_register(pdev->vpci,
+                               is_hwdom ? vpci_hw_read32 : guest_bar_read,
+                               is_hwdom ? bar_write : guest_bar_write,
+                               reg, 4, &bars[i]);
         if ( rc )
-        {
-            pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
-            return rc;
-        }
+            goto fail;
     }
 
-    /* Check expansion ROM. */
-    rc = pci_size_mem_bar(pdev->sbdf, rom_reg, &addr, &size, PCI_BAR_ROM);
-    if ( rc > 0 && size )
+    /* Check expansion ROM: we do not handle ROM for guests. */
+    if ( is_hwdom )
     {
-        struct vpci_bar *rom = &header->bars[num_bars];
+        rc = pci_size_mem_bar(pdev->sbdf, rom_reg, &addr, &size, PCI_BAR_ROM);
+        if ( rc > 0 && size )
+        {
+            struct vpci_bar *rom = &header->bars[num_bars];
 
-        rom->type = VPCI_BAR_ROM;
-        rom->size = size;
-        rom->addr = addr;
-        header->rom_enabled = pci_conf_read32(pdev->sbdf, rom_reg) &
-                              PCI_ROM_ADDRESS_ENABLE;
+            rom->type = VPCI_BAR_ROM;
+            rom->size = size;
+            rom->addr = addr;
+            header->rom_enabled = pci_conf_read32(pdev->sbdf, rom_reg) &
+                                  PCI_ROM_ADDRESS_ENABLE;
 
-        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, rom_write, rom_reg,
-                               4, rom);
-        if ( rc )
-            rom->type = VPCI_BAR_EMPTY;
+            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, rom_write,
+                                   rom_reg, 4, rom);
+            if ( rc )
+                rom->type = VPCI_BAR_EMPTY;
+        }
+    }
+    else
+    {
+        if ( !is_hwdom )
+        {
+            rc = vpci_add_register(pdev->vpci, empty_bar_read, NULL,
+                                   rom_reg, 4, &header->bars[num_bars]);
+            if ( rc )
+                goto fail;
+        }
     }
 
     return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, cmd, false) : 0;
+
+ fail:
+    pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
+    return rc;
 }
 REGISTER_VPCI_INIT(init_bars, VPCI_PRIORITY_MIDDLE);
 
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 44296623e1..486a655e8d 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -67,7 +67,10 @@ struct vpci {
     struct vpci_header {
         /* Information about the PCI BARs of this device. */
         struct vpci_bar {
+            /* Physical (host) address. */
             uint64_t addr;
+            /* Guest view of the BAR: address and lower bits. */
+            uint64_t guest_reg;
             uint64_t size;
             enum {
                 VPCI_BAR_EMPTY,
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v8 07/13] vpci/header: handle p2m range sets per BAR
  2023-07-20  0:32 [PATCH v8 00/13] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (4 preceding siblings ...)
  2023-07-20  0:32 ` [PATCH v8 06/13] rangeset: add RANGESETF_no_print flag Volodymyr Babchuk
@ 2023-07-20  0:32 ` Volodymyr Babchuk
  2023-07-21 11:49   ` Roger Pau Monné
  2023-07-20  0:32 ` [PATCH v8 05/13] vpci/header: implement guest BAR register handlers Volodymyr Babchuk
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 73+ messages in thread
From: Volodymyr Babchuk @ 2023-07-20  0:32 UTC (permalink / raw)
  To: xen-devel; +Cc: Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Instead of handling a single range set, that contains all the memory
regions of all the BARs and ROM, have them per BAR.
As the range sets are now created when a PCI device is added and destroyed
when it is removed so make them named and accounted.

Note that rangesets were chosen here despite there being only up to
3 separate ranges in each set (typically just 1). But rangeset per BAR
was chosen for the ease of implementation and existing code re-usability.

This is in preparation of making non-identity mappings in p2m for the MMIOs.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

---
Since v6:
- update according to the new locking scheme
- remove odd fail label in modify_bars
Since v5:
- fix comments
- move rangeset allocation to init_bars and only allocate
  for MAPPABLE BARs
- check for overlap with the already setup BAR ranges
Since v4:
- use named range sets for BARs (Jan)
- changes required by the new locking scheme
- updated commit message (Jan)
Since v3:
- re-work vpci_cancel_pending accordingly to the per-BAR handling
- s/num_mem_ranges/map_pending and s/uint8_t/bool
- ASSERT(bar->mem) in modify_bars
- create and destroy the rangesets on add/remove
---
 xen/drivers/vpci/header.c | 235 ++++++++++++++++++++++++++++----------
 xen/drivers/vpci/vpci.c   |   6 +
 xen/include/xen/vpci.h    |   3 +-
 3 files changed, 181 insertions(+), 63 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 5dc9b5338b..eb07fa0bb2 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -141,63 +141,106 @@ static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
 
 bool vpci_process_pending(struct vcpu *v)
 {
-    if ( v->vpci.mem )
+    struct pci_dev *pdev = v->vpci.pdev;
+
+    if ( !pdev )
+        return false;
+
+    if ( v->vpci.map_pending )
     {
         struct map_data data = {
             .d = v->domain,
             .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
         };
-        int rc = rangeset_consume_ranges(v->vpci.mem, map_range, &data);
-
-        if ( rc == -ERESTART )
-            return true;
+        struct vpci_header *header = &pdev->vpci->header;
+        unsigned int i;
 
         write_lock(&v->domain->pci_lock);
-        spin_lock(&v->vpci.pdev->vpci->lock);
-        /* Disable memory decoding unconditionally on failure. */
-        modify_decoding(v->vpci.pdev,
-                        rc ? v->vpci.cmd & ~PCI_COMMAND_MEMORY : v->vpci.cmd,
-                        !rc && v->vpci.rom_only);
-        spin_unlock(&v->vpci.pdev->vpci->lock);
-
-        rangeset_destroy(v->vpci.mem);
-        v->vpci.mem = NULL;
-        if ( rc )
-            /*
-             * FIXME: in case of failure remove the device from the domain.
-             * Note that there might still be leftover mappings. While this is
-             * safe for Dom0, for DomUs the domain will likely need to be
-             * killed in order to avoid leaking stale p2m mappings on
-             * failure.
-             */
-            vpci_remove_device(v->vpci.pdev);
+
+        for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
+        {
+            struct vpci_bar *bar = &header->bars[i];
+            int rc;
+
+            if ( rangeset_is_empty(bar->mem) )
+                continue;
+
+            rc = rangeset_consume_ranges(bar->mem, map_range, &data);
+
+            if ( rc == -ERESTART )
+            {
+                write_unlock(&v->domain->pci_lock);
+                return true;
+            }
+
+            spin_lock(&pdev->vpci->lock);
+            /* Disable memory decoding unconditionally on failure. */
+            modify_decoding(pdev, rc ? v->vpci.cmd & ~PCI_COMMAND_MEMORY :
+                                       v->vpci.cmd, !rc && v->vpci.rom_only);
+            spin_unlock(&pdev->vpci->lock);
+
+            if ( rc )
+            {
+                /*
+                 * FIXME: in case of failure remove the device from the domain.
+                 * Note that there might still be leftover mappings. While this
+                 * is safe for Dom0, for DomUs the domain needs to be killed in
+                 * order to avoid leaking stale p2m mappings on failure.
+                 */
+                v->vpci.map_pending = false;
+
+                if ( is_hardware_domain(v->domain) )
+                {
+                    vpci_remove_device(pdev);
+                    write_unlock(&v->domain->pci_lock);
+                }
+                else
+                {
+                    write_unlock(&v->domain->pci_lock);
+                    domain_crash(v->domain);
+                }
+                return false;
+            }
+        }
         write_unlock(&v->domain->pci_lock);
+
+        v->vpci.map_pending = false;
     }
 
+
     return false;
 }
 
 static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
-                            struct rangeset *mem, uint16_t cmd)
+                            uint16_t cmd)
 {
     struct map_data data = { .d = d, .map = true };
-    int rc;
+    struct vpci_header *header = &pdev->vpci->header;
+    int rc = 0;
+    unsigned int i;
 
     ASSERT(rw_is_locked(&d->pci_lock));
 
-    while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
+    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
     {
-        /*
-         * It's safe to drop and reacquire the lock in this context
-         * without risking pdev disappearing because devices cannot be
-         * removed until the initial domain has been started.
-         */
-        read_unlock(&d->pci_lock);
-        process_pending_softirqs();
-        read_lock(&d->pci_lock);
-    }
+        struct vpci_bar *bar = &header->bars[i];
 
-    rangeset_destroy(mem);
+        if ( rangeset_is_empty(bar->mem) )
+            continue;
+
+        while ( (rc = rangeset_consume_ranges(bar->mem, map_range,
+                                              &data)) == -ERESTART )
+        {
+            /*
+             * It's safe to drop and reacquire the lock in this context
+             * without risking pdev disappearing because devices cannot be
+             * removed until the initial domain has been started.
+             */
+            write_unlock(&d->pci_lock);
+            process_pending_softirqs();
+            write_lock(&d->pci_lock);
+        }
+    }
     if ( !rc )
         modify_decoding(pdev, cmd, false);
 
@@ -205,10 +248,12 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
 }
 
 static void defer_map(struct domain *d, struct pci_dev *pdev,
-                      struct rangeset *mem, uint16_t cmd, bool rom_only)
+                      uint16_t cmd, bool rom_only)
 {
     struct vcpu *curr = current;
 
+    ASSERT(!!rw_is_write_locked(&pdev->domain->pci_lock));
+
     /*
      * FIXME: when deferring the {un}map the state of the device should not
      * be trusted. For example the enable bit is toggled after the device
@@ -216,7 +261,7 @@ static void defer_map(struct domain *d, struct pci_dev *pdev,
      * started for the same device if the domain is not well-behaved.
      */
     curr->vpci.pdev = pdev;
-    curr->vpci.mem = mem;
+    curr->vpci.map_pending = true;
     curr->vpci.cmd = cmd;
     curr->vpci.rom_only = rom_only;
     /*
@@ -230,33 +275,34 @@ static void defer_map(struct domain *d, struct pci_dev *pdev,
 static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
 {
     struct vpci_header *header = &pdev->vpci->header;
-    struct rangeset *mem = rangeset_new(NULL, NULL, 0);
     struct pci_dev *tmp, *dev = NULL;
     const struct domain *d;
     const struct vpci_msix *msix = pdev->vpci->msix;
-    unsigned int i;
+    unsigned int i, j;
     int rc;
+    bool map_pending;
 
     ASSERT(rw_is_locked(&pdev->domain->pci_lock));
 
-    if ( !mem )
-        return -ENOMEM;
-
     /*
-     * Create a rangeset that represents the current device BARs memory region
-     * and compare it against all the currently active BAR memory regions. If
-     * an overlap is found, subtract it from the region to be mapped/unmapped.
+     * Create a rangeset per BAR that represents the current device memory
+     * region and compare it against all the currently active BAR memory
+     * regions. If an overlap is found, subtract it from the region to be
+     * mapped/unmapped.
      *
-     * First fill the rangeset with all the BARs of this device or with the ROM
+     * First fill the rangesets with the BARs of this device or with the ROM
      * BAR only, depending on whether the guest is toggling the memory decode
      * bit of the command register, or the enable bit of the ROM BAR register.
      */
     for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
     {
-        const struct vpci_bar *bar = &header->bars[i];
+        struct vpci_bar *bar = &header->bars[i];
         unsigned long start = PFN_DOWN(bar->addr);
         unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
 
+        if ( !bar->mem )
+            continue;
+
         if ( !MAPPABLE_BAR(bar) ||
              (rom_only ? bar->type != VPCI_BAR_ROM
                        : (bar->type == VPCI_BAR_ROM && !header->rom_enabled)) ||
@@ -272,14 +318,31 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
             continue;
         }
 
-        rc = rangeset_add_range(mem, start, end);
+        rc = rangeset_add_range(bar->mem, start, end);
         if ( rc )
         {
             printk(XENLOG_G_WARNING "Failed to add [%lx, %lx]: %d\n",
                    start, end, rc);
-            rangeset_destroy(mem);
             return rc;
         }
+
+        /* Check for overlap with the already setup BAR ranges. */
+        for ( j = 0; j < i; j++ )
+        {
+            struct vpci_bar *bar = &header->bars[j];
+
+            if ( rangeset_is_empty(bar->mem) )
+                continue;
+
+            rc = rangeset_remove_range(bar->mem, start, end);
+            if ( rc )
+            {
+                printk(XENLOG_G_WARNING
+                       "Failed to remove overlapping range [%lx, %lx]: %d\n",
+                       start, end, rc);
+                return rc;
+            }
+        }
     }
 
     /* Remove any MSIX regions if present. */
@@ -289,14 +352,21 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
         unsigned long end = PFN_DOWN(vmsix_table_addr(pdev->vpci, i) +
                                      vmsix_table_size(pdev->vpci, i) - 1);
 
-        rc = rangeset_remove_range(mem, start, end);
-        if ( rc )
+        for ( j = 0; j < ARRAY_SIZE(header->bars); j++ )
         {
-            printk(XENLOG_G_WARNING
-                   "Failed to remove MSIX table [%lx, %lx]: %d\n",
-                   start, end, rc);
-            rangeset_destroy(mem);
-            return rc;
+            const struct vpci_bar *bar = &header->bars[j];
+
+            if ( rangeset_is_empty(bar->mem) )
+                continue;
+
+            rc = rangeset_remove_range(bar->mem, start, end);
+            if ( rc )
+            {
+                printk(XENLOG_G_WARNING
+                       "Failed to remove MSIX table [%lx, %lx]: %d\n",
+                       start, end, rc);
+                return rc;
+            }
         }
     }
 
@@ -341,7 +411,7 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
                 unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
 
                 if ( !bar->enabled ||
-                     !rangeset_overlaps_range(mem, start, end) ||
+                     !rangeset_overlaps_range(bar->mem, start, end) ||
                      /*
                       * If only the ROM enable bit is toggled check against
                       * other BARs in the same device for overlaps, but not
@@ -350,12 +420,11 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
                      (rom_only && tmp == pdev && bar->type == VPCI_BAR_ROM) )
                     continue;
 
-                rc = rangeset_remove_range(mem, start, end);
+                rc = rangeset_remove_range(bar->mem, start, end);
                 if ( rc )
                 {
                     printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
                            start, end, rc);
-                    rangeset_destroy(mem);
                     return rc;
                 }
             }
@@ -380,10 +449,23 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
          * will always be to establish mappings and process all the BARs.
          */
         ASSERT((cmd & PCI_COMMAND_MEMORY) && !rom_only);
-        return apply_map(pdev->domain, pdev, mem, cmd);
+        return apply_map(pdev->domain, pdev, cmd);
     }
 
-    defer_map(dev->domain, dev, mem, cmd, rom_only);
+    /* Find out how many memory ranges has left after MSI and overlaps. */
+    map_pending = false;
+    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
+        if ( !rangeset_is_empty(header->bars[i].mem) )
+        {
+            map_pending = true;
+            break;
+        }
+
+    /* If there's no mapping work write the command register now. */
+    if ( !map_pending )
+        pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
+    else
+        defer_map(dev->domain, dev, cmd, rom_only);
 
     return 0;
 }
@@ -574,6 +656,19 @@ static void cf_check rom_write(
         rom->addr = val & PCI_ROM_ADDRESS_MASK;
 }
 
+static int bar_add_rangeset(struct pci_dev *pdev, struct vpci_bar *bar, int i)
+{
+    char str[32];
+
+    snprintf(str, sizeof(str), "%pp:BAR%d", &pdev->sbdf, i);
+
+    bar->mem = rangeset_new(pdev->domain, str, RANGESETF_no_print);
+    if ( !bar->mem )
+        return -ENOMEM;
+
+    return 0;
+}
+
 static int cf_check init_bars(struct pci_dev *pdev)
 {
     uint16_t cmd;
@@ -657,6 +752,13 @@ static int cf_check init_bars(struct pci_dev *pdev)
         else
             bars[i].type = VPCI_BAR_MEM32;
 
+        rc = bar_add_rangeset(pdev, &bars[i], i);
+        if ( rc )
+        {
+            bars[i].type = VPCI_BAR_EMPTY;
+            return rc;
+        }
+
         rc = pci_size_mem_bar(pdev->sbdf, reg, &addr, &size,
                               (i == num_bars - 1) ? PCI_BAR_LAST : 0);
         if ( rc < 0 )
@@ -707,6 +809,15 @@ static int cf_check init_bars(struct pci_dev *pdev)
                                    rom_reg, 4, rom);
             if ( rc )
                 rom->type = VPCI_BAR_EMPTY;
+            else
+            {
+                rc = bar_add_rangeset(pdev, rom, i);
+                if ( rc )
+                {
+                    rom->type = VPCI_BAR_EMPTY;
+                    return rc;
+                }
+            }
         }
     }
     else
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index a97710a806..ca3505ecb7 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -38,6 +38,8 @@ extern vpci_register_init_t *const __end_vpci_array[];
 
 void vpci_remove_device(struct pci_dev *pdev)
 {
+    unsigned int i;
+
     ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
 
     if ( !has_vpci(pdev->domain) || !pdev->vpci )
@@ -63,6 +65,10 @@ void vpci_remove_device(struct pci_dev *pdev)
             if ( pdev->vpci->msix->table[i] )
                 iounmap(pdev->vpci->msix->table[i]);
     }
+
+    for ( i = 0; i < ARRAY_SIZE(pdev->vpci->header.bars); i++ )
+        rangeset_destroy(pdev->vpci->header.bars[i].mem);
+
     xfree(pdev->vpci->msix);
     xfree(pdev->vpci->msi);
     xfree(pdev->vpci);
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 486a655e8d..b78dd6512b 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -72,6 +72,7 @@ struct vpci {
             /* Guest view of the BAR: address and lower bits. */
             uint64_t guest_reg;
             uint64_t size;
+            struct rangeset *mem;
             enum {
                 VPCI_BAR_EMPTY,
                 VPCI_BAR_IO,
@@ -156,9 +157,9 @@ struct vpci {
 
 struct vpci_vcpu {
     /* Per-vcpu structure to store state while {un}mapping of PCI BARs. */
-    struct rangeset *mem;
     struct pci_dev *pdev;
     uint16_t cmd;
+    bool map_pending : 1;
     bool rom_only : 1;
 };
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v8 08/13] vpci/header: program p2m with guest BAR view
  2023-07-20  0:32 [PATCH v8 00/13] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (7 preceding siblings ...)
  2023-07-20  0:32 ` [PATCH v8 11/13] vpci: add initial support for virtual PCI bus topology Volodymyr Babchuk
@ 2023-07-20  0:32 ` Volodymyr Babchuk
  2023-07-21 13:05   ` Roger Pau Monné
  2023-07-24 10:43   ` Jan Beulich
  2023-07-20  0:32 ` [PATCH v8 10/13] vpci/header: reset the command register when adding devices Volodymyr Babchuk
                   ` (4 subsequent siblings)
  13 siblings, 2 replies; 73+ messages in thread
From: Volodymyr Babchuk @ 2023-07-20  0:32 UTC (permalink / raw)
  To: xen-devel; +Cc: Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Take into account guest's BAR view and program its p2m accordingly:
gfn is guest's view of the BAR and mfn is the physical BAR value as set
up by the PCI bus driver in the hardware domain.
This way hardware domain sees physical BAR values and guest sees
emulated ones.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
Since v5:
- remove debug print in map_range callback
- remove "identity" from the debug print
Since v4:
- moved start_{gfn|mfn} calculation into map_range
- pass vpci_bar in the map_data instead of start_{gfn|mfn}
- s/guest_addr/guest_reg
Since v3:
- updated comment (Roger)
- removed gfn_add(map->start_gfn, rc); which is wrong
- use v->domain instead of v->vpci.pdev->domain
- removed odd e.g. in comment
- s/d%d/%pd in altered code
- use gdprintk for map/unmap logs
Since v2:
- improve readability for data.start_gfn and restructure ?: construct
Since v1:
 - s/MSI/MSI-X in comments
---
 xen/drivers/vpci/header.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index eb07fa0bb2..e1a448b674 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -30,6 +30,7 @@
 
 struct map_data {
     struct domain *d;
+    const struct vpci_bar *bar;
     bool map;
 };
 
@@ -41,8 +42,21 @@ static int cf_check map_range(
 
     for ( ; ; )
     {
+        /* Start address of the BAR as seen by the guest. */
+        gfn_t start_gfn = _gfn(PFN_DOWN(is_hardware_domain(map->d)
+                                        ? map->bar->addr
+                                        : map->bar->guest_reg));
+        /* Physical start address of the BAR. */
+        mfn_t start_mfn = _mfn(PFN_DOWN(map->bar->addr));
         unsigned long size = e - s + 1;
 
+        /*
+         * Ranges to be mapped don't always start at the BAR start address, as
+         * there can be holes or partially consumed ranges. Account for the
+         * offset of the current address from the BAR start.
+         */
+        start_gfn = gfn_add(start_gfn, s - mfn_x(start_mfn));
+
         /*
          * ARM TODOs:
          * - On ARM whether the memory is prefetchable or not should be passed
@@ -52,8 +66,8 @@ static int cf_check map_range(
          * - {un}map_mmio_regions doesn't support preemption.
          */
 
-        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(s))
-                      : unmap_mmio_regions(map->d, _gfn(s), size, _mfn(s));
+        rc = map->map ? map_mmio_regions(map->d, start_gfn, size, _mfn(s))
+                      : unmap_mmio_regions(map->d, start_gfn, size, _mfn(s));
         if ( rc == 0 )
         {
             *c += size;
@@ -62,8 +76,8 @@ static int cf_check map_range(
         if ( rc < 0 )
         {
             printk(XENLOG_G_WARNING
-                   "Failed to identity %smap [%lx, %lx] for d%d: %d\n",
-                   map->map ? "" : "un", s, e, map->d->domain_id, rc);
+                   "Failed to %smap [%lx, %lx] for %pd: %d\n",
+                   map->map ? "" : "un", s, e, map->d, rc);
             break;
         }
         ASSERT(rc < size);
@@ -165,6 +179,7 @@ bool vpci_process_pending(struct vcpu *v)
             if ( rangeset_is_empty(bar->mem) )
                 continue;
 
+            data.bar = bar;
             rc = rangeset_consume_ranges(bar->mem, map_range, &data);
 
             if ( rc == -ERESTART )
@@ -228,6 +243,7 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
         if ( rangeset_is_empty(bar->mem) )
             continue;
 
+        data.bar = bar;
         while ( (rc = rangeset_consume_ranges(bar->mem, map_range,
                                               &data)) == -ERESTART )
         {
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v8 10/13] vpci/header: reset the command register when adding devices
  2023-07-20  0:32 [PATCH v8 00/13] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (8 preceding siblings ...)
  2023-07-20  0:32 ` [PATCH v8 08/13] vpci/header: program p2m with guest BAR view Volodymyr Babchuk
@ 2023-07-20  0:32 ` Volodymyr Babchuk
  2023-07-21 13:37   ` Roger Pau Monné
  2023-07-20  0:32 ` [PATCH v8 09/13] vpci/header: emulate PCI_COMMAND register for guests Volodymyr Babchuk
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 73+ messages in thread
From: Volodymyr Babchuk @ 2023-07-20  0:32 UTC (permalink / raw)
  To: xen-devel; +Cc: Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Reset the command register when assigning a PCI device to a guest:
according to the PCI spec the PCI_COMMAND register is typically all 0's
after reset, but this might not be true for the guest as it needs
to respect host's settings.
For that reason, do not write 0 to the PCI_COMMAND register directly,
but go through the corresponding emulation layer (cmd_write), which
will take care about the actual bits written.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
Since v6:
- use cmd_write directly without introducing emulate_cmd_reg
- update commit message with more description on all 0's in PCI_COMMAND
Since v5:
- updated commit message
Since v1:
 - do not write 0 to the command register, but respect host settings.
---
 xen/drivers/vpci/header.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index ae05d242a5..44a9940fb9 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -749,6 +749,10 @@ static int cf_check init_bars(struct pci_dev *pdev)
      */
     ASSERT(header->guest_cmd == 0);
 
+    /* Reset the command register for guests. */
+    if ( !is_hwdom )
+        cmd_write(pdev, PCI_COMMAND, 0, header);
+
     /* Setup a handler for the command register. */
     rc = vpci_add_register(pdev->vpci, cmd_read, cmd_write, PCI_COMMAND,
                            2, header);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v8 09/13] vpci/header: emulate PCI_COMMAND register for guests
  2023-07-20  0:32 [PATCH v8 00/13] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (9 preceding siblings ...)
  2023-07-20  0:32 ` [PATCH v8 10/13] vpci/header: reset the command register when adding devices Volodymyr Babchuk
@ 2023-07-20  0:32 ` Volodymyr Babchuk
  2023-07-21 13:32   ` Roger Pau Monné
  2023-07-24 11:03   ` Jan Beulich
  2023-07-20  0:32 ` [PATCH v8 12/13] xen/arm: translate virtual PCI bus topology " Volodymyr Babchuk
                   ` (2 subsequent siblings)
  13 siblings, 2 replies; 73+ messages in thread
From: Volodymyr Babchuk @ 2023-07-20  0:32 UTC (permalink / raw)
  To: xen-devel; +Cc: Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Xen and/or Dom0 may have put values in PCI_COMMAND which they expect
to remain unaltered. PCI_COMMAND_SERR bit is a good example: while the
guest's view of this will want to be zero initially, the host having set
it to 1 may not easily be overwritten with 0, or else we'd effectively
imply giving the guest control of the bit. Thus, PCI_COMMAND register needs
proper emulation in order to honor host's settings.

There are examples of emulators [1], [2] which already deal with PCI_COMMAND
register emulation and it seems that at most they care about is the only INTx
bit (besides IO/memory enable and bus master which are write through).
It could be because in order to properly emulate the PCI_COMMAND register
we need to know about the whole PCI topology, e.g. if any setting in device's
command register is aligned with the upstream port etc.
This makes me think that because of this complexity others just ignore that.
Neither I think this can easily be done in Xen case.

According to "PCI LOCAL BUS SPECIFICATION, REV. 3.0", section "6.2.2
Device Control" the reset state of the command register is typically 0,
so when assigning a PCI device use 0 as the initial state for the guest's view
of the command register.

For now our emulation only makes sure INTx is set according to the host
requirements, i.e. depending on MSI/MSI-X enabled state.

This implementation and the decision to only emulate INTx bit for now
is based on the previous discussion at [3].

[1] https://github.com/qemu/qemu/blob/master/hw/xen/xen_pt_config_init.c#L310
[2] https://github.com/projectacrn/acrn-hypervisor/blob/master/hypervisor/hw/pci.c#L336
[3] https://patchwork.kernel.org/project/xen-devel/patch/20210903100831.177748-9-andr2000@gmail.com/

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---

Since v6:
- fold guest's logic into cmd_write
- implement cmd_read, so we can report emulated INTx state to guests
- introduce header->guest_cmd to hold the emulated state of the
  PCI_COMMAND register for guests
Since v5:
- add additional check for MSI-X enabled while altering INTX bit
- make sure INTx disabled while guests enable MSI/MSI-X
Since v3:
- gate more code on CONFIG_HAS_MSI
- removed logic for the case when MSI/MSI-X not enabled
---
 xen/drivers/vpci/header.c | 38 +++++++++++++++++++++++++++++++++++++-
 xen/drivers/vpci/msi.c    |  4 ++++
 xen/drivers/vpci/msix.c   |  4 ++++
 xen/include/xen/vpci.h    |  3 +++
 4 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index e1a448b674..ae05d242a5 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -486,11 +486,27 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
     return 0;
 }
 
+/* TODO: Add proper emulation for all bits of the command register. */
 static void cf_check cmd_write(
     const struct pci_dev *pdev, unsigned int reg, uint32_t cmd, void *data)
 {
     struct vpci_header *header = data;
 
+    if ( !is_hardware_domain(pdev->domain) )
+    {
+        struct vpci_header *header = data;
+
+        header->guest_cmd = cmd;
+#ifdef CONFIG_HAS_PCI_MSI
+        if ( pdev->vpci->msi->enabled || pdev->vpci->msix->enabled )
+            /*
+             * Guest wants to enable INTx, but it can't be enabled
+             * if MSI/MSI-X enabled.
+             */
+            cmd |= PCI_COMMAND_INTX_DISABLE;
+#endif
+    }
+
     /*
      * Let Dom0 play with all the bits directly except for the memory
      * decoding one.
@@ -507,6 +523,19 @@ static void cf_check cmd_write(
         pci_conf_write16(pdev->sbdf, reg, cmd);
 }
 
+static uint32_t cmd_read(const struct pci_dev *pdev, unsigned int reg,
+                         void *data)
+{
+    if ( !is_hardware_domain(pdev->domain) )
+    {
+        struct vpci_header *header = data;
+
+        return header->guest_cmd;
+    }
+
+    return pci_conf_read16(pdev->sbdf, reg);
+}
+
 static void cf_check bar_write(
     const struct pci_dev *pdev, unsigned int reg, uint32_t val, void *data)
 {
@@ -713,8 +742,15 @@ static int cf_check init_bars(struct pci_dev *pdev)
         return -EOPNOTSUPP;
     }
 
+    /*
+     * According to "PCI LOCAL BUS SPECIFICATION, REV. 3.0", section "6.2.2
+     * Device Control" the reset state of the command register is
+     * typically all 0's, so this is used as initial value for the guests.
+     */
+    ASSERT(header->guest_cmd == 0);
+
     /* Setup a handler for the command register. */
-    rc = vpci_add_register(pdev->vpci, vpci_hw_read16, cmd_write, PCI_COMMAND,
+    rc = vpci_add_register(pdev->vpci, cmd_read, cmd_write, PCI_COMMAND,
                            2, header);
     if ( rc )
         return rc;
diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
index e63152c224..c37845a949 100644
--- a/xen/drivers/vpci/msi.c
+++ b/xen/drivers/vpci/msi.c
@@ -70,6 +70,10 @@ static void cf_check control_write(
 
         if ( vpci_msi_arch_enable(msi, pdev, vectors) )
             return;
+
+        /* Make sure guest doesn't enable INTx while enabling MSI. */
+        if ( !is_hardware_domain(pdev->domain) )
+            pci_intx(pdev, false);
     }
     else
         vpci_msi_arch_disable(msi, pdev);
diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
index 9481274579..eab1661b87 100644
--- a/xen/drivers/vpci/msix.c
+++ b/xen/drivers/vpci/msix.c
@@ -97,6 +97,10 @@ static void cf_check control_write(
         for ( i = 0; i < msix->max_entries; i++ )
             if ( !msix->entries[i].masked && msix->entries[i].updated )
                 update_entry(&msix->entries[i], pdev, i);
+
+        /* Make sure guest doesn't enable INTx while enabling MSI-X. */
+        if ( !is_hardware_domain(pdev->domain) )
+            pci_intx(pdev, false);
     }
     else if ( !new_enabled && msix->enabled )
     {
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index b78dd6512b..6099d2141d 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -87,6 +87,9 @@ struct vpci {
         } bars[PCI_HEADER_NORMAL_NR_BARS + 1];
         /* At most 6 BARS + 1 expansion ROM BAR. */
 
+        /* Guest view of the PCI_COMMAND register. */
+        uint16_t guest_cmd;
+
         /*
          * Store whether the ROM enable bit is set (doesn't imply ROM BAR
          * is mapped into guest p2m) if there's a ROM BAR on the device.
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v8 11/13] vpci: add initial support for virtual PCI bus topology
  2023-07-20  0:32 [PATCH v8 00/13] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (6 preceding siblings ...)
  2023-07-20  0:32 ` [PATCH v8 05/13] vpci/header: implement guest BAR register handlers Volodymyr Babchuk
@ 2023-07-20  0:32 ` Volodymyr Babchuk
  2023-07-20  6:50   ` Jan Beulich
                     ` (3 more replies)
  2023-07-20  0:32 ` [PATCH v8 08/13] vpci/header: program p2m with guest BAR view Volodymyr Babchuk
                   ` (5 subsequent siblings)
  13 siblings, 4 replies; 73+ messages in thread
From: Volodymyr Babchuk @ 2023-07-20  0:32 UTC (permalink / raw)
  To: xen-devel; +Cc: Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Assign SBDF to the PCI devices being passed through with bus 0.
The resulting topology is where PCIe devices reside on the bus 0 of the
root complex itself (embedded endpoints).
This implementation is limited to 32 devices which are allowed on
a single PCI bus.

Please note, that at the moment only function 0 of a multifunction
device can be passed through.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
Since v8:
- Added write lock in add_virtual_device
Since v6:
- re-work wrt new locking scheme
- OT: add ASSERT(pcidevs_write_locked()); to add_virtual_device()
Since v5:
- s/vpci_add_virtual_device/add_virtual_device and make it static
- call add_virtual_device from vpci_assign_device and do not use
  REGISTER_VPCI_INIT machinery
- add pcidevs_locked ASSERT
- use DECLARE_BITMAP for vpci_dev_assigned_map
Since v4:
- moved and re-worked guest sbdf initializers
- s/set_bit/__set_bit
- s/clear_bit/__clear_bit
- minor comment fix s/Virtual/Guest/
- added VPCI_MAX_VIRT_DEV constant (PCI_SLOT(~0) + 1) which will be used
  later for counting the number of MMIO handlers required for a guest
  (Julien)
Since v3:
 - make use of VPCI_INIT
 - moved all new code to vpci.c which belongs to it
 - changed open-coded 31 to PCI_SLOT(~0)
 - added comments and code to reject multifunction devices with
   functions other than 0
 - updated comment about vpci_dev_next and made it unsigned int
 - implement roll back in case of error while assigning/deassigning devices
 - s/dom%pd/%pd
Since v2:
 - remove casts that are (a) malformed and (b) unnecessary
 - add new line for better readability
 - remove CONFIG_HAS_VPCI_GUEST_SUPPORT ifdef's as the relevant vPCI
    functions are now completely gated with this config
 - gate common code with CONFIG_HAS_VPCI_GUEST_SUPPORT
New in v2
---
 xen/drivers/vpci/vpci.c | 72 ++++++++++++++++++++++++++++++++++++++++-
 xen/include/xen/sched.h |  8 +++++
 xen/include/xen/vpci.h  | 11 +++++++
 3 files changed, 90 insertions(+), 1 deletion(-)

diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index ca3505ecb7..baaafe4a2a 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -46,6 +46,16 @@ void vpci_remove_device(struct pci_dev *pdev)
         return;
 
     spin_lock(&pdev->vpci->lock);
+
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+    if ( pdev->vpci->guest_sbdf.sbdf != ~0 )
+    {
+        __clear_bit(pdev->vpci->guest_sbdf.dev,
+                    &pdev->domain->vpci_dev_assigned_map);
+        pdev->vpci->guest_sbdf.sbdf = ~0;
+    }
+#endif
+
     while ( !list_empty(&pdev->vpci->handlers) )
     {
         struct vpci_register *r = list_first_entry(&pdev->vpci->handlers,
@@ -101,6 +111,10 @@ int vpci_add_handlers(struct pci_dev *pdev)
     INIT_LIST_HEAD(&pdev->vpci->handlers);
     spin_lock_init(&pdev->vpci->lock);
 
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+    pdev->vpci->guest_sbdf.sbdf = ~0;
+#endif
+
     for ( i = 0; i < NUM_VPCI_INIT; i++ )
     {
         rc = __start_vpci_array[i](pdev);
@@ -115,6 +129,54 @@ int vpci_add_handlers(struct pci_dev *pdev)
 }
 
 #ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+static int add_virtual_device(struct pci_dev *pdev)
+{
+    struct domain *d = pdev->domain;
+    pci_sbdf_t sbdf = { 0 };
+    unsigned long new_dev_number;
+
+    if ( is_hardware_domain(d) )
+        return 0;
+
+    ASSERT(pcidevs_locked());
+
+    /*
+     * Each PCI bus supports 32 devices/slots at max or up to 256 when
+     * there are multi-function ones which are not yet supported.
+     */
+    if ( pdev->info.is_extfn )
+    {
+        gdprintk(XENLOG_ERR, "%pp: only function 0 passthrough supported\n",
+                 &pdev->sbdf);
+        return -EOPNOTSUPP;
+    }
+
+    write_lock(&pdev->domain->pci_lock);
+    new_dev_number = find_first_zero_bit(d->vpci_dev_assigned_map,
+                                         VPCI_MAX_VIRT_DEV);
+    if ( new_dev_number >= VPCI_MAX_VIRT_DEV )
+    {
+        write_unlock(&pdev->domain->pci_lock);
+        return -ENOSPC;
+    }
+
+    __set_bit(new_dev_number, &d->vpci_dev_assigned_map);
+
+    /*
+     * Both segment and bus number are 0:
+     *  - we emulate a single host bridge for the guest, e.g. segment 0
+     *  - with bus 0 the virtual devices are seen as embedded
+     *    endpoints behind the root complex
+     *
+     * TODO: add support for multi-function devices.
+     */
+    sbdf.devfn = PCI_DEVFN(new_dev_number, 0);
+    pdev->vpci->guest_sbdf = sbdf;
+    write_unlock(&pdev->domain->pci_lock);
+
+    return 0;
+}
+
 /* Notify vPCI that device is assigned to guest. */
 int vpci_assign_device(struct pci_dev *pdev)
 {
@@ -125,8 +187,16 @@ int vpci_assign_device(struct pci_dev *pdev)
 
     rc = vpci_add_handlers(pdev);
     if ( rc )
-        vpci_deassign_device(pdev);
+        goto fail;
+
+    rc = add_virtual_device(pdev);
+    if ( rc )
+        goto fail;
+
+    return 0;
 
+ fail:
+    vpci_deassign_device(pdev);
     return rc;
 }
 #endif /* CONFIG_HAS_VPCI_GUEST_SUPPORT */
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 80dd150bbf..478bd21f3e 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -461,6 +461,14 @@ struct domain
 #ifdef CONFIG_HAS_PCI
     struct list_head pdev_list;
     rwlock_t pci_lock;
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+    /*
+     * The bitmap which shows which device numbers are already used by the
+     * virtual PCI bus topology and is used to assign a unique SBDF to the
+     * next passed through virtual PCI device.
+     */
+    DECLARE_BITMAP(vpci_dev_assigned_map, VPCI_MAX_VIRT_DEV);
+#endif
 #endif
 
 #ifdef CONFIG_HAS_PASSTHROUGH
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 6099d2141d..c55c45f7a1 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -21,6 +21,13 @@ typedef int vpci_register_init_t(struct pci_dev *dev);
 
 #define VPCI_ECAM_BDF(addr)     (((addr) & 0x0ffff000) >> 12)
 
+/*
+ * Maximum number of devices supported by the virtual bus topology:
+ * each PCI bus supports 32 devices/slots at max or up to 256 when
+ * there are multi-function ones which are not yet supported.
+ */
+#define VPCI_MAX_VIRT_DEV       (PCI_SLOT(~0) + 1)
+
 #define REGISTER_VPCI_INIT(x, p)                \
   static vpci_register_init_t *const x##_entry  \
                __used_section(".data.vpci." p) = x
@@ -155,6 +162,10 @@ struct vpci {
             struct vpci_arch_msix_entry arch;
         } entries[];
     } *msix;
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+    /* Guest SBDF of the device. */
+    pci_sbdf_t guest_sbdf;
+#endif
 #endif
 };
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v8 12/13] xen/arm: translate virtual PCI bus topology for guests
  2023-07-20  0:32 [PATCH v8 00/13] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (10 preceding siblings ...)
  2023-07-20  0:32 ` [PATCH v8 09/13] vpci/header: emulate PCI_COMMAND register for guests Volodymyr Babchuk
@ 2023-07-20  0:32 ` Volodymyr Babchuk
  2023-07-20  6:54   ` Jan Beulich
                     ` (2 more replies)
  2023-07-20  0:32 ` [PATCH v8 13/13] xen/arm: account IO handlers for emulated PCI MSI-X Volodymyr Babchuk
  2023-07-20  0:41 ` [PATCH v8 00/13] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
  13 siblings, 3 replies; 73+ messages in thread
From: Volodymyr Babchuk @ 2023-07-20  0:32 UTC (permalink / raw)
  To: xen-devel; +Cc: Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

There are three  originators for the PCI configuration space access:
1. The domain that owns physical host bridge: MMIO handlers are
there so we can update vPCI register handlers with the values
written by the hardware domain, e.g. physical view of the registers
vs guest's view on the configuration space.
2. Guest access to the passed through PCI devices: we need to properly
map virtual bus topology to the physical one, e.g. pass the configuration
space access to the corresponding physical devices.
3. Emulated host PCI bridge access. It doesn't exist in the physical
topology, e.g. it can't be mapped to some physical host bridge.
So, all access to the host bridge itself needs to be trapped and
emulated.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
Since v8:
- locks moved out of vpci_translate_virtual_device()
Since v6:
- add pcidevs locking to vpci_translate_virtual_device
- update wrt to the new locking scheme
Since v5:
- add vpci_translate_virtual_device for #ifndef CONFIG_HAS_VPCI_GUEST_SUPPORT
  case to simplify ifdefery
- add ASSERT(!is_hardware_domain(d)); to vpci_translate_virtual_device
- reset output register on failed virtual SBDF translation
Since v4:
- indentation fixes
- constify struct domain
- updated commit message
- updates to the new locking scheme (pdev->vpci_lock)
Since v3:
- revisit locking
- move code to vpci.c
Since v2:
 - pass struct domain instead of struct vcpu
 - constify arguments where possible
 - gate relevant code with CONFIG_HAS_VPCI_GUEST_SUPPORT
New in v2
---
 xen/arch/arm/vpci.c     | 47 +++++++++++++++++++++++++++++++++++++++--
 xen/drivers/vpci/vpci.c | 24 +++++++++++++++++++++
 xen/include/xen/vpci.h  |  7 ++++++
 3 files changed, 76 insertions(+), 2 deletions(-)

diff --git a/xen/arch/arm/vpci.c b/xen/arch/arm/vpci.c
index 3bc4bb5508..66701465af 100644
--- a/xen/arch/arm/vpci.c
+++ b/xen/arch/arm/vpci.c
@@ -28,10 +28,33 @@ static int vpci_mmio_read(struct vcpu *v, mmio_info_t *info,
                           register_t *r, void *p)
 {
     struct pci_host_bridge *bridge = p;
-    pci_sbdf_t sbdf = vpci_sbdf_from_gpa(bridge, info->gpa);
+    pci_sbdf_t sbdf;
     /* data is needed to prevent a pointer cast on 32bit */
     unsigned long data;
 
+    ASSERT(!bridge == !is_hardware_domain(v->domain));
+
+    sbdf = vpci_sbdf_from_gpa(bridge, info->gpa);
+
+    /*
+     * For the passed through devices we need to map their virtual SBDF
+     * to the physical PCI device being passed through.
+     */
+    if ( !bridge )
+    {
+        bool translated;
+
+        read_lock(&v->domain->pci_lock);
+        translated = vpci_translate_virtual_device(v->domain, &sbdf);
+        read_unlock(&v->domain->pci_lock);
+
+        if ( !translated )
+        {
+            *r = ~0ul;
+            return 1;
+        }
+    }
+
     if ( vpci_ecam_read(sbdf, ECAM_REG_OFFSET(info->gpa),
                         1U << info->dabt.size, &data) )
     {
@@ -48,7 +71,27 @@ static int vpci_mmio_write(struct vcpu *v, mmio_info_t *info,
                            register_t r, void *p)
 {
     struct pci_host_bridge *bridge = p;
-    pci_sbdf_t sbdf = vpci_sbdf_from_gpa(bridge, info->gpa);
+    pci_sbdf_t sbdf;
+
+    ASSERT(!bridge == !is_hardware_domain(v->domain));
+
+    sbdf = vpci_sbdf_from_gpa(bridge, info->gpa);
+
+    /*
+     * For the passed through devices we need to map their virtual SBDF
+     * to the physical PCI device being passed through.
+     */
+    if ( !bridge )
+    {
+        bool translated;
+
+        read_lock(&v->domain->pci_lock);
+        translated = vpci_translate_virtual_device(v->domain, &sbdf);
+        read_unlock(&v->domain->pci_lock);
+
+        if ( !translated )
+            return 1;
+    }
 
     return vpci_ecam_write(sbdf, ECAM_REG_OFFSET(info->gpa),
                            1U << info->dabt.size, r);
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index baaafe4a2a..2ce36e811d 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -177,6 +177,30 @@ static int add_virtual_device(struct pci_dev *pdev)
     return 0;
 }
 
+/*
+ * Find the physical device which is mapped to the virtual device
+ * and translate virtual SBDF to the physical one.
+ * This must hold domain's pci_lock in read mode.
+ */
+bool vpci_translate_virtual_device(struct domain *d, pci_sbdf_t *sbdf)
+{
+    struct pci_dev *pdev;
+
+    ASSERT(!is_hardware_domain(d));
+
+    for_each_pdev( d, pdev )
+    {
+        if ( pdev->vpci && (pdev->vpci->guest_sbdf.sbdf == sbdf->sbdf) )
+        {
+            /* Replace guest SBDF with the physical one. */
+            *sbdf = pdev->sbdf;
+            return true;
+        }
+    }
+
+    return false;
+}
+
 /* Notify vPCI that device is assigned to guest. */
 int vpci_assign_device(struct pci_dev *pdev)
 {
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index c55c45f7a1..7d30fbdd28 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -286,6 +286,7 @@ static inline bool __must_check vpci_process_pending(struct vcpu *v)
 /* Notify vPCI that device is assigned/de-assigned to/from guest. */
 int vpci_assign_device(struct pci_dev *pdev);
 #define vpci_deassign_device vpci_remove_device
+bool vpci_translate_virtual_device(struct domain *d, pci_sbdf_t *sbdf);
 #else
 static inline int vpci_assign_device(struct pci_dev *pdev)
 {
@@ -295,6 +296,12 @@ static inline int vpci_assign_device(struct pci_dev *pdev)
 static inline void vpci_deassign_device(struct pci_dev *pdev)
 {
 };
+
+static inline bool vpci_translate_virtual_device(struct domain *d,
+                                                 pci_sbdf_t *sbdf)
+{
+    return false;
+}
 #endif
 
 #endif
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v8 13/13] xen/arm: account IO handlers for emulated PCI MSI-X
  2023-07-20  0:32 [PATCH v8 00/13] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (11 preceding siblings ...)
  2023-07-20  0:32 ` [PATCH v8 12/13] xen/arm: translate virtual PCI bus topology " Volodymyr Babchuk
@ 2023-07-20  0:32 ` Volodymyr Babchuk
  2023-07-20  0:41 ` [PATCH v8 00/13] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
  13 siblings, 0 replies; 73+ messages in thread
From: Volodymyr Babchuk @ 2023-07-20  0:32 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Andrushchenko, Julien Grall, Julien Grall, Stefano Stabellini

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

At the moment, we always allocate an extra 16 slots for IO handlers
(see MAX_IO_HANDLER). So while adding IO trap handlers for the emulated
MSI-X registers we need to explicitly tell that we have additional IO
handlers, so those are accounted.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Acked-by: Julien Grall <jgrall@amazon.com>
---
Cc: Julien Grall <julien@xen.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
---
This actually moved here from the part 2 of the prep work for PCI
passthrough on Arm as it seems to be the proper place for it.

Since v5:
- optimize with IS_ENABLED(CONFIG_HAS_PCI_MSI) since VPCI_MAX_VIRT_DEV is
  defined unconditionally
New in v5
---
 xen/arch/arm/vpci.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/xen/arch/arm/vpci.c b/xen/arch/arm/vpci.c
index 66701465af..cd9f5d0757 100644
--- a/xen/arch/arm/vpci.c
+++ b/xen/arch/arm/vpci.c
@@ -148,6 +148,8 @@ static int vpci_get_num_handlers_cb(struct domain *d,
 
 unsigned int domain_vpci_get_num_mmio_handlers(struct domain *d)
 {
+    unsigned int count;
+
     if ( !has_vpci(d) )
         return 0;
 
@@ -168,7 +170,17 @@ unsigned int domain_vpci_get_num_mmio_handlers(struct domain *d)
      * For guests each host bridge requires one region to cover the
      * configuration space. At the moment, we only expose a single host bridge.
      */
-    return 1;
+    count = 1;
+
+    /*
+     * There's a single MSI-X MMIO handler that deals with both PBA
+     * and MSI-X tables per each PCI device being passed through.
+     * Maximum number of emulated virtual devices is VPCI_MAX_VIRT_DEV.
+     */
+    if ( IS_ENABLED(CONFIG_HAS_PCI_MSI) )
+        count += VPCI_MAX_VIRT_DEV;
+
+    return count;
 }
 
 /*
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 00/13] PCI devices passthrough on Arm, part 3
  2023-07-20  0:32 [PATCH v8 00/13] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (12 preceding siblings ...)
  2023-07-20  0:32 ` [PATCH v8 13/13] xen/arm: account IO handlers for emulated PCI MSI-X Volodymyr Babchuk
@ 2023-07-20  0:41 ` Volodymyr Babchuk
  13 siblings, 0 replies; 73+ messages in thread
From: Volodymyr Babchuk @ 2023-07-20  0:41 UTC (permalink / raw)
  To: xen-devel
  Cc: Volodymyr Babchuk, Andrew Cooper, George Dunlap, Jan Beulich,
	Julien Grall, Stefano Stabellini, Wei Liu, Paul Durrant,
	Roger Pau Monné,
	Kevin Tian, Bertrand Marquis, Volodymyr Babchuk


Volodymyr Babchuk <volodymyr_babchuk@epam.com> writes:

Hello again,

Looks like I messed up with add_maintainers script and sent this series
without proper CCs. So I am CCing all interested persons only in this
cover letter only.

Sorry for the noise.

> Hello,
>
> This is next version of vPCI rework. Aim of this series is to prepare
> ground for introducing PCI support on ARM platform.
>
> The biggest change from previous, mistakenly named, v7 series is how
> locking is implemented. Instead of d->vpci_rwlock we introduce
> d->pci_lock which has broader scope, as it protects not only domain's
> vpci state, but domain's list of PCI devices as well.
>
> As we discussed in IRC with Roger, it is not feasible to rework all
> the existing code to use the new lock right away. It was agreed that
> any write access to d->pdev_list will be protected by **both**
> d->pci_lock in write mode and pcidevs_lock(). Read access on other
> hand should be protected by either d->pci_lock in read mode or
> pcidevs_lock(). It is expected that existing code will use
> pcidevs_lock() and new users will use new rw lock. Of course, this
> does not mean that new users shall not use pcidevs_lock() when it is
> appropriate.
>
> Apart from locking scheme rework, there are less major fixes in some
> patches, based on the review comments.
>
> Oleksandr Andrushchenko (12):
>   vpci: use per-domain PCI lock to protect vpci structure
>   vpci: restrict unhandled read/write operations for guests
>   vpci: add hooks for PCI device assign/de-assign
>   vpci/header: implement guest BAR register handlers
>   rangeset: add RANGESETF_no_print flag
>   vpci/header: handle p2m range sets per BAR
>   vpci/header: program p2m with guest BAR view
>   vpci/header: emulate PCI_COMMAND register for guests
>   vpci/header: reset the command register when adding devices
>   vpci: add initial support for virtual PCI bus topology
>   xen/arm: translate virtual PCI bus topology for guests
>   xen/arm: account IO handlers for emulated PCI MSI-X
>
> Volodymyr Babchuk (1):
>   pci: introduce per-domain PCI rwlock
>
>  xen/arch/arm/vpci.c                         |  61 ++-
>  xen/arch/x86/hvm/vmsi.c                     |   4 +
>  xen/common/domain.c                         |   1 +
>  xen/common/rangeset.c                       |   5 +-
>  xen/drivers/Kconfig                         |   4 +
>  xen/drivers/passthrough/amd/pci_amd_iommu.c |   9 +-
>  xen/drivers/passthrough/pci.c               |  96 ++++-
>  xen/drivers/passthrough/vtd/iommu.c         |   9 +-
>  xen/drivers/vpci/header.c                   | 453 ++++++++++++++++----
>  xen/drivers/vpci/msi.c                      |  18 +-
>  xen/drivers/vpci/msix.c                     |  56 ++-
>  xen/drivers/vpci/vpci.c                     | 176 +++++++-
>  xen/include/xen/pci.h                       |   1 +
>  xen/include/xen/rangeset.h                  |   5 +-
>  xen/include/xen/sched.h                     |   9 +
>  xen/include/xen/vpci.h                      |  42 +-
>  16 files changed, 828 insertions(+), 121 deletions(-)


-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 11/13] vpci: add initial support for virtual PCI bus topology
  2023-07-20  0:32 ` [PATCH v8 11/13] vpci: add initial support for virtual PCI bus topology Volodymyr Babchuk
@ 2023-07-20  6:50   ` Jan Beulich
  2023-07-21  0:43     ` Volodymyr Babchuk
  2023-07-21 13:53   ` Roger Pau Monné
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 73+ messages in thread
From: Jan Beulich @ 2023-07-20  6:50 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: Oleksandr Andrushchenko, xen-devel

On 20.07.2023 02:32, Volodymyr Babchuk wrote:
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -46,6 +46,16 @@ void vpci_remove_device(struct pci_dev *pdev)
>          return;
>  
>      spin_lock(&pdev->vpci->lock);
> +
> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> +    if ( pdev->vpci->guest_sbdf.sbdf != ~0 )
> +    {
> +        __clear_bit(pdev->vpci->guest_sbdf.dev,
> +                    &pdev->domain->vpci_dev_assigned_map);
> +        pdev->vpci->guest_sbdf.sbdf = ~0;
> +    }
> +#endif

The lock acquired above is not ...

> @@ -115,6 +129,54 @@ int vpci_add_handlers(struct pci_dev *pdev)
>  }
>  
>  #ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> +static int add_virtual_device(struct pci_dev *pdev)
> +{
> +    struct domain *d = pdev->domain;
> +    pci_sbdf_t sbdf = { 0 };
> +    unsigned long new_dev_number;
> +
> +    if ( is_hardware_domain(d) )
> +        return 0;
> +
> +    ASSERT(pcidevs_locked());
> +
> +    /*
> +     * Each PCI bus supports 32 devices/slots at max or up to 256 when
> +     * there are multi-function ones which are not yet supported.
> +     */
> +    if ( pdev->info.is_extfn )
> +    {
> +        gdprintk(XENLOG_ERR, "%pp: only function 0 passthrough supported\n",
> +                 &pdev->sbdf);
> +        return -EOPNOTSUPP;
> +    }
> +
> +    write_lock(&pdev->domain->pci_lock);
> +    new_dev_number = find_first_zero_bit(d->vpci_dev_assigned_map,
> +                                         VPCI_MAX_VIRT_DEV);
> +    if ( new_dev_number >= VPCI_MAX_VIRT_DEV )
> +    {
> +        write_unlock(&pdev->domain->pci_lock);
> +        return -ENOSPC;
> +    }
> +
> +    __set_bit(new_dev_number, &d->vpci_dev_assigned_map);

... the same as the one held here, so the bitmap still isn't properly
protected afaics, unless the intention is to continue to rely on
the global PCI lock (assuming that one's held in both cases, which I
didn't check it is). Conversely it looks like the vPCI lock isn't
held here. Both aspects may be intentional, but the locks being
acquired differing requires suitable code comments imo.

I've also briefly looked at patch 1, and I'm afraid that still lacks
commentary about intended lock nesting. That might be relevant here
in case locking visible from patch / patch context isn't providing
the full picture.

> +    /*
> +     * Both segment and bus number are 0:
> +     *  - we emulate a single host bridge for the guest, e.g. segment 0
> +     *  - with bus 0 the virtual devices are seen as embedded
> +     *    endpoints behind the root complex
> +     *
> +     * TODO: add support for multi-function devices.
> +     */
> +    sbdf.devfn = PCI_DEVFN(new_dev_number, 0);
> +    pdev->vpci->guest_sbdf = sbdf;
> +    write_unlock(&pdev->domain->pci_lock);

With the above I also wonder whether this lock can't (and hence
should) be dropped a little earlier (right after fiddling with the
bitmap).

Jan


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 12/13] xen/arm: translate virtual PCI bus topology for guests
  2023-07-20  0:32 ` [PATCH v8 12/13] xen/arm: translate virtual PCI bus topology " Volodymyr Babchuk
@ 2023-07-20  6:54   ` Jan Beulich
  2023-07-21 14:09   ` Roger Pau Monné
  2023-07-24  8:02   ` Roger Pau Monné
  2 siblings, 0 replies; 73+ messages in thread
From: Jan Beulich @ 2023-07-20  6:54 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: Oleksandr Andrushchenko, xen-devel

On 20.07.2023 02:32, Volodymyr Babchuk wrote:
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -177,6 +177,30 @@ static int add_virtual_device(struct pci_dev *pdev)
>      return 0;
>  }
>  
> +/*
> + * Find the physical device which is mapped to the virtual device
> + * and translate virtual SBDF to the physical one.
> + * This must hold domain's pci_lock in read mode.

How about an assertion to that effect?

> + */
> +bool vpci_translate_virtual_device(struct domain *d, pci_sbdf_t *sbdf)
> +{
> +    struct pci_dev *pdev;
> +
> +    ASSERT(!is_hardware_domain(d));
> +
> +    for_each_pdev( d, pdev )

Nit: Style (either you consider for_each_pdev a [pseudo-]keyword or you
don't; depending on that there's either a blank missing or there are two
too many).

Jan


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 01/13] pci: introduce per-domain PCI rwlock
  2023-07-20  0:32 ` [PATCH v8 01/13] pci: introduce per-domain PCI rwlock Volodymyr Babchuk
@ 2023-07-20  9:45   ` Roger Pau Monné
  2023-07-20 22:57     ` Volodymyr Babchuk
  2023-07-20 15:40   ` Jan Beulich
  1 sibling, 1 reply; 73+ messages in thread
From: Roger Pau Monné @ 2023-07-20  9:45 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: xen-devel, Jan Beulich

On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
> Add per-domain d->pci_lock that protects access to
> d->pdev_list. Purpose of this lock is to give guarantees to VPCI code
> that underlying pdev will not disappear under feet. This is a rw-lock,
> but this patch adds only write_lock()s. There will be read_lock()
> users in the next patches.
> 
> This lock should be taken in write mode every time d->pdev_list is
> altered. This covers both accesses to d->pdev_list and accesses to
> pdev->domain_list fields. All write accesses also should be protected
> by pcidevs_lock() as well. Idea is that any user that wants read
> access to the list or to the devices stored in the list should use
> either this new d->pci_lock or old pcidevs_lock(). Usage of any of
> this two locks will ensure only that pdev of interest will not
> disappear from under feet and that the pdev still will be assigned to
> the same domain. Of course, any new users should use pcidevs_lock()
> when it is appropriate (e.g. when accessing any other state that is
> protected by the said lock).

I think this needs a note about the ordering:

"In case both the newly introduced per-domain rwlock and the pcidevs
lock is taken, the later must be acquired first."

> 
> Any write access to pdev->domain_list should be protected by both
> pcidevs_lock() and d->pci_lock in the write mode.

You also protect calls to vpci_remove_device() with the per-domain
pci_lock it seems, and that will need some explanation as it's not
obvious.

> 
> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
> Suggested-by: Jan Beulich <jbeulich@suse.com>
> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
> 
> ---
> 
> Changes in v8:
>  - New patch
> 
> Changes in v8 vs RFC:
>  - Removed all read_locks after discussion with Roger in #xendevel
>  - pci_release_devices() now returns the first error code
>  - extended commit message
>  - added missing lock in pci_remove_device()
>  - extended locked region in pci_add_device() to protect list_del() calls
> ---
>  xen/common/domain.c                         |  1 +
>  xen/drivers/passthrough/amd/pci_amd_iommu.c |  9 ++-
>  xen/drivers/passthrough/pci.c               | 68 +++++++++++++++++----
>  xen/drivers/passthrough/vtd/iommu.c         |  9 ++-
>  xen/include/xen/sched.h                     |  1 +
>  5 files changed, 74 insertions(+), 14 deletions(-)
> 
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index caaa402637..5d8a8836da 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -645,6 +645,7 @@ struct domain *domain_create(domid_t domid,
>  
>  #ifdef CONFIG_HAS_PCI
>      INIT_LIST_HEAD(&d->pdev_list);
> +    rwlock_init(&d->pci_lock);
>  #endif
>  
>      /* All error paths can depend on the above setup. */
> diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> index 94e3775506..e2f2e2e950 100644
> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> @@ -476,8 +476,13 @@ static int cf_check reassign_device(
>  
>      if ( devfn == pdev->devfn && pdev->domain != target )
>      {
> -        list_move(&pdev->domain_list, &target->pdev_list);
> -        pdev->domain = target;

You seem to have inadvertently dropped the above line? (and so devices
would keep the previous pdev->domain value)

> +        write_lock(&pdev->domain->pci_lock);
> +        list_del(&pdev->domain_list);
> +        write_unlock(&pdev->domain->pci_lock);
> +
> +        write_lock(&target->pci_lock);
> +        list_add(&pdev->domain_list, &target->pdev_list);
> +        write_unlock(&target->pci_lock);
>      }
>  
>      /*
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index 95846e84f2..5b4632ead2 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -454,7 +454,9 @@ static void __init _pci_hide_device(struct pci_dev *pdev)
>      if ( pdev->domain )
>          return;
>      pdev->domain = dom_xen;
> +    write_lock(&dom_xen->pci_lock);
>      list_add(&pdev->domain_list, &dom_xen->pdev_list);
> +    write_unlock(&dom_xen->pci_lock);
>  }
>  
>  int __init pci_hide_device(unsigned int seg, unsigned int bus,
> @@ -747,6 +749,7 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>      ret = 0;
>      if ( !pdev->domain )
>      {
> +        write_lock(&hardware_domain->pci_lock);
>          pdev->domain = hardware_domain;
>          list_add(&pdev->domain_list, &hardware_domain->pdev_list);
>  
> @@ -760,6 +763,7 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>              printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
>              list_del(&pdev->domain_list);
>              pdev->domain = NULL;
> +            write_unlock(&hardware_domain->pci_lock);

Strictly speaking, this could move one line earlier, as accesses to
pdev->domain are not protected by the d->pci_lock?  Same in other
instances (above and below), as you seem to introduce a pattern to
perform accesses to pdev->domain with the rwlock taken.

>              goto out;
>          }
>          ret = iommu_add_device(pdev);
> @@ -768,8 +772,10 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>              vpci_remove_device(pdev);
>              list_del(&pdev->domain_list);
>              pdev->domain = NULL;
> +            write_unlock(&hardware_domain->pci_lock);
>              goto out;
>          }
> +        write_unlock(&hardware_domain->pci_lock);
>      }
>      else
>          iommu_enable_device(pdev);
> @@ -812,11 +818,13 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
>      list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
>          if ( pdev->bus == bus && pdev->devfn == devfn )
>          {
> +            write_lock(&pdev->domain->pci_lock);
>              vpci_remove_device(pdev);
>              pci_cleanup_msi(pdev);
>              ret = iommu_remove_device(pdev);
>              if ( pdev->domain )
>                  list_del(&pdev->domain_list);
> +            write_unlock(&pdev->domain->pci_lock);

Here you seem to protect more than strictly required, I would think
only the list_del() would need to be done holding the rwlock?

>              printk(XENLOG_DEBUG "PCI remove device %pp\n", &pdev->sbdf);
>              free_pdev(pseg, pdev);
>              break;
> @@ -887,26 +895,62 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
>  
>  int pci_release_devices(struct domain *d)
>  {
> -    struct pci_dev *pdev, *tmp;
> -    u8 bus, devfn;
> -    int ret;
> +    int combined_ret;
> +    LIST_HEAD(failed_pdevs);
>  
>      pcidevs_lock();
> -    ret = arch_pci_clean_pirqs(d);
> -    if ( ret )
> +    write_lock(&d->pci_lock);
> +    combined_ret = arch_pci_clean_pirqs(d);

Why do you need the per-domain rwlock for arch_pci_clean_pirqs()?
That function doesn't modify the per-domain pdev list.

> +    if ( combined_ret )
>      {
>          pcidevs_unlock();
> -        return ret;
> +        write_unlock(&d->pci_lock);
> +        return combined_ret;

Ideally we would like to keep the same order on unlock, so the rwlock
should be released before the pcidevs lock (unless there's a reason
not to).

>      }
> -    list_for_each_entry_safe ( pdev, tmp, &d->pdev_list, domain_list )
> +
> +    while ( !list_empty(&d->pdev_list) )
>      {
> -        bus = pdev->bus;
> -        devfn = pdev->devfn;
> -        ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
> +        struct pci_dev *pdev = list_first_entry(&d->pdev_list,
> +                                                struct pci_dev,
> +                                                domain_list);
> +        uint16_t seg = pdev->seg;
> +        uint8_t bus = pdev->bus;
> +        uint8_t devfn = pdev->devfn;
> +        int ret;
> +
> +        write_unlock(&d->pci_lock);
> +        ret = deassign_device(d, seg, bus, devfn);
> +        write_lock(&d->pci_lock);
> +        if ( ret )
> +        {
> +            bool still_present = false;
> +            const struct pci_dev *tmp;
> +
> +            /*
> +             * We need to check if deassign_device() left our pdev in
> +             * domain's list. As we dropped the lock, we can't be sure
> +             * that list wasn't permutated in some random way, so we
> +             * need to traverse the whole list.
> +             */
> +            for_each_pdev ( d, tmp )
> +            {
> +                if ( tmp == pdev )
> +                {
> +                    still_present = true;
> +                    break;
> +                }
> +            }
> +            if ( still_present )
> +                list_move(&pdev->domain_list, &failed_pdevs);

You can get rid of the still_present variable, and just do:

for_each_pdev ( d, tmp )
    if ( tmp == pdev )
    {
        list_move(&pdev->domain_list, &failed_pdevs);
	break;
    }


> +            combined_ret = combined_ret?:ret;

Nit: missing spaces around the ternary operator.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 02/13] vpci: use per-domain PCI lock to protect vpci structure
  2023-07-20  0:32 ` [PATCH v8 02/13] vpci: use per-domain PCI lock to protect vpci structure Volodymyr Babchuk
@ 2023-07-20 11:20   ` Roger Pau Monné
  2023-07-20 13:27     ` Jan Beulich
                       ` (2 more replies)
  2023-07-20 16:03   ` Jan Beulich
  2023-07-20 16:09   ` Jan Beulich
  2 siblings, 3 replies; 73+ messages in thread
From: Roger Pau Monné @ 2023-07-20 11:20 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: xen-devel, Oleksandr Andrushchenko, Jan Beulich

On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Use a previously introduced per-domain read/write lock to check
> whether vpci is present, so we are sure there are no accesses to the
> contents of the vpci struct if not. This lock can be used (and in a
> few cases is used right away) so that vpci removal can be performed
> while holding the lock in write mode. Previously such removal could
> race with vpci_read for example.

This I think needs to state the locking order of the per-domain
pci_lock wrt the vpci->lock.  AFAICT that's d->pci_lock first, then
vpci->lock.

> 1. Per-domain's pci_rwlock is used to protect pdev->vpci structure
> from being removed.
> 
> 2. Writing the command register and ROM BAR register may trigger
> modify_bars to run, which in turn may access multiple pdevs while
> checking for the existing BAR's overlap. The overlapping check, if
> done under the read lock, requires vpci->lock to be acquired on both
> devices being compared, which may produce a deadlock. It is not
> possible to upgrade read lock to write lock in such a case. So, in
> order to prevent the deadlock, use d->pci_lock instead. To prevent
> deadlock while locking both hwdom->pci_lock and dom_xen->pci_lock,
> always lock hwdom first.
> 
> All other code, which doesn't lead to pdev->vpci destruction and does
> not access multiple pdevs at the same time, can still use a
> combination of the read lock and pdev->vpci->lock.
> 
> 3. Drop const qualifier where the new rwlock is used and this is
> appropriate.
> 
> 4. Do not call process_pending_softirqs with any locks held. For that
> unlock prior the call and re-acquire the locks after. After
> re-acquiring the lock there is no need to check if pdev->vpci exists:
>  - in apply_map because of the context it is called (no race condition
>    possible)
>  - for MSI/MSI-X debug code because it is called at the end of
>    pdev->vpci access and no further access to pdev->vpci is made

I assume that's vpci_msix_arch_print(): there are further accesses to
pdev->vpci, but those use the msix local variable, which holds a copy
of the pointer in pdev->vpci->msix, so that last sentence is not true
I'm afraid.

However the code already try to cater for the pdev going away, and
hence it's IMO fine.  IOW: your change doesn't make this any better or
worse.

> 
> 5. Introduce pcidevs_trylock, so there is a possibility to try locking
> the pcidev's lock.

I'm confused by this addition, the more that's no used anywhere.  Can
you defer the addition until the patch that makes use of it?

> 
> 6. Use d->pci_lock around for_each_pdev and pci_get_pdev_by_domain
> while accessing pdevs in vpci code.
> 
> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
> Suggested-by: Jan Beulich <jbeulich@suse.com>
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
> 
> ---
> 
> Changes in v8:
>  - changed d->vpci_lock to d->pci_lock
>  - introducing d->pci_lock in a separate patch
>  - extended locked region in vpci_process_pending
>  - removed pcidevs_lockis vpci_dump_msi()
>  - removed some changes as they are not needed with
>    the new locking scheme
>  - added handling for hwdom && dom_xen case
> ---
>  xen/arch/x86/hvm/vmsi.c       |  4 +++
>  xen/drivers/passthrough/pci.c |  7 +++++
>  xen/drivers/vpci/header.c     | 18 ++++++++++++
>  xen/drivers/vpci/msi.c        | 14 ++++++++--
>  xen/drivers/vpci/msix.c       | 52 ++++++++++++++++++++++++++++++-----
>  xen/drivers/vpci/vpci.c       | 46 +++++++++++++++++++++++++++++--
>  xen/include/xen/pci.h         |  1 +
>  7 files changed, 129 insertions(+), 13 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
> index 3cd4923060..8c1bd66b9c 100644
> --- a/xen/arch/x86/hvm/vmsi.c
> +++ b/xen/arch/x86/hvm/vmsi.c
> @@ -895,6 +895,8 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>  {
>      unsigned int i;
>  
> +    ASSERT(rw_is_locked(&msix->pdev->domain->pci_lock));
> +
>      for ( i = 0; i < msix->max_entries; i++ )
>      {
>          const struct vpci_msix_entry *entry = &msix->entries[i];
> @@ -913,7 +915,9 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>              struct pci_dev *pdev = msix->pdev;
>  
>              spin_unlock(&msix->pdev->vpci->lock);
> +            read_unlock(&pdev->domain->pci_lock);
>              process_pending_softirqs();
> +            read_lock(&pdev->domain->pci_lock);

This should be a read_trylock(), much like the spin_trylock() below.

>              /* NB: we assume that pdev cannot go away for an alive domain. */
>              if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
>                  return -EBUSY;
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index 5b4632ead2..6f8692cd9c 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -57,6 +57,11 @@ void pcidevs_lock(void)
>      spin_lock_recursive(&_pcidevs_lock);
>  }
>  
> +int pcidevs_trylock(void)
> +{
> +    return spin_trylock_recursive(&_pcidevs_lock);
> +}
> +
>  void pcidevs_unlock(void)
>  {
>      spin_unlock_recursive(&_pcidevs_lock);
> @@ -1144,7 +1149,9 @@ static void __hwdom_init setup_one_hwdom_device(const struct setup_hwdom *ctxt,
>      } while ( devfn != pdev->devfn &&
>                PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
>  
> +    write_lock(&ctxt->d->pci_lock);
>      err = vpci_add_handlers(pdev);
> +    write_unlock(&ctxt->d->pci_lock);
>      if ( err )
>          printk(XENLOG_ERR "setup of vPCI for d%d failed: %d\n",
>                 ctxt->d->domain_id, err);
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index b41556d007..2780fcae72 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -152,6 +152,7 @@ bool vpci_process_pending(struct vcpu *v)
>          if ( rc == -ERESTART )
>              return true;
>  
> +        write_lock(&v->domain->pci_lock);
>          spin_lock(&v->vpci.pdev->vpci->lock);
>          /* Disable memory decoding unconditionally on failure. */
>          modify_decoding(v->vpci.pdev,
> @@ -170,6 +171,7 @@ bool vpci_process_pending(struct vcpu *v)
>               * failure.
>               */
>              vpci_remove_device(v->vpci.pdev);
> +        write_unlock(&v->domain->pci_lock);
>      }

The handling in vpci_process_pending() wrt vpci_remove_device() is
racy and will need some thinking to get it solved.  Your change
doesn't make it any worse, but I would also be fine with adding a note
in the commit message that vpci_process_pending() is not adjusted to
use the new lock because it needs to be reworked first in order to be
safe against a concurrent vpci_remove_device() call.

>  
>      return false;
> @@ -181,8 +183,20 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
>      struct map_data data = { .d = d, .map = true };
>      int rc;
>  
> +    ASSERT(rw_is_locked(&d->pci_lock));
> +
>      while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
> +    {
> +        /*
> +         * It's safe to drop and reacquire the lock in this context
> +         * without risking pdev disappearing because devices cannot be
> +         * removed until the initial domain has been started.
> +         */
> +        read_unlock(&d->pci_lock);
>          process_pending_softirqs();
> +        read_lock(&d->pci_lock);
> +    }

Since this is init only code you could likely forego the usage of the
locks, but I guess that's more churn than just using them.  In any
case, as this gets called from modify_bars() the locks need to be
dropped/taken in write mode (see comment below).

>      rangeset_destroy(mem);
>      if ( !rc )
>          modify_decoding(pdev, cmd, false);
> @@ -223,6 +237,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>      unsigned int i;
>      int rc;
>  
> +    ASSERT(rw_is_locked(&pdev->domain->pci_lock));

The lock here needs to be taken in write mode I think, so the code can
safely iterate over the contents of each pdev->vpci assigned to the
domain.

> +
>      if ( !mem )
>          return -ENOMEM;
>  
> @@ -502,6 +518,8 @@ static int cf_check init_bars(struct pci_dev *pdev)
>      struct vpci_bar *bars = header->bars;
>      int rc;
>  
> +    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
> +
>      switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
>      {
>      case PCI_HEADER_TYPE_NORMAL:
> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
> index 8f2b59e61a..e63152c224 100644
> --- a/xen/drivers/vpci/msi.c
> +++ b/xen/drivers/vpci/msi.c
> @@ -190,6 +190,8 @@ static int cf_check init_msi(struct pci_dev *pdev)
>      uint16_t control;
>      int ret;
>  
> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));

I'm confused by the difference in lock requirements between
init_bars() and init_msi().  In the former you assert for the lock
being taken in read mode, while the later asserts for write mode.

We want to do initialization in write mode, so that modify_bars()
called by init_bars() has exclusive access to the contents of
pdev->vpci.

> +
>      if ( !pos )
>          return 0;
>  
> @@ -265,7 +267,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW);
>  
>  void vpci_dump_msi(void)
>  {
> -    const struct domain *d;
> +    struct domain *d;
>  
>      rcu_read_lock(&domlist_read_lock);
>      for_each_domain ( d )
> @@ -277,6 +279,9 @@ void vpci_dump_msi(void)
>  
>          printk("vPCI MSI/MSI-X d%d\n", d->domain_id);
>  
> +        if ( !read_trylock(&d->pci_lock) )
> +            continue;
> +
>          for_each_pdev ( d, pdev )
>          {
>              const struct vpci_msi *msi;
> @@ -318,14 +323,17 @@ void vpci_dump_msi(void)
>                       * holding the lock.
>                       */
>                      printk("unable to print all MSI-X entries: %d\n", rc);
> -                    process_pending_softirqs();
> -                    continue;
> +                    goto pdev_done;
>                  }
>              }
>  
>              spin_unlock(&pdev->vpci->lock);
> + pdev_done:
> +            read_unlock(&d->pci_lock);
>              process_pending_softirqs();
> +            read_lock(&d->pci_lock);

read_trylock().

This is not very safe, as the list could be modified while the lock is
dropped, but it's a debug key handler so I'm not very concerned.
However we should at least add a comment that this relies on the list
not being altered while the lock is dropped.

>          }
> +        read_unlock(&d->pci_lock);
>      }
>      rcu_read_unlock(&domlist_read_lock);
>  }
> diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
> index 25bde77586..9481274579 100644
> --- a/xen/drivers/vpci/msix.c
> +++ b/xen/drivers/vpci/msix.c
> @@ -147,6 +147,8 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
>  {
>      struct vpci_msix *msix;
>  
> +    ASSERT(rw_is_locked(&d->pci_lock));

Hm, here you are iterating over pdev->vpci->header.bars for multiple
devices, so I think in addition to the pci_lock in read mode we should
also take the vpci->lock for each pdev.

I think I would like to rework msix_find() so it's msix_get() and
returns with the appropriate vpci->lock taken.  Anyway, that's for a
different patch, the usage of the lock in read mode seems correct,
albeit I might want to move the read_lock() call inside of msix_get()
in the future.

> +
>      list_for_each_entry ( msix, &d->arch.hvm.msix_tables, next )
>      {
>          const struct vpci_bar *bars = msix->pdev->vpci->header.bars;
> @@ -163,7 +165,13 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
>  
>  static int cf_check msix_accept(struct vcpu *v, unsigned long addr)
>  {
> -    return !!msix_find(v->domain, addr);
> +    int rc;
> +
> +    read_lock(&v->domain->pci_lock);
> +    rc = !!msix_find(v->domain, addr);
> +    read_unlock(&v->domain->pci_lock);
> +
> +    return rc;
>  }
>  
>  static bool access_allowed(const struct pci_dev *pdev, unsigned long addr,
> @@ -358,21 +366,34 @@ static int adjacent_read(const struct domain *d, const struct vpci_msix *msix,
>  static int cf_check msix_read(
>      struct vcpu *v, unsigned long addr, unsigned int len, unsigned long *data)
>  {
> -    const struct domain *d = v->domain;
> -    struct vpci_msix *msix = msix_find(d, addr);
> +    struct domain *d = v->domain;
> +    struct vpci_msix *msix;
>      const struct vpci_msix_entry *entry;
>      unsigned int offset;
>  
>      *data = ~0ul;
>  
> +    read_lock(&d->pci_lock);
> +
> +    msix = msix_find(d, addr);
>      if ( !msix )
> +    {
> +        read_unlock(&d->pci_lock);
>          return X86EMUL_RETRY;
> +    }
>  
>      if ( adjacent_handle(msix, addr) )
> -        return adjacent_read(d, msix, addr, len, data);
> +    {
> +        int rc = adjacent_read(d, msix, addr, len, data);

Nit: missing newline (here and below).

> +        read_unlock(&d->pci_lock);
> +        return rc;
> +    }
>  
>      if ( !access_allowed(msix->pdev, addr, len) )
> +    {
> +        read_unlock(&d->pci_lock);
>          return X86EMUL_OKAY;
> +    }
>  
>      spin_lock(&msix->pdev->vpci->lock);
>      entry = get_entry(msix, addr);
> @@ -404,6 +425,7 @@ static int cf_check msix_read(
>          break;
>      }
>      spin_unlock(&msix->pdev->vpci->lock);
> +    read_unlock(&d->pci_lock);
>  
>      return X86EMUL_OKAY;
>  }
> @@ -491,19 +513,32 @@ static int adjacent_write(const struct domain *d, const struct vpci_msix *msix,
>  static int cf_check msix_write(
>      struct vcpu *v, unsigned long addr, unsigned int len, unsigned long data)
>  {
> -    const struct domain *d = v->domain;
> -    struct vpci_msix *msix = msix_find(d, addr);
> +    struct domain *d = v->domain;
> +    struct vpci_msix *msix;
>      struct vpci_msix_entry *entry;
>      unsigned int offset;
>  
> +    read_lock(&d->pci_lock);
> +
> +    msix = msix_find(d, addr);
>      if ( !msix )
> +    {
> +        read_unlock(&d->pci_lock);
>          return X86EMUL_RETRY;
> +    }
>  
>      if ( adjacent_handle(msix, addr) )
> -        return adjacent_write(d, msix, addr, len, data);
> +    {
> +        int rc = adjacent_write(d, msix, addr, len, data);
> +        read_unlock(&d->pci_lock);
> +        return rc;
> +    }
>  
>      if ( !access_allowed(msix->pdev, addr, len) )
> +    {
> +        read_unlock(&d->pci_lock);
>          return X86EMUL_OKAY;
> +    }
>  
>      spin_lock(&msix->pdev->vpci->lock);
>      entry = get_entry(msix, addr);
> @@ -579,6 +614,7 @@ static int cf_check msix_write(
>          break;
>      }
>      spin_unlock(&msix->pdev->vpci->lock);
> +    read_unlock(&d->pci_lock);
>  
>      return X86EMUL_OKAY;
>  }
> @@ -665,6 +701,8 @@ static int cf_check init_msix(struct pci_dev *pdev)
>      struct vpci_msix *msix;
>      int rc;
>  
> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
> +
>      msix_offset = pci_find_cap_offset(pdev->seg, pdev->bus, slot, func,
>                                        PCI_CAP_ID_MSIX);
>      if ( !msix_offset )
> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
> index d73fa76302..f22cbf2112 100644
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -38,6 +38,8 @@ extern vpci_register_init_t *const __end_vpci_array[];
>  
>  void vpci_remove_device(struct pci_dev *pdev)
>  {
> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
> +
>      if ( !has_vpci(pdev->domain) || !pdev->vpci )
>          return;
>  
> @@ -73,6 +75,8 @@ int vpci_add_handlers(struct pci_dev *pdev)
>      const unsigned long *ro_map;
>      int rc = 0;
>  
> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
> +
>      if ( !has_vpci(pdev->domain) )
>          return 0;
>  
> @@ -326,11 +330,12 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
>  
>  uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>  {
> -    const struct domain *d = current->domain;
> +    struct domain *d = current->domain;
>      const struct pci_dev *pdev;
>      const struct vpci_register *r;
>      unsigned int data_offset = 0;
>      uint32_t data = ~(uint32_t)0;
> +    rwlock_t *lock;
>  
>      if ( !size )
>      {
> @@ -342,11 +347,21 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>       * Find the PCI dev matching the address, which for hwdom also requires
>       * consulting DomXEN.  Passthrough everything that's not trapped.
>       */
> +    lock = &d->pci_lock;
> +    read_lock(lock);
>      pdev = pci_get_pdev(d, sbdf);
>      if ( !pdev && is_hardware_domain(d) )
> +    {
> +        read_unlock(lock);
> +        lock = &dom_xen->pci_lock;
> +        read_lock(lock);
>          pdev = pci_get_pdev(dom_xen, sbdf);
> +    }
>      if ( !pdev || !pdev->vpci )
> +    {
> +        read_unlock(lock);
>          return vpci_read_hw(sbdf, reg, size);
> +    }
>  
>      spin_lock(&pdev->vpci->lock);
>  
> @@ -392,6 +407,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>          ASSERT(data_offset < size);
>      }
>      spin_unlock(&pdev->vpci->lock);
> +    read_unlock(lock);
>  
>      if ( data_offset < size )
>      {
> @@ -431,10 +447,23 @@ static void vpci_write_helper(const struct pci_dev *pdev,
>               r->private);
>  }
>  
> +/* Helper function to unlock locks taken by vpci_write in proper order */
> +static void unlock_locks(struct domain *d)
> +{
> +    ASSERT(rw_is_locked(&d->pci_lock));
> +
> +    if ( is_hardware_domain(d) )
> +    {
> +        ASSERT(rw_is_locked(&d->pci_lock));
> +        read_unlock(&dom_xen->pci_lock);
> +    }
> +    read_unlock(&d->pci_lock);
> +}
> +
>  void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>                  uint32_t data)
>  {
> -    const struct domain *d = current->domain;
> +    struct domain *d = current->domain;
>      const struct pci_dev *pdev;
>      const struct vpci_register *r;
>      unsigned int data_offset = 0;
> @@ -447,8 +476,16 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>  
>      /*
>       * Find the PCI dev matching the address, which for hwdom also requires
> -     * consulting DomXEN.  Passthrough everything that's not trapped.
> +     * consulting DomXEN. Passthrough everything that's not trapped.
> +     * If this is hwdom, we need to hold locks for both domain in case if
> +     * modify_bars is called()

Typo: the () wants to be at the end of modify_bars().

>       */
> +    read_lock(&d->pci_lock);
> +
> +    /* dom_xen->pci_lock always should be taken second to prevent deadlock */
> +    if ( is_hardware_domain(d) )
> +        read_lock(&dom_xen->pci_lock);

For modify_bars() we also want the locks to be in write mode (at least
the hw one), so that the position of the BARs can't be changed while
modify_bars() is iterating over them.

Is this something that will be done in a followup change?

> +
>      pdev = pci_get_pdev(d, sbdf);
>      if ( !pdev && is_hardware_domain(d) )
>          pdev = pci_get_pdev(dom_xen, sbdf);
> @@ -459,6 +496,8 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>  
>          if ( !ro_map || !test_bit(sbdf.bdf, ro_map) )
>              vpci_write_hw(sbdf, reg, size, data);
> +
> +        unlock_locks(d);
>          return;
>      }
>  
> @@ -498,6 +537,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>          ASSERT(data_offset < size);
>      }
>      spin_unlock(&pdev->vpci->lock);
> +    unlock_locks(d);

There's one issue here, some handlers will cal pcidevs_lock(), which
will result in a lock over inversion, as in the previous patch we
agreed that the locking order was pcidevs_lock first, d->pci_lock
after.

For example the MSI control_write() handler will call
vpci_msi_arch_enable() which takes the pcidevs lock.  I think I will
have to look into using a dedicated lock for MSI related handling, as
that's the only place where I think we have this pattern of taking the
pcidevs_lock after the d->pci_lock.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 03/13] vpci: restrict unhandled read/write operations for guests
  2023-07-20  0:32 ` [PATCH v8 03/13] vpci: restrict unhandled read/write operations for guests Volodymyr Babchuk
@ 2023-07-20 11:32   ` Roger Pau Monné
  0 siblings, 0 replies; 73+ messages in thread
From: Roger Pau Monné @ 2023-07-20 11:32 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: xen-devel, Oleksandr Andrushchenko

On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> A guest would be able to read and write those registers which are not
> emulated and have no respective vPCI handlers, so it will be possible
> for it to access the hardware directly.
> In order to prevent a guest from reads and writes from/to the unhandled
                                                            ^ extra 'the'
> registers make sure only hardware domain can access the hardware directly
> and restrict guests from doing so.
> 
> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

With the stray change below removed.

> 
> ---
> Since v6:
> - do not use is_hwdom parameter for vpci_{read|write}_hw and use
>   current->domain internally
> - update commit message
> New in v6
> ---
>  xen/drivers/vpci/vpci.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
> index f22cbf2112..a6d2cf8660 100644
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -233,6 +233,10 @@ static uint32_t vpci_read_hw(pci_sbdf_t sbdf, unsigned int reg,
>  {
>      uint32_t data;
>  
> +    /* Guest domains are not allowed to read real hardware. */
> +    if ( !is_hardware_domain(current->domain) )
> +        return ~(uint32_t)0;
> +
>      switch ( size )
>      {
>      case 4:
> @@ -273,9 +277,13 @@ static uint32_t vpci_read_hw(pci_sbdf_t sbdf, unsigned int reg,
>      return data;
>  }
>  
> -static void vpci_write_hw(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
> -                          uint32_t data)
> +static void vpci_write_hw(pci_sbdf_t sbdf, unsigned int reg,
> +                          unsigned int size, uint32_t data)

Unrelated change?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 04/13] vpci: add hooks for PCI device assign/de-assign
  2023-07-20  0:32 ` [PATCH v8 04/13] vpci: add hooks for PCI device assign/de-assign Volodymyr Babchuk
@ 2023-07-20 12:36   ` Roger Pau Monné
  2023-07-26  1:38     ` Volodymyr Babchuk
  2023-07-24  9:41   ` Jan Beulich
  1 sibling, 1 reply; 73+ messages in thread
From: Roger Pau Monné @ 2023-07-20 12:36 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: xen-devel, Oleksandr Andrushchenko

On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> When a PCI device gets assigned/de-assigned some work on vPCI side needs
> to be done for that device. Introduce a pair of hooks so vPCI can handle
> that.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> ---
> Since v8:
> - removed vpci_deassign_device
> Since v6:
> - do not pass struct domain to vpci_{assign|deassign}_device as
>   pdev->domain can be used
> - do not leave the device assigned (pdev->domain == new domain) in case
>   vpci_assign_device fails: try to de-assign and if this also fails, then
>   crash the domain
> Since v5:
> - do not split code into run_vpci_init
> - do not check for is_system_domain in vpci_{de}assign_device
> - do not use vpci_remove_device_handlers_locked and re-allocate
>   pdev->vpci completely
> - make vpci_deassign_device void
> Since v4:
>  - de-assign vPCI from the previous domain on device assignment
>  - do not remove handlers in vpci_assign_device as those must not
>    exist at that point
> Since v3:
>  - remove toolstack roll-back description from the commit message
>    as error are to be handled with proper cleanup in Xen itself
>  - remove __must_check
>  - remove redundant rc check while assigning devices
>  - fix redundant CONFIG_HAS_VPCI check for CONFIG_HAS_VPCI_GUEST_SUPPORT
>  - use REGISTER_VPCI_INIT machinery to run required steps on device
>    init/assign: add run_vpci_init helper
> Since v2:
> - define CONFIG_HAS_VPCI_GUEST_SUPPORT so dead code is not compiled
>   for x86
> Since v1:
>  - constify struct pci_dev where possible
>  - do not open code is_system_domain()
>  - extended the commit message
> ---
>  xen/drivers/Kconfig           |  4 ++++
>  xen/drivers/passthrough/pci.c | 21 +++++++++++++++++++++
>  xen/drivers/vpci/vpci.c       | 18 ++++++++++++++++++
>  xen/include/xen/vpci.h        | 15 +++++++++++++++
>  4 files changed, 58 insertions(+)
> 
> diff --git a/xen/drivers/Kconfig b/xen/drivers/Kconfig
> index db94393f47..780490cf8e 100644
> --- a/xen/drivers/Kconfig
> +++ b/xen/drivers/Kconfig
> @@ -15,4 +15,8 @@ source "drivers/video/Kconfig"
>  config HAS_VPCI
>  	bool
>  
> +config HAS_VPCI_GUEST_SUPPORT
> +	bool
> +	depends on HAS_VPCI
> +
>  endmenu
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index 6f8692cd9c..265d359704 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -885,6 +885,10 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
>      if ( ret )
>          goto out;
>  
> +    write_lock(&pdev->domain->pci_lock);
> +    vpci_deassign_device(pdev);
> +    write_unlock(&pdev->domain->pci_lock);
> +
>      if ( pdev->domain == hardware_domain  )
>          pdev->quarantine = false;
>  
> @@ -1484,6 +1488,10 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>      if ( pdev->broken && d != hardware_domain && d != dom_io )
>          goto done;
>  
> +    write_lock(&pdev->domain->pci_lock);
> +    vpci_deassign_device(pdev);
> +    write_unlock(&pdev->domain->pci_lock);
> +
>      rc = pdev_msix_assign(d, pdev);
>      if ( rc )
>          goto done;
> @@ -1509,6 +1517,19 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>          rc = iommu_call(hd->platform_ops, assign_device, d, devfn,
>                          pci_to_dev(pdev), flag);
>      }
> +    if ( rc )
> +        goto done;
> +
> +    devfn = pdev->devfn;
> +    write_lock(&pdev->domain->pci_lock);
> +    rc = vpci_assign_device(pdev);
> +    write_unlock(&pdev->domain->pci_lock);
> +    if ( rc && deassign_device(d, seg, bus, devfn) )
> +    {
> +        printk(XENLOG_ERR "%pd: %pp was left partially assigned\n",
> +               d, &PCI_SBDF(seg, bus, devfn));

&pdev->sbdf?  Then you can get of the devfn usage above.

> +        domain_crash(d);

This seems like a bit different from the other error paths in the
function, isn't it fine to return an error and let the caller handle
the deassign?

Also, if we really need to call deassign_device() we must do so for
all possible phantom devices, see the above loop around
iommu_call(..., assing_device, ...);

> +    }
>  
>   done:
>      if ( rc )
> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
> index a6d2cf8660..a97710a806 100644
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -107,6 +107,24 @@ int vpci_add_handlers(struct pci_dev *pdev)
>  
>      return rc;
>  }
> +
> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> +/* Notify vPCI that device is assigned to guest. */
> +int vpci_assign_device(struct pci_dev *pdev)
> +{
> +    int rc;
> +
> +    if ( !has_vpci(pdev->domain) )
> +        return 0;
> +
> +    rc = vpci_add_handlers(pdev);
> +    if ( rc )
> +        vpci_deassign_device(pdev);

Why do you need this handler, vpci_add_handlers() when failing will
already call vpci_remove_device(), which is what
vpci_deassign_device() does.

> +
> +    return rc;
> +}
> +#endif /* CONFIG_HAS_VPCI_GUEST_SUPPORT */
> +
>  #endif /* __XEN__ */
>  
>  static int vpci_register_cmp(const struct vpci_register *r1,
> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
> index 0b8a2a3c74..44296623e1 100644
> --- a/xen/include/xen/vpci.h
> +++ b/xen/include/xen/vpci.h
> @@ -264,6 +264,21 @@ static inline bool __must_check vpci_process_pending(struct vcpu *v)
>  }
>  #endif
>  
> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> +/* Notify vPCI that device is assigned/de-assigned to/from guest. */
> +int vpci_assign_device(struct pci_dev *pdev);
> +#define vpci_deassign_device vpci_remove_device
> +#else
> +static inline int vpci_assign_device(struct pci_dev *pdev)
> +{
> +    return 0;
> +};
> +
> +static inline void vpci_deassign_device(struct pci_dev *pdev)
> +{
> +};
> +#endif

I don't think there's much point in introducing new functions, see
above.  I'm fine if the current ones want to be renamed to
vpci_{,de}assign_device(), but adding defines like the above just
makes the code harder to follow.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 02/13] vpci: use per-domain PCI lock to protect vpci structure
  2023-07-20 11:20   ` Roger Pau Monné
@ 2023-07-20 13:27     ` Jan Beulich
  2023-07-20 13:50       ` Roger Pau Monné
  2023-07-20 15:53     ` Jan Beulich
  2023-07-26  1:17     ` Volodymyr Babchuk
  2 siblings, 1 reply; 73+ messages in thread
From: Jan Beulich @ 2023-07-20 13:27 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Oleksandr Andrushchenko, Volodymyr Babchuk

On 20.07.2023 13:20, Roger Pau Monné wrote:
> On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
>> @@ -447,8 +476,16 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>>  
>>      /*
>>       * Find the PCI dev matching the address, which for hwdom also requires
>> -     * consulting DomXEN.  Passthrough everything that's not trapped.
>> +     * consulting DomXEN. Passthrough everything that's not trapped.
>> +     * If this is hwdom, we need to hold locks for both domain in case if
>> +     * modify_bars is called()
> 
> Typo: the () wants to be at the end of modify_bars().
> 
>>       */
>> +    read_lock(&d->pci_lock);
>> +
>> +    /* dom_xen->pci_lock always should be taken second to prevent deadlock */
>> +    if ( is_hardware_domain(d) )
>> +        read_lock(&dom_xen->pci_lock);
> 
> For modify_bars() we also want the locks to be in write mode (at least
> the hw one), so that the position of the BARs can't be changed while
> modify_bars() is iterating over them.

Isn't changing of the BARs happening under the vpci lock? Or else I guess
I haven't understood the description correctly: My reading so far was
that it is only the presence (allocation status / pointer validity) that
is protected by this new lock.

Jan


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 02/13] vpci: use per-domain PCI lock to protect vpci structure
  2023-07-20 13:27     ` Jan Beulich
@ 2023-07-20 13:50       ` Roger Pau Monné
  2023-07-24  0:07         ` Volodymyr Babchuk
  0 siblings, 1 reply; 73+ messages in thread
From: Roger Pau Monné @ 2023-07-20 13:50 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Oleksandr Andrushchenko, Volodymyr Babchuk

On Thu, Jul 20, 2023 at 03:27:29PM +0200, Jan Beulich wrote:
> On 20.07.2023 13:20, Roger Pau Monné wrote:
> > On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
> >> @@ -447,8 +476,16 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
> >>  
> >>      /*
> >>       * Find the PCI dev matching the address, which for hwdom also requires
> >> -     * consulting DomXEN.  Passthrough everything that's not trapped.
> >> +     * consulting DomXEN. Passthrough everything that's not trapped.
> >> +     * If this is hwdom, we need to hold locks for both domain in case if
> >> +     * modify_bars is called()
> > 
> > Typo: the () wants to be at the end of modify_bars().
> > 
> >>       */
> >> +    read_lock(&d->pci_lock);
> >> +
> >> +    /* dom_xen->pci_lock always should be taken second to prevent deadlock */
> >> +    if ( is_hardware_domain(d) )
> >> +        read_lock(&dom_xen->pci_lock);
> > 
> > For modify_bars() we also want the locks to be in write mode (at least
> > the hw one), so that the position of the BARs can't be changed while
> > modify_bars() is iterating over them.
> 
> Isn't changing of the BARs happening under the vpci lock?

It is.

> Or else I guess
> I haven't understood the description correctly: My reading so far was
> that it is only the presence (allocation status / pointer validity) that
> is protected by this new lock.

Hm, I see, yes.  I guess it was a previous patch version that also
took care of the modify_bars() issue by taking the lock in exclusive
mode here.

We can always do that later, so forget about that comment (for now).

Thanks, Roger.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 01/13] pci: introduce per-domain PCI rwlock
  2023-07-20  0:32 ` [PATCH v8 01/13] pci: introduce per-domain PCI rwlock Volodymyr Babchuk
  2023-07-20  9:45   ` Roger Pau Monné
@ 2023-07-20 15:40   ` Jan Beulich
  2023-07-20 23:37     ` Volodymyr Babchuk
  1 sibling, 1 reply; 73+ messages in thread
From: Jan Beulich @ 2023-07-20 15:40 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: Roger Pau Monné, xen-devel

On 20.07.2023 02:32, Volodymyr Babchuk wrote:
> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> @@ -476,8 +476,13 @@ static int cf_check reassign_device(
>  
>      if ( devfn == pdev->devfn && pdev->domain != target )
>      {
> -        list_move(&pdev->domain_list, &target->pdev_list);
> -        pdev->domain = target;
> +        write_lock(&pdev->domain->pci_lock);
> +        list_del(&pdev->domain_list);
> +        write_unlock(&pdev->domain->pci_lock);

As mentioned on an earlier version, perhaps better (cheaper) to use
"source" here? (Same in VT-d code then.)

> @@ -747,6 +749,7 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>      ret = 0;
>      if ( !pdev->domain )
>      {
> +        write_lock(&hardware_domain->pci_lock);
>          pdev->domain = hardware_domain;
>          list_add(&pdev->domain_list, &hardware_domain->pdev_list);
>  
> @@ -760,6 +763,7 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>              printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
>              list_del(&pdev->domain_list);
>              pdev->domain = NULL;
> +            write_unlock(&hardware_domain->pci_lock);
>              goto out;

In addition to Roger's comments about locking scope: In a case like this
one it would probably also be good to move the printk() out of the locked
area. It can be slow, after all.

Question is why you have this wide a locked area here in the first place:
Don't you need to hold the lock just across the two list operations (but
not in between)?

> @@ -887,26 +895,62 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
>  
>  int pci_release_devices(struct domain *d)
>  {
> -    struct pci_dev *pdev, *tmp;
> -    u8 bus, devfn;
> -    int ret;
> +    int combined_ret;
> +    LIST_HEAD(failed_pdevs);
>  
>      pcidevs_lock();
> -    ret = arch_pci_clean_pirqs(d);
> -    if ( ret )
> +    write_lock(&d->pci_lock);
> +    combined_ret = arch_pci_clean_pirqs(d);
> +    if ( combined_ret )
>      {
>          pcidevs_unlock();
> -        return ret;
> +        write_unlock(&d->pci_lock);
> +        return combined_ret;
>      }
> -    list_for_each_entry_safe ( pdev, tmp, &d->pdev_list, domain_list )
> +
> +    while ( !list_empty(&d->pdev_list) )
>      {
> -        bus = pdev->bus;
> -        devfn = pdev->devfn;
> -        ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
> +        struct pci_dev *pdev = list_first_entry(&d->pdev_list,
> +                                                struct pci_dev,
> +                                                domain_list);
> +        uint16_t seg = pdev->seg;
> +        uint8_t bus = pdev->bus;
> +        uint8_t devfn = pdev->devfn;
> +        int ret;
> +
> +        write_unlock(&d->pci_lock);
> +        ret = deassign_device(d, seg, bus, devfn);
> +        write_lock(&d->pci_lock);
> +        if ( ret )
> +        {
> +            bool still_present = false;
> +            const struct pci_dev *tmp;
> +
> +            /*
> +             * We need to check if deassign_device() left our pdev in
> +             * domain's list. As we dropped the lock, we can't be sure
> +             * that list wasn't permutated in some random way, so we
> +             * need to traverse the whole list.
> +             */
> +            for_each_pdev ( d, tmp )
> +            {
> +                if ( tmp == pdev )
> +                {
> +                    still_present = true;
> +                    break;
> +                }
> +            }
> +            if ( still_present )
> +                list_move(&pdev->domain_list, &failed_pdevs);

In order to retain original ordering on the resulting list, perhaps better
list_move_tail()?

Jan


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 02/13] vpci: use per-domain PCI lock to protect vpci structure
  2023-07-20 11:20   ` Roger Pau Monné
  2023-07-20 13:27     ` Jan Beulich
@ 2023-07-20 15:53     ` Jan Beulich
  2023-07-26  1:17     ` Volodymyr Babchuk
  2 siblings, 0 replies; 73+ messages in thread
From: Jan Beulich @ 2023-07-20 15:53 UTC (permalink / raw)
  To: Roger Pau Monné, Volodymyr Babchuk
  Cc: xen-devel, Oleksandr Andrushchenko

On 20.07.2023 13:20, Roger Pau Monné wrote:
> On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
>> @@ -318,14 +323,17 @@ void vpci_dump_msi(void)
>>                       * holding the lock.
>>                       */

Note the comment here.

>>                      printk("unable to print all MSI-X entries: %d\n", rc);
>> -                    process_pending_softirqs();
>> -                    continue;
>> +                    goto pdev_done;
>>                  }
>>              }
>>  
>>              spin_unlock(&pdev->vpci->lock);
>> + pdev_done:
>> +            read_unlock(&d->pci_lock);
>>              process_pending_softirqs();
>> +            read_lock(&d->pci_lock);
> 
> read_trylock().

Plus the same scheme as with the spin lock wants following imo:
vpci_msix_arch_print() returns an error only with (now) both locks
dropped. This then wants reflecting in the comment pointed out
above.

Jan


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 05/13] vpci/header: implement guest BAR register handlers
  2023-07-20  0:32 ` [PATCH v8 05/13] vpci/header: implement guest BAR register handlers Volodymyr Babchuk
@ 2023-07-20 16:01   ` Roger Pau Monné
  2023-07-21 10:36   ` Rahul Singh
  1 sibling, 0 replies; 73+ messages in thread
From: Roger Pau Monné @ 2023-07-20 16:01 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: xen-devel, Oleksandr Andrushchenko

On Thu, Jul 20, 2023 at 12:32:32AM +0000, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Add relevant vpci register handlers when assigning PCI device to a domain
> and remove those when de-assigning. This allows having different
> handlers for different domains, e.g. hwdom and other guests.
> 
> Emulate guest BAR register values: this allows creating a guest view
> of the registers and emulates size and properties probe as it is done
> during PCI device enumeration by the guest.
> 
> All empty, IO and ROM BARs for guests are emulated by returning 0 on
> reads and ignoring writes: this BARs are special with this respect as
> their lower bits have special meaning, so returning default ~0 on read
> may confuse guest OS.
> 
> Memory decoding is initially disabled when used by guests in order to
> prevent the BAR being placed on top of a RAM region.

I'm kind of lost on this last sentence, as I don't see the patch
explicitly disabling PCI_COMMAND_MEMORY form the command register.  Is
that more of an expectation on the initial device state?

Maybe there should be some checking in that case then?

> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> ---
> 
> Since v6:
> - unify the writing of the PCI_COMMAND register on the
>   error path into a label
> - do not introduce bar_ignore_access helper and open code
> - s/guest_bar_ignore_read/empty_bar_read
> - update error message in guest_bar_write
> - only setup empty_bar_read for IO if !x86
> Since v5:
> - make sure that the guest set address has the same page offset
>   as the physical address on the host
> - remove guest_rom_{read|write} as those just implement the default
>   behaviour of the registers not being handled
> - adjusted comment for struct vpci.addr field
> - add guest handlers for BARs which are not handled and will otherwise
>   return ~0 on read and ignore writes. The BARs are special with this
>   respect as their lower bits have special meaning, so returning ~0
>   doesn't seem to be right
> Since v4:
> - updated commit message
> - s/guest_addr/guest_reg
> Since v3:
> - squashed two patches: dynamic add/remove handlers and guest BAR
>   handler implementation
> - fix guest BAR read of the high part of a 64bit BAR (Roger)
> - add error handling to vpci_assign_device
> - s/dom%pd/%pd
> - blank line before return
> Since v2:
> - remove unneeded ifdefs for CONFIG_HAS_VPCI_GUEST_SUPPORT as more code
>   has been eliminated from being built on x86
> Since v1:
>  - constify struct pci_dev where possible
>  - do not open code is_system_domain()
>  - simplify some code3. simplify
>  - use gdprintk + error code instead of gprintk
>  - gate vpci_bar_{add|remove}_handlers with CONFIG_HAS_VPCI_GUEST_SUPPORT,
>    so these do not get compiled for x86
>  - removed unneeded is_system_domain check
>  - re-work guest read/write to be much simpler and do more work on write
>    than read which is expected to be called more frequently
>  - removed one too obvious comment
> ---
>  xen/drivers/vpci/header.c | 156 +++++++++++++++++++++++++++++++-------
>  xen/include/xen/vpci.h    |   3 +
>  2 files changed, 130 insertions(+), 29 deletions(-)
> 
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index 2780fcae72..5dc9b5338b 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -457,6 +457,71 @@ static void cf_check bar_write(
>      pci_conf_write32(pdev->sbdf, reg, val);
>  }
>  
> +static void cf_check guest_bar_write(const struct pci_dev *pdev,
> +                                     unsigned int reg, uint32_t val, void *data)
> +{
> +    struct vpci_bar *bar = data;
> +    bool hi = false;
> +    uint64_t guest_reg = bar->guest_reg;
> +
> +    if ( bar->type == VPCI_BAR_MEM64_HI )
> +    {
> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> +        bar--;
> +        hi = true;
> +    }
> +    else
> +    {
> +        val &= PCI_BASE_ADDRESS_MEM_MASK;
> +        val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
> +                                           : PCI_BASE_ADDRESS_MEM_TYPE_64;
> +        val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
> +    }
> +
> +    guest_reg &= ~(0xffffffffull << (hi ? 32 : 0));
> +    guest_reg |= (uint64_t)val << (hi ? 32 : 0);
> +
> +    guest_reg &= ~(bar->size - 1) | ~PCI_BASE_ADDRESS_MEM_MASK;
> +
> +    /*
> +     * Make sure that the guest set address has the same page offset
> +     * as the physical address on the host or otherwise things won't work as
> +     * expected.
> +     */
> +    if ( (guest_reg & (~PAGE_MASK & PCI_BASE_ADDRESS_MEM_MASK)) !=
> +         (bar->addr & ~PAGE_MASK) )
> +    {
> +        gprintk(XENLOG_WARNING,
> +                "%pp: ignored BAR %zu write attempting to change page offset\n",
> +                &pdev->sbdf, bar - pdev->vpci->header.bars + hi);
> +        return;
> +    }
> +
> +    bar->guest_reg = guest_reg;
> +}
> +
> +static uint32_t cf_check guest_bar_read(const struct pci_dev *pdev,
> +                                        unsigned int reg, void *data)
> +{
> +    const struct vpci_bar *bar = data;
> +    bool hi = false;
> +
> +    if ( bar->type == VPCI_BAR_MEM64_HI )
> +    {
> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> +        bar--;
> +        hi = true;
> +    }
> +
> +    return bar->guest_reg >> (hi ? 32 : 0);
> +}
> +
> +static uint32_t cf_check empty_bar_read(const struct pci_dev *pdev,
> +                                        unsigned int reg, void *data)
> +{
> +    return 0;
> +}
> +
>  static void cf_check rom_write(
>      const struct pci_dev *pdev, unsigned int reg, uint32_t val, void *data)
>  {
> @@ -517,6 +582,7 @@ static int cf_check init_bars(struct pci_dev *pdev)
>      struct vpci_header *header = &pdev->vpci->header;
>      struct vpci_bar *bars = header->bars;
>      int rc;
> +    bool is_hwdom = is_hardware_domain(pdev->domain);
>  
>      ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>  
> @@ -558,13 +624,12 @@ static int cf_check init_bars(struct pci_dev *pdev)
>          if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
>          {
>              bars[i].type = VPCI_BAR_MEM64_HI;
> -            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg,
> -                                   4, &bars[i]);
> +            rc = vpci_add_register(pdev->vpci,
> +                                   is_hwdom ? vpci_hw_read32 : guest_bar_read,
> +                                   is_hwdom ? bar_write : guest_bar_write,
> +                                   reg, 4, &bars[i]);
>              if ( rc )
> -            {
> -                pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
> -                return rc;
> -            }
> +                goto fail;
>  
>              continue;
>          }
> @@ -573,6 +638,17 @@ static int cf_check init_bars(struct pci_dev *pdev)
>          if ( (val & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO )
>          {
>              bars[i].type = VPCI_BAR_IO;
> +
> +#ifndef CONFIG_X86
> +            if ( !is_hwdom )
> +            {
> +                rc = vpci_add_register(pdev->vpci, empty_bar_read, NULL,
> +                                       reg, 4, &bars[i]);

For an empty BAR there's no need to pass &bars[i] around? (same for
all callers that setup empty_bar_read() handlers.

> +                if ( rc )
> +                    goto fail;
> +            }
> +#endif

This might be better done as an IS_ENABLED() check in the introduced
if condition.  Need a bit of a description as to why IO space BARs are
handled as empty BARs for domUs.

> +
>              continue;
>          }
>          if ( (val & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
> @@ -584,14 +660,20 @@ static int cf_check init_bars(struct pci_dev *pdev)
>          rc = pci_size_mem_bar(pdev->sbdf, reg, &addr, &size,
>                                (i == num_bars - 1) ? PCI_BAR_LAST : 0);
>          if ( rc < 0 )
> -        {
> -            pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
> -            return rc;
> -        }
> +            goto fail;
>  
>          if ( size == 0 )
>          {
>              bars[i].type = VPCI_BAR_EMPTY;
> +
> +            if ( !is_hwdom )
> +            {
> +                rc = vpci_add_register(pdev->vpci, empty_bar_read, NULL,
> +                                       reg, 4, &bars[i]);
> +                if ( rc )
> +                    goto fail;
> +            }
> +
>              continue;
>          }
>  
> @@ -599,34 +681,50 @@ static int cf_check init_bars(struct pci_dev *pdev)
>          bars[i].size = size;
>          bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
>  
> -        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg, 4,
> -                               &bars[i]);
> +        rc = vpci_add_register(pdev->vpci,
> +                               is_hwdom ? vpci_hw_read32 : guest_bar_read,
> +                               is_hwdom ? bar_write : guest_bar_write,
> +                               reg, 4, &bars[i]);
>          if ( rc )
> -        {
> -            pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
> -            return rc;
> -        }
> +            goto fail;
>      }
>  
> -    /* Check expansion ROM. */
> -    rc = pci_size_mem_bar(pdev->sbdf, rom_reg, &addr, &size, PCI_BAR_ROM);
> -    if ( rc > 0 && size )
> +    /* Check expansion ROM: we do not handle ROM for guests. */

Is there any specific reason for not handling ROM BAR for guests?

> +    if ( is_hwdom )
>      {
> -        struct vpci_bar *rom = &header->bars[num_bars];
> +        rc = pci_size_mem_bar(pdev->sbdf, rom_reg, &addr, &size, PCI_BAR_ROM);
> +        if ( rc > 0 && size )
> +        {
> +            struct vpci_bar *rom = &header->bars[num_bars];
>  
> -        rom->type = VPCI_BAR_ROM;
> -        rom->size = size;
> -        rom->addr = addr;
> -        header->rom_enabled = pci_conf_read32(pdev->sbdf, rom_reg) &
> -                              PCI_ROM_ADDRESS_ENABLE;
> +            rom->type = VPCI_BAR_ROM;
> +            rom->size = size;
> +            rom->addr = addr;
> +            header->rom_enabled = pci_conf_read32(pdev->sbdf, rom_reg) &
> +                                  PCI_ROM_ADDRESS_ENABLE;
>  
> -        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, rom_write, rom_reg,
> -                               4, rom);
> -        if ( rc )
> -            rom->type = VPCI_BAR_EMPTY;
> +            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, rom_write,
> +                                   rom_reg, 4, rom);
> +            if ( rc )
> +                rom->type = VPCI_BAR_EMPTY;
> +        }
> +    }
> +    else
> +    {
> +        if ( !is_hwdom )

Extra !is_hwdown?  The condition on the outer if is already is_hwdom,
and this is the else branch.

> +        {
> +            rc = vpci_add_register(pdev->vpci, empty_bar_read, NULL,
> +                                   rom_reg, 4, &header->bars[num_bars]);
> +            if ( rc )
> +                goto fail;
> +        }
>      }
>  
>      return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, cmd, false) : 0;
> +
> + fail:
> +    pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
> +    return rc;

It might have been better for the usage of the fail label to be
introduced in a pre-patch, as there would then be less changes here
(and the pre-patch would be a non-functional change).

Thanks, Roger.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 02/13] vpci: use per-domain PCI lock to protect vpci structure
  2023-07-20  0:32 ` [PATCH v8 02/13] vpci: use per-domain PCI lock to protect vpci structure Volodymyr Babchuk
  2023-07-20 11:20   ` Roger Pau Monné
@ 2023-07-20 16:03   ` Jan Beulich
  2023-07-20 16:14     ` Roger Pau Monné
  2023-07-20 16:09   ` Jan Beulich
  2 siblings, 1 reply; 73+ messages in thread
From: Jan Beulich @ 2023-07-20 16:03 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Oleksandr Andrushchenko, Roger Pau Monné, xen-devel

On 20.07.2023 02:32, Volodymyr Babchuk wrote:
> --- a/xen/drivers/vpci/msi.c
> +++ b/xen/drivers/vpci/msi.c
> @@ -190,6 +190,8 @@ static int cf_check init_msi(struct pci_dev *pdev)
>      uint16_t control;
>      int ret;
>  
> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));

I'm afraid I have to ask the opposite question, compared to Roger's:
Why do you need the lock held for write here (and in init_msix())?
Neither list of devices nor the pdev->vpci pointer are being altered.

Jan


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 02/13] vpci: use per-domain PCI lock to protect vpci structure
  2023-07-20  0:32 ` [PATCH v8 02/13] vpci: use per-domain PCI lock to protect vpci structure Volodymyr Babchuk
  2023-07-20 11:20   ` Roger Pau Monné
  2023-07-20 16:03   ` Jan Beulich
@ 2023-07-20 16:09   ` Jan Beulich
  2 siblings, 0 replies; 73+ messages in thread
From: Jan Beulich @ 2023-07-20 16:09 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Oleksandr Andrushchenko, Roger Pau Monné, xen-devel

On 20.07.2023 02:32, Volodymyr Babchuk wrote:
> @@ -431,10 +447,23 @@ static void vpci_write_helper(const struct pci_dev *pdev,
>               r->private);
>  }
>  
> +/* Helper function to unlock locks taken by vpci_write in proper order */
> +static void unlock_locks(struct domain *d)
> +{
> +    ASSERT(rw_is_locked(&d->pci_lock));
> +
> +    if ( is_hardware_domain(d) )
> +    {
> +        ASSERT(rw_is_locked(&d->pci_lock));

Copy-and-past mistake? You've asserted this same condition already above.

> +        read_unlock(&dom_xen->pci_lock);
> +    }
> +    read_unlock(&d->pci_lock);
> +}
> +
>  void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>                  uint32_t data)
>  {
> -    const struct domain *d = current->domain;
> +    struct domain *d = current->domain;
>      const struct pci_dev *pdev;
>      const struct vpci_register *r;
>      unsigned int data_offset = 0;
> @@ -447,8 +476,16 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>  
>      /*
>       * Find the PCI dev matching the address, which for hwdom also requires
> -     * consulting DomXEN.  Passthrough everything that's not trapped.
> +     * consulting DomXEN. Passthrough everything that's not trapped.
> +     * If this is hwdom, we need to hold locks for both domain in case if
> +     * modify_bars is called()
>       */
> +    read_lock(&d->pci_lock);
> +
> +    /* dom_xen->pci_lock always should be taken second to prevent deadlock */
> +    if ( is_hardware_domain(d) )
> +        read_lock(&dom_xen->pci_lock);

But I wonder anyway - can we perhaps get away without acquiring dom_xen's
lock here? Its list isn't altered anymore post-boot, iirc.

> @@ -498,6 +537,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>          ASSERT(data_offset < size);
>      }
>      spin_unlock(&pdev->vpci->lock);
> +    unlock_locks(d);

In this context the question arises whether the function wouldn't better
be named more specific to its purpose: It's obvious here that it doesn't
unlock all the locks involved.

Jan


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 02/13] vpci: use per-domain PCI lock to protect vpci structure
  2023-07-20 16:03   ` Jan Beulich
@ 2023-07-20 16:14     ` Roger Pau Monné
  2023-07-21  6:02       ` Jan Beulich
  0 siblings, 1 reply; 73+ messages in thread
From: Roger Pau Monné @ 2023-07-20 16:14 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Volodymyr Babchuk, Oleksandr Andrushchenko, xen-devel

On Thu, Jul 20, 2023 at 06:03:49PM +0200, Jan Beulich wrote:
> On 20.07.2023 02:32, Volodymyr Babchuk wrote:
> > --- a/xen/drivers/vpci/msi.c
> > +++ b/xen/drivers/vpci/msi.c
> > @@ -190,6 +190,8 @@ static int cf_check init_msi(struct pci_dev *pdev)
> >      uint16_t control;
> >      int ret;
> >  
> > +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
> 
> I'm afraid I have to ask the opposite question, compared to Roger's:
> Why do you need the lock held for write here (and in init_msix())?
> Neither list of devices nor the pdev->vpci pointer are being
> altered.

This is called from vpci_add_handlers() which will acquire (or
requires being called) with the lock in write mode in order to set
pdev->vpci I would assume.  Strictly speaking however the init
handlers don't require the lock in write mode unless we use such
locking to get exclusive access to all the devices assigned to the
domain BARs array for modify_bars().

Thanks, Roger.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 01/13] pci: introduce per-domain PCI rwlock
  2023-07-20  9:45   ` Roger Pau Monné
@ 2023-07-20 22:57     ` Volodymyr Babchuk
  0 siblings, 0 replies; 73+ messages in thread
From: Volodymyr Babchuk @ 2023-07-20 22:57 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Jan Beulich


Hi Roger,

Roger Pau Monné <roger.pau@citrix.com> writes:

> On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
>> Add per-domain d->pci_lock that protects access to
>> d->pdev_list. Purpose of this lock is to give guarantees to VPCI code
>> that underlying pdev will not disappear under feet. This is a rw-lock,
>> but this patch adds only write_lock()s. There will be read_lock()
>> users in the next patches.
>> 
>> This lock should be taken in write mode every time d->pdev_list is
>> altered. This covers both accesses to d->pdev_list and accesses to
>> pdev->domain_list fields. All write accesses also should be protected
>> by pcidevs_lock() as well. Idea is that any user that wants read
>> access to the list or to the devices stored in the list should use
>> either this new d->pci_lock or old pcidevs_lock(). Usage of any of
>> this two locks will ensure only that pdev of interest will not
>> disappear from under feet and that the pdev still will be assigned to
>> the same domain. Of course, any new users should use pcidevs_lock()
>> when it is appropriate (e.g. when accessing any other state that is
>> protected by the said lock).
>
> I think this needs a note about the ordering:
>
> "In case both the newly introduced per-domain rwlock and the pcidevs
> lock is taken, the later must be acquired first."

Thanks. Added.

>> 
>> Any write access to pdev->domain_list should be protected by both
>> pcidevs_lock() and d->pci_lock in the write mode.
>
> You also protect calls to vpci_remove_device() with the per-domain
> pci_lock it seems, and that will need some explanation as it's not
> obvious.

Well, strictly speaking, it is not required in this patch. But it is
needed in the next one. I can lock only "list_del(&pdev->domain_list);"
end extend then locked area in the next patch. On other hand, this patch
already protects vpci_add_handlers() call in the pci_add_device() due to
the code layout, so it may be natural to protect vpci_remove_device() as
well. What is your opinion?

>> 
>> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
>> Suggested-by: Jan Beulich <jbeulich@suse.com>
>> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
>> 
>> ---
>> 
>> Changes in v8:
>>  - New patch
>> 
>> Changes in v8 vs RFC:
>>  - Removed all read_locks after discussion with Roger in #xendevel
>>  - pci_release_devices() now returns the first error code
>>  - extended commit message
>>  - added missing lock in pci_remove_device()
>>  - extended locked region in pci_add_device() to protect list_del() calls
>> ---
>>  xen/common/domain.c                         |  1 +
>>  xen/drivers/passthrough/amd/pci_amd_iommu.c |  9 ++-
>>  xen/drivers/passthrough/pci.c               | 68 +++++++++++++++++----
>>  xen/drivers/passthrough/vtd/iommu.c         |  9 ++-
>>  xen/include/xen/sched.h                     |  1 +
>>  5 files changed, 74 insertions(+), 14 deletions(-)
>> 
>> diff --git a/xen/common/domain.c b/xen/common/domain.c
>> index caaa402637..5d8a8836da 100644
>> --- a/xen/common/domain.c
>> +++ b/xen/common/domain.c
>> @@ -645,6 +645,7 @@ struct domain *domain_create(domid_t domid,
>>  
>>  #ifdef CONFIG_HAS_PCI
>>      INIT_LIST_HEAD(&d->pdev_list);
>> +    rwlock_init(&d->pci_lock);
>>  #endif
>>  
>>      /* All error paths can depend on the above setup. */
>> diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c b/xen/drivers/passthrough/amd/pci_amd_iommu.c
>> index 94e3775506..e2f2e2e950 100644
>> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
>> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
>> @@ -476,8 +476,13 @@ static int cf_check reassign_device(
>>  
>>      if ( devfn == pdev->devfn && pdev->domain != target )
>>      {
>> -        list_move(&pdev->domain_list, &target->pdev_list);
>> -        pdev->domain = target;
>
> You seem to have inadvertently dropped the above line? (and so devices
> would keep the previous pdev->domain value)
>

Oops, yes. Thank you. I was testing those patches on Intel machine, so
AMD part left not verified.

>> +        write_lock(&pdev->domain->pci_lock);
>> +        list_del(&pdev->domain_list);
>> +        write_unlock(&pdev->domain->pci_lock);
>> +
>> +        write_lock(&target->pci_lock);
>> +        list_add(&pdev->domain_list, &target->pdev_list);
>> +        write_unlock(&target->pci_lock);
>>      }
>>  
>>      /*
>> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
>> index 95846e84f2..5b4632ead2 100644
>> --- a/xen/drivers/passthrough/pci.c
>> +++ b/xen/drivers/passthrough/pci.c
>> @@ -454,7 +454,9 @@ static void __init _pci_hide_device(struct pci_dev *pdev)
>>      if ( pdev->domain )
>>          return;
>>      pdev->domain = dom_xen;
>> +    write_lock(&dom_xen->pci_lock);
>>      list_add(&pdev->domain_list, &dom_xen->pdev_list);
>> +    write_unlock(&dom_xen->pci_lock);
>>  }
>>  
>>  int __init pci_hide_device(unsigned int seg, unsigned int bus,
>> @@ -747,6 +749,7 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>>      ret = 0;
>>      if ( !pdev->domain )
>>      {
>> +        write_lock(&hardware_domain->pci_lock);
>>          pdev->domain = hardware_domain;
>>          list_add(&pdev->domain_list, &hardware_domain->pdev_list);
>>  
>> @@ -760,6 +763,7 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>>              printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
>>              list_del(&pdev->domain_list);
>>              pdev->domain = NULL;
>> +            write_unlock(&hardware_domain->pci_lock);
>
> Strictly speaking, this could move one line earlier, as accesses to
> pdev->domain are not protected by the d->pci_lock?  Same in other
> instances (above and below), as you seem to introduce a pattern to
> perform accesses to pdev->domain with the rwlock taken.
>

Yes, you are right. I'll move the unlock() call.

>>              goto out;
>>          }
>>          ret = iommu_add_device(pdev);
>> @@ -768,8 +772,10 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>>              vpci_remove_device(pdev);
>>              list_del(&pdev->domain_list);
>>              pdev->domain = NULL;
>> +            write_unlock(&hardware_domain->pci_lock);
>>              goto out;
>>          }
>> +        write_unlock(&hardware_domain->pci_lock);
>>      }
>>      else
>>          iommu_enable_device(pdev);
>> @@ -812,11 +818,13 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
>>      list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
>>          if ( pdev->bus == bus && pdev->devfn == devfn )
>>          {
>> +            write_lock(&pdev->domain->pci_lock);
>>              vpci_remove_device(pdev);
>>              pci_cleanup_msi(pdev);
>>              ret = iommu_remove_device(pdev);
>>              if ( pdev->domain )
>>                  list_del(&pdev->domain_list);
>> +            write_unlock(&pdev->domain->pci_lock);
>
> Here you seem to protect more than strictly required, I would think
> only the list_del() would need to be done holding the rwlock?
>

Yes, I believe this is a spill from a next patch. At first all those
changes were introduced in "vpci: use per-domain PCI lock to protect
vpci structure", but then I decided to split changes into two patches.

>>              printk(XENLOG_DEBUG "PCI remove device %pp\n", &pdev->sbdf);
>>              free_pdev(pseg, pdev);
>>              break;
>> @@ -887,26 +895,62 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
>>  
>>  int pci_release_devices(struct domain *d)
>>  {
>> -    struct pci_dev *pdev, *tmp;
>> -    u8 bus, devfn;
>> -    int ret;
>> +    int combined_ret;
>> +    LIST_HEAD(failed_pdevs);
>>  
>>      pcidevs_lock();
>> -    ret = arch_pci_clean_pirqs(d);
>> -    if ( ret )
>> +    write_lock(&d->pci_lock);
>> +    combined_ret = arch_pci_clean_pirqs(d);
>
> Why do you need the per-domain rwlock for arch_pci_clean_pirqs()?
> That function doesn't modify the per-domain pdev list.

You are right, I will correct this in the next version.

>
>> +    if ( combined_ret )
>>      {
>>          pcidevs_unlock();
>> -        return ret;
>> +        write_unlock(&d->pci_lock);
>> +        return combined_ret;
>
> Ideally we would like to keep the same order on unlock, so the rwlock
> should be released before the pcidevs lock (unless there's a reason
> not to).

I'll move write_lock() further below, so this will be fixed automatically.

>
>>      }
>> -    list_for_each_entry_safe ( pdev, tmp, &d->pdev_list, domain_list )
>> +
>> +    while ( !list_empty(&d->pdev_list) )
>>      {
>> -        bus = pdev->bus;
>> -        devfn = pdev->devfn;
>> -        ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
>> +        struct pci_dev *pdev = list_first_entry(&d->pdev_list,
>> +                                                struct pci_dev,
>> +                                                domain_list);
>> +        uint16_t seg = pdev->seg;
>> +        uint8_t bus = pdev->bus;
>> +        uint8_t devfn = pdev->devfn;
>> +        int ret;
>> +
>> +        write_unlock(&d->pci_lock);
>> +        ret = deassign_device(d, seg, bus, devfn);
>> +        write_lock(&d->pci_lock);
>> +        if ( ret )
>> +        {
>> +            bool still_present = false;
>> +            const struct pci_dev *tmp;
>> +
>> +            /*
>> +             * We need to check if deassign_device() left our pdev in
>> +             * domain's list. As we dropped the lock, we can't be sure
>> +             * that list wasn't permutated in some random way, so we
>> +             * need to traverse the whole list.
>> +             */
>> +            for_each_pdev ( d, tmp )
>> +            {
>> +                if ( tmp == pdev )
>> +                {
>> +                    still_present = true;
>> +                    break;
>> +                }
>> +            }
>> +            if ( still_present )
>> +                list_move(&pdev->domain_list, &failed_pdevs);
>
> You can get rid of the still_present variable, and just do:
>
> for_each_pdev ( d, tmp )
>     if ( tmp == pdev )
>     {
>         list_move(&pdev->domain_list, &failed_pdevs);
> 	break;
>     }
>
>

Yep, thanks.


-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 01/13] pci: introduce per-domain PCI rwlock
  2023-07-20 15:40   ` Jan Beulich
@ 2023-07-20 23:37     ` Volodymyr Babchuk
  0 siblings, 0 replies; 73+ messages in thread
From: Volodymyr Babchuk @ 2023-07-20 23:37 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Roger Pau Monné, xen-devel


Hi Jan,

Jan Beulich <jbeulich@suse.com> writes:

> On 20.07.2023 02:32, Volodymyr Babchuk wrote:
>> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
>> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
>> @@ -476,8 +476,13 @@ static int cf_check reassign_device(
>>  
>>      if ( devfn == pdev->devfn && pdev->domain != target )
>>      {
>> -        list_move(&pdev->domain_list, &target->pdev_list);
>> -        pdev->domain = target;
>> +        write_lock(&pdev->domain->pci_lock);
>> +        list_del(&pdev->domain_list);
>> +        write_unlock(&pdev->domain->pci_lock);
>
> As mentioned on an earlier version, perhaps better (cheaper) to use
> "source" here? (Same in VT-d code then.)

Sorry, I saw you comment for the previous version, but missed to include
this change. It will be done in the next version.

>> @@ -747,6 +749,7 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>>      ret = 0;
>>      if ( !pdev->domain )
>>      {
>> +        write_lock(&hardware_domain->pci_lock);
>>          pdev->domain = hardware_domain;
>>          list_add(&pdev->domain_list, &hardware_domain->pdev_list);
>>  
>> @@ -760,6 +763,7 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>>              printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
>>              list_del(&pdev->domain_list);
>>              pdev->domain = NULL;
>> +            write_unlock(&hardware_domain->pci_lock);
>>              goto out;
>
> In addition to Roger's comments about locking scope: In a case like this
> one it would probably also be good to move the printk() out of the locked
> area. It can be slow, after all.
>
> Question is why you have this wide a locked area here in the first place:
> Don't you need to hold the lock just across the two list operations (but
> not in between)?

Strictly speaking yes, we need to hold lock only when operating on the
list. For now. Next patch will use the same lock to protect the VPCI
(de)alloction, so locked region will be extended anyways.

I think, I'll decrease locked area in this patch and increase in the
next one, it will be most logical.


>> @@ -887,26 +895,62 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
>>  
>>  int pci_release_devices(struct domain *d)
>>  {
>> -    struct pci_dev *pdev, *tmp;
>> -    u8 bus, devfn;
>> -    int ret;
>> +    int combined_ret;
>> +    LIST_HEAD(failed_pdevs);
>>  
>>      pcidevs_lock();
>> -    ret = arch_pci_clean_pirqs(d);
>> -    if ( ret )
>> +    write_lock(&d->pci_lock);
>> +    combined_ret = arch_pci_clean_pirqs(d);
>> +    if ( combined_ret )
>>      {
>>          pcidevs_unlock();
>> -        return ret;
>> +        write_unlock(&d->pci_lock);
>> +        return combined_ret;
>>      }
>> -    list_for_each_entry_safe ( pdev, tmp, &d->pdev_list, domain_list )
>> +
>> +    while ( !list_empty(&d->pdev_list) )
>>      {
>> -        bus = pdev->bus;
>> -        devfn = pdev->devfn;
>> -        ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
>> +        struct pci_dev *pdev = list_first_entry(&d->pdev_list,
>> +                                                struct pci_dev,
>> +                                                domain_list);
>> +        uint16_t seg = pdev->seg;
>> +        uint8_t bus = pdev->bus;
>> +        uint8_t devfn = pdev->devfn;
>> +        int ret;
>> +
>> +        write_unlock(&d->pci_lock);
>> +        ret = deassign_device(d, seg, bus, devfn);
>> +        write_lock(&d->pci_lock);
>> +        if ( ret )
>> +        {
>> +            bool still_present = false;
>> +            const struct pci_dev *tmp;
>> +
>> +            /*
>> +             * We need to check if deassign_device() left our pdev in
>> +             * domain's list. As we dropped the lock, we can't be sure
>> +             * that list wasn't permutated in some random way, so we
>> +             * need to traverse the whole list.
>> +             */
>> +            for_each_pdev ( d, tmp )
>> +            {
>> +                if ( tmp == pdev )
>> +                {
>> +                    still_present = true;
>> +                    break;
>> +                }
>> +            }
>> +            if ( still_present )
>> +                list_move(&pdev->domain_list, &failed_pdevs);
>
> In order to retain original ordering on the resulting list, perhaps better
> list_move_tail()?

Yes, thanks.


-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 11/13] vpci: add initial support for virtual PCI bus topology
  2023-07-20  6:50   ` Jan Beulich
@ 2023-07-21  0:43     ` Volodymyr Babchuk
  0 siblings, 0 replies; 73+ messages in thread
From: Volodymyr Babchuk @ 2023-07-21  0:43 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Oleksandr Andrushchenko, xen-devel


Hi Jan,

Jan Beulich <jbeulich@suse.com> writes:

> On 20.07.2023 02:32, Volodymyr Babchuk wrote:
>> --- a/xen/drivers/vpci/vpci.c
>> +++ b/xen/drivers/vpci/vpci.c
>> @@ -46,6 +46,16 @@ void vpci_remove_device(struct pci_dev *pdev)
>>          return;
>>  
>>      spin_lock(&pdev->vpci->lock);
>> +
>> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
>> +    if ( pdev->vpci->guest_sbdf.sbdf != ~0 )
>> +    {
>> +        __clear_bit(pdev->vpci->guest_sbdf.dev,
>> +                    &pdev->domain->vpci_dev_assigned_map);
>> +        pdev->vpci->guest_sbdf.sbdf = ~0;
>> +    }
>> +#endif
>
> The lock acquired above is not ...

vpci_remove_device() is called when d->pci_lock is already held.

But, I'll move this hunk before spin_lock(&pdev->vpci->lock); we don't
need to hold it while cleaning vpci_dev_assigned_map

>> @@ -115,6 +129,54 @@ int vpci_add_handlers(struct pci_dev *pdev)
>>  }
>>  
>>  #ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
>> +static int add_virtual_device(struct pci_dev *pdev)
>> +{
>> +    struct domain *d = pdev->domain;
>> +    pci_sbdf_t sbdf = { 0 };
>> +    unsigned long new_dev_number;
>> +
>> +    if ( is_hardware_domain(d) )
>> +        return 0;
>> +
>> +    ASSERT(pcidevs_locked());
>> +
>> +    /*
>> +     * Each PCI bus supports 32 devices/slots at max or up to 256 when
>> +     * there are multi-function ones which are not yet supported.
>> +     */
>> +    if ( pdev->info.is_extfn )
>> +    {
>> +        gdprintk(XENLOG_ERR, "%pp: only function 0 passthrough supported\n",
>> +                 &pdev->sbdf);
>> +        return -EOPNOTSUPP;
>> +    }
>> +
>> +    write_lock(&pdev->domain->pci_lock);
>> +    new_dev_number = find_first_zero_bit(d->vpci_dev_assigned_map,
>> +                                         VPCI_MAX_VIRT_DEV);
>> +    if ( new_dev_number >= VPCI_MAX_VIRT_DEV )
>> +    {
>> +        write_unlock(&pdev->domain->pci_lock);
>> +        return -ENOSPC;
>> +    }
>> +
>> +    __set_bit(new_dev_number, &d->vpci_dev_assigned_map);
>
> ... the same as the one held here, so the bitmap still isn't properly
> protected afaics, unless the intention is to continue to rely on
> the global PCI lock (assuming that one's held in both cases, which I
> didn't check it is). Conversely it looks like the vPCI lock isn't
> held here. Both aspects may be intentional, but the locks being
> acquired differing requires suitable code comments imo.

As I stated above, vpci_remove_device() is called when d->pci_lock is
already held.


> I've also briefly looked at patch 1, and I'm afraid that still lacks
> commentary about intended lock nesting. That might be relevant here
> in case locking visible from patch / patch context isn't providing
> the full picture.
>

There is
    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
at the beginning of vpci_remove_device(), which is added by
"vpci: use per-domain PCI lock to protect vpci structure".

I believe, it will be more beneficial to review series from the
beginning.

>> +    /*
>> +     * Both segment and bus number are 0:
>> +     *  - we emulate a single host bridge for the guest, e.g. segment 0
>> +     *  - with bus 0 the virtual devices are seen as embedded
>> +     *    endpoints behind the root complex
>> +     *
>> +     * TODO: add support for multi-function devices.
>> +     */
>> +    sbdf.devfn = PCI_DEVFN(new_dev_number, 0);
>> +    pdev->vpci->guest_sbdf = sbdf;
>> +    write_unlock(&pdev->domain->pci_lock);
>
> With the above I also wonder whether this lock can't (and hence
> should) be dropped a little earlier (right after fiddling with the
> bitmap).

This is the good observation, thanks.

-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 02/13] vpci: use per-domain PCI lock to protect vpci structure
  2023-07-20 16:14     ` Roger Pau Monné
@ 2023-07-21  6:02       ` Jan Beulich
  2023-07-21  7:43         ` Roger Pau Monné
  0 siblings, 1 reply; 73+ messages in thread
From: Jan Beulich @ 2023-07-21  6:02 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Volodymyr Babchuk, Oleksandr Andrushchenko, xen-devel

On 20.07.2023 18:14, Roger Pau Monné wrote:
> On Thu, Jul 20, 2023 at 06:03:49PM +0200, Jan Beulich wrote:
>> On 20.07.2023 02:32, Volodymyr Babchuk wrote:
>>> --- a/xen/drivers/vpci/msi.c
>>> +++ b/xen/drivers/vpci/msi.c
>>> @@ -190,6 +190,8 @@ static int cf_check init_msi(struct pci_dev *pdev)
>>>      uint16_t control;
>>>      int ret;
>>>  
>>> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
>>
>> I'm afraid I have to ask the opposite question, compared to Roger's:
>> Why do you need the lock held for write here (and in init_msix())?
>> Neither list of devices nor the pdev->vpci pointer are being
>> altered.
> 
> This is called from vpci_add_handlers() which will acquire (or
> requires being called) with the lock in write mode in order to set
> pdev->vpci I would assume.

Right.

>  Strictly speaking however the init
> handlers don't require the lock in write mode unless we use such
> locking to get exclusive access to all the devices assigned to the
> domain BARs array for modify_bars().

Aiui in the present model modify_bars() has to use the vpci lock for
protection. Therefore imo in any of the init functions the assertions
should either express the real requirements of those functions, or be
omitted on the basis that they're all called out of add-handlers
anyway.

Jan


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 02/13] vpci: use per-domain PCI lock to protect vpci structure
  2023-07-21  6:02       ` Jan Beulich
@ 2023-07-21  7:43         ` Roger Pau Monné
  2023-07-21  8:48           ` Jan Beulich
  0 siblings, 1 reply; 73+ messages in thread
From: Roger Pau Monné @ 2023-07-21  7:43 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Volodymyr Babchuk, Oleksandr Andrushchenko, xen-devel

On Fri, Jul 21, 2023 at 08:02:33AM +0200, Jan Beulich wrote:
> On 20.07.2023 18:14, Roger Pau Monné wrote:
> >  Strictly speaking however the init
> > handlers don't require the lock in write mode unless we use such
> > locking to get exclusive access to all the devices assigned to the
> > domain BARs array for modify_bars().
> 
> Aiui in the present model modify_bars() has to use the vpci lock for
> protection.

But the current protection is insufficient, as we only hold the vpci
lock of the current device, but we don't hold the vpci lock of the
other devices when we iterate over in order to find overlapping bars
(or else it wold be an ABBA deadlock situation).

So my suggestion (which can be done later) is to take the newly
introduced per-domain rwlock in exclusive mode for modify_bars() in
order to assert there are no changes to the other devices vpci bar
fields.

> Therefore imo in any of the init functions the assertions
> should either express the real requirements of those functions, or be
> omitted on the basis that they're all called out of add-handlers
> anyway.

I'm happy to omit for the time being.  Iff we agree that modify_bars()
requires the rwlock in exclusive mode then we could add the assertion
there, but not in the init functions themselves.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 02/13] vpci: use per-domain PCI lock to protect vpci structure
  2023-07-21  7:43         ` Roger Pau Monné
@ 2023-07-21  8:48           ` Jan Beulich
  0 siblings, 0 replies; 73+ messages in thread
From: Jan Beulich @ 2023-07-21  8:48 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Volodymyr Babchuk, Oleksandr Andrushchenko, xen-devel

On 21.07.2023 09:43, Roger Pau Monné wrote:
> On Fri, Jul 21, 2023 at 08:02:33AM +0200, Jan Beulich wrote:
>> On 20.07.2023 18:14, Roger Pau Monné wrote:
>>>  Strictly speaking however the init
>>> handlers don't require the lock in write mode unless we use such
>>> locking to get exclusive access to all the devices assigned to the
>>> domain BARs array for modify_bars().
>>
>> Aiui in the present model modify_bars() has to use the vpci lock for
>> protection.
> 
> But the current protection is insufficient, as we only hold the vpci
> lock of the current device, but we don't hold the vpci lock of the
> other devices when we iterate over in order to find overlapping bars
> (or else it wold be an ABBA deadlock situation).
> 
> So my suggestion (which can be done later) is to take the newly
> introduced per-domain rwlock in exclusive mode for modify_bars() in
> order to assert there are no changes to the other devices vpci bar
> fields.

I think this makes sense, just that it doesn't belong in this patch.

Jan

>> Therefore imo in any of the init functions the assertions
>> should either express the real requirements of those functions, or be
>> omitted on the basis that they're all called out of add-handlers
>> anyway.
> 
> I'm happy to omit for the time being.  Iff we agree that modify_bars()
> requires the rwlock in exclusive mode then we could add the assertion
> there, but not in the init functions themselves.
> 
> Thanks, Roger.



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 05/13] vpci/header: implement guest BAR register handlers
  2023-07-20  0:32 ` [PATCH v8 05/13] vpci/header: implement guest BAR register handlers Volodymyr Babchuk
  2023-07-20 16:01   ` Roger Pau Monné
@ 2023-07-21 10:36   ` Rahul Singh
  2023-07-21 10:50     ` Jan Beulich
  1 sibling, 1 reply; 73+ messages in thread
From: Rahul Singh @ 2023-07-21 10:36 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: xen-devel, Oleksandr Andrushchenko

Hi Volodymyr,

> On 20 Jul 2023, at 1:32 am, Volodymyr Babchuk <Volodymyr_Babchuk@epam.com> wrote:
> 
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Add relevant vpci register handlers when assigning PCI device to a domain
> and remove those when de-assigning. This allows having different
> handlers for different domains, e.g. hwdom and other guests.
> 
> Emulate guest BAR register values: this allows creating a guest view
> of the registers and emulates size and properties probe as it is done
> during PCI device enumeration by the guest.
> 
> All empty, IO and ROM BARs for guests are emulated by returning 0 on
> reads and ignoring writes: this BARs are special with this respect as
> their lower bits have special meaning, so returning default ~0 on read
> may confuse guest OS.
> 
> Memory decoding is initially disabled when used by guests in order to
> prevent the BAR being placed on top of a RAM region.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> ---
> 
> Since v6:
> - unify the writing of the PCI_COMMAND register on the
>  error path into a label
> - do not introduce bar_ignore_access helper and open code
> - s/guest_bar_ignore_read/empty_bar_read
> - update error message in guest_bar_write
> - only setup empty_bar_read for IO if !x86
> Since v5:
> - make sure that the guest set address has the same page offset
>  as the physical address on the host
> - remove guest_rom_{read|write} as those just implement the default
>  behaviour of the registers not being handled
> - adjusted comment for struct vpci.addr field
> - add guest handlers for BARs which are not handled and will otherwise
>  return ~0 on read and ignore writes. The BARs are special with this
>  respect as their lower bits have special meaning, so returning ~0
>  doesn't seem to be right
> Since v4:
> - updated commit message
> - s/guest_addr/guest_reg
> Since v3:
> - squashed two patches: dynamic add/remove handlers and guest BAR
>  handler implementation
> - fix guest BAR read of the high part of a 64bit BAR (Roger)
> - add error handling to vpci_assign_device
> - s/dom%pd/%pd
> - blank line before return
> Since v2:
> - remove unneeded ifdefs for CONFIG_HAS_VPCI_GUEST_SUPPORT as more code
>  has been eliminated from being built on x86
> Since v1:
> - constify struct pci_dev where possible
> - do not open code is_system_domain()
> - simplify some code3. simplify
> - use gdprintk + error code instead of gprintk
> - gate vpci_bar_{add|remove}_handlers with CONFIG_HAS_VPCI_GUEST_SUPPORT,
>   so these do not get compiled for x86
> - removed unneeded is_system_domain check
> - re-work guest read/write to be much simpler and do more work on write
>   than read which is expected to be called more frequently
> - removed one too obvious comment
> ---
> xen/drivers/vpci/header.c | 156 +++++++++++++++++++++++++++++++-------
> xen/include/xen/vpci.h    |   3 +
> 2 files changed, 130 insertions(+), 29 deletions(-)
> 
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index 2780fcae72..5dc9b5338b 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -457,6 +457,71 @@ static void cf_check bar_write(
>     pci_conf_write32(pdev->sbdf, reg, val);
> }
> 
> +static void cf_check guest_bar_write(const struct pci_dev *pdev,
> +                                     unsigned int reg, uint32_t val, void *data)
> +{
> +    struct vpci_bar *bar = data;
> +    bool hi = false;
> +    uint64_t guest_reg = bar->guest_reg;
> +
> +    if ( bar->type == VPCI_BAR_MEM64_HI )
> +    {
> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> +        bar--;
> +        hi = true;
> +    }
> +    else
> +    {
> +        val &= PCI_BASE_ADDRESS_MEM_MASK;
> +        val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
> +                                           : PCI_BASE_ADDRESS_MEM_TYPE_64;
> +        val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
> +    }
> +
> +    guest_reg &= ~(0xffffffffull << (hi ? 32 : 0));
> +    guest_reg |= (uint64_t)val << (hi ? 32 : 0);
> +
> +    guest_reg &= ~(bar->size - 1) | ~PCI_BASE_ADDRESS_MEM_MASK;
> +
> +    /*
> +     * Make sure that the guest set address has the same page offset
> +     * as the physical address on the host or otherwise things won't work as
> +     * expected.
> +     */
> +    if ( (guest_reg & (~PAGE_MASK & PCI_BASE_ADDRESS_MEM_MASK)) !=
> +         (bar->addr & ~PAGE_MASK) )
> +    {
> +        gprintk(XENLOG_WARNING,
> +                "%pp: ignored BAR %zu write attempting to change page offset\n",
> +                &pdev->sbdf, bar - pdev->vpci->header.bars + hi);
> +        return;
> +    }
> +
> +    bar->guest_reg = guest_reg;
> +}
> +
> +static uint32_t cf_check guest_bar_read(const struct pci_dev *pdev,
> +                                        unsigned int reg, void *data)
> +{
> +    const struct vpci_bar *bar = data;
> +    bool hi = false;
> +
> +    if ( bar->type == VPCI_BAR_MEM64_HI )
> +    {
> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> +        bar--;
> +        hi = true;
> +    }
> +
> +    return bar->guest_reg >> (hi ? 32 : 0);
> +}
> +
> +static uint32_t cf_check empty_bar_read(const struct pci_dev *pdev,
> +                                        unsigned int reg, void *data)
> +{
> +    return 0;
> +}
> +
> static void cf_check rom_write(
>     const struct pci_dev *pdev, unsigned int reg, uint32_t val, void *data)
> {
> @@ -517,6 +582,7 @@ static int cf_check init_bars(struct pci_dev *pdev)
>     struct vpci_header *header = &pdev->vpci->header;
>     struct vpci_bar *bars = header->bars;
>     int rc;
> +    bool is_hwdom = is_hardware_domain(pdev->domain);
> 
>     ASSERT(rw_is_locked(&pdev->domain->pci_lock));
> 
> @@ -558,13 +624,12 @@ static int cf_check init_bars(struct pci_dev *pdev)
>         if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
>         {
>             bars[i].type = VPCI_BAR_MEM64_HI;
> -            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg,
> -                                   4, &bars[i]);
> +            rc = vpci_add_register(pdev->vpci,
> +                                   is_hwdom ? vpci_hw_read32 : guest_bar_read,
> +                                   is_hwdom ? bar_write : guest_bar_write,
> +                                   reg, 4, &bars[i]);
>             if ( rc )
> -            {
> -                pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
> -                return rc;
> -            }
> +                goto fail;
> 
>             continue;
>         }
> @@ -573,6 +638,17 @@ static int cf_check init_bars(struct pci_dev *pdev)
>         if ( (val & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO )
>         {
>             bars[i].type = VPCI_BAR_IO;
> +
> +#ifndef CONFIG_X86
> +            if ( !is_hwdom )
> +            {
> +                rc = vpci_add_register(pdev->vpci, empty_bar_read, NULL,
> +                                       reg, 4, &bars[i]);
> +                if ( rc )
> +                    goto fail;
> +            }
> +#endif
> +
>             continue;
>         }
>         if ( (val & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
> @@ -584,14 +660,20 @@ static int cf_check init_bars(struct pci_dev *pdev)
>         rc = pci_size_mem_bar(pdev->sbdf, reg, &addr, &size,
>                               (i == num_bars - 1) ? PCI_BAR_LAST : 0);
>         if ( rc < 0 )
> -        {
> -            pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
> -            return rc;
> -        }
> +            goto fail;
> 
>         if ( size == 0 )
>         {
>             bars[i].type = VPCI_BAR_EMPTY;
> +
> +            if ( !is_hwdom )
> +            {
> +                rc = vpci_add_register(pdev->vpci, empty_bar_read, NULL,
> +                                       reg, 4, &bars[i]);
> +                if ( rc )
> +                    goto fail;
> +            }
> +
>             continue;
>         }
> 
> @@ -599,34 +681,50 @@ static int cf_check init_bars(struct pci_dev *pdev)
>         bars[i].size = size;
>         bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;

I think there is a need to set the BAR mem type and prefetchable bit to the 
guest_reg also to avoid mismatch when Guest kernel initially read the BAR’s.

if ( !is_hwdom )
{
    bars[i].guest_reg |= bars[i].type == VPCI_BAR_MEM32 ?
                                                             PCI_BASE_ADDRESS_MEM_TYPE_32 : PCI_BASE_ADDRESS_MEM_TYPE_64;
    bars[i].guest_reg |= bars[i].prefetchable ?
                                     PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
}
 
Regards,
Rahul

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 05/13] vpci/header: implement guest BAR register handlers
  2023-07-21 10:36   ` Rahul Singh
@ 2023-07-21 10:50     ` Jan Beulich
  2023-07-21 11:52       ` Roger Pau Monné
  0 siblings, 1 reply; 73+ messages in thread
From: Jan Beulich @ 2023-07-21 10:50 UTC (permalink / raw)
  To: Rahul Singh, Volodymyr Babchuk
  Cc: xen-devel, Oleksandr Andrushchenko, Roger Pau Monné

On 21.07.2023 12:36, Rahul Singh wrote:
>> On 20 Jul 2023, at 1:32 am, Volodymyr Babchuk <Volodymyr_Babchuk@epam.com> wrote:
>> @@ -599,34 +681,50 @@ static int cf_check init_bars(struct pci_dev *pdev)
>>         bars[i].size = size;
>>         bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
> 
> I think there is a need to set the BAR mem type and prefetchable bit to the 
> guest_reg also to avoid mismatch when Guest kernel initially read the BAR’s.

Perhaps more generally: Shouldn't r/o bits be handed through in almost
all cases?

Jan


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 07/13] vpci/header: handle p2m range sets per BAR
  2023-07-20  0:32 ` [PATCH v8 07/13] vpci/header: handle p2m range sets per BAR Volodymyr Babchuk
@ 2023-07-21 11:49   ` Roger Pau Monné
  0 siblings, 0 replies; 73+ messages in thread
From: Roger Pau Monné @ 2023-07-21 11:49 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: xen-devel, Oleksandr Andrushchenko

On Thu, Jul 20, 2023 at 12:32:32AM +0000, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Instead of handling a single range set, that contains all the memory
> regions of all the BARs and ROM, have them per BAR.
> As the range sets are now created when a PCI device is added and destroyed
> when it is removed so make them named and accounted.
> 
> Note that rangesets were chosen here despite there being only up to
> 3 separate ranges in each set (typically just 1). But rangeset per BAR
> was chosen for the ease of implementation and existing code re-usability.
> 
> This is in preparation of making non-identity mappings in p2m for the MMIOs.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> ---
> Since v6:
> - update according to the new locking scheme
> - remove odd fail label in modify_bars
> Since v5:
> - fix comments
> - move rangeset allocation to init_bars and only allocate
>   for MAPPABLE BARs
> - check for overlap with the already setup BAR ranges
> Since v4:
> - use named range sets for BARs (Jan)
> - changes required by the new locking scheme
> - updated commit message (Jan)
> Since v3:
> - re-work vpci_cancel_pending accordingly to the per-BAR handling
> - s/num_mem_ranges/map_pending and s/uint8_t/bool
> - ASSERT(bar->mem) in modify_bars
> - create and destroy the rangesets on add/remove
> ---
>  xen/drivers/vpci/header.c | 235 ++++++++++++++++++++++++++++----------
>  xen/drivers/vpci/vpci.c   |   6 +
>  xen/include/xen/vpci.h    |   3 +-
>  3 files changed, 181 insertions(+), 63 deletions(-)
> 
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index 5dc9b5338b..eb07fa0bb2 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -141,63 +141,106 @@ static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
>  
>  bool vpci_process_pending(struct vcpu *v)
>  {
> -    if ( v->vpci.mem )
> +    struct pci_dev *pdev = v->vpci.pdev;
> +    if ( !pdev )
> +        return false;

I think this check is kind of inverted, you should check for
vpci.map_pending first, and then check that the rest of the fields are
also set (or complain otherwise as something went clearly wrong)?

> +
> +    if ( v->vpci.map_pending )
>      {
>          struct map_data data = {
>              .d = v->domain,
>              .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
>          };
> -        int rc = rangeset_consume_ranges(v->vpci.mem, map_range, &data);
> -
> -        if ( rc == -ERESTART )
> -            return true;
> +        struct vpci_header *header = &pdev->vpci->header;

You need to hold the per-domain rwlock in order to access
pdev->vpci.

> +        unsigned int i;
>  
>          write_lock(&v->domain->pci_lock);

Holding the lock in write mode for the duration of the mapping is
quite aggressive, as the mapping operation could be a long running
one.

Is this only locked in exclusive mode in order to have the right
locking for the vpci_remove_device() call below?

If so we might consider using a different error handling in order to
avoid taking the lock in exclusive mode.

> -        spin_lock(&v->vpci.pdev->vpci->lock);
> -        /* Disable memory decoding unconditionally on failure. */
> -        modify_decoding(v->vpci.pdev,
> -                        rc ? v->vpci.cmd & ~PCI_COMMAND_MEMORY : v->vpci.cmd,
> -                        !rc && v->vpci.rom_only);
> -        spin_unlock(&v->vpci.pdev->vpci->lock);
> -
> -        rangeset_destroy(v->vpci.mem);
> -        v->vpci.mem = NULL;
> -        if ( rc )
> -            /*
> -             * FIXME: in case of failure remove the device from the domain.
> -             * Note that there might still be leftover mappings. While this is
> -             * safe for Dom0, for DomUs the domain will likely need to be
> -             * killed in order to avoid leaking stale p2m mappings on
> -             * failure.
> -             */
> -            vpci_remove_device(v->vpci.pdev);
> +
> +        for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
> +        {
> +            struct vpci_bar *bar = &header->bars[i];
> +            int rc;
> +
> +            if ( rangeset_is_empty(bar->mem) )
> +                continue;
> +
> +            rc = rangeset_consume_ranges(bar->mem, map_range, &data);
> +
> +            if ( rc == -ERESTART )
> +            {
> +                write_unlock(&v->domain->pci_lock);
> +                return true;
> +            }
> +
> +            spin_lock(&pdev->vpci->lock);
> +            /* Disable memory decoding unconditionally on failure. */
> +            modify_decoding(pdev, rc ? v->vpci.cmd & ~PCI_COMMAND_MEMORY :
> +                                       v->vpci.cmd, !rc && v->vpci.rom_only);

This need to also be moved out of the loop, or else you would be
toggling the memory decoding bit every time a BAR is mapped or
unmapped.  This must be done once all BARs are {un,}mapped (so outside
of the for loop).

You will likely need to keep a call here that disables memory decoding
only if the mapping has failed.

> +            spin_unlock(&pdev->vpci->lock);
> +
> +            if ( rc )
> +            {
> +                /*
> +                 * FIXME: in case of failure remove the device from the domain.
> +                 * Note that there might still be leftover mappings. While this
> +                 * is safe for Dom0, for DomUs the domain needs to be killed in
> +                 * order to avoid leaking stale p2m mappings on failure.
> +                 */

You are already handling the domU case, so the comment needs to be
adjusted, as it's no longer a FIXME.  We might consider to just remove
the comment at once.

> +                v->vpci.map_pending = false;
> +
> +                if ( is_hardware_domain(v->domain) )
> +                {
> +                    vpci_remove_device(pdev);
> +                    write_unlock(&v->domain->pci_lock);
> +                }
> +                else
> +                {
> +                    write_unlock(&v->domain->pci_lock);
> +                    domain_crash(v->domain);
> +                }
> +                return false;
> +            }
> +        }
>          write_unlock(&v->domain->pci_lock);
> +
> +        v->vpci.map_pending = false;
>      }
>  
> +
>      return false;
>  }
>  
>  static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
> -                            struct rangeset *mem, uint16_t cmd)
> +                            uint16_t cmd)
>  {
>      struct map_data data = { .d = d, .map = true };
> -    int rc;
> +    struct vpci_header *header = &pdev->vpci->header;
> +    int rc = 0;
> +    unsigned int i;
>  
>      ASSERT(rw_is_locked(&d->pci_lock));
>  
> -    while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
> +    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>      {
> -        /*
> -         * It's safe to drop and reacquire the lock in this context
> -         * without risking pdev disappearing because devices cannot be
> -         * removed until the initial domain has been started.
> -         */
> -        read_unlock(&d->pci_lock);
> -        process_pending_softirqs();
> -        read_lock(&d->pci_lock);
> -    }
> +        struct vpci_bar *bar = &header->bars[i];
>  
> -    rangeset_destroy(mem);
> +        if ( rangeset_is_empty(bar->mem) )
> +            continue;
> +
> +        while ( (rc = rangeset_consume_ranges(bar->mem, map_range,
> +                                              &data)) == -ERESTART )
> +        {
> +            /*
> +             * It's safe to drop and reacquire the lock in this context
> +             * without risking pdev disappearing because devices cannot be
> +             * removed until the initial domain has been started.
> +             */
> +            write_unlock(&d->pci_lock);
> +            process_pending_softirqs();
> +            write_lock(&d->pci_lock);
> +        }
> +    }
>      if ( !rc )
>          modify_decoding(pdev, cmd, false);
>  
> @@ -205,10 +248,12 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
>  }
>  
>  static void defer_map(struct domain *d, struct pci_dev *pdev,
> -                      struct rangeset *mem, uint16_t cmd, bool rom_only)
> +                      uint16_t cmd, bool rom_only)
>  {
>      struct vcpu *curr = current;
>  
> +    ASSERT(!!rw_is_write_locked(&pdev->domain->pci_lock));

No need for the double !!.

> +
>      /*
>       * FIXME: when deferring the {un}map the state of the device should not
>       * be trusted. For example the enable bit is toggled after the device
> @@ -216,7 +261,7 @@ static void defer_map(struct domain *d, struct pci_dev *pdev,
>       * started for the same device if the domain is not well-behaved.
>       */
>      curr->vpci.pdev = pdev;
> -    curr->vpci.mem = mem;
> +    curr->vpci.map_pending = true;
>      curr->vpci.cmd = cmd;
>      curr->vpci.rom_only = rom_only;
>      /*
> @@ -230,33 +275,34 @@ static void defer_map(struct domain *d, struct pci_dev *pdev,
>  static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>  {
>      struct vpci_header *header = &pdev->vpci->header;
> -    struct rangeset *mem = rangeset_new(NULL, NULL, 0);
>      struct pci_dev *tmp, *dev = NULL;
>      const struct domain *d;
>      const struct vpci_msix *msix = pdev->vpci->msix;
> -    unsigned int i;
> +    unsigned int i, j;
>      int rc;
> +    bool map_pending;
>  
>      ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>  
> -    if ( !mem )
> -        return -ENOMEM;
> -
>      /*
> -     * Create a rangeset that represents the current device BARs memory region
> -     * and compare it against all the currently active BAR memory regions. If
> -     * an overlap is found, subtract it from the region to be mapped/unmapped.
> +     * Create a rangeset per BAR that represents the current device memory
> +     * region and compare it against all the currently active BAR memory
> +     * regions. If an overlap is found, subtract it from the region to be
> +     * mapped/unmapped.
>       *
> -     * First fill the rangeset with all the BARs of this device or with the ROM
> +     * First fill the rangesets with the BARs of this device or with the ROM

I think you need to drop the 's' from BARs also.

>       * BAR only, depending on whether the guest is toggling the memory decode
>       * bit of the command register, or the enable bit of the ROM BAR register.
>       */
>      for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>      {
> -        const struct vpci_bar *bar = &header->bars[i];
> +        struct vpci_bar *bar = &header->bars[i];
>          unsigned long start = PFN_DOWN(bar->addr);
>          unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
>  
> +        if ( !bar->mem )
> +            continue;
> +
>          if ( !MAPPABLE_BAR(bar) ||
>               (rom_only ? bar->type != VPCI_BAR_ROM
>                         : (bar->type == VPCI_BAR_ROM && !header->rom_enabled)) ||
> @@ -272,14 +318,31 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>              continue;
>          }
>  
> -        rc = rangeset_add_range(mem, start, end);
> +        rc = rangeset_add_range(bar->mem, start, end);
>          if ( rc )
>          {
>              printk(XENLOG_G_WARNING "Failed to add [%lx, %lx]: %d\n",
>                     start, end, rc);
> -            rangeset_destroy(mem);
>              return rc;
>          }
> +
> +        /* Check for overlap with the already setup BAR ranges. */
> +        for ( j = 0; j < i; j++ )
> +        {
> +            struct vpci_bar *bar = &header->bars[j];

This is kind of confusing, as you are defining an inner 'bar' variable
that shadows the outside one.  Might be better to name it as prev_bar
or some such, to avoid the shadowing.

> +
> +            if ( rangeset_is_empty(bar->mem) )
> +                continue;
> +
> +            rc = rangeset_remove_range(bar->mem, start, end);
> +            if ( rc )
> +            {
> +                printk(XENLOG_G_WARNING
> +                       "Failed to remove overlapping range [%lx, %lx]: %d\n",
> +                       start, end, rc);

Might as well print the SBDF of the device while at it (same below).

You could also consider using gprintk instead of plain printk, and
avoid the _G_ tag in the log level.

> +                return rc;
> +            }
> +        }
>      }
>  
>      /* Remove any MSIX regions if present. */
> @@ -289,14 +352,21 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>          unsigned long end = PFN_DOWN(vmsix_table_addr(pdev->vpci, i) +
>                                       vmsix_table_size(pdev->vpci, i) - 1);
>  
> -        rc = rangeset_remove_range(mem, start, end);
> -        if ( rc )
> +        for ( j = 0; j < ARRAY_SIZE(header->bars); j++ )
>          {
> -            printk(XENLOG_G_WARNING
> -                   "Failed to remove MSIX table [%lx, %lx]: %d\n",
> -                   start, end, rc);
> -            rangeset_destroy(mem);
> -            return rc;
> +            const struct vpci_bar *bar = &header->bars[j];
> +
> +            if ( rangeset_is_empty(bar->mem) )
> +                continue;
> +
> +            rc = rangeset_remove_range(bar->mem, start, end);
> +            if ( rc )
> +            {
> +                printk(XENLOG_G_WARNING
> +                       "Failed to remove MSIX table [%lx, %lx]: %d\n",
> +                       start, end, rc);
> +                return rc;
> +            }
>          }
>      }
>  
> @@ -341,7 +411,7 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>                  unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
>  
>                  if ( !bar->enabled ||
> -                     !rangeset_overlaps_range(mem, start, end) ||
> +                     !rangeset_overlaps_range(bar->mem, start, end) ||
>                       /*
>                        * If only the ROM enable bit is toggled check against
>                        * other BARs in the same device for overlaps, but not
> @@ -350,12 +420,11 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>                       (rom_only && tmp == pdev && bar->type == VPCI_BAR_ROM) )
>                      continue;
>  
> -                rc = rangeset_remove_range(mem, start, end);
> +                rc = rangeset_remove_range(bar->mem, start, end);

Urg, isn't 'bar' here pointing to the remote device BAR, not the BARs
that we want to map?

You need an inner loop that iterates over header->bars, much like you
do to handle the MSI-X table overlaps.

>                  if ( rc )
>                  {
>                      printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
>                             start, end, rc);
> -                    rangeset_destroy(mem);
>                      return rc;
>                  }
>              }
> @@ -380,10 +449,23 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>           * will always be to establish mappings and process all the BARs.
>           */
>          ASSERT((cmd & PCI_COMMAND_MEMORY) && !rom_only);
> -        return apply_map(pdev->domain, pdev, mem, cmd);
> +        return apply_map(pdev->domain, pdev, cmd);
>      }
>  
> -    defer_map(dev->domain, dev, mem, cmd, rom_only);
> +    /* Find out how many memory ranges has left after MSI and overlaps. */
                                          ^ are (I think).
> +    map_pending = false;
> +    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
> +        if ( !rangeset_is_empty(header->bars[i].mem) )
> +        {
> +            map_pending = true;
> +            break;
> +        }
> +
> +    /* If there's no mapping work write the command register now. */
> +    if ( !map_pending )
> +        pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
> +    else
> +        defer_map(dev->domain, dev, cmd, rom_only);

This is kind of not strictly required, and different from the current
approach where defer_map() gets called regardless of whether the
rangeset is all empty.

Could be moved to a separate commit.

>  
>      return 0;
>  }
> @@ -574,6 +656,19 @@ static void cf_check rom_write(
>          rom->addr = val & PCI_ROM_ADDRESS_MASK;
>  }
>  
> +static int bar_add_rangeset(struct pci_dev *pdev, struct vpci_bar *bar, int i)

pci_dev should be const, and i unsigned.

> +{
> +    char str[32];
> +
> +    snprintf(str, sizeof(str), "%pp:BAR%d", &pdev->sbdf, i);
> +
> +    bar->mem = rangeset_new(pdev->domain, str, RANGESETF_no_print);
> +    if ( !bar->mem )
> +        return -ENOMEM;
> +
> +    return 0;
> +}
> +
>  static int cf_check init_bars(struct pci_dev *pdev)
>  {
>      uint16_t cmd;
> @@ -657,6 +752,13 @@ static int cf_check init_bars(struct pci_dev *pdev)
>          else
>              bars[i].type = VPCI_BAR_MEM32;
>  
> +        rc = bar_add_rangeset(pdev, &bars[i], i);
> +        if ( rc )
> +        {
> +            bars[i].type = VPCI_BAR_EMPTY;
> +            return rc;
> +        }
> +
>          rc = pci_size_mem_bar(pdev->sbdf, reg, &addr, &size,
>                                (i == num_bars - 1) ? PCI_BAR_LAST : 0);
>          if ( rc < 0 )
> @@ -707,6 +809,15 @@ static int cf_check init_bars(struct pci_dev *pdev)
>                                     rom_reg, 4, rom);
>              if ( rc )
>                  rom->type = VPCI_BAR_EMPTY;
> +            else
> +            {
> +                rc = bar_add_rangeset(pdev, rom, i);
> +                if ( rc )
> +                {
> +                    rom->type = VPCI_BAR_EMPTY;
> +                    return rc;
> +                }

For both of the above: I don't think you need to set the BAR to EMPTY
if you are already returning an error, as the whole vCPI handling will
fail initialization.  Setting to empty only makes sense if we can try
to continue with normal operations.

> +            }
>          }
>      }
>      else
> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
> index a97710a806..ca3505ecb7 100644
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -38,6 +38,8 @@ extern vpci_register_init_t *const __end_vpci_array[];
>  
>  void vpci_remove_device(struct pci_dev *pdev)
>  {
> +    unsigned int i;
> +
>      ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
>  
>      if ( !has_vpci(pdev->domain) || !pdev->vpci )
> @@ -63,6 +65,10 @@ void vpci_remove_device(struct pci_dev *pdev)
>              if ( pdev->vpci->msix->table[i] )
>                  iounmap(pdev->vpci->msix->table[i]);
>      }
> +
> +    for ( i = 0; i < ARRAY_SIZE(pdev->vpci->header.bars); i++ )
> +        rangeset_destroy(pdev->vpci->header.bars[i].mem);
> +
>      xfree(pdev->vpci->msix);
>      xfree(pdev->vpci->msi);
>      xfree(pdev->vpci);
> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
> index 486a655e8d..b78dd6512b 100644
> --- a/xen/include/xen/vpci.h
> +++ b/xen/include/xen/vpci.h
> @@ -72,6 +72,7 @@ struct vpci {
>              /* Guest view of the BAR: address and lower bits. */
>              uint64_t guest_reg;
>              uint64_t size;
> +            struct rangeset *mem;
>              enum {
>                  VPCI_BAR_EMPTY,
>                  VPCI_BAR_IO,
> @@ -156,9 +157,9 @@ struct vpci {
>  
>  struct vpci_vcpu {
>      /* Per-vcpu structure to store state while {un}mapping of PCI BARs. */
> -    struct rangeset *mem;
>      struct pci_dev *pdev;
>      uint16_t cmd;
> +    bool map_pending : 1;

I do wonder whether we really need the map_pending boolean field,
couldn't pdev != NULL be used as a way to signal a pending mapping
operation?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 05/13] vpci/header: implement guest BAR register handlers
  2023-07-21 10:50     ` Jan Beulich
@ 2023-07-21 11:52       ` Roger Pau Monné
  0 siblings, 0 replies; 73+ messages in thread
From: Roger Pau Monné @ 2023-07-21 11:52 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Rahul Singh, Volodymyr Babchuk, xen-devel, Oleksandr Andrushchenko

On Fri, Jul 21, 2023 at 12:50:23PM +0200, Jan Beulich wrote:
> On 21.07.2023 12:36, Rahul Singh wrote:
> >> On 20 Jul 2023, at 1:32 am, Volodymyr Babchuk <Volodymyr_Babchuk@epam.com> wrote:
> >> @@ -599,34 +681,50 @@ static int cf_check init_bars(struct pci_dev *pdev)
> >>         bars[i].size = size;
> >>         bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
> > 
> > I think there is a need to set the BAR mem type and prefetchable bit to the 
> > guest_reg also to avoid mismatch when Guest kernel initially read the BAR’s.
> 
> Perhaps more generally: Shouldn't r/o bits be handed through in almost
> all cases?

I remember in an earlier version suggesting to store the guest
address, instead of the guest BAR register value.  Then the flags
would be unconditionally added in guest_bar_read() and we wouldn't
need to worry about initializing the register.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 08/13] vpci/header: program p2m with guest BAR view
  2023-07-20  0:32 ` [PATCH v8 08/13] vpci/header: program p2m with guest BAR view Volodymyr Babchuk
@ 2023-07-21 13:05   ` Roger Pau Monné
  2023-07-24 10:30     ` Jan Beulich
  2023-07-24 10:43   ` Jan Beulich
  1 sibling, 1 reply; 73+ messages in thread
From: Roger Pau Monné @ 2023-07-21 13:05 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: xen-devel, Oleksandr Andrushchenko

On Thu, Jul 20, 2023 at 12:32:33AM +0000, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Take into account guest's BAR view and program its p2m accordingly:
> gfn is guest's view of the BAR and mfn is the physical BAR value as set
> up by the PCI bus driver in the hardware domain.

Who sets that value should be left out of the commit message.  On x86
PCI BARs are positioned by the firmware usually.

> This way hardware domain sees physical BAR values and guest sees
> emulated ones.

This last sentence is kind of confusing, I would maybe write:

"Hardware domain continues getting the BARs identity mapped, while for
domUs the BARs are mapped at the requested guest address without
modifying the BAR address in the device PCI config space."

I'm afraid you are missing changes in modify_bars():  the overlaps for
domU should be checked against the guest address of the BAR, not the
host one.  So you need to adjust the code in modify_bars() to use the
newly introduced guest_reg when checking for overlaps in the domU case
(and when populating the rangesets).

> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> ---
> Since v5:
> - remove debug print in map_range callback
> - remove "identity" from the debug print
> Since v4:
> - moved start_{gfn|mfn} calculation into map_range
> - pass vpci_bar in the map_data instead of start_{gfn|mfn}
> - s/guest_addr/guest_reg
> Since v3:
> - updated comment (Roger)
> - removed gfn_add(map->start_gfn, rc); which is wrong
> - use v->domain instead of v->vpci.pdev->domain
> - removed odd e.g. in comment
> - s/d%d/%pd in altered code
> - use gdprintk for map/unmap logs
> Since v2:
> - improve readability for data.start_gfn and restructure ?: construct
> Since v1:
>  - s/MSI/MSI-X in comments
> ---
>  xen/drivers/vpci/header.c | 24 ++++++++++++++++++++----
>  1 file changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index eb07fa0bb2..e1a448b674 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -30,6 +30,7 @@
>  
>  struct map_data {
>      struct domain *d;
> +    const struct vpci_bar *bar;
>      bool map;
>  };
>  
> @@ -41,8 +42,21 @@ static int cf_check map_range(
>  
>      for ( ; ; )
>      {
> +        /* Start address of the BAR as seen by the guest. */
> +        gfn_t start_gfn = _gfn(PFN_DOWN(is_hardware_domain(map->d)
> +                                        ? map->bar->addr
> +                                        : map->bar->guest_reg));
> +        /* Physical start address of the BAR. */
> +        mfn_t start_mfn = _mfn(PFN_DOWN(map->bar->addr));
>          unsigned long size = e - s + 1;
>  
> +        /*
> +         * Ranges to be mapped don't always start at the BAR start address, as
> +         * there can be holes or partially consumed ranges. Account for the
> +         * offset of the current address from the BAR start.
> +         */
> +        start_gfn = gfn_add(start_gfn, s - mfn_x(start_mfn));

The rangeset for guests should contain the guest address,
not the physical position of the BAR, so the logic here will be
slightly different (as you will need to adjust the mfn parameter of
{,un}map_mmio_regions() instead).

That's so you can do overlap checking in the guest address space, as
it's where the mappings will be created.

> +
>          /*
>           * ARM TODOs:
>           * - On ARM whether the memory is prefetchable or not should be passed
> @@ -52,8 +66,8 @@ static int cf_check map_range(
>           * - {un}map_mmio_regions doesn't support preemption.
>           */
>  
> -        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(s))
> -                      : unmap_mmio_regions(map->d, _gfn(s), size, _mfn(s));
> +        rc = map->map ? map_mmio_regions(map->d, start_gfn, size, _mfn(s))
> +                      : unmap_mmio_regions(map->d, start_gfn, size, _mfn(s));
>          if ( rc == 0 )
>          {
>              *c += size;
> @@ -62,8 +76,8 @@ static int cf_check map_range(
>          if ( rc < 0 )
>          {
>              printk(XENLOG_G_WARNING
> -                   "Failed to identity %smap [%lx, %lx] for d%d: %d\n",
> -                   map->map ? "" : "un", s, e, map->d->domain_id, rc);
> +                   "Failed to %smap [%lx, %lx] for %pd: %d\n",
> +                   map->map ? "" : "un", s, e, map->d, rc);

I would also print the gfn -> mfn values if it's no longer an identity
map.

>              break;
>          }
>          ASSERT(rc < size);
> @@ -165,6 +179,7 @@ bool vpci_process_pending(struct vcpu *v)
>              if ( rangeset_is_empty(bar->mem) )
>                  continue;
>  
> +            data.bar = bar;

Please init the .bar field at declaration, like it's done for the rest
of the field.  It doesn't matter if the BAR turns out to be empty
(same below).

Thanks, Roger.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 09/13] vpci/header: emulate PCI_COMMAND register for guests
  2023-07-20  0:32 ` [PATCH v8 09/13] vpci/header: emulate PCI_COMMAND register for guests Volodymyr Babchuk
@ 2023-07-21 13:32   ` Roger Pau Monné
  2023-07-21 13:40     ` Roger Pau Monné
  2023-07-24 11:06     ` Jan Beulich
  2023-07-24 11:03   ` Jan Beulich
  1 sibling, 2 replies; 73+ messages in thread
From: Roger Pau Monné @ 2023-07-21 13:32 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: xen-devel, Oleksandr Andrushchenko

On Thu, Jul 20, 2023 at 12:32:33AM +0000, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Xen and/or Dom0 may have put values in PCI_COMMAND which they expect
> to remain unaltered. PCI_COMMAND_SERR bit is a good example: while the
> guest's view of this will want to be zero initially, the host having set
> it to 1 may not easily be overwritten with 0, or else we'd effectively
> imply giving the guest control of the bit. Thus, PCI_COMMAND register needs
> proper emulation in order to honor host's settings.

You speak about SERR here, yet in the code all bits are togglable by
domUs.

> There are examples of emulators [1], [2] which already deal with PCI_COMMAND
> register emulation and it seems that at most they care about is the only INTx
                                                                      ^ stray?
> bit (besides IO/memory enable and bus master which are write through).
> It could be because in order to properly emulate the PCI_COMMAND register
> we need to know about the whole PCI topology, e.g. if any setting in device's
> command register is aligned with the upstream port etc.
> 
> This makes me think that because of this complexity others just ignore that.
> Neither I think this can easily be done in Xen case.
> 
> According to "PCI LOCAL BUS SPECIFICATION, REV. 3.0", section "6.2.2
> Device Control" the reset state of the command register is typically 0,
> so when assigning a PCI device use 0 as the initial state for the guest's view
> of the command register.
> 
> For now our emulation only makes sure INTx is set according to the host
> requirements, i.e. depending on MSI/MSI-X enabled state.
> 
> This implementation and the decision to only emulate INTx bit for now
> is based on the previous discussion at [3].
> 
> [1] https://github.com/qemu/qemu/blob/master/hw/xen/xen_pt_config_init.c#L310
> [2] https://github.com/projectacrn/acrn-hypervisor/blob/master/hypervisor/hw/pci.c#L336
> [3] https://patchwork.kernel.org/project/xen-devel/patch/20210903100831.177748-9-andr2000@gmail.com/
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> ---
> 
> Since v6:
> - fold guest's logic into cmd_write
> - implement cmd_read, so we can report emulated INTx state to guests
> - introduce header->guest_cmd to hold the emulated state of the
>   PCI_COMMAND register for guests
> Since v5:
> - add additional check for MSI-X enabled while altering INTX bit
> - make sure INTx disabled while guests enable MSI/MSI-X
> Since v3:
> - gate more code on CONFIG_HAS_MSI
> - removed logic for the case when MSI/MSI-X not enabled
> ---
>  xen/drivers/vpci/header.c | 38 +++++++++++++++++++++++++++++++++++++-
>  xen/drivers/vpci/msi.c    |  4 ++++
>  xen/drivers/vpci/msix.c   |  4 ++++
>  xen/include/xen/vpci.h    |  3 +++
>  4 files changed, 48 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index e1a448b674..ae05d242a5 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -486,11 +486,27 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>      return 0;
>  }
>  
> +/* TODO: Add proper emulation for all bits of the command register. */
>  static void cf_check cmd_write(
>      const struct pci_dev *pdev, unsigned int reg, uint32_t cmd, void *data)
>  {
>      struct vpci_header *header = data;
>  
> +    if ( !is_hardware_domain(pdev->domain) )
> +    {
> +        struct vpci_header *header = data;

Why do you need this variable?  You already have 'header' in the outer
scope you can use here.

> +
> +        header->guest_cmd = cmd;
> +#ifdef CONFIG_HAS_PCI_MSI
> +        if ( pdev->vpci->msi->enabled || pdev->vpci->msix->enabled )
> +            /*
> +             * Guest wants to enable INTx, but it can't be enabled
> +             * if MSI/MSI-X enabled.
> +             */
> +            cmd |= PCI_COMMAND_INTX_DISABLE;
> +#endif
> +    }
> +
>      /*
>       * Let Dom0 play with all the bits directly except for the memory
>       * decoding one.

This comments likely needs updating, to reflect that bits not allowed
to domU are already masked.

> @@ -507,6 +523,19 @@ static void cf_check cmd_write(
>          pci_conf_write16(pdev->sbdf, reg, cmd);
>  }
>  
> +static uint32_t cmd_read(const struct pci_dev *pdev, unsigned int reg,
> +                         void *data)
> +{
> +    if ( !is_hardware_domain(pdev->domain) )
> +    {
> +        struct vpci_header *header = data;
> +
> +        return header->guest_cmd;
> +    }
> +
> +    return pci_conf_read16(pdev->sbdf, reg);

Would IMO be simpler as:

const struct vpci_header *header = data;

return is_hardware_domain(pdev->domain) ? pci_conf_read16(pdev->sbdf, reg)
                                        : header->guest_cmd;

In fact I wonder why not make this handler domU specific so that the
hardware domain can continue to use vpci_hw_read16.

> +}
> +
>  static void cf_check bar_write(
>      const struct pci_dev *pdev, unsigned int reg, uint32_t val, void *data)
>  {
> @@ -713,8 +742,15 @@ static int cf_check init_bars(struct pci_dev *pdev)
>          return -EOPNOTSUPP;
>      }
>  
> +    /*
> +     * According to "PCI LOCAL BUS SPECIFICATION, REV. 3.0", section "6.2.2
> +     * Device Control" the reset state of the command register is
> +     * typically all 0's, so this is used as initial value for the guests.
> +     */
> +    ASSERT(header->guest_cmd == 0);

Hm, while that would be the expectation, shouldn't the command register
reflect the current state of the hardware?

I think you want to check 'cmd' so it's sane, and complain otherwise but
propagate the value to the guest view.

> +
>      /* Setup a handler for the command register. */
> -    rc = vpci_add_register(pdev->vpci, vpci_hw_read16, cmd_write, PCI_COMMAND,
> +    rc = vpci_add_register(pdev->vpci, cmd_read, cmd_write, PCI_COMMAND,
>                             2, header);

See comment above about keeping the hw domain using vpci_hw_read16.

>      if ( rc )
>          return rc;
> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
> index e63152c224..c37845a949 100644
> --- a/xen/drivers/vpci/msi.c
> +++ b/xen/drivers/vpci/msi.c
> @@ -70,6 +70,10 @@ static void cf_check control_write(
>  
>          if ( vpci_msi_arch_enable(msi, pdev, vectors) )
>              return;
> +
> +        /* Make sure guest doesn't enable INTx while enabling MSI. */
> +        if ( !is_hardware_domain(pdev->domain) )
> +            pci_intx(pdev, false);
>      }
>      else
>          vpci_msi_arch_disable(msi, pdev);
> diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
> index 9481274579..eab1661b87 100644
> --- a/xen/drivers/vpci/msix.c
> +++ b/xen/drivers/vpci/msix.c
> @@ -97,6 +97,10 @@ static void cf_check control_write(
>          for ( i = 0; i < msix->max_entries; i++ )
>              if ( !msix->entries[i].masked && msix->entries[i].updated )
>                  update_entry(&msix->entries[i], pdev, i);
> +
> +        /* Make sure guest doesn't enable INTx while enabling MSI-X. */
> +        if ( !is_hardware_domain(pdev->domain) )
> +            pci_intx(pdev, false);

I think here and in the MSI case you want to update the guest view of
the command register if you unconditionally disable INTx.

Maybe just use cmd_write() and let the logic there cache the new
value?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 10/13] vpci/header: reset the command register when adding devices
  2023-07-20  0:32 ` [PATCH v8 10/13] vpci/header: reset the command register when adding devices Volodymyr Babchuk
@ 2023-07-21 13:37   ` Roger Pau Monné
  0 siblings, 0 replies; 73+ messages in thread
From: Roger Pau Monné @ 2023-07-21 13:37 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: xen-devel, Oleksandr Andrushchenko

On Thu, Jul 20, 2023 at 12:32:33AM +0000, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Reset the command register when assigning a PCI device to a guest:
> according to the PCI spec the PCI_COMMAND register is typically all 0's
> after reset, but this might not be true for the guest as it needs
> to respect host's settings.
> For that reason, do not write 0 to the PCI_COMMAND register directly,
> but go through the corresponding emulation layer (cmd_write), which
> will take care about the actual bits written.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> ---
> Since v6:
> - use cmd_write directly without introducing emulate_cmd_reg
> - update commit message with more description on all 0's in PCI_COMMAND
> Since v5:
> - updated commit message
> Since v1:
>  - do not write 0 to the command register, but respect host settings.
> ---
>  xen/drivers/vpci/header.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index ae05d242a5..44a9940fb9 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -749,6 +749,10 @@ static int cf_check init_bars(struct pci_dev *pdev)
>       */
>      ASSERT(header->guest_cmd == 0);
>  
> +    /* Reset the command register for guests. */
> +    if ( !is_hwdom )
> +        cmd_write(pdev, PCI_COMMAND, 0, header);

So the assert just above is no longer needed? (and could be removed
from the previous patch).

As requested on the previous patch, should some message be logged if
the command register is not as expected (0 in this case?)

Thanks, Roger.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 09/13] vpci/header: emulate PCI_COMMAND register for guests
  2023-07-21 13:32   ` Roger Pau Monné
@ 2023-07-21 13:40     ` Roger Pau Monné
  2023-07-24 11:06     ` Jan Beulich
  1 sibling, 0 replies; 73+ messages in thread
From: Roger Pau Monné @ 2023-07-21 13:40 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: xen-devel, Oleksandr Andrushchenko

On Fri, Jul 21, 2023 at 03:32:27PM +0200, Roger Pau Monné wrote:
> On Thu, Jul 20, 2023 at 12:32:33AM +0000, Volodymyr Babchuk wrote:
> > From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> > +    /*
> > +     * According to "PCI LOCAL BUS SPECIFICATION, REV. 3.0", section "6.2.2
> > +     * Device Control" the reset state of the command register is
> > +     * typically all 0's, so this is used as initial value for the guests.
> > +     */
> > +    ASSERT(header->guest_cmd == 0);
> 
> Hm, while that would be the expectation, shouldn't the command register
> reflect the current state of the hardware?
> 
> I think you want to check 'cmd' so it's sane, and complain otherwise but
> propagate the value to the guest view.

In fact asserting that header->guest_cmd == 0 is pointless, as the
structure has just been allocated and zeroed.  We do not assert that
the other fields are also zeroed.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 11/13] vpci: add initial support for virtual PCI bus topology
  2023-07-20  0:32 ` [PATCH v8 11/13] vpci: add initial support for virtual PCI bus topology Volodymyr Babchuk
  2023-07-20  6:50   ` Jan Beulich
@ 2023-07-21 13:53   ` Roger Pau Monné
  2023-07-21 14:00   ` Roger Pau Monné
  2023-07-26 21:35   ` Stewart Hildebrand
  3 siblings, 0 replies; 73+ messages in thread
From: Roger Pau Monné @ 2023-07-21 13:53 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: xen-devel, Oleksandr Andrushchenko

On Thu, Jul 20, 2023 at 12:32:33AM +0000, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
> index 80dd150bbf..478bd21f3e 100644
> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -461,6 +461,14 @@ struct domain
>  #ifdef CONFIG_HAS_PCI
>      struct list_head pdev_list;
>      rwlock_t pci_lock;
> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> +    /*
> +     * The bitmap which shows which device numbers are already used by the
> +     * virtual PCI bus topology and is used to assign a unique SBDF to the
> +     * next passed through virtual PCI device.
> +     */
> +    DECLARE_BITMAP(vpci_dev_assigned_map, VPCI_MAX_VIRT_DEV);
> +#endif

I think it would be helpful to state that vpci_dev_assigned_map is
protected by pci_lock (as I understand it's the intention).

Thanks, Roger.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 11/13] vpci: add initial support for virtual PCI bus topology
  2023-07-20  0:32 ` [PATCH v8 11/13] vpci: add initial support for virtual PCI bus topology Volodymyr Babchuk
  2023-07-20  6:50   ` Jan Beulich
  2023-07-21 13:53   ` Roger Pau Monné
@ 2023-07-21 14:00   ` Roger Pau Monné
  2023-07-26 21:35   ` Stewart Hildebrand
  3 siblings, 0 replies; 73+ messages in thread
From: Roger Pau Monné @ 2023-07-21 14:00 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: xen-devel, Oleksandr Andrushchenko

On Thu, Jul 20, 2023 at 12:32:33AM +0000, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Assign SBDF to the PCI devices being passed through with bus 0.
> The resulting topology is where PCIe devices reside on the bus 0 of the
> root complex itself (embedded endpoints).
> This implementation is limited to 32 devices which are allowed on
> a single PCI bus.

I do wonder how this will work with ioreqs, iow: shouldn't it be the
toolstack that selects the virtual slot of the PCI device (in the
guest bus).  Otherwise I see a hard time reconciling how ioreqs and
vPCI can work together if vPCI has it's own (private) view of the bus,
and thinks it has exclusive ownership of it.

It might be something to deal afterwards, but would likely need a TODO
tag in order to realize it needs to be improved.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 12/13] xen/arm: translate virtual PCI bus topology for guests
  2023-07-20  0:32 ` [PATCH v8 12/13] xen/arm: translate virtual PCI bus topology " Volodymyr Babchuk
  2023-07-20  6:54   ` Jan Beulich
@ 2023-07-21 14:09   ` Roger Pau Monné
  2023-07-24  8:02   ` Roger Pau Monné
  2 siblings, 0 replies; 73+ messages in thread
From: Roger Pau Monné @ 2023-07-21 14:09 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: xen-devel, Oleksandr Andrushchenko

On Thu, Jul 20, 2023 at 12:32:34AM +0000, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> There are three  originators for the PCI configuration space access:
> 1. The domain that owns physical host bridge: MMIO handlers are
> there so we can update vPCI register handlers with the values
> written by the hardware domain, e.g. physical view of the registers
> vs guest's view on the configuration space.
> 2. Guest access to the passed through PCI devices: we need to properly
> map virtual bus topology to the physical one, e.g. pass the configuration
> space access to the corresponding physical devices.
> 3. Emulated host PCI bridge access. It doesn't exist in the physical
> topology, e.g. it can't be mapped to some physical host bridge.
> So, all access to the host bridge itself needs to be trapped and
> emulated.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> ---
> Since v8:
> - locks moved out of vpci_translate_virtual_device()
> Since v6:
> - add pcidevs locking to vpci_translate_virtual_device
> - update wrt to the new locking scheme
> Since v5:
> - add vpci_translate_virtual_device for #ifndef CONFIG_HAS_VPCI_GUEST_SUPPORT
>   case to simplify ifdefery
> - add ASSERT(!is_hardware_domain(d)); to vpci_translate_virtual_device
> - reset output register on failed virtual SBDF translation
> Since v4:
> - indentation fixes
> - constify struct domain
> - updated commit message
> - updates to the new locking scheme (pdev->vpci_lock)
> Since v3:
> - revisit locking
> - move code to vpci.c
> Since v2:
>  - pass struct domain instead of struct vcpu
>  - constify arguments where possible
>  - gate relevant code with CONFIG_HAS_VPCI_GUEST_SUPPORT
> New in v2
> ---
>  xen/arch/arm/vpci.c     | 47 +++++++++++++++++++++++++++++++++++++++--
>  xen/drivers/vpci/vpci.c | 24 +++++++++++++++++++++
>  xen/include/xen/vpci.h  |  7 ++++++
>  3 files changed, 76 insertions(+), 2 deletions(-)
> 
> diff --git a/xen/arch/arm/vpci.c b/xen/arch/arm/vpci.c
> index 3bc4bb5508..66701465af 100644
> --- a/xen/arch/arm/vpci.c
> +++ b/xen/arch/arm/vpci.c
> @@ -28,10 +28,33 @@ static int vpci_mmio_read(struct vcpu *v, mmio_info_t *info,
>                            register_t *r, void *p)
>  {
>      struct pci_host_bridge *bridge = p;
> -    pci_sbdf_t sbdf = vpci_sbdf_from_gpa(bridge, info->gpa);
> +    pci_sbdf_t sbdf;
>      /* data is needed to prevent a pointer cast on 32bit */
>      unsigned long data;
>  
> +    ASSERT(!bridge == !is_hardware_domain(v->domain));
> +
> +    sbdf = vpci_sbdf_from_gpa(bridge, info->gpa);
> +
> +    /*
> +     * For the passed through devices we need to map their virtual SBDF
> +     * to the physical PCI device being passed through.
> +     */
> +    if ( !bridge )
> +    {
> +        bool translated;
> +
> +        read_lock(&v->domain->pci_lock);
> +        translated = vpci_translate_virtual_device(v->domain, &sbdf);
> +        read_unlock(&v->domain->pci_lock);
> +
> +        if ( !translated )
> +        {
> +            *r = ~0ul;
> +            return 1;
> +        }
> +    }
> +
>      if ( vpci_ecam_read(sbdf, ECAM_REG_OFFSET(info->gpa),
>                          1U << info->dabt.size, &data) )
>      {
> @@ -48,7 +71,27 @@ static int vpci_mmio_write(struct vcpu *v, mmio_info_t *info,
>                             register_t r, void *p)
>  {
>      struct pci_host_bridge *bridge = p;
> -    pci_sbdf_t sbdf = vpci_sbdf_from_gpa(bridge, info->gpa);
> +    pci_sbdf_t sbdf;
> +
> +    ASSERT(!bridge == !is_hardware_domain(v->domain));
> +
> +    sbdf = vpci_sbdf_from_gpa(bridge, info->gpa);
> +
> +    /*
> +     * For the passed through devices we need to map their virtual SBDF
> +     * to the physical PCI device being passed through.
> +     */
> +    if ( !bridge )
> +    {
> +        bool translated;
> +
> +        read_lock(&v->domain->pci_lock);
> +        translated = vpci_translate_virtual_device(v->domain, &sbdf);
> +        read_unlock(&v->domain->pci_lock);

You drop the lock here, so it's possible that returned SBDF is already
stale by the time Xen does scan the domain pdev list again?

I guess it's a minor issue.

> +
> +        if ( !translated )
> +            return 1;
> +    }
>  
>      return vpci_ecam_write(sbdf, ECAM_REG_OFFSET(info->gpa),
>                             1U << info->dabt.size, r);
> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
> index baaafe4a2a..2ce36e811d 100644
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -177,6 +177,30 @@ static int add_virtual_device(struct pci_dev *pdev)
>      return 0;
>  }
>  
> +/*
> + * Find the physical device which is mapped to the virtual device
> + * and translate virtual SBDF to the physical one.
> + * This must hold domain's pci_lock in read mode.
> + */
> +bool vpci_translate_virtual_device(struct domain *d, pci_sbdf_t *sbdf)
> +{
> +    struct pci_dev *pdev;

const for pdev and d.

> +
> +    ASSERT(!is_hardware_domain(d));
> +
> +    for_each_pdev( d, pdev )
> +    {
> +        if ( pdev->vpci && (pdev->vpci->guest_sbdf.sbdf == sbdf->sbdf) )
> +        {
> +            /* Replace guest SBDF with the physical one. */
> +            *sbdf = pdev->sbdf;

Since you are already iterating over the domain pdev list, won't it be
more helpful to return the pdev?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 02/13] vpci: use per-domain PCI lock to protect vpci structure
  2023-07-20 13:50       ` Roger Pau Monné
@ 2023-07-24  0:07         ` Volodymyr Babchuk
  2023-07-24  7:59           ` Roger Pau Monné
  0 siblings, 1 reply; 73+ messages in thread
From: Volodymyr Babchuk @ 2023-07-24  0:07 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: Jan Beulich, xen-devel, Oleksandr Andrushchenko


Hi Roger,

Roger Pau Monné <roger.pau@citrix.com> writes:

> On Thu, Jul 20, 2023 at 03:27:29PM +0200, Jan Beulich wrote:
>> On 20.07.2023 13:20, Roger Pau Monné wrote:
>> > On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
>> >> @@ -447,8 +476,16 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>> >>  
>> >>      /*
>> >>       * Find the PCI dev matching the address, which for hwdom also requires
>> >> -     * consulting DomXEN.  Passthrough everything that's not trapped.
>> >> +     * consulting DomXEN. Passthrough everything that's not trapped.
>> >> +     * If this is hwdom, we need to hold locks for both domain in case if
>> >> +     * modify_bars is called()
>> > 
>> > Typo: the () wants to be at the end of modify_bars().
>> > 
>> >>       */
>> >> +    read_lock(&d->pci_lock);
>> >> +
>> >> +    /* dom_xen->pci_lock always should be taken second to prevent deadlock */
>> >> +    if ( is_hardware_domain(d) )
>> >> +        read_lock(&dom_xen->pci_lock);
>> > 
>> > For modify_bars() we also want the locks to be in write mode (at least
>> > the hw one), so that the position of the BARs can't be changed while
>> > modify_bars() is iterating over them.
>> 
>> Isn't changing of the BARs happening under the vpci lock?
>
> It is.
>
>> Or else I guess
>> I haven't understood the description correctly: My reading so far was
>> that it is only the presence (allocation status / pointer validity) that
>> is protected by this new lock.
>
> Hm, I see, yes.  I guess it was a previous patch version that also
> took care of the modify_bars() issue by taking the lock in exclusive
> mode here.
>
> We can always do that later, so forget about that comment (for now).

Are you sure? I'd rather rework the code to use write lock in the
modify_bars(). This is why we began all this journey in the first place.

-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 02/13] vpci: use per-domain PCI lock to protect vpci structure
  2023-07-24  0:07         ` Volodymyr Babchuk
@ 2023-07-24  7:59           ` Roger Pau Monné
  0 siblings, 0 replies; 73+ messages in thread
From: Roger Pau Monné @ 2023-07-24  7:59 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: Jan Beulich, xen-devel, Oleksandr Andrushchenko

On Mon, Jul 24, 2023 at 12:07:48AM +0000, Volodymyr Babchuk wrote:
> 
> Hi Roger,
> 
> Roger Pau Monné <roger.pau@citrix.com> writes:
> 
> > On Thu, Jul 20, 2023 at 03:27:29PM +0200, Jan Beulich wrote:
> >> On 20.07.2023 13:20, Roger Pau Monné wrote:
> >> > On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
> >> >> @@ -447,8 +476,16 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
> >> >>  
> >> >>      /*
> >> >>       * Find the PCI dev matching the address, which for hwdom also requires
> >> >> -     * consulting DomXEN.  Passthrough everything that's not trapped.
> >> >> +     * consulting DomXEN. Passthrough everything that's not trapped.
> >> >> +     * If this is hwdom, we need to hold locks for both domain in case if
> >> >> +     * modify_bars is called()
> >> > 
> >> > Typo: the () wants to be at the end of modify_bars().
> >> > 
> >> >>       */
> >> >> +    read_lock(&d->pci_lock);
> >> >> +
> >> >> +    /* dom_xen->pci_lock always should be taken second to prevent deadlock */
> >> >> +    if ( is_hardware_domain(d) )
> >> >> +        read_lock(&dom_xen->pci_lock);
> >> > 
> >> > For modify_bars() we also want the locks to be in write mode (at least
> >> > the hw one), so that the position of the BARs can't be changed while
> >> > modify_bars() is iterating over them.
> >> 
> >> Isn't changing of the BARs happening under the vpci lock?
> >
> > It is.
> >
> >> Or else I guess
> >> I haven't understood the description correctly: My reading so far was
> >> that it is only the presence (allocation status / pointer validity) that
> >> is protected by this new lock.
> >
> > Hm, I see, yes.  I guess it was a previous patch version that also
> > took care of the modify_bars() issue by taking the lock in exclusive
> > mode here.
> >
> > We can always do that later, so forget about that comment (for now).
> 
> Are you sure? I'd rather rework the code to use write lock in the
> modify_bars(). This is why we began all this journey in the first place.

Well, I was just saying that it doesn't need to be done in this same
patch, it can be done as a followup if that's preferred, but one way
or another we need to deal with it.

I'm fine if you want to adjust the commit message and do the change in
this same patch.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 12/13] xen/arm: translate virtual PCI bus topology for guests
  2023-07-20  0:32 ` [PATCH v8 12/13] xen/arm: translate virtual PCI bus topology " Volodymyr Babchuk
  2023-07-20  6:54   ` Jan Beulich
  2023-07-21 14:09   ` Roger Pau Monné
@ 2023-07-24  8:02   ` Roger Pau Monné
  2 siblings, 0 replies; 73+ messages in thread
From: Roger Pau Monné @ 2023-07-24  8:02 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: xen-devel, Oleksandr Andrushchenko

On Thu, Jul 20, 2023 at 12:32:34AM +0000, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> There are three  originators for the PCI configuration space access:
> 1. The domain that owns physical host bridge: MMIO handlers are
> there so we can update vPCI register handlers with the values
> written by the hardware domain, e.g. physical view of the registers
> vs guest's view on the configuration space.
> 2. Guest access to the passed through PCI devices: we need to properly
> map virtual bus topology to the physical one, e.g. pass the configuration
> space access to the corresponding physical devices.
> 3. Emulated host PCI bridge access. It doesn't exist in the physical
> topology, e.g. it can't be mapped to some physical host bridge.
> So, all access to the host bridge itself needs to be trapped and
> emulated.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> ---
> Since v8:
> - locks moved out of vpci_translate_virtual_device()
> Since v6:
> - add pcidevs locking to vpci_translate_virtual_device
> - update wrt to the new locking scheme
> Since v5:
> - add vpci_translate_virtual_device for #ifndef CONFIG_HAS_VPCI_GUEST_SUPPORT
>   case to simplify ifdefery
> - add ASSERT(!is_hardware_domain(d)); to vpci_translate_virtual_device
> - reset output register on failed virtual SBDF translation
> Since v4:
> - indentation fixes
> - constify struct domain
> - updated commit message
> - updates to the new locking scheme (pdev->vpci_lock)
> Since v3:
> - revisit locking
> - move code to vpci.c
> Since v2:
>  - pass struct domain instead of struct vcpu
>  - constify arguments where possible
>  - gate relevant code with CONFIG_HAS_VPCI_GUEST_SUPPORT
> New in v2
> ---
>  xen/arch/arm/vpci.c     | 47 +++++++++++++++++++++++++++++++++++++++--
>  xen/drivers/vpci/vpci.c | 24 +++++++++++++++++++++
>  xen/include/xen/vpci.h  |  7 ++++++
>  3 files changed, 76 insertions(+), 2 deletions(-)
> 
> diff --git a/xen/arch/arm/vpci.c b/xen/arch/arm/vpci.c
> index 3bc4bb5508..66701465af 100644
> --- a/xen/arch/arm/vpci.c
> +++ b/xen/arch/arm/vpci.c
> @@ -28,10 +28,33 @@ static int vpci_mmio_read(struct vcpu *v, mmio_info_t *info,
>                            register_t *r, void *p)
>  {
>      struct pci_host_bridge *bridge = p;
> -    pci_sbdf_t sbdf = vpci_sbdf_from_gpa(bridge, info->gpa);
> +    pci_sbdf_t sbdf;
>      /* data is needed to prevent a pointer cast on 32bit */
>      unsigned long data;
>  
> +    ASSERT(!bridge == !is_hardware_domain(v->domain));
> +
> +    sbdf = vpci_sbdf_from_gpa(bridge, info->gpa);
> +
> +    /*
> +     * For the passed through devices we need to map their virtual SBDF
> +     * to the physical PCI device being passed through.
> +     */
> +    if ( !bridge )
> +    {
> +        bool translated;
> +
> +        read_lock(&v->domain->pci_lock);
> +        translated = vpci_translate_virtual_device(v->domain, &sbdf);
> +        read_unlock(&v->domain->pci_lock);
> +
> +        if ( !translated )
> +        {
> +            *r = ~0ul;
> +            return 1;
> +        }
> +    }

I've been thinking about this, is there any reason to not place this
logic inside of vpci_sbdf_from_gpa()?

I'm not sure you need to expose vpci_translate_virtual_device().

Thanks, Roger.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 04/13] vpci: add hooks for PCI device assign/de-assign
  2023-07-20  0:32 ` [PATCH v8 04/13] vpci: add hooks for PCI device assign/de-assign Volodymyr Babchuk
  2023-07-20 12:36   ` Roger Pau Monné
@ 2023-07-24  9:41   ` Jan Beulich
  1 sibling, 0 replies; 73+ messages in thread
From: Jan Beulich @ 2023-07-24  9:41 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: Oleksandr Andrushchenko, xen-devel

On 20.07.2023 02:32, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> When a PCI device gets assigned/de-assigned some work on vPCI side needs
> to be done for that device. Introduce a pair of hooks so vPCI can handle
> that.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

A couple more mechanical comments in addition to what Roger said:

> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -885,6 +885,10 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
>      if ( ret )
>          goto out;
>  
> +    write_lock(&pdev->domain->pci_lock);
> +    vpci_deassign_device(pdev);
> +    write_unlock(&pdev->domain->pci_lock);

Can't it be just d here?

> @@ -1484,6 +1488,10 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>      if ( pdev->broken && d != hardware_domain && d != dom_io )
>          goto done;
>  
> +    write_lock(&pdev->domain->pci_lock);
> +    vpci_deassign_device(pdev);
> +    write_unlock(&pdev->domain->pci_lock);

Is this meaningful (and okay to call at all) when pdev->domain == dom_io?

> @@ -1509,6 +1517,19 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>          rc = iommu_call(hd->platform_ops, assign_device, d, devfn,
>                          pci_to_dev(pdev), flag);
>      }
> +    if ( rc )
> +        goto done;
> +
> +    devfn = pdev->devfn;
> +    write_lock(&pdev->domain->pci_lock);
> +    rc = vpci_assign_device(pdev);
> +    write_unlock(&pdev->domain->pci_lock);

Just d again here?

Jan


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 08/13] vpci/header: program p2m with guest BAR view
  2023-07-21 13:05   ` Roger Pau Monné
@ 2023-07-24 10:30     ` Jan Beulich
  0 siblings, 0 replies; 73+ messages in thread
From: Jan Beulich @ 2023-07-24 10:30 UTC (permalink / raw)
  To: Roger Pau Monné, Volodymyr Babchuk
  Cc: xen-devel, Oleksandr Andrushchenko

On 21.07.2023 15:05, Roger Pau Monné wrote:
> On Thu, Jul 20, 2023 at 12:32:33AM +0000, Volodymyr Babchuk wrote:
>> @@ -62,8 +76,8 @@ static int cf_check map_range(
>>          if ( rc < 0 )
>>          {
>>              printk(XENLOG_G_WARNING
>> -                   "Failed to identity %smap [%lx, %lx] for d%d: %d\n",
>> -                   map->map ? "" : "un", s, e, map->d->domain_id, rc);
>> +                   "Failed to %smap [%lx, %lx] for %pd: %d\n",
>> +                   map->map ? "" : "un", s, e, map->d, rc);
> 
> I would also print the gfn -> mfn values if it's no longer an identity
> map.

And also the actual range - it's not [s,e] anymore.

Jan


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 08/13] vpci/header: program p2m with guest BAR view
  2023-07-20  0:32 ` [PATCH v8 08/13] vpci/header: program p2m with guest BAR view Volodymyr Babchuk
  2023-07-21 13:05   ` Roger Pau Monné
@ 2023-07-24 10:43   ` Jan Beulich
  2023-07-24 13:16     ` Roger Pau Monné
  1 sibling, 1 reply; 73+ messages in thread
From: Jan Beulich @ 2023-07-24 10:43 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Oleksandr Andrushchenko, xen-devel, Roger Pau Monné

On 20.07.2023 02:32, Volodymyr Babchuk wrote:
> @@ -52,8 +66,8 @@ static int cf_check map_range(
>           * - {un}map_mmio_regions doesn't support preemption.
>           */
>  
> -        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(s))
> -                      : unmap_mmio_regions(map->d, _gfn(s), size, _mfn(s));
> +        rc = map->map ? map_mmio_regions(map->d, start_gfn, size, _mfn(s))
> +                      : unmap_mmio_regions(map->d, start_gfn, size, _mfn(s));

Aiui this is the first direct exposure of these functions to DomU-s;
so far all calls were Xen-internal or from a domctl. There are a
couple of Arm TODOs listed in the comment ahead, but I'm not sure
that's all what is lacking here, and it's unclear whether this can
sensibly be left as a follow-on activity (at the very least known
open issues need mentioning as TODOs).

For example the x86 function truncates an unsigned long local
variable to (signed) int in its main return statement. This may for
the moment still be only a theoretical issue, but will need dealing
with sooner or later, I think.

Furthermore this yet again allows DomU-s to fiddle with their p2m.
To a degree this is unavoidable, I suppose. But some thought may
need putting into this anyway. Aiui on real hardware if a BAR is
placed over RAM, behavior is simply undefined. Once the BAR is
moved away though, behavior will become defined again: The RAM will
"reappear" in case the earlier undefined-ness made it disappear. I
don't know how the Arm variants of the functions behave, but on x86
the RAM pages will disappear from the guest's p2m upon putting a
BAR there, but they won't reappear upon unmapping of the BAR.

Luckily at least preemption looks to be handled in a satisfactory
manner already.

Jan


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 09/13] vpci/header: emulate PCI_COMMAND register for guests
  2023-07-20  0:32 ` [PATCH v8 09/13] vpci/header: emulate PCI_COMMAND register for guests Volodymyr Babchuk
  2023-07-21 13:32   ` Roger Pau Monné
@ 2023-07-24 11:03   ` Jan Beulich
  1 sibling, 0 replies; 73+ messages in thread
From: Jan Beulich @ 2023-07-24 11:03 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: Oleksandr Andrushchenko, xen-devel

On 20.07.2023 02:32, Volodymyr Babchuk wrote:
> --- a/xen/drivers/vpci/msi.c
> +++ b/xen/drivers/vpci/msi.c
> @@ -70,6 +70,10 @@ static void cf_check control_write(
>  
>          if ( vpci_msi_arch_enable(msi, pdev, vectors) )
>              return;
> +
> +        /* Make sure guest doesn't enable INTx while enabling MSI. */
> +        if ( !is_hardware_domain(pdev->domain) )
> +            pci_intx(pdev, false);

Neither this nor ...

> --- a/xen/drivers/vpci/msix.c
> +++ b/xen/drivers/vpci/msix.c
> @@ -97,6 +97,10 @@ static void cf_check control_write(
>          for ( i = 0; i < msix->max_entries; i++ )
>              if ( !msix->entries[i].masked && msix->entries[i].updated )
>                  update_entry(&msix->entries[i], pdev, i);
> +
> +        /* Make sure guest doesn't enable INTx while enabling MSI-X. */
> +        if ( !is_hardware_domain(pdev->domain) )
> +            pci_intx(pdev, false);

... this has a counterpart passing true, to restore pin-based IRQs.
While it looks like we have a pre-existing issue here as well
(see __pci_disable_msi() vs __pci_disable_msix()), could you clarify
how this is meant to work for DomU-s?

Jan


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 09/13] vpci/header: emulate PCI_COMMAND register for guests
  2023-07-21 13:32   ` Roger Pau Monné
  2023-07-21 13:40     ` Roger Pau Monné
@ 2023-07-24 11:06     ` Jan Beulich
  1 sibling, 0 replies; 73+ messages in thread
From: Jan Beulich @ 2023-07-24 11:06 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Oleksandr Andrushchenko, Volodymyr Babchuk

On 21.07.2023 15:32, Roger Pau Monné wrote:
> On Thu, Jul 20, 2023 at 12:32:33AM +0000, Volodymyr Babchuk wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>
>> Xen and/or Dom0 may have put values in PCI_COMMAND which they expect
>> to remain unaltered. PCI_COMMAND_SERR bit is a good example: while the
>> guest's view of this will want to be zero initially, the host having set
>> it to 1 may not easily be overwritten with 0, or else we'd effectively
>> imply giving the guest control of the bit. Thus, PCI_COMMAND register needs
>> proper emulation in order to honor host's settings.
> 
> You speak about SERR here, yet in the code all bits are togglable by
> domUs.

I think this paragraph is meant to describe only what would need doing,
as per what's said ...

>> There are examples of emulators [1], [2] which already deal with PCI_COMMAND
>> register emulation and it seems that at most they care about is the only INTx
>                                                                       ^ stray?
>> bit (besides IO/memory enable and bus master which are write through).
>> It could be because in order to properly emulate the PCI_COMMAND register
>> we need to know about the whole PCI topology, e.g. if any setting in device's
>> command register is aligned with the upstream port etc.
>>
>> This makes me think that because of this complexity others just ignore that.
>> Neither I think this can easily be done in Xen case.
>>
>> According to "PCI LOCAL BUS SPECIFICATION, REV. 3.0", section "6.2.2
>> Device Control" the reset state of the command register is typically 0,
>> so when assigning a PCI device use 0 as the initial state for the guest's view
>> of the command register.
>>
>> For now our emulation only makes sure INTx is set according to the host
>> requirements, i.e. depending on MSI/MSI-X enabled state.
>>
>> This implementation and the decision to only emulate INTx bit for now
>> is based on the previous discussion at [3].

... through to down here. Yet I agree the title suggests otherwise, and
hence that initial paragraph is further misleading.

>> --- a/xen/drivers/vpci/header.c
>> +++ b/xen/drivers/vpci/header.c
>> @@ -486,11 +486,27 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>      return 0;
>>  }
>>  
>> +/* TODO: Add proper emulation for all bits of the command register. */
>>  static void cf_check cmd_write(
>>      const struct pci_dev *pdev, unsigned int reg, uint32_t cmd, void *data)
>>  {

Note also the TODO being added here. Which course will need resolving
before any of this can become supported.

Jan


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 08/13] vpci/header: program p2m with guest BAR view
  2023-07-24 10:43   ` Jan Beulich
@ 2023-07-24 13:16     ` Roger Pau Monné
  2023-07-24 13:31       ` Jan Beulich
  0 siblings, 1 reply; 73+ messages in thread
From: Roger Pau Monné @ 2023-07-24 13:16 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Volodymyr Babchuk, Oleksandr Andrushchenko, xen-devel

On Mon, Jul 24, 2023 at 12:43:26PM +0200, Jan Beulich wrote:
> On 20.07.2023 02:32, Volodymyr Babchuk wrote:
> > @@ -52,8 +66,8 @@ static int cf_check map_range(
> >           * - {un}map_mmio_regions doesn't support preemption.
> >           */
> >  
> > -        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(s))
> > -                      : unmap_mmio_regions(map->d, _gfn(s), size, _mfn(s));
> > +        rc = map->map ? map_mmio_regions(map->d, start_gfn, size, _mfn(s))
> > +                      : unmap_mmio_regions(map->d, start_gfn, size, _mfn(s));
> 
> Aiui this is the first direct exposure of these functions to DomU-s;

I guess it depends on how direct you consider exposure from
XEN_DOMCTL_memory_mapping hypercall, as that's what gets called by
QEMU also in order to set up BAR mappings.

> so far all calls were Xen-internal or from a domctl. There are a
> couple of Arm TODOs listed in the comment ahead, but I'm not sure
> that's all what is lacking here, and it's unclear whether this can
> sensibly be left as a follow-on activity (at the very least known
> open issues need mentioning as TODOs).
> 
> For example the x86 function truncates an unsigned long local
> variable to (signed) int in its main return statement. This may for
> the moment still be only a theoretical issue, but will need dealing
> with sooner or later, I think.

One bit that we need to add is the iomem_access_permitted() plus the
xsm_iomem_mapping() checks to map_range().

> Furthermore this yet again allows DomU-s to fiddle with their p2m.
> To a degree this is unavoidable, I suppose. But some thought may
> need putting into this anyway. Aiui on real hardware if a BAR is
> placed over RAM, behavior is simply undefined. Once the BAR is
> moved away though, behavior will become defined again: The RAM will
> "reappear" in case the earlier undefined-ness made it disappear. I
> don't know how the Arm variants of the functions behave, but on x86
> the RAM pages will disappear from the guest's p2m upon putting a
> BAR there, but they won't reappear upon unmapping of the BAR.

Yeah, that's unfortunate, but I'm afraid it's the same behavior when
using QEMU, so I wouldn't consider it strictly a regression from the
current handling that we do for BARs when doing PCI passthrough.

Furthermore, I don't see any easy way to deal with this so that RAM
can be re-added when the BAR is re positioned to not overlap a RAM
region.

> Luckily at least preemption looks to be handled in a satisfactory
> manner already.

I spend quite a lot of time trying to make sure this was at least
attempted to solve properly.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 08/13] vpci/header: program p2m with guest BAR view
  2023-07-24 13:16     ` Roger Pau Monné
@ 2023-07-24 13:31       ` Jan Beulich
  2023-07-24 13:42         ` Roger Pau Monné
  0 siblings, 1 reply; 73+ messages in thread
From: Jan Beulich @ 2023-07-24 13:31 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Volodymyr Babchuk, Oleksandr Andrushchenko, xen-devel

On 24.07.2023 15:16, Roger Pau Monné wrote:
> On Mon, Jul 24, 2023 at 12:43:26PM +0200, Jan Beulich wrote:
>> On 20.07.2023 02:32, Volodymyr Babchuk wrote:
>>> @@ -52,8 +66,8 @@ static int cf_check map_range(
>>>           * - {un}map_mmio_regions doesn't support preemption.
>>>           */
>>>  
>>> -        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(s))
>>> -                      : unmap_mmio_regions(map->d, _gfn(s), size, _mfn(s));
>>> +        rc = map->map ? map_mmio_regions(map->d, start_gfn, size, _mfn(s))
>>> +                      : unmap_mmio_regions(map->d, start_gfn, size, _mfn(s));
>>
>> Aiui this is the first direct exposure of these functions to DomU-s;
> 
> I guess it depends on how direct you consider exposure from
> XEN_DOMCTL_memory_mapping hypercall, as that's what gets called by
> QEMU also in order to set up BAR mappings.

Fair point - it is one of the few domctls not covered by XSA-77.

>> so far all calls were Xen-internal or from a domctl. There are a
>> couple of Arm TODOs listed in the comment ahead, but I'm not sure
>> that's all what is lacking here, and it's unclear whether this can
>> sensibly be left as a follow-on activity (at the very least known
>> open issues need mentioning as TODOs).
>>
>> For example the x86 function truncates an unsigned long local
>> variable to (signed) int in its main return statement. This may for
>> the moment still be only a theoretical issue, but will need dealing
>> with sooner or later, I think.
> 
> One bit that we need to add is the iomem_access_permitted() plus the
> xsm_iomem_mapping() checks to map_range().

The former would just be reassurance, wouldn't it? Assigning a PCI
device surely implies granting access to all its BARs (minus the
MSI-X page(s), if any). The latter would of course be more
"interesting", as XSM could in principle interject.

Jan


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 08/13] vpci/header: program p2m with guest BAR view
  2023-07-24 13:31       ` Jan Beulich
@ 2023-07-24 13:42         ` Roger Pau Monné
  0 siblings, 0 replies; 73+ messages in thread
From: Roger Pau Monné @ 2023-07-24 13:42 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Volodymyr Babchuk, Oleksandr Andrushchenko, xen-devel

On Mon, Jul 24, 2023 at 03:31:56PM +0200, Jan Beulich wrote:
> On 24.07.2023 15:16, Roger Pau Monné wrote:
> > On Mon, Jul 24, 2023 at 12:43:26PM +0200, Jan Beulich wrote:
> >> On 20.07.2023 02:32, Volodymyr Babchuk wrote:
> >>> @@ -52,8 +66,8 @@ static int cf_check map_range(
> >>>           * - {un}map_mmio_regions doesn't support preemption.
> >>>           */
> >>>  
> >>> -        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(s))
> >>> -                      : unmap_mmio_regions(map->d, _gfn(s), size, _mfn(s));
> >>> +        rc = map->map ? map_mmio_regions(map->d, start_gfn, size, _mfn(s))
> >>> +                      : unmap_mmio_regions(map->d, start_gfn, size, _mfn(s));
> >>
> >> Aiui this is the first direct exposure of these functions to DomU-s;
> > 
> > I guess it depends on how direct you consider exposure from
> > XEN_DOMCTL_memory_mapping hypercall, as that's what gets called by
> > QEMU also in order to set up BAR mappings.
> 
> Fair point - it is one of the few domctls not covered by XSA-77.
> 
> >> so far all calls were Xen-internal or from a domctl. There are a
> >> couple of Arm TODOs listed in the comment ahead, but I'm not sure
> >> that's all what is lacking here, and it's unclear whether this can
> >> sensibly be left as a follow-on activity (at the very least known
> >> open issues need mentioning as TODOs).
> >>
> >> For example the x86 function truncates an unsigned long local
> >> variable to (signed) int in its main return statement. This may for
> >> the moment still be only a theoretical issue, but will need dealing
> >> with sooner or later, I think.
> > 
> > One bit that we need to add is the iomem_access_permitted() plus the
> > xsm_iomem_mapping() checks to map_range().
> 
> The former would just be reassurance, wouldn't it? Assigning a PCI
> device surely implies granting access to all its BARs (minus the
> MSI-X page(s), if any).

Indeed.  But for consistency we need to match the same checks that are
done in XEN_DOMCTL_memory_mapping.

> The latter would of course be more
> "interesting", as XSM could in principle interject.

Yes, we need both.  In fact I'm just writing a patch to add them
straight away.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 02/13] vpci: use per-domain PCI lock to protect vpci structure
  2023-07-20 11:20   ` Roger Pau Monné
  2023-07-20 13:27     ` Jan Beulich
  2023-07-20 15:53     ` Jan Beulich
@ 2023-07-26  1:17     ` Volodymyr Babchuk
  2023-07-26  6:39       ` Jan Beulich
  2023-07-26  9:35       ` Roger Pau Monné
  2 siblings, 2 replies; 73+ messages in thread
From: Volodymyr Babchuk @ 2023-07-26  1:17 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Oleksandr Andrushchenko, Jan Beulich


Hi Roger,

Roger Pau Monné <roger.pau@citrix.com> writes:

> On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> 
>> Use a previously introduced per-domain read/write lock to check
>> whether vpci is present, so we are sure there are no accesses to the
>> contents of the vpci struct if not. This lock can be used (and in a
>> few cases is used right away) so that vpci removal can be performed
>> while holding the lock in write mode. Previously such removal could
>> race with vpci_read for example.
>
> This I think needs to state the locking order of the per-domain
> pci_lock wrt the vpci->lock.  AFAICT that's d->pci_lock first, then
> vpci->lock.

Will add, thanks.

>> 1. Per-domain's pci_rwlock is used to protect pdev->vpci structure
>> from being removed.
>> 
>> 2. Writing the command register and ROM BAR register may trigger
>> modify_bars to run, which in turn may access multiple pdevs while
>> checking for the existing BAR's overlap. The overlapping check, if
>> done under the read lock, requires vpci->lock to be acquired on both
>> devices being compared, which may produce a deadlock. It is not
>> possible to upgrade read lock to write lock in such a case. So, in
>> order to prevent the deadlock, use d->pci_lock instead. To prevent
>> deadlock while locking both hwdom->pci_lock and dom_xen->pci_lock,
>> always lock hwdom first.
>> 
>> All other code, which doesn't lead to pdev->vpci destruction and does
>> not access multiple pdevs at the same time, can still use a
>> combination of the read lock and pdev->vpci->lock.
>> 
>> 3. Drop const qualifier where the new rwlock is used and this is
>> appropriate.
>> 
>> 4. Do not call process_pending_softirqs with any locks held. For that
>> unlock prior the call and re-acquire the locks after. After
>> re-acquiring the lock there is no need to check if pdev->vpci exists:
>>  - in apply_map because of the context it is called (no race condition
>>    possible)
>>  - for MSI/MSI-X debug code because it is called at the end of
>>    pdev->vpci access and no further access to pdev->vpci is made
>
> I assume that's vpci_msix_arch_print(): there are further accesses to
> pdev->vpci, but those use the msix local variable, which holds a copy
> of the pointer in pdev->vpci->msix, so that last sentence is not true
> I'm afraid.

Yes, I see. I am wondering if we can memorize sbdf and call pci_get_pdev()
after re-acquiring the lock. Of course, there is a slight chance that we
will get another pdev with the same sbdf...

> However the code already try to cater for the pdev going away, and
> hence it's IMO fine.  IOW: your change doesn't make this any better or
> worse.
>
>> 
>> 5. Introduce pcidevs_trylock, so there is a possibility to try locking
>> the pcidev's lock.
>
> I'm confused by this addition, the more that's no used anywhere.  Can
> you defer the addition until the patch that makes use of it?
>

Yup. This is another rebasing artifact. There were users for this
function it the previous version of the patch.

>> 
>> 6. Use d->pci_lock around for_each_pdev and pci_get_pdev_by_domain
>> while accessing pdevs in vpci code.
>> 
>> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
>> Suggested-by: Jan Beulich <jbeulich@suse.com>
>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
>> 
>> ---
>> 
>> Changes in v8:
>>  - changed d->vpci_lock to d->pci_lock
>>  - introducing d->pci_lock in a separate patch
>>  - extended locked region in vpci_process_pending
>>  - removed pcidevs_lockis vpci_dump_msi()
>>  - removed some changes as they are not needed with
>>    the new locking scheme
>>  - added handling for hwdom && dom_xen case
>> ---
>>  xen/arch/x86/hvm/vmsi.c       |  4 +++
>>  xen/drivers/passthrough/pci.c |  7 +++++
>>  xen/drivers/vpci/header.c     | 18 ++++++++++++
>>  xen/drivers/vpci/msi.c        | 14 ++++++++--
>>  xen/drivers/vpci/msix.c       | 52 ++++++++++++++++++++++++++++++-----
>>  xen/drivers/vpci/vpci.c       | 46 +++++++++++++++++++++++++++++--
>>  xen/include/xen/pci.h         |  1 +
>>  7 files changed, 129 insertions(+), 13 deletions(-)
>> 
>> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
>> index 3cd4923060..8c1bd66b9c 100644
>> --- a/xen/arch/x86/hvm/vmsi.c
>> +++ b/xen/arch/x86/hvm/vmsi.c
>> @@ -895,6 +895,8 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>>  {
>>      unsigned int i;
>>  
>> +    ASSERT(rw_is_locked(&msix->pdev->domain->pci_lock));
>> +
>>      for ( i = 0; i < msix->max_entries; i++ )
>>      {
>>          const struct vpci_msix_entry *entry = &msix->entries[i];
>> @@ -913,7 +915,9 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>>              struct pci_dev *pdev = msix->pdev;
>>  
>>              spin_unlock(&msix->pdev->vpci->lock);
>> +            read_unlock(&pdev->domain->pci_lock);
>>              process_pending_softirqs();
>> +            read_lock(&pdev->domain->pci_lock);
>
> This should be a read_trylock(), much like the spin_trylock() below.

vpci_dump_msi() expects that vpci_msix_arch_print() will return holding
this lock. I can rework both functions, of course. But then we will in
situation when we need to known exact behavior of vpci_dump_msi() wrt of
locks in the calling code...

>
>>              /* NB: we assume that pdev cannot go away for an alive domain. */
>>              if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
>>                  return -EBUSY;
>> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
>> index 5b4632ead2..6f8692cd9c 100644
>> --- a/xen/drivers/passthrough/pci.c
>> +++ b/xen/drivers/passthrough/pci.c
>> @@ -57,6 +57,11 @@ void pcidevs_lock(void)
>>      spin_lock_recursive(&_pcidevs_lock);
>>  }
>>  
>> +int pcidevs_trylock(void)
>> +{
>> +    return spin_trylock_recursive(&_pcidevs_lock);
>> +}
>> +
>>  void pcidevs_unlock(void)
>>  {
>>      spin_unlock_recursive(&_pcidevs_lock);
>> @@ -1144,7 +1149,9 @@ static void __hwdom_init setup_one_hwdom_device(const struct setup_hwdom *ctxt,
>>      } while ( devfn != pdev->devfn &&
>>                PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
>>  
>> +    write_lock(&ctxt->d->pci_lock);
>>      err = vpci_add_handlers(pdev);
>> +    write_unlock(&ctxt->d->pci_lock);
>>      if ( err )
>>          printk(XENLOG_ERR "setup of vPCI for d%d failed: %d\n",
>>                 ctxt->d->domain_id, err);
>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
>> index b41556d007..2780fcae72 100644
>> --- a/xen/drivers/vpci/header.c
>> +++ b/xen/drivers/vpci/header.c
>> @@ -152,6 +152,7 @@ bool vpci_process_pending(struct vcpu *v)
>>          if ( rc == -ERESTART )
>>              return true;
>>  
>> +        write_lock(&v->domain->pci_lock);
>>          spin_lock(&v->vpci.pdev->vpci->lock);
>>          /* Disable memory decoding unconditionally on failure. */
>>          modify_decoding(v->vpci.pdev,
>> @@ -170,6 +171,7 @@ bool vpci_process_pending(struct vcpu *v)
>>               * failure.
>>               */
>>              vpci_remove_device(v->vpci.pdev);
>> +        write_unlock(&v->domain->pci_lock);
>>      }
>
> The handling in vpci_process_pending() wrt vpci_remove_device() is
> racy and will need some thinking to get it solved.  Your change
> doesn't make it any worse, but I would also be fine with adding a note
> in the commit message that vpci_process_pending() is not adjusted to
> use the new lock because it needs to be reworked first in order to be
> safe against a concurrent vpci_remove_device() call.

It is racy because we are accessing v->vpci.pdev->vpci, I see. At least
we can check if it is not NULL...

But the problem is broader, of course, as vpci struct could be destroyed
and created anew. This begs the question if we should delay
vpci_process_pending() in the first place.

>>  
>>      return false;
>> @@ -181,8 +183,20 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
>>      struct map_data data = { .d = d, .map = true };
>>      int rc;
>>  
>> +    ASSERT(rw_is_locked(&d->pci_lock));
>> +
>>      while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
>> +    {
>> +        /*
>> +         * It's safe to drop and reacquire the lock in this context
>> +         * without risking pdev disappearing because devices cannot be
>> +         * removed until the initial domain has been started.
>> +         */
>> +        read_unlock(&d->pci_lock);
>>          process_pending_softirqs();
>> +        read_lock(&d->pci_lock);
>> +    }
>
> Since this is init only code you could likely forego the usage of the
> locks, but I guess that's more churn than just using them.  In any
> case, as this gets called from modify_bars() the locks need to be
> dropped/taken in write mode (see comment below).
>
>>      rangeset_destroy(mem);
>>      if ( !rc )
>>          modify_decoding(pdev, cmd, false);
>> @@ -223,6 +237,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>      unsigned int i;
>>      int rc;
>>  
>> +    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>
> The lock here needs to be taken in write mode I think, so the code can
> safely iterate over the contents of each pdev->vpci assigned to the
> domain.
>

Yep, reworked.

>> +
>>      if ( !mem )
>>          return -ENOMEM;
>>  
>> @@ -502,6 +518,8 @@ static int cf_check init_bars(struct pci_dev *pdev)
>>      struct vpci_bar *bars = header->bars;
>>      int rc;
>>  
>> +    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>> +
>>      switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
>>      {
>>      case PCI_HEADER_TYPE_NORMAL:
>> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
>> index 8f2b59e61a..e63152c224 100644
>> --- a/xen/drivers/vpci/msi.c
>> +++ b/xen/drivers/vpci/msi.c
>> @@ -190,6 +190,8 @@ static int cf_check init_msi(struct pci_dev *pdev)
>>      uint16_t control;
>>      int ret;
>>  
>> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
>
> I'm confused by the difference in lock requirements between
> init_bars() and init_msi().  In the former you assert for the lock
> being taken in read mode, while the later asserts for write mode.
>
> We want to do initialization in write mode, so that modify_bars()
> called by init_bars() has exclusive access to the contents of
> pdev->vpci.
>

Taking into account your discussion with Jan, I removed this ASSERT at all. 

>> +
>>      if ( !pos )
>>          return 0;
>>  
>> @@ -265,7 +267,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW);
>>  
>>  void vpci_dump_msi(void)
>>  {
>> -    const struct domain *d;
>> +    struct domain *d;
>>  
>>      rcu_read_lock(&domlist_read_lock);
>>      for_each_domain ( d )
>> @@ -277,6 +279,9 @@ void vpci_dump_msi(void)
>>  
>>          printk("vPCI MSI/MSI-X d%d\n", d->domain_id);
>>  
>> +        if ( !read_trylock(&d->pci_lock) )
>> +            continue;
>> +
>>          for_each_pdev ( d, pdev )
>>          {
>>              const struct vpci_msi *msi;
>> @@ -318,14 +323,17 @@ void vpci_dump_msi(void)
>>                       * holding the lock.
>>                       */
>>                      printk("unable to print all MSI-X entries: %d\n", rc);
>> -                    process_pending_softirqs();
>> -                    continue;
>> +                    goto pdev_done;
>>                  }
>>              }
>>  
>>              spin_unlock(&pdev->vpci->lock);
>> + pdev_done:
>> +            read_unlock(&d->pci_lock);
>>              process_pending_softirqs();
>> +            read_lock(&d->pci_lock);
>
> read_trylock().
>
> This is not very safe, as the list could be modified while the lock is
> dropped, but it's a debug key handler so I'm not very concerned.
> However we should at least add a comment that this relies on the list
> not being altered while the lock is dropped.

Added, thanks.

>
>>          }
>> +        read_unlock(&d->pci_lock);
>>      }
>>      rcu_read_unlock(&domlist_read_lock);
>>  }
>> diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
>> index 25bde77586..9481274579 100644
>> --- a/xen/drivers/vpci/msix.c
>> +++ b/xen/drivers/vpci/msix.c
>> @@ -147,6 +147,8 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
>>  {
>>      struct vpci_msix *msix;
>>  
>> +    ASSERT(rw_is_locked(&d->pci_lock));
>
> Hm, here you are iterating over pdev->vpci->header.bars for multiple
> devices, so I think in addition to the pci_lock in read mode we should
> also take the vpci->lock for each pdev.
>
> I think I would like to rework msix_find() so it's msix_get() and
> returns with the appropriate vpci->lock taken.  Anyway, that's for a
> different patch, the usage of the lock in read mode seems correct,
> albeit I might want to move the read_lock() call inside of msix_get()
> in the future.
>
>> +
>>      list_for_each_entry ( msix, &d->arch.hvm.msix_tables, next )
>>      {
>>          const struct vpci_bar *bars = msix->pdev->vpci->header.bars;
>> @@ -163,7 +165,13 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
>>  
>>  static int cf_check msix_accept(struct vcpu *v, unsigned long addr)
>>  {
>> -    return !!msix_find(v->domain, addr);
>> +    int rc;
>> +
>> +    read_lock(&v->domain->pci_lock);
>> +    rc = !!msix_find(v->domain, addr);
>> +    read_unlock(&v->domain->pci_lock);
>> +
>> +    return rc;
>>  }
>>  
>>  static bool access_allowed(const struct pci_dev *pdev, unsigned long addr,
>> @@ -358,21 +366,34 @@ static int adjacent_read(const struct domain *d, const struct vpci_msix *msix,
>>  static int cf_check msix_read(
>>      struct vcpu *v, unsigned long addr, unsigned int len, unsigned long *data)
>>  {
>> -    const struct domain *d = v->domain;
>> -    struct vpci_msix *msix = msix_find(d, addr);
>> +    struct domain *d = v->domain;
>> +    struct vpci_msix *msix;
>>      const struct vpci_msix_entry *entry;
>>      unsigned int offset;
>>  
>>      *data = ~0ul;
>>  
>> +    read_lock(&d->pci_lock);
>> +
>> +    msix = msix_find(d, addr);
>>      if ( !msix )
>> +    {
>> +        read_unlock(&d->pci_lock);
>>          return X86EMUL_RETRY;
>> +    }
>>  
>>      if ( adjacent_handle(msix, addr) )
>> -        return adjacent_read(d, msix, addr, len, data);
>> +    {
>> +        int rc = adjacent_read(d, msix, addr, len, data);
>
> Nit: missing newline (here and below).
>
>> +        read_unlock(&d->pci_lock);
>> +        return rc;
>> +    }
>>  
>>      if ( !access_allowed(msix->pdev, addr, len) )
>> +    {
>> +        read_unlock(&d->pci_lock);
>>          return X86EMUL_OKAY;
>> +    }
>>  
>>      spin_lock(&msix->pdev->vpci->lock);
>>      entry = get_entry(msix, addr);
>> @@ -404,6 +425,7 @@ static int cf_check msix_read(
>>          break;
>>      }
>>      spin_unlock(&msix->pdev->vpci->lock);
>> +    read_unlock(&d->pci_lock);
>>  
>>      return X86EMUL_OKAY;
>>  }
>> @@ -491,19 +513,32 @@ static int adjacent_write(const struct domain *d, const struct vpci_msix *msix,
>>  static int cf_check msix_write(
>>      struct vcpu *v, unsigned long addr, unsigned int len, unsigned long data)
>>  {
>> -    const struct domain *d = v->domain;
>> -    struct vpci_msix *msix = msix_find(d, addr);
>> +    struct domain *d = v->domain;
>> +    struct vpci_msix *msix;
>>      struct vpci_msix_entry *entry;
>>      unsigned int offset;
>>  
>> +    read_lock(&d->pci_lock);
>> +
>> +    msix = msix_find(d, addr);
>>      if ( !msix )
>> +    {
>> +        read_unlock(&d->pci_lock);
>>          return X86EMUL_RETRY;
>> +    }
>>  
>>      if ( adjacent_handle(msix, addr) )
>> -        return adjacent_write(d, msix, addr, len, data);
>> +    {
>> +        int rc = adjacent_write(d, msix, addr, len, data);
>> +        read_unlock(&d->pci_lock);
>> +        return rc;
>> +    }
>>  
>>      if ( !access_allowed(msix->pdev, addr, len) )
>> +    {
>> +        read_unlock(&d->pci_lock);
>>          return X86EMUL_OKAY;
>> +    }
>>  
>>      spin_lock(&msix->pdev->vpci->lock);
>>      entry = get_entry(msix, addr);
>> @@ -579,6 +614,7 @@ static int cf_check msix_write(
>>          break;
>>      }
>>      spin_unlock(&msix->pdev->vpci->lock);
>> +    read_unlock(&d->pci_lock);
>>  
>>      return X86EMUL_OKAY;
>>  }
>> @@ -665,6 +701,8 @@ static int cf_check init_msix(struct pci_dev *pdev)
>>      struct vpci_msix *msix;
>>      int rc;
>>  
>> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
>> +
>>      msix_offset = pci_find_cap_offset(pdev->seg, pdev->bus, slot, func,
>>                                        PCI_CAP_ID_MSIX);
>>      if ( !msix_offset )
>> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
>> index d73fa76302..f22cbf2112 100644
>> --- a/xen/drivers/vpci/vpci.c
>> +++ b/xen/drivers/vpci/vpci.c
>> @@ -38,6 +38,8 @@ extern vpci_register_init_t *const __end_vpci_array[];
>>  
>>  void vpci_remove_device(struct pci_dev *pdev)
>>  {
>> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
>> +
>>      if ( !has_vpci(pdev->domain) || !pdev->vpci )
>>          return;
>>  
>> @@ -73,6 +75,8 @@ int vpci_add_handlers(struct pci_dev *pdev)
>>      const unsigned long *ro_map;
>>      int rc = 0;
>>  
>> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
>> +
>>      if ( !has_vpci(pdev->domain) )
>>          return 0;
>>  
>> @@ -326,11 +330,12 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
>>  
>>  uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>>  {
>> -    const struct domain *d = current->domain;
>> +    struct domain *d = current->domain;
>>      const struct pci_dev *pdev;
>>      const struct vpci_register *r;
>>      unsigned int data_offset = 0;
>>      uint32_t data = ~(uint32_t)0;
>> +    rwlock_t *lock;
>>  
>>      if ( !size )
>>      {
>> @@ -342,11 +347,21 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>>       * Find the PCI dev matching the address, which for hwdom also requires
>>       * consulting DomXEN.  Passthrough everything that's not trapped.
>>       */
>> +    lock = &d->pci_lock;
>> +    read_lock(lock);
>>      pdev = pci_get_pdev(d, sbdf);
>>      if ( !pdev && is_hardware_domain(d) )
>> +    {
>> +        read_unlock(lock);
>> +        lock = &dom_xen->pci_lock;
>> +        read_lock(lock);
>>          pdev = pci_get_pdev(dom_xen, sbdf);
>> +    }
>>      if ( !pdev || !pdev->vpci )
>> +    {
>> +        read_unlock(lock);
>>          return vpci_read_hw(sbdf, reg, size);
>> +    }
>>  
>>      spin_lock(&pdev->vpci->lock);
>>  
>> @@ -392,6 +407,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>>          ASSERT(data_offset < size);
>>      }
>>      spin_unlock(&pdev->vpci->lock);
>> +    read_unlock(lock);
>>  
>>      if ( data_offset < size )
>>      {
>> @@ -431,10 +447,23 @@ static void vpci_write_helper(const struct pci_dev *pdev,
>>               r->private);
>>  }
>>  
>> +/* Helper function to unlock locks taken by vpci_write in proper order */
>> +static void unlock_locks(struct domain *d)
>> +{
>> +    ASSERT(rw_is_locked(&d->pci_lock));
>> +
>> +    if ( is_hardware_domain(d) )
>> +    {
>> +        ASSERT(rw_is_locked(&d->pci_lock));
>> +        read_unlock(&dom_xen->pci_lock);
>> +    }
>> +    read_unlock(&d->pci_lock);
>> +}
>> +
>>  void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>>                  uint32_t data)
>>  {
>> -    const struct domain *d = current->domain;
>> +    struct domain *d = current->domain;
>>      const struct pci_dev *pdev;
>>      const struct vpci_register *r;
>>      unsigned int data_offset = 0;
>> @@ -447,8 +476,16 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>>  
>>      /*
>>       * Find the PCI dev matching the address, which for hwdom also requires
>> -     * consulting DomXEN.  Passthrough everything that's not trapped.
>> +     * consulting DomXEN. Passthrough everything that's not trapped.
>> +     * If this is hwdom, we need to hold locks for both domain in case if
>> +     * modify_bars is called()
>
> Typo: the () wants to be at the end of modify_bars().
>
>>       */
>> +    read_lock(&d->pci_lock);
>> +
>> +    /* dom_xen->pci_lock always should be taken second to prevent deadlock */
>> +    if ( is_hardware_domain(d) )
>> +        read_lock(&dom_xen->pci_lock);
>
> For modify_bars() we also want the locks to be in write mode (at least
> the hw one), so that the position of the BARs can't be changed while
> modify_bars() is iterating over them.
>
> Is this something that will be done in a followup change?

I'll done it in this change.

>
>> +
>>      pdev = pci_get_pdev(d, sbdf);
>>      if ( !pdev && is_hardware_domain(d) )
>>          pdev = pci_get_pdev(dom_xen, sbdf);
>> @@ -459,6 +496,8 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>>  
>>          if ( !ro_map || !test_bit(sbdf.bdf, ro_map) )
>>              vpci_write_hw(sbdf, reg, size, data);
>> +
>> +        unlock_locks(d);
>>          return;
>>      }
>>  
>> @@ -498,6 +537,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>>          ASSERT(data_offset < size);
>>      }
>>      spin_unlock(&pdev->vpci->lock);
>> +    unlock_locks(d);
>
> There's one issue here, some handlers will cal pcidevs_lock(), which
> will result in a lock over inversion, as in the previous patch we
> agreed that the locking order was pcidevs_lock first, d->pci_lock
> after.
>
> For example the MSI control_write() handler will call
> vpci_msi_arch_enable() which takes the pcidevs lock.  I think I will
> have to look into using a dedicated lock for MSI related handling, as
> that's the only place where I think we have this pattern of taking the
> pcidevs_lock after the d->pci_lock.

I'll mention this in the commit message. Is there something else that I
should do right now?

-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 04/13] vpci: add hooks for PCI device assign/de-assign
  2023-07-20 12:36   ` Roger Pau Monné
@ 2023-07-26  1:38     ` Volodymyr Babchuk
  2023-07-26  8:42       ` Roger Pau Monné
  0 siblings, 1 reply; 73+ messages in thread
From: Volodymyr Babchuk @ 2023-07-26  1:38 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Oleksandr Andrushchenko


Hi Roger,

Roger Pau Monné <roger.pau@citrix.com> writes:

> On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> 
>> When a PCI device gets assigned/de-assigned some work on vPCI side needs
>> to be done for that device. Introduce a pair of hooks so vPCI can handle
>> that.
>> 
>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> ---
>> Since v8:
>> - removed vpci_deassign_device
>> Since v6:
>> - do not pass struct domain to vpci_{assign|deassign}_device as
>>   pdev->domain can be used
>> - do not leave the device assigned (pdev->domain == new domain) in case
>>   vpci_assign_device fails: try to de-assign and if this also fails, then
>>   crash the domain
>> Since v5:
>> - do not split code into run_vpci_init
>> - do not check for is_system_domain in vpci_{de}assign_device
>> - do not use vpci_remove_device_handlers_locked and re-allocate
>>   pdev->vpci completely
>> - make vpci_deassign_device void
>> Since v4:
>>  - de-assign vPCI from the previous domain on device assignment
>>  - do not remove handlers in vpci_assign_device as those must not
>>    exist at that point
>> Since v3:
>>  - remove toolstack roll-back description from the commit message
>>    as error are to be handled with proper cleanup in Xen itself
>>  - remove __must_check
>>  - remove redundant rc check while assigning devices
>>  - fix redundant CONFIG_HAS_VPCI check for CONFIG_HAS_VPCI_GUEST_SUPPORT
>>  - use REGISTER_VPCI_INIT machinery to run required steps on device
>>    init/assign: add run_vpci_init helper
>> Since v2:
>> - define CONFIG_HAS_VPCI_GUEST_SUPPORT so dead code is not compiled
>>   for x86
>> Since v1:
>>  - constify struct pci_dev where possible
>>  - do not open code is_system_domain()
>>  - extended the commit message
>> ---
>>  xen/drivers/Kconfig           |  4 ++++
>>  xen/drivers/passthrough/pci.c | 21 +++++++++++++++++++++
>>  xen/drivers/vpci/vpci.c       | 18 ++++++++++++++++++
>>  xen/include/xen/vpci.h        | 15 +++++++++++++++
>>  4 files changed, 58 insertions(+)
>> 
>> diff --git a/xen/drivers/Kconfig b/xen/drivers/Kconfig
>> index db94393f47..780490cf8e 100644
>> --- a/xen/drivers/Kconfig
>> +++ b/xen/drivers/Kconfig
>> @@ -15,4 +15,8 @@ source "drivers/video/Kconfig"
>>  config HAS_VPCI
>>  	bool
>>  
>> +config HAS_VPCI_GUEST_SUPPORT
>> +	bool
>> +	depends on HAS_VPCI
>> +
>>  endmenu
>> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
>> index 6f8692cd9c..265d359704 100644
>> --- a/xen/drivers/passthrough/pci.c
>> +++ b/xen/drivers/passthrough/pci.c
>> @@ -885,6 +885,10 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
>>      if ( ret )
>>          goto out;
>>  
>> +    write_lock(&pdev->domain->pci_lock);
>> +    vpci_deassign_device(pdev);
>> +    write_unlock(&pdev->domain->pci_lock);
>> +
>>      if ( pdev->domain == hardware_domain  )
>>          pdev->quarantine = false;
>>  
>> @@ -1484,6 +1488,10 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>      if ( pdev->broken && d != hardware_domain && d != dom_io )
>>          goto done;
>>  
>> +    write_lock(&pdev->domain->pci_lock);
>> +    vpci_deassign_device(pdev);
>> +    write_unlock(&pdev->domain->pci_lock);
>> +
>>      rc = pdev_msix_assign(d, pdev);
>>      if ( rc )
>>          goto done;
>> @@ -1509,6 +1517,19 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>          rc = iommu_call(hd->platform_ops, assign_device, d, devfn,
>>                          pci_to_dev(pdev), flag);
>>      }
>> +    if ( rc )
>> +        goto done;
>> +
>> +    devfn = pdev->devfn;
>> +    write_lock(&pdev->domain->pci_lock);
>> +    rc = vpci_assign_device(pdev);
>> +    write_unlock(&pdev->domain->pci_lock);
>> +    if ( rc && deassign_device(d, seg, bus, devfn) )
>> +    {
>> +        printk(XENLOG_ERR "%pd: %pp was left partially assigned\n",
>> +               d, &PCI_SBDF(seg, bus, devfn));
>
> &pdev->sbdf?  Then you can get of the devfn usage above.

Yes, thanks.

>> +        domain_crash(d);
>
> This seems like a bit different from the other error paths in the
> function, isn't it fine to return an error and let the caller handle
> the deassign?

I believe, intention was to not leave device in an unknown state: we
failed both assign_device() and deassign_device() call, so what to do
now? But yes, I think you are right and it is better to let caller to
decide what to do next.

>
> Also, if we really need to call deassign_device() we must do so for
> all possible phantom devices, see the above loop around
> iommu_call(..., assing_device, ...);

But deassign_device() has the loop for all phantom devices that already
does all the work. Unless I miss something, of course.

>> +    }
>>  
>>   done:
>>      if ( rc )
>> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
>> index a6d2cf8660..a97710a806 100644
>> --- a/xen/drivers/vpci/vpci.c
>> +++ b/xen/drivers/vpci/vpci.c
>> @@ -107,6 +107,24 @@ int vpci_add_handlers(struct pci_dev *pdev)
>>  
>>      return rc;
>>  }
>> +
>> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
>> +/* Notify vPCI that device is assigned to guest. */
>> +int vpci_assign_device(struct pci_dev *pdev)
>> +{
>> +    int rc;
>> +
>> +    if ( !has_vpci(pdev->domain) )
>> +        return 0;
>> +
>> +    rc = vpci_add_handlers(pdev);
>> +    if ( rc )
>> +        vpci_deassign_device(pdev);
>
> Why do you need this handler, vpci_add_handlers() when failing will
> already call vpci_remove_device(), which is what
> vpci_deassign_device() does.
>
>> +
>> +    return rc;
>> +}
>> +#endif /* CONFIG_HAS_VPCI_GUEST_SUPPORT */
>> +
>>  #endif /* __XEN__ */
>>  
>>  static int vpci_register_cmp(const struct vpci_register *r1,
>> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
>> index 0b8a2a3c74..44296623e1 100644
>> --- a/xen/include/xen/vpci.h
>> +++ b/xen/include/xen/vpci.h
>> @@ -264,6 +264,21 @@ static inline bool __must_check vpci_process_pending(struct vcpu *v)
>>  }
>>  #endif
>>  
>> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
>> +/* Notify vPCI that device is assigned/de-assigned to/from guest. */
>> +int vpci_assign_device(struct pci_dev *pdev);
>> +#define vpci_deassign_device vpci_remove_device
>> +#else
>> +static inline int vpci_assign_device(struct pci_dev *pdev)
>> +{
>> +    return 0;
>> +};
>> +
>> +static inline void vpci_deassign_device(struct pci_dev *pdev)
>> +{
>> +};
>> +#endif
>
> I don't think there's much point in introducing new functions, see
> above.  I'm fine if the current ones want to be renamed to
> vpci_{,de}assign_device(), but adding defines like the above just
> makes the code harder to follow.

Good idea, thanks, I'll just rename the original functions.

-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 02/13] vpci: use per-domain PCI lock to protect vpci structure
  2023-07-26  1:17     ` Volodymyr Babchuk
@ 2023-07-26  6:39       ` Jan Beulich
  2023-07-26  9:35       ` Roger Pau Monné
  1 sibling, 0 replies; 73+ messages in thread
From: Jan Beulich @ 2023-07-26  6:39 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: xen-devel, Oleksandr Andrushchenko, Roger Pau Monné

On 26.07.2023 03:17, Volodymyr Babchuk wrote:
> Roger Pau Monné <roger.pau@citrix.com> writes:
>> On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
>>> --- a/xen/arch/x86/hvm/vmsi.c
>>> +++ b/xen/arch/x86/hvm/vmsi.c
>>> @@ -895,6 +895,8 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>>>  {
>>>      unsigned int i;
>>>  
>>> +    ASSERT(rw_is_locked(&msix->pdev->domain->pci_lock));
>>> +
>>>      for ( i = 0; i < msix->max_entries; i++ )
>>>      {
>>>          const struct vpci_msix_entry *entry = &msix->entries[i];
>>> @@ -913,7 +915,9 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>>>              struct pci_dev *pdev = msix->pdev;
>>>  
>>>              spin_unlock(&msix->pdev->vpci->lock);
>>> +            read_unlock(&pdev->domain->pci_lock);
>>>              process_pending_softirqs();
>>> +            read_lock(&pdev->domain->pci_lock);
>>
>> This should be a read_trylock(), much like the spin_trylock() below.
> 
> vpci_dump_msi() expects that vpci_msix_arch_print() will return holding
> this lock. I can rework both functions, of course. But then we will in
> situation when we need to known exact behavior of vpci_dump_msi() wrt of
> locks in the calling code...

Your reply sounds as if you hadn't seen my earlier suggestion on this
matter (making the behavior match also for the now 2nd lock involved).

Jan


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 04/13] vpci: add hooks for PCI device assign/de-assign
  2023-07-26  1:38     ` Volodymyr Babchuk
@ 2023-07-26  8:42       ` Roger Pau Monné
  0 siblings, 0 replies; 73+ messages in thread
From: Roger Pau Monné @ 2023-07-26  8:42 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: xen-devel, Oleksandr Andrushchenko

On Wed, Jul 26, 2023 at 01:38:30AM +0000, Volodymyr Babchuk wrote:
> 
> Hi Roger,
> 
> Roger Pau Monné <roger.pau@citrix.com> writes:
> 
> > On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
> >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> >> @@ -1509,6 +1517,19 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
> >>          rc = iommu_call(hd->platform_ops, assign_device, d, devfn,
> >>                          pci_to_dev(pdev), flag);
> >>      }
> >> +    if ( rc )
> >> +        goto done;
> >> +
> >> +    devfn = pdev->devfn;
> >> +    write_lock(&pdev->domain->pci_lock);
> >> +    rc = vpci_assign_device(pdev);
> >> +    write_unlock(&pdev->domain->pci_lock);
> >> +    if ( rc && deassign_device(d, seg, bus, devfn) )
> >> +    {
> >> +        printk(XENLOG_ERR "%pd: %pp was left partially assigned\n",
> >> +               d, &PCI_SBDF(seg, bus, devfn));
> >
> > &pdev->sbdf?  Then you can get of the devfn usage above.
> 
> Yes, thanks.
> 
> >> +        domain_crash(d);
> >
> > This seems like a bit different from the other error paths in the
> > function, isn't it fine to return an error and let the caller handle
> > the deassign?
> 
> I believe, intention was to not leave device in an unknown state: we
> failed both assign_device() and deassign_device() call, so what to do
> now? But yes, I think you are right and it is better to let caller to
> decide what to do next.

I don't think it would be a security risk to leave the device in that
state.  For domUs the guest won't get access to the device registers
anyway as we use an allow list approach.  Also deassign_device() is
not called when we fail to assign one of the phantom functions.

We don't seem to undo any of the work in assign_device() on error so
it should be fine to not do the call to deassign_device() on error to
initialize vPCI.

> >
> > Also, if we really need to call deassign_device() we must do so for
> > all possible phantom devices, see the above loop around
> > iommu_call(..., assing_device, ...);
> 
> But deassign_device() has the loop for all phantom devices that already
> does all the work. Unless I miss something, of course.

No, you are right, a single call is fine.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 02/13] vpci: use per-domain PCI lock to protect vpci structure
  2023-07-26  1:17     ` Volodymyr Babchuk
  2023-07-26  6:39       ` Jan Beulich
@ 2023-07-26  9:35       ` Roger Pau Monné
  2023-07-27  0:56         ` Volodymyr Babchuk
  1 sibling, 1 reply; 73+ messages in thread
From: Roger Pau Monné @ 2023-07-26  9:35 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: xen-devel, Oleksandr Andrushchenko, Jan Beulich

On Wed, Jul 26, 2023 at 01:17:58AM +0000, Volodymyr Babchuk wrote:
> 
> Hi Roger,
> 
> Roger Pau Monné <roger.pau@citrix.com> writes:
> 
> > On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
> >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> >> @@ -498,6 +537,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
> >>          ASSERT(data_offset < size);
> >>      }
> >>      spin_unlock(&pdev->vpci->lock);
> >> +    unlock_locks(d);
> >
> > There's one issue here, some handlers will cal pcidevs_lock(), which
> > will result in a lock over inversion, as in the previous patch we
> > agreed that the locking order was pcidevs_lock first, d->pci_lock
> > after.
> >
> > For example the MSI control_write() handler will call
> > vpci_msi_arch_enable() which takes the pcidevs lock.  I think I will
> > have to look into using a dedicated lock for MSI related handling, as
> > that's the only place where I think we have this pattern of taking the
> > pcidevs_lock after the d->pci_lock.
> 
> I'll mention this in the commit message. Is there something else that I
> should do right now?

Well, I don't think we want to commit this as-is with a known lock
inversion.

The functions that require the pcidevs lock are:

pt_irq_{create,destroy}_bind()
unmap_domain_pirq()

AFAICT those functions require the lock in order to assert that the
underlying device doesn't go away, as they do also use d->event_lock
in order to get exclusive access to the data fields.  Please double
check that I'm not mistaken.

If that's accurate you will have to check the call tree that spawns
from those functions in order to modify the asserts to check for
either the pcidevs or the per-domain pci_list lock being taken.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 11/13] vpci: add initial support for virtual PCI bus topology
  2023-07-20  0:32 ` [PATCH v8 11/13] vpci: add initial support for virtual PCI bus topology Volodymyr Babchuk
                     ` (2 preceding siblings ...)
  2023-07-21 14:00   ` Roger Pau Monné
@ 2023-07-26 21:35   ` Stewart Hildebrand
  3 siblings, 0 replies; 73+ messages in thread
From: Stewart Hildebrand @ 2023-07-26 21:35 UTC (permalink / raw)
  To: Volodymyr Babchuk, xen-devel
  Cc: Oleksandr Andrushchenko, Jan Beulich, Roger Pau Monné

On 7/19/23 20:32, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Assign SBDF to the PCI devices being passed through with bus 0.
> The resulting topology is where PCIe devices reside on the bus 0 of the
> root complex itself (embedded endpoints).
> This implementation is limited to 32 devices which are allowed on
> a single PCI bus.
> 
> Please note, that at the moment only function 0 of a multifunction
> device can be passed through.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> ---
> Since v8:
> - Added write lock in add_virtual_device
> Since v6:
> - re-work wrt new locking scheme
> - OT: add ASSERT(pcidevs_write_locked()); to add_virtual_device()
> Since v5:
> - s/vpci_add_virtual_device/add_virtual_device and make it static
> - call add_virtual_device from vpci_assign_device and do not use
>   REGISTER_VPCI_INIT machinery
> - add pcidevs_locked ASSERT
> - use DECLARE_BITMAP for vpci_dev_assigned_map
> Since v4:
> - moved and re-worked guest sbdf initializers
> - s/set_bit/__set_bit
> - s/clear_bit/__clear_bit
> - minor comment fix s/Virtual/Guest/
> - added VPCI_MAX_VIRT_DEV constant (PCI_SLOT(~0) + 1) which will be used
>   later for counting the number of MMIO handlers required for a guest
>   (Julien)
> Since v3:
>  - make use of VPCI_INIT
>  - moved all new code to vpci.c which belongs to it
>  - changed open-coded 31 to PCI_SLOT(~0)
>  - added comments and code to reject multifunction devices with
>    functions other than 0
>  - updated comment about vpci_dev_next and made it unsigned int
>  - implement roll back in case of error while assigning/deassigning devices
>  - s/dom%pd/%pd
> Since v2:
>  - remove casts that are (a) malformed and (b) unnecessary
>  - add new line for better readability
>  - remove CONFIG_HAS_VPCI_GUEST_SUPPORT ifdef's as the relevant vPCI
>     functions are now completely gated with this config
>  - gate common code with CONFIG_HAS_VPCI_GUEST_SUPPORT
> New in v2
> ---
>  xen/drivers/vpci/vpci.c | 72 ++++++++++++++++++++++++++++++++++++++++-
>  xen/include/xen/sched.h |  8 +++++
>  xen/include/xen/vpci.h  | 11 +++++++
>  3 files changed, 90 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
> index ca3505ecb7..baaafe4a2a 100644
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -46,6 +46,16 @@ void vpci_remove_device(struct pci_dev *pdev)
>          return;
> 
>      spin_lock(&pdev->vpci->lock);
> +
> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> +    if ( pdev->vpci->guest_sbdf.sbdf != ~0 )
> +    {
> +        __clear_bit(pdev->vpci->guest_sbdf.dev,
> +                    &pdev->domain->vpci_dev_assigned_map);
> +        pdev->vpci->guest_sbdf.sbdf = ~0;
> +    }
> +#endif
> +
>      while ( !list_empty(&pdev->vpci->handlers) )
>      {
>          struct vpci_register *r = list_first_entry(&pdev->vpci->handlers,
> @@ -101,6 +111,10 @@ int vpci_add_handlers(struct pci_dev *pdev)
>      INIT_LIST_HEAD(&pdev->vpci->handlers);
>      spin_lock_init(&pdev->vpci->lock);
> 
> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> +    pdev->vpci->guest_sbdf.sbdf = ~0;
> +#endif
> +
>      for ( i = 0; i < NUM_VPCI_INIT; i++ )
>      {
>          rc = __start_vpci_array[i](pdev);
> @@ -115,6 +129,54 @@ int vpci_add_handlers(struct pci_dev *pdev)
>  }
> 
>  #ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> +static int add_virtual_device(struct pci_dev *pdev)
> +{
> +    struct domain *d = pdev->domain;
> +    pci_sbdf_t sbdf = { 0 };
> +    unsigned long new_dev_number;
> +
> +    if ( is_hardware_domain(d) )
> +        return 0;
> +
> +    ASSERT(pcidevs_locked());
> +
> +    /*
> +     * Each PCI bus supports 32 devices/slots at max or up to 256 when
> +     * there are multi-function ones which are not yet supported.
> +     */
> +    if ( pdev->info.is_extfn )
> +    {
> +        gdprintk(XENLOG_ERR, "%pp: only function 0 passthrough supported\n",
> +                 &pdev->sbdf);
> +        return -EOPNOTSUPP;
> +    }
> +
> +    write_lock(&pdev->domain->pci_lock);

This should be replaced with an ASSERT, same as the one in vpci_add_handlers() above.

The lock is already acquired a few patches before this in the caller in
drivers/passthrough/pci.c:assign_device()

1524     write_lock(&pdev->domain->pci_lock);
1525     rc = vpci_assign_device(pdev);
1526     write_unlock(&pdev->domain->pci_lock);

> +    new_dev_number = find_first_zero_bit(d->vpci_dev_assigned_map,
> +                                         VPCI_MAX_VIRT_DEV);
> +    if ( new_dev_number >= VPCI_MAX_VIRT_DEV )
> +    {
> +        write_unlock(&pdev->domain->pci_lock);
> +        return -ENOSPC;
> +    }
> +
> +    __set_bit(new_dev_number, &d->vpci_dev_assigned_map);
> +
> +    /*
> +     * Both segment and bus number are 0:
> +     *  - we emulate a single host bridge for the guest, e.g. segment 0
> +     *  - with bus 0 the virtual devices are seen as embedded
> +     *    endpoints behind the root complex
> +     *
> +     * TODO: add support for multi-function devices.
> +     */
> +    sbdf.devfn = PCI_DEVFN(new_dev_number, 0);
> +    pdev->vpci->guest_sbdf = sbdf;
> +    write_unlock(&pdev->domain->pci_lock);
> +
> +    return 0;
> +}
> +
>  /* Notify vPCI that device is assigned to guest. */
>  int vpci_assign_device(struct pci_dev *pdev)
>  {
> @@ -125,8 +187,16 @@ int vpci_assign_device(struct pci_dev *pdev)
> 
>      rc = vpci_add_handlers(pdev);
>      if ( rc )
> -        vpci_deassign_device(pdev);
> +        goto fail;
> +
> +    rc = add_virtual_device(pdev);
> +    if ( rc )
> +        goto fail;
> +
> +    return 0;
> 
> + fail:
> +    vpci_deassign_device(pdev);
>      return rc;
>  }
>  #endif /* CONFIG_HAS_VPCI_GUEST_SUPPORT */
> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
> index 80dd150bbf..478bd21f3e 100644
> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -461,6 +461,14 @@ struct domain
>  #ifdef CONFIG_HAS_PCI
>      struct list_head pdev_list;
>      rwlock_t pci_lock;
> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> +    /*
> +     * The bitmap which shows which device numbers are already used by the
> +     * virtual PCI bus topology and is used to assign a unique SBDF to the
> +     * next passed through virtual PCI device.
> +     */
> +    DECLARE_BITMAP(vpci_dev_assigned_map, VPCI_MAX_VIRT_DEV);
> +#endif
>  #endif
> 
>  #ifdef CONFIG_HAS_PASSTHROUGH
> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
> index 6099d2141d..c55c45f7a1 100644
> --- a/xen/include/xen/vpci.h
> +++ b/xen/include/xen/vpci.h
> @@ -21,6 +21,13 @@ typedef int vpci_register_init_t(struct pci_dev *dev);
> 
>  #define VPCI_ECAM_BDF(addr)     (((addr) & 0x0ffff000) >> 12)
> 
> +/*
> + * Maximum number of devices supported by the virtual bus topology:
> + * each PCI bus supports 32 devices/slots at max or up to 256 when
> + * there are multi-function ones which are not yet supported.
> + */
> +#define VPCI_MAX_VIRT_DEV       (PCI_SLOT(~0) + 1)
> +
>  #define REGISTER_VPCI_INIT(x, p)                \
>    static vpci_register_init_t *const x##_entry  \
>                 __used_section(".data.vpci." p) = x
> @@ -155,6 +162,10 @@ struct vpci {
>              struct vpci_arch_msix_entry arch;
>          } entries[];
>      } *msix;
> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> +    /* Guest SBDF of the device. */
> +    pci_sbdf_t guest_sbdf;
> +#endif
>  #endif
>  };
> 
> --
> 2.41.0
> 


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 02/13] vpci: use per-domain PCI lock to protect vpci structure
  2023-07-26  9:35       ` Roger Pau Monné
@ 2023-07-27  0:56         ` Volodymyr Babchuk
  2023-07-27  7:41           ` Jan Beulich
  2023-07-27 12:42           ` Roger Pau Monné
  0 siblings, 2 replies; 73+ messages in thread
From: Volodymyr Babchuk @ 2023-07-27  0:56 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Oleksandr Andrushchenko, Jan Beulich

Hi Roger,

Roger Pau Monné <roger.pau@citrix.com> writes:

> On Wed, Jul 26, 2023 at 01:17:58AM +0000, Volodymyr Babchuk wrote:
>> 
>> Hi Roger,
>> 
>> Roger Pau Monné <roger.pau@citrix.com> writes:
>> 
>> > On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
>> >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> >> @@ -498,6 +537,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>> >>          ASSERT(data_offset < size);
>> >>      }
>> >>      spin_unlock(&pdev->vpci->lock);
>> >> +    unlock_locks(d);
>> >
>> > There's one issue here, some handlers will cal pcidevs_lock(), which
>> > will result in a lock over inversion, as in the previous patch we
>> > agreed that the locking order was pcidevs_lock first, d->pci_lock
>> > after.
>> >
>> > For example the MSI control_write() handler will call
>> > vpci_msi_arch_enable() which takes the pcidevs lock.  I think I will
>> > have to look into using a dedicated lock for MSI related handling, as
>> > that's the only place where I think we have this pattern of taking the
>> > pcidevs_lock after the d->pci_lock.
>> 
>> I'll mention this in the commit message. Is there something else that I
>> should do right now?
>
> Well, I don't think we want to commit this as-is with a known lock
> inversion.
>
> The functions that require the pcidevs lock are:
>
> pt_irq_{create,destroy}_bind()
> unmap_domain_pirq()
>
> AFAICT those functions require the lock in order to assert that the
> underlying device doesn't go away, as they do also use d->event_lock
> in order to get exclusive access to the data fields.  Please double
> check that I'm not mistaken.

You are right, all three function does not access any of PCI state
directly. However...

> If that's accurate you will have to check the call tree that spawns
> from those functions in order to modify the asserts to check for
> either the pcidevs or the per-domain pci_list lock being taken.

... I checked calls for PT_IRQ_TYPE_MSI case, there is only call that
bothers me: hvm_pi_update_irte(), which calls IO-MMU code via
vmx_pi_update_irte():

amd_iommu_msi_msg_update_ire() or msi_msg_write_remap_rte().

Both functions read basic pdev fields like sbfd or type. I see no
problem there, as values of those fields are not supposed to be changed.
Also those function use own locks to protect shared state. But as IO-MMU
code is quite convoluted it is hard to be sure that it is safe to call
those functions without holding pdevs_lock. All I can say is that those
functions and their callees have no ASSERT(pcidevs_locked()).

-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 02/13] vpci: use per-domain PCI lock to protect vpci structure
  2023-07-27  0:56         ` Volodymyr Babchuk
@ 2023-07-27  7:41           ` Jan Beulich
  2023-07-27 10:31             ` Volodymyr Babchuk
  2023-07-27 12:42           ` Roger Pau Monné
  1 sibling, 1 reply; 73+ messages in thread
From: Jan Beulich @ 2023-07-27  7:41 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: xen-devel, Oleksandr Andrushchenko, Roger Pau Monné

On 27.07.2023 02:56, Volodymyr Babchuk wrote:
> Hi Roger,
> 
> Roger Pau Monné <roger.pau@citrix.com> writes:
> 
>> On Wed, Jul 26, 2023 at 01:17:58AM +0000, Volodymyr Babchuk wrote:
>>>
>>> Hi Roger,
>>>
>>> Roger Pau Monné <roger.pau@citrix.com> writes:
>>>
>>>> On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>> @@ -498,6 +537,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>>>>>          ASSERT(data_offset < size);
>>>>>      }
>>>>>      spin_unlock(&pdev->vpci->lock);
>>>>> +    unlock_locks(d);
>>>>
>>>> There's one issue here, some handlers will cal pcidevs_lock(), which
>>>> will result in a lock over inversion, as in the previous patch we
>>>> agreed that the locking order was pcidevs_lock first, d->pci_lock
>>>> after.
>>>>
>>>> For example the MSI control_write() handler will call
>>>> vpci_msi_arch_enable() which takes the pcidevs lock.  I think I will
>>>> have to look into using a dedicated lock for MSI related handling, as
>>>> that's the only place where I think we have this pattern of taking the
>>>> pcidevs_lock after the d->pci_lock.
>>>
>>> I'll mention this in the commit message. Is there something else that I
>>> should do right now?
>>
>> Well, I don't think we want to commit this as-is with a known lock
>> inversion.
>>
>> The functions that require the pcidevs lock are:
>>
>> pt_irq_{create,destroy}_bind()
>> unmap_domain_pirq()
>>
>> AFAICT those functions require the lock in order to assert that the
>> underlying device doesn't go away, as they do also use d->event_lock
>> in order to get exclusive access to the data fields.  Please double
>> check that I'm not mistaken.
> 
> You are right, all three function does not access any of PCI state
> directly. However...
> 
>> If that's accurate you will have to check the call tree that spawns
>> from those functions in order to modify the asserts to check for
>> either the pcidevs or the per-domain pci_list lock being taken.
> 
> ... I checked calls for PT_IRQ_TYPE_MSI case, there is only call that
> bothers me: hvm_pi_update_irte(), which calls IO-MMU code via
> vmx_pi_update_irte():
> 
> amd_iommu_msi_msg_update_ire() or msi_msg_write_remap_rte().
> 
> Both functions read basic pdev fields like sbfd or type. I see no
> problem there, as values of those fields are not supposed to be changed.

But whether fields are basic or will never change doesn't matter when
the pdev struct itself suddenly disappears.

Jan

> Also those function use own locks to protect shared state. But as IO-MMU
> code is quite convoluted it is hard to be sure that it is safe to call
> those functions without holding pdevs_lock. All I can say is that those
> functions and their callees have no ASSERT(pcidevs_locked()).
> 



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 02/13] vpci: use per-domain PCI lock to protect vpci structure
  2023-07-27  7:41           ` Jan Beulich
@ 2023-07-27 10:31             ` Volodymyr Babchuk
  2023-07-27 11:37               ` Jan Beulich
  0 siblings, 1 reply; 73+ messages in thread
From: Volodymyr Babchuk @ 2023-07-27 10:31 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Oleksandr Andrushchenko, Roger Pau Monné


Hi Jan

Jan Beulich <jbeulich@suse.com> writes:

> On 27.07.2023 02:56, Volodymyr Babchuk wrote:
>> Hi Roger,
>> 
>> Roger Pau Monné <roger.pau@citrix.com> writes:
>> 
>>> On Wed, Jul 26, 2023 at 01:17:58AM +0000, Volodymyr Babchuk wrote:
>>>>
>>>> Hi Roger,
>>>>
>>>> Roger Pau Monné <roger.pau@citrix.com> writes:
>>>>
>>>>> On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>> @@ -498,6 +537,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>>>>>>          ASSERT(data_offset < size);
>>>>>>      }
>>>>>>      spin_unlock(&pdev->vpci->lock);
>>>>>> +    unlock_locks(d);
>>>>>
>>>>> There's one issue here, some handlers will cal pcidevs_lock(), which
>>>>> will result in a lock over inversion, as in the previous patch we
>>>>> agreed that the locking order was pcidevs_lock first, d->pci_lock
>>>>> after.
>>>>>
>>>>> For example the MSI control_write() handler will call
>>>>> vpci_msi_arch_enable() which takes the pcidevs lock.  I think I will
>>>>> have to look into using a dedicated lock for MSI related handling, as
>>>>> that's the only place where I think we have this pattern of taking the
>>>>> pcidevs_lock after the d->pci_lock.
>>>>
>>>> I'll mention this in the commit message. Is there something else that I
>>>> should do right now?
>>>
>>> Well, I don't think we want to commit this as-is with a known lock
>>> inversion.
>>>
>>> The functions that require the pcidevs lock are:
>>>
>>> pt_irq_{create,destroy}_bind()
>>> unmap_domain_pirq()
>>>
>>> AFAICT those functions require the lock in order to assert that the
>>> underlying device doesn't go away, as they do also use d->event_lock
>>> in order to get exclusive access to the data fields.  Please double
>>> check that I'm not mistaken.
>> 
>> You are right, all three function does not access any of PCI state
>> directly. However...
>> 
>>> If that's accurate you will have to check the call tree that spawns
>>> from those functions in order to modify the asserts to check for
>>> either the pcidevs or the per-domain pci_list lock being taken.
>> 
>> ... I checked calls for PT_IRQ_TYPE_MSI case, there is only call that
>> bothers me: hvm_pi_update_irte(), which calls IO-MMU code via
>> vmx_pi_update_irte():
>> 
>> amd_iommu_msi_msg_update_ire() or msi_msg_write_remap_rte().
>> 
>> Both functions read basic pdev fields like sbfd or type. I see no
>> problem there, as values of those fields are not supposed to be changed.
>
> But whether fields are basic or will never change doesn't matter when
> the pdev struct itself suddenly disappears.

This is not a problem, as it is expected that d->pci_lock is being held,
so pdev structure will not disappear. I am trying to answer another
question: is d->pci_lock enough or pcidevs_lock is also should required?

-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 02/13] vpci: use per-domain PCI lock to protect vpci structure
  2023-07-27 10:31             ` Volodymyr Babchuk
@ 2023-07-27 11:37               ` Jan Beulich
  2023-07-27 15:13                 ` Volodymyr Babchuk
  0 siblings, 1 reply; 73+ messages in thread
From: Jan Beulich @ 2023-07-27 11:37 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: xen-devel, Oleksandr Andrushchenko, Roger Pau Monné

On 27.07.2023 12:31, Volodymyr Babchuk wrote:
> 
> Hi Jan
> 
> Jan Beulich <jbeulich@suse.com> writes:
> 
>> On 27.07.2023 02:56, Volodymyr Babchuk wrote:
>>> Hi Roger,
>>>
>>> Roger Pau Monné <roger.pau@citrix.com> writes:
>>>
>>>> On Wed, Jul 26, 2023 at 01:17:58AM +0000, Volodymyr Babchuk wrote:
>>>>>
>>>>> Hi Roger,
>>>>>
>>>>> Roger Pau Monné <roger.pau@citrix.com> writes:
>>>>>
>>>>>> On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>> @@ -498,6 +537,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>>>>>>>          ASSERT(data_offset < size);
>>>>>>>      }
>>>>>>>      spin_unlock(&pdev->vpci->lock);
>>>>>>> +    unlock_locks(d);
>>>>>>
>>>>>> There's one issue here, some handlers will cal pcidevs_lock(), which
>>>>>> will result in a lock over inversion, as in the previous patch we
>>>>>> agreed that the locking order was pcidevs_lock first, d->pci_lock
>>>>>> after.
>>>>>>
>>>>>> For example the MSI control_write() handler will call
>>>>>> vpci_msi_arch_enable() which takes the pcidevs lock.  I think I will
>>>>>> have to look into using a dedicated lock for MSI related handling, as
>>>>>> that's the only place where I think we have this pattern of taking the
>>>>>> pcidevs_lock after the d->pci_lock.
>>>>>
>>>>> I'll mention this in the commit message. Is there something else that I
>>>>> should do right now?
>>>>
>>>> Well, I don't think we want to commit this as-is with a known lock
>>>> inversion.
>>>>
>>>> The functions that require the pcidevs lock are:
>>>>
>>>> pt_irq_{create,destroy}_bind()
>>>> unmap_domain_pirq()
>>>>
>>>> AFAICT those functions require the lock in order to assert that the
>>>> underlying device doesn't go away, as they do also use d->event_lock
>>>> in order to get exclusive access to the data fields.  Please double
>>>> check that I'm not mistaken.
>>>
>>> You are right, all three function does not access any of PCI state
>>> directly. However...
>>>
>>>> If that's accurate you will have to check the call tree that spawns
>>>> from those functions in order to modify the asserts to check for
>>>> either the pcidevs or the per-domain pci_list lock being taken.
>>>
>>> ... I checked calls for PT_IRQ_TYPE_MSI case, there is only call that
>>> bothers me: hvm_pi_update_irte(), which calls IO-MMU code via
>>> vmx_pi_update_irte():
>>>
>>> amd_iommu_msi_msg_update_ire() or msi_msg_write_remap_rte().
>>>
>>> Both functions read basic pdev fields like sbfd or type. I see no
>>> problem there, as values of those fields are not supposed to be changed.
>>
>> But whether fields are basic or will never change doesn't matter when
>> the pdev struct itself suddenly disappears.
> 
> This is not a problem, as it is expected that d->pci_lock is being held,
> so pdev structure will not disappear. I am trying to answer another
> question: is d->pci_lock enough or pcidevs_lock is also should required?

To answer such questions, may I ask that you first firmly write down
(and submit) what each of the locks guards?

Jan


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 02/13] vpci: use per-domain PCI lock to protect vpci structure
  2023-07-27  0:56         ` Volodymyr Babchuk
  2023-07-27  7:41           ` Jan Beulich
@ 2023-07-27 12:42           ` Roger Pau Monné
  2023-07-27 12:56             ` Jan Beulich
  2023-07-28  0:21             ` Volodymyr Babchuk
  1 sibling, 2 replies; 73+ messages in thread
From: Roger Pau Monné @ 2023-07-27 12:42 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: xen-devel, Oleksandr Andrushchenko, Jan Beulich

On Thu, Jul 27, 2023 at 12:56:54AM +0000, Volodymyr Babchuk wrote:
> Hi Roger,
> 
> Roger Pau Monné <roger.pau@citrix.com> writes:
> 
> > On Wed, Jul 26, 2023 at 01:17:58AM +0000, Volodymyr Babchuk wrote:
> >> 
> >> Hi Roger,
> >> 
> >> Roger Pau Monné <roger.pau@citrix.com> writes:
> >> 
> >> > On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
> >> >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> >> >> @@ -498,6 +537,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
> >> >>          ASSERT(data_offset < size);
> >> >>      }
> >> >>      spin_unlock(&pdev->vpci->lock);
> >> >> +    unlock_locks(d);
> >> >
> >> > There's one issue here, some handlers will cal pcidevs_lock(), which
> >> > will result in a lock over inversion, as in the previous patch we
> >> > agreed that the locking order was pcidevs_lock first, d->pci_lock
> >> > after.
> >> >
> >> > For example the MSI control_write() handler will call
> >> > vpci_msi_arch_enable() which takes the pcidevs lock.  I think I will
> >> > have to look into using a dedicated lock for MSI related handling, as
> >> > that's the only place where I think we have this pattern of taking the
> >> > pcidevs_lock after the d->pci_lock.
> >> 
> >> I'll mention this in the commit message. Is there something else that I
> >> should do right now?
> >
> > Well, I don't think we want to commit this as-is with a known lock
> > inversion.
> >
> > The functions that require the pcidevs lock are:
> >
> > pt_irq_{create,destroy}_bind()
> > unmap_domain_pirq()
> >
> > AFAICT those functions require the lock in order to assert that the
> > underlying device doesn't go away, as they do also use d->event_lock
> > in order to get exclusive access to the data fields.  Please double
> > check that I'm not mistaken.
> 
> You are right, all three function does not access any of PCI state
> directly. However...
> 
> > If that's accurate you will have to check the call tree that spawns
> > from those functions in order to modify the asserts to check for
> > either the pcidevs or the per-domain pci_list lock being taken.
> 
> ... I checked calls for PT_IRQ_TYPE_MSI case, there is only call that
> bothers me: hvm_pi_update_irte(), which calls IO-MMU code via
> vmx_pi_update_irte():
> 
> amd_iommu_msi_msg_update_ire() or msi_msg_write_remap_rte().

That path is only for VT-d, so strictly speaking you only need to worry
about msi_msg_write_remap_rte().

msi_msg_write_remap_rte() does take the IOMMU intremap lock.

There are also existing callers of iommu_update_ire_from_msi() that
call the functions without the pcidevs locked.  See
hpet_msi_set_affinity() for example.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 02/13] vpci: use per-domain PCI lock to protect vpci structure
  2023-07-27 12:42           ` Roger Pau Monné
@ 2023-07-27 12:56             ` Jan Beulich
  2023-07-27 14:43               ` Roger Pau Monné
  2023-07-28  0:21             ` Volodymyr Babchuk
  1 sibling, 1 reply; 73+ messages in thread
From: Jan Beulich @ 2023-07-27 12:56 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Oleksandr Andrushchenko, Volodymyr Babchuk

On 27.07.2023 14:42, Roger Pau Monné wrote:
> There are also existing callers of iommu_update_ire_from_msi() that
> call the functions without the pcidevs locked.  See
> hpet_msi_set_affinity() for example.

Ftaod first and foremost because there's no pdev in that case.

Jan


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 02/13] vpci: use per-domain PCI lock to protect vpci structure
  2023-07-27 12:56             ` Jan Beulich
@ 2023-07-27 14:43               ` Roger Pau Monné
  0 siblings, 0 replies; 73+ messages in thread
From: Roger Pau Monné @ 2023-07-27 14:43 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Oleksandr Andrushchenko, Volodymyr Babchuk

On Thu, Jul 27, 2023 at 02:56:18PM +0200, Jan Beulich wrote:
> On 27.07.2023 14:42, Roger Pau Monné wrote:
> > There are also existing callers of iommu_update_ire_from_msi() that
> > call the functions without the pcidevs locked.  See
> > hpet_msi_set_affinity() for example.
> 
> Ftaod first and foremost because there's no pdev in that case.

Likewise for (mostly?) the rest of the callers, as callers of the
.set_affinity hw_irq_controller hook don't have a PCI device at
hand.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 02/13] vpci: use per-domain PCI lock to protect vpci structure
  2023-07-27 11:37               ` Jan Beulich
@ 2023-07-27 15:13                 ` Volodymyr Babchuk
  0 siblings, 0 replies; 73+ messages in thread
From: Volodymyr Babchuk @ 2023-07-27 15:13 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Oleksandr Andrushchenko, Roger Pau Monné


Jan,

Jan Beulich <jbeulich@suse.com> writes:

> On 27.07.2023 12:31, Volodymyr Babchuk wrote:
>> 
>> Hi Jan
>> 
>> Jan Beulich <jbeulich@suse.com> writes:
>> 
>>> On 27.07.2023 02:56, Volodymyr Babchuk wrote:
>>>> Hi Roger,
>>>>
>>>> Roger Pau Monné <roger.pau@citrix.com> writes:
>>>>
>>>>> On Wed, Jul 26, 2023 at 01:17:58AM +0000, Volodymyr Babchuk wrote:
>>>>>>
>>>>>> Hi Roger,
>>>>>>
>>>>>> Roger Pau Monné <roger.pau@citrix.com> writes:
>>>>>>
>>>>>>> On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
>>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>>> @@ -498,6 +537,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>>>>>>>>          ASSERT(data_offset < size);
>>>>>>>>      }
>>>>>>>>      spin_unlock(&pdev->vpci->lock);
>>>>>>>> +    unlock_locks(d);
>>>>>>>
>>>>>>> There's one issue here, some handlers will cal pcidevs_lock(), which
>>>>>>> will result in a lock over inversion, as in the previous patch we
>>>>>>> agreed that the locking order was pcidevs_lock first, d->pci_lock
>>>>>>> after.
>>>>>>>
>>>>>>> For example the MSI control_write() handler will call
>>>>>>> vpci_msi_arch_enable() which takes the pcidevs lock.  I think I will
>>>>>>> have to look into using a dedicated lock for MSI related handling, as
>>>>>>> that's the only place where I think we have this pattern of taking the
>>>>>>> pcidevs_lock after the d->pci_lock.
>>>>>>
>>>>>> I'll mention this in the commit message. Is there something else that I
>>>>>> should do right now?
>>>>>
>>>>> Well, I don't think we want to commit this as-is with a known lock
>>>>> inversion.
>>>>>
>>>>> The functions that require the pcidevs lock are:
>>>>>
>>>>> pt_irq_{create,destroy}_bind()
>>>>> unmap_domain_pirq()
>>>>>
>>>>> AFAICT those functions require the lock in order to assert that the
>>>>> underlying device doesn't go away, as they do also use d->event_lock
>>>>> in order to get exclusive access to the data fields.  Please double
>>>>> check that I'm not mistaken.
>>>>
>>>> You are right, all three function does not access any of PCI state
>>>> directly. However...
>>>>
>>>>> If that's accurate you will have to check the call tree that spawns
>>>>> from those functions in order to modify the asserts to check for
>>>>> either the pcidevs or the per-domain pci_list lock being taken.
>>>>
>>>> ... I checked calls for PT_IRQ_TYPE_MSI case, there is only call that
>>>> bothers me: hvm_pi_update_irte(), which calls IO-MMU code via
>>>> vmx_pi_update_irte():
>>>>
>>>> amd_iommu_msi_msg_update_ire() or msi_msg_write_remap_rte().
>>>>
>>>> Both functions read basic pdev fields like sbfd or type. I see no
>>>> problem there, as values of those fields are not supposed to be changed.
>>>
>>> But whether fields are basic or will never change doesn't matter when
>>> the pdev struct itself suddenly disappears.
>> 
>> This is not a problem, as it is expected that d->pci_lock is being held,
>> so pdev structure will not disappear. I am trying to answer another
>> question: is d->pci_lock enough or pcidevs_lock is also should required?
>
> To answer such questions, may I ask that you first firmly write down
> (and submit) what each of the locks guards?

I can do this for a newly introduced lock. So domain->pci_lock guards:

1. domain->pcidevs_list. This means that PCI devices can't be added to
or removed from a domain, when the lock is taken in read mode. As a
byproduct, any pdev assigned to a domain can't be deleted because we
need to deassign it first. To modify domain->pcidevs_list we need to
hold both d->pci_lock in write mode and pcidevs_lock.

2. Presence of pdev->vpci struct for any pdev assigned to a domain. The
structure itself is locked by pdev->vpci->lock. But to add/remove
pdev->vpci itself we need to hold d->pci_lock in the write mode.

As for pcidevs_lock, AFAIK, there is no strictly written rules, what is
exactly is protected by this lock.

-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 02/13] vpci: use per-domain PCI lock to protect vpci structure
  2023-07-27 12:42           ` Roger Pau Monné
  2023-07-27 12:56             ` Jan Beulich
@ 2023-07-28  0:21             ` Volodymyr Babchuk
  2023-07-28 13:55               ` Roger Pau Monné
  1 sibling, 1 reply; 73+ messages in thread
From: Volodymyr Babchuk @ 2023-07-28  0:21 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Oleksandr Andrushchenko, Jan Beulich


Hi Roger,

Roger Pau Monné <roger.pau@citrix.com> writes:

> On Thu, Jul 27, 2023 at 12:56:54AM +0000, Volodymyr Babchuk wrote:
>> Hi Roger,
>> 
>> Roger Pau Monné <roger.pau@citrix.com> writes:
>> 
>> > On Wed, Jul 26, 2023 at 01:17:58AM +0000, Volodymyr Babchuk wrote:
>> >> 
>> >> Hi Roger,
>> >> 
>> >> Roger Pau Monné <roger.pau@citrix.com> writes:
>> >> 
>> >> > On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
>> >> >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> >> >> @@ -498,6 +537,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>> >> >>          ASSERT(data_offset < size);
>> >> >>      }
>> >> >>      spin_unlock(&pdev->vpci->lock);
>> >> >> +    unlock_locks(d);
>> >> >
>> >> > There's one issue here, some handlers will cal pcidevs_lock(), which
>> >> > will result in a lock over inversion, as in the previous patch we
>> >> > agreed that the locking order was pcidevs_lock first, d->pci_lock
>> >> > after.
>> >> >
>> >> > For example the MSI control_write() handler will call
>> >> > vpci_msi_arch_enable() which takes the pcidevs lock.  I think I will
>> >> > have to look into using a dedicated lock for MSI related handling, as
>> >> > that's the only place where I think we have this pattern of taking the
>> >> > pcidevs_lock after the d->pci_lock.
>> >> 
>> >> I'll mention this in the commit message. Is there something else that I
>> >> should do right now?
>> >
>> > Well, I don't think we want to commit this as-is with a known lock
>> > inversion.
>> >
>> > The functions that require the pcidevs lock are:
>> >
>> > pt_irq_{create,destroy}_bind()
>> > unmap_domain_pirq()
>> >
>> > AFAICT those functions require the lock in order to assert that the
>> > underlying device doesn't go away, as they do also use d->event_lock
>> > in order to get exclusive access to the data fields.  Please double
>> > check that I'm not mistaken.
>> 
>> You are right, all three function does not access any of PCI state
>> directly. However...
>> 
>> > If that's accurate you will have to check the call tree that spawns
>> > from those functions in order to modify the asserts to check for
>> > either the pcidevs or the per-domain pci_list lock being taken.
>> 
>> ... I checked calls for PT_IRQ_TYPE_MSI case, there is only call that
>> bothers me: hvm_pi_update_irte(), which calls IO-MMU code via
>> vmx_pi_update_irte():
>> 
>> amd_iommu_msi_msg_update_ire() or msi_msg_write_remap_rte().
>
> That path is only for VT-d, so strictly speaking you only need to worry
> about msi_msg_write_remap_rte().
>
> msi_msg_write_remap_rte() does take the IOMMU intremap lock.
>
> There are also existing callers of iommu_update_ire_from_msi() that
> call the functions without the pcidevs locked.  See
> hpet_msi_set_affinity() for example.

Thank you for clarifying this.

I have found another call path which causes troubles:
__pci_enable_msi[x] is called from pci_enable_msi() via vMSI, via
physdev_map_irq and also directly from ns16550 driver.

__pci_enable_msi[x] accesses pdev fields, mostly pdev->msix or
pdev->msi_list, so looks like we need pcidevs_lock(), as pdev fields are
not protected by d->pci_lock...

-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 02/13] vpci: use per-domain PCI lock to protect vpci structure
  2023-07-28  0:21             ` Volodymyr Babchuk
@ 2023-07-28 13:55               ` Roger Pau Monné
  0 siblings, 0 replies; 73+ messages in thread
From: Roger Pau Monné @ 2023-07-28 13:55 UTC (permalink / raw)
  To: Volodymyr Babchuk; +Cc: xen-devel, Oleksandr Andrushchenko, Jan Beulich

On Fri, Jul 28, 2023 at 12:21:54AM +0000, Volodymyr Babchuk wrote:
> 
> Hi Roger,
> 
> Roger Pau Monné <roger.pau@citrix.com> writes:
> 
> > On Thu, Jul 27, 2023 at 12:56:54AM +0000, Volodymyr Babchuk wrote:
> >> Hi Roger,
> >> 
> >> Roger Pau Monné <roger.pau@citrix.com> writes:
> >> 
> >> > On Wed, Jul 26, 2023 at 01:17:58AM +0000, Volodymyr Babchuk wrote:
> >> >> 
> >> >> Hi Roger,
> >> >> 
> >> >> Roger Pau Monné <roger.pau@citrix.com> writes:
> >> >> 
> >> >> > On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
> >> >> >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> >> >> >> @@ -498,6 +537,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
> >> >> >>          ASSERT(data_offset < size);
> >> >> >>      }
> >> >> >>      spin_unlock(&pdev->vpci->lock);
> >> >> >> +    unlock_locks(d);
> >> >> >
> >> >> > There's one issue here, some handlers will cal pcidevs_lock(), which
> >> >> > will result in a lock over inversion, as in the previous patch we
> >> >> > agreed that the locking order was pcidevs_lock first, d->pci_lock
> >> >> > after.
> >> >> >
> >> >> > For example the MSI control_write() handler will call
> >> >> > vpci_msi_arch_enable() which takes the pcidevs lock.  I think I will
> >> >> > have to look into using a dedicated lock for MSI related handling, as
> >> >> > that's the only place where I think we have this pattern of taking the
> >> >> > pcidevs_lock after the d->pci_lock.
> >> >> 
> >> >> I'll mention this in the commit message. Is there something else that I
> >> >> should do right now?
> >> >
> >> > Well, I don't think we want to commit this as-is with a known lock
> >> > inversion.
> >> >
> >> > The functions that require the pcidevs lock are:
> >> >
> >> > pt_irq_{create,destroy}_bind()
> >> > unmap_domain_pirq()
> >> >
> >> > AFAICT those functions require the lock in order to assert that the
> >> > underlying device doesn't go away, as they do also use d->event_lock
> >> > in order to get exclusive access to the data fields.  Please double
> >> > check that I'm not mistaken.
> >> 
> >> You are right, all three function does not access any of PCI state
> >> directly. However...
> >> 
> >> > If that's accurate you will have to check the call tree that spawns
> >> > from those functions in order to modify the asserts to check for
> >> > either the pcidevs or the per-domain pci_list lock being taken.
> >> 
> >> ... I checked calls for PT_IRQ_TYPE_MSI case, there is only call that
> >> bothers me: hvm_pi_update_irte(), which calls IO-MMU code via
> >> vmx_pi_update_irte():
> >> 
> >> amd_iommu_msi_msg_update_ire() or msi_msg_write_remap_rte().
> >
> > That path is only for VT-d, so strictly speaking you only need to worry
> > about msi_msg_write_remap_rte().
> >
> > msi_msg_write_remap_rte() does take the IOMMU intremap lock.
> >
> > There are also existing callers of iommu_update_ire_from_msi() that
> > call the functions without the pcidevs locked.  See
> > hpet_msi_set_affinity() for example.
> 
> Thank you for clarifying this.
> 
> I have found another call path which causes troubles:
> __pci_enable_msi[x] is called from pci_enable_msi() via vMSI, via
> physdev_map_irq and also directly from ns16550 driver.

Both vPCI and physdev_map_irq() use the same path: map_domain_pirq()
which gets called with d->event_lock taken in exclusive mode, that
should be enough as a device cannot be assigned to multiple guests.

ns16550_init_postirq() is an init function, which means it won't be
executed after Xen has booted, so I think this is all fine, as
concurrent accesses from ns16550_init_postirq() and map_domain_pirq()
are impossible.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2023-07-28 13:56 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-20  0:32 [PATCH v8 00/13] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
2023-07-20  0:32 ` [PATCH v8 01/13] pci: introduce per-domain PCI rwlock Volodymyr Babchuk
2023-07-20  9:45   ` Roger Pau Monné
2023-07-20 22:57     ` Volodymyr Babchuk
2023-07-20 15:40   ` Jan Beulich
2023-07-20 23:37     ` Volodymyr Babchuk
2023-07-20  0:32 ` [PATCH v8 03/13] vpci: restrict unhandled read/write operations for guests Volodymyr Babchuk
2023-07-20 11:32   ` Roger Pau Monné
2023-07-20  0:32 ` [PATCH v8 02/13] vpci: use per-domain PCI lock to protect vpci structure Volodymyr Babchuk
2023-07-20 11:20   ` Roger Pau Monné
2023-07-20 13:27     ` Jan Beulich
2023-07-20 13:50       ` Roger Pau Monné
2023-07-24  0:07         ` Volodymyr Babchuk
2023-07-24  7:59           ` Roger Pau Monné
2023-07-20 15:53     ` Jan Beulich
2023-07-26  1:17     ` Volodymyr Babchuk
2023-07-26  6:39       ` Jan Beulich
2023-07-26  9:35       ` Roger Pau Monné
2023-07-27  0:56         ` Volodymyr Babchuk
2023-07-27  7:41           ` Jan Beulich
2023-07-27 10:31             ` Volodymyr Babchuk
2023-07-27 11:37               ` Jan Beulich
2023-07-27 15:13                 ` Volodymyr Babchuk
2023-07-27 12:42           ` Roger Pau Monné
2023-07-27 12:56             ` Jan Beulich
2023-07-27 14:43               ` Roger Pau Monné
2023-07-28  0:21             ` Volodymyr Babchuk
2023-07-28 13:55               ` Roger Pau Monné
2023-07-20 16:03   ` Jan Beulich
2023-07-20 16:14     ` Roger Pau Monné
2023-07-21  6:02       ` Jan Beulich
2023-07-21  7:43         ` Roger Pau Monné
2023-07-21  8:48           ` Jan Beulich
2023-07-20 16:09   ` Jan Beulich
2023-07-20  0:32 ` [PATCH v8 04/13] vpci: add hooks for PCI device assign/de-assign Volodymyr Babchuk
2023-07-20 12:36   ` Roger Pau Monné
2023-07-26  1:38     ` Volodymyr Babchuk
2023-07-26  8:42       ` Roger Pau Monné
2023-07-24  9:41   ` Jan Beulich
2023-07-20  0:32 ` [PATCH v8 06/13] rangeset: add RANGESETF_no_print flag Volodymyr Babchuk
2023-07-20  0:32 ` [PATCH v8 07/13] vpci/header: handle p2m range sets per BAR Volodymyr Babchuk
2023-07-21 11:49   ` Roger Pau Monné
2023-07-20  0:32 ` [PATCH v8 05/13] vpci/header: implement guest BAR register handlers Volodymyr Babchuk
2023-07-20 16:01   ` Roger Pau Monné
2023-07-21 10:36   ` Rahul Singh
2023-07-21 10:50     ` Jan Beulich
2023-07-21 11:52       ` Roger Pau Monné
2023-07-20  0:32 ` [PATCH v8 11/13] vpci: add initial support for virtual PCI bus topology Volodymyr Babchuk
2023-07-20  6:50   ` Jan Beulich
2023-07-21  0:43     ` Volodymyr Babchuk
2023-07-21 13:53   ` Roger Pau Monné
2023-07-21 14:00   ` Roger Pau Monné
2023-07-26 21:35   ` Stewart Hildebrand
2023-07-20  0:32 ` [PATCH v8 08/13] vpci/header: program p2m with guest BAR view Volodymyr Babchuk
2023-07-21 13:05   ` Roger Pau Monné
2023-07-24 10:30     ` Jan Beulich
2023-07-24 10:43   ` Jan Beulich
2023-07-24 13:16     ` Roger Pau Monné
2023-07-24 13:31       ` Jan Beulich
2023-07-24 13:42         ` Roger Pau Monné
2023-07-20  0:32 ` [PATCH v8 10/13] vpci/header: reset the command register when adding devices Volodymyr Babchuk
2023-07-21 13:37   ` Roger Pau Monné
2023-07-20  0:32 ` [PATCH v8 09/13] vpci/header: emulate PCI_COMMAND register for guests Volodymyr Babchuk
2023-07-21 13:32   ` Roger Pau Monné
2023-07-21 13:40     ` Roger Pau Monné
2023-07-24 11:06     ` Jan Beulich
2023-07-24 11:03   ` Jan Beulich
2023-07-20  0:32 ` [PATCH v8 12/13] xen/arm: translate virtual PCI bus topology " Volodymyr Babchuk
2023-07-20  6:54   ` Jan Beulich
2023-07-21 14:09   ` Roger Pau Monné
2023-07-24  8:02   ` Roger Pau Monné
2023-07-20  0:32 ` [PATCH v8 13/13] xen/arm: account IO handlers for emulated PCI MSI-X Volodymyr Babchuk
2023-07-20  0:41 ` [PATCH v8 00/13] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.