xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 01/12] vpci: introduce per-domain lock to protect vpci structure
  2023-06-13 10:32 [PATCH v7 00/12] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
@ 2023-06-13 10:32 ` Volodymyr Babchuk
  2023-06-16 16:30   ` Roger Pau Monné
  2023-06-20 20:17   ` Stewart Hildebrand
  2023-06-13 10:32 ` [PATCH v7 04/12] vpci/header: implement guest BAR register handlers Volodymyr Babchuk
                   ` (11 subsequent siblings)
  12 siblings, 2 replies; 38+ messages in thread
From: Volodymyr Babchuk @ 2023-06-13 10:32 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Andrushchenko, Jan Beulich, Andrew Cooper,
	Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Paul Durrant

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Introduce a per-domain read/write lock to check whether vpci is present,
so we are sure there are no accesses to the contents of the vpci struct
if not. This lock can be used (and in a few cases is used right away)
so that vpci removal can be performed while holding the lock in write
mode. Previously such removal could race with vpci_read for example.

1. Per-domain's vpci_rwlock is used to protect pdev->vpci structure
from being removed.

2. Writing the command register and ROM BAR register may trigger
modify_bars to run, which in turn may access multiple pdevs while
checking for the existing BAR's overlap. The overlapping check, if done
under the read lock, requires vpci->lock to be acquired on both devices
being compared, which may produce a deadlock. It is not possible to
upgrade read lock to write lock in such a case. So, in order to prevent
the deadlock, check which registers are going to be written and acquire
the lock in the appropriate mode from the beginning.

All other code, which doesn't lead to pdev->vpci destruction and does not
access multiple pdevs at the same time, can still use a combination of the
read lock and pdev->vpci->lock.

3. Optimize if ROM BAR write lock required detection by caching offset
of the ROM BAR register in vpci->header->rom_reg which depends on
header's type.

4. Reduce locked region in vpci_remove_device as it is now possible
to set pdev->vpci to NULL early right after the write lock is acquired.

5. Reduce locked region in vpci_add_handlers as it is possible to
initialize many more fields of the struct vpci before assigning it to
pdev->vpci.

6. vpci_{add|remove}_register are required to be called with the write lock
held, but it is not feasible to add an assert there as it requires
struct domain to be passed for that. So, add a comment about this requirement
to these and other functions with the equivalent constraints.

7. Drop const qualifier where the new rwlock is used and this is appropriate.

8. Do not call process_pending_softirqs with any locks held. For that unlock
prior the call and re-acquire the locks after. After re-acquiring the
lock there is no need to check if pdev->vpci exists:
 - in apply_map because of the context it is called (no race condition
   possible)
 - for MSI/MSI-X debug code because it is called at the end of
   pdev->vpci access and no further access to pdev->vpci is made

9. Check for !pdev->vpci in vpci_{read|write} after acquiring the lock
and if so, allow reading or writing the hardware register directly. This is
acceptable as we only deal with Dom0 as of now. Once DomU support is
added the write will need to be ignored and read return all 0's for the
guests, while Dom0 can still access the registers directly.

10. Introduce pcidevs_trylock, so there is a possibility to try locking
the pcidev's lock.

11. Use pcidev's lock around for_each_pdev and pci_get_pdev_by_domain
while accessing pdevs in vpci code.

12. This is based on the discussion at [1].

[1] https://lore.kernel.org/all/20220204063459.680961-4-andr2000@gmail.com/

Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

---
This was checked on x86: with and without PVH Dom0.
---
 xen/arch/x86/hvm/vmsi.c       |   7 +++
 xen/common/domain.c           |   3 +
 xen/drivers/passthrough/pci.c |   5 ++
 xen/drivers/vpci/header.c     |  52 +++++++++++++++++
 xen/drivers/vpci/msi.c        |  25 +++++++-
 xen/drivers/vpci/msix.c       |  51 +++++++++++++---
 xen/drivers/vpci/vpci.c       | 107 +++++++++++++++++++++++++---------
 xen/include/xen/pci.h         |   1 +
 xen/include/xen/sched.h       |   3 +
 xen/include/xen/vpci.h        |   6 ++
 10 files changed, 223 insertions(+), 37 deletions(-)

diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index 3cd4923060..f188450b1b 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -895,6 +895,9 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
 {
     unsigned int i;
 
+    ASSERT(rw_is_locked(&msix->pdev->domain->vpci_rwlock));
+    ASSERT(pcidevs_locked());
+
     for ( i = 0; i < msix->max_entries; i++ )
     {
         const struct vpci_msix_entry *entry = &msix->entries[i];
@@ -913,7 +916,11 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
             struct pci_dev *pdev = msix->pdev;
 
             spin_unlock(&msix->pdev->vpci->lock);
+            pcidevs_unlock();
+            read_unlock(&pdev->domain->vpci_rwlock);
             process_pending_softirqs();
+            read_lock(&pdev->domain->vpci_rwlock);
+            pcidevs_lock();
             /* NB: we assume that pdev cannot go away for an alive domain. */
             if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
                 return -EBUSY;
diff --git a/xen/common/domain.c b/xen/common/domain.c
index 6a440590fe..a4640acb62 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -622,6 +622,9 @@ struct domain *domain_create(domid_t domid,
 
 #ifdef CONFIG_HAS_PCI
     INIT_LIST_HEAD(&d->pdev_list);
+#ifdef CONFIG_HAS_VPCI
+    rwlock_init(&d->vpci_rwlock);
+#endif
 #endif
 
     /* All error paths can depend on the above setup. */
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 07d1986d33..0afcb59af0 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -57,6 +57,11 @@ void pcidevs_lock(void)
     spin_lock_recursive(&_pcidevs_lock);
 }
 
+int pcidevs_trylock(void)
+{
+    return spin_trylock_recursive(&_pcidevs_lock);
+}
+
 void pcidevs_unlock(void)
 {
     spin_unlock_recursive(&_pcidevs_lock);
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index ec2e978a4e..23b5223adf 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -152,12 +152,14 @@ bool vpci_process_pending(struct vcpu *v)
         if ( rc == -ERESTART )
             return true;
 
+        read_lock(&v->domain->vpci_rwlock);
         spin_lock(&v->vpci.pdev->vpci->lock);
         /* Disable memory decoding unconditionally on failure. */
         modify_decoding(v->vpci.pdev,
                         rc ? v->vpci.cmd & ~PCI_COMMAND_MEMORY : v->vpci.cmd,
                         !rc && v->vpci.rom_only);
         spin_unlock(&v->vpci.pdev->vpci->lock);
+        read_unlock(&v->domain->vpci_rwlock);
 
         rangeset_destroy(v->vpci.mem);
         v->vpci.mem = NULL;
@@ -181,8 +183,20 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
     struct map_data data = { .d = d, .map = true };
     int rc;
 
+    ASSERT(rw_is_write_locked(&d->vpci_rwlock));
+
     while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
+    {
+        /*
+         * It's safe to drop and reacquire the lock in this context
+         * without risking pdev disappearing because devices cannot be
+         * removed until the initial domain has been started.
+         */
+        write_unlock(&d->vpci_rwlock);
         process_pending_softirqs();
+        write_lock(&d->vpci_rwlock);
+    }
+
     rangeset_destroy(mem);
     if ( !rc )
         modify_decoding(pdev, cmd, false);
@@ -213,6 +227,7 @@ static void defer_map(struct domain *d, struct pci_dev *pdev,
     raise_softirq(SCHEDULE_SOFTIRQ);
 }
 
+/* This must hold domain's vpci_rwlock in write mode. */
 static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
 {
     struct vpci_header *header = &pdev->vpci->header;
@@ -287,6 +302,7 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
      * Check for overlaps with other BARs. Note that only BARs that are
      * currently mapped (enabled) are checked for overlaps.
      */
+    pcidevs_lock();
     for_each_pdev ( pdev->domain, tmp )
     {
         if ( tmp == pdev )
@@ -326,10 +342,12 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
                 printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
                        start, end, rc);
                 rangeset_destroy(mem);
+                pcidevs_unlock();
                 return rc;
             }
         }
     }
+    pcidevs_unlock();
 
     ASSERT(dev);
 
@@ -482,6 +500,8 @@ static int cf_check init_bars(struct pci_dev *pdev)
     struct vpci_bar *bars = header->bars;
     int rc;
 
+    ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock));
+
     switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
     {
     case PCI_HEADER_TYPE_NORMAL:
@@ -570,6 +590,8 @@ static int cf_check init_bars(struct pci_dev *pdev)
         }
     }
 
+    ASSERT(!header->rom_reg);
+
     /* Check expansion ROM. */
     rc = pci_size_mem_bar(pdev->sbdf, rom_reg, &addr, &size, PCI_BAR_ROM);
     if ( rc > 0 && size )
@@ -586,12 +608,42 @@ static int cf_check init_bars(struct pci_dev *pdev)
                                4, rom);
         if ( rc )
             rom->type = VPCI_BAR_EMPTY;
+
+        header->rom_reg = rom_reg;
     }
 
     return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, cmd, false) : 0;
 }
 REGISTER_VPCI_INIT(init_bars, VPCI_PRIORITY_MIDDLE);
 
+static bool overlap(unsigned int r1_offset, unsigned int r1_size,
+                    unsigned int r2_offset, unsigned int r2_size)
+{
+    /* Return true if there is an overlap. */
+    return r1_offset < r2_offset + r2_size && r2_offset < r1_offset + r1_size;
+}
+
+bool vpci_header_need_write_lock(const struct pci_dev *pdev,
+                                 unsigned int start, unsigned int size)
+{
+    /*
+     * Writing the command register and ROM BAR register may trigger
+     * modify_bars to run, which in turn may access multiple pdevs while
+     * checking for the existing BAR's overlap. The overlapping check, if done
+     * under the read lock, requires vpci->lock to be acquired on both devices
+     * being compared, which may produce a deadlock. At the same time it is not
+     * possible to upgrade read lock to write lock in such a case.
+     * Check which registers are going to be written and return true if lock
+     * needs to be acquired in write mode.
+     */
+    if ( overlap(start, size, PCI_COMMAND, 2) ||
+         (pdev->vpci->header.rom_reg &&
+          overlap(start, size, pdev->vpci->header.rom_reg, 4)) )
+        return true;
+
+    return false;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
index 8f2b59e61a..e7d1f916a0 100644
--- a/xen/drivers/vpci/msi.c
+++ b/xen/drivers/vpci/msi.c
@@ -190,6 +190,8 @@ static int cf_check init_msi(struct pci_dev *pdev)
     uint16_t control;
     int ret;
 
+    ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock));
+
     if ( !pos )
         return 0;
 
@@ -265,7 +267,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW);
 
 void vpci_dump_msi(void)
 {
-    const struct domain *d;
+    struct domain *d;
 
     rcu_read_lock(&domlist_read_lock);
     for_each_domain ( d )
@@ -277,6 +279,15 @@ void vpci_dump_msi(void)
 
         printk("vPCI MSI/MSI-X d%d\n", d->domain_id);
 
+        if ( !read_trylock(&d->vpci_rwlock) )
+            continue;
+
+        if ( !pcidevs_trylock() )
+        {
+            read_unlock(&d->vpci_rwlock);
+            continue;
+        }
+
         for_each_pdev ( d, pdev )
         {
             const struct vpci_msi *msi;
@@ -318,14 +329,22 @@ void vpci_dump_msi(void)
                      * holding the lock.
                      */
                     printk("unable to print all MSI-X entries: %d\n", rc);
-                    process_pending_softirqs();
-                    continue;
+                    goto pdev_done;
                 }
             }
 
             spin_unlock(&pdev->vpci->lock);
+ pdev_done:
+            pcidevs_unlock();
+            read_unlock(&d->vpci_rwlock);
+
             process_pending_softirqs();
+
+            read_lock(&d->vpci_rwlock);
+            pcidevs_lock();
         }
+        pcidevs_unlock();
+        read_unlock(&d->vpci_rwlock);
     }
     rcu_read_unlock(&domlist_read_lock);
 }
diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
index 25bde77586..b5a7dfdf9c 100644
--- a/xen/drivers/vpci/msix.c
+++ b/xen/drivers/vpci/msix.c
@@ -143,6 +143,7 @@ static void cf_check control_write(
         pci_conf_write16(pdev->sbdf, reg, val);
 }
 
+/* This must hold domain's vpci_rwlock in write mode. */
 static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
 {
     struct vpci_msix *msix;
@@ -163,7 +164,13 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
 
 static int cf_check msix_accept(struct vcpu *v, unsigned long addr)
 {
-    return !!msix_find(v->domain, addr);
+    int rc;
+
+    read_lock(&v->domain->vpci_rwlock);
+    rc = !!msix_find(v->domain, addr);
+    read_unlock(&v->domain->vpci_rwlock);
+
+    return rc;
 }
 
 static bool access_allowed(const struct pci_dev *pdev, unsigned long addr,
@@ -358,21 +365,34 @@ static int adjacent_read(const struct domain *d, const struct vpci_msix *msix,
 static int cf_check msix_read(
     struct vcpu *v, unsigned long addr, unsigned int len, unsigned long *data)
 {
-    const struct domain *d = v->domain;
-    struct vpci_msix *msix = msix_find(d, addr);
+    struct domain *d = v->domain;
+    struct vpci_msix *msix;
     const struct vpci_msix_entry *entry;
     unsigned int offset;
 
     *data = ~0ul;
 
+    read_lock(&d->vpci_rwlock);
+
+    msix = msix_find(d, addr);
     if ( !msix )
+    {
+        read_unlock(&d->vpci_rwlock);
         return X86EMUL_RETRY;
+    }
 
     if ( adjacent_handle(msix, addr) )
-        return adjacent_read(d, msix, addr, len, data);
+    {
+        int rc = adjacent_read(d, msix, addr, len, data);
+        read_unlock(&d->vpci_rwlock);
+        return rc;
+    }
 
     if ( !access_allowed(msix->pdev, addr, len) )
+    {
+        read_unlock(&d->vpci_rwlock);
         return X86EMUL_OKAY;
+    }
 
     spin_lock(&msix->pdev->vpci->lock);
     entry = get_entry(msix, addr);
@@ -404,6 +424,7 @@ static int cf_check msix_read(
         break;
     }
     spin_unlock(&msix->pdev->vpci->lock);
+    read_unlock(&d->vpci_rwlock);
 
     return X86EMUL_OKAY;
 }
@@ -491,19 +512,32 @@ static int adjacent_write(const struct domain *d, const struct vpci_msix *msix,
 static int cf_check msix_write(
     struct vcpu *v, unsigned long addr, unsigned int len, unsigned long data)
 {
-    const struct domain *d = v->domain;
-    struct vpci_msix *msix = msix_find(d, addr);
+    struct domain *d = v->domain;
+    struct vpci_msix *msix;
     struct vpci_msix_entry *entry;
     unsigned int offset;
 
+    read_lock(&d->vpci_rwlock);
+
+    msix = msix_find(d, addr);
     if ( !msix )
+    {
+        read_unlock(&d->vpci_rwlock);
         return X86EMUL_RETRY;
+    }
 
     if ( adjacent_handle(msix, addr) )
-        return adjacent_write(d, msix, addr, len, data);
+    {
+        int rc = adjacent_write(d, msix, addr, len, data);
+        read_unlock(&d->vpci_rwlock);
+        return rc;
+    }
 
     if ( !access_allowed(msix->pdev, addr, len) )
+    {
+        read_unlock(&d->vpci_rwlock);
         return X86EMUL_OKAY;
+    }
 
     spin_lock(&msix->pdev->vpci->lock);
     entry = get_entry(msix, addr);
@@ -579,6 +613,7 @@ static int cf_check msix_write(
         break;
     }
     spin_unlock(&msix->pdev->vpci->lock);
+    read_unlock(&d->vpci_rwlock);
 
     return X86EMUL_OKAY;
 }
@@ -665,6 +700,8 @@ static int cf_check init_msix(struct pci_dev *pdev)
     struct vpci_msix *msix;
     int rc;
 
+    ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock));
+
     msix_offset = pci_find_cap_offset(pdev->seg, pdev->bus, slot, func,
                                       PCI_CAP_ID_MSIX);
     if ( !msix_offset )
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 652807a4a4..1270174e78 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -38,20 +38,32 @@ extern vpci_register_init_t *const __end_vpci_array[];
 
 void vpci_remove_device(struct pci_dev *pdev)
 {
-    if ( !has_vpci(pdev->domain) || !pdev->vpci )
+    struct vpci *vpci;
+
+    if ( !has_vpci(pdev->domain) )
         return;
 
-    spin_lock(&pdev->vpci->lock);
+    write_lock(&pdev->domain->vpci_rwlock);
+    if ( !pdev->vpci )
+    {
+        write_unlock(&pdev->domain->vpci_rwlock);
+        return;
+    }
+
+    vpci = pdev->vpci;
+    pdev->vpci = NULL;
+    write_unlock(&pdev->domain->vpci_rwlock);
+
     while ( !list_empty(&pdev->vpci->handlers) )
     {
-        struct vpci_register *r = list_first_entry(&pdev->vpci->handlers,
+        struct vpci_register *r = list_first_entry(&vpci->handlers,
                                                    struct vpci_register,
                                                    node);
 
         list_del(&r->node);
         xfree(r);
     }
-    spin_unlock(&pdev->vpci->lock);
+
     if ( pdev->vpci->msix )
     {
         unsigned int i;
@@ -61,29 +73,33 @@ void vpci_remove_device(struct pci_dev *pdev)
             if ( pdev->vpci->msix->table[i] )
                 iounmap(pdev->vpci->msix->table[i]);
     }
-    xfree(pdev->vpci->msix);
-    xfree(pdev->vpci->msi);
-    xfree(pdev->vpci);
-    pdev->vpci = NULL;
+    xfree(vpci->msix);
+    xfree(vpci->msi);
+    xfree(vpci);
 }
 
 int vpci_add_handlers(struct pci_dev *pdev)
 {
+    struct vpci *vpci;
     unsigned int i;
     int rc = 0;
 
     if ( !has_vpci(pdev->domain) )
         return 0;
 
+    vpci = xzalloc(struct vpci);
+    if ( !vpci )
+        return -ENOMEM;
+
+    INIT_LIST_HEAD(&vpci->handlers);
+    spin_lock_init(&vpci->lock);
+
+    write_lock(&pdev->domain->vpci_rwlock);
+
     /* We should not get here twice for the same device. */
     ASSERT(!pdev->vpci);
 
-    pdev->vpci = xzalloc(struct vpci);
-    if ( !pdev->vpci )
-        return -ENOMEM;
-
-    INIT_LIST_HEAD(&pdev->vpci->handlers);
-    spin_lock_init(&pdev->vpci->lock);
+    pdev->vpci = vpci;
 
     for ( i = 0; i < NUM_VPCI_INIT; i++ )
     {
@@ -91,6 +107,7 @@ int vpci_add_handlers(struct pci_dev *pdev)
         if ( rc )
             break;
     }
+    write_unlock(&pdev->domain->vpci_rwlock);
 
     if ( rc )
         vpci_remove_device(pdev);
@@ -139,6 +156,7 @@ uint32_t cf_check vpci_hw_read32(
     return pci_conf_read32(pdev->sbdf, reg);
 }
 
+/* This must hold domain's vpci_rwlock in write mode. */
 int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
                       vpci_write_t *write_handler, unsigned int offset,
                       unsigned int size, void *data)
@@ -162,8 +180,6 @@ int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
     r->offset = offset;
     r->private = data;
 
-    spin_lock(&vpci->lock);
-
     /* The list of handlers must be kept sorted at all times. */
     list_for_each ( prev, &vpci->handlers )
     {
@@ -175,25 +191,23 @@ int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
             break;
         if ( cmp == 0 )
         {
-            spin_unlock(&vpci->lock);
             xfree(r);
             return -EEXIST;
         }
     }
 
     list_add_tail(&r->node, prev);
-    spin_unlock(&vpci->lock);
 
     return 0;
 }
 
+/* This must hold domain's vpci_rwlock in write mode. */
 int vpci_remove_register(struct vpci *vpci, unsigned int offset,
                          unsigned int size)
 {
     const struct vpci_register r = { .offset = offset, .size = size };
     struct vpci_register *rm;
 
-    spin_lock(&vpci->lock);
     list_for_each_entry ( rm, &vpci->handlers, node )
     {
         int cmp = vpci_register_cmp(&r, rm);
@@ -205,14 +219,12 @@ int vpci_remove_register(struct vpci *vpci, unsigned int offset,
         if ( !cmp && rm->offset == offset && rm->size == size )
         {
             list_del(&rm->node);
-            spin_unlock(&vpci->lock);
             xfree(rm);
             return 0;
         }
         if ( cmp <= 0 )
             break;
     }
-    spin_unlock(&vpci->lock);
 
     return -ENOENT;
 }
@@ -320,7 +332,7 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
 
 uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
 {
-    const struct domain *d = current->domain;
+    struct domain *d = current->domain;
     const struct pci_dev *pdev;
     const struct vpci_register *r;
     unsigned int data_offset = 0;
@@ -333,10 +345,18 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
     }
 
     /* Find the PCI dev matching the address. */
+    pcidevs_lock();
     pdev = pci_get_pdev(d, sbdf);
-    if ( !pdev || !pdev->vpci )
+    pcidevs_unlock();
+    if ( !pdev )
         return vpci_read_hw(sbdf, reg, size);
 
+    read_lock(&d->vpci_rwlock);
+    if ( !pdev->vpci )
+    {
+        read_unlock(&d->vpci_rwlock);
+        return vpci_read_hw(sbdf, reg, size);
+    }
     spin_lock(&pdev->vpci->lock);
 
     /* Read from the hardware or the emulated register handlers. */
@@ -381,6 +401,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
         ASSERT(data_offset < size);
     }
     spin_unlock(&pdev->vpci->lock);
+    read_unlock(&d->vpci_rwlock);
 
     if ( data_offset < size )
     {
@@ -423,11 +444,12 @@ static void vpci_write_helper(const struct pci_dev *pdev,
 void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
                 uint32_t data)
 {
-    const struct domain *d = current->domain;
+    struct domain *d = current->domain;
     const struct pci_dev *pdev;
     const struct vpci_register *r;
     unsigned int data_offset = 0;
     const unsigned long *ro_map = pci_get_ro_map(sbdf.seg);
+    bool write_locked = false;
 
     if ( !size )
     {
@@ -443,14 +465,38 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
      * Find the PCI dev matching the address.
      * Passthrough everything that's not trapped.
      */
+    pcidevs_lock();
     pdev = pci_get_pdev(d, sbdf);
-    if ( !pdev || !pdev->vpci )
+    pcidevs_unlock();
+    if ( !pdev )
     {
         vpci_write_hw(sbdf, reg, size, data);
         return;
     }
 
-    spin_lock(&pdev->vpci->lock);
+    if ( vpci_header_need_write_lock(pdev, reg, size) )
+    {
+        /* Gain exclusive access to all of the domain pdevs vpci. */
+        write_lock(&d->vpci_rwlock);
+        if ( !pdev->vpci )
+        {
+            write_unlock(&d->vpci_rwlock);
+            vpci_write_hw(sbdf, reg, size, data);
+            return;
+        }
+        write_locked = true;
+    }
+    else
+    {
+        read_lock(&d->vpci_rwlock);
+        if ( !pdev->vpci )
+        {
+            read_unlock(&d->vpci_rwlock);
+            vpci_write_hw(sbdf, reg, size, data);
+            return;
+        }
+        spin_lock(&pdev->vpci->lock);
+    }
 
     /* Write the value to the hardware or emulated registers. */
     list_for_each_entry ( r, &pdev->vpci->handlers, node )
@@ -485,7 +531,14 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
             break;
         ASSERT(data_offset < size);
     }
-    spin_unlock(&pdev->vpci->lock);
+
+    if ( write_locked )
+        write_unlock(&d->vpci_rwlock);
+    else
+    {
+        spin_unlock(&pdev->vpci->lock);
+        read_unlock(&d->vpci_rwlock);
+    }
 
     if ( data_offset < size )
         /* Tailing gap, write the remaining. */
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 5975ca2f30..4512910dca 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -157,6 +157,7 @@ struct pci_dev {
  */
 
 void pcidevs_lock(void);
+int pcidevs_trylock(void);
 void pcidevs_unlock(void);
 bool_t __must_check pcidevs_locked(void);
 
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 85242a73d3..78227b7a1d 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -460,6 +460,9 @@ struct domain
 
 #ifdef CONFIG_HAS_PCI
     struct list_head pdev_list;
+#ifdef CONFIG_HAS_VPCI
+    rwlock_t vpci_rwlock;
+#endif
 #endif
 
 #ifdef CONFIG_HAS_PASSTHROUGH
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 0b8a2a3c74..3a7fcc4463 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -57,6 +57,9 @@ uint32_t cf_check vpci_hw_read32(
  */
 bool __must_check vpci_process_pending(struct vcpu *v);
 
+bool vpci_header_need_write_lock(const struct pci_dev *pdev,
+                                 unsigned int start, unsigned int size);
+
 struct vpci {
     /* List of vPCI handlers for a device. */
     struct list_head handlers;
@@ -83,6 +86,9 @@ struct vpci {
         } bars[PCI_HEADER_NORMAL_NR_BARS + 1];
         /* At most 6 BARS + 1 expansion ROM BAR. */
 
+        /* Offset to the ROM BAR register if any. */
+        unsigned int rom_reg;
+
         /*
          * Store whether the ROM enable bit is set (doesn't imply ROM BAR
          * is mapped into guest p2m) if there's a ROM BAR on the device.
-- 
2.40.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v7 00/12] PCI devices passthrough on Arm, part 3
@ 2023-06-13 10:32 Volodymyr Babchuk
  2023-06-13 10:32 ` [PATCH v7 01/12] vpci: introduce per-domain lock to protect vpci structure Volodymyr Babchuk
                   ` (12 more replies)
  0 siblings, 13 replies; 38+ messages in thread
From: Volodymyr Babchuk @ 2023-06-13 10:32 UTC (permalink / raw)
  To: xen-devel
  Cc: Volodymyr Babchuk, Jan Beulich, Andrew Cooper,
	Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Paul Durrant, Bertrand Marquis, Volodymyr Babchuk

Hello,

This is another another version of vPCI rework (previous one can be
found at [1]). The biggest change is how vPCI locking is done. This
series uses per-domain vPCI rwlock.

Note that this series does not include my work on reference counting
for PCI devices because this counting does not resolve isses we are
having for vPCI. While it is (maybe) nice to have PCI refcounting, it
does not moves us towards PCI on ARM.


[1] https://lore.kernel.org/all/20220204063459.680961-1-andr2000@gmail.com/

Oleksandr Andrushchenko (12):
  vpci: introduce per-domain lock to protect vpci structure
  vpci: restrict unhandled read/write operations for guests
  vpci: add hooks for PCI device assign/de-assign
  vpci/header: implement guest BAR register handlers
  rangeset: add RANGESETF_no_print flag
  vpci/header: handle p2m range sets per BAR
  vpci/header: program p2m with guest BAR view
  vpci/header: emulate PCI_COMMAND register for guests
  vpci/header: reset the command register when adding devices
  vpci: add initial support for virtual PCI bus topology
  xen/arm: translate virtual PCI bus topology for guests
  xen/arm: account IO handlers for emulated PCI MSI-X

 xen/arch/arm/vpci.c           |  31 ++-
 xen/arch/x86/hvm/vmsi.c       |   7 +
 xen/common/domain.c           |   3 +
 xen/common/rangeset.c         |   5 +-
 xen/drivers/Kconfig           |   4 +
 xen/drivers/passthrough/pci.c |  16 ++
 xen/drivers/vpci/header.c     | 485 ++++++++++++++++++++++++++++------
 xen/drivers/vpci/msi.c        |  29 +-
 xen/drivers/vpci/msix.c       |  55 +++-
 xen/drivers/vpci/vpci.c       | 257 ++++++++++++++++--
 xen/include/xen/pci.h         |   1 +
 xen/include/xen/rangeset.h    |   5 +-
 xen/include/xen/sched.h       |  11 +
 xen/include/xen/vpci.h        |  47 +++-
 14 files changed, 826 insertions(+), 130 deletions(-)

-- 
2.40.1

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v7 02/12] vpci: restrict unhandled read/write operations for guests
  2023-06-13 10:32 [PATCH v7 00/12] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
  2023-06-13 10:32 ` [PATCH v7 01/12] vpci: introduce per-domain lock to protect vpci structure Volodymyr Babchuk
  2023-06-13 10:32 ` [PATCH v7 04/12] vpci/header: implement guest BAR register handlers Volodymyr Babchuk
@ 2023-06-13 10:32 ` Volodymyr Babchuk
  2023-06-13 10:32 ` [PATCH v7 03/12] vpci: add hooks for PCI device assign/de-assign Volodymyr Babchuk
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 38+ messages in thread
From: Volodymyr Babchuk @ 2023-06-13 10:32 UTC (permalink / raw)
  To: xen-devel; +Cc: Oleksandr Andrushchenko, Roger Pau Monné

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

A guest would be able to read and write those registers which are not
emulated and have no respective vPCI handlers, so it will be possible
for it to access the hardware directly.
In order to prevent a guest from reads and writes from/to the unhandled
registers make sure only hardware domain can access the hardware directly
and restrict guests from doing so.

Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

---
Since v6:
- do not use is_hwdom parameter for vpci_{read|write}_hw and use
  current->domain internally
- update commit message
New in v6
---
 xen/drivers/vpci/vpci.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 1270174e78..0b11d9c3f8 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -235,6 +235,10 @@ static uint32_t vpci_read_hw(pci_sbdf_t sbdf, unsigned int reg,
 {
     uint32_t data;
 
+    /* Guest domains are not allowed to read real hardware. */
+    if ( !is_hardware_domain(current->domain) )
+        return ~(uint32_t)0;
+
     switch ( size )
     {
     case 4:
@@ -275,9 +279,13 @@ static uint32_t vpci_read_hw(pci_sbdf_t sbdf, unsigned int reg,
     return data;
 }
 
-static void vpci_write_hw(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
-                          uint32_t data)
+static void vpci_write_hw(pci_sbdf_t sbdf, unsigned int reg,
+                          unsigned int size, uint32_t data)
 {
+    /* Guest domains are not allowed to write real hardware. */
+    if ( !is_hardware_domain(current->domain) )
+        return;
+
     switch ( size )
     {
     case 4:
-- 
2.40.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v7 03/12] vpci: add hooks for PCI device assign/de-assign
  2023-06-13 10:32 [PATCH v7 00/12] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (2 preceding siblings ...)
  2023-06-13 10:32 ` [PATCH v7 02/12] vpci: restrict unhandled read/write operations for guests Volodymyr Babchuk
@ 2023-06-13 10:32 ` Volodymyr Babchuk
  2023-06-13 10:32 ` [PATCH v7 06/12] vpci/header: handle p2m range sets per BAR Volodymyr Babchuk
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 38+ messages in thread
From: Volodymyr Babchuk @ 2023-06-13 10:32 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Andrushchenko, Andrew Cooper, George Dunlap,
	Jan Beulich, Julien Grall, Stefano Stabellini, Wei Liu,
	Paul Durrant, Roger Pau Monné

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

When a PCI device gets assigned/de-assigned some work on vPCI side needs
to be done for that device. Introduce a pair of hooks so vPCI can handle
that.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
Since v6:
- do not pass struct domain to vpci_{assign|deassign}_device as
  pdev->domain can be used
- do not leave the device assigned (pdev->domain == new domain) in case
  vpci_assign_device fails: try to de-assign and if this also fails, then
  crash the domain
Since v5:
- do not split code into run_vpci_init
- do not check for is_system_domain in vpci_{de}assign_device
- do not use vpci_remove_device_handlers_locked and re-allocate
  pdev->vpci completely
- make vpci_deassign_device void
Since v4:
 - de-assign vPCI from the previous domain on device assignment
 - do not remove handlers in vpci_assign_device as those must not
   exist at that point
Since v3:
 - remove toolstack roll-back description from the commit message
   as error are to be handled with proper cleanup in Xen itself
 - remove __must_check
 - remove redundant rc check while assigning devices
 - fix redundant CONFIG_HAS_VPCI check for CONFIG_HAS_VPCI_GUEST_SUPPORT
 - use REGISTER_VPCI_INIT machinery to run required steps on device
   init/assign: add run_vpci_init helper
Since v2:
- define CONFIG_HAS_VPCI_GUEST_SUPPORT so dead code is not compiled
  for x86
Since v1:
 - constify struct pci_dev where possible
 - do not open code is_system_domain()
 - extended the commit message
---
 xen/drivers/Kconfig           |  4 ++++
 xen/drivers/passthrough/pci.c | 11 +++++++++++
 xen/drivers/vpci/vpci.c       | 27 +++++++++++++++++++++++++++
 xen/include/xen/vpci.h        | 15 +++++++++++++++
 4 files changed, 57 insertions(+)

diff --git a/xen/drivers/Kconfig b/xen/drivers/Kconfig
index db94393f47..780490cf8e 100644
--- a/xen/drivers/Kconfig
+++ b/xen/drivers/Kconfig
@@ -15,4 +15,8 @@ source "drivers/video/Kconfig"
 config HAS_VPCI
 	bool
 
+config HAS_VPCI_GUEST_SUPPORT
+	bool
+	depends on HAS_VPCI
+
 endmenu
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 0afcb59af0..f43b40508a 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -877,6 +877,8 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
     if ( ret )
         goto out;
 
+    vpci_deassign_device(pdev);
+
     if ( pdev->domain == hardware_domain  )
         pdev->quarantine = false;
 
@@ -1417,6 +1419,7 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
 {
     const struct domain_iommu *hd = dom_iommu(d);
     struct pci_dev *pdev;
+    uint8_t old_devfn;
     int rc = 0;
 
     if ( !is_iommu_enabled(d) )
@@ -1436,6 +1439,8 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
     if ( pdev->broken && d != hardware_domain && d != dom_io )
         goto done;
 
+    vpci_deassign_device(pdev);
+
     rc = pdev_msix_assign(d, pdev);
     if ( rc )
         goto done;
@@ -1453,6 +1458,8 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
                           pci_to_dev(pdev), flag)) )
         goto done;
 
+    old_devfn = devfn;
+
     for ( ; pdev->phantom_stride; rc = 0 )
     {
         devfn += pdev->phantom_stride;
@@ -1462,6 +1469,10 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
                         pci_to_dev(pdev), flag);
     }
 
+    rc = vpci_assign_device(pdev);
+    if ( rc && deassign_device(d, seg, bus, old_devfn) )
+        domain_crash(d);
+
  done:
     if ( rc )
         printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 0b11d9c3f8..4182c65eaa 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -114,6 +114,33 @@ int vpci_add_handlers(struct pci_dev *pdev)
 
     return rc;
 }
+
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+/* Notify vPCI that device is assigned to guest. */
+int vpci_assign_device(struct pci_dev *pdev)
+{
+    int rc;
+
+    if ( !has_vpci(pdev->domain) )
+        return 0;
+
+    rc = vpci_add_handlers(pdev);
+    if ( rc )
+        vpci_deassign_device(pdev);
+
+    return rc;
+}
+
+/* Notify vPCI that device is de-assigned from guest. */
+void vpci_deassign_device(struct pci_dev *pdev)
+{
+    if ( !has_vpci(pdev->domain) )
+        return;
+
+    vpci_remove_device(pdev);
+}
+#endif /* CONFIG_HAS_VPCI_GUEST_SUPPORT */
+
 #endif /* __XEN__ */
 
 static int vpci_register_cmp(const struct vpci_register *r1,
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 3a7fcc4463..5383ede556 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -270,6 +270,21 @@ static inline bool __must_check vpci_process_pending(struct vcpu *v)
 }
 #endif
 
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+/* Notify vPCI that device is assigned/de-assigned to/from guest. */
+int vpci_assign_device(struct pci_dev *pdev);
+void vpci_deassign_device(struct pci_dev *pdev);
+#else
+static inline int vpci_assign_device(struct pci_dev *pdev)
+{
+    return 0;
+};
+
+static inline void vpci_deassign_device(struct pci_dev *pdev)
+{
+};
+#endif
+
 #endif
 
 /*
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v7 04/12] vpci/header: implement guest BAR register handlers
  2023-06-13 10:32 [PATCH v7 00/12] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
  2023-06-13 10:32 ` [PATCH v7 01/12] vpci: introduce per-domain lock to protect vpci structure Volodymyr Babchuk
@ 2023-06-13 10:32 ` Volodymyr Babchuk
  2023-06-13 10:32 ` [PATCH v7 02/12] vpci: restrict unhandled read/write operations for guests Volodymyr Babchuk
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 38+ messages in thread
From: Volodymyr Babchuk @ 2023-06-13 10:32 UTC (permalink / raw)
  To: xen-devel; +Cc: Oleksandr Andrushchenko, Roger Pau Monné

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Add relevant vpci register handlers when assigning PCI device to a domain
and remove those when de-assigning. This allows having different
handlers for different domains, e.g. hwdom and other guests.

Emulate guest BAR register values: this allows creating a guest view
of the registers and emulates size and properties probe as it is done
during PCI device enumeration by the guest.

All empty, IO and ROM BARs for guests are emulated by returning 0 on
reads and ignoring writes: this BARs are special with this respect as
their lower bits have special meaning, so returning default ~0 on read
may confuse guest OS.

Memory decoding is initially disabled when used by guests in order to
prevent the BAR being placed on top of a RAM region.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---

Since v6:
- unify the writing of the PCI_COMMAND register on the
  error path into a label
- do not introduce bar_ignore_access helper and open code
- s/guest_bar_ignore_read/empty_bar_read
- update error message in guest_bar_write
- only setup empty_bar_read for IO if !x86
Since v5:
- make sure that the guest set address has the same page offset
  as the physical address on the host
- remove guest_rom_{read|write} as those just implement the default
  behaviour of the registers not being handled
- adjusted comment for struct vpci.addr field
- add guest handlers for BARs which are not handled and will otherwise
  return ~0 on read and ignore writes. The BARs are special with this
  respect as their lower bits have special meaning, so returning ~0
  doesn't seem to be right
Since v4:
- updated commit message
- s/guest_addr/guest_reg
Since v3:
- squashed two patches: dynamic add/remove handlers and guest BAR
  handler implementation
- fix guest BAR read of the high part of a 64bit BAR (Roger)
- add error handling to vpci_assign_device
- s/dom%pd/%pd
- blank line before return
Since v2:
- remove unneeded ifdefs for CONFIG_HAS_VPCI_GUEST_SUPPORT as more code
  has been eliminated from being built on x86
Since v1:
 - constify struct pci_dev where possible
 - do not open code is_system_domain()
 - simplify some code3. simplify
 - use gdprintk + error code instead of gprintk
 - gate vpci_bar_{add|remove}_handlers with CONFIG_HAS_VPCI_GUEST_SUPPORT,
   so these do not get compiled for x86
 - removed unneeded is_system_domain check
 - re-work guest read/write to be much simpler and do more work on write
   than read which is expected to be called more frequently
 - removed one too obvious comment
---
 xen/drivers/vpci/header.c | 159 ++++++++++++++++++++++++++++++--------
 xen/include/xen/vpci.h    |   3 +
 2 files changed, 131 insertions(+), 31 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 23b5223adf..8eebbf968b 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -439,6 +439,71 @@ static void cf_check bar_write(
     pci_conf_write32(pdev->sbdf, reg, val);
 }
 
+static void cf_check guest_bar_write(const struct pci_dev *pdev,
+                                     unsigned int reg, uint32_t val, void *data)
+{
+    struct vpci_bar *bar = data;
+    bool hi = false;
+    uint64_t guest_reg = bar->guest_reg;
+
+    if ( bar->type == VPCI_BAR_MEM64_HI )
+    {
+        ASSERT(reg > PCI_BASE_ADDRESS_0);
+        bar--;
+        hi = true;
+    }
+    else
+    {
+        val &= PCI_BASE_ADDRESS_MEM_MASK;
+        val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
+                                           : PCI_BASE_ADDRESS_MEM_TYPE_64;
+        val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
+    }
+
+    guest_reg &= ~(0xffffffffull << (hi ? 32 : 0));
+    guest_reg |= (uint64_t)val << (hi ? 32 : 0);
+
+    guest_reg &= ~(bar->size - 1) | ~PCI_BASE_ADDRESS_MEM_MASK;
+
+    /*
+     * Make sure that the guest set address has the same page offset
+     * as the physical address on the host or otherwise things won't work as
+     * expected.
+     */
+    if ( (guest_reg & (~PAGE_MASK & PCI_BASE_ADDRESS_MEM_MASK)) !=
+         (bar->addr & ~PAGE_MASK) )
+    {
+        gprintk(XENLOG_WARNING,
+                "%pp: ignored BAR %zu write attempting to change page offset\n",
+                &pdev->sbdf, bar - pdev->vpci->header.bars + hi);
+        return;
+    }
+
+    bar->guest_reg = guest_reg;
+}
+
+static uint32_t cf_check guest_bar_read(const struct pci_dev *pdev,
+                                        unsigned int reg, void *data)
+{
+    const struct vpci_bar *bar = data;
+    bool hi = false;
+
+    if ( bar->type == VPCI_BAR_MEM64_HI )
+    {
+        ASSERT(reg > PCI_BASE_ADDRESS_0);
+        bar--;
+        hi = true;
+    }
+
+    return bar->guest_reg >> (hi ? 32 : 0);
+}
+
+static uint32_t cf_check empty_bar_read(const struct pci_dev *pdev,
+                                        unsigned int reg, void *data)
+{
+    return 0;
+}
+
 static void cf_check rom_write(
     const struct pci_dev *pdev, unsigned int reg, uint32_t val, void *data)
 {
@@ -499,6 +564,7 @@ static int cf_check init_bars(struct pci_dev *pdev)
     struct vpci_header *header = &pdev->vpci->header;
     struct vpci_bar *bars = header->bars;
     int rc;
+    bool is_hwdom = is_hardware_domain(pdev->domain);
 
     ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock));
 
@@ -540,13 +606,12 @@ static int cf_check init_bars(struct pci_dev *pdev)
         if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
         {
             bars[i].type = VPCI_BAR_MEM64_HI;
-            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg,
-                                   4, &bars[i]);
+            rc = vpci_add_register(pdev->vpci,
+                                   is_hwdom ? vpci_hw_read32 : guest_bar_read,
+                                   is_hwdom ? bar_write : guest_bar_write,
+                                   reg, 4, &bars[i]);
             if ( rc )
-            {
-                pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
-                return rc;
-            }
+                goto fail;
 
             continue;
         }
@@ -555,6 +620,17 @@ static int cf_check init_bars(struct pci_dev *pdev)
         if ( (val & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO )
         {
             bars[i].type = VPCI_BAR_IO;
+
+#ifndef CONFIG_X86
+            if ( !is_hwdom )
+            {
+                rc = vpci_add_register(pdev->vpci, empty_bar_read, NULL,
+                                       reg, 4, &bars[i]);
+                if ( rc )
+                    goto fail;
+            }
+#endif
+
             continue;
         }
         if ( (val & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
@@ -566,14 +642,20 @@ static int cf_check init_bars(struct pci_dev *pdev)
         rc = pci_size_mem_bar(pdev->sbdf, reg, &addr, &size,
                               (i == num_bars - 1) ? PCI_BAR_LAST : 0);
         if ( rc < 0 )
-        {
-            pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
-            return rc;
-        }
+            goto fail;
 
         if ( size == 0 )
         {
             bars[i].type = VPCI_BAR_EMPTY;
+
+            if ( !is_hwdom )
+            {
+                rc = vpci_add_register(pdev->vpci, empty_bar_read, NULL,
+                                       reg, 4, &bars[i]);
+                if ( rc )
+                    goto fail;
+            }
+
             continue;
         }
 
@@ -581,38 +663,53 @@ static int cf_check init_bars(struct pci_dev *pdev)
         bars[i].size = size;
         bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
 
-        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg, 4,
-                               &bars[i]);
+        rc = vpci_add_register(pdev->vpci,
+                               is_hwdom ? vpci_hw_read32 : guest_bar_read,
+                               is_hwdom ? bar_write : guest_bar_write,
+                               reg, 4, &bars[i]);
         if ( rc )
-        {
-            pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
-            return rc;
-        }
+            goto fail;
     }
 
     ASSERT(!header->rom_reg);
-
-    /* Check expansion ROM. */
-    rc = pci_size_mem_bar(pdev->sbdf, rom_reg, &addr, &size, PCI_BAR_ROM);
-    if ( rc > 0 && size )
+    /* Check expansion ROM: we do not handle ROM for guests. */
+    if ( is_hwdom )
     {
-        struct vpci_bar *rom = &header->bars[num_bars];
+        rc = pci_size_mem_bar(pdev->sbdf, rom_reg, &addr, &size, PCI_BAR_ROM);
+        if ( rc > 0 && size )
+        {
+            struct vpci_bar *rom = &header->bars[num_bars];
 
-        rom->type = VPCI_BAR_ROM;
-        rom->size = size;
-        rom->addr = addr;
-        header->rom_enabled = pci_conf_read32(pdev->sbdf, rom_reg) &
-                              PCI_ROM_ADDRESS_ENABLE;
+            rom->type = VPCI_BAR_ROM;
+            rom->size = size;
+            rom->addr = addr;
+            header->rom_enabled = pci_conf_read32(pdev->sbdf, rom_reg) &
+                                  PCI_ROM_ADDRESS_ENABLE;
 
-        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, rom_write, rom_reg,
-                               4, rom);
-        if ( rc )
-            rom->type = VPCI_BAR_EMPTY;
+            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, rom_write,
+                                   rom_reg, 4, rom);
+            if ( rc )
+                rom->type = VPCI_BAR_EMPTY;
 
-        header->rom_reg = rom_reg;
+            header->rom_reg = rom_reg;
+        }
+    }
+    else
+    {
+        if ( !is_hwdom )
+        {
+            rc = vpci_add_register(pdev->vpci, empty_bar_read, NULL,
+                                   rom_reg, 4, &header->bars[num_bars]);
+            if ( rc )
+                goto fail;
+        }
     }
 
     return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, cmd, false) : 0;
+
+ fail:
+    pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
+    return rc;
 }
 REGISTER_VPCI_INIT(init_bars, VPCI_PRIORITY_MIDDLE);
 
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 5383ede556..e9170cc8ca 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -70,7 +70,10 @@ struct vpci {
     struct vpci_header {
         /* Information about the PCI BARs of this device. */
         struct vpci_bar {
+            /* Physical (host) address. */
             uint64_t addr;
+            /* Guest view of the BAR: address and lower bits. */
+            uint64_t guest_reg;
             uint64_t size;
             enum {
                 VPCI_BAR_EMPTY,
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v7 07/12] vpci/header: program p2m with guest BAR view
  2023-06-13 10:32 [PATCH v7 00/12] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (6 preceding siblings ...)
  2023-06-13 10:32 ` [PATCH v7 05/12] rangeset: add RANGESETF_no_print flag Volodymyr Babchuk
@ 2023-06-13 10:32 ` Volodymyr Babchuk
  2023-06-13 10:32 ` [PATCH v7 09/12] vpci/header: reset the command register when adding devices Volodymyr Babchuk
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 38+ messages in thread
From: Volodymyr Babchuk @ 2023-06-13 10:32 UTC (permalink / raw)
  To: xen-devel; +Cc: Oleksandr Andrushchenko, Roger Pau Monné

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Take into account guest's BAR view and program its p2m accordingly:
gfn is guest's view of the BAR and mfn is the physical BAR value as set
up by the PCI bus driver in the hardware domain.
This way hardware domain sees physical BAR values and guest sees
emulated ones.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
Since v5:
- remove debug print in map_range callback
- remove "identity" from the debug print
Since v4:
- moved start_{gfn|mfn} calculation into map_range
- pass vpci_bar in the map_data instead of start_{gfn|mfn}
- s/guest_addr/guest_reg
Since v3:
- updated comment (Roger)
- removed gfn_add(map->start_gfn, rc); which is wrong
- use v->domain instead of v->vpci.pdev->domain
- removed odd e.g. in comment
- s/d%d/%pd in altered code
- use gdprintk for map/unmap logs
Since v2:
- improve readability for data.start_gfn and restructure ?: construct
Since v1:
 - s/MSI/MSI-X in comments
---
 xen/drivers/vpci/header.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 8e722857d6..0524fbd220 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -30,6 +30,7 @@
 
 struct map_data {
     struct domain *d;
+    const struct vpci_bar *bar;
     bool map;
 };
 
@@ -41,8 +42,21 @@ static int cf_check map_range(
 
     for ( ; ; )
     {
+        /* Start address of the BAR as seen by the guest. */
+        gfn_t start_gfn = _gfn(PFN_DOWN(is_hardware_domain(map->d)
+                                        ? map->bar->addr
+                                        : map->bar->guest_reg));
+        /* Physical start address of the BAR. */
+        mfn_t start_mfn = _mfn(PFN_DOWN(map->bar->addr));
         unsigned long size = e - s + 1;
 
+        /*
+         * Ranges to be mapped don't always start at the BAR start address, as
+         * there can be holes or partially consumed ranges. Account for the
+         * offset of the current address from the BAR start.
+         */
+        start_gfn = gfn_add(start_gfn, s - mfn_x(start_mfn));
+
         /*
          * ARM TODOs:
          * - On ARM whether the memory is prefetchable or not should be passed
@@ -52,8 +66,8 @@ static int cf_check map_range(
          * - {un}map_mmio_regions doesn't support preemption.
          */
 
-        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(s))
-                      : unmap_mmio_regions(map->d, _gfn(s), size, _mfn(s));
+        rc = map->map ? map_mmio_regions(map->d, start_gfn, size, _mfn(s))
+                      : unmap_mmio_regions(map->d, start_gfn, size, _mfn(s));
         if ( rc == 0 )
         {
             *c += size;
@@ -62,8 +76,8 @@ static int cf_check map_range(
         if ( rc < 0 )
         {
             printk(XENLOG_G_WARNING
-                   "Failed to identity %smap [%lx, %lx] for d%d: %d\n",
-                   map->map ? "" : "un", s, e, map->d->domain_id, rc);
+                   "Failed to %smap [%lx, %lx] for %pd: %d\n",
+                   map->map ? "" : "un", s, e, map->d, rc);
             break;
         }
         ASSERT(rc < size);
@@ -165,6 +179,7 @@ bool vpci_process_pending(struct vcpu *v)
             if ( rangeset_is_empty(bar->mem) )
                 continue;
 
+            data.bar = bar;
             rc = rangeset_consume_ranges(bar->mem, map_range, &data);
 
             if ( rc == -ERESTART )
@@ -224,6 +239,7 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
         if ( rangeset_is_empty(bar->mem) )
             continue;
 
+        data.bar = bar;
         while ( (rc = rangeset_consume_ranges(bar->mem, map_range,
                                               &data)) == -ERESTART )
         {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v7 08/12] vpci/header: emulate PCI_COMMAND register for guests
  2023-06-13 10:32 [PATCH v7 00/12] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (4 preceding siblings ...)
  2023-06-13 10:32 ` [PATCH v7 06/12] vpci/header: handle p2m range sets per BAR Volodymyr Babchuk
@ 2023-06-13 10:32 ` Volodymyr Babchuk
  2023-06-13 10:32 ` [PATCH v7 05/12] rangeset: add RANGESETF_no_print flag Volodymyr Babchuk
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 38+ messages in thread
From: Volodymyr Babchuk @ 2023-06-13 10:32 UTC (permalink / raw)
  To: xen-devel; +Cc: Oleksandr Andrushchenko, Roger Pau Monné

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Xen and/or Dom0 may have put values in PCI_COMMAND which they expect
to remain unaltered. PCI_COMMAND_SERR bit is a good example: while the
guest's view of this will want to be zero initially, the host having set
it to 1 may not easily be overwritten with 0, or else we'd effectively
imply giving the guest control of the bit. Thus, PCI_COMMAND register needs
proper emulation in order to honor host's settings.

There are examples of emulators [1], [2] which already deal with PCI_COMMAND
register emulation and it seems that at most they care about is the only INTx
bit (besides IO/memory enable and bus master which are write through).
It could be because in order to properly emulate the PCI_COMMAND register
we need to know about the whole PCI topology, e.g. if any setting in device's
command register is aligned with the upstream port etc.
This makes me think that because of this complexity others just ignore that.
Neither I think this can easily be done in Xen case.

According to "PCI LOCAL BUS SPECIFICATION, REV. 3.0", section "6.2.2
Device Control" the reset state of the command register is typically 0,
so when assigning a PCI device use 0 as the initial state for the guest's view
of the command register.

For now our emulation only makes sure INTx is set according to the host
requirements, i.e. depending on MSI/MSI-X enabled state.

This implementation and the decision to only emulate INTx bit for now
is based on the previous discussion at [3].

[1] https://github.com/qemu/qemu/blob/master/hw/xen/xen_pt_config_init.c#L310
[2] https://github.com/projectacrn/acrn-hypervisor/blob/master/hypervisor/hw/pci.c#L336
[3] https://patchwork.kernel.org/project/xen-devel/patch/20210903100831.177748-9-andr2000@gmail.com/

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---

lorc's rebase: check for the logic after rebase

Since v6:
- fold guest's logic into cmd_write
- implement cmd_read, so we can report emulated INTx state to guests
- introduce header->guest_cmd to hold the emulated state of the
  PCI_COMMAND register for guests
Since v5:
- add additional check for MSI-X enabled while altering INTX bit
- make sure INTx disabled while guests enable MSI/MSI-X
Since v3:
- gate more code on CONFIG_HAS_MSI
- removed logic for the case when MSI/MSI-X not enabled
---
 xen/drivers/vpci/header.c | 38 +++++++++++++++++++++++++++++++++++++-
 xen/drivers/vpci/msi.c    |  4 ++++
 xen/drivers/vpci/msix.c   |  4 ++++
 xen/include/xen/vpci.h    |  2 ++
 4 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 0524fbd220..677b93226c 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -465,11 +465,27 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
     return 0;
 }
 
+/* TODO: Add proper emulation for all bits of the command register. */
 static void cf_check cmd_write(
     const struct pci_dev *pdev, unsigned int reg, uint32_t cmd, void *data)
 {
     struct vpci_header *header = data;
 
+    if ( !is_hardware_domain(pdev->domain) )
+    {
+        struct vpci_header *header = data;
+
+        header->guest_cmd = cmd;
+#ifdef CONFIG_HAS_PCI_MSI
+        if ( pdev->vpci->msi->enabled || pdev->vpci->msix->enabled )
+            /*
+             * Guest wants to enable INTx, but it can't be enabled
+             * if MSI/MSI-X enabled.
+             */
+            cmd |= PCI_COMMAND_INTX_DISABLE;
+#endif
+    }
+
     /*
      * Let Dom0 play with all the bits directly except for the memory
      * decoding one.
@@ -486,6 +502,19 @@ static void cf_check cmd_write(
         pci_conf_write16(pdev->sbdf, reg, cmd);
 }
 
+static uint32_t cmd_read(const struct pci_dev *pdev, unsigned int reg,
+                         void *data)
+{
+    if ( !is_hardware_domain(pdev->domain) )
+    {
+        struct vpci_header *header = data;
+
+        return header->guest_cmd;
+    }
+
+    return pci_conf_read16(pdev->sbdf, reg);
+}
+
 static void cf_check bar_write(
     const struct pci_dev *pdev, unsigned int reg, uint32_t val, void *data)
 {
@@ -692,8 +721,15 @@ static int cf_check init_bars(struct pci_dev *pdev)
         return -EOPNOTSUPP;
     }
 
+    /*
+     * According to "PCI LOCAL BUS SPECIFICATION, REV. 3.0", section "6.2.2
+     * Device Control" the reset state of the command register is
+     * typically all 0's, so this is used as initial value for the guests.
+     */
+    ASSERT(header->guest_cmd == 0);
+
     /* Setup a handler for the command register. */
-    rc = vpci_add_register(pdev->vpci, vpci_hw_read16, cmd_write, PCI_COMMAND,
+    rc = vpci_add_register(pdev->vpci, cmd_read, cmd_write, PCI_COMMAND,
                            2, header);
     if ( rc )
         return rc;
diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
index e7d1f916a0..687aef54d1 100644
--- a/xen/drivers/vpci/msi.c
+++ b/xen/drivers/vpci/msi.c
@@ -70,6 +70,10 @@ static void cf_check control_write(
 
         if ( vpci_msi_arch_enable(msi, pdev, vectors) )
             return;
+
+        /* Make sure guest doesn't enable INTx while enabling MSI. */
+        if ( !is_hardware_domain(pdev->domain) )
+            pci_intx(pdev, false);
     }
     else
         vpci_msi_arch_disable(msi, pdev);
diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
index b5a7dfdf9c..e4b957d5f3 100644
--- a/xen/drivers/vpci/msix.c
+++ b/xen/drivers/vpci/msix.c
@@ -97,6 +97,10 @@ static void cf_check control_write(
         for ( i = 0; i < msix->max_entries; i++ )
             if ( !msix->entries[i].masked && msix->entries[i].updated )
                 update_entry(&msix->entries[i], pdev, i);
+
+        /* Make sure guest doesn't enable INTx while enabling MSI-X. */
+        if ( !is_hardware_domain(pdev->domain) )
+            pci_intx(pdev, false);
     }
     else if ( !new_enabled && msix->enabled )
     {
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 1e42a59c1d..fe100a8cb7 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -92,6 +92,8 @@ struct vpci {
 
         /* Offset to the ROM BAR register if any. */
         unsigned int rom_reg;
+        /* Guest view of the PCI_COMMAND register. */
+        uint16_t guest_cmd;
 
         /*
          * Store whether the ROM enable bit is set (doesn't imply ROM BAR
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v7 06/12] vpci/header: handle p2m range sets per BAR
  2023-06-13 10:32 [PATCH v7 00/12] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (3 preceding siblings ...)
  2023-06-13 10:32 ` [PATCH v7 03/12] vpci: add hooks for PCI device assign/de-assign Volodymyr Babchuk
@ 2023-06-13 10:32 ` Volodymyr Babchuk
  2023-06-13 10:32 ` [PATCH v7 08/12] vpci/header: emulate PCI_COMMAND register for guests Volodymyr Babchuk
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 38+ messages in thread
From: Volodymyr Babchuk @ 2023-06-13 10:32 UTC (permalink / raw)
  To: xen-devel; +Cc: Oleksandr Andrushchenko, Roger Pau Monné

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Instead of handling a single range set, that contains all the memory
regions of all the BARs and ROM, have them per BAR.
As the range sets are now created when a PCI device is added and destroyed
when it is removed so make them named and accounted.

Note that rangesets were chosen here despite there being only up to
3 separate ranges in each set (typically just 1). But rangeset per BAR
was chosen for the ease of implementation and existing code re-usability.

This is in preparation of making non-identity mappings in p2m for the MMIOs.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

---
Since v6:
- update according to the new locking scheme
- remove odd fail label in modify_bars
Since v5:
- fix comments
- move rangeset allocation to init_bars and only allocate
  for MAPPABLE BARs
- check for overlap with the already setup BAR ranges
Since v4:
- use named range sets for BARs (Jan)
- changes required by the new locking scheme
- updated commit message (Jan)
Since v3:
- re-work vpci_cancel_pending accordingly to the per-BAR handling
- s/num_mem_ranges/map_pending and s/uint8_t/bool
- ASSERT(bar->mem) in modify_bars
- create and destroy the rangesets on add/remove
---
 xen/drivers/vpci/header.c | 238 +++++++++++++++++++++++++++-----------
 xen/drivers/vpci/vpci.c   |   5 +
 xen/include/xen/vpci.h    |   3 +-
 3 files changed, 180 insertions(+), 66 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 8eebbf968b..8e722857d6 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -141,63 +141,102 @@ static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
 
 bool vpci_process_pending(struct vcpu *v)
 {
-    if ( v->vpci.mem )
+    struct pci_dev *pdev = v->vpci.pdev;
+
+    if ( !pdev )
+        return false;
+
+    read_lock(&v->domain->vpci_rwlock);
+
+    if ( v->vpci.map_pending )
     {
         struct map_data data = {
             .d = v->domain,
             .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
         };
-        int rc = rangeset_consume_ranges(v->vpci.mem, map_range, &data);
-
-        if ( rc == -ERESTART )
-            return true;
-
-        read_lock(&v->domain->vpci_rwlock);
-        spin_lock(&v->vpci.pdev->vpci->lock);
-        /* Disable memory decoding unconditionally on failure. */
-        modify_decoding(v->vpci.pdev,
-                        rc ? v->vpci.cmd & ~PCI_COMMAND_MEMORY : v->vpci.cmd,
-                        !rc && v->vpci.rom_only);
-        spin_unlock(&v->vpci.pdev->vpci->lock);
-        read_unlock(&v->domain->vpci_rwlock);
-
-        rangeset_destroy(v->vpci.mem);
-        v->vpci.mem = NULL;
-        if ( rc )
-            /*
-             * FIXME: in case of failure remove the device from the domain.
-             * Note that there might still be leftover mappings. While this is
-             * safe for Dom0, for DomUs the domain will likely need to be
-             * killed in order to avoid leaking stale p2m mappings on
-             * failure.
-             */
-            vpci_remove_device(v->vpci.pdev);
+        struct vpci_header *header = &pdev->vpci->header;
+        unsigned int i;
+
+        for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
+        {
+            struct vpci_bar *bar = &header->bars[i];
+            int rc;
+
+            if ( rangeset_is_empty(bar->mem) )
+                continue;
+
+            rc = rangeset_consume_ranges(bar->mem, map_range, &data);
+
+            if ( rc == -ERESTART )
+            {
+                read_unlock(&v->domain->vpci_rwlock);
+                return true;
+            }
+
+            spin_lock(&pdev->vpci->lock);
+            /* Disable memory decoding unconditionally on failure. */
+            modify_decoding(pdev, rc ? v->vpci.cmd & ~PCI_COMMAND_MEMORY :
+                                       v->vpci.cmd, !rc && v->vpci.rom_only);
+            spin_unlock(&pdev->vpci->lock);
+
+            if ( rc )
+            {
+                /*
+                 * FIXME: in case of failure remove the device from the domain.
+                 * Note that there might still be leftover mappings. While this
+                 * is safe for Dom0, for DomUs the domain needs to be killed in
+                 * order to avoid leaking stale p2m mappings on failure.
+                 */
+                v->vpci.map_pending = false;
+                read_unlock(&v->domain->vpci_rwlock);
+
+                if ( is_hardware_domain(v->domain) )
+                    vpci_remove_device(pdev);
+                else
+                    domain_crash(v->domain);
+
+                return false;
+            }
+        }
+
+        v->vpci.map_pending = false;
     }
 
+    read_unlock(&v->domain->vpci_rwlock);
+
     return false;
 }
 
 static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
-                            struct rangeset *mem, uint16_t cmd)
+                            uint16_t cmd)
 {
     struct map_data data = { .d = d, .map = true };
-    int rc;
+    struct vpci_header *header = &pdev->vpci->header;
+    int rc = 0;
+    unsigned int i;
 
     ASSERT(rw_is_write_locked(&d->vpci_rwlock));
 
-    while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
+    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
     {
-        /*
-         * It's safe to drop and reacquire the lock in this context
-         * without risking pdev disappearing because devices cannot be
-         * removed until the initial domain has been started.
-         */
-        write_unlock(&d->vpci_rwlock);
-        process_pending_softirqs();
-        write_lock(&d->vpci_rwlock);
-    }
+        struct vpci_bar *bar = &header->bars[i];
+
+        if ( rangeset_is_empty(bar->mem) )
+            continue;
 
-    rangeset_destroy(mem);
+        while ( (rc = rangeset_consume_ranges(bar->mem, map_range,
+                                              &data)) == -ERESTART )
+        {
+            /*
+             * It's safe to drop and reacquire the lock in this context
+             * without risking pdev disappearing because devices cannot be
+             * removed until the initial domain has been started.
+             */
+            write_unlock(&d->vpci_rwlock);
+            process_pending_softirqs();
+            write_lock(&d->vpci_rwlock);
+        }
+    }
     if ( !rc )
         modify_decoding(pdev, cmd, false);
 
@@ -205,10 +244,12 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
 }
 
 static void defer_map(struct domain *d, struct pci_dev *pdev,
-                      struct rangeset *mem, uint16_t cmd, bool rom_only)
+                      uint16_t cmd, bool rom_only)
 {
     struct vcpu *curr = current;
 
+    ASSERT(!!rw_is_write_locked(&pdev->domain->vpci_rwlock));
+
     /*
      * FIXME: when deferring the {un}map the state of the device should not
      * be trusted. For example the enable bit is toggled after the device
@@ -216,7 +257,7 @@ static void defer_map(struct domain *d, struct pci_dev *pdev,
      * started for the same device if the domain is not well-behaved.
      */
     curr->vpci.pdev = pdev;
-    curr->vpci.mem = mem;
+    curr->vpci.map_pending = true;
     curr->vpci.cmd = cmd;
     curr->vpci.rom_only = rom_only;
     /*
@@ -231,30 +272,31 @@ static void defer_map(struct domain *d, struct pci_dev *pdev,
 static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
 {
     struct vpci_header *header = &pdev->vpci->header;
-    struct rangeset *mem = rangeset_new(NULL, NULL, 0);
     struct pci_dev *tmp, *dev = NULL;
     const struct vpci_msix *msix = pdev->vpci->msix;
-    unsigned int i;
+    unsigned int i, j;
     int rc;
-
-    if ( !mem )
-        return -ENOMEM;
+    bool map_pending;
 
     /*
-     * Create a rangeset that represents the current device BARs memory region
-     * and compare it against all the currently active BAR memory regions. If
-     * an overlap is found, subtract it from the region to be mapped/unmapped.
+     * Create a rangeset per BAR that represents the current device memory
+     * region and compare it against all the currently active BAR memory
+     * regions. If an overlap is found, subtract it from the region to be
+     * mapped/unmapped.
      *
-     * First fill the rangeset with all the BARs of this device or with the ROM
+     * First fill the rangesets with the BARs of this device or with the ROM
      * BAR only, depending on whether the guest is toggling the memory decode
      * bit of the command register, or the enable bit of the ROM BAR register.
      */
     for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
     {
-        const struct vpci_bar *bar = &header->bars[i];
+        struct vpci_bar *bar = &header->bars[i];
         unsigned long start = PFN_DOWN(bar->addr);
         unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
 
+        if ( !bar->mem )
+            continue;
+
         if ( !MAPPABLE_BAR(bar) ||
              (rom_only ? bar->type != VPCI_BAR_ROM
                        : (bar->type == VPCI_BAR_ROM && !header->rom_enabled)) ||
@@ -270,14 +312,31 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
             continue;
         }
 
-        rc = rangeset_add_range(mem, start, end);
+        rc = rangeset_add_range(bar->mem, start, end);
         if ( rc )
         {
             printk(XENLOG_G_WARNING "Failed to add [%lx, %lx]: %d\n",
                    start, end, rc);
-            rangeset_destroy(mem);
             return rc;
         }
+
+        /* Check for overlap with the already setup BAR ranges. */
+        for ( j = 0; j < i; j++ )
+        {
+            struct vpci_bar *bar = &header->bars[j];
+
+            if ( rangeset_is_empty(bar->mem) )
+                continue;
+
+            rc = rangeset_remove_range(bar->mem, start, end);
+            if ( rc )
+            {
+                printk(XENLOG_G_WARNING
+                       "Failed to remove overlapping range [%lx, %lx]: %d\n",
+                       start, end, rc);
+                return rc;
+            }
+        }
     }
 
     /* Remove any MSIX regions if present. */
@@ -287,14 +346,21 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
         unsigned long end = PFN_DOWN(vmsix_table_addr(pdev->vpci, i) +
                                      vmsix_table_size(pdev->vpci, i) - 1);
 
-        rc = rangeset_remove_range(mem, start, end);
-        if ( rc )
+        for ( j = 0; j < ARRAY_SIZE(header->bars); j++ )
         {
-            printk(XENLOG_G_WARNING
-                   "Failed to remove MSIX table [%lx, %lx]: %d\n",
-                   start, end, rc);
-            rangeset_destroy(mem);
-            return rc;
+            const struct vpci_bar *bar = &header->bars[j];
+
+            if ( rangeset_is_empty(bar->mem) )
+                continue;
+
+            rc = rangeset_remove_range(bar->mem, start, end);
+            if ( rc )
+            {
+                printk(XENLOG_G_WARNING
+                       "Failed to remove MSIX table [%lx, %lx]: %d\n",
+                       start, end, rc);
+                return rc;
+            }
         }
     }
 
@@ -327,7 +393,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
             unsigned long start = PFN_DOWN(bar->addr);
             unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
 
-            if ( !bar->enabled || !rangeset_overlaps_range(mem, start, end) ||
+            if ( !bar->enabled ||
+                 !rangeset_overlaps_range(bar->mem, start, end) ||
                  /*
                   * If only the ROM enable bit is toggled check against other
                   * BARs in the same device for overlaps, but not against the
@@ -336,12 +403,11 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
                  (rom_only && tmp == pdev && bar->type == VPCI_BAR_ROM) )
                 continue;
 
-            rc = rangeset_remove_range(mem, start, end);
+            rc = rangeset_remove_range(bar->mem, start, end);
             if ( rc )
             {
                 printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
                        start, end, rc);
-                rangeset_destroy(mem);
                 pcidevs_unlock();
                 return rc;
             }
@@ -362,10 +428,23 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
          * will always be to establish mappings and process all the BARs.
          */
         ASSERT((cmd & PCI_COMMAND_MEMORY) && !rom_only);
-        return apply_map(pdev->domain, pdev, mem, cmd);
+        return apply_map(pdev->domain, pdev, cmd);
     }
 
-    defer_map(dev->domain, dev, mem, cmd, rom_only);
+    /* Find out how many memory ranges has left after MSI and overlaps. */
+    map_pending = false;
+    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
+        if ( !rangeset_is_empty(header->bars[i].mem) )
+        {
+            map_pending = true;
+            break;
+        }
+
+    /* If there's no mapping work write the command register now. */
+    if ( !map_pending )
+        pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
+    else
+        defer_map(dev->domain, dev, cmd, rom_only);
 
     return 0;
 }
@@ -556,6 +635,19 @@ static void cf_check rom_write(
         rom->addr = val & PCI_ROM_ADDRESS_MASK;
 }
 
+static int bar_add_rangeset(struct pci_dev *pdev, struct vpci_bar *bar, int i)
+{
+    char str[32];
+
+    snprintf(str, sizeof(str), "%pp:BAR%d", &pdev->sbdf, i);
+
+    bar->mem = rangeset_new(pdev->domain, str, RANGESETF_no_print);
+    if ( !bar->mem )
+        return -ENOMEM;
+
+    return 0;
+}
+
 static int cf_check init_bars(struct pci_dev *pdev)
 {
     uint16_t cmd;
@@ -639,6 +731,13 @@ static int cf_check init_bars(struct pci_dev *pdev)
         else
             bars[i].type = VPCI_BAR_MEM32;
 
+        rc = bar_add_rangeset(pdev, &bars[i], i);
+        if ( rc )
+        {
+            bars[i].type = VPCI_BAR_EMPTY;
+            return rc;
+        }
+
         rc = pci_size_mem_bar(pdev->sbdf, reg, &addr, &size,
                               (i == num_bars - 1) ? PCI_BAR_LAST : 0);
         if ( rc < 0 )
@@ -690,6 +789,15 @@ static int cf_check init_bars(struct pci_dev *pdev)
                                    rom_reg, 4, rom);
             if ( rc )
                 rom->type = VPCI_BAR_EMPTY;
+            else
+            {
+                rc = bar_add_rangeset(pdev, rom, i);
+                if ( rc )
+                {
+                    rom->type = VPCI_BAR_EMPTY;
+                    return rc;
+                }
+            }
 
             header->rom_reg = rom_reg;
         }
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 4182c65eaa..b542ddaf7b 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -39,6 +39,7 @@ extern vpci_register_init_t *const __end_vpci_array[];
 void vpci_remove_device(struct pci_dev *pdev)
 {
     struct vpci *vpci;
+    unsigned int i;
 
     if ( !has_vpci(pdev->domain) )
         return;
@@ -73,6 +74,10 @@ void vpci_remove_device(struct pci_dev *pdev)
             if ( pdev->vpci->msix->table[i] )
                 iounmap(pdev->vpci->msix->table[i]);
     }
+
+    for ( i = 0; i < ARRAY_SIZE(vpci->header.bars); i++ )
+        rangeset_destroy(vpci->header.bars[i].mem);
+
     xfree(vpci->msix);
     xfree(vpci->msi);
     xfree(vpci);
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index e9170cc8ca..1e42a59c1d 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -75,6 +75,7 @@ struct vpci {
             /* Guest view of the BAR: address and lower bits. */
             uint64_t guest_reg;
             uint64_t size;
+            struct rangeset *mem;
             enum {
                 VPCI_BAR_EMPTY,
                 VPCI_BAR_IO,
@@ -162,9 +163,9 @@ struct vpci {
 
 struct vpci_vcpu {
     /* Per-vcpu structure to store state while {un}mapping of PCI BARs. */
-    struct rangeset *mem;
     struct pci_dev *pdev;
     uint16_t cmd;
+    bool map_pending : 1;
     bool rom_only : 1;
 };
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v7 05/12] rangeset: add RANGESETF_no_print flag
  2023-06-13 10:32 [PATCH v7 00/12] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (5 preceding siblings ...)
  2023-06-13 10:32 ` [PATCH v7 08/12] vpci/header: emulate PCI_COMMAND register for guests Volodymyr Babchuk
@ 2023-06-13 10:32 ` Volodymyr Babchuk
  2023-06-13 10:32 ` [PATCH v7 07/12] vpci/header: program p2m with guest BAR view Volodymyr Babchuk
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 38+ messages in thread
From: Volodymyr Babchuk @ 2023-06-13 10:32 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Andrushchenko, Andrew Cooper, George Dunlap,
	Jan Beulich, Julien Grall, Stefano Stabellini, Wei Liu

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

There are range sets which should not be printed, so introduce a flag
which allows marking those as such. Implement relevant logic to skip
such entries while printing.

While at it also simplify the definition of the flags by directly
defining those without helpers.

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
Since v5:
- comment indentation (Jan)
Since v1:
- update BUG_ON with new flag
- simplify the definition of the flags
---
 xen/common/rangeset.c      | 5 ++++-
 xen/include/xen/rangeset.h | 5 +++--
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/xen/common/rangeset.c b/xen/common/rangeset.c
index a6ef264046..f8b909d016 100644
--- a/xen/common/rangeset.c
+++ b/xen/common/rangeset.c
@@ -433,7 +433,7 @@ struct rangeset *rangeset_new(
     INIT_LIST_HEAD(&r->range_list);
     r->nr_ranges = -1;
 
-    BUG_ON(flags & ~RANGESETF_prettyprint_hex);
+    BUG_ON(flags & ~(RANGESETF_prettyprint_hex | RANGESETF_no_print));
     r->flags = flags;
 
     safe_strcpy(r->name, name ?: "(no name)");
@@ -575,6 +575,9 @@ void rangeset_domain_printk(
 
     list_for_each_entry ( r, &d->rangesets, rangeset_list )
     {
+        if ( r->flags & RANGESETF_no_print )
+            continue;
+
         printk("    ");
         rangeset_printk(r);
         printk("\n");
diff --git a/xen/include/xen/rangeset.h b/xen/include/xen/rangeset.h
index 135f33f606..f7c69394d6 100644
--- a/xen/include/xen/rangeset.h
+++ b/xen/include/xen/rangeset.h
@@ -49,8 +49,9 @@ void rangeset_limit(
 
 /* Flags for passing to rangeset_new(). */
  /* Pretty-print range limits in hexadecimal. */
-#define _RANGESETF_prettyprint_hex 0
-#define RANGESETF_prettyprint_hex  (1U << _RANGESETF_prettyprint_hex)
+#define RANGESETF_prettyprint_hex   (1U << 0)
+ /* Do not print entries marked with this flag. */
+#define RANGESETF_no_print          (1U << 1)
 
 bool_t __must_check rangeset_is_empty(
     const struct rangeset *r);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v7 11/12] xen/arm: translate virtual PCI bus topology for guests
  2023-06-13 10:32 [PATCH v7 00/12] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (8 preceding siblings ...)
  2023-06-13 10:32 ` [PATCH v7 09/12] vpci/header: reset the command register when adding devices Volodymyr Babchuk
@ 2023-06-13 10:32 ` Volodymyr Babchuk
  2023-06-21 12:01   ` Jan Beulich
  2023-06-13 10:32 ` [PATCH v7 10/12] vpci: add initial support for virtual PCI bus topology Volodymyr Babchuk
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 38+ messages in thread
From: Volodymyr Babchuk @ 2023-06-13 10:32 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Andrushchenko, Stefano Stabellini, Julien Grall,
	Bertrand Marquis, Volodymyr Babchuk, Roger Pau Monné

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

There are three  originators for the PCI configuration space access:
1. The domain that owns physical host bridge: MMIO handlers are
there so we can update vPCI register handlers with the values
written by the hardware domain, e.g. physical view of the registers
vs guest's view on the configuration space.
2. Guest access to the passed through PCI devices: we need to properly
map virtual bus topology to the physical one, e.g. pass the configuration
space access to the corresponding physical devices.
3. Emulated host PCI bridge access. It doesn't exist in the physical
topology, e.g. it can't be mapped to some physical host bridge.
So, all access to the host bridge itself needs to be trapped and
emulated.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
Since v6:
- add pcidevs locking to vpci_translate_virtual_device
- update wrt to the new locking scheme
Since v5:
- add vpci_translate_virtual_device for #ifndef CONFIG_HAS_VPCI_GUEST_SUPPORT
  case to simplify ifdefery
- add ASSERT(!is_hardware_domain(d)); to vpci_translate_virtual_device
- reset output register on failed virtual SBDF translation
Since v4:
- indentation fixes
- constify struct domain
- updated commit message
- updates to the new locking scheme (pdev->vpci_lock)
Since v3:
- revisit locking
- move code to vpci.c
Since v2:
 - pass struct domain instead of struct vcpu
 - constify arguments where possible
 - gate relevant code with CONFIG_HAS_VPCI_GUEST_SUPPORT
New in v2
---
 xen/arch/arm/vpci.c     | 17 +++++++++++++++++
 xen/drivers/vpci/vpci.c | 38 ++++++++++++++++++++++++++++++++++++++
 xen/include/xen/vpci.h  |  7 +++++++
 3 files changed, 62 insertions(+)

diff --git a/xen/arch/arm/vpci.c b/xen/arch/arm/vpci.c
index 3bc4bb5508..d60913e532 100644
--- a/xen/arch/arm/vpci.c
+++ b/xen/arch/arm/vpci.c
@@ -32,6 +32,16 @@ static int vpci_mmio_read(struct vcpu *v, mmio_info_t *info,
     /* data is needed to prevent a pointer cast on 32bit */
     unsigned long data;
 
+    /*
+     * For the passed through devices we need to map their virtual SBDF
+     * to the physical PCI device being passed through.
+     */
+    if ( !bridge && !vpci_translate_virtual_device(v->domain, &sbdf) )
+    {
+        *r = ~0ul;
+        return 1;
+    }
+
     if ( vpci_ecam_read(sbdf, ECAM_REG_OFFSET(info->gpa),
                         1U << info->dabt.size, &data) )
     {
@@ -50,6 +60,13 @@ static int vpci_mmio_write(struct vcpu *v, mmio_info_t *info,
     struct pci_host_bridge *bridge = p;
     pci_sbdf_t sbdf = vpci_sbdf_from_gpa(bridge, info->gpa);
 
+    /*
+     * For the passed through devices we need to map their virtual SBDF
+     * to the physical PCI device being passed through.
+     */
+    if ( !bridge && !vpci_translate_virtual_device(v->domain, &sbdf) )
+        return 1;
+
     return vpci_ecam_write(sbdf, ECAM_REG_OFFSET(info->gpa),
                            1U << info->dabt.size, r);
 }
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 9dacb70bf3..490e3b14c7 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -180,6 +180,44 @@ static void vpci_remove_virtual_device(const struct pci_dev *pdev)
     write_unlock(&pdev->domain->vpci_rwlock);
 }
 
+/*
+ * Find the physical device which is mapped to the virtual device
+ * and translate virtual SBDF to the physical one.
+ */
+bool vpci_translate_virtual_device(struct domain *d, pci_sbdf_t *sbdf)
+{
+    struct pci_dev *pdev;
+
+    ASSERT(!is_hardware_domain(d));
+
+    read_lock(&d->vpci_rwlock);
+    pcidevs_lock();
+    for_each_pdev( d, pdev )
+    {
+        bool found;
+
+        if ( !pdev->vpci )
+            continue;
+
+        spin_lock(&pdev->vpci->lock);
+        found = pdev->vpci && (pdev->vpci->guest_sbdf.sbdf == sbdf->sbdf);
+        spin_unlock(&pdev->vpci->lock);
+
+        if ( found )
+        {
+            /* Replace guest SBDF with the physical one. */
+            *sbdf = pdev->sbdf;
+            pcidevs_unlock();
+            read_unlock(&d->vpci_rwlock);
+            return true;
+        }
+    }
+
+    pcidevs_unlock();
+    read_unlock(&d->vpci_rwlock);
+    return false;
+}
+
 /* Notify vPCI that device is assigned to guest. */
 int vpci_assign_device(struct pci_dev *pdev)
 {
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 3cb5d63e84..0d085971cc 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -291,6 +291,7 @@ static inline bool __must_check vpci_process_pending(struct vcpu *v)
 /* Notify vPCI that device is assigned/de-assigned to/from guest. */
 int vpci_assign_device(struct pci_dev *pdev);
 void vpci_deassign_device(struct pci_dev *pdev);
+bool vpci_translate_virtual_device(struct domain *d, pci_sbdf_t *sbdf);
 #else
 static inline int vpci_assign_device(struct pci_dev *pdev)
 {
@@ -300,6 +301,12 @@ static inline int vpci_assign_device(struct pci_dev *pdev)
 static inline void vpci_deassign_device(struct pci_dev *pdev)
 {
 };
+
+static inline bool vpci_translate_virtual_device(struct domain *d,
+                                                 pci_sbdf_t *sbdf)
+{
+    return false;
+}
 #endif
 
 #endif
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v7 09/12] vpci/header: reset the command register when adding devices
  2023-06-13 10:32 [PATCH v7 00/12] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (7 preceding siblings ...)
  2023-06-13 10:32 ` [PATCH v7 07/12] vpci/header: program p2m with guest BAR view Volodymyr Babchuk
@ 2023-06-13 10:32 ` Volodymyr Babchuk
  2023-06-13 10:32 ` [PATCH v7 11/12] xen/arm: translate virtual PCI bus topology for guests Volodymyr Babchuk
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 38+ messages in thread
From: Volodymyr Babchuk @ 2023-06-13 10:32 UTC (permalink / raw)
  To: xen-devel; +Cc: Oleksandr Andrushchenko, Roger Pau Monné

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Reset the command register when assigning a PCI device to a guest:
according to the PCI spec the PCI_COMMAND register is typically all 0's
after reset, but this might not be true for the guest as it needs
to respect host's settings.
For that reason, do not write 0 to the PCI_COMMAND register directly,
but go through the corresponding emulation layer (cmd_write), which
will take care about the actual bits written.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
Since v6:
- use cmd_write directly without introducing emulate_cmd_reg
- update commit message with more description on all 0's in PCI_COMMAND
Since v5:
- updated commit message
Since v1:
 - do not write 0 to the command register, but respect host settings.
---
 xen/drivers/vpci/header.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 677b93226c..1021a61ed6 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -728,6 +728,10 @@ static int cf_check init_bars(struct pci_dev *pdev)
      */
     ASSERT(header->guest_cmd == 0);
 
+    /* Reset the command register for guests. */
+    if ( !is_hwdom )
+        cmd_write(pdev, PCI_COMMAND, 0, header);
+
     /* Setup a handler for the command register. */
     rc = vpci_add_register(pdev->vpci, cmd_read, cmd_write, PCI_COMMAND,
                            2, header);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v7 10/12] vpci: add initial support for virtual PCI bus topology
  2023-06-13 10:32 [PATCH v7 00/12] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (9 preceding siblings ...)
  2023-06-13 10:32 ` [PATCH v7 11/12] xen/arm: translate virtual PCI bus topology for guests Volodymyr Babchuk
@ 2023-06-13 10:32 ` Volodymyr Babchuk
  2023-06-21 12:06   ` Jan Beulich
  2023-06-13 10:32 ` [PATCH v7 12/12] xen/arm: account IO handlers for emulated PCI MSI-X Volodymyr Babchuk
  2023-06-15  2:01 ` [PATCH v7 00/12] PCI devices passthrough on Arm, part 3 Stewart Hildebrand
  12 siblings, 1 reply; 38+ messages in thread
From: Volodymyr Babchuk @ 2023-06-13 10:32 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Andrushchenko, Roger Pau Monné,
	Andrew Cooper, George Dunlap, Jan Beulich, Julien Grall,
	Stefano Stabellini, Wei Liu

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Assign SBDF to the PCI devices being passed through with bus 0.
The resulting topology is where PCIe devices reside on the bus 0 of the
root complex itself (embedded endpoints).
This implementation is limited to 32 devices which are allowed on
a single PCI bus.

Please note, that at the moment only function 0 of a multifunction
device can be passed through.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
Since v6:
- re-work wrt new locking scheme
Since v5:
- s/vpci_add_virtual_device/add_virtual_device and make it static
- call add_virtual_device from vpci_assign_device and do not use
  REGISTER_VPCI_INIT machinery
- add pcidevs_locked ASSERT
- use DECLARE_BITMAP for vpci_dev_assigned_map
Since v4:
- moved and re-worked guest sbdf initializers
- s/set_bit/__set_bit
- s/clear_bit/__clear_bit
- minor comment fix s/Virtual/Guest/
- added VPCI_MAX_VIRT_DEV constant (PCI_SLOT(~0) + 1) which will be used
  later for counting the number of MMIO handlers required for a guest
  (Julien)
Since v3:
 - make use of VPCI_INIT
 - moved all new code to vpci.c which belongs to it
 - changed open-coded 31 to PCI_SLOT(~0)
 - added comments and code to reject multifunction devices with
   functions other than 0
 - updated comment about vpci_dev_next and made it unsigned int
 - implement roll back in case of error while assigning/deassigning devices
 - s/dom%pd/%pd
Since v2:
 - remove casts that are (a) malformed and (b) unnecessary
 - add new line for better readability
 - remove CONFIG_HAS_VPCI_GUEST_SUPPORT ifdef's as the relevant vPCI
    functions are now completely gated with this config
 - gate common code with CONFIG_HAS_VPCI_GUEST_SUPPORT
New in v2
---
 xen/drivers/vpci/vpci.c | 70 ++++++++++++++++++++++++++++++++++++++++-
 xen/include/xen/sched.h |  8 +++++
 xen/include/xen/vpci.h  | 11 +++++++
 3 files changed, 88 insertions(+), 1 deletion(-)

diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index b542ddaf7b..9dacb70bf3 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -98,6 +98,9 @@ int vpci_add_handlers(struct pci_dev *pdev)
 
     INIT_LIST_HEAD(&vpci->handlers);
     spin_lock_init(&vpci->lock);
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+    vpci->guest_sbdf.sbdf = ~0;
+#endif
 
     write_lock(&pdev->domain->vpci_rwlock);
 
@@ -121,6 +124,62 @@ int vpci_add_handlers(struct pci_dev *pdev)
 }
 
 #ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+static int add_virtual_device(struct pci_dev *pdev)
+{
+    struct domain *d = pdev->domain;
+    pci_sbdf_t sbdf = { 0 };
+    unsigned long new_dev_number;
+
+    if ( is_hardware_domain(d) )
+        return 0;
+
+    ASSERT(pcidevs_locked());
+
+    /*
+     * Each PCI bus supports 32 devices/slots at max or up to 256 when
+     * there are multi-function ones which are not yet supported.
+     */
+    if ( pdev->info.is_extfn )
+    {
+        gdprintk(XENLOG_ERR, "%pp: only function 0 passthrough supported\n",
+                 &pdev->sbdf);
+        return -EOPNOTSUPP;
+    }
+
+    new_dev_number = find_first_zero_bit(d->vpci_dev_assigned_map,
+                                         VPCI_MAX_VIRT_DEV);
+    if ( new_dev_number >= VPCI_MAX_VIRT_DEV )
+        return -ENOSPC;
+
+    __set_bit(new_dev_number, &d->vpci_dev_assigned_map);
+
+    /*
+     * Both segment and bus number are 0:
+     *  - we emulate a single host bridge for the guest, e.g. segment 0
+     *  - with bus 0 the virtual devices are seen as embedded
+     *    endpoints behind the root complex
+     *
+     * TODO: add support for multi-function devices.
+     */
+    sbdf.devfn = PCI_DEVFN(new_dev_number, 0);
+    pdev->vpci->guest_sbdf = sbdf;
+
+    return 0;
+
+}
+
+static void vpci_remove_virtual_device(const struct pci_dev *pdev)
+{
+    write_lock(&pdev->domain->vpci_rwlock);
+    if ( pdev->vpci )
+    {
+        __clear_bit(pdev->vpci->guest_sbdf.dev,
+                    &pdev->domain->vpci_dev_assigned_map);
+        pdev->vpci->guest_sbdf.sbdf = ~0;
+    }
+    write_unlock(&pdev->domain->vpci_rwlock);
+}
+
 /* Notify vPCI that device is assigned to guest. */
 int vpci_assign_device(struct pci_dev *pdev)
 {
@@ -131,8 +190,16 @@ int vpci_assign_device(struct pci_dev *pdev)
 
     rc = vpci_add_handlers(pdev);
     if ( rc )
-        vpci_deassign_device(pdev);
+        goto fail;
+
+    rc = add_virtual_device(pdev);
+    if ( rc )
+        goto fail;
+
+    return 0;
 
+ fail:
+    vpci_deassign_device(pdev);
     return rc;
 }
 
@@ -142,6 +209,7 @@ void vpci_deassign_device(struct pci_dev *pdev)
     if ( !has_vpci(pdev->domain) )
         return;
 
+    vpci_remove_virtual_device(pdev);
     vpci_remove_device(pdev);
 }
 #endif /* CONFIG_HAS_VPCI_GUEST_SUPPORT */
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 78227b7a1d..312a692b31 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -463,6 +463,14 @@ struct domain
 #ifdef CONFIG_HAS_VPCI
     rwlock_t vpci_rwlock;
 #endif
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+    /*
+     * The bitmap which shows which device numbers are already used by the
+     * virtual PCI bus topology and is used to assign a unique SBDF to the
+     * next passed through virtual PCI device.
+     */
+    DECLARE_BITMAP(vpci_dev_assigned_map, VPCI_MAX_VIRT_DEV);
+#endif
 #endif
 
 #ifdef CONFIG_HAS_PASSTHROUGH
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index fe100a8cb7..3cb5d63e84 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -21,6 +21,13 @@ typedef int vpci_register_init_t(struct pci_dev *dev);
 
 #define VPCI_ECAM_BDF(addr)     (((addr) & 0x0ffff000) >> 12)
 
+/*
+ * Maximum number of devices supported by the virtual bus topology:
+ * each PCI bus supports 32 devices/slots at max or up to 256 when
+ * there are multi-function ones which are not yet supported.
+ */
+#define VPCI_MAX_VIRT_DEV       (PCI_SLOT(~0) + 1)
+
 #define REGISTER_VPCI_INIT(x, p)                \
   static vpci_register_init_t *const x##_entry  \
                __used_section(".data.vpci." p) = x
@@ -160,6 +167,10 @@ struct vpci {
             struct vpci_arch_msix_entry arch;
         } entries[];
     } *msix;
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+    /* Guest SBDF of the device. */
+    pci_sbdf_t guest_sbdf;
+#endif
 #endif
 };
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v7 12/12] xen/arm: account IO handlers for emulated PCI MSI-X
  2023-06-13 10:32 [PATCH v7 00/12] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (10 preceding siblings ...)
  2023-06-13 10:32 ` [PATCH v7 10/12] vpci: add initial support for virtual PCI bus topology Volodymyr Babchuk
@ 2023-06-13 10:32 ` Volodymyr Babchuk
  2023-06-15  2:01 ` [PATCH v7 00/12] PCI devices passthrough on Arm, part 3 Stewart Hildebrand
  12 siblings, 0 replies; 38+ messages in thread
From: Volodymyr Babchuk @ 2023-06-13 10:32 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Andrushchenko, Bertrand Marquis, Volodymyr Babchuk,
	Julien Grall, Julien Grall, Stefano Stabellini

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

At the moment, we always allocate an extra 16 slots for IO handlers
(see MAX_IO_HANDLER). So while adding IO trap handlers for the emulated
MSI-X registers we need to explicitly tell that we have additional IO
handlers, so those are accounted.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Acked-by: Julien Grall <jgrall@amazon.com>
---
Cc: Julien Grall <julien@xen.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
---
This actually moved here from the part 2 of the prep work for PCI
passthrough on Arm as it seems to be the proper place for it.

Since v5:
- optimize with IS_ENABLED(CONFIG_HAS_PCI_MSI) since VPCI_MAX_VIRT_DEV is
  defined unconditionally
New in v5
---
 xen/arch/arm/vpci.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/xen/arch/arm/vpci.c b/xen/arch/arm/vpci.c
index d60913e532..7499a1c925 100644
--- a/xen/arch/arm/vpci.c
+++ b/xen/arch/arm/vpci.c
@@ -122,6 +122,8 @@ static int vpci_get_num_handlers_cb(struct domain *d,
 
 unsigned int domain_vpci_get_num_mmio_handlers(struct domain *d)
 {
+    unsigned int count;
+
     if ( !has_vpci(d) )
         return 0;
 
@@ -142,7 +144,17 @@ unsigned int domain_vpci_get_num_mmio_handlers(struct domain *d)
      * For guests each host bridge requires one region to cover the
      * configuration space. At the moment, we only expose a single host bridge.
      */
-    return 1;
+    count = 1;
+
+    /*
+     * There's a single MSI-X MMIO handler that deals with both PBA
+     * and MSI-X tables per each PCI device being passed through.
+     * Maximum number of emulated virtual devices is VPCI_MAX_VIRT_DEV.
+     */
+    if ( IS_ENABLED(CONFIG_HAS_PCI_MSI) )
+        count += VPCI_MAX_VIRT_DEV;
+
+    return count;
 }
 
 /*
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v7 00/12] PCI devices passthrough on Arm, part 3
  2023-06-13 10:32 [PATCH v7 00/12] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
                   ` (11 preceding siblings ...)
  2023-06-13 10:32 ` [PATCH v7 12/12] xen/arm: account IO handlers for emulated PCI MSI-X Volodymyr Babchuk
@ 2023-06-15  2:01 ` Stewart Hildebrand
  2023-06-15  9:39   ` Volodymyr Babchuk
  12 siblings, 1 reply; 38+ messages in thread
From: Stewart Hildebrand @ 2023-06-15  2:01 UTC (permalink / raw)
  To: Volodymyr Babchuk, xen-devel
  Cc: Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Paul Durrant, Bertrand Marquis

On 6/13/23 06:32, Volodymyr Babchuk wrote:
> Hello,
> 
> This is another another version of vPCI rework (previous one can be
> found at [1]). The biggest change is how vPCI locking is done. This
> series uses per-domain vPCI rwlock.
> 
> Note that this series does not include my work on reference counting
> for PCI devices because this counting does not resolve isses we are
> having for vPCI. While it is (maybe) nice to have PCI refcounting, it
> does not moves us towards PCI on ARM.
> 
> 
> [1] https://lore.kernel.org/all/20220204063459.680961-1-andr2000@gmail.com/

Thanks for sending this!

Should this be v8? I see v7 at [2].

I had to rewind my xen.git back to 67c28bfc5245 for this series to apply cleanly (just before ee045f3a4a6d "vpci/header: cope with devices not having vpci allocated").

[2] https://lists.xenproject.org/archives/html/xen-devel/2022-07/msg01127.html


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v7 00/12] PCI devices passthrough on Arm, part 3
  2023-06-15  2:01 ` [PATCH v7 00/12] PCI devices passthrough on Arm, part 3 Stewart Hildebrand
@ 2023-06-15  9:39   ` Volodymyr Babchuk
  2023-06-15 12:16     ` Jan Beulich
  0 siblings, 1 reply; 38+ messages in thread
From: Volodymyr Babchuk @ 2023-06-15  9:39 UTC (permalink / raw)
  To: Stewart Hildebrand
  Cc: xen-devel, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Paul Durrant, Bertrand Marquis


Hi Stewart,

Stewart Hildebrand <stewart.hildebrand@amd.com> writes:

> On 6/13/23 06:32, Volodymyr Babchuk wrote:
>> Hello,
>> 
>> This is another another version of vPCI rework (previous one can be
>> found at [1]). The biggest change is how vPCI locking is done. This
>> series uses per-domain vPCI rwlock.
>> 
>> Note that this series does not include my work on reference counting
>> for PCI devices because this counting does not resolve isses we are
>> having for vPCI. While it is (maybe) nice to have PCI refcounting, it
>> does not moves us towards PCI on ARM.
>> 
>> 
>> [1]
>> https://urldefense.com/v3/__https://lore.kernel.org/all/20220204063459.680961-1-andr2000@gmail.com/__;!!GF_29dbcQIUBPA!0BUqPos1zFKUoPwbKLLwKItNgBVPaBgxmH1Y6zXpms2bngrlWrzB-qMNvIaiAy2WSWMa93UrlvRi0ijYP8X4Ymx07GXYPO1W$
>> [lore[.]kernel[.]org]
>
> Thanks for sending this!
>
> Should this be v8? I see v7 at [2].

Oops, my bad. 

> I had to rewind my xen.git back to 67c28bfc5245 for this series to apply cleanly (just before ee045f3a4a6d "vpci/header: cope with devices not having vpci allocated").

I rebased this series onto staging about two weeks ago. Looks like
there was new changes into the PCI code after that.

Should I send a new, real v8 which is rebased onto current staging, or
we'll wait for review for the current set of patches?


-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v7 00/12] PCI devices passthrough on Arm, part 3
  2023-06-15  9:39   ` Volodymyr Babchuk
@ 2023-06-15 12:16     ` Jan Beulich
  2023-06-21 22:11       ` Volodymyr Babchuk
  0 siblings, 1 reply; 38+ messages in thread
From: Jan Beulich @ 2023-06-15 12:16 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: xen-devel, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Paul Durrant, Bertrand Marquis, Stewart Hildebrand

On 15.06.2023 11:39, Volodymyr Babchuk wrote:
> Stewart Hildebrand <stewart.hildebrand@amd.com> writes:
>> On 6/13/23 06:32, Volodymyr Babchuk wrote:
>>> Hello,
>>>
>>> This is another another version of vPCI rework (previous one can be
>>> found at [1]). The biggest change is how vPCI locking is done. This
>>> series uses per-domain vPCI rwlock.
>>>
>>> Note that this series does not include my work on reference counting
>>> for PCI devices because this counting does not resolve isses we are
>>> having for vPCI. While it is (maybe) nice to have PCI refcounting, it
>>> does not moves us towards PCI on ARM.
>>>
>>>
>>> [1]
>>> https://urldefense.com/v3/__https://lore.kernel.org/all/20220204063459.680961-1-andr2000@gmail.com/__;!!GF_29dbcQIUBPA!0BUqPos1zFKUoPwbKLLwKItNgBVPaBgxmH1Y6zXpms2bngrlWrzB-qMNvIaiAy2WSWMa93UrlvRi0ijYP8X4Ymx07GXYPO1W$
>>> [lore[.]kernel[.]org]
>>
>> Thanks for sending this!
>>
>> Should this be v8? I see v7 at [2].
> 
> Oops, my bad. 
> 
>> I had to rewind my xen.git back to 67c28bfc5245 for this series to apply cleanly (just before ee045f3a4a6d "vpci/header: cope with devices not having vpci allocated").
> 
> I rebased this series onto staging about two weeks ago. Looks like
> there was new changes into the PCI code after that.
> 
> Should I send a new, real v8 which is rebased onto current staging, or
> we'll wait for review for the current set of patches?

Please send a version which, at least at the time of posting, actually
applies. Taking into account Stewart's observation on the version
number makes it even more desirable to have a re-post.

Jan


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v7 01/12] vpci: introduce per-domain lock to protect vpci structure
  2023-06-13 10:32 ` [PATCH v7 01/12] vpci: introduce per-domain lock to protect vpci structure Volodymyr Babchuk
@ 2023-06-16 16:30   ` Roger Pau Monné
  2023-06-21 22:07     ` Volodymyr Babchuk
  2023-06-20 20:17   ` Stewart Hildebrand
  1 sibling, 1 reply; 38+ messages in thread
From: Roger Pau Monné @ 2023-06-16 16:30 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: xen-devel, Oleksandr Andrushchenko, Jan Beulich, Andrew Cooper,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Paul Durrant

On Tue, Jun 13, 2023 at 10:32:26AM +0000, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Introduce a per-domain read/write lock to check whether vpci is present,
> so we are sure there are no accesses to the contents of the vpci struct
> if not. This lock can be used (and in a few cases is used right away)
> so that vpci removal can be performed while holding the lock in write
> mode. Previously such removal could race with vpci_read for example.
> 
> 1. Per-domain's vpci_rwlock is used to protect pdev->vpci structure
> from being removed.
> 
> 2. Writing the command register and ROM BAR register may trigger
> modify_bars to run, which in turn may access multiple pdevs while
> checking for the existing BAR's overlap. The overlapping check, if done
> under the read lock, requires vpci->lock to be acquired on both devices
> being compared, which may produce a deadlock. It is not possible to
> upgrade read lock to write lock in such a case. So, in order to prevent
> the deadlock, check which registers are going to be written and acquire
> the lock in the appropriate mode from the beginning.
> 
> All other code, which doesn't lead to pdev->vpci destruction and does not
> access multiple pdevs at the same time, can still use a combination of the
> read lock and pdev->vpci->lock.
> 
> 3. Optimize if ROM BAR write lock required detection by caching offset
> of the ROM BAR register in vpci->header->rom_reg which depends on
> header's type.
> 
> 4. Reduce locked region in vpci_remove_device as it is now possible
> to set pdev->vpci to NULL early right after the write lock is acquired.
> 
> 5. Reduce locked region in vpci_add_handlers as it is possible to
> initialize many more fields of the struct vpci before assigning it to
> pdev->vpci.
> 
> 6. vpci_{add|remove}_register are required to be called with the write lock
> held, but it is not feasible to add an assert there as it requires
> struct domain to be passed for that. So, add a comment about this requirement
> to these and other functions with the equivalent constraints.
> 
> 7. Drop const qualifier where the new rwlock is used and this is appropriate.
> 
> 8. Do not call process_pending_softirqs with any locks held. For that unlock
> prior the call and re-acquire the locks after. After re-acquiring the
> lock there is no need to check if pdev->vpci exists:
>  - in apply_map because of the context it is called (no race condition
>    possible)
>  - for MSI/MSI-X debug code because it is called at the end of
>    pdev->vpci access and no further access to pdev->vpci is made
> 
> 9. Check for !pdev->vpci in vpci_{read|write} after acquiring the lock
> and if so, allow reading or writing the hardware register directly. This is
> acceptable as we only deal with Dom0 as of now. Once DomU support is
> added the write will need to be ignored and read return all 0's for the
> guests, while Dom0 can still access the registers directly.
> 
> 10. Introduce pcidevs_trylock, so there is a possibility to try locking
> the pcidev's lock.
> 
> 11. Use pcidev's lock around for_each_pdev and pci_get_pdev_by_domain
> while accessing pdevs in vpci code.
> 
> 12. This is based on the discussion at [1].
> 
> [1] https://lore.kernel.org/all/20220204063459.680961-4-andr2000@gmail.com/
> 
> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
> Suggested-by: Jan Beulich <jbeulich@suse.com>
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Thanks.

I haven't looked in full detail, but I'm afraid there's an ABBA
deadlock with the per-domain vpci lock and the pcidevs lock in
modify_bars() vs  vpci_add_handlers() and vpci_remove_device().

I've made some comments below.

> ---
> This was checked on x86: with and without PVH Dom0.
> ---
>  xen/arch/x86/hvm/vmsi.c       |   7 +++
>  xen/common/domain.c           |   3 +
>  xen/drivers/passthrough/pci.c |   5 ++
>  xen/drivers/vpci/header.c     |  52 +++++++++++++++++
>  xen/drivers/vpci/msi.c        |  25 +++++++-
>  xen/drivers/vpci/msix.c       |  51 +++++++++++++---
>  xen/drivers/vpci/vpci.c       | 107 +++++++++++++++++++++++++---------
>  xen/include/xen/pci.h         |   1 +
>  xen/include/xen/sched.h       |   3 +
>  xen/include/xen/vpci.h        |   6 ++
>  10 files changed, 223 insertions(+), 37 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
> index 3cd4923060..f188450b1b 100644
> --- a/xen/arch/x86/hvm/vmsi.c
> +++ b/xen/arch/x86/hvm/vmsi.c
> @@ -895,6 +895,9 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>  {
>      unsigned int i;
>  
> +    ASSERT(rw_is_locked(&msix->pdev->domain->vpci_rwlock));
> +    ASSERT(pcidevs_locked());
> +
>      for ( i = 0; i < msix->max_entries; i++ )
>      {
>          const struct vpci_msix_entry *entry = &msix->entries[i];
> @@ -913,7 +916,11 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>              struct pci_dev *pdev = msix->pdev;
>  
>              spin_unlock(&msix->pdev->vpci->lock);
> +            pcidevs_unlock();
> +            read_unlock(&pdev->domain->vpci_rwlock);
>              process_pending_softirqs();
> +            read_lock(&pdev->domain->vpci_rwlock);
> +            pcidevs_lock();

Why do you need the pcidevs_lock here?  Isn't it enough to have the
per-domain rwlock taken in read mode in order to prevent devices from
being de-assigned from the domain?

>              /* NB: we assume that pdev cannot go away for an alive domain. */
>              if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
>                  return -EBUSY;
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index 6a440590fe..a4640acb62 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -622,6 +622,9 @@ struct domain *domain_create(domid_t domid,
>  
>  #ifdef CONFIG_HAS_PCI
>      INIT_LIST_HEAD(&d->pdev_list);
> +#ifdef CONFIG_HAS_VPCI
> +    rwlock_init(&d->vpci_rwlock);
> +#endif
>  #endif
>  
>      /* All error paths can depend on the above setup. */
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index 07d1986d33..0afcb59af0 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -57,6 +57,11 @@ void pcidevs_lock(void)
>      spin_lock_recursive(&_pcidevs_lock);
>  }
>  
> +int pcidevs_trylock(void)
> +{
> +    return spin_trylock_recursive(&_pcidevs_lock);
> +}
> +
>  void pcidevs_unlock(void)
>  {
>      spin_unlock_recursive(&_pcidevs_lock);
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index ec2e978a4e..23b5223adf 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -152,12 +152,14 @@ bool vpci_process_pending(struct vcpu *v)
>          if ( rc == -ERESTART )
>              return true;
>  
> +        read_lock(&v->domain->vpci_rwlock);
>          spin_lock(&v->vpci.pdev->vpci->lock);
>          /* Disable memory decoding unconditionally on failure. */
>          modify_decoding(v->vpci.pdev,
>                          rc ? v->vpci.cmd & ~PCI_COMMAND_MEMORY : v->vpci.cmd,
>                          !rc && v->vpci.rom_only);
>          spin_unlock(&v->vpci.pdev->vpci->lock);
> +        read_unlock(&v->domain->vpci_rwlock);

I think you likely want to expand the scope of the domain vpci lock in
order to also protect the data in v->vpci?  So that vPCI device
removal can clear this data if required without racing with
vpci_process_pending().  Otherwise you need to pause the domain when
removing vPCI.

>  
>          rangeset_destroy(v->vpci.mem);
>          v->vpci.mem = NULL;
> @@ -181,8 +183,20 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
>      struct map_data data = { .d = d, .map = true };
>      int rc;
>  
> +    ASSERT(rw_is_write_locked(&d->vpci_rwlock));
> +
>      while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
> +    {
> +        /*
> +         * It's safe to drop and reacquire the lock in this context
> +         * without risking pdev disappearing because devices cannot be
> +         * removed until the initial domain has been started.
> +         */
> +        write_unlock(&d->vpci_rwlock);
>          process_pending_softirqs();
> +        write_lock(&d->vpci_rwlock);
> +    }
> +
>      rangeset_destroy(mem);
>      if ( !rc )
>          modify_decoding(pdev, cmd, false);
> @@ -213,6 +227,7 @@ static void defer_map(struct domain *d, struct pci_dev *pdev,
>      raise_softirq(SCHEDULE_SOFTIRQ);
>  }
>  
> +/* This must hold domain's vpci_rwlock in write mode. */

Why it must be in write mode?

Is this to avoid ABBA deadlocks as not taking the per-domain lock in
write mode would then force us to take each pdev->vpci->lock in order
to prevent concurrent modifications of remote devices?

>  static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>  {
>      struct vpci_header *header = &pdev->vpci->header;
> @@ -287,6 +302,7 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>       * Check for overlaps with other BARs. Note that only BARs that are
>       * currently mapped (enabled) are checked for overlaps.
>       */
> +    pcidevs_lock();

This can be problematic I'm afraid, as it's a guest controlled path
that allows applying unrestricted contention on the pcidevs lock.  I'm
unsure whether multiple guests could be able to trigger the watchdog
if given enough devices/vcpus to be able to perform enough
simultaneous calls to modify_bars().

Maybe you could expand the per-domain vpci lock to also protect
modifications of domain->pdev_list?

IOW: kind of a per-domain pdev_lock.

>      for_each_pdev ( pdev->domain, tmp )
>      {
>          if ( tmp == pdev )
> @@ -326,10 +342,12 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>                  printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
>                         start, end, rc);
>                  rangeset_destroy(mem);
> +                pcidevs_unlock();
>                  return rc;
>              }
>          }
>      }
> +    pcidevs_unlock();
>  
>      ASSERT(dev);
>  
> @@ -482,6 +500,8 @@ static int cf_check init_bars(struct pci_dev *pdev)
>      struct vpci_bar *bars = header->bars;
>      int rc;
>  
> +    ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock));
> +
>      switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
>      {
>      case PCI_HEADER_TYPE_NORMAL:
> @@ -570,6 +590,8 @@ static int cf_check init_bars(struct pci_dev *pdev)
>          }
>      }
>  
> +    ASSERT(!header->rom_reg);
> +
>      /* Check expansion ROM. */
>      rc = pci_size_mem_bar(pdev->sbdf, rom_reg, &addr, &size, PCI_BAR_ROM);
>      if ( rc > 0 && size )
> @@ -586,12 +608,42 @@ static int cf_check init_bars(struct pci_dev *pdev)
>                                 4, rom);
>          if ( rc )
>              rom->type = VPCI_BAR_EMPTY;
> +
> +        header->rom_reg = rom_reg;
>      }
>  
>      return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, cmd, false) : 0;
>  }
>  REGISTER_VPCI_INIT(init_bars, VPCI_PRIORITY_MIDDLE);
>  
> +static bool overlap(unsigned int r1_offset, unsigned int r1_size,
> +                    unsigned int r2_offset, unsigned int r2_size)
> +{
> +    /* Return true if there is an overlap. */
> +    return r1_offset < r2_offset + r2_size && r2_offset < r1_offset + r1_size;

Hm, we already have a vpci_register_cmp(), which does a very similar
comparison.  I wonder if it would be possible to somehow use that
here.

> +}
> +
> +bool vpci_header_need_write_lock(const struct pci_dev *pdev,
> +                                 unsigned int start, unsigned int size)
> +{
> +    /*
> +     * Writing the command register and ROM BAR register may trigger
> +     * modify_bars to run, which in turn may access multiple pdevs while
> +     * checking for the existing BAR's overlap. The overlapping check, if done
> +     * under the read lock, requires vpci->lock to be acquired on both devices
> +     * being compared, which may produce a deadlock. At the same time it is not
> +     * possible to upgrade read lock to write lock in such a case.
> +     * Check which registers are going to be written and return true if lock
> +     * needs to be acquired in write mode.
> +     */
> +    if ( overlap(start, size, PCI_COMMAND, 2) ||
> +         (pdev->vpci->header.rom_reg &&
> +          overlap(start, size, pdev->vpci->header.rom_reg, 4)) )
> +        return true;
> +
> +    return false;
> +}
> +
>  /*
>   * Local variables:
>   * mode: C
> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
> index 8f2b59e61a..e7d1f916a0 100644
> --- a/xen/drivers/vpci/msi.c
> +++ b/xen/drivers/vpci/msi.c
> @@ -190,6 +190,8 @@ static int cf_check init_msi(struct pci_dev *pdev)
>      uint16_t control;
>      int ret;
>  
> +    ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock));
> +
>      if ( !pos )
>          return 0;
>  
> @@ -265,7 +267,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW);
>  
>  void vpci_dump_msi(void)
>  {
> -    const struct domain *d;
> +    struct domain *d;
>  
>      rcu_read_lock(&domlist_read_lock);
>      for_each_domain ( d )
> @@ -277,6 +279,15 @@ void vpci_dump_msi(void)
>  
>          printk("vPCI MSI/MSI-X d%d\n", d->domain_id);
>  
> +        if ( !read_trylock(&d->vpci_rwlock) )
> +            continue;
> +
> +        if ( !pcidevs_trylock() )
> +        {
> +            read_unlock(&d->vpci_rwlock);
> +            continue;
> +        }
> +
>          for_each_pdev ( d, pdev )
>          {
>              const struct vpci_msi *msi;
> @@ -318,14 +329,22 @@ void vpci_dump_msi(void)
>                       * holding the lock.
>                       */
>                      printk("unable to print all MSI-X entries: %d\n", rc);
> -                    process_pending_softirqs();
> -                    continue;
> +                    goto pdev_done;
>                  }
>              }
>  
>              spin_unlock(&pdev->vpci->lock);
> + pdev_done:
> +            pcidevs_unlock();
> +            read_unlock(&d->vpci_rwlock);
> +
>              process_pending_softirqs();
> +
> +            read_lock(&d->vpci_rwlock);
> +            pcidevs_lock();
>          }
> +        pcidevs_unlock();
> +        read_unlock(&d->vpci_rwlock);
>      }
>      rcu_read_unlock(&domlist_read_lock);
>  }
> diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
> index 25bde77586..b5a7dfdf9c 100644
> --- a/xen/drivers/vpci/msix.c
> +++ b/xen/drivers/vpci/msix.c
> @@ -143,6 +143,7 @@ static void cf_check control_write(
>          pci_conf_write16(pdev->sbdf, reg, val);
>  }
>  
> +/* This must hold domain's vpci_rwlock in write mode. */
>  static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
>  {
>      struct vpci_msix *msix;
> @@ -163,7 +164,13 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
>  
>  static int cf_check msix_accept(struct vcpu *v, unsigned long addr)
>  {
> -    return !!msix_find(v->domain, addr);
> +    int rc;
> +
> +    read_lock(&v->domain->vpci_rwlock);
> +    rc = !!msix_find(v->domain, addr);
> +    read_unlock(&v->domain->vpci_rwlock);
> +
> +    return rc;
>  }
>  
>  static bool access_allowed(const struct pci_dev *pdev, unsigned long addr,
> @@ -358,21 +365,34 @@ static int adjacent_read(const struct domain *d, const struct vpci_msix *msix,
>  static int cf_check msix_read(
>      struct vcpu *v, unsigned long addr, unsigned int len, unsigned long *data)
>  {
> -    const struct domain *d = v->domain;
> -    struct vpci_msix *msix = msix_find(d, addr);
> +    struct domain *d = v->domain;
> +    struct vpci_msix *msix;
>      const struct vpci_msix_entry *entry;
>      unsigned int offset;
>  
>      *data = ~0ul;
>  
> +    read_lock(&d->vpci_rwlock);
> +
> +    msix = msix_find(d, addr);
>      if ( !msix )
> +    {
> +        read_unlock(&d->vpci_rwlock);
>          return X86EMUL_RETRY;
> +    }
>  
>      if ( adjacent_handle(msix, addr) )
> -        return adjacent_read(d, msix, addr, len, data);
> +    {
> +        int rc = adjacent_read(d, msix, addr, len, data);
> +        read_unlock(&d->vpci_rwlock);
> +        return rc;
> +    }
>  
>      if ( !access_allowed(msix->pdev, addr, len) )
> +    {
> +        read_unlock(&d->vpci_rwlock);
>          return X86EMUL_OKAY;
> +    }
>  
>      spin_lock(&msix->pdev->vpci->lock);
>      entry = get_entry(msix, addr);
> @@ -404,6 +424,7 @@ static int cf_check msix_read(
>          break;
>      }
>      spin_unlock(&msix->pdev->vpci->lock);
> +    read_unlock(&d->vpci_rwlock);
>  
>      return X86EMUL_OKAY;
>  }
> @@ -491,19 +512,32 @@ static int adjacent_write(const struct domain *d, const struct vpci_msix *msix,
>  static int cf_check msix_write(
>      struct vcpu *v, unsigned long addr, unsigned int len, unsigned long data)
>  {
> -    const struct domain *d = v->domain;
> -    struct vpci_msix *msix = msix_find(d, addr);
> +    struct domain *d = v->domain;
> +    struct vpci_msix *msix;
>      struct vpci_msix_entry *entry;
>      unsigned int offset;
>  
> +    read_lock(&d->vpci_rwlock);
> +
> +    msix = msix_find(d, addr);
>      if ( !msix )
> +    {
> +        read_unlock(&d->vpci_rwlock);
>          return X86EMUL_RETRY;
> +    }
>  
>      if ( adjacent_handle(msix, addr) )
> -        return adjacent_write(d, msix, addr, len, data);
> +    {
> +        int rc = adjacent_write(d, msix, addr, len, data);
> +        read_unlock(&d->vpci_rwlock);
> +        return rc;
> +    }
>  
>      if ( !access_allowed(msix->pdev, addr, len) )
> +    {
> +        read_unlock(&d->vpci_rwlock);
>          return X86EMUL_OKAY;
> +    }
>  
>      spin_lock(&msix->pdev->vpci->lock);
>      entry = get_entry(msix, addr);
> @@ -579,6 +613,7 @@ static int cf_check msix_write(
>          break;
>      }
>      spin_unlock(&msix->pdev->vpci->lock);
> +    read_unlock(&d->vpci_rwlock);
>  
>      return X86EMUL_OKAY;
>  }
> @@ -665,6 +700,8 @@ static int cf_check init_msix(struct pci_dev *pdev)
>      struct vpci_msix *msix;
>      int rc;
>  
> +    ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock));
> +
>      msix_offset = pci_find_cap_offset(pdev->seg, pdev->bus, slot, func,
>                                        PCI_CAP_ID_MSIX);
>      if ( !msix_offset )
> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
> index 652807a4a4..1270174e78 100644
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -38,20 +38,32 @@ extern vpci_register_init_t *const __end_vpci_array[];
>  
>  void vpci_remove_device(struct pci_dev *pdev)
>  {
> -    if ( !has_vpci(pdev->domain) || !pdev->vpci )
> +    struct vpci *vpci;
> +
> +    if ( !has_vpci(pdev->domain) )
>          return;
>  
> -    spin_lock(&pdev->vpci->lock);
> +    write_lock(&pdev->domain->vpci_rwlock);
> +    if ( !pdev->vpci )
> +    {
> +        write_unlock(&pdev->domain->vpci_rwlock);
> +        return;
> +    }
> +
> +    vpci = pdev->vpci;
> +    pdev->vpci = NULL;
> +    write_unlock(&pdev->domain->vpci_rwlock);
> +
>      while ( !list_empty(&pdev->vpci->handlers) )
>      {
> -        struct vpci_register *r = list_first_entry(&pdev->vpci->handlers,
> +        struct vpci_register *r = list_first_entry(&vpci->handlers,
>                                                     struct vpci_register,
>                                                     node);
>  
>          list_del(&r->node);
>          xfree(r);
>      }
> -    spin_unlock(&pdev->vpci->lock);
> +
>      if ( pdev->vpci->msix )
>      {
>          unsigned int i;
> @@ -61,29 +73,33 @@ void vpci_remove_device(struct pci_dev *pdev)
>              if ( pdev->vpci->msix->table[i] )
>                  iounmap(pdev->vpci->msix->table[i]);
>      }
> -    xfree(pdev->vpci->msix);
> -    xfree(pdev->vpci->msi);
> -    xfree(pdev->vpci);
> -    pdev->vpci = NULL;
> +    xfree(vpci->msix);
> +    xfree(vpci->msi);
> +    xfree(vpci);
>  }
>  
>  int vpci_add_handlers(struct pci_dev *pdev)
>  {
> +    struct vpci *vpci;
>      unsigned int i;
>      int rc = 0;
>  
>      if ( !has_vpci(pdev->domain) )
>          return 0;
>  
> +    vpci = xzalloc(struct vpci);
> +    if ( !vpci )
> +        return -ENOMEM;
> +
> +    INIT_LIST_HEAD(&vpci->handlers);
> +    spin_lock_init(&vpci->lock);
> +
> +    write_lock(&pdev->domain->vpci_rwlock);

I think the usage of the vpci per-domain lock here and in
vpci_remove_device() create an ABBA deadlock situation with the usage
of it in modify_bars().

Both vpci_add_handlers() and vpci_remove_device() get called with the
pcidevs lock taken, and take the per-domain vpci lock afterwards,
while modify_bars() does the inverse locking, gets called with the
per-domain vpci lock taken and then take the pcidevs lock inside the
function.

> +
>      /* We should not get here twice for the same device. */
>      ASSERT(!pdev->vpci);
>  
> -    pdev->vpci = xzalloc(struct vpci);
> -    if ( !pdev->vpci )
> -        return -ENOMEM;
> -
> -    INIT_LIST_HEAD(&pdev->vpci->handlers);
> -    spin_lock_init(&pdev->vpci->lock);
> +    pdev->vpci = vpci;
>  
>      for ( i = 0; i < NUM_VPCI_INIT; i++ )
>      {
> @@ -91,6 +107,7 @@ int vpci_add_handlers(struct pci_dev *pdev)
>          if ( rc )
>              break;
>      }
> +    write_unlock(&pdev->domain->vpci_rwlock);
>  
>      if ( rc )
>          vpci_remove_device(pdev);
> @@ -139,6 +156,7 @@ uint32_t cf_check vpci_hw_read32(
>      return pci_conf_read32(pdev->sbdf, reg);
>  }
>  
> +/* This must hold domain's vpci_rwlock in write mode. */
>  int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
>                        vpci_write_t *write_handler, unsigned int offset,
>                        unsigned int size, void *data)
> @@ -162,8 +180,6 @@ int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
>      r->offset = offset;
>      r->private = data;
>  
> -    spin_lock(&vpci->lock);

If you remove the lock here we need to assert that the per-domain vpci
lock is taken in write mode.

> -
>      /* The list of handlers must be kept sorted at all times. */
>      list_for_each ( prev, &vpci->handlers )
>      {
> @@ -175,25 +191,23 @@ int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
>              break;
>          if ( cmp == 0 )
>          {
> -            spin_unlock(&vpci->lock);
>              xfree(r);
>              return -EEXIST;
>          }
>      }
>  
>      list_add_tail(&r->node, prev);
> -    spin_unlock(&vpci->lock);
>  
>      return 0;
>  }
>  
> +/* This must hold domain's vpci_rwlock in write mode. */
>  int vpci_remove_register(struct vpci *vpci, unsigned int offset,
>                           unsigned int size)
>  {
>      const struct vpci_register r = { .offset = offset, .size = size };
>      struct vpci_register *rm;
>  
> -    spin_lock(&vpci->lock);
>      list_for_each_entry ( rm, &vpci->handlers, node )
>      {
>          int cmp = vpci_register_cmp(&r, rm);
> @@ -205,14 +219,12 @@ int vpci_remove_register(struct vpci *vpci, unsigned int offset,
>          if ( !cmp && rm->offset == offset && rm->size == size )
>          {
>              list_del(&rm->node);
> -            spin_unlock(&vpci->lock);
>              xfree(rm);
>              return 0;
>          }
>          if ( cmp <= 0 )
>              break;
>      }
> -    spin_unlock(&vpci->lock);

Same here about the per-domain lock being taken.

>  
>      return -ENOENT;
>  }
> @@ -320,7 +332,7 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
>  
>  uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>  {
> -    const struct domain *d = current->domain;
> +    struct domain *d = current->domain;
>      const struct pci_dev *pdev;
>      const struct vpci_register *r;
>      unsigned int data_offset = 0;
> @@ -333,10 +345,18 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>      }
>  
>      /* Find the PCI dev matching the address. */
> +    pcidevs_lock();
>      pdev = pci_get_pdev(d, sbdf);
> -    if ( !pdev || !pdev->vpci )
> +    pcidevs_unlock();

I think it's worth looking into expanding the per-domain vpci-lock to
a pdevs lock or similar in order to protect the pdev_list also if
possible.  So that we can avoid taking the pcidevs lock here.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v7 01/12] vpci: introduce per-domain lock to protect vpci structure
  2023-06-13 10:32 ` [PATCH v7 01/12] vpci: introduce per-domain lock to protect vpci structure Volodymyr Babchuk
  2023-06-16 16:30   ` Roger Pau Monné
@ 2023-06-20 20:17   ` Stewart Hildebrand
  2023-06-22 21:18     ` Volodymyr Babchuk
  1 sibling, 1 reply; 38+ messages in thread
From: Stewart Hildebrand @ 2023-06-20 20:17 UTC (permalink / raw)
  To: Volodymyr Babchuk, xen-devel
  Cc: Oleksandr Andrushchenko, Jan Beulich, Andrew Cooper,
	Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Paul Durrant

On 6/13/23 06:32, Volodymyr Babchuk wrote:

...

> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
> index 652807a4a4..1270174e78 100644
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -38,20 +38,32 @@ extern vpci_register_init_t *const __end_vpci_array[];
> 
>  void vpci_remove_device(struct pci_dev *pdev)
>  {
> -    if ( !has_vpci(pdev->domain) || !pdev->vpci )
> +    struct vpci *vpci;
> +
> +    if ( !has_vpci(pdev->domain) )
>          return;
> 
> -    spin_lock(&pdev->vpci->lock);
> +    write_lock(&pdev->domain->vpci_rwlock);
> +    if ( !pdev->vpci )
> +    {
> +        write_unlock(&pdev->domain->vpci_rwlock);
> +        return;
> +    }
> +
> +    vpci = pdev->vpci;
> +    pdev->vpci = NULL;

Here we set pdev->vpci to NULL, yet there are still a few uses remaining after this in vpci_remove_device. I suspect they were missed during a s/pdev->vpci/vpci/ search and replace.

> +    write_unlock(&pdev->domain->vpci_rwlock);
> +
>      while ( !list_empty(&pdev->vpci->handlers) )

pdev->vpci dereferenced here

>      {
> -        struct vpci_register *r = list_first_entry(&pdev->vpci->handlers,
> +        struct vpci_register *r = list_first_entry(&vpci->handlers,
>                                                     struct vpci_register,
>                                                     node);
> 
>          list_del(&r->node);
>          xfree(r);
>      }
> -    spin_unlock(&pdev->vpci->lock);
> +
>      if ( pdev->vpci->msix )

pdev->vpci dereferenced here

>      {
>          unsigned int i;
> @@ -61,29 +73,33 @@ void vpci_remove_device(struct pci_dev *pdev)
>              if ( pdev->vpci->msix->table[i] )

pdev->vpci dereferenced here, and two more above not shown in the diff context

>                  iounmap(pdev->vpci->msix->table[i]);

pdev->vpci dereferenced here

>      }
> -    xfree(pdev->vpci->msix);
> -    xfree(pdev->vpci->msi);
> -    xfree(pdev->vpci);
> -    pdev->vpci = NULL;
> +    xfree(vpci->msix);
> +    xfree(vpci->msi);
> +    xfree(vpci);
>  }


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v7 11/12] xen/arm: translate virtual PCI bus topology for guests
  2023-06-13 10:32 ` [PATCH v7 11/12] xen/arm: translate virtual PCI bus topology for guests Volodymyr Babchuk
@ 2023-06-21 12:01   ` Jan Beulich
  0 siblings, 0 replies; 38+ messages in thread
From: Jan Beulich @ 2023-06-21 12:01 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Oleksandr Andrushchenko, Stefano Stabellini, Julien Grall,
	Bertrand Marquis, Roger Pau Monné,
	xen-devel

On 13.06.2023 12:32, Volodymyr Babchuk wrote:
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -180,6 +180,44 @@ static void vpci_remove_virtual_device(const struct pci_dev *pdev)
>      write_unlock(&pdev->domain->vpci_rwlock);
>  }
>  
> +/*
> + * Find the physical device which is mapped to the virtual device
> + * and translate virtual SBDF to the physical one.
> + */
> +bool vpci_translate_virtual_device(struct domain *d, pci_sbdf_t *sbdf)
> +{
> +    struct pci_dev *pdev;
> +
> +    ASSERT(!is_hardware_domain(d));
> +
> +    read_lock(&d->vpci_rwlock);
> +    pcidevs_lock();
> +    for_each_pdev( d, pdev )
> +    {
> +        bool found;
> +
> +        if ( !pdev->vpci )
> +            continue;
> +
> +        spin_lock(&pdev->vpci->lock);

Following the description of patch 1, why does this lock need acquiring here?
The r/w lock acquired above should already guard against modification.

> +        found = pdev->vpci && (pdev->vpci->guest_sbdf.sbdf == sbdf->sbdf);

The LHS of the && is pointless here - when acquiring the lock you have
already de-referenced pdev->vpci.

> +        spin_unlock(&pdev->vpci->lock);
> +
> +        if ( found )
> +        {
> +            /* Replace guest SBDF with the physical one. */
> +            *sbdf = pdev->sbdf;
> +            pcidevs_unlock();
> +            read_unlock(&d->vpci_rwlock);
> +            return true;

What use is the result to the caller with all locks dropped, and hence
state possibly having changed before the caller gets a chance to look
at the result? If there are reason why this is okay, you want to (a)
spell out those reasons in the description and (b) document this
restriction in a comment, to warn people who may want to (re)use this
function. But really I think the caller needs to hold a suitable lock
until after the result from here was consumed.

Jan


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v7 10/12] vpci: add initial support for virtual PCI bus topology
  2023-06-13 10:32 ` [PATCH v7 10/12] vpci: add initial support for virtual PCI bus topology Volodymyr Babchuk
@ 2023-06-21 12:06   ` Jan Beulich
  2023-07-07 16:45     ` Oleksandr Tyshchenko
  0 siblings, 1 reply; 38+ messages in thread
From: Jan Beulich @ 2023-06-21 12:06 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Oleksandr Andrushchenko, Roger Pau Monné,
	Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, xen-devel

On 13.06.2023 12:32, Volodymyr Babchuk wrote:
> @@ -121,6 +124,62 @@ int vpci_add_handlers(struct pci_dev *pdev)
>  }
>  
>  #ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> +static int add_virtual_device(struct pci_dev *pdev)
> +{
> +    struct domain *d = pdev->domain;
> +    pci_sbdf_t sbdf = { 0 };
> +    unsigned long new_dev_number;
> +
> +    if ( is_hardware_domain(d) )
> +        return 0;
> +
> +    ASSERT(pcidevs_locked());
> +
> +    /*
> +     * Each PCI bus supports 32 devices/slots at max or up to 256 when
> +     * there are multi-function ones which are not yet supported.
> +     */
> +    if ( pdev->info.is_extfn )
> +    {
> +        gdprintk(XENLOG_ERR, "%pp: only function 0 passthrough supported\n",
> +                 &pdev->sbdf);
> +        return -EOPNOTSUPP;
> +    }
> +
> +    new_dev_number = find_first_zero_bit(d->vpci_dev_assigned_map,
> +                                         VPCI_MAX_VIRT_DEV);
> +    if ( new_dev_number >= VPCI_MAX_VIRT_DEV )
> +        return -ENOSPC;
> +
> +    __set_bit(new_dev_number, &d->vpci_dev_assigned_map);

Since the find-and-set can't easily be atomic, the lock used here (
asserted to be held above) needs to be the same as ...

> +    /*
> +     * Both segment and bus number are 0:
> +     *  - we emulate a single host bridge for the guest, e.g. segment 0
> +     *  - with bus 0 the virtual devices are seen as embedded
> +     *    endpoints behind the root complex
> +     *
> +     * TODO: add support for multi-function devices.
> +     */
> +    sbdf.devfn = PCI_DEVFN(new_dev_number, 0);
> +    pdev->vpci->guest_sbdf = sbdf;
> +
> +    return 0;
> +
> +}
> +
> +static void vpci_remove_virtual_device(const struct pci_dev *pdev)
> +{
> +    write_lock(&pdev->domain->vpci_rwlock);
> +    if ( pdev->vpci )
> +    {
> +        __clear_bit(pdev->vpci->guest_sbdf.dev,
> +                    &pdev->domain->vpci_dev_assigned_map);
> +        pdev->vpci->guest_sbdf.sbdf = ~0;
> +    }
> +    write_unlock(&pdev->domain->vpci_rwlock);

... the one used here.

Jan


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v7 01/12] vpci: introduce per-domain lock to protect vpci structure
  2023-06-16 16:30   ` Roger Pau Monné
@ 2023-06-21 22:07     ` Volodymyr Babchuk
  2023-06-22  8:15       ` Roger Pau Monné
  0 siblings, 1 reply; 38+ messages in thread
From: Volodymyr Babchuk @ 2023-06-21 22:07 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Oleksandr Andrushchenko, Jan Beulich, Andrew Cooper,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Paul Durrant


Hi Roger,

Roger Pau Monné <roger.pau@citrix.com> writes:

> On Tue, Jun 13, 2023 at 10:32:26AM +0000, Volodymyr Babchuk wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> 
>> Introduce a per-domain read/write lock to check whether vpci is present,
>> so we are sure there are no accesses to the contents of the vpci struct
>> if not. This lock can be used (and in a few cases is used right away)
>> so that vpci removal can be performed while holding the lock in write
>> mode. Previously such removal could race with vpci_read for example.
>> 
>> 1. Per-domain's vpci_rwlock is used to protect pdev->vpci structure
>> from being removed.
>> 
>> 2. Writing the command register and ROM BAR register may trigger
>> modify_bars to run, which in turn may access multiple pdevs while
>> checking for the existing BAR's overlap. The overlapping check, if done
>> under the read lock, requires vpci->lock to be acquired on both devices
>> being compared, which may produce a deadlock. It is not possible to
>> upgrade read lock to write lock in such a case. So, in order to prevent
>> the deadlock, check which registers are going to be written and acquire
>> the lock in the appropriate mode from the beginning.
>> 
>> All other code, which doesn't lead to pdev->vpci destruction and does not
>> access multiple pdevs at the same time, can still use a combination of the
>> read lock and pdev->vpci->lock.
>> 
>> 3. Optimize if ROM BAR write lock required detection by caching offset
>> of the ROM BAR register in vpci->header->rom_reg which depends on
>> header's type.
>> 
>> 4. Reduce locked region in vpci_remove_device as it is now possible
>> to set pdev->vpci to NULL early right after the write lock is acquired.
>> 
>> 5. Reduce locked region in vpci_add_handlers as it is possible to
>> initialize many more fields of the struct vpci before assigning it to
>> pdev->vpci.
>> 
>> 6. vpci_{add|remove}_register are required to be called with the write lock
>> held, but it is not feasible to add an assert there as it requires
>> struct domain to be passed for that. So, add a comment about this requirement
>> to these and other functions with the equivalent constraints.
>> 
>> 7. Drop const qualifier where the new rwlock is used and this is appropriate.
>> 
>> 8. Do not call process_pending_softirqs with any locks held. For that unlock
>> prior the call and re-acquire the locks after. After re-acquiring the
>> lock there is no need to check if pdev->vpci exists:
>>  - in apply_map because of the context it is called (no race condition
>>    possible)
>>  - for MSI/MSI-X debug code because it is called at the end of
>>    pdev->vpci access and no further access to pdev->vpci is made
>> 
>> 9. Check for !pdev->vpci in vpci_{read|write} after acquiring the lock
>> and if so, allow reading or writing the hardware register directly. This is
>> acceptable as we only deal with Dom0 as of now. Once DomU support is
>> added the write will need to be ignored and read return all 0's for the
>> guests, while Dom0 can still access the registers directly.
>> 
>> 10. Introduce pcidevs_trylock, so there is a possibility to try locking
>> the pcidev's lock.
>> 
>> 11. Use pcidev's lock around for_each_pdev and pci_get_pdev_by_domain
>> while accessing pdevs in vpci code.
>> 
>> 12. This is based on the discussion at [1].
>> 
>> [1] https://urldefense.com/v3/__https://lore.kernel.org/all/20220204063459.680961-4-andr2000@gmail.com/__;!!GF_29dbcQIUBPA!zPy31CWFWlyC0xhEHiSj6rOPe7RDSjLranI9KZqhG4ssmChJMWvsPLJPQGTcVsnnowZpP8-LaKJkIWIzb8ue0DoYhg$ [lore[.]kernel[.]org]
>> 
>> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
>> Suggested-by: Jan Beulich <jbeulich@suse.com>
>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>
> Thanks.
>
> I haven't looked in full detail, but I'm afraid there's an ABBA
> deadlock with the per-domain vpci lock and the pcidevs lock in
> modify_bars() vs  vpci_add_handlers() and vpci_remove_device().
>
> I've made some comments below.

Thank you for the review. I believe that it is a good idea to have a
per-domain pdev_list lock. See my answers below.


>> ---
>> This was checked on x86: with and without PVH Dom0.
>> ---
>>  xen/arch/x86/hvm/vmsi.c       |   7 +++
>>  xen/common/domain.c           |   3 +
>>  xen/drivers/passthrough/pci.c |   5 ++
>>  xen/drivers/vpci/header.c     |  52 +++++++++++++++++
>>  xen/drivers/vpci/msi.c        |  25 +++++++-
>>  xen/drivers/vpci/msix.c       |  51 +++++++++++++---
>>  xen/drivers/vpci/vpci.c       | 107 +++++++++++++++++++++++++---------
>>  xen/include/xen/pci.h         |   1 +
>>  xen/include/xen/sched.h       |   3 +
>>  xen/include/xen/vpci.h        |   6 ++
>>  10 files changed, 223 insertions(+), 37 deletions(-)
>> 
>> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
>> index 3cd4923060..f188450b1b 100644
>> --- a/xen/arch/x86/hvm/vmsi.c
>> +++ b/xen/arch/x86/hvm/vmsi.c
>> @@ -895,6 +895,9 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>>  {
>>      unsigned int i;
>>  
>> +    ASSERT(rw_is_locked(&msix->pdev->domain->vpci_rwlock));
>> +    ASSERT(pcidevs_locked());
>> +
>>      for ( i = 0; i < msix->max_entries; i++ )
>>      {
>>          const struct vpci_msix_entry *entry = &msix->entries[i];
>> @@ -913,7 +916,11 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>>              struct pci_dev *pdev = msix->pdev;
>>  
>>              spin_unlock(&msix->pdev->vpci->lock);
>> +            pcidevs_unlock();
>> +            read_unlock(&pdev->domain->vpci_rwlock);
>>              process_pending_softirqs();
>> +            read_lock(&pdev->domain->vpci_rwlock);
>> +            pcidevs_lock();
>
> Why do you need the pcidevs_lock here?  Isn't it enough to have the
> per-domain rwlock taken in read mode in order to prevent devices from
> being de-assigned from the domain?
>

Yes, looks like you are right. I'll remove pcidevs_lock() 


>>              /* NB: we assume that pdev cannot go away for an alive domain. */
>>              if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
>>                  return -EBUSY;
>> diff --git a/xen/common/domain.c b/xen/common/domain.c
>> index 6a440590fe..a4640acb62 100644
>> --- a/xen/common/domain.c
>> +++ b/xen/common/domain.c
>> @@ -622,6 +622,9 @@ struct domain *domain_create(domid_t domid,
>>  
>>  #ifdef CONFIG_HAS_PCI
>>      INIT_LIST_HEAD(&d->pdev_list);
>> +#ifdef CONFIG_HAS_VPCI
>> +    rwlock_init(&d->vpci_rwlock);
>> +#endif
>>  #endif
>>  
>>      /* All error paths can depend on the above setup. */
>> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
>> index 07d1986d33..0afcb59af0 100644
>> --- a/xen/drivers/passthrough/pci.c
>> +++ b/xen/drivers/passthrough/pci.c
>> @@ -57,6 +57,11 @@ void pcidevs_lock(void)
>>      spin_lock_recursive(&_pcidevs_lock);
>>  }
>>  
>> +int pcidevs_trylock(void)
>> +{
>> +    return spin_trylock_recursive(&_pcidevs_lock);
>> +}
>> +
>>  void pcidevs_unlock(void)
>>  {
>>      spin_unlock_recursive(&_pcidevs_lock);
>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
>> index ec2e978a4e..23b5223adf 100644
>> --- a/xen/drivers/vpci/header.c
>> +++ b/xen/drivers/vpci/header.c
>> @@ -152,12 +152,14 @@ bool vpci_process_pending(struct vcpu *v)
>>          if ( rc == -ERESTART )
>>              return true;
>>  
>> +        read_lock(&v->domain->vpci_rwlock);
>>          spin_lock(&v->vpci.pdev->vpci->lock);
>>          /* Disable memory decoding unconditionally on failure. */
>>          modify_decoding(v->vpci.pdev,
>>                          rc ? v->vpci.cmd & ~PCI_COMMAND_MEMORY : v->vpci.cmd,
>>                          !rc && v->vpci.rom_only);
>>          spin_unlock(&v->vpci.pdev->vpci->lock);
>> +        read_unlock(&v->domain->vpci_rwlock);
>
> I think you likely want to expand the scope of the domain vpci lock in
> order to also protect the data in v->vpci?  So that vPCI device
> removal can clear this data if required without racing with
> vpci_process_pending().  Otherwise you need to pause the domain when
> removing vPCI.

Yes, you are right.

>
>>  
>>          rangeset_destroy(v->vpci.mem);
>>          v->vpci.mem = NULL;
>> @@ -181,8 +183,20 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
>>      struct map_data data = { .d = d, .map = true };
>>      int rc;
>>  
>> +    ASSERT(rw_is_write_locked(&d->vpci_rwlock));
>> +
>>      while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
>> +    {
>> +        /*
>> +         * It's safe to drop and reacquire the lock in this context
>> +         * without risking pdev disappearing because devices cannot be
>> +         * removed until the initial domain has been started.
>> +         */
>> +        write_unlock(&d->vpci_rwlock);
>>          process_pending_softirqs();
>> +        write_lock(&d->vpci_rwlock);
>> +    }
>> +
>>      rangeset_destroy(mem);
>>      if ( !rc )
>>          modify_decoding(pdev, cmd, false);
>> @@ -213,6 +227,7 @@ static void defer_map(struct domain *d, struct pci_dev *pdev,
>>      raise_softirq(SCHEDULE_SOFTIRQ);
>>  }
>>  
>> +/* This must hold domain's vpci_rwlock in write mode. */
>
> Why it must be in write mode?
>
> Is this to avoid ABBA deadlocks as not taking the per-domain lock in
> write mode would then force us to take each pdev->vpci->lock in order
> to prevent concurrent modifications of remote devices?

Yes, exactly this. This is mentioned in the commit message:

    2. Writing the command register and ROM BAR register may trigger
    modify_bars to run, which in turn may access multiple pdevs while
    checking for the existing BAR's overlap. The overlapping check, if done
    under the read lock, requires vpci->lock to be acquired on both devices
    being compared, which may produce a deadlock. It is not possible to
    upgrade read lock to write lock in such a case. So, in order to prevent
    the deadlock, check which registers are going to be written and acquire
    the lock in the appropriate mode from the beginning.

>
>>  static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>  {
>>      struct vpci_header *header = &pdev->vpci->header;
>> @@ -287,6 +302,7 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>       * Check for overlaps with other BARs. Note that only BARs that are
>>       * currently mapped (enabled) are checked for overlaps.
>>       */
>> +    pcidevs_lock();
>
> This can be problematic I'm afraid, as it's a guest controlled path
> that allows applying unrestricted contention on the pcidevs lock.  I'm
> unsure whether multiple guests could be able to trigger the watchdog
> if given enough devices/vcpus to be able to perform enough
> simultaneous calls to modify_bars().
>
> Maybe you could expand the per-domain vpci lock to also protect
> modifications of domain->pdev_list?
>
> IOW: kind of a per-domain pdev_lock.

This might work, but I am not sure if we need to expand vpci lock. Maybe
it is better to add another lock that protects domain->pdev_list? I
afraid that combined lock will be confusing, as it protects two
semi-related entities.

>
>>      for_each_pdev ( pdev->domain, tmp )
>>      {
>>          if ( tmp == pdev )
>> @@ -326,10 +342,12 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>                  printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
>>                         start, end, rc);
>>                  rangeset_destroy(mem);
>> +                pcidevs_unlock();
>>                  return rc;
>>              }
>>          }
>>      }
>> +    pcidevs_unlock();
>>  
>>      ASSERT(dev);
>>  
>> @@ -482,6 +500,8 @@ static int cf_check init_bars(struct pci_dev *pdev)
>>      struct vpci_bar *bars = header->bars;
>>      int rc;
>>  
>> +    ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock));
>> +
>>      switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
>>      {
>>      case PCI_HEADER_TYPE_NORMAL:
>> @@ -570,6 +590,8 @@ static int cf_check init_bars(struct pci_dev *pdev)
>>          }
>>      }
>>  
>> +    ASSERT(!header->rom_reg);
>> +
>>      /* Check expansion ROM. */
>>      rc = pci_size_mem_bar(pdev->sbdf, rom_reg, &addr, &size, PCI_BAR_ROM);
>>      if ( rc > 0 && size )
>> @@ -586,12 +608,42 @@ static int cf_check init_bars(struct pci_dev *pdev)
>>                                 4, rom);
>>          if ( rc )
>>              rom->type = VPCI_BAR_EMPTY;
>> +
>> +        header->rom_reg = rom_reg;
>>      }
>>  
>>      return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, cmd, false) : 0;
>>  }
>>  REGISTER_VPCI_INIT(init_bars, VPCI_PRIORITY_MIDDLE);
>>  
>> +static bool overlap(unsigned int r1_offset, unsigned int r1_size,
>> +                    unsigned int r2_offset, unsigned int r2_size)
>> +{
>> +    /* Return true if there is an overlap. */
>> +    return r1_offset < r2_offset + r2_size && r2_offset < r1_offset + r1_size;
>
> Hm, we already have a vpci_register_cmp(), which does a very similar
> comparison.  I wonder if it would be possible to somehow use that
> here.
>

Yeah, I'll revisit this part.

>> +}
>> +
>> +bool vpci_header_need_write_lock(const struct pci_dev *pdev,
>> +                                 unsigned int start, unsigned int size)
>> +{
>> +    /*
>> +     * Writing the command register and ROM BAR register may trigger
>> +     * modify_bars to run, which in turn may access multiple pdevs while
>> +     * checking for the existing BAR's overlap. The overlapping check, if done
>> +     * under the read lock, requires vpci->lock to be acquired on both devices
>> +     * being compared, which may produce a deadlock. At the same time it is not
>> +     * possible to upgrade read lock to write lock in such a case.
>> +     * Check which registers are going to be written and return true if lock
>> +     * needs to be acquired in write mode.
>> +     */
>> +    if ( overlap(start, size, PCI_COMMAND, 2) ||
>> +         (pdev->vpci->header.rom_reg &&
>> +          overlap(start, size, pdev->vpci->header.rom_reg, 4)) )
>> +        return true;
>> +
>> +    return false;
>> +}
>> +
>>  /*
>>   * Local variables:
>>   * mode: C
>> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
>> index 8f2b59e61a..e7d1f916a0 100644
>> --- a/xen/drivers/vpci/msi.c
>> +++ b/xen/drivers/vpci/msi.c
>> @@ -190,6 +190,8 @@ static int cf_check init_msi(struct pci_dev *pdev)
>>      uint16_t control;
>>      int ret;
>>  
>> +    ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock));
>> +
>>      if ( !pos )
>>          return 0;
>>  
>> @@ -265,7 +267,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW);
>>  
>>  void vpci_dump_msi(void)
>>  {
>> -    const struct domain *d;
>> +    struct domain *d;
>>  
>>      rcu_read_lock(&domlist_read_lock);
>>      for_each_domain ( d )
>> @@ -277,6 +279,15 @@ void vpci_dump_msi(void)
>>  
>>          printk("vPCI MSI/MSI-X d%d\n", d->domain_id);
>>  
>> +        if ( !read_trylock(&d->vpci_rwlock) )
>> +            continue;
>> +
>> +        if ( !pcidevs_trylock() )
>> +        {
>> +            read_unlock(&d->vpci_rwlock);
>> +            continue;
>> +        }
>> +
>>          for_each_pdev ( d, pdev )
>>          {
>>              const struct vpci_msi *msi;
>> @@ -318,14 +329,22 @@ void vpci_dump_msi(void)
>>                       * holding the lock.
>>                       */
>>                      printk("unable to print all MSI-X entries: %d\n", rc);
>> -                    process_pending_softirqs();
>> -                    continue;
>> +                    goto pdev_done;
>>                  }
>>              }
>>  
>>              spin_unlock(&pdev->vpci->lock);
>> + pdev_done:
>> +            pcidevs_unlock();
>> +            read_unlock(&d->vpci_rwlock);
>> +
>>              process_pending_softirqs();
>> +
>> +            read_lock(&d->vpci_rwlock);
>> +            pcidevs_lock();
>>          }
>> +        pcidevs_unlock();
>> +        read_unlock(&d->vpci_rwlock);
>>      }
>>      rcu_read_unlock(&domlist_read_lock);
>>  }
>> diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
>> index 25bde77586..b5a7dfdf9c 100644
>> --- a/xen/drivers/vpci/msix.c
>> +++ b/xen/drivers/vpci/msix.c
>> @@ -143,6 +143,7 @@ static void cf_check control_write(
>>          pci_conf_write16(pdev->sbdf, reg, val);
>>  }
>>  
>> +/* This must hold domain's vpci_rwlock in write mode. */
>>  static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
>>  {
>>      struct vpci_msix *msix;
>> @@ -163,7 +164,13 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
>>  
>>  static int cf_check msix_accept(struct vcpu *v, unsigned long addr)
>>  {
>> -    return !!msix_find(v->domain, addr);
>> +    int rc;
>> +
>> +    read_lock(&v->domain->vpci_rwlock);
>> +    rc = !!msix_find(v->domain, addr);
>> +    read_unlock(&v->domain->vpci_rwlock);
>> +
>> +    return rc;
>>  }
>>  
>>  static bool access_allowed(const struct pci_dev *pdev, unsigned long addr,
>> @@ -358,21 +365,34 @@ static int adjacent_read(const struct domain *d, const struct vpci_msix *msix,
>>  static int cf_check msix_read(
>>      struct vcpu *v, unsigned long addr, unsigned int len, unsigned long *data)
>>  {
>> -    const struct domain *d = v->domain;
>> -    struct vpci_msix *msix = msix_find(d, addr);
>> +    struct domain *d = v->domain;
>> +    struct vpci_msix *msix;
>>      const struct vpci_msix_entry *entry;
>>      unsigned int offset;
>>  
>>      *data = ~0ul;
>>  
>> +    read_lock(&d->vpci_rwlock);
>> +
>> +    msix = msix_find(d, addr);
>>      if ( !msix )
>> +    {
>> +        read_unlock(&d->vpci_rwlock);
>>          return X86EMUL_RETRY;
>> +    }
>>  
>>      if ( adjacent_handle(msix, addr) )
>> -        return adjacent_read(d, msix, addr, len, data);
>> +    {
>> +        int rc = adjacent_read(d, msix, addr, len, data);
>> +        read_unlock(&d->vpci_rwlock);
>> +        return rc;
>> +    }
>>  
>>      if ( !access_allowed(msix->pdev, addr, len) )
>> +    {
>> +        read_unlock(&d->vpci_rwlock);
>>          return X86EMUL_OKAY;
>> +    }
>>  
>>      spin_lock(&msix->pdev->vpci->lock);
>>      entry = get_entry(msix, addr);
>> @@ -404,6 +424,7 @@ static int cf_check msix_read(
>>          break;
>>      }
>>      spin_unlock(&msix->pdev->vpci->lock);
>> +    read_unlock(&d->vpci_rwlock);
>>  
>>      return X86EMUL_OKAY;
>>  }
>> @@ -491,19 +512,32 @@ static int adjacent_write(const struct domain *d, const struct vpci_msix *msix,
>>  static int cf_check msix_write(
>>      struct vcpu *v, unsigned long addr, unsigned int len, unsigned long data)
>>  {
>> -    const struct domain *d = v->domain;
>> -    struct vpci_msix *msix = msix_find(d, addr);
>> +    struct domain *d = v->domain;
>> +    struct vpci_msix *msix;
>>      struct vpci_msix_entry *entry;
>>      unsigned int offset;
>>  
>> +    read_lock(&d->vpci_rwlock);
>> +
>> +    msix = msix_find(d, addr);
>>      if ( !msix )
>> +    {
>> +        read_unlock(&d->vpci_rwlock);
>>          return X86EMUL_RETRY;
>> +    }
>>  
>>      if ( adjacent_handle(msix, addr) )
>> -        return adjacent_write(d, msix, addr, len, data);
>> +    {
>> +        int rc = adjacent_write(d, msix, addr, len, data);
>> +        read_unlock(&d->vpci_rwlock);
>> +        return rc;
>> +    }
>>  
>>      if ( !access_allowed(msix->pdev, addr, len) )
>> +    {
>> +        read_unlock(&d->vpci_rwlock);
>>          return X86EMUL_OKAY;
>> +    }
>>  
>>      spin_lock(&msix->pdev->vpci->lock);
>>      entry = get_entry(msix, addr);
>> @@ -579,6 +613,7 @@ static int cf_check msix_write(
>>          break;
>>      }
>>      spin_unlock(&msix->pdev->vpci->lock);
>> +    read_unlock(&d->vpci_rwlock);
>>  
>>      return X86EMUL_OKAY;
>>  }
>> @@ -665,6 +700,8 @@ static int cf_check init_msix(struct pci_dev *pdev)
>>      struct vpci_msix *msix;
>>      int rc;
>>  
>> +    ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock));
>> +
>>      msix_offset = pci_find_cap_offset(pdev->seg, pdev->bus, slot, func,
>>                                        PCI_CAP_ID_MSIX);
>>      if ( !msix_offset )
>> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
>> index 652807a4a4..1270174e78 100644
>> --- a/xen/drivers/vpci/vpci.c
>> +++ b/xen/drivers/vpci/vpci.c
>> @@ -38,20 +38,32 @@ extern vpci_register_init_t *const __end_vpci_array[];
>>  
>>  void vpci_remove_device(struct pci_dev *pdev)
>>  {
>> -    if ( !has_vpci(pdev->domain) || !pdev->vpci )
>> +    struct vpci *vpci;
>> +
>> +    if ( !has_vpci(pdev->domain) )
>>          return;
>>  
>> -    spin_lock(&pdev->vpci->lock);
>> +    write_lock(&pdev->domain->vpci_rwlock);
>> +    if ( !pdev->vpci )
>> +    {
>> +        write_unlock(&pdev->domain->vpci_rwlock);
>> +        return;
>> +    }
>> +
>> +    vpci = pdev->vpci;
>> +    pdev->vpci = NULL;
>> +    write_unlock(&pdev->domain->vpci_rwlock);
>> +
>>      while ( !list_empty(&pdev->vpci->handlers) )
>>      {
>> -        struct vpci_register *r = list_first_entry(&pdev->vpci->handlers,
>> +        struct vpci_register *r = list_first_entry(&vpci->handlers,
>>                                                     struct vpci_register,
>>                                                     node);
>>  
>>          list_del(&r->node);
>>          xfree(r);
>>      }
>> -    spin_unlock(&pdev->vpci->lock);
>> +
>>      if ( pdev->vpci->msix )
>>      {
>>          unsigned int i;
>> @@ -61,29 +73,33 @@ void vpci_remove_device(struct pci_dev *pdev)
>>              if ( pdev->vpci->msix->table[i] )
>>                  iounmap(pdev->vpci->msix->table[i]);
>>      }
>> -    xfree(pdev->vpci->msix);
>> -    xfree(pdev->vpci->msi);
>> -    xfree(pdev->vpci);
>> -    pdev->vpci = NULL;
>> +    xfree(vpci->msix);
>> +    xfree(vpci->msi);
>> +    xfree(vpci);
>>  }
>>  
>>  int vpci_add_handlers(struct pci_dev *pdev)
>>  {
>> +    struct vpci *vpci;
>>      unsigned int i;
>>      int rc = 0;
>>  
>>      if ( !has_vpci(pdev->domain) )
>>          return 0;
>>  
>> +    vpci = xzalloc(struct vpci);
>> +    if ( !vpci )
>> +        return -ENOMEM;
>> +
>> +    INIT_LIST_HEAD(&vpci->handlers);
>> +    spin_lock_init(&vpci->lock);
>> +
>> +    write_lock(&pdev->domain->vpci_rwlock);
>
> I think the usage of the vpci per-domain lock here and in
> vpci_remove_device() create an ABBA deadlock situation with the usage
> of it in modify_bars().
>
> Both vpci_add_handlers() and vpci_remove_device() get called with the
> pcidevs lock taken, and take the per-domain vpci lock afterwards,
> while modify_bars() does the inverse locking, gets called with the
> per-domain vpci lock taken and then take the pcidevs lock inside the
> function.

You suggested to use separate per-domain pdev_list lock in
modify_bars. Looks like this can help with the ABBA deadlock.

>
>> +
>>      /* We should not get here twice for the same device. */
>>      ASSERT(!pdev->vpci);
>>  
>> -    pdev->vpci = xzalloc(struct vpci);
>> -    if ( !pdev->vpci )
>> -        return -ENOMEM;
>> -
>> -    INIT_LIST_HEAD(&pdev->vpci->handlers);
>> -    spin_lock_init(&pdev->vpci->lock);
>> +    pdev->vpci = vpci;
>>  
>>      for ( i = 0; i < NUM_VPCI_INIT; i++ )
>>      {
>> @@ -91,6 +107,7 @@ int vpci_add_handlers(struct pci_dev *pdev)
>>          if ( rc )
>>              break;
>>      }
>> +    write_unlock(&pdev->domain->vpci_rwlock);
>>  
>>      if ( rc )
>>          vpci_remove_device(pdev);
>> @@ -139,6 +156,7 @@ uint32_t cf_check vpci_hw_read32(
>>      return pci_conf_read32(pdev->sbdf, reg);
>>  }
>>  
>> +/* This must hold domain's vpci_rwlock in write mode. */
>>  int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
>>                        vpci_write_t *write_handler, unsigned int offset,
>>                        unsigned int size, void *data)
>> @@ -162,8 +180,6 @@ int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
>>      r->offset = offset;
>>      r->private = data;
>>  
>> -    spin_lock(&vpci->lock);
>
> If you remove the lock here we need to assert that the per-domain vpci
> lock is taken in write mode.

Yep, assert will be better than comment above.

>
>> -
>>      /* The list of handlers must be kept sorted at all times. */
>>      list_for_each ( prev, &vpci->handlers )
>>      {
>> @@ -175,25 +191,23 @@ int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
>>              break;
>>          if ( cmp == 0 )
>>          {
>> -            spin_unlock(&vpci->lock);
>>              xfree(r);
>>              return -EEXIST;
>>          }
>>      }
>>  
>>      list_add_tail(&r->node, prev);
>> -    spin_unlock(&vpci->lock);
>>  
>>      return 0;
>>  }
>>  
>> +/* This must hold domain's vpci_rwlock in write mode. */
>>  int vpci_remove_register(struct vpci *vpci, unsigned int offset,
>>                           unsigned int size)
>>  {
>>      const struct vpci_register r = { .offset = offset, .size = size };
>>      struct vpci_register *rm;
>>  
>> -    spin_lock(&vpci->lock);
>>      list_for_each_entry ( rm, &vpci->handlers, node )
>>      {
>>          int cmp = vpci_register_cmp(&r, rm);
>> @@ -205,14 +219,12 @@ int vpci_remove_register(struct vpci *vpci, unsigned int offset,
>>          if ( !cmp && rm->offset == offset && rm->size == size )
>>          {
>>              list_del(&rm->node);
>> -            spin_unlock(&vpci->lock);
>>              xfree(rm);
>>              return 0;
>>          }
>>          if ( cmp <= 0 )
>>              break;
>>      }
>> -    spin_unlock(&vpci->lock);
>
> Same here about the per-domain lock being taken.
>
>>  
>>      return -ENOENT;
>>  }
>> @@ -320,7 +332,7 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
>>  
>>  uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>>  {
>> -    const struct domain *d = current->domain;
>> +    struct domain *d = current->domain;
>>      const struct pci_dev *pdev;
>>      const struct vpci_register *r;
>>      unsigned int data_offset = 0;
>> @@ -333,10 +345,18 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>>      }
>>  
>>      /* Find the PCI dev matching the address. */
>> +    pcidevs_lock();
>>      pdev = pci_get_pdev(d, sbdf);
>> -    if ( !pdev || !pdev->vpci )
>> +    pcidevs_unlock();
>
> I think it's worth looking into expanding the per-domain vpci-lock to
> a pdevs lock or similar in order to protect the pdev_list also if
> possible.  So that we can avoid taking the pcidevs lock here.

Agree


-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v7 00/12] PCI devices passthrough on Arm, part 3
  2023-06-15 12:16     ` Jan Beulich
@ 2023-06-21 22:11       ` Volodymyr Babchuk
  2023-06-22  7:48         ` Jan Beulich
  0 siblings, 1 reply; 38+ messages in thread
From: Volodymyr Babchuk @ 2023-06-21 22:11 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Paul Durrant, Bertrand Marquis, Stewart Hildebrand


Hi Jan,

Jan Beulich <jbeulich@suse.com> writes:

> On 15.06.2023 11:39, Volodymyr Babchuk wrote:
>> Stewart Hildebrand <stewart.hildebrand@amd.com> writes:
>>> On 6/13/23 06:32, Volodymyr Babchuk wrote:
>>>> Hello,
>>>>
>>>> This is another another version of vPCI rework (previous one can be
>>>> found at [1]). The biggest change is how vPCI locking is done. This
>>>> series uses per-domain vPCI rwlock.
>>>>
>>>> Note that this series does not include my work on reference counting
>>>> for PCI devices because this counting does not resolve isses we are
>>>> having for vPCI. While it is (maybe) nice to have PCI refcounting, it
>>>> does not moves us towards PCI on ARM.
>>>>
>>>>
>>>> [1]
>>>> https://urldefense.com/v3/__https://lore.kernel.org/all/20220204063459.680961-1-andr2000@gmail.com/__;!!GF_29dbcQIUBPA!0BUqPos1zFKUoPwbKLLwKItNgBVPaBgxmH1Y6zXpms2bngrlWrzB-qMNvIaiAy2WSWMa93UrlvRi0ijYP8X4Ymx07GXYPO1W$
>>>> [lore[.]kernel[.]org]
>>>
>>> Thanks for sending this!
>>>
>>> Should this be v8? I see v7 at [2].
>> 
>> Oops, my bad. 
>> 
>>> I had to rewind my xen.git back to 67c28bfc5245 for this series to apply cleanly (just before ee045f3a4a6d "vpci/header: cope with devices not having vpci allocated").
>> 
>> I rebased this series onto staging about two weeks ago. Looks like
>> there was new changes into the PCI code after that.
>> 
>> Should I send a new, real v8 which is rebased onto current staging, or
>> we'll wait for review for the current set of patches?
>
> Please send a version which, at least at the time of posting, actually
> applies. Taking into account Stewart's observation on the version
> number makes it even more desirable to have a re-post.

I am terribly sorry about version mishmash. But Roger made valuable
comments for the first patch already.

So I'll post the updated version with an additional lock and other
fixes. Should it be v8 or v9 in that case?

-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v7 00/12] PCI devices passthrough on Arm, part 3
  2023-06-21 22:11       ` Volodymyr Babchuk
@ 2023-06-22  7:48         ` Jan Beulich
  0 siblings, 0 replies; 38+ messages in thread
From: Jan Beulich @ 2023-06-22  7:48 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: xen-devel, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Paul Durrant, Bertrand Marquis, Stewart Hildebrand

On 22.06.2023 00:11, Volodymyr Babchuk wrote:
> Jan Beulich <jbeulich@suse.com> writes:
>> On 15.06.2023 11:39, Volodymyr Babchuk wrote:
>>> Stewart Hildebrand <stewart.hildebrand@amd.com> writes:
>>>> On 6/13/23 06:32, Volodymyr Babchuk wrote:
>>>>> Hello,
>>>>>
>>>>> This is another another version of vPCI rework (previous one can be
>>>>> found at [1]). The biggest change is how vPCI locking is done. This
>>>>> series uses per-domain vPCI rwlock.
>>>>>
>>>>> Note that this series does not include my work on reference counting
>>>>> for PCI devices because this counting does not resolve isses we are
>>>>> having for vPCI. While it is (maybe) nice to have PCI refcounting, it
>>>>> does not moves us towards PCI on ARM.
>>>>>
>>>>>
>>>>> [1]
>>>>> https://urldefense.com/v3/__https://lore.kernel.org/all/20220204063459.680961-1-andr2000@gmail.com/__;!!GF_29dbcQIUBPA!0BUqPos1zFKUoPwbKLLwKItNgBVPaBgxmH1Y6zXpms2bngrlWrzB-qMNvIaiAy2WSWMa93UrlvRi0ijYP8X4Ymx07GXYPO1W$
>>>>> [lore[.]kernel[.]org]
>>>>
>>>> Thanks for sending this!
>>>>
>>>> Should this be v8? I see v7 at [2].
>>>
>>> Oops, my bad. 
>>>
>>>> I had to rewind my xen.git back to 67c28bfc5245 for this series to apply cleanly (just before ee045f3a4a6d "vpci/header: cope with devices not having vpci allocated").
>>>
>>> I rebased this series onto staging about two weeks ago. Looks like
>>> there was new changes into the PCI code after that.
>>>
>>> Should I send a new, real v8 which is rebased onto current staging, or
>>> we'll wait for review for the current set of patches?
>>
>> Please send a version which, at least at the time of posting, actually
>> applies. Taking into account Stewart's observation on the version
>> number makes it even more desirable to have a re-post.
> 
> I am terribly sorry about version mishmash. But Roger made valuable
> comments for the first patch already.
> 
> So I'll post the updated version with an additional lock and other
> fixes. Should it be v8 or v9 in that case?

I don't think that matters, unless you have v8 changelog entries already
which were posted with the 2nd v7. Otherwise all that's relevant is that
the version be clearly distinguishable from earlier ones.

Jan


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v7 01/12] vpci: introduce per-domain lock to protect vpci structure
  2023-06-21 22:07     ` Volodymyr Babchuk
@ 2023-06-22  8:15       ` Roger Pau Monné
  2023-06-22 21:17         ` Volodymyr Babchuk
  0 siblings, 1 reply; 38+ messages in thread
From: Roger Pau Monné @ 2023-06-22  8:15 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: xen-devel, Oleksandr Andrushchenko, Jan Beulich, Andrew Cooper,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Paul Durrant

On Wed, Jun 21, 2023 at 10:07:20PM +0000, Volodymyr Babchuk wrote:
> 
> Hi Roger,
> 
> Roger Pau Monné <roger.pau@citrix.com> writes:
> 
> > On Tue, Jun 13, 2023 at 10:32:26AM +0000, Volodymyr Babchuk wrote:
> >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> >> 
> >> Introduce a per-domain read/write lock to check whether vpci is present,
> >> so we are sure there are no accesses to the contents of the vpci struct
> >> if not. This lock can be used (and in a few cases is used right away)
> >> so that vpci removal can be performed while holding the lock in write
> >> mode. Previously such removal could race with vpci_read for example.
> >> 
> >> 1. Per-domain's vpci_rwlock is used to protect pdev->vpci structure
> >> from being removed.
> >> 
> >> 2. Writing the command register and ROM BAR register may trigger
> >> modify_bars to run, which in turn may access multiple pdevs while
> >> checking for the existing BAR's overlap. The overlapping check, if done
> >> under the read lock, requires vpci->lock to be acquired on both devices
> >> being compared, which may produce a deadlock. It is not possible to
> >> upgrade read lock to write lock in such a case. So, in order to prevent
> >> the deadlock, check which registers are going to be written and acquire
> >> the lock in the appropriate mode from the beginning.
> >> 
> >> All other code, which doesn't lead to pdev->vpci destruction and does not
> >> access multiple pdevs at the same time, can still use a combination of the
> >> read lock and pdev->vpci->lock.
> >> 
> >> 3. Optimize if ROM BAR write lock required detection by caching offset
> >> of the ROM BAR register in vpci->header->rom_reg which depends on
> >> header's type.
> >> 
> >> 4. Reduce locked region in vpci_remove_device as it is now possible
> >> to set pdev->vpci to NULL early right after the write lock is acquired.
> >> 
> >> 5. Reduce locked region in vpci_add_handlers as it is possible to
> >> initialize many more fields of the struct vpci before assigning it to
> >> pdev->vpci.
> >> 
> >> 6. vpci_{add|remove}_register are required to be called with the write lock
> >> held, but it is not feasible to add an assert there as it requires
> >> struct domain to be passed for that. So, add a comment about this requirement
> >> to these and other functions with the equivalent constraints.
> >> 
> >> 7. Drop const qualifier where the new rwlock is used and this is appropriate.
> >> 
> >> 8. Do not call process_pending_softirqs with any locks held. For that unlock
> >> prior the call and re-acquire the locks after. After re-acquiring the
> >> lock there is no need to check if pdev->vpci exists:
> >>  - in apply_map because of the context it is called (no race condition
> >>    possible)
> >>  - for MSI/MSI-X debug code because it is called at the end of
> >>    pdev->vpci access and no further access to pdev->vpci is made
> >> 
> >> 9. Check for !pdev->vpci in vpci_{read|write} after acquiring the lock
> >> and if so, allow reading or writing the hardware register directly. This is
> >> acceptable as we only deal with Dom0 as of now. Once DomU support is
> >> added the write will need to be ignored and read return all 0's for the
> >> guests, while Dom0 can still access the registers directly.
> >> 
> >> 10. Introduce pcidevs_trylock, so there is a possibility to try locking
> >> the pcidev's lock.
> >> 
> >> 11. Use pcidev's lock around for_each_pdev and pci_get_pdev_by_domain
> >> while accessing pdevs in vpci code.
> >> 
> >> 12. This is based on the discussion at [1].
> >> 
> >> [1] https://urldefense.com/v3/__https://lore.kernel.org/all/20220204063459.680961-4-andr2000@gmail.com/__;!!GF_29dbcQIUBPA!zPy31CWFWlyC0xhEHiSj6rOPe7RDSjLranI9KZqhG4ssmChJMWvsPLJPQGTcVsnnowZpP8-LaKJkIWIzb8ue0DoYhg$ [lore[.]kernel[.]org]
> >> 
> >> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
> >> Suggested-by: Jan Beulich <jbeulich@suse.com>
> >> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> >
> > Thanks.
> >
> > I haven't looked in full detail, but I'm afraid there's an ABBA
> > deadlock with the per-domain vpci lock and the pcidevs lock in
> > modify_bars() vs  vpci_add_handlers() and vpci_remove_device().
> >
> > I've made some comments below.
> 
> Thank you for the review. I believe that it is a good idea to have a
> per-domain pdev_list lock. See my answers below.

I think it's important that the lock that protects domain->pdev_list
must be the same that also protects pdev->vpci, or else you might run
into similar ABBA deadlock situations.

The problem then could be that in vpci_{read,write} you will take the
per-domain pdev lock in read mode in order to get the pdev, and for
writes to the command register or the ROM BAR you would have to
upgrade such lock to a write lock without dropping it, and we don't
have such functionality for rw locks ATM.

Maybe just re-starting the function knowing that the lock must be
taken in write mode would be a good solution: writes to the command
register will already be slow since they are likely to involve
modifications to the p2m.

> >> @@ -213,6 +227,7 @@ static void defer_map(struct domain *d, struct pci_dev *pdev,
> >>      raise_softirq(SCHEDULE_SOFTIRQ);
> >>  }
> >>  
> >> +/* This must hold domain's vpci_rwlock in write mode. */
> >
> > Why it must be in write mode?
> >
> > Is this to avoid ABBA deadlocks as not taking the per-domain lock in
> > write mode would then force us to take each pdev->vpci->lock in order
> > to prevent concurrent modifications of remote devices?
> 
> Yes, exactly this. This is mentioned in the commit message:
> 
>     2. Writing the command register and ROM BAR register may trigger
>     modify_bars to run, which in turn may access multiple pdevs while
>     checking for the existing BAR's overlap. The overlapping check, if done
>     under the read lock, requires vpci->lock to be acquired on both devices
>     being compared, which may produce a deadlock. It is not possible to
>     upgrade read lock to write lock in such a case. So, in order to prevent
>     the deadlock, check which registers are going to be written and acquire
>     the lock in the appropriate mode from the beginning.

Might be good to put part of the commit message in the code comment,
just saying that the lock must be taken in write mode is not very
informative.

/*
 * The <lock-name> per domain lock must be taken in write mode in
 * order to prevent changes to the vPCI data of devices assigned to
 * the domain, since the function parses such data.
 */

> 
> >
> >>  static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> >>  {
> >>      struct vpci_header *header = &pdev->vpci->header;
> >> @@ -287,6 +302,7 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> >>       * Check for overlaps with other BARs. Note that only BARs that are
> >>       * currently mapped (enabled) are checked for overlaps.
> >>       */
> >> +    pcidevs_lock();
> >
> > This can be problematic I'm afraid, as it's a guest controlled path
> > that allows applying unrestricted contention on the pcidevs lock.  I'm
> > unsure whether multiple guests could be able to trigger the watchdog
> > if given enough devices/vcpus to be able to perform enough
> > simultaneous calls to modify_bars().
> >
> > Maybe you could expand the per-domain vpci lock to also protect
> > modifications of domain->pdev_list?
> >
> > IOW: kind of a per-domain pdev_lock.
> 
> This might work, but I am not sure if we need to expand vpci lock. Maybe
> it is better to add another lock that protects domain->pdev_list? I
> afraid that combined lock will be confusing, as it protects two
> semi-related entities.

I think expanding is the best solution, otherwise you are likely to
run into the same ABBA deadlock situations.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v7 01/12] vpci: introduce per-domain lock to protect vpci structure
  2023-06-22  8:15       ` Roger Pau Monné
@ 2023-06-22 21:17         ` Volodymyr Babchuk
  2023-06-23  8:50           ` Roger Pau Monné
  0 siblings, 1 reply; 38+ messages in thread
From: Volodymyr Babchuk @ 2023-06-22 21:17 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Oleksandr Andrushchenko, Jan Beulich, Andrew Cooper,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Paul Durrant


Hi Roger,

Roger Pau Monné <roger.pau@citrix.com> writes:

> On Wed, Jun 21, 2023 at 10:07:20PM +0000, Volodymyr Babchuk wrote:
>> 
>> Hi Roger,
>> 
>> Roger Pau Monné <roger.pau@citrix.com> writes:
>> 
>> > On Tue, Jun 13, 2023 at 10:32:26AM +0000, Volodymyr Babchuk wrote:
>> >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> >> 
>> >> Introduce a per-domain read/write lock to check whether vpci is present,
>> >> so we are sure there are no accesses to the contents of the vpci struct
>> >> if not. This lock can be used (and in a few cases is used right away)
>> >> so that vpci removal can be performed while holding the lock in write
>> >> mode. Previously such removal could race with vpci_read for example.
>> >> 
>> >> 1. Per-domain's vpci_rwlock is used to protect pdev->vpci structure
>> >> from being removed.
>> >> 
>> >> 2. Writing the command register and ROM BAR register may trigger
>> >> modify_bars to run, which in turn may access multiple pdevs while
>> >> checking for the existing BAR's overlap. The overlapping check, if done
>> >> under the read lock, requires vpci->lock to be acquired on both devices
>> >> being compared, which may produce a deadlock. It is not possible to
>> >> upgrade read lock to write lock in such a case. So, in order to prevent
>> >> the deadlock, check which registers are going to be written and acquire
>> >> the lock in the appropriate mode from the beginning.
>> >> 
>> >> All other code, which doesn't lead to pdev->vpci destruction and does not
>> >> access multiple pdevs at the same time, can still use a combination of the
>> >> read lock and pdev->vpci->lock.
>> >> 
>> >> 3. Optimize if ROM BAR write lock required detection by caching offset
>> >> of the ROM BAR register in vpci->header->rom_reg which depends on
>> >> header's type.
>> >> 
>> >> 4. Reduce locked region in vpci_remove_device as it is now possible
>> >> to set pdev->vpci to NULL early right after the write lock is acquired.
>> >> 
>> >> 5. Reduce locked region in vpci_add_handlers as it is possible to
>> >> initialize many more fields of the struct vpci before assigning it to
>> >> pdev->vpci.
>> >> 
>> >> 6. vpci_{add|remove}_register are required to be called with the write lock
>> >> held, but it is not feasible to add an assert there as it requires
>> >> struct domain to be passed for that. So, add a comment about this requirement
>> >> to these and other functions with the equivalent constraints.
>> >> 
>> >> 7. Drop const qualifier where the new rwlock is used and this is appropriate.
>> >> 
>> >> 8. Do not call process_pending_softirqs with any locks held. For that unlock
>> >> prior the call and re-acquire the locks after. After re-acquiring the
>> >> lock there is no need to check if pdev->vpci exists:
>> >>  - in apply_map because of the context it is called (no race condition
>> >>    possible)
>> >>  - for MSI/MSI-X debug code because it is called at the end of
>> >>    pdev->vpci access and no further access to pdev->vpci is made
>> >> 
>> >> 9. Check for !pdev->vpci in vpci_{read|write} after acquiring the lock
>> >> and if so, allow reading or writing the hardware register directly. This is
>> >> acceptable as we only deal with Dom0 as of now. Once DomU support is
>> >> added the write will need to be ignored and read return all 0's for the
>> >> guests, while Dom0 can still access the registers directly.
>> >> 
>> >> 10. Introduce pcidevs_trylock, so there is a possibility to try locking
>> >> the pcidev's lock.
>> >> 
>> >> 11. Use pcidev's lock around for_each_pdev and pci_get_pdev_by_domain
>> >> while accessing pdevs in vpci code.
>> >> 
>> >> 12. This is based on the discussion at [1].
>> >> 
>> >> [1] https://urldefense.com/v3/__https://lore.kernel.org/all/20220204063459.680961-4-andr2000@gmail.com/__;!!GF_29dbcQIUBPA!zPy31CWFWlyC0xhEHiSj6rOPe7RDSjLranI9KZqhG4ssmChJMWvsPLJPQGTcVsnnowZpP8-LaKJkIWIzb8ue0DoYhg$ [lore[.]kernel[.]org]
>> >> 
>> >> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
>> >> Suggested-by: Jan Beulich <jbeulich@suse.com>
>> >> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> >
>> > Thanks.
>> >
>> > I haven't looked in full detail, but I'm afraid there's an ABBA
>> > deadlock with the per-domain vpci lock and the pcidevs lock in
>> > modify_bars() vs  vpci_add_handlers() and vpci_remove_device().
>> >
>> > I've made some comments below.
>> 
>> Thank you for the review. I believe that it is a good idea to have a
>> per-domain pdev_list lock. See my answers below.
>
> I think it's important that the lock that protects domain->pdev_list
> must be the same that also protects pdev->vpci, or else you might run
> into similar ABBA deadlock situations.
>
> The problem then could be that in vpci_{read,write} you will take the
> per-domain pdev lock in read mode in order to get the pdev, and for
> writes to the command register or the ROM BAR you would have to
> upgrade such lock to a write lock without dropping it, and we don't
> have such functionality for rw locks ATM.
>
> Maybe just re-starting the function knowing that the lock must be
> taken in write mode would be a good solution: writes to the command
> register will already be slow since they are likely to involve
> modifications to the p2m.

Looks like modify_bars() is the only cause for this extended lock. I
know that this was discussed earlier, but can we rework modify_bars to
not iterate over all the pdevs? We can store copy of all enabled BARs in
a domain structure, protected by domain->vpci_lock. Something akin to

struct vpci_bar {
        list_head list;
        struct vpci *vpci;
        unsigned long start;
        unsigned long end;
        bool is_rom;
};


>> >> @@ -213,6 +227,7 @@ static void defer_map(struct domain *d, struct pci_dev *pdev,
>> >>      raise_softirq(SCHEDULE_SOFTIRQ);
>> >>  }
>> >>  
>> >> +/* This must hold domain's vpci_rwlock in write mode. */
>> >
>> > Why it must be in write mode?
>> >
>> > Is this to avoid ABBA deadlocks as not taking the per-domain lock in
>> > write mode would then force us to take each pdev->vpci->lock in order
>> > to prevent concurrent modifications of remote devices?
>> 
>> Yes, exactly this. This is mentioned in the commit message:
>> 
>>     2. Writing the command register and ROM BAR register may trigger
>>     modify_bars to run, which in turn may access multiple pdevs while
>>     checking for the existing BAR's overlap. The overlapping check, if done
>>     under the read lock, requires vpci->lock to be acquired on both devices
>>     being compared, which may produce a deadlock. It is not possible to
>>     upgrade read lock to write lock in such a case. So, in order to prevent
>>     the deadlock, check which registers are going to be written and acquire
>>     the lock in the appropriate mode from the beginning.
>
> Might be good to put part of the commit message in the code comment,
> just saying that the lock must be taken in write mode is not very
> informative.
>
> /*
>  * The <lock-name> per domain lock must be taken in write mode in
>  * order to prevent changes to the vPCI data of devices assigned to
>  * the domain, since the function parses such data.
>  */
>

Agree. Also, I'll add a corresponding ASSERT()

-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v7 01/12] vpci: introduce per-domain lock to protect vpci structure
  2023-06-20 20:17   ` Stewart Hildebrand
@ 2023-06-22 21:18     ` Volodymyr Babchuk
  0 siblings, 0 replies; 38+ messages in thread
From: Volodymyr Babchuk @ 2023-06-22 21:18 UTC (permalink / raw)
  To: Stewart Hildebrand
  Cc: xen-devel, Oleksandr Andrushchenko, Jan Beulich, Andrew Cooper,
	Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Paul Durrant


Hi Stewart,

Stewart Hildebrand <stewart.hildebrand@amd.com> writes:

> On 6/13/23 06:32, Volodymyr Babchuk wrote:
>
> ...
>
>> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
>> index 652807a4a4..1270174e78 100644
>> --- a/xen/drivers/vpci/vpci.c
>> +++ b/xen/drivers/vpci/vpci.c
>> @@ -38,20 +38,32 @@ extern vpci_register_init_t *const __end_vpci_array[];
>> 
>>  void vpci_remove_device(struct pci_dev *pdev)
>>  {
>> -    if ( !has_vpci(pdev->domain) || !pdev->vpci )
>> +    struct vpci *vpci;
>> +
>> +    if ( !has_vpci(pdev->domain) )
>>          return;
>> 
>> -    spin_lock(&pdev->vpci->lock);
>> +    write_lock(&pdev->domain->vpci_rwlock);
>> +    if ( !pdev->vpci )
>> +    {
>> +        write_unlock(&pdev->domain->vpci_rwlock);
>> +        return;
>> +    }
>> +
>> +    vpci = pdev->vpci;
>> +    pdev->vpci = NULL;
>
> Here we set pdev->vpci to NULL, yet there are still a few uses
> remaining after this in vpci_remove_device. I suspect they were missed
> during a s/pdev->vpci/vpci/ search and replace.
>

Yes, you are right. Thank you, I'll fix this in the next version.

>> +    write_unlock(&pdev->domain->vpci_rwlock);
>> +
>>      while ( !list_empty(&pdev->vpci->handlers) )
>
> pdev->vpci dereferenced here
>
>>      {
>> -        struct vpci_register *r = list_first_entry(&pdev->vpci->handlers,
>> +        struct vpci_register *r = list_first_entry(&vpci->handlers,
>>                                                     struct vpci_register,
>>                                                     node);
>> 
>>          list_del(&r->node);
>>          xfree(r);
>>      }
>> -    spin_unlock(&pdev->vpci->lock);
>> +
>>      if ( pdev->vpci->msix )
>
> pdev->vpci dereferenced here
>
>>      {
>>          unsigned int i;
>> @@ -61,29 +73,33 @@ void vpci_remove_device(struct pci_dev *pdev)
>>              if ( pdev->vpci->msix->table[i] )
>
> pdev->vpci dereferenced here, and two more above not shown in the diff context
>
>>                  iounmap(pdev->vpci->msix->table[i]);
>
> pdev->vpci dereferenced here
>
>>      }
>> -    xfree(pdev->vpci->msix);
>> -    xfree(pdev->vpci->msi);
>> -    xfree(pdev->vpci);
>> -    pdev->vpci = NULL;
>> +    xfree(vpci->msix);
>> +    xfree(vpci->msi);
>> +    xfree(vpci);
>>  }


-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v7 01/12] vpci: introduce per-domain lock to protect vpci structure
  2023-06-22 21:17         ` Volodymyr Babchuk
@ 2023-06-23  8:50           ` Roger Pau Monné
  2023-06-23  9:26             ` Volodymyr Babchuk
  2023-07-04 21:03             ` Volodymyr Babchuk
  0 siblings, 2 replies; 38+ messages in thread
From: Roger Pau Monné @ 2023-06-23  8:50 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: xen-devel, Oleksandr Andrushchenko, Jan Beulich, Andrew Cooper,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Paul Durrant

On Thu, Jun 22, 2023 at 09:17:36PM +0000, Volodymyr Babchuk wrote:
> 
> Hi Roger,
> 
> Roger Pau Monné <roger.pau@citrix.com> writes:
> 
> > On Wed, Jun 21, 2023 at 10:07:20PM +0000, Volodymyr Babchuk wrote:
> >> 
> >> Hi Roger,
> >> 
> >> Roger Pau Monné <roger.pau@citrix.com> writes:
> >> 
> >> > On Tue, Jun 13, 2023 at 10:32:26AM +0000, Volodymyr Babchuk wrote:
> >> >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> >> >> 
> >> >> Introduce a per-domain read/write lock to check whether vpci is present,
> >> >> so we are sure there are no accesses to the contents of the vpci struct
> >> >> if not. This lock can be used (and in a few cases is used right away)
> >> >> so that vpci removal can be performed while holding the lock in write
> >> >> mode. Previously such removal could race with vpci_read for example.
> >> >> 
> >> >> 1. Per-domain's vpci_rwlock is used to protect pdev->vpci structure
> >> >> from being removed.
> >> >> 
> >> >> 2. Writing the command register and ROM BAR register may trigger
> >> >> modify_bars to run, which in turn may access multiple pdevs while
> >> >> checking for the existing BAR's overlap. The overlapping check, if done
> >> >> under the read lock, requires vpci->lock to be acquired on both devices
> >> >> being compared, which may produce a deadlock. It is not possible to
> >> >> upgrade read lock to write lock in such a case. So, in order to prevent
> >> >> the deadlock, check which registers are going to be written and acquire
> >> >> the lock in the appropriate mode from the beginning.
> >> >> 
> >> >> All other code, which doesn't lead to pdev->vpci destruction and does not
> >> >> access multiple pdevs at the same time, can still use a combination of the
> >> >> read lock and pdev->vpci->lock.
> >> >> 
> >> >> 3. Optimize if ROM BAR write lock required detection by caching offset
> >> >> of the ROM BAR register in vpci->header->rom_reg which depends on
> >> >> header's type.
> >> >> 
> >> >> 4. Reduce locked region in vpci_remove_device as it is now possible
> >> >> to set pdev->vpci to NULL early right after the write lock is acquired.
> >> >> 
> >> >> 5. Reduce locked region in vpci_add_handlers as it is possible to
> >> >> initialize many more fields of the struct vpci before assigning it to
> >> >> pdev->vpci.
> >> >> 
> >> >> 6. vpci_{add|remove}_register are required to be called with the write lock
> >> >> held, but it is not feasible to add an assert there as it requires
> >> >> struct domain to be passed for that. So, add a comment about this requirement
> >> >> to these and other functions with the equivalent constraints.
> >> >> 
> >> >> 7. Drop const qualifier where the new rwlock is used and this is appropriate.
> >> >> 
> >> >> 8. Do not call process_pending_softirqs with any locks held. For that unlock
> >> >> prior the call and re-acquire the locks after. After re-acquiring the
> >> >> lock there is no need to check if pdev->vpci exists:
> >> >>  - in apply_map because of the context it is called (no race condition
> >> >>    possible)
> >> >>  - for MSI/MSI-X debug code because it is called at the end of
> >> >>    pdev->vpci access and no further access to pdev->vpci is made
> >> >> 
> >> >> 9. Check for !pdev->vpci in vpci_{read|write} after acquiring the lock
> >> >> and if so, allow reading or writing the hardware register directly. This is
> >> >> acceptable as we only deal with Dom0 as of now. Once DomU support is
> >> >> added the write will need to be ignored and read return all 0's for the
> >> >> guests, while Dom0 can still access the registers directly.
> >> >> 
> >> >> 10. Introduce pcidevs_trylock, so there is a possibility to try locking
> >> >> the pcidev's lock.
> >> >> 
> >> >> 11. Use pcidev's lock around for_each_pdev and pci_get_pdev_by_domain
> >> >> while accessing pdevs in vpci code.
> >> >> 
> >> >> 12. This is based on the discussion at [1].
> >> >> 
> >> >> [1] https://urldefense.com/v3/__https://lore.kernel.org/all/20220204063459.680961-4-andr2000@gmail.com/__;!!GF_29dbcQIUBPA!zPy31CWFWlyC0xhEHiSj6rOPe7RDSjLranI9KZqhG4ssmChJMWvsPLJPQGTcVsnnowZpP8-LaKJkIWIzb8ue0DoYhg$ [lore[.]kernel[.]org]
> >> >> 
> >> >> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
> >> >> Suggested-by: Jan Beulich <jbeulich@suse.com>
> >> >> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> >> >
> >> > Thanks.
> >> >
> >> > I haven't looked in full detail, but I'm afraid there's an ABBA
> >> > deadlock with the per-domain vpci lock and the pcidevs lock in
> >> > modify_bars() vs  vpci_add_handlers() and vpci_remove_device().
> >> >
> >> > I've made some comments below.
> >> 
> >> Thank you for the review. I believe that it is a good idea to have a
> >> per-domain pdev_list lock. See my answers below.
> >
> > I think it's important that the lock that protects domain->pdev_list
> > must be the same that also protects pdev->vpci, or else you might run
> > into similar ABBA deadlock situations.
> >
> > The problem then could be that in vpci_{read,write} you will take the
> > per-domain pdev lock in read mode in order to get the pdev, and for
> > writes to the command register or the ROM BAR you would have to
> > upgrade such lock to a write lock without dropping it, and we don't
> > have such functionality for rw locks ATM.
> >
> > Maybe just re-starting the function knowing that the lock must be
> > taken in write mode would be a good solution: writes to the command
> > register will already be slow since they are likely to involve
> > modifications to the p2m.
> 
> Looks like modify_bars() is the only cause for this extended lock. I
> know that this was discussed earlier, but can we rework modify_bars to
> not iterate over all the pdevs? We can store copy of all enabled BARs in
> a domain structure, protected by domain->vpci_lock. Something akin to
> 
> struct vpci_bar {
>         list_head list;
>         struct vpci *vpci;
>         unsigned long start;
>         unsigned long end;
>         bool is_rom;
> };

This IMO makes the logic more complicated, as each time a BAR is
updated we would have to change the cached address and size in two
different places.  It's also duplicated data that takes up memory, and
there are system with a non-trivial amount of PCI devices and thus
BARs to track.

I think it's easier to just make the newly introduced per-domain
rwlock also protect the domain's pdev_list (unless I'm missing
something).  AFAICT it would also simplify locking, as such rwlock
protecting the domain->pdev_list will avoid you from having to take
the pcidevs lock in vpci_{read,write} in order to find the device the
access belongs to.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v7 01/12] vpci: introduce per-domain lock to protect vpci structure
  2023-06-23  8:50           ` Roger Pau Monné
@ 2023-06-23  9:26             ` Volodymyr Babchuk
  2023-06-23 17:09               ` Jan Beulich
  2023-07-04 21:03             ` Volodymyr Babchuk
  1 sibling, 1 reply; 38+ messages in thread
From: Volodymyr Babchuk @ 2023-06-23  9:26 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Oleksandr Andrushchenko, Jan Beulich, Andrew Cooper,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Paul Durrant


Hi Roger,

Roger Pau Monné <roger.pau@citrix.com> writes:

> On Thu, Jun 22, 2023 at 09:17:36PM +0000, Volodymyr Babchuk wrote:
>> 
>> Hi Roger,
>> 
>> Roger Pau Monné <roger.pau@citrix.com> writes:
>> 
>> > On Wed, Jun 21, 2023 at 10:07:20PM +0000, Volodymyr Babchuk wrote:
>> >> 
>> >> Hi Roger,
>> >> 
>> >> Roger Pau Monné <roger.pau@citrix.com> writes:
>> >> 
>> >> > On Tue, Jun 13, 2023 at 10:32:26AM +0000, Volodymyr Babchuk wrote:
>> >> >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> >> >> 
>> >> >> Introduce a per-domain read/write lock to check whether vpci is present,
>> >> >> so we are sure there are no accesses to the contents of the vpci struct
>> >> >> if not. This lock can be used (and in a few cases is used right away)
>> >> >> so that vpci removal can be performed while holding the lock in write
>> >> >> mode. Previously such removal could race with vpci_read for example.
>> >> >> 
>> >> >> 1. Per-domain's vpci_rwlock is used to protect pdev->vpci structure
>> >> >> from being removed.
>> >> >> 
>> >> >> 2. Writing the command register and ROM BAR register may trigger
>> >> >> modify_bars to run, which in turn may access multiple pdevs while
>> >> >> checking for the existing BAR's overlap. The overlapping check, if done
>> >> >> under the read lock, requires vpci->lock to be acquired on both devices
>> >> >> being compared, which may produce a deadlock. It is not possible to
>> >> >> upgrade read lock to write lock in such a case. So, in order to prevent
>> >> >> the deadlock, check which registers are going to be written and acquire
>> >> >> the lock in the appropriate mode from the beginning.
>> >> >> 
>> >> >> All other code, which doesn't lead to pdev->vpci destruction and does not
>> >> >> access multiple pdevs at the same time, can still use a combination of the
>> >> >> read lock and pdev->vpci->lock.
>> >> >> 
>> >> >> 3. Optimize if ROM BAR write lock required detection by caching offset
>> >> >> of the ROM BAR register in vpci->header->rom_reg which depends on
>> >> >> header's type.
>> >> >> 
>> >> >> 4. Reduce locked region in vpci_remove_device as it is now possible
>> >> >> to set pdev->vpci to NULL early right after the write lock is acquired.
>> >> >> 
>> >> >> 5. Reduce locked region in vpci_add_handlers as it is possible to
>> >> >> initialize many more fields of the struct vpci before assigning it to
>> >> >> pdev->vpci.
>> >> >> 
>> >> >> 6. vpci_{add|remove}_register are required to be called with the write lock
>> >> >> held, but it is not feasible to add an assert there as it requires
>> >> >> struct domain to be passed for that. So, add a comment about this requirement
>> >> >> to these and other functions with the equivalent constraints.
>> >> >> 
>> >> >> 7. Drop const qualifier where the new rwlock is used and this is appropriate.
>> >> >> 
>> >> >> 8. Do not call process_pending_softirqs with any locks held. For that unlock
>> >> >> prior the call and re-acquire the locks after. After re-acquiring the
>> >> >> lock there is no need to check if pdev->vpci exists:
>> >> >>  - in apply_map because of the context it is called (no race condition
>> >> >>    possible)
>> >> >>  - for MSI/MSI-X debug code because it is called at the end of
>> >> >>    pdev->vpci access and no further access to pdev->vpci is made
>> >> >> 
>> >> >> 9. Check for !pdev->vpci in vpci_{read|write} after acquiring the lock
>> >> >> and if so, allow reading or writing the hardware register directly. This is
>> >> >> acceptable as we only deal with Dom0 as of now. Once DomU support is
>> >> >> added the write will need to be ignored and read return all 0's for the
>> >> >> guests, while Dom0 can still access the registers directly.
>> >> >> 
>> >> >> 10. Introduce pcidevs_trylock, so there is a possibility to try locking
>> >> >> the pcidev's lock.
>> >> >> 
>> >> >> 11. Use pcidev's lock around for_each_pdev and pci_get_pdev_by_domain
>> >> >> while accessing pdevs in vpci code.
>> >> >> 
>> >> >> 12. This is based on the discussion at [1].
>> >> >> 
>> >> >> [1] https://urldefense.com/v3/__https://lore.kernel.org/all/20220204063459.680961-4-andr2000@gmail.com/__;!!GF_29dbcQIUBPA!zPy31CWFWlyC0xhEHiSj6rOPe7RDSjLranI9KZqhG4ssmChJMWvsPLJPQGTcVsnnowZpP8-LaKJkIWIzb8ue0DoYhg$ [lore[.]kernel[.]org]
>> >> >> 
>> >> >> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
>> >> >> Suggested-by: Jan Beulich <jbeulich@suse.com>
>> >> >> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> >> >
>> >> > Thanks.
>> >> >
>> >> > I haven't looked in full detail, but I'm afraid there's an ABBA
>> >> > deadlock with the per-domain vpci lock and the pcidevs lock in
>> >> > modify_bars() vs  vpci_add_handlers() and vpci_remove_device().
>> >> >
>> >> > I've made some comments below.
>> >> 
>> >> Thank you for the review. I believe that it is a good idea to have a
>> >> per-domain pdev_list lock. See my answers below.
>> >
>> > I think it's important that the lock that protects domain->pdev_list
>> > must be the same that also protects pdev->vpci, or else you might run
>> > into similar ABBA deadlock situations.
>> >
>> > The problem then could be that in vpci_{read,write} you will take the
>> > per-domain pdev lock in read mode in order to get the pdev, and for
>> > writes to the command register or the ROM BAR you would have to
>> > upgrade such lock to a write lock without dropping it, and we don't
>> > have such functionality for rw locks ATM.
>> >
>> > Maybe just re-starting the function knowing that the lock must be
>> > taken in write mode would be a good solution: writes to the command
>> > register will already be slow since they are likely to involve
>> > modifications to the p2m.
>> 
>> Looks like modify_bars() is the only cause for this extended lock. I
>> know that this was discussed earlier, but can we rework modify_bars to
>> not iterate over all the pdevs? We can store copy of all enabled BARs in
>> a domain structure, protected by domain->vpci_lock. Something akin to
>> 
>> struct vpci_bar {
>>         list_head list;
>>         struct vpci *vpci;
>>         unsigned long start;
>>         unsigned long end;
>>         bool is_rom;
>> };
>
> This IMO makes the logic more complicated, as each time a BAR is
> updated we would have to change the cached address and size in two
> different places.  It's also duplicated data that takes up memory, and
> there are system with a non-trivial amount of PCI devices and thus
> BARs to track.
>
> I think it's easier to just make the newly introduced per-domain
> rwlock also protect the domain's pdev_list (unless I'm missing
> something).  AFAICT it would also simplify locking, as such rwlock
> protecting the domain->pdev_list will avoid you from having to take
> the pcidevs lock in vpci_{read,write} in order to find the device the
> access belongs to.

In my opinion, this will make the whole locking logic complex, because
we will have one rwlock that protects:

1. pdev->vpci
2. domain->pdev_list

This is a two semi-related things. I feel that this will be hard to
understand for anyone who is not deeply intimate with the PCI
code. Anyways, I want this patch series to get going, so I believe your
judgment here. How should I name this lock? Taking into account, that
scope is will be more broad, "domain->vpci_rwlock" does not seems as a
good name.

-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v7 01/12] vpci: introduce per-domain lock to protect vpci structure
  2023-06-23  9:26             ` Volodymyr Babchuk
@ 2023-06-23 17:09               ` Jan Beulich
  0 siblings, 0 replies; 38+ messages in thread
From: Jan Beulich @ 2023-06-23 17:09 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: xen-devel, Oleksandr Andrushchenko, Andrew Cooper, Wei Liu,
	George Dunlap, Julien Grall, Stefano Stabellini, Paul Durrant,
	Roger Pau Monné

On 23.06.2023 11:26, Volodymyr Babchuk wrote:
> Roger Pau Monné <roger.pau@citrix.com> writes:
>> On Thu, Jun 22, 2023 at 09:17:36PM +0000, Volodymyr Babchuk wrote:
>>> Roger Pau Monné <roger.pau@citrix.com> writes:
>>>> On Wed, Jun 21, 2023 at 10:07:20PM +0000, Volodymyr Babchuk wrote:
>>>>> Roger Pau Monné <roger.pau@citrix.com> writes:
>>>>>> I've made some comments below.
>>>>>
>>>>> Thank you for the review. I believe that it is a good idea to have a
>>>>> per-domain pdev_list lock. See my answers below.
>>>>
>>>> I think it's important that the lock that protects domain->pdev_list
>>>> must be the same that also protects pdev->vpci, or else you might run
>>>> into similar ABBA deadlock situations.
>>>>
>>>> The problem then could be that in vpci_{read,write} you will take the
>>>> per-domain pdev lock in read mode in order to get the pdev, and for
>>>> writes to the command register or the ROM BAR you would have to
>>>> upgrade such lock to a write lock without dropping it, and we don't
>>>> have such functionality for rw locks ATM.
>>>>
>>>> Maybe just re-starting the function knowing that the lock must be
>>>> taken in write mode would be a good solution: writes to the command
>>>> register will already be slow since they are likely to involve
>>>> modifications to the p2m.
>>>
>>> Looks like modify_bars() is the only cause for this extended lock. I
>>> know that this was discussed earlier, but can we rework modify_bars to
>>> not iterate over all the pdevs? We can store copy of all enabled BARs in
>>> a domain structure, protected by domain->vpci_lock. Something akin to
>>>
>>> struct vpci_bar {
>>>         list_head list;
>>>         struct vpci *vpci;
>>>         unsigned long start;
>>>         unsigned long end;
>>>         bool is_rom;
>>> };
>>
>> This IMO makes the logic more complicated, as each time a BAR is
>> updated we would have to change the cached address and size in two
>> different places.  It's also duplicated data that takes up memory, and
>> there are system with a non-trivial amount of PCI devices and thus
>> BARs to track.
>>
>> I think it's easier to just make the newly introduced per-domain
>> rwlock also protect the domain's pdev_list (unless I'm missing
>> something).  AFAICT it would also simplify locking, as such rwlock
>> protecting the domain->pdev_list will avoid you from having to take
>> the pcidevs lock in vpci_{read,write} in order to find the device the
>> access belongs to.
> 
> In my opinion, this will make the whole locking logic complex, because
> we will have one rwlock that protects:
> 
> 1. pdev->vpci
> 2. domain->pdev_list

The import thing at this stage is to get the granularity down from
the global pcidevs lock. What exactly is grouped together is
secondary for the moment; it needs writing down clearly in a code
comment, of course.

Just to be sure I recall things correctly: 1 above is covering
only the pointer, not the contents of the pointed to struct? If
so, I would agree putting both under the same lock is a little odd,
but if that limits lock nesting issues, I'd say so be it. If not,
then this lock could simply be declared as covering everything
PCI-ish that a domain has in use. Especially in the latter case ...

> This is a two semi-related things. I feel that this will be hard to
> understand for anyone who is not deeply intimate with the PCI
> code. Anyways, I want this patch series to get going, so I believe your
> judgment here. How should I name this lock? Taking into account, that
> scope is will be more broad, "domain->vpci_rwlock" does not seems as a
> good name.

... d->pci_lock would seem quite appropriate.

Jan


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v7 01/12] vpci: introduce per-domain lock to protect vpci structure
  2023-06-23  8:50           ` Roger Pau Monné
  2023-06-23  9:26             ` Volodymyr Babchuk
@ 2023-07-04 21:03             ` Volodymyr Babchuk
  2023-07-05  7:11               ` Jan Beulich
  1 sibling, 1 reply; 38+ messages in thread
From: Volodymyr Babchuk @ 2023-07-04 21:03 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Oleksandr Andrushchenko, Jan Beulich, Andrew Cooper,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Paul Durrant


Hi Roger,

Roger Pau Monné <roger.pau@citrix.com> writes:

> On Thu, Jun 22, 2023 at 09:17:36PM +0000, Volodymyr Babchuk wrote:
>> 
>> Hi Roger,
>> 
>> Roger Pau Monné <roger.pau@citrix.com> writes:
>> 
>> > On Wed, Jun 21, 2023 at 10:07:20PM +0000, Volodymyr Babchuk wrote:
>> >> 
>> >> Hi Roger,
>> >> 
>> >> Roger Pau Monné <roger.pau@citrix.com> writes:
>> >> 
>> >> > On Tue, Jun 13, 2023 at 10:32:26AM +0000, Volodymyr Babchuk wrote:
>> >> >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> >> >> 
>> >> >> Introduce a per-domain read/write lock to check whether vpci is present,
>> >> >> so we are sure there are no accesses to the contents of the vpci struct
>> >> >> if not. This lock can be used (and in a few cases is used right away)
>> >> >> so that vpci removal can be performed while holding the lock in write
>> >> >> mode. Previously such removal could race with vpci_read for example.
>> >> >> 
>> >> >> 1. Per-domain's vpci_rwlock is used to protect pdev->vpci structure
>> >> >> from being removed.
>> >> >> 
>> >> >> 2. Writing the command register and ROM BAR register may trigger
>> >> >> modify_bars to run, which in turn may access multiple pdevs while
>> >> >> checking for the existing BAR's overlap. The overlapping check, if done
>> >> >> under the read lock, requires vpci->lock to be acquired on both devices
>> >> >> being compared, which may produce a deadlock. It is not possible to
>> >> >> upgrade read lock to write lock in such a case. So, in order to prevent
>> >> >> the deadlock, check which registers are going to be written and acquire
>> >> >> the lock in the appropriate mode from the beginning.
>> >> >> 
>> >> >> All other code, which doesn't lead to pdev->vpci destruction and does not
>> >> >> access multiple pdevs at the same time, can still use a combination of the
>> >> >> read lock and pdev->vpci->lock.
>> >> >> 
>> >> >> 3. Optimize if ROM BAR write lock required detection by caching offset
>> >> >> of the ROM BAR register in vpci->header->rom_reg which depends on
>> >> >> header's type.
>> >> >> 
>> >> >> 4. Reduce locked region in vpci_remove_device as it is now possible
>> >> >> to set pdev->vpci to NULL early right after the write lock is acquired.
>> >> >> 
>> >> >> 5. Reduce locked region in vpci_add_handlers as it is possible to
>> >> >> initialize many more fields of the struct vpci before assigning it to
>> >> >> pdev->vpci.
>> >> >> 
>> >> >> 6. vpci_{add|remove}_register are required to be called with the write lock
>> >> >> held, but it is not feasible to add an assert there as it requires
>> >> >> struct domain to be passed for that. So, add a comment about this requirement
>> >> >> to these and other functions with the equivalent constraints.
>> >> >> 
>> >> >> 7. Drop const qualifier where the new rwlock is used and this is appropriate.
>> >> >> 
>> >> >> 8. Do not call process_pending_softirqs with any locks held. For that unlock
>> >> >> prior the call and re-acquire the locks after. After re-acquiring the
>> >> >> lock there is no need to check if pdev->vpci exists:
>> >> >>  - in apply_map because of the context it is called (no race condition
>> >> >>    possible)
>> >> >>  - for MSI/MSI-X debug code because it is called at the end of
>> >> >>    pdev->vpci access and no further access to pdev->vpci is made
>> >> >> 
>> >> >> 9. Check for !pdev->vpci in vpci_{read|write} after acquiring the lock
>> >> >> and if so, allow reading or writing the hardware register directly. This is
>> >> >> acceptable as we only deal with Dom0 as of now. Once DomU support is
>> >> >> added the write will need to be ignored and read return all 0's for the
>> >> >> guests, while Dom0 can still access the registers directly.
>> >> >> 
>> >> >> 10. Introduce pcidevs_trylock, so there is a possibility to try locking
>> >> >> the pcidev's lock.
>> >> >> 
>> >> >> 11. Use pcidev's lock around for_each_pdev and pci_get_pdev_by_domain
>> >> >> while accessing pdevs in vpci code.
>> >> >> 
>> >> >> 12. This is based on the discussion at [1].
>> >> >> 
>> >> >> [1] https://urldefense.com/v3/__https://lore.kernel.org/all/20220204063459.680961-4-andr2000@gmail.com/__;!!GF_29dbcQIUBPA!zPy31CWFWlyC0xhEHiSj6rOPe7RDSjLranI9KZqhG4ssmChJMWvsPLJPQGTcVsnnowZpP8-LaKJkIWIzb8ue0DoYhg$ [lore[.]kernel[.]org]
>> >> >> 
>> >> >> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
>> >> >> Suggested-by: Jan Beulich <jbeulich@suse.com>
>> >> >> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> >> >
>> >> > Thanks.
>> >> >
>> >> > I haven't looked in full detail, but I'm afraid there's an ABBA
>> >> > deadlock with the per-domain vpci lock and the pcidevs lock in
>> >> > modify_bars() vs  vpci_add_handlers() and vpci_remove_device().
>> >> >
>> >> > I've made some comments below.
>> >> 
>> >> Thank you for the review. I believe that it is a good idea to have a
>> >> per-domain pdev_list lock. See my answers below.
>> >
>> > I think it's important that the lock that protects domain->pdev_list
>> > must be the same that also protects pdev->vpci, or else you might run
>> > into similar ABBA deadlock situations.
>> >
>> > The problem then could be that in vpci_{read,write} you will take the
>> > per-domain pdev lock in read mode in order to get the pdev, and for
>> > writes to the command register or the ROM BAR you would have to
>> > upgrade such lock to a write lock without dropping it, and we don't
>> > have such functionality for rw locks ATM.
>> >
>> > Maybe just re-starting the function knowing that the lock must be
>> > taken in write mode would be a good solution: writes to the command
>> > register will already be slow since they are likely to involve
>> > modifications to the p2m.
>> 
>> Looks like modify_bars() is the only cause for this extended lock. I
>> know that this was discussed earlier, but can we rework modify_bars to
>> not iterate over all the pdevs? We can store copy of all enabled BARs in
>> a domain structure, protected by domain->vpci_lock. Something akin to
>> 
>> struct vpci_bar {
>>         list_head list;
>>         struct vpci *vpci;
>>         unsigned long start;
>>         unsigned long end;
>>         bool is_rom;
>> };
>
> This IMO makes the logic more complicated, as each time a BAR is
> updated we would have to change the cached address and size in two
> different places.  It's also duplicated data that takes up memory, and
> there are system with a non-trivial amount of PCI devices and thus
> BARs to track.
>
> I think it's easier to just make the newly introduced per-domain
> rwlock also protect the domain's pdev_list (unless I'm missing
> something).  AFAICT it would also simplify locking, as such rwlock
> protecting the domain->pdev_list will avoid you from having to take
> the pcidevs lock in vpci_{read,write} in order to find the device the
> access belongs to.

I am currently implementing your proposal (along with Jan's
suggestions), but I am facing ABBA deadlock with IOMMU's
reassign_device() call, which has this piece of code:

        list_move(&pdev->domain_list, &target->pdev_list);

My immediate change was:

        write_lock(&pdev->domain->pci_lock);
        list_del(&pdev->domain_list);
        write_unlock(&pdev->domain->pci_lock);

        write_lock(&target->pci_lock);
        list_add(&pdev->domain_list, &target->pdev_list);
        write_unlock(&target->pci_lock);

But this will not work because reassign_device is called from
pci_release_devices() which iterates over d->pdev_list, so we need to
take a d->pci_lock early.

Any suggestions on how to fix this? My idea is to remove a device from a
list one at time:

int pci_release_devices(struct domain *d)
{
    struct pci_dev *pdev;
    u8 bus, devfn;
    int ret;

    pcidevs_lock();
    write_lock(&d->pci_lock);
    ret = arch_pci_clean_pirqs(d);
    if ( ret )
    {
        pcidevs_unlock();
        write_unlock(&d->pci_lock);
        return ret;
    }

    while ( !list_empty(&d->pdev_list) )
    {
        pdev = list_entry(&d->pdev_list, struct pci_dev, domain_list);
        bus = pdev->bus;
        devfn = pdev->devfn;
        list_del(&pdev->domain_list);
        write_unlock(&d->pci_lock);
        ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
        write_lock(&d->pci_lock);
    }
    write_unlock(&d->pci_lock);
    pcidevs_unlock();

    return ret;
}

But it is ugly somewhat.


-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v7 01/12] vpci: introduce per-domain lock to protect vpci structure
  2023-07-04 21:03             ` Volodymyr Babchuk
@ 2023-07-05  7:11               ` Jan Beulich
  2023-07-05  8:59                 ` Roger Pau Monné
  2023-07-09 22:41                 ` Volodymyr Babchuk
  0 siblings, 2 replies; 38+ messages in thread
From: Jan Beulich @ 2023-07-05  7:11 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: xen-devel, Oleksandr Andrushchenko, Andrew Cooper, Wei Liu,
	George Dunlap, Julien Grall, Stefano Stabellini, Paul Durrant,
	Roger Pau Monné

On 04.07.2023 23:03, Volodymyr Babchuk wrote:
> I am currently implementing your proposal (along with Jan's
> suggestions), but I am facing ABBA deadlock with IOMMU's
> reassign_device() call, which has this piece of code:
> 
>         list_move(&pdev->domain_list, &target->pdev_list);
> 
> My immediate change was:
> 
>         write_lock(&pdev->domain->pci_lock);
>         list_del(&pdev->domain_list);
>         write_unlock(&pdev->domain->pci_lock);
> 
>         write_lock(&target->pci_lock);
>         list_add(&pdev->domain_list, &target->pdev_list);
>         write_unlock(&target->pci_lock);
> 
> But this will not work because reassign_device is called from
> pci_release_devices() which iterates over d->pdev_list, so we need to
> take a d->pci_lock early.
> 
> Any suggestions on how to fix this? My idea is to remove a device from a
> list one at time:
> 
> int pci_release_devices(struct domain *d)
> {
>     struct pci_dev *pdev;
>     u8 bus, devfn;
>     int ret;
> 
>     pcidevs_lock();
>     write_lock(&d->pci_lock);
>     ret = arch_pci_clean_pirqs(d);
>     if ( ret )
>     {
>         pcidevs_unlock();
>         write_unlock(&d->pci_lock);
>         return ret;
>     }
> 
>     while ( !list_empty(&d->pdev_list) )
>     {
>         pdev = list_entry(&d->pdev_list, struct pci_dev, domain_list);
>         bus = pdev->bus;
>         devfn = pdev->devfn;
>         list_del(&pdev->domain_list);
>         write_unlock(&d->pci_lock);
>         ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
>         write_lock(&d->pci_lock);

I think it needs doing almost like this, but with two more tweaks and
no list_del() right here (first and foremost to avoid needing to
figure whether removing early isn't going to subtly break anything;
see below for an error case that would end up with changed behavior):

    while ( !list_empty(&d->pdev_list) )
    {
        const struct pci_dev *pdev = list_first_entry(&d->pdev_list, struct pci_dev, domain_list);
        uint16_t seg = pdev->seg;
        uint8_t bus = pdev->bus;
        uint8_t devfn = pdev->devfn;

        write_unlock(&d->pci_lock);
        ret = deassign_device(d, seg, bus, devfn) ?: ret;
        write_lock(&d->pci_lock);
    }

One caveat though: The original list_for_each_entry_safe() guarantees
the loop to complete; your use of list_del() would guarantee that too,
but then the device wouldn't be on the list anymore if deassign_device()
failed. Therefore I guess you will need another local list where you
(temporarily) put all the devices which deassign_device() left on the
list, and which you would then move back to d->pdev_list after the loop
has finished. (Whether it is sufficient to inspect the list head to
determine whether the pdev is still on the list needs careful checking.)

Jan


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v7 01/12] vpci: introduce per-domain lock to protect vpci structure
  2023-07-05  7:11               ` Jan Beulich
@ 2023-07-05  8:59                 ` Roger Pau Monné
  2023-07-05  9:13                   ` Jan Beulich
  2023-07-09 22:41                 ` Volodymyr Babchuk
  1 sibling, 1 reply; 38+ messages in thread
From: Roger Pau Monné @ 2023-07-05  8:59 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Volodymyr Babchuk, xen-devel, Oleksandr Andrushchenko,
	Andrew Cooper, Wei Liu, George Dunlap, Julien Grall,
	Stefano Stabellini, Paul Durrant

On Wed, Jul 05, 2023 at 09:11:10AM +0200, Jan Beulich wrote:
> On 04.07.2023 23:03, Volodymyr Babchuk wrote:
> > I am currently implementing your proposal (along with Jan's
> > suggestions), but I am facing ABBA deadlock with IOMMU's
> > reassign_device() call, which has this piece of code:
> > 
> >         list_move(&pdev->domain_list, &target->pdev_list);
> > 
> > My immediate change was:
> > 
> >         write_lock(&pdev->domain->pci_lock);
> >         list_del(&pdev->domain_list);
> >         write_unlock(&pdev->domain->pci_lock);
> > 
> >         write_lock(&target->pci_lock);
> >         list_add(&pdev->domain_list, &target->pdev_list);
> >         write_unlock(&target->pci_lock);
> > 
> > But this will not work because reassign_device is called from
> > pci_release_devices() which iterates over d->pdev_list, so we need to
> > take a d->pci_lock early.
> > 
> > Any suggestions on how to fix this? My idea is to remove a device from a
> > list one at time:
> > 
> > int pci_release_devices(struct domain *d)
> > {
> >     struct pci_dev *pdev;
> >     u8 bus, devfn;
> >     int ret;
> > 
> >     pcidevs_lock();
> >     write_lock(&d->pci_lock);
> >     ret = arch_pci_clean_pirqs(d);
> >     if ( ret )
> >     {
> >         pcidevs_unlock();
> >         write_unlock(&d->pci_lock);
> >         return ret;
> >     }
> > 
> >     while ( !list_empty(&d->pdev_list) )
> >     {
> >         pdev = list_entry(&d->pdev_list, struct pci_dev, domain_list);
> >         bus = pdev->bus;
> >         devfn = pdev->devfn;
> >         list_del(&pdev->domain_list);
> >         write_unlock(&d->pci_lock);
> >         ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
> >         write_lock(&d->pci_lock);
> 
> I think it needs doing almost like this, but with two more tweaks and
> no list_del() right here (first and foremost to avoid needing to
> figure whether removing early isn't going to subtly break anything;
> see below for an error case that would end up with changed behavior):
> 
>     while ( !list_empty(&d->pdev_list) )
>     {
>         const struct pci_dev *pdev = list_first_entry(&d->pdev_list, struct pci_dev, domain_list);
>         uint16_t seg = pdev->seg;
>         uint8_t bus = pdev->bus;
>         uint8_t devfn = pdev->devfn;
> 
>         write_unlock(&d->pci_lock);

I think you need to remove the device from the pdev_list before
dropping the lock, or else release could race with other operations.

That's unlikely, but still if the lock is dropped and the routine
needs to operate on the device it is better remove such device from
the domain so other operations cannot get a reference to it.

Otherwise you could modify reassign_device() implementations so they
require the caller to hold the source->pci_lock when calling the
routine, but that's ugly because the lock would need to be dropped in
order to reassign the device from source to target domains.

Another option would be to move the whole d->pdev_list to a local
variable (so that d->pdev_list would be empty) and then iterate over
it without the d->pci_lock.  On failure you would take the lock and
add the failing device back into d->pdev_list.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v7 01/12] vpci: introduce per-domain lock to protect vpci structure
  2023-07-05  8:59                 ` Roger Pau Monné
@ 2023-07-05  9:13                   ` Jan Beulich
  2023-07-07  2:02                     ` Volodymyr Babchuk
  0 siblings, 1 reply; 38+ messages in thread
From: Jan Beulich @ 2023-07-05  9:13 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Volodymyr Babchuk, xen-devel, Oleksandr Andrushchenko,
	Andrew Cooper, Wei Liu, George Dunlap, Julien Grall,
	Stefano Stabellini, Paul Durrant

On 05.07.2023 10:59, Roger Pau Monné wrote:
> On Wed, Jul 05, 2023 at 09:11:10AM +0200, Jan Beulich wrote:
>> On 04.07.2023 23:03, Volodymyr Babchuk wrote:
>>> I am currently implementing your proposal (along with Jan's
>>> suggestions), but I am facing ABBA deadlock with IOMMU's
>>> reassign_device() call, which has this piece of code:
>>>
>>>         list_move(&pdev->domain_list, &target->pdev_list);
>>>
>>> My immediate change was:
>>>
>>>         write_lock(&pdev->domain->pci_lock);
>>>         list_del(&pdev->domain_list);
>>>         write_unlock(&pdev->domain->pci_lock);
>>>
>>>         write_lock(&target->pci_lock);
>>>         list_add(&pdev->domain_list, &target->pdev_list);
>>>         write_unlock(&target->pci_lock);
>>>
>>> But this will not work because reassign_device is called from
>>> pci_release_devices() which iterates over d->pdev_list, so we need to
>>> take a d->pci_lock early.
>>>
>>> Any suggestions on how to fix this? My idea is to remove a device from a
>>> list one at time:
>>>
>>> int pci_release_devices(struct domain *d)
>>> {
>>>     struct pci_dev *pdev;
>>>     u8 bus, devfn;
>>>     int ret;
>>>
>>>     pcidevs_lock();
>>>     write_lock(&d->pci_lock);
>>>     ret = arch_pci_clean_pirqs(d);
>>>     if ( ret )
>>>     {
>>>         pcidevs_unlock();
>>>         write_unlock(&d->pci_lock);
>>>         return ret;
>>>     }
>>>
>>>     while ( !list_empty(&d->pdev_list) )
>>>     {
>>>         pdev = list_entry(&d->pdev_list, struct pci_dev, domain_list);
>>>         bus = pdev->bus;
>>>         devfn = pdev->devfn;
>>>         list_del(&pdev->domain_list);
>>>         write_unlock(&d->pci_lock);
>>>         ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
>>>         write_lock(&d->pci_lock);
>>
>> I think it needs doing almost like this, but with two more tweaks and
>> no list_del() right here (first and foremost to avoid needing to
>> figure whether removing early isn't going to subtly break anything;
>> see below for an error case that would end up with changed behavior):
>>
>>     while ( !list_empty(&d->pdev_list) )
>>     {
>>         const struct pci_dev *pdev = list_first_entry(&d->pdev_list, struct pci_dev, domain_list);
>>         uint16_t seg = pdev->seg;
>>         uint8_t bus = pdev->bus;
>>         uint8_t devfn = pdev->devfn;
>>
>>         write_unlock(&d->pci_lock);
> 
> I think you need to remove the device from the pdev_list before
> dropping the lock, or else release could race with other operations.
> 
> That's unlikely, but still if the lock is dropped and the routine
> needs to operate on the device it is better remove such device from
> the domain so other operations cannot get a reference to it.
> 
> Otherwise you could modify reassign_device() implementations so they
> require the caller to hold the source->pci_lock when calling the
> routine, but that's ugly because the lock would need to be dropped in
> order to reassign the device from source to target domains.
> 
> Another option would be to move the whole d->pdev_list to a local
> variable (so that d->pdev_list would be empty) and then iterate over
> it without the d->pci_lock.  On failure you would take the lock and
> add the failing device back into d->pdev_list.

Conceptually I like this last variant, but like the individual list_del()
it requires auditing code for no dependency on the device still being on
that list. In fact deassign_device()'s use of pci_get_pdev() does. The
function would then need changing to have struct pci_dev * passed in.
Yet who knows where else there are uses of pci_get_pdev() lurking.

Jan


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v7 01/12] vpci: introduce per-domain lock to protect vpci structure
  2023-07-05  9:13                   ` Jan Beulich
@ 2023-07-07  2:02                     ` Volodymyr Babchuk
  2023-07-07  6:32                       ` Jan Beulich
  0 siblings, 1 reply; 38+ messages in thread
From: Volodymyr Babchuk @ 2023-07-07  2:02 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Roger Pau Monné,
	xen-devel, Oleksandr Andrushchenko, Andrew Cooper, Wei Liu,
	George Dunlap, Julien Grall, Stefano Stabellini, Paul Durrant


Hi Jan,

Jan Beulich <jbeulich@suse.com> writes:

> On 05.07.2023 10:59, Roger Pau Monné wrote:
>> On Wed, Jul 05, 2023 at 09:11:10AM +0200, Jan Beulich wrote:
>>> On 04.07.2023 23:03, Volodymyr Babchuk wrote:
>>>> I am currently implementing your proposal (along with Jan's
>>>> suggestions), but I am facing ABBA deadlock with IOMMU's
>>>> reassign_device() call, which has this piece of code:
>>>>
>>>>         list_move(&pdev->domain_list, &target->pdev_list);
>>>>
>>>> My immediate change was:
>>>>
>>>>         write_lock(&pdev->domain->pci_lock);
>>>>         list_del(&pdev->domain_list);
>>>>         write_unlock(&pdev->domain->pci_lock);
>>>>
>>>>         write_lock(&target->pci_lock);
>>>>         list_add(&pdev->domain_list, &target->pdev_list);
>>>>         write_unlock(&target->pci_lock);
>>>>
>>>> But this will not work because reassign_device is called from
>>>> pci_release_devices() which iterates over d->pdev_list, so we need to
>>>> take a d->pci_lock early.
>>>>
>>>> Any suggestions on how to fix this? My idea is to remove a device from a
>>>> list one at time:
>>>>
>>>> int pci_release_devices(struct domain *d)
>>>> {
>>>>     struct pci_dev *pdev;
>>>>     u8 bus, devfn;
>>>>     int ret;
>>>>
>>>>     pcidevs_lock();
>>>>     write_lock(&d->pci_lock);
>>>>     ret = arch_pci_clean_pirqs(d);
>>>>     if ( ret )
>>>>     {
>>>>         pcidevs_unlock();
>>>>         write_unlock(&d->pci_lock);
>>>>         return ret;
>>>>     }
>>>>
>>>>     while ( !list_empty(&d->pdev_list) )
>>>>     {
>>>>         pdev = list_entry(&d->pdev_list, struct pci_dev, domain_list);
>>>>         bus = pdev->bus;
>>>>         devfn = pdev->devfn;
>>>>         list_del(&pdev->domain_list);
>>>>         write_unlock(&d->pci_lock);
>>>>         ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
>>>>         write_lock(&d->pci_lock);
>>>
>>> I think it needs doing almost like this, but with two more tweaks and
>>> no list_del() right here (first and foremost to avoid needing to
>>> figure whether removing early isn't going to subtly break anything;
>>> see below for an error case that would end up with changed behavior):
>>>
>>>     while ( !list_empty(&d->pdev_list) )
>>>     {
>>>         const struct pci_dev *pdev = list_first_entry(&d->pdev_list, struct pci_dev, domain_list);
>>>         uint16_t seg = pdev->seg;
>>>         uint8_t bus = pdev->bus;
>>>         uint8_t devfn = pdev->devfn;
>>>
>>>         write_unlock(&d->pci_lock);
>> 
>> I think you need to remove the device from the pdev_list before
>> dropping the lock, or else release could race with other operations.
>> 
>> That's unlikely, but still if the lock is dropped and the routine
>> needs to operate on the device it is better remove such device from
>> the domain so other operations cannot get a reference to it.
>> 
>> Otherwise you could modify reassign_device() implementations so they
>> require the caller to hold the source->pci_lock when calling the
>> routine, but that's ugly because the lock would need to be dropped in
>> order to reassign the device from source to target domains.
>> 
>> Another option would be to move the whole d->pdev_list to a local
>> variable (so that d->pdev_list would be empty) and then iterate over
>> it without the d->pci_lock.  On failure you would take the lock and
>> add the failing device back into d->pdev_list.
>
> Conceptually I like this last variant, but like the individual list_del()
> it requires auditing code for no dependency on the device still being on
> that list. In fact deassign_device()'s use of pci_get_pdev() does. The
> function would then need changing to have struct pci_dev * passed in.
> Yet who knows where else there are uses of pci_get_pdev() lurking.

Okay, so I changed deassign_device() signature and reworked the loop
in pci_release_devices() in a such way:

    INIT_LIST_HEAD(&tmp_list);
    /* Move all entries to tmp_list, so we can drop d->pci_lock */
    list_splice_init(&d->pdev_list, &tmp_list);
    write_unlock(&d->pci_lock);

    list_for_each_entry_safe ( pdev, tmp, &tmp_list, domain_list )
    {
        pdev = list_entry(&d->pdev_list, struct pci_dev, domain_list);
        rc = deassign_device(d, pdev);
        if ( rc )
        {
            /* Return device back to the domain list */
            write_lock(&d->pci_lock);
            list_add(&pdev->domain_list, &d->pdev_list);
            write_unlock(&d->pci_lock);
            func_ret = rc;
        }
    }


Also, I checked for all pci_get_pdev() calls and found that struct
domain (the first parameter) is passed only in handful of places:

*** xen/drivers/vpci/vpci.c:
vpci_read[504]                 pdev = pci_get_pdev(d, sbdf);
vpci_read[506]                 pdev = pci_get_pdev(dom_xen, sbdf);
vpci_write[621]                pdev = pci_get_pdev(d, sbdf);
vpci_write[623]                pdev = pci_get_pdev(dom_xen, sbdf);

*** xen/arch/x86/irq.c:
map_domain_pirq[2166]          pdev = pci_get_pdev(d, msi->sbdf);

*** xen/drivers/passthrough/pci.c:
XEN_GUEST_HANDLE_PARAM[1712]   pdev = pci_get_pdev(d, machine_sbdf);

The last one is due to my change to deassign_device() signature.

==============================

d->pdev_list can be accessed there:

*** xen/drivers/passthrough/amd/pci_amd_iommu.c:
reassign_device[489]           list_add(&pdev->domain_list, &target->pdev_list);

*** xen/drivers/passthrough/pci.c:
_pci_hide_device[463]          list_add(&pdev->domain_list, &dom_xen->pdev_list);
pci_get_pdev[561]              list_for_each_entry ( pdev, &d->pdev_list, domain_list )
pci_add_device[759]            list_add(&pdev->domain_list, &hardware_domain->pdev_list);
pci_release_devices[917]       list_splice_init(&d->pdev_list, &tmp_list);
pci_release_devices[922]       pdev = list_entry(&d->pdev_list, struct pci_dev, domain_list);
pci_release_devices[928]       list_add(&pdev->domain_list, &d->pdev_list);
_setup_hwdom_pci_devices[1155] list_add(&pdev->domain_list, &ctxt->d->pdev_list);

*** xen/drivers/passthrough/vtd/iommu.c:
reassign_device_ownership[2819] list_add(&pdev->domain_list, &target->pdev_list);

*** xen/include/xen/pci.h:
for_each_pdev[149]             list_for_each_entry(pdev, &(domain)->pdev_list, domain_list)
has_arch_pdevs[151]            #define has_arch_pdevs(d) (!list_empty(&(d)->pdev_list))

==============================

And has_arch_pdevs() is used there:

*** xen/arch/x86/hvm/hvm.c:
hvm_set_cr0[2388]              has_arch_pdevs(d)) )

*** xen/arch/x86/hvm/vmx/vmcs.c:
vmx_do_resume[1892]            if ( has_arch_pdevs(v->domain) && !iommu_snoop

*** xen/arch/x86/mm.c:
l1_disallow_mask[172]          !has_arch_pdevs(d) && \

*** xen/arch/x86/mm/p2m-pod.c:
p2m_pod_set_mem_target[352]    if ( has_arch_pdevs(d) || cache_flush_permitted(d) )
guest_physmap_mark_populate_on_demand[1404] if ( has_arch_pdevs(d) || cache_flush_permitted(d) )

*** xen/arch/x86/mm/paging.c:
paging_log_dirty_enable[208]   if ( has_arch_pdevs(d) )

*** xen/drivers/passthrough/vtd/iommu.c:
reassign_device_ownership[2773] if ( !has_arch_pdevs(target) )
reassign_device_ownership[2807] if ( !has_arch_pdevs(target) )
reassign_device_ownership[2825] if ( !has_arch_pdevs(source) )


has_arch_pdevs() bothers me most, actually, because it is not always
obvious how to add locking for the callers. I am planning to rework it
in the following way:

#define has_arch_pdevs_unlocked(d) (!list_empty(&(d)->pdev_list))

static inline bool has_arch_pdevs(struct domain *d)
{
    bool ret;

    read_lock(&d->pci_lock);
    ret = has_arch_pdevs_unlocked(d);
    read_unlock(&d->pci_lock);

    return ret;
}

And then use appropriate macro/function depending on circumstances.


-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v7 01/12] vpci: introduce per-domain lock to protect vpci structure
  2023-07-07  2:02                     ` Volodymyr Babchuk
@ 2023-07-07  6:32                       ` Jan Beulich
  0 siblings, 0 replies; 38+ messages in thread
From: Jan Beulich @ 2023-07-07  6:32 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Roger Pau Monné,
	xen-devel, Oleksandr Andrushchenko, Andrew Cooper, Wei Liu,
	George Dunlap, Julien Grall, Stefano Stabellini, Paul Durrant

On 07.07.2023 04:02, Volodymyr Babchuk wrote:
> 
> Hi Jan,
> 
> Jan Beulich <jbeulich@suse.com> writes:
> 
>> On 05.07.2023 10:59, Roger Pau Monné wrote:
>>> On Wed, Jul 05, 2023 at 09:11:10AM +0200, Jan Beulich wrote:
>>>> On 04.07.2023 23:03, Volodymyr Babchuk wrote:
>>>>> I am currently implementing your proposal (along with Jan's
>>>>> suggestions), but I am facing ABBA deadlock with IOMMU's
>>>>> reassign_device() call, which has this piece of code:
>>>>>
>>>>>         list_move(&pdev->domain_list, &target->pdev_list);
>>>>>
>>>>> My immediate change was:
>>>>>
>>>>>         write_lock(&pdev->domain->pci_lock);
>>>>>         list_del(&pdev->domain_list);
>>>>>         write_unlock(&pdev->domain->pci_lock);
>>>>>
>>>>>         write_lock(&target->pci_lock);
>>>>>         list_add(&pdev->domain_list, &target->pdev_list);
>>>>>         write_unlock(&target->pci_lock);
>>>>>
>>>>> But this will not work because reassign_device is called from
>>>>> pci_release_devices() which iterates over d->pdev_list, so we need to
>>>>> take a d->pci_lock early.
>>>>>
>>>>> Any suggestions on how to fix this? My idea is to remove a device from a
>>>>> list one at time:
>>>>>
>>>>> int pci_release_devices(struct domain *d)
>>>>> {
>>>>>     struct pci_dev *pdev;
>>>>>     u8 bus, devfn;
>>>>>     int ret;
>>>>>
>>>>>     pcidevs_lock();
>>>>>     write_lock(&d->pci_lock);
>>>>>     ret = arch_pci_clean_pirqs(d);
>>>>>     if ( ret )
>>>>>     {
>>>>>         pcidevs_unlock();
>>>>>         write_unlock(&d->pci_lock);
>>>>>         return ret;
>>>>>     }
>>>>>
>>>>>     while ( !list_empty(&d->pdev_list) )
>>>>>     {
>>>>>         pdev = list_entry(&d->pdev_list, struct pci_dev, domain_list);
>>>>>         bus = pdev->bus;
>>>>>         devfn = pdev->devfn;
>>>>>         list_del(&pdev->domain_list);
>>>>>         write_unlock(&d->pci_lock);
>>>>>         ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
>>>>>         write_lock(&d->pci_lock);
>>>>
>>>> I think it needs doing almost like this, but with two more tweaks and
>>>> no list_del() right here (first and foremost to avoid needing to
>>>> figure whether removing early isn't going to subtly break anything;
>>>> see below for an error case that would end up with changed behavior):
>>>>
>>>>     while ( !list_empty(&d->pdev_list) )
>>>>     {
>>>>         const struct pci_dev *pdev = list_first_entry(&d->pdev_list, struct pci_dev, domain_list);
>>>>         uint16_t seg = pdev->seg;
>>>>         uint8_t bus = pdev->bus;
>>>>         uint8_t devfn = pdev->devfn;
>>>>
>>>>         write_unlock(&d->pci_lock);
>>>
>>> I think you need to remove the device from the pdev_list before
>>> dropping the lock, or else release could race with other operations.
>>>
>>> That's unlikely, but still if the lock is dropped and the routine
>>> needs to operate on the device it is better remove such device from
>>> the domain so other operations cannot get a reference to it.
>>>
>>> Otherwise you could modify reassign_device() implementations so they
>>> require the caller to hold the source->pci_lock when calling the
>>> routine, but that's ugly because the lock would need to be dropped in
>>> order to reassign the device from source to target domains.
>>>
>>> Another option would be to move the whole d->pdev_list to a local
>>> variable (so that d->pdev_list would be empty) and then iterate over
>>> it without the d->pci_lock.  On failure you would take the lock and
>>> add the failing device back into d->pdev_list.
>>
>> Conceptually I like this last variant, but like the individual list_del()
>> it requires auditing code for no dependency on the device still being on
>> that list. In fact deassign_device()'s use of pci_get_pdev() does. The
>> function would then need changing to have struct pci_dev * passed in.
>> Yet who knows where else there are uses of pci_get_pdev() lurking.
> 
> Okay, so I changed deassign_device() signature and reworked the loop
> in pci_release_devices() in a such way:
> 
>     INIT_LIST_HEAD(&tmp_list);
>     /* Move all entries to tmp_list, so we can drop d->pci_lock */
>     list_splice_init(&d->pdev_list, &tmp_list);
>     write_unlock(&d->pci_lock);
> 
>     list_for_each_entry_safe ( pdev, tmp, &tmp_list, domain_list )
>     {
>         pdev = list_entry(&d->pdev_list, struct pci_dev, domain_list);
>         rc = deassign_device(d, pdev);
>         if ( rc )
>         {
>             /* Return device back to the domain list */
>             write_lock(&d->pci_lock);
>             list_add(&pdev->domain_list, &d->pdev_list);
>             write_unlock(&d->pci_lock);
>             func_ret = rc;
>         }
>     }
> 
> 
> Also, I checked for all pci_get_pdev() calls and found that struct
> domain (the first parameter) is passed only in handful of places:
> 
> *** xen/drivers/vpci/vpci.c:
> vpci_read[504]                 pdev = pci_get_pdev(d, sbdf);
> vpci_read[506]                 pdev = pci_get_pdev(dom_xen, sbdf);
> vpci_write[621]                pdev = pci_get_pdev(d, sbdf);
> vpci_write[623]                pdev = pci_get_pdev(dom_xen, sbdf);
> 
> *** xen/arch/x86/irq.c:
> map_domain_pirq[2166]          pdev = pci_get_pdev(d, msi->sbdf);
> 
> *** xen/drivers/passthrough/pci.c:
> XEN_GUEST_HANDLE_PARAM[1712]   pdev = pci_get_pdev(d, machine_sbdf);
> 
> The last one is due to my change to deassign_device() signature.

And which is going to continue to return NULL when earlier on you've
emptied the list. The purpose of passing in struct pdev * was to
eliminate this call. (Yet there may be further reasons why eliminating
this call actually isn't correct.)

> ==============================
> 
> d->pdev_list can be accessed there:
> 
> *** xen/drivers/passthrough/amd/pci_amd_iommu.c:
> reassign_device[489]           list_add(&pdev->domain_list, &target->pdev_list);
> 
> *** xen/drivers/passthrough/pci.c:
> _pci_hide_device[463]          list_add(&pdev->domain_list, &dom_xen->pdev_list);
> pci_get_pdev[561]              list_for_each_entry ( pdev, &d->pdev_list, domain_list )
> pci_add_device[759]            list_add(&pdev->domain_list, &hardware_domain->pdev_list);
> pci_release_devices[917]       list_splice_init(&d->pdev_list, &tmp_list);
> pci_release_devices[922]       pdev = list_entry(&d->pdev_list, struct pci_dev, domain_list);
> pci_release_devices[928]       list_add(&pdev->domain_list, &d->pdev_list);
> _setup_hwdom_pci_devices[1155] list_add(&pdev->domain_list, &ctxt->d->pdev_list);
> 
> *** xen/drivers/passthrough/vtd/iommu.c:
> reassign_device_ownership[2819] list_add(&pdev->domain_list, &target->pdev_list);
> 
> *** xen/include/xen/pci.h:
> for_each_pdev[149]             list_for_each_entry(pdev, &(domain)->pdev_list, domain_list)
> has_arch_pdevs[151]            #define has_arch_pdevs(d) (!list_empty(&(d)->pdev_list))
> 
> ==============================
> 
> And has_arch_pdevs() is used there:
> 
> *** xen/arch/x86/hvm/hvm.c:
> hvm_set_cr0[2388]              has_arch_pdevs(d)) )
> 
> *** xen/arch/x86/hvm/vmx/vmcs.c:
> vmx_do_resume[1892]            if ( has_arch_pdevs(v->domain) && !iommu_snoop
> 
> *** xen/arch/x86/mm.c:
> l1_disallow_mask[172]          !has_arch_pdevs(d) && \
> 
> *** xen/arch/x86/mm/p2m-pod.c:
> p2m_pod_set_mem_target[352]    if ( has_arch_pdevs(d) || cache_flush_permitted(d) )
> guest_physmap_mark_populate_on_demand[1404] if ( has_arch_pdevs(d) || cache_flush_permitted(d) )
> 
> *** xen/arch/x86/mm/paging.c:
> paging_log_dirty_enable[208]   if ( has_arch_pdevs(d) )
> 
> *** xen/drivers/passthrough/vtd/iommu.c:
> reassign_device_ownership[2773] if ( !has_arch_pdevs(target) )
> reassign_device_ownership[2807] if ( !has_arch_pdevs(target) )
> reassign_device_ownership[2825] if ( !has_arch_pdevs(source) )
> 
> 
> has_arch_pdevs() bothers me most, actually, because it is not always
> obvious how to add locking for the callers. I am planning to rework it
> in the following way:

Locking is only one aspect. As above, another is whether the function
might wrongly return "false" when you prematurely empty the list of
devices.

> #define has_arch_pdevs_unlocked(d) (!list_empty(&(d)->pdev_list))
> 
> static inline bool has_arch_pdevs(struct domain *d)
> {
>     bool ret;
> 
>     read_lock(&d->pci_lock);
>     ret = has_arch_pdevs_unlocked(d);
>     read_unlock(&d->pci_lock);
> 
>     return ret;
> }

No, this follows a pattern that earlier was said isn't acceptable:
The result of such a check is meaningless to the caller without holding
the lock past actually consuming the result. It simply is stale by the
time you return to the caller. (There may be special cases where other
constraints eliminate the concern, like maybe during domain
construction or domain cleanup, but such need carefully considering in
each individual case. Which in particular means there shouldn't be any
common-use helper functions doing what is unsafe in the general case.)

Jan


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v7 10/12] vpci: add initial support for virtual PCI bus topology
  2023-06-21 12:06   ` Jan Beulich
@ 2023-07-07 16:45     ` Oleksandr Tyshchenko
  0 siblings, 0 replies; 38+ messages in thread
From: Oleksandr Tyshchenko @ 2023-07-07 16:45 UTC (permalink / raw)
  To: Jan Beulich, Volodymyr Babchuk
  Cc: Oleksandr Andrushchenko, Roger Pau Monné,
	Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, xen-devel



On 21.06.23 15:06, Jan Beulich wrote:

Hello all


> On 13.06.2023 12:32, Volodymyr Babchuk wrote:
>> @@ -121,6 +124,62 @@ int vpci_add_handlers(struct pci_dev *pdev)
>>   }
>>   
>>   #ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
>> +static int add_virtual_device(struct pci_dev *pdev)
>> +{
>> +    struct domain *d = pdev->domain;
>> +    pci_sbdf_t sbdf = { 0 };
>> +    unsigned long new_dev_number;
>> +
>> +    if ( is_hardware_domain(d) )
>> +        return 0;
>> +
>> +    ASSERT(pcidevs_locked());
>> +
>> +    /*
>> +     * Each PCI bus supports 32 devices/slots at max or up to 256 when
>> +     * there are multi-function ones which are not yet supported.
>> +     */
>> +    if ( pdev->info.is_extfn )
>> +    {
>> +        gdprintk(XENLOG_ERR, "%pp: only function 0 passthrough supported\n",
>> +                 &pdev->sbdf);
>> +        return -EOPNOTSUPP;
>> +    }
>> +
>> +    new_dev_number = find_first_zero_bit(d->vpci_dev_assigned_map,
>> +                                         VPCI_MAX_VIRT_DEV);
>> +    if ( new_dev_number >= VPCI_MAX_VIRT_DEV )
>> +        return -ENOSPC;
>> +
>> +    __set_bit(new_dev_number, &d->vpci_dev_assigned_map);
> 
> Since the find-and-set can't easily be atomic, the lock used here (
> asserted to be held above) needs to be the same as ...
> 
>> +    /*
>> +     * Both segment and bus number are 0:
>> +     *  - we emulate a single host bridge for the guest, e.g. segment 0
>> +     *  - with bus 0 the virtual devices are seen as embedded
>> +     *    endpoints behind the root complex
>> +     *
>> +     * TODO: add support for multi-function devices.
>> +     */
>> +    sbdf.devfn = PCI_DEVFN(new_dev_number, 0);
>> +    pdev->vpci->guest_sbdf = sbdf;
>> +
>> +    return 0;
>> +
>> +}
>> +
>> +static void vpci_remove_virtual_device(const struct pci_dev *pdev)
>> +{
>> +    write_lock(&pdev->domain->vpci_rwlock);
>> +    if ( pdev->vpci )
>> +    {
>> +        __clear_bit(pdev->vpci->guest_sbdf.dev,
>> +                    &pdev->domain->vpci_dev_assigned_map);
>> +        pdev->vpci->guest_sbdf.sbdf = ~0;
>> +    }
>> +    write_unlock(&pdev->domain->vpci_rwlock);
> 
> ... the one used here.


I think, it makes sense, yes.

***

There is one more thing. As far as I remember, there were some requests 
provided for the previous version (also v7) [1]. At least one of them, I 
assume, is still applicable here. I am speaking about a request to 
consider moving "cleaning up guest_sbdf / vpci_dev_assigned_map" into 
vpci_remove_device() here and aliasing of vpci_deassign_device() to 
vpci_remove_device() in commit #03/12.

The diff below (to be applied on top of current patch) is my 
understanding (not even build tested):

diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index a61282cc5b..c3e6c153bc 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -51,6 +51,15 @@ void vpci_remove_device(struct pci_dev *pdev)
          return;
      }

+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+    if ( pdev->vpci->guest_sbdf.sbdf != ~0 )
+    {
+        __clear_bit(pdev->vpci->guest_sbdf.dev,
+                    &pdev->domain->vpci_dev_assigned_map);
+        pdev->vpci->guest_sbdf.sbdf = ~0;
+    }
+#endif
+
      vpci = pdev->vpci;
      pdev->vpci = NULL;
      write_unlock(&pdev->domain->vpci_rwlock);
@@ -152,10 +161,14 @@ static int add_virtual_device(struct pci_dev *pdev)
          return -EOPNOTSUPP;
      }

+    write_lock(&pdev->domain->vpci_rwlock);
      new_dev_number = find_first_zero_bit(d->vpci_dev_assigned_map,
                                           VPCI_MAX_VIRT_DEV);
      if ( new_dev_number >= VPCI_MAX_VIRT_DEV )
+    {
+        write_unlock(&pdev->domain->vpci_rwlock);
          return -ENOSPC;
+    }

      __set_bit(new_dev_number, &d->vpci_dev_assigned_map);

@@ -169,23 +182,12 @@ static int add_virtual_device(struct pci_dev *pdev)
       */
      sbdf.devfn = PCI_DEVFN(new_dev_number, 0);
      pdev->vpci->guest_sbdf = sbdf;
+    write_unlock(&pdev->domain->vpci_rwlock);

      return 0;

  }

-static void vpci_remove_virtual_device(const struct pci_dev *pdev)
-{
-    write_lock(&pdev->domain->vpci_rwlock);
-    if ( pdev->vpci )
-    {
-        __clear_bit(pdev->vpci->guest_sbdf.dev,
-                    &pdev->domain->vpci_dev_assigned_map);
-        pdev->vpci->guest_sbdf.sbdf = ~0;
-    }
-    write_unlock(&pdev->domain->vpci_rwlock);
-}
-
  /* Notify vPCI that device is assigned to guest. */
  int vpci_assign_device(struct pci_dev *pdev)
  {
@@ -215,7 +217,6 @@ void vpci_deassign_device(struct pci_dev *pdev)
      if ( !has_vpci(pdev->domain) )
          return;

-    vpci_remove_virtual_device(pdev);
      vpci_remove_device(pdev);
  }
  #endif /* CONFIG_HAS_VPCI_GUEST_SUPPORT */
(END)



[1] 
https://lore.kernel.org/xen-devel/20220719174253.541965-10-olekstysh@gmail.com/
https://lore.kernel.org/xen-devel/20220719174253.541965-3-olekstysh@gmail.com/

> 
> Jan
> 


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v7 01/12] vpci: introduce per-domain lock to protect vpci structure
  2023-07-05  7:11               ` Jan Beulich
  2023-07-05  8:59                 ` Roger Pau Monné
@ 2023-07-09 22:41                 ` Volodymyr Babchuk
  2023-07-10  6:20                   ` Jan Beulich
  1 sibling, 1 reply; 38+ messages in thread
From: Volodymyr Babchuk @ 2023-07-09 22:41 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Oleksandr Andrushchenko, Andrew Cooper, Wei Liu,
	George Dunlap, Julien Grall, Stefano Stabellini, Paul Durrant,
	Roger Pau Monné


Hi Jan,

Jan Beulich <jbeulich@suse.com> writes:

> On 04.07.2023 23:03, Volodymyr Babchuk wrote:
>> I am currently implementing your proposal (along with Jan's
>> suggestions), but I am facing ABBA deadlock with IOMMU's
>> reassign_device() call, which has this piece of code:
>> 
>>         list_move(&pdev->domain_list, &target->pdev_list);
>> 
>> My immediate change was:
>> 
>>         write_lock(&pdev->domain->pci_lock);
>>         list_del(&pdev->domain_list);
>>         write_unlock(&pdev->domain->pci_lock);
>> 
>>         write_lock(&target->pci_lock);
>>         list_add(&pdev->domain_list, &target->pdev_list);
>>         write_unlock(&target->pci_lock);
>> 
>> But this will not work because reassign_device is called from
>> pci_release_devices() which iterates over d->pdev_list, so we need to
>> take a d->pci_lock early.
>> 
>> Any suggestions on how to fix this? My idea is to remove a device from a
>> list one at time:
>> 
>> int pci_release_devices(struct domain *d)
>> {
>>     struct pci_dev *pdev;
>>     u8 bus, devfn;
>>     int ret;
>> 
>>     pcidevs_lock();
>>     write_lock(&d->pci_lock);
>>     ret = arch_pci_clean_pirqs(d);
>>     if ( ret )
>>     {
>>         pcidevs_unlock();
>>         write_unlock(&d->pci_lock);
>>         return ret;
>>     }
>> 
>>     while ( !list_empty(&d->pdev_list) )
>>     {
>>         pdev = list_entry(&d->pdev_list, struct pci_dev, domain_list);
>>         bus = pdev->bus;
>>         devfn = pdev->devfn;
>>         list_del(&pdev->domain_list);
>>         write_unlock(&d->pci_lock);
>>         ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
>>         write_lock(&d->pci_lock);
>
> I think it needs doing almost like this, but with two more tweaks and
> no list_del() right here (first and foremost to avoid needing to
> figure whether removing early isn't going to subtly break anything;
> see below for an error case that would end up with changed behavior):
>
>     while ( !list_empty(&d->pdev_list) )
>     {
>         const struct pci_dev *pdev = list_first_entry(&d->pdev_list, struct pci_dev, domain_list);
>         uint16_t seg = pdev->seg;
>         uint8_t bus = pdev->bus;
>         uint8_t devfn = pdev->devfn;
>
>         write_unlock(&d->pci_lock);
>         ret = deassign_device(d, seg, bus, devfn) ?: ret;
>         write_lock(&d->pci_lock);
>     }
>
> One caveat though: The original list_for_each_entry_safe() guarantees
> the loop to complete; your use of list_del() would guarantee that too,
> but then the device wouldn't be on the list anymore if deassign_device()
> failed. Therefore I guess you will need another local list where you
> (temporarily) put all the devices which deassign_device() left on the
> list, and which you would then move back to d->pdev_list after the loop
> has finished.

Okay, taking into the account all that you wrote, I came with this
implementation:

int pci_release_devices(struct domain *d)
{
    int combined_ret;
    LIST_HEAD(failed_pdevs);

    pcidevs_lock();
    write_lock(&d->pci_lock);
    combined_ret = arch_pci_clean_pirqs(d);
    if ( combined_ret )
    {
        pcidevs_unlock();
        write_unlock(&d->pci_lock);
        return combined_ret;
    }

    while ( !list_empty(&d->pdev_list) )
    {
        struct pci_dev *pdev = list_first_entry(&d->pdev_list,
                                                struct pci_dev,
                                                domain_list);
        uint16_t seg = pdev->seg;
        uint8_t bus = pdev->bus;
        uint8_t devfn = pdev->devfn;
        int ret;

        write_unlock(&d->pci_lock);
        ret = deassign_device(d, seg, bus, devfn);
        write_lock(&d->pci_lock);
        if ( ret )
        {
            bool still_present = false;
            const struct pci_dev *tmp;

            for_each_pdev ( d, tmp )
            {
                if ( tmp == (const struct pci_dev*)pdev )
                {
                    still_present = true;
                    break;
                }
            }
            if ( still_present )
                list_move(&pdev->domain_list, &failed_pdevs);
            combined_ret = ret;
        }
    }

    list_splice(&failed_pdevs, &d->pdev_list);
    write_unlock(&d->pci_lock);
    pcidevs_unlock();

    return combined_ret;
}


> (Whether it is sufficient to inspect the list head to
> determine whether the pdev is still on the list needs careful checking.)

I am not sure that this is enough. We dropped d->pci_lock so we can
expect that the list was permutated in a random way.


-- 
WBR, Volodymyr

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v7 01/12] vpci: introduce per-domain lock to protect vpci structure
  2023-07-09 22:41                 ` Volodymyr Babchuk
@ 2023-07-10  6:20                   ` Jan Beulich
  0 siblings, 0 replies; 38+ messages in thread
From: Jan Beulich @ 2023-07-10  6:20 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: xen-devel, Oleksandr Andrushchenko, Andrew Cooper, Wei Liu,
	George Dunlap, Julien Grall, Stefano Stabellini, Paul Durrant,
	Roger Pau Monné

On 10.07.2023 00:41, Volodymyr Babchuk wrote:
> 
> Hi Jan,
> 
> Jan Beulich <jbeulich@suse.com> writes:
> 
>> On 04.07.2023 23:03, Volodymyr Babchuk wrote:
>>> I am currently implementing your proposal (along with Jan's
>>> suggestions), but I am facing ABBA deadlock with IOMMU's
>>> reassign_device() call, which has this piece of code:
>>>
>>>         list_move(&pdev->domain_list, &target->pdev_list);
>>>
>>> My immediate change was:
>>>
>>>         write_lock(&pdev->domain->pci_lock);
>>>         list_del(&pdev->domain_list);
>>>         write_unlock(&pdev->domain->pci_lock);
>>>
>>>         write_lock(&target->pci_lock);
>>>         list_add(&pdev->domain_list, &target->pdev_list);
>>>         write_unlock(&target->pci_lock);
>>>
>>> But this will not work because reassign_device is called from
>>> pci_release_devices() which iterates over d->pdev_list, so we need to
>>> take a d->pci_lock early.
>>>
>>> Any suggestions on how to fix this? My idea is to remove a device from a
>>> list one at time:
>>>
>>> int pci_release_devices(struct domain *d)
>>> {
>>>     struct pci_dev *pdev;
>>>     u8 bus, devfn;
>>>     int ret;
>>>
>>>     pcidevs_lock();
>>>     write_lock(&d->pci_lock);
>>>     ret = arch_pci_clean_pirqs(d);
>>>     if ( ret )
>>>     {
>>>         pcidevs_unlock();
>>>         write_unlock(&d->pci_lock);
>>>         return ret;
>>>     }
>>>
>>>     while ( !list_empty(&d->pdev_list) )
>>>     {
>>>         pdev = list_entry(&d->pdev_list, struct pci_dev, domain_list);
>>>         bus = pdev->bus;
>>>         devfn = pdev->devfn;
>>>         list_del(&pdev->domain_list);
>>>         write_unlock(&d->pci_lock);
>>>         ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
>>>         write_lock(&d->pci_lock);
>>
>> I think it needs doing almost like this, but with two more tweaks and
>> no list_del() right here (first and foremost to avoid needing to
>> figure whether removing early isn't going to subtly break anything;
>> see below for an error case that would end up with changed behavior):
>>
>>     while ( !list_empty(&d->pdev_list) )
>>     {
>>         const struct pci_dev *pdev = list_first_entry(&d->pdev_list, struct pci_dev, domain_list);
>>         uint16_t seg = pdev->seg;
>>         uint8_t bus = pdev->bus;
>>         uint8_t devfn = pdev->devfn;
>>
>>         write_unlock(&d->pci_lock);
>>         ret = deassign_device(d, seg, bus, devfn) ?: ret;
>>         write_lock(&d->pci_lock);
>>     }
>>
>> One caveat though: The original list_for_each_entry_safe() guarantees
>> the loop to complete; your use of list_del() would guarantee that too,
>> but then the device wouldn't be on the list anymore if deassign_device()
>> failed. Therefore I guess you will need another local list where you
>> (temporarily) put all the devices which deassign_device() left on the
>> list, and which you would then move back to d->pdev_list after the loop
>> has finished.
> 
> Okay, taking into the account all that you wrote, I came with this
> implementation:

Looks plausible at the first glance, but will of course need looking at
in more detail in the context of the full patch. Just one (largely)
cosmetic remark right away:

> int pci_release_devices(struct domain *d)
> {
>     int combined_ret;
>     LIST_HEAD(failed_pdevs);
> 
>     pcidevs_lock();
>     write_lock(&d->pci_lock);
>     combined_ret = arch_pci_clean_pirqs(d);
>     if ( combined_ret )
>     {
>         pcidevs_unlock();
>         write_unlock(&d->pci_lock);
>         return combined_ret;
>     }
> 
>     while ( !list_empty(&d->pdev_list) )
>     {
>         struct pci_dev *pdev = list_first_entry(&d->pdev_list,
>                                                 struct pci_dev,
>                                                 domain_list);
>         uint16_t seg = pdev->seg;
>         uint8_t bus = pdev->bus;
>         uint8_t devfn = pdev->devfn;
>         int ret;
> 
>         write_unlock(&d->pci_lock);
>         ret = deassign_device(d, seg, bus, devfn);
>         write_lock(&d->pci_lock);
>         if ( ret )
>         {
>             bool still_present = false;
>             const struct pci_dev *tmp;
> 
>             for_each_pdev ( d, tmp )
>             {
>                 if ( tmp == (const struct pci_dev*)pdev )

As I'm sure I expressed earlier, casts should be avoided whenever
possible. I see no reason for one here: Comparing pointer-to-
const-<type> with pointer-to-<type> is permitted by the language.

>                 {
>                     still_present = true;
>                     break;
>                 }
>             }
>             if ( still_present )
>                 list_move(&pdev->domain_list, &failed_pdevs);
>             combined_ret = ret;
>         }
>     }
> 
>     list_splice(&failed_pdevs, &d->pdev_list);
>     write_unlock(&d->pci_lock);
>     pcidevs_unlock();
> 
>     return combined_ret;
> }
> 
> 
>> (Whether it is sufficient to inspect the list head to
>> determine whether the pdev is still on the list needs careful checking.)
> 
> I am not sure that this is enough. We dropped d->pci_lock so we can
> expect that the list was permutated in a random way.

Right, if that cannot be excluded for sure, then better stay on the safe
side for now. A code comment would be nice though ahead of the
for_each_pdev().

Jan


^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2023-07-10  6:20 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-13 10:32 [PATCH v7 00/12] PCI devices passthrough on Arm, part 3 Volodymyr Babchuk
2023-06-13 10:32 ` [PATCH v7 01/12] vpci: introduce per-domain lock to protect vpci structure Volodymyr Babchuk
2023-06-16 16:30   ` Roger Pau Monné
2023-06-21 22:07     ` Volodymyr Babchuk
2023-06-22  8:15       ` Roger Pau Monné
2023-06-22 21:17         ` Volodymyr Babchuk
2023-06-23  8:50           ` Roger Pau Monné
2023-06-23  9:26             ` Volodymyr Babchuk
2023-06-23 17:09               ` Jan Beulich
2023-07-04 21:03             ` Volodymyr Babchuk
2023-07-05  7:11               ` Jan Beulich
2023-07-05  8:59                 ` Roger Pau Monné
2023-07-05  9:13                   ` Jan Beulich
2023-07-07  2:02                     ` Volodymyr Babchuk
2023-07-07  6:32                       ` Jan Beulich
2023-07-09 22:41                 ` Volodymyr Babchuk
2023-07-10  6:20                   ` Jan Beulich
2023-06-20 20:17   ` Stewart Hildebrand
2023-06-22 21:18     ` Volodymyr Babchuk
2023-06-13 10:32 ` [PATCH v7 04/12] vpci/header: implement guest BAR register handlers Volodymyr Babchuk
2023-06-13 10:32 ` [PATCH v7 02/12] vpci: restrict unhandled read/write operations for guests Volodymyr Babchuk
2023-06-13 10:32 ` [PATCH v7 03/12] vpci: add hooks for PCI device assign/de-assign Volodymyr Babchuk
2023-06-13 10:32 ` [PATCH v7 06/12] vpci/header: handle p2m range sets per BAR Volodymyr Babchuk
2023-06-13 10:32 ` [PATCH v7 08/12] vpci/header: emulate PCI_COMMAND register for guests Volodymyr Babchuk
2023-06-13 10:32 ` [PATCH v7 05/12] rangeset: add RANGESETF_no_print flag Volodymyr Babchuk
2023-06-13 10:32 ` [PATCH v7 07/12] vpci/header: program p2m with guest BAR view Volodymyr Babchuk
2023-06-13 10:32 ` [PATCH v7 09/12] vpci/header: reset the command register when adding devices Volodymyr Babchuk
2023-06-13 10:32 ` [PATCH v7 11/12] xen/arm: translate virtual PCI bus topology for guests Volodymyr Babchuk
2023-06-21 12:01   ` Jan Beulich
2023-06-13 10:32 ` [PATCH v7 10/12] vpci: add initial support for virtual PCI bus topology Volodymyr Babchuk
2023-06-21 12:06   ` Jan Beulich
2023-07-07 16:45     ` Oleksandr Tyshchenko
2023-06-13 10:32 ` [PATCH v7 12/12] xen/arm: account IO handlers for emulated PCI MSI-X Volodymyr Babchuk
2023-06-15  2:01 ` [PATCH v7 00/12] PCI devices passthrough on Arm, part 3 Stewart Hildebrand
2023-06-15  9:39   ` Volodymyr Babchuk
2023-06-15 12:16     ` Jan Beulich
2023-06-21 22:11       ` Volodymyr Babchuk
2023-06-22  7:48         ` Jan Beulich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).