All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/10] Rework PCI locking
@ 2022-08-31 14:10 Volodymyr Babchuk
  2022-08-31 14:10 ` [RFC PATCH 01/10] xen: pci: add per-domain pci list lock Volodymyr Babchuk
                   ` (10 more replies)
  0 siblings, 11 replies; 43+ messages in thread
From: Volodymyr Babchuk @ 2022-08-31 14:10 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Andrushchenko, Volodymyr Babchuk, Andrew Cooper,
	George Dunlap, Jan Beulich, Julien Grall, Stefano Stabellini,
	Wei Liu, Paul Durrant, Roger Pau Monné,
	Kevin Tian

Hello,

This is yet another take to a PCI locking rework. This approach
was suggest by Jan Beulich who proposed to use a reference
counter to control lifetime of pci_dev objects.

When I started added reference counting it quickly became clear
that this approach can provide more granular locking insted of
huge pcidevs_lock() which is used right now. I studied how this
lock used and what it protects. And found the following:

0. Comment in pci.h states the following:

 153 /*
 154  * The pcidevs_lock protect alldevs_list, and the assignment for the
 155  * devices, it also sync the access to the msi capability that is not
 156  * interrupt handling related (the mask bit register).
 157  */

But in reality it does much more. Here is what I found:

1. Lifetime of pci_dev struct

2. Access to pseg->alldevs_list

3. Access to domain->pdev_list

4. Access to iommu->ats_list

5. Access to MSI capability

6. Some obsucure stuff in IOMMU drivers: there are places that
are guarded by pcidevs_lock() but it seems that nothing
PCI-related happens there.

7. Something that I probably overlooked

Anyways, I tried to get rid of global mighty pcidevs_lock() by
reworking items 1-5.

This patch series does exactly this: adds separate lock for each
of the lists, lock for struct pci_dev itself, adds reference
counting, then removes pcidevs_lock() entirely. I do understand
that I should not remove locks when there are locking fixes for
items 6-7. But this is why it is an RFC. I want to discuss if my
approach is legit and get some guidance from maintainers on what
should be done in addition to the presented changes.


Volodymyr Babchuk (10):
  xen: pci: add per-domain pci list lock
  xen: pci: add pci_seg->alldevs_lock
  xen: pci: introduce ats_list_lock
  xen: add reference counter support
  xen: pci: introduce reference counting for pdev
  xen: pci: print reference counter when dumping pci_devs
  xen: pci: add per-device locking
  xen: pci: remove pcidev_[un]lock[ed] calls
  [RFC only] xen: iommu: remove last  pcidevs_lock() calls in iommu
  [RFC only] xen: pci: remove pcidev_lock() function

 xen/arch/x86/domctl.c                       |   8 -
 xen/arch/x86/hvm/vioapic.c                  |   2 -
 xen/arch/x86/hvm/vmsi.c                     |  20 +-
 xen/arch/x86/irq.c                          |  11 +-
 xen/arch/x86/msi.c                          |  68 ++++-
 xen/arch/x86/pci.c                          |   8 +-
 xen/arch/x86/physdev.c                      |  24 +-
 xen/common/domain.c                         |   1 +
 xen/common/sysctl.c                         |   7 +-
 xen/drivers/char/ns16550.c                  |   4 -
 xen/drivers/passthrough/amd/iommu.h         |   1 +
 xen/drivers/passthrough/amd/iommu_cmd.c     |   4 +-
 xen/drivers/passthrough/amd/iommu_detect.c  |   1 +
 xen/drivers/passthrough/amd/iommu_init.c    |  19 +-
 xen/drivers/passthrough/amd/iommu_map.c     |  11 +-
 xen/drivers/passthrough/amd/pci_amd_iommu.c |  19 +-
 xen/drivers/passthrough/msi.c               |   8 +-
 xen/drivers/passthrough/pci.c               | 267 +++++++++++---------
 xen/drivers/passthrough/vtd/intremap.c      |   2 -
 xen/drivers/passthrough/vtd/iommu.c         |  33 +--
 xen/drivers/passthrough/vtd/iommu.h         |   1 +
 xen/drivers/passthrough/vtd/qinval.c        |   3 +
 xen/drivers/passthrough/vtd/quirks.c        |   2 +
 xen/drivers/passthrough/vtd/x86/ats.c       |   3 +
 xen/drivers/passthrough/x86/iommu.c         |   5 -
 xen/drivers/video/vga.c                     |  12 +-
 xen/drivers/vpci/header.c                   |   3 +
 xen/drivers/vpci/msi.c                      |   7 +-
 xen/drivers/vpci/vpci.c                     |  10 +-
 xen/include/xen/pci.h                       |  36 ++-
 xen/include/xen/refcnt.h                    |  28 ++
 xen/include/xen/sched.h                     |   1 +
 32 files changed, 380 insertions(+), 249 deletions(-)
 create mode 100644 xen/include/xen/refcnt.h

-- 
2.36.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC PATCH 01/10] xen: pci: add per-domain pci list lock
  2022-08-31 14:10 [RFC PATCH 00/10] Rework PCI locking Volodymyr Babchuk
@ 2022-08-31 14:10 ` Volodymyr Babchuk
  2023-01-26 23:18   ` Stefano Stabellini
  2022-08-31 14:10 ` [RFC PATCH 04/10] xen: add reference counter support Volodymyr Babchuk
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: Volodymyr Babchuk @ 2022-08-31 14:10 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Andrushchenko, Volodymyr Babchuk, Andrew Cooper,
	George Dunlap, Jan Beulich, Julien Grall, Stefano Stabellini,
	Wei Liu, Paul Durrant, Roger Pau Monné,
	Kevin Tian

domain->pdevs_lock protects access to domain->pdev_list.
As this, it should be used when we are adding, removing on enumerating
PCI devices assigned to a domain.

This enables more granular locking instead of one huge pcidevs_lock that
locks entire PCI subsystem. Please note that pcidevs_lock() is still
used, we are going to remove it in subsequent patches.

Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
---
 xen/common/domain.c                         |  1 +
 xen/drivers/passthrough/amd/iommu_cmd.c     |  4 ++-
 xen/drivers/passthrough/amd/pci_amd_iommu.c |  7 ++++-
 xen/drivers/passthrough/pci.c               | 29 ++++++++++++++++++++-
 xen/drivers/passthrough/vtd/iommu.c         |  9 +++++--
 xen/drivers/vpci/header.c                   |  3 +++
 xen/drivers/vpci/msi.c                      |  7 ++++-
 xen/drivers/vpci/vpci.c                     |  4 +--
 xen/include/xen/pci.h                       |  2 +-
 xen/include/xen/sched.h                     |  1 +
 10 files changed, 58 insertions(+), 9 deletions(-)

diff --git a/xen/common/domain.c b/xen/common/domain.c
index 7062393e37..4611141b87 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -618,6 +618,7 @@ struct domain *domain_create(domid_t domid,
 
 #ifdef CONFIG_HAS_PCI
     INIT_LIST_HEAD(&d->pdev_list);
+    spin_lock_init(&d->pdevs_lock);
 #endif
 
     /* All error paths can depend on the above setup. */
diff --git a/xen/drivers/passthrough/amd/iommu_cmd.c b/xen/drivers/passthrough/amd/iommu_cmd.c
index 40ddf366bb..47c45398d4 100644
--- a/xen/drivers/passthrough/amd/iommu_cmd.c
+++ b/xen/drivers/passthrough/amd/iommu_cmd.c
@@ -308,11 +308,12 @@ void amd_iommu_flush_iotlb(u8 devfn, const struct pci_dev *pdev,
     flush_command_buffer(iommu, iommu_dev_iotlb_timeout);
 }
 
-static void amd_iommu_flush_all_iotlbs(const struct domain *d, daddr_t daddr,
+static void amd_iommu_flush_all_iotlbs(struct domain *d, daddr_t daddr,
                                        unsigned int order)
 {
     struct pci_dev *pdev;
 
+    spin_lock(&d->pdevs_lock);
     for_each_pdev( d, pdev )
     {
         u8 devfn = pdev->devfn;
@@ -323,6 +324,7 @@ static void amd_iommu_flush_all_iotlbs(const struct domain *d, daddr_t daddr,
         } while ( devfn != pdev->devfn &&
                   PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
     }
+    spin_unlock(&d->pdevs_lock);
 }
 
 /* Flush iommu cache after p2m changes. */
diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c b/xen/drivers/passthrough/amd/pci_amd_iommu.c
index 4ba8e764b2..64c016491d 100644
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -96,20 +96,25 @@ static int __must_check allocate_domain_resources(struct domain *d)
     return rc;
 }
 
-static bool any_pdev_behind_iommu(const struct domain *d,
+static bool any_pdev_behind_iommu(struct domain *d,
                                   const struct pci_dev *exclude,
                                   const struct amd_iommu *iommu)
 {
     const struct pci_dev *pdev;
 
+    spin_lock(&d->pdevs_lock);
     for_each_pdev ( d, pdev )
     {
         if ( pdev == exclude )
             continue;
 
         if ( find_iommu_for_device(pdev->seg, pdev->sbdf.bdf) == iommu )
+	{
+	    spin_unlock(&d->pdevs_lock);
             return true;
+	}
     }
+    spin_unlock(&d->pdevs_lock);
 
     return false;
 }
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index cdaf5c247f..4366f8f965 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -523,7 +523,9 @@ static void __init _pci_hide_device(struct pci_dev *pdev)
     if ( pdev->domain )
         return;
     pdev->domain = dom_xen;
+    spin_lock(&dom_xen->pdevs_lock);
     list_add(&pdev->domain_list, &dom_xen->pdev_list);
+    spin_unlock(&dom_xen->pdevs_lock);
 }
 
 int __init pci_hide_device(unsigned int seg, unsigned int bus,
@@ -595,7 +597,7 @@ struct pci_dev *pci_get_real_pdev(pci_sbdf_t sbdf)
     return pdev;
 }
 
-struct pci_dev *pci_get_pdev(const struct domain *d, pci_sbdf_t sbdf)
+struct pci_dev *pci_get_pdev(struct domain *d, pci_sbdf_t sbdf)
 {
     struct pci_dev *pdev;
 
@@ -620,9 +622,16 @@ struct pci_dev *pci_get_pdev(const struct domain *d, pci_sbdf_t sbdf)
                 return pdev;
     }
     else
+    {
+        spin_lock(&d->pdevs_lock);
         list_for_each_entry ( pdev, &d->pdev_list, domain_list )
             if ( pdev->sbdf.bdf == sbdf.bdf )
+            {
+                spin_unlock(&d->pdevs_lock);
                 return pdev;
+            }
+        spin_unlock(&d->pdevs_lock);
+    }
 
     return NULL;
 }
@@ -817,7 +826,9 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
     if ( !pdev->domain )
     {
         pdev->domain = hardware_domain;
+        spin_lock(&hardware_domain->pdevs_lock);
         list_add(&pdev->domain_list, &hardware_domain->pdev_list);
+        spin_unlock(&hardware_domain->pdevs_lock);
 
         /*
          * For devices not discovered by Xen during boot, add vPCI handlers
@@ -827,7 +838,9 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
         if ( ret )
         {
             printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
+            spin_lock(&pdev->domain->pdevs_lock);
             list_del(&pdev->domain_list);
+            spin_unlock(&pdev->domain->pdevs_lock);
             pdev->domain = NULL;
             goto out;
         }
@@ -835,7 +848,9 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
         if ( ret )
         {
             vpci_remove_device(pdev);
+            spin_lock(&pdev->domain->pdevs_lock);
             list_del(&pdev->domain_list);
+            spin_unlock(&pdev->domain->pdevs_lock);
             pdev->domain = NULL;
             goto out;
         }
@@ -885,7 +900,11 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
             pci_cleanup_msi(pdev);
             ret = iommu_remove_device(pdev);
             if ( pdev->domain )
+            {
+                spin_lock(&pdev->domain->pdevs_lock);
                 list_del(&pdev->domain_list);
+                spin_unlock(&pdev->domain->pdevs_lock);
+            }
             printk(XENLOG_DEBUG "PCI remove device %pp\n", &pdev->sbdf);
             free_pdev(pseg, pdev);
             break;
@@ -967,12 +986,14 @@ int pci_release_devices(struct domain *d)
         pcidevs_unlock();
         return ret;
     }
+    spin_lock(&d->pdevs_lock);
     list_for_each_entry_safe ( pdev, tmp, &d->pdev_list, domain_list )
     {
         bus = pdev->bus;
         devfn = pdev->devfn;
         ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
     }
+    spin_unlock(&d->pdevs_lock);
     pcidevs_unlock();
 
     return ret;
@@ -1194,7 +1215,9 @@ static int __hwdom_init cf_check _setup_hwdom_pci_devices(
             if ( !pdev->domain )
             {
                 pdev->domain = ctxt->d;
+                spin_lock(&pdev->domain->pdevs_lock);
                 list_add(&pdev->domain_list, &ctxt->d->pdev_list);
+                spin_unlock(&pdev->domain->pdevs_lock);
                 setup_one_hwdom_device(ctxt, pdev);
             }
             else if ( pdev->domain == dom_xen )
@@ -1556,6 +1579,7 @@ static int iommu_get_device_group(
         return group_id;
 
     pcidevs_lock();
+    spin_lock(&d->pdevs_lock);
     for_each_pdev( d, pdev )
     {
         unsigned int b = pdev->bus;
@@ -1571,6 +1595,7 @@ static int iommu_get_device_group(
         if ( sdev_id < 0 )
         {
             pcidevs_unlock();
+            spin_unlock(&d->pdevs_lock);
             return sdev_id;
         }
 
@@ -1581,6 +1606,7 @@ static int iommu_get_device_group(
             if ( unlikely(copy_to_guest_offset(buf, i, &bdf, 1)) )
             {
                 pcidevs_unlock();
+                spin_unlock(&d->pdevs_lock);
                 return -EFAULT;
             }
             i++;
@@ -1588,6 +1614,7 @@ static int iommu_get_device_group(
     }
 
     pcidevs_unlock();
+    spin_unlock(&d->pdevs_lock);
 
     return i;
 }
diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index 62e143125d..fff1442265 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -183,12 +183,13 @@ static void cleanup_domid_map(domid_t domid, struct vtd_iommu *iommu)
     }
 }
 
-static bool any_pdev_behind_iommu(const struct domain *d,
+static bool any_pdev_behind_iommu(struct domain *d,
                                   const struct pci_dev *exclude,
                                   const struct vtd_iommu *iommu)
 {
     const struct pci_dev *pdev;
 
+    spin_lock(&d->pdevs_lock);
     for_each_pdev ( d, pdev )
     {
         const struct acpi_drhd_unit *drhd;
@@ -198,8 +199,12 @@ static bool any_pdev_behind_iommu(const struct domain *d,
 
         drhd = acpi_find_matched_drhd_unit(pdev);
         if ( drhd && drhd->iommu == iommu )
+        {
+            spin_unlock(&d->pdevs_lock);
             return true;
+        }
     }
+    spin_unlock(&d->pdevs_lock);
 
     return false;
 }
@@ -208,7 +213,7 @@ static bool any_pdev_behind_iommu(const struct domain *d,
  * If no other devices under the same iommu owned by this domain,
  * clear iommu in iommu_bitmap and clear domain_id in domid_bitmap.
  */
-static void check_cleanup_domid_map(const struct domain *d,
+static void check_cleanup_domid_map(struct domain *d,
                                     const struct pci_dev *exclude,
                                     struct vtd_iommu *iommu)
 {
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index a1c928a0d2..a59aa7ad0b 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -267,6 +267,7 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
      * Check for overlaps with other BARs. Note that only BARs that are
      * currently mapped (enabled) are checked for overlaps.
      */
+    spin_lock(&pdev->domain->pdevs_lock);
     for_each_pdev ( pdev->domain, tmp )
     {
         if ( tmp == pdev )
@@ -306,11 +307,13 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
                 printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
                        start, end, rc);
                 rangeset_destroy(mem);
+                spin_unlock( &pdev->domain->pdevs_lock);
                 return rc;
             }
         }
     }
 
+    spin_unlock( &pdev->domain->pdevs_lock);
     ASSERT(dev);
 
     if ( system_state < SYS_STATE_active )
diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
index 8f2b59e61a..8969c335b0 100644
--- a/xen/drivers/vpci/msi.c
+++ b/xen/drivers/vpci/msi.c
@@ -265,7 +265,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW);
 
 void vpci_dump_msi(void)
 {
-    const struct domain *d;
+    struct domain *d;
 
     rcu_read_lock(&domlist_read_lock);
     for_each_domain ( d )
@@ -277,6 +277,9 @@ void vpci_dump_msi(void)
 
         printk("vPCI MSI/MSI-X d%d\n", d->domain_id);
 
+        if ( !spin_trylock(&d->pdevs_lock) )
+            continue;
+
         for_each_pdev ( d, pdev )
         {
             const struct vpci_msi *msi;
@@ -326,6 +329,8 @@ void vpci_dump_msi(void)
             spin_unlock(&pdev->vpci->lock);
             process_pending_softirqs();
         }
+        spin_unlock(&d->pdevs_lock);
+
     }
     rcu_read_unlock(&domlist_read_lock);
 }
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 3467c0de86..7d1f9fd438 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -312,7 +312,7 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
 
 uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
 {
-    const struct domain *d = current->domain;
+    struct domain *d = current->domain;
     const struct pci_dev *pdev;
     const struct vpci_register *r;
     unsigned int data_offset = 0;
@@ -415,7 +415,7 @@ static void vpci_write_helper(const struct pci_dev *pdev,
 void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
                 uint32_t data)
 {
-    const struct domain *d = current->domain;
+    struct domain *d = current->domain;
     const struct pci_dev *pdev;
     const struct vpci_register *r;
     unsigned int data_offset = 0;
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 5975ca2f30..19047b4b20 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -177,7 +177,7 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
 int pci_remove_device(u16 seg, u8 bus, u8 devfn);
 int pci_ro_device(int seg, int bus, int devfn);
 int pci_hide_device(unsigned int seg, unsigned int bus, unsigned int devfn);
-struct pci_dev *pci_get_pdev(const struct domain *d, pci_sbdf_t sbdf);
+struct pci_dev *pci_get_pdev(struct domain *d, pci_sbdf_t sbdf);
 struct pci_dev *pci_get_real_pdev(pci_sbdf_t sbdf);
 void pci_check_disable_device(u16 seg, u8 bus, u8 devfn);
 
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 1cf629e7ec..0775228ba9 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -457,6 +457,7 @@ struct domain
 
 #ifdef CONFIG_HAS_PCI
     struct list_head pdev_list;
+    spinlock_t pdevs_lock;
 #endif
 
 #ifdef CONFIG_HAS_PASSTHROUGH
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 03/10] xen: pci: introduce ats_list_lock
  2022-08-31 14:10 [RFC PATCH 00/10] Rework PCI locking Volodymyr Babchuk
  2022-08-31 14:10 ` [RFC PATCH 01/10] xen: pci: add per-domain pci list lock Volodymyr Babchuk
  2022-08-31 14:10 ` [RFC PATCH 04/10] xen: add reference counter support Volodymyr Babchuk
@ 2022-08-31 14:10 ` Volodymyr Babchuk
  2023-01-26 23:56   ` Stefano Stabellini
  2022-08-31 14:10 ` [RFC PATCH 02/10] xen: pci: add pci_seg->alldevs_lock Volodymyr Babchuk
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: Volodymyr Babchuk @ 2022-08-31 14:10 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Andrushchenko, Volodymyr Babchuk, Jan Beulich,
	Andrew Cooper, Paul Durrant, Roger Pau Monné,
	Kevin Tian

ATS subsystem has own list of PCI devices. As we are going to remove
global pcidevs_lock() in favor to more granular locking, we need to
ensure that this list is protected somehow. To do this, we need to add
additional lock for each IOMMU, as list to be protected is also part
of IOMMU.

Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
---
 xen/drivers/passthrough/amd/iommu.h         |  1 +
 xen/drivers/passthrough/amd/iommu_detect.c  |  1 +
 xen/drivers/passthrough/amd/pci_amd_iommu.c |  8 ++++++++
 xen/drivers/passthrough/pci.c               |  1 +
 xen/drivers/passthrough/vtd/iommu.c         | 11 +++++++++++
 xen/drivers/passthrough/vtd/iommu.h         |  1 +
 xen/drivers/passthrough/vtd/qinval.c        |  3 +++
 xen/drivers/passthrough/vtd/x86/ats.c       |  3 +++
 8 files changed, 29 insertions(+)

diff --git a/xen/drivers/passthrough/amd/iommu.h b/xen/drivers/passthrough/amd/iommu.h
index 8bc3c35b1b..edd6eb52b3 100644
--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -106,6 +106,7 @@ struct amd_iommu {
     int enabled;
 
     struct list_head ats_devices;
+    spinlock_t ats_list_lock;
 };
 
 struct ivrs_unity_map {
diff --git a/xen/drivers/passthrough/amd/iommu_detect.c b/xen/drivers/passthrough/amd/iommu_detect.c
index 2317fa6a7d..1d6f4f2168 100644
--- a/xen/drivers/passthrough/amd/iommu_detect.c
+++ b/xen/drivers/passthrough/amd/iommu_detect.c
@@ -160,6 +160,7 @@ int __init amd_iommu_detect_one_acpi(
     }
 
     spin_lock_init(&iommu->lock);
+    spin_lock_init(&iommu->ats_list_lock);
     INIT_LIST_HEAD(&iommu->ats_devices);
 
     iommu->seg = ivhd_block->pci_segment_group;
diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c b/xen/drivers/passthrough/amd/pci_amd_iommu.c
index 64c016491d..955f3af57a 100644
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -276,7 +276,11 @@ static int __must_check amd_iommu_setup_domain_device(
          !pci_ats_enabled(iommu->seg, bus, pdev->devfn) )
     {
         if ( devfn == pdev->devfn )
+	{
+	    spin_lock(&iommu->ats_list_lock);
             enable_ats_device(pdev, &iommu->ats_devices);
+	    spin_unlock(&iommu->ats_list_lock);
+	}
 
         amd_iommu_flush_iotlb(devfn, pdev, INV_IOMMU_ALL_PAGES_ADDRESS, 0);
     }
@@ -416,7 +420,11 @@ static void amd_iommu_disable_domain_device(const struct domain *domain,
 
     if ( pci_ats_device(iommu->seg, bus, pdev->devfn) &&
          pci_ats_enabled(iommu->seg, bus, pdev->devfn) )
+    {
+	spin_lock(&iommu->ats_list_lock);
         disable_ats_device(pdev);
+	spin_unlock(&iommu->ats_list_lock);
+    }
 
     BUG_ON ( iommu->dev_table.buffer == NULL );
     req_id = get_dma_requestor_id(iommu->seg, PCI_BDF(bus, devfn));
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 2dfa1c2875..b5db5498a1 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -1641,6 +1641,7 @@ void iommu_dev_iotlb_flush_timeout(struct domain *d, struct pci_dev *pdev)
 {
     pcidevs_lock();
 
+    /* iommu->ats_list_lock is taken by the caller of this function */
     disable_ats_device(pdev);
 
     ASSERT(pdev->domain);
diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index fff1442265..42661f22f4 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1281,6 +1281,7 @@ int __init iommu_alloc(struct acpi_drhd_unit *drhd)
     spin_lock_init(&iommu->lock);
     spin_lock_init(&iommu->register_lock);
     spin_lock_init(&iommu->intremap.lock);
+    spin_lock_init(&iommu->ats_list_lock);
 
     iommu->drhd = drhd;
     drhd->iommu = iommu;
@@ -1769,7 +1770,11 @@ static int domain_context_mapping(struct domain *domain, u8 devfn,
         if ( ret > 0 )
             ret = 0;
         if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
+        {
+            spin_lock(&drhd->iommu->ats_list_lock);
             enable_ats_device(pdev, &drhd->iommu->ats_devices);
+            spin_unlock(&drhd->iommu->ats_list_lock);
+        }
 
         break;
 
@@ -1977,7 +1982,11 @@ static const struct acpi_drhd_unit *domain_context_unmap(
                    domain, &PCI_SBDF(seg, bus, devfn));
         ret = domain_context_unmap_one(domain, iommu, bus, devfn);
         if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
+        {
+            spin_lock(&iommu->ats_list_lock);
             disable_ats_device(pdev);
+            spin_unlock(&iommu->ats_list_lock);
+        }
 
         break;
 
@@ -2374,7 +2383,9 @@ static int cf_check intel_iommu_enable_device(struct pci_dev *pdev)
     if ( ret <= 0 )
         return ret;
 
+    spin_lock(&drhd->iommu->ats_list_lock);
     ret = enable_ats_device(pdev, &drhd->iommu->ats_devices);
+    spin_unlock(&drhd->iommu->ats_list_lock);
 
     return ret >= 0 ? 0 : ret;
 }
diff --git a/xen/drivers/passthrough/vtd/iommu.h b/xen/drivers/passthrough/vtd/iommu.h
index 78aa8a96f5..2a7a4c1b58 100644
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -506,6 +506,7 @@ struct vtd_iommu {
     } flush;
 
     struct list_head ats_devices;
+    spinlock_t ats_list_lock;
     unsigned long *pseudo_domid_map; /* "pseudo" domain id bitmap */
     unsigned long *domid_bitmap;  /* domain id bitmap */
     domid_t *domid_map;           /* domain id mapping array */
diff --git a/xen/drivers/passthrough/vtd/qinval.c b/xen/drivers/passthrough/vtd/qinval.c
index 4f9ad136b9..6e876348db 100644
--- a/xen/drivers/passthrough/vtd/qinval.c
+++ b/xen/drivers/passthrough/vtd/qinval.c
@@ -238,7 +238,10 @@ static int __must_check dev_invalidate_sync(struct vtd_iommu *iommu,
         if ( d == NULL )
             return rc;
 
+	spin_lock(&iommu->ats_list_lock);
         iommu_dev_iotlb_flush_timeout(d, pdev);
+	spin_unlock(&iommu->ats_list_lock);
+
         rcu_unlock_domain(d);
     }
     else if ( rc == -ETIMEDOUT )
diff --git a/xen/drivers/passthrough/vtd/x86/ats.c b/xen/drivers/passthrough/vtd/x86/ats.c
index 04d702b1d6..55e991183b 100644
--- a/xen/drivers/passthrough/vtd/x86/ats.c
+++ b/xen/drivers/passthrough/vtd/x86/ats.c
@@ -117,6 +117,7 @@ int dev_invalidate_iotlb(struct vtd_iommu *iommu, u16 did,
     if ( !ecap_dev_iotlb(iommu->ecap) )
         return ret;
 
+    spin_lock(&iommu->ats_list_lock);
     list_for_each_entry_safe( pdev, temp, &iommu->ats_devices, ats.list )
     {
         bool_t sbit;
@@ -155,12 +156,14 @@ int dev_invalidate_iotlb(struct vtd_iommu *iommu, u16 did,
             break;
         default:
             dprintk(XENLOG_WARNING VTDPREFIX, "invalid vt-d flush type\n");
+	    spin_unlock(&iommu->ats_list_lock);
             return -EOPNOTSUPP;
         }
 
         if ( !ret )
             ret = rc;
     }
+    spin_unlock(&iommu->ats_list_lock);
 
     return ret;
 }
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 04/10] xen: add reference counter support
  2022-08-31 14:10 [RFC PATCH 00/10] Rework PCI locking Volodymyr Babchuk
  2022-08-31 14:10 ` [RFC PATCH 01/10] xen: pci: add per-domain pci list lock Volodymyr Babchuk
@ 2022-08-31 14:10 ` Volodymyr Babchuk
  2023-02-15 11:20   ` Jan Beulich
  2022-08-31 14:10 ` [RFC PATCH 03/10] xen: pci: introduce ats_list_lock Volodymyr Babchuk
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: Volodymyr Babchuk @ 2022-08-31 14:10 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Andrushchenko, Volodymyr Babchuk, Andrew Cooper,
	George Dunlap, Jan Beulich, Julien Grall, Stefano Stabellini,
	Wei Liu

We can use reference counter to ease up object lifetime management.
This patch adds very basic support for reference counters. refcnt
should be used in the following way:

1. Protected structure should have refcnt_t field

2. This field should be initialized with refcnt_init() during object
construction.

3. If code holds a valid pointer to a structure/object it can increase
refcount with refcnt_get(). No additional locking is required.

4. Code should call refcnt_put() before dropping pointer to a
protected structure. `destructor` is a call back function that should
destruct object and free all resources, including structure protected
itself. Destructor will be called if reference counter reaches zero.

5. If code does not hold a valid pointer to a protected structure it
should use other locking mechanism to obtain a pointer. For example,
it should lock a list that hold protected objects.

Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
---
 xen/include/xen/refcnt.h | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)
 create mode 100644 xen/include/xen/refcnt.h

diff --git a/xen/include/xen/refcnt.h b/xen/include/xen/refcnt.h
new file mode 100644
index 0000000000..7f5395a21c
--- /dev/null
+++ b/xen/include/xen/refcnt.h
@@ -0,0 +1,28 @@
+#ifndef __XEN_REFCNT_H__
+#define __XEN_REFCNT_H__
+
+#include <asm/atomic.h>
+
+typedef atomic_t refcnt_t;
+
+static inline void refcnt_init(refcnt_t *refcnt)
+{
+	atomic_set(refcnt, 1);
+}
+
+static inline void refcnt_get(refcnt_t *refcnt)
+{
+#ifndef NDEBUG
+	ASSERT(atomic_add_unless(refcnt, 1, 0) > 0);
+#else
+	atomic_add_unless(refcnt, 1, 0);
+#endif
+}
+
+static inline void refcnt_put(refcnt_t *refcnt, void (*destructor)(refcnt_t *refcnt))
+{
+	if ( atomic_dec_and_test(refcnt) )
+		destructor(refcnt);
+}
+
+#endif
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 02/10] xen: pci: add pci_seg->alldevs_lock
  2022-08-31 14:10 [RFC PATCH 00/10] Rework PCI locking Volodymyr Babchuk
                   ` (2 preceding siblings ...)
  2022-08-31 14:10 ` [RFC PATCH 03/10] xen: pci: introduce ats_list_lock Volodymyr Babchuk
@ 2022-08-31 14:10 ` Volodymyr Babchuk
  2023-01-26 23:40   ` Stefano Stabellini
  2023-02-28 16:32   ` Jan Beulich
  2022-08-31 14:11 ` [RFC PATCH 05/10] xen: pci: introduce reference counting for pdev Volodymyr Babchuk
                   ` (6 subsequent siblings)
  10 siblings, 2 replies; 43+ messages in thread
From: Volodymyr Babchuk @ 2022-08-31 14:10 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Andrushchenko, Volodymyr Babchuk, Jan Beulich,
	Paul Durrant, Roger Pau Monné

This lock protects alldevs_list of struct pci_seg. As this, it should
be used when we are adding, removing on enumerating PCI devices
assigned to a PCI segment.

Radix tree that stores PCI segment has own locking mechanism, also
pci_seg structures are only allocated and newer freed, so we need no
additional locking to access pci_seg structures. But we need a lock
that protects alldevs_list field.

This enables more granular locking instead of one huge pcidevs_lock
that locks entire PCI subsystem.  Please note that pcidevs_lock() is
still used, we are going to remove it in subsequent patches.

Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
---
 xen/drivers/passthrough/pci.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 4366f8f965..2dfa1c2875 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -38,6 +38,7 @@
 
 struct pci_seg {
     struct list_head alldevs_list;
+    spinlock_t alldevs_lock;
     u16 nr;
     unsigned long *ro_map;
     /* bus2bridge_lock protects bus2bridge array */
@@ -93,6 +94,7 @@ static struct pci_seg *alloc_pseg(u16 seg)
     pseg->nr = seg;
     INIT_LIST_HEAD(&pseg->alldevs_list);
     spin_lock_init(&pseg->bus2bridge_lock);
+    spin_lock_init(&pseg->alldevs_lock);
 
     if ( radix_tree_insert(&pci_segments, seg, pseg) )
     {
@@ -385,9 +387,13 @@ static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn)
     unsigned int pos;
     int rc;
 
+    spin_lock(&pseg->alldevs_lock);
     list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
         if ( pdev->bus == bus && pdev->devfn == devfn )
+        {
+            spin_unlock(&pseg->alldevs_lock);
             return pdev;
+        }
 
     pdev = xzalloc(struct pci_dev);
     if ( !pdev )
@@ -404,10 +410,12 @@ static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn)
     if ( rc )
     {
         xfree(pdev);
+        spin_unlock(&pseg->alldevs_lock);
         return NULL;
     }
 
     list_add(&pdev->alldevs_list, &pseg->alldevs_list);
+    spin_unlock(&pseg->alldevs_lock);
 
     /* update bus2bridge */
     switch ( pdev->type = pdev_type(pseg->nr, bus, devfn) )
@@ -611,15 +619,20 @@ struct pci_dev *pci_get_pdev(struct domain *d, pci_sbdf_t sbdf)
      */
     if ( !d || is_hardware_domain(d) )
     {
-        const struct pci_seg *pseg = get_pseg(sbdf.seg);
+        struct pci_seg *pseg = get_pseg(sbdf.seg);
 
         if ( !pseg )
             return NULL;
 
+        spin_lock(&pseg->alldevs_lock);
         list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
             if ( pdev->sbdf.bdf == sbdf.bdf &&
                  (!d || pdev->domain == d) )
+            {
+                spin_unlock(&pseg->alldevs_lock);
                 return pdev;
+            }
+        spin_unlock(&pseg->alldevs_lock);
     }
     else
     {
@@ -893,6 +906,7 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
         return -ENODEV;
 
     pcidevs_lock();
+    spin_lock(&pseg->alldevs_lock);
     list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
         if ( pdev->bus == bus && pdev->devfn == devfn )
         {
@@ -907,10 +921,12 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
             }
             printk(XENLOG_DEBUG "PCI remove device %pp\n", &pdev->sbdf);
             free_pdev(pseg, pdev);
+            list_del(&pdev->alldevs_list);
             break;
         }
 
     pcidevs_unlock();
+    spin_unlock(&pseg->alldevs_lock);
     return ret;
 }
 
@@ -1363,6 +1379,7 @@ static int cf_check _dump_pci_devices(struct pci_seg *pseg, void *arg)
 
     printk("==== segment %04x ====\n", pseg->nr);
 
+    spin_lock(&pseg->alldevs_lock);
     list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
     {
         printk("%pp - ", &pdev->sbdf);
@@ -1376,6 +1393,7 @@ static int cf_check _dump_pci_devices(struct pci_seg *pseg, void *arg)
         pdev_dump_msi(pdev);
         printk("\n");
     }
+    spin_unlock(&pseg->alldevs_lock);
 
     return 0;
 }
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 06/10] xen: pci: print reference counter when dumping pci_devs
  2022-08-31 14:10 [RFC PATCH 00/10] Rework PCI locking Volodymyr Babchuk
                   ` (4 preceding siblings ...)
  2022-08-31 14:11 ` [RFC PATCH 05/10] xen: pci: introduce reference counting for pdev Volodymyr Babchuk
@ 2022-08-31 14:11 ` Volodymyr Babchuk
  2022-08-31 14:11 ` [RFC PATCH 07/10] xen: pci: add per-device locking Volodymyr Babchuk
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 43+ messages in thread
From: Volodymyr Babchuk @ 2022-08-31 14:11 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Andrushchenko, Volodymyr Babchuk, Jan Beulich,
	Paul Durrant, Roger Pau Monné

This can be handy during new reference counter approach evaluation.

Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
---
 xen/drivers/passthrough/pci.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index a6c6368769..c8da80b981 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -1381,7 +1381,8 @@ static int cf_check _dump_pci_devices(struct pci_seg *pseg, void *arg)
         else
 #endif
             printk("%pd", pdev->domain);
-        printk(" - node %-3d", (pdev->node != NUMA_NO_NODE) ? pdev->node : -1);
+        printk(" - node %-3d refcnt %d", (pdev->node != NUMA_NO_NODE) ? pdev->node : -1,
+               atomic_read(&pdev->refcnt));
         pdev_dump_msi(pdev);
         printk("\n");
     }
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 05/10] xen: pci: introduce reference counting for pdev
  2022-08-31 14:10 [RFC PATCH 00/10] Rework PCI locking Volodymyr Babchuk
                   ` (3 preceding siblings ...)
  2022-08-31 14:10 ` [RFC PATCH 02/10] xen: pci: add pci_seg->alldevs_lock Volodymyr Babchuk
@ 2022-08-31 14:11 ` Volodymyr Babchuk
  2023-01-27  0:43   ` Stefano Stabellini
  2023-02-28 17:06   ` Jan Beulich
  2022-08-31 14:11 ` [RFC PATCH 06/10] xen: pci: print reference counter when dumping pci_devs Volodymyr Babchuk
                   ` (5 subsequent siblings)
  10 siblings, 2 replies; 43+ messages in thread
From: Volodymyr Babchuk @ 2022-08-31 14:11 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Andrushchenko, Volodymyr Babchuk, Jan Beulich,
	Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Paul Durrant, Kevin Tian

Prior to this change, lifetime of pci_dev objects was protected by global
pcidevs_lock(). We are going to get if of this lock, so we need some
other mechanism to ensure that those objects will not disappear under
feet of code that access them. Reference counting is a good choice as
it provides easy to comprehend way to control object lifetime with
better granularity than global super lock.

This patch adds two new helper functions: pcidev_get() and
pcidev_put(). pcidev_get() will increase reference counter, while
pcidev_put() will decrease it, destroying object when counter reaches
zero.

pcidev_get() should be used only when you already have a valid pointer
to the object or you are holding lock that protects one of the
lists (domain, pseg or ats) that store pci_dev structs.

pcidev_get() is rarely used directly, because there already are
functions that will provide valid pointer to pci_dev struct:
pci_get_pdev() and pci_get_real_pdev(). They will lock appropriate
list, find needed object and increase its reference counter before
returning to the caller.

Naturally, pci_put() should be called after finishing working with a
received object. This is the reason why this patch have so many
pcidev_put()s and so little pcidev_get()s: existing calls to
pci_get_*() functions now will increase reference counter
automatically, we just need to decrease it back when we finished.

This patch removes "const" qualifier from some pdev pointers because
pcidev_put() technically alters the contents of pci_dev structure.

Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>

---

- Jan, can I add your Suggested-by tag?
---
 xen/arch/x86/hvm/vmsi.c                  |   2 +-
 xen/arch/x86/irq.c                       |   4 +
 xen/arch/x86/msi.c                       |  41 ++++++-
 xen/arch/x86/pci.c                       |   4 +-
 xen/arch/x86/physdev.c                   |  17 ++-
 xen/common/sysctl.c                      |   5 +-
 xen/drivers/passthrough/amd/iommu_init.c |  12 ++-
 xen/drivers/passthrough/amd/iommu_map.c  |   6 +-
 xen/drivers/passthrough/pci.c            | 131 +++++++++++++++--------
 xen/drivers/passthrough/vtd/quirks.c     |   2 +
 xen/drivers/video/vga.c                  |  10 +-
 xen/drivers/vpci/vpci.c                  |   6 +-
 xen/include/xen/pci.h                    |  18 ++++
 13 files changed, 201 insertions(+), 57 deletions(-)

diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index 75f92885dc..7fb1075673 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -912,7 +912,7 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
 
             spin_unlock(&msix->pdev->vpci->lock);
             process_pending_softirqs();
-            /* NB: we assume that pdev cannot go away for an alive domain. */
+
             if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
                 return -EBUSY;
             if ( pdev->vpci->msix != msix )
diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index cd0c8a30a8..d8672a03e1 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -2174,6 +2174,7 @@ int map_domain_pirq(
                 msi->entry_nr = ret;
                 ret = -ENFILE;
             }
+	    pcidev_put(pdev);
             goto done;
         }
 
@@ -2188,6 +2189,7 @@ int map_domain_pirq(
             msi_desc->irq = -1;
             msi_free_irq(msi_desc);
             ret = -EBUSY;
+	    pcidev_put(pdev);
             goto done;
         }
 
@@ -2272,10 +2274,12 @@ int map_domain_pirq(
             }
             msi_desc->irq = -1;
             msi_free_irq(msi_desc);
+	    pcidev_put(pdev);
             goto done;
         }
 
         set_domain_irq_pirq(d, irq, info);
+	pcidev_put(pdev);
         spin_unlock_irqrestore(&desc->lock, flags);
     }
     else
diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index d0bf63df1d..bccaccb98b 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -572,6 +572,10 @@ int msi_free_irq(struct msi_desc *entry)
                         virt_to_fix((unsigned long)entry->mask_base));
 
     list_del(&entry->list);
+
+    /* Corresponds to pcidev_get() in msi[x]_capability_init()  */
+    pcidev_put(entry->dev);
+
     xfree(entry);
     return 0;
 }
@@ -644,6 +648,7 @@ static int msi_capability_init(struct pci_dev *dev,
             entry[i].msi.mpos = mpos;
         entry[i].msi.nvec = 0;
         entry[i].dev = dev;
+	pcidev_get(dev);
     }
     entry->msi.nvec = nvec;
     entry->irq = irq;
@@ -703,22 +708,36 @@ static u64 read_pci_mem_bar(u16 seg, u8 bus, u8 slot, u8 func, u8 bir, int vf)
              !num_vf || !offset || (num_vf > 1 && !stride) ||
              bir >= PCI_SRIOV_NUM_BARS ||
              !pdev->vf_rlen[bir] )
+        {
+            if ( pdev )
+                pcidev_put(pdev);
             return 0;
+        }
         base = pos + PCI_SRIOV_BAR;
         vf -= PCI_BDF(bus, slot, func) + offset;
         if ( vf < 0 )
+        {
+            pcidev_put(pdev);
             return 0;
+        }
         if ( stride )
         {
             if ( vf % stride )
+            {
+                pcidev_put(pdev);
                 return 0;
+            }
             vf /= stride;
         }
         if ( vf >= num_vf )
+        {
+            pcidev_put(pdev);
             return 0;
+        }
         BUILD_BUG_ON(ARRAY_SIZE(pdev->vf_rlen) != PCI_SRIOV_NUM_BARS);
         disp = vf * pdev->vf_rlen[bir];
         limit = PCI_SRIOV_NUM_BARS;
+        pcidev_put(pdev);
     }
     else switch ( pci_conf_read8(PCI_SBDF(seg, bus, slot, func),
                                  PCI_HEADER_TYPE) & 0x7f )
@@ -925,6 +944,8 @@ static int msix_capability_init(struct pci_dev *dev,
         entry->dev = dev;
         entry->mask_base = base;
 
+	pcidev_get(dev);
+
         list_add_tail(&entry->list, &dev->msi_list);
         *desc = entry;
     }
@@ -999,6 +1020,7 @@ static int __pci_enable_msi(struct msi_info *msi, struct msi_desc **desc)
 {
     struct pci_dev *pdev;
     struct msi_desc *old_desc;
+    int ret;
 
     ASSERT(pcidevs_locked());
     pdev = pci_get_pdev(NULL, msi->sbdf);
@@ -1010,6 +1032,7 @@ static int __pci_enable_msi(struct msi_info *msi, struct msi_desc **desc)
     {
         printk(XENLOG_ERR "irq %d already mapped to MSI on %pp\n",
                msi->irq, &pdev->sbdf);
+	pcidev_put(pdev);
         return -EEXIST;
     }
 
@@ -1020,7 +1043,10 @@ static int __pci_enable_msi(struct msi_info *msi, struct msi_desc **desc)
         __pci_disable_msix(old_desc);
     }
 
-    return msi_capability_init(pdev, msi->irq, desc, msi->entry_nr);
+    ret = msi_capability_init(pdev, msi->irq, desc, msi->entry_nr);
+    pcidev_put(pdev);
+
+    return ret;
 }
 
 static void __pci_disable_msi(struct msi_desc *entry)
@@ -1054,6 +1080,7 @@ static int __pci_enable_msix(struct msi_info *msi, struct msi_desc **desc)
 {
     struct pci_dev *pdev;
     struct msi_desc *old_desc;
+    int ret;
 
     ASSERT(pcidevs_locked());
     pdev = pci_get_pdev(NULL, msi->sbdf);
@@ -1061,13 +1088,17 @@ static int __pci_enable_msix(struct msi_info *msi, struct msi_desc **desc)
         return -ENODEV;
 
     if ( msi->entry_nr >= pdev->msix->nr_entries )
+    {
+	pcidev_put(pdev);
         return -EINVAL;
+    }
 
     old_desc = find_msi_entry(pdev, msi->irq, PCI_CAP_ID_MSIX);
     if ( old_desc )
     {
         printk(XENLOG_ERR "irq %d already mapped to MSI-X on %pp\n",
                msi->irq, &pdev->sbdf);
+	pcidev_put(pdev);
         return -EEXIST;
     }
 
@@ -1078,7 +1109,11 @@ static int __pci_enable_msix(struct msi_info *msi, struct msi_desc **desc)
         __pci_disable_msi(old_desc);
     }
 
-    return msix_capability_init(pdev, msi, desc);
+    ret = msix_capability_init(pdev, msi, desc);
+
+    pcidev_put(pdev);
+
+    return ret;
 }
 
 static void _pci_cleanup_msix(struct arch_msix *msix)
@@ -1161,6 +1196,8 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off)
         rc = msix_capability_init(pdev, NULL, NULL);
     pcidevs_unlock();
 
+    pcidev_put(pdev);
+
     return rc;
 }
 
diff --git a/xen/arch/x86/pci.c b/xen/arch/x86/pci.c
index 97b792e578..1d38f0df7c 100644
--- a/xen/arch/x86/pci.c
+++ b/xen/arch/x86/pci.c
@@ -91,8 +91,10 @@ int pci_conf_write_intercept(unsigned int seg, unsigned int bdf,
     pcidevs_lock();
 
     pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bdf));
-    if ( pdev )
+    if ( pdev ) {
         rc = pci_msi_conf_write_intercept(pdev, reg, size, data);
+	pcidev_put(pdev);
+    }
 
     pcidevs_unlock();
 
diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c
index 2f1d955a96..96214a3d40 100644
--- a/xen/arch/x86/physdev.c
+++ b/xen/arch/x86/physdev.c
@@ -533,7 +533,14 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
         pcidevs_lock();
         pdev = pci_get_pdev(NULL,
                             PCI_SBDF(0, restore_msi.bus, restore_msi.devfn));
-        ret = pdev ? pci_restore_msi_state(pdev) : -ENODEV;
+        if ( pdev )
+        {
+            ret = pci_restore_msi_state(pdev);
+            pcidev_put(pdev);
+        }
+        else
+            ret = -ENODEV;
+
         pcidevs_unlock();
         break;
     }
@@ -548,7 +555,13 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 
         pcidevs_lock();
         pdev = pci_get_pdev(NULL, PCI_SBDF(dev.seg, dev.bus, dev.devfn));
-        ret = pdev ? pci_restore_msi_state(pdev) : -ENODEV;
+        if ( pdev )
+        {
+            ret =  pci_restore_msi_state(pdev);
+            pcidev_put(pdev);
+        }
+        else
+            ret = -ENODEV;
         pcidevs_unlock();
         break;
     }
diff --git a/xen/common/sysctl.c b/xen/common/sysctl.c
index 02505ab044..0feef94cd2 100644
--- a/xen/common/sysctl.c
+++ b/xen/common/sysctl.c
@@ -438,7 +438,7 @@ long do_sysctl(XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) u_sysctl)
         {
             physdev_pci_device_t dev;
             uint32_t node;
-            const struct pci_dev *pdev;
+            struct pci_dev *pdev;
 
             if ( copy_from_guest_offset(&dev, ti->devs, i, 1) )
             {
@@ -456,6 +456,9 @@ long do_sysctl(XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) u_sysctl)
                 node = pdev->node;
             pcidevs_unlock();
 
+            if ( pdev )
+                pcidev_put(pdev);
+
             if ( copy_to_guest_offset(ti->nodes, i, &node, 1) )
             {
                 ret = -EFAULT;
diff --git a/xen/drivers/passthrough/amd/iommu_init.c b/xen/drivers/passthrough/amd/iommu_init.c
index 1f14aaf49e..7c1713a602 100644
--- a/xen/drivers/passthrough/amd/iommu_init.c
+++ b/xen/drivers/passthrough/amd/iommu_init.c
@@ -644,6 +644,7 @@ static void cf_check parse_ppr_log_entry(struct amd_iommu *iommu, u32 entry[])
 
     if ( pdev )
         guest_iommu_add_ppr_log(pdev->domain, entry);
+    pcidev_put(pdev);
 }
 
 static void iommu_check_ppr_log(struct amd_iommu *iommu)
@@ -747,6 +748,11 @@ static bool_t __init set_iommu_interrupt_handler(struct amd_iommu *iommu)
     }
 
     pcidevs_lock();
+    /*
+     * XXX: it is unclear if this device can be removed. Right now
+     * there is no code that clears msi.dev, so no one will decrease
+     * refcount on it.
+     */
     iommu->msi.dev = pci_get_pdev(NULL, PCI_SBDF(iommu->seg, iommu->bdf));
     pcidevs_unlock();
     if ( !iommu->msi.dev )
@@ -1272,7 +1278,7 @@ static int __init cf_check amd_iommu_setup_device_table(
     {
         if ( ivrs_mappings[bdf].valid )
         {
-            const struct pci_dev *pdev = NULL;
+            struct pci_dev *pdev = NULL;
 
             /* add device table entry */
             iommu_dte_add_device_entry(&dt[bdf], &ivrs_mappings[bdf]);
@@ -1297,7 +1303,10 @@ static int __init cf_check amd_iommu_setup_device_table(
                         pdev->msix ? pdev->msix->nr_entries
                                    : pdev->msi_maxvec);
                 if ( !ivrs_mappings[bdf].intremap_table )
+		{
+		    pcidev_put(pdev);
                     return -ENOMEM;
+		}
 
                 if ( pdev->phantom_stride )
                 {
@@ -1315,6 +1324,7 @@ static int __init cf_check amd_iommu_setup_device_table(
                             ivrs_mappings[bdf].intremap_inuse;
                     }
                 }
+		pcidev_put(pdev);
             }
 
             amd_iommu_set_intremap_table(
diff --git a/xen/drivers/passthrough/amd/iommu_map.c b/xen/drivers/passthrough/amd/iommu_map.c
index 993bac6f88..9d621e3d36 100644
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -724,14 +724,18 @@ int cf_check amd_iommu_get_reserved_device_memory(
         if ( !iommu )
         {
             /* May need to trigger the workaround in find_iommu_for_device(). */
-            const struct pci_dev *pdev;
+            struct pci_dev *pdev;
 
             pcidevs_lock();
             pdev = pci_get_pdev(NULL, sbdf);
             pcidevs_unlock();
 
             if ( pdev )
+            {
                 iommu = find_iommu_for_device(seg, bdf);
+                /* XXX: Should we hold pdev reference till end of the loop? */
+                pcidev_put(pdev);
+            }
             if ( !iommu )
                 continue;
         }
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index b5db5498a1..a6c6368769 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -403,6 +403,7 @@ static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn)
     *((u8*) &pdev->bus) = bus;
     *((u8*) &pdev->devfn) = devfn;
     pdev->domain = NULL;
+    refcnt_init(&pdev->refcnt);
 
     arch_pci_init_pdev(pdev);
 
@@ -499,33 +500,6 @@ static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn)
     return pdev;
 }
 
-static void free_pdev(struct pci_seg *pseg, struct pci_dev *pdev)
-{
-    /* update bus2bridge */
-    switch ( pdev->type )
-    {
-        unsigned int sec_bus, sub_bus;
-
-        case DEV_TYPE_PCIe2PCI_BRIDGE:
-        case DEV_TYPE_LEGACY_PCI_BRIDGE:
-            sec_bus = pci_conf_read8(pdev->sbdf, PCI_SECONDARY_BUS);
-            sub_bus = pci_conf_read8(pdev->sbdf, PCI_SUBORDINATE_BUS);
-
-            spin_lock(&pseg->bus2bridge_lock);
-            for ( ; sec_bus <= sub_bus; sec_bus++ )
-                pseg->bus2bridge[sec_bus] = pseg->bus2bridge[pdev->bus];
-            spin_unlock(&pseg->bus2bridge_lock);
-            break;
-
-        default:
-            break;
-    }
-
-    list_del(&pdev->alldevs_list);
-    pdev_msi_deinit(pdev);
-    xfree(pdev);
-}
-
 static void __init _pci_hide_device(struct pci_dev *pdev)
 {
     if ( pdev->domain )
@@ -596,10 +570,15 @@ struct pci_dev *pci_get_real_pdev(pci_sbdf_t sbdf)
     {
         if ( !(sbdf.devfn & stride) )
             continue;
+
         sbdf.devfn &= ~stride;
+        pcidev_put(pdev);
         pdev = pci_get_pdev(NULL, sbdf);
         if ( pdev && stride != pdev->phantom_stride )
+        {
+            pcidev_put(pdev);
             pdev = NULL;
+        }
     }
 
     return pdev;
@@ -629,6 +608,7 @@ struct pci_dev *pci_get_pdev(struct domain *d, pci_sbdf_t sbdf)
             if ( pdev->sbdf.bdf == sbdf.bdf &&
                  (!d || pdev->domain == d) )
             {
+                pcidev_get(pdev);
                 spin_unlock(&pseg->alldevs_lock);
                 return pdev;
             }
@@ -640,6 +620,7 @@ struct pci_dev *pci_get_pdev(struct domain *d, pci_sbdf_t sbdf)
         list_for_each_entry ( pdev, &d->pdev_list, domain_list )
             if ( pdev->sbdf.bdf == sbdf.bdf )
             {
+                pcidev_get(pdev);
                 spin_unlock(&d->pdevs_lock);
                 return pdev;
             }
@@ -754,7 +735,10 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
                             PCI_SBDF(seg, info->physfn.bus,
                                      info->physfn.devfn));
         if ( pdev )
+        {
             pf_is_extfn = pdev->info.is_extfn;
+            pcidev_put(pdev);
+        }
         pcidevs_unlock();
         if ( !pdev )
             pci_add_device(seg, info->physfn.bus, info->physfn.devfn,
@@ -920,8 +904,9 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
                 spin_unlock(&pdev->domain->pdevs_lock);
             }
             printk(XENLOG_DEBUG "PCI remove device %pp\n", &pdev->sbdf);
-            free_pdev(pseg, pdev);
             list_del(&pdev->alldevs_list);
+            pdev_msi_deinit(pdev);
+            pcidev_put(pdev);
             break;
         }
 
@@ -952,7 +937,7 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
     {
         ret = iommu_quarantine_dev_init(pci_to_dev(pdev));
         if ( ret )
-           return ret;
+            goto out;
 
         target = dom_io;
     }
@@ -982,6 +967,7 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
     pdev->fault.count = 0;
 
  out:
+    pcidev_put(pdev);
     if ( ret )
         printk(XENLOG_G_ERR "%pd: deassign (%pp) failed (%d)\n",
                d, &PCI_SBDF(seg, bus, devfn), ret);
@@ -1117,7 +1103,10 @@ void pci_check_disable_device(u16 seg, u8 bus, u8 devfn)
             pdev->fault.count >>= 1;
         pdev->fault.time = now;
         if ( ++pdev->fault.count < PT_FAULT_THRESHOLD )
+        {
+            pcidev_put(pdev);
             pdev = NULL;
+        }
     }
     pcidevs_unlock();
 
@@ -1128,6 +1117,8 @@ void pci_check_disable_device(u16 seg, u8 bus, u8 devfn)
      * control it for us. */
     cword = pci_conf_read16(pdev->sbdf, PCI_COMMAND);
     pci_conf_write16(pdev->sbdf, PCI_COMMAND, cword & ~PCI_COMMAND_MASTER);
+
+    pcidev_put(pdev);
 }
 
 /*
@@ -1246,6 +1237,7 @@ static int __hwdom_init cf_check _setup_hwdom_pci_devices(
                 printk(XENLOG_WARNING "Dom%d owning %pp?\n",
                        pdev->domain->domain_id, &pdev->sbdf);
 
+            pcidev_put(pdev);
             if ( iommu_verbose )
             {
                 pcidevs_unlock();
@@ -1495,33 +1487,28 @@ static int iommu_remove_device(struct pci_dev *pdev)
     return iommu_call(hd->platform_ops, remove_device, devfn, pci_to_dev(pdev));
 }
 
-static int device_assigned(u16 seg, u8 bus, u8 devfn)
+static int device_assigned(struct pci_dev *pdev)
 {
-    struct pci_dev *pdev;
     int rc = 0;
 
     ASSERT(pcidevs_locked());
-    pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn));
-
-    if ( !pdev )
-        rc = -ENODEV;
     /*
      * If the device exists and it is not owned by either the hardware
      * domain or dom_io then it must be assigned to a guest, or be
      * hidden (owned by dom_xen).
      */
-    else if ( pdev->domain != hardware_domain &&
-              pdev->domain != dom_io )
+    if ( pdev->domain != hardware_domain &&
+         pdev->domain != dom_io )
         rc = -EBUSY;
 
     return rc;
 }
 
 /* Caller should hold the pcidevs_lock */
-static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
+static int assign_device(struct domain *d, struct pci_dev *pdev, u32 flag)
 {
     const struct domain_iommu *hd = dom_iommu(d);
-    struct pci_dev *pdev;
+    uint8_t devfn;
     int rc = 0;
 
     if ( !is_iommu_enabled(d) )
@@ -1532,10 +1519,11 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
 
     /* device_assigned() should already have cleared the device for assignment */
     ASSERT(pcidevs_locked());
-    pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn));
     ASSERT(pdev && (pdev->domain == hardware_domain ||
                     pdev->domain == dom_io));
 
+    devfn = pdev->devfn;
+
     /* Do not allow broken devices to be assigned to guests. */
     rc = -EBADF;
     if ( pdev->broken && d != hardware_domain && d != dom_io )
@@ -1570,7 +1558,7 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
  done:
     if ( rc )
         printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
-               d, &PCI_SBDF(seg, bus, devfn), rc);
+               d, &PCI_SBDF(pdev->seg, pdev->bus, devfn), rc);
     /* The device is assigned to dom_io so mark it as quarantined */
     else if ( d == dom_io )
         pdev->quarantine = true;
@@ -1710,6 +1698,9 @@ int iommu_do_pci_domctl(
         ASSERT(d);
         /* fall through */
     case XEN_DOMCTL_test_assign_device:
+    {
+        struct pci_dev *pdev;
+
         /* Don't support self-assignment of devices. */
         if ( d == current->domain )
         {
@@ -1737,26 +1728,36 @@ int iommu_do_pci_domctl(
         seg = machine_sbdf >> 16;
         bus = PCI_BUS(machine_sbdf);
         devfn = PCI_DEVFN(machine_sbdf);
+        pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn));
+        if ( !pdev )
+        {
+            printk(XENLOG_G_INFO "%pp non-existent\n",
+                   &PCI_SBDF(seg, bus, devfn));
+            ret = -EINVAL;
+            break;
+        }
 
         pcidevs_lock();
-        ret = device_assigned(seg, bus, devfn);
+        ret = device_assigned(pdev);
         if ( domctl->cmd == XEN_DOMCTL_test_assign_device )
         {
             if ( ret )
             {
-                printk(XENLOG_G_INFO "%pp already assigned, or non-existent\n",
+                printk(XENLOG_G_INFO "%pp already assigned\n",
                        &PCI_SBDF(seg, bus, devfn));
                 ret = -EINVAL;
             }
         }
         else if ( !ret )
-            ret = assign_device(d, seg, bus, devfn, flags);
+            ret = assign_device(d, pdev, flags);
+
+        pcidev_put(pdev);
         pcidevs_unlock();
         if ( ret == -ERESTART )
             ret = hypercall_create_continuation(__HYPERVISOR_domctl,
                                                 "h", u_domctl);
         break;
-
+    }
     case XEN_DOMCTL_deassign_device:
         /* Don't support self-deassignment of devices. */
         if ( d == current->domain )
@@ -1796,6 +1797,46 @@ int iommu_do_pci_domctl(
     return ret;
 }
 
+static void release_pdev(refcnt_t *refcnt)
+{
+    struct pci_dev *pdev = container_of(refcnt, struct pci_dev, refcnt);
+    struct pci_seg *pseg = get_pseg(pdev->seg);
+
+    printk(XENLOG_DEBUG "PCI release device %pp\n", &pdev->sbdf);
+
+    /* update bus2bridge */
+    switch ( pdev->type )
+    {
+        unsigned int sec_bus, sub_bus;
+
+        case DEV_TYPE_PCIe2PCI_BRIDGE:
+        case DEV_TYPE_LEGACY_PCI_BRIDGE:
+            sec_bus = pci_conf_read8(pdev->sbdf, PCI_SECONDARY_BUS);
+            sub_bus = pci_conf_read8(pdev->sbdf, PCI_SUBORDINATE_BUS);
+
+            spin_lock(&pseg->bus2bridge_lock);
+            for ( ; sec_bus <= sub_bus; sec_bus++ )
+                pseg->bus2bridge[sec_bus] = pseg->bus2bridge[pdev->bus];
+            spin_unlock(&pseg->bus2bridge_lock);
+            break;
+
+        default:
+            break;
+    }
+
+    xfree(pdev);
+}
+
+void pcidev_get(struct pci_dev *pdev)
+{
+    refcnt_get(&pdev->refcnt);
+}
+
+void pcidev_put(struct pci_dev *pdev)
+{
+    refcnt_put(&pdev->refcnt, release_pdev);
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/drivers/passthrough/vtd/quirks.c b/xen/drivers/passthrough/vtd/quirks.c
index fcc8f73e8b..d240da0416 100644
--- a/xen/drivers/passthrough/vtd/quirks.c
+++ b/xen/drivers/passthrough/vtd/quirks.c
@@ -429,6 +429,8 @@ static int __must_check map_me_phantom_function(struct domain *domain,
         rc = domain_context_unmap_one(domain, drhd->iommu, 0,
                                       PCI_DEVFN(dev, 7));
 
+    pcidev_put(pdev);
+
     return rc;
 }
 
diff --git a/xen/drivers/video/vga.c b/xen/drivers/video/vga.c
index 29a88e8241..1298f3a7b6 100644
--- a/xen/drivers/video/vga.c
+++ b/xen/drivers/video/vga.c
@@ -114,7 +114,7 @@ void __init video_endboot(void)
         for ( bus = 0; bus < 256; ++bus )
             for ( devfn = 0; devfn < 256; ++devfn )
             {
-                const struct pci_dev *pdev;
+                struct pci_dev *pdev;
                 u8 b = bus, df = devfn, sb;
 
                 pcidevs_lock();
@@ -126,7 +126,11 @@ void __init video_endboot(void)
                                      PCI_CLASS_DEVICE) != 0x0300 ||
                      !(pci_conf_read16(PCI_SBDF(0, bus, devfn), PCI_COMMAND) &
                        (PCI_COMMAND_IO | PCI_COMMAND_MEMORY)) )
+		{
+		    if (pdev)
+			pcidev_put(pdev);
                     continue;
+		}
 
                 while ( b )
                 {
@@ -144,7 +148,10 @@ void __init video_endboot(void)
                             if ( pci_conf_read16(PCI_SBDF(0, b, df),
                                                  PCI_BRIDGE_CONTROL) &
                                  PCI_BRIDGE_CTL_VGA )
+			    {
+				pcidev_put(pdev);
                                 continue;
+			    }
                             break;
                         }
                         break;
@@ -157,6 +164,7 @@ void __init video_endboot(void)
                            bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
                     pci_hide_device(0, bus, devfn);
                 }
+		pcidev_put(pdev);
             }
     }
 
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 7d1f9fd438..59dc55f498 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -313,7 +313,7 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
 uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
 {
     struct domain *d = current->domain;
-    const struct pci_dev *pdev;
+    struct pci_dev *pdev;
     const struct vpci_register *r;
     unsigned int data_offset = 0;
     uint32_t data = ~(uint32_t)0;
@@ -373,6 +373,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
         ASSERT(data_offset < size);
     }
     spin_unlock(&pdev->vpci->lock);
+    pcidev_put(pdev);
 
     if ( data_offset < size )
     {
@@ -416,7 +417,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
                 uint32_t data)
 {
     struct domain *d = current->domain;
-    const struct pci_dev *pdev;
+    struct pci_dev *pdev;
     const struct vpci_register *r;
     unsigned int data_offset = 0;
     const unsigned long *ro_map = pci_get_ro_map(sbdf.seg);
@@ -478,6 +479,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
         ASSERT(data_offset < size);
     }
     spin_unlock(&pdev->vpci->lock);
+    pcidev_put(pdev);
 
     if ( data_offset < size )
         /* Tailing gap, write the remaining. */
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 19047b4b20..e71a180ef3 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -13,6 +13,7 @@
 #include <xen/irq.h>
 #include <xen/pci_regs.h>
 #include <xen/pfn.h>
+#include <xen/refcnt.h>
 #include <asm/device.h>
 #include <asm/numa.h>
 
@@ -116,6 +117,9 @@ struct pci_dev {
     /* Device misbehaving, prevent assigning it to guests. */
     bool broken;
 
+    /* Reference counter */
+    refcnt_t refcnt;
+
     enum pdev_type {
         DEV_TYPE_PCI_UNKNOWN,
         DEV_TYPE_PCIe_ENDPOINT,
@@ -160,6 +164,14 @@ void pcidevs_lock(void);
 void pcidevs_unlock(void);
 bool_t __must_check pcidevs_locked(void);
 
+/*
+ * Acquire and release reference to the given device. Holding
+ * reference ensures that device will not disappear under feet, but
+ * does not guarantee that code has exclusive access to the device.
+ */
+void pcidev_get(struct pci_dev *pdev);
+void pcidev_put(struct pci_dev *pdev);
+
 bool_t pci_known_segment(u16 seg);
 bool_t pci_device_detect(u16 seg, u8 bus, u8 dev, u8 func);
 int scan_pci_devices(void);
@@ -177,8 +189,14 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
 int pci_remove_device(u16 seg, u8 bus, u8 devfn);
 int pci_ro_device(int seg, int bus, int devfn);
 int pci_hide_device(unsigned int seg, unsigned int bus, unsigned int devfn);
+
+/*
+ * Next two functions will find a requested device and acquire
+ * reference to it. Use pcidev_put() to release the reference.
+ */
 struct pci_dev *pci_get_pdev(struct domain *d, pci_sbdf_t sbdf);
 struct pci_dev *pci_get_real_pdev(pci_sbdf_t sbdf);
+
 void pci_check_disable_device(u16 seg, u8 bus, u8 devfn);
 
 uint8_t pci_conf_read8(pci_sbdf_t sbdf, unsigned int reg);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 09/10] [RFC only] xen: iommu: remove last  pcidevs_lock() calls in iommu
  2022-08-31 14:10 [RFC PATCH 00/10] Rework PCI locking Volodymyr Babchuk
                   ` (6 preceding siblings ...)
  2022-08-31 14:11 ` [RFC PATCH 07/10] xen: pci: add per-device locking Volodymyr Babchuk
@ 2022-08-31 14:11 ` Volodymyr Babchuk
  2023-01-28  1:36   ` Stefano Stabellini
  2023-02-28 16:25   ` Jan Beulich
  2022-08-31 14:11 ` [RFC PATCH 08/10] xen: pci: remove pcidev_[un]lock[ed] calls Volodymyr Babchuk
                   ` (2 subsequent siblings)
  10 siblings, 2 replies; 43+ messages in thread
From: Volodymyr Babchuk @ 2022-08-31 14:11 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Andrushchenko, Volodymyr Babchuk, Kevin Tian,
	Jan Beulich, Paul Durrant, Roger Pau Monné

There are number of cases where pcidevs_lock() is used to protect
something that is not related to PCI devices per se.

Probably pcidev_lock in these places should be replaced with some
other lock.

This patch is not intended to be merged and is present only to discuss
this use of pcidevs_lock()

Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
---
 xen/drivers/passthrough/vtd/intremap.c | 2 --
 xen/drivers/passthrough/vtd/iommu.c    | 5 -----
 xen/drivers/passthrough/x86/iommu.c    | 5 -----
 3 files changed, 12 deletions(-)

diff --git a/xen/drivers/passthrough/vtd/intremap.c b/xen/drivers/passthrough/vtd/intremap.c
index 1512e4866b..44e3b72f91 100644
--- a/xen/drivers/passthrough/vtd/intremap.c
+++ b/xen/drivers/passthrough/vtd/intremap.c
@@ -893,8 +893,6 @@ int pi_update_irte(const struct pi_desc *pi_desc, const struct pirq *pirq,
 
     spin_unlock_irq(&desc->lock);
 
-    ASSERT(pcidevs_locked());
-
     return msi_msg_write_remap_rte(msi_desc, &msi_desc->msg);
 
  unlock_out:
diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index 87868188b7..9d258d154d 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -127,8 +127,6 @@ static int context_set_domain_id(struct context_entry *context,
 {
     unsigned int i;
 
-    ASSERT(pcidevs_locked());
-
     if ( domid_mapping(iommu) )
     {
         unsigned int nr_dom = cap_ndoms(iommu->cap);
@@ -1882,7 +1880,6 @@ int domain_context_unmap_one(
     int iommu_domid, rc, ret;
     bool_t flush_dev_iotlb;
 
-    ASSERT(pcidevs_locked());
     spin_lock(&iommu->lock);
 
     maddr = bus_to_context_maddr(iommu, bus);
@@ -2601,7 +2598,6 @@ static void __hwdom_init setup_hwdom_rmrr(struct domain *d)
     u16 bdf;
     int ret, i;
 
-    pcidevs_lock();
     for_each_rmrr_device ( rmrr, bdf, i )
     {
         /*
@@ -2616,7 +2612,6 @@ static void __hwdom_init setup_hwdom_rmrr(struct domain *d)
             dprintk(XENLOG_ERR VTDPREFIX,
                      "IOMMU: mapping reserved region failed\n");
     }
-    pcidevs_unlock();
 }
 
 static struct iommu_state {
diff --git a/xen/drivers/passthrough/x86/iommu.c b/xen/drivers/passthrough/x86/iommu.c
index f671b0f2bb..4e94ad15df 100644
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -207,7 +207,6 @@ int iommu_identity_mapping(struct domain *d, p2m_access_t p2ma,
     struct identity_map *map;
     struct domain_iommu *hd = dom_iommu(d);
 
-    ASSERT(pcidevs_locked());
     ASSERT(base < end);
 
     /*
@@ -479,8 +478,6 @@ domid_t iommu_alloc_domid(unsigned long *map)
     static unsigned int start;
     unsigned int idx = find_next_zero_bit(map, UINT16_MAX - DOMID_MASK, start);
 
-    ASSERT(pcidevs_locked());
-
     if ( idx >= UINT16_MAX - DOMID_MASK )
         idx = find_first_zero_bit(map, UINT16_MAX - DOMID_MASK);
     if ( idx >= UINT16_MAX - DOMID_MASK )
@@ -495,8 +492,6 @@ domid_t iommu_alloc_domid(unsigned long *map)
 
 void iommu_free_domid(domid_t domid, unsigned long *map)
 {
-    ASSERT(pcidevs_locked());
-
     if ( domid == DOMID_INVALID )
         return;
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 07/10] xen: pci: add per-device locking
  2022-08-31 14:10 [RFC PATCH 00/10] Rework PCI locking Volodymyr Babchuk
                   ` (5 preceding siblings ...)
  2022-08-31 14:11 ` [RFC PATCH 06/10] xen: pci: print reference counter when dumping pci_devs Volodymyr Babchuk
@ 2022-08-31 14:11 ` Volodymyr Babchuk
  2023-01-28  0:56   ` Stefano Stabellini
  2023-02-28 16:46   ` Jan Beulich
  2022-08-31 14:11 ` [RFC PATCH 09/10] [RFC only] xen: iommu: remove last pcidevs_lock() calls in iommu Volodymyr Babchuk
                   ` (3 subsequent siblings)
  10 siblings, 2 replies; 43+ messages in thread
From: Volodymyr Babchuk @ 2022-08-31 14:11 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Andrushchenko, Volodymyr Babchuk, Jan Beulich,
	Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Paul Durrant

Spinlock in struct pci_device will be used to protect access to device
itself. Right now it is used mostly by MSI code.

Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
---
 xen/arch/x86/hvm/vmsi.c       |  6 +++++-
 xen/arch/x86/msi.c            | 16 ++++++++++++++++
 xen/drivers/passthrough/msi.c |  8 +++++++-
 xen/drivers/passthrough/pci.c |  2 ++
 xen/include/xen/pci.h         | 12 ++++++++++++
 5 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index 7fb1075673..c9e5f279c5 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -203,10 +203,14 @@ static struct msi_desc *msixtbl_addr_to_desc(
 
     nr_entry = (addr - entry->gtable) / PCI_MSIX_ENTRY_SIZE;
 
+    pcidev_lock(entry->pdev);
     list_for_each_entry( desc, &entry->pdev->msi_list, list )
         if ( desc->msi_attrib.type == PCI_CAP_ID_MSIX &&
-             desc->msi_attrib.entry_nr == nr_entry )
+             desc->msi_attrib.entry_nr == nr_entry ) {
+	    pcidev_unlock(entry->pdev);
             return desc;
+	}
+    pcidev_unlock(entry->pdev);
 
     return NULL;
 }
diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index bccaccb98b..6b62c4f452 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -389,6 +389,7 @@ static bool msi_set_mask_bit(struct irq_desc *desc, bool host, bool guest)
     default:
         return 0;
     }
+
     entry->msi_attrib.host_masked = host;
     entry->msi_attrib.guest_masked = guest;
 
@@ -585,12 +586,17 @@ static struct msi_desc *find_msi_entry(struct pci_dev *dev,
 {
     struct msi_desc *entry;
 
+    pcidev_lock(dev);
     list_for_each_entry( entry, &dev->msi_list, list )
     {
         if ( entry->msi_attrib.type == cap_id &&
              (irq == -1 || entry->irq == irq) )
+	{
+	    pcidev_unlock(dev);
             return entry;
+	}
     }
+    pcidev_unlock(dev);
 
     return NULL;
 }
@@ -661,7 +667,9 @@ static int msi_capability_init(struct pci_dev *dev,
         maskbits |= ~(uint32_t)0 >> (32 - dev->msi_maxvec);
         pci_conf_write32(dev->sbdf, mpos, maskbits);
     }
+    pcidev_lock(dev);
     list_add_tail(&entry->list, &dev->msi_list);
+    pcidev_unlock(dev);
 
     *desc = entry;
     /* Restore the original MSI enabled bits  */
@@ -946,7 +954,9 @@ static int msix_capability_init(struct pci_dev *dev,
 
 	pcidev_get(dev);
 
+	pcidev_lock(dev);
         list_add_tail(&entry->list, &dev->msi_list);
+	pcidev_unlock(dev);
         *desc = entry;
     }
 
@@ -1231,11 +1241,13 @@ static void msi_free_irqs(struct pci_dev* dev)
 {
     struct msi_desc *entry, *tmp;
 
+    pcidev_lock(dev);
     list_for_each_entry_safe( entry, tmp, &dev->msi_list, list )
     {
         pci_disable_msi(entry);
         msi_free_irq(entry);
     }
+    pcidev_unlock(dev);
 }
 
 void pci_cleanup_msi(struct pci_dev *pdev)
@@ -1354,6 +1366,7 @@ int pci_restore_msi_state(struct pci_dev *pdev)
     if ( ret )
         return ret;
 
+    pcidev_lock(pdev);
     list_for_each_entry_safe( entry, tmp, &pdev->msi_list, list )
     {
         unsigned int i = 0, nr = 1;
@@ -1371,6 +1384,7 @@ int pci_restore_msi_state(struct pci_dev *pdev)
             dprintk(XENLOG_ERR, "Restore MSI for %pp entry %u not set?\n",
                     &pdev->sbdf, i);
             spin_unlock_irqrestore(&desc->lock, flags);
+	    pcidev_unlock(pdev);
             if ( type == PCI_CAP_ID_MSIX )
                 pci_conf_write16(pdev->sbdf, msix_control_reg(pos),
                                  control & ~PCI_MSIX_FLAGS_ENABLE);
@@ -1393,6 +1407,7 @@ int pci_restore_msi_state(struct pci_dev *pdev)
             if ( unlikely(!memory_decoded(pdev)) )
             {
                 spin_unlock_irqrestore(&desc->lock, flags);
+		pcidev_unlock(pdev);
                 pci_conf_write16(pdev->sbdf, msix_control_reg(pos),
                                  control & ~PCI_MSIX_FLAGS_ENABLE);
                 return -ENXIO;
@@ -1438,6 +1453,7 @@ int pci_restore_msi_state(struct pci_dev *pdev)
         pci_conf_write16(pdev->sbdf, msix_control_reg(pos),
                          control | PCI_MSIX_FLAGS_ENABLE);
 
+    pcidev_unlock(pdev);
     return 0;
 }
 
diff --git a/xen/drivers/passthrough/msi.c b/xen/drivers/passthrough/msi.c
index ce1a450f6f..98f4d2721a 100644
--- a/xen/drivers/passthrough/msi.c
+++ b/xen/drivers/passthrough/msi.c
@@ -22,6 +22,7 @@ int pdev_msi_init(struct pci_dev *pdev)
 {
     unsigned int pos;
 
+    pcidev_lock(pdev);
     INIT_LIST_HEAD(&pdev->msi_list);
 
     pos = pci_find_cap_offset(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
@@ -41,7 +42,10 @@ int pdev_msi_init(struct pci_dev *pdev)
         uint16_t ctrl;
 
         if ( !msix )
-            return -ENOMEM;
+        {
+             pcidev_unlock(pdev);
+             return -ENOMEM;
+        }
 
         spin_lock_init(&msix->table_lock);
 
@@ -51,6 +55,8 @@ int pdev_msi_init(struct pci_dev *pdev)
         pdev->msix = msix;
     }
 
+    pcidev_unlock(pdev);
+
     return 0;
 }
 
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index c8da80b981..c83397211b 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -1383,7 +1383,9 @@ static int cf_check _dump_pci_devices(struct pci_seg *pseg, void *arg)
             printk("%pd", pdev->domain);
         printk(" - node %-3d refcnt %d", (pdev->node != NUMA_NO_NODE) ? pdev->node : -1,
                atomic_read(&pdev->refcnt));
+        pcidev_lock(pdev);
         pdev_dump_msi(pdev);
+        pcidev_unlock(pdev);
         printk("\n");
     }
     spin_unlock(&pseg->alldevs_lock);
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index e71a180ef3..d0a7339d84 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -106,6 +106,8 @@ struct pci_dev {
     uint8_t msi_maxvec;
     uint8_t phantom_stride;
 
+    /* Device lock */
+    spinlock_t lock;
     nodeid_t node; /* NUMA node */
 
     /* Device to be quarantined, don't automatically re-assign to dom0 */
@@ -235,6 +237,16 @@ int msixtbl_pt_register(struct domain *, struct pirq *, uint64_t gtable);
 void msixtbl_pt_unregister(struct domain *, struct pirq *);
 void msixtbl_pt_cleanup(struct domain *d);
 
+static inline void pcidev_lock(struct pci_dev *pdev)
+{
+    spin_lock(&pdev->lock);
+}
+
+static inline void pcidev_unlock(struct pci_dev *pdev)
+{
+    spin_unlock(&pdev->lock);
+}
+
 #ifdef CONFIG_HVM
 int arch_pci_clean_pirqs(struct domain *d);
 #else
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 08/10] xen: pci: remove pcidev_[un]lock[ed] calls
  2022-08-31 14:10 [RFC PATCH 00/10] Rework PCI locking Volodymyr Babchuk
                   ` (7 preceding siblings ...)
  2022-08-31 14:11 ` [RFC PATCH 09/10] [RFC only] xen: iommu: remove last pcidevs_lock() calls in iommu Volodymyr Babchuk
@ 2022-08-31 14:11 ` Volodymyr Babchuk
  2023-01-28  1:32   ` Stefano Stabellini
  2022-08-31 14:11 ` [RFC PATCH 10/10] [RFC only] xen: pci: remove pcidev_lock() function Volodymyr Babchuk
  2022-09-06 10:32 ` [RFC PATCH 00/10] Rework PCI locking Jan Beulich
  10 siblings, 1 reply; 43+ messages in thread
From: Volodymyr Babchuk @ 2022-08-31 14:11 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Andrushchenko, Volodymyr Babchuk, Jan Beulich,
	Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Paul Durrant, Kevin Tian

As pci devices are refcounted now and all list that store them are
protected by separate locks, we can safely drop global pcidevs_lock.

Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
---
 xen/arch/x86/domctl.c                       |  8 ---
 xen/arch/x86/hvm/vioapic.c                  |  2 -
 xen/arch/x86/hvm/vmsi.c                     | 12 ----
 xen/arch/x86/irq.c                          |  7 ---
 xen/arch/x86/msi.c                          | 11 ----
 xen/arch/x86/pci.c                          |  4 --
 xen/arch/x86/physdev.c                      |  7 +--
 xen/common/sysctl.c                         |  2 -
 xen/drivers/char/ns16550.c                  |  4 --
 xen/drivers/passthrough/amd/iommu_init.c    |  7 ---
 xen/drivers/passthrough/amd/iommu_map.c     |  5 --
 xen/drivers/passthrough/amd/pci_amd_iommu.c |  4 --
 xen/drivers/passthrough/pci.c               | 63 +--------------------
 xen/drivers/passthrough/vtd/iommu.c         |  8 ---
 xen/drivers/video/vga.c                     |  2 -
 15 files changed, 4 insertions(+), 142 deletions(-)

diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
index 020df615bd..9f4ca03385 100644
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -537,11 +537,7 @@ long arch_do_domctl(
 
         ret = -ESRCH;
         if ( is_iommu_enabled(d) )
-        {
-            pcidevs_lock();
             ret = pt_irq_create_bind(d, bind);
-            pcidevs_unlock();
-        }
         if ( ret < 0 )
             printk(XENLOG_G_ERR "pt_irq_create_bind failed (%ld) for dom%d\n",
                    ret, d->domain_id);
@@ -566,11 +562,7 @@ long arch_do_domctl(
             break;
 
         if ( is_iommu_enabled(d) )
-        {
-            pcidevs_lock();
             ret = pt_irq_destroy_bind(d, bind);
-            pcidevs_unlock();
-        }
         if ( ret < 0 )
             printk(XENLOG_G_ERR "pt_irq_destroy_bind failed (%ld) for dom%d\n",
                    ret, d->domain_id);
diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
index cb7f440160..aa4e7766a3 100644
--- a/xen/arch/x86/hvm/vioapic.c
+++ b/xen/arch/x86/hvm/vioapic.c
@@ -197,7 +197,6 @@ static int vioapic_hwdom_map_gsi(unsigned int gsi, unsigned int trig,
         return ret;
     }
 
-    pcidevs_lock();
     ret = pt_irq_create_bind(currd, &pt_irq_bind);
     if ( ret )
     {
@@ -207,7 +206,6 @@ static int vioapic_hwdom_map_gsi(unsigned int gsi, unsigned int trig,
         unmap_domain_pirq(currd, pirq);
         write_unlock(&currd->event_lock);
     }
-    pcidevs_unlock();
 
     return ret;
 }
diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index c9e5f279c5..344bbd646c 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -470,7 +470,6 @@ int msixtbl_pt_register(struct domain *d, struct pirq *pirq, uint64_t gtable)
     struct msixtbl_entry *entry, *new_entry;
     int r = -EINVAL;
 
-    ASSERT(pcidevs_locked());
     ASSERT(rw_is_write_locked(&d->event_lock));
 
     if ( !msixtbl_initialised(d) )
@@ -540,7 +539,6 @@ void msixtbl_pt_unregister(struct domain *d, struct pirq *pirq)
     struct pci_dev *pdev;
     struct msixtbl_entry *entry;
 
-    ASSERT(pcidevs_locked());
     ASSERT(rw_is_write_locked(&d->event_lock));
 
     if ( !msixtbl_initialised(d) )
@@ -686,8 +684,6 @@ static int vpci_msi_update(const struct pci_dev *pdev, uint32_t data,
 {
     unsigned int i;
 
-    ASSERT(pcidevs_locked());
-
     if ( (address & MSI_ADDR_BASE_MASK) != MSI_ADDR_HEADER )
     {
         gdprintk(XENLOG_ERR, "%pp: PIRQ %u: unsupported address %lx\n",
@@ -728,7 +724,6 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
 
     ASSERT(msi->arch.pirq != INVALID_PIRQ);
 
-    pcidevs_lock();
     for ( i = 0; i < msi->vectors && msi->arch.bound; i++ )
     {
         struct xen_domctl_bind_pt_irq unbind = {
@@ -747,7 +742,6 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
 
     msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address,
                                        msi->vectors, msi->arch.pirq, msi->mask);
-    pcidevs_unlock();
 }
 
 static int vpci_msi_enable(const struct pci_dev *pdev, unsigned int nr,
@@ -785,10 +779,8 @@ int vpci_msi_arch_enable(struct vpci_msi *msi, const struct pci_dev *pdev,
         return rc;
     msi->arch.pirq = rc;
 
-    pcidevs_lock();
     msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address, vectors,
                                        msi->arch.pirq, msi->mask);
-    pcidevs_unlock();
 
     return 0;
 }
@@ -800,7 +792,6 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
 
     ASSERT(pirq != INVALID_PIRQ);
 
-    pcidevs_lock();
     for ( i = 0; i < nr && bound; i++ )
     {
         struct xen_domctl_bind_pt_irq bind = {
@@ -816,7 +807,6 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
     write_lock(&pdev->domain->event_lock);
     unmap_domain_pirq(pdev->domain, pirq);
     write_unlock(&pdev->domain->event_lock);
-    pcidevs_unlock();
 }
 
 void vpci_msi_arch_disable(struct vpci_msi *msi, const struct pci_dev *pdev)
@@ -863,7 +853,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
 
     entry->arch.pirq = rc;
 
-    pcidevs_lock();
     rc = vpci_msi_update(pdev, entry->data, entry->addr, 1, entry->arch.pirq,
                          entry->masked);
     if ( rc )
@@ -871,7 +860,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
         vpci_msi_disable(pdev, entry->arch.pirq, 1, false);
         entry->arch.pirq = INVALID_PIRQ;
     }
-    pcidevs_unlock();
 
     return rc;
 }
diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index d8672a03e1..6a08830a55 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -2156,8 +2156,6 @@ int map_domain_pirq(
         struct pci_dev *pdev;
         unsigned int nr = 0;
 
-        ASSERT(pcidevs_locked());
-
         ret = -ENODEV;
         if ( !cpu_has_apic )
             goto done;
@@ -2317,7 +2315,6 @@ int unmap_domain_pirq(struct domain *d, int pirq)
     if ( (pirq < 0) || (pirq >= d->nr_pirqs) )
         return -EINVAL;
 
-    ASSERT(pcidevs_locked());
     ASSERT(rw_is_write_locked(&d->event_lock));
 
     info = pirq_info(d, pirq);
@@ -2423,7 +2420,6 @@ void free_domain_pirqs(struct domain *d)
 {
     int i;
 
-    pcidevs_lock();
     write_lock(&d->event_lock);
 
     for ( i = 0; i < d->nr_pirqs; i++ )
@@ -2431,7 +2427,6 @@ void free_domain_pirqs(struct domain *d)
             unmap_domain_pirq(d, i);
 
     write_unlock(&d->event_lock);
-    pcidevs_unlock();
 }
 
 static void cf_check dump_irqs(unsigned char key)
@@ -2911,7 +2906,6 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
 
     msi->irq = irq;
 
-    pcidevs_lock();
     /* Verify or get pirq. */
     write_lock(&d->event_lock);
     pirq = allocate_pirq(d, index, *pirq_p, irq, type, &msi->entry_nr);
@@ -2927,7 +2921,6 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
 
  done:
     write_unlock(&d->event_lock);
-    pcidevs_unlock();
     if ( ret )
     {
         switch ( type )
diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index 6b62c4f452..f04b90e235 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -623,7 +623,6 @@ static int msi_capability_init(struct pci_dev *dev,
     u8 slot = PCI_SLOT(dev->devfn);
     u8 func = PCI_FUNC(dev->devfn);
 
-    ASSERT(pcidevs_locked());
     pos = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSI);
     if ( !pos )
         return -ENODEV;
@@ -810,8 +809,6 @@ static int msix_capability_init(struct pci_dev *dev,
     if ( !pos )
         return -ENODEV;
 
-    ASSERT(pcidevs_locked());
-
     control = pci_conf_read16(dev->sbdf, msix_control_reg(pos));
     /*
      * Ensure MSI-X interrupts are masked during setup. Some devices require
@@ -1032,7 +1029,6 @@ static int __pci_enable_msi(struct msi_info *msi, struct msi_desc **desc)
     struct msi_desc *old_desc;
     int ret;
 
-    ASSERT(pcidevs_locked());
     pdev = pci_get_pdev(NULL, msi->sbdf);
     if ( !pdev )
         return -ENODEV;
@@ -1092,7 +1088,6 @@ static int __pci_enable_msix(struct msi_info *msi, struct msi_desc **desc)
     struct msi_desc *old_desc;
     int ret;
 
-    ASSERT(pcidevs_locked());
     pdev = pci_get_pdev(NULL, msi->sbdf);
     if ( !pdev || !pdev->msix )
         return -ENODEV;
@@ -1191,7 +1186,6 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off)
     if ( !use_msi )
         return 0;
 
-    pcidevs_lock();
     pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn));
     if ( !pdev )
         rc = -ENODEV;
@@ -1204,7 +1198,6 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off)
     }
     else
         rc = msix_capability_init(pdev, NULL, NULL);
-    pcidevs_unlock();
 
     pcidev_put(pdev);
 
@@ -1217,8 +1210,6 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off)
  */
 int pci_enable_msi(struct msi_info *msi, struct msi_desc **desc)
 {
-    ASSERT(pcidevs_locked());
-
     if ( !use_msi )
         return -EPERM;
 
@@ -1355,8 +1346,6 @@ int pci_restore_msi_state(struct pci_dev *pdev)
     unsigned int type = 0, pos = 0;
     u16 control = 0;
 
-    ASSERT(pcidevs_locked());
-
     if ( !use_msi )
         return -EOPNOTSUPP;
 
diff --git a/xen/arch/x86/pci.c b/xen/arch/x86/pci.c
index 1d38f0df7c..4dcd6d96f3 100644
--- a/xen/arch/x86/pci.c
+++ b/xen/arch/x86/pci.c
@@ -88,15 +88,11 @@ int pci_conf_write_intercept(unsigned int seg, unsigned int bdf,
     if ( reg < 64 || reg >= 256 )
         return 0;
 
-    pcidevs_lock();
-
     pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bdf));
     if ( pdev ) {
         rc = pci_msi_conf_write_intercept(pdev, reg, size, data);
 	pcidev_put(pdev);
     }
 
-    pcidevs_unlock();
-
     return rc;
 }
diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c
index 96214a3d40..a41366b609 100644
--- a/xen/arch/x86/physdev.c
+++ b/xen/arch/x86/physdev.c
@@ -162,11 +162,9 @@ int physdev_unmap_pirq(domid_t domid, int pirq)
             goto free_domain;
     }
 
-    pcidevs_lock();
     write_lock(&d->event_lock);
     ret = unmap_domain_pirq(d, pirq);
     write_unlock(&d->event_lock);
-    pcidevs_unlock();
 
  free_domain:
     rcu_unlock_domain(d);
@@ -530,7 +528,6 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
         if ( copy_from_guest(&restore_msi, arg, 1) != 0 )
             break;
 
-        pcidevs_lock();
         pdev = pci_get_pdev(NULL,
                             PCI_SBDF(0, restore_msi.bus, restore_msi.devfn));
         if ( pdev )
@@ -541,7 +538,6 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
         else
             ret = -ENODEV;
 
-        pcidevs_unlock();
         break;
     }
 
@@ -553,7 +549,6 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
         if ( copy_from_guest(&dev, arg, 1) != 0 )
             break;
 
-        pcidevs_lock();
         pdev = pci_get_pdev(NULL, PCI_SBDF(dev.seg, dev.bus, dev.devfn));
         if ( pdev )
         {
@@ -562,7 +557,7 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
         }
         else
             ret = -ENODEV;
-        pcidevs_unlock();
+
         break;
     }
 
diff --git a/xen/common/sysctl.c b/xen/common/sysctl.c
index 0feef94cd2..6bb8c5c295 100644
--- a/xen/common/sysctl.c
+++ b/xen/common/sysctl.c
@@ -446,7 +446,6 @@ long do_sysctl(XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) u_sysctl)
                 break;
             }
 
-            pcidevs_lock();
             pdev = pci_get_pdev(NULL, PCI_SBDF(dev.seg, dev.bus, dev.devfn));
             if ( !pdev )
                 node = XEN_INVALID_DEV;
@@ -454,7 +453,6 @@ long do_sysctl(XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) u_sysctl)
                 node = XEN_INVALID_NODE_ID;
             else
                 node = pdev->node;
-            pcidevs_unlock();
 
             if ( pdev )
                 pcidev_put(pdev);
diff --git a/xen/drivers/char/ns16550.c b/xen/drivers/char/ns16550.c
index 01a05c9aa8..66c10b18e5 100644
--- a/xen/drivers/char/ns16550.c
+++ b/xen/drivers/char/ns16550.c
@@ -445,8 +445,6 @@ static void __init cf_check ns16550_init_postirq(struct serial_port *port)
             {
                 struct msi_desc *msi_desc = NULL;
 
-                pcidevs_lock();
-
                 rc = pci_enable_msi(&msi, &msi_desc);
                 if ( !rc )
                 {
@@ -460,8 +458,6 @@ static void __init cf_check ns16550_init_postirq(struct serial_port *port)
                         pci_disable_msi(msi_desc);
                 }
 
-                pcidevs_unlock();
-
                 if ( rc )
                 {
                     uart->irq = 0;
diff --git a/xen/drivers/passthrough/amd/iommu_init.c b/xen/drivers/passthrough/amd/iommu_init.c
index 7c1713a602..e42af65a40 100644
--- a/xen/drivers/passthrough/amd/iommu_init.c
+++ b/xen/drivers/passthrough/amd/iommu_init.c
@@ -638,10 +638,7 @@ static void cf_check parse_ppr_log_entry(struct amd_iommu *iommu, u32 entry[])
     uint16_t device_id = iommu_get_devid_from_cmd(entry[0]);
     struct pci_dev *pdev;
 
-    pcidevs_lock();
     pdev = pci_get_real_pdev(PCI_SBDF(iommu->seg, device_id));
-    pcidevs_unlock();
-
     if ( pdev )
         guest_iommu_add_ppr_log(pdev->domain, entry);
     pcidev_put(pdev);
@@ -747,14 +744,12 @@ static bool_t __init set_iommu_interrupt_handler(struct amd_iommu *iommu)
         return 0;
     }
 
-    pcidevs_lock();
     /*
      * XXX: it is unclear if this device can be removed. Right now
      * there is no code that clears msi.dev, so no one will decrease
      * refcount on it.
      */
     iommu->msi.dev = pci_get_pdev(NULL, PCI_SBDF(iommu->seg, iommu->bdf));
-    pcidevs_unlock();
     if ( !iommu->msi.dev )
     {
         AMD_IOMMU_WARN("no pdev for %pp\n",
@@ -1289,9 +1284,7 @@ static int __init cf_check amd_iommu_setup_device_table(
             {
                 if ( !pci_init )
                     continue;
-                pcidevs_lock();
                 pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bdf));
-                pcidevs_unlock();
             }
 
             if ( pdev && (pdev->msix || pdev->msi_maxvec) )
diff --git a/xen/drivers/passthrough/amd/iommu_map.c b/xen/drivers/passthrough/amd/iommu_map.c
index 9d621e3d36..d04aa37538 100644
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -726,9 +726,7 @@ int cf_check amd_iommu_get_reserved_device_memory(
             /* May need to trigger the workaround in find_iommu_for_device(). */
             struct pci_dev *pdev;
 
-            pcidevs_lock();
             pdev = pci_get_pdev(NULL, sbdf);
-            pcidevs_unlock();
 
             if ( pdev )
             {
@@ -848,7 +846,6 @@ int cf_check amd_iommu_quarantine_init(struct pci_dev *pdev, bool scratch_page)
     const struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
     int rc;
 
-    ASSERT(pcidevs_locked());
     ASSERT(!hd->arch.amd.root_table);
     ASSERT(page_list_empty(&hd->arch.pgtables.list));
 
@@ -903,8 +900,6 @@ void amd_iommu_quarantine_teardown(struct pci_dev *pdev)
 {
     struct domain_iommu *hd = dom_iommu(dom_io);
 
-    ASSERT(pcidevs_locked());
-
     if ( !pdev->arch.amd.root_table )
         return;
 
diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c b/xen/drivers/passthrough/amd/pci_amd_iommu.c
index 955f3af57a..919e30129e 100644
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -268,8 +268,6 @@ static int __must_check amd_iommu_setup_domain_device(
                     req_id, pdev->type, page_to_maddr(root_pg),
                     domid, hd->arch.amd.paging_mode);
 
-    ASSERT(pcidevs_locked());
-
     if ( pci_ats_device(iommu->seg, bus, pdev->devfn) &&
          !ivrs_dev->block_ats &&
          iommu_has_cap(iommu, PCI_CAP_IOTLB_SHIFT) &&
@@ -416,8 +414,6 @@ static void amd_iommu_disable_domain_device(const struct domain *domain,
     if ( QUARANTINE_SKIP(domain, pdev) )
         return;
 
-    ASSERT(pcidevs_locked());
-
     if ( pci_ats_device(iommu->seg, bus, pdev->devfn) &&
          pci_ats_enabled(iommu->seg, bus, pdev->devfn) )
     {
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index c83397211b..cc62a5aec4 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -517,7 +517,6 @@ int __init pci_hide_device(unsigned int seg, unsigned int bus,
     struct pci_seg *pseg;
     int rc = -ENOMEM;
 
-    pcidevs_lock();
     pseg = alloc_pseg(seg);
     if ( pseg )
     {
@@ -528,7 +527,6 @@ int __init pci_hide_device(unsigned int seg, unsigned int bus,
             rc = 0;
         }
     }
-    pcidevs_unlock();
 
     return rc;
 }
@@ -588,8 +586,6 @@ struct pci_dev *pci_get_pdev(struct domain *d, pci_sbdf_t sbdf)
 {
     struct pci_dev *pdev;
 
-    ASSERT(d || pcidevs_locked());
-
     /*
      * The hardware domain owns the majority of the devices in the system.
      * When there are multiple segments, traversing the per-segment list is
@@ -730,7 +726,6 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
         pdev_type = "device";
     else if ( info->is_virtfn )
     {
-        pcidevs_lock();
         pdev = pci_get_pdev(NULL,
                             PCI_SBDF(seg, info->physfn.bus,
                                      info->physfn.devfn));
@@ -739,7 +734,6 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
             pf_is_extfn = pdev->info.is_extfn;
             pcidev_put(pdev);
         }
-        pcidevs_unlock();
         if ( !pdev )
             pci_add_device(seg, info->physfn.bus, info->physfn.devfn,
                            NULL, node);
@@ -756,7 +750,6 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
 
     ret = -ENOMEM;
 
-    pcidevs_lock();
     pseg = alloc_pseg(seg);
     if ( !pseg )
         goto out;
@@ -858,7 +851,6 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
     pci_enable_acs(pdev);
 
 out:
-    pcidevs_unlock();
     if ( !ret )
     {
         printk(XENLOG_DEBUG "PCI add %s %pp\n", pdev_type,  &pdev->sbdf);
@@ -889,7 +881,6 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
     if ( !pseg )
         return -ENODEV;
 
-    pcidevs_lock();
     spin_lock(&pseg->alldevs_lock);
     list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
         if ( pdev->bus == bus && pdev->devfn == devfn )
@@ -910,12 +901,10 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
             break;
         }
 
-    pcidevs_unlock();
     spin_unlock(&pseg->alldevs_lock);
     return ret;
 }
 
-/* Caller should hold the pcidevs_lock */
 static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
                            uint8_t devfn)
 {
@@ -927,7 +916,6 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
     if ( !is_iommu_enabled(d) )
         return -EINVAL;
 
-    ASSERT(pcidevs_locked());
     pdev = pci_get_pdev(d, PCI_SBDF(seg, bus, devfn));
     if ( !pdev )
         return -ENODEV;
@@ -981,13 +969,10 @@ int pci_release_devices(struct domain *d)
     u8 bus, devfn;
     int ret;
 
-    pcidevs_lock();
     ret = arch_pci_clean_pirqs(d);
     if ( ret )
-    {
-        pcidevs_unlock();
         return ret;
-    }
+
     spin_lock(&d->pdevs_lock);
     list_for_each_entry_safe ( pdev, tmp, &d->pdev_list, domain_list )
     {
@@ -996,7 +981,6 @@ int pci_release_devices(struct domain *d)
         ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
     }
     spin_unlock(&d->pdevs_lock);
-    pcidevs_unlock();
 
     return ret;
 }
@@ -1094,7 +1078,6 @@ void pci_check_disable_device(u16 seg, u8 bus, u8 devfn)
     s_time_t now = NOW();
     u16 cword;
 
-    pcidevs_lock();
     pdev = pci_get_real_pdev(PCI_SBDF(seg, bus, devfn));
     if ( pdev )
     {
@@ -1108,7 +1091,6 @@ void pci_check_disable_device(u16 seg, u8 bus, u8 devfn)
             pdev = NULL;
         }
     }
-    pcidevs_unlock();
 
     if ( !pdev )
         return;
@@ -1164,13 +1146,7 @@ static int __init cf_check _scan_pci_devices(struct pci_seg *pseg, void *arg)
 
 int __init scan_pci_devices(void)
 {
-    int ret;
-
-    pcidevs_lock();
-    ret = pci_segments_iterate(_scan_pci_devices, NULL);
-    pcidevs_unlock();
-
-    return ret;
+    return pci_segments_iterate(_scan_pci_devices, NULL);
 }
 
 struct setup_hwdom {
@@ -1239,19 +1215,11 @@ static int __hwdom_init cf_check _setup_hwdom_pci_devices(
 
             pcidev_put(pdev);
             if ( iommu_verbose )
-            {
-                pcidevs_unlock();
                 process_pending_softirqs();
-                pcidevs_lock();
-            }
         }
 
         if ( !iommu_verbose )
-        {
-            pcidevs_unlock();
             process_pending_softirqs();
-            pcidevs_lock();
-        }
     }
 
     return 0;
@@ -1262,9 +1230,7 @@ void __hwdom_init setup_hwdom_pci_devices(
 {
     struct setup_hwdom ctxt = { .d = d, .handler = handler };
 
-    pcidevs_lock();
     pci_segments_iterate(_setup_hwdom_pci_devices, &ctxt);
-    pcidevs_unlock();
 }
 
 /* APEI not supported on ARM yet. */
@@ -1396,9 +1362,7 @@ static int cf_check _dump_pci_devices(struct pci_seg *pseg, void *arg)
 static void cf_check dump_pci_devices(unsigned char ch)
 {
     printk("==== PCI devices ====\n");
-    pcidevs_lock();
     pci_segments_iterate(_dump_pci_devices, NULL);
-    pcidevs_unlock();
 }
 
 static int __init cf_check setup_dump_pcidevs(void)
@@ -1417,8 +1381,6 @@ static int iommu_add_device(struct pci_dev *pdev)
     if ( !pdev->domain )
         return -EINVAL;
 
-    ASSERT(pcidevs_locked());
-
     hd = dom_iommu(pdev->domain);
     if ( !is_iommu_enabled(pdev->domain) )
         return 0;
@@ -1446,8 +1408,6 @@ static int iommu_enable_device(struct pci_dev *pdev)
     if ( !pdev->domain )
         return -EINVAL;
 
-    ASSERT(pcidevs_locked());
-
     hd = dom_iommu(pdev->domain);
     if ( !is_iommu_enabled(pdev->domain) ||
          !hd->platform_ops->enable_device )
@@ -1494,7 +1454,6 @@ static int device_assigned(struct pci_dev *pdev)
 {
     int rc = 0;
 
-    ASSERT(pcidevs_locked());
     /*
      * If the device exists and it is not owned by either the hardware
      * domain or dom_io then it must be assigned to a guest, or be
@@ -1507,7 +1466,6 @@ static int device_assigned(struct pci_dev *pdev)
     return rc;
 }
 
-/* Caller should hold the pcidevs_lock */
 static int assign_device(struct domain *d, struct pci_dev *pdev, u32 flag)
 {
     const struct domain_iommu *hd = dom_iommu(d);
@@ -1521,7 +1479,6 @@ static int assign_device(struct domain *d, struct pci_dev *pdev, u32 flag)
         return -EXDEV;
 
     /* device_assigned() should already have cleared the device for assignment */
-    ASSERT(pcidevs_locked());
     ASSERT(pdev && (pdev->domain == hardware_domain ||
                     pdev->domain == dom_io));
 
@@ -1587,7 +1544,6 @@ static int iommu_get_device_group(
     if ( group_id < 0 )
         return group_id;
 
-    pcidevs_lock();
     spin_lock(&d->pdevs_lock);
     for_each_pdev( d, pdev )
     {
@@ -1603,7 +1559,6 @@ static int iommu_get_device_group(
         sdev_id = iommu_call(ops, get_device_group_id, seg, b, df);
         if ( sdev_id < 0 )
         {
-            pcidevs_unlock();
             spin_unlock(&d->pdevs_lock);
             return sdev_id;
         }
@@ -1614,7 +1569,6 @@ static int iommu_get_device_group(
 
             if ( unlikely(copy_to_guest_offset(buf, i, &bdf, 1)) )
             {
-                pcidevs_unlock();
                 spin_unlock(&d->pdevs_lock);
                 return -EFAULT;
             }
@@ -1622,7 +1576,6 @@ static int iommu_get_device_group(
         }
     }
 
-    pcidevs_unlock();
     spin_unlock(&d->pdevs_lock);
 
     return i;
@@ -1630,17 +1583,12 @@ static int iommu_get_device_group(
 
 void iommu_dev_iotlb_flush_timeout(struct domain *d, struct pci_dev *pdev)
 {
-    pcidevs_lock();
-
     /* iommu->ats_list_lock is taken by the caller of this function */
     disable_ats_device(pdev);
 
     ASSERT(pdev->domain);
     if ( d != pdev->domain )
-    {
-        pcidevs_unlock();
         return;
-    }
 
     pdev->broken = true;
 
@@ -1649,8 +1597,6 @@ void iommu_dev_iotlb_flush_timeout(struct domain *d, struct pci_dev *pdev)
                d->domain_id, &pdev->sbdf);
     if ( !is_hardware_domain(d) )
         domain_crash(d);
-
-    pcidevs_unlock();
 }
 
 int iommu_do_pci_domctl(
@@ -1740,7 +1686,6 @@ int iommu_do_pci_domctl(
             break;
         }
 
-        pcidevs_lock();
         ret = device_assigned(pdev);
         if ( domctl->cmd == XEN_DOMCTL_test_assign_device )
         {
@@ -1755,7 +1700,7 @@ int iommu_do_pci_domctl(
             ret = assign_device(d, pdev, flags);
 
         pcidev_put(pdev);
-        pcidevs_unlock();
+
         if ( ret == -ERESTART )
             ret = hypercall_create_continuation(__HYPERVISOR_domctl,
                                                 "h", u_domctl);
@@ -1787,9 +1732,7 @@ int iommu_do_pci_domctl(
         bus = PCI_BUS(machine_sbdf);
         devfn = PCI_DEVFN(machine_sbdf);
 
-        pcidevs_lock();
         ret = deassign_device(d, seg, bus, devfn);
-        pcidevs_unlock();
         break;
 
     default:
diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index 42661f22f4..87868188b7 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1490,7 +1490,6 @@ int domain_context_mapping_one(
     if ( QUARANTINE_SKIP(domain, pgd_maddr) )
         return 0;
 
-    ASSERT(pcidevs_locked());
     spin_lock(&iommu->lock);
     maddr = bus_to_context_maddr(iommu, bus);
     context_entries = (struct context_entry *)map_vtd_domain_page(maddr);
@@ -1711,8 +1710,6 @@ static int domain_context_mapping(struct domain *domain, u8 devfn,
     if ( drhd && drhd->iommu->node != NUMA_NO_NODE )
         dom_iommu(domain)->node = drhd->iommu->node;
 
-    ASSERT(pcidevs_locked());
-
     for_each_rmrr_device( rmrr, bdf, i )
     {
         if ( rmrr->segment != pdev->seg || bdf != pdev->sbdf.bdf )
@@ -2072,8 +2069,6 @@ static void quarantine_teardown(struct pci_dev *pdev,
 {
     struct domain_iommu *hd = dom_iommu(dom_io);
 
-    ASSERT(pcidevs_locked());
-
     if ( !pdev->arch.vtd.pgd_maddr )
         return;
 
@@ -2341,8 +2336,6 @@ static int cf_check intel_iommu_add_device(u8 devfn, struct pci_dev *pdev)
     u16 bdf;
     int ret, i;
 
-    ASSERT(pcidevs_locked());
-
     if ( !pdev->domain )
         return -EINVAL;
 
@@ -3176,7 +3169,6 @@ static int cf_check intel_iommu_quarantine_init(struct pci_dev *pdev,
     bool rmrr_found = false;
     int rc;
 
-    ASSERT(pcidevs_locked());
     ASSERT(!hd->arch.vtd.pgd_maddr);
     ASSERT(page_list_empty(&hd->arch.pgtables.list));
 
diff --git a/xen/drivers/video/vga.c b/xen/drivers/video/vga.c
index 1298f3a7b6..1f7c496114 100644
--- a/xen/drivers/video/vga.c
+++ b/xen/drivers/video/vga.c
@@ -117,9 +117,7 @@ void __init video_endboot(void)
                 struct pci_dev *pdev;
                 u8 b = bus, df = devfn, sb;
 
-                pcidevs_lock();
                 pdev = pci_get_pdev(NULL, PCI_SBDF(0, bus, devfn));
-                pcidevs_unlock();
 
                 if ( !pdev ||
                      pci_conf_read16(PCI_SBDF(0, bus, devfn),
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 10/10] [RFC only] xen: pci: remove pcidev_lock() function
  2022-08-31 14:10 [RFC PATCH 00/10] Rework PCI locking Volodymyr Babchuk
                   ` (8 preceding siblings ...)
  2022-08-31 14:11 ` [RFC PATCH 08/10] xen: pci: remove pcidev_[un]lock[ed] calls Volodymyr Babchuk
@ 2022-08-31 14:11 ` Volodymyr Babchuk
  2022-09-06 10:32 ` [RFC PATCH 00/10] Rework PCI locking Jan Beulich
  10 siblings, 0 replies; 43+ messages in thread
From: Volodymyr Babchuk @ 2022-08-31 14:11 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Andrushchenko, Volodymyr Babchuk, Jan Beulich,
	Paul Durrant, Roger Pau Monné,
	Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu

This path will be squashed into "xen: pci: remove pcidev_[un]lock[ed]
calls" after we resolve "[RFC only] xen: iommu: remove last
pcidevs_lock() calls in iommu".

Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
---
 xen/drivers/passthrough/pci.c | 18 ------------------
 xen/include/xen/pci.h         | 10 ----------
 2 files changed, 28 deletions(-)

diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index cc62a5aec4..381eba3018 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -50,24 +50,6 @@ struct pci_seg {
         u8 devfn;
     } bus2bridge[MAX_BUSES];
 };
-
-static spinlock_t _pcidevs_lock = SPIN_LOCK_UNLOCKED;
-
-void pcidevs_lock(void)
-{
-    spin_lock_recursive(&_pcidevs_lock);
-}
-
-void pcidevs_unlock(void)
-{
-    spin_unlock_recursive(&_pcidevs_lock);
-}
-
-bool_t pcidevs_locked(void)
-{
-    return !!spin_is_locked(&_pcidevs_lock);
-}
-
 static struct radix_tree_root pci_segments;
 
 static inline struct pci_seg *get_pseg(u16 seg)
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index d0a7339d84..0abc54ea39 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -156,16 +156,6 @@ struct pci_dev {
 
 #define has_arch_pdevs(d) (!list_empty(&(d)->pdev_list))
 
-/*
- * The pcidevs_lock protect alldevs_list, and the assignment for the 
- * devices, it also sync the access to the msi capability that is not
- * interrupt handling related (the mask bit register).
- */
-
-void pcidevs_lock(void);
-void pcidevs_unlock(void);
-bool_t __must_check pcidevs_locked(void);
-
 /*
  * Acquire and release reference to the given device. Holding
  * reference ensures that device will not disappear under feet, but
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 00/10] Rework PCI locking
  2022-08-31 14:10 [RFC PATCH 00/10] Rework PCI locking Volodymyr Babchuk
                   ` (9 preceding siblings ...)
  2022-08-31 14:11 ` [RFC PATCH 10/10] [RFC only] xen: pci: remove pcidev_lock() function Volodymyr Babchuk
@ 2022-09-06 10:32 ` Jan Beulich
  2023-01-18 18:21   ` Julien Grall
  10 siblings, 1 reply; 43+ messages in thread
From: Jan Beulich @ 2022-09-06 10:32 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Oleksandr Andrushchenko, Andrew Cooper, George Dunlap,
	Julien Grall, Stefano Stabellini, Wei Liu, Paul Durrant,
	Roger Pau Monné,
	Kevin Tian, xen-devel

On 31.08.2022 16:10, Volodymyr Babchuk wrote:
> Hello,
> 
> This is yet another take to a PCI locking rework. This approach
> was suggest by Jan Beulich who proposed to use a reference
> counter to control lifetime of pci_dev objects.
> 
> When I started added reference counting it quickly became clear
> that this approach can provide more granular locking insted of
> huge pcidevs_lock() which is used right now. I studied how this
> lock used and what it protects. And found the following:
> 
> 0. Comment in pci.h states the following:
> 
>  153 /*
>  154  * The pcidevs_lock protect alldevs_list, and the assignment for the
>  155  * devices, it also sync the access to the msi capability that is not
>  156  * interrupt handling related (the mask bit register).
>  157  */
> 
> But in reality it does much more. Here is what I found:
> 
> 1. Lifetime of pci_dev struct
> 
> 2. Access to pseg->alldevs_list
> 
> 3. Access to domain->pdev_list
> 
> 4. Access to iommu->ats_list
> 
> 5. Access to MSI capability
> 
> 6. Some obsucure stuff in IOMMU drivers: there are places that
> are guarded by pcidevs_lock() but it seems that nothing
> PCI-related happens there.

Right - the lock being held was (ab)used in IOMMU code in a number of
places. This likely needs to change in the course of this re-work;
patch titles don't suggest this is currently part of the series.

> 7. Something that I probably overlooked

And this is the main risk here. The huge scope of the original lock
means that many things are serialized now but won't be anymore once
the lock is gone.

But yes - thanks for the work. To be honest I don't expect to be able
to look at this series in detail until after the Xen Summit. And even
then it may take a while ...

Jan


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 00/10] Rework PCI locking
  2022-09-06 10:32 ` [RFC PATCH 00/10] Rework PCI locking Jan Beulich
@ 2023-01-18 18:21   ` Julien Grall
  2023-01-19  9:47     ` Jan Beulich
  0 siblings, 1 reply; 43+ messages in thread
From: Julien Grall @ 2023-01-18 18:21 UTC (permalink / raw)
  To: Jan Beulich, Volodymyr Babchuk
  Cc: Oleksandr Andrushchenko, Andrew Cooper, George Dunlap,
	Stefano Stabellini, Wei Liu, Paul Durrant, Roger Pau Monné,
	Kevin Tian, xen-devel

Hi Jan,

On 06/09/2022 11:32, Jan Beulich wrote:
> On 31.08.2022 16:10, Volodymyr Babchuk wrote:
>> Hello,
>>
>> This is yet another take to a PCI locking rework. This approach
>> was suggest by Jan Beulich who proposed to use a reference
>> counter to control lifetime of pci_dev objects.
>>
>> When I started added reference counting it quickly became clear
>> that this approach can provide more granular locking insted of
>> huge pcidevs_lock() which is used right now. I studied how this
>> lock used and what it protects. And found the following:
>>
>> 0. Comment in pci.h states the following:
>>
>>   153 /*
>>   154  * The pcidevs_lock protect alldevs_list, and the assignment for the
>>   155  * devices, it also sync the access to the msi capability that is not
>>   156  * interrupt handling related (the mask bit register).
>>   157  */
>>
>> But in reality it does much more. Here is what I found:
>>
>> 1. Lifetime of pci_dev struct
>>
>> 2. Access to pseg->alldevs_list
>>
>> 3. Access to domain->pdev_list
>>
>> 4. Access to iommu->ats_list
>>
>> 5. Access to MSI capability
>>
>> 6. Some obsucure stuff in IOMMU drivers: there are places that
>> are guarded by pcidevs_lock() but it seems that nothing
>> PCI-related happens there.
> 
> Right - the lock being held was (ab)used in IOMMU code in a number of
> places. This likely needs to change in the course of this re-work;
> patch titles don't suggest this is currently part of the series.
> 
>> 7. Something that I probably overlooked
> 
> And this is the main risk here. The huge scope of the original lock
> means that many things are serialized now but won't be anymore once
> the lock is gone.
> 
> But yes - thanks for the work. To be honest I don't expect to be able
> to look at this series in detail until after the Xen Summit. And even
> then it may take a while ...

I was wondering if this is still in your list to review?

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 00/10] Rework PCI locking
  2023-01-18 18:21   ` Julien Grall
@ 2023-01-19  9:47     ` Jan Beulich
  0 siblings, 0 replies; 43+ messages in thread
From: Jan Beulich @ 2023-01-19  9:47 UTC (permalink / raw)
  To: Julien Grall
  Cc: Oleksandr Andrushchenko, Andrew Cooper, George Dunlap,
	Stefano Stabellini, Wei Liu, Paul Durrant, Roger Pau Monné,
	Kevin Tian, xen-devel, Volodymyr Babchuk

On 18.01.2023 19:21, Julien Grall wrote:
> On 06/09/2022 11:32, Jan Beulich wrote:
>> On 31.08.2022 16:10, Volodymyr Babchuk wrote:
>>> Hello,
>>>
>>> This is yet another take to a PCI locking rework. This approach
>>> was suggest by Jan Beulich who proposed to use a reference
>>> counter to control lifetime of pci_dev objects.
>>>
>>> When I started added reference counting it quickly became clear
>>> that this approach can provide more granular locking insted of
>>> huge pcidevs_lock() which is used right now. I studied how this
>>> lock used and what it protects. And found the following:
>>>
>>> 0. Comment in pci.h states the following:
>>>
>>>   153 /*
>>>   154  * The pcidevs_lock protect alldevs_list, and the assignment for the
>>>   155  * devices, it also sync the access to the msi capability that is not
>>>   156  * interrupt handling related (the mask bit register).
>>>   157  */
>>>
>>> But in reality it does much more. Here is what I found:
>>>
>>> 1. Lifetime of pci_dev struct
>>>
>>> 2. Access to pseg->alldevs_list
>>>
>>> 3. Access to domain->pdev_list
>>>
>>> 4. Access to iommu->ats_list
>>>
>>> 5. Access to MSI capability
>>>
>>> 6. Some obsucure stuff in IOMMU drivers: there are places that
>>> are guarded by pcidevs_lock() but it seems that nothing
>>> PCI-related happens there.
>>
>> Right - the lock being held was (ab)used in IOMMU code in a number of
>> places. This likely needs to change in the course of this re-work;
>> patch titles don't suggest this is currently part of the series.
>>
>>> 7. Something that I probably overlooked
>>
>> And this is the main risk here. The huge scope of the original lock
>> means that many things are serialized now but won't be anymore once
>> the lock is gone.
>>
>> But yes - thanks for the work. To be honest I don't expect to be able
>> to look at this series in detail until after the Xen Summit. And even
>> then it may take a while ...
> 
> I was wondering if this is still in your list to review?

Yes, it certainly is. But as before no predictions when I might get to it.

Jan


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 01/10] xen: pci: add per-domain pci list lock
  2022-08-31 14:10 ` [RFC PATCH 01/10] xen: pci: add per-domain pci list lock Volodymyr Babchuk
@ 2023-01-26 23:18   ` Stefano Stabellini
  2023-01-27  8:01     ` Jan Beulich
  2023-02-14 23:38     ` Volodymyr Babchuk
  0 siblings, 2 replies; 43+ messages in thread
From: Stefano Stabellini @ 2023-01-26 23:18 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: xen-devel, Oleksandr Andrushchenko, Andrew Cooper, George Dunlap,
	Jan Beulich, Julien Grall, Stefano Stabellini, Wei Liu,
	Paul Durrant, Roger Pau Monné,
	Kevin Tian, Stewart.Hildebrand

On Wed, 31 Aug 2022, Volodymyr Babchuk wrote:
> domain->pdevs_lock protects access to domain->pdev_list.
> As this, it should be used when we are adding, removing on enumerating
> PCI devices assigned to a domain.
> 
> This enables more granular locking instead of one huge pcidevs_lock that
> locks entire PCI subsystem. Please note that pcidevs_lock() is still
> used, we are going to remove it in subsequent patches.
> 
> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>

I reviewed the patch, and made sure to pay extra attention to:
- error paths
- missing locks
- lock ordering
- interruptions

Here is what I found:


1) iommu.c:reassign_device_ownership and pci_amd_iommu.c:reassign_device
Both functions without any pdevs_lock locking do:
list_move(&pdev->domain_list, &target->pdev_list);

It seems to be it would need pdevs_lock. Maybe we need to change
list_move into list_del (protected by the pdevs_lock of the old domain)
and list_add (protected by the pdev_lock of the new domain).


2) has_arch_pdevs
has_arch_pdevs is implemented as list_empty and needs locking as well,
however no domain->pdevs_lock are added to protect has_arch_pdevs in
this patch. I think we need pdevs_lock around has_arch_pdevs.


Two more comments below about lock inversion and taking the same lock
twice



> ---
>  xen/common/domain.c                         |  1 +
>  xen/drivers/passthrough/amd/iommu_cmd.c     |  4 ++-
>  xen/drivers/passthrough/amd/pci_amd_iommu.c |  7 ++++-
>  xen/drivers/passthrough/pci.c               | 29 ++++++++++++++++++++-
>  xen/drivers/passthrough/vtd/iommu.c         |  9 +++++--
>  xen/drivers/vpci/header.c                   |  3 +++
>  xen/drivers/vpci/msi.c                      |  7 ++++-
>  xen/drivers/vpci/vpci.c                     |  4 +--
>  xen/include/xen/pci.h                       |  2 +-
>  xen/include/xen/sched.h                     |  1 +
>  10 files changed, 58 insertions(+), 9 deletions(-)
> 
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index 7062393e37..4611141b87 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -618,6 +618,7 @@ struct domain *domain_create(domid_t domid,
>  
>  #ifdef CONFIG_HAS_PCI
>      INIT_LIST_HEAD(&d->pdev_list);
> +    spin_lock_init(&d->pdevs_lock);
>  #endif
>  
>      /* All error paths can depend on the above setup. */
> diff --git a/xen/drivers/passthrough/amd/iommu_cmd.c b/xen/drivers/passthrough/amd/iommu_cmd.c
> index 40ddf366bb..47c45398d4 100644
> --- a/xen/drivers/passthrough/amd/iommu_cmd.c
> +++ b/xen/drivers/passthrough/amd/iommu_cmd.c
> @@ -308,11 +308,12 @@ void amd_iommu_flush_iotlb(u8 devfn, const struct pci_dev *pdev,
>      flush_command_buffer(iommu, iommu_dev_iotlb_timeout);
>  }
>  
> -static void amd_iommu_flush_all_iotlbs(const struct domain *d, daddr_t daddr,
> +static void amd_iommu_flush_all_iotlbs(struct domain *d, daddr_t daddr,
>                                         unsigned int order)
>  {
>      struct pci_dev *pdev;
>  
> +    spin_lock(&d->pdevs_lock);
>      for_each_pdev( d, pdev )
>      {
>          u8 devfn = pdev->devfn;
> @@ -323,6 +324,7 @@ static void amd_iommu_flush_all_iotlbs(const struct domain *d, daddr_t daddr,
>          } while ( devfn != pdev->devfn &&
>                    PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
>      }
> +    spin_unlock(&d->pdevs_lock);
>  }
>  
>  /* Flush iommu cache after p2m changes. */
> diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> index 4ba8e764b2..64c016491d 100644
> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> @@ -96,20 +96,25 @@ static int __must_check allocate_domain_resources(struct domain *d)
>      return rc;
>  }
>  
> -static bool any_pdev_behind_iommu(const struct domain *d,
> +static bool any_pdev_behind_iommu(struct domain *d,
>                                    const struct pci_dev *exclude,
>                                    const struct amd_iommu *iommu)
>  {
>      const struct pci_dev *pdev;
>  
> +    spin_lock(&d->pdevs_lock);
>      for_each_pdev ( d, pdev )
>      {
>          if ( pdev == exclude )
>              continue;
>  
>          if ( find_iommu_for_device(pdev->seg, pdev->sbdf.bdf) == iommu )
> +	{
> +	    spin_unlock(&d->pdevs_lock);
>              return true;
> +	}

code style: tabs instead of spaces


>      }
> +    spin_unlock(&d->pdevs_lock);
>  
>      return false;
>  }
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index cdaf5c247f..4366f8f965 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -523,7 +523,9 @@ static void __init _pci_hide_device(struct pci_dev *pdev)
>      if ( pdev->domain )
>          return;
>      pdev->domain = dom_xen;
> +    spin_lock(&dom_xen->pdevs_lock);
>      list_add(&pdev->domain_list, &dom_xen->pdev_list);
> +    spin_unlock(&dom_xen->pdevs_lock);
>  }
>  
>  int __init pci_hide_device(unsigned int seg, unsigned int bus,
> @@ -595,7 +597,7 @@ struct pci_dev *pci_get_real_pdev(pci_sbdf_t sbdf)
>      return pdev;
>  }
>  
> -struct pci_dev *pci_get_pdev(const struct domain *d, pci_sbdf_t sbdf)
> +struct pci_dev *pci_get_pdev(struct domain *d, pci_sbdf_t sbdf)
>  {
>      struct pci_dev *pdev;
>  
> @@ -620,9 +622,16 @@ struct pci_dev *pci_get_pdev(const struct domain *d, pci_sbdf_t sbdf)
>                  return pdev;
>      }
>      else
> +    {
> +        spin_lock(&d->pdevs_lock);
>          list_for_each_entry ( pdev, &d->pdev_list, domain_list )
>              if ( pdev->sbdf.bdf == sbdf.bdf )
> +            {
> +                spin_unlock(&d->pdevs_lock);
>                  return pdev;
> +            }
> +        spin_unlock(&d->pdevs_lock);
> +    }
>  
>      return NULL;
>  }
> @@ -817,7 +826,9 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>      if ( !pdev->domain )
>      {
>          pdev->domain = hardware_domain;
> +        spin_lock(&hardware_domain->pdevs_lock);
>          list_add(&pdev->domain_list, &hardware_domain->pdev_list);
> +        spin_unlock(&hardware_domain->pdevs_lock);
>  
>          /*
>           * For devices not discovered by Xen during boot, add vPCI handlers
> @@ -827,7 +838,9 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>          if ( ret )
>          {
>              printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
> +            spin_lock(&pdev->domain->pdevs_lock);
>              list_del(&pdev->domain_list);
> +            spin_unlock(&pdev->domain->pdevs_lock);
>              pdev->domain = NULL;
>              goto out;
>          }
> @@ -835,7 +848,9 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>          if ( ret )
>          {
>              vpci_remove_device(pdev);
> +            spin_lock(&pdev->domain->pdevs_lock);
>              list_del(&pdev->domain_list);
> +            spin_unlock(&pdev->domain->pdevs_lock);
>              pdev->domain = NULL;
>              goto out;
>          }
> @@ -885,7 +900,11 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
>              pci_cleanup_msi(pdev);
>              ret = iommu_remove_device(pdev);
>              if ( pdev->domain )
> +            {
> +                spin_lock(&pdev->domain->pdevs_lock);
>                  list_del(&pdev->domain_list);
> +                spin_unlock(&pdev->domain->pdevs_lock);
> +            }
>              printk(XENLOG_DEBUG "PCI remove device %pp\n", &pdev->sbdf);
>              free_pdev(pseg, pdev);
>              break;
> @@ -967,12 +986,14 @@ int pci_release_devices(struct domain *d)
>          pcidevs_unlock();
>          return ret;
>      }
> +    spin_lock(&d->pdevs_lock);
>      list_for_each_entry_safe ( pdev, tmp, &d->pdev_list, domain_list )
>      {
>          bus = pdev->bus;
>          devfn = pdev->devfn;
>          ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;

This causes pdevs_lock to be taken twice. deassign_device also takes a
pdevs_lock.  Probably we need to change all the
spin_lock(&d->pdevs_lock) into spin_lock_recursive.



>      }
> +    spin_unlock(&d->pdevs_lock);
>      pcidevs_unlock();
>  
>      return ret;
> @@ -1194,7 +1215,9 @@ static int __hwdom_init cf_check _setup_hwdom_pci_devices(
>              if ( !pdev->domain )
>              {
>                  pdev->domain = ctxt->d;
> +                spin_lock(&pdev->domain->pdevs_lock);
>                  list_add(&pdev->domain_list, &ctxt->d->pdev_list);
> +                spin_unlock(&pdev->domain->pdevs_lock);
>                  setup_one_hwdom_device(ctxt, pdev);
>              }
>              else if ( pdev->domain == dom_xen )
> @@ -1556,6 +1579,7 @@ static int iommu_get_device_group(
>          return group_id;
>  
>      pcidevs_lock();
> +    spin_lock(&d->pdevs_lock);
>      for_each_pdev( d, pdev )
>      {
>          unsigned int b = pdev->bus;
> @@ -1571,6 +1595,7 @@ static int iommu_get_device_group(
>          if ( sdev_id < 0 )
>          {
>              pcidevs_unlock();
> +            spin_unlock(&d->pdevs_lock);

lock inversion


>              return sdev_id;
>          }
>  
> @@ -1581,6 +1606,7 @@ static int iommu_get_device_group(
>              if ( unlikely(copy_to_guest_offset(buf, i, &bdf, 1)) )
>              {
>                  pcidevs_unlock();
> +                spin_unlock(&d->pdevs_lock);

lock inversion


>                  return -EFAULT;
>              }
>              i++;
> @@ -1588,6 +1614,7 @@ static int iommu_get_device_group(
>      }
>  
>      pcidevs_unlock();
> +    spin_unlock(&d->pdevs_lock);

lock inversion


>      return i;
>  }
> diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
> index 62e143125d..fff1442265 100644
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -183,12 +183,13 @@ static void cleanup_domid_map(domid_t domid, struct vtd_iommu *iommu)
>      }
>  }
>  
> -static bool any_pdev_behind_iommu(const struct domain *d,
> +static bool any_pdev_behind_iommu(struct domain *d,
>                                    const struct pci_dev *exclude,
>                                    const struct vtd_iommu *iommu)
>  {
>      const struct pci_dev *pdev;
>  
> +    spin_lock(&d->pdevs_lock);
>      for_each_pdev ( d, pdev )
>      {
>          const struct acpi_drhd_unit *drhd;
> @@ -198,8 +199,12 @@ static bool any_pdev_behind_iommu(const struct domain *d,
>  
>          drhd = acpi_find_matched_drhd_unit(pdev);
>          if ( drhd && drhd->iommu == iommu )
> +        {
> +            spin_unlock(&d->pdevs_lock);
>              return true;
> +        }
>      }
> +    spin_unlock(&d->pdevs_lock);
>  
>      return false;
>  }
> @@ -208,7 +213,7 @@ static bool any_pdev_behind_iommu(const struct domain *d,
>   * If no other devices under the same iommu owned by this domain,
>   * clear iommu in iommu_bitmap and clear domain_id in domid_bitmap.
>   */
> -static void check_cleanup_domid_map(const struct domain *d,
> +static void check_cleanup_domid_map(struct domain *d,
>                                      const struct pci_dev *exclude,
>                                      struct vtd_iommu *iommu)
>  {
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index a1c928a0d2..a59aa7ad0b 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -267,6 +267,7 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>       * Check for overlaps with other BARs. Note that only BARs that are
>       * currently mapped (enabled) are checked for overlaps.
>       */
> +    spin_lock(&pdev->domain->pdevs_lock);
>      for_each_pdev ( pdev->domain, tmp )
>      {
>          if ( tmp == pdev )
> @@ -306,11 +307,13 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>                  printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
>                         start, end, rc);
>                  rangeset_destroy(mem);
> +                spin_unlock( &pdev->domain->pdevs_lock);
>                  return rc;
>              }
>          }
>      }
>  
> +    spin_unlock( &pdev->domain->pdevs_lock);
>      ASSERT(dev);
>  
>      if ( system_state < SYS_STATE_active )
> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
> index 8f2b59e61a..8969c335b0 100644
> --- a/xen/drivers/vpci/msi.c
> +++ b/xen/drivers/vpci/msi.c
> @@ -265,7 +265,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW);
>  
>  void vpci_dump_msi(void)
>  {
> -    const struct domain *d;
> +    struct domain *d;
>  
>      rcu_read_lock(&domlist_read_lock);
>      for_each_domain ( d )
> @@ -277,6 +277,9 @@ void vpci_dump_msi(void)
>  
>          printk("vPCI MSI/MSI-X d%d\n", d->domain_id);
>  
> +        if ( !spin_trylock(&d->pdevs_lock) )
> +            continue;
> +
>          for_each_pdev ( d, pdev )
>          {
>              const struct vpci_msi *msi;
> @@ -326,6 +329,8 @@ void vpci_dump_msi(void)
>              spin_unlock(&pdev->vpci->lock);
>              process_pending_softirqs();
>          }
> +        spin_unlock(&d->pdevs_lock);
> +
>      }
>      rcu_read_unlock(&domlist_read_lock);
>  }
> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
> index 3467c0de86..7d1f9fd438 100644
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -312,7 +312,7 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
>  
>  uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>  {
> -    const struct domain *d = current->domain;
> +    struct domain *d = current->domain;
>      const struct pci_dev *pdev;
>      const struct vpci_register *r;
>      unsigned int data_offset = 0;
> @@ -415,7 +415,7 @@ static void vpci_write_helper(const struct pci_dev *pdev,
>  void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>                  uint32_t data)
>  {
> -    const struct domain *d = current->domain;
> +    struct domain *d = current->domain;
>      const struct pci_dev *pdev;
>      const struct vpci_register *r;
>      unsigned int data_offset = 0;
> diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
> index 5975ca2f30..19047b4b20 100644
> --- a/xen/include/xen/pci.h
> +++ b/xen/include/xen/pci.h
> @@ -177,7 +177,7 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>  int pci_remove_device(u16 seg, u8 bus, u8 devfn);
>  int pci_ro_device(int seg, int bus, int devfn);
>  int pci_hide_device(unsigned int seg, unsigned int bus, unsigned int devfn);
> -struct pci_dev *pci_get_pdev(const struct domain *d, pci_sbdf_t sbdf);
> +struct pci_dev *pci_get_pdev(struct domain *d, pci_sbdf_t sbdf);
>  struct pci_dev *pci_get_real_pdev(pci_sbdf_t sbdf);
>  void pci_check_disable_device(u16 seg, u8 bus, u8 devfn);
>  
> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
> index 1cf629e7ec..0775228ba9 100644
> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -457,6 +457,7 @@ struct domain
>  
>  #ifdef CONFIG_HAS_PCI
>      struct list_head pdev_list;
> +    spinlock_t pdevs_lock;

I think it would be better called "pdev_lock" but OK either way


>  #endif
>  
>  #ifdef CONFIG_HAS_PASSTHROUGH
> -- 
> 2.36.1
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 02/10] xen: pci: add pci_seg->alldevs_lock
  2022-08-31 14:10 ` [RFC PATCH 02/10] xen: pci: add pci_seg->alldevs_lock Volodymyr Babchuk
@ 2023-01-26 23:40   ` Stefano Stabellini
  2023-02-28 16:32   ` Jan Beulich
  1 sibling, 0 replies; 43+ messages in thread
From: Stefano Stabellini @ 2023-01-26 23:40 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: xen-devel, Oleksandr Andrushchenko, Jan Beulich, Paul Durrant,
	Roger Pau Monné

On Wed, 31 Aug 2022, Volodymyr Babchuk wrote:
> This lock protects alldevs_list of struct pci_seg. As this, it should
> be used when we are adding, removing on enumerating PCI devices
> assigned to a PCI segment.
> 
> Radix tree that stores PCI segment has own locking mechanism, also
> pci_seg structures are only allocated and newer freed, so we need no
> additional locking to access pci_seg structures. But we need a lock
> that protects alldevs_list field.
> 
> This enables more granular locking instead of one huge pcidevs_lock
> that locks entire PCI subsystem.  Please note that pcidevs_lock() is
> still used, we are going to remove it in subsequent patches.
> 
> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
> ---
>  xen/drivers/passthrough/pci.c | 20 +++++++++++++++++++-
>  1 file changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index 4366f8f965..2dfa1c2875 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -38,6 +38,7 @@
>  
>  struct pci_seg {
>      struct list_head alldevs_list;
> +    spinlock_t alldevs_lock;
>      u16 nr;
>      unsigned long *ro_map;
>      /* bus2bridge_lock protects bus2bridge array */
> @@ -93,6 +94,7 @@ static struct pci_seg *alloc_pseg(u16 seg)
>      pseg->nr = seg;
>      INIT_LIST_HEAD(&pseg->alldevs_list);
>      spin_lock_init(&pseg->bus2bridge_lock);
> +    spin_lock_init(&pseg->alldevs_lock);
>  
>      if ( radix_tree_insert(&pci_segments, seg, pseg) )
>      {
> @@ -385,9 +387,13 @@ static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn)
>      unsigned int pos;
>      int rc;
>  
> +    spin_lock(&pseg->alldevs_lock);
>      list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
>          if ( pdev->bus == bus && pdev->devfn == devfn )
> +        {
> +            spin_unlock(&pseg->alldevs_lock);
>              return pdev;
> +        }
>  
>      pdev = xzalloc(struct pci_dev);
>      if ( !pdev )

Here there is a missing spin_unlock on the error path


> @@ -404,10 +410,12 @@ static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn)
>      if ( rc )
>      {
>          xfree(pdev);
> +        spin_unlock(&pseg->alldevs_lock);
>          return NULL;
>      }
>  
>      list_add(&pdev->alldevs_list, &pseg->alldevs_list);
> +    spin_unlock(&pseg->alldevs_lock);
>  
>      /* update bus2bridge */
>      switch ( pdev->type = pdev_type(pseg->nr, bus, devfn) )
> @@ -611,15 +619,20 @@ struct pci_dev *pci_get_pdev(struct domain *d, pci_sbdf_t sbdf)
>       */
>      if ( !d || is_hardware_domain(d) )
>      {
> -        const struct pci_seg *pseg = get_pseg(sbdf.seg);
> +        struct pci_seg *pseg = get_pseg(sbdf.seg);
>  
>          if ( !pseg )
>              return NULL;
>  
> +        spin_lock(&pseg->alldevs_lock);
>          list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
>              if ( pdev->sbdf.bdf == sbdf.bdf &&
>                   (!d || pdev->domain == d) )
> +            {
> +                spin_unlock(&pseg->alldevs_lock);
>                  return pdev;
> +            }
> +        spin_unlock(&pseg->alldevs_lock);
>      }
>      else
>      {
> @@ -893,6 +906,7 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
>          return -ENODEV;
>  
>      pcidevs_lock();
> +    spin_lock(&pseg->alldevs_lock);
>      list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
>          if ( pdev->bus == bus && pdev->devfn == devfn )
>          {
> @@ -907,10 +921,12 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
>              }
>              printk(XENLOG_DEBUG "PCI remove device %pp\n", &pdev->sbdf);
>              free_pdev(pseg, pdev);
> +            list_del(&pdev->alldevs_list);

use after free: free_pdev is freeing pdef

>              break;
>          }
>  
>      pcidevs_unlock();
> +    spin_unlock(&pseg->alldevs_lock);

lock inversion


>      return ret;
>  }
>  
> @@ -1363,6 +1379,7 @@ static int cf_check _dump_pci_devices(struct pci_seg *pseg, void *arg)
>  
>      printk("==== segment %04x ====\n", pseg->nr);
>  
> +    spin_lock(&pseg->alldevs_lock);
>      list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
>      {
>          printk("%pp - ", &pdev->sbdf);
> @@ -1376,6 +1393,7 @@ static int cf_check _dump_pci_devices(struct pci_seg *pseg, void *arg)
>          pdev_dump_msi(pdev);
>          printk("\n");
>      }
> +    spin_unlock(&pseg->alldevs_lock);
>  
>      return 0;
>  }
> -- 
> 2.36.1
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 03/10] xen: pci: introduce ats_list_lock
  2022-08-31 14:10 ` [RFC PATCH 03/10] xen: pci: introduce ats_list_lock Volodymyr Babchuk
@ 2023-01-26 23:56   ` Stefano Stabellini
  2023-01-27  8:13     ` Jan Beulich
  0 siblings, 1 reply; 43+ messages in thread
From: Stefano Stabellini @ 2023-01-26 23:56 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: xen-devel, Oleksandr Andrushchenko, Jan Beulich, Andrew Cooper,
	Paul Durrant, Roger Pau Monné,
	Kevin Tian

On Wed, 31 Aug 2022, Volodymyr Babchuk wrote:
> ATS subsystem has own list of PCI devices. As we are going to remove
> global pcidevs_lock() in favor to more granular locking, we need to
> ensure that this list is protected somehow. To do this, we need to add
> additional lock for each IOMMU, as list to be protected is also part
> of IOMMU.
> 
> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
> ---
>  xen/drivers/passthrough/amd/iommu.h         |  1 +
>  xen/drivers/passthrough/amd/iommu_detect.c  |  1 +
>  xen/drivers/passthrough/amd/pci_amd_iommu.c |  8 ++++++++
>  xen/drivers/passthrough/pci.c               |  1 +
>  xen/drivers/passthrough/vtd/iommu.c         | 11 +++++++++++
>  xen/drivers/passthrough/vtd/iommu.h         |  1 +
>  xen/drivers/passthrough/vtd/qinval.c        |  3 +++
>  xen/drivers/passthrough/vtd/x86/ats.c       |  3 +++
>  8 files changed, 29 insertions(+)
> 
> diff --git a/xen/drivers/passthrough/amd/iommu.h b/xen/drivers/passthrough/amd/iommu.h
> index 8bc3c35b1b..edd6eb52b3 100644
> --- a/xen/drivers/passthrough/amd/iommu.h
> +++ b/xen/drivers/passthrough/amd/iommu.h
> @@ -106,6 +106,7 @@ struct amd_iommu {
>      int enabled;
>  
>      struct list_head ats_devices;
> +    spinlock_t ats_list_lock;
>  };
>  
>  struct ivrs_unity_map {
> diff --git a/xen/drivers/passthrough/amd/iommu_detect.c b/xen/drivers/passthrough/amd/iommu_detect.c
> index 2317fa6a7d..1d6f4f2168 100644
> --- a/xen/drivers/passthrough/amd/iommu_detect.c
> +++ b/xen/drivers/passthrough/amd/iommu_detect.c
> @@ -160,6 +160,7 @@ int __init amd_iommu_detect_one_acpi(
>      }
>  
>      spin_lock_init(&iommu->lock);
> +    spin_lock_init(&iommu->ats_list_lock);
>      INIT_LIST_HEAD(&iommu->ats_devices);
>  
>      iommu->seg = ivhd_block->pci_segment_group;
> diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> index 64c016491d..955f3af57a 100644
> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> @@ -276,7 +276,11 @@ static int __must_check amd_iommu_setup_domain_device(
>           !pci_ats_enabled(iommu->seg, bus, pdev->devfn) )
>      {
>          if ( devfn == pdev->devfn )
> +	{
> +	    spin_lock(&iommu->ats_list_lock);
>              enable_ats_device(pdev, &iommu->ats_devices);
> +	    spin_unlock(&iommu->ats_list_lock);

code style


> +	}
>  
>          amd_iommu_flush_iotlb(devfn, pdev, INV_IOMMU_ALL_PAGES_ADDRESS, 0);
>      }
> @@ -416,7 +420,11 @@ static void amd_iommu_disable_domain_device(const struct domain *domain,
>  
>      if ( pci_ats_device(iommu->seg, bus, pdev->devfn) &&
>           pci_ats_enabled(iommu->seg, bus, pdev->devfn) )
> +    {
> +	spin_lock(&iommu->ats_list_lock);
>          disable_ats_device(pdev);
> +	spin_unlock(&iommu->ats_list_lock);

code style


> +    }
>  
>      BUG_ON ( iommu->dev_table.buffer == NULL );
>      req_id = get_dma_requestor_id(iommu->seg, PCI_BDF(bus, devfn));
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index 2dfa1c2875..b5db5498a1 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -1641,6 +1641,7 @@ void iommu_dev_iotlb_flush_timeout(struct domain *d, struct pci_dev *pdev)
>  {
>      pcidevs_lock();
>  
> +    /* iommu->ats_list_lock is taken by the caller of this function */

This is a locking inversion. In all other places we take pcidevs_lock
first, then ats_list_lock lock. For instance look at
xen/drivers/passthrough/pci.c:deassign_device that is called with
pcidevs_locked and then calls iommu_call(... reassign_device ...) which
ends up taking ats_list_lock.

This is the only exception. I think we need to move the
spin_lock(ats_list_lock) from qinval.c to here.



>      disable_ats_device(pdev);
>  
>      ASSERT(pdev->domain);
> diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
> index fff1442265..42661f22f4 100644
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -1281,6 +1281,7 @@ int __init iommu_alloc(struct acpi_drhd_unit *drhd)
>      spin_lock_init(&iommu->lock);
>      spin_lock_init(&iommu->register_lock);
>      spin_lock_init(&iommu->intremap.lock);
> +    spin_lock_init(&iommu->ats_list_lock);
>  
>      iommu->drhd = drhd;
>      drhd->iommu = iommu;
> @@ -1769,7 +1770,11 @@ static int domain_context_mapping(struct domain *domain, u8 devfn,
>          if ( ret > 0 )
>              ret = 0;
>          if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
> +        {
> +            spin_lock(&drhd->iommu->ats_list_lock);
>              enable_ats_device(pdev, &drhd->iommu->ats_devices);
> +            spin_unlock(&drhd->iommu->ats_list_lock);
> +        }
>  
>          break;
>  
> @@ -1977,7 +1982,11 @@ static const struct acpi_drhd_unit *domain_context_unmap(
>                     domain, &PCI_SBDF(seg, bus, devfn));
>          ret = domain_context_unmap_one(domain, iommu, bus, devfn);
>          if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
> +        {
> +            spin_lock(&iommu->ats_list_lock);
>              disable_ats_device(pdev);
> +            spin_unlock(&iommu->ats_list_lock);
> +        }
>  
>          break;
>  
> @@ -2374,7 +2383,9 @@ static int cf_check intel_iommu_enable_device(struct pci_dev *pdev)
>      if ( ret <= 0 )
>          return ret;
>  
> +    spin_lock(&drhd->iommu->ats_list_lock);
>      ret = enable_ats_device(pdev, &drhd->iommu->ats_devices);
> +    spin_unlock(&drhd->iommu->ats_list_lock);
>  
>      return ret >= 0 ? 0 : ret;
>  }
> diff --git a/xen/drivers/passthrough/vtd/iommu.h b/xen/drivers/passthrough/vtd/iommu.h
> index 78aa8a96f5..2a7a4c1b58 100644
> --- a/xen/drivers/passthrough/vtd/iommu.h
> +++ b/xen/drivers/passthrough/vtd/iommu.h
> @@ -506,6 +506,7 @@ struct vtd_iommu {
>      } flush;
>  
>      struct list_head ats_devices;
> +    spinlock_t ats_list_lock;
>      unsigned long *pseudo_domid_map; /* "pseudo" domain id bitmap */
>      unsigned long *domid_bitmap;  /* domain id bitmap */
>      domid_t *domid_map;           /* domain id mapping array */
> diff --git a/xen/drivers/passthrough/vtd/qinval.c b/xen/drivers/passthrough/vtd/qinval.c
> index 4f9ad136b9..6e876348db 100644
> --- a/xen/drivers/passthrough/vtd/qinval.c
> +++ b/xen/drivers/passthrough/vtd/qinval.c
> @@ -238,7 +238,10 @@ static int __must_check dev_invalidate_sync(struct vtd_iommu *iommu,
>          if ( d == NULL )
>              return rc;
>  
> +	spin_lock(&iommu->ats_list_lock);
>          iommu_dev_iotlb_flush_timeout(d, pdev);
> +	spin_unlock(&iommu->ats_list_lock);

code style


>          rcu_unlock_domain(d);
>      }
>      else if ( rc == -ETIMEDOUT )
> diff --git a/xen/drivers/passthrough/vtd/x86/ats.c b/xen/drivers/passthrough/vtd/x86/ats.c
> index 04d702b1d6..55e991183b 100644
> --- a/xen/drivers/passthrough/vtd/x86/ats.c
> +++ b/xen/drivers/passthrough/vtd/x86/ats.c
> @@ -117,6 +117,7 @@ int dev_invalidate_iotlb(struct vtd_iommu *iommu, u16 did,
>      if ( !ecap_dev_iotlb(iommu->ecap) )
>          return ret;
>  
> +    spin_lock(&iommu->ats_list_lock);
>      list_for_each_entry_safe( pdev, temp, &iommu->ats_devices, ats.list )
>      {
>          bool_t sbit;
> @@ -155,12 +156,14 @@ int dev_invalidate_iotlb(struct vtd_iommu *iommu, u16 did,
>              break;
>          default:
>              dprintk(XENLOG_WARNING VTDPREFIX, "invalid vt-d flush type\n");
> +	    spin_unlock(&iommu->ats_list_lock);

code style


>              return -EOPNOTSUPP;
>          }
>  
>          if ( !ret )
>              ret = rc;
>      }
> +    spin_unlock(&iommu->ats_list_lock);
>  
>      return ret;
>  }
> -- 
> 2.36.1
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 05/10] xen: pci: introduce reference counting for pdev
  2022-08-31 14:11 ` [RFC PATCH 05/10] xen: pci: introduce reference counting for pdev Volodymyr Babchuk
@ 2023-01-27  0:43   ` Stefano Stabellini
  2023-02-20 22:00     ` Volodymyr Babchuk
  2023-02-28 17:06   ` Jan Beulich
  1 sibling, 1 reply; 43+ messages in thread
From: Stefano Stabellini @ 2023-01-27  0:43 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: xen-devel, Oleksandr Andrushchenko, Jan Beulich, Andrew Cooper,
	Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Paul Durrant, Kevin Tian

On Wed, 31 Aug 2022, Volodymyr Babchuk wrote:
> Prior to this change, lifetime of pci_dev objects was protected by global
> pcidevs_lock(). We are going to get if of this lock, so we need some
> other mechanism to ensure that those objects will not disappear under
> feet of code that access them. Reference counting is a good choice as
> it provides easy to comprehend way to control object lifetime with
> better granularity than global super lock.
> 
> This patch adds two new helper functions: pcidev_get() and
> pcidev_put(). pcidev_get() will increase reference counter, while
> pcidev_put() will decrease it, destroying object when counter reaches
> zero.
> 
> pcidev_get() should be used only when you already have a valid pointer
> to the object or you are holding lock that protects one of the
> lists (domain, pseg or ats) that store pci_dev structs.
> 
> pcidev_get() is rarely used directly, because there already are
> functions that will provide valid pointer to pci_dev struct:
> pci_get_pdev() and pci_get_real_pdev(). They will lock appropriate
> list, find needed object and increase its reference counter before
> returning to the caller.
> 
> Naturally, pci_put() should be called after finishing working with a
> received object. This is the reason why this patch have so many
> pcidev_put()s and so little pcidev_get()s: existing calls to
> pci_get_*() functions now will increase reference counter
> automatically, we just need to decrease it back when we finished.
> 
> This patch removes "const" qualifier from some pdev pointers because
> pcidev_put() technically alters the contents of pci_dev structure.
> 
> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>

tabs everywhere in this patch


> ---
> 
> - Jan, can I add your Suggested-by tag?
> ---
>  xen/arch/x86/hvm/vmsi.c                  |   2 +-
>  xen/arch/x86/irq.c                       |   4 +
>  xen/arch/x86/msi.c                       |  41 ++++++-
>  xen/arch/x86/pci.c                       |   4 +-
>  xen/arch/x86/physdev.c                   |  17 ++-
>  xen/common/sysctl.c                      |   5 +-
>  xen/drivers/passthrough/amd/iommu_init.c |  12 ++-
>  xen/drivers/passthrough/amd/iommu_map.c  |   6 +-
>  xen/drivers/passthrough/pci.c            | 131 +++++++++++++++--------
>  xen/drivers/passthrough/vtd/quirks.c     |   2 +
>  xen/drivers/video/vga.c                  |  10 +-
>  xen/drivers/vpci/vpci.c                  |   6 +-
>  xen/include/xen/pci.h                    |  18 ++++
>  13 files changed, 201 insertions(+), 57 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
> index 75f92885dc..7fb1075673 100644
> --- a/xen/arch/x86/hvm/vmsi.c
> +++ b/xen/arch/x86/hvm/vmsi.c
> @@ -912,7 +912,7 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>  
>              spin_unlock(&msix->pdev->vpci->lock);
>              process_pending_softirqs();
> -            /* NB: we assume that pdev cannot go away for an alive domain. */
> +
>              if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
>                  return -EBUSY;
>              if ( pdev->vpci->msix != msix )
> diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
> index cd0c8a30a8..d8672a03e1 100644
> --- a/xen/arch/x86/irq.c
> +++ b/xen/arch/x86/irq.c
> @@ -2174,6 +2174,7 @@ int map_domain_pirq(
>                  msi->entry_nr = ret;
>                  ret = -ENFILE;
>              }
> +	    pcidev_put(pdev);

I think it would be better to move pcidev_put just after done:


>              goto done;
>          }
>  
> @@ -2188,6 +2189,7 @@ int map_domain_pirq(
>              msi_desc->irq = -1;
>              msi_free_irq(msi_desc);
>              ret = -EBUSY;
> +	    pcidev_put(pdev);
>              goto done;
>          }
>  
> @@ -2272,10 +2274,12 @@ int map_domain_pirq(
>              }
>              msi_desc->irq = -1;
>              msi_free_irq(msi_desc);
> +	    pcidev_put(pdev);
>              goto done;
>          }
>  
>          set_domain_irq_pirq(d, irq, info);
> +	pcidev_put(pdev);
>          spin_unlock_irqrestore(&desc->lock, flags);
>      }
>      else
> diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
> index d0bf63df1d..bccaccb98b 100644
> --- a/xen/arch/x86/msi.c
> +++ b/xen/arch/x86/msi.c
> @@ -572,6 +572,10 @@ int msi_free_irq(struct msi_desc *entry)
>                          virt_to_fix((unsigned long)entry->mask_base));
>  
>      list_del(&entry->list);
> +
> +    /* Corresponds to pcidev_get() in msi[x]_capability_init()  */
> +    pcidev_put(entry->dev);
> +
>      xfree(entry);
>      return 0;
>  }
> @@ -644,6 +648,7 @@ static int msi_capability_init(struct pci_dev *dev,
>              entry[i].msi.mpos = mpos;
>          entry[i].msi.nvec = 0;
>          entry[i].dev = dev;
> +	pcidev_get(dev);
>      }
>      entry->msi.nvec = nvec;
>      entry->irq = irq;
> @@ -703,22 +708,36 @@ static u64 read_pci_mem_bar(u16 seg, u8 bus, u8 slot, u8 func, u8 bir, int vf)
>               !num_vf || !offset || (num_vf > 1 && !stride) ||
>               bir >= PCI_SRIOV_NUM_BARS ||
>               !pdev->vf_rlen[bir] )
> +        {
> +            if ( pdev )
> +                pcidev_put(pdev);
>              return 0;
> +        }
>          base = pos + PCI_SRIOV_BAR;
>          vf -= PCI_BDF(bus, slot, func) + offset;
>          if ( vf < 0 )
> +        {
> +            pcidev_put(pdev);
>              return 0;
> +        }
>          if ( stride )
>          {
>              if ( vf % stride )
> +            {
> +                pcidev_put(pdev);
>                  return 0;
> +            }
>              vf /= stride;
>          }
>          if ( vf >= num_vf )
> +        {
> +            pcidev_put(pdev);
>              return 0;
> +        }
>          BUILD_BUG_ON(ARRAY_SIZE(pdev->vf_rlen) != PCI_SRIOV_NUM_BARS);
>          disp = vf * pdev->vf_rlen[bir];
>          limit = PCI_SRIOV_NUM_BARS;
> +        pcidev_put(pdev);
>      }
>      else switch ( pci_conf_read8(PCI_SBDF(seg, bus, slot, func),
>                                   PCI_HEADER_TYPE) & 0x7f )
> @@ -925,6 +944,8 @@ static int msix_capability_init(struct pci_dev *dev,
>          entry->dev = dev;
>          entry->mask_base = base;
>  
> +	pcidev_get(dev);
> +
>          list_add_tail(&entry->list, &dev->msi_list);
>          *desc = entry;
>      }
> @@ -999,6 +1020,7 @@ static int __pci_enable_msi(struct msi_info *msi, struct msi_desc **desc)
>  {
>      struct pci_dev *pdev;
>      struct msi_desc *old_desc;
> +    int ret;
>  
>      ASSERT(pcidevs_locked());
>      pdev = pci_get_pdev(NULL, msi->sbdf);
> @@ -1010,6 +1032,7 @@ static int __pci_enable_msi(struct msi_info *msi, struct msi_desc **desc)
>      {
>          printk(XENLOG_ERR "irq %d already mapped to MSI on %pp\n",
>                 msi->irq, &pdev->sbdf);
> +	pcidev_put(pdev);
>          return -EEXIST;
>      }
>  
> @@ -1020,7 +1043,10 @@ static int __pci_enable_msi(struct msi_info *msi, struct msi_desc **desc)
>          __pci_disable_msix(old_desc);
>      }
>  
> -    return msi_capability_init(pdev, msi->irq, desc, msi->entry_nr);
> +    ret = msi_capability_init(pdev, msi->irq, desc, msi->entry_nr);
> +    pcidev_put(pdev);
> +
> +    return ret;
>  }
>  
>  static void __pci_disable_msi(struct msi_desc *entry)
> @@ -1054,6 +1080,7 @@ static int __pci_enable_msix(struct msi_info *msi, struct msi_desc **desc)
>  {
>      struct pci_dev *pdev;
>      struct msi_desc *old_desc;
> +    int ret;
>  
>      ASSERT(pcidevs_locked());
>      pdev = pci_get_pdev(NULL, msi->sbdf);
> @@ -1061,13 +1088,17 @@ static int __pci_enable_msix(struct msi_info *msi, struct msi_desc **desc)
>          return -ENODEV;

maybe missed pcidev_put above if pdev != 0 && pdev->msix == 0


>      if ( msi->entry_nr >= pdev->msix->nr_entries )
> +    {
> +	pcidev_put(pdev);
>          return -EINVAL;
> +    }
>  
>      old_desc = find_msi_entry(pdev, msi->irq, PCI_CAP_ID_MSIX);
>      if ( old_desc )
>      {
>          printk(XENLOG_ERR "irq %d already mapped to MSI-X on %pp\n",
>                 msi->irq, &pdev->sbdf);
> +	pcidev_put(pdev);
>          return -EEXIST;
>      }
>  
> @@ -1078,7 +1109,11 @@ static int __pci_enable_msix(struct msi_info *msi, struct msi_desc **desc)
>          __pci_disable_msi(old_desc);
>      }
>  
> -    return msix_capability_init(pdev, msi, desc);
> +    ret = msix_capability_init(pdev, msi, desc);
> +
> +    pcidev_put(pdev);
> +
> +    return ret;
>  }
>  
>  static void _pci_cleanup_msix(struct arch_msix *msix)
> @@ -1161,6 +1196,8 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off)
>          rc = msix_capability_init(pdev, NULL, NULL);
>      pcidevs_unlock();
>  
> +    pcidev_put(pdev);
> +
>      return rc;
>  }
>  
> diff --git a/xen/arch/x86/pci.c b/xen/arch/x86/pci.c
> index 97b792e578..1d38f0df7c 100644
> --- a/xen/arch/x86/pci.c
> +++ b/xen/arch/x86/pci.c
> @@ -91,8 +91,10 @@ int pci_conf_write_intercept(unsigned int seg, unsigned int bdf,
>      pcidevs_lock();
>  
>      pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bdf));
> -    if ( pdev )
> +    if ( pdev ) {
>          rc = pci_msi_conf_write_intercept(pdev, reg, size, data);
> +	pcidev_put(pdev);
> +    }
>  
>      pcidevs_unlock();
>  
> diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c
> index 2f1d955a96..96214a3d40 100644
> --- a/xen/arch/x86/physdev.c
> +++ b/xen/arch/x86/physdev.c
> @@ -533,7 +533,14 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>          pcidevs_lock();
>          pdev = pci_get_pdev(NULL,
>                              PCI_SBDF(0, restore_msi.bus, restore_msi.devfn));
> -        ret = pdev ? pci_restore_msi_state(pdev) : -ENODEV;
> +        if ( pdev )
> +        {
> +            ret = pci_restore_msi_state(pdev);
> +            pcidev_put(pdev);
> +        }
> +        else
> +            ret = -ENODEV;
> +
>          pcidevs_unlock();
>          break;
>      }
> @@ -548,7 +555,13 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>  
>          pcidevs_lock();
>          pdev = pci_get_pdev(NULL, PCI_SBDF(dev.seg, dev.bus, dev.devfn));
> -        ret = pdev ? pci_restore_msi_state(pdev) : -ENODEV;
> +        if ( pdev )
> +        {
> +            ret =  pci_restore_msi_state(pdev);
> +            pcidev_put(pdev);
> +        }
> +        else
> +            ret = -ENODEV;
>          pcidevs_unlock();
>          break;
>      }
> diff --git a/xen/common/sysctl.c b/xen/common/sysctl.c
> index 02505ab044..0feef94cd2 100644
> --- a/xen/common/sysctl.c
> +++ b/xen/common/sysctl.c
> @@ -438,7 +438,7 @@ long do_sysctl(XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) u_sysctl)
>          {
>              physdev_pci_device_t dev;
>              uint32_t node;
> -            const struct pci_dev *pdev;
> +            struct pci_dev *pdev;
>  
>              if ( copy_from_guest_offset(&dev, ti->devs, i, 1) )
>              {
> @@ -456,6 +456,9 @@ long do_sysctl(XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) u_sysctl)
>                  node = pdev->node;
>              pcidevs_unlock();
>  
> +            if ( pdev )
> +                pcidev_put(pdev);
> +
>              if ( copy_to_guest_offset(ti->nodes, i, &node, 1) )
>              {
>                  ret = -EFAULT;
> diff --git a/xen/drivers/passthrough/amd/iommu_init.c b/xen/drivers/passthrough/amd/iommu_init.c
> index 1f14aaf49e..7c1713a602 100644
> --- a/xen/drivers/passthrough/amd/iommu_init.c
> +++ b/xen/drivers/passthrough/amd/iommu_init.c
> @@ -644,6 +644,7 @@ static void cf_check parse_ppr_log_entry(struct amd_iommu *iommu, u32 entry[])
>  
>      if ( pdev )
>          guest_iommu_add_ppr_log(pdev->domain, entry);
> +    pcidev_put(pdev);
>  }
>  
>  static void iommu_check_ppr_log(struct amd_iommu *iommu)
> @@ -747,6 +748,11 @@ static bool_t __init set_iommu_interrupt_handler(struct amd_iommu *iommu)
>      }
>  
>      pcidevs_lock();
> +    /*
> +     * XXX: it is unclear if this device can be removed. Right now
> +     * there is no code that clears msi.dev, so no one will decrease
> +     * refcount on it.
> +     */
>      iommu->msi.dev = pci_get_pdev(NULL, PCI_SBDF(iommu->seg, iommu->bdf));
>      pcidevs_unlock();
>      if ( !iommu->msi.dev )
> @@ -1272,7 +1278,7 @@ static int __init cf_check amd_iommu_setup_device_table(
>      {
>          if ( ivrs_mappings[bdf].valid )
>          {
> -            const struct pci_dev *pdev = NULL;
> +            struct pci_dev *pdev = NULL;
>  
>              /* add device table entry */
>              iommu_dte_add_device_entry(&dt[bdf], &ivrs_mappings[bdf]);
> @@ -1297,7 +1303,10 @@ static int __init cf_check amd_iommu_setup_device_table(
>                          pdev->msix ? pdev->msix->nr_entries
>                                     : pdev->msi_maxvec);
>                  if ( !ivrs_mappings[bdf].intremap_table )
> +		{
> +		    pcidev_put(pdev);
>                      return -ENOMEM;
> +		}
>  
>                  if ( pdev->phantom_stride )
>                  {
> @@ -1315,6 +1324,7 @@ static int __init cf_check amd_iommu_setup_device_table(
>                              ivrs_mappings[bdf].intremap_inuse;
>                      }
>                  }
> +		pcidev_put(pdev);
>              }
>  
>              amd_iommu_set_intremap_table(
> diff --git a/xen/drivers/passthrough/amd/iommu_map.c b/xen/drivers/passthrough/amd/iommu_map.c
> index 993bac6f88..9d621e3d36 100644
> --- a/xen/drivers/passthrough/amd/iommu_map.c
> +++ b/xen/drivers/passthrough/amd/iommu_map.c
> @@ -724,14 +724,18 @@ int cf_check amd_iommu_get_reserved_device_memory(
>          if ( !iommu )
>          {
>              /* May need to trigger the workaround in find_iommu_for_device(). */
> -            const struct pci_dev *pdev;
> +            struct pci_dev *pdev;
>  
>              pcidevs_lock();
>              pdev = pci_get_pdev(NULL, sbdf);
>              pcidevs_unlock();
>  
>              if ( pdev )
> +            {
>                  iommu = find_iommu_for_device(seg, bdf);
> +                /* XXX: Should we hold pdev reference till end of the loop? */
> +                pcidev_put(pdev);
> +            }
>              if ( !iommu )
>                  continue;
>          }
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index b5db5498a1..a6c6368769 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -403,6 +403,7 @@ static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn)
>      *((u8*) &pdev->bus) = bus;
>      *((u8*) &pdev->devfn) = devfn;
>      pdev->domain = NULL;
> +    refcnt_init(&pdev->refcnt);
>  
>      arch_pci_init_pdev(pdev);
>  
> @@ -499,33 +500,6 @@ static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn)
>      return pdev;
>  }
>  
> -static void free_pdev(struct pci_seg *pseg, struct pci_dev *pdev)
> -{
> -    /* update bus2bridge */
> -    switch ( pdev->type )
> -    {
> -        unsigned int sec_bus, sub_bus;
> -
> -        case DEV_TYPE_PCIe2PCI_BRIDGE:
> -        case DEV_TYPE_LEGACY_PCI_BRIDGE:
> -            sec_bus = pci_conf_read8(pdev->sbdf, PCI_SECONDARY_BUS);
> -            sub_bus = pci_conf_read8(pdev->sbdf, PCI_SUBORDINATE_BUS);
> -
> -            spin_lock(&pseg->bus2bridge_lock);
> -            for ( ; sec_bus <= sub_bus; sec_bus++ )
> -                pseg->bus2bridge[sec_bus] = pseg->bus2bridge[pdev->bus];
> -            spin_unlock(&pseg->bus2bridge_lock);
> -            break;
> -
> -        default:
> -            break;
> -    }
> -
> -    list_del(&pdev->alldevs_list);
> -    pdev_msi_deinit(pdev);
> -    xfree(pdev);
> -}
> -
>  static void __init _pci_hide_device(struct pci_dev *pdev)
>  {
>      if ( pdev->domain )
> @@ -596,10 +570,15 @@ struct pci_dev *pci_get_real_pdev(pci_sbdf_t sbdf)
>      {
>          if ( !(sbdf.devfn & stride) )
>              continue;

We also need a pcidev_put before continue


> +
>          sbdf.devfn &= ~stride;
> +        pcidev_put(pdev);
>          pdev = pci_get_pdev(NULL, sbdf);
>          if ( pdev && stride != pdev->phantom_stride )
> +        {
> +            pcidev_put(pdev);
>              pdev = NULL;
> +        }
>      }
>  
>      return pdev;
> @@ -629,6 +608,7 @@ struct pci_dev *pci_get_pdev(struct domain *d, pci_sbdf_t sbdf)
>              if ( pdev->sbdf.bdf == sbdf.bdf &&
>                   (!d || pdev->domain == d) )
>              {
> +                pcidev_get(pdev);
>                  spin_unlock(&pseg->alldevs_lock);
>                  return pdev;
>              }
> @@ -640,6 +620,7 @@ struct pci_dev *pci_get_pdev(struct domain *d, pci_sbdf_t sbdf)
>          list_for_each_entry ( pdev, &d->pdev_list, domain_list )
>              if ( pdev->sbdf.bdf == sbdf.bdf )
>              {
> +                pcidev_get(pdev);
>                  spin_unlock(&d->pdevs_lock);
>                  return pdev;
>              }
> @@ -754,7 +735,10 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>                              PCI_SBDF(seg, info->physfn.bus,
>                                       info->physfn.devfn));
>          if ( pdev )
> +        {
>              pf_is_extfn = pdev->info.is_extfn;
> +            pcidev_put(pdev);
> +        }
>          pcidevs_unlock();
>          if ( !pdev )
>              pci_add_device(seg, info->physfn.bus, info->physfn.devfn,
> @@ -920,8 +904,9 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
>                  spin_unlock(&pdev->domain->pdevs_lock);
>              }
>              printk(XENLOG_DEBUG "PCI remove device %pp\n", &pdev->sbdf);
> -            free_pdev(pseg, pdev);
>              list_del(&pdev->alldevs_list);
> +            pdev_msi_deinit(pdev);
> +            pcidev_put(pdev);
>              break;
>          }
>  
> @@ -952,7 +937,7 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
>      {
>          ret = iommu_quarantine_dev_init(pci_to_dev(pdev));
>          if ( ret )
> -           return ret;
> +            goto out;
>  
>          target = dom_io;
>      }
> @@ -982,6 +967,7 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
>      pdev->fault.count = 0;
>  
>   out:
> +    pcidev_put(pdev);
>      if ( ret )
>          printk(XENLOG_G_ERR "%pd: deassign (%pp) failed (%d)\n",
>                 d, &PCI_SBDF(seg, bus, devfn), ret);
> @@ -1117,7 +1103,10 @@ void pci_check_disable_device(u16 seg, u8 bus, u8 devfn)
>              pdev->fault.count >>= 1;
>          pdev->fault.time = now;
>          if ( ++pdev->fault.count < PT_FAULT_THRESHOLD )
> +        {
> +            pcidev_put(pdev);
>              pdev = NULL;
> +        }
>      }
>      pcidevs_unlock();
>  
> @@ -1128,6 +1117,8 @@ void pci_check_disable_device(u16 seg, u8 bus, u8 devfn)
>       * control it for us. */
>      cword = pci_conf_read16(pdev->sbdf, PCI_COMMAND);
>      pci_conf_write16(pdev->sbdf, PCI_COMMAND, cword & ~PCI_COMMAND_MASTER);
> +
> +    pcidev_put(pdev);
>  }
>  
>  /*
> @@ -1246,6 +1237,7 @@ static int __hwdom_init cf_check _setup_hwdom_pci_devices(
>                  printk(XENLOG_WARNING "Dom%d owning %pp?\n",
>                         pdev->domain->domain_id, &pdev->sbdf);
>  
> +            pcidev_put(pdev);
>              if ( iommu_verbose )
>              {
>                  pcidevs_unlock();
> @@ -1495,33 +1487,28 @@ static int iommu_remove_device(struct pci_dev *pdev)
>      return iommu_call(hd->platform_ops, remove_device, devfn, pci_to_dev(pdev));
>  }
>  
> -static int device_assigned(u16 seg, u8 bus, u8 devfn)
> +static int device_assigned(struct pci_dev *pdev)
>  {
> -    struct pci_dev *pdev;
>      int rc = 0;
>  
>      ASSERT(pcidevs_locked());
> -    pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn));
> -
> -    if ( !pdev )
> -        rc = -ENODEV;
>      /*
>       * If the device exists and it is not owned by either the hardware
>       * domain or dom_io then it must be assigned to a guest, or be
>       * hidden (owned by dom_xen).
>       */
> -    else if ( pdev->domain != hardware_domain &&
> -              pdev->domain != dom_io )
> +    if ( pdev->domain != hardware_domain &&
> +         pdev->domain != dom_io )
>          rc = -EBUSY;
>  
>      return rc;
>  }
>  
>  /* Caller should hold the pcidevs_lock */
> -static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
> +static int assign_device(struct domain *d, struct pci_dev *pdev, u32 flag)
>  {
>      const struct domain_iommu *hd = dom_iommu(d);
> -    struct pci_dev *pdev;
> +    uint8_t devfn;
>      int rc = 0;
>  
>      if ( !is_iommu_enabled(d) )
> @@ -1532,10 +1519,11 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>  
>      /* device_assigned() should already have cleared the device for assignment */
>      ASSERT(pcidevs_locked());
> -    pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn));
>      ASSERT(pdev && (pdev->domain == hardware_domain ||
>                      pdev->domain == dom_io));
>  
> +    devfn = pdev->devfn;
> +
>      /* Do not allow broken devices to be assigned to guests. */
>      rc = -EBADF;
>      if ( pdev->broken && d != hardware_domain && d != dom_io )
> @@ -1570,7 +1558,7 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>   done:
>      if ( rc )
>          printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
> -               d, &PCI_SBDF(seg, bus, devfn), rc);
> +               d, &PCI_SBDF(pdev->seg, pdev->bus, devfn), rc);
>      /* The device is assigned to dom_io so mark it as quarantined */
>      else if ( d == dom_io )
>          pdev->quarantine = true;
> @@ -1710,6 +1698,9 @@ int iommu_do_pci_domctl(
>          ASSERT(d);
>          /* fall through */
>      case XEN_DOMCTL_test_assign_device:
> +    {
> +        struct pci_dev *pdev;
> +
>          /* Don't support self-assignment of devices. */
>          if ( d == current->domain )
>          {
> @@ -1737,26 +1728,36 @@ int iommu_do_pci_domctl(
>          seg = machine_sbdf >> 16;
>          bus = PCI_BUS(machine_sbdf);
>          devfn = PCI_DEVFN(machine_sbdf);
> +        pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn));
> +        if ( !pdev )
> +        {
> +            printk(XENLOG_G_INFO "%pp non-existent\n",
> +                   &PCI_SBDF(seg, bus, devfn));
> +            ret = -EINVAL;
> +            break;
> +        }
>  
>          pcidevs_lock();
> -        ret = device_assigned(seg, bus, devfn);
> +        ret = device_assigned(pdev);
>          if ( domctl->cmd == XEN_DOMCTL_test_assign_device )
>          {
>              if ( ret )
>              {
> -                printk(XENLOG_G_INFO "%pp already assigned, or non-existent\n",
> +                printk(XENLOG_G_INFO "%pp already assigned\n",
>                         &PCI_SBDF(seg, bus, devfn));
>                  ret = -EINVAL;
>              }
>          }
>          else if ( !ret )
> -            ret = assign_device(d, seg, bus, devfn, flags);
> +            ret = assign_device(d, pdev, flags);
> +
> +        pcidev_put(pdev);
>          pcidevs_unlock();
>          if ( ret == -ERESTART )
>              ret = hypercall_create_continuation(__HYPERVISOR_domctl,
>                                                  "h", u_domctl);
>          break;
> -
> +    }
>      case XEN_DOMCTL_deassign_device:
>          /* Don't support self-deassignment of devices. */
>          if ( d == current->domain )
> @@ -1796,6 +1797,46 @@ int iommu_do_pci_domctl(
>      return ret;
>  }
>  
> +static void release_pdev(refcnt_t *refcnt)
> +{
> +    struct pci_dev *pdev = container_of(refcnt, struct pci_dev, refcnt);
> +    struct pci_seg *pseg = get_pseg(pdev->seg);
> +
> +    printk(XENLOG_DEBUG "PCI release device %pp\n", &pdev->sbdf);
> +
> +    /* update bus2bridge */
> +    switch ( pdev->type )
> +    {
> +        unsigned int sec_bus, sub_bus;
> +
> +        case DEV_TYPE_PCIe2PCI_BRIDGE:
> +        case DEV_TYPE_LEGACY_PCI_BRIDGE:
> +            sec_bus = pci_conf_read8(pdev->sbdf, PCI_SECONDARY_BUS);
> +            sub_bus = pci_conf_read8(pdev->sbdf, PCI_SUBORDINATE_BUS);
> +
> +            spin_lock(&pseg->bus2bridge_lock);
> +            for ( ; sec_bus <= sub_bus; sec_bus++ )
> +                pseg->bus2bridge[sec_bus] = pseg->bus2bridge[pdev->bus];
> +            spin_unlock(&pseg->bus2bridge_lock);
> +            break;
> +
> +        default:
> +            break;
> +    }
> +
> +    xfree(pdev);
> +}
> +
> +void pcidev_get(struct pci_dev *pdev)
> +{
> +    refcnt_get(&pdev->refcnt);
> +}
> +
> +void pcidev_put(struct pci_dev *pdev)
> +{
> +    refcnt_put(&pdev->refcnt, release_pdev);
> +}
> +
>  /*
>   * Local variables:
>   * mode: C
> diff --git a/xen/drivers/passthrough/vtd/quirks.c b/xen/drivers/passthrough/vtd/quirks.c
> index fcc8f73e8b..d240da0416 100644
> --- a/xen/drivers/passthrough/vtd/quirks.c
> +++ b/xen/drivers/passthrough/vtd/quirks.c
> @@ -429,6 +429,8 @@ static int __must_check map_me_phantom_function(struct domain *domain,
>          rc = domain_context_unmap_one(domain, drhd->iommu, 0,
>                                        PCI_DEVFN(dev, 7));
>  
> +    pcidev_put(pdev);
> +
>      return rc;
>  }
>  
> diff --git a/xen/drivers/video/vga.c b/xen/drivers/video/vga.c
> index 29a88e8241..1298f3a7b6 100644
> --- a/xen/drivers/video/vga.c
> +++ b/xen/drivers/video/vga.c
> @@ -114,7 +114,7 @@ void __init video_endboot(void)
>          for ( bus = 0; bus < 256; ++bus )
>              for ( devfn = 0; devfn < 256; ++devfn )
>              {
> -                const struct pci_dev *pdev;
> +                struct pci_dev *pdev;
>                  u8 b = bus, df = devfn, sb;
>  
>                  pcidevs_lock();
> @@ -126,7 +126,11 @@ void __init video_endboot(void)
>                                       PCI_CLASS_DEVICE) != 0x0300 ||
>                       !(pci_conf_read16(PCI_SBDF(0, bus, devfn), PCI_COMMAND) &
>                         (PCI_COMMAND_IO | PCI_COMMAND_MEMORY)) )
> +		{
> +		    if (pdev)
> +			pcidev_put(pdev);
>                      continue;
> +		}
>  
>                  while ( b )
>                  {
> @@ -144,7 +148,10 @@ void __init video_endboot(void)
>                              if ( pci_conf_read16(PCI_SBDF(0, b, df),
>                                                   PCI_BRIDGE_CONTROL) &
>                                   PCI_BRIDGE_CTL_VGA )
> +			    {
> +				pcidev_put(pdev);
>                                  continue;
> +			    }

This is wrong: it is inside the inner while loop and unnecessary given
the pcidev_put below


>                              break;
>                          }
>                          break;
> @@ -157,6 +164,7 @@ void __init video_endboot(void)
>                             bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
>                      pci_hide_device(0, bus, devfn);
>                  }
> +		pcidev_put(pdev);
>              }
>      }
>  
> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
> index 7d1f9fd438..59dc55f498 100644
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -313,7 +313,7 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
>  uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>  {
>      struct domain *d = current->domain;
> -    const struct pci_dev *pdev;
> +    struct pci_dev *pdev;
>      const struct vpci_register *r;
>      unsigned int data_offset = 0;
>      uint32_t data = ~(uint32_t)0;
> @@ -373,6 +373,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>          ASSERT(data_offset < size);
>      }
>      spin_unlock(&pdev->vpci->lock);
> +    pcidev_put(pdev);

I think there is a missing pcidev_put above in the vpci_read function:

if ( !pdev || !pdev->vpci )
    return ...

in case pdev != 0 && pdev->vpci == 0


>      if ( data_offset < size )
>      {
> @@ -416,7 +417,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>                  uint32_t data)
>  {
>      struct domain *d = current->domain;
> -    const struct pci_dev *pdev;
> +    struct pci_dev *pdev;
>      const struct vpci_register *r;
>      unsigned int data_offset = 0;
>      const unsigned long *ro_map = pci_get_ro_map(sbdf.seg);
> @@ -478,6 +479,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>          ASSERT(data_offset < size);
>      }
>      spin_unlock(&pdev->vpci->lock);
> +    pcidev_put(pdev);

same here, missing pcidev_put above


>      if ( data_offset < size )
>          /* Tailing gap, write the remaining. */
> diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
> index 19047b4b20..e71a180ef3 100644
> --- a/xen/include/xen/pci.h
> +++ b/xen/include/xen/pci.h
> @@ -13,6 +13,7 @@
>  #include <xen/irq.h>
>  #include <xen/pci_regs.h>
>  #include <xen/pfn.h>
> +#include <xen/refcnt.h>
>  #include <asm/device.h>
>  #include <asm/numa.h>
>  
> @@ -116,6 +117,9 @@ struct pci_dev {
>      /* Device misbehaving, prevent assigning it to guests. */
>      bool broken;
>  
> +    /* Reference counter */
> +    refcnt_t refcnt;
> +
>      enum pdev_type {
>          DEV_TYPE_PCI_UNKNOWN,
>          DEV_TYPE_PCIe_ENDPOINT,
> @@ -160,6 +164,14 @@ void pcidevs_lock(void);
>  void pcidevs_unlock(void);
>  bool_t __must_check pcidevs_locked(void);
>  
> +/*
> + * Acquire and release reference to the given device. Holding
> + * reference ensures that device will not disappear under feet, but
> + * does not guarantee that code has exclusive access to the device.
> + */
> +void pcidev_get(struct pci_dev *pdev);
> +void pcidev_put(struct pci_dev *pdev);
> +
>  bool_t pci_known_segment(u16 seg);
>  bool_t pci_device_detect(u16 seg, u8 bus, u8 dev, u8 func);
>  int scan_pci_devices(void);
> @@ -177,8 +189,14 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>  int pci_remove_device(u16 seg, u8 bus, u8 devfn);
>  int pci_ro_device(int seg, int bus, int devfn);
>  int pci_hide_device(unsigned int seg, unsigned int bus, unsigned int devfn);
> +
> +/*
> + * Next two functions will find a requested device and acquire
> + * reference to it. Use pcidev_put() to release the reference.
> + */
>  struct pci_dev *pci_get_pdev(struct domain *d, pci_sbdf_t sbdf);
>  struct pci_dev *pci_get_real_pdev(pci_sbdf_t sbdf);
> +
>  void pci_check_disable_device(u16 seg, u8 bus, u8 devfn);
>  
>  uint8_t pci_conf_read8(pci_sbdf_t sbdf, unsigned int reg);
> -- 
> 2.36.1
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 01/10] xen: pci: add per-domain pci list lock
  2023-01-26 23:18   ` Stefano Stabellini
@ 2023-01-27  8:01     ` Jan Beulich
  2023-02-14 23:38     ` Volodymyr Babchuk
  1 sibling, 0 replies; 43+ messages in thread
From: Jan Beulich @ 2023-01-27  8:01 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: xen-devel, Oleksandr Andrushchenko, Andrew Cooper, George Dunlap,
	Julien Grall, Wei Liu, Paul Durrant, Roger Pau Monné,
	Kevin Tian, Stewart.Hildebrand, Volodymyr Babchuk

On 27.01.2023 00:18, Stefano Stabellini wrote:
> On Wed, 31 Aug 2022, Volodymyr Babchuk wrote:
>> @@ -1571,6 +1595,7 @@ static int iommu_get_device_group(
>>          if ( sdev_id < 0 )
>>          {
>>              pcidevs_unlock();
>> +            spin_unlock(&d->pdevs_lock);
> 
> lock inversion
> 
> 
>>              return sdev_id;
>>          }
>>  
>> @@ -1581,6 +1606,7 @@ static int iommu_get_device_group(
>>              if ( unlikely(copy_to_guest_offset(buf, i, &bdf, 1)) )
>>              {
>>                  pcidevs_unlock();
>> +                spin_unlock(&d->pdevs_lock);
> 
> lock inversion
> 
> 
>>                  return -EFAULT;
>>              }
>>              i++;
>> @@ -1588,6 +1614,7 @@ static int iommu_get_device_group(
>>      }
>>  
>>      pcidevs_unlock();
>> +    spin_unlock(&d->pdevs_lock);
> 
> lock inversion

While from a cosmetic perspective I of course agree that releasing locks
would better be done the opposite order of acquiring whenever possible,
I'd like to point out that lock release alone is never subject to "lock
order" issues. We do have a couple of cases (I think) where we actually
do so because otherwise respective code would end up uglier, or because
we want to limit locking regions as much as possible (I'm sorry, I don't
have an example to hand).

>> --- a/xen/include/xen/sched.h
>> +++ b/xen/include/xen/sched.h
>> @@ -457,6 +457,7 @@ struct domain
>>  
>>  #ifdef CONFIG_HAS_PCI
>>      struct list_head pdev_list;
>> +    spinlock_t pdevs_lock;
> 
> I think it would be better called "pdev_lock" but OK either way

I'm of two minds here: On one hand the lock is to guard all devices, so
plural looks applicable. Otoh the list head field uses singular as well.

Jan


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 03/10] xen: pci: introduce ats_list_lock
  2023-01-26 23:56   ` Stefano Stabellini
@ 2023-01-27  8:13     ` Jan Beulich
  2023-02-17  1:20       ` Volodymyr Babchuk
  0 siblings, 1 reply; 43+ messages in thread
From: Jan Beulich @ 2023-01-27  8:13 UTC (permalink / raw)
  To: Stefano Stabellini, Volodymyr Babchuk
  Cc: xen-devel, Oleksandr Andrushchenko, Andrew Cooper, Paul Durrant,
	Roger Pau Monné,
	Kevin Tian

On 27.01.2023 00:56, Stefano Stabellini wrote:
> On Wed, 31 Aug 2022, Volodymyr Babchuk wrote:
>> --- a/xen/drivers/passthrough/pci.c
>> +++ b/xen/drivers/passthrough/pci.c
>> @@ -1641,6 +1641,7 @@ void iommu_dev_iotlb_flush_timeout(struct domain *d, struct pci_dev *pdev)
>>  {
>>      pcidevs_lock();
>>  
>> +    /* iommu->ats_list_lock is taken by the caller of this function */
> 
> This is a locking inversion. In all other places we take pcidevs_lock
> first, then ats_list_lock lock. For instance look at
> xen/drivers/passthrough/pci.c:deassign_device that is called with
> pcidevs_locked and then calls iommu_call(... reassign_device ...) which
> ends up taking ats_list_lock.
> 
> This is the only exception. I think we need to move the
> spin_lock(ats_list_lock) from qinval.c to here.

First question here is what the lock is meant to protect: Just the list,
or also the ATS state change (i.e. serializing enable and disable against
one another). In the latter case the lock also wants naming differently.

Second question is who is to acquire the lock. Why isn't this done _in_
{en,dis}able_ats_device() themselves? That would then allow to further
reduce the locked range, because at least the pci_find_ext_capability()
call and the final logging can occur without the lock held.

Jan


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 07/10] xen: pci: add per-device locking
  2022-08-31 14:11 ` [RFC PATCH 07/10] xen: pci: add per-device locking Volodymyr Babchuk
@ 2023-01-28  0:56   ` Stefano Stabellini
  2023-02-20 22:29     ` Volodymyr Babchuk
  2023-02-28 16:46   ` Jan Beulich
  1 sibling, 1 reply; 43+ messages in thread
From: Stefano Stabellini @ 2023-01-28  0:56 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: xen-devel, Oleksandr Andrushchenko, Jan Beulich, Andrew Cooper,
	Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Paul Durrant

On Wed, 31 Aug 2022, Volodymyr Babchuk wrote:
> Spinlock in struct pci_device will be used to protect access to device
> itself. Right now it is used mostly by MSI code.
> 
> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>

There are 2 instances of:

    BUG_ON(list_empty(&dev->msi_list));

in xen/arch/x86/msi.c:__pci_disable_msi and
xen/arch/x86/msi.c:__pci_disable_msix which are not protected by
pcidev_lock. However list_empty needs to be protected. (pci_disable_msi
can also be called from xen/arch/x86/irq.c where it is not surrounded by
pcidev_lock.)

Given that they are BUG_ON, I wonder if we could remove them instead of
adding locks there. It would make things simpler.


> ---
>  xen/arch/x86/hvm/vmsi.c       |  6 +++++-
>  xen/arch/x86/msi.c            | 16 ++++++++++++++++
>  xen/drivers/passthrough/msi.c |  8 +++++++-
>  xen/drivers/passthrough/pci.c |  2 ++
>  xen/include/xen/pci.h         | 12 ++++++++++++
>  5 files changed, 42 insertions(+), 2 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
> index 7fb1075673..c9e5f279c5 100644
> --- a/xen/arch/x86/hvm/vmsi.c
> +++ b/xen/arch/x86/hvm/vmsi.c
> @@ -203,10 +203,14 @@ static struct msi_desc *msixtbl_addr_to_desc(
>  
>      nr_entry = (addr - entry->gtable) / PCI_MSIX_ENTRY_SIZE;
>  
> +    pcidev_lock(entry->pdev);
>      list_for_each_entry( desc, &entry->pdev->msi_list, list )
>          if ( desc->msi_attrib.type == PCI_CAP_ID_MSIX &&
> -             desc->msi_attrib.entry_nr == nr_entry )
> +             desc->msi_attrib.entry_nr == nr_entry ) {
> +	    pcidev_unlock(entry->pdev);

code style


>              return desc;
> +	}
> +    pcidev_unlock(entry->pdev);
>  
>      return NULL;
>  }
> diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
> index bccaccb98b..6b62c4f452 100644
> --- a/xen/arch/x86/msi.c
> +++ b/xen/arch/x86/msi.c
> @@ -389,6 +389,7 @@ static bool msi_set_mask_bit(struct irq_desc *desc, bool host, bool guest)
>      default:
>          return 0;
>      }
> +

spurious change


>      entry->msi_attrib.host_masked = host;
>      entry->msi_attrib.guest_masked = guest;
>  
> @@ -585,12 +586,17 @@ static struct msi_desc *find_msi_entry(struct pci_dev *dev,
>  {
>      struct msi_desc *entry;
>  
> +    pcidev_lock(dev);
>      list_for_each_entry( entry, &dev->msi_list, list )
>      {
>          if ( entry->msi_attrib.type == cap_id &&
>               (irq == -1 || entry->irq == irq) )
> +	{
> +	    pcidev_unlock(dev);
>              return entry;
> +	}
>      }
> +    pcidev_unlock(dev);
>  
>      return NULL;
>  }
> @@ -661,7 +667,9 @@ static int msi_capability_init(struct pci_dev *dev,
>          maskbits |= ~(uint32_t)0 >> (32 - dev->msi_maxvec);
>          pci_conf_write32(dev->sbdf, mpos, maskbits);
>      }
> +    pcidev_lock(dev);
>      list_add_tail(&entry->list, &dev->msi_list);
> +    pcidev_unlock(dev);
>  
>      *desc = entry;
>      /* Restore the original MSI enabled bits  */
> @@ -946,7 +954,9 @@ static int msix_capability_init(struct pci_dev *dev,
>  
>  	pcidev_get(dev);
>  
> +	pcidev_lock(dev);
>          list_add_tail(&entry->list, &dev->msi_list);
> +	pcidev_unlock(dev);
>          *desc = entry;
>      }
>  
> @@ -1231,11 +1241,13 @@ static void msi_free_irqs(struct pci_dev* dev)
>  {
>      struct msi_desc *entry, *tmp;
>  
> +    pcidev_lock(dev);
>      list_for_each_entry_safe( entry, tmp, &dev->msi_list, list )
>      {
>          pci_disable_msi(entry);
>          msi_free_irq(entry);
>      }
> +    pcidev_unlock(dev);
>  }
>  
>  void pci_cleanup_msi(struct pci_dev *pdev)
> @@ -1354,6 +1366,7 @@ int pci_restore_msi_state(struct pci_dev *pdev)
>      if ( ret )
>          return ret;
>  
> +    pcidev_lock(pdev);
>      list_for_each_entry_safe( entry, tmp, &pdev->msi_list, list )
>      {
>          unsigned int i = 0, nr = 1;
> @@ -1371,6 +1384,7 @@ int pci_restore_msi_state(struct pci_dev *pdev)
>              dprintk(XENLOG_ERR, "Restore MSI for %pp entry %u not set?\n",
>                      &pdev->sbdf, i);
>              spin_unlock_irqrestore(&desc->lock, flags);
> +	    pcidev_unlock(pdev);
>              if ( type == PCI_CAP_ID_MSIX )
>                  pci_conf_write16(pdev->sbdf, msix_control_reg(pos),
>                                   control & ~PCI_MSIX_FLAGS_ENABLE);
> @@ -1393,6 +1407,7 @@ int pci_restore_msi_state(struct pci_dev *pdev)
>              if ( unlikely(!memory_decoded(pdev)) )
>              {
>                  spin_unlock_irqrestore(&desc->lock, flags);
> +		pcidev_unlock(pdev);
>                  pci_conf_write16(pdev->sbdf, msix_control_reg(pos),
>                                   control & ~PCI_MSIX_FLAGS_ENABLE);
>                  return -ENXIO;
> @@ -1438,6 +1453,7 @@ int pci_restore_msi_state(struct pci_dev *pdev)
>          pci_conf_write16(pdev->sbdf, msix_control_reg(pos),
>                           control | PCI_MSIX_FLAGS_ENABLE);
>  
> +    pcidev_unlock(pdev);
>      return 0;
>  }
>  
> diff --git a/xen/drivers/passthrough/msi.c b/xen/drivers/passthrough/msi.c
> index ce1a450f6f..98f4d2721a 100644
> --- a/xen/drivers/passthrough/msi.c
> +++ b/xen/drivers/passthrough/msi.c
> @@ -22,6 +22,7 @@ int pdev_msi_init(struct pci_dev *pdev)
>  {
>      unsigned int pos;
>  
> +    pcidev_lock(pdev);
>      INIT_LIST_HEAD(&pdev->msi_list);
>  
>      pos = pci_find_cap_offset(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> @@ -41,7 +42,10 @@ int pdev_msi_init(struct pci_dev *pdev)
>          uint16_t ctrl;
>  
>          if ( !msix )
> -            return -ENOMEM;
> +        {
> +             pcidev_unlock(pdev);
> +             return -ENOMEM;
> +        }
>  
>          spin_lock_init(&msix->table_lock);
>  
> @@ -51,6 +55,8 @@ int pdev_msi_init(struct pci_dev *pdev)
>          pdev->msix = msix;
>      }
>  
> +    pcidev_unlock(pdev);
> +
>      return 0;
>  }
>  
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index c8da80b981..c83397211b 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -1383,7 +1383,9 @@ static int cf_check _dump_pci_devices(struct pci_seg *pseg, void *arg)
>              printk("%pd", pdev->domain);
>          printk(" - node %-3d refcnt %d", (pdev->node != NUMA_NO_NODE) ? pdev->node : -1,
>                 atomic_read(&pdev->refcnt));
> +        pcidev_lock(pdev);
>          pdev_dump_msi(pdev);
> +        pcidev_unlock(pdev);
>          printk("\n");
>      }
>      spin_unlock(&pseg->alldevs_lock);
> diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
> index e71a180ef3..d0a7339d84 100644
> --- a/xen/include/xen/pci.h
> +++ b/xen/include/xen/pci.h
> @@ -106,6 +106,8 @@ struct pci_dev {
>      uint8_t msi_maxvec;
>      uint8_t phantom_stride;
>  
> +    /* Device lock */
> +    spinlock_t lock;
>      nodeid_t node; /* NUMA node */
>  
>      /* Device to be quarantined, don't automatically re-assign to dom0 */
> @@ -235,6 +237,16 @@ int msixtbl_pt_register(struct domain *, struct pirq *, uint64_t gtable);
>  void msixtbl_pt_unregister(struct domain *, struct pirq *);
>  void msixtbl_pt_cleanup(struct domain *d);
>  
> +static inline void pcidev_lock(struct pci_dev *pdev)
> +{
> +    spin_lock(&pdev->lock);
> +}
> +
> +static inline void pcidev_unlock(struct pci_dev *pdev)
> +{
> +    spin_unlock(&pdev->lock);
> +}
> +
>  #ifdef CONFIG_HVM
>  int arch_pci_clean_pirqs(struct domain *d);
>  #else
> -- 
> 2.36.1
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 08/10] xen: pci: remove pcidev_[un]lock[ed] calls
  2022-08-31 14:11 ` [RFC PATCH 08/10] xen: pci: remove pcidev_[un]lock[ed] calls Volodymyr Babchuk
@ 2023-01-28  1:32   ` Stefano Stabellini
  2023-02-20 23:13     ` Volodymyr Babchuk
  2023-02-28 16:51     ` Jan Beulich
  0 siblings, 2 replies; 43+ messages in thread
From: Stefano Stabellini @ 2023-01-28  1:32 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: xen-devel, Oleksandr Andrushchenko, Jan Beulich, Andrew Cooper,
	Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Paul Durrant, Kevin Tian

On Wed, 31 Aug 2022, Volodymyr Babchuk wrote:
> As pci devices are refcounted now and all list that store them are
> protected by separate locks, we can safely drop global pcidevs_lock.
> 
> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>

Up until this patch this patch series introduces:
- d->pdevs_lock to protect d->pdev_list
- pci_seg->alldevs_lock to protect pci_seg->alldevs_list
- iommu->ats_list_lock to protect iommu->ats_devices
- pdev refcounting to detect a pdev in-use and when to free it
- pdev->lock to protect pdev->msi_list

They cover a lot of ground.  Are they collectively covering everything
pcidevs_lock() was protecting?

deassign_device is not protected by pcidevs_lock anymore.
deassign_device accesses a number of pdev fields, including quarantine,
phantom_stride and fault.count.

deassign_device could run at the same time as assign_device who sets
quarantine and other fields.

It looks like assign_device, deassign_device, and other functions
accessing/modifying pdev fields should be protected by pdev->lock.

In fact, I think it would be safer to make sure every place that
currently has a pcidevs_lock() gets a pdev->lock (unless there is a
d->pdevs_lock, pci_seg->alldevs_lock, iommu->ats_list_lock, or another
lock) ?


> ---
>  xen/arch/x86/domctl.c                       |  8 ---
>  xen/arch/x86/hvm/vioapic.c                  |  2 -
>  xen/arch/x86/hvm/vmsi.c                     | 12 ----
>  xen/arch/x86/irq.c                          |  7 ---
>  xen/arch/x86/msi.c                          | 11 ----
>  xen/arch/x86/pci.c                          |  4 --
>  xen/arch/x86/physdev.c                      |  7 +--
>  xen/common/sysctl.c                         |  2 -
>  xen/drivers/char/ns16550.c                  |  4 --
>  xen/drivers/passthrough/amd/iommu_init.c    |  7 ---
>  xen/drivers/passthrough/amd/iommu_map.c     |  5 --
>  xen/drivers/passthrough/amd/pci_amd_iommu.c |  4 --
>  xen/drivers/passthrough/pci.c               | 63 +--------------------
>  xen/drivers/passthrough/vtd/iommu.c         |  8 ---
>  xen/drivers/video/vga.c                     |  2 -
>  15 files changed, 4 insertions(+), 142 deletions(-)
> 
> diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
> index 020df615bd..9f4ca03385 100644
> --- a/xen/arch/x86/domctl.c
> +++ b/xen/arch/x86/domctl.c
> @@ -537,11 +537,7 @@ long arch_do_domctl(
>  
>          ret = -ESRCH;
>          if ( is_iommu_enabled(d) )
> -        {
> -            pcidevs_lock();
>              ret = pt_irq_create_bind(d, bind);
> -            pcidevs_unlock();
> -        }
>          if ( ret < 0 )
>              printk(XENLOG_G_ERR "pt_irq_create_bind failed (%ld) for dom%d\n",
>                     ret, d->domain_id);
> @@ -566,11 +562,7 @@ long arch_do_domctl(
>              break;
>  
>          if ( is_iommu_enabled(d) )
> -        {
> -            pcidevs_lock();
>              ret = pt_irq_destroy_bind(d, bind);
> -            pcidevs_unlock();
> -        }
>          if ( ret < 0 )
>              printk(XENLOG_G_ERR "pt_irq_destroy_bind failed (%ld) for dom%d\n",
>                     ret, d->domain_id);
> diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
> index cb7f440160..aa4e7766a3 100644
> --- a/xen/arch/x86/hvm/vioapic.c
> +++ b/xen/arch/x86/hvm/vioapic.c
> @@ -197,7 +197,6 @@ static int vioapic_hwdom_map_gsi(unsigned int gsi, unsigned int trig,
>          return ret;
>      }
>  
> -    pcidevs_lock();
>      ret = pt_irq_create_bind(currd, &pt_irq_bind);
>      if ( ret )
>      {
> @@ -207,7 +206,6 @@ static int vioapic_hwdom_map_gsi(unsigned int gsi, unsigned int trig,
>          unmap_domain_pirq(currd, pirq);
>          write_unlock(&currd->event_lock);
>      }
> -    pcidevs_unlock();
>  
>      return ret;
>  }
> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
> index c9e5f279c5..344bbd646c 100644
> --- a/xen/arch/x86/hvm/vmsi.c
> +++ b/xen/arch/x86/hvm/vmsi.c
> @@ -470,7 +470,6 @@ int msixtbl_pt_register(struct domain *d, struct pirq *pirq, uint64_t gtable)
>      struct msixtbl_entry *entry, *new_entry;
>      int r = -EINVAL;
>  
> -    ASSERT(pcidevs_locked());
>      ASSERT(rw_is_write_locked(&d->event_lock));
>  
>      if ( !msixtbl_initialised(d) )
> @@ -540,7 +539,6 @@ void msixtbl_pt_unregister(struct domain *d, struct pirq *pirq)
>      struct pci_dev *pdev;
>      struct msixtbl_entry *entry;
>  
> -    ASSERT(pcidevs_locked());
>      ASSERT(rw_is_write_locked(&d->event_lock));
>  
>      if ( !msixtbl_initialised(d) )
> @@ -686,8 +684,6 @@ static int vpci_msi_update(const struct pci_dev *pdev, uint32_t data,
>  {
>      unsigned int i;
>  
> -    ASSERT(pcidevs_locked());
> -
>      if ( (address & MSI_ADDR_BASE_MASK) != MSI_ADDR_HEADER )
>      {
>          gdprintk(XENLOG_ERR, "%pp: PIRQ %u: unsupported address %lx\n",
> @@ -728,7 +724,6 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
>  
>      ASSERT(msi->arch.pirq != INVALID_PIRQ);
>  
> -    pcidevs_lock();
>      for ( i = 0; i < msi->vectors && msi->arch.bound; i++ )
>      {
>          struct xen_domctl_bind_pt_irq unbind = {
> @@ -747,7 +742,6 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
>  
>      msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address,
>                                         msi->vectors, msi->arch.pirq, msi->mask);
> -    pcidevs_unlock();
>  }
>  
>  static int vpci_msi_enable(const struct pci_dev *pdev, unsigned int nr,
> @@ -785,10 +779,8 @@ int vpci_msi_arch_enable(struct vpci_msi *msi, const struct pci_dev *pdev,
>          return rc;
>      msi->arch.pirq = rc;
>  
> -    pcidevs_lock();
>      msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address, vectors,
>                                         msi->arch.pirq, msi->mask);
> -    pcidevs_unlock();
>  
>      return 0;
>  }
> @@ -800,7 +792,6 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
>  
>      ASSERT(pirq != INVALID_PIRQ);
>  
> -    pcidevs_lock();
>      for ( i = 0; i < nr && bound; i++ )
>      {
>          struct xen_domctl_bind_pt_irq bind = {
> @@ -816,7 +807,6 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
>      write_lock(&pdev->domain->event_lock);
>      unmap_domain_pirq(pdev->domain, pirq);
>      write_unlock(&pdev->domain->event_lock);
> -    pcidevs_unlock();
>  }
>  
>  void vpci_msi_arch_disable(struct vpci_msi *msi, const struct pci_dev *pdev)
> @@ -863,7 +853,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
>  
>      entry->arch.pirq = rc;
>  
> -    pcidevs_lock();
>      rc = vpci_msi_update(pdev, entry->data, entry->addr, 1, entry->arch.pirq,
>                           entry->masked);
>      if ( rc )
> @@ -871,7 +860,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
>          vpci_msi_disable(pdev, entry->arch.pirq, 1, false);
>          entry->arch.pirq = INVALID_PIRQ;
>      }
> -    pcidevs_unlock();
>  
>      return rc;
>  }
> diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
> index d8672a03e1..6a08830a55 100644
> --- a/xen/arch/x86/irq.c
> +++ b/xen/arch/x86/irq.c
> @@ -2156,8 +2156,6 @@ int map_domain_pirq(
>          struct pci_dev *pdev;
>          unsigned int nr = 0;
>  
> -        ASSERT(pcidevs_locked());
> -
>          ret = -ENODEV;
>          if ( !cpu_has_apic )
>              goto done;
> @@ -2317,7 +2315,6 @@ int unmap_domain_pirq(struct domain *d, int pirq)
>      if ( (pirq < 0) || (pirq >= d->nr_pirqs) )
>          return -EINVAL;
>  
> -    ASSERT(pcidevs_locked());
>      ASSERT(rw_is_write_locked(&d->event_lock));
>  
>      info = pirq_info(d, pirq);
> @@ -2423,7 +2420,6 @@ void free_domain_pirqs(struct domain *d)
>  {
>      int i;
>  
> -    pcidevs_lock();
>      write_lock(&d->event_lock);
>  
>      for ( i = 0; i < d->nr_pirqs; i++ )
> @@ -2431,7 +2427,6 @@ void free_domain_pirqs(struct domain *d)
>              unmap_domain_pirq(d, i);
>  
>      write_unlock(&d->event_lock);
> -    pcidevs_unlock();
>  }
>  
>  static void cf_check dump_irqs(unsigned char key)
> @@ -2911,7 +2906,6 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>  
>      msi->irq = irq;
>  
> -    pcidevs_lock();
>      /* Verify or get pirq. */
>      write_lock(&d->event_lock);
>      pirq = allocate_pirq(d, index, *pirq_p, irq, type, &msi->entry_nr);
> @@ -2927,7 +2921,6 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>  
>   done:
>      write_unlock(&d->event_lock);
> -    pcidevs_unlock();
>      if ( ret )
>      {
>          switch ( type )
> diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
> index 6b62c4f452..f04b90e235 100644
> --- a/xen/arch/x86/msi.c
> +++ b/xen/arch/x86/msi.c
> @@ -623,7 +623,6 @@ static int msi_capability_init(struct pci_dev *dev,
>      u8 slot = PCI_SLOT(dev->devfn);
>      u8 func = PCI_FUNC(dev->devfn);
>  
> -    ASSERT(pcidevs_locked());
>      pos = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSI);
>      if ( !pos )
>          return -ENODEV;
> @@ -810,8 +809,6 @@ static int msix_capability_init(struct pci_dev *dev,
>      if ( !pos )
>          return -ENODEV;
>  
> -    ASSERT(pcidevs_locked());
> -
>      control = pci_conf_read16(dev->sbdf, msix_control_reg(pos));
>      /*
>       * Ensure MSI-X interrupts are masked during setup. Some devices require
> @@ -1032,7 +1029,6 @@ static int __pci_enable_msi(struct msi_info *msi, struct msi_desc **desc)
>      struct msi_desc *old_desc;
>      int ret;
>  
> -    ASSERT(pcidevs_locked());
>      pdev = pci_get_pdev(NULL, msi->sbdf);
>      if ( !pdev )
>          return -ENODEV;
> @@ -1092,7 +1088,6 @@ static int __pci_enable_msix(struct msi_info *msi, struct msi_desc **desc)
>      struct msi_desc *old_desc;
>      int ret;
>  
> -    ASSERT(pcidevs_locked());
>      pdev = pci_get_pdev(NULL, msi->sbdf);
>      if ( !pdev || !pdev->msix )
>          return -ENODEV;
> @@ -1191,7 +1186,6 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off)
>      if ( !use_msi )
>          return 0;
>  
> -    pcidevs_lock();
>      pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn));
>      if ( !pdev )
>          rc = -ENODEV;
> @@ -1204,7 +1198,6 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off)
>      }
>      else
>          rc = msix_capability_init(pdev, NULL, NULL);
> -    pcidevs_unlock();
>  
>      pcidev_put(pdev);
>  
> @@ -1217,8 +1210,6 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off)
>   */
>  int pci_enable_msi(struct msi_info *msi, struct msi_desc **desc)
>  {
> -    ASSERT(pcidevs_locked());
> -
>      if ( !use_msi )
>          return -EPERM;
>  
> @@ -1355,8 +1346,6 @@ int pci_restore_msi_state(struct pci_dev *pdev)
>      unsigned int type = 0, pos = 0;
>      u16 control = 0;
>  
> -    ASSERT(pcidevs_locked());
> -
>      if ( !use_msi )
>          return -EOPNOTSUPP;
>  
> diff --git a/xen/arch/x86/pci.c b/xen/arch/x86/pci.c
> index 1d38f0df7c..4dcd6d96f3 100644
> --- a/xen/arch/x86/pci.c
> +++ b/xen/arch/x86/pci.c
> @@ -88,15 +88,11 @@ int pci_conf_write_intercept(unsigned int seg, unsigned int bdf,
>      if ( reg < 64 || reg >= 256 )
>          return 0;
>  
> -    pcidevs_lock();
> -
>      pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bdf));
>      if ( pdev ) {
>          rc = pci_msi_conf_write_intercept(pdev, reg, size, data);
>  	pcidev_put(pdev);
>      }
>  
> -    pcidevs_unlock();
> -
>      return rc;
>  }
> diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c
> index 96214a3d40..a41366b609 100644
> --- a/xen/arch/x86/physdev.c
> +++ b/xen/arch/x86/physdev.c
> @@ -162,11 +162,9 @@ int physdev_unmap_pirq(domid_t domid, int pirq)
>              goto free_domain;
>      }
>  
> -    pcidevs_lock();
>      write_lock(&d->event_lock);
>      ret = unmap_domain_pirq(d, pirq);
>      write_unlock(&d->event_lock);
> -    pcidevs_unlock();
>  
>   free_domain:
>      rcu_unlock_domain(d);
> @@ -530,7 +528,6 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>          if ( copy_from_guest(&restore_msi, arg, 1) != 0 )
>              break;
>  
> -        pcidevs_lock();
>          pdev = pci_get_pdev(NULL,
>                              PCI_SBDF(0, restore_msi.bus, restore_msi.devfn));
>          if ( pdev )
> @@ -541,7 +538,6 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>          else
>              ret = -ENODEV;
>  
> -        pcidevs_unlock();
>          break;
>      }
>  
> @@ -553,7 +549,6 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>          if ( copy_from_guest(&dev, arg, 1) != 0 )
>              break;
>  
> -        pcidevs_lock();
>          pdev = pci_get_pdev(NULL, PCI_SBDF(dev.seg, dev.bus, dev.devfn));
>          if ( pdev )
>          {
> @@ -562,7 +557,7 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>          }
>          else
>              ret = -ENODEV;
> -        pcidevs_unlock();
> +
>          break;
>      }
>  
> diff --git a/xen/common/sysctl.c b/xen/common/sysctl.c
> index 0feef94cd2..6bb8c5c295 100644
> --- a/xen/common/sysctl.c
> +++ b/xen/common/sysctl.c
> @@ -446,7 +446,6 @@ long do_sysctl(XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) u_sysctl)
>                  break;
>              }
>  
> -            pcidevs_lock();
>              pdev = pci_get_pdev(NULL, PCI_SBDF(dev.seg, dev.bus, dev.devfn));
>              if ( !pdev )
>                  node = XEN_INVALID_DEV;
> @@ -454,7 +453,6 @@ long do_sysctl(XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) u_sysctl)
>                  node = XEN_INVALID_NODE_ID;
>              else
>                  node = pdev->node;
> -            pcidevs_unlock();
>  
>              if ( pdev )
>                  pcidev_put(pdev);
> diff --git a/xen/drivers/char/ns16550.c b/xen/drivers/char/ns16550.c
> index 01a05c9aa8..66c10b18e5 100644
> --- a/xen/drivers/char/ns16550.c
> +++ b/xen/drivers/char/ns16550.c
> @@ -445,8 +445,6 @@ static void __init cf_check ns16550_init_postirq(struct serial_port *port)
>              {
>                  struct msi_desc *msi_desc = NULL;
>  
> -                pcidevs_lock();
> -
>                  rc = pci_enable_msi(&msi, &msi_desc);
>                  if ( !rc )
>                  {
> @@ -460,8 +458,6 @@ static void __init cf_check ns16550_init_postirq(struct serial_port *port)
>                          pci_disable_msi(msi_desc);
>                  }
>  
> -                pcidevs_unlock();
> -
>                  if ( rc )
>                  {
>                      uart->irq = 0;
> diff --git a/xen/drivers/passthrough/amd/iommu_init.c b/xen/drivers/passthrough/amd/iommu_init.c
> index 7c1713a602..e42af65a40 100644
> --- a/xen/drivers/passthrough/amd/iommu_init.c
> +++ b/xen/drivers/passthrough/amd/iommu_init.c
> @@ -638,10 +638,7 @@ static void cf_check parse_ppr_log_entry(struct amd_iommu *iommu, u32 entry[])
>      uint16_t device_id = iommu_get_devid_from_cmd(entry[0]);
>      struct pci_dev *pdev;
>  
> -    pcidevs_lock();
>      pdev = pci_get_real_pdev(PCI_SBDF(iommu->seg, device_id));
> -    pcidevs_unlock();
> -
>      if ( pdev )
>          guest_iommu_add_ppr_log(pdev->domain, entry);
>      pcidev_put(pdev);
> @@ -747,14 +744,12 @@ static bool_t __init set_iommu_interrupt_handler(struct amd_iommu *iommu)
>          return 0;
>      }
>  
> -    pcidevs_lock();
>      /*
>       * XXX: it is unclear if this device can be removed. Right now
>       * there is no code that clears msi.dev, so no one will decrease
>       * refcount on it.
>       */
>      iommu->msi.dev = pci_get_pdev(NULL, PCI_SBDF(iommu->seg, iommu->bdf));
> -    pcidevs_unlock();
>      if ( !iommu->msi.dev )
>      {
>          AMD_IOMMU_WARN("no pdev for %pp\n",
> @@ -1289,9 +1284,7 @@ static int __init cf_check amd_iommu_setup_device_table(
>              {
>                  if ( !pci_init )
>                      continue;
> -                pcidevs_lock();
>                  pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bdf));
> -                pcidevs_unlock();
>              }
>  
>              if ( pdev && (pdev->msix || pdev->msi_maxvec) )
> diff --git a/xen/drivers/passthrough/amd/iommu_map.c b/xen/drivers/passthrough/amd/iommu_map.c
> index 9d621e3d36..d04aa37538 100644
> --- a/xen/drivers/passthrough/amd/iommu_map.c
> +++ b/xen/drivers/passthrough/amd/iommu_map.c
> @@ -726,9 +726,7 @@ int cf_check amd_iommu_get_reserved_device_memory(
>              /* May need to trigger the workaround in find_iommu_for_device(). */
>              struct pci_dev *pdev;
>  
> -            pcidevs_lock();
>              pdev = pci_get_pdev(NULL, sbdf);
> -            pcidevs_unlock();
>  
>              if ( pdev )
>              {
> @@ -848,7 +846,6 @@ int cf_check amd_iommu_quarantine_init(struct pci_dev *pdev, bool scratch_page)
>      const struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
>      int rc;
>  
> -    ASSERT(pcidevs_locked());
>      ASSERT(!hd->arch.amd.root_table);
>      ASSERT(page_list_empty(&hd->arch.pgtables.list));
>  
> @@ -903,8 +900,6 @@ void amd_iommu_quarantine_teardown(struct pci_dev *pdev)
>  {
>      struct domain_iommu *hd = dom_iommu(dom_io);
>  
> -    ASSERT(pcidevs_locked());
> -
>      if ( !pdev->arch.amd.root_table )
>          return;
>  
> diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> index 955f3af57a..919e30129e 100644
> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> @@ -268,8 +268,6 @@ static int __must_check amd_iommu_setup_domain_device(
>                      req_id, pdev->type, page_to_maddr(root_pg),
>                      domid, hd->arch.amd.paging_mode);
>  
> -    ASSERT(pcidevs_locked());
> -
>      if ( pci_ats_device(iommu->seg, bus, pdev->devfn) &&
>           !ivrs_dev->block_ats &&
>           iommu_has_cap(iommu, PCI_CAP_IOTLB_SHIFT) &&
> @@ -416,8 +414,6 @@ static void amd_iommu_disable_domain_device(const struct domain *domain,
>      if ( QUARANTINE_SKIP(domain, pdev) )
>          return;
>  
> -    ASSERT(pcidevs_locked());
> -
>      if ( pci_ats_device(iommu->seg, bus, pdev->devfn) &&
>           pci_ats_enabled(iommu->seg, bus, pdev->devfn) )
>      {
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index c83397211b..cc62a5aec4 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -517,7 +517,6 @@ int __init pci_hide_device(unsigned int seg, unsigned int bus,
>      struct pci_seg *pseg;
>      int rc = -ENOMEM;
>  
> -    pcidevs_lock();
>      pseg = alloc_pseg(seg);
>      if ( pseg )
>      {
> @@ -528,7 +527,6 @@ int __init pci_hide_device(unsigned int seg, unsigned int bus,
>              rc = 0;
>          }
>      }
> -    pcidevs_unlock();
>  
>      return rc;
>  }
> @@ -588,8 +586,6 @@ struct pci_dev *pci_get_pdev(struct domain *d, pci_sbdf_t sbdf)
>  {
>      struct pci_dev *pdev;
>  
> -    ASSERT(d || pcidevs_locked());
> -
>      /*
>       * The hardware domain owns the majority of the devices in the system.
>       * When there are multiple segments, traversing the per-segment list is
> @@ -730,7 +726,6 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>          pdev_type = "device";
>      else if ( info->is_virtfn )
>      {
> -        pcidevs_lock();
>          pdev = pci_get_pdev(NULL,
>                              PCI_SBDF(seg, info->physfn.bus,
>                                       info->physfn.devfn));
> @@ -739,7 +734,6 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>              pf_is_extfn = pdev->info.is_extfn;
>              pcidev_put(pdev);
>          }
> -        pcidevs_unlock();
>          if ( !pdev )
>              pci_add_device(seg, info->physfn.bus, info->physfn.devfn,
>                             NULL, node);
> @@ -756,7 +750,6 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>  
>      ret = -ENOMEM;
>  
> -    pcidevs_lock();
>      pseg = alloc_pseg(seg);
>      if ( !pseg )
>          goto out;
> @@ -858,7 +851,6 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>      pci_enable_acs(pdev);
>  
>  out:
> -    pcidevs_unlock();
>      if ( !ret )
>      {
>          printk(XENLOG_DEBUG "PCI add %s %pp\n", pdev_type,  &pdev->sbdf);
> @@ -889,7 +881,6 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
>      if ( !pseg )
>          return -ENODEV;
>  
> -    pcidevs_lock();
>      spin_lock(&pseg->alldevs_lock);
>      list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
>          if ( pdev->bus == bus && pdev->devfn == devfn )
> @@ -910,12 +901,10 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
>              break;
>          }
>  
> -    pcidevs_unlock();
>      spin_unlock(&pseg->alldevs_lock);
>      return ret;
>  }
>  
> -/* Caller should hold the pcidevs_lock */
>  static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
>                             uint8_t devfn)
>  {
> @@ -927,7 +916,6 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
>      if ( !is_iommu_enabled(d) )
>          return -EINVAL;
>  
> -    ASSERT(pcidevs_locked());
>      pdev = pci_get_pdev(d, PCI_SBDF(seg, bus, devfn));
>      if ( !pdev )
>          return -ENODEV;
> @@ -981,13 +969,10 @@ int pci_release_devices(struct domain *d)
>      u8 bus, devfn;
>      int ret;
>  
> -    pcidevs_lock();
>      ret = arch_pci_clean_pirqs(d);
>      if ( ret )
> -    {
> -        pcidevs_unlock();
>          return ret;
> -    }
> +
>      spin_lock(&d->pdevs_lock);
>      list_for_each_entry_safe ( pdev, tmp, &d->pdev_list, domain_list )
>      {
> @@ -996,7 +981,6 @@ int pci_release_devices(struct domain *d)
>          ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
>      }
>      spin_unlock(&d->pdevs_lock);
> -    pcidevs_unlock();
>  
>      return ret;
>  }
> @@ -1094,7 +1078,6 @@ void pci_check_disable_device(u16 seg, u8 bus, u8 devfn)
>      s_time_t now = NOW();
>      u16 cword;
>  
> -    pcidevs_lock();
>      pdev = pci_get_real_pdev(PCI_SBDF(seg, bus, devfn));
>      if ( pdev )
>      {
> @@ -1108,7 +1091,6 @@ void pci_check_disable_device(u16 seg, u8 bus, u8 devfn)
>              pdev = NULL;
>          }
>      }
> -    pcidevs_unlock();
>  
>      if ( !pdev )
>          return;
> @@ -1164,13 +1146,7 @@ static int __init cf_check _scan_pci_devices(struct pci_seg *pseg, void *arg)
>  
>  int __init scan_pci_devices(void)
>  {
> -    int ret;
> -
> -    pcidevs_lock();
> -    ret = pci_segments_iterate(_scan_pci_devices, NULL);
> -    pcidevs_unlock();
> -
> -    return ret;
> +    return pci_segments_iterate(_scan_pci_devices, NULL);
>  }
>  
>  struct setup_hwdom {
> @@ -1239,19 +1215,11 @@ static int __hwdom_init cf_check _setup_hwdom_pci_devices(
>  
>              pcidev_put(pdev);
>              if ( iommu_verbose )
> -            {
> -                pcidevs_unlock();
>                  process_pending_softirqs();
> -                pcidevs_lock();
> -            }
>          }
>  
>          if ( !iommu_verbose )
> -        {
> -            pcidevs_unlock();
>              process_pending_softirqs();
> -            pcidevs_lock();
> -        }
>      }
>  
>      return 0;
> @@ -1262,9 +1230,7 @@ void __hwdom_init setup_hwdom_pci_devices(
>  {
>      struct setup_hwdom ctxt = { .d = d, .handler = handler };
>  
> -    pcidevs_lock();
>      pci_segments_iterate(_setup_hwdom_pci_devices, &ctxt);
> -    pcidevs_unlock();
>  }
>  
>  /* APEI not supported on ARM yet. */
> @@ -1396,9 +1362,7 @@ static int cf_check _dump_pci_devices(struct pci_seg *pseg, void *arg)
>  static void cf_check dump_pci_devices(unsigned char ch)
>  {
>      printk("==== PCI devices ====\n");
> -    pcidevs_lock();
>      pci_segments_iterate(_dump_pci_devices, NULL);
> -    pcidevs_unlock();
>  }
>  
>  static int __init cf_check setup_dump_pcidevs(void)
> @@ -1417,8 +1381,6 @@ static int iommu_add_device(struct pci_dev *pdev)
>      if ( !pdev->domain )
>          return -EINVAL;
>  
> -    ASSERT(pcidevs_locked());
> -
>      hd = dom_iommu(pdev->domain);
>      if ( !is_iommu_enabled(pdev->domain) )
>          return 0;
> @@ -1446,8 +1408,6 @@ static int iommu_enable_device(struct pci_dev *pdev)
>      if ( !pdev->domain )
>          return -EINVAL;
>  
> -    ASSERT(pcidevs_locked());
> -
>      hd = dom_iommu(pdev->domain);
>      if ( !is_iommu_enabled(pdev->domain) ||
>           !hd->platform_ops->enable_device )
> @@ -1494,7 +1454,6 @@ static int device_assigned(struct pci_dev *pdev)
>  {
>      int rc = 0;
>  
> -    ASSERT(pcidevs_locked());
>      /*
>       * If the device exists and it is not owned by either the hardware
>       * domain or dom_io then it must be assigned to a guest, or be
> @@ -1507,7 +1466,6 @@ static int device_assigned(struct pci_dev *pdev)
>      return rc;
>  }
>  
> -/* Caller should hold the pcidevs_lock */
>  static int assign_device(struct domain *d, struct pci_dev *pdev, u32 flag)
>  {
>      const struct domain_iommu *hd = dom_iommu(d);
> @@ -1521,7 +1479,6 @@ static int assign_device(struct domain *d, struct pci_dev *pdev, u32 flag)
>          return -EXDEV;
>  
>      /* device_assigned() should already have cleared the device for assignment */
> -    ASSERT(pcidevs_locked());
>      ASSERT(pdev && (pdev->domain == hardware_domain ||
>                      pdev->domain == dom_io));
>  
> @@ -1587,7 +1544,6 @@ static int iommu_get_device_group(
>      if ( group_id < 0 )
>          return group_id;
>  
> -    pcidevs_lock();
>      spin_lock(&d->pdevs_lock);
>      for_each_pdev( d, pdev )
>      {
> @@ -1603,7 +1559,6 @@ static int iommu_get_device_group(
>          sdev_id = iommu_call(ops, get_device_group_id, seg, b, df);
>          if ( sdev_id < 0 )
>          {
> -            pcidevs_unlock();
>              spin_unlock(&d->pdevs_lock);
>              return sdev_id;
>          }
> @@ -1614,7 +1569,6 @@ static int iommu_get_device_group(
>  
>              if ( unlikely(copy_to_guest_offset(buf, i, &bdf, 1)) )
>              {
> -                pcidevs_unlock();
>                  spin_unlock(&d->pdevs_lock);
>                  return -EFAULT;
>              }
> @@ -1622,7 +1576,6 @@ static int iommu_get_device_group(
>          }
>      }
>  
> -    pcidevs_unlock();
>      spin_unlock(&d->pdevs_lock);
>  
>      return i;
> @@ -1630,17 +1583,12 @@ static int iommu_get_device_group(
>  
>  void iommu_dev_iotlb_flush_timeout(struct domain *d, struct pci_dev *pdev)
>  {
> -    pcidevs_lock();
> -
>      /* iommu->ats_list_lock is taken by the caller of this function */
>      disable_ats_device(pdev);
>  
>      ASSERT(pdev->domain);
>      if ( d != pdev->domain )
> -    {
> -        pcidevs_unlock();
>          return;
> -    }
>  
>      pdev->broken = true;
>  
> @@ -1649,8 +1597,6 @@ void iommu_dev_iotlb_flush_timeout(struct domain *d, struct pci_dev *pdev)
>                 d->domain_id, &pdev->sbdf);
>      if ( !is_hardware_domain(d) )
>          domain_crash(d);
> -
> -    pcidevs_unlock();
>  }
>  
>  int iommu_do_pci_domctl(
> @@ -1740,7 +1686,6 @@ int iommu_do_pci_domctl(
>              break;
>          }
>  
> -        pcidevs_lock();
>          ret = device_assigned(pdev);
>          if ( domctl->cmd == XEN_DOMCTL_test_assign_device )
>          {
> @@ -1755,7 +1700,7 @@ int iommu_do_pci_domctl(
>              ret = assign_device(d, pdev, flags);
>  
>          pcidev_put(pdev);
> -        pcidevs_unlock();
> +
>          if ( ret == -ERESTART )
>              ret = hypercall_create_continuation(__HYPERVISOR_domctl,
>                                                  "h", u_domctl);
> @@ -1787,9 +1732,7 @@ int iommu_do_pci_domctl(
>          bus = PCI_BUS(machine_sbdf);
>          devfn = PCI_DEVFN(machine_sbdf);
>  
> -        pcidevs_lock();
>          ret = deassign_device(d, seg, bus, devfn);
> -        pcidevs_unlock();
>          break;
>  
>      default:
> diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
> index 42661f22f4..87868188b7 100644
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -1490,7 +1490,6 @@ int domain_context_mapping_one(
>      if ( QUARANTINE_SKIP(domain, pgd_maddr) )
>          return 0;
>  
> -    ASSERT(pcidevs_locked());
>      spin_lock(&iommu->lock);
>      maddr = bus_to_context_maddr(iommu, bus);
>      context_entries = (struct context_entry *)map_vtd_domain_page(maddr);
> @@ -1711,8 +1710,6 @@ static int domain_context_mapping(struct domain *domain, u8 devfn,
>      if ( drhd && drhd->iommu->node != NUMA_NO_NODE )
>          dom_iommu(domain)->node = drhd->iommu->node;
>  
> -    ASSERT(pcidevs_locked());
> -
>      for_each_rmrr_device( rmrr, bdf, i )
>      {
>          if ( rmrr->segment != pdev->seg || bdf != pdev->sbdf.bdf )
> @@ -2072,8 +2069,6 @@ static void quarantine_teardown(struct pci_dev *pdev,
>  {
>      struct domain_iommu *hd = dom_iommu(dom_io);
>  
> -    ASSERT(pcidevs_locked());
> -
>      if ( !pdev->arch.vtd.pgd_maddr )
>          return;
>  
> @@ -2341,8 +2336,6 @@ static int cf_check intel_iommu_add_device(u8 devfn, struct pci_dev *pdev)
>      u16 bdf;
>      int ret, i;
>  
> -    ASSERT(pcidevs_locked());
> -
>      if ( !pdev->domain )
>          return -EINVAL;
>  
> @@ -3176,7 +3169,6 @@ static int cf_check intel_iommu_quarantine_init(struct pci_dev *pdev,
>      bool rmrr_found = false;
>      int rc;
>  
> -    ASSERT(pcidevs_locked());
>      ASSERT(!hd->arch.vtd.pgd_maddr);
>      ASSERT(page_list_empty(&hd->arch.pgtables.list));
>  
> diff --git a/xen/drivers/video/vga.c b/xen/drivers/video/vga.c
> index 1298f3a7b6..1f7c496114 100644
> --- a/xen/drivers/video/vga.c
> +++ b/xen/drivers/video/vga.c
> @@ -117,9 +117,7 @@ void __init video_endboot(void)
>                  struct pci_dev *pdev;
>                  u8 b = bus, df = devfn, sb;
>  
> -                pcidevs_lock();
>                  pdev = pci_get_pdev(NULL, PCI_SBDF(0, bus, devfn));
> -                pcidevs_unlock();
>  
>                  if ( !pdev ||
>                       pci_conf_read16(PCI_SBDF(0, bus, devfn),
> -- 
> 2.36.1
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 09/10] [RFC only] xen: iommu: remove last  pcidevs_lock() calls in iommu
  2022-08-31 14:11 ` [RFC PATCH 09/10] [RFC only] xen: iommu: remove last pcidevs_lock() calls in iommu Volodymyr Babchuk
@ 2023-01-28  1:36   ` Stefano Stabellini
  2023-02-20  0:41     ` Volodymyr Babchuk
  2023-02-28 16:25   ` Jan Beulich
  1 sibling, 1 reply; 43+ messages in thread
From: Stefano Stabellini @ 2023-01-28  1:36 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: xen-devel, Oleksandr Andrushchenko, Kevin Tian, Jan Beulich,
	Paul Durrant, Roger Pau Monné

On Wed, 31 Aug 2022, Volodymyr Babchuk wrote:
> There are number of cases where pcidevs_lock() is used to protect
> something that is not related to PCI devices per se.
> 
> Probably pcidev_lock in these places should be replaced with some
> other lock.
> 
> This patch is not intended to be merged and is present only to discuss
> this use of pcidevs_lock()
> 
> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>

I wonder if we are being too ambitious: is it necessary to get rid of
pcidevs_lock completely, without replacing all its occurrences with
something else?

Because it would be a lot easier to only replace pcidevs_lock with
pdev->lock, replacing the global lock with a per-device lock. That alone
would be a major improvement and would be far easier to verify its
correctness.

While this patch and the previous patch together remove all occurrences
of pcidevs_lock without adding pdev->lock, which is difficult to prove
correct.


> ---
>  xen/drivers/passthrough/vtd/intremap.c | 2 --
>  xen/drivers/passthrough/vtd/iommu.c    | 5 -----
>  xen/drivers/passthrough/x86/iommu.c    | 5 -----
>  3 files changed, 12 deletions(-)
> 
> diff --git a/xen/drivers/passthrough/vtd/intremap.c b/xen/drivers/passthrough/vtd/intremap.c
> index 1512e4866b..44e3b72f91 100644
> --- a/xen/drivers/passthrough/vtd/intremap.c
> +++ b/xen/drivers/passthrough/vtd/intremap.c
> @@ -893,8 +893,6 @@ int pi_update_irte(const struct pi_desc *pi_desc, const struct pirq *pirq,
>  
>      spin_unlock_irq(&desc->lock);
>  
> -    ASSERT(pcidevs_locked());
> -
>      return msi_msg_write_remap_rte(msi_desc, &msi_desc->msg);
>  
>   unlock_out:
> diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
> index 87868188b7..9d258d154d 100644
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -127,8 +127,6 @@ static int context_set_domain_id(struct context_entry *context,
>  {
>      unsigned int i;
>  
> -    ASSERT(pcidevs_locked());
> -
>      if ( domid_mapping(iommu) )
>      {
>          unsigned int nr_dom = cap_ndoms(iommu->cap);
> @@ -1882,7 +1880,6 @@ int domain_context_unmap_one(
>      int iommu_domid, rc, ret;
>      bool_t flush_dev_iotlb;
>  
> -    ASSERT(pcidevs_locked());
>      spin_lock(&iommu->lock);
>  
>      maddr = bus_to_context_maddr(iommu, bus);
> @@ -2601,7 +2598,6 @@ static void __hwdom_init setup_hwdom_rmrr(struct domain *d)
>      u16 bdf;
>      int ret, i;
>  
> -    pcidevs_lock();
>      for_each_rmrr_device ( rmrr, bdf, i )
>      {
>          /*
> @@ -2616,7 +2612,6 @@ static void __hwdom_init setup_hwdom_rmrr(struct domain *d)
>              dprintk(XENLOG_ERR VTDPREFIX,
>                       "IOMMU: mapping reserved region failed\n");
>      }
> -    pcidevs_unlock();
>  }
>  
>  static struct iommu_state {
> diff --git a/xen/drivers/passthrough/x86/iommu.c b/xen/drivers/passthrough/x86/iommu.c
> index f671b0f2bb..4e94ad15df 100644
> --- a/xen/drivers/passthrough/x86/iommu.c
> +++ b/xen/drivers/passthrough/x86/iommu.c
> @@ -207,7 +207,6 @@ int iommu_identity_mapping(struct domain *d, p2m_access_t p2ma,
>      struct identity_map *map;
>      struct domain_iommu *hd = dom_iommu(d);
>  
> -    ASSERT(pcidevs_locked());
>      ASSERT(base < end);
>  
>      /*
> @@ -479,8 +478,6 @@ domid_t iommu_alloc_domid(unsigned long *map)
>      static unsigned int start;
>      unsigned int idx = find_next_zero_bit(map, UINT16_MAX - DOMID_MASK, start);
>  
> -    ASSERT(pcidevs_locked());
> -
>      if ( idx >= UINT16_MAX - DOMID_MASK )
>          idx = find_first_zero_bit(map, UINT16_MAX - DOMID_MASK);
>      if ( idx >= UINT16_MAX - DOMID_MASK )
> @@ -495,8 +492,6 @@ domid_t iommu_alloc_domid(unsigned long *map)
>  
>  void iommu_free_domid(domid_t domid, unsigned long *map)
>  {
> -    ASSERT(pcidevs_locked());
> -
>      if ( domid == DOMID_INVALID )
>          return;
>  
> -- 
> 2.36.1
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 01/10] xen: pci: add per-domain pci list lock
  2023-01-26 23:18   ` Stefano Stabellini
  2023-01-27  8:01     ` Jan Beulich
@ 2023-02-14 23:38     ` Volodymyr Babchuk
  2023-02-15  9:06       ` Jan Beulich
  1 sibling, 1 reply; 43+ messages in thread
From: Volodymyr Babchuk @ 2023-02-14 23:38 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: xen-devel, Oleksandr Andrushchenko, Andrew Cooper, George Dunlap,
	Jan Beulich, Julien Grall, Wei Liu, Paul Durrant,
	Roger Pau Monné,
	Kevin Tian, Stewart.Hildebrand


Hello Stefano,

Stefano Stabellini <sstabellini@kernel.org> writes:

> On Wed, 31 Aug 2022, Volodymyr Babchuk wrote:
>> domain->pdevs_lock protects access to domain->pdev_list.
>> As this, it should be used when we are adding, removing on enumerating
>> PCI devices assigned to a domain.
>> 
>> This enables more granular locking instead of one huge pcidevs_lock that
>> locks entire PCI subsystem. Please note that pcidevs_lock() is still
>> used, we are going to remove it in subsequent patches.
>> 
>> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
>
> I reviewed the patch, and made sure to pay extra attention to:

Thank you for doing this.

> - error paths
> - missing locks
> - lock ordering
> - interruptions

> Here is what I found:
>
>
> 1) iommu.c:reassign_device_ownership and pci_amd_iommu.c:reassign_device
> Both functions without any pdevs_lock locking do:
> list_move(&pdev->domain_list, &target->pdev_list);
>
> It seems to be it would need pdevs_lock. Maybe we need to change
> list_move into list_del (protected by the pdevs_lock of the old domain)
> and list_add (protected by the pdev_lock of the new domain).

Yes, I did as you suggested. But this leads to another potential
issue. I'll describe it below, in deassign_device() part.

[...]

>> +    spin_lock(&d->pdevs_lock);
>>      list_for_each_entry_safe ( pdev, tmp, &d->pdev_list, domain_list )
>>      {
>>          bus = pdev->bus;
>>          devfn = pdev->devfn;
>>          ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
>
> This causes pdevs_lock to be taken twice. deassign_device also takes a
> pdevs_lock.  Probably we need to change all the
> spin_lock(&d->pdevs_lock) into spin_lock_recursive.

You are right, I missed that deassign_device() causes
iommu*_reassign_device() call. But there lies the issue: with recursive
locks, reassign_device() will not be able to unlock source->pdevs_lock,
but will try to take target->pdevs_lock also. This potentially might
lead to deadlock, if another call to reassign_device() moves some other
pdev in the opposite way at the same time. This is why I want to avoid
recursive spinlocks if possible.

So, I am thinking: why does IOMMU code move a pdev across domains in the
first place? We are making IOMMU driver responsible of managing domain's
pdevs, which does not look right, as this is the responsibility of pci.c

I want to propose another approach: implement deassign_device() function
in IOMMU drivers. Then, instead of calling to reassign_device() we might
do the following:

1. deassign_device()
2. remove pdev from source->pdev_list
3. add pdef to target->pdev_list
4. assign_device()


[...]

>> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
>> index 1cf629e7ec..0775228ba9 100644
>> --- a/xen/include/xen/sched.h
>> +++ b/xen/include/xen/sched.h
>> @@ -457,6 +457,7 @@ struct domain
>>  
>>  #ifdef CONFIG_HAS_PCI
>>      struct list_head pdev_list;
>> +    spinlock_t pdevs_lock;
>
> I think it would be better called "pdev_lock" but OK either way

As Jan pointed out, we are locking devices as in plural. On other hand, I can rename
pdev_list to pdevs_lists to make this consistent.

>
>
>>  #endif
>>  
>>  #ifdef CONFIG_HAS_PASSTHROUGH
>> -- 
>> 2.36.1
>> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 01/10] xen: pci: add per-domain pci list lock
  2023-02-14 23:38     ` Volodymyr Babchuk
@ 2023-02-15  9:06       ` Jan Beulich
  0 siblings, 0 replies; 43+ messages in thread
From: Jan Beulich @ 2023-02-15  9:06 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: xen-devel, Oleksandr Andrushchenko, Andrew Cooper, George Dunlap,
	Julien Grall, Wei Liu, Paul Durrant, Roger Pau Monné,
	Kevin Tian, Stewart.Hildebrand, Stefano Stabellini

On 15.02.2023 00:38, Volodymyr Babchuk wrote:
> Stefano Stabellini <sstabellini@kernel.org> writes:
>> On Wed, 31 Aug 2022, Volodymyr Babchuk wrote:
>> 1) iommu.c:reassign_device_ownership and pci_amd_iommu.c:reassign_device
>> Both functions without any pdevs_lock locking do:
>> list_move(&pdev->domain_list, &target->pdev_list);
>>
>> It seems to be it would need pdevs_lock. Maybe we need to change
>> list_move into list_del (protected by the pdevs_lock of the old domain)
>> and list_add (protected by the pdev_lock of the new domain).
> 
> Yes, I did as you suggested. But this leads to another potential
> issue. I'll describe it below, in deassign_device() part.
> 
> [...]
> 
>>> +    spin_lock(&d->pdevs_lock);
>>>      list_for_each_entry_safe ( pdev, tmp, &d->pdev_list, domain_list )
>>>      {
>>>          bus = pdev->bus;
>>>          devfn = pdev->devfn;
>>>          ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
>>
>> This causes pdevs_lock to be taken twice. deassign_device also takes a
>> pdevs_lock.  Probably we need to change all the
>> spin_lock(&d->pdevs_lock) into spin_lock_recursive.
> 
> You are right, I missed that deassign_device() causes
> iommu*_reassign_device() call. But there lies the issue: with recursive
> locks, reassign_device() will not be able to unlock source->pdevs_lock,
> but will try to take target->pdevs_lock also. This potentially might
> lead to deadlock, if another call to reassign_device() moves some other
> pdev in the opposite way at the same time. This is why I want to avoid
> recursive spinlocks if possible.
> 
> So, I am thinking: why does IOMMU code move a pdev across domains in the
> first place? We are making IOMMU driver responsible of managing domain's
> pdevs, which does not look right, as this is the responsibility of pci.c

The boundary between what is PCI and what is IOMMU is pretty fuzzy, I'm
afraid. After all pass-through (with IOMMU) is all about PCI devices (on
x86 at least). Despite the filename being pci.c, much what's there is
actually IOMMU code. Specifically deassign_device() is the vendor-
independent IOMMU part of the operation; moving that ...

> I want to propose another approach: implement deassign_device() function
> in IOMMU drivers.

... into vendor-specific code would mean code duplication, which ought
to be avoided as much as possible.

> Then, instead of calling to reassign_device() we might
> do the following:
> 
> 1. deassign_device()
> 2. remove pdev from source->pdev_list
> 3. add pdef to target->pdev_list
> 4. assign_device()

I'm not sure such ordering would end up correct. It may need to be

1. remove pdev from source->pdev_list
2. deassign_device()
3. assign_device()
4. add pdev to target->pdev_list (or back to source upon failure)

which still may be troublesome: The present placement of the moving is, in
particular, ensuring that ownership (expressed by pdev->domain) changes at
the same time as list membership. With things properly locked it _may_ be
okay to separate (in time) these two steps, but that'll require quite a bit
of care (read: justification that it's correct/safe).

And of course you could then also ask why it's low level driver code
changing pdev->domain. I don't see how you would move that to generic code,
as the field shouldn't be left stale for an extended period of time, nor
can it sensibly be transiently set to e.g. NULL.

Additionally deassign_device() is misnamed - it's really re-assigning
the device (as you can see from it invoking
hd->platform_ops->reassign_device()). You cannot split de-assign from
assign: The device always should be assigned _somewhere_ (DOM_IO if
nothing else; see XSA-400), except late enough during hot-unplug. But that
would be taken care of by the alternative ordering above (combining 2 and
3 into a single step).

Jan


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 04/10] xen: add reference counter support
  2022-08-31 14:10 ` [RFC PATCH 04/10] xen: add reference counter support Volodymyr Babchuk
@ 2023-02-15 11:20   ` Jan Beulich
  2023-02-17  1:56     ` Volodymyr Babchuk
  0 siblings, 1 reply; 43+ messages in thread
From: Jan Beulich @ 2023-02-15 11:20 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Oleksandr Andrushchenko, Andrew Cooper, George Dunlap,
	Julien Grall, Stefano Stabellini, Wei Liu, xen-devel

On 31.08.2022 16:10, Volodymyr Babchuk wrote:
> --- /dev/null
> +++ b/xen/include/xen/refcnt.h
> @@ -0,0 +1,28 @@
> +#ifndef __XEN_REFCNT_H__
> +#define __XEN_REFCNT_H__
> +
> +#include <asm/atomic.h>
> +
> +typedef atomic_t refcnt_t;

Like Linux has it, I think this would better be a separate struct. At
least in debug builds, i.e. it could certainly use typesafe.h if that
ended up to be a good fit (which I'm not sure it would, so this is
merely a thought).

> +static inline void refcnt_init(refcnt_t *refcnt)
> +{
> +	atomic_set(refcnt, 1);
> +}
> +
> +static inline void refcnt_get(refcnt_t *refcnt)
> +{
> +#ifndef NDEBUG
> +	ASSERT(atomic_add_unless(refcnt, 1, 0) > 0);
> +#else
> +	atomic_add_unless(refcnt, 1, 0);
> +#endif
> +}

I think this wants doing without any #ifdef-ary, e.g.

static inline void refcnt_get(refcnt_t *refcnt)
{
    int ret = atomic_add_unless(refcnt, 1, 0);

    ASSERT(ret > 0);
}

I wonder though whether certain callers may not want to instead know
whether a refcount was successfully obtained, i.e. whether instead of
asserting here you don't want to return a boolean success indicator,
which callers then would deal with (either by asserting or by suitably
handling the case). See get_page() and page_get_owner_and_reference()
for similar behavior we have (and use) already.

> +static inline void refcnt_put(refcnt_t *refcnt, void (*destructor)(refcnt_t *refcnt))
> +{
> +	if ( atomic_dec_and_test(refcnt) )
> +		destructor(refcnt);
> +}

No assertion here as to the count being positive?

Also the entire file wants to use Xen's space indentation, not hard
tabs.

Jan


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 03/10] xen: pci: introduce ats_list_lock
  2023-01-27  8:13     ` Jan Beulich
@ 2023-02-17  1:20       ` Volodymyr Babchuk
  2023-02-17  7:39         ` Jan Beulich
  0 siblings, 1 reply; 43+ messages in thread
From: Volodymyr Babchuk @ 2023-02-17  1:20 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stefano Stabellini, xen-devel, Oleksandr Andrushchenko,
	Andrew Cooper, Paul Durrant, Roger Pau Monné,
	Kevin Tian


Hello Jan,

Jan Beulich <jbeulich@suse.com> writes:

> On 27.01.2023 00:56, Stefano Stabellini wrote:
>> On Wed, 31 Aug 2022, Volodymyr Babchuk wrote:
>>> --- a/xen/drivers/passthrough/pci.c
>>> +++ b/xen/drivers/passthrough/pci.c
>>> @@ -1641,6 +1641,7 @@ void iommu_dev_iotlb_flush_timeout(struct domain *d, struct pci_dev *pdev)
>>>  {
>>>      pcidevs_lock();
>>>  
>>> +    /* iommu->ats_list_lock is taken by the caller of this function */
>> 
>> This is a locking inversion. In all other places we take pcidevs_lock
>> first, then ats_list_lock lock. For instance look at
>> xen/drivers/passthrough/pci.c:deassign_device that is called with
>> pcidevs_locked and then calls iommu_call(... reassign_device ...) which
>> ends up taking ats_list_lock.
>> 
>> This is the only exception. I think we need to move the
>> spin_lock(ats_list_lock) from qinval.c to here.
>
> First question here is what the lock is meant to protect: Just the list,
> or also the ATS state change (i.e. serializing enable and disable against
> one another). In the latter case the lock also wants naming differently.

My intention was to protect list only. But I believe you are right and
we should protect the whole state. I'll rename the lock to ats_lock.

> Second question is who is to acquire the lock. Why isn't this done _in_
> {en,dis}able_ats_device() themselves? That would then allow to further
> reduce the locked range, because at least the pci_find_ext_capability()
> call and the final logging can occur without the lock held.

You are right, I'll extended {en,dis}able_ats_device() API to pass
pointer to the lock.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 04/10] xen: add reference counter support
  2023-02-15 11:20   ` Jan Beulich
@ 2023-02-17  1:56     ` Volodymyr Babchuk
  2023-02-17  7:53       ` Jan Beulich
  0 siblings, 1 reply; 43+ messages in thread
From: Volodymyr Babchuk @ 2023-02-17  1:56 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Oleksandr Andrushchenko, Andrew Cooper, George Dunlap,
	Julien Grall, Stefano Stabellini, Wei Liu, xen-devel


Hello Jan,

Jan Beulich <jbeulich@suse.com> writes:

> On 31.08.2022 16:10, Volodymyr Babchuk wrote:
>> --- /dev/null
>> +++ b/xen/include/xen/refcnt.h
>> @@ -0,0 +1,28 @@
>> +#ifndef __XEN_REFCNT_H__
>> +#define __XEN_REFCNT_H__
>> +
>> +#include <asm/atomic.h>
>> +
>> +typedef atomic_t refcnt_t;
>
> Like Linux has it, I think this would better be a separate struct. At
> least in debug builds, i.e. it could certainly use typesafe.h if that
> ended up to be a good fit (which I'm not sure it would, so this is
> merely a thought).

Sadly, TYPE_SAFE does not support pointers. e.g I can't get pointer to
an encapsulated value which is also passed as a pointer. I can expand
TYPE_SAFE with $FOO_x_ptr():

    static inline _type *_name##_x_ptr(_name##_t *n) { &return n->_name; }

or make custom encapsulation in refcnt.h. Which one you prefer?

>> +static inline void refcnt_init(refcnt_t *refcnt)
>> +{
>> +	atomic_set(refcnt, 1);
>> +}
>> +
>> +static inline void refcnt_get(refcnt_t *refcnt)
>> +{
>> +#ifndef NDEBUG
>> +	ASSERT(atomic_add_unless(refcnt, 1, 0) > 0);
>> +#else
>> +	atomic_add_unless(refcnt, 1, 0);
>> +#endif
>> +}

> I think this wants doing without any #ifdef-ary, e.g.
>
> static inline void refcnt_get(refcnt_t *refcnt)
> {
>     int ret = atomic_add_unless(refcnt, 1, 0);
>
>     ASSERT(ret > 0);
> }
>

Thanks, did as you suggested. I was afraid that compiler would complain
about unused ret in non-debug builds.

> I wonder though whether certain callers may not want to instead know
> whether a refcount was successfully obtained, i.e. whether instead of
> asserting here you don't want to return a boolean success indicator,
> which callers then would deal with (either by asserting or by suitably
> handling the case). See get_page() and page_get_owner_and_reference()
> for similar behavior we have (and use) already.

For now there are no such callers, so I don't want to implement unused
functionality. But, if you prefer this way, I'll do this.

[...]


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 03/10] xen: pci: introduce ats_list_lock
  2023-02-17  1:20       ` Volodymyr Babchuk
@ 2023-02-17  7:39         ` Jan Beulich
  0 siblings, 0 replies; 43+ messages in thread
From: Jan Beulich @ 2023-02-17  7:39 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Stefano Stabellini, xen-devel, Oleksandr Andrushchenko,
	Andrew Cooper, Paul Durrant, Roger Pau Monné,
	Kevin Tian

On 17.02.2023 02:20, Volodymyr Babchuk wrote:
> Jan Beulich <jbeulich@suse.com> writes:
>> On 27.01.2023 00:56, Stefano Stabellini wrote:
>>> On Wed, 31 Aug 2022, Volodymyr Babchuk wrote:
>>>> --- a/xen/drivers/passthrough/pci.c
>>>> +++ b/xen/drivers/passthrough/pci.c
>>>> @@ -1641,6 +1641,7 @@ void iommu_dev_iotlb_flush_timeout(struct domain *d, struct pci_dev *pdev)
>>>>  {
>>>>      pcidevs_lock();
>>>>  
>>>> +    /* iommu->ats_list_lock is taken by the caller of this function */
>>>
>>> This is a locking inversion. In all other places we take pcidevs_lock
>>> first, then ats_list_lock lock. For instance look at
>>> xen/drivers/passthrough/pci.c:deassign_device that is called with
>>> pcidevs_locked and then calls iommu_call(... reassign_device ...) which
>>> ends up taking ats_list_lock.
>>>
>>> This is the only exception. I think we need to move the
>>> spin_lock(ats_list_lock) from qinval.c to here.
>>
>> First question here is what the lock is meant to protect: Just the list,
>> or also the ATS state change (i.e. serializing enable and disable against
>> one another). In the latter case the lock also wants naming differently.
> 
> My intention was to protect list only. But I believe you are right and
> we should protect the whole state. I'll rename the lock to ats_lock.
> 
>> Second question is who is to acquire the lock. Why isn't this done _in_
>> {en,dis}able_ats_device() themselves? That would then allow to further
>> reduce the locked range, because at least the pci_find_ext_capability()
>> call and the final logging can occur without the lock held.
> 
> You are right, I'll extended {en,dis}able_ats_device() API to pass
> pointer to the lock.

Hmm, that'll make for an odd interface. I was wondering in the past
already why we don't have a backref from the PCI dev to its controlling
IOMMU (might be ambiguous and hence better left unset for bridges,
especially host ones, but I think ATS is being fiddled with only for
leaf devices; would need double checking of course).

Jan


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 04/10] xen: add reference counter support
  2023-02-17  1:56     ` Volodymyr Babchuk
@ 2023-02-17  7:53       ` Jan Beulich
  2023-02-19 22:34         ` Volodymyr Babchuk
  0 siblings, 1 reply; 43+ messages in thread
From: Jan Beulich @ 2023-02-17  7:53 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Oleksandr Andrushchenko, Andrew Cooper, George Dunlap,
	Julien Grall, Stefano Stabellini, Wei Liu, xen-devel

On 17.02.2023 02:56, Volodymyr Babchuk wrote:
> Jan Beulich <jbeulich@suse.com> writes:
>> On 31.08.2022 16:10, Volodymyr Babchuk wrote:
>>> --- /dev/null
>>> +++ b/xen/include/xen/refcnt.h
>>> @@ -0,0 +1,28 @@
>>> +#ifndef __XEN_REFCNT_H__
>>> +#define __XEN_REFCNT_H__
>>> +
>>> +#include <asm/atomic.h>
>>> +
>>> +typedef atomic_t refcnt_t;
>>
>> Like Linux has it, I think this would better be a separate struct. At
>> least in debug builds, i.e. it could certainly use typesafe.h if that
>> ended up to be a good fit (which I'm not sure it would, so this is
>> merely a thought).
> 
> Sadly, TYPE_SAFE does not support pointers. e.g I can't get pointer to
> an encapsulated value which is also passed as a pointer. I can expand
> TYPE_SAFE with $FOO_x_ptr():
> 
>     static inline _type *_name##_x_ptr(_name##_t *n) { &return n->_name; }
> 
> or make custom encapsulation in refcnt.h. Which one you prefer?

First of all, as said - typesafe.h may not be a good fit. And then the
helper you suggest looks to be UB if the passed in pointer was to an
array rather than a singular object, so having something like that in
a very generic piece of infrastructure is inappropriate anyway.

>>> +static inline void refcnt_init(refcnt_t *refcnt)
>>> +{
>>> +	atomic_set(refcnt, 1);
>>> +}
>>> +
>>> +static inline void refcnt_get(refcnt_t *refcnt)
>>> +{
>>> +#ifndef NDEBUG
>>> +	ASSERT(atomic_add_unless(refcnt, 1, 0) > 0);
>>> +#else
>>> +	atomic_add_unless(refcnt, 1, 0);
>>> +#endif
>>> +}
> 
>> I think this wants doing without any #ifdef-ary, e.g.
>>
>> static inline void refcnt_get(refcnt_t *refcnt)
>> {
>>     int ret = atomic_add_unless(refcnt, 1, 0);
>>
>>     ASSERT(ret > 0);
>> }
>>
> 
> Thanks, did as you suggested. I was afraid that compiler would complain
> about unused ret in non-debug builds.
> 
>> I wonder though whether certain callers may not want to instead know
>> whether a refcount was successfully obtained, i.e. whether instead of
>> asserting here you don't want to return a boolean success indicator,
>> which callers then would deal with (either by asserting or by suitably
>> handling the case). See get_page() and page_get_owner_and_reference()
>> for similar behavior we have (and use) already.
> 
> For now there are no such callers, so I don't want to implement unused
> functionality. But, if you prefer this way, I'll do this.

Well, I can see your point about unused functionality. That needs to be
weighed against this being a pretty basic piece of infrastructure, which
may want using elsewhere as well. Such re-use would then better not
trigger touching all the code which already uses it (in principle the
domain ref counting might be able to re-use it, for example, but there's
that DOMAIN_DESTROYED special case which may require it to continue to
have a custom implementation).

What you may want to do is check Linux'es equivalent. Depending on how
close ours is going to be, using the same naming may also want considering.

Jan


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 04/10] xen: add reference counter support
  2023-02-17  7:53       ` Jan Beulich
@ 2023-02-19 22:34         ` Volodymyr Babchuk
  0 siblings, 0 replies; 43+ messages in thread
From: Volodymyr Babchuk @ 2023-02-19 22:34 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Oleksandr Andrushchenko, Andrew Cooper, George Dunlap,
	Julien Grall, Stefano Stabellini, Wei Liu, xen-devel


Hi Jan,

Jan Beulich <jbeulich@suse.com> writes:

> On 17.02.2023 02:56, Volodymyr Babchuk wrote:
>> Jan Beulich <jbeulich@suse.com> writes:
>>> On 31.08.2022 16:10, Volodymyr Babchuk wrote:
>>>> --- /dev/null
>>>> +++ b/xen/include/xen/refcnt.h
>>>> @@ -0,0 +1,28 @@
>>>> +#ifndef __XEN_REFCNT_H__
>>>> +#define __XEN_REFCNT_H__
>>>> +
>>>> +#include <asm/atomic.h>
>>>> +
>>>> +typedef atomic_t refcnt_t;
>>>
>>> Like Linux has it, I think this would better be a separate struct. At
>>> least in debug builds, i.e. it could certainly use typesafe.h if that
>>> ended up to be a good fit (which I'm not sure it would, so this is
>>> merely a thought).
>> 
>> Sadly, TYPE_SAFE does not support pointers. e.g I can't get pointer to
>> an encapsulated value which is also passed as a pointer. I can expand
>> TYPE_SAFE with $FOO_x_ptr():
>> 
>>     static inline _type *_name##_x_ptr(_name##_t *n) { &return n->_name; }
>> 
>> or make custom encapsulation in refcnt.h. Which one you prefer?
>
> First of all, as said - typesafe.h may not be a good fit. And then the
> helper you suggest looks to be UB if the passed in pointer was to an
> array rather than a singular object, so having something like that in
> a very generic piece of infrastructure is inappropriate anyway.

Okay, no problem. I'll use a separate struct. Also, I played a bit with
compiler outputs. Looks like there is no additional overhead in reading
single value from a struct. So I don't think that we need an additional
non-debug implementation for this type.

>>>> +static inline void refcnt_init(refcnt_t *refcnt)
>>>> +{
>>>> +	atomic_set(refcnt, 1);
>>>> +}
>>>> +
>>>> +static inline void refcnt_get(refcnt_t *refcnt)
>>>> +{
>>>> +#ifndef NDEBUG
>>>> +	ASSERT(atomic_add_unless(refcnt, 1, 0) > 0);
>>>> +#else
>>>> +	atomic_add_unless(refcnt, 1, 0);
>>>> +#endif
>>>> +}
>> 
>>> I think this wants doing without any #ifdef-ary, e.g.
>>>
>>> static inline void refcnt_get(refcnt_t *refcnt)
>>> {
>>>     int ret = atomic_add_unless(refcnt, 1, 0);
>>>
>>>     ASSERT(ret > 0);
>>> }
>>>
>> 
>> Thanks, did as you suggested. I was afraid that compiler would complain
>> about unused ret in non-debug builds.
>> 
>>> I wonder though whether certain callers may not want to instead know
>>> whether a refcount was successfully obtained, i.e. whether instead of
>>> asserting here you don't want to return a boolean success indicator,
>>> which callers then would deal with (either by asserting or by suitably
>>> handling the case). See get_page() and page_get_owner_and_reference()
>>> for similar behavior we have (and use) already.
>> 
>> For now there are no such callers, so I don't want to implement unused
>> functionality. But, if you prefer this way, I'll do this.
>
> Well, I can see your point about unused functionality. That needs to be
> weighed against this being a pretty basic piece of infrastructure, which
> may want using elsewhere as well. Such re-use would then better not
> trigger touching all the code which already uses it (in principle the
> domain ref counting might be able to re-use it, for example, but there's
> that DOMAIN_DESTROYED special case which may require it to continue to
> have a custom implementation).
>
> What you may want to do is check Linux'es equivalent. Depending on how
> close ours is going to be, using the same naming may also want considering.

I wrote my implementation from scratch to avoid any potential licensing
issues. But, looking at Linux implementation:

There are two abstractions: struct refcount and struct kref. Struct
refcount is like atomic_t but with saturation to avoid wrapping. Struct
kref is built on top of struct refcount. It is tailored to handle
reference counted objects by having ability to call release() function
when refcounter reaches zero. Both kref_get() and refcount_inc()
functions return void.

My implementation has no separation on this two types - ref counter with
saturation and kernel object reference counter. My implementation does
only latter thing. It is a good idea to add saturation and I will do
this in the next patch version.

As for details on function prototypes and type names - I'll do as you
say. If you want refcnt_put() to return bool - no problem. If you want
this functionality renamed or aligned with Linux's one - just tell
me. From my point of view, right now we have minimal implementation that
covers all available use cases and can be easily expended in the future
to cover new use cases. For use cases I can see PCI, cpupool and maybe
couple of ARM IOMMU drivers. All others:

- get_domain() uses that DOMAIN_DESTROYED special case you mentioned

- {get,put}_page* does not use atomic_t all and rely on direct cmpxchg()
  call for some reason.

- OP-TEE code is happy with atomics due to complex logic

- {get,put}_cpu_var and put_gfn does not use ref counting at all

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 09/10] [RFC only] xen: iommu: remove last pcidevs_lock() calls in iommu
  2023-01-28  1:36   ` Stefano Stabellini
@ 2023-02-20  0:41     ` Volodymyr Babchuk
  0 siblings, 0 replies; 43+ messages in thread
From: Volodymyr Babchuk @ 2023-02-20  0:41 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: xen-devel, Oleksandr Andrushchenko, Kevin Tian, Jan Beulich,
	Paul Durrant, Roger Pau Monné


Hi Stefano,

Stefano Stabellini <sstabellini@kernel.org> writes:

> On Wed, 31 Aug 2022, Volodymyr Babchuk wrote:
>> There are number of cases where pcidevs_lock() is used to protect
>> something that is not related to PCI devices per se.
>> 
>> Probably pcidev_lock in these places should be replaced with some
>> other lock.
>> 
>> This patch is not intended to be merged and is present only to discuss
>> this use of pcidevs_lock()
>> 
>> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
>
> I wonder if we are being too ambitious: is it necessary to get rid of
> pcidevs_lock completely, without replacing all its occurrences with
> something else?
>

> Because it would be a lot easier to only replace pcidevs_lock with
> pdev->lock, replacing the global lock with a per-device lock. That alone
> would be a major improvement and would be far easier to verify its
> correctness.

This is the aim of this patch series. We can't just replace pcidevs_lock
with pdev->lock, because pcidevs_lock() locks not only PCI devices, but
a PCI subsystem as a whole. As I wrote on the cover letter, I identified
list of things that pcidevs_lock protect and tried to create separate
locks for them.

>
> While this patch and the previous patch together remove all occurrences
> of pcidevs_lock without adding pdev->lock, which is difficult to prove
> correct.

Previous patch removes occurrences of pcidevs_lock() in places which are
already protected with new locks. Sometimes, this is d->pdevs_lock,
sometimes it is sufficient to call pcidev_get() to increase refcounter,
in other cases we need to call pdev_lock(pdev), ...

And goal of this patch is to discuss pieces which left. As you can see,
there is no pointer to pdev to lock, this code does not traverse
d->pdev_list, etc. So it is not immediately obvious what exactly those
ASSERTs should protect. Maybe, they were added by mistake and are not
needed, actually. Maybe I missing some nuance of x86 IOMMU workings. I
really need maintainer's advice there.

>
>
>> ---
>>  xen/drivers/passthrough/vtd/intremap.c | 2 --
>>  xen/drivers/passthrough/vtd/iommu.c    | 5 -----
>>  xen/drivers/passthrough/x86/iommu.c    | 5 -----
>>  3 files changed, 12 deletions(-)
>> 
>> diff --git a/xen/drivers/passthrough/vtd/intremap.c b/xen/drivers/passthrough/vtd/intremap.c
>> index 1512e4866b..44e3b72f91 100644
>> --- a/xen/drivers/passthrough/vtd/intremap.c
>> +++ b/xen/drivers/passthrough/vtd/intremap.c
>> @@ -893,8 +893,6 @@ int pi_update_irte(const struct pi_desc *pi_desc, const struct pirq *pirq,
>>  
>>      spin_unlock_irq(&desc->lock);
>>  
>> -    ASSERT(pcidevs_locked());
>> -
>>      return msi_msg_write_remap_rte(msi_desc, &msi_desc->msg);
>>  
>>   unlock_out:
>> diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
>> index 87868188b7..9d258d154d 100644
>> --- a/xen/drivers/passthrough/vtd/iommu.c
>> +++ b/xen/drivers/passthrough/vtd/iommu.c
>> @@ -127,8 +127,6 @@ static int context_set_domain_id(struct context_entry *context,
>>  {
>>      unsigned int i;
>>  
>> -    ASSERT(pcidevs_locked());
>> -
>>      if ( domid_mapping(iommu) )
>>      {
>>          unsigned int nr_dom = cap_ndoms(iommu->cap);
>> @@ -1882,7 +1880,6 @@ int domain_context_unmap_one(
>>      int iommu_domid, rc, ret;
>>      bool_t flush_dev_iotlb;
>>  
>> -    ASSERT(pcidevs_locked());
>>      spin_lock(&iommu->lock);
>>  
>>      maddr = bus_to_context_maddr(iommu, bus);
>> @@ -2601,7 +2598,6 @@ static void __hwdom_init setup_hwdom_rmrr(struct domain *d)
>>      u16 bdf;
>>      int ret, i;
>>  
>> -    pcidevs_lock();
>>      for_each_rmrr_device ( rmrr, bdf, i )
>>      {
>>          /*
>> @@ -2616,7 +2612,6 @@ static void __hwdom_init setup_hwdom_rmrr(struct domain *d)
>>              dprintk(XENLOG_ERR VTDPREFIX,
>>                       "IOMMU: mapping reserved region failed\n");
>>      }
>> -    pcidevs_unlock();
>>  }
>>  
>>  static struct iommu_state {
>> diff --git a/xen/drivers/passthrough/x86/iommu.c b/xen/drivers/passthrough/x86/iommu.c
>> index f671b0f2bb..4e94ad15df 100644
>> --- a/xen/drivers/passthrough/x86/iommu.c
>> +++ b/xen/drivers/passthrough/x86/iommu.c
>> @@ -207,7 +207,6 @@ int iommu_identity_mapping(struct domain *d, p2m_access_t p2ma,
>>      struct identity_map *map;
>>      struct domain_iommu *hd = dom_iommu(d);
>>  
>> -    ASSERT(pcidevs_locked());
>>      ASSERT(base < end);
>>  
>>      /*
>> @@ -479,8 +478,6 @@ domid_t iommu_alloc_domid(unsigned long *map)
>>      static unsigned int start;
>>      unsigned int idx = find_next_zero_bit(map, UINT16_MAX - DOMID_MASK, start);
>>  
>> -    ASSERT(pcidevs_locked());
>> -
>>      if ( idx >= UINT16_MAX - DOMID_MASK )
>>          idx = find_first_zero_bit(map, UINT16_MAX - DOMID_MASK);
>>      if ( idx >= UINT16_MAX - DOMID_MASK )
>> @@ -495,8 +492,6 @@ domid_t iommu_alloc_domid(unsigned long *map)
>>  
>>  void iommu_free_domid(domid_t domid, unsigned long *map)
>>  {
>> -    ASSERT(pcidevs_locked());
>> -
>>      if ( domid == DOMID_INVALID )
>>          return;
>>  
>> -- 
>> 2.36.1
>> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 05/10] xen: pci: introduce reference counting for pdev
  2023-01-27  0:43   ` Stefano Stabellini
@ 2023-02-20 22:00     ` Volodymyr Babchuk
  0 siblings, 0 replies; 43+ messages in thread
From: Volodymyr Babchuk @ 2023-02-20 22:00 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: xen-devel, Oleksandr Andrushchenko, Jan Beulich, Andrew Cooper,
	Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Paul Durrant, Kevin Tian


Hi Stefano,

Thank you for the review

Stefano Stabellini <sstabellini@kernel.org> writes:

> On Wed, 31 Aug 2022, Volodymyr Babchuk wrote:
>> Prior to this change, lifetime of pci_dev objects was protected by global
>> pcidevs_lock(). We are going to get if of this lock, so we need some
>> other mechanism to ensure that those objects will not disappear under
>> feet of code that access them. Reference counting is a good choice as
>> it provides easy to comprehend way to control object lifetime with
>> better granularity than global super lock.
>> 
>> This patch adds two new helper functions: pcidev_get() and
>> pcidev_put(). pcidev_get() will increase reference counter, while
>> pcidev_put() will decrease it, destroying object when counter reaches
>> zero.
>> 
>> pcidev_get() should be used only when you already have a valid pointer
>> to the object or you are holding lock that protects one of the
>> lists (domain, pseg or ats) that store pci_dev structs.
>> 
>> pcidev_get() is rarely used directly, because there already are
>> functions that will provide valid pointer to pci_dev struct:
>> pci_get_pdev() and pci_get_real_pdev(). They will lock appropriate
>> list, find needed object and increase its reference counter before
>> returning to the caller.
>> 
>> Naturally, pci_put() should be called after finishing working with a
>> received object. This is the reason why this patch have so many
>> pcidev_put()s and so little pcidev_get()s: existing calls to
>> pci_get_*() functions now will increase reference counter
>> automatically, we just need to decrease it back when we finished.
>> 
>> This patch removes "const" qualifier from some pdev pointers because
>> pcidev_put() technically alters the contents of pci_dev structure.
>> 
>> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
>
> tabs everywhere in this patch
>

Oh, yes, sorry. I asked sometime ago, and want to ask again: instead of
adding EMACS magic into each file, we can put one .dir-locals.el file with
basically the same config in xen/ directory. This will accomplish two
things:

 - there will be no need to add EMACS magic strings into each new file

 - the same config will apply to files that do not have magic
   strings. Files with different coding style rules can be filtered by
   code in .dir-locals.el and/or by strategically placed files in
   sub-directories.

I am happy to hear maintainers opinion about this.

>> ---
>> 
>> - Jan, can I add your Suggested-by tag?
>> ---
>>  xen/arch/x86/hvm/vmsi.c                  |   2 +-
>>  xen/arch/x86/irq.c                       |   4 +
>>  xen/arch/x86/msi.c                       |  41 ++++++-
>>  xen/arch/x86/pci.c                       |   4 +-
>>  xen/arch/x86/physdev.c                   |  17 ++-
>>  xen/common/sysctl.c                      |   5 +-
>>  xen/drivers/passthrough/amd/iommu_init.c |  12 ++-
>>  xen/drivers/passthrough/amd/iommu_map.c  |   6 +-
>>  xen/drivers/passthrough/pci.c            | 131 +++++++++++++++--------
>>  xen/drivers/passthrough/vtd/quirks.c     |   2 +
>>  xen/drivers/video/vga.c                  |  10 +-
>>  xen/drivers/vpci/vpci.c                  |   6 +-
>>  xen/include/xen/pci.h                    |  18 ++++
>>  13 files changed, 201 insertions(+), 57 deletions(-)
>> 
>> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
>> index 75f92885dc..7fb1075673 100644
>> --- a/xen/arch/x86/hvm/vmsi.c
>> +++ b/xen/arch/x86/hvm/vmsi.c
>> @@ -912,7 +912,7 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>>  
>>              spin_unlock(&msix->pdev->vpci->lock);
>>              process_pending_softirqs();
>> -            /* NB: we assume that pdev cannot go away for an alive domain. */
>> +
>>              if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
>>                  return -EBUSY;
>>              if ( pdev->vpci->msix != msix )
>> diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
>> index cd0c8a30a8..d8672a03e1 100644
>> --- a/xen/arch/x86/irq.c
>> +++ b/xen/arch/x86/irq.c
>> @@ -2174,6 +2174,7 @@ int map_domain_pirq(
>>                  msi->entry_nr = ret;
>>                  ret = -ENFILE;
>>              }
>> +	    pcidev_put(pdev);
>
> I think it would be better to move pcidev_put just after done:

I'd love to do this, but pdev is declared inside "if" block while "done:"
is outside of this scope. I can move pdev into outer scope if you believe
that it will be better.

[...]

All other comments were taken into account.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 07/10] xen: pci: add per-device locking
  2023-01-28  0:56   ` Stefano Stabellini
@ 2023-02-20 22:29     ` Volodymyr Babchuk
  0 siblings, 0 replies; 43+ messages in thread
From: Volodymyr Babchuk @ 2023-02-20 22:29 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: xen-devel, Oleksandr Andrushchenko, Jan Beulich, Andrew Cooper,
	Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Paul Durrant


Hi Stefano,

Stefano Stabellini <sstabellini@kernel.org> writes:

> On Wed, 31 Aug 2022, Volodymyr Babchuk wrote:
>> Spinlock in struct pci_device will be used to protect access to device
>> itself. Right now it is used mostly by MSI code.
>> 
>> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
>
> There are 2 instances of:
>
>     BUG_ON(list_empty(&dev->msi_list));
>
> in xen/arch/x86/msi.c:__pci_disable_msi and
> xen/arch/x86/msi.c:__pci_disable_msix which are not protected by
> pcidev_lock. However list_empty needs to be protected. (pci_disable_msi
> can also be called from xen/arch/x86/irq.c where it is not surrounded by
> pcidev_lock.)

I checked all call paths. pci_disable_msi() is called from three places in
xen/arch/x86/irq.c. As I can see, all three are "protected" with
ASSERT(pcidevs_locked()), or am I missing something?

>
> Given that they are BUG_ON, I wonder if we could remove them instead of
> adding locks there. It would make things simpler.

Well, I will be happy to remove them, if there are no objections.

>
>
>> ---
>>  xen/arch/x86/hvm/vmsi.c       |  6 +++++-
>>  xen/arch/x86/msi.c            | 16 ++++++++++++++++
>>  xen/drivers/passthrough/msi.c |  8 +++++++-
>>  xen/drivers/passthrough/pci.c |  2 ++
>>  xen/include/xen/pci.h         | 12 ++++++++++++
>>  5 files changed, 42 insertions(+), 2 deletions(-)
>> 
>> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
>> index 7fb1075673..c9e5f279c5 100644
>> --- a/xen/arch/x86/hvm/vmsi.c
>> +++ b/xen/arch/x86/hvm/vmsi.c
>> @@ -203,10 +203,14 @@ static struct msi_desc *msixtbl_addr_to_desc(
>>  
>>      nr_entry = (addr - entry->gtable) / PCI_MSIX_ENTRY_SIZE;
>>  
>> +    pcidev_lock(entry->pdev);
>>      list_for_each_entry( desc, &entry->pdev->msi_list, list )
>>          if ( desc->msi_attrib.type == PCI_CAP_ID_MSIX &&
>> -             desc->msi_attrib.entry_nr == nr_entry )
>> +             desc->msi_attrib.entry_nr == nr_entry ) {
>> +	    pcidev_unlock(entry->pdev);
>
> code style
>
>
>>              return desc;
>> +	}
>> +    pcidev_unlock(entry->pdev);
>>  
>>      return NULL;
>>  }
>> diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
>> index bccaccb98b..6b62c4f452 100644
>> --- a/xen/arch/x86/msi.c
>> +++ b/xen/arch/x86/msi.c
>> @@ -389,6 +389,7 @@ static bool msi_set_mask_bit(struct irq_desc *desc, bool host, bool guest)
>>      default:
>>          return 0;
>>      }
>> +
>
> spurious change
>
>
>>      entry->msi_attrib.host_masked = host;
>>      entry->msi_attrib.guest_masked = guest;
>>  
>> @@ -585,12 +586,17 @@ static struct msi_desc *find_msi_entry(struct pci_dev *dev,
>>  {
>>      struct msi_desc *entry;
>>  
>> +    pcidev_lock(dev);
>>      list_for_each_entry( entry, &dev->msi_list, list )
>>      {
>>          if ( entry->msi_attrib.type == cap_id &&
>>               (irq == -1 || entry->irq == irq) )
>> +	{
>> +	    pcidev_unlock(dev);
>>              return entry;
>> +	}
>>      }
>> +    pcidev_unlock(dev);
>>  
>>      return NULL;
>>  }
>> @@ -661,7 +667,9 @@ static int msi_capability_init(struct pci_dev *dev,
>>          maskbits |= ~(uint32_t)0 >> (32 - dev->msi_maxvec);
>>          pci_conf_write32(dev->sbdf, mpos, maskbits);
>>      }
>> +    pcidev_lock(dev);
>>      list_add_tail(&entry->list, &dev->msi_list);
>> +    pcidev_unlock(dev);
>>  
>>      *desc = entry;
>>      /* Restore the original MSI enabled bits  */
>> @@ -946,7 +954,9 @@ static int msix_capability_init(struct pci_dev *dev,
>>  
>>  	pcidev_get(dev);
>>  
>> +	pcidev_lock(dev);
>>          list_add_tail(&entry->list, &dev->msi_list);
>> +	pcidev_unlock(dev);
>>          *desc = entry;
>>      }
>>  
>> @@ -1231,11 +1241,13 @@ static void msi_free_irqs(struct pci_dev* dev)
>>  {
>>      struct msi_desc *entry, *tmp;
>>  
>> +    pcidev_lock(dev);
>>      list_for_each_entry_safe( entry, tmp, &dev->msi_list, list )
>>      {
>>          pci_disable_msi(entry);
>>          msi_free_irq(entry);
>>      }
>> +    pcidev_unlock(dev);
>>  }
>>  
>>  void pci_cleanup_msi(struct pci_dev *pdev)
>> @@ -1354,6 +1366,7 @@ int pci_restore_msi_state(struct pci_dev *pdev)
>>      if ( ret )
>>          return ret;
>>  
>> +    pcidev_lock(pdev);
>>      list_for_each_entry_safe( entry, tmp, &pdev->msi_list, list )
>>      {
>>          unsigned int i = 0, nr = 1;
>> @@ -1371,6 +1384,7 @@ int pci_restore_msi_state(struct pci_dev *pdev)
>>              dprintk(XENLOG_ERR, "Restore MSI for %pp entry %u not set?\n",
>>                      &pdev->sbdf, i);
>>              spin_unlock_irqrestore(&desc->lock, flags);
>> +	    pcidev_unlock(pdev);
>>              if ( type == PCI_CAP_ID_MSIX )
>>                  pci_conf_write16(pdev->sbdf, msix_control_reg(pos),
>>                                   control & ~PCI_MSIX_FLAGS_ENABLE);
>> @@ -1393,6 +1407,7 @@ int pci_restore_msi_state(struct pci_dev *pdev)
>>              if ( unlikely(!memory_decoded(pdev)) )
>>              {
>>                  spin_unlock_irqrestore(&desc->lock, flags);
>> +		pcidev_unlock(pdev);
>>                  pci_conf_write16(pdev->sbdf, msix_control_reg(pos),
>>                                   control & ~PCI_MSIX_FLAGS_ENABLE);
>>                  return -ENXIO;
>> @@ -1438,6 +1453,7 @@ int pci_restore_msi_state(struct pci_dev *pdev)
>>          pci_conf_write16(pdev->sbdf, msix_control_reg(pos),
>>                           control | PCI_MSIX_FLAGS_ENABLE);
>>  
>> +    pcidev_unlock(pdev);
>>      return 0;
>>  }
>>  
>> diff --git a/xen/drivers/passthrough/msi.c b/xen/drivers/passthrough/msi.c
>> index ce1a450f6f..98f4d2721a 100644
>> --- a/xen/drivers/passthrough/msi.c
>> +++ b/xen/drivers/passthrough/msi.c
>> @@ -22,6 +22,7 @@ int pdev_msi_init(struct pci_dev *pdev)
>>  {
>>      unsigned int pos;
>>  
>> +    pcidev_lock(pdev);
>>      INIT_LIST_HEAD(&pdev->msi_list);
>>  
>>      pos = pci_find_cap_offset(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
>> @@ -41,7 +42,10 @@ int pdev_msi_init(struct pci_dev *pdev)
>>          uint16_t ctrl;
>>  
>>          if ( !msix )
>> -            return -ENOMEM;
>> +        {
>> +             pcidev_unlock(pdev);
>> +             return -ENOMEM;
>> +        }
>>  
>>          spin_lock_init(&msix->table_lock);
>>  
>> @@ -51,6 +55,8 @@ int pdev_msi_init(struct pci_dev *pdev)
>>          pdev->msix = msix;
>>      }
>>  
>> +    pcidev_unlock(pdev);
>> +
>>      return 0;
>>  }
>>  
>> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
>> index c8da80b981..c83397211b 100644
>> --- a/xen/drivers/passthrough/pci.c
>> +++ b/xen/drivers/passthrough/pci.c
>> @@ -1383,7 +1383,9 @@ static int cf_check _dump_pci_devices(struct pci_seg *pseg, void *arg)
>>              printk("%pd", pdev->domain);
>>          printk(" - node %-3d refcnt %d", (pdev->node != NUMA_NO_NODE) ? pdev->node : -1,
>>                 atomic_read(&pdev->refcnt));
>> +        pcidev_lock(pdev);
>>          pdev_dump_msi(pdev);
>> +        pcidev_unlock(pdev);
>>          printk("\n");
>>      }
>>      spin_unlock(&pseg->alldevs_lock);
>> diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
>> index e71a180ef3..d0a7339d84 100644
>> --- a/xen/include/xen/pci.h
>> +++ b/xen/include/xen/pci.h
>> @@ -106,6 +106,8 @@ struct pci_dev {
>>      uint8_t msi_maxvec;
>>      uint8_t phantom_stride;
>>  
>> +    /* Device lock */
>> +    spinlock_t lock;
>>      nodeid_t node; /* NUMA node */
>>  
>>      /* Device to be quarantined, don't automatically re-assign to dom0 */
>> @@ -235,6 +237,16 @@ int msixtbl_pt_register(struct domain *, struct pirq *, uint64_t gtable);
>>  void msixtbl_pt_unregister(struct domain *, struct pirq *);
>>  void msixtbl_pt_cleanup(struct domain *d);
>>  
>> +static inline void pcidev_lock(struct pci_dev *pdev)
>> +{
>> +    spin_lock(&pdev->lock);
>> +}
>> +
>> +static inline void pcidev_unlock(struct pci_dev *pdev)
>> +{
>> +    spin_unlock(&pdev->lock);
>> +}
>> +
>>  #ifdef CONFIG_HVM
>>  int arch_pci_clean_pirqs(struct domain *d);
>>  #else
>> -- 
>> 2.36.1
>> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 08/10] xen: pci: remove pcidev_[un]lock[ed] calls
  2023-01-28  1:32   ` Stefano Stabellini
@ 2023-02-20 23:13     ` Volodymyr Babchuk
  2023-02-21  9:50       ` Jan Beulich
  2023-02-28 16:51     ` Jan Beulich
  1 sibling, 1 reply; 43+ messages in thread
From: Volodymyr Babchuk @ 2023-02-20 23:13 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: xen-devel, Oleksandr Andrushchenko, Jan Beulich, Andrew Cooper,
	Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Paul Durrant, Kevin Tian


Hi Stefano,

Stefano Stabellini <sstabellini@kernel.org> writes:

> On Wed, 31 Aug 2022, Volodymyr Babchuk wrote:
>> As pci devices are refcounted now and all list that store them are
>> protected by separate locks, we can safely drop global pcidevs_lock.
>> 
>> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
>
> Up until this patch this patch series introduces:
> - d->pdevs_lock to protect d->pdev_list
> - pci_seg->alldevs_lock to protect pci_seg->alldevs_list
> - iommu->ats_list_lock to protect iommu->ats_devices
> - pdev refcounting to detect a pdev in-use and when to free it
> - pdev->lock to protect pdev->msi_list
>
> They cover a lot of ground.  Are they collectively covering everything
> pcidevs_lock() was protecting?

Well, that is the question. Those patch are in RFC stage because I can't
fully answer your question. I tried my best to introduce proper locking,
but apparently missed couple of places, like

> deassign_device is not protected by pcidevs_lock anymore.
> deassign_device accesses a number of pdev fields, including quarantine,
> phantom_stride and fault.count.
>
> deassign_device could run at the same time as assign_device who sets
> quarantine and other fields.
>

I hope this is all, but problem is that PCI subsystem is old, large and
complex. Fo example, as I wrote earlier, there are places that are
protected with pcidevs_lock(), but do nothing with PCI. I just don't
know what to do with such places. I have a hope that x86 maintainers
would review my changes and give feedback on missed spots.


> It looks like assign_device, deassign_device, and other functions
> accessing/modifying pdev fields should be protected by pdev->lock.

You are right, I'll check again this whole patch to identify places
where additional locks are required. I already have some candidates
besides those you mentioned above.

> In fact, I think it would be safer to make sure every place that
> currently has a pcidevs_lock() gets a pdev->lock (unless there is a
> d->pdevs_lock, pci_seg->alldevs_lock, iommu->ats_list_lock, or another
> lock) ?

I'll try, but problem is that there are places where we don't have
pointer to pdev, so it is not clear what exactly should be locked.

>
>> ---
>>  xen/arch/x86/domctl.c                       |  8 ---
>>  xen/arch/x86/hvm/vioapic.c                  |  2 -
>>  xen/arch/x86/hvm/vmsi.c                     | 12 ----
>>  xen/arch/x86/irq.c                          |  7 ---
>>  xen/arch/x86/msi.c                          | 11 ----
>>  xen/arch/x86/pci.c                          |  4 --
>>  xen/arch/x86/physdev.c                      |  7 +--
>>  xen/common/sysctl.c                         |  2 -
>>  xen/drivers/char/ns16550.c                  |  4 --
>>  xen/drivers/passthrough/amd/iommu_init.c    |  7 ---
>>  xen/drivers/passthrough/amd/iommu_map.c     |  5 --
>>  xen/drivers/passthrough/amd/pci_amd_iommu.c |  4 --
>>  xen/drivers/passthrough/pci.c               | 63 +--------------------
>>  xen/drivers/passthrough/vtd/iommu.c         |  8 ---
>>  xen/drivers/video/vga.c                     |  2 -
>>  15 files changed, 4 insertions(+), 142 deletions(-)
>> 
>> diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
>> index 020df615bd..9f4ca03385 100644
>> --- a/xen/arch/x86/domctl.c
>> +++ b/xen/arch/x86/domctl.c
>> @@ -537,11 +537,7 @@ long arch_do_domctl(
>>  
>>          ret = -ESRCH;
>>          if ( is_iommu_enabled(d) )
>> -        {
>> -            pcidevs_lock();
>>              ret = pt_irq_create_bind(d, bind);
>> -            pcidevs_unlock();
>> -        }
>>          if ( ret < 0 )
>>              printk(XENLOG_G_ERR "pt_irq_create_bind failed (%ld) for dom%d\n",
>>                     ret, d->domain_id);
>> @@ -566,11 +562,7 @@ long arch_do_domctl(
>>              break;
>>  
>>          if ( is_iommu_enabled(d) )
>> -        {
>> -            pcidevs_lock();
>>              ret = pt_irq_destroy_bind(d, bind);
>> -            pcidevs_unlock();
>> -        }
>>          if ( ret < 0 )
>>              printk(XENLOG_G_ERR "pt_irq_destroy_bind failed (%ld) for dom%d\n",
>>                     ret, d->domain_id);
>> diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
>> index cb7f440160..aa4e7766a3 100644
>> --- a/xen/arch/x86/hvm/vioapic.c
>> +++ b/xen/arch/x86/hvm/vioapic.c
>> @@ -197,7 +197,6 @@ static int vioapic_hwdom_map_gsi(unsigned int gsi, unsigned int trig,
>>          return ret;
>>      }
>>  
>> -    pcidevs_lock();
>>      ret = pt_irq_create_bind(currd, &pt_irq_bind);
>>      if ( ret )
>>      {
>> @@ -207,7 +206,6 @@ static int vioapic_hwdom_map_gsi(unsigned int gsi, unsigned int trig,
>>          unmap_domain_pirq(currd, pirq);
>>          write_unlock(&currd->event_lock);
>>      }
>> -    pcidevs_unlock();
>>  
>>      return ret;
>>  }
>> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
>> index c9e5f279c5..344bbd646c 100644
>> --- a/xen/arch/x86/hvm/vmsi.c
>> +++ b/xen/arch/x86/hvm/vmsi.c
>> @@ -470,7 +470,6 @@ int msixtbl_pt_register(struct domain *d, struct pirq *pirq, uint64_t gtable)
>>      struct msixtbl_entry *entry, *new_entry;
>>      int r = -EINVAL;
>>  
>> -    ASSERT(pcidevs_locked());
>>      ASSERT(rw_is_write_locked(&d->event_lock));
>>  
>>      if ( !msixtbl_initialised(d) )
>> @@ -540,7 +539,6 @@ void msixtbl_pt_unregister(struct domain *d, struct pirq *pirq)
>>      struct pci_dev *pdev;
>>      struct msixtbl_entry *entry;
>>  
>> -    ASSERT(pcidevs_locked());
>>      ASSERT(rw_is_write_locked(&d->event_lock));
>>  
>>      if ( !msixtbl_initialised(d) )
>> @@ -686,8 +684,6 @@ static int vpci_msi_update(const struct pci_dev *pdev, uint32_t data,
>>  {
>>      unsigned int i;
>>  
>> -    ASSERT(pcidevs_locked());
>> -
>>      if ( (address & MSI_ADDR_BASE_MASK) != MSI_ADDR_HEADER )
>>      {
>>          gdprintk(XENLOG_ERR, "%pp: PIRQ %u: unsupported address %lx\n",
>> @@ -728,7 +724,6 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
>>  
>>      ASSERT(msi->arch.pirq != INVALID_PIRQ);
>>  
>> -    pcidevs_lock();
>>      for ( i = 0; i < msi->vectors && msi->arch.bound; i++ )
>>      {
>>          struct xen_domctl_bind_pt_irq unbind = {
>> @@ -747,7 +742,6 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
>>  
>>      msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address,
>>                                         msi->vectors, msi->arch.pirq, msi->mask);
>> -    pcidevs_unlock();
>>  }
>>  
>>  static int vpci_msi_enable(const struct pci_dev *pdev, unsigned int nr,
>> @@ -785,10 +779,8 @@ int vpci_msi_arch_enable(struct vpci_msi *msi, const struct pci_dev *pdev,
>>          return rc;
>>      msi->arch.pirq = rc;
>>  
>> -    pcidevs_lock();
>>      msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address, vectors,
>>                                         msi->arch.pirq, msi->mask);
>> -    pcidevs_unlock();
>>  
>>      return 0;
>>  }
>> @@ -800,7 +792,6 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
>>  
>>      ASSERT(pirq != INVALID_PIRQ);
>>  
>> -    pcidevs_lock();
>>      for ( i = 0; i < nr && bound; i++ )
>>      {
>>          struct xen_domctl_bind_pt_irq bind = {
>> @@ -816,7 +807,6 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
>>      write_lock(&pdev->domain->event_lock);
>>      unmap_domain_pirq(pdev->domain, pirq);
>>      write_unlock(&pdev->domain->event_lock);
>> -    pcidevs_unlock();
>>  }
>>  
>>  void vpci_msi_arch_disable(struct vpci_msi *msi, const struct pci_dev *pdev)
>> @@ -863,7 +853,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
>>  
>>      entry->arch.pirq = rc;
>>  
>> -    pcidevs_lock();
>>      rc = vpci_msi_update(pdev, entry->data, entry->addr, 1, entry->arch.pirq,
>>                           entry->masked);
>>      if ( rc )
>> @@ -871,7 +860,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
>>          vpci_msi_disable(pdev, entry->arch.pirq, 1, false);
>>          entry->arch.pirq = INVALID_PIRQ;
>>      }
>> -    pcidevs_unlock();
>>  
>>      return rc;
>>  }
>> diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
>> index d8672a03e1..6a08830a55 100644
>> --- a/xen/arch/x86/irq.c
>> +++ b/xen/arch/x86/irq.c
>> @@ -2156,8 +2156,6 @@ int map_domain_pirq(
>>          struct pci_dev *pdev;
>>          unsigned int nr = 0;
>>  
>> -        ASSERT(pcidevs_locked());
>> -
>>          ret = -ENODEV;
>>          if ( !cpu_has_apic )
>>              goto done;
>> @@ -2317,7 +2315,6 @@ int unmap_domain_pirq(struct domain *d, int pirq)
>>      if ( (pirq < 0) || (pirq >= d->nr_pirqs) )
>>          return -EINVAL;
>>  
>> -    ASSERT(pcidevs_locked());
>>      ASSERT(rw_is_write_locked(&d->event_lock));
>>  
>>      info = pirq_info(d, pirq);
>> @@ -2423,7 +2420,6 @@ void free_domain_pirqs(struct domain *d)
>>  {
>>      int i;
>>  
>> -    pcidevs_lock();
>>      write_lock(&d->event_lock);
>>  
>>      for ( i = 0; i < d->nr_pirqs; i++ )
>> @@ -2431,7 +2427,6 @@ void free_domain_pirqs(struct domain *d)
>>              unmap_domain_pirq(d, i);
>>  
>>      write_unlock(&d->event_lock);
>> -    pcidevs_unlock();
>>  }
>>  
>>  static void cf_check dump_irqs(unsigned char key)
>> @@ -2911,7 +2906,6 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>>  
>>      msi->irq = irq;
>>  
>> -    pcidevs_lock();
>>      /* Verify or get pirq. */
>>      write_lock(&d->event_lock);
>>      pirq = allocate_pirq(d, index, *pirq_p, irq, type, &msi->entry_nr);
>> @@ -2927,7 +2921,6 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>>  
>>   done:
>>      write_unlock(&d->event_lock);
>> -    pcidevs_unlock();
>>      if ( ret )
>>      {
>>          switch ( type )
>> diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
>> index 6b62c4f452..f04b90e235 100644
>> --- a/xen/arch/x86/msi.c
>> +++ b/xen/arch/x86/msi.c
>> @@ -623,7 +623,6 @@ static int msi_capability_init(struct pci_dev *dev,
>>      u8 slot = PCI_SLOT(dev->devfn);
>>      u8 func = PCI_FUNC(dev->devfn);
>>  
>> -    ASSERT(pcidevs_locked());
>>      pos = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSI);
>>      if ( !pos )
>>          return -ENODEV;
>> @@ -810,8 +809,6 @@ static int msix_capability_init(struct pci_dev *dev,
>>      if ( !pos )
>>          return -ENODEV;
>>  
>> -    ASSERT(pcidevs_locked());
>> -
>>      control = pci_conf_read16(dev->sbdf, msix_control_reg(pos));
>>      /*
>>       * Ensure MSI-X interrupts are masked during setup. Some devices require
>> @@ -1032,7 +1029,6 @@ static int __pci_enable_msi(struct msi_info *msi, struct msi_desc **desc)
>>      struct msi_desc *old_desc;
>>      int ret;
>>  
>> -    ASSERT(pcidevs_locked());
>>      pdev = pci_get_pdev(NULL, msi->sbdf);
>>      if ( !pdev )
>>          return -ENODEV;
>> @@ -1092,7 +1088,6 @@ static int __pci_enable_msix(struct msi_info *msi, struct msi_desc **desc)
>>      struct msi_desc *old_desc;
>>      int ret;
>>  
>> -    ASSERT(pcidevs_locked());
>>      pdev = pci_get_pdev(NULL, msi->sbdf);
>>      if ( !pdev || !pdev->msix )
>>          return -ENODEV;
>> @@ -1191,7 +1186,6 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off)
>>      if ( !use_msi )
>>          return 0;
>>  
>> -    pcidevs_lock();
>>      pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn));
>>      if ( !pdev )
>>          rc = -ENODEV;
>> @@ -1204,7 +1198,6 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off)
>>      }
>>      else
>>          rc = msix_capability_init(pdev, NULL, NULL);
>> -    pcidevs_unlock();
>>  
>>      pcidev_put(pdev);
>>  
>> @@ -1217,8 +1210,6 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off)
>>   */
>>  int pci_enable_msi(struct msi_info *msi, struct msi_desc **desc)
>>  {
>> -    ASSERT(pcidevs_locked());
>> -
>>      if ( !use_msi )
>>          return -EPERM;
>>  
>> @@ -1355,8 +1346,6 @@ int pci_restore_msi_state(struct pci_dev *pdev)
>>      unsigned int type = 0, pos = 0;
>>      u16 control = 0;
>>  
>> -    ASSERT(pcidevs_locked());
>> -
>>      if ( !use_msi )
>>          return -EOPNOTSUPP;
>>  
>> diff --git a/xen/arch/x86/pci.c b/xen/arch/x86/pci.c
>> index 1d38f0df7c..4dcd6d96f3 100644
>> --- a/xen/arch/x86/pci.c
>> +++ b/xen/arch/x86/pci.c
>> @@ -88,15 +88,11 @@ int pci_conf_write_intercept(unsigned int seg, unsigned int bdf,
>>      if ( reg < 64 || reg >= 256 )
>>          return 0;
>>  
>> -    pcidevs_lock();
>> -
>>      pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bdf));
>>      if ( pdev ) {
>>          rc = pci_msi_conf_write_intercept(pdev, reg, size, data);
>>  	pcidev_put(pdev);
>>      }
>>  
>> -    pcidevs_unlock();
>> -
>>      return rc;
>>  }
>> diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c
>> index 96214a3d40..a41366b609 100644
>> --- a/xen/arch/x86/physdev.c
>> +++ b/xen/arch/x86/physdev.c
>> @@ -162,11 +162,9 @@ int physdev_unmap_pirq(domid_t domid, int pirq)
>>              goto free_domain;
>>      }
>>  
>> -    pcidevs_lock();
>>      write_lock(&d->event_lock);
>>      ret = unmap_domain_pirq(d, pirq);
>>      write_unlock(&d->event_lock);
>> -    pcidevs_unlock();
>>  
>>   free_domain:
>>      rcu_unlock_domain(d);
>> @@ -530,7 +528,6 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>>          if ( copy_from_guest(&restore_msi, arg, 1) != 0 )
>>              break;
>>  
>> -        pcidevs_lock();
>>          pdev = pci_get_pdev(NULL,
>>                              PCI_SBDF(0, restore_msi.bus, restore_msi.devfn));
>>          if ( pdev )
>> @@ -541,7 +538,6 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>>          else
>>              ret = -ENODEV;
>>  
>> -        pcidevs_unlock();
>>          break;
>>      }
>>  
>> @@ -553,7 +549,6 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>>          if ( copy_from_guest(&dev, arg, 1) != 0 )
>>              break;
>>  
>> -        pcidevs_lock();
>>          pdev = pci_get_pdev(NULL, PCI_SBDF(dev.seg, dev.bus, dev.devfn));
>>          if ( pdev )
>>          {
>> @@ -562,7 +557,7 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>>          }
>>          else
>>              ret = -ENODEV;
>> -        pcidevs_unlock();
>> +
>>          break;
>>      }
>>  
>> diff --git a/xen/common/sysctl.c b/xen/common/sysctl.c
>> index 0feef94cd2..6bb8c5c295 100644
>> --- a/xen/common/sysctl.c
>> +++ b/xen/common/sysctl.c
>> @@ -446,7 +446,6 @@ long do_sysctl(XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) u_sysctl)
>>                  break;
>>              }
>>  
>> -            pcidevs_lock();
>>              pdev = pci_get_pdev(NULL, PCI_SBDF(dev.seg, dev.bus, dev.devfn));
>>              if ( !pdev )
>>                  node = XEN_INVALID_DEV;
>> @@ -454,7 +453,6 @@ long do_sysctl(XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) u_sysctl)
>>                  node = XEN_INVALID_NODE_ID;
>>              else
>>                  node = pdev->node;
>> -            pcidevs_unlock();
>>  
>>              if ( pdev )
>>                  pcidev_put(pdev);
>> diff --git a/xen/drivers/char/ns16550.c b/xen/drivers/char/ns16550.c
>> index 01a05c9aa8..66c10b18e5 100644
>> --- a/xen/drivers/char/ns16550.c
>> +++ b/xen/drivers/char/ns16550.c
>> @@ -445,8 +445,6 @@ static void __init cf_check ns16550_init_postirq(struct serial_port *port)
>>              {
>>                  struct msi_desc *msi_desc = NULL;
>>  
>> -                pcidevs_lock();
>> -
>>                  rc = pci_enable_msi(&msi, &msi_desc);
>>                  if ( !rc )
>>                  {
>> @@ -460,8 +458,6 @@ static void __init cf_check ns16550_init_postirq(struct serial_port *port)
>>                          pci_disable_msi(msi_desc);
>>                  }
>>  
>> -                pcidevs_unlock();
>> -
>>                  if ( rc )
>>                  {
>>                      uart->irq = 0;
>> diff --git a/xen/drivers/passthrough/amd/iommu_init.c b/xen/drivers/passthrough/amd/iommu_init.c
>> index 7c1713a602..e42af65a40 100644
>> --- a/xen/drivers/passthrough/amd/iommu_init.c
>> +++ b/xen/drivers/passthrough/amd/iommu_init.c
>> @@ -638,10 +638,7 @@ static void cf_check parse_ppr_log_entry(struct amd_iommu *iommu, u32 entry[])
>>      uint16_t device_id = iommu_get_devid_from_cmd(entry[0]);
>>      struct pci_dev *pdev;
>>  
>> -    pcidevs_lock();
>>      pdev = pci_get_real_pdev(PCI_SBDF(iommu->seg, device_id));
>> -    pcidevs_unlock();
>> -
>>      if ( pdev )
>>          guest_iommu_add_ppr_log(pdev->domain, entry);
>>      pcidev_put(pdev);
>> @@ -747,14 +744,12 @@ static bool_t __init set_iommu_interrupt_handler(struct amd_iommu *iommu)
>>          return 0;
>>      }
>>  
>> -    pcidevs_lock();
>>      /*
>>       * XXX: it is unclear if this device can be removed. Right now
>>       * there is no code that clears msi.dev, so no one will decrease
>>       * refcount on it.
>>       */
>>      iommu->msi.dev = pci_get_pdev(NULL, PCI_SBDF(iommu->seg, iommu->bdf));
>> -    pcidevs_unlock();
>>      if ( !iommu->msi.dev )
>>      {
>>          AMD_IOMMU_WARN("no pdev for %pp\n",
>> @@ -1289,9 +1284,7 @@ static int __init cf_check amd_iommu_setup_device_table(
>>              {
>>                  if ( !pci_init )
>>                      continue;
>> -                pcidevs_lock();
>>                  pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bdf));
>> -                pcidevs_unlock();
>>              }
>>  
>>              if ( pdev && (pdev->msix || pdev->msi_maxvec) )
>> diff --git a/xen/drivers/passthrough/amd/iommu_map.c b/xen/drivers/passthrough/amd/iommu_map.c
>> index 9d621e3d36..d04aa37538 100644
>> --- a/xen/drivers/passthrough/amd/iommu_map.c
>> +++ b/xen/drivers/passthrough/amd/iommu_map.c
>> @@ -726,9 +726,7 @@ int cf_check amd_iommu_get_reserved_device_memory(
>>              /* May need to trigger the workaround in find_iommu_for_device(). */
>>              struct pci_dev *pdev;
>>  
>> -            pcidevs_lock();
>>              pdev = pci_get_pdev(NULL, sbdf);
>> -            pcidevs_unlock();
>>  
>>              if ( pdev )
>>              {
>> @@ -848,7 +846,6 @@ int cf_check amd_iommu_quarantine_init(struct pci_dev *pdev, bool scratch_page)
>>      const struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
>>      int rc;
>>  
>> -    ASSERT(pcidevs_locked());
>>      ASSERT(!hd->arch.amd.root_table);
>>      ASSERT(page_list_empty(&hd->arch.pgtables.list));
>>  
>> @@ -903,8 +900,6 @@ void amd_iommu_quarantine_teardown(struct pci_dev *pdev)
>>  {
>>      struct domain_iommu *hd = dom_iommu(dom_io);
>>  
>> -    ASSERT(pcidevs_locked());
>> -
>>      if ( !pdev->arch.amd.root_table )
>>          return;
>>  
>> diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c b/xen/drivers/passthrough/amd/pci_amd_iommu.c
>> index 955f3af57a..919e30129e 100644
>> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
>> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
>> @@ -268,8 +268,6 @@ static int __must_check amd_iommu_setup_domain_device(
>>                      req_id, pdev->type, page_to_maddr(root_pg),
>>                      domid, hd->arch.amd.paging_mode);
>>  
>> -    ASSERT(pcidevs_locked());
>> -
>>      if ( pci_ats_device(iommu->seg, bus, pdev->devfn) &&
>>           !ivrs_dev->block_ats &&
>>           iommu_has_cap(iommu, PCI_CAP_IOTLB_SHIFT) &&
>> @@ -416,8 +414,6 @@ static void amd_iommu_disable_domain_device(const struct domain *domain,
>>      if ( QUARANTINE_SKIP(domain, pdev) )
>>          return;
>>  
>> -    ASSERT(pcidevs_locked());
>> -
>>      if ( pci_ats_device(iommu->seg, bus, pdev->devfn) &&
>>           pci_ats_enabled(iommu->seg, bus, pdev->devfn) )
>>      {
>> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
>> index c83397211b..cc62a5aec4 100644
>> --- a/xen/drivers/passthrough/pci.c
>> +++ b/xen/drivers/passthrough/pci.c
>> @@ -517,7 +517,6 @@ int __init pci_hide_device(unsigned int seg, unsigned int bus,
>>      struct pci_seg *pseg;
>>      int rc = -ENOMEM;
>>  
>> -    pcidevs_lock();
>>      pseg = alloc_pseg(seg);
>>      if ( pseg )
>>      {
>> @@ -528,7 +527,6 @@ int __init pci_hide_device(unsigned int seg, unsigned int bus,
>>              rc = 0;
>>          }
>>      }
>> -    pcidevs_unlock();
>>  
>>      return rc;
>>  }
>> @@ -588,8 +586,6 @@ struct pci_dev *pci_get_pdev(struct domain *d, pci_sbdf_t sbdf)
>>  {
>>      struct pci_dev *pdev;
>>  
>> -    ASSERT(d || pcidevs_locked());
>> -
>>      /*
>>       * The hardware domain owns the majority of the devices in the system.
>>       * When there are multiple segments, traversing the per-segment list is
>> @@ -730,7 +726,6 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>>          pdev_type = "device";
>>      else if ( info->is_virtfn )
>>      {
>> -        pcidevs_lock();
>>          pdev = pci_get_pdev(NULL,
>>                              PCI_SBDF(seg, info->physfn.bus,
>>                                       info->physfn.devfn));
>> @@ -739,7 +734,6 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>>              pf_is_extfn = pdev->info.is_extfn;
>>              pcidev_put(pdev);
>>          }
>> -        pcidevs_unlock();
>>          if ( !pdev )
>>              pci_add_device(seg, info->physfn.bus, info->physfn.devfn,
>>                             NULL, node);
>> @@ -756,7 +750,6 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>>  
>>      ret = -ENOMEM;
>>  
>> -    pcidevs_lock();
>>      pseg = alloc_pseg(seg);
>>      if ( !pseg )
>>          goto out;
>> @@ -858,7 +851,6 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>>      pci_enable_acs(pdev);
>>  
>>  out:
>> -    pcidevs_unlock();
>>      if ( !ret )
>>      {
>>          printk(XENLOG_DEBUG "PCI add %s %pp\n", pdev_type,  &pdev->sbdf);
>> @@ -889,7 +881,6 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
>>      if ( !pseg )
>>          return -ENODEV;
>>  
>> -    pcidevs_lock();
>>      spin_lock(&pseg->alldevs_lock);
>>      list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
>>          if ( pdev->bus == bus && pdev->devfn == devfn )
>> @@ -910,12 +901,10 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
>>              break;
>>          }
>>  
>> -    pcidevs_unlock();
>>      spin_unlock(&pseg->alldevs_lock);
>>      return ret;
>>  }
>>  
>> -/* Caller should hold the pcidevs_lock */
>>  static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
>>                             uint8_t devfn)
>>  {
>> @@ -927,7 +916,6 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
>>      if ( !is_iommu_enabled(d) )
>>          return -EINVAL;
>>  
>> -    ASSERT(pcidevs_locked());
>>      pdev = pci_get_pdev(d, PCI_SBDF(seg, bus, devfn));
>>      if ( !pdev )
>>          return -ENODEV;
>> @@ -981,13 +969,10 @@ int pci_release_devices(struct domain *d)
>>      u8 bus, devfn;
>>      int ret;
>>  
>> -    pcidevs_lock();
>>      ret = arch_pci_clean_pirqs(d);
>>      if ( ret )
>> -    {
>> -        pcidevs_unlock();
>>          return ret;
>> -    }
>> +
>>      spin_lock(&d->pdevs_lock);
>>      list_for_each_entry_safe ( pdev, tmp, &d->pdev_list, domain_list )
>>      {
>> @@ -996,7 +981,6 @@ int pci_release_devices(struct domain *d)
>>          ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
>>      }
>>      spin_unlock(&d->pdevs_lock);
>> -    pcidevs_unlock();
>>  
>>      return ret;
>>  }
>> @@ -1094,7 +1078,6 @@ void pci_check_disable_device(u16 seg, u8 bus, u8 devfn)
>>      s_time_t now = NOW();
>>      u16 cword;
>>  
>> -    pcidevs_lock();
>>      pdev = pci_get_real_pdev(PCI_SBDF(seg, bus, devfn));
>>      if ( pdev )
>>      {
>> @@ -1108,7 +1091,6 @@ void pci_check_disable_device(u16 seg, u8 bus, u8 devfn)
>>              pdev = NULL;
>>          }
>>      }
>> -    pcidevs_unlock();
>>  
>>      if ( !pdev )
>>          return;
>> @@ -1164,13 +1146,7 @@ static int __init cf_check _scan_pci_devices(struct pci_seg *pseg, void *arg)
>>  
>>  int __init scan_pci_devices(void)
>>  {
>> -    int ret;
>> -
>> -    pcidevs_lock();
>> -    ret = pci_segments_iterate(_scan_pci_devices, NULL);
>> -    pcidevs_unlock();
>> -
>> -    return ret;
>> +    return pci_segments_iterate(_scan_pci_devices, NULL);
>>  }
>>  
>>  struct setup_hwdom {
>> @@ -1239,19 +1215,11 @@ static int __hwdom_init cf_check _setup_hwdom_pci_devices(
>>  
>>              pcidev_put(pdev);
>>              if ( iommu_verbose )
>> -            {
>> -                pcidevs_unlock();
>>                  process_pending_softirqs();
>> -                pcidevs_lock();
>> -            }
>>          }
>>  
>>          if ( !iommu_verbose )
>> -        {
>> -            pcidevs_unlock();
>>              process_pending_softirqs();
>> -            pcidevs_lock();
>> -        }
>>      }
>>  
>>      return 0;
>> @@ -1262,9 +1230,7 @@ void __hwdom_init setup_hwdom_pci_devices(
>>  {
>>      struct setup_hwdom ctxt = { .d = d, .handler = handler };
>>  
>> -    pcidevs_lock();
>>      pci_segments_iterate(_setup_hwdom_pci_devices, &ctxt);
>> -    pcidevs_unlock();
>>  }
>>  
>>  /* APEI not supported on ARM yet. */
>> @@ -1396,9 +1362,7 @@ static int cf_check _dump_pci_devices(struct pci_seg *pseg, void *arg)
>>  static void cf_check dump_pci_devices(unsigned char ch)
>>  {
>>      printk("==== PCI devices ====\n");
>> -    pcidevs_lock();
>>      pci_segments_iterate(_dump_pci_devices, NULL);
>> -    pcidevs_unlock();
>>  }
>>  
>>  static int __init cf_check setup_dump_pcidevs(void)
>> @@ -1417,8 +1381,6 @@ static int iommu_add_device(struct pci_dev *pdev)
>>      if ( !pdev->domain )
>>          return -EINVAL;
>>  
>> -    ASSERT(pcidevs_locked());
>> -
>>      hd = dom_iommu(pdev->domain);
>>      if ( !is_iommu_enabled(pdev->domain) )
>>          return 0;
>> @@ -1446,8 +1408,6 @@ static int iommu_enable_device(struct pci_dev *pdev)
>>      if ( !pdev->domain )
>>          return -EINVAL;
>>  
>> -    ASSERT(pcidevs_locked());
>> -
>>      hd = dom_iommu(pdev->domain);
>>      if ( !is_iommu_enabled(pdev->domain) ||
>>           !hd->platform_ops->enable_device )
>> @@ -1494,7 +1454,6 @@ static int device_assigned(struct pci_dev *pdev)
>>  {
>>      int rc = 0;
>>  
>> -    ASSERT(pcidevs_locked());
>>      /*
>>       * If the device exists and it is not owned by either the hardware
>>       * domain or dom_io then it must be assigned to a guest, or be
>> @@ -1507,7 +1466,6 @@ static int device_assigned(struct pci_dev *pdev)
>>      return rc;
>>  }
>>  
>> -/* Caller should hold the pcidevs_lock */
>>  static int assign_device(struct domain *d, struct pci_dev *pdev, u32 flag)
>>  {
>>      const struct domain_iommu *hd = dom_iommu(d);
>> @@ -1521,7 +1479,6 @@ static int assign_device(struct domain *d, struct pci_dev *pdev, u32 flag)
>>          return -EXDEV;
>>  
>>      /* device_assigned() should already have cleared the device for assignment */
>> -    ASSERT(pcidevs_locked());
>>      ASSERT(pdev && (pdev->domain == hardware_domain ||
>>                      pdev->domain == dom_io));
>>  
>> @@ -1587,7 +1544,6 @@ static int iommu_get_device_group(
>>      if ( group_id < 0 )
>>          return group_id;
>>  
>> -    pcidevs_lock();
>>      spin_lock(&d->pdevs_lock);
>>      for_each_pdev( d, pdev )
>>      {
>> @@ -1603,7 +1559,6 @@ static int iommu_get_device_group(
>>          sdev_id = iommu_call(ops, get_device_group_id, seg, b, df);
>>          if ( sdev_id < 0 )
>>          {
>> -            pcidevs_unlock();
>>              spin_unlock(&d->pdevs_lock);
>>              return sdev_id;
>>          }
>> @@ -1614,7 +1569,6 @@ static int iommu_get_device_group(
>>  
>>              if ( unlikely(copy_to_guest_offset(buf, i, &bdf, 1)) )
>>              {
>> -                pcidevs_unlock();
>>                  spin_unlock(&d->pdevs_lock);
>>                  return -EFAULT;
>>              }
>> @@ -1622,7 +1576,6 @@ static int iommu_get_device_group(
>>          }
>>      }
>>  
>> -    pcidevs_unlock();
>>      spin_unlock(&d->pdevs_lock);
>>  
>>      return i;
>> @@ -1630,17 +1583,12 @@ static int iommu_get_device_group(
>>  
>>  void iommu_dev_iotlb_flush_timeout(struct domain *d, struct pci_dev *pdev)
>>  {
>> -    pcidevs_lock();
>> -
>>      /* iommu->ats_list_lock is taken by the caller of this function */
>>      disable_ats_device(pdev);
>>  
>>      ASSERT(pdev->domain);
>>      if ( d != pdev->domain )
>> -    {
>> -        pcidevs_unlock();
>>          return;
>> -    }
>>  
>>      pdev->broken = true;
>>  
>> @@ -1649,8 +1597,6 @@ void iommu_dev_iotlb_flush_timeout(struct domain *d, struct pci_dev *pdev)
>>                 d->domain_id, &pdev->sbdf);
>>      if ( !is_hardware_domain(d) )
>>          domain_crash(d);
>> -
>> -    pcidevs_unlock();
>>  }
>>  
>>  int iommu_do_pci_domctl(
>> @@ -1740,7 +1686,6 @@ int iommu_do_pci_domctl(
>>              break;
>>          }
>>  
>> -        pcidevs_lock();
>>          ret = device_assigned(pdev);
>>          if ( domctl->cmd == XEN_DOMCTL_test_assign_device )
>>          {
>> @@ -1755,7 +1700,7 @@ int iommu_do_pci_domctl(
>>              ret = assign_device(d, pdev, flags);
>>  
>>          pcidev_put(pdev);
>> -        pcidevs_unlock();
>> +
>>          if ( ret == -ERESTART )
>>              ret = hypercall_create_continuation(__HYPERVISOR_domctl,
>>                                                  "h", u_domctl);
>> @@ -1787,9 +1732,7 @@ int iommu_do_pci_domctl(
>>          bus = PCI_BUS(machine_sbdf);
>>          devfn = PCI_DEVFN(machine_sbdf);
>>  
>> -        pcidevs_lock();
>>          ret = deassign_device(d, seg, bus, devfn);
>> -        pcidevs_unlock();
>>          break;
>>  
>>      default:
>> diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
>> index 42661f22f4..87868188b7 100644
>> --- a/xen/drivers/passthrough/vtd/iommu.c
>> +++ b/xen/drivers/passthrough/vtd/iommu.c
>> @@ -1490,7 +1490,6 @@ int domain_context_mapping_one(
>>      if ( QUARANTINE_SKIP(domain, pgd_maddr) )
>>          return 0;
>>  
>> -    ASSERT(pcidevs_locked());
>>      spin_lock(&iommu->lock);
>>      maddr = bus_to_context_maddr(iommu, bus);
>>      context_entries = (struct context_entry *)map_vtd_domain_page(maddr);
>> @@ -1711,8 +1710,6 @@ static int domain_context_mapping(struct domain *domain, u8 devfn,
>>      if ( drhd && drhd->iommu->node != NUMA_NO_NODE )
>>          dom_iommu(domain)->node = drhd->iommu->node;
>>  
>> -    ASSERT(pcidevs_locked());
>> -
>>      for_each_rmrr_device( rmrr, bdf, i )
>>      {
>>          if ( rmrr->segment != pdev->seg || bdf != pdev->sbdf.bdf )
>> @@ -2072,8 +2069,6 @@ static void quarantine_teardown(struct pci_dev *pdev,
>>  {
>>      struct domain_iommu *hd = dom_iommu(dom_io);
>>  
>> -    ASSERT(pcidevs_locked());
>> -
>>      if ( !pdev->arch.vtd.pgd_maddr )
>>          return;
>>  
>> @@ -2341,8 +2336,6 @@ static int cf_check intel_iommu_add_device(u8 devfn, struct pci_dev *pdev)
>>      u16 bdf;
>>      int ret, i;
>>  
>> -    ASSERT(pcidevs_locked());
>> -
>>      if ( !pdev->domain )
>>          return -EINVAL;
>>  
>> @@ -3176,7 +3169,6 @@ static int cf_check intel_iommu_quarantine_init(struct pci_dev *pdev,
>>      bool rmrr_found = false;
>>      int rc;
>>  
>> -    ASSERT(pcidevs_locked());
>>      ASSERT(!hd->arch.vtd.pgd_maddr);
>>      ASSERT(page_list_empty(&hd->arch.pgtables.list));
>>  
>> diff --git a/xen/drivers/video/vga.c b/xen/drivers/video/vga.c
>> index 1298f3a7b6..1f7c496114 100644
>> --- a/xen/drivers/video/vga.c
>> +++ b/xen/drivers/video/vga.c
>> @@ -117,9 +117,7 @@ void __init video_endboot(void)
>>                  struct pci_dev *pdev;
>>                  u8 b = bus, df = devfn, sb;
>>  
>> -                pcidevs_lock();
>>                  pdev = pci_get_pdev(NULL, PCI_SBDF(0, bus, devfn));
>> -                pcidevs_unlock();
>>  
>>                  if ( !pdev ||
>>                       pci_conf_read16(PCI_SBDF(0, bus, devfn),
>> -- 
>> 2.36.1
>> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 08/10] xen: pci: remove pcidev_[un]lock[ed] calls
  2023-02-20 23:13     ` Volodymyr Babchuk
@ 2023-02-21  9:50       ` Jan Beulich
  2023-03-09  1:22         ` Volodymyr Babchuk
  0 siblings, 1 reply; 43+ messages in thread
From: Jan Beulich @ 2023-02-21  9:50 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: xen-devel, Oleksandr Andrushchenko, Andrew Cooper,
	Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Paul Durrant, Kevin Tian,
	Stefano Stabellini

On 21.02.2023 00:13, Volodymyr Babchuk wrote:
> Stefano Stabellini <sstabellini@kernel.org> writes:
>> On Wed, 31 Aug 2022, Volodymyr Babchuk wrote:
>>> As pci devices are refcounted now and all list that store them are
>>> protected by separate locks, we can safely drop global pcidevs_lock.
>>>
>>> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
>>
>> Up until this patch this patch series introduces:
>> - d->pdevs_lock to protect d->pdev_list
>> - pci_seg->alldevs_lock to protect pci_seg->alldevs_list
>> - iommu->ats_list_lock to protect iommu->ats_devices
>> - pdev refcounting to detect a pdev in-use and when to free it
>> - pdev->lock to protect pdev->msi_list
>>
>> They cover a lot of ground.  Are they collectively covering everything
>> pcidevs_lock() was protecting?
> 
> Well, that is the question. Those patch are in RFC stage because I can't
> fully answer your question. I tried my best to introduce proper locking,
> but apparently missed couple of places, like
> 
>> deassign_device is not protected by pcidevs_lock anymore.
>> deassign_device accesses a number of pdev fields, including quarantine,
>> phantom_stride and fault.count.
>>
>> deassign_device could run at the same time as assign_device who sets
>> quarantine and other fields.
>>
> 
> I hope this is all, but problem is that PCI subsystem is old, large and
> complex. Fo example, as I wrote earlier, there are places that are
> protected with pcidevs_lock(), but do nothing with PCI. I just don't
> know what to do with such places. I have a hope that x86 maintainers
> would review my changes and give feedback on missed spots.

At the risk of it sounding unfair, at least initially: While review may
spot issues, you will want to keep in mind that none of the people who
originally wrote that code are around anymore. And even if they were,
it would be uncertain - just like for the x86 maintainers - that they
would recall (if they were aware at some time in the first place) all
the corner cases. Therefore I'm afraid that proving correctness and
safety of the proposed transformations can only be done by properly
auditing all involved code paths. Yet that's something that imo wants
to already have been done by the time patches are submitted for review.
Reviewers would then "merely" (hard enough perhaps) check the results
of that audit.

I might guess that this locking situation is one of the reasons why
Andrew in particular thinks (afaik) that the IOMMU code we have would
better be re-written almost from scratch. I assume it's clear to him
(it certainly is to me) that this is something that could only be
expected to happen in an ideal work: I see no-one taking on such an
exercise. We already have too little bandwidth.

Jan


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 09/10] [RFC only] xen: iommu: remove last pcidevs_lock() calls in iommu
  2022-08-31 14:11 ` [RFC PATCH 09/10] [RFC only] xen: iommu: remove last pcidevs_lock() calls in iommu Volodymyr Babchuk
  2023-01-28  1:36   ` Stefano Stabellini
@ 2023-02-28 16:25   ` Jan Beulich
  1 sibling, 0 replies; 43+ messages in thread
From: Jan Beulich @ 2023-02-28 16:25 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Oleksandr Andrushchenko, Kevin Tian, Paul Durrant,
	Roger Pau Monné,
	xen-devel

On 31.08.2022 16:11, Volodymyr Babchuk wrote:
> There are number of cases where pcidevs_lock() is used to protect
> something that is not related to PCI devices per se.
> 
> Probably pcidev_lock in these places should be replaced with some
> other lock.
> 
> This patch is not intended to be merged and is present only to discuss
> this use of pcidevs_lock()

For all such instances it needs to be understood what (if anything) is
being protected and how the same guarding can be achieved in the new
model. Since I'm afraid that's simply stating the obvious, I guess I
don't really understand what needs discussing here.

Jan


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 02/10] xen: pci: add pci_seg->alldevs_lock
  2022-08-31 14:10 ` [RFC PATCH 02/10] xen: pci: add pci_seg->alldevs_lock Volodymyr Babchuk
  2023-01-26 23:40   ` Stefano Stabellini
@ 2023-02-28 16:32   ` Jan Beulich
  1 sibling, 0 replies; 43+ messages in thread
From: Jan Beulich @ 2023-02-28 16:32 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Oleksandr Andrushchenko, Paul Durrant, Roger Pau Monné, xen-devel

On 31.08.2022 16:10, Volodymyr Babchuk wrote:
> This lock protects alldevs_list of struct pci_seg. As this, it should
> be used when we are adding, removing on enumerating PCI devices
> assigned to a PCI segment.
> 
> Radix tree that stores PCI segment has own locking mechanism, also
> pci_seg structures are only allocated and newer freed, so we need no
> additional locking to access pci_seg structures. But we need a lock
> that protects alldevs_list field.
> 
> This enables more granular locking instead of one huge pcidevs_lock
> that locks entire PCI subsystem.  Please note that pcidevs_lock() is
> still used, we are going to remove it in subsequent patches.

Just a thought: To limit the scope of the steps taken, would it be a
possibility (and useful) to move from the global to the per-segment
lock, extending what this per-segment lock is actually protecting?
And only then take further steps, as already done in later parts of
this series?

Jan



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 07/10] xen: pci: add per-device locking
  2022-08-31 14:11 ` [RFC PATCH 07/10] xen: pci: add per-device locking Volodymyr Babchuk
  2023-01-28  0:56   ` Stefano Stabellini
@ 2023-02-28 16:46   ` Jan Beulich
  1 sibling, 0 replies; 43+ messages in thread
From: Jan Beulich @ 2023-02-28 16:46 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Oleksandr Andrushchenko, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Paul Durrant, xen-devel

On 31.08.2022 16:11, Volodymyr Babchuk wrote:
> --- a/xen/arch/x86/hvm/vmsi.c
> +++ b/xen/arch/x86/hvm/vmsi.c
> @@ -203,10 +203,14 @@ static struct msi_desc *msixtbl_addr_to_desc(
>  
>      nr_entry = (addr - entry->gtable) / PCI_MSIX_ENTRY_SIZE;
>  
> +    pcidev_lock(entry->pdev);
>      list_for_each_entry( desc, &entry->pdev->msi_list, list )
>          if ( desc->msi_attrib.type == PCI_CAP_ID_MSIX &&
> -             desc->msi_attrib.entry_nr == nr_entry )
> +             desc->msi_attrib.entry_nr == nr_entry ) {
> +	    pcidev_unlock(entry->pdev);
>              return desc;

This is a potentially problematic pattern: desc has a backlink to pdev,
just like "entry" has. _If_ locking is required here (and the
refcounting is insufficient), then it is questionable whether the lock
can actually be dropped before returning. The idea with refcounting was,
though, that entities holding a reference can be sure the pdev won't go
away.

But of course there's also the question what "access to device itself"
(as stated in the description) does (or does not) constitute. I think
it is pretty crucial that for every new lock it is spelled out clearly
what it protects.

Seeing the list iteration pattern here (and at least once below)
another question is whether a lock like the one here may want to be a
read/write one.

Jan


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 08/10] xen: pci: remove pcidev_[un]lock[ed] calls
  2023-01-28  1:32   ` Stefano Stabellini
  2023-02-20 23:13     ` Volodymyr Babchuk
@ 2023-02-28 16:51     ` Jan Beulich
  1 sibling, 0 replies; 43+ messages in thread
From: Jan Beulich @ 2023-02-28 16:51 UTC (permalink / raw)
  To: Stefano Stabellini, Volodymyr Babchuk
  Cc: xen-devel, Oleksandr Andrushchenko, Andrew Cooper,
	Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Paul Durrant, Kevin Tian

On 28.01.2023 02:32, Stefano Stabellini wrote:
> On Wed, 31 Aug 2022, Volodymyr Babchuk wrote:
>> As pci devices are refcounted now and all list that store them are
>> protected by separate locks, we can safely drop global pcidevs_lock.
>>
>> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
> 
> Up until this patch this patch series introduces:
> - d->pdevs_lock to protect d->pdev_list
> - pci_seg->alldevs_lock to protect pci_seg->alldevs_list
> - iommu->ats_list_lock to protect iommu->ats_devices
> - pdev refcounting to detect a pdev in-use and when to free it
> - pdev->lock to protect pdev->msi_list
> 
> They cover a lot of ground.  Are they collectively covering everything
> pcidevs_lock() was protecting?
> 
> deassign_device is not protected by pcidevs_lock anymore.
> deassign_device accesses a number of pdev fields, including quarantine,
> phantom_stride and fault.count.
> 
> deassign_device could run at the same time as assign_device who sets
> quarantine and other fields.
> 
> It looks like assign_device, deassign_device, and other functions
> accessing/modifying pdev fields should be protected by pdev->lock.
> 
> In fact, I think it would be safer to make sure every place that
> currently has a pcidevs_lock() gets a pdev->lock (unless there is a
> d->pdevs_lock, pci_seg->alldevs_lock, iommu->ats_list_lock, or another
> lock) ?

Yes, I agree - there shouldn't be cases where lock uses are removed
with neither replacement nor an explanation why the removal is safe.
Which in turn suggests that a change like the one here likely needs
doing in much smaller chunks. Grouping could possibly be based upon
all touched instances having the same replacement or justification.

Jan


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 05/10] xen: pci: introduce reference counting for pdev
  2022-08-31 14:11 ` [RFC PATCH 05/10] xen: pci: introduce reference counting for pdev Volodymyr Babchuk
  2023-01-27  0:43   ` Stefano Stabellini
@ 2023-02-28 17:06   ` Jan Beulich
  1 sibling, 0 replies; 43+ messages in thread
From: Jan Beulich @ 2023-02-28 17:06 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Oleksandr Andrushchenko, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Paul Durrant, Kevin Tian, xen-devel

On 31.08.2022 16:11, Volodymyr Babchuk wrote:
> Prior to this change, lifetime of pci_dev objects was protected by global
> pcidevs_lock(). We are going to get if of this lock, so we need some
> other mechanism to ensure that those objects will not disappear under
> feet of code that access them. Reference counting is a good choice as
> it provides easy to comprehend way to control object lifetime with
> better granularity than global super lock.
> 
> This patch adds two new helper functions: pcidev_get() and
> pcidev_put(). pcidev_get() will increase reference counter, while
> pcidev_put() will decrease it, destroying object when counter reaches
> zero.
> 
> pcidev_get() should be used only when you already have a valid pointer
> to the object or you are holding lock that protects one of the
> lists (domain, pseg or ats) that store pci_dev structs.
> 
> pcidev_get() is rarely used directly, because there already are
> functions that will provide valid pointer to pci_dev struct:
> pci_get_pdev() and pci_get_real_pdev(). They will lock appropriate
> list, find needed object and increase its reference counter before
> returning to the caller.
> 
> Naturally, pci_put() should be called after finishing working with a
> received object. This is the reason why this patch have so many
> pcidev_put()s and so little pcidev_get()s: existing calls to
> pci_get_*() functions now will increase reference counter
> automatically, we just need to decrease it back when we finished.
> 
> This patch removes "const" qualifier from some pdev pointers because
> pcidev_put() technically alters the contents of pci_dev structure.

I wonder if you have so few "get"s because in some cases references
would be needed, but aren't being obtained. As a rule of thumb I'd
expect any entity storing a pointer in a long-lived data structure
to obtain a ref first. And we have quite a few struct fields pointing
to devices. I'd also expect a reference to be held when a device is
e.g. put on a domain's list. This would then likely mean that for
example in deassign_device() (or maybe pci_add_device()) you wouldn't
drop the ref in the success case, but instead the ref would transfer
to the domain the device is added to.

> ---
> 
> - Jan, can I add your Suggested-by tag?

Sure, why not.

Jan


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 08/10] xen: pci: remove pcidev_[un]lock[ed] calls
  2023-02-21  9:50       ` Jan Beulich
@ 2023-03-09  1:22         ` Volodymyr Babchuk
  2023-03-09  9:06           ` Jan Beulich
  0 siblings, 1 reply; 43+ messages in thread
From: Volodymyr Babchuk @ 2023-03-09  1:22 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Oleksandr Andrushchenko, Andrew Cooper,
	Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Paul Durrant, Kevin Tian,
	Stefano Stabellini


Hello Jan,

Jan Beulich <jbeulich@suse.com> writes:

> On 21.02.2023 00:13, Volodymyr Babchuk wrote:
>> Stefano Stabellini <sstabellini@kernel.org> writes:
>>> On Wed, 31 Aug 2022, Volodymyr Babchuk wrote:
>>>> As pci devices are refcounted now and all list that store them are
>>>> protected by separate locks, we can safely drop global pcidevs_lock.
>>>>
>>>> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
>>>
>>> Up until this patch this patch series introduces:
>>> - d->pdevs_lock to protect d->pdev_list
>>> - pci_seg->alldevs_lock to protect pci_seg->alldevs_list
>>> - iommu->ats_list_lock to protect iommu->ats_devices
>>> - pdev refcounting to detect a pdev in-use and when to free it
>>> - pdev->lock to protect pdev->msi_list
>>>
>>> They cover a lot of ground.  Are they collectively covering everything
>>> pcidevs_lock() was protecting?
>> 
>> Well, that is the question. Those patch are in RFC stage because I can't
>> fully answer your question. I tried my best to introduce proper locking,
>> but apparently missed couple of places, like
>> 
>>> deassign_device is not protected by pcidevs_lock anymore.
>>> deassign_device accesses a number of pdev fields, including quarantine,
>>> phantom_stride and fault.count.
>>>
>>> deassign_device could run at the same time as assign_device who sets
>>> quarantine and other fields.
>>>
>> 
>> I hope this is all, but problem is that PCI subsystem is old, large and
>> complex. Fo example, as I wrote earlier, there are places that are
>> protected with pcidevs_lock(), but do nothing with PCI. I just don't
>> know what to do with such places. I have a hope that x86 maintainers
>> would review my changes and give feedback on missed spots.
>
> At the risk of it sounding unfair, at least initially: While review may
> spot issues, you will want to keep in mind that none of the people who
> originally wrote that code are around anymore. And even if they were,
> it would be uncertain - just like for the x86 maintainers - that they
> would recall (if they were aware at some time in the first place) all
> the corner cases. Therefore I'm afraid that proving correctness and
> safety of the proposed transformations can only be done by properly
> auditing all involved code paths. Yet that's something that imo wants
> to already have been done by the time patches are submitted for review.
> Reviewers would then "merely" (hard enough perhaps) check the results
> of that audit.
>
> I might guess that this locking situation is one of the reasons why
> Andrew in particular thinks (afaik) that the IOMMU code we have would
> better be re-written almost from scratch. I assume it's clear to him
> (it certainly is to me) that this is something that could only be
> expected to happen in an ideal work: I see no-one taking on such an
> exercise. We already have too little bandwidth.

The more I dig into IOMMU code, the more I agree with Andrew. I can't
see how current PCI locking can be untangled in the IOMMU code. There
are just too many moving parts. I tried to play with static code
analysis tools, but I haven't find anything that can reliably analyze
locking in Xen. I even tried to write own tool tailored specifically for
PCI locking analysis. While it works on some synthetic tests, there is
too much work to support actual Xen code.

I am not able to rework x86 IOMMU code. So, I am inclined to drop this
patch series at all. My current plan is to take minimal refcounting from
this series to satisfy your comments for "vpci: use pcidevs locking to
protect MMIO handlers".

-- 
WBR, Volodymyr


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 08/10] xen: pci: remove pcidev_[un]lock[ed] calls
  2023-03-09  1:22         ` Volodymyr Babchuk
@ 2023-03-09  9:06           ` Jan Beulich
  0 siblings, 0 replies; 43+ messages in thread
From: Jan Beulich @ 2023-03-09  9:06 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: xen-devel, Oleksandr Andrushchenko, Andrew Cooper,
	Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Paul Durrant, Kevin Tian,
	Stefano Stabellini

On 09.03.2023 02:22, Volodymyr Babchuk wrote:
> Jan Beulich <jbeulich@suse.com> writes:
>> On 21.02.2023 00:13, Volodymyr Babchuk wrote:
>>> Stefano Stabellini <sstabellini@kernel.org> writes:
>>>> On Wed, 31 Aug 2022, Volodymyr Babchuk wrote:
>>>>> As pci devices are refcounted now and all list that store them are
>>>>> protected by separate locks, we can safely drop global pcidevs_lock.
>>>>>
>>>>> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
>>>>
>>>> Up until this patch this patch series introduces:
>>>> - d->pdevs_lock to protect d->pdev_list
>>>> - pci_seg->alldevs_lock to protect pci_seg->alldevs_list
>>>> - iommu->ats_list_lock to protect iommu->ats_devices
>>>> - pdev refcounting to detect a pdev in-use and when to free it
>>>> - pdev->lock to protect pdev->msi_list
>>>>
>>>> They cover a lot of ground.  Are they collectively covering everything
>>>> pcidevs_lock() was protecting?
>>>
>>> Well, that is the question. Those patch are in RFC stage because I can't
>>> fully answer your question. I tried my best to introduce proper locking,
>>> but apparently missed couple of places, like
>>>
>>>> deassign_device is not protected by pcidevs_lock anymore.
>>>> deassign_device accesses a number of pdev fields, including quarantine,
>>>> phantom_stride and fault.count.
>>>>
>>>> deassign_device could run at the same time as assign_device who sets
>>>> quarantine and other fields.
>>>>
>>>
>>> I hope this is all, but problem is that PCI subsystem is old, large and
>>> complex. Fo example, as I wrote earlier, there are places that are
>>> protected with pcidevs_lock(), but do nothing with PCI. I just don't
>>> know what to do with such places. I have a hope that x86 maintainers
>>> would review my changes and give feedback on missed spots.
>>
>> At the risk of it sounding unfair, at least initially: While review may
>> spot issues, you will want to keep in mind that none of the people who
>> originally wrote that code are around anymore. And even if they were,
>> it would be uncertain - just like for the x86 maintainers - that they
>> would recall (if they were aware at some time in the first place) all
>> the corner cases. Therefore I'm afraid that proving correctness and
>> safety of the proposed transformations can only be done by properly
>> auditing all involved code paths. Yet that's something that imo wants
>> to already have been done by the time patches are submitted for review.
>> Reviewers would then "merely" (hard enough perhaps) check the results
>> of that audit.
>>
>> I might guess that this locking situation is one of the reasons why
>> Andrew in particular thinks (afaik) that the IOMMU code we have would
>> better be re-written almost from scratch. I assume it's clear to him
>> (it certainly is to me) that this is something that could only be
>> expected to happen in an ideal work: I see no-one taking on such an
>> exercise. We already have too little bandwidth.
> 
> The more I dig into IOMMU code, the more I agree with Andrew. I can't
> see how current PCI locking can be untangled in the IOMMU code. There
> are just too many moving parts. I tried to play with static code
> analysis tools, but I haven't find anything that can reliably analyze
> locking in Xen. I even tried to write own tool tailored specifically for
> PCI locking analysis. While it works on some synthetic tests, there is
> too much work to support actual Xen code.
> 
> I am not able to rework x86 IOMMU code. So, I am inclined to drop this
> patch series at all. My current plan is to take minimal refcounting from
> this series to satisfy your comments for "vpci: use pcidevs locking to
> protect MMIO handlers".

I guess this may indeed be the "best" approach for now - introduce
refcounting to use where relevant for new work, and then slowly see about
replacing (dropping) locking where a refcount suffices when one is held.

Jan


^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2023-03-09  9:07 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-31 14:10 [RFC PATCH 00/10] Rework PCI locking Volodymyr Babchuk
2022-08-31 14:10 ` [RFC PATCH 01/10] xen: pci: add per-domain pci list lock Volodymyr Babchuk
2023-01-26 23:18   ` Stefano Stabellini
2023-01-27  8:01     ` Jan Beulich
2023-02-14 23:38     ` Volodymyr Babchuk
2023-02-15  9:06       ` Jan Beulich
2022-08-31 14:10 ` [RFC PATCH 04/10] xen: add reference counter support Volodymyr Babchuk
2023-02-15 11:20   ` Jan Beulich
2023-02-17  1:56     ` Volodymyr Babchuk
2023-02-17  7:53       ` Jan Beulich
2023-02-19 22:34         ` Volodymyr Babchuk
2022-08-31 14:10 ` [RFC PATCH 03/10] xen: pci: introduce ats_list_lock Volodymyr Babchuk
2023-01-26 23:56   ` Stefano Stabellini
2023-01-27  8:13     ` Jan Beulich
2023-02-17  1:20       ` Volodymyr Babchuk
2023-02-17  7:39         ` Jan Beulich
2022-08-31 14:10 ` [RFC PATCH 02/10] xen: pci: add pci_seg->alldevs_lock Volodymyr Babchuk
2023-01-26 23:40   ` Stefano Stabellini
2023-02-28 16:32   ` Jan Beulich
2022-08-31 14:11 ` [RFC PATCH 05/10] xen: pci: introduce reference counting for pdev Volodymyr Babchuk
2023-01-27  0:43   ` Stefano Stabellini
2023-02-20 22:00     ` Volodymyr Babchuk
2023-02-28 17:06   ` Jan Beulich
2022-08-31 14:11 ` [RFC PATCH 06/10] xen: pci: print reference counter when dumping pci_devs Volodymyr Babchuk
2022-08-31 14:11 ` [RFC PATCH 07/10] xen: pci: add per-device locking Volodymyr Babchuk
2023-01-28  0:56   ` Stefano Stabellini
2023-02-20 22:29     ` Volodymyr Babchuk
2023-02-28 16:46   ` Jan Beulich
2022-08-31 14:11 ` [RFC PATCH 09/10] [RFC only] xen: iommu: remove last pcidevs_lock() calls in iommu Volodymyr Babchuk
2023-01-28  1:36   ` Stefano Stabellini
2023-02-20  0:41     ` Volodymyr Babchuk
2023-02-28 16:25   ` Jan Beulich
2022-08-31 14:11 ` [RFC PATCH 08/10] xen: pci: remove pcidev_[un]lock[ed] calls Volodymyr Babchuk
2023-01-28  1:32   ` Stefano Stabellini
2023-02-20 23:13     ` Volodymyr Babchuk
2023-02-21  9:50       ` Jan Beulich
2023-03-09  1:22         ` Volodymyr Babchuk
2023-03-09  9:06           ` Jan Beulich
2023-02-28 16:51     ` Jan Beulich
2022-08-31 14:11 ` [RFC PATCH 10/10] [RFC only] xen: pci: remove pcidev_lock() function Volodymyr Babchuk
2022-09-06 10:32 ` [RFC PATCH 00/10] Rework PCI locking Jan Beulich
2023-01-18 18:21   ` Julien Grall
2023-01-19  9:47     ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.