xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/5] VT-d Device-TLB flush issue
@ 2016-03-02 14:31 Quan Xu
  2016-03-02 14:31 ` [PATCH v6 1/5] IOMMU/MMU: Adjust top level functions for VT-d Device-TLB flush error Quan Xu
                   ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: Quan Xu @ 2016-03-02 14:31 UTC (permalink / raw)
  To: jbeulich, kevin.tian
  Cc: feng.wu, ian.campbell, george.dunlap, andrew.cooper3,
	dario.faggioli, tim, xen-devel, stefano.stabellini,
	Aravind.Gopalakrishnan, jun.nakajima, jinsong.liu, Quan Xu, keir,
	suravee.suthikulpanit

This patches fix current timeout concern and also allow limited ATS support:

1. Check VT-d Device-TLB flush error.
   This patch set checks all kinds of error and all the way up the call trees of VT-d Device-TLB flush.

2. Make the pcidevs_lock a recursive one.

3. Reduce spin timeout to 1ms, which can be boot-time changed with 'vtd_qi_timeout'.
   For example:
           multiboot /boot/xen.gz ats=1 vtd_qi_timeout=100

4. Fix vt-d Device-TLB flush timeout issue.
   If Device-TLB flush is timeout, we'll hide the target ATS device and crash the domain owning this ATS device.
   If impacted domain is hardware domain, just throw out a warning.
   The hidden device will be disallowed to be further assigned to  any domain.

----

 * DMAR_OPERATION_TIMEOUT should be also chopped down to a low number of milliseconds.
   As Kevin Tian mentioned in 'Revisit VT-d asynchronous flush issue', We also confirmed with hardware team
   that 1ms is large enough for IOMMU internal flush. So I can change DMAR_OPERATION_TIMEOUT from 1000 ms to 1 ms.

   IOMMU_WAIT_OP() is only for VT-d registers read/write, and there is also a panic. We need a further discussion
   whether or how to remove this panic in next patch set.

 * The coming patch set will fix IOTLB/Context/IETC flush timeout.

--Changes in v6:

#patch 1/2
   * Make a reasonable attempt at splitting things, adjusting top level functions first and then
     working your way down to leaf ones.
   * Remove some pointless initializers.
   * Log error and don't return error value for hardware_domain init and crashed system shutdown.
   * when to populate iommu page table for domu, try to tear down the iommu page table for iommu
     iotlb flush error.
   * when the flush_iotlb_qi() return value is positive, All we need is
     -call iommu_flush_write_buffer() only when rc > 0
     -return zero from this function when rc is positive, or rc = 0 after call iommu_flush_write_buffer().
   * Fix v4 unaddressed issue:
     http://lists.xenproject.org/archives/html/xen-devel/2016-01/msg01555.html


#patch 3
   * A new patch, make the pcidevs_lock a recursive one (Remove v4 pcidevs_lock related patches).

#patch 4
   * Add an entry in docs/misc/xen-command-line.markdown _alphabetically_.
   * Add a __must_check annotation on the function queue_invalidate_wait().

#patch 5
   * Add Stray blanks inside the parentheses.
   * Don't iterate over pdev-s without holding that lock, and hold pcidevs_lock for pdev-s list.
   * Print SBDF in canonical ssss:bb:dd.f form.
   * Handle 'ret'/'rc' variables in the same function, and remove the pointless rc.

Quan Xu (5):
  IOMMU/MMU: Adjust top level functions for VT-d Device-TLB flush error
  IOMMU/MMU: Adjust low level functions for VT-d Device-TLB flush error
  IOMMU: Make the pcidevs_lock a recursive one
  VT-d: Reduce spin timeout to 1ms, which can be boot-time changed
  VT-d: Fix vt-d Device-TLB flush timeout issue

 docs/misc/xen-command-line.markdown           |   7 ++
 xen/arch/x86/acpi/power.c                     |  14 ++-
 xen/arch/x86/domctl.c                         |   8 +-
 xen/arch/x86/hvm/vmsi.c                       |   4 +-
 xen/arch/x86/irq.c                            |   8 +-
 xen/arch/x86/mm.c                             |  13 ++-
 xen/arch/x86/mm/p2m-ept.c                     |  12 ++-
 xen/arch/x86/mm/p2m-pt.c                      |  12 ++-
 xen/arch/x86/msi.c                            |  16 +--
 xen/arch/x86/pci.c                            |   4 +-
 xen/arch/x86/physdev.c                        |  16 +--
 xen/common/grant_table.c                      |   5 +-
 xen/common/memory.c                           |   5 +-
 xen/common/sysctl.c                           |   4 +-
 xen/drivers/passthrough/amd/iommu_init.c      |  21 ++--
 xen/drivers/passthrough/amd/iommu_map.c       |   2 +-
 xen/drivers/passthrough/amd/pci_amd_iommu.c   |   6 +-
 xen/drivers/passthrough/arm/smmu.c            |  10 +-
 xen/drivers/passthrough/iommu.c               |  25 +++--
 xen/drivers/passthrough/pci.c                 |  99 ++++++++++---------
 xen/drivers/passthrough/vtd/extern.h          |   4 +-
 xen/drivers/passthrough/vtd/iommu.c           | 134 +++++++++++++++++---------
 xen/drivers/passthrough/vtd/qinval.c          |  80 +++++++++++++--
 xen/drivers/passthrough/vtd/quirks.c          |  26 +++--
 xen/drivers/passthrough/vtd/x86/ats.c         |  12 +++
 xen/drivers/passthrough/vtd/x86/vtd.c         |   7 +-
 xen/drivers/passthrough/x86/iommu.c           |   6 +-
 xen/drivers/video/vga.c                       |   4 +-
 xen/include/asm-x86/hvm/svm/amd-iommu-proto.h |   2 +-
 xen/include/asm-x86/iommu.h                   |   2 +-
 xen/include/xen/iommu.h                       |  12 +--
 xen/include/xen/pci.h                         |   5 +-
 32 files changed, 399 insertions(+), 186 deletions(-)

-- 
1.9.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v6 1/5] IOMMU/MMU: Adjust top level functions for VT-d Device-TLB flush error
  2016-03-02 14:31 [PATCH v6 0/5] VT-d Device-TLB flush issue Quan Xu
@ 2016-03-02 14:31 ` Quan Xu
  2016-03-02 14:31 ` [PATCH v6 2/5] IOMMU/MMU: Adjust low " Quan Xu
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 21+ messages in thread
From: Quan Xu @ 2016-03-02 14:31 UTC (permalink / raw)
  To: jbeulich, kevin.tian
  Cc: feng.wu, ian.campbell, george.dunlap, andrew.cooper3,
	dario.faggioli, tim, xen-devel, stefano.stabellini,
	Aravind.Gopalakrishnan, jun.nakajima, jinsong.liu, Quan Xu, keir,
	suravee.suthikulpanit

Signed-off-by: Quan Xu <quan.xu@intel.com>
---
 xen/arch/x86/acpi/power.c             | 14 +++++++++++++-
 xen/arch/x86/mm.c                     | 13 ++++++++-----
 xen/arch/x86/mm/p2m-ept.c             | 10 +++++++++-
 xen/arch/x86/mm/p2m-pt.c              | 12 ++++++++++--
 xen/common/grant_table.c              |  5 +++--
 xen/common/memory.c                   |  5 +++--
 xen/drivers/passthrough/iommu.c       | 16 +++++++++++-----
 xen/drivers/passthrough/vtd/x86/vtd.c |  7 +++++--
 xen/drivers/passthrough/x86/iommu.c   |  6 +++++-
 xen/include/xen/iommu.h               |  6 +++---
 10 files changed, 70 insertions(+), 24 deletions(-)

diff --git a/xen/arch/x86/acpi/power.c b/xen/arch/x86/acpi/power.c
index f41f0de..ed1173c 100644
--- a/xen/arch/x86/acpi/power.c
+++ b/xen/arch/x86/acpi/power.c
@@ -45,6 +45,8 @@ void do_suspend_lowlevel(void);
 
 static int device_power_down(void)
 {
+    int err;
+
     console_suspend();
 
     time_suspend();
@@ -53,11 +55,21 @@ static int device_power_down(void)
 
     ioapic_suspend();
 
-    iommu_suspend();
+    err = iommu_suspend();
+    if ( err )
+        goto iommu_suspend_error;
 
     lapic_suspend();
 
     return 0;
+
+iommu_suspend_error:
+    ioapic_resume();
+    i8259A_resume();
+    time_resume();
+    console_resume();
+
+    return err;
 }
 
 static void device_power_up(void)
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 202ff76..54d8fce 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -2332,7 +2332,7 @@ static int __get_page_type(struct page_info *page, unsigned long type,
                            int preemptible)
 {
     unsigned long nx, x, y = page->u.inuse.type_info;
-    int rc = 0;
+    int rc = 0, ret = 0;
 
     ASSERT(!(type & ~(PGT_type_mask | PGT_pae_xen_l2)));
 
@@ -2443,11 +2443,11 @@ static int __get_page_type(struct page_info *page, unsigned long type,
         if ( d && is_pv_domain(d) && unlikely(need_iommu(d)) )
         {
             if ( (x & PGT_type_mask) == PGT_writable_page )
-                iommu_unmap_page(d, mfn_to_gmfn(d, page_to_mfn(page)));
+                ret = iommu_unmap_page(d, mfn_to_gmfn(d, page_to_mfn(page)));
             else if ( type == PGT_writable_page )
-                iommu_map_page(d, mfn_to_gmfn(d, page_to_mfn(page)),
-                               page_to_mfn(page),
-                               IOMMUF_readable|IOMMUF_writable);
+                ret = iommu_map_page(d, mfn_to_gmfn(d, page_to_mfn(page)),
+                                     page_to_mfn(page),
+                                     IOMMUF_readable|IOMMUF_writable);
         }
     }
 
@@ -2464,6 +2464,9 @@ static int __get_page_type(struct page_info *page, unsigned long type,
     if ( (x & PGT_partial) && !(nx & PGT_partial) )
         put_page(page);
 
+    if ( !rc )
+        rc = ret;
+
     return rc;
 }
 
diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
index 9860c6c..d31b9af 100644
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -834,7 +834,15 @@ out:
         {
             if ( iommu_flags )
                 for ( i = 0; i < (1 << order); i++ )
-                    iommu_map_page(d, gfn + i, mfn_x(mfn) + i, iommu_flags);
+                {
+                    rc = iommu_map_page(d, gfn + i, mfn_x(mfn) + i, iommu_flags);
+                    if ( rc )
+                    {
+                        while ( i-- > 0 )
+                            iommu_unmap_page(d, gfn + i);
+                        break;
+                    }
+                }
             else
                 for ( i = 0; i < (1 << order); i++ )
                     iommu_unmap_page(d, gfn + i);
diff --git a/xen/arch/x86/mm/p2m-pt.c b/xen/arch/x86/mm/p2m-pt.c
index 709920a..690fffc 100644
--- a/xen/arch/x86/mm/p2m-pt.c
+++ b/xen/arch/x86/mm/p2m-pt.c
@@ -675,8 +675,16 @@ p2m_pt_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
         }
         else if ( iommu_pte_flags )
             for ( i = 0; i < (1UL << page_order); i++ )
-                iommu_map_page(p2m->domain, gfn + i, mfn_x(mfn) + i,
-                               iommu_pte_flags);
+            {
+                rc = iommu_map_page(p2m->domain, gfn + i, mfn_x(mfn) + i,
+                                    iommu_pte_flags);
+                if ( rc )
+                {
+                    while ( i-- > 0 )
+                        iommu_unmap_page(p2m->domain, gfn + i);
+                    break;
+                }
+            }
         else
             for ( i = 0; i < (1UL << page_order); i++ )
                 iommu_unmap_page(p2m->domain, gfn + i);
diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c
index 2b449d5..f7dd731 100644
--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -919,8 +919,9 @@ __gnttab_map_grant_ref(
             {
                 nr_gets++;
                 (void)get_page(pg, rd);
-                if ( !(op->flags & GNTMAP_readonly) )
-                    get_page_type(pg, PGT_writable_page);
+                if ( !(op->flags & GNTMAP_readonly) &&
+                     !get_page_type(pg, PGT_writable_page) )
+                        goto could_not_pin;
             }
         }
     }
diff --git a/xen/common/memory.c b/xen/common/memory.c
index b541f4a1..c228d9f 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -631,8 +631,9 @@ static int xenmem_add_to_physmap(struct domain *d,
     if ( need_iommu(d) )
     {
         this_cpu(iommu_dont_flush_iotlb) = 0;
-        iommu_iotlb_flush(d, xatp->idx - done, done);
-        iommu_iotlb_flush(d, xatp->gpfn - done, done);
+        rc = iommu_iotlb_flush(d, xatp->idx - done, done);
+        if ( !rc )
+            rc = iommu_iotlb_flush(d, xatp->gpfn - done, done);
     }
 #endif
 
diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c
index d5137733..daff00c 100644
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -266,24 +266,28 @@ static void iommu_free_pagetables(unsigned long unused)
                             cpumask_cycle(smp_processor_id(), &cpu_online_map));
 }
 
-void iommu_iotlb_flush(struct domain *d, unsigned long gfn, unsigned int page_count)
+int iommu_iotlb_flush(struct domain *d, unsigned long gfn, unsigned int page_count)
 {
     struct hvm_iommu *hd = domain_hvm_iommu(d);
 
     if ( !iommu_enabled || !hd->platform_ops || !hd->platform_ops->iotlb_flush )
-        return;
+        return 0;
 
     hd->platform_ops->iotlb_flush(d, gfn, page_count);
+
+    return 0;
 }
 
-void iommu_iotlb_flush_all(struct domain *d)
+int iommu_iotlb_flush_all(struct domain *d)
 {
     struct hvm_iommu *hd = domain_hvm_iommu(d);
 
     if ( !iommu_enabled || !hd->platform_ops || !hd->platform_ops->iotlb_flush_all )
-        return;
+        return 0;
 
     hd->platform_ops->iotlb_flush_all(d);
+
+    return 0;
 }
 
 int __init iommu_setup(void)
@@ -354,11 +358,13 @@ int iommu_do_domctl(
     return ret;
 }
 
-void iommu_suspend()
+int iommu_suspend()
 {
     const struct iommu_ops *ops = iommu_get_ops();
     if ( iommu_enabled )
         ops->suspend();
+
+    return 0;
 }
 
 void iommu_share_p2m_table(struct domain* d)
diff --git a/xen/drivers/passthrough/vtd/x86/vtd.c b/xen/drivers/passthrough/vtd/x86/vtd.c
index c0d6aab..e5ab10a 100644
--- a/xen/drivers/passthrough/vtd/x86/vtd.c
+++ b/xen/drivers/passthrough/vtd/x86/vtd.c
@@ -140,8 +140,11 @@ void __hwdom_init vtd_set_hwdom_mapping(struct domain *d)
 
         tmp = 1 << (PAGE_SHIFT - PAGE_SHIFT_4K);
         for ( j = 0; j < tmp; j++ )
-            iommu_map_page(d, pfn * tmp + j, pfn * tmp + j,
-                           IOMMUF_readable|IOMMUF_writable);
+            if ( iommu_map_page(d, pfn * tmp + j, pfn * tmp + j,
+                                IOMMUF_readable|IOMMUF_writable) )
+                printk(XENLOG_G_ERR
+                       "IOMMU: Map page gfn: 0x%lx(mfn: 0x%lx) failed.\n",
+                       pfn * tmp + j, pfn * tmp + j);
 
         if (!(i & (0xfffff >> (PAGE_SHIFT - PAGE_SHIFT_4K))))
             process_pending_softirqs();
diff --git a/xen/drivers/passthrough/x86/iommu.c b/xen/drivers/passthrough/x86/iommu.c
index 8cbb655..d8e3c8f 100644
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -104,7 +104,11 @@ int arch_iommu_populate_page_table(struct domain *d)
     this_cpu(iommu_dont_flush_iotlb) = 0;
 
     if ( !rc )
-        iommu_iotlb_flush_all(d);
+    {
+        rc = iommu_iotlb_flush_all(d);
+        if ( rc )
+            iommu_teardown(d);
+    }
     else if ( rc != -ERESTART )
         iommu_teardown(d);
 
diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h
index 8f3a20e..1341355 100644
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -167,7 +167,7 @@ struct iommu_ops {
     void (*dump_p2m_table)(struct domain *d);
 };
 
-void iommu_suspend(void);
+int iommu_suspend(void);
 void iommu_resume(void);
 void iommu_crash_shutdown(void);
 int iommu_get_reserved_device_memory(iommu_grdm_t *, void *);
@@ -182,8 +182,8 @@ int iommu_do_pci_domctl(struct xen_domctl *, struct domain *d,
 int iommu_do_domctl(struct xen_domctl *, struct domain *d,
                     XEN_GUEST_HANDLE_PARAM(xen_domctl_t));
 
-void iommu_iotlb_flush(struct domain *d, unsigned long gfn, unsigned int page_count);
-void iommu_iotlb_flush_all(struct domain *d);
+int iommu_iotlb_flush(struct domain *d, unsigned long gfn, unsigned int page_count);
+int iommu_iotlb_flush_all(struct domain *d);
 
 /*
  * The purpose of the iommu_dont_flush_iotlb optional cpu flag is to
-- 
1.9.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 2/5] IOMMU/MMU: Adjust low level functions for VT-d Device-TLB flush error
  2016-03-02 14:31 [PATCH v6 0/5] VT-d Device-TLB flush issue Quan Xu
  2016-03-02 14:31 ` [PATCH v6 1/5] IOMMU/MMU: Adjust top level functions for VT-d Device-TLB flush error Quan Xu
@ 2016-03-02 14:31 ` Quan Xu
  2016-03-02 14:31 ` [PATCH v6 3/5] IOMMU: Make the pcidevs_lock a recursive one Quan Xu
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 21+ messages in thread
From: Quan Xu @ 2016-03-02 14:31 UTC (permalink / raw)
  To: jbeulich, kevin.tian
  Cc: feng.wu, ian.campbell, george.dunlap, andrew.cooper3,
	dario.faggioli, tim, xen-devel, stefano.stabellini,
	Aravind.Gopalakrishnan, jun.nakajima, jinsong.liu, Quan Xu, keir,
	suravee.suthikulpanit

Signed-off-by: Quan Xu <quan.xu@intel.com>
---
 xen/arch/x86/mm/p2m-ept.c                     |   2 +-
 xen/drivers/passthrough/amd/iommu_init.c      |  12 ++-
 xen/drivers/passthrough/amd/pci_amd_iommu.c   |   2 +-
 xen/drivers/passthrough/arm/smmu.c            |  10 ++-
 xen/drivers/passthrough/iommu.c               |  17 ++--
 xen/drivers/passthrough/vtd/extern.h          |   2 +-
 xen/drivers/passthrough/vtd/iommu.c           | 120 ++++++++++++++++++--------
 xen/drivers/passthrough/vtd/quirks.c          |  26 +++---
 xen/include/asm-x86/hvm/svm/amd-iommu-proto.h |   2 +-
 xen/include/asm-x86/iommu.h                   |   2 +-
 xen/include/xen/iommu.h                       |   6 +-
 11 files changed, 133 insertions(+), 68 deletions(-)

diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
index d31b9af..2247972 100644
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -829,7 +829,7 @@ out:
          need_modify_vtd_table )
     {
         if ( iommu_hap_pt_share )
-            iommu_pte_flush(d, gfn, &ept_entry->epte, order, vtd_pte_present);
+            rc = iommu_pte_flush(d, gfn, &ept_entry->epte, order, vtd_pte_present);
         else
         {
             if ( iommu_flags )
diff --git a/xen/drivers/passthrough/amd/iommu_init.c b/xen/drivers/passthrough/amd/iommu_init.c
index d90a2d2..5635650 100644
--- a/xen/drivers/passthrough/amd/iommu_init.c
+++ b/xen/drivers/passthrough/amd/iommu_init.c
@@ -1340,12 +1340,14 @@ static void invalidate_all_devices(void)
     iterate_ivrs_mappings(_invalidate_all_devices);
 }
 
-void amd_iommu_suspend(void)
+int amd_iommu_suspend(void)
 {
     struct amd_iommu *iommu;
 
     for_each_amd_iommu ( iommu )
         disable_iommu(iommu);
+
+    return 0;
 }
 
 void amd_iommu_resume(void)
@@ -1369,3 +1371,11 @@ void amd_iommu_resume(void)
         invalidate_all_domain_pages();
     }
 }
+
+void amd_iommu_crash_shutdown(void)
+{
+    struct amd_iommu *iommu;
+
+    for_each_amd_iommu ( iommu )
+        disable_iommu(iommu);
+}
diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c b/xen/drivers/passthrough/amd/pci_amd_iommu.c
index c1c0b6b..8d9f358 100644
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -628,6 +628,6 @@ const struct iommu_ops amd_iommu_ops = {
     .suspend = amd_iommu_suspend,
     .resume = amd_iommu_resume,
     .share_p2m = amd_iommu_share_p2m,
-    .crash_shutdown = amd_iommu_suspend,
+    .crash_shutdown = amd_iommu_crash_shutdown,
     .dump_p2m_table = amd_dump_p2m_table,
 };
diff --git a/xen/drivers/passthrough/arm/smmu.c b/xen/drivers/passthrough/arm/smmu.c
index bb08827..96bb568 100644
--- a/xen/drivers/passthrough/arm/smmu.c
+++ b/xen/drivers/passthrough/arm/smmu.c
@@ -2544,7 +2544,7 @@ static int force_stage = 2;
  */
 static u32 platform_features = ARM_SMMU_FEAT_COHERENT_WALK;
 
-static void arm_smmu_iotlb_flush_all(struct domain *d)
+static int arm_smmu_iotlb_flush_all(struct domain *d)
 {
 	struct arm_smmu_xen_domain *smmu_domain = domain_hvm_iommu(d)->arch.priv;
 	struct iommu_domain *cfg;
@@ -2561,13 +2561,15 @@ static void arm_smmu_iotlb_flush_all(struct domain *d)
 		arm_smmu_tlb_inv_context(cfg->priv);
 	}
 	spin_unlock(&smmu_domain->lock);
+
+    return 0;
 }
 
-static void arm_smmu_iotlb_flush(struct domain *d, unsigned long gfn,
-                                 unsigned int page_count)
+static int arm_smmu_iotlb_flush(struct domain *d, unsigned long gfn,
+                                unsigned int page_count)
 {
     /* ARM SMMU v1 doesn't have flush by VMA and VMID */
-    arm_smmu_iotlb_flush_all(d);
+    return arm_smmu_iotlb_flush_all(d);
 }
 
 static struct iommu_domain *arm_smmu_get_domain(struct domain *d,
diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c
index daff00c..fff60e9 100644
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -171,7 +171,11 @@ void __hwdom_init iommu_hwdom_init(struct domain *d)
                  ((page->u.inuse.type_info & PGT_type_mask)
                   == PGT_writable_page) )
                 mapping |= IOMMUF_writable;
-            hd->platform_ops->map_page(d, gfn, mfn, mapping);
+            if ( hd->platform_ops->map_page(d, gfn, mfn, mapping) )
+                printk(XENLOG_G_ERR
+                       "IOMMU: Map page gfn: 0x%lx(mfn: 0x%lx) failed.\n",
+                       gfn, mfn);
+
             if ( !(i++ & 0xfffff) )
                 process_pending_softirqs();
         }
@@ -273,9 +277,7 @@ int iommu_iotlb_flush(struct domain *d, unsigned long gfn, unsigned int page_cou
     if ( !iommu_enabled || !hd->platform_ops || !hd->platform_ops->iotlb_flush )
         return 0;
 
-    hd->platform_ops->iotlb_flush(d, gfn, page_count);
-
-    return 0;
+    return hd->platform_ops->iotlb_flush(d, gfn, page_count);
 }
 
 int iommu_iotlb_flush_all(struct domain *d)
@@ -285,9 +287,7 @@ int iommu_iotlb_flush_all(struct domain *d)
     if ( !iommu_enabled || !hd->platform_ops || !hd->platform_ops->iotlb_flush_all )
         return 0;
 
-    hd->platform_ops->iotlb_flush_all(d);
-
-    return 0;
+    return hd->platform_ops->iotlb_flush_all(d);
 }
 
 int __init iommu_setup(void)
@@ -361,8 +361,9 @@ int iommu_do_domctl(
 int iommu_suspend()
 {
     const struct iommu_ops *ops = iommu_get_ops();
+
     if ( iommu_enabled )
-        ops->suspend();
+        return ops->suspend();
 
     return 0;
 }
diff --git a/xen/drivers/passthrough/vtd/extern.h b/xen/drivers/passthrough/vtd/extern.h
index 8acf889..507c2e9 100644
--- a/xen/drivers/passthrough/vtd/extern.h
+++ b/xen/drivers/passthrough/vtd/extern.h
@@ -91,7 +91,7 @@ int is_igd_vt_enabled_quirk(void);
 void platform_quirks_init(void);
 void vtd_ops_preamble_quirk(struct iommu* iommu);
 void vtd_ops_postamble_quirk(struct iommu* iommu);
-void me_wifi_quirk(struct domain *domain, u8 bus, u8 devfn, int map);
+int me_wifi_quirk(struct domain *domain, u8 bus, u8 devfn, int map);
 void pci_vtd_quirk(const struct pci_dev *);
 int platform_supports_intremap(void);
 int platform_supports_x2apic(void);
diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index dd13865..6e62a42 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -542,11 +542,12 @@ static int iommu_flush_iotlb_psi(
     return status;
 }
 
-static void iommu_flush_all(void)
+static int iommu_flush_all(void)
 {
     struct acpi_drhd_unit *drhd;
     struct iommu *iommu;
     int flush_dev_iotlb;
+    int rc;
 
     flush_all_cache();
     for_each_drhd_unit ( drhd )
@@ -554,11 +555,24 @@ static void iommu_flush_all(void)
         iommu = drhd->iommu;
         iommu_flush_context_global(iommu, 0);
         flush_dev_iotlb = find_ats_dev_drhd(iommu) ? 1 : 0;
-        iommu_flush_iotlb_global(iommu, 0, flush_dev_iotlb);
+        rc = iommu_flush_iotlb_global(iommu, 0, flush_dev_iotlb);
+
+        if ( rc > 0 )
+        {
+            iommu_flush_write_buffer(iommu);
+            rc = 0;
+        }
+        else if ( rc < 0 )
+        {
+            printk(XENLOG_G_ERR "IOMMU: IOMMU flush all failed.\n");
+            break;
+        }
     }
+
+    return rc;
 }
 
-static void __intel_iommu_iotlb_flush(struct domain *d, unsigned long gfn,
+static int __intel_iommu_iotlb_flush(struct domain *d, unsigned long gfn,
         int dma_old_pte_present, unsigned int page_count)
 {
     struct hvm_iommu *hd = domain_hvm_iommu(d);
@@ -566,6 +580,7 @@ static void __intel_iommu_iotlb_flush(struct domain *d, unsigned long gfn,
     struct iommu *iommu;
     int flush_dev_iotlb;
     int iommu_domid;
+    int rc = 0;
 
     /*
      * No need pcideves_lock here because we have flush
@@ -584,37 +599,40 @@ static void __intel_iommu_iotlb_flush(struct domain *d, unsigned long gfn,
             continue;
 
         if ( page_count > 1 || gfn == -1 )
-        {
-            if ( iommu_flush_iotlb_dsi(iommu, iommu_domid,
-                        0, flush_dev_iotlb) )
-                iommu_flush_write_buffer(iommu);
-        }
+            rc = iommu_flush_iotlb_dsi(iommu, iommu_domid,
+                        0, flush_dev_iotlb);
         else
-        {
-            if ( iommu_flush_iotlb_psi(iommu, iommu_domid,
+            rc = iommu_flush_iotlb_psi(iommu, iommu_domid,
                         (paddr_t)gfn << PAGE_SHIFT_4K, 0,
-                        !dma_old_pte_present, flush_dev_iotlb) )
-                iommu_flush_write_buffer(iommu);
+                        !dma_old_pte_present, flush_dev_iotlb);
+
+        if ( rc > 0 )
+        {
+            iommu_flush_write_buffer(iommu);
+            rc = 0;
         }
     }
+
+    return rc;
 }
 
-static void intel_iommu_iotlb_flush(struct domain *d, unsigned long gfn, unsigned int page_count)
+static int intel_iommu_iotlb_flush(struct domain *d, unsigned long gfn, unsigned int page_count)
 {
-    __intel_iommu_iotlb_flush(d, gfn, 1, page_count);
+    return __intel_iommu_iotlb_flush(d, gfn, 1, page_count);
 }
 
-static void intel_iommu_iotlb_flush_all(struct domain *d)
+static int intel_iommu_iotlb_flush_all(struct domain *d)
 {
-    __intel_iommu_iotlb_flush(d, 0, 0, 0);
+    return __intel_iommu_iotlb_flush(d, 0, 0, 0);
 }
 
 /* clear one page's page table */
-static void dma_pte_clear_one(struct domain *domain, u64 addr)
+static int dma_pte_clear_one(struct domain *domain, u64 addr)
 {
     struct hvm_iommu *hd = domain_hvm_iommu(domain);
     struct dma_pte *page = NULL, *pte = NULL;
     u64 pg_maddr;
+    int rc = 0;
 
     spin_lock(&hd->arch.mapping_lock);
     /* get last level pte */
@@ -622,7 +640,7 @@ static void dma_pte_clear_one(struct domain *domain, u64 addr)
     if ( pg_maddr == 0 )
     {
         spin_unlock(&hd->arch.mapping_lock);
-        return;
+        return -ENOMEM;
     }
 
     page = (struct dma_pte *)map_vtd_domain_page(pg_maddr);
@@ -632,7 +650,7 @@ static void dma_pte_clear_one(struct domain *domain, u64 addr)
     {
         spin_unlock(&hd->arch.mapping_lock);
         unmap_vtd_domain_page(page);
-        return;
+        return 0;
     }
 
     dma_clear_pte(*pte);
@@ -640,9 +658,11 @@ static void dma_pte_clear_one(struct domain *domain, u64 addr)
     iommu_flush_cache_entry(pte, sizeof(struct dma_pte));
 
     if ( !this_cpu(iommu_dont_flush_iotlb) )
-        __intel_iommu_iotlb_flush(domain, addr >> PAGE_SHIFT_4K, 1, 1);
+        rc = __intel_iommu_iotlb_flush(domain, addr >> PAGE_SHIFT_4K, 1, 1);
 
     unmap_vtd_domain_page(page);
+
+    return rc;
 }
 
 static void iommu_free_pagetable(u64 pt_maddr, int level)
@@ -1285,6 +1305,7 @@ int domain_context_mapping_one(
     u64 maddr, pgd_maddr;
     u16 seg = iommu->intel->drhd->segment;
     int agaw;
+    int rc = 0;
 
     ASSERT(spin_is_locked(&pcidevs_lock));
     spin_lock(&iommu->lock);
@@ -1404,7 +1425,14 @@ int domain_context_mapping_one(
     else
     {
         int flush_dev_iotlb = find_ats_dev_drhd(iommu) ? 1 : 0;
-        iommu_flush_iotlb_dsi(iommu, 0, 1, flush_dev_iotlb);
+
+        rc = iommu_flush_iotlb_dsi(iommu, 0, 1, flush_dev_iotlb);
+
+        if ( rc > 0 )
+        {
+            iommu_flush_write_buffer(iommu);
+            rc = 0;
+        }
     }
 
     set_bit(iommu->index, &hd->arch.iommu_bitmap);
@@ -1414,7 +1442,7 @@ int domain_context_mapping_one(
     if ( !seg )
         me_wifi_quirk(domain, bus, devfn, MAP_ME_PHANTOM_FUNC);
 
-    return 0;
+    return rc;
 }
 
 static int domain_context_mapping(
@@ -1509,6 +1537,7 @@ int domain_context_unmap_one(
     struct context_entry *context, *context_entries;
     u64 maddr;
     int iommu_domid;
+    int rc = 0;
 
     ASSERT(spin_is_locked(&pcidevs_lock));
     spin_lock(&iommu->lock);
@@ -1543,7 +1572,14 @@ int domain_context_unmap_one(
     else
     {
         int flush_dev_iotlb = find_ats_dev_drhd(iommu) ? 1 : 0;
-        iommu_flush_iotlb_dsi(iommu, iommu_domid, 0, flush_dev_iotlb);
+
+        rc = iommu_flush_iotlb_dsi(iommu, iommu_domid, 0, flush_dev_iotlb);
+
+        if ( rc > 0 )
+        {
+            iommu_flush_write_buffer(iommu);
+            rc = 0;
+        }
     }
 
     spin_unlock(&iommu->lock);
@@ -1552,7 +1588,7 @@ int domain_context_unmap_one(
     if ( !iommu->intel->drhd->segment )
         me_wifi_quirk(domain, bus, devfn, UNMAP_ME_PHANTOM_FUNC);
 
-    return 0;
+    return rc;
 }
 
 static int domain_context_unmap(
@@ -1742,7 +1778,7 @@ static int intel_iommu_map_page(
     unmap_vtd_domain_page(page);
 
     if ( !this_cpu(iommu_dont_flush_iotlb) )
-        __intel_iommu_iotlb_flush(d, gfn, dma_pte_present(old), 1);
+        return __intel_iommu_iotlb_flush(d, gfn, dma_pte_present(old), 1);
 
     return 0;
 }
@@ -1753,19 +1789,18 @@ static int intel_iommu_unmap_page(struct domain *d, unsigned long gfn)
     if ( iommu_passthrough && is_hardware_domain(d) )
         return 0;
 
-    dma_pte_clear_one(d, (paddr_t)gfn << PAGE_SHIFT_4K);
-
-    return 0;
+    return dma_pte_clear_one(d, (paddr_t)gfn << PAGE_SHIFT_4K);
 }
 
-void iommu_pte_flush(struct domain *d, u64 gfn, u64 *pte,
-                     int order, int present)
+int iommu_pte_flush(struct domain *d, u64 gfn, u64 *pte,
+                    int order, int present)
 {
     struct acpi_drhd_unit *drhd;
     struct iommu *iommu = NULL;
     struct hvm_iommu *hd = domain_hvm_iommu(d);
     int flush_dev_iotlb;
     int iommu_domid;
+    int rc = 0;
 
     iommu_flush_cache_entry(pte, sizeof(struct dma_pte));
 
@@ -1779,11 +1814,17 @@ void iommu_pte_flush(struct domain *d, u64 gfn, u64 *pte,
         iommu_domid= domain_iommu_domid(d, iommu);
         if ( iommu_domid == -1 )
             continue;
-        if ( iommu_flush_iotlb_psi(iommu, iommu_domid,
+        rc = iommu_flush_iotlb_psi(iommu, iommu_domid,
                                    (paddr_t)gfn << PAGE_SHIFT_4K,
-                                   order, !present, flush_dev_iotlb) )
+                                   order, !present, flush_dev_iotlb);
+        if ( rc > 0 )
+        {
             iommu_flush_write_buffer(iommu);
+            rc = 0;
+        }
     }
+
+    return rc;
 }
 
 static int __init vtd_ept_page_compatible(struct iommu *iommu)
@@ -2103,8 +2144,8 @@ static int init_vtd_hw(void)
             return -EIO;
         }
     }
-    iommu_flush_all();
-    return 0;
+
+    return iommu_flush_all();
 }
 
 static void __hwdom_init setup_hwdom_rmrr(struct domain *d)
@@ -2372,16 +2413,19 @@ static int intel_iommu_group_id(u16 seg, u8 bus, u8 devfn)
 }
 
 static u32 iommu_state[MAX_IOMMUS][MAX_IOMMU_REGS];
-static void vtd_suspend(void)
+static int vtd_suspend(void)
 {
     struct acpi_drhd_unit *drhd;
     struct iommu *iommu;
     u32    i;
+    int rc;
 
     if ( !iommu_enabled )
-        return;
+        return 0;
 
-    iommu_flush_all();
+    rc = iommu_flush_all();
+    if ( rc )
+        return rc;
 
     for_each_drhd_unit ( drhd )
     {
@@ -2410,6 +2454,8 @@ static void vtd_suspend(void)
         if ( !iommu_intremap && iommu_qinval )
             disable_qinval(iommu);
     }
+
+    return 0;
 }
 
 static void vtd_crash_shutdown(void)
diff --git a/xen/drivers/passthrough/vtd/quirks.c b/xen/drivers/passthrough/vtd/quirks.c
index 1888843..6a5fd87 100644
--- a/xen/drivers/passthrough/vtd/quirks.c
+++ b/xen/drivers/passthrough/vtd/quirks.c
@@ -332,10 +332,11 @@ void __init platform_quirks_init(void)
  * assigning Intel integrated wifi device to a guest.
  */
 
-static void map_me_phantom_function(struct domain *domain, u32 dev, int map)
+static int map_me_phantom_function(struct domain *domain, u32 dev, int map)
 {
     struct acpi_drhd_unit *drhd;
     struct pci_dev *pdev;
+    int rc;
 
     /* find ME VT-d engine base on a real ME device */
     pdev = pci_get_pdev(0, 0, PCI_DEVFN(dev, 0));
@@ -343,23 +344,26 @@ static void map_me_phantom_function(struct domain *domain, u32 dev, int map)
 
     /* map or unmap ME phantom function */
     if ( map )
-        domain_context_mapping_one(domain, drhd->iommu, 0,
-                                   PCI_DEVFN(dev, 7), NULL);
+        rc = domain_context_mapping_one(domain, drhd->iommu, 0,
+                                        PCI_DEVFN(dev, 7), NULL);
     else
-        domain_context_unmap_one(domain, drhd->iommu, 0,
-                                 PCI_DEVFN(dev, 7));
+        rc = domain_context_unmap_one(domain, drhd->iommu, 0,
+                                      PCI_DEVFN(dev, 7));
+
+    return rc;
 }
 
-void me_wifi_quirk(struct domain *domain, u8 bus, u8 devfn, int map)
+int me_wifi_quirk(struct domain *domain, u8 bus, u8 devfn, int map)
 {
     u32 id;
+    int rc = 0;
 
     id = pci_conf_read32(0, 0, 0, 0, 0);
     if ( IS_CTG(id) )
     {
         /* quit if ME does not exist */
         if ( pci_conf_read32(0, 0, 3, 0, 0) == 0xffffffff )
-            return;
+            return -ENOENT;
 
         /* if device is WLAN device, map ME phantom device 0:3.7 */
         id = pci_conf_read32(0, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), 0);
@@ -373,7 +377,7 @@ void me_wifi_quirk(struct domain *domain, u8 bus, u8 devfn, int map)
             case 0x423b8086:
             case 0x423c8086:
             case 0x423d8086:
-                map_me_phantom_function(domain, 3, map);
+                rc = map_me_phantom_function(domain, 3, map);
                 break;
             default:
                 break;
@@ -383,7 +387,7 @@ void me_wifi_quirk(struct domain *domain, u8 bus, u8 devfn, int map)
     {
         /* quit if ME does not exist */
         if ( pci_conf_read32(0, 0, 22, 0, 0) == 0xffffffff )
-            return;
+            return -ENOENT;
 
         /* if device is WLAN device, map ME phantom device 0:22.7 */
         id = pci_conf_read32(0, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), 0);
@@ -399,12 +403,14 @@ void me_wifi_quirk(struct domain *domain, u8 bus, u8 devfn, int map)
             case 0x42388086:        /* Puma Peak */
             case 0x422b8086:
             case 0x422c8086:
-                map_me_phantom_function(domain, 22, map);
+                rc = map_me_phantom_function(domain, 22, map);
                 break;
             default:
                 break;
         }
     }
+
+    return rc;
 }
 
 void pci_vtd_quirk(const struct pci_dev *pdev)
diff --git a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
index 9c51172..f540fc8 100644
--- a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
+++ b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
@@ -119,7 +119,7 @@ extern unsigned long *shared_intremap_inuse;
 
 /* power management support */
 void amd_iommu_resume(void);
-void amd_iommu_suspend(void);
+int amd_iommu_suspend(void);
 void amd_iommu_crash_shutdown(void);
 
 /* guest iommu support */
diff --git a/xen/include/asm-x86/iommu.h b/xen/include/asm-x86/iommu.h
index 29203d7..cf2a269 100644
--- a/xen/include/asm-x86/iommu.h
+++ b/xen/include/asm-x86/iommu.h
@@ -26,7 +26,7 @@ int iommu_setup_hpet_msi(struct msi_desc *);
 
 /* While VT-d specific, this must get declared in a generic header. */
 int adjust_vtd_irq_affinities(void);
-void iommu_pte_flush(struct domain *d, u64 gfn, u64 *pte, int order, int present);
+int iommu_pte_flush(struct domain *d, u64 gfn, u64 *pte, int order, int present);
 int iommu_supports_eim(void);
 int iommu_enable_x2apic_IR(void);
 void iommu_disable_x2apic_IR(void);
diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h
index 1341355..cc487fa 100644
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -157,12 +157,12 @@ struct iommu_ops {
     unsigned int (*read_apic_from_ire)(unsigned int apic, unsigned int reg);
     int (*setup_hpet_msi)(struct msi_desc *);
 #endif /* CONFIG_X86 */
-    void (*suspend)(void);
+    int (*suspend)(void);
     void (*resume)(void);
     void (*share_p2m)(struct domain *d);
     void (*crash_shutdown)(void);
-    void (*iotlb_flush)(struct domain *d, unsigned long gfn, unsigned int page_count);
-    void (*iotlb_flush_all)(struct domain *d);
+    int (*iotlb_flush)(struct domain *d, unsigned long gfn, unsigned int page_count);
+    int (*iotlb_flush_all)(struct domain *d);
     int (*get_reserved_device_memory)(iommu_grdm_t *, void *);
     void (*dump_p2m_table)(struct domain *d);
 };
-- 
1.9.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 3/5] IOMMU: Make the pcidevs_lock a recursive one
  2016-03-02 14:31 [PATCH v6 0/5] VT-d Device-TLB flush issue Quan Xu
  2016-03-02 14:31 ` [PATCH v6 1/5] IOMMU/MMU: Adjust top level functions for VT-d Device-TLB flush error Quan Xu
  2016-03-02 14:31 ` [PATCH v6 2/5] IOMMU/MMU: Adjust low " Quan Xu
@ 2016-03-02 14:31 ` Quan Xu
  2016-03-03 23:59   ` Dario Faggioli
  2016-03-02 14:31 ` [PATCH v6 4/5] VT-d: Reduce spin timeout to 1ms, which can be boot-time changed Quan Xu
  2016-03-02 14:31 ` [PATCH v6 5/5] VT-d: Fix vt-d Device-TLB flush timeout issue Quan Xu
  4 siblings, 1 reply; 21+ messages in thread
From: Quan Xu @ 2016-03-02 14:31 UTC (permalink / raw)
  To: jbeulich, kevin.tian
  Cc: feng.wu, ian.campbell, george.dunlap, andrew.cooper3,
	dario.faggioli, tim, xen-devel, stefano.stabellini,
	Aravind.Gopalakrishnan, jun.nakajima, jinsong.liu, Quan Xu, keir,
	suravee.suthikulpanit

Signed-off-by: Quan Xu <quan.xu@intel.com>
---
 xen/arch/x86/domctl.c                       |  8 +--
 xen/arch/x86/hvm/vmsi.c                     |  4 +-
 xen/arch/x86/irq.c                          |  8 +--
 xen/arch/x86/msi.c                          | 16 ++---
 xen/arch/x86/pci.c                          |  4 +-
 xen/arch/x86/physdev.c                      | 16 ++---
 xen/common/sysctl.c                         |  4 +-
 xen/drivers/passthrough/amd/iommu_init.c    |  9 ++-
 xen/drivers/passthrough/amd/iommu_map.c     |  2 +-
 xen/drivers/passthrough/amd/pci_amd_iommu.c |  4 +-
 xen/drivers/passthrough/pci.c               | 93 ++++++++++++++++-------------
 xen/drivers/passthrough/vtd/iommu.c         | 14 ++---
 xen/drivers/video/vga.c                     |  4 +-
 xen/include/xen/pci.h                       |  4 +-
 14 files changed, 102 insertions(+), 88 deletions(-)

diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
index bf62a88..21cc161 100644
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -427,9 +427,9 @@ long arch_do_domctl(
         ret = -ESRCH;
         if ( iommu_enabled )
         {
-            spin_lock(&pcidevs_lock);
+            pcidevs_lock();
             ret = pt_irq_create_bind(d, bind);
-            spin_unlock(&pcidevs_lock);
+            pcidevs_unlock();
         }
         if ( ret < 0 )
             printk(XENLOG_G_ERR "pt_irq_create_bind failed (%ld) for dom%d\n",
@@ -452,9 +452,9 @@ long arch_do_domctl(
 
         if ( iommu_enabled )
         {
-            spin_lock(&pcidevs_lock);
+            pcidevs_lock();
             ret = pt_irq_destroy_bind(d, bind);
-            spin_unlock(&pcidevs_lock);
+            pcidevs_unlock();
         }
         if ( ret < 0 )
             printk(XENLOG_G_ERR "pt_irq_destroy_bind failed (%ld) for dom%d\n",
diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index ac838a9..8e0817b 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -388,7 +388,7 @@ int msixtbl_pt_register(struct domain *d, struct pirq *pirq, uint64_t gtable)
     struct msixtbl_entry *entry, *new_entry;
     int r = -EINVAL;
 
-    ASSERT(spin_is_locked(&pcidevs_lock));
+    ASSERT(pcidevs_is_locked());
     ASSERT(spin_is_locked(&d->event_lock));
 
     /*
@@ -443,7 +443,7 @@ void msixtbl_pt_unregister(struct domain *d, struct pirq *pirq)
     struct pci_dev *pdev;
     struct msixtbl_entry *entry;
 
-    ASSERT(spin_is_locked(&pcidevs_lock));
+    ASSERT(pcidevs_is_locked());
     ASSERT(spin_is_locked(&d->event_lock));
 
     irq_desc = pirq_spin_lock_irq_desc(pirq, NULL);
diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index bf2e822..68bdf19 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -1955,7 +1955,7 @@ int map_domain_pirq(
         struct pci_dev *pdev;
         unsigned int nr = 0;
 
-        ASSERT(spin_is_locked(&pcidevs_lock));
+        ASSERT(pcidevs_is_locked());
 
         ret = -ENODEV;
         if ( !cpu_has_apic )
@@ -2100,7 +2100,7 @@ int unmap_domain_pirq(struct domain *d, int pirq)
     if ( (pirq < 0) || (pirq >= d->nr_pirqs) )
         return -EINVAL;
 
-    ASSERT(spin_is_locked(&pcidevs_lock));
+    ASSERT(pcidevs_is_locked());
     ASSERT(spin_is_locked(&d->event_lock));
 
     info = pirq_info(d, pirq);
@@ -2226,7 +2226,7 @@ void free_domain_pirqs(struct domain *d)
 {
     int i;
 
-    spin_lock(&pcidevs_lock);
+    pcidevs_lock();
     spin_lock(&d->event_lock);
 
     for ( i = 0; i < d->nr_pirqs; i++ )
@@ -2234,7 +2234,7 @@ void free_domain_pirqs(struct domain *d)
             unmap_domain_pirq(d, i);
 
     spin_unlock(&d->event_lock);
-    spin_unlock(&pcidevs_lock);
+    pcidevs_unlock();
 }
 
 static void dump_irqs(unsigned char key)
diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index 3dbb84d..6e5e33e 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -694,7 +694,7 @@ static int msi_capability_init(struct pci_dev *dev,
     u8 slot = PCI_SLOT(dev->devfn);
     u8 func = PCI_FUNC(dev->devfn);
 
-    ASSERT(spin_is_locked(&pcidevs_lock));
+    ASSERT(pcidevs_is_locked());
     pos = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSI);
     if ( !pos )
         return -ENODEV;
@@ -852,7 +852,7 @@ static int msix_capability_init(struct pci_dev *dev,
     u8 func = PCI_FUNC(dev->devfn);
     bool_t maskall = msix->host_maskall;
 
-    ASSERT(spin_is_locked(&pcidevs_lock));
+    ASSERT(pcidevs_is_locked());
 
     control = pci_conf_read16(seg, bus, slot, func, msix_control_reg(pos));
     /*
@@ -1042,7 +1042,7 @@ static int __pci_enable_msi(struct msi_info *msi, struct msi_desc **desc)
     struct pci_dev *pdev;
     struct msi_desc *old_desc;
 
-    ASSERT(spin_is_locked(&pcidevs_lock));
+    ASSERT(pcidevs_is_locked());
     pdev = pci_get_pdev(msi->seg, msi->bus, msi->devfn);
     if ( !pdev )
         return -ENODEV;
@@ -1103,7 +1103,7 @@ static int __pci_enable_msix(struct msi_info *msi, struct msi_desc **desc)
     u8 func = PCI_FUNC(msi->devfn);
     struct msi_desc *old_desc;
 
-    ASSERT(spin_is_locked(&pcidevs_lock));
+    ASSERT(pcidevs_is_locked());
     pdev = pci_get_pdev(msi->seg, msi->bus, msi->devfn);
     pos = pci_find_cap_offset(msi->seg, msi->bus, slot, func, PCI_CAP_ID_MSIX);
     if ( !pdev || !pos )
@@ -1205,7 +1205,7 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool_t off)
     if ( !pos )
         return -ENODEV;
 
-    spin_lock(&pcidevs_lock);
+    pcidevs_lock();
     pdev = pci_get_pdev(seg, bus, devfn);
     if ( !pdev )
         rc = -ENODEV;
@@ -1224,7 +1224,7 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool_t off)
         rc = msix_capability_init(pdev, pos, NULL, NULL,
                                   multi_msix_capable(control));
     }
-    spin_unlock(&pcidevs_lock);
+    pcidevs_unlock();
 
     return rc;
 }
@@ -1235,7 +1235,7 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool_t off)
  */
 int pci_enable_msi(struct msi_info *msi, struct msi_desc **desc)
 {
-    ASSERT(spin_is_locked(&pcidevs_lock));
+    ASSERT(pcidevs_is_locked());
 
     if ( !use_msi )
         return -EPERM;
@@ -1351,7 +1351,7 @@ int pci_restore_msi_state(struct pci_dev *pdev)
     unsigned int type = 0, pos = 0;
     u16 control = 0;
 
-    ASSERT(spin_is_locked(&pcidevs_lock));
+    ASSERT(pcidevs_is_locked());
 
     if ( !use_msi )
         return -EOPNOTSUPP;
diff --git a/xen/arch/x86/pci.c b/xen/arch/x86/pci.c
index 5bcecbb..892278d 100644
--- a/xen/arch/x86/pci.c
+++ b/xen/arch/x86/pci.c
@@ -82,13 +82,13 @@ int pci_conf_write_intercept(unsigned int seg, unsigned int bdf,
     if ( reg < 64 || reg >= 256 )
         return 0;
 
-    spin_lock(&pcidevs_lock);
+    pcidevs_lock();
 
     pdev = pci_get_pdev(seg, PCI_BUS(bdf), PCI_DEVFN2(bdf));
     if ( pdev )
         rc = pci_msi_conf_write_intercept(pdev, reg, size, data);
 
-    spin_unlock(&pcidevs_lock);
+    pcidevs_unlock();
 
     return rc;
 }
diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c
index 57b7800..1cb9b58 100644
--- a/xen/arch/x86/physdev.c
+++ b/xen/arch/x86/physdev.c
@@ -167,7 +167,7 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
         goto free_domain;
     }
 
-    spin_lock(&pcidevs_lock);
+    pcidevs_lock();
     /* Verify or get pirq. */
     spin_lock(&d->event_lock);
     pirq = domain_irq_to_pirq(d, irq);
@@ -237,7 +237,7 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
 
  done:
     spin_unlock(&d->event_lock);
-    spin_unlock(&pcidevs_lock);
+    pcidevs_unlock();
     if ( ret != 0 )
         switch ( type )
         {
@@ -275,11 +275,11 @@ int physdev_unmap_pirq(domid_t domid, int pirq)
             goto free_domain;
     }
 
-    spin_lock(&pcidevs_lock);
+    pcidevs_lock();
     spin_lock(&d->event_lock);
     ret = unmap_domain_pirq(d, pirq);
     spin_unlock(&d->event_lock);
-    spin_unlock(&pcidevs_lock);
+    pcidevs_unlock();
 
  free_domain:
     rcu_unlock_domain(d);
@@ -689,10 +689,10 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
         if ( copy_from_guest(&restore_msi, arg, 1) != 0 )
             break;
 
-        spin_lock(&pcidevs_lock);
+        pcidevs_lock();
         pdev = pci_get_pdev(0, restore_msi.bus, restore_msi.devfn);
         ret = pdev ? pci_restore_msi_state(pdev) : -ENODEV;
-        spin_unlock(&pcidevs_lock);
+        pcidevs_unlock();
         break;
     }
 
@@ -704,10 +704,10 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
         if ( copy_from_guest(&dev, arg, 1) != 0 )
             break;
 
-        spin_lock(&pcidevs_lock);
+        pcidevs_lock();
         pdev = pci_get_pdev(dev.seg, dev.bus, dev.devfn);
         ret = pdev ? pci_restore_msi_state(pdev) : -ENODEV;
-        spin_unlock(&pcidevs_lock);
+        pcidevs_unlock();
         break;
     }
 
diff --git a/xen/common/sysctl.c b/xen/common/sysctl.c
index 85e853f..401edb9 100644
--- a/xen/common/sysctl.c
+++ b/xen/common/sysctl.c
@@ -426,7 +426,7 @@ long do_sysctl(XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) u_sysctl)
                 break;
             }
 
-            spin_lock(&pcidevs_lock);
+            pcidevs_lock();
             pdev = pci_get_pdev(dev.seg, dev.bus, dev.devfn);
             if ( !pdev )
                 node = XEN_INVALID_DEV;
@@ -434,7 +434,7 @@ long do_sysctl(XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) u_sysctl)
                 node = XEN_INVALID_NODE_ID;
             else
                 node = pdev->node;
-            spin_unlock(&pcidevs_lock);
+            pcidevs_unlock();
 
             if ( copy_to_guest_offset(ti->nodes, i, &node, 1) )
             {
diff --git a/xen/drivers/passthrough/amd/iommu_init.c b/xen/drivers/passthrough/amd/iommu_init.c
index 5635650..8bb47b2 100644
--- a/xen/drivers/passthrough/amd/iommu_init.c
+++ b/xen/drivers/passthrough/amd/iommu_init.c
@@ -673,9 +673,9 @@ void parse_ppr_log_entry(struct amd_iommu *iommu, u32 entry[])
     bus = PCI_BUS(device_id);
     devfn = PCI_DEVFN2(device_id);
 
-    spin_lock(&pcidevs_lock);
+    pcidevs_lock();
     pdev = pci_get_real_pdev(iommu->seg, bus, devfn);
-    spin_unlock(&pcidevs_lock);
+    pcidevs_unlock();
 
     if ( pdev )
         guest_iommu_add_ppr_log(pdev->domain, entry);
@@ -778,7 +778,6 @@ static bool_t __init set_iommu_interrupt_handler(struct amd_iommu *iommu)
 {
     int irq, ret;
     hw_irq_controller *handler;
-    unsigned long flags;
     u16 control;
 
     irq = create_irq(NUMA_NO_NODE);
@@ -788,10 +787,10 @@ static bool_t __init set_iommu_interrupt_handler(struct amd_iommu *iommu)
         return 0;
     }
 
-    spin_lock_irqsave(&pcidevs_lock, flags);
+    pcidevs_lock();
     iommu->msi.dev = pci_get_pdev(iommu->seg, PCI_BUS(iommu->bdf),
                                   PCI_DEVFN2(iommu->bdf));
-    spin_unlock_irqrestore(&pcidevs_lock, flags);
+    pcidevs_unlock();
     if ( !iommu->msi.dev )
     {
         AMD_IOMMU_DEBUG("IOMMU: no pdev for %04x:%02x:%02x.%u\n",
diff --git a/xen/drivers/passthrough/amd/iommu_map.c b/xen/drivers/passthrough/amd/iommu_map.c
index 78862c9..9d91dfc 100644
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -593,7 +593,7 @@ static int update_paging_mode(struct domain *d, unsigned long gfn)
         hd->arch.paging_mode = level;
         hd->arch.root_table = new_root;
 
-        if ( !spin_is_locked(&pcidevs_lock) )
+        if ( !pcidevs_is_locked() )
             AMD_IOMMU_DEBUG("%s Try to access pdev_list "
                             "without aquiring pcidevs_lock.\n", __func__);
 
diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c b/xen/drivers/passthrough/amd/pci_amd_iommu.c
index 8d9f358..1a72c87 100644
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -158,7 +158,7 @@ static void amd_iommu_setup_domain_device(
 
     spin_unlock_irqrestore(&iommu->lock, flags);
 
-    ASSERT(spin_is_locked(&pcidevs_lock));
+    ASSERT(pcidevs_is_locked());
 
     if ( pci_ats_device(iommu->seg, bus, pdev->devfn) &&
          !pci_ats_enabled(iommu->seg, bus, pdev->devfn) )
@@ -345,7 +345,7 @@ void amd_iommu_disable_domain_device(struct domain *domain,
     }
     spin_unlock_irqrestore(&iommu->lock, flags);
 
-    ASSERT(spin_is_locked(&pcidevs_lock));
+    ASSERT(pcidevs_is_locked());
 
     if ( devfn == pdev->devfn &&
          pci_ats_device(iommu->seg, bus, devfn) &&
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 27b3ca7..d7e94e1 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -48,7 +48,7 @@ struct pci_seg {
     } bus2bridge[MAX_BUSES];
 };
 
-spinlock_t pcidevs_lock = SPIN_LOCK_UNLOCKED;
+static spinlock_t _pcidevs_lock = SPIN_LOCK_UNLOCKED;
 static struct radix_tree_root pci_segments;
 
 static inline struct pci_seg *get_pseg(u16 seg)
@@ -103,6 +103,21 @@ static int pci_segments_iterate(
     return rc;
 }
 
+void pcidevs_lock(void)
+{
+    spin_lock_recursive(&_pcidevs_lock);
+}
+
+void pcidevs_unlock(void)
+{
+    spin_unlock_recursive(&_pcidevs_lock);
+}
+
+int pcidevs_is_locked(void)
+{
+    return spin_is_locked(&_pcidevs_lock);
+}
+
 void __init pt_pci_init(void)
 {
     radix_tree_init(&pci_segments);
@@ -412,14 +427,14 @@ int __init pci_hide_device(int bus, int devfn)
     struct pci_dev *pdev;
     int rc = -ENOMEM;
 
-    spin_lock(&pcidevs_lock);
+    pcidevs_lock();
     pdev = alloc_pdev(get_pseg(0), bus, devfn);
     if ( pdev )
     {
         _pci_hide_device(pdev);
         rc = 0;
     }
-    spin_unlock(&pcidevs_lock);
+    pcidevs_unlock();
 
     return rc;
 }
@@ -456,7 +471,7 @@ struct pci_dev *pci_get_pdev(int seg, int bus, int devfn)
     struct pci_seg *pseg = get_pseg(seg);
     struct pci_dev *pdev = NULL;
 
-    ASSERT(spin_is_locked(&pcidevs_lock));
+    ASSERT(pcidevs_is_locked());
     ASSERT(seg != -1 || bus == -1);
     ASSERT(bus != -1 || devfn == -1);
 
@@ -581,9 +596,9 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
         pdev_type = "extended function";
     else if (info->is_virtfn)
     {
-        spin_lock(&pcidevs_lock);
+        pcidevs_lock();
         pdev = pci_get_pdev(seg, info->physfn.bus, info->physfn.devfn);
-        spin_unlock(&pcidevs_lock);
+        pcidevs_unlock();
         if ( !pdev )
             pci_add_device(seg, info->physfn.bus, info->physfn.devfn,
                            NULL, node);
@@ -601,7 +616,7 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
 
     ret = -ENOMEM;
 
-    spin_lock(&pcidevs_lock);
+    pcidevs_lock();
     pseg = alloc_pseg(seg);
     if ( !pseg )
         goto out;
@@ -703,7 +718,7 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
     pci_enable_acs(pdev);
 
 out:
-    spin_unlock(&pcidevs_lock);
+    pcidevs_unlock();
     if ( !ret )
     {
         printk(XENLOG_DEBUG "PCI add %s %04x:%02x:%02x.%u\n", pdev_type,
@@ -735,7 +750,7 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
     if ( !pseg )
         return -ENODEV;
 
-    spin_lock(&pcidevs_lock);
+    pcidevs_lock();
     list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
         if ( pdev->bus == bus && pdev->devfn == devfn )
         {
@@ -749,7 +764,7 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
             break;
         }
 
-    spin_unlock(&pcidevs_lock);
+    pcidevs_unlock();
     return ret;
 }
 
@@ -807,11 +822,11 @@ int pci_release_devices(struct domain *d)
     u8 bus, devfn;
     int ret;
 
-    spin_lock(&pcidevs_lock);
+    pcidevs_lock();
     ret = pci_clean_dpci_irqs(d);
     if ( ret )
     {
-        spin_unlock(&pcidevs_lock);
+        pcidevs_unlock();
         return ret;
     }
     while ( (pdev = pci_get_pdev_by_domain(d, -1, -1, -1)) )
@@ -823,7 +838,7 @@ int pci_release_devices(struct domain *d)
                    d->domain_id, pdev->seg, bus,
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
     }
-    spin_unlock(&pcidevs_lock);
+    pcidevs_unlock();
 
     return 0;
 }
@@ -920,7 +935,7 @@ void pci_check_disable_device(u16 seg, u8 bus, u8 devfn)
     s_time_t now = NOW();
     u16 cword;
 
-    spin_lock(&pcidevs_lock);
+    pcidevs_lock();
     pdev = pci_get_real_pdev(seg, bus, devfn);
     if ( pdev )
     {
@@ -931,7 +946,7 @@ void pci_check_disable_device(u16 seg, u8 bus, u8 devfn)
         if ( ++pdev->fault.count < PT_FAULT_THRESHOLD )
             pdev = NULL;
     }
-    spin_unlock(&pcidevs_lock);
+    pcidevs_unlock();
 
     if ( !pdev )
         return;
@@ -988,9 +1003,9 @@ int __init scan_pci_devices(void)
 {
     int ret;
 
-    spin_lock(&pcidevs_lock);
+    pcidevs_lock();
     ret = pci_segments_iterate(_scan_pci_devices, NULL);
-    spin_unlock(&pcidevs_lock);
+    pcidevs_unlock();
 
     return ret;
 }
@@ -1054,17 +1069,17 @@ static int __hwdom_init _setup_hwdom_pci_devices(struct pci_seg *pseg, void *arg
 
             if ( iommu_verbose )
             {
-                spin_unlock(&pcidevs_lock);
+                pcidevs_unlock();
                 process_pending_softirqs();
-                spin_lock(&pcidevs_lock);
+                pcidevs_lock();
             }
         }
 
         if ( !iommu_verbose )
         {
-            spin_unlock(&pcidevs_lock);
+            pcidevs_unlock();
             process_pending_softirqs();
-            spin_lock(&pcidevs_lock);
+            pcidevs_lock();
         }
     }
 
@@ -1076,9 +1091,9 @@ void __hwdom_init setup_hwdom_pci_devices(
 {
     struct setup_hwdom ctxt = { .d = d, .handler = handler };
 
-    spin_lock(&pcidevs_lock);
+    pcidevs_lock();
     pci_segments_iterate(_setup_hwdom_pci_devices, &ctxt);
-    spin_unlock(&pcidevs_lock);
+    pcidevs_unlock();
 }
 
 #ifdef CONFIG_ACPI
@@ -1206,9 +1221,9 @@ static int _dump_pci_devices(struct pci_seg *pseg, void *arg)
 static void dump_pci_devices(unsigned char ch)
 {
     printk("==== PCI devices ====\n");
-    spin_lock(&pcidevs_lock);
+    pcidevs_lock();
     pci_segments_iterate(_dump_pci_devices, NULL);
-    spin_unlock(&pcidevs_lock);
+    pcidevs_unlock();
 }
 
 struct keyhandler dump_pci_devices_keyhandler = {
@@ -1248,7 +1263,7 @@ int iommu_add_device(struct pci_dev *pdev)
     if ( !pdev->domain )
         return -EINVAL;
 
-    ASSERT(spin_is_locked(&pcidevs_lock));
+    ASSERT(pcidevs_is_locked());
 
     hd = domain_hvm_iommu(pdev->domain);
     if ( !iommu_enabled || !hd->platform_ops )
@@ -1277,7 +1292,7 @@ int iommu_enable_device(struct pci_dev *pdev)
     if ( !pdev->domain )
         return -EINVAL;
 
-    ASSERT(spin_is_locked(&pcidevs_lock));
+    ASSERT(pcidevs_is_locked());
 
     hd = domain_hvm_iommu(pdev->domain);
     if ( !iommu_enabled || !hd->platform_ops ||
@@ -1326,9 +1341,9 @@ static int device_assigned(u16 seg, u8 bus, u8 devfn)
 {
     struct pci_dev *pdev;
 
-    spin_lock(&pcidevs_lock);
+    pcidevs_lock();
     pdev = pci_get_pdev_by_domain(hardware_domain, seg, bus, devfn);
-    spin_unlock(&pcidevs_lock);
+    pcidevs_unlock();
 
     return pdev ? 0 : -EBUSY;
 }
@@ -1350,13 +1365,11 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
              p2m_get_hostp2m(d)->global_logdirty)) )
         return -EXDEV;
 
-    if ( !spin_trylock(&pcidevs_lock) )
-        return -ERESTART;
-
+    pcidevs_lock();
     rc = iommu_construct(d);
     if ( rc )
     {
-        spin_unlock(&pcidevs_lock);
+        pcidevs_unlock();
         return rc;
     }
 
@@ -1387,7 +1400,7 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
  done:
     if ( !has_arch_pdevs(d) && need_iommu(d) )
         iommu_teardown(d);
-    spin_unlock(&pcidevs_lock);
+    pcidevs_unlock();
 
     return rc;
 }
@@ -1402,7 +1415,7 @@ int deassign_device(struct domain *d, u16 seg, u8 bus, u8 devfn)
     if ( !iommu_enabled || !hd->platform_ops )
         return -EINVAL;
 
-    ASSERT(spin_is_locked(&pcidevs_lock));
+    ASSERT(pcidevs_is_locked());
     pdev = pci_get_pdev_by_domain(d, seg, bus, devfn);
     if ( !pdev )
         return -ENODEV;
@@ -1457,7 +1470,7 @@ static int iommu_get_device_group(
 
     group_id = ops->get_device_group_id(seg, bus, devfn);
 
-    spin_lock(&pcidevs_lock);
+    pcidevs_lock();
     for_each_pdev( d, pdev )
     {
         if ( (pdev->seg != seg) ||
@@ -1476,14 +1489,14 @@ static int iommu_get_device_group(
 
             if ( unlikely(copy_to_guest_offset(buf, i, &bdf, 1)) )
             {
-                spin_unlock(&pcidevs_lock);
+                pcidevs_unlock();
                 return -1;
             }
             i++;
         }
     }
 
-    spin_unlock(&pcidevs_lock);
+    pcidevs_unlock();
 
     return i;
 }
@@ -1611,9 +1624,9 @@ int iommu_do_pci_domctl(
         bus = PCI_BUS(machine_sbdf);
         devfn = PCI_DEVFN2(machine_sbdf);
 
-        spin_lock(&pcidevs_lock);
+        pcidevs_lock();
         ret = deassign_device(d, seg, bus, devfn);
-        spin_unlock(&pcidevs_lock);
+        pcidevs_unlock();
         if ( ret )
             printk(XENLOG_G_ERR
                    "deassign %04x:%02x:%02x.%u from dom%d failed (%d)\n",
diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index 6e62a42..a0e3922 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1307,7 +1307,7 @@ int domain_context_mapping_one(
     int agaw;
     int rc = 0;
 
-    ASSERT(spin_is_locked(&pcidevs_lock));
+    ASSERT(pcidevs_is_locked());
     spin_lock(&iommu->lock);
     maddr = bus_to_context_maddr(iommu, bus);
     context_entries = (struct context_entry *)map_vtd_domain_page(maddr);
@@ -1456,7 +1456,7 @@ static int domain_context_mapping(
     if ( !drhd )
         return -ENODEV;
 
-    ASSERT(spin_is_locked(&pcidevs_lock));
+    ASSERT(pcidevs_is_locked());
 
     switch ( pdev->type )
     {
@@ -1539,7 +1539,7 @@ int domain_context_unmap_one(
     int iommu_domid;
     int rc = 0;
 
-    ASSERT(spin_is_locked(&pcidevs_lock));
+    ASSERT(pcidevs_is_locked());
     spin_lock(&iommu->lock);
 
     maddr = bus_to_context_maddr(iommu, bus);
@@ -1861,7 +1861,7 @@ static int rmrr_identity_mapping(struct domain *d, bool_t map,
     struct mapped_rmrr *mrmrr;
     struct hvm_iommu *hd = domain_hvm_iommu(d);
 
-    ASSERT(spin_is_locked(&pcidevs_lock));
+    ASSERT(pcidevs_is_locked());
     ASSERT(rmrr->base_address < rmrr->end_address);
 
     /*
@@ -1926,7 +1926,7 @@ static int intel_iommu_add_device(u8 devfn, struct pci_dev *pdev)
     u16 bdf;
     int ret, i;
 
-    ASSERT(spin_is_locked(&pcidevs_lock));
+    ASSERT(pcidevs_is_locked());
 
     if ( !pdev->domain )
         return -EINVAL;
@@ -2154,7 +2154,7 @@ static void __hwdom_init setup_hwdom_rmrr(struct domain *d)
     u16 bdf;
     int ret, i;
 
-    spin_lock(&pcidevs_lock);
+    pcidevs_lock();
     for_each_rmrr_device ( rmrr, bdf, i )
     {
         /*
@@ -2168,7 +2168,7 @@ static void __hwdom_init setup_hwdom_rmrr(struct domain *d)
             dprintk(XENLOG_ERR VTDPREFIX,
                      "IOMMU: mapping reserved region failed\n");
     }
-    spin_unlock(&pcidevs_lock);
+    pcidevs_unlock();
 }
 
 int __init intel_vtd_setup(void)
diff --git a/xen/drivers/video/vga.c b/xen/drivers/video/vga.c
index 40e5963..61c6b13 100644
--- a/xen/drivers/video/vga.c
+++ b/xen/drivers/video/vga.c
@@ -117,9 +117,9 @@ void __init video_endboot(void)
                 const struct pci_dev *pdev;
                 u8 b = bus, df = devfn, sb;
 
-                spin_lock(&pcidevs_lock);
+                pcidevs_lock();
                 pdev = pci_get_pdev(0, bus, devfn);
-                spin_unlock(&pcidevs_lock);
+                pcidevs_unlock();
 
                 if ( !pdev ||
                      pci_conf_read16(0, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index a5aef55..017aa0b 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -94,7 +94,9 @@ struct pci_dev {
  * interrupt handling related (the mask bit register).
  */
 
-extern spinlock_t pcidevs_lock;
+void pcidevs_lock(void);
+void pcidevs_unlock(void);
+int pcidevs_is_locked(void);
 
 bool_t pci_known_segment(u16 seg);
 bool_t pci_device_detect(u16 seg, u8 bus, u8 dev, u8 func);
-- 
1.9.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 4/5] VT-d: Reduce spin timeout to 1ms, which can be boot-time changed
  2016-03-02 14:31 [PATCH v6 0/5] VT-d Device-TLB flush issue Quan Xu
                   ` (2 preceding siblings ...)
  2016-03-02 14:31 ` [PATCH v6 3/5] IOMMU: Make the pcidevs_lock a recursive one Quan Xu
@ 2016-03-02 14:31 ` Quan Xu
  2016-03-04  0:11   ` Dario Faggioli
  2016-03-02 14:31 ` [PATCH v6 5/5] VT-d: Fix vt-d Device-TLB flush timeout issue Quan Xu
  4 siblings, 1 reply; 21+ messages in thread
From: Quan Xu @ 2016-03-02 14:31 UTC (permalink / raw)
  To: jbeulich, kevin.tian
  Cc: feng.wu, ian.campbell, george.dunlap, andrew.cooper3,
	dario.faggioli, tim, xen-devel, stefano.stabellini,
	Aravind.Gopalakrishnan, jun.nakajima, jinsong.liu, Quan Xu, keir,
	suravee.suthikulpanit

Signed-off-by: Quan Xu <quan.xu@intel.com>
---
 docs/misc/xen-command-line.markdown  |  7 +++++++
 xen/drivers/passthrough/vtd/qinval.c | 15 +++++++++++++--
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index a565c1b..1f5a111 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -1466,6 +1466,13 @@ Note that if **watchdog** option is also specified vpmu will be turned off.
 As the BTS virtualisation is not 100% safe and because of the nehalem quirk
 don't use the vpmu flag on production systems with Intel cpus!
 
+### vtd\_qi\_timeout (VT-d)
+> `= <integer>`
+
+> Default: `1`
+
+Specify the timeout of the VT-d Queued Invalidation in milliseconds.
+
 ### watchdog
 > `= force | <boolean>`
 
diff --git a/xen/drivers/passthrough/vtd/qinval.c b/xen/drivers/passthrough/vtd/qinval.c
index b81b0bd..882b9f4 100644
--- a/xen/drivers/passthrough/vtd/qinval.c
+++ b/xen/drivers/passthrough/vtd/qinval.c
@@ -28,6 +28,11 @@
 #include "vtd.h"
 #include "extern.h"
 
+static unsigned int __read_mostly vtd_qi_timeout = 1;
+integer_param("vtd_qi_timeout", vtd_qi_timeout);
+
+#define IOMMU_QI_TIMEOUT (vtd_qi_timeout * MILLISECS(1))
+
 static void print_qi_regs(struct iommu *iommu)
 {
     u64 val;
@@ -130,6 +135,10 @@ static void queue_invalidate_iotlb(struct iommu *iommu,
     spin_unlock_irqrestore(&iommu->register_lock, flags);
 }
 
+/*
+ * NB. We must check all kinds of error and all the way up the
+ * call trees.
+ */
 static int queue_invalidate_wait(struct iommu *iommu,
     u8 iflag, u8 sw, u8 fn)
 {
@@ -167,10 +176,12 @@ static int queue_invalidate_wait(struct iommu *iommu,
         start_time = NOW();
         while ( poll_slot != QINVAL_STAT_DONE )
         {
-            if ( NOW() > (start_time + DMAR_OPERATION_TIMEOUT) )
+            if ( NOW() > (start_time + IOMMU_QI_TIMEOUT) )
             {
                 print_qi_regs(iommu);
-                panic("queue invalidate wait descriptor was not executed");
+                printk(XENLOG_WARNING VTDPREFIX
+                       "Queue invalidate wait descriptor was timeout.\n");
+                return -ETIMEDOUT;
             }
             cpu_relax();
         }
-- 
1.9.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 5/5] VT-d: Fix vt-d Device-TLB flush timeout issue
  2016-03-02 14:31 [PATCH v6 0/5] VT-d Device-TLB flush issue Quan Xu
                   ` (3 preceding siblings ...)
  2016-03-02 14:31 ` [PATCH v6 4/5] VT-d: Reduce spin timeout to 1ms, which can be boot-time changed Quan Xu
@ 2016-03-02 14:31 ` Quan Xu
  4 siblings, 0 replies; 21+ messages in thread
From: Quan Xu @ 2016-03-02 14:31 UTC (permalink / raw)
  To: jbeulich, kevin.tian
  Cc: feng.wu, ian.campbell, george.dunlap, andrew.cooper3,
	dario.faggioli, tim, xen-devel, stefano.stabellini,
	Aravind.Gopalakrishnan, jun.nakajima, jinsong.liu, Quan Xu, keir,
	suravee.suthikulpanit

If Device-TLB flush is timeout, we'll hide the target ATS device
and crash the domain owning this ATS device. If impacted domain
is hardware domain, just throw out a warning. The hidden device
will be disallowed to be further assigned to any domain.

Signed-off-by: Quan Xu <quan.xu@intel.com>
---
 xen/drivers/passthrough/pci.c         |  6 ++--
 xen/drivers/passthrough/vtd/extern.h  |  2 ++
 xen/drivers/passthrough/vtd/qinval.c  | 65 ++++++++++++++++++++++++++++++++---
 xen/drivers/passthrough/vtd/x86/ats.c | 12 +++++++
 xen/include/xen/pci.h                 |  1 +
 5 files changed, 78 insertions(+), 8 deletions(-)

diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index d7e94e1..53b382a 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -414,7 +414,7 @@ static void free_pdev(struct pci_seg *pseg, struct pci_dev *pdev)
     xfree(pdev);
 }
 
-static void _pci_hide_device(struct pci_dev *pdev)
+void pci_hide_existing_device(struct pci_dev *pdev)
 {
     if ( pdev->domain )
         return;
@@ -431,7 +431,7 @@ int __init pci_hide_device(int bus, int devfn)
     pdev = alloc_pdev(get_pseg(0), bus, devfn);
     if ( pdev )
     {
-        _pci_hide_device(pdev);
+        pci_hide_existing_device(pdev);
         rc = 0;
     }
     pcidevs_unlock();
@@ -461,7 +461,7 @@ int __init pci_ro_device(int seg, int bus, int devfn)
     }
 
     __set_bit(PCI_BDF2(bus, devfn), pseg->ro_map);
-    _pci_hide_device(pdev);
+    pci_hide_existing_device(pdev);
 
     return 0;
 }
diff --git a/xen/drivers/passthrough/vtd/extern.h b/xen/drivers/passthrough/vtd/extern.h
index 507c2e9..dac1e5a 100644
--- a/xen/drivers/passthrough/vtd/extern.h
+++ b/xen/drivers/passthrough/vtd/extern.h
@@ -58,6 +58,8 @@ int ats_device(const struct pci_dev *, const struct acpi_drhd_unit *);
 
 int dev_invalidate_iotlb(struct iommu *iommu, u16 did,
                          u64 addr, unsigned int size_order, u64 type);
+int dev_invalidate_iotlb_sync(struct iommu *iommu, u16 did,
+                              u16 seg, u8 bus, u8 devfn);
 
 int qinval_device_iotlb(struct iommu *iommu,
                         u32 max_invs_pend, u16 sid, u16 size, u64 addr);
diff --git a/xen/drivers/passthrough/vtd/qinval.c b/xen/drivers/passthrough/vtd/qinval.c
index 882b9f4..8ff2d94 100644
--- a/xen/drivers/passthrough/vtd/qinval.c
+++ b/xen/drivers/passthrough/vtd/qinval.c
@@ -233,6 +233,57 @@ int qinval_device_iotlb(struct iommu *iommu,
     return 0;
 }
 
+static void dev_invalidate_iotlb_timeout(struct iommu *iommu, u16 did,
+                                         u16 seg, u8 bus, u8 devfn)
+{
+    struct domain *d = NULL;
+    struct pci_dev *pdev;
+
+    if ( test_bit(did, iommu->domid_bitmap) )
+        d = rcu_lock_domain_by_id(iommu->domid_map[did]);
+
+    if ( d == NULL )
+        return;
+
+    pcidevs_lock();
+    for_each_pdev(d, pdev)
+    {
+        if ( ( pdev->seg == seg ) &&
+             ( pdev->bus == bus ) &&
+             ( pdev->devfn == devfn ) )
+        {
+            ASSERT ( pdev->domain );
+            list_del(&pdev->domain_list);
+            pdev->domain = NULL;
+            pci_hide_existing_device(pdev);
+            break;
+        }
+    }
+
+    pcidevs_unlock();
+
+    if ( !is_hardware_domain(d) )
+        domain_crash(d);
+
+    rcu_unlock_domain(d);
+}
+
+int dev_invalidate_iotlb_sync(struct iommu *iommu, u16 did,
+                              u16 seg, u8 bus, u8 devfn)
+{
+    struct qi_ctrl *qi_ctrl = iommu_qi_ctrl(iommu);
+    int rc = 0;
+
+    if ( qi_ctrl->qinval_maddr )
+    {
+        rc = queue_invalidate_wait(iommu, 0, 1, 1);
+        if ( rc == -ETIMEDOUT )
+            dev_invalidate_iotlb_timeout(iommu, did, seg, bus, devfn);
+    }
+
+    return rc;
+}
+
 static void queue_invalidate_iec(struct iommu *iommu, u8 granu, u8 im, u16 iidx)
 {
     unsigned long flags;
@@ -342,8 +393,6 @@ static int flush_iotlb_qi(
 
     if ( qi_ctrl->qinval_maddr != 0 )
     {
-        int rc;
-
         /* use queued invalidation */
         if (cap_write_drain(iommu->cap))
             dw = 1;
@@ -353,11 +402,17 @@ static int flush_iotlb_qi(
         queue_invalidate_iotlb(iommu,
                                type >> DMA_TLB_FLUSH_GRANU_OFFSET, dr,
                                dw, did, size_order, 0, addr);
+
+        /*
+         * Before Device-TLB invalidation we need to synchronize
+         * invalidation completions with hardware.
+         */
+        ret = invalidate_sync(iommu);
+        if ( ret )
+             return ret;
+
         if ( flush_dev_iotlb )
             ret = dev_invalidate_iotlb(iommu, did, addr, size_order, type);
-        rc = invalidate_sync(iommu);
-        if ( !ret )
-            ret = rc;
     }
     return ret;
 }
diff --git a/xen/drivers/passthrough/vtd/x86/ats.c b/xen/drivers/passthrough/vtd/x86/ats.c
index 7c797f6..a79a95a 100644
--- a/xen/drivers/passthrough/vtd/x86/ats.c
+++ b/xen/drivers/passthrough/vtd/x86/ats.c
@@ -162,6 +162,18 @@ int dev_invalidate_iotlb(struct iommu *iommu, u16 did,
             return -EOPNOTSUPP;
         }
 
+        /*
+         * Synchronize with hardware for Device-TLB invalidate
+         * descriptor.
+         */
+        rc = dev_invalidate_iotlb_sync(iommu, did, pdev->seg,
+                                       pdev->bus, pdev->devfn);
+        if ( rc )
+            printk(XENLOG_ERR
+                   "Flush error %d on device %04x:%02x:%02x.%u.\n",
+                   ret, pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                   PCI_FUNC(pdev->devfn));
+
         if ( !ret )
             ret = rc;
     }
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 017aa0b..118d8fd 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -115,6 +115,7 @@ const unsigned long *pci_get_ro_map(u16 seg);
 int pci_add_device(u16 seg, u8 bus, u8 devfn,
                    const struct pci_dev_info *, nodeid_t node);
 int pci_remove_device(u16 seg, u8 bus, u8 devfn);
+void pci_hide_existing_device(struct pci_dev *pdev);
 int pci_ro_device(int seg, int bus, int devfn);
 int pci_hide_device(int bus, int devfn);
 struct pci_dev *pci_get_pdev(int seg, int bus, int devfn);
-- 
1.9.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 3/5] IOMMU: Make the pcidevs_lock a recursive one
  2016-03-02 14:31 ` [PATCH v6 3/5] IOMMU: Make the pcidevs_lock a recursive one Quan Xu
@ 2016-03-03 23:59   ` Dario Faggioli
  2016-03-04  2:45     ` Xu, Quan
  0 siblings, 1 reply; 21+ messages in thread
From: Dario Faggioli @ 2016-03-03 23:59 UTC (permalink / raw)
  To: Quan Xu, jbeulich, kevin.tian
  Cc: feng.wu, jun.nakajima, george.dunlap, andrew.cooper3, xen-devel,
	stefano.stabellini, suravee.suthikulpanit, jinsong.liu


[-- Attachment #1.1: Type: text/plain, Size: 3155 bytes --]

[I've removed some of the people that shouldn't be involved in this
discussion any longer from the Cc-list]

On Wed, 2016-03-02 at 22:31 +0800, Quan Xu wrote:
> Signed-off-by: Quan Xu <quan.xu@intel.com>
> 
So, this patch looks ok to me.

I spotted something, though, that I think needs some attention.

Since I'm jumping on this series only now, if this has been discussed
before and I missed it, sorry for the noise.

> @@ -788,10 +787,10 @@ static bool_t __init
> set_iommu_interrupt_handler(struct amd_iommu *iommu)
>          return 0;
>      }
>  
> -    spin_lock_irqsave(&pcidevs_lock, flags);
> +    pcidevs_lock();
>
So, spin_lock_irqsave() does:
  local_irq_save()
    local_save_flags()
    local_irq_disable()
  _spin_lock()

i.e., it saves the flags and disable interrupts.

pcidevs_lock() does:
  spin_lock_recursive()
    ... //handle recursion
    _spin_lock()

i.e., it does not disable interrupts.

And therefore it is possible that we are actually skipping disabling
interrupts (if they're not disabled already), isn't it?

And, of course, the same reasoning --mutatis mutandis-- applies to the
unlock side of things.

>      iommu->msi.dev = pci_get_pdev(iommu->seg, PCI_BUS(iommu->bdf),
>                                    PCI_DEVFN2(iommu->bdf));
> -    spin_unlock_irqrestore(&pcidevs_lock, flags);
> +    pcidevs_unlock();
>
i.e., spin_unlock_irqrestore() restore the flags, including the
interrupt enabled/disabled status, which means it can re-enable the
interrupts or not, depending on whether they were enabled at the time
of the previous spin_lock_irqsave(); pcidevs_unlock() just don't affect
interrupt enabling/disabling at all.

So, if the original code is correct in using
spin_lock_irqsave()/spin_unlock_irqrestore(), I think that we need
_irqsave() and _irqrestore() variants of recursive spinlocks, in order
to deal with this case.

However, from a quick inspection, it looks to me that:
 - this can only be called (during initialization), with interrupt
   enabled, so least saving & restoring flags shouldn't be necessary
   (unless I missed where they can be disabled in the call chain
   from iommu_setup() toward set_iommu_interrupt_handler()).
 - This protects pci_get_dev(); looking at other places where 
   pci_get_dev() is called, I don't think it is really necessary to
   disable interrupts.

If I'm right, it means that the original code could well have been
using just plain spin_lock() and spin_unlock(), and it would then be
fine to turn them into pcidevs_lock() and pcidevs_unlock(), and so no
need to add more spin_[un]lock_recursive() variants.

That would also mean that the patch is indeed ok, but I'd add a mention
of this in the changelog.

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 4/5] VT-d: Reduce spin timeout to 1ms, which can be boot-time changed
  2016-03-02 14:31 ` [PATCH v6 4/5] VT-d: Reduce spin timeout to 1ms, which can be boot-time changed Quan Xu
@ 2016-03-04  0:11   ` Dario Faggioli
  2016-03-04  9:32     ` Jan Beulich
  0 siblings, 1 reply; 21+ messages in thread
From: Dario Faggioli @ 2016-03-04  0:11 UTC (permalink / raw)
  To: Quan Xu, jbeulich, kevin.tian
  Cc: feng.wu, jun.nakajima, george.dunlap, andrew.cooper3, xen-devel,
	stefano.stabellini, suravee.suthikulpanit, jinsong.liu


[-- Attachment #1.1: Type: text/plain, Size: 2468 bytes --]

[Trimmed the Cc-list a bit again]

On Wed, 2016-03-02 at 22:31 +0800, Quan Xu wrote:
> Signed-off-by: Quan Xu <quan.xu@intel.com>
>  
> diff --git a/xen/drivers/passthrough/vtd/qinval.c
> b/xen/drivers/passthrough/vtd/qinval.c
> index b81b0bd..882b9f4 100644
> --- a/xen/drivers/passthrough/vtd/qinval.c
> +++ b/xen/drivers/passthrough/vtd/qinval.c
> @@ -28,6 +28,11 @@
>  #include "vtd.h"
>  #include "extern.h"
>  
> +static unsigned int __read_mostly vtd_qi_timeout = 1;
> +integer_param("vtd_qi_timeout", vtd_qi_timeout);
> +
> +#define IOMMU_QI_TIMEOUT (vtd_qi_timeout * MILLISECS(1))
> +
>  static void print_qi_regs(struct iommu *iommu)
>  {
>      u64 val;
> @@ -130,6 +135,10 @@ static void queue_invalidate_iotlb(struct iommu
> *iommu,
>      spin_unlock_irqrestore(&iommu->register_lock, flags);
>  }
>  
> +/*
> + * NB. We must check all kinds of error and all the way up the
> + * call trees.
> + */
>  static int queue_invalidate_wait(struct iommu *iommu,
>      u8 iflag, u8 sw, u8 fn)
>  {
> @@ -167,10 +176,12 @@ static int queue_invalidate_wait(struct iommu
> *iommu,
>          start_time = NOW();
>          while ( poll_slot != QINVAL_STAT_DONE )
>          {
> -            if ( NOW() > (start_time + DMAR_OPERATION_TIMEOUT) )
> +            if ( NOW() > (start_time + IOMMU_QI_TIMEOUT) )
>
Since this now involves a time unit conversion, can't we:
 - instead of start_time, above, compute, once and for all:
     timeout = NOW() + IOMMU_QI_TIMEOUT;
 - check whether ( NOW() > timeout )

I appreciate that the default for vtd_qi_timeout is 1, so it's most
likely not that a big deal, but it still looks better to me.

>              {
>                  print_qi_regs(iommu);
> -                panic("queue invalidate wait descriptor was not
> executed");
> +                printk(XENLOG_WARNING VTDPREFIX
> +                       "Queue invalidate wait descriptor was
> timeout.\n");
>
"Queue invalidate wait descriptor timed out"  ?

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 3/5] IOMMU: Make the pcidevs_lock a recursive one
  2016-03-03 23:59   ` Dario Faggioli
@ 2016-03-04  2:45     ` Xu, Quan
  2016-03-04  9:29       ` Jan Beulich
  0 siblings, 1 reply; 21+ messages in thread
From: Xu, Quan @ 2016-03-04  2:45 UTC (permalink / raw)
  To: Dario Faggioli, jbeulich, Tian, Kevin
  Cc: Wu, Feng, Nakajima, Jun, george.dunlap, andrew.cooper3,
	xen-devel, stefano.stabellini, suravee.suthikulpanit,
	jinsong.liu

On March 04, 2016 7:59am, <dario.faggioli@citrix.com> wrote:
> On Wed, 2016-03-02 at 22:31 +0800, Quan Xu wrote:
> > Signed-off-by: Quan Xu <quan.xu@intel.com>
> >
> So, this patch looks ok to me.
> 
> I spotted something, though, that I think needs some attention.
> 
> Since I'm jumping on this series only now, if this has been discussed before and I
> missed it, sorry for the noise.
> 

Dario, Welcome~  it's never too late.:)

> > @@ -788,10 +787,10 @@ static bool_t __init
> > set_iommu_interrupt_handler(struct amd_iommu *iommu)
> >          return 0;
> >      }
> >
> > -    spin_lock_irqsave(&pcidevs_lock, flags);
> > +    pcidevs_lock();
> >
> So, spin_lock_irqsave() does:
>   local_irq_save()
>     local_save_flags()
>     local_irq_disable()
>   _spin_lock()
> 
> i.e., it saves the flags and disable interrupts.
> 
> pcidevs_lock() does:
>   spin_lock_recursive()
>     ... //handle recursion
>     _spin_lock()
> 
> i.e., it does not disable interrupts.
> 
> And therefore it is possible that we are actually skipping disabling interrupts (if
> they're not disabled already), isn't it?
> 
> And, of course, the same reasoning --mutatis mutandis-- applies to the unlock
> side of things.
> 
> >      iommu->msi.dev = pci_get_pdev(iommu->seg, PCI_BUS(iommu->bdf),
> >                                    PCI_DEVFN2(iommu->bd
> f));
> > -    spin_unlock_irqrestore(&pcidevs_lock, flags);
> > +    pcidevs_unlock();
> >
> i.e., spin_unlock_irqrestore() restore the flags, including the interrupt
> enabled/disabled status, which means it can re-enable the interrupts or not,
> depending on whether they were enabled at the time of the previous
> spin_lock_irqsave(); pcidevs_unlock() just don't affect interrupt
> enabling/disabling at all.
> 
> So, if the original code is correct in using
> spin_lock_irqsave()/spin_unlock_irqrestore(), I think that we need
> _irqsave() and _irqrestore() variants of recursive spinlocks, in order to deal with
> this case.
> 
> However, from a quick inspection, it looks to me that:
>  - this can only be called (during initialization), with interrupt
>    enabled, so least saving & restoring flags shouldn't be necessary
>    (unless I missed where they can be disabled in the call chain
>    from iommu_setup() toward set_iommu_interrupt_handler()).
>  - This protects pci_get_dev(); looking at other places where
>    pci_get_dev() is called, I don't think it is really necessary to
>    disable interrupts.
> 
> If I'm right, it means that the original code could well have been using just plain
> spin_lock() and spin_unlock(), and it would then be fine to turn them into
> pcidevs_lock() and pcidevs_unlock(), and so no need to add more
> spin_[un]lock_recursive() variants.
> 
> That would also mean that the patch is indeed ok, 

Yes, I fully agree your analysis and conclusion.
I tried to implement _irqsave() and _irqrestore() variants of recursive spinlocks, but I found that it is no need to add them.

Also I'd highlight the below modification:
-    if ( !spin_trylock(&pcidevs_lock) )
-        return -ERESTART;
-
+    pcidevs_lock();

IMO, it is right too.


> but I'd add a mention of this in the changelog.

In git log?


Quan
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 3/5] IOMMU: Make the pcidevs_lock a recursive one
  2016-03-04  2:45     ` Xu, Quan
@ 2016-03-04  9:29       ` Jan Beulich
  2016-03-04 11:54         ` Xu, Quan
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Beulich @ 2016-03-04  9:29 UTC (permalink / raw)
  To: Dario Faggioli, Quan Xu
  Cc: Kevin Tian, Feng Wu, Jun Nakajima, george.dunlap, jinsong.liu,
	xen-devel, stefano.stabellini, suravee.suthikulpanit,
	andrew.cooper3

>>> On 04.03.16 at 03:45, <quan.xu@intel.com> wrote:
> On March 04, 2016 7:59am, <dario.faggioli@citrix.com> wrote:
>> On Wed, 2016-03-02 at 22:31 +0800, Quan Xu wrote:
>> > @@ -788,10 +787,10 @@ static bool_t __init
>> > set_iommu_interrupt_handler(struct amd_iommu *iommu)
>> >          return 0;
>> >      }
>> >
>> > -    spin_lock_irqsave(&pcidevs_lock, flags);
>> > +    pcidevs_lock();
>> >
>> So, spin_lock_irqsave() does:
>>   local_irq_save()
>>     local_save_flags()
>>     local_irq_disable()
>>   _spin_lock()
>> 
>> i.e., it saves the flags and disable interrupts.
>> 
>> pcidevs_lock() does:
>>   spin_lock_recursive()
>>     ... //handle recursion
>>     _spin_lock()
>> 
>> i.e., it does not disable interrupts.
>> 
>> And therefore it is possible that we are actually skipping disabling 
> interrupts (if
>> they're not disabled already), isn't it?
>> 
>> And, of course, the same reasoning --mutatis mutandis-- applies to the unlock
>> side of things.
>> 
>> >      iommu->msi.dev = pci_get_pdev(iommu->seg, PCI_BUS(iommu->bdf),
>> >                                    PCI_DEVFN2(iommu->bdf));
>> > -    spin_unlock_irqrestore(&pcidevs_lock, flags);
>> > +    pcidevs_unlock();
>> >
>> i.e., spin_unlock_irqrestore() restore the flags, including the interrupt
>> enabled/disabled status, which means it can re-enable the interrupts or not,
>> depending on whether they were enabled at the time of the previous
>> spin_lock_irqsave(); pcidevs_unlock() just don't affect interrupt
>> enabling/disabling at all.
>> 
>> So, if the original code is correct in using
>> spin_lock_irqsave()/spin_unlock_irqrestore(), I think that we need
>> _irqsave() and _irqrestore() variants of recursive spinlocks, in order to deal with
>> this case.
>> 
>> However, from a quick inspection, it looks to me that:
>>  - this can only be called (during initialization), with interrupt
>>    enabled, so least saving & restoring flags shouldn't be necessary
>>    (unless I missed where they can be disabled in the call chain
>>    from iommu_setup() toward set_iommu_interrupt_handler()).
>>  - This protects pci_get_dev(); looking at other places where
>>    pci_get_dev() is called, I don't think it is really necessary to
>>    disable interrupts.
>> 
>> If I'm right, it means that the original code could well have been using 
> just plain
>> spin_lock() and spin_unlock(), and it would then be fine to turn them into
>> pcidevs_lock() and pcidevs_unlock(), and so no need to add more
>> spin_[un]lock_recursive() variants.
>> 
>> That would also mean that the patch is indeed ok, 
> 
> Yes, I fully agree your analysis and conclusion.
> I tried to implement _irqsave() and _irqrestore() variants of recursive 
> spinlocks, but I found that it is no need to add them.
> 
> Also I'd highlight the below modification:
> -    if ( !spin_trylock(&pcidevs_lock) )
> -        return -ERESTART;
> -
> +    pcidevs_lock();
> 
> IMO, it is right too.

Well, I'll have to see where exactly this is (pulling such out of
context is pretty unhelpful), but I suspect it can't be replaced
like this.

>> but I'd add a mention of this in the changelog.
> 
> In git log?

To be honest, changes like this would better not be buried in a
big rework like the one here. Make it a prereq patch, where you
then will kind of automatically describe why it's correct. (I agree
current code is bogus, and we're not hitting the respective
BUG_ON() in check_lock() only because spin_debug gets set to
true only after most __init code has run.)

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 4/5] VT-d: Reduce spin timeout to 1ms, which can be boot-time changed
  2016-03-04  0:11   ` Dario Faggioli
@ 2016-03-04  9:32     ` Jan Beulich
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Beulich @ 2016-03-04  9:32 UTC (permalink / raw)
  To: Dario Faggioli, kevin.tian, Quan Xu
  Cc: feng.wu, jun.nakajima, george.dunlap, jinsong.liu, xen-devel,
	stefano.stabellini, suravee.suthikulpanit, andrew.cooper3

>>> On 04.03.16 at 01:11, <dario.faggioli@citrix.com> wrote:
> On Wed, 2016-03-02 at 22:31 +0800, Quan Xu wrote:
>> @@ -167,10 +176,12 @@ static int queue_invalidate_wait(struct iommu
>> *iommu,
>>          start_time = NOW();
>>          while ( poll_slot != QINVAL_STAT_DONE )
>>          {
>> -            if ( NOW() > (start_time + DMAR_OPERATION_TIMEOUT) )
>> +            if ( NOW() > (start_time + IOMMU_QI_TIMEOUT) )
>>
> Since this now involves a time unit conversion, can't we:
>  - instead of start_time, above, compute, once and for all:
>      timeout = NOW() + IOMMU_QI_TIMEOUT;
>  - check whether ( NOW() > timeout )
> 
> I appreciate that the default for vtd_qi_timeout is 1, so it's most
> likely not that a big deal, but it still looks better to me.

The code might look a little neater, but it really isn't relevant from
a performance pov: We're trying to lose time here anyway.

>>              {
>>                  print_qi_regs(iommu);
>> -                panic("queue invalidate wait descriptor was not
>> executed");
>> +                printk(XENLOG_WARNING VTDPREFIX
>> +                       "Queue invalidate wait descriptor was
>> timeout.\n");
>>
> "Queue invalidate wait descriptor timed out"  ?

Indeed - Quan, I'm sure Kevin and/or I have been mentioning this
before. If not for this specific instance, then for another one, in
which case I'd like to remind you to apply review comments not
just to the single instance they're being given for.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 3/5] IOMMU: Make the pcidevs_lock a recursive one
  2016-03-04  9:29       ` Jan Beulich
@ 2016-03-04 11:54         ` Xu, Quan
  2016-03-04 13:59           ` Dario Faggioli
  0 siblings, 1 reply; 21+ messages in thread
From: Xu, Quan @ 2016-03-04 11:54 UTC (permalink / raw)
  To: Jan Beulich, Dario Faggioli
  Cc: Tian, Kevin, Wu, Feng, Nakajima, Jun, george.dunlap, jinsong.liu,
	xen-devel, stefano.stabellini, suravee.suthikulpanit,
	andrew.cooper3

On March 04, 2016 5:29pm, <JBeulich@suse.com> wrote:
> >>> On 04.03.16 at 03:45, <quan.xu@intel.com> wrote:
> > On March 04, 2016 7:59am, <dario.faggioli@citrix.com> wrote:
> >> On Wed, 2016-03-02 at 22:31 +0800, Quan Xu wrote:
> >> > @@ -788,10 +787,10 @@ static bool_t __init
> >> > set_iommu_interrupt_handler(struct amd_iommu *iommu)
> >> >          return 0;
> >> >      }
> >> >
> >> > -    spin_lock_irqsave(&pcidevs_lock, flags);
> >> > +    pcidevs_lock();
> >> >
> >> So, spin_lock_irqsave() does:
> >>   local_irq_save()
> >>     local_save_flags()
> >>     local_irq_disable()
> >>   _spin_lock()
> >>
> >> i.e., it saves the flags and disable interrupts.
> >>
> >> pcidevs_lock() does:
> >>   spin_lock_recursive()
> >>     ... //handle recursion
> >>     _spin_lock()
> >>
> >> i.e., it does not disable interrupts.
> >>
> >> And therefore it is possible that we are actually skipping disabling
> > interrupts (if
> >> they're not disabled already), isn't it?
> >>
> >> And, of course, the same reasoning --mutatis mutandis-- applies to
> >> the unlock side of things.
> >>
> >> >      iommu->msi.dev = pci_get_pdev(iommu->seg,
> PCI_BUS(iommu->bdf),
> >> >                                    PCI_DEVFN2(iommu->bdf));
> >> > -    spin_unlock_irqrestore(&pcidevs_lock, flags);
> >> > +    pcidevs_unlock();
> >> >
> >> i.e., spin_unlock_irqrestore() restore the flags, including the
> >> interrupt enabled/disabled status, which means it can re-enable the
> >> interrupts or not, depending on whether they were enabled at the time
> >> of the previous spin_lock_irqsave(); pcidevs_unlock() just don't
> >> affect interrupt enabling/disabling at all.
> >>
> >> So, if the original code is correct in using
> >> spin_lock_irqsave()/spin_unlock_irqrestore(), I think that we need
> >> _irqsave() and _irqrestore() variants of recursive spinlocks, in
> >> order to deal with this case.
> >>
> >> However, from a quick inspection, it looks to me that:
> >>  - this can only be called (during initialization), with interrupt
> >>    enabled, so least saving & restoring flags shouldn't be necessary
> >>    (unless I missed where they can be disabled in the call chain
> >>    from iommu_setup() toward set_iommu_interrupt_handler()).
> >>  - This protects pci_get_dev(); looking at other places where
> >>    pci_get_dev() is called, I don't think it is really necessary to
> >>    disable interrupts.
> >>
> >> If I'm right, it means that the original code could well have been
> >> using
> > just plain
> >> spin_lock() and spin_unlock(), and it would then be fine to turn them
> >> into
> >> pcidevs_lock() and pcidevs_unlock(), and so no need to add more
> >> spin_[un]lock_recursive() variants.
> >>
> >> That would also mean that the patch is indeed ok,
> >
> > Yes, I fully agree your analysis and conclusion.
> > I tried to implement _irqsave() and _irqrestore() variants of
> > recursive spinlocks, but I found that it is no need to add them.
> >
> > Also I'd highlight the below modification:
> > -    if ( !spin_trylock(&pcidevs_lock) )
> > -        return -ERESTART;
> > -
> > +    pcidevs_lock();
> >
> > IMO, it is right too.
> 
> Well, I'll have to see where exactly this is (pulling such out of context is pretty
> unhelpful), but I suspect it can't be replaced like this.
> 

Jan, I am looking forward to your review.
btw, It is in the assign_device(), in the xen/drivers/passthrough/pci.c file.

> >> but I'd add a mention of this in the changelog.
> >
> > In git log?
> 
> To be honest, changes like this would better not be buried in a big rework like
> the one here. Make it a prereq patch, where you then will kind of automatically
> describe why it's correct. (I agree current code is bogus, and we're not hitting
> the respective
> BUG_ON() in check_lock() only because spin_debug gets set to true only after
> most __init code has run.)

Agreed, I would make a prereq patch in v7.

Quan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 3/5] IOMMU: Make the pcidevs_lock a recursive one
  2016-03-04 11:54         ` Xu, Quan
@ 2016-03-04 13:59           ` Dario Faggioli
  2016-03-04 14:09             ` Jan Beulich
  2016-03-07  7:05             ` Xu, Quan
  0 siblings, 2 replies; 21+ messages in thread
From: Dario Faggioli @ 2016-03-04 13:59 UTC (permalink / raw)
  To: Xu, Quan, Jan Beulich
  Cc: Tian, Kevin, Wu, Feng, Nakajima, Jun, george.dunlap, jinsong.liu,
	xen-devel, stefano.stabellini, suravee.suthikulpanit,
	andrew.cooper3


[-- Attachment #1.1: Type: text/plain, Size: 2846 bytes --]

On Fri, 2016-03-04 at 11:54 +0000, Xu, Quan wrote:
> On March 04, 2016 5:29pm, <JBeulich@suse.com> wrote:
> > On March 04, 2016 7:59am, <dario.faggioli@citrix.com> wrote:
> > 
> > > Also I'd highlight the below modification:
> > > -    if ( !spin_trylock(&pcidevs_lock) )
> > > -        return -ERESTART;
> > > -
> > > +    pcidevs_lock();
> > > 
> > > IMO, it is right too.
> > Well, I'll have to see where exactly this is (pulling such out of
> > context is pretty
> > unhelpful), but I suspect it can't be replaced like this.
> > 
> Jan, I am looking forward to your review.
> btw, It is in the assign_device(), in the
> xen/drivers/passthrough/pci.c file.
> 
Mmm... If multiple cpus calls assign_device(), and the calls race, the
behavior between before and after the patch looks indeed different to
me.

In fact, in current code, the cpus that find the lock busy already,
would quit the function immediately and a continuation is created. On
the other hand, with the patch, they would spin and actually get the
lock, one after the other (if there's more of them) at some point.

Please, double check my reasoning, but I looks to me that it is indeed
different what happens when the hypercall is restarted (i.e., in
current code) and what happens if we just let others take the lock and
execute the function (i.e., with the patch applied).

I suggest you try to figure out whether that is actually the case. Once
you've done, feel free to report here and ask for help for finding a
solution, if you don't see one.

> > > > but I'd add a mention of this in the changelog.
> > > In git log?
> > To be honest, changes like this would better not be buried in a big
> > rework like
> > the one here. Make it a prereq patch, where you then will kind of
> > automatically
> > describe why it's correct. (I agree current code is bogus, and
> > we're not hitting
> > the respective
> > BUG_ON() in check_lock() only because spin_debug gets set to true
> > only after
> > most __init code has run.)
> Agreed, I would make a prereq patch in v7.
> 
Ok. In general, I agree with Jan.

In this case, I suggested just mentioning in changelog as we curently
basically have a bug, and I think it would be fine to have something
like "Doing what we do also serves as a fix for a bug found in xxx
yyy".

But it's indeed Jan's call, and his solution is certainly cleaner.
Not to mention that, in the case of assign_device(), fixing that would
most likely require being done in a patch on its own anyway.

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 3/5] IOMMU: Make the pcidevs_lock a recursive one
  2016-03-04 13:59           ` Dario Faggioli
@ 2016-03-04 14:09             ` Jan Beulich
  2016-03-07  7:05             ` Xu, Quan
  1 sibling, 0 replies; 21+ messages in thread
From: Jan Beulich @ 2016-03-04 14:09 UTC (permalink / raw)
  To: Dario Faggioli, Quan Xu
  Cc: Kevin Tian, Feng Wu, Jun Nakajima, george.dunlap, jinsong.liu,
	xen-devel, stefano.stabellini, suravee.suthikulpanit,
	andrew.cooper3

>>> On 04.03.16 at 14:59, <dario.faggioli@citrix.com> wrote:
> On Fri, 2016-03-04 at 11:54 +0000, Xu, Quan wrote:
>> On March 04, 2016 5:29pm, <JBeulich@suse.com> wrote:
>> > To be honest, changes like this would better not be buried in a big
>> > rework like
>> > the one here. Make it a prereq patch, where you then will kind of
>> > automatically
>> > describe why it's correct. (I agree current code is bogus, and
>> > we're not hitting
>> > the respective
>> > BUG_ON() in check_lock() only because spin_debug gets set to true
>> > only after
>> > most __init code has run.)
>> Agreed, I would make a prereq patch in v7.
>> 
> Ok. In general, I agree with Jan.
> 
> In this case, I suggested just mentioning in changelog as we curently
> basically have a bug, and I think it would be fine to have something
> like "Doing what we do also serves as a fix for a bug found in xxx
> yyy".
> 
> But it's indeed Jan's call, and his solution is certainly cleaner.

Well, one of the reasons to separate out bug fixes is to make them
backportable.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 3/5] IOMMU: Make the pcidevs_lock a recursive one
  2016-03-04 13:59           ` Dario Faggioli
  2016-03-04 14:09             ` Jan Beulich
@ 2016-03-07  7:05             ` Xu, Quan
  2016-03-07 11:14               ` Jan Beulich
  1 sibling, 1 reply; 21+ messages in thread
From: Xu, Quan @ 2016-03-07  7:05 UTC (permalink / raw)
  To: Dario Faggioli, Jan Beulich
  Cc: Tian, Kevin, Wu, Feng, Nakajima, Jun, george.dunlap, jinsong.liu,
	xen-devel, stefano.stabellini, suravee.suthikulpanit,
	andrew.cooper3

On March 04, 2016 9:59pm, <dario.faggioli@citrix.com> wrote:
> On Fri, 2016-03-04 at 11:54 +0000, Xu, Quan wrote:
> > On March 04, 2016 5:29pm, <JBeulich@suse.com> wrote:
> > > On March 04, 2016 7:59am, <dario.faggioli@citrix.com> wrote:
> > >
> > > > Also I'd highlight the below modification:
> > > > -    if ( !spin_trylock(&pcidevs_lock) )
> > > > -        return -ERESTART;
> > > > -
> > > > +    pcidevs_lock();
> > > >
> > > > IMO, it is right too.
> > > Well, I'll have to see where exactly this is (pulling such out of
> > > context is pretty unhelpful), but I suspect it can't be replaced
> > > like this.
> > >
> > Jan, I am looking forward to your review.
> > btw, It is in the assign_device(), in the
> > xen/drivers/passthrough/pci.c file.
> >
> Mmm... If multiple cpus calls assign_device(), and the calls race, the behavior
> between before and after the patch looks indeed different to me.
> 
> In fact, in current code, the cpus that find the lock busy already, would quit the
> function immediately and a continuation is created. On the other hand, with the
> patch, they would spin and actually get the lock, one after the other (if there's
> more of them) at some point.
> 
> Please, double check my reasoning, but I looks to me that it is indeed different
> what happens when the hypercall is restarted (i.e., in current code) and what
> happens if we just let others take the lock and execute the function (i.e., with
> the patch applied).
> 
> I suggest you try to figure out whether that is actually the case. Once you've
> done, feel free to report here and ask for help for finding a solution, if you don't
> see one.
> 

Good idea.
For multiple cpus calls assign_device(), Iet's assume that there are 3 calls in parallel:
  (p1). xl pci-attach TestDom 0000:81:00.0
  (p2). xl pci-attach TestDom 0000:81:00.0
  (p3). xl pci-attach TestDom 0000:81:00.0
 
Furthermore, p1 and p2 run on pCPU1, and p3 runs on pCPU2.

After my patch,
__IIUC__ , the invoker flow might be as follow:
pCPU1                                               pCPU2
 .                                                     .
 .                                                     .
 assign_device_1()
  {                                                    .
   spin_lock_r(lock)                                      .
   .                                                   assign_device_3() spin_lock_r(lock) <-- blocks
   assign_device_2()
     {                                                 x <-- spins
       spin_lock_r(lock) <-- can continue                     x <-- spins
       spin_unlock_r(lock) <-- *doesn't* release lock           x <-- spins
     }                                                 x <-- spins
     .                                                 x <-- spins
  }                                                    x <-- spins
   .                                                   x <-- spins
   spin_unlock_r(lock) <-- release lock ----------->. ....... ...............<--assign_device_3() continue, with lock held
   .                                                   .
   .                                                   .
   .                                                   spin_unlock_r(lock) <--lock is now free


Befer my patch,
The invoker flower might return at the point of assign_device_2() / assign_device_3().

So, yes, If multiple cpus calls assign_device(), and the calls race, the behavior between before and after the patch looks indeed different.

I try to fix it with follow:
--------patch >> -------- 

--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -118,6 +118,11 @@ int pcidevs_is_locked(void)
     return spin_is_locked(&_pcidevs_lock);
 }

+int pcidevs_trylock(void)
+{
+    return spin_trylock_recursive(&_pcidevs_lock);
+}
+
 void __init pt_pci_init(void)
 {
     radix_tree_init(&pci_segments);
@@ -1365,7 +1370,7 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
              p2m_get_hostp2m(d)->global_logdirty)) )
         return -EXDEV;

-    if ( !spin_trylock(&pcidevs_lock) )
+    if ( !pcidevs_trylock() )
         return -ERESTART;

     rc = iommu_construct(d);
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 017aa0b..b87571d 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -97,6 +97,7 @@ struct pci_dev {
 void pcidevs_lock(void);
 void pcidevs_unlock(void);
 int pcidevs_is_locked(void);
+int pcidevs_trylock(void);

 bool_t pci_known_segment(u16 seg);
 bool_t pci_device_detect(u16 seg, u8 bus, u8 dev, u8 func);

--------patch << -------- 

A quick question, is it '-ERESTART', instead of '-EBUSY' ?

There is also a similar case, cpu_hotplug:
   $cpu_up()--> cpu_hotplug_begin()-->get_cpu_maps()--> spin_trylock_recursive(&cpu_add_remove_lock)

Feel free to share your idea, and correct me if I'm wrong.

Quan
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 3/5] IOMMU: Make the pcidevs_lock a recursive one
  2016-03-07  7:05             ` Xu, Quan
@ 2016-03-07 11:14               ` Jan Beulich
  2016-03-07 11:23                 ` Xu, Quan
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Beulich @ 2016-03-07 11:14 UTC (permalink / raw)
  To: Quan Xu
  Cc: Kevin Tian, Feng Wu, Jun Nakajima, george.dunlap, jinsong.liu,
	Dario Faggioli, xen-devel, stefano.stabellini,
	suravee.suthikulpanit, andrew.cooper3

>>> On 07.03.16 at 08:05, <quan.xu@intel.com> wrote:
> I try to fix it with follow:
> --------patch >> -------- 
> 
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -118,6 +118,11 @@ int pcidevs_is_locked(void)
>      return spin_is_locked(&_pcidevs_lock);
>  }
> 
> +int pcidevs_trylock(void)
> +{
> +    return spin_trylock_recursive(&_pcidevs_lock);
> +}
> +
>  void __init pt_pci_init(void)
>  {
>      radix_tree_init(&pci_segments);
> @@ -1365,7 +1370,7 @@ static int assign_device(struct domain *d, u16 seg, u8 
> bus, u8 devfn, u32 flag)
>               p2m_get_hostp2m(d)->global_logdirty)) )
>          return -EXDEV;
> 
> -    if ( !spin_trylock(&pcidevs_lock) )
> +    if ( !pcidevs_trylock() )
>          return -ERESTART;

Exactly.

> A quick question, is it '-ERESTART', instead of '-EBUSY' ?

No idea what this question is about in this context.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 3/5] IOMMU: Make the pcidevs_lock a recursive one
  2016-03-07 11:14               ` Jan Beulich
@ 2016-03-07 11:23                 ` Xu, Quan
  2016-03-07 11:36                   ` Jan Beulich
  0 siblings, 1 reply; 21+ messages in thread
From: Xu, Quan @ 2016-03-07 11:23 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, Wu, Feng, Nakajima, Jun, george.dunlap, jinsong.liu,
	Dario Faggioli, xen-devel, stefano.stabellini,
	suravee.suthikulpanit, andrew.cooper3

On March 07, 2016 7:14pm, <JBeulich@suse.com> wrote:
> >>> On 07.03.16 at 08:05, <quan.xu@intel.com> wrote:


> > A quick question, is it '-ERESTART', instead of '-EBUSY' ?
> 
> No idea what this question is about in this context.
> 

it is in xen/drivers/passthrough/pci.c, assign_device().

static int assign_device()
{
   ....
    if ( !spin_trylock(&pcidevs_lock) )
        return -ERESTART;
   ....
}

Quan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 3/5] IOMMU: Make the pcidevs_lock a recursive one
  2016-03-07 11:23                 ` Xu, Quan
@ 2016-03-07 11:36                   ` Jan Beulich
  2016-03-07 11:42                     ` Xu, Quan
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Beulich @ 2016-03-07 11:36 UTC (permalink / raw)
  To: Quan Xu
  Cc: Kevin Tian, Feng Wu, Jun Nakajima, george.dunlap, jinsong.liu,
	Dario Faggioli, xen-devel, stefano.stabellini,
	suravee.suthikulpanit, andrew.cooper3

>>> On 07.03.16 at 12:23, <quan.xu@intel.com> wrote:
> On March 07, 2016 7:14pm, <JBeulich@suse.com> wrote:
>> >>> On 07.03.16 at 08:05, <quan.xu@intel.com> wrote:
> 
> 
>> > A quick question, is it '-ERESTART', instead of '-EBUSY' ?
>> 
>> No idea what this question is about in this context.
>> 
> 
> it is in xen/drivers/passthrough/pci.c, assign_device().
> 
> static int assign_device()
> {
>    ....
>     if ( !spin_trylock(&pcidevs_lock) )
>         return -ERESTART;
>    ....
> }

But I still don't understand what you're trying to find out or point
out.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 3/5] IOMMU: Make the pcidevs_lock a recursive one
  2016-03-07 11:36                   ` Jan Beulich
@ 2016-03-07 11:42                     ` Xu, Quan
  2016-03-07 11:49                       ` Jan Beulich
  0 siblings, 1 reply; 21+ messages in thread
From: Xu, Quan @ 2016-03-07 11:42 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, Wu, Feng, Nakajima, Jun, george.dunlap, jinsong.liu,
	Dario Faggioli, xen-devel, stefano.stabellini,
	suravee.suthikulpanit, andrew.cooper3

On March 07, 2016 7:36pm, <JBeulich@suse.com> wrote:
> >>> On 07.03.16 at 12:23, <quan.xu@intel.com> wrote:
> > On March 07, 2016 7:14pm, <JBeulich@suse.com> wrote:
> >> >>> On 07.03.16 at 08:05, <quan.xu@intel.com> wrote:
> >
> >
> >> > A quick question, is it '-ERESTART', instead of '-EBUSY' ?
> >>
> >> No idea what this question is about in this context.
> >>
> >
> > it is in xen/drivers/passthrough/pci.c, assign_device().
> >
> > static int assign_device()
> > {
> >    ....
> >     if ( !spin_trylock(&pcidevs_lock) )
> >         return -ERESTART;
> >    ....
> > }
> 
> But I still don't understand what you're trying to find out or point out.
> 

Jan, sorry.
Now the return error code is '-ERESTART' for ' if ( !spin_trylock(&pcidevs_lock) ', in assign_device(), in xen/drivers/passthrough/pci.c.
I think it would be '-EBUSY'.
Quan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 3/5] IOMMU: Make the pcidevs_lock a recursive one
  2016-03-07 11:42                     ` Xu, Quan
@ 2016-03-07 11:49                       ` Jan Beulich
  2016-03-07 11:55                         ` Xu, Quan
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Beulich @ 2016-03-07 11:49 UTC (permalink / raw)
  To: Quan Xu
  Cc: Kevin Tian, Feng Wu, Jun Nakajima, george.dunlap, jinsong.liu,
	Dario Faggioli, xen-devel, stefano.stabellini,
	suravee.suthikulpanit, andrew.cooper3

>>> On 07.03.16 at 12:42, <quan.xu@intel.com> wrote:
> On March 07, 2016 7:36pm, <JBeulich@suse.com> wrote:
>> >>> On 07.03.16 at 12:23, <quan.xu@intel.com> wrote:
>> > On March 07, 2016 7:14pm, <JBeulich@suse.com> wrote:
>> >> >>> On 07.03.16 at 08:05, <quan.xu@intel.com> wrote:
>> >
>> >
>> >> > A quick question, is it '-ERESTART', instead of '-EBUSY' ?
>> >>
>> >> No idea what this question is about in this context.
>> >>
>> >
>> > it is in xen/drivers/passthrough/pci.c, assign_device().
>> >
>> > static int assign_device()
>> > {
>> >    ....
>> >     if ( !spin_trylock(&pcidevs_lock) )
>> >         return -ERESTART;
>> >    ....
>> > }
>> 
>> But I still don't understand what you're trying to find out or point out.
> 
> Jan, sorry.
> Now the return error code is '-ERESTART' for ' if ( 
> !spin_trylock(&pcidevs_lock) ', in assign_device(), in 
> xen/drivers/passthrough/pci.c.
> I think it would be '-EBUSY'.

Oh - definitely not. Just follow the call chain back up, and you
should find that this gets taken as an indication to create a
continuation, whereas -EBUSY would bubble back up to the
original (user space) caller (which is _not_ what we want here).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 3/5] IOMMU: Make the pcidevs_lock a recursive one
  2016-03-07 11:49                       ` Jan Beulich
@ 2016-03-07 11:55                         ` Xu, Quan
  0 siblings, 0 replies; 21+ messages in thread
From: Xu, Quan @ 2016-03-07 11:55 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, Wu, Feng, Nakajima, Jun, george.dunlap, jinsong.liu,
	Dario Faggioli, xen-devel, stefano.stabellini,
	suravee.suthikulpanit, andrew.cooper3

On March 07, 2016 7:49pm, <JBeulich@suse.com> wrote:
> >>> On 07.03.16 at 12:42, <quan.xu@intel.com> wrote:
> > On March 07, 2016 7:36pm, <JBeulich@suse.com> wrote:
> >> >>> On 07.03.16 at 12:23, <quan.xu@intel.com> wrote:
> >> > On March 07, 2016 7:14pm, <JBeulich@suse.com> wrote:
> >> >> >>> On 07.03.16 at 08:05, <quan.xu@intel.com> wrote:
> >> >
> >> >
> >> >> > A quick question, is it '-ERESTART', instead of '-EBUSY' ?
> >> >>
> >> >> No idea what this question is about in this context.
> >> >>
> >> >
> >> > it is in xen/drivers/passthrough/pci.c, assign_device().
> >> >
> >> > static int assign_device()
> >> > {
> >> >    ....
> >> >     if ( !spin_trylock(&pcidevs_lock) )
> >> >         return -ERESTART;
> >> >    ....
> >> > }
> >>
> >> But I still don't understand what you're trying to find out or point out.
> >
> > Jan, sorry.
> > Now the return error code is '-ERESTART' for ' if (
> > !spin_trylock(&pcidevs_lock) ', in assign_device(), in
> > xen/drivers/passthrough/pci.c.
> > I think it would be '-EBUSY'.
> 
> Oh - definitely not. Just follow the call chain back up, and you should find that
> this gets taken as an indication to create a continuation, whereas -EBUSY would
> bubble back up to the original (user space) caller (which is _not_ what we want
> here).
> 

Got it. thanks.
Quan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2016-03-07 11:55 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-02 14:31 [PATCH v6 0/5] VT-d Device-TLB flush issue Quan Xu
2016-03-02 14:31 ` [PATCH v6 1/5] IOMMU/MMU: Adjust top level functions for VT-d Device-TLB flush error Quan Xu
2016-03-02 14:31 ` [PATCH v6 2/5] IOMMU/MMU: Adjust low " Quan Xu
2016-03-02 14:31 ` [PATCH v6 3/5] IOMMU: Make the pcidevs_lock a recursive one Quan Xu
2016-03-03 23:59   ` Dario Faggioli
2016-03-04  2:45     ` Xu, Quan
2016-03-04  9:29       ` Jan Beulich
2016-03-04 11:54         ` Xu, Quan
2016-03-04 13:59           ` Dario Faggioli
2016-03-04 14:09             ` Jan Beulich
2016-03-07  7:05             ` Xu, Quan
2016-03-07 11:14               ` Jan Beulich
2016-03-07 11:23                 ` Xu, Quan
2016-03-07 11:36                   ` Jan Beulich
2016-03-07 11:42                     ` Xu, Quan
2016-03-07 11:49                       ` Jan Beulich
2016-03-07 11:55                         ` Xu, Quan
2016-03-02 14:31 ` [PATCH v6 4/5] VT-d: Reduce spin timeout to 1ms, which can be boot-time changed Quan Xu
2016-03-04  0:11   ` Dario Faggioli
2016-03-04  9:32     ` Jan Beulich
2016-03-02 14:31 ` [PATCH v6 5/5] VT-d: Fix vt-d Device-TLB flush timeout issue Quan Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).