All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/3] VT-d Device-TLB flush issue
@ 2015-12-23  8:25 Quan Xu
  2015-12-23  8:25 ` [PATCH v4 1/3] VT-d: Check VT-d Device-TLB flush error Quan Xu
                   ` (5 more replies)
  0 siblings, 6 replies; 60+ messages in thread
From: Quan Xu @ 2015-12-23  8:25 UTC (permalink / raw)
  To: jbeulich, kevin.tian
  Cc: feng.wu, eddie.dong, george.dunlap, andrew.cooper3, tim,
	xen-devel, jun.nakajima, Quan Xu, keir

This patches are based on Kevin Tian's previous discussion 'Revisit VT-d asynchronous flush issue'.
Fix current timeout concern and also allow limited ATS support in a light way:

1. Check VT-d Device-TLB flush error.
   This patch checks all kinds of error and all the way up the call trees of VT-d Device-TLB flush.

2. Reduce spin timeout to 1ms, which can be boot-time changed with 'iommu_qi_timeout_ms'.
   For example:
           multiboot /boot/xen.gz ats=1 iommu_qi_timeout_ms=100

3. Fix vt-d Device-TLB flush timeout issue.
    Now if IOTLB/Context/IETC flush is timeout, panic hypervisor. The coming patch
    set will fix it.

    If Device-TLB flush is timeout, we'll hide the target ATS
    device and crash the domain owning this ATS device.

    If impacted domain is hardware domain, just throw out a warning.

    The hided Device will be disallowed to be further assigned to
    any domain.

--

 * DMAR_OPERATION_TIMEOUT should be also chopped down to a low number of milliseconds.
   As Kevin Tian mentioned in 'Revisit VT-d asynchronous flush issue', We also confirmed with hardware team
   that 1ms is large enough for IOMMU internal flush. So I can change DMAR_OPERATION_TIMEOUT from 1000 ms to 1 ms.

   IOMMU_WAIT_OP() is only for VT-d registers read/write, and there is also a panic. We need a further discussion
   whether or how to remove this panic in next patch set.

 * if IOTLB/Context/IETC flush is timeout, panic hypervisor. The coming patch set will fix it.


Quan Xu (3):
  VT-d: Check VT-d Device-TLB flush error.
  VT-d: Reduce spin timeout to 1ms, which can be boot-time changed.
  VT-d: Fix vt-d Device-TLB flush timeout issue.

 xen/arch/x86/acpi/power.c                     |   8 +-
 xen/arch/x86/crash.c                          |   3 +-
 xen/arch/x86/domain_build.c                   |   5 +-
 xen/arch/x86/mm.c                             |  15 ++-
 xen/arch/x86/mm/p2m-ept.c                     |  14 ++-
 xen/arch/x86/mm/p2m-pt.c                      |  14 ++-
 xen/arch/x86/mm/p2m.c                         |  19 +++-
 xen/arch/x86/x86_64/mm.c                      |   7 +-
 xen/common/domain.c                           |   3 +-
 xen/common/grant_table.c                      |   5 +-
 xen/common/memory.c                           |  13 ++-
 xen/drivers/passthrough/amd/iommu_init.c      |   4 +-
 xen/drivers/passthrough/amd/pci_amd_iommu.c   |   4 +-
 xen/drivers/passthrough/arm/smmu.c            |  13 ++-
 xen/drivers/passthrough/iommu.c               |  47 +++++---
 xen/drivers/passthrough/pci.c                 |   2 +-
 xen/drivers/passthrough/vtd/extern.h          |   6 +-
 xen/drivers/passthrough/vtd/iommu.c           | 157 ++++++++++++++++++++------
 xen/drivers/passthrough/vtd/qinval.c          |  93 ++++++++++++++-
 xen/drivers/passthrough/vtd/quirks.c          |  26 +++--
 xen/drivers/passthrough/vtd/x86/ats.c         |  13 +++
 xen/drivers/passthrough/vtd/x86/vtd.c         |  13 ++-
 xen/drivers/passthrough/x86/iommu.c           |   6 +-
 xen/include/asm-x86/hvm/svm/amd-iommu-proto.h |   4 +-
 xen/include/asm-x86/iommu.h                   |   2 +-
 xen/include/xen/iommu.h                       |  20 ++--
 26 files changed, 403 insertions(+), 113 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v4 1/3] VT-d: Check VT-d Device-TLB flush error.
  2015-12-23  8:25 [PATCH v4 0/3] VT-d Device-TLB flush issue Quan Xu
@ 2015-12-23  8:25 ` Quan Xu
  2015-12-25  2:53   ` Tian, Kevin
  2016-01-14 17:15   ` Jan Beulich
  2015-12-23  8:25 ` [PATCH v4 2/3] VT-d: Reduce spin timeout to 1ms, which can be boot-time changed Quan Xu
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 60+ messages in thread
From: Quan Xu @ 2015-12-23  8:25 UTC (permalink / raw)
  To: jbeulich, kevin.tian
  Cc: feng.wu, eddie.dong, george.dunlap, andrew.cooper3, tim,
	xen-devel, jun.nakajima, Quan Xu, keir

This patch checks all kinds of error and all the way up
the call trees of VT-d Device-TLB flush.

Signed-off-by: Quan Xu <quan.xu@intel.com>
---
 xen/arch/x86/acpi/power.c                     |   8 +-
 xen/arch/x86/crash.c                          |   3 +-
 xen/arch/x86/domain_build.c                   |   5 +-
 xen/arch/x86/mm.c                             |  15 ++-
 xen/arch/x86/mm/p2m-ept.c                     |  14 ++-
 xen/arch/x86/mm/p2m-pt.c                      |  14 ++-
 xen/arch/x86/mm/p2m.c                         |  19 +++-
 xen/arch/x86/x86_64/mm.c                      |   7 +-
 xen/common/domain.c                           |   3 +-
 xen/common/grant_table.c                      |   5 +-
 xen/common/memory.c                           |  13 ++-
 xen/drivers/passthrough/amd/iommu_init.c      |   4 +-
 xen/drivers/passthrough/amd/pci_amd_iommu.c   |   4 +-
 xen/drivers/passthrough/arm/smmu.c            |  13 ++-
 xen/drivers/passthrough/iommu.c               |  47 +++++---
 xen/drivers/passthrough/vtd/extern.h          |   4 +-
 xen/drivers/passthrough/vtd/iommu.c           | 157 ++++++++++++++++++++------
 xen/drivers/passthrough/vtd/qinval.c          |   2 +-
 xen/drivers/passthrough/vtd/quirks.c          |  26 +++--
 xen/drivers/passthrough/vtd/x86/vtd.c         |  13 ++-
 xen/drivers/passthrough/x86/iommu.c           |   6 +-
 xen/include/asm-x86/hvm/svm/amd-iommu-proto.h |   4 +-
 xen/include/asm-x86/iommu.h                   |   2 +-
 xen/include/xen/iommu.h                       |  20 ++--
 24 files changed, 300 insertions(+), 108 deletions(-)

diff --git a/xen/arch/x86/acpi/power.c b/xen/arch/x86/acpi/power.c
index f41f0de..1974721 100644
--- a/xen/arch/x86/acpi/power.c
+++ b/xen/arch/x86/acpi/power.c
@@ -45,6 +45,8 @@ void do_suspend_lowlevel(void);
 
 static int device_power_down(void)
 {
+    int rc;
+
     console_suspend();
 
     time_suspend();
@@ -53,7 +55,9 @@ static int device_power_down(void)
 
     ioapic_suspend();
 
-    iommu_suspend();
+    rc = iommu_suspend();
+    if ( rc )
+        return rc;
 
     lapic_suspend();
 
@@ -182,7 +186,7 @@ static int enter_state(u32 state)
         error = tboot_s3_resume();
         break;
     case ACPI_STATE_S5:
-        acpi_enter_sleep_state(ACPI_STATE_S5);
+        error = acpi_enter_sleep_state(ACPI_STATE_S5);
         break;
     default:
         error = -EINVAL;
diff --git a/xen/arch/x86/crash.c b/xen/arch/x86/crash.c
index 888a214..59e1af6 100644
--- a/xen/arch/x86/crash.c
+++ b/xen/arch/x86/crash.c
@@ -170,7 +170,8 @@ static void nmi_shootdown_cpus(void)
 
     /* Crash shutdown any IOMMU functionality as the crashdump kernel is not
      * happy when booting if interrupt/dma remapping is still enabled */
-    iommu_crash_shutdown();
+    if ( iommu_crash_shutdown() )
+        printk("Failed to shut down IOMMU.\n");
 
     __stop_this_cpu();
 
diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
index bca6fe7..a11bc2a 100644
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -1627,7 +1627,10 @@ int __init construct_dom0(
     }
 
     if ( d->domain_id == hardware_domid )
-        iommu_hwdom_init(d);
+    {
+        if ( iommu_hwdom_init(d) )
+            printk("Xen warning : IOMMU hardware domain init failed.\n");
+    }
 
     return 0;
 
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 202ff76..3c1db05 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -2443,11 +2443,18 @@ static int __get_page_type(struct page_info *page, unsigned long type,
         if ( d && is_pv_domain(d) && unlikely(need_iommu(d)) )
         {
             if ( (x & PGT_type_mask) == PGT_writable_page )
-                iommu_unmap_page(d, mfn_to_gmfn(d, page_to_mfn(page)));
+            {
+                rc = iommu_unmap_page(d, mfn_to_gmfn(d, page_to_mfn(page)));
+                return rc;
+            }
             else if ( type == PGT_writable_page )
-                iommu_map_page(d, mfn_to_gmfn(d, page_to_mfn(page)),
-                               page_to_mfn(page),
-                               IOMMUF_readable|IOMMUF_writable);
+            {
+                rc = iommu_map_page(d, mfn_to_gmfn(d, page_to_mfn(page)),
+                                    page_to_mfn(page),
+                                    IOMMUF_readable|IOMMUF_writable);
+                if ( rc )
+                    return rc;
+            }
         }
     }
 
diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
index 9860c6c..2ed43b0 100644
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -829,15 +829,23 @@ out:
          need_modify_vtd_table )
     {
         if ( iommu_hap_pt_share )
-            iommu_pte_flush(d, gfn, &ept_entry->epte, order, vtd_pte_present);
+            rc = iommu_pte_flush(d, gfn, &ept_entry->epte, order, vtd_pte_present);
         else
         {
             if ( iommu_flags )
                 for ( i = 0; i < (1 << order); i++ )
-                    iommu_map_page(d, gfn + i, mfn_x(mfn) + i, iommu_flags);
+                {
+                    rc = iommu_map_page(d, gfn + i, mfn_x(mfn) + i, iommu_flags);
+                    if ( rc )
+                        break;
+                }
             else
                 for ( i = 0; i < (1 << order); i++ )
-                    iommu_unmap_page(d, gfn + i);
+                {
+                    rc = iommu_unmap_page(d, gfn + i);
+                    if ( rc )
+                        break;
+                }
         }
     }
 
diff --git a/xen/arch/x86/mm/p2m-pt.c b/xen/arch/x86/mm/p2m-pt.c
index 709920a..b2b340d 100644
--- a/xen/arch/x86/mm/p2m-pt.c
+++ b/xen/arch/x86/mm/p2m-pt.c
@@ -675,11 +675,19 @@ p2m_pt_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
         }
         else if ( iommu_pte_flags )
             for ( i = 0; i < (1UL << page_order); i++ )
-                iommu_map_page(p2m->domain, gfn + i, mfn_x(mfn) + i,
-                               iommu_pte_flags);
+            {
+                rc = iommu_map_page(p2m->domain, gfn + i, mfn_x(mfn) + i,
+                                    iommu_pte_flags);
+                if ( rc )
+                    goto out;
+            }
         else
             for ( i = 0; i < (1UL << page_order); i++ )
-                iommu_unmap_page(p2m->domain, gfn + i);
+            {
+                rc = iommu_unmap_page(p2m->domain, gfn + i);
+                if ( rc )
+                    goto out;
+            }
     }
 
     /*
diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
index c6b883d..6b43da0 100644
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -605,12 +605,17 @@ p2m_remove_page(struct p2m_domain *p2m, unsigned long gfn, unsigned long mfn,
     mfn_t mfn_return;
     p2m_type_t t;
     p2m_access_t a;
+    int rc;
 
     if ( !paging_mode_translate(p2m->domain) )
     {
         if ( need_iommu(p2m->domain) )
             for ( i = 0; i < (1 << page_order); i++ )
-                iommu_unmap_page(p2m->domain, mfn + i);
+            {
+                rc = iommu_unmap_page(p2m->domain, mfn + i);
+                if ( rc )
+                    return rc;
+            }
         return 0;
     }
 
@@ -654,7 +659,7 @@ guest_physmap_add_entry(struct domain *d, unsigned long gfn,
     p2m_access_t a;
     mfn_t omfn;
     int pod_count = 0;
-    int rc = 0;
+    int rc = 0, ret = 0;
 
     if ( !paging_mode_translate(d) )
     {
@@ -667,7 +672,15 @@ guest_physmap_add_entry(struct domain *d, unsigned long gfn,
                 if ( rc != 0 )
                 {
                     while ( i-- > 0 )
-                        iommu_unmap_page(d, mfn + i);
+                    {
+                        ret = iommu_unmap_page(d, mfn + i);
+                        if ( ret )
+                            break;
+                    }
+
+                    if ( ret )
+                        rc = ret;
+
                     return rc;
                 }
             }
diff --git a/xen/arch/x86/x86_64/mm.c b/xen/arch/x86/x86_64/mm.c
index d918002..fe7b10c 100644
--- a/xen/arch/x86/x86_64/mm.c
+++ b/xen/arch/x86/x86_64/mm.c
@@ -1438,7 +1438,12 @@ int memory_add(unsigned long spfn, unsigned long epfn, unsigned int pxm)
         if ( i != epfn )
         {
             while (i-- > old_max)
-                iommu_unmap_page(hardware_domain, i);
+            {
+                ret = iommu_unmap_page(hardware_domain, i);
+                if ( ret )
+                    break;
+            }
+
             goto destroy_m2p;
         }
     }
diff --git a/xen/common/domain.c b/xen/common/domain.c
index 1b9fcfc..11f526d 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -228,7 +228,8 @@ static int late_hwdom_init(struct domain *d)
 
     rcu_unlock_domain(dom0);
 
-    iommu_hwdom_init(d);
+    if ( iommu_hwdom_init(d) )
+        printk("Xen warning : IOMMU hardware domain init failed.\n");
 
     return rv;
 #else
diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c
index 2b449d5..5faa61e 100644
--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -920,7 +920,10 @@ __gnttab_map_grant_ref(
                 nr_gets++;
                 (void)get_page(pg, rd);
                 if ( !(op->flags & GNTMAP_readonly) )
-                    get_page_type(pg, PGT_writable_page);
+                {
+                    if ( get_page_type(pg, PGT_writable_page) )
+                        goto could_not_pin;
+                }
             }
         }
     }
diff --git a/xen/common/memory.c b/xen/common/memory.c
index b541f4a1..989b461 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -593,6 +593,10 @@ static int xenmem_add_to_physmap(struct domain *d,
     unsigned int done = 0;
     long rc = 0;
 
+#ifdef HAS_PASSTHROUGH
+    int ret = 0;
+#endif
+
     if ( xatp->space != XENMAPSPACE_gmfn_range )
         return xenmem_add_to_physmap_one(d, xatp->space, DOMID_INVALID,
                                          xatp->idx, xatp->gpfn);
@@ -631,8 +635,13 @@ static int xenmem_add_to_physmap(struct domain *d,
     if ( need_iommu(d) )
     {
         this_cpu(iommu_dont_flush_iotlb) = 0;
-        iommu_iotlb_flush(d, xatp->idx - done, done);
-        iommu_iotlb_flush(d, xatp->gpfn - done, done);
+        ret = iommu_iotlb_flush(d, xatp->idx - done, done);
+        if ( ret )
+            return ret;
+
+        ret = iommu_iotlb_flush(d, xatp->gpfn - done, done);
+        if ( ret )
+            return ret;
     }
 #endif
 
diff --git a/xen/drivers/passthrough/amd/iommu_init.c b/xen/drivers/passthrough/amd/iommu_init.c
index d90a2d2..ec47e22 100644
--- a/xen/drivers/passthrough/amd/iommu_init.c
+++ b/xen/drivers/passthrough/amd/iommu_init.c
@@ -1340,12 +1340,14 @@ static void invalidate_all_devices(void)
     iterate_ivrs_mappings(_invalidate_all_devices);
 }
 
-void amd_iommu_suspend(void)
+int amd_iommu_suspend(void)
 {
     struct amd_iommu *iommu;
 
     for_each_amd_iommu ( iommu )
         disable_iommu(iommu);
+
+    return 0;
 }
 
 void amd_iommu_resume(void)
diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c b/xen/drivers/passthrough/amd/pci_amd_iommu.c
index c1c0b6b..449de13 100644
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -280,7 +280,7 @@ static int amd_iommu_domain_init(struct domain *d)
     return 0;
 }
 
-static void __hwdom_init amd_iommu_hwdom_init(struct domain *d)
+static int __hwdom_init amd_iommu_hwdom_init(struct domain *d)
 {
     unsigned long i; 
     const struct amd_iommu *iommu;
@@ -312,6 +312,8 @@ static void __hwdom_init amd_iommu_hwdom_init(struct domain *d)
             BUG();
 
     setup_hwdom_pci_devices(d, amd_iommu_setup_hwdom_device);
+
+    return 0;
 }
 
 void amd_iommu_disable_domain_device(struct domain *domain,
diff --git a/xen/drivers/passthrough/arm/smmu.c b/xen/drivers/passthrough/arm/smmu.c
index bb08827..155b7f3 100644
--- a/xen/drivers/passthrough/arm/smmu.c
+++ b/xen/drivers/passthrough/arm/smmu.c
@@ -2544,7 +2544,7 @@ static int force_stage = 2;
  */
 static u32 platform_features = ARM_SMMU_FEAT_COHERENT_WALK;
 
-static void arm_smmu_iotlb_flush_all(struct domain *d)
+static int arm_smmu_iotlb_flush_all(struct domain *d)
 {
 	struct arm_smmu_xen_domain *smmu_domain = domain_hvm_iommu(d)->arch.priv;
 	struct iommu_domain *cfg;
@@ -2561,13 +2561,15 @@ static void arm_smmu_iotlb_flush_all(struct domain *d)
 		arm_smmu_tlb_inv_context(cfg->priv);
 	}
 	spin_unlock(&smmu_domain->lock);
+
+        return 0;
 }
 
-static void arm_smmu_iotlb_flush(struct domain *d, unsigned long gfn,
-                                 unsigned int page_count)
+static int arm_smmu_iotlb_flush(struct domain *d, unsigned long gfn,
+                                unsigned int page_count)
 {
     /* ARM SMMU v1 doesn't have flush by VMA and VMID */
-    arm_smmu_iotlb_flush_all(d);
+    return arm_smmu_iotlb_flush_all(d);
 }
 
 static struct iommu_domain *arm_smmu_get_domain(struct domain *d,
@@ -2737,8 +2739,9 @@ static int arm_smmu_iommu_domain_init(struct domain *d)
 	return 0;
 }
 
-static void __hwdom_init arm_smmu_iommu_hwdom_init(struct domain *d)
+static int __hwdom_init arm_smmu_iommu_hwdom_init(struct domain *d)
 {
+        return 0;
 }
 
 static void arm_smmu_iommu_domain_teardown(struct domain *d)
diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c
index d5137733..34e4ef9 100644
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -146,14 +146,15 @@ static void __hwdom_init check_hwdom_reqs(struct domain *d)
     iommu_dom0_strict = 1;
 }
 
-void __hwdom_init iommu_hwdom_init(struct domain *d)
+int __hwdom_init iommu_hwdom_init(struct domain *d)
 {
     struct hvm_iommu *hd = domain_hvm_iommu(d);
+    int rc = 0;
 
     check_hwdom_reqs(d);
 
     if ( !iommu_enabled )
-        return;
+        return -EINVAL;
 
     register_keyhandler('o', &iommu_p2m_table);
     d->need_iommu = !!iommu_dom0_strict;
@@ -171,7 +172,10 @@ void __hwdom_init iommu_hwdom_init(struct domain *d)
                  ((page->u.inuse.type_info & PGT_type_mask)
                   == PGT_writable_page) )
                 mapping |= IOMMUF_writable;
-            hd->platform_ops->map_page(d, gfn, mfn, mapping);
+            rc = hd->platform_ops->map_page(d, gfn, mfn, mapping);
+            if ( rc )
+                return rc;
+
             if ( !(i++ & 0xfffff) )
                 process_pending_softirqs();
         }
@@ -266,24 +270,24 @@ static void iommu_free_pagetables(unsigned long unused)
                             cpumask_cycle(smp_processor_id(), &cpu_online_map));
 }
 
-void iommu_iotlb_flush(struct domain *d, unsigned long gfn, unsigned int page_count)
+int iommu_iotlb_flush(struct domain *d, unsigned long gfn, unsigned int page_count)
 {
     struct hvm_iommu *hd = domain_hvm_iommu(d);
 
     if ( !iommu_enabled || !hd->platform_ops || !hd->platform_ops->iotlb_flush )
-        return;
+        return 0;
 
-    hd->platform_ops->iotlb_flush(d, gfn, page_count);
+    return hd->platform_ops->iotlb_flush(d, gfn, page_count);
 }
 
-void iommu_iotlb_flush_all(struct domain *d)
+int iommu_iotlb_flush_all(struct domain *d)
 {
     struct hvm_iommu *hd = domain_hvm_iommu(d);
 
     if ( !iommu_enabled || !hd->platform_ops || !hd->platform_ops->iotlb_flush_all )
-        return;
+        return 0;
 
-    hd->platform_ops->iotlb_flush_all(d);
+    return hd->platform_ops->iotlb_flush_all(d);
 }
 
 int __init iommu_setup(void)
@@ -354,11 +358,19 @@ int iommu_do_domctl(
     return ret;
 }
 
-void iommu_suspend()
+int iommu_suspend()
 {
     const struct iommu_ops *ops = iommu_get_ops();
+    int rc;
+
     if ( iommu_enabled )
-        ops->suspend();
+    {
+        rc = ops->suspend();
+        if ( rc )
+            return rc;
+    }
+
+    return 0;
 }
 
 void iommu_share_p2m_table(struct domain* d)
@@ -369,12 +381,21 @@ void iommu_share_p2m_table(struct domain* d)
         ops->share_p2m(d);
 }
 
-void iommu_crash_shutdown(void)
+int iommu_crash_shutdown(void)
 {
     const struct iommu_ops *ops = iommu_get_ops();
+    int rc;
+
     if ( iommu_enabled )
-        ops->crash_shutdown();
+    {
+        rc = ops->crash_shutdown();
+        if ( rc )
+            return rc;
+    }
+
     iommu_enabled = iommu_intremap = 0;
+
+    return rc;
 }
 
 int iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
diff --git a/xen/drivers/passthrough/vtd/extern.h b/xen/drivers/passthrough/vtd/extern.h
index 8acf889..ec9c513 100644
--- a/xen/drivers/passthrough/vtd/extern.h
+++ b/xen/drivers/passthrough/vtd/extern.h
@@ -91,11 +91,11 @@ int is_igd_vt_enabled_quirk(void);
 void platform_quirks_init(void);
 void vtd_ops_preamble_quirk(struct iommu* iommu);
 void vtd_ops_postamble_quirk(struct iommu* iommu);
-void me_wifi_quirk(struct domain *domain, u8 bus, u8 devfn, int map);
+int me_wifi_quirk(struct domain *domain, u8 bus, u8 devfn, int map);
 void pci_vtd_quirk(const struct pci_dev *);
 int platform_supports_intremap(void);
 int platform_supports_x2apic(void);
 
-void vtd_set_hwdom_mapping(struct domain *d);
+int vtd_set_hwdom_mapping(struct domain *d);
 
 #endif // _VTD_EXTERN_H_
diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index dd13865..08aaaec 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -542,11 +542,12 @@ static int iommu_flush_iotlb_psi(
     return status;
 }
 
-static void iommu_flush_all(void)
+static int iommu_flush_all(void)
 {
     struct acpi_drhd_unit *drhd;
     struct iommu *iommu;
     int flush_dev_iotlb;
+    int rc;
 
     flush_all_cache();
     for_each_drhd_unit ( drhd )
@@ -554,11 +555,15 @@ static void iommu_flush_all(void)
         iommu = drhd->iommu;
         iommu_flush_context_global(iommu, 0);
         flush_dev_iotlb = find_ats_dev_drhd(iommu) ? 1 : 0;
-        iommu_flush_iotlb_global(iommu, 0, flush_dev_iotlb);
+        rc = iommu_flush_iotlb_global(iommu, 0, flush_dev_iotlb);
+        if ( rc )
+            return rc;
     }
+
+    return 0;
 }
 
-static void __intel_iommu_iotlb_flush(struct domain *d, unsigned long gfn,
+static int __intel_iommu_iotlb_flush(struct domain *d, unsigned long gfn,
         int dma_old_pte_present, unsigned int page_count)
 {
     struct hvm_iommu *hd = domain_hvm_iommu(d);
@@ -566,6 +571,7 @@ static void __intel_iommu_iotlb_flush(struct domain *d, unsigned long gfn,
     struct iommu *iommu;
     int flush_dev_iotlb;
     int iommu_domid;
+    int rc;
 
     /*
      * No need pcideves_lock here because we have flush
@@ -585,36 +591,47 @@ static void __intel_iommu_iotlb_flush(struct domain *d, unsigned long gfn,
 
         if ( page_count > 1 || gfn == -1 )
         {
-            if ( iommu_flush_iotlb_dsi(iommu, iommu_domid,
-                        0, flush_dev_iotlb) )
+            rc = iommu_flush_iotlb_dsi(iommu, iommu_domid,
+                        0, flush_dev_iotlb);
+            if ( rc )
+            {
                 iommu_flush_write_buffer(iommu);
+                return rc;
+            }
         }
         else
         {
-            if ( iommu_flush_iotlb_psi(iommu, iommu_domid,
+            rc = iommu_flush_iotlb_psi(iommu, iommu_domid,
                         (paddr_t)gfn << PAGE_SHIFT_4K, 0,
-                        !dma_old_pte_present, flush_dev_iotlb) )
+                        !dma_old_pte_present, flush_dev_iotlb);
+            if ( rc )
+            {
                 iommu_flush_write_buffer(iommu);
+                return rc;
+            }
         }
     }
+
+    return 0;
 }
 
-static void intel_iommu_iotlb_flush(struct domain *d, unsigned long gfn, unsigned int page_count)
+static int intel_iommu_iotlb_flush(struct domain *d, unsigned long gfn, unsigned int page_count)
 {
-    __intel_iommu_iotlb_flush(d, gfn, 1, page_count);
+    return __intel_iommu_iotlb_flush(d, gfn, 1, page_count);
 }
 
-static void intel_iommu_iotlb_flush_all(struct domain *d)
+static int intel_iommu_iotlb_flush_all(struct domain *d)
 {
-    __intel_iommu_iotlb_flush(d, 0, 0, 0);
+    return __intel_iommu_iotlb_flush(d, 0, 0, 0);
 }
 
 /* clear one page's page table */
-static void dma_pte_clear_one(struct domain *domain, u64 addr)
+static int dma_pte_clear_one(struct domain *domain, u64 addr)
 {
     struct hvm_iommu *hd = domain_hvm_iommu(domain);
     struct dma_pte *page = NULL, *pte = NULL;
     u64 pg_maddr;
+    int rc;
 
     spin_lock(&hd->arch.mapping_lock);
     /* get last level pte */
@@ -622,7 +639,7 @@ static void dma_pte_clear_one(struct domain *domain, u64 addr)
     if ( pg_maddr == 0 )
     {
         spin_unlock(&hd->arch.mapping_lock);
-        return;
+        return -ENOENT;
     }
 
     page = (struct dma_pte *)map_vtd_domain_page(pg_maddr);
@@ -632,7 +649,7 @@ static void dma_pte_clear_one(struct domain *domain, u64 addr)
     {
         spin_unlock(&hd->arch.mapping_lock);
         unmap_vtd_domain_page(page);
-        return;
+        return -ENOENT;
     }
 
     dma_clear_pte(*pte);
@@ -640,9 +657,18 @@ static void dma_pte_clear_one(struct domain *domain, u64 addr)
     iommu_flush_cache_entry(pte, sizeof(struct dma_pte));
 
     if ( !this_cpu(iommu_dont_flush_iotlb) )
-        __intel_iommu_iotlb_flush(domain, addr >> PAGE_SHIFT_4K, 1, 1);
+    {
+        rc = __intel_iommu_iotlb_flush(domain, addr >> PAGE_SHIFT_4K, 1, 1);
+        if ( rc )
+        {
+            unmap_vtd_domain_page(page);
+            return rc;
+        }
+    }
 
     unmap_vtd_domain_page(page);
+
+    return 0;
 }
 
 static void iommu_free_pagetable(u64 pt_maddr, int level)
@@ -1251,20 +1277,24 @@ static int intel_iommu_domain_init(struct domain *d)
     return 0;
 }
 
-static void __hwdom_init intel_iommu_hwdom_init(struct domain *d)
+static int __hwdom_init intel_iommu_hwdom_init(struct domain *d)
 {
     struct acpi_drhd_unit *drhd;
+    int rc;
 
     if ( !iommu_passthrough && !need_iommu(d) )
     {
         /* Set up 1:1 page table for hardware domain. */
-        vtd_set_hwdom_mapping(d);
+        rc = vtd_set_hwdom_mapping(d);
+        if ( rc )
+            return rc;
     }
 
     setup_hwdom_pci_devices(d, setup_hwdom_device);
     setup_hwdom_rmrr(d);
 
-    iommu_flush_all();
+    if ( iommu_flush_all() )
+        printk("Xen warning : iommu flush error.\n");
 
     for_each_drhd_unit ( drhd )
     {
@@ -1273,6 +1303,8 @@ static void __hwdom_init intel_iommu_hwdom_init(struct domain *d)
             BUG();
         iommu_enable_translation(drhd);
     }
+
+    return 0;
 }
 
 int domain_context_mapping_one(
@@ -1404,7 +1436,14 @@ int domain_context_mapping_one(
     else
     {
         int flush_dev_iotlb = find_ats_dev_drhd(iommu) ? 1 : 0;
-        iommu_flush_iotlb_dsi(iommu, 0, 1, flush_dev_iotlb);
+        int rc;
+
+        rc = iommu_flush_iotlb_dsi(iommu, 0, 1, flush_dev_iotlb);
+        if ( rc )
+        {
+            unmap_vtd_domain_page(context_entries);
+            return rc;
+        }
     }
 
     set_bit(iommu->index, &hd->arch.iommu_bitmap);
@@ -1412,7 +1451,13 @@ int domain_context_mapping_one(
     unmap_vtd_domain_page(context_entries);
 
     if ( !seg )
-        me_wifi_quirk(domain, bus, devfn, MAP_ME_PHANTOM_FUNC);
+    {
+        int rc;
+
+        rc = me_wifi_quirk(domain, bus, devfn, MAP_ME_PHANTOM_FUNC);
+        if ( rc )
+            return rc;
+    }
 
     return 0;
 }
@@ -1509,6 +1554,7 @@ int domain_context_unmap_one(
     struct context_entry *context, *context_entries;
     u64 maddr;
     int iommu_domid;
+    int rc;
 
     ASSERT(spin_is_locked(&pcidevs_lock));
     spin_lock(&iommu->lock);
@@ -1543,15 +1589,24 @@ int domain_context_unmap_one(
     else
     {
         int flush_dev_iotlb = find_ats_dev_drhd(iommu) ? 1 : 0;
-        iommu_flush_iotlb_dsi(iommu, iommu_domid, 0, flush_dev_iotlb);
+        rc = iommu_flush_iotlb_dsi(iommu, iommu_domid, 0, flush_dev_iotlb);
+        if ( rc )
+        {
+            spin_unlock(&iommu->lock);
+            unmap_vtd_domain_page(context_entries);
+            return rc;
+        }
     }
 
     spin_unlock(&iommu->lock);
     unmap_vtd_domain_page(context_entries);
 
     if ( !iommu->intel->drhd->segment )
-        me_wifi_quirk(domain, bus, devfn, UNMAP_ME_PHANTOM_FUNC);
-
+    {
+       rc = me_wifi_quirk(domain, bus, devfn, UNMAP_ME_PHANTOM_FUNC);
+       if ( rc )
+           return rc;
+    }
     return 0;
 }
 
@@ -1700,6 +1755,7 @@ static int intel_iommu_map_page(
     struct hvm_iommu *hd = domain_hvm_iommu(d);
     struct dma_pte *page = NULL, *pte = NULL, old, new = { 0 };
     u64 pg_maddr;
+    int rc;
 
     /* Do nothing if VT-d shares EPT page table */
     if ( iommu_use_hap_pt(d) )
@@ -1742,30 +1798,39 @@ static int intel_iommu_map_page(
     unmap_vtd_domain_page(page);
 
     if ( !this_cpu(iommu_dont_flush_iotlb) )
-        __intel_iommu_iotlb_flush(d, gfn, dma_pte_present(old), 1);
+    {
+        rc = __intel_iommu_iotlb_flush(d, gfn, dma_pte_present(old), 1);
+        if ( rc )
+            return rc;
+    }
 
     return 0;
 }
 
 static int intel_iommu_unmap_page(struct domain *d, unsigned long gfn)
 {
+    int rc;
+
     /* Do nothing if hardware domain and iommu supports pass thru. */
     if ( iommu_passthrough && is_hardware_domain(d) )
         return 0;
 
-    dma_pte_clear_one(d, (paddr_t)gfn << PAGE_SHIFT_4K);
+    rc = dma_pte_clear_one(d, (paddr_t)gfn << PAGE_SHIFT_4K);
+    if ( rc )
+        return rc;
 
     return 0;
 }
 
-void iommu_pte_flush(struct domain *d, u64 gfn, u64 *pte,
-                     int order, int present)
+int iommu_pte_flush(struct domain *d, u64 gfn, u64 *pte,
+                    int order, int present)
 {
     struct acpi_drhd_unit *drhd;
     struct iommu *iommu = NULL;
     struct hvm_iommu *hd = domain_hvm_iommu(d);
     int flush_dev_iotlb;
     int iommu_domid;
+    int rc;
 
     iommu_flush_cache_entry(pte, sizeof(struct dma_pte));
 
@@ -1779,11 +1844,17 @@ void iommu_pte_flush(struct domain *d, u64 gfn, u64 *pte,
         iommu_domid= domain_iommu_domid(d, iommu);
         if ( iommu_domid == -1 )
             continue;
-        if ( iommu_flush_iotlb_psi(iommu, iommu_domid,
+        rc = iommu_flush_iotlb_psi(iommu, iommu_domid,
                                    (paddr_t)gfn << PAGE_SHIFT_4K,
-                                   order, !present, flush_dev_iotlb) )
+                                   order, !present, flush_dev_iotlb);
+        if ( rc )
+        {
             iommu_flush_write_buffer(iommu);
+            return rc;
+        }
     }
+
+    return 0;
 }
 
 static int __init vtd_ept_page_compatible(struct iommu *iommu)
@@ -2103,7 +2174,11 @@ static int init_vtd_hw(void)
             return -EIO;
         }
     }
-    iommu_flush_all();
+
+    ret = iommu_flush_all();
+    if ( ret )
+        return ret;
+
     return 0;
 }
 
@@ -2372,16 +2447,19 @@ static int intel_iommu_group_id(u16 seg, u8 bus, u8 devfn)
 }
 
 static u32 iommu_state[MAX_IOMMUS][MAX_IOMMU_REGS];
-static void vtd_suspend(void)
+static int vtd_suspend(void)
 {
     struct acpi_drhd_unit *drhd;
     struct iommu *iommu;
+    int rc;
     u32    i;
 
     if ( !iommu_enabled )
-        return;
+        return -EINVAL;
 
-    iommu_flush_all();
+    rc = iommu_flush_all();
+    if ( rc )
+        return rc;
 
     for_each_drhd_unit ( drhd )
     {
@@ -2410,17 +2488,22 @@ static void vtd_suspend(void)
         if ( !iommu_intremap && iommu_qinval )
             disable_qinval(iommu);
     }
+
+    return 0;
 }
 
-static void vtd_crash_shutdown(void)
+static int vtd_crash_shutdown(void)
 {
     struct acpi_drhd_unit *drhd;
     struct iommu *iommu;
+    int rc;
 
     if ( !iommu_enabled )
-        return;
+        return -EINVAL;
 
-    iommu_flush_all();
+    rc = iommu_flush_all();
+    if ( rc )
+        return rc;
 
     for_each_drhd_unit ( drhd )
     {
@@ -2429,6 +2512,8 @@ static void vtd_crash_shutdown(void)
         disable_intremap(drhd->iommu);
         disable_qinval(drhd->iommu);
     }
+
+    return 0;
 }
 
 static void vtd_resume(void)
diff --git a/xen/drivers/passthrough/vtd/qinval.c b/xen/drivers/passthrough/vtd/qinval.c
index b81b0bd..946e812 100644
--- a/xen/drivers/passthrough/vtd/qinval.c
+++ b/xen/drivers/passthrough/vtd/qinval.c
@@ -324,7 +324,7 @@ static int flush_iotlb_qi(
     if ( flush_non_present_entry )
     {
         if ( !cap_caching_mode(iommu->cap) )
-            return 1;
+            return 0;
         else
             did = 0;
     }
diff --git a/xen/drivers/passthrough/vtd/quirks.c b/xen/drivers/passthrough/vtd/quirks.c
index 1888843..d9aea7e 100644
--- a/xen/drivers/passthrough/vtd/quirks.c
+++ b/xen/drivers/passthrough/vtd/quirks.c
@@ -332,10 +332,11 @@ void __init platform_quirks_init(void)
  * assigning Intel integrated wifi device to a guest.
  */
 
-static void map_me_phantom_function(struct domain *domain, u32 dev, int map)
+static int map_me_phantom_function(struct domain *domain, u32 dev, int map)
 {
     struct acpi_drhd_unit *drhd;
     struct pci_dev *pdev;
+    int rc = 0;
 
     /* find ME VT-d engine base on a real ME device */
     pdev = pci_get_pdev(0, 0, PCI_DEVFN(dev, 0));
@@ -343,23 +344,26 @@ static void map_me_phantom_function(struct domain *domain, u32 dev, int map)
 
     /* map or unmap ME phantom function */
     if ( map )
-        domain_context_mapping_one(domain, drhd->iommu, 0,
-                                   PCI_DEVFN(dev, 7), NULL);
+        rc = domain_context_mapping_one(domain, drhd->iommu, 0,
+                                        PCI_DEVFN(dev, 7), NULL);
     else
-        domain_context_unmap_one(domain, drhd->iommu, 0,
-                                 PCI_DEVFN(dev, 7));
+        rc = domain_context_unmap_one(domain, drhd->iommu, 0,
+                                      PCI_DEVFN(dev, 7));
+
+    return rc;
 }
 
-void me_wifi_quirk(struct domain *domain, u8 bus, u8 devfn, int map)
+int me_wifi_quirk(struct domain *domain, u8 bus, u8 devfn, int map)
 {
     u32 id;
+    int rc = 0;
 
     id = pci_conf_read32(0, 0, 0, 0, 0);
     if ( IS_CTG(id) )
     {
         /* quit if ME does not exist */
         if ( pci_conf_read32(0, 0, 3, 0, 0) == 0xffffffff )
-            return;
+            return -ENOENT;
 
         /* if device is WLAN device, map ME phantom device 0:3.7 */
         id = pci_conf_read32(0, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), 0);
@@ -373,7 +377,7 @@ void me_wifi_quirk(struct domain *domain, u8 bus, u8 devfn, int map)
             case 0x423b8086:
             case 0x423c8086:
             case 0x423d8086:
-                map_me_phantom_function(domain, 3, map);
+                rc = map_me_phantom_function(domain, 3, map);
                 break;
             default:
                 break;
@@ -383,7 +387,7 @@ void me_wifi_quirk(struct domain *domain, u8 bus, u8 devfn, int map)
     {
         /* quit if ME does not exist */
         if ( pci_conf_read32(0, 0, 22, 0, 0) == 0xffffffff )
-            return;
+            return -ENOENT;
 
         /* if device is WLAN device, map ME phantom device 0:22.7 */
         id = pci_conf_read32(0, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), 0);
@@ -399,12 +403,14 @@ void me_wifi_quirk(struct domain *domain, u8 bus, u8 devfn, int map)
             case 0x42388086:        /* Puma Peak */
             case 0x422b8086:
             case 0x422c8086:
-                map_me_phantom_function(domain, 22, map);
+                rc = map_me_phantom_function(domain, 22, map);
                 break;
             default:
                 break;
         }
     }
+
+    return rc;
 }
 
 void pci_vtd_quirk(const struct pci_dev *pdev)
diff --git a/xen/drivers/passthrough/vtd/x86/vtd.c b/xen/drivers/passthrough/vtd/x86/vtd.c
index c0d6aab..b952ff7 100644
--- a/xen/drivers/passthrough/vtd/x86/vtd.c
+++ b/xen/drivers/passthrough/vtd/x86/vtd.c
@@ -108,9 +108,10 @@ void hvm_dpci_isairq_eoi(struct domain *d, unsigned int isairq)
     spin_unlock(&d->event_lock);
 }
 
-void __hwdom_init vtd_set_hwdom_mapping(struct domain *d)
+int __hwdom_init vtd_set_hwdom_mapping(struct domain *d)
 {
     unsigned long i, j, tmp, top;
+    int rc;
 
     BUG_ON(!is_hardware_domain(d));
 
@@ -140,11 +141,17 @@ void __hwdom_init vtd_set_hwdom_mapping(struct domain *d)
 
         tmp = 1 << (PAGE_SHIFT - PAGE_SHIFT_4K);
         for ( j = 0; j < tmp; j++ )
-            iommu_map_page(d, pfn * tmp + j, pfn * tmp + j,
-                           IOMMUF_readable|IOMMUF_writable);
+        {
+            rc = iommu_map_page(d, pfn * tmp + j, pfn * tmp + j,
+                                IOMMUF_readable|IOMMUF_writable);
+            if ( rc )
+                return rc;
+        }
 
         if (!(i & (0xfffff >> (PAGE_SHIFT - PAGE_SHIFT_4K))))
             process_pending_softirqs();
     }
+
+    return 0;
 }
 
diff --git a/xen/drivers/passthrough/x86/iommu.c b/xen/drivers/passthrough/x86/iommu.c
index 8cbb655..6674fb0 100644
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -104,7 +104,11 @@ int arch_iommu_populate_page_table(struct domain *d)
     this_cpu(iommu_dont_flush_iotlb) = 0;
 
     if ( !rc )
-        iommu_iotlb_flush_all(d);
+    {
+        rc = iommu_iotlb_flush_all(d);
+        if ( rc )
+            return rc;
+    }
     else if ( rc != -ERESTART )
         iommu_teardown(d);
 
diff --git a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
index 9c51172..4691f9b 100644
--- a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
+++ b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
@@ -119,8 +119,8 @@ extern unsigned long *shared_intremap_inuse;
 
 /* power management support */
 void amd_iommu_resume(void);
-void amd_iommu_suspend(void);
-void amd_iommu_crash_shutdown(void);
+int amd_iommu_suspend(void);
+int amd_iommu_crash_shutdown(void);
 
 /* guest iommu support */
 void amd_iommu_send_guest_cmd(struct amd_iommu *iommu, u32 cmd[]);
diff --git a/xen/include/asm-x86/iommu.h b/xen/include/asm-x86/iommu.h
index 29203d7..cf2a269 100644
--- a/xen/include/asm-x86/iommu.h
+++ b/xen/include/asm-x86/iommu.h
@@ -26,7 +26,7 @@ int iommu_setup_hpet_msi(struct msi_desc *);
 
 /* While VT-d specific, this must get declared in a generic header. */
 int adjust_vtd_irq_affinities(void);
-void iommu_pte_flush(struct domain *d, u64 gfn, u64 *pte, int order, int present);
+int iommu_pte_flush(struct domain *d, u64 gfn, u64 *pte, int order, int present);
 int iommu_supports_eim(void);
 int iommu_enable_x2apic_IR(void);
 void iommu_disable_x2apic_IR(void);
diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h
index 8f3a20e..f5b6f7e 100644
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -55,7 +55,7 @@ int iommu_add_device(struct pci_dev *pdev);
 int iommu_enable_device(struct pci_dev *pdev);
 int iommu_remove_device(struct pci_dev *pdev);
 int iommu_domain_init(struct domain *d);
-void iommu_hwdom_init(struct domain *d);
+int iommu_hwdom_init(struct domain *d);
 void iommu_domain_destroy(struct domain *d);
 int deassign_device(struct domain *d, u16 seg, u8 bus, u8 devfn);
 
@@ -134,7 +134,7 @@ typedef int iommu_grdm_t(xen_pfn_t start, xen_ulong_t nr, u32 id, void *ctxt);
 
 struct iommu_ops {
     int (*init)(struct domain *d);
-    void (*hwdom_init)(struct domain *d);
+    int (*hwdom_init)(struct domain *d);
     int (*add_device)(u8 devfn, device_t *dev);
     int (*enable_device)(device_t *dev);
     int (*remove_device)(u8 devfn, device_t *dev);
@@ -157,19 +157,19 @@ struct iommu_ops {
     unsigned int (*read_apic_from_ire)(unsigned int apic, unsigned int reg);
     int (*setup_hpet_msi)(struct msi_desc *);
 #endif /* CONFIG_X86 */
-    void (*suspend)(void);
+    int (*suspend)(void);
     void (*resume)(void);
     void (*share_p2m)(struct domain *d);
-    void (*crash_shutdown)(void);
-    void (*iotlb_flush)(struct domain *d, unsigned long gfn, unsigned int page_count);
-    void (*iotlb_flush_all)(struct domain *d);
+    int (*crash_shutdown)(void);
+    int (*iotlb_flush)(struct domain *d, unsigned long gfn, unsigned int page_count);
+    int (*iotlb_flush_all)(struct domain *d);
     int (*get_reserved_device_memory)(iommu_grdm_t *, void *);
     void (*dump_p2m_table)(struct domain *d);
 };
 
-void iommu_suspend(void);
+int iommu_suspend(void);
 void iommu_resume(void);
-void iommu_crash_shutdown(void);
+int iommu_crash_shutdown(void);
 int iommu_get_reserved_device_memory(iommu_grdm_t *, void *);
 
 void iommu_share_p2m_table(struct domain *d);
@@ -182,8 +182,8 @@ int iommu_do_pci_domctl(struct xen_domctl *, struct domain *d,
 int iommu_do_domctl(struct xen_domctl *, struct domain *d,
                     XEN_GUEST_HANDLE_PARAM(xen_domctl_t));
 
-void iommu_iotlb_flush(struct domain *d, unsigned long gfn, unsigned int page_count);
-void iommu_iotlb_flush_all(struct domain *d);
+int iommu_iotlb_flush(struct domain *d, unsigned long gfn, unsigned int page_count);
+int iommu_iotlb_flush_all(struct domain *d);
 
 /*
  * The purpose of the iommu_dont_flush_iotlb optional cpu flag is to
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v4 2/3] VT-d: Reduce spin timeout to 1ms, which can be boot-time changed.
  2015-12-23  8:25 [PATCH v4 0/3] VT-d Device-TLB flush issue Quan Xu
  2015-12-23  8:25 ` [PATCH v4 1/3] VT-d: Check VT-d Device-TLB flush error Quan Xu
@ 2015-12-23  8:25 ` Quan Xu
  2015-12-25  2:56   ` Tian, Kevin
                     ` (2 more replies)
  2015-12-23  8:25 ` [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue Quan Xu
                   ` (3 subsequent siblings)
  5 siblings, 3 replies; 60+ messages in thread
From: Quan Xu @ 2015-12-23  8:25 UTC (permalink / raw)
  To: jbeulich, kevin.tian
  Cc: feng.wu, eddie.dong, george.dunlap, andrew.cooper3, tim,
	xen-devel, jun.nakajima, Quan Xu, keir

Signed-off-by: Quan Xu <quan.xu@intel.com>
---
 xen/drivers/passthrough/vtd/qinval.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/xen/drivers/passthrough/vtd/qinval.c b/xen/drivers/passthrough/vtd/qinval.c
index 946e812..b227e4e 100644
--- a/xen/drivers/passthrough/vtd/qinval.c
+++ b/xen/drivers/passthrough/vtd/qinval.c
@@ -28,6 +28,11 @@
 #include "vtd.h"
 #include "extern.h"
 
+static int __read_mostly iommu_qi_timeout_ms = 1;
+integer_param("iommu_qi_timeout_ms", iommu_qi_timeout_ms);
+
+#define IOMMU_QI_TIMEOUT (iommu_qi_timeout_ms * MILLISECS(1))
+
 static void print_qi_regs(struct iommu *iommu)
 {
     u64 val;
@@ -167,10 +172,12 @@ static int queue_invalidate_wait(struct iommu *iommu,
         start_time = NOW();
         while ( poll_slot != QINVAL_STAT_DONE )
         {
-            if ( NOW() > (start_time + DMAR_OPERATION_TIMEOUT) )
+            if ( NOW() > (start_time + IOMMU_QI_TIMEOUT) )
             {
                 print_qi_regs(iommu);
-                panic("queue invalidate wait descriptor was not executed");
+                dprintk(XENLOG_WARNING VTDPREFIX,
+                        "Queue invalidate wait descriptor was timeout.\n");
+                return -ETIMEDOUT;
             }
             cpu_relax();
         }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.
  2015-12-23  8:25 [PATCH v4 0/3] VT-d Device-TLB flush issue Quan Xu
  2015-12-23  8:25 ` [PATCH v4 1/3] VT-d: Check VT-d Device-TLB flush error Quan Xu
  2015-12-23  8:25 ` [PATCH v4 2/3] VT-d: Reduce spin timeout to 1ms, which can be boot-time changed Quan Xu
@ 2015-12-23  8:25 ` Quan Xu
  2015-12-25  3:06   ` Tian, Kevin
  2016-01-15 13:09   ` Jan Beulich
  2015-12-23 10:19 ` [PATCH v4 0/3] VT-d Device-TLB flush issue Xu, Quan
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 60+ messages in thread
From: Quan Xu @ 2015-12-23  8:25 UTC (permalink / raw)
  To: jbeulich, kevin.tian
  Cc: feng.wu, eddie.dong, george.dunlap, andrew.cooper3, tim,
	xen-devel, jun.nakajima, Quan Xu, keir

Now if IOTLB/Context/IETC flush is timeout, panic hypervisor.
The coming patch set will fix it.

If Device-TLB flush is timeout, we'll hide the target ATS
device and crash the domain owning this ATS device.

If impacted domain is hardware domain, just throw out a warning.

The hided Device will be disallowed to be further assigned to
any domain.

Signed-off-by: Quan Xu <quan.xu@intel.com>
---
 xen/drivers/passthrough/pci.c         |  2 +-
 xen/drivers/passthrough/vtd/extern.h  |  2 +
 xen/drivers/passthrough/vtd/qinval.c  | 80 ++++++++++++++++++++++++++++++++++-
 xen/drivers/passthrough/vtd/x86/ats.c | 13 ++++++
 4 files changed, 94 insertions(+), 3 deletions(-)

diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 27b3ca7..2d7dc59 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -407,7 +407,7 @@ static void _pci_hide_device(struct pci_dev *pdev)
     list_add(&pdev->domain_list, &dom_xen->arch.pdev_list);
 }
 
-int __init pci_hide_device(int bus, int devfn)
+int pci_hide_device(int bus, int devfn)
 {
     struct pci_dev *pdev;
     int rc = -ENOMEM;
diff --git a/xen/drivers/passthrough/vtd/extern.h b/xen/drivers/passthrough/vtd/extern.h
index ec9c513..a2123ce 100644
--- a/xen/drivers/passthrough/vtd/extern.h
+++ b/xen/drivers/passthrough/vtd/extern.h
@@ -58,6 +58,8 @@ int ats_device(const struct pci_dev *, const struct acpi_drhd_unit *);
 
 int dev_invalidate_iotlb(struct iommu *iommu, u16 did,
                          u64 addr, unsigned int size_order, u64 type);
+int dev_invalidate_iotlb_sync(struct iommu *iommu, u16 did,
+                              u16 seg, u8 bus, u8 devfn);
 
 int qinval_device_iotlb(struct iommu *iommu,
                         u32 max_invs_pend, u16 sid, u16 size, u64 addr);
diff --git a/xen/drivers/passthrough/vtd/qinval.c b/xen/drivers/passthrough/vtd/qinval.c
index b227e4e..7330c5d 100644
--- a/xen/drivers/passthrough/vtd/qinval.c
+++ b/xen/drivers/passthrough/vtd/qinval.c
@@ -190,9 +190,19 @@ static int queue_invalidate_wait(struct iommu *iommu,
 static int invalidate_sync(struct iommu *iommu)
 {
     struct qi_ctrl *qi_ctrl = iommu_qi_ctrl(iommu);
+    int rc;
 
     if ( qi_ctrl->qinval_maddr )
-        return queue_invalidate_wait(iommu, 0, 1, 1);
+    {
+        rc = queue_invalidate_wait(iommu, 0, 1, 1);
+        if ( rc )
+        {
+            if ( rc == -ETIMEDOUT )
+                panic("Queue invalidate wait descriptor was timeout.\n");
+            return rc;
+        }
+    }
+
     return 0;
 }
 
@@ -229,6 +239,63 @@ int qinval_device_iotlb(struct iommu *iommu,
     return 0;
 }
 
+static void dev_invalidate_iotlb_timeout(struct iommu *iommu, u16 did,
+                                         u16 seg, u8 bus, u8 devfn)
+{
+    struct domain *d;
+    struct pci_dev *pdev;
+
+    d = rcu_lock_domain_by_id(iommu->domid_map[did]);
+    ASSERT(d);
+    for_each_pdev(d, pdev)
+    {
+        if ( (pdev->seg == seg) &&
+             (pdev->bus == bus) &&
+             (pdev->devfn == devfn) )
+        {
+            if ( pdev->domain )
+            {
+                list_del(&pdev->domain_list);
+                pdev->domain = NULL;
+            }
+
+            if ( pci_hide_device(bus, devfn) )
+            {
+                printk(XENLOG_ERR
+                       "IOMMU hide device %04x:%02x:%02x error.",
+                       seg, bus, devfn);
+                break;
+            }
+
+            break;
+        }
+    }
+
+    if ( !is_hardware_domain(d) )
+        domain_crash(d);
+    rcu_unlock_domain(d);
+}
+
+int dev_invalidate_iotlb_sync(struct iommu *iommu, u16 did,
+                              u16 seg, u8 bus, u8 devfn)
+{
+    struct qi_ctrl *qi_ctrl = iommu_qi_ctrl(iommu);
+    int rc;
+
+    if ( qi_ctrl->qinval_maddr )
+    {
+        rc = queue_invalidate_wait(iommu, 0, 1, 1);
+        if ( rc )
+        {
+            if ( rc == -ETIMEDOUT )
+                dev_invalidate_iotlb_timeout(iommu, did, seg, bus, devfn);
+            return rc;
+        }
+    }
+
+    return 0;
+}
+
 static void queue_invalidate_iec(struct iommu *iommu, u8 granu, u8 im, u16 iidx)
 {
     unsigned long flags;
@@ -349,9 +416,18 @@ static int flush_iotlb_qi(
         queue_invalidate_iotlb(iommu,
                                type >> DMA_TLB_FLUSH_GRANU_OFFSET, dr,
                                dw, did, size_order, 0, addr);
+
+        /*
+         * Synchronize with hardware for invalidation request descriptors
+         * submitted before Device-TLB invalidate descriptor.
+         */
+        rc = invalidate_sync(iommu);
+        if ( rc )
+             return rc;
+
         if ( flush_dev_iotlb )
             ret = dev_invalidate_iotlb(iommu, did, addr, size_order, type);
-        rc = invalidate_sync(iommu);
+
         if ( !ret )
             ret = rc;
     }
diff --git a/xen/drivers/passthrough/vtd/x86/ats.c b/xen/drivers/passthrough/vtd/x86/ats.c
index 7c797f6..6299f52 100644
--- a/xen/drivers/passthrough/vtd/x86/ats.c
+++ b/xen/drivers/passthrough/vtd/x86/ats.c
@@ -162,6 +162,19 @@ int dev_invalidate_iotlb(struct iommu *iommu, u16 did,
             return -EOPNOTSUPP;
         }
 
+        /*
+         * Synchronize with hardware for Device-TLB invalidate
+         * descriptor.
+         */
+        ret = dev_invalidate_iotlb_sync(iommu, did, pdev->seg,
+                                        pdev->bus, pdev->devfn);
+        if ( ret )
+        {
+            dprintk(XENLOG_WARNING VTDPREFIX,
+                    "Device-TLB flush timeout.\n");
+            return ret;
+        }
+
         if ( !ret )
             ret = rc;
     }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 0/3] VT-d Device-TLB flush issue
  2015-12-23  8:25 [PATCH v4 0/3] VT-d Device-TLB flush issue Quan Xu
                   ` (2 preceding siblings ...)
  2015-12-23  8:25 ` [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue Quan Xu
@ 2015-12-23 10:19 ` Xu, Quan
  2015-12-23 10:40   ` Jan Beulich
  2015-12-23 10:39 ` Jan Beulich
  2015-12-25  1:53 ` Tian, Kevin
  5 siblings, 1 reply; 60+ messages in thread
From: Xu, Quan @ 2015-12-23 10:19 UTC (permalink / raw)
  To: jbeulich
  Cc: Tian, Kevin, Wu, Feng, Dong, Eddie, george.dunlap,
	andrew.cooper3, tim, xen-devel, Nakajima, Jun, keir

> On 23.12.2015 at 4:26pm, <quan.xu@intel.com> wrote:
> Subject: [PATCH v4 0/3] VT-d Device-TLB flush issue
> Quan Xu (3):
>   VT-d: Check VT-d Device-TLB flush error.
>   VT-d: Reduce spin timeout to 1ms, which can be boot-time changed.
>   VT-d: Fix vt-d Device-TLB flush timeout issue.

Jan, 
    If patch 1 is a huge, could you review the patch 2/3 first? 
Any comment, I can fix it ASAP. Then we can make patch 2/3 done before holiday. 

Thanks. Merry Christmas!!
-Quan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 0/3] VT-d Device-TLB flush issue
  2015-12-23  8:25 [PATCH v4 0/3] VT-d Device-TLB flush issue Quan Xu
                   ` (3 preceding siblings ...)
  2015-12-23 10:19 ` [PATCH v4 0/3] VT-d Device-TLB flush issue Xu, Quan
@ 2015-12-23 10:39 ` Jan Beulich
  2015-12-23 11:09   ` Xu, Quan
  2015-12-25  1:50   ` Tian, Kevin
  2015-12-25  1:53 ` Tian, Kevin
  5 siblings, 2 replies; 60+ messages in thread
From: Jan Beulich @ 2015-12-23 10:39 UTC (permalink / raw)
  To: quan.xu
  Cc: kevin.tian, feng.wu, george.dunlap, andrew.cooper3, tim,
	xen-devel, jun.nakajima, keir

>>> Quan Xu <quan.xu@intel.com> 12/23/15 9:26 AM >>>
>This patches are based on Kevin Tian's previous discussion 'Revisit VT-d asynchronous flush issue'.
>Fix current timeout concern and also allow limited ATS support in a light way:
>
>1. Check VT-d Device-TLB flush error.
>This patch checks all kinds of error and all the way up the call trees of VT-d Device-TLB flush.
>
>2. Reduce spin timeout to 1ms, which can be boot-time changed with 'iommu_qi_timeout_ms'.
>For example:
>multiboot /boot/xen.gz ats=1 iommu_qi_timeout_ms=100
>
>3. Fix vt-d Device-TLB flush timeout issue.
>Now if IOTLB/Context/IETC flush is timeout, panic hypervisor. The coming patch
>set will fix it.

There must have been a misunderstanding: Your earlier outline didn't indicate you
mean to introduce panics here, even if only temporarily. I'm afraid I'm not currently
willing to take any conceptually wrong patches anymore with just the promise of
fixing the issue(s) later (and I think we've mentioned this in some past discussion
on the list, albeit unlikely in the context of any of your work). This may mean that
the earlier described ordering of things you mean to do needs changing.

I'm sorry that you're hit first by this, the more that it was not you but colleagues of
yours causing this change to the acceptance model.

Jan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 0/3] VT-d Device-TLB flush issue
  2015-12-23 10:19 ` [PATCH v4 0/3] VT-d Device-TLB flush issue Xu, Quan
@ 2015-12-23 10:40   ` Jan Beulich
  2015-12-23 10:59     ` Xu, Quan
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2015-12-23 10:40 UTC (permalink / raw)
  To: quan.xu
  Cc: tim, kevin.tian, feng.wu, george.dunlap, andrew.cooper3,
	eddie.dong, xen-devel, jun.nakajima, keir

>>> "Xu, Quan" <quan.xu@intel.com> 12/23/15 11:19 AM >>>
>> On 23.12.2015 at 4:26pm, <quan.xu@intel.com> wrote:
>> Subject: [PATCH v4 0/3] VT-d Device-TLB flush issue
>> Quan Xu (3):
>>   VT-d: Check VT-d Device-TLB flush error.
>>   VT-d: Reduce spin timeout to 1ms, which can be boot-time changed.
>>   VT-d: Fix vt-d Device-TLB flush timeout issue.
>
>If patch 1 is a huge, could you review the patch 2/3 first? 
>Any comment, I can fix it ASAP. Then we can make patch 2/3 done before holiday. 

No, sorry, I'm already not in the office anymore, and won't do patch review
from home in my spare time. This will need to wait.

Jan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 0/3] VT-d Device-TLB flush issue
  2015-12-23 10:40   ` Jan Beulich
@ 2015-12-23 10:59     ` Xu, Quan
  0 siblings, 0 replies; 60+ messages in thread
From: Xu, Quan @ 2015-12-23 10:59 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, Tian, Kevin, Wu, Feng, george.dunlap, andrew.cooper3, Dong,
	Eddie, xen-devel, Nakajima, Jun, keir

> On 23.12.2015 at 6:41pm, <jbeulich@suse.com> wrote:
> >>> "Xu, Quan" <quan.xu@intel.com> 12/23/15 11:19 AM >>>
> >> On 23.12.2015 at 4:26pm, <quan.xu@intel.com> wrote:
> >> Subject: [PATCH v4 0/3] VT-d Device-TLB flush issue Quan Xu (3):
> >>   VT-d: Check VT-d Device-TLB flush error.
> >>   VT-d: Reduce spin timeout to 1ms, which can be boot-time changed.
> >>   VT-d: Fix vt-d Device-TLB flush timeout issue.
> >
> >If patch 1 is a huge, could you review the patch 2/3 first?
> >Any comment, I can fix it ASAP. Then we can make patch 2/3 done before
> holiday.
> 
> No, sorry, I'm already not in the office anymore, and won't do patch review
> from home in my spare time. This will need to wait.
> 
ok.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 0/3] VT-d Device-TLB flush issue
  2015-12-23 10:39 ` Jan Beulich
@ 2015-12-23 11:09   ` Xu, Quan
  2015-12-25  1:50   ` Tian, Kevin
  1 sibling, 0 replies; 60+ messages in thread
From: Xu, Quan @ 2015-12-23 11:09 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, Wu, Feng, george.dunlap, andrew.cooper3, tim,
	xen-devel, Nakajima, Jun, keir

> On 23.12.2015 at 6:39pm, 'Jan Beulich' wrote:
> >>> Quan Xu <quan.xu@intel.com> 12/23/15 9:26 AM >>>
> >This patches are based on Kevin Tian's previous discussion 'Revisit VT-d
> asynchronous flush issue'.
> >Fix current timeout concern and also allow limited ATS support in a light way:
> >
> >1. Check VT-d Device-TLB flush error.
> >This patch checks all kinds of error and all the way up the call trees of VT-d
> Device-TLB flush.
> >
> >2. Reduce spin timeout to 1ms, which can be boot-time changed with
> 'iommu_qi_timeout_ms'.
> >For example:
> >multiboot /boot/xen.gz ats=1 iommu_qi_timeout_ms=100
> >
> >3. Fix vt-d Device-TLB flush timeout issue.
> >Now if IOTLB/Context/IETC flush is timeout, panic hypervisor. The
> >coming patch set will fix it.
> 
> There must have been a misunderstanding: Your earlier outline didn't indicate
> you mean to introduce panics here, 

It is a panic for iotlb/context/iec flush issue without my patch set.  I just move the panic from queue_invalidate_wait() to 
invalidate_sync()..
I did NOT introduce a new panic..


> even if only temporarily. I'm afraid I'm not
> currently willing to take any conceptually wrong patches anymore with just the
> promise of fixing the issue(s) later (and I think we've mentioned this in some
> past discussion on the list, albeit unlikely in the context of any of your work).
> This may mean that the earlier described ordering of things you mean to do
> needs changing.
> 

:(:( .. 

> I'm sorry that you're hit first by this, the more that it was not you but colleagues
> of yours causing this change to the acceptance model.
> 
> Jan
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 0/3] VT-d Device-TLB flush issue
  2015-12-23 10:39 ` Jan Beulich
  2015-12-23 11:09   ` Xu, Quan
@ 2015-12-25  1:50   ` Tian, Kevin
  2015-12-25  2:26     ` Xu, Quan
  1 sibling, 1 reply; 60+ messages in thread
From: Tian, Kevin @ 2015-12-25  1:50 UTC (permalink / raw)
  To: Jan Beulich, Xu, Quan
  Cc: keir, george.dunlap, andrew.cooper3, tim, xen-devel, Nakajima,
	Jun, Wu, Feng

> From: Jan Beulich [mailto:jbeulich@suse.com]
> Sent: Wednesday, December 23, 2015 6:39 PM
> 
> >>> Quan Xu <quan.xu@intel.com> 12/23/15 9:26 AM >>>
> >This patches are based on Kevin Tian's previous discussion 'Revisit VT-d asynchronous
> flush issue'.
> >Fix current timeout concern and also allow limited ATS support in a light way:
> >
> >1. Check VT-d Device-TLB flush error.
> >This patch checks all kinds of error and all the way up the call trees of VT-d Device-TLB
> flush.
> >
> >2. Reduce spin timeout to 1ms, which can be boot-time changed with
> 'iommu_qi_timeout_ms'.
> >For example:
> >multiboot /boot/xen.gz ats=1 iommu_qi_timeout_ms=100
> >
> >3. Fix vt-d Device-TLB flush timeout issue.
> >Now if IOTLB/Context/IETC flush is timeout, panic hypervisor. The coming patch
> >set will fix it.
> 
> There must have been a misunderstanding: Your earlier outline didn't indicate you
> mean to introduce panics here, even if only temporarily. I'm afraid I'm not currently
> willing to take any conceptually wrong patches anymore with just the promise of
> fixing the issue(s) later (and I think we've mentioned this in some past discussion
> on the list, albeit unlikely in the context of any of your work). This may mean that
> the earlier described ordering of things you mean to do needs changing.
> 
> I'm sorry that you're hit first by this, the more that it was not you but colleagues of
> yours causing this change to the acceptance model.
> 

I believe Quan's point here is to point out the current fact. That is why he said a
coming patch will fix that behavior based on earlier discussion. It's not related
with this patch set which is only step-1 in his plan.

Quan, I think it's confusing for you to miss description of this step-1 patch, with
some TODO discussions together for later steps. It'd be good for the message staying
with whatever is changed in this Device-TLB flush fix, and then in the end you
list several TODO bullets to let community know future plan (again, just list of
expected tasks. no need to discuss them until you have related code ready). 

This way it should make reviewers more focus on the points you want to get
reviewed. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 0/3] VT-d Device-TLB flush issue
  2015-12-23  8:25 [PATCH v4 0/3] VT-d Device-TLB flush issue Quan Xu
                   ` (4 preceding siblings ...)
  2015-12-23 10:39 ` Jan Beulich
@ 2015-12-25  1:53 ` Tian, Kevin
  2015-12-25  2:04   ` Xu, Quan
  5 siblings, 1 reply; 60+ messages in thread
From: Tian, Kevin @ 2015-12-25  1:53 UTC (permalink / raw)
  To: Xu, Quan, jbeulich
  Cc: Wu, Feng, Dong, Eddie, george.dunlap, andrew.cooper3, tim,
	xen-devel, Nakajima, Jun, keir

> From: Xu, Quan
> Sent: Wednesday, December 23, 2015 4:26 PM
> 
> This patches are based on Kevin Tian's previous discussion 'Revisit VT-d asynchronous
> flush issue'.
> Fix current timeout concern and also allow limited ATS support in a light way:
> 
> 1. Check VT-d Device-TLB flush error.
>    This patch checks all kinds of error and all the way up the call trees of VT-d Device-TLB
> flush.
> 
> 2. Reduce spin timeout to 1ms, which can be boot-time changed with
> 'iommu_qi_timeout_ms'.
>    For example:
>            multiboot /boot/xen.gz ats=1 iommu_qi_timeout_ms=100
> 
> 3. Fix vt-d Device-TLB flush timeout issue.
>     Now if IOTLB/Context/IETC flush is timeout, panic hypervisor. The coming patch
>     set will fix it.

remove above sentence which is not related to this patch.

> 
>     If Device-TLB flush is timeout, we'll hide the target ATS
>     device and crash the domain owning this ATS device.
> 
>     If impacted domain is hardware domain, just throw out a warning.
> 
>     The hided Device will be disallowed to be further assigned to
>     any domain.

hided -> hidden. Device -> device.

> 
> --
> 
>  * DMAR_OPERATION_TIMEOUT should be also chopped down to a low number of
> milliseconds.
>    As Kevin Tian mentioned in 'Revisit VT-d asynchronous flush issue', We also confirmed
> with hardware team
>    that 1ms is large enough for IOMMU internal flush. So I can change
> DMAR_OPERATION_TIMEOUT from 1000 ms to 1 ms.
> 
>    IOMMU_WAIT_OP() is only for VT-d registers read/write, and there is also a panic. We
> need a further discussion
>    whether or how to remove this panic in next patch set.
> 
>  * if IOTLB/Context/IETC flush is timeout, panic hypervisor. The coming patch set will fix
> it.

for all above, enough to say TODO tasks based on your summary earlier:

   - context/iotlb flush error. (need 2 ~ 3 weeks)
   - iec flush error. (need 3 ~ 4 weeks)

as I commented in another thread, let's discuss them when you have new
code ready.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 0/3] VT-d Device-TLB flush issue
  2015-12-25  1:53 ` Tian, Kevin
@ 2015-12-25  2:04   ` Xu, Quan
  0 siblings, 0 replies; 60+ messages in thread
From: Xu, Quan @ 2015-12-25  2:04 UTC (permalink / raw)
  To: Tian, Kevin, jbeulich
  Cc: Wu, Feng, Dong, Eddie, george.dunlap, andrew.cooper3, tim,
	xen-devel, Nakajima, Jun, keir

> On 25.12.2015 at 9:54am, <Tian, Kevin> wrote:
> > From: Xu, Quan
> > Sent: Wednesday, December 23, 2015 4:26 PM
> >
> > This patches are based on Kevin Tian's previous discussion 'Revisit
> > VT-d asynchronous flush issue'.
> > Fix current timeout concern and also allow limited ATS support in a light way:
> >
> > 1. Check VT-d Device-TLB flush error.
> >    This patch checks all kinds of error and all the way up the call
> > trees of VT-d Device-TLB flush.
> >
> > 2. Reduce spin timeout to 1ms, which can be boot-time changed with
> > 'iommu_qi_timeout_ms'.
> >    For example:
> >            multiboot /boot/xen.gz ats=1 iommu_qi_timeout_ms=100
> >
> > 3. Fix vt-d Device-TLB flush timeout issue.
> >     Now if IOTLB/Context/IETC flush is timeout, panic hypervisor. The
> coming patch
> >     set will fix it.
> 
> remove above sentence which is not related to this patch.
>

Agreed. I did NOT introduce these panics.


> >
> >     If Device-TLB flush is timeout, we'll hide the target ATS
> >     device and crash the domain owning this ATS device.
> >
> >     If impacted domain is hardware domain, just throw out a warning.
> >
> >     The hided Device will be disallowed to be further assigned to
> >     any domain.
> 
> hided -> hidden. Device -> device.
> 

Agreed. 


> >
> > --
> >
> >  * DMAR_OPERATION_TIMEOUT should be also chopped down to a low
> number
> > of milliseconds.
> >    As Kevin Tian mentioned in 'Revisit VT-d asynchronous flush issue',
> > We also confirmed with hardware team
> >    that 1ms is large enough for IOMMU internal flush. So I can change
> > DMAR_OPERATION_TIMEOUT from 1000 ms to 1 ms.
> >
> >    IOMMU_WAIT_OP() is only for VT-d registers read/write, and there is
> > also a panic. We need a further discussion
> >    whether or how to remove this panic in next patch set.
> >
> >  * if IOTLB/Context/IETC flush is timeout, panic hypervisor. The
> > coming patch set will fix it.
> 
> for all above, enough to say TODO tasks based on your summary earlier:
> 
>    - context/iotlb flush error. (need 2 ~ 3 weeks)
>    - iec flush error. (need 3 ~ 4 weeks)
> 
> as I commented in another thread, let's discuss them when you have new code
> ready.
> 

Thanks, I will speed up this process.

Merry Christmas!
-Quan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 0/3] VT-d Device-TLB flush issue
  2015-12-25  1:50   ` Tian, Kevin
@ 2015-12-25  2:26     ` Xu, Quan
  0 siblings, 0 replies; 60+ messages in thread
From: Xu, Quan @ 2015-12-25  2:26 UTC (permalink / raw)
  To: Tian, Kevin, Jan Beulich
  Cc: keir, george.dunlap, andrew.cooper3, tim, xen-devel, Nakajima,
	Jun, Wu, Feng

> On 25.12.2015 at 9:51am, <Tian, Kevin> wrote:
> > From: Jan Beulich [mailto:jbeulich@suse.com]
> > Sent: Wednesday, December 23, 2015 6:39 PM
> >
> > >>> Quan Xu <quan.xu@intel.com> 12/23/15 9:26 AM >>>
> > >This patches are based on Kevin Tian's previous discussion 'Revisit
> > >VT-d asynchronous
> > flush issue'.
> > >Fix current timeout concern and also allow limited ATS support in a light way:
> > >
> > >1. Check VT-d Device-TLB flush error.
> > >This patch checks all kinds of error and all the way up the call
> > >trees of VT-d Device-TLB
> > flush.
> > >
> > >2. Reduce spin timeout to 1ms, which can be boot-time changed with
> > 'iommu_qi_timeout_ms'.
> > >For example:
> > >multiboot /boot/xen.gz ats=1 iommu_qi_timeout_ms=100
> > >
> > >3. Fix vt-d Device-TLB flush timeout issue.
> > >Now if IOTLB/Context/IETC flush is timeout, panic hypervisor. The
> > >coming patch set will fix it.
> >
> > There must have been a misunderstanding: Your earlier outline didn't
> > indicate you mean to introduce panics here, even if only temporarily.
> > I'm afraid I'm not currently willing to take any conceptually wrong
> > patches anymore with just the promise of fixing the issue(s) later
> > (and I think we've mentioned this in some past discussion on the list,
> > albeit unlikely in the context of any of your work). This may mean that the
> earlier described ordering of things you mean to do needs changing.
> >
> > I'm sorry that you're hit first by this, the more that it was not you
> > but colleagues of yours causing this change to the acceptance model.
> >
> 
> I believe Quan's point here is to point out the current fact. That is why he said a
> coming patch will fix that behavior based on earlier discussion. It's not related
> with this patch set which is only step-1 in his plan.
> 

Yes.

> Quan, I think it's confusing for you to miss description of this step-1 patch, with
> some TODO discussions together for later steps. It'd be good for the message
> staying with whatever is changed in this Device-TLB flush fix, and then in the
> end you list several TODO bullets to let community know future plan (again, just
> list of expected tasks. no need to discuss them until you have related code
> ready).
> 

I will fix it in next v5.

> This way it should make reviewers more focus on the points you want to get
> reviewed. :-)

Thanks Jan / Kevin. :):)

-Quan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 1/3] VT-d: Check VT-d Device-TLB flush error.
  2015-12-23  8:25 ` [PATCH v4 1/3] VT-d: Check VT-d Device-TLB flush error Quan Xu
@ 2015-12-25  2:53   ` Tian, Kevin
  2016-01-06  2:12     ` Xu, Quan
  2016-01-14 17:01     ` Jan Beulich
  2016-01-14 17:15   ` Jan Beulich
  1 sibling, 2 replies; 60+ messages in thread
From: Tian, Kevin @ 2015-12-25  2:53 UTC (permalink / raw)
  To: Xu, Quan, jbeulich
  Cc: Wu, Feng, Dong, Eddie, george.dunlap, andrew.cooper3, tim,
	xen-devel, Nakajima, Jun, keir

> From: Xu, Quan
> Sent: Wednesday, December 23, 2015 4:26 PM
> 
> This patch checks all kinds of error and all the way up
> the call trees of VT-d Device-TLB flush.
> 
> Signed-off-by: Quan Xu <quan.xu@intel.com>
> ---
>  xen/arch/x86/acpi/power.c                     |   8 +-
>  xen/arch/x86/crash.c                          |   3 +-
>  xen/arch/x86/domain_build.c                   |   5 +-
>  xen/arch/x86/mm.c                             |  15 ++-
>  xen/arch/x86/mm/p2m-ept.c                     |  14 ++-
>  xen/arch/x86/mm/p2m-pt.c                      |  14 ++-
>  xen/arch/x86/mm/p2m.c                         |  19 +++-
>  xen/arch/x86/x86_64/mm.c                      |   7 +-
>  xen/common/domain.c                           |   3 +-
>  xen/common/grant_table.c                      |   5 +-
>  xen/common/memory.c                           |  13 ++-
>  xen/drivers/passthrough/amd/iommu_init.c      |   4 +-
>  xen/drivers/passthrough/amd/pci_amd_iommu.c   |   4 +-
>  xen/drivers/passthrough/arm/smmu.c            |  13 ++-
>  xen/drivers/passthrough/iommu.c               |  47 +++++---
>  xen/drivers/passthrough/vtd/extern.h          |   4 +-
>  xen/drivers/passthrough/vtd/iommu.c           | 157
> ++++++++++++++++++++------
>  xen/drivers/passthrough/vtd/qinval.c          |   2 +-
>  xen/drivers/passthrough/vtd/quirks.c          |  26 +++--
>  xen/drivers/passthrough/vtd/x86/vtd.c         |  13 ++-
>  xen/drivers/passthrough/x86/iommu.c           |   6 +-
>  xen/include/asm-x86/hvm/svm/amd-iommu-proto.h |   4 +-
>  xen/include/asm-x86/iommu.h                   |   2 +-
>  xen/include/xen/iommu.h                       |  20 ++--
>  24 files changed, 300 insertions(+), 108 deletions(-)
> 
> diff --git a/xen/arch/x86/acpi/power.c b/xen/arch/x86/acpi/power.c
> index f41f0de..1974721 100644
> --- a/xen/arch/x86/acpi/power.c
> +++ b/xen/arch/x86/acpi/power.c
> @@ -45,6 +45,8 @@ void do_suspend_lowlevel(void);
> 
>  static int device_power_down(void)
>  {
> +    int rc;
> +
>      console_suspend();
> 
>      time_suspend();
> @@ -53,7 +55,9 @@ static int device_power_down(void)
> 
>      ioapic_suspend();
> 
> -    iommu_suspend();
> +    rc = iommu_suspend();
> +    if ( rc )
> +        return rc;
> 
>      lapic_suspend();
> 

Looks error handling is not only a problem in VT-d code. Above
actually should check return values of all suspend callbacks. Just
checking iommu_suspend is not enough, but it's a good improvement
anyway...

[...]

> diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
> index bca6fe7..a11bc2a 100644
> --- a/xen/arch/x86/domain_build.c
> +++ b/xen/arch/x86/domain_build.c
> @@ -1627,7 +1627,10 @@ int __init construct_dom0(
>      }
> 
>      if ( d->domain_id == hardware_domid )
> -        iommu_hwdom_init(d);
> +    {
> +        if ( iommu_hwdom_init(d) )
> +            printk("Xen warning : IOMMU hardware domain init failed.\n");
> +    }

if construct_dom0 fails, guess we can panic here? e.g. simply move earlier
BUG_ON(rc != 0) after above trunk. In an ideal case we may disable 
iommu_enabled upon error at this point, to allow moving forward. But that
can be improved separately.

[...]

> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
> index 202ff76..3c1db05 100644
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -2443,11 +2443,18 @@ static int __get_page_type(struct page_info *page, unsigned
> long type,
>          if ( d && is_pv_domain(d) && unlikely(need_iommu(d)) )
>          {
>              if ( (x & PGT_type_mask) == PGT_writable_page )
> -                iommu_unmap_page(d, mfn_to_gmfn(d, page_to_mfn(page)));
> +            {
> +                rc = iommu_unmap_page(d, mfn_to_gmfn(d, page_to_mfn(page)));
> +                return rc;
> +            }

looks you return absolutely regardless of error check. There are still some
useful code after this point...

>              else if ( type == PGT_writable_page )
> -                iommu_map_page(d, mfn_to_gmfn(d, page_to_mfn(page)),
> -                               page_to_mfn(page),
> -                               IOMMUF_readable|IOMMUF_writable);
> +            {
> +                rc = iommu_map_page(d, mfn_to_gmfn(d, page_to_mfn(page)),
> +                                    page_to_mfn(page),
> +                                    IOMMUF_readable|IOMMUF_writable);
> +                if ( rc )
> +                    return rc;
> +            }

this one is correct.

[...]

> diff --git a/xen/arch/x86/mm/p2m-pt.c b/xen/arch/x86/mm/p2m-pt.c
> index 709920a..b2b340d 100644
> --- a/xen/arch/x86/mm/p2m-pt.c
> +++ b/xen/arch/x86/mm/p2m-pt.c
> @@ -675,11 +675,19 @@ p2m_pt_set_entry(struct p2m_domain *p2m, unsigned long
> gfn, mfn_t mfn,
>          }

curious why there's no similar check on code below:

        if ( iommu_use_hap_pt(p2m->domain) )
        {
            if ( iommu_old_flags )
                **amd_iommu_flush_pages(p2m->domain, gfn, page_order)**;
        }

>          else if ( iommu_pte_flags )
>              for ( i = 0; i < (1UL << page_order); i++ )
> -                iommu_map_page(p2m->domain, gfn + i, mfn_x(mfn) + i,
> -                               iommu_pte_flags);
> +            {
> +                rc = iommu_map_page(p2m->domain, gfn + i, mfn_x(mfn) + i,
> +                                    iommu_pte_flags);
> +                if ( rc )
> +                    goto out;

looks 'break' should be enough here.

[...]

> diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
> index c6b883d..6b43da0 100644
> --- a/xen/arch/x86/mm/p2m.c
> +++ b/xen/arch/x86/mm/p2m.c
> @@ -654,7 +659,7 @@ guest_physmap_add_entry(struct domain *d, unsigned long gfn,
>      p2m_access_t a;
>      mfn_t omfn;
>      int pod_count = 0;
> -    int rc = 0;
> +    int rc = 0, ret = 0;
> 
>      if ( !paging_mode_translate(d) )
>      {
> @@ -667,7 +672,15 @@ guest_physmap_add_entry(struct domain *d, unsigned long gfn,
>                  if ( rc != 0 )
>                  {
>                      while ( i-- > 0 )
> -                        iommu_unmap_page(d, mfn + i);
> +                    {
> +                        ret = iommu_unmap_page(d, mfn + i);
> +                        if ( ret )
> +                            break;
> +                    }
> +
> +                    if ( ret )
> +                        rc = ret;
> +

you can reuse 'rc' here.

[...]

> diff --git a/xen/arch/x86/x86_64/mm.c b/xen/arch/x86/x86_64/mm.c
> index d918002..fe7b10c 100644
> --- a/xen/arch/x86/x86_64/mm.c
> +++ b/xen/arch/x86/x86_64/mm.c
> @@ -1438,7 +1438,12 @@ int memory_add(unsigned long spfn, unsigned long epfn,
> unsigned int pxm)
>          if ( i != epfn )
>          {
>              while (i-- > old_max)
> -                iommu_unmap_page(hardware_domain, i);
> +            {
> +                ret = iommu_unmap_page(hardware_domain, i);
> +                if ( ret )
> +                    break;
> +            }
> +

here you can do simple check:

	if (iommu_unmap_page(hardware_domain, i))
		break;

[...]

> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index 1b9fcfc..11f526d 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -228,7 +228,8 @@ static int late_hwdom_init(struct domain *d)
> 
>      rcu_unlock_domain(dom0);
> 
> -    iommu_hwdom_init(d);
> +    if ( iommu_hwdom_init(d) )
> +        printk("Xen warning : IOMMU hardware domain init failed.\n");
> 
>      return rv;
>  #else


rv = iommu_hwdom_init(d), otherwise error is not propagated 
outside.

> diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c
> index 2b449d5..5faa61e 100644
> --- a/xen/common/grant_table.c
> +++ b/xen/common/grant_table.c
> @@ -920,7 +920,10 @@ __gnttab_map_grant_ref(
>                  nr_gets++;
>                  (void)get_page(pg, rd);
>                  if ( !(op->flags & GNTMAP_readonly) )
> -                    get_page_type(pg, PGT_writable_page);
> +                {
> +                    if ( get_page_type(pg, PGT_writable_page) )
> +                        goto could_not_pin;
> +                }

combine two ifs together.

> diff --git a/xen/common/memory.c b/xen/common/memory.c
> index b541f4a1..989b461 100644
> --- a/xen/common/memory.c
> +++ b/xen/common/memory.c
> @@ -593,6 +593,10 @@ static int xenmem_add_to_physmap(struct domain *d,
>      unsigned int done = 0;
>      long rc = 0;
> 
> +#ifdef HAS_PASSTHROUGH
> +    int ret = 0;
> +#endif
> +

I think you can reuse rc here.

>      if ( xatp->space != XENMAPSPACE_gmfn_range )
>          return xenmem_add_to_physmap_one(d, xatp->space, DOMID_INVALID,
>                                           xatp->idx, xatp->gpfn);
> @@ -631,8 +635,13 @@ static int xenmem_add_to_physmap(struct domain *d,
>      if ( need_iommu(d) )
>      {
>          this_cpu(iommu_dont_flush_iotlb) = 0;
> -        iommu_iotlb_flush(d, xatp->idx - done, done);
> -        iommu_iotlb_flush(d, xatp->gpfn - done, done);
> +        ret = iommu_iotlb_flush(d, xatp->idx - done, done);
> +        if ( ret )
> +            return ret;
> +
> +        ret = iommu_iotlb_flush(d, xatp->gpfn - done, done);
> +        if ( ret )
> +            return ret;
>      }

rc = iommu_iotlb_flush(d, xatp->idx - done, done);
if ( !rc )
	rc = iommu_iotlb_flush(d, xatp->gpfn - done, done);

return rc;

[...]

> diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c
> index d5137733..34e4ef9 100644
> --- a/xen/drivers/passthrough/iommu.c
> +++ b/xen/drivers/passthrough/iommu.c
> @@ -146,14 +146,15 @@ static void __hwdom_init check_hwdom_reqs(struct domain *d)
>      iommu_dom0_strict = 1;
>  }
> 
> -void __hwdom_init iommu_hwdom_init(struct domain *d)
> +int __hwdom_init iommu_hwdom_init(struct domain *d)
>  {
>      struct hvm_iommu *hd = domain_hvm_iommu(d);
> +    int rc = 0;
> 
>      check_hwdom_reqs(d);
> 
>      if ( !iommu_enabled )
> -        return;
> +        return -EINVAL;

iommu_enabled can be false if user chooses so. You should return
ZERO here to indicate success.

[...]

> @@ -354,11 +358,19 @@ int iommu_do_domctl(
>      return ret;
>  }
> 
> -void iommu_suspend()
> +int iommu_suspend()
>  {
>      const struct iommu_ops *ops = iommu_get_ops();
> +    int rc;
> +
>      if ( iommu_enabled )
> -        ops->suspend();
> +    {
> +        rc = ops->suspend();
> +        if ( rc )
> +            return rc;
> +    }
> +
> +    return 0;

if ( iommu_enabled )
	return ops->suspend();

return 0;

> @@ -369,12 +381,21 @@ void iommu_share_p2m_table(struct domain* d)
>          ops->share_p2m(d);
>  }
> 
> -void iommu_crash_shutdown(void)
> +int iommu_crash_shutdown(void)
>  {
>      const struct iommu_ops *ops = iommu_get_ops();
> +    int rc;
> +
>      if ( iommu_enabled )
> -        ops->crash_shutdown();
> +    {
> +        rc = ops->crash_shutdown();
> +        if ( rc )
> +            return rc;
> +    }
> +

ditto.

[...]

> diff --git a/xen/drivers/passthrough/vtd/iommu.c
> b/xen/drivers/passthrough/vtd/iommu.c
> index dd13865..08aaaec 100644
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -566,6 +571,7 @@ static void __intel_iommu_iotlb_flush(struct domain *d, unsigned
> long gfn,
>      struct iommu *iommu;
>      int flush_dev_iotlb;
>      int iommu_domid;
> +    int rc;
> 
>      /*
>       * No need pcideves_lock here because we have flush
> @@ -585,36 +591,47 @@ static void __intel_iommu_iotlb_flush(struct domain *d,
> unsigned long gfn,
> 
>          if ( page_count > 1 || gfn == -1 )
>          {
> -            if ( iommu_flush_iotlb_dsi(iommu, iommu_domid,
> -                        0, flush_dev_iotlb) )
> +            rc = iommu_flush_iotlb_dsi(iommu, iommu_domid,
> +                        0, flush_dev_iotlb);
> +            if ( rc )
> +            {
>                  iommu_flush_write_buffer(iommu);
> +                return rc;
> +            }
>          }
>          else
>          {
> -            if ( iommu_flush_iotlb_psi(iommu, iommu_domid,
> +            rc = iommu_flush_iotlb_psi(iommu, iommu_domid,
>                          (paddr_t)gfn << PAGE_SHIFT_4K, 0,
> -                        !dma_old_pte_present, flush_dev_iotlb) )
> +                        !dma_old_pte_present, flush_dev_iotlb);
> +            if ( rc )
> +            {
>                  iommu_flush_write_buffer(iommu);
> +                return rc;
> +            }

iommu_flush_write_buffer can be combined to one for above
two branches.

>  /* clear one page's page table */
> -static void dma_pte_clear_one(struct domain *domain, u64 addr)
> +static int dma_pte_clear_one(struct domain *domain, u64 addr)
>  {
>      struct hvm_iommu *hd = domain_hvm_iommu(domain);
>      struct dma_pte *page = NULL, *pte = NULL;
>      u64 pg_maddr;
> +    int rc;
> 
>      spin_lock(&hd->arch.mapping_lock);
>      /* get last level pte */
> @@ -622,7 +639,7 @@ static void dma_pte_clear_one(struct domain *domain, u64 addr)
>      if ( pg_maddr == 0 )
>      {
>          spin_unlock(&hd->arch.mapping_lock);
> -        return;
> +        return -ENOENT;

stay consistent to other places which use -ENOMEM.

>      }
> 
>      page = (struct dma_pte *)map_vtd_domain_page(pg_maddr);
> @@ -632,7 +649,7 @@ static void dma_pte_clear_one(struct domain *domain, u64 addr)
>      {
>          spin_unlock(&hd->arch.mapping_lock);
>          unmap_vtd_domain_page(page);
> -        return;
> +        return -ENOENT;

It's a sane case if above code is referred to below:

    if ( !dma_pte_present(*pte) )
    {
        spin_unlock(&hd->arch.mapping_lock);
        unmap_vtd_domain_page(page);
        return;
    }

>      }
> 
>      dma_clear_pte(*pte);
> @@ -640,9 +657,18 @@ static void dma_pte_clear_one(struct domain *domain, u64
> addr)
>      iommu_flush_cache_entry(pte, sizeof(struct dma_pte));
> 
>      if ( !this_cpu(iommu_dont_flush_iotlb) )
> -        __intel_iommu_iotlb_flush(domain, addr >> PAGE_SHIFT_4K, 1, 1);
> +    {
> +        rc = __intel_iommu_iotlb_flush(domain, addr >> PAGE_SHIFT_4K, 1, 1);
> +        if ( rc )
> +        {
> +            unmap_vtd_domain_page(page);
> +            return rc;
> +        }
> +    }

no need for immediate check above. you can return rc in the end.

> 
>      unmap_vtd_domain_page(page);
> +
> +    return 0;
>  }
> 
>  static void iommu_free_pagetable(u64 pt_maddr, int level)
> @@ -1251,20 +1277,24 @@ static int intel_iommu_domain_init(struct domain *d)
>      return 0;
>  }
> 
> -static void __hwdom_init intel_iommu_hwdom_init(struct domain *d)
> +static int __hwdom_init intel_iommu_hwdom_init(struct domain *d)
>  {
>      struct acpi_drhd_unit *drhd;
> +    int rc;
> 
>      if ( !iommu_passthrough && !need_iommu(d) )
>      {
>          /* Set up 1:1 page table for hardware domain. */
> -        vtd_set_hwdom_mapping(d);
> +        rc = vtd_set_hwdom_mapping(d);
> +        if ( rc )
> +            return rc;
>      }
> 
>      setup_hwdom_pci_devices(d, setup_hwdom_device);
>      setup_hwdom_rmrr(d);
> 
> -    iommu_flush_all();
> +    if ( iommu_flush_all() )
> +        printk("Xen warning : iommu flush error.\n");

why no error return in this case.

> 
>      for_each_drhd_unit ( drhd )
>      {
> @@ -1273,6 +1303,8 @@ static void __hwdom_init intel_iommu_hwdom_init(struct
> domain *d)
>              BUG();
>          iommu_enable_translation(drhd);
>      }
> +
> +    return 0;
>  }
> 
>  int domain_context_mapping_one(
> @@ -1404,7 +1436,14 @@ int domain_context_mapping_one(
>      else
>      {
>          int flush_dev_iotlb = find_ats_dev_drhd(iommu) ? 1 : 0;
> -        iommu_flush_iotlb_dsi(iommu, 0, 1, flush_dev_iotlb);
> +        int rc;
> +
> +        rc = iommu_flush_iotlb_dsi(iommu, 0, 1, flush_dev_iotlb);
> +        if ( rc )
> +        {
> +            unmap_vtd_domain_page(context_entries);
> +            return rc;
> +        }
>      }
> 
>      set_bit(iommu->index, &hd->arch.iommu_bitmap);
> @@ -1412,7 +1451,13 @@ int domain_context_mapping_one(
>      unmap_vtd_domain_page(context_entries);
> 
>      if ( !seg )
> -        me_wifi_quirk(domain, bus, devfn, MAP_ME_PHANTOM_FUNC);
> +    {
> +        int rc;
> +
> +        rc = me_wifi_quirk(domain, bus, devfn, MAP_ME_PHANTOM_FUNC);
> +        if ( rc )
> +            return rc;
> +    }

if ( !seg )
	return me_wifi_quirk(...);

> 
>      return 0;
>  }
> @@ -1509,6 +1554,7 @@ int domain_context_unmap_one(
>      struct context_entry *context, *context_entries;
>      u64 maddr;
>      int iommu_domid;
> +    int rc;
> 
>      ASSERT(spin_is_locked(&pcidevs_lock));
>      spin_lock(&iommu->lock);
> @@ -1543,15 +1589,24 @@ int domain_context_unmap_one(
>      else
>      {
>          int flush_dev_iotlb = find_ats_dev_drhd(iommu) ? 1 : 0;
> -        iommu_flush_iotlb_dsi(iommu, iommu_domid, 0, flush_dev_iotlb);
> +        rc = iommu_flush_iotlb_dsi(iommu, iommu_domid, 0, flush_dev_iotlb);
> +        if ( rc )
> +        {
> +            spin_unlock(&iommu->lock);
> +            unmap_vtd_domain_page(context_entries);
> +            return rc;
> +        }
>      }

just rc = iommu_flush_iotlb_dsi(...) should be enough. see later.

> 
>      spin_unlock(&iommu->lock);
>      unmap_vtd_domain_page(context_entries);
> 
>      if ( !iommu->intel->drhd->segment )
> -        me_wifi_quirk(domain, bus, devfn, UNMAP_ME_PHANTOM_FUNC);
> -
> +    {
> +       rc = me_wifi_quirk(domain, bus, devfn, UNMAP_ME_PHANTOM_FUNC);
> +       if ( rc )
> +           return rc;
> +    }
>      return 0;

if ( !rc && !iommu->intel->drhd->segment )
	rc = me_wifi_quirk(...);

return rc;

>  }
> 
> @@ -1700,6 +1755,7 @@ static int intel_iommu_map_page(
>      struct hvm_iommu *hd = domain_hvm_iommu(d);
>      struct dma_pte *page = NULL, *pte = NULL, old, new = { 0 };
>      u64 pg_maddr;
> +    int rc;
> 
>      /* Do nothing if VT-d shares EPT page table */
>      if ( iommu_use_hap_pt(d) )
> @@ -1742,30 +1798,39 @@ static int intel_iommu_map_page(
>      unmap_vtd_domain_page(page);
> 
>      if ( !this_cpu(iommu_dont_flush_iotlb) )
> -        __intel_iommu_iotlb_flush(d, gfn, dma_pte_present(old), 1);
> +    {
> +        rc = __intel_iommu_iotlb_flush(d, gfn, dma_pte_present(old), 1);
> +        if ( rc )
> +            return rc;
> +    }

if ( !this_cpu(iommu_dont_flush_iotlb) )
	return __intel_iommu_iotlb_flush(...);

I'll stop comment for similar refinement. Please check to improve
in next version. :-)

> 
>      return 0;
>  }
> 


[...]


> 
> -void iommu_pte_flush(struct domain *d, u64 gfn, u64 *pte,
> -                     int order, int present)
> +int iommu_pte_flush(struct domain *d, u64 gfn, u64 *pte,
> +                    int order, int present)
>  {
>      struct acpi_drhd_unit *drhd;
>      struct iommu *iommu = NULL;
>      struct hvm_iommu *hd = domain_hvm_iommu(d);
>      int flush_dev_iotlb;
>      int iommu_domid;
> +    int rc;
> 
>      iommu_flush_cache_entry(pte, sizeof(struct dma_pte));
> 
> @@ -1779,11 +1844,17 @@ void iommu_pte_flush(struct domain *d, u64 gfn, u64 *pte,
>          iommu_domid= domain_iommu_domid(d, iommu);
>          if ( iommu_domid == -1 )
>              continue;
> -        if ( iommu_flush_iotlb_psi(iommu, iommu_domid,
> +        rc = iommu_flush_iotlb_psi(iommu, iommu_domid,
>                                     (paddr_t)gfn << PAGE_SHIFT_4K,
> -                                   order, !present, flush_dev_iotlb) )
> +                                   order, !present, flush_dev_iotlb);
> +        if ( rc )
> +        {
>              iommu_flush_write_buffer(iommu);
> +            return rc;
> +        }
>      }

just curious. if write_buffer needs be flushed for every iotlb flush error,
shouldn't it be better handled within iommu_flush_... instead of duplicating
in every caller?

> +
> +    return 0;
>  }
> 
>  static int __init vtd_ept_page_compatible(struct iommu *iommu)

[...]

> @@ -2372,16 +2447,19 @@ static int intel_iommu_group_id(u16 seg, u8 bus, u8 devfn)
>  }
> 
>  static u32 iommu_state[MAX_IOMMUS][MAX_IOMMU_REGS];
> -static void vtd_suspend(void)
> +static int vtd_suspend(void)
>  {
>      struct acpi_drhd_unit *drhd;
>      struct iommu *iommu;
> +    int rc;
>      u32    i;
> 
>      if ( !iommu_enabled )
> -        return;
> +        return -EINVAL;

not an error.

> 
> -    iommu_flush_all();
> +    rc = iommu_flush_all();
> +    if ( rc )
> +        return rc;
> 
>      for_each_drhd_unit ( drhd )
>      {
> @@ -2410,17 +2488,22 @@ static void vtd_suspend(void)
>          if ( !iommu_intremap && iommu_qinval )
>              disable_qinval(iommu);
>      }
> +
> +    return 0;
>  }
> 
> -static void vtd_crash_shutdown(void)
> +static int vtd_crash_shutdown(void)
>  {
>      struct acpi_drhd_unit *drhd;
>      struct iommu *iommu;
> +    int rc;
> 
>      if ( !iommu_enabled )
> -        return;
> +        return -EINVAL;

ditto

> 
> -    iommu_flush_all();
> +    rc = iommu_flush_all();
> +    if ( rc )
> +        return rc;
> 
>      for_each_drhd_unit ( drhd )
>      {
> @@ -2429,6 +2512,8 @@ static void vtd_crash_shutdown(void)
>          disable_intremap(drhd->iommu);
>          disable_qinval(drhd->iommu);
>      }
> +
> +    return 0;
>  }
> 
>  static void vtd_resume(void)
> diff --git a/xen/drivers/passthrough/vtd/qinval.c b/xen/drivers/passthrough/vtd/qinval.c
> index b81b0bd..946e812 100644
> --- a/xen/drivers/passthrough/vtd/qinval.c
> +++ b/xen/drivers/passthrough/vtd/qinval.c
> @@ -324,7 +324,7 @@ static int flush_iotlb_qi(
>      if ( flush_non_present_entry )
>      {
>          if ( !cap_caching_mode(iommu->cap) )
> -            return 1;
> +            return 0;

this looks problematic. originally 0/1 is used to indicate whether caller
needs to flush cache. Here you return 0 then may break something...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 2/3] VT-d: Reduce spin timeout to 1ms, which can be boot-time changed.
  2015-12-23  8:25 ` [PATCH v4 2/3] VT-d: Reduce spin timeout to 1ms, which can be boot-time changed Quan Xu
@ 2015-12-25  2:56   ` Tian, Kevin
  2015-12-25  2:58     ` Xu, Quan
  2016-01-13 16:55   ` Jan Beulich
  2016-01-13 16:57   ` Jan Beulich
  2 siblings, 1 reply; 60+ messages in thread
From: Tian, Kevin @ 2015-12-25  2:56 UTC (permalink / raw)
  To: Xu, Quan, jbeulich
  Cc: Wu, Feng, Dong, Eddie, george.dunlap, andrew.cooper3, tim,
	xen-devel, Nakajima, Jun, keir

> From: Xu, Quan
> Sent: Wednesday, December 23, 2015 4:26 PM
> 
> Signed-off-by: Quan Xu <quan.xu@intel.com>

Acked-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 2/3] VT-d: Reduce spin timeout to 1ms, which can be boot-time changed.
  2015-12-25  2:56   ` Tian, Kevin
@ 2015-12-25  2:58     ` Xu, Quan
  0 siblings, 0 replies; 60+ messages in thread
From: Xu, Quan @ 2015-12-25  2:58 UTC (permalink / raw)
  To: Tian, Kevin, jbeulich
  Cc: Wu, Feng, Dong, Eddie, george.dunlap, andrew.cooper3, tim,
	xen-devel, Nakajima, Jun, keir

> On 25.12.2015 at 10:57am, <Tian, Kevin> wrote:
> > From: Xu, Quan
> > Sent: Wednesday, December 23, 2015 4:26 PM
> >
> > Signed-off-by: Quan Xu <quan.xu@intel.com>
> 
> Acked-by: Kevin Tian <kevin.tian@intel.com>

Thanks Kevin!

-Quan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.
  2015-12-23  8:25 ` [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue Quan Xu
@ 2015-12-25  3:06   ` Tian, Kevin
  2016-01-06 11:25     ` Xu, Quan
  2016-01-15 13:09   ` Jan Beulich
  1 sibling, 1 reply; 60+ messages in thread
From: Tian, Kevin @ 2015-12-25  3:06 UTC (permalink / raw)
  To: Xu, Quan, jbeulich
  Cc: Wu, Feng, Dong, Eddie, george.dunlap, andrew.cooper3, tim,
	xen-devel, Nakajima, Jun, keir

> From: Xu, Quan
> Sent: Wednesday, December 23, 2015 4:26 PM
> 
> Now if IOTLB/Context/IETC flush is timeout, panic hypervisor.
> The coming patch set will fix it.

again, please adjust comment to reflect what this patch is
doing, i.e. only about improvement on Device-TLB flush.

> 
> If Device-TLB flush is timeout, we'll hide the target ATS
> device and crash the domain owning this ATS device.
> 
> If impacted domain is hardware domain, just throw out a warning.
> 
> The hided Device will be disallowed to be further assigned to
> any domain.

hided Device -> hidden device

> 
> Signed-off-by: Quan Xu <quan.xu@intel.com>
> ---
>  xen/drivers/passthrough/pci.c         |  2 +-
>  xen/drivers/passthrough/vtd/extern.h  |  2 +
>  xen/drivers/passthrough/vtd/qinval.c  | 80
> ++++++++++++++++++++++++++++++++++-
>  xen/drivers/passthrough/vtd/x86/ats.c | 13 ++++++
>  4 files changed, 94 insertions(+), 3 deletions(-)
> 
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index 27b3ca7..2d7dc59 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -407,7 +407,7 @@ static void _pci_hide_device(struct pci_dev *pdev)
>      list_add(&pdev->domain_list, &dom_xen->arch.pdev_list);
>  }
> 
> -int __init pci_hide_device(int bus, int devfn)
> +int pci_hide_device(int bus, int devfn)
>  {
>      struct pci_dev *pdev;
>      int rc = -ENOMEM;
> diff --git a/xen/drivers/passthrough/vtd/extern.h
> b/xen/drivers/passthrough/vtd/extern.h
> index ec9c513..a2123ce 100644
> --- a/xen/drivers/passthrough/vtd/extern.h
> +++ b/xen/drivers/passthrough/vtd/extern.h
> @@ -58,6 +58,8 @@ int ats_device(const struct pci_dev *, const struct acpi_drhd_unit *);
> 
>  int dev_invalidate_iotlb(struct iommu *iommu, u16 did,
>                           u64 addr, unsigned int size_order, u64 type);
> +int dev_invalidate_iotlb_sync(struct iommu *iommu, u16 did,
> +                              u16 seg, u8 bus, u8 devfn);
> 
>  int qinval_device_iotlb(struct iommu *iommu,
>                          u32 max_invs_pend, u16 sid, u16 size, u64 addr);
> diff --git a/xen/drivers/passthrough/vtd/qinval.c b/xen/drivers/passthrough/vtd/qinval.c
> index b227e4e..7330c5d 100644
> --- a/xen/drivers/passthrough/vtd/qinval.c
> +++ b/xen/drivers/passthrough/vtd/qinval.c
> @@ -190,9 +190,19 @@ static int queue_invalidate_wait(struct iommu *iommu,
>  static int invalidate_sync(struct iommu *iommu)
>  {
>      struct qi_ctrl *qi_ctrl = iommu_qi_ctrl(iommu);
> +    int rc;
> 
>      if ( qi_ctrl->qinval_maddr )
> -        return queue_invalidate_wait(iommu, 0, 1, 1);
> +    {
> +        rc = queue_invalidate_wait(iommu, 0, 1, 1);
> +        if ( rc )
> +        {
> +            if ( rc == -ETIMEDOUT )
> +                panic("Queue invalidate wait descriptor was timeout.\n");
> +            return rc;
> +        }
> +    }
> +

why do you still do panic here? w/ PATCH 1/3 you should return rc
to caller directly, right?

>      return 0;
>  }
> 
> @@ -229,6 +239,63 @@ int qinval_device_iotlb(struct iommu *iommu,
>      return 0;
>  }
> 
> +static void dev_invalidate_iotlb_timeout(struct iommu *iommu, u16 did,
> +                                         u16 seg, u8 bus, u8 devfn)
> +{
> +    struct domain *d;
> +    struct pci_dev *pdev;
> +
> +    d = rcu_lock_domain_by_id(iommu->domid_map[did]);
> +    ASSERT(d);
> +    for_each_pdev(d, pdev)
> +    {
> +        if ( (pdev->seg == seg) &&
> +             (pdev->bus == bus) &&
> +             (pdev->devfn == devfn) )
> +        {
> +            if ( pdev->domain )
> +            {
> +                list_del(&pdev->domain_list);
> +                pdev->domain = NULL;
> +            }
> +
> +            if ( pci_hide_device(bus, devfn) )
> +            {
> +                printk(XENLOG_ERR
> +                       "IOMMU hide device %04x:%02x:%02x error.",
> +                       seg, bus, devfn);
> +                break;
> +            }

I'm not sure whether using pci_hide_device is enough or not break
other code path here? For example, now you have errors propagated
to callers. What's the final policy against flush error? If they will
finally go to cleanup some more state relying on pdev->domain, then
that code path might be broken. If it's fine to do cleanup at this point,
then just hidden is not enough. If you look at pci_remove_device, there's
quite some state to be further cleaned up...

I didn't think about the full logic thoroughly now. But it would always be
good to not hide device now until a point where all related states have
been cleaned up in error handling path chained up.

Thanks
Kevin 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 1/3] VT-d: Check VT-d Device-TLB flush error.
  2015-12-25  2:53   ` Tian, Kevin
@ 2016-01-06  2:12     ` Xu, Quan
  2016-01-14 17:01     ` Jan Beulich
  1 sibling, 0 replies; 60+ messages in thread
From: Xu, Quan @ 2016-01-06  2:12 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Wu, Feng, Dong, Eddie, george.dunlap, andrew.cooper3, tim,
	xen-devel, jbeulich, Nakajima, Jun, keir

On December 25 2015 10:54 AM, <Tian, Kevin> wrote:
> > From: Xu, Quan
> > Sent: Wednesday, December 23, 2015 4:26 PM
> >
> > This patch checks all kinds of error and all the way up the call trees
> > of VT-d Device-TLB flush.
> >
> > Signed-off-by: Quan Xu <quan.xu@intel.com>
> > ---
> >  xen/arch/x86/acpi/power.c                     |   8 +-
> >  xen/arch/x86/crash.c                          |   3 +-
> >  xen/arch/x86/domain_build.c                   |   5 +-
> >  xen/arch/x86/mm.c                             |  15 ++-
> >  xen/arch/x86/mm/p2m-ept.c                     |  14 ++-
> >  xen/arch/x86/mm/p2m-pt.c                      |  14 ++-
> >  xen/arch/x86/mm/p2m.c                         |  19 +++-
> >  xen/arch/x86/x86_64/mm.c                      |   7 +-
> >  xen/common/domain.c                           |   3 +-
> >  xen/common/grant_table.c                      |   5 +-
> >  xen/common/memory.c                           |  13 ++-
> >  xen/drivers/passthrough/amd/iommu_init.c      |   4 +-
> >  xen/drivers/passthrough/amd/pci_amd_iommu.c   |   4 +-
> >  xen/drivers/passthrough/arm/smmu.c            |  13 ++-
> >  xen/drivers/passthrough/iommu.c               |  47 +++++---
> >  xen/drivers/passthrough/vtd/extern.h          |   4 +-
> >  xen/drivers/passthrough/vtd/iommu.c           | 157
> > ++++++++++++++++++++------
> >  xen/drivers/passthrough/vtd/qinval.c          |   2 +-
> >  xen/drivers/passthrough/vtd/quirks.c          |  26 +++--
> >  xen/drivers/passthrough/vtd/x86/vtd.c         |  13 ++-
> >  xen/drivers/passthrough/x86/iommu.c           |   6 +-
> >  xen/include/asm-x86/hvm/svm/amd-iommu-proto.h |   4 +-
> >  xen/include/asm-x86/iommu.h                   |   2 +-
> >  xen/include/xen/iommu.h                       |  20 ++--
> >  24 files changed, 300 insertions(+), 108 deletions(-)
> >

Kevin, 
Thanks for your comments!! It would take much time to review it.
I will fix them in next v5. May I discuss with you f2f for some dubious case?


Quan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.
  2015-12-25  3:06   ` Tian, Kevin
@ 2016-01-06 11:25     ` Xu, Quan
  0 siblings, 0 replies; 60+ messages in thread
From: Xu, Quan @ 2016-01-06 11:25 UTC (permalink / raw)
  To: Tian, Kevin, jbeulich
  Cc: Wu, Feng, Dong, Eddie, george.dunlap, andrew.cooper3, tim,
	xen-devel, Nakajima, Jun, keir

> On December 25 2015, 11:06 AM, <Tian, Kevin> wrote: 
> > From: Xu, Quan
> > Sent: Wednesday, December 23, 2015 4:26 PM
> >
> > Now if IOTLB/Context/IETC flush is timeout, panic hypervisor.
> > The coming patch set will fix it.
> 
> again, please adjust comment to reflect what this patch is doing, i.e. only about
> improvement on Device-TLB flush.
> 

Agreed.

> >
> > If Device-TLB flush is timeout, we'll hide the target ATS device and
> > crash the domain owning this ATS device.
> >
> > If impacted domain is hardware domain, just throw out a warning.
> >
> > The hided Device will be disallowed to be further assigned to any
> > domain.
> 
> hided Device -> hidden device
> 

Agreed.

[...]

> > diff --git a/xen/drivers/passthrough/vtd/qinval.c
> > b/xen/drivers/passthrough/vtd/qinval.c
> > index b227e4e..7330c5d 100644
> > --- a/xen/drivers/passthrough/vtd/qinval.c
> > +++ b/xen/drivers/passthrough/vtd/qinval.c
> > @@ -190,9 +190,19 @@ static int queue_invalidate_wait(struct iommu
> > *iommu,  static int invalidate_sync(struct iommu *iommu)  {
> >      struct qi_ctrl *qi_ctrl = iommu_qi_ctrl(iommu);
> > +    int rc;
> >
> >      if ( qi_ctrl->qinval_maddr )
> > -        return queue_invalidate_wait(iommu, 0, 1, 1);
> > +    {
> > +        rc = queue_invalidate_wait(iommu, 0, 1, 1);
> > +        if ( rc )
> > +        {
> > +            if ( rc == -ETIMEDOUT )
> > +                panic("Queue invalidate wait descriptor was
> timeout.\n");
> > +            return rc;
> > +        }
> > +    }
> > +
> 
> why do you still do panic here? w/ PATCH 1/3 you should return rc to caller
> directly, right?
> 

This panic is original in queue_invalidate_wait(). Patch 1/3 is just for Device-TLB flush error ( Actually it covers iotlb flush error).
If I fix VT-d iotlb/context/iec flush error to propagated callers, then I can remove this panic and return rc to propagated caller directly.



> > +static void dev_invalidate_iotlb_timeout(struct iommu *iommu, u16 did,
> > +                                         u16 seg, u8 bus, u8 devfn)
> {
> > +    struct domain *d;
> > +    struct pci_dev *pdev;
> > +
> > +    d = rcu_lock_domain_by_id(iommu->domid_map[did]);
> > +    ASSERT(d);
> > +    for_each_pdev(d, pdev)
> > +    {
> > +        if ( (pdev->seg == seg) &&
> > +             (pdev->bus == bus) &&
> > +             (pdev->devfn == devfn) )
> > +        {
> > +            if ( pdev->domain )
> > +            {
> > +                list_del(&pdev->domain_list);
> > +                pdev->domain = NULL;
> > +            }
> > +
> > +            if ( pci_hide_device(bus, devfn) )
> > +            {
> > +                printk(XENLOG_ERR
> > +                       "IOMMU hide device %04x:%02x:%02x error.",
> > +                       seg, bus, devfn);
> > +                break;
> > +            }
> 
> I'm not sure whether using pci_hide_device is enough or not break other code
> path here? For example, now you have errors propagated to callers.

More information, after call pci_hide_device(),
..
         pdev->domain = dom_xen;
         list_add(&pdev->domain_list, &dom_xen->arch.pdev_list);
..

Hidden PCI devices are associated with 'dom_xen'.
Now I found it may not clear rmrr mapping / context mapping. .i.e iommu_remove_device() -- [hd->platform_ops is NULL for dom_xen].
But I think it is acceptable, IMO dom_xen is a part hypervisor, and there mappings would not a security issue.
Or I can remove these mappings before to hide this device.

So I think it will not break other code break other code.

> What's the
> final policy against flush error? 
>

If Device-TLB flush is timeout, we'll hide the target ATS device and crash the domain owning this ATS device.
If impacted domain is hardware domain, just throw out a warning, instead of crash the hardware domain.
The hided Device will be disallowed to be further assigned to any domain.



>If they will finally go to cleanup some more state
> relying on pdev->domain, then that code path might be broken. If it's fine to do
> cleanup at this point, then just hidden is not enough. If you look at
> pci_remove_device, there's quite some state to be further cleaned up...
> 


As the pdev->domain is 'dom_xen';
So for this case, it is still working. Correct me if it is not correct.
BTW, a quick question, which case to call pci_remove_device()?


> I didn't think about the full logic thoroughly now. But it would always be good
> to not hide device now until a point where all related states have been cleaned
> up in error handling path chained up.
> 
 
Quan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 2/3] VT-d: Reduce spin timeout to 1ms, which can be boot-time changed.
  2015-12-23  8:25 ` [PATCH v4 2/3] VT-d: Reduce spin timeout to 1ms, which can be boot-time changed Quan Xu
  2015-12-25  2:56   ` Tian, Kevin
@ 2016-01-13 16:55   ` Jan Beulich
  2016-01-14  1:53     ` Xu, Quan
  2016-01-13 16:57   ` Jan Beulich
  2 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2016-01-13 16:55 UTC (permalink / raw)
  To: Quan Xu
  Cc: kevin.tian, feng.wu, george.dunlap, andrew.cooper3, tim,
	xen-devel, jun.nakajima, keir

>>> On 23.12.15 at 09:25, <quan.xu@intel.com> wrote:
> --- a/xen/drivers/passthrough/vtd/qinval.c
> +++ b/xen/drivers/passthrough/vtd/qinval.c
> @@ -28,6 +28,11 @@
>  #include "vtd.h"
>  #include "extern.h"
>  
> +static int __read_mostly iommu_qi_timeout_ms = 1;

I'll take the liberty to convert this to "unsigned int" when committing.

Jan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 2/3] VT-d: Reduce spin timeout to 1ms, which can be boot-time changed.
  2015-12-23  8:25 ` [PATCH v4 2/3] VT-d: Reduce spin timeout to 1ms, which can be boot-time changed Quan Xu
  2015-12-25  2:56   ` Tian, Kevin
  2016-01-13 16:55   ` Jan Beulich
@ 2016-01-13 16:57   ` Jan Beulich
  2016-01-14  1:59     ` Xu, Quan
  2 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2016-01-13 16:57 UTC (permalink / raw)
  To: Quan Xu
  Cc: kevin.tian, feng.wu, george.dunlap, andrew.cooper3, tim,
	xen-devel, jun.nakajima, keir

>>> On 23.12.15 at 09:25, <quan.xu@intel.com> wrote:
> --- a/xen/drivers/passthrough/vtd/qinval.c
> +++ b/xen/drivers/passthrough/vtd/qinval.c
> @@ -28,6 +28,11 @@
>  #include "vtd.h"
>  #include "extern.h"
>  
> +static int __read_mostly iommu_qi_timeout_ms = 1;
> +integer_param("iommu_qi_timeout_ms", iommu_qi_timeout_ms);

But wait - this needs to be accompanied by an entry in
docs/misc/xen-command-line.markdown.

Jan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 2/3] VT-d: Reduce spin timeout to 1ms, which can be boot-time changed.
  2016-01-13 16:55   ` Jan Beulich
@ 2016-01-14  1:53     ` Xu, Quan
  0 siblings, 0 replies; 60+ messages in thread
From: Xu, Quan @ 2016-01-14  1:53 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, Wu, Feng, george.dunlap, andrew.cooper3, tim,
	xen-devel, Nakajima, Jun, keir

On 14.01.2016 at 12:56am, <JBeulich@suse.com> wrote:
> >>> On 23.12.15 at 09:25, <quan.xu@intel.com> wrote:
> > --- a/xen/drivers/passthrough/vtd/qinval.c
> > +++ b/xen/drivers/passthrough/vtd/qinval.c
> > @@ -28,6 +28,11 @@
> >  #include "vtd.h"
> >  #include "extern.h"
> >
> > +static int __read_mostly iommu_qi_timeout_ms = 1;
> 
> I'll take the liberty to convert this to "unsigned int" when committing.
> 

Jan, thanks.
I can also do it in next v5.

Quan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 2/3] VT-d: Reduce spin timeout to 1ms, which can be boot-time changed.
  2016-01-13 16:57   ` Jan Beulich
@ 2016-01-14  1:59     ` Xu, Quan
  2016-01-14  9:03       ` Jan Beulich
  0 siblings, 1 reply; 60+ messages in thread
From: Xu, Quan @ 2016-01-14  1:59 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, Wu, Feng, george.dunlap, andrew.cooper3, tim,
	xen-devel, Nakajima, Jun, keir

On 14.01.2016 at 12:58am, <JBeulich@suse.com> wrote:
> >>> On 23.12.15 at 09:25, <quan.xu@intel.com> wrote:
> > --- a/xen/drivers/passthrough/vtd/qinval.c
> > +++ b/xen/drivers/passthrough/vtd/qinval.c
> > @@ -28,6 +28,11 @@
> >  #include "vtd.h"
> >  #include "extern.h"
> >
> > +static int __read_mostly iommu_qi_timeout_ms = 1;
> > +integer_param("iommu_qi_timeout_ms", iommu_qi_timeout_ms);
> 
> But wait - this needs to be accompanied by an entry in
> docs/misc/xen-command-line.markdown.
> 

Jan, Is the following right?

--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -901,6 +901,14 @@ debug hypervisor only).

 >> Control the use of Queued Invalidation.

+> `qi_timeout_ms` (VT-d)
+
+> Default: `1`
+
+> `= <integer>`
+
+>> Specify the timeout of the VT-d Queued Invalidation in ms.
+
 > `snoop` (Intel)

 > Default: `true`


-Quan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 2/3] VT-d: Reduce spin timeout to 1ms, which can be boot-time changed.
  2016-01-14  1:59     ` Xu, Quan
@ 2016-01-14  9:03       ` Jan Beulich
  2016-01-14 10:29         ` Xu, Quan
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2016-01-14  9:03 UTC (permalink / raw)
  To: Quan Xu
  Cc: Kevin Tian, Feng Wu, george.dunlap, andrew.cooper3, tim,
	xen-devel, Jun Nakajima, keir

>>> On 14.01.16 at 02:59, <quan.xu@intel.com> wrote:
> On 14.01.2016 at 12:58am, <JBeulich@suse.com> wrote:
>> >>> On 23.12.15 at 09:25, <quan.xu@intel.com> wrote:
>> > --- a/xen/drivers/passthrough/vtd/qinval.c
>> > +++ b/xen/drivers/passthrough/vtd/qinval.c
>> > @@ -28,6 +28,11 @@
>> >  #include "vtd.h"
>> >  #include "extern.h"
>> >
>> > +static int __read_mostly iommu_qi_timeout_ms = 1;
>> > +integer_param("iommu_qi_timeout_ms", iommu_qi_timeout_ms);
>> 
>> But wait - this needs to be accompanied by an entry in
>> docs/misc/xen-command-line.markdown.
>> 
> 
> Jan, Is the following right?

No - you appear to add to the "iommu" option description. Either you
need to change the implementation to match that (which probably
would be a good idea, as that's what "iommu" is being used for in
other cases), or you need to add a new standalone entry with the
correct option name.

Jan

> --- a/docs/misc/xen-command-line.markdown
> +++ b/docs/misc/xen-command-line.markdown
> @@ -901,6 +901,14 @@ debug hypervisor only).
> 
>  >> Control the use of Queued Invalidation.
> 
> +> `qi_timeout_ms` (VT-d)
> +
> +> Default: `1`
> +
> +> `= <integer>`
> +
> +>> Specify the timeout of the VT-d Queued Invalidation in ms.
> +
>  > `snoop` (Intel)
> 
>  > Default: `true`
> 
> 
> -Quan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 2/3] VT-d: Reduce spin timeout to 1ms, which can be boot-time changed.
  2016-01-14  9:03       ` Jan Beulich
@ 2016-01-14 10:29         ` Xu, Quan
  2016-01-14 12:17           ` Jan Beulich
  0 siblings, 1 reply; 60+ messages in thread
From: Xu, Quan @ 2016-01-14 10:29 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, Wu, Feng, george.dunlap, andrew.cooper3, tim,
	xen-devel, Nakajima, Jun, keir

On 14.01.2016 at 5:04pm, <JBeulich@suse.com> wrote:
> >>> On 14.01.16 at 02:59, <quan.xu@intel.com> wrote:
> > On 14.01.2016 at 12:58am, <JBeulich@suse.com> wrote:
> >> >>> On 23.12.15 at 09:25, <quan.xu@intel.com> wrote:
> >> > --- a/xen/drivers/passthrough/vtd/qinval.c
> >> > +++ b/xen/drivers/passthrough/vtd/qinval.c
> >> > @@ -28,6 +28,11 @@
> >> >  #include "vtd.h"
> >> >  #include "extern.h"
> >> >
> >> > +static int __read_mostly iommu_qi_timeout_ms = 1;
> >> > +integer_param("iommu_qi_timeout_ms", iommu_qi_timeout_ms);
> >>
> >> But wait - this needs to be accompanied by an entry in
> >> docs/misc/xen-command-line.markdown.
> >>
> >
> > Jan, Is the following right?
> 
> No - you appear to add to the "iommu" option description. Either you need to
> change the implementation to match that (which probably would be a good
> idea, as that's what "iommu" is being used for in other cases), or you need to
> add a new standalone entry with the correct option name.
> 

I prefer to add a new standalone entry.
Maybe I can modify the option name from "iommu_qi_timeout_ms" to "qi_timeout_ms". 


-Quan



> 
> > --- a/docs/misc/xen-command-line.markdown
> > +++ b/docs/misc/xen-command-line.markdown
> > @@ -901,6 +901,14 @@ debug hypervisor only).
> >
> >  >> Control the use of Queued Invalidation.
> >
> > +> `qi_timeout_ms` (VT-d)
> > +
> > +> Default: `1`
> > +
> > +> `= <integer>`
> > +
> > +>> Specify the timeout of the VT-d Queued Invalidation in ms.
> > +
> >  > `snoop` (Intel)
> >
> >  > Default: `true`
> >
> >
> > -Quan
> 
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 2/3] VT-d: Reduce spin timeout to 1ms, which can be boot-time changed.
  2016-01-14 10:29         ` Xu, Quan
@ 2016-01-14 12:17           ` Jan Beulich
  0 siblings, 0 replies; 60+ messages in thread
From: Jan Beulich @ 2016-01-14 12:17 UTC (permalink / raw)
  To: Quan Xu
  Cc: Kevin Tian, Feng Wu, george.dunlap, andrew.cooper3, tim,
	xen-devel, Jun Nakajima, keir

>>> On 14.01.16 at 11:29, <quan.xu@intel.com> wrote:
> On 14.01.2016 at 5:04pm, <JBeulich@suse.com> wrote:
>> >>> On 14.01.16 at 02:59, <quan.xu@intel.com> wrote:
>> > On 14.01.2016 at 12:58am, <JBeulich@suse.com> wrote:
>> >> >>> On 23.12.15 at 09:25, <quan.xu@intel.com> wrote:
>> >> > --- a/xen/drivers/passthrough/vtd/qinval.c
>> >> > +++ b/xen/drivers/passthrough/vtd/qinval.c
>> >> > @@ -28,6 +28,11 @@
>> >> >  #include "vtd.h"
>> >> >  #include "extern.h"
>> >> >
>> >> > +static int __read_mostly iommu_qi_timeout_ms = 1;
>> >> > +integer_param("iommu_qi_timeout_ms", iommu_qi_timeout_ms);
>> >>
>> >> But wait - this needs to be accompanied by an entry in
>> >> docs/misc/xen-command-line.markdown.
>> >>
>> >
>> > Jan, Is the following right?
>> 
>> No - you appear to add to the "iommu" option description. Either you need to
>> change the implementation to match that (which probably would be a good
>> idea, as that's what "iommu" is being used for in other cases), or you need 
> to
>> add a new standalone entry with the correct option name.
>> 
> 
> I prefer to add a new standalone entry.
> Maybe I can modify the option name from "iommu_qi_timeout_ms" to 
> "qi_timeout_ms". 

vtd-qi-timeout-ms might be okay, but qi as the prefix is bad. The
trailing ms may also not be needed.

Jan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 1/3] VT-d: Check VT-d Device-TLB flush error.
  2015-12-25  2:53   ` Tian, Kevin
  2016-01-06  2:12     ` Xu, Quan
@ 2016-01-14 17:01     ` Jan Beulich
  1 sibling, 0 replies; 60+ messages in thread
From: Jan Beulich @ 2016-01-14 17:01 UTC (permalink / raw)
  To: Kevin Tian, Quan Xu
  Cc: tim, Feng Wu, george.dunlap, andrew.cooper3, Eddie Dong,
	xen-devel, Jun Nakajima, keir

>>> On 25.12.15 at 03:53, <kevin.tian@intel.com> wrote:
>>  From: Xu, Quan
>> Sent: Wednesday, December 23, 2015 4:26 PM
>> --- a/xen/arch/x86/acpi/power.c
>> +++ b/xen/arch/x86/acpi/power.c
>> @@ -45,6 +45,8 @@ void do_suspend_lowlevel(void);
>> 
>>  static int device_power_down(void)
>>  {
>> +    int rc;
>> +
>>      console_suspend();
>> 
>>      time_suspend();
>> @@ -53,7 +55,9 @@ static int device_power_down(void)
>> 
>>      ioapic_suspend();
>> 
>> -    iommu_suspend();
>> +    rc = iommu_suspend();
>> +    if ( rc )
>> +        return rc;
>> 
>>      lapic_suspend();
>> 
> 
> Looks error handling is not only a problem in VT-d code. Above
> actually should check return values of all suspend callbacks. Just
> checking iommu_suspend is not enough, but it's a good improvement
> anyway...

No, it's not - it leaves the system in a non-working state without
undoing whatever succeeded already.

Jan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 1/3] VT-d: Check VT-d Device-TLB flush error.
  2015-12-23  8:25 ` [PATCH v4 1/3] VT-d: Check VT-d Device-TLB flush error Quan Xu
  2015-12-25  2:53   ` Tian, Kevin
@ 2016-01-14 17:15   ` Jan Beulich
  1 sibling, 0 replies; 60+ messages in thread
From: Jan Beulich @ 2016-01-14 17:15 UTC (permalink / raw)
  To: Quan Xu
  Cc: kevin.tian, feng.wu, george.dunlap, andrew.cooper3, tim,
	xen-devel, jun.nakajima, keir

>>> On 23.12.15 at 09:25, <quan.xu@intel.com> wrote:
> @@ -182,7 +186,7 @@ static int enter_state(u32 state)
>          error = tboot_s3_resume();
>          break;
>      case ACPI_STATE_S5:
> -        acpi_enter_sleep_state(ACPI_STATE_S5);
> +        error = acpi_enter_sleep_state(ACPI_STATE_S5);

I can't see how this is related to the purpose of the patch. I don't
mind such error checking being added, but not in this huge patch.
It would anyway be nice if you could see about splitting this
apart, to aid reviewing and - in case it would be needed after
committing - bisection.

> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -2443,11 +2443,18 @@ static int __get_page_type(struct page_info *page, unsigned long type,
>          if ( d && is_pv_domain(d) && unlikely(need_iommu(d)) )
>          {
>              if ( (x & PGT_type_mask) == PGT_writable_page )
> -                iommu_unmap_page(d, mfn_to_gmfn(d, page_to_mfn(page)));
> +            {
> +                rc = iommu_unmap_page(d, mfn_to_gmfn(d, page_to_mfn(page)));
> +                return rc;
> +            }
>              else if ( type == PGT_writable_page )
> -                iommu_map_page(d, mfn_to_gmfn(d, page_to_mfn(page)),
> -                               page_to_mfn(page),
> -                               IOMMUF_readable|IOMMUF_writable);
> +            {
> +                rc = iommu_map_page(d, mfn_to_gmfn(d, page_to_mfn(page)),
> +                                    page_to_mfn(page),
> +                                    IOMMUF_readable|IOMMUF_writable);
> +                if ( rc )
> +                    return rc;
> +            }
>          }
>      }

Again you can't simply return here, or else you leak the type
reference, and you indefinitely stall any other CPU waiting for
page validation to happen.

> --- a/xen/arch/x86/mm/p2m-ept.c
> +++ b/xen/arch/x86/mm/p2m-ept.c
> @@ -829,15 +829,23 @@ out:
>           need_modify_vtd_table )
>      {
>          if ( iommu_hap_pt_share )
> -            iommu_pte_flush(d, gfn, &ept_entry->epte, order, vtd_pte_present);
> +            rc = iommu_pte_flush(d, gfn, &ept_entry->epte, order, vtd_pte_present);
>          else
>          {
>              if ( iommu_flags )
>                  for ( i = 0; i < (1 << order); i++ )
> -                    iommu_map_page(d, gfn + i, mfn_x(mfn) + i, iommu_flags);
> +                {
> +                    rc = iommu_map_page(d, gfn + i, mfn_x(mfn) + i, iommu_flags);
> +                    if ( rc )
> +                        break;
> +                }

And the pattern repeats - you can't just exit without undoing what
so far was done.

>              else
>                  for ( i = 0; i < (1 << order); i++ )
> -                    iommu_unmap_page(d, gfn + i);
> +                {
> +                    rc = iommu_unmap_page(d, gfn + i);
> +                    if ( rc )
> +                        break;
> +                }

As a special case, unmapping should perhaps continue despite an error,
in an attempt to do best effort cleanup.

I'm not going to continue further down, as I suspect I'll find more of
the same class of issues.

Jan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.
  2015-12-23  8:25 ` [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue Quan Xu
  2015-12-25  3:06   ` Tian, Kevin
@ 2016-01-15 13:09   ` Jan Beulich
  2016-01-20 10:26     ` Xu, Quan
  1 sibling, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2016-01-15 13:09 UTC (permalink / raw)
  To: Quan Xu
  Cc: kevin.tian, feng.wu, george.dunlap, andrew.cooper3, tim,
	xen-devel, jun.nakajima, keir

>>> On 23.12.15 at 09:25, <quan.xu@intel.com> wrote:
> --- a/xen/drivers/passthrough/vtd/qinval.c
> +++ b/xen/drivers/passthrough/vtd/qinval.c
> @@ -190,9 +190,19 @@ static int queue_invalidate_wait(struct iommu *iommu,
>  static int invalidate_sync(struct iommu *iommu)
>  {
>      struct qi_ctrl *qi_ctrl = iommu_qi_ctrl(iommu);
> +    int rc;
>  
>      if ( qi_ctrl->qinval_maddr )
> -        return queue_invalidate_wait(iommu, 0, 1, 1);
> +    {
> +        rc = queue_invalidate_wait(iommu, 0, 1, 1);
> +        if ( rc )
> +        {
> +            if ( rc == -ETIMEDOUT )
> +                panic("Queue invalidate wait descriptor was timeout.\n");

Unless I'm overlooking something this doesn't replace and existing
panic(), and I think I had indicated quite clearly that this series
shouldn't add any new ones, unless the alternative of crashing
the owning domain is completely unfeasible.

> +                panic("Queue invalidate wait descriptor was timeout.\n");
> +            return rc;
> +        }
> +    }
> +
>      return 0;
>  }

Please avoid introducing multiple return points when this can be
trivially avoided.

> @@ -229,6 +239,63 @@ int qinval_device_iotlb(struct iommu *iommu,
>      return 0;
>  }
>  
> +static void dev_invalidate_iotlb_timeout(struct iommu *iommu, u16 did,
> +                                         u16 seg, u8 bus, u8 devfn)
> +{
> +    struct domain *d;
> +    struct pci_dev *pdev;
> +
> +    d = rcu_lock_domain_by_id(iommu->domid_map[did]);
> +    ASSERT(d);

Don't you need to acquire some lock in order to safely assert this?
Also note that unused slots hold zero, i.e. there's at least a
theoretical risk of problems here when you don't also look at
iommu->domid_bitmap.

> +    for_each_pdev(d, pdev)
> +    {
> +        if ( (pdev->seg == seg) &&
> +             (pdev->bus == bus) &&
> +             (pdev->devfn == devfn) )
> +        {
> +            if ( pdev->domain )

If the device is on the domain's list, can this reasonably be false?
I.e. don't you rather mean ASSERT() here?

> +            {
> +                list_del(&pdev->domain_list);

This should happen under pcidevs_lock - you need to either acquire
it or ASSERT() it being held.

> +                pdev->domain = NULL;
> +            }
> +
> +            if ( pci_hide_device(bus, devfn) )
> +            {
> +                printk(XENLOG_ERR
> +                       "IOMMU hide device %04x:%02x:%02x error.",
> +                       seg, bus, devfn);
> +                break;
> +            }
> +
> +            break;
> +        }

Redundant breaks.

> +    }
> +
> +    if ( !is_hardware_domain(d) )
> +        domain_crash(d);

I wonder whether the device hiding shouldn't also be avoided if
the device is owned by hwdom.

> @@ -349,9 +416,18 @@ static int flush_iotlb_qi(
>          queue_invalidate_iotlb(iommu,
>                                 type >> DMA_TLB_FLUSH_GRANU_OFFSET, dr,
>                                 dw, did, size_order, 0, addr);
> +
> +        /*
> +         * Synchronize with hardware for invalidation request descriptors
> +         * submitted before Device-TLB invalidate descriptor.
> +         */
> +        rc = invalidate_sync(iommu);
> +        if ( rc )
> +             return rc;
> +
>          if ( flush_dev_iotlb )
>              ret = dev_invalidate_iotlb(iommu, did, addr, size_order, type);
> -        rc = invalidate_sync(iommu);
> +
>          if ( !ret )
>              ret = rc;
>      }

This change is because of ...?

> --- a/xen/drivers/passthrough/vtd/x86/ats.c
> +++ b/xen/drivers/passthrough/vtd/x86/ats.c
> @@ -162,6 +162,19 @@ int dev_invalidate_iotlb(struct iommu *iommu, u16 did,
>              return -EOPNOTSUPP;
>          }
>  
> +        /*
> +         * Synchronize with hardware for Device-TLB invalidate
> +         * descriptor.
> +         */
> +        ret = dev_invalidate_iotlb_sync(iommu, did, pdev->seg,
> +                                        pdev->bus, pdev->devfn);
> +        if ( ret )
> +        {
> +            dprintk(XENLOG_WARNING VTDPREFIX,
> +                    "Device-TLB flush timeout.\n");

Are you aware that dprintk() compiles to nothing in non-debug builds?
Plus logging the error code and device coordinates might turn out
helpful in such cases. But first of all - if there was a timeout, we'd
make it here only for Dom0. Perhaps the printing should move into
an else to the domain_crash()? And if there was another error, the
message would be outright wrong.

Jan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.
  2016-01-15 13:09   ` Jan Beulich
@ 2016-01-20 10:26     ` Xu, Quan
  2016-01-20 11:29       ` Jan Beulich
  0 siblings, 1 reply; 60+ messages in thread
From: Xu, Quan @ 2016-01-20 10:26 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, Wu, Feng, george.dunlap, andrew.cooper3, tim,
	xen-devel, Nakajima, Jun, keir

Jan, thanks for your comments.

> On January 15, 2016 at 9:10, <JBeulich@suse.com> wrote:
> >>> On 23.12.15 at 09:25, <quan.xu@intel.com> wrote:
> > --- a/xen/drivers/passthrough/vtd/qinval.c
> > +++ b/xen/drivers/passthrough/vtd/qinval.c
> > @@ -190,9 +190,19 @@ static int queue_invalidate_wait(struct iommu
> > *iommu,  static int invalidate_sync(struct iommu *iommu)  {
> >      struct qi_ctrl *qi_ctrl = iommu_qi_ctrl(iommu);
> > +    int rc;
> >
> >      if ( qi_ctrl->qinval_maddr )
> > -        return queue_invalidate_wait(iommu, 0, 1, 1);
> > +    {
> > +        rc = queue_invalidate_wait(iommu, 0, 1, 1);
> > +        if ( rc )
> > +        {
> > +            if ( rc == -ETIMEDOUT )
> > +                panic("Queue invalidate wait descriptor was
> > + timeout.\n");
> 


> Unless I'm overlooking something this doesn't replace and existing panic(), and I
> think I had indicated quite clearly that this series shouldn't add any new ones,
> unless the alternative of crashing the owning domain is completely unfeasible.
> 


I will remove it in next v5.

[...]

> > @@ -229,6 +239,63 @@ int qinval_device_iotlb(struct iommu *iommu,
> >      return 0;
> >  }
> >
> > +static void dev_invalidate_iotlb_timeout(struct iommu *iommu, u16 did,
> > +                                         u16 seg, u8 bus, u8 devfn)
> {
> > +    struct domain *d;
> > +    struct pci_dev *pdev;
> > +
> > +    d = rcu_lock_domain_by_id(iommu->domid_map[did]);
> > +    ASSERT(d);
> 
> Don't you need to acquire some lock in order to safely assert this?

Referred to the other use case of 'rcu_lock_domain_by_id()', Xen checks whether
The domain is 'NULL'. 
Could I also replace the 'ASSERT(d)' with
          + If ( d == NULL )
          +   return;
?

> Also note that unused slots hold zero, i.e. there's at least a theoretical risk of
> problems here when you don't also look at
> iommu->domid_bitmap.
> 
I am not clear how to fix this point. Do you have good idea?
Add a lock to 'iommu->domid_bitmap'?


> > +    for_each_pdev(d, pdev)
> > +    {
> > +        if ( (pdev->seg == seg) &&
> > +             (pdev->bus == bus) &&
> > +             (pdev->devfn == devfn) )
> > +        {
> > +            if ( pdev->domain )
> 
> If the device is on the domain's list, can this reasonably be false?

I can't find the false case.

> I.e. don't you rather mean ASSERT() here?

I'd better replace 'if' with ASSERT() for this case.

> 
> > +            {
> > +                list_del(&pdev->domain_list);
> 
> This should happen under pcidevs_lock - you need to either acquire it or
> ASSERT() it being held.
> 

I should check whether it is under pcidevs_lock -- with spin_is_locked(&pcidevs_lock)
If it is not under pcidevs_lock, I should acquire it.
I think ASSERT() is not a good idea. Hypervisor acquires this lock and then remove the resource.



> > +    }
> > +
> > +    if ( !is_hardware_domain(d) )
> > +        domain_crash(d);
> 
> I wonder whether the device hiding shouldn't also be avoided if the device is
> owned by hwdom.
> 

I think it should be avoided if the device is owned by hwdom.
Otherwise, it is quite similar to panic..


> > @@ -349,9 +416,18 @@ static int flush_iotlb_qi(
> >          queue_invalidate_iotlb(iommu,
> >                                 type >>
> DMA_TLB_FLUSH_GRANU_OFFSET, dr,
> >                                 dw, did, size_order, 0, addr);
> > +
> > +        /*
> > +         * Synchronize with hardware for invalidation request descriptors
> > +         * submitted before Device-TLB invalidate descriptor.
> > +         */
> > +        rc = invalidate_sync(iommu);
> > +        if ( rc )
> > +             return rc;
> > +
> >          if ( flush_dev_iotlb )
> >              ret = dev_invalidate_iotlb(iommu, did, addr, size_order, type);
> > -        rc = invalidate_sync(iommu);
> > +
> >          if ( !ret )
> >              ret = rc;
> >      }
> 
> This change is because of ...?
> 
As Kevin's comments,
I add 'dev_invalidate_iotlb_sync()' to dev_invalidate_iotlb(). 
Then i can know which device is flush timeout.



> > --- a/xen/drivers/passthrough/vtd/x86/ats.c
> > +++ b/xen/drivers/passthrough/vtd/x86/ats.c
> > @@ -162,6 +162,19 @@ int dev_invalidate_iotlb(struct iommu *iommu, u16
> did,
> >              return -EOPNOTSUPP;
> >          }
> >
> > +        /*
> > +         * Synchronize with hardware for Device-TLB invalidate
> > +         * descriptor.
> > +         */
> > +        ret = dev_invalidate_iotlb_sync(iommu, did, pdev->seg,
> > +                                        pdev->bus, pdev->devfn);
> > +        if ( ret )
> > +        {
> > +            dprintk(XENLOG_WARNING VTDPREFIX,
> > +                    "Device-TLB flush timeout.\n");
> 
> Are you aware that dprintk() compiles to nothing in non-debug builds?


To be honest, I was not aware.

> Plus logging the error code and device coordinates might turn out helpful in
> such cases. But first of all - if there was a timeout, we'd make it here only for
> Dom0. Perhaps the printing should move into an else to the domain_crash()?
> And if there was another error, the message would be outright wrong.
> 
IMO, the print for Dom0 is enough.
I think it is not a good idea to move into an else to domain_crash(). Domain_crash is quite a common part.
Anyway I can improve it in low priority.

Jan, thanks..
-Quan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.
  2016-01-20 10:26     ` Xu, Quan
@ 2016-01-20 11:29       ` Jan Beulich
  2016-01-21 16:16         ` Xu, Quan
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2016-01-20 11:29 UTC (permalink / raw)
  To: Quan Xu
  Cc: Kevin Tian, Feng Wu, george.dunlap, andrew.cooper3, tim,
	xen-devel, Jun Nakajima, keir

>>> On 20.01.16 at 11:26, <quan.xu@intel.com> wrote:
>> On January 15, 2016 at 9:10, <JBeulich@suse.com> wrote:
>> >>> On 23.12.15 at 09:25, <quan.xu@intel.com> wrote:
>> > @@ -229,6 +239,63 @@ int qinval_device_iotlb(struct iommu *iommu,
>> >      return 0;
>> >  }
>> >
>> > +static void dev_invalidate_iotlb_timeout(struct iommu *iommu, u16 did,
>> > +                                         u16 seg, u8 bus, u8 devfn)
>> {
>> > +    struct domain *d;
>> > +    struct pci_dev *pdev;
>> > +
>> > +    d = rcu_lock_domain_by_id(iommu->domid_map[did]);
>> > +    ASSERT(d);
>> 
>> Don't you need to acquire some lock in order to safely assert this?
> 
> Referred to the other use case of 'rcu_lock_domain_by_id()', Xen checks 
> whether
> The domain is 'NULL'. 
> Could I also replace the 'ASSERT(d)' with
>           + If ( d == NULL )
>           +   return;
> ?

At a first glance this doesn't look right, but in the end that's
something you need to answer.

>> Also note that unused slots hold zero, i.e. there's at least a theoretical 
> risk of
>> problems here when you don't also look at
>> iommu->domid_bitmap.
>> 
> I am not clear how to fix this point. Do you have good idea?
> Add a lock to 'iommu->domid_bitmap'?

How would a lock help avoiding mistaking an unused slot to mean
Dom0? As already suggested - I think you simply need to consult
the bitmap along with the domain ID array.

>> > +            {
>> > +                list_del(&pdev->domain_list);
>> 
>> This should happen under pcidevs_lock - you need to either acquire it or
>> ASSERT() it being held.
>> 
> 
> I should check whether it is under pcidevs_lock -- with 
> spin_is_locked(&pcidevs_lock)
> If it is not under pcidevs_lock, I should acquire it.
> I think ASSERT() is not a good idea. Hypervisor acquires this lock and then 
> remove the resource.

I don't understand this last sentence.

>> > @@ -349,9 +416,18 @@ static int flush_iotlb_qi(
>> >          queue_invalidate_iotlb(iommu,
>> >                                 type >>
>> DMA_TLB_FLUSH_GRANU_OFFSET, dr,
>> >                                 dw, did, size_order, 0, addr);
>> > +
>> > +        /*
>> > +         * Synchronize with hardware for invalidation request descriptors
>> > +         * submitted before Device-TLB invalidate descriptor.
>> > +         */
>> > +        rc = invalidate_sync(iommu);
>> > +        if ( rc )
>> > +             return rc;
>> > +
>> >          if ( flush_dev_iotlb )
>> >              ret = dev_invalidate_iotlb(iommu, did, addr, size_order, type);
>> > -        rc = invalidate_sync(iommu);
>> > +
>> >          if ( !ret )
>> >              ret = rc;
>> >      }
>> 
>> This change is because of ...?
>> 
> As Kevin's comments,
> I add 'dev_invalidate_iotlb_sync()' to dev_invalidate_iotlb(). 
> Then i can know which device is flush timeout.

I don't understand: How is your reply related to you moving up the
invocation of invalidate_sync()?

>> Plus logging the error code and device coordinates might turn out helpful in
>> such cases. But first of all - if there was a timeout, we'd make it here only for
>> Dom0. Perhaps the printing should move into an else to the domain_crash()?
>> And if there was another error, the message would be outright wrong.
>> 
> IMO, the print for Dom0 is enough.
> I think it is not a good idea to move into an else to domain_crash(). 
> Domain_crash is quite a common part.
> Anyway I can improve it in low priority.

At the very least the message should not end up being actively
misleading.

Jan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.
  2016-01-20 11:29       ` Jan Beulich
@ 2016-01-21 16:16         ` Xu, Quan
  2016-01-21 16:31           ` Jan Beulich
  0 siblings, 1 reply; 60+ messages in thread
From: Xu, Quan @ 2016-01-21 16:16 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, Wu, Feng, george.dunlap, andrew.cooper3, tim,
	xen-devel, Nakajima, Jun, keir

> On January 20, 2016 at 7:29 pm,  <JBeulich@suse.com> wrote:
> >>> On 20.01.16 at 11:26, <quan.xu@intel.com> wrote:
> >> On January 15, 2016 at 9:10, <JBeulich@suse.com> wrote:
> >> >>> On 23.12.15 at 09:25, <quan.xu@intel.com> wrote:
> >> > @@ -229,6 +239,63 @@ int qinval_device_iotlb(struct iommu *iommu,
> >> >      return 0;
> >> >  }
> >> >
> >> > +static void dev_invalidate_iotlb_timeout(struct iommu *iommu, u16 did,
> >> > +                                         u16 seg, u8 bus, u8
> >> > +devfn)
> >> {
> >> > +    struct domain *d;
> >> > +    struct pci_dev *pdev;
> >> > +
> >> > +    d = rcu_lock_domain_by_id(iommu->domid_map[did]);
> >> > +    ASSERT(d);
> >>
> >> Don't you need to acquire some lock in order to safely assert this?
> >
> > Referred to the other use case of 'rcu_lock_domain_by_id()', Xen
> > checks whether The domain is 'NULL'.
> > Could I also replace the 'ASSERT(d)' with
> >           + If ( d == NULL )
> >           +   return;
> > ?
> 
> At a first glance this doesn't look right, but in the end that's something you need
> to answer.
> 

Is it quite similar to whether the domain has been destroyed when Device-TLB is flushing?
Correct me if I still doesn't get you point..


> >> Also note that unused slots hold zero, i.e. there's at least a
> >> theoretical
> > risk of
> >> problems here when you don't also look at
> >> iommu->domid_bitmap.
> >>
> > I am not clear how to fix this point. Do you have good idea?
> > Add a lock to 'iommu->domid_bitmap'?
> 
> How would a lock help avoiding mistaking an unused slot to mean Dom0? As
> already suggested - I think you simply need to consult the bitmap along with the
> domain ID array.
> 

Try to get domain id with iommu->domid_map[did] ?


> >> > +            {
> >> > +                list_del(&pdev->domain_list);
> >>
> >> This should happen under pcidevs_lock - you need to either acquire it
> >> or
> >> ASSERT() it being held.
> >>
> >
> > I should check whether it is under pcidevs_lock -- with
> > spin_is_locked(&pcidevs_lock)
> > If it is not under pcidevs_lock, I should acquire it.
> > I think ASSERT() is not a good idea. Hypervisor acquires this lock and
> > then remove the resource.
> 
> I don't understand this last sentence.
> 
For example: in 
pci_remove_device()
{
...
    spin_lock(&pcidevs_lock);
    ..
    iommu_remove_device()..
    ..
    spin_unlock(&pcidevs_lock);
}

Device-TLB is maybe flush error in iommu_remove_device()..
Then it is under pcidevs_lock..
In this case, I need to check whether it is under pcidevs_lock.
If not, I need to acquire the pcidevs_lock.

> >> > @@ -349,9 +416,18 @@ static int flush_iotlb_qi(
> >> >          queue_invalidate_iotlb(iommu,
> >> >                                 type >>
> >> DMA_TLB_FLUSH_GRANU_OFFSET, dr,
> >> >                                 dw, did, size_order, 0, addr);
> >> > +
> >> > +        /*
> >> > +         * Synchronize with hardware for invalidation request
> descriptors
> >> > +         * submitted before Device-TLB invalidate descriptor.
> >> > +         */
> >> > +        rc = invalidate_sync(iommu);
> >> > +        if ( rc )
> >> > +             return rc;
> >> > +
> >> >          if ( flush_dev_iotlb )
> >> >              ret = dev_invalidate_iotlb(iommu, did, addr, size_order,
> type);
> >> > -        rc = invalidate_sync(iommu);
> >> > +
> >> >          if ( !ret )
> >> >              ret = rc;
> >> >      }
> >>
> >> This change is because of ...?
> >>
> > As Kevin's comments,
> > I add 'dev_invalidate_iotlb_sync()' to dev_invalidate_iotlb().
> > Then i can know which device is flush timeout.
> 
> I don't understand: How is your reply related to you moving up the invocation of
> invalidate_sync()?

This invalidate_sync() is to synchronize with hardware for invalidation request descriptors
submitted before Device-TLB invalidate descriptor.

If I don't add this invalidate_sync().. 
dev_invalidate_iotlb_timeout() is not clear whether the flush time out is about Device-TLB flush error or the other previous iotlb/iec/context flush error. 


> 
> >> Plus logging the error code and device coordinates might turn out
> >> helpful in such cases. But first of all - if there was a timeout,
> >> we'd make it here only for Dom0. Perhaps the printing should move into an
> else to the domain_crash()?
> >> And if there was another error, the message would be outright wrong.
> >>
> > IMO, the print for Dom0 is enough.
> > I think it is not a good idea to move into an else to domain_crash().
> > Domain_crash is quite a common part.
> > Anyway I can improve it in low priority.
> 
> At the very least the message should not end up being actively misleading.
> 
What about the below?
+                printk(XENLOG_ERR
+                       "Device %04x:%02x:%02x is flush error.",
+                       seg, bus, devfn);



Quan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.
  2016-01-21 16:16         ` Xu, Quan
@ 2016-01-21 16:31           ` Jan Beulich
  2016-01-22 15:57             ` Xu, Quan
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2016-01-21 16:31 UTC (permalink / raw)
  To: Quan Xu
  Cc: Kevin Tian, Feng Wu, george.dunlap, andrew.cooper3, tim,
	xen-devel, Jun Nakajima, keir

>>> On 21.01.16 at 17:16, <quan.xu@intel.com> wrote:
>>  On January 20, 2016 at 7:29 pm,  <JBeulich@suse.com> wrote:
>> >>> On 20.01.16 at 11:26, <quan.xu@intel.com> wrote:
>> >> On January 15, 2016 at 9:10, <JBeulich@suse.com> wrote:
>> >> >>> On 23.12.15 at 09:25, <quan.xu@intel.com> wrote:
>> >> > @@ -229,6 +239,63 @@ int qinval_device_iotlb(struct iommu *iommu,
>> >> >      return 0;
>> >> >  }
>> >> >
>> >> > +static void dev_invalidate_iotlb_timeout(struct iommu *iommu, u16 did,
>> >> > +                                         u16 seg, u8 bus, u8
>> >> > +devfn)
>> >> {
>> >> > +    struct domain *d;
>> >> > +    struct pci_dev *pdev;
>> >> > +
>> >> > +    d = rcu_lock_domain_by_id(iommu->domid_map[did]);
>> >> > +    ASSERT(d);
>> >>
>> >> Don't you need to acquire some lock in order to safely assert this?
>> >
>> > Referred to the other use case of 'rcu_lock_domain_by_id()', Xen
>> > checks whether The domain is 'NULL'.
>> > Could I also replace the 'ASSERT(d)' with
>> >           + If ( d == NULL )
>> >           +   return;
>> > ?
>> 
>> At a first glance this doesn't look right, but in the end that's something 
> you need
>> to answer.
>> 
> 
> Is it quite similar to whether the domain has been destroyed when Device-TLB 
> is flushing?

Going back to the original question I raised, you need to ask
yourself: Can d validly be NULL when you get here (e.g. due to
some other processor changing something in parallel)? If yes,
you can't ASSERT(), and the next question then is what exactly
it means that it got ripped off the table. The answer to that
determines whether simply returning would be okay.

>> >> Also note that unused slots hold zero, i.e. there's at least a
>> >> theoretical
>> > risk of
>> >> problems here when you don't also look at
>> >> iommu->domid_bitmap.
>> >>
>> > I am not clear how to fix this point. Do you have good idea?
>> > Add a lock to 'iommu->domid_bitmap'?
>> 
>> How would a lock help avoiding mistaking an unused slot to mean Dom0? As
>> already suggested - I think you simply need to consult the bitmap along with 
> the
>> domain ID array.
>> 
> 
> Try to get domain id with iommu->domid_map[did] ?

???

>> >> > +            {
>> >> > +                list_del(&pdev->domain_list);
>> >>
>> >> This should happen under pcidevs_lock - you need to either acquire it
>> >> or
>> >> ASSERT() it being held.
>> >>
>> >
>> > I should check whether it is under pcidevs_lock -- with
>> > spin_is_locked(&pcidevs_lock)
>> > If it is not under pcidevs_lock, I should acquire it.
>> > I think ASSERT() is not a good idea. Hypervisor acquires this lock and
>> > then remove the resource.
>> 
>> I don't understand this last sentence.
>> 
> For example: in 
> pci_remove_device()
> {
> ...
>     spin_lock(&pcidevs_lock);
>     ..
>     iommu_remove_device()..
>     ..
>     spin_unlock(&pcidevs_lock);
> }
> 
> Device-TLB is maybe flush error in iommu_remove_device()..
> Then it is under pcidevs_lock..
> In this case, I need to check whether it is under pcidevs_lock.
> If not, I need to acquire the pcidevs_lock.

Ah, okay. But you can't use spin_is_locked() for that purpose.

>> >> > @@ -349,9 +416,18 @@ static int flush_iotlb_qi(
>> >> >          queue_invalidate_iotlb(iommu,
>> >> >                                 type >>
>> >> DMA_TLB_FLUSH_GRANU_OFFSET, dr,
>> >> >                                 dw, did, size_order, 0, addr);
>> >> > +
>> >> > +        /*
>> >> > +         * Synchronize with hardware for invalidation request
>> descriptors
>> >> > +         * submitted before Device-TLB invalidate descriptor.
>> >> > +         */
>> >> > +        rc = invalidate_sync(iommu);
>> >> > +        if ( rc )
>> >> > +             return rc;
>> >> > +
>> >> >          if ( flush_dev_iotlb )
>> >> >              ret = dev_invalidate_iotlb(iommu, did, addr, size_order,
>> type);
>> >> > -        rc = invalidate_sync(iommu);
>> >> > +
>> >> >          if ( !ret )
>> >> >              ret = rc;
>> >> >      }
>> >>
>> >> This change is because of ...?
>> >>
>> > As Kevin's comments,
>> > I add 'dev_invalidate_iotlb_sync()' to dev_invalidate_iotlb().
>> > Then i can know which device is flush timeout.
>> 
>> I don't understand: How is your reply related to you moving up the 
> invocation of
>> invalidate_sync()?
> 
> This invalidate_sync() is to synchronize with hardware for invalidation 
> request descriptors
> submitted before Device-TLB invalidate descriptor.
> 
> If I don't add this invalidate_sync().. 
> dev_invalidate_iotlb_timeout() is not clear whether the flush time out is 
> about Device-TLB flush error or the other previous iotlb/iec/context flush 
> error. 

This being anything but obvious, maybe a brief comment would
help?

>> >> Plus logging the error code and device coordinates might turn out
>> >> helpful in such cases. But first of all - if there was a timeout,
>> >> we'd make it here only for Dom0. Perhaps the printing should move into an
>> else to the domain_crash()?
>> >> And if there was another error, the message would be outright wrong.
>> >>
>> > IMO, the print for Dom0 is enough.
>> > I think it is not a good idea to move into an else to domain_crash().
>> > Domain_crash is quite a common part.
>> > Anyway I can improve it in low priority.
>> 
>> At the very least the message should not end up being actively misleading.
>> 
> What about the below?
> +                printk(XENLOG_ERR
> +                       "Device %04x:%02x:%02x is flush error.",
> +                       seg, bus, devfn);

Well, for one you once again omit the error code. And then (I think
others had pointed that out already) a device cannot be flush error.
Perhaps you mean "Flush error %d on device ...\n"? Plus you
absolutely should print PCI device coordinates in the canonical way,
so that grep-ing a log file for issues with a specific device will find
everything related to that device.

Jan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.
  2016-01-21 16:31           ` Jan Beulich
@ 2016-01-22 15:57             ` Xu, Quan
  2016-01-25 14:09               ` Jan Beulich
  0 siblings, 1 reply; 60+ messages in thread
From: Xu, Quan @ 2016-01-22 15:57 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, Wu, Feng, george.dunlap, andrew.cooper3, tim,
	xen-devel, Nakajima, Jun, keir

> On January 22, 2016 at 12:31am, <JBeulich@suse.com> wrote:
> >>> On 21.01.16 at 17:16, <quan.xu@intel.com> wrote:
> >>  On January 20, 2016 at 7:29 pm,  <JBeulich@suse.com> wrote:
> >> >>> On 20.01.16 at 11:26, <quan.xu@intel.com> wrote:
> >> >> On January 15, 2016 at 9:10, <JBeulich@suse.com> wrote:
> >> >> >>> On 23.12.15 at 09:25, <quan.xu@intel.com> wrote:
> >> >> > @@ -229,6 +239,63 @@ int qinval_device_iotlb(struct iommu *iommu,
> >> >> >      return 0;
> >> >> >  }
> >> >> >
> >> >> > +static void dev_invalidate_iotlb_timeout(struct iommu *iommu, u16
> did,
> >> >> > +                                         u16 seg, u8 bus, u8
> >> >> > +devfn)
> >> >> {
> >> >> > +    struct domain *d;
> >> >> > +    struct pci_dev *pdev;
> >> >> > +
> >> >> > +    d = rcu_lock_domain_by_id(iommu->domid_map[did]);
> >> >> > +    ASSERT(d);
> >> >>
> >> >> Don't you need to acquire some lock in order to safely assert this?
> >> >
> >> > Referred to the other use case of 'rcu_lock_domain_by_id()', Xen
> >> > checks whether The domain is 'NULL'.
> >> > Could I also replace the 'ASSERT(d)' with
> >> >           + If ( d == NULL )
> >> >           +   return;
> >> > ?
> >>
> >> At a first glance this doesn't look right, but in the end that's
> >> something
> > you need
> >> to answer.
> >>
> >
> > Is it quite similar to whether the domain has been destroyed when
> > Device-TLB is flushing?
> 
> Going back to the original question I raised, you need to ask
> yourself: Can d validly be NULL when you get here (e.g. due to some other
> processor changing something in parallel)?

I think that is YES.
Not all of the callers acquire RCU lock. Such as:
   $flush_iotlb_qi()--- iommu_flush_iotlb_psi() -- __intel_iommu_iotlb_flush() --intel_iommu_iotlb_flush() --iommu_iotlb_flush() --xenmem_add_to_physmap()--do_memory_op()
When Device-TLB is flushing for the above call path, the domain is destroyed in parallel. Then the d is NULL.


> If yes, you can't ASSERT(), and the
> next question then is what exactly it means that it got ripped off the table. The
> answer to that determines whether simply returning would be okay.
> 

IMO, Simply returning would be okay, but the device would not be hidden.



> >> >> Also note that unused slots hold zero, i.e. there's at least a
> >> >> theoretical
> >> > risk of
> >> >> problems here when you don't also look at
> >> >> iommu->domid_bitmap.
> >> >>
> >> > I am not clear how to fix this point. Do you have good idea?
> >> > Add a lock to 'iommu->domid_bitmap'?
> >>
> >> How would a lock help avoiding mistaking an unused slot to mean Dom0?
> >> As already suggested - I think you simply need to consult the bitmap
> >> along with
> > the
> >> domain ID array.
> >>
> >
> > Try to get domain id with iommu->domid_map[did] ?
> 
> ???
> 
        +if ( iommu->domid_map )
        +   d = rcu_lock_domain_by_id(iommu->domid_map[did]);

Is it right?


> >> >> > +            {
> >> >> > +                list_del(&pdev->domain_list);
> >> >>
> >> >> This should happen under pcidevs_lock - you need to either acquire
> >> >> it or
> >> >> ASSERT() it being held.
> >> >>
> >> >
> >> > I should check whether it is under pcidevs_lock -- with
> >> > spin_is_locked(&pcidevs_lock)
> >> > If it is not under pcidevs_lock, I should acquire it.
> >> > I think ASSERT() is not a good idea. Hypervisor acquires this lock
> >> > and then remove the resource.
> >>
> >> I don't understand this last sentence.
> >>
> > For example: in
> > pci_remove_device()
> > {
> > ...
> >     spin_lock(&pcidevs_lock);
> >     ..
> >     iommu_remove_device()..
> >     ..
> >     spin_unlock(&pcidevs_lock);
> > }
> >
> > Device-TLB is maybe flush error in iommu_remove_device()..
> > Then it is under pcidevs_lock..
> > In this case, I need to check whether it is under pcidevs_lock.
> > If not, I need to acquire the pcidevs_lock.
> 
> Ah, okay. But you can't use spin_is_locked() for that purpose.
> 
Why?

If I introduce a new parameter 'lock'.
+ int lock = spin_is_locked(&pcidevs_lock);


+ if ( !lock )
+    spin_lock(&pcidevs_lock);
...
+ if ( !lock )
+    spin_unlock(&pcidevs_lock);

Is it right? 
Jan, do you have some better idea? I think ASSERT is not right.




> >> >> > @@ -349,9 +416,18 @@ static int flush_iotlb_qi(
> >> >> >          queue_invalidate_iotlb(iommu,
> >> >> >                                 type >>
> >> >> DMA_TLB_FLUSH_GRANU_OFFSET, dr,
> >> >> >                                 dw, did, size_order, 0, addr);
> >> >> > +
> >> >> > +        /*
> >> >> > +         * Synchronize with hardware for invalidation request
> >> descriptors
> >> >> > +         * submitted before Device-TLB invalidate descriptor.
> >> >> > +         */
> >> >> > +        rc = invalidate_sync(iommu);
> >> >> > +        if ( rc )
> >> >> > +             return rc;
> >> >> > +
> >> >> >          if ( flush_dev_iotlb )
> >> >> >              ret = dev_invalidate_iotlb(iommu, did, addr,
> >> >> > size_order,
> >> type);
> >> >> > -        rc = invalidate_sync(iommu);
> >> >> > +
> >> >> >          if ( !ret )
> >> >> >              ret = rc;
> >> >> >      }
> >> >>
> >> >> This change is because of ...?
> >> >>
> >> > As Kevin's comments,
> >> > I add 'dev_invalidate_iotlb_sync()' to dev_invalidate_iotlb().
> >> > Then i can know which device is flush timeout.
> >>
> >> I don't understand: How is your reply related to you moving up the
> > invocation of
> >> invalidate_sync()?
> >
> > This invalidate_sync() is to synchronize with hardware for
> > invalidation request descriptors submitted before Device-TLB
> > invalidate descriptor.
> >
> > If I don't add this invalidate_sync()..
> > dev_invalidate_iotlb_timeout() is not clear whether the flush time out
> > is about Device-TLB flush error or the other previous
> > iotlb/iec/context flush error.
> 
> This being anything but obvious, maybe a brief comment would help?
> 

Change 
"
Synchronize with hardware for invalidation request descriptors 
submitted before Device-TLB invalidate descriptor.
"
TO

"Synchronize invalidation completions before Device-TLB invalidation
"
?

> >> >> Plus logging the error code and device coordinates might turn out
> >> >> helpful in such cases. But first of all - if there was a timeout,
> >> >> we'd make it here only for Dom0. Perhaps the printing should move
> >> >> into an
> >> else to the domain_crash()?
> >> >> And if there was another error, the message would be outright wrong.
> >> >>
> >> > IMO, the print for Dom0 is enough.
> >> > I think it is not a good idea to move into an else to domain_crash().
> >> > Domain_crash is quite a common part.
> >> > Anyway I can improve it in low priority.
> >>
> >> At the very least the message should not end up being actively misleading.
> >>
> > What about the below?
> > +                printk(XENLOG_ERR
> > +                       "Device %04x:%02x:%02x is flush error.",
> > +                       seg, bus, devfn);
> 
> Well, for one you once again omit the error code. And then (I think others had
> pointed that out already) a device cannot be flush error.
> Perhaps you mean "Flush error %d on device ...\n"? Plus you absolutely should
> print PCI device coordinates in the canonical way, 

What does 'print PCI device coordinates in the canonical way' mean ?
Such as  "0000:00:02.1"?

> so that grep-ing a log file for
> issues with a specific device will find everything related to that device.


I am appreciate your kindness.
Quan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.
  2016-01-22 15:57             ` Xu, Quan
@ 2016-01-25 14:09               ` Jan Beulich
  2016-01-26 13:47                 ` Xu, Quan
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2016-01-25 14:09 UTC (permalink / raw)
  To: Quan Xu
  Cc: Kevin Tian, Feng Wu, george.dunlap, andrew.cooper3, tim,
	xen-devel, Jun Nakajima, keir

>>> On 22.01.16 at 16:57, <quan.xu@intel.com> wrote:
>>  On January 22, 2016 at 12:31am, <JBeulich@suse.com> wrote:
>> >>> On 21.01.16 at 17:16, <quan.xu@intel.com> wrote:
>> >>  On January 20, 2016 at 7:29 pm,  <JBeulich@suse.com> wrote:
>> >> >>> On 20.01.16 at 11:26, <quan.xu@intel.com> wrote:
>> >> >> On January 15, 2016 at 9:10, <JBeulich@suse.com> wrote:
>> >> >> >>> On 23.12.15 at 09:25, <quan.xu@intel.com> wrote:
>> >> >> Also note that unused slots hold zero, i.e. there's at least a
>> >> >> theoretical
>> >> > risk of
>> >> >> problems here when you don't also look at
>> >> >> iommu->domid_bitmap.
>> >> >>
>> >> > I am not clear how to fix this point. Do you have good idea?
>> >> > Add a lock to 'iommu->domid_bitmap'?
>> >>
>> >> How would a lock help avoiding mistaking an unused slot to mean Dom0?
>> >> As already suggested - I think you simply need to consult the bitmap
>> >> along with
>> > the
>> >> domain ID array.
>> >>
>> >
>> > Try to get domain id with iommu->domid_map[did] ?
>> 
>> ???
>> 
>         +if ( iommu->domid_map )
>         +   d = rcu_lock_domain_by_id(iommu->domid_map[did]);
> 
> Is it right?

I don't see what this changes. Again - what your code has been
lacking so far is some mechanism to guarantee that what you
read from domid_map[] is actually a valid domain ID. I can only
once again point your attention to domid_bitmap, which afaics
gives you exactly that valid/invalid indication.

>> >> >> > +            {
>> >> >> > +                list_del(&pdev->domain_list);
>> >> >>
>> >> >> This should happen under pcidevs_lock - you need to either acquire
>> >> >> it or
>> >> >> ASSERT() it being held.
>> >> >>
>> >> >
>> >> > I should check whether it is under pcidevs_lock -- with
>> >> > spin_is_locked(&pcidevs_lock)
>> >> > If it is not under pcidevs_lock, I should acquire it.
>> >> > I think ASSERT() is not a good idea. Hypervisor acquires this lock
>> >> > and then remove the resource.
>> >>
>> >> I don't understand this last sentence.
>> >>
>> > For example: in
>> > pci_remove_device()
>> > {
>> > ...
>> >     spin_lock(&pcidevs_lock);
>> >     ..
>> >     iommu_remove_device()..
>> >     ..
>> >     spin_unlock(&pcidevs_lock);
>> > }
>> >
>> > Device-TLB is maybe flush error in iommu_remove_device()..
>> > Then it is under pcidevs_lock..
>> > In this case, I need to check whether it is under pcidevs_lock.
>> > If not, I need to acquire the pcidevs_lock.
>> 
>> Ah, okay. But you can't use spin_is_locked() for that purpose.
>> 
> Why?

Because spin_is_locked() returns true if _any CPU_ holds the lock,
not just when the CPU you're running on does.

> If I introduce a new parameter 'lock'.
> + int lock = spin_is_locked(&pcidevs_lock);
> 
> 
> + if ( !lock )
> +    spin_lock(&pcidevs_lock);
> ...
> + if ( !lock )
> +    spin_unlock(&pcidevs_lock);
> 
> Is it right? 
> Jan, do you have some better idea?

If indeed different locking state is possible for different call trees,
then I'm afraid you won't get around passing down flags to indicate
whether the needed lock is already being held.

> I think ASSERT is not right.

Indeed.

>> >> >> > @@ -349,9 +416,18 @@ static int flush_iotlb_qi(
>> >> >> >          queue_invalidate_iotlb(iommu,
>> >> >> >                                 type >>
>> >> >> DMA_TLB_FLUSH_GRANU_OFFSET, dr,
>> >> >> >                                 dw, did, size_order, 0, addr);
>> >> >> > +
>> >> >> > +        /*
>> >> >> > +         * Synchronize with hardware for invalidation request
>> >> descriptors
>> >> >> > +         * submitted before Device-TLB invalidate descriptor.
>> >> >> > +         */
>> >> >> > +        rc = invalidate_sync(iommu);
>> >> >> > +        if ( rc )
>> >> >> > +             return rc;
>> >> >> > +
>> >> >> >          if ( flush_dev_iotlb )
>> >> >> >              ret = dev_invalidate_iotlb(iommu, did, addr,
>> >> >> > size_order,
>> >> type);
>> >> >> > -        rc = invalidate_sync(iommu);
>> >> >> > +
>> >> >> >          if ( !ret )
>> >> >> >              ret = rc;
>> >> >> >      }
>> >> >>
>> >> >> This change is because of ...?
>> >> >>
>> >> > As Kevin's comments,
>> >> > I add 'dev_invalidate_iotlb_sync()' to dev_invalidate_iotlb().
>> >> > Then i can know which device is flush timeout.
>> >>
>> >> I don't understand: How is your reply related to you moving up the
>> > invocation of
>> >> invalidate_sync()?
>> >
>> > This invalidate_sync() is to synchronize with hardware for
>> > invalidation request descriptors submitted before Device-TLB
>> > invalidate descriptor.
>> >
>> > If I don't add this invalidate_sync()..
>> > dev_invalidate_iotlb_timeout() is not clear whether the flush time out
>> > is about Device-TLB flush error or the other previous
>> > iotlb/iec/context flush error.
>> 
>> This being anything but obvious, maybe a brief comment would help?
>> 
> 
> Change 
> "
> Synchronize with hardware for invalidation request descriptors 
> submitted before Device-TLB invalidate descriptor.
> "
> TO
> 
> "Synchronize invalidation completions before Device-TLB invalidation
> "
> ?

I'd even further emphasize the ordering, by saying "Before
Device-TLB invalidation we need to ..." (or perhaps it's rather
"... we'd like to ...").

>> >> >> Plus logging the error code and device coordinates might turn out
>> >> >> helpful in such cases. But first of all - if there was a timeout,
>> >> >> we'd make it here only for Dom0. Perhaps the printing should move
>> >> >> into an
>> >> else to the domain_crash()?
>> >> >> And if there was another error, the message would be outright wrong.
>> >> >>
>> >> > IMO, the print for Dom0 is enough.
>> >> > I think it is not a good idea to move into an else to domain_crash().
>> >> > Domain_crash is quite a common part.
>> >> > Anyway I can improve it in low priority.
>> >>
>> >> At the very least the message should not end up being actively misleading.
>> >>
>> > What about the below?
>> > +                printk(XENLOG_ERR
>> > +                       "Device %04x:%02x:%02x is flush error.",
>> > +                       seg, bus, devfn);
>> 
>> Well, for one you once again omit the error code. And then (I think others 
> had
>> pointed that out already) a device cannot be flush error.
>> Perhaps you mean "Flush error %d on device ...\n"? Plus you absolutely 
> should
>> print PCI device coordinates in the canonical way, 
> 
> What does 'print PCI device coordinates in the canonical way' mean ?
> Such as  "0000:00:02.1"?

Yes. There are numerous examples throughout the tree.

Jan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.
  2016-01-25 14:09               ` Jan Beulich
@ 2016-01-26 13:47                 ` Xu, Quan
  2016-01-26 13:59                   ` Jan Beulich
  0 siblings, 1 reply; 60+ messages in thread
From: Xu, Quan @ 2016-01-26 13:47 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, Wu, Feng, george.dunlap, andrew.cooper3, tim, xen-devel

> On January 25, 2016 at 10:09pm, <JBeulich@suse.com> wrote:
> >>> On 22.01.16 at 16:57, <quan.xu@intel.com> wrote:
> >>  On January 22, 2016 at 12:31am, <JBeulich@suse.com> wrote:
> >> >>> On 21.01.16 at 17:16, <quan.xu@intel.com> wrote:
> >> >>  On January 20, 2016 at 7:29 pm,  <JBeulich@suse.com> wrote:
> >> >> >>> On 20.01.16 at 11:26, <quan.xu@intel.com> wrote:
> >> >> >> On January 15, 2016 at 9:10, <JBeulich@suse.com> wrote:
> >> >> >> >>> On 23.12.15 at 09:25, <quan.xu@intel.com> wrote:
> >> >> >> Also note that unused slots hold zero, i.e. there's at least a
> >> >> >> theoretical
> >> >> > risk of
> >> >> >> problems here when you don't also look at
> >> >> >> iommu->domid_bitmap.
> >> >> >>
> >> >> > I am not clear how to fix this point. Do you have good idea?
> >> >> > Add a lock to 'iommu->domid_bitmap'?
> >> >>
> >> >> How would a lock help avoiding mistaking an unused slot to mean Dom0?
> >> >> As already suggested - I think you simply need to consult the
> >> >> bitmap along with
> >> > the
> >> >> domain ID array.
> >> >>
> >> >
> >> > Try to get domain id with iommu->domid_map[did] ?
> >>
> >> ???
> >>
> >         +if ( iommu->domid_map )
> >         +   d = rcu_lock_domain_by_id(iommu->domid_map[did]);
> >
> > Is it right?
> 
> I don't see what this changes. Again - what your code has been lacking so far is
> some mechanism to guarantee that what you read from domid_map[] is
> actually a valid domain ID. I can only once again point your attention to
> domid_bitmap, which afaics gives you exactly that valid/invalid indication.
> 
Jan,
Sorry, I was confused about this point as I didn't understand the domid_bitmap. I printed out the iommu->domid_bitmap[did], which was tricky to me.
Now I get it. 
As you mentioned , I simply need to consult the bitmap along with the domain ID array.

+If ( test_bit(did, iommu->domid_bitmap) && iommu->domid_map[did] >= 0 )
+   d = rcu_lock_domain_by_id(iommu->domid_map[did]);

Is it right now?

> >> >> >> > +            {
> >> >> >> > +                list_del(&pdev->domain_list);
> >> >> >>
> >> >> >> This should happen under pcidevs_lock - you need to either
> >> >> >> acquire it or
> >> >> >> ASSERT() it being held.
> >> >> >>
> >> >> >
> >> >> > I should check whether it is under pcidevs_lock -- with
> >> >> > spin_is_locked(&pcidevs_lock)
> >> >> > If it is not under pcidevs_lock, I should acquire it.
> >> >> > I think ASSERT() is not a good idea. Hypervisor acquires this
> >> >> > lock and then remove the resource.
> >> >>
> >> >> I don't understand this last sentence.
> >> >>
> >> > For example: in
> >> > pci_remove_device()
> >> > {
> >> > ...
> >> >     spin_lock(&pcidevs_lock);
> >> >     ..
> >> >     iommu_remove_device()..
> >> >     ..
> >> >     spin_unlock(&pcidevs_lock);
> >> > }
> >> >
> >> > Device-TLB is maybe flush error in iommu_remove_device()..
> >> > Then it is under pcidevs_lock..
> >> > In this case, I need to check whether it is under pcidevs_lock.
> >> > If not, I need to acquire the pcidevs_lock.
> >>
> >> Ah, okay. But you can't use spin_is_locked() for that purpose.
> >>
> > If I introduce a new parameter 'lock'.
> > + int lock = spin_is_locked(&pcidevs_lock);
> >
> >
> > + if ( !lock )
> > +    spin_lock(&pcidevs_lock);
> > ...
> > + if ( !lock )
> > +    spin_unlock(&pcidevs_lock);
> >
> > Is it right?
> > Jan, do you have some better idea?
> 
> If indeed different locking state is possible for different call trees, 

Indeed different. For example,
It is _not_under_ lock for the following call tree:
$ flush_iotlb_qi()--- iommu_flush_iotlb_psi() -- __intel_iommu_iotlb_flush() --intel_iommu_iotlb_flush() --iommu_iotlb_flush() --xenmem_add_to_physmap()--do_memory_op() 

It is _under_ lock for the following call tree:
$flush_iotlb_qi()--iommu_flush_iotlb_dsi()--domain_context_unmap_one()--domain_context_unmap()--reassign_device_ownership()--deassign_device()-iommu_do_pci_domctl()



> then I'm
> afraid you won't get around passing down flags to indicate whether the needed
> lock is already being held.
> 
Agreed.

At first, I am open for any solution.
pcidevs_lock is quite a big lock. For this point, it looks much better to add a new flag to delay hiding device.
I am also afraid that it may raise further security issues.


Jan, thanks!!
-Quan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.
  2016-01-26 13:47                 ` Xu, Quan
@ 2016-01-26 13:59                   ` Jan Beulich
  2016-01-26 15:27                     ` Xu, Quan
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2016-01-26 13:59 UTC (permalink / raw)
  To: Quan Xu
  Cc: Kevin Tian, Feng Wu, george.dunlap, andrew.cooper3, tim, xen-devel

>>> On 26.01.16 at 14:47, <quan.xu@intel.com> wrote:
> As you mentioned , I simply need to consult the bitmap along with the domain 
> ID array.
> 
> +If ( test_bit(did, iommu->domid_bitmap) && iommu->domid_map[did] >= 0 )
> +   d = rcu_lock_domain_by_id(iommu->domid_map[did]);
> 
> Is it right now?

Mostly, except that I don't understand the >= 0 part.

> At first, I am open for any solution.
> pcidevs_lock is quite a big lock. For this point, it looks much better to 
> add a new flag to delay hiding device.
> I am also afraid that it may raise further security issues.

Well, I'd say just go and see which one turns out to be less
cumbersome and/or less intrusive.

Jan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.
  2016-01-26 13:59                   ` Jan Beulich
@ 2016-01-26 15:27                     ` Xu, Quan
  2016-01-26 15:52                       ` Jan Beulich
  0 siblings, 1 reply; 60+ messages in thread
From: Xu, Quan @ 2016-01-26 15:27 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, Wu, Feng, george.dunlap, andrew.cooper3, tim, xen-devel

> On January 26, 2016 at 10:00pm, <JBeulich@suse.com> wrote:
> >>> On 26.01.16 at 14:47, <quan.xu@intel.com> wrote:
> > As you mentioned , I simply need to consult the bitmap along with the
> > domain ID array.
> >
> > +If ( test_bit(did, iommu->domid_bitmap) && iommu->domid_map[did] >= 0 )
> > +   d = rcu_lock_domain_by_id(iommu->domid_map[did]);
> >
> > Is it right now?
> 
> Mostly, except that I don't understand the >= 0 part.
> 
Domain ID should be >= 0..
If it is redundant, I can remove it.

> > At first, I am open for any solution.
> > pcidevs_lock is quite a big lock. For this point, it looks much better
> > to add a new flag to delay hiding device.
> > I am also afraid that it may raise further security issues.
> 
> Well, I'd say just go and see which one turns out to be less cumbersome and/or
> less intrusive.
> 
For this lock, any good idea?
IMO, I can get started to add a new flag to delay hiding device.

-Quan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.
  2016-01-26 15:27                     ` Xu, Quan
@ 2016-01-26 15:52                       ` Jan Beulich
  2016-01-26 22:47                         ` Tian, Kevin
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2016-01-26 15:52 UTC (permalink / raw)
  To: Quan Xu
  Cc: Kevin Tian, Feng Wu, george.dunlap, andrew.cooper3, tim, xen-devel

>>> On 26.01.16 at 16:27, <quan.xu@intel.com> wrote:
>>  On January 26, 2016 at 10:00pm, <JBeulich@suse.com> wrote:
>> >>> On 26.01.16 at 14:47, <quan.xu@intel.com> wrote:
>> > As you mentioned , I simply need to consult the bitmap along with the
>> > domain ID array.
>> >
>> > +If ( test_bit(did, iommu->domid_bitmap) && iommu->domid_map[did] >= 0 )
>> > +   d = rcu_lock_domain_by_id(iommu->domid_map[did]);
>> >
>> > Is it right now?
>> 
>> Mostly, except that I don't understand the >= 0 part.
>> 
> Domain ID should be >= 0..
> If it is redundant, I can remove it.

Quan, please: You have the code, so you can check whether
negative values ever get stored there. And had you checked,
you'd have found that domid_map is declared as u16 *. The
"u" here, as I hope you know, stands for "unsigned". (Of
course this really should be domid_t, but I'm sure the VT-d
maintainers won't care at all about such inconsistencies.)

>> > At first, I am open for any solution.
>> > pcidevs_lock is quite a big lock. For this point, it looks much better
>> > to add a new flag to delay hiding device.
>> > I am also afraid that it may raise further security issues.
>> 
>> Well, I'd say just go and see which one turns out to be less cumbersome 
> and/or
>> less intrusive.
>> 
> For this lock, any good idea?
> IMO, I can get started to add a new flag to delay hiding device.

Once again: Before getting started, please assess which route is
going to be the better one. Remember that we had already
discussed and put aside some form of deferring the hiding of
devices, so if you come back with a patch doing that again, you'll
have to be able to explain why the alternative(s) are worse.

Jan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.
  2016-01-26 15:52                       ` Jan Beulich
@ 2016-01-26 22:47                         ` Tian, Kevin
  2016-01-27 11:09                           ` Xu, Quan
  0 siblings, 1 reply; 60+ messages in thread
From: Tian, Kevin @ 2016-01-26 22:47 UTC (permalink / raw)
  To: Jan Beulich, Xu, Quan
  Cc: george.dunlap, andrew.cooper3, tim, Wu, Feng, xen-devel

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Tuesday, January 26, 2016 11:53 PM
> 
> >> > At first, I am open for any solution.
> >> > pcidevs_lock is quite a big lock. For this point, it looks much better
> >> > to add a new flag to delay hiding device.
> >> > I am also afraid that it may raise further security issues.
> >>
> >> Well, I'd say just go and see which one turns out to be less cumbersome
> > and/or
> >> less intrusive.
> >>
> > For this lock, any good idea?
> > IMO, I can get started to add a new flag to delay hiding device.
> 
> Once again: Before getting started, please assess which route is
> going to be the better one. Remember that we had already
> discussed and put aside some form of deferring the hiding of
> devices, so if you come back with a patch doing that again, you'll
> have to be able to explain why the alternative(s) are worse.
> 

Quan, could you list pros/cons of those alternatives based on
discussion so far? Then we can decide which way should be
done before you go to actual coding. Earlier suggestion on
hiding device immediately is under the assumption that all
locks have been held. If this part becomes too complex, and
you can explain clearly that deferring the hiding action doesn't
lead to any race condition, then people can see why you are
proposing defer again.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.
  2016-01-26 22:47                         ` Tian, Kevin
@ 2016-01-27 11:09                           ` Xu, Quan
  2016-01-27 11:24                             ` Jan Beulich
  0 siblings, 1 reply; 60+ messages in thread
From: Xu, Quan @ 2016-01-27 11:09 UTC (permalink / raw)
  To: Tian, Kevin, Jan Beulich
  Cc: george.dunlap, andrew.cooper3, tim, Wu, Feng, xen-devel

> On January 27, 2016 at 6:48am, <Tian, Kevin> wrote:
> > From: Jan Beulich [mailto:JBeulich@suse.com]
> > Sent: Tuesday, January 26, 2016 11:53 PM


> > Once again: Before getting started, please assess which route is going
> > to be the better one. Remember that we had already discussed and put
> > aside some form of deferring the hiding of devices, so if you come
> > back with a patch doing that again, you'll have to be able to explain
> > why the alternative(s) are worse.
> >
> 
> Quan, could you list pros/cons of those alternatives based on discussion so far?
> Then we can decide which way should be done before you go to actual coding.
> Earlier suggestion on hiding device immediately is under the assumption that all
> locks have been held. If this part becomes too complex, and you can explain
> clearly that deferring the hiding action doesn't lead to any race condition, then
> people can see why you are proposing defer again.


The following are pros/cons of those alternatives. It is also why I propose defer again.

-- --
1. Hiding the devices immediately
Pros:
     * it makes whatever changes are ASAP after the Device-TLB flush error.
 
Cons:
     * It may break the code path.
     * It may lead to any race condition.
     * Hiding the devices immediately is under the assumption that all locks have been held.
      Different locking state is possible for different call trees. This part becomes too complex.
 
2. deferring the hiding of devices
Pros:
    *It doesn't lead to any race condition.
    *It makes existing calling chains with all required error handling completed.
 
Cons:
    * It needs to maintain a flag.
-- --

-----
How to defer the hiding of devices:
    When Device-TLB flush is error, it is to set a new flag and then continue existing calling chains with all required error handling completed. Only at safe place we can safely invoke pci_hide_device() variant to hide devices. We do a lazy hide until next time when it's assigned to another guest while the new flag is recognized.

Furthermore explanation:
A new flag:
         In my previous email, I introduced a "pci_dev->info.is_unassignable" flag. When Device-TLB flush is error, I think it is not a good idea to modify the pci_dev,
         As it is uncertain whether pcidevs_lock is acquired.
         I'd better introduce a new iommu bitmap array to register the device with BDF. It may be quite similar to 'iommu->domid_bitmap'

Safe place: 
         I think the beginning of reassign_device_ownership() is a safe place to hide the device which is registered with the new flag.
         reassign_device_ownership() is under lock.

pci_hide_device() variant:
         Remove the pcidevs_lock, as the pcidevs_lock has been acquired at call tree of reassign_device_ownership().


Thanks!!!
-Quan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.
  2016-01-27 11:09                           ` Xu, Quan
@ 2016-01-27 11:24                             ` Jan Beulich
  2016-01-27 12:38                               ` Xu, Quan
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2016-01-27 11:24 UTC (permalink / raw)
  To: Quan Xu
  Cc: Kevin Tian, Feng Wu, george.dunlap, andrew.cooper3, tim, xen-devel

>>> On 27.01.16 at 12:09, <quan.xu@intel.com> wrote:
>>  On January 27, 2016 at 6:48am, <Tian, Kevin> wrote:
>> > From: Jan Beulich [mailto:JBeulich@suse.com]
>> > Sent: Tuesday, January 26, 2016 11:53 PM
> 
> 
>> > Once again: Before getting started, please assess which route is going
>> > to be the better one. Remember that we had already discussed and put
>> > aside some form of deferring the hiding of devices, so if you come
>> > back with a patch doing that again, you'll have to be able to explain
>> > why the alternative(s) are worse.
>> >
>> 
>> Quan, could you list pros/cons of those alternatives based on discussion so far?
>> Then we can decide which way should be done before you go to actual coding.
>> Earlier suggestion on hiding device immediately is under the assumption that all
>> locks have been held. If this part becomes too complex, and you can explain
>> clearly that deferring the hiding action doesn't lead to any race condition, then
>> people can see why you are proposing defer again.
> 
> 
> The following are pros/cons of those alternatives. It is also why I propose 
> defer again.
> 
> -- --
> 1. Hiding the devices immediately
> Pros:
>      * it makes whatever changes are ASAP after the Device-TLB flush error.
>  
> Cons:
>      * It may break the code path.
>      * It may lead to any race condition.
>      * Hiding the devices immediately is under the assumption that all locks have been held.
>       Different locking state is possible for different call trees. This part becomes too complex.

So you just repeat what you've already said before. "This part
becomes too complex" you say without any kind of proof, yet
that's what we need to understand whether the alternative of
doing the locking correctly really is this bad (and I continue to
not see why it would).

Jan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.
  2016-01-27 11:24                             ` Jan Beulich
@ 2016-01-27 12:38                               ` Xu, Quan
  2016-01-27 13:15                                 ` Jan Beulich
  0 siblings, 1 reply; 60+ messages in thread
From: Xu, Quan @ 2016-01-27 12:38 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, Wu, Feng, george.dunlap, andrew.cooper3, tim, xen-devel

> On January 27, 2016 at 7:24pm, <JBeulich@suse.com> wrote:
> >>> On 27.01.16 at 12:09, <quan.xu@intel.com> wrote:
> >>  On January 27, 2016 at 6:48am, <Tian, Kevin> wrote:
> >> > From: Jan Beulich [mailto:JBeulich@suse.com]
> >> > Sent: Tuesday, January 26, 2016 11:53 PM
> >
> >
> >> > Once again: Before getting started, please assess which route is
> >> > going to be the better one. Remember that we had already discussed
> >> > and put aside some form of deferring the hiding of devices, so if
> >> > you come back with a patch doing that again, you'll have to be able
> >> > to explain why the alternative(s) are worse.
> >> >
> >>
> >> Quan, could you list pros/cons of those alternatives based on discussion so
> far?
> >> Then we can decide which way should be done before you go to actual
> coding.
> >> Earlier suggestion on hiding device immediately is under the
> >> assumption that all locks have been held. If this part becomes too
> >> complex, and you can explain clearly that deferring the hiding action
> >> doesn't lead to any race condition, then people can see why you are
> proposing defer again.
> >
> >
> > The following are pros/cons of those alternatives. It is also why I
> > propose defer again.
> >
> > -- --
> > 1. Hiding the devices immediately
> > Pros:
> >      * it makes whatever changes are ASAP after the Device-TLB flush error.
> >
> > Cons:
> >      * It may break the code path.
> >      * It may lead to any race condition.
> >      * Hiding the devices immediately is under the assumption that all locks
> have been held.
> >       Different locking state is possible for different call trees. This part
> becomes too complex.
> 
> So you just repeat what you've already said before. "This part becomes too
> complex" you say without any kind of proof, yet that's what we need to
> understand whether the alternative of doing the locking correctly really is this
> bad (and I continue to not see why it would).


Such as pcidevs_lock:

1. as I mentioned, it is indeed different locking state is possible for different call trees of flush Device-TLB. When Device-TLB flush is error, It is really challenge to judge whether to acquire the pcidevs_lock or not.

   For example,
   *It is _not_under_ lock for the following call tree:
$ flush_iotlb_qi()--- iommu_flush_iotlb_psi() -- __intel_iommu_iotlb_flush() --intel_iommu_iotlb_flush() --iommu_iotlb_flush() --xenmem_add_to_physmap()--do_memory_op() 

   *It is _under_ lock for the following call tree:
$flush_iotlb_qi()--iommu_flush_iotlb_dsi()--domain_context_unmap_one()--domain_context_unmap()--reassign_device_ownership()--deassign_device()-iommu_do_pci_domctl()

2. if I try to acquire the pcidevs_lock for some _not_under_ lock call tree, it makes things worse. As the pcidevs_lock is a big lock, then
  Frequent memory modification may block the pci-device assign due to the pcidevs_lock. if I try to split the pcidevs_lock into small locks.
  It may takes a great deal of time to make it stable.

Also I should consider the other locks..


-Quan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.
  2016-01-27 12:38                               ` Xu, Quan
@ 2016-01-27 13:15                                 ` Jan Beulich
  2016-01-27 14:13                                   ` Xu, Quan
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2016-01-27 13:15 UTC (permalink / raw)
  To: Quan Xu
  Cc: Kevin Tian, Feng Wu, george.dunlap, andrew.cooper3, tim, xen-devel

>>> On 27.01.16 at 13:38, <quan.xu@intel.com> wrote:
>>  On January 27, 2016 at 7:24pm, <JBeulich@suse.com> wrote:
>> >>> On 27.01.16 at 12:09, <quan.xu@intel.com> wrote:
>> >>  On January 27, 2016 at 6:48am, <Tian, Kevin> wrote:
>> >> > From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> > Sent: Tuesday, January 26, 2016 11:53 PM
>> >
>> >
>> >> > Once again: Before getting started, please assess which route is
>> >> > going to be the better one. Remember that we had already discussed
>> >> > and put aside some form of deferring the hiding of devices, so if
>> >> > you come back with a patch doing that again, you'll have to be able
>> >> > to explain why the alternative(s) are worse.
>> >> >
>> >>
>> >> Quan, could you list pros/cons of those alternatives based on discussion so
>> far?
>> >> Then we can decide which way should be done before you go to actual
>> coding.
>> >> Earlier suggestion on hiding device immediately is under the
>> >> assumption that all locks have been held. If this part becomes too
>> >> complex, and you can explain clearly that deferring the hiding action
>> >> doesn't lead to any race condition, then people can see why you are
>> proposing defer again.
>> >
>> >
>> > The following are pros/cons of those alternatives. It is also why I
>> > propose defer again.
>> >
>> > -- --
>> > 1. Hiding the devices immediately
>> > Pros:
>> >      * it makes whatever changes are ASAP after the Device-TLB flush error.
>> >
>> > Cons:
>> >      * It may break the code path.
>> >      * It may lead to any race condition.
>> >      * Hiding the devices immediately is under the assumption that all 
> locks
>> have been held.
>> >       Different locking state is possible for different call trees. This 
> part
>> becomes too complex.
>> 
>> So you just repeat what you've already said before. "This part becomes too
>> complex" you say without any kind of proof, yet that's what we need to
>> understand whether the alternative of doing the locking correctly really is 
> this
>> bad (and I continue to not see why it would).
> 
> 
> Such as pcidevs_lock:
> 
> 1. as I mentioned, it is indeed different locking state is possible for 
> different call trees of flush Device-TLB. When Device-TLB flush is error, It is 
> really challenge to judge whether to acquire the pcidevs_lock or not.
> 
>    For example,
>    *It is _not_under_ lock for the following call tree:
> $ flush_iotlb_qi()--- iommu_flush_iotlb_psi() -- __intel_iommu_iotlb_flush() 
> --intel_iommu_iotlb_flush() --iommu_iotlb_flush() 
> --xenmem_add_to_physmap()--do_memory_op() 
> 
>    *It is _under_ lock for the following call tree:
> $flush_iotlb_qi()--iommu_flush_iotlb_dsi()--domain_context_unmap_one()--domain_con
> text_unmap()--reassign_device_ownership()--deassign_device()-iommu_do_pci_domctl()
> 
> 2. if I try to acquire the pcidevs_lock for some _not_under_ lock call tree, 
> it makes things worse. As the pcidevs_lock is a big lock, then
>   Frequent memory modification may block the pci-device assign due to the 
> pcidevs_lock. if I try to split the pcidevs_lock into small locks.
>   It may takes a great deal of time to make it stable.

I don't understand this, namely in the context of my suggestion to
simply pass down a flag indicating whether the lock is being held
(and hence acquiring it only in the most narrow scope if not already
owning it).

Jan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.
  2016-01-27 13:15                                 ` Jan Beulich
@ 2016-01-27 14:13                                   ` Xu, Quan
  2016-01-27 14:29                                     ` Jan Beulich
  0 siblings, 1 reply; 60+ messages in thread
From: Xu, Quan @ 2016-01-27 14:13 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, Wu, Feng, george.dunlap, andrew.cooper3, tim, xen-devel

> On January 27, 2016 at 9:15pm, <JBeulich@suse.com> wrote:
> >>> On 27.01.16 at 13:38, <quan.xu@intel.com> wrote:
> >>  On January 27, 2016 at 7:24pm, <JBeulich@suse.com> wrote:
> >> >>> On 27.01.16 at 12:09, <quan.xu@intel.com> wrote:
> >> >>  On January 27, 2016 at 6:48am, <Tian, Kevin> wrote:
> >> >> > From: Jan Beulich [mailto:JBeulich@suse.com]
> >> >> > Sent: Tuesday, January 26, 2016 11:53 PM
> >> >
> >> >
> >> >> > Once again: Before getting started, please assess which route is
> >> >> > going to be the better one. Remember that we had already
> >> >> > discussed and put aside some form of deferring the hiding of
> >> >> > devices, so if you come back with a patch doing that again,
> >> >> > you'll have to be able to explain why the alternative(s) are worse.
> >> >> >
> >> >>
> >> >> Quan, could you list pros/cons of those alternatives based on
> >> >> discussion so
> >> far?
> >> >> Then we can decide which way should be done before you go to
> >> >> actual
> >> coding.
> >> >> Earlier suggestion on hiding device immediately is under the
> >> >> assumption that all locks have been held. If this part becomes too
> >> >> complex, and you can explain clearly that deferring the hiding
> >> >> action doesn't lead to any race condition, then people can see why
> >> >> you are
> >> proposing defer again.
> >> >
> >> >
> >> > The following are pros/cons of those alternatives. It is also why I
> >> > propose defer again.
> >> >
> >> > -- --
> >> > 1. Hiding the devices immediately
> >> > Pros:
> >> >      * it makes whatever changes are ASAP after the Device-TLB flush
> error.
> >> >
> >> > Cons:
> >> >      * It may break the code path.
> >> >      * It may lead to any race condition.
> >> >      * Hiding the devices immediately is under the assumption that
> >> > all
> > locks
> >> have been held.
> >> >       Different locking state is possible for different call trees.
> >> > This
> > part
> >> becomes too complex.
> >>
> >> So you just repeat what you've already said before. "This part
> >> becomes too complex" you say without any kind of proof, yet that's
> >> what we need to understand whether the alternative of doing the
> >> locking correctly really is
> > this
> >> bad (and I continue to not see why it would).
> >
> >
> > Such as pcidevs_lock:
> >
> > 1. as I mentioned, it is indeed different locking state is possible
> > for different call trees of flush Device-TLB. When Device-TLB flush is
> > error, It is really challenge to judge whether to acquire the pcidevs_lock or
> not.
> >
> >    For example,
> >    *It is _not_under_ lock for the following call tree:
> > $ flush_iotlb_qi()--- iommu_flush_iotlb_psi() --
> > __intel_iommu_iotlb_flush()
> > --intel_iommu_iotlb_flush() --iommu_iotlb_flush()
> > --xenmem_add_to_physmap()--do_memory_op()
> >
> >    *It is _under_ lock for the following call tree:
> > $flush_iotlb_qi()--iommu_flush_iotlb_dsi()--domain_context_unmap_one()
> > --domain_con
> > text_unmap()--reassign_device_ownership()--deassign_device()-iommu_do_
> > pci_domctl()
> >
> > 2. if I try to acquire the pcidevs_lock for some _not_under_ lock call
> > tree, it makes things worse. As the pcidevs_lock is a big lock, then
> >   Frequent memory modification may block the pci-device assign due to
> > the pcidevs_lock. if I try to split the pcidevs_lock into small locks.
> >   It may takes a great deal of time to make it stable.
> 
> I don't understand this, namely in the context of my suggestion to simply pass
> down a flag indicating whether the lock is being held (and hence acquiring it
> only in the most narrow scope if not already owning it).
> 

This is also an idea.
BTW, Does the lock refer to pcidevs_lock?

-Quan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.
  2016-01-27 14:13                                   ` Xu, Quan
@ 2016-01-27 14:29                                     ` Jan Beulich
  2016-01-27 14:35                                       ` Xu, Quan
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2016-01-27 14:29 UTC (permalink / raw)
  To: Quan Xu
  Cc: Kevin Tian, Feng Wu, george.dunlap, andrew.cooper3, tim, xen-devel

>>> On 27.01.16 at 15:13, <quan.xu@intel.com> wrote:
>>  On January 27, 2016 at 9:15pm, <JBeulich@suse.com> wrote:
>> >>> On 27.01.16 at 13:38, <quan.xu@intel.com> wrote:
>> >>  On January 27, 2016 at 7:24pm, <JBeulich@suse.com> wrote:
>> >> >>> On 27.01.16 at 12:09, <quan.xu@intel.com> wrote:
>> >> >>  On January 27, 2016 at 6:48am, <Tian, Kevin> wrote:
>> >> >> > From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> >> > Sent: Tuesday, January 26, 2016 11:53 PM
>> >> >
>> >> >
>> >> >> > Once again: Before getting started, please assess which route is
>> >> >> > going to be the better one. Remember that we had already
>> >> >> > discussed and put aside some form of deferring the hiding of
>> >> >> > devices, so if you come back with a patch doing that again,
>> >> >> > you'll have to be able to explain why the alternative(s) are worse.
>> >> >> >
>> >> >>
>> >> >> Quan, could you list pros/cons of those alternatives based on
>> >> >> discussion so
>> >> far?
>> >> >> Then we can decide which way should be done before you go to
>> >> >> actual
>> >> coding.
>> >> >> Earlier suggestion on hiding device immediately is under the
>> >> >> assumption that all locks have been held. If this part becomes too
>> >> >> complex, and you can explain clearly that deferring the hiding
>> >> >> action doesn't lead to any race condition, then people can see why
>> >> >> you are
>> >> proposing defer again.
>> >> >
>> >> >
>> >> > The following are pros/cons of those alternatives. It is also why I
>> >> > propose defer again.
>> >> >
>> >> > -- --
>> >> > 1. Hiding the devices immediately
>> >> > Pros:
>> >> >      * it makes whatever changes are ASAP after the Device-TLB flush
>> error.
>> >> >
>> >> > Cons:
>> >> >      * It may break the code path.
>> >> >      * It may lead to any race condition.
>> >> >      * Hiding the devices immediately is under the assumption that
>> >> > all
>> > locks
>> >> have been held.
>> >> >       Different locking state is possible for different call trees.
>> >> > This
>> > part
>> >> becomes too complex.
>> >>
>> >> So you just repeat what you've already said before. "This part
>> >> becomes too complex" you say without any kind of proof, yet that's
>> >> what we need to understand whether the alternative of doing the
>> >> locking correctly really is
>> > this
>> >> bad (and I continue to not see why it would).
>> >
>> >
>> > Such as pcidevs_lock:
>> >
>> > 1. as I mentioned, it is indeed different locking state is possible
>> > for different call trees of flush Device-TLB. When Device-TLB flush is
>> > error, It is really challenge to judge whether to acquire the pcidevs_lock or
>> not.
>> >
>> >    For example,
>> >    *It is _not_under_ lock for the following call tree:
>> > $ flush_iotlb_qi()--- iommu_flush_iotlb_psi() --
>> > __intel_iommu_iotlb_flush()
>> > --intel_iommu_iotlb_flush() --iommu_iotlb_flush()
>> > --xenmem_add_to_physmap()--do_memory_op()
>> >
>> >    *It is _under_ lock for the following call tree:
>> > $flush_iotlb_qi()--iommu_flush_iotlb_dsi()--domain_context_unmap_one()
>> > --domain_con
>> > text_unmap()--reassign_device_ownership()--deassign_device()-iommu_do_
>> > pci_domctl()
>> >
>> > 2. if I try to acquire the pcidevs_lock for some _not_under_ lock call
>> > tree, it makes things worse. As the pcidevs_lock is a big lock, then
>> >   Frequent memory modification may block the pci-device assign due to
>> > the pcidevs_lock. if I try to split the pcidevs_lock into small locks.
>> >   It may takes a great deal of time to make it stable.
>> 
>> I don't understand this, namely in the context of my suggestion to simply pass
>> down a flag indicating whether the lock is being held (and hence acquiring it
>> only in the most narrow scope if not already owning it).
>> 
> 
> This is also an idea.
> BTW, Does the lock refer to pcidevs_lock?

Yes, for now I assume that only that lock actually needs special
attention.

Jan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.
  2016-01-27 14:29                                     ` Jan Beulich
@ 2016-01-27 14:35                                       ` Xu, Quan
  0 siblings, 0 replies; 60+ messages in thread
From: Xu, Quan @ 2016-01-27 14:35 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, Wu, Feng, george.dunlap, andrew.cooper3, tim, xen-devel

> On January 27, 2016 10:29pm, <JBeulich@suse.com> wrote:
> >>> On 27.01.16 at 15:13, <quan.xu@intel.com> wrote:
> >>  On January 27, 2016 at 9:15pm, <JBeulich@suse.com> wrote:
> >> >>> On 27.01.16 at 13:38, <quan.xu@intel.com> wrote:
> >> >>  On January 27, 2016 at 7:24pm, <JBeulich@suse.com> wrote:
> >> >> >>> On 27.01.16 at 12:09, <quan.xu@intel.com> wrote:
> >> >> >>  On January 27, 2016 at 6:48am, <Tian, Kevin> wrote:
> >> >> >> > From: Jan Beulich [mailto:JBeulich@suse.com]
> >> >> >> > Sent: Tuesday, January 26, 2016 11:53 PM
> >> >> >
> >> >> >
> >> >> >> > Once again: Before getting started, please assess which route
> >> >> >> > is going to be the better one. Remember that we had already
> >> >> >> > discussed and put aside some form of deferring the hiding of
> >> >> >> > devices, so if you come back with a patch doing that again,
> >> >> >> > you'll have to be able to explain why the alternative(s) are worse.
> >> >> >> >
> >> >> >>
> >> >> >> Quan, could you list pros/cons of those alternatives based on
> >> >> >> discussion so
> >> >> far?
> >> >> >> Then we can decide which way should be done before you go to
> >> >> >> actual
> >> >> coding.
> >> >> >> Earlier suggestion on hiding device immediately is under the
> >> >> >> assumption that all locks have been held. If this part becomes
> >> >> >> too complex, and you can explain clearly that deferring the
> >> >> >> hiding action doesn't lead to any race condition, then people
> >> >> >> can see why you are
> >> >> proposing defer again.
> >> >> >
> >> >> >
> >> >> > The following are pros/cons of those alternatives. It is also
> >> >> > why I propose defer again.
> >> >> >
> >> >> > -- --
> >> >> > 1. Hiding the devices immediately
> >> >> > Pros:
> >> >> >      * it makes whatever changes are ASAP after the Device-TLB
> >> >> > flush
> >> error.
> >> >> >
> >> >> > Cons:
> >> >> >      * It may break the code path.
> >> >> >      * It may lead to any race condition.
> >> >> >      * Hiding the devices immediately is under the assumption
> >> >> > that all
> >> > locks
> >> >> have been held.
> >> >> >       Different locking state is possible for different call trees.
> >> >> > This
> >> > part
> >> >> becomes too complex.
> >> >>
> >> >> So you just repeat what you've already said before. "This part
> >> >> becomes too complex" you say without any kind of proof, yet that's
> >> >> what we need to understand whether the alternative of doing the
> >> >> locking correctly really is
> >> > this
> >> >> bad (and I continue to not see why it would).
> >> >
> >> >
> >> > Such as pcidevs_lock:
> >> >
> >> > 1. as I mentioned, it is indeed different locking state is possible
> >> > for different call trees of flush Device-TLB. When Device-TLB flush
> >> > is error, It is really challenge to judge whether to acquire the
> >> > pcidevs_lock or
> >> not.
> >> >
> >> >    For example,
> >> >    *It is _not_under_ lock for the following call tree:
> >> > $ flush_iotlb_qi()--- iommu_flush_iotlb_psi() --
> >> > __intel_iommu_iotlb_flush()
> >> > --intel_iommu_iotlb_flush() --iommu_iotlb_flush()
> >> > --xenmem_add_to_physmap()--do_memory_op()
> >> >
> >> >    *It is _under_ lock for the following call tree:
> >> > $flush_iotlb_qi()--iommu_flush_iotlb_dsi()--domain_context_unmap_on
> >> > e()
> >> > --domain_con
> >> > text_unmap()--reassign_device_ownership()--deassign_device()-iommu_
> >> > do_
> >> > pci_domctl()
> >> >
> >> > 2. if I try to acquire the pcidevs_lock for some _not_under_ lock
> >> > call tree, it makes things worse. As the pcidevs_lock is a big lock, then
> >> >   Frequent memory modification may block the pci-device assign due
> >> > to the pcidevs_lock. if I try to split the pcidevs_lock into small locks.
> >> >   It may takes a great deal of time to make it stable.
> >>
> >> I don't understand this, namely in the context of my suggestion to
> >> simply pass down a flag indicating whether the lock is being held
> >> (and hence acquiring it only in the most narrow scope if not already owning
> it).
> >>
> >
> > This is also an idea.
> > BTW, Does the lock refer to pcidevs_lock?
> 
> Yes, for now I assume that only that lock actually needs special attention.
> 

Thanks. I think so too.
-Quan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.
  2016-01-18 15:32                 ` Tim Deegan
  2016-01-19  0:46                   ` Tian, Kevin
@ 2016-01-19  6:54                   ` Xu, Quan
  1 sibling, 0 replies; 60+ messages in thread
From: Xu, Quan @ 2016-01-19  6:54 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Tian, Kevin, keir, Jan Beulich, george.dunlap, andrew.cooper3,
	xen-devel, Nakajima, Jun, Wu, Feng

> On January 18, 2016 at 11:32pm, <tim@xen.org> wrote:
> At 10:46 +0000 on 14 Jan (1452768377), Xu, Quan wrote:
> > > It's not about how this specific thing can be fixed. I didn't check all the code.
> > > There could be more examples, and in the future all new code need to
> > > be aware that the majority of IOMMU functions may have pdev->domain
> > > changed due to error. My concern is more that it's not a good
> > > design. I think it's natural to have a state changed only after all
> > > existing references in the call chain have been completed. Adding a new flag
> to delay hiding device looks much clearer to me.
> >
> > It looks we are not on the same page for how to hide a device.
> > Actually I have implemented these two ideas with pci_hide_device() or
> > flag in previous versions.
> >
> > Andrew and Tim, what's your opinion?
> 
> I have no strong opinion, since this isn't my area (and really never was - I
> worked on address translation more than device allocation).
> Since you ask, I don't see anything wrong with hiding the device if you already
> own all the locks, and I'd be inclined to make whatever changes are necessary
> ASAP after the error.  Deferring the change (and having all callers need to look
> for the deferred error) sounds fragile.

Tim, thanks a lots. I always assumed that you know ALL of Xen..
I will follow current v4 implement and send out v5 ASAP.

- Quan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.
  2016-01-18 15:32                 ` Tim Deegan
@ 2016-01-19  0:46                   ` Tian, Kevin
  2016-01-19  6:54                   ` Xu, Quan
  1 sibling, 0 replies; 60+ messages in thread
From: Tian, Kevin @ 2016-01-19  0:46 UTC (permalink / raw)
  To: Tim Deegan, Xu, Quan
  Cc: keir, Nakajima, Jun, george.dunlap, andrew.cooper3, xen-devel,
	Jan Beulich, Wu, Feng

> From: Tim Deegan [mailto:tim@xen.org]
> Sent: Monday, January 18, 2016 11:32 PM
> 
> At 10:46 +0000 on 14 Jan (1452768377), Xu, Quan wrote:
> > > It's not about how this specific thing can be fixed. I didn't check all the code.
> > > There could be more examples, and in the future all new code need to be aware
> > > that the majority of IOMMU functions may have pdev->domain changed due to
> > > error. My concern is more that it's not a good design. I think it's natural to have
> > > a state changed only after all existing references in the call chain have been
> > > completed. Adding a new flag to delay hiding device looks much clearer to me.
> >
> > It looks we are not on the same page for how to hide a device.
> > Actually I have implemented these two ideas with pci_hide_device()
> > or flag in previous versions.
> >
> > Andrew and Tim, what's your opinion?
> 
> I have no strong opinion, since this isn't my area (and really never
> was - I worked on address translation more than device allocation).
> Since you ask, I don't see anything wrong with hiding the device if
> you already own all the locks, and I'd be inclined to make whatever
> changes are necessary ASAP after the error.  Deferring the change (and
> having all callers need to look for the deferred error) sounds
> fragile.
> 

It doesn't need to have all callers do same deferred error check. It
can be done in next device assignment point. But I'm OK with going
Jan's proposal, i.e. hide device right when thing goes wrong, as long
as it is the common practice in Xen.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.
  2016-01-14 10:46               ` Xu, Quan
@ 2016-01-18 15:32                 ` Tim Deegan
  2016-01-19  0:46                   ` Tian, Kevin
  2016-01-19  6:54                   ` Xu, Quan
  0 siblings, 2 replies; 60+ messages in thread
From: Tim Deegan @ 2016-01-18 15:32 UTC (permalink / raw)
  To: Xu, Quan
  Cc: Tian, Kevin, keir, Jan Beulich, george.dunlap, andrew.cooper3,
	xen-devel, Nakajima, Jun, Wu, Feng

At 10:46 +0000 on 14 Jan (1452768377), Xu, Quan wrote:
> > It's not about how this specific thing can be fixed. I didn't check all the code.
> > There could be more examples, and in the future all new code need to be aware
> > that the majority of IOMMU functions may have pdev->domain changed due to
> > error. My concern is more that it's not a good design. I think it's natural to have
> > a state changed only after all existing references in the call chain have been
> > completed. Adding a new flag to delay hiding device looks much clearer to me.
> 
> It looks we are not on the same page for how to hide a device.
> Actually I have implemented these two ideas with pci_hide_device()
> or flag in previous versions.
> 
> Andrew and Tim, what's your opinion?

I have no strong opinion, since this isn't my area (and really never
was - I worked on address translation more than device allocation).
Since you ask, I don't see anything wrong with hiding the device if
you already own all the locks, and I'd be inclined to make whatever
changes are necessary ASAP after the error.  Deferring the change (and
having all callers need to look for the deferred error) sounds
fragile.

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.
  2016-01-14  9:12             ` Tian, Kevin
@ 2016-01-14 10:46               ` Xu, Quan
  2016-01-18 15:32                 ` Tim Deegan
  0 siblings, 1 reply; 60+ messages in thread
From: Xu, Quan @ 2016-01-14 10:46 UTC (permalink / raw)
  To: tim, andrew.cooper3
  Cc: Tian, Kevin, keir, Nakajima, Jun, george.dunlap, xen-devel,
	Jan Beulich, Wu, Feng



> -----Original Message-----
> From: Tian, Kevin
> Sent: Thursday, January 14, 2016 5:13 PM
> To: Jan Beulich; Xu, Quan
> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Wu, Feng;
> Nakajima, Jun; xen-devel@lists.xen.org; keir@xen.org; tim@xen.org
> Subject: RE: [Xen-devel] [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout
> issue.
> 

> > From: Jan Beulich [mailto:JBeulich@suse.com]
> > Sent: Thursday, January 14, 2016 5:00 PM
> >
> > >>> On 14.01.16 at 02:22, <kevin.tian@intel.com> wrote:
> > >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> > >> Sent: Wednesday, January 13, 2016 7:03 PM
> > >>
> > >> >> Jan,
> > >> >> Patch 2/3 and Patch 3/3 are based on v3 (Actually they are v3's
> > >> >> patch 1/2
> > >> > and patch 2/2).
> > >> >> We have discussed how to hide a device with pci_hide_device()
> > >> >> when
> > > Device-TLB
> > >> > flush is
> > >> >> error.
> > >> >>
> > >> >> Now there are 2 concerns:
> > >> >>       1. Hide the PCI device may break the code path. (then the
> > >> >> pdev->domain
> > >> > is
> > >> >> dom_xen)
> > >> >>       2. Is the blew logic right?
> > >> >>            --If Device-TLB flush is timeout, we'll hide the
> > >> >> target ATS device
> > >> > and crash the
> > >> >> domain owning this ATS device.
> > >> >>              If impacted domain is hardware domain, just throw
> > >> >> out a
> > >> > warning, instead of
> > >> >> crash the hardware domain.
> > >> >>             The hided Device will be disallowed to be further
> > >> >> assigned to
> > >> > any domain.
> > >> >>
> > >> >> Kevin, correct me if I am wrong.
> > >> >>
> > >> >>
> > >> >
> > >> > for 2, yes it's the policy we agreed in previous discussion.
> > >> >
> > >> > for 1, after more thinking I think the concern is real.
> > >> > pci_hide_device is used once in early boot-up phase. At that
> > >> > time, it's simple to just have right owner configured. However in
> > >> > the path of normal device assign/deassign, there are tons of more
> > >> > state associated which may be related to the owner. Though we may
> > >> > do some special handling in related code paths to have dom_xen
> > >> > specially handled, it's way tricky and not easy to maintain.
> > >>
> > >> I don't buy this without one of you pointing out the actual
> > >> difficulties: The domain is being crashed anyway, so there's no
> > >> true device de-assignment needed (IOMMU tables will get torn down
> > >> no matter what).
> > >>
> > >
> > > Here is one example of a quick glimpse:
> > >
> > > static int intel_iommu_add_device(u8 devfn, struct pci_dev *pdev) {
> > > 	[...]
> > >
> > >     ret = domain_context_mapping(pdev->domain, devfn, pdev);
> > >     if ( ret )
> > >     {
> > >         dprintk(XENLOG_ERR VTDPREFIX, "d%d: context mapping
> failed\n",
> > >                 pdev->domain->domain_id);
> > >         return ret;
> > >     }
> > >
> > > If pdev->domain is changed inside due to pci_hide_device, the error
> > > message will be inaccurate.
> >
> > But that's a thing trivial to fix up. You had talked about tricky things...
> >
> 
> It's not about how this specific thing can be fixed. I didn't check all the code.
> There could be more examples, and in the future all new code need to be aware
> that the majority of IOMMU functions may have pdev->domain changed due to
> error. My concern is more that it's not a good design. I think it's natural to have
> a state changed only after all existing references in the call chain have been
> completed. Adding a new flag to delay hiding device looks much clearer to me.
> 

It looks we are not on the same page for how to hide a device.
Actually I have implemented these two ideas with pci_hide_device() or flag in previous versions.

Andrew and Tim, what's your opinion?


Quan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.
  2016-01-14  9:00           ` Jan Beulich
@ 2016-01-14  9:12             ` Tian, Kevin
  2016-01-14 10:46               ` Xu, Quan
  0 siblings, 1 reply; 60+ messages in thread
From: Tian, Kevin @ 2016-01-14  9:12 UTC (permalink / raw)
  To: Jan Beulich, Xu, Quan
  Cc: keir, george.dunlap, andrew.cooper3, tim, xen-devel, Nakajima,
	Jun, Wu, Feng

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Thursday, January 14, 2016 5:00 PM
> 
> >>> On 14.01.16 at 02:22, <kevin.tian@intel.com> wrote:
> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Wednesday, January 13, 2016 7:03 PM
> >>
> >> >> Jan,
> >> >> Patch 2/3 and Patch 3/3 are based on v3 (Actually they are v3's patch 1/2
> >> > and patch 2/2).
> >> >> We have discussed how to hide a device with pci_hide_device() when
> > Device-TLB
> >> > flush is
> >> >> error.
> >> >>
> >> >> Now there are 2 concerns:
> >> >>       1. Hide the PCI device may break the code path. (then the pdev->domain
> >> > is
> >> >> dom_xen)
> >> >>       2. Is the blew logic right?
> >> >>            --If Device-TLB flush is timeout, we'll hide the target ATS device
> >> > and crash the
> >> >> domain owning this ATS device.
> >> >>              If impacted domain is hardware domain, just throw out a
> >> > warning, instead of
> >> >> crash the hardware domain.
> >> >>             The hided Device will be disallowed to be further assigned to
> >> > any domain.
> >> >>
> >> >> Kevin, correct me if I am wrong.
> >> >>
> >> >>
> >> >
> >> > for 2, yes it's the policy we agreed in previous discussion.
> >> >
> >> > for 1, after more thinking I think the concern is real. pci_hide_device
> >> > is used once in early boot-up phase. At that time, it's simple to just
> >> > have right owner configured. However in the path of normal device
> >> > assign/deassign, there are tons of more state associated which may
> >> > be related to the owner. Though we may do some special handling
> >> > in related code paths to have dom_xen specially handled, it's way
> >> > tricky and not easy to maintain.
> >>
> >> I don't buy this without one of you pointing out the actual
> >> difficulties: The domain is being crashed anyway, so there's no
> >> true device de-assignment needed (IOMMU tables will get torn
> >> down no matter what).
> >>
> >
> > Here is one example of a quick glimpse:
> >
> > static int intel_iommu_add_device(u8 devfn, struct pci_dev *pdev)
> > {
> > 	[...]
> >
> >     ret = domain_context_mapping(pdev->domain, devfn, pdev);
> >     if ( ret )
> >     {
> >         dprintk(XENLOG_ERR VTDPREFIX, "d%d: context mapping failed\n",
> >                 pdev->domain->domain_id);
> >         return ret;
> >     }
> >
> > If pdev->domain is changed inside due to pci_hide_device, the error
> > message will be inaccurate.
> 
> But that's a thing trivial to fix up. You had talked about tricky things...
> 

It's not about how this specific thing can be fixed. I didn't check all the code. 
There could be more examples, and in the future all new code need to be
aware that the majority of IOMMU functions may have pdev->domain changed
due to error. My concern is more that it's not a good design. I think it's natural
to have a state changed only after all existing references in the call 
chain have been completed. Adding a new flag to delay hiding device looks much 
clearer to me.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.
  2016-01-14  1:22         ` Tian, Kevin
@ 2016-01-14  9:00           ` Jan Beulich
  2016-01-14  9:12             ` Tian, Kevin
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2016-01-14  9:00 UTC (permalink / raw)
  To: Kevin Tian, Quan Xu
  Cc: keir, george.dunlap, andrew.cooper3, tim, xen-devel,
	Jun Nakajima, Feng Wu

>>> On 14.01.16 at 02:22, <kevin.tian@intel.com> wrote:
>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Wednesday, January 13, 2016 7:03 PM
>> 
>> >> Jan,
>> >> Patch 2/3 and Patch 3/3 are based on v3 (Actually they are v3's patch 1/2
>> > and patch 2/2).
>> >> We have discussed how to hide a device with pci_hide_device() when 
> Device-TLB
>> > flush is
>> >> error.
>> >>
>> >> Now there are 2 concerns:
>> >>       1. Hide the PCI device may break the code path. (then the pdev->domain
>> > is
>> >> dom_xen)
>> >>       2. Is the blew logic right?
>> >>            --If Device-TLB flush is timeout, we'll hide the target ATS device
>> > and crash the
>> >> domain owning this ATS device.
>> >>              If impacted domain is hardware domain, just throw out a
>> > warning, instead of
>> >> crash the hardware domain.
>> >>             The hided Device will be disallowed to be further assigned to
>> > any domain.
>> >>
>> >> Kevin, correct me if I am wrong.
>> >>
>> >>
>> >
>> > for 2, yes it's the policy we agreed in previous discussion.
>> >
>> > for 1, after more thinking I think the concern is real. pci_hide_device
>> > is used once in early boot-up phase. At that time, it's simple to just
>> > have right owner configured. However in the path of normal device
>> > assign/deassign, there are tons of more state associated which may
>> > be related to the owner. Though we may do some special handling
>> > in related code paths to have dom_xen specially handled, it's way
>> > tricky and not easy to maintain.
>> 
>> I don't buy this without one of you pointing out the actual
>> difficulties: The domain is being crashed anyway, so there's no
>> true device de-assignment needed (IOMMU tables will get torn
>> down no matter what).
>> 
> 
> Here is one example of a quick glimpse:
> 
> static int intel_iommu_add_device(u8 devfn, struct pci_dev *pdev)
> {
> 	[...]
> 
>     ret = domain_context_mapping(pdev->domain, devfn, pdev);
>     if ( ret )
>     {
>         dprintk(XENLOG_ERR VTDPREFIX, "d%d: context mapping failed\n",
>                 pdev->domain->domain_id);
>         return ret;
>     }
> 
> If pdev->domain is changed inside due to pci_hide_device, the error
> message will be inaccurate.

But that's a thing trivial to fix up. You had talked about tricky things...

Jan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.
  2016-01-13 11:02       ` Jan Beulich
@ 2016-01-14  1:22         ` Tian, Kevin
  2016-01-14  9:00           ` Jan Beulich
  0 siblings, 1 reply; 60+ messages in thread
From: Tian, Kevin @ 2016-01-14  1:22 UTC (permalink / raw)
  To: Jan Beulich, Xu, Quan
  Cc: keir, george.dunlap, andrew.cooper3, tim, xen-devel, Nakajima,
	Jun, Wu, Feng

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Wednesday, January 13, 2016 7:03 PM
> 
> >> Jan,
> >> Patch 2/3 and Patch 3/3 are based on v3 (Actually they are v3's patch 1/2
> > and patch 2/2).
> >> We have discussed how to hide a device with pci_hide_device() when Device-TLB
> > flush is
> >> error.
> >>
> >> Now there are 2 concerns:
> >>       1. Hide the PCI device may break the code path. (then the pdev->domain
> > is
> >> dom_xen)
> >>       2. Is the blew logic right?
> >>            --If Device-TLB flush is timeout, we'll hide the target ATS device
> > and crash the
> >> domain owning this ATS device.
> >>              If impacted domain is hardware domain, just throw out a
> > warning, instead of
> >> crash the hardware domain.
> >>             The hided Device will be disallowed to be further assigned to
> > any domain.
> >>
> >> Kevin, correct me if I am wrong.
> >>
> >>
> >
> > for 2, yes it's the policy we agreed in previous discussion.
> >
> > for 1, after more thinking I think the concern is real. pci_hide_device
> > is used once in early boot-up phase. At that time, it's simple to just
> > have right owner configured. However in the path of normal device
> > assign/deassign, there are tons of more state associated which may
> > be related to the owner. Though we may do some special handling
> > in related code paths to have dom_xen specially handled, it's way
> > tricky and not easy to maintain.
> 
> I don't buy this without one of you pointing out the actual
> difficulties: The domain is being crashed anyway, so there's no
> true device de-assignment needed (IOMMU tables will get torn
> down no matter what).
> 

Here is one example of a quick glimpse:

static int intel_iommu_add_device(u8 devfn, struct pci_dev *pdev)
{
	[...]

    ret = domain_context_mapping(pdev->domain, devfn, pdev);
    if ( ret )
    {
        dprintk(XENLOG_ERR VTDPREFIX, "d%d: context mapping failed\n",
                pdev->domain->domain_id);
        return ret;
    }

If pdev->domain is changed inside due to pci_hide_device, the error
message will be inaccurate.

That is why I think we'd better hide device until current call chain
completes.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.
  2016-01-13  6:12     ` Tian, Kevin
@ 2016-01-13 11:02       ` Jan Beulich
  2016-01-14  1:22         ` Tian, Kevin
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2016-01-13 11:02 UTC (permalink / raw)
  To: Kevin Tian, Quan Xu
  Cc: keir, george.dunlap, andrew.cooper3, tim, xen-devel,
	Jun Nakajima, Feng Wu

>>> On 13.01.16 at 07:12, <kevin.tian@intel.com> wrote:
>>  From: Xu, Quan
>> Sent: Thursday, January 07, 2016 9:47 PM
>> 
>> > On January 07, 2016 9:28 PM, <JBeulich@suse.com> wrote:
>> > >>> On 07.01.16 at 02:39, <quan.xu@intel.com> wrote:
>> > > On January 06, 2016 7:26 PM, <quan.xu@intel.com> wrote:
>> > >> > I didn't think about the full logic thoroughly now. But it would
>> > >> > always be good to not hide device now until a point where all
>> > >> > related states have been cleaned up in error handling path chained up.
>> > >> >
>> > >
>> > > Jan, could you help me to double check it? thanks.
>> >
>> > I'm not sure I understand what you want or need, the more that I didn't 
> even
>> > get around to look at the patches yet.
>> >
>> 
>> Jan,
>> Patch 2/3 and Patch 3/3 are based on v3 (Actually they are v3's patch 1/2 
> and patch 2/2).
>> We have discussed how to hide a device with pci_hide_device() when Device-TLB 
> flush is
>> error.
>> 
>> Now there are 2 concerns:
>>       1. Hide the PCI device may break the code path. (then the pdev->domain 
> is
>> dom_xen)
>>       2. Is the blew logic right?
>>            --If Device-TLB flush is timeout, we'll hide the target ATS device 
> and crash the
>> domain owning this ATS device.
>>              If impacted domain is hardware domain, just throw out a 
> warning, instead of
>> crash the hardware domain.
>>             The hided Device will be disallowed to be further assigned to 
> any domain.
>> 
>> Kevin, correct me if I am wrong.
>> 
>> 
> 
> for 2, yes it's the policy we agreed in previous discussion.
> 
> for 1, after more thinking I think the concern is real. pci_hide_device
> is used once in early boot-up phase. At that time, it's simple to just
> have right owner configured. However in the path of normal device
> assign/deassign, there are tons of more state associated which may
> be related to the owner. Though we may do some special handling
> in related code paths to have dom_xen specially handled, it's way
> tricky and not easy to maintain.

I don't buy this without one of you pointing out the actual
difficulties: The domain is being crashed anyway, so there's no
true device de-assignment needed (IOMMU tables will get torn
down no matter what).

Jan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.
  2016-01-07 13:46   ` Xu, Quan
  2016-01-08  1:06     ` Xu, Quan
@ 2016-01-13  6:12     ` Tian, Kevin
  2016-01-13 11:02       ` Jan Beulich
  1 sibling, 1 reply; 60+ messages in thread
From: Tian, Kevin @ 2016-01-13  6:12 UTC (permalink / raw)
  To: Xu, Quan, Jan Beulich
  Cc: keir, george.dunlap, andrew.cooper3, tim, xen-devel, Nakajima,
	Jun, Wu, Feng

> From: Xu, Quan
> Sent: Thursday, January 07, 2016 9:47 PM
> 
> > On January 07, 2016 9:28 PM, <JBeulich@suse.com> wrote:
> > >>> On 07.01.16 at 02:39, <quan.xu@intel.com> wrote:
> > > On January 06, 2016 7:26 PM, <quan.xu@intel.com> wrote:
> > >> > I didn't think about the full logic thoroughly now. But it would
> > >> > always be good to not hide device now until a point where all
> > >> > related states have been cleaned up in error handling path chained up.
> > >> >
> > >
> > > Jan, could you help me to double check it? thanks.
> >
> > I'm not sure I understand what you want or need, the more that I didn't even
> > get around to look at the patches yet.
> >
> 
> Jan,
> Patch 2/3 and Patch 3/3 are based on v3 (Actually they are v3's patch 1/2 and patch 2/2).
> We have discussed how to hide a device with pci_hide_device() when Device-TLB flush is
> error.
> 
> Now there are 2 concerns:
>       1. Hide the PCI device may break the code path. (then the pdev->domain is
> dom_xen)
>       2. Is the blew logic right?
>            --If Device-TLB flush is timeout, we'll hide the target ATS device and crash the
> domain owning this ATS device.
>              If impacted domain is hardware domain, just throw out a warning, instead of
> crash the hardware domain.
>             The hided Device will be disallowed to be further assigned to any domain.
> 
> Kevin, correct me if I am wrong.
> 
> 

for 2, yes it's the policy we agreed in previous discussion.

for 1, after more thinking I think the concern is real. pci_hide_device
is used once in early boot-up phase. At that time, it's simple to just
have right owner configured. However in the path of normal device
assign/deassign, there are tons of more state associated which may
be related to the owner. Though we may do some special handling
in related code paths to have dom_xen specially handled, it's way
tricky and not easy to maintain.

I think the cleaner solution, similar to your earlier version, is to
set a flag and then continue existing calling chains with all required
error handling completed. Only at that place we can safely invoke
pci_hide_device. If outmost callers are scattered, we may do a 
lazy hide until next time when it's assigned to another guest while
the new flag is recognized.

Jan, your comments?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.
  2016-01-07 13:46   ` Xu, Quan
@ 2016-01-08  1:06     ` Xu, Quan
  2016-01-13  6:12     ` Tian, Kevin
  1 sibling, 0 replies; 60+ messages in thread
From: Xu, Quan @ 2016-01-08  1:06 UTC (permalink / raw)
  To: Xu, Quan, Jan Beulich, Tian, Kevin
  Cc: keir, george.dunlap, andrew.cooper3, tim, xen-devel, Nakajima,
	Jun, Wu, Feng

On January 07, 2016 9:47 PM, <quan.xu@intel.com> wrote:
> Jan,
> Patch 2/3 and Patch 3/3 are based on v3 (Actually they are v3's patch 1/2 and
> patch 2/2).
> We have discussed how to hide a device with pci_hide_device() when
> Device-TLB flush is error.
> 
> Now there are 2 concerns:
>       1. Hide the PCI device may break the code path. (then the pdev->domain
> is dom_xen)
>       2. Is the blew logic right?
>            --If Device-TLB flush is timeout, we'll hide the target ATS device
> and crash the domain owning this ATS device.
>              If impacted domain is hardware domain, just throw out a
> warning, instead of crash the hardware domain.
>             The hided Device will be disallowed to be further assigned to any

Sorry, just correct it. 'hided Device' -> 'hidden device'. 



> domain.
> 
> Kevin, correct me if I am wrong.
> 
> 
> Quan
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.
  2016-01-07 13:28 ` Jan Beulich
@ 2016-01-07 13:46   ` Xu, Quan
  2016-01-08  1:06     ` Xu, Quan
  2016-01-13  6:12     ` Tian, Kevin
  0 siblings, 2 replies; 60+ messages in thread
From: Xu, Quan @ 2016-01-07 13:46 UTC (permalink / raw)
  To: Jan Beulich, Tian, Kevin
  Cc: keir, george.dunlap, andrew.cooper3, tim, xen-devel, Nakajima,
	Jun, Wu, Feng

> On January 07, 2016 9:28 PM, <JBeulich@suse.com> wrote:
> >>> On 07.01.16 at 02:39, <quan.xu@intel.com> wrote:
> > On January 06, 2016 7:26 PM, <quan.xu@intel.com> wrote:
> >> > I didn't think about the full logic thoroughly now. But it would
> >> > always be good to not hide device now until a point where all
> >> > related states have been cleaned up in error handling path chained up.
> >> >
> >
> > Jan, could you help me to double check it? thanks.
> 
> I'm not sure I understand what you want or need, the more that I didn't even
> get around to look at the patches yet.
> 

Jan, 
Patch 2/3 and Patch 3/3 are based on v3 (Actually they are v3's patch 1/2 and patch 2/2).
We have discussed how to hide a device with pci_hide_device() when Device-TLB flush is error. 

Now there are 2 concerns:
      1. Hide the PCI device may break the code path. (then the pdev->domain is dom_xen)
      2. Is the blew logic right?
           --If Device-TLB flush is timeout, we'll hide the target ATS device and crash the domain owning this ATS device.
             If impacted domain is hardware domain, just throw out a warning, instead of crash the hardware domain.
            The hided Device will be disallowed to be further assigned to any domain.

Kevin, correct me if I am wrong.


Quan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.
  2016-01-07  1:39 [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue Xu, Quan
@ 2016-01-07 13:28 ` Jan Beulich
  2016-01-07 13:46   ` Xu, Quan
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Beulich @ 2016-01-07 13:28 UTC (permalink / raw)
  To: Quan Xu
  Cc: tim, Kevin Tian, Feng Wu, george.dunlap, andrew.cooper3,
	Eddie Dong, xen-devel, Jun Nakajima, keir

>>> On 07.01.16 at 02:39, <quan.xu@intel.com> wrote:
> On January 06, 2016 7:26 PM, <quan.xu@intel.com> wrote: 
>> > I didn't think about the full logic thoroughly now. But it would
>> > always be good to not hide device now until a point where all related
>> > states have been cleaned up in error handling path chained up.
>> >
> 
> Jan, could you help me to double check it? thanks.

I'm not sure I understand what you want or need, the more that
I didn't even get around to look at the patches yet.

Jan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.
@ 2016-01-07  1:39 Xu, Quan
  2016-01-07 13:28 ` Jan Beulich
  0 siblings, 1 reply; 60+ messages in thread
From: Xu, Quan @ 2016-01-07  1:39 UTC (permalink / raw)
  To: jbeulich
  Cc: Tian, Kevin, Wu, Feng, Xu, Quan, george.dunlap, tim, Dong, Eddie,
	xen-devel, Nakajima, Jun, andrew.cooper3, keir

On January 06, 2016 7:26 PM, <quan.xu@intel.com> wrote: 
> > > diff --git a/xen/drivers/passthrough/vtd/qinval.c
> > > b/xen/drivers/passthrough/vtd/qinval.c
> > > index b227e4e..7330c5d 100644
> > > --- a/xen/drivers/passthrough/vtd/qinval.c
> > > +++ b/xen/drivers/passthrough/vtd/qinval.c
> > > @@ -190,9 +190,19 @@ static int queue_invalidate_wait(struct iommu
> > > *iommu,  static int invalidate_sync(struct iommu *iommu)  {
> > >      struct qi_ctrl *qi_ctrl = iommu_qi_ctrl(iommu);
> > > +    int rc;
> > >
> > >      if ( qi_ctrl->qinval_maddr )
> > > -        return queue_invalidate_wait(iommu, 0, 1, 1);
> > > +    {
> > > +        rc = queue_invalidate_wait(iommu, 0, 1, 1);
> > > +        if ( rc )
> > > +        {
> > > +            if ( rc == -ETIMEDOUT )
> > > +                panic("Queue invalidate wait descriptor was
> > timeout.\n");
> > > +            return rc;
> > > +        }
> > > +    }
> > > +
> >
> > why do you still do panic here? w/ PATCH 1/3 you should return rc to
> > caller directly, right?
> >
> 
> This panic is original in queue_invalidate_wait(). Patch 1/3 is just for Device-TLB
> flush error ( Actually it covers iotlb flush error).
> If I fix VT-d iotlb/context/iec flush error to propagated callers, then I can
> remove this panic and return rc to propagated caller directly.
> 
> 
> 
> > > +static void dev_invalidate_iotlb_timeout(struct iommu *iommu, u16 did,
> > > +                                         u16 seg, u8 bus, u8
> devfn)
> > {
> > > +    struct domain *d;
> > > +    struct pci_dev *pdev;
> > > +
> > > +    d = rcu_lock_domain_by_id(iommu->domid_map[did]);
> > > +    ASSERT(d);
> > > +    for_each_pdev(d, pdev)
> > > +    {
> > > +        if ( (pdev->seg == seg) &&
> > > +             (pdev->bus == bus) &&
> > > +             (pdev->devfn == devfn) )
> > > +        {
> > > +            if ( pdev->domain )
> > > +            {
> > > +                list_del(&pdev->domain_list);
> > > +                pdev->domain = NULL;
> > > +            }
> > > +
> > > +            if ( pci_hide_device(bus, devfn) )
> > > +            {
> > > +                printk(XENLOG_ERR
> > > +                       "IOMMU hide device %04x:%02x:%02x error.",
> > > +                       seg, bus, devfn);
> > > +                break;
> > > +            }
> >
> > I'm not sure whether using pci_hide_device is enough or not break
> > other code path here? For example, now you have errors propagated to
> callers.
> 
> More information, after call pci_hide_device(), ..
>          pdev->domain = dom_xen;
>          list_add(&pdev->domain_list, &dom_xen->arch.pdev_list); ..
> 
> Hidden PCI devices are associated with 'dom_xen'.
> Now I found it may not clear rmrr mapping / context mapping. .i.e
> iommu_remove_device() -- [hd->platform_ops is NULL for dom_xen].
> But I think it is acceptable, IMO dom_xen is a part hypervisor, and there
> mappings would not a security issue.
> Or I can remove these mappings before to hide this device.
> 
> So I think it will not break other code break other code.
> 
> > What's the
> > final policy against flush error?
> >
> 
> If Device-TLB flush is timeout, we'll hide the target ATS device and crash the
> domain owning this ATS device.
> If impacted domain is hardware domain, just throw out a warning, instead of
> crash the hardware domain.
> The hided Device will be disallowed to be further assigned to any domain.
> 
> 
> 
> >If they will finally go to cleanup some more state  relying on
> >pdev->domain, then that code path might be broken. If it's fine to do
> >cleanup at this point, then just hidden is not enough. If you look at
> >pci_remove_device, there's quite some state to be further cleaned up...
> >
> 
> 
> As the pdev->domain is 'dom_xen';
> So for this case, it is still working. Correct me if it is not correct.
> BTW, a quick question, which case to call pci_remove_device()?
> 
> 
> > I didn't think about the full logic thoroughly now. But it would
> > always be good to not hide device now until a point where all related
> > states have been cleaned up in error handling path chained up.
> >

Jan, could you help me to double check it? thanks.

Quan

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2016-01-27 14:35 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-23  8:25 [PATCH v4 0/3] VT-d Device-TLB flush issue Quan Xu
2015-12-23  8:25 ` [PATCH v4 1/3] VT-d: Check VT-d Device-TLB flush error Quan Xu
2015-12-25  2:53   ` Tian, Kevin
2016-01-06  2:12     ` Xu, Quan
2016-01-14 17:01     ` Jan Beulich
2016-01-14 17:15   ` Jan Beulich
2015-12-23  8:25 ` [PATCH v4 2/3] VT-d: Reduce spin timeout to 1ms, which can be boot-time changed Quan Xu
2015-12-25  2:56   ` Tian, Kevin
2015-12-25  2:58     ` Xu, Quan
2016-01-13 16:55   ` Jan Beulich
2016-01-14  1:53     ` Xu, Quan
2016-01-13 16:57   ` Jan Beulich
2016-01-14  1:59     ` Xu, Quan
2016-01-14  9:03       ` Jan Beulich
2016-01-14 10:29         ` Xu, Quan
2016-01-14 12:17           ` Jan Beulich
2015-12-23  8:25 ` [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue Quan Xu
2015-12-25  3:06   ` Tian, Kevin
2016-01-06 11:25     ` Xu, Quan
2016-01-15 13:09   ` Jan Beulich
2016-01-20 10:26     ` Xu, Quan
2016-01-20 11:29       ` Jan Beulich
2016-01-21 16:16         ` Xu, Quan
2016-01-21 16:31           ` Jan Beulich
2016-01-22 15:57             ` Xu, Quan
2016-01-25 14:09               ` Jan Beulich
2016-01-26 13:47                 ` Xu, Quan
2016-01-26 13:59                   ` Jan Beulich
2016-01-26 15:27                     ` Xu, Quan
2016-01-26 15:52                       ` Jan Beulich
2016-01-26 22:47                         ` Tian, Kevin
2016-01-27 11:09                           ` Xu, Quan
2016-01-27 11:24                             ` Jan Beulich
2016-01-27 12:38                               ` Xu, Quan
2016-01-27 13:15                                 ` Jan Beulich
2016-01-27 14:13                                   ` Xu, Quan
2016-01-27 14:29                                     ` Jan Beulich
2016-01-27 14:35                                       ` Xu, Quan
2015-12-23 10:19 ` [PATCH v4 0/3] VT-d Device-TLB flush issue Xu, Quan
2015-12-23 10:40   ` Jan Beulich
2015-12-23 10:59     ` Xu, Quan
2015-12-23 10:39 ` Jan Beulich
2015-12-23 11:09   ` Xu, Quan
2015-12-25  1:50   ` Tian, Kevin
2015-12-25  2:26     ` Xu, Quan
2015-12-25  1:53 ` Tian, Kevin
2015-12-25  2:04   ` Xu, Quan
2016-01-07  1:39 [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue Xu, Quan
2016-01-07 13:28 ` Jan Beulich
2016-01-07 13:46   ` Xu, Quan
2016-01-08  1:06     ` Xu, Quan
2016-01-13  6:12     ` Tian, Kevin
2016-01-13 11:02       ` Jan Beulich
2016-01-14  1:22         ` Tian, Kevin
2016-01-14  9:00           ` Jan Beulich
2016-01-14  9:12             ` Tian, Kevin
2016-01-14 10:46               ` Xu, Quan
2016-01-18 15:32                 ` Tim Deegan
2016-01-19  0:46                   ` Tian, Kevin
2016-01-19  6:54                   ` Xu, Quan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.