[PATCH v3 00/23] IOMMU: superpage support when not sharing pagetables

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 00/23] IOMMU: superpage support when not sharing pagetables
@ 2022-01-10 16:19 Jan Beulich
  2022-01-10 16:22 ` [PATCH v3 01/23] AMD/IOMMU: have callers specify the target level for page table walks Jan Beulich
                   ` (22 more replies)
  0 siblings, 23 replies; 40+ messages in thread
From: Jan Beulich @ 2022-01-10 16:19 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné

For a long time we've been rather inefficient with IOMMU page table
management when not sharing page tables, i.e. in particular for PV (and
further specifically also for PV Dom0) and AMD (where nowadays we never
share page tables). While up to about 2.5 years ago AMD code had logic
to un-shatter page mappings, that logic was ripped out for being buggy
(XSA-275 plus follow-on).

This series enables use of large pages in AMD and Intel (VT-d) code;
Arm is presently not in need of any enabling as pagetables are always
shared there. It also augments PV Dom0 creation with suitable explicit
IOMMU mapping calls to facilitate use of large pages there. Depending
on the amount of memory handed to Dom0 this improves booting time
(latency until Dom0 actually starts) quite a bit; subsequent shattering
of some of the large pages may of course consume some of the saved time.

Known fallout has been spelled out here:
https://lists.xen.org/archives/html/xen-devel/2021-08/msg00781.html

I'm inclined to say "of course" there are also a few seemingly unrelated
changes included here, which I just came to consider necessary or at
least desirable (in part for having been in need of adjustment for a
long time) along the way. Some of these changes are likely independent
of the bulk of the work here, and hence may be fine to go in ahead of
earlier patches.

v3, besides addressing review feedback, now also implements unshattering
of large pages. There are also a few other new small patches. See
individual patches for details.

01: AMD/IOMMU: have callers specify the target level for page table walks
02: VT-d: have callers specify the target level for page table walks
03: VT-d: limit page table population in domain_pgd_maddr()
04: IOMMU: have vendor code announce supported page sizes
05: IOMMU: simplify unmap-on-error in iommu_map()
06: IOMMU: add order parameter to ->{,un}map_page() hooks
07: IOMMU: have iommu_{,un}map() split requests into largest possible chunks
08: IOMMU/x86: restrict IO-APIC mappings for PV Dom0
09: IOMMU/x86: perform PV Dom0 mappings in batches
10: IOMMU/x86: support freeing of pagetables
11: AMD/IOMMU: drop stray TLB flush
12: AMD/IOMMU: walk trees upon page fault
13: AMD/IOMMU: return old PTE from {set,clear}_iommu_pte_present()
14: AMD/IOMMU: allow use of superpage mappings
15: VT-d: allow use of superpage mappings
16: IOMMU: fold flush-all hook into "flush one"
17: IOMMU/x86: prefill newly allocate page tables
18: x86: introduce helper for recording degree of contiguity in page tables
19: AMD/IOMMU: free all-empty page tables
20: VT-d: free all-empty page tables
21: AMD/IOMMU: replace all-contiguous page tables by superpage mappings
22: VT-d: replace all-contiguous page tables by superpage mappings
23: IOMMU/x86: add perf counters for page table splitting / coalescing

While not directly related (except that making this mode work properly
here was a fair part of the overall work), at this occasion I'd also
like to renew my proposal to make "iommu=dom0-strict" the default going
forward. It already is not only the default, but the only possible mode
for PVH Dom0.

Jan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 01/23] AMD/IOMMU: have callers specify the target level for page table walks
  2022-01-10 16:19 [PATCH v3 00/23] IOMMU: superpage support when not sharing pagetables Jan Beulich
@ 2022-01-10 16:22 ` Jan Beulich
  2022-01-10 16:22 ` [PATCH v3 02/23] VT-d: " Jan Beulich
                   ` (21 subsequent siblings)
  22 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2022-01-10 16:22 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné

In order to be able to insert/remove super-pages we need to allow
callers of the walking function to specify at which point to stop the
walk. (For now at least gcc will instantiate just a variant of the
function with the parameter eliminated, so effectively no change to
generated code as far as the parameter addition goes.)

Instead of merely adjusting a BUG_ON() condition, convert it into an
error return - there's no reason to crash the entire host in that case.
Leave an assertion though for spotting issues early in debug builds.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
---
v3: Add ASSERT_UNREACHABLE(). Adjust a comment.

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -178,7 +178,8 @@ void __init iommu_dte_add_device_entry(s
  * page tables.
  */
 static int iommu_pde_from_dfn(struct domain *d, unsigned long dfn,
-                              unsigned long *pt_mfn, bool map)
+                              unsigned int target, unsigned long *pt_mfn,
+                              bool map)
 {
     union amd_iommu_pte *pde, *next_table_vaddr;
     unsigned long  next_table_mfn;
@@ -189,7 +190,11 @@ static int iommu_pde_from_dfn(struct dom
     table = hd->arch.amd.root_table;
     level = hd->arch.amd.paging_mode;
 
-    BUG_ON( table == NULL || level < 1 || level > 6 );
+    if ( !table || target < 1 || level < target || level > 6 )
+    {
+        ASSERT_UNREACHABLE();
+        return 1;
+    }
 
     /*
      * A frame number past what the current page tables can represent can't
@@ -200,7 +205,7 @@ static int iommu_pde_from_dfn(struct dom
 
     next_table_mfn = mfn_x(page_to_mfn(table));
 
-    while ( level > 1 )
+    while ( level > target )
     {
         unsigned int next_level = level - 1;
 
@@ -271,7 +276,7 @@ static int iommu_pde_from_dfn(struct dom
         level--;
     }
 
-    /* mfn of level 1 page table */
+    /* mfn of target level page table */
     *pt_mfn = next_table_mfn;
     return 0;
 }
@@ -307,7 +312,7 @@ int amd_iommu_map_page(struct domain *d,
         return rc;
     }
 
-    if ( iommu_pde_from_dfn(d, dfn_x(dfn), &pt_mfn, true) || !pt_mfn )
+    if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, true) || !pt_mfn )
     {
         spin_unlock(&hd->arch.mapping_lock);
         AMD_IOMMU_ERROR("invalid IO pagetable entry dfn = %"PRI_dfn"\n",
@@ -340,7 +345,7 @@ int amd_iommu_unmap_page(struct domain *
         return 0;
     }
 
-    if ( iommu_pde_from_dfn(d, dfn_x(dfn), &pt_mfn, false) )
+    if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, false) )
     {
         spin_unlock(&hd->arch.mapping_lock);
         AMD_IOMMU_ERROR("invalid IO pagetable entry dfn = %"PRI_dfn"\n",



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 02/23] VT-d: have callers specify the target level for page table walks
  2022-01-10 16:19 [PATCH v3 00/23] IOMMU: superpage support when not sharing pagetables Jan Beulich
  2022-01-10 16:22 ` [PATCH v3 01/23] AMD/IOMMU: have callers specify the target level for page table walks Jan Beulich
@ 2022-01-10 16:22 ` Jan Beulich
  2022-01-30  3:17   ` Tian, Kevin
  2022-01-10 16:23 ` [PATCH v3 03/23] VT-d: limit page table population in domain_pgd_maddr() Jan Beulich
                   ` (20 subsequent siblings)
  22 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2022-01-10 16:22 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné, Kevin Tian

In order to be able to insert/remove super-pages we need to allow
callers of the walking function to specify at which point to stop the
walk.

For intel_iommu_lookup_page() integrate the last level access into
the main walking function.

dma_pte_clear_one() gets only partly adjusted for now: Error handling
and order parameter get put in place, but the order parameter remains
ignored (just like intel_iommu_map_page()'s order part of the flags).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
I was actually wondering whether it wouldn't make sense to integrate
dma_pte_clear_one() into its only caller intel_iommu_unmap_page(), for
better symmetry with intel_iommu_map_page().
---
v2: Fix build.

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -347,63 +347,116 @@ static u64 bus_to_context_maddr(struct v
     return maddr;
 }
 
-static u64 addr_to_dma_page_maddr(struct domain *domain, u64 addr, int alloc)
+/*
+ * This function walks (and if requested allocates) page tables to the
+ * designated target level. It returns
+ * - 0 when a non-present entry was encountered and no allocation was
+ *   requested,
+ * - a small positive value (the level, i.e. below PAGE_SIZE) upon allocation
+ *   failure,
+ * - for target > 0 the physical address of the page table holding the leaf
+ *   PTE for the requested address,
+ * - for target == 0 the full PTE.
+ */
+static uint64_t addr_to_dma_page_maddr(struct domain *domain, daddr_t addr,
+                                       unsigned int target,
+                                       unsigned int *flush_flags, bool alloc)
 {
     struct domain_iommu *hd = dom_iommu(domain);
     int addr_width = agaw_to_width(hd->arch.vtd.agaw);
     struct dma_pte *parent, *pte = NULL;
-    int level = agaw_to_level(hd->arch.vtd.agaw);
-    int offset;
+    unsigned int level = agaw_to_level(hd->arch.vtd.agaw), offset;
     u64 pte_maddr = 0;
 
     addr &= (((u64)1) << addr_width) - 1;
     ASSERT(spin_is_locked(&hd->arch.mapping_lock));
+    ASSERT(target || !alloc);
+
     if ( !hd->arch.vtd.pgd_maddr )
     {
         struct page_info *pg;
 
-        if ( !alloc || !(pg = iommu_alloc_pgtable(domain)) )
+        if ( !alloc )
+            goto out;
+
+        pte_maddr = level;
+        if ( !(pg = iommu_alloc_pgtable(domain)) )
             goto out;
 
         hd->arch.vtd.pgd_maddr = page_to_maddr(pg);
     }
 
-    parent = (struct dma_pte *)map_vtd_domain_page(hd->arch.vtd.pgd_maddr);
-    while ( level > 1 )
+    pte_maddr = hd->arch.vtd.pgd_maddr;
+    parent = map_vtd_domain_page(pte_maddr);
+    while ( level > target )
     {
         offset = address_level_offset(addr, level);
         pte = &parent[offset];
 
         pte_maddr = dma_pte_addr(*pte);
-        if ( !pte_maddr )
+        if ( !dma_pte_present(*pte) || (level > 1 && dma_pte_superpage(*pte)) )
         {
             struct page_info *pg;
+            /*
+             * Higher level tables always set r/w, last level page table
+             * controls read/write.
+             */
+            struct dma_pte new_pte = { DMA_PTE_PROT };
 
             if ( !alloc )
-                break;
+            {
+                pte_maddr = 0;
+                if ( !dma_pte_present(*pte) )
+                    break;
+
+                /*
+                 * When the leaf entry was requested, pass back the full PTE,
+                 * with the address adjusted to account for the residual of
+                 * the walk.
+                 */
+                pte_maddr = pte->val +
+                    (addr & ((1UL << level_to_offset_bits(level)) - 1) &
+                     PAGE_MASK);
+                if ( !target )
+                    break;
+            }
 
+            pte_maddr = level - 1;
             pg = iommu_alloc_pgtable(domain);
             if ( !pg )
                 break;
 
             pte_maddr = page_to_maddr(pg);
-            dma_set_pte_addr(*pte, pte_maddr);
+            dma_set_pte_addr(new_pte, pte_maddr);
 
-            /*
-             * high level table always sets r/w, last level
-             * page table control read/write
-             */
-            dma_set_pte_readable(*pte);
-            dma_set_pte_writable(*pte);
+            if ( dma_pte_present(*pte) )
+            {
+                struct dma_pte *split = map_vtd_domain_page(pte_maddr);
+                unsigned long inc = 1UL << level_to_offset_bits(level - 1);
+
+                split[0].val = pte->val;
+                if ( inc == PAGE_SIZE )
+                    split[0].val &= ~DMA_PTE_SP;
+
+                for ( offset = 1; offset < PTE_NUM; ++offset )
+                    split[offset].val = split[offset - 1].val + inc;
+
+                iommu_sync_cache(split, PAGE_SIZE);
+                unmap_vtd_domain_page(split);
+
+                if ( flush_flags )
+                    *flush_flags |= IOMMU_FLUSHF_modified;
+            }
+
+            write_atomic(&pte->val, new_pte.val);
             iommu_sync_cache(pte, sizeof(struct dma_pte));
         }
 
-        if ( level == 2 )
+        if ( --level == target )
             break;
 
         unmap_vtd_domain_page(parent);
         parent = map_vtd_domain_page(pte_maddr);
-        level--;
     }
 
     unmap_vtd_domain_page(parent);
@@ -430,7 +483,7 @@ static uint64_t domain_pgd_maddr(struct
         if ( !hd->arch.vtd.pgd_maddr )
         {
             /* Ensure we have pagetables allocated down to leaf PTE. */
-            addr_to_dma_page_maddr(d, 0, 1);
+            addr_to_dma_page_maddr(d, 0, 1, NULL, true);
 
             if ( !hd->arch.vtd.pgd_maddr )
                 return 0;
@@ -770,8 +823,9 @@ static int __must_check iommu_flush_iotl
 }
 
 /* clear one page's page table */
-static void dma_pte_clear_one(struct domain *domain, uint64_t addr,
-                              unsigned int *flush_flags)
+static int dma_pte_clear_one(struct domain *domain, daddr_t addr,
+                             unsigned int order,
+                             unsigned int *flush_flags)
 {
     struct domain_iommu *hd = dom_iommu(domain);
     struct dma_pte *page = NULL, *pte = NULL;
@@ -779,11 +833,11 @@ static void dma_pte_clear_one(struct dom
 
     spin_lock(&hd->arch.mapping_lock);
     /* get last level pte */
-    pg_maddr = addr_to_dma_page_maddr(domain, addr, 0);
-    if ( pg_maddr == 0 )
+    pg_maddr = addr_to_dma_page_maddr(domain, addr, 1, flush_flags, false);
+    if ( pg_maddr < PAGE_SIZE )
     {
         spin_unlock(&hd->arch.mapping_lock);
-        return;
+        return pg_maddr ? -ENOMEM : 0;
     }
 
     page = (struct dma_pte *)map_vtd_domain_page(pg_maddr);
@@ -793,7 +847,7 @@ static void dma_pte_clear_one(struct dom
     {
         spin_unlock(&hd->arch.mapping_lock);
         unmap_vtd_domain_page(page);
-        return;
+        return 0;
     }
 
     dma_clear_pte(*pte);
@@ -803,6 +857,8 @@ static void dma_pte_clear_one(struct dom
     iommu_sync_cache(pte, sizeof(struct dma_pte));
 
     unmap_vtd_domain_page(page);
+
+    return 0;
 }
 
 static int iommu_set_root_entry(struct vtd_iommu *iommu)
@@ -1914,8 +1970,9 @@ static int __must_check intel_iommu_map_
         return 0;
     }
 
-    pg_maddr = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), 1);
-    if ( !pg_maddr )
+    pg_maddr = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), 1, flush_flags,
+                                      true);
+    if ( pg_maddr < PAGE_SIZE )
     {
         spin_unlock(&hd->arch.mapping_lock);
         return -ENOMEM;
@@ -1965,17 +2022,14 @@ static int __must_check intel_iommu_unma
     if ( iommu_hwdom_passthrough && is_hardware_domain(d) )
         return 0;
 
-    dma_pte_clear_one(d, dfn_to_daddr(dfn), flush_flags);
-
-    return 0;
+    return dma_pte_clear_one(d, dfn_to_daddr(dfn), 0, flush_flags);
 }
 
 static int intel_iommu_lookup_page(struct domain *d, dfn_t dfn, mfn_t *mfn,
                                    unsigned int *flags)
 {
     struct domain_iommu *hd = dom_iommu(d);
-    struct dma_pte *page, val;
-    u64 pg_maddr;
+    uint64_t val;
 
     /*
      * If VT-d shares EPT page table or if the domain is the hardware
@@ -1987,25 +2041,16 @@ static int intel_iommu_lookup_page(struc
 
     spin_lock(&hd->arch.mapping_lock);
 
-    pg_maddr = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), 0);
-    if ( !pg_maddr )
-    {
-        spin_unlock(&hd->arch.mapping_lock);
-        return -ENOENT;
-    }
-
-    page = map_vtd_domain_page(pg_maddr);
-    val = page[dfn_x(dfn) & LEVEL_MASK];
+    val = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), 0, NULL, false);
 
-    unmap_vtd_domain_page(page);
     spin_unlock(&hd->arch.mapping_lock);
 
-    if ( !dma_pte_present(val) )
+    if ( val < PAGE_SIZE )
         return -ENOENT;
 
-    *mfn = maddr_to_mfn(dma_pte_addr(val));
-    *flags = dma_pte_read(val) ? IOMMUF_readable : 0;
-    *flags |= dma_pte_write(val) ? IOMMUF_writable : 0;
+    *mfn = maddr_to_mfn(val);
+    *flags = val & DMA_PTE_READ ? IOMMUF_readable : 0;
+    *flags |= val & DMA_PTE_WRITE ? IOMMUF_writable : 0;
 
     return 0;
 }



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 03/23] VT-d: limit page table population in domain_pgd_maddr()
  2022-01-10 16:19 [PATCH v3 00/23] IOMMU: superpage support when not sharing pagetables Jan Beulich
  2022-01-10 16:22 ` [PATCH v3 01/23] AMD/IOMMU: have callers specify the target level for page table walks Jan Beulich
  2022-01-10 16:22 ` [PATCH v3 02/23] VT-d: " Jan Beulich
@ 2022-01-10 16:23 ` Jan Beulich
  2022-01-30  3:22   ` Tian, Kevin
  2022-01-10 16:25 ` [PATCH v3 04/23] IOMMU: have vendor code announce supported page sizes Jan Beulich
                   ` (19 subsequent siblings)
  22 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2022-01-10 16:23 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné, Kevin Tian

I have to admit that I never understood why domain_pgd_maddr() wants to
populate all page table levels for DFN 0. I can only assume that despite
the comment there what is needed is population just down to the smallest
possible nr_pt_levels that the loop later in the function may need to
run to. Hence what is needed is the minimum of all possible
iommu->nr_pt_levels, to then be passed into addr_to_dma_page_maddr()
instead of literal 1.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -55,6 +55,7 @@ bool __read_mostly iommu_snoop = true;
 #endif
 
 static unsigned int __read_mostly nr_iommus;
+static unsigned int __read_mostly min_pt_levels = UINT_MAX;
 
 static struct iommu_ops vtd_ops;
 static struct tasklet vtd_fault_tasklet;
@@ -482,8 +483,11 @@ static uint64_t domain_pgd_maddr(struct
     {
         if ( !hd->arch.vtd.pgd_maddr )
         {
-            /* Ensure we have pagetables allocated down to leaf PTE. */
-            addr_to_dma_page_maddr(d, 0, 1, NULL, true);
+            /*
+             * Ensure we have pagetables allocated down to the smallest
+             * level the loop below may need to run to.
+             */
+            addr_to_dma_page_maddr(d, 0, min_pt_levels, NULL, true);
 
             if ( !hd->arch.vtd.pgd_maddr )
                 return 0;
@@ -1381,6 +1385,8 @@ int __init iommu_alloc(struct acpi_drhd_
         return -ENODEV;
     }
     iommu->nr_pt_levels = agaw_to_level(agaw);
+    if ( min_pt_levels > iommu->nr_pt_levels )
+        min_pt_levels = iommu->nr_pt_levels;
 
     if ( !ecap_coherent(iommu->ecap) )
         vtd_ops.sync_cache = sync_cache;



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 04/23] IOMMU: have vendor code announce supported page sizes
  2022-01-10 16:19 [PATCH v3 00/23] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (2 preceding siblings ...)
  2022-01-10 16:23 ` [PATCH v3 03/23] VT-d: limit page table population in domain_pgd_maddr() Jan Beulich
@ 2022-01-10 16:25 ` Jan Beulich
  2022-01-10 16:25 ` [PATCH v3 05/23] IOMMU: simplify unmap-on-error in iommu_map() Jan Beulich
                   ` (18 subsequent siblings)
  22 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2022-01-10 16:25 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné,
	Julien Grall, Rahul Singh, Kevin Tian, Bertrand Marquis,
	Volodymyr Babchuk

Generic code will use this information to determine what order values
can legitimately be passed to the ->{,un}map_page() hooks. For now all
ops structures simply get to announce 4k mappings (as base page size),
and there is (and always has been) an assumption that this matches the
CPU's MMU base page size (eventually we will want to permit IOMMUs with
a base page size smaller than the CPU MMU's).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Rahul Singh <rahul.singh@arm.com>

--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -629,6 +629,7 @@ static void amd_dump_page_tables(struct
 }
 
 static const struct iommu_ops __initconstrel _iommu_ops = {
+    .page_sizes = PAGE_SIZE_4K,
     .init = amd_iommu_domain_init,
     .hwdom_init = amd_iommu_hwdom_init,
     .quarantine_init = amd_iommu_quarantine_init,
--- a/xen/drivers/passthrough/arm/ipmmu-vmsa.c
+++ b/xen/drivers/passthrough/arm/ipmmu-vmsa.c
@@ -1298,6 +1298,7 @@ static void ipmmu_iommu_domain_teardown(
 
 static const struct iommu_ops ipmmu_iommu_ops =
 {
+    .page_sizes      = PAGE_SIZE_4K,
     .init            = ipmmu_iommu_domain_init,
     .hwdom_init      = ipmmu_iommu_hwdom_init,
     .teardown        = ipmmu_iommu_domain_teardown,
--- a/xen/drivers/passthrough/arm/smmu.c
+++ b/xen/drivers/passthrough/arm/smmu.c
@@ -2873,6 +2873,7 @@ static void arm_smmu_iommu_domain_teardo
 }
 
 static const struct iommu_ops arm_smmu_iommu_ops = {
+    .page_sizes = PAGE_SIZE_4K,
     .init = arm_smmu_iommu_domain_init,
     .hwdom_init = arm_smmu_iommu_hwdom_init,
     .add_device = arm_smmu_dt_add_device_generic,
--- a/xen/drivers/passthrough/arm/smmu-v3.c
+++ b/xen/drivers/passthrough/arm/smmu-v3.c
@@ -3426,7 +3426,8 @@ static void arm_smmu_iommu_xen_domain_te
 }
 
 static const struct iommu_ops arm_smmu_iommu_ops = {
-	.init		= arm_smmu_iommu_xen_domain_init,
+	.page_sizes		= PAGE_SIZE_4K,
+	.init			= arm_smmu_iommu_xen_domain_init,
 	.hwdom_init		= arm_smmu_iommu_hwdom_init,
 	.teardown		= arm_smmu_iommu_xen_domain_teardown,
 	.iotlb_flush		= arm_smmu_iotlb_flush,
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -470,7 +470,17 @@ int __init iommu_setup(void)
 
     if ( iommu_enable )
     {
+        const struct iommu_ops *ops = NULL;
+
         rc = iommu_hardware_setup();
+        if ( !rc )
+            ops = iommu_get_ops();
+        if ( ops && (ops->page_sizes & -ops->page_sizes) != PAGE_SIZE )
+        {
+            printk(XENLOG_ERR "IOMMU: page size mask %lx unsupported\n",
+                   ops->page_sizes);
+            rc = ops->page_sizes ? -EPERM : -ENODATA;
+        }
         iommu_enabled = (rc == 0);
     }
 
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2891,6 +2891,7 @@ static int __init intel_iommu_quarantine
 }
 
 static struct iommu_ops __initdata vtd_ops = {
+    .page_sizes = PAGE_SIZE_4K,
     .init = intel_iommu_domain_init,
     .hwdom_init = intel_iommu_hwdom_init,
     .quarantine_init = intel_iommu_quarantine_init,
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -231,6 +231,7 @@ struct page_info;
 typedef int iommu_grdm_t(xen_pfn_t start, xen_ulong_t nr, u32 id, void *ctxt);
 
 struct iommu_ops {
+    unsigned long page_sizes;
     int (*init)(struct domain *d);
     void (*hwdom_init)(struct domain *d);
     int (*quarantine_init)(struct domain *d);



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 05/23] IOMMU: simplify unmap-on-error in iommu_map()
  2022-01-10 16:19 [PATCH v3 00/23] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (3 preceding siblings ...)
  2022-01-10 16:25 ` [PATCH v3 04/23] IOMMU: have vendor code announce supported page sizes Jan Beulich
@ 2022-01-10 16:25 ` Jan Beulich
  2022-01-10 16:27 ` [PATCH v3 06/23] IOMMU: add order parameter to ->{,un}map_page() hooks Jan Beulich
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2022-01-10 16:25 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné

As of 68a8aa5d7264 ("iommu: make map and unmap take a page count,
similar to flush") there's no need anymore to have a loop here.

Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -285,11 +285,9 @@ int iommu_map(struct domain *d, dfn_t df
                    d->domain_id, dfn_x(dfn_add(dfn, i)),
                    mfn_x(mfn_add(mfn, i)), rc);
 
-        while ( i-- )
-            /* if statement to satisfy __must_check */
-            if ( iommu_call(hd->platform_ops, unmap_page, d, dfn_add(dfn, i),
-                            flush_flags) )
-                continue;
+        /* while statement to satisfy __must_check */
+        while ( iommu_unmap(d, dfn, i, flush_flags) )
+            break;
 
         if ( !is_hardware_domain(d) )
             domain_crash(d);



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 06/23] IOMMU: add order parameter to ->{,un}map_page() hooks
  2022-01-10 16:19 [PATCH v3 00/23] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (4 preceding siblings ...)
  2022-01-10 16:25 ` [PATCH v3 05/23] IOMMU: simplify unmap-on-error in iommu_map() Jan Beulich
@ 2022-01-10 16:27 ` Jan Beulich
  2022-01-10 16:27 ` [PATCH v3 07/23] IOMMU: have iommu_{,un}map() split requests into largest possible chunks Jan Beulich
                   ` (16 subsequent siblings)
  22 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2022-01-10 16:27 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné,
	Julien Grall, Stefano Stabellini, Volodymyr Babchuk,
	Bertrand Marquis, Kevin Tian

Or really, in the case of ->map_page(), accommodate it in the existing
"flags" parameter. All call sites will pass 0 for now.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
[Arm]
Acked-by: Julien Grall <jgrall@amazon.com>
---
v3: Re-base over new earlier patch.
v2: Re-base over change earlier in the series.

--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -243,6 +243,7 @@ int __must_check amd_iommu_map_page(stru
                                     mfn_t mfn, unsigned int flags,
                                     unsigned int *flush_flags);
 int __must_check amd_iommu_unmap_page(struct domain *d, dfn_t dfn,
+                                      unsigned int order,
                                       unsigned int *flush_flags);
 int __must_check amd_iommu_alloc_root(struct domain *d);
 int amd_iommu_reserve_domain_unity_map(struct domain *domain,
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -331,7 +331,7 @@ int amd_iommu_map_page(struct domain *d,
     return 0;
 }
 
-int amd_iommu_unmap_page(struct domain *d, dfn_t dfn,
+int amd_iommu_unmap_page(struct domain *d, dfn_t dfn, unsigned int order,
                          unsigned int *flush_flags)
 {
     unsigned long pt_mfn = 0;
--- a/xen/drivers/passthrough/arm/iommu_helpers.c
+++ b/xen/drivers/passthrough/arm/iommu_helpers.c
@@ -57,11 +57,13 @@ int __must_check arm_iommu_map_page(stru
      * The function guest_physmap_add_entry replaces the current mapping
      * if there is already one...
      */
-    return guest_physmap_add_entry(d, _gfn(dfn_x(dfn)), _mfn(dfn_x(dfn)), 0, t);
+    return guest_physmap_add_entry(d, _gfn(dfn_x(dfn)), _mfn(dfn_x(dfn)),
+                                   IOMMUF_order(flags), t);
 }
 
 /* Should only be used if P2M Table is shared between the CPU and the IOMMU. */
 int __must_check arm_iommu_unmap_page(struct domain *d, dfn_t dfn,
+                                      unsigned int order,
                                       unsigned int *flush_flags)
 {
     /*
@@ -71,7 +73,8 @@ int __must_check arm_iommu_unmap_page(st
     if ( !is_domain_direct_mapped(d) )
         return -EINVAL;
 
-    return guest_physmap_remove_page(d, _gfn(dfn_x(dfn)), _mfn(dfn_x(dfn)), 0);
+    return guest_physmap_remove_page(d, _gfn(dfn_x(dfn)), _mfn(dfn_x(dfn)),
+                                     order);
 }
 
 /*
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -271,6 +271,8 @@ int iommu_map(struct domain *d, dfn_t df
     if ( !is_iommu_enabled(d) )
         return 0;
 
+    ASSERT(!IOMMUF_order(flags));
+
     for ( i = 0; i < page_count; i++ )
     {
         rc = iommu_call(hd->platform_ops, map_page, d, dfn_add(dfn, i),
@@ -331,7 +333,7 @@ int iommu_unmap(struct domain *d, dfn_t
     for ( i = 0; i < page_count; i++ )
     {
         int err = iommu_call(hd->platform_ops, unmap_page, d, dfn_add(dfn, i),
-                             flush_flags);
+                             0, flush_flags);
 
         if ( likely(!err) )
             continue;
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2018,6 +2018,7 @@ static int __must_check intel_iommu_map_
 }
 
 static int __must_check intel_iommu_unmap_page(struct domain *d, dfn_t dfn,
+                                               unsigned int order,
                                                unsigned int *flush_flags)
 {
     /* Do nothing if VT-d shares EPT page table */
@@ -2028,7 +2029,7 @@ static int __must_check intel_iommu_unma
     if ( iommu_hwdom_passthrough && is_hardware_domain(d) )
         return 0;
 
-    return dma_pte_clear_one(d, dfn_to_daddr(dfn), 0, flush_flags);
+    return dma_pte_clear_one(d, dfn_to_daddr(dfn), order, flush_flags);
 }
 
 static int intel_iommu_lookup_page(struct domain *d, dfn_t dfn, mfn_t *mfn,
--- a/xen/arch/arm/include/asm/iommu.h
+++ b/xen/arch/arm/include/asm/iommu.h
@@ -31,6 +31,7 @@ int __must_check arm_iommu_map_page(stru
                                     unsigned int flags,
                                     unsigned int *flush_flags);
 int __must_check arm_iommu_unmap_page(struct domain *d, dfn_t dfn,
+                                      unsigned int order,
                                       unsigned int *flush_flags);
 
 #endif /* __ARCH_ARM_IOMMU_H__ */
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -127,9 +127,10 @@ void arch_iommu_hwdom_init(struct domain
  * The following flags are passed to map operations and passed by lookup
  * operations.
  */
-#define _IOMMUF_readable 0
+#define IOMMUF_order(n)  ((n) & 0x3f)
+#define _IOMMUF_readable 6
 #define IOMMUF_readable  (1u<<_IOMMUF_readable)
-#define _IOMMUF_writable 1
+#define _IOMMUF_writable 7
 #define IOMMUF_writable  (1u<<_IOMMUF_writable)
 
 /*
@@ -255,6 +256,7 @@ struct iommu_ops {
                                  unsigned int flags,
                                  unsigned int *flush_flags);
     int __must_check (*unmap_page)(struct domain *d, dfn_t dfn,
+                                   unsigned int order,
                                    unsigned int *flush_flags);
     int __must_check (*lookup_page)(struct domain *d, dfn_t dfn, mfn_t *mfn,
                                     unsigned int *flags);



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 07/23] IOMMU: have iommu_{,un}map() split requests into largest possible chunks
  2022-01-10 16:19 [PATCH v3 00/23] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (5 preceding siblings ...)
  2022-01-10 16:27 ` [PATCH v3 06/23] IOMMU: add order parameter to ->{,un}map_page() hooks Jan Beulich
@ 2022-01-10 16:27 ` Jan Beulich
  2022-01-10 16:28 ` [PATCH v3 08/23] IOMMU/x86: restrict IO-APIC mappings for PV Dom0 Jan Beulich
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2022-01-10 16:27 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné

Introduce a helper function to determine the largest possible mapping
that allows covering a request (or the next part of it that is left to
be processed).

In order to not add yet more recurring dfn_add() / mfn_add() to the two
callers of the new helper, also introduce local variables holding the
values presently operated on.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: Re-base over new earlier patch.

--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -260,12 +260,38 @@ void iommu_domain_destroy(struct domain
     arch_iommu_domain_destroy(d);
 }
 
-int iommu_map(struct domain *d, dfn_t dfn, mfn_t mfn,
+static unsigned int mapping_order(const struct domain_iommu *hd,
+                                  dfn_t dfn, mfn_t mfn, unsigned long nr)
+{
+    unsigned long res = dfn_x(dfn) | mfn_x(mfn);
+    unsigned long sizes = hd->platform_ops->page_sizes;
+    unsigned int bit = find_first_set_bit(sizes), order = 0;
+
+    ASSERT(bit == PAGE_SHIFT);
+
+    while ( (sizes = (sizes >> bit) & ~1) )
+    {
+        unsigned long mask;
+
+        bit = find_first_set_bit(sizes);
+        mask = (1UL << bit) - 1;
+        if ( nr <= mask || (res & mask) )
+            break;
+        order += bit;
+        nr >>= bit;
+        res >>= bit;
+    }
+
+    return order;
+}
+
+int iommu_map(struct domain *d, dfn_t dfn0, mfn_t mfn0,
               unsigned long page_count, unsigned int flags,
               unsigned int *flush_flags)
 {
     const struct domain_iommu *hd = dom_iommu(d);
     unsigned long i;
+    unsigned int order;
     int rc = 0;
 
     if ( !is_iommu_enabled(d) )
@@ -273,10 +299,15 @@ int iommu_map(struct domain *d, dfn_t df
 
     ASSERT(!IOMMUF_order(flags));
 
-    for ( i = 0; i < page_count; i++ )
+    for ( i = 0; i < page_count; i += 1UL << order )
     {
-        rc = iommu_call(hd->platform_ops, map_page, d, dfn_add(dfn, i),
-                        mfn_add(mfn, i), flags, flush_flags);
+        dfn_t dfn = dfn_add(dfn0, i);
+        mfn_t mfn = mfn_add(mfn0, i);
+
+        order = mapping_order(hd, dfn, mfn, page_count - i);
+
+        rc = iommu_call(hd->platform_ops, map_page, d, dfn, mfn,
+                        flags | IOMMUF_order(order), flush_flags);
 
         if ( likely(!rc) )
             continue;
@@ -284,11 +315,10 @@ int iommu_map(struct domain *d, dfn_t df
         if ( !d->is_shutting_down && printk_ratelimit() )
             printk(XENLOG_ERR
                    "d%d: IOMMU mapping dfn %"PRI_dfn" to mfn %"PRI_mfn" failed: %d\n",
-                   d->domain_id, dfn_x(dfn_add(dfn, i)),
-                   mfn_x(mfn_add(mfn, i)), rc);
+                   d->domain_id, dfn_x(dfn), mfn_x(mfn), rc);
 
         /* while statement to satisfy __must_check */
-        while ( iommu_unmap(d, dfn, i, flush_flags) )
+        while ( iommu_unmap(d, dfn0, i, flush_flags) )
             break;
 
         if ( !is_hardware_domain(d) )
@@ -320,20 +350,25 @@ int iommu_legacy_map(struct domain *d, d
     return rc;
 }
 
-int iommu_unmap(struct domain *d, dfn_t dfn, unsigned long page_count,
+int iommu_unmap(struct domain *d, dfn_t dfn0, unsigned long page_count,
                 unsigned int *flush_flags)
 {
     const struct domain_iommu *hd = dom_iommu(d);
     unsigned long i;
+    unsigned int order;
     int rc = 0;
 
     if ( !is_iommu_enabled(d) )
         return 0;
 
-    for ( i = 0; i < page_count; i++ )
+    for ( i = 0; i < page_count; i += 1UL << order )
     {
-        int err = iommu_call(hd->platform_ops, unmap_page, d, dfn_add(dfn, i),
-                             0, flush_flags);
+        dfn_t dfn = dfn_add(dfn0, i);
+        int err;
+
+        order = mapping_order(hd, dfn, _mfn(0), page_count - i);
+        err = iommu_call(hd->platform_ops, unmap_page, d, dfn,
+                         order, flush_flags);
 
         if ( likely(!err) )
             continue;
@@ -341,7 +376,7 @@ int iommu_unmap(struct domain *d, dfn_t
         if ( !d->is_shutting_down && printk_ratelimit() )
             printk(XENLOG_ERR
                    "d%d: IOMMU unmapping dfn %"PRI_dfn" failed: %d\n",
-                   d->domain_id, dfn_x(dfn_add(dfn, i)), err);
+                   d->domain_id, dfn_x(dfn), err);
 
         if ( !rc )
             rc = err;



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 08/23] IOMMU/x86: restrict IO-APIC mappings for PV Dom0
  2022-01-10 16:19 [PATCH v3 00/23] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (6 preceding siblings ...)
  2022-01-10 16:27 ` [PATCH v3 07/23] IOMMU: have iommu_{,un}map() split requests into largest possible chunks Jan Beulich
@ 2022-01-10 16:28 ` Jan Beulich
  2022-01-10 16:28 ` [PATCH v3 09/23] IOMMU/x86: perform PV Dom0 mappings in batches Jan Beulich
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2022-01-10 16:28 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné

While already the case for PVH, there's no reason to treat PV
differently here, though of course the addresses get taken from another
source in this case. Except that, to match CPU side mappings, by default
we permit r/o ones. This then also means we now deal consistently with
IO-APICs whose MMIO is or is not covered by E820 reserved regions.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
[integrated] v1: Integrate into series.
[standalone] v2: Keep IOMMU mappings in sync with CPU ones.

--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -274,12 +274,12 @@ void iommu_identity_map_teardown(struct
     }
 }
 
-static bool __hwdom_init hwdom_iommu_map(const struct domain *d,
-                                         unsigned long pfn,
-                                         unsigned long max_pfn)
+static unsigned int __hwdom_init hwdom_iommu_map(const struct domain *d,
+                                                 unsigned long pfn,
+                                                 unsigned long max_pfn)
 {
     mfn_t mfn = _mfn(pfn);
-    unsigned int i, type;
+    unsigned int i, type, perms = IOMMUF_readable | IOMMUF_writable;
 
     /*
      * Set up 1:1 mapping for dom0. Default to include only conventional RAM
@@ -288,44 +288,60 @@ static bool __hwdom_init hwdom_iommu_map
      * that fall in unusable ranges for PV Dom0.
      */
     if ( (pfn > max_pfn && !mfn_valid(mfn)) || xen_in_range(pfn) )
-        return false;
+        return 0;
 
     switch ( type = page_get_ram_type(mfn) )
     {
     case RAM_TYPE_UNUSABLE:
-        return false;
+        return 0;
 
     case RAM_TYPE_CONVENTIONAL:
         if ( iommu_hwdom_strict )
-            return false;
+            return 0;
         break;
 
     default:
         if ( type & RAM_TYPE_RESERVED )
         {
             if ( !iommu_hwdom_inclusive && !iommu_hwdom_reserved )
-                return false;
+                perms = 0;
         }
-        else if ( is_hvm_domain(d) || !iommu_hwdom_inclusive || pfn > max_pfn )
-            return false;
+        else if ( is_hvm_domain(d) )
+            return 0;
+        else if ( !iommu_hwdom_inclusive || pfn > max_pfn )
+            perms = 0;
     }
 
     /* Check that it doesn't overlap with the Interrupt Address Range. */
     if ( pfn >= 0xfee00 && pfn <= 0xfeeff )
-        return false;
+        return 0;
     /* ... or the IO-APIC */
-    for ( i = 0; has_vioapic(d) && i < d->arch.hvm.nr_vioapics; i++ )
-        if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
-            return false;
+    if ( has_vioapic(d) )
+    {
+        for ( i = 0; i < d->arch.hvm.nr_vioapics; i++ )
+            if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
+                return 0;
+    }
+    else if ( is_pv_domain(d) )
+    {
+        /*
+         * Be consistent with CPU mappings: Dom0 is permitted to establish r/o
+         * ones there, so it should also have such established for IOMMUs.
+         */
+        for ( i = 0; i < nr_ioapics; i++ )
+            if ( pfn == PFN_DOWN(mp_ioapics[i].mpc_apicaddr) )
+                return rangeset_contains_singleton(mmio_ro_ranges, pfn)
+                       ? IOMMUF_readable : 0;
+    }
     /*
      * ... or the PCIe MCFG regions.
      * TODO: runtime added MMCFG regions are not checked to make sure they
      * don't overlap with already mapped regions, thus preventing trapping.
      */
     if ( has_vpci(d) && vpci_is_mmcfg_address(d, pfn_to_paddr(pfn)) )
-        return false;
+        return 0;
 
-    return true;
+    return perms;
 }
 
 void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
@@ -367,15 +383,19 @@ void __hwdom_init arch_iommu_hwdom_init(
     for ( ; i < top; i++ )
     {
         unsigned long pfn = pdx_to_pfn(i);
+        unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
         int rc;
 
-        if ( !hwdom_iommu_map(d, pfn, max_pfn) )
+        if ( !perms )
             rc = 0;
         else if ( paging_mode_translate(d) )
-            rc = set_identity_p2m_entry(d, pfn, p2m_access_rw, 0);
+            rc = set_identity_p2m_entry(d, pfn,
+                                        perms & IOMMUF_writable ? p2m_access_rw
+                                                                : p2m_access_r,
+                                        0);
         else
             rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
-                           IOMMUF_readable | IOMMUF_writable, &flush_flags);
+                           perms, &flush_flags);
 
         if ( rc )
             printk(XENLOG_WARNING "%pd: identity %smapping of %lx failed: %d\n",



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 09/23] IOMMU/x86: perform PV Dom0 mappings in batches
  2022-01-10 16:19 [PATCH v3 00/23] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (7 preceding siblings ...)
  2022-01-10 16:28 ` [PATCH v3 08/23] IOMMU/x86: restrict IO-APIC mappings for PV Dom0 Jan Beulich
@ 2022-01-10 16:28 ` Jan Beulich
  2022-01-10 16:29 ` [PATCH v3 10/23] IOMMU/x86: support freeing of pagetables Jan Beulich
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2022-01-10 16:28 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné

For large page mappings to be easily usable (i.e. in particular without
un-shattering of smaller page mappings) and for mapping operations to
then also be more efficient, pass batches of Dom0 memory to iommu_map().
In dom0_construct_pv() and its helpers (covering strict mode) this
additionally requires establishing the type of those pages (albeit with
zero type references).

The earlier establishing of PGT_writable_page | PGT_validated requires
the existing places where this gets done (through get_page_and_type())
to be updated: For pages which actually have a mapping, the type
refcount needs to be 1.

There is actually a related bug that gets fixed here as a side effect:
Typically the last L1 table would get marked as such only after
get_page_and_type(..., PGT_writable_page). While this is fine as far as
refcounting goes, the page did remain mapped in the IOMMU in this case
(when "iommu=dom0-strict").

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Subsequently set_identity_p2m_entry() may want to also gain an order
parameter, for arch_iommu_hwdom_init() to use. While this only affects
non-RAM regions, systems typically have 2-16Mb of reserved space
immediately below 4Gb, which hence could be mapped more efficiently.

The installing of zero-ref writable types has in fact shown (observed
while putting together the change) that despite the intention by the
XSA-288 changes (affecting DomU-s only) for Dom0 a number of
sufficiently ordinary pages (at the very least initrd and P2M ones as
well as pages that are part of the initial allocation but not part of
the initial mapping) still have been starting out as PGT_none, meaning
that they would have gained IOMMU mappings only the first time these
pages would get mapped writably. Consequently an open question is
whether iommu_memory_setup() should set the pages to PGT_writable_page
independent of need_iommu_pt_sync().

I didn't think I need to address the bug mentioned in the description in
a separate (prereq) patch, but if others disagree I could certainly
break out that part (needing to first use iommu_legacy_unmap() then).

Note that 4k P2M pages don't get (pre-)mapped in setup_pv_physmap():
They'll end up mapped via the later get_page_and_type().

As to the way these refs get installed: I've chosen to avoid the more
expensive {get,put}_page_and_type(), favoring to put in place the
intended type directly. I guess I could be convinced to avoid this
bypassing of the actual logic; I merely think it's unnecessarily
expensive.

Note also that strictly speaking the iommu_iotlb_flush_all() here (as
well as the pre-existing one in arch_iommu_hwdom_init()) shouldn't be
needed: Actual hooking up (AMD) or enabling of translation (VT-d)
occurs only afterwards anyway, so nothing can have made it into TLBs
just yet.
---
v3: Fold iommu_map() into (the now renamed) iommu_memory_setup(). Move
    iommu_unmap() into mark_pv_pt_pages_rdonly(). Adjust (split) log
    message in arch_iommu_hwdom_init().

--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -46,7 +46,8 @@ void __init dom0_update_physmap(bool com
 static __init void mark_pv_pt_pages_rdonly(struct domain *d,
                                            l4_pgentry_t *l4start,
                                            unsigned long vpt_start,
-                                           unsigned long nr_pt_pages)
+                                           unsigned long nr_pt_pages,
+                                           unsigned int *flush_flags)
 {
     unsigned long count;
     struct page_info *page;
@@ -71,6 +72,14 @@ static __init void mark_pv_pt_pages_rdon
         ASSERT((page->u.inuse.type_info & PGT_type_mask) <= PGT_root_page_table);
         ASSERT(!(page->u.inuse.type_info & ~(PGT_type_mask | PGT_pae_xen_l2)));
 
+        /*
+         * Page table pages need to be removed from the IOMMU again in case
+         * iommu_memory_setup() ended up mapping them.
+         */
+        if ( need_iommu_pt_sync(d) &&
+             iommu_unmap(d, _dfn(mfn_x(page_to_mfn(page))), 1, flush_flags) )
+            BUG();
+
         /* Read-only mapping + PGC_allocated + page-table page. */
         page->count_info         = PGC_allocated | 3;
         page->u.inuse.type_info |= PGT_validated | 1;
@@ -107,11 +116,43 @@ static __init void mark_pv_pt_pages_rdon
     unmap_domain_page(pl3e);
 }
 
+static void __init iommu_memory_setup(struct domain *d, const char *what,
+                                      struct page_info *page, unsigned long nr,
+                                      unsigned int *flush_flags)
+{
+    int rc;
+    mfn_t mfn = page_to_mfn(page);
+
+    if ( !need_iommu_pt_sync(d) )
+        return;
+
+    rc = iommu_map(d, _dfn(mfn_x(mfn)), mfn, nr,
+                   IOMMUF_readable | IOMMUF_writable, flush_flags);
+    if ( rc )
+    {
+        printk(XENLOG_ERR "pre-mapping %s MFN [%lx,%lx) into IOMMU failed: %d\n",
+               what, mfn_x(mfn), mfn_x(mfn) + nr, rc);
+        return;
+    }
+
+    /*
+     * For successfully established IOMMU mappings the type of the page(s)
+     * needs to match (for _get_page_type() to unmap upon type change). Set
+     * the page(s) to writable with no type ref.
+     */
+    for ( ; nr--; ++page )
+    {
+        ASSERT(!page->u.inuse.type_info);
+        page->u.inuse.type_info = PGT_writable_page | PGT_validated;
+    }
+}
+
 static __init void setup_pv_physmap(struct domain *d, unsigned long pgtbl_pfn,
                                     unsigned long v_start, unsigned long v_end,
                                     unsigned long vphysmap_start,
                                     unsigned long vphysmap_end,
-                                    unsigned long nr_pages)
+                                    unsigned long nr_pages,
+                                    unsigned int *flush_flags)
 {
     struct page_info *page = NULL;
     l4_pgentry_t *pl4e, *l4start = map_domain_page(_mfn(pgtbl_pfn));
@@ -177,6 +218,10 @@ static __init void setup_pv_physmap(stru
                                              L3_PAGETABLE_SHIFT - PAGE_SHIFT,
                                              MEMF_no_scrub)) != NULL )
             {
+                iommu_memory_setup(d, "P2M 1G", page,
+                                   SUPERPAGE_PAGES * SUPERPAGE_PAGES,
+                                   flush_flags);
+
                 *pl3e = l3e_from_page(page, L1_PROT|_PAGE_DIRTY|_PAGE_PSE);
                 vphysmap_start += 1UL << L3_PAGETABLE_SHIFT;
                 continue;
@@ -203,6 +248,9 @@ static __init void setup_pv_physmap(stru
                                              L2_PAGETABLE_SHIFT - PAGE_SHIFT,
                                              MEMF_no_scrub)) != NULL )
             {
+                iommu_memory_setup(d, "P2M 2M", page, SUPERPAGE_PAGES,
+                                   flush_flags);
+
                 *pl2e = l2e_from_page(page, L1_PROT|_PAGE_DIRTY|_PAGE_PSE);
                 vphysmap_start += 1UL << L2_PAGETABLE_SHIFT;
                 continue;
@@ -311,6 +359,7 @@ int __init dom0_construct_pv(struct doma
     unsigned long initrd_pfn = -1, initrd_mfn = 0;
     unsigned long count;
     struct page_info *page = NULL;
+    unsigned int flush_flags = 0;
     start_info_t *si;
     struct vcpu *v = d->vcpu[0];
     void *image_base = bootstrap_map(image);
@@ -573,6 +622,9 @@ int __init dom0_construct_pv(struct doma
                     BUG();
         }
         initrd->mod_end = 0;
+
+        iommu_memory_setup(d, "initrd", mfn_to_page(_mfn(initrd_mfn)),
+                           PFN_UP(initrd_len), &flush_flags);
     }
 
     printk("PHYSICAL MEMORY ARRANGEMENT:\n"
@@ -606,6 +658,13 @@ int __init dom0_construct_pv(struct doma
 
     process_pending_softirqs();
 
+    /*
+     * Map the full range here and then punch holes for page tables
+     * alongside marking them as such in mark_pv_pt_pages_rdonly().
+     */
+    iommu_memory_setup(d, "init-alloc", mfn_to_page(_mfn(alloc_spfn)),
+                       alloc_epfn - alloc_spfn, &flush_flags);
+
     mpt_alloc = (vpt_start - v_start) + pfn_to_paddr(alloc_spfn);
     if ( vinitrd_start )
         mpt_alloc -= PAGE_ALIGN(initrd_len);
@@ -690,7 +749,8 @@ int __init dom0_construct_pv(struct doma
         l1tab++;
 
         page = mfn_to_page(_mfn(mfn));
-        if ( !page->u.inuse.type_info &&
+        if ( (!page->u.inuse.type_info ||
+              page->u.inuse.type_info == (PGT_writable_page | PGT_validated)) &&
              !get_page_and_type(page, d, PGT_writable_page) )
             BUG();
     }
@@ -719,7 +779,7 @@ int __init dom0_construct_pv(struct doma
     }
 
     /* Pages that are part of page tables must be read only. */
-    mark_pv_pt_pages_rdonly(d, l4start, vpt_start, nr_pt_pages);
+    mark_pv_pt_pages_rdonly(d, l4start, vpt_start, nr_pt_pages, &flush_flags);
 
     /* Mask all upcalls... */
     for ( i = 0; i < XEN_LEGACY_MAX_VCPUS; i++ )
@@ -794,7 +854,7 @@ int __init dom0_construct_pv(struct doma
     {
         pfn = pagetable_get_pfn(v->arch.guest_table);
         setup_pv_physmap(d, pfn, v_start, v_end, vphysmap_start, vphysmap_end,
-                         nr_pages);
+                         nr_pages, &flush_flags);
     }
 
     /* Write the phys->machine and machine->phys table entries. */
@@ -825,7 +885,9 @@ int __init dom0_construct_pv(struct doma
         if ( get_gpfn_from_mfn(mfn) >= count )
         {
             BUG_ON(compat);
-            if ( !page->u.inuse.type_info &&
+            if ( (!page->u.inuse.type_info ||
+                  page->u.inuse.type_info == (PGT_writable_page |
+                                              PGT_validated)) &&
                  !get_page_and_type(page, d, PGT_writable_page) )
                 BUG();
 
@@ -841,8 +903,12 @@ int __init dom0_construct_pv(struct doma
 #endif
     while ( pfn < nr_pages )
     {
-        if ( (page = alloc_chunk(d, nr_pages - domain_tot_pages(d))) == NULL )
+        count = domain_tot_pages(d);
+        if ( (page = alloc_chunk(d, nr_pages - count)) == NULL )
             panic("Not enough RAM for DOM0 reservation\n");
+
+        iommu_memory_setup(d, "chunk", page, domain_tot_pages(d) - count,
+                           &flush_flags);
         while ( pfn < domain_tot_pages(d) )
         {
             mfn = mfn_x(page_to_mfn(page));
@@ -857,6 +923,10 @@ int __init dom0_construct_pv(struct doma
         }
     }
 
+    /* Use while() to avoid compiler warning. */
+    while ( iommu_iotlb_flush_all(d, flush_flags) )
+        break;
+
     if ( initrd_len != 0 )
     {
         si->mod_start = vinitrd_start ?: initrd_pfn;
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -346,8 +346,8 @@ static unsigned int __hwdom_init hwdom_i
 
 void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
 {
-    unsigned long i, top, max_pfn;
-    unsigned int flush_flags = 0;
+    unsigned long i, top, max_pfn, start, count;
+    unsigned int flush_flags = 0, start_perms = 0;
 
     BUG_ON(!is_hardware_domain(d));
 
@@ -378,9 +378,9 @@ void __hwdom_init arch_iommu_hwdom_init(
      * First Mb will get mapped in one go by pvh_populate_p2m(). Avoid
      * setting up potentially conflicting mappings here.
      */
-    i = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
+    start = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
 
-    for ( ; i < top; i++ )
+    for ( i = start, count = 0; i < top; )
     {
         unsigned long pfn = pdx_to_pfn(i);
         unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
@@ -389,20 +389,41 @@ void __hwdom_init arch_iommu_hwdom_init(
         if ( !perms )
             rc = 0;
         else if ( paging_mode_translate(d) )
+        {
             rc = set_identity_p2m_entry(d, pfn,
                                         perms & IOMMUF_writable ? p2m_access_rw
                                                                 : p2m_access_r,
                                         0);
+            if ( rc )
+                printk(XENLOG_WARNING
+                       "%pd: identity mapping of %lx failed: %d\n",
+                       d, pfn, rc);
+        }
+        else if ( pfn != start + count || perms != start_perms )
+        {
+        commit:
+            rc = iommu_map(d, _dfn(start), _mfn(start), count, start_perms,
+                           &flush_flags);
+            if ( rc )
+                printk(XENLOG_WARNING
+                       "%pd: IOMMU identity mapping of [%lx,%lx) failed: %d\n",
+                       d, pfn, pfn + count, rc);
+            SWAP(start, pfn);
+            start_perms = perms;
+            count = 1;
+        }
         else
-            rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
-                           perms, &flush_flags);
+        {
+            ++count;
+            rc = 0;
+        }
 
-        if ( rc )
-            printk(XENLOG_WARNING "%pd: identity %smapping of %lx failed: %d\n",
-                   d, !paging_mode_translate(d) ? "IOMMU " : "", pfn, rc);
 
-        if (!(i & 0xfffff))
+        if ( !(++i & 0xfffff) )
             process_pending_softirqs();
+
+        if ( i == top && count )
+            goto commit;
     }
 
     /* Use if to avoid compiler warning */



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 10/23] IOMMU/x86: support freeing of pagetables
  2022-01-10 16:19 [PATCH v3 00/23] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (8 preceding siblings ...)
  2022-01-10 16:28 ` [PATCH v3 09/23] IOMMU/x86: perform PV Dom0 mappings in batches Jan Beulich
@ 2022-01-10 16:29 ` Jan Beulich
  2022-01-10 16:29 ` [PATCH v3 11/23] AMD/IOMMU: drop stray TLB flush Jan Beulich
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2022-01-10 16:29 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné

For vendor specific code to support superpages we need to be able to
deal with a superpage mapping replacing an intermediate page table (or
hierarchy thereof). Consequently an iommu_alloc_pgtable() counterpart is
needed to free individual page tables while a domain is still alive.
Since the freeing needs to be deferred until after a suitable IOTLB
flush was performed, released page tables get queued for processing by a
tasklet.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
I was considering whether to use a softirq-tasklet instead. This would
have the benefit of avoiding extra scheduling operations, but come with
the risk of the freeing happening prematurely because of a
process_pending_softirqs() somewhere.
---
v3: Call process_pending_softirqs() from free_queued_pgtables().

--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -12,6 +12,7 @@
  * this program; If not, see <http://www.gnu.org/licenses/>.
  */
 
+#include <xen/cpu.h>
 #include <xen/sched.h>
 #include <xen/iommu.h>
 #include <xen/paging.h>
@@ -491,6 +492,92 @@ struct page_info *iommu_alloc_pgtable(st
     return pg;
 }
 
+/*
+ * Intermediate page tables which get replaced by large pages may only be
+ * freed after a suitable IOTLB flush. Hence such pages get queued on a
+ * per-CPU list, with a per-CPU tasklet processing the list on the assumption
+ * that the necessary IOTLB flush will have occurred by the time tasklets get
+ * to run. (List and tasklet being per-CPU has the benefit of accesses not
+ * requiring any locking.)
+ */
+static DEFINE_PER_CPU(struct page_list_head, free_pgt_list);
+static DEFINE_PER_CPU(struct tasklet, free_pgt_tasklet);
+
+static void free_queued_pgtables(void *arg)
+{
+    struct page_list_head *list = arg;
+    struct page_info *pg;
+    unsigned int done = 0;
+
+    while ( (pg = page_list_remove_head(list)) )
+    {
+        free_domheap_page(pg);
+
+        /* Granularity of checking somewhat arbitrary. */
+        if ( !(++done & 0x1ff) )
+             process_pending_softirqs();
+    }
+}
+
+void iommu_queue_free_pgtable(struct domain *d, struct page_info *pg)
+{
+    struct domain_iommu *hd = dom_iommu(d);
+    unsigned int cpu = smp_processor_id();
+
+    spin_lock(&hd->arch.pgtables.lock);
+    page_list_del(pg, &hd->arch.pgtables.list);
+    spin_unlock(&hd->arch.pgtables.lock);
+
+    page_list_add_tail(pg, &per_cpu(free_pgt_list, cpu));
+
+    tasklet_schedule(&per_cpu(free_pgt_tasklet, cpu));
+}
+
+static int cpu_callback(
+    struct notifier_block *nfb, unsigned long action, void *hcpu)
+{
+    unsigned int cpu = (unsigned long)hcpu;
+    struct page_list_head *list = &per_cpu(free_pgt_list, cpu);
+    struct tasklet *tasklet = &per_cpu(free_pgt_tasklet, cpu);
+
+    switch ( action )
+    {
+    case CPU_DOWN_PREPARE:
+        tasklet_kill(tasklet);
+        break;
+
+    case CPU_DEAD:
+        page_list_splice(list, &this_cpu(free_pgt_list));
+        INIT_PAGE_LIST_HEAD(list);
+        tasklet_schedule(&this_cpu(free_pgt_tasklet));
+        break;
+
+    case CPU_UP_PREPARE:
+    case CPU_DOWN_FAILED:
+        tasklet_init(tasklet, free_queued_pgtables, list);
+        break;
+    }
+
+    return NOTIFY_DONE;
+}
+
+static struct notifier_block cpu_nfb = {
+    .notifier_call = cpu_callback,
+};
+
+static int __init bsp_init(void)
+{
+    if ( iommu_enabled )
+    {
+        cpu_callback(&cpu_nfb, CPU_UP_PREPARE,
+                     (void *)(unsigned long)smp_processor_id());
+        register_cpu_notifier(&cpu_nfb);
+    }
+
+    return 0;
+}
+presmp_initcall(bsp_init);
+
 bool arch_iommu_use_permitted(const struct domain *d)
 {
     /*
--- a/xen/arch/x86/include/asm/iommu.h
+++ b/xen/arch/x86/include/asm/iommu.h
@@ -143,6 +143,7 @@ int pi_update_irte(const struct pi_desc
 
 int __must_check iommu_free_pgtables(struct domain *d);
 struct page_info *__must_check iommu_alloc_pgtable(struct domain *d);
+void iommu_queue_free_pgtable(struct domain *d, struct page_info *pg);
 
 #endif /* !__ARCH_X86_IOMMU_H__ */
 /*



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 11/23] AMD/IOMMU: drop stray TLB flush
  2022-01-10 16:19 [PATCH v3 00/23] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (9 preceding siblings ...)
  2022-01-10 16:29 ` [PATCH v3 10/23] IOMMU/x86: support freeing of pagetables Jan Beulich
@ 2022-01-10 16:29 ` Jan Beulich
  2022-01-10 16:30 ` [PATCH v3 12/23] AMD/IOMMU: walk trees upon page fault Jan Beulich
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2022-01-10 16:29 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné

I think this flush was overlooked when flushing was moved out of the
core (un)mapping functions. The flush the caller is required to invoke
anyway will satisfy the needs resulting from the splitting of a
superpage.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -179,7 +179,7 @@ void __init iommu_dte_add_device_entry(s
  */
 static int iommu_pde_from_dfn(struct domain *d, unsigned long dfn,
                               unsigned int target, unsigned long *pt_mfn,
-                              bool map)
+                              unsigned int *flush_flags, bool map)
 {
     union amd_iommu_pte *pde, *next_table_vaddr;
     unsigned long  next_table_mfn;
@@ -240,7 +240,7 @@ static int iommu_pde_from_dfn(struct dom
             set_iommu_pde_present(pde, next_table_mfn, next_level, true,
                                   true);
 
-            amd_iommu_flush_all_pages(d);
+            *flush_flags |= IOMMU_FLUSHF_modified;
         }
 
         /* Install lower level page table for non-present entries */
@@ -312,7 +312,8 @@ int amd_iommu_map_page(struct domain *d,
         return rc;
     }
 
-    if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, true) || !pt_mfn )
+    if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, flush_flags, true) ||
+         !pt_mfn )
     {
         spin_unlock(&hd->arch.mapping_lock);
         AMD_IOMMU_ERROR("invalid IO pagetable entry dfn = %"PRI_dfn"\n",
@@ -345,7 +346,7 @@ int amd_iommu_unmap_page(struct domain *
         return 0;
     }
 
-    if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, false) )
+    if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, flush_flags, false) )
     {
         spin_unlock(&hd->arch.mapping_lock);
         AMD_IOMMU_ERROR("invalid IO pagetable entry dfn = %"PRI_dfn"\n",



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 12/23] AMD/IOMMU: walk trees upon page fault
  2022-01-10 16:19 [PATCH v3 00/23] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (10 preceding siblings ...)
  2022-01-10 16:29 ` [PATCH v3 11/23] AMD/IOMMU: drop stray TLB flush Jan Beulich
@ 2022-01-10 16:30 ` Jan Beulich
  2022-01-10 16:30 ` [PATCH v3 13/23] AMD/IOMMU: return old PTE from {set,clear}_iommu_pte_present() Jan Beulich
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2022-01-10 16:30 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné

This is to aid diagnosing issues and largely matches VT-d's behavior.
Since I'm adding permissions output here as well, take the opportunity
and also add their displaying to amd_dump_page_table_level().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Note: "largely matches VT-d's behavior" includes the lack of any locking
      here. Adding suitable locking may not be that easy, as we'd need
      to determine which domain's mapping lock to acquire in addition to
      the necessary IOMMU lock (for the device table access), and
      whether that domain actually still exists. The latter is because
      if we really want to play safe here, imo we also need to account
      for the device table to be potentially corrupted / stale.

--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -256,6 +256,8 @@ int __must_check amd_iommu_flush_iotlb_p
                                              unsigned long page_count,
                                              unsigned int flush_flags);
 int __must_check amd_iommu_flush_iotlb_all(struct domain *d);
+void amd_iommu_print_entries(const struct amd_iommu *iommu, unsigned int dev_id,
+                             dfn_t dfn);
 
 /* device table functions */
 int get_dma_requestor_id(uint16_t seg, uint16_t bdf);
--- a/xen/drivers/passthrough/amd/iommu_init.c
+++ b/xen/drivers/passthrough/amd/iommu_init.c
@@ -574,6 +574,9 @@ static void parse_event_log_entry(struct
                (flags & 0x002) ? " NX" : "",
                (flags & 0x001) ? " GN" : "");
 
+        if ( iommu_verbose )
+            amd_iommu_print_entries(iommu, device_id, daddr_to_dfn(addr));
+
         for ( bdf = 0; bdf < ivrs_bdf_entries; bdf++ )
             if ( get_dma_requestor_id(iommu->seg, bdf) == device_id )
                 pci_check_disable_device(iommu->seg, PCI_BUS(bdf),
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -366,6 +366,50 @@ int amd_iommu_unmap_page(struct domain *
     return 0;
 }
 
+void amd_iommu_print_entries(const struct amd_iommu *iommu, unsigned int dev_id,
+                             dfn_t dfn)
+{
+    mfn_t pt_mfn;
+    unsigned int level;
+    const struct amd_iommu_dte *dt = iommu->dev_table.buffer;
+
+    if ( !dt[dev_id].tv )
+    {
+        printk("%pp: no root\n", &PCI_SBDF2(iommu->seg, dev_id));
+        return;
+    }
+
+    pt_mfn = _mfn(dt[dev_id].pt_root);
+    level = dt[dev_id].paging_mode;
+    printk("%pp root @ %"PRI_mfn" (%u levels) dfn=%"PRI_dfn"\n",
+           &PCI_SBDF2(iommu->seg, dev_id), mfn_x(pt_mfn), level, dfn_x(dfn));
+
+    while ( level )
+    {
+        const union amd_iommu_pte *pt = map_domain_page(pt_mfn);
+        unsigned int idx = pfn_to_pde_idx(dfn_x(dfn), level);
+        union amd_iommu_pte pte = pt[idx];
+
+        unmap_domain_page(pt);
+
+        printk("  L%u[%03x] = %"PRIx64" %c%c\n", level, idx, pte.raw,
+               pte.pr ? pte.ir ? 'r' : '-' : 'n',
+               pte.pr ? pte.iw ? 'w' : '-' : 'p');
+
+        if ( !pte.pr )
+            break;
+
+        if ( pte.next_level >= level )
+        {
+            printk("  L%u[%03x]: next: %u\n", level, idx, pte.next_level);
+            break;
+        }
+
+        pt_mfn = _mfn(pte.mfn);
+        level = pte.next_level;
+    }
+}
+
 static unsigned long flush_count(unsigned long dfn, unsigned long page_count,
                                  unsigned int order)
 {
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -607,10 +607,11 @@ static void amd_dump_page_table_level(st
                 mfn_to_page(_mfn(pde->mfn)), pde->next_level,
                 address, indent + 1);
         else
-            printk("%*sdfn: %08lx  mfn: %08lx\n",
+            printk("%*sdfn: %08lx  mfn: %08lx  %c%c\n",
                    indent, "",
                    (unsigned long)PFN_DOWN(address),
-                   (unsigned long)PFN_DOWN(pfn_to_paddr(pde->mfn)));
+                   (unsigned long)PFN_DOWN(pfn_to_paddr(pde->mfn)),
+                   pde->ir ? 'r' : '-', pde->iw ? 'w' : '-');
     }
 
     unmap_domain_page(table_vaddr);



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 13/23] AMD/IOMMU: return old PTE from {set,clear}_iommu_pte_present()
  2022-01-10 16:19 [PATCH v3 00/23] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (11 preceding siblings ...)
  2022-01-10 16:30 ` [PATCH v3 12/23] AMD/IOMMU: walk trees upon page fault Jan Beulich
@ 2022-01-10 16:30 ` Jan Beulich
  2022-01-10 16:31 ` [PATCH v3 14/23] AMD/IOMMU: allow use of superpage mappings Jan Beulich
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2022-01-10 16:30 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné

In order to free intermediate page tables when replacing smaller
mappings by a single larger one callers will need to know the full PTE.
Flush indicators can be derived from this in the callers (and outside
the locked regions). First split set_iommu_pte_present() from
set_iommu_ptes_present(): Only the former needs to return the old PTE,
while the latter (like also set_iommu_pde_present()) doesn't even need
to return flush indicators. Then change return types/values and callers
accordingly.

Note that for subsequent changes returning merely a boolean (old.pr) is
not going to be sufficient; the next_level field will also be required.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -31,30 +31,28 @@ static unsigned int pfn_to_pde_idx(unsig
     return idx;
 }
 
-static unsigned int clear_iommu_pte_present(unsigned long l1_mfn,
-                                            unsigned long dfn)
+static union amd_iommu_pte clear_iommu_pte_present(unsigned long l1_mfn,
+                                                   unsigned long dfn)
 {
-    union amd_iommu_pte *table, *pte;
-    unsigned int flush_flags;
+    union amd_iommu_pte *table, *pte, old;
 
     table = map_domain_page(_mfn(l1_mfn));
     pte = &table[pfn_to_pde_idx(dfn, 1)];
+    old = *pte;
 
-    flush_flags = pte->pr ? IOMMU_FLUSHF_modified : 0;
     write_atomic(&pte->raw, 0);
 
     unmap_domain_page(table);
 
-    return flush_flags;
+    return old;
 }
 
-static unsigned int set_iommu_pde_present(union amd_iommu_pte *pte,
-                                          unsigned long next_mfn,
-                                          unsigned int next_level, bool iw,
-                                          bool ir)
+static void set_iommu_pde_present(union amd_iommu_pte *pte,
+                                  unsigned long next_mfn,
+                                  unsigned int next_level,
+                                  bool iw, bool ir)
 {
-    union amd_iommu_pte new = {}, old;
-    unsigned int flush_flags = IOMMU_FLUSHF_added;
+    union amd_iommu_pte new = {};
 
     /*
      * FC bit should be enabled in PTE, this helps to solve potential
@@ -68,28 +66,42 @@ static unsigned int set_iommu_pde_presen
     new.next_level = next_level;
     new.pr = true;
 
-    old.raw = read_atomic(&pte->raw);
-    old.ign0 = 0;
-    old.ign1 = 0;
-    old.ign2 = 0;
+    write_atomic(&pte->raw, new.raw);
+}
 
-    if ( old.pr && old.raw != new.raw )
-        flush_flags |= IOMMU_FLUSHF_modified;
+static union amd_iommu_pte set_iommu_pte_present(unsigned long pt_mfn,
+                                                 unsigned long dfn,
+                                                 unsigned long next_mfn,
+                                                 unsigned int level,
+                                                 bool iw, bool ir)
+{
+    union amd_iommu_pte *table, *pde, old;
 
-    write_atomic(&pte->raw, new.raw);
+    table = map_domain_page(_mfn(pt_mfn));
+    pde = &table[pfn_to_pde_idx(dfn, level)];
+
+    old = *pde;
+    if ( !old.pr || old.next_level ||
+         old.mfn != next_mfn ||
+         old.iw != iw || old.ir != ir )
+        set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
+    else
+        old.pr = false; /* signal "no change" to the caller */
 
-    return flush_flags;
+    unmap_domain_page(table);
+
+    return old;
 }
 
-static unsigned int set_iommu_ptes_present(unsigned long pt_mfn,
-                                           unsigned long dfn,
-                                           unsigned long next_mfn,
-                                           unsigned int nr_ptes,
-                                           unsigned int pde_level,
-                                           bool iw, bool ir)
+static void set_iommu_ptes_present(unsigned long pt_mfn,
+                                   unsigned long dfn,
+                                   unsigned long next_mfn,
+                                   unsigned int nr_ptes,
+                                   unsigned int pde_level,
+                                   bool iw, bool ir)
 {
     union amd_iommu_pte *table, *pde;
-    unsigned int page_sz, flush_flags = 0;
+    unsigned int page_sz;
 
     table = map_domain_page(_mfn(pt_mfn));
     pde = &table[pfn_to_pde_idx(dfn, pde_level)];
@@ -98,20 +110,18 @@ static unsigned int set_iommu_ptes_prese
     if ( (void *)(pde + nr_ptes) > (void *)table + PAGE_SIZE )
     {
         ASSERT_UNREACHABLE();
-        return 0;
+        return;
     }
 
     while ( nr_ptes-- )
     {
-        flush_flags |= set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
+        set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
 
         ++pde;
         next_mfn += page_sz;
     }
 
     unmap_domain_page(table);
-
-    return flush_flags;
 }
 
 void amd_iommu_set_root_page_table(struct amd_iommu_dte *dte,
@@ -287,6 +297,7 @@ int amd_iommu_map_page(struct domain *d,
     struct domain_iommu *hd = dom_iommu(d);
     int rc;
     unsigned long pt_mfn = 0;
+    union amd_iommu_pte old;
 
     spin_lock(&hd->arch.mapping_lock);
 
@@ -323,12 +334,16 @@ int amd_iommu_map_page(struct domain *d,
     }
 
     /* Install 4k mapping */
-    *flush_flags |= set_iommu_ptes_present(pt_mfn, dfn_x(dfn), mfn_x(mfn),
-                                           1, 1, (flags & IOMMUF_writable),
-                                           (flags & IOMMUF_readable));
+    old = set_iommu_pte_present(pt_mfn, dfn_x(dfn), mfn_x(mfn), 1,
+                                (flags & IOMMUF_writable),
+                                (flags & IOMMUF_readable));
 
     spin_unlock(&hd->arch.mapping_lock);
 
+    *flush_flags |= IOMMU_FLUSHF_added;
+    if ( old.pr )
+        *flush_flags |= IOMMU_FLUSHF_modified;
+
     return 0;
 }
 
@@ -337,6 +352,7 @@ int amd_iommu_unmap_page(struct domain *
 {
     unsigned long pt_mfn = 0;
     struct domain_iommu *hd = dom_iommu(d);
+    union amd_iommu_pte old = {};
 
     spin_lock(&hd->arch.mapping_lock);
 
@@ -358,11 +374,14 @@ int amd_iommu_unmap_page(struct domain *
     if ( pt_mfn )
     {
         /* Mark PTE as 'page not present'. */
-        *flush_flags |= clear_iommu_pte_present(pt_mfn, dfn_x(dfn));
+        old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn));
     }
 
     spin_unlock(&hd->arch.mapping_lock);
 
+    if ( old.pr )
+        *flush_flags |= IOMMU_FLUSHF_modified;
+
     return 0;
 }
 



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 14/23] AMD/IOMMU: allow use of superpage mappings
  2022-01-10 16:19 [PATCH v3 00/23] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (12 preceding siblings ...)
  2022-01-10 16:30 ` [PATCH v3 13/23] AMD/IOMMU: return old PTE from {set,clear}_iommu_pte_present() Jan Beulich
@ 2022-01-10 16:31 ` Jan Beulich
  2022-01-10 16:32 ` [PATCH v3 15/23] VT-d: " Jan Beulich
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2022-01-10 16:31 UTC (permalink / raw)
  To: xen-devel
  Cc: Paul Durrant, Roger Pau Monné,
	Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu

No separate feature flags exist which would control availability of
these; the only restriction is HATS (establishing the maximum number of
page table levels in general), and even that has a lower bound of 4.
Thus we can unconditionally announce 2M, 1G, and 512G mappings. (Via
non-default page sizes the implementation in principle permits arbitrary
size mappings, but these require multiple identical leaf PTEs to be
written, which isn't all that different from having to write multiple
consecutive PTEs with increasing frame numbers. IMO that's therefore
beneficial only on hardware where suitable TLBs exist; I'm unaware of
such hardware.)

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
I'm not fully sure about allowing 512G mappings: The scheduling-for-
freeing of intermediate page tables would take quite a while when
replacing a tree of 4k mappings by a single 512G one. Yet then again
there's no present code path via which 512G chunks of memory could be
allocated (and hence mapped) anyway, so this would only benefit huge
systems where 512 1G mappings could be re-coalesced (once suitable code
is in place) into a single L4 entry. And re-coalescing wouldn't result
in scheduling-for-freeing of full trees of lower level pagetables.
---
v3: Rename queue_free_pt()'s last parameter. Replace "level > 1" checks
    where possible.

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -32,12 +32,13 @@ static unsigned int pfn_to_pde_idx(unsig
 }
 
 static union amd_iommu_pte clear_iommu_pte_present(unsigned long l1_mfn,
-                                                   unsigned long dfn)
+                                                   unsigned long dfn,
+                                                   unsigned int level)
 {
     union amd_iommu_pte *table, *pte, old;
 
     table = map_domain_page(_mfn(l1_mfn));
-    pte = &table[pfn_to_pde_idx(dfn, 1)];
+    pte = &table[pfn_to_pde_idx(dfn, level)];
     old = *pte;
 
     write_atomic(&pte->raw, 0);
@@ -291,10 +292,31 @@ static int iommu_pde_from_dfn(struct dom
     return 0;
 }
 
+static void queue_free_pt(struct domain *d, mfn_t mfn, unsigned int level)
+{
+    if ( level > 1 )
+    {
+        union amd_iommu_pte *pt = map_domain_page(mfn);
+        unsigned int i;
+
+        for ( i = 0; i < PTE_PER_TABLE_SIZE; ++i )
+            if ( pt[i].pr && pt[i].next_level )
+            {
+                ASSERT(pt[i].next_level < level);
+                queue_free_pt(d, _mfn(pt[i].mfn), pt[i].next_level);
+            }
+
+        unmap_domain_page(pt);
+    }
+
+    iommu_queue_free_pgtable(d, mfn_to_page(mfn));
+}
+
 int amd_iommu_map_page(struct domain *d, dfn_t dfn, mfn_t mfn,
                        unsigned int flags, unsigned int *flush_flags)
 {
     struct domain_iommu *hd = dom_iommu(d);
+    unsigned int level = (IOMMUF_order(flags) / PTE_PER_TABLE_SHIFT) + 1;
     int rc;
     unsigned long pt_mfn = 0;
     union amd_iommu_pte old;
@@ -323,7 +345,7 @@ int amd_iommu_map_page(struct domain *d,
         return rc;
     }
 
-    if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, flush_flags, true) ||
+    if ( iommu_pde_from_dfn(d, dfn_x(dfn), level, &pt_mfn, flush_flags, true) ||
          !pt_mfn )
     {
         spin_unlock(&hd->arch.mapping_lock);
@@ -333,8 +355,8 @@ int amd_iommu_map_page(struct domain *d,
         return -EFAULT;
     }
 
-    /* Install 4k mapping */
-    old = set_iommu_pte_present(pt_mfn, dfn_x(dfn), mfn_x(mfn), 1,
+    /* Install mapping */
+    old = set_iommu_pte_present(pt_mfn, dfn_x(dfn), mfn_x(mfn), level,
                                 (flags & IOMMUF_writable),
                                 (flags & IOMMUF_readable));
 
@@ -342,8 +364,13 @@ int amd_iommu_map_page(struct domain *d,
 
     *flush_flags |= IOMMU_FLUSHF_added;
     if ( old.pr )
+    {
         *flush_flags |= IOMMU_FLUSHF_modified;
 
+        if ( IOMMUF_order(flags) && old.next_level )
+            queue_free_pt(d, _mfn(old.mfn), old.next_level);
+    }
+
     return 0;
 }
 
@@ -352,6 +379,7 @@ int amd_iommu_unmap_page(struct domain *
 {
     unsigned long pt_mfn = 0;
     struct domain_iommu *hd = dom_iommu(d);
+    unsigned int level = (order / PTE_PER_TABLE_SHIFT) + 1;
     union amd_iommu_pte old = {};
 
     spin_lock(&hd->arch.mapping_lock);
@@ -362,7 +390,7 @@ int amd_iommu_unmap_page(struct domain *
         return 0;
     }
 
-    if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, flush_flags, false) )
+    if ( iommu_pde_from_dfn(d, dfn_x(dfn), level, &pt_mfn, flush_flags, false) )
     {
         spin_unlock(&hd->arch.mapping_lock);
         AMD_IOMMU_ERROR("invalid IO pagetable entry dfn = %"PRI_dfn"\n",
@@ -374,14 +402,19 @@ int amd_iommu_unmap_page(struct domain *
     if ( pt_mfn )
     {
         /* Mark PTE as 'page not present'. */
-        old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn));
+        old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn), level);
     }
 
     spin_unlock(&hd->arch.mapping_lock);
 
     if ( old.pr )
+    {
         *flush_flags |= IOMMU_FLUSHF_modified;
 
+        if ( order && old.next_level )
+            queue_free_pt(d, _mfn(old.mfn), old.next_level);
+    }
+
     return 0;
 }
 
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -630,7 +630,7 @@ static void amd_dump_page_tables(struct
 }
 
 static const struct iommu_ops __initconstrel _iommu_ops = {
-    .page_sizes = PAGE_SIZE_4K,
+    .page_sizes = PAGE_SIZE_4K | PAGE_SIZE_2M | PAGE_SIZE_1G | PAGE_SIZE_512G,
     .init = amd_iommu_domain_init,
     .hwdom_init = amd_iommu_hwdom_init,
     .quarantine_init = amd_iommu_quarantine_init,
--- a/xen/include/xen/page-defs.h
+++ b/xen/include/xen/page-defs.h
@@ -21,4 +21,19 @@
 #define PAGE_MASK_64K               PAGE_MASK_GRAN(64K)
 #define PAGE_ALIGN_64K(addr)        PAGE_ALIGN_GRAN(64K, addr)
 
+#define PAGE_SHIFT_2M               21
+#define PAGE_SIZE_2M                PAGE_SIZE_GRAN(2M)
+#define PAGE_MASK_2M                PAGE_MASK_GRAN(2M)
+#define PAGE_ALIGN_2M(addr)         PAGE_ALIGN_GRAN(2M, addr)
+
+#define PAGE_SHIFT_1G               30
+#define PAGE_SIZE_1G                PAGE_SIZE_GRAN(1G)
+#define PAGE_MASK_1G                PAGE_MASK_GRAN(1G)
+#define PAGE_ALIGN_1G(addr)         PAGE_ALIGN_GRAN(1G, addr)
+
+#define PAGE_SHIFT_512G             39
+#define PAGE_SIZE_512G              PAGE_SIZE_GRAN(512G)
+#define PAGE_MASK_512G              PAGE_MASK_GRAN(512G)
+#define PAGE_ALIGN_512G(addr)       PAGE_ALIGN_GRAN(512G, addr)
+
 #endif /* __XEN_PAGE_DEFS_H__ */



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 15/23] VT-d: allow use of superpage mappings
  2022-01-10 16:19 [PATCH v3 00/23] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (13 preceding siblings ...)
  2022-01-10 16:31 ` [PATCH v3 14/23] AMD/IOMMU: allow use of superpage mappings Jan Beulich
@ 2022-01-10 16:32 ` Jan Beulich
  2022-01-30  3:26   ` Tian, Kevin
  2022-01-10 16:33 ` [PATCH v3 16/23] IOMMU: fold flush-all hook into "flush one" Jan Beulich
                   ` (7 subsequent siblings)
  22 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2022-01-10 16:32 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné, Kevin Tian

... depending on feature availability (and absence of quirks).

Also make the page table dumping function aware of superpages.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: Rename queue_free_pt()'s last parameter. Replace "level > 1" checks
    where possible. Tighten assertion.

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -826,18 +826,37 @@ static int __must_check iommu_flush_iotl
     return iommu_flush_iotlb(d, INVALID_DFN, 0, 0);
 }
 
+static void queue_free_pt(struct domain *d, mfn_t mfn, unsigned int level)
+{
+    if ( level > 1 )
+    {
+        struct dma_pte *pt = map_domain_page(mfn);
+        unsigned int i;
+
+        for ( i = 0; i < PTE_NUM; ++i )
+            if ( dma_pte_present(pt[i]) && !dma_pte_superpage(pt[i]) )
+                queue_free_pt(d, maddr_to_mfn(dma_pte_addr(pt[i])),
+                              level - 1);
+
+        unmap_domain_page(pt);
+    }
+
+    iommu_queue_free_pgtable(d, mfn_to_page(mfn));
+}
+
 /* clear one page's page table */
 static int dma_pte_clear_one(struct domain *domain, daddr_t addr,
                              unsigned int order,
                              unsigned int *flush_flags)
 {
     struct domain_iommu *hd = dom_iommu(domain);
-    struct dma_pte *page = NULL, *pte = NULL;
+    struct dma_pte *page = NULL, *pte = NULL, old;
     u64 pg_maddr;
+    unsigned int level = (order / LEVEL_STRIDE) + 1;
 
     spin_lock(&hd->arch.mapping_lock);
-    /* get last level pte */
-    pg_maddr = addr_to_dma_page_maddr(domain, addr, 1, flush_flags, false);
+    /* get target level pte */
+    pg_maddr = addr_to_dma_page_maddr(domain, addr, level, flush_flags, false);
     if ( pg_maddr < PAGE_SIZE )
     {
         spin_unlock(&hd->arch.mapping_lock);
@@ -845,7 +864,7 @@ static int dma_pte_clear_one(struct doma
     }
 
     page = (struct dma_pte *)map_vtd_domain_page(pg_maddr);
-    pte = page + address_level_offset(addr, 1);
+    pte = &page[address_level_offset(addr, level)];
 
     if ( !dma_pte_present(*pte) )
     {
@@ -854,14 +873,20 @@ static int dma_pte_clear_one(struct doma
         return 0;
     }
 
+    old = *pte;
     dma_clear_pte(*pte);
-    *flush_flags |= IOMMU_FLUSHF_modified;
 
     spin_unlock(&hd->arch.mapping_lock);
     iommu_sync_cache(pte, sizeof(struct dma_pte));
 
     unmap_vtd_domain_page(page);
 
+    *flush_flags |= IOMMU_FLUSHF_modified;
+
+    if ( order && !dma_pte_superpage(old) )
+        queue_free_pt(domain, maddr_to_mfn(dma_pte_addr(old)),
+                      order / LEVEL_STRIDE);
+
     return 0;
 }
 
@@ -1952,6 +1977,7 @@ static int __must_check intel_iommu_map_
     struct domain_iommu *hd = dom_iommu(d);
     struct dma_pte *page, *pte, old, new = {};
     u64 pg_maddr;
+    unsigned int level = (IOMMUF_order(flags) / LEVEL_STRIDE) + 1;
     int rc = 0;
 
     /* Do nothing if VT-d shares EPT page table */
@@ -1976,7 +2002,7 @@ static int __must_check intel_iommu_map_
         return 0;
     }
 
-    pg_maddr = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), 1, flush_flags,
+    pg_maddr = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), level, flush_flags,
                                       true);
     if ( pg_maddr < PAGE_SIZE )
     {
@@ -1985,13 +2011,15 @@ static int __must_check intel_iommu_map_
     }
 
     page = (struct dma_pte *)map_vtd_domain_page(pg_maddr);
-    pte = &page[dfn_x(dfn) & LEVEL_MASK];
+    pte = &page[address_level_offset(dfn_to_daddr(dfn), level)];
     old = *pte;
 
     dma_set_pte_addr(new, mfn_to_maddr(mfn));
     dma_set_pte_prot(new,
                      ((flags & IOMMUF_readable) ? DMA_PTE_READ  : 0) |
                      ((flags & IOMMUF_writable) ? DMA_PTE_WRITE : 0));
+    if ( IOMMUF_order(flags) )
+        dma_set_pte_superpage(new);
 
     /* Set the SNP on leaf page table if Snoop Control available */
     if ( iommu_snoop )
@@ -2012,8 +2040,14 @@ static int __must_check intel_iommu_map_
 
     *flush_flags |= IOMMU_FLUSHF_added;
     if ( dma_pte_present(old) )
+    {
         *flush_flags |= IOMMU_FLUSHF_modified;
 
+        if ( IOMMUF_order(flags) && !dma_pte_superpage(old) )
+            queue_free_pt(d, maddr_to_mfn(dma_pte_addr(old)),
+                          IOMMUF_order(flags) / LEVEL_STRIDE);
+    }
+
     return rc;
 }
 
@@ -2370,6 +2404,7 @@ static int __init vtd_setup(void)
 {
     struct acpi_drhd_unit *drhd;
     struct vtd_iommu *iommu;
+    unsigned int large_sizes = PAGE_SIZE_2M | PAGE_SIZE_1G;
     int ret;
     bool reg_inval_supported = true;
 
@@ -2412,6 +2447,11 @@ static int __init vtd_setup(void)
                cap_sps_2mb(iommu->cap) ? ", 2MB" : "",
                cap_sps_1gb(iommu->cap) ? ", 1GB" : "");
 
+        if ( !cap_sps_2mb(iommu->cap) )
+            large_sizes &= ~PAGE_SIZE_2M;
+        if ( !cap_sps_1gb(iommu->cap) )
+            large_sizes &= ~PAGE_SIZE_1G;
+
 #ifndef iommu_snoop
         if ( iommu_snoop && !ecap_snp_ctl(iommu->ecap) )
             iommu_snoop = false;
@@ -2483,6 +2523,9 @@ static int __init vtd_setup(void)
     if ( ret )
         goto error;
 
+    ASSERT(iommu_ops.page_sizes == PAGE_SIZE_4K);
+    iommu_ops.page_sizes |= large_sizes;
+
     register_keyhandler('V', vtd_dump_iommu_info, "dump iommu info", 1);
 
     return 0;
@@ -2797,7 +2840,7 @@ static void vtd_dump_page_table_level(pa
             continue;
 
         address = gpa + offset_level_address(i, level);
-        if ( next_level >= 1 ) 
+        if ( next_level && !dma_pte_superpage(*pte) )
             vtd_dump_page_table_level(dma_pte_addr(*pte), next_level,
                                       address, indent + 1);
         else



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 16/23] IOMMU: fold flush-all hook into "flush one"
  2022-01-10 16:19 [PATCH v3 00/23] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (14 preceding siblings ...)
  2022-01-10 16:32 ` [PATCH v3 15/23] VT-d: " Jan Beulich
@ 2022-01-10 16:33 ` Jan Beulich
  2022-01-30  3:38   ` Tian, Kevin
  2022-01-10 16:34 ` [PATCH v3 17/23] IOMMU/x86: prefill newly allocate page tables Jan Beulich
                   ` (6 subsequent siblings)
  22 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2022-01-10 16:33 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné,
	Kevin Tian, Julien Grall, Stefano Stabellini, Volodymyr Babchuk,
	Bertrand Marquis, Rahul Singh

Having a separate flush-all hook has always been puzzling me some. We
will want to be able to force a full flush via accumulated flush flags
from the map/unmap functions. Introduce a respective new flag and fold
all flush handling to use the single remaining hook.

Note that because of the respective comments in SMMU and IPMMU-VMSA
code, I've folded the two prior hook functions into one. For SMMU-v3,
which lacks a comment towards incapable hardware, I've left both
functions in place on the assumption that selective and full flushes
will eventually want separating.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
[IPMMU-VMSA and SMMU-V2]
Reviewed-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
[SMMUv3]
Reviewed-by: Rahul Singh <rahul.singh@arm.com>
[Arm]
Acked-by: Julien Grall <jgrall@amazon.com>
---
TBD: What we really are going to need is for the map/unmap functions to
     specify that a wider region needs flushing than just the one
     covered by the present set of (un)maps. This may still be less than
     a full flush, but at least as a first step it seemed better to me
     to keep things simple and go the flush-all route.
---
v3: Re-base over changes earlier in the series.
v2: New.

--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -255,7 +255,6 @@ int amd_iommu_get_reserved_device_memory
 int __must_check amd_iommu_flush_iotlb_pages(struct domain *d, dfn_t dfn,
                                              unsigned long page_count,
                                              unsigned int flush_flags);
-int __must_check amd_iommu_flush_iotlb_all(struct domain *d);
 void amd_iommu_print_entries(const struct amd_iommu *iommu, unsigned int dev_id,
                              dfn_t dfn);
 
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -478,15 +478,18 @@ int amd_iommu_flush_iotlb_pages(struct d
 {
     unsigned long dfn_l = dfn_x(dfn);
 
-    ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN));
-    ASSERT(flush_flags);
+    if ( !(flush_flags & IOMMU_FLUSHF_all) )
+    {
+        ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN));
+        ASSERT(flush_flags);
+    }
 
     /* Unless a PTE was modified, no flush is required */
     if ( !(flush_flags & IOMMU_FLUSHF_modified) )
         return 0;
 
-    /* If the range wraps then just flush everything */
-    if ( dfn_l + page_count < dfn_l )
+    /* If so requested or if the range wraps then just flush everything. */
+    if ( (flush_flags & IOMMU_FLUSHF_all) || dfn_l + page_count < dfn_l )
     {
         amd_iommu_flush_all_pages(d);
         return 0;
@@ -511,13 +514,6 @@ int amd_iommu_flush_iotlb_pages(struct d
 
     return 0;
 }
-
-int amd_iommu_flush_iotlb_all(struct domain *d)
-{
-    amd_iommu_flush_all_pages(d);
-
-    return 0;
-}
 
 int amd_iommu_reserve_domain_unity_map(struct domain *d,
                                        const struct ivrs_unity_map *map,
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -642,7 +642,6 @@ static const struct iommu_ops __initcons
     .map_page = amd_iommu_map_page,
     .unmap_page = amd_iommu_unmap_page,
     .iotlb_flush = amd_iommu_flush_iotlb_pages,
-    .iotlb_flush_all = amd_iommu_flush_iotlb_all,
     .reassign_device = reassign_device,
     .get_device_group_id = amd_iommu_group_id,
     .enable_x2apic = iov_enable_xt,
--- a/xen/drivers/passthrough/arm/ipmmu-vmsa.c
+++ b/xen/drivers/passthrough/arm/ipmmu-vmsa.c
@@ -930,13 +930,19 @@ out:
 }
 
 /* Xen IOMMU ops */
-static int __must_check ipmmu_iotlb_flush_all(struct domain *d)
+static int __must_check ipmmu_iotlb_flush(struct domain *d, dfn_t dfn,
+                                          unsigned long page_count,
+                                          unsigned int flush_flags)
 {
     struct ipmmu_vmsa_xen_domain *xen_domain = dom_iommu(d)->arch.priv;
 
+    ASSERT(flush_flags);
+
     if ( !xen_domain || !xen_domain->root_domain )
         return 0;
 
+    /* The hardware doesn't support selective TLB flush. */
+
     spin_lock(&xen_domain->lock);
     ipmmu_tlb_invalidate(xen_domain->root_domain);
     spin_unlock(&xen_domain->lock);
@@ -944,16 +950,6 @@ static int __must_check ipmmu_iotlb_flus
     return 0;
 }
 
-static int __must_check ipmmu_iotlb_flush(struct domain *d, dfn_t dfn,
-                                          unsigned long page_count,
-                                          unsigned int flush_flags)
-{
-    ASSERT(flush_flags);
-
-    /* The hardware doesn't support selective TLB flush. */
-    return ipmmu_iotlb_flush_all(d);
-}
-
 static struct ipmmu_vmsa_domain *ipmmu_get_cache_domain(struct domain *d,
                                                         struct device *dev)
 {
@@ -1303,7 +1299,6 @@ static const struct iommu_ops ipmmu_iomm
     .hwdom_init      = ipmmu_iommu_hwdom_init,
     .teardown        = ipmmu_iommu_domain_teardown,
     .iotlb_flush     = ipmmu_iotlb_flush,
-    .iotlb_flush_all = ipmmu_iotlb_flush_all,
     .assign_device   = ipmmu_assign_device,
     .reassign_device = ipmmu_reassign_device,
     .map_page        = arm_iommu_map_page,
--- a/xen/drivers/passthrough/arm/smmu.c
+++ b/xen/drivers/passthrough/arm/smmu.c
@@ -2649,11 +2649,17 @@ static int force_stage = 2;
  */
 static u32 platform_features = ARM_SMMU_FEAT_COHERENT_WALK;
 
-static int __must_check arm_smmu_iotlb_flush_all(struct domain *d)
+static int __must_check arm_smmu_iotlb_flush(struct domain *d, dfn_t dfn,
+					     unsigned long page_count,
+					     unsigned int flush_flags)
 {
 	struct arm_smmu_xen_domain *smmu_domain = dom_iommu(d)->arch.priv;
 	struct iommu_domain *cfg;
 
+	ASSERT(flush_flags);
+
+	/* ARM SMMU v1 doesn't have flush by VMA and VMID */
+
 	spin_lock(&smmu_domain->lock);
 	list_for_each_entry(cfg, &smmu_domain->contexts, list) {
 		/*
@@ -2670,16 +2676,6 @@ static int __must_check arm_smmu_iotlb_f
 	return 0;
 }
 
-static int __must_check arm_smmu_iotlb_flush(struct domain *d, dfn_t dfn,
-					     unsigned long page_count,
-					     unsigned int flush_flags)
-{
-	ASSERT(flush_flags);
-
-	/* ARM SMMU v1 doesn't have flush by VMA and VMID */
-	return arm_smmu_iotlb_flush_all(d);
-}
-
 static struct iommu_domain *arm_smmu_get_domain(struct domain *d,
 						struct device *dev)
 {
@@ -2879,7 +2875,6 @@ static const struct iommu_ops arm_smmu_i
     .add_device = arm_smmu_dt_add_device_generic,
     .teardown = arm_smmu_iommu_domain_teardown,
     .iotlb_flush = arm_smmu_iotlb_flush,
-    .iotlb_flush_all = arm_smmu_iotlb_flush_all,
     .assign_device = arm_smmu_assign_dev,
     .reassign_device = arm_smmu_reassign_dev,
     .map_page = arm_iommu_map_page,
--- a/xen/drivers/passthrough/arm/smmu-v3.c
+++ b/xen/drivers/passthrough/arm/smmu-v3.c
@@ -3431,7 +3431,6 @@ static const struct iommu_ops arm_smmu_i
 	.hwdom_init		= arm_smmu_iommu_hwdom_init,
 	.teardown		= arm_smmu_iommu_xen_domain_teardown,
 	.iotlb_flush		= arm_smmu_iotlb_flush,
-	.iotlb_flush_all	= arm_smmu_iotlb_flush_all,
 	.assign_device		= arm_smmu_assign_dev,
 	.reassign_device	= arm_smmu_reassign_dev,
 	.map_page		= arm_iommu_map_page,
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -455,15 +455,12 @@ int iommu_iotlb_flush_all(struct domain
     const struct domain_iommu *hd = dom_iommu(d);
     int rc;
 
-    if ( !is_iommu_enabled(d) || !hd->platform_ops->iotlb_flush_all ||
+    if ( !is_iommu_enabled(d) || !hd->platform_ops->iotlb_flush ||
          !flush_flags )
         return 0;
 
-    /*
-     * The operation does a full flush so we don't need to pass the
-     * flush_flags in.
-     */
-    rc = iommu_call(hd->platform_ops, iotlb_flush_all, d);
+    rc = iommu_call(hd->platform_ops, iotlb_flush, d, INVALID_DFN, 0,
+                    flush_flags | IOMMU_FLUSHF_all);
     if ( unlikely(rc) )
     {
         if ( !d->is_shutting_down && printk_ratelimit() )
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -814,18 +814,21 @@ static int __must_check iommu_flush_iotl
                                                 unsigned long page_count,
                                                 unsigned int flush_flags)
 {
-    ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN));
-    ASSERT(flush_flags);
+    if ( flush_flags & IOMMU_FLUSHF_all )
+    {
+        dfn = INVALID_DFN;
+        page_count = 0;
+    }
+    else
+    {
+        ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN));
+        ASSERT(flush_flags);
+    }
 
     return iommu_flush_iotlb(d, dfn, flush_flags & IOMMU_FLUSHF_modified,
                              page_count);
 }
 
-static int __must_check iommu_flush_iotlb_all(struct domain *d)
-{
-    return iommu_flush_iotlb(d, INVALID_DFN, 0, 0);
-}
-
 static void queue_free_pt(struct domain *d, mfn_t mfn, unsigned int level)
 {
     if ( level > 1 )
@@ -2928,7 +2931,7 @@ static int __init intel_iommu_quarantine
     spin_unlock(&hd->arch.mapping_lock);
 
     if ( !rc )
-        rc = iommu_flush_iotlb_all(d);
+        rc = iommu_flush_iotlb(d, INVALID_DFN, 0, 0);
 
     /* Pages may be leaked in failure case */
     return rc;
@@ -2961,7 +2964,6 @@ static struct iommu_ops __initdata vtd_o
     .resume = vtd_resume,
     .crash_shutdown = vtd_crash_shutdown,
     .iotlb_flush = iommu_flush_iotlb_pages,
-    .iotlb_flush_all = iommu_flush_iotlb_all,
     .get_reserved_device_memory = intel_iommu_get_reserved_device_memory,
     .dump_page_tables = vtd_dump_page_tables,
 };
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -147,9 +147,11 @@ enum
 {
     _IOMMU_FLUSHF_added,
     _IOMMU_FLUSHF_modified,
+    _IOMMU_FLUSHF_all,
 };
 #define IOMMU_FLUSHF_added (1u << _IOMMU_FLUSHF_added)
 #define IOMMU_FLUSHF_modified (1u << _IOMMU_FLUSHF_modified)
+#define IOMMU_FLUSHF_all (1u << _IOMMU_FLUSHF_all)
 
 int __must_check iommu_map(struct domain *d, dfn_t dfn, mfn_t mfn,
                            unsigned long page_count, unsigned int flags,
@@ -282,7 +284,6 @@ struct iommu_ops {
     int __must_check (*iotlb_flush)(struct domain *d, dfn_t dfn,
                                     unsigned long page_count,
                                     unsigned int flush_flags);
-    int __must_check (*iotlb_flush_all)(struct domain *d);
     int (*get_reserved_device_memory)(iommu_grdm_t *, void *);
     void (*dump_page_tables)(struct domain *d);
 



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 17/23] IOMMU/x86: prefill newly allocate page tables
  2022-01-10 16:19 [PATCH v3 00/23] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (15 preceding siblings ...)
  2022-01-10 16:33 ` [PATCH v3 16/23] IOMMU: fold flush-all hook into "flush one" Jan Beulich
@ 2022-01-10 16:34 ` Jan Beulich
  2022-02-18  5:01   ` Tian, Kevin
  2022-01-10 16:35 ` [PATCH v3 18/23] x86: introduce helper for recording degree of contiguity in " Jan Beulich
                   ` (5 subsequent siblings)
  22 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2022-01-10 16:34 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné, Kevin Tian

Page tables are used for two purposes after allocation: They either
start out all empty, or they get filled to replace a superpage.
Subsequently, to replace all empty or fully contiguous page tables,
contiguous sub-regions will be recorded within individual page tables.
Install the initial set of markers immediately after allocation. Make
sure to retain these markers when further populating a page table in
preparation for it to replace a superpage.

The markers are simply 4-bit fields holding the order value of
contiguous entries. To demonstrate this, if a page table had just 16
entries, this would be the initial (fully contiguous) set of markers:

index  0 1 2 3 4 5 6 7 8 9 A B C D E F
marker 4 0 1 0 2 0 1 0 3 0 1 0 2 0 1 0

"Contiguous" here means not only present entries with successively
increasing MFNs, each one suitably aligned for its slot, but also a
respective number of all non-present entries.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
An alternative to the ASSERT()s added to set_iommu_ptes_present() would
be to make the function less general-purpose; it's used in a single
place only after all (i.e. it might as well be folded into its only
caller).

While in VT-d's comment ahead of struct dma_pte I'm adjusting the
description of the high bits, I'd like to note that the description of
some of the lower bits isn't correct either. Yet I don't think adjusting
that belongs here.
---
v3: Add comments. Re-base.
v2: New.

--- a/xen/drivers/passthrough/amd/iommu-defs.h
+++ b/xen/drivers/passthrough/amd/iommu-defs.h
@@ -445,11 +445,13 @@ union amd_iommu_x2apic_control {
 #define IOMMU_PAGE_TABLE_U32_PER_ENTRY	(IOMMU_PAGE_TABLE_ENTRY_SIZE / 4)
 #define IOMMU_PAGE_TABLE_ALIGNMENT	4096
 
+#define IOMMU_PTE_CONTIG_MASK           0x1e /* The ign0 field below. */
+
 union amd_iommu_pte {
     uint64_t raw;
     struct {
         bool pr:1;
-        unsigned int ign0:4;
+        unsigned int ign0:4; /* Covered by IOMMU_PTE_CONTIG_MASK. */
         bool a:1;
         bool d:1;
         unsigned int ign1:2;
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -116,7 +116,19 @@ static void set_iommu_ptes_present(unsig
 
     while ( nr_ptes-- )
     {
-        set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
+        ASSERT(!pde->next_level);
+        ASSERT(!pde->u);
+
+        if ( pde > table )
+            ASSERT(pde->ign0 == find_first_set_bit(pde - table));
+        else
+            ASSERT(pde->ign0 == PAGE_SHIFT - 3);
+
+        pde->iw = iw;
+        pde->ir = ir;
+        pde->fc = true; /* See set_iommu_pde_present(). */
+        pde->mfn = next_mfn;
+        pde->pr = true;
 
         ++pde;
         next_mfn += page_sz;
@@ -235,7 +247,7 @@ static int iommu_pde_from_dfn(struct dom
             mfn = next_table_mfn;
 
             /* allocate lower level page table */
-            table = iommu_alloc_pgtable(d);
+            table = iommu_alloc_pgtable(d, IOMMU_PTE_CONTIG_MASK);
             if ( table == NULL )
             {
                 AMD_IOMMU_ERROR("cannot allocate I/O page table\n");
@@ -265,7 +277,7 @@ static int iommu_pde_from_dfn(struct dom
 
             if ( next_table_mfn == 0 )
             {
-                table = iommu_alloc_pgtable(d);
+                table = iommu_alloc_pgtable(d, IOMMU_PTE_CONTIG_MASK);
                 if ( table == NULL )
                 {
                     AMD_IOMMU_ERROR("cannot allocate I/O page table\n");
@@ -651,7 +663,7 @@ int __init amd_iommu_quarantine_init(str
 
     spin_lock(&hd->arch.mapping_lock);
 
-    hd->arch.amd.root_table = iommu_alloc_pgtable(d);
+    hd->arch.amd.root_table = iommu_alloc_pgtable(d, 0);
     if ( !hd->arch.amd.root_table )
         goto out;
 
@@ -666,7 +678,7 @@ int __init amd_iommu_quarantine_init(str
          * page table pages, and the resulting allocations are always
          * zeroed.
          */
-        pg = iommu_alloc_pgtable(d);
+        pg = iommu_alloc_pgtable(d, 0);
         if ( !pg )
             break;
 
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -242,7 +242,7 @@ int amd_iommu_alloc_root(struct domain *
 
     if ( unlikely(!hd->arch.amd.root_table) )
     {
-        hd->arch.amd.root_table = iommu_alloc_pgtable(d);
+        hd->arch.amd.root_table = iommu_alloc_pgtable(d, 0);
         if ( !hd->arch.amd.root_table )
             return -ENOMEM;
     }
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -381,7 +381,7 @@ static uint64_t addr_to_dma_page_maddr(s
             goto out;
 
         pte_maddr = level;
-        if ( !(pg = iommu_alloc_pgtable(domain)) )
+        if ( !(pg = iommu_alloc_pgtable(domain, 0)) )
             goto out;
 
         hd->arch.vtd.pgd_maddr = page_to_maddr(pg);
@@ -423,7 +423,7 @@ static uint64_t addr_to_dma_page_maddr(s
             }
 
             pte_maddr = level - 1;
-            pg = iommu_alloc_pgtable(domain);
+            pg = iommu_alloc_pgtable(domain, DMA_PTE_CONTIG_MASK);
             if ( !pg )
                 break;
 
@@ -435,12 +435,13 @@ static uint64_t addr_to_dma_page_maddr(s
                 struct dma_pte *split = map_vtd_domain_page(pte_maddr);
                 unsigned long inc = 1UL << level_to_offset_bits(level - 1);
 
-                split[0].val = pte->val;
+                split[0].val |= pte->val & ~DMA_PTE_CONTIG_MASK;
                 if ( inc == PAGE_SIZE )
                     split[0].val &= ~DMA_PTE_SP;
 
                 for ( offset = 1; offset < PTE_NUM; ++offset )
-                    split[offset].val = split[offset - 1].val + inc;
+                    split[offset].val |=
+                        (split[offset - 1].val & ~DMA_PTE_CONTIG_MASK) + inc;
 
                 iommu_sync_cache(split, PAGE_SIZE);
                 unmap_vtd_domain_page(split);
@@ -2028,7 +2029,7 @@ static int __must_check intel_iommu_map_
     if ( iommu_snoop )
         dma_set_pte_snp(new);
 
-    if ( old.val == new.val )
+    if ( !((old.val ^ new.val) & ~DMA_PTE_CONTIG_MASK) )
     {
         spin_unlock(&hd->arch.mapping_lock);
         unmap_vtd_domain_page(page);
@@ -2885,7 +2886,7 @@ static int __init intel_iommu_quarantine
         goto out;
     }
 
-    pg = iommu_alloc_pgtable(d);
+    pg = iommu_alloc_pgtable(d, 0);
 
     rc = -ENOMEM;
     if ( !pg )
@@ -2904,7 +2905,7 @@ static int __init intel_iommu_quarantine
          * page table pages, and the resulting allocations are always
          * zeroed.
          */
-        pg = iommu_alloc_pgtable(d);
+        pg = iommu_alloc_pgtable(d, 0);
 
         if ( !pg )
             goto out;
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -250,7 +250,10 @@ struct context_entry {
  * 2-6: reserved
  * 7: super page
  * 8-11: available
- * 12-63: Host physcial address
+ * 12-51: Host physcial address
+ * 52-61: available (52-55 used for DMA_PTE_CONTIG_MASK)
+ * 62: reserved
+ * 63: available
  */
 struct dma_pte {
     u64 val;
@@ -260,6 +263,7 @@ struct dma_pte {
 #define DMA_PTE_PROT (DMA_PTE_READ | DMA_PTE_WRITE)
 #define DMA_PTE_SP   (1 << 7)
 #define DMA_PTE_SNP  (1 << 11)
+#define DMA_PTE_CONTIG_MASK  (0xfull << PADDR_BITS)
 #define dma_clear_pte(p)    do {(p).val = 0;} while(0)
 #define dma_set_pte_readable(p) do {(p).val |= DMA_PTE_READ;} while(0)
 #define dma_set_pte_writable(p) do {(p).val |= DMA_PTE_WRITE;} while(0)
@@ -273,7 +277,7 @@ struct dma_pte {
 #define dma_pte_write(p) (dma_pte_prot(p) & DMA_PTE_WRITE)
 #define dma_pte_addr(p) ((p).val & PADDR_MASK & PAGE_MASK_4K)
 #define dma_set_pte_addr(p, addr) do {\
-            (p).val |= ((addr) & PAGE_MASK_4K); } while (0)
+            (p).val |= ((addr) & PADDR_MASK & PAGE_MASK_4K); } while (0)
 #define dma_pte_present(p) (((p).val & DMA_PTE_PROT) != 0)
 #define dma_pte_superpage(p) (((p).val & DMA_PTE_SP) != 0)
 
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -461,12 +461,12 @@ int iommu_free_pgtables(struct domain *d
     return 0;
 }
 
-struct page_info *iommu_alloc_pgtable(struct domain *d)
+struct page_info *iommu_alloc_pgtable(struct domain *d, uint64_t contig_mask)
 {
     struct domain_iommu *hd = dom_iommu(d);
     unsigned int memflags = 0;
     struct page_info *pg;
-    void *p;
+    uint64_t *p;
 
 #ifdef CONFIG_NUMA
     if ( hd->node != NUMA_NO_NODE )
@@ -478,7 +478,28 @@ struct page_info *iommu_alloc_pgtable(st
         return NULL;
 
     p = __map_domain_page(pg);
-    clear_page(p);
+
+    if ( contig_mask )
+    {
+        unsigned int i, shift = find_first_set_bit(contig_mask);
+
+        ASSERT(((PAGE_SHIFT - 3) & (contig_mask >> shift)) == PAGE_SHIFT - 3);
+
+        p[0] = (PAGE_SHIFT - 3ull) << shift;
+        p[1] = 0;
+        p[2] = 1ull << shift;
+        p[3] = 0;
+
+        for ( i = 4; i < PAGE_SIZE / 8; i += 4 )
+        {
+            p[i + 0] = (find_first_set_bit(i) + 0ull) << shift;
+            p[i + 1] = 0;
+            p[i + 2] = 1ull << shift;
+            p[i + 3] = 0;
+        }
+    }
+    else
+        clear_page(p);
 
     if ( hd->platform_ops->sync_cache )
         iommu_vcall(hd->platform_ops, sync_cache, p, PAGE_SIZE);
--- a/xen/arch/x86/include/asm/iommu.h
+++ b/xen/arch/x86/include/asm/iommu.h
@@ -142,7 +142,8 @@ int pi_update_irte(const struct pi_desc
 })
 
 int __must_check iommu_free_pgtables(struct domain *d);
-struct page_info *__must_check iommu_alloc_pgtable(struct domain *d);
+struct page_info *__must_check iommu_alloc_pgtable(struct domain *d,
+                                                   uint64_t contig_mask);
 void iommu_queue_free_pgtable(struct domain *d, struct page_info *pg);
 
 #endif /* !__ARCH_X86_IOMMU_H__ */



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 18/23] x86: introduce helper for recording degree of contiguity in page tables
  2022-01-10 16:19 [PATCH v3 00/23] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (16 preceding siblings ...)
  2022-01-10 16:34 ` [PATCH v3 17/23] IOMMU/x86: prefill newly allocate page tables Jan Beulich
@ 2022-01-10 16:35 ` Jan Beulich
  2022-01-10 16:35 ` [PATCH v3 19/23] AMD/IOMMU: free all-empty " Jan Beulich
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2022-01-10 16:35 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné

This is a re-usable helper (kind of a template) which gets introduced
without users so that the individual subsequent patches introducing such
users can get committed independently of one another.

See the comment at the top of the new file. To demonstrate the effect,
if a page table had just 16 entries, this would be the set of markers
for a page table with fully contiguous mappings:

index  0 1 2 3 4 5 6 7 8 9 A B C D E F
marker 4 0 1 0 2 0 1 0 3 0 1 0 2 0 1 0

"Contiguous" here means not only present entries with successively
increasing MFNs, each one suitably aligned for its slot, but also a
respective number of all non-present entries.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: Rename function and header. Introduce IS_CONTIG().
v2: New.

--- /dev/null
+++ b/xen/arch/x86/include/asm/pt-contig-markers.h
@@ -0,0 +1,105 @@
+#ifndef __ASM_X86_PT_CONTIG_MARKERS_H
+#define __ASM_X86_PT_CONTIG_MARKERS_H
+
+/*
+ * Short of having function templates in C, the function defined below is
+ * intended to be used by multiple parties interested in recording the
+ * degree of contiguity in mappings by a single page table.
+ *
+ * Scheme: Every entry records the order of contiguous successive entries,
+ * up to the maximum order covered by that entry (which is the number of
+ * clear low bits in its index, with entry 0 being the exception using
+ * the base-2 logarithm of the number of entries in a single page table).
+ * While a few entries need touching upon update, knowing whether the
+ * table is fully contiguous (and can hence be replaced by a higher level
+ * leaf entry) is then possible by simply looking at entry 0's marker.
+ *
+ * Prereqs:
+ * - CONTIG_MASK needs to be #define-d, to a value having at least 4
+ *   contiguous bits (ignored by hardware), before including this file,
+ * - page tables to be passed here need to be initialized with correct
+ *   markers.
+ */
+
+#include <xen/bitops.h>
+#include <xen/lib.h>
+#include <xen/page-size.h>
+
+/* This is the same for all anticipated users, so doesn't need passing in. */
+#define CONTIG_LEVEL_SHIFT 9
+#define CONTIG_NR          (1 << CONTIG_LEVEL_SHIFT)
+
+#define GET_MARKER(e) MASK_EXTR(e, CONTIG_MASK)
+#define SET_MARKER(e, m) \
+    ((void)((e) = ((e) & ~CONTIG_MASK) | MASK_INSR(m, CONTIG_MASK)))
+
+#define IS_CONTIG(kind, pt, i, idx, shift, b) \
+    ((kind) == PTE_kind_leaf \
+     ? (((pt)[i] ^ (pt)[idx]) & ~CONTIG_MASK) == (1ULL << ((b) + (shift))) \
+     : !((pt)[i] & ~CONTIG_MASK))
+
+enum PTE_kind {
+    PTE_kind_null,
+    PTE_kind_leaf,
+    PTE_kind_table,
+};
+
+static bool pt_update_contig_markers(uint64_t *pt, unsigned int idx,
+                                     unsigned int level, enum PTE_kind kind)
+{
+    unsigned int b, i = idx;
+    unsigned int shift = (level - 1) * CONTIG_LEVEL_SHIFT + PAGE_SHIFT;
+
+    ASSERT(idx < CONTIG_NR);
+    ASSERT(!(pt[idx] & CONTIG_MASK));
+
+    /* Step 1: Reduce markers in lower numbered entries. */
+    while ( i )
+    {
+        b = find_first_set_bit(i);
+        i &= ~(1U << b);
+        if ( GET_MARKER(pt[i]) > b )
+            SET_MARKER(pt[i], b);
+    }
+
+    /* An intermediate table is never contiguous with anything. */
+    if ( kind == PTE_kind_table )
+        return false;
+
+    /*
+     * Present entries need in-sync index and address to be a candidate
+     * for being contiguous: What we're after is whether ultimately the
+     * intermediate table can be replaced by a superpage.
+     */
+    if ( kind != PTE_kind_null &&
+         idx != ((pt[idx] >> shift) & (CONTIG_NR - 1)) )
+        return false;
+
+    /* Step 2: Check higher numbered entries for contiguity. */
+    for ( b = 0; b < CONTIG_LEVEL_SHIFT && !(idx & (1U << b)); ++b )
+    {
+        i = idx | (1U << b);
+        if ( !IS_CONTIG(kind, pt, i, idx, shift, b) || GET_MARKER(pt[i]) != b )
+            break;
+    }
+
+    /* Step 3: Update markers in this and lower numbered entries. */
+    for ( ; SET_MARKER(pt[idx], b), b < CONTIG_LEVEL_SHIFT; ++b )
+    {
+        i = idx ^ (1U << b);
+        if ( !IS_CONTIG(kind, pt, i, idx, shift, b) || GET_MARKER(pt[i]) != b )
+            break;
+        idx &= ~(1U << b);
+    }
+
+    return b == CONTIG_LEVEL_SHIFT;
+}
+
+#undef IS_CONTIG
+#undef SET_MARKER
+#undef GET_MARKER
+#undef CONTIG_NR
+#undef CONTIG_LEVEL_SHIFT
+#undef CONTIG_MASK
+
+#endif /* __ASM_X86_PT_CONTIG_MARKERS_H */



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 19/23] AMD/IOMMU: free all-empty page tables
  2022-01-10 16:19 [PATCH v3 00/23] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (17 preceding siblings ...)
  2022-01-10 16:35 ` [PATCH v3 18/23] x86: introduce helper for recording degree of contiguity in " Jan Beulich
@ 2022-01-10 16:35 ` Jan Beulich
  2022-01-10 16:36 ` [PATCH v3 20/23] VT-d: " Jan Beulich
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2022-01-10 16:35 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné

When a page table ends up with no present entries left, it can be
replaced by a non-present entry at the next higher level. The page table
itself can then be scheduled for freeing.

Note that while its output isn't used there yet,
pt_update_contig_markers() right away needs to be called in all places
where entries get updated, not just the one where entries get cleared.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: Re-base over changes earlier in the series.
v2: New.

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -21,6 +21,9 @@
 
 #include "iommu.h"
 
+#define CONTIG_MASK IOMMU_PTE_CONTIG_MASK
+#include <asm/pt-contig-markers.h>
+
 /* Given pfn and page table level, return pde index */
 static unsigned int pfn_to_pde_idx(unsigned long pfn, unsigned int level)
 {
@@ -33,16 +36,20 @@ static unsigned int pfn_to_pde_idx(unsig
 
 static union amd_iommu_pte clear_iommu_pte_present(unsigned long l1_mfn,
                                                    unsigned long dfn,
-                                                   unsigned int level)
+                                                   unsigned int level,
+                                                   bool *free)
 {
     union amd_iommu_pte *table, *pte, old;
+    unsigned int idx = pfn_to_pde_idx(dfn, level);
 
     table = map_domain_page(_mfn(l1_mfn));
-    pte = &table[pfn_to_pde_idx(dfn, level)];
+    pte = &table[idx];
     old = *pte;
 
     write_atomic(&pte->raw, 0);
 
+    *free = pt_update_contig_markers(&table->raw, idx, level, PTE_kind_null);
+
     unmap_domain_page(table);
 
     return old;
@@ -85,7 +92,11 @@ static union amd_iommu_pte set_iommu_pte
     if ( !old.pr || old.next_level ||
          old.mfn != next_mfn ||
          old.iw != iw || old.ir != ir )
+    {
         set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
+        pt_update_contig_markers(&table->raw, pfn_to_pde_idx(dfn, level),
+                                 level, PTE_kind_leaf);
+    }
     else
         old.pr = false; /* signal "no change" to the caller */
 
@@ -262,6 +273,9 @@ static int iommu_pde_from_dfn(struct dom
             smp_wmb();
             set_iommu_pde_present(pde, next_table_mfn, next_level, true,
                                   true);
+            pt_update_contig_markers(&next_table_vaddr->raw,
+                                     pfn_to_pde_idx(dfn, level),
+                                     level, PTE_kind_table);
 
             *flush_flags |= IOMMU_FLUSHF_modified;
         }
@@ -287,6 +301,9 @@ static int iommu_pde_from_dfn(struct dom
                 next_table_mfn = mfn_x(page_to_mfn(table));
                 set_iommu_pde_present(pde, next_table_mfn, next_level, true,
                                       true);
+                pt_update_contig_markers(&next_table_vaddr->raw,
+                                         pfn_to_pde_idx(dfn, level),
+                                         level, PTE_kind_table);
             }
             else /* should never reach here */
             {
@@ -413,8 +430,24 @@ int amd_iommu_unmap_page(struct domain *
 
     if ( pt_mfn )
     {
+        bool free;
+
         /* Mark PTE as 'page not present'. */
-        old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn), level);
+        old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn), level, &free);
+
+        while ( unlikely(free) && ++level < hd->arch.amd.paging_mode )
+        {
+            struct page_info *pg = mfn_to_page(_mfn(pt_mfn));
+
+            if ( iommu_pde_from_dfn(d, dfn_x(dfn), level, &pt_mfn,
+                                    flush_flags, false) )
+                BUG();
+            BUG_ON(!pt_mfn);
+
+            clear_iommu_pte_present(pt_mfn, dfn_x(dfn), level, &free);
+            *flush_flags |= IOMMU_FLUSHF_all;
+            iommu_queue_free_pgtable(d, pg);
+        }
     }
 
     spin_unlock(&hd->arch.mapping_lock);



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 20/23] VT-d: free all-empty page tables
  2022-01-10 16:19 [PATCH v3 00/23] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (18 preceding siblings ...)
  2022-01-10 16:35 ` [PATCH v3 19/23] AMD/IOMMU: free all-empty " Jan Beulich
@ 2022-01-10 16:36 ` Jan Beulich
  2022-02-18  5:20   ` Tian, Kevin
  2022-01-10 16:37 ` [PATCH v3 21/23] AMD/IOMMU: replace all-contiguous page tables by superpage mappings Jan Beulich
                   ` (2 subsequent siblings)
  22 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2022-01-10 16:36 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné, Kevin Tian

When a page table ends up with no present entries left, it can be
replaced by a non-present entry at the next higher level. The page table
itself can then be scheduled for freeing.

Note that while its output isn't used there yet,
pt_update_contig_markers() right away needs to be called in all places
where entries get updated, not just the one where entries get cleared.

Note further that while pt_update_contig_markers() updates perhaps
several PTEs within the table, since these are changes to "avail" bits
only I do not think that cache flushing would be needed afterwards. Such
cache flushing (of entire pages, unless adding yet more logic to me more
selective) would be quite noticable performance-wise (very prominent
during Dom0 boot).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: Properly bound loop. Re-base over changes earlier in the series.
v2: New.
---
The hang during boot on my Latitude E6410 (see the respective code
comment) was pretty close after iommu_enable_translation(). No errors,
no watchdog would kick in, just sometimes the first few pixel lines of
the next log message's (XEN) prefix would have made it out to the screen
(and there's no serial there). It's been a lot of experimenting until I
figured the workaround (which I consider ugly, but halfway acceptable).
I've been trying hard to make sure the workaround wouldn't be masking a
real issue, yet I'm still wary of it possibly doing so ... My best guess
at this point is that on these old IOMMUs the ignored bits 52...61
aren't really ignored for present entries, but also aren't "reserved"
enough to trigger faults. This guess is from having tried to set other
bits in this range (unconditionally, and with the workaround here in
place), which yielded the same behavior.

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -42,6 +42,9 @@
 #include "vtd.h"
 #include "../ats.h"
 
+#define CONTIG_MASK DMA_PTE_CONTIG_MASK
+#include <asm/pt-contig-markers.h>
+
 /* dom_io is used as a sentinel for quarantined devices */
 #define QUARANTINE_SKIP(d) ((d) == dom_io && !dom_iommu(d)->arch.vtd.pgd_maddr)
 
@@ -452,6 +455,9 @@ static uint64_t addr_to_dma_page_maddr(s
 
             write_atomic(&pte->val, new_pte.val);
             iommu_sync_cache(pte, sizeof(struct dma_pte));
+            pt_update_contig_markers(&parent->val,
+                                     address_level_offset(addr, level),
+                                     level, PTE_kind_table);
         }
 
         if ( --level == target )
@@ -879,9 +885,31 @@ static int dma_pte_clear_one(struct doma
 
     old = *pte;
     dma_clear_pte(*pte);
+    iommu_sync_cache(pte, sizeof(*pte));
+
+    while ( pt_update_contig_markers(&page->val,
+                                     address_level_offset(addr, level),
+                                     level, PTE_kind_null) &&
+            ++level < min_pt_levels )
+    {
+        struct page_info *pg = maddr_to_page(pg_maddr);
+
+        unmap_vtd_domain_page(page);
+
+        pg_maddr = addr_to_dma_page_maddr(domain, addr, level, flush_flags,
+                                          false);
+        BUG_ON(pg_maddr < PAGE_SIZE);
+
+        page = map_vtd_domain_page(pg_maddr);
+        pte = &page[address_level_offset(addr, level)];
+        dma_clear_pte(*pte);
+        iommu_sync_cache(pte, sizeof(*pte));
+
+        *flush_flags |= IOMMU_FLUSHF_all;
+        iommu_queue_free_pgtable(domain, pg);
+    }
 
     spin_unlock(&hd->arch.mapping_lock);
-    iommu_sync_cache(pte, sizeof(struct dma_pte));
 
     unmap_vtd_domain_page(page);
 
@@ -2037,8 +2065,21 @@ static int __must_check intel_iommu_map_
     }
 
     *pte = new;
-
     iommu_sync_cache(pte, sizeof(struct dma_pte));
+
+    /*
+     * While the (ab)use of PTE_kind_table here allows to save some work in
+     * the function, the main motivation for it is that it avoids a so far
+     * unexplained hang during boot (while preparing Dom0) on a Westmere
+     * based laptop.
+     */
+    pt_update_contig_markers(&page->val,
+                             address_level_offset(dfn_to_daddr(dfn), level),
+                             level,
+                             (hd->platform_ops->page_sizes &
+                              (1UL << level_to_offset_bits(level + 1))
+                              ? PTE_kind_leaf : PTE_kind_table));
+
     spin_unlock(&hd->arch.mapping_lock);
     unmap_vtd_domain_page(page);
 



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 21/23] AMD/IOMMU: replace all-contiguous page tables by superpage mappings
  2022-01-10 16:19 [PATCH v3 00/23] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (19 preceding siblings ...)
  2022-01-10 16:36 ` [PATCH v3 20/23] VT-d: " Jan Beulich
@ 2022-01-10 16:37 ` Jan Beulich
  2022-01-10 16:38 ` [PATCH v3 22/23] VT-d: " Jan Beulich
  2022-01-10 16:38 ` [PATCH v3 23/23] IOMMU/x86: add perf counters for page table splitting / coalescing Jan Beulich
  22 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2022-01-10 16:37 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné

When a page table ends up with all contiguous entries (including all
identical attributes), it can be replaced by a superpage entry at the
next higher level. The page table itself can then be scheduled for
freeing.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Unlike the freeing of all-empty page tables, this causes quite a bit of
back and forth for PV domains, due to their mapping/unmapping of pages
when they get converted to/from being page tables. It may therefore be
worth considering to delay re-coalescing a little, to avoid doing so
when the superpage would otherwise get split again pretty soon. But I
think this would better be the subject of a separate change anyway.

Of course this could also be helped by more "aware" kernel side
behavior: They could avoid immediately mapping freed page tables
writable again, in anticipation of re-using that same page for another
page table elsewhere.
---
v3: New.

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -81,7 +81,8 @@ static union amd_iommu_pte set_iommu_pte
                                                  unsigned long dfn,
                                                  unsigned long next_mfn,
                                                  unsigned int level,
-                                                 bool iw, bool ir)
+                                                 bool iw, bool ir,
+                                                 bool *contig)
 {
     union amd_iommu_pte *table, *pde, old;
 
@@ -94,11 +95,15 @@ static union amd_iommu_pte set_iommu_pte
          old.iw != iw || old.ir != ir )
     {
         set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
-        pt_update_contig_markers(&table->raw, pfn_to_pde_idx(dfn, level),
-                                 level, PTE_kind_leaf);
+        *contig = pt_update_contig_markers(&table->raw,
+                                           pfn_to_pde_idx(dfn, level),
+                                           level, PTE_kind_leaf);
     }
     else
+    {
         old.pr = false; /* signal "no change" to the caller */
+        *contig = false;
+    }
 
     unmap_domain_page(table);
 
@@ -346,6 +351,7 @@ int amd_iommu_map_page(struct domain *d,
 {
     struct domain_iommu *hd = dom_iommu(d);
     unsigned int level = (IOMMUF_order(flags) / PTE_PER_TABLE_SHIFT) + 1;
+    bool contig;
     int rc;
     unsigned long pt_mfn = 0;
     union amd_iommu_pte old;
@@ -386,8 +392,26 @@ int amd_iommu_map_page(struct domain *d,
 
     /* Install mapping */
     old = set_iommu_pte_present(pt_mfn, dfn_x(dfn), mfn_x(mfn), level,
-                                (flags & IOMMUF_writable),
-                                (flags & IOMMUF_readable));
+                                flags & IOMMUF_writable,
+                                flags & IOMMUF_readable, &contig);
+
+    while ( unlikely(contig) && ++level < hd->arch.amd.paging_mode )
+    {
+        struct page_info *pg = mfn_to_page(_mfn(pt_mfn));
+        unsigned long next_mfn;
+
+        if ( iommu_pde_from_dfn(d, dfn_x(dfn), level, &pt_mfn, flush_flags,
+                                false) )
+            BUG();
+        BUG_ON(!pt_mfn);
+
+        next_mfn = mfn_x(mfn) & (~0UL << (PTE_PER_TABLE_SHIFT * (level - 1)));
+        set_iommu_pte_present(pt_mfn, dfn_x(dfn), next_mfn, level,
+                              flags & IOMMUF_writable,
+                              flags & IOMMUF_readable, &contig);
+        *flush_flags |= IOMMU_FLUSHF_modified | IOMMU_FLUSHF_all;
+        iommu_queue_free_pgtable(d, pg);
+    }
 
     spin_unlock(&hd->arch.mapping_lock);
 



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 22/23] VT-d: replace all-contiguous page tables by superpage mappings
  2022-01-10 16:19 [PATCH v3 00/23] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (20 preceding siblings ...)
  2022-01-10 16:37 ` [PATCH v3 21/23] AMD/IOMMU: replace all-contiguous page tables by superpage mappings Jan Beulich
@ 2022-01-10 16:38 ` Jan Beulich
  2022-02-18  5:22   ` Tian, Kevin
  2022-01-10 16:38 ` [PATCH v3 23/23] IOMMU/x86: add perf counters for page table splitting / coalescing Jan Beulich
  22 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2022-01-10 16:38 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné, Kevin Tian

When a page table ends up with all contiguous entries (including all
identical attributes), it can be replaced by a superpage entry at the
next higher level. The page table itself can then be scheduled for
freeing.

The adjustment to LEVEL_MASK is merely to avoid leaving a latent trap
for whenever we (and obviously hardware) start supporting 512G mappings.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Unlike the freeing of all-empty page tables, this causes quite a bit of
back and forth for PV domains, due to their mapping/unmapping of pages
when they get converted to/from being page tables. It may therefore be
worth considering to delay re-coalescing a little, to avoid doing so
when the superpage would otherwise get split again pretty soon. But I
think this would better be the subject of a separate change anyway.

Of course this could also be helped by more "aware" kernel side
behavior: They could avoid immediately mapping freed page tables
writable again, in anticipation of re-using that same page for another
page table elsewhere.
---
v3: New.

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2071,14 +2071,35 @@ static int __must_check intel_iommu_map_
      * While the (ab)use of PTE_kind_table here allows to save some work in
      * the function, the main motivation for it is that it avoids a so far
      * unexplained hang during boot (while preparing Dom0) on a Westmere
-     * based laptop.
+     * based laptop.  This also has the intended effect of terminating the
+     * loop when super pages aren't supported anymore at the next level.
      */
-    pt_update_contig_markers(&page->val,
-                             address_level_offset(dfn_to_daddr(dfn), level),
-                             level,
-                             (hd->platform_ops->page_sizes &
-                              (1UL << level_to_offset_bits(level + 1))
-                              ? PTE_kind_leaf : PTE_kind_table));
+    while ( pt_update_contig_markers(&page->val,
+                                     address_level_offset(dfn_to_daddr(dfn), level),
+                                     level,
+                                     (hd->platform_ops->page_sizes &
+                                      (1UL << level_to_offset_bits(level + 1))
+                                       ? PTE_kind_leaf : PTE_kind_table)) )
+    {
+        struct page_info *pg = maddr_to_page(pg_maddr);
+
+        unmap_vtd_domain_page(page);
+
+        new.val &= ~(LEVEL_MASK << level_to_offset_bits(level));
+        dma_set_pte_superpage(new);
+
+        pg_maddr = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), ++level,
+                                          flush_flags, false);
+        BUG_ON(pg_maddr < PAGE_SIZE);
+
+        page = map_vtd_domain_page(pg_maddr);
+        pte = &page[address_level_offset(dfn_to_daddr(dfn), level)];
+        *pte = new;
+        iommu_sync_cache(pte, sizeof(*pte));
+
+        *flush_flags |= IOMMU_FLUSHF_modified | IOMMU_FLUSHF_all;
+        iommu_queue_free_pgtable(d, pg);
+    }
 
     spin_unlock(&hd->arch.mapping_lock);
     unmap_vtd_domain_page(page);
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -229,7 +229,7 @@ struct context_entry {
 
 /* page table handling */
 #define LEVEL_STRIDE       (9)
-#define LEVEL_MASK         ((1 << LEVEL_STRIDE) - 1)
+#define LEVEL_MASK         (PTE_NUM - 1UL)
 #define PTE_NUM            (1 << LEVEL_STRIDE)
 #define level_to_agaw(val) ((val) - 2)
 #define agaw_to_level(val) ((val) + 2)



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 23/23] IOMMU/x86: add perf counters for page table splitting / coalescing
  2022-01-10 16:19 [PATCH v3 00/23] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (21 preceding siblings ...)
  2022-01-10 16:38 ` [PATCH v3 22/23] VT-d: " Jan Beulich
@ 2022-01-10 16:38 ` Jan Beulich
  2022-02-18  5:23   ` Tian, Kevin
  22 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2022-01-10 16:38 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné, Kevin Tian

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -283,6 +283,8 @@ static int iommu_pde_from_dfn(struct dom
                                      level, PTE_kind_table);
 
             *flush_flags |= IOMMU_FLUSHF_modified;
+
+            perfc_incr(iommu_pt_shatters);
         }
 
         /* Install lower level page table for non-present entries */
@@ -411,6 +413,7 @@ int amd_iommu_map_page(struct domain *d,
                               flags & IOMMUF_readable, &contig);
         *flush_flags |= IOMMU_FLUSHF_modified | IOMMU_FLUSHF_all;
         iommu_queue_free_pgtable(d, pg);
+        perfc_incr(iommu_pt_coalesces);
     }
 
     spin_unlock(&hd->arch.mapping_lock);
@@ -471,6 +474,7 @@ int amd_iommu_unmap_page(struct domain *
             clear_iommu_pte_present(pt_mfn, dfn_x(dfn), level, &free);
             *flush_flags |= IOMMU_FLUSHF_all;
             iommu_queue_free_pgtable(d, pg);
+            perfc_incr(iommu_pt_coalesces);
         }
     }
 
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -451,6 +451,8 @@ static uint64_t addr_to_dma_page_maddr(s
 
                 if ( flush_flags )
                     *flush_flags |= IOMMU_FLUSHF_modified;
+
+                perfc_incr(iommu_pt_shatters);
             }
 
             write_atomic(&pte->val, new_pte.val);
@@ -907,6 +909,7 @@ static int dma_pte_clear_one(struct doma
 
         *flush_flags |= IOMMU_FLUSHF_all;
         iommu_queue_free_pgtable(domain, pg);
+        perfc_incr(iommu_pt_coalesces);
     }
 
     spin_unlock(&hd->arch.mapping_lock);
@@ -2099,6 +2102,7 @@ static int __must_check intel_iommu_map_
 
         *flush_flags |= IOMMU_FLUSHF_modified | IOMMU_FLUSHF_all;
         iommu_queue_free_pgtable(d, pg);
+        perfc_incr(iommu_pt_coalesces);
     }
 
     spin_unlock(&hd->arch.mapping_lock);
--- a/xen/arch/x86/include/asm/perfc_defn.h
+++ b/xen/arch/x86/include/asm/perfc_defn.h
@@ -125,4 +125,7 @@ PERFCOUNTER(realmode_exits,      "vmexit
 
 PERFCOUNTER(pauseloop_exits, "vmexits from Pause-Loop Detection")
 
+PERFCOUNTER(iommu_pt_shatters,    "IOMMU page table shatters")
+PERFCOUNTER(iommu_pt_coalesces,   "IOMMU page table coalesces")
+
 /*#endif*/ /* __XEN_PERFC_DEFN_H__ */



^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH v3 02/23] VT-d: have callers specify the target level for page table walks
  2022-01-10 16:22 ` [PATCH v3 02/23] VT-d: " Jan Beulich
@ 2022-01-30  3:17   ` Tian, Kevin
  2022-01-31 10:04     ` Jan Beulich
  0 siblings, 1 reply; 40+ messages in thread
From: Tian, Kevin @ 2022-01-30  3:17 UTC (permalink / raw)
  To: Beulich, Jan, xen-devel
  Cc: Cooper, Andrew, Paul Durrant, Pau Monné, Roger

> From: Jan Beulich <jbeulich@suse.com>
> Sent: Tuesday, January 11, 2022 12:23 AM
> 
> In order to be able to insert/remove super-pages we need to allow
> callers of the walking function to specify at which point to stop the
> walk.
> 
> For intel_iommu_lookup_page() integrate the last level access into
> the main walking function.
> 
> dma_pte_clear_one() gets only partly adjusted for now: Error handling
> and order parameter get put in place, but the order parameter remains
> ignored (just like intel_iommu_map_page()'s order part of the flags).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> I was actually wondering whether it wouldn't make sense to integrate
> dma_pte_clear_one() into its only caller intel_iommu_unmap_page(), for
> better symmetry with intel_iommu_map_page().

I think it's the right thing to do. It was there due to multiple callers
when firstly introduced. But now given only one caller mering it
with the caller to be symmetry makes sense.

with or without that change (given it's simple):

	Reviewed-by: Kevin Tian <kevin.tian@intel.com>

> ---
> v2: Fix build.
> 
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -347,63 +347,116 @@ static u64 bus_to_context_maddr(struct v
>      return maddr;
>  }
> 
> -static u64 addr_to_dma_page_maddr(struct domain *domain, u64 addr, int
> alloc)
> +/*
> + * This function walks (and if requested allocates) page tables to the
> + * designated target level. It returns
> + * - 0 when a non-present entry was encountered and no allocation was
> + *   requested,
> + * - a small positive value (the level, i.e. below PAGE_SIZE) upon allocation
> + *   failure,
> + * - for target > 0 the physical address of the page table holding the leaf
> + *   PTE for the requested address,
> + * - for target == 0 the full PTE.
> + */
> +static uint64_t addr_to_dma_page_maddr(struct domain *domain, daddr_t
> addr,
> +                                       unsigned int target,
> +                                       unsigned int *flush_flags, bool alloc)
>  {
>      struct domain_iommu *hd = dom_iommu(domain);
>      int addr_width = agaw_to_width(hd->arch.vtd.agaw);
>      struct dma_pte *parent, *pte = NULL;
> -    int level = agaw_to_level(hd->arch.vtd.agaw);
> -    int offset;
> +    unsigned int level = agaw_to_level(hd->arch.vtd.agaw), offset;
>      u64 pte_maddr = 0;
> 
>      addr &= (((u64)1) << addr_width) - 1;
>      ASSERT(spin_is_locked(&hd->arch.mapping_lock));
> +    ASSERT(target || !alloc);
> +
>      if ( !hd->arch.vtd.pgd_maddr )
>      {
>          struct page_info *pg;
> 
> -        if ( !alloc || !(pg = iommu_alloc_pgtable(domain)) )
> +        if ( !alloc )
> +            goto out;
> +
> +        pte_maddr = level;
> +        if ( !(pg = iommu_alloc_pgtable(domain)) )
>              goto out;
> 
>          hd->arch.vtd.pgd_maddr = page_to_maddr(pg);
>      }
> 
> -    parent = (struct dma_pte *)map_vtd_domain_page(hd-
> >arch.vtd.pgd_maddr);
> -    while ( level > 1 )
> +    pte_maddr = hd->arch.vtd.pgd_maddr;
> +    parent = map_vtd_domain_page(pte_maddr);
> +    while ( level > target )
>      {
>          offset = address_level_offset(addr, level);
>          pte = &parent[offset];
> 
>          pte_maddr = dma_pte_addr(*pte);
> -        if ( !pte_maddr )
> +        if ( !dma_pte_present(*pte) || (level > 1 &&
> dma_pte_superpage(*pte)) )
>          {
>              struct page_info *pg;
> +            /*
> +             * Higher level tables always set r/w, last level page table
> +             * controls read/write.
> +             */
> +            struct dma_pte new_pte = { DMA_PTE_PROT };
> 
>              if ( !alloc )
> -                break;
> +            {
> +                pte_maddr = 0;
> +                if ( !dma_pte_present(*pte) )
> +                    break;
> +
> +                /*
> +                 * When the leaf entry was requested, pass back the full PTE,
> +                 * with the address adjusted to account for the residual of
> +                 * the walk.
> +                 */
> +                pte_maddr = pte->val +
> +                    (addr & ((1UL << level_to_offset_bits(level)) - 1) &
> +                     PAGE_MASK);
> +                if ( !target )
> +                    break;
> +            }
> 
> +            pte_maddr = level - 1;
>              pg = iommu_alloc_pgtable(domain);
>              if ( !pg )
>                  break;
> 
>              pte_maddr = page_to_maddr(pg);
> -            dma_set_pte_addr(*pte, pte_maddr);
> +            dma_set_pte_addr(new_pte, pte_maddr);
> 
> -            /*
> -             * high level table always sets r/w, last level
> -             * page table control read/write
> -             */
> -            dma_set_pte_readable(*pte);
> -            dma_set_pte_writable(*pte);
> +            if ( dma_pte_present(*pte) )
> +            {
> +                struct dma_pte *split = map_vtd_domain_page(pte_maddr);
> +                unsigned long inc = 1UL << level_to_offset_bits(level - 1);
> +
> +                split[0].val = pte->val;
> +                if ( inc == PAGE_SIZE )
> +                    split[0].val &= ~DMA_PTE_SP;
> +
> +                for ( offset = 1; offset < PTE_NUM; ++offset )
> +                    split[offset].val = split[offset - 1].val + inc;
> +
> +                iommu_sync_cache(split, PAGE_SIZE);
> +                unmap_vtd_domain_page(split);
> +
> +                if ( flush_flags )
> +                    *flush_flags |= IOMMU_FLUSHF_modified;
> +            }
> +
> +            write_atomic(&pte->val, new_pte.val);
>              iommu_sync_cache(pte, sizeof(struct dma_pte));
>          }
> 
> -        if ( level == 2 )
> +        if ( --level == target )
>              break;
> 
>          unmap_vtd_domain_page(parent);
>          parent = map_vtd_domain_page(pte_maddr);
> -        level--;
>      }
> 
>      unmap_vtd_domain_page(parent);
> @@ -430,7 +483,7 @@ static uint64_t domain_pgd_maddr(struct
>          if ( !hd->arch.vtd.pgd_maddr )
>          {
>              /* Ensure we have pagetables allocated down to leaf PTE. */
> -            addr_to_dma_page_maddr(d, 0, 1);
> +            addr_to_dma_page_maddr(d, 0, 1, NULL, true);
> 
>              if ( !hd->arch.vtd.pgd_maddr )
>                  return 0;
> @@ -770,8 +823,9 @@ static int __must_check iommu_flush_iotl
>  }
> 
>  /* clear one page's page table */
> -static void dma_pte_clear_one(struct domain *domain, uint64_t addr,
> -                              unsigned int *flush_flags)
> +static int dma_pte_clear_one(struct domain *domain, daddr_t addr,
> +                             unsigned int order,
> +                             unsigned int *flush_flags)
>  {
>      struct domain_iommu *hd = dom_iommu(domain);
>      struct dma_pte *page = NULL, *pte = NULL;
> @@ -779,11 +833,11 @@ static void dma_pte_clear_one(struct dom
> 
>      spin_lock(&hd->arch.mapping_lock);
>      /* get last level pte */
> -    pg_maddr = addr_to_dma_page_maddr(domain, addr, 0);
> -    if ( pg_maddr == 0 )
> +    pg_maddr = addr_to_dma_page_maddr(domain, addr, 1, flush_flags,
> false);
> +    if ( pg_maddr < PAGE_SIZE )
>      {
>          spin_unlock(&hd->arch.mapping_lock);
> -        return;
> +        return pg_maddr ? -ENOMEM : 0;
>      }
> 
>      page = (struct dma_pte *)map_vtd_domain_page(pg_maddr);
> @@ -793,7 +847,7 @@ static void dma_pte_clear_one(struct dom
>      {
>          spin_unlock(&hd->arch.mapping_lock);
>          unmap_vtd_domain_page(page);
> -        return;
> +        return 0;
>      }
> 
>      dma_clear_pte(*pte);
> @@ -803,6 +857,8 @@ static void dma_pte_clear_one(struct dom
>      iommu_sync_cache(pte, sizeof(struct dma_pte));
> 
>      unmap_vtd_domain_page(page);
> +
> +    return 0;
>  }
> 
>  static int iommu_set_root_entry(struct vtd_iommu *iommu)
> @@ -1914,8 +1970,9 @@ static int __must_check intel_iommu_map_
>          return 0;
>      }
> 
> -    pg_maddr = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), 1);
> -    if ( !pg_maddr )
> +    pg_maddr = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), 1,
> flush_flags,
> +                                      true);
> +    if ( pg_maddr < PAGE_SIZE )
>      {
>          spin_unlock(&hd->arch.mapping_lock);
>          return -ENOMEM;
> @@ -1965,17 +2022,14 @@ static int __must_check intel_iommu_unma
>      if ( iommu_hwdom_passthrough && is_hardware_domain(d) )
>          return 0;
> 
> -    dma_pte_clear_one(d, dfn_to_daddr(dfn), flush_flags);
> -
> -    return 0;
> +    return dma_pte_clear_one(d, dfn_to_daddr(dfn), 0, flush_flags);
>  }
> 
>  static int intel_iommu_lookup_page(struct domain *d, dfn_t dfn, mfn_t
> *mfn,
>                                     unsigned int *flags)
>  {
>      struct domain_iommu *hd = dom_iommu(d);
> -    struct dma_pte *page, val;
> -    u64 pg_maddr;
> +    uint64_t val;
> 
>      /*
>       * If VT-d shares EPT page table or if the domain is the hardware
> @@ -1987,25 +2041,16 @@ static int intel_iommu_lookup_page(struc
> 
>      spin_lock(&hd->arch.mapping_lock);
> 
> -    pg_maddr = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), 0);
> -    if ( !pg_maddr )
> -    {
> -        spin_unlock(&hd->arch.mapping_lock);
> -        return -ENOENT;
> -    }
> -
> -    page = map_vtd_domain_page(pg_maddr);
> -    val = page[dfn_x(dfn) & LEVEL_MASK];
> +    val = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), 0, NULL, false);
> 
> -    unmap_vtd_domain_page(page);
>      spin_unlock(&hd->arch.mapping_lock);
> 
> -    if ( !dma_pte_present(val) )
> +    if ( val < PAGE_SIZE )
>          return -ENOENT;
> 
> -    *mfn = maddr_to_mfn(dma_pte_addr(val));
> -    *flags = dma_pte_read(val) ? IOMMUF_readable : 0;
> -    *flags |= dma_pte_write(val) ? IOMMUF_writable : 0;
> +    *mfn = maddr_to_mfn(val);
> +    *flags = val & DMA_PTE_READ ? IOMMUF_readable : 0;
> +    *flags |= val & DMA_PTE_WRITE ? IOMMUF_writable : 0;
> 
>      return 0;
>  }


^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH v3 03/23] VT-d: limit page table population in domain_pgd_maddr()
  2022-01-10 16:23 ` [PATCH v3 03/23] VT-d: limit page table population in domain_pgd_maddr() Jan Beulich
@ 2022-01-30  3:22   ` Tian, Kevin
  0 siblings, 0 replies; 40+ messages in thread
From: Tian, Kevin @ 2022-01-30  3:22 UTC (permalink / raw)
  To: Beulich, Jan, xen-devel
  Cc: Cooper, Andrew, Paul Durrant, Pau Monné, Roger

> From: Jan Beulich <jbeulich@suse.com>
> Sent: Tuesday, January 11, 2022 12:23 AM
> 
> I have to admit that I never understood why domain_pgd_maddr() wants to
> populate all page table levels for DFN 0. I can only assume that despite
> the comment there what is needed is population just down to the smallest
> possible nr_pt_levels that the loop later in the function may need to
> run to. Hence what is needed is the minimum of all possible
> iommu->nr_pt_levels, to then be passed into addr_to_dma_page_maddr()
> instead of literal 1.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

> ---
> v3: New.
> 
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -55,6 +55,7 @@ bool __read_mostly iommu_snoop = true;
>  #endif
> 
>  static unsigned int __read_mostly nr_iommus;
> +static unsigned int __read_mostly min_pt_levels = UINT_MAX;
> 
>  static struct iommu_ops vtd_ops;
>  static struct tasklet vtd_fault_tasklet;
> @@ -482,8 +483,11 @@ static uint64_t domain_pgd_maddr(struct
>      {
>          if ( !hd->arch.vtd.pgd_maddr )
>          {
> -            /* Ensure we have pagetables allocated down to leaf PTE. */
> -            addr_to_dma_page_maddr(d, 0, 1, NULL, true);
> +            /*
> +             * Ensure we have pagetables allocated down to the smallest
> +             * level the loop below may need to run to.
> +             */
> +            addr_to_dma_page_maddr(d, 0, min_pt_levels, NULL, true);
> 
>              if ( !hd->arch.vtd.pgd_maddr )
>                  return 0;
> @@ -1381,6 +1385,8 @@ int __init iommu_alloc(struct acpi_drhd_
>          return -ENODEV;
>      }
>      iommu->nr_pt_levels = agaw_to_level(agaw);
> +    if ( min_pt_levels > iommu->nr_pt_levels )
> +        min_pt_levels = iommu->nr_pt_levels;
> 
>      if ( !ecap_coherent(iommu->ecap) )
>          vtd_ops.sync_cache = sync_cache;


^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH v3 15/23] VT-d: allow use of superpage mappings
  2022-01-10 16:32 ` [PATCH v3 15/23] VT-d: " Jan Beulich
@ 2022-01-30  3:26   ` Tian, Kevin
  0 siblings, 0 replies; 40+ messages in thread
From: Tian, Kevin @ 2022-01-30  3:26 UTC (permalink / raw)
  To: Beulich, Jan, xen-devel
  Cc: Cooper, Andrew, Paul Durrant, Pau Monné, Roger

> From: Jan Beulich
> Sent: Tuesday, January 11, 2022 12:32 AM
> 
> ... depending on feature availability (and absence of quirks).
> 
> Also make the page table dumping function aware of superpages.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

> ---
> v3: Rename queue_free_pt()'s last parameter. Replace "level > 1" checks
>     where possible. Tighten assertion.
> 
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -826,18 +826,37 @@ static int __must_check iommu_flush_iotl
>      return iommu_flush_iotlb(d, INVALID_DFN, 0, 0);
>  }
> 
> +static void queue_free_pt(struct domain *d, mfn_t mfn, unsigned int level)
> +{
> +    if ( level > 1 )
> +    {
> +        struct dma_pte *pt = map_domain_page(mfn);
> +        unsigned int i;
> +
> +        for ( i = 0; i < PTE_NUM; ++i )
> +            if ( dma_pte_present(pt[i]) && !dma_pte_superpage(pt[i]) )
> +                queue_free_pt(d, maddr_to_mfn(dma_pte_addr(pt[i])),
> +                              level - 1);
> +
> +        unmap_domain_page(pt);
> +    }
> +
> +    iommu_queue_free_pgtable(d, mfn_to_page(mfn));
> +}
> +
>  /* clear one page's page table */
>  static int dma_pte_clear_one(struct domain *domain, daddr_t addr,
>                               unsigned int order,
>                               unsigned int *flush_flags)
>  {
>      struct domain_iommu *hd = dom_iommu(domain);
> -    struct dma_pte *page = NULL, *pte = NULL;
> +    struct dma_pte *page = NULL, *pte = NULL, old;
>      u64 pg_maddr;
> +    unsigned int level = (order / LEVEL_STRIDE) + 1;
> 
>      spin_lock(&hd->arch.mapping_lock);
> -    /* get last level pte */
> -    pg_maddr = addr_to_dma_page_maddr(domain, addr, 1, flush_flags,
> false);
> +    /* get target level pte */
> +    pg_maddr = addr_to_dma_page_maddr(domain, addr, level, flush_flags,
> false);
>      if ( pg_maddr < PAGE_SIZE )
>      {
>          spin_unlock(&hd->arch.mapping_lock);
> @@ -845,7 +864,7 @@ static int dma_pte_clear_one(struct doma
>      }
> 
>      page = (struct dma_pte *)map_vtd_domain_page(pg_maddr);
> -    pte = page + address_level_offset(addr, 1);
> +    pte = &page[address_level_offset(addr, level)];
> 
>      if ( !dma_pte_present(*pte) )
>      {
> @@ -854,14 +873,20 @@ static int dma_pte_clear_one(struct doma
>          return 0;
>      }
> 
> +    old = *pte;
>      dma_clear_pte(*pte);
> -    *flush_flags |= IOMMU_FLUSHF_modified;
> 
>      spin_unlock(&hd->arch.mapping_lock);
>      iommu_sync_cache(pte, sizeof(struct dma_pte));
> 
>      unmap_vtd_domain_page(page);
> 
> +    *flush_flags |= IOMMU_FLUSHF_modified;
> +
> +    if ( order && !dma_pte_superpage(old) )
> +        queue_free_pt(domain, maddr_to_mfn(dma_pte_addr(old)),
> +                      order / LEVEL_STRIDE);
> +
>      return 0;
>  }
> 
> @@ -1952,6 +1977,7 @@ static int __must_check intel_iommu_map_
>      struct domain_iommu *hd = dom_iommu(d);
>      struct dma_pte *page, *pte, old, new = {};
>      u64 pg_maddr;
> +    unsigned int level = (IOMMUF_order(flags) / LEVEL_STRIDE) + 1;
>      int rc = 0;
> 
>      /* Do nothing if VT-d shares EPT page table */
> @@ -1976,7 +2002,7 @@ static int __must_check intel_iommu_map_
>          return 0;
>      }
> 
> -    pg_maddr = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), 1,
> flush_flags,
> +    pg_maddr = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), level,
> flush_flags,
>                                        true);
>      if ( pg_maddr < PAGE_SIZE )
>      {
> @@ -1985,13 +2011,15 @@ static int __must_check intel_iommu_map_
>      }
> 
>      page = (struct dma_pte *)map_vtd_domain_page(pg_maddr);
> -    pte = &page[dfn_x(dfn) & LEVEL_MASK];
> +    pte = &page[address_level_offset(dfn_to_daddr(dfn), level)];
>      old = *pte;
> 
>      dma_set_pte_addr(new, mfn_to_maddr(mfn));
>      dma_set_pte_prot(new,
>                       ((flags & IOMMUF_readable) ? DMA_PTE_READ  : 0) |
>                       ((flags & IOMMUF_writable) ? DMA_PTE_WRITE : 0));
> +    if ( IOMMUF_order(flags) )
> +        dma_set_pte_superpage(new);
> 
>      /* Set the SNP on leaf page table if Snoop Control available */
>      if ( iommu_snoop )
> @@ -2012,8 +2040,14 @@ static int __must_check intel_iommu_map_
> 
>      *flush_flags |= IOMMU_FLUSHF_added;
>      if ( dma_pte_present(old) )
> +    {
>          *flush_flags |= IOMMU_FLUSHF_modified;
> 
> +        if ( IOMMUF_order(flags) && !dma_pte_superpage(old) )
> +            queue_free_pt(d, maddr_to_mfn(dma_pte_addr(old)),
> +                          IOMMUF_order(flags) / LEVEL_STRIDE);
> +    }
> +
>      return rc;
>  }
> 
> @@ -2370,6 +2404,7 @@ static int __init vtd_setup(void)
>  {
>      struct acpi_drhd_unit *drhd;
>      struct vtd_iommu *iommu;
> +    unsigned int large_sizes = PAGE_SIZE_2M | PAGE_SIZE_1G;
>      int ret;
>      bool reg_inval_supported = true;
> 
> @@ -2412,6 +2447,11 @@ static int __init vtd_setup(void)
>                 cap_sps_2mb(iommu->cap) ? ", 2MB" : "",
>                 cap_sps_1gb(iommu->cap) ? ", 1GB" : "");
> 
> +        if ( !cap_sps_2mb(iommu->cap) )
> +            large_sizes &= ~PAGE_SIZE_2M;
> +        if ( !cap_sps_1gb(iommu->cap) )
> +            large_sizes &= ~PAGE_SIZE_1G;
> +
>  #ifndef iommu_snoop
>          if ( iommu_snoop && !ecap_snp_ctl(iommu->ecap) )
>              iommu_snoop = false;
> @@ -2483,6 +2523,9 @@ static int __init vtd_setup(void)
>      if ( ret )
>          goto error;
> 
> +    ASSERT(iommu_ops.page_sizes == PAGE_SIZE_4K);
> +    iommu_ops.page_sizes |= large_sizes;
> +
>      register_keyhandler('V', vtd_dump_iommu_info, "dump iommu info", 1);
> 
>      return 0;
> @@ -2797,7 +2840,7 @@ static void vtd_dump_page_table_level(pa
>              continue;
> 
>          address = gpa + offset_level_address(i, level);
> -        if ( next_level >= 1 )
> +        if ( next_level && !dma_pte_superpage(*pte) )
>              vtd_dump_page_table_level(dma_pte_addr(*pte), next_level,
>                                        address, indent + 1);
>          else
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH v3 16/23] IOMMU: fold flush-all hook into "flush one"
  2022-01-10 16:33 ` [PATCH v3 16/23] IOMMU: fold flush-all hook into "flush one" Jan Beulich
@ 2022-01-30  3:38   ` Tian, Kevin
  0 siblings, 0 replies; 40+ messages in thread
From: Tian, Kevin @ 2022-01-30  3:38 UTC (permalink / raw)
  To: Beulich, Jan, xen-devel
  Cc: Cooper, Andrew, Paul Durrant, Pau Monné,
	Roger, Julien Grall, Stefano Stabellini, Volodymyr Babchuk,
	Bertrand Marquis, Rahul Singh

> From: Jan Beulich <jbeulich@suse.com>
> Sent: Tuesday, January 11, 2022 12:34 AM
> 
> Having a separate flush-all hook has always been puzzling me some. We
> will want to be able to force a full flush via accumulated flush flags
> from the map/unmap functions. Introduce a respective new flag and fold
> all flush handling to use the single remaining hook.
> 
> Note that because of the respective comments in SMMU and IPMMU-VMSA
> code, I've folded the two prior hook functions into one. For SMMU-v3,
> which lacks a comment towards incapable hardware, I've left both
> functions in place on the assumption that selective and full flushes
> will eventually want separating.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
> [IPMMU-VMSA and SMMU-V2]
> Reviewed-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
> [SMMUv3]
> Reviewed-by: Rahul Singh <rahul.singh@arm.com>
> [Arm]
> Acked-by: Julien Grall <jgrall@amazon.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

> ---
> TBD: What we really are going to need is for the map/unmap functions to
>      specify that a wider region needs flushing than just the one
>      covered by the present set of (un)maps. This may still be less than
>      a full flush, but at least as a first step it seemed better to me
>      to keep things simple and go the flush-all route.
> ---
> v3: Re-base over changes earlier in the series.
> v2: New.
> 
> --- a/xen/drivers/passthrough/amd/iommu.h
> +++ b/xen/drivers/passthrough/amd/iommu.h
> @@ -255,7 +255,6 @@ int amd_iommu_get_reserved_device_memory
>  int __must_check amd_iommu_flush_iotlb_pages(struct domain *d, dfn_t
> dfn,
>                                               unsigned long page_count,
>                                               unsigned int flush_flags);
> -int __must_check amd_iommu_flush_iotlb_all(struct domain *d);
>  void amd_iommu_print_entries(const struct amd_iommu *iommu, unsigned
> int dev_id,
>                               dfn_t dfn);
> 
> --- a/xen/drivers/passthrough/amd/iommu_map.c
> +++ b/xen/drivers/passthrough/amd/iommu_map.c
> @@ -478,15 +478,18 @@ int amd_iommu_flush_iotlb_pages(struct d
>  {
>      unsigned long dfn_l = dfn_x(dfn);
> 
> -    ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN));
> -    ASSERT(flush_flags);
> +    if ( !(flush_flags & IOMMU_FLUSHF_all) )
> +    {
> +        ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN));
> +        ASSERT(flush_flags);
> +    }
> 
>      /* Unless a PTE was modified, no flush is required */
>      if ( !(flush_flags & IOMMU_FLUSHF_modified) )
>          return 0;
> 
> -    /* If the range wraps then just flush everything */
> -    if ( dfn_l + page_count < dfn_l )
> +    /* If so requested or if the range wraps then just flush everything. */
> +    if ( (flush_flags & IOMMU_FLUSHF_all) || dfn_l + page_count < dfn_l )
>      {
>          amd_iommu_flush_all_pages(d);
>          return 0;
> @@ -511,13 +514,6 @@ int amd_iommu_flush_iotlb_pages(struct d
> 
>      return 0;
>  }
> -
> -int amd_iommu_flush_iotlb_all(struct domain *d)
> -{
> -    amd_iommu_flush_all_pages(d);
> -
> -    return 0;
> -}
> 
>  int amd_iommu_reserve_domain_unity_map(struct domain *d,
>                                         const struct ivrs_unity_map *map,
> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> @@ -642,7 +642,6 @@ static const struct iommu_ops __initcons
>      .map_page = amd_iommu_map_page,
>      .unmap_page = amd_iommu_unmap_page,
>      .iotlb_flush = amd_iommu_flush_iotlb_pages,
> -    .iotlb_flush_all = amd_iommu_flush_iotlb_all,
>      .reassign_device = reassign_device,
>      .get_device_group_id = amd_iommu_group_id,
>      .enable_x2apic = iov_enable_xt,
> --- a/xen/drivers/passthrough/arm/ipmmu-vmsa.c
> +++ b/xen/drivers/passthrough/arm/ipmmu-vmsa.c
> @@ -930,13 +930,19 @@ out:
>  }
> 
>  /* Xen IOMMU ops */
> -static int __must_check ipmmu_iotlb_flush_all(struct domain *d)
> +static int __must_check ipmmu_iotlb_flush(struct domain *d, dfn_t dfn,
> +                                          unsigned long page_count,
> +                                          unsigned int flush_flags)
>  {
>      struct ipmmu_vmsa_xen_domain *xen_domain = dom_iommu(d)-
> >arch.priv;
> 
> +    ASSERT(flush_flags);
> +
>      if ( !xen_domain || !xen_domain->root_domain )
>          return 0;
> 
> +    /* The hardware doesn't support selective TLB flush. */
> +
>      spin_lock(&xen_domain->lock);
>      ipmmu_tlb_invalidate(xen_domain->root_domain);
>      spin_unlock(&xen_domain->lock);
> @@ -944,16 +950,6 @@ static int __must_check ipmmu_iotlb_flus
>      return 0;
>  }
> 
> -static int __must_check ipmmu_iotlb_flush(struct domain *d, dfn_t dfn,
> -                                          unsigned long page_count,
> -                                          unsigned int flush_flags)
> -{
> -    ASSERT(flush_flags);
> -
> -    /* The hardware doesn't support selective TLB flush. */
> -    return ipmmu_iotlb_flush_all(d);
> -}
> -
>  static struct ipmmu_vmsa_domain *ipmmu_get_cache_domain(struct
> domain *d,
>                                                          struct device *dev)
>  {
> @@ -1303,7 +1299,6 @@ static const struct iommu_ops ipmmu_iomm
>      .hwdom_init      = ipmmu_iommu_hwdom_init,
>      .teardown        = ipmmu_iommu_domain_teardown,
>      .iotlb_flush     = ipmmu_iotlb_flush,
> -    .iotlb_flush_all = ipmmu_iotlb_flush_all,
>      .assign_device   = ipmmu_assign_device,
>      .reassign_device = ipmmu_reassign_device,
>      .map_page        = arm_iommu_map_page,
> --- a/xen/drivers/passthrough/arm/smmu.c
> +++ b/xen/drivers/passthrough/arm/smmu.c
> @@ -2649,11 +2649,17 @@ static int force_stage = 2;
>   */
>  static u32 platform_features = ARM_SMMU_FEAT_COHERENT_WALK;
> 
> -static int __must_check arm_smmu_iotlb_flush_all(struct domain *d)
> +static int __must_check arm_smmu_iotlb_flush(struct domain *d, dfn_t dfn,
> +					     unsigned long page_count,
> +					     unsigned int flush_flags)
>  {
>  	struct arm_smmu_xen_domain *smmu_domain = dom_iommu(d)-
> >arch.priv;
>  	struct iommu_domain *cfg;
> 
> +	ASSERT(flush_flags);
> +
> +	/* ARM SMMU v1 doesn't have flush by VMA and VMID */
> +
>  	spin_lock(&smmu_domain->lock);
>  	list_for_each_entry(cfg, &smmu_domain->contexts, list) {
>  		/*
> @@ -2670,16 +2676,6 @@ static int __must_check arm_smmu_iotlb_f
>  	return 0;
>  }
> 
> -static int __must_check arm_smmu_iotlb_flush(struct domain *d, dfn_t dfn,
> -					     unsigned long page_count,
> -					     unsigned int flush_flags)
> -{
> -	ASSERT(flush_flags);
> -
> -	/* ARM SMMU v1 doesn't have flush by VMA and VMID */
> -	return arm_smmu_iotlb_flush_all(d);
> -}
> -
>  static struct iommu_domain *arm_smmu_get_domain(struct domain *d,
>  						struct device *dev)
>  {
> @@ -2879,7 +2875,6 @@ static const struct iommu_ops arm_smmu_i
>      .add_device = arm_smmu_dt_add_device_generic,
>      .teardown = arm_smmu_iommu_domain_teardown,
>      .iotlb_flush = arm_smmu_iotlb_flush,
> -    .iotlb_flush_all = arm_smmu_iotlb_flush_all,
>      .assign_device = arm_smmu_assign_dev,
>      .reassign_device = arm_smmu_reassign_dev,
>      .map_page = arm_iommu_map_page,
> --- a/xen/drivers/passthrough/arm/smmu-v3.c
> +++ b/xen/drivers/passthrough/arm/smmu-v3.c
> @@ -3431,7 +3431,6 @@ static const struct iommu_ops arm_smmu_i
>  	.hwdom_init		= arm_smmu_iommu_hwdom_init,
>  	.teardown		=
> arm_smmu_iommu_xen_domain_teardown,
>  	.iotlb_flush		= arm_smmu_iotlb_flush,
> -	.iotlb_flush_all	= arm_smmu_iotlb_flush_all,
>  	.assign_device		= arm_smmu_assign_dev,
>  	.reassign_device	= arm_smmu_reassign_dev,
>  	.map_page		= arm_iommu_map_page,
> --- a/xen/drivers/passthrough/iommu.c
> +++ b/xen/drivers/passthrough/iommu.c
> @@ -455,15 +455,12 @@ int iommu_iotlb_flush_all(struct domain
>      const struct domain_iommu *hd = dom_iommu(d);
>      int rc;
> 
> -    if ( !is_iommu_enabled(d) || !hd->platform_ops->iotlb_flush_all ||
> +    if ( !is_iommu_enabled(d) || !hd->platform_ops->iotlb_flush ||
>           !flush_flags )
>          return 0;
> 
> -    /*
> -     * The operation does a full flush so we don't need to pass the
> -     * flush_flags in.
> -     */
> -    rc = iommu_call(hd->platform_ops, iotlb_flush_all, d);
> +    rc = iommu_call(hd->platform_ops, iotlb_flush, d, INVALID_DFN, 0,
> +                    flush_flags | IOMMU_FLUSHF_all);
>      if ( unlikely(rc) )
>      {
>          if ( !d->is_shutting_down && printk_ratelimit() )
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -814,18 +814,21 @@ static int __must_check iommu_flush_iotl
>                                                  unsigned long page_count,
>                                                  unsigned int flush_flags)
>  {
> -    ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN));
> -    ASSERT(flush_flags);
> +    if ( flush_flags & IOMMU_FLUSHF_all )
> +    {
> +        dfn = INVALID_DFN;
> +        page_count = 0;
> +    }
> +    else
> +    {
> +        ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN));
> +        ASSERT(flush_flags);
> +    }
> 
>      return iommu_flush_iotlb(d, dfn, flush_flags & IOMMU_FLUSHF_modified,
>                               page_count);
>  }
> 
> -static int __must_check iommu_flush_iotlb_all(struct domain *d)
> -{
> -    return iommu_flush_iotlb(d, INVALID_DFN, 0, 0);
> -}
> -
>  static void queue_free_pt(struct domain *d, mfn_t mfn, unsigned int level)
>  {
>      if ( level > 1 )
> @@ -2928,7 +2931,7 @@ static int __init intel_iommu_quarantine
>      spin_unlock(&hd->arch.mapping_lock);
> 
>      if ( !rc )
> -        rc = iommu_flush_iotlb_all(d);
> +        rc = iommu_flush_iotlb(d, INVALID_DFN, 0, 0);
> 
>      /* Pages may be leaked in failure case */
>      return rc;
> @@ -2961,7 +2964,6 @@ static struct iommu_ops __initdata vtd_o
>      .resume = vtd_resume,
>      .crash_shutdown = vtd_crash_shutdown,
>      .iotlb_flush = iommu_flush_iotlb_pages,
> -    .iotlb_flush_all = iommu_flush_iotlb_all,
>      .get_reserved_device_memory =
> intel_iommu_get_reserved_device_memory,
>      .dump_page_tables = vtd_dump_page_tables,
>  };
> --- a/xen/include/xen/iommu.h
> +++ b/xen/include/xen/iommu.h
> @@ -147,9 +147,11 @@ enum
>  {
>      _IOMMU_FLUSHF_added,
>      _IOMMU_FLUSHF_modified,
> +    _IOMMU_FLUSHF_all,
>  };
>  #define IOMMU_FLUSHF_added (1u << _IOMMU_FLUSHF_added)
>  #define IOMMU_FLUSHF_modified (1u << _IOMMU_FLUSHF_modified)
> +#define IOMMU_FLUSHF_all (1u << _IOMMU_FLUSHF_all)
> 
>  int __must_check iommu_map(struct domain *d, dfn_t dfn, mfn_t mfn,
>                             unsigned long page_count, unsigned int flags,
> @@ -282,7 +284,6 @@ struct iommu_ops {
>      int __must_check (*iotlb_flush)(struct domain *d, dfn_t dfn,
>                                      unsigned long page_count,
>                                      unsigned int flush_flags);
> -    int __must_check (*iotlb_flush_all)(struct domain *d);
>      int (*get_reserved_device_memory)(iommu_grdm_t *, void *);
>      void (*dump_page_tables)(struct domain *d);
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 02/23] VT-d: have callers specify the target level for page table walks
  2022-01-30  3:17   ` Tian, Kevin
@ 2022-01-31 10:04     ` Jan Beulich
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2022-01-31 10:04 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Cooper, Andrew, Paul Durrant, Pau Monné, Roger, xen-devel

On 30.01.2022 04:17, Tian, Kevin wrote:
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Tuesday, January 11, 2022 12:23 AM
>>
>> In order to be able to insert/remove super-pages we need to allow
>> callers of the walking function to specify at which point to stop the
>> walk.
>>
>> For intel_iommu_lookup_page() integrate the last level access into
>> the main walking function.
>>
>> dma_pte_clear_one() gets only partly adjusted for now: Error handling
>> and order parameter get put in place, but the order parameter remains
>> ignored (just like intel_iommu_map_page()'s order part of the flags).
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> I was actually wondering whether it wouldn't make sense to integrate
>> dma_pte_clear_one() into its only caller intel_iommu_unmap_page(), for
>> better symmetry with intel_iommu_map_page().
> 
> I think it's the right thing to do. It was there due to multiple callers
> when firstly introduced. But now given only one caller mering it
> with the caller to be symmetry makes sense.

I guess I'll make this a separate change towards the end of this series
now, to save me from some rebasing of other patches.

> with or without that change (given it's simple):
> 
> 	Reviewed-by: Kevin Tian <kevin.tian@intel.com>

Thanks.

Jan



^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH v3 17/23] IOMMU/x86: prefill newly allocate page tables
  2022-01-10 16:34 ` [PATCH v3 17/23] IOMMU/x86: prefill newly allocate page tables Jan Beulich
@ 2022-02-18  5:01   ` Tian, Kevin
  2022-02-18  8:24     ` Jan Beulich
  0 siblings, 1 reply; 40+ messages in thread
From: Tian, Kevin @ 2022-02-18  5:01 UTC (permalink / raw)
  To: Beulich, Jan, xen-devel
  Cc: Cooper, Andrew, Paul Durrant, Pau Monné, Roger

> From: Jan Beulich <jbeulich@suse.com>
> Sent: Tuesday, January 11, 2022 12:35 AM
> 
> Page tables are used for two purposes after allocation: They either
> start out all empty, or they get filled to replace a superpage.
> Subsequently, to replace all empty or fully contiguous page tables,
> contiguous sub-regions will be recorded within individual page tables.
> Install the initial set of markers immediately after allocation. Make
> sure to retain these markers when further populating a page table in
> preparation for it to replace a superpage.
> 
> The markers are simply 4-bit fields holding the order value of
> contiguous entries. To demonstrate this, if a page table had just 16
> entries, this would be the initial (fully contiguous) set of markers:
> 
> index  0 1 2 3 4 5 6 7 8 9 A B C D E F
> marker 4 0 1 0 2 0 1 0 3 0 1 0 2 0 1 0
> 
> "Contiguous" here means not only present entries with successively
> increasing MFNs, each one suitably aligned for its slot, but also a
> respective number of all non-present entries.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>, with a nit:

> @@ -478,7 +478,28 @@ struct page_info *iommu_alloc_pgtable(st
>          return NULL;
> 
>      p = __map_domain_page(pg);
> -    clear_page(p);
> +
> +    if ( contig_mask )
> +    {
> +        unsigned int i, shift = find_first_set_bit(contig_mask);
> +
> +        ASSERT(((PAGE_SHIFT - 3) & (contig_mask >> shift)) == PAGE_SHIFT - 3);
> +
> +        p[0] = (PAGE_SHIFT - 3ull) << shift;
> +        p[1] = 0;
> +        p[2] = 1ull << shift;
> +        p[3] = 0;
> +
> +        for ( i = 4; i < PAGE_SIZE / 8; i += 4 )
> +        {
> +            p[i + 0] = (find_first_set_bit(i) + 0ull) << shift;
> +            p[i + 1] = 0;
> +            p[i + 2] = 1ull << shift;
> +            p[i + 3] = 0;
> +        }

some comment similar to what commit msg describes can improve
the readability here.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH v3 20/23] VT-d: free all-empty page tables
  2022-01-10 16:36 ` [PATCH v3 20/23] VT-d: " Jan Beulich
@ 2022-02-18  5:20   ` Tian, Kevin
  2022-02-18  8:31     ` Jan Beulich
  0 siblings, 1 reply; 40+ messages in thread
From: Tian, Kevin @ 2022-02-18  5:20 UTC (permalink / raw)
  To: Beulich, Jan, xen-devel
  Cc: Cooper, Andrew, Paul Durrant, Pau Monné, Roger

> From: Jan Beulich <jbeulich@suse.com>
> Sent: Tuesday, January 11, 2022 12:36 AM
> 
> When a page table ends up with no present entries left, it can be
> replaced by a non-present entry at the next higher level. The page table
> itself can then be scheduled for freeing.
> 
> Note that while its output isn't used there yet,
> pt_update_contig_markers() right away needs to be called in all places
> where entries get updated, not just the one where entries get cleared.
> 
> Note further that while pt_update_contig_markers() updates perhaps
> several PTEs within the table, since these are changes to "avail" bits
> only I do not think that cache flushing would be needed afterwards. Such
> cache flushing (of entire pages, unless adding yet more logic to me more
> selective) would be quite noticable performance-wise (very prominent
> during Dom0 boot).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> v3: Properly bound loop. Re-base over changes earlier in the series.
> v2: New.
> ---
> The hang during boot on my Latitude E6410 (see the respective code
> comment) was pretty close after iommu_enable_translation(). No errors,
> no watchdog would kick in, just sometimes the first few pixel lines of
> the next log message's (XEN) prefix would have made it out to the screen
> (and there's no serial there). It's been a lot of experimenting until I
> figured the workaround (which I consider ugly, but halfway acceptable).
> I've been trying hard to make sure the workaround wouldn't be masking a
> real issue, yet I'm still wary of it possibly doing so ... My best guess
> at this point is that on these old IOMMUs the ignored bits 52...61
> aren't really ignored for present entries, but also aren't "reserved"
> enough to trigger faults. This guess is from having tried to set other

Is this machine able to capture any VT-d faults before? If yes maybe
you will observe more information if trying to tweak those bits at a later
time (instead of when iommu is enabled)?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH v3 22/23] VT-d: replace all-contiguous page tables by superpage mappings
  2022-01-10 16:38 ` [PATCH v3 22/23] VT-d: " Jan Beulich
@ 2022-02-18  5:22   ` Tian, Kevin
  0 siblings, 0 replies; 40+ messages in thread
From: Tian, Kevin @ 2022-02-18  5:22 UTC (permalink / raw)
  To: Beulich, Jan, xen-devel
  Cc: Cooper, Andrew, Paul Durrant, Pau Monné, Roger

> From: Jan Beulich <jbeulich@suse.com>
> Sent: Tuesday, January 11, 2022 12:38 AM
> 
> When a page table ends up with all contiguous entries (including all
> identical attributes), it can be replaced by a superpage entry at the
> next higher level. The page table itself can then be scheduled for
> freeing.
> 
> The adjustment to LEVEL_MASK is merely to avoid leaving a latent trap
> for whenever we (and obviously hardware) start supporting 512G mappings.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> Unlike the freeing of all-empty page tables, this causes quite a bit of
> back and forth for PV domains, due to their mapping/unmapping of pages
> when they get converted to/from being page tables. It may therefore be
> worth considering to delay re-coalescing a little, to avoid doing so
> when the superpage would otherwise get split again pretty soon. But I
> think this would better be the subject of a separate change anyway.
> 

Agree. thus:

	Reviewed-by: Kevin Tian <kevin.tian@intel.com>

> Of course this could also be helped by more "aware" kernel side
> behavior: They could avoid immediately mapping freed page tables
> writable again, in anticipation of re-using that same page for another
> page table elsewhere.
> ---
> v3: New.
> 
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -2071,14 +2071,35 @@ static int __must_check intel_iommu_map_
>       * While the (ab)use of PTE_kind_table here allows to save some work in
>       * the function, the main motivation for it is that it avoids a so far
>       * unexplained hang during boot (while preparing Dom0) on a Westmere
> -     * based laptop.
> +     * based laptop.  This also has the intended effect of terminating the
> +     * loop when super pages aren't supported anymore at the next level.
>       */
> -    pt_update_contig_markers(&page->val,
> -                             address_level_offset(dfn_to_daddr(dfn), level),
> -                             level,
> -                             (hd->platform_ops->page_sizes &
> -                              (1UL << level_to_offset_bits(level + 1))
> -                              ? PTE_kind_leaf : PTE_kind_table));
> +    while ( pt_update_contig_markers(&page->val,
> +                                     address_level_offset(dfn_to_daddr(dfn), level),
> +                                     level,
> +                                     (hd->platform_ops->page_sizes &
> +                                      (1UL << level_to_offset_bits(level + 1))
> +                                       ? PTE_kind_leaf : PTE_kind_table)) )
> +    {
> +        struct page_info *pg = maddr_to_page(pg_maddr);
> +
> +        unmap_vtd_domain_page(page);
> +
> +        new.val &= ~(LEVEL_MASK << level_to_offset_bits(level));
> +        dma_set_pte_superpage(new);
> +
> +        pg_maddr = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), ++level,
> +                                          flush_flags, false);
> +        BUG_ON(pg_maddr < PAGE_SIZE);
> +
> +        page = map_vtd_domain_page(pg_maddr);
> +        pte = &page[address_level_offset(dfn_to_daddr(dfn), level)];
> +        *pte = new;
> +        iommu_sync_cache(pte, sizeof(*pte));
> +
> +        *flush_flags |= IOMMU_FLUSHF_modified | IOMMU_FLUSHF_all;
> +        iommu_queue_free_pgtable(d, pg);
> +    }
> 
>      spin_unlock(&hd->arch.mapping_lock);
>      unmap_vtd_domain_page(page);
> --- a/xen/drivers/passthrough/vtd/iommu.h
> +++ b/xen/drivers/passthrough/vtd/iommu.h
> @@ -229,7 +229,7 @@ struct context_entry {
> 
>  /* page table handling */
>  #define LEVEL_STRIDE       (9)
> -#define LEVEL_MASK         ((1 << LEVEL_STRIDE) - 1)
> +#define LEVEL_MASK         (PTE_NUM - 1UL)
>  #define PTE_NUM            (1 << LEVEL_STRIDE)
>  #define level_to_agaw(val) ((val) - 2)
>  #define agaw_to_level(val) ((val) + 2)


^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH v3 23/23] IOMMU/x86: add perf counters for page table splitting / coalescing
  2022-01-10 16:38 ` [PATCH v3 23/23] IOMMU/x86: add perf counters for page table splitting / coalescing Jan Beulich
@ 2022-02-18  5:23   ` Tian, Kevin
  0 siblings, 0 replies; 40+ messages in thread
From: Tian, Kevin @ 2022-02-18  5:23 UTC (permalink / raw)
  To: Beulich, Jan, xen-devel
  Cc: Cooper, Andrew, Paul Durrant, Pau Monné, Roger

> From: Jan Beulich <jbeulich@suse.com>
> Sent: Tuesday, January 11, 2022 12:39 AM
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Kevin tian <kevin.tian@intel.com>

> ---
> v3: New.
> 
> --- a/xen/drivers/passthrough/amd/iommu_map.c
> +++ b/xen/drivers/passthrough/amd/iommu_map.c
> @@ -283,6 +283,8 @@ static int iommu_pde_from_dfn(struct dom
>                                       level, PTE_kind_table);
> 
>              *flush_flags |= IOMMU_FLUSHF_modified;
> +
> +            perfc_incr(iommu_pt_shatters);
>          }
> 
>          /* Install lower level page table for non-present entries */
> @@ -411,6 +413,7 @@ int amd_iommu_map_page(struct domain *d,
>                                flags & IOMMUF_readable, &contig);
>          *flush_flags |= IOMMU_FLUSHF_modified | IOMMU_FLUSHF_all;
>          iommu_queue_free_pgtable(d, pg);
> +        perfc_incr(iommu_pt_coalesces);
>      }
> 
>      spin_unlock(&hd->arch.mapping_lock);
> @@ -471,6 +474,7 @@ int amd_iommu_unmap_page(struct domain *
>              clear_iommu_pte_present(pt_mfn, dfn_x(dfn), level, &free);
>              *flush_flags |= IOMMU_FLUSHF_all;
>              iommu_queue_free_pgtable(d, pg);
> +            perfc_incr(iommu_pt_coalesces);
>          }
>      }
> 
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -451,6 +451,8 @@ static uint64_t addr_to_dma_page_maddr(s
> 
>                  if ( flush_flags )
>                      *flush_flags |= IOMMU_FLUSHF_modified;
> +
> +                perfc_incr(iommu_pt_shatters);
>              }
> 
>              write_atomic(&pte->val, new_pte.val);
> @@ -907,6 +909,7 @@ static int dma_pte_clear_one(struct doma
> 
>          *flush_flags |= IOMMU_FLUSHF_all;
>          iommu_queue_free_pgtable(domain, pg);
> +        perfc_incr(iommu_pt_coalesces);
>      }
> 
>      spin_unlock(&hd->arch.mapping_lock);
> @@ -2099,6 +2102,7 @@ static int __must_check intel_iommu_map_
> 
>          *flush_flags |= IOMMU_FLUSHF_modified | IOMMU_FLUSHF_all;
>          iommu_queue_free_pgtable(d, pg);
> +        perfc_incr(iommu_pt_coalesces);
>      }
> 
>      spin_unlock(&hd->arch.mapping_lock);
> --- a/xen/arch/x86/include/asm/perfc_defn.h
> +++ b/xen/arch/x86/include/asm/perfc_defn.h
> @@ -125,4 +125,7 @@ PERFCOUNTER(realmode_exits,      "vmexit
> 
>  PERFCOUNTER(pauseloop_exits, "vmexits from Pause-Loop Detection")
> 
> +PERFCOUNTER(iommu_pt_shatters,    "IOMMU page table shatters")
> +PERFCOUNTER(iommu_pt_coalesces,   "IOMMU page table coalesces")
> +
>  /*#endif*/ /* __XEN_PERFC_DEFN_H__ */


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 17/23] IOMMU/x86: prefill newly allocate page tables
  2022-02-18  5:01   ` Tian, Kevin
@ 2022-02-18  8:24     ` Jan Beulich
  2022-02-18  8:26       ` Tian, Kevin
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2022-02-18  8:24 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Cooper, Andrew, Paul Durrant, Pau Monné, Roger, xen-devel

On 18.02.2022 06:01, Tian, Kevin wrote:
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Tuesday, January 11, 2022 12:35 AM
>>
>> Page tables are used for two purposes after allocation: They either
>> start out all empty, or they get filled to replace a superpage.
>> Subsequently, to replace all empty or fully contiguous page tables,
>> contiguous sub-regions will be recorded within individual page tables.
>> Install the initial set of markers immediately after allocation. Make
>> sure to retain these markers when further populating a page table in
>> preparation for it to replace a superpage.
>>
>> The markers are simply 4-bit fields holding the order value of
>> contiguous entries. To demonstrate this, if a page table had just 16
>> entries, this would be the initial (fully contiguous) set of markers:
>>
>> index  0 1 2 3 4 5 6 7 8 9 A B C D E F
>> marker 4 0 1 0 2 0 1 0 3 0 1 0 2 0 1 0
>>
>> "Contiguous" here means not only present entries with successively
>> increasing MFNs, each one suitably aligned for its slot, but also a
>> respective number of all non-present entries.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>, with a nit:

Thanks.

>> @@ -478,7 +478,28 @@ struct page_info *iommu_alloc_pgtable(st
>>          return NULL;
>>
>>      p = __map_domain_page(pg);
>> -    clear_page(p);
>> +
>> +    if ( contig_mask )
>> +    {
>> +        unsigned int i, shift = find_first_set_bit(contig_mask);
>> +
>> +        ASSERT(((PAGE_SHIFT - 3) & (contig_mask >> shift)) == PAGE_SHIFT - 3);
>> +
>> +        p[0] = (PAGE_SHIFT - 3ull) << shift;
>> +        p[1] = 0;
>> +        p[2] = 1ull << shift;
>> +        p[3] = 0;
>> +
>> +        for ( i = 4; i < PAGE_SIZE / 8; i += 4 )
>> +        {
>> +            p[i + 0] = (find_first_set_bit(i) + 0ull) << shift;
>> +            p[i + 1] = 0;
>> +            p[i + 2] = 1ull << shift;
>> +            p[i + 3] = 0;
>> +        }
> 
> some comment similar to what commit msg describes can improve
> the readability here.

I wouldn't want to replicate what pt-contig-markers.h describes, so
maybe a comment referring there would do?

Jan



^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH v3 17/23] IOMMU/x86: prefill newly allocate page tables
  2022-02-18  8:24     ` Jan Beulich
@ 2022-02-18  8:26       ` Tian, Kevin
  0 siblings, 0 replies; 40+ messages in thread
From: Tian, Kevin @ 2022-02-18  8:26 UTC (permalink / raw)
  To: Beulich, Jan
  Cc: Cooper, Andrew, Paul Durrant, Pau Monné, Roger, xen-devel

> From: Jan Beulich <jbeulich@suse.com>
> Sent: Friday, February 18, 2022 4:25 PM
> 
> On 18.02.2022 06:01, Tian, Kevin wrote:
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: Tuesday, January 11, 2022 12:35 AM
> >>
> >> Page tables are used for two purposes after allocation: They either
> >> start out all empty, or they get filled to replace a superpage.
> >> Subsequently, to replace all empty or fully contiguous page tables,
> >> contiguous sub-regions will be recorded within individual page tables.
> >> Install the initial set of markers immediately after allocation. Make
> >> sure to retain these markers when further populating a page table in
> >> preparation for it to replace a superpage.
> >>
> >> The markers are simply 4-bit fields holding the order value of
> >> contiguous entries. To demonstrate this, if a page table had just 16
> >> entries, this would be the initial (fully contiguous) set of markers:
> >>
> >> index  0 1 2 3 4 5 6 7 8 9 A B C D E F
> >> marker 4 0 1 0 2 0 1 0 3 0 1 0 2 0 1 0
> >>
> >> "Contiguous" here means not only present entries with successively
> >> increasing MFNs, each one suitably aligned for its slot, but also a
> >> respective number of all non-present entries.
> >>
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >
> > Reviewed-by: Kevin Tian <kevin.tian@intel.com>, with a nit:
> 
> Thanks.
> 
> >> @@ -478,7 +478,28 @@ struct page_info *iommu_alloc_pgtable(st
> >>          return NULL;
> >>
> >>      p = __map_domain_page(pg);
> >> -    clear_page(p);
> >> +
> >> +    if ( contig_mask )
> >> +    {
> >> +        unsigned int i, shift = find_first_set_bit(contig_mask);
> >> +
> >> +        ASSERT(((PAGE_SHIFT - 3) & (contig_mask >> shift)) == PAGE_SHIFT -
> 3);
> >> +
> >> +        p[0] = (PAGE_SHIFT - 3ull) << shift;
> >> +        p[1] = 0;
> >> +        p[2] = 1ull << shift;
> >> +        p[3] = 0;
> >> +
> >> +        for ( i = 4; i < PAGE_SIZE / 8; i += 4 )
> >> +        {
> >> +            p[i + 0] = (find_first_set_bit(i) + 0ull) << shift;
> >> +            p[i + 1] = 0;
> >> +            p[i + 2] = 1ull << shift;
> >> +            p[i + 3] = 0;
> >> +        }
> >
> > some comment similar to what commit msg describes can improve
> > the readability here.
> 
> I wouldn't want to replicate what pt-contig-markers.h describes, so
> maybe a comment referring there would do?
> 

sounds good.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 20/23] VT-d: free all-empty page tables
  2022-02-18  5:20   ` Tian, Kevin
@ 2022-02-18  8:31     ` Jan Beulich
  2022-03-14  4:01       ` Tian, Kevin
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2022-02-18  8:31 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Cooper, Andrew, Paul Durrant, Pau Monné, Roger, xen-devel

On 18.02.2022 06:20, Tian, Kevin wrote:
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Tuesday, January 11, 2022 12:36 AM
>>
>> When a page table ends up with no present entries left, it can be
>> replaced by a non-present entry at the next higher level. The page table
>> itself can then be scheduled for freeing.
>>
>> Note that while its output isn't used there yet,
>> pt_update_contig_markers() right away needs to be called in all places
>> where entries get updated, not just the one where entries get cleared.
>>
>> Note further that while pt_update_contig_markers() updates perhaps
>> several PTEs within the table, since these are changes to "avail" bits
>> only I do not think that cache flushing would be needed afterwards. Such
>> cache flushing (of entire pages, unless adding yet more logic to me more
>> selective) would be quite noticable performance-wise (very prominent
>> during Dom0 boot).
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> v3: Properly bound loop. Re-base over changes earlier in the series.
>> v2: New.
>> ---
>> The hang during boot on my Latitude E6410 (see the respective code
>> comment) was pretty close after iommu_enable_translation(). No errors,
>> no watchdog would kick in, just sometimes the first few pixel lines of
>> the next log message's (XEN) prefix would have made it out to the screen
>> (and there's no serial there). It's been a lot of experimenting until I
>> figured the workaround (which I consider ugly, but halfway acceptable).
>> I've been trying hard to make sure the workaround wouldn't be masking a
>> real issue, yet I'm still wary of it possibly doing so ... My best guess
>> at this point is that on these old IOMMUs the ignored bits 52...61
>> aren't really ignored for present entries, but also aren't "reserved"
>> enough to trigger faults. This guess is from having tried to set other
> 
> Is this machine able to capture any VT-d faults before?

Not sure what you mean here. I don't think I can trigger any I/O at this
point in time, and hence I also couldn't try to trigger a fault. But if
the question is whether fault reporting at this time actually works,
then yes, I think so: This is during Dom0 construction, i.e. late enough
for fault reporting to be fully set up and enabled.

Jan

> If yes maybe
> you will observe more information if trying to tweak those bits at a later
> time (instead of when iommu is enabled)?
> 
> Thanks
> Kevin



^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH v3 20/23] VT-d: free all-empty page tables
  2022-02-18  8:31     ` Jan Beulich
@ 2022-03-14  4:01       ` Tian, Kevin
  2022-03-14  7:33         ` Jan Beulich
  0 siblings, 1 reply; 40+ messages in thread
From: Tian, Kevin @ 2022-03-14  4:01 UTC (permalink / raw)
  To: Beulich, Jan
  Cc: Cooper, Andrew, Paul Durrant, Pau Monné, Roger, xen-devel

> From: Jan Beulich <jbeulich@suse.com>
> Sent: Friday, February 18, 2022 4:31 PM
> 
> On 18.02.2022 06:20, Tian, Kevin wrote:
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: Tuesday, January 11, 2022 12:36 AM
> >>
> >> When a page table ends up with no present entries left, it can be
> >> replaced by a non-present entry at the next higher level. The page table
> >> itself can then be scheduled for freeing.
> >>
> >> Note that while its output isn't used there yet,
> >> pt_update_contig_markers() right away needs to be called in all places
> >> where entries get updated, not just the one where entries get cleared.
> >>
> >> Note further that while pt_update_contig_markers() updates perhaps
> >> several PTEs within the table, since these are changes to "avail" bits
> >> only I do not think that cache flushing would be needed afterwards. Such
> >> cache flushing (of entire pages, unless adding yet more logic to me more
> >> selective) would be quite noticable performance-wise (very prominent
> >> during Dom0 boot).
> >>
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >> ---
> >> v3: Properly bound loop. Re-base over changes earlier in the series.
> >> v2: New.
> >> ---
> >> The hang during boot on my Latitude E6410 (see the respective code
> >> comment) was pretty close after iommu_enable_translation(). No errors,
> >> no watchdog would kick in, just sometimes the first few pixel lines of
> >> the next log message's (XEN) prefix would have made it out to the screen
> >> (and there's no serial there). It's been a lot of experimenting until I
> >> figured the workaround (which I consider ugly, but halfway acceptable).
> >> I've been trying hard to make sure the workaround wouldn't be masking a
> >> real issue, yet I'm still wary of it possibly doing so ... My best guess
> >> at this point is that on these old IOMMUs the ignored bits 52...61
> >> aren't really ignored for present entries, but also aren't "reserved"
> >> enough to trigger faults. This guess is from having tried to set other
> >
> > Is this machine able to capture any VT-d faults before?
> 
> Not sure what you mean here. I don't think I can trigger any I/O at this
> point in time, and hence I also couldn't try to trigger a fault. But if
> the question is whether fault reporting at this time actually works,
> then yes, I think so: This is during Dom0 construction, i.e. late enough
> for fault reporting to be fully set up and enabled.
> 

My point was that with your guess that the ignored bits are not
ignored some VT-d faults should be triggered. If the reason why
you cannot observe such faults is because they happened too
early so no much can be shown on the screen then trying to
setting those bits at much later point might get more shown to
verify your guess. 

btw any progress since last post? How urgent do you want this
feature in (compared to the issue that it may paper over)? 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 20/23] VT-d: free all-empty page tables
  2022-03-14  4:01       ` Tian, Kevin
@ 2022-03-14  7:33         ` Jan Beulich
  2022-03-17  5:55           ` Tian, Kevin
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2022-03-14  7:33 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Cooper, Andrew, Paul Durrant, Pau Monné, Roger, xen-devel

On 14.03.2022 05:01, Tian, Kevin wrote:
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Friday, February 18, 2022 4:31 PM
>>
>> On 18.02.2022 06:20, Tian, Kevin wrote:
>>>> From: Jan Beulich <jbeulich@suse.com>
>>>> Sent: Tuesday, January 11, 2022 12:36 AM
>>>>
>>>> When a page table ends up with no present entries left, it can be
>>>> replaced by a non-present entry at the next higher level. The page table
>>>> itself can then be scheduled for freeing.
>>>>
>>>> Note that while its output isn't used there yet,
>>>> pt_update_contig_markers() right away needs to be called in all places
>>>> where entries get updated, not just the one where entries get cleared.
>>>>
>>>> Note further that while pt_update_contig_markers() updates perhaps
>>>> several PTEs within the table, since these are changes to "avail" bits
>>>> only I do not think that cache flushing would be needed afterwards. Such
>>>> cache flushing (of entire pages, unless adding yet more logic to me more
>>>> selective) would be quite noticable performance-wise (very prominent
>>>> during Dom0 boot).
>>>>
>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>> ---
>>>> v3: Properly bound loop. Re-base over changes earlier in the series.
>>>> v2: New.
>>>> ---
>>>> The hang during boot on my Latitude E6410 (see the respective code
>>>> comment) was pretty close after iommu_enable_translation(). No errors,
>>>> no watchdog would kick in, just sometimes the first few pixel lines of
>>>> the next log message's (XEN) prefix would have made it out to the screen
>>>> (and there's no serial there). It's been a lot of experimenting until I
>>>> figured the workaround (which I consider ugly, but halfway acceptable).
>>>> I've been trying hard to make sure the workaround wouldn't be masking a
>>>> real issue, yet I'm still wary of it possibly doing so ... My best guess
>>>> at this point is that on these old IOMMUs the ignored bits 52...61
>>>> aren't really ignored for present entries, but also aren't "reserved"
>>>> enough to trigger faults. This guess is from having tried to set other
>>>
>>> Is this machine able to capture any VT-d faults before?
>>
>> Not sure what you mean here. I don't think I can trigger any I/O at this
>> point in time, and hence I also couldn't try to trigger a fault. But if
>> the question is whether fault reporting at this time actually works,
>> then yes, I think so: This is during Dom0 construction, i.e. late enough
>> for fault reporting to be fully set up and enabled.
>>
> 
> My point was that with your guess that the ignored bits are not
> ignored some VT-d faults should be triggered. If the reason why
> you cannot observe such faults is because they happened too
> early so no much can be shown on the screen then trying to
> setting those bits at much later point might get more shown to
> verify your guess. 

Pretty clearly there aren't any faults. And in fact my suspicion is
that the bits are used for addressing memory, and then the memory
access hangs (doesn't complete).

> btw any progress since last post? How urgent do you want this
> feature in (compared to the issue that it may paper over)? 

Well, one way or another the issue needs to be dealt with for this
series to eventually go in. To be honest I hadn't expected that it
would still be pending ...

Jan



^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH v3 20/23] VT-d: free all-empty page tables
  2022-03-14  7:33         ` Jan Beulich
@ 2022-03-17  5:55           ` Tian, Kevin
  2022-03-17  8:55             ` Jan Beulich
  0 siblings, 1 reply; 40+ messages in thread
From: Tian, Kevin @ 2022-03-17  5:55 UTC (permalink / raw)
  To: Beulich, Jan
  Cc: Cooper, Andrew, Paul Durrant, Pau Monné, Roger, xen-devel

> From: Jan Beulich <jbeulich@suse.com>
> Sent: Monday, March 14, 2022 3:33 PM
> 
> On 14.03.2022 05:01, Tian, Kevin wrote:
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: Friday, February 18, 2022 4:31 PM
> >>
> >> On 18.02.2022 06:20, Tian, Kevin wrote:
> >>>> From: Jan Beulich <jbeulich@suse.com>
> >>>> Sent: Tuesday, January 11, 2022 12:36 AM
> >>>>
> >>>> When a page table ends up with no present entries left, it can be
> >>>> replaced by a non-present entry at the next higher level. The page table
> >>>> itself can then be scheduled for freeing.
> >>>>
> >>>> Note that while its output isn't used there yet,
> >>>> pt_update_contig_markers() right away needs to be called in all places
> >>>> where entries get updated, not just the one where entries get cleared.
> >>>>
> >>>> Note further that while pt_update_contig_markers() updates perhaps
> >>>> several PTEs within the table, since these are changes to "avail" bits
> >>>> only I do not think that cache flushing would be needed afterwards.
> Such
> >>>> cache flushing (of entire pages, unless adding yet more logic to me
> more
> >>>> selective) would be quite noticable performance-wise (very prominent
> >>>> during Dom0 boot).
> >>>>
> >>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >>>> ---
> >>>> v3: Properly bound loop. Re-base over changes earlier in the series.
> >>>> v2: New.
> >>>> ---
> >>>> The hang during boot on my Latitude E6410 (see the respective code
> >>>> comment) was pretty close after iommu_enable_translation(). No
> errors,
> >>>> no watchdog would kick in, just sometimes the first few pixel lines of
> >>>> the next log message's (XEN) prefix would have made it out to the
> screen
> >>>> (and there's no serial there). It's been a lot of experimenting until I
> >>>> figured the workaround (which I consider ugly, but halfway acceptable).
> >>>> I've been trying hard to make sure the workaround wouldn't be
> masking a
> >>>> real issue, yet I'm still wary of it possibly doing so ... My best guess
> >>>> at this point is that on these old IOMMUs the ignored bits 52...61
> >>>> aren't really ignored for present entries, but also aren't "reserved"
> >>>> enough to trigger faults. This guess is from having tried to set other
> >>>
> >>> Is this machine able to capture any VT-d faults before?
> >>
> >> Not sure what you mean here. I don't think I can trigger any I/O at this
> >> point in time, and hence I also couldn't try to trigger a fault. But if
> >> the question is whether fault reporting at this time actually works,
> >> then yes, I think so: This is during Dom0 construction, i.e. late enough
> >> for fault reporting to be fully set up and enabled.
> >>
> >
> > My point was that with your guess that the ignored bits are not
> > ignored some VT-d faults should be triggered. If the reason why
> > you cannot observe such faults is because they happened too
> > early so no much can be shown on the screen then trying to
> > setting those bits at much later point might get more shown to
> > verify your guess.
> 
> Pretty clearly there aren't any faults. And in fact my suspicion is
> that the bits are used for addressing memory, and then the memory
> access hangs (doesn't complete).
> 
> > btw any progress since last post? How urgent do you want this
> > feature in (compared to the issue that it may paper over)?
> 
> Well, one way or another the issue needs to be dealt with for this
> series to eventually go in. To be honest I hadn't expected that it
> would still be pending ...
> 

Sorry I didn't get your meaning. Do you mean that you didn't
expect that I haven't given r-b or that you haven't found time
to root-cause this issue?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 20/23] VT-d: free all-empty page tables
  2022-03-17  5:55           ` Tian, Kevin
@ 2022-03-17  8:55             ` Jan Beulich
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2022-03-17  8:55 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Cooper, Andrew, Paul Durrant, Pau Monné, Roger, xen-devel

On 17.03.2022 06:55, Tian, Kevin wrote:
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Monday, March 14, 2022 3:33 PM
>>
>> On 14.03.2022 05:01, Tian, Kevin wrote:
>>>> From: Jan Beulich <jbeulich@suse.com>
>>>> Sent: Friday, February 18, 2022 4:31 PM
>>>>
>>>> On 18.02.2022 06:20, Tian, Kevin wrote:
>>>>>> From: Jan Beulich <jbeulich@suse.com>
>>>>>> Sent: Tuesday, January 11, 2022 12:36 AM
>>>>>>
>>>>>> When a page table ends up with no present entries left, it can be
>>>>>> replaced by a non-present entry at the next higher level. The page table
>>>>>> itself can then be scheduled for freeing.
>>>>>>
>>>>>> Note that while its output isn't used there yet,
>>>>>> pt_update_contig_markers() right away needs to be called in all places
>>>>>> where entries get updated, not just the one where entries get cleared.
>>>>>>
>>>>>> Note further that while pt_update_contig_markers() updates perhaps
>>>>>> several PTEs within the table, since these are changes to "avail" bits
>>>>>> only I do not think that cache flushing would be needed afterwards.
>> Such
>>>>>> cache flushing (of entire pages, unless adding yet more logic to me
>> more
>>>>>> selective) would be quite noticable performance-wise (very prominent
>>>>>> during Dom0 boot).
>>>>>>
>>>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>>>> ---
>>>>>> v3: Properly bound loop. Re-base over changes earlier in the series.
>>>>>> v2: New.
>>>>>> ---
>>>>>> The hang during boot on my Latitude E6410 (see the respective code
>>>>>> comment) was pretty close after iommu_enable_translation(). No
>> errors,
>>>>>> no watchdog would kick in, just sometimes the first few pixel lines of
>>>>>> the next log message's (XEN) prefix would have made it out to the
>> screen
>>>>>> (and there's no serial there). It's been a lot of experimenting until I
>>>>>> figured the workaround (which I consider ugly, but halfway acceptable).
>>>>>> I've been trying hard to make sure the workaround wouldn't be
>> masking a
>>>>>> real issue, yet I'm still wary of it possibly doing so ... My best guess
>>>>>> at this point is that on these old IOMMUs the ignored bits 52...61
>>>>>> aren't really ignored for present entries, but also aren't "reserved"
>>>>>> enough to trigger faults. This guess is from having tried to set other
>>>>>
>>>>> Is this machine able to capture any VT-d faults before?
>>>>
>>>> Not sure what you mean here. I don't think I can trigger any I/O at this
>>>> point in time, and hence I also couldn't try to trigger a fault. But if
>>>> the question is whether fault reporting at this time actually works,
>>>> then yes, I think so: This is during Dom0 construction, i.e. late enough
>>>> for fault reporting to be fully set up and enabled.
>>>>
>>>
>>> My point was that with your guess that the ignored bits are not
>>> ignored some VT-d faults should be triggered. If the reason why
>>> you cannot observe such faults is because they happened too
>>> early so no much can be shown on the screen then trying to
>>> setting those bits at much later point might get more shown to
>>> verify your guess.
>>
>> Pretty clearly there aren't any faults. And in fact my suspicion is
>> that the bits are used for addressing memory, and then the memory
>> access hangs (doesn't complete).
>>
>>> btw any progress since last post? How urgent do you want this
>>> feature in (compared to the issue that it may paper over)?
>>
>> Well, one way or another the issue needs to be dealt with for this
>> series to eventually go in. To be honest I hadn't expected that it
>> would still be pending ...
>>
> 
> Sorry I didn't get your meaning. Do you mean that you didn't
> expect that I haven't given r-b or that you haven't found time
> to root-cause this issue?

Neither - the comment about the series as a whole still being pending
was a general one. After all it's been over half a year since the
original posting of the first parts of it.

As to root-causing this issue: I don't see any reasonable way for me
to do so. Hence it's not a matter of finding time anymore (that was
only the case until I could actually associate the behavior with the
one specific piece of code that causes it), but a matter of simply
not being in the position to sensibly try to dig deeper. I continue
to think that the only reasonable way to gain further insight is for
someone with access to the sources of the (I assume) involved
microcode in the chipset to spell out what the expected behavior
given that microcode would actually be. Without such knowledge I do
not see any alternative to what I'm currently doing to document and
work around the issue. Yet I also understand that given this is
rather old hardware, there's little interest at Intel to actually
put time into such research.

Jan



^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2022-03-17  8:56 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-10 16:19 [PATCH v3 00/23] IOMMU: superpage support when not sharing pagetables Jan Beulich
2022-01-10 16:22 ` [PATCH v3 01/23] AMD/IOMMU: have callers specify the target level for page table walks Jan Beulich
2022-01-10 16:22 ` [PATCH v3 02/23] VT-d: " Jan Beulich
2022-01-30  3:17   ` Tian, Kevin
2022-01-31 10:04     ` Jan Beulich
2022-01-10 16:23 ` [PATCH v3 03/23] VT-d: limit page table population in domain_pgd_maddr() Jan Beulich
2022-01-30  3:22   ` Tian, Kevin
2022-01-10 16:25 ` [PATCH v3 04/23] IOMMU: have vendor code announce supported page sizes Jan Beulich
2022-01-10 16:25 ` [PATCH v3 05/23] IOMMU: simplify unmap-on-error in iommu_map() Jan Beulich
2022-01-10 16:27 ` [PATCH v3 06/23] IOMMU: add order parameter to ->{,un}map_page() hooks Jan Beulich
2022-01-10 16:27 ` [PATCH v3 07/23] IOMMU: have iommu_{,un}map() split requests into largest possible chunks Jan Beulich
2022-01-10 16:28 ` [PATCH v3 08/23] IOMMU/x86: restrict IO-APIC mappings for PV Dom0 Jan Beulich
2022-01-10 16:28 ` [PATCH v3 09/23] IOMMU/x86: perform PV Dom0 mappings in batches Jan Beulich
2022-01-10 16:29 ` [PATCH v3 10/23] IOMMU/x86: support freeing of pagetables Jan Beulich
2022-01-10 16:29 ` [PATCH v3 11/23] AMD/IOMMU: drop stray TLB flush Jan Beulich
2022-01-10 16:30 ` [PATCH v3 12/23] AMD/IOMMU: walk trees upon page fault Jan Beulich
2022-01-10 16:30 ` [PATCH v3 13/23] AMD/IOMMU: return old PTE from {set,clear}_iommu_pte_present() Jan Beulich
2022-01-10 16:31 ` [PATCH v3 14/23] AMD/IOMMU: allow use of superpage mappings Jan Beulich
2022-01-10 16:32 ` [PATCH v3 15/23] VT-d: " Jan Beulich
2022-01-30  3:26   ` Tian, Kevin
2022-01-10 16:33 ` [PATCH v3 16/23] IOMMU: fold flush-all hook into "flush one" Jan Beulich
2022-01-30  3:38   ` Tian, Kevin
2022-01-10 16:34 ` [PATCH v3 17/23] IOMMU/x86: prefill newly allocate page tables Jan Beulich
2022-02-18  5:01   ` Tian, Kevin
2022-02-18  8:24     ` Jan Beulich
2022-02-18  8:26       ` Tian, Kevin
2022-01-10 16:35 ` [PATCH v3 18/23] x86: introduce helper for recording degree of contiguity in " Jan Beulich
2022-01-10 16:35 ` [PATCH v3 19/23] AMD/IOMMU: free all-empty " Jan Beulich
2022-01-10 16:36 ` [PATCH v3 20/23] VT-d: " Jan Beulich
2022-02-18  5:20   ` Tian, Kevin
2022-02-18  8:31     ` Jan Beulich
2022-03-14  4:01       ` Tian, Kevin
2022-03-14  7:33         ` Jan Beulich
2022-03-17  5:55           ` Tian, Kevin
2022-03-17  8:55             ` Jan Beulich
2022-01-10 16:37 ` [PATCH v3 21/23] AMD/IOMMU: replace all-contiguous page tables by superpage mappings Jan Beulich
2022-01-10 16:38 ` [PATCH v3 22/23] VT-d: " Jan Beulich
2022-02-18  5:22   ` Tian, Kevin
2022-01-10 16:38 ` [PATCH v3 23/23] IOMMU/x86: add perf counters for page table splitting / coalescing Jan Beulich
2022-02-18  5:23   ` Tian, Kevin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.