[PATCH 00/17] IOMMU: superpage support when not sharing pagetables

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/17] IOMMU: superpage support when not sharing pagetables
@ 2021-08-24 14:13 Jan Beulich
  2021-08-24 14:15 ` [PATCH 01/17] AMD/IOMMU: avoid recording each level's MFN when walking page table Jan Beulich
                   ` (17 more replies)
  0 siblings, 18 replies; 35+ messages in thread
From: Jan Beulich @ 2021-08-24 14:13 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant

For a long time we've been rather inefficient with IOMMU page table
management when not sharing page tables, i.e. in particular for PV (and
further specifically also for PV Dom0) and AMD (where nowadays we never
share page tables). While up to about 2.5 years ago AMD code had logic
to un-shatter page mappings, that logic was ripped out for being buggy
(XSA-275 plus follow-on).

This series enables use of large pages in AMD and Intel (VT-d) code;
Arm is presently not in need of any enabling as pagetables are always
shared there. It also augments PV Dom0 creation with suitable explicit
IOMMU mapping calls to facilitate use of large pages there without
getting into the business of un-shattering page mappings just yet.
Depending on the amount of memory handed to Dom0 this improves booting
time (latency until Dom0 actually starts) quite a bit; subsequent
shattering of some of the large pages may of course consume some of the
saved time.

Parts of this series functionally depend on the previously submitted
"VT-d: fix caching mode IOTLB flushing".

Known fallout has been spelled out here:
https://lists.xen.org/archives/html/xen-devel/2021-08/msg00781.html

I'm inclined to say "of course" there are also various seemingly
unrelated changes included here, which I just came to consider necessary
or at least desirable (in part for having been in need of adjustment for
a long time) along the way. Some of these changes are likely independent
of the bulk of the work here, and hence may be fine to go in ahead of
earlier patches.

While, as said above, un-shattering of mappings isn't an immediate goal,
I intend to at least arrange for freeing page tables which have ended up
all empty. But that's not part of this v1 of the series.

01: AMD/IOMMU: avoid recording each level's MFN when walking page table
02: AMD/IOMMU: have callers specify the target level for page table walks
03: VT-d: have callers specify the target level for page table walks
04: IOMMU: have vendor code announce supported page sizes
05: IOMMU: add order parameter to ->{,un}map_page() hooks
06: IOMMU: have iommu_{,un}map() split requests into largest possible chunks
07: IOMMU/x86: restrict IO-APIC mappings for PV Dom0
08: IOMMU/x86: perform PV Dom0 mappings in batches
09: IOMMU/x86: support freeing of pagetables
10: AMD/IOMMU: drop stray TLB flush
11: AMD/IOMMU: walk trees upon page fault
12: AMD/IOMMU: return old PTE from {set,clear}_iommu_pte_present()
13: AMD/IOMMU: allow use of superpage mappings
14: VT-d: allow use of superpage mappings
15: IOMMU: page table dumping adjustments
16: VT-d: show permissions during page table walks
17: IOMMU/x86: drop pointless NULL checks

While not directly related (except that making this mode work properly
here was a fair part of the overall work), at this occasion I'd also
like to renew my proposal to make "iommu=dom0-strict" the default going
forward. It already is not only the default, but the only possible mode
for PVH Dom0.

Jan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 01/17] AMD/IOMMU: avoid recording each level's MFN when walking page table
  2021-08-24 14:13 [PATCH 00/17] IOMMU: superpage support when not sharing pagetables Jan Beulich
@ 2021-08-24 14:15 ` Jan Beulich
  2021-08-25 13:29   ` Andrew Cooper
  2021-08-24 14:15 ` [PATCH 02/17] AMD/IOMMU: have callers specify the target level for page table walks Jan Beulich
                   ` (16 subsequent siblings)
  17 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2021-08-24 14:15 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant

Both callers only care about the target (level 1) MFN. I also cannot
see what we might need higher level MFNs for down the road. And even
modern gcc doesn't recognize the optimization potential.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -178,7 +178,7 @@ void __init iommu_dte_add_device_entry(s
  * page tables.
  */
 static int iommu_pde_from_dfn(struct domain *d, unsigned long dfn,
-                              unsigned long pt_mfn[], bool map)
+                              unsigned long *pt_mfn, bool map)
 {
     union amd_iommu_pte *pde, *next_table_vaddr;
     unsigned long  next_table_mfn;
@@ -203,7 +203,6 @@ static int iommu_pde_from_dfn(struct dom
     while ( level > 1 )
     {
         unsigned int next_level = level - 1;
-        pt_mfn[level] = next_table_mfn;
 
         next_table_vaddr = map_domain_page(_mfn(next_table_mfn));
         pde = &next_table_vaddr[pfn_to_pde_idx(dfn, level)];
@@ -273,7 +272,7 @@ static int iommu_pde_from_dfn(struct dom
     }
 
     /* mfn of level 1 page table */
-    pt_mfn[level] = next_table_mfn;
+    *pt_mfn = next_table_mfn;
     return 0;
 }
 
@@ -282,9 +281,7 @@ int amd_iommu_map_page(struct domain *d,
 {
     struct domain_iommu *hd = dom_iommu(d);
     int rc;
-    unsigned long pt_mfn[7];
-
-    memset(pt_mfn, 0, sizeof(pt_mfn));
+    unsigned long pt_mfn = 0;
 
     spin_lock(&hd->arch.mapping_lock);
 
@@ -310,7 +307,7 @@ int amd_iommu_map_page(struct domain *d,
         return rc;
     }
 
-    if ( iommu_pde_from_dfn(d, dfn_x(dfn), pt_mfn, true) || (pt_mfn[1] == 0) )
+    if ( iommu_pde_from_dfn(d, dfn_x(dfn), &pt_mfn, true) || !pt_mfn )
     {
         spin_unlock(&hd->arch.mapping_lock);
         AMD_IOMMU_DEBUG("Invalid IO pagetable entry dfn = %"PRI_dfn"\n",
@@ -320,7 +317,7 @@ int amd_iommu_map_page(struct domain *d,
     }
 
     /* Install 4k mapping */
-    *flush_flags |= set_iommu_ptes_present(pt_mfn[1], dfn_x(dfn), mfn_x(mfn),
+    *flush_flags |= set_iommu_ptes_present(pt_mfn, dfn_x(dfn), mfn_x(mfn),
                                            1, 1, (flags & IOMMUF_writable),
                                            (flags & IOMMUF_readable));
 
@@ -332,11 +329,9 @@ int amd_iommu_map_page(struct domain *d,
 int amd_iommu_unmap_page(struct domain *d, dfn_t dfn,
                          unsigned int *flush_flags)
 {
-    unsigned long pt_mfn[7];
+    unsigned long pt_mfn = 0;
     struct domain_iommu *hd = dom_iommu(d);
 
-    memset(pt_mfn, 0, sizeof(pt_mfn));
-
     spin_lock(&hd->arch.mapping_lock);
 
     if ( !hd->arch.amd.root_table )
@@ -345,7 +340,7 @@ int amd_iommu_unmap_page(struct domain *
         return 0;
     }
 
-    if ( iommu_pde_from_dfn(d, dfn_x(dfn), pt_mfn, false) )
+    if ( iommu_pde_from_dfn(d, dfn_x(dfn), &pt_mfn, false) )
     {
         spin_unlock(&hd->arch.mapping_lock);
         AMD_IOMMU_DEBUG("Invalid IO pagetable entry dfn = %"PRI_dfn"\n",
@@ -354,10 +349,10 @@ int amd_iommu_unmap_page(struct domain *
         return -EFAULT;
     }
 
-    if ( pt_mfn[1] )
+    if ( pt_mfn )
     {
         /* Mark PTE as 'page not present'. */
-        *flush_flags |= clear_iommu_pte_present(pt_mfn[1], dfn_x(dfn));
+        *flush_flags |= clear_iommu_pte_present(pt_mfn, dfn_x(dfn));
     }
 
     spin_unlock(&hd->arch.mapping_lock);



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 02/17] AMD/IOMMU: have callers specify the target level for page table walks
  2021-08-24 14:13 [PATCH 00/17] IOMMU: superpage support when not sharing pagetables Jan Beulich
  2021-08-24 14:15 ` [PATCH 01/17] AMD/IOMMU: avoid recording each level's MFN when walking page table Jan Beulich
@ 2021-08-24 14:15 ` Jan Beulich
  2021-08-24 14:16 ` [PATCH 03/17] VT-d: " Jan Beulich
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Jan Beulich @ 2021-08-24 14:15 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant

In order to be able to insert/remove super-pages we need to allow
callers of the walking function to specify at which point to stop the
walk. (For now at least gcc will instantiate just a variant of the
function with the parameter eliminated, so effectively no change to
generated code as far as the parameter addition goes.)

Instead of merely adjusting a BUG_ON() condition, convert it into an
error return - there's no reason to crash the entire host in that case.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -178,7 +178,8 @@ void __init iommu_dte_add_device_entry(s
  * page tables.
  */
 static int iommu_pde_from_dfn(struct domain *d, unsigned long dfn,
-                              unsigned long *pt_mfn, bool map)
+                              unsigned int target, unsigned long *pt_mfn,
+                              bool map)
 {
     union amd_iommu_pte *pde, *next_table_vaddr;
     unsigned long  next_table_mfn;
@@ -189,7 +190,8 @@ static int iommu_pde_from_dfn(struct dom
     table = hd->arch.amd.root_table;
     level = hd->arch.amd.paging_mode;
 
-    BUG_ON( table == NULL || level < 1 || level > 6 );
+    if ( !table || target < 1 || level < target || level > 6 )
+        return 1;
 
     /*
      * A frame number past what the current page tables can represent can't
@@ -200,7 +202,7 @@ static int iommu_pde_from_dfn(struct dom
 
     next_table_mfn = mfn_x(page_to_mfn(table));
 
-    while ( level > 1 )
+    while ( level > target )
     {
         unsigned int next_level = level - 1;
 
@@ -307,7 +309,7 @@ int amd_iommu_map_page(struct domain *d,
         return rc;
     }
 
-    if ( iommu_pde_from_dfn(d, dfn_x(dfn), &pt_mfn, true) || !pt_mfn )
+    if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, true) || !pt_mfn )
     {
         spin_unlock(&hd->arch.mapping_lock);
         AMD_IOMMU_DEBUG("Invalid IO pagetable entry dfn = %"PRI_dfn"\n",
@@ -340,7 +342,7 @@ int amd_iommu_unmap_page(struct domain *
         return 0;
     }
 
-    if ( iommu_pde_from_dfn(d, dfn_x(dfn), &pt_mfn, false) )
+    if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, false) )
     {
         spin_unlock(&hd->arch.mapping_lock);
         AMD_IOMMU_DEBUG("Invalid IO pagetable entry dfn = %"PRI_dfn"\n",



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 03/17] VT-d: have callers specify the target level for page table walks
  2021-08-24 14:13 [PATCH 00/17] IOMMU: superpage support when not sharing pagetables Jan Beulich
  2021-08-24 14:15 ` [PATCH 01/17] AMD/IOMMU: avoid recording each level's MFN when walking page table Jan Beulich
  2021-08-24 14:15 ` [PATCH 02/17] AMD/IOMMU: have callers specify the target level for page table walks Jan Beulich
@ 2021-08-24 14:16 ` Jan Beulich
  2021-08-24 14:17 ` [PATCH 04/17] IOMMU: have vendor code announce supported page sizes Jan Beulich
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Jan Beulich @ 2021-08-24 14:16 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Kevin Tian

In order to be able to insert/remove super-pages we need to allow
callers of the walking function to specify at which point to stop the
walk.

For intel_iommu_lookup_page() integrate the last level access into
the main walking function.

dma_pte_clear_one() gets only partly adjusted for now: Error handling
and order parameter get put in place, but the order parameter remains
ignored (just like intel_iommu_map_page()'s order part of the flags).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
I have to admit that I don't understand why domain_pgd_maddr() wants to
populate all page table levels for DFN 0.

I was actually wondering whether it wouldn't make sense to integrate
dma_pte_clear_one() into its only caller intel_iommu_unmap_page(), for
better symmetry with intel_iommu_map_page().

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -270,63 +270,116 @@ static u64 bus_to_context_maddr(struct v
     return maddr;
 }
 
-static u64 addr_to_dma_page_maddr(struct domain *domain, u64 addr, int alloc)
+/*
+ * This function walks (and if requested allocates) page tables to the
+ * designated target level. It returns
+ * - 0 when a non-present entry was encountered and no allocation was
+ *   requested,
+ * - a small positive value (the level, i.e. below PAGE_SIZE) upon allocation
+ *   failure,
+ * - for target > 0 the address of the page table holding the leaf PTE for
+ *   the requested address,
+ * - for target == 0 the full PTE.
+ */
+static uint64_t addr_to_dma_page_maddr(struct domain *domain, daddr_t addr,
+                                       unsigned int target,
+                                       unsigned int *flush_flags, bool alloc)
 {
     struct domain_iommu *hd = dom_iommu(domain);
     int addr_width = agaw_to_width(hd->arch.vtd.agaw);
     struct dma_pte *parent, *pte = NULL;
-    int level = agaw_to_level(hd->arch.vtd.agaw);
-    int offset;
+    unsigned int level = agaw_to_level(hd->arch.vtd.agaw), offset;
     u64 pte_maddr = 0;
 
     addr &= (((u64)1) << addr_width) - 1;
     ASSERT(spin_is_locked(&hd->arch.mapping_lock));
+    ASSERT(target || !alloc);
+
     if ( !hd->arch.vtd.pgd_maddr )
     {
         struct page_info *pg;
 
-        if ( !alloc || !(pg = iommu_alloc_pgtable(domain)) )
+        if ( !alloc )
+            goto out;
+
+        pte_maddr = level;
+        if ( !(pg = iommu_alloc_pgtable(domain)) )
             goto out;
 
         hd->arch.vtd.pgd_maddr = page_to_maddr(pg);
     }
 
-    parent = (struct dma_pte *)map_vtd_domain_page(hd->arch.vtd.pgd_maddr);
-    while ( level > 1 )
+    pte_maddr = hd->arch.vtd.pgd_maddr;
+    parent = map_vtd_domain_page(pte_maddr);
+    while ( level > target )
     {
         offset = address_level_offset(addr, level);
         pte = &parent[offset];
 
         pte_maddr = dma_pte_addr(*pte);
-        if ( !pte_maddr )
+        if ( !dma_pte_present(*pte) || (level > 1 && dma_pte_superpage(*pte)) )
         {
             struct page_info *pg;
+            /*
+             * Higher level tables always set r/w, last level page table
+             * controls read/write.
+             */
+            struct dma_pte new_pte = { DMA_PTE_PROT };
 
             if ( !alloc )
-                break;
+            {
+                pte_maddr = 0;
+                if ( !dma_pte_present(*pte) )
+                    break;
+
+                /*
+                 * When the leaf entry was requested, pass back the full PTE,
+                 * with the address adjusted to account for the residual of
+                 * the walk.
+                 */
+                pte_maddr = pte->val +
+                    (addr & ((1UL << level_to_offset_bits(level)) - 1) &
+                     PAGE_MASK);
+                if ( !target )
+                    break;
+            }
 
+            pte_maddr = level - 1;
             pg = iommu_alloc_pgtable(domain);
             if ( !pg )
                 break;
 
             pte_maddr = page_to_maddr(pg);
-            dma_set_pte_addr(*pte, pte_maddr);
+            dma_set_pte_addr(new_pte, pte_maddr);
 
-            /*
-             * high level table always sets r/w, last level
-             * page table control read/write
-             */
-            dma_set_pte_readable(*pte);
-            dma_set_pte_writable(*pte);
+            if ( dma_pte_present(*pte) )
+            {
+                struct dma_pte *split = map_vtd_domain_page(pte_maddr);
+                unsigned long inc = 1UL << level_to_offset_bits(level - 1);
+
+                split[0].val = pte->val;
+                if ( inc == PAGE_SIZE )
+                    split[0].val &= ~DMA_PTE_SP;
+
+                for ( offset = 1; offset < PTE_NUM; ++offset )
+                    split[offset].val = split[offset - 1].val + inc;
+
+                iommu_sync_cache(split, PAGE_SIZE);
+                unmap_vtd_domain_page(split);
+
+                if ( flush_flags )
+                    *flush_flags |= IOMMU_FLUSHF_modified;
+            }
+
+            write_atomic(&pte->val, new_pte.val);
             iommu_sync_cache(pte, sizeof(struct dma_pte));
         }
 
-        if ( level == 2 )
+        if ( --level == target )
             break;
 
         unmap_vtd_domain_page(parent);
         parent = map_vtd_domain_page(pte_maddr);
-        level--;
     }
 
     unmap_vtd_domain_page(parent);
@@ -352,7 +405,7 @@ static uint64_t domain_pgd_maddr(struct
     if ( !hd->arch.vtd.pgd_maddr )
     {
         /* Ensure we have pagetables allocated down to leaf PTE. */
-        addr_to_dma_page_maddr(d, 0, 1);
+        addr_to_dma_page_maddr(d, 0, 1, NULL, true);
 
         if ( !hd->arch.vtd.pgd_maddr )
             return 0;
@@ -704,8 +757,9 @@ static int __must_check iommu_flush_iotl
 }
 
 /* clear one page's page table */
-static void dma_pte_clear_one(struct domain *domain, uint64_t addr,
-                              unsigned int *flush_flags)
+static int dma_pte_clear_one(struct domain *domain, daddr_t addr,
+                             unsigned int order,
+                             unsigned int *flush_flags)
 {
     struct domain_iommu *hd = dom_iommu(domain);
     struct dma_pte *page = NULL, *pte = NULL;
@@ -713,11 +767,11 @@ static void dma_pte_clear_one(struct dom
 
     spin_lock(&hd->arch.mapping_lock);
     /* get last level pte */
-    pg_maddr = addr_to_dma_page_maddr(domain, addr, 0);
-    if ( pg_maddr == 0 )
+    pg_maddr = addr_to_dma_page_maddr(domain, addr, 1, flush_flags, false);
+    if ( pg_maddr < PAGE_SIZE )
     {
         spin_unlock(&hd->arch.mapping_lock);
-        return;
+        return pg_maddr ? -ENOMEM : 0;
     }
 
     page = (struct dma_pte *)map_vtd_domain_page(pg_maddr);
@@ -727,7 +781,7 @@ static void dma_pte_clear_one(struct dom
     {
         spin_unlock(&hd->arch.mapping_lock);
         unmap_vtd_domain_page(page);
-        return;
+        return 0;
     }
 
     dma_clear_pte(*pte);
@@ -737,6 +791,8 @@ static void dma_pte_clear_one(struct dom
     iommu_sync_cache(pte, sizeof(struct dma_pte));
 
     unmap_vtd_domain_page(page);
+
+    return 0;
 }
 
 static int iommu_set_root_entry(struct vtd_iommu *iommu)
@@ -1834,8 +1890,9 @@ static int __must_check intel_iommu_map_
         return 0;
     }
 
-    pg_maddr = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), 1);
-    if ( !pg_maddr )
+    pg_maddr = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), 1, flush_flags,
+                                      true);
+    if ( pg_maddr < PAGE_SIZE )
     {
         spin_unlock(&hd->arch.mapping_lock);
         return -ENOMEM;
@@ -1885,17 +1942,14 @@ static int __must_check intel_iommu_unma
     if ( iommu_hwdom_passthrough && is_hardware_domain(d) )
         return 0;
 
-    dma_pte_clear_one(d, dfn_to_daddr(dfn), flush_flags);
-
-    return 0;
+    return dma_pte_clear_one(d, dfn_to_daddr(dfn), order, flush_flags);
 }
 
 static int intel_iommu_lookup_page(struct domain *d, dfn_t dfn, mfn_t *mfn,
                                    unsigned int *flags)
 {
     struct domain_iommu *hd = dom_iommu(d);
-    struct dma_pte *page, val;
-    u64 pg_maddr;
+    uint64_t val;
 
     /*
      * If VT-d shares EPT page table or if the domain is the hardware
@@ -1907,25 +1961,16 @@ static int intel_iommu_lookup_page(struc
 
     spin_lock(&hd->arch.mapping_lock);
 
-    pg_maddr = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), 0);
-    if ( !pg_maddr )
-    {
-        spin_unlock(&hd->arch.mapping_lock);
-        return -ENOENT;
-    }
-
-    page = map_vtd_domain_page(pg_maddr);
-    val = page[dfn_x(dfn) & LEVEL_MASK];
+    val = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), 0, NULL, false);
 
-    unmap_vtd_domain_page(page);
     spin_unlock(&hd->arch.mapping_lock);
 
-    if ( !dma_pte_present(val) )
+    if ( val < PAGE_SIZE )
         return -ENOENT;
 
-    *mfn = maddr_to_mfn(dma_pte_addr(val));
-    *flags = dma_pte_read(val) ? IOMMUF_readable : 0;
-    *flags |= dma_pte_write(val) ? IOMMUF_writable : 0;
+    *mfn = maddr_to_mfn(val);
+    *flags = val & DMA_PTE_READ ? IOMMUF_readable : 0;
+    *flags |= val & DMA_PTE_WRITE ? IOMMUF_writable : 0;
 
     return 0;
 }



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 04/17] IOMMU: have vendor code announce supported page sizes
  2021-08-24 14:13 [PATCH 00/17] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (2 preceding siblings ...)
  2021-08-24 14:16 ` [PATCH 03/17] VT-d: " Jan Beulich
@ 2021-08-24 14:17 ` Jan Beulich
  2021-09-16  7:38   ` Tian, Kevin
  2021-08-24 14:18 ` [PATCH 05/17] IOMMU: add order parameter to ->{,un}map_page() hooks Jan Beulich
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2021-08-24 14:17 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Paul Durrant, Kevin Tian, Stefano Stabellini,
	Julien Grall, Bertrand Marquis

Generic code will use this information to determine what order values
can legitimately be passed to the ->{,un}map_page() hooks. For now all
ops structures simply get to announce 4k mappings (as base page size),
and there is (and always has been) an assumption that this matches the
CPU's MMU base page size (eventually we will want to permit IOMMUs with
a base page size smaller than the CPU MMU's).

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -583,6 +583,7 @@ static void amd_dump_page_tables(struct
 }
 
 static const struct iommu_ops __initconstrel _iommu_ops = {
+    .page_sizes = PAGE_SIZE_4K,
     .init = amd_iommu_domain_init,
     .hwdom_init = amd_iommu_hwdom_init,
     .quarantine_init = amd_iommu_quarantine_init,
--- a/xen/drivers/passthrough/arm/ipmmu-vmsa.c
+++ b/xen/drivers/passthrough/arm/ipmmu-vmsa.c
@@ -1298,6 +1298,7 @@ static void ipmmu_iommu_domain_teardown(
 
 static const struct iommu_ops ipmmu_iommu_ops =
 {
+    .page_sizes      = PAGE_SIZE_4K,
     .init            = ipmmu_iommu_domain_init,
     .hwdom_init      = ipmmu_iommu_hwdom_init,
     .teardown        = ipmmu_iommu_domain_teardown,
--- a/xen/drivers/passthrough/arm/smmu.c
+++ b/xen/drivers/passthrough/arm/smmu.c
@@ -2873,6 +2873,7 @@ static void arm_smmu_iommu_domain_teardo
 }
 
 static const struct iommu_ops arm_smmu_iommu_ops = {
+    .page_sizes = PAGE_SIZE_4K,
     .init = arm_smmu_iommu_domain_init,
     .hwdom_init = arm_smmu_iommu_hwdom_init,
     .add_device = arm_smmu_dt_add_device_generic,
--- a/xen/drivers/passthrough/arm/smmu-v3.c
+++ b/xen/drivers/passthrough/arm/smmu-v3.c
@@ -3426,7 +3426,8 @@ static void arm_smmu_iommu_xen_domain_te
 }
 
 static const struct iommu_ops arm_smmu_iommu_ops = {
-	.init		= arm_smmu_iommu_xen_domain_init,
+	.page_sizes		= PAGE_SIZE_4K,
+	.init			= arm_smmu_iommu_xen_domain_init,
 	.hwdom_init		= arm_smmu_iommu_hwdom_init,
 	.teardown		= arm_smmu_iommu_xen_domain_teardown,
 	.iotlb_flush		= arm_smmu_iotlb_flush,
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -470,7 +470,17 @@ int __init iommu_setup(void)
 
     if ( iommu_enable )
     {
+        const struct iommu_ops *ops = NULL;
+
         rc = iommu_hardware_setup();
+        if ( !rc )
+            ops = iommu_get_ops();
+        if ( ops && (ops->page_sizes & -ops->page_sizes) != PAGE_SIZE )
+        {
+            printk(XENLOG_ERR "IOMMU: page size mask %lx unsupported\n",
+                   ops->page_sizes);
+            rc = ops->page_sizes ? -EPERM : -ENODATA;
+        }
         iommu_enabled = (rc == 0);
     }
 
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2875,6 +2875,7 @@ static int __init intel_iommu_quarantine
 }
 
 static struct iommu_ops __initdata vtd_ops = {
+    .page_sizes = PAGE_SIZE_4K,
     .init = intel_iommu_domain_init,
     .hwdom_init = intel_iommu_hwdom_init,
     .quarantine_init = intel_iommu_quarantine_init,
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -231,6 +231,7 @@ struct page_info;
 typedef int iommu_grdm_t(xen_pfn_t start, xen_ulong_t nr, u32 id, void *ctxt);
 
 struct iommu_ops {
+    unsigned long page_sizes;
     int (*init)(struct domain *d);
     void (*hwdom_init)(struct domain *d);
     int (*quarantine_init)(struct domain *d);



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 05/17] IOMMU: add order parameter to ->{,un}map_page() hooks
  2021-08-24 14:13 [PATCH 00/17] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (3 preceding siblings ...)
  2021-08-24 14:17 ` [PATCH 04/17] IOMMU: have vendor code announce supported page sizes Jan Beulich
@ 2021-08-24 14:18 ` Jan Beulich
  2021-09-16  7:41   ` Tian, Kevin
  2021-08-24 14:19 ` [PATCH 06/17] IOMMU: have iommu_{,un}map() split requests into largest possible chunks Jan Beulich
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2021-08-24 14:18 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Paul Durrant, Kevin Tian, Stefano Stabellini,
	Julien Grall

Or really, in the case of ->map_page(), accommodate it in th existing
"flags" parameter. All call sites will pass 0 for now.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -225,6 +225,7 @@ int __must_check amd_iommu_map_page(stru
                                     mfn_t mfn, unsigned int flags,
                                     unsigned int *flush_flags);
 int __must_check amd_iommu_unmap_page(struct domain *d, dfn_t dfn,
+                                      unsigned int order,
                                       unsigned int *flush_flags);
 int __must_check amd_iommu_alloc_root(struct domain *d);
 int amd_iommu_reserve_domain_unity_map(struct domain *domain,
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -328,7 +328,7 @@ int amd_iommu_map_page(struct domain *d,
     return 0;
 }
 
-int amd_iommu_unmap_page(struct domain *d, dfn_t dfn,
+int amd_iommu_unmap_page(struct domain *d, dfn_t dfn, unsigned int order,
                          unsigned int *flush_flags)
 {
     unsigned long pt_mfn = 0;
--- a/xen/drivers/passthrough/arm/iommu_helpers.c
+++ b/xen/drivers/passthrough/arm/iommu_helpers.c
@@ -57,11 +57,13 @@ int __must_check arm_iommu_map_page(stru
      * The function guest_physmap_add_entry replaces the current mapping
      * if there is already one...
      */
-    return guest_physmap_add_entry(d, _gfn(dfn_x(dfn)), _mfn(dfn_x(dfn)), 0, t);
+    return guest_physmap_add_entry(d, _gfn(dfn_x(dfn)), _mfn(dfn_x(dfn)),
+                                   IOMMUF_order(flags), t);
 }
 
 /* Should only be used if P2M Table is shared between the CPU and the IOMMU. */
 int __must_check arm_iommu_unmap_page(struct domain *d, dfn_t dfn,
+                                      unsigned int order,
                                       unsigned int *flush_flags)
 {
     /*
@@ -71,7 +73,8 @@ int __must_check arm_iommu_unmap_page(st
     if ( !is_domain_direct_mapped(d) )
         return -EINVAL;
 
-    return guest_physmap_remove_page(d, _gfn(dfn_x(dfn)), _mfn(dfn_x(dfn)), 0);
+    return guest_physmap_remove_page(d, _gfn(dfn_x(dfn)), _mfn(dfn_x(dfn)),
+                                     order);
 }
 
 /*
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -271,6 +271,8 @@ int iommu_map(struct domain *d, dfn_t df
     if ( !is_iommu_enabled(d) )
         return 0;
 
+    ASSERT(!IOMMUF_order(flags));
+
     for ( i = 0; i < page_count; i++ )
     {
         rc = iommu_call(hd->platform_ops, map_page, d, dfn_add(dfn, i),
@@ -288,7 +290,7 @@ int iommu_map(struct domain *d, dfn_t df
         while ( i-- )
             /* if statement to satisfy __must_check */
             if ( iommu_call(hd->platform_ops, unmap_page, d, dfn_add(dfn, i),
-                            flush_flags) )
+                            0, flush_flags) )
                 continue;
 
         if ( !is_hardware_domain(d) )
@@ -333,7 +335,7 @@ int iommu_unmap(struct domain *d, dfn_t
     for ( i = 0; i < page_count; i++ )
     {
         int err = iommu_call(hd->platform_ops, unmap_page, d, dfn_add(dfn, i),
-                             flush_flags);
+                             0, flush_flags);
 
         if ( likely(!err) )
             continue;
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1932,6 +1932,7 @@ static int __must_check intel_iommu_map_
 }
 
 static int __must_check intel_iommu_unmap_page(struct domain *d, dfn_t dfn,
+                                               unsigned int order,
                                                unsigned int *flush_flags)
 {
     /* Do nothing if VT-d shares EPT page table */
--- a/xen/include/asm-arm/iommu.h
+++ b/xen/include/asm-arm/iommu.h
@@ -31,6 +31,7 @@ int __must_check arm_iommu_map_page(stru
                                     unsigned int flags,
                                     unsigned int *flush_flags);
 int __must_check arm_iommu_unmap_page(struct domain *d, dfn_t dfn,
+                                      unsigned int order,
                                       unsigned int *flush_flags);
 
 #endif /* __ARCH_ARM_IOMMU_H__ */
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -127,9 +127,10 @@ void arch_iommu_hwdom_init(struct domain
  * The following flags are passed to map operations and passed by lookup
  * operations.
  */
-#define _IOMMUF_readable 0
+#define IOMMUF_order(n)  ((n) & 0x3f)
+#define _IOMMUF_readable 6
 #define IOMMUF_readable  (1u<<_IOMMUF_readable)
-#define _IOMMUF_writable 1
+#define _IOMMUF_writable 7
 #define IOMMUF_writable  (1u<<_IOMMUF_writable)
 
 /*
@@ -255,6 +256,7 @@ struct iommu_ops {
                                  unsigned int flags,
                                  unsigned int *flush_flags);
     int __must_check (*unmap_page)(struct domain *d, dfn_t dfn,
+                                   unsigned int order,
                                    unsigned int *flush_flags);
     int __must_check (*lookup_page)(struct domain *d, dfn_t dfn, mfn_t *mfn,
                                     unsigned int *flags);



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 06/17] IOMMU: have iommu_{,un}map() split requests into largest possible chunks
  2021-08-24 14:13 [PATCH 00/17] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (4 preceding siblings ...)
  2021-08-24 14:18 ` [PATCH 05/17] IOMMU: add order parameter to ->{,un}map_page() hooks Jan Beulich
@ 2021-08-24 14:19 ` Jan Beulich
  2021-08-24 14:21 ` [PATCH 07/17] IOMMU/x86: restrict IO-APIC mappings for PV Dom0 Jan Beulich
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Jan Beulich @ 2021-08-24 14:19 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant

Introduce a helper function to determine the largest possible mapping
that allows covering a request (or the next part of it that is left to
be processed).

In order to not add yet more recurring dfn_add() / mfn_add() to the two
callers of the new helper, also introduce local variables holding the
values presently operated on.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -260,12 +260,38 @@ void iommu_domain_destroy(struct domain
     arch_iommu_domain_destroy(d);
 }
 
-int iommu_map(struct domain *d, dfn_t dfn, mfn_t mfn,
+static unsigned int mapping_order(const struct domain_iommu *hd,
+                                  dfn_t dfn, mfn_t mfn, unsigned long nr)
+{
+    unsigned long res = dfn_x(dfn) | mfn_x(mfn);
+    unsigned long sizes = hd->platform_ops->page_sizes;
+    unsigned int bit = find_first_set_bit(sizes), order = 0;
+
+    ASSERT(bit == PAGE_SHIFT);
+
+    while ( (sizes = (sizes >> bit) & ~1) )
+    {
+        unsigned long mask;
+
+        bit = find_first_set_bit(sizes);
+        mask = (1UL << bit) - 1;
+        if ( nr <= mask || (res & mask) )
+            break;
+        order += bit;
+        nr >>= bit;
+        res >>= bit;
+    }
+
+    return order;
+}
+
+int iommu_map(struct domain *d, dfn_t dfn0, mfn_t mfn0,
               unsigned long page_count, unsigned int flags,
               unsigned int *flush_flags)
 {
     const struct domain_iommu *hd = dom_iommu(d);
     unsigned long i;
+    unsigned int order;
     int rc = 0;
 
     if ( !is_iommu_enabled(d) )
@@ -273,10 +299,16 @@ int iommu_map(struct domain *d, dfn_t df
 
     ASSERT(!IOMMUF_order(flags));
 
-    for ( i = 0; i < page_count; i++ )
+    for ( i = 0; i < page_count; i += 1UL << order )
     {
-        rc = iommu_call(hd->platform_ops, map_page, d, dfn_add(dfn, i),
-                        mfn_add(mfn, i), flags, flush_flags);
+        dfn_t dfn = dfn_add(dfn0, i);
+        mfn_t mfn = mfn_add(mfn0, i);
+        unsigned long j;
+
+        order = mapping_order(hd, dfn, mfn, page_count - i);
+
+        rc = iommu_call(hd->platform_ops, map_page, d, dfn, mfn,
+                        flags | IOMMUF_order(order), flush_flags);
 
         if ( likely(!rc) )
             continue;
@@ -284,14 +316,18 @@ int iommu_map(struct domain *d, dfn_t df
         if ( !d->is_shutting_down && printk_ratelimit() )
             printk(XENLOG_ERR
                    "d%d: IOMMU mapping dfn %"PRI_dfn" to mfn %"PRI_mfn" failed: %d\n",
-                   d->domain_id, dfn_x(dfn_add(dfn, i)),
-                   mfn_x(mfn_add(mfn, i)), rc);
+                   d->domain_id, dfn_x(dfn), mfn_x(mfn), rc);
+
+        for ( j = 0; j < i; j += 1UL << order )
+        {
+            dfn = dfn_add(dfn0, j);
+            order = mapping_order(hd, dfn, _mfn(0), i - j);
 
-        while ( i-- )
             /* if statement to satisfy __must_check */
-            if ( iommu_call(hd->platform_ops, unmap_page, d, dfn_add(dfn, i),
-                            0, flush_flags) )
+            if ( iommu_call(hd->platform_ops, unmap_page, d, dfn, order,
+                            flush_flags) )
                 continue;
+        }
 
         if ( !is_hardware_domain(d) )
             domain_crash(d);
@@ -322,20 +358,25 @@ int iommu_legacy_map(struct domain *d, d
     return rc;
 }
 
-int iommu_unmap(struct domain *d, dfn_t dfn, unsigned long page_count,
+int iommu_unmap(struct domain *d, dfn_t dfn0, unsigned long page_count,
                 unsigned int *flush_flags)
 {
     const struct domain_iommu *hd = dom_iommu(d);
     unsigned long i;
+    unsigned int order;
     int rc = 0;
 
     if ( !is_iommu_enabled(d) )
         return 0;
 
-    for ( i = 0; i < page_count; i++ )
+    for ( i = 0; i < page_count; i += 1UL << order )
     {
-        int err = iommu_call(hd->platform_ops, unmap_page, d, dfn_add(dfn, i),
-                             0, flush_flags);
+        dfn_t dfn = dfn_add(dfn0, i);
+        int err;
+
+        order = mapping_order(hd, dfn, _mfn(0), page_count - i);
+        err = iommu_call(hd->platform_ops, unmap_page, d, dfn,
+                         order, flush_flags);
 
         if ( likely(!err) )
             continue;
@@ -343,7 +384,7 @@ int iommu_unmap(struct domain *d, dfn_t
         if ( !d->is_shutting_down && printk_ratelimit() )
             printk(XENLOG_ERR
                    "d%d: IOMMU unmapping dfn %"PRI_dfn" failed: %d\n",
-                   d->domain_id, dfn_x(dfn_add(dfn, i)), err);
+                   d->domain_id, dfn_x(dfn), err);
 
         if ( !rc )
             rc = err;



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 07/17] IOMMU/x86: restrict IO-APIC mappings for PV Dom0
  2021-08-24 14:13 [PATCH 00/17] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (5 preceding siblings ...)
  2021-08-24 14:19 ` [PATCH 06/17] IOMMU: have iommu_{,un}map() split requests into largest possible chunks Jan Beulich
@ 2021-08-24 14:21 ` Jan Beulich
  2021-08-26 11:57   ` Andrew Cooper
  2021-08-24 14:21 ` [PATCH 08/17] IOMMU/x86: perform PV Dom0 mappings in batches Jan Beulich
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2021-08-24 14:21 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant

While already the case for PVH, there's no reason to treat PV
differently here, though of course the addresses get taken from another
source in this case. Except that, to match CPU side mappings, by default
we permit r/o ones. This then also means we now deal consistently with
IO-APICs whose MMIO is or is not covered by E820 reserved regions.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
[integrated] v1: Integrate into series.
[standalone] v2: Keep IOMMU mappings in sync with CPU ones.

--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -159,12 +159,12 @@ void arch_iommu_domain_destroy(struct do
            page_list_empty(&dom_iommu(d)->arch.pgtables.list));
 }
 
-static bool __hwdom_init hwdom_iommu_map(const struct domain *d,
-                                         unsigned long pfn,
-                                         unsigned long max_pfn)
+static unsigned int __hwdom_init hwdom_iommu_map(const struct domain *d,
+                                                 unsigned long pfn,
+                                                 unsigned long max_pfn)
 {
     mfn_t mfn = _mfn(pfn);
-    unsigned int i, type;
+    unsigned int i, type, perms = IOMMUF_readable | IOMMUF_writable;
 
     /*
      * Set up 1:1 mapping for dom0. Default to include only conventional RAM
@@ -173,44 +173,60 @@ static bool __hwdom_init hwdom_iommu_map
      * that fall in unusable ranges for PV Dom0.
      */
     if ( (pfn > max_pfn && !mfn_valid(mfn)) || xen_in_range(pfn) )
-        return false;
+        return 0;
 
     switch ( type = page_get_ram_type(mfn) )
     {
     case RAM_TYPE_UNUSABLE:
-        return false;
+        return 0;
 
     case RAM_TYPE_CONVENTIONAL:
         if ( iommu_hwdom_strict )
-            return false;
+            return 0;
         break;
 
     default:
         if ( type & RAM_TYPE_RESERVED )
         {
             if ( !iommu_hwdom_inclusive && !iommu_hwdom_reserved )
-                return false;
+                perms = 0;
         }
-        else if ( is_hvm_domain(d) || !iommu_hwdom_inclusive || pfn > max_pfn )
-            return false;
+        else if ( is_hvm_domain(d) )
+            return 0;
+        else if ( !iommu_hwdom_inclusive || pfn > max_pfn )
+            perms = 0;
     }
 
     /* Check that it doesn't overlap with the Interrupt Address Range. */
     if ( pfn >= 0xfee00 && pfn <= 0xfeeff )
-        return false;
+        return 0;
     /* ... or the IO-APIC */
-    for ( i = 0; has_vioapic(d) && i < d->arch.hvm.nr_vioapics; i++ )
-        if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
-            return false;
+    if ( has_vioapic(d) )
+    {
+        for ( i = 0; i < d->arch.hvm.nr_vioapics; i++ )
+            if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
+                return 0;
+    }
+    else if ( is_pv_domain(d) )
+    {
+        /*
+         * Be consistent with CPU mappings: Dom0 is permitted to establish r/o
+         * ones there, so it should also have such established for IOMMUs.
+         */
+        for ( i = 0; i < nr_ioapics; i++ )
+            if ( pfn == PFN_DOWN(mp_ioapics[i].mpc_apicaddr) )
+                return rangeset_contains_singleton(mmio_ro_ranges, pfn)
+                       ? IOMMUF_readable : 0;
+    }
     /*
      * ... or the PCIe MCFG regions.
      * TODO: runtime added MMCFG regions are not checked to make sure they
      * don't overlap with already mapped regions, thus preventing trapping.
      */
     if ( has_vpci(d) && vpci_is_mmcfg_address(d, pfn_to_paddr(pfn)) )
-        return false;
+        return 0;
 
-    return true;
+    return perms;
 }
 
 void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
@@ -246,15 +262,19 @@ void __hwdom_init arch_iommu_hwdom_init(
     for ( i = 0; i < top; i++ )
     {
         unsigned long pfn = pdx_to_pfn(i);
+        unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
         int rc;
 
-        if ( !hwdom_iommu_map(d, pfn, max_pfn) )
+        if ( !perms )
             rc = 0;
         else if ( paging_mode_translate(d) )
-            rc = set_identity_p2m_entry(d, pfn, p2m_access_rw, 0);
+            rc = set_identity_p2m_entry(d, pfn,
+                                        perms & IOMMUF_writable ? p2m_access_rw
+                                                                : p2m_access_r,
+                                        0);
         else
             rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
-                           IOMMUF_readable | IOMMUF_writable, &flush_flags);
+                           perms, &flush_flags);
 
         if ( rc )
             printk(XENLOG_WARNING "%pd: identity %smapping of %lx failed: %d\n",



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 08/17] IOMMU/x86: perform PV Dom0 mappings in batches
  2021-08-24 14:13 [PATCH 00/17] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (6 preceding siblings ...)
  2021-08-24 14:21 ` [PATCH 07/17] IOMMU/x86: restrict IO-APIC mappings for PV Dom0 Jan Beulich
@ 2021-08-24 14:21 ` Jan Beulich
  2021-08-24 14:22 ` [PATCH 09/17] IOMMU/x86: support freeing of pagetables Jan Beulich
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Jan Beulich @ 2021-08-24 14:21 UTC (permalink / raw)
  To: xen-devel; +Cc: Paul Durrant, Andrew Cooper, Wei Liu, Roger Pau Monné

For large page mappings to be easily usable (i.e. in particular without
un-shattering of smaller page mappings) and for mapping operations to
then also be more efficient, pass batches of Dom0 memory to iommu_map().
In dom0_construct_pv() and its helpers (covering strict mode) this
additionally requires establishing the type of those pages (albeit with
zero type references).

The earlier establishing of PGT_writable_page | PGT_validated requires
the existing places where this gets done (through get_page_and_type())
to be updated: For pages which actually have a mapping, the type
refcount needs to be 1.

There is actually a related bug that gets fixed here as a side effect:
Typically the last L1 table would get marked as such only after
get_page_and_type(..., PGT_writable_page). While this is fine as far as
refcounting goes, the page did remain mapped in the IOMMU in this case
(when "iommu=dom0-strict").

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Subsequently p2m_add_identity_entry() may want to also gain an order
parameter, for arch_iommu_hwdom_init() to use. While this only affects
non-RAM regions, systems typically have 2-16Mb of reserved space
immediately below 4Gb, which hence could be mapped more efficiently.

The installing of zero-ref writable types has in fact shown (observed
while putting together the change) that despite the intention by the
XSA-288 changes (affecting DomU-s only) for Dom0 a number of
sufficiently ordinary pages (at the very least initrd and P2M ones as
well as pages that are part of the initial allocation but not part of
the initial mapping) still have been starting out as PGT_none, meaning
that they would have gained IOMMU mappings only the first time these
pages would get mapped writably.

I didn't think I need to address the bug mentioned in the description in
a separate (prereq) patch, but if others disagree I could certainly
break out that part (needing to first use iommu_legacy_unmap() then).

Note that 4k P2M pages don't get (pre-)mapped in setup_pv_physmap():
They'll end up mapped via the later get_page_and_type().

As to the way these refs get installed: I've chosen to avoid the more
expensive {get,put}_page_and_type(), putting in place the intended type
directly. I guess I could be convinced to avoid this bypassing of the
actual logic; I merely think it's unnecessarily expensive.

--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -106,11 +106,26 @@ static __init void mark_pv_pt_pages_rdon
     unmap_domain_page(pl3e);
 }
 
+/*
+ * For IOMMU mappings done while building Dom0 the type of the pages needs to
+ * match (for _get_page_type() to unmap upon type change). Set the pages to
+ * writable with no type ref. NB: This is benign when !need_iommu_pt_sync(d).
+ */
+static void __init make_pages_writable(struct page_info *page, unsigned long nr)
+{
+    for ( ; nr--; ++page )
+    {
+        ASSERT(!page->u.inuse.type_info);
+        page->u.inuse.type_info = PGT_writable_page | PGT_validated;
+    }
+}
+
 static __init void setup_pv_physmap(struct domain *d, unsigned long pgtbl_pfn,
                                     unsigned long v_start, unsigned long v_end,
                                     unsigned long vphysmap_start,
                                     unsigned long vphysmap_end,
-                                    unsigned long nr_pages)
+                                    unsigned long nr_pages,
+                                    unsigned int *flush_flags)
 {
     struct page_info *page = NULL;
     l4_pgentry_t *pl4e, *l4start = map_domain_page(_mfn(pgtbl_pfn));
@@ -123,6 +138,8 @@ static __init void setup_pv_physmap(stru
 
     while ( vphysmap_start < vphysmap_end )
     {
+        int rc = 0;
+
         if ( domain_tot_pages(d) +
              ((round_pgup(vphysmap_end) - vphysmap_start) >> PAGE_SHIFT) +
              3 > nr_pages )
@@ -176,7 +193,22 @@ static __init void setup_pv_physmap(stru
                                              L3_PAGETABLE_SHIFT - PAGE_SHIFT,
                                              MEMF_no_scrub)) != NULL )
             {
-                *pl3e = l3e_from_page(page, L1_PROT|_PAGE_DIRTY|_PAGE_PSE);
+                mfn_t mfn = page_to_mfn(page);
+
+                if ( need_iommu_pt_sync(d) )
+                    rc = iommu_map(d, _dfn(mfn_x(mfn)), mfn,
+                                   SUPERPAGE_PAGES * SUPERPAGE_PAGES,
+                                   IOMMUF_readable | IOMMUF_writable,
+                                   flush_flags);
+                if ( !rc )
+                    make_pages_writable(page,
+                                        SUPERPAGE_PAGES * SUPERPAGE_PAGES);
+                else
+                    printk(XENLOG_ERR
+                           "pre-mapping P2M 1G-MFN %lx into IOMMU failed: %d\n",
+                           mfn_x(mfn), rc);
+
+                *pl3e = l3e_from_mfn(mfn, L1_PROT|_PAGE_DIRTY|_PAGE_PSE);
                 vphysmap_start += 1UL << L3_PAGETABLE_SHIFT;
                 continue;
             }
@@ -202,7 +234,20 @@ static __init void setup_pv_physmap(stru
                                              L2_PAGETABLE_SHIFT - PAGE_SHIFT,
                                              MEMF_no_scrub)) != NULL )
             {
-                *pl2e = l2e_from_page(page, L1_PROT|_PAGE_DIRTY|_PAGE_PSE);
+                mfn_t mfn = page_to_mfn(page);
+
+                if ( need_iommu_pt_sync(d) )
+                    rc = iommu_map(d, _dfn(mfn_x(mfn)), mfn, SUPERPAGE_PAGES,
+                                   IOMMUF_readable | IOMMUF_writable,
+                                   flush_flags);
+                if ( !rc )
+                    make_pages_writable(page, SUPERPAGE_PAGES);
+                else
+                    printk(XENLOG_ERR
+                           "pre-mapping P2M 2M-MFN %lx into IOMMU failed: %d\n",
+                           mfn_x(mfn), rc);
+
+                *pl2e = l2e_from_mfn(mfn, L1_PROT|_PAGE_DIRTY|_PAGE_PSE);
                 vphysmap_start += 1UL << L2_PAGETABLE_SHIFT;
                 continue;
             }
@@ -310,6 +355,7 @@ int __init dom0_construct_pv(struct doma
     unsigned long initrd_pfn = -1, initrd_mfn = 0;
     unsigned long count;
     struct page_info *page = NULL;
+    unsigned int flush_flags = 0;
     start_info_t *si;
     struct vcpu *v = d->vcpu[0];
     void *image_base = bootstrap_map(image);
@@ -572,6 +618,18 @@ int __init dom0_construct_pv(struct doma
                     BUG();
         }
         initrd->mod_end = 0;
+
+        count = PFN_UP(initrd_len);
+
+        if ( need_iommu_pt_sync(d) )
+            rc = iommu_map(d, _dfn(initrd_mfn), _mfn(initrd_mfn), count,
+                           IOMMUF_readable | IOMMUF_writable, &flush_flags);
+        if ( !rc )
+            make_pages_writable(mfn_to_page(_mfn(initrd_mfn)), count);
+        else
+            printk(XENLOG_ERR
+                   "pre-mapping initrd (MFN %lx) into IOMMU failed: %d\n",
+                   initrd_mfn, rc);
     }
 
     printk("PHYSICAL MEMORY ARRANGEMENT:\n"
@@ -605,6 +663,22 @@ int __init dom0_construct_pv(struct doma
 
     process_pending_softirqs();
 
+    /*
+     * We map the full range here and then punch a hole for page tables via
+     * iommu_unmap() further down, once they have got marked as such.
+     */
+    if ( need_iommu_pt_sync(d) )
+        rc = iommu_map(d, _dfn(alloc_spfn), _mfn(alloc_spfn),
+                       alloc_epfn - alloc_spfn,
+                       IOMMUF_readable | IOMMUF_writable, &flush_flags);
+    if ( !rc )
+        make_pages_writable(mfn_to_page(_mfn(alloc_spfn)),
+                            alloc_epfn - alloc_spfn);
+    else
+        printk(XENLOG_ERR
+               "pre-mapping MFNs [%lx,%lx) into IOMMU failed: %d\n",
+               alloc_spfn, alloc_epfn, rc);
+
     mpt_alloc = (vpt_start - v_start) + pfn_to_paddr(alloc_spfn);
     if ( vinitrd_start )
         mpt_alloc -= PAGE_ALIGN(initrd_len);
@@ -689,7 +763,8 @@ int __init dom0_construct_pv(struct doma
         l1tab++;
 
         page = mfn_to_page(_mfn(mfn));
-        if ( !page->u.inuse.type_info &&
+        if ( (!page->u.inuse.type_info ||
+              page->u.inuse.type_info == (PGT_writable_page | PGT_validated)) &&
              !get_page_and_type(page, d, PGT_writable_page) )
             BUG();
     }
@@ -720,6 +795,17 @@ int __init dom0_construct_pv(struct doma
     /* Pages that are part of page tables must be read only. */
     mark_pv_pt_pages_rdonly(d, l4start, vpt_start, nr_pt_pages);
 
+    /*
+     * This needs to come after all potentially excess
+     * get_page_and_type(..., PGT_writable_page) invocations; see the loop a
+     * few lines further up, where the effect of calling that function in an
+     * earlier loop iteration may get overwritten by a later one.
+     */
+    if ( need_iommu_pt_sync(d) &&
+         iommu_unmap(d, _dfn(PFN_DOWN(mpt_alloc) - nr_pt_pages), nr_pt_pages,
+                     &flush_flags) )
+        BUG();
+
     /* Mask all upcalls... */
     for ( i = 0; i < XEN_LEGACY_MAX_VCPUS; i++ )
         shared_info(d, vcpu_info[i].evtchn_upcall_mask) = 1;
@@ -793,7 +879,7 @@ int __init dom0_construct_pv(struct doma
     {
         pfn = pagetable_get_pfn(v->arch.guest_table);
         setup_pv_physmap(d, pfn, v_start, v_end, vphysmap_start, vphysmap_end,
-                         nr_pages);
+                         nr_pages, &flush_flags);
     }
 
     /* Write the phys->machine and machine->phys table entries. */
@@ -825,7 +911,9 @@ int __init dom0_construct_pv(struct doma
         if ( get_gpfn_from_mfn(mfn) >= count )
         {
             BUG_ON(compat);
-            if ( !page->u.inuse.type_info &&
+            if ( (!page->u.inuse.type_info ||
+                  page->u.inuse.type_info == (PGT_writable_page |
+                                              PGT_validated)) &&
                  !get_page_and_type(page, d, PGT_writable_page) )
                 BUG();
 
@@ -841,22 +929,41 @@ int __init dom0_construct_pv(struct doma
 #endif
     while ( pfn < nr_pages )
     {
-        if ( (page = alloc_chunk(d, nr_pages - domain_tot_pages(d))) == NULL )
+        count = domain_tot_pages(d);
+        if ( (page = alloc_chunk(d, nr_pages - count)) == NULL )
             panic("Not enough RAM for DOM0 reservation\n");
+        mfn = mfn_x(page_to_mfn(page));
+
+        if ( need_iommu_pt_sync(d) )
+        {
+            rc = iommu_map(d, _dfn(mfn), _mfn(mfn), domain_tot_pages(d) - count,
+                           IOMMUF_readable | IOMMUF_writable, &flush_flags);
+            if ( rc )
+                printk(XENLOG_ERR
+                       "pre-mapping MFN %lx (PFN %lx) into IOMMU failed: %d\n",
+                       mfn, pfn, rc);
+        }
+
         while ( pfn < domain_tot_pages(d) )
         {
-            mfn = mfn_x(page_to_mfn(page));
+            if ( !rc )
+                make_pages_writable(page, 1);
+
 #ifndef NDEBUG
 #define pfn (nr_pages - 1 - (pfn - (alloc_epfn - alloc_spfn)))
 #endif
             dom0_update_physmap(compat, pfn, mfn, vphysmap_start);
 #undef pfn
-            page++; pfn++;
+            page++; mfn++; pfn++;
             if ( !(pfn & 0xfffff) )
                 process_pending_softirqs();
         }
     }
 
+    /* Use while() to avoid compiler warning. */
+    while ( iommu_iotlb_flush_all(d, flush_flags) )
+        break;
+
     if ( initrd_len != 0 )
     {
         si->mod_start = vinitrd_start ?: initrd_pfn;
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -231,8 +231,8 @@ static unsigned int __hwdom_init hwdom_i
 
 void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
 {
-    unsigned long i, top, max_pfn;
-    unsigned int flush_flags = 0;
+    unsigned long i, top, max_pfn, start, count;
+    unsigned int flush_flags = 0, start_perms = 0;
 
     BUG_ON(!is_hardware_domain(d));
 
@@ -259,7 +259,7 @@ void __hwdom_init arch_iommu_hwdom_init(
     max_pfn = (GB(4) >> PAGE_SHIFT) - 1;
     top = max(max_pdx, pfn_to_pdx(max_pfn) + 1);
 
-    for ( i = 0; i < top; i++ )
+    for ( start = count = i = 0; i < top; )
     {
         unsigned long pfn = pdx_to_pfn(i);
         unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
@@ -272,16 +272,30 @@ void __hwdom_init arch_iommu_hwdom_init(
                                         perms & IOMMUF_writable ? p2m_access_rw
                                                                 : p2m_access_r,
                                         0);
+        else if ( pfn != start + count || perms != start_perms )
+        {
+        commit:
+            rc = iommu_map(d, _dfn(start), _mfn(start), count,
+                           start_perms, &flush_flags);
+            SWAP(start, pfn);
+            start_perms = perms;
+            count = 1;
+        }
         else
-            rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
-                           perms, &flush_flags);
+        {
+            ++count;
+            rc = 0;
+        }
 
         if ( rc )
             printk(XENLOG_WARNING "%pd: identity %smapping of %lx failed: %d\n",
                    d, !paging_mode_translate(d) ? "IOMMU " : "", pfn, rc);
 
-        if (!(i & 0xfffff))
+        if ( !(++i & 0xfffff) )
             process_pending_softirqs();
+
+        if ( i == top && count )
+            goto commit;
     }
 
     /* Use if to avoid compiler warning */



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 09/17] IOMMU/x86: support freeing of pagetables
  2021-08-24 14:13 [PATCH 00/17] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (7 preceding siblings ...)
  2021-08-24 14:21 ` [PATCH 08/17] IOMMU/x86: perform PV Dom0 mappings in batches Jan Beulich
@ 2021-08-24 14:22 ` Jan Beulich
  2021-08-24 14:22 ` [PATCH 10/17] AMD/IOMMU: drop stray TLB flush Jan Beulich
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Jan Beulich @ 2021-08-24 14:22 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant

For vendor specific code to support superpages we need to be able to
deal with a superpage mapping replacing an intermediate page table (or
hierarchy thereof). Consequently an iommu_alloc_pgtable() counterpart is
needed to free individual page tables while a domain is still alive.
Since the freeing needs to be deferred until after a suitable IOTLB
flush was performed, released page tables get queued for processing by a
tasklet.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
I was considering whether to use a softirq-taklet instead. This would
have the benefit of avoiding extra scheduling operations, but come with
the risk of the freeing happening prematurely because of a
process_pending_softirqs() somewhere.

--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -12,6 +12,7 @@
  * this program; If not, see <http://www.gnu.org/licenses/>.
  */
 
+#include <xen/cpu.h>
 #include <xen/sched.h>
 #include <xen/iommu.h>
 #include <xen/paging.h>
@@ -363,6 +364,85 @@ struct page_info *iommu_alloc_pgtable(st
     return pg;
 }
 
+/*
+ * Intermediate page tables which get replaced by large pages may only be
+ * freed after a suitable IOTLB flush. Hence such pages get queued on a
+ * per-CPU list, with a per-CPU tasklet processing the list on the assumption
+ * that the necessary IOTLB flush will have occurred by the time tasklets get
+ * to run. (List and tasklet being per-CPU has the benefit of accesses not
+ * requiring any locking.)
+ */
+static DEFINE_PER_CPU(struct page_list_head, free_pgt_list);
+static DEFINE_PER_CPU(struct tasklet, free_pgt_tasklet);
+
+static void free_queued_pgtables(void *arg)
+{
+    struct page_list_head *list = arg;
+    struct page_info *pg;
+
+    while ( (pg = page_list_remove_head(list)) )
+        free_domheap_page(pg);
+}
+
+void iommu_queue_free_pgtable(struct domain *d, struct page_info *pg)
+{
+    struct domain_iommu *hd = dom_iommu(d);
+    unsigned int cpu = smp_processor_id();
+
+    spin_lock(&hd->arch.pgtables.lock);
+    page_list_del(pg, &hd->arch.pgtables.list);
+    spin_unlock(&hd->arch.pgtables.lock);
+
+    page_list_add_tail(pg, &per_cpu(free_pgt_list, cpu));
+
+    tasklet_schedule(&per_cpu(free_pgt_tasklet, cpu));
+}
+
+static int cpu_callback(
+    struct notifier_block *nfb, unsigned long action, void *hcpu)
+{
+    unsigned int cpu = (unsigned long)hcpu;
+    struct page_list_head *list = &per_cpu(free_pgt_list, cpu);
+    struct tasklet *tasklet = &per_cpu(free_pgt_tasklet, cpu);
+
+    switch ( action )
+    {
+    case CPU_DOWN_PREPARE:
+        tasklet_kill(tasklet);
+        break;
+
+    case CPU_DEAD:
+        page_list_splice(list, &this_cpu(free_pgt_list));
+        INIT_PAGE_LIST_HEAD(list);
+        tasklet_schedule(&this_cpu(free_pgt_tasklet));
+        break;
+
+    case CPU_UP_PREPARE:
+    case CPU_DOWN_FAILED:
+        tasklet_init(tasklet, free_queued_pgtables, list);
+        break;
+    }
+
+    return NOTIFY_DONE;
+}
+
+static struct notifier_block cpu_nfb = {
+    .notifier_call = cpu_callback,
+};
+
+static int __init bsp_init(void)
+{
+    if ( iommu_enabled )
+    {
+        cpu_callback(&cpu_nfb, CPU_UP_PREPARE,
+                     (void *)(unsigned long)smp_processor_id());
+        register_cpu_notifier(&cpu_nfb);
+    }
+
+    return 0;
+}
+presmp_initcall(bsp_init);
+
 bool arch_iommu_use_permitted(const struct domain *d)
 {
     /*
--- a/xen/include/asm-x86/iommu.h
+++ b/xen/include/asm-x86/iommu.h
@@ -136,6 +136,7 @@ int pi_update_irte(const struct pi_desc
 
 int __must_check iommu_free_pgtables(struct domain *d);
 struct page_info *__must_check iommu_alloc_pgtable(struct domain *d);
+void iommu_queue_free_pgtable(struct domain *d, struct page_info *pg);
 
 #endif /* !__ARCH_X86_IOMMU_H__ */
 /*



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 10/17] AMD/IOMMU: drop stray TLB flush
  2021-08-24 14:13 [PATCH 00/17] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (8 preceding siblings ...)
  2021-08-24 14:22 ` [PATCH 09/17] IOMMU/x86: support freeing of pagetables Jan Beulich
@ 2021-08-24 14:22 ` Jan Beulich
  2021-08-24 14:23 ` [PATCH 11/17] AMD/IOMMU: walk trees upon page fault Jan Beulich
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Jan Beulich @ 2021-08-24 14:22 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant

I think this flush was overlooked when flushing was moved out of the
core (un)mapping functions. The flush the caller is required to invoke
anyway will satisfy the needs resulting from the splitting of a
superpage.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -179,7 +179,7 @@ void __init iommu_dte_add_device_entry(s
  */
 static int iommu_pde_from_dfn(struct domain *d, unsigned long dfn,
                               unsigned int target, unsigned long *pt_mfn,
-                              bool map)
+                              unsigned int *flush_flags, bool map)
 {
     union amd_iommu_pte *pde, *next_table_vaddr;
     unsigned long  next_table_mfn;
@@ -237,7 +237,7 @@ static int iommu_pde_from_dfn(struct dom
             set_iommu_pde_present(pde, next_table_mfn, next_level, true,
                                   true);
 
-            amd_iommu_flush_all_pages(d);
+            *flush_flags |= IOMMU_FLUSHF_modified;
         }
 
         /* Install lower level page table for non-present entries */
@@ -309,7 +309,8 @@ int amd_iommu_map_page(struct domain *d,
         return rc;
     }
 
-    if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, true) || !pt_mfn )
+    if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, flush_flags, true) ||
+         !pt_mfn )
     {
         spin_unlock(&hd->arch.mapping_lock);
         AMD_IOMMU_DEBUG("Invalid IO pagetable entry dfn = %"PRI_dfn"\n",
@@ -342,7 +343,7 @@ int amd_iommu_unmap_page(struct domain *
         return 0;
     }
 
-    if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, false) )
+    if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, flush_flags, false) )
     {
         spin_unlock(&hd->arch.mapping_lock);
         AMD_IOMMU_DEBUG("Invalid IO pagetable entry dfn = %"PRI_dfn"\n",



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 11/17] AMD/IOMMU: walk trees upon page fault
  2021-08-24 14:13 [PATCH 00/17] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (9 preceding siblings ...)
  2021-08-24 14:22 ` [PATCH 10/17] AMD/IOMMU: drop stray TLB flush Jan Beulich
@ 2021-08-24 14:23 ` Jan Beulich
  2021-08-24 14:24 ` [PATCH 12/17] AMD/IOMMU: return old PTE from {set,clear}_iommu_pte_present() Jan Beulich
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Jan Beulich @ 2021-08-24 14:23 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant

This is to aid diagnosing issues and largely matches VT-d's behavior.
Since I'm adding permissions output here as well, take the opportunity
and also add their displaying to amd_dump_page_table_level().

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -235,6 +235,8 @@ int __must_check amd_iommu_flush_iotlb_p
                                              unsigned long page_count,
                                              unsigned int flush_flags);
 int __must_check amd_iommu_flush_iotlb_all(struct domain *d);
+void amd_iommu_print_entries(const struct amd_iommu *iommu, unsigned int dev_id,
+                             dfn_t dfn);
 
 /* device table functions */
 int get_dma_requestor_id(uint16_t seg, uint16_t bdf);
--- a/xen/drivers/passthrough/amd/iommu_init.c
+++ b/xen/drivers/passthrough/amd/iommu_init.c
@@ -573,6 +573,9 @@ static void parse_event_log_entry(struct
                (flags & 0x002) ? " NX" : "",
                (flags & 0x001) ? " GN" : "");
 
+        if ( iommu_verbose )
+            amd_iommu_print_entries(iommu, device_id, daddr_to_dfn(addr));
+
         for ( bdf = 0; bdf < ivrs_bdf_entries; bdf++ )
             if ( get_dma_requestor_id(iommu->seg, bdf) == device_id )
                 pci_check_disable_device(iommu->seg, PCI_BUS(bdf),
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -363,6 +363,50 @@ int amd_iommu_unmap_page(struct domain *
     return 0;
 }
 
+void amd_iommu_print_entries(const struct amd_iommu *iommu, unsigned int dev_id,
+                             dfn_t dfn)
+{
+    mfn_t pt_mfn;
+    unsigned int level;
+    const struct amd_iommu_dte *dt = iommu->dev_table.buffer;
+
+    if ( !dt[dev_id].tv )
+    {
+        printk("%pp: no root\n", &PCI_SBDF2(iommu->seg, dev_id));
+        return;
+    }
+
+    pt_mfn = _mfn(dt[dev_id].pt_root);
+    level = dt[dev_id].paging_mode;
+    printk("%pp root @ %"PRI_mfn" (%u levels) dfn=%"PRI_dfn"\n",
+           &PCI_SBDF2(iommu->seg, dev_id), mfn_x(pt_mfn), level, dfn_x(dfn));
+
+    while ( level )
+    {
+        const union amd_iommu_pte *pt = map_domain_page(pt_mfn);
+        unsigned int idx = pfn_to_pde_idx(dfn_x(dfn), level);
+        union amd_iommu_pte pte = pt[idx];
+
+        unmap_domain_page(pt);
+
+        printk("  L%u[%03x] = %"PRIx64" %c%c\n", level, idx, pte.raw,
+               pte.pr ? pte.ir ? 'r' : '-' : 'n',
+               pte.pr ? pte.iw ? 'w' : '-' : 'p');
+
+        if ( !pte.pr )
+            break;
+
+        if ( pte.next_level >= level )
+        {
+            printk("  L%u[%03x]: next: %u\n", level, idx, pte.next_level);
+            break;
+        }
+
+        pt_mfn = _mfn(pte.mfn);
+        level = pte.next_level;
+    }
+}
+
 static unsigned long flush_count(unsigned long dfn, unsigned long page_count,
                                  unsigned int order)
 {
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -561,10 +561,11 @@ static void amd_dump_page_table_level(st
                 mfn_to_page(_mfn(pde->mfn)), pde->next_level,
                 address, indent + 1);
         else
-            printk("%*sdfn: %08lx  mfn: %08lx\n",
+            printk("%*sdfn: %08lx  mfn: %08lx  %c%c\n",
                    indent, "",
                    (unsigned long)PFN_DOWN(address),
-                   (unsigned long)PFN_DOWN(pfn_to_paddr(pde->mfn)));
+                   (unsigned long)PFN_DOWN(pfn_to_paddr(pde->mfn)),
+                   pde->ir ? 'r' : '-', pde->iw ? 'w' : '-');
     }
 
     unmap_domain_page(table_vaddr);



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 12/17] AMD/IOMMU: return old PTE from {set,clear}_iommu_pte_present()
  2021-08-24 14:13 [PATCH 00/17] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (10 preceding siblings ...)
  2021-08-24 14:23 ` [PATCH 11/17] AMD/IOMMU: walk trees upon page fault Jan Beulich
@ 2021-08-24 14:24 ` Jan Beulich
  2021-08-24 14:25 ` [PATCH 13/17] AMD/IOMMU: allow use of superpage mappings Jan Beulich
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Jan Beulich @ 2021-08-24 14:24 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant

In order to free intermediate page tables when replacing smaller
mappings by a single larger one callers will need to know the full PTE.
Flush indicators can be derived from this in the callers (and outside
the locked regions). First split set_iommu_pte_present() from
set_iommu_ptes_present(): Only the former needs to return the old PTE,
while the latter (like also set_iommu_pde_present()) doesn't even need
to return flush indicators. Then change return types/values and callers
accordingly.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -31,30 +31,28 @@ static unsigned int pfn_to_pde_idx(unsig
     return idx;
 }
 
-static unsigned int clear_iommu_pte_present(unsigned long l1_mfn,
-                                            unsigned long dfn)
+static union amd_iommu_pte clear_iommu_pte_present(unsigned long l1_mfn,
+                                                   unsigned long dfn)
 {
-    union amd_iommu_pte *table, *pte;
-    unsigned int flush_flags;
+    union amd_iommu_pte *table, *pte, old;
 
     table = map_domain_page(_mfn(l1_mfn));
     pte = &table[pfn_to_pde_idx(dfn, 1)];
+    old = *pte;
 
-    flush_flags = pte->pr ? IOMMU_FLUSHF_modified : 0;
     write_atomic(&pte->raw, 0);
 
     unmap_domain_page(table);
 
-    return flush_flags;
+    return old;
 }
 
-static unsigned int set_iommu_pde_present(union amd_iommu_pte *pte,
-                                          unsigned long next_mfn,
-                                          unsigned int next_level, bool iw,
-                                          bool ir)
+static void set_iommu_pde_present(union amd_iommu_pte *pte,
+                                  unsigned long next_mfn,
+                                  unsigned int next_level,
+                                  bool iw, bool ir)
 {
-    union amd_iommu_pte new = {}, old;
-    unsigned int flush_flags = IOMMU_FLUSHF_added;
+    union amd_iommu_pte new = {};
 
     /*
      * FC bit should be enabled in PTE, this helps to solve potential
@@ -68,28 +66,42 @@ static unsigned int set_iommu_pde_presen
     new.next_level = next_level;
     new.pr = true;
 
-    old.raw = read_atomic(&pte->raw);
-    old.ign0 = 0;
-    old.ign1 = 0;
-    old.ign2 = 0;
+    write_atomic(&pte->raw, new.raw);
+}
 
-    if ( old.pr && old.raw != new.raw )
-        flush_flags |= IOMMU_FLUSHF_modified;
+static union amd_iommu_pte set_iommu_pte_present(unsigned long pt_mfn,
+                                                 unsigned long dfn,
+                                                 unsigned long next_mfn,
+                                                 unsigned int level,
+                                                 bool iw, bool ir)
+{
+    union amd_iommu_pte *table, *pde, old;
 
-    write_atomic(&pte->raw, new.raw);
+    table = map_domain_page(_mfn(pt_mfn));
+    pde = &table[pfn_to_pde_idx(dfn, level)];
+
+    old = *pde;
+    if ( !old.pr || old.next_level ||
+         old.mfn != next_mfn ||
+         old.iw != iw || old.ir != ir )
+        set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
+    else
+        old.pr = false; /* signal "no change" to the caller */
 
-    return flush_flags;
+    unmap_domain_page(table);
+
+    return old;
 }
 
-static unsigned int set_iommu_ptes_present(unsigned long pt_mfn,
-                                           unsigned long dfn,
-                                           unsigned long next_mfn,
-                                           unsigned int nr_ptes,
-                                           unsigned int pde_level,
-                                           bool iw, bool ir)
+static void set_iommu_ptes_present(unsigned long pt_mfn,
+                                   unsigned long dfn,
+                                   unsigned long next_mfn,
+                                   unsigned int nr_ptes,
+                                   unsigned int pde_level,
+                                   bool iw, bool ir)
 {
     union amd_iommu_pte *table, *pde;
-    unsigned int page_sz, flush_flags = 0;
+    unsigned int page_sz;
 
     table = map_domain_page(_mfn(pt_mfn));
     pde = &table[pfn_to_pde_idx(dfn, pde_level)];
@@ -98,20 +110,18 @@ static unsigned int set_iommu_ptes_prese
     if ( (void *)(pde + nr_ptes) > (void *)table + PAGE_SIZE )
     {
         ASSERT_UNREACHABLE();
-        return 0;
+        return;
     }
 
     while ( nr_ptes-- )
     {
-        flush_flags |= set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
+        set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
 
         ++pde;
         next_mfn += page_sz;
     }
 
     unmap_domain_page(table);
-
-    return flush_flags;
 }
 
 void amd_iommu_set_root_page_table(struct amd_iommu_dte *dte,
@@ -284,6 +294,7 @@ int amd_iommu_map_page(struct domain *d,
     struct domain_iommu *hd = dom_iommu(d);
     int rc;
     unsigned long pt_mfn = 0;
+    union amd_iommu_pte old;
 
     spin_lock(&hd->arch.mapping_lock);
 
@@ -320,12 +331,16 @@ int amd_iommu_map_page(struct domain *d,
     }
 
     /* Install 4k mapping */
-    *flush_flags |= set_iommu_ptes_present(pt_mfn, dfn_x(dfn), mfn_x(mfn),
-                                           1, 1, (flags & IOMMUF_writable),
-                                           (flags & IOMMUF_readable));
+    old = set_iommu_pte_present(pt_mfn, dfn_x(dfn), mfn_x(mfn), 1,
+                                (flags & IOMMUF_writable),
+                                (flags & IOMMUF_readable));
 
     spin_unlock(&hd->arch.mapping_lock);
 
+    *flush_flags |= IOMMU_FLUSHF_added;
+    if ( old.pr )
+        *flush_flags |= IOMMU_FLUSHF_modified;
+
     return 0;
 }
 
@@ -334,6 +349,7 @@ int amd_iommu_unmap_page(struct domain *
 {
     unsigned long pt_mfn = 0;
     struct domain_iommu *hd = dom_iommu(d);
+    union amd_iommu_pte old = {};
 
     spin_lock(&hd->arch.mapping_lock);
 
@@ -355,11 +371,14 @@ int amd_iommu_unmap_page(struct domain *
     if ( pt_mfn )
     {
         /* Mark PTE as 'page not present'. */
-        *flush_flags |= clear_iommu_pte_present(pt_mfn, dfn_x(dfn));
+        old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn));
     }
 
     spin_unlock(&hd->arch.mapping_lock);
 
+    if ( old.pr )
+        *flush_flags |= IOMMU_FLUSHF_modified;
+
     return 0;
 }
 



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 13/17] AMD/IOMMU: allow use of superpage mappings
  2021-08-24 14:13 [PATCH 00/17] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (11 preceding siblings ...)
  2021-08-24 14:24 ` [PATCH 12/17] AMD/IOMMU: return old PTE from {set,clear}_iommu_pte_present() Jan Beulich
@ 2021-08-24 14:25 ` Jan Beulich
  2021-08-24 14:25 ` [PATCH 14/17] VT-d: " Jan Beulich
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Jan Beulich @ 2021-08-24 14:25 UTC (permalink / raw)
  To: xen-devel
  Cc: Paul Durrant, Andrew Cooper, George Dunlap, Ian Jackson,
	Julien Grall, Stefano Stabellini, Wei Liu

No separate feature flags exist which would control availability of
these; the only restriction is HATS (establishing the maximum number of
page table levels in general), and even that has a lower bound of 4.
Thus we can unconditionally announce 2M, 1G, and 512G mappings. (Via
non-default page sizes the implementation in principle permits arbitrary
size mappings, but these require multiple identical leaf PTEs to be
written, which isn't all that different from having to write multiple
consecutive PTEs with increasing frame numbers. IMO that's therefore
beneficial only on hardware where suitable TLBs exist; I'm unaware of
such hardware.)

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
I'm not fully sure about allowing 512G mappings: The scheduling-for-
freeing of intermediate page tables can take quite a while when
replacing a tree of 4k mappings by a single 512G one. Plus (or otoh)
there's no present code path via which 512G chunks of memory could be
allocated (and hence mapped) anyway.

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -32,12 +32,13 @@ static unsigned int pfn_to_pde_idx(unsig
 }
 
 static union amd_iommu_pte clear_iommu_pte_present(unsigned long l1_mfn,
-                                                   unsigned long dfn)
+                                                   unsigned long dfn,
+                                                   unsigned int level)
 {
     union amd_iommu_pte *table, *pte, old;
 
     table = map_domain_page(_mfn(l1_mfn));
-    pte = &table[pfn_to_pde_idx(dfn, 1)];
+    pte = &table[pfn_to_pde_idx(dfn, level)];
     old = *pte;
 
     write_atomic(&pte->raw, 0);
@@ -288,10 +289,31 @@ static int iommu_pde_from_dfn(struct dom
     return 0;
 }
 
+static void queue_free_pt(struct domain *d, mfn_t mfn, unsigned int next_level)
+{
+    if ( next_level > 1 )
+    {
+        union amd_iommu_pte *pt = map_domain_page(mfn);
+        unsigned int i;
+
+        for ( i = 0; i < PTE_PER_TABLE_SIZE; ++i )
+            if ( pt[i].pr && pt[i].next_level )
+            {
+                ASSERT(pt[i].next_level < next_level);
+                queue_free_pt(d, _mfn(pt[i].mfn), pt[i].next_level);
+            }
+
+        unmap_domain_page(pt);
+    }
+
+    iommu_queue_free_pgtable(d, mfn_to_page(mfn));
+}
+
 int amd_iommu_map_page(struct domain *d, dfn_t dfn, mfn_t mfn,
                        unsigned int flags, unsigned int *flush_flags)
 {
     struct domain_iommu *hd = dom_iommu(d);
+    unsigned int level = (IOMMUF_order(flags) / PTE_PER_TABLE_SHIFT) + 1;
     int rc;
     unsigned long pt_mfn = 0;
     union amd_iommu_pte old;
@@ -320,7 +342,7 @@ int amd_iommu_map_page(struct domain *d,
         return rc;
     }
 
-    if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, flush_flags, true) ||
+    if ( iommu_pde_from_dfn(d, dfn_x(dfn), level, &pt_mfn, flush_flags, true) ||
          !pt_mfn )
     {
         spin_unlock(&hd->arch.mapping_lock);
@@ -330,8 +352,8 @@ int amd_iommu_map_page(struct domain *d,
         return -EFAULT;
     }
 
-    /* Install 4k mapping */
-    old = set_iommu_pte_present(pt_mfn, dfn_x(dfn), mfn_x(mfn), 1,
+    /* Install mapping */
+    old = set_iommu_pte_present(pt_mfn, dfn_x(dfn), mfn_x(mfn), level,
                                 (flags & IOMMUF_writable),
                                 (flags & IOMMUF_readable));
 
@@ -339,8 +361,13 @@ int amd_iommu_map_page(struct domain *d,
 
     *flush_flags |= IOMMU_FLUSHF_added;
     if ( old.pr )
+    {
         *flush_flags |= IOMMU_FLUSHF_modified;
 
+        if ( level > 1 && old.next_level )
+            queue_free_pt(d, _mfn(old.mfn), old.next_level);
+    }
+
     return 0;
 }
 
@@ -349,6 +376,7 @@ int amd_iommu_unmap_page(struct domain *
 {
     unsigned long pt_mfn = 0;
     struct domain_iommu *hd = dom_iommu(d);
+    unsigned int level = (order / PTE_PER_TABLE_SHIFT) + 1;
     union amd_iommu_pte old = {};
 
     spin_lock(&hd->arch.mapping_lock);
@@ -359,7 +387,7 @@ int amd_iommu_unmap_page(struct domain *
         return 0;
     }
 
-    if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, flush_flags, false) )
+    if ( iommu_pde_from_dfn(d, dfn_x(dfn), level, &pt_mfn, flush_flags, false) )
     {
         spin_unlock(&hd->arch.mapping_lock);
         AMD_IOMMU_DEBUG("Invalid IO pagetable entry dfn = %"PRI_dfn"\n",
@@ -371,14 +399,19 @@ int amd_iommu_unmap_page(struct domain *
     if ( pt_mfn )
     {
         /* Mark PTE as 'page not present'. */
-        old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn));
+        old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn), level);
     }
 
     spin_unlock(&hd->arch.mapping_lock);
 
     if ( old.pr )
+    {
         *flush_flags |= IOMMU_FLUSHF_modified;
 
+        if ( level > 1 && old.next_level )
+            queue_free_pt(d, _mfn(old.mfn), old.next_level);
+    }
+
     return 0;
 }
 
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -584,7 +584,7 @@ static void amd_dump_page_tables(struct
 }
 
 static const struct iommu_ops __initconstrel _iommu_ops = {
-    .page_sizes = PAGE_SIZE_4K,
+    .page_sizes = PAGE_SIZE_4K | PAGE_SIZE_2M | PAGE_SIZE_1G | PAGE_SIZE_512G,
     .init = amd_iommu_domain_init,
     .hwdom_init = amd_iommu_hwdom_init,
     .quarantine_init = amd_iommu_quarantine_init,
--- a/xen/include/xen/page-defs.h
+++ b/xen/include/xen/page-defs.h
@@ -21,4 +21,19 @@
 #define PAGE_MASK_64K               PAGE_MASK_GRAN(64K)
 #define PAGE_ALIGN_64K(addr)        PAGE_ALIGN_GRAN(64K, addr)
 
+#define PAGE_SHIFT_2M               21
+#define PAGE_SIZE_2M                PAGE_SIZE_GRAN(2M)
+#define PAGE_MASK_2M                PAGE_MASK_GRAN(2M)
+#define PAGE_ALIGN_2M(addr)         PAGE_ALIGN_GRAN(2M, addr)
+
+#define PAGE_SHIFT_1G               30
+#define PAGE_SIZE_1G                PAGE_SIZE_GRAN(1G)
+#define PAGE_MASK_1G                PAGE_MASK_GRAN(1G)
+#define PAGE_ALIGN_1G(addr)         PAGE_ALIGN_GRAN(1G, addr)
+
+#define PAGE_SHIFT_512G             39
+#define PAGE_SIZE_512G              PAGE_SIZE_GRAN(512G)
+#define PAGE_MASK_512G              PAGE_MASK_GRAN(512G)
+#define PAGE_ALIGN_512G(addr)       PAGE_ALIGN_GRAN(512G, addr)
+
 #endif /* __XEN_PAGE_DEFS_H__ */



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 14/17] VT-d: allow use of superpage mappings
  2021-08-24 14:13 [PATCH 00/17] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (12 preceding siblings ...)
  2021-08-24 14:25 ` [PATCH 13/17] AMD/IOMMU: allow use of superpage mappings Jan Beulich
@ 2021-08-24 14:25 ` Jan Beulich
  2021-08-24 14:26 ` [PATCH 15/17] IOMMU: page table dumping adjustments Jan Beulich
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Jan Beulich @ 2021-08-24 14:25 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Kevin Tian

... depending on feature availability (and absence of quirks).

Also make the page table dumping function aware of superpages.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -756,18 +756,37 @@ static int __must_check iommu_flush_iotl
     return iommu_flush_iotlb(d, INVALID_DFN, 0, 0);
 }
 
+static void queue_free_pt(struct domain *d, mfn_t mfn, unsigned int next_level)
+{
+    if ( next_level > 1 )
+    {
+        struct dma_pte *pt = map_domain_page(mfn);
+        unsigned int i;
+
+        for ( i = 0; i < PTE_NUM; ++i )
+            if ( dma_pte_present(pt[i]) && !dma_pte_superpage(pt[i]) )
+                queue_free_pt(d, maddr_to_mfn(dma_pte_addr(pt[i])),
+                              next_level - 1);
+
+        unmap_domain_page(pt);
+    }
+
+    iommu_queue_free_pgtable(d, mfn_to_page(mfn));
+}
+
 /* clear one page's page table */
 static int dma_pte_clear_one(struct domain *domain, daddr_t addr,
                              unsigned int order,
                              unsigned int *flush_flags)
 {
     struct domain_iommu *hd = dom_iommu(domain);
-    struct dma_pte *page = NULL, *pte = NULL;
+    struct dma_pte *page = NULL, *pte = NULL, old;
     u64 pg_maddr;
+    unsigned int level = (order / LEVEL_STRIDE) + 1;
 
     spin_lock(&hd->arch.mapping_lock);
-    /* get last level pte */
-    pg_maddr = addr_to_dma_page_maddr(domain, addr, 1, flush_flags, false);
+    /* get target level pte */
+    pg_maddr = addr_to_dma_page_maddr(domain, addr, level, flush_flags, false);
     if ( pg_maddr < PAGE_SIZE )
     {
         spin_unlock(&hd->arch.mapping_lock);
@@ -775,7 +794,7 @@ static int dma_pte_clear_one(struct doma
     }
 
     page = (struct dma_pte *)map_vtd_domain_page(pg_maddr);
-    pte = page + address_level_offset(addr, 1);
+    pte = &page[address_level_offset(addr, level)];
 
     if ( !dma_pte_present(*pte) )
     {
@@ -784,14 +803,19 @@ static int dma_pte_clear_one(struct doma
         return 0;
     }
 
+    old = *pte;
     dma_clear_pte(*pte);
-    *flush_flags |= IOMMU_FLUSHF_modified;
 
     spin_unlock(&hd->arch.mapping_lock);
     iommu_sync_cache(pte, sizeof(struct dma_pte));
 
     unmap_vtd_domain_page(page);
 
+    *flush_flags |= IOMMU_FLUSHF_modified;
+
+    if ( level > 1 && !dma_pte_superpage(old) )
+        queue_free_pt(domain, maddr_to_mfn(dma_pte_addr(old)), level - 1);
+
     return 0;
 }
 
@@ -1866,6 +1890,7 @@ static int __must_check intel_iommu_map_
     struct domain_iommu *hd = dom_iommu(d);
     struct dma_pte *page, *pte, old, new = {};
     u64 pg_maddr;
+    unsigned int level = (IOMMUF_order(flags) / LEVEL_STRIDE) + 1;
     int rc = 0;
 
     /* Do nothing if VT-d shares EPT page table */
@@ -1890,7 +1915,7 @@ static int __must_check intel_iommu_map_
         return 0;
     }
 
-    pg_maddr = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), 1, flush_flags,
+    pg_maddr = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), level, flush_flags,
                                       true);
     if ( pg_maddr < PAGE_SIZE )
     {
@@ -1899,13 +1924,15 @@ static int __must_check intel_iommu_map_
     }
 
     page = (struct dma_pte *)map_vtd_domain_page(pg_maddr);
-    pte = &page[dfn_x(dfn) & LEVEL_MASK];
+    pte = &page[address_level_offset(dfn_to_daddr(dfn), level)];
     old = *pte;
 
     dma_set_pte_addr(new, mfn_to_maddr(mfn));
     dma_set_pte_prot(new,
                      ((flags & IOMMUF_readable) ? DMA_PTE_READ  : 0) |
                      ((flags & IOMMUF_writable) ? DMA_PTE_WRITE : 0));
+    if ( IOMMUF_order(flags) )
+        dma_set_pte_superpage(new);
 
     /* Set the SNP on leaf page table if Snoop Control available */
     if ( iommu_snoop )
@@ -1926,8 +1953,13 @@ static int __must_check intel_iommu_map_
 
     *flush_flags |= IOMMU_FLUSHF_added;
     if ( dma_pte_present(old) )
+    {
         *flush_flags |= IOMMU_FLUSHF_modified;
 
+        if ( level > 1 && !dma_pte_superpage(old) )
+            queue_free_pt(d, maddr_to_mfn(dma_pte_addr(old)), level - 1);
+    }
+
     return rc;
 }
 
@@ -2348,6 +2380,7 @@ static int __init vtd_setup(void)
 {
     struct acpi_drhd_unit *drhd;
     struct vtd_iommu *iommu;
+    unsigned int large_sizes = PAGE_SIZE_2M | PAGE_SIZE_1G;
     int ret;
     bool reg_inval_supported = true;
 
@@ -2390,6 +2423,11 @@ static int __init vtd_setup(void)
                cap_sps_2mb(iommu->cap) ? ", 2MB" : "",
                cap_sps_1gb(iommu->cap) ? ", 1GB" : "");
 
+        if ( !cap_sps_2mb(iommu->cap) )
+            large_sizes &= ~PAGE_SIZE_2M;
+        if ( !cap_sps_1gb(iommu->cap) )
+            large_sizes &= ~PAGE_SIZE_1G;
+
 #ifndef iommu_snoop
         if ( iommu_snoop && !ecap_snp_ctl(iommu->ecap) )
             iommu_snoop = false;
@@ -2461,6 +2499,9 @@ static int __init vtd_setup(void)
     if ( ret )
         goto error;
 
+    ASSERT(iommu_ops.page_sizes & PAGE_SIZE_4K);
+    iommu_ops.page_sizes |= large_sizes;
+
     register_keyhandler('V', vtd_dump_iommu_info, "dump iommu info", 1);
 
     return 0;
@@ -2777,7 +2818,7 @@ static void vtd_dump_page_table_level(pa
             continue;
 
         address = gpa + offset_level_address(i, level);
-        if ( next_level >= 1 ) 
+        if ( next_level && !dma_pte_superpage(*pte) )
             vtd_dump_page_table_level(dma_pte_addr(*pte), next_level,
                                       address, indent + 1);
         else



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 15/17] IOMMU: page table dumping adjustments
  2021-08-24 14:13 [PATCH 00/17] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (13 preceding siblings ...)
  2021-08-24 14:25 ` [PATCH 14/17] VT-d: " Jan Beulich
@ 2021-08-24 14:26 ` Jan Beulich
  2021-08-24 14:28   ` Jan Beulich
  2021-08-24 14:27 ` [PATCH 16/17] VT-d: show permissions during page table walks Jan Beulich
                   ` (2 subsequent siblings)
  17 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2021-08-24 14:26 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant

For one none of the three IOMMU implementations on Arm specify a dumping
hook. Generalize VT-d's "don't dump shared page tables" to cover for
this.

Further in the past I was told that on Arm in principle there could be
multiple different IOMMUs, and hence different domains' platform_ops
pointers could differ. Use each domain's ops for calling the dump hook.
(In the long run all uses of iommu_get_ops() would likely need to
disappear for this reason.)

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -640,12 +640,9 @@ bool_t iommu_has_feature(struct domain *
 static void iommu_dump_page_tables(unsigned char key)
 {
     struct domain *d;
-    const struct iommu_ops *ops;
 
     ASSERT(iommu_enabled);
 
-    ops = iommu_get_ops();
-
     rcu_read_lock(&domlist_read_lock);
 
     for_each_domain(d)
@@ -653,7 +650,13 @@ static void iommu_dump_page_tables(unsig
         if ( is_hardware_domain(d) || !is_iommu_enabled(d) )
             continue;
 
-        ops->dump_page_tables(d);
+        if ( iommu_use_hap_pt(d) )
+        {
+            printk("%pd sharing page tables\n", d);
+            continue;
+        }
+
+        dom_iommu(d)->platform_ops->dump_page_tables(d);
     }
 
     rcu_read_unlock(&domlist_read_lock);
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2835,12 +2835,6 @@ static void vtd_dump_page_tables(struct
 {
     const struct domain_iommu *hd = dom_iommu(d);
 
-    if ( iommu_use_hap_pt(d) )
-    {
-        printk(VTDPREFIX " %pd sharing EPT table\n", d);
-        return;
-    }
-
     printk(VTDPREFIX" %pd table has %d levels\n", d,
            agaw_to_level(hd->arch.vtd.agaw));
     vtd_dump_page_table_level(hd->arch.vtd.pgd_maddr,



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 16/17] VT-d: show permissions during page table walks
  2021-08-24 14:13 [PATCH 00/17] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (14 preceding siblings ...)
  2021-08-24 14:26 ` [PATCH 15/17] IOMMU: page table dumping adjustments Jan Beulich
@ 2021-08-24 14:27 ` Jan Beulich
  2021-09-16  7:36   ` Tian, Kevin
  2021-08-24 14:27 ` [PATCH 17/17] IOMMU/x86: drop pointless NULL checks Jan Beulich
  2021-08-25 12:06 ` [PATCH 00/17] IOMMU: superpage support when not sharing pagetables Jan Beulich
  17 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2021-08-24 14:27 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Kevin Tian

Besides the addresses this is the next crucial bit of information one
might be after.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2822,10 +2822,12 @@ static void vtd_dump_page_table_level(pa
             vtd_dump_page_table_level(dma_pte_addr(*pte), next_level,
                                       address, indent + 1);
         else
-            printk("%*sdfn: %08lx mfn: %08lx\n",
+            printk("%*sdfn: %08lx mfn: %08lx %c%c\n",
                    indent, "",
                    (unsigned long)(address >> PAGE_SHIFT_4K),
-                   (unsigned long)(dma_pte_addr(*pte) >> PAGE_SHIFT_4K));
+                   (unsigned long)(dma_pte_addr(*pte) >> PAGE_SHIFT_4K),
+                   dma_pte_read(*pte) ? 'r' : '-',
+                   dma_pte_write(*pte) ? 'w' : '-');
     }
 
     unmap_vtd_domain_page(pt_vaddr);
--- a/xen/drivers/passthrough/vtd/utils.c
+++ b/xen/drivers/passthrough/vtd/utils.c
@@ -159,14 +159,11 @@ void print_vtd_entries(struct vtd_iommu
         l_index = get_level_index(gmfn, level);
         pte.val = l[l_index];
         unmap_vtd_domain_page(l);
-        printk("    l%u[%03x] = %"PRIx64"\n", level, l_index, pte.val);
+        printk("    l%u[%03x] = %"PRIx64" %c%c\n", level, l_index, pte.val,
+               dma_pte_read(pte) ? 'r' : '-',
+               dma_pte_write(pte) ? 'w' : '-');
 
-        if ( !dma_pte_present(pte) )
-        {
-            printk("    l%u[%03x] not present\n", level, l_index);
-            break;
-        }
-        if ( dma_pte_superpage(pte) )
+        if ( !dma_pte_present(pte) || dma_pte_superpage(pte) )
             break;
         val = dma_pte_addr(pte);
     } while ( --level );



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 17/17] IOMMU/x86: drop pointless NULL checks
  2021-08-24 14:13 [PATCH 00/17] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (15 preceding siblings ...)
  2021-08-24 14:27 ` [PATCH 16/17] VT-d: show permissions during page table walks Jan Beulich
@ 2021-08-24 14:27 ` Jan Beulich
  2021-08-26 12:05   ` Andrew Cooper
                     ` (2 more replies)
  2021-08-25 12:06 ` [PATCH 00/17] IOMMU: superpage support when not sharing pagetables Jan Beulich
  17 siblings, 3 replies; 35+ messages in thread
From: Jan Beulich @ 2021-08-24 14:27 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Kevin Tian

map_domain_page() et al never fail; no need to check their return values
against NULL, and no need to carry dead printk()s.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -530,12 +530,6 @@ static void amd_dump_page_table_level(st
         return;
 
     table_vaddr = __map_domain_page(pg);
-    if ( table_vaddr == NULL )
-    {
-        printk("AMD IOMMU failed to map domain page %"PRIpaddr"\n",
-                page_to_maddr(pg));
-        return;
-    }
 
     for ( index = 0; index < PTE_PER_TABLE_SIZE; index++ )
     {
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2800,12 +2800,6 @@ static void vtd_dump_page_table_level(pa
         return;
 
     pt_vaddr = map_vtd_domain_page(pt_maddr);
-    if ( pt_vaddr == NULL )
-    {
-        printk(VTDPREFIX " failed to map domain page %"PRIpaddr"\n",
-               pt_maddr);
-        return;
-    }
 
     next_level = level - 1;
     for ( i = 0; i < PTE_NUM; i++ )
--- a/xen/drivers/passthrough/vtd/utils.c
+++ b/xen/drivers/passthrough/vtd/utils.c
@@ -106,11 +106,6 @@ void print_vtd_entries(struct vtd_iommu
     }
 
     root_entry = (struct root_entry *)map_vtd_domain_page(iommu->root_maddr);
-    if ( root_entry == NULL )
-    {
-        printk("    root_entry == NULL\n");
-        return;
-    }
 
     printk("    root_entry[%02x] = %"PRIx64"\n", bus, root_entry[bus].val);
     if ( !root_present(root_entry[bus]) )
@@ -123,11 +118,6 @@ void print_vtd_entries(struct vtd_iommu
     val = root_entry[bus].val;
     unmap_vtd_domain_page(root_entry);
     ctxt_entry = map_vtd_domain_page(val);
-    if ( ctxt_entry == NULL )
-    {
-        printk("    ctxt_entry == NULL\n");
-        return;
-    }
 
     val = ctxt_entry[devfn].lo;
     printk("    context[%02x] = %"PRIx64"_%"PRIx64"\n",
@@ -151,11 +141,6 @@ void print_vtd_entries(struct vtd_iommu
     do
     {
         l = map_vtd_domain_page(val);
-        if ( l == NULL )
-        {
-            printk("    l%u == NULL\n", level);
-            break;
-        }
         l_index = get_level_index(gmfn, level);
         pte.val = l[l_index];
         unmap_vtd_domain_page(l);



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 15/17] IOMMU: page table dumping adjustments
  2021-08-24 14:26 ` [PATCH 15/17] IOMMU: page table dumping adjustments Jan Beulich
@ 2021-08-24 14:28   ` Jan Beulich
  2021-09-16  7:33     ` Tian, Kevin
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2021-08-24 14:28 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Kevin Tian

On 24.08.2021 16:26, Jan Beulich wrote:
> For one none of the three IOMMU implementations on Arm specify a dumping
> hook. Generalize VT-d's "don't dump shared page tables" to cover for
> this.
> 
> Further in the past I was told that on Arm in principle there could be
> multiple different IOMMUs, and hence different domains' platform_ops
> pointers could differ. Use each domain's ops for calling the dump hook.
> (In the long run all uses of iommu_get_ops() would likely need to
> disappear for this reason.)
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Should have Cc-ed Kevin on this one as well; now added.

Jan

> --- a/xen/drivers/passthrough/iommu.c
> +++ b/xen/drivers/passthrough/iommu.c
> @@ -640,12 +640,9 @@ bool_t iommu_has_feature(struct domain *
>  static void iommu_dump_page_tables(unsigned char key)
>  {
>      struct domain *d;
> -    const struct iommu_ops *ops;
>  
>      ASSERT(iommu_enabled);
>  
> -    ops = iommu_get_ops();
> -
>      rcu_read_lock(&domlist_read_lock);
>  
>      for_each_domain(d)
> @@ -653,7 +650,13 @@ static void iommu_dump_page_tables(unsig
>          if ( is_hardware_domain(d) || !is_iommu_enabled(d) )
>              continue;
>  
> -        ops->dump_page_tables(d);
> +        if ( iommu_use_hap_pt(d) )
> +        {
> +            printk("%pd sharing page tables\n", d);
> +            continue;
> +        }
> +
> +        dom_iommu(d)->platform_ops->dump_page_tables(d);
>      }
>  
>      rcu_read_unlock(&domlist_read_lock);
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -2835,12 +2835,6 @@ static void vtd_dump_page_tables(struct
>  {
>      const struct domain_iommu *hd = dom_iommu(d);
>  
> -    if ( iommu_use_hap_pt(d) )
> -    {
> -        printk(VTDPREFIX " %pd sharing EPT table\n", d);
> -        return;
> -    }
> -
>      printk(VTDPREFIX" %pd table has %d levels\n", d,
>             agaw_to_level(hd->arch.vtd.agaw));
>      vtd_dump_page_table_level(hd->arch.vtd.pgd_maddr,
> 
> 



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 00/17] IOMMU: superpage support when not sharing pagetables
  2021-08-24 14:13 [PATCH 00/17] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (16 preceding siblings ...)
  2021-08-24 14:27 ` [PATCH 17/17] IOMMU/x86: drop pointless NULL checks Jan Beulich
@ 2021-08-25 12:06 ` Jan Beulich
  17 siblings, 0 replies; 35+ messages in thread
From: Jan Beulich @ 2021-08-25 12:06 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant

On 24.08.2021 16:13, Jan Beulich wrote:
> For a long time we've been rather inefficient with IOMMU page table
> management when not sharing page tables, i.e. in particular for PV (and
> further specifically also for PV Dom0) and AMD (where nowadays we never
> share page tables). While up to about 2.5 years ago AMD code had logic
> to un-shatter page mappings, that logic was ripped out for being buggy
> (XSA-275 plus follow-on).
> 
> This series enables use of large pages in AMD and Intel (VT-d) code;
> Arm is presently not in need of any enabling as pagetables are always
> shared there. It also augments PV Dom0 creation with suitable explicit
> IOMMU mapping calls to facilitate use of large pages there without
> getting into the business of un-shattering page mappings just yet.
> Depending on the amount of memory handed to Dom0 this improves booting
> time (latency until Dom0 actually starts) quite a bit; subsequent
> shattering of some of the large pages may of course consume some of the
> saved time.
> 
> Parts of this series functionally depend on the previously submitted
> "VT-d: fix caching mode IOTLB flushing".
> 
> Known fallout has been spelled out here:
> https://lists.xen.org/archives/html/xen-devel/2021-08/msg00781.html
> 
> I'm inclined to say "of course" there are also various seemingly
> unrelated changes included here, which I just came to consider necessary
> or at least desirable (in part for having been in need of adjustment for
> a long time) along the way. Some of these changes are likely independent
> of the bulk of the work here, and hence may be fine to go in ahead of
> earlier patches.
> 
> While, as said above, un-shattering of mappings isn't an immediate goal,
> I intend to at least arrange for freeing page tables which have ended up
> all empty. But that's not part of this v1 of the series.
> 
> 01: AMD/IOMMU: avoid recording each level's MFN when walking page table
> 02: AMD/IOMMU: have callers specify the target level for page table walks
> 03: VT-d: have callers specify the target level for page table walks
> 04: IOMMU: have vendor code announce supported page sizes
> 05: IOMMU: add order parameter to ->{,un}map_page() hooks
> 06: IOMMU: have iommu_{,un}map() split requests into largest possible chunks
> 07: IOMMU/x86: restrict IO-APIC mappings for PV Dom0
> 08: IOMMU/x86: perform PV Dom0 mappings in batches
> 09: IOMMU/x86: support freeing of pagetables
> 10: AMD/IOMMU: drop stray TLB flush
> 11: AMD/IOMMU: walk trees upon page fault
> 12: AMD/IOMMU: return old PTE from {set,clear}_iommu_pte_present()
> 13: AMD/IOMMU: allow use of superpage mappings
> 14: VT-d: allow use of superpage mappings

I should probably spell this out explicitly: These last two, which actually
enable use of superpages, and which hence introduce the need to shatter
some, comes with the risk of tripping code like this

	ret = xenmem_reservation_decrease(nr_pages, frame_list);
	BUG_ON(ret != nr_pages);

still found e.g. in up-to-date Linux. This is - similar to the other
fallout description pointed at further up - a result of now potentially
needing to allocate memory in order to free some, and perhaps the needed
amount being higher than the freed one.

Jan



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 01/17] AMD/IOMMU: avoid recording each level's MFN when walking page table
  2021-08-24 14:15 ` [PATCH 01/17] AMD/IOMMU: avoid recording each level's MFN when walking page table Jan Beulich
@ 2021-08-25 13:29   ` Andrew Cooper
  0 siblings, 0 replies; 35+ messages in thread
From: Andrew Cooper @ 2021-08-25 13:29 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Paul Durrant

On 24/08/2021 15:15, Jan Beulich wrote:
> Both callers only care about the target (level 1) MFN. I also cannot
> see what we might need higher level MFNs for down the road. And even
> modern gcc doesn't recognize the optimization potential.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

I've been wanting to delete this for several years.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 07/17] IOMMU/x86: restrict IO-APIC mappings for PV Dom0
  2021-08-24 14:21 ` [PATCH 07/17] IOMMU/x86: restrict IO-APIC mappings for PV Dom0 Jan Beulich
@ 2021-08-26 11:57   ` Andrew Cooper
  2021-08-26 12:55     ` Jan Beulich
  0 siblings, 1 reply; 35+ messages in thread
From: Andrew Cooper @ 2021-08-26 11:57 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Paul Durrant

On 24/08/2021 15:21, Jan Beulich wrote:
> While already the case for PVH, there's no reason to treat PV
> differently here, though of course the addresses get taken from another
> source in this case. Except that, to match CPU side mappings, by default
> we permit r/o ones. This then also means we now deal consistently with
> IO-APICs whose MMIO is or is not covered by E820 reserved regions.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Why do we give PV dom0 a mapping of the IO-APIC?  Having thought about
it, it cannot possibly be usable.

IO-APICs use a indirect window, and Xen doesn't perform any
write-emulation (that I can see), so the guest can't even read the data
register and work out which register it represents.  It also can't do an
atomic 64bit read across the index and data registers, as that is
explicitly undefined behaviour in the IO-APIC spec.

On the other hand, we do have PHYSDEVOP_apic_{read,write} which, despite
the name, is for the IO-APIC not the LAPIC, and I bet these hypercalls
where introduced upon discovering that a read-only mapping of the
IO-APIC it totally useless.

I think we can safely not expose the IO-APICs to PV dom0 at all, which
simplifies the memory handling further.

~Andrew

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 17/17] IOMMU/x86: drop pointless NULL checks
  2021-08-24 14:27 ` [PATCH 17/17] IOMMU/x86: drop pointless NULL checks Jan Beulich
@ 2021-08-26 12:05   ` Andrew Cooper
  2021-08-26 12:36     ` Jan Beulich
  2021-09-15 12:42   ` Ping: " Jan Beulich
  2021-09-16  7:29   ` Tian, Kevin
  2 siblings, 1 reply; 35+ messages in thread
From: Andrew Cooper @ 2021-08-26 12:05 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Paul Durrant, Kevin Tian

On 24/08/2021 15:27, Jan Beulich wrote:
> --- a/xen/drivers/passthrough/vtd/utils.c
> +++ b/xen/drivers/passthrough/vtd/utils.c
> @@ -106,11 +106,6 @@ void print_vtd_entries(struct vtd_iommu
>      }
>  
>      root_entry = (struct root_entry *)map_vtd_domain_page(iommu->root_maddr);

Seeing as you're actually cleaning up mapping calls, drop this cast too?

Either way, Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

> -    if ( root_entry == NULL )
> -    {
> -        printk("    root_entry == NULL\n");
> -        return;
> -    }
>  
>      printk("    root_entry[%02x] = %"PRIx64"\n", bus, root_entry[bus].val);
>      if ( !root_present(root_entry[bus]) )
>



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 17/17] IOMMU/x86: drop pointless NULL checks
  2021-08-26 12:05   ` Andrew Cooper
@ 2021-08-26 12:36     ` Jan Beulich
  0 siblings, 0 replies; 35+ messages in thread
From: Jan Beulich @ 2021-08-26 12:36 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Paul Durrant, Kevin Tian, xen-devel

On 26.08.2021 14:05, Andrew Cooper wrote:
> On 24/08/2021 15:27, Jan Beulich wrote:
>> --- a/xen/drivers/passthrough/vtd/utils.c
>> +++ b/xen/drivers/passthrough/vtd/utils.c
>> @@ -106,11 +106,6 @@ void print_vtd_entries(struct vtd_iommu
>>      }
>>  
>>      root_entry = (struct root_entry *)map_vtd_domain_page(iommu->root_maddr);
> 
> Seeing as you're actually cleaning up mapping calls, drop this cast too?

There are more of these, so this would be yet another cleanup patch.
I didn't get annoyed enough by these to put one together; instead I
was hoping that we might touch these lines anyway at some point. E.g.
when doing away with the funny map_vtd_domain_page() wrapper around
map_domain_page() ...

> Either way, Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

Thanks.

Jan



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 07/17] IOMMU/x86: restrict IO-APIC mappings for PV Dom0
  2021-08-26 11:57   ` Andrew Cooper
@ 2021-08-26 12:55     ` Jan Beulich
  2021-09-07 17:13       ` Andrew Cooper
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2021-08-26 12:55 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Paul Durrant, xen-devel

On 26.08.2021 13:57, Andrew Cooper wrote:
> On 24/08/2021 15:21, Jan Beulich wrote:
>> While already the case for PVH, there's no reason to treat PV
>> differently here, though of course the addresses get taken from another
>> source in this case. Except that, to match CPU side mappings, by default
>> we permit r/o ones. This then also means we now deal consistently with
>> IO-APICs whose MMIO is or is not covered by E820 reserved regions.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Why do we give PV dom0 a mapping of the IO-APIC?  Having thought about
> it, it cannot possibly be usable.
> 
> IO-APICs use a indirect window, and Xen doesn't perform any
> write-emulation (that I can see), so the guest can't even read the data
> register and work out which register it represents.  It also can't do an
> atomic 64bit read across the index and data registers, as that is
> explicitly undefined behaviour in the IO-APIC spec.
> 
> On the other hand, we do have PHYSDEVOP_apic_{read,write} which, despite
> the name, is for the IO-APIC not the LAPIC, and I bet these hypercalls
> where introduced upon discovering that a read-only mapping of the
> IO-APIC it totally useless.
> 
> I think we can safely not expose the IO-APICs to PV dom0 at all, which
> simplifies the memory handling further.

The reason we do expose it r/o is that some ACPI implementations access
(read and write) some RTEs from AML. If we don't allow Dom0 to map the
IO-APIC, Dom0 will typically fail to boot on such systems. What we have
right now seems to be good enough for those systems, no matter that the
data they get to read makes little sense. If we learned of systems
where this isn't sufficient, we'd have to implement more complete read
emulation (i.e. latching writes to the window register while still
discarding writes to the data register).

If we produced a not-present PTE instead of a r/o one for such mapping
requests, I'm afraid we'd actually further complicate memory handling,
because we'd then need to consider for emulation also n/p #PF, not just
writes to r/o mappings.

This said - I'd also be fine with consistently not mapping the IO-APICs
in the IOMMU page tables. It was you to request CPU and IOMMU mappings
to match. What I want to do away with is the present mixture, derived
from the E820 type covering the IO-APIC space.

Jan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 07/17] IOMMU/x86: restrict IO-APIC mappings for PV Dom0
  2021-08-26 12:55     ` Jan Beulich
@ 2021-09-07 17:13       ` Andrew Cooper
  2021-09-08  9:44         ` Jan Beulich
  0 siblings, 1 reply; 35+ messages in thread
From: Andrew Cooper @ 2021-09-07 17:13 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Paul Durrant, xen-devel

On 26/08/2021 13:55, Jan Beulich wrote:
> On 26.08.2021 13:57, Andrew Cooper wrote:
>> On 24/08/2021 15:21, Jan Beulich wrote:
>>> While already the case for PVH, there's no reason to treat PV
>>> differently here, though of course the addresses get taken from another
>>> source in this case. Except that, to match CPU side mappings, by default
>>> we permit r/o ones. This then also means we now deal consistently with
>>> IO-APICs whose MMIO is or is not covered by E820 reserved regions.
>>>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> Why do we give PV dom0 a mapping of the IO-APIC?  Having thought about
>> it, it cannot possibly be usable.
>>
>> IO-APICs use a indirect window, and Xen doesn't perform any
>> write-emulation (that I can see), so the guest can't even read the data
>> register and work out which register it represents.  It also can't do an
>> atomic 64bit read across the index and data registers, as that is
>> explicitly undefined behaviour in the IO-APIC spec.
>>
>> On the other hand, we do have PHYSDEVOP_apic_{read,write} which, despite
>> the name, is for the IO-APIC not the LAPIC, and I bet these hypercalls
>> where introduced upon discovering that a read-only mapping of the
>> IO-APIC it totally useless.
>>
>> I think we can safely not expose the IO-APICs to PV dom0 at all, which
>> simplifies the memory handling further.
> The reason we do expose it r/o is that some ACPI implementations access
> (read and write) some RTEs from AML. If we don't allow Dom0 to map the
> IO-APIC, Dom0 will typically fail to boot on such systems.

I think more details are needed.

If AML is reading the RTEs, it's is also writing to the index register.

Irrespective of Xen, doing this behind the back of the OS cannot work
safely, because at a minimum the ACPI interpreter would need to take the
ioapic lock, and I see no evidence of workarounds like this in Linux.

In Xen, we appear to swallow writes to mmio_ro ranges which is rude, and
causes the associated reads to read garbage.  This is Xen-induced memory
corruption inside dom0.

I don't think any of this is legitimate behaviour.  ACPI needs disabling
on such systems, or interlocking suitably, and its not just IO-APICs
which are problematic - any windowed I/O (e.g. RTC) as well as any other
device with read side effects.

I think we should seriously consider not mapping the IO-APIC at all. 
That said, I would be surprised if Linux is cleanly avoiding the
IO-APIC, so a slightly less bad alternative is to redirect to an "MMIO"
frame of ~0's which we still write-discard on top of.

That at least makes the Xen-induced memory corruption behave
deterministically.

~Andrew

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 07/17] IOMMU/x86: restrict IO-APIC mappings for PV Dom0
  2021-09-07 17:13       ` Andrew Cooper
@ 2021-09-08  9:44         ` Jan Beulich
  0 siblings, 0 replies; 35+ messages in thread
From: Jan Beulich @ 2021-09-08  9:44 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Paul Durrant, xen-devel

On 07.09.2021 19:13, Andrew Cooper wrote:
> On 26/08/2021 13:55, Jan Beulich wrote:
>> On 26.08.2021 13:57, Andrew Cooper wrote:
>>> On 24/08/2021 15:21, Jan Beulich wrote:
>>>> While already the case for PVH, there's no reason to treat PV
>>>> differently here, though of course the addresses get taken from another
>>>> source in this case. Except that, to match CPU side mappings, by default
>>>> we permit r/o ones. This then also means we now deal consistently with
>>>> IO-APICs whose MMIO is or is not covered by E820 reserved regions.
>>>>
>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>> Why do we give PV dom0 a mapping of the IO-APIC?  Having thought about
>>> it, it cannot possibly be usable.
>>>
>>> IO-APICs use a indirect window, and Xen doesn't perform any
>>> write-emulation (that I can see), so the guest can't even read the data
>>> register and work out which register it represents.  It also can't do an
>>> atomic 64bit read across the index and data registers, as that is
>>> explicitly undefined behaviour in the IO-APIC spec.
>>>
>>> On the other hand, we do have PHYSDEVOP_apic_{read,write} which, despite
>>> the name, is for the IO-APIC not the LAPIC, and I bet these hypercalls
>>> where introduced upon discovering that a read-only mapping of the
>>> IO-APIC it totally useless.
>>>
>>> I think we can safely not expose the IO-APICs to PV dom0 at all, which
>>> simplifies the memory handling further.
>> The reason we do expose it r/o is that some ACPI implementations access
>> (read and write) some RTEs from AML. If we don't allow Dom0 to map the
>> IO-APIC, Dom0 will typically fail to boot on such systems.
> 
> I think more details are needed.

How do you expect to collect the necessary info without having an affected
system to test? I see close to zero chance to locate the old reports (and
possible discussion) via web search.

> If AML is reading the RTEs, it's is also writing to the index register.

Quite likely, yes. Albeit this being broken to a fair degree in the
first place, ...

> Irrespective of Xen, doing this behind the back of the OS cannot work
> safely, because at a minimum the ACPI interpreter would need to take the
> ioapic lock, and I see no evidence of workarounds like this in Linux.

... as you indicate you think (as much as I do), leaves room for the
actual accesses to also be flawed (and hence meaningless in the first
place). I do recall looking at the disassembled AML back at the time,
but I do not recall any details for sure. What I seem to vaguely recall
is that their whole purpose was to set the mask bit in an RTE (I think
to work around the dual routing issue, and I assume in turn to work
around missing workarounds in certain OSes). For that the current
approach as well as the alternative one you suggest below would be
equally "good enough".

> In Xen, we appear to swallow writes to mmio_ro ranges which is rude, and
> causes the associated reads to read garbage.  This is Xen-induced memory
> corruption inside dom0.
> 
> 
> I don't think any of this is legitimate behaviour.  ACPI needs disabling
> on such systems, or interlocking suitably, and its not just IO-APICs
> which are problematic - any windowed I/O (e.g. RTC) as well as any other
> device with read side effects.

I don't think disabling ACPI on such systems would be a viable option.
Things tend to not work very well that way ... Plus iirc these issues
weren't exclusively on some random no-name systems, but in at least one
of the cases on ones of a pretty large vendor.

> I think we should seriously consider not mapping the IO-APIC at all. 

We can easily do so on the IOMMU side, if you agree to have CPU and
IOMMU mappings diverge for this case. Since the behavior is PV-
specific anyway, there are also no concerns as to differing behavior
with vs without shared page tables (on PVH).

> That said, I would be surprised if Linux is cleanly avoiding the
> IO-APIC, so a slightly less bad alternative is to redirect to an "MMIO"
> frame of ~0's which we still write-discard on top of.
> 
> That at least makes the Xen-induced memory corruption behave
> deterministically.

It would work more deterministically, yes, but without such a system
available to test we wouldn't know whether that approach would actually
make things work (sufficiently). Whereas for the current approach we do
know (from the testing done back at the time).

Jan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Ping: [PATCH 17/17] IOMMU/x86: drop pointless NULL checks
  2021-08-24 14:27 ` [PATCH 17/17] IOMMU/x86: drop pointless NULL checks Jan Beulich
  2021-08-26 12:05   ` Andrew Cooper
@ 2021-09-15 12:42   ` Jan Beulich
  2021-09-16  7:47     ` Tian, Kevin
  2021-09-16  7:29   ` Tian, Kevin
  2 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2021-09-15 12:42 UTC (permalink / raw)
  To: Kevin Tian; +Cc: Andrew Cooper, Paul Durrant, xen-devel

Kevin,

On 24.08.2021 16:27, Jan Beulich wrote:
> map_domain_page() et al never fail; no need to check their return values
> against NULL, and no need to carry dead printk()s.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

may I get an ack (or otherwise) on this patch please? Ideally also
for some other VT-d specific ones in this series? I'd really like
to get in at least some parts of this series, before submitting an
even larger v2.

Thanks, Jan

> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> @@ -530,12 +530,6 @@ static void amd_dump_page_table_level(st
>          return;
>  
>      table_vaddr = __map_domain_page(pg);
> -    if ( table_vaddr == NULL )
> -    {
> -        printk("AMD IOMMU failed to map domain page %"PRIpaddr"\n",
> -                page_to_maddr(pg));
> -        return;
> -    }
>  
>      for ( index = 0; index < PTE_PER_TABLE_SIZE; index++ )
>      {
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -2800,12 +2800,6 @@ static void vtd_dump_page_table_level(pa
>          return;
>  
>      pt_vaddr = map_vtd_domain_page(pt_maddr);
> -    if ( pt_vaddr == NULL )
> -    {
> -        printk(VTDPREFIX " failed to map domain page %"PRIpaddr"\n",
> -               pt_maddr);
> -        return;
> -    }
>  
>      next_level = level - 1;
>      for ( i = 0; i < PTE_NUM; i++ )
> --- a/xen/drivers/passthrough/vtd/utils.c
> +++ b/xen/drivers/passthrough/vtd/utils.c
> @@ -106,11 +106,6 @@ void print_vtd_entries(struct vtd_iommu
>      }
>  
>      root_entry = (struct root_entry *)map_vtd_domain_page(iommu->root_maddr);
> -    if ( root_entry == NULL )
> -    {
> -        printk("    root_entry == NULL\n");
> -        return;
> -    }
>  
>      printk("    root_entry[%02x] = %"PRIx64"\n", bus, root_entry[bus].val);
>      if ( !root_present(root_entry[bus]) )
> @@ -123,11 +118,6 @@ void print_vtd_entries(struct vtd_iommu
>      val = root_entry[bus].val;
>      unmap_vtd_domain_page(root_entry);
>      ctxt_entry = map_vtd_domain_page(val);
> -    if ( ctxt_entry == NULL )
> -    {
> -        printk("    ctxt_entry == NULL\n");
> -        return;
> -    }
>  
>      val = ctxt_entry[devfn].lo;
>      printk("    context[%02x] = %"PRIx64"_%"PRIx64"\n",
> @@ -151,11 +141,6 @@ void print_vtd_entries(struct vtd_iommu
>      do
>      {
>          l = map_vtd_domain_page(val);
> -        if ( l == NULL )
> -        {
> -            printk("    l%u == NULL\n", level);
> -            break;
> -        }
>          l_index = get_level_index(gmfn, level);
>          pte.val = l[l_index];
>          unmap_vtd_domain_page(l);
> 
> 



^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [PATCH 17/17] IOMMU/x86: drop pointless NULL checks
  2021-08-24 14:27 ` [PATCH 17/17] IOMMU/x86: drop pointless NULL checks Jan Beulich
  2021-08-26 12:05   ` Andrew Cooper
  2021-09-15 12:42   ` Ping: " Jan Beulich
@ 2021-09-16  7:29   ` Tian, Kevin
  2 siblings, 0 replies; 35+ messages in thread
From: Tian, Kevin @ 2021-09-16  7:29 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Cooper, Andrew, Paul Durrant

> From: Jan Beulich <jbeulich@suse.com>
> Sent: Tuesday, August 24, 2021 10:28 PM
> 
> map_domain_page() et al never fail; no need to check their return values
> against NULL, and no need to carry dead printk()s.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

> 
> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> @@ -530,12 +530,6 @@ static void amd_dump_page_table_level(st
>          return;
> 
>      table_vaddr = __map_domain_page(pg);
> -    if ( table_vaddr == NULL )
> -    {
> -        printk("AMD IOMMU failed to map domain page %"PRIpaddr"\n",
> -                page_to_maddr(pg));
> -        return;
> -    }
> 
>      for ( index = 0; index < PTE_PER_TABLE_SIZE; index++ )
>      {
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -2800,12 +2800,6 @@ static void vtd_dump_page_table_level(pa
>          return;
> 
>      pt_vaddr = map_vtd_domain_page(pt_maddr);
> -    if ( pt_vaddr == NULL )
> -    {
> -        printk(VTDPREFIX " failed to map domain page %"PRIpaddr"\n",
> -               pt_maddr);
> -        return;
> -    }
> 
>      next_level = level - 1;
>      for ( i = 0; i < PTE_NUM; i++ )
> --- a/xen/drivers/passthrough/vtd/utils.c
> +++ b/xen/drivers/passthrough/vtd/utils.c
> @@ -106,11 +106,6 @@ void print_vtd_entries(struct vtd_iommu
>      }
> 
>      root_entry = (struct root_entry *)map_vtd_domain_page(iommu-
> >root_maddr);
> -    if ( root_entry == NULL )
> -    {
> -        printk("    root_entry == NULL\n");
> -        return;
> -    }
> 
>      printk("    root_entry[%02x] = %"PRIx64"\n", bus, root_entry[bus].val);
>      if ( !root_present(root_entry[bus]) )
> @@ -123,11 +118,6 @@ void print_vtd_entries(struct vtd_iommu
>      val = root_entry[bus].val;
>      unmap_vtd_domain_page(root_entry);
>      ctxt_entry = map_vtd_domain_page(val);
> -    if ( ctxt_entry == NULL )
> -    {
> -        printk("    ctxt_entry == NULL\n");
> -        return;
> -    }
> 
>      val = ctxt_entry[devfn].lo;
>      printk("    context[%02x] = %"PRIx64"_%"PRIx64"\n",
> @@ -151,11 +141,6 @@ void print_vtd_entries(struct vtd_iommu
>      do
>      {
>          l = map_vtd_domain_page(val);
> -        if ( l == NULL )
> -        {
> -            printk("    l%u == NULL\n", level);
> -            break;
> -        }
>          l_index = get_level_index(gmfn, level);
>          pte.val = l[l_index];
>          unmap_vtd_domain_page(l);


^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [PATCH 15/17] IOMMU: page table dumping adjustments
  2021-08-24 14:28   ` Jan Beulich
@ 2021-09-16  7:33     ` Tian, Kevin
  0 siblings, 0 replies; 35+ messages in thread
From: Tian, Kevin @ 2021-09-16  7:33 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Cooper, Andrew, Paul Durrant

> From: Jan Beulich <jbeulich@suse.com>
> Sent: Tuesday, August 24, 2021 10:29 PM
> 
> On 24.08.2021 16:26, Jan Beulich wrote:
> > For one none of the three IOMMU implementations on Arm specify a
> dumping
> > hook. Generalize VT-d's "don't dump shared page tables" to cover for
> > this.
> >
> > Further in the past I was told that on Arm in principle there could be
> > multiple different IOMMUs, and hence different domains' platform_ops
> > pointers could differ. Use each domain's ops for calling the dump hook.
> > (In the long run all uses of iommu_get_ops() would likely need to
> > disappear for this reason.)
> >
> > Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Should have Cc-ed Kevin on this one as well; now added.
> 
> Jan

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

> 
> > --- a/xen/drivers/passthrough/iommu.c
> > +++ b/xen/drivers/passthrough/iommu.c
> > @@ -640,12 +640,9 @@ bool_t iommu_has_feature(struct domain *
> >  static void iommu_dump_page_tables(unsigned char key)
> >  {
> >      struct domain *d;
> > -    const struct iommu_ops *ops;
> >
> >      ASSERT(iommu_enabled);
> >
> > -    ops = iommu_get_ops();
> > -
> >      rcu_read_lock(&domlist_read_lock);
> >
> >      for_each_domain(d)
> > @@ -653,7 +650,13 @@ static void iommu_dump_page_tables(unsig
> >          if ( is_hardware_domain(d) || !is_iommu_enabled(d) )
> >              continue;
> >
> > -        ops->dump_page_tables(d);
> > +        if ( iommu_use_hap_pt(d) )
> > +        {
> > +            printk("%pd sharing page tables\n", d);
> > +            continue;
> > +        }
> > +
> > +        dom_iommu(d)->platform_ops->dump_page_tables(d);
> >      }
> >
> >      rcu_read_unlock(&domlist_read_lock);
> > --- a/xen/drivers/passthrough/vtd/iommu.c
> > +++ b/xen/drivers/passthrough/vtd/iommu.c
> > @@ -2835,12 +2835,6 @@ static void vtd_dump_page_tables(struct
> >  {
> >      const struct domain_iommu *hd = dom_iommu(d);
> >
> > -    if ( iommu_use_hap_pt(d) )
> > -    {
> > -        printk(VTDPREFIX " %pd sharing EPT table\n", d);
> > -        return;
> > -    }
> > -
> >      printk(VTDPREFIX" %pd table has %d levels\n", d,
> >             agaw_to_level(hd->arch.vtd.agaw));
> >      vtd_dump_page_table_level(hd->arch.vtd.pgd_maddr,
> >
> >


^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [PATCH 16/17] VT-d: show permissions during page table walks
  2021-08-24 14:27 ` [PATCH 16/17] VT-d: show permissions during page table walks Jan Beulich
@ 2021-09-16  7:36   ` Tian, Kevin
  0 siblings, 0 replies; 35+ messages in thread
From: Tian, Kevin @ 2021-09-16  7:36 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Cooper, Andrew, Paul Durrant

> From: Jan Beulich <jbeulich@suse.com>
> Sent: Tuesday, August 24, 2021 10:27 PM
> 
> Besides the addresses this is the next crucial bit of information one
> might be after.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

> 
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -2822,10 +2822,12 @@ static void vtd_dump_page_table_level(pa
>              vtd_dump_page_table_level(dma_pte_addr(*pte), next_level,
>                                        address, indent + 1);
>          else
> -            printk("%*sdfn: %08lx mfn: %08lx\n",
> +            printk("%*sdfn: %08lx mfn: %08lx %c%c\n",
>                     indent, "",
>                     (unsigned long)(address >> PAGE_SHIFT_4K),
> -                   (unsigned long)(dma_pte_addr(*pte) >> PAGE_SHIFT_4K));
> +                   (unsigned long)(dma_pte_addr(*pte) >> PAGE_SHIFT_4K),
> +                   dma_pte_read(*pte) ? 'r' : '-',
> +                   dma_pte_write(*pte) ? 'w' : '-');
>      }
> 
>      unmap_vtd_domain_page(pt_vaddr);
> --- a/xen/drivers/passthrough/vtd/utils.c
> +++ b/xen/drivers/passthrough/vtd/utils.c
> @@ -159,14 +159,11 @@ void print_vtd_entries(struct vtd_iommu
>          l_index = get_level_index(gmfn, level);
>          pte.val = l[l_index];
>          unmap_vtd_domain_page(l);
> -        printk("    l%u[%03x] = %"PRIx64"\n", level, l_index, pte.val);
> +        printk("    l%u[%03x] = %"PRIx64" %c%c\n", level, l_index, pte.val,
> +               dma_pte_read(pte) ? 'r' : '-',
> +               dma_pte_write(pte) ? 'w' : '-');
> 
> -        if ( !dma_pte_present(pte) )
> -        {
> -            printk("    l%u[%03x] not present\n", level, l_index);
> -            break;
> -        }
> -        if ( dma_pte_superpage(pte) )
> +        if ( !dma_pte_present(pte) || dma_pte_superpage(pte) )
>              break;
>          val = dma_pte_addr(pte);
>      } while ( --level );


^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [PATCH 04/17] IOMMU: have vendor code announce supported page sizes
  2021-08-24 14:17 ` [PATCH 04/17] IOMMU: have vendor code announce supported page sizes Jan Beulich
@ 2021-09-16  7:38   ` Tian, Kevin
  0 siblings, 0 replies; 35+ messages in thread
From: Tian, Kevin @ 2021-09-16  7:38 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Cooper, Andrew, Paul Durrant, Stefano Stabellini, Julien Grall,
	Bertrand Marquis

> From: Jan Beulich <jbeulich@suse.com>
> Sent: Tuesday, August 24, 2021 10:18 PM
> 
> Generic code will use this information to determine what order values
> can legitimately be passed to the ->{,un}map_page() hooks. For now all
> ops structures simply get to announce 4k mappings (as base page size),
> and there is (and always has been) an assumption that this matches the
> CPU's MMU base page size (eventually we will want to permit IOMMUs with
> a base page size smaller than the CPU MMU's).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

> 
> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> @@ -583,6 +583,7 @@ static void amd_dump_page_tables(struct
>  }
> 
>  static const struct iommu_ops __initconstrel _iommu_ops = {
> +    .page_sizes = PAGE_SIZE_4K,
>      .init = amd_iommu_domain_init,
>      .hwdom_init = amd_iommu_hwdom_init,
>      .quarantine_init = amd_iommu_quarantine_init,
> --- a/xen/drivers/passthrough/arm/ipmmu-vmsa.c
> +++ b/xen/drivers/passthrough/arm/ipmmu-vmsa.c
> @@ -1298,6 +1298,7 @@ static void ipmmu_iommu_domain_teardown(
> 
>  static const struct iommu_ops ipmmu_iommu_ops =
>  {
> +    .page_sizes      = PAGE_SIZE_4K,
>      .init            = ipmmu_iommu_domain_init,
>      .hwdom_init      = ipmmu_iommu_hwdom_init,
>      .teardown        = ipmmu_iommu_domain_teardown,
> --- a/xen/drivers/passthrough/arm/smmu.c
> +++ b/xen/drivers/passthrough/arm/smmu.c
> @@ -2873,6 +2873,7 @@ static void arm_smmu_iommu_domain_teardo
>  }
> 
>  static const struct iommu_ops arm_smmu_iommu_ops = {
> +    .page_sizes = PAGE_SIZE_4K,
>      .init = arm_smmu_iommu_domain_init,
>      .hwdom_init = arm_smmu_iommu_hwdom_init,
>      .add_device = arm_smmu_dt_add_device_generic,
> --- a/xen/drivers/passthrough/arm/smmu-v3.c
> +++ b/xen/drivers/passthrough/arm/smmu-v3.c
> @@ -3426,7 +3426,8 @@ static void arm_smmu_iommu_xen_domain_te
>  }
> 
>  static const struct iommu_ops arm_smmu_iommu_ops = {
> -	.init		= arm_smmu_iommu_xen_domain_init,
> +	.page_sizes		= PAGE_SIZE_4K,
> +	.init			= arm_smmu_iommu_xen_domain_init,
>  	.hwdom_init		= arm_smmu_iommu_hwdom_init,
>  	.teardown		=
> arm_smmu_iommu_xen_domain_teardown,
>  	.iotlb_flush		= arm_smmu_iotlb_flush,
> --- a/xen/drivers/passthrough/iommu.c
> +++ b/xen/drivers/passthrough/iommu.c
> @@ -470,7 +470,17 @@ int __init iommu_setup(void)
> 
>      if ( iommu_enable )
>      {
> +        const struct iommu_ops *ops = NULL;
> +
>          rc = iommu_hardware_setup();
> +        if ( !rc )
> +            ops = iommu_get_ops();
> +        if ( ops && (ops->page_sizes & -ops->page_sizes) != PAGE_SIZE )
> +        {
> +            printk(XENLOG_ERR "IOMMU: page size mask %lx unsupported\n",
> +                   ops->page_sizes);
> +            rc = ops->page_sizes ? -EPERM : -ENODATA;
> +        }
>          iommu_enabled = (rc == 0);
>      }
> 
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -2875,6 +2875,7 @@ static int __init intel_iommu_quarantine
>  }
> 
>  static struct iommu_ops __initdata vtd_ops = {
> +    .page_sizes = PAGE_SIZE_4K,
>      .init = intel_iommu_domain_init,
>      .hwdom_init = intel_iommu_hwdom_init,
>      .quarantine_init = intel_iommu_quarantine_init,
> --- a/xen/include/xen/iommu.h
> +++ b/xen/include/xen/iommu.h
> @@ -231,6 +231,7 @@ struct page_info;
>  typedef int iommu_grdm_t(xen_pfn_t start, xen_ulong_t nr, u32 id, void
> *ctxt);
> 
>  struct iommu_ops {
> +    unsigned long page_sizes;
>      int (*init)(struct domain *d);
>      void (*hwdom_init)(struct domain *d);
>      int (*quarantine_init)(struct domain *d);


^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [PATCH 05/17] IOMMU: add order parameter to ->{,un}map_page() hooks
  2021-08-24 14:18 ` [PATCH 05/17] IOMMU: add order parameter to ->{,un}map_page() hooks Jan Beulich
@ 2021-09-16  7:41   ` Tian, Kevin
  0 siblings, 0 replies; 35+ messages in thread
From: Tian, Kevin @ 2021-09-16  7:41 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Cooper, Andrew, Paul Durrant, Stefano Stabellini, Julien Grall

> From: Jan Beulich <jbeulich@suse.com>
> Sent: Tuesday, August 24, 2021 10:19 PM
> 
> Or really, in the case of ->map_page(), accommodate it in th existing
> "flags" parameter. All call sites will pass 0 for now.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

> 
> --- a/xen/drivers/passthrough/amd/iommu.h
> +++ b/xen/drivers/passthrough/amd/iommu.h
> @@ -225,6 +225,7 @@ int __must_check amd_iommu_map_page(stru
>                                      mfn_t mfn, unsigned int flags,
>                                      unsigned int *flush_flags);
>  int __must_check amd_iommu_unmap_page(struct domain *d, dfn_t dfn,
> +                                      unsigned int order,
>                                        unsigned int *flush_flags);
>  int __must_check amd_iommu_alloc_root(struct domain *d);
>  int amd_iommu_reserve_domain_unity_map(struct domain *domain,
> --- a/xen/drivers/passthrough/amd/iommu_map.c
> +++ b/xen/drivers/passthrough/amd/iommu_map.c
> @@ -328,7 +328,7 @@ int amd_iommu_map_page(struct domain *d,
>      return 0;
>  }
> 
> -int amd_iommu_unmap_page(struct domain *d, dfn_t dfn,
> +int amd_iommu_unmap_page(struct domain *d, dfn_t dfn, unsigned int
> order,
>                           unsigned int *flush_flags)
>  {
>      unsigned long pt_mfn = 0;
> --- a/xen/drivers/passthrough/arm/iommu_helpers.c
> +++ b/xen/drivers/passthrough/arm/iommu_helpers.c
> @@ -57,11 +57,13 @@ int __must_check arm_iommu_map_page(stru
>       * The function guest_physmap_add_entry replaces the current mapping
>       * if there is already one...
>       */
> -    return guest_physmap_add_entry(d, _gfn(dfn_x(dfn)), _mfn(dfn_x(dfn)),
> 0, t);
> +    return guest_physmap_add_entry(d, _gfn(dfn_x(dfn)), _mfn(dfn_x(dfn)),
> +                                   IOMMUF_order(flags), t);
>  }
> 
>  /* Should only be used if P2M Table is shared between the CPU and the
> IOMMU. */
>  int __must_check arm_iommu_unmap_page(struct domain *d, dfn_t dfn,
> +                                      unsigned int order,
>                                        unsigned int *flush_flags)
>  {
>      /*
> @@ -71,7 +73,8 @@ int __must_check arm_iommu_unmap_page(st
>      if ( !is_domain_direct_mapped(d) )
>          return -EINVAL;
> 
> -    return guest_physmap_remove_page(d, _gfn(dfn_x(dfn)),
> _mfn(dfn_x(dfn)), 0);
> +    return guest_physmap_remove_page(d, _gfn(dfn_x(dfn)),
> _mfn(dfn_x(dfn)),
> +                                     order);
>  }
> 
>  /*
> --- a/xen/drivers/passthrough/iommu.c
> +++ b/xen/drivers/passthrough/iommu.c
> @@ -271,6 +271,8 @@ int iommu_map(struct domain *d, dfn_t df
>      if ( !is_iommu_enabled(d) )
>          return 0;
> 
> +    ASSERT(!IOMMUF_order(flags));
> +
>      for ( i = 0; i < page_count; i++ )
>      {
>          rc = iommu_call(hd->platform_ops, map_page, d, dfn_add(dfn, i),
> @@ -288,7 +290,7 @@ int iommu_map(struct domain *d, dfn_t df
>          while ( i-- )
>              /* if statement to satisfy __must_check */
>              if ( iommu_call(hd->platform_ops, unmap_page, d, dfn_add(dfn, i),
> -                            flush_flags) )
> +                            0, flush_flags) )
>                  continue;
> 
>          if ( !is_hardware_domain(d) )
> @@ -333,7 +335,7 @@ int iommu_unmap(struct domain *d, dfn_t
>      for ( i = 0; i < page_count; i++ )
>      {
>          int err = iommu_call(hd->platform_ops, unmap_page, d, dfn_add(dfn, i),
> -                             flush_flags);
> +                             0, flush_flags);
> 
>          if ( likely(!err) )
>              continue;
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -1932,6 +1932,7 @@ static int __must_check intel_iommu_map_
>  }
> 
>  static int __must_check intel_iommu_unmap_page(struct domain *d, dfn_t
> dfn,
> +                                               unsigned int order,
>                                                 unsigned int *flush_flags)
>  {
>      /* Do nothing if VT-d shares EPT page table */
> --- a/xen/include/asm-arm/iommu.h
> +++ b/xen/include/asm-arm/iommu.h
> @@ -31,6 +31,7 @@ int __must_check arm_iommu_map_page(stru
>                                      unsigned int flags,
>                                      unsigned int *flush_flags);
>  int __must_check arm_iommu_unmap_page(struct domain *d, dfn_t dfn,
> +                                      unsigned int order,
>                                        unsigned int *flush_flags);
> 
>  #endif /* __ARCH_ARM_IOMMU_H__ */
> --- a/xen/include/xen/iommu.h
> +++ b/xen/include/xen/iommu.h
> @@ -127,9 +127,10 @@ void arch_iommu_hwdom_init(struct domain
>   * The following flags are passed to map operations and passed by lookup
>   * operations.
>   */
> -#define _IOMMUF_readable 0
> +#define IOMMUF_order(n)  ((n) & 0x3f)
> +#define _IOMMUF_readable 6
>  #define IOMMUF_readable  (1u<<_IOMMUF_readable)
> -#define _IOMMUF_writable 1
> +#define _IOMMUF_writable 7
>  #define IOMMUF_writable  (1u<<_IOMMUF_writable)
> 
>  /*
> @@ -255,6 +256,7 @@ struct iommu_ops {
>                                   unsigned int flags,
>                                   unsigned int *flush_flags);
>      int __must_check (*unmap_page)(struct domain *d, dfn_t dfn,
> +                                   unsigned int order,
>                                     unsigned int *flush_flags);
>      int __must_check (*lookup_page)(struct domain *d, dfn_t dfn, mfn_t
> *mfn,
>                                      unsigned int *flags);


^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: Ping: [PATCH 17/17] IOMMU/x86: drop pointless NULL checks
  2021-09-15 12:42   ` Ping: " Jan Beulich
@ 2021-09-16  7:47     ` Tian, Kevin
  2021-09-16  8:24       ` Jan Beulich
  0 siblings, 1 reply; 35+ messages in thread
From: Tian, Kevin @ 2021-09-16  7:47 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Cooper, Andrew, Paul Durrant, xen-devel

> From: Jan Beulich <jbeulich@suse.com>
> Sent: Wednesday, September 15, 2021 8:42 PM
> 
> Kevin,
> 
> On 24.08.2021 16:27, Jan Beulich wrote:
> > map_domain_page() et al never fail; no need to check their return values
> > against NULL, and no need to carry dead printk()s.
> >
> > Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> may I get an ack (or otherwise) on this patch please? Ideally also
> for some other VT-d specific ones in this series? I'd really like
> to get in at least some parts of this series, before submitting an
> even larger v2.
> 

I haven't found time to check the entire series in detail. Only acked
a few ones which look obvious.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Ping: [PATCH 17/17] IOMMU/x86: drop pointless NULL checks
  2021-09-16  7:47     ` Tian, Kevin
@ 2021-09-16  8:24       ` Jan Beulich
  0 siblings, 0 replies; 35+ messages in thread
From: Jan Beulich @ 2021-09-16  8:24 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: Cooper, Andrew, Paul Durrant, xen-devel

On 16.09.2021 09:47, Tian, Kevin wrote:
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Wednesday, September 15, 2021 8:42 PM
>>
>> Kevin,
>>
>> On 24.08.2021 16:27, Jan Beulich wrote:
>>> map_domain_page() et al never fail; no need to check their return values
>>> against NULL, and no need to carry dead printk()s.
>>>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>
>> may I get an ack (or otherwise) on this patch please? Ideally also
>> for some other VT-d specific ones in this series? I'd really like
>> to get in at least some parts of this series, before submitting an
>> even larger v2.
> 
> I haven't found time to check the entire series in detail. Only acked
> a few ones which look obvious.

That's fine, thanks! It still allows me to move forward a little.

Jan



^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2021-09-16  8:25 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-24 14:13 [PATCH 00/17] IOMMU: superpage support when not sharing pagetables Jan Beulich
2021-08-24 14:15 ` [PATCH 01/17] AMD/IOMMU: avoid recording each level's MFN when walking page table Jan Beulich
2021-08-25 13:29   ` Andrew Cooper
2021-08-24 14:15 ` [PATCH 02/17] AMD/IOMMU: have callers specify the target level for page table walks Jan Beulich
2021-08-24 14:16 ` [PATCH 03/17] VT-d: " Jan Beulich
2021-08-24 14:17 ` [PATCH 04/17] IOMMU: have vendor code announce supported page sizes Jan Beulich
2021-09-16  7:38   ` Tian, Kevin
2021-08-24 14:18 ` [PATCH 05/17] IOMMU: add order parameter to ->{,un}map_page() hooks Jan Beulich
2021-09-16  7:41   ` Tian, Kevin
2021-08-24 14:19 ` [PATCH 06/17] IOMMU: have iommu_{,un}map() split requests into largest possible chunks Jan Beulich
2021-08-24 14:21 ` [PATCH 07/17] IOMMU/x86: restrict IO-APIC mappings for PV Dom0 Jan Beulich
2021-08-26 11:57   ` Andrew Cooper
2021-08-26 12:55     ` Jan Beulich
2021-09-07 17:13       ` Andrew Cooper
2021-09-08  9:44         ` Jan Beulich
2021-08-24 14:21 ` [PATCH 08/17] IOMMU/x86: perform PV Dom0 mappings in batches Jan Beulich
2021-08-24 14:22 ` [PATCH 09/17] IOMMU/x86: support freeing of pagetables Jan Beulich
2021-08-24 14:22 ` [PATCH 10/17] AMD/IOMMU: drop stray TLB flush Jan Beulich
2021-08-24 14:23 ` [PATCH 11/17] AMD/IOMMU: walk trees upon page fault Jan Beulich
2021-08-24 14:24 ` [PATCH 12/17] AMD/IOMMU: return old PTE from {set,clear}_iommu_pte_present() Jan Beulich
2021-08-24 14:25 ` [PATCH 13/17] AMD/IOMMU: allow use of superpage mappings Jan Beulich
2021-08-24 14:25 ` [PATCH 14/17] VT-d: " Jan Beulich
2021-08-24 14:26 ` [PATCH 15/17] IOMMU: page table dumping adjustments Jan Beulich
2021-08-24 14:28   ` Jan Beulich
2021-09-16  7:33     ` Tian, Kevin
2021-08-24 14:27 ` [PATCH 16/17] VT-d: show permissions during page table walks Jan Beulich
2021-09-16  7:36   ` Tian, Kevin
2021-08-24 14:27 ` [PATCH 17/17] IOMMU/x86: drop pointless NULL checks Jan Beulich
2021-08-26 12:05   ` Andrew Cooper
2021-08-26 12:36     ` Jan Beulich
2021-09-15 12:42   ` Ping: " Jan Beulich
2021-09-16  7:47     ` Tian, Kevin
2021-09-16  8:24       ` Jan Beulich
2021-09-16  7:29   ` Tian, Kevin
2021-08-25 12:06 ` [PATCH 00/17] IOMMU: superpage support when not sharing pagetables Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.