All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/18] IOMMU: superpage support when not sharing pagetables
@ 2021-09-24  9:39 Jan Beulich
  2021-09-24  9:41 ` [PATCH v2 01/18] AMD/IOMMU: have callers specify the target level for page table walks Jan Beulich
                   ` (17 more replies)
  0 siblings, 18 replies; 100+ messages in thread
From: Jan Beulich @ 2021-09-24  9:39 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant

For a long time we've been rather inefficient with IOMMU page table
management when not sharing page tables, i.e. in particular for PV (and
further specifically also for PV Dom0) and AMD (where nowadays we never
share page tables). While up to about 2.5 years ago AMD code had logic
to un-shatter page mappings, that logic was ripped out for being buggy
(XSA-275 plus follow-on).

This series enables use of large pages in AMD and Intel (VT-d) code;
Arm is presently not in need of any enabling as pagetables are always
shared there. It also augments PV Dom0 creation with suitable explicit
IOMMU mapping calls to facilitate use of large pages there without
getting into the business of un-shattering page mappings just yet.
Depending on the amount of memory handed to Dom0 this improves booting
time (latency until Dom0 actually starts) quite a bit; subsequent
shattering of some of the large pages may of course consume some of the
saved time.

Known fallout has been spelled out here:
https://lists.xen.org/archives/html/xen-devel/2021-08/msg00781.html

I'm inclined to say "of course" there are also a few seemingly unrelated
changes included here, which I just came to consider necessary or at
least desirable (in part for having been in need of adjustment for a
long time) along the way. Some of these changes are likely independent
of the bulk of the work here, and hence may be fine to go in ahead of
earlier patches.

While, as said above, un-shattering of mappings isn't an immediate goal,
teh last few patches now at least arrange for freeing page tables which
have ended up all empty. This also introduces the underlying support to
then un-shatter large pages (potentially re-usable elsewhere as well),
but that's not part of this v2 of the series.

01: AMD/IOMMU: have callers specify the target level for page table walks
02: VT-d: have callers specify the target level for page table walks
03: IOMMU: have vendor code announce supported page sizes
04: IOMMU: add order parameter to ->{,un}map_page() hooks
05: IOMMU: have iommu_{,un}map() split requests into largest possible chunks
06: IOMMU/x86: restrict IO-APIC mappings for PV Dom0
07: IOMMU/x86: perform PV Dom0 mappings in batches
08: IOMMU/x86: support freeing of pagetables
09: AMD/IOMMU: drop stray TLB flush
10: AMD/IOMMU: walk trees upon page fault
11: AMD/IOMMU: return old PTE from {set,clear}_iommu_pte_present()
12: AMD/IOMMU: allow use of superpage mappings
13: VT-d: allow use of superpage mappings
14: IOMMU: fold flush-all hook into "flush one"
15: IOMMU/x86: prefill newly allocate page tables
16: x86: introduce helper for recording degree of contiguity in page tables
17: AMD/IOMMU: free all-empty page tables
18: VT-d: free all-empty page tables

While not directly related (except that making this mode work properly
here was a fair part of the overall work), at this occasion I'd also
like to renew my proposal to make "iommu=dom0-strict" the default going
forward. It already is not only the default, but the only possible mode
for PVH Dom0.

Jan



^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v2 01/18] AMD/IOMMU: have callers specify the target level for page table walks
  2021-09-24  9:39 [PATCH v2 00/18] IOMMU: superpage support when not sharing pagetables Jan Beulich
@ 2021-09-24  9:41 ` Jan Beulich
  2021-09-24 10:58   ` Roger Pau Monné
  2021-09-24  9:42 ` [PATCH v2 02/18] VT-d: " Jan Beulich
                   ` (16 subsequent siblings)
  17 siblings, 1 reply; 100+ messages in thread
From: Jan Beulich @ 2021-09-24  9:41 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant

In order to be able to insert/remove super-pages we need to allow
callers of the walking function to specify at which point to stop the
walk. (For now at least gcc will instantiate just a variant of the
function with the parameter eliminated, so effectively no change to
generated code as far as the parameter addition goes.)

Instead of merely adjusting a BUG_ON() condition, convert it into an
error return - there's no reason to crash the entire host in that case.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -178,7 +178,8 @@ void __init iommu_dte_add_device_entry(s
  * page tables.
  */
 static int iommu_pde_from_dfn(struct domain *d, unsigned long dfn,
-                              unsigned long *pt_mfn, bool map)
+                              unsigned int target, unsigned long *pt_mfn,
+                              bool map)
 {
     union amd_iommu_pte *pde, *next_table_vaddr;
     unsigned long  next_table_mfn;
@@ -189,7 +190,8 @@ static int iommu_pde_from_dfn(struct dom
     table = hd->arch.amd.root_table;
     level = hd->arch.amd.paging_mode;
 
-    BUG_ON( table == NULL || level < 1 || level > 6 );
+    if ( !table || target < 1 || level < target || level > 6 )
+        return 1;
 
     /*
      * A frame number past what the current page tables can represent can't
@@ -200,7 +202,7 @@ static int iommu_pde_from_dfn(struct dom
 
     next_table_mfn = mfn_x(page_to_mfn(table));
 
-    while ( level > 1 )
+    while ( level > target )
     {
         unsigned int next_level = level - 1;
 
@@ -307,7 +309,7 @@ int amd_iommu_map_page(struct domain *d,
         return rc;
     }
 
-    if ( iommu_pde_from_dfn(d, dfn_x(dfn), &pt_mfn, true) || !pt_mfn )
+    if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, true) || !pt_mfn )
     {
         spin_unlock(&hd->arch.mapping_lock);
         AMD_IOMMU_DEBUG("Invalid IO pagetable entry dfn = %"PRI_dfn"\n",
@@ -340,7 +342,7 @@ int amd_iommu_unmap_page(struct domain *
         return 0;
     }
 
-    if ( iommu_pde_from_dfn(d, dfn_x(dfn), &pt_mfn, false) )
+    if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, false) )
     {
         spin_unlock(&hd->arch.mapping_lock);
         AMD_IOMMU_DEBUG("Invalid IO pagetable entry dfn = %"PRI_dfn"\n",



^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v2 02/18] VT-d: have callers specify the target level for page table walks
  2021-09-24  9:39 [PATCH v2 00/18] IOMMU: superpage support when not sharing pagetables Jan Beulich
  2021-09-24  9:41 ` [PATCH v2 01/18] AMD/IOMMU: have callers specify the target level for page table walks Jan Beulich
@ 2021-09-24  9:42 ` Jan Beulich
  2021-09-24 14:45   ` Roger Pau Monné
  2021-09-24  9:43 ` [PATCH v2 03/18] IOMMU: have vendor code announce supported page sizes Jan Beulich
                   ` (15 subsequent siblings)
  17 siblings, 1 reply; 100+ messages in thread
From: Jan Beulich @ 2021-09-24  9:42 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Kevin Tian

In order to be able to insert/remove super-pages we need to allow
callers of the walking function to specify at which point to stop the
walk.

For intel_iommu_lookup_page() integrate the last level access into
the main walking function.

dma_pte_clear_one() gets only partly adjusted for now: Error handling
and order parameter get put in place, but the order parameter remains
ignored (just like intel_iommu_map_page()'s order part of the flags).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
I have to admit that I don't understand why domain_pgd_maddr() wants to
populate all page table levels for DFN 0.

I was actually wondering whether it wouldn't make sense to integrate
dma_pte_clear_one() into its only caller intel_iommu_unmap_page(), for
better symmetry with intel_iommu_map_page().
---
v2: Fix build.

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -264,63 +264,116 @@ static u64 bus_to_context_maddr(struct v
     return maddr;
 }
 
-static u64 addr_to_dma_page_maddr(struct domain *domain, u64 addr, int alloc)
+/*
+ * This function walks (and if requested allocates) page tables to the
+ * designated target level. It returns
+ * - 0 when a non-present entry was encountered and no allocation was
+ *   requested,
+ * - a small positive value (the level, i.e. below PAGE_SIZE) upon allocation
+ *   failure,
+ * - for target > 0 the address of the page table holding the leaf PTE for
+ *   the requested address,
+ * - for target == 0 the full PTE.
+ */
+static uint64_t addr_to_dma_page_maddr(struct domain *domain, daddr_t addr,
+                                       unsigned int target,
+                                       unsigned int *flush_flags, bool alloc)
 {
     struct domain_iommu *hd = dom_iommu(domain);
     int addr_width = agaw_to_width(hd->arch.vtd.agaw);
     struct dma_pte *parent, *pte = NULL;
-    int level = agaw_to_level(hd->arch.vtd.agaw);
-    int offset;
+    unsigned int level = agaw_to_level(hd->arch.vtd.agaw), offset;
     u64 pte_maddr = 0;
 
     addr &= (((u64)1) << addr_width) - 1;
     ASSERT(spin_is_locked(&hd->arch.mapping_lock));
+    ASSERT(target || !alloc);
+
     if ( !hd->arch.vtd.pgd_maddr )
     {
         struct page_info *pg;
 
-        if ( !alloc || !(pg = iommu_alloc_pgtable(domain)) )
+        if ( !alloc )
+            goto out;
+
+        pte_maddr = level;
+        if ( !(pg = iommu_alloc_pgtable(domain)) )
             goto out;
 
         hd->arch.vtd.pgd_maddr = page_to_maddr(pg);
     }
 
-    parent = (struct dma_pte *)map_vtd_domain_page(hd->arch.vtd.pgd_maddr);
-    while ( level > 1 )
+    pte_maddr = hd->arch.vtd.pgd_maddr;
+    parent = map_vtd_domain_page(pte_maddr);
+    while ( level > target )
     {
         offset = address_level_offset(addr, level);
         pte = &parent[offset];
 
         pte_maddr = dma_pte_addr(*pte);
-        if ( !pte_maddr )
+        if ( !dma_pte_present(*pte) || (level > 1 && dma_pte_superpage(*pte)) )
         {
             struct page_info *pg;
+            /*
+             * Higher level tables always set r/w, last level page table
+             * controls read/write.
+             */
+            struct dma_pte new_pte = { DMA_PTE_PROT };
 
             if ( !alloc )
-                break;
+            {
+                pte_maddr = 0;
+                if ( !dma_pte_present(*pte) )
+                    break;
+
+                /*
+                 * When the leaf entry was requested, pass back the full PTE,
+                 * with the address adjusted to account for the residual of
+                 * the walk.
+                 */
+                pte_maddr = pte->val +
+                    (addr & ((1UL << level_to_offset_bits(level)) - 1) &
+                     PAGE_MASK);
+                if ( !target )
+                    break;
+            }
 
+            pte_maddr = level - 1;
             pg = iommu_alloc_pgtable(domain);
             if ( !pg )
                 break;
 
             pte_maddr = page_to_maddr(pg);
-            dma_set_pte_addr(*pte, pte_maddr);
+            dma_set_pte_addr(new_pte, pte_maddr);
 
-            /*
-             * high level table always sets r/w, last level
-             * page table control read/write
-             */
-            dma_set_pte_readable(*pte);
-            dma_set_pte_writable(*pte);
+            if ( dma_pte_present(*pte) )
+            {
+                struct dma_pte *split = map_vtd_domain_page(pte_maddr);
+                unsigned long inc = 1UL << level_to_offset_bits(level - 1);
+
+                split[0].val = pte->val;
+                if ( inc == PAGE_SIZE )
+                    split[0].val &= ~DMA_PTE_SP;
+
+                for ( offset = 1; offset < PTE_NUM; ++offset )
+                    split[offset].val = split[offset - 1].val + inc;
+
+                iommu_sync_cache(split, PAGE_SIZE);
+                unmap_vtd_domain_page(split);
+
+                if ( flush_flags )
+                    *flush_flags |= IOMMU_FLUSHF_modified;
+            }
+
+            write_atomic(&pte->val, new_pte.val);
             iommu_sync_cache(pte, sizeof(struct dma_pte));
         }
 
-        if ( level == 2 )
+        if ( --level == target )
             break;
 
         unmap_vtd_domain_page(parent);
         parent = map_vtd_domain_page(pte_maddr);
-        level--;
     }
 
     unmap_vtd_domain_page(parent);
@@ -346,7 +399,7 @@ static uint64_t domain_pgd_maddr(struct
     if ( !hd->arch.vtd.pgd_maddr )
     {
         /* Ensure we have pagetables allocated down to leaf PTE. */
-        addr_to_dma_page_maddr(d, 0, 1);
+        addr_to_dma_page_maddr(d, 0, 1, NULL, true);
 
         if ( !hd->arch.vtd.pgd_maddr )
             return 0;
@@ -691,8 +744,9 @@ static int __must_check iommu_flush_iotl
 }
 
 /* clear one page's page table */
-static void dma_pte_clear_one(struct domain *domain, uint64_t addr,
-                              unsigned int *flush_flags)
+static int dma_pte_clear_one(struct domain *domain, daddr_t addr,
+                             unsigned int order,
+                             unsigned int *flush_flags)
 {
     struct domain_iommu *hd = dom_iommu(domain);
     struct dma_pte *page = NULL, *pte = NULL;
@@ -700,11 +754,11 @@ static void dma_pte_clear_one(struct dom
 
     spin_lock(&hd->arch.mapping_lock);
     /* get last level pte */
-    pg_maddr = addr_to_dma_page_maddr(domain, addr, 0);
-    if ( pg_maddr == 0 )
+    pg_maddr = addr_to_dma_page_maddr(domain, addr, 1, flush_flags, false);
+    if ( pg_maddr < PAGE_SIZE )
     {
         spin_unlock(&hd->arch.mapping_lock);
-        return;
+        return pg_maddr ? -ENOMEM : 0;
     }
 
     page = (struct dma_pte *)map_vtd_domain_page(pg_maddr);
@@ -714,7 +768,7 @@ static void dma_pte_clear_one(struct dom
     {
         spin_unlock(&hd->arch.mapping_lock);
         unmap_vtd_domain_page(page);
-        return;
+        return 0;
     }
 
     dma_clear_pte(*pte);
@@ -724,6 +778,8 @@ static void dma_pte_clear_one(struct dom
     iommu_sync_cache(pte, sizeof(struct dma_pte));
 
     unmap_vtd_domain_page(page);
+
+    return 0;
 }
 
 static int iommu_set_root_entry(struct vtd_iommu *iommu)
@@ -1836,8 +1892,9 @@ static int __must_check intel_iommu_map_
         return 0;
     }
 
-    pg_maddr = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), 1);
-    if ( !pg_maddr )
+    pg_maddr = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), 1, flush_flags,
+                                      true);
+    if ( pg_maddr < PAGE_SIZE )
     {
         spin_unlock(&hd->arch.mapping_lock);
         return -ENOMEM;
@@ -1887,17 +1944,14 @@ static int __must_check intel_iommu_unma
     if ( iommu_hwdom_passthrough && is_hardware_domain(d) )
         return 0;
 
-    dma_pte_clear_one(d, dfn_to_daddr(dfn), flush_flags);
-
-    return 0;
+    return dma_pte_clear_one(d, dfn_to_daddr(dfn), 0, flush_flags);
 }
 
 static int intel_iommu_lookup_page(struct domain *d, dfn_t dfn, mfn_t *mfn,
                                    unsigned int *flags)
 {
     struct domain_iommu *hd = dom_iommu(d);
-    struct dma_pte *page, val;
-    u64 pg_maddr;
+    uint64_t val;
 
     /*
      * If VT-d shares EPT page table or if the domain is the hardware
@@ -1909,25 +1963,16 @@ static int intel_iommu_lookup_page(struc
 
     spin_lock(&hd->arch.mapping_lock);
 
-    pg_maddr = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), 0);
-    if ( !pg_maddr )
-    {
-        spin_unlock(&hd->arch.mapping_lock);
-        return -ENOENT;
-    }
-
-    page = map_vtd_domain_page(pg_maddr);
-    val = page[dfn_x(dfn) & LEVEL_MASK];
+    val = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), 0, NULL, false);
 
-    unmap_vtd_domain_page(page);
     spin_unlock(&hd->arch.mapping_lock);
 
-    if ( !dma_pte_present(val) )
+    if ( val < PAGE_SIZE )
         return -ENOENT;
 
-    *mfn = maddr_to_mfn(dma_pte_addr(val));
-    *flags = dma_pte_read(val) ? IOMMUF_readable : 0;
-    *flags |= dma_pte_write(val) ? IOMMUF_writable : 0;
+    *mfn = maddr_to_mfn(val);
+    *flags = val & DMA_PTE_READ ? IOMMUF_readable : 0;
+    *flags |= val & DMA_PTE_WRITE ? IOMMUF_writable : 0;
 
     return 0;
 }



^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v2 03/18] IOMMU: have vendor code announce supported page sizes
  2021-09-24  9:39 [PATCH v2 00/18] IOMMU: superpage support when not sharing pagetables Jan Beulich
  2021-09-24  9:41 ` [PATCH v2 01/18] AMD/IOMMU: have callers specify the target level for page table walks Jan Beulich
  2021-09-24  9:42 ` [PATCH v2 02/18] VT-d: " Jan Beulich
@ 2021-09-24  9:43 ` Jan Beulich
  2021-11-30 12:25   ` Roger Pau Monné
                     ` (2 more replies)
  2021-09-24  9:44 ` [PATCH v2 04/18] IOMMU: add order parameter to ->{,un}map_page() hooks Jan Beulich
                   ` (14 subsequent siblings)
  17 siblings, 3 replies; 100+ messages in thread
From: Jan Beulich @ 2021-09-24  9:43 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Paul Durrant, Kevin Tian, Julien Grall,
	Stefano Stabellini, Volodymyr Babchuk, Bertrand Marquis,
	Rahul Singh

Generic code will use this information to determine what order values
can legitimately be passed to the ->{,un}map_page() hooks. For now all
ops structures simply get to announce 4k mappings (as base page size),
and there is (and always has been) an assumption that this matches the
CPU's MMU base page size (eventually we will want to permit IOMMUs with
a base page size smaller than the CPU MMU's).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -629,6 +629,7 @@ static void amd_dump_page_tables(struct
 }
 
 static const struct iommu_ops __initconstrel _iommu_ops = {
+    .page_sizes = PAGE_SIZE_4K,
     .init = amd_iommu_domain_init,
     .hwdom_init = amd_iommu_hwdom_init,
     .quarantine_init = amd_iommu_quarantine_init,
--- a/xen/drivers/passthrough/arm/ipmmu-vmsa.c
+++ b/xen/drivers/passthrough/arm/ipmmu-vmsa.c
@@ -1298,6 +1298,7 @@ static void ipmmu_iommu_domain_teardown(
 
 static const struct iommu_ops ipmmu_iommu_ops =
 {
+    .page_sizes      = PAGE_SIZE_4K,
     .init            = ipmmu_iommu_domain_init,
     .hwdom_init      = ipmmu_iommu_hwdom_init,
     .teardown        = ipmmu_iommu_domain_teardown,
--- a/xen/drivers/passthrough/arm/smmu.c
+++ b/xen/drivers/passthrough/arm/smmu.c
@@ -2873,6 +2873,7 @@ static void arm_smmu_iommu_domain_teardo
 }
 
 static const struct iommu_ops arm_smmu_iommu_ops = {
+    .page_sizes = PAGE_SIZE_4K,
     .init = arm_smmu_iommu_domain_init,
     .hwdom_init = arm_smmu_iommu_hwdom_init,
     .add_device = arm_smmu_dt_add_device_generic,
--- a/xen/drivers/passthrough/arm/smmu-v3.c
+++ b/xen/drivers/passthrough/arm/smmu-v3.c
@@ -3426,7 +3426,8 @@ static void arm_smmu_iommu_xen_domain_te
 }
 
 static const struct iommu_ops arm_smmu_iommu_ops = {
-	.init		= arm_smmu_iommu_xen_domain_init,
+	.page_sizes		= PAGE_SIZE_4K,
+	.init			= arm_smmu_iommu_xen_domain_init,
 	.hwdom_init		= arm_smmu_iommu_hwdom_init,
 	.teardown		= arm_smmu_iommu_xen_domain_teardown,
 	.iotlb_flush		= arm_smmu_iotlb_flush,
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -470,7 +470,17 @@ int __init iommu_setup(void)
 
     if ( iommu_enable )
     {
+        const struct iommu_ops *ops = NULL;
+
         rc = iommu_hardware_setup();
+        if ( !rc )
+            ops = iommu_get_ops();
+        if ( ops && (ops->page_sizes & -ops->page_sizes) != PAGE_SIZE )
+        {
+            printk(XENLOG_ERR "IOMMU: page size mask %lx unsupported\n",
+                   ops->page_sizes);
+            rc = ops->page_sizes ? -EPERM : -ENODATA;
+        }
         iommu_enabled = (rc == 0);
     }
 
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2806,6 +2806,7 @@ static int __init intel_iommu_quarantine
 }
 
 static struct iommu_ops __initdata vtd_ops = {
+    .page_sizes = PAGE_SIZE_4K,
     .init = intel_iommu_domain_init,
     .hwdom_init = intel_iommu_hwdom_init,
     .quarantine_init = intel_iommu_quarantine_init,
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -231,6 +231,7 @@ struct page_info;
 typedef int iommu_grdm_t(xen_pfn_t start, xen_ulong_t nr, u32 id, void *ctxt);
 
 struct iommu_ops {
+    unsigned long page_sizes;
     int (*init)(struct domain *d);
     void (*hwdom_init)(struct domain *d);
     int (*quarantine_init)(struct domain *d);



^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v2 04/18] IOMMU: add order parameter to ->{,un}map_page() hooks
  2021-09-24  9:39 [PATCH v2 00/18] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (2 preceding siblings ...)
  2021-09-24  9:43 ` [PATCH v2 03/18] IOMMU: have vendor code announce supported page sizes Jan Beulich
@ 2021-09-24  9:44 ` Jan Beulich
  2021-11-30 13:49   ` Roger Pau Monné
  2021-12-17 14:42   ` Julien Grall
  2021-09-24  9:45 ` [PATCH v2 05/18] IOMMU: have iommu_{,un}map() split requests into largest possible chunks Jan Beulich
                   ` (13 subsequent siblings)
  17 siblings, 2 replies; 100+ messages in thread
From: Jan Beulich @ 2021-09-24  9:44 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Paul Durrant, Kevin Tian, Julien Grall,
	Stefano Stabellini, Volodymyr Babchuk

Or really, in the case of ->map_page(), accommodate it in the existing
"flags" parameter. All call sites will pass 0 for now.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
v2: Re-base over change earlier in the series.

--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -230,6 +230,7 @@ int __must_check amd_iommu_map_page(stru
                                     mfn_t mfn, unsigned int flags,
                                     unsigned int *flush_flags);
 int __must_check amd_iommu_unmap_page(struct domain *d, dfn_t dfn,
+                                      unsigned int order,
                                       unsigned int *flush_flags);
 int __must_check amd_iommu_alloc_root(struct domain *d);
 int amd_iommu_reserve_domain_unity_map(struct domain *domain,
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -328,7 +328,7 @@ int amd_iommu_map_page(struct domain *d,
     return 0;
 }
 
-int amd_iommu_unmap_page(struct domain *d, dfn_t dfn,
+int amd_iommu_unmap_page(struct domain *d, dfn_t dfn, unsigned int order,
                          unsigned int *flush_flags)
 {
     unsigned long pt_mfn = 0;
--- a/xen/drivers/passthrough/arm/iommu_helpers.c
+++ b/xen/drivers/passthrough/arm/iommu_helpers.c
@@ -57,11 +57,13 @@ int __must_check arm_iommu_map_page(stru
      * The function guest_physmap_add_entry replaces the current mapping
      * if there is already one...
      */
-    return guest_physmap_add_entry(d, _gfn(dfn_x(dfn)), _mfn(dfn_x(dfn)), 0, t);
+    return guest_physmap_add_entry(d, _gfn(dfn_x(dfn)), _mfn(dfn_x(dfn)),
+                                   IOMMUF_order(flags), t);
 }
 
 /* Should only be used if P2M Table is shared between the CPU and the IOMMU. */
 int __must_check arm_iommu_unmap_page(struct domain *d, dfn_t dfn,
+                                      unsigned int order,
                                       unsigned int *flush_flags)
 {
     /*
@@ -71,7 +73,8 @@ int __must_check arm_iommu_unmap_page(st
     if ( !is_domain_direct_mapped(d) )
         return -EINVAL;
 
-    return guest_physmap_remove_page(d, _gfn(dfn_x(dfn)), _mfn(dfn_x(dfn)), 0);
+    return guest_physmap_remove_page(d, _gfn(dfn_x(dfn)), _mfn(dfn_x(dfn)),
+                                     order);
 }
 
 /*
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -271,6 +271,8 @@ int iommu_map(struct domain *d, dfn_t df
     if ( !is_iommu_enabled(d) )
         return 0;
 
+    ASSERT(!IOMMUF_order(flags));
+
     for ( i = 0; i < page_count; i++ )
     {
         rc = iommu_call(hd->platform_ops, map_page, d, dfn_add(dfn, i),
@@ -288,7 +290,7 @@ int iommu_map(struct domain *d, dfn_t df
         while ( i-- )
             /* if statement to satisfy __must_check */
             if ( iommu_call(hd->platform_ops, unmap_page, d, dfn_add(dfn, i),
-                            flush_flags) )
+                            0, flush_flags) )
                 continue;
 
         if ( !is_hardware_domain(d) )
@@ -333,7 +335,7 @@ int iommu_unmap(struct domain *d, dfn_t
     for ( i = 0; i < page_count; i++ )
     {
         int err = iommu_call(hd->platform_ops, unmap_page, d, dfn_add(dfn, i),
-                             flush_flags);
+                             0, flush_flags);
 
         if ( likely(!err) )
             continue;
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1934,6 +1934,7 @@ static int __must_check intel_iommu_map_
 }
 
 static int __must_check intel_iommu_unmap_page(struct domain *d, dfn_t dfn,
+                                               unsigned int order,
                                                unsigned int *flush_flags)
 {
     /* Do nothing if VT-d shares EPT page table */
@@ -1944,7 +1945,7 @@ static int __must_check intel_iommu_unma
     if ( iommu_hwdom_passthrough && is_hardware_domain(d) )
         return 0;
 
-    return dma_pte_clear_one(d, dfn_to_daddr(dfn), 0, flush_flags);
+    return dma_pte_clear_one(d, dfn_to_daddr(dfn), order, flush_flags);
 }
 
 static int intel_iommu_lookup_page(struct domain *d, dfn_t dfn, mfn_t *mfn,
--- a/xen/include/asm-arm/iommu.h
+++ b/xen/include/asm-arm/iommu.h
@@ -31,6 +31,7 @@ int __must_check arm_iommu_map_page(stru
                                     unsigned int flags,
                                     unsigned int *flush_flags);
 int __must_check arm_iommu_unmap_page(struct domain *d, dfn_t dfn,
+                                      unsigned int order,
                                       unsigned int *flush_flags);
 
 #endif /* __ARCH_ARM_IOMMU_H__ */
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -127,9 +127,10 @@ void arch_iommu_hwdom_init(struct domain
  * The following flags are passed to map operations and passed by lookup
  * operations.
  */
-#define _IOMMUF_readable 0
+#define IOMMUF_order(n)  ((n) & 0x3f)
+#define _IOMMUF_readable 6
 #define IOMMUF_readable  (1u<<_IOMMUF_readable)
-#define _IOMMUF_writable 1
+#define _IOMMUF_writable 7
 #define IOMMUF_writable  (1u<<_IOMMUF_writable)
 
 /*
@@ -255,6 +256,7 @@ struct iommu_ops {
                                  unsigned int flags,
                                  unsigned int *flush_flags);
     int __must_check (*unmap_page)(struct domain *d, dfn_t dfn,
+                                   unsigned int order,
                                    unsigned int *flush_flags);
     int __must_check (*lookup_page)(struct domain *d, dfn_t dfn, mfn_t *mfn,
                                     unsigned int *flags);



^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v2 05/18] IOMMU: have iommu_{,un}map() split requests into largest possible chunks
  2021-09-24  9:39 [PATCH v2 00/18] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (3 preceding siblings ...)
  2021-09-24  9:44 ` [PATCH v2 04/18] IOMMU: add order parameter to ->{,un}map_page() hooks Jan Beulich
@ 2021-09-24  9:45 ` Jan Beulich
  2021-11-30 15:24   ` Roger Pau Monné
  2021-09-24  9:46 ` [PATCH v2 06/18] IOMMU/x86: restrict IO-APIC mappings for PV Dom0 Jan Beulich
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 100+ messages in thread
From: Jan Beulich @ 2021-09-24  9:45 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant

Introduce a helper function to determine the largest possible mapping
that allows covering a request (or the next part of it that is left to
be processed).

In order to not add yet more recurring dfn_add() / mfn_add() to the two
callers of the new helper, also introduce local variables holding the
values presently operated on.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -260,12 +260,38 @@ void iommu_domain_destroy(struct domain
     arch_iommu_domain_destroy(d);
 }
 
-int iommu_map(struct domain *d, dfn_t dfn, mfn_t mfn,
+static unsigned int mapping_order(const struct domain_iommu *hd,
+                                  dfn_t dfn, mfn_t mfn, unsigned long nr)
+{
+    unsigned long res = dfn_x(dfn) | mfn_x(mfn);
+    unsigned long sizes = hd->platform_ops->page_sizes;
+    unsigned int bit = find_first_set_bit(sizes), order = 0;
+
+    ASSERT(bit == PAGE_SHIFT);
+
+    while ( (sizes = (sizes >> bit) & ~1) )
+    {
+        unsigned long mask;
+
+        bit = find_first_set_bit(sizes);
+        mask = (1UL << bit) - 1;
+        if ( nr <= mask || (res & mask) )
+            break;
+        order += bit;
+        nr >>= bit;
+        res >>= bit;
+    }
+
+    return order;
+}
+
+int iommu_map(struct domain *d, dfn_t dfn0, mfn_t mfn0,
               unsigned long page_count, unsigned int flags,
               unsigned int *flush_flags)
 {
     const struct domain_iommu *hd = dom_iommu(d);
     unsigned long i;
+    unsigned int order;
     int rc = 0;
 
     if ( !is_iommu_enabled(d) )
@@ -273,10 +299,16 @@ int iommu_map(struct domain *d, dfn_t df
 
     ASSERT(!IOMMUF_order(flags));
 
-    for ( i = 0; i < page_count; i++ )
+    for ( i = 0; i < page_count; i += 1UL << order )
     {
-        rc = iommu_call(hd->platform_ops, map_page, d, dfn_add(dfn, i),
-                        mfn_add(mfn, i), flags, flush_flags);
+        dfn_t dfn = dfn_add(dfn0, i);
+        mfn_t mfn = mfn_add(mfn0, i);
+        unsigned long j;
+
+        order = mapping_order(hd, dfn, mfn, page_count - i);
+
+        rc = iommu_call(hd->platform_ops, map_page, d, dfn, mfn,
+                        flags | IOMMUF_order(order), flush_flags);
 
         if ( likely(!rc) )
             continue;
@@ -284,14 +316,18 @@ int iommu_map(struct domain *d, dfn_t df
         if ( !d->is_shutting_down && printk_ratelimit() )
             printk(XENLOG_ERR
                    "d%d: IOMMU mapping dfn %"PRI_dfn" to mfn %"PRI_mfn" failed: %d\n",
-                   d->domain_id, dfn_x(dfn_add(dfn, i)),
-                   mfn_x(mfn_add(mfn, i)), rc);
+                   d->domain_id, dfn_x(dfn), mfn_x(mfn), rc);
+
+        for ( j = 0; j < i; j += 1UL << order )
+        {
+            dfn = dfn_add(dfn0, j);
+            order = mapping_order(hd, dfn, _mfn(0), i - j);
 
-        while ( i-- )
             /* if statement to satisfy __must_check */
-            if ( iommu_call(hd->platform_ops, unmap_page, d, dfn_add(dfn, i),
-                            0, flush_flags) )
+            if ( iommu_call(hd->platform_ops, unmap_page, d, dfn, order,
+                            flush_flags) )
                 continue;
+        }
 
         if ( !is_hardware_domain(d) )
             domain_crash(d);
@@ -322,20 +358,25 @@ int iommu_legacy_map(struct domain *d, d
     return rc;
 }
 
-int iommu_unmap(struct domain *d, dfn_t dfn, unsigned long page_count,
+int iommu_unmap(struct domain *d, dfn_t dfn0, unsigned long page_count,
                 unsigned int *flush_flags)
 {
     const struct domain_iommu *hd = dom_iommu(d);
     unsigned long i;
+    unsigned int order;
     int rc = 0;
 
     if ( !is_iommu_enabled(d) )
         return 0;
 
-    for ( i = 0; i < page_count; i++ )
+    for ( i = 0; i < page_count; i += 1UL << order )
     {
-        int err = iommu_call(hd->platform_ops, unmap_page, d, dfn_add(dfn, i),
-                             0, flush_flags);
+        dfn_t dfn = dfn_add(dfn0, i);
+        int err;
+
+        order = mapping_order(hd, dfn, _mfn(0), page_count - i);
+        err = iommu_call(hd->platform_ops, unmap_page, d, dfn,
+                         order, flush_flags);
 
         if ( likely(!err) )
             continue;
@@ -343,7 +384,7 @@ int iommu_unmap(struct domain *d, dfn_t
         if ( !d->is_shutting_down && printk_ratelimit() )
             printk(XENLOG_ERR
                    "d%d: IOMMU unmapping dfn %"PRI_dfn" failed: %d\n",
-                   d->domain_id, dfn_x(dfn_add(dfn, i)), err);
+                   d->domain_id, dfn_x(dfn), err);
 
         if ( !rc )
             rc = err;



^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v2 06/18] IOMMU/x86: restrict IO-APIC mappings for PV Dom0
  2021-09-24  9:39 [PATCH v2 00/18] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (4 preceding siblings ...)
  2021-09-24  9:45 ` [PATCH v2 05/18] IOMMU: have iommu_{,un}map() split requests into largest possible chunks Jan Beulich
@ 2021-09-24  9:46 ` Jan Beulich
  2021-12-01  9:09   ` Roger Pau Monné
  2021-09-24  9:47 ` [PATCH v2 07/18] IOMMU/x86: perform PV Dom0 mappings in batches Jan Beulich
                   ` (11 subsequent siblings)
  17 siblings, 1 reply; 100+ messages in thread
From: Jan Beulich @ 2021-09-24  9:46 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné

While already the case for PVH, there's no reason to treat PV
differently here, though of course the addresses get taken from another
source in this case. Except that, to match CPU side mappings, by default
we permit r/o ones. This then also means we now deal consistently with
IO-APICs whose MMIO is or is not covered by E820 reserved regions.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
[integrated] v1: Integrate into series.
[standalone] v2: Keep IOMMU mappings in sync with CPU ones.

--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -253,12 +253,12 @@ void iommu_identity_map_teardown(struct
     }
 }
 
-static bool __hwdom_init hwdom_iommu_map(const struct domain *d,
-                                         unsigned long pfn,
-                                         unsigned long max_pfn)
+static unsigned int __hwdom_init hwdom_iommu_map(const struct domain *d,
+                                                 unsigned long pfn,
+                                                 unsigned long max_pfn)
 {
     mfn_t mfn = _mfn(pfn);
-    unsigned int i, type;
+    unsigned int i, type, perms = IOMMUF_readable | IOMMUF_writable;
 
     /*
      * Set up 1:1 mapping for dom0. Default to include only conventional RAM
@@ -267,44 +267,60 @@ static bool __hwdom_init hwdom_iommu_map
      * that fall in unusable ranges for PV Dom0.
      */
     if ( (pfn > max_pfn && !mfn_valid(mfn)) || xen_in_range(pfn) )
-        return false;
+        return 0;
 
     switch ( type = page_get_ram_type(mfn) )
     {
     case RAM_TYPE_UNUSABLE:
-        return false;
+        return 0;
 
     case RAM_TYPE_CONVENTIONAL:
         if ( iommu_hwdom_strict )
-            return false;
+            return 0;
         break;
 
     default:
         if ( type & RAM_TYPE_RESERVED )
         {
             if ( !iommu_hwdom_inclusive && !iommu_hwdom_reserved )
-                return false;
+                perms = 0;
         }
-        else if ( is_hvm_domain(d) || !iommu_hwdom_inclusive || pfn > max_pfn )
-            return false;
+        else if ( is_hvm_domain(d) )
+            return 0;
+        else if ( !iommu_hwdom_inclusive || pfn > max_pfn )
+            perms = 0;
     }
 
     /* Check that it doesn't overlap with the Interrupt Address Range. */
     if ( pfn >= 0xfee00 && pfn <= 0xfeeff )
-        return false;
+        return 0;
     /* ... or the IO-APIC */
-    for ( i = 0; has_vioapic(d) && i < d->arch.hvm.nr_vioapics; i++ )
-        if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
-            return false;
+    if ( has_vioapic(d) )
+    {
+        for ( i = 0; i < d->arch.hvm.nr_vioapics; i++ )
+            if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
+                return 0;
+    }
+    else if ( is_pv_domain(d) )
+    {
+        /*
+         * Be consistent with CPU mappings: Dom0 is permitted to establish r/o
+         * ones there, so it should also have such established for IOMMUs.
+         */
+        for ( i = 0; i < nr_ioapics; i++ )
+            if ( pfn == PFN_DOWN(mp_ioapics[i].mpc_apicaddr) )
+                return rangeset_contains_singleton(mmio_ro_ranges, pfn)
+                       ? IOMMUF_readable : 0;
+    }
     /*
      * ... or the PCIe MCFG regions.
      * TODO: runtime added MMCFG regions are not checked to make sure they
      * don't overlap with already mapped regions, thus preventing trapping.
      */
     if ( has_vpci(d) && vpci_is_mmcfg_address(d, pfn_to_paddr(pfn)) )
-        return false;
+        return 0;
 
-    return true;
+    return perms;
 }
 
 void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
@@ -346,15 +362,19 @@ void __hwdom_init arch_iommu_hwdom_init(
     for ( ; i < top; i++ )
     {
         unsigned long pfn = pdx_to_pfn(i);
+        unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
         int rc;
 
-        if ( !hwdom_iommu_map(d, pfn, max_pfn) )
+        if ( !perms )
             rc = 0;
         else if ( paging_mode_translate(d) )
-            rc = set_identity_p2m_entry(d, pfn, p2m_access_rw, 0);
+            rc = set_identity_p2m_entry(d, pfn,
+                                        perms & IOMMUF_writable ? p2m_access_rw
+                                                                : p2m_access_r,
+                                        0);
         else
             rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
-                           IOMMUF_readable | IOMMUF_writable, &flush_flags);
+                           perms, &flush_flags);
 
         if ( rc )
             printk(XENLOG_WARNING "%pd: identity %smapping of %lx failed: %d\n",



^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v2 07/18] IOMMU/x86: perform PV Dom0 mappings in batches
  2021-09-24  9:39 [PATCH v2 00/18] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (5 preceding siblings ...)
  2021-09-24  9:46 ` [PATCH v2 06/18] IOMMU/x86: restrict IO-APIC mappings for PV Dom0 Jan Beulich
@ 2021-09-24  9:47 ` Jan Beulich
  2021-12-02 14:10   ` Roger Pau Monné
  2021-09-24  9:48 ` [PATCH v2 08/18] IOMMU/x86: support freeing of pagetables Jan Beulich
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 100+ messages in thread
From: Jan Beulich @ 2021-09-24  9:47 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné, Wei Liu

For large page mappings to be easily usable (i.e. in particular without
un-shattering of smaller page mappings) and for mapping operations to
then also be more efficient, pass batches of Dom0 memory to iommu_map().
In dom0_construct_pv() and its helpers (covering strict mode) this
additionally requires establishing the type of those pages (albeit with
zero type references).

The earlier establishing of PGT_writable_page | PGT_validated requires
the existing places where this gets done (through get_page_and_type())
to be updated: For pages which actually have a mapping, the type
refcount needs to be 1.

There is actually a related bug that gets fixed here as a side effect:
Typically the last L1 table would get marked as such only after
get_page_and_type(..., PGT_writable_page). While this is fine as far as
refcounting goes, the page did remain mapped in the IOMMU in this case
(when "iommu=dom0-strict").

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Subsequently p2m_add_identity_entry() may want to also gain an order
parameter, for arch_iommu_hwdom_init() to use. While this only affects
non-RAM regions, systems typically have 2-16Mb of reserved space
immediately below 4Gb, which hence could be mapped more efficiently.

The installing of zero-ref writable types has in fact shown (observed
while putting together the change) that despite the intention by the
XSA-288 changes (affecting DomU-s only) for Dom0 a number of
sufficiently ordinary pages (at the very least initrd and P2M ones as
well as pages that are part of the initial allocation but not part of
the initial mapping) still have been starting out as PGT_none, meaning
that they would have gained IOMMU mappings only the first time these
pages would get mapped writably.

I didn't think I need to address the bug mentioned in the description in
a separate (prereq) patch, but if others disagree I could certainly
break out that part (needing to first use iommu_legacy_unmap() then).

Note that 4k P2M pages don't get (pre-)mapped in setup_pv_physmap():
They'll end up mapped via the later get_page_and_type().

As to the way these refs get installed: I've chosen to avoid the more
expensive {get,put}_page_and_type(), putting in place the intended type
directly. I guess I could be convinced to avoid this bypassing of the
actual logic; I merely think it's unnecessarily expensive.

--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -106,11 +106,26 @@ static __init void mark_pv_pt_pages_rdon
     unmap_domain_page(pl3e);
 }
 
+/*
+ * For IOMMU mappings done while building Dom0 the type of the pages needs to
+ * match (for _get_page_type() to unmap upon type change). Set the pages to
+ * writable with no type ref. NB: This is benign when !need_iommu_pt_sync(d).
+ */
+static void __init make_pages_writable(struct page_info *page, unsigned long nr)
+{
+    for ( ; nr--; ++page )
+    {
+        ASSERT(!page->u.inuse.type_info);
+        page->u.inuse.type_info = PGT_writable_page | PGT_validated;
+    }
+}
+
 static __init void setup_pv_physmap(struct domain *d, unsigned long pgtbl_pfn,
                                     unsigned long v_start, unsigned long v_end,
                                     unsigned long vphysmap_start,
                                     unsigned long vphysmap_end,
-                                    unsigned long nr_pages)
+                                    unsigned long nr_pages,
+                                    unsigned int *flush_flags)
 {
     struct page_info *page = NULL;
     l4_pgentry_t *pl4e, *l4start = map_domain_page(_mfn(pgtbl_pfn));
@@ -123,6 +138,8 @@ static __init void setup_pv_physmap(stru
 
     while ( vphysmap_start < vphysmap_end )
     {
+        int rc = 0;
+
         if ( domain_tot_pages(d) +
              ((round_pgup(vphysmap_end) - vphysmap_start) >> PAGE_SHIFT) +
              3 > nr_pages )
@@ -176,7 +193,22 @@ static __init void setup_pv_physmap(stru
                                              L3_PAGETABLE_SHIFT - PAGE_SHIFT,
                                              MEMF_no_scrub)) != NULL )
             {
-                *pl3e = l3e_from_page(page, L1_PROT|_PAGE_DIRTY|_PAGE_PSE);
+                mfn_t mfn = page_to_mfn(page);
+
+                if ( need_iommu_pt_sync(d) )
+                    rc = iommu_map(d, _dfn(mfn_x(mfn)), mfn,
+                                   SUPERPAGE_PAGES * SUPERPAGE_PAGES,
+                                   IOMMUF_readable | IOMMUF_writable,
+                                   flush_flags);
+                if ( !rc )
+                    make_pages_writable(page,
+                                        SUPERPAGE_PAGES * SUPERPAGE_PAGES);
+                else
+                    printk(XENLOG_ERR
+                           "pre-mapping P2M 1G-MFN %lx into IOMMU failed: %d\n",
+                           mfn_x(mfn), rc);
+
+                *pl3e = l3e_from_mfn(mfn, L1_PROT|_PAGE_DIRTY|_PAGE_PSE);
                 vphysmap_start += 1UL << L3_PAGETABLE_SHIFT;
                 continue;
             }
@@ -202,7 +234,20 @@ static __init void setup_pv_physmap(stru
                                              L2_PAGETABLE_SHIFT - PAGE_SHIFT,
                                              MEMF_no_scrub)) != NULL )
             {
-                *pl2e = l2e_from_page(page, L1_PROT|_PAGE_DIRTY|_PAGE_PSE);
+                mfn_t mfn = page_to_mfn(page);
+
+                if ( need_iommu_pt_sync(d) )
+                    rc = iommu_map(d, _dfn(mfn_x(mfn)), mfn, SUPERPAGE_PAGES,
+                                   IOMMUF_readable | IOMMUF_writable,
+                                   flush_flags);
+                if ( !rc )
+                    make_pages_writable(page, SUPERPAGE_PAGES);
+                else
+                    printk(XENLOG_ERR
+                           "pre-mapping P2M 2M-MFN %lx into IOMMU failed: %d\n",
+                           mfn_x(mfn), rc);
+
+                *pl2e = l2e_from_mfn(mfn, L1_PROT|_PAGE_DIRTY|_PAGE_PSE);
                 vphysmap_start += 1UL << L2_PAGETABLE_SHIFT;
                 continue;
             }
@@ -310,6 +355,7 @@ int __init dom0_construct_pv(struct doma
     unsigned long initrd_pfn = -1, initrd_mfn = 0;
     unsigned long count;
     struct page_info *page = NULL;
+    unsigned int flush_flags = 0;
     start_info_t *si;
     struct vcpu *v = d->vcpu[0];
     void *image_base = bootstrap_map(image);
@@ -572,6 +618,18 @@ int __init dom0_construct_pv(struct doma
                     BUG();
         }
         initrd->mod_end = 0;
+
+        count = PFN_UP(initrd_len);
+
+        if ( need_iommu_pt_sync(d) )
+            rc = iommu_map(d, _dfn(initrd_mfn), _mfn(initrd_mfn), count,
+                           IOMMUF_readable | IOMMUF_writable, &flush_flags);
+        if ( !rc )
+            make_pages_writable(mfn_to_page(_mfn(initrd_mfn)), count);
+        else
+            printk(XENLOG_ERR
+                   "pre-mapping initrd (MFN %lx) into IOMMU failed: %d\n",
+                   initrd_mfn, rc);
     }
 
     printk("PHYSICAL MEMORY ARRANGEMENT:\n"
@@ -605,6 +663,22 @@ int __init dom0_construct_pv(struct doma
 
     process_pending_softirqs();
 
+    /*
+     * We map the full range here and then punch a hole for page tables via
+     * iommu_unmap() further down, once they have got marked as such.
+     */
+    if ( need_iommu_pt_sync(d) )
+        rc = iommu_map(d, _dfn(alloc_spfn), _mfn(alloc_spfn),
+                       alloc_epfn - alloc_spfn,
+                       IOMMUF_readable | IOMMUF_writable, &flush_flags);
+    if ( !rc )
+        make_pages_writable(mfn_to_page(_mfn(alloc_spfn)),
+                            alloc_epfn - alloc_spfn);
+    else
+        printk(XENLOG_ERR
+               "pre-mapping MFNs [%lx,%lx) into IOMMU failed: %d\n",
+               alloc_spfn, alloc_epfn, rc);
+
     mpt_alloc = (vpt_start - v_start) + pfn_to_paddr(alloc_spfn);
     if ( vinitrd_start )
         mpt_alloc -= PAGE_ALIGN(initrd_len);
@@ -689,7 +763,8 @@ int __init dom0_construct_pv(struct doma
         l1tab++;
 
         page = mfn_to_page(_mfn(mfn));
-        if ( !page->u.inuse.type_info &&
+        if ( (!page->u.inuse.type_info ||
+              page->u.inuse.type_info == (PGT_writable_page | PGT_validated)) &&
              !get_page_and_type(page, d, PGT_writable_page) )
             BUG();
     }
@@ -720,6 +795,17 @@ int __init dom0_construct_pv(struct doma
     /* Pages that are part of page tables must be read only. */
     mark_pv_pt_pages_rdonly(d, l4start, vpt_start, nr_pt_pages);
 
+    /*
+     * This needs to come after all potentially excess
+     * get_page_and_type(..., PGT_writable_page) invocations; see the loop a
+     * few lines further up, where the effect of calling that function in an
+     * earlier loop iteration may get overwritten by a later one.
+     */
+    if ( need_iommu_pt_sync(d) &&
+         iommu_unmap(d, _dfn(PFN_DOWN(mpt_alloc) - nr_pt_pages), nr_pt_pages,
+                     &flush_flags) )
+        BUG();
+
     /* Mask all upcalls... */
     for ( i = 0; i < XEN_LEGACY_MAX_VCPUS; i++ )
         shared_info(d, vcpu_info[i].evtchn_upcall_mask) = 1;
@@ -793,7 +879,7 @@ int __init dom0_construct_pv(struct doma
     {
         pfn = pagetable_get_pfn(v->arch.guest_table);
         setup_pv_physmap(d, pfn, v_start, v_end, vphysmap_start, vphysmap_end,
-                         nr_pages);
+                         nr_pages, &flush_flags);
     }
 
     /* Write the phys->machine and machine->phys table entries. */
@@ -824,7 +910,9 @@ int __init dom0_construct_pv(struct doma
         if ( get_gpfn_from_mfn(mfn) >= count )
         {
             BUG_ON(compat);
-            if ( !page->u.inuse.type_info &&
+            if ( (!page->u.inuse.type_info ||
+                  page->u.inuse.type_info == (PGT_writable_page |
+                                              PGT_validated)) &&
                  !get_page_and_type(page, d, PGT_writable_page) )
                 BUG();
 
@@ -840,22 +928,41 @@ int __init dom0_construct_pv(struct doma
 #endif
     while ( pfn < nr_pages )
     {
-        if ( (page = alloc_chunk(d, nr_pages - domain_tot_pages(d))) == NULL )
+        count = domain_tot_pages(d);
+        if ( (page = alloc_chunk(d, nr_pages - count)) == NULL )
             panic("Not enough RAM for DOM0 reservation\n");
+        mfn = mfn_x(page_to_mfn(page));
+
+        if ( need_iommu_pt_sync(d) )
+        {
+            rc = iommu_map(d, _dfn(mfn), _mfn(mfn), domain_tot_pages(d) - count,
+                           IOMMUF_readable | IOMMUF_writable, &flush_flags);
+            if ( rc )
+                printk(XENLOG_ERR
+                       "pre-mapping MFN %lx (PFN %lx) into IOMMU failed: %d\n",
+                       mfn, pfn, rc);
+        }
+
         while ( pfn < domain_tot_pages(d) )
         {
-            mfn = mfn_x(page_to_mfn(page));
+            if ( !rc )
+                make_pages_writable(page, 1);
+
 #ifndef NDEBUG
 #define pfn (nr_pages - 1 - (pfn - (alloc_epfn - alloc_spfn)))
 #endif
             dom0_update_physmap(compat, pfn, mfn, vphysmap_start);
 #undef pfn
-            page++; pfn++;
+            page++; mfn++; pfn++;
             if ( !(pfn & 0xfffff) )
                 process_pending_softirqs();
         }
     }
 
+    /* Use while() to avoid compiler warning. */
+    while ( iommu_iotlb_flush_all(d, flush_flags) )
+        break;
+
     if ( initrd_len != 0 )
     {
         si->mod_start = vinitrd_start ?: initrd_pfn;
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -325,8 +325,8 @@ static unsigned int __hwdom_init hwdom_i
 
 void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
 {
-    unsigned long i, top, max_pfn;
-    unsigned int flush_flags = 0;
+    unsigned long i, top, max_pfn, start, count;
+    unsigned int flush_flags = 0, start_perms = 0;
 
     BUG_ON(!is_hardware_domain(d));
 
@@ -357,9 +357,9 @@ void __hwdom_init arch_iommu_hwdom_init(
      * First Mb will get mapped in one go by pvh_populate_p2m(). Avoid
      * setting up potentially conflicting mappings here.
      */
-    i = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
+    start = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
 
-    for ( ; i < top; i++ )
+    for ( i = start, count = 0; i < top; )
     {
         unsigned long pfn = pdx_to_pfn(i);
         unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
@@ -372,16 +372,30 @@ void __hwdom_init arch_iommu_hwdom_init(
                                         perms & IOMMUF_writable ? p2m_access_rw
                                                                 : p2m_access_r,
                                         0);
+        else if ( pfn != start + count || perms != start_perms )
+        {
+        commit:
+            rc = iommu_map(d, _dfn(start), _mfn(start), count,
+                           start_perms, &flush_flags);
+            SWAP(start, pfn);
+            start_perms = perms;
+            count = 1;
+        }
         else
-            rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
-                           perms, &flush_flags);
+        {
+            ++count;
+            rc = 0;
+        }
 
         if ( rc )
             printk(XENLOG_WARNING "%pd: identity %smapping of %lx failed: %d\n",
                    d, !paging_mode_translate(d) ? "IOMMU " : "", pfn, rc);
 
-        if (!(i & 0xfffff))
+        if ( !(++i & 0xfffff) )
             process_pending_softirqs();
+
+        if ( i == top && count )
+            goto commit;
     }
 
     /* Use if to avoid compiler warning */



^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v2 08/18] IOMMU/x86: support freeing of pagetables
  2021-09-24  9:39 [PATCH v2 00/18] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (6 preceding siblings ...)
  2021-09-24  9:47 ` [PATCH v2 07/18] IOMMU/x86: perform PV Dom0 mappings in batches Jan Beulich
@ 2021-09-24  9:48 ` Jan Beulich
  2021-12-02 16:03   ` Roger Pau Monné
  2021-12-10 13:51   ` Roger Pau Monné
  2021-09-24  9:48 ` [PATCH v2 09/18] AMD/IOMMU: drop stray TLB flush Jan Beulich
                   ` (9 subsequent siblings)
  17 siblings, 2 replies; 100+ messages in thread
From: Jan Beulich @ 2021-09-24  9:48 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné, Wei Liu

For vendor specific code to support superpages we need to be able to
deal with a superpage mapping replacing an intermediate page table (or
hierarchy thereof). Consequently an iommu_alloc_pgtable() counterpart is
needed to free individual page tables while a domain is still alive.
Since the freeing needs to be deferred until after a suitable IOTLB
flush was performed, released page tables get queued for processing by a
tasklet.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
I was considering whether to use a softirq-taklet instead. This would
have the benefit of avoiding extra scheduling operations, but come with
the risk of the freeing happening prematurely because of a
process_pending_softirqs() somewhere.

--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -12,6 +12,7 @@
  * this program; If not, see <http://www.gnu.org/licenses/>.
  */
 
+#include <xen/cpu.h>
 #include <xen/sched.h>
 #include <xen/iommu.h>
 #include <xen/paging.h>
@@ -463,6 +464,85 @@ struct page_info *iommu_alloc_pgtable(st
     return pg;
 }
 
+/*
+ * Intermediate page tables which get replaced by large pages may only be
+ * freed after a suitable IOTLB flush. Hence such pages get queued on a
+ * per-CPU list, with a per-CPU tasklet processing the list on the assumption
+ * that the necessary IOTLB flush will have occurred by the time tasklets get
+ * to run. (List and tasklet being per-CPU has the benefit of accesses not
+ * requiring any locking.)
+ */
+static DEFINE_PER_CPU(struct page_list_head, free_pgt_list);
+static DEFINE_PER_CPU(struct tasklet, free_pgt_tasklet);
+
+static void free_queued_pgtables(void *arg)
+{
+    struct page_list_head *list = arg;
+    struct page_info *pg;
+
+    while ( (pg = page_list_remove_head(list)) )
+        free_domheap_page(pg);
+}
+
+void iommu_queue_free_pgtable(struct domain *d, struct page_info *pg)
+{
+    struct domain_iommu *hd = dom_iommu(d);
+    unsigned int cpu = smp_processor_id();
+
+    spin_lock(&hd->arch.pgtables.lock);
+    page_list_del(pg, &hd->arch.pgtables.list);
+    spin_unlock(&hd->arch.pgtables.lock);
+
+    page_list_add_tail(pg, &per_cpu(free_pgt_list, cpu));
+
+    tasklet_schedule(&per_cpu(free_pgt_tasklet, cpu));
+}
+
+static int cpu_callback(
+    struct notifier_block *nfb, unsigned long action, void *hcpu)
+{
+    unsigned int cpu = (unsigned long)hcpu;
+    struct page_list_head *list = &per_cpu(free_pgt_list, cpu);
+    struct tasklet *tasklet = &per_cpu(free_pgt_tasklet, cpu);
+
+    switch ( action )
+    {
+    case CPU_DOWN_PREPARE:
+        tasklet_kill(tasklet);
+        break;
+
+    case CPU_DEAD:
+        page_list_splice(list, &this_cpu(free_pgt_list));
+        INIT_PAGE_LIST_HEAD(list);
+        tasklet_schedule(&this_cpu(free_pgt_tasklet));
+        break;
+
+    case CPU_UP_PREPARE:
+    case CPU_DOWN_FAILED:
+        tasklet_init(tasklet, free_queued_pgtables, list);
+        break;
+    }
+
+    return NOTIFY_DONE;
+}
+
+static struct notifier_block cpu_nfb = {
+    .notifier_call = cpu_callback,
+};
+
+static int __init bsp_init(void)
+{
+    if ( iommu_enabled )
+    {
+        cpu_callback(&cpu_nfb, CPU_UP_PREPARE,
+                     (void *)(unsigned long)smp_processor_id());
+        register_cpu_notifier(&cpu_nfb);
+    }
+
+    return 0;
+}
+presmp_initcall(bsp_init);
+
 bool arch_iommu_use_permitted(const struct domain *d)
 {
     /*
--- a/xen/include/asm-x86/iommu.h
+++ b/xen/include/asm-x86/iommu.h
@@ -143,6 +143,7 @@ int pi_update_irte(const struct pi_desc
 
 int __must_check iommu_free_pgtables(struct domain *d);
 struct page_info *__must_check iommu_alloc_pgtable(struct domain *d);
+void iommu_queue_free_pgtable(struct domain *d, struct page_info *pg);
 
 #endif /* !__ARCH_X86_IOMMU_H__ */
 /*



^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v2 09/18] AMD/IOMMU: drop stray TLB flush
  2021-09-24  9:39 [PATCH v2 00/18] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (7 preceding siblings ...)
  2021-09-24  9:48 ` [PATCH v2 08/18] IOMMU/x86: support freeing of pagetables Jan Beulich
@ 2021-09-24  9:48 ` Jan Beulich
  2021-12-02 16:16   ` Roger Pau Monné
  2021-09-24  9:51 ` [PATCH v2 10/18] AMD/IOMMU: walk trees upon page fault Jan Beulich
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 100+ messages in thread
From: Jan Beulich @ 2021-09-24  9:48 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant

I think this flush was overlooked when flushing was moved out of the
core (un)mapping functions. The flush the caller is required to invoke
anyway will satisfy the needs resulting from the splitting of a
superpage.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -179,7 +179,7 @@ void __init iommu_dte_add_device_entry(s
  */
 static int iommu_pde_from_dfn(struct domain *d, unsigned long dfn,
                               unsigned int target, unsigned long *pt_mfn,
-                              bool map)
+                              unsigned int *flush_flags, bool map)
 {
     union amd_iommu_pte *pde, *next_table_vaddr;
     unsigned long  next_table_mfn;
@@ -237,7 +237,7 @@ static int iommu_pde_from_dfn(struct dom
             set_iommu_pde_present(pde, next_table_mfn, next_level, true,
                                   true);
 
-            amd_iommu_flush_all_pages(d);
+            *flush_flags |= IOMMU_FLUSHF_modified;
         }
 
         /* Install lower level page table for non-present entries */
@@ -309,7 +309,8 @@ int amd_iommu_map_page(struct domain *d,
         return rc;
     }
 
-    if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, true) || !pt_mfn )
+    if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, flush_flags, true) ||
+         !pt_mfn )
     {
         spin_unlock(&hd->arch.mapping_lock);
         AMD_IOMMU_DEBUG("Invalid IO pagetable entry dfn = %"PRI_dfn"\n",
@@ -342,7 +343,7 @@ int amd_iommu_unmap_page(struct domain *
         return 0;
     }
 
-    if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, false) )
+    if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, flush_flags, false) )
     {
         spin_unlock(&hd->arch.mapping_lock);
         AMD_IOMMU_DEBUG("Invalid IO pagetable entry dfn = %"PRI_dfn"\n",



^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v2 10/18] AMD/IOMMU: walk trees upon page fault
  2021-09-24  9:39 [PATCH v2 00/18] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (8 preceding siblings ...)
  2021-09-24  9:48 ` [PATCH v2 09/18] AMD/IOMMU: drop stray TLB flush Jan Beulich
@ 2021-09-24  9:51 ` Jan Beulich
  2021-12-03  9:03   ` Roger Pau Monné
  2021-09-24  9:51 ` [PATCH v2 11/18] AMD/IOMMU: return old PTE from {set,clear}_iommu_pte_present() Jan Beulich
                   ` (7 subsequent siblings)
  17 siblings, 1 reply; 100+ messages in thread
From: Jan Beulich @ 2021-09-24  9:51 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant

This is to aid diagnosing issues and largely matches VT-d's behavior.
Since I'm adding permissions output here as well, take the opportunity
and also add their displaying to amd_dump_page_table_level().

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -243,6 +243,8 @@ int __must_check amd_iommu_flush_iotlb_p
                                              unsigned long page_count,
                                              unsigned int flush_flags);
 int __must_check amd_iommu_flush_iotlb_all(struct domain *d);
+void amd_iommu_print_entries(const struct amd_iommu *iommu, unsigned int dev_id,
+                             dfn_t dfn);
 
 /* device table functions */
 int get_dma_requestor_id(uint16_t seg, uint16_t bdf);
--- a/xen/drivers/passthrough/amd/iommu_init.c
+++ b/xen/drivers/passthrough/amd/iommu_init.c
@@ -573,6 +573,9 @@ static void parse_event_log_entry(struct
                (flags & 0x002) ? " NX" : "",
                (flags & 0x001) ? " GN" : "");
 
+        if ( iommu_verbose )
+            amd_iommu_print_entries(iommu, device_id, daddr_to_dfn(addr));
+
         for ( bdf = 0; bdf < ivrs_bdf_entries; bdf++ )
             if ( get_dma_requestor_id(iommu->seg, bdf) == device_id )
                 pci_check_disable_device(iommu->seg, PCI_BUS(bdf),
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -363,6 +363,50 @@ int amd_iommu_unmap_page(struct domain *
     return 0;
 }
 
+void amd_iommu_print_entries(const struct amd_iommu *iommu, unsigned int dev_id,
+                             dfn_t dfn)
+{
+    mfn_t pt_mfn;
+    unsigned int level;
+    const struct amd_iommu_dte *dt = iommu->dev_table.buffer;
+
+    if ( !dt[dev_id].tv )
+    {
+        printk("%pp: no root\n", &PCI_SBDF2(iommu->seg, dev_id));
+        return;
+    }
+
+    pt_mfn = _mfn(dt[dev_id].pt_root);
+    level = dt[dev_id].paging_mode;
+    printk("%pp root @ %"PRI_mfn" (%u levels) dfn=%"PRI_dfn"\n",
+           &PCI_SBDF2(iommu->seg, dev_id), mfn_x(pt_mfn), level, dfn_x(dfn));
+
+    while ( level )
+    {
+        const union amd_iommu_pte *pt = map_domain_page(pt_mfn);
+        unsigned int idx = pfn_to_pde_idx(dfn_x(dfn), level);
+        union amd_iommu_pte pte = pt[idx];
+
+        unmap_domain_page(pt);
+
+        printk("  L%u[%03x] = %"PRIx64" %c%c\n", level, idx, pte.raw,
+               pte.pr ? pte.ir ? 'r' : '-' : 'n',
+               pte.pr ? pte.iw ? 'w' : '-' : 'p');
+
+        if ( !pte.pr )
+            break;
+
+        if ( pte.next_level >= level )
+        {
+            printk("  L%u[%03x]: next: %u\n", level, idx, pte.next_level);
+            break;
+        }
+
+        pt_mfn = _mfn(pte.mfn);
+        level = pte.next_level;
+    }
+}
+
 static unsigned long flush_count(unsigned long dfn, unsigned long page_count,
                                  unsigned int order)
 {
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -607,10 +607,11 @@ static void amd_dump_page_table_level(st
                 mfn_to_page(_mfn(pde->mfn)), pde->next_level,
                 address, indent + 1);
         else
-            printk("%*sdfn: %08lx  mfn: %08lx\n",
+            printk("%*sdfn: %08lx  mfn: %08lx  %c%c\n",
                    indent, "",
                    (unsigned long)PFN_DOWN(address),
-                   (unsigned long)PFN_DOWN(pfn_to_paddr(pde->mfn)));
+                   (unsigned long)PFN_DOWN(pfn_to_paddr(pde->mfn)),
+                   pde->ir ? 'r' : '-', pde->iw ? 'w' : '-');
     }
 
     unmap_domain_page(table_vaddr);



^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v2 11/18] AMD/IOMMU: return old PTE from {set,clear}_iommu_pte_present()
  2021-09-24  9:39 [PATCH v2 00/18] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (9 preceding siblings ...)
  2021-09-24  9:51 ` [PATCH v2 10/18] AMD/IOMMU: walk trees upon page fault Jan Beulich
@ 2021-09-24  9:51 ` Jan Beulich
  2021-12-10 12:05   ` Roger Pau Monné
  2021-09-24  9:52 ` [PATCH v2 12/18] AMD/IOMMU: allow use of superpage mappings Jan Beulich
                   ` (6 subsequent siblings)
  17 siblings, 1 reply; 100+ messages in thread
From: Jan Beulich @ 2021-09-24  9:51 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant

In order to free intermediate page tables when replacing smaller
mappings by a single larger one callers will need to know the full PTE.
Flush indicators can be derived from this in the callers (and outside
the locked regions). First split set_iommu_pte_present() from
set_iommu_ptes_present(): Only the former needs to return the old PTE,
while the latter (like also set_iommu_pde_present()) doesn't even need
to return flush indicators. Then change return types/values and callers
accordingly.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -31,30 +31,28 @@ static unsigned int pfn_to_pde_idx(unsig
     return idx;
 }
 
-static unsigned int clear_iommu_pte_present(unsigned long l1_mfn,
-                                            unsigned long dfn)
+static union amd_iommu_pte clear_iommu_pte_present(unsigned long l1_mfn,
+                                                   unsigned long dfn)
 {
-    union amd_iommu_pte *table, *pte;
-    unsigned int flush_flags;
+    union amd_iommu_pte *table, *pte, old;
 
     table = map_domain_page(_mfn(l1_mfn));
     pte = &table[pfn_to_pde_idx(dfn, 1)];
+    old = *pte;
 
-    flush_flags = pte->pr ? IOMMU_FLUSHF_modified : 0;
     write_atomic(&pte->raw, 0);
 
     unmap_domain_page(table);
 
-    return flush_flags;
+    return old;
 }
 
-static unsigned int set_iommu_pde_present(union amd_iommu_pte *pte,
-                                          unsigned long next_mfn,
-                                          unsigned int next_level, bool iw,
-                                          bool ir)
+static void set_iommu_pde_present(union amd_iommu_pte *pte,
+                                  unsigned long next_mfn,
+                                  unsigned int next_level,
+                                  bool iw, bool ir)
 {
-    union amd_iommu_pte new = {}, old;
-    unsigned int flush_flags = IOMMU_FLUSHF_added;
+    union amd_iommu_pte new = {};
 
     /*
      * FC bit should be enabled in PTE, this helps to solve potential
@@ -68,28 +66,42 @@ static unsigned int set_iommu_pde_presen
     new.next_level = next_level;
     new.pr = true;
 
-    old.raw = read_atomic(&pte->raw);
-    old.ign0 = 0;
-    old.ign1 = 0;
-    old.ign2 = 0;
+    write_atomic(&pte->raw, new.raw);
+}
 
-    if ( old.pr && old.raw != new.raw )
-        flush_flags |= IOMMU_FLUSHF_modified;
+static union amd_iommu_pte set_iommu_pte_present(unsigned long pt_mfn,
+                                                 unsigned long dfn,
+                                                 unsigned long next_mfn,
+                                                 unsigned int level,
+                                                 bool iw, bool ir)
+{
+    union amd_iommu_pte *table, *pde, old;
 
-    write_atomic(&pte->raw, new.raw);
+    table = map_domain_page(_mfn(pt_mfn));
+    pde = &table[pfn_to_pde_idx(dfn, level)];
+
+    old = *pde;
+    if ( !old.pr || old.next_level ||
+         old.mfn != next_mfn ||
+         old.iw != iw || old.ir != ir )
+        set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
+    else
+        old.pr = false; /* signal "no change" to the caller */
 
-    return flush_flags;
+    unmap_domain_page(table);
+
+    return old;
 }
 
-static unsigned int set_iommu_ptes_present(unsigned long pt_mfn,
-                                           unsigned long dfn,
-                                           unsigned long next_mfn,
-                                           unsigned int nr_ptes,
-                                           unsigned int pde_level,
-                                           bool iw, bool ir)
+static void set_iommu_ptes_present(unsigned long pt_mfn,
+                                   unsigned long dfn,
+                                   unsigned long next_mfn,
+                                   unsigned int nr_ptes,
+                                   unsigned int pde_level,
+                                   bool iw, bool ir)
 {
     union amd_iommu_pte *table, *pde;
-    unsigned int page_sz, flush_flags = 0;
+    unsigned int page_sz;
 
     table = map_domain_page(_mfn(pt_mfn));
     pde = &table[pfn_to_pde_idx(dfn, pde_level)];
@@ -98,20 +110,18 @@ static unsigned int set_iommu_ptes_prese
     if ( (void *)(pde + nr_ptes) > (void *)table + PAGE_SIZE )
     {
         ASSERT_UNREACHABLE();
-        return 0;
+        return;
     }
 
     while ( nr_ptes-- )
     {
-        flush_flags |= set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
+        set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
 
         ++pde;
         next_mfn += page_sz;
     }
 
     unmap_domain_page(table);
-
-    return flush_flags;
 }
 
 void amd_iommu_set_root_page_table(struct amd_iommu_dte *dte,
@@ -284,6 +294,7 @@ int amd_iommu_map_page(struct domain *d,
     struct domain_iommu *hd = dom_iommu(d);
     int rc;
     unsigned long pt_mfn = 0;
+    union amd_iommu_pte old;
 
     spin_lock(&hd->arch.mapping_lock);
 
@@ -320,12 +331,16 @@ int amd_iommu_map_page(struct domain *d,
     }
 
     /* Install 4k mapping */
-    *flush_flags |= set_iommu_ptes_present(pt_mfn, dfn_x(dfn), mfn_x(mfn),
-                                           1, 1, (flags & IOMMUF_writable),
-                                           (flags & IOMMUF_readable));
+    old = set_iommu_pte_present(pt_mfn, dfn_x(dfn), mfn_x(mfn), 1,
+                                (flags & IOMMUF_writable),
+                                (flags & IOMMUF_readable));
 
     spin_unlock(&hd->arch.mapping_lock);
 
+    *flush_flags |= IOMMU_FLUSHF_added;
+    if ( old.pr )
+        *flush_flags |= IOMMU_FLUSHF_modified;
+
     return 0;
 }
 
@@ -334,6 +349,7 @@ int amd_iommu_unmap_page(struct domain *
 {
     unsigned long pt_mfn = 0;
     struct domain_iommu *hd = dom_iommu(d);
+    union amd_iommu_pte old = {};
 
     spin_lock(&hd->arch.mapping_lock);
 
@@ -355,11 +371,14 @@ int amd_iommu_unmap_page(struct domain *
     if ( pt_mfn )
     {
         /* Mark PTE as 'page not present'. */
-        *flush_flags |= clear_iommu_pte_present(pt_mfn, dfn_x(dfn));
+        old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn));
     }
 
     spin_unlock(&hd->arch.mapping_lock);
 
+    if ( old.pr )
+        *flush_flags |= IOMMU_FLUSHF_modified;
+
     return 0;
 }
 



^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v2 12/18] AMD/IOMMU: allow use of superpage mappings
  2021-09-24  9:39 [PATCH v2 00/18] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (10 preceding siblings ...)
  2021-09-24  9:51 ` [PATCH v2 11/18] AMD/IOMMU: return old PTE from {set,clear}_iommu_pte_present() Jan Beulich
@ 2021-09-24  9:52 ` Jan Beulich
  2021-12-10 15:06   ` Roger Pau Monné
  2021-09-24  9:52 ` [PATCH v2 13/18] VT-d: " Jan Beulich
                   ` (5 subsequent siblings)
  17 siblings, 1 reply; 100+ messages in thread
From: Jan Beulich @ 2021-09-24  9:52 UTC (permalink / raw)
  To: xen-devel
  Cc: Paul Durrant, Andrew Cooper, George Dunlap, Ian Jackson,
	Julien Grall, Stefano Stabellini, Wei Liu

No separate feature flags exist which would control availability of
these; the only restriction is HATS (establishing the maximum number of
page table levels in general), and even that has a lower bound of 4.
Thus we can unconditionally announce 2M, 1G, and 512G mappings. (Via
non-default page sizes the implementation in principle permits arbitrary
size mappings, but these require multiple identical leaf PTEs to be
written, which isn't all that different from having to write multiple
consecutive PTEs with increasing frame numbers. IMO that's therefore
beneficial only on hardware where suitable TLBs exist; I'm unaware of
such hardware.)

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
I'm not fully sure about allowing 512G mappings: The scheduling-for-
freeing of intermediate page tables can take quite a while when
replacing a tree of 4k mappings by a single 512G one. Plus (or otoh)
there's no present code path via which 512G chunks of memory could be
allocated (and hence mapped) anyway.

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -32,12 +32,13 @@ static unsigned int pfn_to_pde_idx(unsig
 }
 
 static union amd_iommu_pte clear_iommu_pte_present(unsigned long l1_mfn,
-                                                   unsigned long dfn)
+                                                   unsigned long dfn,
+                                                   unsigned int level)
 {
     union amd_iommu_pte *table, *pte, old;
 
     table = map_domain_page(_mfn(l1_mfn));
-    pte = &table[pfn_to_pde_idx(dfn, 1)];
+    pte = &table[pfn_to_pde_idx(dfn, level)];
     old = *pte;
 
     write_atomic(&pte->raw, 0);
@@ -288,10 +289,31 @@ static int iommu_pde_from_dfn(struct dom
     return 0;
 }
 
+static void queue_free_pt(struct domain *d, mfn_t mfn, unsigned int next_level)
+{
+    if ( next_level > 1 )
+    {
+        union amd_iommu_pte *pt = map_domain_page(mfn);
+        unsigned int i;
+
+        for ( i = 0; i < PTE_PER_TABLE_SIZE; ++i )
+            if ( pt[i].pr && pt[i].next_level )
+            {
+                ASSERT(pt[i].next_level < next_level);
+                queue_free_pt(d, _mfn(pt[i].mfn), pt[i].next_level);
+            }
+
+        unmap_domain_page(pt);
+    }
+
+    iommu_queue_free_pgtable(d, mfn_to_page(mfn));
+}
+
 int amd_iommu_map_page(struct domain *d, dfn_t dfn, mfn_t mfn,
                        unsigned int flags, unsigned int *flush_flags)
 {
     struct domain_iommu *hd = dom_iommu(d);
+    unsigned int level = (IOMMUF_order(flags) / PTE_PER_TABLE_SHIFT) + 1;
     int rc;
     unsigned long pt_mfn = 0;
     union amd_iommu_pte old;
@@ -320,7 +342,7 @@ int amd_iommu_map_page(struct domain *d,
         return rc;
     }
 
-    if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, flush_flags, true) ||
+    if ( iommu_pde_from_dfn(d, dfn_x(dfn), level, &pt_mfn, flush_flags, true) ||
          !pt_mfn )
     {
         spin_unlock(&hd->arch.mapping_lock);
@@ -330,8 +352,8 @@ int amd_iommu_map_page(struct domain *d,
         return -EFAULT;
     }
 
-    /* Install 4k mapping */
-    old = set_iommu_pte_present(pt_mfn, dfn_x(dfn), mfn_x(mfn), 1,
+    /* Install mapping */
+    old = set_iommu_pte_present(pt_mfn, dfn_x(dfn), mfn_x(mfn), level,
                                 (flags & IOMMUF_writable),
                                 (flags & IOMMUF_readable));
 
@@ -339,8 +361,13 @@ int amd_iommu_map_page(struct domain *d,
 
     *flush_flags |= IOMMU_FLUSHF_added;
     if ( old.pr )
+    {
         *flush_flags |= IOMMU_FLUSHF_modified;
 
+        if ( level > 1 && old.next_level )
+            queue_free_pt(d, _mfn(old.mfn), old.next_level);
+    }
+
     return 0;
 }
 
@@ -349,6 +376,7 @@ int amd_iommu_unmap_page(struct domain *
 {
     unsigned long pt_mfn = 0;
     struct domain_iommu *hd = dom_iommu(d);
+    unsigned int level = (order / PTE_PER_TABLE_SHIFT) + 1;
     union amd_iommu_pte old = {};
 
     spin_lock(&hd->arch.mapping_lock);
@@ -359,7 +387,7 @@ int amd_iommu_unmap_page(struct domain *
         return 0;
     }
 
-    if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, flush_flags, false) )
+    if ( iommu_pde_from_dfn(d, dfn_x(dfn), level, &pt_mfn, flush_flags, false) )
     {
         spin_unlock(&hd->arch.mapping_lock);
         AMD_IOMMU_DEBUG("Invalid IO pagetable entry dfn = %"PRI_dfn"\n",
@@ -371,14 +399,19 @@ int amd_iommu_unmap_page(struct domain *
     if ( pt_mfn )
     {
         /* Mark PTE as 'page not present'. */
-        old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn));
+        old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn), level);
     }
 
     spin_unlock(&hd->arch.mapping_lock);
 
     if ( old.pr )
+    {
         *flush_flags |= IOMMU_FLUSHF_modified;
 
+        if ( level > 1 && old.next_level )
+            queue_free_pt(d, _mfn(old.mfn), old.next_level);
+    }
+
     return 0;
 }
 
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -630,7 +630,7 @@ static void amd_dump_page_tables(struct
 }
 
 static const struct iommu_ops __initconstrel _iommu_ops = {
-    .page_sizes = PAGE_SIZE_4K,
+    .page_sizes = PAGE_SIZE_4K | PAGE_SIZE_2M | PAGE_SIZE_1G | PAGE_SIZE_512G,
     .init = amd_iommu_domain_init,
     .hwdom_init = amd_iommu_hwdom_init,
     .quarantine_init = amd_iommu_quarantine_init,
--- a/xen/include/xen/page-defs.h
+++ b/xen/include/xen/page-defs.h
@@ -21,4 +21,19 @@
 #define PAGE_MASK_64K               PAGE_MASK_GRAN(64K)
 #define PAGE_ALIGN_64K(addr)        PAGE_ALIGN_GRAN(64K, addr)
 
+#define PAGE_SHIFT_2M               21
+#define PAGE_SIZE_2M                PAGE_SIZE_GRAN(2M)
+#define PAGE_MASK_2M                PAGE_MASK_GRAN(2M)
+#define PAGE_ALIGN_2M(addr)         PAGE_ALIGN_GRAN(2M, addr)
+
+#define PAGE_SHIFT_1G               30
+#define PAGE_SIZE_1G                PAGE_SIZE_GRAN(1G)
+#define PAGE_MASK_1G                PAGE_MASK_GRAN(1G)
+#define PAGE_ALIGN_1G(addr)         PAGE_ALIGN_GRAN(1G, addr)
+
+#define PAGE_SHIFT_512G             39
+#define PAGE_SIZE_512G              PAGE_SIZE_GRAN(512G)
+#define PAGE_MASK_512G              PAGE_MASK_GRAN(512G)
+#define PAGE_ALIGN_512G(addr)       PAGE_ALIGN_GRAN(512G, addr)
+
 #endif /* __XEN_PAGE_DEFS_H__ */



^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v2 13/18] VT-d: allow use of superpage mappings
  2021-09-24  9:39 [PATCH v2 00/18] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (11 preceding siblings ...)
  2021-09-24  9:52 ` [PATCH v2 12/18] AMD/IOMMU: allow use of superpage mappings Jan Beulich
@ 2021-09-24  9:52 ` Jan Beulich
  2021-12-13 11:54   ` Roger Pau Monné
  2021-09-24  9:53 ` [PATCH v2 14/18] IOMMU: fold flush-all hook into "flush one" Jan Beulich
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 100+ messages in thread
From: Jan Beulich @ 2021-09-24  9:52 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Kevin Tian

... depending on feature availability (and absence of quirks).

Also make the page table dumping function aware of superpages.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -743,18 +743,37 @@ static int __must_check iommu_flush_iotl
     return iommu_flush_iotlb(d, INVALID_DFN, 0, 0);
 }
 
+static void queue_free_pt(struct domain *d, mfn_t mfn, unsigned int next_level)
+{
+    if ( next_level > 1 )
+    {
+        struct dma_pte *pt = map_domain_page(mfn);
+        unsigned int i;
+
+        for ( i = 0; i < PTE_NUM; ++i )
+            if ( dma_pte_present(pt[i]) && !dma_pte_superpage(pt[i]) )
+                queue_free_pt(d, maddr_to_mfn(dma_pte_addr(pt[i])),
+                              next_level - 1);
+
+        unmap_domain_page(pt);
+    }
+
+    iommu_queue_free_pgtable(d, mfn_to_page(mfn));
+}
+
 /* clear one page's page table */
 static int dma_pte_clear_one(struct domain *domain, daddr_t addr,
                              unsigned int order,
                              unsigned int *flush_flags)
 {
     struct domain_iommu *hd = dom_iommu(domain);
-    struct dma_pte *page = NULL, *pte = NULL;
+    struct dma_pte *page = NULL, *pte = NULL, old;
     u64 pg_maddr;
+    unsigned int level = (order / LEVEL_STRIDE) + 1;
 
     spin_lock(&hd->arch.mapping_lock);
-    /* get last level pte */
-    pg_maddr = addr_to_dma_page_maddr(domain, addr, 1, flush_flags, false);
+    /* get target level pte */
+    pg_maddr = addr_to_dma_page_maddr(domain, addr, level, flush_flags, false);
     if ( pg_maddr < PAGE_SIZE )
     {
         spin_unlock(&hd->arch.mapping_lock);
@@ -762,7 +781,7 @@ static int dma_pte_clear_one(struct doma
     }
 
     page = (struct dma_pte *)map_vtd_domain_page(pg_maddr);
-    pte = page + address_level_offset(addr, 1);
+    pte = &page[address_level_offset(addr, level)];
 
     if ( !dma_pte_present(*pte) )
     {
@@ -771,14 +790,19 @@ static int dma_pte_clear_one(struct doma
         return 0;
     }
 
+    old = *pte;
     dma_clear_pte(*pte);
-    *flush_flags |= IOMMU_FLUSHF_modified;
 
     spin_unlock(&hd->arch.mapping_lock);
     iommu_sync_cache(pte, sizeof(struct dma_pte));
 
     unmap_vtd_domain_page(page);
 
+    *flush_flags |= IOMMU_FLUSHF_modified;
+
+    if ( level > 1 && !dma_pte_superpage(old) )
+        queue_free_pt(domain, maddr_to_mfn(dma_pte_addr(old)), level - 1);
+
     return 0;
 }
 
@@ -1868,6 +1892,7 @@ static int __must_check intel_iommu_map_
     struct domain_iommu *hd = dom_iommu(d);
     struct dma_pte *page, *pte, old, new = {};
     u64 pg_maddr;
+    unsigned int level = (IOMMUF_order(flags) / LEVEL_STRIDE) + 1;
     int rc = 0;
 
     /* Do nothing if VT-d shares EPT page table */
@@ -1892,7 +1917,7 @@ static int __must_check intel_iommu_map_
         return 0;
     }
 
-    pg_maddr = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), 1, flush_flags,
+    pg_maddr = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), level, flush_flags,
                                       true);
     if ( pg_maddr < PAGE_SIZE )
     {
@@ -1901,13 +1926,15 @@ static int __must_check intel_iommu_map_
     }
 
     page = (struct dma_pte *)map_vtd_domain_page(pg_maddr);
-    pte = &page[dfn_x(dfn) & LEVEL_MASK];
+    pte = &page[address_level_offset(dfn_to_daddr(dfn), level)];
     old = *pte;
 
     dma_set_pte_addr(new, mfn_to_maddr(mfn));
     dma_set_pte_prot(new,
                      ((flags & IOMMUF_readable) ? DMA_PTE_READ  : 0) |
                      ((flags & IOMMUF_writable) ? DMA_PTE_WRITE : 0));
+    if ( IOMMUF_order(flags) )
+        dma_set_pte_superpage(new);
 
     /* Set the SNP on leaf page table if Snoop Control available */
     if ( iommu_snoop )
@@ -1928,8 +1955,13 @@ static int __must_check intel_iommu_map_
 
     *flush_flags |= IOMMU_FLUSHF_added;
     if ( dma_pte_present(old) )
+    {
         *flush_flags |= IOMMU_FLUSHF_modified;
 
+        if ( level > 1 && !dma_pte_superpage(old) )
+            queue_free_pt(d, maddr_to_mfn(dma_pte_addr(old)), level - 1);
+    }
+
     return rc;
 }
 
@@ -2286,6 +2318,7 @@ static int __init vtd_setup(void)
 {
     struct acpi_drhd_unit *drhd;
     struct vtd_iommu *iommu;
+    unsigned int large_sizes = PAGE_SIZE_2M | PAGE_SIZE_1G;
     int ret;
     bool reg_inval_supported = true;
 
@@ -2328,6 +2361,11 @@ static int __init vtd_setup(void)
                cap_sps_2mb(iommu->cap) ? ", 2MB" : "",
                cap_sps_1gb(iommu->cap) ? ", 1GB" : "");
 
+        if ( !cap_sps_2mb(iommu->cap) )
+            large_sizes &= ~PAGE_SIZE_2M;
+        if ( !cap_sps_1gb(iommu->cap) )
+            large_sizes &= ~PAGE_SIZE_1G;
+
 #ifndef iommu_snoop
         if ( iommu_snoop && !ecap_snp_ctl(iommu->ecap) )
             iommu_snoop = false;
@@ -2399,6 +2437,9 @@ static int __init vtd_setup(void)
     if ( ret )
         goto error;
 
+    ASSERT(iommu_ops.page_sizes & PAGE_SIZE_4K);
+    iommu_ops.page_sizes |= large_sizes;
+
     register_keyhandler('V', vtd_dump_iommu_info, "dump iommu info", 1);
 
     return 0;
@@ -2712,7 +2753,7 @@ static void vtd_dump_page_table_level(pa
             continue;
 
         address = gpa + offset_level_address(i, level);
-        if ( next_level >= 1 ) 
+        if ( next_level && !dma_pte_superpage(*pte) )
             vtd_dump_page_table_level(dma_pte_addr(*pte), next_level,
                                       address, indent + 1);
         else



^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v2 14/18] IOMMU: fold flush-all hook into "flush one"
  2021-09-24  9:39 [PATCH v2 00/18] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (12 preceding siblings ...)
  2021-09-24  9:52 ` [PATCH v2 13/18] VT-d: " Jan Beulich
@ 2021-09-24  9:53 ` Jan Beulich
  2021-12-13 15:04   ` Roger Pau Monné
                     ` (3 more replies)
  2021-09-24  9:54 ` [PATCH v2 15/18] IOMMU/x86: prefill newly allocate page tables Jan Beulich
                   ` (3 subsequent siblings)
  17 siblings, 4 replies; 100+ messages in thread
From: Jan Beulich @ 2021-09-24  9:53 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Paul Durrant, Kevin Tian, Julien Grall,
	Stefano Stabellini, Volodymyr Babchuk, Bertrand Marquis,
	Rahul Singh

Having a separate flush-all hook has always been puzzling me some. We
will want to be able to force a full flush via accumulated flush flags
from the map/unmap functions. Introduce a respective new flag and fold
all flush handling to use the single remaining hook.

Note that because of the respective comments in SMMU and IPMMU-VMSA
code, I've folded the two prior hook functions into one. For SMMU-v3,
which lacks a comment towards incapable hardware, I've left both
functions in place on the assumption that selective and full flushes
will eventually want separating.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
TBD: What we really are going to need is for the map/unmap functions to
     specify that a wider region needs flushing than just the one
     covered by the present set of (un)maps. This may still be less than
     a full flush, but at least as a first step it seemed better to me
     to keep things simple and go the flush-all route.
---
v2: New.

--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -242,7 +242,6 @@ int amd_iommu_get_reserved_device_memory
 int __must_check amd_iommu_flush_iotlb_pages(struct domain *d, dfn_t dfn,
                                              unsigned long page_count,
                                              unsigned int flush_flags);
-int __must_check amd_iommu_flush_iotlb_all(struct domain *d);
 void amd_iommu_print_entries(const struct amd_iommu *iommu, unsigned int dev_id,
                              dfn_t dfn);
 
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -475,15 +475,18 @@ int amd_iommu_flush_iotlb_pages(struct d
 {
     unsigned long dfn_l = dfn_x(dfn);
 
-    ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN));
-    ASSERT(flush_flags);
+    if ( !(flush_flags & IOMMU_FLUSHF_all) )
+    {
+        ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN));
+        ASSERT(flush_flags);
+    }
 
     /* Unless a PTE was modified, no flush is required */
     if ( !(flush_flags & IOMMU_FLUSHF_modified) )
         return 0;
 
-    /* If the range wraps then just flush everything */
-    if ( dfn_l + page_count < dfn_l )
+    /* If so requested or if the range wraps then just flush everything. */
+    if ( (flush_flags & IOMMU_FLUSHF_all) || dfn_l + page_count < dfn_l )
     {
         amd_iommu_flush_all_pages(d);
         return 0;
@@ -508,13 +511,6 @@ int amd_iommu_flush_iotlb_pages(struct d
 
     return 0;
 }
-
-int amd_iommu_flush_iotlb_all(struct domain *d)
-{
-    amd_iommu_flush_all_pages(d);
-
-    return 0;
-}
 
 int amd_iommu_reserve_domain_unity_map(struct domain *d,
                                        const struct ivrs_unity_map *map,
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -642,7 +642,6 @@ static const struct iommu_ops __initcons
     .map_page = amd_iommu_map_page,
     .unmap_page = amd_iommu_unmap_page,
     .iotlb_flush = amd_iommu_flush_iotlb_pages,
-    .iotlb_flush_all = amd_iommu_flush_iotlb_all,
     .reassign_device = reassign_device,
     .get_device_group_id = amd_iommu_group_id,
     .enable_x2apic = iov_enable_xt,
--- a/xen/drivers/passthrough/arm/ipmmu-vmsa.c
+++ b/xen/drivers/passthrough/arm/ipmmu-vmsa.c
@@ -930,13 +930,19 @@ out:
 }
 
 /* Xen IOMMU ops */
-static int __must_check ipmmu_iotlb_flush_all(struct domain *d)
+static int __must_check ipmmu_iotlb_flush(struct domain *d, dfn_t dfn,
+                                          unsigned long page_count,
+                                          unsigned int flush_flags)
 {
     struct ipmmu_vmsa_xen_domain *xen_domain = dom_iommu(d)->arch.priv;
 
+    ASSERT(flush_flags);
+
     if ( !xen_domain || !xen_domain->root_domain )
         return 0;
 
+    /* The hardware doesn't support selective TLB flush. */
+
     spin_lock(&xen_domain->lock);
     ipmmu_tlb_invalidate(xen_domain->root_domain);
     spin_unlock(&xen_domain->lock);
@@ -944,16 +950,6 @@ static int __must_check ipmmu_iotlb_flus
     return 0;
 }
 
-static int __must_check ipmmu_iotlb_flush(struct domain *d, dfn_t dfn,
-                                          unsigned long page_count,
-                                          unsigned int flush_flags)
-{
-    ASSERT(flush_flags);
-
-    /* The hardware doesn't support selective TLB flush. */
-    return ipmmu_iotlb_flush_all(d);
-}
-
 static struct ipmmu_vmsa_domain *ipmmu_get_cache_domain(struct domain *d,
                                                         struct device *dev)
 {
@@ -1303,7 +1299,6 @@ static const struct iommu_ops ipmmu_iomm
     .hwdom_init      = ipmmu_iommu_hwdom_init,
     .teardown        = ipmmu_iommu_domain_teardown,
     .iotlb_flush     = ipmmu_iotlb_flush,
-    .iotlb_flush_all = ipmmu_iotlb_flush_all,
     .assign_device   = ipmmu_assign_device,
     .reassign_device = ipmmu_reassign_device,
     .map_page        = arm_iommu_map_page,
--- a/xen/drivers/passthrough/arm/smmu.c
+++ b/xen/drivers/passthrough/arm/smmu.c
@@ -2649,11 +2649,17 @@ static int force_stage = 2;
  */
 static u32 platform_features = ARM_SMMU_FEAT_COHERENT_WALK;
 
-static int __must_check arm_smmu_iotlb_flush_all(struct domain *d)
+static int __must_check arm_smmu_iotlb_flush(struct domain *d, dfn_t dfn,
+					     unsigned long page_count,
+					     unsigned int flush_flags)
 {
 	struct arm_smmu_xen_domain *smmu_domain = dom_iommu(d)->arch.priv;
 	struct iommu_domain *cfg;
 
+	ASSERT(flush_flags);
+
+	/* ARM SMMU v1 doesn't have flush by VMA and VMID */
+
 	spin_lock(&smmu_domain->lock);
 	list_for_each_entry(cfg, &smmu_domain->contexts, list) {
 		/*
@@ -2670,16 +2676,6 @@ static int __must_check arm_smmu_iotlb_f
 	return 0;
 }
 
-static int __must_check arm_smmu_iotlb_flush(struct domain *d, dfn_t dfn,
-					     unsigned long page_count,
-					     unsigned int flush_flags)
-{
-	ASSERT(flush_flags);
-
-	/* ARM SMMU v1 doesn't have flush by VMA and VMID */
-	return arm_smmu_iotlb_flush_all(d);
-}
-
 static struct iommu_domain *arm_smmu_get_domain(struct domain *d,
 						struct device *dev)
 {
@@ -2879,7 +2875,6 @@ static const struct iommu_ops arm_smmu_i
     .add_device = arm_smmu_dt_add_device_generic,
     .teardown = arm_smmu_iommu_domain_teardown,
     .iotlb_flush = arm_smmu_iotlb_flush,
-    .iotlb_flush_all = arm_smmu_iotlb_flush_all,
     .assign_device = arm_smmu_assign_dev,
     .reassign_device = arm_smmu_reassign_dev,
     .map_page = arm_iommu_map_page,
--- a/xen/drivers/passthrough/arm/smmu-v3.c
+++ b/xen/drivers/passthrough/arm/smmu-v3.c
@@ -3431,7 +3431,6 @@ static const struct iommu_ops arm_smmu_i
 	.hwdom_init		= arm_smmu_iommu_hwdom_init,
 	.teardown		= arm_smmu_iommu_xen_domain_teardown,
 	.iotlb_flush		= arm_smmu_iotlb_flush,
-	.iotlb_flush_all	= arm_smmu_iotlb_flush_all,
 	.assign_device		= arm_smmu_assign_dev,
 	.reassign_device	= arm_smmu_reassign_dev,
 	.map_page		= arm_iommu_map_page,
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -463,15 +463,12 @@ int iommu_iotlb_flush_all(struct domain
     const struct domain_iommu *hd = dom_iommu(d);
     int rc;
 
-    if ( !is_iommu_enabled(d) || !hd->platform_ops->iotlb_flush_all ||
+    if ( !is_iommu_enabled(d) || !hd->platform_ops->iotlb_flush ||
          !flush_flags )
         return 0;
 
-    /*
-     * The operation does a full flush so we don't need to pass the
-     * flush_flags in.
-     */
-    rc = iommu_call(hd->platform_ops, iotlb_flush_all, d);
+    rc = iommu_call(hd->platform_ops, iotlb_flush, d, INVALID_DFN, 0,
+                    flush_flags | IOMMU_FLUSHF_all);
     if ( unlikely(rc) )
     {
         if ( !d->is_shutting_down && printk_ratelimit() )
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -731,18 +731,21 @@ static int __must_check iommu_flush_iotl
                                                 unsigned long page_count,
                                                 unsigned int flush_flags)
 {
-    ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN));
-    ASSERT(flush_flags);
+    if ( flush_flags & IOMMU_FLUSHF_all )
+    {
+        dfn = INVALID_DFN;
+        page_count = 0;
+    }
+    else
+    {
+        ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN));
+        ASSERT(flush_flags);
+    }
 
     return iommu_flush_iotlb(d, dfn, flush_flags & IOMMU_FLUSHF_modified,
                              page_count);
 }
 
-static int __must_check iommu_flush_iotlb_all(struct domain *d)
-{
-    return iommu_flush_iotlb(d, INVALID_DFN, 0, 0);
-}
-
 static void queue_free_pt(struct domain *d, mfn_t mfn, unsigned int next_level)
 {
     if ( next_level > 1 )
@@ -2841,7 +2844,7 @@ static int __init intel_iommu_quarantine
     spin_unlock(&hd->arch.mapping_lock);
 
     if ( !rc )
-        rc = iommu_flush_iotlb_all(d);
+        rc = iommu_flush_iotlb(d, INVALID_DFN, 0, 0);
 
     /* Pages may be leaked in failure case */
     return rc;
@@ -2874,7 +2877,6 @@ static struct iommu_ops __initdata vtd_o
     .resume = vtd_resume,
     .crash_shutdown = vtd_crash_shutdown,
     .iotlb_flush = iommu_flush_iotlb_pages,
-    .iotlb_flush_all = iommu_flush_iotlb_all,
     .get_reserved_device_memory = intel_iommu_get_reserved_device_memory,
     .dump_page_tables = vtd_dump_page_tables,
 };
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -147,9 +147,11 @@ enum
 {
     _IOMMU_FLUSHF_added,
     _IOMMU_FLUSHF_modified,
+    _IOMMU_FLUSHF_all,
 };
 #define IOMMU_FLUSHF_added (1u << _IOMMU_FLUSHF_added)
 #define IOMMU_FLUSHF_modified (1u << _IOMMU_FLUSHF_modified)
+#define IOMMU_FLUSHF_all (1u << _IOMMU_FLUSHF_all)
 
 int __must_check iommu_map(struct domain *d, dfn_t dfn, mfn_t mfn,
                            unsigned long page_count, unsigned int flags,
@@ -282,7 +284,6 @@ struct iommu_ops {
     int __must_check (*iotlb_flush)(struct domain *d, dfn_t dfn,
                                     unsigned long page_count,
                                     unsigned int flush_flags);
-    int __must_check (*iotlb_flush_all)(struct domain *d);
     int (*get_reserved_device_memory)(iommu_grdm_t *, void *);
     void (*dump_page_tables)(struct domain *d);
 



^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v2 15/18] IOMMU/x86: prefill newly allocate page tables
  2021-09-24  9:39 [PATCH v2 00/18] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (13 preceding siblings ...)
  2021-09-24  9:53 ` [PATCH v2 14/18] IOMMU: fold flush-all hook into "flush one" Jan Beulich
@ 2021-09-24  9:54 ` Jan Beulich
  2021-12-13 15:51   ` Roger Pau Monné
                     ` (2 more replies)
  2021-09-24  9:55 ` [PATCH v2 16/18] x86: introduce helper for recording degree of contiguity in " Jan Beulich
                   ` (2 subsequent siblings)
  17 siblings, 3 replies; 100+ messages in thread
From: Jan Beulich @ 2021-09-24  9:54 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Kevin Tian, Roger Pau Monné

Page table are used for two purposes after allocation: They either start
out all empty, or they get filled to replace a superpage. Subsequently,
to replace all empty or fully contiguous page tables, contiguous sub-
regions will be recorded within individual page tables. Install the
initial set of markers immediately after allocation. Make sure to retain
these markers when further populating a page table in preparation for it
to replace a superpage.

The markers are simply 4-bit fields holding the order value of
contiguous entries. To demonstrate this, if a page table had just 16
entries, this would be the initial (fully contiguous) set of markers:

index  0 1 2 3 4 5 6 7 8 9 A B C D E F
marker 4 0 1 0 2 0 1 0 3 0 1 0 2 0 1 0

"Contiguous" here means not only present entries with successively
increasing MFNs, each one suitably aligned for its slot, but also a
respective number of all non-present entries.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
An alternative to the ASSERT()s added to set_iommu_ptes_present() would
be to make the function less general-purpose; it's used in a single
place only after all (i.e. it might as well be folded into its only
caller).
---
v2: New.

--- a/xen/drivers/passthrough/amd/iommu-defs.h
+++ b/xen/drivers/passthrough/amd/iommu-defs.h
@@ -445,6 +445,8 @@ union amd_iommu_x2apic_control {
 #define IOMMU_PAGE_TABLE_U32_PER_ENTRY	(IOMMU_PAGE_TABLE_ENTRY_SIZE / 4)
 #define IOMMU_PAGE_TABLE_ALIGNMENT	4096
 
+#define IOMMU_PTE_CONTIG_MASK           0x1e /* The ign0 field below. */
+
 union amd_iommu_pte {
     uint64_t raw;
     struct {
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -116,7 +116,19 @@ static void set_iommu_ptes_present(unsig
 
     while ( nr_ptes-- )
     {
-        set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
+        ASSERT(!pde->next_level);
+        ASSERT(!pde->u);
+
+        if ( pde > table )
+            ASSERT(pde->ign0 == find_first_set_bit(pde - table));
+        else
+            ASSERT(pde->ign0 == PAGE_SHIFT - 3);
+
+        pde->iw = iw;
+        pde->ir = ir;
+        pde->fc = true; /* See set_iommu_pde_present(). */
+        pde->mfn = next_mfn;
+        pde->pr = true;
 
         ++pde;
         next_mfn += page_sz;
@@ -232,7 +244,7 @@ static int iommu_pde_from_dfn(struct dom
             mfn = next_table_mfn;
 
             /* allocate lower level page table */
-            table = iommu_alloc_pgtable(d);
+            table = iommu_alloc_pgtable(d, IOMMU_PTE_CONTIG_MASK);
             if ( table == NULL )
             {
                 AMD_IOMMU_DEBUG("Cannot allocate I/O page table\n");
@@ -262,7 +274,7 @@ static int iommu_pde_from_dfn(struct dom
 
             if ( next_table_mfn == 0 )
             {
-                table = iommu_alloc_pgtable(d);
+                table = iommu_alloc_pgtable(d, IOMMU_PTE_CONTIG_MASK);
                 if ( table == NULL )
                 {
                     AMD_IOMMU_DEBUG("Cannot allocate I/O page table\n");
@@ -648,7 +660,7 @@ int __init amd_iommu_quarantine_init(str
 
     spin_lock(&hd->arch.mapping_lock);
 
-    hd->arch.amd.root_table = iommu_alloc_pgtable(d);
+    hd->arch.amd.root_table = iommu_alloc_pgtable(d, 0);
     if ( !hd->arch.amd.root_table )
         goto out;
 
@@ -663,7 +675,7 @@ int __init amd_iommu_quarantine_init(str
          * page table pages, and the resulting allocations are always
          * zeroed.
          */
-        pg = iommu_alloc_pgtable(d);
+        pg = iommu_alloc_pgtable(d, 0);
         if ( !pg )
             break;
 
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -238,7 +238,7 @@ int amd_iommu_alloc_root(struct domain *
 
     if ( unlikely(!hd->arch.amd.root_table) )
     {
-        hd->arch.amd.root_table = iommu_alloc_pgtable(d);
+        hd->arch.amd.root_table = iommu_alloc_pgtable(d, 0);
         if ( !hd->arch.amd.root_table )
             return -ENOMEM;
     }
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -297,7 +297,7 @@ static uint64_t addr_to_dma_page_maddr(s
             goto out;
 
         pte_maddr = level;
-        if ( !(pg = iommu_alloc_pgtable(domain)) )
+        if ( !(pg = iommu_alloc_pgtable(domain, 0)) )
             goto out;
 
         hd->arch.vtd.pgd_maddr = page_to_maddr(pg);
@@ -339,7 +339,7 @@ static uint64_t addr_to_dma_page_maddr(s
             }
 
             pte_maddr = level - 1;
-            pg = iommu_alloc_pgtable(domain);
+            pg = iommu_alloc_pgtable(domain, DMA_PTE_CONTIG_MASK);
             if ( !pg )
                 break;
 
@@ -351,12 +351,13 @@ static uint64_t addr_to_dma_page_maddr(s
                 struct dma_pte *split = map_vtd_domain_page(pte_maddr);
                 unsigned long inc = 1UL << level_to_offset_bits(level - 1);
 
-                split[0].val = pte->val;
+                split[0].val |= pte->val & ~DMA_PTE_CONTIG_MASK;
                 if ( inc == PAGE_SIZE )
                     split[0].val &= ~DMA_PTE_SP;
 
                 for ( offset = 1; offset < PTE_NUM; ++offset )
-                    split[offset].val = split[offset - 1].val + inc;
+                    split[offset].val |=
+                        (split[offset - 1].val & ~DMA_PTE_CONTIG_MASK) + inc;
 
                 iommu_sync_cache(split, PAGE_SIZE);
                 unmap_vtd_domain_page(split);
@@ -1943,7 +1944,7 @@ static int __must_check intel_iommu_map_
     if ( iommu_snoop )
         dma_set_pte_snp(new);
 
-    if ( old.val == new.val )
+    if ( !((old.val ^ new.val) & ~DMA_PTE_CONTIG_MASK) )
     {
         spin_unlock(&hd->arch.mapping_lock);
         unmap_vtd_domain_page(page);
@@ -2798,7 +2799,7 @@ static int __init intel_iommu_quarantine
         goto out;
     }
 
-    pg = iommu_alloc_pgtable(d);
+    pg = iommu_alloc_pgtable(d, 0);
 
     rc = -ENOMEM;
     if ( !pg )
@@ -2817,7 +2818,7 @@ static int __init intel_iommu_quarantine
          * page table pages, and the resulting allocations are always
          * zeroed.
          */
-        pg = iommu_alloc_pgtable(d);
+        pg = iommu_alloc_pgtable(d, 0);
 
         if ( !pg )
             goto out;
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -265,6 +265,7 @@ struct dma_pte {
 #define DMA_PTE_PROT (DMA_PTE_READ | DMA_PTE_WRITE)
 #define DMA_PTE_SP   (1 << 7)
 #define DMA_PTE_SNP  (1 << 11)
+#define DMA_PTE_CONTIG_MASK  (0xfull << PADDR_BITS)
 #define dma_clear_pte(p)    do {(p).val = 0;} while(0)
 #define dma_set_pte_readable(p) do {(p).val |= DMA_PTE_READ;} while(0)
 #define dma_set_pte_writable(p) do {(p).val |= DMA_PTE_WRITE;} while(0)
@@ -278,7 +279,7 @@ struct dma_pte {
 #define dma_pte_write(p) (dma_pte_prot(p) & DMA_PTE_WRITE)
 #define dma_pte_addr(p) ((p).val & PADDR_MASK & PAGE_MASK_4K)
 #define dma_set_pte_addr(p, addr) do {\
-            (p).val |= ((addr) & PAGE_MASK_4K); } while (0)
+            (p).val |= ((addr) & PADDR_MASK & PAGE_MASK_4K); } while (0)
 #define dma_pte_present(p) (((p).val & DMA_PTE_PROT) != 0)
 #define dma_pte_superpage(p) (((p).val & DMA_PTE_SP) != 0)
 
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -433,12 +433,12 @@ int iommu_free_pgtables(struct domain *d
     return 0;
 }
 
-struct page_info *iommu_alloc_pgtable(struct domain *d)
+struct page_info *iommu_alloc_pgtable(struct domain *d, uint64_t contig_mask)
 {
     struct domain_iommu *hd = dom_iommu(d);
     unsigned int memflags = 0;
     struct page_info *pg;
-    void *p;
+    uint64_t *p;
 
 #ifdef CONFIG_NUMA
     if ( hd->node != NUMA_NO_NODE )
@@ -450,7 +450,28 @@ struct page_info *iommu_alloc_pgtable(st
         return NULL;
 
     p = __map_domain_page(pg);
-    clear_page(p);
+
+    if ( contig_mask )
+    {
+        unsigned int i, shift = find_first_set_bit(contig_mask);
+
+        ASSERT(((PAGE_SHIFT - 3) & (contig_mask >> shift)) == PAGE_SHIFT - 3);
+
+        p[0] = (PAGE_SHIFT - 3ull) << shift;
+        p[1] = 0;
+        p[2] = 1ull << shift;
+        p[3] = 0;
+
+        for ( i = 4; i < PAGE_SIZE / 8; i += 4 )
+        {
+            p[i + 0] = (find_first_set_bit(i) + 0ull) << shift;
+            p[i + 1] = 0;
+            p[i + 2] = 1ull << shift;
+            p[i + 3] = 0;
+        }
+    }
+    else
+        clear_page(p);
 
     if ( hd->platform_ops->sync_cache )
         iommu_vcall(hd->platform_ops, sync_cache, p, PAGE_SIZE);
--- a/xen/include/asm-x86/iommu.h
+++ b/xen/include/asm-x86/iommu.h
@@ -142,7 +142,8 @@ int pi_update_irte(const struct pi_desc
 })
 
 int __must_check iommu_free_pgtables(struct domain *d);
-struct page_info *__must_check iommu_alloc_pgtable(struct domain *d);
+struct page_info *__must_check iommu_alloc_pgtable(struct domain *d,
+                                                   uint64_t contig_mask);
 void iommu_queue_free_pgtable(struct domain *d, struct page_info *pg);
 
 #endif /* !__ARCH_X86_IOMMU_H__ */



^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v2 16/18] x86: introduce helper for recording degree of contiguity in page tables
  2021-09-24  9:39 [PATCH v2 00/18] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (14 preceding siblings ...)
  2021-09-24  9:54 ` [PATCH v2 15/18] IOMMU/x86: prefill newly allocate page tables Jan Beulich
@ 2021-09-24  9:55 ` Jan Beulich
  2021-12-15 13:57   ` Roger Pau Monné
  2021-09-24  9:55 ` [PATCH v2 17/18] AMD/IOMMU: free all-empty " Jan Beulich
  2021-09-24  9:56 ` [PATCH v2 18/18] VT-d: " Jan Beulich
  17 siblings, 1 reply; 100+ messages in thread
From: Jan Beulich @ 2021-09-24  9:55 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné, Wei Liu

This is a re-usable helper (kind of a template) which gets introduced
without users so that the individual subsequent patches introducing such
users can get committed independently of one another.

See the comment at the top of the new file. To demonstrate the effect,
if a page table had just 16 entries, this would be the set of markers
for a page table with fully contiguous mappings:

index  0 1 2 3 4 5 6 7 8 9 A B C D E F
marker 4 0 1 0 2 0 1 0 3 0 1 0 2 0 1 0

"Contiguous" here means not only present entries with successively
increasing MFNs, each one suitably aligned for its slot, but also a
respective number of all non-present entries.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: New.

--- /dev/null
+++ b/xen/include/asm-x86/contig-marker.h
@@ -0,0 +1,105 @@
+#ifndef __ASM_X86_CONTIG_MARKER_H
+#define __ASM_X86_CONTIG_MARKER_H
+
+/*
+ * Short of having function templates in C, the function defined below is
+ * intended to be used by multiple parties interested in recording the
+ * degree of contiguity in mappings by a single page table.
+ *
+ * Scheme: Every entry records the order of contiguous successive entries,
+ * up to the maximum order covered by that entry (which is the number of
+ * clear low bits in its index, with entry 0 being the exception using
+ * the base-2 logarithm of the number of entries in a single page table).
+ * While a few entries need touching upon update, knowing whether the
+ * table is fully contiguous (and can hence be replaced by a higher level
+ * leaf entry) is then possible by simply looking at entry 0's marker.
+ *
+ * Prereqs:
+ * - CONTIG_MASK needs to be #define-d, to a value having at least 4
+ *   contiguous bits (ignored by hardware), before including this file,
+ * - page tables to be passed here need to be initialized with correct
+ *   markers.
+ */
+
+#include <xen/bitops.h>
+#include <xen/lib.h>
+#include <xen/page-size.h>
+
+/* This is the same for all anticipated users, so doesn't need passing in. */
+#define CONTIG_LEVEL_SHIFT 9
+#define CONTIG_NR          (1 << CONTIG_LEVEL_SHIFT)
+
+#define GET_MARKER(e) MASK_EXTR(e, CONTIG_MASK)
+#define SET_MARKER(e, m) \
+    ((void)(e = ((e) & ~CONTIG_MASK) | MASK_INSR(m, CONTIG_MASK)))
+
+enum PTE_kind {
+    PTE_kind_null,
+    PTE_kind_leaf,
+    PTE_kind_table,
+};
+
+static bool update_contig_markers(uint64_t *pt, unsigned int idx,
+                                  unsigned int level, enum PTE_kind kind)
+{
+    unsigned int b, i = idx;
+    unsigned int shift = (level - 1) * CONTIG_LEVEL_SHIFT + PAGE_SHIFT;
+
+    ASSERT(idx < CONTIG_NR);
+    ASSERT(!(pt[idx] & CONTIG_MASK));
+
+    /* Step 1: Reduce markers in lower numbered entries. */
+    while ( i )
+    {
+        b = find_first_set_bit(i);
+        i &= ~(1U << b);
+        if ( GET_MARKER(pt[i]) > b )
+            SET_MARKER(pt[i], b);
+    }
+
+    /* An intermediate table is never contiguous with anything. */
+    if ( kind == PTE_kind_table )
+        return false;
+
+    /*
+     * Present entries need in sync index and address to be a candidate
+     * for being contiguous: What we're after is whether ultimately the
+     * intermediate table can be replaced by a superpage.
+     */
+    if ( kind != PTE_kind_null &&
+         idx != ((pt[idx] >> shift) & (CONTIG_NR - 1)) )
+        return false;
+
+    /* Step 2: Check higher numbered entries for contiguity. */
+    for ( b = 0; b < CONTIG_LEVEL_SHIFT && !(idx & (1U << b)); ++b )
+    {
+        i = idx | (1U << b);
+        if ( (kind == PTE_kind_leaf
+              ? ((pt[i] ^ pt[idx]) & ~CONTIG_MASK) != (1ULL << (b + shift))
+              : pt[i] & ~CONTIG_MASK) ||
+             GET_MARKER(pt[i]) != b )
+            break;
+    }
+
+    /* Step 3: Update markers in this and lower numbered entries. */
+    for ( ; SET_MARKER(pt[idx], b), b < CONTIG_LEVEL_SHIFT; ++b )
+    {
+        i = idx ^ (1U << b);
+        if ( (kind == PTE_kind_leaf
+              ? ((pt[i] ^ pt[idx]) & ~CONTIG_MASK) != (1ULL << (b + shift))
+              : pt[i] & ~CONTIG_MASK) ||
+             GET_MARKER(pt[i]) != b )
+            break;
+        idx &= ~(1U << b);
+    }
+
+    return b == CONTIG_LEVEL_SHIFT;
+}
+
+#undef SET_MARKER
+#undef GET_MARKER
+#undef CONTIG_NR
+#undef CONTIG_LEVEL_SHIFT
+#undef CONTIG_MASK
+
+#endif /* __ASM_X86_CONTIG_MARKER_H */



^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v2 17/18] AMD/IOMMU: free all-empty page tables
  2021-09-24  9:39 [PATCH v2 00/18] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (15 preceding siblings ...)
  2021-09-24  9:55 ` [PATCH v2 16/18] x86: introduce helper for recording degree of contiguity in " Jan Beulich
@ 2021-09-24  9:55 ` Jan Beulich
  2021-12-15 15:14   ` Roger Pau Monné
  2021-09-24  9:56 ` [PATCH v2 18/18] VT-d: " Jan Beulich
  17 siblings, 1 reply; 100+ messages in thread
From: Jan Beulich @ 2021-09-24  9:55 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant

When a page table ends up with no present entries left, it can be
replaced by a non-present entry at the next higher level. The page table
itself can then be scheduled for freeing.

Note that while its output isn't used there yet, update_contig_markers()
right away needs to be called in all places where entries get updated,
not just the one where entries get cleared.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: New.

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -21,6 +21,9 @@
 
 #include "iommu.h"
 
+#define CONTIG_MASK IOMMU_PTE_CONTIG_MASK
+#include <asm/contig-marker.h>
+
 /* Given pfn and page table level, return pde index */
 static unsigned int pfn_to_pde_idx(unsigned long pfn, unsigned int level)
 {
@@ -33,16 +36,20 @@ static unsigned int pfn_to_pde_idx(unsig
 
 static union amd_iommu_pte clear_iommu_pte_present(unsigned long l1_mfn,
                                                    unsigned long dfn,
-                                                   unsigned int level)
+                                                   unsigned int level,
+                                                   bool *free)
 {
     union amd_iommu_pte *table, *pte, old;
+    unsigned int idx = pfn_to_pde_idx(dfn, level);
 
     table = map_domain_page(_mfn(l1_mfn));
-    pte = &table[pfn_to_pde_idx(dfn, level)];
+    pte = &table[idx];
     old = *pte;
 
     write_atomic(&pte->raw, 0);
 
+    *free = update_contig_markers(&table->raw, idx, level, PTE_kind_null);
+
     unmap_domain_page(table);
 
     return old;
@@ -85,7 +92,11 @@ static union amd_iommu_pte set_iommu_pte
     if ( !old.pr || old.next_level ||
          old.mfn != next_mfn ||
          old.iw != iw || old.ir != ir )
+    {
         set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
+        update_contig_markers(&table->raw, pfn_to_pde_idx(dfn, level), level,
+                              PTE_kind_leaf);
+    }
     else
         old.pr = false; /* signal "no change" to the caller */
 
@@ -259,6 +270,9 @@ static int iommu_pde_from_dfn(struct dom
             smp_wmb();
             set_iommu_pde_present(pde, next_table_mfn, next_level, true,
                                   true);
+            update_contig_markers(&next_table_vaddr->raw,
+                                  pfn_to_pde_idx(dfn, level),
+                                  level, PTE_kind_table);
 
             *flush_flags |= IOMMU_FLUSHF_modified;
         }
@@ -284,6 +298,9 @@ static int iommu_pde_from_dfn(struct dom
                 next_table_mfn = mfn_x(page_to_mfn(table));
                 set_iommu_pde_present(pde, next_table_mfn, next_level, true,
                                       true);
+                update_contig_markers(&next_table_vaddr->raw,
+                                      pfn_to_pde_idx(dfn, level),
+                                      level, PTE_kind_table);
             }
             else /* should never reach here */
             {
@@ -410,8 +427,25 @@ int amd_iommu_unmap_page(struct domain *
 
     if ( pt_mfn )
     {
+        bool free;
+        unsigned int pt_lvl = level;
+
         /* Mark PTE as 'page not present'. */
-        old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn), level);
+        old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn), level, &free);
+
+        while ( unlikely(free) && ++pt_lvl < hd->arch.amd.paging_mode )
+        {
+            struct page_info *pg = mfn_to_page(_mfn(pt_mfn));
+
+            if ( iommu_pde_from_dfn(d, dfn_x(dfn), pt_lvl, &pt_mfn,
+                                    flush_flags, false) )
+                BUG();
+            BUG_ON(!pt_mfn);
+
+            clear_iommu_pte_present(pt_mfn, dfn_x(dfn), pt_lvl, &free);
+            *flush_flags |= IOMMU_FLUSHF_all;
+            iommu_queue_free_pgtable(d, pg);
+        }
     }
 
     spin_unlock(&hd->arch.mapping_lock);



^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v2 18/18] VT-d: free all-empty page tables
  2021-09-24  9:39 [PATCH v2 00/18] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (16 preceding siblings ...)
  2021-09-24  9:55 ` [PATCH v2 17/18] AMD/IOMMU: free all-empty " Jan Beulich
@ 2021-09-24  9:56 ` Jan Beulich
  17 siblings, 0 replies; 100+ messages in thread
From: Jan Beulich @ 2021-09-24  9:56 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Kevin Tian

When a page table ends up with no present entries left, it can be
replaced by a non-present entry at the next higher level. The page table
itself can then be scheduled for freeing.

Note that while its output isn't used there yet, update_contig_markers()
right away needs to be called in all places where entries get updated,
not just the one where entries get cleared.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: New.

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -42,6 +42,9 @@
 #include "vtd.h"
 #include "../ats.h"
 
+#define CONTIG_MASK DMA_PTE_CONTIG_MASK
+#include <asm/contig-marker.h>
+
 /* dom_io is used as a sentinel for quarantined devices */
 #define QUARANTINE_SKIP(d) ((d) == dom_io && !dom_iommu(d)->arch.vtd.pgd_maddr)
 
@@ -368,6 +371,9 @@ static uint64_t addr_to_dma_page_maddr(s
 
             write_atomic(&pte->val, new_pte.val);
             iommu_sync_cache(pte, sizeof(struct dma_pte));
+            update_contig_markers(&parent->val,
+                                  address_level_offset(addr, level),
+                                  level, PTE_kind_table);
         }
 
         if ( --level == target )
@@ -773,7 +779,7 @@ static int dma_pte_clear_one(struct doma
     struct domain_iommu *hd = dom_iommu(domain);
     struct dma_pte *page = NULL, *pte = NULL, old;
     u64 pg_maddr;
-    unsigned int level = (order / LEVEL_STRIDE) + 1;
+    unsigned int level = (order / LEVEL_STRIDE) + 1, pt_lvl = level;
 
     spin_lock(&hd->arch.mapping_lock);
     /* get target level pte */
@@ -796,9 +802,31 @@ static int dma_pte_clear_one(struct doma
 
     old = *pte;
     dma_clear_pte(*pte);
+    iommu_sync_cache(pte, sizeof(*pte));
+
+    while ( update_contig_markers(&page->val,
+                                  address_level_offset(addr, pt_lvl),
+                                  pt_lvl, PTE_kind_null) &&
+            ++pt_lvl < agaw_to_level(hd->arch.vtd.agaw) )
+    {
+        struct page_info *pg = maddr_to_page(pg_maddr);
+
+        unmap_vtd_domain_page(page);
+
+        pg_maddr = addr_to_dma_page_maddr(domain, addr, pt_lvl, flush_flags,
+                                          false);
+        BUG_ON(pg_maddr < PAGE_SIZE);
+
+        page = map_vtd_domain_page(pg_maddr);
+        pte = &page[address_level_offset(addr, pt_lvl)];
+        dma_clear_pte(*pte);
+        iommu_sync_cache(pte, sizeof(*pte));
+
+        *flush_flags |= IOMMU_FLUSHF_all;
+        iommu_queue_free_pgtable(domain, pg);
+    }
 
     spin_unlock(&hd->arch.mapping_lock);
-    iommu_sync_cache(pte, sizeof(struct dma_pte));
 
     unmap_vtd_domain_page(page);
 
@@ -1952,8 +1980,11 @@ static int __must_check intel_iommu_map_
     }
 
     *pte = new;
-
     iommu_sync_cache(pte, sizeof(struct dma_pte));
+    update_contig_markers(&page->val,
+                          address_level_offset(dfn_to_daddr(dfn), level),
+                          level, PTE_kind_leaf);
+
     spin_unlock(&hd->arch.mapping_lock);
     unmap_vtd_domain_page(page);
 



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 01/18] AMD/IOMMU: have callers specify the target level for page table walks
  2021-09-24  9:41 ` [PATCH v2 01/18] AMD/IOMMU: have callers specify the target level for page table walks Jan Beulich
@ 2021-09-24 10:58   ` Roger Pau Monné
  2021-09-24 12:02     ` Jan Beulich
  0 siblings, 1 reply; 100+ messages in thread
From: Roger Pau Monné @ 2021-09-24 10:58 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On Fri, Sep 24, 2021 at 11:41:14AM +0200, Jan Beulich wrote:
> In order to be able to insert/remove super-pages we need to allow
> callers of the walking function to specify at which point to stop the
> walk. (For now at least gcc will instantiate just a variant of the
> function with the parameter eliminated, so effectively no change to
> generated code as far as the parameter addition goes.)
> 
> Instead of merely adjusting a BUG_ON() condition, convert it into an
> error return - there's no reason to crash the entire host in that case.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> --- a/xen/drivers/passthrough/amd/iommu_map.c
> +++ b/xen/drivers/passthrough/amd/iommu_map.c
> @@ -178,7 +178,8 @@ void __init iommu_dte_add_device_entry(s
>   * page tables.
>   */
>  static int iommu_pde_from_dfn(struct domain *d, unsigned long dfn,
> -                              unsigned long *pt_mfn, bool map)
> +                              unsigned int target, unsigned long *pt_mfn,
> +                              bool map)
>  {
>      union amd_iommu_pte *pde, *next_table_vaddr;
>      unsigned long  next_table_mfn;
> @@ -189,7 +190,8 @@ static int iommu_pde_from_dfn(struct dom
>      table = hd->arch.amd.root_table;
>      level = hd->arch.amd.paging_mode;
>  
> -    BUG_ON( table == NULL || level < 1 || level > 6 );
> +    if ( !table || target < 1 || level < target || level > 6 )
> +        return 1;

I would consider adding an ASSERT_UNREACHABLE here, since there should
be no callers passing those parameters, and we shouldn't be
introducing new ones. Unless you believe there could be valid callers
passing level < target parameter.

>  
>      /*
>       * A frame number past what the current page tables can represent can't
> @@ -200,7 +202,7 @@ static int iommu_pde_from_dfn(struct dom
>  
>      next_table_mfn = mfn_x(page_to_mfn(table));
>  
> -    while ( level > 1 )
> +    while ( level > target )
>      {
>          unsigned int next_level = level - 1;

There's a comment at the bottom of iommu_pde_from_dfn that needs to be
adjusted to no longer explicitly mention level 1.

With that adjusted:

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

FWIW, I always get confused with AMD and shadow code using level 1 to
denote the smaller page size level while Intel uses 0.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 01/18] AMD/IOMMU: have callers specify the target level for page table walks
  2021-09-24 10:58   ` Roger Pau Monné
@ 2021-09-24 12:02     ` Jan Beulich
  0 siblings, 0 replies; 100+ messages in thread
From: Jan Beulich @ 2021-09-24 12:02 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On 24.09.2021 12:58, Roger Pau Monné wrote:
> On Fri, Sep 24, 2021 at 11:41:14AM +0200, Jan Beulich wrote:
>> --- a/xen/drivers/passthrough/amd/iommu_map.c
>> +++ b/xen/drivers/passthrough/amd/iommu_map.c
>> @@ -178,7 +178,8 @@ void __init iommu_dte_add_device_entry(s
>>   * page tables.
>>   */
>>  static int iommu_pde_from_dfn(struct domain *d, unsigned long dfn,
>> -                              unsigned long *pt_mfn, bool map)
>> +                              unsigned int target, unsigned long *pt_mfn,
>> +                              bool map)
>>  {
>>      union amd_iommu_pte *pde, *next_table_vaddr;
>>      unsigned long  next_table_mfn;
>> @@ -189,7 +190,8 @@ static int iommu_pde_from_dfn(struct dom
>>      table = hd->arch.amd.root_table;
>>      level = hd->arch.amd.paging_mode;
>>  
>> -    BUG_ON( table == NULL || level < 1 || level > 6 );
>> +    if ( !table || target < 1 || level < target || level > 6 )
>> +        return 1;
> 
> I would consider adding an ASSERT_UNREACHABLE here, since there should
> be no callers passing those parameters, and we shouldn't be
> introducing new ones. Unless you believe there could be valid callers
> passing level < target parameter.

Ah yes - added.

>> @@ -200,7 +202,7 @@ static int iommu_pde_from_dfn(struct dom
>>  
>>      next_table_mfn = mfn_x(page_to_mfn(table));
>>  
>> -    while ( level > 1 )
>> +    while ( level > target )
>>      {
>>          unsigned int next_level = level - 1;
> 
> There's a comment at the bottom of iommu_pde_from_dfn that needs to be
> adjusted to no longer explicitly mention level 1.

Oh, thanks for noticing. I recall spotting that comment as in
need of updating before starting any of this work. And then I
forgot ...

> With that adjusted:
> 
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks.

> FWIW, I always get confused with AMD and shadow code using level 1 to
> denote the smaller page size level while Intel uses 0.

Wait - with "Intel" you mean just EPT here, don't you? VT-d
code is using 1-based numbering again from all I can tell.

Jan



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 02/18] VT-d: have callers specify the target level for page table walks
  2021-09-24  9:42 ` [PATCH v2 02/18] VT-d: " Jan Beulich
@ 2021-09-24 14:45   ` Roger Pau Monné
  2021-09-27  9:04     ` Jan Beulich
  0 siblings, 1 reply; 100+ messages in thread
From: Roger Pau Monné @ 2021-09-24 14:45 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian

On Fri, Sep 24, 2021 at 11:42:13AM +0200, Jan Beulich wrote:
> In order to be able to insert/remove super-pages we need to allow
> callers of the walking function to specify at which point to stop the
> walk.
> 
> For intel_iommu_lookup_page() integrate the last level access into
> the main walking function.
> 
> dma_pte_clear_one() gets only partly adjusted for now: Error handling
> and order parameter get put in place, but the order parameter remains
> ignored (just like intel_iommu_map_page()'s order part of the flags).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> I have to admit that I don't understand why domain_pgd_maddr() wants to
> populate all page table levels for DFN 0.

I think it would be enough to create up to the level requested by the
caller?

Seems like a lazy way to always assert that the level requested by the
caller would be present.

> 
> I was actually wondering whether it wouldn't make sense to integrate
> dma_pte_clear_one() into its only caller intel_iommu_unmap_page(), for
> better symmetry with intel_iommu_map_page().
> ---
> v2: Fix build.
> 
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -264,63 +264,116 @@ static u64 bus_to_context_maddr(struct v
>      return maddr;
>  }
>  
> -static u64 addr_to_dma_page_maddr(struct domain *domain, u64 addr, int alloc)
> +/*
> + * This function walks (and if requested allocates) page tables to the
> + * designated target level. It returns
> + * - 0 when a non-present entry was encountered and no allocation was
> + *   requested,
> + * - a small positive value (the level, i.e. below PAGE_SIZE) upon allocation
> + *   failure,
> + * - for target > 0 the address of the page table holding the leaf PTE for
                          ^ physical

I think it's clearer, as the return type could be ambiguous.

> + *   the requested address,
> + * - for target == 0 the full PTE.

Could this create confusion if for example one PTE maps physical page
0? As the caller when getting back a full PTE with address 0 and some of
the low bits set could interpret the result as an error.

I think we already had this discussion on other occasions, but I would
rather add a parameter to be used as a return placeholder (ie: a
*dma_pte maybe?) and use the function return value just for errors
because IMO it's clearer, but I know you don't usually like this
approach, so I'm not going to insist.

> + */
> +static uint64_t addr_to_dma_page_maddr(struct domain *domain, daddr_t addr,
> +                                       unsigned int target,
> +                                       unsigned int *flush_flags, bool alloc)
>  {
>      struct domain_iommu *hd = dom_iommu(domain);
>      int addr_width = agaw_to_width(hd->arch.vtd.agaw);
>      struct dma_pte *parent, *pte = NULL;
> -    int level = agaw_to_level(hd->arch.vtd.agaw);
> -    int offset;
> +    unsigned int level = agaw_to_level(hd->arch.vtd.agaw), offset;
>      u64 pte_maddr = 0;
>  
>      addr &= (((u64)1) << addr_width) - 1;
>      ASSERT(spin_is_locked(&hd->arch.mapping_lock));
> +    ASSERT(target || !alloc);

Might be better to use an if with ASSERT_UNREACHABLE and return an
error? (ie: level itself?)

> +
>      if ( !hd->arch.vtd.pgd_maddr )
>      {
>          struct page_info *pg;
>  
> -        if ( !alloc || !(pg = iommu_alloc_pgtable(domain)) )
> +        if ( !alloc )
> +            goto out;
> +
> +        pte_maddr = level;
> +        if ( !(pg = iommu_alloc_pgtable(domain)) )
>              goto out;
>  
>          hd->arch.vtd.pgd_maddr = page_to_maddr(pg);
>      }
>  
> -    parent = (struct dma_pte *)map_vtd_domain_page(hd->arch.vtd.pgd_maddr);
> -    while ( level > 1 )
> +    pte_maddr = hd->arch.vtd.pgd_maddr;
> +    parent = map_vtd_domain_page(pte_maddr);
> +    while ( level > target )
>      {
>          offset = address_level_offset(addr, level);
>          pte = &parent[offset];
>  
>          pte_maddr = dma_pte_addr(*pte);
> -        if ( !pte_maddr )
> +        if ( !dma_pte_present(*pte) || (level > 1 && dma_pte_superpage(*pte)) )
>          {
>              struct page_info *pg;
> +            /*
> +             * Higher level tables always set r/w, last level page table
> +             * controls read/write.
> +             */
> +            struct dma_pte new_pte = { DMA_PTE_PROT };
>  
>              if ( !alloc )
> -                break;
> +            {
> +                pte_maddr = 0;
> +                if ( !dma_pte_present(*pte) )
> +                    break;
> +
> +                /*
> +                 * When the leaf entry was requested, pass back the full PTE,
> +                 * with the address adjusted to account for the residual of
> +                 * the walk.
> +                 */
> +                pte_maddr = pte->val +

Wouldn't it be better to use dma_pte_addr(*pte) rather than accessing
pte->val, and then you could drop the PAGE_MASK?

Or is the addr parameter not guaranteed to be page aligned?

> +                    (addr & ((1UL << level_to_offset_bits(level)) - 1) &
> +                     PAGE_MASK);
> +                if ( !target )
> +                    break;

I'm confused by the conditional break here, why would you calculate
pte_maddr unconditionally to get overwritten just the line below if
target != 0?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 02/18] VT-d: have callers specify the target level for page table walks
  2021-09-24 14:45   ` Roger Pau Monné
@ 2021-09-27  9:04     ` Jan Beulich
  2021-09-27  9:13       ` Jan Beulich
  2021-11-30 11:56       ` Roger Pau Monné
  0 siblings, 2 replies; 100+ messages in thread
From: Jan Beulich @ 2021-09-27  9:04 UTC (permalink / raw)
  To: Roger Pau Monné, Kevin Tian; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On 24.09.2021 16:45, Roger Pau Monné wrote:
> On Fri, Sep 24, 2021 at 11:42:13AM +0200, Jan Beulich wrote:
>> In order to be able to insert/remove super-pages we need to allow
>> callers of the walking function to specify at which point to stop the
>> walk.
>>
>> For intel_iommu_lookup_page() integrate the last level access into
>> the main walking function.
>>
>> dma_pte_clear_one() gets only partly adjusted for now: Error handling
>> and order parameter get put in place, but the order parameter remains
>> ignored (just like intel_iommu_map_page()'s order part of the flags).
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> I have to admit that I don't understand why domain_pgd_maddr() wants to
>> populate all page table levels for DFN 0.
> 
> I think it would be enough to create up to the level requested by the
> caller?
> 
> Seems like a lazy way to always assert that the level requested by the
> caller would be present.

The caller doesn't request any level here. What the caller passes in
is the number of levels the respective IOMMU can deal with (varying
of which across all IOMMUs is somewhat funny anyway). Hence I _guess_
that it would really be sufficient to install as many levels as are
necessary for the loop at the end of the function to complete
successfully. Full depth population then would have happened only
because until here addr_to_dma_page_maddr() didn't have a way to
limit the number of levels. But then the comment is misleading. As
I'm merely guessing here, I'm still hoping for Kevin to have (and
share) some insight.

>> --- a/xen/drivers/passthrough/vtd/iommu.c
>> +++ b/xen/drivers/passthrough/vtd/iommu.c
>> @@ -264,63 +264,116 @@ static u64 bus_to_context_maddr(struct v
>>      return maddr;
>>  }
>>  
>> -static u64 addr_to_dma_page_maddr(struct domain *domain, u64 addr, int alloc)
>> +/*
>> + * This function walks (and if requested allocates) page tables to the
>> + * designated target level. It returns
>> + * - 0 when a non-present entry was encountered and no allocation was
>> + *   requested,
>> + * - a small positive value (the level, i.e. below PAGE_SIZE) upon allocation
>> + *   failure,
>> + * - for target > 0 the address of the page table holding the leaf PTE for
>                           ^ physical
> 
> I think it's clearer, as the return type could be ambiguous.

Added.

>> + *   the requested address,
>> + * - for target == 0 the full PTE.
> 
> Could this create confusion if for example one PTE maps physical page
> 0? As the caller when getting back a full PTE with address 0 and some of
> the low bits set could interpret the result as an error.
> 
> I think we already had this discussion on other occasions, but I would
> rather add a parameter to be used as a return placeholder (ie: a
> *dma_pte maybe?) and use the function return value just for errors
> because IMO it's clearer, but I know you don't usually like this
> approach, so I'm not going to insist.

MFN 0 is never used for anything. This in particular includes it not
getting used as a page table.

>> +static uint64_t addr_to_dma_page_maddr(struct domain *domain, daddr_t addr,
>> +                                       unsigned int target,
>> +                                       unsigned int *flush_flags, bool alloc)
>>  {
>>      struct domain_iommu *hd = dom_iommu(domain);
>>      int addr_width = agaw_to_width(hd->arch.vtd.agaw);
>>      struct dma_pte *parent, *pte = NULL;
>> -    int level = agaw_to_level(hd->arch.vtd.agaw);
>> -    int offset;
>> +    unsigned int level = agaw_to_level(hd->arch.vtd.agaw), offset;
>>      u64 pte_maddr = 0;
>>  
>>      addr &= (((u64)1) << addr_width) - 1;
>>      ASSERT(spin_is_locked(&hd->arch.mapping_lock));
>> +    ASSERT(target || !alloc);
> 
> Might be better to use an if with ASSERT_UNREACHABLE and return an
> error? (ie: level itself?)

I did consider this, but decided against because neither of the two
error indicators properly expresses that case. If you're concerned
of hitting the case in a release build, I'd rather switch to BUG_ON().

>> +
>>      if ( !hd->arch.vtd.pgd_maddr )
>>      {
>>          struct page_info *pg;
>>  
>> -        if ( !alloc || !(pg = iommu_alloc_pgtable(domain)) )
>> +        if ( !alloc )
>> +            goto out;
>> +
>> +        pte_maddr = level;
>> +        if ( !(pg = iommu_alloc_pgtable(domain)) )
>>              goto out;
>>  
>>          hd->arch.vtd.pgd_maddr = page_to_maddr(pg);
>>      }
>>  
>> -    parent = (struct dma_pte *)map_vtd_domain_page(hd->arch.vtd.pgd_maddr);
>> -    while ( level > 1 )
>> +    pte_maddr = hd->arch.vtd.pgd_maddr;
>> +    parent = map_vtd_domain_page(pte_maddr);
>> +    while ( level > target )
>>      {
>>          offset = address_level_offset(addr, level);
>>          pte = &parent[offset];
>>  
>>          pte_maddr = dma_pte_addr(*pte);
>> -        if ( !pte_maddr )
>> +        if ( !dma_pte_present(*pte) || (level > 1 && dma_pte_superpage(*pte)) )
>>          {
>>              struct page_info *pg;
>> +            /*
>> +             * Higher level tables always set r/w, last level page table
>> +             * controls read/write.
>> +             */
>> +            struct dma_pte new_pte = { DMA_PTE_PROT };
>>  
>>              if ( !alloc )
>> -                break;
>> +            {
>> +                pte_maddr = 0;
>> +                if ( !dma_pte_present(*pte) )
>> +                    break;
>> +
>> +                /*
>> +                 * When the leaf entry was requested, pass back the full PTE,
>> +                 * with the address adjusted to account for the residual of
>> +                 * the walk.
>> +                 */
>> +                pte_maddr = pte->val +
> 
> Wouldn't it be better to use dma_pte_addr(*pte) rather than accessing
> pte->val, and then you could drop the PAGE_MASK?
> 
> Or is the addr parameter not guaranteed to be page aligned?

addr is page aligned, but may not be superpage aligned. Yet that's not
the point here. As per the comment at the top of the function (and as
per the needs of intel_iommu_lookup_page()) we want to return a proper
(even if fake) PTE here, i.e. in particular including the access
control bits. Is "full" in the comment not sufficient to express this?

>> +                    (addr & ((1UL << level_to_offset_bits(level)) - 1) &
>> +                     PAGE_MASK);
>> +                if ( !target )
>> +                    break;
> 
> I'm confused by the conditional break here, why would you calculate
> pte_maddr unconditionally to get overwritten just the line below if
> target != 0?

That's simply for style reasons - calculating ahead of the if() allows
for less indentation and hence better flowing of the overall expression.

Jan



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 02/18] VT-d: have callers specify the target level for page table walks
  2021-09-27  9:04     ` Jan Beulich
@ 2021-09-27  9:13       ` Jan Beulich
  2021-11-30 11:56       ` Roger Pau Monné
  1 sibling, 0 replies; 100+ messages in thread
From: Jan Beulich @ 2021-09-27  9:13 UTC (permalink / raw)
  To: Roger Pau Monné, Kevin Tian; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On 27.09.2021 11:04, Jan Beulich wrote:
> On 24.09.2021 16:45, Roger Pau Monné wrote:
>> On Fri, Sep 24, 2021 at 11:42:13AM +0200, Jan Beulich wrote:
>>> In order to be able to insert/remove super-pages we need to allow
>>> callers of the walking function to specify at which point to stop the
>>> walk.
>>>
>>> For intel_iommu_lookup_page() integrate the last level access into
>>> the main walking function.
>>>
>>> dma_pte_clear_one() gets only partly adjusted for now: Error handling
>>> and order parameter get put in place, but the order parameter remains
>>> ignored (just like intel_iommu_map_page()'s order part of the flags).
>>>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>> ---
>>> I have to admit that I don't understand why domain_pgd_maddr() wants to
>>> populate all page table levels for DFN 0.
>>
>> I think it would be enough to create up to the level requested by the
>> caller?
>>
>> Seems like a lazy way to always assert that the level requested by the
>> caller would be present.
> 
> The caller doesn't request any level here. What the caller passes in
> is the number of levels the respective IOMMU can deal with (varying
> of which across all IOMMUs is somewhat funny anyway). Hence I _guess_
> that it would really be sufficient to install as many levels as are
> necessary for the loop at the end of the function to complete
> successfully. Full depth population then would have happened only
> because until here addr_to_dma_page_maddr() didn't have a way to
> limit the number of levels. But then the comment is misleading. As
> I'm merely guessing here, I'm still hoping for Kevin to have (and
> share) some insight.

I've extended this post-commit-message comment to:

I have to admit that I don't understand why domain_pgd_maddr() wants to
populate all page table levels for DFN 0. I _guess_ that despite the
comment there what is needed is really only population down to
nr_pt_levels (such that the loop at the end of the function would
succeed). Problem is that this gets done only upon first allocating the
root table, yet the loop may later get executed with a smaller
nr_pt_levels. IOW population would need to be done down to the smallest
of all possible iommu->nr_pt_levels. As per a comment in iommu_alloc()
this can be between 2 and 4, yet once again the code there isn't fully
in line with the comment, going all the way down to 0.

Jan



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 02/18] VT-d: have callers specify the target level for page table walks
  2021-09-27  9:04     ` Jan Beulich
  2021-09-27  9:13       ` Jan Beulich
@ 2021-11-30 11:56       ` Roger Pau Monné
  2021-11-30 14:38         ` Jan Beulich
  1 sibling, 1 reply; 100+ messages in thread
From: Roger Pau Monné @ 2021-11-30 11:56 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Kevin Tian, xen-devel, Andrew Cooper, Paul Durrant

On Mon, Sep 27, 2021 at 11:04:26AM +0200, Jan Beulich wrote:
> On 24.09.2021 16:45, Roger Pau Monné wrote:
> > On Fri, Sep 24, 2021 at 11:42:13AM +0200, Jan Beulich wrote:
> >> -    parent = (struct dma_pte *)map_vtd_domain_page(hd->arch.vtd.pgd_maddr);
> >> -    while ( level > 1 )
> >> +    pte_maddr = hd->arch.vtd.pgd_maddr;
> >> +    parent = map_vtd_domain_page(pte_maddr);
> >> +    while ( level > target )
> >>      {
> >>          offset = address_level_offset(addr, level);
> >>          pte = &parent[offset];
> >>  
> >>          pte_maddr = dma_pte_addr(*pte);
> >> -        if ( !pte_maddr )
> >> +        if ( !dma_pte_present(*pte) || (level > 1 && dma_pte_superpage(*pte)) )
> >>          {
> >>              struct page_info *pg;
> >> +            /*
> >> +             * Higher level tables always set r/w, last level page table
> >> +             * controls read/write.
> >> +             */
> >> +            struct dma_pte new_pte = { DMA_PTE_PROT };
> >>  
> >>              if ( !alloc )
> >> -                break;
> >> +            {
> >> +                pte_maddr = 0;
> >> +                if ( !dma_pte_present(*pte) )
> >> +                    break;
> >> +
> >> +                /*
> >> +                 * When the leaf entry was requested, pass back the full PTE,
> >> +                 * with the address adjusted to account for the residual of
> >> +                 * the walk.
> >> +                 */
> >> +                pte_maddr = pte->val +
> > 
> > Wouldn't it be better to use dma_pte_addr(*pte) rather than accessing
> > pte->val, and then you could drop the PAGE_MASK?
> > 
> > Or is the addr parameter not guaranteed to be page aligned?
> 
> addr is page aligned, but may not be superpage aligned. Yet that's not
> the point here. As per the comment at the top of the function (and as
> per the needs of intel_iommu_lookup_page()) we want to return a proper
> (even if fake) PTE here, i.e. in particular including the access
> control bits. Is "full" in the comment not sufficient to express this?

I see. I guess I got confused by the function name. It would be better
called addr_to_dma_pte?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 03/18] IOMMU: have vendor code announce supported page sizes
  2021-09-24  9:43 ` [PATCH v2 03/18] IOMMU: have vendor code announce supported page sizes Jan Beulich
@ 2021-11-30 12:25   ` Roger Pau Monné
  2021-12-17 14:43   ` Julien Grall
  2021-12-21  9:26   ` Rahul Singh
  2 siblings, 0 replies; 100+ messages in thread
From: Roger Pau Monné @ 2021-11-30 12:25 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian, Julien Grall,
	Stefano Stabellini, Volodymyr Babchuk, Bertrand Marquis,
	Rahul Singh

On Fri, Sep 24, 2021 at 11:43:57AM +0200, Jan Beulich wrote:
> Generic code will use this information to determine what order values
> can legitimately be passed to the ->{,un}map_page() hooks. For now all
> ops structures simply get to announce 4k mappings (as base page size),
> and there is (and always has been) an assumption that this matches the
> CPU's MMU base page size (eventually we will want to permit IOMMUs with
> a base page size smaller than the CPU MMU's).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 04/18] IOMMU: add order parameter to ->{,un}map_page() hooks
  2021-09-24  9:44 ` [PATCH v2 04/18] IOMMU: add order parameter to ->{,un}map_page() hooks Jan Beulich
@ 2021-11-30 13:49   ` Roger Pau Monné
  2021-11-30 14:45     ` Jan Beulich
  2021-12-17 14:42   ` Julien Grall
  1 sibling, 1 reply; 100+ messages in thread
From: Roger Pau Monné @ 2021-11-30 13:49 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian, Julien Grall,
	Stefano Stabellini, Volodymyr Babchuk

On Fri, Sep 24, 2021 at 11:44:50AM +0200, Jan Beulich wrote:
> Or really, in the case of ->map_page(), accommodate it in the existing
> "flags" parameter. All call sites will pass 0 for now.

It feels slightly weird from an interface PoV that the map handler
takes the order from the flags parameter, while the unmap one has a
function parameter for it.

> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 02/18] VT-d: have callers specify the target level for page table walks
  2021-11-30 11:56       ` Roger Pau Monné
@ 2021-11-30 14:38         ` Jan Beulich
  0 siblings, 0 replies; 100+ messages in thread
From: Jan Beulich @ 2021-11-30 14:38 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: Kevin Tian, xen-devel, Andrew Cooper, Paul Durrant

On 30.11.2021 12:56, Roger Pau Monné wrote:
> On Mon, Sep 27, 2021 at 11:04:26AM +0200, Jan Beulich wrote:
>> On 24.09.2021 16:45, Roger Pau Monné wrote:
>>> On Fri, Sep 24, 2021 at 11:42:13AM +0200, Jan Beulich wrote:
>>>> -    parent = (struct dma_pte *)map_vtd_domain_page(hd->arch.vtd.pgd_maddr);
>>>> -    while ( level > 1 )
>>>> +    pte_maddr = hd->arch.vtd.pgd_maddr;
>>>> +    parent = map_vtd_domain_page(pte_maddr);
>>>> +    while ( level > target )
>>>>      {
>>>>          offset = address_level_offset(addr, level);
>>>>          pte = &parent[offset];
>>>>  
>>>>          pte_maddr = dma_pte_addr(*pte);
>>>> -        if ( !pte_maddr )
>>>> +        if ( !dma_pte_present(*pte) || (level > 1 && dma_pte_superpage(*pte)) )
>>>>          {
>>>>              struct page_info *pg;
>>>> +            /*
>>>> +             * Higher level tables always set r/w, last level page table
>>>> +             * controls read/write.
>>>> +             */
>>>> +            struct dma_pte new_pte = { DMA_PTE_PROT };
>>>>  
>>>>              if ( !alloc )
>>>> -                break;
>>>> +            {
>>>> +                pte_maddr = 0;
>>>> +                if ( !dma_pte_present(*pte) )
>>>> +                    break;
>>>> +
>>>> +                /*
>>>> +                 * When the leaf entry was requested, pass back the full PTE,
>>>> +                 * with the address adjusted to account for the residual of
>>>> +                 * the walk.
>>>> +                 */
>>>> +                pte_maddr = pte->val +
>>>
>>> Wouldn't it be better to use dma_pte_addr(*pte) rather than accessing
>>> pte->val, and then you could drop the PAGE_MASK?
>>>
>>> Or is the addr parameter not guaranteed to be page aligned?
>>
>> addr is page aligned, but may not be superpage aligned. Yet that's not
>> the point here. As per the comment at the top of the function (and as
>> per the needs of intel_iommu_lookup_page()) we want to return a proper
>> (even if fake) PTE here, i.e. in particular including the access
>> control bits. Is "full" in the comment not sufficient to express this?
> 
> I see. I guess I got confused by the function name. It would be better
> called addr_to_dma_pte?

That wouldn't match its new purpose either. It can return an address
_or_ a full PTE, as per - as said - the comment being added at the
top of the function.

Jan



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 04/18] IOMMU: add order parameter to ->{,un}map_page() hooks
  2021-11-30 13:49   ` Roger Pau Monné
@ 2021-11-30 14:45     ` Jan Beulich
  0 siblings, 0 replies; 100+ messages in thread
From: Jan Beulich @ 2021-11-30 14:45 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian, Julien Grall,
	Stefano Stabellini, Volodymyr Babchuk

On 30.11.2021 14:49, Roger Pau Monné wrote:
> On Fri, Sep 24, 2021 at 11:44:50AM +0200, Jan Beulich wrote:
>> Or really, in the case of ->map_page(), accommodate it in the existing
>> "flags" parameter. All call sites will pass 0 for now.
> 
> It feels slightly weird from an interface PoV that the map handler
> takes the order from the flags parameter, while the unmap one has a
> function parameter for it.

Well, I wouldn't want to call the unmap parameter "flags" just for
consistency. If there ever is a flag to be passed, I guess that's
going to be the natural course of action.

>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> 
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks.

Jan



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 05/18] IOMMU: have iommu_{,un}map() split requests into largest possible chunks
  2021-09-24  9:45 ` [PATCH v2 05/18] IOMMU: have iommu_{,un}map() split requests into largest possible chunks Jan Beulich
@ 2021-11-30 15:24   ` Roger Pau Monné
  2021-12-02 15:59     ` Jan Beulich
  0 siblings, 1 reply; 100+ messages in thread
From: Roger Pau Monné @ 2021-11-30 15:24 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On Fri, Sep 24, 2021 at 11:45:57AM +0200, Jan Beulich wrote:
> Introduce a helper function to determine the largest possible mapping
> that allows covering a request (or the next part of it that is left to
> be processed).
> 
> In order to not add yet more recurring dfn_add() / mfn_add() to the two
> callers of the new helper, also introduce local variables holding the
> values presently operated on.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> --- a/xen/drivers/passthrough/iommu.c
> +++ b/xen/drivers/passthrough/iommu.c
> @@ -260,12 +260,38 @@ void iommu_domain_destroy(struct domain
>      arch_iommu_domain_destroy(d);
>  }
>  
> -int iommu_map(struct domain *d, dfn_t dfn, mfn_t mfn,
> +static unsigned int mapping_order(const struct domain_iommu *hd,
> +                                  dfn_t dfn, mfn_t mfn, unsigned long nr)
> +{
> +    unsigned long res = dfn_x(dfn) | mfn_x(mfn);
> +    unsigned long sizes = hd->platform_ops->page_sizes;
> +    unsigned int bit = find_first_set_bit(sizes), order = 0;
> +
> +    ASSERT(bit == PAGE_SHIFT);
> +
> +    while ( (sizes = (sizes >> bit) & ~1) )
> +    {
> +        unsigned long mask;
> +
> +        bit = find_first_set_bit(sizes);
> +        mask = (1UL << bit) - 1;
> +        if ( nr <= mask || (res & mask) )
> +            break;
> +        order += bit;
> +        nr >>= bit;
> +        res >>= bit;
> +    }
> +
> +    return order;
> +}

This looks like it could be used in other places, I would at least
consider using it in pvh_populate_memory_range where we also need to
figure out the maximum order given an address and a number of pages.

Do you think you could place it in a more generic file and also use
more generic parameters (ie: unsigned long gfn and mfn)?

> +
> +int iommu_map(struct domain *d, dfn_t dfn0, mfn_t mfn0,
>                unsigned long page_count, unsigned int flags,
>                unsigned int *flush_flags)
>  {
>      const struct domain_iommu *hd = dom_iommu(d);
>      unsigned long i;
> +    unsigned int order;
>      int rc = 0;
>  
>      if ( !is_iommu_enabled(d) )
> @@ -273,10 +299,16 @@ int iommu_map(struct domain *d, dfn_t df
>  
>      ASSERT(!IOMMUF_order(flags));
>  
> -    for ( i = 0; i < page_count; i++ )
> +    for ( i = 0; i < page_count; i += 1UL << order )
>      {
> -        rc = iommu_call(hd->platform_ops, map_page, d, dfn_add(dfn, i),
> -                        mfn_add(mfn, i), flags, flush_flags);
> +        dfn_t dfn = dfn_add(dfn0, i);
> +        mfn_t mfn = mfn_add(mfn0, i);
> +        unsigned long j;
> +
> +        order = mapping_order(hd, dfn, mfn, page_count - i);
> +
> +        rc = iommu_call(hd->platform_ops, map_page, d, dfn, mfn,
> +                        flags | IOMMUF_order(order), flush_flags);
>  
>          if ( likely(!rc) )
>              continue;
> @@ -284,14 +316,18 @@ int iommu_map(struct domain *d, dfn_t df
>          if ( !d->is_shutting_down && printk_ratelimit() )
>              printk(XENLOG_ERR
>                     "d%d: IOMMU mapping dfn %"PRI_dfn" to mfn %"PRI_mfn" failed: %d\n",
> -                   d->domain_id, dfn_x(dfn_add(dfn, i)),
> -                   mfn_x(mfn_add(mfn, i)), rc);
> +                   d->domain_id, dfn_x(dfn), mfn_x(mfn), rc);
> +
> +        for ( j = 0; j < i; j += 1UL << order )
> +        {
> +            dfn = dfn_add(dfn0, j);
> +            order = mapping_order(hd, dfn, _mfn(0), i - j);
>  
> -        while ( i-- )
>              /* if statement to satisfy __must_check */
> -            if ( iommu_call(hd->platform_ops, unmap_page, d, dfn_add(dfn, i),
> -                            0, flush_flags) )
> +            if ( iommu_call(hd->platform_ops, unmap_page, d, dfn, order,
> +                            flush_flags) )
>                  continue;
> +        }

Why you need this unmap loop here, can't you just use iommu_unmap?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 06/18] IOMMU/x86: restrict IO-APIC mappings for PV Dom0
  2021-09-24  9:46 ` [PATCH v2 06/18] IOMMU/x86: restrict IO-APIC mappings for PV Dom0 Jan Beulich
@ 2021-12-01  9:09   ` Roger Pau Monné
  2021-12-01  9:27     ` Jan Beulich
  0 siblings, 1 reply; 100+ messages in thread
From: Roger Pau Monné @ 2021-12-01  9:09 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On Fri, Sep 24, 2021 at 11:46:57AM +0200, Jan Beulich wrote:
> While already the case for PVH, there's no reason to treat PV
> differently here, though of course the addresses get taken from another
> source in this case. Except that, to match CPU side mappings, by default
> we permit r/o ones. This then also means we now deal consistently with
> IO-APICs whose MMIO is or is not covered by E820 reserved regions.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> [integrated] v1: Integrate into series.
> [standalone] v2: Keep IOMMU mappings in sync with CPU ones.
> 
> --- a/xen/drivers/passthrough/x86/iommu.c
> +++ b/xen/drivers/passthrough/x86/iommu.c
> @@ -253,12 +253,12 @@ void iommu_identity_map_teardown(struct
>      }
>  }
>  
> -static bool __hwdom_init hwdom_iommu_map(const struct domain *d,
> -                                         unsigned long pfn,
> -                                         unsigned long max_pfn)
> +static unsigned int __hwdom_init hwdom_iommu_map(const struct domain *d,
> +                                                 unsigned long pfn,
> +                                                 unsigned long max_pfn)
>  {
>      mfn_t mfn = _mfn(pfn);
> -    unsigned int i, type;
> +    unsigned int i, type, perms = IOMMUF_readable | IOMMUF_writable;
>  
>      /*
>       * Set up 1:1 mapping for dom0. Default to include only conventional RAM
> @@ -267,44 +267,60 @@ static bool __hwdom_init hwdom_iommu_map
>       * that fall in unusable ranges for PV Dom0.
>       */
>      if ( (pfn > max_pfn && !mfn_valid(mfn)) || xen_in_range(pfn) )
> -        return false;
> +        return 0;
>  
>      switch ( type = page_get_ram_type(mfn) )
>      {
>      case RAM_TYPE_UNUSABLE:
> -        return false;
> +        return 0;
>  
>      case RAM_TYPE_CONVENTIONAL:
>          if ( iommu_hwdom_strict )
> -            return false;
> +            return 0;
>          break;
>  
>      default:
>          if ( type & RAM_TYPE_RESERVED )
>          {
>              if ( !iommu_hwdom_inclusive && !iommu_hwdom_reserved )
> -                return false;
> +                perms = 0;
>          }
> -        else if ( is_hvm_domain(d) || !iommu_hwdom_inclusive || pfn > max_pfn )
> -            return false;
> +        else if ( is_hvm_domain(d) )
> +            return 0;
> +        else if ( !iommu_hwdom_inclusive || pfn > max_pfn )
> +            perms = 0;

I'm confused about the reason to set perms = 0 instead of just
returning here. AFAICT perms won't be set to any other value below,
so you might as well just return 0.

>      }
>  
>      /* Check that it doesn't overlap with the Interrupt Address Range. */
>      if ( pfn >= 0xfee00 && pfn <= 0xfeeff )
> -        return false;
> +        return 0;
>      /* ... or the IO-APIC */
> -    for ( i = 0; has_vioapic(d) && i < d->arch.hvm.nr_vioapics; i++ )
> -        if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
> -            return false;
> +    if ( has_vioapic(d) )
> +    {
> +        for ( i = 0; i < d->arch.hvm.nr_vioapics; i++ )
> +            if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
> +                return 0;
> +    }
> +    else if ( is_pv_domain(d) )
> +    {
> +        /*
> +         * Be consistent with CPU mappings: Dom0 is permitted to establish r/o
> +         * ones there, so it should also have such established for IOMMUs.
> +         */
> +        for ( i = 0; i < nr_ioapics; i++ )
> +            if ( pfn == PFN_DOWN(mp_ioapics[i].mpc_apicaddr) )
> +                return rangeset_contains_singleton(mmio_ro_ranges, pfn)
> +                       ? IOMMUF_readable : 0;
> +    }

Note that the emulated vIO-APICs are mapped over the real ones (ie:
using the same base addresses), and hence both loops will end up using
the same regions. I would rather keep them separated anyway, just in
case we decide to somehow change the position of the emulated ones in
the future.

>      /*
>       * ... or the PCIe MCFG regions.
>       * TODO: runtime added MMCFG regions are not checked to make sure they
>       * don't overlap with already mapped regions, thus preventing trapping.
>       */
>      if ( has_vpci(d) && vpci_is_mmcfg_address(d, pfn_to_paddr(pfn)) )
> -        return false;
> +        return 0;
>  
> -    return true;
> +    return perms;
>  }
>  
>  void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
> @@ -346,15 +362,19 @@ void __hwdom_init arch_iommu_hwdom_init(
>      for ( ; i < top; i++ )
>      {
>          unsigned long pfn = pdx_to_pfn(i);
> +        unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
>          int rc;
>  
> -        if ( !hwdom_iommu_map(d, pfn, max_pfn) )
> +        if ( !perms )
>              rc = 0;
>          else if ( paging_mode_translate(d) )
> -            rc = set_identity_p2m_entry(d, pfn, p2m_access_rw, 0);
> +            rc = set_identity_p2m_entry(d, pfn,
> +                                        perms & IOMMUF_writable ? p2m_access_rw
> +                                                                : p2m_access_r,
> +                                        0);
>          else
>              rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
> -                           IOMMUF_readable | IOMMUF_writable, &flush_flags);
> +                           perms, &flush_flags);

You could just call set_identity_p2m_entry uniformly here. It will
DTRT for non-translated guests also, and then hwdom_iommu_map could
perhaps return a p2m_access_t?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 06/18] IOMMU/x86: restrict IO-APIC mappings for PV Dom0
  2021-12-01  9:09   ` Roger Pau Monné
@ 2021-12-01  9:27     ` Jan Beulich
  2021-12-01 10:32       ` Roger Pau Monné
  0 siblings, 1 reply; 100+ messages in thread
From: Jan Beulich @ 2021-12-01  9:27 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On 01.12.2021 10:09, Roger Pau Monné wrote:
> On Fri, Sep 24, 2021 at 11:46:57AM +0200, Jan Beulich wrote:
>> @@ -267,44 +267,60 @@ static bool __hwdom_init hwdom_iommu_map
>>       * that fall in unusable ranges for PV Dom0.
>>       */
>>      if ( (pfn > max_pfn && !mfn_valid(mfn)) || xen_in_range(pfn) )
>> -        return false;
>> +        return 0;
>>  
>>      switch ( type = page_get_ram_type(mfn) )
>>      {
>>      case RAM_TYPE_UNUSABLE:
>> -        return false;
>> +        return 0;
>>  
>>      case RAM_TYPE_CONVENTIONAL:
>>          if ( iommu_hwdom_strict )
>> -            return false;
>> +            return 0;
>>          break;
>>  
>>      default:
>>          if ( type & RAM_TYPE_RESERVED )
>>          {
>>              if ( !iommu_hwdom_inclusive && !iommu_hwdom_reserved )
>> -                return false;
>> +                perms = 0;
>>          }
>> -        else if ( is_hvm_domain(d) || !iommu_hwdom_inclusive || pfn > max_pfn )
>> -            return false;
>> +        else if ( is_hvm_domain(d) )
>> +            return 0;
>> +        else if ( !iommu_hwdom_inclusive || pfn > max_pfn )
>> +            perms = 0;
> 
> I'm confused about the reason to set perms = 0 instead of just
> returning here. AFAICT perms won't be set to any other value below,
> so you might as well just return 0.

This is so that ...

>>      }
>>  
>>      /* Check that it doesn't overlap with the Interrupt Address Range. */
>>      if ( pfn >= 0xfee00 && pfn <= 0xfeeff )
>> -        return false;
>> +        return 0;
>>      /* ... or the IO-APIC */
>> -    for ( i = 0; has_vioapic(d) && i < d->arch.hvm.nr_vioapics; i++ )
>> -        if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
>> -            return false;
>> +    if ( has_vioapic(d) )
>> +    {
>> +        for ( i = 0; i < d->arch.hvm.nr_vioapics; i++ )
>> +            if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
>> +                return 0;
>> +    }
>> +    else if ( is_pv_domain(d) )
>> +    {
>> +        /*
>> +         * Be consistent with CPU mappings: Dom0 is permitted to establish r/o
>> +         * ones there, so it should also have such established for IOMMUs.
>> +         */
>> +        for ( i = 0; i < nr_ioapics; i++ )
>> +            if ( pfn == PFN_DOWN(mp_ioapics[i].mpc_apicaddr) )
>> +                return rangeset_contains_singleton(mmio_ro_ranges, pfn)
>> +                       ? IOMMUF_readable : 0;
>> +    }

... this return, as per the comment, takes precedence over returning
zero.

> Note that the emulated vIO-APICs are mapped over the real ones (ie:
> using the same base addresses), and hence both loops will end up using
> the same regions. I would rather keep them separated anyway, just in
> case we decide to somehow change the position of the emulated ones in
> the future.

Yes - I don't think we should bake any such assumption into the code
here.

>> @@ -346,15 +362,19 @@ void __hwdom_init arch_iommu_hwdom_init(
>>      for ( ; i < top; i++ )
>>      {
>>          unsigned long pfn = pdx_to_pfn(i);
>> +        unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
>>          int rc;
>>  
>> -        if ( !hwdom_iommu_map(d, pfn, max_pfn) )
>> +        if ( !perms )
>>              rc = 0;
>>          else if ( paging_mode_translate(d) )
>> -            rc = set_identity_p2m_entry(d, pfn, p2m_access_rw, 0);
>> +            rc = set_identity_p2m_entry(d, pfn,
>> +                                        perms & IOMMUF_writable ? p2m_access_rw
>> +                                                                : p2m_access_r,
>> +                                        0);
>>          else
>>              rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
>> -                           IOMMUF_readable | IOMMUF_writable, &flush_flags);
>> +                           perms, &flush_flags);
> 
> You could just call set_identity_p2m_entry uniformly here. It will
> DTRT for non-translated guests also, and then hwdom_iommu_map could
> perhaps return a p2m_access_t?

That's an orthogonal change imo, i.e. could be done as a prereq change,
but I'd prefer to leave it as is for now. Furthermore see "x86/mm: split
set_identity_p2m_entry() into PV and HVM parts": In v2 I'm now also
adjusting the code here (and vpci_make_msix_hole()) to call the
translated-only function.

Jan



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 06/18] IOMMU/x86: restrict IO-APIC mappings for PV Dom0
  2021-12-01  9:27     ` Jan Beulich
@ 2021-12-01 10:32       ` Roger Pau Monné
  2021-12-01 11:45         ` Jan Beulich
  0 siblings, 1 reply; 100+ messages in thread
From: Roger Pau Monné @ 2021-12-01 10:32 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On Wed, Dec 01, 2021 at 10:27:21AM +0100, Jan Beulich wrote:
> On 01.12.2021 10:09, Roger Pau Monné wrote:
> > On Fri, Sep 24, 2021 at 11:46:57AM +0200, Jan Beulich wrote:
> >> @@ -267,44 +267,60 @@ static bool __hwdom_init hwdom_iommu_map
> >>       * that fall in unusable ranges for PV Dom0.
> >>       */
> >>      if ( (pfn > max_pfn && !mfn_valid(mfn)) || xen_in_range(pfn) )
> >> -        return false;
> >> +        return 0;
> >>  
> >>      switch ( type = page_get_ram_type(mfn) )
> >>      {
> >>      case RAM_TYPE_UNUSABLE:
> >> -        return false;
> >> +        return 0;
> >>  
> >>      case RAM_TYPE_CONVENTIONAL:
> >>          if ( iommu_hwdom_strict )
> >> -            return false;
> >> +            return 0;
> >>          break;
> >>  
> >>      default:
> >>          if ( type & RAM_TYPE_RESERVED )
> >>          {
> >>              if ( !iommu_hwdom_inclusive && !iommu_hwdom_reserved )
> >> -                return false;
> >> +                perms = 0;
> >>          }
> >> -        else if ( is_hvm_domain(d) || !iommu_hwdom_inclusive || pfn > max_pfn )
> >> -            return false;
> >> +        else if ( is_hvm_domain(d) )
> >> +            return 0;
> >> +        else if ( !iommu_hwdom_inclusive || pfn > max_pfn )
> >> +            perms = 0;
> > 
> > I'm confused about the reason to set perms = 0 instead of just
> > returning here. AFAICT perms won't be set to any other value below,
> > so you might as well just return 0.
> 
> This is so that ...
> 
> >>      }
> >>  
> >>      /* Check that it doesn't overlap with the Interrupt Address Range. */
> >>      if ( pfn >= 0xfee00 && pfn <= 0xfeeff )
> >> -        return false;
> >> +        return 0;
> >>      /* ... or the IO-APIC */
> >> -    for ( i = 0; has_vioapic(d) && i < d->arch.hvm.nr_vioapics; i++ )
> >> -        if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
> >> -            return false;
> >> +    if ( has_vioapic(d) )
> >> +    {
> >> +        for ( i = 0; i < d->arch.hvm.nr_vioapics; i++ )
> >> +            if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
> >> +                return 0;
> >> +    }
> >> +    else if ( is_pv_domain(d) )
> >> +    {
> >> +        /*
> >> +         * Be consistent with CPU mappings: Dom0 is permitted to establish r/o
> >> +         * ones there, so it should also have such established for IOMMUs.
> >> +         */
> >> +        for ( i = 0; i < nr_ioapics; i++ )
> >> +            if ( pfn == PFN_DOWN(mp_ioapics[i].mpc_apicaddr) )
> >> +                return rangeset_contains_singleton(mmio_ro_ranges, pfn)
> >> +                       ? IOMMUF_readable : 0;
> >> +    }
> 
> ... this return, as per the comment, takes precedence over returning
> zero.

I see. This is because you want to map those in the IOMMU page tables
even if the IO-APIC ranges are outside of a reserved region.

I have to admit this is kind of weird, because the purpose of this
function is to add mappings for all memory below 4G, and/or for all
reserved regions.

I also wonder whether we should kind of generalize the handling of RO
regions in the IOMMU for PV dom0 by using mmio_ro_ranges instead? Ie:
we could loop around the RO ranges in PV dom0 build and map them.

FWIW MSI-X tables are also RO, but adding and removing those to the
IOMMU might be quite complex as we have to track the memory decoding
and MSI-X enable bits.

And we are likely missing a check for iomem_access_permitted in
hwdom_iommu_map?

> >> @@ -346,15 +362,19 @@ void __hwdom_init arch_iommu_hwdom_init(
> >>      for ( ; i < top; i++ )
> >>      {
> >>          unsigned long pfn = pdx_to_pfn(i);
> >> +        unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
> >>          int rc;
> >>  
> >> -        if ( !hwdom_iommu_map(d, pfn, max_pfn) )
> >> +        if ( !perms )
> >>              rc = 0;
> >>          else if ( paging_mode_translate(d) )
> >> -            rc = set_identity_p2m_entry(d, pfn, p2m_access_rw, 0);
> >> +            rc = set_identity_p2m_entry(d, pfn,
> >> +                                        perms & IOMMUF_writable ? p2m_access_rw
> >> +                                                                : p2m_access_r,
> >> +                                        0);
> >>          else
> >>              rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
> >> -                           IOMMUF_readable | IOMMUF_writable, &flush_flags);
> >> +                           perms, &flush_flags);
> > 
> > You could just call set_identity_p2m_entry uniformly here. It will
> > DTRT for non-translated guests also, and then hwdom_iommu_map could
> > perhaps return a p2m_access_t?
> 
> That's an orthogonal change imo, i.e. could be done as a prereq change,
> but I'd prefer to leave it as is for now. Furthermore see "x86/mm: split
> set_identity_p2m_entry() into PV and HVM parts": In v2 I'm now also
> adjusting the code here 

I would rather adjust the code here to just call
set_identity_p2m_entry instead of differentiating between PV and
HVM.

> (and vpci_make_msix_hole()) to call the
> translated-only function.

This one does make sense, as vpci is strictly HVM only.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 06/18] IOMMU/x86: restrict IO-APIC mappings for PV Dom0
  2021-12-01 10:32       ` Roger Pau Monné
@ 2021-12-01 11:45         ` Jan Beulich
  2021-12-02 15:12           ` Roger Pau Monné
  0 siblings, 1 reply; 100+ messages in thread
From: Jan Beulich @ 2021-12-01 11:45 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On 01.12.2021 11:32, Roger Pau Monné wrote:
> On Wed, Dec 01, 2021 at 10:27:21AM +0100, Jan Beulich wrote:
>> On 01.12.2021 10:09, Roger Pau Monné wrote:
>>> On Fri, Sep 24, 2021 at 11:46:57AM +0200, Jan Beulich wrote:
>>>> @@ -267,44 +267,60 @@ static bool __hwdom_init hwdom_iommu_map
>>>>       * that fall in unusable ranges for PV Dom0.
>>>>       */
>>>>      if ( (pfn > max_pfn && !mfn_valid(mfn)) || xen_in_range(pfn) )
>>>> -        return false;
>>>> +        return 0;
>>>>  
>>>>      switch ( type = page_get_ram_type(mfn) )
>>>>      {
>>>>      case RAM_TYPE_UNUSABLE:
>>>> -        return false;
>>>> +        return 0;
>>>>  
>>>>      case RAM_TYPE_CONVENTIONAL:
>>>>          if ( iommu_hwdom_strict )
>>>> -            return false;
>>>> +            return 0;
>>>>          break;
>>>>  
>>>>      default:
>>>>          if ( type & RAM_TYPE_RESERVED )
>>>>          {
>>>>              if ( !iommu_hwdom_inclusive && !iommu_hwdom_reserved )
>>>> -                return false;
>>>> +                perms = 0;
>>>>          }
>>>> -        else if ( is_hvm_domain(d) || !iommu_hwdom_inclusive || pfn > max_pfn )
>>>> -            return false;
>>>> +        else if ( is_hvm_domain(d) )
>>>> +            return 0;
>>>> +        else if ( !iommu_hwdom_inclusive || pfn > max_pfn )
>>>> +            perms = 0;
>>>
>>> I'm confused about the reason to set perms = 0 instead of just
>>> returning here. AFAICT perms won't be set to any other value below,
>>> so you might as well just return 0.
>>
>> This is so that ...
>>
>>>>      }
>>>>  
>>>>      /* Check that it doesn't overlap with the Interrupt Address Range. */
>>>>      if ( pfn >= 0xfee00 && pfn <= 0xfeeff )
>>>> -        return false;
>>>> +        return 0;
>>>>      /* ... or the IO-APIC */
>>>> -    for ( i = 0; has_vioapic(d) && i < d->arch.hvm.nr_vioapics; i++ )
>>>> -        if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
>>>> -            return false;
>>>> +    if ( has_vioapic(d) )
>>>> +    {
>>>> +        for ( i = 0; i < d->arch.hvm.nr_vioapics; i++ )
>>>> +            if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
>>>> +                return 0;
>>>> +    }
>>>> +    else if ( is_pv_domain(d) )
>>>> +    {
>>>> +        /*
>>>> +         * Be consistent with CPU mappings: Dom0 is permitted to establish r/o
>>>> +         * ones there, so it should also have such established for IOMMUs.
>>>> +         */
>>>> +        for ( i = 0; i < nr_ioapics; i++ )
>>>> +            if ( pfn == PFN_DOWN(mp_ioapics[i].mpc_apicaddr) )
>>>> +                return rangeset_contains_singleton(mmio_ro_ranges, pfn)
>>>> +                       ? IOMMUF_readable : 0;
>>>> +    }
>>
>> ... this return, as per the comment, takes precedence over returning
>> zero.
> 
> I see. This is because you want to map those in the IOMMU page tables
> even if the IO-APIC ranges are outside of a reserved region.
> 
> I have to admit this is kind of weird, because the purpose of this
> function is to add mappings for all memory below 4G, and/or for all
> reserved regions.

Well, that was what it started out as. The purpose here is to be consistent
about IO-APICs: Either have them all mapped, or none of them. Since we map
them in the CPU page tables and since Andrew asked for the two mappings to
be consistent, this is the only way to satisfy the requests. Personally I'd
be okay with not mapping IO-APICs here (but then regardless of whether they
are covered by a reserved region).

> I also wonder whether we should kind of generalize the handling of RO
> regions in the IOMMU for PV dom0 by using mmio_ro_ranges instead? Ie:
> we could loop around the RO ranges in PV dom0 build and map them.

We shouldn't, for example because of ...

> FWIW MSI-X tables are also RO, but adding and removing those to the
> IOMMU might be quite complex as we have to track the memory decoding
> and MSI-X enable bits.

... these: Dom0 shouldn't have a need for mappings of these tables. It's
bad enough that we need to map them in the CPU page tables.

But yes, if the goal is to map stuff uniformly in CPU and IOMMU, then
what you suggest would look to be a reasonable approach.

> And we are likely missing a check for iomem_access_permitted in
> hwdom_iommu_map?

This would be a documentation-only check: The pages have permissions
removed when not in mmio_ro_ranges (see dom0_setup_permissions()).
IOW their presence there is an indication of permissions having been
granted.

>>>> @@ -346,15 +362,19 @@ void __hwdom_init arch_iommu_hwdom_init(
>>>>      for ( ; i < top; i++ )
>>>>      {
>>>>          unsigned long pfn = pdx_to_pfn(i);
>>>> +        unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
>>>>          int rc;
>>>>  
>>>> -        if ( !hwdom_iommu_map(d, pfn, max_pfn) )
>>>> +        if ( !perms )
>>>>              rc = 0;
>>>>          else if ( paging_mode_translate(d) )
>>>> -            rc = set_identity_p2m_entry(d, pfn, p2m_access_rw, 0);
>>>> +            rc = set_identity_p2m_entry(d, pfn,
>>>> +                                        perms & IOMMUF_writable ? p2m_access_rw
>>>> +                                                                : p2m_access_r,
>>>> +                                        0);
>>>>          else
>>>>              rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
>>>> -                           IOMMUF_readable | IOMMUF_writable, &flush_flags);
>>>> +                           perms, &flush_flags);
>>>
>>> You could just call set_identity_p2m_entry uniformly here. It will
>>> DTRT for non-translated guests also, and then hwdom_iommu_map could
>>> perhaps return a p2m_access_t?
>>
>> That's an orthogonal change imo, i.e. could be done as a prereq change,
>> but I'd prefer to leave it as is for now. Furthermore see "x86/mm: split
>> set_identity_p2m_entry() into PV and HVM parts": In v2 I'm now also
>> adjusting the code here 
> 
> I would rather adjust the code here to just call
> set_identity_p2m_entry instead of differentiating between PV and
> HVM.

I'm a little hesitant, in particular considering your suggestion to
then have hwdom_iommu_map() return p2m_access_t. Andrew has made quite
clear that the use of p2m_access_* here and in a number of other places
is actually an abuse.

Plus - forgot about this in my earlier reply - there would also be a
conflict with the next patch in this series, where larger orders will
get passed to iommu_map(), while set_identity_p2m_entry() has no
respective parameter yet (and imo it isn't urgent for it to gain one).

Jan



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 07/18] IOMMU/x86: perform PV Dom0 mappings in batches
  2021-09-24  9:47 ` [PATCH v2 07/18] IOMMU/x86: perform PV Dom0 mappings in batches Jan Beulich
@ 2021-12-02 14:10   ` Roger Pau Monné
  2021-12-03 12:38     ` Jan Beulich
  0 siblings, 1 reply; 100+ messages in thread
From: Roger Pau Monné @ 2021-12-02 14:10 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On Fri, Sep 24, 2021 at 11:47:41AM +0200, Jan Beulich wrote:
> For large page mappings to be easily usable (i.e. in particular without
> un-shattering of smaller page mappings) and for mapping operations to
> then also be more efficient, pass batches of Dom0 memory to iommu_map().
> In dom0_construct_pv() and its helpers (covering strict mode) this
> additionally requires establishing the type of those pages (albeit with
> zero type references).
> 
> The earlier establishing of PGT_writable_page | PGT_validated requires
> the existing places where this gets done (through get_page_and_type())
> to be updated: For pages which actually have a mapping, the type
> refcount needs to be 1.
> 
> There is actually a related bug that gets fixed here as a side effect:
> Typically the last L1 table would get marked as such only after
> get_page_and_type(..., PGT_writable_page). While this is fine as far as
> refcounting goes, the page did remain mapped in the IOMMU in this case
> (when "iommu=dom0-strict").
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> Subsequently p2m_add_identity_entry() may want to also gain an order
> parameter, for arch_iommu_hwdom_init() to use. While this only affects
> non-RAM regions, systems typically have 2-16Mb of reserved space
> immediately below 4Gb, which hence could be mapped more efficiently.
> 
> The installing of zero-ref writable types has in fact shown (observed
> while putting together the change) that despite the intention by the
> XSA-288 changes (affecting DomU-s only) for Dom0 a number of
> sufficiently ordinary pages (at the very least initrd and P2M ones as
> well as pages that are part of the initial allocation but not part of
> the initial mapping) still have been starting out as PGT_none, meaning
> that they would have gained IOMMU mappings only the first time these
> pages would get mapped writably.
> 
> I didn't think I need to address the bug mentioned in the description in
> a separate (prereq) patch, but if others disagree I could certainly
> break out that part (needing to first use iommu_legacy_unmap() then).
> 
> Note that 4k P2M pages don't get (pre-)mapped in setup_pv_physmap():
> They'll end up mapped via the later get_page_and_type().
> 
> As to the way these refs get installed: I've chosen to avoid the more
> expensive {get,put}_page_and_type(), putting in place the intended type
> directly. I guess I could be convinced to avoid this bypassing of the
> actual logic; I merely think it's unnecessarily expensive.
> 
> --- a/xen/arch/x86/pv/dom0_build.c
> +++ b/xen/arch/x86/pv/dom0_build.c
> @@ -106,11 +106,26 @@ static __init void mark_pv_pt_pages_rdon
>      unmap_domain_page(pl3e);
>  }
>  
> +/*
> + * For IOMMU mappings done while building Dom0 the type of the pages needs to
> + * match (for _get_page_type() to unmap upon type change). Set the pages to
> + * writable with no type ref. NB: This is benign when !need_iommu_pt_sync(d).
> + */
> +static void __init make_pages_writable(struct page_info *page, unsigned long nr)
> +{
> +    for ( ; nr--; ++page )
> +    {
> +        ASSERT(!page->u.inuse.type_info);
> +        page->u.inuse.type_info = PGT_writable_page | PGT_validated;
> +    }
> +}
> +
>  static __init void setup_pv_physmap(struct domain *d, unsigned long pgtbl_pfn,
>                                      unsigned long v_start, unsigned long v_end,
>                                      unsigned long vphysmap_start,
>                                      unsigned long vphysmap_end,
> -                                    unsigned long nr_pages)
> +                                    unsigned long nr_pages,
> +                                    unsigned int *flush_flags)
>  {
>      struct page_info *page = NULL;
>      l4_pgentry_t *pl4e, *l4start = map_domain_page(_mfn(pgtbl_pfn));
> @@ -123,6 +138,8 @@ static __init void setup_pv_physmap(stru
>  
>      while ( vphysmap_start < vphysmap_end )
>      {
> +        int rc = 0;
> +
>          if ( domain_tot_pages(d) +
>               ((round_pgup(vphysmap_end) - vphysmap_start) >> PAGE_SHIFT) +
>               3 > nr_pages )
> @@ -176,7 +193,22 @@ static __init void setup_pv_physmap(stru
>                                               L3_PAGETABLE_SHIFT - PAGE_SHIFT,
>                                               MEMF_no_scrub)) != NULL )
>              {
> -                *pl3e = l3e_from_page(page, L1_PROT|_PAGE_DIRTY|_PAGE_PSE);
> +                mfn_t mfn = page_to_mfn(page);
> +
> +                if ( need_iommu_pt_sync(d) )
> +                    rc = iommu_map(d, _dfn(mfn_x(mfn)), mfn,
> +                                   SUPERPAGE_PAGES * SUPERPAGE_PAGES,
> +                                   IOMMUF_readable | IOMMUF_writable,
> +                                   flush_flags);
> +                if ( !rc )
> +                    make_pages_writable(page,
> +                                        SUPERPAGE_PAGES * SUPERPAGE_PAGES);
> +                else
> +                    printk(XENLOG_ERR
> +                           "pre-mapping P2M 1G-MFN %lx into IOMMU failed: %d\n",
> +                           mfn_x(mfn), rc);
> +
> +                *pl3e = l3e_from_mfn(mfn, L1_PROT|_PAGE_DIRTY|_PAGE_PSE);
>                  vphysmap_start += 1UL << L3_PAGETABLE_SHIFT;
>                  continue;
>              }
> @@ -202,7 +234,20 @@ static __init void setup_pv_physmap(stru
>                                               L2_PAGETABLE_SHIFT - PAGE_SHIFT,
>                                               MEMF_no_scrub)) != NULL )
>              {
> -                *pl2e = l2e_from_page(page, L1_PROT|_PAGE_DIRTY|_PAGE_PSE);
> +                mfn_t mfn = page_to_mfn(page);
> +
> +                if ( need_iommu_pt_sync(d) )
> +                    rc = iommu_map(d, _dfn(mfn_x(mfn)), mfn, SUPERPAGE_PAGES,
> +                                   IOMMUF_readable | IOMMUF_writable,
> +                                   flush_flags);
> +                if ( !rc )
> +                    make_pages_writable(page, SUPERPAGE_PAGES);
> +                else
> +                    printk(XENLOG_ERR
> +                           "pre-mapping P2M 2M-MFN %lx into IOMMU failed: %d\n",
> +                           mfn_x(mfn), rc);
> +
> +                *pl2e = l2e_from_mfn(mfn, L1_PROT|_PAGE_DIRTY|_PAGE_PSE);
>                  vphysmap_start += 1UL << L2_PAGETABLE_SHIFT;
>                  continue;
>              }
> @@ -310,6 +355,7 @@ int __init dom0_construct_pv(struct doma
>      unsigned long initrd_pfn = -1, initrd_mfn = 0;
>      unsigned long count;
>      struct page_info *page = NULL;
> +    unsigned int flush_flags = 0;
>      start_info_t *si;
>      struct vcpu *v = d->vcpu[0];
>      void *image_base = bootstrap_map(image);
> @@ -572,6 +618,18 @@ int __init dom0_construct_pv(struct doma
>                      BUG();
>          }
>          initrd->mod_end = 0;
> +
> +        count = PFN_UP(initrd_len);
> +
> +        if ( need_iommu_pt_sync(d) )
> +            rc = iommu_map(d, _dfn(initrd_mfn), _mfn(initrd_mfn), count,
> +                           IOMMUF_readable | IOMMUF_writable, &flush_flags);
> +        if ( !rc )
> +            make_pages_writable(mfn_to_page(_mfn(initrd_mfn)), count);
> +        else
> +            printk(XENLOG_ERR
> +                   "pre-mapping initrd (MFN %lx) into IOMMU failed: %d\n",
> +                   initrd_mfn, rc);
>      }
>  
>      printk("PHYSICAL MEMORY ARRANGEMENT:\n"
> @@ -605,6 +663,22 @@ int __init dom0_construct_pv(struct doma
>  
>      process_pending_softirqs();
>  
> +    /*
> +     * We map the full range here and then punch a hole for page tables via
> +     * iommu_unmap() further down, once they have got marked as such.
> +     */
> +    if ( need_iommu_pt_sync(d) )
> +        rc = iommu_map(d, _dfn(alloc_spfn), _mfn(alloc_spfn),
> +                       alloc_epfn - alloc_spfn,
> +                       IOMMUF_readable | IOMMUF_writable, &flush_flags);
> +    if ( !rc )
> +        make_pages_writable(mfn_to_page(_mfn(alloc_spfn)),
> +                            alloc_epfn - alloc_spfn);
> +    else
> +        printk(XENLOG_ERR
> +               "pre-mapping MFNs [%lx,%lx) into IOMMU failed: %d\n",
> +               alloc_spfn, alloc_epfn, rc);
> +
>      mpt_alloc = (vpt_start - v_start) + pfn_to_paddr(alloc_spfn);
>      if ( vinitrd_start )
>          mpt_alloc -= PAGE_ALIGN(initrd_len);
> @@ -689,7 +763,8 @@ int __init dom0_construct_pv(struct doma
>          l1tab++;
>  
>          page = mfn_to_page(_mfn(mfn));
> -        if ( !page->u.inuse.type_info &&
> +        if ( (!page->u.inuse.type_info ||
> +              page->u.inuse.type_info == (PGT_writable_page | PGT_validated)) &&

Would it be clearer to get page for all pages that have a 0 count:
!(type_info & PGT_count_mask). Or would that interact badly with page
table pages?

>               !get_page_and_type(page, d, PGT_writable_page) )
>              BUG();
>      }
> @@ -720,6 +795,17 @@ int __init dom0_construct_pv(struct doma
>      /* Pages that are part of page tables must be read only. */
>      mark_pv_pt_pages_rdonly(d, l4start, vpt_start, nr_pt_pages);
>  
> +    /*
> +     * This needs to come after all potentially excess
> +     * get_page_and_type(..., PGT_writable_page) invocations; see the loop a
> +     * few lines further up, where the effect of calling that function in an
> +     * earlier loop iteration may get overwritten by a later one.
> +     */
> +    if ( need_iommu_pt_sync(d) &&
> +         iommu_unmap(d, _dfn(PFN_DOWN(mpt_alloc) - nr_pt_pages), nr_pt_pages,
> +                     &flush_flags) )
> +        BUG();

Wouldn't such unmap better happen as part of changing the types of the
pages that become part of the guest page tables?

>      /* Mask all upcalls... */
>      for ( i = 0; i < XEN_LEGACY_MAX_VCPUS; i++ )
>          shared_info(d, vcpu_info[i].evtchn_upcall_mask) = 1;
> @@ -793,7 +879,7 @@ int __init dom0_construct_pv(struct doma
>      {
>          pfn = pagetable_get_pfn(v->arch.guest_table);
>          setup_pv_physmap(d, pfn, v_start, v_end, vphysmap_start, vphysmap_end,
> -                         nr_pages);
> +                         nr_pages, &flush_flags);
>      }
>  
>      /* Write the phys->machine and machine->phys table entries. */
> @@ -824,7 +910,9 @@ int __init dom0_construct_pv(struct doma
>          if ( get_gpfn_from_mfn(mfn) >= count )
>          {
>              BUG_ON(compat);
> -            if ( !page->u.inuse.type_info &&
> +            if ( (!page->u.inuse.type_info ||
> +                  page->u.inuse.type_info == (PGT_writable_page |
> +                                              PGT_validated)) &&
>                   !get_page_and_type(page, d, PGT_writable_page) )
>                  BUG();
>  
> @@ -840,22 +928,41 @@ int __init dom0_construct_pv(struct doma
>  #endif
>      while ( pfn < nr_pages )
>      {
> -        if ( (page = alloc_chunk(d, nr_pages - domain_tot_pages(d))) == NULL )
> +        count = domain_tot_pages(d);
> +        if ( (page = alloc_chunk(d, nr_pages - count)) == NULL )
>              panic("Not enough RAM for DOM0 reservation\n");
> +        mfn = mfn_x(page_to_mfn(page));
> +
> +        if ( need_iommu_pt_sync(d) )
> +        {
> +            rc = iommu_map(d, _dfn(mfn), _mfn(mfn), domain_tot_pages(d) - count,
> +                           IOMMUF_readable | IOMMUF_writable, &flush_flags);
> +            if ( rc )
> +                printk(XENLOG_ERR
> +                       "pre-mapping MFN %lx (PFN %lx) into IOMMU failed: %d\n",
> +                       mfn, pfn, rc);
> +        }
> +
>          while ( pfn < domain_tot_pages(d) )
>          {
> -            mfn = mfn_x(page_to_mfn(page));
> +            if ( !rc )
> +                make_pages_writable(page, 1);

There's quite a lot of repetition of the pattern: allocate, iommu_map,
set as writable. Would it be possible to abstract this into some
kind of helper?

I've realized some of the allocations use alloc_chunk while others use
alloc_domheap_pages, so it might require some work.

> +
>  #ifndef NDEBUG
>  #define pfn (nr_pages - 1 - (pfn - (alloc_epfn - alloc_spfn)))
>  #endif
>              dom0_update_physmap(compat, pfn, mfn, vphysmap_start);
>  #undef pfn
> -            page++; pfn++;
> +            page++; mfn++; pfn++;
>              if ( !(pfn & 0xfffff) )
>                  process_pending_softirqs();
>          }
>      }
>  
> +    /* Use while() to avoid compiler warning. */
> +    while ( iommu_iotlb_flush_all(d, flush_flags) )
> +        break;

Might be worth to print a message here in case of error?

> +
>      if ( initrd_len != 0 )
>      {
>          si->mod_start = vinitrd_start ?: initrd_pfn;
> --- a/xen/drivers/passthrough/x86/iommu.c
> +++ b/xen/drivers/passthrough/x86/iommu.c
> @@ -325,8 +325,8 @@ static unsigned int __hwdom_init hwdom_i
>  
>  void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
>  {
> -    unsigned long i, top, max_pfn;
> -    unsigned int flush_flags = 0;
> +    unsigned long i, top, max_pfn, start, count;
> +    unsigned int flush_flags = 0, start_perms = 0;
>  
>      BUG_ON(!is_hardware_domain(d));
>  
> @@ -357,9 +357,9 @@ void __hwdom_init arch_iommu_hwdom_init(
>       * First Mb will get mapped in one go by pvh_populate_p2m(). Avoid
>       * setting up potentially conflicting mappings here.
>       */
> -    i = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
> +    start = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
>  
> -    for ( ; i < top; i++ )
> +    for ( i = start, count = 0; i < top; )
>      {
>          unsigned long pfn = pdx_to_pfn(i);
>          unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
> @@ -372,16 +372,30 @@ void __hwdom_init arch_iommu_hwdom_init(
>                                          perms & IOMMUF_writable ? p2m_access_rw
>                                                                  : p2m_access_r,
>                                          0);
> +        else if ( pfn != start + count || perms != start_perms )
> +        {
> +        commit:
> +            rc = iommu_map(d, _dfn(start), _mfn(start), count,
> +                           start_perms, &flush_flags);
> +            SWAP(start, pfn);
> +            start_perms = perms;
> +            count = 1;
> +        }
>          else
> -            rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
> -                           perms, &flush_flags);
> +        {
> +            ++count;
> +            rc = 0;
> +        }
>  
>          if ( rc )
>              printk(XENLOG_WARNING "%pd: identity %smapping of %lx failed: %d\n",
>                     d, !paging_mode_translate(d) ? "IOMMU " : "", pfn, rc);

Would be nice to print the count (or end pfn) in case it's a range.

While not something that you have to fix here, the logic here is
becoming overly complicated IMO. It might be easier to just put all
the ram and reserved regions (or everything < 4G) into a rangeset and
then punch holes on it for non guest mappable regions, and finally use
rangeset_consume_ranges to iterate and map those. That's likely faster
than having to iterate over all pfns on the system, and easier to
understand from a logic PoV.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 06/18] IOMMU/x86: restrict IO-APIC mappings for PV Dom0
  2021-12-01 11:45         ` Jan Beulich
@ 2021-12-02 15:12           ` Roger Pau Monné
  2021-12-02 15:28             ` Jan Beulich
  0 siblings, 1 reply; 100+ messages in thread
From: Roger Pau Monné @ 2021-12-02 15:12 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On Wed, Dec 01, 2021 at 12:45:12PM +0100, Jan Beulich wrote:
> On 01.12.2021 11:32, Roger Pau Monné wrote:
> > On Wed, Dec 01, 2021 at 10:27:21AM +0100, Jan Beulich wrote:
> >> On 01.12.2021 10:09, Roger Pau Monné wrote:
> >>> On Fri, Sep 24, 2021 at 11:46:57AM +0200, Jan Beulich wrote:
> >>>> @@ -267,44 +267,60 @@ static bool __hwdom_init hwdom_iommu_map
> >>>>       * that fall in unusable ranges for PV Dom0.
> >>>>       */
> >>>>      if ( (pfn > max_pfn && !mfn_valid(mfn)) || xen_in_range(pfn) )
> >>>> -        return false;
> >>>> +        return 0;
> >>>>  
> >>>>      switch ( type = page_get_ram_type(mfn) )
> >>>>      {
> >>>>      case RAM_TYPE_UNUSABLE:
> >>>> -        return false;
> >>>> +        return 0;
> >>>>  
> >>>>      case RAM_TYPE_CONVENTIONAL:
> >>>>          if ( iommu_hwdom_strict )
> >>>> -            return false;
> >>>> +            return 0;
> >>>>          break;
> >>>>  
> >>>>      default:
> >>>>          if ( type & RAM_TYPE_RESERVED )
> >>>>          {
> >>>>              if ( !iommu_hwdom_inclusive && !iommu_hwdom_reserved )
> >>>> -                return false;
> >>>> +                perms = 0;
> >>>>          }
> >>>> -        else if ( is_hvm_domain(d) || !iommu_hwdom_inclusive || pfn > max_pfn )
> >>>> -            return false;
> >>>> +        else if ( is_hvm_domain(d) )
> >>>> +            return 0;
> >>>> +        else if ( !iommu_hwdom_inclusive || pfn > max_pfn )
> >>>> +            perms = 0;
> >>>
> >>> I'm confused about the reason to set perms = 0 instead of just
> >>> returning here. AFAICT perms won't be set to any other value below,
> >>> so you might as well just return 0.
> >>
> >> This is so that ...
> >>
> >>>>      }
> >>>>  
> >>>>      /* Check that it doesn't overlap with the Interrupt Address Range. */
> >>>>      if ( pfn >= 0xfee00 && pfn <= 0xfeeff )
> >>>> -        return false;
> >>>> +        return 0;
> >>>>      /* ... or the IO-APIC */
> >>>> -    for ( i = 0; has_vioapic(d) && i < d->arch.hvm.nr_vioapics; i++ )
> >>>> -        if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
> >>>> -            return false;
> >>>> +    if ( has_vioapic(d) )
> >>>> +    {
> >>>> +        for ( i = 0; i < d->arch.hvm.nr_vioapics; i++ )
> >>>> +            if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
> >>>> +                return 0;
> >>>> +    }
> >>>> +    else if ( is_pv_domain(d) )
> >>>> +    {
> >>>> +        /*
> >>>> +         * Be consistent with CPU mappings: Dom0 is permitted to establish r/o
> >>>> +         * ones there, so it should also have such established for IOMMUs.
> >>>> +         */
> >>>> +        for ( i = 0; i < nr_ioapics; i++ )
> >>>> +            if ( pfn == PFN_DOWN(mp_ioapics[i].mpc_apicaddr) )
> >>>> +                return rangeset_contains_singleton(mmio_ro_ranges, pfn)
> >>>> +                       ? IOMMUF_readable : 0;
> >>>> +    }
> >>
> >> ... this return, as per the comment, takes precedence over returning
> >> zero.
> > 
> > I see. This is because you want to map those in the IOMMU page tables
> > even if the IO-APIC ranges are outside of a reserved region.
> > 
> > I have to admit this is kind of weird, because the purpose of this
> > function is to add mappings for all memory below 4G, and/or for all
> > reserved regions.
> 
> Well, that was what it started out as. The purpose here is to be consistent
> about IO-APICs: Either have them all mapped, or none of them. Since we map
> them in the CPU page tables and since Andrew asked for the two mappings to
> be consistent, this is the only way to satisfy the requests. Personally I'd
> be okay with not mapping IO-APICs here (but then regardless of whether they
> are covered by a reserved region).

I'm unsure of the best way to deal with this, it seems like both
the CPU and the IOMMU page tables would never be equal for PV dom0,
because we have no intention to map the MSI-X tables in RO mode in the
IOMMU page tables.

I'm not really opposed to having the IO-APIC mapped RO in the IOMMU
page tables, but I also don't see much benefit of doing it unless we
have a user-case for it. The IO-APIC handling in PV is already
different from native, so I would be fine if we add a comment noting
that while the IO-APIC is mappable to the CPU page tables as RO it's
not present in the IOMMU page tables (and then adjust hwdom_iommu_map
to prevent it's mapping).

I think we should also prevent mapping of the LAPIC, the HPET and the
HyperTransport range in case they fall into a reserved region?

TBH looks like we should be using iomem_access_permitted in
arch_iommu_hwdom_init? (not just for the IO-APIC, but for other device
ranges)

> >>>> @@ -346,15 +362,19 @@ void __hwdom_init arch_iommu_hwdom_init(
> >>>>      for ( ; i < top; i++ )
> >>>>      {
> >>>>          unsigned long pfn = pdx_to_pfn(i);
> >>>> +        unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
> >>>>          int rc;
> >>>>  
> >>>> -        if ( !hwdom_iommu_map(d, pfn, max_pfn) )
> >>>> +        if ( !perms )
> >>>>              rc = 0;
> >>>>          else if ( paging_mode_translate(d) )
> >>>> -            rc = set_identity_p2m_entry(d, pfn, p2m_access_rw, 0);
> >>>> +            rc = set_identity_p2m_entry(d, pfn,
> >>>> +                                        perms & IOMMUF_writable ? p2m_access_rw
> >>>> +                                                                : p2m_access_r,
> >>>> +                                        0);
> >>>>          else
> >>>>              rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
> >>>> -                           IOMMUF_readable | IOMMUF_writable, &flush_flags);
> >>>> +                           perms, &flush_flags);
> >>>
> >>> You could just call set_identity_p2m_entry uniformly here. It will
> >>> DTRT for non-translated guests also, and then hwdom_iommu_map could
> >>> perhaps return a p2m_access_t?
> >>
> >> That's an orthogonal change imo, i.e. could be done as a prereq change,
> >> but I'd prefer to leave it as is for now. Furthermore see "x86/mm: split
> >> set_identity_p2m_entry() into PV and HVM parts": In v2 I'm now also
> >> adjusting the code here 
> > 
> > I would rather adjust the code here to just call
> > set_identity_p2m_entry instead of differentiating between PV and
> > HVM.
> 
> I'm a little hesitant, in particular considering your suggestion to
> then have hwdom_iommu_map() return p2m_access_t. Andrew has made quite
> clear that the use of p2m_access_* here and in a number of other places
> is actually an abuse.
> 
> Plus - forgot about this in my earlier reply - there would also be a
> conflict with the next patch in this series, where larger orders will
> get passed to iommu_map(), while set_identity_p2m_entry() has no
> respective parameter yet (and imo it isn't urgent for it to gain one).

I've seen now as the iommu_map path is modified to handle ranges
instead of single pages. Long term we also want to expand this range
handling to the HVM branch.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 06/18] IOMMU/x86: restrict IO-APIC mappings for PV Dom0
  2021-12-02 15:12           ` Roger Pau Monné
@ 2021-12-02 15:28             ` Jan Beulich
  2021-12-02 19:16               ` Andrew Cooper
  0 siblings, 1 reply; 100+ messages in thread
From: Jan Beulich @ 2021-12-02 15:28 UTC (permalink / raw)
  To: Andrew Cooper, Roger Pau Monné; +Cc: xen-devel, Paul Durrant

On 02.12.2021 16:12, Roger Pau Monné wrote:
> On Wed, Dec 01, 2021 at 12:45:12PM +0100, Jan Beulich wrote:
>> On 01.12.2021 11:32, Roger Pau Monné wrote:
>>> On Wed, Dec 01, 2021 at 10:27:21AM +0100, Jan Beulich wrote:
>>>> On 01.12.2021 10:09, Roger Pau Monné wrote:
>>>>> On Fri, Sep 24, 2021 at 11:46:57AM +0200, Jan Beulich wrote:
>>>>>> @@ -267,44 +267,60 @@ static bool __hwdom_init hwdom_iommu_map
>>>>>>       * that fall in unusable ranges for PV Dom0.
>>>>>>       */
>>>>>>      if ( (pfn > max_pfn && !mfn_valid(mfn)) || xen_in_range(pfn) )
>>>>>> -        return false;
>>>>>> +        return 0;
>>>>>>  
>>>>>>      switch ( type = page_get_ram_type(mfn) )
>>>>>>      {
>>>>>>      case RAM_TYPE_UNUSABLE:
>>>>>> -        return false;
>>>>>> +        return 0;
>>>>>>  
>>>>>>      case RAM_TYPE_CONVENTIONAL:
>>>>>>          if ( iommu_hwdom_strict )
>>>>>> -            return false;
>>>>>> +            return 0;
>>>>>>          break;
>>>>>>  
>>>>>>      default:
>>>>>>          if ( type & RAM_TYPE_RESERVED )
>>>>>>          {
>>>>>>              if ( !iommu_hwdom_inclusive && !iommu_hwdom_reserved )
>>>>>> -                return false;
>>>>>> +                perms = 0;
>>>>>>          }
>>>>>> -        else if ( is_hvm_domain(d) || !iommu_hwdom_inclusive || pfn > max_pfn )
>>>>>> -            return false;
>>>>>> +        else if ( is_hvm_domain(d) )
>>>>>> +            return 0;
>>>>>> +        else if ( !iommu_hwdom_inclusive || pfn > max_pfn )
>>>>>> +            perms = 0;
>>>>>
>>>>> I'm confused about the reason to set perms = 0 instead of just
>>>>> returning here. AFAICT perms won't be set to any other value below,
>>>>> so you might as well just return 0.
>>>>
>>>> This is so that ...
>>>>
>>>>>>      }
>>>>>>  
>>>>>>      /* Check that it doesn't overlap with the Interrupt Address Range. */
>>>>>>      if ( pfn >= 0xfee00 && pfn <= 0xfeeff )
>>>>>> -        return false;
>>>>>> +        return 0;
>>>>>>      /* ... or the IO-APIC */
>>>>>> -    for ( i = 0; has_vioapic(d) && i < d->arch.hvm.nr_vioapics; i++ )
>>>>>> -        if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
>>>>>> -            return false;
>>>>>> +    if ( has_vioapic(d) )
>>>>>> +    {
>>>>>> +        for ( i = 0; i < d->arch.hvm.nr_vioapics; i++ )
>>>>>> +            if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
>>>>>> +                return 0;
>>>>>> +    }
>>>>>> +    else if ( is_pv_domain(d) )
>>>>>> +    {
>>>>>> +        /*
>>>>>> +         * Be consistent with CPU mappings: Dom0 is permitted to establish r/o
>>>>>> +         * ones there, so it should also have such established for IOMMUs.
>>>>>> +         */
>>>>>> +        for ( i = 0; i < nr_ioapics; i++ )
>>>>>> +            if ( pfn == PFN_DOWN(mp_ioapics[i].mpc_apicaddr) )
>>>>>> +                return rangeset_contains_singleton(mmio_ro_ranges, pfn)
>>>>>> +                       ? IOMMUF_readable : 0;
>>>>>> +    }
>>>>
>>>> ... this return, as per the comment, takes precedence over returning
>>>> zero.
>>>
>>> I see. This is because you want to map those in the IOMMU page tables
>>> even if the IO-APIC ranges are outside of a reserved region.
>>>
>>> I have to admit this is kind of weird, because the purpose of this
>>> function is to add mappings for all memory below 4G, and/or for all
>>> reserved regions.
>>
>> Well, that was what it started out as. The purpose here is to be consistent
>> about IO-APICs: Either have them all mapped, or none of them. Since we map
>> them in the CPU page tables and since Andrew asked for the two mappings to
>> be consistent, this is the only way to satisfy the requests. Personally I'd
>> be okay with not mapping IO-APICs here (but then regardless of whether they
>> are covered by a reserved region).
> 
> I'm unsure of the best way to deal with this, it seems like both
> the CPU and the IOMMU page tables would never be equal for PV dom0,
> because we have no intention to map the MSI-X tables in RO mode in the
> IOMMU page tables.
> 
> I'm not really opposed to having the IO-APIC mapped RO in the IOMMU
> page tables, but I also don't see much benefit of doing it unless we
> have a user-case for it. The IO-APIC handling in PV is already
> different from native, so I would be fine if we add a comment noting
> that while the IO-APIC is mappable to the CPU page tables as RO it's
> not present in the IOMMU page tables (and then adjust hwdom_iommu_map
> to prevent it's mapping).

Andrew, you did request both mappings to get in sync - thoughts?

> I think we should also prevent mapping of the LAPIC, the HPET and the
> HyperTransport range in case they fall into a reserved region?

Probably.

> TBH looks like we should be using iomem_access_permitted in
> arch_iommu_hwdom_init? (not just for the IO-APIC, but for other device
> ranges)

In general - perhaps. Not sure though whether to switch to doing so
right here.

>>>>>> @@ -346,15 +362,19 @@ void __hwdom_init arch_iommu_hwdom_init(
>>>>>>      for ( ; i < top; i++ )
>>>>>>      {
>>>>>>          unsigned long pfn = pdx_to_pfn(i);
>>>>>> +        unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
>>>>>>          int rc;
>>>>>>  
>>>>>> -        if ( !hwdom_iommu_map(d, pfn, max_pfn) )
>>>>>> +        if ( !perms )
>>>>>>              rc = 0;
>>>>>>          else if ( paging_mode_translate(d) )
>>>>>> -            rc = set_identity_p2m_entry(d, pfn, p2m_access_rw, 0);
>>>>>> +            rc = set_identity_p2m_entry(d, pfn,
>>>>>> +                                        perms & IOMMUF_writable ? p2m_access_rw
>>>>>> +                                                                : p2m_access_r,
>>>>>> +                                        0);
>>>>>>          else
>>>>>>              rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
>>>>>> -                           IOMMUF_readable | IOMMUF_writable, &flush_flags);
>>>>>> +                           perms, &flush_flags);
>>>>>
>>>>> You could just call set_identity_p2m_entry uniformly here. It will
>>>>> DTRT for non-translated guests also, and then hwdom_iommu_map could
>>>>> perhaps return a p2m_access_t?
>>>>
>>>> That's an orthogonal change imo, i.e. could be done as a prereq change,
>>>> but I'd prefer to leave it as is for now. Furthermore see "x86/mm: split
>>>> set_identity_p2m_entry() into PV and HVM parts": In v2 I'm now also
>>>> adjusting the code here 
>>>
>>> I would rather adjust the code here to just call
>>> set_identity_p2m_entry instead of differentiating between PV and
>>> HVM.
>>
>> I'm a little hesitant, in particular considering your suggestion to
>> then have hwdom_iommu_map() return p2m_access_t. Andrew has made quite
>> clear that the use of p2m_access_* here and in a number of other places
>> is actually an abuse.
>>
>> Plus - forgot about this in my earlier reply - there would also be a
>> conflict with the next patch in this series, where larger orders will
>> get passed to iommu_map(), while set_identity_p2m_entry() has no
>> respective parameter yet (and imo it isn't urgent for it to gain one).
> 
> I've seen now as the iommu_map path is modified to handle ranges
> instead of single pages. Long term we also want to expand this range
> handling to the HVM branch.

Long (or maybe better mid) term, yes.

Jan



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 05/18] IOMMU: have iommu_{,un}map() split requests into largest possible chunks
  2021-11-30 15:24   ` Roger Pau Monné
@ 2021-12-02 15:59     ` Jan Beulich
  0 siblings, 0 replies; 100+ messages in thread
From: Jan Beulich @ 2021-12-02 15:59 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On 30.11.2021 16:24, Roger Pau Monné wrote:
> On Fri, Sep 24, 2021 at 11:45:57AM +0200, Jan Beulich wrote:
>> --- a/xen/drivers/passthrough/iommu.c
>> +++ b/xen/drivers/passthrough/iommu.c
>> @@ -260,12 +260,38 @@ void iommu_domain_destroy(struct domain
>>      arch_iommu_domain_destroy(d);
>>  }
>>  
>> -int iommu_map(struct domain *d, dfn_t dfn, mfn_t mfn,
>> +static unsigned int mapping_order(const struct domain_iommu *hd,
>> +                                  dfn_t dfn, mfn_t mfn, unsigned long nr)
>> +{
>> +    unsigned long res = dfn_x(dfn) | mfn_x(mfn);
>> +    unsigned long sizes = hd->platform_ops->page_sizes;
>> +    unsigned int bit = find_first_set_bit(sizes), order = 0;
>> +
>> +    ASSERT(bit == PAGE_SHIFT);
>> +
>> +    while ( (sizes = (sizes >> bit) & ~1) )
>> +    {
>> +        unsigned long mask;
>> +
>> +        bit = find_first_set_bit(sizes);
>> +        mask = (1UL << bit) - 1;
>> +        if ( nr <= mask || (res & mask) )
>> +            break;
>> +        order += bit;
>> +        nr >>= bit;
>> +        res >>= bit;
>> +    }
>> +
>> +    return order;
>> +}
> 
> This looks like it could be used in other places, I would at least
> consider using it in pvh_populate_memory_range where we also need to
> figure out the maximum order given an address and a number of pages.
> 
> Do you think you could place it in a more generic file and also use
> more generic parameters (ie: unsigned long gfn and mfn)?

The function as is surely isn't reusable, for its use of IOMMU
specific data. If and when a 2nd user appears, it'll be far clearer
whether and if so how much of it is worth generalizing. (Among other
things I'd like to retain the typesafe parameter types here.)

>> @@ -284,14 +316,18 @@ int iommu_map(struct domain *d, dfn_t df
>>          if ( !d->is_shutting_down && printk_ratelimit() )
>>              printk(XENLOG_ERR
>>                     "d%d: IOMMU mapping dfn %"PRI_dfn" to mfn %"PRI_mfn" failed: %d\n",
>> -                   d->domain_id, dfn_x(dfn_add(dfn, i)),
>> -                   mfn_x(mfn_add(mfn, i)), rc);
>> +                   d->domain_id, dfn_x(dfn), mfn_x(mfn), rc);
>> +
>> +        for ( j = 0; j < i; j += 1UL << order )
>> +        {
>> +            dfn = dfn_add(dfn0, j);
>> +            order = mapping_order(hd, dfn, _mfn(0), i - j);
>>  
>> -        while ( i-- )
>>              /* if statement to satisfy __must_check */
>> -            if ( iommu_call(hd->platform_ops, unmap_page, d, dfn_add(dfn, i),
>> -                            0, flush_flags) )
>> +            if ( iommu_call(hd->platform_ops, unmap_page, d, dfn, order,
>> +                            flush_flags) )
>>                  continue;
>> +        }
> 
> Why you need this unmap loop here, can't you just use iommu_unmap?

Good question - I merely converted the loop that was already there.
Looks like that could have been changed to a simple call already
before. I'll change it here, on the assumption that splitting this
out isn't going to be a worthwhile exercise.

Jan



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 08/18] IOMMU/x86: support freeing of pagetables
  2021-09-24  9:48 ` [PATCH v2 08/18] IOMMU/x86: support freeing of pagetables Jan Beulich
@ 2021-12-02 16:03   ` Roger Pau Monné
  2021-12-02 16:10     ` Jan Beulich
  2021-12-10 13:51   ` Roger Pau Monné
  1 sibling, 1 reply; 100+ messages in thread
From: Roger Pau Monné @ 2021-12-02 16:03 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On Fri, Sep 24, 2021 at 11:48:21AM +0200, Jan Beulich wrote:
> For vendor specific code to support superpages we need to be able to
> deal with a superpage mapping replacing an intermediate page table (or
> hierarchy thereof). Consequently an iommu_alloc_pgtable() counterpart is
> needed to free individual page tables while a domain is still alive.
> Since the freeing needs to be deferred until after a suitable IOTLB
> flush was performed, released page tables get queued for processing by a
> tasklet.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> I was considering whether to use a softirq-taklet instead. This would
> have the benefit of avoiding extra scheduling operations, but come with
> the risk of the freeing happening prematurely because of a
> process_pending_softirqs() somewhere.

Another approach that comes to mind (maybe you already thought of it
and discarded) would be to perform the freeing after the flush in
iommu_iotlb_flush{_all} while keeping the per pPCU lists.

That would IMO seem better from a safety PoV, as we know that the
flush has been performed when the pages are freed, and would avoid the
switch to the idle domain in order to do the freeing.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 08/18] IOMMU/x86: support freeing of pagetables
  2021-12-02 16:03   ` Roger Pau Monné
@ 2021-12-02 16:10     ` Jan Beulich
  2021-12-03  8:30       ` Roger Pau Monné
  0 siblings, 1 reply; 100+ messages in thread
From: Jan Beulich @ 2021-12-02 16:10 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On 02.12.2021 17:03, Roger Pau Monné wrote:
> On Fri, Sep 24, 2021 at 11:48:21AM +0200, Jan Beulich wrote:
>> For vendor specific code to support superpages we need to be able to
>> deal with a superpage mapping replacing an intermediate page table (or
>> hierarchy thereof). Consequently an iommu_alloc_pgtable() counterpart is
>> needed to free individual page tables while a domain is still alive.
>> Since the freeing needs to be deferred until after a suitable IOTLB
>> flush was performed, released page tables get queued for processing by a
>> tasklet.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> I was considering whether to use a softirq-taklet instead. This would
>> have the benefit of avoiding extra scheduling operations, but come with
>> the risk of the freeing happening prematurely because of a
>> process_pending_softirqs() somewhere.
> 
> Another approach that comes to mind (maybe you already thought of it
> and discarded) would be to perform the freeing after the flush in
> iommu_iotlb_flush{_all} while keeping the per pPCU lists.

This was my initial plan, but I couldn't convince myself that the first
flush to happen would be _the_ one associated with the to-be-freed
page tables. ISTR (vaguely though) actually having found an example to
the contrary.

Jan



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 09/18] AMD/IOMMU: drop stray TLB flush
  2021-09-24  9:48 ` [PATCH v2 09/18] AMD/IOMMU: drop stray TLB flush Jan Beulich
@ 2021-12-02 16:16   ` Roger Pau Monné
  0 siblings, 0 replies; 100+ messages in thread
From: Roger Pau Monné @ 2021-12-02 16:16 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On Fri, Sep 24, 2021 at 11:48:57AM +0200, Jan Beulich wrote:
> I think this flush was overlooked when flushing was moved out of the
> core (un)mapping functions. The flush the caller is required to invoke
> anyway will satisfy the needs resulting from the splitting of a
> superpage.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 06/18] IOMMU/x86: restrict IO-APIC mappings for PV Dom0
  2021-12-02 15:28             ` Jan Beulich
@ 2021-12-02 19:16               ` Andrew Cooper
  2021-12-03  6:41                 ` Jan Beulich
  0 siblings, 1 reply; 100+ messages in thread
From: Andrew Cooper @ 2021-12-02 19:16 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper, Roger Pau Monné; +Cc: xen-devel, Paul Durrant

On 02/12/2021 15:28, Jan Beulich wrote:
> On 02.12.2021 16:12, Roger Pau Monné wrote:
>> On Wed, Dec 01, 2021 at 12:45:12PM +0100, Jan Beulich wrote:
>>> On 01.12.2021 11:32, Roger Pau Monné wrote:
>>>> On Wed, Dec 01, 2021 at 10:27:21AM +0100, Jan Beulich wrote:
>>>>> On 01.12.2021 10:09, Roger Pau Monné wrote:
>>>>>> On Fri, Sep 24, 2021 at 11:46:57AM +0200, Jan Beulich wrote:
>>>>>>> @@ -267,44 +267,60 @@ static bool __hwdom_init hwdom_iommu_map
>>>>>>>       * that fall in unusable ranges for PV Dom0.
>>>>>>>       */
>>>>>>>      if ( (pfn > max_pfn && !mfn_valid(mfn)) || xen_in_range(pfn) )
>>>>>>> -        return false;
>>>>>>> +        return 0;
>>>>>>>  
>>>>>>>      switch ( type = page_get_ram_type(mfn) )
>>>>>>>      {
>>>>>>>      case RAM_TYPE_UNUSABLE:
>>>>>>> -        return false;
>>>>>>> +        return 0;
>>>>>>>  
>>>>>>>      case RAM_TYPE_CONVENTIONAL:
>>>>>>>          if ( iommu_hwdom_strict )
>>>>>>> -            return false;
>>>>>>> +            return 0;
>>>>>>>          break;
>>>>>>>  
>>>>>>>      default:
>>>>>>>          if ( type & RAM_TYPE_RESERVED )
>>>>>>>          {
>>>>>>>              if ( !iommu_hwdom_inclusive && !iommu_hwdom_reserved )
>>>>>>> -                return false;
>>>>>>> +                perms = 0;
>>>>>>>          }
>>>>>>> -        else if ( is_hvm_domain(d) || !iommu_hwdom_inclusive || pfn > max_pfn )
>>>>>>> -            return false;
>>>>>>> +        else if ( is_hvm_domain(d) )
>>>>>>> +            return 0;
>>>>>>> +        else if ( !iommu_hwdom_inclusive || pfn > max_pfn )
>>>>>>> +            perms = 0;
>>>>>> I'm confused about the reason to set perms = 0 instead of just
>>>>>> returning here. AFAICT perms won't be set to any other value below,
>>>>>> so you might as well just return 0.
>>>>> This is so that ...
>>>>>
>>>>>>>      }
>>>>>>>  
>>>>>>>      /* Check that it doesn't overlap with the Interrupt Address Range. */
>>>>>>>      if ( pfn >= 0xfee00 && pfn <= 0xfeeff )
>>>>>>> -        return false;
>>>>>>> +        return 0;
>>>>>>>      /* ... or the IO-APIC */
>>>>>>> -    for ( i = 0; has_vioapic(d) && i < d->arch.hvm.nr_vioapics; i++ )
>>>>>>> -        if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
>>>>>>> -            return false;
>>>>>>> +    if ( has_vioapic(d) )
>>>>>>> +    {
>>>>>>> +        for ( i = 0; i < d->arch.hvm.nr_vioapics; i++ )
>>>>>>> +            if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
>>>>>>> +                return 0;
>>>>>>> +    }
>>>>>>> +    else if ( is_pv_domain(d) )
>>>>>>> +    {
>>>>>>> +        /*
>>>>>>> +         * Be consistent with CPU mappings: Dom0 is permitted to establish r/o
>>>>>>> +         * ones there, so it should also have such established for IOMMUs.
>>>>>>> +         */
>>>>>>> +        for ( i = 0; i < nr_ioapics; i++ )
>>>>>>> +            if ( pfn == PFN_DOWN(mp_ioapics[i].mpc_apicaddr) )
>>>>>>> +                return rangeset_contains_singleton(mmio_ro_ranges, pfn)
>>>>>>> +                       ? IOMMUF_readable : 0;
>>>>>>> +    }
>>>>> ... this return, as per the comment, takes precedence over returning
>>>>> zero.
>>>> I see. This is because you want to map those in the IOMMU page tables
>>>> even if the IO-APIC ranges are outside of a reserved region.
>>>>
>>>> I have to admit this is kind of weird, because the purpose of this
>>>> function is to add mappings for all memory below 4G, and/or for all
>>>> reserved regions.
>>> Well, that was what it started out as. The purpose here is to be consistent
>>> about IO-APICs: Either have them all mapped, or none of them. Since we map
>>> them in the CPU page tables and since Andrew asked for the two mappings to
>>> be consistent, this is the only way to satisfy the requests. Personally I'd
>>> be okay with not mapping IO-APICs here (but then regardless of whether they
>>> are covered by a reserved region).
>> I'm unsure of the best way to deal with this, it seems like both
>> the CPU and the IOMMU page tables would never be equal for PV dom0,
>> because we have no intention to map the MSI-X tables in RO mode in the
>> IOMMU page tables.
>>
>> I'm not really opposed to having the IO-APIC mapped RO in the IOMMU
>> page tables, but I also don't see much benefit of doing it unless we
>> have a user-case for it. The IO-APIC handling in PV is already
>> different from native, so I would be fine if we add a comment noting
>> that while the IO-APIC is mappable to the CPU page tables as RO it's
>> not present in the IOMMU page tables (and then adjust hwdom_iommu_map
>> to prevent it's mapping).
> Andrew, you did request both mappings to get in sync - thoughts?

Lets step back to first principles.

On real hardware, there is no such thing as read-only-ness of the
physical address space.  Anything like that is a device which accepts
and discards writes.

It's not clear what a real hardware platform would do in this scenario,
but from reading some of the platform docs, I suspect the System Address
Decoder would provide a symmetric view of the hardware address space,
but this doesn't mean that UBOX would tolerate memory accesses uniformly
from all sources.  Also, there's nothing to say that all platforms
behave the same.


For HVM with shared-pt, the CPU and IOMMU mappings really are
identical.  The IOMMU really will get a read-only mapping of real MMCFG,
and holes for fully-emulated devices, which would suffer a IOMMU fault
if targetted.

For HVM without shared-pt, the translations are mostly kept in sync, but
the permissions in the CPU mappings may be reduced for e.g. logdirty
reasons.

For PV guests, things are mostly like the HVM shared-pt case, except
we've got the real IO-APICs mapped read-only, and no fully-emulated devices.


Putting the real IO-APICs in the IOMMU is about as short sighted as
letting the PV guest see them to begin with, but there is nothing
fundamentally wrong with letting a PV guest do a DMA read of the
IO-APIC, seeing as we let it do a CPU read.  (And whether the platform
will even allow it, is a different matter.)


However, it is really important for there to not be a load of special
casing (all undocumented, naturally) keeping the CPU and IOMMU views
different.  It is an error that the views were ever different
(translation wise), and the only legitimate permission difference I can
think of is to support logdirty mode for migration.  (Introspection
protection for device-enabled VMs will be left as an exercise to
whomever first wants to use it.)

Making the guest physical address space view consistent between the CPU
and device is a "because its obviously the correct thing to do" issue. 
Deciding "well it makes no sense for you to have an IO mapping of $FOO"
is a matter of policy that Xen has no legitimate right to be enforcing.

~Andrew


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 06/18] IOMMU/x86: restrict IO-APIC mappings for PV Dom0
  2021-12-02 19:16               ` Andrew Cooper
@ 2021-12-03  6:41                 ` Jan Beulich
  0 siblings, 0 replies; 100+ messages in thread
From: Jan Beulich @ 2021-12-03  6:41 UTC (permalink / raw)
  To: Andrew Cooper, Andrew Cooper
  Cc: xen-devel, Paul Durrant, Roger Pau Monné

On 02.12.2021 20:16, Andrew Cooper wrote:
> On 02/12/2021 15:28, Jan Beulich wrote:
>> On 02.12.2021 16:12, Roger Pau Monné wrote:
>>> On Wed, Dec 01, 2021 at 12:45:12PM +0100, Jan Beulich wrote:
>>>> On 01.12.2021 11:32, Roger Pau Monné wrote:
>>>>> On Wed, Dec 01, 2021 at 10:27:21AM +0100, Jan Beulich wrote:
>>>>>> On 01.12.2021 10:09, Roger Pau Monné wrote:
>>>>>>> On Fri, Sep 24, 2021 at 11:46:57AM +0200, Jan Beulich wrote:
>>>>>>>> @@ -267,44 +267,60 @@ static bool __hwdom_init hwdom_iommu_map
>>>>>>>>       * that fall in unusable ranges for PV Dom0.
>>>>>>>>       */
>>>>>>>>      if ( (pfn > max_pfn && !mfn_valid(mfn)) || xen_in_range(pfn) )
>>>>>>>> -        return false;
>>>>>>>> +        return 0;
>>>>>>>>  
>>>>>>>>      switch ( type = page_get_ram_type(mfn) )
>>>>>>>>      {
>>>>>>>>      case RAM_TYPE_UNUSABLE:
>>>>>>>> -        return false;
>>>>>>>> +        return 0;
>>>>>>>>  
>>>>>>>>      case RAM_TYPE_CONVENTIONAL:
>>>>>>>>          if ( iommu_hwdom_strict )
>>>>>>>> -            return false;
>>>>>>>> +            return 0;
>>>>>>>>          break;
>>>>>>>>  
>>>>>>>>      default:
>>>>>>>>          if ( type & RAM_TYPE_RESERVED )
>>>>>>>>          {
>>>>>>>>              if ( !iommu_hwdom_inclusive && !iommu_hwdom_reserved )
>>>>>>>> -                return false;
>>>>>>>> +                perms = 0;
>>>>>>>>          }
>>>>>>>> -        else if ( is_hvm_domain(d) || !iommu_hwdom_inclusive || pfn > max_pfn )
>>>>>>>> -            return false;
>>>>>>>> +        else if ( is_hvm_domain(d) )
>>>>>>>> +            return 0;
>>>>>>>> +        else if ( !iommu_hwdom_inclusive || pfn > max_pfn )
>>>>>>>> +            perms = 0;
>>>>>>> I'm confused about the reason to set perms = 0 instead of just
>>>>>>> returning here. AFAICT perms won't be set to any other value below,
>>>>>>> so you might as well just return 0.
>>>>>> This is so that ...
>>>>>>
>>>>>>>>      }
>>>>>>>>  
>>>>>>>>      /* Check that it doesn't overlap with the Interrupt Address Range. */
>>>>>>>>      if ( pfn >= 0xfee00 && pfn <= 0xfeeff )
>>>>>>>> -        return false;
>>>>>>>> +        return 0;
>>>>>>>>      /* ... or the IO-APIC */
>>>>>>>> -    for ( i = 0; has_vioapic(d) && i < d->arch.hvm.nr_vioapics; i++ )
>>>>>>>> -        if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
>>>>>>>> -            return false;
>>>>>>>> +    if ( has_vioapic(d) )
>>>>>>>> +    {
>>>>>>>> +        for ( i = 0; i < d->arch.hvm.nr_vioapics; i++ )
>>>>>>>> +            if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
>>>>>>>> +                return 0;
>>>>>>>> +    }
>>>>>>>> +    else if ( is_pv_domain(d) )
>>>>>>>> +    {
>>>>>>>> +        /*
>>>>>>>> +         * Be consistent with CPU mappings: Dom0 is permitted to establish r/o
>>>>>>>> +         * ones there, so it should also have such established for IOMMUs.
>>>>>>>> +         */
>>>>>>>> +        for ( i = 0; i < nr_ioapics; i++ )
>>>>>>>> +            if ( pfn == PFN_DOWN(mp_ioapics[i].mpc_apicaddr) )
>>>>>>>> +                return rangeset_contains_singleton(mmio_ro_ranges, pfn)
>>>>>>>> +                       ? IOMMUF_readable : 0;
>>>>>>>> +    }
>>>>>> ... this return, as per the comment, takes precedence over returning
>>>>>> zero.
>>>>> I see. This is because you want to map those in the IOMMU page tables
>>>>> even if the IO-APIC ranges are outside of a reserved region.
>>>>>
>>>>> I have to admit this is kind of weird, because the purpose of this
>>>>> function is to add mappings for all memory below 4G, and/or for all
>>>>> reserved regions.
>>>> Well, that was what it started out as. The purpose here is to be consistent
>>>> about IO-APICs: Either have them all mapped, or none of them. Since we map
>>>> them in the CPU page tables and since Andrew asked for the two mappings to
>>>> be consistent, this is the only way to satisfy the requests. Personally I'd
>>>> be okay with not mapping IO-APICs here (but then regardless of whether they
>>>> are covered by a reserved region).
>>> I'm unsure of the best way to deal with this, it seems like both
>>> the CPU and the IOMMU page tables would never be equal for PV dom0,
>>> because we have no intention to map the MSI-X tables in RO mode in the
>>> IOMMU page tables.
>>>
>>> I'm not really opposed to having the IO-APIC mapped RO in the IOMMU
>>> page tables, but I also don't see much benefit of doing it unless we
>>> have a user-case for it. The IO-APIC handling in PV is already
>>> different from native, so I would be fine if we add a comment noting
>>> that while the IO-APIC is mappable to the CPU page tables as RO it's
>>> not present in the IOMMU page tables (and then adjust hwdom_iommu_map
>>> to prevent it's mapping).
>> Andrew, you did request both mappings to get in sync - thoughts?
> 
> Lets step back to first principles.
> 
> On real hardware, there is no such thing as read-only-ness of the
> physical address space.  Anything like that is a device which accepts
> and discards writes.
> 
> It's not clear what a real hardware platform would do in this scenario,
> but from reading some of the platform docs, I suspect the System Address
> Decoder would provide a symmetric view of the hardware address space,
> but this doesn't mean that UBOX would tolerate memory accesses uniformly
> from all sources.  Also, there's nothing to say that all platforms
> behave the same.
> 
> 
> For HVM with shared-pt, the CPU and IOMMU mappings really are
> identical.  The IOMMU really will get a read-only mapping of real MMCFG,
> and holes for fully-emulated devices, which would suffer a IOMMU fault
> if targetted.
> 
> For HVM without shared-pt, the translations are mostly kept in sync, but
> the permissions in the CPU mappings may be reduced for e.g. logdirty
> reasons.
> 
> For PV guests, things are mostly like the HVM shared-pt case, except
> we've got the real IO-APICs mapped read-only, and no fully-emulated devices.
> 
> 
> Putting the real IO-APICs in the IOMMU is about as short sighted as
> letting the PV guest see them to begin with, but there is nothing
> fundamentally wrong with letting a PV guest do a DMA read of the
> IO-APIC, seeing as we let it do a CPU read.  (And whether the platform
> will even allow it, is a different matter.)
> 
> 
> However, it is really important for there to not be a load of special
> casing (all undocumented, naturally) keeping the CPU and IOMMU views
> different.  It is an error that the views were ever different
> (translation wise), and the only legitimate permission difference I can
> think of is to support logdirty mode for migration.  (Introspection
> protection for device-enabled VMs will be left as an exercise to
> whomever first wants to use it.)
> 
> Making the guest physical address space view consistent between the CPU
> and device is a "because its obviously the correct thing to do" issue. 
> Deciding "well it makes no sense for you to have an IO mapping of $FOO"
> is a matter of policy that Xen has no legitimate right to be enforcing.

To summarize: You continue to think it's better to map the IO-APICs r/o
also in the IOMMU, despite there not being any practical need for these
mappings (the CPU ones get permitted as a workaround only, after all).
Please correct me if that's a wrong understanding of your reply. And I
take it that you're aware that CPU mappings get inserted only upon Dom0's
request, whereas IOMMU mappings get created once during boot (the
inconsistent form of which had been present prior to this patch).

Any decision here would then imo also want to apply to e.g. the HPET
region, which we have a mode for where Dom0 can map it r/o. And the
MSI-X tables and PBAs (which get dynamically entered into mmio_ro_ranges).

Jan



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 08/18] IOMMU/x86: support freeing of pagetables
  2021-12-02 16:10     ` Jan Beulich
@ 2021-12-03  8:30       ` Roger Pau Monné
  2021-12-03  9:38         ` Roger Pau Monné
  2021-12-03  9:40         ` Jan Beulich
  0 siblings, 2 replies; 100+ messages in thread
From: Roger Pau Monné @ 2021-12-03  8:30 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On Thu, Dec 02, 2021 at 05:10:38PM +0100, Jan Beulich wrote:
> On 02.12.2021 17:03, Roger Pau Monné wrote:
> > On Fri, Sep 24, 2021 at 11:48:21AM +0200, Jan Beulich wrote:
> >> For vendor specific code to support superpages we need to be able to
> >> deal with a superpage mapping replacing an intermediate page table (or
> >> hierarchy thereof). Consequently an iommu_alloc_pgtable() counterpart is
> >> needed to free individual page tables while a domain is still alive.
> >> Since the freeing needs to be deferred until after a suitable IOTLB
> >> flush was performed, released page tables get queued for processing by a
> >> tasklet.
> >>
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >> ---
> >> I was considering whether to use a softirq-taklet instead. This would
> >> have the benefit of avoiding extra scheduling operations, but come with
> >> the risk of the freeing happening prematurely because of a
> >> process_pending_softirqs() somewhere.
> > 
> > Another approach that comes to mind (maybe you already thought of it
> > and discarded) would be to perform the freeing after the flush in
> > iommu_iotlb_flush{_all} while keeping the per pPCU lists.
> 
> This was my initial plan, but I couldn't convince myself that the first
> flush to happen would be _the_ one associated with the to-be-freed
> page tables. ISTR (vaguely though) actually having found an example to
> the contrary.

I see. If we keep the list per pCPU I'm not sure we could have an
IOMMU  flush not related to the to-be-freed pages, as we cannot have
interleaved IOMMU operations on the same pCPU.

Also, if we strictly add the pages to the freeing list once unhooked
from the IOMMU page tables it should be safe to flush and then free
them, as they would be no references remaining anymore.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 10/18] AMD/IOMMU: walk trees upon page fault
  2021-09-24  9:51 ` [PATCH v2 10/18] AMD/IOMMU: walk trees upon page fault Jan Beulich
@ 2021-12-03  9:03   ` Roger Pau Monné
  2021-12-03  9:49     ` Jan Beulich
  2021-12-03  9:59     ` Jan Beulich
  0 siblings, 2 replies; 100+ messages in thread
From: Roger Pau Monné @ 2021-12-03  9:03 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On Fri, Sep 24, 2021 at 11:51:15AM +0200, Jan Beulich wrote:
> This is to aid diagnosing issues and largely matches VT-d's behavior.
> Since I'm adding permissions output here as well, take the opportunity
> and also add their displaying to amd_dump_page_table_level().
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> --- a/xen/drivers/passthrough/amd/iommu.h
> +++ b/xen/drivers/passthrough/amd/iommu.h
> @@ -243,6 +243,8 @@ int __must_check amd_iommu_flush_iotlb_p
>                                               unsigned long page_count,
>                                               unsigned int flush_flags);
>  int __must_check amd_iommu_flush_iotlb_all(struct domain *d);
> +void amd_iommu_print_entries(const struct amd_iommu *iommu, unsigned int dev_id,
> +                             dfn_t dfn);
>  
>  /* device table functions */
>  int get_dma_requestor_id(uint16_t seg, uint16_t bdf);
> --- a/xen/drivers/passthrough/amd/iommu_init.c
> +++ b/xen/drivers/passthrough/amd/iommu_init.c
> @@ -573,6 +573,9 @@ static void parse_event_log_entry(struct
>                 (flags & 0x002) ? " NX" : "",
>                 (flags & 0x001) ? " GN" : "");
>  
> +        if ( iommu_verbose )
> +            amd_iommu_print_entries(iommu, device_id, daddr_to_dfn(addr));
> +
>          for ( bdf = 0; bdf < ivrs_bdf_entries; bdf++ )
>              if ( get_dma_requestor_id(iommu->seg, bdf) == device_id )
>                  pci_check_disable_device(iommu->seg, PCI_BUS(bdf),
> --- a/xen/drivers/passthrough/amd/iommu_map.c
> +++ b/xen/drivers/passthrough/amd/iommu_map.c
> @@ -363,6 +363,50 @@ int amd_iommu_unmap_page(struct domain *
>      return 0;
>  }
>  
> +void amd_iommu_print_entries(const struct amd_iommu *iommu, unsigned int dev_id,
> +                             dfn_t dfn)
> +{
> +    mfn_t pt_mfn;
> +    unsigned int level;
> +    const struct amd_iommu_dte *dt = iommu->dev_table.buffer;
> +
> +    if ( !dt[dev_id].tv )
> +    {
> +        printk("%pp: no root\n", &PCI_SBDF2(iommu->seg, dev_id));
> +        return;
> +    }
> +
> +    pt_mfn = _mfn(dt[dev_id].pt_root);
> +    level = dt[dev_id].paging_mode;
> +    printk("%pp root @ %"PRI_mfn" (%u levels) dfn=%"PRI_dfn"\n",
> +           &PCI_SBDF2(iommu->seg, dev_id), mfn_x(pt_mfn), level, dfn_x(dfn));
> +
> +    while ( level )
> +    {
> +        const union amd_iommu_pte *pt = map_domain_page(pt_mfn);
> +        unsigned int idx = pfn_to_pde_idx(dfn_x(dfn), level);
> +        union amd_iommu_pte pte = pt[idx];

Don't you need to take a lock here (mapping_lock maybe?) in order to
prevent changes to the IOMMU page tables while walking them?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 08/18] IOMMU/x86: support freeing of pagetables
  2021-12-03  8:30       ` Roger Pau Monné
@ 2021-12-03  9:38         ` Roger Pau Monné
  2021-12-03  9:40         ` Jan Beulich
  1 sibling, 0 replies; 100+ messages in thread
From: Roger Pau Monné @ 2021-12-03  9:38 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On Fri, Dec 03, 2021 at 09:30:00AM +0100, Roger Pau Monné wrote:
> On Thu, Dec 02, 2021 at 05:10:38PM +0100, Jan Beulich wrote:
> > On 02.12.2021 17:03, Roger Pau Monné wrote:
> > > On Fri, Sep 24, 2021 at 11:48:21AM +0200, Jan Beulich wrote:
> > >> For vendor specific code to support superpages we need to be able to
> > >> deal with a superpage mapping replacing an intermediate page table (or
> > >> hierarchy thereof). Consequently an iommu_alloc_pgtable() counterpart is
> > >> needed to free individual page tables while a domain is still alive.
> > >> Since the freeing needs to be deferred until after a suitable IOTLB
> > >> flush was performed, released page tables get queued for processing by a
> > >> tasklet.
> > >>
> > >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> > >> ---
> > >> I was considering whether to use a softirq-taklet instead. This would
> > >> have the benefit of avoiding extra scheduling operations, but come with
> > >> the risk of the freeing happening prematurely because of a
> > >> process_pending_softirqs() somewhere.
> > > 
> > > Another approach that comes to mind (maybe you already thought of it
> > > and discarded) would be to perform the freeing after the flush in
> > > iommu_iotlb_flush{_all} while keeping the per pPCU lists.
> > 
> > This was my initial plan, but I couldn't convince myself that the first
> > flush to happen would be _the_ one associated with the to-be-freed
> > page tables. ISTR (vaguely though) actually having found an example to
> > the contrary.
> 
> I see. If we keep the list per pCPU I'm not sure we could have an
> IOMMU  flush not related to the to-be-freed pages, as we cannot have
> interleaved IOMMU operations on the same pCPU.
> 
> Also, if we strictly add the pages to the freeing list once unhooked
> from the IOMMU page tables it should be safe to flush and then free
> them, as they would be no references remaining anymore.

Replying to my last paragraph: there are different types of flushes,
and they have different scopes, so just adding the pages to be freed
to a random list and expecting any flush to remove them from the IOMMU
cache is not correct.

I still think the first paragraph is accurate, as we shouldn't have
interleaving IOMMU operations on the same pCPU, so a flush on a pCPU
should always clear the entries that have been freed as a result of
the ongoing operation on that pCPU, and those operations should be
sequential.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 08/18] IOMMU/x86: support freeing of pagetables
  2021-12-03  8:30       ` Roger Pau Monné
  2021-12-03  9:38         ` Roger Pau Monné
@ 2021-12-03  9:40         ` Jan Beulich
  1 sibling, 0 replies; 100+ messages in thread
From: Jan Beulich @ 2021-12-03  9:40 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On 03.12.2021 09:30, Roger Pau Monné wrote:
> On Thu, Dec 02, 2021 at 05:10:38PM +0100, Jan Beulich wrote:
>> On 02.12.2021 17:03, Roger Pau Monné wrote:
>>> On Fri, Sep 24, 2021 at 11:48:21AM +0200, Jan Beulich wrote:
>>>> For vendor specific code to support superpages we need to be able to
>>>> deal with a superpage mapping replacing an intermediate page table (or
>>>> hierarchy thereof). Consequently an iommu_alloc_pgtable() counterpart is
>>>> needed to free individual page tables while a domain is still alive.
>>>> Since the freeing needs to be deferred until after a suitable IOTLB
>>>> flush was performed, released page tables get queued for processing by a
>>>> tasklet.
>>>>
>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>> ---
>>>> I was considering whether to use a softirq-taklet instead. This would
>>>> have the benefit of avoiding extra scheduling operations, but come with
>>>> the risk of the freeing happening prematurely because of a
>>>> process_pending_softirqs() somewhere.
>>>
>>> Another approach that comes to mind (maybe you already thought of it
>>> and discarded) would be to perform the freeing after the flush in
>>> iommu_iotlb_flush{_all} while keeping the per pPCU lists.
>>
>> This was my initial plan, but I couldn't convince myself that the first
>> flush to happen would be _the_ one associated with the to-be-freed
>> page tables. ISTR (vaguely though) actually having found an example to
>> the contrary.
> 
> I see. If we keep the list per pCPU I'm not sure we could have an
> IOMMU  flush not related to the to-be-freed pages, as we cannot have
> interleaved IOMMU operations on the same pCPU.

"interleaved" is perhaps the wrong word. But can you easily exclude e.g.
a put_page() in the middle of some other operation? That could in turn
invoke one of the legacy map/unmap functions (see cleanup_page_mappings()
for an example), where the flushing happens immediately after the
map/unmap.

> Also, if we strictly add the pages to the freeing list once unhooked
> from the IOMMU page tables it should be safe to flush and then free
> them, as they would be no references remaining anymore.

Only if the flush is a full-address-space one.

Jan



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 10/18] AMD/IOMMU: walk trees upon page fault
  2021-12-03  9:03   ` Roger Pau Monné
@ 2021-12-03  9:49     ` Jan Beulich
  2021-12-03  9:55       ` Jan Beulich
  2021-12-03  9:59     ` Jan Beulich
  1 sibling, 1 reply; 100+ messages in thread
From: Jan Beulich @ 2021-12-03  9:49 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On 03.12.2021 10:03, Roger Pau Monné wrote:
> On Fri, Sep 24, 2021 at 11:51:15AM +0200, Jan Beulich wrote:
>> This is to aid diagnosing issues and largely matches VT-d's behavior.
>> Since I'm adding permissions output here as well, take the opportunity
>> and also add their displaying to amd_dump_page_table_level().
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>
>> --- a/xen/drivers/passthrough/amd/iommu.h
>> +++ b/xen/drivers/passthrough/amd/iommu.h
>> @@ -243,6 +243,8 @@ int __must_check amd_iommu_flush_iotlb_p
>>                                               unsigned long page_count,
>>                                               unsigned int flush_flags);
>>  int __must_check amd_iommu_flush_iotlb_all(struct domain *d);
>> +void amd_iommu_print_entries(const struct amd_iommu *iommu, unsigned int dev_id,
>> +                             dfn_t dfn);
>>  
>>  /* device table functions */
>>  int get_dma_requestor_id(uint16_t seg, uint16_t bdf);
>> --- a/xen/drivers/passthrough/amd/iommu_init.c
>> +++ b/xen/drivers/passthrough/amd/iommu_init.c
>> @@ -573,6 +573,9 @@ static void parse_event_log_entry(struct
>>                 (flags & 0x002) ? " NX" : "",
>>                 (flags & 0x001) ? " GN" : "");
>>  
>> +        if ( iommu_verbose )
>> +            amd_iommu_print_entries(iommu, device_id, daddr_to_dfn(addr));
>> +
>>          for ( bdf = 0; bdf < ivrs_bdf_entries; bdf++ )
>>              if ( get_dma_requestor_id(iommu->seg, bdf) == device_id )
>>                  pci_check_disable_device(iommu->seg, PCI_BUS(bdf),
>> --- a/xen/drivers/passthrough/amd/iommu_map.c
>> +++ b/xen/drivers/passthrough/amd/iommu_map.c
>> @@ -363,6 +363,50 @@ int amd_iommu_unmap_page(struct domain *
>>      return 0;
>>  }
>>  
>> +void amd_iommu_print_entries(const struct amd_iommu *iommu, unsigned int dev_id,
>> +                             dfn_t dfn)
>> +{
>> +    mfn_t pt_mfn;
>> +    unsigned int level;
>> +    const struct amd_iommu_dte *dt = iommu->dev_table.buffer;
>> +
>> +    if ( !dt[dev_id].tv )
>> +    {
>> +        printk("%pp: no root\n", &PCI_SBDF2(iommu->seg, dev_id));
>> +        return;
>> +    }
>> +
>> +    pt_mfn = _mfn(dt[dev_id].pt_root);
>> +    level = dt[dev_id].paging_mode;
>> +    printk("%pp root @ %"PRI_mfn" (%u levels) dfn=%"PRI_dfn"\n",
>> +           &PCI_SBDF2(iommu->seg, dev_id), mfn_x(pt_mfn), level, dfn_x(dfn));
>> +
>> +    while ( level )
>> +    {
>> +        const union amd_iommu_pte *pt = map_domain_page(pt_mfn);
>> +        unsigned int idx = pfn_to_pde_idx(dfn_x(dfn), level);
>> +        union amd_iommu_pte pte = pt[idx];
> 
> Don't you need to take a lock here (mapping_lock maybe?) in order to
> prevent changes to the IOMMU page tables while walking them?

Generally speaking - yes. But see the description saying "largely
matches VT-d's behavior". On VT-d both the IOMMU lock and the mapping
lock would need acquiring to be safe (the former could perhaps be
dropped early). Likewise here. If I wanted to do so here, I ought to
add a prereq patch adjusting the VT-d function. The main "excuse" not
to do so is/was probably the size of the series.

Jan



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 10/18] AMD/IOMMU: walk trees upon page fault
  2021-12-03  9:49     ` Jan Beulich
@ 2021-12-03  9:55       ` Jan Beulich
  2021-12-10 10:23         ` Roger Pau Monné
  0 siblings, 1 reply; 100+ messages in thread
From: Jan Beulich @ 2021-12-03  9:55 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On 03.12.2021 10:49, Jan Beulich wrote:
> On 03.12.2021 10:03, Roger Pau Monné wrote:
>> On Fri, Sep 24, 2021 at 11:51:15AM +0200, Jan Beulich wrote:
>>> This is to aid diagnosing issues and largely matches VT-d's behavior.
>>> Since I'm adding permissions output here as well, take the opportunity
>>> and also add their displaying to amd_dump_page_table_level().
>>>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>
>>> --- a/xen/drivers/passthrough/amd/iommu.h
>>> +++ b/xen/drivers/passthrough/amd/iommu.h
>>> @@ -243,6 +243,8 @@ int __must_check amd_iommu_flush_iotlb_p
>>>                                               unsigned long page_count,
>>>                                               unsigned int flush_flags);
>>>  int __must_check amd_iommu_flush_iotlb_all(struct domain *d);
>>> +void amd_iommu_print_entries(const struct amd_iommu *iommu, unsigned int dev_id,
>>> +                             dfn_t dfn);
>>>  
>>>  /* device table functions */
>>>  int get_dma_requestor_id(uint16_t seg, uint16_t bdf);
>>> --- a/xen/drivers/passthrough/amd/iommu_init.c
>>> +++ b/xen/drivers/passthrough/amd/iommu_init.c
>>> @@ -573,6 +573,9 @@ static void parse_event_log_entry(struct
>>>                 (flags & 0x002) ? " NX" : "",
>>>                 (flags & 0x001) ? " GN" : "");
>>>  
>>> +        if ( iommu_verbose )
>>> +            amd_iommu_print_entries(iommu, device_id, daddr_to_dfn(addr));
>>> +
>>>          for ( bdf = 0; bdf < ivrs_bdf_entries; bdf++ )
>>>              if ( get_dma_requestor_id(iommu->seg, bdf) == device_id )
>>>                  pci_check_disable_device(iommu->seg, PCI_BUS(bdf),
>>> --- a/xen/drivers/passthrough/amd/iommu_map.c
>>> +++ b/xen/drivers/passthrough/amd/iommu_map.c
>>> @@ -363,6 +363,50 @@ int amd_iommu_unmap_page(struct domain *
>>>      return 0;
>>>  }
>>>  
>>> +void amd_iommu_print_entries(const struct amd_iommu *iommu, unsigned int dev_id,
>>> +                             dfn_t dfn)
>>> +{
>>> +    mfn_t pt_mfn;
>>> +    unsigned int level;
>>> +    const struct amd_iommu_dte *dt = iommu->dev_table.buffer;
>>> +
>>> +    if ( !dt[dev_id].tv )
>>> +    {
>>> +        printk("%pp: no root\n", &PCI_SBDF2(iommu->seg, dev_id));
>>> +        return;
>>> +    }
>>> +
>>> +    pt_mfn = _mfn(dt[dev_id].pt_root);
>>> +    level = dt[dev_id].paging_mode;
>>> +    printk("%pp root @ %"PRI_mfn" (%u levels) dfn=%"PRI_dfn"\n",
>>> +           &PCI_SBDF2(iommu->seg, dev_id), mfn_x(pt_mfn), level, dfn_x(dfn));
>>> +
>>> +    while ( level )
>>> +    {
>>> +        const union amd_iommu_pte *pt = map_domain_page(pt_mfn);
>>> +        unsigned int idx = pfn_to_pde_idx(dfn_x(dfn), level);
>>> +        union amd_iommu_pte pte = pt[idx];
>>
>> Don't you need to take a lock here (mapping_lock maybe?) in order to
>> prevent changes to the IOMMU page tables while walking them?
> 
> Generally speaking - yes. But see the description saying "largely
> matches VT-d's behavior". On VT-d both the IOMMU lock and the mapping
> lock would need acquiring to be safe (the former could perhaps be
> dropped early). Likewise here. If I wanted to do so here, I ought to
> add a prereq patch adjusting the VT-d function. The main "excuse" not
> to do so is/was probably the size of the series.

Which in turn would call for {amd,vtd}_dump_page_tables() to gain proper
locking. Not sure where this would end; these further items are more and
more unrelated to the actual purpose of this series (whereas I needed the
patch here anyway for debugging purposes) ...

Jan



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 10/18] AMD/IOMMU: walk trees upon page fault
  2021-12-03  9:03   ` Roger Pau Monné
  2021-12-03  9:49     ` Jan Beulich
@ 2021-12-03  9:59     ` Jan Beulich
  1 sibling, 0 replies; 100+ messages in thread
From: Jan Beulich @ 2021-12-03  9:59 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On 03.12.2021 10:03, Roger Pau Monné wrote:
> On Fri, Sep 24, 2021 at 11:51:15AM +0200, Jan Beulich wrote:
>> This is to aid diagnosing issues and largely matches VT-d's behavior.
>> Since I'm adding permissions output here as well, take the opportunity
>> and also add their displaying to amd_dump_page_table_level().
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>
>> --- a/xen/drivers/passthrough/amd/iommu.h
>> +++ b/xen/drivers/passthrough/amd/iommu.h
>> @@ -243,6 +243,8 @@ int __must_check amd_iommu_flush_iotlb_p
>>                                               unsigned long page_count,
>>                                               unsigned int flush_flags);
>>  int __must_check amd_iommu_flush_iotlb_all(struct domain *d);
>> +void amd_iommu_print_entries(const struct amd_iommu *iommu, unsigned int dev_id,
>> +                             dfn_t dfn);
>>  
>>  /* device table functions */
>>  int get_dma_requestor_id(uint16_t seg, uint16_t bdf);
>> --- a/xen/drivers/passthrough/amd/iommu_init.c
>> +++ b/xen/drivers/passthrough/amd/iommu_init.c
>> @@ -573,6 +573,9 @@ static void parse_event_log_entry(struct
>>                 (flags & 0x002) ? " NX" : "",
>>                 (flags & 0x001) ? " GN" : "");
>>  
>> +        if ( iommu_verbose )
>> +            amd_iommu_print_entries(iommu, device_id, daddr_to_dfn(addr));
>> +
>>          for ( bdf = 0; bdf < ivrs_bdf_entries; bdf++ )
>>              if ( get_dma_requestor_id(iommu->seg, bdf) == device_id )
>>                  pci_check_disable_device(iommu->seg, PCI_BUS(bdf),
>> --- a/xen/drivers/passthrough/amd/iommu_map.c
>> +++ b/xen/drivers/passthrough/amd/iommu_map.c
>> @@ -363,6 +363,50 @@ int amd_iommu_unmap_page(struct domain *
>>      return 0;
>>  }
>>  
>> +void amd_iommu_print_entries(const struct amd_iommu *iommu, unsigned int dev_id,
>> +                             dfn_t dfn)
>> +{
>> +    mfn_t pt_mfn;
>> +    unsigned int level;
>> +    const struct amd_iommu_dte *dt = iommu->dev_table.buffer;
>> +
>> +    if ( !dt[dev_id].tv )
>> +    {
>> +        printk("%pp: no root\n", &PCI_SBDF2(iommu->seg, dev_id));
>> +        return;
>> +    }
>> +
>> +    pt_mfn = _mfn(dt[dev_id].pt_root);
>> +    level = dt[dev_id].paging_mode;
>> +    printk("%pp root @ %"PRI_mfn" (%u levels) dfn=%"PRI_dfn"\n",
>> +           &PCI_SBDF2(iommu->seg, dev_id), mfn_x(pt_mfn), level, dfn_x(dfn));
>> +
>> +    while ( level )
>> +    {
>> +        const union amd_iommu_pte *pt = map_domain_page(pt_mfn);
>> +        unsigned int idx = pfn_to_pde_idx(dfn_x(dfn), level);
>> +        union amd_iommu_pte pte = pt[idx];
> 
> Don't you need to take a lock here (mapping_lock maybe?) in order to
> prevent changes to the IOMMU page tables while walking them?

Further to my earlier reply, taking the mapping lock here isn't
straightforward, as that would mean determining the correct domain.

Jan



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 07/18] IOMMU/x86: perform PV Dom0 mappings in batches
  2021-12-02 14:10   ` Roger Pau Monné
@ 2021-12-03 12:38     ` Jan Beulich
  2021-12-10  9:36       ` Roger Pau Monné
  0 siblings, 1 reply; 100+ messages in thread
From: Jan Beulich @ 2021-12-03 12:38 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On 02.12.2021 15:10, Roger Pau Monné wrote:
> On Fri, Sep 24, 2021 at 11:47:41AM +0200, Jan Beulich wrote:
>> @@ -689,7 +763,8 @@ int __init dom0_construct_pv(struct doma
>>          l1tab++;
>>  
>>          page = mfn_to_page(_mfn(mfn));
>> -        if ( !page->u.inuse.type_info &&
>> +        if ( (!page->u.inuse.type_info ||
>> +              page->u.inuse.type_info == (PGT_writable_page | PGT_validated)) &&
> 
> Would it be clearer to get page for all pages that have a 0 count:
> !(type_info & PGT_count_mask). Or would that interact badly with page
> table pages?

Indeed this wouldn't work with page tables (and I recall having learned
this the hard way): Prior to mark_pv_pt_pages_rdonly() they all have a
type refcount of zero. Even if it wasn't for this, I'd prefer to not
relax the condition here more than necessary.

>> @@ -720,6 +795,17 @@ int __init dom0_construct_pv(struct doma
>>      /* Pages that are part of page tables must be read only. */
>>      mark_pv_pt_pages_rdonly(d, l4start, vpt_start, nr_pt_pages);
>>  
>> +    /*
>> +     * This needs to come after all potentially excess
>> +     * get_page_and_type(..., PGT_writable_page) invocations; see the loop a
>> +     * few lines further up, where the effect of calling that function in an
>> +     * earlier loop iteration may get overwritten by a later one.
>> +     */
>> +    if ( need_iommu_pt_sync(d) &&
>> +         iommu_unmap(d, _dfn(PFN_DOWN(mpt_alloc) - nr_pt_pages), nr_pt_pages,
>> +                     &flush_flags) )
>> +        BUG();
> 
> Wouldn't such unmap better happen as part of changing the types of the
> pages that become part of the guest page tables?

Not sure - it's a single call here, but would be a call per page when
e.g. moved into mark_pv_pt_pages_rdonly().

>> @@ -840,22 +928,41 @@ int __init dom0_construct_pv(struct doma
>>  #endif
>>      while ( pfn < nr_pages )
>>      {
>> -        if ( (page = alloc_chunk(d, nr_pages - domain_tot_pages(d))) == NULL )
>> +        count = domain_tot_pages(d);
>> +        if ( (page = alloc_chunk(d, nr_pages - count)) == NULL )
>>              panic("Not enough RAM for DOM0 reservation\n");
>> +        mfn = mfn_x(page_to_mfn(page));
>> +
>> +        if ( need_iommu_pt_sync(d) )
>> +        {
>> +            rc = iommu_map(d, _dfn(mfn), _mfn(mfn), domain_tot_pages(d) - count,
>> +                           IOMMUF_readable | IOMMUF_writable, &flush_flags);
>> +            if ( rc )
>> +                printk(XENLOG_ERR
>> +                       "pre-mapping MFN %lx (PFN %lx) into IOMMU failed: %d\n",
>> +                       mfn, pfn, rc);
>> +        }
>> +
>>          while ( pfn < domain_tot_pages(d) )
>>          {
>> -            mfn = mfn_x(page_to_mfn(page));
>> +            if ( !rc )
>> +                make_pages_writable(page, 1);
> 
> There's quite a lot of repetition of the pattern: allocate, iommu_map,
> set as writable. Would it be possible to abstract this into some
> kind of helper?
> 
> I've realized some of the allocations use alloc_chunk while others use
> alloc_domheap_pages, so it might require some work.

Right, I'd leave the allocation part aside for the moment. I had actually
considered to fold iommu_map() and make_pages_writable() into a common
helper (or really rename make_pages_writable() and fold iommu_map() into
there). What I lacked was a reasonable, not overly long name for such a
function. Plus - maybe minor - I wanted to avoid extra MFN <-> page
translations.

With a fair name suggested, I'd be happy to give this a try.

>>  #ifndef NDEBUG
>>  #define pfn (nr_pages - 1 - (pfn - (alloc_epfn - alloc_spfn)))
>>  #endif
>>              dom0_update_physmap(compat, pfn, mfn, vphysmap_start);
>>  #undef pfn
>> -            page++; pfn++;
>> +            page++; mfn++; pfn++;
>>              if ( !(pfn & 0xfffff) )
>>                  process_pending_softirqs();
>>          }
>>      }
>>  
>> +    /* Use while() to avoid compiler warning. */
>> +    while ( iommu_iotlb_flush_all(d, flush_flags) )
>> +        break;
> 
> Might be worth to print a message here in case of error?

Maybe. But then not just here. See e.g. arch_iommu_hwdom_init().

>> @@ -372,16 +372,30 @@ void __hwdom_init arch_iommu_hwdom_init(
>>                                          perms & IOMMUF_writable ? p2m_access_rw
>>                                                                  : p2m_access_r,
>>                                          0);
>> +        else if ( pfn != start + count || perms != start_perms )
>> +        {
>> +        commit:
>> +            rc = iommu_map(d, _dfn(start), _mfn(start), count,
>> +                           start_perms, &flush_flags);
>> +            SWAP(start, pfn);
>> +            start_perms = perms;
>> +            count = 1;
>> +        }
>>          else
>> -            rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
>> -                           perms, &flush_flags);
>> +        {
>> +            ++count;
>> +            rc = 0;
>> +        }
>>  
>>          if ( rc )
>>              printk(XENLOG_WARNING "%pd: identity %smapping of %lx failed: %d\n",
>>                     d, !paging_mode_translate(d) ? "IOMMU " : "", pfn, rc);
> 
> Would be nice to print the count (or end pfn) in case it's a range.

I can do so if you think it's worth further extra code. I can't use
"count" here in particular, as that was updated already (in context
above). The most reasonable change towards this would perhaps be to
duplicate the printk() into both the "if()" and the "else if()" bodies.

> While not something that you have to fix here, the logic here is
> becoming overly complicated IMO. It might be easier to just put all
> the ram and reserved regions (or everything < 4G) into a rangeset and
> then punch holes on it for non guest mappable regions, and finally use
> rangeset_consume_ranges to iterate and map those. That's likely faster
> than having to iterate over all pfns on the system, and easier to
> understand from a logic PoV.

Maybe; I didn't spend much time on figuring possible ways of
consolidating some of this.

Jan



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 07/18] IOMMU/x86: perform PV Dom0 mappings in batches
  2021-12-03 12:38     ` Jan Beulich
@ 2021-12-10  9:36       ` Roger Pau Monné
  2021-12-10 11:41         ` Jan Beulich
  0 siblings, 1 reply; 100+ messages in thread
From: Roger Pau Monné @ 2021-12-10  9:36 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On Fri, Dec 03, 2021 at 01:38:48PM +0100, Jan Beulich wrote:
> On 02.12.2021 15:10, Roger Pau Monné wrote:
> > On Fri, Sep 24, 2021 at 11:47:41AM +0200, Jan Beulich wrote:
> >> @@ -689,7 +763,8 @@ int __init dom0_construct_pv(struct doma
> >>          l1tab++;
> >>  
> >>          page = mfn_to_page(_mfn(mfn));
> >> -        if ( !page->u.inuse.type_info &&
> >> +        if ( (!page->u.inuse.type_info ||
> >> +              page->u.inuse.type_info == (PGT_writable_page | PGT_validated)) &&
> > 
> > Would it be clearer to get page for all pages that have a 0 count:
> > !(type_info & PGT_count_mask). Or would that interact badly with page
> > table pages?
> 
> Indeed this wouldn't work with page tables (and I recall having learned
> this the hard way): Prior to mark_pv_pt_pages_rdonly() they all have a
> type refcount of zero. Even if it wasn't for this, I'd prefer to not
> relax the condition here more than necessary.

Right. Page tables will have some types set but a count of 0.

> >> @@ -720,6 +795,17 @@ int __init dom0_construct_pv(struct doma
> >>      /* Pages that are part of page tables must be read only. */
> >>      mark_pv_pt_pages_rdonly(d, l4start, vpt_start, nr_pt_pages);
> >>  
> >> +    /*
> >> +     * This needs to come after all potentially excess
> >> +     * get_page_and_type(..., PGT_writable_page) invocations; see the loop a
> >> +     * few lines further up, where the effect of calling that function in an
> >> +     * earlier loop iteration may get overwritten by a later one.
> >> +     */
> >> +    if ( need_iommu_pt_sync(d) &&
> >> +         iommu_unmap(d, _dfn(PFN_DOWN(mpt_alloc) - nr_pt_pages), nr_pt_pages,
> >> +                     &flush_flags) )
> >> +        BUG();
> > 
> > Wouldn't such unmap better happen as part of changing the types of the
> > pages that become part of the guest page tables?
> 
> Not sure - it's a single call here, but would be a call per page when
> e.g. moved into mark_pv_pt_pages_rdonly().

I see. So this would result in multiple calls when moved, plus all the
involved page shattering and aggregation logic. Overall it would be
less error prone, as the iommu unmap would happen next to the type
change, but I'm not going to insist if you think it's not worth it.
The page table structure pages shouldn't be that many anyway?

> >> @@ -840,22 +928,41 @@ int __init dom0_construct_pv(struct doma
> >>  #endif
> >>      while ( pfn < nr_pages )
> >>      {
> >> -        if ( (page = alloc_chunk(d, nr_pages - domain_tot_pages(d))) == NULL )
> >> +        count = domain_tot_pages(d);
> >> +        if ( (page = alloc_chunk(d, nr_pages - count)) == NULL )
> >>              panic("Not enough RAM for DOM0 reservation\n");
> >> +        mfn = mfn_x(page_to_mfn(page));
> >> +
> >> +        if ( need_iommu_pt_sync(d) )
> >> +        {
> >> +            rc = iommu_map(d, _dfn(mfn), _mfn(mfn), domain_tot_pages(d) - count,
> >> +                           IOMMUF_readable | IOMMUF_writable, &flush_flags);
> >> +            if ( rc )
> >> +                printk(XENLOG_ERR
> >> +                       "pre-mapping MFN %lx (PFN %lx) into IOMMU failed: %d\n",
> >> +                       mfn, pfn, rc);
> >> +        }
> >> +
> >>          while ( pfn < domain_tot_pages(d) )
> >>          {
> >> -            mfn = mfn_x(page_to_mfn(page));
> >> +            if ( !rc )
> >> +                make_pages_writable(page, 1);
> > 
> > There's quite a lot of repetition of the pattern: allocate, iommu_map,
> > set as writable. Would it be possible to abstract this into some
> > kind of helper?
> > 
> > I've realized some of the allocations use alloc_chunk while others use
> > alloc_domheap_pages, so it might require some work.
> 
> Right, I'd leave the allocation part aside for the moment. I had actually
> considered to fold iommu_map() and make_pages_writable() into a common
> helper (or really rename make_pages_writable() and fold iommu_map() into
> there). What I lacked was a reasonable, not overly long name for such a
> function.

I'm not overly good at naming, but I think we need to somehow find a
way to place those together into a single helper.

I would be fine with naming this iommu_memory_{setup,add} or some
such. Marking the pages as writable is a result (or a requirement
might be a better way to express it?) of adding them to the IOMMU.
Would you be OK with one of those names?

> >> @@ -372,16 +372,30 @@ void __hwdom_init arch_iommu_hwdom_init(
> >>                                          perms & IOMMUF_writable ? p2m_access_rw
> >>                                                                  : p2m_access_r,
> >>                                          0);
> >> +        else if ( pfn != start + count || perms != start_perms )
> >> +        {
> >> +        commit:
> >> +            rc = iommu_map(d, _dfn(start), _mfn(start), count,
> >> +                           start_perms, &flush_flags);
> >> +            SWAP(start, pfn);
> >> +            start_perms = perms;
> >> +            count = 1;
> >> +        }
> >>          else
> >> -            rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
> >> -                           perms, &flush_flags);
> >> +        {
> >> +            ++count;
> >> +            rc = 0;
> >> +        }
> >>  
> >>          if ( rc )
> >>              printk(XENLOG_WARNING "%pd: identity %smapping of %lx failed: %d\n",
> >>                     d, !paging_mode_translate(d) ? "IOMMU " : "", pfn, rc);
> > 
> > Would be nice to print the count (or end pfn) in case it's a range.
> 
> I can do so if you think it's worth further extra code. I can't use
> "count" here in particular, as that was updated already (in context
> above). The most reasonable change towards this would perhaps be to
> duplicate the printk() into both the "if()" and the "else if()" bodies.

Maybe. The current message gives the impression that a single pfn has
been added and failed, but without printing the range that failed the
message will not be that helpful in diagnosing further issues that
might arise due to the mapping failure.

> > While not something that you have to fix here, the logic here is
> > becoming overly complicated IMO. It might be easier to just put all
> > the ram and reserved regions (or everything < 4G) into a rangeset and
> > then punch holes on it for non guest mappable regions, and finally use
> > rangeset_consume_ranges to iterate and map those. That's likely faster
> > than having to iterate over all pfns on the system, and easier to
> > understand from a logic PoV.
> 
> Maybe; I didn't spend much time on figuring possible ways of
> consolidating some of this.

I can give it a try after your code is merged. I think it's getting a
bit messy.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 10/18] AMD/IOMMU: walk trees upon page fault
  2021-12-03  9:55       ` Jan Beulich
@ 2021-12-10 10:23         ` Roger Pau Monné
  0 siblings, 0 replies; 100+ messages in thread
From: Roger Pau Monné @ 2021-12-10 10:23 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On Fri, Dec 03, 2021 at 10:55:54AM +0100, Jan Beulich wrote:
> On 03.12.2021 10:49, Jan Beulich wrote:
> > On 03.12.2021 10:03, Roger Pau Monné wrote:
> >> On Fri, Sep 24, 2021 at 11:51:15AM +0200, Jan Beulich wrote:
> >>> This is to aid diagnosing issues and largely matches VT-d's behavior.
> >>> Since I'm adding permissions output here as well, take the opportunity
> >>> and also add their displaying to amd_dump_page_table_level().
> >>>
> >>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >>>
> >>> --- a/xen/drivers/passthrough/amd/iommu.h
> >>> +++ b/xen/drivers/passthrough/amd/iommu.h
> >>> @@ -243,6 +243,8 @@ int __must_check amd_iommu_flush_iotlb_p
> >>>                                               unsigned long page_count,
> >>>                                               unsigned int flush_flags);
> >>>  int __must_check amd_iommu_flush_iotlb_all(struct domain *d);
> >>> +void amd_iommu_print_entries(const struct amd_iommu *iommu, unsigned int dev_id,
> >>> +                             dfn_t dfn);
> >>>  
> >>>  /* device table functions */
> >>>  int get_dma_requestor_id(uint16_t seg, uint16_t bdf);
> >>> --- a/xen/drivers/passthrough/amd/iommu_init.c
> >>> +++ b/xen/drivers/passthrough/amd/iommu_init.c
> >>> @@ -573,6 +573,9 @@ static void parse_event_log_entry(struct
> >>>                 (flags & 0x002) ? " NX" : "",
> >>>                 (flags & 0x001) ? " GN" : "");
> >>>  
> >>> +        if ( iommu_verbose )
> >>> +            amd_iommu_print_entries(iommu, device_id, daddr_to_dfn(addr));
> >>> +
> >>>          for ( bdf = 0; bdf < ivrs_bdf_entries; bdf++ )
> >>>              if ( get_dma_requestor_id(iommu->seg, bdf) == device_id )
> >>>                  pci_check_disable_device(iommu->seg, PCI_BUS(bdf),
> >>> --- a/xen/drivers/passthrough/amd/iommu_map.c
> >>> +++ b/xen/drivers/passthrough/amd/iommu_map.c
> >>> @@ -363,6 +363,50 @@ int amd_iommu_unmap_page(struct domain *
> >>>      return 0;
> >>>  }
> >>>  
> >>> +void amd_iommu_print_entries(const struct amd_iommu *iommu, unsigned int dev_id,
> >>> +                             dfn_t dfn)
> >>> +{
> >>> +    mfn_t pt_mfn;
> >>> +    unsigned int level;
> >>> +    const struct amd_iommu_dte *dt = iommu->dev_table.buffer;
> >>> +
> >>> +    if ( !dt[dev_id].tv )
> >>> +    {
> >>> +        printk("%pp: no root\n", &PCI_SBDF2(iommu->seg, dev_id));
> >>> +        return;
> >>> +    }
> >>> +
> >>> +    pt_mfn = _mfn(dt[dev_id].pt_root);
> >>> +    level = dt[dev_id].paging_mode;
> >>> +    printk("%pp root @ %"PRI_mfn" (%u levels) dfn=%"PRI_dfn"\n",
> >>> +           &PCI_SBDF2(iommu->seg, dev_id), mfn_x(pt_mfn), level, dfn_x(dfn));
> >>> +
> >>> +    while ( level )
> >>> +    {
> >>> +        const union amd_iommu_pte *pt = map_domain_page(pt_mfn);
> >>> +        unsigned int idx = pfn_to_pde_idx(dfn_x(dfn), level);
> >>> +        union amd_iommu_pte pte = pt[idx];
> >>
> >> Don't you need to take a lock here (mapping_lock maybe?) in order to
> >> prevent changes to the IOMMU page tables while walking them?
> > 
> > Generally speaking - yes. But see the description saying "largely
> > matches VT-d's behavior". On VT-d both the IOMMU lock and the mapping
> > lock would need acquiring to be safe (the former could perhaps be
> > dropped early). Likewise here. If I wanted to do so here, I ought to
> > add a prereq patch adjusting the VT-d function. The main "excuse" not
> > to do so is/was probably the size of the series.
> 
> Which in turn would call for {amd,vtd}_dump_page_tables() to gain proper
> locking. Not sure where this would end; these further items are more and
> more unrelated to the actual purpose of this series (whereas I needed the
> patch here anyway for debugging purposes) ...

I think adding a comment regarding the lack of locking due to the
function only being used as a debug aide would help clarify this. I
don't think we support running with iommu debug or verbose modes.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 07/18] IOMMU/x86: perform PV Dom0 mappings in batches
  2021-12-10  9:36       ` Roger Pau Monné
@ 2021-12-10 11:41         ` Jan Beulich
  2021-12-10 12:35           ` Roger Pau Monné
  0 siblings, 1 reply; 100+ messages in thread
From: Jan Beulich @ 2021-12-10 11:41 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On 10.12.2021 10:36, Roger Pau Monné wrote:
> On Fri, Dec 03, 2021 at 01:38:48PM +0100, Jan Beulich wrote:
>> On 02.12.2021 15:10, Roger Pau Monné wrote:
>>> On Fri, Sep 24, 2021 at 11:47:41AM +0200, Jan Beulich wrote:
>>>> @@ -720,6 +795,17 @@ int __init dom0_construct_pv(struct doma
>>>>      /* Pages that are part of page tables must be read only. */
>>>>      mark_pv_pt_pages_rdonly(d, l4start, vpt_start, nr_pt_pages);
>>>>  
>>>> +    /*
>>>> +     * This needs to come after all potentially excess
>>>> +     * get_page_and_type(..., PGT_writable_page) invocations; see the loop a
>>>> +     * few lines further up, where the effect of calling that function in an
>>>> +     * earlier loop iteration may get overwritten by a later one.
>>>> +     */
>>>> +    if ( need_iommu_pt_sync(d) &&
>>>> +         iommu_unmap(d, _dfn(PFN_DOWN(mpt_alloc) - nr_pt_pages), nr_pt_pages,
>>>> +                     &flush_flags) )
>>>> +        BUG();
>>>
>>> Wouldn't such unmap better happen as part of changing the types of the
>>> pages that become part of the guest page tables?
>>
>> Not sure - it's a single call here, but would be a call per page when
>> e.g. moved into mark_pv_pt_pages_rdonly().
> 
> I see. So this would result in multiple calls when moved, plus all the
> involved page shattering and aggregation logic. Overall it would be
> less error prone, as the iommu unmap would happen next to the type
> change, but I'm not going to insist if you think it's not worth it.
> The page table structure pages shouldn't be that many anyway?

Typically it wouldn't be that many, true. I'm not sure about "less
error prone", though: We'd have more problems if the range unmapped
here wasn't properly representing the set of page tables used.

>>>> @@ -840,22 +928,41 @@ int __init dom0_construct_pv(struct doma
>>>>  #endif
>>>>      while ( pfn < nr_pages )
>>>>      {
>>>> -        if ( (page = alloc_chunk(d, nr_pages - domain_tot_pages(d))) == NULL )
>>>> +        count = domain_tot_pages(d);
>>>> +        if ( (page = alloc_chunk(d, nr_pages - count)) == NULL )
>>>>              panic("Not enough RAM for DOM0 reservation\n");
>>>> +        mfn = mfn_x(page_to_mfn(page));
>>>> +
>>>> +        if ( need_iommu_pt_sync(d) )
>>>> +        {
>>>> +            rc = iommu_map(d, _dfn(mfn), _mfn(mfn), domain_tot_pages(d) - count,
>>>> +                           IOMMUF_readable | IOMMUF_writable, &flush_flags);
>>>> +            if ( rc )
>>>> +                printk(XENLOG_ERR
>>>> +                       "pre-mapping MFN %lx (PFN %lx) into IOMMU failed: %d\n",
>>>> +                       mfn, pfn, rc);
>>>> +        }
>>>> +
>>>>          while ( pfn < domain_tot_pages(d) )
>>>>          {
>>>> -            mfn = mfn_x(page_to_mfn(page));
>>>> +            if ( !rc )
>>>> +                make_pages_writable(page, 1);
>>>
>>> There's quite a lot of repetition of the pattern: allocate, iommu_map,
>>> set as writable. Would it be possible to abstract this into some
>>> kind of helper?
>>>
>>> I've realized some of the allocations use alloc_chunk while others use
>>> alloc_domheap_pages, so it might require some work.
>>
>> Right, I'd leave the allocation part aside for the moment. I had actually
>> considered to fold iommu_map() and make_pages_writable() into a common
>> helper (or really rename make_pages_writable() and fold iommu_map() into
>> there). What I lacked was a reasonable, not overly long name for such a
>> function.
> 
> I'm not overly good at naming, but I think we need to somehow find a
> way to place those together into a single helper.
> 
> I would be fine with naming this iommu_memory_{setup,add} or some
> such. Marking the pages as writable is a result (or a requirement
> might be a better way to express it?) of adding them to the IOMMU.
> Would you be OK with one of those names?

I'll use the suggestion as a basis and see how it ends up looking /
feeling.

>>>> @@ -372,16 +372,30 @@ void __hwdom_init arch_iommu_hwdom_init(
>>>>                                          perms & IOMMUF_writable ? p2m_access_rw
>>>>                                                                  : p2m_access_r,
>>>>                                          0);
>>>> +        else if ( pfn != start + count || perms != start_perms )
>>>> +        {
>>>> +        commit:
>>>> +            rc = iommu_map(d, _dfn(start), _mfn(start), count,
>>>> +                           start_perms, &flush_flags);
>>>> +            SWAP(start, pfn);
>>>> +            start_perms = perms;
>>>> +            count = 1;
>>>> +        }
>>>>          else
>>>> -            rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
>>>> -                           perms, &flush_flags);
>>>> +        {
>>>> +            ++count;
>>>> +            rc = 0;
>>>> +        }
>>>>  
>>>>          if ( rc )
>>>>              printk(XENLOG_WARNING "%pd: identity %smapping of %lx failed: %d\n",
>>>>                     d, !paging_mode_translate(d) ? "IOMMU " : "", pfn, rc);
>>>
>>> Would be nice to print the count (or end pfn) in case it's a range.
>>
>> I can do so if you think it's worth further extra code. I can't use
>> "count" here in particular, as that was updated already (in context
>> above). The most reasonable change towards this would perhaps be to
>> duplicate the printk() into both the "if()" and the "else if()" bodies.
> 
> Maybe. The current message gives the impression that a single pfn has
> been added and failed, but without printing the range that failed the
> message will not be that helpful in diagnosing further issues that
> might arise due to the mapping failure.

I guess I'll make the change then. I'm still not really convinced though,
as the presence of the message should be far more concerning than whether
it's a single page or a range. As middle ground, would

             printk(XENLOG_WARNING "%pd: identity %smapping of %lx... failed: %d\n",

be indicative enough of this perhaps not having been just a single page?
Otoh splitting (and moving) the message would allow to drop the separate
paging_mode_translate() check.

Jan



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 11/18] AMD/IOMMU: return old PTE from {set,clear}_iommu_pte_present()
  2021-09-24  9:51 ` [PATCH v2 11/18] AMD/IOMMU: return old PTE from {set,clear}_iommu_pte_present() Jan Beulich
@ 2021-12-10 12:05   ` Roger Pau Monné
  2021-12-10 12:59     ` Jan Beulich
  0 siblings, 1 reply; 100+ messages in thread
From: Roger Pau Monné @ 2021-12-10 12:05 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On Fri, Sep 24, 2021 at 11:51:40AM +0200, Jan Beulich wrote:
> In order to free intermediate page tables when replacing smaller
> mappings by a single larger one callers will need to know the full PTE.
> Flush indicators can be derived from this in the callers (and outside
> the locked regions). First split set_iommu_pte_present() from
> set_iommu_ptes_present(): Only the former needs to return the old PTE,
> while the latter (like also set_iommu_pde_present()) doesn't even need
> to return flush indicators. Then change return types/values and callers
> accordingly.

Without looking at further patches I would say you only care to know
whether the old PTE was present (ie: pr bit set), at which point those
functions could also return a boolean instead of a full PTE?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 07/18] IOMMU/x86: perform PV Dom0 mappings in batches
  2021-12-10 11:41         ` Jan Beulich
@ 2021-12-10 12:35           ` Roger Pau Monné
  0 siblings, 0 replies; 100+ messages in thread
From: Roger Pau Monné @ 2021-12-10 12:35 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On Fri, Dec 10, 2021 at 12:41:31PM +0100, Jan Beulich wrote:
> On 10.12.2021 10:36, Roger Pau Monné wrote:
> > On Fri, Dec 03, 2021 at 01:38:48PM +0100, Jan Beulich wrote:
> >> On 02.12.2021 15:10, Roger Pau Monné wrote:
> >>> On Fri, Sep 24, 2021 at 11:47:41AM +0200, Jan Beulich wrote:
> >>>> @@ -720,6 +795,17 @@ int __init dom0_construct_pv(struct doma
> >>>>      /* Pages that are part of page tables must be read only. */
> >>>>      mark_pv_pt_pages_rdonly(d, l4start, vpt_start, nr_pt_pages);
> >>>>  
> >>>> +    /*
> >>>> +     * This needs to come after all potentially excess
> >>>> +     * get_page_and_type(..., PGT_writable_page) invocations; see the loop a
> >>>> +     * few lines further up, where the effect of calling that function in an
> >>>> +     * earlier loop iteration may get overwritten by a later one.
> >>>> +     */
> >>>> +    if ( need_iommu_pt_sync(d) &&
> >>>> +         iommu_unmap(d, _dfn(PFN_DOWN(mpt_alloc) - nr_pt_pages), nr_pt_pages,
> >>>> +                     &flush_flags) )
> >>>> +        BUG();
> >>>
> >>> Wouldn't such unmap better happen as part of changing the types of the
> >>> pages that become part of the guest page tables?
> >>
> >> Not sure - it's a single call here, but would be a call per page when
> >> e.g. moved into mark_pv_pt_pages_rdonly().
> > 
> > I see. So this would result in multiple calls when moved, plus all the
> > involved page shattering and aggregation logic. Overall it would be
> > less error prone, as the iommu unmap would happen next to the type
> > change, but I'm not going to insist if you think it's not worth it.
> > The page table structure pages shouldn't be that many anyway?
> 
> Typically it wouldn't be that many, true. I'm not sure about "less
> error prone", though: We'd have more problems if the range unmapped
> here wasn't properly representing the set of page tables used.

I have to admit I'm biased regarding the PV dom0 building code because
I find it utterly hard to follow, so IMO pairing the unmap call with
the code that marks the pages as read-only seemed less error prone and
less likely to go out of sync with regards to future changes.

That said, if you still feel it's better to do it in a block here I
won't argue anymore.

> >>>> @@ -372,16 +372,30 @@ void __hwdom_init arch_iommu_hwdom_init(
> >>>>                                          perms & IOMMUF_writable ? p2m_access_rw
> >>>>                                                                  : p2m_access_r,
> >>>>                                          0);
> >>>> +        else if ( pfn != start + count || perms != start_perms )
> >>>> +        {
> >>>> +        commit:
> >>>> +            rc = iommu_map(d, _dfn(start), _mfn(start), count,
> >>>> +                           start_perms, &flush_flags);
> >>>> +            SWAP(start, pfn);
> >>>> +            start_perms = perms;
> >>>> +            count = 1;
> >>>> +        }
> >>>>          else
> >>>> -            rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
> >>>> -                           perms, &flush_flags);
> >>>> +        {
> >>>> +            ++count;
> >>>> +            rc = 0;
> >>>> +        }
> >>>>  
> >>>>          if ( rc )
> >>>>              printk(XENLOG_WARNING "%pd: identity %smapping of %lx failed: %d\n",
> >>>>                     d, !paging_mode_translate(d) ? "IOMMU " : "", pfn, rc);
> >>>
> >>> Would be nice to print the count (or end pfn) in case it's a range.
> >>
> >> I can do so if you think it's worth further extra code. I can't use
> >> "count" here in particular, as that was updated already (in context
> >> above). The most reasonable change towards this would perhaps be to
> >> duplicate the printk() into both the "if()" and the "else if()" bodies.
> > 
> > Maybe. The current message gives the impression that a single pfn has
> > been added and failed, but without printing the range that failed the
> > message will not be that helpful in diagnosing further issues that
> > might arise due to the mapping failure.
> 
> I guess I'll make the change then. I'm still not really convinced though,
> as the presence of the message should be far more concerning than whether
> it's a single page or a range. As middle ground, would
> 
>              printk(XENLOG_WARNING "%pd: identity %smapping of %lx... failed: %d\n",
> 
> be indicative enough of this perhaps not having been just a single page?

Let's go with that last suggestion then.

I would like to attempt to simplify part of the logic here, at which
point it might be easier to print a unified message for both the
translated and non-translated guests.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 11/18] AMD/IOMMU: return old PTE from {set,clear}_iommu_pte_present()
  2021-12-10 12:05   ` Roger Pau Monné
@ 2021-12-10 12:59     ` Jan Beulich
  2021-12-10 13:53       ` Roger Pau Monné
  0 siblings, 1 reply; 100+ messages in thread
From: Jan Beulich @ 2021-12-10 12:59 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On 10.12.2021 13:05, Roger Pau Monné wrote:
> On Fri, Sep 24, 2021 at 11:51:40AM +0200, Jan Beulich wrote:
>> In order to free intermediate page tables when replacing smaller
>> mappings by a single larger one callers will need to know the full PTE.
>> Flush indicators can be derived from this in the callers (and outside
>> the locked regions). First split set_iommu_pte_present() from
>> set_iommu_ptes_present(): Only the former needs to return the old PTE,
>> while the latter (like also set_iommu_pde_present()) doesn't even need
>> to return flush indicators. Then change return types/values and callers
>> accordingly.
> 
> Without looking at further patches I would say you only care to know
> whether the old PTE was present (ie: pr bit set), at which point those
> functions could also return a boolean instead of a full PTE?

But looking at further patches will reveal that I then also need the
next_level field from the old PTE (to tell superpages from page tables).

Jan



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 08/18] IOMMU/x86: support freeing of pagetables
  2021-09-24  9:48 ` [PATCH v2 08/18] IOMMU/x86: support freeing of pagetables Jan Beulich
  2021-12-02 16:03   ` Roger Pau Monné
@ 2021-12-10 13:51   ` Roger Pau Monné
  2021-12-13  8:38     ` Jan Beulich
  1 sibling, 1 reply; 100+ messages in thread
From: Roger Pau Monné @ 2021-12-10 13:51 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On Fri, Sep 24, 2021 at 11:48:21AM +0200, Jan Beulich wrote:
> For vendor specific code to support superpages we need to be able to
> deal with a superpage mapping replacing an intermediate page table (or
> hierarchy thereof). Consequently an iommu_alloc_pgtable() counterpart is
> needed to free individual page tables while a domain is still alive.
> Since the freeing needs to be deferred until after a suitable IOTLB
> flush was performed, released page tables get queued for processing by a
> tasklet.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> I was considering whether to use a softirq-taklet instead. This would
> have the benefit of avoiding extra scheduling operations, but come with
> the risk of the freeing happening prematurely because of a
> process_pending_softirqs() somewhere.

The main one that comes to mind would be the debug keys and it's usage
of process_pending_softirqs that could interfere with iommu unmaps, so
I guess if only for that reason it's best to run in idle vcpu context.

> --- a/xen/drivers/passthrough/x86/iommu.c
> +++ b/xen/drivers/passthrough/x86/iommu.c
> @@ -12,6 +12,7 @@
>   * this program; If not, see <http://www.gnu.org/licenses/>.
>   */
>  
> +#include <xen/cpu.h>
>  #include <xen/sched.h>
>  #include <xen/iommu.h>
>  #include <xen/paging.h>
> @@ -463,6 +464,85 @@ struct page_info *iommu_alloc_pgtable(st
>      return pg;
>  }
>  
> +/*
> + * Intermediate page tables which get replaced by large pages may only be
> + * freed after a suitable IOTLB flush. Hence such pages get queued on a
> + * per-CPU list, with a per-CPU tasklet processing the list on the assumption
> + * that the necessary IOTLB flush will have occurred by the time tasklets get
> + * to run. (List and tasklet being per-CPU has the benefit of accesses not
> + * requiring any locking.)
> + */
> +static DEFINE_PER_CPU(struct page_list_head, free_pgt_list);
> +static DEFINE_PER_CPU(struct tasklet, free_pgt_tasklet);
> +
> +static void free_queued_pgtables(void *arg)
> +{
> +    struct page_list_head *list = arg;
> +    struct page_info *pg;
> +
> +    while ( (pg = page_list_remove_head(list)) )
> +        free_domheap_page(pg);

Should you add a preempt check here to yield and schedule the tasklet
again, in order to be able to process any pending work?

Maybe just calling process_pending_softirqs would be enough?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 11/18] AMD/IOMMU: return old PTE from {set,clear}_iommu_pte_present()
  2021-12-10 12:59     ` Jan Beulich
@ 2021-12-10 13:53       ` Roger Pau Monné
  0 siblings, 0 replies; 100+ messages in thread
From: Roger Pau Monné @ 2021-12-10 13:53 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On Fri, Dec 10, 2021 at 01:59:02PM +0100, Jan Beulich wrote:
> On 10.12.2021 13:05, Roger Pau Monné wrote:
> > On Fri, Sep 24, 2021 at 11:51:40AM +0200, Jan Beulich wrote:
> >> In order to free intermediate page tables when replacing smaller
> >> mappings by a single larger one callers will need to know the full PTE.
> >> Flush indicators can be derived from this in the callers (and outside
> >> the locked regions). First split set_iommu_pte_present() from
> >> set_iommu_ptes_present(): Only the former needs to return the old PTE,
> >> while the latter (like also set_iommu_pde_present()) doesn't even need
> >> to return flush indicators. Then change return types/values and callers
> >> accordingly.
> > 
> > Without looking at further patches I would say you only care to know
> > whether the old PTE was present (ie: pr bit set), at which point those
> > functions could also return a boolean instead of a full PTE?
> 
> But looking at further patches will reveal that I then also need the
> next_level field from the old PTE (to tell superpages from page tables).

Oh, OK. I was expecting something like that.

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

I wouldn't mind if you added a note to the commit message that the
full PTE is returned because new callers will require more data.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 12/18] AMD/IOMMU: allow use of superpage mappings
  2021-09-24  9:52 ` [PATCH v2 12/18] AMD/IOMMU: allow use of superpage mappings Jan Beulich
@ 2021-12-10 15:06   ` Roger Pau Monné
  2021-12-13  8:49     ` Jan Beulich
  0 siblings, 1 reply; 100+ messages in thread
From: Roger Pau Monné @ 2021-12-10 15:06 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Paul Durrant, Andrew Cooper, George Dunlap,
	Ian Jackson, Julien Grall, Stefano Stabellini, Wei Liu

On Fri, Sep 24, 2021 at 11:52:14AM +0200, Jan Beulich wrote:
> No separate feature flags exist which would control availability of
> these; the only restriction is HATS (establishing the maximum number of
> page table levels in general), and even that has a lower bound of 4.
> Thus we can unconditionally announce 2M, 1G, and 512G mappings. (Via
> non-default page sizes the implementation in principle permits arbitrary
> size mappings, but these require multiple identical leaf PTEs to be
> written, which isn't all that different from having to write multiple
> consecutive PTEs with increasing frame numbers. IMO that's therefore
> beneficial only on hardware where suitable TLBs exist; I'm unaware of
> such hardware.)

Also replacing/shattering such non-standard page sizes will require
more logic, so unless there's a performance benefit I would just skip
using them.

> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> I'm not fully sure about allowing 512G mappings: The scheduling-for-
> freeing of intermediate page tables can take quite a while when
> replacing a tree of 4k mappings by a single 512G one. Plus (or otoh)
> there's no present code path via which 512G chunks of memory could be
> allocated (and hence mapped) anyway.

I would limit to 1G, which is what we support for CPU page tables
also.

> --- a/xen/drivers/passthrough/amd/iommu_map.c
> +++ b/xen/drivers/passthrough/amd/iommu_map.c
> @@ -32,12 +32,13 @@ static unsigned int pfn_to_pde_idx(unsig
>  }
>  
>  static union amd_iommu_pte clear_iommu_pte_present(unsigned long l1_mfn,
> -                                                   unsigned long dfn)
> +                                                   unsigned long dfn,
> +                                                   unsigned int level)
>  {
>      union amd_iommu_pte *table, *pte, old;
>  
>      table = map_domain_page(_mfn(l1_mfn));
> -    pte = &table[pfn_to_pde_idx(dfn, 1)];
> +    pte = &table[pfn_to_pde_idx(dfn, level)];
>      old = *pte;
>  
>      write_atomic(&pte->raw, 0);
> @@ -288,10 +289,31 @@ static int iommu_pde_from_dfn(struct dom
>      return 0;
>  }
>  
> +static void queue_free_pt(struct domain *d, mfn_t mfn, unsigned int next_level)

Nit: should the last parameter be named level rather than next_level?
AFAICT it's the level of the mfn parameter.

Should we also assert that level (or next_level) is always != 0 for
extra safety?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 08/18] IOMMU/x86: support freeing of pagetables
  2021-12-10 13:51   ` Roger Pau Monné
@ 2021-12-13  8:38     ` Jan Beulich
  0 siblings, 0 replies; 100+ messages in thread
From: Jan Beulich @ 2021-12-13  8:38 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On 10.12.2021 14:51, Roger Pau Monné wrote:
> On Fri, Sep 24, 2021 at 11:48:21AM +0200, Jan Beulich wrote:
>> For vendor specific code to support superpages we need to be able to
>> deal with a superpage mapping replacing an intermediate page table (or
>> hierarchy thereof). Consequently an iommu_alloc_pgtable() counterpart is
>> needed to free individual page tables while a domain is still alive.
>> Since the freeing needs to be deferred until after a suitable IOTLB
>> flush was performed, released page tables get queued for processing by a
>> tasklet.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> I was considering whether to use a softirq-taklet instead. This would
>> have the benefit of avoiding extra scheduling operations, but come with
>> the risk of the freeing happening prematurely because of a
>> process_pending_softirqs() somewhere.
> 
> The main one that comes to mind would be the debug keys and it's usage
> of process_pending_softirqs that could interfere with iommu unmaps, so
> I guess if only for that reason it's best to run in idle vcpu context.

IOW you support the choice made.

>> --- a/xen/drivers/passthrough/x86/iommu.c
>> +++ b/xen/drivers/passthrough/x86/iommu.c
>> @@ -12,6 +12,7 @@
>>   * this program; If not, see <http://www.gnu.org/licenses/>.
>>   */
>>  
>> +#include <xen/cpu.h>
>>  #include <xen/sched.h>
>>  #include <xen/iommu.h>
>>  #include <xen/paging.h>
>> @@ -463,6 +464,85 @@ struct page_info *iommu_alloc_pgtable(st
>>      return pg;
>>  }
>>  
>> +/*
>> + * Intermediate page tables which get replaced by large pages may only be
>> + * freed after a suitable IOTLB flush. Hence such pages get queued on a
>> + * per-CPU list, with a per-CPU tasklet processing the list on the assumption
>> + * that the necessary IOTLB flush will have occurred by the time tasklets get
>> + * to run. (List and tasklet being per-CPU has the benefit of accesses not
>> + * requiring any locking.)
>> + */
>> +static DEFINE_PER_CPU(struct page_list_head, free_pgt_list);
>> +static DEFINE_PER_CPU(struct tasklet, free_pgt_tasklet);
>> +
>> +static void free_queued_pgtables(void *arg)
>> +{
>> +    struct page_list_head *list = arg;
>> +    struct page_info *pg;
>> +
>> +    while ( (pg = page_list_remove_head(list)) )
>> +        free_domheap_page(pg);
> 
> Should you add a preempt check here to yield and schedule the tasklet
> again, in order to be able to process any pending work?

I did ask myself this question, yes, but ...

> Maybe just calling process_pending_softirqs would be enough?

... I think I didn't consider this as a possible simpler variant (compared
to re-scheduling the tasklet). Let me add such - I agree that this should
be sufficient.

Jan



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 12/18] AMD/IOMMU: allow use of superpage mappings
  2021-12-10 15:06   ` Roger Pau Monné
@ 2021-12-13  8:49     ` Jan Beulich
  2021-12-13  9:45       ` Roger Pau Monné
  0 siblings, 1 reply; 100+ messages in thread
From: Jan Beulich @ 2021-12-13  8:49 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Paul Durrant, Andrew Cooper, George Dunlap,
	Ian Jackson, Julien Grall, Stefano Stabellini, Wei Liu

On 10.12.2021 16:06, Roger Pau Monné wrote:
> On Fri, Sep 24, 2021 at 11:52:14AM +0200, Jan Beulich wrote:
>> ---
>> I'm not fully sure about allowing 512G mappings: The scheduling-for-
>> freeing of intermediate page tables can take quite a while when
>> replacing a tree of 4k mappings by a single 512G one. Plus (or otoh)
>> there's no present code path via which 512G chunks of memory could be
>> allocated (and hence mapped) anyway.
> 
> I would limit to 1G, which is what we support for CPU page tables
> also.

I'm not sure I buy comparing with CPU side support when not sharing
page tables. Not the least with PV in mind.

>> @@ -288,10 +289,31 @@ static int iommu_pde_from_dfn(struct dom
>>      return 0;
>>  }
>>  
>> +static void queue_free_pt(struct domain *d, mfn_t mfn, unsigned int next_level)
> 
> Nit: should the last parameter be named level rather than next_level?
> AFAICT it's the level of the mfn parameter.

Yeah, might make sense.

> Should we also assert that level (or next_level) is always != 0 for
> extra safety?

As said elsewhere - if this wasn't a static helper, I'd agree. But all
call sites have respective conditionals around the call. If anything
I'd move those checks into the function (but only if you think that
would improve things, as to me having them at the call sites is more
logical).

Jan



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 12/18] AMD/IOMMU: allow use of superpage mappings
  2021-12-13  8:49     ` Jan Beulich
@ 2021-12-13  9:45       ` Roger Pau Monné
  2021-12-13 10:00         ` Jan Beulich
  0 siblings, 1 reply; 100+ messages in thread
From: Roger Pau Monné @ 2021-12-13  9:45 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Paul Durrant, Andrew Cooper, George Dunlap,
	Ian Jackson, Julien Grall, Stefano Stabellini, Wei Liu

On Mon, Dec 13, 2021 at 09:49:50AM +0100, Jan Beulich wrote:
> On 10.12.2021 16:06, Roger Pau Monné wrote:
> > On Fri, Sep 24, 2021 at 11:52:14AM +0200, Jan Beulich wrote:
> >> ---
> >> I'm not fully sure about allowing 512G mappings: The scheduling-for-
> >> freeing of intermediate page tables can take quite a while when
> >> replacing a tree of 4k mappings by a single 512G one. Plus (or otoh)
> >> there's no present code path via which 512G chunks of memory could be
> >> allocated (and hence mapped) anyway.
> > 
> > I would limit to 1G, which is what we support for CPU page tables
> > also.
> 
> I'm not sure I buy comparing with CPU side support when not sharing
> page tables. Not the least with PV in mind.

Hm, my thinking was that similar reasons that don't allow us to do
512G mappings for the CPU side would also apply to IOMMU. Regardless
of that, given the current way in which replaced page table entries
are freed, I'm not sure it's fine to allow 512G mappings as the
freeing of the possible huge amount of 4K entries could allow guests
to hog a CPU for a long time.

It would be better if we could somehow account this in a per-vCPU way,
kind of similar to what we do with vPCI BAR mappings.

> > Should we also assert that level (or next_level) is always != 0 for
> > extra safety?
> 
> As said elsewhere - if this wasn't a static helper, I'd agree. But all
> call sites have respective conditionals around the call. If anything
> I'd move those checks into the function (but only if you think that
> would improve things, as to me having them at the call sites is more
> logical).

I'm fine to leave the checks in the callers, was just a suggestion in
case we gain new callers that forgot to do the checks themselves.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 12/18] AMD/IOMMU: allow use of superpage mappings
  2021-12-13  9:45       ` Roger Pau Monné
@ 2021-12-13 10:00         ` Jan Beulich
  2021-12-13 10:33           ` Roger Pau Monné
  0 siblings, 1 reply; 100+ messages in thread
From: Jan Beulich @ 2021-12-13 10:00 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Paul Durrant, Andrew Cooper, George Dunlap,
	Ian Jackson, Julien Grall, Stefano Stabellini, Wei Liu

On 13.12.2021 10:45, Roger Pau Monné wrote:
> On Mon, Dec 13, 2021 at 09:49:50AM +0100, Jan Beulich wrote:
>> On 10.12.2021 16:06, Roger Pau Monné wrote:
>>> On Fri, Sep 24, 2021 at 11:52:14AM +0200, Jan Beulich wrote:
>>>> ---
>>>> I'm not fully sure about allowing 512G mappings: The scheduling-for-
>>>> freeing of intermediate page tables can take quite a while when
>>>> replacing a tree of 4k mappings by a single 512G one. Plus (or otoh)
>>>> there's no present code path via which 512G chunks of memory could be
>>>> allocated (and hence mapped) anyway.
>>>
>>> I would limit to 1G, which is what we support for CPU page tables
>>> also.
>>
>> I'm not sure I buy comparing with CPU side support when not sharing
>> page tables. Not the least with PV in mind.
> 
> Hm, my thinking was that similar reasons that don't allow us to do
> 512G mappings for the CPU side would also apply to IOMMU. Regardless
> of that, given the current way in which replaced page table entries
> are freed, I'm not sure it's fine to allow 512G mappings as the
> freeing of the possible huge amount of 4K entries could allow guests
> to hog a CPU for a long time.

This huge amount can occur only when replacing a hierarchy with
sufficiently many 4k leaves by a single 512G page. Yet there's no
way - afaics - that such an operation can be initiated right now.
That's, as said in the remark, because there's no way to allocate
a 512G chunk of memory in one go. When re-coalescing, the worst
that can happen is one L1 table worth of 4k mappings, one L2
table worth of 2M mappings, and one L3 table worth of 1G mappings.
All other mappings already need to have been superpage ones at the
time of the checking. Hence the total upper bound (for the
enclosing map / unmap) is again primarily determined by there not
being any way to establish 512G mappings.

Actually, thinking about it, there's one path where 512G mappings
could be established, but that's Dom0-reachable only
(XEN_DOMCTL_memory_mapping) and would assume gigantic BARs in a
PCI device. Even if such a device existed, I think we're fine to
assume that Dom0 won't establish such mappings to replace
existing ones, but only ever put them in place when nothing was
mapped in that range yet.

> It would be better if we could somehow account this in a per-vCPU way,
> kind of similar to what we do with vPCI BAR mappings.

But recording them per-vCPU wouldn't make any difference to the
number of pages that could accumulate in a single run. Maybe I'm
missing something in what you're thinking about here ...

Jan



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 12/18] AMD/IOMMU: allow use of superpage mappings
  2021-12-13 10:00         ` Jan Beulich
@ 2021-12-13 10:33           ` Roger Pau Monné
  2021-12-13 10:41             ` Jan Beulich
  0 siblings, 1 reply; 100+ messages in thread
From: Roger Pau Monné @ 2021-12-13 10:33 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Paul Durrant, Andrew Cooper, George Dunlap,
	Ian Jackson, Julien Grall, Stefano Stabellini, Wei Liu

On Mon, Dec 13, 2021 at 11:00:23AM +0100, Jan Beulich wrote:
> On 13.12.2021 10:45, Roger Pau Monné wrote:
> > On Mon, Dec 13, 2021 at 09:49:50AM +0100, Jan Beulich wrote:
> >> On 10.12.2021 16:06, Roger Pau Monné wrote:
> >>> On Fri, Sep 24, 2021 at 11:52:14AM +0200, Jan Beulich wrote:
> >>>> ---
> >>>> I'm not fully sure about allowing 512G mappings: The scheduling-for-
> >>>> freeing of intermediate page tables can take quite a while when
> >>>> replacing a tree of 4k mappings by a single 512G one. Plus (or otoh)
> >>>> there's no present code path via which 512G chunks of memory could be
> >>>> allocated (and hence mapped) anyway.
> >>>
> >>> I would limit to 1G, which is what we support for CPU page tables
> >>> also.
> >>
> >> I'm not sure I buy comparing with CPU side support when not sharing
> >> page tables. Not the least with PV in mind.
> > 
> > Hm, my thinking was that similar reasons that don't allow us to do
> > 512G mappings for the CPU side would also apply to IOMMU. Regardless
> > of that, given the current way in which replaced page table entries
> > are freed, I'm not sure it's fine to allow 512G mappings as the
> > freeing of the possible huge amount of 4K entries could allow guests
> > to hog a CPU for a long time.
> 
> This huge amount can occur only when replacing a hierarchy with
> sufficiently many 4k leaves by a single 512G page. Yet there's no
> way - afaics - that such an operation can be initiated right now.
> That's, as said in the remark, because there's no way to allocate
> a 512G chunk of memory in one go. When re-coalescing, the worst
> that can happen is one L1 table worth of 4k mappings, one L2
> table worth of 2M mappings, and one L3 table worth of 1G mappings.
> All other mappings already need to have been superpage ones at the
> time of the checking. Hence the total upper bound (for the
> enclosing map / unmap) is again primarily determined by there not
> being any way to establish 512G mappings.
> 
> Actually, thinking about it, there's one path where 512G mappings
> could be established, but that's Dom0-reachable only
> (XEN_DOMCTL_memory_mapping) and would assume gigantic BARs in a
> PCI device. Even if such a device existed, I think we're fine to
> assume that Dom0 won't establish such mappings to replace
> existing ones, but only ever put them in place when nothing was
> mapped in that range yet.
> 
> > It would be better if we could somehow account this in a per-vCPU way,
> > kind of similar to what we do with vPCI BAR mappings.
> 
> But recording them per-vCPU wouldn't make any difference to the
> number of pages that could accumulate in a single run. Maybe I'm
> missing something in what you're thinking about here ...

If Xen somehow did the free in guest vCPU context before resuming
guest execution then you could yield when events are pending and thus
allow other guests to run without hogging the pCPU, and the freeing
would be accounted to vCPU sched slice time.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 12/18] AMD/IOMMU: allow use of superpage mappings
  2021-12-13 10:33           ` Roger Pau Monné
@ 2021-12-13 10:41             ` Jan Beulich
  0 siblings, 0 replies; 100+ messages in thread
From: Jan Beulich @ 2021-12-13 10:41 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Paul Durrant, Andrew Cooper, George Dunlap,
	Ian Jackson, Julien Grall, Stefano Stabellini, Wei Liu

On 13.12.2021 11:33, Roger Pau Monné wrote:
> On Mon, Dec 13, 2021 at 11:00:23AM +0100, Jan Beulich wrote:
>> On 13.12.2021 10:45, Roger Pau Monné wrote:
>>> It would be better if we could somehow account this in a per-vCPU way,
>>> kind of similar to what we do with vPCI BAR mappings.
>>
>> But recording them per-vCPU wouldn't make any difference to the
>> number of pages that could accumulate in a single run. Maybe I'm
>> missing something in what you're thinking about here ...
> 
> If Xen somehow did the free in guest vCPU context before resuming
> guest execution then you could yield when events are pending and thus
> allow other guests to run without hogging the pCPU, and the freeing
> would be accounted to vCPU sched slice time.

That's an interesting thought. Shouldn't be difficult to arrange for
HVM (from {svm,vmx}_vmenter_helper()), but I can't immediately see a
clean way of having the same for PV (short of an ad hoc call out of
assembly code somewhere after test_all_events).

Jan



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 13/18] VT-d: allow use of superpage mappings
  2021-09-24  9:52 ` [PATCH v2 13/18] VT-d: " Jan Beulich
@ 2021-12-13 11:54   ` Roger Pau Monné
  2021-12-13 13:39     ` Jan Beulich
  0 siblings, 1 reply; 100+ messages in thread
From: Roger Pau Monné @ 2021-12-13 11:54 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian

On Fri, Sep 24, 2021 at 11:52:47AM +0200, Jan Beulich wrote:
> ... depending on feature availability (and absence of quirks).
> 
> Also make the page table dumping function aware of superpages.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Just some minor nits.

> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -743,18 +743,37 @@ static int __must_check iommu_flush_iotl
>      return iommu_flush_iotlb(d, INVALID_DFN, 0, 0);
>  }
>  
> +static void queue_free_pt(struct domain *d, mfn_t mfn, unsigned int next_level)

Same comment as the AMD side patch, about naming the parameter just
level.

> @@ -1901,13 +1926,15 @@ static int __must_check intel_iommu_map_
>      }
>  
>      page = (struct dma_pte *)map_vtd_domain_page(pg_maddr);
> -    pte = &page[dfn_x(dfn) & LEVEL_MASK];
> +    pte = &page[address_level_offset(dfn_to_daddr(dfn), level)];
>      old = *pte;
>  
>      dma_set_pte_addr(new, mfn_to_maddr(mfn));
>      dma_set_pte_prot(new,
>                       ((flags & IOMMUF_readable) ? DMA_PTE_READ  : 0) |
>                       ((flags & IOMMUF_writable) ? DMA_PTE_WRITE : 0));
> +    if ( IOMMUF_order(flags) )

You seem to use level > 1 in other places to check for whether the
entry is intended to be a super-page. Is there any reason to use
IOMMUF_order here instead?


> @@ -2328,6 +2361,11 @@ static int __init vtd_setup(void)
>                 cap_sps_2mb(iommu->cap) ? ", 2MB" : "",
>                 cap_sps_1gb(iommu->cap) ? ", 1GB" : "");
>  
> +        if ( !cap_sps_2mb(iommu->cap) )
> +            large_sizes &= ~PAGE_SIZE_2M;
> +        if ( !cap_sps_1gb(iommu->cap) )
> +            large_sizes &= ~PAGE_SIZE_1G;
> +
>  #ifndef iommu_snoop
>          if ( iommu_snoop && !ecap_snp_ctl(iommu->ecap) )
>              iommu_snoop = false;
> @@ -2399,6 +2437,9 @@ static int __init vtd_setup(void)
>      if ( ret )
>          goto error;
>  
> +    ASSERT(iommu_ops.page_sizes & PAGE_SIZE_4K);

Since you are adding the assert, it might be more complete to check no
other page sizes are set, iommu_ops.page_sizes == PAGE_SIZE_4K?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 13/18] VT-d: allow use of superpage mappings
  2021-12-13 11:54   ` Roger Pau Monné
@ 2021-12-13 13:39     ` Jan Beulich
  0 siblings, 0 replies; 100+ messages in thread
From: Jan Beulich @ 2021-12-13 13:39 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian

On 13.12.2021 12:54, Roger Pau Monné wrote:
> On Fri, Sep 24, 2021 at 11:52:47AM +0200, Jan Beulich wrote:
>> --- a/xen/drivers/passthrough/vtd/iommu.c
>> +++ b/xen/drivers/passthrough/vtd/iommu.c
>> @@ -743,18 +743,37 @@ static int __must_check iommu_flush_iotl
>>      return iommu_flush_iotlb(d, INVALID_DFN, 0, 0);
>>  }
>>  
>> +static void queue_free_pt(struct domain *d, mfn_t mfn, unsigned int next_level)
> 
> Same comment as the AMD side patch, about naming the parameter just
> level.

Sure, will change.

>> @@ -1901,13 +1926,15 @@ static int __must_check intel_iommu_map_
>>      }
>>  
>>      page = (struct dma_pte *)map_vtd_domain_page(pg_maddr);
>> -    pte = &page[dfn_x(dfn) & LEVEL_MASK];
>> +    pte = &page[address_level_offset(dfn_to_daddr(dfn), level)];
>>      old = *pte;
>>  
>>      dma_set_pte_addr(new, mfn_to_maddr(mfn));
>>      dma_set_pte_prot(new,
>>                       ((flags & IOMMUF_readable) ? DMA_PTE_READ  : 0) |
>>                       ((flags & IOMMUF_writable) ? DMA_PTE_WRITE : 0));
>> +    if ( IOMMUF_order(flags) )
> 
> You seem to use level > 1 in other places to check for whether the
> entry is intended to be a super-page. Is there any reason to use
> IOMMUF_order here instead?

"flags" is the original source of information here, so it seemed more
natural to use it. The following hunk uses "level > 1" to better
match the similar unmap logic as well as AMD code. Maybe I should
change those to also use "flags" (or "order" in the unmap case), as
that would allow re-using the local variable in the new patches in v3
doing the re-coalescing of present superpages (right now I'm using a
second, not very nicely named variable there).

I'll have to think about this some and check whether there are other
issues if I made such a change.

>> @@ -2328,6 +2361,11 @@ static int __init vtd_setup(void)
>>                 cap_sps_2mb(iommu->cap) ? ", 2MB" : "",
>>                 cap_sps_1gb(iommu->cap) ? ", 1GB" : "");
>>  
>> +        if ( !cap_sps_2mb(iommu->cap) )
>> +            large_sizes &= ~PAGE_SIZE_2M;
>> +        if ( !cap_sps_1gb(iommu->cap) )
>> +            large_sizes &= ~PAGE_SIZE_1G;
>> +
>>  #ifndef iommu_snoop
>>          if ( iommu_snoop && !ecap_snp_ctl(iommu->ecap) )
>>              iommu_snoop = false;
>> @@ -2399,6 +2437,9 @@ static int __init vtd_setup(void)
>>      if ( ret )
>>          goto error;
>>  
>> +    ASSERT(iommu_ops.page_sizes & PAGE_SIZE_4K);
> 
> Since you are adding the assert, it might be more complete to check no
> other page sizes are set, iommu_ops.page_sizes == PAGE_SIZE_4K?

Ah yes, would make sense. Let me change this.

Jan



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 14/18] IOMMU: fold flush-all hook into "flush one"
  2021-09-24  9:53 ` [PATCH v2 14/18] IOMMU: fold flush-all hook into "flush one" Jan Beulich
@ 2021-12-13 15:04   ` Roger Pau Monné
  2021-12-14  9:06     ` Jan Beulich
  2021-12-15 15:28   ` Oleksandr
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 100+ messages in thread
From: Roger Pau Monné @ 2021-12-13 15:04 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian, Julien Grall,
	Stefano Stabellini, Volodymyr Babchuk, Bertrand Marquis,
	Rahul Singh

On Fri, Sep 24, 2021 at 11:53:59AM +0200, Jan Beulich wrote:
> Having a separate flush-all hook has always been puzzling me some. We
> will want to be able to force a full flush via accumulated flush flags
> from the map/unmap functions. Introduce a respective new flag and fold
> all flush handling to use the single remaining hook.
> 
> Note that because of the respective comments in SMMU and IPMMU-VMSA
> code, I've folded the two prior hook functions into one. For SMMU-v3,
> which lacks a comment towards incapable hardware, I've left both
> functions in place on the assumption that selective and full flushes
> will eventually want separating.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Just one nit I think.

> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -731,18 +731,21 @@ static int __must_check iommu_flush_iotl
>                                                  unsigned long page_count,
>                                                  unsigned int flush_flags)
>  {
> -    ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN));
> -    ASSERT(flush_flags);
> +    if ( flush_flags & IOMMU_FLUSHF_all )
> +    {
> +        dfn = INVALID_DFN;
> +        page_count = 0;

Don't we expect callers to already pass an invalid dfn and a 0 page
count when doing a full flush?

In the equivalent AMD code you didn't set those for the FLUSHF_all
case.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 15/18] IOMMU/x86: prefill newly allocate page tables
  2021-09-24  9:54 ` [PATCH v2 15/18] IOMMU/x86: prefill newly allocate page tables Jan Beulich
@ 2021-12-13 15:51   ` Roger Pau Monné
  2021-12-14  9:15     ` Jan Beulich
  2021-12-14 14:50   ` Roger Pau Monné
  2021-12-14 15:06   ` Roger Pau Monné
  2 siblings, 1 reply; 100+ messages in thread
From: Roger Pau Monné @ 2021-12-13 15:51 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian

On Fri, Sep 24, 2021 at 11:54:58AM +0200, Jan Beulich wrote:
> Page table are used for two purposes after allocation: They either start
> out all empty, or they get filled to replace a superpage. Subsequently,
> to replace all empty or fully contiguous page tables, contiguous sub-
> regions will be recorded within individual page tables. Install the
> initial set of markers immediately after allocation. Make sure to retain
> these markers when further populating a page table in preparation for it
> to replace a superpage.
> 
> The markers are simply 4-bit fields holding the order value of
> contiguous entries. To demonstrate this, if a page table had just 16
> entries, this would be the initial (fully contiguous) set of markers:
> 
> index  0 1 2 3 4 5 6 7 8 9 A B C D E F
> marker 4 0 1 0 2 0 1 0 3 0 1 0 2 0 1 0
> 
> "Contiguous" here means not only present entries with successively
> increasing MFNs, each one suitably aligned for its slot, but also a
> respective number of all non-present entries.

I'm afraid I'm slightly lost with all this, please bear with me. Is
this just a performance improvement when doing super-page
replacements, or there's more to it?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 14/18] IOMMU: fold flush-all hook into "flush one"
  2021-12-13 15:04   ` Roger Pau Monné
@ 2021-12-14  9:06     ` Jan Beulich
  2021-12-14  9:27       ` Roger Pau Monné
  0 siblings, 1 reply; 100+ messages in thread
From: Jan Beulich @ 2021-12-14  9:06 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian, Julien Grall,
	Stefano Stabellini, Volodymyr Babchuk, Bertrand Marquis,
	Rahul Singh

On 13.12.2021 16:04, Roger Pau Monné wrote:
> On Fri, Sep 24, 2021 at 11:53:59AM +0200, Jan Beulich wrote:
>> Having a separate flush-all hook has always been puzzling me some. We
>> will want to be able to force a full flush via accumulated flush flags
>> from the map/unmap functions. Introduce a respective new flag and fold
>> all flush handling to use the single remaining hook.
>>
>> Note that because of the respective comments in SMMU and IPMMU-VMSA
>> code, I've folded the two prior hook functions into one. For SMMU-v3,
>> which lacks a comment towards incapable hardware, I've left both
>> functions in place on the assumption that selective and full flushes
>> will eventually want separating.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Just one nit I think.
> 
>> --- a/xen/drivers/passthrough/vtd/iommu.c
>> +++ b/xen/drivers/passthrough/vtd/iommu.c
>> @@ -731,18 +731,21 @@ static int __must_check iommu_flush_iotl
>>                                                  unsigned long page_count,
>>                                                  unsigned int flush_flags)
>>  {
>> -    ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN));
>> -    ASSERT(flush_flags);
>> +    if ( flush_flags & IOMMU_FLUSHF_all )
>> +    {
>> +        dfn = INVALID_DFN;
>> +        page_count = 0;
> 
> Don't we expect callers to already pass an invalid dfn and a 0 page
> count when doing a full flush?

I didn't want to introduce such a requirement. The two arguments should
imo be don't-cares with IOMMU_FLUSHF_all, such that callers handing on
(or accumulating) flags don't need to apply extra care.

> In the equivalent AMD code you didn't set those for the FLUSHF_all
> case.

There's no similar dependency there in AMD code. For VT-d,
iommu_flush_iotlb() needs at least one of the two set this way to
actually do a full-address-space flush. (Which, as an aside, I've
recently learned is supposedly wrong when cap_isoch() returns true. But
that's an orthogonal issue, albeit it may be possible to deal with at
the same time as, down the road, limiting the too aggressive flushing
done by subsequent patches using this new flag.)

I could be talked into setting just page_count to zero here.

Jan



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 15/18] IOMMU/x86: prefill newly allocate page tables
  2021-12-13 15:51   ` Roger Pau Monné
@ 2021-12-14  9:15     ` Jan Beulich
  2021-12-14 11:41       ` Roger Pau Monné
  0 siblings, 1 reply; 100+ messages in thread
From: Jan Beulich @ 2021-12-14  9:15 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian

On 13.12.2021 16:51, Roger Pau Monné wrote:
> On Fri, Sep 24, 2021 at 11:54:58AM +0200, Jan Beulich wrote:
>> Page table are used for two purposes after allocation: They either start
>> out all empty, or they get filled to replace a superpage. Subsequently,
>> to replace all empty or fully contiguous page tables, contiguous sub-
>> regions will be recorded within individual page tables. Install the
>> initial set of markers immediately after allocation. Make sure to retain
>> these markers when further populating a page table in preparation for it
>> to replace a superpage.
>>
>> The markers are simply 4-bit fields holding the order value of
>> contiguous entries. To demonstrate this, if a page table had just 16
>> entries, this would be the initial (fully contiguous) set of markers:
>>
>> index  0 1 2 3 4 5 6 7 8 9 A B C D E F
>> marker 4 0 1 0 2 0 1 0 3 0 1 0 2 0 1 0
>>
>> "Contiguous" here means not only present entries with successively
>> increasing MFNs, each one suitably aligned for its slot, but also a
>> respective number of all non-present entries.
> 
> I'm afraid I'm slightly lost with all this, please bear with me. Is
> this just a performance improvement when doing super-page
> replacements, or there's more to it?

What I wanted to strictly avoid is to have to scan entire pages for
contiguity (i.e. on average touching half a page), like e.g.
map_pages_to_xen() and modify_xen_mappings() do. Hence I tried to
find a scheme where for any individual update only a predictably
very limited number of entries need inspecting (some of these then
of course also need updating).

Jan



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 14/18] IOMMU: fold flush-all hook into "flush one"
  2021-12-14  9:06     ` Jan Beulich
@ 2021-12-14  9:27       ` Roger Pau Monné
  0 siblings, 0 replies; 100+ messages in thread
From: Roger Pau Monné @ 2021-12-14  9:27 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian, Julien Grall,
	Stefano Stabellini, Volodymyr Babchuk, Bertrand Marquis,
	Rahul Singh

On Tue, Dec 14, 2021 at 10:06:37AM +0100, Jan Beulich wrote:
> On 13.12.2021 16:04, Roger Pau Monné wrote:
> > On Fri, Sep 24, 2021 at 11:53:59AM +0200, Jan Beulich wrote:
> >> Having a separate flush-all hook has always been puzzling me some. We
> >> will want to be able to force a full flush via accumulated flush flags
> >> from the map/unmap functions. Introduce a respective new flag and fold
> >> all flush handling to use the single remaining hook.
> >>
> >> Note that because of the respective comments in SMMU and IPMMU-VMSA
> >> code, I've folded the two prior hook functions into one. For SMMU-v3,
> >> which lacks a comment towards incapable hardware, I've left both
> >> functions in place on the assumption that selective and full flushes
> >> will eventually want separating.
> >>
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

> >> --- a/xen/drivers/passthrough/vtd/iommu.c
> >> +++ b/xen/drivers/passthrough/vtd/iommu.c
> >> @@ -731,18 +731,21 @@ static int __must_check iommu_flush_iotl
> >>                                                  unsigned long page_count,
> >>                                                  unsigned int flush_flags)
> >>  {
> >> -    ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN));
> >> -    ASSERT(flush_flags);
> >> +    if ( flush_flags & IOMMU_FLUSHF_all )
> >> +    {
> >> +        dfn = INVALID_DFN;
> >> +        page_count = 0;
> > 
> > Don't we expect callers to already pass an invalid dfn and a 0 page
> > count when doing a full flush?
> 
> I didn't want to introduce such a requirement. The two arguments should
> imo be don't-cares with IOMMU_FLUSHF_all, such that callers handing on
> (or accumulating) flags don't need to apply extra care.
> 
> > In the equivalent AMD code you didn't set those for the FLUSHF_all
> > case.
> 
> There's no similar dependency there in AMD code. For VT-d,
> iommu_flush_iotlb() needs at least one of the two set this way to
> actually do a full-address-space flush. (Which, as an aside, I've
> recently learned is supposedly wrong when cap_isoch() returns true. But
> that's an orthogonal issue, albeit it may be possible to deal with at
> the same time as, down the road, limiting the too aggressive flushing
> done by subsequent patches using this new flag.)

I see. AMD flush helper gets the flags as a parameter (because
the flush all function is a wrapper around the flush pages one), so
there's no need to signal a full flush using the other parameters.

> I could be talked into setting just page_count to zero here.

No, I think it's fine.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 15/18] IOMMU/x86: prefill newly allocate page tables
  2021-12-14  9:15     ` Jan Beulich
@ 2021-12-14 11:41       ` Roger Pau Monné
  2021-12-14 11:48         ` Jan Beulich
  0 siblings, 1 reply; 100+ messages in thread
From: Roger Pau Monné @ 2021-12-14 11:41 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian

On Tue, Dec 14, 2021 at 10:15:37AM +0100, Jan Beulich wrote:
> On 13.12.2021 16:51, Roger Pau Monné wrote:
> > On Fri, Sep 24, 2021 at 11:54:58AM +0200, Jan Beulich wrote:
> >> Page table are used for two purposes after allocation: They either start
> >> out all empty, or they get filled to replace a superpage. Subsequently,
> >> to replace all empty or fully contiguous page tables, contiguous sub-
> >> regions will be recorded within individual page tables. Install the
> >> initial set of markers immediately after allocation. Make sure to retain
> >> these markers when further populating a page table in preparation for it
> >> to replace a superpage.
> >>
> >> The markers are simply 4-bit fields holding the order value of
> >> contiguous entries. To demonstrate this, if a page table had just 16
> >> entries, this would be the initial (fully contiguous) set of markers:
> >>
> >> index  0 1 2 3 4 5 6 7 8 9 A B C D E F
> >> marker 4 0 1 0 2 0 1 0 3 0 1 0 2 0 1 0

Could you expand bit more on this explanation?

I don't get how such markers are used, or how they relate to the page
table entries. I guess the point is to note whether entries are
populated, and whether such populated entries are contiguous?

Could you provide a more visual example maybe, about what would go
into each relevant page table entry, including the marker
information?

I would like to understand this instead of trying to figure out from
the code (as then I could be making wrong assumptions).

> >>
> >> "Contiguous" here means not only present entries with successively
> >> increasing MFNs, each one suitably aligned for its slot, but also a
> >> respective number of all non-present entries.
> > 
> > I'm afraid I'm slightly lost with all this, please bear with me. Is
> > this just a performance improvement when doing super-page
> > replacements, or there's more to it?
> 
> What I wanted to strictly avoid is to have to scan entire pages for
> contiguity (i.e. on average touching half a page), like e.g.
> map_pages_to_xen() and modify_xen_mappings() do. Hence I tried to
> find a scheme where for any individual update only a predictably
> very limited number of entries need inspecting (some of these then
> of course also need updating).

Thanks. So there's some extra cost here of having to update those
markers when a page table entry is modified.

Roger.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 15/18] IOMMU/x86: prefill newly allocate page tables
  2021-12-14 11:41       ` Roger Pau Monné
@ 2021-12-14 11:48         ` Jan Beulich
  0 siblings, 0 replies; 100+ messages in thread
From: Jan Beulich @ 2021-12-14 11:48 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian

On 14.12.2021 12:41, Roger Pau Monné wrote:
> On Tue, Dec 14, 2021 at 10:15:37AM +0100, Jan Beulich wrote:
>> On 13.12.2021 16:51, Roger Pau Monné wrote:
>>> On Fri, Sep 24, 2021 at 11:54:58AM +0200, Jan Beulich wrote:
>>>> Page table are used for two purposes after allocation: They either start
>>>> out all empty, or they get filled to replace a superpage. Subsequently,
>>>> to replace all empty or fully contiguous page tables, contiguous sub-
>>>> regions will be recorded within individual page tables. Install the
>>>> initial set of markers immediately after allocation. Make sure to retain
>>>> these markers when further populating a page table in preparation for it
>>>> to replace a superpage.
>>>>
>>>> The markers are simply 4-bit fields holding the order value of
>>>> contiguous entries. To demonstrate this, if a page table had just 16
>>>> entries, this would be the initial (fully contiguous) set of markers:
>>>>
>>>> index  0 1 2 3 4 5 6 7 8 9 A B C D E F
>>>> marker 4 0 1 0 2 0 1 0 3 0 1 0 2 0 1 0
> 
> Could you expand bit more on this explanation?
> 
> I don't get how such markers are used, or how they relate to the page
> table entries. I guess the point is to note whether entries are
> populated, and whether such populated entries are contiguous?
> 
> Could you provide a more visual example maybe, about what would go
> into each relevant page table entry, including the marker
> information?

I'm not sure I understand what you're after. The markers say "This
2^^marker aligned range is contiguous" (including the case of
contiguously clear). And they go into a vendor dependent ignored
4-bit field in each PTE. (Obviously odd numbered PTEs won't ever
be updated from holding a zero marker.)

An intermediate page table is eligible for replacement when the
marker of entry 0 is 9.

>>>> "Contiguous" here means not only present entries with successively
>>>> increasing MFNs, each one suitably aligned for its slot, but also a
>>>> respective number of all non-present entries.
>>>
>>> I'm afraid I'm slightly lost with all this, please bear with me. Is
>>> this just a performance improvement when doing super-page
>>> replacements, or there's more to it?
>>
>> What I wanted to strictly avoid is to have to scan entire pages for
>> contiguity (i.e. on average touching half a page), like e.g.
>> map_pages_to_xen() and modify_xen_mappings() do. Hence I tried to
>> find a scheme where for any individual update only a predictably
>> very limited number of entries need inspecting (some of these then
>> of course also need updating).
> 
> Thanks. So there's some extra cost here of having to update those
> markers when a page table entry is modified.

Well, yes, in order to re-coalesce _some_ extra cost is to be paid in
any event.

Jan



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 15/18] IOMMU/x86: prefill newly allocate page tables
  2021-09-24  9:54 ` [PATCH v2 15/18] IOMMU/x86: prefill newly allocate page tables Jan Beulich
  2021-12-13 15:51   ` Roger Pau Monné
@ 2021-12-14 14:50   ` Roger Pau Monné
  2021-12-14 15:05     ` Jan Beulich
  2021-12-14 15:06   ` Roger Pau Monné
  2 siblings, 1 reply; 100+ messages in thread
From: Roger Pau Monné @ 2021-12-14 14:50 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian

On Fri, Sep 24, 2021 at 11:54:58AM +0200, Jan Beulich wrote:
> Page table are used for two purposes after allocation: They either start
> out all empty, or they get filled to replace a superpage. Subsequently,
> to replace all empty or fully contiguous page tables, contiguous sub-
> regions will be recorded within individual page tables. Install the
> initial set of markers immediately after allocation. Make sure to retain
> these markers when further populating a page table in preparation for it
> to replace a superpage.
> 
> The markers are simply 4-bit fields holding the order value of
> contiguous entries. To demonstrate this, if a page table had just 16
> entries, this would be the initial (fully contiguous) set of markers:
> 
> index  0 1 2 3 4 5 6 7 8 9 A B C D E F
> marker 4 0 1 0 2 0 1 0 3 0 1 0 2 0 1 0
> 
> "Contiguous" here means not only present entries with successively
> increasing MFNs, each one suitably aligned for its slot, but also a
> respective number of all non-present entries.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Obviously this marker only works for newly created page tables right
now, the moment we start poking holes or replacing entries the marker
is not updated anymore. I expect further patches will expand on
this.

> ---
> An alternative to the ASSERT()s added to set_iommu_ptes_present() would
> be to make the function less general-purpose; it's used in a single
> place only after all (i.e. it might as well be folded into its only
> caller).
> ---
> v2: New.
> 
> --- a/xen/drivers/passthrough/amd/iommu-defs.h
> +++ b/xen/drivers/passthrough/amd/iommu-defs.h
> @@ -445,6 +445,8 @@ union amd_iommu_x2apic_control {
>  #define IOMMU_PAGE_TABLE_U32_PER_ENTRY	(IOMMU_PAGE_TABLE_ENTRY_SIZE / 4)
>  #define IOMMU_PAGE_TABLE_ALIGNMENT	4096
>  
> +#define IOMMU_PTE_CONTIG_MASK           0x1e /* The ign0 field below. */

Should you rename ign0 to contig_mask or some such now?

Same would apply to the comment next to dma_pte for VT-d, where bits
52:62 are ignored (the comments seems to be missing this already) and
we will be using bits 52:55 to store the contiguous mask for the
entry.

> +
>  union amd_iommu_pte {
>      uint64_t raw;
>      struct {
> --- a/xen/drivers/passthrough/amd/iommu_map.c
> +++ b/xen/drivers/passthrough/amd/iommu_map.c
> @@ -116,7 +116,19 @@ static void set_iommu_ptes_present(unsig
>  
>      while ( nr_ptes-- )
>      {
> -        set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
> +        ASSERT(!pde->next_level);
> +        ASSERT(!pde->u);
> +
> +        if ( pde > table )
> +            ASSERT(pde->ign0 == find_first_set_bit(pde - table));
> +        else
> +            ASSERT(pde->ign0 == PAGE_SHIFT - 3);

You could even special case (pde - table) % 2 != 0, but this is debug
only code, and it's possible a mod is more costly than
find_first_set_bit.

> --- a/xen/drivers/passthrough/x86/iommu.c
> +++ b/xen/drivers/passthrough/x86/iommu.c
> @@ -433,12 +433,12 @@ int iommu_free_pgtables(struct domain *d
>      return 0;
>  }
>  
> -struct page_info *iommu_alloc_pgtable(struct domain *d)
> +struct page_info *iommu_alloc_pgtable(struct domain *d, uint64_t contig_mask)
>  {
>      struct domain_iommu *hd = dom_iommu(d);
>      unsigned int memflags = 0;
>      struct page_info *pg;
> -    void *p;
> +    uint64_t *p;
>  
>  #ifdef CONFIG_NUMA
>      if ( hd->node != NUMA_NO_NODE )
> @@ -450,7 +450,28 @@ struct page_info *iommu_alloc_pgtable(st
>          return NULL;
>  
>      p = __map_domain_page(pg);
> -    clear_page(p);
> +
> +    if ( contig_mask )
> +    {
> +        unsigned int i, shift = find_first_set_bit(contig_mask);
> +
> +        ASSERT(((PAGE_SHIFT - 3) & (contig_mask >> shift)) == PAGE_SHIFT - 3);
> +
> +        p[0] = (PAGE_SHIFT - 3ull) << shift;
> +        p[1] = 0;
> +        p[2] = 1ull << shift;
> +        p[3] = 0;
> +
> +        for ( i = 4; i < PAGE_SIZE / 8; i += 4 )
> +        {
> +            p[i + 0] = (find_first_set_bit(i) + 0ull) << shift;
> +            p[i + 1] = 0;
> +            p[i + 2] = 1ull << shift;
> +            p[i + 3] = 0;
> +        }

You could likely do:

for ( i = 0; i < PAGE_SIZE / 8; i += 4 )
{
    p[i + 0] = i ? ((find_first_set_bit(i) + 0ull) << shift)
                 : ((PAGE_SHIFT - 3ull) << shift);
    p[i + 1] = 0;
    p[i + 2] = 1ull << shift;
    p[i + 3] = 0;
}

To avoid having to open code the first loop iteration. The ternary
operator could also be nested before the shift, but I find that
harder to read.

> +    }
> +    else
> +        clear_page(p);
>  
>      if ( hd->platform_ops->sync_cache )
>          iommu_vcall(hd->platform_ops, sync_cache, p, PAGE_SIZE);
> --- a/xen/include/asm-x86/iommu.h
> +++ b/xen/include/asm-x86/iommu.h
> @@ -142,7 +142,8 @@ int pi_update_irte(const struct pi_desc
>  })
>  
>  int __must_check iommu_free_pgtables(struct domain *d);
> -struct page_info *__must_check iommu_alloc_pgtable(struct domain *d);
> +struct page_info *__must_check iommu_alloc_pgtable(struct domain *d,
> +                                                   uint64_t contig_mask);
>  void iommu_queue_free_pgtable(struct domain *d, struct page_info *pg);
>  
>  #endif /* !__ARCH_X86_IOMMU_H__ */
> 


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 15/18] IOMMU/x86: prefill newly allocate page tables
  2021-12-14 14:50   ` Roger Pau Monné
@ 2021-12-14 15:05     ` Jan Beulich
  2021-12-14 15:15       ` Roger Pau Monné
  0 siblings, 1 reply; 100+ messages in thread
From: Jan Beulich @ 2021-12-14 15:05 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian

On 14.12.2021 15:50, Roger Pau Monné wrote:
> On Fri, Sep 24, 2021 at 11:54:58AM +0200, Jan Beulich wrote:
>> Page table are used for two purposes after allocation: They either start
>> out all empty, or they get filled to replace a superpage. Subsequently,
>> to replace all empty or fully contiguous page tables, contiguous sub-
>> regions will be recorded within individual page tables. Install the
>> initial set of markers immediately after allocation. Make sure to retain
>> these markers when further populating a page table in preparation for it
>> to replace a superpage.
>>
>> The markers are simply 4-bit fields holding the order value of
>> contiguous entries. To demonstrate this, if a page table had just 16
>> entries, this would be the initial (fully contiguous) set of markers:
>>
>> index  0 1 2 3 4 5 6 7 8 9 A B C D E F
>> marker 4 0 1 0 2 0 1 0 3 0 1 0 2 0 1 0
>>
>> "Contiguous" here means not only present entries with successively
>> increasing MFNs, each one suitably aligned for its slot, but also a
>> respective number of all non-present entries.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Obviously this marker only works for newly created page tables right
> now, the moment we start poking holes or replacing entries the marker
> is not updated anymore. I expect further patches will expand on
> this.

Well, until there's a consumer of the markers, there's no need to
keep them updated. That updating is indeed going to be added in
subsequent patches. I've merely tried to split off pieces that can
go one their own.

>> --- a/xen/drivers/passthrough/amd/iommu-defs.h
>> +++ b/xen/drivers/passthrough/amd/iommu-defs.h
>> @@ -445,6 +445,8 @@ union amd_iommu_x2apic_control {
>>  #define IOMMU_PAGE_TABLE_U32_PER_ENTRY	(IOMMU_PAGE_TABLE_ENTRY_SIZE / 4)
>>  #define IOMMU_PAGE_TABLE_ALIGNMENT	4096
>>  
>> +#define IOMMU_PTE_CONTIG_MASK           0x1e /* The ign0 field below. */
> 
> Should you rename ign0 to contig_mask or some such now?

Not sure. I guess I should attach a comment linking here.

> Same would apply to the comment next to dma_pte for VT-d, where bits
> 52:62 are ignored (the comments seems to be missing this already) and
> we will be using bits 52:55 to store the contiguous mask for the
> entry.

Same here then.

>> --- a/xen/drivers/passthrough/amd/iommu_map.c
>> +++ b/xen/drivers/passthrough/amd/iommu_map.c
>> @@ -116,7 +116,19 @@ static void set_iommu_ptes_present(unsig
>>  
>>      while ( nr_ptes-- )
>>      {
>> -        set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
>> +        ASSERT(!pde->next_level);
>> +        ASSERT(!pde->u);
>> +
>> +        if ( pde > table )
>> +            ASSERT(pde->ign0 == find_first_set_bit(pde - table));
>> +        else
>> +            ASSERT(pde->ign0 == PAGE_SHIFT - 3);
> 
> You could even special case (pde - table) % 2 != 0, but this is debug
> only code, and it's possible a mod is more costly than
> find_first_set_bit.

Not sure why I would want to special case anything that doesn't need
special casing. The pde == table case needs special care because there
find_first_set_bit() cannot be called.

>> @@ -450,7 +450,28 @@ struct page_info *iommu_alloc_pgtable(st
>>          return NULL;
>>  
>>      p = __map_domain_page(pg);
>> -    clear_page(p);
>> +
>> +    if ( contig_mask )
>> +    {
>> +        unsigned int i, shift = find_first_set_bit(contig_mask);
>> +
>> +        ASSERT(((PAGE_SHIFT - 3) & (contig_mask >> shift)) == PAGE_SHIFT - 3);
>> +
>> +        p[0] = (PAGE_SHIFT - 3ull) << shift;
>> +        p[1] = 0;
>> +        p[2] = 1ull << shift;
>> +        p[3] = 0;
>> +
>> +        for ( i = 4; i < PAGE_SIZE / 8; i += 4 )
>> +        {
>> +            p[i + 0] = (find_first_set_bit(i) + 0ull) << shift;
>> +            p[i + 1] = 0;
>> +            p[i + 2] = 1ull << shift;
>> +            p[i + 3] = 0;
>> +        }
> 
> You could likely do:
> 
> for ( i = 0; i < PAGE_SIZE / 8; i += 4 )
> {
>     p[i + 0] = i ? ((find_first_set_bit(i) + 0ull) << shift)
>                  : ((PAGE_SHIFT - 3ull) << shift);
>     p[i + 1] = 0;
>     p[i + 2] = 1ull << shift;
>     p[i + 3] = 0;
> }
> 
> To avoid having to open code the first loop iteration.

I could, but I explicitly didn't want to. I consider conditionals
inside a loop which special case just the first (or last) iteration
quite odd (unless they really save a lot of duplication).

> The ternary
> operator could also be nested before the shift, but I find that
> harder to read.

If I was to make the change, then that alternative way, as it would
allow to avoid the addition of 0ull:

    p[i + 0] = (i ? find_first_set_bit(i)
                  : PAGE_SHIFT - 3ull) << shift;

I could be talked into going that route (but not the intermediate
one you've suggested).

Jan



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 15/18] IOMMU/x86: prefill newly allocate page tables
  2021-09-24  9:54 ` [PATCH v2 15/18] IOMMU/x86: prefill newly allocate page tables Jan Beulich
  2021-12-13 15:51   ` Roger Pau Monné
  2021-12-14 14:50   ` Roger Pau Monné
@ 2021-12-14 15:06   ` Roger Pau Monné
  2021-12-14 15:10     ` Jan Beulich
  2 siblings, 1 reply; 100+ messages in thread
From: Roger Pau Monné @ 2021-12-14 15:06 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian

Forgot to comment.

On Fri, Sep 24, 2021 at 11:54:58AM +0200, Jan Beulich wrote:
> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> @@ -238,7 +238,7 @@ int amd_iommu_alloc_root(struct domain *
>  
>      if ( unlikely(!hd->arch.amd.root_table) )
>      {
> -        hd->arch.amd.root_table = iommu_alloc_pgtable(d);
> +        hd->arch.amd.root_table = iommu_alloc_pgtable(d, 0);

So root tables don't get markers setup...


>          if ( !hd->arch.amd.root_table )
>              return -ENOMEM;
>      }
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -297,7 +297,7 @@ static uint64_t addr_to_dma_page_maddr(s
>              goto out;
>  
>          pte_maddr = level;
> -        if ( !(pg = iommu_alloc_pgtable(domain)) )
> +        if ( !(pg = iommu_alloc_pgtable(domain, 0)) )

...likewise here.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 15/18] IOMMU/x86: prefill newly allocate page tables
  2021-12-14 15:06   ` Roger Pau Monné
@ 2021-12-14 15:10     ` Jan Beulich
  2021-12-14 15:17       ` Roger Pau Monné
  0 siblings, 1 reply; 100+ messages in thread
From: Jan Beulich @ 2021-12-14 15:10 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian

On 14.12.2021 16:06, Roger Pau Monné wrote:
> Forgot to comment.
> 
> On Fri, Sep 24, 2021 at 11:54:58AM +0200, Jan Beulich wrote:
>> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
>> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
>> @@ -238,7 +238,7 @@ int amd_iommu_alloc_root(struct domain *
>>  
>>      if ( unlikely(!hd->arch.amd.root_table) )
>>      {
>> -        hd->arch.amd.root_table = iommu_alloc_pgtable(d);
>> +        hd->arch.amd.root_table = iommu_alloc_pgtable(d, 0);
> 
> So root tables don't get markers setup...
> 
> 
>>          if ( !hd->arch.amd.root_table )
>>              return -ENOMEM;
>>      }
>> --- a/xen/drivers/passthrough/vtd/iommu.c
>> +++ b/xen/drivers/passthrough/vtd/iommu.c
>> @@ -297,7 +297,7 @@ static uint64_t addr_to_dma_page_maddr(s
>>              goto out;
>>  
>>          pte_maddr = level;
>> -        if ( !(pg = iommu_alloc_pgtable(domain)) )
>> +        if ( !(pg = iommu_alloc_pgtable(domain, 0)) )
> 
> ...likewise here.

Yes. Plus quarantine domain's page tables also don't. Neither root
tables nor quarantine domain's are ever eligible for re-coalescing,
so there's no point having markers there.

Jan



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 15/18] IOMMU/x86: prefill newly allocate page tables
  2021-12-14 15:05     ` Jan Beulich
@ 2021-12-14 15:15       ` Roger Pau Monné
  2021-12-14 15:21         ` Jan Beulich
  0 siblings, 1 reply; 100+ messages in thread
From: Roger Pau Monné @ 2021-12-14 15:15 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian

On Tue, Dec 14, 2021 at 04:05:27PM +0100, Jan Beulich wrote:
> On 14.12.2021 15:50, Roger Pau Monné wrote:
> > On Fri, Sep 24, 2021 at 11:54:58AM +0200, Jan Beulich wrote:
> >> --- a/xen/drivers/passthrough/amd/iommu-defs.h
> >> +++ b/xen/drivers/passthrough/amd/iommu-defs.h
> >> @@ -445,6 +445,8 @@ union amd_iommu_x2apic_control {
> >>  #define IOMMU_PAGE_TABLE_U32_PER_ENTRY	(IOMMU_PAGE_TABLE_ENTRY_SIZE / 4)
> >>  #define IOMMU_PAGE_TABLE_ALIGNMENT	4096
> >>  
> >> +#define IOMMU_PTE_CONTIG_MASK           0x1e /* The ign0 field below. */
> > 
> > Should you rename ign0 to contig_mask or some such now?
> 
> Not sure. I guess I should attach a comment linking here.
> 
> > Same would apply to the comment next to dma_pte for VT-d, where bits
> > 52:62 are ignored (the comments seems to be missing this already) and
> > we will be using bits 52:55 to store the contiguous mask for the
> > entry.
> 
> Same here then.

I would prefer that.

> >> --- a/xen/drivers/passthrough/amd/iommu_map.c
> >> +++ b/xen/drivers/passthrough/amd/iommu_map.c
> >> @@ -116,7 +116,19 @@ static void set_iommu_ptes_present(unsig
> >>  
> >>      while ( nr_ptes-- )
> >>      {
> >> -        set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
> >> +        ASSERT(!pde->next_level);
> >> +        ASSERT(!pde->u);
> >> +
> >> +        if ( pde > table )
> >> +            ASSERT(pde->ign0 == find_first_set_bit(pde - table));
> >> +        else
> >> +            ASSERT(pde->ign0 == PAGE_SHIFT - 3);
> > 
> > You could even special case (pde - table) % 2 != 0, but this is debug
> > only code, and it's possible a mod is more costly than
> > find_first_set_bit.
> 
> Not sure why I would want to special case anything that doesn't need
> special casing. The pde == table case needs special care because there
> find_first_set_bit() cannot be called.

Well in iommu_alloc_pgtable you already special case odd entries by
explicitly setting the mask to 0 instead of using find_first_set_bit.

> >> @@ -450,7 +450,28 @@ struct page_info *iommu_alloc_pgtable(st
> >>          return NULL;
> >>  
> >>      p = __map_domain_page(pg);
> >> -    clear_page(p);
> >> +
> >> +    if ( contig_mask )
> >> +    {
> >> +        unsigned int i, shift = find_first_set_bit(contig_mask);
> >> +
> >> +        ASSERT(((PAGE_SHIFT - 3) & (contig_mask >> shift)) == PAGE_SHIFT - 3);
> >> +
> >> +        p[0] = (PAGE_SHIFT - 3ull) << shift;
> >> +        p[1] = 0;
> >> +        p[2] = 1ull << shift;
> >> +        p[3] = 0;
> >> +
> >> +        for ( i = 4; i < PAGE_SIZE / 8; i += 4 )
> >> +        {
> >> +            p[i + 0] = (find_first_set_bit(i) + 0ull) << shift;
> >> +            p[i + 1] = 0;
> >> +            p[i + 2] = 1ull << shift;
> >> +            p[i + 3] = 0;
> >> +        }
> > 
> > You could likely do:
> > 
> > for ( i = 0; i < PAGE_SIZE / 8; i += 4 )
> > {
> >     p[i + 0] = i ? ((find_first_set_bit(i) + 0ull) << shift)
> >                  : ((PAGE_SHIFT - 3ull) << shift);
> >     p[i + 1] = 0;
> >     p[i + 2] = 1ull << shift;
> >     p[i + 3] = 0;
> > }
> > 
> > To avoid having to open code the first loop iteration.
> 
> I could, but I explicitly didn't want to. I consider conditionals
> inside a loop which special case just the first (or last) iteration
> quite odd (unless they really save a lot of duplication).
> 
> > The ternary
> > operator could also be nested before the shift, but I find that
> > harder to read.
> 
> If I was to make the change, then that alternative way, as it would
> allow to avoid the addition of 0ull:
> 
>     p[i + 0] = (i ? find_first_set_bit(i)
>                   : PAGE_SHIFT - 3ull) << shift;
> 
> I could be talked into going that route (but not the intermediate
> one you've suggested).

If you prefer to open code the first iteration I'm also fine with
that.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 15/18] IOMMU/x86: prefill newly allocate page tables
  2021-12-14 15:10     ` Jan Beulich
@ 2021-12-14 15:17       ` Roger Pau Monné
  2021-12-14 15:24         ` Jan Beulich
  0 siblings, 1 reply; 100+ messages in thread
From: Roger Pau Monné @ 2021-12-14 15:17 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian

On Tue, Dec 14, 2021 at 04:10:28PM +0100, Jan Beulich wrote:
> On 14.12.2021 16:06, Roger Pau Monné wrote:
> > Forgot to comment.
> > 
> > On Fri, Sep 24, 2021 at 11:54:58AM +0200, Jan Beulich wrote:
> >> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> >> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> >> @@ -238,7 +238,7 @@ int amd_iommu_alloc_root(struct domain *
> >>  
> >>      if ( unlikely(!hd->arch.amd.root_table) )
> >>      {
> >> -        hd->arch.amd.root_table = iommu_alloc_pgtable(d);
> >> +        hd->arch.amd.root_table = iommu_alloc_pgtable(d, 0);
> > 
> > So root tables don't get markers setup...
> > 
> > 
> >>          if ( !hd->arch.amd.root_table )
> >>              return -ENOMEM;
> >>      }
> >> --- a/xen/drivers/passthrough/vtd/iommu.c
> >> +++ b/xen/drivers/passthrough/vtd/iommu.c
> >> @@ -297,7 +297,7 @@ static uint64_t addr_to_dma_page_maddr(s
> >>              goto out;
> >>  
> >>          pte_maddr = level;
> >> -        if ( !(pg = iommu_alloc_pgtable(domain)) )
> >> +        if ( !(pg = iommu_alloc_pgtable(domain, 0)) )
> > 
> > ...likewise here.
> 
> Yes. Plus quarantine domain's page tables also don't. Neither root
> tables nor quarantine domain's are ever eligible for re-coalescing,
> so there's no point having markers there.

Quarantine won't be coalesced anyway as the same mfn is repeated over
all the entries, so it will never be a suitable candidate for
coalescing?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 15/18] IOMMU/x86: prefill newly allocate page tables
  2021-12-14 15:15       ` Roger Pau Monné
@ 2021-12-14 15:21         ` Jan Beulich
  0 siblings, 0 replies; 100+ messages in thread
From: Jan Beulich @ 2021-12-14 15:21 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian

On 14.12.2021 16:15, Roger Pau Monné wrote:
> On Tue, Dec 14, 2021 at 04:05:27PM +0100, Jan Beulich wrote:
>> On 14.12.2021 15:50, Roger Pau Monné wrote:
>>> On Fri, Sep 24, 2021 at 11:54:58AM +0200, Jan Beulich wrote:
>>>> --- a/xen/drivers/passthrough/amd/iommu_map.c
>>>> +++ b/xen/drivers/passthrough/amd/iommu_map.c
>>>> @@ -116,7 +116,19 @@ static void set_iommu_ptes_present(unsig
>>>>  
>>>>      while ( nr_ptes-- )
>>>>      {
>>>> -        set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
>>>> +        ASSERT(!pde->next_level);
>>>> +        ASSERT(!pde->u);
>>>> +
>>>> +        if ( pde > table )
>>>> +            ASSERT(pde->ign0 == find_first_set_bit(pde - table));
>>>> +        else
>>>> +            ASSERT(pde->ign0 == PAGE_SHIFT - 3);
>>>
>>> You could even special case (pde - table) % 2 != 0, but this is debug
>>> only code, and it's possible a mod is more costly than
>>> find_first_set_bit.
>>
>> Not sure why I would want to special case anything that doesn't need
>> special casing. The pde == table case needs special care because there
>> find_first_set_bit() cannot be called.
> 
> Well in iommu_alloc_pgtable you already special case odd entries by
> explicitly setting the mask to 0 instead of using find_first_set_bit.

I don't consider this special casing; instead I'm unrolling the loop
4 times to simplify calculations not only for odd entries, but also
for those where index % 4 == 2. Unrolling the loop here just for the
assertions doesn't look very desirable.

Jan



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 15/18] IOMMU/x86: prefill newly allocate page tables
  2021-12-14 15:17       ` Roger Pau Monné
@ 2021-12-14 15:24         ` Jan Beulich
  0 siblings, 0 replies; 100+ messages in thread
From: Jan Beulich @ 2021-12-14 15:24 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian

On 14.12.2021 16:17, Roger Pau Monné wrote:
> On Tue, Dec 14, 2021 at 04:10:28PM +0100, Jan Beulich wrote:
>> On 14.12.2021 16:06, Roger Pau Monné wrote:
>>> Forgot to comment.
>>>
>>> On Fri, Sep 24, 2021 at 11:54:58AM +0200, Jan Beulich wrote:
>>>> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
>>>> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
>>>> @@ -238,7 +238,7 @@ int amd_iommu_alloc_root(struct domain *
>>>>  
>>>>      if ( unlikely(!hd->arch.amd.root_table) )
>>>>      {
>>>> -        hd->arch.amd.root_table = iommu_alloc_pgtable(d);
>>>> +        hd->arch.amd.root_table = iommu_alloc_pgtable(d, 0);
>>>
>>> So root tables don't get markers setup...
>>>
>>>
>>>>          if ( !hd->arch.amd.root_table )
>>>>              return -ENOMEM;
>>>>      }
>>>> --- a/xen/drivers/passthrough/vtd/iommu.c
>>>> +++ b/xen/drivers/passthrough/vtd/iommu.c
>>>> @@ -297,7 +297,7 @@ static uint64_t addr_to_dma_page_maddr(s
>>>>              goto out;
>>>>  
>>>>          pte_maddr = level;
>>>> -        if ( !(pg = iommu_alloc_pgtable(domain)) )
>>>> +        if ( !(pg = iommu_alloc_pgtable(domain, 0)) )
>>>
>>> ...likewise here.
>>
>> Yes. Plus quarantine domain's page tables also don't. Neither root
>> tables nor quarantine domain's are ever eligible for re-coalescing,
>> so there's no point having markers there.
> 
> Quarantine won't be coalesced anyway as the same mfn is repeated over
> all the entries, so it will never be a suitable candidate for
> coalescing?

Correct.

Jan



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 16/18] x86: introduce helper for recording degree of contiguity in page tables
  2021-09-24  9:55 ` [PATCH v2 16/18] x86: introduce helper for recording degree of contiguity in " Jan Beulich
@ 2021-12-15 13:57   ` Roger Pau Monné
  2021-12-16 15:47     ` Jan Beulich
  0 siblings, 1 reply; 100+ messages in thread
From: Roger Pau Monné @ 2021-12-15 13:57 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On Fri, Sep 24, 2021 at 11:55:30AM +0200, Jan Beulich wrote:
> This is a re-usable helper (kind of a template) which gets introduced
> without users so that the individual subsequent patches introducing such
> users can get committed independently of one another.
> 
> See the comment at the top of the new file. To demonstrate the effect,
> if a page table had just 16 entries, this would be the set of markers
> for a page table with fully contiguous mappings:
> 
> index  0 1 2 3 4 5 6 7 8 9 A B C D E F
> marker 4 0 1 0 2 0 1 0 3 0 1 0 2 0 1 0
> 
> "Contiguous" here means not only present entries with successively
> increasing MFNs, each one suitably aligned for its slot, but also a
> respective number of all non-present entries.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> v2: New.
> 
> --- /dev/null
> +++ b/xen/include/asm-x86/contig-marker.h
> @@ -0,0 +1,105 @@
> +#ifndef __ASM_X86_CONTIG_MARKER_H
> +#define __ASM_X86_CONTIG_MARKER_H
> +
> +/*
> + * Short of having function templates in C, the function defined below is
> + * intended to be used by multiple parties interested in recording the
> + * degree of contiguity in mappings by a single page table.
> + *
> + * Scheme: Every entry records the order of contiguous successive entries,
> + * up to the maximum order covered by that entry (which is the number of
> + * clear low bits in its index, with entry 0 being the exception using
> + * the base-2 logarithm of the number of entries in a single page table).
> + * While a few entries need touching upon update, knowing whether the
> + * table is fully contiguous (and can hence be replaced by a higher level
> + * leaf entry) is then possible by simply looking at entry 0's marker.
> + *
> + * Prereqs:
> + * - CONTIG_MASK needs to be #define-d, to a value having at least 4
> + *   contiguous bits (ignored by hardware), before including this file,
> + * - page tables to be passed here need to be initialized with correct
> + *   markers.

Given this requirement I think it would make sense to place the page
table marker initialization currently placed in iommu_alloc_pgtable as
a helper here also?

> + */
> +
> +#include <xen/bitops.h>
> +#include <xen/lib.h>
> +#include <xen/page-size.h>
> +
> +/* This is the same for all anticipated users, so doesn't need passing in. */
> +#define CONTIG_LEVEL_SHIFT 9
> +#define CONTIG_NR          (1 << CONTIG_LEVEL_SHIFT)
> +
> +#define GET_MARKER(e) MASK_EXTR(e, CONTIG_MASK)
> +#define SET_MARKER(e, m) \
> +    ((void)(e = ((e) & ~CONTIG_MASK) | MASK_INSR(m, CONTIG_MASK)))
> +
> +enum PTE_kind {
> +    PTE_kind_null,
> +    PTE_kind_leaf,
> +    PTE_kind_table,
> +};
> +
> +static bool update_contig_markers(uint64_t *pt, unsigned int idx,

Maybe pt_update_contig_markers, so it's not such a generic name.

> +                                  unsigned int level, enum PTE_kind kind)
> +{
> +    unsigned int b, i = idx;
> +    unsigned int shift = (level - 1) * CONTIG_LEVEL_SHIFT + PAGE_SHIFT;
> +
> +    ASSERT(idx < CONTIG_NR);
> +    ASSERT(!(pt[idx] & CONTIG_MASK));
> +
> +    /* Step 1: Reduce markers in lower numbered entries. */
> +    while ( i )
> +    {
> +        b = find_first_set_bit(i);
> +        i &= ~(1U << b);
> +        if ( GET_MARKER(pt[i]) > b )
> +            SET_MARKER(pt[i], b);
> +    }
> +
> +    /* An intermediate table is never contiguous with anything. */
> +    if ( kind == PTE_kind_table )
> +        return false;
> +
> +    /*
> +     * Present entries need in sync index and address to be a candidate
> +     * for being contiguous: What we're after is whether ultimately the
> +     * intermediate table can be replaced by a superpage.
> +     */
> +    if ( kind != PTE_kind_null &&
> +         idx != ((pt[idx] >> shift) & (CONTIG_NR - 1)) )

Don't you just need to check that the address is aligned to at least
idx, not that it's exactly aligned?

AFAICT this will return early if the address has an alignment that
exceeds the requirements imposed by idx.

> +        return false;
> +
> +    /* Step 2: Check higher numbered entries for contiguity. */
> +    for ( b = 0; b < CONTIG_LEVEL_SHIFT && !(idx & (1U << b)); ++b )
> +    {
> +        i = idx | (1U << b);
> +        if ( (kind == PTE_kind_leaf
> +              ? ((pt[i] ^ pt[idx]) & ~CONTIG_MASK) != (1ULL << (b + shift))

Maybe this could be a macro, CHECK_CONTIG or some such? It's also used
below.

I would also think this would be clearer as:

(pt[idx] & ~CONTIG_MASK) + (1ULL << (shift + b)) == (pt[i] & ~CONTIG_MASK)

> +              : pt[i] & ~CONTIG_MASK) ||

Isn't PTE_kind_null always supposed to be empty? (ie: wouldn't this
check always succeed?)

> +             GET_MARKER(pt[i]) != b )
> +            break;
> +    }
> +
> +    /* Step 3: Update markers in this and lower numbered entries. */
> +    for ( ; SET_MARKER(pt[idx], b), b < CONTIG_LEVEL_SHIFT; ++b )
> +    {
> +        i = idx ^ (1U << b);
> +        if ( (kind == PTE_kind_leaf
> +              ? ((pt[i] ^ pt[idx]) & ~CONTIG_MASK) != (1ULL << (b + shift))
> +              : pt[i] & ~CONTIG_MASK) ||
> +             GET_MARKER(pt[i]) != b )
> +            break;
> +        idx &= ~(1U << b);

There's an iteration where idx will be 0, and then there's no further
point in doing the & anymore?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 17/18] AMD/IOMMU: free all-empty page tables
  2021-09-24  9:55 ` [PATCH v2 17/18] AMD/IOMMU: free all-empty " Jan Beulich
@ 2021-12-15 15:14   ` Roger Pau Monné
  2021-12-16 15:54     ` Jan Beulich
  0 siblings, 1 reply; 100+ messages in thread
From: Roger Pau Monné @ 2021-12-15 15:14 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On Fri, Sep 24, 2021 at 11:55:57AM +0200, Jan Beulich wrote:
> When a page table ends up with no present entries left, it can be
> replaced by a non-present entry at the next higher level. The page table
> itself can then be scheduled for freeing.
> 
> Note that while its output isn't used there yet, update_contig_markers()
> right away needs to be called in all places where entries get updated,
> not just the one where entries get cleared.

Couldn't you also coalesce all contiguous page tables into a
super-page entry at the higher level? (not that should be done here,
it's just that I'm not seeing any patch to that effect in the series)

> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> v2: New.
> 
> --- a/xen/drivers/passthrough/amd/iommu_map.c
> +++ b/xen/drivers/passthrough/amd/iommu_map.c
> @@ -21,6 +21,9 @@
>  
>  #include "iommu.h"
>  
> +#define CONTIG_MASK IOMMU_PTE_CONTIG_MASK
> +#include <asm/contig-marker.h>
> +
>  /* Given pfn and page table level, return pde index */
>  static unsigned int pfn_to_pde_idx(unsigned long pfn, unsigned int level)
>  {
> @@ -33,16 +36,20 @@ static unsigned int pfn_to_pde_idx(unsig
>  
>  static union amd_iommu_pte clear_iommu_pte_present(unsigned long l1_mfn,
>                                                     unsigned long dfn,
> -                                                   unsigned int level)
> +                                                   unsigned int level,
> +                                                   bool *free)
>  {
>      union amd_iommu_pte *table, *pte, old;
> +    unsigned int idx = pfn_to_pde_idx(dfn, level);
>  
>      table = map_domain_page(_mfn(l1_mfn));
> -    pte = &table[pfn_to_pde_idx(dfn, level)];
> +    pte = &table[idx];
>      old = *pte;
>  
>      write_atomic(&pte->raw, 0);
>  
> +    *free = update_contig_markers(&table->raw, idx, level, PTE_kind_null);
> +
>      unmap_domain_page(table);
>  
>      return old;
> @@ -85,7 +92,11 @@ static union amd_iommu_pte set_iommu_pte
>      if ( !old.pr || old.next_level ||
>           old.mfn != next_mfn ||
>           old.iw != iw || old.ir != ir )
> +    {
>          set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
> +        update_contig_markers(&table->raw, pfn_to_pde_idx(dfn, level), level,
> +                              PTE_kind_leaf);
> +    }
>      else
>          old.pr = false; /* signal "no change" to the caller */
>  
> @@ -259,6 +270,9 @@ static int iommu_pde_from_dfn(struct dom
>              smp_wmb();
>              set_iommu_pde_present(pde, next_table_mfn, next_level, true,
>                                    true);
> +            update_contig_markers(&next_table_vaddr->raw,
> +                                  pfn_to_pde_idx(dfn, level),
> +                                  level, PTE_kind_table);
>  
>              *flush_flags |= IOMMU_FLUSHF_modified;
>          }
> @@ -284,6 +298,9 @@ static int iommu_pde_from_dfn(struct dom
>                  next_table_mfn = mfn_x(page_to_mfn(table));
>                  set_iommu_pde_present(pde, next_table_mfn, next_level, true,
>                                        true);
> +                update_contig_markers(&next_table_vaddr->raw,
> +                                      pfn_to_pde_idx(dfn, level),
> +                                      level, PTE_kind_table);

Would be nice if we could pack the update_contig_markers in
set_iommu_pde_present (like you do for clear_iommu_pte_present).

Thanks, Roger.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 14/18] IOMMU: fold flush-all hook into "flush one"
  2021-09-24  9:53 ` [PATCH v2 14/18] IOMMU: fold flush-all hook into "flush one" Jan Beulich
  2021-12-13 15:04   ` Roger Pau Monné
@ 2021-12-15 15:28   ` Oleksandr
  2021-12-16  8:49     ` Jan Beulich
  2021-12-16 11:30   ` Rahul Singh
  2021-12-17 14:38   ` Julien Grall
  3 siblings, 1 reply; 100+ messages in thread
From: Oleksandr @ 2021-12-15 15:28 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian, Julien Grall,
	Stefano Stabellini, Volodymyr Babchuk, Bertrand Marquis,
	Rahul Singh


On 24.09.21 12:53, Jan Beulich wrote:

Hi Jan

> Having a separate flush-all hook has always been puzzling me some. We
> will want to be able to force a full flush via accumulated flush flags
> from the map/unmap functions. Introduce a respective new flag and fold
> all flush handling to use the single remaining hook.
>
> Note that because of the respective comments in SMMU and IPMMU-VMSA
> code, I've folded the two prior hook functions into one.

Changes to IPMMU-VMSA lgtm, for SMMU-v2 I think the same.


> For SMMU-v3,
> which lacks a comment towards incapable hardware, I've left both
> functions in place on the assumption that selective and full flushes
> will eventually want separating.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> TBD: What we really are going to need is for the map/unmap functions to
>       specify that a wider region needs flushing than just the one
>       covered by the present set of (un)maps. This may still be less than
>       a full flush, but at least as a first step it seemed better to me
>       to keep things simple and go the flush-all route.
> ---
> v2: New.
>
> --- a/xen/drivers/passthrough/amd/iommu.h
> +++ b/xen/drivers/passthrough/amd/iommu.h
> @@ -242,7 +242,6 @@ int amd_iommu_get_reserved_device_memory
>   int __must_check amd_iommu_flush_iotlb_pages(struct domain *d, dfn_t dfn,
>                                                unsigned long page_count,
>                                                unsigned int flush_flags);
> -int __must_check amd_iommu_flush_iotlb_all(struct domain *d);
>   void amd_iommu_print_entries(const struct amd_iommu *iommu, unsigned int dev_id,
>                                dfn_t dfn);
>   
> --- a/xen/drivers/passthrough/amd/iommu_map.c
> +++ b/xen/drivers/passthrough/amd/iommu_map.c
> @@ -475,15 +475,18 @@ int amd_iommu_flush_iotlb_pages(struct d
>   {
>       unsigned long dfn_l = dfn_x(dfn);
>   
> -    ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN));
> -    ASSERT(flush_flags);
> +    if ( !(flush_flags & IOMMU_FLUSHF_all) )
> +    {
> +        ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN));
> +        ASSERT(flush_flags);
> +    }
>   
>       /* Unless a PTE was modified, no flush is required */
>       if ( !(flush_flags & IOMMU_FLUSHF_modified) )
>           return 0;
>   
> -    /* If the range wraps then just flush everything */
> -    if ( dfn_l + page_count < dfn_l )
> +    /* If so requested or if the range wraps then just flush everything. */
> +    if ( (flush_flags & IOMMU_FLUSHF_all) || dfn_l + page_count < dfn_l )
>       {
>           amd_iommu_flush_all_pages(d);
>           return 0;
> @@ -508,13 +511,6 @@ int amd_iommu_flush_iotlb_pages(struct d
>   
>       return 0;
>   }
> -
> -int amd_iommu_flush_iotlb_all(struct domain *d)
> -{
> -    amd_iommu_flush_all_pages(d);
> -
> -    return 0;
> -}
>   
>   int amd_iommu_reserve_domain_unity_map(struct domain *d,
>                                          const struct ivrs_unity_map *map,
> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> @@ -642,7 +642,6 @@ static const struct iommu_ops __initcons
>       .map_page = amd_iommu_map_page,
>       .unmap_page = amd_iommu_unmap_page,
>       .iotlb_flush = amd_iommu_flush_iotlb_pages,
> -    .iotlb_flush_all = amd_iommu_flush_iotlb_all,
>       .reassign_device = reassign_device,
>       .get_device_group_id = amd_iommu_group_id,
>       .enable_x2apic = iov_enable_xt,
> --- a/xen/drivers/passthrough/arm/ipmmu-vmsa.c
> +++ b/xen/drivers/passthrough/arm/ipmmu-vmsa.c
> @@ -930,13 +930,19 @@ out:
>   }
>   
>   /* Xen IOMMU ops */
> -static int __must_check ipmmu_iotlb_flush_all(struct domain *d)
> +static int __must_check ipmmu_iotlb_flush(struct domain *d, dfn_t dfn,
> +                                          unsigned long page_count,
> +                                          unsigned int flush_flags)
>   {
>       struct ipmmu_vmsa_xen_domain *xen_domain = dom_iommu(d)->arch.priv;
>   
> +    ASSERT(flush_flags);
> +
>       if ( !xen_domain || !xen_domain->root_domain )
>           return 0;
>   
> +    /* The hardware doesn't support selective TLB flush. */
> +
>       spin_lock(&xen_domain->lock);
>       ipmmu_tlb_invalidate(xen_domain->root_domain);
>       spin_unlock(&xen_domain->lock);
> @@ -944,16 +950,6 @@ static int __must_check ipmmu_iotlb_flus
>       return 0;
>   }
>   
> -static int __must_check ipmmu_iotlb_flush(struct domain *d, dfn_t dfn,
> -                                          unsigned long page_count,
> -                                          unsigned int flush_flags)
> -{
> -    ASSERT(flush_flags);
> -
> -    /* The hardware doesn't support selective TLB flush. */
> -    return ipmmu_iotlb_flush_all(d);
> -}
> -
>   static struct ipmmu_vmsa_domain *ipmmu_get_cache_domain(struct domain *d,
>                                                           struct device *dev)
>   {
> @@ -1303,7 +1299,6 @@ static const struct iommu_ops ipmmu_iomm
>       .hwdom_init      = ipmmu_iommu_hwdom_init,
>       .teardown        = ipmmu_iommu_domain_teardown,
>       .iotlb_flush     = ipmmu_iotlb_flush,
> -    .iotlb_flush_all = ipmmu_iotlb_flush_all,
>       .assign_device   = ipmmu_assign_device,
>       .reassign_device = ipmmu_reassign_device,
>       .map_page        = arm_iommu_map_page,
> --- a/xen/drivers/passthrough/arm/smmu.c
> +++ b/xen/drivers/passthrough/arm/smmu.c
> @@ -2649,11 +2649,17 @@ static int force_stage = 2;
>    */
>   static u32 platform_features = ARM_SMMU_FEAT_COHERENT_WALK;
>   
> -static int __must_check arm_smmu_iotlb_flush_all(struct domain *d)
> +static int __must_check arm_smmu_iotlb_flush(struct domain *d, dfn_t dfn,
> +					     unsigned long page_count,
> +					     unsigned int flush_flags)
>   {
>   	struct arm_smmu_xen_domain *smmu_domain = dom_iommu(d)->arch.priv;
>   	struct iommu_domain *cfg;
>   
> +	ASSERT(flush_flags);
> +
> +	/* ARM SMMU v1 doesn't have flush by VMA and VMID */
> +
>   	spin_lock(&smmu_domain->lock);
>   	list_for_each_entry(cfg, &smmu_domain->contexts, list) {
>   		/*
> @@ -2670,16 +2676,6 @@ static int __must_check arm_smmu_iotlb_f
>   	return 0;
>   }
>   
> -static int __must_check arm_smmu_iotlb_flush(struct domain *d, dfn_t dfn,
> -					     unsigned long page_count,
> -					     unsigned int flush_flags)
> -{
> -	ASSERT(flush_flags);
> -
> -	/* ARM SMMU v1 doesn't have flush by VMA and VMID */
> -	return arm_smmu_iotlb_flush_all(d);
> -}
> -
>   static struct iommu_domain *arm_smmu_get_domain(struct domain *d,
>   						struct device *dev)
>   {
> @@ -2879,7 +2875,6 @@ static const struct iommu_ops arm_smmu_i
>       .add_device = arm_smmu_dt_add_device_generic,
>       .teardown = arm_smmu_iommu_domain_teardown,
>       .iotlb_flush = arm_smmu_iotlb_flush,
> -    .iotlb_flush_all = arm_smmu_iotlb_flush_all,
>       .assign_device = arm_smmu_assign_dev,
>       .reassign_device = arm_smmu_reassign_dev,
>       .map_page = arm_iommu_map_page,
> --- a/xen/drivers/passthrough/arm/smmu-v3.c
> +++ b/xen/drivers/passthrough/arm/smmu-v3.c
> @@ -3431,7 +3431,6 @@ static const struct iommu_ops arm_smmu_i
>   	.hwdom_init		= arm_smmu_iommu_hwdom_init,
>   	.teardown		= arm_smmu_iommu_xen_domain_teardown,
>   	.iotlb_flush		= arm_smmu_iotlb_flush,
> -	.iotlb_flush_all	= arm_smmu_iotlb_flush_all,
>   	.assign_device		= arm_smmu_assign_dev,
>   	.reassign_device	= arm_smmu_reassign_dev,
>   	.map_page		= arm_iommu_map_page,
> --- a/xen/drivers/passthrough/iommu.c
> +++ b/xen/drivers/passthrough/iommu.c
> @@ -463,15 +463,12 @@ int iommu_iotlb_flush_all(struct domain
>       const struct domain_iommu *hd = dom_iommu(d);
>       int rc;
>   
> -    if ( !is_iommu_enabled(d) || !hd->platform_ops->iotlb_flush_all ||
> +    if ( !is_iommu_enabled(d) || !hd->platform_ops->iotlb_flush ||
>            !flush_flags )
>           return 0;
>   
> -    /*
> -     * The operation does a full flush so we don't need to pass the
> -     * flush_flags in.
> -     */
> -    rc = iommu_call(hd->platform_ops, iotlb_flush_all, d);
> +    rc = iommu_call(hd->platform_ops, iotlb_flush, d, INVALID_DFN, 0,
> +                    flush_flags | IOMMU_FLUSHF_all);
>       if ( unlikely(rc) )
>       {
>           if ( !d->is_shutting_down && printk_ratelimit() )
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -731,18 +731,21 @@ static int __must_check iommu_flush_iotl
>                                                   unsigned long page_count,
>                                                   unsigned int flush_flags)
>   {
> -    ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN));
> -    ASSERT(flush_flags);
> +    if ( flush_flags & IOMMU_FLUSHF_all )
> +    {
> +        dfn = INVALID_DFN;
> +        page_count = 0;
> +    }
> +    else
> +    {
> +        ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN));
> +        ASSERT(flush_flags);
> +    }
>   
>       return iommu_flush_iotlb(d, dfn, flush_flags & IOMMU_FLUSHF_modified,
>                                page_count);
>   }
>   
> -static int __must_check iommu_flush_iotlb_all(struct domain *d)
> -{
> -    return iommu_flush_iotlb(d, INVALID_DFN, 0, 0);
> -}
> -
>   static void queue_free_pt(struct domain *d, mfn_t mfn, unsigned int next_level)
>   {
>       if ( next_level > 1 )
> @@ -2841,7 +2844,7 @@ static int __init intel_iommu_quarantine
>       spin_unlock(&hd->arch.mapping_lock);
>   
>       if ( !rc )
> -        rc = iommu_flush_iotlb_all(d);
> +        rc = iommu_flush_iotlb(d, INVALID_DFN, 0, 0);
>   
>       /* Pages may be leaked in failure case */
>       return rc;
> @@ -2874,7 +2877,6 @@ static struct iommu_ops __initdata vtd_o
>       .resume = vtd_resume,
>       .crash_shutdown = vtd_crash_shutdown,
>       .iotlb_flush = iommu_flush_iotlb_pages,
> -    .iotlb_flush_all = iommu_flush_iotlb_all,
>       .get_reserved_device_memory = intel_iommu_get_reserved_device_memory,
>       .dump_page_tables = vtd_dump_page_tables,
>   };
> --- a/xen/include/xen/iommu.h
> +++ b/xen/include/xen/iommu.h
> @@ -147,9 +147,11 @@ enum
>   {
>       _IOMMU_FLUSHF_added,
>       _IOMMU_FLUSHF_modified,
> +    _IOMMU_FLUSHF_all,
>   };
>   #define IOMMU_FLUSHF_added (1u << _IOMMU_FLUSHF_added)
>   #define IOMMU_FLUSHF_modified (1u << _IOMMU_FLUSHF_modified)
> +#define IOMMU_FLUSHF_all (1u << _IOMMU_FLUSHF_all)
>   
>   int __must_check iommu_map(struct domain *d, dfn_t dfn, mfn_t mfn,
>                              unsigned long page_count, unsigned int flags,
> @@ -282,7 +284,6 @@ struct iommu_ops {
>       int __must_check (*iotlb_flush)(struct domain *d, dfn_t dfn,
>                                       unsigned long page_count,
>                                       unsigned int flush_flags);
> -    int __must_check (*iotlb_flush_all)(struct domain *d);
>       int (*get_reserved_device_memory)(iommu_grdm_t *, void *);
>       void (*dump_page_tables)(struct domain *d);
>   
>
>
-- 
Regards,

Oleksandr Tyshchenko



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 14/18] IOMMU: fold flush-all hook into "flush one"
  2021-12-15 15:28   ` Oleksandr
@ 2021-12-16  8:49     ` Jan Beulich
  2021-12-16 10:39       ` Oleksandr
  0 siblings, 1 reply; 100+ messages in thread
From: Jan Beulich @ 2021-12-16  8:49 UTC (permalink / raw)
  To: Oleksandr
  Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian, Julien Grall,
	Stefano Stabellini, Volodymyr Babchuk, Bertrand Marquis,
	Rahul Singh

On 15.12.2021 16:28, Oleksandr wrote:
> On 24.09.21 12:53, Jan Beulich wrote:
>> Having a separate flush-all hook has always been puzzling me some. We
>> will want to be able to force a full flush via accumulated flush flags
>> from the map/unmap functions. Introduce a respective new flag and fold
>> all flush handling to use the single remaining hook.
>>
>> Note that because of the respective comments in SMMU and IPMMU-VMSA
>> code, I've folded the two prior hook functions into one.
> 
> Changes to IPMMU-VMSA lgtm, for SMMU-v2 I think the same.

Thanks; I wonder whether I may transform this into some kind of tag.

Jan



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 14/18] IOMMU: fold flush-all hook into "flush one"
  2021-12-16  8:49     ` Jan Beulich
@ 2021-12-16 10:39       ` Oleksandr
  0 siblings, 0 replies; 100+ messages in thread
From: Oleksandr @ 2021-12-16 10:39 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian, Julien Grall,
	Stefano Stabellini, Volodymyr Babchuk, Bertrand Marquis,
	Rahul Singh


On 16.12.21 10:49, Jan Beulich wrote:

Hi Jan


> On 15.12.2021 16:28, Oleksandr wrote:
>> On 24.09.21 12:53, Jan Beulich wrote:
>>> Having a separate flush-all hook has always been puzzling me some. We
>>> will want to be able to force a full flush via accumulated flush flags
>>> from the map/unmap functions. Introduce a respective new flag and fold
>>> all flush handling to use the single remaining hook.
>>>
>>> Note that because of the respective comments in SMMU and IPMMU-VMSA
>>> code, I've folded the two prior hook functions into one.
>> Changes to IPMMU-VMSA lgtm, for SMMU-v2 I think the same.
> Thanks; I wonder whether I may transform this into some kind of tag.


[IPMMU-VMSA and SMMU-V2 bits]

Reviewed-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>


>
> Jan
>
-- 
Regards,

Oleksandr Tyshchenko



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 14/18] IOMMU: fold flush-all hook into "flush one"
  2021-09-24  9:53 ` [PATCH v2 14/18] IOMMU: fold flush-all hook into "flush one" Jan Beulich
  2021-12-13 15:04   ` Roger Pau Monné
  2021-12-15 15:28   ` Oleksandr
@ 2021-12-16 11:30   ` Rahul Singh
  2021-12-21  8:04     ` Jan Beulich
  2021-12-17 14:38   ` Julien Grall
  3 siblings, 1 reply; 100+ messages in thread
From: Rahul Singh @ 2021-12-16 11:30 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian, Julien Grall,
	Stefano Stabellini, Volodymyr Babchuk, Bertrand Marquis

Hi Jan

> On 24 Sep 2021, at 10:53 am, Jan Beulich <jbeulich@suse.com> wrote:
> 
> Having a separate flush-all hook has always been puzzling me some. We
> will want to be able to force a full flush via accumulated flush flags
> from the map/unmap functions. Introduce a respective new flag and fold
> all flush handling to use the single remaining hook.
> 
> Note that because of the respective comments in SMMU and IPMMU-VMSA
> code, I've folded the two prior hook functions into one. For SMMU-v3,
> which lacks a comment towards incapable hardware, I've left both
> functions in place on the assumption that selective and full flushes
> will eventually want separating.


For SMMUv3 related Changs:
Reviewed-by: Rahul Singh <rahul.singh@arm.com>

Regards,
Rahul
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> TBD: What we really are going to need is for the map/unmap functions to
>     specify that a wider region needs flushing than just the one
>     covered by the present set of (un)maps. This may still be less than
>     a full flush, but at least as a first step it seemed better to me
>     to keep things simple and go the flush-all route.
> ---
> v2: New.
> 
> --- a/xen/drivers/passthrough/amd/iommu.h
> +++ b/xen/drivers/passthrough/amd/iommu.h
> @@ -242,7 +242,6 @@ int amd_iommu_get_reserved_device_memory
> int __must_check amd_iommu_flush_iotlb_pages(struct domain *d, dfn_t dfn,
>                                              unsigned long page_count,
>                                              unsigned int flush_flags);
> -int __must_check amd_iommu_flush_iotlb_all(struct domain *d);
> void amd_iommu_print_entries(const struct amd_iommu *iommu, unsigned int dev_id,
>                              dfn_t dfn);
> 
> --- a/xen/drivers/passthrough/amd/iommu_map.c
> +++ b/xen/drivers/passthrough/amd/iommu_map.c
> @@ -475,15 +475,18 @@ int amd_iommu_flush_iotlb_pages(struct d
> {
>     unsigned long dfn_l = dfn_x(dfn);
> 
> -    ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN));
> -    ASSERT(flush_flags);
> +    if ( !(flush_flags & IOMMU_FLUSHF_all) )
> +    {
> +        ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN));
> +        ASSERT(flush_flags);
> +    }
> 
>     /* Unless a PTE was modified, no flush is required */
>     if ( !(flush_flags & IOMMU_FLUSHF_modified) )
>         return 0;
> 
> -    /* If the range wraps then just flush everything */
> -    if ( dfn_l + page_count < dfn_l )
> +    /* If so requested or if the range wraps then just flush everything. */
> +    if ( (flush_flags & IOMMU_FLUSHF_all) || dfn_l + page_count < dfn_l )
>     {
>         amd_iommu_flush_all_pages(d);
>         return 0;
> @@ -508,13 +511,6 @@ int amd_iommu_flush_iotlb_pages(struct d
> 
>     return 0;
> }
> -
> -int amd_iommu_flush_iotlb_all(struct domain *d)
> -{
> -    amd_iommu_flush_all_pages(d);
> -
> -    return 0;
> -}
> 
> int amd_iommu_reserve_domain_unity_map(struct domain *d,
>                                        const struct ivrs_unity_map *map,
> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> @@ -642,7 +642,6 @@ static const struct iommu_ops __initcons
>     .map_page = amd_iommu_map_page,
>     .unmap_page = amd_iommu_unmap_page,
>     .iotlb_flush = amd_iommu_flush_iotlb_pages,
> -    .iotlb_flush_all = amd_iommu_flush_iotlb_all,
>     .reassign_device = reassign_device,
>     .get_device_group_id = amd_iommu_group_id,
>     .enable_x2apic = iov_enable_xt,
> --- a/xen/drivers/passthrough/arm/ipmmu-vmsa.c
> +++ b/xen/drivers/passthrough/arm/ipmmu-vmsa.c
> @@ -930,13 +930,19 @@ out:
> }
> 
> /* Xen IOMMU ops */
> -static int __must_check ipmmu_iotlb_flush_all(struct domain *d)
> +static int __must_check ipmmu_iotlb_flush(struct domain *d, dfn_t dfn,
> +                                          unsigned long page_count,
> +                                          unsigned int flush_flags)
> {
>     struct ipmmu_vmsa_xen_domain *xen_domain = dom_iommu(d)->arch.priv;
> 
> +    ASSERT(flush_flags);
> +
>     if ( !xen_domain || !xen_domain->root_domain )
>         return 0;
> 
> +    /* The hardware doesn't support selective TLB flush. */
> +
>     spin_lock(&xen_domain->lock);
>     ipmmu_tlb_invalidate(xen_domain->root_domain);
>     spin_unlock(&xen_domain->lock);
> @@ -944,16 +950,6 @@ static int __must_check ipmmu_iotlb_flus
>     return 0;
> }
> 
> -static int __must_check ipmmu_iotlb_flush(struct domain *d, dfn_t dfn,
> -                                          unsigned long page_count,
> -                                          unsigned int flush_flags)
> -{
> -    ASSERT(flush_flags);
> -
> -    /* The hardware doesn't support selective TLB flush. */
> -    return ipmmu_iotlb_flush_all(d);
> -}
> -
> static struct ipmmu_vmsa_domain *ipmmu_get_cache_domain(struct domain *d,
>                                                         struct device *dev)
> {
> @@ -1303,7 +1299,6 @@ static const struct iommu_ops ipmmu_iomm
>     .hwdom_init      = ipmmu_iommu_hwdom_init,
>     .teardown        = ipmmu_iommu_domain_teardown,
>     .iotlb_flush     = ipmmu_iotlb_flush,
> -    .iotlb_flush_all = ipmmu_iotlb_flush_all,
>     .assign_device   = ipmmu_assign_device,
>     .reassign_device = ipmmu_reassign_device,
>     .map_page        = arm_iommu_map_page,
> --- a/xen/drivers/passthrough/arm/smmu.c
> +++ b/xen/drivers/passthrough/arm/smmu.c
> @@ -2649,11 +2649,17 @@ static int force_stage = 2;
>  */
> static u32 platform_features = ARM_SMMU_FEAT_COHERENT_WALK;
> 
> -static int __must_check arm_smmu_iotlb_flush_all(struct domain *d)
> +static int __must_check arm_smmu_iotlb_flush(struct domain *d, dfn_t dfn,
> +					     unsigned long page_count,
> +					     unsigned int flush_flags)
> {
> 	struct arm_smmu_xen_domain *smmu_domain = dom_iommu(d)->arch.priv;
> 	struct iommu_domain *cfg;
> 
> +	ASSERT(flush_flags);
> +
> +	/* ARM SMMU v1 doesn't have flush by VMA and VMID */
> +
> 	spin_lock(&smmu_domain->lock);
> 	list_for_each_entry(cfg, &smmu_domain->contexts, list) {
> 		/*
> @@ -2670,16 +2676,6 @@ static int __must_check arm_smmu_iotlb_f
> 	return 0;
> }
> 
> -static int __must_check arm_smmu_iotlb_flush(struct domain *d, dfn_t dfn,
> -					     unsigned long page_count,
> -					     unsigned int flush_flags)
> -{
> -	ASSERT(flush_flags);
> -
> -	/* ARM SMMU v1 doesn't have flush by VMA and VMID */
> -	return arm_smmu_iotlb_flush_all(d);
> -}
> -
> static struct iommu_domain *arm_smmu_get_domain(struct domain *d,
> 						struct device *dev)
> {
> @@ -2879,7 +2875,6 @@ static const struct iommu_ops arm_smmu_i
>     .add_device = arm_smmu_dt_add_device_generic,
>     .teardown = arm_smmu_iommu_domain_teardown,
>     .iotlb_flush = arm_smmu_iotlb_flush,
> -    .iotlb_flush_all = arm_smmu_iotlb_flush_all,
>     .assign_device = arm_smmu_assign_dev,
>     .reassign_device = arm_smmu_reassign_dev,
>     .map_page = arm_iommu_map_page,
> --- a/xen/drivers/passthrough/arm/smmu-v3.c
> +++ b/xen/drivers/passthrough/arm/smmu-v3.c
> @@ -3431,7 +3431,6 @@ static const struct iommu_ops arm_smmu_i
> 	.hwdom_init		= arm_smmu_iommu_hwdom_init,
> 	.teardown		= arm_smmu_iommu_xen_domain_teardown,
> 	.iotlb_flush		= arm_smmu_iotlb_flush,
> -	.iotlb_flush_all	= arm_smmu_iotlb_flush_all,
> 	.assign_device		= arm_smmu_assign_dev,
> 	.reassign_device	= arm_smmu_reassign_dev,
> 	.map_page		= arm_iommu_map_page,
> --- a/xen/drivers/passthrough/iommu.c
> +++ b/xen/drivers/passthrough/iommu.c
> @@ -463,15 +463,12 @@ int iommu_iotlb_flush_all(struct domain
>     const struct domain_iommu *hd = dom_iommu(d);
>     int rc;
> 
> -    if ( !is_iommu_enabled(d) || !hd->platform_ops->iotlb_flush_all ||
> +    if ( !is_iommu_enabled(d) || !hd->platform_ops->iotlb_flush ||
>          !flush_flags )
>         return 0;
> 
> -    /*
> -     * The operation does a full flush so we don't need to pass the
> -     * flush_flags in.
> -     */
> -    rc = iommu_call(hd->platform_ops, iotlb_flush_all, d);
> +    rc = iommu_call(hd->platform_ops, iotlb_flush, d, INVALID_DFN, 0,
> +                    flush_flags | IOMMU_FLUSHF_all);
>     if ( unlikely(rc) )
>     {
>         if ( !d->is_shutting_down && printk_ratelimit() )
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -731,18 +731,21 @@ static int __must_check iommu_flush_iotl
>                                                 unsigned long page_count,
>                                                 unsigned int flush_flags)
> {
> -    ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN));
> -    ASSERT(flush_flags);
> +    if ( flush_flags & IOMMU_FLUSHF_all )
> +    {
> +        dfn = INVALID_DFN;
> +        page_count = 0;
> +    }
> +    else
> +    {
> +        ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN));
> +        ASSERT(flush_flags);
> +    }
> 
>     return iommu_flush_iotlb(d, dfn, flush_flags & IOMMU_FLUSHF_modified,
>                              page_count);
> }
> 
> -static int __must_check iommu_flush_iotlb_all(struct domain *d)
> -{
> -    return iommu_flush_iotlb(d, INVALID_DFN, 0, 0);
> -}
> -
> static void queue_free_pt(struct domain *d, mfn_t mfn, unsigned int next_level)
> {
>     if ( next_level > 1 )
> @@ -2841,7 +2844,7 @@ static int __init intel_iommu_quarantine
>     spin_unlock(&hd->arch.mapping_lock);
> 
>     if ( !rc )
> -        rc = iommu_flush_iotlb_all(d);
> +        rc = iommu_flush_iotlb(d, INVALID_DFN, 0, 0);
> 
>     /* Pages may be leaked in failure case */
>     return rc;
> @@ -2874,7 +2877,6 @@ static struct iommu_ops __initdata vtd_o
>     .resume = vtd_resume,
>     .crash_shutdown = vtd_crash_shutdown,
>     .iotlb_flush = iommu_flush_iotlb_pages,
> -    .iotlb_flush_all = iommu_flush_iotlb_all,
>     .get_reserved_device_memory = intel_iommu_get_reserved_device_memory,
>     .dump_page_tables = vtd_dump_page_tables,
> };
> --- a/xen/include/xen/iommu.h
> +++ b/xen/include/xen/iommu.h
> @@ -147,9 +147,11 @@ enum
> {
>     _IOMMU_FLUSHF_added,
>     _IOMMU_FLUSHF_modified,
> +    _IOMMU_FLUSHF_all,
> };
> #define IOMMU_FLUSHF_added (1u << _IOMMU_FLUSHF_added)
> #define IOMMU_FLUSHF_modified (1u << _IOMMU_FLUSHF_modified)
> +#define IOMMU_FLUSHF_all (1u << _IOMMU_FLUSHF_all)
> 
> int __must_check iommu_map(struct domain *d, dfn_t dfn, mfn_t mfn,
>                            unsigned long page_count, unsigned int flags,
> @@ -282,7 +284,6 @@ struct iommu_ops {
>     int __must_check (*iotlb_flush)(struct domain *d, dfn_t dfn,
>                                     unsigned long page_count,
>                                     unsigned int flush_flags);
> -    int __must_check (*iotlb_flush_all)(struct domain *d);
>     int (*get_reserved_device_memory)(iommu_grdm_t *, void *);
>     void (*dump_page_tables)(struct domain *d);
> 
> 



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 16/18] x86: introduce helper for recording degree of contiguity in page tables
  2021-12-15 13:57   ` Roger Pau Monné
@ 2021-12-16 15:47     ` Jan Beulich
  2021-12-20 15:25       ` Roger Pau Monné
  0 siblings, 1 reply; 100+ messages in thread
From: Jan Beulich @ 2021-12-16 15:47 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On 15.12.2021 14:57, Roger Pau Monné wrote:
> On Fri, Sep 24, 2021 at 11:55:30AM +0200, Jan Beulich wrote:
>> --- /dev/null
>> +++ b/xen/include/asm-x86/contig-marker.h
>> @@ -0,0 +1,105 @@
>> +#ifndef __ASM_X86_CONTIG_MARKER_H
>> +#define __ASM_X86_CONTIG_MARKER_H
>> +
>> +/*
>> + * Short of having function templates in C, the function defined below is
>> + * intended to be used by multiple parties interested in recording the
>> + * degree of contiguity in mappings by a single page table.
>> + *
>> + * Scheme: Every entry records the order of contiguous successive entries,
>> + * up to the maximum order covered by that entry (which is the number of
>> + * clear low bits in its index, with entry 0 being the exception using
>> + * the base-2 logarithm of the number of entries in a single page table).
>> + * While a few entries need touching upon update, knowing whether the
>> + * table is fully contiguous (and can hence be replaced by a higher level
>> + * leaf entry) is then possible by simply looking at entry 0's marker.
>> + *
>> + * Prereqs:
>> + * - CONTIG_MASK needs to be #define-d, to a value having at least 4
>> + *   contiguous bits (ignored by hardware), before including this file,
>> + * - page tables to be passed here need to be initialized with correct
>> + *   markers.
> 
> Given this requirement I think it would make sense to place the page
> table marker initialization currently placed in iommu_alloc_pgtable as
> a helper here also?

I would be nice, yes, but it would also cause problems. I specifically do
not want to make the function here "inline". Hence a source file including
it would need to be given a way to suppress its visibility to the compiler.
Which would mean #ifdef-ary I'd prefer to avoid. Yet by saying "prefer" I
mean to leave open that I could be talked into doing what you suggest.

>> + */
>> +
>> +#include <xen/bitops.h>
>> +#include <xen/lib.h>
>> +#include <xen/page-size.h>
>> +
>> +/* This is the same for all anticipated users, so doesn't need passing in. */
>> +#define CONTIG_LEVEL_SHIFT 9
>> +#define CONTIG_NR          (1 << CONTIG_LEVEL_SHIFT)
>> +
>> +#define GET_MARKER(e) MASK_EXTR(e, CONTIG_MASK)
>> +#define SET_MARKER(e, m) \
>> +    ((void)(e = ((e) & ~CONTIG_MASK) | MASK_INSR(m, CONTIG_MASK)))
>> +
>> +enum PTE_kind {
>> +    PTE_kind_null,
>> +    PTE_kind_leaf,
>> +    PTE_kind_table,
>> +};
>> +
>> +static bool update_contig_markers(uint64_t *pt, unsigned int idx,
> 
> Maybe pt_update_contig_markers, so it's not such a generic name.

I can do that. The header may then want to be named pt-contig-marker.h
or pt-contig-markers.h. Thoughts?

>> +                                  unsigned int level, enum PTE_kind kind)
>> +{
>> +    unsigned int b, i = idx;
>> +    unsigned int shift = (level - 1) * CONTIG_LEVEL_SHIFT + PAGE_SHIFT;
>> +
>> +    ASSERT(idx < CONTIG_NR);
>> +    ASSERT(!(pt[idx] & CONTIG_MASK));
>> +
>> +    /* Step 1: Reduce markers in lower numbered entries. */
>> +    while ( i )
>> +    {
>> +        b = find_first_set_bit(i);
>> +        i &= ~(1U << b);
>> +        if ( GET_MARKER(pt[i]) > b )
>> +            SET_MARKER(pt[i], b);
>> +    }
>> +
>> +    /* An intermediate table is never contiguous with anything. */
>> +    if ( kind == PTE_kind_table )
>> +        return false;
>> +
>> +    /*
>> +     * Present entries need in sync index and address to be a candidate
>> +     * for being contiguous: What we're after is whether ultimately the
>> +     * intermediate table can be replaced by a superpage.
>> +     */
>> +    if ( kind != PTE_kind_null &&
>> +         idx != ((pt[idx] >> shift) & (CONTIG_NR - 1)) )
> 
> Don't you just need to check that the address is aligned to at least
> idx, not that it's exactly aligned?

No, that wouldn't be sufficient. We're not after a general "is
contiguous" here, but strictly after "is this slot meeting the
requirements for the whole table eventually getting replaced by a
superpage".

>> +        return false;
>> +
>> +    /* Step 2: Check higher numbered entries for contiguity. */
>> +    for ( b = 0; b < CONTIG_LEVEL_SHIFT && !(idx & (1U << b)); ++b )
>> +    {
>> +        i = idx | (1U << b);
>> +        if ( (kind == PTE_kind_leaf
>> +              ? ((pt[i] ^ pt[idx]) & ~CONTIG_MASK) != (1ULL << (b + shift))
> 
> Maybe this could be a macro, CHECK_CONTIG or some such? It's also used
> below.

Hmm, yes, this might indeed help readability. There's going to be a
lot of parameters though; not sure whether omitting all(?) parameters
for such a locally used macro would be considered acceptable.

> I would also think this would be clearer as:
> 
> (pt[idx] & ~CONTIG_MASK) + (1ULL << (shift + b)) == (pt[i] & ~CONTIG_MASK)

By using + we'd consider entries contiguous which for our purposes
shouldn't be considered so. Yes, the earlier check should already
have caught that case, but I'd like the checks to be as tight as
possible.

>> +              : pt[i] & ~CONTIG_MASK) ||
> 
> Isn't PTE_kind_null always supposed to be empty?

Yes (albeit this could be relaxed, but then the logic here would
need to know where the "present" bit(s) is/are).

> (ie: wouldn't this check always succeed?)

No - "kind" describes pt[idx], not pt[i].

>> +             GET_MARKER(pt[i]) != b )
>> +            break;
>> +    }
>> +
>> +    /* Step 3: Update markers in this and lower numbered entries. */
>> +    for ( ; SET_MARKER(pt[idx], b), b < CONTIG_LEVEL_SHIFT; ++b )
>> +    {
>> +        i = idx ^ (1U << b);
>> +        if ( (kind == PTE_kind_leaf
>> +              ? ((pt[i] ^ pt[idx]) & ~CONTIG_MASK) != (1ULL << (b + shift))
>> +              : pt[i] & ~CONTIG_MASK) ||
>> +             GET_MARKER(pt[i]) != b )
>> +            break;
>> +        idx &= ~(1U << b);
> 
> There's an iteration where idx will be 0, and then there's no further
> point in doing the & anymore?

Yes, but doing the & anyway is cheaper than adding a conditional.

Jan



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 17/18] AMD/IOMMU: free all-empty page tables
  2021-12-15 15:14   ` Roger Pau Monné
@ 2021-12-16 15:54     ` Jan Beulich
  0 siblings, 0 replies; 100+ messages in thread
From: Jan Beulich @ 2021-12-16 15:54 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On 15.12.2021 16:14, Roger Pau Monné wrote:
> On Fri, Sep 24, 2021 at 11:55:57AM +0200, Jan Beulich wrote:
>> When a page table ends up with no present entries left, it can be
>> replaced by a non-present entry at the next higher level. The page table
>> itself can then be scheduled for freeing.
>>
>> Note that while its output isn't used there yet, update_contig_markers()
>> right away needs to be called in all places where entries get updated,
>> not just the one where entries get cleared.
> 
> Couldn't you also coalesce all contiguous page tables into a
> super-page entry at the higher level? (not that should be done here,
> it's just that I'm not seeing any patch to that effect in the series)

Yes I could. And in v3 I will (but before getting to that I first
had to work around what looks to be an erratum on very old VT-d
hardware). See the cover letter.

>> @@ -33,16 +36,20 @@ static unsigned int pfn_to_pde_idx(unsig
>>  
>>  static union amd_iommu_pte clear_iommu_pte_present(unsigned long l1_mfn,
>>                                                     unsigned long dfn,
>> -                                                   unsigned int level)
>> +                                                   unsigned int level,
>> +                                                   bool *free)
>>  {
>>      union amd_iommu_pte *table, *pte, old;
>> +    unsigned int idx = pfn_to_pde_idx(dfn, level);
>>  
>>      table = map_domain_page(_mfn(l1_mfn));
>> -    pte = &table[pfn_to_pde_idx(dfn, level)];
>> +    pte = &table[idx];
>>      old = *pte;
>>  
>>      write_atomic(&pte->raw, 0);
>>  
>> +    *free = update_contig_markers(&table->raw, idx, level, PTE_kind_null);
>> +
>>      unmap_domain_page(table);
>>  
>>      return old;
>> @@ -85,7 +92,11 @@ static union amd_iommu_pte set_iommu_pte
>>      if ( !old.pr || old.next_level ||
>>           old.mfn != next_mfn ||
>>           old.iw != iw || old.ir != ir )
>> +    {
>>          set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
>> +        update_contig_markers(&table->raw, pfn_to_pde_idx(dfn, level), level,
>> +                              PTE_kind_leaf);
>> +    }
>>      else
>>          old.pr = false; /* signal "no change" to the caller */
>>  
>> @@ -259,6 +270,9 @@ static int iommu_pde_from_dfn(struct dom
>>              smp_wmb();
>>              set_iommu_pde_present(pde, next_table_mfn, next_level, true,
>>                                    true);
>> +            update_contig_markers(&next_table_vaddr->raw,
>> +                                  pfn_to_pde_idx(dfn, level),
>> +                                  level, PTE_kind_table);
>>  
>>              *flush_flags |= IOMMU_FLUSHF_modified;
>>          }
>> @@ -284,6 +298,9 @@ static int iommu_pde_from_dfn(struct dom
>>                  next_table_mfn = mfn_x(page_to_mfn(table));
>>                  set_iommu_pde_present(pde, next_table_mfn, next_level, true,
>>                                        true);
>> +                update_contig_markers(&next_table_vaddr->raw,
>> +                                      pfn_to_pde_idx(dfn, level),
>> +                                      level, PTE_kind_table);
> 
> Would be nice if we could pack the update_contig_markers in
> set_iommu_pde_present (like you do for clear_iommu_pte_present).

I'm actually viewing things the other way around: I would have liked
to avoid placing the call in clear_iommu_pte_present(), but that's
where the mapping gets established and torn down.

Jan



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 14/18] IOMMU: fold flush-all hook into "flush one"
  2021-09-24  9:53 ` [PATCH v2 14/18] IOMMU: fold flush-all hook into "flush one" Jan Beulich
                     ` (2 preceding siblings ...)
  2021-12-16 11:30   ` Rahul Singh
@ 2021-12-17 14:38   ` Julien Grall
  3 siblings, 0 replies; 100+ messages in thread
From: Julien Grall @ 2021-12-17 14:38 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Andrew Cooper, Paul Durrant, Kevin Tian, Stefano Stabellini,
	Volodymyr Babchuk, Bertrand Marquis, Rahul Singh



On 24/09/2021 10:53, Jan Beulich wrote:
> Having a separate flush-all hook has always been puzzling me some. We
> will want to be able to force a full flush via accumulated flush flags
> from the map/unmap functions. Introduce a respective new flag and fold
> all flush handling to use the single remaining hook.
> 
> Note that because of the respective comments in SMMU and IPMMU-VMSA
> code, I've folded the two prior hook functions into one. For SMMU-v3,
> which lacks a comment towards incapable hardware, I've left both
> functions in place on the assumption that selective and full flushes
> will eventually want separating.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

For the Arm part:

Acked-by: Julien Grall <jgrall@amazon.com>

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 04/18] IOMMU: add order parameter to ->{,un}map_page() hooks
  2021-09-24  9:44 ` [PATCH v2 04/18] IOMMU: add order parameter to ->{,un}map_page() hooks Jan Beulich
  2021-11-30 13:49   ` Roger Pau Monné
@ 2021-12-17 14:42   ` Julien Grall
  1 sibling, 0 replies; 100+ messages in thread
From: Julien Grall @ 2021-12-17 14:42 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Andrew Cooper, Paul Durrant, Kevin Tian, Stefano Stabellini,
	Volodymyr Babchuk

Hi Jan,

On 24/09/2021 10:44, Jan Beulich wrote:
> Or really, in the case of ->map_page(), accommodate it in the existing
> "flags" parameter. All call sites will pass 0 for now.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>

For the Arm bits:

Acked-by: Julien Grall <jgrall@amazon.com>

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 03/18] IOMMU: have vendor code announce supported page sizes
  2021-09-24  9:43 ` [PATCH v2 03/18] IOMMU: have vendor code announce supported page sizes Jan Beulich
  2021-11-30 12:25   ` Roger Pau Monné
@ 2021-12-17 14:43   ` Julien Grall
  2021-12-21  9:26   ` Rahul Singh
  2 siblings, 0 replies; 100+ messages in thread
From: Julien Grall @ 2021-12-17 14:43 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Andrew Cooper, Paul Durrant, Kevin Tian, Stefano Stabellini,
	Volodymyr Babchuk, Bertrand Marquis, Rahul Singh

Hi Jan,

On 24/09/2021 10:43, Jan Beulich wrote:
> Generic code will use this information to determine what order values
> can legitimately be passed to the ->{,un}map_page() hooks. For now all
> ops structures simply get to announce 4k mappings (as base page size),
> and there is (and always has been) an assumption that this matches the
> CPU's MMU base page size (eventually we will want to permit IOMMUs with
> a base page size smaller than the CPU MMU's).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>

Acked-by: Julien Grall <jgrall@amazon.com>

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 16/18] x86: introduce helper for recording degree of contiguity in page tables
  2021-12-16 15:47     ` Jan Beulich
@ 2021-12-20 15:25       ` Roger Pau Monné
  2021-12-21  8:09         ` Jan Beulich
  0 siblings, 1 reply; 100+ messages in thread
From: Roger Pau Monné @ 2021-12-20 15:25 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On Thu, Dec 16, 2021 at 04:47:30PM +0100, Jan Beulich wrote:
> On 15.12.2021 14:57, Roger Pau Monné wrote:
> > On Fri, Sep 24, 2021 at 11:55:30AM +0200, Jan Beulich wrote:
> >> --- /dev/null
> >> +++ b/xen/include/asm-x86/contig-marker.h
> >> @@ -0,0 +1,105 @@
> >> +#ifndef __ASM_X86_CONTIG_MARKER_H
> >> +#define __ASM_X86_CONTIG_MARKER_H
> >> +
> >> +/*
> >> + * Short of having function templates in C, the function defined below is
> >> + * intended to be used by multiple parties interested in recording the
> >> + * degree of contiguity in mappings by a single page table.
> >> + *
> >> + * Scheme: Every entry records the order of contiguous successive entries,
> >> + * up to the maximum order covered by that entry (which is the number of
> >> + * clear low bits in its index, with entry 0 being the exception using
> >> + * the base-2 logarithm of the number of entries in a single page table).
> >> + * While a few entries need touching upon update, knowing whether the
> >> + * table is fully contiguous (and can hence be replaced by a higher level
> >> + * leaf entry) is then possible by simply looking at entry 0's marker.
> >> + *
> >> + * Prereqs:
> >> + * - CONTIG_MASK needs to be #define-d, to a value having at least 4
> >> + *   contiguous bits (ignored by hardware), before including this file,
> >> + * - page tables to be passed here need to be initialized with correct
> >> + *   markers.
> > 
> > Given this requirement I think it would make sense to place the page
> > table marker initialization currently placed in iommu_alloc_pgtable as
> > a helper here also?
> 
> I would be nice, yes, but it would also cause problems. I specifically do
> not want to make the function here "inline". Hence a source file including
> it would need to be given a way to suppress its visibility to the compiler.
> Which would mean #ifdef-ary I'd prefer to avoid. Yet by saying "prefer" I
> mean to leave open that I could be talked into doing what you suggest.

Could you mark those as __maybe_unused? Or would you rather like to
assert that they are used when included?

> >> + */
> >> +
> >> +#include <xen/bitops.h>
> >> +#include <xen/lib.h>
> >> +#include <xen/page-size.h>
> >> +
> >> +/* This is the same for all anticipated users, so doesn't need passing in. */
> >> +#define CONTIG_LEVEL_SHIFT 9
> >> +#define CONTIG_NR          (1 << CONTIG_LEVEL_SHIFT)
> >> +
> >> +#define GET_MARKER(e) MASK_EXTR(e, CONTIG_MASK)
> >> +#define SET_MARKER(e, m) \
> >> +    ((void)(e = ((e) & ~CONTIG_MASK) | MASK_INSR(m, CONTIG_MASK)))
> >> +
> >> +enum PTE_kind {
> >> +    PTE_kind_null,
> >> +    PTE_kind_leaf,
> >> +    PTE_kind_table,
> >> +};
> >> +
> >> +static bool update_contig_markers(uint64_t *pt, unsigned int idx,
> > 
> > Maybe pt_update_contig_markers, so it's not such a generic name.
> 
> I can do that. The header may then want to be named pt-contig-marker.h
> or pt-contig-markers.h. Thoughts?

Seems fine to me.

> >> +                                  unsigned int level, enum PTE_kind kind)
> >> +{
> >> +    unsigned int b, i = idx;
> >> +    unsigned int shift = (level - 1) * CONTIG_LEVEL_SHIFT + PAGE_SHIFT;
> >> +
> >> +    ASSERT(idx < CONTIG_NR);
> >> +    ASSERT(!(pt[idx] & CONTIG_MASK));
> >> +
> >> +    /* Step 1: Reduce markers in lower numbered entries. */
> >> +    while ( i )
> >> +    {
> >> +        b = find_first_set_bit(i);
> >> +        i &= ~(1U << b);
> >> +        if ( GET_MARKER(pt[i]) > b )
> >> +            SET_MARKER(pt[i], b);
> >> +    }
> >> +
> >> +    /* An intermediate table is never contiguous with anything. */
> >> +    if ( kind == PTE_kind_table )
> >> +        return false;
> >> +
> >> +    /*
> >> +     * Present entries need in sync index and address to be a candidate
> >> +     * for being contiguous: What we're after is whether ultimately the
> >> +     * intermediate table can be replaced by a superpage.
> >> +     */
> >> +    if ( kind != PTE_kind_null &&
> >> +         idx != ((pt[idx] >> shift) & (CONTIG_NR - 1)) )
> > 
> > Don't you just need to check that the address is aligned to at least
> > idx, not that it's exactly aligned?
> 
> No, that wouldn't be sufficient. We're not after a general "is
> contiguous" here, but strictly after "is this slot meeting the
> requirements for the whole table eventually getting replaced by a
> superpage".

I see, makes sense. I didn't relate this check to the 'replaced by a
superpage' part of the comment.

> >> +        return false;
> >> +
> >> +    /* Step 2: Check higher numbered entries for contiguity. */
> >> +    for ( b = 0; b < CONTIG_LEVEL_SHIFT && !(idx & (1U << b)); ++b )
> >> +    {
> >> +        i = idx | (1U << b);
> >> +        if ( (kind == PTE_kind_leaf
> >> +              ? ((pt[i] ^ pt[idx]) & ~CONTIG_MASK) != (1ULL << (b + shift))
> > 
> > Maybe this could be a macro, CHECK_CONTIG or some such? It's also used
> > below.
> 
> Hmm, yes, this might indeed help readability. There's going to be a
> lot of parameters though; not sure whether omitting all(?) parameters
> for such a locally used macro would be considered acceptable.
> 
> > I would also think this would be clearer as:
> > 
> > (pt[idx] & ~CONTIG_MASK) + (1ULL << (shift + b)) == (pt[i] & ~CONTIG_MASK)
> 
> By using + we'd consider entries contiguous which for our purposes
> shouldn't be considered so. Yes, the earlier check should already
> have caught that case, but I'd like the checks to be as tight as
> possible.
> 
> >> +              : pt[i] & ~CONTIG_MASK) ||
> > 
> > Isn't PTE_kind_null always supposed to be empty?
> 
> Yes (albeit this could be relaxed, but then the logic here would
> need to know where the "present" bit(s) is/are).
> 
> > (ie: wouldn't this check always succeed?)
> 
> No - "kind" describes pt[idx], not pt[i].
> 
> >> +             GET_MARKER(pt[i]) != b )
> >> +            break;
> >> +    }
> >> +
> >> +    /* Step 3: Update markers in this and lower numbered entries. */
> >> +    for ( ; SET_MARKER(pt[idx], b), b < CONTIG_LEVEL_SHIFT; ++b )
> >> +    {
> >> +        i = idx ^ (1U << b);
> >> +        if ( (kind == PTE_kind_leaf
> >> +              ? ((pt[i] ^ pt[idx]) & ~CONTIG_MASK) != (1ULL << (b + shift))
> >> +              : pt[i] & ~CONTIG_MASK) ||
> >> +             GET_MARKER(pt[i]) != b )
> >> +            break;
> >> +        idx &= ~(1U << b);
> > 
> > There's an iteration where idx will be 0, and then there's no further
> > point in doing the & anymore?
> 
> Yes, but doing the & anyway is cheaper than adding a conditional.

I think it might be interesting to add some kind of unit testing to
this code in tools/tests. It's a standalone piece of code that could
be easily tested for correct functionality. Not that you should do it
here, in fact it might be interesting for me to do so in order to
better understand the code.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 14/18] IOMMU: fold flush-all hook into "flush one"
  2021-12-16 11:30   ` Rahul Singh
@ 2021-12-21  8:04     ` Jan Beulich
  0 siblings, 0 replies; 100+ messages in thread
From: Jan Beulich @ 2021-12-21  8:04 UTC (permalink / raw)
  To: Rahul Singh
  Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian, Julien Grall,
	Stefano Stabellini, Volodymyr Babchuk, Bertrand Marquis

On 16.12.2021 12:30, Rahul Singh wrote:
>> On 24 Sep 2021, at 10:53 am, Jan Beulich <jbeulich@suse.com> wrote:
>>
>> Having a separate flush-all hook has always been puzzling me some. We
>> will want to be able to force a full flush via accumulated flush flags
>> from the map/unmap functions. Introduce a respective new flag and fold
>> all flush handling to use the single remaining hook.
>>
>> Note that because of the respective comments in SMMU and IPMMU-VMSA
>> code, I've folded the two prior hook functions into one. For SMMU-v3,
>> which lacks a comment towards incapable hardware, I've left both
>> functions in place on the assumption that selective and full flushes
>> will eventually want separating.
> 
> 
> For SMMUv3 related Changs:
> Reviewed-by: Rahul Singh <rahul.singh@arm.com>

Thanks. Any chance of an ack / R-b also for patch 3?

Jan



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 16/18] x86: introduce helper for recording degree of contiguity in page tables
  2021-12-20 15:25       ` Roger Pau Monné
@ 2021-12-21  8:09         ` Jan Beulich
  2022-01-04  8:57           ` Roger Pau Monné
  0 siblings, 1 reply; 100+ messages in thread
From: Jan Beulich @ 2021-12-21  8:09 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On 20.12.2021 16:25, Roger Pau Monné wrote:
> I think it might be interesting to add some kind of unit testing to
> this code in tools/tests. It's a standalone piece of code that could
> be easily tested for correct functionality. Not that you should do it
> here, in fact it might be interesting for me to do so in order to
> better understand the code.

Actually I developed this by first having a user space app where I could
control insertions / removals from the command line. Only once I had it
working that way was when I converted the helper function to what's now
in this header. But that user space app wouldn't directly lend itself to
become an element under tools/tests/, I'm afraid.

Jan



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 03/18] IOMMU: have vendor code announce supported page sizes
  2021-09-24  9:43 ` [PATCH v2 03/18] IOMMU: have vendor code announce supported page sizes Jan Beulich
  2021-11-30 12:25   ` Roger Pau Monné
  2021-12-17 14:43   ` Julien Grall
@ 2021-12-21  9:26   ` Rahul Singh
  2 siblings, 0 replies; 100+ messages in thread
From: Rahul Singh @ 2021-12-21  9:26 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian, Julien Grall,
	Stefano Stabellini, Volodymyr Babchuk, Bertrand Marquis

Hi Jan,

> On 24 Sep 2021, at 10:43 am, Jan Beulich <jbeulich@suse.com> wrote:
> 
> Generic code will use this information to determine what order values
> can legitimately be passed to the ->{,un}map_page() hooks. For now all
> ops structures simply get to announce 4k mappings (as base page size),
> and there is (and always has been) an assumption that this matches the
> CPU's MMU base page size (eventually we will want to permit IOMMUs with
> a base page size smaller than the CPU MMU's).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>

Reviewed-by: Rahul Singh <rahul.singh@arm.com>

Regards,
Rahul
> 
> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> @@ -629,6 +629,7 @@ static void amd_dump_page_tables(struct
> }
> 
> static const struct iommu_ops __initconstrel _iommu_ops = {
> +    .page_sizes = PAGE_SIZE_4K,
>     .init = amd_iommu_domain_init,
>     .hwdom_init = amd_iommu_hwdom_init,
>     .quarantine_init = amd_iommu_quarantine_init,
> --- a/xen/drivers/passthrough/arm/ipmmu-vmsa.c
> +++ b/xen/drivers/passthrough/arm/ipmmu-vmsa.c
> @@ -1298,6 +1298,7 @@ static void ipmmu_iommu_domain_teardown(
> 
> static const struct iommu_ops ipmmu_iommu_ops =
> {
> +    .page_sizes      = PAGE_SIZE_4K,
>     .init            = ipmmu_iommu_domain_init,
>     .hwdom_init      = ipmmu_iommu_hwdom_init,
>     .teardown        = ipmmu_iommu_domain_teardown,
> --- a/xen/drivers/passthrough/arm/smmu.c
> +++ b/xen/drivers/passthrough/arm/smmu.c
> @@ -2873,6 +2873,7 @@ static void arm_smmu_iommu_domain_teardo
> }
> 
> static const struct iommu_ops arm_smmu_iommu_ops = {
> +    .page_sizes = PAGE_SIZE_4K,
>     .init = arm_smmu_iommu_domain_init,
>     .hwdom_init = arm_smmu_iommu_hwdom_init,
>     .add_device = arm_smmu_dt_add_device_generic,
> --- a/xen/drivers/passthrough/arm/smmu-v3.c
> +++ b/xen/drivers/passthrough/arm/smmu-v3.c
> @@ -3426,7 +3426,8 @@ static void arm_smmu_iommu_xen_domain_te
> }
> 
> static const struct iommu_ops arm_smmu_iommu_ops = {
> -	.init		= arm_smmu_iommu_xen_domain_init,
> +	.page_sizes		= PAGE_SIZE_4K,
> +	.init			= arm_smmu_iommu_xen_domain_init,
> 	.hwdom_init		= arm_smmu_iommu_hwdom_init,
> 	.teardown		= arm_smmu_iommu_xen_domain_teardown,
> 	.iotlb_flush		= arm_smmu_iotlb_flush,
> --- a/xen/drivers/passthrough/iommu.c
> +++ b/xen/drivers/passthrough/iommu.c
> @@ -470,7 +470,17 @@ int __init iommu_setup(void)
> 
>     if ( iommu_enable )
>     {
> +        const struct iommu_ops *ops = NULL;
> +
>         rc = iommu_hardware_setup();
> +        if ( !rc )
> +            ops = iommu_get_ops();
> +        if ( ops && (ops->page_sizes & -ops->page_sizes) != PAGE_SIZE )
> +        {
> +            printk(XENLOG_ERR "IOMMU: page size mask %lx unsupported\n",
> +                   ops->page_sizes);
> +            rc = ops->page_sizes ? -EPERM : -ENODATA;
> +        }
>         iommu_enabled = (rc == 0);
>     }
> 
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -2806,6 +2806,7 @@ static int __init intel_iommu_quarantine
> }
> 
> static struct iommu_ops __initdata vtd_ops = {
> +    .page_sizes = PAGE_SIZE_4K,
>     .init = intel_iommu_domain_init,
>     .hwdom_init = intel_iommu_hwdom_init,
>     .quarantine_init = intel_iommu_quarantine_init,
> --- a/xen/include/xen/iommu.h
> +++ b/xen/include/xen/iommu.h
> @@ -231,6 +231,7 @@ struct page_info;
> typedef int iommu_grdm_t(xen_pfn_t start, xen_ulong_t nr, u32 id, void *ctxt);
> 
> struct iommu_ops {
> +    unsigned long page_sizes;
>     int (*init)(struct domain *d);
>     void (*hwdom_init)(struct domain *d);
>     int (*quarantine_init)(struct domain *d);
> 



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 16/18] x86: introduce helper for recording degree of contiguity in page tables
  2021-12-21  8:09         ` Jan Beulich
@ 2022-01-04  8:57           ` Roger Pau Monné
  2022-01-04  9:00             ` Jan Beulich
  0 siblings, 1 reply; 100+ messages in thread
From: Roger Pau Monné @ 2022-01-04  8:57 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On Tue, Dec 21, 2021 at 09:09:45AM +0100, Jan Beulich wrote:
> On 20.12.2021 16:25, Roger Pau Monné wrote:
> > I think it might be interesting to add some kind of unit testing to
> > this code in tools/tests. It's a standalone piece of code that could
> > be easily tested for correct functionality. Not that you should do it
> > here, in fact it might be interesting for me to do so in order to
> > better understand the code.
> 
> Actually I developed this by first having a user space app where I could
> control insertions / removals from the command line. Only once I had it
> working that way was when I converted the helper function to what's now
> in this header. But that user space app wouldn't directly lend itself to
> become an element under tools/tests/, I'm afraid.

Also, I'm curious, did you develop the algorithm yourself, or is there
some prior literature about it? I wonder how other OSes deal with this
problem if they support coalescing contiguous pages.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 16/18] x86: introduce helper for recording degree of contiguity in page tables
  2022-01-04  8:57           ` Roger Pau Monné
@ 2022-01-04  9:00             ` Jan Beulich
  0 siblings, 0 replies; 100+ messages in thread
From: Jan Beulich @ 2022-01-04  9:00 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On 04.01.2022 09:57, Roger Pau Monné wrote:
> On Tue, Dec 21, 2021 at 09:09:45AM +0100, Jan Beulich wrote:
>> On 20.12.2021 16:25, Roger Pau Monné wrote:
>>> I think it might be interesting to add some kind of unit testing to
>>> this code in tools/tests. It's a standalone piece of code that could
>>> be easily tested for correct functionality. Not that you should do it
>>> here, in fact it might be interesting for me to do so in order to
>>> better understand the code.
>>
>> Actually I developed this by first having a user space app where I could
>> control insertions / removals from the command line. Only once I had it
>> working that way was when I converted the helper function to what's now
>> in this header. But that user space app wouldn't directly lend itself to
>> become an element under tools/tests/, I'm afraid.
> 
> Also, I'm curious, did you develop the algorithm yourself, or is there
> some prior literature about it?

I would have added some form of reference if I had taken it from somewhere.

Jan



^ permalink raw reply	[flat|nested] 100+ messages in thread

end of thread, other threads:[~2022-01-04  9:00 UTC | newest]

Thread overview: 100+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-24  9:39 [PATCH v2 00/18] IOMMU: superpage support when not sharing pagetables Jan Beulich
2021-09-24  9:41 ` [PATCH v2 01/18] AMD/IOMMU: have callers specify the target level for page table walks Jan Beulich
2021-09-24 10:58   ` Roger Pau Monné
2021-09-24 12:02     ` Jan Beulich
2021-09-24  9:42 ` [PATCH v2 02/18] VT-d: " Jan Beulich
2021-09-24 14:45   ` Roger Pau Monné
2021-09-27  9:04     ` Jan Beulich
2021-09-27  9:13       ` Jan Beulich
2021-11-30 11:56       ` Roger Pau Monné
2021-11-30 14:38         ` Jan Beulich
2021-09-24  9:43 ` [PATCH v2 03/18] IOMMU: have vendor code announce supported page sizes Jan Beulich
2021-11-30 12:25   ` Roger Pau Monné
2021-12-17 14:43   ` Julien Grall
2021-12-21  9:26   ` Rahul Singh
2021-09-24  9:44 ` [PATCH v2 04/18] IOMMU: add order parameter to ->{,un}map_page() hooks Jan Beulich
2021-11-30 13:49   ` Roger Pau Monné
2021-11-30 14:45     ` Jan Beulich
2021-12-17 14:42   ` Julien Grall
2021-09-24  9:45 ` [PATCH v2 05/18] IOMMU: have iommu_{,un}map() split requests into largest possible chunks Jan Beulich
2021-11-30 15:24   ` Roger Pau Monné
2021-12-02 15:59     ` Jan Beulich
2021-09-24  9:46 ` [PATCH v2 06/18] IOMMU/x86: restrict IO-APIC mappings for PV Dom0 Jan Beulich
2021-12-01  9:09   ` Roger Pau Monné
2021-12-01  9:27     ` Jan Beulich
2021-12-01 10:32       ` Roger Pau Monné
2021-12-01 11:45         ` Jan Beulich
2021-12-02 15:12           ` Roger Pau Monné
2021-12-02 15:28             ` Jan Beulich
2021-12-02 19:16               ` Andrew Cooper
2021-12-03  6:41                 ` Jan Beulich
2021-09-24  9:47 ` [PATCH v2 07/18] IOMMU/x86: perform PV Dom0 mappings in batches Jan Beulich
2021-12-02 14:10   ` Roger Pau Monné
2021-12-03 12:38     ` Jan Beulich
2021-12-10  9:36       ` Roger Pau Monné
2021-12-10 11:41         ` Jan Beulich
2021-12-10 12:35           ` Roger Pau Monné
2021-09-24  9:48 ` [PATCH v2 08/18] IOMMU/x86: support freeing of pagetables Jan Beulich
2021-12-02 16:03   ` Roger Pau Monné
2021-12-02 16:10     ` Jan Beulich
2021-12-03  8:30       ` Roger Pau Monné
2021-12-03  9:38         ` Roger Pau Monné
2021-12-03  9:40         ` Jan Beulich
2021-12-10 13:51   ` Roger Pau Monné
2021-12-13  8:38     ` Jan Beulich
2021-09-24  9:48 ` [PATCH v2 09/18] AMD/IOMMU: drop stray TLB flush Jan Beulich
2021-12-02 16:16   ` Roger Pau Monné
2021-09-24  9:51 ` [PATCH v2 10/18] AMD/IOMMU: walk trees upon page fault Jan Beulich
2021-12-03  9:03   ` Roger Pau Monné
2021-12-03  9:49     ` Jan Beulich
2021-12-03  9:55       ` Jan Beulich
2021-12-10 10:23         ` Roger Pau Monné
2021-12-03  9:59     ` Jan Beulich
2021-09-24  9:51 ` [PATCH v2 11/18] AMD/IOMMU: return old PTE from {set,clear}_iommu_pte_present() Jan Beulich
2021-12-10 12:05   ` Roger Pau Monné
2021-12-10 12:59     ` Jan Beulich
2021-12-10 13:53       ` Roger Pau Monné
2021-09-24  9:52 ` [PATCH v2 12/18] AMD/IOMMU: allow use of superpage mappings Jan Beulich
2021-12-10 15:06   ` Roger Pau Monné
2021-12-13  8:49     ` Jan Beulich
2021-12-13  9:45       ` Roger Pau Monné
2021-12-13 10:00         ` Jan Beulich
2021-12-13 10:33           ` Roger Pau Monné
2021-12-13 10:41             ` Jan Beulich
2021-09-24  9:52 ` [PATCH v2 13/18] VT-d: " Jan Beulich
2021-12-13 11:54   ` Roger Pau Monné
2021-12-13 13:39     ` Jan Beulich
2021-09-24  9:53 ` [PATCH v2 14/18] IOMMU: fold flush-all hook into "flush one" Jan Beulich
2021-12-13 15:04   ` Roger Pau Monné
2021-12-14  9:06     ` Jan Beulich
2021-12-14  9:27       ` Roger Pau Monné
2021-12-15 15:28   ` Oleksandr
2021-12-16  8:49     ` Jan Beulich
2021-12-16 10:39       ` Oleksandr
2021-12-16 11:30   ` Rahul Singh
2021-12-21  8:04     ` Jan Beulich
2021-12-17 14:38   ` Julien Grall
2021-09-24  9:54 ` [PATCH v2 15/18] IOMMU/x86: prefill newly allocate page tables Jan Beulich
2021-12-13 15:51   ` Roger Pau Monné
2021-12-14  9:15     ` Jan Beulich
2021-12-14 11:41       ` Roger Pau Monné
2021-12-14 11:48         ` Jan Beulich
2021-12-14 14:50   ` Roger Pau Monné
2021-12-14 15:05     ` Jan Beulich
2021-12-14 15:15       ` Roger Pau Monné
2021-12-14 15:21         ` Jan Beulich
2021-12-14 15:06   ` Roger Pau Monné
2021-12-14 15:10     ` Jan Beulich
2021-12-14 15:17       ` Roger Pau Monné
2021-12-14 15:24         ` Jan Beulich
2021-09-24  9:55 ` [PATCH v2 16/18] x86: introduce helper for recording degree of contiguity in " Jan Beulich
2021-12-15 13:57   ` Roger Pau Monné
2021-12-16 15:47     ` Jan Beulich
2021-12-20 15:25       ` Roger Pau Monné
2021-12-21  8:09         ` Jan Beulich
2022-01-04  8:57           ` Roger Pau Monné
2022-01-04  9:00             ` Jan Beulich
2021-09-24  9:55 ` [PATCH v2 17/18] AMD/IOMMU: free all-empty " Jan Beulich
2021-12-15 15:14   ` Roger Pau Monné
2021-12-16 15:54     ` Jan Beulich
2021-09-24  9:56 ` [PATCH v2 18/18] VT-d: " Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.