[PATCH v4 00/21] IOMMU: superpage support when not sharing pagetables

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 00/21] IOMMU: superpage support when not sharing pagetables
@ 2022-04-25  8:29 Jan Beulich
  2022-04-25  8:30 ` [PATCH v4 01/21] AMD/IOMMU: correct potentially-UB shifts Jan Beulich
                   ` (21 more replies)
  0 siblings, 22 replies; 106+ messages in thread
From: Jan Beulich @ 2022-04-25  8:29 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné

For a long time we've been rather inefficient with IOMMU page table
management when not sharing page tables, i.e. in particular for PV (and
further specifically also for PV Dom0) and AMD (where nowadays we never
share page tables). While up to about 2.5 years ago AMD code had logic
to un-shatter page mappings, that logic was ripped out for being buggy
(XSA-275 plus follow-on).

This series enables use of large pages in AMD and Intel (VT-d) code;
Arm is presently not in need of any enabling as pagetables are always
shared there. It also augments PV Dom0 creation with suitable explicit
IOMMU mapping calls to facilitate use of large pages there. Depending
on the amount of memory handed to Dom0 this improves booting time
(latency until Dom0 actually starts) quite a bit; subsequent shattering
of some of the large pages may of course consume some of the saved time.

Known fallout has been spelled out here:
https://lists.xen.org/archives/html/xen-devel/2021-08/msg00781.html

There's a dependency on 'PCI: replace "secondary" flavors of
PCI_{DEVFN,BDF,SBDF}()', in particular by patch 8. Its prereq patch
still lacks an Arm ack, so it couldn't go in yet.

I'm inclined to say "of course" there are also a few seemingly unrelated
changes included here, which I just came to consider necessary or at
least desirable (in part for having been in need of adjustment for a
long time) along the way. Some of these changes are likely independent
of the bulk of the work here, and hence may be fine to go in ahead of
earlier patches.

See individual patches for details on the v4 changes.

01: AMD/IOMMU: correct potentially-UB shifts
02: IOMMU: simplify unmap-on-error in iommu_map()
03: IOMMU: add order parameter to ->{,un}map_page() hooks
04: IOMMU: have iommu_{,un}map() split requests into largest possible chunks
05: IOMMU/x86: restrict IO-APIC mappings for PV Dom0
06: IOMMU/x86: perform PV Dom0 mappings in batches
07: IOMMU/x86: support freeing of pagetables
08: AMD/IOMMU: walk trees upon page fault
09: AMD/IOMMU: return old PTE from {set,clear}_iommu_pte_present()
10: AMD/IOMMU: allow use of superpage mappings
11: VT-d: allow use of superpage mappings
12: IOMMU: fold flush-all hook into "flush one"
13: IOMMU/x86: prefill newly allocate page tables
14: x86: introduce helper for recording degree of contiguity in page tables
15: AMD/IOMMU: free all-empty page tables
16: VT-d: free all-empty page tables
17: AMD/IOMMU: replace all-contiguous page tables by superpage mappings
18: VT-d: replace all-contiguous page tables by superpage mappings
19: IOMMU/x86: add perf counters for page table splitting / coalescing
20: VT-d: fold iommu_flush_iotlb{,_pages}()
21: VT-d: fold dma_pte_clear_one() into its only caller

While not directly related (except that making this mode work properly
here was a fair part of the overall work), at this occasion I'd also
like to renew my proposal to make "iommu=dom0-strict" the default going
forward. It already is not only the default, but the only possible mode
for PVH Dom0.

Jan

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v4 01/21] AMD/IOMMU: correct potentially-UB shifts
  2022-04-25  8:29 [PATCH v4 00/21] IOMMU: superpage support when not sharing pagetables Jan Beulich
@ 2022-04-25  8:30 ` Jan Beulich
  2022-04-27 13:08   ` Andrew Cooper
  2022-05-03 10:10   ` Roger Pau Monné
  2022-04-25  8:32 ` [PATCH v4 02/21] IOMMU: simplify unmap-on-error in iommu_map() Jan Beulich
                   ` (20 subsequent siblings)
  21 siblings, 2 replies; 106+ messages in thread
From: Jan Beulich @ 2022-04-25  8:30 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné

Recent changes (likely 5fafa6cf529a ["AMD/IOMMU: have callers specify
the target level for page table walks"]) have made Coverity notice a
shift count in iommu_pde_from_dfn() which might in theory grow too
large. While this isn't a problem in practice, address the concern
nevertheless to not leave dangling breakage in case very large
superpages would be enabled at some point.

Coverity ID: 1504264

While there also address a similar issue in set_iommu_ptes_present().
It's not clear to me why Coverity hasn't spotted that one.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: New.

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -89,11 +89,11 @@ static unsigned int set_iommu_ptes_prese
                                            bool iw, bool ir)
 {
     union amd_iommu_pte *table, *pde;
-    unsigned int page_sz, flush_flags = 0;
+    unsigned long page_sz = 1UL << (PTE_PER_TABLE_SHIFT * (pde_level - 1));
+    unsigned int flush_flags = 0;
 
     table = map_domain_page(_mfn(pt_mfn));
     pde = &table[pfn_to_pde_idx(dfn, pde_level)];
-    page_sz = 1U << (PTE_PER_TABLE_SHIFT * (pde_level - 1));
 
     if ( (void *)(pde + nr_ptes) > (void *)table + PAGE_SIZE )
     {
@@ -281,7 +281,7 @@ static int iommu_pde_from_dfn(struct dom
         {
             unsigned long mfn, pfn;
 
-            pfn =  dfn & ~((1 << (PTE_PER_TABLE_SHIFT * next_level)) - 1);
+            pfn = dfn & ~((1UL << (PTE_PER_TABLE_SHIFT * next_level)) - 1);
             mfn = next_table_mfn;
 
             /* allocate lower level page table */



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v4 02/21] IOMMU: simplify unmap-on-error in iommu_map()
  2022-04-25  8:29 [PATCH v4 00/21] IOMMU: superpage support when not sharing pagetables Jan Beulich
  2022-04-25  8:30 ` [PATCH v4 01/21] AMD/IOMMU: correct potentially-UB shifts Jan Beulich
@ 2022-04-25  8:32 ` Jan Beulich
  2022-04-27 13:16   ` Andrew Cooper
  2022-05-03 10:25   ` Roger Pau Monné
  2022-04-25  8:32 ` [PATCH v4 03/21] IOMMU: add order parameter to ->{,un}map_page() hooks Jan Beulich
                   ` (19 subsequent siblings)
  21 siblings, 2 replies; 106+ messages in thread
From: Jan Beulich @ 2022-04-25  8:32 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné

As of 68a8aa5d7264 ("iommu: make map and unmap take a page count,
similar to flush") there's no need anymore to have a loop here.

Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -308,11 +308,9 @@ int iommu_map(struct domain *d, dfn_t df
                    d->domain_id, dfn_x(dfn_add(dfn, i)),
                    mfn_x(mfn_add(mfn, i)), rc);
 
-        while ( i-- )
-            /* if statement to satisfy __must_check */
-            if ( iommu_call(hd->platform_ops, unmap_page, d, dfn_add(dfn, i),
-                            flush_flags) )
-                continue;
+        /* while statement to satisfy __must_check */
+        while ( iommu_unmap(d, dfn, i, flush_flags) )
+            break;
 
         if ( !is_hardware_domain(d) )
             domain_crash(d);



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v4 03/21] IOMMU: add order parameter to ->{,un}map_page() hooks
  2022-04-25  8:29 [PATCH v4 00/21] IOMMU: superpage support when not sharing pagetables Jan Beulich
  2022-04-25  8:30 ` [PATCH v4 01/21] AMD/IOMMU: correct potentially-UB shifts Jan Beulich
  2022-04-25  8:32 ` [PATCH v4 02/21] IOMMU: simplify unmap-on-error in iommu_map() Jan Beulich
@ 2022-04-25  8:32 ` Jan Beulich
  2022-04-25  8:33 ` [PATCH v4 04/21] IOMMU: have iommu_{,un}map() split requests into largest possible chunks Jan Beulich
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 106+ messages in thread
From: Jan Beulich @ 2022-04-25  8:32 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné

Or really, in the case of ->map_page(), accommodate it in the existing
"flags" parameter. All call sites will pass 0 for now.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Julien Grall <jgrall@amazon.com> # Arm
---
v4: Re-base.
v3: Re-base over new earlier patch.
v2: Re-base over change earlier in the series.

--- a/xen/arch/arm/include/asm/iommu.h
+++ b/xen/arch/arm/include/asm/iommu.h
@@ -31,6 +31,7 @@ int __must_check arm_iommu_map_page(stru
                                     unsigned int flags,
                                     unsigned int *flush_flags);
 int __must_check arm_iommu_unmap_page(struct domain *d, dfn_t dfn,
+                                      unsigned int order,
                                       unsigned int *flush_flags);
 
 #endif /* __ARCH_ARM_IOMMU_H__ */
--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -245,7 +245,8 @@ int __must_check cf_check amd_iommu_map_
     struct domain *d, dfn_t dfn, mfn_t mfn, unsigned int flags,
     unsigned int *flush_flags);
 int __must_check cf_check amd_iommu_unmap_page(
-    struct domain *d, dfn_t dfn, unsigned int *flush_flags);
+    struct domain *d, dfn_t dfn, unsigned int order,
+    unsigned int *flush_flags);
 int __must_check amd_iommu_alloc_root(struct domain *d);
 int amd_iommu_reserve_domain_unity_map(struct domain *domain,
                                        const struct ivrs_unity_map *map,
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -395,7 +395,7 @@ int cf_check amd_iommu_map_page(
 }
 
 int cf_check amd_iommu_unmap_page(
-    struct domain *d, dfn_t dfn, unsigned int *flush_flags)
+    struct domain *d, dfn_t dfn, unsigned int order, unsigned int *flush_flags)
 {
     unsigned long pt_mfn = 0;
     struct domain_iommu *hd = dom_iommu(d);
--- a/xen/drivers/passthrough/arm/iommu_helpers.c
+++ b/xen/drivers/passthrough/arm/iommu_helpers.c
@@ -57,11 +57,13 @@ int __must_check arm_iommu_map_page(stru
      * The function guest_physmap_add_entry replaces the current mapping
      * if there is already one...
      */
-    return guest_physmap_add_entry(d, _gfn(dfn_x(dfn)), _mfn(dfn_x(dfn)), 0, t);
+    return guest_physmap_add_entry(d, _gfn(dfn_x(dfn)), _mfn(dfn_x(dfn)),
+                                   IOMMUF_order(flags), t);
 }
 
 /* Should only be used if P2M Table is shared between the CPU and the IOMMU. */
 int __must_check arm_iommu_unmap_page(struct domain *d, dfn_t dfn,
+                                      unsigned int order,
                                       unsigned int *flush_flags)
 {
     /*
@@ -71,7 +73,8 @@ int __must_check arm_iommu_unmap_page(st
     if ( !is_domain_direct_mapped(d) )
         return -EINVAL;
 
-    return guest_physmap_remove_page(d, _gfn(dfn_x(dfn)), _mfn(dfn_x(dfn)), 0);
+    return guest_physmap_remove_page(d, _gfn(dfn_x(dfn)), _mfn(dfn_x(dfn)),
+                                     order);
 }
 
 /*
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -294,6 +294,8 @@ int iommu_map(struct domain *d, dfn_t df
     if ( !is_iommu_enabled(d) )
         return 0;
 
+    ASSERT(!IOMMUF_order(flags));
+
     for ( i = 0; i < page_count; i++ )
     {
         rc = iommu_call(hd->platform_ops, map_page, d, dfn_add(dfn, i),
@@ -354,7 +356,7 @@ int iommu_unmap(struct domain *d, dfn_t
     for ( i = 0; i < page_count; i++ )
     {
         int err = iommu_call(hd->platform_ops, unmap_page, d, dfn_add(dfn, i),
-                             flush_flags);
+                             0, flush_flags);
 
         if ( likely(!err) )
             continue;
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2163,7 +2163,7 @@ static int __must_check cf_check intel_i
 }
 
 static int __must_check cf_check intel_iommu_unmap_page(
-    struct domain *d, dfn_t dfn, unsigned int *flush_flags)
+    struct domain *d, dfn_t dfn, unsigned int order, unsigned int *flush_flags)
 {
     /* Do nothing if VT-d shares EPT page table */
     if ( iommu_use_hap_pt(d) )
@@ -2173,7 +2173,7 @@ static int __must_check cf_check intel_i
     if ( iommu_hwdom_passthrough && is_hardware_domain(d) )
         return 0;
 
-    return dma_pte_clear_one(d, dfn_to_daddr(dfn), 0, flush_flags);
+    return dma_pte_clear_one(d, dfn_to_daddr(dfn), order, flush_flags);
 }
 
 static int cf_check intel_iommu_lookup_page(
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -127,9 +127,10 @@ void arch_iommu_hwdom_init(struct domain
  * The following flags are passed to map operations and passed by lookup
  * operations.
  */
-#define _IOMMUF_readable 0
+#define IOMMUF_order(n)  ((n) & 0x3f)
+#define _IOMMUF_readable 6
 #define IOMMUF_readable  (1u<<_IOMMUF_readable)
-#define _IOMMUF_writable 1
+#define _IOMMUF_writable 7
 #define IOMMUF_writable  (1u<<_IOMMUF_writable)
 
 /*
@@ -255,6 +256,7 @@ struct iommu_ops {
                                  unsigned int flags,
                                  unsigned int *flush_flags);
     int __must_check (*unmap_page)(struct domain *d, dfn_t dfn,
+                                   unsigned int order,
                                    unsigned int *flush_flags);
     int __must_check (*lookup_page)(struct domain *d, dfn_t dfn, mfn_t *mfn,
                                     unsigned int *flags);



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v4 04/21] IOMMU: have iommu_{,un}map() split requests into largest possible chunks
  2022-04-25  8:29 [PATCH v4 00/21] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (2 preceding siblings ...)
  2022-04-25  8:32 ` [PATCH v4 03/21] IOMMU: add order parameter to ->{,un}map_page() hooks Jan Beulich
@ 2022-04-25  8:33 ` Jan Beulich
  2022-05-03 12:37   ` Roger Pau Monné
  2022-04-25  8:34 ` [PATCH v4 05/21] IOMMU/x86: restrict IO-APIC mappings for PV Dom0 Jan Beulich
                   ` (17 subsequent siblings)
  21 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-04-25  8:33 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné

Introduce a helper function to determine the largest possible mapping
that allows covering a request (or the next part of it that is left to
be processed).

In order to not add yet more recurring dfn_add() / mfn_add() to the two
callers of the new helper, also introduce local variables holding the
values presently operated on.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: Re-base over new earlier patch.

--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -283,12 +283,38 @@ void iommu_domain_destroy(struct domain
     arch_iommu_domain_destroy(d);
 }
 
-int iommu_map(struct domain *d, dfn_t dfn, mfn_t mfn,
+static unsigned int mapping_order(const struct domain_iommu *hd,
+                                  dfn_t dfn, mfn_t mfn, unsigned long nr)
+{
+    unsigned long res = dfn_x(dfn) | mfn_x(mfn);
+    unsigned long sizes = hd->platform_ops->page_sizes;
+    unsigned int bit = find_first_set_bit(sizes), order = 0;
+
+    ASSERT(bit == PAGE_SHIFT);
+
+    while ( (sizes = (sizes >> bit) & ~1) )
+    {
+        unsigned long mask;
+
+        bit = find_first_set_bit(sizes);
+        mask = (1UL << bit) - 1;
+        if ( nr <= mask || (res & mask) )
+            break;
+        order += bit;
+        nr >>= bit;
+        res >>= bit;
+    }
+
+    return order;
+}
+
+int iommu_map(struct domain *d, dfn_t dfn0, mfn_t mfn0,
               unsigned long page_count, unsigned int flags,
               unsigned int *flush_flags)
 {
     const struct domain_iommu *hd = dom_iommu(d);
     unsigned long i;
+    unsigned int order;
     int rc = 0;
 
     if ( !is_iommu_enabled(d) )
@@ -296,10 +322,15 @@ int iommu_map(struct domain *d, dfn_t df
 
     ASSERT(!IOMMUF_order(flags));
 
-    for ( i = 0; i < page_count; i++ )
+    for ( i = 0; i < page_count; i += 1UL << order )
     {
-        rc = iommu_call(hd->platform_ops, map_page, d, dfn_add(dfn, i),
-                        mfn_add(mfn, i), flags, flush_flags);
+        dfn_t dfn = dfn_add(dfn0, i);
+        mfn_t mfn = mfn_add(mfn0, i);
+
+        order = mapping_order(hd, dfn, mfn, page_count - i);
+
+        rc = iommu_call(hd->platform_ops, map_page, d, dfn, mfn,
+                        flags | IOMMUF_order(order), flush_flags);
 
         if ( likely(!rc) )
             continue;
@@ -307,11 +338,10 @@ int iommu_map(struct domain *d, dfn_t df
         if ( !d->is_shutting_down && printk_ratelimit() )
             printk(XENLOG_ERR
                    "d%d: IOMMU mapping dfn %"PRI_dfn" to mfn %"PRI_mfn" failed: %d\n",
-                   d->domain_id, dfn_x(dfn_add(dfn, i)),
-                   mfn_x(mfn_add(mfn, i)), rc);
+                   d->domain_id, dfn_x(dfn), mfn_x(mfn), rc);
 
         /* while statement to satisfy __must_check */
-        while ( iommu_unmap(d, dfn, i, flush_flags) )
+        while ( iommu_unmap(d, dfn0, i, flush_flags) )
             break;
 
         if ( !is_hardware_domain(d) )
@@ -343,20 +373,25 @@ int iommu_legacy_map(struct domain *d, d
     return rc;
 }
 
-int iommu_unmap(struct domain *d, dfn_t dfn, unsigned long page_count,
+int iommu_unmap(struct domain *d, dfn_t dfn0, unsigned long page_count,
                 unsigned int *flush_flags)
 {
     const struct domain_iommu *hd = dom_iommu(d);
     unsigned long i;
+    unsigned int order;
     int rc = 0;
 
     if ( !is_iommu_enabled(d) )
         return 0;
 
-    for ( i = 0; i < page_count; i++ )
+    for ( i = 0; i < page_count; i += 1UL << order )
     {
-        int err = iommu_call(hd->platform_ops, unmap_page, d, dfn_add(dfn, i),
-                             0, flush_flags);
+        dfn_t dfn = dfn_add(dfn0, i);
+        int err;
+
+        order = mapping_order(hd, dfn, _mfn(0), page_count - i);
+        err = iommu_call(hd->platform_ops, unmap_page, d, dfn,
+                         order, flush_flags);
 
         if ( likely(!err) )
             continue;
@@ -364,7 +399,7 @@ int iommu_unmap(struct domain *d, dfn_t
         if ( !d->is_shutting_down && printk_ratelimit() )
             printk(XENLOG_ERR
                    "d%d: IOMMU unmapping dfn %"PRI_dfn" failed: %d\n",
-                   d->domain_id, dfn_x(dfn_add(dfn, i)), err);
+                   d->domain_id, dfn_x(dfn), err);
 
         if ( !rc )
             rc = err;



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v4 05/21] IOMMU/x86: restrict IO-APIC mappings for PV Dom0
  2022-04-25  8:29 [PATCH v4 00/21] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (3 preceding siblings ...)
  2022-04-25  8:33 ` [PATCH v4 04/21] IOMMU: have iommu_{,un}map() split requests into largest possible chunks Jan Beulich
@ 2022-04-25  8:34 ` Jan Beulich
  2022-05-03 13:00   ` Roger Pau Monné
  2022-04-25  8:34 ` [PATCH v4 06/21] IOMMU/x86: perform PV Dom0 mappings in batches Jan Beulich
                   ` (16 subsequent siblings)
  21 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-04-25  8:34 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné

While already the case for PVH, there's no reason to treat PV
differently here, though of course the addresses get taken from another
source in this case. Except that, to match CPU side mappings, by default
we permit r/o ones. This then also means we now deal consistently with
IO-APICs whose MMIO is or is not covered by E820 reserved regions.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
[integrated] v1: Integrate into series.
[standalone] v2: Keep IOMMU mappings in sync with CPU ones.

--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -275,12 +275,12 @@ void iommu_identity_map_teardown(struct
     }
 }
 
-static bool __hwdom_init hwdom_iommu_map(const struct domain *d,
-                                         unsigned long pfn,
-                                         unsigned long max_pfn)
+static unsigned int __hwdom_init hwdom_iommu_map(const struct domain *d,
+                                                 unsigned long pfn,
+                                                 unsigned long max_pfn)
 {
     mfn_t mfn = _mfn(pfn);
-    unsigned int i, type;
+    unsigned int i, type, perms = IOMMUF_readable | IOMMUF_writable;
 
     /*
      * Set up 1:1 mapping for dom0. Default to include only conventional RAM
@@ -289,44 +289,60 @@ static bool __hwdom_init hwdom_iommu_map
      * that fall in unusable ranges for PV Dom0.
      */
     if ( (pfn > max_pfn && !mfn_valid(mfn)) || xen_in_range(pfn) )
-        return false;
+        return 0;
 
     switch ( type = page_get_ram_type(mfn) )
     {
     case RAM_TYPE_UNUSABLE:
-        return false;
+        return 0;
 
     case RAM_TYPE_CONVENTIONAL:
         if ( iommu_hwdom_strict )
-            return false;
+            return 0;
         break;
 
     default:
         if ( type & RAM_TYPE_RESERVED )
         {
             if ( !iommu_hwdom_inclusive && !iommu_hwdom_reserved )
-                return false;
+                perms = 0;
         }
-        else if ( is_hvm_domain(d) || !iommu_hwdom_inclusive || pfn > max_pfn )
-            return false;
+        else if ( is_hvm_domain(d) )
+            return 0;
+        else if ( !iommu_hwdom_inclusive || pfn > max_pfn )
+            perms = 0;
     }
 
     /* Check that it doesn't overlap with the Interrupt Address Range. */
     if ( pfn >= 0xfee00 && pfn <= 0xfeeff )
-        return false;
+        return 0;
     /* ... or the IO-APIC */
-    for ( i = 0; has_vioapic(d) && i < d->arch.hvm.nr_vioapics; i++ )
-        if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
-            return false;
+    if ( has_vioapic(d) )
+    {
+        for ( i = 0; i < d->arch.hvm.nr_vioapics; i++ )
+            if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
+                return 0;
+    }
+    else if ( is_pv_domain(d) )
+    {
+        /*
+         * Be consistent with CPU mappings: Dom0 is permitted to establish r/o
+         * ones there, so it should also have such established for IOMMUs.
+         */
+        for ( i = 0; i < nr_ioapics; i++ )
+            if ( pfn == PFN_DOWN(mp_ioapics[i].mpc_apicaddr) )
+                return rangeset_contains_singleton(mmio_ro_ranges, pfn)
+                       ? IOMMUF_readable : 0;
+    }
     /*
      * ... or the PCIe MCFG regions.
      * TODO: runtime added MMCFG regions are not checked to make sure they
      * don't overlap with already mapped regions, thus preventing trapping.
      */
     if ( has_vpci(d) && vpci_is_mmcfg_address(d, pfn_to_paddr(pfn)) )
-        return false;
+        return 0;
 
-    return true;
+    return perms;
 }
 
 void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
@@ -368,15 +384,19 @@ void __hwdom_init arch_iommu_hwdom_init(
     for ( ; i < top; i++ )
     {
         unsigned long pfn = pdx_to_pfn(i);
+        unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
         int rc;
 
-        if ( !hwdom_iommu_map(d, pfn, max_pfn) )
+        if ( !perms )
             rc = 0;
         else if ( paging_mode_translate(d) )
-            rc = p2m_add_identity_entry(d, pfn, p2m_access_rw, 0);
+            rc = p2m_add_identity_entry(d, pfn,
+                                        perms & IOMMUF_writable ? p2m_access_rw
+                                                                : p2m_access_r,
+                                        0);
         else
             rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
-                           IOMMUF_readable | IOMMUF_writable, &flush_flags);
+                           perms, &flush_flags);
 
         if ( rc )
             printk(XENLOG_WARNING "%pd: identity %smapping of %lx failed: %d\n",



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v4 06/21] IOMMU/x86: perform PV Dom0 mappings in batches
  2022-04-25  8:29 [PATCH v4 00/21] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (4 preceding siblings ...)
  2022-04-25  8:34 ` [PATCH v4 05/21] IOMMU/x86: restrict IO-APIC mappings for PV Dom0 Jan Beulich
@ 2022-04-25  8:34 ` Jan Beulich
  2022-05-03 14:49   ` Roger Pau Monné
  2022-04-25  8:35 ` [PATCH v4 07/21] IOMMU/x86: support freeing of pagetables Jan Beulich
                   ` (15 subsequent siblings)
  21 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-04-25  8:34 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné, Wei Liu

For large page mappings to be easily usable (i.e. in particular without
un-shattering of smaller page mappings) and for mapping operations to
then also be more efficient, pass batches of Dom0 memory to iommu_map().
In dom0_construct_pv() and its helpers (covering strict mode) this
additionally requires establishing the type of those pages (albeit with
zero type references).

The earlier establishing of PGT_writable_page | PGT_validated requires
the existing places where this gets done (through get_page_and_type())
to be updated: For pages which actually have a mapping, the type
refcount needs to be 1.

There is actually a related bug that gets fixed here as a side effect:
Typically the last L1 table would get marked as such only after
get_page_and_type(..., PGT_writable_page). While this is fine as far as
refcounting goes, the page did remain mapped in the IOMMU in this case
(when "iommu=dom0-strict").

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Subsequently p2m_add_identity_entry() may want to also gain an order
parameter, for arch_iommu_hwdom_init() to use. While this only affects
non-RAM regions, systems typically have 2-16Mb of reserved space
immediately below 4Gb, which hence could be mapped more efficiently.

The installing of zero-ref writable types has in fact shown (observed
while putting together the change) that despite the intention by the
XSA-288 changes (affecting DomU-s only) for Dom0 a number of
sufficiently ordinary pages (at the very least initrd and P2M ones as
well as pages that are part of the initial allocation but not part of
the initial mapping) still have been starting out as PGT_none, meaning
that they would have gained IOMMU mappings only the first time these
pages would get mapped writably. Consequently an open question is
whether iommu_memory_setup() should set the pages to PGT_writable_page
independent of need_iommu_pt_sync().

I didn't think I need to address the bug mentioned in the description in
a separate (prereq) patch, but if others disagree I could certainly
break out that part (needing to first use iommu_legacy_unmap() then).

Note that 4k P2M pages don't get (pre-)mapped in setup_pv_physmap():
They'll end up mapped via the later get_page_and_type().

As to the way these refs get installed: I've chosen to avoid the more
expensive {get,put}_page_and_type(), favoring to put in place the
intended type directly. I guess I could be convinced to avoid this
bypassing of the actual logic; I merely think it's unnecessarily
expensive.

Note also that strictly speaking the iommu_iotlb_flush_all() here (as
well as the pre-existing one in arch_iommu_hwdom_init()) shouldn't be
needed: Actual hooking up (AMD) or enabling of translation (VT-d)
occurs only afterwards anyway, so nothing can have made it into TLBs
just yet.
---
v3: Fold iommu_map() into (the now renamed) iommu_memory_setup(). Move
    iommu_unmap() into mark_pv_pt_pages_rdonly(). Adjust (split) log
    message in arch_iommu_hwdom_init().

--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -46,7 +46,8 @@ void __init dom0_update_physmap(bool com
 static __init void mark_pv_pt_pages_rdonly(struct domain *d,
                                            l4_pgentry_t *l4start,
                                            unsigned long vpt_start,
-                                           unsigned long nr_pt_pages)
+                                           unsigned long nr_pt_pages,
+                                           unsigned int *flush_flags)
 {
     unsigned long count;
     struct page_info *page;
@@ -71,6 +72,14 @@ static __init void mark_pv_pt_pages_rdon
         ASSERT((page->u.inuse.type_info & PGT_type_mask) <= PGT_root_page_table);
         ASSERT(!(page->u.inuse.type_info & ~(PGT_type_mask | PGT_pae_xen_l2)));
 
+        /*
+         * Page table pages need to be removed from the IOMMU again in case
+         * iommu_memory_setup() ended up mapping them.
+         */
+        if ( need_iommu_pt_sync(d) &&
+             iommu_unmap(d, _dfn(mfn_x(page_to_mfn(page))), 1, flush_flags) )
+            BUG();
+
         /* Read-only mapping + PGC_allocated + page-table page. */
         page->count_info         = PGC_allocated | 3;
         page->u.inuse.type_info |= PGT_validated | 1;
@@ -107,11 +116,43 @@ static __init void mark_pv_pt_pages_rdon
     unmap_domain_page(pl3e);
 }
 
+static void __init iommu_memory_setup(struct domain *d, const char *what,
+                                      struct page_info *page, unsigned long nr,
+                                      unsigned int *flush_flags)
+{
+    int rc;
+    mfn_t mfn = page_to_mfn(page);
+
+    if ( !need_iommu_pt_sync(d) )
+        return;
+
+    rc = iommu_map(d, _dfn(mfn_x(mfn)), mfn, nr,
+                   IOMMUF_readable | IOMMUF_writable, flush_flags);
+    if ( rc )
+    {
+        printk(XENLOG_ERR "pre-mapping %s MFN [%lx,%lx) into IOMMU failed: %d\n",
+               what, mfn_x(mfn), mfn_x(mfn) + nr, rc);
+        return;
+    }
+
+    /*
+     * For successfully established IOMMU mappings the type of the page(s)
+     * needs to match (for _get_page_type() to unmap upon type change). Set
+     * the page(s) to writable with no type ref.
+     */
+    for ( ; nr--; ++page )
+    {
+        ASSERT(!page->u.inuse.type_info);
+        page->u.inuse.type_info = PGT_writable_page | PGT_validated;
+    }
+}
+
 static __init void setup_pv_physmap(struct domain *d, unsigned long pgtbl_pfn,
                                     unsigned long v_start, unsigned long v_end,
                                     unsigned long vphysmap_start,
                                     unsigned long vphysmap_end,
-                                    unsigned long nr_pages)
+                                    unsigned long nr_pages,
+                                    unsigned int *flush_flags)
 {
     struct page_info *page = NULL;
     l4_pgentry_t *pl4e, *l4start = map_domain_page(_mfn(pgtbl_pfn));
@@ -177,6 +218,10 @@ static __init void setup_pv_physmap(stru
                                              L3_PAGETABLE_SHIFT - PAGE_SHIFT,
                                              MEMF_no_scrub)) != NULL )
             {
+                iommu_memory_setup(d, "P2M 1G", page,
+                                   SUPERPAGE_PAGES * SUPERPAGE_PAGES,
+                                   flush_flags);
+
                 *pl3e = l3e_from_page(page, L1_PROT|_PAGE_DIRTY|_PAGE_PSE);
                 vphysmap_start += 1UL << L3_PAGETABLE_SHIFT;
                 continue;
@@ -203,6 +248,9 @@ static __init void setup_pv_physmap(stru
                                              L2_PAGETABLE_SHIFT - PAGE_SHIFT,
                                              MEMF_no_scrub)) != NULL )
             {
+                iommu_memory_setup(d, "P2M 2M", page, SUPERPAGE_PAGES,
+                                   flush_flags);
+
                 *pl2e = l2e_from_page(page, L1_PROT|_PAGE_DIRTY|_PAGE_PSE);
                 vphysmap_start += 1UL << L2_PAGETABLE_SHIFT;
                 continue;
@@ -311,6 +359,7 @@ int __init dom0_construct_pv(struct doma
     unsigned long initrd_pfn = -1, initrd_mfn = 0;
     unsigned long count;
     struct page_info *page = NULL;
+    unsigned int flush_flags = 0;
     start_info_t *si;
     struct vcpu *v = d->vcpu[0];
     void *image_base = bootstrap_map(image);
@@ -573,6 +622,9 @@ int __init dom0_construct_pv(struct doma
                     BUG();
         }
         initrd->mod_end = 0;
+
+        iommu_memory_setup(d, "initrd", mfn_to_page(_mfn(initrd_mfn)),
+                           PFN_UP(initrd_len), &flush_flags);
     }
 
     printk("PHYSICAL MEMORY ARRANGEMENT:\n"
@@ -606,6 +658,13 @@ int __init dom0_construct_pv(struct doma
 
     process_pending_softirqs();
 
+    /*
+     * Map the full range here and then punch holes for page tables
+     * alongside marking them as such in mark_pv_pt_pages_rdonly().
+     */
+    iommu_memory_setup(d, "init-alloc", mfn_to_page(_mfn(alloc_spfn)),
+                       alloc_epfn - alloc_spfn, &flush_flags);
+
     mpt_alloc = (vpt_start - v_start) + pfn_to_paddr(alloc_spfn);
     if ( vinitrd_start )
         mpt_alloc -= PAGE_ALIGN(initrd_len);
@@ -690,7 +749,8 @@ int __init dom0_construct_pv(struct doma
         l1tab++;
 
         page = mfn_to_page(_mfn(mfn));
-        if ( !page->u.inuse.type_info &&
+        if ( (!page->u.inuse.type_info ||
+              page->u.inuse.type_info == (PGT_writable_page | PGT_validated)) &&
              !get_page_and_type(page, d, PGT_writable_page) )
             BUG();
     }
@@ -719,7 +779,7 @@ int __init dom0_construct_pv(struct doma
     }
 
     /* Pages that are part of page tables must be read only. */
-    mark_pv_pt_pages_rdonly(d, l4start, vpt_start, nr_pt_pages);
+    mark_pv_pt_pages_rdonly(d, l4start, vpt_start, nr_pt_pages, &flush_flags);
 
     /* Mask all upcalls... */
     for ( i = 0; i < XEN_LEGACY_MAX_VCPUS; i++ )
@@ -794,7 +854,7 @@ int __init dom0_construct_pv(struct doma
     {
         pfn = pagetable_get_pfn(v->arch.guest_table);
         setup_pv_physmap(d, pfn, v_start, v_end, vphysmap_start, vphysmap_end,
-                         nr_pages);
+                         nr_pages, &flush_flags);
     }
 
     /* Write the phys->machine and machine->phys table entries. */
@@ -825,7 +885,9 @@ int __init dom0_construct_pv(struct doma
         if ( get_gpfn_from_mfn(mfn) >= count )
         {
             BUG_ON(compat);
-            if ( !page->u.inuse.type_info &&
+            if ( (!page->u.inuse.type_info ||
+                  page->u.inuse.type_info == (PGT_writable_page |
+                                              PGT_validated)) &&
                  !get_page_and_type(page, d, PGT_writable_page) )
                 BUG();
 
@@ -841,8 +903,12 @@ int __init dom0_construct_pv(struct doma
 #endif
     while ( pfn < nr_pages )
     {
-        if ( (page = alloc_chunk(d, nr_pages - domain_tot_pages(d))) == NULL )
+        count = domain_tot_pages(d);
+        if ( (page = alloc_chunk(d, nr_pages - count)) == NULL )
             panic("Not enough RAM for DOM0 reservation\n");
+
+        iommu_memory_setup(d, "chunk", page, domain_tot_pages(d) - count,
+                           &flush_flags);
         while ( pfn < domain_tot_pages(d) )
         {
             mfn = mfn_x(page_to_mfn(page));
@@ -857,6 +923,10 @@ int __init dom0_construct_pv(struct doma
         }
     }
 
+    /* Use while() to avoid compiler warning. */
+    while ( iommu_iotlb_flush_all(d, flush_flags) )
+        break;
+
     if ( initrd_len != 0 )
     {
         si->mod_start = vinitrd_start ?: initrd_pfn;
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -347,8 +347,8 @@ static unsigned int __hwdom_init hwdom_i
 
 void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
 {
-    unsigned long i, top, max_pfn;
-    unsigned int flush_flags = 0;
+    unsigned long i, top, max_pfn, start, count;
+    unsigned int flush_flags = 0, start_perms = 0;
 
     BUG_ON(!is_hardware_domain(d));
 
@@ -379,9 +379,9 @@ void __hwdom_init arch_iommu_hwdom_init(
      * First Mb will get mapped in one go by pvh_populate_p2m(). Avoid
      * setting up potentially conflicting mappings here.
      */
-    i = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
+    start = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
 
-    for ( ; i < top; i++ )
+    for ( i = start, count = 0; i < top; )
     {
         unsigned long pfn = pdx_to_pfn(i);
         unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
@@ -390,20 +390,41 @@ void __hwdom_init arch_iommu_hwdom_init(
         if ( !perms )
             rc = 0;
         else if ( paging_mode_translate(d) )
+        {
             rc = p2m_add_identity_entry(d, pfn,
                                         perms & IOMMUF_writable ? p2m_access_rw
                                                                 : p2m_access_r,
                                         0);
+            if ( rc )
+                printk(XENLOG_WARNING
+                       "%pd: identity mapping of %lx failed: %d\n",
+                       d, pfn, rc);
+        }
+        else if ( pfn != start + count || perms != start_perms )
+        {
+        commit:
+            rc = iommu_map(d, _dfn(start), _mfn(start), count, start_perms,
+                           &flush_flags);
+            if ( rc )
+                printk(XENLOG_WARNING
+                       "%pd: IOMMU identity mapping of [%lx,%lx) failed: %d\n",
+                       d, pfn, pfn + count, rc);
+            SWAP(start, pfn);
+            start_perms = perms;
+            count = 1;
+        }
         else
-            rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
-                           perms, &flush_flags);
+        {
+            ++count;
+            rc = 0;
+        }
 
-        if ( rc )
-            printk(XENLOG_WARNING "%pd: identity %smapping of %lx failed: %d\n",
-                   d, !paging_mode_translate(d) ? "IOMMU " : "", pfn, rc);
 
-        if (!(i & 0xfffff))
+        if ( !(++i & 0xfffff) )
             process_pending_softirqs();
+
+        if ( i == top && count )
+            goto commit;
     }
 
     /* Use if to avoid compiler warning */



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v4 07/21] IOMMU/x86: support freeing of pagetables
  2022-04-25  8:29 [PATCH v4 00/21] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (5 preceding siblings ...)
  2022-04-25  8:34 ` [PATCH v4 06/21] IOMMU/x86: perform PV Dom0 mappings in batches Jan Beulich
@ 2022-04-25  8:35 ` Jan Beulich
  2022-05-03 16:20   ` Roger Pau Monné
  2022-04-25  8:36 ` [PATCH v4 08/21] AMD/IOMMU: walk trees upon page fault Jan Beulich
                   ` (14 subsequent siblings)
  21 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-04-25  8:35 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné, Wei Liu

For vendor specific code to support superpages we need to be able to
deal with a superpage mapping replacing an intermediate page table (or
hierarchy thereof). Consequently an iommu_alloc_pgtable() counterpart is
needed to free individual page tables while a domain is still alive.
Since the freeing needs to be deferred until after a suitable IOTLB
flush was performed, released page tables get queued for processing by a
tasklet.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
I was considering whether to use a softirq-tasklet instead. This would
have the benefit of avoiding extra scheduling operations, but come with
the risk of the freeing happening prematurely because of a
process_pending_softirqs() somewhere.
---
v4: Change type of iommu_queue_free_pgtable()'s 1st parameter. Re-base.
v3: Call process_pending_softirqs() from free_queued_pgtables().

--- a/xen/arch/x86/include/asm/iommu.h
+++ b/xen/arch/x86/include/asm/iommu.h
@@ -147,6 +147,7 @@ void iommu_free_domid(domid_t domid, uns
 int __must_check iommu_free_pgtables(struct domain *d);
 struct domain_iommu;
 struct page_info *__must_check iommu_alloc_pgtable(struct domain_iommu *hd);
+void iommu_queue_free_pgtable(struct domain_iommu *hd, struct page_info *pg);
 
 #endif /* !__ARCH_X86_IOMMU_H__ */
 /*
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -12,6 +12,7 @@
  * this program; If not, see <http://www.gnu.org/licenses/>.
  */
 
+#include <xen/cpu.h>
 #include <xen/sched.h>
 #include <xen/iommu.h>
 #include <xen/paging.h>
@@ -550,6 +551,91 @@ struct page_info *iommu_alloc_pgtable(st
     return pg;
 }
 
+/*
+ * Intermediate page tables which get replaced by large pages may only be
+ * freed after a suitable IOTLB flush. Hence such pages get queued on a
+ * per-CPU list, with a per-CPU tasklet processing the list on the assumption
+ * that the necessary IOTLB flush will have occurred by the time tasklets get
+ * to run. (List and tasklet being per-CPU has the benefit of accesses not
+ * requiring any locking.)
+ */
+static DEFINE_PER_CPU(struct page_list_head, free_pgt_list);
+static DEFINE_PER_CPU(struct tasklet, free_pgt_tasklet);
+
+static void free_queued_pgtables(void *arg)
+{
+    struct page_list_head *list = arg;
+    struct page_info *pg;
+    unsigned int done = 0;
+
+    while ( (pg = page_list_remove_head(list)) )
+    {
+        free_domheap_page(pg);
+
+        /* Granularity of checking somewhat arbitrary. */
+        if ( !(++done & 0x1ff) )
+             process_pending_softirqs();
+    }
+}
+
+void iommu_queue_free_pgtable(struct domain_iommu *hd, struct page_info *pg)
+{
+    unsigned int cpu = smp_processor_id();
+
+    spin_lock(&hd->arch.pgtables.lock);
+    page_list_del(pg, &hd->arch.pgtables.list);
+    spin_unlock(&hd->arch.pgtables.lock);
+
+    page_list_add_tail(pg, &per_cpu(free_pgt_list, cpu));
+
+    tasklet_schedule(&per_cpu(free_pgt_tasklet, cpu));
+}
+
+static int cf_check cpu_callback(
+    struct notifier_block *nfb, unsigned long action, void *hcpu)
+{
+    unsigned int cpu = (unsigned long)hcpu;
+    struct page_list_head *list = &per_cpu(free_pgt_list, cpu);
+    struct tasklet *tasklet = &per_cpu(free_pgt_tasklet, cpu);
+
+    switch ( action )
+    {
+    case CPU_DOWN_PREPARE:
+        tasklet_kill(tasklet);
+        break;
+
+    case CPU_DEAD:
+        page_list_splice(list, &this_cpu(free_pgt_list));
+        INIT_PAGE_LIST_HEAD(list);
+        tasklet_schedule(&this_cpu(free_pgt_tasklet));
+        break;
+
+    case CPU_UP_PREPARE:
+    case CPU_DOWN_FAILED:
+        tasklet_init(tasklet, free_queued_pgtables, list);
+        break;
+    }
+
+    return NOTIFY_DONE;
+}
+
+static struct notifier_block cpu_nfb = {
+    .notifier_call = cpu_callback,
+};
+
+static int __init cf_check bsp_init(void)
+{
+    if ( iommu_enabled )
+    {
+        cpu_callback(&cpu_nfb, CPU_UP_PREPARE,
+                     (void *)(unsigned long)smp_processor_id());
+        register_cpu_notifier(&cpu_nfb);
+    }
+
+    return 0;
+}
+presmp_initcall(bsp_init);
+
 bool arch_iommu_use_permitted(const struct domain *d)
 {
     /*



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v4 08/21] AMD/IOMMU: walk trees upon page fault
  2022-04-25  8:29 [PATCH v4 00/21] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (6 preceding siblings ...)
  2022-04-25  8:35 ` [PATCH v4 07/21] IOMMU/x86: support freeing of pagetables Jan Beulich
@ 2022-04-25  8:36 ` Jan Beulich
  2022-05-04 15:57   ` Roger Pau Monné
  2022-04-25  8:37 ` [PATCH v4 09/21] AMD/IOMMU: return old PTE from {set,clear}_iommu_pte_present() Jan Beulich
                   ` (13 subsequent siblings)
  21 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-04-25  8:36 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné

This is to aid diagnosing issues and largely matches VT-d's behavior.
Since I'm adding permissions output here as well, take the opportunity
and also add their displaying to amd_dump_page_table_level().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Note: "largely matches VT-d's behavior" includes the lack of any locking
      here. Adding suitable locking may not be that easy, as we'd need
      to determine which domain's mapping lock to acquire in addition to
      the necessary IOMMU lock (for the device table access), and
      whether that domain actually still exists. The latter is because
      if we really want to play safe here, imo we also need to account
      for the device table to be potentially corrupted / stale.
---
v4: Re-base.

--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -259,6 +259,8 @@ int __must_check cf_check amd_iommu_flus
     struct domain *d, dfn_t dfn, unsigned long page_count,
     unsigned int flush_flags);
 int __must_check cf_check amd_iommu_flush_iotlb_all(struct domain *d);
+void amd_iommu_print_entries(const struct amd_iommu *iommu, unsigned int dev_id,
+                             dfn_t dfn);
 
 /* device table functions */
 int get_dma_requestor_id(uint16_t seg, uint16_t bdf);
--- a/xen/drivers/passthrough/amd/iommu_init.c
+++ b/xen/drivers/passthrough/amd/iommu_init.c
@@ -575,6 +575,9 @@ static void cf_check parse_event_log_ent
                (flags & 0x002) ? " NX" : "",
                (flags & 0x001) ? " GN" : "");
 
+        if ( iommu_verbose )
+            amd_iommu_print_entries(iommu, device_id, daddr_to_dfn(addr));
+
         for ( bdf = 0; bdf < ivrs_bdf_entries; bdf++ )
             if ( get_dma_requestor_id(iommu->seg, bdf) == device_id )
                 pci_check_disable_device(iommu->seg, PCI_BUS(bdf),
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -428,6 +428,50 @@ int cf_check amd_iommu_unmap_page(
     return 0;
 }
 
+void amd_iommu_print_entries(const struct amd_iommu *iommu, unsigned int dev_id,
+                             dfn_t dfn)
+{
+    mfn_t pt_mfn;
+    unsigned int level;
+    const struct amd_iommu_dte *dt = iommu->dev_table.buffer;
+
+    if ( !dt[dev_id].tv )
+    {
+        printk("%pp: no root\n", &PCI_SBDF(iommu->seg, dev_id));
+        return;
+    }
+
+    pt_mfn = _mfn(dt[dev_id].pt_root);
+    level = dt[dev_id].paging_mode;
+    printk("%pp root @ %"PRI_mfn" (%u levels) dfn=%"PRI_dfn"\n",
+           &PCI_SBDF(iommu->seg, dev_id), mfn_x(pt_mfn), level, dfn_x(dfn));
+
+    while ( level )
+    {
+        const union amd_iommu_pte *pt = map_domain_page(pt_mfn);
+        unsigned int idx = pfn_to_pde_idx(dfn_x(dfn), level);
+        union amd_iommu_pte pte = pt[idx];
+
+        unmap_domain_page(pt);
+
+        printk("  L%u[%03x] = %"PRIx64" %c%c\n", level, idx, pte.raw,
+               pte.pr ? pte.ir ? 'r' : '-' : 'n',
+               pte.pr ? pte.iw ? 'w' : '-' : 'p');
+
+        if ( !pte.pr )
+            break;
+
+        if ( pte.next_level >= level )
+        {
+            printk("  L%u[%03x]: next: %u\n", level, idx, pte.next_level);
+            break;
+        }
+
+        pt_mfn = _mfn(pte.mfn);
+        level = pte.next_level;
+    }
+}
+
 static unsigned long flush_count(unsigned long dfn, unsigned long page_count,
                                  unsigned int order)
 {
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -724,10 +724,11 @@ static void amd_dump_page_table_level(st
                 mfn_to_page(_mfn(pde->mfn)), pde->next_level,
                 address, indent + 1);
         else
-            printk("%*sdfn: %08lx  mfn: %08lx\n",
+            printk("%*sdfn: %08lx  mfn: %08lx  %c%c\n",
                    indent, "",
                    (unsigned long)PFN_DOWN(address),
-                   (unsigned long)PFN_DOWN(pfn_to_paddr(pde->mfn)));
+                   (unsigned long)PFN_DOWN(pfn_to_paddr(pde->mfn)),
+                   pde->ir ? 'r' : '-', pde->iw ? 'w' : '-');
     }
 
     unmap_domain_page(table_vaddr);



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v4 09/21] AMD/IOMMU: return old PTE from {set,clear}_iommu_pte_present()
  2022-04-25  8:29 [PATCH v4 00/21] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (7 preceding siblings ...)
  2022-04-25  8:36 ` [PATCH v4 08/21] AMD/IOMMU: walk trees upon page fault Jan Beulich
@ 2022-04-25  8:37 ` Jan Beulich
  2022-04-25  8:38 ` [PATCH v4 10/21] AMD/IOMMU: allow use of superpage mappings Jan Beulich
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 106+ messages in thread
From: Jan Beulich @ 2022-04-25  8:37 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné

In order to free intermediate page tables when replacing smaller
mappings by a single larger one callers will need to know the full PTE.
Flush indicators can be derived from this in the callers (and outside
the locked regions). First split set_iommu_pte_present() from
set_iommu_ptes_present(): Only the former needs to return the old PTE,
while the latter (like also set_iommu_pde_present()) doesn't even need
to return flush indicators. Then change return types/values and callers
accordingly.

Note that for subsequent changes returning merely a boolean (old.pr) is
not going to be sufficient; the next_level field will also be required.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
---
v4: Re-base over changes earlier in the series.

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -31,30 +31,28 @@ static unsigned int pfn_to_pde_idx(unsig
     return idx;
 }
 
-static unsigned int clear_iommu_pte_present(unsigned long l1_mfn,
-                                            unsigned long dfn)
+static union amd_iommu_pte clear_iommu_pte_present(unsigned long l1_mfn,
+                                                   unsigned long dfn)
 {
-    union amd_iommu_pte *table, *pte;
-    unsigned int flush_flags;
+    union amd_iommu_pte *table, *pte, old;
 
     table = map_domain_page(_mfn(l1_mfn));
     pte = &table[pfn_to_pde_idx(dfn, 1)];
+    old = *pte;
 
-    flush_flags = pte->pr ? IOMMU_FLUSHF_modified : 0;
     write_atomic(&pte->raw, 0);
 
     unmap_domain_page(table);
 
-    return flush_flags;
+    return old;
 }
 
-static unsigned int set_iommu_pde_present(union amd_iommu_pte *pte,
-                                          unsigned long next_mfn,
-                                          unsigned int next_level, bool iw,
-                                          bool ir)
+static void set_iommu_pde_present(union amd_iommu_pte *pte,
+                                  unsigned long next_mfn,
+                                  unsigned int next_level,
+                                  bool iw, bool ir)
 {
-    union amd_iommu_pte new = {}, old;
-    unsigned int flush_flags = IOMMU_FLUSHF_added;
+    union amd_iommu_pte new = {};
 
     /*
      * FC bit should be enabled in PTE, this helps to solve potential
@@ -68,29 +66,42 @@ static unsigned int set_iommu_pde_presen
     new.next_level = next_level;
     new.pr = true;
 
-    old.raw = read_atomic(&pte->raw);
-    old.ign0 = 0;
-    old.ign1 = 0;
-    old.ign2 = 0;
+    write_atomic(&pte->raw, new.raw);
+}
 
-    if ( old.pr && old.raw != new.raw )
-        flush_flags |= IOMMU_FLUSHF_modified;
+static union amd_iommu_pte set_iommu_pte_present(unsigned long pt_mfn,
+                                                 unsigned long dfn,
+                                                 unsigned long next_mfn,
+                                                 unsigned int level,
+                                                 bool iw, bool ir)
+{
+    union amd_iommu_pte *table, *pde, old;
 
-    write_atomic(&pte->raw, new.raw);
+    table = map_domain_page(_mfn(pt_mfn));
+    pde = &table[pfn_to_pde_idx(dfn, level)];
+
+    old = *pde;
+    if ( !old.pr || old.next_level ||
+         old.mfn != next_mfn ||
+         old.iw != iw || old.ir != ir )
+        set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
+    else
+        old.pr = false; /* signal "no change" to the caller */
 
-    return flush_flags;
+    unmap_domain_page(table);
+
+    return old;
 }
 
-static unsigned int set_iommu_ptes_present(unsigned long pt_mfn,
-                                           unsigned long dfn,
-                                           unsigned long next_mfn,
-                                           unsigned int nr_ptes,
-                                           unsigned int pde_level,
-                                           bool iw, bool ir)
+static void set_iommu_ptes_present(unsigned long pt_mfn,
+                                   unsigned long dfn,
+                                   unsigned long next_mfn,
+                                   unsigned int nr_ptes,
+                                   unsigned int pde_level,
+                                   bool iw, bool ir)
 {
     union amd_iommu_pte *table, *pde;
     unsigned long page_sz = 1UL << (PTE_PER_TABLE_SHIFT * (pde_level - 1));
-    unsigned int flush_flags = 0;
 
     table = map_domain_page(_mfn(pt_mfn));
     pde = &table[pfn_to_pde_idx(dfn, pde_level)];
@@ -98,20 +109,18 @@ static unsigned int set_iommu_ptes_prese
     if ( (void *)(pde + nr_ptes) > (void *)table + PAGE_SIZE )
     {
         ASSERT_UNREACHABLE();
-        return 0;
+        return;
     }
 
     while ( nr_ptes-- )
     {
-        flush_flags |= set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
+        set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
 
         ++pde;
         next_mfn += page_sz;
     }
 
     unmap_domain_page(table);
-
-    return flush_flags;
 }
 
 /*
@@ -349,6 +358,7 @@ int cf_check amd_iommu_map_page(
     struct domain_iommu *hd = dom_iommu(d);
     int rc;
     unsigned long pt_mfn = 0;
+    union amd_iommu_pte old;
 
     spin_lock(&hd->arch.mapping_lock);
 
@@ -385,12 +395,16 @@ int cf_check amd_iommu_map_page(
     }
 
     /* Install 4k mapping */
-    *flush_flags |= set_iommu_ptes_present(pt_mfn, dfn_x(dfn), mfn_x(mfn),
-                                           1, 1, (flags & IOMMUF_writable),
-                                           (flags & IOMMUF_readable));
+    old = set_iommu_pte_present(pt_mfn, dfn_x(dfn), mfn_x(mfn), 1,
+                                (flags & IOMMUF_writable),
+                                (flags & IOMMUF_readable));
 
     spin_unlock(&hd->arch.mapping_lock);
 
+    *flush_flags |= IOMMU_FLUSHF_added;
+    if ( old.pr )
+        *flush_flags |= IOMMU_FLUSHF_modified;
+
     return 0;
 }
 
@@ -399,6 +413,7 @@ int cf_check amd_iommu_unmap_page(
 {
     unsigned long pt_mfn = 0;
     struct domain_iommu *hd = dom_iommu(d);
+    union amd_iommu_pte old = {};
 
     spin_lock(&hd->arch.mapping_lock);
 
@@ -420,11 +435,14 @@ int cf_check amd_iommu_unmap_page(
     if ( pt_mfn )
     {
         /* Mark PTE as 'page not present'. */
-        *flush_flags |= clear_iommu_pte_present(pt_mfn, dfn_x(dfn));
+        old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn));
     }
 
     spin_unlock(&hd->arch.mapping_lock);
 
+    if ( old.pr )
+        *flush_flags |= IOMMU_FLUSHF_modified;
+
     return 0;
 }
 



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v4 10/21] AMD/IOMMU: allow use of superpage mappings
  2022-04-25  8:29 [PATCH v4 00/21] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (8 preceding siblings ...)
  2022-04-25  8:37 ` [PATCH v4 09/21] AMD/IOMMU: return old PTE from {set,clear}_iommu_pte_present() Jan Beulich
@ 2022-04-25  8:38 ` Jan Beulich
  2022-05-05 13:19   ` Roger Pau Monné
  2022-04-25  8:38 ` [PATCH v4 11/21] VT-d: " Jan Beulich
                   ` (11 subsequent siblings)
  21 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-04-25  8:38 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné

No separate feature flags exist which would control availability of
these; the only restriction is HATS (establishing the maximum number of
page table levels in general), and even that has a lower bound of 4.
Thus we can unconditionally announce 2M, 1G, and 512G mappings. (Via
non-default page sizes the implementation in principle permits arbitrary
size mappings, but these require multiple identical leaf PTEs to be
written, which isn't all that different from having to write multiple
consecutive PTEs with increasing frame numbers. IMO that's therefore
beneficial only on hardware where suitable TLBs exist; I'm unaware of
such hardware.)

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
I'm not fully sure about allowing 512G mappings: The scheduling-for-
freeing of intermediate page tables would take quite a while when
replacing a tree of 4k mappings by a single 512G one. Yet then again
there's no present code path via which 512G chunks of memory could be
allocated (and hence mapped) anyway, so this would only benefit huge
systems where 512 1G mappings could be re-coalesced (once suitable code
is in place) into a single L4 entry. And re-coalescing wouldn't result
in scheduling-for-freeing of full trees of lower level pagetables.
---
v4: Change type of queue_free_pt()'s 1st parameter. Re-base.
v3: Rename queue_free_pt()'s last parameter. Replace "level > 1" checks
    where possible.

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -32,12 +32,13 @@ static unsigned int pfn_to_pde_idx(unsig
 }
 
 static union amd_iommu_pte clear_iommu_pte_present(unsigned long l1_mfn,
-                                                   unsigned long dfn)
+                                                   unsigned long dfn,
+                                                   unsigned int level)
 {
     union amd_iommu_pte *table, *pte, old;
 
     table = map_domain_page(_mfn(l1_mfn));
-    pte = &table[pfn_to_pde_idx(dfn, 1)];
+    pte = &table[pfn_to_pde_idx(dfn, level)];
     old = *pte;
 
     write_atomic(&pte->raw, 0);
@@ -351,11 +352,32 @@ static int iommu_pde_from_dfn(struct dom
     return 0;
 }
 
+static void queue_free_pt(struct domain_iommu *hd, mfn_t mfn, unsigned int level)
+{
+    if ( level > 1 )
+    {
+        union amd_iommu_pte *pt = map_domain_page(mfn);
+        unsigned int i;
+
+        for ( i = 0; i < PTE_PER_TABLE_SIZE; ++i )
+            if ( pt[i].pr && pt[i].next_level )
+            {
+                ASSERT(pt[i].next_level < level);
+                queue_free_pt(hd, _mfn(pt[i].mfn), pt[i].next_level);
+            }
+
+        unmap_domain_page(pt);
+    }
+
+    iommu_queue_free_pgtable(hd, mfn_to_page(mfn));
+}
+
 int cf_check amd_iommu_map_page(
     struct domain *d, dfn_t dfn, mfn_t mfn, unsigned int flags,
     unsigned int *flush_flags)
 {
     struct domain_iommu *hd = dom_iommu(d);
+    unsigned int level = (IOMMUF_order(flags) / PTE_PER_TABLE_SHIFT) + 1;
     int rc;
     unsigned long pt_mfn = 0;
     union amd_iommu_pte old;
@@ -384,7 +406,7 @@ int cf_check amd_iommu_map_page(
         return rc;
     }
 
-    if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, flush_flags, true) ||
+    if ( iommu_pde_from_dfn(d, dfn_x(dfn), level, &pt_mfn, flush_flags, true) ||
          !pt_mfn )
     {
         spin_unlock(&hd->arch.mapping_lock);
@@ -394,8 +416,8 @@ int cf_check amd_iommu_map_page(
         return -EFAULT;
     }
 
-    /* Install 4k mapping */
-    old = set_iommu_pte_present(pt_mfn, dfn_x(dfn), mfn_x(mfn), 1,
+    /* Install mapping */
+    old = set_iommu_pte_present(pt_mfn, dfn_x(dfn), mfn_x(mfn), level,
                                 (flags & IOMMUF_writable),
                                 (flags & IOMMUF_readable));
 
@@ -403,8 +425,13 @@ int cf_check amd_iommu_map_page(
 
     *flush_flags |= IOMMU_FLUSHF_added;
     if ( old.pr )
+    {
         *flush_flags |= IOMMU_FLUSHF_modified;
 
+        if ( IOMMUF_order(flags) && old.next_level )
+            queue_free_pt(hd, _mfn(old.mfn), old.next_level);
+    }
+
     return 0;
 }
 
@@ -413,6 +440,7 @@ int cf_check amd_iommu_unmap_page(
 {
     unsigned long pt_mfn = 0;
     struct domain_iommu *hd = dom_iommu(d);
+    unsigned int level = (order / PTE_PER_TABLE_SHIFT) + 1;
     union amd_iommu_pte old = {};
 
     spin_lock(&hd->arch.mapping_lock);
@@ -423,7 +451,7 @@ int cf_check amd_iommu_unmap_page(
         return 0;
     }
 
-    if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, flush_flags, false) )
+    if ( iommu_pde_from_dfn(d, dfn_x(dfn), level, &pt_mfn, flush_flags, false) )
     {
         spin_unlock(&hd->arch.mapping_lock);
         AMD_IOMMU_ERROR("invalid IO pagetable entry dfn = %"PRI_dfn"\n",
@@ -435,14 +463,19 @@ int cf_check amd_iommu_unmap_page(
     if ( pt_mfn )
     {
         /* Mark PTE as 'page not present'. */
-        old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn));
+        old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn), level);
     }
 
     spin_unlock(&hd->arch.mapping_lock);
 
     if ( old.pr )
+    {
         *flush_flags |= IOMMU_FLUSHF_modified;
 
+        if ( order && old.next_level )
+            queue_free_pt(hd, _mfn(old.mfn), old.next_level);
+    }
+
     return 0;
 }
 
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -747,7 +747,7 @@ static void cf_check amd_dump_page_table
 }
 
 static const struct iommu_ops __initconst_cf_clobber _iommu_ops = {
-    .page_sizes = PAGE_SIZE_4K,
+    .page_sizes = PAGE_SIZE_4K | PAGE_SIZE_2M | PAGE_SIZE_1G | PAGE_SIZE_512G,
     .init = amd_iommu_domain_init,
     .hwdom_init = amd_iommu_hwdom_init,
     .quarantine_init = amd_iommu_quarantine_init,
--- a/xen/include/xen/page-defs.h
+++ b/xen/include/xen/page-defs.h
@@ -21,4 +21,19 @@
 #define PAGE_MASK_64K               PAGE_MASK_GRAN(64K)
 #define PAGE_ALIGN_64K(addr)        PAGE_ALIGN_GRAN(64K, addr)
 
+#define PAGE_SHIFT_2M               21
+#define PAGE_SIZE_2M                PAGE_SIZE_GRAN(2M)
+#define PAGE_MASK_2M                PAGE_MASK_GRAN(2M)
+#define PAGE_ALIGN_2M(addr)         PAGE_ALIGN_GRAN(2M, addr)
+
+#define PAGE_SHIFT_1G               30
+#define PAGE_SIZE_1G                PAGE_SIZE_GRAN(1G)
+#define PAGE_MASK_1G                PAGE_MASK_GRAN(1G)
+#define PAGE_ALIGN_1G(addr)         PAGE_ALIGN_GRAN(1G, addr)
+
+#define PAGE_SHIFT_512G             39
+#define PAGE_SIZE_512G              PAGE_SIZE_GRAN(512G)
+#define PAGE_MASK_512G              PAGE_MASK_GRAN(512G)
+#define PAGE_ALIGN_512G(addr)       PAGE_ALIGN_GRAN(512G, addr)
+
 #endif /* __XEN_PAGE_DEFS_H__ */



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v4 11/21] VT-d: allow use of superpage mappings
  2022-04-25  8:29 [PATCH v4 00/21] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (9 preceding siblings ...)
  2022-04-25  8:38 ` [PATCH v4 10/21] AMD/IOMMU: allow use of superpage mappings Jan Beulich
@ 2022-04-25  8:38 ` Jan Beulich
  2022-05-05 16:20   ` Roger Pau Monné
  2022-04-25  8:40 ` [PATCH v4 12/21] IOMMU: fold flush-all hook into "flush one" Jan Beulich
                   ` (10 subsequent siblings)
  21 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-04-25  8:38 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné, Kevin Tian

... depending on feature availability (and absence of quirks).

Also make the page table dumping function aware of superpages.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
v4: Change type of queue_free_pt()'s 1st parameter. Re-base.
v3: Rename queue_free_pt()'s last parameter. Replace "level > 1" checks
    where possible. Tighten assertion.

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -784,18 +784,37 @@ static int __must_check cf_check iommu_f
     return iommu_flush_iotlb(d, INVALID_DFN, 0, 0);
 }
 
+static void queue_free_pt(struct domain_iommu *hd, mfn_t mfn, unsigned int level)
+{
+    if ( level > 1 )
+    {
+        struct dma_pte *pt = map_domain_page(mfn);
+        unsigned int i;
+
+        for ( i = 0; i < PTE_NUM; ++i )
+            if ( dma_pte_present(pt[i]) && !dma_pte_superpage(pt[i]) )
+                queue_free_pt(hd, maddr_to_mfn(dma_pte_addr(pt[i])),
+                              level - 1);
+
+        unmap_domain_page(pt);
+    }
+
+    iommu_queue_free_pgtable(hd, mfn_to_page(mfn));
+}
+
 /* clear one page's page table */
 static int dma_pte_clear_one(struct domain *domain, daddr_t addr,
                              unsigned int order,
                              unsigned int *flush_flags)
 {
     struct domain_iommu *hd = dom_iommu(domain);
-    struct dma_pte *page = NULL, *pte = NULL;
+    struct dma_pte *page = NULL, *pte = NULL, old;
     u64 pg_maddr;
+    unsigned int level = (order / LEVEL_STRIDE) + 1;
 
     spin_lock(&hd->arch.mapping_lock);
-    /* get last level pte */
-    pg_maddr = addr_to_dma_page_maddr(domain, addr, 1, flush_flags, false);
+    /* get target level pte */
+    pg_maddr = addr_to_dma_page_maddr(domain, addr, level, flush_flags, false);
     if ( pg_maddr < PAGE_SIZE )
     {
         spin_unlock(&hd->arch.mapping_lock);
@@ -803,7 +822,7 @@ static int dma_pte_clear_one(struct doma
     }
 
     page = (struct dma_pte *)map_vtd_domain_page(pg_maddr);
-    pte = page + address_level_offset(addr, 1);
+    pte = &page[address_level_offset(addr, level)];
 
     if ( !dma_pte_present(*pte) )
     {
@@ -812,14 +831,20 @@ static int dma_pte_clear_one(struct doma
         return 0;
     }
 
+    old = *pte;
     dma_clear_pte(*pte);
-    *flush_flags |= IOMMU_FLUSHF_modified;
 
     spin_unlock(&hd->arch.mapping_lock);
     iommu_sync_cache(pte, sizeof(struct dma_pte));
 
     unmap_vtd_domain_page(page);
 
+    *flush_flags |= IOMMU_FLUSHF_modified;
+
+    if ( order && !dma_pte_superpage(old) )
+        queue_free_pt(hd, maddr_to_mfn(dma_pte_addr(old)),
+                      order / LEVEL_STRIDE);
+
     return 0;
 }
 
@@ -2097,6 +2122,7 @@ static int __must_check cf_check intel_i
     struct domain_iommu *hd = dom_iommu(d);
     struct dma_pte *page, *pte, old, new = {};
     u64 pg_maddr;
+    unsigned int level = (IOMMUF_order(flags) / LEVEL_STRIDE) + 1;
     int rc = 0;
 
     /* Do nothing if VT-d shares EPT page table */
@@ -2121,7 +2147,7 @@ static int __must_check cf_check intel_i
         return 0;
     }
 
-    pg_maddr = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), 1, flush_flags,
+    pg_maddr = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), level, flush_flags,
                                       true);
     if ( pg_maddr < PAGE_SIZE )
     {
@@ -2130,13 +2156,15 @@ static int __must_check cf_check intel_i
     }
 
     page = (struct dma_pte *)map_vtd_domain_page(pg_maddr);
-    pte = &page[dfn_x(dfn) & LEVEL_MASK];
+    pte = &page[address_level_offset(dfn_to_daddr(dfn), level)];
     old = *pte;
 
     dma_set_pte_addr(new, mfn_to_maddr(mfn));
     dma_set_pte_prot(new,
                      ((flags & IOMMUF_readable) ? DMA_PTE_READ  : 0) |
                      ((flags & IOMMUF_writable) ? DMA_PTE_WRITE : 0));
+    if ( IOMMUF_order(flags) )
+        dma_set_pte_superpage(new);
 
     /* Set the SNP on leaf page table if Snoop Control available */
     if ( iommu_snoop )
@@ -2157,8 +2185,14 @@ static int __must_check cf_check intel_i
 
     *flush_flags |= IOMMU_FLUSHF_added;
     if ( dma_pte_present(old) )
+    {
         *flush_flags |= IOMMU_FLUSHF_modified;
 
+        if ( IOMMUF_order(flags) && !dma_pte_superpage(old) )
+            queue_free_pt(hd, maddr_to_mfn(dma_pte_addr(old)),
+                          IOMMUF_order(flags) / LEVEL_STRIDE);
+    }
+
     return rc;
 }
 
@@ -2516,6 +2550,7 @@ static int __init cf_check vtd_setup(voi
 {
     struct acpi_drhd_unit *drhd;
     struct vtd_iommu *iommu;
+    unsigned int large_sizes = PAGE_SIZE_2M | PAGE_SIZE_1G;
     int ret;
     bool reg_inval_supported = true;
 
@@ -2558,6 +2593,11 @@ static int __init cf_check vtd_setup(voi
                cap_sps_2mb(iommu->cap) ? ", 2MB" : "",
                cap_sps_1gb(iommu->cap) ? ", 1GB" : "");
 
+        if ( !cap_sps_2mb(iommu->cap) )
+            large_sizes &= ~PAGE_SIZE_2M;
+        if ( !cap_sps_1gb(iommu->cap) )
+            large_sizes &= ~PAGE_SIZE_1G;
+
 #ifndef iommu_snoop
         if ( iommu_snoop && !ecap_snp_ctl(iommu->ecap) )
             iommu_snoop = false;
@@ -2629,6 +2669,9 @@ static int __init cf_check vtd_setup(voi
     if ( ret )
         goto error;
 
+    ASSERT(iommu_ops.page_sizes == PAGE_SIZE_4K);
+    iommu_ops.page_sizes |= large_sizes;
+
     register_keyhandler('V', vtd_dump_iommu_info, "dump iommu info", 1);
 
     return 0;
@@ -2961,7 +3004,7 @@ static void vtd_dump_page_table_level(pa
             continue;
 
         address = gpa + offset_level_address(i, level);
-        if ( next_level >= 1 ) 
+        if ( next_level && !dma_pte_superpage(*pte) )
             vtd_dump_page_table_level(dma_pte_addr(*pte), next_level,
                                       address, indent + 1);
         else



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v4 12/21] IOMMU: fold flush-all hook into "flush one"
  2022-04-25  8:29 [PATCH v4 00/21] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (10 preceding siblings ...)
  2022-04-25  8:38 ` [PATCH v4 11/21] VT-d: " Jan Beulich
@ 2022-04-25  8:40 ` Jan Beulich
  2022-05-06  8:38   ` Roger Pau Monné
  2022-04-25  8:40 ` [PATCH v4 13/21] IOMMU/x86: prefill newly allocate page tables Jan Beulich
                   ` (9 subsequent siblings)
  21 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-04-25  8:40 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné

Having a separate flush-all hook has always been puzzling me some. We
will want to be able to force a full flush via accumulated flush flags
from the map/unmap functions. Introduce a respective new flag and fold
all flush handling to use the single remaining hook.

Note that because of the respective comments in SMMU and IPMMU-VMSA
code, I've folded the two prior hook functions into one. For SMMU-v3,
which lacks a comment towards incapable hardware, I've left both
functions in place on the assumption that selective and full flushes
will eventually want separating.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> # IPMMU-VMSA, SMMU-V2
Reviewed-by: Rahul Singh <rahul.singh@arm.com> # SMMUv3
Acked-by: Julien Grall <jgrall@amazon.com> # Arm
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
TBD: What we really are going to need is for the map/unmap functions to
     specify that a wider region needs flushing than just the one
     covered by the present set of (un)maps. This may still be less than
     a full flush, but at least as a first step it seemed better to me
     to keep things simple and go the flush-all route.
---
v4: Re-base.
v3: Re-base over changes earlier in the series.
v2: New.

--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -258,7 +258,6 @@ int cf_check amd_iommu_get_reserved_devi
 int __must_check cf_check amd_iommu_flush_iotlb_pages(
     struct domain *d, dfn_t dfn, unsigned long page_count,
     unsigned int flush_flags);
-int __must_check cf_check amd_iommu_flush_iotlb_all(struct domain *d);
 void amd_iommu_print_entries(const struct amd_iommu *iommu, unsigned int dev_id,
                              dfn_t dfn);
 
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -539,15 +539,18 @@ int cf_check amd_iommu_flush_iotlb_pages
 {
     unsigned long dfn_l = dfn_x(dfn);
 
-    ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN));
-    ASSERT(flush_flags);
+    if ( !(flush_flags & IOMMU_FLUSHF_all) )
+    {
+        ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN));
+        ASSERT(flush_flags);
+    }
 
     /* Unless a PTE was modified, no flush is required */
     if ( !(flush_flags & IOMMU_FLUSHF_modified) )
         return 0;
 
-    /* If the range wraps then just flush everything */
-    if ( dfn_l + page_count < dfn_l )
+    /* If so requested or if the range wraps then just flush everything. */
+    if ( (flush_flags & IOMMU_FLUSHF_all) || dfn_l + page_count < dfn_l )
     {
         amd_iommu_flush_all_pages(d);
         return 0;
@@ -572,13 +575,6 @@ int cf_check amd_iommu_flush_iotlb_pages
 
     return 0;
 }
-
-int cf_check amd_iommu_flush_iotlb_all(struct domain *d)
-{
-    amd_iommu_flush_all_pages(d);
-
-    return 0;
-}
 
 int amd_iommu_reserve_domain_unity_map(struct domain *d,
                                        const struct ivrs_unity_map *map,
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -759,7 +759,6 @@ static const struct iommu_ops __initcons
     .map_page = amd_iommu_map_page,
     .unmap_page = amd_iommu_unmap_page,
     .iotlb_flush = amd_iommu_flush_iotlb_pages,
-    .iotlb_flush_all = amd_iommu_flush_iotlb_all,
     .reassign_device = reassign_device,
     .get_device_group_id = amd_iommu_group_id,
     .enable_x2apic = iov_enable_xt,
--- a/xen/drivers/passthrough/arm/ipmmu-vmsa.c
+++ b/xen/drivers/passthrough/arm/ipmmu-vmsa.c
@@ -1000,13 +1000,19 @@ out:
 }
 
 /* Xen IOMMU ops */
-static int __must_check ipmmu_iotlb_flush_all(struct domain *d)
+static int __must_check ipmmu_iotlb_flush(struct domain *d, dfn_t dfn,
+                                          unsigned long page_count,
+                                          unsigned int flush_flags)
 {
     struct ipmmu_vmsa_xen_domain *xen_domain = dom_iommu(d)->arch.priv;
 
+    ASSERT(flush_flags);
+
     if ( !xen_domain || !xen_domain->root_domain )
         return 0;
 
+    /* The hardware doesn't support selective TLB flush. */
+
     spin_lock(&xen_domain->lock);
     ipmmu_tlb_invalidate(xen_domain->root_domain);
     spin_unlock(&xen_domain->lock);
@@ -1014,16 +1020,6 @@ static int __must_check ipmmu_iotlb_flus
     return 0;
 }
 
-static int __must_check ipmmu_iotlb_flush(struct domain *d, dfn_t dfn,
-                                          unsigned long page_count,
-                                          unsigned int flush_flags)
-{
-    ASSERT(flush_flags);
-
-    /* The hardware doesn't support selective TLB flush. */
-    return ipmmu_iotlb_flush_all(d);
-}
-
 static struct ipmmu_vmsa_domain *ipmmu_get_cache_domain(struct domain *d,
                                                         struct device *dev)
 {
@@ -1360,7 +1356,6 @@ static const struct iommu_ops ipmmu_iomm
     .hwdom_init      = arch_iommu_hwdom_init,
     .teardown        = ipmmu_iommu_domain_teardown,
     .iotlb_flush     = ipmmu_iotlb_flush,
-    .iotlb_flush_all = ipmmu_iotlb_flush_all,
     .assign_device   = ipmmu_assign_device,
     .reassign_device = ipmmu_reassign_device,
     .map_page        = arm_iommu_map_page,
--- a/xen/drivers/passthrough/arm/smmu.c
+++ b/xen/drivers/passthrough/arm/smmu.c
@@ -2649,11 +2649,17 @@ static int force_stage = 2;
  */
 static u32 platform_features = ARM_SMMU_FEAT_COHERENT_WALK;
 
-static int __must_check arm_smmu_iotlb_flush_all(struct domain *d)
+static int __must_check arm_smmu_iotlb_flush(struct domain *d, dfn_t dfn,
+					     unsigned long page_count,
+					     unsigned int flush_flags)
 {
 	struct arm_smmu_xen_domain *smmu_domain = dom_iommu(d)->arch.priv;
 	struct iommu_domain *cfg;
 
+	ASSERT(flush_flags);
+
+	/* ARM SMMU v1 doesn't have flush by VMA and VMID */
+
 	spin_lock(&smmu_domain->lock);
 	list_for_each_entry(cfg, &smmu_domain->contexts, list) {
 		/*
@@ -2670,16 +2676,6 @@ static int __must_check arm_smmu_iotlb_f
 	return 0;
 }
 
-static int __must_check arm_smmu_iotlb_flush(struct domain *d, dfn_t dfn,
-					     unsigned long page_count,
-					     unsigned int flush_flags)
-{
-	ASSERT(flush_flags);
-
-	/* ARM SMMU v1 doesn't have flush by VMA and VMID */
-	return arm_smmu_iotlb_flush_all(d);
-}
-
 static struct iommu_domain *arm_smmu_get_domain(struct domain *d,
 						struct device *dev)
 {
@@ -2864,7 +2860,6 @@ static const struct iommu_ops arm_smmu_i
     .add_device = arm_smmu_dt_add_device_generic,
     .teardown = arm_smmu_iommu_domain_teardown,
     .iotlb_flush = arm_smmu_iotlb_flush,
-    .iotlb_flush_all = arm_smmu_iotlb_flush_all,
     .assign_device = arm_smmu_assign_dev,
     .reassign_device = arm_smmu_reassign_dev,
     .map_page = arm_iommu_map_page,
--- a/xen/drivers/passthrough/arm/smmu-v3.c
+++ b/xen/drivers/passthrough/arm/smmu-v3.c
@@ -3416,7 +3416,6 @@ static const struct iommu_ops arm_smmu_i
 	.hwdom_init		= arch_iommu_hwdom_init,
 	.teardown		= arm_smmu_iommu_xen_domain_teardown,
 	.iotlb_flush		= arm_smmu_iotlb_flush,
-	.iotlb_flush_all	= arm_smmu_iotlb_flush_all,
 	.assign_device		= arm_smmu_assign_dev,
 	.reassign_device	= arm_smmu_reassign_dev,
 	.map_page		= arm_iommu_map_page,
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -478,15 +478,12 @@ int iommu_iotlb_flush_all(struct domain
     const struct domain_iommu *hd = dom_iommu(d);
     int rc;
 
-    if ( !is_iommu_enabled(d) || !hd->platform_ops->iotlb_flush_all ||
+    if ( !is_iommu_enabled(d) || !hd->platform_ops->iotlb_flush ||
          !flush_flags )
         return 0;
 
-    /*
-     * The operation does a full flush so we don't need to pass the
-     * flush_flags in.
-     */
-    rc = iommu_call(hd->platform_ops, iotlb_flush_all, d);
+    rc = iommu_call(hd->platform_ops, iotlb_flush, d, INVALID_DFN, 0,
+                    flush_flags | IOMMU_FLUSHF_all);
     if ( unlikely(rc) )
     {
         if ( !d->is_shutting_down && printk_ratelimit() )
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -772,18 +772,21 @@ static int __must_check cf_check iommu_f
     struct domain *d, dfn_t dfn, unsigned long page_count,
     unsigned int flush_flags)
 {
-    ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN));
-    ASSERT(flush_flags);
+    if ( flush_flags & IOMMU_FLUSHF_all )
+    {
+        dfn = INVALID_DFN;
+        page_count = 0;
+    }
+    else
+    {
+        ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN));
+        ASSERT(flush_flags);
+    }
 
     return iommu_flush_iotlb(d, dfn, flush_flags & IOMMU_FLUSHF_modified,
                              page_count);
 }
 
-static int __must_check cf_check iommu_flush_iotlb_all(struct domain *d)
-{
-    return iommu_flush_iotlb(d, INVALID_DFN, 0, 0);
-}
-
 static void queue_free_pt(struct domain_iommu *hd, mfn_t mfn, unsigned int level)
 {
     if ( level > 1 )
@@ -3185,7 +3188,6 @@ static const struct iommu_ops __initcons
     .resume = vtd_resume,
     .crash_shutdown = vtd_crash_shutdown,
     .iotlb_flush = iommu_flush_iotlb_pages,
-    .iotlb_flush_all = iommu_flush_iotlb_all,
     .get_reserved_device_memory = intel_iommu_get_reserved_device_memory,
     .dump_page_tables = vtd_dump_page_tables,
 };
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -147,9 +147,11 @@ enum
 {
     _IOMMU_FLUSHF_added,
     _IOMMU_FLUSHF_modified,
+    _IOMMU_FLUSHF_all,
 };
 #define IOMMU_FLUSHF_added (1u << _IOMMU_FLUSHF_added)
 #define IOMMU_FLUSHF_modified (1u << _IOMMU_FLUSHF_modified)
+#define IOMMU_FLUSHF_all (1u << _IOMMU_FLUSHF_all)
 
 int __must_check iommu_map(struct domain *d, dfn_t dfn, mfn_t mfn,
                            unsigned long page_count, unsigned int flags,
@@ -281,7 +283,6 @@ struct iommu_ops {
     int __must_check (*iotlb_flush)(struct domain *d, dfn_t dfn,
                                     unsigned long page_count,
                                     unsigned int flush_flags);
-    int __must_check (*iotlb_flush_all)(struct domain *d);
     int (*get_reserved_device_memory)(iommu_grdm_t *, void *);
     void (*dump_page_tables)(struct domain *d);
 



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v4 13/21] IOMMU/x86: prefill newly allocate page tables
  2022-04-25  8:29 [PATCH v4 00/21] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (11 preceding siblings ...)
  2022-04-25  8:40 ` [PATCH v4 12/21] IOMMU: fold flush-all hook into "flush one" Jan Beulich
@ 2022-04-25  8:40 ` Jan Beulich
  2022-05-06 11:16   ` Roger Pau Monné
  2022-04-25  8:41 ` [PATCH v4 14/21] x86: introduce helper for recording degree of contiguity in " Jan Beulich
                   ` (8 subsequent siblings)
  21 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-04-25  8:40 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné, Wei Liu

Page tables are used for two purposes after allocation: They either
start out all empty, or they get filled to replace a superpage.
Subsequently, to replace all empty or fully contiguous page tables,
contiguous sub-regions will be recorded within individual page tables.
Install the initial set of markers immediately after allocation. Make
sure to retain these markers when further populating a page table in
preparation for it to replace a superpage.

The markers are simply 4-bit fields holding the order value of
contiguous entries. To demonstrate this, if a page table had just 16
entries, this would be the initial (fully contiguous) set of markers:

index  0 1 2 3 4 5 6 7 8 9 A B C D E F
marker 4 0 1 0 2 0 1 0 3 0 1 0 2 0 1 0

"Contiguous" here means not only present entries with successively
increasing MFNs, each one suitably aligned for its slot, but also a
respective number of all non-present entries.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
An alternative to the ASSERT()s added to set_iommu_ptes_present() would
be to make the function less general-purpose; it's used in a single
place only after all (i.e. it might as well be folded into its only
caller).

While in VT-d's comment ahead of struct dma_pte I'm adjusting the
description of the high bits, I'd like to note that the description of
some of the lower bits isn't correct either. Yet I don't think adjusting
that belongs here.
---
v4: Add another comment referring to pt-contig-markers.h. Re-base.
v3: Add comments. Re-base.
v2: New.

--- a/xen/arch/x86/include/asm/iommu.h
+++ b/xen/arch/x86/include/asm/iommu.h
@@ -146,7 +146,8 @@ void iommu_free_domid(domid_t domid, uns
 
 int __must_check iommu_free_pgtables(struct domain *d);
 struct domain_iommu;
-struct page_info *__must_check iommu_alloc_pgtable(struct domain_iommu *hd);
+struct page_info *__must_check iommu_alloc_pgtable(struct domain_iommu *hd,
+                                                   uint64_t contig_mask);
 void iommu_queue_free_pgtable(struct domain_iommu *hd, struct page_info *pg);
 
 #endif /* !__ARCH_X86_IOMMU_H__ */
--- a/xen/drivers/passthrough/amd/iommu-defs.h
+++ b/xen/drivers/passthrough/amd/iommu-defs.h
@@ -446,11 +446,13 @@ union amd_iommu_x2apic_control {
 #define IOMMU_PAGE_TABLE_U32_PER_ENTRY	(IOMMU_PAGE_TABLE_ENTRY_SIZE / 4)
 #define IOMMU_PAGE_TABLE_ALIGNMENT	4096
 
+#define IOMMU_PTE_CONTIG_MASK           0x1e /* The ign0 field below. */
+
 union amd_iommu_pte {
     uint64_t raw;
     struct {
         bool pr:1;
-        unsigned int ign0:4;
+        unsigned int ign0:4; /* Covered by IOMMU_PTE_CONTIG_MASK. */
         bool a:1;
         bool d:1;
         unsigned int ign1:2;
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -115,7 +115,19 @@ static void set_iommu_ptes_present(unsig
 
     while ( nr_ptes-- )
     {
-        set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
+        ASSERT(!pde->next_level);
+        ASSERT(!pde->u);
+
+        if ( pde > table )
+            ASSERT(pde->ign0 == find_first_set_bit(pde - table));
+        else
+            ASSERT(pde->ign0 == PAGE_SHIFT - 3);
+
+        pde->iw = iw;
+        pde->ir = ir;
+        pde->fc = true; /* See set_iommu_pde_present(). */
+        pde->mfn = next_mfn;
+        pde->pr = true;
 
         ++pde;
         next_mfn += page_sz;
@@ -295,7 +307,7 @@ static int iommu_pde_from_dfn(struct dom
             mfn = next_table_mfn;
 
             /* allocate lower level page table */
-            table = iommu_alloc_pgtable(hd);
+            table = iommu_alloc_pgtable(hd, IOMMU_PTE_CONTIG_MASK);
             if ( table == NULL )
             {
                 AMD_IOMMU_ERROR("cannot allocate I/O page table\n");
@@ -325,7 +337,7 @@ static int iommu_pde_from_dfn(struct dom
 
             if ( next_table_mfn == 0 )
             {
-                table = iommu_alloc_pgtable(hd);
+                table = iommu_alloc_pgtable(hd, IOMMU_PTE_CONTIG_MASK);
                 if ( table == NULL )
                 {
                     AMD_IOMMU_ERROR("cannot allocate I/O page table\n");
@@ -717,7 +729,7 @@ static int fill_qpt(union amd_iommu_pte
                  * page table pages, and the resulting allocations are always
                  * zeroed.
                  */
-                pgs[level] = iommu_alloc_pgtable(hd);
+                pgs[level] = iommu_alloc_pgtable(hd, 0);
                 if ( !pgs[level] )
                 {
                     rc = -ENOMEM;
@@ -775,7 +787,7 @@ int cf_check amd_iommu_quarantine_init(s
         return 0;
     }
 
-    pdev->arch.amd.root_table = iommu_alloc_pgtable(hd);
+    pdev->arch.amd.root_table = iommu_alloc_pgtable(hd, 0);
     if ( !pdev->arch.amd.root_table )
         return -ENOMEM;
 
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -342,7 +342,7 @@ int amd_iommu_alloc_root(struct domain *
 
     if ( unlikely(!hd->arch.amd.root_table) && d != dom_io )
     {
-        hd->arch.amd.root_table = iommu_alloc_pgtable(hd);
+        hd->arch.amd.root_table = iommu_alloc_pgtable(hd, 0);
         if ( !hd->arch.amd.root_table )
             return -ENOMEM;
     }
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -334,7 +334,7 @@ static uint64_t addr_to_dma_page_maddr(s
             goto out;
 
         pte_maddr = level;
-        if ( !(pg = iommu_alloc_pgtable(hd)) )
+        if ( !(pg = iommu_alloc_pgtable(hd, 0)) )
             goto out;
 
         hd->arch.vtd.pgd_maddr = page_to_maddr(pg);
@@ -376,7 +376,7 @@ static uint64_t addr_to_dma_page_maddr(s
             }
 
             pte_maddr = level - 1;
-            pg = iommu_alloc_pgtable(hd);
+            pg = iommu_alloc_pgtable(hd, DMA_PTE_CONTIG_MASK);
             if ( !pg )
                 break;
 
@@ -388,12 +388,13 @@ static uint64_t addr_to_dma_page_maddr(s
                 struct dma_pte *split = map_vtd_domain_page(pte_maddr);
                 unsigned long inc = 1UL << level_to_offset_bits(level - 1);
 
-                split[0].val = pte->val;
+                split[0].val |= pte->val & ~DMA_PTE_CONTIG_MASK;
                 if ( inc == PAGE_SIZE )
                     split[0].val &= ~DMA_PTE_SP;
 
                 for ( offset = 1; offset < PTE_NUM; ++offset )
-                    split[offset].val = split[offset - 1].val + inc;
+                    split[offset].val |=
+                        (split[offset - 1].val & ~DMA_PTE_CONTIG_MASK) + inc;
 
                 iommu_sync_cache(split, PAGE_SIZE);
                 unmap_vtd_domain_page(split);
@@ -2173,7 +2174,7 @@ static int __must_check cf_check intel_i
     if ( iommu_snoop )
         dma_set_pte_snp(new);
 
-    if ( old.val == new.val )
+    if ( !((old.val ^ new.val) & ~DMA_PTE_CONTIG_MASK) )
     {
         spin_unlock(&hd->arch.mapping_lock);
         unmap_vtd_domain_page(page);
@@ -3052,7 +3053,7 @@ static int fill_qpt(struct dma_pte *this
                  * page table pages, and the resulting allocations are always
                  * zeroed.
                  */
-                pgs[level] = iommu_alloc_pgtable(hd);
+                pgs[level] = iommu_alloc_pgtable(hd, 0);
                 if ( !pgs[level] )
                 {
                     rc = -ENOMEM;
@@ -3109,7 +3110,7 @@ static int cf_check intel_iommu_quaranti
     if ( !drhd )
         return -ENODEV;
 
-    pg = iommu_alloc_pgtable(hd);
+    pg = iommu_alloc_pgtable(hd, 0);
     if ( !pg )
         return -ENOMEM;
 
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -253,7 +253,10 @@ struct context_entry {
  * 2-6: reserved
  * 7: super page
  * 8-11: available
- * 12-63: Host physcial address
+ * 12-51: Host physcial address
+ * 52-61: available (52-55 used for DMA_PTE_CONTIG_MASK)
+ * 62: reserved
+ * 63: available
  */
 struct dma_pte {
     u64 val;
@@ -263,6 +266,7 @@ struct dma_pte {
 #define DMA_PTE_PROT (DMA_PTE_READ | DMA_PTE_WRITE)
 #define DMA_PTE_SP   (1 << 7)
 #define DMA_PTE_SNP  (1 << 11)
+#define DMA_PTE_CONTIG_MASK  (0xfull << PADDR_BITS)
 #define dma_clear_pte(p)    do {(p).val = 0;} while(0)
 #define dma_set_pte_readable(p) do {(p).val |= DMA_PTE_READ;} while(0)
 #define dma_set_pte_writable(p) do {(p).val |= DMA_PTE_WRITE;} while(0)
@@ -276,7 +280,7 @@ struct dma_pte {
 #define dma_pte_write(p) (dma_pte_prot(p) & DMA_PTE_WRITE)
 #define dma_pte_addr(p) ((p).val & PADDR_MASK & PAGE_MASK_4K)
 #define dma_set_pte_addr(p, addr) do {\
-            (p).val |= ((addr) & PAGE_MASK_4K); } while (0)
+            (p).val |= ((addr) & PADDR_MASK & PAGE_MASK_4K); } while (0)
 #define dma_pte_present(p) (((p).val & DMA_PTE_PROT) != 0)
 #define dma_pte_superpage(p) (((p).val & DMA_PTE_SP) != 0)
 
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -522,11 +522,12 @@ int iommu_free_pgtables(struct domain *d
     return 0;
 }
 
-struct page_info *iommu_alloc_pgtable(struct domain_iommu *hd)
+struct page_info *iommu_alloc_pgtable(struct domain_iommu *hd,
+                                      uint64_t contig_mask)
 {
     unsigned int memflags = 0;
     struct page_info *pg;
-    void *p;
+    uint64_t *p;
 
 #ifdef CONFIG_NUMA
     if ( hd->node != NUMA_NO_NODE )
@@ -538,7 +539,29 @@ struct page_info *iommu_alloc_pgtable(st
         return NULL;
 
     p = __map_domain_page(pg);
-    clear_page(p);
+
+    if ( contig_mask )
+    {
+        /* See pt-contig-markers.h for a description of the marker scheme. */
+        unsigned int i, shift = find_first_set_bit(contig_mask);
+
+        ASSERT(((PAGE_SHIFT - 3) & (contig_mask >> shift)) == PAGE_SHIFT - 3);
+
+        p[0] = (PAGE_SHIFT - 3ull) << shift;
+        p[1] = 0;
+        p[2] = 1ull << shift;
+        p[3] = 0;
+
+        for ( i = 4; i < PAGE_SIZE / 8; i += 4 )
+        {
+            p[i + 0] = (find_first_set_bit(i) + 0ull) << shift;
+            p[i + 1] = 0;
+            p[i + 2] = 1ull << shift;
+            p[i + 3] = 0;
+        }
+    }
+    else
+        clear_page(p);
 
     iommu_sync_cache(p, PAGE_SIZE);
 



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v4 14/21] x86: introduce helper for recording degree of contiguity in page tables
  2022-04-25  8:29 [PATCH v4 00/21] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (12 preceding siblings ...)
  2022-04-25  8:40 ` [PATCH v4 13/21] IOMMU/x86: prefill newly allocate page tables Jan Beulich
@ 2022-04-25  8:41 ` Jan Beulich
  2022-05-06 13:25   ` Roger Pau Monné
  2022-04-25  8:42 ` [PATCH v4 15/21] AMD/IOMMU: free all-empty " Jan Beulich
                   ` (7 subsequent siblings)
  21 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-04-25  8:41 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné, Wei Liu

This is a re-usable helper (kind of a template) which gets introduced
without users so that the individual subsequent patches introducing such
users can get committed independently of one another.

See the comment at the top of the new file. To demonstrate the effect,
if a page table had just 16 entries, this would be the set of markers
for a page table with fully contiguous mappings:

index  0 1 2 3 4 5 6 7 8 9 A B C D E F
marker 4 0 1 0 2 0 1 0 3 0 1 0 2 0 1 0

"Contiguous" here means not only present entries with successively
increasing MFNs, each one suitably aligned for its slot, but also a
respective number of all non-present entries.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: Rename function and header. Introduce IS_CONTIG().
v2: New.

--- /dev/null
+++ b/xen/arch/x86/include/asm/pt-contig-markers.h
@@ -0,0 +1,105 @@
+#ifndef __ASM_X86_PT_CONTIG_MARKERS_H
+#define __ASM_X86_PT_CONTIG_MARKERS_H
+
+/*
+ * Short of having function templates in C, the function defined below is
+ * intended to be used by multiple parties interested in recording the
+ * degree of contiguity in mappings by a single page table.
+ *
+ * Scheme: Every entry records the order of contiguous successive entries,
+ * up to the maximum order covered by that entry (which is the number of
+ * clear low bits in its index, with entry 0 being the exception using
+ * the base-2 logarithm of the number of entries in a single page table).
+ * While a few entries need touching upon update, knowing whether the
+ * table is fully contiguous (and can hence be replaced by a higher level
+ * leaf entry) is then possible by simply looking at entry 0's marker.
+ *
+ * Prereqs:
+ * - CONTIG_MASK needs to be #define-d, to a value having at least 4
+ *   contiguous bits (ignored by hardware), before including this file,
+ * - page tables to be passed here need to be initialized with correct
+ *   markers.
+ */
+
+#include <xen/bitops.h>
+#include <xen/lib.h>
+#include <xen/page-size.h>
+
+/* This is the same for all anticipated users, so doesn't need passing in. */
+#define CONTIG_LEVEL_SHIFT 9
+#define CONTIG_NR          (1 << CONTIG_LEVEL_SHIFT)
+
+#define GET_MARKER(e) MASK_EXTR(e, CONTIG_MASK)
+#define SET_MARKER(e, m) \
+    ((void)((e) = ((e) & ~CONTIG_MASK) | MASK_INSR(m, CONTIG_MASK)))
+
+#define IS_CONTIG(kind, pt, i, idx, shift, b) \
+    ((kind) == PTE_kind_leaf \
+     ? (((pt)[i] ^ (pt)[idx]) & ~CONTIG_MASK) == (1ULL << ((b) + (shift))) \
+     : !((pt)[i] & ~CONTIG_MASK))
+
+enum PTE_kind {
+    PTE_kind_null,
+    PTE_kind_leaf,
+    PTE_kind_table,
+};
+
+static bool pt_update_contig_markers(uint64_t *pt, unsigned int idx,
+                                     unsigned int level, enum PTE_kind kind)
+{
+    unsigned int b, i = idx;
+    unsigned int shift = (level - 1) * CONTIG_LEVEL_SHIFT + PAGE_SHIFT;
+
+    ASSERT(idx < CONTIG_NR);
+    ASSERT(!(pt[idx] & CONTIG_MASK));
+
+    /* Step 1: Reduce markers in lower numbered entries. */
+    while ( i )
+    {
+        b = find_first_set_bit(i);
+        i &= ~(1U << b);
+        if ( GET_MARKER(pt[i]) > b )
+            SET_MARKER(pt[i], b);
+    }
+
+    /* An intermediate table is never contiguous with anything. */
+    if ( kind == PTE_kind_table )
+        return false;
+
+    /*
+     * Present entries need in-sync index and address to be a candidate
+     * for being contiguous: What we're after is whether ultimately the
+     * intermediate table can be replaced by a superpage.
+     */
+    if ( kind != PTE_kind_null &&
+         idx != ((pt[idx] >> shift) & (CONTIG_NR - 1)) )
+        return false;
+
+    /* Step 2: Check higher numbered entries for contiguity. */
+    for ( b = 0; b < CONTIG_LEVEL_SHIFT && !(idx & (1U << b)); ++b )
+    {
+        i = idx | (1U << b);
+        if ( !IS_CONTIG(kind, pt, i, idx, shift, b) || GET_MARKER(pt[i]) != b )
+            break;
+    }
+
+    /* Step 3: Update markers in this and lower numbered entries. */
+    for ( ; SET_MARKER(pt[idx], b), b < CONTIG_LEVEL_SHIFT; ++b )
+    {
+        i = idx ^ (1U << b);
+        if ( !IS_CONTIG(kind, pt, i, idx, shift, b) || GET_MARKER(pt[i]) != b )
+            break;
+        idx &= ~(1U << b);
+    }
+
+    return b == CONTIG_LEVEL_SHIFT;
+}
+
+#undef IS_CONTIG
+#undef SET_MARKER
+#undef GET_MARKER
+#undef CONTIG_NR
+#undef CONTIG_LEVEL_SHIFT
+#undef CONTIG_MASK
+
+#endif /* __ASM_X86_PT_CONTIG_MARKERS_H */



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v4 15/21] AMD/IOMMU: free all-empty page tables
  2022-04-25  8:29 [PATCH v4 00/21] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (13 preceding siblings ...)
  2022-04-25  8:41 ` [PATCH v4 14/21] x86: introduce helper for recording degree of contiguity in " Jan Beulich
@ 2022-04-25  8:42 ` Jan Beulich
  2022-05-10 13:30   ` Roger Pau Monné
  2022-04-25  8:42 ` [PATCH v4 16/21] VT-d: " Jan Beulich
                   ` (6 subsequent siblings)
  21 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-04-25  8:42 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné

When a page table ends up with no present entries left, it can be
replaced by a non-present entry at the next higher level. The page table
itself can then be scheduled for freeing.

Note that while its output isn't used there yet,
pt_update_contig_markers() right away needs to be called in all places
where entries get updated, not just the one where entries get cleared.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: Re-base over changes earlier in the series.
v3: Re-base over changes earlier in the series.
v2: New.

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -21,6 +21,9 @@
 
 #include "iommu.h"
 
+#define CONTIG_MASK IOMMU_PTE_CONTIG_MASK
+#include <asm/pt-contig-markers.h>
+
 /* Given pfn and page table level, return pde index */
 static unsigned int pfn_to_pde_idx(unsigned long pfn, unsigned int level)
 {
@@ -33,16 +36,20 @@ static unsigned int pfn_to_pde_idx(unsig
 
 static union amd_iommu_pte clear_iommu_pte_present(unsigned long l1_mfn,
                                                    unsigned long dfn,
-                                                   unsigned int level)
+                                                   unsigned int level,
+                                                   bool *free)
 {
     union amd_iommu_pte *table, *pte, old;
+    unsigned int idx = pfn_to_pde_idx(dfn, level);
 
     table = map_domain_page(_mfn(l1_mfn));
-    pte = &table[pfn_to_pde_idx(dfn, level)];
+    pte = &table[idx];
     old = *pte;
 
     write_atomic(&pte->raw, 0);
 
+    *free = pt_update_contig_markers(&table->raw, idx, level, PTE_kind_null);
+
     unmap_domain_page(table);
 
     return old;
@@ -85,7 +92,11 @@ static union amd_iommu_pte set_iommu_pte
     if ( !old.pr || old.next_level ||
          old.mfn != next_mfn ||
          old.iw != iw || old.ir != ir )
+    {
         set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
+        pt_update_contig_markers(&table->raw, pfn_to_pde_idx(dfn, level),
+                                 level, PTE_kind_leaf);
+    }
     else
         old.pr = false; /* signal "no change" to the caller */
 
@@ -322,6 +333,9 @@ static int iommu_pde_from_dfn(struct dom
             smp_wmb();
             set_iommu_pde_present(pde, next_table_mfn, next_level, true,
                                   true);
+            pt_update_contig_markers(&next_table_vaddr->raw,
+                                     pfn_to_pde_idx(dfn, level),
+                                     level, PTE_kind_table);
 
             *flush_flags |= IOMMU_FLUSHF_modified;
         }
@@ -347,6 +361,9 @@ static int iommu_pde_from_dfn(struct dom
                 next_table_mfn = mfn_x(page_to_mfn(table));
                 set_iommu_pde_present(pde, next_table_mfn, next_level, true,
                                       true);
+                pt_update_contig_markers(&next_table_vaddr->raw,
+                                         pfn_to_pde_idx(dfn, level),
+                                         level, PTE_kind_table);
             }
             else /* should never reach here */
             {
@@ -474,8 +491,24 @@ int cf_check amd_iommu_unmap_page(
 
     if ( pt_mfn )
     {
+        bool free;
+
         /* Mark PTE as 'page not present'. */
-        old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn), level);
+        old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn), level, &free);
+
+        while ( unlikely(free) && ++level < hd->arch.amd.paging_mode )
+        {
+            struct page_info *pg = mfn_to_page(_mfn(pt_mfn));
+
+            if ( iommu_pde_from_dfn(d, dfn_x(dfn), level, &pt_mfn,
+                                    flush_flags, false) )
+                BUG();
+            BUG_ON(!pt_mfn);
+
+            clear_iommu_pte_present(pt_mfn, dfn_x(dfn), level, &free);
+            *flush_flags |= IOMMU_FLUSHF_all;
+            iommu_queue_free_pgtable(hd, pg);
+        }
     }
 
     spin_unlock(&hd->arch.mapping_lock);



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v4 16/21] VT-d: free all-empty page tables
  2022-04-25  8:29 [PATCH v4 00/21] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (14 preceding siblings ...)
  2022-04-25  8:42 ` [PATCH v4 15/21] AMD/IOMMU: free all-empty " Jan Beulich
@ 2022-04-25  8:42 ` Jan Beulich
  2022-04-27  4:09   ` Tian, Kevin
  2022-05-10 14:30   ` Roger Pau Monné
  2022-04-25  8:43 ` [PATCH v4 17/21] AMD/IOMMU: replace all-contiguous page tables by superpage mappings Jan Beulich
                   ` (5 subsequent siblings)
  21 siblings, 2 replies; 106+ messages in thread
From: Jan Beulich @ 2022-04-25  8:42 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné, Kevin Tian

When a page table ends up with no present entries left, it can be
replaced by a non-present entry at the next higher level. The page table
itself can then be scheduled for freeing.

Note that while its output isn't used there yet,
pt_update_contig_markers() right away needs to be called in all places
where entries get updated, not just the one where entries get cleared.

Note further that while pt_update_contig_markers() updates perhaps
several PTEs within the table, since these are changes to "avail" bits
only I do not think that cache flushing would be needed afterwards. Such
cache flushing (of entire pages, unless adding yet more logic to me more
selective) would be quite noticable performance-wise (very prominent
during Dom0 boot).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: Re-base over changes earlier in the series.
v3: Properly bound loop. Re-base over changes earlier in the series.
v2: New.
---
The hang during boot on my Latitude E6410 (see the respective code
comment) was pretty close after iommu_enable_translation(). No errors,
no watchdog would kick in, just sometimes the first few pixel lines of
the next log message's (XEN) prefix would have made it out to the screen
(and there's no serial there). It's been a lot of experimenting until I
figured the workaround (which I consider ugly, but halfway acceptable).
I've been trying hard to make sure the workaround wouldn't be masking a
real issue, yet I'm still wary of it possibly doing so ... My best guess
at this point is that on these old IOMMUs the ignored bits 52...61
aren't really ignored for present entries, but also aren't "reserved"
enough to trigger faults. This guess is from having tried to set other
bits in this range (unconditionally, and with the workaround here in
place), which yielded the same behavior.

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -43,6 +43,9 @@
 #include "vtd.h"
 #include "../ats.h"
 
+#define CONTIG_MASK DMA_PTE_CONTIG_MASK
+#include <asm/pt-contig-markers.h>
+
 /* dom_io is used as a sentinel for quarantined devices */
 #define QUARANTINE_SKIP(d, pgd_maddr) ((d) == dom_io && !(pgd_maddr))
 #define DEVICE_DOMID(d, pdev) ((d) != dom_io ? (d)->domain_id \
@@ -405,6 +408,9 @@ static uint64_t addr_to_dma_page_maddr(s
 
             write_atomic(&pte->val, new_pte.val);
             iommu_sync_cache(pte, sizeof(struct dma_pte));
+            pt_update_contig_markers(&parent->val,
+                                     address_level_offset(addr, level),
+                                     level, PTE_kind_table);
         }
 
         if ( --level == target )
@@ -837,9 +843,31 @@ static int dma_pte_clear_one(struct doma
 
     old = *pte;
     dma_clear_pte(*pte);
+    iommu_sync_cache(pte, sizeof(*pte));
+
+    while ( pt_update_contig_markers(&page->val,
+                                     address_level_offset(addr, level),
+                                     level, PTE_kind_null) &&
+            ++level < min_pt_levels )
+    {
+        struct page_info *pg = maddr_to_page(pg_maddr);
+
+        unmap_vtd_domain_page(page);
+
+        pg_maddr = addr_to_dma_page_maddr(domain, addr, level, flush_flags,
+                                          false);
+        BUG_ON(pg_maddr < PAGE_SIZE);
+
+        page = map_vtd_domain_page(pg_maddr);
+        pte = &page[address_level_offset(addr, level)];
+        dma_clear_pte(*pte);
+        iommu_sync_cache(pte, sizeof(*pte));
+
+        *flush_flags |= IOMMU_FLUSHF_all;
+        iommu_queue_free_pgtable(hd, pg);
+    }
 
     spin_unlock(&hd->arch.mapping_lock);
-    iommu_sync_cache(pte, sizeof(struct dma_pte));
 
     unmap_vtd_domain_page(page);
 
@@ -2182,8 +2210,21 @@ static int __must_check cf_check intel_i
     }
 
     *pte = new;
-
     iommu_sync_cache(pte, sizeof(struct dma_pte));
+
+    /*
+     * While the (ab)use of PTE_kind_table here allows to save some work in
+     * the function, the main motivation for it is that it avoids a so far
+     * unexplained hang during boot (while preparing Dom0) on a Westmere
+     * based laptop.
+     */
+    pt_update_contig_markers(&page->val,
+                             address_level_offset(dfn_to_daddr(dfn), level),
+                             level,
+                             (hd->platform_ops->page_sizes &
+                              (1UL << level_to_offset_bits(level + 1))
+                              ? PTE_kind_leaf : PTE_kind_table));
+
     spin_unlock(&hd->arch.mapping_lock);
     unmap_vtd_domain_page(page);
 



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v4 17/21] AMD/IOMMU: replace all-contiguous page tables by superpage mappings
  2022-04-25  8:29 [PATCH v4 00/21] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (15 preceding siblings ...)
  2022-04-25  8:42 ` [PATCH v4 16/21] VT-d: " Jan Beulich
@ 2022-04-25  8:43 ` Jan Beulich
  2022-05-10 15:31   ` Roger Pau Monné
  2022-04-25  8:43 ` [PATCH v4 18/21] VT-d: " Jan Beulich
                   ` (4 subsequent siblings)
  21 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-04-25  8:43 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné

When a page table ends up with all contiguous entries (including all
identical attributes), it can be replaced by a superpage entry at the
next higher level. The page table itself can then be scheduled for
freeing.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Unlike the freeing of all-empty page tables, this causes quite a bit of
back and forth for PV domains, due to their mapping/unmapping of pages
when they get converted to/from being page tables. It may therefore be
worth considering to delay re-coalescing a little, to avoid doing so
when the superpage would otherwise get split again pretty soon. But I
think this would better be the subject of a separate change anyway.

Of course this could also be helped by more "aware" kernel side
behavior: They could avoid immediately mapping freed page tables
writable again, in anticipation of re-using that same page for another
page table elsewhere.
---
v4: Re-base over changes earlier in the series.
v3: New.

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -81,7 +81,8 @@ static union amd_iommu_pte set_iommu_pte
                                                  unsigned long dfn,
                                                  unsigned long next_mfn,
                                                  unsigned int level,
-                                                 bool iw, bool ir)
+                                                 bool iw, bool ir,
+                                                 bool *contig)
 {
     union amd_iommu_pte *table, *pde, old;
 
@@ -94,11 +95,15 @@ static union amd_iommu_pte set_iommu_pte
          old.iw != iw || old.ir != ir )
     {
         set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
-        pt_update_contig_markers(&table->raw, pfn_to_pde_idx(dfn, level),
-                                 level, PTE_kind_leaf);
+        *contig = pt_update_contig_markers(&table->raw,
+                                           pfn_to_pde_idx(dfn, level),
+                                           level, PTE_kind_leaf);
     }
     else
+    {
         old.pr = false; /* signal "no change" to the caller */
+        *contig = false;
+    }
 
     unmap_domain_page(table);
 
@@ -407,6 +412,7 @@ int cf_check amd_iommu_map_page(
 {
     struct domain_iommu *hd = dom_iommu(d);
     unsigned int level = (IOMMUF_order(flags) / PTE_PER_TABLE_SHIFT) + 1;
+    bool contig;
     int rc;
     unsigned long pt_mfn = 0;
     union amd_iommu_pte old;
@@ -447,8 +453,26 @@ int cf_check amd_iommu_map_page(
 
     /* Install mapping */
     old = set_iommu_pte_present(pt_mfn, dfn_x(dfn), mfn_x(mfn), level,
-                                (flags & IOMMUF_writable),
-                                (flags & IOMMUF_readable));
+                                flags & IOMMUF_writable,
+                                flags & IOMMUF_readable, &contig);
+
+    while ( unlikely(contig) && ++level < hd->arch.amd.paging_mode )
+    {
+        struct page_info *pg = mfn_to_page(_mfn(pt_mfn));
+        unsigned long next_mfn;
+
+        if ( iommu_pde_from_dfn(d, dfn_x(dfn), level, &pt_mfn, flush_flags,
+                                false) )
+            BUG();
+        BUG_ON(!pt_mfn);
+
+        next_mfn = mfn_x(mfn) & (~0UL << (PTE_PER_TABLE_SHIFT * (level - 1)));
+        set_iommu_pte_present(pt_mfn, dfn_x(dfn), next_mfn, level,
+                              flags & IOMMUF_writable,
+                              flags & IOMMUF_readable, &contig);
+        *flush_flags |= IOMMU_FLUSHF_modified | IOMMU_FLUSHF_all;
+        iommu_queue_free_pgtable(hd, pg);
+    }
 
     spin_unlock(&hd->arch.mapping_lock);
 



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v4 18/21] VT-d: replace all-contiguous page tables by superpage mappings
  2022-04-25  8:29 [PATCH v4 00/21] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (16 preceding siblings ...)
  2022-04-25  8:43 ` [PATCH v4 17/21] AMD/IOMMU: replace all-contiguous page tables by superpage mappings Jan Beulich
@ 2022-04-25  8:43 ` Jan Beulich
  2022-05-11 11:08   ` Roger Pau Monné
  2022-04-25  8:44 ` [PATCH v4 19/21] IOMMU/x86: add perf counters for page table splitting / coalescing Jan Beulich
                   ` (3 subsequent siblings)
  21 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-04-25  8:43 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné, Kevin Tian

When a page table ends up with all contiguous entries (including all
identical attributes), it can be replaced by a superpage entry at the
next higher level. The page table itself can then be scheduled for
freeing.

The adjustment to LEVEL_MASK is merely to avoid leaving a latent trap
for whenever we (and obviously hardware) start supporting 512G mappings.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
Unlike the freeing of all-empty page tables, this causes quite a bit of
back and forth for PV domains, due to their mapping/unmapping of pages
when they get converted to/from being page tables. It may therefore be
worth considering to delay re-coalescing a little, to avoid doing so
when the superpage would otherwise get split again pretty soon. But I
think this would better be the subject of a separate change anyway.

Of course this could also be helped by more "aware" kernel side
behavior: They could avoid immediately mapping freed page tables
writable again, in anticipation of re-using that same page for another
page table elsewhere.
---
v4: Re-base over changes earlier in the series.
v3: New.

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2216,14 +2216,35 @@ static int __must_check cf_check intel_i
      * While the (ab)use of PTE_kind_table here allows to save some work in
      * the function, the main motivation for it is that it avoids a so far
      * unexplained hang during boot (while preparing Dom0) on a Westmere
-     * based laptop.
+     * based laptop.  This also has the intended effect of terminating the
+     * loop when super pages aren't supported anymore at the next level.
      */
-    pt_update_contig_markers(&page->val,
-                             address_level_offset(dfn_to_daddr(dfn), level),
-                             level,
-                             (hd->platform_ops->page_sizes &
-                              (1UL << level_to_offset_bits(level + 1))
-                              ? PTE_kind_leaf : PTE_kind_table));
+    while ( pt_update_contig_markers(&page->val,
+                                     address_level_offset(dfn_to_daddr(dfn), level),
+                                     level,
+                                     (hd->platform_ops->page_sizes &
+                                      (1UL << level_to_offset_bits(level + 1))
+                                       ? PTE_kind_leaf : PTE_kind_table)) )
+    {
+        struct page_info *pg = maddr_to_page(pg_maddr);
+
+        unmap_vtd_domain_page(page);
+
+        new.val &= ~(LEVEL_MASK << level_to_offset_bits(level));
+        dma_set_pte_superpage(new);
+
+        pg_maddr = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), ++level,
+                                          flush_flags, false);
+        BUG_ON(pg_maddr < PAGE_SIZE);
+
+        page = map_vtd_domain_page(pg_maddr);
+        pte = &page[address_level_offset(dfn_to_daddr(dfn), level)];
+        *pte = new;
+        iommu_sync_cache(pte, sizeof(*pte));
+
+        *flush_flags |= IOMMU_FLUSHF_modified | IOMMU_FLUSHF_all;
+        iommu_queue_free_pgtable(hd, pg);
+    }
 
     spin_unlock(&hd->arch.mapping_lock);
     unmap_vtd_domain_page(page);
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -232,7 +232,7 @@ struct context_entry {
 
 /* page table handling */
 #define LEVEL_STRIDE       (9)
-#define LEVEL_MASK         ((1 << LEVEL_STRIDE) - 1)
+#define LEVEL_MASK         (PTE_NUM - 1UL)
 #define PTE_NUM            (1 << LEVEL_STRIDE)
 #define level_to_agaw(val) ((val) - 2)
 #define agaw_to_level(val) ((val) + 2)



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v4 19/21] IOMMU/x86: add perf counters for page table splitting / coalescing
  2022-04-25  8:29 [PATCH v4 00/21] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (17 preceding siblings ...)
  2022-04-25  8:43 ` [PATCH v4 18/21] VT-d: " Jan Beulich
@ 2022-04-25  8:44 ` Jan Beulich
  2022-05-11 13:48   ` Roger Pau Monné
  2022-04-25  8:44 ` [PATCH v4 20/21] VT-d: fold iommu_flush_iotlb{,_pages}() Jan Beulich
                   ` (2 subsequent siblings)
  21 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-04-25  8:44 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné, Wei Liu

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin tian <kevin.tian@intel.com>
---
v3: New.

--- a/xen/arch/x86/include/asm/perfc_defn.h
+++ b/xen/arch/x86/include/asm/perfc_defn.h
@@ -125,4 +125,7 @@ PERFCOUNTER(realmode_exits,      "vmexit
 
 PERFCOUNTER(pauseloop_exits, "vmexits from Pause-Loop Detection")
 
+PERFCOUNTER(iommu_pt_shatters,    "IOMMU page table shatters")
+PERFCOUNTER(iommu_pt_coalesces,   "IOMMU page table coalesces")
+
 /*#endif*/ /* __XEN_PERFC_DEFN_H__ */
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -343,6 +343,8 @@ static int iommu_pde_from_dfn(struct dom
                                      level, PTE_kind_table);
 
             *flush_flags |= IOMMU_FLUSHF_modified;
+
+            perfc_incr(iommu_pt_shatters);
         }
 
         /* Install lower level page table for non-present entries */
@@ -472,6 +474,7 @@ int cf_check amd_iommu_map_page(
                               flags & IOMMUF_readable, &contig);
         *flush_flags |= IOMMU_FLUSHF_modified | IOMMU_FLUSHF_all;
         iommu_queue_free_pgtable(hd, pg);
+        perfc_incr(iommu_pt_coalesces);
     }
 
     spin_unlock(&hd->arch.mapping_lock);
@@ -532,6 +535,7 @@ int cf_check amd_iommu_unmap_page(
             clear_iommu_pte_present(pt_mfn, dfn_x(dfn), level, &free);
             *flush_flags |= IOMMU_FLUSHF_all;
             iommu_queue_free_pgtable(hd, pg);
+            perfc_incr(iommu_pt_coalesces);
         }
     }
 
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -404,6 +404,8 @@ static uint64_t addr_to_dma_page_maddr(s
 
                 if ( flush_flags )
                     *flush_flags |= IOMMU_FLUSHF_modified;
+
+                perfc_incr(iommu_pt_shatters);
             }
 
             write_atomic(&pte->val, new_pte.val);
@@ -865,6 +867,7 @@ static int dma_pte_clear_one(struct doma
 
         *flush_flags |= IOMMU_FLUSHF_all;
         iommu_queue_free_pgtable(hd, pg);
+        perfc_incr(iommu_pt_coalesces);
     }
 
     spin_unlock(&hd->arch.mapping_lock);
@@ -2244,6 +2247,7 @@ static int __must_check cf_check intel_i
 
         *flush_flags |= IOMMU_FLUSHF_modified | IOMMU_FLUSHF_all;
         iommu_queue_free_pgtable(hd, pg);
+        perfc_incr(iommu_pt_coalesces);
     }
 
     spin_unlock(&hd->arch.mapping_lock);



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v4 20/21] VT-d: fold iommu_flush_iotlb{,_pages}()
  2022-04-25  8:29 [PATCH v4 00/21] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (18 preceding siblings ...)
  2022-04-25  8:44 ` [PATCH v4 19/21] IOMMU/x86: add perf counters for page table splitting / coalescing Jan Beulich
@ 2022-04-25  8:44 ` Jan Beulich
  2022-04-27  4:12   ` Tian, Kevin
  2022-05-11 13:50   ` Roger Pau Monné
  2022-04-25  8:45 ` [PATCH v4 21/21] VT-d: fold dma_pte_clear_one() into its only caller Jan Beulich
  2022-05-18 12:50 ` [PATCH v4 00/21] IOMMU: superpage support when not sharing pagetables Jan Beulich
  21 siblings, 2 replies; 106+ messages in thread
From: Jan Beulich @ 2022-04-25  8:44 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné, Kevin Tian

With iommu_flush_iotlb_all() gone, iommu_flush_iotlb_pages() is merely a
wrapper around the not otherwise called iommu_flush_iotlb(). Fold both
functions.

No functional change intended.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: New.

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -728,9 +728,9 @@ static int __must_check iommu_flush_all(
     return rc;
 }
 
-static int __must_check iommu_flush_iotlb(struct domain *d, dfn_t dfn,
-                                          bool_t dma_old_pte_present,
-                                          unsigned long page_count)
+static int __must_check cf_check iommu_flush_iotlb(struct domain *d, dfn_t dfn,
+                                                   unsigned long page_count,
+                                                   unsigned int flush_flags)
 {
     struct domain_iommu *hd = dom_iommu(d);
     struct acpi_drhd_unit *drhd;
@@ -739,6 +739,17 @@ static int __must_check iommu_flush_iotl
     int iommu_domid;
     int ret = 0;
 
+    if ( flush_flags & IOMMU_FLUSHF_all )
+    {
+        dfn = INVALID_DFN;
+        page_count = 0;
+    }
+    else
+    {
+        ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN));
+        ASSERT(flush_flags);
+    }
+
     /*
      * No need pcideves_lock here because we have flush
      * when assign/deassign device
@@ -765,7 +776,7 @@ static int __must_check iommu_flush_iotl
             rc = iommu_flush_iotlb_psi(iommu, iommu_domid,
                                        dfn_to_daddr(dfn),
                                        get_order_from_pages(page_count),
-                                       !dma_old_pte_present,
+                                       !(flush_flags & IOMMU_FLUSHF_modified),
                                        flush_dev_iotlb);
 
         if ( rc > 0 )
@@ -777,25 +788,6 @@ static int __must_check iommu_flush_iotl
     return ret;
 }
 
-static int __must_check cf_check iommu_flush_iotlb_pages(
-    struct domain *d, dfn_t dfn, unsigned long page_count,
-    unsigned int flush_flags)
-{
-    if ( flush_flags & IOMMU_FLUSHF_all )
-    {
-        dfn = INVALID_DFN;
-        page_count = 0;
-    }
-    else
-    {
-        ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN));
-        ASSERT(flush_flags);
-    }
-
-    return iommu_flush_iotlb(d, dfn, flush_flags & IOMMU_FLUSHF_modified,
-                             page_count);
-}
-
 static void queue_free_pt(struct domain_iommu *hd, mfn_t mfn, unsigned int level)
 {
     if ( level > 1 )
@@ -3254,7 +3246,7 @@ static const struct iommu_ops __initcons
     .suspend = vtd_suspend,
     .resume = vtd_resume,
     .crash_shutdown = vtd_crash_shutdown,
-    .iotlb_flush = iommu_flush_iotlb_pages,
+    .iotlb_flush = iommu_flush_iotlb,
     .get_reserved_device_memory = intel_iommu_get_reserved_device_memory,
     .dump_page_tables = vtd_dump_page_tables,
 };



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v4 21/21] VT-d: fold dma_pte_clear_one() into its only caller
  2022-04-25  8:29 [PATCH v4 00/21] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (19 preceding siblings ...)
  2022-04-25  8:44 ` [PATCH v4 20/21] VT-d: fold iommu_flush_iotlb{,_pages}() Jan Beulich
@ 2022-04-25  8:45 ` Jan Beulich
  2022-04-27  4:13   ` Tian, Kevin
  2022-05-11 13:57   ` Roger Pau Monné
  2022-05-18 12:50 ` [PATCH v4 00/21] IOMMU: superpage support when not sharing pagetables Jan Beulich
  21 siblings, 2 replies; 106+ messages in thread
From: Jan Beulich @ 2022-04-25  8:45 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Paul Durrant, Roger Pau Monné, Kevin Tian

This way intel_iommu_unmap_page() ends up quite a bit more similar to
intel_iommu_map_page().

No functional change intended.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: New.

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -806,75 +806,6 @@ static void queue_free_pt(struct domain_
     iommu_queue_free_pgtable(hd, mfn_to_page(mfn));
 }
 
-/* clear one page's page table */
-static int dma_pte_clear_one(struct domain *domain, daddr_t addr,
-                             unsigned int order,
-                             unsigned int *flush_flags)
-{
-    struct domain_iommu *hd = dom_iommu(domain);
-    struct dma_pte *page = NULL, *pte = NULL, old;
-    u64 pg_maddr;
-    unsigned int level = (order / LEVEL_STRIDE) + 1;
-
-    spin_lock(&hd->arch.mapping_lock);
-    /* get target level pte */
-    pg_maddr = addr_to_dma_page_maddr(domain, addr, level, flush_flags, false);
-    if ( pg_maddr < PAGE_SIZE )
-    {
-        spin_unlock(&hd->arch.mapping_lock);
-        return pg_maddr ? -ENOMEM : 0;
-    }
-
-    page = (struct dma_pte *)map_vtd_domain_page(pg_maddr);
-    pte = &page[address_level_offset(addr, level)];
-
-    if ( !dma_pte_present(*pte) )
-    {
-        spin_unlock(&hd->arch.mapping_lock);
-        unmap_vtd_domain_page(page);
-        return 0;
-    }
-
-    old = *pte;
-    dma_clear_pte(*pte);
-    iommu_sync_cache(pte, sizeof(*pte));
-
-    while ( pt_update_contig_markers(&page->val,
-                                     address_level_offset(addr, level),
-                                     level, PTE_kind_null) &&
-            ++level < min_pt_levels )
-    {
-        struct page_info *pg = maddr_to_page(pg_maddr);
-
-        unmap_vtd_domain_page(page);
-
-        pg_maddr = addr_to_dma_page_maddr(domain, addr, level, flush_flags,
-                                          false);
-        BUG_ON(pg_maddr < PAGE_SIZE);
-
-        page = map_vtd_domain_page(pg_maddr);
-        pte = &page[address_level_offset(addr, level)];
-        dma_clear_pte(*pte);
-        iommu_sync_cache(pte, sizeof(*pte));
-
-        *flush_flags |= IOMMU_FLUSHF_all;
-        iommu_queue_free_pgtable(hd, pg);
-        perfc_incr(iommu_pt_coalesces);
-    }
-
-    spin_unlock(&hd->arch.mapping_lock);
-
-    unmap_vtd_domain_page(page);
-
-    *flush_flags |= IOMMU_FLUSHF_modified;
-
-    if ( order && !dma_pte_superpage(old) )
-        queue_free_pt(hd, maddr_to_mfn(dma_pte_addr(old)),
-                      order / LEVEL_STRIDE);
-
-    return 0;
-}
-
 static int iommu_set_root_entry(struct vtd_iommu *iommu)
 {
     u32 sts;
@@ -2261,6 +2192,12 @@ static int __must_check cf_check intel_i
 static int __must_check cf_check intel_iommu_unmap_page(
     struct domain *d, dfn_t dfn, unsigned int order, unsigned int *flush_flags)
 {
+    struct domain_iommu *hd = dom_iommu(d);
+    daddr_t addr = dfn_to_daddr(dfn);
+    struct dma_pte *page = NULL, *pte = NULL, old;
+    uint64_t pg_maddr;
+    unsigned int level = (order / LEVEL_STRIDE) + 1;
+
     /* Do nothing if VT-d shares EPT page table */
     if ( iommu_use_hap_pt(d) )
         return 0;
@@ -2269,7 +2206,62 @@ static int __must_check cf_check intel_i
     if ( iommu_hwdom_passthrough && is_hardware_domain(d) )
         return 0;
 
-    return dma_pte_clear_one(d, dfn_to_daddr(dfn), order, flush_flags);
+    spin_lock(&hd->arch.mapping_lock);
+    /* get target level pte */
+    pg_maddr = addr_to_dma_page_maddr(d, addr, level, flush_flags, false);
+    if ( pg_maddr < PAGE_SIZE )
+    {
+        spin_unlock(&hd->arch.mapping_lock);
+        return pg_maddr ? -ENOMEM : 0;
+    }
+
+    page = map_vtd_domain_page(pg_maddr);
+    pte = &page[address_level_offset(addr, level)];
+
+    if ( !dma_pte_present(*pte) )
+    {
+        spin_unlock(&hd->arch.mapping_lock);
+        unmap_vtd_domain_page(page);
+        return 0;
+    }
+
+    old = *pte;
+    dma_clear_pte(*pte);
+    iommu_sync_cache(pte, sizeof(*pte));
+
+    while ( pt_update_contig_markers(&page->val,
+                                     address_level_offset(addr, level),
+                                     level, PTE_kind_null) &&
+            ++level < min_pt_levels )
+    {
+        struct page_info *pg = maddr_to_page(pg_maddr);
+
+        unmap_vtd_domain_page(page);
+
+        pg_maddr = addr_to_dma_page_maddr(d, addr, level, flush_flags, false);
+        BUG_ON(pg_maddr < PAGE_SIZE);
+
+        page = map_vtd_domain_page(pg_maddr);
+        pte = &page[address_level_offset(addr, level)];
+        dma_clear_pte(*pte);
+        iommu_sync_cache(pte, sizeof(*pte));
+
+        *flush_flags |= IOMMU_FLUSHF_all;
+        iommu_queue_free_pgtable(hd, pg);
+        perfc_incr(iommu_pt_coalesces);
+    }
+
+    spin_unlock(&hd->arch.mapping_lock);
+
+    unmap_vtd_domain_page(page);
+
+    *flush_flags |= IOMMU_FLUSHF_modified;
+
+    if ( order && !dma_pte_superpage(old) )
+        queue_free_pt(hd, maddr_to_mfn(dma_pte_addr(old)),
+                      order / LEVEL_STRIDE);
+
+    return 0;
 }
 
 static int cf_check intel_iommu_lookup_page(



^ permalink raw reply	[flat|nested] 106+ messages in thread

* RE: [PATCH v4 16/21] VT-d: free all-empty page tables
  2022-04-25  8:42 ` [PATCH v4 16/21] VT-d: " Jan Beulich
@ 2022-04-27  4:09   ` Tian, Kevin
  2022-05-10 14:30   ` Roger Pau Monné
  1 sibling, 0 replies; 106+ messages in thread
From: Tian, Kevin @ 2022-04-27  4:09 UTC (permalink / raw)
  To: Beulich, Jan, xen-devel
  Cc: Cooper, Andrew, Paul Durrant, Pau Monné, Roger

> From: Jan Beulich <jbeulich@suse.com>
> Sent: Monday, April 25, 2022 4:43 PM
> 
> When a page table ends up with no present entries left, it can be
> replaced by a non-present entry at the next higher level. The page table
> itself can then be scheduled for freeing.
> 
> Note that while its output isn't used there yet,
> pt_update_contig_markers() right away needs to be called in all places
> where entries get updated, not just the one where entries get cleared.
> 
> Note further that while pt_update_contig_markers() updates perhaps
> several PTEs within the table, since these are changes to "avail" bits
> only I do not think that cache flushing would be needed afterwards. Such
> cache flushing (of entire pages, unless adding yet more logic to me more
> selective) would be quite noticable performance-wise (very prominent
> during Dom0 boot).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

> ---
> v4: Re-base over changes earlier in the series.
> v3: Properly bound loop. Re-base over changes earlier in the series.
> v2: New.
> ---
> The hang during boot on my Latitude E6410 (see the respective code
> comment) was pretty close after iommu_enable_translation(). No errors,
> no watchdog would kick in, just sometimes the first few pixel lines of
> the next log message's (XEN) prefix would have made it out to the screen
> (and there's no serial there). It's been a lot of experimenting until I
> figured the workaround (which I consider ugly, but halfway acceptable).
> I've been trying hard to make sure the workaround wouldn't be masking a
> real issue, yet I'm still wary of it possibly doing so ... My best guess
> at this point is that on these old IOMMUs the ignored bits 52...61
> aren't really ignored for present entries, but also aren't "reserved"
> enough to trigger faults. This guess is from having tried to set other
> bits in this range (unconditionally, and with the workaround here in
> place), which yielded the same behavior.
> 
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -43,6 +43,9 @@
>  #include "vtd.h"
>  #include "../ats.h"
> 
> +#define CONTIG_MASK DMA_PTE_CONTIG_MASK
> +#include <asm/pt-contig-markers.h>
> +
>  /* dom_io is used as a sentinel for quarantined devices */
>  #define QUARANTINE_SKIP(d, pgd_maddr) ((d) == dom_io && !(pgd_maddr))
>  #define DEVICE_DOMID(d, pdev) ((d) != dom_io ? (d)->domain_id \
> @@ -405,6 +408,9 @@ static uint64_t addr_to_dma_page_maddr(s
> 
>              write_atomic(&pte->val, new_pte.val);
>              iommu_sync_cache(pte, sizeof(struct dma_pte));
> +            pt_update_contig_markers(&parent->val,
> +                                     address_level_offset(addr, level),
> +                                     level, PTE_kind_table);
>          }
> 
>          if ( --level == target )
> @@ -837,9 +843,31 @@ static int dma_pte_clear_one(struct doma
> 
>      old = *pte;
>      dma_clear_pte(*pte);
> +    iommu_sync_cache(pte, sizeof(*pte));
> +
> +    while ( pt_update_contig_markers(&page->val,
> +                                     address_level_offset(addr, level),
> +                                     level, PTE_kind_null) &&
> +            ++level < min_pt_levels )
> +    {
> +        struct page_info *pg = maddr_to_page(pg_maddr);
> +
> +        unmap_vtd_domain_page(page);
> +
> +        pg_maddr = addr_to_dma_page_maddr(domain, addr, level,
> flush_flags,
> +                                          false);
> +        BUG_ON(pg_maddr < PAGE_SIZE);
> +
> +        page = map_vtd_domain_page(pg_maddr);
> +        pte = &page[address_level_offset(addr, level)];
> +        dma_clear_pte(*pte);
> +        iommu_sync_cache(pte, sizeof(*pte));
> +
> +        *flush_flags |= IOMMU_FLUSHF_all;
> +        iommu_queue_free_pgtable(hd, pg);
> +    }
> 
>      spin_unlock(&hd->arch.mapping_lock);
> -    iommu_sync_cache(pte, sizeof(struct dma_pte));
> 
>      unmap_vtd_domain_page(page);
> 
> @@ -2182,8 +2210,21 @@ static int __must_check cf_check intel_i
>      }
> 
>      *pte = new;
> -
>      iommu_sync_cache(pte, sizeof(struct dma_pte));
> +
> +    /*
> +     * While the (ab)use of PTE_kind_table here allows to save some work in
> +     * the function, the main motivation for it is that it avoids a so far
> +     * unexplained hang during boot (while preparing Dom0) on a Westmere
> +     * based laptop.
> +     */
> +    pt_update_contig_markers(&page->val,
> +                             address_level_offset(dfn_to_daddr(dfn), level),
> +                             level,
> +                             (hd->platform_ops->page_sizes &
> +                              (1UL << level_to_offset_bits(level + 1))
> +                              ? PTE_kind_leaf : PTE_kind_table));
> +
>      spin_unlock(&hd->arch.mapping_lock);
>      unmap_vtd_domain_page(page);
> 


^ permalink raw reply	[flat|nested] 106+ messages in thread

* RE: [PATCH v4 20/21] VT-d: fold iommu_flush_iotlb{,_pages}()
  2022-04-25  8:44 ` [PATCH v4 20/21] VT-d: fold iommu_flush_iotlb{,_pages}() Jan Beulich
@ 2022-04-27  4:12   ` Tian, Kevin
  2022-05-11 13:50   ` Roger Pau Monné
  1 sibling, 0 replies; 106+ messages in thread
From: Tian, Kevin @ 2022-04-27  4:12 UTC (permalink / raw)
  To: Beulich, Jan, xen-devel
  Cc: Cooper, Andrew, Paul Durrant, Pau Monné, Roger

> From: Jan Beulich <jbeulich@suse.com>
> Sent: Monday, April 25, 2022 4:45 PM
> 
> With iommu_flush_iotlb_all() gone, iommu_flush_iotlb_pages() is merely a
> wrapper around the not otherwise called iommu_flush_iotlb(). Fold both
> functions.
> 
> No functional change intended.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

> ---
> v4: New.
> 
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -728,9 +728,9 @@ static int __must_check iommu_flush_all(
>      return rc;
>  }
> 
> -static int __must_check iommu_flush_iotlb(struct domain *d, dfn_t dfn,
> -                                          bool_t dma_old_pte_present,
> -                                          unsigned long page_count)
> +static int __must_check cf_check iommu_flush_iotlb(struct domain *d,
> dfn_t dfn,
> +                                                   unsigned long page_count,
> +                                                   unsigned int flush_flags)
>  {
>      struct domain_iommu *hd = dom_iommu(d);
>      struct acpi_drhd_unit *drhd;
> @@ -739,6 +739,17 @@ static int __must_check iommu_flush_iotl
>      int iommu_domid;
>      int ret = 0;
> 
> +    if ( flush_flags & IOMMU_FLUSHF_all )
> +    {
> +        dfn = INVALID_DFN;
> +        page_count = 0;
> +    }
> +    else
> +    {
> +        ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN));
> +        ASSERT(flush_flags);
> +    }
> +
>      /*
>       * No need pcideves_lock here because we have flush
>       * when assign/deassign device
> @@ -765,7 +776,7 @@ static int __must_check iommu_flush_iotl
>              rc = iommu_flush_iotlb_psi(iommu, iommu_domid,
>                                         dfn_to_daddr(dfn),
>                                         get_order_from_pages(page_count),
> -                                       !dma_old_pte_present,
> +                                       !(flush_flags & IOMMU_FLUSHF_modified),
>                                         flush_dev_iotlb);
> 
>          if ( rc > 0 )
> @@ -777,25 +788,6 @@ static int __must_check iommu_flush_iotl
>      return ret;
>  }
> 
> -static int __must_check cf_check iommu_flush_iotlb_pages(
> -    struct domain *d, dfn_t dfn, unsigned long page_count,
> -    unsigned int flush_flags)
> -{
> -    if ( flush_flags & IOMMU_FLUSHF_all )
> -    {
> -        dfn = INVALID_DFN;
> -        page_count = 0;
> -    }
> -    else
> -    {
> -        ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN));
> -        ASSERT(flush_flags);
> -    }
> -
> -    return iommu_flush_iotlb(d, dfn, flush_flags & IOMMU_FLUSHF_modified,
> -                             page_count);
> -}
> -
>  static void queue_free_pt(struct domain_iommu *hd, mfn_t mfn, unsigned
> int level)
>  {
>      if ( level > 1 )
> @@ -3254,7 +3246,7 @@ static const struct iommu_ops __initcons
>      .suspend = vtd_suspend,
>      .resume = vtd_resume,
>      .crash_shutdown = vtd_crash_shutdown,
> -    .iotlb_flush = iommu_flush_iotlb_pages,
> +    .iotlb_flush = iommu_flush_iotlb,
>      .get_reserved_device_memory =
> intel_iommu_get_reserved_device_memory,
>      .dump_page_tables = vtd_dump_page_tables,
>  };


^ permalink raw reply	[flat|nested] 106+ messages in thread

* RE: [PATCH v4 21/21] VT-d: fold dma_pte_clear_one() into its only caller
  2022-04-25  8:45 ` [PATCH v4 21/21] VT-d: fold dma_pte_clear_one() into its only caller Jan Beulich
@ 2022-04-27  4:13   ` Tian, Kevin
  2022-05-11 13:57   ` Roger Pau Monné
  1 sibling, 0 replies; 106+ messages in thread
From: Tian, Kevin @ 2022-04-27  4:13 UTC (permalink / raw)
  To: Beulich, Jan, xen-devel
  Cc: Cooper, Andrew, Paul Durrant, Pau Monné, Roger

> From: Jan Beulich <jbeulich@suse.com>
> Sent: Monday, April 25, 2022 4:45 PM
> 
> This way intel_iommu_unmap_page() ends up quite a bit more similar to
> intel_iommu_map_page().
> 
> No functional change intended.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

> ---
> v4: New.
> 
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -806,75 +806,6 @@ static void queue_free_pt(struct domain_
>      iommu_queue_free_pgtable(hd, mfn_to_page(mfn));
>  }
> 
> -/* clear one page's page table */
> -static int dma_pte_clear_one(struct domain *domain, daddr_t addr,
> -                             unsigned int order,
> -                             unsigned int *flush_flags)
> -{
> -    struct domain_iommu *hd = dom_iommu(domain);
> -    struct dma_pte *page = NULL, *pte = NULL, old;
> -    u64 pg_maddr;
> -    unsigned int level = (order / LEVEL_STRIDE) + 1;
> -
> -    spin_lock(&hd->arch.mapping_lock);
> -    /* get target level pte */
> -    pg_maddr = addr_to_dma_page_maddr(domain, addr, level, flush_flags,
> false);
> -    if ( pg_maddr < PAGE_SIZE )
> -    {
> -        spin_unlock(&hd->arch.mapping_lock);
> -        return pg_maddr ? -ENOMEM : 0;
> -    }
> -
> -    page = (struct dma_pte *)map_vtd_domain_page(pg_maddr);
> -    pte = &page[address_level_offset(addr, level)];
> -
> -    if ( !dma_pte_present(*pte) )
> -    {
> -        spin_unlock(&hd->arch.mapping_lock);
> -        unmap_vtd_domain_page(page);
> -        return 0;
> -    }
> -
> -    old = *pte;
> -    dma_clear_pte(*pte);
> -    iommu_sync_cache(pte, sizeof(*pte));
> -
> -    while ( pt_update_contig_markers(&page->val,
> -                                     address_level_offset(addr, level),
> -                                     level, PTE_kind_null) &&
> -            ++level < min_pt_levels )
> -    {
> -        struct page_info *pg = maddr_to_page(pg_maddr);
> -
> -        unmap_vtd_domain_page(page);
> -
> -        pg_maddr = addr_to_dma_page_maddr(domain, addr, level,
> flush_flags,
> -                                          false);
> -        BUG_ON(pg_maddr < PAGE_SIZE);
> -
> -        page = map_vtd_domain_page(pg_maddr);
> -        pte = &page[address_level_offset(addr, level)];
> -        dma_clear_pte(*pte);
> -        iommu_sync_cache(pte, sizeof(*pte));
> -
> -        *flush_flags |= IOMMU_FLUSHF_all;
> -        iommu_queue_free_pgtable(hd, pg);
> -        perfc_incr(iommu_pt_coalesces);
> -    }
> -
> -    spin_unlock(&hd->arch.mapping_lock);
> -
> -    unmap_vtd_domain_page(page);
> -
> -    *flush_flags |= IOMMU_FLUSHF_modified;
> -
> -    if ( order && !dma_pte_superpage(old) )
> -        queue_free_pt(hd, maddr_to_mfn(dma_pte_addr(old)),
> -                      order / LEVEL_STRIDE);
> -
> -    return 0;
> -}
> -
>  static int iommu_set_root_entry(struct vtd_iommu *iommu)
>  {
>      u32 sts;
> @@ -2261,6 +2192,12 @@ static int __must_check cf_check intel_i
>  static int __must_check cf_check intel_iommu_unmap_page(
>      struct domain *d, dfn_t dfn, unsigned int order, unsigned int *flush_flags)
>  {
> +    struct domain_iommu *hd = dom_iommu(d);
> +    daddr_t addr = dfn_to_daddr(dfn);
> +    struct dma_pte *page = NULL, *pte = NULL, old;
> +    uint64_t pg_maddr;
> +    unsigned int level = (order / LEVEL_STRIDE) + 1;
> +
>      /* Do nothing if VT-d shares EPT page table */
>      if ( iommu_use_hap_pt(d) )
>          return 0;
> @@ -2269,7 +2206,62 @@ static int __must_check cf_check intel_i
>      if ( iommu_hwdom_passthrough && is_hardware_domain(d) )
>          return 0;
> 
> -    return dma_pte_clear_one(d, dfn_to_daddr(dfn), order, flush_flags);
> +    spin_lock(&hd->arch.mapping_lock);
> +    /* get target level pte */
> +    pg_maddr = addr_to_dma_page_maddr(d, addr, level, flush_flags, false);
> +    if ( pg_maddr < PAGE_SIZE )
> +    {
> +        spin_unlock(&hd->arch.mapping_lock);
> +        return pg_maddr ? -ENOMEM : 0;
> +    }
> +
> +    page = map_vtd_domain_page(pg_maddr);
> +    pte = &page[address_level_offset(addr, level)];
> +
> +    if ( !dma_pte_present(*pte) )
> +    {
> +        spin_unlock(&hd->arch.mapping_lock);
> +        unmap_vtd_domain_page(page);
> +        return 0;
> +    }
> +
> +    old = *pte;
> +    dma_clear_pte(*pte);
> +    iommu_sync_cache(pte, sizeof(*pte));
> +
> +    while ( pt_update_contig_markers(&page->val,
> +                                     address_level_offset(addr, level),
> +                                     level, PTE_kind_null) &&
> +            ++level < min_pt_levels )
> +    {
> +        struct page_info *pg = maddr_to_page(pg_maddr);
> +
> +        unmap_vtd_domain_page(page);
> +
> +        pg_maddr = addr_to_dma_page_maddr(d, addr, level, flush_flags,
> false);
> +        BUG_ON(pg_maddr < PAGE_SIZE);
> +
> +        page = map_vtd_domain_page(pg_maddr);
> +        pte = &page[address_level_offset(addr, level)];
> +        dma_clear_pte(*pte);
> +        iommu_sync_cache(pte, sizeof(*pte));
> +
> +        *flush_flags |= IOMMU_FLUSHF_all;
> +        iommu_queue_free_pgtable(hd, pg);
> +        perfc_incr(iommu_pt_coalesces);
> +    }
> +
> +    spin_unlock(&hd->arch.mapping_lock);
> +
> +    unmap_vtd_domain_page(page);
> +
> +    *flush_flags |= IOMMU_FLUSHF_modified;
> +
> +    if ( order && !dma_pte_superpage(old) )
> +        queue_free_pt(hd, maddr_to_mfn(dma_pte_addr(old)),
> +                      order / LEVEL_STRIDE);
> +
> +    return 0;
>  }
> 
>  static int cf_check intel_iommu_lookup_page(


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 01/21] AMD/IOMMU: correct potentially-UB shifts
  2022-04-25  8:30 ` [PATCH v4 01/21] AMD/IOMMU: correct potentially-UB shifts Jan Beulich
@ 2022-04-27 13:08   ` Andrew Cooper
  2022-04-27 13:57     ` Jan Beulich
  2022-05-03 10:10   ` Roger Pau Monné
  1 sibling, 1 reply; 106+ messages in thread
From: Andrew Cooper @ 2022-04-27 13:08 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Paul Durrant, Roger Pau Monne

On 25/04/2022 09:30, Jan Beulich wrote:
> Recent changes (likely 5fafa6cf529a ["AMD/IOMMU: have callers specify
> the target level for page table walks"]) have made Coverity notice a
> shift count in iommu_pde_from_dfn() which might in theory grow too
> large. While this isn't a problem in practice, address the concern
> nevertheless to not leave dangling breakage in case very large
> superpages would be enabled at some point.
>
> Coverity ID: 1504264
>
> While there also address a similar issue in set_iommu_ptes_present().
> It's not clear to me why Coverity hasn't spotted that one.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> v4: New.
>
> --- a/xen/drivers/passthrough/amd/iommu_map.c
> +++ b/xen/drivers/passthrough/amd/iommu_map.c
> @@ -89,11 +89,11 @@ static unsigned int set_iommu_ptes_prese
>                                             bool iw, bool ir)
>  {
>      union amd_iommu_pte *table, *pde;
> -    unsigned int page_sz, flush_flags = 0;
> +    unsigned long page_sz = 1UL << (PTE_PER_TABLE_SHIFT * (pde_level - 1));

There's an off-by-12 error somewhere here.

Judging by it's use, it should be named mapping_frames (or similar) instead.

With that fixed, Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 02/21] IOMMU: simplify unmap-on-error in iommu_map()
  2022-04-25  8:32 ` [PATCH v4 02/21] IOMMU: simplify unmap-on-error in iommu_map() Jan Beulich
@ 2022-04-27 13:16   ` Andrew Cooper
  2022-04-27 14:05     ` Jan Beulich
  2022-05-03 10:25   ` Roger Pau Monné
  1 sibling, 1 reply; 106+ messages in thread
From: Andrew Cooper @ 2022-04-27 13:16 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Paul Durrant, Roger Pau Monne

On 25/04/2022 09:32, Jan Beulich wrote:
> As of 68a8aa5d7264 ("iommu: make map and unmap take a page count,
> similar to flush") there's no need anymore to have a loop here.
>
> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> v3: New.
>
> --- a/xen/drivers/passthrough/iommu.c
> +++ b/xen/drivers/passthrough/iommu.c
> @@ -308,11 +308,9 @@ int iommu_map(struct domain *d, dfn_t df
>                     d->domain_id, dfn_x(dfn_add(dfn, i)),
>                     mfn_x(mfn_add(mfn, i)), rc);
>  
> -        while ( i-- )
> -            /* if statement to satisfy __must_check */
> -            if ( iommu_call(hd->platform_ops, unmap_page, d, dfn_add(dfn, i),
> -                            flush_flags) )
> -                continue;
> +        /* while statement to satisfy __must_check */
> +        while ( iommu_unmap(d, dfn, i, flush_flags) )
> +            break;

How can this possibly be correct?

The map_page() calls are made one 4k page at a time, and this while loop
is undoing every iteration, one 4k page at a time.

Without this while loop, any failure after the first page will end up
not being unmapped.

~Andrew

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 01/21] AMD/IOMMU: correct potentially-UB shifts
  2022-04-27 13:08   ` Andrew Cooper
@ 2022-04-27 13:57     ` Jan Beulich
  0 siblings, 0 replies; 106+ messages in thread
From: Jan Beulich @ 2022-04-27 13:57 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Paul Durrant, Roger Pau Monne, xen-devel

On 27.04.2022 15:08, Andrew Cooper wrote:
> On 25/04/2022 09:30, Jan Beulich wrote:
>> Recent changes (likely 5fafa6cf529a ["AMD/IOMMU: have callers specify
>> the target level for page table walks"]) have made Coverity notice a
>> shift count in iommu_pde_from_dfn() which might in theory grow too
>> large. While this isn't a problem in practice, address the concern
>> nevertheless to not leave dangling breakage in case very large
>> superpages would be enabled at some point.
>>
>> Coverity ID: 1504264
>>
>> While there also address a similar issue in set_iommu_ptes_present().
>> It's not clear to me why Coverity hasn't spotted that one.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> v4: New.
>>
>> --- a/xen/drivers/passthrough/amd/iommu_map.c
>> +++ b/xen/drivers/passthrough/amd/iommu_map.c
>> @@ -89,11 +89,11 @@ static unsigned int set_iommu_ptes_prese
>>                                             bool iw, bool ir)
>>  {
>>      union amd_iommu_pte *table, *pde;
>> -    unsigned int page_sz, flush_flags = 0;
>> +    unsigned long page_sz = 1UL << (PTE_PER_TABLE_SHIFT * (pde_level - 1));
> 
> There's an off-by-12 error somewhere here.
> 
> Judging by it's use, it should be named mapping_frames (or similar) instead.

Hmm, I think the author meant "size of the (potentially large) page
in units of 4k (base) pages". That's still some form of "page size".

> With that fixed, Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

If anything there could be another patch renaming the variable; that's
certainly not the goal here. But as said, I don't think the variable
name is strictly wrong. And with that it also doesn't feel entirely
right that I would be on the hook of renaming it. I also think that
"mapping_frames" isn't much better; it would need to be something
like "nr_frames_per_pte", which is starting to get longish.

So for the moment thanks for the R-b, but I will only apply it once
we've sorted the condition you provided it under.

Jan



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 02/21] IOMMU: simplify unmap-on-error in iommu_map()
  2022-04-27 13:16   ` Andrew Cooper
@ 2022-04-27 14:05     ` Jan Beulich
  0 siblings, 0 replies; 106+ messages in thread
From: Jan Beulich @ 2022-04-27 14:05 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Paul Durrant, Roger Pau Monne, xen-devel

On 27.04.2022 15:16, Andrew Cooper wrote:
> On 25/04/2022 09:32, Jan Beulich wrote:
>> --- a/xen/drivers/passthrough/iommu.c
>> +++ b/xen/drivers/passthrough/iommu.c
>> @@ -308,11 +308,9 @@ int iommu_map(struct domain *d, dfn_t df
>>                     d->domain_id, dfn_x(dfn_add(dfn, i)),
>>                     mfn_x(mfn_add(mfn, i)), rc);
>>  
>> -        while ( i-- )
>> -            /* if statement to satisfy __must_check */
>> -            if ( iommu_call(hd->platform_ops, unmap_page, d, dfn_add(dfn, i),
>> -                            flush_flags) )
>> -                continue;
>> +        /* while statement to satisfy __must_check */
>> +        while ( iommu_unmap(d, dfn, i, flush_flags) )
>> +            break;
> 
> How can this possibly be correct?
> 
> The map_page() calls are made one 4k page at a time, and this while loop
> is undoing every iteration, one 4k page at a time.
> 
> Without this while loop, any failure after the first page will end up
> not being unmapped.

There's no real "while loop" here, it's effectively

        if ( iommu_unmap(d, dfn, i, flush_flags) )
            /* nothing */;

just that I wanted to avoid the empty body (but I could switch if
that's preferred).

Note that the 3rd argument to iommu_unmap() is i, not 1.

But I have to admit that I also have trouble interpreting your last
sentence - how would it matter if there was no code here at all? Or
did you maybe mean "With ..." instead of "Without ..."?

Jan



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 01/21] AMD/IOMMU: correct potentially-UB shifts
  2022-04-25  8:30 ` [PATCH v4 01/21] AMD/IOMMU: correct potentially-UB shifts Jan Beulich
  2022-04-27 13:08   ` Andrew Cooper
@ 2022-05-03 10:10   ` Roger Pau Monné
  2022-05-03 14:34     ` Jan Beulich
  1 sibling, 1 reply; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-03 10:10 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On Mon, Apr 25, 2022 at 10:30:33AM +0200, Jan Beulich wrote:
> Recent changes (likely 5fafa6cf529a ["AMD/IOMMU: have callers specify
> the target level for page table walks"]) have made Coverity notice a
> shift count in iommu_pde_from_dfn() which might in theory grow too
> large. While this isn't a problem in practice, address the concern
> nevertheless to not leave dangling breakage in case very large
> superpages would be enabled at some point.
> 
> Coverity ID: 1504264
> 
> While there also address a similar issue in set_iommu_ptes_present().
> It's not clear to me why Coverity hasn't spotted that one.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

> ---
> v4: New.
> 
> --- a/xen/drivers/passthrough/amd/iommu_map.c
> +++ b/xen/drivers/passthrough/amd/iommu_map.c
> @@ -89,11 +89,11 @@ static unsigned int set_iommu_ptes_prese
>                                             bool iw, bool ir)
>  {
>      union amd_iommu_pte *table, *pde;
> -    unsigned int page_sz, flush_flags = 0;
> +    unsigned long page_sz = 1UL << (PTE_PER_TABLE_SHIFT * (pde_level - 1));

Seeing the discussion from Andrews reply, nr_pages might be more
appropriate while still quite short.

I'm not making my Rb conditional to that change though.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 02/21] IOMMU: simplify unmap-on-error in iommu_map()
  2022-04-25  8:32 ` [PATCH v4 02/21] IOMMU: simplify unmap-on-error in iommu_map() Jan Beulich
  2022-04-27 13:16   ` Andrew Cooper
@ 2022-05-03 10:25   ` Roger Pau Monné
  2022-05-03 14:37     ` Jan Beulich
  1 sibling, 1 reply; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-03 10:25 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On Mon, Apr 25, 2022 at 10:32:10AM +0200, Jan Beulich wrote:
> As of 68a8aa5d7264 ("iommu: make map and unmap take a page count,
> similar to flush") there's no need anymore to have a loop here.
> 
> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

I wonder whether we should have a macro to ignore returns from
__must_check attributed functions.  Ie:

#define IGNORE_RETURN(exp) while ( exp ) break;

As to avoid confusion (and having to reason) whether the usage of
while is correct.  I always find it confusing to assert such loop
expressions are correct.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 04/21] IOMMU: have iommu_{,un}map() split requests into largest possible chunks
  2022-04-25  8:33 ` [PATCH v4 04/21] IOMMU: have iommu_{,un}map() split requests into largest possible chunks Jan Beulich
@ 2022-05-03 12:37   ` Roger Pau Monné
  2022-05-03 14:44     ` Jan Beulich
  0 siblings, 1 reply; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-03 12:37 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On Mon, Apr 25, 2022 at 10:33:32AM +0200, Jan Beulich wrote:
> --- a/xen/drivers/passthrough/iommu.c
> +++ b/xen/drivers/passthrough/iommu.c
> @@ -307,11 +338,10 @@ int iommu_map(struct domain *d, dfn_t df
>          if ( !d->is_shutting_down && printk_ratelimit() )
>              printk(XENLOG_ERR
>                     "d%d: IOMMU mapping dfn %"PRI_dfn" to mfn %"PRI_mfn" failed: %d\n",
> -                   d->domain_id, dfn_x(dfn_add(dfn, i)),
> -                   mfn_x(mfn_add(mfn, i)), rc);
> +                   d->domain_id, dfn_x(dfn), mfn_x(mfn), rc);

Since you are already adjusting the line, I wouldn't mind if you also
switched to use %pd at once (and in the same adjustment done to
iommu_unmap).

>  
>          /* while statement to satisfy __must_check */
> -        while ( iommu_unmap(d, dfn, i, flush_flags) )
> +        while ( iommu_unmap(d, dfn0, i, flush_flags) )

To match previous behavior you likely need to use i + (1UL << order),
so pages covered by the map_page call above are also taken care in the
unmap request?

With that fixed:

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

(Feel free to adjust the printks to use %pd or not, that's not a
requirement for the Rb)

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 05/21] IOMMU/x86: restrict IO-APIC mappings for PV Dom0
  2022-04-25  8:34 ` [PATCH v4 05/21] IOMMU/x86: restrict IO-APIC mappings for PV Dom0 Jan Beulich
@ 2022-05-03 13:00   ` Roger Pau Monné
  2022-05-03 14:50     ` Jan Beulich
  0 siblings, 1 reply; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-03 13:00 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On Mon, Apr 25, 2022 at 10:34:23AM +0200, Jan Beulich wrote:
> While already the case for PVH, there's no reason to treat PV
> differently here, though of course the addresses get taken from another
> source in this case. Except that, to match CPU side mappings, by default
> we permit r/o ones. This then also means we now deal consistently with
> IO-APICs whose MMIO is or is not covered by E820 reserved regions.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> [integrated] v1: Integrate into series.
> [standalone] v2: Keep IOMMU mappings in sync with CPU ones.
> 
> --- a/xen/drivers/passthrough/x86/iommu.c
> +++ b/xen/drivers/passthrough/x86/iommu.c
> @@ -275,12 +275,12 @@ void iommu_identity_map_teardown(struct
>      }
>  }
>  
> -static bool __hwdom_init hwdom_iommu_map(const struct domain *d,
> -                                         unsigned long pfn,
> -                                         unsigned long max_pfn)
> +static unsigned int __hwdom_init hwdom_iommu_map(const struct domain *d,
> +                                                 unsigned long pfn,
> +                                                 unsigned long max_pfn)
>  {
>      mfn_t mfn = _mfn(pfn);
> -    unsigned int i, type;
> +    unsigned int i, type, perms = IOMMUF_readable | IOMMUF_writable;
>  
>      /*
>       * Set up 1:1 mapping for dom0. Default to include only conventional RAM
> @@ -289,44 +289,60 @@ static bool __hwdom_init hwdom_iommu_map
>       * that fall in unusable ranges for PV Dom0.
>       */
>      if ( (pfn > max_pfn && !mfn_valid(mfn)) || xen_in_range(pfn) )
> -        return false;
> +        return 0;
>  
>      switch ( type = page_get_ram_type(mfn) )
>      {
>      case RAM_TYPE_UNUSABLE:
> -        return false;
> +        return 0;
>  
>      case RAM_TYPE_CONVENTIONAL:
>          if ( iommu_hwdom_strict )
> -            return false;
> +            return 0;
>          break;
>  
>      default:
>          if ( type & RAM_TYPE_RESERVED )
>          {
>              if ( !iommu_hwdom_inclusive && !iommu_hwdom_reserved )
> -                return false;
> +                perms = 0;
>          }
> -        else if ( is_hvm_domain(d) || !iommu_hwdom_inclusive || pfn > max_pfn )
> -            return false;
> +        else if ( is_hvm_domain(d) )
> +            return 0;
> +        else if ( !iommu_hwdom_inclusive || pfn > max_pfn )
> +            perms = 0;
>      }
>  
>      /* Check that it doesn't overlap with the Interrupt Address Range. */
>      if ( pfn >= 0xfee00 && pfn <= 0xfeeff )
> -        return false;
> +        return 0;
>      /* ... or the IO-APIC */
> -    for ( i = 0; has_vioapic(d) && i < d->arch.hvm.nr_vioapics; i++ )
> -        if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
> -            return false;
> +    if ( has_vioapic(d) )
> +    {
> +        for ( i = 0; i < d->arch.hvm.nr_vioapics; i++ )
> +            if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
> +                return 0;
> +    }
> +    else if ( is_pv_domain(d) )
> +    {
> +        /*
> +         * Be consistent with CPU mappings: Dom0 is permitted to establish r/o
> +         * ones there, so it should also have such established for IOMMUs.
> +         */
> +        for ( i = 0; i < nr_ioapics; i++ )
> +            if ( pfn == PFN_DOWN(mp_ioapics[i].mpc_apicaddr) )
> +                return rangeset_contains_singleton(mmio_ro_ranges, pfn)
> +                       ? IOMMUF_readable : 0;

If we really are after consistency with CPU side mappings, we should
likely take the whole contents of mmio_ro_ranges and d->iomem_caps
into account, not just the pages belonging to the IO-APIC?

There could also be HPET pages mapped as RO for PV.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 01/21] AMD/IOMMU: correct potentially-UB shifts
  2022-05-03 10:10   ` Roger Pau Monné
@ 2022-05-03 14:34     ` Jan Beulich
  0 siblings, 0 replies; 106+ messages in thread
From: Jan Beulich @ 2022-05-03 14:34 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On 03.05.2022 12:10, Roger Pau Monné wrote:
> On Mon, Apr 25, 2022 at 10:30:33AM +0200, Jan Beulich wrote:
>> Recent changes (likely 5fafa6cf529a ["AMD/IOMMU: have callers specify
>> the target level for page table walks"]) have made Coverity notice a
>> shift count in iommu_pde_from_dfn() which might in theory grow too
>> large. While this isn't a problem in practice, address the concern
>> nevertheless to not leave dangling breakage in case very large
>> superpages would be enabled at some point.
>>
>> Coverity ID: 1504264
>>
>> While there also address a similar issue in set_iommu_ptes_present().
>> It's not clear to me why Coverity hasn't spotted that one.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks.

>> --- a/xen/drivers/passthrough/amd/iommu_map.c
>> +++ b/xen/drivers/passthrough/amd/iommu_map.c
>> @@ -89,11 +89,11 @@ static unsigned int set_iommu_ptes_prese
>>                                             bool iw, bool ir)
>>  {
>>      union amd_iommu_pte *table, *pde;
>> -    unsigned int page_sz, flush_flags = 0;
>> +    unsigned long page_sz = 1UL << (PTE_PER_TABLE_SHIFT * (pde_level - 1));
> 
> Seeing the discussion from Andrews reply, nr_pages might be more
> appropriate while still quite short.

Yes and no - it then would be ambiguous as to what size pages are
meant.

> I'm not making my Rb conditional to that change though.

Good, thanks. But I guess I'm still somewhat stuck unless hearing
back from Andrew (although one might not count a conditional R-b
as a "pending objection"). I'll give him a few more days, but I
continue to think this ought to be a separate change (if renaming
is really needed in the 1st place) ...

Jan



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 02/21] IOMMU: simplify unmap-on-error in iommu_map()
  2022-05-03 10:25   ` Roger Pau Monné
@ 2022-05-03 14:37     ` Jan Beulich
  2022-05-03 16:22       ` Roger Pau Monné
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-05-03 14:37 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On 03.05.2022 12:25, Roger Pau Monné wrote:
> On Mon, Apr 25, 2022 at 10:32:10AM +0200, Jan Beulich wrote:
>> As of 68a8aa5d7264 ("iommu: make map and unmap take a page count,
>> similar to flush") there's no need anymore to have a loop here.
>>
>> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks.

> I wonder whether we should have a macro to ignore returns from
> __must_check attributed functions.  Ie:
> 
> #define IGNORE_RETURN(exp) while ( exp ) break;
> 
> As to avoid confusion (and having to reason) whether the usage of
> while is correct.  I always find it confusing to assert such loop
> expressions are correct.

I've been considering some form of wrapper macro (not specifically
the one you suggest), but I'm of two minds: On one hand I agree it
would help readers, but otoh I fear it may make it more attractive
to actually override the __must_check (which really ought to be an
exception).

Jan



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 04/21] IOMMU: have iommu_{,un}map() split requests into largest possible chunks
  2022-05-03 12:37   ` Roger Pau Monné
@ 2022-05-03 14:44     ` Jan Beulich
  2022-05-04 10:20       ` Roger Pau Monné
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-05-03 14:44 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On 03.05.2022 14:37, Roger Pau Monné wrote:
> On Mon, Apr 25, 2022 at 10:33:32AM +0200, Jan Beulich wrote:
>> --- a/xen/drivers/passthrough/iommu.c
>> +++ b/xen/drivers/passthrough/iommu.c
>> @@ -307,11 +338,10 @@ int iommu_map(struct domain *d, dfn_t df
>>          if ( !d->is_shutting_down && printk_ratelimit() )
>>              printk(XENLOG_ERR
>>                     "d%d: IOMMU mapping dfn %"PRI_dfn" to mfn %"PRI_mfn" failed: %d\n",
>> -                   d->domain_id, dfn_x(dfn_add(dfn, i)),
>> -                   mfn_x(mfn_add(mfn, i)), rc);
>> +                   d->domain_id, dfn_x(dfn), mfn_x(mfn), rc);
> 
> Since you are already adjusting the line, I wouldn't mind if you also
> switched to use %pd at once (and in the same adjustment done to
> iommu_unmap).

I did consider doing so, but decided against since this would lead
to also touching the format string (which right now is unaltered).

>>  
>>          /* while statement to satisfy __must_check */
>> -        while ( iommu_unmap(d, dfn, i, flush_flags) )
>> +        while ( iommu_unmap(d, dfn0, i, flush_flags) )
> 
> To match previous behavior you likely need to use i + (1UL << order),
> so pages covered by the map_page call above are also taken care in the
> unmap request?

I'm afraid I don't follow: Prior behavior was to unmap only what
was mapped on earlier iterations. This continues to be that way.

> With that fixed:
> 
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, but I'll wait with applying this.

Jan



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 06/21] IOMMU/x86: perform PV Dom0 mappings in batches
  2022-04-25  8:34 ` [PATCH v4 06/21] IOMMU/x86: perform PV Dom0 mappings in batches Jan Beulich
@ 2022-05-03 14:49   ` Roger Pau Monné
  2022-05-04  9:46     ` Jan Beulich
  0 siblings, 1 reply; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-03 14:49 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On Mon, Apr 25, 2022 at 10:34:59AM +0200, Jan Beulich wrote:
> For large page mappings to be easily usable (i.e. in particular without
> un-shattering of smaller page mappings) and for mapping operations to
> then also be more efficient, pass batches of Dom0 memory to iommu_map().
> In dom0_construct_pv() and its helpers (covering strict mode) this
> additionally requires establishing the type of those pages (albeit with
> zero type references).

I think it's possible I've already asked this.  Would it make sense to
add the IOMMU mappings in alloc_domheap_pages(), maybe by passing a
specific flag?

It would seem to me that doing it that way would also allow the
mappings to get established in blocks for domUs.

And be less error prone in having to match memory allocation with
iommu_memory_setup() calls in order for the pages to be added to the
IOMMU page tables.

> The earlier establishing of PGT_writable_page | PGT_validated requires
> the existing places where this gets done (through get_page_and_type())
> to be updated: For pages which actually have a mapping, the type
> refcount needs to be 1.
> 
> There is actually a related bug that gets fixed here as a side effect:
> Typically the last L1 table would get marked as such only after
> get_page_and_type(..., PGT_writable_page). While this is fine as far as
> refcounting goes, the page did remain mapped in the IOMMU in this case
> (when "iommu=dom0-strict").
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> Subsequently p2m_add_identity_entry() may want to also gain an order
> parameter, for arch_iommu_hwdom_init() to use. While this only affects
> non-RAM regions, systems typically have 2-16Mb of reserved space
> immediately below 4Gb, which hence could be mapped more efficiently.

Indeed.

> The installing of zero-ref writable types has in fact shown (observed
> while putting together the change) that despite the intention by the
> XSA-288 changes (affecting DomU-s only) for Dom0 a number of
> sufficiently ordinary pages (at the very least initrd and P2M ones as
> well as pages that are part of the initial allocation but not part of
> the initial mapping) still have been starting out as PGT_none, meaning
> that they would have gained IOMMU mappings only the first time these
> pages would get mapped writably. Consequently an open question is
> whether iommu_memory_setup() should set the pages to PGT_writable_page
> independent of need_iommu_pt_sync().

I think I'm confused, doesn't the setting of PGT_writable_page happen
as a result of need_iommu_pt_sync() and having those pages added to
the IOMMU page tables? (so they can be properly tracked and IOMMU
mappings are removed if thte page is also removed)

If the pages are not added here (because dom0 is not running in strict
mode) then setting PGT_writable_page is not required?

> I didn't think I need to address the bug mentioned in the description in
> a separate (prereq) patch, but if others disagree I could certainly
> break out that part (needing to first use iommu_legacy_unmap() then).
> 
> Note that 4k P2M pages don't get (pre-)mapped in setup_pv_physmap():
> They'll end up mapped via the later get_page_and_type().
> 
> As to the way these refs get installed: I've chosen to avoid the more
> expensive {get,put}_page_and_type(), favoring to put in place the
> intended type directly. I guess I could be convinced to avoid this
> bypassing of the actual logic; I merely think it's unnecessarily
> expensive.

In a different piece of code I would have asked to avoid open-coding
the type changes.  But there are already open-coded type changes in
dom0_construct_pv(), so adding those doesn't make the current status
worse.

> Note also that strictly speaking the iommu_iotlb_flush_all() here (as
> well as the pre-existing one in arch_iommu_hwdom_init()) shouldn't be
> needed: Actual hooking up (AMD) or enabling of translation (VT-d)
> occurs only afterwards anyway, so nothing can have made it into TLBs
> just yet.

Hm, indeed. I think the one in arch_iommu_hwdom_init can surely go
away, as we must strictly do the hwdom init before enabling the iommu
itself.

The one in dom0 build I'm less convinced, just to be on the safe side
if we ever change the order of IOMMU init and memory setup.  I would
expect flushing an empty TLB to not be very expensive?

> --- a/xen/drivers/passthrough/x86/iommu.c
> +++ b/xen/drivers/passthrough/x86/iommu.c
> @@ -347,8 +347,8 @@ static unsigned int __hwdom_init hwdom_i
>  
>  void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
>  {
> -    unsigned long i, top, max_pfn;
> -    unsigned int flush_flags = 0;
> +    unsigned long i, top, max_pfn, start, count;
> +    unsigned int flush_flags = 0, start_perms = 0;
>  
>      BUG_ON(!is_hardware_domain(d));
>  
> @@ -379,9 +379,9 @@ void __hwdom_init arch_iommu_hwdom_init(
>       * First Mb will get mapped in one go by pvh_populate_p2m(). Avoid
>       * setting up potentially conflicting mappings here.
>       */
> -    i = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
> +    start = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
>  
> -    for ( ; i < top; i++ )
> +    for ( i = start, count = 0; i < top; )
>      {
>          unsigned long pfn = pdx_to_pfn(i);
>          unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
> @@ -390,20 +390,41 @@ void __hwdom_init arch_iommu_hwdom_init(
>          if ( !perms )
>              rc = 0;
>          else if ( paging_mode_translate(d) )
> +        {
>              rc = p2m_add_identity_entry(d, pfn,
>                                          perms & IOMMUF_writable ? p2m_access_rw
>                                                                  : p2m_access_r,
>                                          0);
> +            if ( rc )
> +                printk(XENLOG_WARNING
> +                       "%pd: identity mapping of %lx failed: %d\n",
> +                       d, pfn, rc);
> +        }
> +        else if ( pfn != start + count || perms != start_perms )
> +        {
> +        commit:
> +            rc = iommu_map(d, _dfn(start), _mfn(start), count, start_perms,
> +                           &flush_flags);
> +            if ( rc )
> +                printk(XENLOG_WARNING
> +                       "%pd: IOMMU identity mapping of [%lx,%lx) failed: %d\n",
> +                       d, pfn, pfn + count, rc);
> +            SWAP(start, pfn);
> +            start_perms = perms;
> +            count = 1;
> +        }
>          else
> -            rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
> -                           perms, &flush_flags);
> +        {
> +            ++count;
> +            rc = 0;

Seeing as we want to process this in blocks now, I wonder whether it
would make sense to take a different approach, and use a rangeset to
track which regions need to be mapped.  What gets added would be based
on the host e820 plus the options
iommu_hwdom_{strict,inclusive,reserved}.  We would then punch holes
based on the logic in hwdom_iommu_map() and finally we could iterate
over the regions afterwards using rangeset_consume_ranges().

Not that you strictly need to do it here, just think the end result
would be clearer.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 05/21] IOMMU/x86: restrict IO-APIC mappings for PV Dom0
  2022-05-03 13:00   ` Roger Pau Monné
@ 2022-05-03 14:50     ` Jan Beulich
  2022-05-04  9:32       ` Jan Beulich
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-05-03 14:50 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On 03.05.2022 15:00, Roger Pau Monné wrote:
> On Mon, Apr 25, 2022 at 10:34:23AM +0200, Jan Beulich wrote:
>> While already the case for PVH, there's no reason to treat PV
>> differently here, though of course the addresses get taken from another
>> source in this case. Except that, to match CPU side mappings, by default
>> we permit r/o ones. This then also means we now deal consistently with
>> IO-APICs whose MMIO is or is not covered by E820 reserved regions.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> [integrated] v1: Integrate into series.
>> [standalone] v2: Keep IOMMU mappings in sync with CPU ones.
>>
>> --- a/xen/drivers/passthrough/x86/iommu.c
>> +++ b/xen/drivers/passthrough/x86/iommu.c
>> @@ -275,12 +275,12 @@ void iommu_identity_map_teardown(struct
>>      }
>>  }
>>  
>> -static bool __hwdom_init hwdom_iommu_map(const struct domain *d,
>> -                                         unsigned long pfn,
>> -                                         unsigned long max_pfn)
>> +static unsigned int __hwdom_init hwdom_iommu_map(const struct domain *d,
>> +                                                 unsigned long pfn,
>> +                                                 unsigned long max_pfn)
>>  {
>>      mfn_t mfn = _mfn(pfn);
>> -    unsigned int i, type;
>> +    unsigned int i, type, perms = IOMMUF_readable | IOMMUF_writable;
>>  
>>      /*
>>       * Set up 1:1 mapping for dom0. Default to include only conventional RAM
>> @@ -289,44 +289,60 @@ static bool __hwdom_init hwdom_iommu_map
>>       * that fall in unusable ranges for PV Dom0.
>>       */
>>      if ( (pfn > max_pfn && !mfn_valid(mfn)) || xen_in_range(pfn) )
>> -        return false;
>> +        return 0;
>>  
>>      switch ( type = page_get_ram_type(mfn) )
>>      {
>>      case RAM_TYPE_UNUSABLE:
>> -        return false;
>> +        return 0;
>>  
>>      case RAM_TYPE_CONVENTIONAL:
>>          if ( iommu_hwdom_strict )
>> -            return false;
>> +            return 0;
>>          break;
>>  
>>      default:
>>          if ( type & RAM_TYPE_RESERVED )
>>          {
>>              if ( !iommu_hwdom_inclusive && !iommu_hwdom_reserved )
>> -                return false;
>> +                perms = 0;
>>          }
>> -        else if ( is_hvm_domain(d) || !iommu_hwdom_inclusive || pfn > max_pfn )
>> -            return false;
>> +        else if ( is_hvm_domain(d) )
>> +            return 0;
>> +        else if ( !iommu_hwdom_inclusive || pfn > max_pfn )
>> +            perms = 0;
>>      }
>>  
>>      /* Check that it doesn't overlap with the Interrupt Address Range. */
>>      if ( pfn >= 0xfee00 && pfn <= 0xfeeff )
>> -        return false;
>> +        return 0;
>>      /* ... or the IO-APIC */
>> -    for ( i = 0; has_vioapic(d) && i < d->arch.hvm.nr_vioapics; i++ )
>> -        if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
>> -            return false;
>> +    if ( has_vioapic(d) )
>> +    {
>> +        for ( i = 0; i < d->arch.hvm.nr_vioapics; i++ )
>> +            if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
>> +                return 0;
>> +    }
>> +    else if ( is_pv_domain(d) )
>> +    {
>> +        /*
>> +         * Be consistent with CPU mappings: Dom0 is permitted to establish r/o
>> +         * ones there, so it should also have such established for IOMMUs.
>> +         */
>> +        for ( i = 0; i < nr_ioapics; i++ )
>> +            if ( pfn == PFN_DOWN(mp_ioapics[i].mpc_apicaddr) )
>> +                return rangeset_contains_singleton(mmio_ro_ranges, pfn)
>> +                       ? IOMMUF_readable : 0;
> 
> If we really are after consistency with CPU side mappings, we should
> likely take the whole contents of mmio_ro_ranges and d->iomem_caps
> into account, not just the pages belonging to the IO-APIC?
> 
> There could also be HPET pages mapped as RO for PV.

Hmm. This would be a yet bigger functional change, but indeed would further
improve consistency. But shouldn't we then also establish r/w mappings for
stuff in ->iomem_caps but not in mmio_ro_ranges? This would feel like going
too far ...

Jan



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 07/21] IOMMU/x86: support freeing of pagetables
  2022-04-25  8:35 ` [PATCH v4 07/21] IOMMU/x86: support freeing of pagetables Jan Beulich
@ 2022-05-03 16:20   ` Roger Pau Monné
  2022-05-04 13:07     ` Jan Beulich
  0 siblings, 1 reply; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-03 16:20 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On Mon, Apr 25, 2022 at 10:35:45AM +0200, Jan Beulich wrote:
> For vendor specific code to support superpages we need to be able to
> deal with a superpage mapping replacing an intermediate page table (or
> hierarchy thereof). Consequently an iommu_alloc_pgtable() counterpart is
> needed to free individual page tables while a domain is still alive.
> Since the freeing needs to be deferred until after a suitable IOTLB
> flush was performed, released page tables get queued for processing by a
> tasklet.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> I was considering whether to use a softirq-tasklet instead. This would
> have the benefit of avoiding extra scheduling operations, but come with
> the risk of the freeing happening prematurely because of a
> process_pending_softirqs() somewhere.

I'm sorry again if I already raised this, I don't seem to find a
reference.

What about doing the freeing before resuming the guest execution in
guest vCPU context?

We already have a hook like this on HVM in hvm_do_resume() calling
vpci_process_pending().  I wonder whether we could have a similar hook
for PV and keep the pages to be freed in the vCPU instead of the pCPU.
This would have the benefit of being able to context switch the vCPU
in case the operation takes too long.

Not that the current approach is wrong, but doing it in the guest
resume path we could likely prevent guests doing heavy p2m
modifications from hogging CPU time.

> ---
> v4: Change type of iommu_queue_free_pgtable()'s 1st parameter. Re-base.
> v3: Call process_pending_softirqs() from free_queued_pgtables().
> 
> --- a/xen/arch/x86/include/asm/iommu.h
> +++ b/xen/arch/x86/include/asm/iommu.h
> @@ -147,6 +147,7 @@ void iommu_free_domid(domid_t domid, uns
>  int __must_check iommu_free_pgtables(struct domain *d);
>  struct domain_iommu;
>  struct page_info *__must_check iommu_alloc_pgtable(struct domain_iommu *hd);
> +void iommu_queue_free_pgtable(struct domain_iommu *hd, struct page_info *pg);
>  
>  #endif /* !__ARCH_X86_IOMMU_H__ */
>  /*
> --- a/xen/drivers/passthrough/x86/iommu.c
> +++ b/xen/drivers/passthrough/x86/iommu.c
> @@ -12,6 +12,7 @@
>   * this program; If not, see <http://www.gnu.org/licenses/>.
>   */
>  
> +#include <xen/cpu.h>
>  #include <xen/sched.h>
>  #include <xen/iommu.h>
>  #include <xen/paging.h>
> @@ -550,6 +551,91 @@ struct page_info *iommu_alloc_pgtable(st
>      return pg;
>  }
>  
> +/*
> + * Intermediate page tables which get replaced by large pages may only be
> + * freed after a suitable IOTLB flush. Hence such pages get queued on a
> + * per-CPU list, with a per-CPU tasklet processing the list on the assumption
> + * that the necessary IOTLB flush will have occurred by the time tasklets get
> + * to run. (List and tasklet being per-CPU has the benefit of accesses not
> + * requiring any locking.)
> + */
> +static DEFINE_PER_CPU(struct page_list_head, free_pgt_list);
> +static DEFINE_PER_CPU(struct tasklet, free_pgt_tasklet);
> +
> +static void free_queued_pgtables(void *arg)
> +{
> +    struct page_list_head *list = arg;
> +    struct page_info *pg;
> +    unsigned int done = 0;
> +

With the current logic I think it might be helpful to assert that the
list is not empty when we get here?

Given the operation requires a context switch we would like to avoid
such unless there's indeed pending work to do.

> +    while ( (pg = page_list_remove_head(list)) )
> +    {
> +        free_domheap_page(pg);
> +
> +        /* Granularity of checking somewhat arbitrary. */
> +        if ( !(++done & 0x1ff) )
> +             process_pending_softirqs();
> +    }
> +}
> +
> +void iommu_queue_free_pgtable(struct domain_iommu *hd, struct page_info *pg)
> +{
> +    unsigned int cpu = smp_processor_id();
> +
> +    spin_lock(&hd->arch.pgtables.lock);
> +    page_list_del(pg, &hd->arch.pgtables.list);
> +    spin_unlock(&hd->arch.pgtables.lock);
> +
> +    page_list_add_tail(pg, &per_cpu(free_pgt_list, cpu));
> +
> +    tasklet_schedule(&per_cpu(free_pgt_tasklet, cpu));
> +}
> +
> +static int cf_check cpu_callback(
> +    struct notifier_block *nfb, unsigned long action, void *hcpu)
> +{
> +    unsigned int cpu = (unsigned long)hcpu;
> +    struct page_list_head *list = &per_cpu(free_pgt_list, cpu);
> +    struct tasklet *tasklet = &per_cpu(free_pgt_tasklet, cpu);
> +
> +    switch ( action )
> +    {
> +    case CPU_DOWN_PREPARE:
> +        tasklet_kill(tasklet);
> +        break;
> +
> +    case CPU_DEAD:
> +        page_list_splice(list, &this_cpu(free_pgt_list));

I think you could check whether list is empty before queuing it?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 02/21] IOMMU: simplify unmap-on-error in iommu_map()
  2022-05-03 14:37     ` Jan Beulich
@ 2022-05-03 16:22       ` Roger Pau Monné
  0 siblings, 0 replies; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-03 16:22 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On Tue, May 03, 2022 at 04:37:29PM +0200, Jan Beulich wrote:
> On 03.05.2022 12:25, Roger Pau Monné wrote:
> > On Mon, Apr 25, 2022 at 10:32:10AM +0200, Jan Beulich wrote:
> >> As of 68a8aa5d7264 ("iommu: make map and unmap take a page count,
> >> similar to flush") there's no need anymore to have a loop here.
> >>
> >> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> > 
> > Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> Thanks.
> 
> > I wonder whether we should have a macro to ignore returns from
> > __must_check attributed functions.  Ie:
> > 
> > #define IGNORE_RETURN(exp) while ( exp ) break;
> > 
> > As to avoid confusion (and having to reason) whether the usage of
> > while is correct.  I always find it confusing to assert such loop
> > expressions are correct.
> 
> I've been considering some form of wrapper macro (not specifically
> the one you suggest), but I'm of two minds: On one hand I agree it
> would help readers, but otoh I fear it may make it more attractive
> to actually override the __must_check (which really ought to be an
> exception).

Well, I think anyone reviewing the code would realize that the error
is being ignored, and hence check that this is actually intended.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 05/21] IOMMU/x86: restrict IO-APIC mappings for PV Dom0
  2022-05-03 14:50     ` Jan Beulich
@ 2022-05-04  9:32       ` Jan Beulich
  2022-05-04 10:30         ` Roger Pau Monné
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-05-04  9:32 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On 03.05.2022 16:50, Jan Beulich wrote:
> On 03.05.2022 15:00, Roger Pau Monné wrote:
>> On Mon, Apr 25, 2022 at 10:34:23AM +0200, Jan Beulich wrote:
>>> While already the case for PVH, there's no reason to treat PV
>>> differently here, though of course the addresses get taken from another
>>> source in this case. Except that, to match CPU side mappings, by default
>>> we permit r/o ones. This then also means we now deal consistently with
>>> IO-APICs whose MMIO is or is not covered by E820 reserved regions.
>>>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>> ---
>>> [integrated] v1: Integrate into series.
>>> [standalone] v2: Keep IOMMU mappings in sync with CPU ones.
>>>
>>> --- a/xen/drivers/passthrough/x86/iommu.c
>>> +++ b/xen/drivers/passthrough/x86/iommu.c
>>> @@ -275,12 +275,12 @@ void iommu_identity_map_teardown(struct
>>>      }
>>>  }
>>>  
>>> -static bool __hwdom_init hwdom_iommu_map(const struct domain *d,
>>> -                                         unsigned long pfn,
>>> -                                         unsigned long max_pfn)
>>> +static unsigned int __hwdom_init hwdom_iommu_map(const struct domain *d,
>>> +                                                 unsigned long pfn,
>>> +                                                 unsigned long max_pfn)
>>>  {
>>>      mfn_t mfn = _mfn(pfn);
>>> -    unsigned int i, type;
>>> +    unsigned int i, type, perms = IOMMUF_readable | IOMMUF_writable;
>>>  
>>>      /*
>>>       * Set up 1:1 mapping for dom0. Default to include only conventional RAM
>>> @@ -289,44 +289,60 @@ static bool __hwdom_init hwdom_iommu_map
>>>       * that fall in unusable ranges for PV Dom0.
>>>       */
>>>      if ( (pfn > max_pfn && !mfn_valid(mfn)) || xen_in_range(pfn) )
>>> -        return false;
>>> +        return 0;
>>>  
>>>      switch ( type = page_get_ram_type(mfn) )
>>>      {
>>>      case RAM_TYPE_UNUSABLE:
>>> -        return false;
>>> +        return 0;
>>>  
>>>      case RAM_TYPE_CONVENTIONAL:
>>>          if ( iommu_hwdom_strict )
>>> -            return false;
>>> +            return 0;
>>>          break;
>>>  
>>>      default:
>>>          if ( type & RAM_TYPE_RESERVED )
>>>          {
>>>              if ( !iommu_hwdom_inclusive && !iommu_hwdom_reserved )
>>> -                return false;
>>> +                perms = 0;
>>>          }
>>> -        else if ( is_hvm_domain(d) || !iommu_hwdom_inclusive || pfn > max_pfn )
>>> -            return false;
>>> +        else if ( is_hvm_domain(d) )
>>> +            return 0;
>>> +        else if ( !iommu_hwdom_inclusive || pfn > max_pfn )
>>> +            perms = 0;
>>>      }
>>>  
>>>      /* Check that it doesn't overlap with the Interrupt Address Range. */
>>>      if ( pfn >= 0xfee00 && pfn <= 0xfeeff )
>>> -        return false;
>>> +        return 0;
>>>      /* ... or the IO-APIC */
>>> -    for ( i = 0; has_vioapic(d) && i < d->arch.hvm.nr_vioapics; i++ )
>>> -        if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
>>> -            return false;
>>> +    if ( has_vioapic(d) )
>>> +    {
>>> +        for ( i = 0; i < d->arch.hvm.nr_vioapics; i++ )
>>> +            if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
>>> +                return 0;
>>> +    }
>>> +    else if ( is_pv_domain(d) )
>>> +    {
>>> +        /*
>>> +         * Be consistent with CPU mappings: Dom0 is permitted to establish r/o
>>> +         * ones there, so it should also have such established for IOMMUs.
>>> +         */
>>> +        for ( i = 0; i < nr_ioapics; i++ )
>>> +            if ( pfn == PFN_DOWN(mp_ioapics[i].mpc_apicaddr) )
>>> +                return rangeset_contains_singleton(mmio_ro_ranges, pfn)
>>> +                       ? IOMMUF_readable : 0;
>>
>> If we really are after consistency with CPU side mappings, we should
>> likely take the whole contents of mmio_ro_ranges and d->iomem_caps
>> into account, not just the pages belonging to the IO-APIC?
>>
>> There could also be HPET pages mapped as RO for PV.
> 
> Hmm. This would be a yet bigger functional change, but indeed would further
> improve consistency. But shouldn't we then also establish r/w mappings for
> stuff in ->iomem_caps but not in mmio_ro_ranges? This would feel like going
> too far ...

FTAOD I didn't mean to say that I think such mappings shouldn't be there;
I have been of the opinion that e.g. I/O directly to/from the linear
frame buffer of a graphics device should in principle be permitted. But
which specific mappings to put in place can imo not be derived from
->iomem_caps, as we merely subtract certain ranges after initially having
set all bits in it. Besides ranges not mapping any MMIO, even something
like the PCI ECAM ranges (parts of which we may also force to r/o, and
which we would hence cover here if I followed your suggestion) are
questionable in this regard.

Jan



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 06/21] IOMMU/x86: perform PV Dom0 mappings in batches
  2022-05-03 14:49   ` Roger Pau Monné
@ 2022-05-04  9:46     ` Jan Beulich
  2022-05-04 11:20       ` Roger Pau Monné
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-05-04  9:46 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On 03.05.2022 16:49, Roger Pau Monné wrote:
> On Mon, Apr 25, 2022 at 10:34:59AM +0200, Jan Beulich wrote:
>> For large page mappings to be easily usable (i.e. in particular without
>> un-shattering of smaller page mappings) and for mapping operations to
>> then also be more efficient, pass batches of Dom0 memory to iommu_map().
>> In dom0_construct_pv() and its helpers (covering strict mode) this
>> additionally requires establishing the type of those pages (albeit with
>> zero type references).
> 
> I think it's possible I've already asked this.  Would it make sense to
> add the IOMMU mappings in alloc_domheap_pages(), maybe by passing a
> specific flag?

I don't think you did ask, but now that you do: This would look like a
layering violation to me. I don't think allocation should ever have
mapping (into the IOMMU or elsewhere) as a "side effect", no matter
that ...

> It would seem to me that doing it that way would also allow the
> mappings to get established in blocks for domUs.

... then this would perhaps be possible.

>> The installing of zero-ref writable types has in fact shown (observed
>> while putting together the change) that despite the intention by the
>> XSA-288 changes (affecting DomU-s only) for Dom0 a number of
>> sufficiently ordinary pages (at the very least initrd and P2M ones as
>> well as pages that are part of the initial allocation but not part of
>> the initial mapping) still have been starting out as PGT_none, meaning
>> that they would have gained IOMMU mappings only the first time these
>> pages would get mapped writably. Consequently an open question is
>> whether iommu_memory_setup() should set the pages to PGT_writable_page
>> independent of need_iommu_pt_sync().
> 
> I think I'm confused, doesn't the setting of PGT_writable_page happen
> as a result of need_iommu_pt_sync() and having those pages added to
> the IOMMU page tables? (so they can be properly tracked and IOMMU
> mappings are removed if thte page is also removed)

In principle yes - in guest_physmap_add_page(). But this function isn't
called for the pages I did enumerate in the remark. XSA-288 really only
cared about getting this right for DomU-s.

> If the pages are not added here (because dom0 is not running in strict
> mode) then setting PGT_writable_page is not required?

Correct - in that case we skip fiddling with IOMMU mappings on
transitions to/from PGT_writable_page, and hence putting this type in
place would be benign (but improve consistency).

>> Note also that strictly speaking the iommu_iotlb_flush_all() here (as
>> well as the pre-existing one in arch_iommu_hwdom_init()) shouldn't be
>> needed: Actual hooking up (AMD) or enabling of translation (VT-d)
>> occurs only afterwards anyway, so nothing can have made it into TLBs
>> just yet.
> 
> Hm, indeed. I think the one in arch_iommu_hwdom_init can surely go
> away, as we must strictly do the hwdom init before enabling the iommu
> itself.

Why would that be? That's imo as much of an implementation detail as
...

> The one in dom0 build I'm less convinced, just to be on the safe side
> if we ever change the order of IOMMU init and memory setup.

... this. Just like we populate tables with the IOMMU already enabled
for DomU-s, I think the same would be valid to do for Dom0.

> I would expect flushing an empty TLB to not be very expensive?

I wouldn't "expect" this - it might be this way, but it surely depends
on whether an implementation can easily tell whether the TLB is empty,
and whether its emptiness actually makes a difference for the latency
of a flush operation.

>> --- a/xen/drivers/passthrough/x86/iommu.c
>> +++ b/xen/drivers/passthrough/x86/iommu.c
>> @@ -347,8 +347,8 @@ static unsigned int __hwdom_init hwdom_i
>>  
>>  void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
>>  {
>> -    unsigned long i, top, max_pfn;
>> -    unsigned int flush_flags = 0;
>> +    unsigned long i, top, max_pfn, start, count;
>> +    unsigned int flush_flags = 0, start_perms = 0;
>>  
>>      BUG_ON(!is_hardware_domain(d));
>>  
>> @@ -379,9 +379,9 @@ void __hwdom_init arch_iommu_hwdom_init(
>>       * First Mb will get mapped in one go by pvh_populate_p2m(). Avoid
>>       * setting up potentially conflicting mappings here.
>>       */
>> -    i = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
>> +    start = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
>>  
>> -    for ( ; i < top; i++ )
>> +    for ( i = start, count = 0; i < top; )
>>      {
>>          unsigned long pfn = pdx_to_pfn(i);
>>          unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
>> @@ -390,20 +390,41 @@ void __hwdom_init arch_iommu_hwdom_init(
>>          if ( !perms )
>>              rc = 0;
>>          else if ( paging_mode_translate(d) )
>> +        {
>>              rc = p2m_add_identity_entry(d, pfn,
>>                                          perms & IOMMUF_writable ? p2m_access_rw
>>                                                                  : p2m_access_r,
>>                                          0);
>> +            if ( rc )
>> +                printk(XENLOG_WARNING
>> +                       "%pd: identity mapping of %lx failed: %d\n",
>> +                       d, pfn, rc);
>> +        }
>> +        else if ( pfn != start + count || perms != start_perms )
>> +        {
>> +        commit:
>> +            rc = iommu_map(d, _dfn(start), _mfn(start), count, start_perms,
>> +                           &flush_flags);
>> +            if ( rc )
>> +                printk(XENLOG_WARNING
>> +                       "%pd: IOMMU identity mapping of [%lx,%lx) failed: %d\n",
>> +                       d, pfn, pfn + count, rc);
>> +            SWAP(start, pfn);
>> +            start_perms = perms;
>> +            count = 1;
>> +        }
>>          else
>> -            rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
>> -                           perms, &flush_flags);
>> +        {
>> +            ++count;
>> +            rc = 0;
> 
> Seeing as we want to process this in blocks now, I wonder whether it
> would make sense to take a different approach, and use a rangeset to
> track which regions need to be mapped.  What gets added would be based
> on the host e820 plus the options
> iommu_hwdom_{strict,inclusive,reserved}.  We would then punch holes
> based on the logic in hwdom_iommu_map() and finally we could iterate
> over the regions afterwards using rangeset_consume_ranges().
> 
> Not that you strictly need to do it here, just think the end result
> would be clearer.

The end result might indeed be, but it would be more of a change right
here. Hence I'd prefer to leave that out of the series for now.

Jan



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 04/21] IOMMU: have iommu_{,un}map() split requests into largest possible chunks
  2022-05-03 14:44     ` Jan Beulich
@ 2022-05-04 10:20       ` Roger Pau Monné
  0 siblings, 0 replies; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-04 10:20 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On Tue, May 03, 2022 at 04:44:45PM +0200, Jan Beulich wrote:
> On 03.05.2022 14:37, Roger Pau Monné wrote:
> > On Mon, Apr 25, 2022 at 10:33:32AM +0200, Jan Beulich wrote:
> >> --- a/xen/drivers/passthrough/iommu.c
> >> +++ b/xen/drivers/passthrough/iommu.c
> >> @@ -307,11 +338,10 @@ int iommu_map(struct domain *d, dfn_t df
> >>          if ( !d->is_shutting_down && printk_ratelimit() )
> >>              printk(XENLOG_ERR
> >>                     "d%d: IOMMU mapping dfn %"PRI_dfn" to mfn %"PRI_mfn" failed: %d\n",
> >> -                   d->domain_id, dfn_x(dfn_add(dfn, i)),
> >> -                   mfn_x(mfn_add(mfn, i)), rc);
> >> +                   d->domain_id, dfn_x(dfn), mfn_x(mfn), rc);
> > 
> > Since you are already adjusting the line, I wouldn't mind if you also
> > switched to use %pd at once (and in the same adjustment done to
> > iommu_unmap).
> 
> I did consider doing so, but decided against since this would lead
> to also touching the format string (which right now is unaltered).
> 
> >>  
> >>          /* while statement to satisfy __must_check */
> >> -        while ( iommu_unmap(d, dfn, i, flush_flags) )
> >> +        while ( iommu_unmap(d, dfn0, i, flush_flags) )
> > 
> > To match previous behavior you likely need to use i + (1UL << order),
> > so pages covered by the map_page call above are also taken care in the
> > unmap request?
> 
> I'm afraid I don't follow: Prior behavior was to unmap only what
> was mapped on earlier iterations. This continues to be that way.

My bad, I was wrong and somehow assumed that the previous behavior
would also pass the failed map entry, but that's not the case.

> > With that fixed:
> > 
> > Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> Thanks, but I'll wait with applying this.

I withdraw my previous comment, feel free to apply this.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 05/21] IOMMU/x86: restrict IO-APIC mappings for PV Dom0
  2022-05-04  9:32       ` Jan Beulich
@ 2022-05-04 10:30         ` Roger Pau Monné
  2022-05-04 10:51           ` Jan Beulich
  0 siblings, 1 reply; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-04 10:30 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On Wed, May 04, 2022 at 11:32:51AM +0200, Jan Beulich wrote:
> On 03.05.2022 16:50, Jan Beulich wrote:
> > On 03.05.2022 15:00, Roger Pau Monné wrote:
> >> On Mon, Apr 25, 2022 at 10:34:23AM +0200, Jan Beulich wrote:
> >>> While already the case for PVH, there's no reason to treat PV
> >>> differently here, though of course the addresses get taken from another
> >>> source in this case. Except that, to match CPU side mappings, by default
> >>> we permit r/o ones. This then also means we now deal consistently with
> >>> IO-APICs whose MMIO is or is not covered by E820 reserved regions.
> >>>
> >>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >>> ---
> >>> [integrated] v1: Integrate into series.
> >>> [standalone] v2: Keep IOMMU mappings in sync with CPU ones.
> >>>
> >>> --- a/xen/drivers/passthrough/x86/iommu.c
> >>> +++ b/xen/drivers/passthrough/x86/iommu.c
> >>> @@ -275,12 +275,12 @@ void iommu_identity_map_teardown(struct
> >>>      }
> >>>  }
> >>>  
> >>> -static bool __hwdom_init hwdom_iommu_map(const struct domain *d,
> >>> -                                         unsigned long pfn,
> >>> -                                         unsigned long max_pfn)
> >>> +static unsigned int __hwdom_init hwdom_iommu_map(const struct domain *d,
> >>> +                                                 unsigned long pfn,
> >>> +                                                 unsigned long max_pfn)
> >>>  {
> >>>      mfn_t mfn = _mfn(pfn);
> >>> -    unsigned int i, type;
> >>> +    unsigned int i, type, perms = IOMMUF_readable | IOMMUF_writable;
> >>>  
> >>>      /*
> >>>       * Set up 1:1 mapping for dom0. Default to include only conventional RAM
> >>> @@ -289,44 +289,60 @@ static bool __hwdom_init hwdom_iommu_map
> >>>       * that fall in unusable ranges for PV Dom0.
> >>>       */
> >>>      if ( (pfn > max_pfn && !mfn_valid(mfn)) || xen_in_range(pfn) )
> >>> -        return false;
> >>> +        return 0;
> >>>  
> >>>      switch ( type = page_get_ram_type(mfn) )
> >>>      {
> >>>      case RAM_TYPE_UNUSABLE:
> >>> -        return false;
> >>> +        return 0;
> >>>  
> >>>      case RAM_TYPE_CONVENTIONAL:
> >>>          if ( iommu_hwdom_strict )
> >>> -            return false;
> >>> +            return 0;
> >>>          break;
> >>>  
> >>>      default:
> >>>          if ( type & RAM_TYPE_RESERVED )
> >>>          {
> >>>              if ( !iommu_hwdom_inclusive && !iommu_hwdom_reserved )
> >>> -                return false;
> >>> +                perms = 0;
> >>>          }
> >>> -        else if ( is_hvm_domain(d) || !iommu_hwdom_inclusive || pfn > max_pfn )
> >>> -            return false;
> >>> +        else if ( is_hvm_domain(d) )
> >>> +            return 0;
> >>> +        else if ( !iommu_hwdom_inclusive || pfn > max_pfn )
> >>> +            perms = 0;
> >>>      }
> >>>  
> >>>      /* Check that it doesn't overlap with the Interrupt Address Range. */
> >>>      if ( pfn >= 0xfee00 && pfn <= 0xfeeff )
> >>> -        return false;
> >>> +        return 0;
> >>>      /* ... or the IO-APIC */
> >>> -    for ( i = 0; has_vioapic(d) && i < d->arch.hvm.nr_vioapics; i++ )
> >>> -        if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
> >>> -            return false;
> >>> +    if ( has_vioapic(d) )
> >>> +    {
> >>> +        for ( i = 0; i < d->arch.hvm.nr_vioapics; i++ )
> >>> +            if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
> >>> +                return 0;
> >>> +    }
> >>> +    else if ( is_pv_domain(d) )
> >>> +    {
> >>> +        /*
> >>> +         * Be consistent with CPU mappings: Dom0 is permitted to establish r/o
> >>> +         * ones there, so it should also have such established for IOMMUs.
> >>> +         */
> >>> +        for ( i = 0; i < nr_ioapics; i++ )
> >>> +            if ( pfn == PFN_DOWN(mp_ioapics[i].mpc_apicaddr) )
> >>> +                return rangeset_contains_singleton(mmio_ro_ranges, pfn)
> >>> +                       ? IOMMUF_readable : 0;
> >>
> >> If we really are after consistency with CPU side mappings, we should
> >> likely take the whole contents of mmio_ro_ranges and d->iomem_caps
> >> into account, not just the pages belonging to the IO-APIC?
> >>
> >> There could also be HPET pages mapped as RO for PV.
> > 
> > Hmm. This would be a yet bigger functional change, but indeed would further
> > improve consistency. But shouldn't we then also establish r/w mappings for
> > stuff in ->iomem_caps but not in mmio_ro_ranges? This would feel like going
> > too far ...
> 
> FTAOD I didn't mean to say that I think such mappings shouldn't be there;
> I have been of the opinion that e.g. I/O directly to/from the linear
> frame buffer of a graphics device should in principle be permitted. But
> which specific mappings to put in place can imo not be derived from
> ->iomem_caps, as we merely subtract certain ranges after initially having
> set all bits in it. Besides ranges not mapping any MMIO, even something
> like the PCI ECAM ranges (parts of which we may also force to r/o, and
> which we would hence cover here if I followed your suggestion) are
> questionable in this regard.

Right, ->iomem_caps is indeed too wide for our purpose.  What
about using something like:

else if ( is_pv_domain(d) )
{
    if ( !iomem_access_permitted(d, pfn, pfn) )
        return 0;
    if ( rangeset_contains_singleton(mmio_ro_ranges, pfn) )
        return IOMMUF_readable;
}

That would get us a bit closer to allowed CPU side mappings, and we
don't need to special case IO-APIC or HPET addresses as those are
already added to ->iomem_caps or mmio_ro_ranges respectively by
dom0_setup_permissions().

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 05/21] IOMMU/x86: restrict IO-APIC mappings for PV Dom0
  2022-05-04 10:30         ` Roger Pau Monné
@ 2022-05-04 10:51           ` Jan Beulich
  2022-05-04 12:01             ` Roger Pau Monné
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-05-04 10:51 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On 04.05.2022 12:30, Roger Pau Monné wrote:
> On Wed, May 04, 2022 at 11:32:51AM +0200, Jan Beulich wrote:
>> On 03.05.2022 16:50, Jan Beulich wrote:
>>> On 03.05.2022 15:00, Roger Pau Monné wrote:
>>>> On Mon, Apr 25, 2022 at 10:34:23AM +0200, Jan Beulich wrote:
>>>>> While already the case for PVH, there's no reason to treat PV
>>>>> differently here, though of course the addresses get taken from another
>>>>> source in this case. Except that, to match CPU side mappings, by default
>>>>> we permit r/o ones. This then also means we now deal consistently with
>>>>> IO-APICs whose MMIO is or is not covered by E820 reserved regions.
>>>>>
>>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>>> ---
>>>>> [integrated] v1: Integrate into series.
>>>>> [standalone] v2: Keep IOMMU mappings in sync with CPU ones.
>>>>>
>>>>> --- a/xen/drivers/passthrough/x86/iommu.c
>>>>> +++ b/xen/drivers/passthrough/x86/iommu.c
>>>>> @@ -275,12 +275,12 @@ void iommu_identity_map_teardown(struct
>>>>>      }
>>>>>  }
>>>>>  
>>>>> -static bool __hwdom_init hwdom_iommu_map(const struct domain *d,
>>>>> -                                         unsigned long pfn,
>>>>> -                                         unsigned long max_pfn)
>>>>> +static unsigned int __hwdom_init hwdom_iommu_map(const struct domain *d,
>>>>> +                                                 unsigned long pfn,
>>>>> +                                                 unsigned long max_pfn)
>>>>>  {
>>>>>      mfn_t mfn = _mfn(pfn);
>>>>> -    unsigned int i, type;
>>>>> +    unsigned int i, type, perms = IOMMUF_readable | IOMMUF_writable;
>>>>>  
>>>>>      /*
>>>>>       * Set up 1:1 mapping for dom0. Default to include only conventional RAM
>>>>> @@ -289,44 +289,60 @@ static bool __hwdom_init hwdom_iommu_map
>>>>>       * that fall in unusable ranges for PV Dom0.
>>>>>       */
>>>>>      if ( (pfn > max_pfn && !mfn_valid(mfn)) || xen_in_range(pfn) )
>>>>> -        return false;
>>>>> +        return 0;
>>>>>  
>>>>>      switch ( type = page_get_ram_type(mfn) )
>>>>>      {
>>>>>      case RAM_TYPE_UNUSABLE:
>>>>> -        return false;
>>>>> +        return 0;
>>>>>  
>>>>>      case RAM_TYPE_CONVENTIONAL:
>>>>>          if ( iommu_hwdom_strict )
>>>>> -            return false;
>>>>> +            return 0;
>>>>>          break;
>>>>>  
>>>>>      default:
>>>>>          if ( type & RAM_TYPE_RESERVED )
>>>>>          {
>>>>>              if ( !iommu_hwdom_inclusive && !iommu_hwdom_reserved )
>>>>> -                return false;
>>>>> +                perms = 0;
>>>>>          }
>>>>> -        else if ( is_hvm_domain(d) || !iommu_hwdom_inclusive || pfn > max_pfn )
>>>>> -            return false;
>>>>> +        else if ( is_hvm_domain(d) )
>>>>> +            return 0;
>>>>> +        else if ( !iommu_hwdom_inclusive || pfn > max_pfn )
>>>>> +            perms = 0;
>>>>>      }
>>>>>  
>>>>>      /* Check that it doesn't overlap with the Interrupt Address Range. */
>>>>>      if ( pfn >= 0xfee00 && pfn <= 0xfeeff )
>>>>> -        return false;
>>>>> +        return 0;
>>>>>      /* ... or the IO-APIC */
>>>>> -    for ( i = 0; has_vioapic(d) && i < d->arch.hvm.nr_vioapics; i++ )
>>>>> -        if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
>>>>> -            return false;
>>>>> +    if ( has_vioapic(d) )
>>>>> +    {
>>>>> +        for ( i = 0; i < d->arch.hvm.nr_vioapics; i++ )
>>>>> +            if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
>>>>> +                return 0;
>>>>> +    }
>>>>> +    else if ( is_pv_domain(d) )
>>>>> +    {
>>>>> +        /*
>>>>> +         * Be consistent with CPU mappings: Dom0 is permitted to establish r/o
>>>>> +         * ones there, so it should also have such established for IOMMUs.
>>>>> +         */
>>>>> +        for ( i = 0; i < nr_ioapics; i++ )
>>>>> +            if ( pfn == PFN_DOWN(mp_ioapics[i].mpc_apicaddr) )
>>>>> +                return rangeset_contains_singleton(mmio_ro_ranges, pfn)
>>>>> +                       ? IOMMUF_readable : 0;
>>>>
>>>> If we really are after consistency with CPU side mappings, we should
>>>> likely take the whole contents of mmio_ro_ranges and d->iomem_caps
>>>> into account, not just the pages belonging to the IO-APIC?
>>>>
>>>> There could also be HPET pages mapped as RO for PV.
>>>
>>> Hmm. This would be a yet bigger functional change, but indeed would further
>>> improve consistency. But shouldn't we then also establish r/w mappings for
>>> stuff in ->iomem_caps but not in mmio_ro_ranges? This would feel like going
>>> too far ...
>>
>> FTAOD I didn't mean to say that I think such mappings shouldn't be there;
>> I have been of the opinion that e.g. I/O directly to/from the linear
>> frame buffer of a graphics device should in principle be permitted. But
>> which specific mappings to put in place can imo not be derived from
>> ->iomem_caps, as we merely subtract certain ranges after initially having
>> set all bits in it. Besides ranges not mapping any MMIO, even something
>> like the PCI ECAM ranges (parts of which we may also force to r/o, and
>> which we would hence cover here if I followed your suggestion) are
>> questionable in this regard.
> 
> Right, ->iomem_caps is indeed too wide for our purpose.  What
> about using something like:
> 
> else if ( is_pv_domain(d) )
> {
>     if ( !iomem_access_permitted(d, pfn, pfn) )
>         return 0;

We can't return 0 here (as RAM pages also make it here when
!iommu_hwdom_strict), so I can at best take this as a vague outline
of what you really mean. And I don't want to rely on RAM pages being
(imo wrongly) represented by set bits in Dom0's iomem_caps.

>     if ( rangeset_contains_singleton(mmio_ro_ranges, pfn) )
>         return IOMMUF_readable;
> }
> 
> That would get us a bit closer to allowed CPU side mappings, and we
> don't need to special case IO-APIC or HPET addresses as those are
> already added to ->iomem_caps or mmio_ro_ranges respectively by
> dom0_setup_permissions().

This won't fit in a region of code framed by a (split) comment
saying "Check that it doesn't overlap with ...". Hence if anything
I could put something like this further down. Yet even then the
question remains what to do with ranges which pass
iomem_access_permitted() but
- aren't really MMIO,
- are inside MMCFG,
- are otherwise special.

Or did you perhaps mean to suggest something like

else if ( is_pv_domain(d) && iomem_access_permitted(d, pfn, pfn) &&
          rangeset_contains_singleton(mmio_ro_ranges, pfn) )
    return IOMMUF_readable;

? Then there would only remain the question of whether mapping r/o
MMCFG pages is okay (I don't think it is), but that could then be
special-cased similar to what's done further down for vPCI (by not
returning in the "else if", but merely updating "perms").

Jan



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 06/21] IOMMU/x86: perform PV Dom0 mappings in batches
  2022-05-04  9:46     ` Jan Beulich
@ 2022-05-04 11:20       ` Roger Pau Monné
  2022-05-04 12:27         ` Jan Beulich
  0 siblings, 1 reply; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-04 11:20 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On Wed, May 04, 2022 at 11:46:37AM +0200, Jan Beulich wrote:
> On 03.05.2022 16:49, Roger Pau Monné wrote:
> > On Mon, Apr 25, 2022 at 10:34:59AM +0200, Jan Beulich wrote:
> >> For large page mappings to be easily usable (i.e. in particular without
> >> un-shattering of smaller page mappings) and for mapping operations to
> >> then also be more efficient, pass batches of Dom0 memory to iommu_map().
> >> In dom0_construct_pv() and its helpers (covering strict mode) this
> >> additionally requires establishing the type of those pages (albeit with
> >> zero type references).
> > 
> > I think it's possible I've already asked this.  Would it make sense to
> > add the IOMMU mappings in alloc_domheap_pages(), maybe by passing a
> > specific flag?
> 
> I don't think you did ask, but now that you do: This would look like a
> layering violation to me. I don't think allocation should ever have
> mapping (into the IOMMU or elsewhere) as a "side effect", no matter
> that ...

Hm, I'm certainly not that familiar with PV itself to likely be able
to make a proper argument here.  I fully agree with you for translated
guests using a p2m.

For PV we currently establish/teardown IOMMU mappings in
_get_page_type(), which already looks like a layering violation to me,
hence also doing so in alloc_domheap_pages() wouldn't seem that bad if
it allows to simplify the resulting code overall.

> > It would seem to me that doing it that way would also allow the
> > mappings to get established in blocks for domUs.
> 
> ... then this would perhaps be possible.
> 
> >> The installing of zero-ref writable types has in fact shown (observed
> >> while putting together the change) that despite the intention by the
> >> XSA-288 changes (affecting DomU-s only) for Dom0 a number of
> >> sufficiently ordinary pages (at the very least initrd and P2M ones as
> >> well as pages that are part of the initial allocation but not part of
> >> the initial mapping) still have been starting out as PGT_none, meaning
> >> that they would have gained IOMMU mappings only the first time these
> >> pages would get mapped writably. Consequently an open question is
> >> whether iommu_memory_setup() should set the pages to PGT_writable_page
> >> independent of need_iommu_pt_sync().
> > 
> > I think I'm confused, doesn't the setting of PGT_writable_page happen
> > as a result of need_iommu_pt_sync() and having those pages added to
> > the IOMMU page tables? (so they can be properly tracked and IOMMU
> > mappings are removed if thte page is also removed)
> 
> In principle yes - in guest_physmap_add_page(). But this function isn't
> called for the pages I did enumerate in the remark. XSA-288 really only
> cared about getting this right for DomU-s.

Would it make sense to change guest_physmap_add_page() to be able to
pass the page_order parameter down to iommu_map(), and then use it for
dom0 build instead of introducing iommu_memory_setup()?

I think guest_physmap_add_page() will need to be adjusted at some
point for domUs, and hence it could be unified with dom0 usage
also?

> > If the pages are not added here (because dom0 is not running in strict
> > mode) then setting PGT_writable_page is not required?
> 
> Correct - in that case we skip fiddling with IOMMU mappings on
> transitions to/from PGT_writable_page, and hence putting this type in
> place would be benign (but improve consistency).
> 
> >> Note also that strictly speaking the iommu_iotlb_flush_all() here (as
> >> well as the pre-existing one in arch_iommu_hwdom_init()) shouldn't be
> >> needed: Actual hooking up (AMD) or enabling of translation (VT-d)
> >> occurs only afterwards anyway, so nothing can have made it into TLBs
> >> just yet.
> > 
> > Hm, indeed. I think the one in arch_iommu_hwdom_init can surely go
> > away, as we must strictly do the hwdom init before enabling the iommu
> > itself.
> 
> Why would that be? That's imo as much of an implementation detail as
> ...

Well, you want to have the reserved/inclusive options applied (and
mappings created) before enabling the IOMMU, because at that point
devices have already been assigned.  So it depends more on a
combination of devices assigned & IOMMU enabled rather than just IOMMU
being enabled.

> > The one in dom0 build I'm less convinced, just to be on the safe side
> > if we ever change the order of IOMMU init and memory setup.
> 
> ... this. Just like we populate tables with the IOMMU already enabled
> for DomU-s, I think the same would be valid to do for Dom0.
> 
> > I would expect flushing an empty TLB to not be very expensive?
> 
> I wouldn't "expect" this - it might be this way, but it surely depends
> on whether an implementation can easily tell whether the TLB is empty,
> and whether its emptiness actually makes a difference for the latency
> of a flush operation.
> 
> >> --- a/xen/drivers/passthrough/x86/iommu.c
> >> +++ b/xen/drivers/passthrough/x86/iommu.c
> >> @@ -347,8 +347,8 @@ static unsigned int __hwdom_init hwdom_i
> >>  
> >>  void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
> >>  {
> >> -    unsigned long i, top, max_pfn;
> >> -    unsigned int flush_flags = 0;
> >> +    unsigned long i, top, max_pfn, start, count;
> >> +    unsigned int flush_flags = 0, start_perms = 0;
> >>  
> >>      BUG_ON(!is_hardware_domain(d));
> >>  
> >> @@ -379,9 +379,9 @@ void __hwdom_init arch_iommu_hwdom_init(
> >>       * First Mb will get mapped in one go by pvh_populate_p2m(). Avoid
> >>       * setting up potentially conflicting mappings here.
> >>       */
> >> -    i = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
> >> +    start = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
> >>  
> >> -    for ( ; i < top; i++ )
> >> +    for ( i = start, count = 0; i < top; )
> >>      {
> >>          unsigned long pfn = pdx_to_pfn(i);
> >>          unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
> >> @@ -390,20 +390,41 @@ void __hwdom_init arch_iommu_hwdom_init(
> >>          if ( !perms )
> >>              rc = 0;
> >>          else if ( paging_mode_translate(d) )
> >> +        {
> >>              rc = p2m_add_identity_entry(d, pfn,
> >>                                          perms & IOMMUF_writable ? p2m_access_rw
> >>                                                                  : p2m_access_r,
> >>                                          0);
> >> +            if ( rc )
> >> +                printk(XENLOG_WARNING
> >> +                       "%pd: identity mapping of %lx failed: %d\n",
> >> +                       d, pfn, rc);
> >> +        }
> >> +        else if ( pfn != start + count || perms != start_perms )
> >> +        {
> >> +        commit:
> >> +            rc = iommu_map(d, _dfn(start), _mfn(start), count, start_perms,
> >> +                           &flush_flags);
> >> +            if ( rc )
> >> +                printk(XENLOG_WARNING
> >> +                       "%pd: IOMMU identity mapping of [%lx,%lx) failed: %d\n",
> >> +                       d, pfn, pfn + count, rc);
> >> +            SWAP(start, pfn);
> >> +            start_perms = perms;
> >> +            count = 1;
> >> +        }
> >>          else
> >> -            rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
> >> -                           perms, &flush_flags);
> >> +        {
> >> +            ++count;
> >> +            rc = 0;
> > 
> > Seeing as we want to process this in blocks now, I wonder whether it
> > would make sense to take a different approach, and use a rangeset to
> > track which regions need to be mapped.  What gets added would be based
> > on the host e820 plus the options
> > iommu_hwdom_{strict,inclusive,reserved}.  We would then punch holes
> > based on the logic in hwdom_iommu_map() and finally we could iterate
> > over the regions afterwards using rangeset_consume_ranges().
> > 
> > Not that you strictly need to do it here, just think the end result
> > would be clearer.
> 
> The end result might indeed be, but it would be more of a change right
> here. Hence I'd prefer to leave that out of the series for now.

OK.  I think it might be nice to add a comment in that regard, mostly
because I tend to forget myself.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 05/21] IOMMU/x86: restrict IO-APIC mappings for PV Dom0
  2022-05-04 10:51           ` Jan Beulich
@ 2022-05-04 12:01             ` Roger Pau Monné
  2022-05-04 12:12               ` Jan Beulich
  0 siblings, 1 reply; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-04 12:01 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On Wed, May 04, 2022 at 12:51:25PM +0200, Jan Beulich wrote:
> On 04.05.2022 12:30, Roger Pau Monné wrote:
> > On Wed, May 04, 2022 at 11:32:51AM +0200, Jan Beulich wrote:
> >> On 03.05.2022 16:50, Jan Beulich wrote:
> >>> On 03.05.2022 15:00, Roger Pau Monné wrote:
> >>>> On Mon, Apr 25, 2022 at 10:34:23AM +0200, Jan Beulich wrote:
> >>>>> While already the case for PVH, there's no reason to treat PV
> >>>>> differently here, though of course the addresses get taken from another
> >>>>> source in this case. Except that, to match CPU side mappings, by default
> >>>>> we permit r/o ones. This then also means we now deal consistently with
> >>>>> IO-APICs whose MMIO is or is not covered by E820 reserved regions.
> >>>>>
> >>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >>>>> ---
> >>>>> [integrated] v1: Integrate into series.
> >>>>> [standalone] v2: Keep IOMMU mappings in sync with CPU ones.
> >>>>>
> >>>>> --- a/xen/drivers/passthrough/x86/iommu.c
> >>>>> +++ b/xen/drivers/passthrough/x86/iommu.c
> >>>>> @@ -275,12 +275,12 @@ void iommu_identity_map_teardown(struct
> >>>>>      }
> >>>>>  }
> >>>>>  
> >>>>> -static bool __hwdom_init hwdom_iommu_map(const struct domain *d,
> >>>>> -                                         unsigned long pfn,
> >>>>> -                                         unsigned long max_pfn)
> >>>>> +static unsigned int __hwdom_init hwdom_iommu_map(const struct domain *d,
> >>>>> +                                                 unsigned long pfn,
> >>>>> +                                                 unsigned long max_pfn)
> >>>>>  {
> >>>>>      mfn_t mfn = _mfn(pfn);
> >>>>> -    unsigned int i, type;
> >>>>> +    unsigned int i, type, perms = IOMMUF_readable | IOMMUF_writable;
> >>>>>  
> >>>>>      /*
> >>>>>       * Set up 1:1 mapping for dom0. Default to include only conventional RAM
> >>>>> @@ -289,44 +289,60 @@ static bool __hwdom_init hwdom_iommu_map
> >>>>>       * that fall in unusable ranges for PV Dom0.
> >>>>>       */
> >>>>>      if ( (pfn > max_pfn && !mfn_valid(mfn)) || xen_in_range(pfn) )
> >>>>> -        return false;
> >>>>> +        return 0;
> >>>>>  
> >>>>>      switch ( type = page_get_ram_type(mfn) )
> >>>>>      {
> >>>>>      case RAM_TYPE_UNUSABLE:
> >>>>> -        return false;
> >>>>> +        return 0;
> >>>>>  
> >>>>>      case RAM_TYPE_CONVENTIONAL:
> >>>>>          if ( iommu_hwdom_strict )
> >>>>> -            return false;
> >>>>> +            return 0;
> >>>>>          break;
> >>>>>  
> >>>>>      default:
> >>>>>          if ( type & RAM_TYPE_RESERVED )
> >>>>>          {
> >>>>>              if ( !iommu_hwdom_inclusive && !iommu_hwdom_reserved )
> >>>>> -                return false;
> >>>>> +                perms = 0;
> >>>>>          }
> >>>>> -        else if ( is_hvm_domain(d) || !iommu_hwdom_inclusive || pfn > max_pfn )
> >>>>> -            return false;
> >>>>> +        else if ( is_hvm_domain(d) )
> >>>>> +            return 0;
> >>>>> +        else if ( !iommu_hwdom_inclusive || pfn > max_pfn )
> >>>>> +            perms = 0;
> >>>>>      }
> >>>>>  
> >>>>>      /* Check that it doesn't overlap with the Interrupt Address Range. */
> >>>>>      if ( pfn >= 0xfee00 && pfn <= 0xfeeff )
> >>>>> -        return false;
> >>>>> +        return 0;
> >>>>>      /* ... or the IO-APIC */
> >>>>> -    for ( i = 0; has_vioapic(d) && i < d->arch.hvm.nr_vioapics; i++ )
> >>>>> -        if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
> >>>>> -            return false;
> >>>>> +    if ( has_vioapic(d) )
> >>>>> +    {
> >>>>> +        for ( i = 0; i < d->arch.hvm.nr_vioapics; i++ )
> >>>>> +            if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
> >>>>> +                return 0;
> >>>>> +    }
> >>>>> +    else if ( is_pv_domain(d) )
> >>>>> +    {
> >>>>> +        /*
> >>>>> +         * Be consistent with CPU mappings: Dom0 is permitted to establish r/o
> >>>>> +         * ones there, so it should also have such established for IOMMUs.
> >>>>> +         */
> >>>>> +        for ( i = 0; i < nr_ioapics; i++ )
> >>>>> +            if ( pfn == PFN_DOWN(mp_ioapics[i].mpc_apicaddr) )
> >>>>> +                return rangeset_contains_singleton(mmio_ro_ranges, pfn)
> >>>>> +                       ? IOMMUF_readable : 0;
> >>>>
> >>>> If we really are after consistency with CPU side mappings, we should
> >>>> likely take the whole contents of mmio_ro_ranges and d->iomem_caps
> >>>> into account, not just the pages belonging to the IO-APIC?
> >>>>
> >>>> There could also be HPET pages mapped as RO for PV.
> >>>
> >>> Hmm. This would be a yet bigger functional change, but indeed would further
> >>> improve consistency. But shouldn't we then also establish r/w mappings for
> >>> stuff in ->iomem_caps but not in mmio_ro_ranges? This would feel like going
> >>> too far ...
> >>
> >> FTAOD I didn't mean to say that I think such mappings shouldn't be there;
> >> I have been of the opinion that e.g. I/O directly to/from the linear
> >> frame buffer of a graphics device should in principle be permitted. But
> >> which specific mappings to put in place can imo not be derived from
> >> ->iomem_caps, as we merely subtract certain ranges after initially having
> >> set all bits in it. Besides ranges not mapping any MMIO, even something
> >> like the PCI ECAM ranges (parts of which we may also force to r/o, and
> >> which we would hence cover here if I followed your suggestion) are
> >> questionable in this regard.
> > 
> > Right, ->iomem_caps is indeed too wide for our purpose.  What
> > about using something like:
> > 
> > else if ( is_pv_domain(d) )
> > {
> >     if ( !iomem_access_permitted(d, pfn, pfn) )
> >         return 0;
> 
> We can't return 0 here (as RAM pages also make it here when
> !iommu_hwdom_strict), so I can at best take this as a vague outline
> of what you really mean. And I don't want to rely on RAM pages being
> (imo wrongly) represented by set bits in Dom0's iomem_caps.

Well, yes, my suggestion was taking into account that ->iomem_caps for
dom0 has mostly holes for things that shouldn't be mapped, but
otherwise contains everything else as allowed (including RAM).

We could instead do:

else if ( is_pv_domain(d) && type != RAM_TYPE_CONVENTIONAL )
{
    ...

So that we don't rely on RAM being 'allowed' in ->iomem_caps?

> >     if ( rangeset_contains_singleton(mmio_ro_ranges, pfn) )
> >         return IOMMUF_readable;
> > }
> > 
> > That would get us a bit closer to allowed CPU side mappings, and we
> > don't need to special case IO-APIC or HPET addresses as those are
> > already added to ->iomem_caps or mmio_ro_ranges respectively by
> > dom0_setup_permissions().
> 
> This won't fit in a region of code framed by a (split) comment
> saying "Check that it doesn't overlap with ...". Hence if anything
> I could put something like this further down. Yet even then the
> question remains what to do with ranges which pass
> iomem_access_permitted() but
> - aren't really MMIO,
> - are inside MMCFG,
> - are otherwise special.
> 
> Or did you perhaps mean to suggest something like
> 
> else if ( is_pv_domain(d) && iomem_access_permitted(d, pfn, pfn) &&
>           rangeset_contains_singleton(mmio_ro_ranges, pfn) )
>     return IOMMUF_readable;

I don't think this would be fully correct, as we would still allow
mappings of IO-APIC pages explicitly banned in ->iomem_caps by not
handling those?

> ? Then there would only remain the question of whether mapping r/o
> MMCFG pages is okay (I don't think it is), but that could then be
> special-cased similar to what's done further down for vPCI (by not
> returning in the "else if", but merely updating "perms").

Well part of the point of this is to make CPU and Device mappings
more similar.  I don't think devices have any business in poking at
the MMCFG range, so it's fine to explicitly ban that range.  But I
would have also said the same for IO-APIC pages, so I'm unsure why are
IO-APIC pages fine to be mapped RO, but not the MMCFG range.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 05/21] IOMMU/x86: restrict IO-APIC mappings for PV Dom0
  2022-05-04 12:01             ` Roger Pau Monné
@ 2022-05-04 12:12               ` Jan Beulich
  2022-05-04 13:00                 ` Roger Pau Monné
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-05-04 12:12 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On 04.05.2022 14:01, Roger Pau Monné wrote:
> On Wed, May 04, 2022 at 12:51:25PM +0200, Jan Beulich wrote:
>> On 04.05.2022 12:30, Roger Pau Monné wrote:
>>> Right, ->iomem_caps is indeed too wide for our purpose.  What
>>> about using something like:
>>>
>>> else if ( is_pv_domain(d) )
>>> {
>>>     if ( !iomem_access_permitted(d, pfn, pfn) )
>>>         return 0;
>>
>> We can't return 0 here (as RAM pages also make it here when
>> !iommu_hwdom_strict), so I can at best take this as a vague outline
>> of what you really mean. And I don't want to rely on RAM pages being
>> (imo wrongly) represented by set bits in Dom0's iomem_caps.
> 
> Well, yes, my suggestion was taking into account that ->iomem_caps for
> dom0 has mostly holes for things that shouldn't be mapped, but
> otherwise contains everything else as allowed (including RAM).
> 
> We could instead do:
> 
> else if ( is_pv_domain(d) && type != RAM_TYPE_CONVENTIONAL )
> {
>     ...
> 
> So that we don't rely on RAM being 'allowed' in ->iomem_caps?

This would feel to me like excess special casing.

>>>     if ( rangeset_contains_singleton(mmio_ro_ranges, pfn) )
>>>         return IOMMUF_readable;
>>> }
>>>
>>> That would get us a bit closer to allowed CPU side mappings, and we
>>> don't need to special case IO-APIC or HPET addresses as those are
>>> already added to ->iomem_caps or mmio_ro_ranges respectively by
>>> dom0_setup_permissions().
>>
>> This won't fit in a region of code framed by a (split) comment
>> saying "Check that it doesn't overlap with ...". Hence if anything
>> I could put something like this further down. Yet even then the
>> question remains what to do with ranges which pass
>> iomem_access_permitted() but
>> - aren't really MMIO,
>> - are inside MMCFG,
>> - are otherwise special.
>>
>> Or did you perhaps mean to suggest something like
>>
>> else if ( is_pv_domain(d) && iomem_access_permitted(d, pfn, pfn) &&
>>           rangeset_contains_singleton(mmio_ro_ranges, pfn) )
>>     return IOMMUF_readable;
> 
> I don't think this would be fully correct, as we would still allow
> mappings of IO-APIC pages explicitly banned in ->iomem_caps by not
> handling those?

CPU side mappings don't deal with the IO-APICs specifically. They only
care about iomem_caps and mmio_ro_ranges. Hence explicitly banned
IO-APIC pages cannot be mapped there either. (Of course we only do
such banning if IO-APIC pages weren't possible to represent in
mmio_ro_ranges, which should effectively be never.)

>> ? Then there would only remain the question of whether mapping r/o
>> MMCFG pages is okay (I don't think it is), but that could then be
>> special-cased similar to what's done further down for vPCI (by not
>> returning in the "else if", but merely updating "perms").
> 
> Well part of the point of this is to make CPU and Device mappings
> more similar.  I don't think devices have any business in poking at
> the MMCFG range, so it's fine to explicitly ban that range.  But I
> would have also said the same for IO-APIC pages, so I'm unsure why are
> IO-APIC pages fine to be mapped RO, but not the MMCFG range.

I wouldn't have wanted to allow r/o mappings of the IO-APICs, but
Linux plus the ACPI tables of certain vendors require us to permit
this. If we didn't, Dom0 would crash there during boot.

Jan



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 06/21] IOMMU/x86: perform PV Dom0 mappings in batches
  2022-05-04 11:20       ` Roger Pau Monné
@ 2022-05-04 12:27         ` Jan Beulich
  2022-05-04 13:55           ` Roger Pau Monné
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-05-04 12:27 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On 04.05.2022 13:20, Roger Pau Monné wrote:
> On Wed, May 04, 2022 at 11:46:37AM +0200, Jan Beulich wrote:
>> On 03.05.2022 16:49, Roger Pau Monné wrote:
>>> On Mon, Apr 25, 2022 at 10:34:59AM +0200, Jan Beulich wrote:
>>>> For large page mappings to be easily usable (i.e. in particular without
>>>> un-shattering of smaller page mappings) and for mapping operations to
>>>> then also be more efficient, pass batches of Dom0 memory to iommu_map().
>>>> In dom0_construct_pv() and its helpers (covering strict mode) this
>>>> additionally requires establishing the type of those pages (albeit with
>>>> zero type references).
>>>
>>> I think it's possible I've already asked this.  Would it make sense to
>>> add the IOMMU mappings in alloc_domheap_pages(), maybe by passing a
>>> specific flag?
>>
>> I don't think you did ask, but now that you do: This would look like a
>> layering violation to me. I don't think allocation should ever have
>> mapping (into the IOMMU or elsewhere) as a "side effect", no matter
>> that ...
> 
> Hm, I'm certainly not that familiar with PV itself to likely be able
> to make a proper argument here.  I fully agree with you for translated
> guests using a p2m.
> 
> For PV we currently establish/teardown IOMMU mappings in
> _get_page_type(), which already looks like a layering violation to me,
> hence also doing so in alloc_domheap_pages() wouldn't seem that bad if
> it allows to simplify the resulting code overall.

That's a layering violation too, I agree, but at least it's one central
place.

>>> It would seem to me that doing it that way would also allow the
>>> mappings to get established in blocks for domUs.
>>
>> ... then this would perhaps be possible.
>>
>>>> The installing of zero-ref writable types has in fact shown (observed
>>>> while putting together the change) that despite the intention by the
>>>> XSA-288 changes (affecting DomU-s only) for Dom0 a number of
>>>> sufficiently ordinary pages (at the very least initrd and P2M ones as
>>>> well as pages that are part of the initial allocation but not part of
>>>> the initial mapping) still have been starting out as PGT_none, meaning
>>>> that they would have gained IOMMU mappings only the first time these
>>>> pages would get mapped writably. Consequently an open question is
>>>> whether iommu_memory_setup() should set the pages to PGT_writable_page
>>>> independent of need_iommu_pt_sync().
>>>
>>> I think I'm confused, doesn't the setting of PGT_writable_page happen
>>> as a result of need_iommu_pt_sync() and having those pages added to
>>> the IOMMU page tables? (so they can be properly tracked and IOMMU
>>> mappings are removed if thte page is also removed)
>>
>> In principle yes - in guest_physmap_add_page(). But this function isn't
>> called for the pages I did enumerate in the remark. XSA-288 really only
>> cared about getting this right for DomU-s.
> 
> Would it make sense to change guest_physmap_add_page() to be able to
> pass the page_order parameter down to iommu_map(), and then use it for
> dom0 build instead of introducing iommu_memory_setup()?

To be quite frank: This is something that I might have been willing to
do months ago, when this series was still fresh. If I was to start
re-doing all of this code now, it would take far more time than it
would have taken back then. Hence I'd like to avoid a full re-work here
unless entirely unacceptable in the way currently done (which largely
fits with how we have been doing Dom0 setup).

Furthermore, guest_physmap_add_page() doesn't itself call iommu_map().
What you're suggesting would require get_page_and_type() to be able to
work on higher-order pages. I view adjustments like this as well out
of scope for this series.

> I think guest_physmap_add_page() will need to be adjusted at some
> point for domUs, and hence it could be unified with dom0 usage
> also?

As an optimization - perhaps. I view it as more important to have HVM
guests work reasonably well (which includes the performance aspect of
setting them up).

>>> If the pages are not added here (because dom0 is not running in strict
>>> mode) then setting PGT_writable_page is not required?
>>
>> Correct - in that case we skip fiddling with IOMMU mappings on
>> transitions to/from PGT_writable_page, and hence putting this type in
>> place would be benign (but improve consistency).
>>
>>>> Note also that strictly speaking the iommu_iotlb_flush_all() here (as
>>>> well as the pre-existing one in arch_iommu_hwdom_init()) shouldn't be
>>>> needed: Actual hooking up (AMD) or enabling of translation (VT-d)
>>>> occurs only afterwards anyway, so nothing can have made it into TLBs
>>>> just yet.
>>>
>>> Hm, indeed. I think the one in arch_iommu_hwdom_init can surely go
>>> away, as we must strictly do the hwdom init before enabling the iommu
>>> itself.
>>
>> Why would that be? That's imo as much of an implementation detail as
>> ...
> 
> Well, you want to have the reserved/inclusive options applied (and
> mappings created) before enabling the IOMMU, because at that point
> devices have already been assigned.  So it depends more on a
> combination of devices assigned & IOMMU enabled rather than just IOMMU
> being enabled.
> 
>>> The one in dom0 build I'm less convinced, just to be on the safe side
>>> if we ever change the order of IOMMU init and memory setup.
>>
>> ... this. Just like we populate tables with the IOMMU already enabled
>> for DomU-s, I think the same would be valid to do for Dom0.
>>
>>> I would expect flushing an empty TLB to not be very expensive?
>>
>> I wouldn't "expect" this - it might be this way, but it surely depends
>> on whether an implementation can easily tell whether the TLB is empty,
>> and whether its emptiness actually makes a difference for the latency
>> of a flush operation.
>>
>>>> --- a/xen/drivers/passthrough/x86/iommu.c
>>>> +++ b/xen/drivers/passthrough/x86/iommu.c
>>>> @@ -347,8 +347,8 @@ static unsigned int __hwdom_init hwdom_i
>>>>  
>>>>  void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
>>>>  {
>>>> -    unsigned long i, top, max_pfn;
>>>> -    unsigned int flush_flags = 0;
>>>> +    unsigned long i, top, max_pfn, start, count;
>>>> +    unsigned int flush_flags = 0, start_perms = 0;
>>>>  
>>>>      BUG_ON(!is_hardware_domain(d));
>>>>  
>>>> @@ -379,9 +379,9 @@ void __hwdom_init arch_iommu_hwdom_init(
>>>>       * First Mb will get mapped in one go by pvh_populate_p2m(). Avoid
>>>>       * setting up potentially conflicting mappings here.
>>>>       */
>>>> -    i = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
>>>> +    start = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
>>>>  
>>>> -    for ( ; i < top; i++ )
>>>> +    for ( i = start, count = 0; i < top; )
>>>>      {
>>>>          unsigned long pfn = pdx_to_pfn(i);
>>>>          unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
>>>> @@ -390,20 +390,41 @@ void __hwdom_init arch_iommu_hwdom_init(
>>>>          if ( !perms )
>>>>              rc = 0;
>>>>          else if ( paging_mode_translate(d) )
>>>> +        {
>>>>              rc = p2m_add_identity_entry(d, pfn,
>>>>                                          perms & IOMMUF_writable ? p2m_access_rw
>>>>                                                                  : p2m_access_r,
>>>>                                          0);
>>>> +            if ( rc )
>>>> +                printk(XENLOG_WARNING
>>>> +                       "%pd: identity mapping of %lx failed: %d\n",
>>>> +                       d, pfn, rc);
>>>> +        }
>>>> +        else if ( pfn != start + count || perms != start_perms )
>>>> +        {
>>>> +        commit:
>>>> +            rc = iommu_map(d, _dfn(start), _mfn(start), count, start_perms,
>>>> +                           &flush_flags);
>>>> +            if ( rc )
>>>> +                printk(XENLOG_WARNING
>>>> +                       "%pd: IOMMU identity mapping of [%lx,%lx) failed: %d\n",
>>>> +                       d, pfn, pfn + count, rc);
>>>> +            SWAP(start, pfn);
>>>> +            start_perms = perms;
>>>> +            count = 1;
>>>> +        }
>>>>          else
>>>> -            rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
>>>> -                           perms, &flush_flags);
>>>> +        {
>>>> +            ++count;
>>>> +            rc = 0;
>>>
>>> Seeing as we want to process this in blocks now, I wonder whether it
>>> would make sense to take a different approach, and use a rangeset to
>>> track which regions need to be mapped.  What gets added would be based
>>> on the host e820 plus the options
>>> iommu_hwdom_{strict,inclusive,reserved}.  We would then punch holes
>>> based on the logic in hwdom_iommu_map() and finally we could iterate
>>> over the regions afterwards using rangeset_consume_ranges().
>>>
>>> Not that you strictly need to do it here, just think the end result
>>> would be clearer.
>>
>> The end result might indeed be, but it would be more of a change right
>> here. Hence I'd prefer to leave that out of the series for now.
> 
> OK.  I think it might be nice to add a comment in that regard, mostly
> because I tend to forget myself.

Sure, I've added another post-commit-message remark.

Jan



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 05/21] IOMMU/x86: restrict IO-APIC mappings for PV Dom0
  2022-05-04 12:12               ` Jan Beulich
@ 2022-05-04 13:00                 ` Roger Pau Monné
  2022-05-04 13:19                   ` Jan Beulich
  0 siblings, 1 reply; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-04 13:00 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On Wed, May 04, 2022 at 02:12:58PM +0200, Jan Beulich wrote:
> On 04.05.2022 14:01, Roger Pau Monné wrote:
> > On Wed, May 04, 2022 at 12:51:25PM +0200, Jan Beulich wrote:
> >> On 04.05.2022 12:30, Roger Pau Monné wrote:
> >>> Right, ->iomem_caps is indeed too wide for our purpose.  What
> >>> about using something like:
> >>>
> >>> else if ( is_pv_domain(d) )
> >>> {
> >>>     if ( !iomem_access_permitted(d, pfn, pfn) )
> >>>         return 0;
> >>
> >> We can't return 0 here (as RAM pages also make it here when
> >> !iommu_hwdom_strict), so I can at best take this as a vague outline
> >> of what you really mean. And I don't want to rely on RAM pages being
> >> (imo wrongly) represented by set bits in Dom0's iomem_caps.
> > 
> > Well, yes, my suggestion was taking into account that ->iomem_caps for
> > dom0 has mostly holes for things that shouldn't be mapped, but
> > otherwise contains everything else as allowed (including RAM).
> > 
> > We could instead do:
> > 
> > else if ( is_pv_domain(d) && type != RAM_TYPE_CONVENTIONAL )
> > {
> >     ...
> > 
> > So that we don't rely on RAM being 'allowed' in ->iomem_caps?
> 
> This would feel to me like excess special casing.

What about placing this in the 'default:' label on the type switch a
bit above?

> >>>     if ( rangeset_contains_singleton(mmio_ro_ranges, pfn) )
> >>>         return IOMMUF_readable;
> >>> }
> >>>
> >>> That would get us a bit closer to allowed CPU side mappings, and we
> >>> don't need to special case IO-APIC or HPET addresses as those are
> >>> already added to ->iomem_caps or mmio_ro_ranges respectively by
> >>> dom0_setup_permissions().
> >>
> >> This won't fit in a region of code framed by a (split) comment
> >> saying "Check that it doesn't overlap with ...". Hence if anything
> >> I could put something like this further down. Yet even then the
> >> question remains what to do with ranges which pass
> >> iomem_access_permitted() but
> >> - aren't really MMIO,
> >> - are inside MMCFG,
> >> - are otherwise special.
> >>
> >> Or did you perhaps mean to suggest something like
> >>
> >> else if ( is_pv_domain(d) && iomem_access_permitted(d, pfn, pfn) &&
> >>           rangeset_contains_singleton(mmio_ro_ranges, pfn) )
> >>     return IOMMUF_readable;
> > 
> > I don't think this would be fully correct, as we would still allow
> > mappings of IO-APIC pages explicitly banned in ->iomem_caps by not
> > handling those?
> 
> CPU side mappings don't deal with the IO-APICs specifically. They only
> care about iomem_caps and mmio_ro_ranges. Hence explicitly banned
> IO-APIC pages cannot be mapped there either. (Of course we only do
> such banning if IO-APIC pages weren't possible to represent in
> mmio_ro_ranges, which should effectively be never.)

I think I haven't expressed myself correctly.

This construct won't return 0 for pfns not in iomem_caps, and hence
could allow mapping of addresses not in iomem_caps?

> >> ? Then there would only remain the question of whether mapping r/o
> >> MMCFG pages is okay (I don't think it is), but that could then be
> >> special-cased similar to what's done further down for vPCI (by not
> >> returning in the "else if", but merely updating "perms").
> > 
> > Well part of the point of this is to make CPU and Device mappings
> > more similar.  I don't think devices have any business in poking at
> > the MMCFG range, so it's fine to explicitly ban that range.  But I
> > would have also said the same for IO-APIC pages, so I'm unsure why are
> > IO-APIC pages fine to be mapped RO, but not the MMCFG range.
> 
> I wouldn't have wanted to allow r/o mappings of the IO-APICs, but
> Linux plus the ACPI tables of certain vendors require us to permit
> this. If we didn't, Dom0 would crash there during boot.

Right, but those are required for the CPU only.  I think it's a fine
goal to try to have similar mappings for CPU and Devices, and then
that would also cover MMCFG in the PV case.  Or else it fine to assume
CPU vs Device mappings will be slightly different, and then don't add
any mappings for IO-APIC, HPET or MMCFG to the Device page tables
(likely there's more that could be added here).

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 07/21] IOMMU/x86: support freeing of pagetables
  2022-05-03 16:20   ` Roger Pau Monné
@ 2022-05-04 13:07     ` Jan Beulich
  2022-05-04 15:06       ` Roger Pau Monné
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-05-04 13:07 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On 03.05.2022 18:20, Roger Pau Monné wrote:
> On Mon, Apr 25, 2022 at 10:35:45AM +0200, Jan Beulich wrote:
>> For vendor specific code to support superpages we need to be able to
>> deal with a superpage mapping replacing an intermediate page table (or
>> hierarchy thereof). Consequently an iommu_alloc_pgtable() counterpart is
>> needed to free individual page tables while a domain is still alive.
>> Since the freeing needs to be deferred until after a suitable IOTLB
>> flush was performed, released page tables get queued for processing by a
>> tasklet.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> I was considering whether to use a softirq-tasklet instead. This would
>> have the benefit of avoiding extra scheduling operations, but come with
>> the risk of the freeing happening prematurely because of a
>> process_pending_softirqs() somewhere.
> 
> I'm sorry again if I already raised this, I don't seem to find a
> reference.

Earlier on you only suggested "to perform the freeing after the flush".

> What about doing the freeing before resuming the guest execution in
> guest vCPU context?
> 
> We already have a hook like this on HVM in hvm_do_resume() calling
> vpci_process_pending().  I wonder whether we could have a similar hook
> for PV and keep the pages to be freed in the vCPU instead of the pCPU.
> This would have the benefit of being able to context switch the vCPU
> in case the operation takes too long.

I think this might work in general, but would be troublesome when
preparing Dom0 (where we don't run on any of Dom0's vCPU-s, and we
won't ever "exit to guest context" on an idle vCPU). I'm also not
really fancying to use something like

    v = current->domain == d ? current : d->vcpu[0];

(leaving aside that we don't really have d available in
iommu_queue_free_pgtable() and I'd be hesitant to convert it back).
Otoh it might be okay to free page tables right away for domains
which haven't run at all so far. But this would again require
passing struct domain * to iommu_queue_free_pgtable().

Another upside (I think) of the current approach is that all logic
is contained in a single source file (i.e. in particular there's no
new field needed in a per-vCPU structure defined in some header).

> Not that the current approach is wrong, but doing it in the guest
> resume path we could likely prevent guests doing heavy p2m
> modifications from hogging CPU time.

Well, they would still be hogging time, but that time would then be
accounted towards their time slices, yes.

>> @@ -550,6 +551,91 @@ struct page_info *iommu_alloc_pgtable(st
>>      return pg;
>>  }
>>  
>> +/*
>> + * Intermediate page tables which get replaced by large pages may only be
>> + * freed after a suitable IOTLB flush. Hence such pages get queued on a
>> + * per-CPU list, with a per-CPU tasklet processing the list on the assumption
>> + * that the necessary IOTLB flush will have occurred by the time tasklets get
>> + * to run. (List and tasklet being per-CPU has the benefit of accesses not
>> + * requiring any locking.)
>> + */
>> +static DEFINE_PER_CPU(struct page_list_head, free_pgt_list);
>> +static DEFINE_PER_CPU(struct tasklet, free_pgt_tasklet);
>> +
>> +static void free_queued_pgtables(void *arg)
>> +{
>> +    struct page_list_head *list = arg;
>> +    struct page_info *pg;
>> +    unsigned int done = 0;
>> +
> 
> With the current logic I think it might be helpful to assert that the
> list is not empty when we get here?
> 
> Given the operation requires a context switch we would like to avoid
> such unless there's indeed pending work to do.

But is that worth adding an assertion and risking to kill a system just
because there's a race somewhere by which we might get here without any
work to do? If you strongly think we want to know about such instances,
how about a WARN_ON_ONCE() (except that we still don't have that
specific construct, it would need to be open-coded for the time being)?

>> +static int cf_check cpu_callback(
>> +    struct notifier_block *nfb, unsigned long action, void *hcpu)
>> +{
>> +    unsigned int cpu = (unsigned long)hcpu;
>> +    struct page_list_head *list = &per_cpu(free_pgt_list, cpu);
>> +    struct tasklet *tasklet = &per_cpu(free_pgt_tasklet, cpu);
>> +
>> +    switch ( action )
>> +    {
>> +    case CPU_DOWN_PREPARE:
>> +        tasklet_kill(tasklet);
>> +        break;
>> +
>> +    case CPU_DEAD:
>> +        page_list_splice(list, &this_cpu(free_pgt_list));
> 
> I think you could check whether list is empty before queuing it?

I could, but this would make the code (slightly) more complicated
for improving something which doesn't occur frequently.

Jan



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 05/21] IOMMU/x86: restrict IO-APIC mappings for PV Dom0
  2022-05-04 13:00                 ` Roger Pau Monné
@ 2022-05-04 13:19                   ` Jan Beulich
  2022-05-04 13:46                     ` Roger Pau Monné
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-05-04 13:19 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On 04.05.2022 15:00, Roger Pau Monné wrote:
> On Wed, May 04, 2022 at 02:12:58PM +0200, Jan Beulich wrote:
>> On 04.05.2022 14:01, Roger Pau Monné wrote:
>>> On Wed, May 04, 2022 at 12:51:25PM +0200, Jan Beulich wrote:
>>>> On 04.05.2022 12:30, Roger Pau Monné wrote:
>>>>> Right, ->iomem_caps is indeed too wide for our purpose.  What
>>>>> about using something like:
>>>>>
>>>>> else if ( is_pv_domain(d) )
>>>>> {
>>>>>     if ( !iomem_access_permitted(d, pfn, pfn) )
>>>>>         return 0;
>>>>
>>>> We can't return 0 here (as RAM pages also make it here when
>>>> !iommu_hwdom_strict), so I can at best take this as a vague outline
>>>> of what you really mean. And I don't want to rely on RAM pages being
>>>> (imo wrongly) represented by set bits in Dom0's iomem_caps.
>>>
>>> Well, yes, my suggestion was taking into account that ->iomem_caps for
>>> dom0 has mostly holes for things that shouldn't be mapped, but
>>> otherwise contains everything else as allowed (including RAM).
>>>
>>> We could instead do:
>>>
>>> else if ( is_pv_domain(d) && type != RAM_TYPE_CONVENTIONAL )
>>> {
>>>     ...
>>>
>>> So that we don't rely on RAM being 'allowed' in ->iomem_caps?
>>
>> This would feel to me like excess special casing.
> 
> What about placing this in the 'default:' label on the type switch a
> bit above?

I'd really like to stick to the present layout of where the special
casing is done, with PV and PVH logic at least next to each other. I
continue to think the construct I suggested (still visible below)
would do.

>>>>>     if ( rangeset_contains_singleton(mmio_ro_ranges, pfn) )
>>>>>         return IOMMUF_readable;
>>>>> }
>>>>>
>>>>> That would get us a bit closer to allowed CPU side mappings, and we
>>>>> don't need to special case IO-APIC or HPET addresses as those are
>>>>> already added to ->iomem_caps or mmio_ro_ranges respectively by
>>>>> dom0_setup_permissions().
>>>>
>>>> This won't fit in a region of code framed by a (split) comment
>>>> saying "Check that it doesn't overlap with ...". Hence if anything
>>>> I could put something like this further down. Yet even then the
>>>> question remains what to do with ranges which pass
>>>> iomem_access_permitted() but
>>>> - aren't really MMIO,
>>>> - are inside MMCFG,
>>>> - are otherwise special.
>>>>
>>>> Or did you perhaps mean to suggest something like
>>>>
>>>> else if ( is_pv_domain(d) && iomem_access_permitted(d, pfn, pfn) &&
>>>>           rangeset_contains_singleton(mmio_ro_ranges, pfn) )
>>>>     return IOMMUF_readable;
>>>
>>> I don't think this would be fully correct, as we would still allow
>>> mappings of IO-APIC pages explicitly banned in ->iomem_caps by not
>>> handling those?
>>
>> CPU side mappings don't deal with the IO-APICs specifically. They only
>> care about iomem_caps and mmio_ro_ranges. Hence explicitly banned
>> IO-APIC pages cannot be mapped there either. (Of course we only do
>> such banning if IO-APIC pages weren't possible to represent in
>> mmio_ro_ranges, which should effectively be never.)
> 
> I think I haven't expressed myself correctly.
> 
> This construct won't return 0 for pfns not in iomem_caps, and hence
> could allow mapping of addresses not in iomem_caps?

I'm afraid I don't understand: There's an iomem_access_permitted()
in the conditional. How would this allow mapping pages outside of
iomem_caps? The default case higher up has already forced perms to
zero for any non-RAM page (unless iommu_hwdom_inclusive).

>>>> ? Then there would only remain the question of whether mapping r/o
>>>> MMCFG pages is okay (I don't think it is), but that could then be
>>>> special-cased similar to what's done further down for vPCI (by not
>>>> returning in the "else if", but merely updating "perms").
>>>
>>> Well part of the point of this is to make CPU and Device mappings
>>> more similar.  I don't think devices have any business in poking at
>>> the MMCFG range, so it's fine to explicitly ban that range.  But I
>>> would have also said the same for IO-APIC pages, so I'm unsure why are
>>> IO-APIC pages fine to be mapped RO, but not the MMCFG range.
>>
>> I wouldn't have wanted to allow r/o mappings of the IO-APICs, but
>> Linux plus the ACPI tables of certain vendors require us to permit
>> this. If we didn't, Dom0 would crash there during boot.
> 
> Right, but those are required for the CPU only.  I think it's a fine
> goal to try to have similar mappings for CPU and Devices, and then
> that would also cover MMCFG in the PV case.  Or else it fine to assume
> CPU vs Device mappings will be slightly different, and then don't add
> any mappings for IO-APIC, HPET or MMCFG to the Device page tables
> (likely there's more that could be added here).

It being different is what Andrew looks to strongly dislike. And I agree
with this up to a certain point, i.e. I'm having a hard time seeing why
we should put in MMCFG mappings just for this reason. But if consensus
was that consistency across all types of MMIO is the goal, then I could
live with also making MMCFG mappings ...

Jan



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 05/21] IOMMU/x86: restrict IO-APIC mappings for PV Dom0
  2022-05-04 13:19                   ` Jan Beulich
@ 2022-05-04 13:46                     ` Roger Pau Monné
  2022-05-04 13:55                       ` Jan Beulich
  0 siblings, 1 reply; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-04 13:46 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On Wed, May 04, 2022 at 03:19:16PM +0200, Jan Beulich wrote:
> On 04.05.2022 15:00, Roger Pau Monné wrote:
> > On Wed, May 04, 2022 at 02:12:58PM +0200, Jan Beulich wrote:
> >> On 04.05.2022 14:01, Roger Pau Monné wrote:
> >>> On Wed, May 04, 2022 at 12:51:25PM +0200, Jan Beulich wrote:
> >>>> On 04.05.2022 12:30, Roger Pau Monné wrote:
> >>>>> Right, ->iomem_caps is indeed too wide for our purpose.  What
> >>>>> about using something like:
> >>>>>
> >>>>> else if ( is_pv_domain(d) )
> >>>>> {
> >>>>>     if ( !iomem_access_permitted(d, pfn, pfn) )
> >>>>>         return 0;
> >>>>
> >>>> We can't return 0 here (as RAM pages also make it here when
> >>>> !iommu_hwdom_strict), so I can at best take this as a vague outline
> >>>> of what you really mean. And I don't want to rely on RAM pages being
> >>>> (imo wrongly) represented by set bits in Dom0's iomem_caps.
> >>>
> >>> Well, yes, my suggestion was taking into account that ->iomem_caps for
> >>> dom0 has mostly holes for things that shouldn't be mapped, but
> >>> otherwise contains everything else as allowed (including RAM).
> >>>
> >>> We could instead do:
> >>>
> >>> else if ( is_pv_domain(d) && type != RAM_TYPE_CONVENTIONAL )
> >>> {
> >>>     ...
> >>>
> >>> So that we don't rely on RAM being 'allowed' in ->iomem_caps?
> >>
> >> This would feel to me like excess special casing.
> > 
> > What about placing this in the 'default:' label on the type switch a
> > bit above?
> 
> I'd really like to stick to the present layout of where the special
> casing is done, with PV and PVH logic at least next to each other. I
> continue to think the construct I suggested (still visible below)
> would do.
> 
> >>>>>     if ( rangeset_contains_singleton(mmio_ro_ranges, pfn) )
> >>>>>         return IOMMUF_readable;
> >>>>> }
> >>>>>
> >>>>> That would get us a bit closer to allowed CPU side mappings, and we
> >>>>> don't need to special case IO-APIC or HPET addresses as those are
> >>>>> already added to ->iomem_caps or mmio_ro_ranges respectively by
> >>>>> dom0_setup_permissions().
> >>>>
> >>>> This won't fit in a region of code framed by a (split) comment
> >>>> saying "Check that it doesn't overlap with ...". Hence if anything
> >>>> I could put something like this further down. Yet even then the
> >>>> question remains what to do with ranges which pass
> >>>> iomem_access_permitted() but
> >>>> - aren't really MMIO,
> >>>> - are inside MMCFG,
> >>>> - are otherwise special.
> >>>>
> >>>> Or did you perhaps mean to suggest something like
> >>>>
> >>>> else if ( is_pv_domain(d) && iomem_access_permitted(d, pfn, pfn) &&
> >>>>           rangeset_contains_singleton(mmio_ro_ranges, pfn) )
> >>>>     return IOMMUF_readable;
> >>>
> >>> I don't think this would be fully correct, as we would still allow
> >>> mappings of IO-APIC pages explicitly banned in ->iomem_caps by not
> >>> handling those?
> >>
> >> CPU side mappings don't deal with the IO-APICs specifically. They only
> >> care about iomem_caps and mmio_ro_ranges. Hence explicitly banned
> >> IO-APIC pages cannot be mapped there either. (Of course we only do
> >> such banning if IO-APIC pages weren't possible to represent in
> >> mmio_ro_ranges, which should effectively be never.)
> > 
> > I think I haven't expressed myself correctly.
> > 
> > This construct won't return 0 for pfns not in iomem_caps, and hence
> > could allow mapping of addresses not in iomem_caps?
> 
> I'm afraid I don't understand: There's an iomem_access_permitted()
> in the conditional. How would this allow mapping pages outside of
> iomem_caps? The default case higher up has already forced perms to
> zero for any non-RAM page (unless iommu_hwdom_inclusive).

It was my understanding that when using iommu_hwdom_inclusive (or
iommu_hwdom_reserved if the IO-APIC page is a reserved region) we
still want to deny access to the IO-APIC page if it's not in
iomem_caps, and the proposed conditional won't do that.

So I guess the discussion is really whether
iommu_hwdom_{inclusive,reserved} take precedence over ->iomem_caps?

It seems a bit inconsistent IMO to enforce mmio_ro_ranges but not
->iomem_caps when using iommu_hwdom_{inclusive,reserved}.

> >>>> ? Then there would only remain the question of whether mapping r/o
> >>>> MMCFG pages is okay (I don't think it is), but that could then be
> >>>> special-cased similar to what's done further down for vPCI (by not
> >>>> returning in the "else if", but merely updating "perms").
> >>>
> >>> Well part of the point of this is to make CPU and Device mappings
> >>> more similar.  I don't think devices have any business in poking at
> >>> the MMCFG range, so it's fine to explicitly ban that range.  But I
> >>> would have also said the same for IO-APIC pages, so I'm unsure why are
> >>> IO-APIC pages fine to be mapped RO, but not the MMCFG range.
> >>
> >> I wouldn't have wanted to allow r/o mappings of the IO-APICs, but
> >> Linux plus the ACPI tables of certain vendors require us to permit
> >> this. If we didn't, Dom0 would crash there during boot.
> > 
> > Right, but those are required for the CPU only.  I think it's a fine
> > goal to try to have similar mappings for CPU and Devices, and then
> > that would also cover MMCFG in the PV case.  Or else it fine to assume
> > CPU vs Device mappings will be slightly different, and then don't add
> > any mappings for IO-APIC, HPET or MMCFG to the Device page tables
> > (likely there's more that could be added here).
> 
> It being different is what Andrew looks to strongly dislike. And I agree
> with this up to a certain point, i.e. I'm having a hard time seeing why
> we should put in MMCFG mappings just for this reason. But if consensus
> was that consistency across all types of MMIO is the goal, then I could
> live with also making MMCFG mappings ...

For HVM/PVH I think we want o be consistent as long as it's doable (we
can't provide devices access to the emulated MMCFG there for example).

For PV I guess it's also a worthy goal if it makes the code easier.
PV (and PV dom0 specially) is already a very custom platform with
weird properties (like the mapping of the IO-APIC and HPET regions RO
or no mappings at all).

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 05/21] IOMMU/x86: restrict IO-APIC mappings for PV Dom0
  2022-05-04 13:46                     ` Roger Pau Monné
@ 2022-05-04 13:55                       ` Jan Beulich
  2022-05-04 15:22                         ` Roger Pau Monné
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-05-04 13:55 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On 04.05.2022 15:46, Roger Pau Monné wrote:
> On Wed, May 04, 2022 at 03:19:16PM +0200, Jan Beulich wrote:
>> On 04.05.2022 15:00, Roger Pau Monné wrote:
>>> On Wed, May 04, 2022 at 02:12:58PM +0200, Jan Beulich wrote:
>>>> On 04.05.2022 14:01, Roger Pau Monné wrote:
>>>>> On Wed, May 04, 2022 at 12:51:25PM +0200, Jan Beulich wrote:
>>>>>> On 04.05.2022 12:30, Roger Pau Monné wrote:
>>>>>>> Right, ->iomem_caps is indeed too wide for our purpose.  What
>>>>>>> about using something like:
>>>>>>>
>>>>>>> else if ( is_pv_domain(d) )
>>>>>>> {
>>>>>>>     if ( !iomem_access_permitted(d, pfn, pfn) )
>>>>>>>         return 0;
>>>>>>
>>>>>> We can't return 0 here (as RAM pages also make it here when
>>>>>> !iommu_hwdom_strict), so I can at best take this as a vague outline
>>>>>> of what you really mean. And I don't want to rely on RAM pages being
>>>>>> (imo wrongly) represented by set bits in Dom0's iomem_caps.
>>>>>
>>>>> Well, yes, my suggestion was taking into account that ->iomem_caps for
>>>>> dom0 has mostly holes for things that shouldn't be mapped, but
>>>>> otherwise contains everything else as allowed (including RAM).
>>>>>
>>>>> We could instead do:
>>>>>
>>>>> else if ( is_pv_domain(d) && type != RAM_TYPE_CONVENTIONAL )
>>>>> {
>>>>>     ...
>>>>>
>>>>> So that we don't rely on RAM being 'allowed' in ->iomem_caps?
>>>>
>>>> This would feel to me like excess special casing.
>>>
>>> What about placing this in the 'default:' label on the type switch a
>>> bit above?
>>
>> I'd really like to stick to the present layout of where the special
>> casing is done, with PV and PVH logic at least next to each other. I
>> continue to think the construct I suggested (still visible below)
>> would do.
>>
>>>>>>>     if ( rangeset_contains_singleton(mmio_ro_ranges, pfn) )
>>>>>>>         return IOMMUF_readable;
>>>>>>> }
>>>>>>>
>>>>>>> That would get us a bit closer to allowed CPU side mappings, and we
>>>>>>> don't need to special case IO-APIC or HPET addresses as those are
>>>>>>> already added to ->iomem_caps or mmio_ro_ranges respectively by
>>>>>>> dom0_setup_permissions().
>>>>>>
>>>>>> This won't fit in a region of code framed by a (split) comment
>>>>>> saying "Check that it doesn't overlap with ...". Hence if anything
>>>>>> I could put something like this further down. Yet even then the
>>>>>> question remains what to do with ranges which pass
>>>>>> iomem_access_permitted() but
>>>>>> - aren't really MMIO,
>>>>>> - are inside MMCFG,
>>>>>> - are otherwise special.
>>>>>>
>>>>>> Or did you perhaps mean to suggest something like
>>>>>>
>>>>>> else if ( is_pv_domain(d) && iomem_access_permitted(d, pfn, pfn) &&
>>>>>>           rangeset_contains_singleton(mmio_ro_ranges, pfn) )
>>>>>>     return IOMMUF_readable;
>>>>>
>>>>> I don't think this would be fully correct, as we would still allow
>>>>> mappings of IO-APIC pages explicitly banned in ->iomem_caps by not
>>>>> handling those?
>>>>
>>>> CPU side mappings don't deal with the IO-APICs specifically. They only
>>>> care about iomem_caps and mmio_ro_ranges. Hence explicitly banned
>>>> IO-APIC pages cannot be mapped there either. (Of course we only do
>>>> such banning if IO-APIC pages weren't possible to represent in
>>>> mmio_ro_ranges, which should effectively be never.)
>>>
>>> I think I haven't expressed myself correctly.
>>>
>>> This construct won't return 0 for pfns not in iomem_caps, and hence
>>> could allow mapping of addresses not in iomem_caps?
>>
>> I'm afraid I don't understand: There's an iomem_access_permitted()
>> in the conditional. How would this allow mapping pages outside of
>> iomem_caps? The default case higher up has already forced perms to
>> zero for any non-RAM page (unless iommu_hwdom_inclusive).
> 
> It was my understanding that when using iommu_hwdom_inclusive (or
> iommu_hwdom_reserved if the IO-APIC page is a reserved region) we
> still want to deny access to the IO-APIC page if it's not in
> iomem_caps, and the proposed conditional won't do that.
> 
> So I guess the discussion is really whether
> iommu_hwdom_{inclusive,reserved} take precedence over ->iomem_caps?

I think the intended interaction is not spelled out anywhere. I
also think that it is to be expected for such interaction to be
quirky; after all the options themselves are quirks.

> It seems a bit inconsistent IMO to enforce mmio_ro_ranges but not
> ->iomem_caps when using iommu_hwdom_{inclusive,reserved}.

In a way, yes. But as said before - it's highly theoretical for
IO-APIC pages to not be in ->iomem_caps (and this case also won't
go silently).

Jan



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 06/21] IOMMU/x86: perform PV Dom0 mappings in batches
  2022-05-04 12:27         ` Jan Beulich
@ 2022-05-04 13:55           ` Roger Pau Monné
  2022-05-04 14:26             ` Jan Beulich
  0 siblings, 1 reply; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-04 13:55 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On Wed, May 04, 2022 at 02:27:14PM +0200, Jan Beulich wrote:
> On 04.05.2022 13:20, Roger Pau Monné wrote:
> > On Wed, May 04, 2022 at 11:46:37AM +0200, Jan Beulich wrote:
> >> On 03.05.2022 16:49, Roger Pau Monné wrote:
> >>> On Mon, Apr 25, 2022 at 10:34:59AM +0200, Jan Beulich wrote:
> >>> It would seem to me that doing it that way would also allow the
> >>> mappings to get established in blocks for domUs.
> >>
> >> ... then this would perhaps be possible.
> >>
> >>>> The installing of zero-ref writable types has in fact shown (observed
> >>>> while putting together the change) that despite the intention by the
> >>>> XSA-288 changes (affecting DomU-s only) for Dom0 a number of
> >>>> sufficiently ordinary pages (at the very least initrd and P2M ones as
> >>>> well as pages that are part of the initial allocation but not part of
> >>>> the initial mapping) still have been starting out as PGT_none, meaning
> >>>> that they would have gained IOMMU mappings only the first time these
> >>>> pages would get mapped writably. Consequently an open question is
> >>>> whether iommu_memory_setup() should set the pages to PGT_writable_page
> >>>> independent of need_iommu_pt_sync().
> >>>
> >>> I think I'm confused, doesn't the setting of PGT_writable_page happen
> >>> as a result of need_iommu_pt_sync() and having those pages added to
> >>> the IOMMU page tables? (so they can be properly tracked and IOMMU
> >>> mappings are removed if thte page is also removed)
> >>
> >> In principle yes - in guest_physmap_add_page(). But this function isn't
> >> called for the pages I did enumerate in the remark. XSA-288 really only
> >> cared about getting this right for DomU-s.
> > 
> > Would it make sense to change guest_physmap_add_page() to be able to
> > pass the page_order parameter down to iommu_map(), and then use it for
> > dom0 build instead of introducing iommu_memory_setup()?
> 
> To be quite frank: This is something that I might have been willing to
> do months ago, when this series was still fresh. If I was to start
> re-doing all of this code now, it would take far more time than it
> would have taken back then. Hence I'd like to avoid a full re-work here
> unless entirely unacceptable in the way currently done (which largely
> fits with how we have been doing Dom0 setup).

Sorry, I would have really liked to be more on time with reviews of
this, but there's always something that comes up.

> Furthermore, guest_physmap_add_page() doesn't itself call iommu_map().
> What you're suggesting would require get_page_and_type() to be able to
> work on higher-order pages. I view adjustments like this as well out
> of scope for this series.

Well, my initial thinking was to do something similar to what you
currently have in iommu_memory_setup: a direct call to iommu_map and
adjust the page types manually, but I think this will only work for
dom0 because pages are fresh at that point.  For domUs we must use
get_page_and_type so any previous mapping is also removed.

> > I think guest_physmap_add_page() will need to be adjusted at some
> > point for domUs, and hence it could be unified with dom0 usage
> > also?
> 
> As an optimization - perhaps. I view it as more important to have HVM
> guests work reasonably well (which includes the performance aspect of
> setting them up).

OK, I'm fine with focusing on HVM.

> >>>> --- a/xen/drivers/passthrough/x86/iommu.c
> >>>> +++ b/xen/drivers/passthrough/x86/iommu.c
> >>>> @@ -347,8 +347,8 @@ static unsigned int __hwdom_init hwdom_i
> >>>>  
> >>>>  void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
> >>>>  {
> >>>> -    unsigned long i, top, max_pfn;
> >>>> -    unsigned int flush_flags = 0;
> >>>> +    unsigned long i, top, max_pfn, start, count;
> >>>> +    unsigned int flush_flags = 0, start_perms = 0;
> >>>>  
> >>>>      BUG_ON(!is_hardware_domain(d));
> >>>>  
> >>>> @@ -379,9 +379,9 @@ void __hwdom_init arch_iommu_hwdom_init(
> >>>>       * First Mb will get mapped in one go by pvh_populate_p2m(). Avoid
> >>>>       * setting up potentially conflicting mappings here.
> >>>>       */
> >>>> -    i = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
> >>>> +    start = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
> >>>>  
> >>>> -    for ( ; i < top; i++ )
> >>>> +    for ( i = start, count = 0; i < top; )
> >>>>      {
> >>>>          unsigned long pfn = pdx_to_pfn(i);
> >>>>          unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
> >>>> @@ -390,20 +390,41 @@ void __hwdom_init arch_iommu_hwdom_init(
> >>>>          if ( !perms )
> >>>>              rc = 0;
> >>>>          else if ( paging_mode_translate(d) )
> >>>> +        {
> >>>>              rc = p2m_add_identity_entry(d, pfn,
> >>>>                                          perms & IOMMUF_writable ? p2m_access_rw
> >>>>                                                                  : p2m_access_r,
> >>>>                                          0);
> >>>> +            if ( rc )
> >>>> +                printk(XENLOG_WARNING
> >>>> +                       "%pd: identity mapping of %lx failed: %d\n",
> >>>> +                       d, pfn, rc);
> >>>> +        }
> >>>> +        else if ( pfn != start + count || perms != start_perms )
> >>>> +        {
> >>>> +        commit:
> >>>> +            rc = iommu_map(d, _dfn(start), _mfn(start), count, start_perms,
> >>>> +                           &flush_flags);
> >>>> +            if ( rc )
> >>>> +                printk(XENLOG_WARNING
> >>>> +                       "%pd: IOMMU identity mapping of [%lx,%lx) failed: %d\n",
> >>>> +                       d, pfn, pfn + count, rc);
> >>>> +            SWAP(start, pfn);
> >>>> +            start_perms = perms;
> >>>> +            count = 1;
> >>>> +        }
> >>>>          else
> >>>> -            rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
> >>>> -                           perms, &flush_flags);
> >>>> +        {
> >>>> +            ++count;
> >>>> +            rc = 0;
> >>>
> >>> Seeing as we want to process this in blocks now, I wonder whether it
> >>> would make sense to take a different approach, and use a rangeset to
> >>> track which regions need to be mapped.  What gets added would be based
> >>> on the host e820 plus the options
> >>> iommu_hwdom_{strict,inclusive,reserved}.  We would then punch holes
> >>> based on the logic in hwdom_iommu_map() and finally we could iterate
> >>> over the regions afterwards using rangeset_consume_ranges().
> >>>
> >>> Not that you strictly need to do it here, just think the end result
> >>> would be clearer.
> >>
> >> The end result might indeed be, but it would be more of a change right
> >> here. Hence I'd prefer to leave that out of the series for now.
> > 
> > OK.  I think it might be nice to add a comment in that regard, mostly
> > because I tend to forget myself.
> 
> Sure, I've added another post-commit-message remark.

Sorry for being confused, but are those reflected in the final commit
message, or in the code itself?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 06/21] IOMMU/x86: perform PV Dom0 mappings in batches
  2022-05-04 13:55           ` Roger Pau Monné
@ 2022-05-04 14:26             ` Jan Beulich
  0 siblings, 0 replies; 106+ messages in thread
From: Jan Beulich @ 2022-05-04 14:26 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On 04.05.2022 15:55, Roger Pau Monné wrote:
> On Wed, May 04, 2022 at 02:27:14PM +0200, Jan Beulich wrote:
>> On 04.05.2022 13:20, Roger Pau Monné wrote:
>>> On Wed, May 04, 2022 at 11:46:37AM +0200, Jan Beulich wrote:
>>>> On 03.05.2022 16:49, Roger Pau Monné wrote:
>>>>> On Mon, Apr 25, 2022 at 10:34:59AM +0200, Jan Beulich wrote:
>>>>>> @@ -390,20 +390,41 @@ void __hwdom_init arch_iommu_hwdom_init(
>>>>>>          if ( !perms )
>>>>>>              rc = 0;
>>>>>>          else if ( paging_mode_translate(d) )
>>>>>> +        {
>>>>>>              rc = p2m_add_identity_entry(d, pfn,
>>>>>>                                          perms & IOMMUF_writable ? p2m_access_rw
>>>>>>                                                                  : p2m_access_r,
>>>>>>                                          0);
>>>>>> +            if ( rc )
>>>>>> +                printk(XENLOG_WARNING
>>>>>> +                       "%pd: identity mapping of %lx failed: %d\n",
>>>>>> +                       d, pfn, rc);
>>>>>> +        }
>>>>>> +        else if ( pfn != start + count || perms != start_perms )
>>>>>> +        {
>>>>>> +        commit:
>>>>>> +            rc = iommu_map(d, _dfn(start), _mfn(start), count, start_perms,
>>>>>> +                           &flush_flags);
>>>>>> +            if ( rc )
>>>>>> +                printk(XENLOG_WARNING
>>>>>> +                       "%pd: IOMMU identity mapping of [%lx,%lx) failed: %d\n",
>>>>>> +                       d, pfn, pfn + count, rc);
>>>>>> +            SWAP(start, pfn);
>>>>>> +            start_perms = perms;
>>>>>> +            count = 1;
>>>>>> +        }
>>>>>>          else
>>>>>> -            rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
>>>>>> -                           perms, &flush_flags);
>>>>>> +        {
>>>>>> +            ++count;
>>>>>> +            rc = 0;
>>>>>
>>>>> Seeing as we want to process this in blocks now, I wonder whether it
>>>>> would make sense to take a different approach, and use a rangeset to
>>>>> track which regions need to be mapped.  What gets added would be based
>>>>> on the host e820 plus the options
>>>>> iommu_hwdom_{strict,inclusive,reserved}.  We would then punch holes
>>>>> based on the logic in hwdom_iommu_map() and finally we could iterate
>>>>> over the regions afterwards using rangeset_consume_ranges().
>>>>>
>>>>> Not that you strictly need to do it here, just think the end result
>>>>> would be clearer.
>>>>
>>>> The end result might indeed be, but it would be more of a change right
>>>> here. Hence I'd prefer to leave that out of the series for now.
>>>
>>> OK.  I think it might be nice to add a comment in that regard, mostly
>>> because I tend to forget myself.
>>
>> Sure, I've added another post-commit-message remark.
> 
> Sorry for being confused, but are those reflected in the final commit
> message, or in the code itself?

Neither - I didn't think we have any code comments anywhere which outline
future plans, including reasons why not doing so right away. When writing
that new remark I didn't even think it would belong in the commit message.

Jan



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 07/21] IOMMU/x86: support freeing of pagetables
  2022-05-04 13:07     ` Jan Beulich
@ 2022-05-04 15:06       ` Roger Pau Monné
  2022-05-05  8:20         ` Jan Beulich
  0 siblings, 1 reply; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-04 15:06 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On Wed, May 04, 2022 at 03:07:24PM +0200, Jan Beulich wrote:
> On 03.05.2022 18:20, Roger Pau Monné wrote:
> > On Mon, Apr 25, 2022 at 10:35:45AM +0200, Jan Beulich wrote:
> >> For vendor specific code to support superpages we need to be able to
> >> deal with a superpage mapping replacing an intermediate page table (or
> >> hierarchy thereof). Consequently an iommu_alloc_pgtable() counterpart is
> >> needed to free individual page tables while a domain is still alive.
> >> Since the freeing needs to be deferred until after a suitable IOTLB
> >> flush was performed, released page tables get queued for processing by a
> >> tasklet.
> >>
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >> ---
> >> I was considering whether to use a softirq-tasklet instead. This would
> >> have the benefit of avoiding extra scheduling operations, but come with
> >> the risk of the freeing happening prematurely because of a
> >> process_pending_softirqs() somewhere.
> > 
> > I'm sorry again if I already raised this, I don't seem to find a
> > reference.
> 
> Earlier on you only suggested "to perform the freeing after the flush".
> 
> > What about doing the freeing before resuming the guest execution in
> > guest vCPU context?
> > 
> > We already have a hook like this on HVM in hvm_do_resume() calling
> > vpci_process_pending().  I wonder whether we could have a similar hook
> > for PV and keep the pages to be freed in the vCPU instead of the pCPU.
> > This would have the benefit of being able to context switch the vCPU
> > in case the operation takes too long.
> 
> I think this might work in general, but would be troublesome when
> preparing Dom0 (where we don't run on any of Dom0's vCPU-s, and we
> won't ever "exit to guest context" on an idle vCPU). I'm also not
> really fancying to use something like
> 
>     v = current->domain == d ? current : d->vcpu[0];

I guess a problematic case would also be hypercalls executed in a
domain context triggering the freeing of a different domain iommu page
table pages.  As then the freeing would be accounted to the current
domain instead of the owner of the pages.

dom0 doesn't seem that problematic, any freeing triggered on a system
domain context could be performed in place (with
process_pending_softirqs() calls to ensure no watchdog triggering).

> (leaving aside that we don't really have d available in
> iommu_queue_free_pgtable() and I'd be hesitant to convert it back).
> Otoh it might be okay to free page tables right away for domains
> which haven't run at all so far.

Could be, but then we would have to make hypercalls that can trigger
those paths preemptible I would think.

> But this would again require
> passing struct domain * to iommu_queue_free_pgtable().

Hm, I guess we could use container_of with the domain_iommu parameter
to obtain a pointer to the domain struct.

> Another upside (I think) of the current approach is that all logic
> is contained in a single source file (i.e. in particular there's no
> new field needed in a per-vCPU structure defined in some header).

Right, I do agree to that.  I'm mostly worried about the resource
starvation aspect.  I guess freeing the pages replaced by a 1G super
page entry is still fine, bigger could be a problem.

> > Not that the current approach is wrong, but doing it in the guest
> > resume path we could likely prevent guests doing heavy p2m
> > modifications from hogging CPU time.
> 
> Well, they would still be hogging time, but that time would then be
> accounted towards their time slices, yes.
> 
> >> @@ -550,6 +551,91 @@ struct page_info *iommu_alloc_pgtable(st
> >>      return pg;
> >>  }
> >>  
> >> +/*
> >> + * Intermediate page tables which get replaced by large pages may only be
> >> + * freed after a suitable IOTLB flush. Hence such pages get queued on a
> >> + * per-CPU list, with a per-CPU tasklet processing the list on the assumption
> >> + * that the necessary IOTLB flush will have occurred by the time tasklets get
> >> + * to run. (List and tasklet being per-CPU has the benefit of accesses not
> >> + * requiring any locking.)
> >> + */
> >> +static DEFINE_PER_CPU(struct page_list_head, free_pgt_list);
> >> +static DEFINE_PER_CPU(struct tasklet, free_pgt_tasklet);
> >> +
> >> +static void free_queued_pgtables(void *arg)
> >> +{
> >> +    struct page_list_head *list = arg;
> >> +    struct page_info *pg;
> >> +    unsigned int done = 0;
> >> +
> > 
> > With the current logic I think it might be helpful to assert that the
> > list is not empty when we get here?
> > 
> > Given the operation requires a context switch we would like to avoid
> > such unless there's indeed pending work to do.
> 
> But is that worth adding an assertion and risking to kill a system just
> because there's a race somewhere by which we might get here without any
> work to do? If you strongly think we want to know about such instances,
> how about a WARN_ON_ONCE() (except that we still don't have that
> specific construct, it would need to be open-coded for the time being)?

Well, I was recommending an assert because I think it's fine to kill a
debug system in order to catch those outliers. On production builds we
should obviously not crash.

> >> +static int cf_check cpu_callback(
> >> +    struct notifier_block *nfb, unsigned long action, void *hcpu)
> >> +{
> >> +    unsigned int cpu = (unsigned long)hcpu;
> >> +    struct page_list_head *list = &per_cpu(free_pgt_list, cpu);
> >> +    struct tasklet *tasklet = &per_cpu(free_pgt_tasklet, cpu);
> >> +
> >> +    switch ( action )
> >> +    {
> >> +    case CPU_DOWN_PREPARE:
> >> +        tasklet_kill(tasklet);
> >> +        break;
> >> +
> >> +    case CPU_DEAD:
> >> +        page_list_splice(list, &this_cpu(free_pgt_list));
> > 
> > I think you could check whether list is empty before queuing it?
> 
> I could, but this would make the code (slightly) more complicated
> for improving something which doesn't occur frequently.

It's just a:

if ( list_empty(list) )
    break;

at the start of the CPU_DEAD case AFAICT.  As you say this notifier is
not to be called frequently, so not a big deal (also I don't think the
addition makes the code more complicated).

Now that I look at the code again, I think there's a
tasklet_schedule() missing in the CPU_DOWN_FAILED case if there are
entries pending on the list list?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 05/21] IOMMU/x86: restrict IO-APIC mappings for PV Dom0
  2022-05-04 13:55                       ` Jan Beulich
@ 2022-05-04 15:22                         ` Roger Pau Monné
  0 siblings, 0 replies; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-04 15:22 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On Wed, May 04, 2022 at 03:55:09PM +0200, Jan Beulich wrote:
> On 04.05.2022 15:46, Roger Pau Monné wrote:
> > On Wed, May 04, 2022 at 03:19:16PM +0200, Jan Beulich wrote:
> >> On 04.05.2022 15:00, Roger Pau Monné wrote:
> >>> On Wed, May 04, 2022 at 02:12:58PM +0200, Jan Beulich wrote:
> >>>> On 04.05.2022 14:01, Roger Pau Monné wrote:
> >>>>> On Wed, May 04, 2022 at 12:51:25PM +0200, Jan Beulich wrote:
> >>>>>> On 04.05.2022 12:30, Roger Pau Monné wrote:
> >>>>>>> Right, ->iomem_caps is indeed too wide for our purpose.  What
> >>>>>>> about using something like:
> >>>>>>>
> >>>>>>> else if ( is_pv_domain(d) )
> >>>>>>> {
> >>>>>>>     if ( !iomem_access_permitted(d, pfn, pfn) )
> >>>>>>>         return 0;
> >>>>>>
> >>>>>> We can't return 0 here (as RAM pages also make it here when
> >>>>>> !iommu_hwdom_strict), so I can at best take this as a vague outline
> >>>>>> of what you really mean. And I don't want to rely on RAM pages being
> >>>>>> (imo wrongly) represented by set bits in Dom0's iomem_caps.
> >>>>>
> >>>>> Well, yes, my suggestion was taking into account that ->iomem_caps for
> >>>>> dom0 has mostly holes for things that shouldn't be mapped, but
> >>>>> otherwise contains everything else as allowed (including RAM).
> >>>>>
> >>>>> We could instead do:
> >>>>>
> >>>>> else if ( is_pv_domain(d) && type != RAM_TYPE_CONVENTIONAL )
> >>>>> {
> >>>>>     ...
> >>>>>
> >>>>> So that we don't rely on RAM being 'allowed' in ->iomem_caps?
> >>>>
> >>>> This would feel to me like excess special casing.
> >>>
> >>> What about placing this in the 'default:' label on the type switch a
> >>> bit above?
> >>
> >> I'd really like to stick to the present layout of where the special
> >> casing is done, with PV and PVH logic at least next to each other. I
> >> continue to think the construct I suggested (still visible below)
> >> would do.
> >>
> >>>>>>>     if ( rangeset_contains_singleton(mmio_ro_ranges, pfn) )
> >>>>>>>         return IOMMUF_readable;
> >>>>>>> }
> >>>>>>>
> >>>>>>> That would get us a bit closer to allowed CPU side mappings, and we
> >>>>>>> don't need to special case IO-APIC or HPET addresses as those are
> >>>>>>> already added to ->iomem_caps or mmio_ro_ranges respectively by
> >>>>>>> dom0_setup_permissions().
> >>>>>>
> >>>>>> This won't fit in a region of code framed by a (split) comment
> >>>>>> saying "Check that it doesn't overlap with ...". Hence if anything
> >>>>>> I could put something like this further down. Yet even then the
> >>>>>> question remains what to do with ranges which pass
> >>>>>> iomem_access_permitted() but
> >>>>>> - aren't really MMIO,
> >>>>>> - are inside MMCFG,
> >>>>>> - are otherwise special.
> >>>>>>
> >>>>>> Or did you perhaps mean to suggest something like
> >>>>>>
> >>>>>> else if ( is_pv_domain(d) && iomem_access_permitted(d, pfn, pfn) &&
> >>>>>>           rangeset_contains_singleton(mmio_ro_ranges, pfn) )
> >>>>>>     return IOMMUF_readable;
> >>>>>
> >>>>> I don't think this would be fully correct, as we would still allow
> >>>>> mappings of IO-APIC pages explicitly banned in ->iomem_caps by not
> >>>>> handling those?
> >>>>
> >>>> CPU side mappings don't deal with the IO-APICs specifically. They only
> >>>> care about iomem_caps and mmio_ro_ranges. Hence explicitly banned
> >>>> IO-APIC pages cannot be mapped there either. (Of course we only do
> >>>> such banning if IO-APIC pages weren't possible to represent in
> >>>> mmio_ro_ranges, which should effectively be never.)
> >>>
> >>> I think I haven't expressed myself correctly.
> >>>
> >>> This construct won't return 0 for pfns not in iomem_caps, and hence
> >>> could allow mapping of addresses not in iomem_caps?
> >>
> >> I'm afraid I don't understand: There's an iomem_access_permitted()
> >> in the conditional. How would this allow mapping pages outside of
> >> iomem_caps? The default case higher up has already forced perms to
> >> zero for any non-RAM page (unless iommu_hwdom_inclusive).
> > 
> > It was my understanding that when using iommu_hwdom_inclusive (or
> > iommu_hwdom_reserved if the IO-APIC page is a reserved region) we
> > still want to deny access to the IO-APIC page if it's not in
> > iomem_caps, and the proposed conditional won't do that.
> > 
> > So I guess the discussion is really whether
> > iommu_hwdom_{inclusive,reserved} take precedence over ->iomem_caps?
> 
> I think the intended interaction is not spelled out anywhere. I
> also think that it is to be expected for such interaction to be
> quirky; after all the options themselves are quirks.
> 
> > It seems a bit inconsistent IMO to enforce mmio_ro_ranges but not
> > ->iomem_caps when using iommu_hwdom_{inclusive,reserved}.
> 
> In a way, yes. But as said before - it's highly theoretical for
> IO-APIC pages to not be in ->iomem_caps (and this case also won't
> go silently).

My idea was for whatever check we add for PV to also cover HPET, which
is in a similar situation (can be either blocked in ->iomem_caps or in
mmio_ro_ranges).

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 08/21] AMD/IOMMU: walk trees upon page fault
  2022-04-25  8:36 ` [PATCH v4 08/21] AMD/IOMMU: walk trees upon page fault Jan Beulich
@ 2022-05-04 15:57   ` Roger Pau Monné
  0 siblings, 0 replies; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-04 15:57 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On Mon, Apr 25, 2022 at 10:36:42AM +0200, Jan Beulich wrote:
> This is to aid diagnosing issues and largely matches VT-d's behavior.
> Since I'm adding permissions output here as well, take the opportunity
> and also add their displaying to amd_dump_page_table_level().
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

> ---
> Note: "largely matches VT-d's behavior" includes the lack of any locking
>       here. Adding suitable locking may not be that easy, as we'd need
>       to determine which domain's mapping lock to acquire in addition to
>       the necessary IOMMU lock (for the device table access), and
>       whether that domain actually still exists. The latter is because
>       if we really want to play safe here, imo we also need to account
>       for the device table to be potentially corrupted / stale.

I think that's fine.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 07/21] IOMMU/x86: support freeing of pagetables
  2022-05-04 15:06       ` Roger Pau Monné
@ 2022-05-05  8:20         ` Jan Beulich
  2022-05-05  9:57           ` Roger Pau Monné
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-05-05  8:20 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On 04.05.2022 17:06, Roger Pau Monné wrote:
> On Wed, May 04, 2022 at 03:07:24PM +0200, Jan Beulich wrote:
>> On 03.05.2022 18:20, Roger Pau Monné wrote:
>>> On Mon, Apr 25, 2022 at 10:35:45AM +0200, Jan Beulich wrote:
>>>> For vendor specific code to support superpages we need to be able to
>>>> deal with a superpage mapping replacing an intermediate page table (or
>>>> hierarchy thereof). Consequently an iommu_alloc_pgtable() counterpart is
>>>> needed to free individual page tables while a domain is still alive.
>>>> Since the freeing needs to be deferred until after a suitable IOTLB
>>>> flush was performed, released page tables get queued for processing by a
>>>> tasklet.
>>>>
>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>> ---
>>>> I was considering whether to use a softirq-tasklet instead. This would
>>>> have the benefit of avoiding extra scheduling operations, but come with
>>>> the risk of the freeing happening prematurely because of a
>>>> process_pending_softirqs() somewhere.
>>>
>>> I'm sorry again if I already raised this, I don't seem to find a
>>> reference.
>>
>> Earlier on you only suggested "to perform the freeing after the flush".
>>
>>> What about doing the freeing before resuming the guest execution in
>>> guest vCPU context?
>>>
>>> We already have a hook like this on HVM in hvm_do_resume() calling
>>> vpci_process_pending().  I wonder whether we could have a similar hook
>>> for PV and keep the pages to be freed in the vCPU instead of the pCPU.
>>> This would have the benefit of being able to context switch the vCPU
>>> in case the operation takes too long.
>>
>> I think this might work in general, but would be troublesome when
>> preparing Dom0 (where we don't run on any of Dom0's vCPU-s, and we
>> won't ever "exit to guest context" on an idle vCPU). I'm also not
>> really fancying to use something like
>>
>>     v = current->domain == d ? current : d->vcpu[0];
> 
> I guess a problematic case would also be hypercalls executed in a
> domain context triggering the freeing of a different domain iommu page
> table pages.  As then the freeing would be accounted to the current
> domain instead of the owner of the pages.

Aiui such can happen only during domain construction. Any such
operation behind the back of a running guest is imo problematic.

> dom0 doesn't seem that problematic, any freeing triggered on a system
> domain context could be performed in place (with
> process_pending_softirqs() calls to ensure no watchdog triggering).
> 
>> (leaving aside that we don't really have d available in
>> iommu_queue_free_pgtable() and I'd be hesitant to convert it back).
>> Otoh it might be okay to free page tables right away for domains
>> which haven't run at all so far.
> 
> Could be, but then we would have to make hypercalls that can trigger
> those paths preemptible I would think.

Yes, if they aren't already and if they allow for freeing of
sufficiently large numbers of pages. That's kind of another argument
against doing so right here, isn't it?

>> But this would again require
>> passing struct domain * to iommu_queue_free_pgtable().
> 
> Hm, I guess we could use container_of with the domain_iommu parameter
> to obtain a pointer to the domain struct.

I was fearing you might suggest this. It would be sort of okay since
the reference to struct domain isn't really altering that struct,
but the goal of limiting what is passed to the function was to
prove that the full struct domain isn't required there. Also doing
so would tie us to the iommu piece actually being a sub-structure of
struct domain, whereas I expect it to become a pointer to a separate
structure sooner or later.

>>>> @@ -550,6 +551,91 @@ struct page_info *iommu_alloc_pgtable(st
>>>>      return pg;
>>>>  }
>>>>  
>>>> +/*
>>>> + * Intermediate page tables which get replaced by large pages may only be
>>>> + * freed after a suitable IOTLB flush. Hence such pages get queued on a
>>>> + * per-CPU list, with a per-CPU tasklet processing the list on the assumption
>>>> + * that the necessary IOTLB flush will have occurred by the time tasklets get
>>>> + * to run. (List and tasklet being per-CPU has the benefit of accesses not
>>>> + * requiring any locking.)
>>>> + */
>>>> +static DEFINE_PER_CPU(struct page_list_head, free_pgt_list);
>>>> +static DEFINE_PER_CPU(struct tasklet, free_pgt_tasklet);
>>>> +
>>>> +static void free_queued_pgtables(void *arg)
>>>> +{
>>>> +    struct page_list_head *list = arg;
>>>> +    struct page_info *pg;
>>>> +    unsigned int done = 0;
>>>> +
>>>
>>> With the current logic I think it might be helpful to assert that the
>>> list is not empty when we get here?
>>>
>>> Given the operation requires a context switch we would like to avoid
>>> such unless there's indeed pending work to do.
>>
>> But is that worth adding an assertion and risking to kill a system just
>> because there's a race somewhere by which we might get here without any
>> work to do? If you strongly think we want to know about such instances,
>> how about a WARN_ON_ONCE() (except that we still don't have that
>> specific construct, it would need to be open-coded for the time being)?
> 
> Well, I was recommending an assert because I think it's fine to kill a
> debug system in order to catch those outliers. On production builds we
> should obviously not crash.

I disagree - such a crash may be rather disturbing to someone doing work
on Xen without being familiar with the IOMMU details.

>>>> +static int cf_check cpu_callback(
>>>> +    struct notifier_block *nfb, unsigned long action, void *hcpu)
>>>> +{
>>>> +    unsigned int cpu = (unsigned long)hcpu;
>>>> +    struct page_list_head *list = &per_cpu(free_pgt_list, cpu);
>>>> +    struct tasklet *tasklet = &per_cpu(free_pgt_tasklet, cpu);
>>>> +
>>>> +    switch ( action )
>>>> +    {
>>>> +    case CPU_DOWN_PREPARE:
>>>> +        tasklet_kill(tasklet);
>>>> +        break;
>>>> +
>>>> +    case CPU_DEAD:
>>>> +        page_list_splice(list, &this_cpu(free_pgt_list));
>>>
>>> I think you could check whether list is empty before queuing it?
>>
>> I could, but this would make the code (slightly) more complicated
>> for improving something which doesn't occur frequently.
> 
> It's just a:
> 
> if ( list_empty(list) )
>     break;
> 
> at the start of the CPU_DEAD case AFAICT.  As you say this notifier is
> not to be called frequently, so not a big deal (also I don't think the
> addition makes the code more complicated).

Okay, I've made that conditional, not the least because I think ...

> Now that I look at the code again, I think there's a
> tasklet_schedule() missing in the CPU_DOWN_FAILED case if there are
> entries pending on the list list?

... this, which indeed was missing, wants to be conditional. While
adding this I did notice that INIT_PAGE_LIST_HEAD() was also missing
for CPU_UP_PREPARE - that's benign for most configs, but necessary
in BIGMEM ones.

Jan



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 07/21] IOMMU/x86: support freeing of pagetables
  2022-05-05  8:20         ` Jan Beulich
@ 2022-05-05  9:57           ` Roger Pau Monné
  0 siblings, 0 replies; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-05  9:57 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On Thu, May 05, 2022 at 10:20:36AM +0200, Jan Beulich wrote:
> On 04.05.2022 17:06, Roger Pau Monné wrote:
> > On Wed, May 04, 2022 at 03:07:24PM +0200, Jan Beulich wrote:
> >> On 03.05.2022 18:20, Roger Pau Monné wrote:
> >>> On Mon, Apr 25, 2022 at 10:35:45AM +0200, Jan Beulich wrote:
> >>>> For vendor specific code to support superpages we need to be able to
> >>>> deal with a superpage mapping replacing an intermediate page table (or
> >>>> hierarchy thereof). Consequently an iommu_alloc_pgtable() counterpart is
> >>>> needed to free individual page tables while a domain is still alive.
> >>>> Since the freeing needs to be deferred until after a suitable IOTLB
> >>>> flush was performed, released page tables get queued for processing by a
> >>>> tasklet.
> >>>>
> >>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >>>> ---
> >>>> I was considering whether to use a softirq-tasklet instead. This would
> >>>> have the benefit of avoiding extra scheduling operations, but come with
> >>>> the risk of the freeing happening prematurely because of a
> >>>> process_pending_softirqs() somewhere.
> >>>
> >>> I'm sorry again if I already raised this, I don't seem to find a
> >>> reference.
> >>
> >> Earlier on you only suggested "to perform the freeing after the flush".
> >>
> >>> What about doing the freeing before resuming the guest execution in
> >>> guest vCPU context?
> >>>
> >>> We already have a hook like this on HVM in hvm_do_resume() calling
> >>> vpci_process_pending().  I wonder whether we could have a similar hook
> >>> for PV and keep the pages to be freed in the vCPU instead of the pCPU.
> >>> This would have the benefit of being able to context switch the vCPU
> >>> in case the operation takes too long.
> >>
> >> I think this might work in general, but would be troublesome when
> >> preparing Dom0 (where we don't run on any of Dom0's vCPU-s, and we
> >> won't ever "exit to guest context" on an idle vCPU). I'm also not
> >> really fancying to use something like
> >>
> >>     v = current->domain == d ? current : d->vcpu[0];
> > 
> > I guess a problematic case would also be hypercalls executed in a
> > domain context triggering the freeing of a different domain iommu page
> > table pages.  As then the freeing would be accounted to the current
> > domain instead of the owner of the pages.
> 
> Aiui such can happen only during domain construction. Any such
> operation behind the back of a running guest is imo problematic.
> 
> > dom0 doesn't seem that problematic, any freeing triggered on a system
> > domain context could be performed in place (with
> > process_pending_softirqs() calls to ensure no watchdog triggering).
> > 
> >> (leaving aside that we don't really have d available in
> >> iommu_queue_free_pgtable() and I'd be hesitant to convert it back).
> >> Otoh it might be okay to free page tables right away for domains
> >> which haven't run at all so far.
> > 
> > Could be, but then we would have to make hypercalls that can trigger
> > those paths preemptible I would think.
> 
> Yes, if they aren't already and if they allow for freeing of
> sufficiently large numbers of pages. That's kind of another argument
> against doing so right here, isn't it?

Indeed, as it's likely to make the implementation more complex IMO.

So let's use this pCPU implementation.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 10/21] AMD/IOMMU: allow use of superpage mappings
  2022-04-25  8:38 ` [PATCH v4 10/21] AMD/IOMMU: allow use of superpage mappings Jan Beulich
@ 2022-05-05 13:19   ` Roger Pau Monné
  2022-05-05 14:34     ` Jan Beulich
  0 siblings, 1 reply; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-05 13:19 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On Mon, Apr 25, 2022 at 10:38:06AM +0200, Jan Beulich wrote:
> No separate feature flags exist which would control availability of
> these; the only restriction is HATS (establishing the maximum number of
> page table levels in general), and even that has a lower bound of 4.
> Thus we can unconditionally announce 2M, 1G, and 512G mappings. (Via
> non-default page sizes the implementation in principle permits arbitrary
> size mappings, but these require multiple identical leaf PTEs to be
> written, which isn't all that different from having to write multiple
> consecutive PTEs with increasing frame numbers. IMO that's therefore
> beneficial only on hardware where suitable TLBs exist; I'm unaware of
> such hardware.)
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

> ---
> I'm not fully sure about allowing 512G mappings: The scheduling-for-
> freeing of intermediate page tables would take quite a while when
> replacing a tree of 4k mappings by a single 512G one. Yet then again
> there's no present code path via which 512G chunks of memory could be
> allocated (and hence mapped) anyway, so this would only benefit huge
> systems where 512 1G mappings could be re-coalesced (once suitable code
> is in place) into a single L4 entry. And re-coalescing wouldn't result
> in scheduling-for-freeing of full trees of lower level pagetables.

I would think part of this should go into the commit message, as to
why enabling 512G superpages is fine.

> ---
> v4: Change type of queue_free_pt()'s 1st parameter. Re-base.
> v3: Rename queue_free_pt()'s last parameter. Replace "level > 1" checks
>     where possible.
> 
> --- a/xen/drivers/passthrough/amd/iommu_map.c
> +++ b/xen/drivers/passthrough/amd/iommu_map.c
> @@ -32,12 +32,13 @@ static unsigned int pfn_to_pde_idx(unsig
>  }
>  
>  static union amd_iommu_pte clear_iommu_pte_present(unsigned long l1_mfn,
> -                                                   unsigned long dfn)
> +                                                   unsigned long dfn,
> +                                                   unsigned int level)
>  {
>      union amd_iommu_pte *table, *pte, old;
>  
>      table = map_domain_page(_mfn(l1_mfn));
> -    pte = &table[pfn_to_pde_idx(dfn, 1)];
> +    pte = &table[pfn_to_pde_idx(dfn, level)];
>      old = *pte;
>  
>      write_atomic(&pte->raw, 0);
> @@ -351,11 +352,32 @@ static int iommu_pde_from_dfn(struct dom
>      return 0;
>  }
>  
> +static void queue_free_pt(struct domain_iommu *hd, mfn_t mfn, unsigned int level)
> +{
> +    if ( level > 1 )
> +    {
> +        union amd_iommu_pte *pt = map_domain_page(mfn);
> +        unsigned int i;
> +
> +        for ( i = 0; i < PTE_PER_TABLE_SIZE; ++i )
> +            if ( pt[i].pr && pt[i].next_level )
> +            {
> +                ASSERT(pt[i].next_level < level);
> +                queue_free_pt(hd, _mfn(pt[i].mfn), pt[i].next_level);
> +            }
> +
> +        unmap_domain_page(pt);
> +    }
> +
> +    iommu_queue_free_pgtable(hd, mfn_to_page(mfn));
> +}
> +
>  int cf_check amd_iommu_map_page(
>      struct domain *d, dfn_t dfn, mfn_t mfn, unsigned int flags,
>      unsigned int *flush_flags)
>  {
>      struct domain_iommu *hd = dom_iommu(d);
> +    unsigned int level = (IOMMUF_order(flags) / PTE_PER_TABLE_SHIFT) + 1;
>      int rc;
>      unsigned long pt_mfn = 0;
>      union amd_iommu_pte old;
> @@ -384,7 +406,7 @@ int cf_check amd_iommu_map_page(
>          return rc;
>      }
>  

I think it might be helpful to assert or otherwise check that the
input order is supported by the IOMMU, just to be on the safe side.

> -    if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, flush_flags, true) ||
> +    if ( iommu_pde_from_dfn(d, dfn_x(dfn), level, &pt_mfn, flush_flags, true) ||
>           !pt_mfn )
>      {
>          spin_unlock(&hd->arch.mapping_lock);
> @@ -394,8 +416,8 @@ int cf_check amd_iommu_map_page(
>          return -EFAULT;
>      }
>  
> -    /* Install 4k mapping */
> -    old = set_iommu_pte_present(pt_mfn, dfn_x(dfn), mfn_x(mfn), 1,
> +    /* Install mapping */
> +    old = set_iommu_pte_present(pt_mfn, dfn_x(dfn), mfn_x(mfn), level,
>                                  (flags & IOMMUF_writable),
>                                  (flags & IOMMUF_readable));
>  
> @@ -403,8 +425,13 @@ int cf_check amd_iommu_map_page(
>  
>      *flush_flags |= IOMMU_FLUSHF_added;
>      if ( old.pr )
> +    {
>          *flush_flags |= IOMMU_FLUSHF_modified;
>  
> +        if ( IOMMUF_order(flags) && old.next_level )
> +            queue_free_pt(hd, _mfn(old.mfn), old.next_level);
> +    }
> +
>      return 0;
>  }
>  
> @@ -413,6 +440,7 @@ int cf_check amd_iommu_unmap_page(
>  {
>      unsigned long pt_mfn = 0;
>      struct domain_iommu *hd = dom_iommu(d);
> +    unsigned int level = (order / PTE_PER_TABLE_SHIFT) + 1;
>      union amd_iommu_pte old = {};
>  
>      spin_lock(&hd->arch.mapping_lock);
> @@ -423,7 +451,7 @@ int cf_check amd_iommu_unmap_page(
>          return 0;
>      }
>  
> -    if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, flush_flags, false) )
> +    if ( iommu_pde_from_dfn(d, dfn_x(dfn), level, &pt_mfn, flush_flags, false) )
>      {
>          spin_unlock(&hd->arch.mapping_lock);
>          AMD_IOMMU_ERROR("invalid IO pagetable entry dfn = %"PRI_dfn"\n",
> @@ -435,14 +463,19 @@ int cf_check amd_iommu_unmap_page(
>      if ( pt_mfn )
>      {
>          /* Mark PTE as 'page not present'. */
> -        old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn));
> +        old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn), level);
>      }
>  
>      spin_unlock(&hd->arch.mapping_lock);
>  
>      if ( old.pr )
> +    {
>          *flush_flags |= IOMMU_FLUSHF_modified;
>  
> +        if ( order && old.next_level )
> +            queue_free_pt(hd, _mfn(old.mfn), old.next_level);
> +    }
> +
>      return 0;
>  }
>  
> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> @@ -747,7 +747,7 @@ static void cf_check amd_dump_page_table
>  }
>  
>  static const struct iommu_ops __initconst_cf_clobber _iommu_ops = {
> -    .page_sizes = PAGE_SIZE_4K,
> +    .page_sizes = PAGE_SIZE_4K | PAGE_SIZE_2M | PAGE_SIZE_1G | PAGE_SIZE_512G,

As mentioned on a previous email, I'm worried if we ever get to
replace an entry populated with 4K pages with a 512G superpage, as the
freeing cost of the associated pagetables would be quite high.

I guess we will have to implement a more preemptive freeing behavior
if issues arise.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 10/21] AMD/IOMMU: allow use of superpage mappings
  2022-05-05 13:19   ` Roger Pau Monné
@ 2022-05-05 14:34     ` Jan Beulich
  2022-05-05 15:26       ` Roger Pau Monné
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-05-05 14:34 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On 05.05.2022 15:19, Roger Pau Monné wrote:
> On Mon, Apr 25, 2022 at 10:38:06AM +0200, Jan Beulich wrote:
>> No separate feature flags exist which would control availability of
>> these; the only restriction is HATS (establishing the maximum number of
>> page table levels in general), and even that has a lower bound of 4.
>> Thus we can unconditionally announce 2M, 1G, and 512G mappings. (Via
>> non-default page sizes the implementation in principle permits arbitrary
>> size mappings, but these require multiple identical leaf PTEs to be
>> written, which isn't all that different from having to write multiple
>> consecutive PTEs with increasing frame numbers. IMO that's therefore
>> beneficial only on hardware where suitable TLBs exist; I'm unaware of
>> such hardware.)
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks.

>> ---
>> I'm not fully sure about allowing 512G mappings: The scheduling-for-
>> freeing of intermediate page tables would take quite a while when
>> replacing a tree of 4k mappings by a single 512G one. Yet then again
>> there's no present code path via which 512G chunks of memory could be
>> allocated (and hence mapped) anyway, so this would only benefit huge
>> systems where 512 1G mappings could be re-coalesced (once suitable code
>> is in place) into a single L4 entry. And re-coalescing wouldn't result
>> in scheduling-for-freeing of full trees of lower level pagetables.
> 
> I would think part of this should go into the commit message, as to
> why enabling 512G superpages is fine.

Together with what you say at the bottom I wonder whether, rather than
moving this into the description in a slightly edited form, I shouldn't
drop the PAGE_SIZE_512G there. I don't think that would invalidate your
R-b.

>> @@ -384,7 +406,7 @@ int cf_check amd_iommu_map_page(
>>          return rc;
>>      }
>>  
> 
> I think it might be helpful to assert or otherwise check that the
> input order is supported by the IOMMU, just to be on the safe side.

Well, yes, I can certainly do so. Given how the code was developed it
didn't seem very likely that such a fundamental assumption could be
violated, but I guess I see your point.

Jan

>> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
>> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
>> @@ -747,7 +747,7 @@ static void cf_check amd_dump_page_table
>>  }
>>  
>>  static const struct iommu_ops __initconst_cf_clobber _iommu_ops = {
>> -    .page_sizes = PAGE_SIZE_4K,
>> +    .page_sizes = PAGE_SIZE_4K | PAGE_SIZE_2M | PAGE_SIZE_1G | PAGE_SIZE_512G,
> 
> As mentioned on a previous email, I'm worried if we ever get to
> replace an entry populated with 4K pages with a 512G superpage, as the
> freeing cost of the associated pagetables would be quite high.
> 
> I guess we will have to implement a more preemptive freeing behavior
> if issues arise.
> 
> Thanks, Roger.
> 



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 10/21] AMD/IOMMU: allow use of superpage mappings
  2022-05-05 14:34     ` Jan Beulich
@ 2022-05-05 15:26       ` Roger Pau Monné
  0 siblings, 0 replies; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-05 15:26 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On Thu, May 05, 2022 at 04:34:54PM +0200, Jan Beulich wrote:
> On 05.05.2022 15:19, Roger Pau Monné wrote:
> > On Mon, Apr 25, 2022 at 10:38:06AM +0200, Jan Beulich wrote:
> >> No separate feature flags exist which would control availability of
> >> these; the only restriction is HATS (establishing the maximum number of
> >> page table levels in general), and even that has a lower bound of 4.
> >> Thus we can unconditionally announce 2M, 1G, and 512G mappings. (Via
> >> non-default page sizes the implementation in principle permits arbitrary
> >> size mappings, but these require multiple identical leaf PTEs to be
> >> written, which isn't all that different from having to write multiple
> >> consecutive PTEs with increasing frame numbers. IMO that's therefore
> >> beneficial only on hardware where suitable TLBs exist; I'm unaware of
> >> such hardware.)
> >>
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> > 
> > Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> Thanks.
> 
> >> ---
> >> I'm not fully sure about allowing 512G mappings: The scheduling-for-
> >> freeing of intermediate page tables would take quite a while when
> >> replacing a tree of 4k mappings by a single 512G one. Yet then again
> >> there's no present code path via which 512G chunks of memory could be
> >> allocated (and hence mapped) anyway, so this would only benefit huge
> >> systems where 512 1G mappings could be re-coalesced (once suitable code
> >> is in place) into a single L4 entry. And re-coalescing wouldn't result
> >> in scheduling-for-freeing of full trees of lower level pagetables.
> > 
> > I would think part of this should go into the commit message, as to
> > why enabling 512G superpages is fine.
> 
> Together with what you say at the bottom I wonder whether, rather than
> moving this into the description in a slightly edited form, I shouldn't
> drop the PAGE_SIZE_512G there. I don't think that would invalidate your
> R-b.

Right, might be good to add a comment that 512G super pages could be
enabled (ie: there's no hardware limitation), but we need to be sure
that the freeing of the removed page table pages doesn't starve the
pCPU.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 11/21] VT-d: allow use of superpage mappings
  2022-04-25  8:38 ` [PATCH v4 11/21] VT-d: " Jan Beulich
@ 2022-05-05 16:20   ` Roger Pau Monné
  2022-05-06  6:13     ` Jan Beulich
  0 siblings, 1 reply; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-05 16:20 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian

On Mon, Apr 25, 2022 at 10:38:37AM +0200, Jan Beulich wrote:
> ... depending on feature availability (and absence of quirks).
> 
> Also make the page table dumping function aware of superpages.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Might be interesting to also add an assert that the passed order
matches the supported values, like requested on AMD.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 11/21] VT-d: allow use of superpage mappings
  2022-05-05 16:20   ` Roger Pau Monné
@ 2022-05-06  6:13     ` Jan Beulich
  0 siblings, 0 replies; 106+ messages in thread
From: Jan Beulich @ 2022-05-06  6:13 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian

On 05.05.2022 18:20, Roger Pau Monné wrote:
> On Mon, Apr 25, 2022 at 10:38:37AM +0200, Jan Beulich wrote:
>> ... depending on feature availability (and absence of quirks).
>>
>> Also make the page table dumping function aware of superpages.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> 
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks.

> Might be interesting to also add an assert that the passed order
> matches the supported values, like requested on AMD.

Sure - I did extend your comment there to the patch here right away.

Jan



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 12/21] IOMMU: fold flush-all hook into "flush one"
  2022-04-25  8:40 ` [PATCH v4 12/21] IOMMU: fold flush-all hook into "flush one" Jan Beulich
@ 2022-05-06  8:38   ` Roger Pau Monné
  2022-05-06  9:59     ` Jan Beulich
  0 siblings, 1 reply; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-06  8:38 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On Mon, Apr 25, 2022 at 10:40:06AM +0200, Jan Beulich wrote:
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -772,18 +772,21 @@ static int __must_check cf_check iommu_f
>      struct domain *d, dfn_t dfn, unsigned long page_count,
>      unsigned int flush_flags)
>  {
> -    ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN));
> -    ASSERT(flush_flags);
> +    if ( flush_flags & IOMMU_FLUSHF_all )
> +    {
> +        dfn = INVALID_DFN;
> +        page_count = 0;
> +    }
> +    else
> +    {
> +        ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN));
> +        ASSERT(flush_flags);
> +    }
>  
>      return iommu_flush_iotlb(d, dfn, flush_flags & IOMMU_FLUSHF_modified,
>                               page_count);

In a future patch we could likely move the code in iommu_flush_iotlb
into iommu_flush_iotlb_pages, seeing as iommu_flush_iotlb_pages is the
only caller of iommu_flush_iotlb.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 12/21] IOMMU: fold flush-all hook into "flush one"
  2022-05-06  8:38   ` Roger Pau Monné
@ 2022-05-06  9:59     ` Jan Beulich
  0 siblings, 0 replies; 106+ messages in thread
From: Jan Beulich @ 2022-05-06  9:59 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On 06.05.2022 10:38, Roger Pau Monné wrote:
> On Mon, Apr 25, 2022 at 10:40:06AM +0200, Jan Beulich wrote:
>> --- a/xen/drivers/passthrough/vtd/iommu.c
>> +++ b/xen/drivers/passthrough/vtd/iommu.c
>> @@ -772,18 +772,21 @@ static int __must_check cf_check iommu_f
>>      struct domain *d, dfn_t dfn, unsigned long page_count,
>>      unsigned int flush_flags)
>>  {
>> -    ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN));
>> -    ASSERT(flush_flags);
>> +    if ( flush_flags & IOMMU_FLUSHF_all )
>> +    {
>> +        dfn = INVALID_DFN;
>> +        page_count = 0;
>> +    }
>> +    else
>> +    {
>> +        ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN));
>> +        ASSERT(flush_flags);
>> +    }
>>  
>>      return iommu_flush_iotlb(d, dfn, flush_flags & IOMMU_FLUSHF_modified,
>>                               page_count);
> 
> In a future patch we could likely move the code in iommu_flush_iotlb
> into iommu_flush_iotlb_pages, seeing as iommu_flush_iotlb_pages is the
> only caller of iommu_flush_iotlb.

And indeed a later patch does so, and an earlier version of the patch
here did say so in a post-commit-message remark.

Jan



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 13/21] IOMMU/x86: prefill newly allocate page tables
  2022-04-25  8:40 ` [PATCH v4 13/21] IOMMU/x86: prefill newly allocate page tables Jan Beulich
@ 2022-05-06 11:16   ` Roger Pau Monné
  2022-05-19 12:12     ` Jan Beulich
  0 siblings, 1 reply; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-06 11:16 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On Mon, Apr 25, 2022 at 10:40:55AM +0200, Jan Beulich wrote:
> Page tables are used for two purposes after allocation: They either
> start out all empty, or they get filled to replace a superpage.
> Subsequently, to replace all empty or fully contiguous page tables,
> contiguous sub-regions will be recorded within individual page tables.
> Install the initial set of markers immediately after allocation. Make
> sure to retain these markers when further populating a page table in
> preparation for it to replace a superpage.
> 
> The markers are simply 4-bit fields holding the order value of
> contiguous entries. To demonstrate this, if a page table had just 16
> entries, this would be the initial (fully contiguous) set of markers:
> 
> index  0 1 2 3 4 5 6 7 8 9 A B C D E F
> marker 4 0 1 0 2 0 1 0 3 0 1 0 2 0 1 0
> 
> "Contiguous" here means not only present entries with successively
> increasing MFNs, each one suitably aligned for its slot, but also a
> respective number of all non-present entries.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> ---
> An alternative to the ASSERT()s added to set_iommu_ptes_present() would
> be to make the function less general-purpose; it's used in a single
> place only after all (i.e. it might as well be folded into its only
> caller).

I would think adding a comment that the function requires the PDE to
be empty would be good.  Also given the current usage we could drop
the nr_ptes parameter and just name the function fill_pde() or
similar.

> 
> While in VT-d's comment ahead of struct dma_pte I'm adjusting the
> description of the high bits, I'd like to note that the description of
> some of the lower bits isn't correct either. Yet I don't think adjusting
> that belongs here.
> ---
> v4: Add another comment referring to pt-contig-markers.h. Re-base.
> v3: Add comments. Re-base.
> v2: New.
> 
> --- a/xen/arch/x86/include/asm/iommu.h
> +++ b/xen/arch/x86/include/asm/iommu.h
> @@ -146,7 +146,8 @@ void iommu_free_domid(domid_t domid, uns
>  
>  int __must_check iommu_free_pgtables(struct domain *d);
>  struct domain_iommu;
> -struct page_info *__must_check iommu_alloc_pgtable(struct domain_iommu *hd);
> +struct page_info *__must_check iommu_alloc_pgtable(struct domain_iommu *hd,
> +                                                   uint64_t contig_mask);
>  void iommu_queue_free_pgtable(struct domain_iommu *hd, struct page_info *pg);
>  
>  #endif /* !__ARCH_X86_IOMMU_H__ */
> --- a/xen/drivers/passthrough/amd/iommu-defs.h
> +++ b/xen/drivers/passthrough/amd/iommu-defs.h
> @@ -446,11 +446,13 @@ union amd_iommu_x2apic_control {
>  #define IOMMU_PAGE_TABLE_U32_PER_ENTRY	(IOMMU_PAGE_TABLE_ENTRY_SIZE / 4)
>  #define IOMMU_PAGE_TABLE_ALIGNMENT	4096
>  
> +#define IOMMU_PTE_CONTIG_MASK           0x1e /* The ign0 field below. */
> +
>  union amd_iommu_pte {
>      uint64_t raw;
>      struct {
>          bool pr:1;
> -        unsigned int ign0:4;
> +        unsigned int ign0:4; /* Covered by IOMMU_PTE_CONTIG_MASK. */
>          bool a:1;
>          bool d:1;
>          unsigned int ign1:2;
> --- a/xen/drivers/passthrough/amd/iommu_map.c
> +++ b/xen/drivers/passthrough/amd/iommu_map.c
> @@ -115,7 +115,19 @@ static void set_iommu_ptes_present(unsig
>  
>      while ( nr_ptes-- )
>      {
> -        set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
> +        ASSERT(!pde->next_level);
> +        ASSERT(!pde->u);
> +
> +        if ( pde > table )
> +            ASSERT(pde->ign0 == find_first_set_bit(pde - table));
> +        else
> +            ASSERT(pde->ign0 == PAGE_SHIFT - 3);

I think PAGETABLE_ORDER would be clearer here.

While here, could you also assert that next_mfn matches the contiguous
order currently set in the PTE?

> +
> +        pde->iw = iw;
> +        pde->ir = ir;
> +        pde->fc = true; /* See set_iommu_pde_present(). */
> +        pde->mfn = next_mfn;
> +        pde->pr = true;
>  
>          ++pde;
>          next_mfn += page_sz;
> @@ -295,7 +307,7 @@ static int iommu_pde_from_dfn(struct dom
>              mfn = next_table_mfn;
>  
>              /* allocate lower level page table */
> -            table = iommu_alloc_pgtable(hd);
> +            table = iommu_alloc_pgtable(hd, IOMMU_PTE_CONTIG_MASK);
>              if ( table == NULL )
>              {
>                  AMD_IOMMU_ERROR("cannot allocate I/O page table\n");
> @@ -325,7 +337,7 @@ static int iommu_pde_from_dfn(struct dom
>  
>              if ( next_table_mfn == 0 )
>              {
> -                table = iommu_alloc_pgtable(hd);
> +                table = iommu_alloc_pgtable(hd, IOMMU_PTE_CONTIG_MASK);
>                  if ( table == NULL )
>                  {
>                      AMD_IOMMU_ERROR("cannot allocate I/O page table\n");
> @@ -717,7 +729,7 @@ static int fill_qpt(union amd_iommu_pte
>                   * page table pages, and the resulting allocations are always
>                   * zeroed.
>                   */
> -                pgs[level] = iommu_alloc_pgtable(hd);
> +                pgs[level] = iommu_alloc_pgtable(hd, 0);

Is it worth not setting up the contiguous data for quarantine page
tables?

I think it's fine now given the current code, but you having added
ASSERTs that the contig data is correct in set_iommu_ptes_present()
makes me wonder whether we could trigger those in the future.

I understand that the contig data is not helpful for quarantine page
tables, but still doesn't seem bad to have it just for coherency.

>                  if ( !pgs[level] )
>                  {
>                      rc = -ENOMEM;
> @@ -775,7 +787,7 @@ int cf_check amd_iommu_quarantine_init(s
>          return 0;
>      }
>  
> -    pdev->arch.amd.root_table = iommu_alloc_pgtable(hd);
> +    pdev->arch.amd.root_table = iommu_alloc_pgtable(hd, 0);
>      if ( !pdev->arch.amd.root_table )
>          return -ENOMEM;
>  
> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> @@ -342,7 +342,7 @@ int amd_iommu_alloc_root(struct domain *
>  
>      if ( unlikely(!hd->arch.amd.root_table) && d != dom_io )
>      {
> -        hd->arch.amd.root_table = iommu_alloc_pgtable(hd);
> +        hd->arch.amd.root_table = iommu_alloc_pgtable(hd, 0);
>          if ( !hd->arch.amd.root_table )
>              return -ENOMEM;
>      }
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -334,7 +334,7 @@ static uint64_t addr_to_dma_page_maddr(s
>              goto out;
>  
>          pte_maddr = level;
> -        if ( !(pg = iommu_alloc_pgtable(hd)) )
> +        if ( !(pg = iommu_alloc_pgtable(hd, 0)) )
>              goto out;
>  
>          hd->arch.vtd.pgd_maddr = page_to_maddr(pg);
> @@ -376,7 +376,7 @@ static uint64_t addr_to_dma_page_maddr(s
>              }
>  
>              pte_maddr = level - 1;
> -            pg = iommu_alloc_pgtable(hd);
> +            pg = iommu_alloc_pgtable(hd, DMA_PTE_CONTIG_MASK);
>              if ( !pg )
>                  break;
>  
> @@ -388,12 +388,13 @@ static uint64_t addr_to_dma_page_maddr(s
>                  struct dma_pte *split = map_vtd_domain_page(pte_maddr);
>                  unsigned long inc = 1UL << level_to_offset_bits(level - 1);
>  
> -                split[0].val = pte->val;
> +                split[0].val |= pte->val & ~DMA_PTE_CONTIG_MASK;
>                  if ( inc == PAGE_SIZE )
>                      split[0].val &= ~DMA_PTE_SP;
>  
>                  for ( offset = 1; offset < PTE_NUM; ++offset )
> -                    split[offset].val = split[offset - 1].val + inc;
> +                    split[offset].val |=
> +                        (split[offset - 1].val & ~DMA_PTE_CONTIG_MASK) + inc;
>  
>                  iommu_sync_cache(split, PAGE_SIZE);
>                  unmap_vtd_domain_page(split);
> @@ -2173,7 +2174,7 @@ static int __must_check cf_check intel_i
>      if ( iommu_snoop )
>          dma_set_pte_snp(new);
>  
> -    if ( old.val == new.val )
> +    if ( !((old.val ^ new.val) & ~DMA_PTE_CONTIG_MASK) )
>      {
>          spin_unlock(&hd->arch.mapping_lock);
>          unmap_vtd_domain_page(page);
> @@ -3052,7 +3053,7 @@ static int fill_qpt(struct dma_pte *this
>                   * page table pages, and the resulting allocations are always
>                   * zeroed.
>                   */
> -                pgs[level] = iommu_alloc_pgtable(hd);
> +                pgs[level] = iommu_alloc_pgtable(hd, 0);
>                  if ( !pgs[level] )
>                  {
>                      rc = -ENOMEM;
> @@ -3109,7 +3110,7 @@ static int cf_check intel_iommu_quaranti
>      if ( !drhd )
>          return -ENODEV;
>  
> -    pg = iommu_alloc_pgtable(hd);
> +    pg = iommu_alloc_pgtable(hd, 0);
>      if ( !pg )
>          return -ENOMEM;
>  
> --- a/xen/drivers/passthrough/vtd/iommu.h
> +++ b/xen/drivers/passthrough/vtd/iommu.h
> @@ -253,7 +253,10 @@ struct context_entry {
>   * 2-6: reserved
>   * 7: super page
>   * 8-11: available
> - * 12-63: Host physcial address
> + * 12-51: Host physcial address
> + * 52-61: available (52-55 used for DMA_PTE_CONTIG_MASK)
> + * 62: reserved
> + * 63: available
>   */
>  struct dma_pte {
>      u64 val;
> @@ -263,6 +266,7 @@ struct dma_pte {
>  #define DMA_PTE_PROT (DMA_PTE_READ | DMA_PTE_WRITE)
>  #define DMA_PTE_SP   (1 << 7)
>  #define DMA_PTE_SNP  (1 << 11)
> +#define DMA_PTE_CONTIG_MASK  (0xfull << PADDR_BITS)
>  #define dma_clear_pte(p)    do {(p).val = 0;} while(0)
>  #define dma_set_pte_readable(p) do {(p).val |= DMA_PTE_READ;} while(0)
>  #define dma_set_pte_writable(p) do {(p).val |= DMA_PTE_WRITE;} while(0)
> @@ -276,7 +280,7 @@ struct dma_pte {
>  #define dma_pte_write(p) (dma_pte_prot(p) & DMA_PTE_WRITE)
>  #define dma_pte_addr(p) ((p).val & PADDR_MASK & PAGE_MASK_4K)
>  #define dma_set_pte_addr(p, addr) do {\
> -            (p).val |= ((addr) & PAGE_MASK_4K); } while (0)
> +            (p).val |= ((addr) & PADDR_MASK & PAGE_MASK_4K); } while (0)

While I'm not opposed to this, I would assume that addr is not
expected to contain bit cleared by PADDR_MASK? (or PAGE_MASK_4K FWIW)

Or else callers are really messed up.

>  #define dma_pte_present(p) (((p).val & DMA_PTE_PROT) != 0)
>  #define dma_pte_superpage(p) (((p).val & DMA_PTE_SP) != 0)
>  
> --- a/xen/drivers/passthrough/x86/iommu.c
> +++ b/xen/drivers/passthrough/x86/iommu.c
> @@ -522,11 +522,12 @@ int iommu_free_pgtables(struct domain *d
>      return 0;
>  }
>  
> -struct page_info *iommu_alloc_pgtable(struct domain_iommu *hd)
> +struct page_info *iommu_alloc_pgtable(struct domain_iommu *hd,
> +                                      uint64_t contig_mask)
>  {
>      unsigned int memflags = 0;
>      struct page_info *pg;
> -    void *p;
> +    uint64_t *p;
>  
>  #ifdef CONFIG_NUMA
>      if ( hd->node != NUMA_NO_NODE )
> @@ -538,7 +539,29 @@ struct page_info *iommu_alloc_pgtable(st
>          return NULL;
>  
>      p = __map_domain_page(pg);
> -    clear_page(p);
> +
> +    if ( contig_mask )
> +    {
> +        /* See pt-contig-markers.h for a description of the marker scheme. */
> +        unsigned int i, shift = find_first_set_bit(contig_mask);
> +
> +        ASSERT(((PAGE_SHIFT - 3) & (contig_mask >> shift)) == PAGE_SHIFT - 3);

I think it might be clearer to use PAGETABLE_ORDER rather than
PAGE_SHIFT - 3.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 14/21] x86: introduce helper for recording degree of contiguity in page tables
  2022-04-25  8:41 ` [PATCH v4 14/21] x86: introduce helper for recording degree of contiguity in " Jan Beulich
@ 2022-05-06 13:25   ` Roger Pau Monné
  2022-05-18 10:06     ` Jan Beulich
  0 siblings, 1 reply; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-06 13:25 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On Mon, Apr 25, 2022 at 10:41:23AM +0200, Jan Beulich wrote:
> This is a re-usable helper (kind of a template) which gets introduced
> without users so that the individual subsequent patches introducing such
> users can get committed independently of one another.
> 
> See the comment at the top of the new file. To demonstrate the effect,
> if a page table had just 16 entries, this would be the set of markers
> for a page table with fully contiguous mappings:
> 
> index  0 1 2 3 4 5 6 7 8 9 A B C D E F
> marker 4 0 1 0 2 0 1 0 3 0 1 0 2 0 1 0
> 
> "Contiguous" here means not only present entries with successively
> increasing MFNs, each one suitably aligned for its slot, but also a
> respective number of all non-present entries.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> v3: Rename function and header. Introduce IS_CONTIG().
> v2: New.
> 
> --- /dev/null
> +++ b/xen/arch/x86/include/asm/pt-contig-markers.h
> @@ -0,0 +1,105 @@
> +#ifndef __ASM_X86_PT_CONTIG_MARKERS_H
> +#define __ASM_X86_PT_CONTIG_MARKERS_H
> +
> +/*
> + * Short of having function templates in C, the function defined below is
> + * intended to be used by multiple parties interested in recording the
> + * degree of contiguity in mappings by a single page table.
> + *
> + * Scheme: Every entry records the order of contiguous successive entries,
> + * up to the maximum order covered by that entry (which is the number of
> + * clear low bits in its index, with entry 0 being the exception using
> + * the base-2 logarithm of the number of entries in a single page table).
> + * While a few entries need touching upon update, knowing whether the
> + * table is fully contiguous (and can hence be replaced by a higher level
> + * leaf entry) is then possible by simply looking at entry 0's marker.
> + *
> + * Prereqs:
> + * - CONTIG_MASK needs to be #define-d, to a value having at least 4
> + *   contiguous bits (ignored by hardware), before including this file,
> + * - page tables to be passed here need to be initialized with correct
> + *   markers.

Not sure it's very relevant, but might we worth adding that:

- Null entries must have the PTE zeroed except for the CONTIG_MASK
  region in order to be considered as inactive.

> + */
> +
> +#include <xen/bitops.h>
> +#include <xen/lib.h>
> +#include <xen/page-size.h>
> +
> +/* This is the same for all anticipated users, so doesn't need passing in. */
> +#define CONTIG_LEVEL_SHIFT 9
> +#define CONTIG_NR          (1 << CONTIG_LEVEL_SHIFT)
> +
> +#define GET_MARKER(e) MASK_EXTR(e, CONTIG_MASK)
> +#define SET_MARKER(e, m) \
> +    ((void)((e) = ((e) & ~CONTIG_MASK) | MASK_INSR(m, CONTIG_MASK)))
> +
> +#define IS_CONTIG(kind, pt, i, idx, shift, b) \
> +    ((kind) == PTE_kind_leaf \
> +     ? (((pt)[i] ^ (pt)[idx]) & ~CONTIG_MASK) == (1ULL << ((b) + (shift))) \
> +     : !((pt)[i] & ~CONTIG_MASK))
> +
> +enum PTE_kind {
> +    PTE_kind_null,
> +    PTE_kind_leaf,
> +    PTE_kind_table,
> +};
> +
> +static bool pt_update_contig_markers(uint64_t *pt, unsigned int idx,
> +                                     unsigned int level, enum PTE_kind kind)
> +{
> +    unsigned int b, i = idx;
> +    unsigned int shift = (level - 1) * CONTIG_LEVEL_SHIFT + PAGE_SHIFT;
> +
> +    ASSERT(idx < CONTIG_NR);
> +    ASSERT(!(pt[idx] & CONTIG_MASK));
> +
> +    /* Step 1: Reduce markers in lower numbered entries. */
> +    while ( i )
> +    {
> +        b = find_first_set_bit(i);
> +        i &= ~(1U << b);
> +        if ( GET_MARKER(pt[i]) > b )
> +            SET_MARKER(pt[i], b);

Can't you exit early when you find an entry that already has the
to-be-set contiguous marker <= b, as lower numbered entries will then
also be <= b'?

Ie:

if ( GET_MARKER(pt[i]) <= b )
    break;
else
    SET_MARKER(pt[i], b);

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 15/21] AMD/IOMMU: free all-empty page tables
  2022-04-25  8:42 ` [PATCH v4 15/21] AMD/IOMMU: free all-empty " Jan Beulich
@ 2022-05-10 13:30   ` Roger Pau Monné
  2022-05-18 10:18     ` Jan Beulich
  0 siblings, 1 reply; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-10 13:30 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On Mon, Apr 25, 2022 at 10:42:19AM +0200, Jan Beulich wrote:
> When a page table ends up with no present entries left, it can be
> replaced by a non-present entry at the next higher level. The page table
> itself can then be scheduled for freeing.
> 
> Note that while its output isn't used there yet,
> pt_update_contig_markers() right away needs to be called in all places
> where entries get updated, not just the one where entries get cleared.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Some comments below.

> ---
> v4: Re-base over changes earlier in the series.
> v3: Re-base over changes earlier in the series.
> v2: New.
> 
> --- a/xen/drivers/passthrough/amd/iommu_map.c
> +++ b/xen/drivers/passthrough/amd/iommu_map.c
> @@ -21,6 +21,9 @@
>  
>  #include "iommu.h"
>  
> +#define CONTIG_MASK IOMMU_PTE_CONTIG_MASK
> +#include <asm/pt-contig-markers.h>
> +
>  /* Given pfn and page table level, return pde index */
>  static unsigned int pfn_to_pde_idx(unsigned long pfn, unsigned int level)
>  {
> @@ -33,16 +36,20 @@ static unsigned int pfn_to_pde_idx(unsig
>  
>  static union amd_iommu_pte clear_iommu_pte_present(unsigned long l1_mfn,
>                                                     unsigned long dfn,
> -                                                   unsigned int level)
> +                                                   unsigned int level,
> +                                                   bool *free)
>  {
>      union amd_iommu_pte *table, *pte, old;
> +    unsigned int idx = pfn_to_pde_idx(dfn, level);
>  
>      table = map_domain_page(_mfn(l1_mfn));
> -    pte = &table[pfn_to_pde_idx(dfn, level)];
> +    pte = &table[idx];
>      old = *pte;
>  
>      write_atomic(&pte->raw, 0);
>  
> +    *free = pt_update_contig_markers(&table->raw, idx, level, PTE_kind_null);
> +
>      unmap_domain_page(table);
>  
>      return old;
> @@ -85,7 +92,11 @@ static union amd_iommu_pte set_iommu_pte
>      if ( !old.pr || old.next_level ||
>           old.mfn != next_mfn ||
>           old.iw != iw || old.ir != ir )
> +    {
>          set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
> +        pt_update_contig_markers(&table->raw, pfn_to_pde_idx(dfn, level),
> +                                 level, PTE_kind_leaf);

It would be better to call pt_update_contig_markers inside of
set_iommu_pde_present, but that would imply changing the parameters
passed to the function.  It's cumbersome (and error prone) to have to
pair calls to set_iommu_pde_present() with pt_update_contig_markers().

> +    }
>      else
>          old.pr = false; /* signal "no change" to the caller */
>  
> @@ -322,6 +333,9 @@ static int iommu_pde_from_dfn(struct dom
>              smp_wmb();
>              set_iommu_pde_present(pde, next_table_mfn, next_level, true,
>                                    true);
> +            pt_update_contig_markers(&next_table_vaddr->raw,
> +                                     pfn_to_pde_idx(dfn, level),
> +                                     level, PTE_kind_table);
>  
>              *flush_flags |= IOMMU_FLUSHF_modified;
>          }
> @@ -347,6 +361,9 @@ static int iommu_pde_from_dfn(struct dom
>                  next_table_mfn = mfn_x(page_to_mfn(table));
>                  set_iommu_pde_present(pde, next_table_mfn, next_level, true,
>                                        true);
> +                pt_update_contig_markers(&next_table_vaddr->raw,
> +                                         pfn_to_pde_idx(dfn, level),
> +                                         level, PTE_kind_table);
>              }
>              else /* should never reach here */
>              {
> @@ -474,8 +491,24 @@ int cf_check amd_iommu_unmap_page(
>  
>      if ( pt_mfn )
>      {
> +        bool free;
> +
>          /* Mark PTE as 'page not present'. */
> -        old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn), level);
> +        old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn), level, &free);
> +
> +        while ( unlikely(free) && ++level < hd->arch.amd.paging_mode )
> +        {
> +            struct page_info *pg = mfn_to_page(_mfn(pt_mfn));
> +
> +            if ( iommu_pde_from_dfn(d, dfn_x(dfn), level, &pt_mfn,
> +                                    flush_flags, false) )
> +                BUG();
> +            BUG_ON(!pt_mfn);
> +
> +            clear_iommu_pte_present(pt_mfn, dfn_x(dfn), level, &free);

Not sure it's worth initializing free to false (at definition and
before each call to clear_iommu_pte_present), just in case we manage
to return early from clear_iommu_pte_present without having updated
'free'.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 16/21] VT-d: free all-empty page tables
  2022-04-25  8:42 ` [PATCH v4 16/21] VT-d: " Jan Beulich
  2022-04-27  4:09   ` Tian, Kevin
@ 2022-05-10 14:30   ` Roger Pau Monné
  2022-05-18 10:26     ` Jan Beulich
  1 sibling, 1 reply; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-10 14:30 UTC (permalink / raw)
  To: Jan Beulich, Kevin Tian; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On Mon, Apr 25, 2022 at 10:42:50AM +0200, Jan Beulich wrote:
> When a page table ends up with no present entries left, it can be
> replaced by a non-present entry at the next higher level. The page table
> itself can then be scheduled for freeing.
> 
> Note that while its output isn't used there yet,
> pt_update_contig_markers() right away needs to be called in all places
> where entries get updated, not just the one where entries get cleared.
> 
> Note further that while pt_update_contig_markers() updates perhaps
> several PTEs within the table, since these are changes to "avail" bits
> only I do not think that cache flushing would be needed afterwards. Such
> cache flushing (of entire pages, unless adding yet more logic to me more
> selective) would be quite noticable performance-wise (very prominent
> during Dom0 boot).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> v4: Re-base over changes earlier in the series.
> v3: Properly bound loop. Re-base over changes earlier in the series.
> v2: New.
> ---
> The hang during boot on my Latitude E6410 (see the respective code
> comment) was pretty close after iommu_enable_translation(). No errors,
> no watchdog would kick in, just sometimes the first few pixel lines of
> the next log message's (XEN) prefix would have made it out to the screen
> (and there's no serial there). It's been a lot of experimenting until I
> figured the workaround (which I consider ugly, but halfway acceptable).
> I've been trying hard to make sure the workaround wouldn't be masking a
> real issue, yet I'm still wary of it possibly doing so ... My best guess
> at this point is that on these old IOMMUs the ignored bits 52...61
> aren't really ignored for present entries, but also aren't "reserved"
> enough to trigger faults. This guess is from having tried to set other
> bits in this range (unconditionally, and with the workaround here in
> place), which yielded the same behavior.

Should we take Kevin's Reviewed-by as a heads up that bits 52..61 on
some? IOMMUs are not usable?

Would be good if we could get a more formal response I think.

> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -43,6 +43,9 @@
>  #include "vtd.h"
>  #include "../ats.h"
>  
> +#define CONTIG_MASK DMA_PTE_CONTIG_MASK
> +#include <asm/pt-contig-markers.h>
> +
>  /* dom_io is used as a sentinel for quarantined devices */
>  #define QUARANTINE_SKIP(d, pgd_maddr) ((d) == dom_io && !(pgd_maddr))
>  #define DEVICE_DOMID(d, pdev) ((d) != dom_io ? (d)->domain_id \
> @@ -405,6 +408,9 @@ static uint64_t addr_to_dma_page_maddr(s
>  
>              write_atomic(&pte->val, new_pte.val);
>              iommu_sync_cache(pte, sizeof(struct dma_pte));
> +            pt_update_contig_markers(&parent->val,
> +                                     address_level_offset(addr, level),

I think (unless previous patches in the series have changed this)
there already is an 'offset' local variable that you could use.

> +                                     level, PTE_kind_table);
>          }
>  
>          if ( --level == target )
> @@ -837,9 +843,31 @@ static int dma_pte_clear_one(struct doma
>  
>      old = *pte;
>      dma_clear_pte(*pte);
> +    iommu_sync_cache(pte, sizeof(*pte));
> +
> +    while ( pt_update_contig_markers(&page->val,
> +                                     address_level_offset(addr, level),
> +                                     level, PTE_kind_null) &&
> +            ++level < min_pt_levels )
> +    {
> +        struct page_info *pg = maddr_to_page(pg_maddr);
> +
> +        unmap_vtd_domain_page(page);
> +
> +        pg_maddr = addr_to_dma_page_maddr(domain, addr, level, flush_flags,
> +                                          false);
> +        BUG_ON(pg_maddr < PAGE_SIZE);
> +
> +        page = map_vtd_domain_page(pg_maddr);
> +        pte = &page[address_level_offset(addr, level)];
> +        dma_clear_pte(*pte);
> +        iommu_sync_cache(pte, sizeof(*pte));
> +
> +        *flush_flags |= IOMMU_FLUSHF_all;
> +        iommu_queue_free_pgtable(hd, pg);
> +    }

I think I'm setting myself for trouble, but do we need to sync cache
the lower lever entries if higher level ones are to be changed.

IOW, would it be fine to just flush the highest level modified PTE?
As the lower lever ones won't be reachable anyway.

>      spin_unlock(&hd->arch.mapping_lock);
> -    iommu_sync_cache(pte, sizeof(struct dma_pte));
>  
>      unmap_vtd_domain_page(page);
>  
> @@ -2182,8 +2210,21 @@ static int __must_check cf_check intel_i
>      }
>  
>      *pte = new;
> -
>      iommu_sync_cache(pte, sizeof(struct dma_pte));
> +
> +    /*
> +     * While the (ab)use of PTE_kind_table here allows to save some work in
> +     * the function, the main motivation for it is that it avoids a so far
> +     * unexplained hang during boot (while preparing Dom0) on a Westmere
> +     * based laptop.
> +     */
> +    pt_update_contig_markers(&page->val,
> +                             address_level_offset(dfn_to_daddr(dfn), level),
> +                             level,
> +                             (hd->platform_ops->page_sizes &
> +                              (1UL << level_to_offset_bits(level + 1))
> +                              ? PTE_kind_leaf : PTE_kind_table));

So this works because on what we believe to be affected models the
only supported page sizes are 4K?

Do we want to do the same with AMD if we don't allow 512G super pages?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 17/21] AMD/IOMMU: replace all-contiguous page tables by superpage mappings
  2022-04-25  8:43 ` [PATCH v4 17/21] AMD/IOMMU: replace all-contiguous page tables by superpage mappings Jan Beulich
@ 2022-05-10 15:31   ` Roger Pau Monné
  2022-05-18 10:40     ` Jan Beulich
  0 siblings, 1 reply; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-10 15:31 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On Mon, Apr 25, 2022 at 10:43:16AM +0200, Jan Beulich wrote:
> When a page table ends up with all contiguous entries (including all
> identical attributes), it can be replaced by a superpage entry at the
> next higher level. The page table itself can then be scheduled for
> freeing.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> Unlike the freeing of all-empty page tables, this causes quite a bit of
> back and forth for PV domains, due to their mapping/unmapping of pages
> when they get converted to/from being page tables. It may therefore be
> worth considering to delay re-coalescing a little, to avoid doing so
> when the superpage would otherwise get split again pretty soon. But I
> think this would better be the subject of a separate change anyway.
> 
> Of course this could also be helped by more "aware" kernel side
> behavior: They could avoid immediately mapping freed page tables
> writable again, in anticipation of re-using that same page for another
> page table elsewhere.
> ---
> v4: Re-base over changes earlier in the series.
> v3: New.
> 
> --- a/xen/drivers/passthrough/amd/iommu_map.c
> +++ b/xen/drivers/passthrough/amd/iommu_map.c
> @@ -81,7 +81,8 @@ static union amd_iommu_pte set_iommu_pte
>                                                   unsigned long dfn,
>                                                   unsigned long next_mfn,
>                                                   unsigned int level,
> -                                                 bool iw, bool ir)
> +                                                 bool iw, bool ir,
> +                                                 bool *contig)
>  {
>      union amd_iommu_pte *table, *pde, old;
>  
> @@ -94,11 +95,15 @@ static union amd_iommu_pte set_iommu_pte
>           old.iw != iw || old.ir != ir )
>      {
>          set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
> -        pt_update_contig_markers(&table->raw, pfn_to_pde_idx(dfn, level),
> -                                 level, PTE_kind_leaf);
> +        *contig = pt_update_contig_markers(&table->raw,
> +                                           pfn_to_pde_idx(dfn, level),
> +                                           level, PTE_kind_leaf);
>      }
>      else
> +    {
>          old.pr = false; /* signal "no change" to the caller */
> +        *contig = false;

So we assume that any caller getting contig == true must have acted
and coalesced the page table?

Might be worth a comment, to note that the function assumes that a
previous return of contig == true will have coalesced the page table
and hence a "no change" PTE write is not expected to happen on a
contig page table.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 18/21] VT-d: replace all-contiguous page tables by superpage mappings
  2022-04-25  8:43 ` [PATCH v4 18/21] VT-d: " Jan Beulich
@ 2022-05-11 11:08   ` Roger Pau Monné
  2022-05-18 10:44     ` Jan Beulich
  0 siblings, 1 reply; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-11 11:08 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian

On Mon, Apr 25, 2022 at 10:43:45AM +0200, Jan Beulich wrote:
> When a page table ends up with all contiguous entries (including all
> identical attributes), it can be replaced by a superpage entry at the
> next higher level. The page table itself can then be scheduled for
> freeing.
> 
> The adjustment to LEVEL_MASK is merely to avoid leaving a latent trap
> for whenever we (and obviously hardware) start supporting 512G mappings.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>

Like on the AMD side, I wonder whether you can get away with only
doing a cache flush for the last (highest level) PTE, as the lower
ones won't be reachable anyway, as the page-table is freed.

Then the flush could be done outside of the locked region.

The rest LGTM.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 19/21] IOMMU/x86: add perf counters for page table splitting / coalescing
  2022-04-25  8:44 ` [PATCH v4 19/21] IOMMU/x86: add perf counters for page table splitting / coalescing Jan Beulich
@ 2022-05-11 13:48   ` Roger Pau Monné
  2022-05-18 11:39     ` Jan Beulich
  0 siblings, 1 reply; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-11 13:48 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On Mon, Apr 25, 2022 at 10:44:11AM +0200, Jan Beulich wrote:
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Reviewed-by: Kevin tian <kevin.tian@intel.com>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Would be helpful to also have those per-guest I think.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 20/21] VT-d: fold iommu_flush_iotlb{,_pages}()
  2022-04-25  8:44 ` [PATCH v4 20/21] VT-d: fold iommu_flush_iotlb{,_pages}() Jan Beulich
  2022-04-27  4:12   ` Tian, Kevin
@ 2022-05-11 13:50   ` Roger Pau Monné
  1 sibling, 0 replies; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-11 13:50 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian

On Mon, Apr 25, 2022 at 10:44:38AM +0200, Jan Beulich wrote:
> With iommu_flush_iotlb_all() gone, iommu_flush_iotlb_pages() is merely a
> wrapper around the not otherwise called iommu_flush_iotlb(). Fold both
> functions.
> 
> No functional change intended.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 21/21] VT-d: fold dma_pte_clear_one() into its only caller
  2022-04-25  8:45 ` [PATCH v4 21/21] VT-d: fold dma_pte_clear_one() into its only caller Jan Beulich
  2022-04-27  4:13   ` Tian, Kevin
@ 2022-05-11 13:57   ` Roger Pau Monné
  1 sibling, 0 replies; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-11 13:57 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian

On Mon, Apr 25, 2022 at 10:45:10AM +0200, Jan Beulich wrote:
> This way intel_iommu_unmap_page() ends up quite a bit more similar to
> intel_iommu_map_page().
> 
> No functional change intended.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 14/21] x86: introduce helper for recording degree of contiguity in page tables
  2022-05-06 13:25   ` Roger Pau Monné
@ 2022-05-18 10:06     ` Jan Beulich
  2022-05-20 10:22       ` Roger Pau Monné
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-05-18 10:06 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On 06.05.2022 15:25, Roger Pau Monné wrote:
> On Mon, Apr 25, 2022 at 10:41:23AM +0200, Jan Beulich wrote:
>> --- /dev/null
>> +++ b/xen/arch/x86/include/asm/pt-contig-markers.h
>> @@ -0,0 +1,105 @@
>> +#ifndef __ASM_X86_PT_CONTIG_MARKERS_H
>> +#define __ASM_X86_PT_CONTIG_MARKERS_H
>> +
>> +/*
>> + * Short of having function templates in C, the function defined below is
>> + * intended to be used by multiple parties interested in recording the
>> + * degree of contiguity in mappings by a single page table.
>> + *
>> + * Scheme: Every entry records the order of contiguous successive entries,
>> + * up to the maximum order covered by that entry (which is the number of
>> + * clear low bits in its index, with entry 0 being the exception using
>> + * the base-2 logarithm of the number of entries in a single page table).
>> + * While a few entries need touching upon update, knowing whether the
>> + * table is fully contiguous (and can hence be replaced by a higher level
>> + * leaf entry) is then possible by simply looking at entry 0's marker.
>> + *
>> + * Prereqs:
>> + * - CONTIG_MASK needs to be #define-d, to a value having at least 4
>> + *   contiguous bits (ignored by hardware), before including this file,
>> + * - page tables to be passed here need to be initialized with correct
>> + *   markers.
> 
> Not sure it's very relevant, but might we worth adding that:
> 
> - Null entries must have the PTE zeroed except for the CONTIG_MASK
>   region in order to be considered as inactive.

NP, I've added an item along these lines.

>> +static bool pt_update_contig_markers(uint64_t *pt, unsigned int idx,
>> +                                     unsigned int level, enum PTE_kind kind)
>> +{
>> +    unsigned int b, i = idx;
>> +    unsigned int shift = (level - 1) * CONTIG_LEVEL_SHIFT + PAGE_SHIFT;
>> +
>> +    ASSERT(idx < CONTIG_NR);
>> +    ASSERT(!(pt[idx] & CONTIG_MASK));
>> +
>> +    /* Step 1: Reduce markers in lower numbered entries. */
>> +    while ( i )
>> +    {
>> +        b = find_first_set_bit(i);
>> +        i &= ~(1U << b);
>> +        if ( GET_MARKER(pt[i]) > b )
>> +            SET_MARKER(pt[i], b);
> 
> Can't you exit early when you find an entry that already has the
> to-be-set contiguous marker <= b, as lower numbered entries will then
> also be <= b'?
> 
> Ie:
> 
> if ( GET_MARKER(pt[i]) <= b )
>     break;
> else
>     SET_MARKER(pt[i], b);

Almost - I think it would need to be 

        if ( GET_MARKER(pt[i]) < b )
            break;
        if ( GET_MARKER(pt[i]) > b )
            SET_MARKER(pt[i], b);

or, accepting redundant updates, 

        if ( GET_MARKER(pt[i]) < b )
            break;
        SET_MARKER(pt[i], b);

. Neither the redundant updates nor the extra (easily mis-predicted)
conditional looked very appealing to me, but I guess I could change
this if you are convinced that's better than continuing a loop with
at most 9 (typically less) iterations.

Jan



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 15/21] AMD/IOMMU: free all-empty page tables
  2022-05-10 13:30   ` Roger Pau Monné
@ 2022-05-18 10:18     ` Jan Beulich
  0 siblings, 0 replies; 106+ messages in thread
From: Jan Beulich @ 2022-05-18 10:18 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On 10.05.2022 15:30, Roger Pau Monné wrote:
> On Mon, Apr 25, 2022 at 10:42:19AM +0200, Jan Beulich wrote:
>> When a page table ends up with no present entries left, it can be
>> replaced by a non-present entry at the next higher level. The page table
>> itself can then be scheduled for freeing.
>>
>> Note that while its output isn't used there yet,
>> pt_update_contig_markers() right away needs to be called in all places
>> where entries get updated, not just the one where entries get cleared.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks.

>> @@ -85,7 +92,11 @@ static union amd_iommu_pte set_iommu_pte
>>      if ( !old.pr || old.next_level ||
>>           old.mfn != next_mfn ||
>>           old.iw != iw || old.ir != ir )
>> +    {
>>          set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
>> +        pt_update_contig_markers(&table->raw, pfn_to_pde_idx(dfn, level),
>> +                                 level, PTE_kind_leaf);
> 
> It would be better to call pt_update_contig_markers inside of
> set_iommu_pde_present, but that would imply changing the parameters
> passed to the function.  It's cumbersome (and error prone) to have to
> pair calls to set_iommu_pde_present() with pt_update_contig_markers().

Right, but then already the sheer number of parameters would become
excessive (imo).

>> @@ -474,8 +491,24 @@ int cf_check amd_iommu_unmap_page(
>>  
>>      if ( pt_mfn )
>>      {
>> +        bool free;
>> +
>>          /* Mark PTE as 'page not present'. */
>> -        old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn), level);
>> +        old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn), level, &free);
>> +
>> +        while ( unlikely(free) && ++level < hd->arch.amd.paging_mode )
>> +        {
>> +            struct page_info *pg = mfn_to_page(_mfn(pt_mfn));
>> +
>> +            if ( iommu_pde_from_dfn(d, dfn_x(dfn), level, &pt_mfn,
>> +                                    flush_flags, false) )
>> +                BUG();
>> +            BUG_ON(!pt_mfn);
>> +
>> +            clear_iommu_pte_present(pt_mfn, dfn_x(dfn), level, &free);
> 
> Not sure it's worth initializing free to false (at definition and
> before each call to clear_iommu_pte_present), just in case we manage
> to return early from clear_iommu_pte_present without having updated
> 'free'.

There's no such path now, so I'd view it as dead code to do so. If
necessary a patch introducing such an early exit would need to deal
with this. But even then I'd rather see this being dealt with right
in clear_iommu_pte_present().

Jan



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 16/21] VT-d: free all-empty page tables
  2022-05-10 14:30   ` Roger Pau Monné
@ 2022-05-18 10:26     ` Jan Beulich
  2022-05-20  0:38       ` Tian, Kevin
  2022-05-20 11:13       ` Roger Pau Monné
  0 siblings, 2 replies; 106+ messages in thread
From: Jan Beulich @ 2022-05-18 10:26 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian

On 10.05.2022 16:30, Roger Pau Monné wrote:
> On Mon, Apr 25, 2022 at 10:42:50AM +0200, Jan Beulich wrote:
>> When a page table ends up with no present entries left, it can be
>> replaced by a non-present entry at the next higher level. The page table
>> itself can then be scheduled for freeing.
>>
>> Note that while its output isn't used there yet,
>> pt_update_contig_markers() right away needs to be called in all places
>> where entries get updated, not just the one where entries get cleared.
>>
>> Note further that while pt_update_contig_markers() updates perhaps
>> several PTEs within the table, since these are changes to "avail" bits
>> only I do not think that cache flushing would be needed afterwards. Such
>> cache flushing (of entire pages, unless adding yet more logic to me more
>> selective) would be quite noticable performance-wise (very prominent
>> during Dom0 boot).
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> v4: Re-base over changes earlier in the series.
>> v3: Properly bound loop. Re-base over changes earlier in the series.
>> v2: New.
>> ---
>> The hang during boot on my Latitude E6410 (see the respective code
>> comment) was pretty close after iommu_enable_translation(). No errors,
>> no watchdog would kick in, just sometimes the first few pixel lines of
>> the next log message's (XEN) prefix would have made it out to the screen
>> (and there's no serial there). It's been a lot of experimenting until I
>> figured the workaround (which I consider ugly, but halfway acceptable).
>> I've been trying hard to make sure the workaround wouldn't be masking a
>> real issue, yet I'm still wary of it possibly doing so ... My best guess
>> at this point is that on these old IOMMUs the ignored bits 52...61
>> aren't really ignored for present entries, but also aren't "reserved"
>> enough to trigger faults. This guess is from having tried to set other
>> bits in this range (unconditionally, and with the workaround here in
>> place), which yielded the same behavior.
> 
> Should we take Kevin's Reviewed-by as a heads up that bits 52..61 on
> some? IOMMUs are not usable?
> 
> Would be good if we could get a more formal response I think.

A more formal response would be nice, but given the age of the affected
hardware I don't expect anything more will be done there by Intel.

>> @@ -405,6 +408,9 @@ static uint64_t addr_to_dma_page_maddr(s
>>  
>>              write_atomic(&pte->val, new_pte.val);
>>              iommu_sync_cache(pte, sizeof(struct dma_pte));
>> +            pt_update_contig_markers(&parent->val,
>> +                                     address_level_offset(addr, level),
> 
> I think (unless previous patches in the series have changed this)
> there already is an 'offset' local variable that you could use.

The variable is clobbered by "IOMMU/x86: prefill newly allocate page
tables".

>> @@ -837,9 +843,31 @@ static int dma_pte_clear_one(struct doma
>>  
>>      old = *pte;
>>      dma_clear_pte(*pte);
>> +    iommu_sync_cache(pte, sizeof(*pte));
>> +
>> +    while ( pt_update_contig_markers(&page->val,
>> +                                     address_level_offset(addr, level),
>> +                                     level, PTE_kind_null) &&
>> +            ++level < min_pt_levels )
>> +    {
>> +        struct page_info *pg = maddr_to_page(pg_maddr);
>> +
>> +        unmap_vtd_domain_page(page);
>> +
>> +        pg_maddr = addr_to_dma_page_maddr(domain, addr, level, flush_flags,
>> +                                          false);
>> +        BUG_ON(pg_maddr < PAGE_SIZE);
>> +
>> +        page = map_vtd_domain_page(pg_maddr);
>> +        pte = &page[address_level_offset(addr, level)];
>> +        dma_clear_pte(*pte);
>> +        iommu_sync_cache(pte, sizeof(*pte));
>> +
>> +        *flush_flags |= IOMMU_FLUSHF_all;
>> +        iommu_queue_free_pgtable(hd, pg);
>> +    }
> 
> I think I'm setting myself for trouble, but do we need to sync cache
> the lower lever entries if higher level ones are to be changed.
> 
> IOW, would it be fine to just flush the highest level modified PTE?
> As the lower lever ones won't be reachable anyway.

I definitely want to err on the safe side here. If later we can
prove that some cache flush is unneeded, I'd be happy to see it
go away.

>> @@ -2182,8 +2210,21 @@ static int __must_check cf_check intel_i
>>      }
>>  
>>      *pte = new;
>> -
>>      iommu_sync_cache(pte, sizeof(struct dma_pte));
>> +
>> +    /*
>> +     * While the (ab)use of PTE_kind_table here allows to save some work in
>> +     * the function, the main motivation for it is that it avoids a so far
>> +     * unexplained hang during boot (while preparing Dom0) on a Westmere
>> +     * based laptop.
>> +     */
>> +    pt_update_contig_markers(&page->val,
>> +                             address_level_offset(dfn_to_daddr(dfn), level),
>> +                             level,
>> +                             (hd->platform_ops->page_sizes &
>> +                              (1UL << level_to_offset_bits(level + 1))
>> +                              ? PTE_kind_leaf : PTE_kind_table));
> 
> So this works because on what we believe to be affected models the
> only supported page sizes are 4K?

Yes.

> Do we want to do the same with AMD if we don't allow 512G super pages?

Why? They don't have a similar flaw.

Jan



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 17/21] AMD/IOMMU: replace all-contiguous page tables by superpage mappings
  2022-05-10 15:31   ` Roger Pau Monné
@ 2022-05-18 10:40     ` Jan Beulich
  2022-05-20 10:35       ` Roger Pau Monné
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-05-18 10:40 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On 10.05.2022 17:31, Roger Pau Monné wrote:
> On Mon, Apr 25, 2022 at 10:43:16AM +0200, Jan Beulich wrote:
>> @@ -94,11 +95,15 @@ static union amd_iommu_pte set_iommu_pte
>>           old.iw != iw || old.ir != ir )
>>      {
>>          set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
>> -        pt_update_contig_markers(&table->raw, pfn_to_pde_idx(dfn, level),
>> -                                 level, PTE_kind_leaf);
>> +        *contig = pt_update_contig_markers(&table->raw,
>> +                                           pfn_to_pde_idx(dfn, level),
>> +                                           level, PTE_kind_leaf);
>>      }
>>      else
>> +    {
>>          old.pr = false; /* signal "no change" to the caller */
>> +        *contig = false;
> 
> So we assume that any caller getting contig == true must have acted
> and coalesced the page table?

Yes, except that I wouldn't use "must", but "would". It's not a
requirement after all, functionality-wise all will be fine without
re-coalescing.

> Might be worth a comment, to note that the function assumes that a
> previous return of contig == true will have coalesced the page table
> and hence a "no change" PTE write is not expected to happen on a
> contig page table.

I'm not convinced, as there's effectively only one caller,
amd_iommu_map_page(). I also don't see why "no change" would be a
problem. "No change" can't result in a fully contiguous page table
if the page table wasn't fully contiguous already before (at which
point it would have been replaced by a superpage).

Jan



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 18/21] VT-d: replace all-contiguous page tables by superpage mappings
  2022-05-11 11:08   ` Roger Pau Monné
@ 2022-05-18 10:44     ` Jan Beulich
  2022-05-20 10:38       ` Roger Pau Monné
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-05-18 10:44 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian

On 11.05.2022 13:08, Roger Pau Monné wrote:
> On Mon, Apr 25, 2022 at 10:43:45AM +0200, Jan Beulich wrote:
>> When a page table ends up with all contiguous entries (including all
>> identical attributes), it can be replaced by a superpage entry at the
>> next higher level. The page table itself can then be scheduled for
>> freeing.
>>
>> The adjustment to LEVEL_MASK is merely to avoid leaving a latent trap
>> for whenever we (and obviously hardware) start supporting 512G mappings.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> 
> Like on the AMD side, I wonder whether you can get away with only

FTAOD I take it you mean "like on the all-empty side", as on AMD we
don't need to do any cache flushing?

> doing a cache flush for the last (highest level) PTE, as the lower
> ones won't be reachable anyway, as the page-table is freed.

But that freeing will happen only later, with a TLB flush in between.
Until then we would better make sure the IOMMU sees what was written,
even if it reading a stale value _should_ be benign.

Jan

> Then the flush could be done outside of the locked region.
> 
> The rest LGTM.
> 
> Thanks, Roger.
> 



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 19/21] IOMMU/x86: add perf counters for page table splitting / coalescing
  2022-05-11 13:48   ` Roger Pau Monné
@ 2022-05-18 11:39     ` Jan Beulich
  2022-05-20 10:41       ` Roger Pau Monné
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-05-18 11:39 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On 11.05.2022 15:48, Roger Pau Monné wrote:
> On Mon, Apr 25, 2022 at 10:44:11AM +0200, Jan Beulich wrote:
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> Reviewed-by: Kevin tian <kevin.tian@intel.com>
> 
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks.

> Would be helpful to also have those per-guest I think.

Perhaps, but we don't have per-guest counter infrastructure, do we?

Jan



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 00/21] IOMMU: superpage support when not sharing pagetables
  2022-04-25  8:29 [PATCH v4 00/21] IOMMU: superpage support when not sharing pagetables Jan Beulich
                   ` (20 preceding siblings ...)
  2022-04-25  8:45 ` [PATCH v4 21/21] VT-d: fold dma_pte_clear_one() into its only caller Jan Beulich
@ 2022-05-18 12:50 ` Jan Beulich
  21 siblings, 0 replies; 106+ messages in thread
From: Jan Beulich @ 2022-05-18 12:50 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Paul Durrant, Roger Pau Monné, xen-devel

On 25.04.2022 10:29, Jan Beulich wrote:
> For a long time we've been rather inefficient with IOMMU page table
> management when not sharing page tables, i.e. in particular for PV (and
> further specifically also for PV Dom0) and AMD (where nowadays we never
> share page tables). While up to about 2.5 years ago AMD code had logic
> to un-shatter page mappings, that logic was ripped out for being buggy
> (XSA-275 plus follow-on).
> 
> This series enables use of large pages in AMD and Intel (VT-d) code;
> Arm is presently not in need of any enabling as pagetables are always
> shared there. It also augments PV Dom0 creation with suitable explicit
> IOMMU mapping calls to facilitate use of large pages there. Depending
> on the amount of memory handed to Dom0 this improves booting time
> (latency until Dom0 actually starts) quite a bit; subsequent shattering
> of some of the large pages may of course consume some of the saved time.
> 
> Known fallout has been spelled out here:
> https://lists.xen.org/archives/html/xen-devel/2021-08/msg00781.html
> 
> There's a dependency on 'PCI: replace "secondary" flavors of
> PCI_{DEVFN,BDF,SBDF}()', in particular by patch 8. Its prereq patch
> still lacks an Arm ack, so it couldn't go in yet.
> 
> I'm inclined to say "of course" there are also a few seemingly unrelated
> changes included here, which I just came to consider necessary or at
> least desirable (in part for having been in need of adjustment for a
> long time) along the way. Some of these changes are likely independent
> of the bulk of the work here, and hence may be fine to go in ahead of
> earlier patches.
> 
> See individual patches for details on the v4 changes.
> 
> 01: AMD/IOMMU: correct potentially-UB shifts
> 02: IOMMU: simplify unmap-on-error in iommu_map()
> 03: IOMMU: add order parameter to ->{,un}map_page() hooks
> 04: IOMMU: have iommu_{,un}map() split requests into largest possible chunks

These first 4 patches are in principle ready to go in. If only there
wasn't (sadly once again) the unclear state with comments on the
first 2 that you had given on Apr 27. I did reply verbally, and hence
I'm intending to commit these 4 by the end of the week - on the
assumption that no response to my replies means I've sufficiently
addressed the concerns - unless I hear back otherwise.

Thanks for you understanding, Jan



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 13/21] IOMMU/x86: prefill newly allocate page tables
  2022-05-06 11:16   ` Roger Pau Monné
@ 2022-05-19 12:12     ` Jan Beulich
  2022-05-20 10:47       ` Roger Pau Monné
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-05-19 12:12 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On 06.05.2022 13:16, Roger Pau Monné wrote:
> On Mon, Apr 25, 2022 at 10:40:55AM +0200, Jan Beulich wrote:
>> ---
>> An alternative to the ASSERT()s added to set_iommu_ptes_present() would
>> be to make the function less general-purpose; it's used in a single
>> place only after all (i.e. it might as well be folded into its only
>> caller).
> 
> I would think adding a comment that the function requires the PDE to
> be empty would be good.

But that's not the case - what the function expects to be clear is
what is being ASSERT()ed.

>  Also given the current usage we could drop
> the nr_ptes parameter and just name the function fill_pde() or
> similar.

Right, but that would want to be a separate change.

>> --- a/xen/drivers/passthrough/amd/iommu_map.c
>> +++ b/xen/drivers/passthrough/amd/iommu_map.c
>> @@ -115,7 +115,19 @@ static void set_iommu_ptes_present(unsig
>>  
>>      while ( nr_ptes-- )
>>      {
>> -        set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
>> +        ASSERT(!pde->next_level);
>> +        ASSERT(!pde->u);
>> +
>> +        if ( pde > table )
>> +            ASSERT(pde->ign0 == find_first_set_bit(pde - table));
>> +        else
>> +            ASSERT(pde->ign0 == PAGE_SHIFT - 3);
> 
> I think PAGETABLE_ORDER would be clearer here.

I disagree - PAGETABLE_ORDER is a CPU-side concept. It's not used anywhere
in IOMMU code afaics.

> While here, could you also assert that next_mfn matches the contiguous
> order currently set in the PTE?

I can, yet that wouldn't be here, but outside (ahead) of the loop.

>> @@ -717,7 +729,7 @@ static int fill_qpt(union amd_iommu_pte
>>                   * page table pages, and the resulting allocations are always
>>                   * zeroed.
>>                   */
>> -                pgs[level] = iommu_alloc_pgtable(hd);
>> +                pgs[level] = iommu_alloc_pgtable(hd, 0);
> 
> Is it worth not setting up the contiguous data for quarantine page
> tables?

Well, it's (slightly) less code, and (hopefully) faster due to the use
of clear_page().

> I think it's fine now given the current code, but you having added
> ASSERTs that the contig data is correct in set_iommu_ptes_present()
> makes me wonder whether we could trigger those in the future.

I'd like to deal with that if and when needed.

> I understand that the contig data is not helpful for quarantine page
> tables, but still doesn't seem bad to have it just for coherency.

You realize that the markers all being zero in a table is a valid
state, functionality-wise? It would merely mean no re-coalescing
until respective entries were touched (updated) at least once.

>> @@ -276,7 +280,7 @@ struct dma_pte {
>>  #define dma_pte_write(p) (dma_pte_prot(p) & DMA_PTE_WRITE)
>>  #define dma_pte_addr(p) ((p).val & PADDR_MASK & PAGE_MASK_4K)
>>  #define dma_set_pte_addr(p, addr) do {\
>> -            (p).val |= ((addr) & PAGE_MASK_4K); } while (0)
>> +            (p).val |= ((addr) & PADDR_MASK & PAGE_MASK_4K); } while (0)
> 
> While I'm not opposed to this, I would assume that addr is not
> expected to contain bit cleared by PADDR_MASK? (or PAGE_MASK_4K FWIW)

Indeed. But I'd prefer to be on the safe side, now that some of the
bits have gained a different meaning.

>> @@ -538,7 +539,29 @@ struct page_info *iommu_alloc_pgtable(st
>>          return NULL;
>>  
>>      p = __map_domain_page(pg);
>> -    clear_page(p);
>> +
>> +    if ( contig_mask )
>> +    {
>> +        /* See pt-contig-markers.h for a description of the marker scheme. */
>> +        unsigned int i, shift = find_first_set_bit(contig_mask);
>> +
>> +        ASSERT(((PAGE_SHIFT - 3) & (contig_mask >> shift)) == PAGE_SHIFT - 3);
> 
> I think it might be clearer to use PAGETABLE_ORDER rather than
> PAGE_SHIFT - 3.

See above.

Jan



^ permalink raw reply	[flat|nested] 106+ messages in thread

* RE: [PATCH v4 16/21] VT-d: free all-empty page tables
  2022-05-18 10:26     ` Jan Beulich
@ 2022-05-20  0:38       ` Tian, Kevin
  2022-05-20 11:13       ` Roger Pau Monné
  1 sibling, 0 replies; 106+ messages in thread
From: Tian, Kevin @ 2022-05-20  0:38 UTC (permalink / raw)
  To: Beulich, Jan, Pau Monné, Roger
  Cc: xen-devel, Cooper, Andrew, Paul Durrant

> From: Jan Beulich
> Sent: Wednesday, May 18, 2022 6:26 PM
> 
> On 10.05.2022 16:30, Roger Pau Monné wrote:
> > On Mon, Apr 25, 2022 at 10:42:50AM +0200, Jan Beulich wrote:
> >> When a page table ends up with no present entries left, it can be
> >> replaced by a non-present entry at the next higher level. The page table
> >> itself can then be scheduled for freeing.
> >>
> >> Note that while its output isn't used there yet,
> >> pt_update_contig_markers() right away needs to be called in all places
> >> where entries get updated, not just the one where entries get cleared.
> >>
> >> Note further that while pt_update_contig_markers() updates perhaps
> >> several PTEs within the table, since these are changes to "avail" bits
> >> only I do not think that cache flushing would be needed afterwards. Such
> >> cache flushing (of entire pages, unless adding yet more logic to me more
> >> selective) would be quite noticable performance-wise (very prominent
> >> during Dom0 boot).
> >>
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >> ---
> >> v4: Re-base over changes earlier in the series.
> >> v3: Properly bound loop. Re-base over changes earlier in the series.
> >> v2: New.
> >> ---
> >> The hang during boot on my Latitude E6410 (see the respective code
> >> comment) was pretty close after iommu_enable_translation(). No errors,
> >> no watchdog would kick in, just sometimes the first few pixel lines of
> >> the next log message's (XEN) prefix would have made it out to the screen
> >> (and there's no serial there). It's been a lot of experimenting until I
> >> figured the workaround (which I consider ugly, but halfway acceptable).
> >> I've been trying hard to make sure the workaround wouldn't be masking a
> >> real issue, yet I'm still wary of it possibly doing so ... My best guess
> >> at this point is that on these old IOMMUs the ignored bits 52...61
> >> aren't really ignored for present entries, but also aren't "reserved"
> >> enough to trigger faults. This guess is from having tried to set other
> >> bits in this range (unconditionally, and with the workaround here in
> >> place), which yielded the same behavior.
> >
> > Should we take Kevin's Reviewed-by as a heads up that bits 52..61 on
> > some? IOMMUs are not usable?
> >
> > Would be good if we could get a more formal response I think.
> 
> A more formal response would be nice, but given the age of the affected
> hardware I don't expect anything more will be done there by Intel.
> 

I didn't hear response on this open internally.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 14/21] x86: introduce helper for recording degree of contiguity in page tables
  2022-05-18 10:06     ` Jan Beulich
@ 2022-05-20 10:22       ` Roger Pau Monné
  2022-05-20 10:59         ` Jan Beulich
  0 siblings, 1 reply; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-20 10:22 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On Wed, May 18, 2022 at 12:06:29PM +0200, Jan Beulich wrote:
> On 06.05.2022 15:25, Roger Pau Monné wrote:
> > On Mon, Apr 25, 2022 at 10:41:23AM +0200, Jan Beulich wrote:
> >> --- /dev/null
> >> +++ b/xen/arch/x86/include/asm/pt-contig-markers.h
> >> @@ -0,0 +1,105 @@
> >> +#ifndef __ASM_X86_PT_CONTIG_MARKERS_H
> >> +#define __ASM_X86_PT_CONTIG_MARKERS_H
> >> +
> >> +/*
> >> + * Short of having function templates in C, the function defined below is
> >> + * intended to be used by multiple parties interested in recording the
> >> + * degree of contiguity in mappings by a single page table.
> >> + *
> >> + * Scheme: Every entry records the order of contiguous successive entries,
> >> + * up to the maximum order covered by that entry (which is the number of
> >> + * clear low bits in its index, with entry 0 being the exception using
> >> + * the base-2 logarithm of the number of entries in a single page table).
> >> + * While a few entries need touching upon update, knowing whether the
> >> + * table is fully contiguous (and can hence be replaced by a higher level
> >> + * leaf entry) is then possible by simply looking at entry 0's marker.
> >> + *
> >> + * Prereqs:
> >> + * - CONTIG_MASK needs to be #define-d, to a value having at least 4
> >> + *   contiguous bits (ignored by hardware), before including this file,
> >> + * - page tables to be passed here need to be initialized with correct
> >> + *   markers.
> > 
> > Not sure it's very relevant, but might we worth adding that:
> > 
> > - Null entries must have the PTE zeroed except for the CONTIG_MASK
> >   region in order to be considered as inactive.
> 
> NP, I've added an item along these lines.
> 
> >> +static bool pt_update_contig_markers(uint64_t *pt, unsigned int idx,
> >> +                                     unsigned int level, enum PTE_kind kind)
> >> +{
> >> +    unsigned int b, i = idx;
> >> +    unsigned int shift = (level - 1) * CONTIG_LEVEL_SHIFT + PAGE_SHIFT;
> >> +
> >> +    ASSERT(idx < CONTIG_NR);
> >> +    ASSERT(!(pt[idx] & CONTIG_MASK));
> >> +
> >> +    /* Step 1: Reduce markers in lower numbered entries. */
> >> +    while ( i )
> >> +    {
> >> +        b = find_first_set_bit(i);
> >> +        i &= ~(1U << b);
> >> +        if ( GET_MARKER(pt[i]) > b )
> >> +            SET_MARKER(pt[i], b);
> > 
> > Can't you exit early when you find an entry that already has the
> > to-be-set contiguous marker <= b, as lower numbered entries will then
> > also be <= b'?
> > 
> > Ie:
> > 
> > if ( GET_MARKER(pt[i]) <= b )
> >     break;
> > else
> >     SET_MARKER(pt[i], b);
> 
> Almost - I think it would need to be 
> 
>         if ( GET_MARKER(pt[i]) < b )
>             break;
>         if ( GET_MARKER(pt[i]) > b )
>             SET_MARKER(pt[i], b);

I guess I'm slightly confused, but if marker at i is <= b, then all
following markers will also be <=, and hence could be skipped?

Not sure why we need to keep iterating if GET_MARKER(pt[i]) == b.

FWIW, you could even do:

if ( GET_MARKER(pt[i]) <= b )
    break;
SET_MARKER(pt[i], b);

Which would keep the conditionals to 1 like it currently is.

> 
> or, accepting redundant updates, 
> 
>         if ( GET_MARKER(pt[i]) < b )
>             break;
>         SET_MARKER(pt[i], b);
> 
> . Neither the redundant updates nor the extra (easily mis-predicted)
> conditional looked very appealing to me, but I guess I could change
> this if you are convinced that's better than continuing a loop with
> at most 9 (typically less) iterations.

Well, I think I at least partly understood the logic.  Not sure
whether it's worth adding the conditional or just assuming that
continuing the loop is going to be cheaper.  Might be worth adding a
comment that we choose to explicitly not add an extra conditional to
check for early exit, because we assume that to be more expensive than
just continuing.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 17/21] AMD/IOMMU: replace all-contiguous page tables by superpage mappings
  2022-05-18 10:40     ` Jan Beulich
@ 2022-05-20 10:35       ` Roger Pau Monné
  0 siblings, 0 replies; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-20 10:35 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant

On Wed, May 18, 2022 at 12:40:59PM +0200, Jan Beulich wrote:
> On 10.05.2022 17:31, Roger Pau Monné wrote:
> > On Mon, Apr 25, 2022 at 10:43:16AM +0200, Jan Beulich wrote:
> >> @@ -94,11 +95,15 @@ static union amd_iommu_pte set_iommu_pte
> >>           old.iw != iw || old.ir != ir )
> >>      {
> >>          set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
> >> -        pt_update_contig_markers(&table->raw, pfn_to_pde_idx(dfn, level),
> >> -                                 level, PTE_kind_leaf);
> >> +        *contig = pt_update_contig_markers(&table->raw,
> >> +                                           pfn_to_pde_idx(dfn, level),
> >> +                                           level, PTE_kind_leaf);
> >>      }
> >>      else
> >> +    {
> >>          old.pr = false; /* signal "no change" to the caller */
> >> +        *contig = false;
> > 
> > So we assume that any caller getting contig == true must have acted
> > and coalesced the page table?
> 
> Yes, except that I wouldn't use "must", but "would". It's not a
> requirement after all, functionality-wise all will be fine without
> re-coalescing.
> 
> > Might be worth a comment, to note that the function assumes that a
> > previous return of contig == true will have coalesced the page table
> > and hence a "no change" PTE write is not expected to happen on a
> > contig page table.
> 
> I'm not convinced, as there's effectively only one caller,
> amd_iommu_map_page(). I also don't see why "no change" would be a
> problem. "No change" can't result in a fully contiguous page table
> if the page table wasn't fully contiguous already before (at which
> point it would have been replaced by a superpage).

Right, I agree, it's just that I would have preferred the result from
set_iommu_ptes_present() to be consistent, ie: repeated calls to it
using the same PTE should set contig to the same value.  Anyway,
that's not relevant to any current callers, so:

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 18/21] VT-d: replace all-contiguous page tables by superpage mappings
  2022-05-18 10:44     ` Jan Beulich
@ 2022-05-20 10:38       ` Roger Pau Monné
  0 siblings, 0 replies; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-20 10:38 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian

On Wed, May 18, 2022 at 12:44:06PM +0200, Jan Beulich wrote:
> On 11.05.2022 13:08, Roger Pau Monné wrote:
> > On Mon, Apr 25, 2022 at 10:43:45AM +0200, Jan Beulich wrote:
> >> When a page table ends up with all contiguous entries (including all
> >> identical attributes), it can be replaced by a superpage entry at the
> >> next higher level. The page table itself can then be scheduled for
> >> freeing.
> >>
> >> The adjustment to LEVEL_MASK is merely to avoid leaving a latent trap
> >> for whenever we (and obviously hardware) start supporting 512G mappings.
> >>
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> > 
> > Like on the AMD side, I wonder whether you can get away with only
> 
> FTAOD I take it you mean "like on the all-empty side", as on AMD we
> don't need to do any cache flushing?

Heh, yes, sorry.

> > doing a cache flush for the last (highest level) PTE, as the lower
> > ones won't be reachable anyway, as the page-table is freed.
> 
> But that freeing will happen only later, with a TLB flush in between.
> Until then we would better make sure the IOMMU sees what was written,
> even if it reading a stale value _should_ be benign.

Hm, but when doing the TLB flush the paging structures will already be
fully updated, and the top level visible entry will have it's cache
flushed, so the lower ones would never be reachable AFAICT.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 19/21] IOMMU/x86: add perf counters for page table splitting / coalescing
  2022-05-18 11:39     ` Jan Beulich
@ 2022-05-20 10:41       ` Roger Pau Monné
  0 siblings, 0 replies; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-20 10:41 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On Wed, May 18, 2022 at 01:39:02PM +0200, Jan Beulich wrote:
> On 11.05.2022 15:48, Roger Pau Monné wrote:
> > On Mon, Apr 25, 2022 at 10:44:11AM +0200, Jan Beulich wrote:
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >> Reviewed-by: Kevin tian <kevin.tian@intel.com>
> > 
> > Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> Thanks.
> 
> > Would be helpful to also have those per-guest I think.
> 
> Perhaps, but we don't have per-guest counter infrastructure, do we?

No, I don't think so?  Would be nice, but I don't see us doing it any
time soon.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 13/21] IOMMU/x86: prefill newly allocate page tables
  2022-05-19 12:12     ` Jan Beulich
@ 2022-05-20 10:47       ` Roger Pau Monné
  2022-05-20 11:11         ` Jan Beulich
  0 siblings, 1 reply; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-20 10:47 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On Thu, May 19, 2022 at 02:12:04PM +0200, Jan Beulich wrote:
> On 06.05.2022 13:16, Roger Pau Monné wrote:
> > On Mon, Apr 25, 2022 at 10:40:55AM +0200, Jan Beulich wrote:
> >> ---
> >> An alternative to the ASSERT()s added to set_iommu_ptes_present() would
> >> be to make the function less general-purpose; it's used in a single
> >> place only after all (i.e. it might as well be folded into its only
> >> caller).
> > 
> > I would think adding a comment that the function requires the PDE to
> > be empty would be good.
> 
> But that's not the case - what the function expects to be clear is
> what is being ASSERT()ed.
> 
> >  Also given the current usage we could drop
> > the nr_ptes parameter and just name the function fill_pde() or
> > similar.
> 
> Right, but that would want to be a separate change.
> 
> >> --- a/xen/drivers/passthrough/amd/iommu_map.c
> >> +++ b/xen/drivers/passthrough/amd/iommu_map.c
> >> @@ -115,7 +115,19 @@ static void set_iommu_ptes_present(unsig
> >>  
> >>      while ( nr_ptes-- )
> >>      {
> >> -        set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
> >> +        ASSERT(!pde->next_level);
> >> +        ASSERT(!pde->u);
> >> +
> >> +        if ( pde > table )
> >> +            ASSERT(pde->ign0 == find_first_set_bit(pde - table));
> >> +        else
> >> +            ASSERT(pde->ign0 == PAGE_SHIFT - 3);
> > 
> > I think PAGETABLE_ORDER would be clearer here.
> 
> I disagree - PAGETABLE_ORDER is a CPU-side concept. It's not used anywhere
> in IOMMU code afaics.

Isn't PAGE_SHIFT also a CPU-side concept in the same way?  I'm not
sure what's the rule for declaring that PAGE_SHIFT is fine to use in
IOMMU code  but not PAGETABLE_ORDER.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 14/21] x86: introduce helper for recording degree of contiguity in page tables
  2022-05-20 10:22       ` Roger Pau Monné
@ 2022-05-20 10:59         ` Jan Beulich
  2022-05-20 11:27           ` Roger Pau Monné
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-05-20 10:59 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On 20.05.2022 12:22, Roger Pau Monné wrote:
> On Wed, May 18, 2022 at 12:06:29PM +0200, Jan Beulich wrote:
>> On 06.05.2022 15:25, Roger Pau Monné wrote:
>>> On Mon, Apr 25, 2022 at 10:41:23AM +0200, Jan Beulich wrote:
>>>> --- /dev/null
>>>> +++ b/xen/arch/x86/include/asm/pt-contig-markers.h
>>>> @@ -0,0 +1,105 @@
>>>> +#ifndef __ASM_X86_PT_CONTIG_MARKERS_H
>>>> +#define __ASM_X86_PT_CONTIG_MARKERS_H
>>>> +
>>>> +/*
>>>> + * Short of having function templates in C, the function defined below is
>>>> + * intended to be used by multiple parties interested in recording the
>>>> + * degree of contiguity in mappings by a single page table.
>>>> + *
>>>> + * Scheme: Every entry records the order of contiguous successive entries,
>>>> + * up to the maximum order covered by that entry (which is the number of
>>>> + * clear low bits in its index, with entry 0 being the exception using
>>>> + * the base-2 logarithm of the number of entries in a single page table).
>>>> + * While a few entries need touching upon update, knowing whether the
>>>> + * table is fully contiguous (and can hence be replaced by a higher level
>>>> + * leaf entry) is then possible by simply looking at entry 0's marker.
>>>> + *
>>>> + * Prereqs:
>>>> + * - CONTIG_MASK needs to be #define-d, to a value having at least 4
>>>> + *   contiguous bits (ignored by hardware), before including this file,
>>>> + * - page tables to be passed here need to be initialized with correct
>>>> + *   markers.
>>>
>>> Not sure it's very relevant, but might we worth adding that:
>>>
>>> - Null entries must have the PTE zeroed except for the CONTIG_MASK
>>>   region in order to be considered as inactive.
>>
>> NP, I've added an item along these lines.
>>
>>>> +static bool pt_update_contig_markers(uint64_t *pt, unsigned int idx,
>>>> +                                     unsigned int level, enum PTE_kind kind)
>>>> +{
>>>> +    unsigned int b, i = idx;
>>>> +    unsigned int shift = (level - 1) * CONTIG_LEVEL_SHIFT + PAGE_SHIFT;
>>>> +
>>>> +    ASSERT(idx < CONTIG_NR);
>>>> +    ASSERT(!(pt[idx] & CONTIG_MASK));
>>>> +
>>>> +    /* Step 1: Reduce markers in lower numbered entries. */
>>>> +    while ( i )
>>>> +    {
>>>> +        b = find_first_set_bit(i);
>>>> +        i &= ~(1U << b);
>>>> +        if ( GET_MARKER(pt[i]) > b )
>>>> +            SET_MARKER(pt[i], b);
>>>
>>> Can't you exit early when you find an entry that already has the
>>> to-be-set contiguous marker <= b, as lower numbered entries will then
>>> also be <= b'?
>>>
>>> Ie:
>>>
>>> if ( GET_MARKER(pt[i]) <= b )
>>>     break;
>>> else
>>>     SET_MARKER(pt[i], b);
>>
>> Almost - I think it would need to be 
>>
>>         if ( GET_MARKER(pt[i]) < b )
>>             break;
>>         if ( GET_MARKER(pt[i]) > b )
>>             SET_MARKER(pt[i], b);
> 
> I guess I'm slightly confused, but if marker at i is <= b, then all
> following markers will also be <=, and hence could be skipped?

Your use of "following" is ambiguous here, because the iteration
moves downwards as far as PTEs inspected are concerned (and it's
b which grows from one iteration to the next). But yes, I think I
agree now that ...

> Not sure why we need to keep iterating if GET_MARKER(pt[i]) == b.

... this isn't needed. At which point ...

> FWIW, you could even do:
> 
> if ( GET_MARKER(pt[i]) <= b )
>     break;
> SET_MARKER(pt[i], b);
> 
> Which would keep the conditionals to 1 like it currently is.
> 
>>
>> or, accepting redundant updates, 
>>
>>         if ( GET_MARKER(pt[i]) < b )
>>             break;
>>         SET_MARKER(pt[i], b);
>>
>> . Neither the redundant updates nor the extra (easily mis-predicted)
>> conditional looked very appealing to me, but I guess I could change
>> this if you are convinced that's better than continuing a loop with
>> at most 9 (typically less) iterations.
> 
> Well, I think I at least partly understood the logic.  Not sure
> whether it's worth adding the conditional or just assuming that
> continuing the loop is going to be cheaper.  Might be worth adding a
> comment that we choose to explicitly not add an extra conditional to
> check for early exit, because we assume that to be more expensive than
> just continuing.

... this resolves without further action.

Jan



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 13/21] IOMMU/x86: prefill newly allocate page tables
  2022-05-20 10:47       ` Roger Pau Monné
@ 2022-05-20 11:11         ` Jan Beulich
  2022-05-20 11:13           ` Jan Beulich
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-05-20 11:11 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On 20.05.2022 12:47, Roger Pau Monné wrote:
> On Thu, May 19, 2022 at 02:12:04PM +0200, Jan Beulich wrote:
>> On 06.05.2022 13:16, Roger Pau Monné wrote:
>>> On Mon, Apr 25, 2022 at 10:40:55AM +0200, Jan Beulich wrote:
>>>> --- a/xen/drivers/passthrough/amd/iommu_map.c
>>>> +++ b/xen/drivers/passthrough/amd/iommu_map.c
>>>> @@ -115,7 +115,19 @@ static void set_iommu_ptes_present(unsig
>>>>  
>>>>      while ( nr_ptes-- )
>>>>      {
>>>> -        set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
>>>> +        ASSERT(!pde->next_level);
>>>> +        ASSERT(!pde->u);
>>>> +
>>>> +        if ( pde > table )
>>>> +            ASSERT(pde->ign0 == find_first_set_bit(pde - table));
>>>> +        else
>>>> +            ASSERT(pde->ign0 == PAGE_SHIFT - 3);
>>>
>>> I think PAGETABLE_ORDER would be clearer here.
>>
>> I disagree - PAGETABLE_ORDER is a CPU-side concept. It's not used anywhere
>> in IOMMU code afaics.
> 
> Isn't PAGE_SHIFT also a CPU-side concept in the same way?  I'm not
> sure what's the rule for declaring that PAGE_SHIFT is fine to use in
> IOMMU code  but not PAGETABLE_ORDER.

Hmm, yes and no. But for consistency with other IOMMU code I may want
to switch to PAGE_SHIFT_4K.

Jan



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 16/21] VT-d: free all-empty page tables
  2022-05-18 10:26     ` Jan Beulich
  2022-05-20  0:38       ` Tian, Kevin
@ 2022-05-20 11:13       ` Roger Pau Monné
  2022-05-27  7:40         ` Jan Beulich
  1 sibling, 1 reply; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-20 11:13 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian

On Wed, May 18, 2022 at 12:26:03PM +0200, Jan Beulich wrote:
> On 10.05.2022 16:30, Roger Pau Monné wrote:
> > On Mon, Apr 25, 2022 at 10:42:50AM +0200, Jan Beulich wrote:
> >> @@ -837,9 +843,31 @@ static int dma_pte_clear_one(struct doma
> >>  
> >>      old = *pte;
> >>      dma_clear_pte(*pte);
> >> +    iommu_sync_cache(pte, sizeof(*pte));
> >> +
> >> +    while ( pt_update_contig_markers(&page->val,
> >> +                                     address_level_offset(addr, level),
> >> +                                     level, PTE_kind_null) &&
> >> +            ++level < min_pt_levels )
> >> +    {
> >> +        struct page_info *pg = maddr_to_page(pg_maddr);
> >> +
> >> +        unmap_vtd_domain_page(page);
> >> +
> >> +        pg_maddr = addr_to_dma_page_maddr(domain, addr, level, flush_flags,
> >> +                                          false);
> >> +        BUG_ON(pg_maddr < PAGE_SIZE);
> >> +
> >> +        page = map_vtd_domain_page(pg_maddr);
> >> +        pte = &page[address_level_offset(addr, level)];
> >> +        dma_clear_pte(*pte);
> >> +        iommu_sync_cache(pte, sizeof(*pte));
> >> +
> >> +        *flush_flags |= IOMMU_FLUSHF_all;
> >> +        iommu_queue_free_pgtable(hd, pg);
> >> +    }
> > 
> > I think I'm setting myself for trouble, but do we need to sync cache
> > the lower lever entries if higher level ones are to be changed.
> > 
> > IOW, would it be fine to just flush the highest level modified PTE?
> > As the lower lever ones won't be reachable anyway.
> 
> I definitely want to err on the safe side here. If later we can
> prove that some cache flush is unneeded, I'd be happy to see it
> go away.

Hm, so it's not only about adding more cache flushes, but moving them
inside of the locked region: previously the only cache flush was done
outside of the locked region.

I guess I can't convince myself why we would need to flush cache of
entries that are to be removed, and that also point to pages scheduled
to be freed.

> >> @@ -2182,8 +2210,21 @@ static int __must_check cf_check intel_i
> >>      }
> >>  
> >>      *pte = new;
> >> -
> >>      iommu_sync_cache(pte, sizeof(struct dma_pte));
> >> +
> >> +    /*
> >> +     * While the (ab)use of PTE_kind_table here allows to save some work in
> >> +     * the function, the main motivation for it is that it avoids a so far
> >> +     * unexplained hang during boot (while preparing Dom0) on a Westmere
> >> +     * based laptop.
> >> +     */
> >> +    pt_update_contig_markers(&page->val,
> >> +                             address_level_offset(dfn_to_daddr(dfn), level),
> >> +                             level,
> >> +                             (hd->platform_ops->page_sizes &
> >> +                              (1UL << level_to_offset_bits(level + 1))
> >> +                              ? PTE_kind_leaf : PTE_kind_table));
> > 
> > So this works because on what we believe to be affected models the
> > only supported page sizes are 4K?
> 
> Yes.
> 
> > Do we want to do the same with AMD if we don't allow 512G super pages?
> 
> Why? They don't have a similar flaw.

So the question was mostly whether we should also avoid the
pt_update_contig_markers for 1G entries, because we won't coalesce
them into a 512G anyway.  IOW avoid the overhead of updating the
contig markers if we know that the resulting super-page is not
supported by ->page_sizes.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 13/21] IOMMU/x86: prefill newly allocate page tables
  2022-05-20 11:11         ` Jan Beulich
@ 2022-05-20 11:13           ` Jan Beulich
  2022-05-20 12:22             ` Roger Pau Monné
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-05-20 11:13 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On 20.05.2022 13:11, Jan Beulich wrote:
> On 20.05.2022 12:47, Roger Pau Monné wrote:
>> On Thu, May 19, 2022 at 02:12:04PM +0200, Jan Beulich wrote:
>>> On 06.05.2022 13:16, Roger Pau Monné wrote:
>>>> On Mon, Apr 25, 2022 at 10:40:55AM +0200, Jan Beulich wrote:
>>>>> --- a/xen/drivers/passthrough/amd/iommu_map.c
>>>>> +++ b/xen/drivers/passthrough/amd/iommu_map.c
>>>>> @@ -115,7 +115,19 @@ static void set_iommu_ptes_present(unsig
>>>>>  
>>>>>      while ( nr_ptes-- )
>>>>>      {
>>>>> -        set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
>>>>> +        ASSERT(!pde->next_level);
>>>>> +        ASSERT(!pde->u);
>>>>> +
>>>>> +        if ( pde > table )
>>>>> +            ASSERT(pde->ign0 == find_first_set_bit(pde - table));
>>>>> +        else
>>>>> +            ASSERT(pde->ign0 == PAGE_SHIFT - 3);
>>>>
>>>> I think PAGETABLE_ORDER would be clearer here.
>>>
>>> I disagree - PAGETABLE_ORDER is a CPU-side concept. It's not used anywhere
>>> in IOMMU code afaics.
>>
>> Isn't PAGE_SHIFT also a CPU-side concept in the same way?  I'm not
>> sure what's the rule for declaring that PAGE_SHIFT is fine to use in
>> IOMMU code  but not PAGETABLE_ORDER.
> 
> Hmm, yes and no. But for consistency with other IOMMU code I may want
> to switch to PAGE_SHIFT_4K.

Except that, with the plan to re-use pt_update_contig_markers() for CPU-
side re-coalescing, there I'd prefer to stick to PAGE_SHIFT.

Jan



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 14/21] x86: introduce helper for recording degree of contiguity in page tables
  2022-05-20 10:59         ` Jan Beulich
@ 2022-05-20 11:27           ` Roger Pau Monné
  0 siblings, 0 replies; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-20 11:27 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On Fri, May 20, 2022 at 12:59:55PM +0200, Jan Beulich wrote:
> On 20.05.2022 12:22, Roger Pau Monné wrote:
> > On Wed, May 18, 2022 at 12:06:29PM +0200, Jan Beulich wrote:
> >> On 06.05.2022 15:25, Roger Pau Monné wrote:
> >>> On Mon, Apr 25, 2022 at 10:41:23AM +0200, Jan Beulich wrote:
> >>>> --- /dev/null
> >>>> +++ b/xen/arch/x86/include/asm/pt-contig-markers.h
> >>>> @@ -0,0 +1,105 @@
> >>>> +#ifndef __ASM_X86_PT_CONTIG_MARKERS_H
> >>>> +#define __ASM_X86_PT_CONTIG_MARKERS_H
> >>>> +
> >>>> +/*
> >>>> + * Short of having function templates in C, the function defined below is
> >>>> + * intended to be used by multiple parties interested in recording the
> >>>> + * degree of contiguity in mappings by a single page table.
> >>>> + *
> >>>> + * Scheme: Every entry records the order of contiguous successive entries,
> >>>> + * up to the maximum order covered by that entry (which is the number of
> >>>> + * clear low bits in its index, with entry 0 being the exception using
> >>>> + * the base-2 logarithm of the number of entries in a single page table).
> >>>> + * While a few entries need touching upon update, knowing whether the
> >>>> + * table is fully contiguous (and can hence be replaced by a higher level
> >>>> + * leaf entry) is then possible by simply looking at entry 0's marker.
> >>>> + *
> >>>> + * Prereqs:
> >>>> + * - CONTIG_MASK needs to be #define-d, to a value having at least 4
> >>>> + *   contiguous bits (ignored by hardware), before including this file,
> >>>> + * - page tables to be passed here need to be initialized with correct
> >>>> + *   markers.
> >>>
> >>> Not sure it's very relevant, but might we worth adding that:
> >>>
> >>> - Null entries must have the PTE zeroed except for the CONTIG_MASK
> >>>   region in order to be considered as inactive.
> >>
> >> NP, I've added an item along these lines.
> >>
> >>>> +static bool pt_update_contig_markers(uint64_t *pt, unsigned int idx,
> >>>> +                                     unsigned int level, enum PTE_kind kind)
> >>>> +{
> >>>> +    unsigned int b, i = idx;
> >>>> +    unsigned int shift = (level - 1) * CONTIG_LEVEL_SHIFT + PAGE_SHIFT;
> >>>> +
> >>>> +    ASSERT(idx < CONTIG_NR);
> >>>> +    ASSERT(!(pt[idx] & CONTIG_MASK));
> >>>> +
> >>>> +    /* Step 1: Reduce markers in lower numbered entries. */
> >>>> +    while ( i )
> >>>> +    {
> >>>> +        b = find_first_set_bit(i);
> >>>> +        i &= ~(1U << b);
> >>>> +        if ( GET_MARKER(pt[i]) > b )
> >>>> +            SET_MARKER(pt[i], b);
> >>>
> >>> Can't you exit early when you find an entry that already has the
> >>> to-be-set contiguous marker <= b, as lower numbered entries will then
> >>> also be <= b'?
> >>>
> >>> Ie:
> >>>
> >>> if ( GET_MARKER(pt[i]) <= b )
> >>>     break;
> >>> else
> >>>     SET_MARKER(pt[i], b);
> >>
> >> Almost - I think it would need to be 
> >>
> >>         if ( GET_MARKER(pt[i]) < b )
> >>             break;
> >>         if ( GET_MARKER(pt[i]) > b )
> >>             SET_MARKER(pt[i], b);
> > 
> > I guess I'm slightly confused, but if marker at i is <= b, then all
> > following markers will also be <=, and hence could be skipped?
> 
> Your use of "following" is ambiguous here, because the iteration
> moves downwards as far as PTEs inspected are concerned (and it's
> b which grows from one iteration to the next). But yes, I think I
> agree now that ...

Right, 'following' here would be the next item processed by the loop.

> > Not sure why we need to keep iterating if GET_MARKER(pt[i]) == b.
> 
> ... this isn't needed. At which point ...
> 
> > FWIW, you could even do:
> > 
> > if ( GET_MARKER(pt[i]) <= b )
> >     break;
> > SET_MARKER(pt[i], b);
> > 
> > Which would keep the conditionals to 1 like it currently is.
> > 
> >>
> >> or, accepting redundant updates, 
> >>
> >>         if ( GET_MARKER(pt[i]) < b )
> >>             break;
> >>         SET_MARKER(pt[i], b);
> >>
> >> . Neither the redundant updates nor the extra (easily mis-predicted)
> >> conditional looked very appealing to me, but I guess I could change
> >> this if you are convinced that's better than continuing a loop with
> >> at most 9 (typically less) iterations.
> > 
> > Well, I think I at least partly understood the logic.  Not sure
> > whether it's worth adding the conditional or just assuming that
> > continuing the loop is going to be cheaper.  Might be worth adding a
> > comment that we choose to explicitly not add an extra conditional to
> > check for early exit, because we assume that to be more expensive than
> > just continuing.
> 
> ... this resolves without further action.

OK, since we agree, and that was the only comment I had, you can add:

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 13/21] IOMMU/x86: prefill newly allocate page tables
  2022-05-20 11:13           ` Jan Beulich
@ 2022-05-20 12:22             ` Roger Pau Monné
  2022-05-20 12:36               ` Jan Beulich
  0 siblings, 1 reply; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-20 12:22 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On Fri, May 20, 2022 at 01:13:28PM +0200, Jan Beulich wrote:
> On 20.05.2022 13:11, Jan Beulich wrote:
> > On 20.05.2022 12:47, Roger Pau Monné wrote:
> >> On Thu, May 19, 2022 at 02:12:04PM +0200, Jan Beulich wrote:
> >>> On 06.05.2022 13:16, Roger Pau Monné wrote:
> >>>> On Mon, Apr 25, 2022 at 10:40:55AM +0200, Jan Beulich wrote:
> >>>>> --- a/xen/drivers/passthrough/amd/iommu_map.c
> >>>>> +++ b/xen/drivers/passthrough/amd/iommu_map.c
> >>>>> @@ -115,7 +115,19 @@ static void set_iommu_ptes_present(unsig
> >>>>>  
> >>>>>      while ( nr_ptes-- )
> >>>>>      {
> >>>>> -        set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
> >>>>> +        ASSERT(!pde->next_level);
> >>>>> +        ASSERT(!pde->u);
> >>>>> +
> >>>>> +        if ( pde > table )
> >>>>> +            ASSERT(pde->ign0 == find_first_set_bit(pde - table));
> >>>>> +        else
> >>>>> +            ASSERT(pde->ign0 == PAGE_SHIFT - 3);
> >>>>
> >>>> I think PAGETABLE_ORDER would be clearer here.
> >>>
> >>> I disagree - PAGETABLE_ORDER is a CPU-side concept. It's not used anywhere
> >>> in IOMMU code afaics.
> >>
> >> Isn't PAGE_SHIFT also a CPU-side concept in the same way?  I'm not
> >> sure what's the rule for declaring that PAGE_SHIFT is fine to use in
> >> IOMMU code  but not PAGETABLE_ORDER.
> > 
> > Hmm, yes and no. But for consistency with other IOMMU code I may want
> > to switch to PAGE_SHIFT_4K.
> 
> Except that, with the plan to re-use pt_update_contig_markers() for CPU-
> side re-coalescing, there I'd prefer to stick to PAGE_SHIFT.

Then can PAGETABLE_ORDER be used instead of PAGE_SHIFT - 3?

IMO it makes the code quite easier to understand.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 13/21] IOMMU/x86: prefill newly allocate page tables
  2022-05-20 12:22             ` Roger Pau Monné
@ 2022-05-20 12:36               ` Jan Beulich
  2022-05-20 14:28                 ` Roger Pau Monné
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-05-20 12:36 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On 20.05.2022 14:22, Roger Pau Monné wrote:
> On Fri, May 20, 2022 at 01:13:28PM +0200, Jan Beulich wrote:
>> On 20.05.2022 13:11, Jan Beulich wrote:
>>> On 20.05.2022 12:47, Roger Pau Monné wrote:
>>>> On Thu, May 19, 2022 at 02:12:04PM +0200, Jan Beulich wrote:
>>>>> On 06.05.2022 13:16, Roger Pau Monné wrote:
>>>>>> On Mon, Apr 25, 2022 at 10:40:55AM +0200, Jan Beulich wrote:
>>>>>>> --- a/xen/drivers/passthrough/amd/iommu_map.c
>>>>>>> +++ b/xen/drivers/passthrough/amd/iommu_map.c
>>>>>>> @@ -115,7 +115,19 @@ static void set_iommu_ptes_present(unsig
>>>>>>>  
>>>>>>>      while ( nr_ptes-- )
>>>>>>>      {
>>>>>>> -        set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
>>>>>>> +        ASSERT(!pde->next_level);
>>>>>>> +        ASSERT(!pde->u);
>>>>>>> +
>>>>>>> +        if ( pde > table )
>>>>>>> +            ASSERT(pde->ign0 == find_first_set_bit(pde - table));
>>>>>>> +        else
>>>>>>> +            ASSERT(pde->ign0 == PAGE_SHIFT - 3);
>>>>>>
>>>>>> I think PAGETABLE_ORDER would be clearer here.
>>>>>
>>>>> I disagree - PAGETABLE_ORDER is a CPU-side concept. It's not used anywhere
>>>>> in IOMMU code afaics.
>>>>
>>>> Isn't PAGE_SHIFT also a CPU-side concept in the same way?  I'm not
>>>> sure what's the rule for declaring that PAGE_SHIFT is fine to use in
>>>> IOMMU code  but not PAGETABLE_ORDER.
>>>
>>> Hmm, yes and no. But for consistency with other IOMMU code I may want
>>> to switch to PAGE_SHIFT_4K.
>>
>> Except that, with the plan to re-use pt_update_contig_markers() for CPU-
>> side re-coalescing, there I'd prefer to stick to PAGE_SHIFT.
> 
> Then can PAGETABLE_ORDER be used instead of PAGE_SHIFT - 3?

pt_update_contig_markers() isn't IOMMU code; since I've said I'd switch
to PAGE_SHIFT_4K in IOMMU code I'm having a hard time seeing how I could
at the same time start using PAGETABLE_ORDER there.

What I maybe could do is use PTE_PER_TABLE_SHIFT in AMD code and
LEVEL_STRIDE in VT-d one. Yet I'm not sure that would be fully correct/
consistent, ...

> IMO it makes the code quite easier to understand.

... or in fact helping readability.

Jan



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 13/21] IOMMU/x86: prefill newly allocate page tables
  2022-05-20 12:36               ` Jan Beulich
@ 2022-05-20 14:28                 ` Roger Pau Monné
  2022-05-20 14:38                   ` Roger Pau Monné
  0 siblings, 1 reply; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-20 14:28 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On Fri, May 20, 2022 at 02:36:02PM +0200, Jan Beulich wrote:
> On 20.05.2022 14:22, Roger Pau Monné wrote:
> > On Fri, May 20, 2022 at 01:13:28PM +0200, Jan Beulich wrote:
> >> On 20.05.2022 13:11, Jan Beulich wrote:
> >>> On 20.05.2022 12:47, Roger Pau Monné wrote:
> >>>> On Thu, May 19, 2022 at 02:12:04PM +0200, Jan Beulich wrote:
> >>>>> On 06.05.2022 13:16, Roger Pau Monné wrote:
> >>>>>> On Mon, Apr 25, 2022 at 10:40:55AM +0200, Jan Beulich wrote:
> >>>>>>> --- a/xen/drivers/passthrough/amd/iommu_map.c
> >>>>>>> +++ b/xen/drivers/passthrough/amd/iommu_map.c
> >>>>>>> @@ -115,7 +115,19 @@ static void set_iommu_ptes_present(unsig
> >>>>>>>  
> >>>>>>>      while ( nr_ptes-- )
> >>>>>>>      {
> >>>>>>> -        set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
> >>>>>>> +        ASSERT(!pde->next_level);
> >>>>>>> +        ASSERT(!pde->u);
> >>>>>>> +
> >>>>>>> +        if ( pde > table )
> >>>>>>> +            ASSERT(pde->ign0 == find_first_set_bit(pde - table));
> >>>>>>> +        else
> >>>>>>> +            ASSERT(pde->ign0 == PAGE_SHIFT - 3);
> >>>>>>
> >>>>>> I think PAGETABLE_ORDER would be clearer here.
> >>>>>
> >>>>> I disagree - PAGETABLE_ORDER is a CPU-side concept. It's not used anywhere
> >>>>> in IOMMU code afaics.
> >>>>
> >>>> Isn't PAGE_SHIFT also a CPU-side concept in the same way?  I'm not
> >>>> sure what's the rule for declaring that PAGE_SHIFT is fine to use in
> >>>> IOMMU code  but not PAGETABLE_ORDER.
> >>>
> >>> Hmm, yes and no. But for consistency with other IOMMU code I may want
> >>> to switch to PAGE_SHIFT_4K.
> >>
> >> Except that, with the plan to re-use pt_update_contig_markers() for CPU-
> >> side re-coalescing, there I'd prefer to stick to PAGE_SHIFT.
> > 
> > Then can PAGETABLE_ORDER be used instead of PAGE_SHIFT - 3?
> 
> pt_update_contig_markers() isn't IOMMU code; since I've said I'd switch
> to PAGE_SHIFT_4K in IOMMU code I'm having a hard time seeing how I could
> at the same time start using PAGETABLE_ORDER there.

I've got confused by the double reply and read it as if you where
going to stick to using PAGE_SHIFT everywhere as proposed originally.

> What I maybe could do is use PTE_PER_TABLE_SHIFT in AMD code and
> LEVEL_STRIDE in VT-d one. Yet I'm not sure that would be fully correct/
> consistent, ...
> 
> > IMO it makes the code quite easier to understand.
> 
> ... or in fact helping readability.

Looking at pt_update_contig_markers() we hardcode CONTIG_LEVEL_SHIFT
to 9 there, which means all users must have a page table order of 9.

It seems to me we are just making things more complicated than
necessary by trying to avoid dependencies between CPU and IOMMU
page-table sizes and definitions, when the underlying mechanism to set
->ign0 has those assumptions baked in.

Would it help if you introduced a PAGE_TABLE_ORDER in page-defs.h?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 13/21] IOMMU/x86: prefill newly allocate page tables
  2022-05-20 14:28                 ` Roger Pau Monné
@ 2022-05-20 14:38                   ` Roger Pau Monné
  2022-05-23  6:49                     ` Jan Beulich
  0 siblings, 1 reply; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-20 14:38 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On Fri, May 20, 2022 at 04:28:14PM +0200, Roger Pau Monné wrote:
> On Fri, May 20, 2022 at 02:36:02PM +0200, Jan Beulich wrote:
> > On 20.05.2022 14:22, Roger Pau Monné wrote:
> > > On Fri, May 20, 2022 at 01:13:28PM +0200, Jan Beulich wrote:
> > >> On 20.05.2022 13:11, Jan Beulich wrote:
> > >>> On 20.05.2022 12:47, Roger Pau Monné wrote:
> > >>>> On Thu, May 19, 2022 at 02:12:04PM +0200, Jan Beulich wrote:
> > >>>>> On 06.05.2022 13:16, Roger Pau Monné wrote:
> > >>>>>> On Mon, Apr 25, 2022 at 10:40:55AM +0200, Jan Beulich wrote:
> > >>>>>>> --- a/xen/drivers/passthrough/amd/iommu_map.c
> > >>>>>>> +++ b/xen/drivers/passthrough/amd/iommu_map.c
> > >>>>>>> @@ -115,7 +115,19 @@ static void set_iommu_ptes_present(unsig
> > >>>>>>>  
> > >>>>>>>      while ( nr_ptes-- )
> > >>>>>>>      {
> > >>>>>>> -        set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
> > >>>>>>> +        ASSERT(!pde->next_level);
> > >>>>>>> +        ASSERT(!pde->u);
> > >>>>>>> +
> > >>>>>>> +        if ( pde > table )
> > >>>>>>> +            ASSERT(pde->ign0 == find_first_set_bit(pde - table));
> > >>>>>>> +        else
> > >>>>>>> +            ASSERT(pde->ign0 == PAGE_SHIFT - 3);
> > >>>>>>
> > >>>>>> I think PAGETABLE_ORDER would be clearer here.
> > >>>>>
> > >>>>> I disagree - PAGETABLE_ORDER is a CPU-side concept. It's not used anywhere
> > >>>>> in IOMMU code afaics.
> > >>>>
> > >>>> Isn't PAGE_SHIFT also a CPU-side concept in the same way?  I'm not
> > >>>> sure what's the rule for declaring that PAGE_SHIFT is fine to use in
> > >>>> IOMMU code  but not PAGETABLE_ORDER.
> > >>>
> > >>> Hmm, yes and no. But for consistency with other IOMMU code I may want
> > >>> to switch to PAGE_SHIFT_4K.
> > >>
> > >> Except that, with the plan to re-use pt_update_contig_markers() for CPU-
> > >> side re-coalescing, there I'd prefer to stick to PAGE_SHIFT.
> > > 
> > > Then can PAGETABLE_ORDER be used instead of PAGE_SHIFT - 3?
> > 
> > pt_update_contig_markers() isn't IOMMU code; since I've said I'd switch
> > to PAGE_SHIFT_4K in IOMMU code I'm having a hard time seeing how I could
> > at the same time start using PAGETABLE_ORDER there.
> 
> I've got confused by the double reply and read it as if you where
> going to stick to using PAGE_SHIFT everywhere as proposed originally.
> 
> > What I maybe could do is use PTE_PER_TABLE_SHIFT in AMD code and
> > LEVEL_STRIDE in VT-d one. Yet I'm not sure that would be fully correct/
> > consistent, ...
> > 
> > > IMO it makes the code quite easier to understand.
> > 
> > ... or in fact helping readability.
> 
> Looking at pt_update_contig_markers() we hardcode CONTIG_LEVEL_SHIFT
> to 9 there, which means all users must have a page table order of 9.
> 
> It seems to me we are just making things more complicated than
> necessary by trying to avoid dependencies between CPU and IOMMU
> page-table sizes and definitions, when the underlying mechanism to set
> ->ign0 has those assumptions baked in.
> 
> Would it help if you introduced a PAGE_TABLE_ORDER in page-defs.h?

Sorry, should be PAGE_TABLE_ORDER_4K.

Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 13/21] IOMMU/x86: prefill newly allocate page tables
  2022-05-20 14:38                   ` Roger Pau Monné
@ 2022-05-23  6:49                     ` Jan Beulich
  2022-05-23  9:10                       ` Roger Pau Monné
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-05-23  6:49 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On 20.05.2022 16:38, Roger Pau Monné wrote:
> On Fri, May 20, 2022 at 04:28:14PM +0200, Roger Pau Monné wrote:
>> On Fri, May 20, 2022 at 02:36:02PM +0200, Jan Beulich wrote:
>>> On 20.05.2022 14:22, Roger Pau Monné wrote:
>>>> On Fri, May 20, 2022 at 01:13:28PM +0200, Jan Beulich wrote:
>>>>> On 20.05.2022 13:11, Jan Beulich wrote:
>>>>>> On 20.05.2022 12:47, Roger Pau Monné wrote:
>>>>>>> On Thu, May 19, 2022 at 02:12:04PM +0200, Jan Beulich wrote:
>>>>>>>> On 06.05.2022 13:16, Roger Pau Monné wrote:
>>>>>>>>> On Mon, Apr 25, 2022 at 10:40:55AM +0200, Jan Beulich wrote:
>>>>>>>>>> --- a/xen/drivers/passthrough/amd/iommu_map.c
>>>>>>>>>> +++ b/xen/drivers/passthrough/amd/iommu_map.c
>>>>>>>>>> @@ -115,7 +115,19 @@ static void set_iommu_ptes_present(unsig
>>>>>>>>>>  
>>>>>>>>>>      while ( nr_ptes-- )
>>>>>>>>>>      {
>>>>>>>>>> -        set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
>>>>>>>>>> +        ASSERT(!pde->next_level);
>>>>>>>>>> +        ASSERT(!pde->u);
>>>>>>>>>> +
>>>>>>>>>> +        if ( pde > table )
>>>>>>>>>> +            ASSERT(pde->ign0 == find_first_set_bit(pde - table));
>>>>>>>>>> +        else
>>>>>>>>>> +            ASSERT(pde->ign0 == PAGE_SHIFT - 3);
>>>>>>>>>
>>>>>>>>> I think PAGETABLE_ORDER would be clearer here.
>>>>>>>>
>>>>>>>> I disagree - PAGETABLE_ORDER is a CPU-side concept. It's not used anywhere
>>>>>>>> in IOMMU code afaics.
>>>>>>>
>>>>>>> Isn't PAGE_SHIFT also a CPU-side concept in the same way?  I'm not
>>>>>>> sure what's the rule for declaring that PAGE_SHIFT is fine to use in
>>>>>>> IOMMU code  but not PAGETABLE_ORDER.
>>>>>>
>>>>>> Hmm, yes and no. But for consistency with other IOMMU code I may want
>>>>>> to switch to PAGE_SHIFT_4K.
>>>>>
>>>>> Except that, with the plan to re-use pt_update_contig_markers() for CPU-
>>>>> side re-coalescing, there I'd prefer to stick to PAGE_SHIFT.
>>>>
>>>> Then can PAGETABLE_ORDER be used instead of PAGE_SHIFT - 3?
>>>
>>> pt_update_contig_markers() isn't IOMMU code; since I've said I'd switch
>>> to PAGE_SHIFT_4K in IOMMU code I'm having a hard time seeing how I could
>>> at the same time start using PAGETABLE_ORDER there.
>>
>> I've got confused by the double reply and read it as if you where
>> going to stick to using PAGE_SHIFT everywhere as proposed originally.
>>
>>> What I maybe could do is use PTE_PER_TABLE_SHIFT in AMD code and
>>> LEVEL_STRIDE in VT-d one. Yet I'm not sure that would be fully correct/
>>> consistent, ...
>>>
>>>> IMO it makes the code quite easier to understand.
>>>
>>> ... or in fact helping readability.
>>
>> Looking at pt_update_contig_markers() we hardcode CONTIG_LEVEL_SHIFT
>> to 9 there, which means all users must have a page table order of 9.
>>
>> It seems to me we are just making things more complicated than
>> necessary by trying to avoid dependencies between CPU and IOMMU
>> page-table sizes and definitions, when the underlying mechanism to set
>> ->ign0 has those assumptions baked in.
>>
>> Would it help if you introduced a PAGE_TABLE_ORDER in page-defs.h?
> 
> Sorry, should be PAGE_TABLE_ORDER_4K.

Oh, good that I looked here before replying to the earlier mail: I'm
afraid I view PAGE_TABLE_ORDER_4K as not very useful. From an
abstract POV, what is the base unit meant to be that the order is
is based upon? PAGE_SHIFT? Or PAGE_SHIFT_4K? I think such an
ambiguity is going to remain even if we very clearly spelled out what
we mean things to be, as one would always need to go back to that
comment to check which of the two possible ways it is.

Furthermore I'm not convinced PAGETABLE_ORDER is really meant to be
associated with a particular page size anyway: PAGE_TABLE_ORDER_2M
imo makes no sense at all. And page-defs.h is not supposed to
express any platform properties anyway, it's merely an accumulation
of (believed) useful constants.

Hence the only thing which I might see as a (remote) option is
IOMMU_PAGE_TABLE_ORDER (for platforms where all IOMMU variants have
all page table levels using identical sizes, which isn't a given, but
which would hold for x86 and hence for the purpose here).

Jan



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 13/21] IOMMU/x86: prefill newly allocate page tables
  2022-05-23  6:49                     ` Jan Beulich
@ 2022-05-23  9:10                       ` Roger Pau Monné
  2022-05-23 10:52                         ` Jan Beulich
  0 siblings, 1 reply; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-23  9:10 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On Mon, May 23, 2022 at 08:49:27AM +0200, Jan Beulich wrote:
> On 20.05.2022 16:38, Roger Pau Monné wrote:
> > On Fri, May 20, 2022 at 04:28:14PM +0200, Roger Pau Monné wrote:
> >> On Fri, May 20, 2022 at 02:36:02PM +0200, Jan Beulich wrote:
> >>> On 20.05.2022 14:22, Roger Pau Monné wrote:
> >>>> On Fri, May 20, 2022 at 01:13:28PM +0200, Jan Beulich wrote:
> >>>>> On 20.05.2022 13:11, Jan Beulich wrote:
> >>>>>> On 20.05.2022 12:47, Roger Pau Monné wrote:
> >>>>>>> On Thu, May 19, 2022 at 02:12:04PM +0200, Jan Beulich wrote:
> >>>>>>>> On 06.05.2022 13:16, Roger Pau Monné wrote:
> >>>>>>>>> On Mon, Apr 25, 2022 at 10:40:55AM +0200, Jan Beulich wrote:
> >>>>>>>>>> --- a/xen/drivers/passthrough/amd/iommu_map.c
> >>>>>>>>>> +++ b/xen/drivers/passthrough/amd/iommu_map.c
> >>>>>>>>>> @@ -115,7 +115,19 @@ static void set_iommu_ptes_present(unsig
> >>>>>>>>>>  
> >>>>>>>>>>      while ( nr_ptes-- )
> >>>>>>>>>>      {
> >>>>>>>>>> -        set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
> >>>>>>>>>> +        ASSERT(!pde->next_level);
> >>>>>>>>>> +        ASSERT(!pde->u);
> >>>>>>>>>> +
> >>>>>>>>>> +        if ( pde > table )
> >>>>>>>>>> +            ASSERT(pde->ign0 == find_first_set_bit(pde - table));
> >>>>>>>>>> +        else
> >>>>>>>>>> +            ASSERT(pde->ign0 == PAGE_SHIFT - 3);
> >>>>>>>>>
> >>>>>>>>> I think PAGETABLE_ORDER would be clearer here.
> >>>>>>>>
> >>>>>>>> I disagree - PAGETABLE_ORDER is a CPU-side concept. It's not used anywhere
> >>>>>>>> in IOMMU code afaics.
> >>>>>>>
> >>>>>>> Isn't PAGE_SHIFT also a CPU-side concept in the same way?  I'm not
> >>>>>>> sure what's the rule for declaring that PAGE_SHIFT is fine to use in
> >>>>>>> IOMMU code  but not PAGETABLE_ORDER.
> >>>>>>
> >>>>>> Hmm, yes and no. But for consistency with other IOMMU code I may want
> >>>>>> to switch to PAGE_SHIFT_4K.
> >>>>>
> >>>>> Except that, with the plan to re-use pt_update_contig_markers() for CPU-
> >>>>> side re-coalescing, there I'd prefer to stick to PAGE_SHIFT.
> >>>>
> >>>> Then can PAGETABLE_ORDER be used instead of PAGE_SHIFT - 3?
> >>>
> >>> pt_update_contig_markers() isn't IOMMU code; since I've said I'd switch
> >>> to PAGE_SHIFT_4K in IOMMU code I'm having a hard time seeing how I could
> >>> at the same time start using PAGETABLE_ORDER there.
> >>
> >> I've got confused by the double reply and read it as if you where
> >> going to stick to using PAGE_SHIFT everywhere as proposed originally.
> >>
> >>> What I maybe could do is use PTE_PER_TABLE_SHIFT in AMD code and
> >>> LEVEL_STRIDE in VT-d one. Yet I'm not sure that would be fully correct/
> >>> consistent, ...
> >>>
> >>>> IMO it makes the code quite easier to understand.
> >>>
> >>> ... or in fact helping readability.
> >>
> >> Looking at pt_update_contig_markers() we hardcode CONTIG_LEVEL_SHIFT
> >> to 9 there, which means all users must have a page table order of 9.
> >>
> >> It seems to me we are just making things more complicated than
> >> necessary by trying to avoid dependencies between CPU and IOMMU
> >> page-table sizes and definitions, when the underlying mechanism to set
> >> ->ign0 has those assumptions baked in.
> >>
> >> Would it help if you introduced a PAGE_TABLE_ORDER in page-defs.h?
> > 
> > Sorry, should be PAGE_TABLE_ORDER_4K.
> 
> Oh, good that I looked here before replying to the earlier mail: I'm
> afraid I view PAGE_TABLE_ORDER_4K as not very useful. From an
> abstract POV, what is the base unit meant to be that the order is
> is based upon? PAGE_SHIFT? Or PAGE_SHIFT_4K? I think such an
> ambiguity is going to remain even if we very clearly spelled out what
> we mean things to be, as one would always need to go back to that
> comment to check which of the two possible ways it is.
> 
> Furthermore I'm not convinced PAGETABLE_ORDER is really meant to be
> associated with a particular page size anyway: PAGE_TABLE_ORDER_2M
> imo makes no sense at all. And page-defs.h is not supposed to
> express any platform properties anyway, it's merely an accumulation
> of (believed) useful constants.
> 
> Hence the only thing which I might see as a (remote) option is
> IOMMU_PAGE_TABLE_ORDER (for platforms where all IOMMU variants have
> all page table levels using identical sizes, which isn't a given, but
> which would hold for x86 and hence for the purpose here).

Since you already define a page table order in pt-contig-markers.h
(CONTIG_NR) it might be possible to export and use that?  In fact the
check done here would be even more accurate if it was done using the
same constant that's used in pt_update_contig_markers(), because the
purpose here is to check that the vendor specific code to init the
page tables has used the correct value.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 13/21] IOMMU/x86: prefill newly allocate page tables
  2022-05-23  9:10                       ` Roger Pau Monné
@ 2022-05-23 10:52                         ` Jan Beulich
  0 siblings, 0 replies; 106+ messages in thread
From: Jan Beulich @ 2022-05-23 10:52 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Wei Liu

On 23.05.2022 11:10, Roger Pau Monné wrote:
> On Mon, May 23, 2022 at 08:49:27AM +0200, Jan Beulich wrote:
>> On 20.05.2022 16:38, Roger Pau Monné wrote:
>>> On Fri, May 20, 2022 at 04:28:14PM +0200, Roger Pau Monné wrote:
>>>> On Fri, May 20, 2022 at 02:36:02PM +0200, Jan Beulich wrote:
>>>>> On 20.05.2022 14:22, Roger Pau Monné wrote:
>>>>>> On Fri, May 20, 2022 at 01:13:28PM +0200, Jan Beulich wrote:
>>>>>>> On 20.05.2022 13:11, Jan Beulich wrote:
>>>>>>>> On 20.05.2022 12:47, Roger Pau Monné wrote:
>>>>>>>>> On Thu, May 19, 2022 at 02:12:04PM +0200, Jan Beulich wrote:
>>>>>>>>>> On 06.05.2022 13:16, Roger Pau Monné wrote:
>>>>>>>>>>> On Mon, Apr 25, 2022 at 10:40:55AM +0200, Jan Beulich wrote:
>>>>>>>>>>>> --- a/xen/drivers/passthrough/amd/iommu_map.c
>>>>>>>>>>>> +++ b/xen/drivers/passthrough/amd/iommu_map.c
>>>>>>>>>>>> @@ -115,7 +115,19 @@ static void set_iommu_ptes_present(unsig
>>>>>>>>>>>>  
>>>>>>>>>>>>      while ( nr_ptes-- )
>>>>>>>>>>>>      {
>>>>>>>>>>>> -        set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
>>>>>>>>>>>> +        ASSERT(!pde->next_level);
>>>>>>>>>>>> +        ASSERT(!pde->u);
>>>>>>>>>>>> +
>>>>>>>>>>>> +        if ( pde > table )
>>>>>>>>>>>> +            ASSERT(pde->ign0 == find_first_set_bit(pde - table));
>>>>>>>>>>>> +        else
>>>>>>>>>>>> +            ASSERT(pde->ign0 == PAGE_SHIFT - 3);
>>>>>>>>>>>
>>>>>>>>>>> I think PAGETABLE_ORDER would be clearer here.
>>>>>>>>>>
>>>>>>>>>> I disagree - PAGETABLE_ORDER is a CPU-side concept. It's not used anywhere
>>>>>>>>>> in IOMMU code afaics.
>>>>>>>>>
>>>>>>>>> Isn't PAGE_SHIFT also a CPU-side concept in the same way?  I'm not
>>>>>>>>> sure what's the rule for declaring that PAGE_SHIFT is fine to use in
>>>>>>>>> IOMMU code  but not PAGETABLE_ORDER.
>>>>>>>>
>>>>>>>> Hmm, yes and no. But for consistency with other IOMMU code I may want
>>>>>>>> to switch to PAGE_SHIFT_4K.
>>>>>>>
>>>>>>> Except that, with the plan to re-use pt_update_contig_markers() for CPU-
>>>>>>> side re-coalescing, there I'd prefer to stick to PAGE_SHIFT.
>>>>>>
>>>>>> Then can PAGETABLE_ORDER be used instead of PAGE_SHIFT - 3?
>>>>>
>>>>> pt_update_contig_markers() isn't IOMMU code; since I've said I'd switch
>>>>> to PAGE_SHIFT_4K in IOMMU code I'm having a hard time seeing how I could
>>>>> at the same time start using PAGETABLE_ORDER there.
>>>>
>>>> I've got confused by the double reply and read it as if you where
>>>> going to stick to using PAGE_SHIFT everywhere as proposed originally.
>>>>
>>>>> What I maybe could do is use PTE_PER_TABLE_SHIFT in AMD code and
>>>>> LEVEL_STRIDE in VT-d one. Yet I'm not sure that would be fully correct/
>>>>> consistent, ...
>>>>>
>>>>>> IMO it makes the code quite easier to understand.
>>>>>
>>>>> ... or in fact helping readability.
>>>>
>>>> Looking at pt_update_contig_markers() we hardcode CONTIG_LEVEL_SHIFT
>>>> to 9 there, which means all users must have a page table order of 9.
>>>>
>>>> It seems to me we are just making things more complicated than
>>>> necessary by trying to avoid dependencies between CPU and IOMMU
>>>> page-table sizes and definitions, when the underlying mechanism to set
>>>> ->ign0 has those assumptions baked in.
>>>>
>>>> Would it help if you introduced a PAGE_TABLE_ORDER in page-defs.h?
>>>
>>> Sorry, should be PAGE_TABLE_ORDER_4K.
>>
>> Oh, good that I looked here before replying to the earlier mail: I'm
>> afraid I view PAGE_TABLE_ORDER_4K as not very useful. From an
>> abstract POV, what is the base unit meant to be that the order is
>> is based upon? PAGE_SHIFT? Or PAGE_SHIFT_4K? I think such an
>> ambiguity is going to remain even if we very clearly spelled out what
>> we mean things to be, as one would always need to go back to that
>> comment to check which of the two possible ways it is.
>>
>> Furthermore I'm not convinced PAGETABLE_ORDER is really meant to be
>> associated with a particular page size anyway: PAGE_TABLE_ORDER_2M
>> imo makes no sense at all. And page-defs.h is not supposed to
>> express any platform properties anyway, it's merely an accumulation
>> of (believed) useful constants.
>>
>> Hence the only thing which I might see as a (remote) option is
>> IOMMU_PAGE_TABLE_ORDER (for platforms where all IOMMU variants have
>> all page table levels using identical sizes, which isn't a given, but
>> which would hold for x86 and hence for the purpose here).
> 
> Since you already define a page table order in pt-contig-markers.h
> (CONTIG_NR) it might be possible to export and use that?  In fact the
> check done here would be even more accurate if it was done using the
> same constant that's used in pt_update_contig_markers(), because the
> purpose here is to check that the vendor specific code to init the
> page tables has used the correct value.

Hmm, yes, let me do that. It'll be a little odd in the header itself
(as I'll need to exclude the bulk of it when CONTIG_MASK is not
defined), but apart from that it should indeed end up being better.

Jan



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 16/21] VT-d: free all-empty page tables
  2022-05-20 11:13       ` Roger Pau Monné
@ 2022-05-27  7:40         ` Jan Beulich
  2022-05-27  7:53           ` Jan Beulich
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-05-27  7:40 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian

On 20.05.2022 13:13, Roger Pau Monné wrote:
> On Wed, May 18, 2022 at 12:26:03PM +0200, Jan Beulich wrote:
>> On 10.05.2022 16:30, Roger Pau Monné wrote:
>>> On Mon, Apr 25, 2022 at 10:42:50AM +0200, Jan Beulich wrote:
>>>> @@ -837,9 +843,31 @@ static int dma_pte_clear_one(struct doma
>>>>  
>>>>      old = *pte;
>>>>      dma_clear_pte(*pte);
>>>> +    iommu_sync_cache(pte, sizeof(*pte));
>>>> +
>>>> +    while ( pt_update_contig_markers(&page->val,
>>>> +                                     address_level_offset(addr, level),
>>>> +                                     level, PTE_kind_null) &&
>>>> +            ++level < min_pt_levels )
>>>> +    {
>>>> +        struct page_info *pg = maddr_to_page(pg_maddr);
>>>> +
>>>> +        unmap_vtd_domain_page(page);
>>>> +
>>>> +        pg_maddr = addr_to_dma_page_maddr(domain, addr, level, flush_flags,
>>>> +                                          false);
>>>> +        BUG_ON(pg_maddr < PAGE_SIZE);
>>>> +
>>>> +        page = map_vtd_domain_page(pg_maddr);
>>>> +        pte = &page[address_level_offset(addr, level)];
>>>> +        dma_clear_pte(*pte);
>>>> +        iommu_sync_cache(pte, sizeof(*pte));
>>>> +
>>>> +        *flush_flags |= IOMMU_FLUSHF_all;
>>>> +        iommu_queue_free_pgtable(hd, pg);
>>>> +    }
>>>
>>> I think I'm setting myself for trouble, but do we need to sync cache
>>> the lower lever entries if higher level ones are to be changed.
>>>
>>> IOW, would it be fine to just flush the highest level modified PTE?
>>> As the lower lever ones won't be reachable anyway.
>>
>> I definitely want to err on the safe side here. If later we can
>> prove that some cache flush is unneeded, I'd be happy to see it
>> go away.
> 
> Hm, so it's not only about adding more cache flushes, but moving them
> inside of the locked region: previously the only cache flush was done
> outside of the locked region.
> 
> I guess I can't convince myself why we would need to flush cache of
> entries that are to be removed, and that also point to pages scheduled
> to be freed.

As previously said - with a series like this I wanted to strictly be
on the safe side, maintaining the pre-existing pattern of all
modifications of live tables being accompanied by a flush (if flushes
are needed in the first place, of course). As to moving flushes into
the locked region, I don't view this as a problem, seeing in
particular that elsewhere we already have flushes with the lock held
(at the very least the _full page_ one in addr_to_dma_page_maddr(),
but also e.g. in intel_iommu_map_page(), where it could be easily
moved past the unlock).

If you (continue to) think that breaking the present pattern isn't
going to misguide future changes, I can certainly drop these not
really necessary flushes. Otoh I was actually considering to,
subsequently, integrate the flushes into e.g. dma_clear_pte() to
make it virtually impossible to break that pattern. This would
imply that all page table related flushes would then occur with the
lock held.

(I won't separately reply to the similar topic on patch 18.)

>>>> @@ -2182,8 +2210,21 @@ static int __must_check cf_check intel_i
>>>>      }
>>>>  
>>>>      *pte = new;
>>>> -
>>>>      iommu_sync_cache(pte, sizeof(struct dma_pte));
>>>> +
>>>> +    /*
>>>> +     * While the (ab)use of PTE_kind_table here allows to save some work in
>>>> +     * the function, the main motivation for it is that it avoids a so far
>>>> +     * unexplained hang during boot (while preparing Dom0) on a Westmere
>>>> +     * based laptop.
>>>> +     */
>>>> +    pt_update_contig_markers(&page->val,
>>>> +                             address_level_offset(dfn_to_daddr(dfn), level),
>>>> +                             level,
>>>> +                             (hd->platform_ops->page_sizes &
>>>> +                              (1UL << level_to_offset_bits(level + 1))
>>>> +                              ? PTE_kind_leaf : PTE_kind_table));
>>>
>>> So this works because on what we believe to be affected models the
>>> only supported page sizes are 4K?
>>
>> Yes.
>>
>>> Do we want to do the same with AMD if we don't allow 512G super pages?
>>
>> Why? They don't have a similar flaw.
> 
> So the question was mostly whether we should also avoid the
> pt_update_contig_markers for 1G entries, because we won't coalesce
> them into a 512G anyway.  IOW avoid the overhead of updating the
> contig markers if we know that the resulting super-page is not
> supported by ->page_sizes.

As the comment says, I consider this at least partly an abuse of
PTE_kind_table, so I'm wary of extending this to AMD. But if you
continue to think it's worth it, I could certainly do so there as
well.

Jan



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 16/21] VT-d: free all-empty page tables
  2022-05-27  7:40         ` Jan Beulich
@ 2022-05-27  7:53           ` Jan Beulich
  2022-05-27  9:21             ` Roger Pau Monné
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Beulich @ 2022-05-27  7:53 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian

On 27.05.2022 09:40, Jan Beulich wrote:
> On 20.05.2022 13:13, Roger Pau Monné wrote:
>> On Wed, May 18, 2022 at 12:26:03PM +0200, Jan Beulich wrote:
>>> On 10.05.2022 16:30, Roger Pau Monné wrote:
>>>> On Mon, Apr 25, 2022 at 10:42:50AM +0200, Jan Beulich wrote:
>>>>> @@ -837,9 +843,31 @@ static int dma_pte_clear_one(struct doma
>>>>>  
>>>>>      old = *pte;
>>>>>      dma_clear_pte(*pte);
>>>>> +    iommu_sync_cache(pte, sizeof(*pte));
>>>>> +
>>>>> +    while ( pt_update_contig_markers(&page->val,
>>>>> +                                     address_level_offset(addr, level),
>>>>> +                                     level, PTE_kind_null) &&
>>>>> +            ++level < min_pt_levels )
>>>>> +    {
>>>>> +        struct page_info *pg = maddr_to_page(pg_maddr);
>>>>> +
>>>>> +        unmap_vtd_domain_page(page);
>>>>> +
>>>>> +        pg_maddr = addr_to_dma_page_maddr(domain, addr, level, flush_flags,
>>>>> +                                          false);
>>>>> +        BUG_ON(pg_maddr < PAGE_SIZE);
>>>>> +
>>>>> +        page = map_vtd_domain_page(pg_maddr);
>>>>> +        pte = &page[address_level_offset(addr, level)];
>>>>> +        dma_clear_pte(*pte);
>>>>> +        iommu_sync_cache(pte, sizeof(*pte));
>>>>> +
>>>>> +        *flush_flags |= IOMMU_FLUSHF_all;
>>>>> +        iommu_queue_free_pgtable(hd, pg);
>>>>> +    }
>>>>
>>>> I think I'm setting myself for trouble, but do we need to sync cache
>>>> the lower lever entries if higher level ones are to be changed.
>>>>
>>>> IOW, would it be fine to just flush the highest level modified PTE?
>>>> As the lower lever ones won't be reachable anyway.
>>>
>>> I definitely want to err on the safe side here. If later we can
>>> prove that some cache flush is unneeded, I'd be happy to see it
>>> go away.
>>
>> Hm, so it's not only about adding more cache flushes, but moving them
>> inside of the locked region: previously the only cache flush was done
>> outside of the locked region.
>>
>> I guess I can't convince myself why we would need to flush cache of
>> entries that are to be removed, and that also point to pages scheduled
>> to be freed.
> 
> As previously said - with a series like this I wanted to strictly be
> on the safe side, maintaining the pre-existing pattern of all
> modifications of live tables being accompanied by a flush (if flushes
> are needed in the first place, of course). As to moving flushes into
> the locked region, I don't view this as a problem, seeing in
> particular that elsewhere we already have flushes with the lock held
> (at the very least the _full page_ one in addr_to_dma_page_maddr(),
> but also e.g. in intel_iommu_map_page(), where it could be easily
> moved past the unlock).
> 
> If you (continue to) think that breaking the present pattern isn't
> going to misguide future changes, I can certainly drop these not
> really necessary flushes. Otoh I was actually considering to,
> subsequently, integrate the flushes into e.g. dma_clear_pte() to
> make it virtually impossible to break that pattern. This would
> imply that all page table related flushes would then occur with the
> lock held.
> 
> (I won't separately reply to the similar topic on patch 18.)

Oh, one more (formal / minor) aspect: Changing when to (not) flush
would also invalidate Kevin's R-b which I've got already for both
this and the later, similarly affected patch.

Jan



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v4 16/21] VT-d: free all-empty page tables
  2022-05-27  7:53           ` Jan Beulich
@ 2022-05-27  9:21             ` Roger Pau Monné
  0 siblings, 0 replies; 106+ messages in thread
From: Roger Pau Monné @ 2022-05-27  9:21 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Paul Durrant, Kevin Tian

On Fri, May 27, 2022 at 09:53:01AM +0200, Jan Beulich wrote:
> On 27.05.2022 09:40, Jan Beulich wrote:
> > On 20.05.2022 13:13, Roger Pau Monné wrote:
> >> On Wed, May 18, 2022 at 12:26:03PM +0200, Jan Beulich wrote:
> >>> On 10.05.2022 16:30, Roger Pau Monné wrote:
> >>>> On Mon, Apr 25, 2022 at 10:42:50AM +0200, Jan Beulich wrote:
> >>>>> @@ -837,9 +843,31 @@ static int dma_pte_clear_one(struct doma
> >>>>>  
> >>>>>      old = *pte;
> >>>>>      dma_clear_pte(*pte);
> >>>>> +    iommu_sync_cache(pte, sizeof(*pte));
> >>>>> +
> >>>>> +    while ( pt_update_contig_markers(&page->val,
> >>>>> +                                     address_level_offset(addr, level),
> >>>>> +                                     level, PTE_kind_null) &&
> >>>>> +            ++level < min_pt_levels )
> >>>>> +    {
> >>>>> +        struct page_info *pg = maddr_to_page(pg_maddr);
> >>>>> +
> >>>>> +        unmap_vtd_domain_page(page);
> >>>>> +
> >>>>> +        pg_maddr = addr_to_dma_page_maddr(domain, addr, level, flush_flags,
> >>>>> +                                          false);
> >>>>> +        BUG_ON(pg_maddr < PAGE_SIZE);
> >>>>> +
> >>>>> +        page = map_vtd_domain_page(pg_maddr);
> >>>>> +        pte = &page[address_level_offset(addr, level)];
> >>>>> +        dma_clear_pte(*pte);
> >>>>> +        iommu_sync_cache(pte, sizeof(*pte));
> >>>>> +
> >>>>> +        *flush_flags |= IOMMU_FLUSHF_all;
> >>>>> +        iommu_queue_free_pgtable(hd, pg);
> >>>>> +    }
> >>>>
> >>>> I think I'm setting myself for trouble, but do we need to sync cache
> >>>> the lower lever entries if higher level ones are to be changed.
> >>>>
> >>>> IOW, would it be fine to just flush the highest level modified PTE?
> >>>> As the lower lever ones won't be reachable anyway.
> >>>
> >>> I definitely want to err on the safe side here. If later we can
> >>> prove that some cache flush is unneeded, I'd be happy to see it
> >>> go away.
> >>
> >> Hm, so it's not only about adding more cache flushes, but moving them
> >> inside of the locked region: previously the only cache flush was done
> >> outside of the locked region.
> >>
> >> I guess I can't convince myself why we would need to flush cache of
> >> entries that are to be removed, and that also point to pages scheduled
> >> to be freed.
> > 
> > As previously said - with a series like this I wanted to strictly be
> > on the safe side, maintaining the pre-existing pattern of all
> > modifications of live tables being accompanied by a flush (if flushes
> > are needed in the first place, of course). As to moving flushes into
> > the locked region, I don't view this as a problem, seeing in
> > particular that elsewhere we already have flushes with the lock held
> > (at the very least the _full page_ one in addr_to_dma_page_maddr(),
> > but also e.g. in intel_iommu_map_page(), where it could be easily
> > moved past the unlock).
> > 
> > If you (continue to) think that breaking the present pattern isn't
> > going to misguide future changes, I can certainly drop these not
> > really necessary flushes. Otoh I was actually considering to,
> > subsequently, integrate the flushes into e.g. dma_clear_pte() to
> > make it virtually impossible to break that pattern. This would
> > imply that all page table related flushes would then occur with the
> > lock held.

Hm, while I agree it's safer to do the flush in dma_clear_pte()
itself, I wonder how much of a performance impact does this have.  It
might be not relevant, in which case I would certainly be fine with
placing the flush in dma_clear_pte().

> > (I won't separately reply to the similar topic on patch 18.)
> 
> Oh, one more (formal / minor) aspect: Changing when to (not) flush
> would also invalidate Kevin's R-b which I've got already for both
> this and the later, similarly affected patch.

OK, so let's go with this for now.  I don't have further comments:

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 106+ messages in thread

end of thread, other threads:[~2022-05-27  9:22 UTC | newest]

Thread overview: 106+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-25  8:29 [PATCH v4 00/21] IOMMU: superpage support when not sharing pagetables Jan Beulich
2022-04-25  8:30 ` [PATCH v4 01/21] AMD/IOMMU: correct potentially-UB shifts Jan Beulich
2022-04-27 13:08   ` Andrew Cooper
2022-04-27 13:57     ` Jan Beulich
2022-05-03 10:10   ` Roger Pau Monné
2022-05-03 14:34     ` Jan Beulich
2022-04-25  8:32 ` [PATCH v4 02/21] IOMMU: simplify unmap-on-error in iommu_map() Jan Beulich
2022-04-27 13:16   ` Andrew Cooper
2022-04-27 14:05     ` Jan Beulich
2022-05-03 10:25   ` Roger Pau Monné
2022-05-03 14:37     ` Jan Beulich
2022-05-03 16:22       ` Roger Pau Monné
2022-04-25  8:32 ` [PATCH v4 03/21] IOMMU: add order parameter to ->{,un}map_page() hooks Jan Beulich
2022-04-25  8:33 ` [PATCH v4 04/21] IOMMU: have iommu_{,un}map() split requests into largest possible chunks Jan Beulich
2022-05-03 12:37   ` Roger Pau Monné
2022-05-03 14:44     ` Jan Beulich
2022-05-04 10:20       ` Roger Pau Monné
2022-04-25  8:34 ` [PATCH v4 05/21] IOMMU/x86: restrict IO-APIC mappings for PV Dom0 Jan Beulich
2022-05-03 13:00   ` Roger Pau Monné
2022-05-03 14:50     ` Jan Beulich
2022-05-04  9:32       ` Jan Beulich
2022-05-04 10:30         ` Roger Pau Monné
2022-05-04 10:51           ` Jan Beulich
2022-05-04 12:01             ` Roger Pau Monné
2022-05-04 12:12               ` Jan Beulich
2022-05-04 13:00                 ` Roger Pau Monné
2022-05-04 13:19                   ` Jan Beulich
2022-05-04 13:46                     ` Roger Pau Monné
2022-05-04 13:55                       ` Jan Beulich
2022-05-04 15:22                         ` Roger Pau Monné
2022-04-25  8:34 ` [PATCH v4 06/21] IOMMU/x86: perform PV Dom0 mappings in batches Jan Beulich
2022-05-03 14:49   ` Roger Pau Monné
2022-05-04  9:46     ` Jan Beulich
2022-05-04 11:20       ` Roger Pau Monné
2022-05-04 12:27         ` Jan Beulich
2022-05-04 13:55           ` Roger Pau Monné
2022-05-04 14:26             ` Jan Beulich
2022-04-25  8:35 ` [PATCH v4 07/21] IOMMU/x86: support freeing of pagetables Jan Beulich
2022-05-03 16:20   ` Roger Pau Monné
2022-05-04 13:07     ` Jan Beulich
2022-05-04 15:06       ` Roger Pau Monné
2022-05-05  8:20         ` Jan Beulich
2022-05-05  9:57           ` Roger Pau Monné
2022-04-25  8:36 ` [PATCH v4 08/21] AMD/IOMMU: walk trees upon page fault Jan Beulich
2022-05-04 15:57   ` Roger Pau Monné
2022-04-25  8:37 ` [PATCH v4 09/21] AMD/IOMMU: return old PTE from {set,clear}_iommu_pte_present() Jan Beulich
2022-04-25  8:38 ` [PATCH v4 10/21] AMD/IOMMU: allow use of superpage mappings Jan Beulich
2022-05-05 13:19   ` Roger Pau Monné
2022-05-05 14:34     ` Jan Beulich
2022-05-05 15:26       ` Roger Pau Monné
2022-04-25  8:38 ` [PATCH v4 11/21] VT-d: " Jan Beulich
2022-05-05 16:20   ` Roger Pau Monné
2022-05-06  6:13     ` Jan Beulich
2022-04-25  8:40 ` [PATCH v4 12/21] IOMMU: fold flush-all hook into "flush one" Jan Beulich
2022-05-06  8:38   ` Roger Pau Monné
2022-05-06  9:59     ` Jan Beulich
2022-04-25  8:40 ` [PATCH v4 13/21] IOMMU/x86: prefill newly allocate page tables Jan Beulich
2022-05-06 11:16   ` Roger Pau Monné
2022-05-19 12:12     ` Jan Beulich
2022-05-20 10:47       ` Roger Pau Monné
2022-05-20 11:11         ` Jan Beulich
2022-05-20 11:13           ` Jan Beulich
2022-05-20 12:22             ` Roger Pau Monné
2022-05-20 12:36               ` Jan Beulich
2022-05-20 14:28                 ` Roger Pau Monné
2022-05-20 14:38                   ` Roger Pau Monné
2022-05-23  6:49                     ` Jan Beulich
2022-05-23  9:10                       ` Roger Pau Monné
2022-05-23 10:52                         ` Jan Beulich
2022-04-25  8:41 ` [PATCH v4 14/21] x86: introduce helper for recording degree of contiguity in " Jan Beulich
2022-05-06 13:25   ` Roger Pau Monné
2022-05-18 10:06     ` Jan Beulich
2022-05-20 10:22       ` Roger Pau Monné
2022-05-20 10:59         ` Jan Beulich
2022-05-20 11:27           ` Roger Pau Monné
2022-04-25  8:42 ` [PATCH v4 15/21] AMD/IOMMU: free all-empty " Jan Beulich
2022-05-10 13:30   ` Roger Pau Monné
2022-05-18 10:18     ` Jan Beulich
2022-04-25  8:42 ` [PATCH v4 16/21] VT-d: " Jan Beulich
2022-04-27  4:09   ` Tian, Kevin
2022-05-10 14:30   ` Roger Pau Monné
2022-05-18 10:26     ` Jan Beulich
2022-05-20  0:38       ` Tian, Kevin
2022-05-20 11:13       ` Roger Pau Monné
2022-05-27  7:40         ` Jan Beulich
2022-05-27  7:53           ` Jan Beulich
2022-05-27  9:21             ` Roger Pau Monné
2022-04-25  8:43 ` [PATCH v4 17/21] AMD/IOMMU: replace all-contiguous page tables by superpage mappings Jan Beulich
2022-05-10 15:31   ` Roger Pau Monné
2022-05-18 10:40     ` Jan Beulich
2022-05-20 10:35       ` Roger Pau Monné
2022-04-25  8:43 ` [PATCH v4 18/21] VT-d: " Jan Beulich
2022-05-11 11:08   ` Roger Pau Monné
2022-05-18 10:44     ` Jan Beulich
2022-05-20 10:38       ` Roger Pau Monné
2022-04-25  8:44 ` [PATCH v4 19/21] IOMMU/x86: add perf counters for page table splitting / coalescing Jan Beulich
2022-05-11 13:48   ` Roger Pau Monné
2022-05-18 11:39     ` Jan Beulich
2022-05-20 10:41       ` Roger Pau Monné
2022-04-25  8:44 ` [PATCH v4 20/21] VT-d: fold iommu_flush_iotlb{,_pages}() Jan Beulich
2022-04-27  4:12   ` Tian, Kevin
2022-05-11 13:50   ` Roger Pau Monné
2022-04-25  8:45 ` [PATCH v4 21/21] VT-d: fold dma_pte_clear_one() into its only caller Jan Beulich
2022-04-27  4:13   ` Tian, Kevin
2022-05-11 13:57   ` Roger Pau Monné
2022-05-18 12:50 ` [PATCH v4 00/21] IOMMU: superpage support when not sharing pagetables Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.