All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/6] xen/x86: various XPTI speedups
@ 2018-03-02  8:13 Juergen Gross
  2018-03-02  8:13 ` [PATCH v2 1/6] x86/xpti: avoid copying L4 page table contents when possible Juergen Gross
                   ` (6 more replies)
  0 siblings, 7 replies; 38+ messages in thread
From: Juergen Gross @ 2018-03-02  8:13 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, andrew.cooper3, wei.liu2, jbeulich, dfaggioli

This patch series aims at reducing the overhead of the XPTI Meltdown
mitigation. It is based on Jan's XPTI speedup series and Wei's series
for support of PCID and INVPCID.

Patch 1 had been posted before, the main changes in this patch are due
to addressing Jan's comments on my first version. The main objective of
that patch is to avoid copying the L4 page table each time the guest is
being activated, as often the contents didn't change while the
hypervisor was active.

Patch 2 tries to minimize flushing the TLB: there is no need to flush
it in write_ptbase() and when activating the guest.

Patch 3 sets the stage for being able to activate XPTI per domain. As a
first step it is now possible to switch XPTI off for dom0 via the xpti
boot parameter.

Patch 4 reduces the costs of TLB flushes even further: as we don't make
any use of global TLB entries with XPTI being active we can avoid
removing all global TLB entries on TLB flushes by simply deactivating
the global pages in CR4.

Patch 5 was originally only meant to prepare using PCIDs in patch 6.
For that purpose it was necessary to allow CR3 values with bit 63 set
in order to avoid flushing TLB entries when writing CR3. This requires
a modification of Jan's rather clever state machine with positive and
negative CR3 values for the hypervisor by using a dedicated flag byte
instead. It turned out this modification saved one branch on interrupt
entry speeding up the handling by a few percent.

Patch 6 is the main performance contributor: by making use of the PCID
feature (if available) TLB entries can survive CR3 switches. The TLB
needs to be flushed on context switches only and not when switching
between guest and hypervisor or guest kernel and user mode.

The complete series has been verified to still mitigate against
Meltdown attacks. A simple performance test (make -j 4 in the Xen
hypervisor directory) showed significant improvements compared to the
state without this series (so with Jan's and Wei's series applied),
the percentage after the numbers is always related to XPTI off:

       XPTI off     Jan+Wei, XPTI on    +this series, XPTI on
real   1m21.169s    1m52.149s (+38%)    1m25.692s (+6%)
user   2m47.652s    2m50.054s (+1%)     2m46.428s (-1%)
sys    1m11.949s    2m21.767s (+97%)    1m23.053s (+15%)

A git branch of that series (+ Jan's and Wei's patches) is available:

https://github.com/jgross1/xen.git xpti


Juergen Gross (6):
  x86/xpti: avoid copying L4 page table contents when possible
  x86/xpti: don't flush TLB twice when switching to 64-bit pv context
  xen/x86: support per-domain flag for xpti
  xen/x86: disable global pages for domains with XPTI active
  xen/x86: use flag byte for decision whether xen_cr3 is valid
  xen/x86: use PCID feature for XPTI

 docs/misc/xen-command-line.markdown |  8 +++-
 xen/arch/x86/cpu/mtrr/generic.c     | 37 ++++++++++-----
 xen/arch/x86/domain.c               |  1 +
 xen/arch/x86/domain_page.c          |  2 +-
 xen/arch/x86/domctl.c               |  4 ++
 xen/arch/x86/flushtlb.c             | 85 ++++++++++++++++++++++-------------
 xen/arch/x86/mm.c                   | 57 +++++++++++++++++------
 xen/arch/x86/pv/dom0_build.c        |  4 ++
 xen/arch/x86/pv/domain.c            | 90 ++++++++++++++++++++++++++++++++++++-
 xen/arch/x86/setup.c                | 23 +++-------
 xen/arch/x86/smp.c                  |  2 +
 xen/arch/x86/smpboot.c              |  6 ++-
 xen/arch/x86/x86_64/asm-offsets.c   |  2 +
 xen/arch/x86/x86_64/compat/entry.S  |  5 +--
 xen/arch/x86/x86_64/entry.S         | 79 ++++++++++++++------------------
 xen/include/asm-x86/current.h       | 22 ++++++---
 xen/include/asm-x86/domain.h        | 38 +++++++++++-----
 xen/include/asm-x86/flushtlb.h      |  2 +
 xen/include/asm-x86/pv/domain.h     |  4 ++
 xen/include/asm-x86/x86-defns.h     |  1 +
 20 files changed, 327 insertions(+), 145 deletions(-)

-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2 1/6] x86/xpti: avoid copying L4 page table contents when possible
  2018-03-02  8:13 [PATCH v2 0/6] xen/x86: various XPTI speedups Juergen Gross
@ 2018-03-02  8:13 ` Juergen Gross
  2018-03-05 16:43   ` Jan Beulich
       [not found]   ` <5A9D81DC02000078001AEB68@suse.com>
  2018-03-02  8:13 ` [PATCH v2 2/6] x86/xpti: don't flush TLB twice when switching to 64-bit pv context Juergen Gross
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 38+ messages in thread
From: Juergen Gross @ 2018-03-02  8:13 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, andrew.cooper3, wei.liu2, jbeulich, dfaggioli

For mitigation of Meltdown the current L4 page table is copied to the
cpu local root page table each time a 64 bit pv guest is entered.

Copying can be avoided in cases where the guest L4 page table hasn't
been modified while running the hypervisor, e.g. when handling
interrupts or any hypercall not modifying the L4 page table or %cr3.

So add a per-cpu flag whether the copying should be performed and set
that flag only when loading a new %cr3 or modifying the L4 page table.
This includes synchronization of the cpu local root page table with
other cpus, so add a special synchronization flag for that case.

A simple performance check (compiling the hypervisor via "make -j 4")
in dom0 with 4 vcpus shows a significant improvement:

- real time drops from 112 seconds to 103 seconds
- system time drops from 142 seconds to 131 seconds

Signed-off-by: Juergen Gross <jgross@suse.com>
---
To be applied on top of Jan's "Meltdown band-aid overhead reduction"
series
---
 xen/arch/x86/mm.c                 | 32 +++++++++++++++++++-------------
 xen/arch/x86/pv/domain.c          |  1 +
 xen/arch/x86/smp.c                |  2 ++
 xen/arch/x86/x86_64/asm-offsets.c |  1 +
 xen/arch/x86/x86_64/entry.S       |  8 ++++++--
 xen/include/asm-x86/current.h     |  8 ++++++++
 xen/include/asm-x86/flushtlb.h    |  2 ++
 7 files changed, 39 insertions(+), 15 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index d8d3ee2ecd..fdc1636817 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -509,6 +509,8 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
 
 void write_ptbase(struct vcpu *v)
 {
+    get_cpu_info()->root_pgt_changed = this_cpu(root_pgt) && is_pv_vcpu(v) &&
+                                       !is_pv_32bit_vcpu(v);
     write_cr3(v->arch.cr3);
 }
 
@@ -3704,18 +3706,22 @@ long do_mmu_update(
                         break;
                     rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
-                    /*
-                     * No need to sync if all uses of the page can be accounted
-                     * to the page lock we hold, its pinned status, and uses on
-                     * this (v)CPU.
-                     */
-                    if ( !rc && !cpu_has_no_xpti &&
-                         ((page->u.inuse.type_info & PGT_count_mask) >
-                          (1 + !!(page->u.inuse.type_info & PGT_pinned) +
-                           (pagetable_get_pfn(curr->arch.guest_table) == mfn) +
-                           (pagetable_get_pfn(curr->arch.guest_table_user) ==
-                            mfn))) )
-                        sync_guest = true;
+                    if ( !rc && !cpu_has_no_xpti )
+                    {
+                        get_cpu_info()->root_pgt_changed = true;
+                        /*
+                         * No need to sync if all uses of the page can be
+                         * accounted to the page lock we hold, its pinned
+                         * status, and uses on this (v)CPU.
+                         */
+                        if ( (page->u.inuse.type_info & PGT_count_mask) >
+                             (1 + !!(page->u.inuse.type_info & PGT_pinned) +
+                              (pagetable_get_pfn(curr->arch.guest_table) ==
+                               mfn) +
+                              (pagetable_get_pfn(curr->arch.guest_table_user) ==
+                               mfn)) )
+                            sync_guest = true;
+                    }
                     break;
 
                 case PGT_writable_page:
@@ -3830,7 +3836,7 @@ long do_mmu_update(
 
         cpumask_andnot(mask, pt_owner->dirty_cpumask, cpumask_of(cpu));
         if ( !cpumask_empty(mask) )
-            flush_mask(mask, FLUSH_TLB_GLOBAL);
+            flush_mask(mask, FLUSH_TLB_GLOBAL | FLUSH_ROOT_PGTBL);
     }
 
     perfc_add(num_page_updates, i);
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 868a23fd7e..7742d522f5 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -238,6 +238,7 @@ static void _toggle_guest_pt(struct vcpu *v, bool force_cr3)
 
     /* Don't flush user global mappings from the TLB. Don't tick TLB clock. */
     asm volatile ( "mov %0, %%cr3" : : "r" (v->arch.cr3) : "memory" );
+    get_cpu_info()->root_pgt_changed = true;
 
     if ( !(v->arch.flags & TF_kernel_mode) )
         return;
diff --git a/xen/arch/x86/smp.c b/xen/arch/x86/smp.c
index 033dd05958..60b0657ab7 100644
--- a/xen/arch/x86/smp.c
+++ b/xen/arch/x86/smp.c
@@ -207,6 +207,8 @@ void invalidate_interrupt(struct cpu_user_regs *regs)
     unsigned int flags = flush_flags;
     ack_APIC_irq();
     perfc_incr(ipis);
+    if ( flags & FLUSH_ROOT_PGTBL )
+        get_cpu_info()->root_pgt_changed = true;
     if ( (flags & FLUSH_VCPU_STATE) && __sync_local_execstate() )
         flags &= ~(FLUSH_TLB | FLUSH_TLB_GLOBAL);
     if ( flags & ~(FLUSH_VCPU_STATE | FLUSH_ORDER_MASK) )
diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
index e925e6589c..c9225b06c1 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -144,6 +144,7 @@ void __dummy__(void)
     OFFSET(CPUINFO_shadow_spec_ctrl, struct cpu_info, shadow_spec_ctrl);
     OFFSET(CPUINFO_use_shadow_spec_ctrl, struct cpu_info, use_shadow_spec_ctrl);
     OFFSET(CPUINFO_bti_ist_info, struct cpu_info, bti_ist_info);
+    OFFSET(CPUINFO_root_pgt_changed, struct cpu_info, root_pgt_changed);
     DEFINE(CPUINFO_sizeof, sizeof(struct cpu_info));
     BLANK();
 
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 3b578b4392..828f9ccfe8 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -48,10 +48,13 @@ restore_all_guest:
 .Lrag_cr3_start:
         mov   VCPU_cr3(%rbx), %r9
         GET_STACK_END(dx)
-        mov   STACK_CPUINFO_FIELD(pv_cr3)(%rdx), %rdi
+        mov   STACK_CPUINFO_FIELD(pv_cr3)(%rdx), %rax
+        cmpb  $0, STACK_CPUINFO_FIELD(root_pgt_changed)(%rdx)
+        je    .Lrag_copy_done
+        movb  $0, STACK_CPUINFO_FIELD(root_pgt_changed)(%rdx)
         movabs $PADDR_MASK & PAGE_MASK, %rsi
         movabs $DIRECTMAP_VIRT_START, %rcx
-        mov   %rdi, %rax
+        mov   %rax, %rdi
         and   %rsi, %rdi
         and   %r9, %rsi
         add   %rcx, %rdi
@@ -67,6 +70,7 @@ restore_all_guest:
         sub   $(ROOT_PAGETABLE_FIRST_XEN_SLOT - \
                 ROOT_PAGETABLE_LAST_XEN_SLOT - 1) * 8, %rdi
         rep movsq
+.Lrag_copy_done:
         mov   STACK_CPUINFO_FIELD(cr4)(%rdx), %rdi
         mov   %r9, STACK_CPUINFO_FIELD(xen_cr3)(%rdx)
         mov   %rdi, %rsi
diff --git a/xen/include/asm-x86/current.h b/xen/include/asm-x86/current.h
index 4678a0fcf5..3c96c173c2 100644
--- a/xen/include/asm-x86/current.h
+++ b/xen/include/asm-x86/current.h
@@ -59,6 +59,14 @@ struct cpu_info {
     bool         use_shadow_spec_ctrl;
     uint8_t      bti_ist_info;
 
+    /*
+     * The following field controls copying of the L4 page table of 64-bit
+     * PV guests to the per-cpu root page table on entering the guest context.
+     * If set the L4 page table is being copied to the root page table and
+     * the field will be reset.
+     */
+    bool         root_pgt_changed;
+
     unsigned long __pad;
     /* get_stack_bottom() must be 16-byte aligned */
 };
diff --git a/xen/include/asm-x86/flushtlb.h b/xen/include/asm-x86/flushtlb.h
index 2cade9cbfb..052f0fa403 100644
--- a/xen/include/asm-x86/flushtlb.h
+++ b/xen/include/asm-x86/flushtlb.h
@@ -103,6 +103,8 @@ void write_cr3(unsigned long cr3);
 #define FLUSH_VA_VALID   0x800
  /* Flush CPU state */
 #define FLUSH_VCPU_STATE 0x1000
+ /* Flush the per-cpu root page table */
+#define FLUSH_ROOT_PGTBL 0x2000
 
 /* Flush local TLBs/caches. */
 unsigned int flush_area_local(const void *va, unsigned int flags);
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 2/6] x86/xpti: don't flush TLB twice when switching to 64-bit pv context
  2018-03-02  8:13 [PATCH v2 0/6] xen/x86: various XPTI speedups Juergen Gross
  2018-03-02  8:13 ` [PATCH v2 1/6] x86/xpti: avoid copying L4 page table contents when possible Juergen Gross
@ 2018-03-02  8:13 ` Juergen Gross
  2018-03-05 16:49   ` Jan Beulich
       [not found]   ` <5A9D831F02000078001AEB7E@suse.com>
  2018-03-02  8:14 ` [PATCH v2 3/6] xen/x86: support per-domain flag for xpti Juergen Gross
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 38+ messages in thread
From: Juergen Gross @ 2018-03-02  8:13 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, andrew.cooper3, wei.liu2, jbeulich, dfaggioli

When switching to a 64-bit pv context the TLB is flushed twice today:
the first time when switching to the new address space in
write_ptbase(), the second time when switching to guest mode in
restore_to_guest.

Avoid the first TLB flush in that case.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 xen/arch/x86/mm.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index fdc1636817..cf512ee306 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -123,6 +123,7 @@
 #include <asm/io_apic.h>
 #include <asm/pci.h>
 #include <asm/guest.h>
+#include <asm/processor.h>
 
 #include <asm/hvm/grant_table.h>
 #include <asm/pv/grant_table.h>
@@ -509,9 +510,16 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
 
 void write_ptbase(struct vcpu *v)
 {
-    get_cpu_info()->root_pgt_changed = this_cpu(root_pgt) && is_pv_vcpu(v) &&
-                                       !is_pv_32bit_vcpu(v);
-    write_cr3(v->arch.cr3);
+    if ( this_cpu(root_pgt) && is_pv_vcpu(v) && !is_pv_32bit_vcpu(v) )
+    {
+        get_cpu_info()->root_pgt_changed = true;
+        asm volatile ( "mov %0, %%cr3" : : "r" (v->arch.cr3) : "memory" );
+    }
+    else
+    {
+        get_cpu_info()->root_pgt_changed = false;
+        write_cr3(v->arch.cr3);
+    }
 }
 
 /*
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 3/6] xen/x86: support per-domain flag for xpti
  2018-03-02  8:13 [PATCH v2 0/6] xen/x86: various XPTI speedups Juergen Gross
  2018-03-02  8:13 ` [PATCH v2 1/6] x86/xpti: avoid copying L4 page table contents when possible Juergen Gross
  2018-03-02  8:13 ` [PATCH v2 2/6] x86/xpti: don't flush TLB twice when switching to 64-bit pv context Juergen Gross
@ 2018-03-02  8:14 ` Juergen Gross
  2018-03-08 10:17   ` Jan Beulich
       [not found]   ` <5AA11BDE02000078001AFB92@suse.com>
  2018-03-02  8:14 ` [PATCH v2 4/6] xen/x86: disable global pages for domains with XPTI active Juergen Gross
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 38+ messages in thread
From: Juergen Gross @ 2018-03-02  8:14 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, andrew.cooper3, wei.liu2, jbeulich, dfaggioli

Instead of switching XPTI globally on or off add a per-domain flag for
that purpose. This allows to modify the xpti boot parameter to support
running dom0 without Meltdown mitigations. Using "xpti=nodom0" as boot
parameter will achieve that.

Move the xpti boot parameter handling to xen/arch/x86/pv/domain.c as
it is pv-domain specific.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 docs/misc/xen-command-line.markdown |  8 +++-
 xen/arch/x86/domctl.c               |  4 ++
 xen/arch/x86/mm.c                   |  6 ++-
 xen/arch/x86/pv/dom0_build.c        |  4 ++
 xen/arch/x86/pv/domain.c            | 85 ++++++++++++++++++++++++++++++++++++-
 xen/arch/x86/setup.c                | 20 +--------
 xen/arch/x86/smpboot.c              |  4 +-
 xen/arch/x86/x86_64/entry.S         |  2 +
 xen/include/asm-x86/current.h       |  3 +-
 xen/include/asm-x86/domain.h        |  3 ++
 xen/include/asm-x86/pv/domain.h     |  4 ++
 11 files changed, 119 insertions(+), 24 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index a95195f246..6122d6d7d5 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -1940,7 +1940,7 @@ clustered mode.  The default, given no hint from the **FADT**, is cluster
 mode.
 
 ### xpti
-> `= <boolean>`
+> `= nodom0 | <boolean>`
 
 > Default: `false` on AMD hardware
 > Default: `true` everywhere else
@@ -1948,6 +1948,12 @@ mode.
 Override default selection of whether to isolate 64-bit PV guest page
 tables.
 
+`true` activates page table isolation even on AMD hardware.
+
+`false` deactivates page table isolation on all systems.
+
+`nodom0` deactivates page table isolation for dom0.
+
 ### xsave
 > `= <boolean>`
 
diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
index 8fbbf3aeb3..0704f398c7 100644
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -24,6 +24,7 @@
 #include <asm/hvm/hvm.h>
 #include <asm/hvm/support.h>
 #include <asm/processor.h>
+#include <asm/pv/domain.h>
 #include <asm/acpi.h> /* for hvm_acpi_power_button */
 #include <xen/hypercall.h> /* for arch_do_domctl */
 #include <xsm/xsm.h>
@@ -610,6 +611,9 @@ long arch_do_domctl(
             ret = switch_compat(d);
         else
             ret = -EINVAL;
+
+        if ( ret == 0 )
+            xpti_domain_init(d);
         break;
 
     case XEN_DOMCTL_get_address_size:
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index cf512ee306..0d0badea86 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -510,15 +510,19 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
 
 void write_ptbase(struct vcpu *v)
 {
-    if ( this_cpu(root_pgt) && is_pv_vcpu(v) && !is_pv_32bit_vcpu(v) )
+    if ( is_pv_vcpu(v) && v->domain->arch.pv_domain.xpti )
     {
         get_cpu_info()->root_pgt_changed = true;
+        get_cpu_info()->pv_cr3 = __pa(this_cpu(root_pgt));
         asm volatile ( "mov %0, %%cr3" : : "r" (v->arch.cr3) : "memory" );
     }
     else
     {
         get_cpu_info()->root_pgt_changed = false;
+        /* Make sure to clear xen_cr3 before pv_cr3; write_cr3() serializes. */
+        get_cpu_info()->xen_cr3 = 0;
         write_cr3(v->arch.cr3);
+        get_cpu_info()->pv_cr3 = 0;
     }
 }
 
diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
index 0bd2f1bf90..21e8cd56bf 100644
--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -19,6 +19,7 @@
 #include <asm/dom0_build.h>
 #include <asm/guest.h>
 #include <asm/page.h>
+#include <asm/pv/domain.h>
 #include <asm/pv/mm.h>
 #include <asm/setup.h>
 
@@ -707,6 +708,9 @@ int __init dom0_construct_pv(struct domain *d,
             cpu = p->processor;
     }
 
+    if ( !is_pv_32bit_domain(d) )
+        xpti_domain_init(d);
+
     d->arch.paging.mode = 0;
 
     /* Set up CR3 value for write_ptbase */
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 7742d522f5..5f15c9e25b 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -9,6 +9,8 @@
 #include <xen/lib.h>
 #include <xen/sched.h>
 
+#include <asm/cpufeature.h>
+#include <asm/msr-index.h>
 #include <asm/pv/domain.h>
 
 /* Override macros from asm/page.h to make them work with mfn_t */
@@ -17,6 +19,87 @@
 #undef page_to_mfn
 #define page_to_mfn(pg) _mfn(__page_to_mfn(pg))
 
+static __read_mostly enum {
+    XPTI_DEFAULT,
+    XPTI_ON,
+    XPTI_OFF,
+    XPTI_NODOM0
+} opt_xpti = XPTI_DEFAULT;
+
+static int parse_xpti(const char *s)
+{
+    int rc = 0;
+
+    switch ( parse_bool(s, NULL) )
+    {
+    case 0:
+        opt_xpti = XPTI_OFF;
+        break;
+    case 1:
+        opt_xpti = XPTI_ON;
+        break;
+    default:
+        if ( !strcmp(s, "default") )
+            opt_xpti = XPTI_DEFAULT;
+        else if ( !strcmp(s, "nodom0") )
+            opt_xpti = XPTI_NODOM0;
+        else
+            rc = -EINVAL;
+        break;
+    }
+
+    return rc;
+}
+
+custom_param("xpti", parse_xpti);
+
+void xpti_init(void)
+{
+    uint64_t caps = 0;
+
+    if ( boot_cpu_data.x86_vendor == X86_VENDOR_AMD )
+        caps = ARCH_CAPABILITIES_RDCL_NO;
+    else if ( boot_cpu_has(X86_FEATURE_ARCH_CAPS) )
+        rdmsrl(MSR_ARCH_CAPABILITIES, caps);
+
+    if ( opt_xpti != XPTI_ON && (caps & ARCH_CAPABILITIES_RDCL_NO) )
+        opt_xpti = XPTI_OFF;
+    else if ( opt_xpti == XPTI_DEFAULT )
+        opt_xpti = XPTI_ON;
+
+    if ( opt_xpti == XPTI_OFF )
+        setup_force_cpu_cap(X86_FEATURE_NO_XPTI);
+    else
+        setup_clear_cpu_cap(X86_FEATURE_NO_XPTI);
+}
+
+void xpti_domain_init(struct domain *d)
+{
+    if ( !is_pv_domain(d) || is_pv_32bit_domain(d) )
+        return;
+
+    switch ( opt_xpti )
+    {
+    case XPTI_OFF:
+        d->arch.pv_domain.xpti = false;
+        break;
+    case XPTI_ON:
+        d->arch.pv_domain.xpti = true;
+        break;
+    case XPTI_NODOM0:
+        d->arch.pv_domain.xpti = d->domain_id != 0 &&
+                                 d->domain_id != hardware_domid;
+        break;
+    default:
+        ASSERT_UNREACHABLE();
+        break;
+    }
+
+    if ( d->arch.pv_domain.xpti )
+        printk("Enabling Xen Pagetable protection (XPTI) for Domain %d\n",
+               d->domain_id);
+}
+
 static void noreturn continue_nonidle_domain(struct vcpu *v)
 {
     check_wakeup_from_wait();
@@ -266,7 +349,7 @@ void toggle_guest_mode(struct vcpu *v)
     }
     asm volatile ( "swapgs" );
 
-    _toggle_guest_pt(v, cpu_has_no_xpti);
+    _toggle_guest_pt(v, !v->domain->arch.pv_domain.xpti);
 }
 
 void toggle_guest_pt(struct vcpu *v)
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 148054c438..7c9fbfe04a 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -53,6 +53,7 @@
 #include <asm/cpuid.h>
 #include <asm/spec_ctrl.h>
 #include <asm/guest.h>
+#include <asm/pv/domain.h>
 
 /* opt_nosmp: If true, secondary processors are ignored. */
 static bool __initdata opt_nosmp;
@@ -169,9 +170,6 @@ static int __init parse_smap_param(const char *s)
 }
 custom_param("smap", parse_smap_param);
 
-static int8_t __initdata opt_xpti = -1;
-boolean_param("xpti", opt_xpti);
-
 bool __read_mostly acpi_disabled;
 bool __initdata acpi_force;
 static char __initdata acpi_param[10] = "";
@@ -1544,21 +1542,7 @@ void __init noreturn __start_xen(unsigned long mbi_p)
 
     cr4_pv32_mask = mmu_cr4_features & XEN_CR4_PV32_BITS;
 
-    if ( opt_xpti < 0 )
-    {
-        uint64_t caps = 0;
-
-        if ( boot_cpu_data.x86_vendor == X86_VENDOR_AMD )
-            caps = ARCH_CAPABILITIES_RDCL_NO;
-        else if ( boot_cpu_has(X86_FEATURE_ARCH_CAPS) )
-            rdmsrl(MSR_ARCH_CAPABILITIES, caps);
-
-        opt_xpti = !(caps & ARCH_CAPABILITIES_RDCL_NO);
-    }
-    if ( opt_xpti )
-        setup_clear_cpu_cap(X86_FEATURE_NO_XPTI);
-    else
-        setup_force_cpu_cap(X86_FEATURE_NO_XPTI);
+    xpti_init();
 
     if ( cpu_has_fsgsbase )
         set_in_cr4(X86_CR4_FSGSBASE);
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index c587e4db9f..60604f4535 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -331,7 +331,7 @@ void start_secondary(void *unused)
     spin_debug_disable();
 
     get_cpu_info()->xen_cr3 = 0;
-    get_cpu_info()->pv_cr3 = this_cpu(root_pgt) ? __pa(this_cpu(root_pgt)) : 0;
+    get_cpu_info()->pv_cr3 = 0;
 
     load_system_tables();
 
@@ -1050,7 +1050,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
         panic("Error %d setting up PV root page table\n", rc);
     if ( per_cpu(root_pgt, 0) )
     {
-        get_cpu_info()->pv_cr3 = __pa(per_cpu(root_pgt, 0));
+        get_cpu_info()->pv_cr3 = 0;
         /*
          * All entry points which may need to switch page tables have to start
          * with interrupts off. Re-write what pv_trap_init() has put there.
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 828f9ccfe8..cdcdc2c40a 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -49,6 +49,8 @@ restore_all_guest:
         mov   VCPU_cr3(%rbx), %r9
         GET_STACK_END(dx)
         mov   STACK_CPUINFO_FIELD(pv_cr3)(%rdx), %rax
+        test  %rax, %rax
+        jz    .Lrag_cr3_end
         cmpb  $0, STACK_CPUINFO_FIELD(root_pgt_changed)(%rdx)
         je    .Lrag_copy_done
         movb  $0, STACK_CPUINFO_FIELD(root_pgt_changed)(%rdx)
diff --git a/xen/include/asm-x86/current.h b/xen/include/asm-x86/current.h
index 3c96c173c2..d5236c82de 100644
--- a/xen/include/asm-x86/current.h
+++ b/xen/include/asm-x86/current.h
@@ -44,7 +44,8 @@ struct cpu_info {
     /*
      * Of the two following fields the latter is being set to the CR3 value
      * to be used on the given pCPU for loading whenever 64-bit PV guest
-     * context is being entered. The value never changes once set.
+     * context is being entered. A value of zero indicates no setting of CR3
+     * is to be performed.
      * The former is the value to restore when re-entering Xen, if any. IOW
      * its value being zero means there's nothing to restore. However, its
      * value can also be negative, indicating to the exit-to-Xen code that
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index 4679d5477d..0cc37dea05 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -257,6 +257,9 @@ struct pv_domain
     struct mapcache_domain mapcache;
 
     struct cpuidmasks *cpuidmasks;
+
+    /* XPTI active? */
+    bool xpti;
 };
 
 struct monitor_write_data {
diff --git a/xen/include/asm-x86/pv/domain.h b/xen/include/asm-x86/pv/domain.h
index acdf140fbd..c6fdf3fe56 100644
--- a/xen/include/asm-x86/pv/domain.h
+++ b/xen/include/asm-x86/pv/domain.h
@@ -28,6 +28,8 @@ int pv_vcpu_initialise(struct vcpu *v);
 void pv_domain_destroy(struct domain *d);
 int pv_domain_initialise(struct domain *d, unsigned int domcr_flags,
                          struct xen_arch_domainconfig *config);
+void xpti_init(void);
+void xpti_domain_init(struct domain *d);
 
 #else  /* !CONFIG_PV */
 
@@ -42,6 +44,8 @@ static inline int pv_domain_initialise(struct domain *d,
 {
     return -EOPNOTSUPP;
 }
+static inline void xpti_init(void) {}
+static inline void xpti_domain_init(struct domain *d) {}
 #endif	/* CONFIG_PV */
 
 void paravirt_ctxt_switch_from(struct vcpu *v);
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 4/6] xen/x86: disable global pages for domains with XPTI active
  2018-03-02  8:13 [PATCH v2 0/6] xen/x86: various XPTI speedups Juergen Gross
                   ` (2 preceding siblings ...)
  2018-03-02  8:14 ` [PATCH v2 3/6] xen/x86: support per-domain flag for xpti Juergen Gross
@ 2018-03-02  8:14 ` Juergen Gross
  2018-03-02 11:03   ` Wei Liu
                     ` (3 more replies)
  2018-03-02  8:14 ` [PATCH v2 5/6] xen/x86: use flag byte for decision whether xen_cr3 is valid Juergen Gross
                   ` (2 subsequent siblings)
  6 siblings, 4 replies; 38+ messages in thread
From: Juergen Gross @ 2018-03-02  8:14 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, andrew.cooper3, wei.liu2, jbeulich, dfaggioli

Instead of flushing the TLB from global pages when switching address
spaces with XPTI being active just disable global pages via %cr4
completely when a domain subject to XPTI is active. This avoids the
need for extra TLB flushes as loading %cr3 will remove all TLB
entries.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 xen/arch/x86/cpu/mtrr/generic.c | 32 +++++++++++++++++++++-----------
 xen/arch/x86/flushtlb.c         | 39 +++++++++++++++++++++++++--------------
 xen/arch/x86/x86_64/entry.S     | 10 ----------
 xen/include/asm-x86/domain.h    |  3 ++-
 4 files changed, 48 insertions(+), 36 deletions(-)

diff --git a/xen/arch/x86/cpu/mtrr/generic.c b/xen/arch/x86/cpu/mtrr/generic.c
index e9c0e5e059..d705138100 100644
--- a/xen/arch/x86/cpu/mtrr/generic.c
+++ b/xen/arch/x86/cpu/mtrr/generic.c
@@ -400,8 +400,10 @@ static DEFINE_SPINLOCK(set_atomicity_lock);
  * has been called.
  */
 
-static void prepare_set(void)
+static bool prepare_set(void)
 {
+	unsigned long cr4;
+
 	/*  Note that this is not ideal, since the cache is only flushed/disabled
 	   for this CPU while the MTRRs are changed, but changing this requires
 	   more invasive changes to the way the kernel boots  */
@@ -412,18 +414,22 @@ static void prepare_set(void)
 	write_cr0(read_cr0() | X86_CR0_CD);
 	wbinvd();
 
-	/*  TLB flushing here relies on Xen always using CR4.PGE. */
-	BUILD_BUG_ON(!(XEN_MINIMAL_CR4 & X86_CR4_PGE));
-	write_cr4(read_cr4() & ~X86_CR4_PGE);
+	cr4 = read_cr4();
+	if (cr4 & X86_CR4_PGE)
+		write_cr4(cr4 & ~X86_CR4_PGE);
+	else
+		asm volatile( "mov %0, %%cr3" : : "r" (read_cr3()) : "memory" );
 
 	/*  Save MTRR state */
 	rdmsrl(MSR_MTRRdefType, deftype);
 
 	/*  Disable MTRRs, and set the default type to uncached  */
 	mtrr_wrmsr(MSR_MTRRdefType, deftype & ~0xcff);
+
+	return !!(cr4 & X86_CR4_PGE);
 }
 
-static void post_set(void)
+static void post_set(bool pge)
 {
 	/* Intel (P6) standard MTRRs */
 	mtrr_wrmsr(MSR_MTRRdefType, deftype);
@@ -432,7 +438,10 @@ static void post_set(void)
 	write_cr0(read_cr0() & ~X86_CR0_CD);
 
 	/*  Reenable CR4.PGE (also flushes the TLB) */
-	write_cr4(read_cr4() | X86_CR4_PGE);
+	if (pge)
+		write_cr4(read_cr4() | X86_CR4_PGE);
+	else
+		asm volatile( "mov %0, %%cr3" : : "r" (read_cr3()) : "memory" );
 
 	spin_unlock(&set_atomicity_lock);
 }
@@ -441,14 +450,15 @@ static void generic_set_all(void)
 {
 	unsigned long mask, count;
 	unsigned long flags;
+	bool pge;
 
 	local_irq_save(flags);
-	prepare_set();
+	pge = prepare_set();
 
 	/* Actually set the state */
 	mask = set_mtrr_state();
 
-	post_set();
+	post_set(pge);
 	local_irq_restore(flags);
 
 	/*  Use the atomic bitops to update the global mask  */
@@ -457,7 +467,6 @@ static void generic_set_all(void)
 			set_bit(count, &smp_changes_mask);
 		mask >>= 1;
 	}
-	
 }
 
 static void generic_set_mtrr(unsigned int reg, unsigned long base,
@@ -474,11 +483,12 @@ static void generic_set_mtrr(unsigned int reg, unsigned long base,
 {
 	unsigned long flags;
 	struct mtrr_var_range *vr;
+	bool pge;
 
 	vr = &mtrr_state.var_ranges[reg];
 
 	local_irq_save(flags);
-	prepare_set();
+	pge = prepare_set();
 
 	if (size == 0) {
 		/* The invalid bit is kept in the mask, so we simply clear the
@@ -499,7 +509,7 @@ static void generic_set_mtrr(unsigned int reg, unsigned long base,
 		mtrr_wrmsr(MSR_IA32_MTRR_PHYSMASK(reg), vr->mask);
 	}
 
-	post_set();
+	post_set(pge);
 	local_irq_restore(flags);
 }
 
diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
index e4ea4f3297..186d9099f6 100644
--- a/xen/arch/x86/flushtlb.c
+++ b/xen/arch/x86/flushtlb.c
@@ -72,20 +72,39 @@ static void post_flush(u32 t)
     this_cpu(tlbflush_time) = t;
 }
 
+static void do_flush_tlb(unsigned long cr3)
+{
+    unsigned long cr4;
+
+    cr4 = read_cr4();
+    if ( cr4 & X86_CR4_PGE )
+    {
+        write_cr4(cr4 & ~X86_CR4_PGE);
+        if ( cr3 )
+            asm volatile ( "mov %0, %%cr3" : : "r" (cr3) : "memory" );
+        else
+            barrier();
+        write_cr4(cr4);
+    }
+    else
+    {
+        if ( !cr3 )
+            cr3 = read_cr3();
+        asm volatile ( "mov %0, %%cr3" : : "r" (cr3) : "memory" );
+    }
+}
+
 void write_cr3(unsigned long cr3)
 {
-    unsigned long flags, cr4;
+    unsigned long flags;
     u32 t;
 
     /* This non-reentrant function is sometimes called in interrupt context. */
     local_irq_save(flags);
 
     t = pre_flush();
-    cr4 = read_cr4();
 
-    write_cr4(cr4 & ~X86_CR4_PGE);
-    asm volatile ( "mov %0, %%cr3" : : "r" (cr3) : "memory" );
-    write_cr4(cr4);
+    do_flush_tlb(cr3);
 
     post_flush(t);
 
@@ -123,22 +142,14 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
             u32 t = pre_flush();
 
             if ( !cpu_has_invpcid )
-            {
-                unsigned long cr4 = read_cr4();
-
-                write_cr4(cr4 & ~X86_CR4_PGE);
-                barrier();
-                write_cr4(cr4);
-            }
+                do_flush_tlb(0);
             else
-            {
                 /*
                  * Using invpcid to flush all mappings works
                  * regardless of whether PCID is enabled or not.
                  * It is faster than read-modify-write CR4.
                  */
                 invpcid_flush_all();
-            }
 
             post_flush(t);
         }
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index cdcdc2c40a..a8d38e7eb2 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -73,13 +73,8 @@ restore_all_guest:
                 ROOT_PAGETABLE_LAST_XEN_SLOT - 1) * 8, %rdi
         rep movsq
 .Lrag_copy_done:
-        mov   STACK_CPUINFO_FIELD(cr4)(%rdx), %rdi
         mov   %r9, STACK_CPUINFO_FIELD(xen_cr3)(%rdx)
-        mov   %rdi, %rsi
-        and   $~X86_CR4_PGE, %rdi
-        mov   %rdi, %cr4
         mov   %rax, %cr3
-        mov   %rsi, %cr4
 .Lrag_cr3_end:
         ALTERNATIVE_NOP .Lrag_cr3_start, .Lrag_cr3_end, X86_FEATURE_NO_XPTI
 
@@ -136,12 +131,7 @@ restore_all_xen:
          * so "g" will have to do.
          */
 UNLIKELY_START(g, exit_cr3)
-        mov   %cr4, %rdi
-        mov   %rdi, %rsi
-        and   $~X86_CR4_PGE, %rdi
-        mov   %rdi, %cr4
         mov   %rax, %cr3
-        mov   %rsi, %cr4
 UNLIKELY_END(exit_cr3)
 
         /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index 0cc37dea05..316418a6fe 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -622,7 +622,8 @@ unsigned long pv_guest_cr4_fixup(const struct vcpu *, unsigned long guest_cr4);
             X86_CR4_SMAP | X86_CR4_OSXSAVE |                \
             X86_CR4_FSGSBASE))                              \
       | ((v)->domain->arch.vtsc ? X86_CR4_TSD : 0))         \
-     & ~X86_CR4_DE)
+     & ~(X86_CR4_DE |                                       \
+         ((v)->domain->arch.pv_domain.xpti ? X86_CR4_PGE : 0)))
 #define real_cr4_to_pv_guest_cr4(c)                         \
     ((c) & ~(X86_CR4_PGE | X86_CR4_PSE | X86_CR4_TSD |      \
              X86_CR4_OSXSAVE | X86_CR4_SMEP |               \
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 5/6] xen/x86: use flag byte for decision whether xen_cr3 is valid
  2018-03-02  8:13 [PATCH v2 0/6] xen/x86: various XPTI speedups Juergen Gross
                   ` (3 preceding siblings ...)
  2018-03-02  8:14 ` [PATCH v2 4/6] xen/x86: disable global pages for domains with XPTI active Juergen Gross
@ 2018-03-02  8:14 ` Juergen Gross
  2018-03-08 14:24   ` Jan Beulich
       [not found]   ` <5AA155BE02000078001AFD89@suse.com>
  2018-03-02  8:14 ` [PATCH v2 6/6] xen/x86: use PCID feature for XPTI Juergen Gross
  2018-03-05 16:20 ` [PATCH v2 0/6] xen/x86: various XPTI speedups Dario Faggioli
  6 siblings, 2 replies; 38+ messages in thread
From: Juergen Gross @ 2018-03-02  8:14 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, andrew.cooper3, wei.liu2, jbeulich, dfaggioli

Today cpu_info->xen_cr3 is either 0 to indicate %cr3 doesn't need to
be switched on entry to Xen, or negative for keeping the value while
indicating not to restore %cr3, or positive in case %cr3 is to be
restored.

Switch to use a flag byte instead of a negative xen_cr3 value in order
to allow %cr3 values with the high bit set in case we want to keep TLB
entries when using the PCID feature.

This reduces the number of branches in interrupt handling and results
in better performance (e.g. parallel make of the Xen hypervisor on my
system was using about 3% less system time).

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 xen/arch/x86/domain.c              |  1 +
 xen/arch/x86/mm.c                  |  1 +
 xen/arch/x86/smpboot.c             |  2 ++
 xen/arch/x86/x86_64/asm-offsets.c  |  1 +
 xen/arch/x86/x86_64/compat/entry.S |  5 ++--
 xen/arch/x86/x86_64/entry.S        | 59 ++++++++++++++++----------------------
 xen/include/asm-x86/current.h      | 11 ++++---
 7 files changed, 39 insertions(+), 41 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 1f8b08ef02..c01ae60296 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1698,6 +1698,7 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
     ASSERT(local_irq_is_enabled());
 
     get_cpu_info()->xen_cr3 = 0;
+    get_cpu_info()->use_xen_cr3 = false;
 
     if ( unlikely(dirty_cpu != cpu) && dirty_cpu != VCPU_CPU_CLEAN )
     {
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 0d0badea86..2d8366a01c 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -520,6 +520,7 @@ void write_ptbase(struct vcpu *v)
     {
         get_cpu_info()->root_pgt_changed = false;
         /* Make sure to clear xen_cr3 before pv_cr3; write_cr3() serializes. */
+        get_cpu_info()->use_xen_cr3 = false;
         get_cpu_info()->xen_cr3 = 0;
         write_cr3(v->arch.cr3);
         get_cpu_info()->pv_cr3 = 0;
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 60604f4535..1b665991bd 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -330,6 +330,7 @@ void start_secondary(void *unused)
      */
     spin_debug_disable();
 
+    get_cpu_info()->use_xen_cr3 = false;
     get_cpu_info()->xen_cr3 = 0;
     get_cpu_info()->pv_cr3 = 0;
 
@@ -1129,6 +1130,7 @@ void __init smp_prepare_boot_cpu(void)
     per_cpu(scratch_cpumask, cpu) = &scratch_cpu0mask;
 #endif
 
+    get_cpu_info()->use_xen_cr3 = false;
     get_cpu_info()->xen_cr3 = 0;
     get_cpu_info()->pv_cr3 = 0;
 }
diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
index c9225b06c1..d276f70a75 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -145,6 +145,7 @@ void __dummy__(void)
     OFFSET(CPUINFO_use_shadow_spec_ctrl, struct cpu_info, use_shadow_spec_ctrl);
     OFFSET(CPUINFO_bti_ist_info, struct cpu_info, bti_ist_info);
     OFFSET(CPUINFO_root_pgt_changed, struct cpu_info, root_pgt_changed);
+    OFFSET(CPUINFO_use_xen_cr3, struct cpu_info, use_xen_cr3);
     DEFINE(CPUINFO_sizeof, sizeof(struct cpu_info));
     BLANK();
 
diff --git a/xen/arch/x86/x86_64/compat/entry.S b/xen/arch/x86/x86_64/compat/entry.S
index 27cdf244e6..74055eade8 100644
--- a/xen/arch/x86/x86_64/compat/entry.S
+++ b/xen/arch/x86/x86_64/compat/entry.S
@@ -218,10 +218,9 @@ ENTRY(cstar_enter)
         GET_STACK_END(bx)
 .Lcstar_cr3_start:
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
-        neg   %rcx
+        test  %rcx, %rcx
         jz    .Lcstar_cr3_okay
-        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
-        neg   %rcx
+        movb  $0, STACK_CPUINFO_FIELD(use_xen_cr3)(%rbx)
         mov   %rcx, %cr3
         movq  $0, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
 .Lcstar_cr3_okay:
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index a8d38e7eb2..2c9ce8d821 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -74,6 +74,7 @@ restore_all_guest:
         rep movsq
 .Lrag_copy_done:
         mov   %r9, STACK_CPUINFO_FIELD(xen_cr3)(%rdx)
+        movb  $1, STACK_CPUINFO_FIELD(use_xen_cr3)(%rdx)
         mov   %rax, %cr3
 .Lrag_cr3_end:
         ALTERNATIVE_NOP .Lrag_cr3_start, .Lrag_cr3_end, X86_FEATURE_NO_XPTI
@@ -123,14 +124,9 @@ restore_all_xen:
          * case we return to late PV exit code (from an NMI or #MC).
          */
         GET_STACK_END(bx)
-        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rdx
+        cmpb  $0, STACK_CPUINFO_FIELD(use_xen_cr3)(%rbx)
+UNLIKELY_START(ne, exit_cr3)
         mov   STACK_CPUINFO_FIELD(pv_cr3)(%rbx), %rax
-        test  %rdx, %rdx
-        /*
-         * Ideally the condition would be "nsz", but such doesn't exist,
-         * so "g" will have to do.
-         */
-UNLIKELY_START(g, exit_cr3)
         mov   %rax, %cr3
 UNLIKELY_END(exit_cr3)
 
@@ -173,10 +169,9 @@ ENTRY(lstar_enter)
         GET_STACK_END(bx)
 .Llstar_cr3_start:
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
-        neg   %rcx
+        test  %rcx, %rcx
         jz    .Llstar_cr3_okay
-        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
-        neg   %rcx
+        movb  $0, STACK_CPUINFO_FIELD(use_xen_cr3)(%rbx)
         mov   %rcx, %cr3
         movq  $0, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
 .Llstar_cr3_okay:
@@ -283,10 +278,9 @@ GLOBAL(sysenter_eflags_saved)
         GET_STACK_END(bx)
 .Lsyse_cr3_start:
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
-        neg   %rcx
+        test  %rcx, %rcx
         jz    .Lsyse_cr3_okay
-        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
-        neg   %rcx
+        movb  $0, STACK_CPUINFO_FIELD(use_xen_cr3)(%rbx)
         mov   %rcx, %cr3
         movq  $0, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
 .Lsyse_cr3_okay:
@@ -336,10 +330,9 @@ ENTRY(int80_direct_trap)
         GET_STACK_END(bx)
 .Lint80_cr3_start:
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
-        neg   %rcx
+        test  %rcx, %rcx
         jz    .Lint80_cr3_okay
-        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
-        neg   %rcx
+        movb  $0, STACK_CPUINFO_FIELD(use_xen_cr3)(%rbx)
         mov   %rcx, %cr3
         movq  $0, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
 .Lint80_cr3_okay:
@@ -523,18 +516,17 @@ ENTRY(common_interrupt)
 
 .Lintr_cr3_start:
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rcx
+        mov   STACK_CPUINFO_FIELD(use_xen_cr3)(%r14), %bl
         mov   %rcx, %r15
-        neg   %rcx
+        test  %rcx, %rcx
         jz    .Lintr_cr3_okay
-        jns   .Lintr_cr3_load
-        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
-        neg   %rcx
-.Lintr_cr3_load:
+        movb  $0, STACK_CPUINFO_FIELD(use_xen_cr3)(%r14)
         mov   %rcx, %cr3
         xor   %ecx, %ecx
         mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
         testb $3, UREGS_cs(%rsp)
         cmovnz %rcx, %r15
+        cmovnz %cx, %bx
 .Lintr_cr3_okay:
 
         CR4_PV32_RESTORE
@@ -542,6 +534,7 @@ ENTRY(common_interrupt)
         callq do_IRQ
 .Lintr_cr3_restore:
         mov   %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
+        mov   %bl, STACK_CPUINFO_FIELD(use_xen_cr3)(%r14)
 .Lintr_cr3_end:
         jmp ret_from_intr
 
@@ -571,18 +564,17 @@ GLOBAL(handle_exception)
 
 .Lxcpt_cr3_start:
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rcx
+        mov   STACK_CPUINFO_FIELD(use_xen_cr3)(%r14), %r13b
         mov   %rcx, %r15
-        neg   %rcx
+        test  %rcx, %rcx
         jz    .Lxcpt_cr3_okay
-        jns   .Lxcpt_cr3_load
-        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
-        neg   %rcx
-.Lxcpt_cr3_load:
+        movb  $0, STACK_CPUINFO_FIELD(use_xen_cr3)(%r14)
         mov   %rcx, %cr3
         xor   %ecx, %ecx
         mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
         testb $3, UREGS_cs(%rsp)
         cmovnz %rcx, %r15
+        cmovnz %rcx, %r13
 .Lxcpt_cr3_okay:
 
 handle_exception_saved:
@@ -652,6 +644,7 @@ handle_exception_saved:
         INDIRECT_CALL %rdx
 .Lxcpt_cr3_restore1:
         mov   %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
+        mov   %r13b, STACK_CPUINFO_FIELD(use_xen_cr3)(%r14)
 .Lxcpt_cr3_end1:
         testb $3,UREGS_cs(%rsp)
         jz    restore_all_xen
@@ -687,6 +680,7 @@ exception_with_ints_disabled:
         movq  %rax,UREGS_kernel_sizeof(%rsp)
 .Lxcpt_cr3_restore2:
         mov   %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
+        mov   %r13b, STACK_CPUINFO_FIELD(use_xen_cr3)(%r14)
 .Lxcpt_cr3_end2:
         jmp   restore_all_xen           # return to fixup code
 
@@ -781,9 +775,6 @@ ENTRY(double_fault)
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rbx
         test  %rbx, %rbx
         jz    .Ldblf_cr3_okay
-        jns   .Ldblf_cr3_load
-        neg   %rbx
-.Ldblf_cr3_load:
         mov   %rbx, %cr3
 .Ldblf_cr3_okay:
 
@@ -812,13 +803,11 @@ handle_ist_exception:
         /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
 
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rcx
+        mov   STACK_CPUINFO_FIELD(use_xen_cr3)(%r14), %bl
         mov   %rcx, %r15
-        neg   %rcx
+        test  %rcx, %rcx
         jz    .List_cr3_okay
-        jns   .List_cr3_load
-        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
-        neg   %rcx
-.List_cr3_load:
+        movb  $0, STACK_CPUINFO_FIELD(use_xen_cr3)(%r14)
         mov   %rcx, %cr3
         movq  $0, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
 .List_cr3_okay:
@@ -831,6 +820,7 @@ handle_ist_exception:
          * and copy the context to stack bottom.
          */
         xor   %r15, %r15
+        xor   %bl, %bl
         GET_CPUINFO_FIELD(guest_cpu_user_regs,di)
         movq  %rsp,%rsi
         movl  $UREGS_kernel_sizeof/8,%ecx
@@ -842,6 +832,7 @@ handle_ist_exception:
         mov   (%rdx, %rax, 8), %rdx
         INDIRECT_CALL %rdx
         mov   %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
+        mov   %bl, STACK_CPUINFO_FIELD(use_xen_cr3)(%r14)
         cmpb  $TRAP_nmi,UREGS_entry_vector(%rsp)
         jne   ret_from_intr
 
diff --git a/xen/include/asm-x86/current.h b/xen/include/asm-x86/current.h
index d5236c82de..34a9512d93 100644
--- a/xen/include/asm-x86/current.h
+++ b/xen/include/asm-x86/current.h
@@ -47,10 +47,7 @@ struct cpu_info {
      * context is being entered. A value of zero indicates no setting of CR3
      * is to be performed.
      * The former is the value to restore when re-entering Xen, if any. IOW
-     * its value being zero means there's nothing to restore. However, its
-     * value can also be negative, indicating to the exit-to-Xen code that
-     * restoring is not necessary, but allowing any nested entry code paths
-     * to still know the value to put back into CR3.
+     * its value being zero means there's nothing to restore.
      */
     unsigned long xen_cr3;
     unsigned long pv_cr3;
@@ -68,6 +65,12 @@ struct cpu_info {
      */
     bool         root_pgt_changed;
 
+    /*
+     * use_xen_cr3 is set in case the value of xen_cr3 is to be written into
+     * CR3 when entering the hypervisor.
+     */
+    bool         use_xen_cr3;
+
     unsigned long __pad;
     /* get_stack_bottom() must be 16-byte aligned */
 };
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 6/6] xen/x86: use PCID feature for XPTI
  2018-03-02  8:13 [PATCH v2 0/6] xen/x86: various XPTI speedups Juergen Gross
                   ` (4 preceding siblings ...)
  2018-03-02  8:14 ` [PATCH v2 5/6] xen/x86: use flag byte for decision whether xen_cr3 is valid Juergen Gross
@ 2018-03-02  8:14 ` Juergen Gross
  2018-03-08 15:27   ` Jan Beulich
  2018-03-05 16:20 ` [PATCH v2 0/6] xen/x86: various XPTI speedups Dario Faggioli
  6 siblings, 1 reply; 38+ messages in thread
From: Juergen Gross @ 2018-03-02  8:14 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, andrew.cooper3, wei.liu2, jbeulich, dfaggioli

Avoid flushing the complete TLB when switching %cr3 for mitigation of
Meltdown by using the PCID feature if available.

We are using 4 PCID values for a 64 bit pv domain subject to XPTI:

- hypervisor active and guest in kernel mode
- guest active and in kernel mode
- hypervisor active and guest in user mode
- guest active and in user mode

The 2 hypervisor cases could possibly be merged, but for security
reasons this is left for another patch.

Add a pcid flag to struct pv_domain to make it possible using PCID
without XPTI later.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 xen/arch/x86/cpu/mtrr/generic.c |  5 +++
 xen/arch/x86/domain_page.c      |  2 +-
 xen/arch/x86/flushtlb.c         | 74 +++++++++++++++++++++++------------------
 xen/arch/x86/mm.c               | 12 ++++++-
 xen/arch/x86/pv/domain.c        |  4 +++
 xen/arch/x86/setup.c            |  3 ++
 xen/include/asm-x86/domain.h    | 34 +++++++++++++------
 xen/include/asm-x86/x86-defns.h |  1 +
 8 files changed, 90 insertions(+), 45 deletions(-)

diff --git a/xen/arch/x86/cpu/mtrr/generic.c b/xen/arch/x86/cpu/mtrr/generic.c
index d705138100..84b9cd78df 100644
--- a/xen/arch/x86/cpu/mtrr/generic.c
+++ b/xen/arch/x86/cpu/mtrr/generic.c
@@ -5,6 +5,7 @@
 #include <xen/mm.h>
 #include <xen/stdbool.h>
 #include <asm/flushtlb.h>
+#include <asm/invpcid.h>
 #include <asm/io.h>
 #include <asm/mtrr.h>
 #include <asm/msr.h>
@@ -417,6 +418,8 @@ static bool prepare_set(void)
 	cr4 = read_cr4();
 	if (cr4 & X86_CR4_PGE)
 		write_cr4(cr4 & ~X86_CR4_PGE);
+	else if ( cpu_has_invpcid )
+		invpcid_flush_all();
 	else
 		asm volatile( "mov %0, %%cr3" : : "r" (read_cr3()) : "memory" );
 
@@ -440,6 +443,8 @@ static void post_set(bool pge)
 	/*  Reenable CR4.PGE (also flushes the TLB) */
 	if (pge)
 		write_cr4(read_cr4() | X86_CR4_PGE);
+	else if ( cpu_has_invpcid )
+		invpcid_flush_all();
 	else
 		asm volatile( "mov %0, %%cr3" : : "r" (read_cr3()) : "memory" );
 
diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c
index 3432a854dd..e4b7f74f34 100644
--- a/xen/arch/x86/domain_page.c
+++ b/xen/arch/x86/domain_page.c
@@ -51,7 +51,7 @@ static inline struct vcpu *mapcache_current_vcpu(void)
         if ( (v = idle_vcpu[smp_processor_id()]) == current )
             sync_local_execstate();
         /* We must now be running on the idle page table. */
-        ASSERT(read_cr3() == __pa(idle_pg_table));
+        ASSERT((read_cr3() & ~X86_CR3_PCIDMASK) == __pa(idle_pg_table));
     }
 
     return v;
diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
index 186d9099f6..a65fad00ed 100644
--- a/xen/arch/x86/flushtlb.c
+++ b/xen/arch/x86/flushtlb.c
@@ -75,39 +75,46 @@ static void post_flush(u32 t)
 static void do_flush_tlb(unsigned long cr3)
 {
     unsigned long cr4;
+    u32 t;
+
+    t = pre_flush();
 
     cr4 = read_cr4();
-    if ( cr4 & X86_CR4_PGE )
+
+    if ( cpu_has_invpcid )
     {
-        write_cr4(cr4 & ~X86_CR4_PGE);
         if ( cr3 )
             asm volatile ( "mov %0, %%cr3" : : "r" (cr3) : "memory" );
-        else
-            barrier();
-        write_cr4(cr4);
+        if ( !cr3 || (cr3 & X86_CR3_NOFLUSH) || (cr4 & X86_CR4_PGE) )
+            invpcid_flush_all();
     }
     else
     {
-        if ( !cr3 )
+        /* PCID not possible here, as invpcid is required for PCID. */
+        if ( cr4 & X86_CR4_PGE )
+            write_cr4(cr4 & ~X86_CR4_PGE);
+        else if ( !cr3 )
             cr3 = read_cr3();
-        asm volatile ( "mov %0, %%cr3" : : "r" (cr3) : "memory" );
+        if ( cr3 )
+            asm volatile ( "mov %0, %%cr3" : : "r" (cr3) : "memory" );
+        else
+            barrier();
+        if ( cr4 & X86_CR4_PGE )
+            write_cr4(cr4);
     }
+
+    post_flush(t);
 }
 
 void write_cr3(unsigned long cr3)
 {
     unsigned long flags;
-    u32 t;
 
     /* This non-reentrant function is sometimes called in interrupt context. */
     local_irq_save(flags);
 
-    t = pre_flush();
-
     do_flush_tlb(cr3);
 
-    post_flush(t);
-
     local_irq_restore(flags);
 }
 
@@ -128,30 +135,33 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
     {
         if ( order == 0 )
         {
-            /*
-             * We don't INVLPG multi-page regions because the 2M/4M/1G
-             * region may not have been mapped with a superpage. Also there
-             * are various errata surrounding INVLPG usage on superpages, and
-             * a full flush is in any case not *that* expensive.
-             */
-            asm volatile ( "invlpg %0"
-                           : : "m" (*(const char *)(va)) : "memory" );
-        }
-        else
-        {
-            u32 t = pre_flush();
+            if ( read_cr3() & X86_CR3_PCIDMASK )
+            {
+                unsigned long addr = (unsigned long)va;
 
-            if ( !cpu_has_invpcid )
-                do_flush_tlb(0);
+                /*
+                 * Flush the addresses for all potential address spaces.
+                 */
+                invpcid_flush_one(PCID_PV_PRIV, addr);
+                invpcid_flush_one(PCID_PV_USER, addr);
+                invpcid_flush_one(PCID_PV_PRIV | PCID_PV_XEN, addr);
+                invpcid_flush_one(PCID_PV_USER | PCID_PV_XEN, addr);
+            }
             else
+            {
                 /*
-                 * Using invpcid to flush all mappings works
-                 * regardless of whether PCID is enabled or not.
-                 * It is faster than read-modify-write CR4.
+                 * We don't INVLPG multi-page regions because the 2M/4M/1G
+                 * region may not have been mapped with a superpage. Also there
+                 * are various errata surrounding INVLPG usage on superpages,
+                 * and a full flush is in any case not *that* expensive.
                  */
-                invpcid_flush_all();
-
-            post_flush(t);
+                asm volatile ( "invlpg %0"
+                               : : "m" (*(const char *)(va)) : "memory" );
+            }
+        }
+        else
+        {
+            do_flush_tlb(0);
         }
     }
 
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 2d8366a01c..82fbbe0a10 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -506,6 +506,8 @@ void free_shared_domheap_page(struct page_info *page)
 void make_cr3(struct vcpu *v, mfn_t mfn)
 {
     v->arch.cr3 = mfn_x(mfn) << PAGE_SHIFT;
+    if ( is_pv_vcpu(v) && v->domain->arch.pv_domain.pcid )
+        v->arch.cr3 |= X86_CR3_NOFLUSH | get_pv_pcid(v, 1);
 }
 
 void write_ptbase(struct vcpu *v)
@@ -514,7 +516,15 @@ void write_ptbase(struct vcpu *v)
     {
         get_cpu_info()->root_pgt_changed = true;
         get_cpu_info()->pv_cr3 = __pa(this_cpu(root_pgt));
-        asm volatile ( "mov %0, %%cr3" : : "r" (v->arch.cr3) : "memory" );
+        if ( v->domain->arch.pv_domain.pcid )
+        {
+            get_cpu_info()->pv_cr3 |= X86_CR3_NOFLUSH | get_pv_pcid(v, 0);
+            write_cr3(v->arch.cr3);
+        }
+        else
+        {
+            asm volatile ( "mov %0, %%cr3" : : "r" (v->arch.cr3) : "memory" );
+        }
     }
     else
     {
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 5f15c9e25b..37338b2a01 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -96,8 +96,12 @@ void xpti_domain_init(struct domain *d)
     }
 
     if ( d->arch.pv_domain.xpti )
+    {
+        d->arch.pv_domain.pcid = cpu_has_pcid && cpu_has_invpcid;
+
         printk("Enabling Xen Pagetable protection (XPTI) for Domain %d\n",
                d->domain_id);
+    }
 }
 
 static void noreturn continue_nonidle_domain(struct vcpu *v)
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 7c9fbfe04a..781f191e6e 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1547,6 +1547,9 @@ void __init noreturn __start_xen(unsigned long mbi_p)
     if ( cpu_has_fsgsbase )
         set_in_cr4(X86_CR4_FSGSBASE);
 
+    if ( cpu_has_invpcid && cpu_has_pcid )
+        set_in_cr4(X86_CR4_PCIDE);
+
     init_speculation_mitigations();
 
     init_idle_domain();
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index 316418a6fe..a2ca03583f 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -260,8 +260,20 @@ struct pv_domain
 
     /* XPTI active? */
     bool xpti;
+
+    /* Use PCID for the different address spaces? */
+    bool pcid;
 };
 
+/* PCID values for the address spaces: */
+#define PCID_PV_PRIV      0x0001
+#define PCID_PV_USER      0x0002
+#define PCID_PV_XEN       0x0004    /* To be ORed to above values. */
+
+#define get_pv_pcid(v, xen)                                              \
+    (((xen) ? PCID_PV_XEN : 0) |                                         \
+     (((v)->arch.flags & TF_kernel_mode) ? PCID_PV_PRIV : PCID_PV_USER))
+
 struct monitor_write_data {
     struct {
         unsigned int msr : 1;
@@ -615,18 +627,18 @@ void vcpu_show_registers(const struct vcpu *);
 unsigned long pv_guest_cr4_fixup(const struct vcpu *, unsigned long guest_cr4);
 
 /* Convert between guest-visible and real CR4 values. */
-#define pv_guest_cr4_to_real_cr4(v)                         \
-    (((v)->arch.pv_vcpu.ctrlreg[4]                          \
-      | (mmu_cr4_features                                   \
-         & (X86_CR4_PGE | X86_CR4_PSE | X86_CR4_SMEP |      \
-            X86_CR4_SMAP | X86_CR4_OSXSAVE |                \
-            X86_CR4_FSGSBASE))                              \
-      | ((v)->domain->arch.vtsc ? X86_CR4_TSD : 0))         \
-     & ~(X86_CR4_DE |                                       \
+#define pv_guest_cr4_to_real_cr4(v)                             \
+    (((v)->arch.pv_vcpu.ctrlreg[4]                              \
+      | (mmu_cr4_features                                       \
+         & (X86_CR4_PGE | X86_CR4_PSE | X86_CR4_SMEP |          \
+            X86_CR4_SMAP | X86_CR4_OSXSAVE |                    \
+            X86_CR4_FSGSBASE | X86_CR4_PCIDE))                  \
+      | ((v)->domain->arch.vtsc ? X86_CR4_TSD : 0))             \
+     & ~(X86_CR4_DE |                                           \
          ((v)->domain->arch.pv_domain.xpti ? X86_CR4_PGE : 0)))
-#define real_cr4_to_pv_guest_cr4(c)                         \
-    ((c) & ~(X86_CR4_PGE | X86_CR4_PSE | X86_CR4_TSD |      \
-             X86_CR4_OSXSAVE | X86_CR4_SMEP |               \
+#define real_cr4_to_pv_guest_cr4(c)                             \
+    ((c) & ~(X86_CR4_PGE | X86_CR4_PSE | X86_CR4_TSD |          \
+             X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_PCIDE |   \
              X86_CR4_FSGSBASE | X86_CR4_SMAP))
 
 #define domain_max_vcpus(d) (is_hvm_domain(d) ? HVM_MAX_VCPUS : MAX_VIRT_CPUS)
diff --git a/xen/include/asm-x86/x86-defns.h b/xen/include/asm-x86/x86-defns.h
index 8598adef14..d007997f88 100644
--- a/xen/include/asm-x86/x86-defns.h
+++ b/xen/include/asm-x86/x86-defns.h
@@ -46,6 +46,7 @@
  * Intel CPU flags in CR3
  */
 #define X86_CR3_NOFLUSH (_AC(1, ULL) << 63)
+#define X86_CR3_PCIDMASK _AC(0x0000000000000fff, ULL) /* Mask for PCID */
 
 /*
  * Intel CPU features in CR4
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 4/6] xen/x86: disable global pages for domains with XPTI active
  2018-03-02  8:14 ` [PATCH v2 4/6] xen/x86: disable global pages for domains with XPTI active Juergen Gross
@ 2018-03-02 11:03   ` Wei Liu
  2018-03-02 11:30     ` Juergen Gross
  2018-03-08 13:38   ` Jan Beulich
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 38+ messages in thread
From: Wei Liu @ 2018-03-02 11:03 UTC (permalink / raw)
  To: Juergen Gross; +Cc: xen-devel, dfaggioli, wei.liu2, jbeulich, andrew.cooper3

On Fri, Mar 02, 2018 at 09:14:01AM +0100, Juergen Gross wrote:
> Instead of flushing the TLB from global pages when switching address
> spaces with XPTI being active just disable global pages via %cr4
> completely when a domain subject to XPTI is active. This avoids the
> need for extra TLB flushes as loading %cr3 will remove all TLB
> entries.
> 
> Signed-off-by: Juergen Gross <jgross@suse.com>

I think you patch tries to disable PGE wholesale for Xen, but I don't
think this patch catches all of them.

XEN_MINIMAL_CR4 still has PGE in it. EFI path and the stack resyncing
asm in __start_xen are missing. 

You also missed fixing the comment of setting MTRR. It is not a big deal
though.

I have in fact written a small series to make CR4.PGE configurable but
haven't got around to send the patches yet because I didn't think it is
very useful in its own.

Wei.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 4/6] xen/x86: disable global pages for domains with XPTI active
  2018-03-02 11:03   ` Wei Liu
@ 2018-03-02 11:30     ` Juergen Gross
  0 siblings, 0 replies; 38+ messages in thread
From: Juergen Gross @ 2018-03-02 11:30 UTC (permalink / raw)
  To: Wei Liu; +Cc: xen-devel, dfaggioli, jbeulich, andrew.cooper3

On 02/03/18 12:03, Wei Liu wrote:
> On Fri, Mar 02, 2018 at 09:14:01AM +0100, Juergen Gross wrote:
>> Instead of flushing the TLB from global pages when switching address
>> spaces with XPTI being active just disable global pages via %cr4
>> completely when a domain subject to XPTI is active. This avoids the
>> need for extra TLB flushes as loading %cr3 will remove all TLB
>> entries.
>>
>> Signed-off-by: Juergen Gross <jgross@suse.com>
> 
> I think you patch tries to disable PGE wholesale for Xen, but I don't
> think this patch catches all of them.

I do it only while a domain subject to XPTI is active on the cpu.

> XEN_MINIMAL_CR4 still has PGE in it. EFI path and the stack resyncing
> asm in __start_xen are missing.

That's on purpose. Without XPTI (e.g. on AMD cpus) PGE is still good
for performance of 64-bit pv domains (but that's the only case AFAIK).

> You also missed fixing the comment of setting MTRR. It is not a big deal
> though.

I'm not sure this comment wants to be modified as normally PGE is still
on, e.g. when booting. The only case I could think of where MTRR would
be modified with PGE off should be a dom0 hypercall, e.g.
XENPF_add_memtype.

> I have in fact written a small series to make CR4.PGE configurable but
> haven't got around to send the patches yet because I didn't think it is
> very useful in its own.

Depends on whether re-using global TLB entry saves more than flushing
the TLB from global pages does cost.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 0/6] xen/x86: various XPTI speedups
  2018-03-02  8:13 [PATCH v2 0/6] xen/x86: various XPTI speedups Juergen Gross
                   ` (5 preceding siblings ...)
  2018-03-02  8:14 ` [PATCH v2 6/6] xen/x86: use PCID feature for XPTI Juergen Gross
@ 2018-03-05 16:20 ` Dario Faggioli
  6 siblings, 0 replies; 38+ messages in thread
From: Dario Faggioli @ 2018-03-05 16:20 UTC (permalink / raw)
  To: Juergen Gross, xen-devel; +Cc: andrew.cooper3, wei.liu2, jbeulich


[-- Attachment #1.1: Type: text/plain, Size: 29330 bytes --]

On Fri, 2018-03-02 at 09:13 +0100, Juergen Gross wrote:
> The complete series has been verified to still mitigate against
> Meltdown attacks. A simple performance test (make -j 4 in the Xen
> hypervisor directory) showed significant improvements compared to the
> state without this series (so with Jan's and Wei's series applied),
> the percentage after the numbers is always related to XPTI off:
> 
>        XPTI off     Jan+Wei, XPTI on    +this series, XPTI on
> real   1m21.169s    1m52.149s (+38%)    1m25.692s (+6%)
> user   2m47.652s    2m50.054s (+1%)     2m46.428s (-1%)
> sys    1m11.949s    2m21.767s (+97%)    1m23.053s (+15%)
> 
> A git branch of that series (+ Jan's and Wei's patches) is available:
> 
> https://github.com/jgross1/xen.git xpti
> 
I've run some more benchmarks, and here there are the results:

https://openbenchmarking.org/result/1803039-DARI-180303217
http://openbenchmarking.org/result/1803039-DARI-180303217&obr_nor=y&obr_hgv=Jan%2BWei%2C+XPTI+on

(I also include a textual recap at the bottom of this email.)

These numbers shows that Juergen's series is quite effective at
improving performance in pretty much all workloads that I've tested.

The only exception is schbench, but I don't think that's very relevant,
because of how the benchmark is configured in the PhoronixTestSuite (I
just recently discovered that).

The in-guest context-switching heavy workloads are the ones where this
series makes the most (positive) difference.

Note that on Stream and on Stress-ng:MemoryCopy, XPTI=on+this series
does even *better* than XPTI=off. This is most likely due to the fact
that Juergen, for now, takes advantage of PCID only for the XPTI=on
case. However, although it is indeed a bit of an unfair comparison, I
think it does prove the point that we want to have (something like)
this series.

Regards,
Dario

AIO-Stress 0.21
Test: Random Write
    MB/s > Higher Is Better
    XPTI off ............... 1926.57 |=============================================================================================================================================================================
    +this series, XPTI on .. 1931.44 |=============================================================================================================================================================================
    Jan+Wei, XPTI on ....... 1807.30 |==================================================================================================================================================================

Stream 2013-01-17
Type: Copy
    MB/s > Higher Is Better
    XPTI off ............... 15738.48 |==============================================================================================================================================
    +this series, XPTI on .. 19011.66 |============================================================================================================================================================================
    Jan+Wei, XPTI on ....... 15381.94 |===========================================================================================================================================

Stream 2013-01-17
Type: Scale
    MB/s > Higher Is Better
    XPTI off ............... 10849.14 |=================================================================================================================================================
    +this series, XPTI on .. 12833.84 |============================================================================================================================================================================
    Jan+Wei, XPTI on ....... 10696.66 |===============================================================================================================================================

Stream 2013-01-17
Type: Triad
    MB/s > Higher Is Better
    XPTI off ............... 12268.20 |======================================================================================================================================================
    +this series, XPTI on .. 14085.56 |============================================================================================================================================================================
    Jan+Wei, XPTI on ....... 12120.56 |====================================================================================================================================================

Stream 2013-01-17
Type: Add
    MB/s > Higher Is Better
    XPTI off ............... 12323.60 |=====================================================================================================================================
    +this series, XPTI on .. 15881.14 |============================================================================================================================================================================
    Jan+Wei, XPTI on ....... 12085.76 |===================================================================================================================================

t-test1 2017-01-13
Threads: 1
    Seconds < Lower Is Better
    XPTI off ............... 75.20  |====================================================================================================
    +this series, XPTI on .. 111.12 |====================================================================================================================================================
    Jan+Wei, XPTI on ....... 130.34 |==============================================================================================================================================================================

t-test1 2017-01-13
Threads: 2
    Seconds < Lower Is Better
    XPTI off ............... 24.13 |===========================================================================================================
    +this series, XPTI on .. 34.12 |=======================================================================================================================================================
    Jan+Wei, XPTI on ....... 39.62 |===============================================================================================================================================================================

CacheBench 
Test: Read
    MB/s > Higher Is Better
    XPTI off ............... 2026.19 |=============================================================================================================================================================================
    +this series, XPTI on .. 2024.43 |=============================================================================================================================================================================
    Jan+Wei, XPTI on ....... 2023.90 |=============================================================================================================================================================================

CacheBench 
Test: Write
    MB/s > Higher Is Better
    XPTI off ............... 9741.40 |=============================================================================================================================================================================
    +this series, XPTI on .. 9734.24 |=============================================================================================================================================================================
    Jan+Wei, XPTI on ....... 9727.98 |=============================================================================================================================================================================

CacheBench 
Test: Read / Modify / Write
    MB/s > Higher Is Better
    XPTI off ............... 13253.33 |============================================================================================================================================================================
    +this series, XPTI on .. 13247.78 |============================================================================================================================================================================
    Jan+Wei, XPTI on ....... 13239.67 |============================================================================================================================================================================

Timed Linux Kernel Compilation 4.13
Time To Compile
    Seconds < Lower Is Better
    XPTI off ............... 155.54 |=======================================================================================================================================================
    +this series, XPTI on .. 169.21 |=====================================================================================================================================================================
    Jan+Wei, XPTI on ....... 178.70 |==============================================================================================================================================================================

Timed MPlayer Compilation 1.0-rc3
Time To Compile
    Seconds < Lower Is Better
    XPTI off ............... 46.12 |=================================================================================================================================================================
    +this series, XPTI on .. 47.70 |=======================================================================================================================================================================
    Jan+Wei, XPTI on ....... 50.02 |===============================================================================================================================================================================

Timed PHP Compilation 7.1.9
Time To Compile
    Seconds < Lower Is Better
    XPTI off ............... 119.46 |==========================================================================================================================================================
    +this series, XPTI on .. 128.66 |======================================================================================================================================================================
    Jan+Wei, XPTI on ....... 135.01 |==============================================================================================================================================================================

Parallel BZIP2 Compression 1.1.12
256MB File Compression
    Seconds < Lower Is Better
    XPTI off ............... 8.96 |===================================================================================================================================================================
    +this series, XPTI on .. 8.47 |==========================================================================================================================================================
    Jan+Wei, XPTI on ....... 9.67 |================================================================================================================================================================================

Hackbench 
Count: 1 - Type: Thread
    Seconds < Lower Is Better
    XPTI off ............... 11.45 |================================================================================================================
    +this series, XPTI on .. 15.48 |=======================================================================================================================================================
    Jan+Wei, XPTI on ....... 17.91 |===============================================================================================================================================================================

Hackbench 
Count: 4 - Type: Thread
    Seconds < Lower Is Better
    XPTI off ............... 38.43 |=========================================================================================================
    +this series, XPTI on .. 53.90 |====================================================================================================================================================
    Jan+Wei, XPTI on ....... 63.83 |===============================================================================================================================================================================

Hackbench 
Count: 16 - Type: Thread
    Seconds < Lower Is Better
    XPTI off ............... 140.91 |====================================================================================================
    +this series, XPTI on .. 203.78 |=================================================================================================================================================
    Jan+Wei, XPTI on ....... 244.98 |==============================================================================================================================================================================

Stress-NG 0.07.26
Test: Crypto
    Bogo Ops/s > Higher Is Better
    XPTI off ............... 689.36 |==============================================================================================================================================================================
    +this series, XPTI on .. 688.78 |==============================================================================================================================================================================
    Jan+Wei, XPTI on ....... 688.35 |==============================================================================================================================================================================

Stress-NG 0.07.26
Test: Bsearch
    Bogo Ops/s > Higher Is Better
    XPTI off ............... 3941.30 |=============================================================================================================================================================================
    +this series, XPTI on .. 3902.20 |===========================================================================================================================================================================
    Jan+Wei, XPTI on ....... 3909.30 |============================================================================================================================================================================

Stress-NG 0.07.26
Test: Forking
    Bogo Ops/s > Higher Is Better
    XPTI off ............... 6208.40 |=============================================================================================================================================================================
    +this series, XPTI on .. 5589.59 |============================================================================================================================================================
    Jan+Wei, XPTI on ....... 4601.54 |================================================================================================================================

Stress-NG 0.07.26
Test: Hsearch
    Bogo Ops/s > Higher Is Better
    XPTI off ............... 26380.63 |============================================================================================================================================================================
    +this series, XPTI on .. 26389.66 |============================================================================================================================================================================
    Jan+Wei, XPTI on ....... 26246.49 |===========================================================================================================================================================================

Stress-NG 0.07.26
Test: Lsearch
    Bogo Ops/s > Higher Is Better
    XPTI off ............... 60.27 |===============================================================================================================================================================================
    +this series, XPTI on .. 60.27 |===============================================================================================================================================================================
    Jan+Wei, XPTI on ....... 55.76 |==================================================================================================================================================================

Stress-NG 0.07.26
Test: Tsearch
    Bogo Ops/s > Higher Is Better
    XPTI off ............... 125.72 |==============================================================================================================================================================================
    +this series, XPTI on .. 125.02 |=============================================================================================================================================================================
    Jan+Wei, XPTI on ....... 124.12 |============================================================================================================================================================================

Stress-NG 0.07.26
Test: CPU Stress
    Bogo Ops/s > Higher Is Better
    XPTI off ............... 1706.72 |=====================================================================================================================================================================
    +this series, XPTI on .. 1733.77 |========================================================================================================================================================================
    Jan+Wei, XPTI on ....... 1787.12 |=============================================================================================================================================================================

Stress-NG 0.07.26
Test: Semaphores
    Bogo Ops/s > Higher Is Better
    XPTI off ............... 2262689.37 |==========================================================================================================================================================================
    +this series, XPTI on .. 2109096.56 |==============================================================================================================================================================
    Jan+Wei, XPTI on ....... 2041884.08 |=========================================================================================================================================================

Stress-NG 0.07.26
Test: Matrix Math
    Bogo Ops/s > Higher Is Better
    XPTI off ............... 20287.24 |============================================================================================================================================================================
    +this series, XPTI on .. 20259.78 |============================================================================================================================================================================
    Jan+Wei, XPTI on ....... 20263.31 |============================================================================================================================================================================

Stress-NG 0.07.26
Test: Vector Math
    Bogo Ops/s > Higher Is Better
    XPTI off ............... 10516.51 |============================================================================================================================================================================
    +this series, XPTI on .. 10507.48 |============================================================================================================================================================================
    Jan+Wei, XPTI on ....... 10497.54 |============================================================================================================================================================================

Stress-NG 0.07.26
Test: Memory Copying
    Bogo Ops/s > Higher Is Better
    XPTI off ............... 978.38  |================================================================================================================================================================
    +this series, XPTI on .. 1058.67 |=============================================================================================================================================================================
    Jan+Wei, XPTI on ....... 983.75  |=================================================================================================================================================================

Stress-NG 0.07.26
Test: Socket Activity
    Bogo Ops/s > Higher Is Better
    XPTI off ............... 1568.00 |=============================================================================================================================================================================
    +this series, XPTI on .. 1170.07 |=================================================================================================================================
    Jan+Wei, XPTI on ....... 923.73  |======================================================================================================

Stress-NG 0.07.26
Test: Context Switching
    Bogo Ops/s > Higher Is Better
    XPTI off ............... 822072.29 |===========================================================================================================================================================================
    +this series, XPTI on .. 366599.11 |============================================================================
    Jan+Wei, XPTI on ....... 261709.75 |======================================================

Stress-NG 0.07.26
Test: Glibc C String Functions
    Bogo Ops/s > Higher Is Better
    XPTI off ............... 269733.97 |===========================================================================================================================================================================
    +this series, XPTI on .. 263300.54 |=======================================================================================================================================================================
    Jan+Wei, XPTI on ....... 262641.54 |=======================================================================================================================================================================

Stress-NG 0.07.26
Test: Glibc Qsort Data Sorting
    Bogo Ops/s > Higher Is Better
    XPTI off ............... 58.00 |===============================================================================================================================================================================
    +this series, XPTI on .. 56.32 |==========================================================================================================================================================================
    Jan+Wei, XPTI on ....... 57.99 |===============================================================================================================================================================================

Stress-NG 0.07.26
Test: System V Message Passing
    Bogo Ops/s > Higher Is Better
    XPTI off ............... 3737187.66 |==========================================================================================================================================================================
    +this series, XPTI on .. 2018958.18 |============================================================================================
    Jan+Wei, XPTI on ....... 1413159.24 |================================================================

PyBench 2018-02-16
Total For Average Test Times
    Milliseconds < Lower Is Better
    XPTI off ............... 2988 |===============================================================================================================================================================================
    +this series, XPTI on .. 2999 |================================================================================================================================================================================
    Jan+Wei, XPTI on ....... 3004 |================================================================================================================================================================================

Schbench 
Message Threads: 2 - Workers Per Message Thread: 16
    usec, 99.9th Latency Percentile < Lower Is Better
    XPTI off ............... 17696 |=======================================================================================================================================================
    +this series, XPTI on .. 18368 |=============================================================================================================================================================
    Jan+Wei, XPTI on ....... 20523 |===============================================================================================================================================================================

Schbench 
Message Threads: 4 - Workers Per Message Thread: 16
    usec, 99.9th Latency Percentile < Lower Is Better
    XPTI off ............... 86144 |=============================================================================================================================================================================
    +this series, XPTI on .. 87168 |===============================================================================================================================================================================
    Jan+Wei, XPTI on ....... 85717 |============================================================================================================================================================================

Schbench 
Message Threads: 6 - Workers Per Message Thread: 16
    usec, 99.9th Latency Percentile < Lower Is Better
    XPTI off ............... 130219 |==============================================================================================================================================================================
    +this series, XPTI on .. 130389 |==============================================================================================================================================================================
    Jan+Wei, XPTI on ....... 130219 |==============================================================================================================================================================================

Schbench 
Message Threads: 8 - Workers Per Message Thread: 16
    usec, 99.9th Latency Percentile < Lower Is Better
    XPTI off ............... 173141 |===========================================================================================================================================================================
    +this series, XPTI on .. 176555 |==============================================================================================================================================================================
    Jan+Wei, XPTI on ....... 172459 |==========================================================================================================================================================================

Schbench 
Message Threads: 16 - Workers Per Message Thread: 16
    usec, 99.9th Latency Percentile < Lower Is Better
    XPTI off ............... 338773 |=============================================================================================================================================================================
    +this series, XPTI on .. 339797 |==============================================================================================================================================================================
    Jan+Wei, XPTI on ....... 329557 |=========================================================================================================================================================================

Schbench 
Message Threads: 32 - Workers Per Message Thread: 16
    usec, 99.9th Latency Percentile < Lower Is Better
    XPTI off ............... 636587 |==============================================================================================================================================================================
    +this series, XPTI on .. 624981 |===========================================================================================================================================================================
    Jan+Wei, XPTI on ....... 635904 |==============================================================================================================================================================================

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Software Engineer @ SUSE https://www.suse.com/

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/6] x86/xpti: avoid copying L4 page table contents when possible
  2018-03-02  8:13 ` [PATCH v2 1/6] x86/xpti: avoid copying L4 page table contents when possible Juergen Gross
@ 2018-03-05 16:43   ` Jan Beulich
  2018-03-08 11:59     ` Juergen Gross
       [not found]   ` <5A9D81DC02000078001AEB68@suse.com>
  1 sibling, 1 reply; 38+ messages in thread
From: Jan Beulich @ 2018-03-05 16:43 UTC (permalink / raw)
  To: Juergen Gross; +Cc: andrew.cooper3, Dario Faggioli, wei.liu2, xen-devel

>>> On 02.03.18 at 09:13, <jgross@suse.com> wrote:
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -509,6 +509,8 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
>  
>  void write_ptbase(struct vcpu *v)
>  {
> +    get_cpu_info()->root_pgt_changed = this_cpu(root_pgt) && is_pv_vcpu(v) &&
> +                                       !is_pv_32bit_vcpu(v);

Why is_pv_vcpu() when you already check is_pv_32bit_vcpu()?

> @@ -3704,18 +3706,22 @@ long do_mmu_update(
>                          break;
>                      rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
>                                        cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
> -                    /*
> -                     * No need to sync if all uses of the page can be accounted
> -                     * to the page lock we hold, its pinned status, and uses on
> -                     * this (v)CPU.
> -                     */
> -                    if ( !rc && !cpu_has_no_xpti &&
> -                         ((page->u.inuse.type_info & PGT_count_mask) >
> -                          (1 + !!(page->u.inuse.type_info & PGT_pinned) +
> -                           (pagetable_get_pfn(curr->arch.guest_table) == mfn) 
> +
> -                           (pagetable_get_pfn(curr->arch.guest_table_user) ==
> -                            mfn))) )
> -                        sync_guest = true;
> +                    if ( !rc && !cpu_has_no_xpti )
> +                    {
> +                        get_cpu_info()->root_pgt_changed = true;

Why would you set this when a foreign domain's L4 got updated?
And don't you need to disallow updating L4s of running guests now
(which is a bad idea anyway)?

> --- a/xen/arch/x86/smp.c
> +++ b/xen/arch/x86/smp.c
> @@ -207,6 +207,8 @@ void invalidate_interrupt(struct cpu_user_regs *regs)
>      unsigned int flags = flush_flags;
>      ack_APIC_irq();
>      perfc_incr(ipis);
> +    if ( flags & FLUSH_ROOT_PGTBL )
> +        get_cpu_info()->root_pgt_changed = true;

While for the caller in do_mmu_update() you don't need it, for
full correctness you also want to do this in flush_area_mask()
for the sender, if it's part of the CPU mask.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 2/6] x86/xpti: don't flush TLB twice when switching to 64-bit pv context
  2018-03-02  8:13 ` [PATCH v2 2/6] x86/xpti: don't flush TLB twice when switching to 64-bit pv context Juergen Gross
@ 2018-03-05 16:49   ` Jan Beulich
       [not found]   ` <5A9D831F02000078001AEB7E@suse.com>
  1 sibling, 0 replies; 38+ messages in thread
From: Jan Beulich @ 2018-03-05 16:49 UTC (permalink / raw)
  To: Juergen Gross; +Cc: andrew.cooper3, Dario Faggioli, wei.liu2, xen-devel

>>> On 02.03.18 at 09:13, <jgross@suse.com> wrote:
> @@ -509,9 +510,16 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
>  
>  void write_ptbase(struct vcpu *v)
>  {
> -    get_cpu_info()->root_pgt_changed = this_cpu(root_pgt) && is_pv_vcpu(v) &&
> -                                       !is_pv_32bit_vcpu(v);
> -    write_cr3(v->arch.cr3);
> +    if ( this_cpu(root_pgt) && is_pv_vcpu(v) && !is_pv_32bit_vcpu(v) )
> +    {
> +        get_cpu_info()->root_pgt_changed = true;
> +        asm volatile ( "mov %0, %%cr3" : : "r" (v->arch.cr3) : "memory" );
> +    }
> +    else
> +    {
> +        get_cpu_info()->root_pgt_changed = false;

Is this write really necessary?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/6] x86/xpti: avoid copying L4 page table contents when possible
       [not found]   ` <5A9D81DC02000078001AEB68@suse.com>
@ 2018-03-06  7:01     ` Juergen Gross
  2018-03-06  7:58       ` Jan Beulich
       [not found]       ` <5A9E583002000078001AED3A@suse.com>
  0 siblings, 2 replies; 38+ messages in thread
From: Juergen Gross @ 2018-03-06  7:01 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, wei.liu2, xen-devel, Dario Faggioli

On 05/03/18 17:43, Jan Beulich wrote:
>>>> On 02.03.18 at 09:13, <jgross@suse.com> wrote:
>> --- a/xen/arch/x86/mm.c
>> +++ b/xen/arch/x86/mm.c
>> @@ -509,6 +509,8 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
>>  
>>  void write_ptbase(struct vcpu *v)
>>  {
>> +    get_cpu_info()->root_pgt_changed = this_cpu(root_pgt) && is_pv_vcpu(v) &&
>> +                                       !is_pv_32bit_vcpu(v);
> 
> Why is_pv_vcpu() when you already check is_pv_32bit_vcpu()?

I check !is_pv_32bit_vcpu() to catch 64-bit pv-domains only.

> 
>> @@ -3704,18 +3706,22 @@ long do_mmu_update(
>>                          break;
>>                      rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
>>                                        cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
>> -                    /*
>> -                     * No need to sync if all uses of the page can be accounted
>> -                     * to the page lock we hold, its pinned status, and uses on
>> -                     * this (v)CPU.
>> -                     */
>> -                    if ( !rc && !cpu_has_no_xpti &&
>> -                         ((page->u.inuse.type_info & PGT_count_mask) >
>> -                          (1 + !!(page->u.inuse.type_info & PGT_pinned) +
>> -                           (pagetable_get_pfn(curr->arch.guest_table) == mfn) 
>> +
>> -                           (pagetable_get_pfn(curr->arch.guest_table_user) ==
>> -                            mfn))) )
>> -                        sync_guest = true;
>> +                    if ( !rc && !cpu_has_no_xpti )
>> +                    {
>> +                        get_cpu_info()->root_pgt_changed = true;
> 
> Why would you set this when a foreign domain's L4 got updated?

Right. I should set it only when modifying the currently active L4.

> And don't you need to disallow updating L4s of running guests now
> (which is a bad idea anyway)?

Yes, I should do that.

> 
>> --- a/xen/arch/x86/smp.c
>> +++ b/xen/arch/x86/smp.c
>> @@ -207,6 +207,8 @@ void invalidate_interrupt(struct cpu_user_regs *regs)
>>      unsigned int flags = flush_flags;
>>      ack_APIC_irq();
>>      perfc_incr(ipis);
>> +    if ( flags & FLUSH_ROOT_PGTBL )
>> +        get_cpu_info()->root_pgt_changed = true;
> 
> While for the caller in do_mmu_update() you don't need it, for
> full correctness you also want to do this in flush_area_mask()
> for the sender, if it's part of the CPU mask.

Right.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 2/6] x86/xpti: don't flush TLB twice when switching to 64-bit pv context
       [not found]   ` <5A9D831F02000078001AEB7E@suse.com>
@ 2018-03-06  7:02     ` Juergen Gross
  0 siblings, 0 replies; 38+ messages in thread
From: Juergen Gross @ 2018-03-06  7:02 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, wei.liu2, xen-devel, Dario Faggioli

On 05/03/18 17:49, Jan Beulich wrote:
>>>> On 02.03.18 at 09:13, <jgross@suse.com> wrote:
>> @@ -509,9 +510,16 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
>>  
>>  void write_ptbase(struct vcpu *v)
>>  {
>> -    get_cpu_info()->root_pgt_changed = this_cpu(root_pgt) && is_pv_vcpu(v) &&
>> -                                       !is_pv_32bit_vcpu(v);
>> -    write_cr3(v->arch.cr3);
>> +    if ( this_cpu(root_pgt) && is_pv_vcpu(v) && !is_pv_32bit_vcpu(v) )
>> +    {
>> +        get_cpu_info()->root_pgt_changed = true;
>> +        asm volatile ( "mov %0, %%cr3" : : "r" (v->arch.cr3) : "memory" );
>> +    }
>> +    else
>> +    {
>> +        get_cpu_info()->root_pgt_changed = false;
> 
> Is this write really necessary?

Probably not. I'll drop it.


Juergen


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/6] x86/xpti: avoid copying L4 page table contents when possible
  2018-03-06  7:01     ` Juergen Gross
@ 2018-03-06  7:58       ` Jan Beulich
       [not found]       ` <5A9E583002000078001AED3A@suse.com>
  1 sibling, 0 replies; 38+ messages in thread
From: Jan Beulich @ 2018-03-06  7:58 UTC (permalink / raw)
  To: Juergen Gross; +Cc: andrew.cooper3, Dario Faggioli, wei.liu2, xen-devel

>>> On 06.03.18 at 08:01, <jgross@suse.com> wrote:
> On 05/03/18 17:43, Jan Beulich wrote:
>>>>> On 02.03.18 at 09:13, <jgross@suse.com> wrote:
>>> --- a/xen/arch/x86/mm.c
>>> +++ b/xen/arch/x86/mm.c
>>> @@ -509,6 +509,8 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
>>>  
>>>  void write_ptbase(struct vcpu *v)
>>>  {
>>> +    get_cpu_info()->root_pgt_changed = this_cpu(root_pgt) && is_pv_vcpu(v) &&
>>> +                                       !is_pv_32bit_vcpu(v);
>> 
>> Why is_pv_vcpu() when you already check is_pv_32bit_vcpu()?
> 
> I check !is_pv_32bit_vcpu() to catch 64-bit pv-domains only.

Oh, I'm sorry - For whatever reason I've ignored the ! there.

>> And don't you need to disallow updating L4s of running guests now
>> (which is a bad idea anyway)?
> 
> Yes, I should do that.

But please do this as a separate change, as strictly speaking this is
a behavioral change that we can't allow. But I think we simply need
to accept this (and perhaps uniformly for all page table levels).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/6] x86/xpti: avoid copying L4 page table contents when possible
       [not found]       ` <5A9E583002000078001AED3A@suse.com>
@ 2018-03-06  8:06         ` Juergen Gross
  2018-03-06  8:17           ` Jan Beulich
  0 siblings, 1 reply; 38+ messages in thread
From: Juergen Gross @ 2018-03-06  8:06 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, wei.liu2, xen-devel, Dario Faggioli

On 06/03/18 08:58, Jan Beulich wrote:
>>>> On 06.03.18 at 08:01, <jgross@suse.com> wrote:
>> On 05/03/18 17:43, Jan Beulich wrote:
>>>>>> On 02.03.18 at 09:13, <jgross@suse.com> wrote:
>>>> --- a/xen/arch/x86/mm.c
>>>> +++ b/xen/arch/x86/mm.c
>>>> @@ -509,6 +509,8 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
>>>>  
>>>>  void write_ptbase(struct vcpu *v)
>>>>  {
>>>> +    get_cpu_info()->root_pgt_changed = this_cpu(root_pgt) && is_pv_vcpu(v) &&
>>>> +                                       !is_pv_32bit_vcpu(v);
>>>
>>> Why is_pv_vcpu() when you already check is_pv_32bit_vcpu()?
>>
>> I check !is_pv_32bit_vcpu() to catch 64-bit pv-domains only.
> 
> Oh, I'm sorry - For whatever reason I've ignored the ! there.
> 
>>> And don't you need to disallow updating L4s of running guests now
>>> (which is a bad idea anyway)?
>>
>> Yes, I should do that.
> 
> But please do this as a separate change, as strictly speaking this is
> a behavioral change that we can't allow. But I think we simply need
> to accept this (and perhaps uniformly for all page table levels).

Hmm, thinking more about it: wouldn't it be enough to do the change and
then sending a FLUSH_ROOT_PGTBL to the vcpu(s) the affected guest is
running on? This would cause a cycle through the hypervisor resulting in
an updated root page table of that guest.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/6] x86/xpti: avoid copying L4 page table contents when possible
  2018-03-06  8:06         ` Juergen Gross
@ 2018-03-06  8:17           ` Jan Beulich
  0 siblings, 0 replies; 38+ messages in thread
From: Jan Beulich @ 2018-03-06  8:17 UTC (permalink / raw)
  To: Juergen Gross; +Cc: andrew.cooper3, Dario Faggioli, wei.liu2, xen-devel

>>> On 06.03.18 at 09:06, <jgross@suse.com> wrote:
> On 06/03/18 08:58, Jan Beulich wrote:
>>>>> On 06.03.18 at 08:01, <jgross@suse.com> wrote:
>>> On 05/03/18 17:43, Jan Beulich wrote:
>>>>>>> On 02.03.18 at 09:13, <jgross@suse.com> wrote:
>>>>> --- a/xen/arch/x86/mm.c
>>>>> +++ b/xen/arch/x86/mm.c
>>>>> @@ -509,6 +509,8 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
>>>>>  
>>>>>  void write_ptbase(struct vcpu *v)
>>>>>  {
>>>>> +    get_cpu_info()->root_pgt_changed = this_cpu(root_pgt) && is_pv_vcpu(v) &&
>>>>> +                                       !is_pv_32bit_vcpu(v);
>>>>
>>>> Why is_pv_vcpu() when you already check is_pv_32bit_vcpu()?
>>>
>>> I check !is_pv_32bit_vcpu() to catch 64-bit pv-domains only.
>> 
>> Oh, I'm sorry - For whatever reason I've ignored the ! there.
>> 
>>>> And don't you need to disallow updating L4s of running guests now
>>>> (which is a bad idea anyway)?
>>>
>>> Yes, I should do that.
>> 
>> But please do this as a separate change, as strictly speaking this is
>> a behavioral change that we can't allow. But I think we simply need
>> to accept this (and perhaps uniformly for all page table levels).
> 
> Hmm, thinking more about it: wouldn't it be enough to do the change and
> then sending a FLUSH_ROOT_PGTBL to the vcpu(s) the affected guest is
> running on? This would cause a cycle through the hypervisor resulting in
> an updated root page table of that guest.

Ah, yes, of course. That was the idea behind the original addition
of the flush invocation after all.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 3/6] xen/x86: support per-domain flag for xpti
  2018-03-02  8:14 ` [PATCH v2 3/6] xen/x86: support per-domain flag for xpti Juergen Gross
@ 2018-03-08 10:17   ` Jan Beulich
       [not found]   ` <5AA11BDE02000078001AFB92@suse.com>
  1 sibling, 0 replies; 38+ messages in thread
From: Jan Beulich @ 2018-03-08 10:17 UTC (permalink / raw)
  To: Juergen Gross; +Cc: andrew.cooper3, Dario Faggioli, wei.liu2, xen-devel

>>> On 02.03.18 at 09:14, <jgross@suse.com> wrote:
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -510,15 +510,19 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
>  
>  void write_ptbase(struct vcpu *v)
>  {
> -    if ( this_cpu(root_pgt) && is_pv_vcpu(v) && !is_pv_32bit_vcpu(v) )
> +    if ( is_pv_vcpu(v) && v->domain->arch.pv_domain.xpti )
>      {
>          get_cpu_info()->root_pgt_changed = true;
> +        get_cpu_info()->pv_cr3 = __pa(this_cpu(root_pgt));
>          asm volatile ( "mov %0, %%cr3" : : "r" (v->arch.cr3) : "memory" );
>      }
>      else
>      {
>          get_cpu_info()->root_pgt_changed = false;
> +        /* Make sure to clear xen_cr3 before pv_cr3; write_cr3() serializes. */
> +        get_cpu_info()->xen_cr3 = 0;
>          write_cr3(v->arch.cr3);
> +        get_cpu_info()->pv_cr3 = 0;
>      }
>  }

I think you want to latch the return value of get_cpu_info() into a
local variable now.

> @@ -707,6 +708,9 @@ int __init dom0_construct_pv(struct domain *d,
>              cpu = p->processor;
>      }
>  
> +    if ( !is_pv_32bit_domain(d) )
> +        xpti_domain_init(d);

Perhaps better to omit the conditional here? Or otherwise use the
"compat32" local variable?

> +static int parse_xpti(const char *s)

__init

> +{
> +    int rc = 0;
> +
> +    switch ( parse_bool(s, NULL) )
> +    {
> +    case 0:
> +        opt_xpti = XPTI_OFF;
> +        break;
> +    case 1:
> +        opt_xpti = XPTI_ON;
> +        break;
> +    default:
> +        if ( !strcmp(s, "default") )

This wants to also be mentioned in the command line doc.

> +            opt_xpti = XPTI_DEFAULT;
> +        else if ( !strcmp(s, "nodom0") )
> +            opt_xpti = XPTI_NODOM0;
> +        else
> +            rc = -EINVAL;
> +        break;
> +    }
> +
> +    return rc;
> +}
> +
> +custom_param("xpti", parse_xpti);

Please omit the blank line above here.

> +void xpti_init(void)

__init

> +void xpti_domain_init(struct domain *d)
> +{
> +    if ( !is_pv_domain(d) || is_pv_32bit_domain(d) )
> +        return;
> +
> +    switch ( opt_xpti )
> +    {
> +    case XPTI_OFF:
> +        d->arch.pv_domain.xpti = false;
> +        break;
> +    case XPTI_ON:
> +        d->arch.pv_domain.xpti = true;
> +        break;
> +    case XPTI_NODOM0:
> +        d->arch.pv_domain.xpti = d->domain_id != 0 &&
> +                                 d->domain_id != hardware_domid;
> +        break;
> +    default:
> +        ASSERT_UNREACHABLE();
> +        break;
> +    }
> +
> +    if ( d->arch.pv_domain.xpti )
> +        printk("Enabling Xen Pagetable protection (XPTI) for Domain %d\n",
> +               d->domain_id);

Please don't, even less so without XENLOG_G_*. And if you really,
really want this at, say, XENLOG_G_DEBUG, then Dom%d please.

> --- a/xen/include/asm-x86/domain.h
> +++ b/xen/include/asm-x86/domain.h
> @@ -257,6 +257,9 @@ struct pv_domain
>      struct mapcache_domain mapcache;
>  
>      struct cpuidmasks *cpuidmasks;
> +
> +    /* XPTI active? */
> +    bool xpti;
>  };

Is there really no 1 byte slot available elsewhere in the structure?
Like between nr_l4_pages and mapcache?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 3/6] xen/x86: support per-domain flag for xpti
       [not found]   ` <5AA11BDE02000078001AFB92@suse.com>
@ 2018-03-08 11:30     ` Juergen Gross
  2018-03-08 12:49       ` Jan Beulich
       [not found]       ` <5AA13F7D02000078001AFCB3@suse.com>
  0 siblings, 2 replies; 38+ messages in thread
From: Juergen Gross @ 2018-03-08 11:30 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, wei.liu2, xen-devel, Dario Faggioli

On 08/03/18 11:17, Jan Beulich wrote:
>>>> On 02.03.18 at 09:14, <jgross@suse.com> wrote:
>> --- a/xen/arch/x86/mm.c
>> +++ b/xen/arch/x86/mm.c
>> @@ -510,15 +510,19 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
>>  
>>  void write_ptbase(struct vcpu *v)
>>  {
>> -    if ( this_cpu(root_pgt) && is_pv_vcpu(v) && !is_pv_32bit_vcpu(v) )
>> +    if ( is_pv_vcpu(v) && v->domain->arch.pv_domain.xpti )
>>      {
>>          get_cpu_info()->root_pgt_changed = true;
>> +        get_cpu_info()->pv_cr3 = __pa(this_cpu(root_pgt));
>>          asm volatile ( "mov %0, %%cr3" : : "r" (v->arch.cr3) : "memory" );
>>      }
>>      else
>>      {
>>          get_cpu_info()->root_pgt_changed = false;
>> +        /* Make sure to clear xen_cr3 before pv_cr3; write_cr3() serializes. */
>> +        get_cpu_info()->xen_cr3 = 0;
>>          write_cr3(v->arch.cr3);
>> +        get_cpu_info()->pv_cr3 = 0;
>>      }
>>  }
> 
> I think you want to latch the return value of get_cpu_info() into a
> local variable now.

Yes.

> 
>> @@ -707,6 +708,9 @@ int __init dom0_construct_pv(struct domain *d,
>>              cpu = p->processor;
>>      }
>>  
>> +    if ( !is_pv_32bit_domain(d) )
>> +        xpti_domain_init(d);
> 
> Perhaps better to omit the conditional here? Or otherwise use the
> "compat32" local variable?

I'll drop the conditional.

> 
>> +static int parse_xpti(const char *s)
> 
> __init

Aah, of course.

> 
>> +{
>> +    int rc = 0;
>> +
>> +    switch ( parse_bool(s, NULL) )
>> +    {
>> +    case 0:
>> +        opt_xpti = XPTI_OFF;
>> +        break;
>> +    case 1:
>> +        opt_xpti = XPTI_ON;
>> +        break;
>> +    default:
>> +        if ( !strcmp(s, "default") )
> 
> This wants to also be mentioned in the command line doc.

Uuh, this was a copy-and-paste result from my alternative XPTI approach.
I'll just drop that value.

> 
>> +            opt_xpti = XPTI_DEFAULT;
>> +        else if ( !strcmp(s, "nodom0") )
>> +            opt_xpti = XPTI_NODOM0;
>> +        else
>> +            rc = -EINVAL;
>> +        break;
>> +    }
>> +
>> +    return rc;
>> +}
>> +
>> +custom_param("xpti", parse_xpti);
> 
> Please omit the blank line above here.

Okay.

> 
>> +void xpti_init(void)
> 
> __init

Yes.

> 
>> +void xpti_domain_init(struct domain *d)
>> +{
>> +    if ( !is_pv_domain(d) || is_pv_32bit_domain(d) )
>> +        return;
>> +
>> +    switch ( opt_xpti )
>> +    {
>> +    case XPTI_OFF:
>> +        d->arch.pv_domain.xpti = false;
>> +        break;
>> +    case XPTI_ON:
>> +        d->arch.pv_domain.xpti = true;
>> +        break;
>> +    case XPTI_NODOM0:
>> +        d->arch.pv_domain.xpti = d->domain_id != 0 &&
>> +                                 d->domain_id != hardware_domid;
>> +        break;
>> +    default:
>> +        ASSERT_UNREACHABLE();
>> +        break;
>> +    }
>> +
>> +    if ( d->arch.pv_domain.xpti )
>> +        printk("Enabling Xen Pagetable protection (XPTI) for Domain %d\n",
>> +               d->domain_id);
> 
> Please don't, even less so without XENLOG_G_*. And if you really,
> really want this at, say, XENLOG_G_DEBUG, then Dom%d please.

Okay, I'll drop that message.

> 
>> --- a/xen/include/asm-x86/domain.h
>> +++ b/xen/include/asm-x86/domain.h
>> @@ -257,6 +257,9 @@ struct pv_domain
>>      struct mapcache_domain mapcache;
>>  
>>      struct cpuidmasks *cpuidmasks;
>> +
>> +    /* XPTI active? */
>> +    bool xpti;
>>  };
> 
> Is there really no 1 byte slot available elsewhere in the structure?
> Like between nr_l4_pages and mapcache?

I'll have a look.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/6] x86/xpti: avoid copying L4 page table contents when possible
  2018-03-05 16:43   ` Jan Beulich
@ 2018-03-08 11:59     ` Juergen Gross
  2018-03-08 12:47       ` Jan Beulich
       [not found]       ` <5AA13EEA02000078001AFCAF@suse.com>
  0 siblings, 2 replies; 38+ messages in thread
From: Juergen Gross @ 2018-03-08 11:59 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, wei.liu2, xen-devel, Dario Faggioli

On 05/03/18 17:43, Jan Beulich wrote:
>>>> On 02.03.18 at 09:13, <jgross@suse.com> wrote:
>> --- a/xen/arch/x86/mm.c
>> +++ b/xen/arch/x86/mm.c
>> @@ -509,6 +509,8 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
>>  
>>  void write_ptbase(struct vcpu *v)
>>  {
>> +    get_cpu_info()->root_pgt_changed = this_cpu(root_pgt) && is_pv_vcpu(v) &&
>> +                                       !is_pv_32bit_vcpu(v);
> 
> Why is_pv_vcpu() when you already check is_pv_32bit_vcpu()?
> 
>> @@ -3704,18 +3706,22 @@ long do_mmu_update(
>>                          break;
>>                      rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
>>                                        cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
>> -                    /*
>> -                     * No need to sync if all uses of the page can be accounted
>> -                     * to the page lock we hold, its pinned status, and uses on
>> -                     * this (v)CPU.
>> -                     */
>> -                    if ( !rc && !cpu_has_no_xpti &&
>> -                         ((page->u.inuse.type_info & PGT_count_mask) >
>> -                          (1 + !!(page->u.inuse.type_info & PGT_pinned) +
>> -                           (pagetable_get_pfn(curr->arch.guest_table) == mfn) 
>> +
>> -                           (pagetable_get_pfn(curr->arch.guest_table_user) ==
>> -                            mfn))) )

Is it really possible this code is running with the user page table
being active on the current cpu? I think this test can be dropped.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/6] x86/xpti: avoid copying L4 page table contents when possible
  2018-03-08 11:59     ` Juergen Gross
@ 2018-03-08 12:47       ` Jan Beulich
       [not found]       ` <5AA13EEA02000078001AFCAF@suse.com>
  1 sibling, 0 replies; 38+ messages in thread
From: Jan Beulich @ 2018-03-08 12:47 UTC (permalink / raw)
  To: Juergen Gross; +Cc: andrew.cooper3, Dario Faggioli, wei.liu2, xen-devel

>>> On 08.03.18 at 12:59, <jgross@suse.com> wrote:
> On 05/03/18 17:43, Jan Beulich wrote:
>>>>> On 02.03.18 at 09:13, <jgross@suse.com> wrote:
>>> @@ -3704,18 +3706,22 @@ long do_mmu_update(
>>>                          break;
>>>                      rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
>>>                                        cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
>>> -                    /*
>>> -                     * No need to sync if all uses of the page can be accounted
>>> -                     * to the page lock we hold, its pinned status, and uses on
>>> -                     * this (v)CPU.
>>> -                     */
>>> -                    if ( !rc && !cpu_has_no_xpti &&
>>> -                         ((page->u.inuse.type_info & PGT_count_mask) >
>>> -                          (1 + !!(page->u.inuse.type_info & PGT_pinned) +
>>> -                           (pagetable_get_pfn(curr->arch.guest_table) == mfn) 
>>> +
>>> -                           (pagetable_get_pfn(curr->arch.guest_table_user) ==
>>> -                            mfn))) )
> 
> Is it really possible this code is running with the user page table
> being active on the current cpu? I think this test can be dropped.

I'm not sure I understand: The check above isn't for what is
active on a CPU, but for what references are being held.
_Installing_ a root page table into ->arch.guest_table{,_user}
is when a reference is being obtained, not _loading_ the table
into CR3. (In theory the above could be extended to also
check vCPU-s other than current, but one would need to deal
with races; obviously pausing the other vCPU-s of the domain
wouldn't be a good idea, but that would be one possible way.)

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 3/6] xen/x86: support per-domain flag for xpti
  2018-03-08 11:30     ` Juergen Gross
@ 2018-03-08 12:49       ` Jan Beulich
       [not found]       ` <5AA13F7D02000078001AFCB3@suse.com>
  1 sibling, 0 replies; 38+ messages in thread
From: Jan Beulich @ 2018-03-08 12:49 UTC (permalink / raw)
  To: Juergen Gross; +Cc: andrew.cooper3, Dario Faggioli, wei.liu2, xen-devel

>>> On 08.03.18 at 12:30, <jgross@suse.com> wrote:
> On 08/03/18 11:17, Jan Beulich wrote:
>>>>> On 02.03.18 at 09:14, <jgross@suse.com> wrote:
>>> +static int parse_xpti(const char *s)
>>> +{
>>> +    int rc = 0;
>>> +
>>> +    switch ( parse_bool(s, NULL) )
>>> +    {
>>> +    case 0:
>>> +        opt_xpti = XPTI_OFF;
>>> +        break;
>>> +    case 1:
>>> +        opt_xpti = XPTI_ON;
>>> +        break;
>>> +    default:
>>> +        if ( !strcmp(s, "default") )
>> 
>> This wants to also be mentioned in the command line doc.
> 
> Uuh, this was a copy-and-paste result from my alternative XPTI approach.
> I'll just drop that value.

I'm not sure that's the best route (and I did intentionally not ask for
you doing so): In cases where you can't edit the pre-built command
line options (like e.g. those read by xen.efi from the config file), it
is quite useful to be able to override what may be there back to the
default.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/6] x86/xpti: avoid copying L4 page table contents when possible
       [not found]       ` <5AA13EEA02000078001AFCAF@suse.com>
@ 2018-03-08 13:03         ` Juergen Gross
  0 siblings, 0 replies; 38+ messages in thread
From: Juergen Gross @ 2018-03-08 13:03 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, wei.liu2, xen-devel, Dario Faggioli

On 08/03/18 13:47, Jan Beulich wrote:
>>>> On 08.03.18 at 12:59, <jgross@suse.com> wrote:
>> On 05/03/18 17:43, Jan Beulich wrote:
>>>>>> On 02.03.18 at 09:13, <jgross@suse.com> wrote:
>>>> @@ -3704,18 +3706,22 @@ long do_mmu_update(
>>>>                          break;
>>>>                      rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
>>>>                                        cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
>>>> -                    /*
>>>> -                     * No need to sync if all uses of the page can be accounted
>>>> -                     * to the page lock we hold, its pinned status, and uses on
>>>> -                     * this (v)CPU.
>>>> -                     */
>>>> -                    if ( !rc && !cpu_has_no_xpti &&
>>>> -                         ((page->u.inuse.type_info & PGT_count_mask) >
>>>> -                          (1 + !!(page->u.inuse.type_info & PGT_pinned) +
>>>> -                           (pagetable_get_pfn(curr->arch.guest_table) == mfn) 
>>>> +
>>>> -                           (pagetable_get_pfn(curr->arch.guest_table_user) ==
>>>> -                            mfn))) )
>>
>> Is it really possible this code is running with the user page table
>> being active on the current cpu? I think this test can be dropped.
> 
> I'm not sure I understand: The check above isn't for what is
> active on a CPU, but for what references are being held.
> _Installing_ a root page table into ->arch.guest_table{,_user}
> is when a reference is being obtained, not _loading_ the table
> into CR3. (In theory the above could be extended to also
> check vCPU-s other than current, but one would need to deal
> with races; obviously pausing the other vCPU-s of the domain
> wouldn't be a good idea, but that would be one possible way.)

Aah, sorry. Now I understand.

Thanks,


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 3/6] xen/x86: support per-domain flag for xpti
       [not found]       ` <5AA13F7D02000078001AFCB3@suse.com>
@ 2018-03-08 13:13         ` Juergen Gross
  0 siblings, 0 replies; 38+ messages in thread
From: Juergen Gross @ 2018-03-08 13:13 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, wei.liu2, xen-devel, Dario Faggioli

On 08/03/18 13:49, Jan Beulich wrote:
>>>> On 08.03.18 at 12:30, <jgross@suse.com> wrote:
>> On 08/03/18 11:17, Jan Beulich wrote:
>>>>>> On 02.03.18 at 09:14, <jgross@suse.com> wrote:
>>>> +static int parse_xpti(const char *s)
>>>> +{
>>>> +    int rc = 0;
>>>> +
>>>> +    switch ( parse_bool(s, NULL) )
>>>> +    {
>>>> +    case 0:
>>>> +        opt_xpti = XPTI_OFF;
>>>> +        break;
>>>> +    case 1:
>>>> +        opt_xpti = XPTI_ON;
>>>> +        break;
>>>> +    default:
>>>> +        if ( !strcmp(s, "default") )
>>>
>>> This wants to also be mentioned in the command line doc.
>>
>> Uuh, this was a copy-and-paste result from my alternative XPTI approach.
>> I'll just drop that value.
> 
> I'm not sure that's the best route (and I did intentionally not ask for
> you doing so): In cases where you can't edit the pre-built command
> line options (like e.g. those read by xen.efi from the config file), it
> is quite useful to be able to override what may be there back to the
> default.

Okay, then I'll document it.


Juergen


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 4/6] xen/x86: disable global pages for domains with XPTI active
  2018-03-02  8:14 ` [PATCH v2 4/6] xen/x86: disable global pages for domains with XPTI active Juergen Gross
  2018-03-02 11:03   ` Wei Liu
@ 2018-03-08 13:38   ` Jan Beulich
  2018-03-09  3:01     ` Tian, Kevin
  2018-03-09  5:23     ` Tian, Kevin
       [not found]   ` <5AA14AF302000078001AFD30@suse.com>
  2018-03-08 15:06   ` Jan Beulich
  3 siblings, 2 replies; 38+ messages in thread
From: Jan Beulich @ 2018-03-08 13:38 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Kevin Tian, wei.liu2, andrew.cooper3, Dario Faggioli,
	Jun Nakajima, xen-devel

>>> On 02.03.18 at 09:14, <jgross@suse.com> wrote:
> Instead of flushing the TLB from global pages when switching address
> spaces with XPTI being active just disable global pages via %cr4
> completely when a domain subject to XPTI is active. This avoids the
> need for extra TLB flushes as loading %cr3 will remove all TLB
> entries.

Hmm, it's far from obvious that this is an improvement overall.
Besides Xen's global pages, we also prevent guest user pages to
be evicted from the TLB across user <-> kernel mode changes.
And the effects of this are likely quite work load dependent.

> @@ -412,18 +414,22 @@ static void prepare_set(void)
>  	write_cr0(read_cr0() | X86_CR0_CD);
>  	wbinvd();
>  
> -	/*  TLB flushing here relies on Xen always using CR4.PGE. */
> -	BUILD_BUG_ON(!(XEN_MINIMAL_CR4 & X86_CR4_PGE));
> -	write_cr4(read_cr4() & ~X86_CR4_PGE);
> +	cr4 = read_cr4();
> +	if (cr4 & X86_CR4_PGE)
> +		write_cr4(cr4 & ~X86_CR4_PGE);
> +	else
> +		asm volatile( "mov %0, %%cr3" : : "r" (read_cr3()) : "memory" );
>  
>  	/*  Save MTRR state */
>  	rdmsrl(MSR_MTRRdefType, deftype);
>  
>  	/*  Disable MTRRs, and set the default type to uncached  */
>  	mtrr_wrmsr(MSR_MTRRdefType, deftype & ~0xcff);
> +
> +	return !!(cr4 & X86_CR4_PGE);

Unnecessary !!.

> --- a/xen/arch/x86/flushtlb.c
> +++ b/xen/arch/x86/flushtlb.c
> @@ -72,20 +72,39 @@ static void post_flush(u32 t)
>      this_cpu(tlbflush_time) = t;
>  }
>  
> +static void do_flush_tlb(unsigned long cr3)

I think this is not a good name, because for its use in write_cr3()
the TLB flush is specifically a secondary effect. Personally I'd
prefer the function to be named e.g. do_write_cr3().

> --- a/xen/include/asm-x86/domain.h
> +++ b/xen/include/asm-x86/domain.h
> @@ -622,7 +622,8 @@ unsigned long pv_guest_cr4_fixup(const struct vcpu *, unsigned long guest_cr4);
>              X86_CR4_SMAP | X86_CR4_OSXSAVE |                \
>              X86_CR4_FSGSBASE))                              \
>        | ((v)->domain->arch.vtsc ? X86_CR4_TSD : 0))         \
> -     & ~X86_CR4_DE)
> +     & ~(X86_CR4_DE |                                       \
> +         ((v)->domain->arch.pv_domain.xpti ? X86_CR4_PGE : 0)))

With this you manage to turn off global pages when switching to
a PV vCPU. But I can't see how you turn global pages back on when
switching away from it. I can see they would be turned back on e.g.
on the first entry to a VMX guest, but how about an SVM one? And
how about the time between switching away from the PV vCPU and
that VM entry? Granted all flushes are global ones right now, but
that should change with the modification here: If you look back at
4.2 code, you'll see that FLUSH_TLB was handled differently in that
case, retaining Xen's global mappings. Any flush IPI not requesting
global pages to be flushed could then leave intact Xen's own TLB
entries, which takes as a prereq that CR4.PGE gets turned back on
earlier.

And one more change would belong into this patch, I think: In patch
2 you change write_ptbase(). The bare CR3 write there would
become eligible to tick the TLB flush clock with what you do here.

Talking of VMX: I wonder whether it wouldn't be better (perhaps both
cheaper and long term possibly more correct) if vmx_ctxt_switch_to()
didn't write CR4, but instead sync-ed HOST_CR4 to what read_cr4()
returns. Jun, Kevin, do you have any thoughts on this? For the patch
here, this would get the behavior on par with SVM, as described above.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 4/6] xen/x86: disable global pages for domains with XPTI active
       [not found]   ` <5AA14AF302000078001AFD30@suse.com>
@ 2018-03-08 14:05     ` Juergen Gross
  2018-03-08 14:33       ` Jan Beulich
       [not found]       ` <5AA157E002000078001AFDA4@suse.com>
  0 siblings, 2 replies; 38+ messages in thread
From: Juergen Gross @ 2018-03-08 14:05 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, wei.liu2, andrew.cooper3, Dario Faggioli,
	Jun Nakajima, xen-devel

On 08/03/18 14:38, Jan Beulich wrote:
>>>> On 02.03.18 at 09:14, <jgross@suse.com> wrote:
>> Instead of flushing the TLB from global pages when switching address
>> spaces with XPTI being active just disable global pages via %cr4
>> completely when a domain subject to XPTI is active. This avoids the
>> need for extra TLB flushes as loading %cr3 will remove all TLB
>> entries.
> 
> Hmm, it's far from obvious that this is an improvement overall.
> Besides Xen's global pages, we also prevent guest user pages to
> be evicted from the TLB across user <-> kernel mode changes.
> And the effects of this are likely quite work load dependent.

With XPTI active we flush the TLB, including all global entries, when
switching between page tables when returning to the guest. So there are
no entries which could survive.

>> @@ -412,18 +414,22 @@ static void prepare_set(void)
>>  	write_cr0(read_cr0() | X86_CR0_CD);
>>  	wbinvd();
>>  
>> -	/*  TLB flushing here relies on Xen always using CR4.PGE. */
>> -	BUILD_BUG_ON(!(XEN_MINIMAL_CR4 & X86_CR4_PGE));
>> -	write_cr4(read_cr4() & ~X86_CR4_PGE);
>> +	cr4 = read_cr4();
>> +	if (cr4 & X86_CR4_PGE)
>> +		write_cr4(cr4 & ~X86_CR4_PGE);
>> +	else
>> +		asm volatile( "mov %0, %%cr3" : : "r" (read_cr3()) : "memory" );
>>  
>>  	/*  Save MTRR state */
>>  	rdmsrl(MSR_MTRRdefType, deftype);
>>  
>>  	/*  Disable MTRRs, and set the default type to uncached  */
>>  	mtrr_wrmsr(MSR_MTRRdefType, deftype & ~0xcff);
>> +
>> +	return !!(cr4 & X86_CR4_PGE);
> 
> Unnecessary !!.

Return type is bool. Isn't !! better in this case?

> 
>> --- a/xen/arch/x86/flushtlb.c
>> +++ b/xen/arch/x86/flushtlb.c
>> @@ -72,20 +72,39 @@ static void post_flush(u32 t)
>>      this_cpu(tlbflush_time) = t;
>>  }
>>  
>> +static void do_flush_tlb(unsigned long cr3)
> 
> I think this is not a good name, because for its use in write_cr3()
> the TLB flush is specifically a secondary effect. Personally I'd
> prefer the function to be named e.g. do_write_cr3().

Okay.

> 
>> --- a/xen/include/asm-x86/domain.h
>> +++ b/xen/include/asm-x86/domain.h
>> @@ -622,7 +622,8 @@ unsigned long pv_guest_cr4_fixup(const struct vcpu *, unsigned long guest_cr4);
>>              X86_CR4_SMAP | X86_CR4_OSXSAVE |                \
>>              X86_CR4_FSGSBASE))                              \
>>        | ((v)->domain->arch.vtsc ? X86_CR4_TSD : 0))         \
>> -     & ~X86_CR4_DE)
>> +     & ~(X86_CR4_DE |                                       \
>> +         ((v)->domain->arch.pv_domain.xpti ? X86_CR4_PGE : 0)))
> 
> With this you manage to turn off global pages when switching to
> a PV vCPU. But I can't see how you turn global pages back on when
> switching away from it. I can see they would be turned back on e.g.
> on the first entry to a VMX guest, but how about an SVM one? And
> how about the time between switching away from the PV vCPU and
> that VM entry? Granted all flushes are global ones right now, but
> that should change with the modification here: If you look back at
> 4.2 code, you'll see that FLUSH_TLB was handled differently in that
> case, retaining Xen's global mappings. Any flush IPI not requesting
> global pages to be flushed could then leave intact Xen's own TLB
> entries, which takes as a prereq that CR4.PGE gets turned back on
> earlier.

Right, turning PGE on again is missing. I had a different solution for
switching PGE on and off in the beginning, but things got rather
complicated. So I changed my mind and turning PGE on again must have
slipped through.

> And one more change would belong into this patch, I think: In patch
> 2 you change write_ptbase(). The bare CR3 write there would
> become eligible to tick the TLB flush clock with what you do here.

Yes, I'll add that.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 5/6] xen/x86: use flag byte for decision whether xen_cr3 is valid
  2018-03-02  8:14 ` [PATCH v2 5/6] xen/x86: use flag byte for decision whether xen_cr3 is valid Juergen Gross
@ 2018-03-08 14:24   ` Jan Beulich
       [not found]   ` <5AA155BE02000078001AFD89@suse.com>
  1 sibling, 0 replies; 38+ messages in thread
From: Jan Beulich @ 2018-03-08 14:24 UTC (permalink / raw)
  To: Juergen Gross; +Cc: andrew.cooper3, Dario Faggioli, wei.liu2, xen-devel

>>> On 02.03.18 at 09:14, <jgross@suse.com> wrote:
> This reduces the number of branches in interrupt handling and results
> in better performance (e.g. parallel make of the Xen hypervisor on my
> system was using about 3% less system time).

3% seems an awful lot for a single conditional branch on each of the
three affected entry paths.

> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -1698,6 +1698,7 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
>      ASSERT(local_irq_is_enabled());
>  
>      get_cpu_info()->xen_cr3 = 0;
> +    get_cpu_info()->use_xen_cr3 = false;

Don't you need this to be the other way around _and_ a barrier() in
between? As the context above shows, interrupts are enabled here
(and NMI/#MC can occur at any time anyway), so with the order
above it seems to me as if restore_all_xen might write zero into CR3.
While the ordering appears to be right elsewhere, the barrier() part
may apply to changes further down as well.

> @@ -523,18 +516,17 @@ ENTRY(common_interrupt)
>  
>  .Lintr_cr3_start:
>          mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rcx
> +        mov   STACK_CPUINFO_FIELD(use_xen_cr3)(%r14), %bl
>          mov   %rcx, %r15
> -        neg   %rcx
> +        test  %rcx, %rcx
>          jz    .Lintr_cr3_okay
> -        jns   .Lintr_cr3_load
> -        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
> -        neg   %rcx
> -.Lintr_cr3_load:
> +        movb  $0, STACK_CPUINFO_FIELD(use_xen_cr3)(%r14)
>          mov   %rcx, %cr3
>          xor   %ecx, %ecx
>          mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
>          testb $3, UREGS_cs(%rsp)
>          cmovnz %rcx, %r15
> +        cmovnz %cx, %bx

32-bit operation please.

> @@ -831,6 +820,7 @@ handle_ist_exception:
>           * and copy the context to stack bottom.
>           */
>          xor   %r15, %r15
> +        xor   %bl, %bl

Same here.

> @@ -68,6 +65,12 @@ struct cpu_info {
>       */
>      bool         root_pgt_changed;
>  
> +    /*
> +     * use_xen_cr3 is set in case the value of xen_cr3 is to be written into
> +     * CR3 when entering the hypervisor.
> +     */
> +    bool         use_xen_cr3;

When entering the hypervisor? Afaics the flag is evaluated only to
trigger the unlikely code in restore_all_xen, which is an exit path (as
the comment portion you remove from xen_cr3 also says).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 5/6] xen/x86: use flag byte for decision whether xen_cr3 is valid
       [not found]   ` <5AA155BE02000078001AFD89@suse.com>
@ 2018-03-08 14:28     ` Juergen Gross
  0 siblings, 0 replies; 38+ messages in thread
From: Juergen Gross @ 2018-03-08 14:28 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, wei.liu2, xen-devel, Dario Faggioli

On 08/03/18 15:24, Jan Beulich wrote:
>>>> On 02.03.18 at 09:14, <jgross@suse.com> wrote:
>> This reduces the number of branches in interrupt handling and results
>> in better performance (e.g. parallel make of the Xen hypervisor on my
>> system was using about 3% less system time).
> 
> 3% seems an awful lot for a single conditional branch on each of the
> three affected entry paths.

I measured it multiple times because I couldn't believe it.

>> --- a/xen/arch/x86/domain.c
>> +++ b/xen/arch/x86/domain.c
>> @@ -1698,6 +1698,7 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
>>      ASSERT(local_irq_is_enabled());
>>  
>>      get_cpu_info()->xen_cr3 = 0;
>> +    get_cpu_info()->use_xen_cr3 = false;
> 
> Don't you need this to be the other way around _and_ a barrier() in
> between? As the context above shows, interrupts are enabled here
> (and NMI/#MC can occur at any time anyway), so with the order
> above it seems to me as if restore_all_xen might write zero into CR3.
> While the ordering appears to be right elsewhere, the barrier() part
> may apply to changes further down as well.

Yes, you are right. Thanks for catching this bug.

> 
>> @@ -523,18 +516,17 @@ ENTRY(common_interrupt)
>>  
>>  .Lintr_cr3_start:
>>          mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rcx
>> +        mov   STACK_CPUINFO_FIELD(use_xen_cr3)(%r14), %bl
>>          mov   %rcx, %r15
>> -        neg   %rcx
>> +        test  %rcx, %rcx
>>          jz    .Lintr_cr3_okay
>> -        jns   .Lintr_cr3_load
>> -        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
>> -        neg   %rcx
>> -.Lintr_cr3_load:
>> +        movb  $0, STACK_CPUINFO_FIELD(use_xen_cr3)(%r14)
>>          mov   %rcx, %cr3
>>          xor   %ecx, %ecx
>>          mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
>>          testb $3, UREGS_cs(%rsp)
>>          cmovnz %rcx, %r15
>> +        cmovnz %cx, %bx
> 
> 32-bit operation please.

Okay.

> 
>> @@ -831,6 +820,7 @@ handle_ist_exception:
>>           * and copy the context to stack bottom.
>>           */
>>          xor   %r15, %r15
>> +        xor   %bl, %bl
> 
> Same here.

Okay.

> 
>> @@ -68,6 +65,12 @@ struct cpu_info {
>>       */
>>      bool         root_pgt_changed;
>>  
>> +    /*
>> +     * use_xen_cr3 is set in case the value of xen_cr3 is to be written into
>> +     * CR3 when entering the hypervisor.
>> +     */
>> +    bool         use_xen_cr3;
> 
> When entering the hypervisor? Afaics the flag is evaluated only to
> trigger the unlikely code in restore_all_xen, which is an exit path (as
> the comment portion you remove from xen_cr3 also says).

Sorry, you are right, of course. Will change the comment.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 4/6] xen/x86: disable global pages for domains with XPTI active
  2018-03-08 14:05     ` Juergen Gross
@ 2018-03-08 14:33       ` Jan Beulich
       [not found]       ` <5AA157E002000078001AFDA4@suse.com>
  1 sibling, 0 replies; 38+ messages in thread
From: Jan Beulich @ 2018-03-08 14:33 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Kevin Tian, wei.liu2, andrew.cooper3, Dario Faggioli,
	Jun Nakajima, xen-devel

>>> On 08.03.18 at 15:05, <jgross@suse.com> wrote:
> On 08/03/18 14:38, Jan Beulich wrote:
>>>>> On 02.03.18 at 09:14, <jgross@suse.com> wrote:
>>> Instead of flushing the TLB from global pages when switching address
>>> spaces with XPTI being active just disable global pages via %cr4
>>> completely when a domain subject to XPTI is active. This avoids the
>>> need for extra TLB flushes as loading %cr3 will remove all TLB
>>> entries.
>> 
>> Hmm, it's far from obvious that this is an improvement overall.
>> Besides Xen's global pages, we also prevent guest user pages to
>> be evicted from the TLB across user <-> kernel mode changes.
>> And the effects of this are likely quite work load dependent.
> 
> With XPTI active we flush the TLB, including all global entries, when
> switching between page tables when returning to the guest. So there are
> no entries which could survive.

Well, right, but that's the "light" in "XPTI light". From a functionality
POV we don't need the guest user mappings to be flushed. Did you
happen to look at what the effect would be of running Xen with
PTE.G uniformly clear (and with the CR4 writes dropped from the
exit paths)?

>>> @@ -412,18 +414,22 @@ static void prepare_set(void)
>>>  	write_cr0(read_cr0() | X86_CR0_CD);
>>>  	wbinvd();
>>>  
>>> -	/*  TLB flushing here relies on Xen always using CR4.PGE. */
>>> -	BUILD_BUG_ON(!(XEN_MINIMAL_CR4 & X86_CR4_PGE));
>>> -	write_cr4(read_cr4() & ~X86_CR4_PGE);
>>> +	cr4 = read_cr4();
>>> +	if (cr4 & X86_CR4_PGE)
>>> +		write_cr4(cr4 & ~X86_CR4_PGE);
>>> +	else
>>> +		asm volatile( "mov %0, %%cr3" : : "r" (read_cr3()) : "memory" );
>>>  
>>>  	/*  Save MTRR state */
>>>  	rdmsrl(MSR_MTRRdefType, deftype);
>>>  
>>>  	/*  Disable MTRRs, and set the default type to uncached  */
>>>  	mtrr_wrmsr(MSR_MTRRdefType, deftype & ~0xcff);
>>> +
>>> +	return !!(cr4 & X86_CR4_PGE);
>> 
>> Unnecessary !!.
> 
> Return type is bool. Isn't !! better in this case?

That was strictly a requirement when we were still using bool_t.
But with bool the compiler is required to do the conversion even
without the !! (and you'll observe that step by step we're getting
rid of those !! now).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 4/6] xen/x86: disable global pages for domains with XPTI active
       [not found]       ` <5AA157E002000078001AFDA4@suse.com>
@ 2018-03-08 14:39         ` Juergen Gross
  0 siblings, 0 replies; 38+ messages in thread
From: Juergen Gross @ 2018-03-08 14:39 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, wei.liu2, andrew.cooper3, Dario Faggioli,
	Jun Nakajima, xen-devel

On 08/03/18 15:33, Jan Beulich wrote:
>>>> On 08.03.18 at 15:05, <jgross@suse.com> wrote:
>> On 08/03/18 14:38, Jan Beulich wrote:
>>>>>> On 02.03.18 at 09:14, <jgross@suse.com> wrote:
>>>> Instead of flushing the TLB from global pages when switching address
>>>> spaces with XPTI being active just disable global pages via %cr4
>>>> completely when a domain subject to XPTI is active. This avoids the
>>>> need for extra TLB flushes as loading %cr3 will remove all TLB
>>>> entries.
>>>
>>> Hmm, it's far from obvious that this is an improvement overall.
>>> Besides Xen's global pages, we also prevent guest user pages to
>>> be evicted from the TLB across user <-> kernel mode changes.
>>> And the effects of this are likely quite work load dependent.
>>
>> With XPTI active we flush the TLB, including all global entries, when
>> switching between page tables when returning to the guest. So there are
>> no entries which could survive.
> 
> Well, right, but that's the "light" in "XPTI light". From a functionality
> POV we don't need the guest user mappings to be flushed. Did you
> happen to look at what the effect would be of running Xen with
> PTE.G uniformly clear (and with the CR4 writes dropped from the
> exit paths)?

Not yet. I wanted to do that when adding PCID support for XPTI=false to
have an idea whether it makes sense to keep the global pages in case a
processor doesn't support PCID or if there are cases where global pages
are better than using PCID.

> 
>>>> @@ -412,18 +414,22 @@ static void prepare_set(void)
>>>>  	write_cr0(read_cr0() | X86_CR0_CD);
>>>>  	wbinvd();
>>>>  
>>>> -	/*  TLB flushing here relies on Xen always using CR4.PGE. */
>>>> -	BUILD_BUG_ON(!(XEN_MINIMAL_CR4 & X86_CR4_PGE));
>>>> -	write_cr4(read_cr4() & ~X86_CR4_PGE);
>>>> +	cr4 = read_cr4();
>>>> +	if (cr4 & X86_CR4_PGE)
>>>> +		write_cr4(cr4 & ~X86_CR4_PGE);
>>>> +	else
>>>> +		asm volatile( "mov %0, %%cr3" : : "r" (read_cr3()) : "memory" );
>>>>  
>>>>  	/*  Save MTRR state */
>>>>  	rdmsrl(MSR_MTRRdefType, deftype);
>>>>  
>>>>  	/*  Disable MTRRs, and set the default type to uncached  */
>>>>  	mtrr_wrmsr(MSR_MTRRdefType, deftype & ~0xcff);
>>>> +
>>>> +	return !!(cr4 & X86_CR4_PGE);
>>>
>>> Unnecessary !!.
>>
>> Return type is bool. Isn't !! better in this case?
> 
> That was strictly a requirement when we were still using bool_t.
> But with bool the compiler is required to do the conversion even
> without the !! (and you'll observe that step by step we're getting
> rid of those !! now).

Okay, I'll drop the !!.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 4/6] xen/x86: disable global pages for domains with XPTI active
  2018-03-02  8:14 ` [PATCH v2 4/6] xen/x86: disable global pages for domains with XPTI active Juergen Gross
                     ` (2 preceding siblings ...)
       [not found]   ` <5AA14AF302000078001AFD30@suse.com>
@ 2018-03-08 15:06   ` Jan Beulich
  2018-03-09 14:40     ` Juergen Gross
  3 siblings, 1 reply; 38+ messages in thread
From: Jan Beulich @ 2018-03-08 15:06 UTC (permalink / raw)
  To: Juergen Gross; +Cc: andrew.cooper3, Dario Faggioli, wei.liu2, xen-devel

>>> On 02.03.18 at 09:14, <jgross@suse.com> wrote:
> @@ -123,22 +142,14 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
>              u32 t = pre_flush();
>  
>              if ( !cpu_has_invpcid )
> -            {
> -                unsigned long cr4 = read_cr4();
> -
> -                write_cr4(cr4 & ~X86_CR4_PGE);
> -                barrier();
> -                write_cr4(cr4);
> -            }
> +                do_flush_tlb(0);
>              else
> -            {
>                  /*
>                   * Using invpcid to flush all mappings works
>                   * regardless of whether PCID is enabled or not.
>                   * It is faster than read-modify-write CR4.
>                   */
>                  invpcid_flush_all();
> -            }

Btw, this is correct for FLUSH_TLB_GLOBAL, but goes too far for
FLUSH_TLB.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 6/6] xen/x86: use PCID feature for XPTI
  2018-03-02  8:14 ` [PATCH v2 6/6] xen/x86: use PCID feature for XPTI Juergen Gross
@ 2018-03-08 15:27   ` Jan Beulich
  0 siblings, 0 replies; 38+ messages in thread
From: Jan Beulich @ 2018-03-08 15:27 UTC (permalink / raw)
  To: Juergen Gross; +Cc: andrew.cooper3, Dario Faggioli, wei.liu2, xen-devel

>>> On 02.03.18 at 09:14, <jgross@suse.com> wrote:
> Avoid flushing the complete TLB when switching %cr3 for mitigation of
> Meltdown by using the PCID feature if available.
> 
> We are using 4 PCID values for a 64 bit pv domain subject to XPTI:
> 
> - hypervisor active and guest in kernel mode
> - guest active and in kernel mode
> - hypervisor active and guest in user mode
> - guest active and in user mode
> 
> The 2 hypervisor cases could possibly be merged, but for security
> reasons this is left for another patch.

I don't understand this part - one less PCID can't mean the code is
going to become more complicated, can it? And without PCID there's
no extra boundary there anyway. Furthermore with two different
PCIDs INVLPG may not do anymore what we expect it to do. This
(for the guest) similarly would affect e.g. MMUEXT_INVLPG_*.

> @@ -417,6 +418,8 @@ static bool prepare_set(void)
>  	cr4 = read_cr4();
>  	if (cr4 & X86_CR4_PGE)
>  		write_cr4(cr4 & ~X86_CR4_PGE);
> +	else if ( cpu_has_invpcid )
> +		invpcid_flush_all();
>  	else
>  		asm volatile( "mov %0, %%cr3" : : "r" (read_cr3()) : "memory" );
>  
> @@ -440,6 +443,8 @@ static void post_set(bool pge)
>  	/*  Reenable CR4.PGE (also flushes the TLB) */
>  	if (pge)
>  		write_cr4(read_cr4() | X86_CR4_PGE);
> +	else if ( cpu_has_invpcid )
> +		invpcid_flush_all();
>  	else
>  		asm volatile( "mov %0, %%cr3" : : "r" (read_cr3()) : "memory" );

I'm confused by the ordering of these - I take it you imply that with
PCID enabled we'll have CR4.PGE clear at all times (but see also
below on that aspect). But is it worth making the code above
depend on it? It would look more natural and less prone to become
a hidden trap if you checked cpu_has_invpcid first.

And what about the PCID-but-no-INVPCID case? A simple CR3
write doesn't flush "foreign" PCIDs aiui. Oh, I've just found further
down that you enable PCID only when both are available. The
description alone doesn't make that clear.

> --- a/xen/arch/x86/flushtlb.c
> +++ b/xen/arch/x86/flushtlb.c
> @@ -75,39 +75,46 @@ static void post_flush(u32 t)
>  static void do_flush_tlb(unsigned long cr3)
>  {
>      unsigned long cr4;
> +    u32 t;
> +
> +    t = pre_flush();
>  
>      cr4 = read_cr4();
> -    if ( cr4 & X86_CR4_PGE )
> +
> +    if ( cpu_has_invpcid )
>      {
> -        write_cr4(cr4 & ~X86_CR4_PGE);
>          if ( cr3 )
>              asm volatile ( "mov %0, %%cr3" : : "r" (cr3) : "memory" );
> -        else
> -            barrier();
> -        write_cr4(cr4);
> +        if ( !cr3 || (cr3 & X86_CR3_NOFLUSH) || (cr4 & X86_CR4_PGE) )
> +            invpcid_flush_all();

Andrew has already explained how this may not be correct in all cases.

As to the function's name - by the time we get here I'm no longer
sure what the best name for it would be.

> @@ -128,30 +135,33 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
>      {
>          if ( order == 0 )
>          {
> -            /*
> -             * We don't INVLPG multi-page regions because the 2M/4M/1G
> -             * region may not have been mapped with a superpage. Also there
> -             * are various errata surrounding INVLPG usage on superpages, and
> -             * a full flush is in any case not *that* expensive.
> -             */
> -            asm volatile ( "invlpg %0"
> -                           : : "m" (*(const char *)(va)) : "memory" );
> -        }
> -        else
> -        {
> -            u32 t = pre_flush();
> +            if ( read_cr3() & X86_CR3_PCIDMASK )

Wouldn't you better look at CR4.PCIDE? And then also flush PCID 0
below, just to be on the safe side?

> +            {
> +                unsigned long addr = (unsigned long)va;
>  
> -            if ( !cpu_has_invpcid )
> -                do_flush_tlb(0);
> +                /*
> +                 * Flush the addresses for all potential address spaces.
> +                 */

Single line comment.

> +                invpcid_flush_one(PCID_PV_PRIV, addr);
> +                invpcid_flush_one(PCID_PV_USER, addr);
> +                invpcid_flush_one(PCID_PV_PRIV | PCID_PV_XEN, addr);
> +                invpcid_flush_one(PCID_PV_USER | PCID_PV_XEN, addr);

Other than INVLPG this doesn't invalidate global mappings, yet I
can't identify any code to globally disable (or never enable)
CR4.PGE.

Out of curiosity, how do 4 INVPCIDs compare to a single INVLPG
performance wise?

> +            }
>              else
> +            {
>                  /*
> -                 * Using invpcid to flush all mappings works
> -                 * regardless of whether PCID is enabled or not.
> -                 * It is faster than read-modify-write CR4.
> +                 * We don't INVLPG multi-page regions because the 2M/4M/1G
> +                 * region may not have been mapped with a superpage. Also there
> +                 * are various errata surrounding INVLPG usage on superpages,
> +                 * and a full flush is in any case not *that* expensive.
>                   */
> -                invpcid_flush_all();
> -
> -            post_flush(t);
> +                asm volatile ( "invlpg %0"
> +                               : : "m" (*(const char *)(va)) : "memory" );
> +            }
> +        }
> +        else
> +        {
> +            do_flush_tlb(0);
>          }

Pointless braces.

> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -506,6 +506,8 @@ void free_shared_domheap_page(struct page_info *page)
>  void make_cr3(struct vcpu *v, mfn_t mfn)
>  {
>      v->arch.cr3 = mfn_x(mfn) << PAGE_SHIFT;
> +    if ( is_pv_vcpu(v) && v->domain->arch.pv_domain.pcid )
> +        v->arch.cr3 |= X86_CR3_NOFLUSH | get_pv_pcid(v, 1);

Is it a good idea to suppress the flush this way for every reader
of v->arch.cr3? For example, what about the use in
dbg_pv_va2mfn()? I think it should be the consumers of the field
to decide whether to OR in that flag (which isn't part of the
register value anyway).

> @@ -514,7 +516,15 @@ void write_ptbase(struct vcpu *v)
>      {
>          get_cpu_info()->root_pgt_changed = true;
>          get_cpu_info()->pv_cr3 = __pa(this_cpu(root_pgt));
> -        asm volatile ( "mov %0, %%cr3" : : "r" (v->arch.cr3) : "memory" );
> +        if ( v->domain->arch.pv_domain.pcid )
> +        {
> +            get_cpu_info()->pv_cr3 |= X86_CR3_NOFLUSH | get_pv_pcid(v, 0);
> +            write_cr3(v->arch.cr3);
> +        }
> +        else
> +        {
> +            asm volatile ( "mov %0, %%cr3" : : "r" (v->arch.cr3) : "memory" );
> +        }

Pointless braces again.

> --- a/xen/arch/x86/pv/domain.c
> +++ b/xen/arch/x86/pv/domain.c
> @@ -96,8 +96,12 @@ void xpti_domain_init(struct domain *d)
>      }
>  
>      if ( d->arch.pv_domain.xpti )
> +    {
> +        d->arch.pv_domain.pcid = cpu_has_pcid && cpu_has_invpcid;

Perhaps again better to key off of CR4.PCIDE? For example I imagine
we'll want to have a command line option to suppress the use of PCID
in case of problems.

> --- a/xen/include/asm-x86/domain.h
> +++ b/xen/include/asm-x86/domain.h
> @@ -260,8 +260,20 @@ struct pv_domain
>  
>      /* XPTI active? */
>      bool xpti;
> +
> +    /* Use PCID for the different address spaces? */
> +    bool pcid;
>  };
>  
> +/* PCID values for the address spaces: */
> +#define PCID_PV_PRIV      0x0001
> +#define PCID_PV_USER      0x0002
> +#define PCID_PV_XEN       0x0004    /* To be ORed to above values. */
> +
> +#define get_pv_pcid(v, xen)                                              \
> +    (((xen) ? PCID_PV_XEN : 0) |                                         \
> +     (((v)->arch.flags & TF_kernel_mode) ? PCID_PV_PRIV : PCID_PV_USER))

Inline function? Plus - the "xen" parameter is really of boolean kind,
which means the callers want to pass true/false instead of 1/0.

> @@ -615,18 +627,18 @@ void vcpu_show_registers(const struct vcpu *);
>  unsigned long pv_guest_cr4_fixup(const struct vcpu *, unsigned long guest_cr4);
>  
>  /* Convert between guest-visible and real CR4 values. */
> -#define pv_guest_cr4_to_real_cr4(v)                         \
> -    (((v)->arch.pv_vcpu.ctrlreg[4]                          \
> -      | (mmu_cr4_features                                   \
> -         & (X86_CR4_PGE | X86_CR4_PSE | X86_CR4_SMEP |      \
> -            X86_CR4_SMAP | X86_CR4_OSXSAVE |                \
> -            X86_CR4_FSGSBASE))                              \
> -      | ((v)->domain->arch.vtsc ? X86_CR4_TSD : 0))         \
> -     & ~(X86_CR4_DE |                                       \
> +#define pv_guest_cr4_to_real_cr4(v)                             \
> +    (((v)->arch.pv_vcpu.ctrlreg[4]                              \
> +      | (mmu_cr4_features                                       \
> +         & (X86_CR4_PGE | X86_CR4_PSE | X86_CR4_SMEP |          \
> +            X86_CR4_SMAP | X86_CR4_OSXSAVE |                    \
> +            X86_CR4_FSGSBASE | X86_CR4_PCIDE))                  \
> +      | ((v)->domain->arch.vtsc ? X86_CR4_TSD : 0))             \
> +     & ~(X86_CR4_DE |                                           \
>           ((v)->domain->arch.pv_domain.xpti ? X86_CR4_PGE : 0)))

I don't see why you need to reposition the backslashes here, without
which this would by quite a bit easier to review.

> --- a/xen/include/asm-x86/x86-defns.h
> +++ b/xen/include/asm-x86/x86-defns.h
> @@ -46,6 +46,7 @@
>   * Intel CPU flags in CR3
>   */
>  #define X86_CR3_NOFLUSH (_AC(1, ULL) << 63)
> +#define X86_CR3_PCIDMASK _AC(0x0000000000000fff, ULL) /* Mask for PCID */

I'm not intending to count whether the number of zeros here is "right";
any number of them is okay actually. Why don't you omit them?

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 4/6] xen/x86: disable global pages for domains with XPTI active
  2018-03-08 13:38   ` Jan Beulich
@ 2018-03-09  3:01     ` Tian, Kevin
  2018-03-09  5:23     ` Tian, Kevin
  1 sibling, 0 replies; 38+ messages in thread
From: Tian, Kevin @ 2018-03-09  3:01 UTC (permalink / raw)
  To: Jan Beulich, Juergen Gross
  Cc: andrew.cooper3, Dario Faggioli, wei.liu2, Nakajima, Jun, xen-devel

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Thursday, March 8, 2018 9:39 PM
> 
> > --- a/xen/include/asm-x86/domain.h
> > +++ b/xen/include/asm-x86/domain.h
> > @@ -622,7 +622,8 @@ unsigned long pv_guest_cr4_fixup(const struct
> vcpu *, unsigned long guest_cr4);
> >              X86_CR4_SMAP | X86_CR4_OSXSAVE |                \
> >              X86_CR4_FSGSBASE))                              \
> >        | ((v)->domain->arch.vtsc ? X86_CR4_TSD : 0))         \
> > -     & ~X86_CR4_DE)
> > +     & ~(X86_CR4_DE |                                       \
> > +         ((v)->domain->arch.pv_domain.xpti ? X86_CR4_PGE : 0)))
> 
> With this you manage to turn off global pages when switching to
> a PV vCPU. But I can't see how you turn global pages back on when
> switching away from it. I can see they would be turned back on e.g.
> on the first entry to a VMX guest, but how about an SVM one? And
> how about the time between switching away from the PV vCPU and
> that VM entry? Granted all flushes are global ones right now, but
> that should change with the modification here: If you look back at
> 4.2 code, you'll see that FLUSH_TLB was handled differently in that
> case, retaining Xen's global mappings. Any flush IPI not requesting
> global pages to be flushed could then leave intact Xen's own TLB
> entries, which takes as a prereq that CR4.PGE gets turned back on
> earlier.
> 
> And one more change would belong into this patch, I think: In patch
> 2 you change write_ptbase(). The bare CR3 write there would
> become eligible to tick the TLB flush clock with what you do here.
> 
> Talking of VMX: I wonder whether it wouldn't be better (perhaps both
> cheaper and long term possibly more correct) if vmx_ctxt_switch_to()
> didn't write CR4, but instead sync-ed HOST_CR4 to what read_cr4()
> returns. Jun, Kevin, do you have any thoughts on this? For the patch
> here, this would get the behavior on par with SVM, as described above.
> 

I didn't see a reason why we want HOST_CR4 different from 
latest cr4 value, so yes, I think it's a reasonable cleanup.

p.s. current vmx logic looks problematic. It actually does 
reverse sync from HOST_CR4 to cr4 before vmentry which 
doesn't make any sense.

Thanks
Kevin 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 4/6] xen/x86: disable global pages for domains with XPTI active
  2018-03-08 13:38   ` Jan Beulich
  2018-03-09  3:01     ` Tian, Kevin
@ 2018-03-09  5:23     ` Tian, Kevin
  2018-03-09  8:34       ` Jan Beulich
       [not found]       ` <5AA2551002000078001B0116@suse.com>
  1 sibling, 2 replies; 38+ messages in thread
From: Tian, Kevin @ 2018-03-09  5:23 UTC (permalink / raw)
  To: Jan Beulich, Juergen Gross
  Cc: andrew.cooper3, Dario Faggioli, wei.liu2, Nakajima, Jun, xen-devel

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Thursday, March 8, 2018 9:39 PM
> 
> > --- a/xen/include/asm-x86/domain.h
> > +++ b/xen/include/asm-x86/domain.h
> > @@ -622,7 +622,8 @@ unsigned long pv_guest_cr4_fixup(const struct
> vcpu *, unsigned long guest_cr4);
> >              X86_CR4_SMAP | X86_CR4_OSXSAVE |                \
> >              X86_CR4_FSGSBASE))                              \
> >        | ((v)->domain->arch.vtsc ? X86_CR4_TSD : 0))         \
> > -     & ~X86_CR4_DE)
> > +     & ~(X86_CR4_DE |                                       \
> > +         ((v)->domain->arch.pv_domain.xpti ? X86_CR4_PGE : 0)))
> 
> With this you manage to turn off global pages when switching to
> a PV vCPU. But I can't see how you turn global pages back on when
> switching away from it. I can see they would be turned back on e.g.
> on the first entry to a VMX guest, but how about an SVM one? And
> how about the time between switching away from the PV vCPU and
> that VM entry? Granted all flushes are global ones right now, but
> that should change with the modification here: If you look back at
> 4.2 code, you'll see that FLUSH_TLB was handled differently in that
> case, retaining Xen's global mappings. Any flush IPI not requesting
> global pages to be flushed could then leave intact Xen's own TLB
> entries, which takes as a prereq that CR4.PGE gets turned back on
> earlier.
> 

btw does PGE really matter regarding to entry to HVM guest? Xen's 
mappings are either all flushed (vpid disabled) or all sustained
(vpid enabled) at VM entries, regardless of global setting. then if 
PGE is anyway turned off for PV and doesn't matter for HVM is it 
still useful to keep it turned on between switches?
 
Thanks
Kevin

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 4/6] xen/x86: disable global pages for domains with XPTI active
  2018-03-09  5:23     ` Tian, Kevin
@ 2018-03-09  8:34       ` Jan Beulich
       [not found]       ` <5AA2551002000078001B0116@suse.com>
  1 sibling, 0 replies; 38+ messages in thread
From: Jan Beulich @ 2018-03-09  8:34 UTC (permalink / raw)
  To: Kevin Tian
  Cc: Juergen Gross, wei.liu2, andrew.cooper3, Dario Faggioli,
	Jun Nakajima, xen-devel

>>> On 09.03.18 at 06:23, <kevin.tian@intel.com> wrote:
>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Thursday, March 8, 2018 9:39 PM
>> 
>> > --- a/xen/include/asm-x86/domain.h
>> > +++ b/xen/include/asm-x86/domain.h
>> > @@ -622,7 +622,8 @@ unsigned long pv_guest_cr4_fixup(const struct
>> vcpu *, unsigned long guest_cr4);
>> >              X86_CR4_SMAP | X86_CR4_OSXSAVE |                \
>> >              X86_CR4_FSGSBASE))                              \
>> >        | ((v)->domain->arch.vtsc ? X86_CR4_TSD : 0))         \
>> > -     & ~X86_CR4_DE)
>> > +     & ~(X86_CR4_DE |                                       \
>> > +         ((v)->domain->arch.pv_domain.xpti ? X86_CR4_PGE : 0)))
>> 
>> With this you manage to turn off global pages when switching to
>> a PV vCPU. But I can't see how you turn global pages back on when
>> switching away from it. I can see they would be turned back on e.g.
>> on the first entry to a VMX guest, but how about an SVM one? And
>> how about the time between switching away from the PV vCPU and
>> that VM entry? Granted all flushes are global ones right now, but
>> that should change with the modification here: If you look back at
>> 4.2 code, you'll see that FLUSH_TLB was handled differently in that
>> case, retaining Xen's global mappings. Any flush IPI not requesting
>> global pages to be flushed could then leave intact Xen's own TLB
>> entries, which takes as a prereq that CR4.PGE gets turned back on
>> earlier.
> 
> btw does PGE really matter regarding to entry to HVM guest? Xen's 
> mappings are either all flushed (vpid disabled) or all sustained
> (vpid enabled) at VM entries, regardless of global setting. then if 
> PGE is anyway turned off for PV and doesn't matter for HVM is it 
> still useful to keep it turned on between switches?

Well, yes indeed, but that's only half of where we may want to
retain them (if possible). The other half is the time between
context switching in a HVM vCPU and the subsequent VM entry (or
to be precise the point in time when interrupts get turned off
before that VM entry, as that's where flush IPIs can still arrive).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 4/6] xen/x86: disable global pages for domains with XPTI active
       [not found]       ` <5AA2551002000078001B0116@suse.com>
@ 2018-03-09  8:42         ` Juergen Gross
  0 siblings, 0 replies; 38+ messages in thread
From: Juergen Gross @ 2018-03-09  8:42 UTC (permalink / raw)
  To: Jan Beulich, Kevin Tian
  Cc: andrew.cooper3, wei.liu2, xen-devel, Jun Nakajima, Dario Faggioli

On 09/03/18 09:34, Jan Beulich wrote:
>>>> On 09.03.18 at 06:23, <kevin.tian@intel.com> wrote:
>>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>>> Sent: Thursday, March 8, 2018 9:39 PM
>>>
>>>> --- a/xen/include/asm-x86/domain.h
>>>> +++ b/xen/include/asm-x86/domain.h
>>>> @@ -622,7 +622,8 @@ unsigned long pv_guest_cr4_fixup(const struct
>>> vcpu *, unsigned long guest_cr4);
>>>>              X86_CR4_SMAP | X86_CR4_OSXSAVE |                \
>>>>              X86_CR4_FSGSBASE))                              \
>>>>        | ((v)->domain->arch.vtsc ? X86_CR4_TSD : 0))         \
>>>> -     & ~X86_CR4_DE)
>>>> +     & ~(X86_CR4_DE |                                       \
>>>> +         ((v)->domain->arch.pv_domain.xpti ? X86_CR4_PGE : 0)))
>>>
>>> With this you manage to turn off global pages when switching to
>>> a PV vCPU. But I can't see how you turn global pages back on when
>>> switching away from it. I can see they would be turned back on e.g.
>>> on the first entry to a VMX guest, but how about an SVM one? And
>>> how about the time between switching away from the PV vCPU and
>>> that VM entry? Granted all flushes are global ones right now, but
>>> that should change with the modification here: If you look back at
>>> 4.2 code, you'll see that FLUSH_TLB was handled differently in that
>>> case, retaining Xen's global mappings. Any flush IPI not requesting
>>> global pages to be flushed could then leave intact Xen's own TLB
>>> entries, which takes as a prereq that CR4.PGE gets turned back on
>>> earlier.
>>
>> btw does PGE really matter regarding to entry to HVM guest? Xen's 
>> mappings are either all flushed (vpid disabled) or all sustained
>> (vpid enabled) at VM entries, regardless of global setting. then if 
>> PGE is anyway turned off for PV and doesn't matter for HVM is it 
>> still useful to keep it turned on between switches?
> 
> Well, yes indeed, but that's only half of where we may want to
> retain them (if possible). The other half is the time between
> context switching in a HVM vCPU and the subsequent VM entry (or
> to be precise the point in time when interrupts get turned off
> before that VM entry, as that's where flush IPIs can still arrive).

I'd like to move loading cr4 from *_ctxt_switch_to() to write_ptbase()
in order to minimize TLB flushes. This will make it much easier to use
the optimal flushing strategy as all intermediate states are handled in
one place.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 4/6] xen/x86: disable global pages for domains with XPTI active
  2018-03-08 15:06   ` Jan Beulich
@ 2018-03-09 14:40     ` Juergen Gross
  2018-03-09 15:30       ` Jan Beulich
  0 siblings, 1 reply; 38+ messages in thread
From: Juergen Gross @ 2018-03-09 14:40 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, wei.liu2, xen-devel, Dario Faggioli

On 08/03/18 16:06, Jan Beulich wrote:
>>>> On 02.03.18 at 09:14, <jgross@suse.com> wrote:
>> @@ -123,22 +142,14 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
>>              u32 t = pre_flush();
>>  
>>              if ( !cpu_has_invpcid )
>> -            {
>> -                unsigned long cr4 = read_cr4();
>> -
>> -                write_cr4(cr4 & ~X86_CR4_PGE);
>> -                barrier();
>> -                write_cr4(cr4);
>> -            }
>> +                do_flush_tlb(0);
>>              else
>> -            {
>>                  /*
>>                   * Using invpcid to flush all mappings works
>>                   * regardless of whether PCID is enabled or not.
>>                   * It is faster than read-modify-write CR4.
>>                   */
>>                  invpcid_flush_all();
>> -            }
> 
> Btw, this is correct for FLUSH_TLB_GLOBAL, but goes too far for
> FLUSH_TLB.

You are aware that my patches didn't change anything in this regard?


Juergen


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 4/6] xen/x86: disable global pages for domains with XPTI active
  2018-03-09 14:40     ` Juergen Gross
@ 2018-03-09 15:30       ` Jan Beulich
  0 siblings, 0 replies; 38+ messages in thread
From: Jan Beulich @ 2018-03-09 15:30 UTC (permalink / raw)
  To: Juergen Gross; +Cc: andrew.cooper3, Dario Faggioli, wei.liu2, xen-devel

>>> On 09.03.18 at 15:40, <jgross@suse.com> wrote:
> On 08/03/18 16:06, Jan Beulich wrote:
>>>>> On 02.03.18 at 09:14, <jgross@suse.com> wrote:
>>> @@ -123,22 +142,14 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
>>>              u32 t = pre_flush();
>>>  
>>>              if ( !cpu_has_invpcid )
>>> -            {
>>> -                unsigned long cr4 = read_cr4();
>>> -
>>> -                write_cr4(cr4 & ~X86_CR4_PGE);
>>> -                barrier();
>>> -                write_cr4(cr4);
>>> -            }
>>> +                do_flush_tlb(0);
>>>              else
>>> -            {
>>>                  /*
>>>                   * Using invpcid to flush all mappings works
>>>                   * regardless of whether PCID is enabled or not.
>>>                   * It is faster than read-modify-write CR4.
>>>                   */
>>>                  invpcid_flush_all();
>>> -            }
>> 
>> Btw, this is correct for FLUSH_TLB_GLOBAL, but goes too far for
>> FLUSH_TLB.
> 
> You are aware that my patches didn't change anything in this regard?

Oh, indeed - I've just responded to Wei's patch.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2018-03-09 15:30 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-02  8:13 [PATCH v2 0/6] xen/x86: various XPTI speedups Juergen Gross
2018-03-02  8:13 ` [PATCH v2 1/6] x86/xpti: avoid copying L4 page table contents when possible Juergen Gross
2018-03-05 16:43   ` Jan Beulich
2018-03-08 11:59     ` Juergen Gross
2018-03-08 12:47       ` Jan Beulich
     [not found]       ` <5AA13EEA02000078001AFCAF@suse.com>
2018-03-08 13:03         ` Juergen Gross
     [not found]   ` <5A9D81DC02000078001AEB68@suse.com>
2018-03-06  7:01     ` Juergen Gross
2018-03-06  7:58       ` Jan Beulich
     [not found]       ` <5A9E583002000078001AED3A@suse.com>
2018-03-06  8:06         ` Juergen Gross
2018-03-06  8:17           ` Jan Beulich
2018-03-02  8:13 ` [PATCH v2 2/6] x86/xpti: don't flush TLB twice when switching to 64-bit pv context Juergen Gross
2018-03-05 16:49   ` Jan Beulich
     [not found]   ` <5A9D831F02000078001AEB7E@suse.com>
2018-03-06  7:02     ` Juergen Gross
2018-03-02  8:14 ` [PATCH v2 3/6] xen/x86: support per-domain flag for xpti Juergen Gross
2018-03-08 10:17   ` Jan Beulich
     [not found]   ` <5AA11BDE02000078001AFB92@suse.com>
2018-03-08 11:30     ` Juergen Gross
2018-03-08 12:49       ` Jan Beulich
     [not found]       ` <5AA13F7D02000078001AFCB3@suse.com>
2018-03-08 13:13         ` Juergen Gross
2018-03-02  8:14 ` [PATCH v2 4/6] xen/x86: disable global pages for domains with XPTI active Juergen Gross
2018-03-02 11:03   ` Wei Liu
2018-03-02 11:30     ` Juergen Gross
2018-03-08 13:38   ` Jan Beulich
2018-03-09  3:01     ` Tian, Kevin
2018-03-09  5:23     ` Tian, Kevin
2018-03-09  8:34       ` Jan Beulich
     [not found]       ` <5AA2551002000078001B0116@suse.com>
2018-03-09  8:42         ` Juergen Gross
     [not found]   ` <5AA14AF302000078001AFD30@suse.com>
2018-03-08 14:05     ` Juergen Gross
2018-03-08 14:33       ` Jan Beulich
     [not found]       ` <5AA157E002000078001AFDA4@suse.com>
2018-03-08 14:39         ` Juergen Gross
2018-03-08 15:06   ` Jan Beulich
2018-03-09 14:40     ` Juergen Gross
2018-03-09 15:30       ` Jan Beulich
2018-03-02  8:14 ` [PATCH v2 5/6] xen/x86: use flag byte for decision whether xen_cr3 is valid Juergen Gross
2018-03-08 14:24   ` Jan Beulich
     [not found]   ` <5AA155BE02000078001AFD89@suse.com>
2018-03-08 14:28     ` Juergen Gross
2018-03-02  8:14 ` [PATCH v2 6/6] xen/x86: use PCID feature for XPTI Juergen Gross
2018-03-08 15:27   ` Jan Beulich
2018-03-05 16:20 ` [PATCH v2 0/6] xen/x86: various XPTI speedups Dario Faggioli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.