All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/7] xen/x86: various XPTI speedups
@ 2018-03-27  9:06 Juergen Gross
  2018-03-27  9:06 ` [PATCH v4 1/7] x86/xpti: avoid copying L4 page table contents when possible Juergen Gross
                   ` (6 more replies)
  0 siblings, 7 replies; 19+ messages in thread
From: Juergen Gross @ 2018-03-27  9:06 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, andrew.cooper3, jbeulich

This patch series aims at reducing the overhead of the XPTI Meltdown
mitigation. It is based on Jan's XPTI speedup series.

Patch 1 had been posted before, the main changes in this patch are due
to addressing Jan's comments on my first version. The main objective of
that patch is to avoid copying the L4 page table each time the guest is
being activated, as often the contents didn't change while the
hypervisor was active.

Patch 2 tries to minimize flushing the TLB: there is no need to flush
it in write_ptbase() and when activating the guest.

Patch 3 sets the stage for being able to activate XPTI per domain. As a
first step it is now possible to switch XPTI off for dom0 via the xpti
boot parameter.

Patch 4 adds support for using the INVPCID instruction for flushing
the TLB.

Patch 5 reduces the costs of TLB flushes even further: as we don't make
any use of global TLB entries with XPTI being active we can avoid
removing all global TLB entries on TLB flushes by simply deactivating
the global pages in CR4.

Patch 6 was originally only meant to prepare using PCIDs in patch 6.
For that purpose it was necessary to allow CR3 values with bit 63 set
in order to avoid flushing TLB entries when writing CR3. This requires
a modification of Jan's rather clever state machine with positive and
negative CR3 values for the hypervisor by using a dedicated flag byte
instead. It turned out this modification saved one branch on interrupt
entry speeding up the handling by a few percent.

Patch 7 is the main performance contributor: by making use of the PCID
feature (if available) TLB entries can survive CR3 switches. The TLB
needs to be flushed on context switches only and not when switching
between guest and hypervisor or guest kernel and user mode.

On my machine (Intel i7-4600M) using the PCID feature in the non-XPTI
case showed a slightly worse performance than using global pages
instead (using PCID and global pages is a bad idea as invalidating
global pages in this case would need a complete TLB flush). For this
reason I've decided to use PCID for XPTI only as the default. That
can easily be changed by using the command line parameter "pcid=all".

The complete series has been verified to still mitigate against
Meltdown attacks. A simple performance test (make -j 4 in the Xen
hypervisor directory) showed significant improvements compared to the
state without this series (so with Jan's series applied),
the percentage after the numbers is always related to XPTI off:

       XPTI off     Jan, XPTI on        +this series, XPTI on
real   1m21.169s    1m52.149s (+38%)    1m25.692s (+6%)
user   2m47.652s    2m50.054s (+1%)     2m46.428s (-1%)
sys    1m11.949s    2m21.767s (+97%)    1m23.053s (+15%)


Juergen Gross (7):
  x86/xpti: avoid copying L4 page table contents when possible
  x86/xpti: don't flush TLB twice when switching to 64-bit pv context
  xen/x86: support per-domain flag for xpti
  xen/x86: use invpcid for flushing the TLB
  xen/x86: disable global pages for domains with XPTI active
  xen/x86: use flag byte for decision whether xen_cr3 is valid
  xen/x86: use PCID feature

 docs/misc/xen-command-line.markdown |  32 +++++++-
 xen/arch/x86/cpu/mtrr/generic.c     |  37 ++++++---
 xen/arch/x86/debug.c                |   2 +-
 xen/arch/x86/domain.c               |   6 +-
 xen/arch/x86/domain_page.c          |   2 +-
 xen/arch/x86/domctl.c               |   8 ++
 xen/arch/x86/flushtlb.c             |  91 ++++++++++++++++-----
 xen/arch/x86/mm.c                   |  84 ++++++++++++++++----
 xen/arch/x86/pv/dom0_build.c        |   4 +
 xen/arch/x86/pv/domain.c            | 154 +++++++++++++++++++++++++++++++++++-
 xen/arch/x86/setup.c                |  27 +++----
 xen/arch/x86/smp.c                  |   2 +-
 xen/arch/x86/smpboot.c              |   6 +-
 xen/arch/x86/x86_64/asm-offsets.c   |   2 +
 xen/arch/x86/x86_64/compat/entry.S  |   5 +-
 xen/arch/x86/x86_64/entry.S         |  87 +++++++++-----------
 xen/common/efi/runtime.c            |   4 +-
 xen/include/asm-x86/current.h       |  23 ++++--
 xen/include/asm-x86/domain.h        |  17 ++--
 xen/include/asm-x86/flushtlb.h      |   4 +-
 xen/include/asm-x86/processor.h     |   3 +
 xen/include/asm-x86/pv/domain.h     |  24 ++++++
 xen/include/asm-x86/x86-defns.h     |   4 +-
 23 files changed, 483 insertions(+), 145 deletions(-)

-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v4 1/7] x86/xpti: avoid copying L4 page table contents when possible
  2018-03-27  9:06 [PATCH v4 0/7] xen/x86: various XPTI speedups Juergen Gross
@ 2018-03-27  9:06 ` Juergen Gross
  2018-03-28 13:45   ` Jan Beulich
  2018-03-27  9:06 ` [PATCH v4 2/7] x86/xpti: don't flush TLB twice when switching to 64-bit pv context Juergen Gross
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Juergen Gross @ 2018-03-27  9:06 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, andrew.cooper3, jbeulich

For mitigation of Meltdown the current L4 page table is copied to the
cpu local root page table each time a 64 bit pv guest is entered.

Copying can be avoided in cases where the guest L4 page table hasn't
been modified while running the hypervisor, e.g. when handling
interrupts or any hypercall not modifying the L4 page table or %cr3.

So add a per-cpu flag whether the copying should be performed and set
that flag only when loading a new %cr3 or modifying the L4 page table.
This includes synchronization of the cpu local root page table with
other cpus, so add a special synchronization flag for that case.

A simple performance check (compiling the hypervisor via "make -j 4")
in dom0 with 4 vcpus shows a significant improvement:

- real time drops from 112 seconds to 103 seconds
- system time drops from 142 seconds to 131 seconds

Signed-off-by: Juergen Gross <jgross@suse.com>
---
V4:
- move setting of root_pgt_changed flag in flush_area_local() out of
  irq disabled section (Jan Beulich)
- move setting of root_pgt_changed in make_cr3() to _toggle_guest_pt()
  (Jan Beulich)
- remove most conditionals in write_ptbase() (Jan Beulich)
- don't set root_pgt_changed in do_mmu_update() for modification of
  the user page table (Jan Beulich)

V3:
- set flag locally only if affected L4 is active (Jan Beulich)
- add setting flag to flush_area_mask() (Jan Beulich)
- set flag in make_cr3() only if called for current active vcpu

To be applied on top of Jan's "Meltdown band-aid overhead reduction"
series
---
 xen/arch/x86/flushtlb.c           |  3 +++
 xen/arch/x86/mm.c                 | 38 +++++++++++++++++++++++++-------------
 xen/arch/x86/pv/domain.c          |  1 +
 xen/arch/x86/smp.c                |  2 +-
 xen/arch/x86/x86_64/asm-offsets.c |  1 +
 xen/arch/x86/x86_64/entry.S       |  8 ++++++--
 xen/include/asm-x86/current.h     |  8 ++++++++
 xen/include/asm-x86/flushtlb.h    |  2 ++
 8 files changed, 47 insertions(+), 16 deletions(-)

diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
index 8a7a76b8ff..38cedf3b22 100644
--- a/xen/arch/x86/flushtlb.c
+++ b/xen/arch/x86/flushtlb.c
@@ -160,5 +160,8 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
 
     local_irq_restore(irqfl);
 
+    if ( flags & FLUSH_ROOT_PGTBL )
+        get_cpu_info()->root_pgt_changed = true;
+
     return flags;
 }
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index f0195561c2..b1d3f0eda8 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -503,6 +503,8 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
 
 void write_ptbase(struct vcpu *v)
 {
+    if ( this_cpu(root_pgt) )
+        get_cpu_info()->root_pgt_changed = true;
     write_cr3(v->arch.cr3);
 }
 
@@ -3698,18 +3700,28 @@ long do_mmu_update(
                         break;
                     rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
-                    /*
-                     * No need to sync if all uses of the page can be accounted
-                     * to the page lock we hold, its pinned status, and uses on
-                     * this (v)CPU.
-                     */
-                    if ( !rc && !cpu_has_no_xpti &&
-                         ((page->u.inuse.type_info & PGT_count_mask) >
-                          (1 + !!(page->u.inuse.type_info & PGT_pinned) +
-                           (pagetable_get_pfn(curr->arch.guest_table) == mfn) +
-                           (pagetable_get_pfn(curr->arch.guest_table_user) ==
-                            mfn))) )
-                        sync_guest = true;
+                    if ( !rc && !cpu_has_no_xpti )
+                    {
+                        bool local_in_use = false;
+
+                        if ( pagetable_get_pfn(curr->arch.guest_table) == mfn )
+                        {
+                            local_in_use = true;
+                            get_cpu_info()->root_pgt_changed = true;
+                        }
+
+                        /*
+                         * No need to sync if all uses of the page can be
+                         * accounted to the page lock we hold, its pinned
+                         * status, and uses on this (v)CPU.
+                         */
+                        if ( (page->u.inuse.type_info & PGT_count_mask) >
+                             (1 + !!(page->u.inuse.type_info & PGT_pinned) +
+                              (pagetable_get_pfn(curr->arch.guest_table_user) ==
+                               mfn) +
+                              local_in_use) )
+                            sync_guest = true;
+                    }
                     break;
 
                 case PGT_writable_page:
@@ -3824,7 +3836,7 @@ long do_mmu_update(
 
         cpumask_andnot(mask, pt_owner->dirty_cpumask, cpumask_of(cpu));
         if ( !cpumask_empty(mask) )
-            flush_mask(mask, FLUSH_TLB_GLOBAL);
+            flush_mask(mask, FLUSH_TLB_GLOBAL | FLUSH_ROOT_PGTBL);
     }
 
     perfc_add(num_page_updates, i);
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 2524326b74..fd529332a3 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -225,6 +225,7 @@ static void _toggle_guest_pt(struct vcpu *v, bool force_cr3)
 
     v->arch.flags ^= TF_kernel_mode;
     update_cr3(v);
+    get_cpu_info()->root_pgt_changed = true;
 
     /*
      * There's no need to load CR3 here when it is going to be loaded on the
diff --git a/xen/arch/x86/smp.c b/xen/arch/x86/smp.c
index 033dd05958..63e819ca38 100644
--- a/xen/arch/x86/smp.c
+++ b/xen/arch/x86/smp.c
@@ -208,7 +208,7 @@ void invalidate_interrupt(struct cpu_user_regs *regs)
     ack_APIC_irq();
     perfc_incr(ipis);
     if ( (flags & FLUSH_VCPU_STATE) && __sync_local_execstate() )
-        flags &= ~(FLUSH_TLB | FLUSH_TLB_GLOBAL);
+        flags &= ~(FLUSH_TLB | FLUSH_TLB_GLOBAL | FLUSH_ROOT_PGTBL);
     if ( flags & ~(FLUSH_VCPU_STATE | FLUSH_ORDER_MASK) )
         flush_area_local(flush_va, flags);
     cpumask_clear_cpu(smp_processor_id(), &flush_cpumask);
diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
index a2fea94f4c..9e2aefb00f 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -143,6 +143,7 @@ void __dummy__(void)
     OFFSET(CPUINFO_shadow_spec_ctrl, struct cpu_info, shadow_spec_ctrl);
     OFFSET(CPUINFO_use_shadow_spec_ctrl, struct cpu_info, use_shadow_spec_ctrl);
     OFFSET(CPUINFO_bti_ist_info, struct cpu_info, bti_ist_info);
+    OFFSET(CPUINFO_root_pgt_changed, struct cpu_info, root_pgt_changed);
     DEFINE(CPUINFO_sizeof, sizeof(struct cpu_info));
     BLANK();
 
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index e58ca90c18..18b79be539 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -129,10 +129,13 @@ restore_all_guest:
 .Lrag_cr3_start:
         mov   VCPU_cr3(%rbx), %r9
         GET_STACK_END(dx)
-        mov   STACK_CPUINFO_FIELD(pv_cr3)(%rdx), %rdi
+        mov   STACK_CPUINFO_FIELD(pv_cr3)(%rdx), %rax
+        cmpb  $0, STACK_CPUINFO_FIELD(root_pgt_changed)(%rdx)
+        je    .Lrag_copy_done
+        movb  $0, STACK_CPUINFO_FIELD(root_pgt_changed)(%rdx)
         movabs $PADDR_MASK & PAGE_MASK, %rsi
         movabs $DIRECTMAP_VIRT_START, %rcx
-        mov   %rdi, %rax
+        mov   %rax, %rdi
         and   %rsi, %rdi
         and   %r9, %rsi
         add   %rcx, %rdi
@@ -148,6 +151,7 @@ restore_all_guest:
         sub   $(ROOT_PAGETABLE_FIRST_XEN_SLOT - \
                 ROOT_PAGETABLE_LAST_XEN_SLOT - 1) * 8, %rdi
         rep movsq
+.Lrag_copy_done:
         mov   STACK_CPUINFO_FIELD(cr4)(%rdx), %rdi
         mov   %r9, STACK_CPUINFO_FIELD(xen_cr3)(%rdx)
         mov   %rdi, %rsi
diff --git a/xen/include/asm-x86/current.h b/xen/include/asm-x86/current.h
index 3a0e1eef36..f2491b4423 100644
--- a/xen/include/asm-x86/current.h
+++ b/xen/include/asm-x86/current.h
@@ -59,6 +59,14 @@ struct cpu_info {
     bool         use_shadow_spec_ctrl;
     uint8_t      bti_ist_info;
 
+    /*
+     * The following field controls copying of the L4 page table of 64-bit
+     * PV guests to the per-cpu root page table on entering the guest context.
+     * If set the L4 page table is being copied to the root page table and
+     * the field will be reset.
+     */
+    bool         root_pgt_changed;
+
     unsigned long __pad;
     /* get_stack_bottom() must be 16-byte aligned */
 };
diff --git a/xen/include/asm-x86/flushtlb.h b/xen/include/asm-x86/flushtlb.h
index 2cade9cbfb..052f0fa403 100644
--- a/xen/include/asm-x86/flushtlb.h
+++ b/xen/include/asm-x86/flushtlb.h
@@ -103,6 +103,8 @@ void write_cr3(unsigned long cr3);
 #define FLUSH_VA_VALID   0x800
  /* Flush CPU state */
 #define FLUSH_VCPU_STATE 0x1000
+ /* Flush the per-cpu root page table */
+#define FLUSH_ROOT_PGTBL 0x2000
 
 /* Flush local TLBs/caches. */
 unsigned int flush_area_local(const void *va, unsigned int flags);
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 2/7] x86/xpti: don't flush TLB twice when switching to 64-bit pv context
  2018-03-27  9:06 [PATCH v4 0/7] xen/x86: various XPTI speedups Juergen Gross
  2018-03-27  9:06 ` [PATCH v4 1/7] x86/xpti: avoid copying L4 page table contents when possible Juergen Gross
@ 2018-03-27  9:06 ` Juergen Gross
  2018-03-28 13:48   ` Jan Beulich
  2018-03-27  9:07 ` [PATCH v4 3/7] xen/x86: support per-domain flag for xpti Juergen Gross
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Juergen Gross @ 2018-03-27  9:06 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, andrew.cooper3, jbeulich

When switching to a 64-bit pv context the TLB is flushed twice today:
the first time when switching to the new address space in
write_ptbase(), the second time when switching to guest mode in
restore_to_guest.

Avoid the first TLB flush in that case.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
V3:
- omit setting root_pgt_changed to false (Jan Beulich)
---
 xen/arch/x86/mm.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index b1d3f0eda8..f7d24a1f8b 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -123,6 +123,7 @@
 #include <asm/io_apic.h>
 #include <asm/pci.h>
 #include <asm/guest.h>
+#include <asm/processor.h>
 
 #include <asm/hvm/grant_table.h>
 #include <asm/pv/grant_table.h>
@@ -503,9 +504,13 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
 
 void write_ptbase(struct vcpu *v)
 {
-    if ( this_cpu(root_pgt) )
+    if ( this_cpu(root_pgt) && is_pv_vcpu(v) && !is_pv_32bit_vcpu(v) )
+    {
         get_cpu_info()->root_pgt_changed = true;
-    write_cr3(v->arch.cr3);
+        asm volatile ( "mov %0, %%cr3" : : "r" (v->arch.cr3) : "memory" );
+    }
+    else
+        write_cr3(v->arch.cr3);
 }
 
 /*
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 3/7] xen/x86: support per-domain flag for xpti
  2018-03-27  9:06 [PATCH v4 0/7] xen/x86: various XPTI speedups Juergen Gross
  2018-03-27  9:06 ` [PATCH v4 1/7] x86/xpti: avoid copying L4 page table contents when possible Juergen Gross
  2018-03-27  9:06 ` [PATCH v4 2/7] x86/xpti: don't flush TLB twice when switching to 64-bit pv context Juergen Gross
@ 2018-03-27  9:07 ` Juergen Gross
  2018-03-27  9:07 ` [PATCH v4 4/7] xen/x86: use invpcid for flushing the TLB Juergen Gross
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Juergen Gross @ 2018-03-27  9:07 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, andrew.cooper3, jbeulich

Instead of switching XPTI globally on or off add a per-domain flag for
that purpose. This allows to modify the xpti boot parameter to support
running dom0 without Meltdown mitigations. Using "xpti=nodom0" as boot
parameter will achieve that.

Move the xpti boot parameter handling to xen/arch/x86/pv/domain.c as
it is pv-domain specific.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
V3:
- latch get_cpu_info() return value in variable (Jan Beulich)
- call always xpti_domain_init() for pv dom0 (Jan Beulich)
- add __init annotations (Jan Beulich)
- drop per domain XPTI message (Jan Beulich)
- document xpti=default support (Jan Beulich)
- move domain xpti flag into a padding hole (Jan Beulich)
---
 docs/misc/xen-command-line.markdown | 10 ++++-
 xen/arch/x86/domctl.c               |  4 ++
 xen/arch/x86/mm.c                   | 12 +++++-
 xen/arch/x86/pv/dom0_build.c        |  3 ++
 xen/arch/x86/pv/domain.c            | 79 ++++++++++++++++++++++++++++++++++++-
 xen/arch/x86/setup.c                | 20 +---------
 xen/arch/x86/smpboot.c              |  4 +-
 xen/arch/x86/x86_64/entry.S         |  2 +
 xen/include/asm-x86/current.h       |  3 +-
 xen/include/asm-x86/domain.h        |  3 ++
 xen/include/asm-x86/pv/domain.h     |  4 ++
 11 files changed, 119 insertions(+), 25 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index b353352adf..79be9a6ba5 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -1955,7 +1955,7 @@ clustered mode.  The default, given no hint from the **FADT**, is cluster
 mode.
 
 ### xpti
-> `= <boolean>`
+> `= default | nodom0 | <boolean>`
 
 > Default: `false` on AMD hardware
 > Default: `true` everywhere else
@@ -1963,6 +1963,14 @@ mode.
 Override default selection of whether to isolate 64-bit PV guest page
 tables.
 
+`true` activates page table isolation even on AMD hardware.
+
+`false` deactivates page table isolation on all systems.
+
+`default` sets the default behaviour.
+
+`nodom0` deactivates page table isolation for dom0.
+
 ### xsave
 > `= <boolean>`
 
diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
index 8fbbf3aeb3..0704f398c7 100644
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -24,6 +24,7 @@
 #include <asm/hvm/hvm.h>
 #include <asm/hvm/support.h>
 #include <asm/processor.h>
+#include <asm/pv/domain.h>
 #include <asm/acpi.h> /* for hvm_acpi_power_button */
 #include <xen/hypercall.h> /* for arch_do_domctl */
 #include <xsm/xsm.h>
@@ -610,6 +611,9 @@ long arch_do_domctl(
             ret = switch_compat(d);
         else
             ret = -EINVAL;
+
+        if ( ret == 0 )
+            xpti_domain_init(d);
         break;
 
     case XEN_DOMCTL_get_address_size:
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index f7d24a1f8b..008dcc1749 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -504,13 +504,21 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
 
 void write_ptbase(struct vcpu *v)
 {
-    if ( this_cpu(root_pgt) && is_pv_vcpu(v) && !is_pv_32bit_vcpu(v) )
+    struct cpu_info *cpu_info = get_cpu_info();
+
+    if ( is_pv_vcpu(v) && v->domain->arch.pv_domain.xpti )
     {
-        get_cpu_info()->root_pgt_changed = true;
+        cpu_info->root_pgt_changed = true;
+        cpu_info->pv_cr3 = __pa(this_cpu(root_pgt));
         asm volatile ( "mov %0, %%cr3" : : "r" (v->arch.cr3) : "memory" );
     }
     else
+    {
+        /* Make sure to clear xen_cr3 before pv_cr3; write_cr3() serializes. */
+        cpu_info->xen_cr3 = 0;
         write_cr3(v->arch.cr3);
+        cpu_info->pv_cr3 = 0;
+    }
 }
 
 /*
diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
index 0bd2f1bf90..77186c19bd 100644
--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -19,6 +19,7 @@
 #include <asm/dom0_build.h>
 #include <asm/guest.h>
 #include <asm/page.h>
+#include <asm/pv/domain.h>
 #include <asm/pv/mm.h>
 #include <asm/setup.h>
 
@@ -707,6 +708,8 @@ int __init dom0_construct_pv(struct domain *d,
             cpu = p->processor;
     }
 
+    xpti_domain_init(d);
+
     d->arch.paging.mode = 0;
 
     /* Set up CR3 value for write_ptbase */
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index fd529332a3..e6bb2bac76 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -9,6 +9,8 @@
 #include <xen/lib.h>
 #include <xen/sched.h>
 
+#include <asm/cpufeature.h>
+#include <asm/msr-index.h>
 #include <asm/pv/domain.h>
 
 /* Override macros from asm/page.h to make them work with mfn_t */
@@ -17,6 +19,81 @@
 #undef page_to_mfn
 #define page_to_mfn(pg) _mfn(__page_to_mfn(pg))
 
+static __read_mostly enum {
+    XPTI_DEFAULT,
+    XPTI_ON,
+    XPTI_OFF,
+    XPTI_NODOM0
+} opt_xpti = XPTI_DEFAULT;
+
+static __init int parse_xpti(const char *s)
+{
+    int rc = 0;
+
+    switch ( parse_bool(s, NULL) )
+    {
+    case 0:
+        opt_xpti = XPTI_OFF;
+        break;
+    case 1:
+        opt_xpti = XPTI_ON;
+        break;
+    default:
+        if ( !strcmp(s, "default") )
+            opt_xpti = XPTI_DEFAULT;
+        else if ( !strcmp(s, "nodom0") )
+            opt_xpti = XPTI_NODOM0;
+        else
+            rc = -EINVAL;
+        break;
+    }
+
+    return rc;
+}
+custom_param("xpti", parse_xpti);
+
+void __init xpti_init(void)
+{
+    uint64_t caps = 0;
+
+    if ( boot_cpu_data.x86_vendor == X86_VENDOR_AMD )
+        caps = ARCH_CAPABILITIES_RDCL_NO;
+    else if ( boot_cpu_has(X86_FEATURE_ARCH_CAPS) )
+        rdmsrl(MSR_ARCH_CAPABILITIES, caps);
+
+    if ( opt_xpti != XPTI_ON && (caps & ARCH_CAPABILITIES_RDCL_NO) )
+        opt_xpti = XPTI_OFF;
+    else if ( opt_xpti == XPTI_DEFAULT )
+        opt_xpti = XPTI_ON;
+
+    if ( opt_xpti == XPTI_OFF )
+        setup_force_cpu_cap(X86_FEATURE_NO_XPTI);
+    else
+        setup_clear_cpu_cap(X86_FEATURE_NO_XPTI);
+}
+
+void xpti_domain_init(struct domain *d)
+{
+    if ( !is_pv_domain(d) || is_pv_32bit_domain(d) )
+        return;
+
+    switch ( opt_xpti )
+    {
+    case XPTI_OFF:
+        break;
+    case XPTI_ON:
+        d->arch.pv_domain.xpti = true;
+        break;
+    case XPTI_NODOM0:
+        d->arch.pv_domain.xpti = d->domain_id != 0 &&
+                                 d->domain_id != hardware_domid;
+        break;
+    default:
+        ASSERT_UNREACHABLE();
+        break;
+    }
+}
+
 static void noreturn continue_nonidle_domain(struct vcpu *v)
 {
     check_wakeup_from_wait();
@@ -265,7 +342,7 @@ void toggle_guest_mode(struct vcpu *v)
     }
     asm volatile ( "swapgs" );
 
-    _toggle_guest_pt(v, cpu_has_no_xpti);
+    _toggle_guest_pt(v, !v->domain->arch.pv_domain.xpti);
 }
 
 void toggle_guest_pt(struct vcpu *v)
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 9591fd987b..33bb05b5d7 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -53,6 +53,7 @@
 #include <asm/cpuid.h>
 #include <asm/spec_ctrl.h>
 #include <asm/guest.h>
+#include <asm/pv/domain.h>
 
 /* opt_nosmp: If true, secondary processors are ignored. */
 static bool __initdata opt_nosmp;
@@ -169,9 +170,6 @@ static int __init parse_smap_param(const char *s)
 }
 custom_param("smap", parse_smap_param);
 
-static int8_t __initdata opt_xpti = -1;
-boolean_param("xpti", opt_xpti);
-
 bool __read_mostly acpi_disabled;
 bool __initdata acpi_force;
 static char __initdata acpi_param[10] = "";
@@ -1546,21 +1544,7 @@ void __init noreturn __start_xen(unsigned long mbi_p)
 
     cr4_pv32_mask = mmu_cr4_features & XEN_CR4_PV32_BITS;
 
-    if ( opt_xpti < 0 )
-    {
-        uint64_t caps = 0;
-
-        if ( boot_cpu_data.x86_vendor == X86_VENDOR_AMD )
-            caps = ARCH_CAPABILITIES_RDCL_NO;
-        else if ( boot_cpu_has(X86_FEATURE_ARCH_CAPS) )
-            rdmsrl(MSR_ARCH_CAPABILITIES, caps);
-
-        opt_xpti = !(caps & ARCH_CAPABILITIES_RDCL_NO);
-    }
-    if ( opt_xpti )
-        setup_clear_cpu_cap(X86_FEATURE_NO_XPTI);
-    else
-        setup_force_cpu_cap(X86_FEATURE_NO_XPTI);
+    xpti_init();
 
     if ( cpu_has_fsgsbase )
         set_in_cr4(X86_CR4_FSGSBASE);
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index b0b72ca544..346a8e8a3f 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -331,7 +331,7 @@ void start_secondary(void *unused)
     spin_debug_disable();
 
     get_cpu_info()->xen_cr3 = 0;
-    get_cpu_info()->pv_cr3 = this_cpu(root_pgt) ? __pa(this_cpu(root_pgt)) : 0;
+    get_cpu_info()->pv_cr3 = 0;
 
     load_system_tables();
 
@@ -1050,7 +1050,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
         panic("Error %d setting up PV root page table\n", rc);
     if ( per_cpu(root_pgt, 0) )
     {
-        get_cpu_info()->pv_cr3 = __pa(per_cpu(root_pgt, 0));
+        get_cpu_info()->pv_cr3 = 0;
 
         /*
          * All entry points which may need to switch page tables have to start
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 18b79be539..2a06cd1a51 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -130,6 +130,8 @@ restore_all_guest:
         mov   VCPU_cr3(%rbx), %r9
         GET_STACK_END(dx)
         mov   STACK_CPUINFO_FIELD(pv_cr3)(%rdx), %rax
+        test  %rax, %rax
+        jz    .Lrag_cr3_end
         cmpb  $0, STACK_CPUINFO_FIELD(root_pgt_changed)(%rdx)
         je    .Lrag_copy_done
         movb  $0, STACK_CPUINFO_FIELD(root_pgt_changed)(%rdx)
diff --git a/xen/include/asm-x86/current.h b/xen/include/asm-x86/current.h
index f2491b4423..b2475783f8 100644
--- a/xen/include/asm-x86/current.h
+++ b/xen/include/asm-x86/current.h
@@ -44,7 +44,8 @@ struct cpu_info {
     /*
      * Of the two following fields the latter is being set to the CR3 value
      * to be used on the given pCPU for loading whenever 64-bit PV guest
-     * context is being entered. The value never changes once set.
+     * context is being entered. A value of zero indicates no setting of CR3
+     * is to be performed.
      * The former is the value to restore when re-entering Xen, if any. IOW
      * its value being zero means there's nothing to restore. However, its
      * value can also be negative, indicating to the exit-to-Xen code that
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index a12ae47f1b..ed4199931a 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -253,6 +253,9 @@ struct pv_domain
 
     atomic_t nr_l4_pages;
 
+    /* XPTI active? */
+    bool xpti;
+
     /* map_domain_page() mapping cache. */
     struct mapcache_domain mapcache;
 
diff --git a/xen/include/asm-x86/pv/domain.h b/xen/include/asm-x86/pv/domain.h
index 5e34176939..911e5dc07f 100644
--- a/xen/include/asm-x86/pv/domain.h
+++ b/xen/include/asm-x86/pv/domain.h
@@ -27,6 +27,8 @@ void pv_vcpu_destroy(struct vcpu *v);
 int pv_vcpu_initialise(struct vcpu *v);
 void pv_domain_destroy(struct domain *d);
 int pv_domain_initialise(struct domain *d);
+void xpti_init(void);
+void xpti_domain_init(struct domain *d);
 
 #else  /* !CONFIG_PV */
 
@@ -36,6 +38,8 @@ static inline void pv_vcpu_destroy(struct vcpu *v) {}
 static inline int pv_vcpu_initialise(struct vcpu *v) { return -EOPNOTSUPP; }
 static inline void pv_domain_destroy(struct domain *d) {}
 static inline int pv_domain_initialise(struct domain *d) { return -EOPNOTSUPP; }
+static inline void xpti_init(void) {}
+static inline void xpti_domain_init(struct domain *d) {}
 
 #endif	/* CONFIG_PV */
 
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 4/7] xen/x86: use invpcid for flushing the TLB
  2018-03-27  9:06 [PATCH v4 0/7] xen/x86: various XPTI speedups Juergen Gross
                   ` (2 preceding siblings ...)
  2018-03-27  9:07 ` [PATCH v4 3/7] xen/x86: support per-domain flag for xpti Juergen Gross
@ 2018-03-27  9:07 ` Juergen Gross
  2018-03-28 13:56   ` Jan Beulich
                     ` (2 more replies)
  2018-03-27  9:07 ` [PATCH v4 5/7] xen/x86: disable global pages for domains with XPTI active Juergen Gross
                   ` (2 subsequent siblings)
  6 siblings, 3 replies; 19+ messages in thread
From: Juergen Gross @ 2018-03-27  9:07 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, andrew.cooper3, jbeulich

If possible use the INVPCID instruction for flushing the TLB instead of
toggling cr4.pge for that purpose.

While at it remove the dependency on cr4.pge being required for mtrr
loading, as this will be required later anyway.

Add a command line option "invpcid" for controlling the use of
INVPCID (default to true).

Signed-off-by: Juergen Gross <jgross@suse.com>
---
V4:
- option "invpcid" instead of "noinvpcid" (Jan Beulich)

V3:
- new patch
---
 docs/misc/xen-command-line.markdown | 10 ++++++++++
 xen/arch/x86/cpu/mtrr/generic.c     | 37 ++++++++++++++++++++++++++-----------
 xen/arch/x86/flushtlb.c             | 31 +++++++++++++++++++++----------
 xen/arch/x86/setup.c                |  7 +++++++
 4 files changed, 64 insertions(+), 21 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index 79be9a6ba5..5f6ae654ad 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -1380,6 +1380,16 @@ Because responsibility for APIC setup is shared between Xen and the
 domain 0 kernel this option is automatically propagated to the domain
 0 command line.
 
+### invpcid (x86)
+> `= <boolean>`
+
+> Default: `true`
+
+Control using the INVPCID instruction for flushing TLB entries.
+This should only be used in case of known issues on the current platform
+with that instruction. Disabling INVPCID will normally result in a slightly
+degraded performance.
+
 ### noirqbalance
 > `= <boolean>`
 
diff --git a/xen/arch/x86/cpu/mtrr/generic.c b/xen/arch/x86/cpu/mtrr/generic.c
index e9c0e5e059..e88643f4bf 100644
--- a/xen/arch/x86/cpu/mtrr/generic.c
+++ b/xen/arch/x86/cpu/mtrr/generic.c
@@ -5,6 +5,7 @@
 #include <xen/mm.h>
 #include <xen/stdbool.h>
 #include <asm/flushtlb.h>
+#include <asm/invpcid.h>
 #include <asm/io.h>
 #include <asm/mtrr.h>
 #include <asm/msr.h>
@@ -400,8 +401,10 @@ static DEFINE_SPINLOCK(set_atomicity_lock);
  * has been called.
  */
 
-static void prepare_set(void)
+static bool prepare_set(void)
 {
+	unsigned long cr4;
+
 	/*  Note that this is not ideal, since the cache is only flushed/disabled
 	   for this CPU while the MTRRs are changed, but changing this requires
 	   more invasive changes to the way the kernel boots  */
@@ -412,18 +415,24 @@ static void prepare_set(void)
 	write_cr0(read_cr0() | X86_CR0_CD);
 	wbinvd();
 
-	/*  TLB flushing here relies on Xen always using CR4.PGE. */
-	BUILD_BUG_ON(!(XEN_MINIMAL_CR4 & X86_CR4_PGE));
-	write_cr4(read_cr4() & ~X86_CR4_PGE);
+	cr4 = read_cr4();
+	if (cr4 & X86_CR4_PGE)
+		write_cr4(cr4 & ~X86_CR4_PGE);
+	else if (cpu_has_invpcid)
+		invpcid_flush_all();
+	else
+		asm volatile( "mov %0, %%cr3" : : "r" (read_cr3()) : "memory" );
 
 	/*  Save MTRR state */
 	rdmsrl(MSR_MTRRdefType, deftype);
 
 	/*  Disable MTRRs, and set the default type to uncached  */
 	mtrr_wrmsr(MSR_MTRRdefType, deftype & ~0xcff);
+
+	return cr4 & X86_CR4_PGE;
 }
 
-static void post_set(void)
+static void post_set(bool pge)
 {
 	/* Intel (P6) standard MTRRs */
 	mtrr_wrmsr(MSR_MTRRdefType, deftype);
@@ -432,7 +441,12 @@ static void post_set(void)
 	write_cr0(read_cr0() & ~X86_CR0_CD);
 
 	/*  Reenable CR4.PGE (also flushes the TLB) */
-	write_cr4(read_cr4() | X86_CR4_PGE);
+	if (pge)
+		write_cr4(read_cr4() | X86_CR4_PGE);
+	else if (cpu_has_invpcid)
+		invpcid_flush_all();
+	else
+		asm volatile( "mov %0, %%cr3" : : "r" (read_cr3()) : "memory" );
 
 	spin_unlock(&set_atomicity_lock);
 }
@@ -441,14 +455,15 @@ static void generic_set_all(void)
 {
 	unsigned long mask, count;
 	unsigned long flags;
+	bool pge;
 
 	local_irq_save(flags);
-	prepare_set();
+	pge = prepare_set();
 
 	/* Actually set the state */
 	mask = set_mtrr_state();
 
-	post_set();
+	post_set(pge);
 	local_irq_restore(flags);
 
 	/*  Use the atomic bitops to update the global mask  */
@@ -457,7 +472,6 @@ static void generic_set_all(void)
 			set_bit(count, &smp_changes_mask);
 		mask >>= 1;
 	}
-	
 }
 
 static void generic_set_mtrr(unsigned int reg, unsigned long base,
@@ -474,11 +488,12 @@ static void generic_set_mtrr(unsigned int reg, unsigned long base,
 {
 	unsigned long flags;
 	struct mtrr_var_range *vr;
+	bool pge;
 
 	vr = &mtrr_state.var_ranges[reg];
 
 	local_irq_save(flags);
-	prepare_set();
+	pge = prepare_set();
 
 	if (size == 0) {
 		/* The invalid bit is kept in the mask, so we simply clear the
@@ -499,7 +514,7 @@ static void generic_set_mtrr(unsigned int reg, unsigned long base,
 		mtrr_wrmsr(MSR_IA32_MTRR_PHYSMASK(reg), vr->mask);
 	}
 
-	post_set();
+	post_set(pge);
 	local_irq_restore(flags);
 }
 
diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
index 38cedf3b22..e740520a8b 100644
--- a/xen/arch/x86/flushtlb.c
+++ b/xen/arch/x86/flushtlb.c
@@ -10,6 +10,7 @@
 #include <xen/sched.h>
 #include <xen/softirq.h>
 #include <asm/flushtlb.h>
+#include <asm/invpcid.h>
 #include <asm/page.h>
 
 /* Debug builds: Wrap frequently to stress-test the wrap logic. */
@@ -71,6 +72,25 @@ static void post_flush(u32 t)
     this_cpu(tlbflush_time) = t;
 }
 
+static void do_tlb_flush(void)
+{
+    u32 t;
+
+    t = pre_flush();
+
+    if ( cpu_has_invpcid )
+        invpcid_flush_all();
+    else
+    {
+        unsigned long cr4 = read_cr4();
+
+        write_cr4(cr4 ^ X86_CR4_PGE);
+        write_cr4(cr4);
+    }
+
+    post_flush(t);
+}
+
 void write_cr3(unsigned long cr3)
 {
     unsigned long flags, cr4;
@@ -118,16 +138,7 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
                            : : "m" (*(const char *)(va)) : "memory" );
         }
         else
-        {
-            u32 t = pre_flush();
-            unsigned long cr4 = read_cr4();
-
-            write_cr4(cr4 & ~X86_CR4_PGE);
-            barrier();
-            write_cr4(cr4);
-
-            post_flush(t);
-        }
+            do_tlb_flush();
     }
 
     if ( flags & FLUSH_CACHE )
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 33bb05b5d7..4eb85b2364 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -63,6 +63,10 @@ boolean_param("nosmp", opt_nosmp);
 static unsigned int __initdata max_cpus;
 integer_param("maxcpus", max_cpus);
 
+/* opt_invpcid: If false, don't use INVPCID instruction even if available. */
+static bool __initdata opt_invpcid = true;
+boolean_param("invpcid", opt_invpcid);
+
 unsigned long __read_mostly cr4_pv32_mask;
 
 /* **** Linux config option: propagated to domain0. */
@@ -1549,6 +1553,9 @@ void __init noreturn __start_xen(unsigned long mbi_p)
     if ( cpu_has_fsgsbase )
         set_in_cr4(X86_CR4_FSGSBASE);
 
+    if ( !opt_invpcid )
+        setup_clear_cpu_cap(X86_FEATURE_INVPCID);
+
     init_speculation_mitigations();
 
     init_idle_domain();
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 5/7] xen/x86: disable global pages for domains with XPTI active
  2018-03-27  9:06 [PATCH v4 0/7] xen/x86: various XPTI speedups Juergen Gross
                   ` (3 preceding siblings ...)
  2018-03-27  9:07 ` [PATCH v4 4/7] xen/x86: use invpcid for flushing the TLB Juergen Gross
@ 2018-03-27  9:07 ` Juergen Gross
  2018-03-29 13:12   ` Jan Beulich
  2018-03-27  9:07 ` [PATCH v4 6/7] xen/x86: use flag byte for decision whether xen_cr3 is valid Juergen Gross
  2018-03-27  9:07 ` [PATCH v4 7/7] xen/x86: use PCID feature Juergen Gross
  6 siblings, 1 reply; 19+ messages in thread
From: Juergen Gross @ 2018-03-27  9:07 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, andrew.cooper3, jbeulich

Instead of flushing the TLB from global pages when switching address
spaces with XPTI being active just disable global pages via %cr4
completely when a domain subject to XPTI is active. This avoids the
need for extra TLB flushes as loading %cr3 will remove all TLB
entries.

In order to avoid states with cr3/cr4 having inconsistent values
(e.g. global pages being activated while cr3 already specifies a XPTI
address space) move loading of the new cr4 value to write_ptbase()
(actually to write_cr3_cr4() called by write_ptbase()).

Signed-off-by: Juergen Gross <jgross@suse.com>
---
V4:
- don't use mmu_cr4_features for setting new cr4 value (Jan Beulich)
- use simpler scheme for setting X86_CR4_PGE in
  pv_guest_cr4_to_real_cr4() (Jan Beulich)

V3:
- move cr4 loading for all domains from *_ctxt_switch_to() to
  write_cr3_cr4() called by write_ptbase() (Jan Beulich)
- rebase
---
 xen/arch/x86/domain.c          |  5 -----
 xen/arch/x86/flushtlb.c        | 13 ++++++++-----
 xen/arch/x86/mm.c              | 14 +++++++++++---
 xen/arch/x86/x86_64/entry.S    | 10 ----------
 xen/common/efi/runtime.c       |  4 ++--
 xen/include/asm-x86/domain.h   |  3 ++-
 xen/include/asm-x86/flushtlb.h |  2 +-
 7 files changed, 24 insertions(+), 27 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index fbb320da9c..38c61dc13a 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1522,17 +1522,12 @@ void paravirt_ctxt_switch_from(struct vcpu *v)
 void paravirt_ctxt_switch_to(struct vcpu *v)
 {
     root_pgentry_t *root_pgt = this_cpu(root_pgt);
-    unsigned long cr4;
 
     if ( root_pgt )
         root_pgt[root_table_offset(PERDOMAIN_VIRT_START)] =
             l4e_from_page(v->domain->arch.perdomain_l3_pg,
                           __PAGE_HYPERVISOR_RW);
 
-    cr4 = pv_guest_cr4_to_real_cr4(v);
-    if ( unlikely(cr4 != read_cr4()) )
-        write_cr4(cr4);
-
     if ( unlikely(v->arch.debugreg[7] & DR7_ACTIVE_MASK) )
         activate_debugregs(v);
 
diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
index e740520a8b..5cb0ad97b8 100644
--- a/xen/arch/x86/flushtlb.c
+++ b/xen/arch/x86/flushtlb.c
@@ -91,20 +91,23 @@ static void do_tlb_flush(void)
     post_flush(t);
 }
 
-void write_cr3(unsigned long cr3)
+void write_cr3_cr4(unsigned long cr3, unsigned long cr4)
 {
-    unsigned long flags, cr4;
+    unsigned long flags;
     u32 t;
 
     /* This non-reentrant function is sometimes called in interrupt context. */
     local_irq_save(flags);
 
     t = pre_flush();
-    cr4 = read_cr4();
 
-    write_cr4(cr4 & ~X86_CR4_PGE);
+    if ( read_cr4() & X86_CR4_PGE )
+        write_cr4(cr4 & ~X86_CR4_PGE);
+
     asm volatile ( "mov %0, %%cr3" : : "r" (cr3) : "memory" );
-    write_cr4(cr4);
+
+    if ( read_cr4() != cr4 )
+        write_cr4(cr4);
 
     post_flush(t);
 
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 008dcc1749..4f637a3c3c 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -505,20 +505,28 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
 void write_ptbase(struct vcpu *v)
 {
     struct cpu_info *cpu_info = get_cpu_info();
+    unsigned long new_cr4;
+
+    new_cr4 = (is_pv_vcpu(v) && !is_idle_vcpu(v))
+              ? pv_guest_cr4_to_real_cr4(v)
+              : ((read_cr4() & ~X86_CR4_TSD) | X86_CR4_PGE);
 
     if ( is_pv_vcpu(v) && v->domain->arch.pv_domain.xpti )
     {
         cpu_info->root_pgt_changed = true;
         cpu_info->pv_cr3 = __pa(this_cpu(root_pgt));
-        asm volatile ( "mov %0, %%cr3" : : "r" (v->arch.cr3) : "memory" );
+        write_cr3_cr4(v->arch.cr3, new_cr4);
     }
     else
     {
-        /* Make sure to clear xen_cr3 before pv_cr3; write_cr3() serializes. */
+        /* Make sure to clear xen_cr3 before pv_cr3. */
         cpu_info->xen_cr3 = 0;
-        write_cr3(v->arch.cr3);
+        /* write_cr3_cr4() serializes. */
+        write_cr3_cr4(v->arch.cr3, new_cr4);
         cpu_info->pv_cr3 = 0;
     }
+
+    ASSERT(is_pv_vcpu(v) || read_cr4() == mmu_cr4_features);
 }
 
 /*
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 2a06cd1a51..eb33ec835a 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -154,13 +154,8 @@ restore_all_guest:
                 ROOT_PAGETABLE_LAST_XEN_SLOT - 1) * 8, %rdi
         rep movsq
 .Lrag_copy_done:
-        mov   STACK_CPUINFO_FIELD(cr4)(%rdx), %rdi
         mov   %r9, STACK_CPUINFO_FIELD(xen_cr3)(%rdx)
-        mov   %rdi, %rsi
-        and   $~X86_CR4_PGE, %rdi
-        mov   %rdi, %cr4
         mov   %rax, %cr3
-        mov   %rsi, %cr4
 .Lrag_cr3_end:
         ALTERNATIVE_NOP .Lrag_cr3_start, .Lrag_cr3_end, X86_FEATURE_NO_XPTI
 
@@ -218,12 +213,7 @@ restore_all_xen:
          * so "g" will have to do.
          */
 UNLIKELY_START(g, exit_cr3)
-        mov   %cr4, %rdi
-        mov   %rdi, %rsi
-        and   $~X86_CR4_PGE, %rdi
-        mov   %rdi, %cr4
         mov   %rax, %cr3
-        mov   %rsi, %cr4
 UNLIKELY_END(exit_cr3)
 .Lrax_cr3_end:
         ALTERNATIVE_NOP .Lrax_cr3_start, .Lrax_cr3_end, X86_FEATURE_NO_XPTI
diff --git a/xen/common/efi/runtime.c b/xen/common/efi/runtime.c
index 3dbc2e8ee5..fad8ca9e95 100644
--- a/xen/common/efi/runtime.c
+++ b/xen/common/efi/runtime.c
@@ -111,7 +111,7 @@ struct efi_rs_state efi_rs_enter(void)
         lgdt(&gdt_desc);
     }
 
-    write_cr3(virt_to_maddr(efi_l4_pgtable));
+    write_cr3_cr4(virt_to_maddr(efi_l4_pgtable), read_cr4());
 
     return state;
 }
@@ -120,7 +120,7 @@ void efi_rs_leave(struct efi_rs_state *state)
 {
     if ( !state->cr3 )
         return;
-    write_cr3(state->cr3);
+    write_cr3_cr4(state->cr3, read_cr4());
     if ( is_pv_vcpu(current) && !is_idle_vcpu(current) )
     {
         struct desc_ptr gdt_desc = {
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index ed4199931a..b7894dc8c8 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -618,9 +618,10 @@ unsigned long pv_guest_cr4_fixup(const struct vcpu *, unsigned long guest_cr4);
 #define pv_guest_cr4_to_real_cr4(v)                         \
     (((v)->arch.pv_vcpu.ctrlreg[4]                          \
       | (mmu_cr4_features                                   \
-         & (X86_CR4_PGE | X86_CR4_PSE | X86_CR4_SMEP |      \
+         & (X86_CR4_PSE | X86_CR4_SMEP |                    \
             X86_CR4_SMAP | X86_CR4_OSXSAVE |                \
             X86_CR4_FSGSBASE))                              \
+      | ((v)->domain->arch.pv_domain.xpti ? 0 : X86_CR4_PGE) \
       | ((v)->domain->arch.vtsc ? X86_CR4_TSD : 0))         \
      & ~X86_CR4_DE)
 #define real_cr4_to_pv_guest_cr4(c)                         \
diff --git a/xen/include/asm-x86/flushtlb.h b/xen/include/asm-x86/flushtlb.h
index 052f0fa403..1eb9682de4 100644
--- a/xen/include/asm-x86/flushtlb.h
+++ b/xen/include/asm-x86/flushtlb.h
@@ -84,7 +84,7 @@ static inline unsigned long read_cr3(void)
 }
 
 /* Write pagetable base and implicitly tick the tlbflush clock. */
-void write_cr3(unsigned long cr3);
+void write_cr3_cr4(unsigned long cr3, unsigned long cr4);
 
 /* flush_* flag fields: */
  /*
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 6/7] xen/x86: use flag byte for decision whether xen_cr3 is valid
  2018-03-27  9:06 [PATCH v4 0/7] xen/x86: various XPTI speedups Juergen Gross
                   ` (4 preceding siblings ...)
  2018-03-27  9:07 ` [PATCH v4 5/7] xen/x86: disable global pages for domains with XPTI active Juergen Gross
@ 2018-03-27  9:07 ` Juergen Gross
  2018-03-27  9:07 ` [PATCH v4 7/7] xen/x86: use PCID feature Juergen Gross
  6 siblings, 0 replies; 19+ messages in thread
From: Juergen Gross @ 2018-03-27  9:07 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, andrew.cooper3, jbeulich

Today cpu_info->xen_cr3 is either 0 to indicate %cr3 doesn't need to
be switched on entry to Xen, or negative for keeping the value while
indicating not to restore %cr3, or positive in case %cr3 is to be
restored.

Switch to use a flag byte instead of a negative xen_cr3 value in order
to allow %cr3 values with the high bit set in case we want to keep TLB
entries when using the PCID feature.

This reduces the number of branches in interrupt handling and results
in better performance (e.g. parallel make of the Xen hypervisor on my
system was using about 3% less system time).

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
V3:
- renamed use_xen_cr3 to better fitting use_pv_cr3
- corrected comment regarding semantics of use_pv_cr3 (Jan Beulich)
- prefer 32-bit operations over 8- or 16-bit ones (Jan Beulich)
---
 xen/arch/x86/domain.c              |  1 +
 xen/arch/x86/mm.c                  |  3 +-
 xen/arch/x86/smpboot.c             |  2 ++
 xen/arch/x86/x86_64/asm-offsets.c  |  1 +
 xen/arch/x86/x86_64/compat/entry.S |  5 ++-
 xen/arch/x86/x86_64/entry.S        | 67 +++++++++++++++++---------------------
 xen/include/asm-x86/current.h      | 12 ++++---
 7 files changed, 45 insertions(+), 46 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 38c61dc13a..57ff40bad8 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1695,6 +1695,7 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
 
     ASSERT(local_irq_is_enabled());
 
+    get_cpu_info()->use_pv_cr3 = false;
     get_cpu_info()->xen_cr3 = 0;
 
     if ( unlikely(dirty_cpu != cpu) && dirty_cpu != VCPU_CPU_CLEAN )
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 4f637a3c3c..856eb9e67f 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -519,7 +519,8 @@ void write_ptbase(struct vcpu *v)
     }
     else
     {
-        /* Make sure to clear xen_cr3 before pv_cr3. */
+        /* Make sure to clear use_pv_cr3 and xen_cr3 before pv_cr3. */
+        cpu_info->use_pv_cr3 = false;
         cpu_info->xen_cr3 = 0;
         /* write_cr3_cr4() serializes. */
         write_cr3_cr4(v->arch.cr3, new_cr4);
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 346a8e8a3f..05109a98fa 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -330,6 +330,7 @@ void start_secondary(void *unused)
      */
     spin_debug_disable();
 
+    get_cpu_info()->use_pv_cr3 = false;
     get_cpu_info()->xen_cr3 = 0;
     get_cpu_info()->pv_cr3 = 0;
 
@@ -1129,6 +1130,7 @@ void __init smp_prepare_boot_cpu(void)
     per_cpu(scratch_cpumask, cpu) = &scratch_cpu0mask;
 #endif
 
+    get_cpu_info()->use_pv_cr3 = false;
     get_cpu_info()->xen_cr3 = 0;
     get_cpu_info()->pv_cr3 = 0;
 }
diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
index 9e2aefb00f..7ad024cf37 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -144,6 +144,7 @@ void __dummy__(void)
     OFFSET(CPUINFO_use_shadow_spec_ctrl, struct cpu_info, use_shadow_spec_ctrl);
     OFFSET(CPUINFO_bti_ist_info, struct cpu_info, bti_ist_info);
     OFFSET(CPUINFO_root_pgt_changed, struct cpu_info, root_pgt_changed);
+    OFFSET(CPUINFO_use_pv_cr3, struct cpu_info, use_pv_cr3);
     DEFINE(CPUINFO_sizeof, sizeof(struct cpu_info));
     BLANK();
 
diff --git a/xen/arch/x86/x86_64/compat/entry.S b/xen/arch/x86/x86_64/compat/entry.S
index 6d2a14eacf..a18598f106 100644
--- a/xen/arch/x86/x86_64/compat/entry.S
+++ b/xen/arch/x86/x86_64/compat/entry.S
@@ -211,10 +211,9 @@ ENTRY(cstar_enter)
         GET_STACK_END(bx)
 .Lcstar_cr3_start:
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
-        neg   %rcx
+        test  %rcx, %rcx
         jz    .Lcstar_cr3_okay
-        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
-        neg   %rcx
+        movb  $0, STACK_CPUINFO_FIELD(use_pv_cr3)(%rbx)
         mov   %rcx, %cr3
         /* %r12 is still zero at this point. */
         mov   %r12, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index eb33ec835a..f51d091c23 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -155,6 +155,7 @@ restore_all_guest:
         rep movsq
 .Lrag_copy_done:
         mov   %r9, STACK_CPUINFO_FIELD(xen_cr3)(%rdx)
+        movb  $1, STACK_CPUINFO_FIELD(use_pv_cr3)(%rdx)
         mov   %rax, %cr3
 .Lrag_cr3_end:
         ALTERNATIVE_NOP .Lrag_cr3_start, .Lrag_cr3_end, X86_FEATURE_NO_XPTI
@@ -205,14 +206,9 @@ restore_all_xen:
          */
         GET_STACK_END(bx)
 .Lrax_cr3_start:
-        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rdx
+        cmpb  $0, STACK_CPUINFO_FIELD(use_pv_cr3)(%rbx)
+UNLIKELY_START(ne, exit_cr3)
         mov   STACK_CPUINFO_FIELD(pv_cr3)(%rbx), %rax
-        test  %rdx, %rdx
-        /*
-         * Ideally the condition would be "nsz", but such doesn't exist,
-         * so "g" will have to do.
-         */
-UNLIKELY_START(g, exit_cr3)
         mov   %rax, %cr3
 UNLIKELY_END(exit_cr3)
 .Lrax_cr3_end:
@@ -257,10 +253,9 @@ ENTRY(lstar_enter)
         GET_STACK_END(bx)
 .Llstar_cr3_start:
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
-        neg   %rcx
+        test  %rcx, %rcx
         jz    .Llstar_cr3_okay
-        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
-        neg   %rcx
+        movb  $0, STACK_CPUINFO_FIELD(use_pv_cr3)(%rbx)
         mov   %rcx, %cr3
         /* %r12 is still zero at this point. */
         mov   %r12, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
@@ -297,10 +292,9 @@ GLOBAL(sysenter_eflags_saved)
         /* PUSHF above has saved EFLAGS.IF clear (the caller had it set). */
         orl   $X86_EFLAGS_IF, UREGS_eflags(%rsp)
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
-        neg   %rcx
+        test  %rcx, %rcx
         jz    .Lsyse_cr3_okay
-        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
-        neg   %rcx
+        movb  $0, STACK_CPUINFO_FIELD(use_pv_cr3)(%rbx)
         mov   %rcx, %cr3
         /* %r12 is still zero at this point. */
         mov   %r12, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
@@ -350,10 +344,9 @@ ENTRY(int80_direct_trap)
         GET_STACK_END(bx)
 .Lint80_cr3_start:
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
-        neg   %rcx
+        test  %rcx, %rcx
         jz    .Lint80_cr3_okay
-        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
-        neg   %rcx
+        movb  $0, STACK_CPUINFO_FIELD(use_pv_cr3)(%rbx)
         mov   %rcx, %cr3
         /* %r12 is still zero at this point. */
         mov   %r12, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
@@ -561,25 +554,25 @@ ENTRY(common_interrupt)
 
 .Lintr_cr3_start:
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rcx
+        mov   STACK_CPUINFO_FIELD(use_pv_cr3)(%r14), %bl
         mov   %rcx, %r15
-        neg   %rcx
+        test  %rcx, %rcx
         jz    .Lintr_cr3_okay
-        jns   .Lintr_cr3_load
-        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
-        neg   %rcx
-.Lintr_cr3_load:
+        movb  $0, STACK_CPUINFO_FIELD(use_pv_cr3)(%r14)
         mov   %rcx, %cr3
         /* %r12 is still zero at this point. */
         mov   %r12, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
         testb $3, UREGS_cs(%rsp)
         cmovnz %r12, %r15
+        cmovnz %r12, %rbx
 .Lintr_cr3_okay:
         ALTERNATIVE_NOP .Lintr_cr3_start, .Lintr_cr3_okay, X86_FEATURE_NO_XPTI
 
         CR4_PV32_RESTORE
         movq %rsp,%rdi
         callq do_IRQ
-        ALTERNATIVE __stringify(mov %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14)), \
+        ALTERNATIVE __stringify(mov %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14);    \
+                                mov %bl, STACK_CPUINFO_FIELD(use_pv_cr3)(%r14)), \
                     "", X86_FEATURE_NO_XPTI
         jmp ret_from_intr
 
@@ -596,18 +589,17 @@ GLOBAL(handle_exception)
 
 .Lxcpt_cr3_start:
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rcx
+        mov   STACK_CPUINFO_FIELD(use_pv_cr3)(%r14), %r13b
         mov   %rcx, %r15
-        neg   %rcx
+        test  %rcx, %rcx
         jz    .Lxcpt_cr3_okay
-        jns   .Lxcpt_cr3_load
-        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
-        neg   %rcx
-.Lxcpt_cr3_load:
+        movb  $0, STACK_CPUINFO_FIELD(use_pv_cr3)(%r14)
         mov   %rcx, %cr3
         /* %r12 is still zero at this point. */
         mov   %r12, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
         testb $3, UREGS_cs(%rsp)
         cmovnz %r12, %r15
+        cmovnz %r12, %r13
 .Lxcpt_cr3_okay:
         ALTERNATIVE_NOP .Lxcpt_cr3_start, .Lxcpt_cr3_okay, X86_FEATURE_NO_XPTI
 
@@ -662,7 +654,8 @@ handle_exception_saved:
         PERFC_INCR(exceptions, %rax, %rbx)
         mov   (%rdx, %rax, 8), %rdx
         INDIRECT_CALL %rdx
-        ALTERNATIVE __stringify(mov %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14)), \
+        ALTERNATIVE __stringify(mov %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14);      \
+                                mov %r13b, STACK_CPUINFO_FIELD(use_pv_cr3)(%r14)), \
                     "", X86_FEATURE_NO_XPTI
         testb $3,UREGS_cs(%rsp)
         jz    restore_all_xen
@@ -696,7 +689,8 @@ exception_with_ints_disabled:
         rep;  movsq                     # make room for ec/ev
 1:      movq  UREGS_error_code(%rsp),%rax # ec/ev
         movq  %rax,UREGS_kernel_sizeof(%rsp)
-        ALTERNATIVE __stringify(mov %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14)), \
+        ALTERNATIVE __stringify(mov %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14);      \
+                                mov %r13b, STACK_CPUINFO_FIELD(use_pv_cr3)(%r14)), \
                     "", X86_FEATURE_NO_XPTI
         jmp   restore_all_xen           # return to fixup code
 
@@ -785,9 +779,6 @@ ENTRY(double_fault)
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rbx
         test  %rbx, %rbx
         jz    .Ldblf_cr3_okay
-        jns   .Ldblf_cr3_load
-        neg   %rbx
-.Ldblf_cr3_load:
         mov   %rbx, %cr3
 .Ldblf_cr3_okay:
 
@@ -817,13 +808,11 @@ handle_ist_exception:
 
 .List_cr3_start:
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rcx
+        mov   STACK_CPUINFO_FIELD(use_pv_cr3)(%r14), %bl
         mov   %rcx, %r15
-        neg   %rcx
+        test  %rcx, %rcx
         jz    .List_cr3_okay
-        jns   .List_cr3_load
-        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
-        neg   %rcx
-.List_cr3_load:
+        movb  $0, STACK_CPUINFO_FIELD(use_pv_cr3)(%r14)
         mov   %rcx, %cr3
         /* %r12 is still zero at this point. */
         mov   %r12, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
@@ -838,6 +827,7 @@ handle_ist_exception:
          * and copy the context to stack bottom.
          */
         xor   %r15, %r15
+        xor   %ebx, %ebx
         GET_CPUINFO_FIELD(guest_cpu_user_regs,di)
         movq  %rsp,%rsi
         movl  $UREGS_kernel_sizeof/8,%ecx
@@ -848,7 +838,8 @@ handle_ist_exception:
         leaq  exception_table(%rip),%rdx
         mov   (%rdx, %rax, 8), %rdx
         INDIRECT_CALL %rdx
-        ALTERNATIVE __stringify(mov %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14)), \
+        ALTERNATIVE __stringify(mov %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14);    \
+                                mov %bl, STACK_CPUINFO_FIELD(use_pv_cr3)(%r14)), \
                     "", X86_FEATURE_NO_XPTI
         cmpb  $TRAP_nmi,UREGS_entry_vector(%rsp)
         jne   ret_from_intr
diff --git a/xen/include/asm-x86/current.h b/xen/include/asm-x86/current.h
index b2475783f8..43bdec1f49 100644
--- a/xen/include/asm-x86/current.h
+++ b/xen/include/asm-x86/current.h
@@ -47,10 +47,7 @@ struct cpu_info {
      * context is being entered. A value of zero indicates no setting of CR3
      * is to be performed.
      * The former is the value to restore when re-entering Xen, if any. IOW
-     * its value being zero means there's nothing to restore. However, its
-     * value can also be negative, indicating to the exit-to-Xen code that
-     * restoring is not necessary, but allowing any nested entry code paths
-     * to still know the value to put back into CR3.
+     * its value being zero means there's nothing to restore.
      */
     unsigned long xen_cr3;
     unsigned long pv_cr3;
@@ -68,6 +65,13 @@ struct cpu_info {
      */
     bool         root_pgt_changed;
 
+    /*
+     * use_pv_cr3 is set in case the value of pv_cr3 is to be written into
+     * CR3 when returning from an interrupt. The main use is when returning
+     * from a NMI or MCE to hypervisor code where pv_cr3 was active.
+     */
+    bool         use_pv_cr3;
+
     unsigned long __pad;
     /* get_stack_bottom() must be 16-byte aligned */
 };
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 7/7] xen/x86: use PCID feature
  2018-03-27  9:06 [PATCH v4 0/7] xen/x86: various XPTI speedups Juergen Gross
                   ` (5 preceding siblings ...)
  2018-03-27  9:07 ` [PATCH v4 6/7] xen/x86: use flag byte for decision whether xen_cr3 is valid Juergen Gross
@ 2018-03-27  9:07 ` Juergen Gross
  2018-03-29 14:19   ` Jan Beulich
       [not found]   ` <5ABD122B02000078001B73A2@suse.com>
  6 siblings, 2 replies; 19+ messages in thread
From: Juergen Gross @ 2018-03-27  9:07 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, andrew.cooper3, jbeulich

Avoid flushing the complete TLB when switching %cr3 for mitigation of
Meltdown by using the PCID feature if available.

We are using 4 PCID values for a 64 bit pv domain subject to XPTI and
2 values for the non-XPTI case:

- guest active and in kernel mode
- guest active and in user mode
- hypervisor active and guest in user mode (XPTI only)
- hypervisor active and guest in kernel mode (XPTI only)

We use PCID only if PCID _and_ INVPCID are supported. With PCID in use
we disable global pages in cr4. A command line parameter controls in
which cases PCID is being used.

As the non-XPTI case has shown not to perform better with PCID at least
on some machines the default is to use PCID only for domains subject to
XPTI.

With PCID enabled we always disable global pages. This avoids having to
either flush the complete TLB or do a cycle through all PCID values
when invalidating a single global page.

pv_guest_cr4_to_real_cr4() is switched from a macro to a real function
now as it has become more complex.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
V4:
- add cr3 mask for page table address and use that in dbg_pv_va2mfn()
  (Jan Beulich)
- use invpcid_flush_all_nonglobals() instead of invpcid_flush_all()
  (Jan Beulich)
- use PCIDs 0/1 when running in Xen or without XPTI, 2/3 with XPTI in
  guest (Jan Beulich)
- ASSERT cr4.pge and cr4.pcide are never active at the same time
  (Jan Beulich)
- make pv_guest_cr4_to_real_cr4() a real function

V3:
- support PCID for non-XPTI case, too
- add command line parameter for controlling usage of PCID
- check PCID active by using cr4.pcide (Jan Beulich)
---
 docs/misc/xen-command-line.markdown | 12 ++++++
 xen/arch/x86/debug.c                |  2 +-
 xen/arch/x86/domain_page.c          |  2 +-
 xen/arch/x86/domctl.c               |  4 ++
 xen/arch/x86/flushtlb.c             | 44 +++++++++++++++++++--
 xen/arch/x86/mm.c                   | 24 +++++++++++-
 xen/arch/x86/pv/dom0_build.c        |  1 +
 xen/arch/x86/pv/domain.c            | 76 ++++++++++++++++++++++++++++++++++++-
 xen/include/asm-x86/domain.h        | 15 +++-----
 xen/include/asm-x86/processor.h     |  3 ++
 xen/include/asm-x86/pv/domain.h     | 20 ++++++++++
 xen/include/asm-x86/x86-defns.h     |  4 +-
 12 files changed, 188 insertions(+), 19 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index 5f6ae654ad..db87fd326d 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -1452,6 +1452,18 @@ All numbers specified must be hexadecimal ones.
 
 This option can be specified more than once (up to 8 times at present).
 
+### pcid (x86)
+> `= <boolean> | xpti | noxpti`
+
+> Default: `xpti`
+
+> Can be modified at runtime
+
+If available, control usage of the PCID feature of the processor for
+64-bit pv-domains. PCID can be used either for no domain at all (`false`),
+for all of them (`true`), only for those subject to XPTI (`xpti`) or for
+those not subject to XPTI (`noxpti`).
+
 ### ple\_gap
 > `= <integer>`
 
diff --git a/xen/arch/x86/debug.c b/xen/arch/x86/debug.c
index 9159f32db4..0d46f2f45a 100644
--- a/xen/arch/x86/debug.c
+++ b/xen/arch/x86/debug.c
@@ -97,7 +97,7 @@ dbg_pv_va2mfn(dbgva_t vaddr, struct domain *dp, uint64_t pgd3val)
     l3_pgentry_t l3e, *l3t;
     l2_pgentry_t l2e, *l2t;
     l1_pgentry_t l1e, *l1t;
-    unsigned long cr3 = (pgd3val ? pgd3val : dp->vcpu[0]->arch.cr3);
+    unsigned long cr3 = pgd3val ?: (dp->vcpu[0]->arch.cr3 & X86_CR3_ADDR_MASK);
     mfn_t mfn = maddr_to_mfn(cr3);
 
     DBGP2("vaddr:%lx domid:%d cr3:%lx pgd3:%lx\n", vaddr, dp->domain_id, 
diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c
index b5780f201f..bf5bf45188 100644
--- a/xen/arch/x86/domain_page.c
+++ b/xen/arch/x86/domain_page.c
@@ -51,7 +51,7 @@ static inline struct vcpu *mapcache_current_vcpu(void)
         if ( (v = idle_vcpu[smp_processor_id()]) == current )
             sync_local_execstate();
         /* We must now be running on the idle page table. */
-        ASSERT(read_cr3() == __pa(idle_pg_table));
+        ASSERT((read_cr3() & ~X86_CR3_PCID_MASK) == __pa(idle_pg_table));
     }
 
     return v;
diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
index 0704f398c7..a7c8772fa6 100644
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -613,7 +613,11 @@ long arch_do_domctl(
             ret = -EINVAL;
 
         if ( ret == 0 )
+        {
             xpti_domain_init(d);
+            pcid_domain_init(d);
+        }
+
         break;
 
     case XEN_DOMCTL_get_address_size:
diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
index 5cb0ad97b8..96f09239f0 100644
--- a/xen/arch/x86/flushtlb.c
+++ b/xen/arch/x86/flushtlb.c
@@ -12,6 +12,7 @@
 #include <asm/flushtlb.h>
 #include <asm/invpcid.h>
 #include <asm/page.h>
+#include <asm/pv/domain.h>
 
 /* Debug builds: Wrap frequently to stress-test the wrap logic. */
 #ifdef NDEBUG
@@ -102,7 +103,21 @@ void write_cr3_cr4(unsigned long cr3, unsigned long cr4)
     t = pre_flush();
 
     if ( read_cr4() & X86_CR4_PGE )
+        /*
+         * X86_CR4_PGE set means PCID being inactive.
+         * We have to purge the TLB via flipping cr4.pge.
+         */
         write_cr4(cr4 & ~X86_CR4_PGE);
+    else if ( cpu_has_invpcid )
+        /*
+         * If we are using PCID purge the TLB via INVPCID as loading cr3
+         * will affect the current PCID only.
+         * If INVPCID is not supported we don't use PCIDs so loading cr3
+         * will purge the TLB (we are in the "global pages off" branch).
+         * invpcid_flush_all_nonglobals() seems to be faster than
+         * invpcid_flush_all().
+         */
+        invpcid_flush_all_nonglobals();
 
     asm volatile ( "mov %0, %%cr3" : : "r" (cr3) : "memory" );
 
@@ -134,11 +149,32 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
             /*
              * We don't INVLPG multi-page regions because the 2M/4M/1G
              * region may not have been mapped with a superpage. Also there
-             * are various errata surrounding INVLPG usage on superpages, and
-             * a full flush is in any case not *that* expensive.
+             * are various errata surrounding INVLPG usage on superpages,
+             * and a full flush is in any case not *that* expensive.
              */
-            asm volatile ( "invlpg %0"
-                           : : "m" (*(const char *)(va)) : "memory" );
+            if ( read_cr4() & X86_CR4_PCIDE )
+            {
+                unsigned long addr = (unsigned long)va;
+
+                /*
+                 * Flush the addresses for all potential address spaces.
+                 * We can't check the current domain for being subject to
+                 * XPTI as current might be the idle vcpu while we still have
+                 * some XPTI domain TLB entries.
+                 * Using invpcid is okay here, as with PCID enabled we always
+                 * have global pages disabled.
+                 */
+                invpcid_flush_one(PCID_PV_PRIV, addr);
+                invpcid_flush_one(PCID_PV_USER, addr);
+                if ( !cpu_has_no_xpti )
+                {
+                    invpcid_flush_one(PCID_PV_PRIV | PCID_PV_XPTI, addr);
+                    invpcid_flush_one(PCID_PV_USER | PCID_PV_XPTI, addr);
+                }
+            }
+            else
+                asm volatile ( "invlpg %0"
+                               : : "m" (*(const char *)(va)) : "memory" );
         }
         else
             do_tlb_flush();
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 856eb9e67f..1566e5b3aa 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -126,6 +126,7 @@
 #include <asm/processor.h>
 
 #include <asm/hvm/grant_table.h>
+#include <asm/pv/domain.h>
 #include <asm/pv/grant_table.h>
 #include <asm/pv/mm.h>
 
@@ -499,7 +500,26 @@ void free_shared_domheap_page(struct page_info *page)
 
 void make_cr3(struct vcpu *v, mfn_t mfn)
 {
+    struct domain *d = v->domain;
+
     v->arch.cr3 = mfn_x(mfn) << PAGE_SHIFT;
+    if ( is_pv_domain(d) && d->arch.pv_domain.pcid )
+        v->arch.cr3 |= get_pcid_bits(v, false);
+}
+
+unsigned long pv_guest_cr4_to_real_cr4(struct vcpu *v)
+{
+    struct domain *d = v->domain;
+    unsigned long cr4;
+
+    cr4 = v->arch.pv_vcpu.ctrlreg[4] & ~X86_CR4_DE;
+    cr4 |= mmu_cr4_features & (X86_CR4_PSE | X86_CR4_SMEP | X86_CR4_SMAP |
+                               X86_CR4_OSXSAVE | X86_CR4_FSGSBASE);
+    cr4 |= (d->arch.pv_domain.xpti || d->arch.pv_domain.pcid) ? 0 : X86_CR4_PGE;
+    cr4 |= d->arch.pv_domain.pcid ? X86_CR4_PCIDE : 0;
+    cr4 |= d->arch.vtsc ? X86_CR4_TSD : 0;
+
+    return cr4;
 }
 
 void write_ptbase(struct vcpu *v)
@@ -509,12 +529,14 @@ void write_ptbase(struct vcpu *v)
 
     new_cr4 = (is_pv_vcpu(v) && !is_idle_vcpu(v))
               ? pv_guest_cr4_to_real_cr4(v)
-              : ((read_cr4() & ~X86_CR4_TSD) | X86_CR4_PGE);
+              : ((read_cr4() & ~(X86_CR4_PCIDE | X86_CR4_TSD)) | X86_CR4_PGE);
 
     if ( is_pv_vcpu(v) && v->domain->arch.pv_domain.xpti )
     {
         cpu_info->root_pgt_changed = true;
         cpu_info->pv_cr3 = __pa(this_cpu(root_pgt));
+        if ( new_cr4 & X86_CR4_PCIDE )
+            cpu_info->pv_cr3 |= get_pcid_bits(v, true);
         write_cr3_cr4(v->arch.cr3, new_cr4);
     }
     else
diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
index 77186c19bd..2af0094e95 100644
--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -709,6 +709,7 @@ int __init dom0_construct_pv(struct domain *d,
     }
 
     xpti_domain_init(d);
+    pcid_domain_init(d);
 
     d->arch.paging.mode = 0;
 
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index e6bb2bac76..838525f6e7 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -94,6 +94,70 @@ void xpti_domain_init(struct domain *d)
     }
 }
 
+static __read_mostly enum {
+    PCID_OFF,
+    PCID_ALL,
+    PCID_XPTI,
+    PCID_NOXPTI
+} opt_pcid = PCID_XPTI;
+
+static __init int parse_pcid(const char *s)
+{
+    int rc = 0;
+
+    switch ( parse_bool(s, NULL) )
+    {
+    case 0:
+        opt_pcid = PCID_OFF;
+        break;
+    case 1:
+        opt_pcid = PCID_ALL;
+        break;
+    default:
+        switch ( parse_boolean("xpti", s, NULL) )
+        {
+        case 0:
+            opt_pcid = PCID_NOXPTI;
+            break;
+        case 1:
+            opt_pcid = PCID_XPTI;
+            break;
+        default:
+            rc = -EINVAL;
+            break;
+        }
+        break;
+    }
+
+    return rc;
+}
+custom_runtime_param("pcid", parse_pcid);
+
+void pcid_domain_init(struct domain *d)
+{
+    if ( !is_pv_domain(d) || is_pv_32bit_domain(d) ||
+         !cpu_has_invpcid || !cpu_has_pcid )
+        return;
+
+    switch ( opt_pcid )
+    {
+    case PCID_OFF:
+        break;
+    case PCID_ALL:
+        d->arch.pv_domain.pcid = true;
+        break;
+    case PCID_XPTI:
+        d->arch.pv_domain.pcid = d->arch.pv_domain.xpti;
+        break;
+    case PCID_NOXPTI:
+        d->arch.pv_domain.pcid = !d->arch.pv_domain.xpti;
+        break;
+    default:
+        ASSERT_UNREACHABLE();
+        break;
+    }
+}
+
 static void noreturn continue_nonidle_domain(struct vcpu *v)
 {
     check_wakeup_from_wait();
@@ -298,11 +362,21 @@ int pv_domain_initialise(struct domain *d)
 
 static void _toggle_guest_pt(struct vcpu *v, bool force_cr3)
 {
+    struct domain *d = v->domain;
+
     ASSERT(!in_irq());
 
     v->arch.flags ^= TF_kernel_mode;
     update_cr3(v);
-    get_cpu_info()->root_pgt_changed = true;
+    if ( d->arch.pv_domain.xpti )
+    {
+        struct cpu_info *cpu_info = get_cpu_info();
+
+        cpu_info->root_pgt_changed = true;
+        cpu_info->pv_cr3 = __pa(this_cpu(root_pgt)) |
+                           (d->arch.pv_domain.pcid
+                            ? get_pcid_bits(v, true) : 0);
+    }
 
     /*
      * There's no need to load CR3 here when it is going to be loaded on the
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index b7894dc8c8..9a624d0e5f 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -255,6 +255,8 @@ struct pv_domain
 
     /* XPTI active? */
     bool xpti;
+    /* Use PCID feature? */
+    bool pcid;
 
     /* map_domain_page() mapping cache. */
     struct mapcache_domain mapcache;
@@ -615,19 +617,12 @@ void vcpu_show_registers(const struct vcpu *);
 unsigned long pv_guest_cr4_fixup(const struct vcpu *, unsigned long guest_cr4);
 
 /* Convert between guest-visible and real CR4 values. */
-#define pv_guest_cr4_to_real_cr4(v)                         \
-    (((v)->arch.pv_vcpu.ctrlreg[4]                          \
-      | (mmu_cr4_features                                   \
-         & (X86_CR4_PSE | X86_CR4_SMEP |                    \
-            X86_CR4_SMAP | X86_CR4_OSXSAVE |                \
-            X86_CR4_FSGSBASE))                              \
-      | ((v)->domain->arch.pv_domain.xpti ? 0 : X86_CR4_PGE) \
-      | ((v)->domain->arch.vtsc ? X86_CR4_TSD : 0))         \
-     & ~X86_CR4_DE)
+unsigned long pv_guest_cr4_to_real_cr4(struct vcpu *v);
+
 #define real_cr4_to_pv_guest_cr4(c)                         \
     ((c) & ~(X86_CR4_PGE | X86_CR4_PSE | X86_CR4_TSD |      \
              X86_CR4_OSXSAVE | X86_CR4_SMEP |               \
-             X86_CR4_FSGSBASE | X86_CR4_SMAP))
+             X86_CR4_FSGSBASE | X86_CR4_SMAP | X86_CR4_PCIDE))
 
 #define domain_max_vcpus(d) (is_hvm_domain(d) ? HVM_MAX_VCPUS : MAX_VIRT_CPUS)
 
diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
index db9988ab33..3067a8c58f 100644
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -290,6 +290,9 @@ static inline unsigned long read_cr4(void)
 
 static inline void write_cr4(unsigned long val)
 {
+    /* No global pages in case of PCIDs enabled! */
+    ASSERT(!(val & X86_CR4_PGE) || !(val & X86_CR4_PCIDE));
+
     get_cpu_info()->cr4 = val;
     asm volatile ( "mov %0,%%cr4" : : "r" (val) );
 }
diff --git a/xen/include/asm-x86/pv/domain.h b/xen/include/asm-x86/pv/domain.h
index 911e5dc07f..3c8c8f4ccc 100644
--- a/xen/include/asm-x86/pv/domain.h
+++ b/xen/include/asm-x86/pv/domain.h
@@ -21,6 +21,24 @@
 #ifndef __X86_PV_DOMAIN_H__
 #define __X86_PV_DOMAIN_H__
 
+/* PCID values for the address spaces of 64-bit pv domains: */
+#define PCID_PV_PRIV      0x0000    /* Used for other domains, too. */
+#define PCID_PV_USER      0x0001
+#define PCID_PV_XPTI      0x0002    /* To be ORed to above values. */
+
+/*
+ * Return additional PCID specific cr3 bits.
+ *
+ * Note that X86_CR3_NOFLUSH will not be readable in cr3. Anyone consuming
+ * v->arch.cr3 should mask away X86_CR3_NOFLUSH and X86_CR3_PCIDMASK in case
+ * the value is used to address the root page table.
+ */
+static inline unsigned long get_pcid_bits(const struct vcpu *v, bool is_xpti)
+{
+    return X86_CR3_NOFLUSH | (is_xpti ? PCID_PV_XPTI : 0) |
+           ((v->arch.flags & TF_kernel_mode) ? PCID_PV_PRIV : PCID_PV_USER);
+}
+
 #ifdef CONFIG_PV
 
 void pv_vcpu_destroy(struct vcpu *v);
@@ -29,6 +47,7 @@ void pv_domain_destroy(struct domain *d);
 int pv_domain_initialise(struct domain *d);
 void xpti_init(void);
 void xpti_domain_init(struct domain *d);
+void pcid_domain_init(struct domain *d);
 
 #else  /* !CONFIG_PV */
 
@@ -40,6 +59,7 @@ static inline void pv_domain_destroy(struct domain *d) {}
 static inline int pv_domain_initialise(struct domain *d) { return -EOPNOTSUPP; }
 static inline void xpti_init(void) {}
 static inline void xpti_domain_init(struct domain *d) {}
+static inline void pcid_domain_init(struct domain *d) {}
 
 #endif	/* CONFIG_PV */
 
diff --git a/xen/include/asm-x86/x86-defns.h b/xen/include/asm-x86/x86-defns.h
index ff8d66be3c..a80325f86d 100644
--- a/xen/include/asm-x86/x86-defns.h
+++ b/xen/include/asm-x86/x86-defns.h
@@ -45,7 +45,9 @@
 /*
  * Intel CPU flags in CR3
  */
-#define X86_CR3_NOFLUSH (_AC(1, ULL) << 63)
+#define X86_CR3_NOFLUSH    (_AC(1, ULL) << 63)
+#define X86_CR3_ADDR_MASK  (PAGE_MASK & ~X86_CR3_NOFLUSH)
+#define X86_CR3_PCID_MASK  _AC(0x0fff, ULL) /* Mask for PCID */
 
 /*
  * Intel CPU features in CR4
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 1/7] x86/xpti: avoid copying L4 page table contents when possible
  2018-03-27  9:06 ` [PATCH v4 1/7] x86/xpti: avoid copying L4 page table contents when possible Juergen Gross
@ 2018-03-28 13:45   ` Jan Beulich
  0 siblings, 0 replies; 19+ messages in thread
From: Jan Beulich @ 2018-03-28 13:45 UTC (permalink / raw)
  To: Juergen Gross; +Cc: andrew.cooper3, xen-devel

>>> On 27.03.18 at 11:06, <jgross@suse.com> wrote:
> For mitigation of Meltdown the current L4 page table is copied to the
> cpu local root page table each time a 64 bit pv guest is entered.
> 
> Copying can be avoided in cases where the guest L4 page table hasn't
> been modified while running the hypervisor, e.g. when handling
> interrupts or any hypercall not modifying the L4 page table or %cr3.
> 
> So add a per-cpu flag whether the copying should be performed and set
> that flag only when loading a new %cr3 or modifying the L4 page table.
> This includes synchronization of the cpu local root page table with
> other cpus, so add a special synchronization flag for that case.
> 
> A simple performance check (compiling the hypervisor via "make -j 4")
> in dom0 with 4 vcpus shows a significant improvement:
> 
> - real time drops from 112 seconds to 103 seconds
> - system time drops from 142 seconds to 131 seconds
> 
> Signed-off-by: Juergen Gross <jgross@suse.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/7] x86/xpti: don't flush TLB twice when switching to 64-bit pv context
  2018-03-27  9:06 ` [PATCH v4 2/7] x86/xpti: don't flush TLB twice when switching to 64-bit pv context Juergen Gross
@ 2018-03-28 13:48   ` Jan Beulich
  0 siblings, 0 replies; 19+ messages in thread
From: Jan Beulich @ 2018-03-28 13:48 UTC (permalink / raw)
  To: Juergen Gross; +Cc: andrew.cooper3, xen-devel

>>> On 27.03.18 at 11:06, <jgross@suse.com> wrote:
> When switching to a 64-bit pv context the TLB is flushed twice today:
> the first time when switching to the new address space in
> write_ptbase(), the second time when switching to guest mode in
> restore_to_guest.
> 
> Avoid the first TLB flush in that case.

This isn't entirely correct - what you avoid is just the flushing of
global entries. Perhaps "Limit the first flush to non-global entries"?

> Signed-off-by: Juergen Gross <jgross@suse.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 4/7] xen/x86: use invpcid for flushing the TLB
  2018-03-27  9:07 ` [PATCH v4 4/7] xen/x86: use invpcid for flushing the TLB Juergen Gross
@ 2018-03-28 13:56   ` Jan Beulich
  2018-03-29 13:44   ` Jan Beulich
       [not found]   ` <5ABD09EC02000078001B737F@suse.com>
  2 siblings, 0 replies; 19+ messages in thread
From: Jan Beulich @ 2018-03-28 13:56 UTC (permalink / raw)
  To: Juergen Gross; +Cc: andrew.cooper3, xen-devel

>>> On 27.03.18 at 11:07, <jgross@suse.com> wrote:
> If possible use the INVPCID instruction for flushing the TLB instead of
> toggling cr4.pge for that purpose.
> 
> While at it remove the dependency on cr4.pge being required for mtrr
> loading, as this will be required later anyway.
> 
> Add a command line option "invpcid" for controlling the use of
> INVPCID (default to true).
> 
> Signed-off-by: Juergen Gross <jgross@suse.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>
with one remarks (which could be addressed while committing):

> @@ -71,6 +72,25 @@ static void post_flush(u32 t)
>      this_cpu(tlbflush_time) = t;
>  }
>  
> +static void do_tlb_flush(void)
> +{
> +    u32 t;
> +
> +    t = pre_flush();

In the original code this was the initializer of the variable, and I'd
prefer if that stayed that way (unless later changes require the
split).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 5/7] xen/x86: disable global pages for domains with XPTI active
  2018-03-27  9:07 ` [PATCH v4 5/7] xen/x86: disable global pages for domains with XPTI active Juergen Gross
@ 2018-03-29 13:12   ` Jan Beulich
  0 siblings, 0 replies; 19+ messages in thread
From: Jan Beulich @ 2018-03-29 13:12 UTC (permalink / raw)
  To: Juergen Gross; +Cc: andrew.cooper3, xen-devel

>>> On 27.03.18 at 11:07, <jgross@suse.com> wrote:
> Instead of flushing the TLB from global pages when switching address
> spaces with XPTI being active just disable global pages via %cr4
> completely when a domain subject to XPTI is active. This avoids the
> need for extra TLB flushes as loading %cr3 will remove all TLB
> entries.
> 
> In order to avoid states with cr3/cr4 having inconsistent values
> (e.g. global pages being activated while cr3 already specifies a XPTI
> address space) move loading of the new cr4 value to write_ptbase()
> (actually to write_cr3_cr4() called by write_ptbase()).
> 
> Signed-off-by: Juergen Gross <jgross@suse.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 4/7] xen/x86: use invpcid for flushing the TLB
  2018-03-27  9:07 ` [PATCH v4 4/7] xen/x86: use invpcid for flushing the TLB Juergen Gross
  2018-03-28 13:56   ` Jan Beulich
@ 2018-03-29 13:44   ` Jan Beulich
       [not found]   ` <5ABD09EC02000078001B737F@suse.com>
  2 siblings, 0 replies; 19+ messages in thread
From: Jan Beulich @ 2018-03-29 13:44 UTC (permalink / raw)
  To: Juergen Gross; +Cc: andrew.cooper3, xen-devel

>>> On 27.03.18 at 11:07, <jgross@suse.com> wrote:
> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -63,6 +63,10 @@ boolean_param("nosmp", opt_nosmp);
>  static unsigned int __initdata max_cpus;
>  integer_param("maxcpus", max_cpus);
>  
> +/* opt_invpcid: If false, don't use INVPCID instruction even if available. */
> +static bool __initdata opt_invpcid = true;
> +boolean_param("invpcid", opt_invpcid);

Hmm, I'm sorry for noticing only now (while seeing the questionable
uses of cpu_has_invpcid in patch 7), but this being an init-only
variable and having ...

> @@ -1549,6 +1553,9 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>      if ( cpu_has_fsgsbase )
>          set_in_cr4(X86_CR4_FSGSBASE);
>  
> +    if ( !opt_invpcid )
> +        setup_clear_cpu_cap(X86_FEATURE_INVPCID);

... this effect has two issues: For one, in such a case this should
be a sub-option to "cpuid=". And then afaict it also disables use of
INVPCID in HVM guests. IOW I think you want to retain the option
but make the variable non-init and non-static. Obviously for early
boot use it may then no longer be possible to set it to true at build
time (you may end up needing two variables).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 7/7] xen/x86: use PCID feature
  2018-03-27  9:07 ` [PATCH v4 7/7] xen/x86: use PCID feature Juergen Gross
@ 2018-03-29 14:19   ` Jan Beulich
       [not found]   ` <5ABD122B02000078001B73A2@suse.com>
  1 sibling, 0 replies; 19+ messages in thread
From: Jan Beulich @ 2018-03-29 14:19 UTC (permalink / raw)
  To: Juergen Gross; +Cc: andrew.cooper3, xen-devel

>>> On 27.03.18 at 11:07, <jgross@suse.com> wrote:
> --- a/xen/arch/x86/domain_page.c
> +++ b/xen/arch/x86/domain_page.c
> @@ -51,7 +51,7 @@ static inline struct vcpu *mapcache_current_vcpu(void)
>          if ( (v = idle_vcpu[smp_processor_id()]) == current )
>              sync_local_execstate();
>          /* We must now be running on the idle page table. */
> -        ASSERT(read_cr3() == __pa(idle_pg_table));
> +        ASSERT((read_cr3() & ~X86_CR3_PCID_MASK) == __pa(idle_pg_table));

This would better use X86_CR3_ADDR_MASK as well.

> @@ -102,7 +103,21 @@ void write_cr3_cr4(unsigned long cr3, unsigned long cr4)
>      t = pre_flush();
>  
>      if ( read_cr4() & X86_CR4_PGE )
> +        /*
> +         * X86_CR4_PGE set means PCID being inactive.
> +         * We have to purge the TLB via flipping cr4.pge.
> +         */
>          write_cr4(cr4 & ~X86_CR4_PGE);
> +    else if ( cpu_has_invpcid )
> +        /*
> +         * If we are using PCID purge the TLB via INVPCID as loading cr3
> +         * will affect the current PCID only.

s/current/new/ ?

> +         * If INVPCID is not supported we don't use PCIDs so loading cr3
> +         * will purge the TLB (we are in the "global pages off" branch).
> +         * invpcid_flush_all_nonglobals() seems to be faster than
> +         * invpcid_flush_all().
> +         */
> +        invpcid_flush_all_nonglobals();
>  
>      asm volatile ( "mov %0, %%cr3" : : "r" (cr3) : "memory" );

What about the TLB entries that have been re-created between
the INVPCID and the write of CR3? It's not obvious to me that
leaving them around is not going to be a problem subsequently,
the more that you write cr3 frequently with the no-flush bit set.
Don't you need to do a single context invalidation for the prior
PCID (if different from the new one)?

> @@ -499,7 +500,26 @@ void free_shared_domheap_page(struct page_info *page)
>  
>  void make_cr3(struct vcpu *v, mfn_t mfn)
>  {
> +    struct domain *d = v->domain;
> +
>      v->arch.cr3 = mfn_x(mfn) << PAGE_SHIFT;
> +    if ( is_pv_domain(d) && d->arch.pv_domain.pcid )
> +        v->arch.cr3 |= get_pcid_bits(v, false);
> +}
> +
> +unsigned long pv_guest_cr4_to_real_cr4(struct vcpu *v)

const

> +{
> +    struct domain *d = v->domain;

again

> @@ -298,11 +362,21 @@ int pv_domain_initialise(struct domain *d)
>  
>  static void _toggle_guest_pt(struct vcpu *v, bool force_cr3)
>  {
> +    struct domain *d = v->domain;

and one more

> --- a/xen/include/asm-x86/x86-defns.h
> +++ b/xen/include/asm-x86/x86-defns.h
> @@ -45,7 +45,9 @@
>  /*
>   * Intel CPU flags in CR3
>   */
> -#define X86_CR3_NOFLUSH (_AC(1, ULL) << 63)
> +#define X86_CR3_NOFLUSH    (_AC(1, ULL) << 63)
> +#define X86_CR3_ADDR_MASK  (PAGE_MASK & ~X86_CR3_NOFLUSH)

This would better be PAGE_MASK & PADDR_MASK: bits 52...62
are reserved (the respective figure in chapter 2 of the SDM is not to
be trusted, the tables in the "4-level paging" section are more likely to
be correct).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 4/7] xen/x86: use invpcid for flushing the TLB
       [not found]   ` <5ABD09EC02000078001B737F@suse.com>
@ 2018-03-29 14:29     ` Juergen Gross
  0 siblings, 0 replies; 19+ messages in thread
From: Juergen Gross @ 2018-03-29 14:29 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, xen-devel

On 29/03/18 15:44, Jan Beulich wrote:
>>>> On 27.03.18 at 11:07, <jgross@suse.com> wrote:
>> --- a/xen/arch/x86/setup.c
>> +++ b/xen/arch/x86/setup.c
>> @@ -63,6 +63,10 @@ boolean_param("nosmp", opt_nosmp);
>>  static unsigned int __initdata max_cpus;
>>  integer_param("maxcpus", max_cpus);
>>  
>> +/* opt_invpcid: If false, don't use INVPCID instruction even if available. */
>> +static bool __initdata opt_invpcid = true;
>> +boolean_param("invpcid", opt_invpcid);
> 
> Hmm, I'm sorry for noticing only now (while seeing the questionable
> uses of cpu_has_invpcid in patch 7), but this being an init-only
> variable and having ...
> 
>> @@ -1549,6 +1553,9 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>>      if ( cpu_has_fsgsbase )
>>          set_in_cr4(X86_CR4_FSGSBASE);
>>  
>> +    if ( !opt_invpcid )
>> +        setup_clear_cpu_cap(X86_FEATURE_INVPCID);
> 
> ... this effect has two issues: For one, in such a case this should
> be a sub-option to "cpuid=". And then afaict it also disables use of
> INVPCID in HVM guests. IOW I think you want to retain the option
> but make the variable non-init and non-static. Obviously for early
> boot use it may then no longer be possible to set it to true at build
> time (you may end up needing two variables).

Okay. I'll change it.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 7/7] xen/x86: use PCID feature
       [not found]   ` <5ABD122B02000078001B73A2@suse.com>
@ 2018-03-29 15:15     ` Juergen Gross
  2018-03-29 15:37       ` Jan Beulich
       [not found]       ` <5ABD245702000078001B745D@suse.com>
  0 siblings, 2 replies; 19+ messages in thread
From: Juergen Gross @ 2018-03-29 15:15 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, xen-devel

On 29/03/18 16:19, Jan Beulich wrote:
>>>> On 27.03.18 at 11:07, <jgross@suse.com> wrote:
>> --- a/xen/arch/x86/domain_page.c
>> +++ b/xen/arch/x86/domain_page.c
>> @@ -51,7 +51,7 @@ static inline struct vcpu *mapcache_current_vcpu(void)
>>          if ( (v = idle_vcpu[smp_processor_id()]) == current )
>>              sync_local_execstate();
>>          /* We must now be running on the idle page table. */
>> -        ASSERT(read_cr3() == __pa(idle_pg_table));
>> +        ASSERT((read_cr3() & ~X86_CR3_PCID_MASK) == __pa(idle_pg_table));
> 
> This would better use X86_CR3_ADDR_MASK as well.

Right.

> 
>> @@ -102,7 +103,21 @@ void write_cr3_cr4(unsigned long cr3, unsigned long cr4)
>>      t = pre_flush();
>>  
>>      if ( read_cr4() & X86_CR4_PGE )
>> +        /*
>> +         * X86_CR4_PGE set means PCID being inactive.
>> +         * We have to purge the TLB via flipping cr4.pge.
>> +         */
>>          write_cr4(cr4 & ~X86_CR4_PGE);
>> +    else if ( cpu_has_invpcid )
>> +        /*
>> +         * If we are using PCID purge the TLB via INVPCID as loading cr3
>> +         * will affect the current PCID only.
> 
> s/current/new/ ?

Okay.

> 
>> +         * If INVPCID is not supported we don't use PCIDs so loading cr3
>> +         * will purge the TLB (we are in the "global pages off" branch).
>> +         * invpcid_flush_all_nonglobals() seems to be faster than
>> +         * invpcid_flush_all().
>> +         */
>> +        invpcid_flush_all_nonglobals();
>>  
>>      asm volatile ( "mov %0, %%cr3" : : "r" (cr3) : "memory" );
> 
> What about the TLB entries that have been re-created between
> the INVPCID and the write of CR3? It's not obvious to me that
> leaving them around is not going to be a problem subsequently,
> the more that you write cr3 frequently with the no-flush bit set.

The no-flush bit should not make any difference here. It controls only
flushing of TLB-entries with the new PCID.

> Don't you need to do a single context invalidation for the prior
> PCID (if different from the new one)?

Hmm, I don't think so, as the mov to %cr3 is a serializing instruction
acting as a speculation barrier. So the only TLB entries which could
survive would be the ones for the few instruction bytes after the
invpcid instruction until the end of the mov to %cr3. Those are harmless
as they are associated with the hypervisor PCID value, so there is no
risk of any data leak to a guest. Maybe a comment explaining that would
be a good idea.

> 
>> @@ -499,7 +500,26 @@ void free_shared_domheap_page(struct page_info *page)
>>  
>>  void make_cr3(struct vcpu *v, mfn_t mfn)
>>  {
>> +    struct domain *d = v->domain;
>> +
>>      v->arch.cr3 = mfn_x(mfn) << PAGE_SHIFT;
>> +    if ( is_pv_domain(d) && d->arch.pv_domain.pcid )
>> +        v->arch.cr3 |= get_pcid_bits(v, false);
>> +}
>> +
>> +unsigned long pv_guest_cr4_to_real_cr4(struct vcpu *v)
> 
> const

Okay (for the other cases, too).

> 
>> +{
>> +    struct domain *d = v->domain;
> 
> again
> 
>> @@ -298,11 +362,21 @@ int pv_domain_initialise(struct domain *d)
>>  
>>  static void _toggle_guest_pt(struct vcpu *v, bool force_cr3)
>>  {
>> +    struct domain *d = v->domain;
> 
> and one more
> 
>> --- a/xen/include/asm-x86/x86-defns.h
>> +++ b/xen/include/asm-x86/x86-defns.h
>> @@ -45,7 +45,9 @@
>>  /*
>>   * Intel CPU flags in CR3
>>   */
>> -#define X86_CR3_NOFLUSH (_AC(1, ULL) << 63)
>> +#define X86_CR3_NOFLUSH    (_AC(1, ULL) << 63)
>> +#define X86_CR3_ADDR_MASK  (PAGE_MASK & ~X86_CR3_NOFLUSH)
> 
> This would better be PAGE_MASK & PADDR_MASK: bits 52...62
> are reserved (the respective figure in chapter 2 of the SDM is not to
> be trusted, the tables in the "4-level paging" section are more likely to
> be correct).

Okay.


Juergen


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 7/7] xen/x86: use PCID feature
  2018-03-29 15:15     ` Juergen Gross
@ 2018-03-29 15:37       ` Jan Beulich
       [not found]       ` <5ABD245702000078001B745D@suse.com>
  1 sibling, 0 replies; 19+ messages in thread
From: Jan Beulich @ 2018-03-29 15:37 UTC (permalink / raw)
  To: Juergen Gross; +Cc: andrew.cooper3, xen-devel

>>> On 29.03.18 at 17:15, <jgross@suse.com> wrote:
> On 29/03/18 16:19, Jan Beulich wrote:
>>>>> On 27.03.18 at 11:07, <jgross@suse.com> wrote:
>>> @@ -102,7 +103,21 @@ void write_cr3_cr4(unsigned long cr3, unsigned long cr4)
>>>      t = pre_flush();
>>>  
>>>      if ( read_cr4() & X86_CR4_PGE )
>>> +        /*
>>> +         * X86_CR4_PGE set means PCID being inactive.
>>> +         * We have to purge the TLB via flipping cr4.pge.
>>> +         */
>>>          write_cr4(cr4 & ~X86_CR4_PGE);
>>> +    else if ( cpu_has_invpcid )
>>> +        /*
>>> +         * If we are using PCID purge the TLB via INVPCID as loading cr3
>>> +         * will affect the current PCID only.
>> 
>> s/current/new/ ?
> 
> Okay.
> 
>> 
>>> +         * If INVPCID is not supported we don't use PCIDs so loading cr3
>>> +         * will purge the TLB (we are in the "global pages off" branch).
>>> +         * invpcid_flush_all_nonglobals() seems to be faster than
>>> +         * invpcid_flush_all().
>>> +         */
>>> +        invpcid_flush_all_nonglobals();
>>>  
>>>      asm volatile ( "mov %0, %%cr3" : : "r" (cr3) : "memory" );
>> 
>> What about the TLB entries that have been re-created between
>> the INVPCID and the write of CR3? It's not obvious to me that
>> leaving them around is not going to be a problem subsequently,
>> the more that you write cr3 frequently with the no-flush bit set.
> 
> The no-flush bit should not make any difference here. It controls only
> flushing of TLB-entries with the new PCID.

Right, but in a subsequent write to CR3 you may make active again
what was the old PCID here. This is in particular so for PCID 0 (which
has dual use).

>> Don't you need to do a single context invalidation for the prior
>> PCID (if different from the new one)?
> 
> Hmm, I don't think so, as the mov to %cr3 is a serializing instruction
> acting as a speculation barrier. So the only TLB entries which could
> survive would be the ones for the few instruction bytes after the
> invpcid instruction until the end of the mov to %cr3. Those are harmless
> as they are associated with the hypervisor PCID value, so there is no
> risk of any data leak to a guest. Maybe a comment explaining that would
> be a good idea.

Well, to be honest I don't trust in things like "speculation barrier"
anymore, at least not as far as things ahead of the barrier go.
Who knows what exactly the CPU does (and hence which TLB
entries it creates) between the INVPCID and the CR3 write. I
don't think we can blindly assume only entries for Xen mappings
could be created during that window.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 7/7] xen/x86: use PCID feature
       [not found]       ` <5ABD245702000078001B745D@suse.com>
@ 2018-04-03  7:14         ` Juergen Gross
  2018-04-05  6:25           ` Juergen Gross
  0 siblings, 1 reply; 19+ messages in thread
From: Juergen Gross @ 2018-04-03  7:14 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, xen-devel

On 29/03/18 17:37, Jan Beulich wrote:
>>>> On 29.03.18 at 17:15, <jgross@suse.com> wrote:
>> On 29/03/18 16:19, Jan Beulich wrote:
>>>>>> On 27.03.18 at 11:07, <jgross@suse.com> wrote:
>>>> @@ -102,7 +103,21 @@ void write_cr3_cr4(unsigned long cr3, unsigned long cr4)
>>>>      t = pre_flush();
>>>>  
>>>>      if ( read_cr4() & X86_CR4_PGE )
>>>> +        /*
>>>> +         * X86_CR4_PGE set means PCID being inactive.
>>>> +         * We have to purge the TLB via flipping cr4.pge.
>>>> +         */
>>>>          write_cr4(cr4 & ~X86_CR4_PGE);
>>>> +    else if ( cpu_has_invpcid )
>>>> +        /*
>>>> +         * If we are using PCID purge the TLB via INVPCID as loading cr3
>>>> +         * will affect the current PCID only.
>>>
>>> s/current/new/ ?
>>
>> Okay.
>>
>>>
>>>> +         * If INVPCID is not supported we don't use PCIDs so loading cr3
>>>> +         * will purge the TLB (we are in the "global pages off" branch).
>>>> +         * invpcid_flush_all_nonglobals() seems to be faster than
>>>> +         * invpcid_flush_all().
>>>> +         */
>>>> +        invpcid_flush_all_nonglobals();
>>>>  
>>>>      asm volatile ( "mov %0, %%cr3" : : "r" (cr3) : "memory" );
>>>
>>> What about the TLB entries that have been re-created between
>>> the INVPCID and the write of CR3? It's not obvious to me that
>>> leaving them around is not going to be a problem subsequently,
>>> the more that you write cr3 frequently with the no-flush bit set.
>>
>> The no-flush bit should not make any difference here. It controls only
>> flushing of TLB-entries with the new PCID.
> 
> Right, but in a subsequent write to CR3 you may make active again
> what was the old PCID here. This is in particular so for PCID 0 (which
> has dual use).
> 
>>> Don't you need to do a single context invalidation for the prior
>>> PCID (if different from the new one)?
>>
>> Hmm, I don't think so, as the mov to %cr3 is a serializing instruction
>> acting as a speculation barrier. So the only TLB entries which could
>> survive would be the ones for the few instruction bytes after the
>> invpcid instruction until the end of the mov to %cr3. Those are harmless
>> as they are associated with the hypervisor PCID value, so there is no
>> risk of any data leak to a guest. Maybe a comment explaining that would
>> be a good idea.
> 
> Well, to be honest I don't trust in things like "speculation barrier"
> anymore, at least not as far as things ahead of the barrier go.
> Who knows what exactly the CPU does (and hence which TLB
> entries it creates) between the INVPCID and the CR3 write. I
> don't think we can blindly assume only entries for Xen mappings
> could be created during that window.

Those speculation barriers are one of the main mitigations for Spectre.
So either we don't think Spectre can be mitigated by using those or we
should trust the barriers to be effective in this case, too, IMHO.

Which documented behavior of the processor are you going to trust? I
think as long as there are no known errata in this regard we should
assume the cpu will behave as documented. And mv to %cr3 is documented
to be serializing nad serializing instruction are documented to be
speculation barriers.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 7/7] xen/x86: use PCID feature
  2018-04-03  7:14         ` Juergen Gross
@ 2018-04-05  6:25           ` Juergen Gross
  0 siblings, 0 replies; 19+ messages in thread
From: Juergen Gross @ 2018-04-05  6:25 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, xen-devel

On 03/04/18 09:14, Juergen Gross wrote:
> On 29/03/18 17:37, Jan Beulich wrote:
>>>>> On 29.03.18 at 17:15, <jgross@suse.com> wrote:
>>> On 29/03/18 16:19, Jan Beulich wrote:
>>>>>>> On 27.03.18 at 11:07, <jgross@suse.com> wrote:
>>>>> @@ -102,7 +103,21 @@ void write_cr3_cr4(unsigned long cr3, unsigned long cr4)
>>>>>      t = pre_flush();
>>>>>  
>>>>>      if ( read_cr4() & X86_CR4_PGE )
>>>>> +        /*
>>>>> +         * X86_CR4_PGE set means PCID being inactive.
>>>>> +         * We have to purge the TLB via flipping cr4.pge.
>>>>> +         */
>>>>>          write_cr4(cr4 & ~X86_CR4_PGE);
>>>>> +    else if ( cpu_has_invpcid )
>>>>> +        /*
>>>>> +         * If we are using PCID purge the TLB via INVPCID as loading cr3
>>>>> +         * will affect the current PCID only.
>>>>
>>>> s/current/new/ ?
>>>
>>> Okay.
>>>
>>>>
>>>>> +         * If INVPCID is not supported we don't use PCIDs so loading cr3
>>>>> +         * will purge the TLB (we are in the "global pages off" branch).
>>>>> +         * invpcid_flush_all_nonglobals() seems to be faster than
>>>>> +         * invpcid_flush_all().
>>>>> +         */
>>>>> +        invpcid_flush_all_nonglobals();
>>>>>  
>>>>>      asm volatile ( "mov %0, %%cr3" : : "r" (cr3) : "memory" );
>>>>
>>>> What about the TLB entries that have been re-created between
>>>> the INVPCID and the write of CR3? It's not obvious to me that
>>>> leaving them around is not going to be a problem subsequently,
>>>> the more that you write cr3 frequently with the no-flush bit set.
>>>
>>> The no-flush bit should not make any difference here. It controls only
>>> flushing of TLB-entries with the new PCID.
>>
>> Right, but in a subsequent write to CR3 you may make active again
>> what was the old PCID here. This is in particular so for PCID 0 (which
>> has dual use).
>>
>>>> Don't you need to do a single context invalidation for the prior
>>>> PCID (if different from the new one)?
>>>
>>> Hmm, I don't think so, as the mov to %cr3 is a serializing instruction
>>> acting as a speculation barrier. So the only TLB entries which could
>>> survive would be the ones for the few instruction bytes after the
>>> invpcid instruction until the end of the mov to %cr3. Those are harmless
>>> as they are associated with the hypervisor PCID value, so there is no
>>> risk of any data leak to a guest. Maybe a comment explaining that would
>>> be a good idea.
>>
>> Well, to be honest I don't trust in things like "speculation barrier"
>> anymore, at least not as far as things ahead of the barrier go.
>> Who knows what exactly the CPU does (and hence which TLB
>> entries it creates) between the INVPCID and the CR3 write. I
>> don't think we can blindly assume only entries for Xen mappings
>> could be created during that window.
> 
> Those speculation barriers are one of the main mitigations for Spectre.
> So either we don't think Spectre can be mitigated by using those or we
> should trust the barriers to be effective in this case, too, IMHO.
> 
> Which documented behavior of the processor are you going to trust? I
> think as long as there are no known errata in this regard we should
> assume the cpu will behave as documented. And mv to %cr3 is documented
> to be serializing nad serializing instruction are documented to be
> speculation barriers.

I have done my standard simple performance test with an extra
invpcid_flush_single_context() added in case the PCID changed in
write_cr3_cr4().

Performance impact seems to be neglectible. OTOH this isn't really
surprising as the only case where the additional flush would be needed
is the case of vcpu scheduling when PCIDs are different: either the
old or the new vcpu (not both) need to be in user mode of a XPTI domain.
Guest address space switches will always happen with PCID 0 (guest is in
kernel).

So I'm adding the additional flush to the patch. Better save than sorry.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2018-04-05  6:25 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-27  9:06 [PATCH v4 0/7] xen/x86: various XPTI speedups Juergen Gross
2018-03-27  9:06 ` [PATCH v4 1/7] x86/xpti: avoid copying L4 page table contents when possible Juergen Gross
2018-03-28 13:45   ` Jan Beulich
2018-03-27  9:06 ` [PATCH v4 2/7] x86/xpti: don't flush TLB twice when switching to 64-bit pv context Juergen Gross
2018-03-28 13:48   ` Jan Beulich
2018-03-27  9:07 ` [PATCH v4 3/7] xen/x86: support per-domain flag for xpti Juergen Gross
2018-03-27  9:07 ` [PATCH v4 4/7] xen/x86: use invpcid for flushing the TLB Juergen Gross
2018-03-28 13:56   ` Jan Beulich
2018-03-29 13:44   ` Jan Beulich
     [not found]   ` <5ABD09EC02000078001B737F@suse.com>
2018-03-29 14:29     ` Juergen Gross
2018-03-27  9:07 ` [PATCH v4 5/7] xen/x86: disable global pages for domains with XPTI active Juergen Gross
2018-03-29 13:12   ` Jan Beulich
2018-03-27  9:07 ` [PATCH v4 6/7] xen/x86: use flag byte for decision whether xen_cr3 is valid Juergen Gross
2018-03-27  9:07 ` [PATCH v4 7/7] xen/x86: use PCID feature Juergen Gross
2018-03-29 14:19   ` Jan Beulich
     [not found]   ` <5ABD122B02000078001B73A2@suse.com>
2018-03-29 15:15     ` Juergen Gross
2018-03-29 15:37       ` Jan Beulich
     [not found]       ` <5ABD245702000078001B745D@suse.com>
2018-04-03  7:14         ` Juergen Gross
2018-04-05  6:25           ` Juergen Gross

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.