All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/7] xen/x86: various XPTI speedups
@ 2018-04-06  7:52 Juergen Gross
  2018-04-06  7:52 ` [PATCH v5 1/7] x86/xpti: avoid copying L4 page table contents when possible Juergen Gross
                   ` (6 more replies)
  0 siblings, 7 replies; 22+ messages in thread
From: Juergen Gross @ 2018-04-06  7:52 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, andrew.cooper3, jbeulich

This patch series aims at reducing the overhead of the XPTI Meltdown
mitigation.

Patch 1 had been posted before, the main changes in this patch are due
to addressing Jan's comments on my first version. The main objective of
that patch is to avoid copying the L4 page table each time the guest is
being activated, as often the contents didn't change while the
hypervisor was active.

Patch 2 tries to minimize flushing the TLB: there is no need to flush
it in write_ptbase() and when activating the guest.

Patch 3 sets the stage for being able to activate XPTI per domain. As a
first step it is now possible to switch XPTI off for dom0 via the xpti
boot parameter.

Patch 4 adds support for using the INVPCID instruction for flushing
the TLB.

Patch 5 reduces the costs of TLB flushes even further: as we don't make
any use of global TLB entries with XPTI being active we can avoid
removing all global TLB entries on TLB flushes by simply deactivating
the global pages in CR4.

Patch 6 prepares using PCIDs in patch 6.
For that purpose it was necessary to allow CR3 values with bit 63 set
in order to avoid flushing TLB entries when writing CR3. This requires
a modification of Jan's rather clever state machine with positive and
negative CR3 values for the hypervisor by using a dedicated flag byte
instead.

Patch 7 is the main performance contributor: by making use of the PCID
feature (if available) TLB entries can survive CR3 switches. The TLB
needs to be flushed on context switches only and not when switching
between guest and hypervisor or guest kernel and user mode.

On my machine (Intel i7-4600M) using the PCID feature in the non-XPTI
case showed a slightly worse performance than using global pages
instead (using PCID and global pages is a bad idea as invalidating
global pages in this case would need a complete TLB flush). For this
reason I've decided to use PCID for XPTI only as the default. That
can easily be changed by using the command line parameter "pcid=true".

The complete series has been verified to still mitigate against
Meltdown attacks. A simple performance test (make -j 4 in the Xen
hypervisor directory) showed significant improvements compared to the
state without this series.
Numbers are seconds, stddev in braces.

xpti=false  elapsed         system         user
unpatched:  88.42 ( 2.01)   94.49 ( 1.38)  180.40 ( 1.41)
patched  :  89.45 ( 3.10)   96.47 ( 3.22)  181.34 ( 1.98)

xpti=true   elapsed         system         user
unpatched: 113.43 ( 3.68)  165.44 ( 4.41)  183.30 ( 1.72)
patched  :  92.76 ( 2.11)  103.39 ( 1.13)  184.86 ( 0.12)


Juergen Gross (7):
  x86/xpti: avoid copying L4 page table contents when possible
  x86/xpti: don't flush TLB twice when switching to 64-bit pv context
  xen/x86: support per-domain flag for xpti
  xen/x86: use invpcid for flushing the TLB
  xen/x86: disable global pages for domains with XPTI active
  xen/x86: use flag byte for decision whether xen_cr3 is valid
  xen/x86: use PCID feature

 docs/misc/xen-command-line.markdown |  32 +++++++-
 xen/arch/x86/cpu/mtrr/generic.c     |  37 ++++++---
 xen/arch/x86/debug.c                |   2 +-
 xen/arch/x86/domain.c               |   6 +-
 xen/arch/x86/domain_page.c          |   2 +-
 xen/arch/x86/domctl.c               |   8 ++
 xen/arch/x86/flushtlb.c             |  99 ++++++++++++++++++-----
 xen/arch/x86/mm.c                   |  84 ++++++++++++++++----
 xen/arch/x86/pv/dom0_build.c        |   4 +
 xen/arch/x86/pv/domain.c            | 154 ++++++++++++++++++++++++++++++++++++
 xen/arch/x86/setup.c                |  28 +++----
 xen/arch/x86/smp.c                  |   2 +-
 xen/arch/x86/smpboot.c              |   6 +-
 xen/arch/x86/x86_64/asm-offsets.c   |   2 +
 xen/arch/x86/x86_64/compat/entry.S  |   5 +-
 xen/arch/x86/x86_64/entry.S         |  79 ++++++++----------
 xen/common/efi/runtime.c            |   4 +-
 xen/include/asm-x86/current.h       |  23 ++++--
 xen/include/asm-x86/domain.h        |  17 ++--
 xen/include/asm-x86/flushtlb.h      |   4 +-
 xen/include/asm-x86/invpcid.h       |   2 +
 xen/include/asm-x86/processor.h     |   3 +
 xen/include/asm-x86/pv/domain.h     |  24 ++++++
 xen/include/asm-x86/x86-defns.h     |   4 +-
 24 files changed, 491 insertions(+), 140 deletions(-)

-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v5 1/7] x86/xpti: avoid copying L4 page table contents when possible
  2018-04-06  7:52 [PATCH v5 0/7] xen/x86: various XPTI speedups Juergen Gross
@ 2018-04-06  7:52 ` Juergen Gross
  2018-04-06 11:09   ` Andrew Cooper
  2018-04-06  7:52 ` [PATCH v5 2/7] x86/xpti: don't flush TLB twice when switching to 64-bit pv context Juergen Gross
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Juergen Gross @ 2018-04-06  7:52 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, andrew.cooper3, jbeulich

For mitigation of Meltdown the current L4 page table is copied to the
cpu local root page table each time a 64 bit pv guest is entered.

Copying can be avoided in cases where the guest L4 page table hasn't
been modified while running the hypervisor, e.g. when handling
interrupts or any hypercall not modifying the L4 page table or %cr3.

So add a per-cpu flag whether the copying should be performed and set
that flag only when loading a new %cr3 or modifying the L4 page table.
This includes synchronization of the cpu local root page table with
other cpus, so add a special synchronization flag for that case.

A simple performance check (compiling the hypervisor via "make -j 4")
in dom0 with 4 vcpus shows a significant improvement:

- real time drops from 113 seconds to 109 seconds
- system time drops from 165 seconds to 155 seconds

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
V4:
- move setting of root_pgt_changed flag in flush_area_local() out of
  irq disabled section (Jan Beulich)
- move setting of root_pgt_changed in make_cr3() to _toggle_guest_pt()
  (Jan Beulich)
- remove most conditionals in write_ptbase() (Jan Beulich)
- don't set root_pgt_changed in do_mmu_update() for modification of
  the user page table (Jan Beulich)

V3:
- set flag locally only if affected L4 is active (Jan Beulich)
- add setting flag to flush_area_mask() (Jan Beulich)
- set flag in make_cr3() only if called for current active vcpu

To be applied on top of Jan's "Meltdown band-aid overhead reduction"
series
---
 xen/arch/x86/flushtlb.c           |  3 +++
 xen/arch/x86/mm.c                 | 38 +++++++++++++++++++++++++-------------
 xen/arch/x86/pv/domain.c          |  2 ++
 xen/arch/x86/smp.c                |  2 +-
 xen/arch/x86/x86_64/asm-offsets.c |  1 +
 xen/arch/x86/x86_64/entry.S       |  8 ++++++--
 xen/include/asm-x86/current.h     |  8 ++++++++
 xen/include/asm-x86/flushtlb.h    |  2 ++
 8 files changed, 48 insertions(+), 16 deletions(-)

diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
index 8a7a76b8ff..38cedf3b22 100644
--- a/xen/arch/x86/flushtlb.c
+++ b/xen/arch/x86/flushtlb.c
@@ -160,5 +160,8 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
 
     local_irq_restore(irqfl);
 
+    if ( flags & FLUSH_ROOT_PGTBL )
+        get_cpu_info()->root_pgt_changed = true;
+
     return flags;
 }
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 6d39d2c8ab..fd89685486 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -504,6 +504,8 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
 
 void write_ptbase(struct vcpu *v)
 {
+    if ( this_cpu(root_pgt) )
+        get_cpu_info()->root_pgt_changed = true;
     write_cr3(v->arch.cr3);
 }
 
@@ -3701,18 +3703,28 @@ long do_mmu_update(
                         break;
                     rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
-                    /*
-                     * No need to sync if all uses of the page can be accounted
-                     * to the page lock we hold, its pinned status, and uses on
-                     * this (v)CPU.
-                     */
-                    if ( !rc && !cpu_has_no_xpti &&
-                         ((page->u.inuse.type_info & PGT_count_mask) >
-                          (1 + !!(page->u.inuse.type_info & PGT_pinned) +
-                           (pagetable_get_pfn(curr->arch.guest_table) == mfn) +
-                           (pagetable_get_pfn(curr->arch.guest_table_user) ==
-                            mfn))) )
-                        sync_guest = true;
+                    if ( !rc && !cpu_has_no_xpti )
+                    {
+                        bool local_in_use = false;
+
+                        if ( pagetable_get_pfn(curr->arch.guest_table) == mfn )
+                        {
+                            local_in_use = true;
+                            get_cpu_info()->root_pgt_changed = true;
+                        }
+
+                        /*
+                         * No need to sync if all uses of the page can be
+                         * accounted to the page lock we hold, its pinned
+                         * status, and uses on this (v)CPU.
+                         */
+                        if ( (page->u.inuse.type_info & PGT_count_mask) >
+                             (1 + !!(page->u.inuse.type_info & PGT_pinned) +
+                              (pagetable_get_pfn(curr->arch.guest_table_user) ==
+                               mfn) +
+                              local_in_use) )
+                            sync_guest = true;
+                    }
                     break;
 
                 case PGT_writable_page:
@@ -3827,7 +3839,7 @@ long do_mmu_update(
 
         cpumask_andnot(mask, pt_owner->dirty_cpumask, cpumask_of(cpu));
         if ( !cpumask_empty(mask) )
-            flush_mask(mask, FLUSH_TLB_GLOBAL);
+            flush_mask(mask, FLUSH_TLB_GLOBAL | FLUSH_ROOT_PGTBL);
     }
 
     perfc_add(num_page_updates, i);
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 01c62e2d45..42522a2db3 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -223,6 +223,8 @@ static void _toggle_guest_pt(struct vcpu *v)
 {
     v->arch.flags ^= TF_kernel_mode;
     update_cr3(v);
+    get_cpu_info()->root_pgt_changed = true;
+
     /* Don't flush user global mappings from the TLB. Don't tick TLB clock. */
     asm volatile ( "mov %0, %%cr3" : : "r" (v->arch.cr3) : "memory" );
 
diff --git a/xen/arch/x86/smp.c b/xen/arch/x86/smp.c
index 033dd05958..63e819ca38 100644
--- a/xen/arch/x86/smp.c
+++ b/xen/arch/x86/smp.c
@@ -208,7 +208,7 @@ void invalidate_interrupt(struct cpu_user_regs *regs)
     ack_APIC_irq();
     perfc_incr(ipis);
     if ( (flags & FLUSH_VCPU_STATE) && __sync_local_execstate() )
-        flags &= ~(FLUSH_TLB | FLUSH_TLB_GLOBAL);
+        flags &= ~(FLUSH_TLB | FLUSH_TLB_GLOBAL | FLUSH_ROOT_PGTBL);
     if ( flags & ~(FLUSH_VCPU_STATE | FLUSH_ORDER_MASK) )
         flush_area_local(flush_va, flags);
     cpumask_clear_cpu(smp_processor_id(), &flush_cpumask);
diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
index a2fea94f4c..9e2aefb00f 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -143,6 +143,7 @@ void __dummy__(void)
     OFFSET(CPUINFO_shadow_spec_ctrl, struct cpu_info, shadow_spec_ctrl);
     OFFSET(CPUINFO_use_shadow_spec_ctrl, struct cpu_info, use_shadow_spec_ctrl);
     OFFSET(CPUINFO_bti_ist_info, struct cpu_info, bti_ist_info);
+    OFFSET(CPUINFO_root_pgt_changed, struct cpu_info, root_pgt_changed);
     DEFINE(CPUINFO_sizeof, sizeof(struct cpu_info));
     BLANK();
 
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 45d9842d09..30c9da5446 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -128,10 +128,13 @@ restore_all_guest:
         /* Copy guest mappings and switch to per-CPU root page table. */
         mov   VCPU_cr3(%rbx), %r9
         GET_STACK_END(dx)
-        mov   STACK_CPUINFO_FIELD(pv_cr3)(%rdx), %rdi
+        mov   STACK_CPUINFO_FIELD(pv_cr3)(%rdx), %rax
+        cmpb  $0, STACK_CPUINFO_FIELD(root_pgt_changed)(%rdx)
+        je    .Lrag_copy_done
+        movb  $0, STACK_CPUINFO_FIELD(root_pgt_changed)(%rdx)
         movabs $PADDR_MASK & PAGE_MASK, %rsi
         movabs $DIRECTMAP_VIRT_START, %rcx
-        mov   %rdi, %rax
+        mov   %rax, %rdi
         and   %rsi, %rdi
         jz    .Lrag_keep_cr3
         and   %r9, %rsi
@@ -148,6 +151,7 @@ restore_all_guest:
         sub   $(ROOT_PAGETABLE_FIRST_XEN_SLOT - \
                 ROOT_PAGETABLE_LAST_XEN_SLOT - 1) * 8, %rdi
         rep movsq
+.Lrag_copy_done:
         mov   STACK_CPUINFO_FIELD(cr4)(%rdx), %rdi
         mov   %r9, STACK_CPUINFO_FIELD(xen_cr3)(%rdx)
         mov   %rdi, %rsi
diff --git a/xen/include/asm-x86/current.h b/xen/include/asm-x86/current.h
index 3a0e1eef36..f2491b4423 100644
--- a/xen/include/asm-x86/current.h
+++ b/xen/include/asm-x86/current.h
@@ -59,6 +59,14 @@ struct cpu_info {
     bool         use_shadow_spec_ctrl;
     uint8_t      bti_ist_info;
 
+    /*
+     * The following field controls copying of the L4 page table of 64-bit
+     * PV guests to the per-cpu root page table on entering the guest context.
+     * If set the L4 page table is being copied to the root page table and
+     * the field will be reset.
+     */
+    bool         root_pgt_changed;
+
     unsigned long __pad;
     /* get_stack_bottom() must be 16-byte aligned */
 };
diff --git a/xen/include/asm-x86/flushtlb.h b/xen/include/asm-x86/flushtlb.h
index 2cade9cbfb..052f0fa403 100644
--- a/xen/include/asm-x86/flushtlb.h
+++ b/xen/include/asm-x86/flushtlb.h
@@ -103,6 +103,8 @@ void write_cr3(unsigned long cr3);
 #define FLUSH_VA_VALID   0x800
  /* Flush CPU state */
 #define FLUSH_VCPU_STATE 0x1000
+ /* Flush the per-cpu root page table */
+#define FLUSH_ROOT_PGTBL 0x2000
 
 /* Flush local TLBs/caches. */
 unsigned int flush_area_local(const void *va, unsigned int flags);
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 2/7] x86/xpti: don't flush TLB twice when switching to 64-bit pv context
  2018-04-06  7:52 [PATCH v5 0/7] xen/x86: various XPTI speedups Juergen Gross
  2018-04-06  7:52 ` [PATCH v5 1/7] x86/xpti: avoid copying L4 page table contents when possible Juergen Gross
@ 2018-04-06  7:52 ` Juergen Gross
  2018-04-06 11:21   ` Andrew Cooper
  2018-04-06  7:52 ` [PATCH v5 3/7] xen/x86: support per-domain flag for xpti Juergen Gross
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Juergen Gross @ 2018-04-06  7:52 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, andrew.cooper3, jbeulich

When switching to a 64-bit pv context the TLB is flushed twice today:
the first time when switching to the new address space in
write_ptbase(), the second time when switching to guest mode in
restore_to_guest.

Limit the first flush to non-global entries in that case.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
V3:
- omit setting root_pgt_changed to false (Jan Beulich)
---
 xen/arch/x86/mm.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index fd89685486..cf2ccb07e6 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -124,6 +124,7 @@
 #include <asm/pci.h>
 #include <asm/guest.h>
 #include <asm/hvm/ioreq.h>
+#include <asm/processor.h>
 
 #include <asm/hvm/grant_table.h>
 #include <asm/pv/grant_table.h>
@@ -504,9 +505,13 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
 
 void write_ptbase(struct vcpu *v)
 {
-    if ( this_cpu(root_pgt) )
+    if ( this_cpu(root_pgt) && is_pv_vcpu(v) && !is_pv_32bit_vcpu(v) )
+    {
         get_cpu_info()->root_pgt_changed = true;
-    write_cr3(v->arch.cr3);
+        asm volatile ( "mov %0, %%cr3" : : "r" (v->arch.cr3) : "memory" );
+    }
+    else
+        write_cr3(v->arch.cr3);
 }
 
 /*
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 3/7] xen/x86: support per-domain flag for xpti
  2018-04-06  7:52 [PATCH v5 0/7] xen/x86: various XPTI speedups Juergen Gross
  2018-04-06  7:52 ` [PATCH v5 1/7] x86/xpti: avoid copying L4 page table contents when possible Juergen Gross
  2018-04-06  7:52 ` [PATCH v5 2/7] x86/xpti: don't flush TLB twice when switching to 64-bit pv context Juergen Gross
@ 2018-04-06  7:52 ` Juergen Gross
  2018-04-06 13:16   ` Andrew Cooper
  2018-04-06  7:52 ` [PATCH v5 4/7] xen/x86: use invpcid for flushing the TLB Juergen Gross
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Juergen Gross @ 2018-04-06  7:52 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, andrew.cooper3, jbeulich

Instead of switching XPTI globally on or off add a per-domain flag for
that purpose. This allows to modify the xpti boot parameter to support
running dom0 without Meltdown mitigations. Using "xpti=nodom0" as boot
parameter will achieve that.

Move the xpti boot parameter handling to xen/arch/x86/pv/domain.c as
it is pv-domain specific.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
V3:
- latch get_cpu_info() return value in variable (Jan Beulich)
- call always xpti_domain_init() for pv dom0 (Jan Beulich)
- add __init annotations (Jan Beulich)
- drop per domain XPTI message (Jan Beulich)
- document xpti=default support (Jan Beulich)
- move domain xpti flag into a padding hole (Jan Beulich)
---
 docs/misc/xen-command-line.markdown | 10 ++++-
 xen/arch/x86/domctl.c               |  4 ++
 xen/arch/x86/mm.c                   | 12 +++++-
 xen/arch/x86/pv/dom0_build.c        |  3 ++
 xen/arch/x86/pv/domain.c            | 77 +++++++++++++++++++++++++++++++++++++
 xen/arch/x86/setup.c                | 20 +---------
 xen/arch/x86/smpboot.c              |  4 +-
 xen/arch/x86/x86_64/entry.S         |  2 +
 xen/include/asm-x86/current.h       |  3 +-
 xen/include/asm-x86/domain.h        |  3 ++
 xen/include/asm-x86/pv/domain.h     |  4 ++
 11 files changed, 118 insertions(+), 24 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index b353352adf..79be9a6ba5 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -1955,7 +1955,7 @@ clustered mode.  The default, given no hint from the **FADT**, is cluster
 mode.
 
 ### xpti
-> `= <boolean>`
+> `= default | nodom0 | <boolean>`
 
 > Default: `false` on AMD hardware
 > Default: `true` everywhere else
@@ -1963,6 +1963,14 @@ mode.
 Override default selection of whether to isolate 64-bit PV guest page
 tables.
 
+`true` activates page table isolation even on AMD hardware.
+
+`false` deactivates page table isolation on all systems.
+
+`default` sets the default behaviour.
+
+`nodom0` deactivates page table isolation for dom0.
+
 ### xsave
 > `= <boolean>`
 
diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
index 8fbbf3aeb3..0704f398c7 100644
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -24,6 +24,7 @@
 #include <asm/hvm/hvm.h>
 #include <asm/hvm/support.h>
 #include <asm/processor.h>
+#include <asm/pv/domain.h>
 #include <asm/acpi.h> /* for hvm_acpi_power_button */
 #include <xen/hypercall.h> /* for arch_do_domctl */
 #include <xsm/xsm.h>
@@ -610,6 +611,9 @@ long arch_do_domctl(
             ret = switch_compat(d);
         else
             ret = -EINVAL;
+
+        if ( ret == 0 )
+            xpti_domain_init(d);
         break;
 
     case XEN_DOMCTL_get_address_size:
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index cf2ccb07e6..131433af9b 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -505,13 +505,21 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
 
 void write_ptbase(struct vcpu *v)
 {
-    if ( this_cpu(root_pgt) && is_pv_vcpu(v) && !is_pv_32bit_vcpu(v) )
+    struct cpu_info *cpu_info = get_cpu_info();
+
+    if ( is_pv_vcpu(v) && v->domain->arch.pv_domain.xpti )
     {
-        get_cpu_info()->root_pgt_changed = true;
+        cpu_info->root_pgt_changed = true;
+        cpu_info->pv_cr3 = __pa(this_cpu(root_pgt));
         asm volatile ( "mov %0, %%cr3" : : "r" (v->arch.cr3) : "memory" );
     }
     else
+    {
+        /* Make sure to clear xen_cr3 before pv_cr3; write_cr3() serializes. */
+        cpu_info->xen_cr3 = 0;
         write_cr3(v->arch.cr3);
+        cpu_info->pv_cr3 = 0;
+    }
 }
 
 /*
diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
index 0bd2f1bf90..77186c19bd 100644
--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -19,6 +19,7 @@
 #include <asm/dom0_build.h>
 #include <asm/guest.h>
 #include <asm/page.h>
+#include <asm/pv/domain.h>
 #include <asm/pv/mm.h>
 #include <asm/setup.h>
 
@@ -707,6 +708,8 @@ int __init dom0_construct_pv(struct domain *d,
             cpu = p->processor;
     }
 
+    xpti_domain_init(d);
+
     d->arch.paging.mode = 0;
 
     /* Set up CR3 value for write_ptbase */
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 42522a2db3..2bef9c48bc 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -9,6 +9,8 @@
 #include <xen/lib.h>
 #include <xen/sched.h>
 
+#include <asm/cpufeature.h>
+#include <asm/msr-index.h>
 #include <asm/pv/domain.h>
 
 /* Override macros from asm/page.h to make them work with mfn_t */
@@ -17,6 +19,81 @@
 #undef page_to_mfn
 #define page_to_mfn(pg) _mfn(__page_to_mfn(pg))
 
+static __read_mostly enum {
+    XPTI_DEFAULT,
+    XPTI_ON,
+    XPTI_OFF,
+    XPTI_NODOM0
+} opt_xpti = XPTI_DEFAULT;
+
+static __init int parse_xpti(const char *s)
+{
+    int rc = 0;
+
+    switch ( parse_bool(s, NULL) )
+    {
+    case 0:
+        opt_xpti = XPTI_OFF;
+        break;
+    case 1:
+        opt_xpti = XPTI_ON;
+        break;
+    default:
+        if ( !strcmp(s, "default") )
+            opt_xpti = XPTI_DEFAULT;
+        else if ( !strcmp(s, "nodom0") )
+            opt_xpti = XPTI_NODOM0;
+        else
+            rc = -EINVAL;
+        break;
+    }
+
+    return rc;
+}
+custom_param("xpti", parse_xpti);
+
+void __init xpti_init(void)
+{
+    uint64_t caps = 0;
+
+    if ( boot_cpu_data.x86_vendor == X86_VENDOR_AMD )
+        caps = ARCH_CAPABILITIES_RDCL_NO;
+    else if ( boot_cpu_has(X86_FEATURE_ARCH_CAPS) )
+        rdmsrl(MSR_ARCH_CAPABILITIES, caps);
+
+    if ( opt_xpti != XPTI_ON && (caps & ARCH_CAPABILITIES_RDCL_NO) )
+        opt_xpti = XPTI_OFF;
+    else if ( opt_xpti == XPTI_DEFAULT )
+        opt_xpti = XPTI_ON;
+
+    if ( opt_xpti == XPTI_OFF )
+        setup_force_cpu_cap(X86_FEATURE_NO_XPTI);
+    else
+        setup_clear_cpu_cap(X86_FEATURE_NO_XPTI);
+}
+
+void xpti_domain_init(struct domain *d)
+{
+    if ( !is_pv_domain(d) || is_pv_32bit_domain(d) )
+        return;
+
+    switch ( opt_xpti )
+    {
+    case XPTI_OFF:
+        break;
+    case XPTI_ON:
+        d->arch.pv_domain.xpti = true;
+        break;
+    case XPTI_NODOM0:
+        d->arch.pv_domain.xpti = d->domain_id != 0 &&
+                                 d->domain_id != hardware_domid;
+        break;
+    default:
+        ASSERT_UNREACHABLE();
+        break;
+    }
+}
+
 static void noreturn continue_nonidle_domain(struct vcpu *v)
 {
     check_wakeup_from_wait();
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 02673d9512..e83d3b4f07 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -53,6 +53,7 @@
 #include <asm/cpuid.h>
 #include <asm/spec_ctrl.h>
 #include <asm/guest.h>
+#include <asm/pv/domain.h>
 
 /* opt_nosmp: If true, secondary processors are ignored. */
 static bool __initdata opt_nosmp;
@@ -169,9 +170,6 @@ static int __init parse_smap_param(const char *s)
 }
 custom_param("smap", parse_smap_param);
 
-static int8_t __initdata opt_xpti = -1;
-boolean_param("xpti", opt_xpti);
-
 bool __read_mostly acpi_disabled;
 bool __initdata acpi_force;
 static char __initdata acpi_param[10] = "";
@@ -1546,21 +1544,7 @@ void __init noreturn __start_xen(unsigned long mbi_p)
 
     cr4_pv32_mask = mmu_cr4_features & XEN_CR4_PV32_BITS;
 
-    if ( opt_xpti < 0 )
-    {
-        uint64_t caps = 0;
-
-        if ( boot_cpu_data.x86_vendor == X86_VENDOR_AMD )
-            caps = ARCH_CAPABILITIES_RDCL_NO;
-        else if ( boot_cpu_has(X86_FEATURE_ARCH_CAPS) )
-            rdmsrl(MSR_ARCH_CAPABILITIES, caps);
-
-        opt_xpti = !(caps & ARCH_CAPABILITIES_RDCL_NO);
-    }
-    if ( opt_xpti )
-        setup_clear_cpu_cap(X86_FEATURE_NO_XPTI);
-    else
-        setup_force_cpu_cap(X86_FEATURE_NO_XPTI);
+    xpti_init();
 
     if ( cpu_has_fsgsbase )
         set_in_cr4(X86_CR4_FSGSBASE);
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index b0b72ca544..346a8e8a3f 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -331,7 +331,7 @@ void start_secondary(void *unused)
     spin_debug_disable();
 
     get_cpu_info()->xen_cr3 = 0;
-    get_cpu_info()->pv_cr3 = this_cpu(root_pgt) ? __pa(this_cpu(root_pgt)) : 0;
+    get_cpu_info()->pv_cr3 = 0;
 
     load_system_tables();
 
@@ -1050,7 +1050,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
         panic("Error %d setting up PV root page table\n", rc);
     if ( per_cpu(root_pgt, 0) )
     {
-        get_cpu_info()->pv_cr3 = __pa(per_cpu(root_pgt, 0));
+        get_cpu_info()->pv_cr3 = 0;
 
         /*
          * All entry points which may need to switch page tables have to start
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 30c9da5446..5026cfa490 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -129,6 +129,8 @@ restore_all_guest:
         mov   VCPU_cr3(%rbx), %r9
         GET_STACK_END(dx)
         mov   STACK_CPUINFO_FIELD(pv_cr3)(%rdx), %rax
+        test  %rax, %rax
+        jz    .Lrag_keep_cr3
         cmpb  $0, STACK_CPUINFO_FIELD(root_pgt_changed)(%rdx)
         je    .Lrag_copy_done
         movb  $0, STACK_CPUINFO_FIELD(root_pgt_changed)(%rdx)
diff --git a/xen/include/asm-x86/current.h b/xen/include/asm-x86/current.h
index f2491b4423..b2475783f8 100644
--- a/xen/include/asm-x86/current.h
+++ b/xen/include/asm-x86/current.h
@@ -44,7 +44,8 @@ struct cpu_info {
     /*
      * Of the two following fields the latter is being set to the CR3 value
      * to be used on the given pCPU for loading whenever 64-bit PV guest
-     * context is being entered. The value never changes once set.
+     * context is being entered. A value of zero indicates no setting of CR3
+     * is to be performed.
      * The former is the value to restore when re-entering Xen, if any. IOW
      * its value being zero means there's nothing to restore. However, its
      * value can also be negative, indicating to the exit-to-Xen code that
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index a12ae47f1b..ed4199931a 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -253,6 +253,9 @@ struct pv_domain
 
     atomic_t nr_l4_pages;
 
+    /* XPTI active? */
+    bool xpti;
+
     /* map_domain_page() mapping cache. */
     struct mapcache_domain mapcache;
 
diff --git a/xen/include/asm-x86/pv/domain.h b/xen/include/asm-x86/pv/domain.h
index 5e34176939..911e5dc07f 100644
--- a/xen/include/asm-x86/pv/domain.h
+++ b/xen/include/asm-x86/pv/domain.h
@@ -27,6 +27,8 @@ void pv_vcpu_destroy(struct vcpu *v);
 int pv_vcpu_initialise(struct vcpu *v);
 void pv_domain_destroy(struct domain *d);
 int pv_domain_initialise(struct domain *d);
+void xpti_init(void);
+void xpti_domain_init(struct domain *d);
 
 #else  /* !CONFIG_PV */
 
@@ -36,6 +38,8 @@ static inline void pv_vcpu_destroy(struct vcpu *v) {}
 static inline int pv_vcpu_initialise(struct vcpu *v) { return -EOPNOTSUPP; }
 static inline void pv_domain_destroy(struct domain *d) {}
 static inline int pv_domain_initialise(struct domain *d) { return -EOPNOTSUPP; }
+static inline void xpti_init(void) {}
+static inline void xpti_domain_init(struct domain *d) {}
 
 #endif	/* CONFIG_PV */
 
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 4/7] xen/x86: use invpcid for flushing the TLB
  2018-04-06  7:52 [PATCH v5 0/7] xen/x86: various XPTI speedups Juergen Gross
                   ` (2 preceding siblings ...)
  2018-04-06  7:52 ` [PATCH v5 3/7] xen/x86: support per-domain flag for xpti Juergen Gross
@ 2018-04-06  7:52 ` Juergen Gross
  2018-04-06 13:40   ` Andrew Cooper
  2018-04-06  7:52 ` [PATCH v5 5/7] xen/x86: disable global pages for domains with XPTI active Juergen Gross
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Juergen Gross @ 2018-04-06  7:52 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, andrew.cooper3, jbeulich

If possible use the INVPCID instruction for flushing the TLB instead of
toggling cr4.pge for that purpose.

While at it remove the dependency on cr4.pge being required for mtrr
loading, as this will be required later anyway.

Add a command line option "invpcid" for controlling the use of
INVPCID (default to true).

Signed-off-by: Juergen Gross <jgross@suse.com>
---
V5:
- use pre_flush() as an initializer in do_tlb_flush() (Jan Beulich)
- introduce boolean use_invpcid instead of clearing X86_FEATURE_INVPCID
  (Jan Beulich)

V4:
- option "invpcid" instead of "noinvpcid" (Jan Beulich)

V3:
- new patch
---
 docs/misc/xen-command-line.markdown | 10 ++++++++++
 xen/arch/x86/cpu/mtrr/generic.c     | 37 ++++++++++++++++++++++++++-----------
 xen/arch/x86/flushtlb.c             | 29 +++++++++++++++++++----------
 xen/arch/x86/setup.c                |  8 ++++++++
 xen/include/asm-x86/invpcid.h       |  2 ++
 5 files changed, 65 insertions(+), 21 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index 79be9a6ba5..5f6ae654ad 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -1380,6 +1380,16 @@ Because responsibility for APIC setup is shared between Xen and the
 domain 0 kernel this option is automatically propagated to the domain
 0 command line.
 
+### invpcid (x86)
+> `= <boolean>`
+
+> Default: `true`
+
+Control using the INVPCID instruction for flushing TLB entries.
+This should only be used in case of known issues on the current platform
+with that instruction. Disabling INVPCID will normally result in a slightly
+degraded performance.
+
 ### noirqbalance
 > `= <boolean>`
 
diff --git a/xen/arch/x86/cpu/mtrr/generic.c b/xen/arch/x86/cpu/mtrr/generic.c
index e9c0e5e059..705855e753 100644
--- a/xen/arch/x86/cpu/mtrr/generic.c
+++ b/xen/arch/x86/cpu/mtrr/generic.c
@@ -5,6 +5,7 @@
 #include <xen/mm.h>
 #include <xen/stdbool.h>
 #include <asm/flushtlb.h>
+#include <asm/invpcid.h>
 #include <asm/io.h>
 #include <asm/mtrr.h>
 #include <asm/msr.h>
@@ -400,8 +401,10 @@ static DEFINE_SPINLOCK(set_atomicity_lock);
  * has been called.
  */
 
-static void prepare_set(void)
+static bool prepare_set(void)
 {
+	unsigned long cr4;
+
 	/*  Note that this is not ideal, since the cache is only flushed/disabled
 	   for this CPU while the MTRRs are changed, but changing this requires
 	   more invasive changes to the way the kernel boots  */
@@ -412,18 +415,24 @@ static void prepare_set(void)
 	write_cr0(read_cr0() | X86_CR0_CD);
 	wbinvd();
 
-	/*  TLB flushing here relies on Xen always using CR4.PGE. */
-	BUILD_BUG_ON(!(XEN_MINIMAL_CR4 & X86_CR4_PGE));
-	write_cr4(read_cr4() & ~X86_CR4_PGE);
+	cr4 = read_cr4();
+	if (cr4 & X86_CR4_PGE)
+		write_cr4(cr4 & ~X86_CR4_PGE);
+	else if (use_invpcid)
+		invpcid_flush_all();
+	else
+		asm volatile( "mov %0, %%cr3" : : "r" (read_cr3()) : "memory" );
 
 	/*  Save MTRR state */
 	rdmsrl(MSR_MTRRdefType, deftype);
 
 	/*  Disable MTRRs, and set the default type to uncached  */
 	mtrr_wrmsr(MSR_MTRRdefType, deftype & ~0xcff);
+
+	return cr4 & X86_CR4_PGE;
 }
 
-static void post_set(void)
+static void post_set(bool pge)
 {
 	/* Intel (P6) standard MTRRs */
 	mtrr_wrmsr(MSR_MTRRdefType, deftype);
@@ -432,7 +441,12 @@ static void post_set(void)
 	write_cr0(read_cr0() & ~X86_CR0_CD);
 
 	/*  Reenable CR4.PGE (also flushes the TLB) */
-	write_cr4(read_cr4() | X86_CR4_PGE);
+	if (pge)
+		write_cr4(read_cr4() | X86_CR4_PGE);
+	else if (use_invpcid)
+		invpcid_flush_all();
+	else
+		asm volatile( "mov %0, %%cr3" : : "r" (read_cr3()) : "memory" );
 
 	spin_unlock(&set_atomicity_lock);
 }
@@ -441,14 +455,15 @@ static void generic_set_all(void)
 {
 	unsigned long mask, count;
 	unsigned long flags;
+	bool pge;
 
 	local_irq_save(flags);
-	prepare_set();
+	pge = prepare_set();
 
 	/* Actually set the state */
 	mask = set_mtrr_state();
 
-	post_set();
+	post_set(pge);
 	local_irq_restore(flags);
 
 	/*  Use the atomic bitops to update the global mask  */
@@ -457,7 +472,6 @@ static void generic_set_all(void)
 			set_bit(count, &smp_changes_mask);
 		mask >>= 1;
 	}
-	
 }
 
 static void generic_set_mtrr(unsigned int reg, unsigned long base,
@@ -474,11 +488,12 @@ static void generic_set_mtrr(unsigned int reg, unsigned long base,
 {
 	unsigned long flags;
 	struct mtrr_var_range *vr;
+	bool pge;
 
 	vr = &mtrr_state.var_ranges[reg];
 
 	local_irq_save(flags);
-	prepare_set();
+	pge = prepare_set();
 
 	if (size == 0) {
 		/* The invalid bit is kept in the mask, so we simply clear the
@@ -499,7 +514,7 @@ static void generic_set_mtrr(unsigned int reg, unsigned long base,
 		mtrr_wrmsr(MSR_IA32_MTRR_PHYSMASK(reg), vr->mask);
 	}
 
-	post_set();
+	post_set(pge);
 	local_irq_restore(flags);
 }
 
diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
index 38cedf3b22..f793b70696 100644
--- a/xen/arch/x86/flushtlb.c
+++ b/xen/arch/x86/flushtlb.c
@@ -10,6 +10,7 @@
 #include <xen/sched.h>
 #include <xen/softirq.h>
 #include <asm/flushtlb.h>
+#include <asm/invpcid.h>
 #include <asm/page.h>
 
 /* Debug builds: Wrap frequently to stress-test the wrap logic. */
@@ -71,6 +72,23 @@ static void post_flush(u32 t)
     this_cpu(tlbflush_time) = t;
 }
 
+static void do_tlb_flush(void)
+{
+    u32 t = pre_flush();
+
+    if ( use_invpcid )
+        invpcid_flush_all();
+    else
+    {
+        unsigned long cr4 = read_cr4();
+
+        write_cr4(cr4 ^ X86_CR4_PGE);
+        write_cr4(cr4);
+    }
+
+    post_flush(t);
+}
+
 void write_cr3(unsigned long cr3)
 {
     unsigned long flags, cr4;
@@ -118,16 +136,7 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
                            : : "m" (*(const char *)(va)) : "memory" );
         }
         else
-        {
-            u32 t = pre_flush();
-            unsigned long cr4 = read_cr4();
-
-            write_cr4(cr4 & ~X86_CR4_PGE);
-            barrier();
-            write_cr4(cr4);
-
-            post_flush(t);
-        }
+            do_tlb_flush();
     }
 
     if ( flags & FLUSH_CACHE )
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index e83d3b4f07..0a27402e4d 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -63,6 +63,11 @@ boolean_param("nosmp", opt_nosmp);
 static unsigned int __initdata max_cpus;
 integer_param("maxcpus", max_cpus);
 
+/* opt_invpcid: If false, don't use INVPCID instruction even if available. */
+static bool __initdata opt_invpcid = true;
+boolean_param("invpcid", opt_invpcid);
+bool use_invpcid;
+
 unsigned long __read_mostly cr4_pv32_mask;
 
 /* **** Linux config option: propagated to domain0. */
@@ -1549,6 +1554,9 @@ void __init noreturn __start_xen(unsigned long mbi_p)
     if ( cpu_has_fsgsbase )
         set_in_cr4(X86_CR4_FSGSBASE);
 
+    if ( opt_invpcid && cpu_has_invpcid )
+        use_invpcid = true;
+
     init_speculation_mitigations();
 
     init_idle_domain();
diff --git a/xen/include/asm-x86/invpcid.h b/xen/include/asm-x86/invpcid.h
index b46624a865..edd8b68706 100644
--- a/xen/include/asm-x86/invpcid.h
+++ b/xen/include/asm-x86/invpcid.h
@@ -3,6 +3,8 @@
 
 #include <xen/types.h>
 
+extern bool use_invpcid;
+
 #define INVPCID_TYPE_INDIV_ADDR      0
 #define INVPCID_TYPE_SINGLE_CTXT     1
 #define INVPCID_TYPE_ALL_INCL_GLOBAL 2
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 5/7] xen/x86: disable global pages for domains with XPTI active
  2018-04-06  7:52 [PATCH v5 0/7] xen/x86: various XPTI speedups Juergen Gross
                   ` (3 preceding siblings ...)
  2018-04-06  7:52 ` [PATCH v5 4/7] xen/x86: use invpcid for flushing the TLB Juergen Gross
@ 2018-04-06  7:52 ` Juergen Gross
  2018-04-06 15:17   ` Andrew Cooper
  2018-04-06  7:52 ` [PATCH v5 6/7] xen/x86: use flag byte for decision whether xen_cr3 is valid Juergen Gross
  2018-04-06  7:52 ` [PATCH v5 7/7] xen/x86: use PCID feature Juergen Gross
  6 siblings, 1 reply; 22+ messages in thread
From: Juergen Gross @ 2018-04-06  7:52 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, andrew.cooper3, jbeulich

Instead of flushing the TLB from global pages when switching address
spaces with XPTI being active just disable global pages via %cr4
completely when a domain subject to XPTI is active. This avoids the
need for extra TLB flushes as loading %cr3 will remove all TLB
entries.

In order to avoid states with cr3/cr4 having inconsistent values
(e.g. global pages being activated while cr3 already specifies a XPTI
address space) move loading of the new cr4 value to write_ptbase()
(actually to write_cr3_cr4() called by write_ptbase()).

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
V4:
- don't use mmu_cr4_features for setting new cr4 value (Jan Beulich)
- use simpler scheme for setting X86_CR4_PGE in
  pv_guest_cr4_to_real_cr4() (Jan Beulich)

V3:
- move cr4 loading for all domains from *_ctxt_switch_to() to
  write_cr3_cr4() called by write_ptbase() (Jan Beulich)
- rebase
---
 xen/arch/x86/domain.c          |  5 -----
 xen/arch/x86/flushtlb.c        | 13 ++++++++-----
 xen/arch/x86/mm.c              | 14 +++++++++++---
 xen/arch/x86/x86_64/entry.S    | 10 ----------
 xen/common/efi/runtime.c       |  4 ++--
 xen/include/asm-x86/domain.h   |  3 ++-
 xen/include/asm-x86/flushtlb.h |  2 +-
 7 files changed, 24 insertions(+), 27 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 9c229594f4..c2bb70c483 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1522,17 +1522,12 @@ void paravirt_ctxt_switch_from(struct vcpu *v)
 void paravirt_ctxt_switch_to(struct vcpu *v)
 {
     root_pgentry_t *root_pgt = this_cpu(root_pgt);
-    unsigned long cr4;
 
     if ( root_pgt )
         root_pgt[root_table_offset(PERDOMAIN_VIRT_START)] =
             l4e_from_page(v->domain->arch.perdomain_l3_pg,
                           __PAGE_HYPERVISOR_RW);
 
-    cr4 = pv_guest_cr4_to_real_cr4(v);
-    if ( unlikely(cr4 != read_cr4()) )
-        write_cr4(cr4);
-
     if ( unlikely(v->arch.debugreg[7] & DR7_ACTIVE_MASK) )
         activate_debugregs(v);
 
diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
index f793b70696..5dcd9a2bf6 100644
--- a/xen/arch/x86/flushtlb.c
+++ b/xen/arch/x86/flushtlb.c
@@ -89,20 +89,23 @@ static void do_tlb_flush(void)
     post_flush(t);
 }
 
-void write_cr3(unsigned long cr3)
+void write_cr3_cr4(unsigned long cr3, unsigned long cr4)
 {
-    unsigned long flags, cr4;
+    unsigned long flags;
     u32 t;
 
     /* This non-reentrant function is sometimes called in interrupt context. */
     local_irq_save(flags);
 
     t = pre_flush();
-    cr4 = read_cr4();
 
-    write_cr4(cr4 & ~X86_CR4_PGE);
+    if ( read_cr4() & X86_CR4_PGE )
+        write_cr4(cr4 & ~X86_CR4_PGE);
+
     asm volatile ( "mov %0, %%cr3" : : "r" (cr3) : "memory" );
-    write_cr4(cr4);
+
+    if ( read_cr4() != cr4 )
+        write_cr4(cr4);
 
     post_flush(t);
 
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 131433af9b..55c437751f 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -506,20 +506,28 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
 void write_ptbase(struct vcpu *v)
 {
     struct cpu_info *cpu_info = get_cpu_info();
+    unsigned long new_cr4;
+
+    new_cr4 = (is_pv_vcpu(v) && !is_idle_vcpu(v))
+              ? pv_guest_cr4_to_real_cr4(v)
+              : ((read_cr4() & ~X86_CR4_TSD) | X86_CR4_PGE);
 
     if ( is_pv_vcpu(v) && v->domain->arch.pv_domain.xpti )
     {
         cpu_info->root_pgt_changed = true;
         cpu_info->pv_cr3 = __pa(this_cpu(root_pgt));
-        asm volatile ( "mov %0, %%cr3" : : "r" (v->arch.cr3) : "memory" );
+        write_cr3_cr4(v->arch.cr3, new_cr4);
     }
     else
     {
-        /* Make sure to clear xen_cr3 before pv_cr3; write_cr3() serializes. */
+        /* Make sure to clear xen_cr3 before pv_cr3. */
         cpu_info->xen_cr3 = 0;
-        write_cr3(v->arch.cr3);
+        /* write_cr3_cr4() serializes. */
+        write_cr3_cr4(v->arch.cr3, new_cr4);
         cpu_info->pv_cr3 = 0;
     }
+
+    ASSERT(is_pv_vcpu(v) || read_cr4() == mmu_cr4_features);
 }
 
 /*
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 5026cfa490..6298ed65cc 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -154,13 +154,8 @@ restore_all_guest:
                 ROOT_PAGETABLE_LAST_XEN_SLOT - 1) * 8, %rdi
         rep movsq
 .Lrag_copy_done:
-        mov   STACK_CPUINFO_FIELD(cr4)(%rdx), %rdi
         mov   %r9, STACK_CPUINFO_FIELD(xen_cr3)(%rdx)
-        mov   %rdi, %rsi
-        and   $~X86_CR4_PGE, %rdi
-        mov   %rdi, %cr4
         mov   %rax, %cr3
-        mov   %rsi, %cr4
 .Lrag_keep_cr3:
 
         /* Restore stashed SPEC_CTRL value. */
@@ -216,12 +211,7 @@ restore_all_xen:
          * so "g" will have to do.
          */
 UNLIKELY_START(g, exit_cr3)
-        mov   %cr4, %rdi
-        mov   %rdi, %rsi
-        and   $~X86_CR4_PGE, %rdi
-        mov   %rdi, %cr4
         mov   %rax, %cr3
-        mov   %rsi, %cr4
 UNLIKELY_END(exit_cr3)
 
         /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */
diff --git a/xen/common/efi/runtime.c b/xen/common/efi/runtime.c
index 3dbc2e8ee5..fad8ca9e95 100644
--- a/xen/common/efi/runtime.c
+++ b/xen/common/efi/runtime.c
@@ -111,7 +111,7 @@ struct efi_rs_state efi_rs_enter(void)
         lgdt(&gdt_desc);
     }
 
-    write_cr3(virt_to_maddr(efi_l4_pgtable));
+    write_cr3_cr4(virt_to_maddr(efi_l4_pgtable), read_cr4());
 
     return state;
 }
@@ -120,7 +120,7 @@ void efi_rs_leave(struct efi_rs_state *state)
 {
     if ( !state->cr3 )
         return;
-    write_cr3(state->cr3);
+    write_cr3_cr4(state->cr3, read_cr4());
     if ( is_pv_vcpu(current) && !is_idle_vcpu(current) )
     {
         struct desc_ptr gdt_desc = {
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index ed4199931a..b7894dc8c8 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -618,9 +618,10 @@ unsigned long pv_guest_cr4_fixup(const struct vcpu *, unsigned long guest_cr4);
 #define pv_guest_cr4_to_real_cr4(v)                         \
     (((v)->arch.pv_vcpu.ctrlreg[4]                          \
       | (mmu_cr4_features                                   \
-         & (X86_CR4_PGE | X86_CR4_PSE | X86_CR4_SMEP |      \
+         & (X86_CR4_PSE | X86_CR4_SMEP |                    \
             X86_CR4_SMAP | X86_CR4_OSXSAVE |                \
             X86_CR4_FSGSBASE))                              \
+      | ((v)->domain->arch.pv_domain.xpti ? 0 : X86_CR4_PGE) \
       | ((v)->domain->arch.vtsc ? X86_CR4_TSD : 0))         \
      & ~X86_CR4_DE)
 #define real_cr4_to_pv_guest_cr4(c)                         \
diff --git a/xen/include/asm-x86/flushtlb.h b/xen/include/asm-x86/flushtlb.h
index 052f0fa403..1eb9682de4 100644
--- a/xen/include/asm-x86/flushtlb.h
+++ b/xen/include/asm-x86/flushtlb.h
@@ -84,7 +84,7 @@ static inline unsigned long read_cr3(void)
 }
 
 /* Write pagetable base and implicitly tick the tlbflush clock. */
-void write_cr3(unsigned long cr3);
+void write_cr3_cr4(unsigned long cr3, unsigned long cr4);
 
 /* flush_* flag fields: */
  /*
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 6/7] xen/x86: use flag byte for decision whether xen_cr3 is valid
  2018-04-06  7:52 [PATCH v5 0/7] xen/x86: various XPTI speedups Juergen Gross
                   ` (4 preceding siblings ...)
  2018-04-06  7:52 ` [PATCH v5 5/7] xen/x86: disable global pages for domains with XPTI active Juergen Gross
@ 2018-04-06  7:52 ` Juergen Gross
  2018-04-06  7:52 ` [PATCH v5 7/7] xen/x86: use PCID feature Juergen Gross
  6 siblings, 0 replies; 22+ messages in thread
From: Juergen Gross @ 2018-04-06  7:52 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, andrew.cooper3, jbeulich

Today cpu_info->xen_cr3 is either 0 to indicate %cr3 doesn't need to
be switched on entry to Xen, or negative for keeping the value while
indicating not to restore %cr3, or positive in case %cr3 is to be
restored.

Switch to use a flag byte instead of a negative xen_cr3 value in order
to allow %cr3 values with the high bit set in case we want to keep TLB
entries when using the PCID feature.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
V3:
- renamed use_xen_cr3 to better fitting use_pv_cr3
- corrected comment regarding semantics of use_pv_cr3 (Jan Beulich)
- prefer 32-bit operations over 8- or 16-bit ones (Jan Beulich)
---
 xen/arch/x86/domain.c              |  1 +
 xen/arch/x86/mm.c                  |  3 +-
 xen/arch/x86/smpboot.c             |  2 ++
 xen/arch/x86/x86_64/asm-offsets.c  |  1 +
 xen/arch/x86/x86_64/compat/entry.S |  5 ++--
 xen/arch/x86/x86_64/entry.S        | 59 ++++++++++++++++----------------------
 xen/include/asm-x86/current.h      | 12 +++++---
 7 files changed, 41 insertions(+), 42 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index c2bb70c483..60af9a5906 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1695,6 +1695,7 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
 
     ASSERT(local_irq_is_enabled());
 
+    get_cpu_info()->use_pv_cr3 = false;
     get_cpu_info()->xen_cr3 = 0;
 
     if ( unlikely(dirty_cpu != cpu) && dirty_cpu != VCPU_CPU_CLEAN )
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 55c437751f..03aa44be76 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -520,7 +520,8 @@ void write_ptbase(struct vcpu *v)
     }
     else
     {
-        /* Make sure to clear xen_cr3 before pv_cr3. */
+        /* Make sure to clear use_pv_cr3 and xen_cr3 before pv_cr3. */
+        cpu_info->use_pv_cr3 = false;
         cpu_info->xen_cr3 = 0;
         /* write_cr3_cr4() serializes. */
         write_cr3_cr4(v->arch.cr3, new_cr4);
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 346a8e8a3f..05109a98fa 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -330,6 +330,7 @@ void start_secondary(void *unused)
      */
     spin_debug_disable();
 
+    get_cpu_info()->use_pv_cr3 = false;
     get_cpu_info()->xen_cr3 = 0;
     get_cpu_info()->pv_cr3 = 0;
 
@@ -1129,6 +1130,7 @@ void __init smp_prepare_boot_cpu(void)
     per_cpu(scratch_cpumask, cpu) = &scratch_cpu0mask;
 #endif
 
+    get_cpu_info()->use_pv_cr3 = false;
     get_cpu_info()->xen_cr3 = 0;
     get_cpu_info()->pv_cr3 = 0;
 }
diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
index 9e2aefb00f..7ad024cf37 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -144,6 +144,7 @@ void __dummy__(void)
     OFFSET(CPUINFO_use_shadow_spec_ctrl, struct cpu_info, use_shadow_spec_ctrl);
     OFFSET(CPUINFO_bti_ist_info, struct cpu_info, bti_ist_info);
     OFFSET(CPUINFO_root_pgt_changed, struct cpu_info, root_pgt_changed);
+    OFFSET(CPUINFO_use_pv_cr3, struct cpu_info, use_pv_cr3);
     DEFINE(CPUINFO_sizeof, sizeof(struct cpu_info));
     BLANK();
 
diff --git a/xen/arch/x86/x86_64/compat/entry.S b/xen/arch/x86/x86_64/compat/entry.S
index 6c7fcf95b3..c9ddf35ee5 100644
--- a/xen/arch/x86/x86_64/compat/entry.S
+++ b/xen/arch/x86/x86_64/compat/entry.S
@@ -210,10 +210,9 @@ ENTRY(cstar_enter)
 
         GET_STACK_END(bx)
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
-        neg   %rcx
+        test  %rcx, %rcx
         jz    .Lcstar_cr3_okay
-        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
-        neg   %rcx
+        movb  $0, STACK_CPUINFO_FIELD(use_pv_cr3)(%rbx)
         mov   %rcx, %cr3
         movq  $0, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
 .Lcstar_cr3_okay:
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 6298ed65cc..38b1bfc7b5 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -155,6 +155,7 @@ restore_all_guest:
         rep movsq
 .Lrag_copy_done:
         mov   %r9, STACK_CPUINFO_FIELD(xen_cr3)(%rdx)
+        movb  $1, STACK_CPUINFO_FIELD(use_pv_cr3)(%rdx)
         mov   %rax, %cr3
 .Lrag_keep_cr3:
 
@@ -203,14 +204,9 @@ restore_all_xen:
          * case we return to late PV exit code (from an NMI or #MC).
          */
         GET_STACK_END(bx)
-        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rdx
+        cmpb  $0, STACK_CPUINFO_FIELD(use_pv_cr3)(%rbx)
+UNLIKELY_START(ne, exit_cr3)
         mov   STACK_CPUINFO_FIELD(pv_cr3)(%rbx), %rax
-        test  %rdx, %rdx
-        /*
-         * Ideally the condition would be "nsz", but such doesn't exist,
-         * so "g" will have to do.
-         */
-UNLIKELY_START(g, exit_cr3)
         mov   %rax, %cr3
 UNLIKELY_END(exit_cr3)
 
@@ -252,10 +248,9 @@ ENTRY(lstar_enter)
 
         GET_STACK_END(bx)
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
-        neg   %rcx
+        test  %rcx, %rcx
         jz    .Llstar_cr3_okay
-        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
-        neg   %rcx
+        movb  $0, STACK_CPUINFO_FIELD(use_pv_cr3)(%rbx)
         mov   %rcx, %cr3
         movq  $0, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
 .Llstar_cr3_okay:
@@ -289,10 +284,9 @@ GLOBAL(sysenter_eflags_saved)
         /* PUSHF above has saved EFLAGS.IF clear (the caller had it set). */
         orl   $X86_EFLAGS_IF, UREGS_eflags(%rsp)
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
-        neg   %rcx
+        test  %rcx, %rcx
         jz    .Lsyse_cr3_okay
-        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
-        neg   %rcx
+        movb  $0, STACK_CPUINFO_FIELD(use_pv_cr3)(%rbx)
         mov   %rcx, %cr3
         movq  $0, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
 .Lsyse_cr3_okay:
@@ -339,10 +333,9 @@ ENTRY(int80_direct_trap)
 
         GET_STACK_END(bx)
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
-        neg   %rcx
+        test  %rcx, %rcx
         jz    .Lint80_cr3_okay
-        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
-        neg   %rcx
+        movb  $0, STACK_CPUINFO_FIELD(use_pv_cr3)(%rbx)
         mov   %rcx, %cr3
         movq  $0, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
 .Lint80_cr3_okay:
@@ -547,24 +540,24 @@ ENTRY(common_interrupt)
         /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
 
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rcx
+        mov   STACK_CPUINFO_FIELD(use_pv_cr3)(%r14), %bl
         mov   %rcx, %r15
-        neg   %rcx
+        test  %rcx, %rcx
         jz    .Lintr_cr3_okay
-        jns   .Lintr_cr3_load
-        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
-        neg   %rcx
-.Lintr_cr3_load:
+        movb  $0, STACK_CPUINFO_FIELD(use_pv_cr3)(%r14)
         mov   %rcx, %cr3
         xor   %ecx, %ecx
         mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
         testb $3, UREGS_cs(%rsp)
         cmovnz %rcx, %r15
+        cmovnz %rcx, %rbx
 .Lintr_cr3_okay:
 
         CR4_PV32_RESTORE
         movq %rsp,%rdi
         callq do_IRQ
         mov   %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
+        mov   %bl, STACK_CPUINFO_FIELD(use_pv_cr3)(%r14)
         jmp ret_from_intr
 
 ENTRY(page_fault)
@@ -579,18 +572,17 @@ GLOBAL(handle_exception)
         /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
 
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rcx
+        mov   STACK_CPUINFO_FIELD(use_pv_cr3)(%r14), %r13b
         mov   %rcx, %r15
-        neg   %rcx
+        test  %rcx, %rcx
         jz    .Lxcpt_cr3_okay
-        jns   .Lxcpt_cr3_load
-        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
-        neg   %rcx
-.Lxcpt_cr3_load:
+        movb  $0, STACK_CPUINFO_FIELD(use_pv_cr3)(%r14)
         mov   %rcx, %cr3
         xor   %ecx, %ecx
         mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
         testb $3, UREGS_cs(%rsp)
         cmovnz %rcx, %r15
+        cmovnz %rcx, %r13
 .Lxcpt_cr3_okay:
 
 handle_exception_saved:
@@ -645,6 +637,7 @@ handle_exception_saved:
         mov   (%rdx, %rax, 8), %rdx
         INDIRECT_CALL %rdx
         mov   %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
+        mov   %r13b, STACK_CPUINFO_FIELD(use_pv_cr3)(%r14)
         testb $3,UREGS_cs(%rsp)
         jz    restore_all_xen
         leaq  VCPU_trap_bounce(%rbx),%rdx
@@ -678,6 +671,7 @@ exception_with_ints_disabled:
 1:      movq  UREGS_error_code(%rsp),%rax # ec/ev
         movq  %rax,UREGS_kernel_sizeof(%rsp)
         mov   %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
+        mov   %r13b, STACK_CPUINFO_FIELD(use_pv_cr3)(%r14)
         jmp   restore_all_xen           # return to fixup code
 
 /* No special register assumptions. */
@@ -765,9 +759,6 @@ ENTRY(double_fault)
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rbx
         test  %rbx, %rbx
         jz    .Ldblf_cr3_okay
-        jns   .Ldblf_cr3_load
-        neg   %rbx
-.Ldblf_cr3_load:
         mov   %rbx, %cr3
 .Ldblf_cr3_okay:
 
@@ -796,13 +787,11 @@ handle_ist_exception:
         /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
 
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rcx
+        mov   STACK_CPUINFO_FIELD(use_pv_cr3)(%r14), %bl
         mov   %rcx, %r15
-        neg   %rcx
+        test  %rcx, %rcx
         jz    .List_cr3_okay
-        jns   .List_cr3_load
-        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
-        neg   %rcx
-.List_cr3_load:
+        movb  $0, STACK_CPUINFO_FIELD(use_pv_cr3)(%r14)
         mov   %rcx, %cr3
         movq  $0, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
 .List_cr3_okay:
@@ -815,6 +804,7 @@ handle_ist_exception:
          * and copy the context to stack bottom.
          */
         xor   %r15, %r15
+        xor   %ebx, %ebx
         GET_CPUINFO_FIELD(guest_cpu_user_regs,di)
         movq  %rsp,%rsi
         movl  $UREGS_kernel_sizeof/8,%ecx
@@ -826,6 +816,7 @@ handle_ist_exception:
         mov   (%rdx, %rax, 8), %rdx
         INDIRECT_CALL %rdx
         mov   %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
+        mov   %bl, STACK_CPUINFO_FIELD(use_pv_cr3)(%r14)
         cmpb  $TRAP_nmi,UREGS_entry_vector(%rsp)
         jne   ret_from_intr
 
diff --git a/xen/include/asm-x86/current.h b/xen/include/asm-x86/current.h
index b2475783f8..43bdec1f49 100644
--- a/xen/include/asm-x86/current.h
+++ b/xen/include/asm-x86/current.h
@@ -47,10 +47,7 @@ struct cpu_info {
      * context is being entered. A value of zero indicates no setting of CR3
      * is to be performed.
      * The former is the value to restore when re-entering Xen, if any. IOW
-     * its value being zero means there's nothing to restore. However, its
-     * value can also be negative, indicating to the exit-to-Xen code that
-     * restoring is not necessary, but allowing any nested entry code paths
-     * to still know the value to put back into CR3.
+     * its value being zero means there's nothing to restore.
      */
     unsigned long xen_cr3;
     unsigned long pv_cr3;
@@ -68,6 +65,13 @@ struct cpu_info {
      */
     bool         root_pgt_changed;
 
+    /*
+     * use_pv_cr3 is set in case the value of pv_cr3 is to be written into
+     * CR3 when returning from an interrupt. The main use is when returning
+     * from a NMI or MCE to hypervisor code where pv_cr3 was active.
+     */
+    bool         use_pv_cr3;
+
     unsigned long __pad;
     /* get_stack_bottom() must be 16-byte aligned */
 };
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 7/7] xen/x86: use PCID feature
  2018-04-06  7:52 [PATCH v5 0/7] xen/x86: various XPTI speedups Juergen Gross
                   ` (5 preceding siblings ...)
  2018-04-06  7:52 ` [PATCH v5 6/7] xen/x86: use flag byte for decision whether xen_cr3 is valid Juergen Gross
@ 2018-04-06  7:52 ` Juergen Gross
  2018-04-06 17:30   ` Andrew Cooper
  2018-04-09 14:15   ` Jan Beulich
  6 siblings, 2 replies; 22+ messages in thread
From: Juergen Gross @ 2018-04-06  7:52 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, andrew.cooper3, jbeulich

Avoid flushing the complete TLB when switching %cr3 for mitigation of
Meltdown by using the PCID feature if available.

We are using 4 PCID values for a 64 bit pv domain subject to XPTI and
2 values for the non-XPTI case:

- guest active and in kernel mode
- guest active and in user mode
- hypervisor active and guest in user mode (XPTI only)
- hypervisor active and guest in kernel mode (XPTI only)

We use PCID only if PCID _and_ INVPCID are supported. With PCID in use
we disable global pages in cr4. A command line parameter controls in
which cases PCID is being used.

As the non-XPTI case has shown not to perform better with PCID at least
on some machines the default is to use PCID only for domains subject to
XPTI.

With PCID enabled we always disable global pages. This avoids having to
either flush the complete TLB or do a cycle through all PCID values
when invalidating a single global page.

pv_guest_cr4_to_real_cr4() is switched from a macro to a real function
now as it has become more complex.

Performance for the XPTI case improves a lot: parallel make of the Xen
hypervisor on my machine reduced elapsed time from 107 to 93 seconds,
system time was reduced from 140 to 103 seconds.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
V5:
- use X86_CR3_ADDR_MASK instead of ~X86_CR3_PCID_MASK (Jan Beulich)
- add some const qualifiers (Jan Beulich)
- mask X86_CR3_ADDR_MASK with PADDR_MASK (Jan Beulich)
- add flushing the TLB from old PCID related entries in write_cr3_cr4()
  (Jan Beulich)

V4:
- add cr3 mask for page table address and use that in dbg_pv_va2mfn()
  (Jan Beulich)
- use invpcid_flush_all_nonglobals() instead of invpcid_flush_all()
  (Jan Beulich)
- use PCIDs 0/1 when running in Xen or without XPTI, 2/3 with XPTI in
  guest (Jan Beulich)
- ASSERT cr4.pge and cr4.pcide are never active at the same time
  (Jan Beulich)
- make pv_guest_cr4_to_real_cr4() a real function

V3:
- support PCID for non-XPTI case, too
- add command line parameter for controlling usage of PCID
- check PCID active by using cr4.pcide (Jan Beulich)
---
 docs/misc/xen-command-line.markdown | 12 ++++++
 xen/arch/x86/debug.c                |  2 +-
 xen/arch/x86/domain_page.c          |  2 +-
 xen/arch/x86/domctl.c               |  4 ++
 xen/arch/x86/flushtlb.c             | 54 ++++++++++++++++++++++++--
 xen/arch/x86/mm.c                   | 24 +++++++++++-
 xen/arch/x86/pv/dom0_build.c        |  1 +
 xen/arch/x86/pv/domain.c            | 77 ++++++++++++++++++++++++++++++++++++-
 xen/include/asm-x86/domain.h        | 15 +++-----
 xen/include/asm-x86/processor.h     |  3 ++
 xen/include/asm-x86/pv/domain.h     | 20 ++++++++++
 xen/include/asm-x86/x86-defns.h     |  4 +-
 12 files changed, 199 insertions(+), 19 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index 5f6ae654ad..db87fd326d 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -1452,6 +1452,18 @@ All numbers specified must be hexadecimal ones.
 
 This option can be specified more than once (up to 8 times at present).
 
+### pcid (x86)
+> `= <boolean> | xpti | noxpti`
+
+> Default: `xpti`
+
+> Can be modified at runtime
+
+If available, control usage of the PCID feature of the processor for
+64-bit pv-domains. PCID can be used either for no domain at all (`false`),
+for all of them (`true`), only for those subject to XPTI (`xpti`) or for
+those not subject to XPTI (`noxpti`).
+
 ### ple\_gap
 > `= <integer>`
 
diff --git a/xen/arch/x86/debug.c b/xen/arch/x86/debug.c
index 9159f32db4..0d46f2f45a 100644
--- a/xen/arch/x86/debug.c
+++ b/xen/arch/x86/debug.c
@@ -97,7 +97,7 @@ dbg_pv_va2mfn(dbgva_t vaddr, struct domain *dp, uint64_t pgd3val)
     l3_pgentry_t l3e, *l3t;
     l2_pgentry_t l2e, *l2t;
     l1_pgentry_t l1e, *l1t;
-    unsigned long cr3 = (pgd3val ? pgd3val : dp->vcpu[0]->arch.cr3);
+    unsigned long cr3 = pgd3val ?: (dp->vcpu[0]->arch.cr3 & X86_CR3_ADDR_MASK);
     mfn_t mfn = maddr_to_mfn(cr3);
 
     DBGP2("vaddr:%lx domid:%d cr3:%lx pgd3:%lx\n", vaddr, dp->domain_id, 
diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c
index b5780f201f..b5af0e639a 100644
--- a/xen/arch/x86/domain_page.c
+++ b/xen/arch/x86/domain_page.c
@@ -51,7 +51,7 @@ static inline struct vcpu *mapcache_current_vcpu(void)
         if ( (v = idle_vcpu[smp_processor_id()]) == current )
             sync_local_execstate();
         /* We must now be running on the idle page table. */
-        ASSERT(read_cr3() == __pa(idle_pg_table));
+        ASSERT((read_cr3() & X86_CR3_ADDR_MASK) == __pa(idle_pg_table));
     }
 
     return v;
diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
index 0704f398c7..a7c8772fa6 100644
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -613,7 +613,11 @@ long arch_do_domctl(
             ret = -EINVAL;
 
         if ( ret == 0 )
+        {
             xpti_domain_init(d);
+            pcid_domain_init(d);
+        }
+
         break;
 
     case XEN_DOMCTL_get_address_size:
diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
index 5dcd9a2bf6..6c7d57b7aa 100644
--- a/xen/arch/x86/flushtlb.c
+++ b/xen/arch/x86/flushtlb.c
@@ -12,6 +12,7 @@
 #include <asm/flushtlb.h>
 #include <asm/invpcid.h>
 #include <asm/page.h>
+#include <asm/pv/domain.h>
 
 /* Debug builds: Wrap frequently to stress-test the wrap logic. */
 #ifdef NDEBUG
@@ -93,6 +94,7 @@ void write_cr3_cr4(unsigned long cr3, unsigned long cr4)
 {
     unsigned long flags;
     u32 t;
+    unsigned long old_pcid = read_cr3() & X86_CR3_PCID_MASK;
 
     /* This non-reentrant function is sometimes called in interrupt context. */
     local_irq_save(flags);
@@ -100,12 +102,35 @@ void write_cr3_cr4(unsigned long cr3, unsigned long cr4)
     t = pre_flush();
 
     if ( read_cr4() & X86_CR4_PGE )
+        /*
+         * X86_CR4_PGE set means PCID being inactive.
+         * We have to purge the TLB via flipping cr4.pge.
+         */
         write_cr4(cr4 & ~X86_CR4_PGE);
+    else if ( use_invpcid )
+        /*
+         * If we are using PCID purge the TLB via INVPCID as loading cr3
+         * will affect the new PCID only.
+         * If INVPCID is not supported we don't use PCIDs so loading cr3
+         * will purge the TLB (we are in the "global pages off" branch).
+         * invpcid_flush_all_nonglobals() seems to be faster than
+         * invpcid_flush_all().
+         */
+        invpcid_flush_all_nonglobals();
 
     asm volatile ( "mov %0, %%cr3" : : "r" (cr3) : "memory" );
 
     if ( read_cr4() != cr4 )
         write_cr4(cr4);
+    else if ( old_pcid != (cr3 & X86_CR3_PCID_MASK) )
+        /*
+         * Make sure no TLB entries related to the old PCID created between
+         * flushing the TLB and writing the new %cr3 value remain in the TLB.
+         * Writing %cr3 is documented to be a speculation barrier, OTOH the
+         * performance impact of the additional flush is next to invisible.
+         * So better be save than sorry.
+         */
+        invpcid_flush_single_context(old_pcid);
 
     post_flush(t);
 
@@ -132,11 +157,32 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
             /*
              * We don't INVLPG multi-page regions because the 2M/4M/1G
              * region may not have been mapped with a superpage. Also there
-             * are various errata surrounding INVLPG usage on superpages, and
-             * a full flush is in any case not *that* expensive.
+             * are various errata surrounding INVLPG usage on superpages,
+             * and a full flush is in any case not *that* expensive.
              */
-            asm volatile ( "invlpg %0"
-                           : : "m" (*(const char *)(va)) : "memory" );
+            if ( read_cr4() & X86_CR4_PCIDE )
+            {
+                unsigned long addr = (unsigned long)va;
+
+                /*
+                 * Flush the addresses for all potential address spaces.
+                 * We can't check the current domain for being subject to
+                 * XPTI as current might be the idle vcpu while we still have
+                 * some XPTI domain TLB entries.
+                 * Using invpcid is okay here, as with PCID enabled we always
+                 * have global pages disabled.
+                 */
+                invpcid_flush_one(PCID_PV_PRIV, addr);
+                invpcid_flush_one(PCID_PV_USER, addr);
+                if ( !cpu_has_no_xpti )
+                {
+                    invpcid_flush_one(PCID_PV_PRIV | PCID_PV_XPTI, addr);
+                    invpcid_flush_one(PCID_PV_USER | PCID_PV_XPTI, addr);
+                }
+            }
+            else
+                asm volatile ( "invlpg %0"
+                               : : "m" (*(const char *)(va)) : "memory" );
         }
         else
             do_tlb_flush();
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 03aa44be76..d31ada0dc9 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -127,6 +127,7 @@
 #include <asm/processor.h>
 
 #include <asm/hvm/grant_table.h>
+#include <asm/pv/domain.h>
 #include <asm/pv/grant_table.h>
 #include <asm/pv/mm.h>
 
@@ -500,7 +501,26 @@ void free_shared_domheap_page(struct page_info *page)
 
 void make_cr3(struct vcpu *v, mfn_t mfn)
 {
+    struct domain *d = v->domain;
+
     v->arch.cr3 = mfn_x(mfn) << PAGE_SHIFT;
+    if ( is_pv_domain(d) && d->arch.pv_domain.pcid )
+        v->arch.cr3 |= get_pcid_bits(v, false);
+}
+
+unsigned long pv_guest_cr4_to_real_cr4(const struct vcpu *v)
+{
+    const struct domain *d = v->domain;
+    unsigned long cr4;
+
+    cr4 = v->arch.pv_vcpu.ctrlreg[4] & ~X86_CR4_DE;
+    cr4 |= mmu_cr4_features & (X86_CR4_PSE | X86_CR4_SMEP | X86_CR4_SMAP |
+                               X86_CR4_OSXSAVE | X86_CR4_FSGSBASE);
+    cr4 |= (d->arch.pv_domain.xpti || d->arch.pv_domain.pcid) ? 0 : X86_CR4_PGE;
+    cr4 |= d->arch.pv_domain.pcid ? X86_CR4_PCIDE : 0;
+    cr4 |= d->arch.vtsc ? X86_CR4_TSD : 0;
+
+    return cr4;
 }
 
 void write_ptbase(struct vcpu *v)
@@ -510,12 +530,14 @@ void write_ptbase(struct vcpu *v)
 
     new_cr4 = (is_pv_vcpu(v) && !is_idle_vcpu(v))
               ? pv_guest_cr4_to_real_cr4(v)
-              : ((read_cr4() & ~X86_CR4_TSD) | X86_CR4_PGE);
+              : ((read_cr4() & ~(X86_CR4_PCIDE | X86_CR4_TSD)) | X86_CR4_PGE);
 
     if ( is_pv_vcpu(v) && v->domain->arch.pv_domain.xpti )
     {
         cpu_info->root_pgt_changed = true;
         cpu_info->pv_cr3 = __pa(this_cpu(root_pgt));
+        if ( new_cr4 & X86_CR4_PCIDE )
+            cpu_info->pv_cr3 |= get_pcid_bits(v, true);
         write_cr3_cr4(v->arch.cr3, new_cr4);
     }
     else
diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
index 77186c19bd..2af0094e95 100644
--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -709,6 +709,7 @@ int __init dom0_construct_pv(struct domain *d,
     }
 
     xpti_domain_init(d);
+    pcid_domain_init(d);
 
     d->arch.paging.mode = 0;
 
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 2bef9c48bc..bfaab414e1 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -10,6 +10,7 @@
 #include <xen/sched.h>
 
 #include <asm/cpufeature.h>
+#include <asm/invpcid.h>
 #include <asm/msr-index.h>
 #include <asm/pv/domain.h>
 
@@ -94,6 +95,70 @@ void xpti_domain_init(struct domain *d)
     }
 }
 
+static __read_mostly enum {
+    PCID_OFF,
+    PCID_ALL,
+    PCID_XPTI,
+    PCID_NOXPTI
+} opt_pcid = PCID_XPTI;
+
+static __init int parse_pcid(const char *s)
+{
+    int rc = 0;
+
+    switch ( parse_bool(s, NULL) )
+    {
+    case 0:
+        opt_pcid = PCID_OFF;
+        break;
+    case 1:
+        opt_pcid = PCID_ALL;
+        break;
+    default:
+        switch ( parse_boolean("xpti", s, NULL) )
+        {
+        case 0:
+            opt_pcid = PCID_NOXPTI;
+            break;
+        case 1:
+            opt_pcid = PCID_XPTI;
+            break;
+        default:
+            rc = -EINVAL;
+            break;
+        }
+        break;
+    }
+
+    return rc;
+}
+custom_runtime_param("pcid", parse_pcid);
+
+void pcid_domain_init(struct domain *d)
+{
+    if ( !is_pv_domain(d) || is_pv_32bit_domain(d) ||
+         !use_invpcid || !cpu_has_pcid )
+        return;
+
+    switch ( opt_pcid )
+    {
+    case PCID_OFF:
+        break;
+    case PCID_ALL:
+        d->arch.pv_domain.pcid = true;
+        break;
+    case PCID_XPTI:
+        d->arch.pv_domain.pcid = d->arch.pv_domain.xpti;
+        break;
+    case PCID_NOXPTI:
+        d->arch.pv_domain.pcid = !d->arch.pv_domain.xpti;
+        break;
+    default:
+        ASSERT_UNREACHABLE();
+        break;
+    }
+}
+
 static void noreturn continue_nonidle_domain(struct vcpu *v)
 {
     check_wakeup_from_wait();
@@ -298,9 +363,19 @@ int pv_domain_initialise(struct domain *d)
 
 static void _toggle_guest_pt(struct vcpu *v)
 {
+    const struct domain *d = v->domain;
+
     v->arch.flags ^= TF_kernel_mode;
     update_cr3(v);
-    get_cpu_info()->root_pgt_changed = true;
+    if ( d->arch.pv_domain.xpti )
+    {
+        struct cpu_info *cpu_info = get_cpu_info();
+
+        cpu_info->root_pgt_changed = true;
+        cpu_info->pv_cr3 = __pa(this_cpu(root_pgt)) |
+                           (d->arch.pv_domain.pcid
+                            ? get_pcid_bits(v, true) : 0);
+    }
 
     /* Don't flush user global mappings from the TLB. Don't tick TLB clock. */
     asm volatile ( "mov %0, %%cr3" : : "r" (v->arch.cr3) : "memory" );
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index b7894dc8c8..8b66096e7f 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -255,6 +255,8 @@ struct pv_domain
 
     /* XPTI active? */
     bool xpti;
+    /* Use PCID feature? */
+    bool pcid;
 
     /* map_domain_page() mapping cache. */
     struct mapcache_domain mapcache;
@@ -615,19 +617,12 @@ void vcpu_show_registers(const struct vcpu *);
 unsigned long pv_guest_cr4_fixup(const struct vcpu *, unsigned long guest_cr4);
 
 /* Convert between guest-visible and real CR4 values. */
-#define pv_guest_cr4_to_real_cr4(v)                         \
-    (((v)->arch.pv_vcpu.ctrlreg[4]                          \
-      | (mmu_cr4_features                                   \
-         & (X86_CR4_PSE | X86_CR4_SMEP |                    \
-            X86_CR4_SMAP | X86_CR4_OSXSAVE |                \
-            X86_CR4_FSGSBASE))                              \
-      | ((v)->domain->arch.pv_domain.xpti ? 0 : X86_CR4_PGE) \
-      | ((v)->domain->arch.vtsc ? X86_CR4_TSD : 0))         \
-     & ~X86_CR4_DE)
+unsigned long pv_guest_cr4_to_real_cr4(const struct vcpu *v);
+
 #define real_cr4_to_pv_guest_cr4(c)                         \
     ((c) & ~(X86_CR4_PGE | X86_CR4_PSE | X86_CR4_TSD |      \
              X86_CR4_OSXSAVE | X86_CR4_SMEP |               \
-             X86_CR4_FSGSBASE | X86_CR4_SMAP))
+             X86_CR4_FSGSBASE | X86_CR4_SMAP | X86_CR4_PCIDE))
 
 #define domain_max_vcpus(d) (is_hvm_domain(d) ? HVM_MAX_VCPUS : MAX_VIRT_CPUS)
 
diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
index db9988ab33..3067a8c58f 100644
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -290,6 +290,9 @@ static inline unsigned long read_cr4(void)
 
 static inline void write_cr4(unsigned long val)
 {
+    /* No global pages in case of PCIDs enabled! */
+    ASSERT(!(val & X86_CR4_PGE) || !(val & X86_CR4_PCIDE));
+
     get_cpu_info()->cr4 = val;
     asm volatile ( "mov %0,%%cr4" : : "r" (val) );
 }
diff --git a/xen/include/asm-x86/pv/domain.h b/xen/include/asm-x86/pv/domain.h
index 911e5dc07f..3c8c8f4ccc 100644
--- a/xen/include/asm-x86/pv/domain.h
+++ b/xen/include/asm-x86/pv/domain.h
@@ -21,6 +21,24 @@
 #ifndef __X86_PV_DOMAIN_H__
 #define __X86_PV_DOMAIN_H__
 
+/* PCID values for the address spaces of 64-bit pv domains: */
+#define PCID_PV_PRIV      0x0000    /* Used for other domains, too. */
+#define PCID_PV_USER      0x0001
+#define PCID_PV_XPTI      0x0002    /* To be ORed to above values. */
+
+/*
+ * Return additional PCID specific cr3 bits.
+ *
+ * Note that X86_CR3_NOFLUSH will not be readable in cr3. Anyone consuming
+ * v->arch.cr3 should mask away X86_CR3_NOFLUSH and X86_CR3_PCIDMASK in case
+ * the value is used to address the root page table.
+ */
+static inline unsigned long get_pcid_bits(const struct vcpu *v, bool is_xpti)
+{
+    return X86_CR3_NOFLUSH | (is_xpti ? PCID_PV_XPTI : 0) |
+           ((v->arch.flags & TF_kernel_mode) ? PCID_PV_PRIV : PCID_PV_USER);
+}
+
 #ifdef CONFIG_PV
 
 void pv_vcpu_destroy(struct vcpu *v);
@@ -29,6 +47,7 @@ void pv_domain_destroy(struct domain *d);
 int pv_domain_initialise(struct domain *d);
 void xpti_init(void);
 void xpti_domain_init(struct domain *d);
+void pcid_domain_init(struct domain *d);
 
 #else  /* !CONFIG_PV */
 
@@ -40,6 +59,7 @@ static inline void pv_domain_destroy(struct domain *d) {}
 static inline int pv_domain_initialise(struct domain *d) { return -EOPNOTSUPP; }
 static inline void xpti_init(void) {}
 static inline void xpti_domain_init(struct domain *d) {}
+static inline void pcid_domain_init(struct domain *d) {}
 
 #endif	/* CONFIG_PV */
 
diff --git a/xen/include/asm-x86/x86-defns.h b/xen/include/asm-x86/x86-defns.h
index ff8d66be3c..123560897e 100644
--- a/xen/include/asm-x86/x86-defns.h
+++ b/xen/include/asm-x86/x86-defns.h
@@ -45,7 +45,9 @@
 /*
  * Intel CPU flags in CR3
  */
-#define X86_CR3_NOFLUSH (_AC(1, ULL) << 63)
+#define X86_CR3_NOFLUSH    (_AC(1, ULL) << 63)
+#define X86_CR3_ADDR_MASK  (PAGE_MASK & PADDR_MASK & ~X86_CR3_NOFLUSH)
+#define X86_CR3_PCID_MASK  _AC(0x0fff, ULL) /* Mask for PCID */
 
 /*
  * Intel CPU features in CR4
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 1/7] x86/xpti: avoid copying L4 page table contents when possible
  2018-04-06  7:52 ` [PATCH v5 1/7] x86/xpti: avoid copying L4 page table contents when possible Juergen Gross
@ 2018-04-06 11:09   ` Andrew Cooper
  0 siblings, 0 replies; 22+ messages in thread
From: Andrew Cooper @ 2018-04-06 11:09 UTC (permalink / raw)
  To: Juergen Gross, xen-devel; +Cc: jbeulich

On 06/04/18 08:52, Juergen Gross wrote:
> For mitigation of Meltdown the current L4 page table is copied to the
> cpu local root page table each time a 64 bit pv guest is entered.
>
> Copying can be avoided in cases where the guest L4 page table hasn't
> been modified while running the hypervisor, e.g. when handling
> interrupts or any hypercall not modifying the L4 page table or %cr3.
>
> So add a per-cpu flag whether the copying should be performed and set

"flag indicating whether"

> that flag only when loading a new %cr3 or modifying the L4 page table.
> This includes synchronization of the cpu local root page table with
> other cpus, so add a special synchronization flag for that case.
>
> A simple performance check (compiling the hypervisor via "make -j 4")
> in dom0 with 4 vcpus shows a significant improvement:
>
> - real time drops from 113 seconds to 109 seconds
> - system time drops from 165 seconds to 155 seconds
>
> Signed-off-by: Juergen Gross <jgross@suse.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
> ---
> V4:
> - move setting of root_pgt_changed flag in flush_area_local() out of
>   irq disabled section (Jan Beulich)
> - move setting of root_pgt_changed in make_cr3() to _toggle_guest_pt()
>   (Jan Beulich)
> - remove most conditionals in write_ptbase() (Jan Beulich)
> - don't set root_pgt_changed in do_mmu_update() for modification of
>   the user page table (Jan Beulich)
>
> V3:
> - set flag locally only if affected L4 is active (Jan Beulich)
> - add setting flag to flush_area_mask() (Jan Beulich)
> - set flag in make_cr3() only if called for current active vcpu
>
> To be applied on top of Jan's "Meltdown band-aid overhead reduction"
> series
> ---
>  xen/arch/x86/flushtlb.c           |  3 +++
>  xen/arch/x86/mm.c                 | 38 +++++++++++++++++++++++++-------------
>  xen/arch/x86/pv/domain.c          |  2 ++
>  xen/arch/x86/smp.c                |  2 +-
>  xen/arch/x86/x86_64/asm-offsets.c |  1 +
>  xen/arch/x86/x86_64/entry.S       |  8 ++++++--
>  xen/include/asm-x86/current.h     |  8 ++++++++
>  xen/include/asm-x86/flushtlb.h    |  2 ++
>  8 files changed, 48 insertions(+), 16 deletions(-)
>
> diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
> index 8a7a76b8ff..38cedf3b22 100644
> --- a/xen/arch/x86/flushtlb.c
> +++ b/xen/arch/x86/flushtlb.c
> @@ -160,5 +160,8 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
>  
>      local_irq_restore(irqfl);
>  
> +    if ( flags & FLUSH_ROOT_PGTBL )
> +        get_cpu_info()->root_pgt_changed = true;
> +
>      return flags;
>  }
> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
> index 6d39d2c8ab..fd89685486 100644
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -504,6 +504,8 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
>  
>  void write_ptbase(struct vcpu *v)
>  {
> +    if ( this_cpu(root_pgt) )

I'm not sure this conditional is wise.  A call hitting here has changed
the root pgt, irrespective of whether the exit path cares.

> +        get_cpu_info()->root_pgt_changed = true;
>      write_cr3(v->arch.cr3);
>  }
>  
> @@ -3701,18 +3703,28 @@ long do_mmu_update(
>                          break;
>                      rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
>                                        cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
> -                    /*
> -                     * No need to sync if all uses of the page can be accounted
> -                     * to the page lock we hold, its pinned status, and uses on
> -                     * this (v)CPU.
> -                     */
> -                    if ( !rc && !cpu_has_no_xpti &&
> -                         ((page->u.inuse.type_info & PGT_count_mask) >
> -                          (1 + !!(page->u.inuse.type_info & PGT_pinned) +
> -                           (pagetable_get_pfn(curr->arch.guest_table) == mfn) +
> -                           (pagetable_get_pfn(curr->arch.guest_table_user) ==
> -                            mfn))) )
> -                        sync_guest = true;
> +                    if ( !rc && !cpu_has_no_xpti )
> +                    {
> +                        bool local_in_use = false;
> +
> +                        if ( pagetable_get_pfn(curr->arch.guest_table) == mfn )
> +                        {
> +                            local_in_use = true;
> +                            get_cpu_info()->root_pgt_changed = true;
> +                        }
> +
> +                        /*
> +                         * No need to sync if all uses of the page can be
> +                         * accounted to the page lock we hold, its pinned
> +                         * status, and uses on this (v)CPU.
> +                         */
> +                        if ( (page->u.inuse.type_info & PGT_count_mask) >
> +                             (1 + !!(page->u.inuse.type_info & PGT_pinned) +
> +                              (pagetable_get_pfn(curr->arch.guest_table_user) ==
> +                               mfn) +
> +                              local_in_use) )

These two lines can be folded together.

> +                            sync_guest = true;
> +                    }
>                      break;
>  
>                  case PGT_writable_page:
> @@ -3827,7 +3839,7 @@ long do_mmu_update(
>  
>          cpumask_andnot(mask, pt_owner->dirty_cpumask, cpumask_of(cpu));
>          if ( !cpumask_empty(mask) )
> -            flush_mask(mask, FLUSH_TLB_GLOBAL);
> +            flush_mask(mask, FLUSH_TLB_GLOBAL | FLUSH_ROOT_PGTBL);
>      }
>  
>      perfc_add(num_page_updates, i);
> diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
> index 01c62e2d45..42522a2db3 100644
> --- a/xen/arch/x86/pv/domain.c
> +++ b/xen/arch/x86/pv/domain.c
> @@ -223,6 +223,8 @@ static void _toggle_guest_pt(struct vcpu *v)
>  {
>      v->arch.flags ^= TF_kernel_mode;
>      update_cr3(v);
> +    get_cpu_info()->root_pgt_changed = true;
> +
>      /* Don't flush user global mappings from the TLB. Don't tick TLB clock. */
>      asm volatile ( "mov %0, %%cr3" : : "r" (v->arch.cr3) : "memory" );
>  
> diff --git a/xen/arch/x86/smp.c b/xen/arch/x86/smp.c
> index 033dd05958..63e819ca38 100644
> --- a/xen/arch/x86/smp.c
> +++ b/xen/arch/x86/smp.c
> @@ -208,7 +208,7 @@ void invalidate_interrupt(struct cpu_user_regs *regs)
>      ack_APIC_irq();
>      perfc_incr(ipis);
>      if ( (flags & FLUSH_VCPU_STATE) && __sync_local_execstate() )
> -        flags &= ~(FLUSH_TLB | FLUSH_TLB_GLOBAL);
> +        flags &= ~(FLUSH_TLB | FLUSH_TLB_GLOBAL | FLUSH_ROOT_PGTBL);
>      if ( flags & ~(FLUSH_VCPU_STATE | FLUSH_ORDER_MASK) )
>          flush_area_local(flush_va, flags);
>      cpumask_clear_cpu(smp_processor_id(), &flush_cpumask);
> diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
> index a2fea94f4c..9e2aefb00f 100644
> --- a/xen/arch/x86/x86_64/asm-offsets.c
> +++ b/xen/arch/x86/x86_64/asm-offsets.c
> @@ -143,6 +143,7 @@ void __dummy__(void)
>      OFFSET(CPUINFO_shadow_spec_ctrl, struct cpu_info, shadow_spec_ctrl);
>      OFFSET(CPUINFO_use_shadow_spec_ctrl, struct cpu_info, use_shadow_spec_ctrl);
>      OFFSET(CPUINFO_bti_ist_info, struct cpu_info, bti_ist_info);
> +    OFFSET(CPUINFO_root_pgt_changed, struct cpu_info, root_pgt_changed);
>      DEFINE(CPUINFO_sizeof, sizeof(struct cpu_info));
>      BLANK();
>  
> diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
> index 45d9842d09..30c9da5446 100644
> --- a/xen/arch/x86/x86_64/entry.S
> +++ b/xen/arch/x86/x86_64/entry.S
> @@ -128,10 +128,13 @@ restore_all_guest:
>          /* Copy guest mappings and switch to per-CPU root page table. */
>          mov   VCPU_cr3(%rbx), %r9
>          GET_STACK_END(dx)
> -        mov   STACK_CPUINFO_FIELD(pv_cr3)(%rdx), %rdi
> +        mov   STACK_CPUINFO_FIELD(pv_cr3)(%rdx), %rax

I think this assembly block is correct (based on the register
requirements at .Lrag_copy_done).  However (as a check of whether I've
understood it correctly), I think it would be more obviously correct as:

mov   STACK_CPUINFO_FIELD(pv_cr3)(%rdx), %rdi   (no delta)
mov   %rdi, %rax  (move from lower down)
cmpb  $0, STACK_CPUINFO_FIELD(root_pgt_changed)(%rdx)

and (IIRC) is slightly better static instruction scheduling for in-order
processors.

~Andrew

> +        cmpb  $0, STACK_CPUINFO_FIELD(root_pgt_changed)(%rdx)
> +        je    .Lrag_copy_done
> +        movb  $0, STACK_CPUINFO_FIELD(root_pgt_changed)(%rdx)
>          movabs $PADDR_MASK & PAGE_MASK, %rsi
>          movabs $DIRECTMAP_VIRT_START, %rcx
> -        mov   %rdi, %rax
> +        mov   %rax, %rdi
>          and   %rsi, %rdi
>          jz    .Lrag_keep_cr3
>          and   %r9, %rsi
> @@ -148,6 +151,7 @@ restore_all_guest:
>          sub   $(ROOT_PAGETABLE_FIRST_XEN_SLOT - \
>                  ROOT_PAGETABLE_LAST_XEN_SLOT - 1) * 8, %rdi
>          rep movsq
> +.Lrag_copy_done:
>          mov   STACK_CPUINFO_FIELD(cr4)(%rdx), %rdi
>          mov   %r9, STACK_CPUINFO_FIELD(xen_cr3)(%rdx)
>          mov   %rdi, %rsi
> diff --git a/xen/include/asm-x86/current.h b/xen/include/asm-x86/current.h
> index 3a0e1eef36..f2491b4423 100644
> --- a/xen/include/asm-x86/current.h
> +++ b/xen/include/asm-x86/current.h
> @@ -59,6 +59,14 @@ struct cpu_info {
>      bool         use_shadow_spec_ctrl;
>      uint8_t      bti_ist_info;
>  
> +    /*
> +     * The following field controls copying of the L4 page table of 64-bit
> +     * PV guests to the per-cpu root page table on entering the guest context.
> +     * If set the L4 page table is being copied to the root page table and
> +     * the field will be reset.
> +     */
> +    bool         root_pgt_changed;
> +
>      unsigned long __pad;
>      /* get_stack_bottom() must be 16-byte aligned */
>  };
> diff --git a/xen/include/asm-x86/flushtlb.h b/xen/include/asm-x86/flushtlb.h
> index 2cade9cbfb..052f0fa403 100644
> --- a/xen/include/asm-x86/flushtlb.h
> +++ b/xen/include/asm-x86/flushtlb.h
> @@ -103,6 +103,8 @@ void write_cr3(unsigned long cr3);
>  #define FLUSH_VA_VALID   0x800
>   /* Flush CPU state */
>  #define FLUSH_VCPU_STATE 0x1000
> + /* Flush the per-cpu root page table */
> +#define FLUSH_ROOT_PGTBL 0x2000
>  
>  /* Flush local TLBs/caches. */
>  unsigned int flush_area_local(const void *va, unsigned int flags);


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 2/7] x86/xpti: don't flush TLB twice when switching to 64-bit pv context
  2018-04-06  7:52 ` [PATCH v5 2/7] x86/xpti: don't flush TLB twice when switching to 64-bit pv context Juergen Gross
@ 2018-04-06 11:21   ` Andrew Cooper
  2018-04-06 11:38     ` Juergen Gross
  0 siblings, 1 reply; 22+ messages in thread
From: Andrew Cooper @ 2018-04-06 11:21 UTC (permalink / raw)
  To: Juergen Gross, xen-devel; +Cc: jbeulich

On 06/04/18 08:52, Juergen Gross wrote:
> When switching to a 64-bit pv context the TLB is flushed twice today:
> the first time when switching to the new address space in
> write_ptbase(), the second time when switching to guest mode in
> restore_to_guest.
>
> Limit the first flush to non-global entries in that case.
>
> Signed-off-by: Juergen Gross <jgross@suse.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>

This isn't safe.  If any speculation/prefetching happens into the new
guest mappings, we will get machine check exceptions.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 2/7] x86/xpti: don't flush TLB twice when switching to 64-bit pv context
  2018-04-06 11:21   ` Andrew Cooper
@ 2018-04-06 11:38     ` Juergen Gross
  0 siblings, 0 replies; 22+ messages in thread
From: Juergen Gross @ 2018-04-06 11:38 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel; +Cc: jbeulich

On 06/04/18 13:21, Andrew Cooper wrote:
> On 06/04/18 08:52, Juergen Gross wrote:
>> When switching to a 64-bit pv context the TLB is flushed twice today:
>> the first time when switching to the new address space in
>> write_ptbase(), the second time when switching to guest mode in
>> restore_to_guest.
>>
>> Limit the first flush to non-global entries in that case.
>>
>> Signed-off-by: Juergen Gross <jgross@suse.com>
>> Reviewed-by: Jan Beulich <jbeulich@suse.com>
> 
> This isn't safe.  If any speculation/prefetching happens into the new
> guest mappings, we will get machine check exceptions.

Okay, I'll drop that patch (it is more or less undone by patch 4
anyway).


Juergen


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 3/7] xen/x86: support per-domain flag for xpti
  2018-04-06  7:52 ` [PATCH v5 3/7] xen/x86: support per-domain flag for xpti Juergen Gross
@ 2018-04-06 13:16   ` Andrew Cooper
  2018-04-06 13:33     ` Juergen Gross
  0 siblings, 1 reply; 22+ messages in thread
From: Andrew Cooper @ 2018-04-06 13:16 UTC (permalink / raw)
  To: Juergen Gross, xen-devel; +Cc: jbeulich

On 06/04/18 08:52, Juergen Gross wrote:
> Instead of switching XPTI globally on or off add a per-domain flag for
> that purpose. This allows to modify the xpti boot parameter to support
> running dom0 without Meltdown mitigations. Using "xpti=nodom0" as boot
> parameter will achieve that.
>
> Move the xpti boot parameter handling to xen/arch/x86/pv/domain.c as
> it is pv-domain specific.
>
> Signed-off-by: Juergen Gross <jgross@suse.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
> ---
> V3:
> - latch get_cpu_info() return value in variable (Jan Beulich)
> - call always xpti_domain_init() for pv dom0 (Jan Beulich)
> - add __init annotations (Jan Beulich)
> - drop per domain XPTI message (Jan Beulich)
> - document xpti=default support (Jan Beulich)
> - move domain xpti flag into a padding hole (Jan Beulich)
> ---
>  docs/misc/xen-command-line.markdown | 10 ++++-
>  xen/arch/x86/domctl.c               |  4 ++
>  xen/arch/x86/mm.c                   | 12 +++++-
>  xen/arch/x86/pv/dom0_build.c        |  3 ++
>  xen/arch/x86/pv/domain.c            | 77 +++++++++++++++++++++++++++++++++++++
>  xen/arch/x86/setup.c                | 20 +---------
>  xen/arch/x86/smpboot.c              |  4 +-
>  xen/arch/x86/x86_64/entry.S         |  2 +
>  xen/include/asm-x86/current.h       |  3 +-
>  xen/include/asm-x86/domain.h        |  3 ++
>  xen/include/asm-x86/pv/domain.h     |  4 ++
>  11 files changed, 118 insertions(+), 24 deletions(-)
>
> diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
> index b353352adf..79be9a6ba5 100644
> --- a/docs/misc/xen-command-line.markdown
> +++ b/docs/misc/xen-command-line.markdown
> @@ -1955,7 +1955,7 @@ clustered mode.  The default, given no hint from the **FADT**, is cluster
>  mode.
>  
>  ### xpti
> -> `= <boolean>`
> +> `= default | nodom0 | <boolean>`
>  
>  > Default: `false` on AMD hardware
>  > Default: `true` everywhere else
> @@ -1963,6 +1963,14 @@ mode.
>  Override default selection of whether to isolate 64-bit PV guest page
>  tables.
>  
> +`true` activates page table isolation even on AMD hardware.
> +
> +`false` deactivates page table isolation on all systems.
> +
> +`default` sets the default behaviour.
> +
> +`nodom0` deactivates page table isolation for dom0. 

Thinking more about this, a simple set of options like this is
insufficiently expressive.  What if I'd like to run domU's but not dom0
with XPTI on AMD hardware?

How about `= List of [ <boolean>, dom0=bool ]`

which allows for selecting dom0 independently of the general default. 
It also avoids the somewhat odd "nodom0" which doesn't match our
prevailing style of "no-" prefixes for boolean options.

Also, the text should note that the dom0 option is ignored for non-PV
dom0's.

> +
>  ### xsave
>  > `= <boolean>`
>  
> diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
> index 8fbbf3aeb3..0704f398c7 100644
> --- a/xen/arch/x86/domctl.c
> +++ b/xen/arch/x86/domctl.c
> @@ -24,6 +24,7 @@
>  #include <asm/hvm/hvm.h>
>  #include <asm/hvm/support.h>
>  #include <asm/processor.h>
> +#include <asm/pv/domain.h>
>  #include <asm/acpi.h> /* for hvm_acpi_power_button */
>  #include <xen/hypercall.h> /* for arch_do_domctl */
>  #include <xsm/xsm.h>
> @@ -610,6 +611,9 @@ long arch_do_domctl(
>              ret = switch_compat(d);
>          else
>              ret = -EINVAL;
> +
> +        if ( ret == 0 )
> +            xpti_domain_init(d);
>          break;
>  
>      case XEN_DOMCTL_get_address_size:
> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
> index cf2ccb07e6..131433af9b 100644
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -505,13 +505,21 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
>  
>  void write_ptbase(struct vcpu *v)
>  {
> -    if ( this_cpu(root_pgt) && is_pv_vcpu(v) && !is_pv_32bit_vcpu(v) )
> +    struct cpu_info *cpu_info = get_cpu_info();
> +
> +    if ( is_pv_vcpu(v) && v->domain->arch.pv_domain.xpti )
>      {
> -        get_cpu_info()->root_pgt_changed = true;
> +        cpu_info->root_pgt_changed = true;
> +        cpu_info->pv_cr3 = __pa(this_cpu(root_pgt));
>          asm volatile ( "mov %0, %%cr3" : : "r" (v->arch.cr3) : "memory" );
>      }
>      else
> +    {
> +        /* Make sure to clear xen_cr3 before pv_cr3; write_cr3() serializes. */
> +        cpu_info->xen_cr3 = 0;
>          write_cr3(v->arch.cr3);
> +        cpu_info->pv_cr3 = 0;
> +    }
>  }
>  
>  /*
> diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
> index 0bd2f1bf90..77186c19bd 100644
> --- a/xen/arch/x86/pv/dom0_build.c
> +++ b/xen/arch/x86/pv/dom0_build.c
> @@ -19,6 +19,7 @@
>  #include <asm/dom0_build.h>
>  #include <asm/guest.h>
>  #include <asm/page.h>
> +#include <asm/pv/domain.h>
>  #include <asm/pv/mm.h>
>  #include <asm/setup.h>
>  
> @@ -707,6 +708,8 @@ int __init dom0_construct_pv(struct domain *d,
>              cpu = p->processor;
>      }
>  
> +    xpti_domain_init(d);
> +
>      d->arch.paging.mode = 0;
>  
>      /* Set up CR3 value for write_ptbase */
> diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
> index 42522a2db3..2bef9c48bc 100644
> --- a/xen/arch/x86/pv/domain.c
> +++ b/xen/arch/x86/pv/domain.c
> @@ -9,6 +9,8 @@
>  #include <xen/lib.h>
>  #include <xen/sched.h>
>  
> +#include <asm/cpufeature.h>
> +#include <asm/msr-index.h>
>  #include <asm/pv/domain.h>
>  
>  /* Override macros from asm/page.h to make them work with mfn_t */
> @@ -17,6 +19,81 @@
>  #undef page_to_mfn
>  #define page_to_mfn(pg) _mfn(__page_to_mfn(pg))
>  
> +static __read_mostly enum {
> +    XPTI_DEFAULT,
> +    XPTI_ON,
> +    XPTI_OFF,
> +    XPTI_NODOM0
> +} opt_xpti = XPTI_DEFAULT;
> +
> +static __init int parse_xpti(const char *s)
> +{
> +    int rc = 0;
> +
> +    switch ( parse_bool(s, NULL) )
> +    {
> +    case 0:
> +        opt_xpti = XPTI_OFF;
> +        break;
> +    case 1:
> +        opt_xpti = XPTI_ON;
> +        break;
> +    default:
> +        if ( !strcmp(s, "default") )
> +            opt_xpti = XPTI_DEFAULT;
> +        else if ( !strcmp(s, "nodom0") )
> +            opt_xpti = XPTI_NODOM0;
> +        else
> +            rc = -EINVAL;
> +        break;
> +    }
> +
> +    return rc;
> +}
> +custom_param("xpti", parse_xpti);
> +
> +void __init xpti_init(void)
> +{
> +    uint64_t caps = 0;
> +
> +    if ( boot_cpu_data.x86_vendor == X86_VENDOR_AMD )
> +        caps = ARCH_CAPABILITIES_RDCL_NO;
> +    else if ( boot_cpu_has(X86_FEATURE_ARCH_CAPS) )
> +        rdmsrl(MSR_ARCH_CAPABILITIES, caps);
> +
> +    if ( opt_xpti != XPTI_ON && (caps & ARCH_CAPABILITIES_RDCL_NO) )
> +        opt_xpti = XPTI_OFF;
> +    else if ( opt_xpti == XPTI_DEFAULT )
> +        opt_xpti = XPTI_ON;
> +
> +    if ( opt_xpti == XPTI_OFF )
> +        setup_force_cpu_cap(X86_FEATURE_NO_XPTI);
> +    else
> +        setup_clear_cpu_cap(X86_FEATURE_NO_XPTI);
> +}

Hmm - its probably arguable, but I'd say that init work for XPTI really
does belong in spec_ctrl.c rather than pv/domain.c.  This would also
avoid having both a toplevel xpti_init() and
init_speculation_migrations() call from __start_xen().

> +
> +void xpti_domain_init(struct domain *d)
> +{
> +    if ( !is_pv_domain(d) || is_pv_32bit_domain(d) )
> +        return;

Nice, but nil pois.  You've ended up with the same bug as my patch has,
despite attempting to fix the 32bit case.  This 32bit check causes an
early exit and leaves xpti set.

The least invasive way to fix is to have the switch statement fill in a
local xpti boolean, then have

d->arch.pv_domain.xpti = xpti && is_pv_32bit_domain(d);

at the end.

> +
> +    switch ( opt_xpti )
> +    {
> +    case XPTI_OFF:
> +        break;
> +    case XPTI_ON:
> +        d->arch.pv_domain.xpti = true;
> +        break;
> +    case XPTI_NODOM0:
> +        d->arch.pv_domain.xpti = d->domain_id != 0 &&
> +                                 d->domain_id != hardware_domid;
> +        break;

Newlines after breaks please.

This wants to be is_hardware_domain(d), but that is liable to cause
problems when constructing dom0.  Instead, I'd suggest just the latter
half of conditional, as hardware_domid is 0 until dom0 chooses to build
a second hardware domain.

~Andrew

> +    default:
> +        ASSERT_UNREACHABLE();
> +        break;
> +    }
> +}
> +
>  static void noreturn continue_nonidle_domain(struct vcpu *v)
>  {
>      check_wakeup_from_wait();
> diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
> index 02673d9512..e83d3b4f07 100644
> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -53,6 +53,7 @@
>  #include <asm/cpuid.h>
>  #include <asm/spec_ctrl.h>
>  #include <asm/guest.h>
> +#include <asm/pv/domain.h>
>  
>  /* opt_nosmp: If true, secondary processors are ignored. */
>  static bool __initdata opt_nosmp;
> @@ -169,9 +170,6 @@ static int __init parse_smap_param(const char *s)
>  }
>  custom_param("smap", parse_smap_param);
>  
> -static int8_t __initdata opt_xpti = -1;
> -boolean_param("xpti", opt_xpti);
> -
>  bool __read_mostly acpi_disabled;
>  bool __initdata acpi_force;
>  static char __initdata acpi_param[10] = "";
> @@ -1546,21 +1544,7 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>  
>      cr4_pv32_mask = mmu_cr4_features & XEN_CR4_PV32_BITS;
>  
> -    if ( opt_xpti < 0 )
> -    {
> -        uint64_t caps = 0;
> -
> -        if ( boot_cpu_data.x86_vendor == X86_VENDOR_AMD )
> -            caps = ARCH_CAPABILITIES_RDCL_NO;
> -        else if ( boot_cpu_has(X86_FEATURE_ARCH_CAPS) )
> -            rdmsrl(MSR_ARCH_CAPABILITIES, caps);
> -
> -        opt_xpti = !(caps & ARCH_CAPABILITIES_RDCL_NO);
> -    }
> -    if ( opt_xpti )
> -        setup_clear_cpu_cap(X86_FEATURE_NO_XPTI);
> -    else
> -        setup_force_cpu_cap(X86_FEATURE_NO_XPTI);
> +    xpti_init();
>  
>      if ( cpu_has_fsgsbase )
>          set_in_cr4(X86_CR4_FSGSBASE);
> diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
> index b0b72ca544..346a8e8a3f 100644
> --- a/xen/arch/x86/smpboot.c
> +++ b/xen/arch/x86/smpboot.c
> @@ -331,7 +331,7 @@ void start_secondary(void *unused)
>      spin_debug_disable();
>  
>      get_cpu_info()->xen_cr3 = 0;
> -    get_cpu_info()->pv_cr3 = this_cpu(root_pgt) ? __pa(this_cpu(root_pgt)) : 0;
> +    get_cpu_info()->pv_cr3 = 0;
>  
>      load_system_tables();
>  
> @@ -1050,7 +1050,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
>          panic("Error %d setting up PV root page table\n", rc);
>      if ( per_cpu(root_pgt, 0) )
>      {
> -        get_cpu_info()->pv_cr3 = __pa(per_cpu(root_pgt, 0));
> +        get_cpu_info()->pv_cr3 = 0;
>  
>          /*
>           * All entry points which may need to switch page tables have to start
> diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
> index 30c9da5446..5026cfa490 100644
> --- a/xen/arch/x86/x86_64/entry.S
> +++ b/xen/arch/x86/x86_64/entry.S
> @@ -129,6 +129,8 @@ restore_all_guest:
>          mov   VCPU_cr3(%rbx), %r9
>          GET_STACK_END(dx)
>          mov   STACK_CPUINFO_FIELD(pv_cr3)(%rdx), %rax
> +        test  %rax, %rax
> +        jz    .Lrag_keep_cr3
>          cmpb  $0, STACK_CPUINFO_FIELD(root_pgt_changed)(%rdx)
>          je    .Lrag_copy_done
>          movb  $0, STACK_CPUINFO_FIELD(root_pgt_changed)(%rdx)
> diff --git a/xen/include/asm-x86/current.h b/xen/include/asm-x86/current.h
> index f2491b4423..b2475783f8 100644
> --- a/xen/include/asm-x86/current.h
> +++ b/xen/include/asm-x86/current.h
> @@ -44,7 +44,8 @@ struct cpu_info {
>      /*
>       * Of the two following fields the latter is being set to the CR3 value
>       * to be used on the given pCPU for loading whenever 64-bit PV guest
> -     * context is being entered. The value never changes once set.
> +     * context is being entered. A value of zero indicates no setting of CR3
> +     * is to be performed.
>       * The former is the value to restore when re-entering Xen, if any. IOW
>       * its value being zero means there's nothing to restore. However, its
>       * value can also be negative, indicating to the exit-to-Xen code that
> diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
> index a12ae47f1b..ed4199931a 100644
> --- a/xen/include/asm-x86/domain.h
> +++ b/xen/include/asm-x86/domain.h
> @@ -253,6 +253,9 @@ struct pv_domain
>  
>      atomic_t nr_l4_pages;
>  
> +    /* XPTI active? */
> +    bool xpti;
> +
>      /* map_domain_page() mapping cache. */
>      struct mapcache_domain mapcache;
>  
> diff --git a/xen/include/asm-x86/pv/domain.h b/xen/include/asm-x86/pv/domain.h
> index 5e34176939..911e5dc07f 100644
> --- a/xen/include/asm-x86/pv/domain.h
> +++ b/xen/include/asm-x86/pv/domain.h
> @@ -27,6 +27,8 @@ void pv_vcpu_destroy(struct vcpu *v);
>  int pv_vcpu_initialise(struct vcpu *v);
>  void pv_domain_destroy(struct domain *d);
>  int pv_domain_initialise(struct domain *d);
> +void xpti_init(void);
> +void xpti_domain_init(struct domain *d);
>  
>  #else  /* !CONFIG_PV */
>  
> @@ -36,6 +38,8 @@ static inline void pv_vcpu_destroy(struct vcpu *v) {}
>  static inline int pv_vcpu_initialise(struct vcpu *v) { return -EOPNOTSUPP; }
>  static inline void pv_domain_destroy(struct domain *d) {}
>  static inline int pv_domain_initialise(struct domain *d) { return -EOPNOTSUPP; }
> +static inline void xpti_init(void) {}
> +static inline void xpti_domain_init(struct domain *d) {}
>  
>  #endif	/* CONFIG_PV */
>  


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 3/7] xen/x86: support per-domain flag for xpti
  2018-04-06 13:16   ` Andrew Cooper
@ 2018-04-06 13:33     ` Juergen Gross
  0 siblings, 0 replies; 22+ messages in thread
From: Juergen Gross @ 2018-04-06 13:33 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel; +Cc: jbeulich

On 06/04/18 15:16, Andrew Cooper wrote:
> On 06/04/18 08:52, Juergen Gross wrote:
>> Instead of switching XPTI globally on or off add a per-domain flag for
>> that purpose. This allows to modify the xpti boot parameter to support
>> running dom0 without Meltdown mitigations. Using "xpti=nodom0" as boot
>> parameter will achieve that.
>>
>> Move the xpti boot parameter handling to xen/arch/x86/pv/domain.c as
>> it is pv-domain specific.
>>
>> Signed-off-by: Juergen Gross <jgross@suse.com>
>> Reviewed-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> V3:
>> - latch get_cpu_info() return value in variable (Jan Beulich)
>> - call always xpti_domain_init() for pv dom0 (Jan Beulich)
>> - add __init annotations (Jan Beulich)
>> - drop per domain XPTI message (Jan Beulich)
>> - document xpti=default support (Jan Beulich)
>> - move domain xpti flag into a padding hole (Jan Beulich)
>> ---
>>  docs/misc/xen-command-line.markdown | 10 ++++-
>>  xen/arch/x86/domctl.c               |  4 ++
>>  xen/arch/x86/mm.c                   | 12 +++++-
>>  xen/arch/x86/pv/dom0_build.c        |  3 ++
>>  xen/arch/x86/pv/domain.c            | 77 +++++++++++++++++++++++++++++++++++++
>>  xen/arch/x86/setup.c                | 20 +---------
>>  xen/arch/x86/smpboot.c              |  4 +-
>>  xen/arch/x86/x86_64/entry.S         |  2 +
>>  xen/include/asm-x86/current.h       |  3 +-
>>  xen/include/asm-x86/domain.h        |  3 ++
>>  xen/include/asm-x86/pv/domain.h     |  4 ++
>>  11 files changed, 118 insertions(+), 24 deletions(-)
>>
>> diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
>> index b353352adf..79be9a6ba5 100644
>> --- a/docs/misc/xen-command-line.markdown
>> +++ b/docs/misc/xen-command-line.markdown
>> @@ -1955,7 +1955,7 @@ clustered mode.  The default, given no hint from the **FADT**, is cluster
>>  mode.
>>  
>>  ### xpti
>> -> `= <boolean>`
>> +> `= default | nodom0 | <boolean>`
>>  
>>  > Default: `false` on AMD hardware
>>  > Default: `true` everywhere else
>> @@ -1963,6 +1963,14 @@ mode.
>>  Override default selection of whether to isolate 64-bit PV guest page
>>  tables.
>>  
>> +`true` activates page table isolation even on AMD hardware.
>> +
>> +`false` deactivates page table isolation on all systems.
>> +
>> +`default` sets the default behaviour.
>> +
>> +`nodom0` deactivates page table isolation for dom0. 
> 
> Thinking more about this, a simple set of options like this is
> insufficiently expressive.  What if I'd like to run domU's but not dom0
> with XPTI on AMD hardware?

Why would anyone want to do this?

My plan was to add a way to switch XPTI on per-domain basis in 4.12.

> 
> How about `= List of [ <boolean>, dom0=bool ]`
> 
> which allows for selecting dom0 independently of the general default. 
> It also avoids the somewhat odd "nodom0" which doesn't match our
> prevailing style of "no-" prefixes for boolean options.

This reasoning makes more sense for me. :-)

> 
> Also, the text should note that the dom0 option is ignored for non-PV
> dom0's.

It clearly states "64-bit PV guest". A non-pv dom0 clearly isn't subject
to this parameter.

> 
>> +
>>  ### xsave
>>  > `= <boolean>`
>>  
>> diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
>> index 8fbbf3aeb3..0704f398c7 100644
>> --- a/xen/arch/x86/domctl.c
>> +++ b/xen/arch/x86/domctl.c
>> @@ -24,6 +24,7 @@
>>  #include <asm/hvm/hvm.h>
>>  #include <asm/hvm/support.h>
>>  #include <asm/processor.h>
>> +#include <asm/pv/domain.h>
>>  #include <asm/acpi.h> /* for hvm_acpi_power_button */
>>  #include <xen/hypercall.h> /* for arch_do_domctl */
>>  #include <xsm/xsm.h>
>> @@ -610,6 +611,9 @@ long arch_do_domctl(
>>              ret = switch_compat(d);
>>          else
>>              ret = -EINVAL;
>> +
>> +        if ( ret == 0 )
>> +            xpti_domain_init(d);
>>          break;
>>  
>>      case XEN_DOMCTL_get_address_size:
>> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
>> index cf2ccb07e6..131433af9b 100644
>> --- a/xen/arch/x86/mm.c
>> +++ b/xen/arch/x86/mm.c
>> @@ -505,13 +505,21 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
>>  
>>  void write_ptbase(struct vcpu *v)
>>  {
>> -    if ( this_cpu(root_pgt) && is_pv_vcpu(v) && !is_pv_32bit_vcpu(v) )
>> +    struct cpu_info *cpu_info = get_cpu_info();
>> +
>> +    if ( is_pv_vcpu(v) && v->domain->arch.pv_domain.xpti )
>>      {
>> -        get_cpu_info()->root_pgt_changed = true;
>> +        cpu_info->root_pgt_changed = true;
>> +        cpu_info->pv_cr3 = __pa(this_cpu(root_pgt));
>>          asm volatile ( "mov %0, %%cr3" : : "r" (v->arch.cr3) : "memory" );
>>      }
>>      else
>> +    {
>> +        /* Make sure to clear xen_cr3 before pv_cr3; write_cr3() serializes. */
>> +        cpu_info->xen_cr3 = 0;
>>          write_cr3(v->arch.cr3);
>> +        cpu_info->pv_cr3 = 0;
>> +    }
>>  }
>>  
>>  /*
>> diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
>> index 0bd2f1bf90..77186c19bd 100644
>> --- a/xen/arch/x86/pv/dom0_build.c
>> +++ b/xen/arch/x86/pv/dom0_build.c
>> @@ -19,6 +19,7 @@
>>  #include <asm/dom0_build.h>
>>  #include <asm/guest.h>
>>  #include <asm/page.h>
>> +#include <asm/pv/domain.h>
>>  #include <asm/pv/mm.h>
>>  #include <asm/setup.h>
>>  
>> @@ -707,6 +708,8 @@ int __init dom0_construct_pv(struct domain *d,
>>              cpu = p->processor;
>>      }
>>  
>> +    xpti_domain_init(d);
>> +
>>      d->arch.paging.mode = 0;
>>  
>>      /* Set up CR3 value for write_ptbase */
>> diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
>> index 42522a2db3..2bef9c48bc 100644
>> --- a/xen/arch/x86/pv/domain.c
>> +++ b/xen/arch/x86/pv/domain.c
>> @@ -9,6 +9,8 @@
>>  #include <xen/lib.h>
>>  #include <xen/sched.h>
>>  
>> +#include <asm/cpufeature.h>
>> +#include <asm/msr-index.h>
>>  #include <asm/pv/domain.h>
>>  
>>  /* Override macros from asm/page.h to make them work with mfn_t */
>> @@ -17,6 +19,81 @@
>>  #undef page_to_mfn
>>  #define page_to_mfn(pg) _mfn(__page_to_mfn(pg))
>>  
>> +static __read_mostly enum {
>> +    XPTI_DEFAULT,
>> +    XPTI_ON,
>> +    XPTI_OFF,
>> +    XPTI_NODOM0
>> +} opt_xpti = XPTI_DEFAULT;
>> +
>> +static __init int parse_xpti(const char *s)
>> +{
>> +    int rc = 0;
>> +
>> +    switch ( parse_bool(s, NULL) )
>> +    {
>> +    case 0:
>> +        opt_xpti = XPTI_OFF;
>> +        break;
>> +    case 1:
>> +        opt_xpti = XPTI_ON;
>> +        break;
>> +    default:
>> +        if ( !strcmp(s, "default") )
>> +            opt_xpti = XPTI_DEFAULT;
>> +        else if ( !strcmp(s, "nodom0") )
>> +            opt_xpti = XPTI_NODOM0;
>> +        else
>> +            rc = -EINVAL;
>> +        break;
>> +    }
>> +
>> +    return rc;
>> +}
>> +custom_param("xpti", parse_xpti);
>> +
>> +void __init xpti_init(void)
>> +{
>> +    uint64_t caps = 0;
>> +
>> +    if ( boot_cpu_data.x86_vendor == X86_VENDOR_AMD )
>> +        caps = ARCH_CAPABILITIES_RDCL_NO;
>> +    else if ( boot_cpu_has(X86_FEATURE_ARCH_CAPS) )
>> +        rdmsrl(MSR_ARCH_CAPABILITIES, caps);
>> +
>> +    if ( opt_xpti != XPTI_ON && (caps & ARCH_CAPABILITIES_RDCL_NO) )
>> +        opt_xpti = XPTI_OFF;
>> +    else if ( opt_xpti == XPTI_DEFAULT )
>> +        opt_xpti = XPTI_ON;
>> +
>> +    if ( opt_xpti == XPTI_OFF )
>> +        setup_force_cpu_cap(X86_FEATURE_NO_XPTI);
>> +    else
>> +        setup_clear_cpu_cap(X86_FEATURE_NO_XPTI);
>> +}
> 
> Hmm - its probably arguable, but I'd say that init work for XPTI really
> does belong in spec_ctrl.c rather than pv/domain.c.  This would also
> avoid having both a toplevel xpti_init() and
> init_speculation_migrations() call from __start_xen().

I don't mind either way.

> 
>> +
>> +void xpti_domain_init(struct domain *d)
>> +{
>> +    if ( !is_pv_domain(d) || is_pv_32bit_domain(d) )
>> +        return;
> 
> Nice, but nil pois.  You've ended up with the same bug as my patch has,
> despite attempting to fix the 32bit case.  This 32bit check causes an
> early exit and leaves xpti set.

Huh? The domain structure is zeroed on allocation. How would xpti
be true for 32-bit pv guests then?

> 
> The least invasive way to fix is to have the switch statement fill in a
> local xpti boolean, then have
> 
> d->arch.pv_domain.xpti = xpti && is_pv_32bit_domain(d);

Oh no. I want 32-bit pv domains to have xpti = false and 64-bit pv
domains may have xpti = true. Your suggestion is wrong in this regard.

> 
> at the end.
> 
>> +
>> +    switch ( opt_xpti )
>> +    {
>> +    case XPTI_OFF:
>> +        break;
>> +    case XPTI_ON:
>> +        d->arch.pv_domain.xpti = true;
>> +        break;
>> +    case XPTI_NODOM0:
>> +        d->arch.pv_domain.xpti = d->domain_id != 0 &&
>> +                                 d->domain_id != hardware_domid;
>> +        break;
> 
> Newlines after breaks please.

Okay.

> 
> This wants to be is_hardware_domain(d), but that is liable to cause
> problems when constructing dom0.  Instead, I'd suggest just the latter
> half of conditional, as hardware_domid is 0 until dom0 chooses to build
> a second hardware domain.

Okay.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 4/7] xen/x86: use invpcid for flushing the TLB
  2018-04-06  7:52 ` [PATCH v5 4/7] xen/x86: use invpcid for flushing the TLB Juergen Gross
@ 2018-04-06 13:40   ` Andrew Cooper
  2018-04-06 13:48     ` Juergen Gross
  0 siblings, 1 reply; 22+ messages in thread
From: Andrew Cooper @ 2018-04-06 13:40 UTC (permalink / raw)
  To: Juergen Gross, xen-devel; +Cc: jbeulich

On 06/04/18 08:52, Juergen Gross wrote:
> If possible use the INVPCID instruction for flushing the TLB instead of
> toggling cr4.pge for that purpose.
>
> While at it remove the dependency on cr4.pge being required for mtrr
> loading, as this will be required later anyway.
>
> Add a command line option "invpcid" for controlling the use of
> INVPCID (default to true).
>
> Signed-off-by: Juergen Gross <jgross@suse.com>
> ---
> V5:
> - use pre_flush() as an initializer in do_tlb_flush() (Jan Beulich)
> - introduce boolean use_invpcid instead of clearing X86_FEATURE_INVPCID
>   (Jan Beulich)
>
> V4:
> - option "invpcid" instead of "noinvpcid" (Jan Beulich)
>
> V3:
> - new patch
> ---
>  docs/misc/xen-command-line.markdown | 10 ++++++++++
>  xen/arch/x86/cpu/mtrr/generic.c     | 37 ++++++++++++++++++++++++++-----------
>  xen/arch/x86/flushtlb.c             | 29 +++++++++++++++++++----------
>  xen/arch/x86/setup.c                |  8 ++++++++
>  xen/include/asm-x86/invpcid.h       |  2 ++
>  5 files changed, 65 insertions(+), 21 deletions(-)
>
> diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
> index 79be9a6ba5..5f6ae654ad 100644
> --- a/docs/misc/xen-command-line.markdown
> +++ b/docs/misc/xen-command-line.markdown
> @@ -1380,6 +1380,16 @@ Because responsibility for APIC setup is shared between Xen and the
>  domain 0 kernel this option is automatically propagated to the domain
>  0 command line.
>  
> +### invpcid (x86)
> +> `= <boolean>`
> +
> +> Default: `true`
> +
> +Control using the INVPCID instruction for flushing TLB entries.
> +This should only be used in case of known issues on the current platform
> +with that instruction. Disabling INVPCID will normally result in a slightly
> +degraded performance.

How about:

By default, Xen will use the INVPCID instruction for TLB management if
it is available.  This option can be used to cause Xen to fall back to
older mechanisms, which are generally slower.

?

We are not aware of any systems with INVPCID problems, but "for testing
purposes" is also a valid reason to use this option.


> +
>  ### noirqbalance
>  > `= <boolean>`
>  
> diff --git a/xen/arch/x86/cpu/mtrr/generic.c b/xen/arch/x86/cpu/mtrr/generic.c
> index e9c0e5e059..705855e753 100644
> --- a/xen/arch/x86/cpu/mtrr/generic.c
> +++ b/xen/arch/x86/cpu/mtrr/generic.c
> @@ -5,6 +5,7 @@
>  #include <xen/mm.h>
>  #include <xen/stdbool.h>
>  #include <asm/flushtlb.h>
> +#include <asm/invpcid.h>
>  #include <asm/io.h>
>  #include <asm/mtrr.h>
>  #include <asm/msr.h>
> @@ -400,8 +401,10 @@ static DEFINE_SPINLOCK(set_atomicity_lock);
>   * has been called.
>   */
>  
> -static void prepare_set(void)
> +static bool prepare_set(void)
>  {
> +	unsigned long cr4;
> +
>  	/*  Note that this is not ideal, since the cache is only flushed/disabled
>  	   for this CPU while the MTRRs are changed, but changing this requires
>  	   more invasive changes to the way the kernel boots  */
> @@ -412,18 +415,24 @@ static void prepare_set(void)
>  	write_cr0(read_cr0() | X86_CR0_CD);
>  	wbinvd();
>  
> -	/*  TLB flushing here relies on Xen always using CR4.PGE. */
> -	BUILD_BUG_ON(!(XEN_MINIMAL_CR4 & X86_CR4_PGE));
> -	write_cr4(read_cr4() & ~X86_CR4_PGE);
> +	cr4 = read_cr4();
> +	if (cr4 & X86_CR4_PGE)
> +		write_cr4(cr4 & ~X86_CR4_PGE);
> +	else if (use_invpcid)
> +		invpcid_flush_all();
> +	else
> +		asm volatile( "mov %0, %%cr3" : : "r" (read_cr3()) : "memory" );

We could really do with gaining a static inline wrapper for write_cr3(),
because at the very least, the memory clobber going missing would be a
BadThing(tm).  Sadly that name is already taken in Xen, and behaves
differently to (all?) other OS's example.

There are currently only 3 callers of write_cr3().  How about a prereq
patch renaming write_cr3() to switch_cr3() (as switch_cr3() logically
implies the TLB clock activities), and a new thin write_cr3() wrapper?

>  
>  	/*  Save MTRR state */
>  	rdmsrl(MSR_MTRRdefType, deftype);
>  
>  	/*  Disable MTRRs, and set the default type to uncached  */
>  	mtrr_wrmsr(MSR_MTRRdefType, deftype & ~0xcff);
> +
> +	return cr4 & X86_CR4_PGE;
>  }
>  
> -static void post_set(void)
> +static void post_set(bool pge)
>  {
>  	/* Intel (P6) standard MTRRs */
>  	mtrr_wrmsr(MSR_MTRRdefType, deftype);
> @@ -432,7 +441,12 @@ static void post_set(void)
>  	write_cr0(read_cr0() & ~X86_CR0_CD);
>  
>  	/*  Reenable CR4.PGE (also flushes the TLB) */
> -	write_cr4(read_cr4() | X86_CR4_PGE);
> +	if (pge)
> +		write_cr4(read_cr4() | X86_CR4_PGE);
> +	else if (use_invpcid)
> +		invpcid_flush_all();
> +	else
> +		asm volatile( "mov %0, %%cr3" : : "r" (read_cr3()) : "memory" );
>  
>  	spin_unlock(&set_atomicity_lock);
>  }
> @@ -441,14 +455,15 @@ static void generic_set_all(void)
>  {
>  	unsigned long mask, count;
>  	unsigned long flags;
> +	bool pge;
>  
>  	local_irq_save(flags);
> -	prepare_set();
> +	pge = prepare_set();
>  
>  	/* Actually set the state */
>  	mask = set_mtrr_state();
>  
> -	post_set();
> +	post_set(pge);
>  	local_irq_restore(flags);
>  
>  	/*  Use the atomic bitops to update the global mask  */
> @@ -457,7 +472,6 @@ static void generic_set_all(void)
>  			set_bit(count, &smp_changes_mask);
>  		mask >>= 1;
>  	}
> -	
>  }
>  
>  static void generic_set_mtrr(unsigned int reg, unsigned long base,
> @@ -474,11 +488,12 @@ static void generic_set_mtrr(unsigned int reg, unsigned long base,
>  {
>  	unsigned long flags;
>  	struct mtrr_var_range *vr;
> +	bool pge;
>  
>  	vr = &mtrr_state.var_ranges[reg];
>  
>  	local_irq_save(flags);
> -	prepare_set();
> +	pge = prepare_set();
>  
>  	if (size == 0) {
>  		/* The invalid bit is kept in the mask, so we simply clear the
> @@ -499,7 +514,7 @@ static void generic_set_mtrr(unsigned int reg, unsigned long base,
>  		mtrr_wrmsr(MSR_IA32_MTRR_PHYSMASK(reg), vr->mask);
>  	}
>  
> -	post_set();
> +	post_set(pge);
>  	local_irq_restore(flags);
>  }
>  
> diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
> index 38cedf3b22..f793b70696 100644
> --- a/xen/arch/x86/flushtlb.c
> +++ b/xen/arch/x86/flushtlb.c
> @@ -10,6 +10,7 @@
>  #include <xen/sched.h>
>  #include <xen/softirq.h>
>  #include <asm/flushtlb.h>
> +#include <asm/invpcid.h>
>  #include <asm/page.h>
>  
>  /* Debug builds: Wrap frequently to stress-test the wrap logic. */
> @@ -71,6 +72,23 @@ static void post_flush(u32 t)
>      this_cpu(tlbflush_time) = t;
>  }
>  
> +static void do_tlb_flush(void)
> +{
> +    u32 t = pre_flush();
> +
> +    if ( use_invpcid )
> +        invpcid_flush_all();
> +    else
> +    {
> +        unsigned long cr4 = read_cr4();
> +
> +        write_cr4(cr4 ^ X86_CR4_PGE);
> +        write_cr4(cr4);
> +    }
> +
> +    post_flush(t);
> +}
> +
>  void write_cr3(unsigned long cr3)
>  {
>      unsigned long flags, cr4;
> @@ -118,16 +136,7 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
>                             : : "m" (*(const char *)(va)) : "memory" );
>          }
>          else
> -        {
> -            u32 t = pre_flush();
> -            unsigned long cr4 = read_cr4();
> -
> -            write_cr4(cr4 & ~X86_CR4_PGE);
> -            barrier();
> -            write_cr4(cr4);
> -
> -            post_flush(t);
> -        }
> +            do_tlb_flush();
>      }
>  
>      if ( flags & FLUSH_CACHE )
> diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
> index e83d3b4f07..0a27402e4d 100644
> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -63,6 +63,11 @@ boolean_param("nosmp", opt_nosmp);
>  static unsigned int __initdata max_cpus;
>  integer_param("maxcpus", max_cpus);
>  
> +/* opt_invpcid: If false, don't use INVPCID instruction even if available. */
> +static bool __initdata opt_invpcid = true;
> +boolean_param("invpcid", opt_invpcid);
> +bool use_invpcid;

__read_mostly.

~Andrew

> +
>  unsigned long __read_mostly cr4_pv32_mask;
>  
>  /* **** Linux config option: propagated to domain0. */
> @@ -1549,6 +1554,9 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>      if ( cpu_has_fsgsbase )
>          set_in_cr4(X86_CR4_FSGSBASE);
>  
> +    if ( opt_invpcid && cpu_has_invpcid )
> +        use_invpcid = true;
> +
>      init_speculation_mitigations();
>  
>      init_idle_domain();
> diff --git a/xen/include/asm-x86/invpcid.h b/xen/include/asm-x86/invpcid.h
> index b46624a865..edd8b68706 100644
> --- a/xen/include/asm-x86/invpcid.h
> +++ b/xen/include/asm-x86/invpcid.h
> @@ -3,6 +3,8 @@
>  
>  #include <xen/types.h>
>  
> +extern bool use_invpcid;
> +
>  #define INVPCID_TYPE_INDIV_ADDR      0
>  #define INVPCID_TYPE_SINGLE_CTXT     1
>  #define INVPCID_TYPE_ALL_INCL_GLOBAL 2


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 4/7] xen/x86: use invpcid for flushing the TLB
  2018-04-06 13:40   ` Andrew Cooper
@ 2018-04-06 13:48     ` Juergen Gross
  0 siblings, 0 replies; 22+ messages in thread
From: Juergen Gross @ 2018-04-06 13:48 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel; +Cc: jbeulich

On 06/04/18 15:40, Andrew Cooper wrote:
> On 06/04/18 08:52, Juergen Gross wrote:
>> If possible use the INVPCID instruction for flushing the TLB instead of
>> toggling cr4.pge for that purpose.
>>
>> While at it remove the dependency on cr4.pge being required for mtrr
>> loading, as this will be required later anyway.
>>
>> Add a command line option "invpcid" for controlling the use of
>> INVPCID (default to true).
>>
>> Signed-off-by: Juergen Gross <jgross@suse.com>
>> ---
>> V5:
>> - use pre_flush() as an initializer in do_tlb_flush() (Jan Beulich)
>> - introduce boolean use_invpcid instead of clearing X86_FEATURE_INVPCID
>>   (Jan Beulich)
>>
>> V4:
>> - option "invpcid" instead of "noinvpcid" (Jan Beulich)
>>
>> V3:
>> - new patch
>> ---
>>  docs/misc/xen-command-line.markdown | 10 ++++++++++
>>  xen/arch/x86/cpu/mtrr/generic.c     | 37 ++++++++++++++++++++++++++-----------
>>  xen/arch/x86/flushtlb.c             | 29 +++++++++++++++++++----------
>>  xen/arch/x86/setup.c                |  8 ++++++++
>>  xen/include/asm-x86/invpcid.h       |  2 ++
>>  5 files changed, 65 insertions(+), 21 deletions(-)
>>
>> diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
>> index 79be9a6ba5..5f6ae654ad 100644
>> --- a/docs/misc/xen-command-line.markdown
>> +++ b/docs/misc/xen-command-line.markdown
>> @@ -1380,6 +1380,16 @@ Because responsibility for APIC setup is shared between Xen and the
>>  domain 0 kernel this option is automatically propagated to the domain
>>  0 command line.
>>  
>> +### invpcid (x86)
>> +> `= <boolean>`
>> +
>> +> Default: `true`
>> +
>> +Control using the INVPCID instruction for flushing TLB entries.
>> +This should only be used in case of known issues on the current platform
>> +with that instruction. Disabling INVPCID will normally result in a slightly
>> +degraded performance.
> 
> How about:
> 
> By default, Xen will use the INVPCID instruction for TLB management if
> it is available.  This option can be used to cause Xen to fall back to
> older mechanisms, which are generally slower.
> 
> ?

Works for me.

> We are not aware of any systems with INVPCID problems, but "for testing
> purposes" is also a valid reason to use this option.

Okay, I'll change it.

> 
> 
>> +
>>  ### noirqbalance
>>  > `= <boolean>`
>>  
>> diff --git a/xen/arch/x86/cpu/mtrr/generic.c b/xen/arch/x86/cpu/mtrr/generic.c
>> index e9c0e5e059..705855e753 100644
>> --- a/xen/arch/x86/cpu/mtrr/generic.c
>> +++ b/xen/arch/x86/cpu/mtrr/generic.c
>> @@ -5,6 +5,7 @@
>>  #include <xen/mm.h>
>>  #include <xen/stdbool.h>
>>  #include <asm/flushtlb.h>
>> +#include <asm/invpcid.h>
>>  #include <asm/io.h>
>>  #include <asm/mtrr.h>
>>  #include <asm/msr.h>
>> @@ -400,8 +401,10 @@ static DEFINE_SPINLOCK(set_atomicity_lock);
>>   * has been called.
>>   */
>>  
>> -static void prepare_set(void)
>> +static bool prepare_set(void)
>>  {
>> +	unsigned long cr4;
>> +
>>  	/*  Note that this is not ideal, since the cache is only flushed/disabled
>>  	   for this CPU while the MTRRs are changed, but changing this requires
>>  	   more invasive changes to the way the kernel boots  */
>> @@ -412,18 +415,24 @@ static void prepare_set(void)
>>  	write_cr0(read_cr0() | X86_CR0_CD);
>>  	wbinvd();
>>  
>> -	/*  TLB flushing here relies on Xen always using CR4.PGE. */
>> -	BUILD_BUG_ON(!(XEN_MINIMAL_CR4 & X86_CR4_PGE));
>> -	write_cr4(read_cr4() & ~X86_CR4_PGE);
>> +	cr4 = read_cr4();
>> +	if (cr4 & X86_CR4_PGE)
>> +		write_cr4(cr4 & ~X86_CR4_PGE);
>> +	else if (use_invpcid)
>> +		invpcid_flush_all();
>> +	else
>> +		asm volatile( "mov %0, %%cr3" : : "r" (read_cr3()) : "memory" );
> 
> We could really do with gaining a static inline wrapper for write_cr3(),
> because at the very least, the memory clobber going missing would be a
> BadThing(tm).  Sadly that name is already taken in Xen, and behaves
> differently to (all?) other OS's example.
> 
> There are currently only 3 callers of write_cr3().  How about a prereq
> patch renaming write_cr3() to switch_cr3() (as switch_cr3() logically
> implies the TLB clock activities), and a new thin write_cr3() wrapper?

Sure.

> 
>>  
>>  	/*  Save MTRR state */
>>  	rdmsrl(MSR_MTRRdefType, deftype);
>>  
>>  	/*  Disable MTRRs, and set the default type to uncached  */
>>  	mtrr_wrmsr(MSR_MTRRdefType, deftype & ~0xcff);
>> +
>> +	return cr4 & X86_CR4_PGE;
>>  }
>>  
>> -static void post_set(void)
>> +static void post_set(bool pge)
>>  {
>>  	/* Intel (P6) standard MTRRs */
>>  	mtrr_wrmsr(MSR_MTRRdefType, deftype);
>> @@ -432,7 +441,12 @@ static void post_set(void)
>>  	write_cr0(read_cr0() & ~X86_CR0_CD);
>>  
>>  	/*  Reenable CR4.PGE (also flushes the TLB) */
>> -	write_cr4(read_cr4() | X86_CR4_PGE);
>> +	if (pge)
>> +		write_cr4(read_cr4() | X86_CR4_PGE);
>> +	else if (use_invpcid)
>> +		invpcid_flush_all();
>> +	else
>> +		asm volatile( "mov %0, %%cr3" : : "r" (read_cr3()) : "memory" );
>>  
>>  	spin_unlock(&set_atomicity_lock);
>>  }
>> @@ -441,14 +455,15 @@ static void generic_set_all(void)
>>  {
>>  	unsigned long mask, count;
>>  	unsigned long flags;
>> +	bool pge;
>>  
>>  	local_irq_save(flags);
>> -	prepare_set();
>> +	pge = prepare_set();
>>  
>>  	/* Actually set the state */
>>  	mask = set_mtrr_state();
>>  
>> -	post_set();
>> +	post_set(pge);
>>  	local_irq_restore(flags);
>>  
>>  	/*  Use the atomic bitops to update the global mask  */
>> @@ -457,7 +472,6 @@ static void generic_set_all(void)
>>  			set_bit(count, &smp_changes_mask);
>>  		mask >>= 1;
>>  	}
>> -	
>>  }
>>  
>>  static void generic_set_mtrr(unsigned int reg, unsigned long base,
>> @@ -474,11 +488,12 @@ static void generic_set_mtrr(unsigned int reg, unsigned long base,
>>  {
>>  	unsigned long flags;
>>  	struct mtrr_var_range *vr;
>> +	bool pge;
>>  
>>  	vr = &mtrr_state.var_ranges[reg];
>>  
>>  	local_irq_save(flags);
>> -	prepare_set();
>> +	pge = prepare_set();
>>  
>>  	if (size == 0) {
>>  		/* The invalid bit is kept in the mask, so we simply clear the
>> @@ -499,7 +514,7 @@ static void generic_set_mtrr(unsigned int reg, unsigned long base,
>>  		mtrr_wrmsr(MSR_IA32_MTRR_PHYSMASK(reg), vr->mask);
>>  	}
>>  
>> -	post_set();
>> +	post_set(pge);
>>  	local_irq_restore(flags);
>>  }
>>  
>> diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
>> index 38cedf3b22..f793b70696 100644
>> --- a/xen/arch/x86/flushtlb.c
>> +++ b/xen/arch/x86/flushtlb.c
>> @@ -10,6 +10,7 @@
>>  #include <xen/sched.h>
>>  #include <xen/softirq.h>
>>  #include <asm/flushtlb.h>
>> +#include <asm/invpcid.h>
>>  #include <asm/page.h>
>>  
>>  /* Debug builds: Wrap frequently to stress-test the wrap logic. */
>> @@ -71,6 +72,23 @@ static void post_flush(u32 t)
>>      this_cpu(tlbflush_time) = t;
>>  }
>>  
>> +static void do_tlb_flush(void)
>> +{
>> +    u32 t = pre_flush();
>> +
>> +    if ( use_invpcid )
>> +        invpcid_flush_all();
>> +    else
>> +    {
>> +        unsigned long cr4 = read_cr4();
>> +
>> +        write_cr4(cr4 ^ X86_CR4_PGE);
>> +        write_cr4(cr4);
>> +    }
>> +
>> +    post_flush(t);
>> +}
>> +
>>  void write_cr3(unsigned long cr3)
>>  {
>>      unsigned long flags, cr4;
>> @@ -118,16 +136,7 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
>>                             : : "m" (*(const char *)(va)) : "memory" );
>>          }
>>          else
>> -        {
>> -            u32 t = pre_flush();
>> -            unsigned long cr4 = read_cr4();
>> -
>> -            write_cr4(cr4 & ~X86_CR4_PGE);
>> -            barrier();
>> -            write_cr4(cr4);
>> -
>> -            post_flush(t);
>> -        }
>> +            do_tlb_flush();
>>      }
>>  
>>      if ( flags & FLUSH_CACHE )
>> diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
>> index e83d3b4f07..0a27402e4d 100644
>> --- a/xen/arch/x86/setup.c
>> +++ b/xen/arch/x86/setup.c
>> @@ -63,6 +63,11 @@ boolean_param("nosmp", opt_nosmp);
>>  static unsigned int __initdata max_cpus;
>>  integer_param("maxcpus", max_cpus);
>>  
>> +/* opt_invpcid: If false, don't use INVPCID instruction even if available. */
>> +static bool __initdata opt_invpcid = true;
>> +boolean_param("invpcid", opt_invpcid);
>> +bool use_invpcid;
> 
> __read_mostly.

Yes.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 5/7] xen/x86: disable global pages for domains with XPTI active
  2018-04-06  7:52 ` [PATCH v5 5/7] xen/x86: disable global pages for domains with XPTI active Juergen Gross
@ 2018-04-06 15:17   ` Andrew Cooper
  2018-04-06 15:41     ` Juergen Gross
  0 siblings, 1 reply; 22+ messages in thread
From: Andrew Cooper @ 2018-04-06 15:17 UTC (permalink / raw)
  To: Juergen Gross, xen-devel; +Cc: jbeulich

On 06/04/18 08:52, Juergen Gross wrote:
> Instead of flushing the TLB from global pages when switching address
> spaces with XPTI being active just disable global pages via %cr4
> completely when a domain subject to XPTI is active. This avoids the
> need for extra TLB flushes as loading %cr3 will remove all TLB
> entries.
>
> In order to avoid states with cr3/cr4 having inconsistent values
> (e.g. global pages being activated while cr3 already specifies a XPTI
> address space) move loading of the new cr4 value to write_ptbase()
> (actually to write_cr3_cr4() called by write_ptbase()).
>
> Signed-off-by: Juergen Gross <jgross@suse.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>

This is contrary to the performance optimisations taken by Linux,
Windows and Apple, which borrowing Xen's 64bit PV optimisation of having
global user pages, because it really is a (small) performance improvement.

Are there performance numbers for this change alone, and/or are later
changes strictly dependent on this functionality?

On the Xen side of things, an argument could probably be made that the
extra cr4 rewrites due to the L4 shadowing might eat away the
performance we would otherwise gain, but I'd be hesitant to blindly
assume that this is the case.

A complicating factor is that Intel have said that the performance gains
from user global pages would be more noticeable on older hardware, due
to differences in the TLB architecture.

> ---
> V4:
> - don't use mmu_cr4_features for setting new cr4 value (Jan Beulich)
> - use simpler scheme for setting X86_CR4_PGE in
>   pv_guest_cr4_to_real_cr4() (Jan Beulich)
>
> V3:
> - move cr4 loading for all domains from *_ctxt_switch_to() to
>   write_cr3_cr4() called by write_ptbase() (Jan Beulich)
> - rebase
> ---
>  xen/arch/x86/domain.c          |  5 -----
>  xen/arch/x86/flushtlb.c        | 13 ++++++++-----
>  xen/arch/x86/mm.c              | 14 +++++++++++---
>  xen/arch/x86/x86_64/entry.S    | 10 ----------
>  xen/common/efi/runtime.c       |  4 ++--
>  xen/include/asm-x86/domain.h   |  3 ++-
>  xen/include/asm-x86/flushtlb.h |  2 +-
>  7 files changed, 24 insertions(+), 27 deletions(-)
>
> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
> index 9c229594f4..c2bb70c483 100644
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -1522,17 +1522,12 @@ void paravirt_ctxt_switch_from(struct vcpu *v)
>  void paravirt_ctxt_switch_to(struct vcpu *v)
>  {
>      root_pgentry_t *root_pgt = this_cpu(root_pgt);
> -    unsigned long cr4;
>  
>      if ( root_pgt )
>          root_pgt[root_table_offset(PERDOMAIN_VIRT_START)] =
>              l4e_from_page(v->domain->arch.perdomain_l3_pg,
>                            __PAGE_HYPERVISOR_RW);
>  
> -    cr4 = pv_guest_cr4_to_real_cr4(v);
> -    if ( unlikely(cr4 != read_cr4()) )
> -        write_cr4(cr4);
> -
>      if ( unlikely(v->arch.debugreg[7] & DR7_ACTIVE_MASK) )
>          activate_debugregs(v);
>  
> diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
> index f793b70696..5dcd9a2bf6 100644
> --- a/xen/arch/x86/flushtlb.c
> +++ b/xen/arch/x86/flushtlb.c
> @@ -89,20 +89,23 @@ static void do_tlb_flush(void)
>      post_flush(t);
>  }
>  
> -void write_cr3(unsigned long cr3)
> +void write_cr3_cr4(unsigned long cr3, unsigned long cr4)
>  {
> -    unsigned long flags, cr4;
> +    unsigned long flags;
>      u32 t;
>  
>      /* This non-reentrant function is sometimes called in interrupt context. */
>      local_irq_save(flags);
>  
>      t = pre_flush();
> -    cr4 = read_cr4();
>  
> -    write_cr4(cr4 & ~X86_CR4_PGE);
> +    if ( read_cr4() & X86_CR4_PGE )
> +        write_cr4(cr4 & ~X86_CR4_PGE);
> +
>      asm volatile ( "mov %0, %%cr3" : : "r" (cr3) : "memory" );
> -    write_cr4(cr4);
> +
> +    if ( read_cr4() != cr4 )

read_cr4(), despite being a cached read, isn't free because of the %rsp
manipulation required to access the variable.  I'd keep the locally
cached cr4, and use "if ( cr4 & X86_CR4_PGE )" here.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 5/7] xen/x86: disable global pages for domains with XPTI active
  2018-04-06 15:17   ` Andrew Cooper
@ 2018-04-06 15:41     ` Juergen Gross
  0 siblings, 0 replies; 22+ messages in thread
From: Juergen Gross @ 2018-04-06 15:41 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel; +Cc: jbeulich

On 06/04/18 17:17, Andrew Cooper wrote:
> On 06/04/18 08:52, Juergen Gross wrote:
>> Instead of flushing the TLB from global pages when switching address
>> spaces with XPTI being active just disable global pages via %cr4
>> completely when a domain subject to XPTI is active. This avoids the
>> need for extra TLB flushes as loading %cr3 will remove all TLB
>> entries.
>>
>> In order to avoid states with cr3/cr4 having inconsistent values
>> (e.g. global pages being activated while cr3 already specifies a XPTI
>> address space) move loading of the new cr4 value to write_ptbase()
>> (actually to write_cr3_cr4() called by write_ptbase()).
>>
>> Signed-off-by: Juergen Gross <jgross@suse.com>
>> Reviewed-by: Jan Beulich <jbeulich@suse.com>
> 
> This is contrary to the performance optimisations taken by Linux,
> Windows and Apple, which borrowing Xen's 64bit PV optimisation of having
> global user pages, because it really is a (small) performance improvement.
> 
> Are there performance numbers for this change alone, and/or are later
> changes strictly dependent on this functionality?

Yes and yes.

Performance is much better for XPTI (again my standard compile test):
elapsed time dropped from 112 to 106 seconds, system time from 160 to
139 seconds, user time from 187 to 186 seconds.

Page invalidation with PCID enabled is _much_ easier.

> On the Xen side of things, an argument could probably be made that the
> extra cr4 rewrites due to the L4 shadowing might eat away the
> performance we would otherwise gain, but I'd be hesitant to blindly
> assume that this is the case.

Another problem I wanted to avoid was the global page handling of Xen
private pages: I would have needed to remove all the global bits, either
even for AMD cpus, or do that dynamically.

> A complicating factor is that Intel have said that the performance gains
> from user global pages would be more noticeable on older hardware, due
> to differences in the TLB architecture.
> 
>> ---
>> V4:
>> - don't use mmu_cr4_features for setting new cr4 value (Jan Beulich)
>> - use simpler scheme for setting X86_CR4_PGE in
>>   pv_guest_cr4_to_real_cr4() (Jan Beulich)
>>
>> V3:
>> - move cr4 loading for all domains from *_ctxt_switch_to() to
>>   write_cr3_cr4() called by write_ptbase() (Jan Beulich)
>> - rebase
>> ---
>>  xen/arch/x86/domain.c          |  5 -----
>>  xen/arch/x86/flushtlb.c        | 13 ++++++++-----
>>  xen/arch/x86/mm.c              | 14 +++++++++++---
>>  xen/arch/x86/x86_64/entry.S    | 10 ----------
>>  xen/common/efi/runtime.c       |  4 ++--
>>  xen/include/asm-x86/domain.h   |  3 ++-
>>  xen/include/asm-x86/flushtlb.h |  2 +-
>>  7 files changed, 24 insertions(+), 27 deletions(-)
>>
>> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
>> index 9c229594f4..c2bb70c483 100644
>> --- a/xen/arch/x86/domain.c
>> +++ b/xen/arch/x86/domain.c
>> @@ -1522,17 +1522,12 @@ void paravirt_ctxt_switch_from(struct vcpu *v)
>>  void paravirt_ctxt_switch_to(struct vcpu *v)
>>  {
>>      root_pgentry_t *root_pgt = this_cpu(root_pgt);
>> -    unsigned long cr4;
>>  
>>      if ( root_pgt )
>>          root_pgt[root_table_offset(PERDOMAIN_VIRT_START)] =
>>              l4e_from_page(v->domain->arch.perdomain_l3_pg,
>>                            __PAGE_HYPERVISOR_RW);
>>  
>> -    cr4 = pv_guest_cr4_to_real_cr4(v);
>> -    if ( unlikely(cr4 != read_cr4()) )
>> -        write_cr4(cr4);
>> -
>>      if ( unlikely(v->arch.debugreg[7] & DR7_ACTIVE_MASK) )
>>          activate_debugregs(v);
>>  
>> diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
>> index f793b70696..5dcd9a2bf6 100644
>> --- a/xen/arch/x86/flushtlb.c
>> +++ b/xen/arch/x86/flushtlb.c
>> @@ -89,20 +89,23 @@ static void do_tlb_flush(void)
>>      post_flush(t);
>>  }
>>  
>> -void write_cr3(unsigned long cr3)
>> +void write_cr3_cr4(unsigned long cr3, unsigned long cr4)
>>  {
>> -    unsigned long flags, cr4;
>> +    unsigned long flags;
>>      u32 t;
>>  
>>      /* This non-reentrant function is sometimes called in interrupt context. */
>>      local_irq_save(flags);
>>  
>>      t = pre_flush();
>> -    cr4 = read_cr4();
>>  
>> -    write_cr4(cr4 & ~X86_CR4_PGE);
>> +    if ( read_cr4() & X86_CR4_PGE )
>> +        write_cr4(cr4 & ~X86_CR4_PGE);
>> +
>>      asm volatile ( "mov %0, %%cr3" : : "r" (cr3) : "memory" );
>> -    write_cr4(cr4);
>> +
>> +    if ( read_cr4() != cr4 )
> 
> read_cr4(), despite being a cached read, isn't free because of the %rsp
> manipulation required to access the variable.  I'd keep the locally
> cached cr4, and use "if ( cr4 & X86_CR4_PGE )" here.

I did this on purpose: it might be cr4 is being modified not only
regarding pge. I can nevertheless cache the read cr4 value, of course.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 7/7] xen/x86: use PCID feature
  2018-04-06  7:52 ` [PATCH v5 7/7] xen/x86: use PCID feature Juergen Gross
@ 2018-04-06 17:30   ` Andrew Cooper
  2018-04-09  6:31     ` Juergen Gross
  2018-04-09 14:15   ` Jan Beulich
  1 sibling, 1 reply; 22+ messages in thread
From: Andrew Cooper @ 2018-04-06 17:30 UTC (permalink / raw)
  To: Juergen Gross, xen-devel; +Cc: jbeulich

On 06/04/18 08:52, Juergen Gross wrote:
> Avoid flushing the complete TLB when switching %cr3 for mitigation of
> Meltdown by using the PCID feature if available.
>
> We are using 4 PCID values for a 64 bit pv domain subject to XPTI and
> 2 values for the non-XPTI case:
>
> - guest active and in kernel mode
> - guest active and in user mode
> - hypervisor active and guest in user mode (XPTI only)
> - hypervisor active and guest in kernel mode (XPTI only)

Information like this should be in the source code as well.  Either
x86/mm.h, or domain.h along with the defines.

>
> We use PCID only if PCID _and_ INVPCID are supported. With PCID in use
> we disable global pages in cr4. A command line parameter controls in
> which cases PCID is being used.
>
> As the non-XPTI case has shown not to perform better with PCID at least
> on some machines the default is to use PCID only for domains subject to
> XPTI.

Have we worked out why?  By simple analysis, a system-call heavy
workload should benefit massively from PCID, irrespective of XPTI.

If measurements say there is no improvement, then it probably means we
are flushing far too much somewhere.

>
> With PCID enabled we always disable global pages. This avoids having to
> either flush the complete TLB or do a cycle through all PCID values
> when invalidating a single global page.
>
> pv_guest_cr4_to_real_cr4() is switched from a macro to a real function
> now as it has become more complex.

I think it would help to split this change out.  This particular patch
is already complicated enough to review. :)

>
> Performance for the XPTI case improves a lot: parallel make of the Xen
> hypervisor on my machine reduced elapsed time from 107 to 93 seconds,
> system time was reduced from 140 to 103 seconds.
>
> Signed-off-by: Juergen Gross <jgross@suse.com>
> ---
> V5:
> - use X86_CR3_ADDR_MASK instead of ~X86_CR3_PCID_MASK (Jan Beulich)
> - add some const qualifiers (Jan Beulich)
> - mask X86_CR3_ADDR_MASK with PADDR_MASK (Jan Beulich)
> - add flushing the TLB from old PCID related entries in write_cr3_cr4()
>   (Jan Beulich)
>
> V4:
> - add cr3 mask for page table address and use that in dbg_pv_va2mfn()
>   (Jan Beulich)
> - use invpcid_flush_all_nonglobals() instead of invpcid_flush_all()
>   (Jan Beulich)
> - use PCIDs 0/1 when running in Xen or without XPTI, 2/3 with XPTI in
>   guest (Jan Beulich)
> - ASSERT cr4.pge and cr4.pcide are never active at the same time
>   (Jan Beulich)
> - make pv_guest_cr4_to_real_cr4() a real function
>
> V3:
> - support PCID for non-XPTI case, too
> - add command line parameter for controlling usage of PCID
> - check PCID active by using cr4.pcide (Jan Beulich)
> ---
>  docs/misc/xen-command-line.markdown | 12 ++++++
>  xen/arch/x86/debug.c                |  2 +-
>  xen/arch/x86/domain_page.c          |  2 +-
>  xen/arch/x86/domctl.c               |  4 ++
>  xen/arch/x86/flushtlb.c             | 54 ++++++++++++++++++++++++--
>  xen/arch/x86/mm.c                   | 24 +++++++++++-
>  xen/arch/x86/pv/dom0_build.c        |  1 +
>  xen/arch/x86/pv/domain.c            | 77 ++++++++++++++++++++++++++++++++++++-
>  xen/include/asm-x86/domain.h        | 15 +++-----
>  xen/include/asm-x86/processor.h     |  3 ++
>  xen/include/asm-x86/pv/domain.h     | 20 ++++++++++
>  xen/include/asm-x86/x86-defns.h     |  4 +-
>  12 files changed, 199 insertions(+), 19 deletions(-)
>
> diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
> index 5f6ae654ad..db87fd326d 100644
> --- a/docs/misc/xen-command-line.markdown
> +++ b/docs/misc/xen-command-line.markdown
> @@ -1452,6 +1452,18 @@ All numbers specified must be hexadecimal ones.
>  
>  This option can be specified more than once (up to 8 times at present).
>  
> +### pcid (x86)
> +> `= <boolean> | xpti | noxpti`

Your implementation below is actually `= <boolean> | xpti=<boolean>`

This is subtly different for the final option, in which case
"pcid=no-xpti" is the string which is actually recognised.

> +
> +> Default: `xpti`
> +
> +> Can be modified at runtime

From the implementation, it looks like changes only apply to new
domains, not to existing ones?

> +
> +If available, control usage of the PCID feature of the processor for
> +64-bit pv-domains.

It is more complicated than this.  PCID only gets used for guests when
Xen is otherwise using INVPCID.

>  PCID can be used either for no domain at all (`false`),
> +for all of them (`true`), only for those subject to XPTI (`xpti`) or for
> +those not subject to XPTI (`noxpti`).
> +
>  ### ple\_gap
>  > `= <integer>`
>  
> diff --git a/xen/arch/x86/debug.c b/xen/arch/x86/debug.c
> index 9159f32db4..0d46f2f45a 100644
> --- a/xen/arch/x86/debug.c
> +++ b/xen/arch/x86/debug.c
> @@ -97,7 +97,7 @@ dbg_pv_va2mfn(dbgva_t vaddr, struct domain *dp, uint64_t pgd3val)
>      l3_pgentry_t l3e, *l3t;
>      l2_pgentry_t l2e, *l2t;
>      l1_pgentry_t l1e, *l1t;
> -    unsigned long cr3 = (pgd3val ? pgd3val : dp->vcpu[0]->arch.cr3);
> +    unsigned long cr3 = pgd3val ?: (dp->vcpu[0]->arch.cr3 & X86_CR3_ADDR_MASK);

Is this needed?  The PCID is shifted out by the maddr_to_mfn(), and we'd
better never be storing the NOFLUSH bit in arch.cr3.

OTOH, I think reporting the PCID in the debug messages would be helpful.

>      mfn_t mfn = maddr_to_mfn(cr3);
>  
>      DBGP2("vaddr:%lx domid:%d cr3:%lx pgd3:%lx\n", vaddr, dp->domain_id, 
> diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c
> index b5780f201f..b5af0e639a 100644
> --- a/xen/arch/x86/domain_page.c
> +++ b/xen/arch/x86/domain_page.c
> @@ -51,7 +51,7 @@ static inline struct vcpu *mapcache_current_vcpu(void)
>          if ( (v = idle_vcpu[smp_processor_id()]) == current )
>              sync_local_execstate();
>          /* We must now be running on the idle page table. */
> -        ASSERT(read_cr3() == __pa(idle_pg_table));
> +        ASSERT((read_cr3() & X86_CR3_ADDR_MASK) == __pa(idle_pg_table));

I think we want probably want a cr3_pa() wrapper.

>      }
>  
>      return v;
> diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
> index 0704f398c7..a7c8772fa6 100644
> --- a/xen/arch/x86/domctl.c
> +++ b/xen/arch/x86/domctl.c
> @@ -613,7 +613,11 @@ long arch_do_domctl(
>              ret = -EINVAL;
>  
>          if ( ret == 0 )
> +        {
>              xpti_domain_init(d);
> +            pcid_domain_init(d);
> +        }
> +
>          break;
>  
>      case XEN_DOMCTL_get_address_size:
> diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
> index 5dcd9a2bf6..6c7d57b7aa 100644
> --- a/xen/arch/x86/flushtlb.c
> +++ b/xen/arch/x86/flushtlb.c
> @@ -12,6 +12,7 @@
>  #include <asm/flushtlb.h>
>  #include <asm/invpcid.h>
>  #include <asm/page.h>
> +#include <asm/pv/domain.h>
>  
>  /* Debug builds: Wrap frequently to stress-test the wrap logic. */
>  #ifdef NDEBUG
> @@ -93,6 +94,7 @@ void write_cr3_cr4(unsigned long cr3, unsigned long cr4)
>  {
>      unsigned long flags;
>      u32 t;
> +    unsigned long old_pcid = read_cr3() & X86_CR3_PCID_MASK;

And a cr3_pcid() helper as well.

>  
>      /* This non-reentrant function is sometimes called in interrupt context. */
>      local_irq_save(flags);
> @@ -100,12 +102,35 @@ void write_cr3_cr4(unsigned long cr3, unsigned long cr4)
>      t = pre_flush();
>  
>      if ( read_cr4() & X86_CR4_PGE )
> +        /*
> +         * X86_CR4_PGE set means PCID being inactive.

"means PCID is inactive".

> +         * We have to purge the TLB via flipping cr4.pge.
> +         */
>          write_cr4(cr4 & ~X86_CR4_PGE);
> +    else if ( use_invpcid )
> +        /*
> +         * If we are using PCID purge the TLB via INVPCID as loading cr3
> +         * will affect the new PCID only.
> +         * If INVPCID is not supported we don't use PCIDs so loading cr3
> +         * will purge the TLB (we are in the "global pages off" branch).
> +         * invpcid_flush_all_nonglobals() seems to be faster than
> +         * invpcid_flush_all().

I'm afraid that I can't parse either of these sentences.

> +         */
> +        invpcid_flush_all_nonglobals();
>  
>      asm volatile ( "mov %0, %%cr3" : : "r" (cr3) : "memory" );
>  
>      if ( read_cr4() != cr4 )
>          write_cr4(cr4);
> +    else if ( old_pcid != (cr3 & X86_CR3_PCID_MASK) )
> +        /*
> +         * Make sure no TLB entries related to the old PCID created between
> +         * flushing the TLB and writing the new %cr3 value remain in the TLB.
> +         * Writing %cr3 is documented to be a speculation barrier, OTOH the
> +         * performance impact of the additional flush is next to invisible.
> +         * So better be save than sorry.
> +         */
> +        invpcid_flush_single_context(old_pcid);
>  
>      post_flush(t);
>  
> @@ -132,11 +157,32 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
>              /*
>               * We don't INVLPG multi-page regions because the 2M/4M/1G
>               * region may not have been mapped with a superpage. Also there
> -             * are various errata surrounding INVLPG usage on superpages, and
> -             * a full flush is in any case not *that* expensive.
> +             * are various errata surrounding INVLPG usage on superpages,
> +             * and a full flush is in any case not *that* expensive.
>               */
> -            asm volatile ( "invlpg %0"
> -                           : : "m" (*(const char *)(va)) : "memory" );
> +            if ( read_cr4() & X86_CR4_PCIDE )
> +            {
> +                unsigned long addr = (unsigned long)va;
> +
> +                /*
> +                 * Flush the addresses for all potential address spaces.
> +                 * We can't check the current domain for being subject to
> +                 * XPTI as current might be the idle vcpu while we still have
> +                 * some XPTI domain TLB entries.
> +                 * Using invpcid is okay here, as with PCID enabled we always
> +                 * have global pages disabled.
> +                 */
> +                invpcid_flush_one(PCID_PV_PRIV, addr);
> +                invpcid_flush_one(PCID_PV_USER, addr);
> +                if ( !cpu_has_no_xpti )
> +                {
> +                    invpcid_flush_one(PCID_PV_PRIV | PCID_PV_XPTI, addr);
> +                    invpcid_flush_one(PCID_PV_USER | PCID_PV_XPTI, addr);
> +                }
> +            }
> +            else
> +                asm volatile ( "invlpg %0"
> +                               : : "m" (*(const char *)(va)) : "memory" );
>          }
>          else
>              do_tlb_flush();
> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
> index 03aa44be76..d31ada0dc9 100644
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -127,6 +127,7 @@
>  #include <asm/processor.h>
>  
>  #include <asm/hvm/grant_table.h>
> +#include <asm/pv/domain.h>
>  #include <asm/pv/grant_table.h>
>  #include <asm/pv/mm.h>
>  
> @@ -500,7 +501,26 @@ void free_shared_domheap_page(struct page_info *page)
>  
>  void make_cr3(struct vcpu *v, mfn_t mfn)
>  {
> +    struct domain *d = v->domain;
> +
>      v->arch.cr3 = mfn_x(mfn) << PAGE_SHIFT;
> +    if ( is_pv_domain(d) && d->arch.pv_domain.pcid )
> +        v->arch.cr3 |= get_pcid_bits(v, false);
> +}
> +
> +unsigned long pv_guest_cr4_to_real_cr4(const struct vcpu *v)
> +{
> +    const struct domain *d = v->domain;
> +    unsigned long cr4;
> +
> +    cr4 = v->arch.pv_vcpu.ctrlreg[4] & ~X86_CR4_DE;
> +    cr4 |= mmu_cr4_features & (X86_CR4_PSE | X86_CR4_SMEP | X86_CR4_SMAP |
> +                               X86_CR4_OSXSAVE | X86_CR4_FSGSBASE);
> +    cr4 |= (d->arch.pv_domain.xpti || d->arch.pv_domain.pcid) ? 0 : X86_CR4_PGE;
> +    cr4 |= d->arch.pv_domain.pcid ? X86_CR4_PCIDE : 0;
> +    cr4 |= d->arch.vtsc ? X86_CR4_TSD : 0;
> +
> +    return cr4;
>  }
>  
>  void write_ptbase(struct vcpu *v)
> @@ -510,12 +530,14 @@ void write_ptbase(struct vcpu *v)
>  
>      new_cr4 = (is_pv_vcpu(v) && !is_idle_vcpu(v))
>                ? pv_guest_cr4_to_real_cr4(v)
> -              : ((read_cr4() & ~X86_CR4_TSD) | X86_CR4_PGE);
> +              : ((read_cr4() & ~(X86_CR4_PCIDE | X86_CR4_TSD)) | X86_CR4_PGE);
>  
>      if ( is_pv_vcpu(v) && v->domain->arch.pv_domain.xpti )
>      {
>          cpu_info->root_pgt_changed = true;
>          cpu_info->pv_cr3 = __pa(this_cpu(root_pgt));
> +        if ( new_cr4 & X86_CR4_PCIDE )
> +            cpu_info->pv_cr3 |= get_pcid_bits(v, true);
>          write_cr3_cr4(v->arch.cr3, new_cr4);
>      }
>      else
> diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
> index 77186c19bd..2af0094e95 100644
> --- a/xen/arch/x86/pv/dom0_build.c
> +++ b/xen/arch/x86/pv/dom0_build.c
> @@ -709,6 +709,7 @@ int __init dom0_construct_pv(struct domain *d,
>      }
>  
>      xpti_domain_init(d);
> +    pcid_domain_init(d);
>  
>      d->arch.paging.mode = 0;
>  
> diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
> index 2bef9c48bc..bfaab414e1 100644
> --- a/xen/arch/x86/pv/domain.c
> +++ b/xen/arch/x86/pv/domain.c
> @@ -10,6 +10,7 @@
>  #include <xen/sched.h>
>  
>  #include <asm/cpufeature.h>
> +#include <asm/invpcid.h>
>  #include <asm/msr-index.h>
>  #include <asm/pv/domain.h>
>  
> @@ -94,6 +95,70 @@ void xpti_domain_init(struct domain *d)
>      }
>  }
>  
> +static __read_mostly enum {
> +    PCID_OFF,
> +    PCID_ALL,
> +    PCID_XPTI,
> +    PCID_NOXPTI
> +} opt_pcid = PCID_XPTI;
> +
> +static __init int parse_pcid(const char *s)
> +{
> +    int rc = 0;
> +
> +    switch ( parse_bool(s, NULL) )
> +    {
> +    case 0:
> +        opt_pcid = PCID_OFF;
> +        break;
> +    case 1:
> +        opt_pcid = PCID_ALL;
> +        break;
> +    default:
> +        switch ( parse_boolean("xpti", s, NULL) )
> +        {
> +        case 0:
> +            opt_pcid = PCID_NOXPTI;
> +            break;
> +        case 1:
> +            opt_pcid = PCID_XPTI;
> +            break;
> +        default:
> +            rc = -EINVAL;
> +            break;
> +        }
> +        break;
> +    }
> +
> +    return rc;
> +}
> +custom_runtime_param("pcid", parse_pcid);
> +
> +void pcid_domain_init(struct domain *d)
> +{
> +    if ( !is_pv_domain(d) || is_pv_32bit_domain(d) ||
> +         !use_invpcid || !cpu_has_pcid )
> +        return;
> +
> +    switch ( opt_pcid )
> +    {
> +    case PCID_OFF:
> +        break;
> +    case PCID_ALL:
> +        d->arch.pv_domain.pcid = true;
> +        break;
> +    case PCID_XPTI:
> +        d->arch.pv_domain.pcid = d->arch.pv_domain.xpti;
> +        break;
> +    case PCID_NOXPTI:
> +        d->arch.pv_domain.pcid = !d->arch.pv_domain.xpti;
> +        break;
> +    default:
> +        ASSERT_UNREACHABLE();
> +        break;
> +    }
> +}
> +
>  static void noreturn continue_nonidle_domain(struct vcpu *v)
>  {
>      check_wakeup_from_wait();
> @@ -298,9 +363,19 @@ int pv_domain_initialise(struct domain *d)
>  
>  static void _toggle_guest_pt(struct vcpu *v)
>  {
> +    const struct domain *d = v->domain;
> +
>      v->arch.flags ^= TF_kernel_mode;
>      update_cr3(v);
> -    get_cpu_info()->root_pgt_changed = true;
> +    if ( d->arch.pv_domain.xpti )
> +    {
> +        struct cpu_info *cpu_info = get_cpu_info();
> +
> +        cpu_info->root_pgt_changed = true;
> +        cpu_info->pv_cr3 = __pa(this_cpu(root_pgt)) |
> +                           (d->arch.pv_domain.pcid
> +                            ? get_pcid_bits(v, true) : 0);
> +    }
>  
>      /* Don't flush user global mappings from the TLB. Don't tick TLB clock. */
>      asm volatile ( "mov %0, %%cr3" : : "r" (v->arch.cr3) : "memory" );
> diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
> index b7894dc8c8..8b66096e7f 100644
> --- a/xen/include/asm-x86/domain.h
> +++ b/xen/include/asm-x86/domain.h
> @@ -255,6 +255,8 @@ struct pv_domain
>  
>      /* XPTI active? */
>      bool xpti;
> +    /* Use PCID feature? */
> +    bool pcid;
>  
>      /* map_domain_page() mapping cache. */
>      struct mapcache_domain mapcache;
> @@ -615,19 +617,12 @@ void vcpu_show_registers(const struct vcpu *);
>  unsigned long pv_guest_cr4_fixup(const struct vcpu *, unsigned long guest_cr4);
>  
>  /* Convert between guest-visible and real CR4 values. */
> -#define pv_guest_cr4_to_real_cr4(v)                         \
> -    (((v)->arch.pv_vcpu.ctrlreg[4]                          \
> -      | (mmu_cr4_features                                   \
> -         & (X86_CR4_PSE | X86_CR4_SMEP |                    \
> -            X86_CR4_SMAP | X86_CR4_OSXSAVE |                \
> -            X86_CR4_FSGSBASE))                              \
> -      | ((v)->domain->arch.pv_domain.xpti ? 0 : X86_CR4_PGE) \
> -      | ((v)->domain->arch.vtsc ? X86_CR4_TSD : 0))         \
> -     & ~X86_CR4_DE)
> +unsigned long pv_guest_cr4_to_real_cr4(const struct vcpu *v);
> +
>  #define real_cr4_to_pv_guest_cr4(c)                         \
>      ((c) & ~(X86_CR4_PGE | X86_CR4_PSE | X86_CR4_TSD |      \
>               X86_CR4_OSXSAVE | X86_CR4_SMEP |               \
> -             X86_CR4_FSGSBASE | X86_CR4_SMAP))
> +             X86_CR4_FSGSBASE | X86_CR4_SMAP | X86_CR4_PCIDE))
>  
>  #define domain_max_vcpus(d) (is_hvm_domain(d) ? HVM_MAX_VCPUS : MAX_VIRT_CPUS)
>  
> diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
> index db9988ab33..3067a8c58f 100644
> --- a/xen/include/asm-x86/processor.h
> +++ b/xen/include/asm-x86/processor.h
> @@ -290,6 +290,9 @@ static inline unsigned long read_cr4(void)
>  
>  static inline void write_cr4(unsigned long val)
>  {
> +    /* No global pages in case of PCIDs enabled! */
> +    ASSERT(!(val & X86_CR4_PGE) || !(val & X86_CR4_PCIDE));
> +
>      get_cpu_info()->cr4 = val;
>      asm volatile ( "mov %0,%%cr4" : : "r" (val) );
>  }
> diff --git a/xen/include/asm-x86/pv/domain.h b/xen/include/asm-x86/pv/domain.h
> index 911e5dc07f..3c8c8f4ccc 100644
> --- a/xen/include/asm-x86/pv/domain.h
> +++ b/xen/include/asm-x86/pv/domain.h
> @@ -21,6 +21,24 @@
>  #ifndef __X86_PV_DOMAIN_H__
>  #define __X86_PV_DOMAIN_H__
>  
> +/* PCID values for the address spaces of 64-bit pv domains: */
> +#define PCID_PV_PRIV      0x0000    /* Used for other domains, too. */
> +#define PCID_PV_USER      0x0001
> +#define PCID_PV_XPTI      0x0002    /* To be ORed to above values. */
> +
> +/*
> + * Return additional PCID specific cr3 bits.
> + *
> + * Note that X86_CR3_NOFLUSH will not be readable in cr3. Anyone consuming
> + * v->arch.cr3 should mask away X86_CR3_NOFLUSH and X86_CR3_PCIDMASK in case
> + * the value is used to address the root page table.
> + */
> +static inline unsigned long get_pcid_bits(const struct vcpu *v, bool is_xpti)
> +{
> +    return X86_CR3_NOFLUSH | (is_xpti ? PCID_PV_XPTI : 0) |
> +           ((v->arch.flags & TF_kernel_mode) ? PCID_PV_PRIV : PCID_PV_USER);

Constructing the correct pa+pcid value for a vcpu is independent of
whether we wish to not flush mappings when switching CR3.

Also, can't is_xpti be derived from v directly?

~Andrew

> +}
> +
>  #ifdef CONFIG_PV
>  
>  void pv_vcpu_destroy(struct vcpu *v);
> @@ -29,6 +47,7 @@ void pv_domain_destroy(struct domain *d);
>  int pv_domain_initialise(struct domain *d);
>  void xpti_init(void);
>  void xpti_domain_init(struct domain *d);
> +void pcid_domain_init(struct domain *d);
>  
>  #else  /* !CONFIG_PV */
>  
> @@ -40,6 +59,7 @@ static inline void pv_domain_destroy(struct domain *d) {}
>  static inline int pv_domain_initialise(struct domain *d) { return -EOPNOTSUPP; }
>  static inline void xpti_init(void) {}
>  static inline void xpti_domain_init(struct domain *d) {}
> +static inline void pcid_domain_init(struct domain *d) {}
>  
>  #endif	/* CONFIG_PV */
>  
> diff --git a/xen/include/asm-x86/x86-defns.h b/xen/include/asm-x86/x86-defns.h
> index ff8d66be3c..123560897e 100644
> --- a/xen/include/asm-x86/x86-defns.h
> +++ b/xen/include/asm-x86/x86-defns.h
> @@ -45,7 +45,9 @@
>  /*
>   * Intel CPU flags in CR3
>   */
> -#define X86_CR3_NOFLUSH (_AC(1, ULL) << 63)
> +#define X86_CR3_NOFLUSH    (_AC(1, ULL) << 63)
> +#define X86_CR3_ADDR_MASK  (PAGE_MASK & PADDR_MASK & ~X86_CR3_NOFLUSH)
> +#define X86_CR3_PCID_MASK  _AC(0x0fff, ULL) /* Mask for PCID */
>  
>  /*
>   * Intel CPU features in CR4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 7/7] xen/x86: use PCID feature
  2018-04-06 17:30   ` Andrew Cooper
@ 2018-04-09  6:31     ` Juergen Gross
  0 siblings, 0 replies; 22+ messages in thread
From: Juergen Gross @ 2018-04-09  6:31 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel; +Cc: jbeulich

On 06/04/18 19:30, Andrew Cooper wrote:
> On 06/04/18 08:52, Juergen Gross wrote:
>> Avoid flushing the complete TLB when switching %cr3 for mitigation of
>> Meltdown by using the PCID feature if available.
>>
>> We are using 4 PCID values for a 64 bit pv domain subject to XPTI and
>> 2 values for the non-XPTI case:
>>
>> - guest active and in kernel mode
>> - guest active and in user mode
>> - hypervisor active and guest in user mode (XPTI only)
>> - hypervisor active and guest in kernel mode (XPTI only)
> 
> Information like this should be in the source code as well.  Either
> x86/mm.h, or domain.h along with the defines.

Okay.

> 
>>
>> We use PCID only if PCID _and_ INVPCID are supported. With PCID in use
>> we disable global pages in cr4. A command line parameter controls in
>> which cases PCID is being used.
>>
>> As the non-XPTI case has shown not to perform better with PCID at least
>> on some machines the default is to use PCID only for domains subject to
>> XPTI.
> 
> Have we worked out why?  By simple analysis, a system-call heavy
> workload should benefit massively from PCID, irrespective of XPTI.

I did only very basic performance tests. So the possible explanations
coming to mind are:

- doing a build of the hypervisor doesn't take advantage of using PCIDs,
  other benchmarks might do

- this might be an effect showing on some processors only, e.g. on my
  system

- my current implementation is using no global pages when PCIDs are in
  use, this might be fine for XPTI, while the non-XPTI case might
  perform better with global pages _and_ PCID

> If measurements say there is no improvement, then it probably means we
> are flushing far too much somewhere.

I'm not aware of such a problem.

> 
>>
>> With PCID enabled we always disable global pages. This avoids having to
>> either flush the complete TLB or do a cycle through all PCID values
>> when invalidating a single global page.
>>
>> pv_guest_cr4_to_real_cr4() is switched from a macro to a real function
>> now as it has become more complex.
> 
> I think it would help to split this change out.  This particular patch
> is already complicated enough to review. :)

Works for me.

> 
>>
>> Performance for the XPTI case improves a lot: parallel make of the Xen
>> hypervisor on my machine reduced elapsed time from 107 to 93 seconds,
>> system time was reduced from 140 to 103 seconds.
>>
>> Signed-off-by: Juergen Gross <jgross@suse.com>
>> ---
>> V5:
>> - use X86_CR3_ADDR_MASK instead of ~X86_CR3_PCID_MASK (Jan Beulich)
>> - add some const qualifiers (Jan Beulich)
>> - mask X86_CR3_ADDR_MASK with PADDR_MASK (Jan Beulich)
>> - add flushing the TLB from old PCID related entries in write_cr3_cr4()
>>   (Jan Beulich)
>>
>> V4:
>> - add cr3 mask for page table address and use that in dbg_pv_va2mfn()
>>   (Jan Beulich)
>> - use invpcid_flush_all_nonglobals() instead of invpcid_flush_all()
>>   (Jan Beulich)
>> - use PCIDs 0/1 when running in Xen or without XPTI, 2/3 with XPTI in
>>   guest (Jan Beulich)
>> - ASSERT cr4.pge and cr4.pcide are never active at the same time
>>   (Jan Beulich)
>> - make pv_guest_cr4_to_real_cr4() a real function
>>
>> V3:
>> - support PCID for non-XPTI case, too
>> - add command line parameter for controlling usage of PCID
>> - check PCID active by using cr4.pcide (Jan Beulich)
>> ---
>>  docs/misc/xen-command-line.markdown | 12 ++++++
>>  xen/arch/x86/debug.c                |  2 +-
>>  xen/arch/x86/domain_page.c          |  2 +-
>>  xen/arch/x86/domctl.c               |  4 ++
>>  xen/arch/x86/flushtlb.c             | 54 ++++++++++++++++++++++++--
>>  xen/arch/x86/mm.c                   | 24 +++++++++++-
>>  xen/arch/x86/pv/dom0_build.c        |  1 +
>>  xen/arch/x86/pv/domain.c            | 77 ++++++++++++++++++++++++++++++++++++-
>>  xen/include/asm-x86/domain.h        | 15 +++-----
>>  xen/include/asm-x86/processor.h     |  3 ++
>>  xen/include/asm-x86/pv/domain.h     | 20 ++++++++++
>>  xen/include/asm-x86/x86-defns.h     |  4 +-
>>  12 files changed, 199 insertions(+), 19 deletions(-)
>>
>> diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
>> index 5f6ae654ad..db87fd326d 100644
>> --- a/docs/misc/xen-command-line.markdown
>> +++ b/docs/misc/xen-command-line.markdown
>> @@ -1452,6 +1452,18 @@ All numbers specified must be hexadecimal ones.
>>  
>>  This option can be specified more than once (up to 8 times at present).
>>  
>> +### pcid (x86)
>> +> `= <boolean> | xpti | noxpti`
> 
> Your implementation below is actually `= <boolean> | xpti=<boolean>`
> 
> This is subtly different for the final option, in which case
> "pcid=no-xpti" is the string which is actually recognised.

Oh, sorry. Will correct it.

> 
>> +
>> +> Default: `xpti`
>> +
>> +> Can be modified at runtime
> 
> From the implementation, it looks like changes only apply to new
> domains, not to existing ones?

Correct. I will make this more clear.

> 
>> +
>> +If available, control usage of the PCID feature of the processor for
>> +64-bit pv-domains.
> 
> It is more complicated than this.  PCID only gets used for guests when
> Xen is otherwise using INVPCID.

I'll add another sentence explaining that.

> 
>>  PCID can be used either for no domain at all (`false`),
>> +for all of them (`true`), only for those subject to XPTI (`xpti`) or for
>> +those not subject to XPTI (`noxpti`).
>> +
>>  ### ple\_gap
>>  > `= <integer>`
>>  
>> diff --git a/xen/arch/x86/debug.c b/xen/arch/x86/debug.c
>> index 9159f32db4..0d46f2f45a 100644
>> --- a/xen/arch/x86/debug.c
>> +++ b/xen/arch/x86/debug.c
>> @@ -97,7 +97,7 @@ dbg_pv_va2mfn(dbgva_t vaddr, struct domain *dp, uint64_t pgd3val)
>>      l3_pgentry_t l3e, *l3t;
>>      l2_pgentry_t l2e, *l2t;
>>      l1_pgentry_t l1e, *l1t;
>> -    unsigned long cr3 = (pgd3val ? pgd3val : dp->vcpu[0]->arch.cr3);
>> +    unsigned long cr3 = pgd3val ?: (dp->vcpu[0]->arch.cr3 & X86_CR3_ADDR_MASK);
> 
> Is this needed?  The PCID is shifted out by the maddr_to_mfn(), and we'd
> better never be storing the NOFLUSH bit in arch.cr3.

I had that discussion with Jan already.

Jan wanted me to add the mask here.

And not storing NOFLUSH to arch.cr3 would mean I'd have to make assembly
code more complicated to decide whether to add NOFLUSH to the value
before writing it to cr3.

> 
> OTOH, I think reporting the PCID in the debug messages would be helpful.
> 
>>      mfn_t mfn = maddr_to_mfn(cr3);
>>  
>>      DBGP2("vaddr:%lx domid:%d cr3:%lx pgd3:%lx\n", vaddr, dp->domain_id, 
>> diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c
>> index b5780f201f..b5af0e639a 100644
>> --- a/xen/arch/x86/domain_page.c
>> +++ b/xen/arch/x86/domain_page.c
>> @@ -51,7 +51,7 @@ static inline struct vcpu *mapcache_current_vcpu(void)
>>          if ( (v = idle_vcpu[smp_processor_id()]) == current )
>>              sync_local_execstate();
>>          /* We must now be running on the idle page table. */
>> -        ASSERT(read_cr3() == __pa(idle_pg_table));
>> +        ASSERT((read_cr3() & X86_CR3_ADDR_MASK) == __pa(idle_pg_table));
> 
> I think we want probably want a cr3_pa() wrapper.

Okay. New patch, I guess?

> 
>>      }
>>  
>>      return v;
>> diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
>> index 0704f398c7..a7c8772fa6 100644
>> --- a/xen/arch/x86/domctl.c
>> +++ b/xen/arch/x86/domctl.c
>> @@ -613,7 +613,11 @@ long arch_do_domctl(
>>              ret = -EINVAL;
>>  
>>          if ( ret == 0 )
>> +        {
>>              xpti_domain_init(d);
>> +            pcid_domain_init(d);
>> +        }
>> +
>>          break;
>>  
>>      case XEN_DOMCTL_get_address_size:
>> diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
>> index 5dcd9a2bf6..6c7d57b7aa 100644
>> --- a/xen/arch/x86/flushtlb.c
>> +++ b/xen/arch/x86/flushtlb.c
>> @@ -12,6 +12,7 @@
>>  #include <asm/flushtlb.h>
>>  #include <asm/invpcid.h>
>>  #include <asm/page.h>
>> +#include <asm/pv/domain.h>
>>  
>>  /* Debug builds: Wrap frequently to stress-test the wrap logic. */
>>  #ifdef NDEBUG
>> @@ -93,6 +94,7 @@ void write_cr3_cr4(unsigned long cr3, unsigned long cr4)
>>  {
>>      unsigned long flags;
>>      u32 t;
>> +    unsigned long old_pcid = read_cr3() & X86_CR3_PCID_MASK;
> 
> And a cr3_pcid() helper as well.

Okay.

> 
>>  
>>      /* This non-reentrant function is sometimes called in interrupt context. */
>>      local_irq_save(flags);
>> @@ -100,12 +102,35 @@ void write_cr3_cr4(unsigned long cr3, unsigned long cr4)
>>      t = pre_flush();
>>  
>>      if ( read_cr4() & X86_CR4_PGE )
>> +        /*
>> +         * X86_CR4_PGE set means PCID being inactive.
> 
> "means PCID is inactive".

Okay.

> 
>> +         * We have to purge the TLB via flipping cr4.pge.
>> +         */
>>          write_cr4(cr4 & ~X86_CR4_PGE);
>> +    else if ( use_invpcid )
>> +        /*
>> +         * If we are using PCID purge the TLB via INVPCID as loading cr3
>> +         * will affect the new PCID only.
>> +         * If INVPCID is not supported we don't use PCIDs so loading cr3
>> +         * will purge the TLB (we are in the "global pages off" branch).
>> +         * invpcid_flush_all_nonglobals() seems to be faster than
>> +         * invpcid_flush_all().
> 
> I'm afraid that I can't parse either of these sentences.

I'll rephrase the comment.

> 
>> +         */
>> +        invpcid_flush_all_nonglobals();
>>  
>>      asm volatile ( "mov %0, %%cr3" : : "r" (cr3) : "memory" );
>>  
>>      if ( read_cr4() != cr4 )
>>          write_cr4(cr4);
>> +    else if ( old_pcid != (cr3 & X86_CR3_PCID_MASK) )
>> +        /*
>> +         * Make sure no TLB entries related to the old PCID created between
>> +         * flushing the TLB and writing the new %cr3 value remain in the TLB.
>> +         * Writing %cr3 is documented to be a speculation barrier, OTOH the
>> +         * performance impact of the additional flush is next to invisible.
>> +         * So better be save than sorry.
>> +         */
>> +        invpcid_flush_single_context(old_pcid);
>>  
>>      post_flush(t);
>>  
>> @@ -132,11 +157,32 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
>>              /*
>>               * We don't INVLPG multi-page regions because the 2M/4M/1G
>>               * region may not have been mapped with a superpage. Also there
>> -             * are various errata surrounding INVLPG usage on superpages, and
>> -             * a full flush is in any case not *that* expensive.
>> +             * are various errata surrounding INVLPG usage on superpages,
>> +             * and a full flush is in any case not *that* expensive.
>>               */
>> -            asm volatile ( "invlpg %0"
>> -                           : : "m" (*(const char *)(va)) : "memory" );
>> +            if ( read_cr4() & X86_CR4_PCIDE )
>> +            {
>> +                unsigned long addr = (unsigned long)va;
>> +
>> +                /*
>> +                 * Flush the addresses for all potential address spaces.
>> +                 * We can't check the current domain for being subject to
>> +                 * XPTI as current might be the idle vcpu while we still have
>> +                 * some XPTI domain TLB entries.
>> +                 * Using invpcid is okay here, as with PCID enabled we always
>> +                 * have global pages disabled.
>> +                 */
>> +                invpcid_flush_one(PCID_PV_PRIV, addr);
>> +                invpcid_flush_one(PCID_PV_USER, addr);
>> +                if ( !cpu_has_no_xpti )
>> +                {
>> +                    invpcid_flush_one(PCID_PV_PRIV | PCID_PV_XPTI, addr);
>> +                    invpcid_flush_one(PCID_PV_USER | PCID_PV_XPTI, addr);
>> +                }
>> +            }
>> +            else
>> +                asm volatile ( "invlpg %0"
>> +                               : : "m" (*(const char *)(va)) : "memory" );
>>          }
>>          else
>>              do_tlb_flush();
>> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
>> index 03aa44be76..d31ada0dc9 100644
>> --- a/xen/arch/x86/mm.c
>> +++ b/xen/arch/x86/mm.c
>> @@ -127,6 +127,7 @@
>>  #include <asm/processor.h>
>>  
>>  #include <asm/hvm/grant_table.h>
>> +#include <asm/pv/domain.h>
>>  #include <asm/pv/grant_table.h>
>>  #include <asm/pv/mm.h>
>>  
>> @@ -500,7 +501,26 @@ void free_shared_domheap_page(struct page_info *page)
>>  
>>  void make_cr3(struct vcpu *v, mfn_t mfn)
>>  {
>> +    struct domain *d = v->domain;
>> +
>>      v->arch.cr3 = mfn_x(mfn) << PAGE_SHIFT;
>> +    if ( is_pv_domain(d) && d->arch.pv_domain.pcid )
>> +        v->arch.cr3 |= get_pcid_bits(v, false);
>> +}
>> +
>> +unsigned long pv_guest_cr4_to_real_cr4(const struct vcpu *v)
>> +{
>> +    const struct domain *d = v->domain;
>> +    unsigned long cr4;
>> +
>> +    cr4 = v->arch.pv_vcpu.ctrlreg[4] & ~X86_CR4_DE;
>> +    cr4 |= mmu_cr4_features & (X86_CR4_PSE | X86_CR4_SMEP | X86_CR4_SMAP |
>> +                               X86_CR4_OSXSAVE | X86_CR4_FSGSBASE);
>> +    cr4 |= (d->arch.pv_domain.xpti || d->arch.pv_domain.pcid) ? 0 : X86_CR4_PGE;
>> +    cr4 |= d->arch.pv_domain.pcid ? X86_CR4_PCIDE : 0;
>> +    cr4 |= d->arch.vtsc ? X86_CR4_TSD : 0;
>> +
>> +    return cr4;
>>  }
>>  
>>  void write_ptbase(struct vcpu *v)
>> @@ -510,12 +530,14 @@ void write_ptbase(struct vcpu *v)
>>  
>>      new_cr4 = (is_pv_vcpu(v) && !is_idle_vcpu(v))
>>                ? pv_guest_cr4_to_real_cr4(v)
>> -              : ((read_cr4() & ~X86_CR4_TSD) | X86_CR4_PGE);
>> +              : ((read_cr4() & ~(X86_CR4_PCIDE | X86_CR4_TSD)) | X86_CR4_PGE);
>>  
>>      if ( is_pv_vcpu(v) && v->domain->arch.pv_domain.xpti )
>>      {
>>          cpu_info->root_pgt_changed = true;
>>          cpu_info->pv_cr3 = __pa(this_cpu(root_pgt));
>> +        if ( new_cr4 & X86_CR4_PCIDE )
>> +            cpu_info->pv_cr3 |= get_pcid_bits(v, true);
>>          write_cr3_cr4(v->arch.cr3, new_cr4);
>>      }
>>      else
>> diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
>> index 77186c19bd..2af0094e95 100644
>> --- a/xen/arch/x86/pv/dom0_build.c
>> +++ b/xen/arch/x86/pv/dom0_build.c
>> @@ -709,6 +709,7 @@ int __init dom0_construct_pv(struct domain *d,
>>      }
>>  
>>      xpti_domain_init(d);
>> +    pcid_domain_init(d);
>>  
>>      d->arch.paging.mode = 0;
>>  
>> diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
>> index 2bef9c48bc..bfaab414e1 100644
>> --- a/xen/arch/x86/pv/domain.c
>> +++ b/xen/arch/x86/pv/domain.c
>> @@ -10,6 +10,7 @@
>>  #include <xen/sched.h>
>>  
>>  #include <asm/cpufeature.h>
>> +#include <asm/invpcid.h>
>>  #include <asm/msr-index.h>
>>  #include <asm/pv/domain.h>
>>  
>> @@ -94,6 +95,70 @@ void xpti_domain_init(struct domain *d)
>>      }
>>  }
>>  
>> +static __read_mostly enum {
>> +    PCID_OFF,
>> +    PCID_ALL,
>> +    PCID_XPTI,
>> +    PCID_NOXPTI
>> +} opt_pcid = PCID_XPTI;
>> +
>> +static __init int parse_pcid(const char *s)
>> +{
>> +    int rc = 0;
>> +
>> +    switch ( parse_bool(s, NULL) )
>> +    {
>> +    case 0:
>> +        opt_pcid = PCID_OFF;
>> +        break;
>> +    case 1:
>> +        opt_pcid = PCID_ALL;
>> +        break;
>> +    default:
>> +        switch ( parse_boolean("xpti", s, NULL) )
>> +        {
>> +        case 0:
>> +            opt_pcid = PCID_NOXPTI;
>> +            break;
>> +        case 1:
>> +            opt_pcid = PCID_XPTI;
>> +            break;
>> +        default:
>> +            rc = -EINVAL;
>> +            break;
>> +        }
>> +        break;
>> +    }
>> +
>> +    return rc;
>> +}
>> +custom_runtime_param("pcid", parse_pcid);
>> +
>> +void pcid_domain_init(struct domain *d)
>> +{
>> +    if ( !is_pv_domain(d) || is_pv_32bit_domain(d) ||
>> +         !use_invpcid || !cpu_has_pcid )
>> +        return;
>> +
>> +    switch ( opt_pcid )
>> +    {
>> +    case PCID_OFF:
>> +        break;
>> +    case PCID_ALL:
>> +        d->arch.pv_domain.pcid = true;
>> +        break;
>> +    case PCID_XPTI:
>> +        d->arch.pv_domain.pcid = d->arch.pv_domain.xpti;
>> +        break;
>> +    case PCID_NOXPTI:
>> +        d->arch.pv_domain.pcid = !d->arch.pv_domain.xpti;
>> +        break;
>> +    default:
>> +        ASSERT_UNREACHABLE();
>> +        break;
>> +    }
>> +}
>> +
>>  static void noreturn continue_nonidle_domain(struct vcpu *v)
>>  {
>>      check_wakeup_from_wait();
>> @@ -298,9 +363,19 @@ int pv_domain_initialise(struct domain *d)
>>  
>>  static void _toggle_guest_pt(struct vcpu *v)
>>  {
>> +    const struct domain *d = v->domain;
>> +
>>      v->arch.flags ^= TF_kernel_mode;
>>      update_cr3(v);
>> -    get_cpu_info()->root_pgt_changed = true;
>> +    if ( d->arch.pv_domain.xpti )
>> +    {
>> +        struct cpu_info *cpu_info = get_cpu_info();
>> +
>> +        cpu_info->root_pgt_changed = true;
>> +        cpu_info->pv_cr3 = __pa(this_cpu(root_pgt)) |
>> +                           (d->arch.pv_domain.pcid
>> +                            ? get_pcid_bits(v, true) : 0);
>> +    }
>>  
>>      /* Don't flush user global mappings from the TLB. Don't tick TLB clock. */
>>      asm volatile ( "mov %0, %%cr3" : : "r" (v->arch.cr3) : "memory" );
>> diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
>> index b7894dc8c8..8b66096e7f 100644
>> --- a/xen/include/asm-x86/domain.h
>> +++ b/xen/include/asm-x86/domain.h
>> @@ -255,6 +255,8 @@ struct pv_domain
>>  
>>      /* XPTI active? */
>>      bool xpti;
>> +    /* Use PCID feature? */
>> +    bool pcid;
>>  
>>      /* map_domain_page() mapping cache. */
>>      struct mapcache_domain mapcache;
>> @@ -615,19 +617,12 @@ void vcpu_show_registers(const struct vcpu *);
>>  unsigned long pv_guest_cr4_fixup(const struct vcpu *, unsigned long guest_cr4);
>>  
>>  /* Convert between guest-visible and real CR4 values. */
>> -#define pv_guest_cr4_to_real_cr4(v)                         \
>> -    (((v)->arch.pv_vcpu.ctrlreg[4]                          \
>> -      | (mmu_cr4_features                                   \
>> -         & (X86_CR4_PSE | X86_CR4_SMEP |                    \
>> -            X86_CR4_SMAP | X86_CR4_OSXSAVE |                \
>> -            X86_CR4_FSGSBASE))                              \
>> -      | ((v)->domain->arch.pv_domain.xpti ? 0 : X86_CR4_PGE) \
>> -      | ((v)->domain->arch.vtsc ? X86_CR4_TSD : 0))         \
>> -     & ~X86_CR4_DE)
>> +unsigned long pv_guest_cr4_to_real_cr4(const struct vcpu *v);
>> +
>>  #define real_cr4_to_pv_guest_cr4(c)                         \
>>      ((c) & ~(X86_CR4_PGE | X86_CR4_PSE | X86_CR4_TSD |      \
>>               X86_CR4_OSXSAVE | X86_CR4_SMEP |               \
>> -             X86_CR4_FSGSBASE | X86_CR4_SMAP))
>> +             X86_CR4_FSGSBASE | X86_CR4_SMAP | X86_CR4_PCIDE))
>>  
>>  #define domain_max_vcpus(d) (is_hvm_domain(d) ? HVM_MAX_VCPUS : MAX_VIRT_CPUS)
>>  
>> diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
>> index db9988ab33..3067a8c58f 100644
>> --- a/xen/include/asm-x86/processor.h
>> +++ b/xen/include/asm-x86/processor.h
>> @@ -290,6 +290,9 @@ static inline unsigned long read_cr4(void)
>>  
>>  static inline void write_cr4(unsigned long val)
>>  {
>> +    /* No global pages in case of PCIDs enabled! */
>> +    ASSERT(!(val & X86_CR4_PGE) || !(val & X86_CR4_PCIDE));
>> +
>>      get_cpu_info()->cr4 = val;
>>      asm volatile ( "mov %0,%%cr4" : : "r" (val) );
>>  }
>> diff --git a/xen/include/asm-x86/pv/domain.h b/xen/include/asm-x86/pv/domain.h
>> index 911e5dc07f..3c8c8f4ccc 100644
>> --- a/xen/include/asm-x86/pv/domain.h
>> +++ b/xen/include/asm-x86/pv/domain.h
>> @@ -21,6 +21,24 @@
>>  #ifndef __X86_PV_DOMAIN_H__
>>  #define __X86_PV_DOMAIN_H__
>>  
>> +/* PCID values for the address spaces of 64-bit pv domains: */
>> +#define PCID_PV_PRIV      0x0000    /* Used for other domains, too. */
>> +#define PCID_PV_USER      0x0001
>> +#define PCID_PV_XPTI      0x0002    /* To be ORed to above values. */
>> +
>> +/*
>> + * Return additional PCID specific cr3 bits.
>> + *
>> + * Note that X86_CR3_NOFLUSH will not be readable in cr3. Anyone consuming
>> + * v->arch.cr3 should mask away X86_CR3_NOFLUSH and X86_CR3_PCIDMASK in case
>> + * the value is used to address the root page table.
>> + */
>> +static inline unsigned long get_pcid_bits(const struct vcpu *v, bool is_xpti)
>> +{
>> +    return X86_CR3_NOFLUSH | (is_xpti ? PCID_PV_XPTI : 0) |
>> +           ((v->arch.flags & TF_kernel_mode) ? PCID_PV_PRIV : PCID_PV_USER);
> 
> Constructing the correct pa+pcid value for a vcpu is independent of
> whether we wish to not flush mappings when switching CR3.

I'd have to add the NOFLUSH bit at all palces where get_pcid_bits() is
being called.

> 
> Also, can't is_xpti be derived from v directly?

No, it depends on whether the address space is meant for guest or
hypervisor mode, too.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 7/7] xen/x86: use PCID feature
  2018-04-06  7:52 ` [PATCH v5 7/7] xen/x86: use PCID feature Juergen Gross
  2018-04-06 17:30   ` Andrew Cooper
@ 2018-04-09 14:15   ` Jan Beulich
  2018-04-09 14:33     ` Juergen Gross
  1 sibling, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2018-04-09 14:15 UTC (permalink / raw)
  To: Juergen Gross; +Cc: andrew.cooper3, xen-devel

>>> On 06.04.18 at 09:52, <jgross@suse.com> wrote:
> @@ -100,12 +102,35 @@ void write_cr3_cr4(unsigned long cr3, unsigned long cr4)
>      t = pre_flush();
>  
>      if ( read_cr4() & X86_CR4_PGE )
> +        /*
> +         * X86_CR4_PGE set means PCID being inactive.
> +         * We have to purge the TLB via flipping cr4.pge.
> +         */
>          write_cr4(cr4 & ~X86_CR4_PGE);
> +    else if ( use_invpcid )
> +        /*
> +         * If we are using PCID purge the TLB via INVPCID as loading cr3
> +         * will affect the new PCID only.
> +         * If INVPCID is not supported we don't use PCIDs so loading cr3
> +         * will purge the TLB (we are in the "global pages off" branch).
> +         * invpcid_flush_all_nonglobals() seems to be faster than
> +         * invpcid_flush_all().
> +         */
> +        invpcid_flush_all_nonglobals();
>  
>      asm volatile ( "mov %0, %%cr3" : : "r" (cr3) : "memory" );
>  
>      if ( read_cr4() != cr4 )
>          write_cr4(cr4);
> +    else if ( old_pcid != (cr3 & X86_CR3_PCID_MASK) )
> +        /*
> +         * Make sure no TLB entries related to the old PCID created between
> +         * flushing the TLB and writing the new %cr3 value remain in the TLB.
> +         * Writing %cr3 is documented to be a speculation barrier, OTOH the
> +         * performance impact of the additional flush is next to invisible.
> +         * So better be save than sorry.
> +         */
> +        invpcid_flush_single_context(old_pcid);

I'm not really happy about this comment. The CR3 write being a
speculation barrier is of no real interest here. Until the CPU's
speculation logic reaches that insn, all sort of things can happen.
We don't even know the exact code the compiler will generate,
much less what that code will trigger inside the CPU.

Also I think it's "safe" in the last sentence.

> --- a/xen/include/asm-x86/x86-defns.h
> +++ b/xen/include/asm-x86/x86-defns.h
> @@ -45,7 +45,9 @@
>  /*
>   * Intel CPU flags in CR3
>   */
> -#define X86_CR3_NOFLUSH (_AC(1, ULL) << 63)
> +#define X86_CR3_NOFLUSH    (_AC(1, ULL) << 63)
> +#define X86_CR3_ADDR_MASK  (PAGE_MASK & PADDR_MASK & ~X86_CR3_NOFLUSH)

Why still ~X86_CR3_NOFLUSH now that you use PADDR_MASK?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 7/7] xen/x86: use PCID feature
  2018-04-09 14:15   ` Jan Beulich
@ 2018-04-09 14:33     ` Juergen Gross
  2018-04-09 15:13       ` Jan Beulich
  0 siblings, 1 reply; 22+ messages in thread
From: Juergen Gross @ 2018-04-09 14:33 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, xen-devel

On 09/04/18 16:15, Jan Beulich wrote:
>>>> On 06.04.18 at 09:52, <jgross@suse.com> wrote:
>> @@ -100,12 +102,35 @@ void write_cr3_cr4(unsigned long cr3, unsigned long cr4)
>>      t = pre_flush();
>>  
>>      if ( read_cr4() & X86_CR4_PGE )
>> +        /*
>> +         * X86_CR4_PGE set means PCID being inactive.
>> +         * We have to purge the TLB via flipping cr4.pge.
>> +         */
>>          write_cr4(cr4 & ~X86_CR4_PGE);
>> +    else if ( use_invpcid )
>> +        /*
>> +         * If we are using PCID purge the TLB via INVPCID as loading cr3
>> +         * will affect the new PCID only.
>> +         * If INVPCID is not supported we don't use PCIDs so loading cr3
>> +         * will purge the TLB (we are in the "global pages off" branch).
>> +         * invpcid_flush_all_nonglobals() seems to be faster than
>> +         * invpcid_flush_all().
>> +         */
>> +        invpcid_flush_all_nonglobals();
>>  
>>      asm volatile ( "mov %0, %%cr3" : : "r" (cr3) : "memory" );
>>  
>>      if ( read_cr4() != cr4 )
>>          write_cr4(cr4);
>> +    else if ( old_pcid != (cr3 & X86_CR3_PCID_MASK) )
>> +        /*
>> +         * Make sure no TLB entries related to the old PCID created between
>> +         * flushing the TLB and writing the new %cr3 value remain in the TLB.
>> +         * Writing %cr3 is documented to be a speculation barrier, OTOH the
>> +         * performance impact of the additional flush is next to invisible.
>> +         * So better be save than sorry.
>> +         */
>> +        invpcid_flush_single_context(old_pcid);
> 
> I'm not really happy about this comment. The CR3 write being a
> speculation barrier is of no real interest here. Until the CPU's
> speculation logic reaches that insn, all sort of things can happen.
> We don't even know the exact code the compiler will generate,
> much less what that code will trigger inside the CPU.

In case you are feeling strong regarding this comment I can
remove it.

> 
> Also I think it's "safe" in the last sentence.

Of course.

> 
>> --- a/xen/include/asm-x86/x86-defns.h
>> +++ b/xen/include/asm-x86/x86-defns.h
>> @@ -45,7 +45,9 @@
>>  /*
>>   * Intel CPU flags in CR3
>>   */
>> -#define X86_CR3_NOFLUSH (_AC(1, ULL) << 63)
>> +#define X86_CR3_NOFLUSH    (_AC(1, ULL) << 63)
>> +#define X86_CR3_ADDR_MASK  (PAGE_MASK & PADDR_MASK & ~X86_CR3_NOFLUSH)
> 
> Why still ~X86_CR3_NOFLUSH now that you use PADDR_MASK?

I just want to make clear that NOFLUSH is not included. Would you
like a comment better?


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 7/7] xen/x86: use PCID feature
  2018-04-09 14:33     ` Juergen Gross
@ 2018-04-09 15:13       ` Jan Beulich
  0 siblings, 0 replies; 22+ messages in thread
From: Jan Beulich @ 2018-04-09 15:13 UTC (permalink / raw)
  To: Juergen Gross; +Cc: andrew.cooper3, xen-devel

>>> On 09.04.18 at 16:33, <jgross@suse.com> wrote:
> On 09/04/18 16:15, Jan Beulich wrote:
>>>>> On 06.04.18 at 09:52, <jgross@suse.com> wrote:
>>> @@ -100,12 +102,35 @@ void write_cr3_cr4(unsigned long cr3, unsigned long cr4)
>>>      t = pre_flush();
>>>  
>>>      if ( read_cr4() & X86_CR4_PGE )
>>> +        /*
>>> +         * X86_CR4_PGE set means PCID being inactive.
>>> +         * We have to purge the TLB via flipping cr4.pge.
>>> +         */
>>>          write_cr4(cr4 & ~X86_CR4_PGE);
>>> +    else if ( use_invpcid )
>>> +        /*
>>> +         * If we are using PCID purge the TLB via INVPCID as loading cr3
>>> +         * will affect the new PCID only.
>>> +         * If INVPCID is not supported we don't use PCIDs so loading cr3
>>> +         * will purge the TLB (we are in the "global pages off" branch).
>>> +         * invpcid_flush_all_nonglobals() seems to be faster than
>>> +         * invpcid_flush_all().
>>> +         */
>>> +        invpcid_flush_all_nonglobals();
>>>  
>>>      asm volatile ( "mov %0, %%cr3" : : "r" (cr3) : "memory" );
>>>  
>>>      if ( read_cr4() != cr4 )
>>>          write_cr4(cr4);
>>> +    else if ( old_pcid != (cr3 & X86_CR3_PCID_MASK) )
>>> +        /*
>>> +         * Make sure no TLB entries related to the old PCID created between
>>> +         * flushing the TLB and writing the new %cr3 value remain in the TLB.
>>> +         * Writing %cr3 is documented to be a speculation barrier, OTOH the
>>> +         * performance impact of the additional flush is next to invisible.
>>> +         * So better be save than sorry.
>>> +         */
>>> +        invpcid_flush_single_context(old_pcid);
>> 
>> I'm not really happy about this comment. The CR3 write being a
>> speculation barrier is of no real interest here. Until the CPU's
>> speculation logic reaches that insn, all sort of things can happen.
>> We don't even know the exact code the compiler will generate,
>> much less what that code will trigger inside the CPU.
> 
> In case you are feeling strong regarding this comment I can
> remove it.

The first sentence is fine (and useful), so please don't drop it
wholesale.

>>> --- a/xen/include/asm-x86/x86-defns.h
>>> +++ b/xen/include/asm-x86/x86-defns.h
>>> @@ -45,7 +45,9 @@
>>>  /*
>>>   * Intel CPU flags in CR3
>>>   */
>>> -#define X86_CR3_NOFLUSH (_AC(1, ULL) << 63)
>>> +#define X86_CR3_NOFLUSH    (_AC(1, ULL) << 63)
>>> +#define X86_CR3_ADDR_MASK  (PAGE_MASK & PADDR_MASK & ~X86_CR3_NOFLUSH)
>> 
>> Why still ~X86_CR3_NOFLUSH now that you use PADDR_MASK?
> 
> I just want to make clear that NOFLUSH is not included. Would you
> like a comment better?

No - the involved symbol names together make clear that the
NOFLUSH bit can't be included. The same bit can't possibly be
part of the address _and_ be the signal to not flush.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2018-04-09 15:13 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-06  7:52 [PATCH v5 0/7] xen/x86: various XPTI speedups Juergen Gross
2018-04-06  7:52 ` [PATCH v5 1/7] x86/xpti: avoid copying L4 page table contents when possible Juergen Gross
2018-04-06 11:09   ` Andrew Cooper
2018-04-06  7:52 ` [PATCH v5 2/7] x86/xpti: don't flush TLB twice when switching to 64-bit pv context Juergen Gross
2018-04-06 11:21   ` Andrew Cooper
2018-04-06 11:38     ` Juergen Gross
2018-04-06  7:52 ` [PATCH v5 3/7] xen/x86: support per-domain flag for xpti Juergen Gross
2018-04-06 13:16   ` Andrew Cooper
2018-04-06 13:33     ` Juergen Gross
2018-04-06  7:52 ` [PATCH v5 4/7] xen/x86: use invpcid for flushing the TLB Juergen Gross
2018-04-06 13:40   ` Andrew Cooper
2018-04-06 13:48     ` Juergen Gross
2018-04-06  7:52 ` [PATCH v5 5/7] xen/x86: disable global pages for domains with XPTI active Juergen Gross
2018-04-06 15:17   ` Andrew Cooper
2018-04-06 15:41     ` Juergen Gross
2018-04-06  7:52 ` [PATCH v5 6/7] xen/x86: use flag byte for decision whether xen_cr3 is valid Juergen Gross
2018-04-06  7:52 ` [PATCH v5 7/7] xen/x86: use PCID feature Juergen Gross
2018-04-06 17:30   ` Andrew Cooper
2018-04-09  6:31     ` Juergen Gross
2018-04-09 14:15   ` Jan Beulich
2018-04-09 14:33     ` Juergen Gross
2018-04-09 15:13       ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.