All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/7] xen/x86: various XPTI speedups
@ 2018-03-21 12:51 Juergen Gross
  2018-03-21 12:51 ` [PATCH v3 1/7] x86/xpti: avoid copying L4 page table contents when possible Juergen Gross
                   ` (7 more replies)
  0 siblings, 8 replies; 49+ messages in thread
From: Juergen Gross @ 2018-03-21 12:51 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, andrew.cooper3, jbeulich, dfaggioli

This patch series aims at reducing the overhead of the XPTI Meltdown
mitigation. It is based on Jan's XPTI speedup series.

Patch 1 had been posted before, the main changes in this patch are due
to addressing Jan's comments on my first version. The main objective of
that patch is to avoid copying the L4 page table each time the guest is
being activated, as often the contents didn't change while the
hypervisor was active.

Patch 2 tries to minimize flushing the TLB: there is no need to flush
it in write_ptbase() and when activating the guest.

Patch 3 sets the stage for being able to activate XPTI per domain. As a
first step it is now possible to switch XPTI off for dom0 via the xpti
boot parameter.

Patch 4 adds support for using the INVPCID instruction for flushing
the TLB.

Patch 5 reduces the costs of TLB flushes even further: as we don't make
any use of global TLB entries with XPTI being active we can avoid
removing all global TLB entries on TLB flushes by simply deactivating
the global pages in CR4.

Patch 6 was originally only meant to prepare using PCIDs in patch 6.
For that purpose it was necessary to allow CR3 values with bit 63 set
in order to avoid flushing TLB entries when writing CR3. This requires
a modification of Jan's rather clever state machine with positive and
negative CR3 values for the hypervisor by using a dedicated flag byte
instead. It turned out this modification saved one branch on interrupt
entry speeding up the handling by a few percent.

Patch 7 is the main performance contributor: by making use of the PCID
feature (if available) TLB entries can survive CR3 switches. The TLB
needs to be flushed on context switches only and not when switching
between guest and hypervisor or guest kernel and user mode.

On my machine (Intel i7-4600M) using the PCID feature in the non-XPTI
case showed a slightly worse performance than using global pages
instead (using PCID and global pages is a bad idea as invalidating
global pages in this case would need a complete TLB flush). For this
reason I've decided to use PCID for XPTI only as the default. That
can easily be changed by using the command line parameter "pcid=all".

The complete series has been verified to still mitigate against
Meltdown attacks. A simple performance test (make -j 4 in the Xen
hypervisor directory) showed significant improvements compared to the
state without this series (so with Jan's series applied),
the percentage after the numbers is always related to XPTI off:

       XPTI off     Jan, XPTI on        +this series, XPTI on
real   1m21.169s    1m52.149s (+38%)    1m25.692s (+6%)
user   2m47.652s    2m50.054s (+1%)     2m46.428s (-1%)
sys    1m11.949s    2m21.767s (+97%)    1m23.053s (+15%)

A git branch of that series (+ Jan's patches) is available:

https://github.com/jgross1/xen.git xpti


Juergen Gross (7):
  x86/xpti: avoid copying L4 page table contents when possible
  x86/xpti: don't flush TLB twice when switching to 64-bit pv context
  xen/x86: support per-domain flag for xpti
  xen/x86: use invpcid for flushing the TLB
  xen/x86: disable global pages for domains with XPTI active
  xen/x86: use flag byte for decision whether xen_cr3 is valid
  xen/x86: use PCID feature

 docs/misc/xen-command-line.markdown |  30 +++++++-
 xen/arch/x86/cpu/mtrr/generic.c     |  37 +++++++---
 xen/arch/x86/debug.c                |   3 +-
 xen/arch/x86/domain.c               |   6 +-
 xen/arch/x86/domain_page.c          |   2 +-
 xen/arch/x86/domctl.c               |   8 +++
 xen/arch/x86/flushtlb.c             |  96 +++++++++++++++++++-------
 xen/arch/x86/mm.c                   |  91 +++++++++++++++++++++----
 xen/arch/x86/pv/dom0_build.c        |   4 ++
 xen/arch/x86/pv/domain.c            | 132 +++++++++++++++++++++++++++++++++++-
 xen/arch/x86/setup.c                |  27 +++-----
 xen/arch/x86/smp.c                  |   2 +-
 xen/arch/x86/smpboot.c              |   7 +-
 xen/arch/x86/x86_64/asm-offsets.c   |   2 +
 xen/arch/x86/x86_64/compat/entry.S  |   5 +-
 xen/arch/x86/x86_64/entry.S         |  87 ++++++++++--------------
 xen/common/efi/runtime.c            |   4 +-
 xen/include/asm-x86/current.h       |  23 +++++--
 xen/include/asm-x86/domain.h        |  20 ++++--
 xen/include/asm-x86/flushtlb.h      |   4 +-
 xen/include/asm-x86/pv/domain.h     |   7 +-
 xen/include/asm-x86/x86-defns.h     |   1 +
 22 files changed, 453 insertions(+), 145 deletions(-)

-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v3 1/7] x86/xpti: avoid copying L4 page table contents when possible
  2018-03-21 12:51 [PATCH v3 0/7] xen/x86: various XPTI speedups Juergen Gross
@ 2018-03-21 12:51 ` Juergen Gross
  2018-03-22 14:31   ` Jan Beulich
       [not found]   ` <5AB3CC5502000078001B5254@suse.com>
  2018-03-21 12:51 ` [PATCH v3 2/7] x86/xpti: don't flush TLB twice when switching to 64-bit pv context Juergen Gross
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 49+ messages in thread
From: Juergen Gross @ 2018-03-21 12:51 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, andrew.cooper3, jbeulich, dfaggioli

For mitigation of Meltdown the current L4 page table is copied to the
cpu local root page table each time a 64 bit pv guest is entered.

Copying can be avoided in cases where the guest L4 page table hasn't
been modified while running the hypervisor, e.g. when handling
interrupts or any hypercall not modifying the L4 page table or %cr3.

So add a per-cpu flag whether the copying should be performed and set
that flag only when loading a new %cr3 or modifying the L4 page table.
This includes synchronization of the cpu local root page table with
other cpus, so add a special synchronization flag for that case.

A simple performance check (compiling the hypervisor via "make -j 4")
in dom0 with 4 vcpus shows a significant improvement:

- real time drops from 112 seconds to 103 seconds
- system time drops from 142 seconds to 131 seconds

Signed-off-by: Juergen Gross <jgross@suse.com>
---
V3:
- set flag locally only if affected L4 is active (Jan Beulich)
- add setting flag to flush_area_mask() (Jan Beulich)
- set flag in make_cr3() only if called for current active vcpu

To be applied on top of Jan's "Meltdown band-aid overhead reduction"
series
---
 xen/arch/x86/flushtlb.c           |  3 +++
 xen/arch/x86/mm.c                 | 42 +++++++++++++++++++++++++++------------
 xen/arch/x86/smp.c                |  2 +-
 xen/arch/x86/x86_64/asm-offsets.c |  1 +
 xen/arch/x86/x86_64/entry.S       |  8 ++++++--
 xen/include/asm-x86/current.h     |  8 ++++++++
 xen/include/asm-x86/flushtlb.h    |  2 ++
 7 files changed, 50 insertions(+), 16 deletions(-)

diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
index 8a7a76b8ff..9a9af71d5a 100644
--- a/xen/arch/x86/flushtlb.c
+++ b/xen/arch/x86/flushtlb.c
@@ -158,6 +158,9 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
         }
     }
 
+    if ( flags & FLUSH_ROOT_PGTBL )
+        get_cpu_info()->root_pgt_changed = true;
+
     local_irq_restore(irqfl);
 
     return flags;
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index f0195561c2..352600ad73 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -499,10 +499,15 @@ void free_shared_domheap_page(struct page_info *page)
 void make_cr3(struct vcpu *v, mfn_t mfn)
 {
     v->arch.cr3 = mfn_x(mfn) << PAGE_SHIFT;
+    if ( v == current && this_cpu(root_pgt) && is_pv_vcpu(v) &&
+         !is_pv_32bit_vcpu(v) )
+        get_cpu_info()->root_pgt_changed = true;
 }
 
 void write_ptbase(struct vcpu *v)
 {
+    if ( this_cpu(root_pgt) && is_pv_vcpu(v) && !is_pv_32bit_vcpu(v) )
+        get_cpu_info()->root_pgt_changed = true;
     write_cr3(v->arch.cr3);
 }
 
@@ -3698,18 +3703,29 @@ long do_mmu_update(
                         break;
                     rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
-                    /*
-                     * No need to sync if all uses of the page can be accounted
-                     * to the page lock we hold, its pinned status, and uses on
-                     * this (v)CPU.
-                     */
-                    if ( !rc && !cpu_has_no_xpti &&
-                         ((page->u.inuse.type_info & PGT_count_mask) >
-                          (1 + !!(page->u.inuse.type_info & PGT_pinned) +
-                           (pagetable_get_pfn(curr->arch.guest_table) == mfn) +
-                           (pagetable_get_pfn(curr->arch.guest_table_user) ==
-                            mfn))) )
-                        sync_guest = true;
+                    if ( !rc && !cpu_has_no_xpti )
+                    {
+                        bool local_in_use = false;
+
+                        if ( (pagetable_get_pfn(curr->arch.guest_table) ==
+                              mfn) ||
+                             (pagetable_get_pfn(curr->arch.guest_table_user) ==
+                              mfn) )
+                        {
+                            local_in_use = true;
+                            get_cpu_info()->root_pgt_changed = true;
+                        }
+
+                        /*
+                         * No need to sync if all uses of the page can be
+                         * accounted to the page lock we hold, its pinned
+                         * status, and uses on this (v)CPU.
+                         */
+                        if ( (page->u.inuse.type_info & PGT_count_mask) >
+                             (1 + !!(page->u.inuse.type_info & PGT_pinned) +
+                              local_in_use) )
+                            sync_guest = true;
+                    }
                     break;
 
                 case PGT_writable_page:
@@ -3824,7 +3840,7 @@ long do_mmu_update(
 
         cpumask_andnot(mask, pt_owner->dirty_cpumask, cpumask_of(cpu));
         if ( !cpumask_empty(mask) )
-            flush_mask(mask, FLUSH_TLB_GLOBAL);
+            flush_mask(mask, FLUSH_TLB_GLOBAL | FLUSH_ROOT_PGTBL);
     }
 
     perfc_add(num_page_updates, i);
diff --git a/xen/arch/x86/smp.c b/xen/arch/x86/smp.c
index 033dd05958..63e819ca38 100644
--- a/xen/arch/x86/smp.c
+++ b/xen/arch/x86/smp.c
@@ -208,7 +208,7 @@ void invalidate_interrupt(struct cpu_user_regs *regs)
     ack_APIC_irq();
     perfc_incr(ipis);
     if ( (flags & FLUSH_VCPU_STATE) && __sync_local_execstate() )
-        flags &= ~(FLUSH_TLB | FLUSH_TLB_GLOBAL);
+        flags &= ~(FLUSH_TLB | FLUSH_TLB_GLOBAL | FLUSH_ROOT_PGTBL);
     if ( flags & ~(FLUSH_VCPU_STATE | FLUSH_ORDER_MASK) )
         flush_area_local(flush_va, flags);
     cpumask_clear_cpu(smp_processor_id(), &flush_cpumask);
diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
index a2fea94f4c..9e2aefb00f 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -143,6 +143,7 @@ void __dummy__(void)
     OFFSET(CPUINFO_shadow_spec_ctrl, struct cpu_info, shadow_spec_ctrl);
     OFFSET(CPUINFO_use_shadow_spec_ctrl, struct cpu_info, use_shadow_spec_ctrl);
     OFFSET(CPUINFO_bti_ist_info, struct cpu_info, bti_ist_info);
+    OFFSET(CPUINFO_root_pgt_changed, struct cpu_info, root_pgt_changed);
     DEFINE(CPUINFO_sizeof, sizeof(struct cpu_info));
     BLANK();
 
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index e58ca90c18..18b79be539 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -129,10 +129,13 @@ restore_all_guest:
 .Lrag_cr3_start:
         mov   VCPU_cr3(%rbx), %r9
         GET_STACK_END(dx)
-        mov   STACK_CPUINFO_FIELD(pv_cr3)(%rdx), %rdi
+        mov   STACK_CPUINFO_FIELD(pv_cr3)(%rdx), %rax
+        cmpb  $0, STACK_CPUINFO_FIELD(root_pgt_changed)(%rdx)
+        je    .Lrag_copy_done
+        movb  $0, STACK_CPUINFO_FIELD(root_pgt_changed)(%rdx)
         movabs $PADDR_MASK & PAGE_MASK, %rsi
         movabs $DIRECTMAP_VIRT_START, %rcx
-        mov   %rdi, %rax
+        mov   %rax, %rdi
         and   %rsi, %rdi
         and   %r9, %rsi
         add   %rcx, %rdi
@@ -148,6 +151,7 @@ restore_all_guest:
         sub   $(ROOT_PAGETABLE_FIRST_XEN_SLOT - \
                 ROOT_PAGETABLE_LAST_XEN_SLOT - 1) * 8, %rdi
         rep movsq
+.Lrag_copy_done:
         mov   STACK_CPUINFO_FIELD(cr4)(%rdx), %rdi
         mov   %r9, STACK_CPUINFO_FIELD(xen_cr3)(%rdx)
         mov   %rdi, %rsi
diff --git a/xen/include/asm-x86/current.h b/xen/include/asm-x86/current.h
index 3a0e1eef36..f2491b4423 100644
--- a/xen/include/asm-x86/current.h
+++ b/xen/include/asm-x86/current.h
@@ -59,6 +59,14 @@ struct cpu_info {
     bool         use_shadow_spec_ctrl;
     uint8_t      bti_ist_info;
 
+    /*
+     * The following field controls copying of the L4 page table of 64-bit
+     * PV guests to the per-cpu root page table on entering the guest context.
+     * If set the L4 page table is being copied to the root page table and
+     * the field will be reset.
+     */
+    bool         root_pgt_changed;
+
     unsigned long __pad;
     /* get_stack_bottom() must be 16-byte aligned */
 };
diff --git a/xen/include/asm-x86/flushtlb.h b/xen/include/asm-x86/flushtlb.h
index 2cade9cbfb..052f0fa403 100644
--- a/xen/include/asm-x86/flushtlb.h
+++ b/xen/include/asm-x86/flushtlb.h
@@ -103,6 +103,8 @@ void write_cr3(unsigned long cr3);
 #define FLUSH_VA_VALID   0x800
  /* Flush CPU state */
 #define FLUSH_VCPU_STATE 0x1000
+ /* Flush the per-cpu root page table */
+#define FLUSH_ROOT_PGTBL 0x2000
 
 /* Flush local TLBs/caches. */
 unsigned int flush_area_local(const void *va, unsigned int flags);
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v3 2/7] x86/xpti: don't flush TLB twice when switching to 64-bit pv context
  2018-03-21 12:51 [PATCH v3 0/7] xen/x86: various XPTI speedups Juergen Gross
  2018-03-21 12:51 ` [PATCH v3 1/7] x86/xpti: avoid copying L4 page table contents when possible Juergen Gross
@ 2018-03-21 12:51 ` Juergen Gross
  2018-03-22 14:50   ` Jan Beulich
  2018-03-21 12:51 ` [PATCH v3 3/7] xen/x86: support per-domain flag for xpti Juergen Gross
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 49+ messages in thread
From: Juergen Gross @ 2018-03-21 12:51 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, andrew.cooper3, jbeulich, dfaggioli

When switching to a 64-bit pv context the TLB is flushed twice today:
the first time when switching to the new address space in
write_ptbase(), the second time when switching to guest mode in
restore_to_guest.

Avoid the first TLB flush in that case.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
V3:
- omit setting root_pgt_changed to false (Jan Beulich)
---
 xen/arch/x86/mm.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 352600ad73..8c944b33c9 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -123,6 +123,7 @@
 #include <asm/io_apic.h>
 #include <asm/pci.h>
 #include <asm/guest.h>
+#include <asm/processor.h>
 
 #include <asm/hvm/grant_table.h>
 #include <asm/pv/grant_table.h>
@@ -507,8 +508,14 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
 void write_ptbase(struct vcpu *v)
 {
     if ( this_cpu(root_pgt) && is_pv_vcpu(v) && !is_pv_32bit_vcpu(v) )
+    {
         get_cpu_info()->root_pgt_changed = true;
-    write_cr3(v->arch.cr3);
+        asm volatile ( "mov %0, %%cr3" : : "r" (v->arch.cr3) : "memory" );
+    }
+    else
+    {
+        write_cr3(v->arch.cr3);
+    }
 }
 
 /*
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v3 3/7] xen/x86: support per-domain flag for xpti
  2018-03-21 12:51 [PATCH v3 0/7] xen/x86: various XPTI speedups Juergen Gross
  2018-03-21 12:51 ` [PATCH v3 1/7] x86/xpti: avoid copying L4 page table contents when possible Juergen Gross
  2018-03-21 12:51 ` [PATCH v3 2/7] x86/xpti: don't flush TLB twice when switching to 64-bit pv context Juergen Gross
@ 2018-03-21 12:51 ` Juergen Gross
  2018-03-22 15:26   ` Jan Beulich
       [not found]   ` <5AB3D92D02000078001B5326@suse.com>
  2018-03-21 12:51 ` [PATCH v3 4/7] xen/x86: use invpcid for flushing the TLB Juergen Gross
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 49+ messages in thread
From: Juergen Gross @ 2018-03-21 12:51 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, andrew.cooper3, jbeulich, dfaggioli

Instead of switching XPTI globally on or off add a per-domain flag for
that purpose. This allows to modify the xpti boot parameter to support
running dom0 without Meltdown mitigations. Using "xpti=nodom0" as boot
parameter will achieve that.

Move the xpti boot parameter handling to xen/arch/x86/pv/domain.c as
it is pv-domain specific.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
V3:
- latch get_cpu_info() return value in variable (Jan Beulich)
- call always xpti_domain_init() for pv dom0 (Jan Beulich)
- add __init annotations (Jan Beulich)
- drop per domain XPTI message (Jan Beulich)
- document xpti=default support (Jan Beulich)
- move domain xpti flag into a padding hole (Jan Beulich)
---
 docs/misc/xen-command-line.markdown | 10 ++++-
 xen/arch/x86/domctl.c               |  4 ++
 xen/arch/x86/mm.c                   | 10 ++++-
 xen/arch/x86/pv/dom0_build.c        |  3 ++
 xen/arch/x86/pv/domain.c            | 80 ++++++++++++++++++++++++++++++++++++-
 xen/arch/x86/setup.c                | 20 +---------
 xen/arch/x86/smpboot.c              |  5 +--
 xen/arch/x86/x86_64/entry.S         |  2 +
 xen/include/asm-x86/current.h       |  3 +-
 xen/include/asm-x86/domain.h        |  3 ++
 xen/include/asm-x86/pv/domain.h     |  5 ++-
 11 files changed, 118 insertions(+), 27 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index b353352adf..79be9a6ba5 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -1955,7 +1955,7 @@ clustered mode.  The default, given no hint from the **FADT**, is cluster
 mode.
 
 ### xpti
-> `= <boolean>`
+> `= default | nodom0 | <boolean>`
 
 > Default: `false` on AMD hardware
 > Default: `true` everywhere else
@@ -1963,6 +1963,14 @@ mode.
 Override default selection of whether to isolate 64-bit PV guest page
 tables.
 
+`true` activates page table isolation even on AMD hardware.
+
+`false` deactivates page table isolation on all systems.
+
+`default` sets the default behaviour.
+
+`nodom0` deactivates page table isolation for dom0.
+
 ### xsave
 > `= <boolean>`
 
diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
index 8fbbf3aeb3..0704f398c7 100644
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -24,6 +24,7 @@
 #include <asm/hvm/hvm.h>
 #include <asm/hvm/support.h>
 #include <asm/processor.h>
+#include <asm/pv/domain.h>
 #include <asm/acpi.h> /* for hvm_acpi_power_button */
 #include <xen/hypercall.h> /* for arch_do_domctl */
 #include <xsm/xsm.h>
@@ -610,6 +611,9 @@ long arch_do_domctl(
             ret = switch_compat(d);
         else
             ret = -EINVAL;
+
+        if ( ret == 0 )
+            xpti_domain_init(d);
         break;
 
     case XEN_DOMCTL_get_address_size:
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 8c944b33c9..e68ed474bf 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -507,14 +507,20 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
 
 void write_ptbase(struct vcpu *v)
 {
-    if ( this_cpu(root_pgt) && is_pv_vcpu(v) && !is_pv_32bit_vcpu(v) )
+    struct cpu_info *cpu_info = get_cpu_info();
+
+    if ( is_pv_vcpu(v) && v->domain->arch.pv_domain.xpti )
     {
-        get_cpu_info()->root_pgt_changed = true;
+        cpu_info->root_pgt_changed = true;
+        cpu_info->pv_cr3 = __pa(this_cpu(root_pgt));
         asm volatile ( "mov %0, %%cr3" : : "r" (v->arch.cr3) : "memory" );
     }
     else
     {
+        /* Make sure to clear xen_cr3 before pv_cr3; write_cr3() serializes. */
+        cpu_info->xen_cr3 = 0;
         write_cr3(v->arch.cr3);
+        cpu_info->pv_cr3 = 0;
     }
 }
 
diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
index 0bd2f1bf90..77186c19bd 100644
--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -19,6 +19,7 @@
 #include <asm/dom0_build.h>
 #include <asm/guest.h>
 #include <asm/page.h>
+#include <asm/pv/domain.h>
 #include <asm/pv/mm.h>
 #include <asm/setup.h>
 
@@ -707,6 +708,8 @@ int __init dom0_construct_pv(struct domain *d,
             cpu = p->processor;
     }
 
+    xpti_domain_init(d);
+
     d->arch.paging.mode = 0;
 
     /* Set up CR3 value for write_ptbase */
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 2524326b74..266117e804 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -9,6 +9,8 @@
 #include <xen/lib.h>
 #include <xen/sched.h>
 
+#include <asm/cpufeature.h>
+#include <asm/msr-index.h>
 #include <asm/pv/domain.h>
 
 /* Override macros from asm/page.h to make them work with mfn_t */
@@ -17,6 +19,82 @@
 #undef page_to_mfn
 #define page_to_mfn(pg) _mfn(__page_to_mfn(pg))
 
+static __read_mostly enum {
+    XPTI_DEFAULT,
+    XPTI_ON,
+    XPTI_OFF,
+    XPTI_NODOM0
+} opt_xpti = XPTI_DEFAULT;
+
+static __init int parse_xpti(const char *s)
+{
+    int rc = 0;
+
+    switch ( parse_bool(s, NULL) )
+    {
+    case 0:
+        opt_xpti = XPTI_OFF;
+        break;
+    case 1:
+        opt_xpti = XPTI_ON;
+        break;
+    default:
+        if ( !strcmp(s, "default") )
+            opt_xpti = XPTI_DEFAULT;
+        else if ( !strcmp(s, "nodom0") )
+            opt_xpti = XPTI_NODOM0;
+        else
+            rc = -EINVAL;
+        break;
+    }
+
+    return rc;
+}
+custom_param("xpti", parse_xpti);
+
+void __init xpti_init(void)
+{
+    uint64_t caps = 0;
+
+    if ( boot_cpu_data.x86_vendor == X86_VENDOR_AMD )
+        caps = ARCH_CAPABILITIES_RDCL_NO;
+    else if ( boot_cpu_has(X86_FEATURE_ARCH_CAPS) )
+        rdmsrl(MSR_ARCH_CAPABILITIES, caps);
+
+    if ( opt_xpti != XPTI_ON && (caps & ARCH_CAPABILITIES_RDCL_NO) )
+        opt_xpti = XPTI_OFF;
+    else if ( opt_xpti == XPTI_DEFAULT )
+        opt_xpti = XPTI_ON;
+
+    if ( opt_xpti == XPTI_OFF )
+        setup_force_cpu_cap(X86_FEATURE_NO_XPTI);
+    else
+        setup_clear_cpu_cap(X86_FEATURE_NO_XPTI);
+}
+
+void xpti_domain_init(struct domain *d)
+{
+    if ( !is_pv_domain(d) || is_pv_32bit_domain(d) )
+        return;
+
+    switch ( opt_xpti )
+    {
+    case XPTI_OFF:
+        d->arch.pv_domain.xpti = false;
+        break;
+    case XPTI_ON:
+        d->arch.pv_domain.xpti = true;
+        break;
+    case XPTI_NODOM0:
+        d->arch.pv_domain.xpti = d->domain_id != 0 &&
+                                 d->domain_id != hardware_domid;
+        break;
+    default:
+        ASSERT_UNREACHABLE();
+        break;
+    }
+}
+
 static void noreturn continue_nonidle_domain(struct vcpu *v)
 {
     check_wakeup_from_wait();
@@ -264,7 +342,7 @@ void toggle_guest_mode(struct vcpu *v)
     }
     asm volatile ( "swapgs" );
 
-    _toggle_guest_pt(v, cpu_has_no_xpti);
+    _toggle_guest_pt(v, !v->domain->arch.pv_domain.xpti);
 }
 
 void toggle_guest_pt(struct vcpu *v)
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index d7aac724b0..801334b43f 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -53,6 +53,7 @@
 #include <asm/cpuid.h>
 #include <asm/spec_ctrl.h>
 #include <asm/guest.h>
+#include <asm/pv/domain.h>
 
 /* opt_nosmp: If true, secondary processors are ignored. */
 static bool __initdata opt_nosmp;
@@ -169,9 +170,6 @@ static int __init parse_smap_param(const char *s)
 }
 custom_param("smap", parse_smap_param);
 
-static int8_t __initdata opt_xpti = -1;
-boolean_param("xpti", opt_xpti);
-
 bool __read_mostly acpi_disabled;
 bool __initdata acpi_force;
 static char __initdata acpi_param[10] = "";
@@ -1546,21 +1544,7 @@ void __init noreturn __start_xen(unsigned long mbi_p)
 
     cr4_pv32_mask = mmu_cr4_features & XEN_CR4_PV32_BITS;
 
-    if ( opt_xpti < 0 )
-    {
-        uint64_t caps = 0;
-
-        if ( boot_cpu_data.x86_vendor == X86_VENDOR_AMD )
-            caps = ARCH_CAPABILITIES_RDCL_NO;
-        else if ( boot_cpu_has(X86_FEATURE_ARCH_CAPS) )
-            rdmsrl(MSR_ARCH_CAPABILITIES, caps);
-
-        opt_xpti = !(caps & ARCH_CAPABILITIES_RDCL_NO);
-    }
-    if ( opt_xpti )
-        setup_clear_cpu_cap(X86_FEATURE_NO_XPTI);
-    else
-        setup_force_cpu_cap(X86_FEATURE_NO_XPTI);
+    xpti_init();
 
     if ( cpu_has_fsgsbase )
         set_in_cr4(X86_CR4_FSGSBASE);
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index b0b72ca544..07624da32b 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -331,7 +331,7 @@ void start_secondary(void *unused)
     spin_debug_disable();
 
     get_cpu_info()->xen_cr3 = 0;
-    get_cpu_info()->pv_cr3 = this_cpu(root_pgt) ? __pa(this_cpu(root_pgt)) : 0;
+    get_cpu_info()->pv_cr3 = 0;
 
     load_system_tables();
 
@@ -1050,8 +1050,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
         panic("Error %d setting up PV root page table\n", rc);
     if ( per_cpu(root_pgt, 0) )
     {
-        get_cpu_info()->pv_cr3 = __pa(per_cpu(root_pgt, 0));
-
+        get_cpu_info()->pv_cr3 = 0;
         /*
          * All entry points which may need to switch page tables have to start
          * with interrupts off. Re-write what pv_trap_init() has put there.
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 18b79be539..2a06cd1a51 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -130,6 +130,8 @@ restore_all_guest:
         mov   VCPU_cr3(%rbx), %r9
         GET_STACK_END(dx)
         mov   STACK_CPUINFO_FIELD(pv_cr3)(%rdx), %rax
+        test  %rax, %rax
+        jz    .Lrag_cr3_end
         cmpb  $0, STACK_CPUINFO_FIELD(root_pgt_changed)(%rdx)
         je    .Lrag_copy_done
         movb  $0, STACK_CPUINFO_FIELD(root_pgt_changed)(%rdx)
diff --git a/xen/include/asm-x86/current.h b/xen/include/asm-x86/current.h
index f2491b4423..b2475783f8 100644
--- a/xen/include/asm-x86/current.h
+++ b/xen/include/asm-x86/current.h
@@ -44,7 +44,8 @@ struct cpu_info {
     /*
      * Of the two following fields the latter is being set to the CR3 value
      * to be used on the given pCPU for loading whenever 64-bit PV guest
-     * context is being entered. The value never changes once set.
+     * context is being entered. A value of zero indicates no setting of CR3
+     * is to be performed.
      * The former is the value to restore when re-entering Xen, if any. IOW
      * its value being zero means there's nothing to restore. However, its
      * value can also be negative, indicating to the exit-to-Xen code that
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index 47aadc2600..b97a8792e9 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -253,6 +253,9 @@ struct pv_domain
 
     atomic_t nr_l4_pages;
 
+    /* XPTI active? */
+    bool xpti;
+
     /* map_domain_page() mapping cache. */
     struct mapcache_domain mapcache;
 
diff --git a/xen/include/asm-x86/pv/domain.h b/xen/include/asm-x86/pv/domain.h
index 5e34176939..2213a8fb3d 100644
--- a/xen/include/asm-x86/pv/domain.h
+++ b/xen/include/asm-x86/pv/domain.h
@@ -27,6 +27,8 @@ void pv_vcpu_destroy(struct vcpu *v);
 int pv_vcpu_initialise(struct vcpu *v);
 void pv_domain_destroy(struct domain *d);
 int pv_domain_initialise(struct domain *d);
+void xpti_init(void);
+void xpti_domain_init(struct domain *d);
 
 #else  /* !CONFIG_PV */
 
@@ -36,7 +38,8 @@ static inline void pv_vcpu_destroy(struct vcpu *v) {}
 static inline int pv_vcpu_initialise(struct vcpu *v) { return -EOPNOTSUPP; }
 static inline void pv_domain_destroy(struct domain *d) {}
 static inline int pv_domain_initialise(struct domain *d) { return -EOPNOTSUPP; }
-
+static inline void xpti_init(void) {}
+static inline void xpti_domain_init(struct domain *d) {}
 #endif	/* CONFIG_PV */
 
 void paravirt_ctxt_switch_from(struct vcpu *v);
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v3 4/7] xen/x86: use invpcid for flushing the TLB
  2018-03-21 12:51 [PATCH v3 0/7] xen/x86: various XPTI speedups Juergen Gross
                   ` (2 preceding siblings ...)
  2018-03-21 12:51 ` [PATCH v3 3/7] xen/x86: support per-domain flag for xpti Juergen Gross
@ 2018-03-21 12:51 ` Juergen Gross
  2018-03-22 15:35   ` Jan Beulich
       [not found]   ` <5AB3DB5102000078001B5358@suse.com>
  2018-03-21 12:51 ` [PATCH v3 5/7] xen/x86: disable global pages for domains with XPTI active Juergen Gross
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 49+ messages in thread
From: Juergen Gross @ 2018-03-21 12:51 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, andrew.cooper3, jbeulich, dfaggioli

If possible use the INVPCID instruction for flushing the TLB instead of
toggling cr4.pge for that purpose.

While at it remove the dependency on cr4.pge being required for mtrr
loading, as this will be required later anyway.

Add a command line option "noinvpcid" for deactivating the use of
INVPCID.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
V3:
- new patch
---
 docs/misc/xen-command-line.markdown |  8 ++++++++
 xen/arch/x86/cpu/mtrr/generic.c     | 37 ++++++++++++++++++++++++++-----------
 xen/arch/x86/flushtlb.c             | 31 +++++++++++++++++++++----------
 xen/arch/x86/setup.c                |  7 +++++++
 4 files changed, 62 insertions(+), 21 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index 79be9a6ba5..8fc7b2ff3b 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -1380,6 +1380,14 @@ Because responsibility for APIC setup is shared between Xen and the
 domain 0 kernel this option is automatically propagated to the domain
 0 command line.
 
+### noinvpcid (x86)
+> `= <boolean>`
+
+Disable using the INVPCID instruction for flushing TLB entries.
+This should only be used in case of known issues on the current platform
+with that instruction. Disabling INVPCID will normally result in a slightly
+degraded performance.
+
 ### noirqbalance
 > `= <boolean>`
 
diff --git a/xen/arch/x86/cpu/mtrr/generic.c b/xen/arch/x86/cpu/mtrr/generic.c
index e9c0e5e059..e88643f4bf 100644
--- a/xen/arch/x86/cpu/mtrr/generic.c
+++ b/xen/arch/x86/cpu/mtrr/generic.c
@@ -5,6 +5,7 @@
 #include <xen/mm.h>
 #include <xen/stdbool.h>
 #include <asm/flushtlb.h>
+#include <asm/invpcid.h>
 #include <asm/io.h>
 #include <asm/mtrr.h>
 #include <asm/msr.h>
@@ -400,8 +401,10 @@ static DEFINE_SPINLOCK(set_atomicity_lock);
  * has been called.
  */
 
-static void prepare_set(void)
+static bool prepare_set(void)
 {
+	unsigned long cr4;
+
 	/*  Note that this is not ideal, since the cache is only flushed/disabled
 	   for this CPU while the MTRRs are changed, but changing this requires
 	   more invasive changes to the way the kernel boots  */
@@ -412,18 +415,24 @@ static void prepare_set(void)
 	write_cr0(read_cr0() | X86_CR0_CD);
 	wbinvd();
 
-	/*  TLB flushing here relies on Xen always using CR4.PGE. */
-	BUILD_BUG_ON(!(XEN_MINIMAL_CR4 & X86_CR4_PGE));
-	write_cr4(read_cr4() & ~X86_CR4_PGE);
+	cr4 = read_cr4();
+	if (cr4 & X86_CR4_PGE)
+		write_cr4(cr4 & ~X86_CR4_PGE);
+	else if (cpu_has_invpcid)
+		invpcid_flush_all();
+	else
+		asm volatile( "mov %0, %%cr3" : : "r" (read_cr3()) : "memory" );
 
 	/*  Save MTRR state */
 	rdmsrl(MSR_MTRRdefType, deftype);
 
 	/*  Disable MTRRs, and set the default type to uncached  */
 	mtrr_wrmsr(MSR_MTRRdefType, deftype & ~0xcff);
+
+	return cr4 & X86_CR4_PGE;
 }
 
-static void post_set(void)
+static void post_set(bool pge)
 {
 	/* Intel (P6) standard MTRRs */
 	mtrr_wrmsr(MSR_MTRRdefType, deftype);
@@ -432,7 +441,12 @@ static void post_set(void)
 	write_cr0(read_cr0() & ~X86_CR0_CD);
 
 	/*  Reenable CR4.PGE (also flushes the TLB) */
-	write_cr4(read_cr4() | X86_CR4_PGE);
+	if (pge)
+		write_cr4(read_cr4() | X86_CR4_PGE);
+	else if (cpu_has_invpcid)
+		invpcid_flush_all();
+	else
+		asm volatile( "mov %0, %%cr3" : : "r" (read_cr3()) : "memory" );
 
 	spin_unlock(&set_atomicity_lock);
 }
@@ -441,14 +455,15 @@ static void generic_set_all(void)
 {
 	unsigned long mask, count;
 	unsigned long flags;
+	bool pge;
 
 	local_irq_save(flags);
-	prepare_set();
+	pge = prepare_set();
 
 	/* Actually set the state */
 	mask = set_mtrr_state();
 
-	post_set();
+	post_set(pge);
 	local_irq_restore(flags);
 
 	/*  Use the atomic bitops to update the global mask  */
@@ -457,7 +472,6 @@ static void generic_set_all(void)
 			set_bit(count, &smp_changes_mask);
 		mask >>= 1;
 	}
-	
 }
 
 static void generic_set_mtrr(unsigned int reg, unsigned long base,
@@ -474,11 +488,12 @@ static void generic_set_mtrr(unsigned int reg, unsigned long base,
 {
 	unsigned long flags;
 	struct mtrr_var_range *vr;
+	bool pge;
 
 	vr = &mtrr_state.var_ranges[reg];
 
 	local_irq_save(flags);
-	prepare_set();
+	pge = prepare_set();
 
 	if (size == 0) {
 		/* The invalid bit is kept in the mask, so we simply clear the
@@ -499,7 +514,7 @@ static void generic_set_mtrr(unsigned int reg, unsigned long base,
 		mtrr_wrmsr(MSR_IA32_MTRR_PHYSMASK(reg), vr->mask);
 	}
 
-	post_set();
+	post_set(pge);
 	local_irq_restore(flags);
 }
 
diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
index 9a9af71d5a..6345676bc5 100644
--- a/xen/arch/x86/flushtlb.c
+++ b/xen/arch/x86/flushtlb.c
@@ -10,6 +10,7 @@
 #include <xen/sched.h>
 #include <xen/softirq.h>
 #include <asm/flushtlb.h>
+#include <asm/invpcid.h>
 #include <asm/page.h>
 
 /* Debug builds: Wrap frequently to stress-test the wrap logic. */
@@ -71,6 +72,25 @@ static void post_flush(u32 t)
     this_cpu(tlbflush_time) = t;
 }
 
+static void do_tlb_flush(void)
+{
+    u32 t;
+
+    t = pre_flush();
+
+    if ( cpu_has_invpcid )
+        invpcid_flush_all();
+    else
+    {
+        unsigned long cr4 = read_cr4();
+
+        write_cr4(cr4 ^ X86_CR4_PGE);
+        write_cr4(cr4);
+    }
+
+    post_flush(t);
+}
+
 void write_cr3(unsigned long cr3)
 {
     unsigned long flags, cr4;
@@ -118,16 +138,7 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
                            : : "m" (*(const char *)(va)) : "memory" );
         }
         else
-        {
-            u32 t = pre_flush();
-            unsigned long cr4 = read_cr4();
-
-            write_cr4(cr4 & ~X86_CR4_PGE);
-            barrier();
-            write_cr4(cr4);
-
-            post_flush(t);
-        }
+            do_tlb_flush();
     }
 
     if ( flags & FLUSH_CACHE )
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 801334b43f..5318aaa859 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -63,6 +63,10 @@ boolean_param("nosmp", opt_nosmp);
 static unsigned int __initdata max_cpus;
 integer_param("maxcpus", max_cpus);
 
+/* opt_noinvpcid: If true, don't use INVPCID instruction even if available. */
+static bool __initdata opt_noinvpcid;
+boolean_param("noinvpcid", opt_noinvpcid);
+
 unsigned long __read_mostly cr4_pv32_mask;
 
 /* **** Linux config option: propagated to domain0. */
@@ -1549,6 +1553,9 @@ void __init noreturn __start_xen(unsigned long mbi_p)
     if ( cpu_has_fsgsbase )
         set_in_cr4(X86_CR4_FSGSBASE);
 
+    if ( opt_noinvpcid )
+        setup_clear_cpu_cap(X86_FEATURE_INVPCID);
+
     init_speculation_mitigations();
 
     init_idle_domain();
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v3 5/7] xen/x86: disable global pages for domains with XPTI active
  2018-03-21 12:51 [PATCH v3 0/7] xen/x86: various XPTI speedups Juergen Gross
                   ` (3 preceding siblings ...)
  2018-03-21 12:51 ` [PATCH v3 4/7] xen/x86: use invpcid for flushing the TLB Juergen Gross
@ 2018-03-21 12:51 ` Juergen Gross
  2018-03-22 16:30   ` Jan Beulich
       [not found]   ` <5AB3E82402000078001B5408@suse.com>
  2018-03-21 12:51 ` [PATCH v3 6/7] xen/x86: use flag byte for decision whether xen_cr3 is valid Juergen Gross
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 49+ messages in thread
From: Juergen Gross @ 2018-03-21 12:51 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, andrew.cooper3, jbeulich, dfaggioli

Instead of flushing the TLB from global pages when switching address
spaces with XPTI being active just disable global pages via %cr4
completely when a domain subject to XPTI is active. This avoids the
need for extra TLB flushes as loading %cr3 will remove all TLB
entries.

In order to avoid states with cr3/cr4 having inconsistent values
(e.g. global pages being activated while cr3 already specifies a XPTI
address space) move loading of the new cr4 value to write_ptbase()
(actually to write_cr3_cr4() called by write_ptbase()).

Signed-off-by: Juergen Gross <jgross@suse.com>
---
V3:
- move cr4 loading for all domains from *_ctxt_switch_to() to
  write_cr3_cr4() called by write_ptbase() (Jan Beulich)
- rebase
---
 xen/arch/x86/domain.c          |  5 -----
 xen/arch/x86/flushtlb.c        | 13 ++++++++-----
 xen/arch/x86/mm.c              | 11 ++++++++---
 xen/arch/x86/x86_64/entry.S    | 10 ----------
 xen/common/efi/runtime.c       |  4 ++--
 xen/include/asm-x86/domain.h   |  3 ++-
 xen/include/asm-x86/flushtlb.h |  2 +-
 7 files changed, 21 insertions(+), 27 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 4cac8906ea..fe18d9d3c4 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1520,17 +1520,12 @@ void paravirt_ctxt_switch_from(struct vcpu *v)
 void paravirt_ctxt_switch_to(struct vcpu *v)
 {
     root_pgentry_t *root_pgt = this_cpu(root_pgt);
-    unsigned long cr4;
 
     if ( root_pgt )
         root_pgt[root_table_offset(PERDOMAIN_VIRT_START)] =
             l4e_from_page(v->domain->arch.perdomain_l3_pg,
                           __PAGE_HYPERVISOR_RW);
 
-    cr4 = pv_guest_cr4_to_real_cr4(v);
-    if ( unlikely(cr4 != read_cr4()) )
-        write_cr4(cr4);
-
     if ( unlikely(v->arch.debugreg[7] & DR7_ACTIVE_MASK) )
         activate_debugregs(v);
 
diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
index 6345676bc5..d4b8acc837 100644
--- a/xen/arch/x86/flushtlb.c
+++ b/xen/arch/x86/flushtlb.c
@@ -91,20 +91,23 @@ static void do_tlb_flush(void)
     post_flush(t);
 }
 
-void write_cr3(unsigned long cr3)
+void write_cr3_cr4(unsigned long cr3, unsigned long cr4)
 {
-    unsigned long flags, cr4;
+    unsigned long flags;
     u32 t;
 
     /* This non-reentrant function is sometimes called in interrupt context. */
     local_irq_save(flags);
 
     t = pre_flush();
-    cr4 = read_cr4();
 
-    write_cr4(cr4 & ~X86_CR4_PGE);
+    if ( read_cr4() & X86_CR4_PGE )
+        write_cr4(cr4 & ~X86_CR4_PGE);
+
     asm volatile ( "mov %0, %%cr3" : : "r" (cr3) : "memory" );
-    write_cr4(cr4);
+
+    if ( read_cr4() != cr4 )
+        write_cr4(cr4);
 
     post_flush(t);
 
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index e68ed474bf..c2738dd3fd 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -508,18 +508,23 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
 void write_ptbase(struct vcpu *v)
 {
     struct cpu_info *cpu_info = get_cpu_info();
+    unsigned long new_cr4;
+
+    new_cr4 = (is_pv_vcpu(v) && !is_idle_vcpu(v))
+              ? pv_guest_cr4_to_real_cr4(v) : mmu_cr4_features;
 
     if ( is_pv_vcpu(v) && v->domain->arch.pv_domain.xpti )
     {
         cpu_info->root_pgt_changed = true;
         cpu_info->pv_cr3 = __pa(this_cpu(root_pgt));
-        asm volatile ( "mov %0, %%cr3" : : "r" (v->arch.cr3) : "memory" );
+        write_cr3_cr4(v->arch.cr3, new_cr4);
     }
     else
     {
-        /* Make sure to clear xen_cr3 before pv_cr3; write_cr3() serializes. */
+        /* Make sure to clear xen_cr3 before pv_cr3. */
         cpu_info->xen_cr3 = 0;
-        write_cr3(v->arch.cr3);
+        /* write_cr3_cr4() serializes. */
+        write_cr3_cr4(v->arch.cr3, new_cr4);
         cpu_info->pv_cr3 = 0;
     }
 }
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 2a06cd1a51..eb33ec835a 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -154,13 +154,8 @@ restore_all_guest:
                 ROOT_PAGETABLE_LAST_XEN_SLOT - 1) * 8, %rdi
         rep movsq
 .Lrag_copy_done:
-        mov   STACK_CPUINFO_FIELD(cr4)(%rdx), %rdi
         mov   %r9, STACK_CPUINFO_FIELD(xen_cr3)(%rdx)
-        mov   %rdi, %rsi
-        and   $~X86_CR4_PGE, %rdi
-        mov   %rdi, %cr4
         mov   %rax, %cr3
-        mov   %rsi, %cr4
 .Lrag_cr3_end:
         ALTERNATIVE_NOP .Lrag_cr3_start, .Lrag_cr3_end, X86_FEATURE_NO_XPTI
 
@@ -218,12 +213,7 @@ restore_all_xen:
          * so "g" will have to do.
          */
 UNLIKELY_START(g, exit_cr3)
-        mov   %cr4, %rdi
-        mov   %rdi, %rsi
-        and   $~X86_CR4_PGE, %rdi
-        mov   %rdi, %cr4
         mov   %rax, %cr3
-        mov   %rsi, %cr4
 UNLIKELY_END(exit_cr3)
 .Lrax_cr3_end:
         ALTERNATIVE_NOP .Lrax_cr3_start, .Lrax_cr3_end, X86_FEATURE_NO_XPTI
diff --git a/xen/common/efi/runtime.c b/xen/common/efi/runtime.c
index 3dbc2e8ee5..fad8ca9e95 100644
--- a/xen/common/efi/runtime.c
+++ b/xen/common/efi/runtime.c
@@ -111,7 +111,7 @@ struct efi_rs_state efi_rs_enter(void)
         lgdt(&gdt_desc);
     }
 
-    write_cr3(virt_to_maddr(efi_l4_pgtable));
+    write_cr3_cr4(virt_to_maddr(efi_l4_pgtable), read_cr4());
 
     return state;
 }
@@ -120,7 +120,7 @@ void efi_rs_leave(struct efi_rs_state *state)
 {
     if ( !state->cr3 )
         return;
-    write_cr3(state->cr3);
+    write_cr3_cr4(state->cr3, read_cr4());
     if ( is_pv_vcpu(current) && !is_idle_vcpu(current) )
     {
         struct desc_ptr gdt_desc = {
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index b97a8792e9..ef9e9639fd 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -621,7 +621,8 @@ unsigned long pv_guest_cr4_fixup(const struct vcpu *, unsigned long guest_cr4);
             X86_CR4_SMAP | X86_CR4_OSXSAVE |                \
             X86_CR4_FSGSBASE))                              \
       | ((v)->domain->arch.vtsc ? X86_CR4_TSD : 0))         \
-     & ~X86_CR4_DE)
+     & ~(X86_CR4_DE |                                       \
+         ((v)->domain->arch.pv_domain.xpti ? X86_CR4_PGE : 0)))
 #define real_cr4_to_pv_guest_cr4(c)                         \
     ((c) & ~(X86_CR4_PGE | X86_CR4_PSE | X86_CR4_TSD |      \
              X86_CR4_OSXSAVE | X86_CR4_SMEP |               \
diff --git a/xen/include/asm-x86/flushtlb.h b/xen/include/asm-x86/flushtlb.h
index 052f0fa403..1eb9682de4 100644
--- a/xen/include/asm-x86/flushtlb.h
+++ b/xen/include/asm-x86/flushtlb.h
@@ -84,7 +84,7 @@ static inline unsigned long read_cr3(void)
 }
 
 /* Write pagetable base and implicitly tick the tlbflush clock. */
-void write_cr3(unsigned long cr3);
+void write_cr3_cr4(unsigned long cr3, unsigned long cr4);
 
 /* flush_* flag fields: */
  /*
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v3 6/7] xen/x86: use flag byte for decision whether xen_cr3 is valid
  2018-03-21 12:51 [PATCH v3 0/7] xen/x86: various XPTI speedups Juergen Gross
                   ` (4 preceding siblings ...)
  2018-03-21 12:51 ` [PATCH v3 5/7] xen/x86: disable global pages for domains with XPTI active Juergen Gross
@ 2018-03-21 12:51 ` Juergen Gross
  2018-03-23  9:47   ` Jan Beulich
  2018-03-21 12:51 ` [PATCH v3 7/7] xen/x86: use PCID feature Juergen Gross
  2018-03-23 14:10 ` [PATCH v3 0/7] xen/x86: various XPTI speedups Dario Faggioli
  7 siblings, 1 reply; 49+ messages in thread
From: Juergen Gross @ 2018-03-21 12:51 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, andrew.cooper3, jbeulich, dfaggioli

Today cpu_info->xen_cr3 is either 0 to indicate %cr3 doesn't need to
be switched on entry to Xen, or negative for keeping the value while
indicating not to restore %cr3, or positive in case %cr3 is to be
restored.

Switch to use a flag byte instead of a negative xen_cr3 value in order
to allow %cr3 values with the high bit set in case we want to keep TLB
entries when using the PCID feature.

This reduces the number of branches in interrupt handling and results
in better performance (e.g. parallel make of the Xen hypervisor on my
system was using about 3% less system time).

Signed-off-by: Juergen Gross <jgross@suse.com>
---
V3:
- renamed use_xen_cr3 to better fitting use_pv_cr3
- corrected comment regarding semantics of use_pv_cr3 (Jan Beulich)
- prefer 32-bit operations over 8- or 16-bit ones (Jan Beulich)
---
 xen/arch/x86/domain.c              |  1 +
 xen/arch/x86/mm.c                  |  3 +-
 xen/arch/x86/smpboot.c             |  2 ++
 xen/arch/x86/x86_64/asm-offsets.c  |  1 +
 xen/arch/x86/x86_64/compat/entry.S |  5 ++-
 xen/arch/x86/x86_64/entry.S        | 67 +++++++++++++++++---------------------
 xen/include/asm-x86/current.h      | 12 ++++---
 7 files changed, 45 insertions(+), 46 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index fe18d9d3c4..6f3e29c677 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1693,6 +1693,7 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
 
     ASSERT(local_irq_is_enabled());
 
+    get_cpu_info()->use_pv_cr3 = false;
     get_cpu_info()->xen_cr3 = 0;
 
     if ( unlikely(dirty_cpu != cpu) && dirty_cpu != VCPU_CPU_CLEAN )
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index c2738dd3fd..29071bb257 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -521,7 +521,8 @@ void write_ptbase(struct vcpu *v)
     }
     else
     {
-        /* Make sure to clear xen_cr3 before pv_cr3. */
+        /* Make sure to clear use_pv_cr3 and xen_cr3 before pv_cr3. */
+        cpu_info->use_pv_cr3 = false;
         cpu_info->xen_cr3 = 0;
         /* write_cr3_cr4() serializes. */
         write_cr3_cr4(v->arch.cr3, new_cr4);
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 07624da32b..8fa24cead8 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -330,6 +330,7 @@ void start_secondary(void *unused)
      */
     spin_debug_disable();
 
+    get_cpu_info()->use_pv_cr3 = false;
     get_cpu_info()->xen_cr3 = 0;
     get_cpu_info()->pv_cr3 = 0;
 
@@ -1128,6 +1129,7 @@ void __init smp_prepare_boot_cpu(void)
     per_cpu(scratch_cpumask, cpu) = &scratch_cpu0mask;
 #endif
 
+    get_cpu_info()->use_pv_cr3 = false;
     get_cpu_info()->xen_cr3 = 0;
     get_cpu_info()->pv_cr3 = 0;
 }
diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
index 9e2aefb00f..7ad024cf37 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -144,6 +144,7 @@ void __dummy__(void)
     OFFSET(CPUINFO_use_shadow_spec_ctrl, struct cpu_info, use_shadow_spec_ctrl);
     OFFSET(CPUINFO_bti_ist_info, struct cpu_info, bti_ist_info);
     OFFSET(CPUINFO_root_pgt_changed, struct cpu_info, root_pgt_changed);
+    OFFSET(CPUINFO_use_pv_cr3, struct cpu_info, use_pv_cr3);
     DEFINE(CPUINFO_sizeof, sizeof(struct cpu_info));
     BLANK();
 
diff --git a/xen/arch/x86/x86_64/compat/entry.S b/xen/arch/x86/x86_64/compat/entry.S
index 6d2a14eacf..a18598f106 100644
--- a/xen/arch/x86/x86_64/compat/entry.S
+++ b/xen/arch/x86/x86_64/compat/entry.S
@@ -211,10 +211,9 @@ ENTRY(cstar_enter)
         GET_STACK_END(bx)
 .Lcstar_cr3_start:
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
-        neg   %rcx
+        test  %rcx, %rcx
         jz    .Lcstar_cr3_okay
-        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
-        neg   %rcx
+        movb  $0, STACK_CPUINFO_FIELD(use_pv_cr3)(%rbx)
         mov   %rcx, %cr3
         /* %r12 is still zero at this point. */
         mov   %r12, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index eb33ec835a..f51d091c23 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -155,6 +155,7 @@ restore_all_guest:
         rep movsq
 .Lrag_copy_done:
         mov   %r9, STACK_CPUINFO_FIELD(xen_cr3)(%rdx)
+        movb  $1, STACK_CPUINFO_FIELD(use_pv_cr3)(%rdx)
         mov   %rax, %cr3
 .Lrag_cr3_end:
         ALTERNATIVE_NOP .Lrag_cr3_start, .Lrag_cr3_end, X86_FEATURE_NO_XPTI
@@ -205,14 +206,9 @@ restore_all_xen:
          */
         GET_STACK_END(bx)
 .Lrax_cr3_start:
-        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rdx
+        cmpb  $0, STACK_CPUINFO_FIELD(use_pv_cr3)(%rbx)
+UNLIKELY_START(ne, exit_cr3)
         mov   STACK_CPUINFO_FIELD(pv_cr3)(%rbx), %rax
-        test  %rdx, %rdx
-        /*
-         * Ideally the condition would be "nsz", but such doesn't exist,
-         * so "g" will have to do.
-         */
-UNLIKELY_START(g, exit_cr3)
         mov   %rax, %cr3
 UNLIKELY_END(exit_cr3)
 .Lrax_cr3_end:
@@ -257,10 +253,9 @@ ENTRY(lstar_enter)
         GET_STACK_END(bx)
 .Llstar_cr3_start:
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
-        neg   %rcx
+        test  %rcx, %rcx
         jz    .Llstar_cr3_okay
-        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
-        neg   %rcx
+        movb  $0, STACK_CPUINFO_FIELD(use_pv_cr3)(%rbx)
         mov   %rcx, %cr3
         /* %r12 is still zero at this point. */
         mov   %r12, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
@@ -297,10 +292,9 @@ GLOBAL(sysenter_eflags_saved)
         /* PUSHF above has saved EFLAGS.IF clear (the caller had it set). */
         orl   $X86_EFLAGS_IF, UREGS_eflags(%rsp)
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
-        neg   %rcx
+        test  %rcx, %rcx
         jz    .Lsyse_cr3_okay
-        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
-        neg   %rcx
+        movb  $0, STACK_CPUINFO_FIELD(use_pv_cr3)(%rbx)
         mov   %rcx, %cr3
         /* %r12 is still zero at this point. */
         mov   %r12, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
@@ -350,10 +344,9 @@ ENTRY(int80_direct_trap)
         GET_STACK_END(bx)
 .Lint80_cr3_start:
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
-        neg   %rcx
+        test  %rcx, %rcx
         jz    .Lint80_cr3_okay
-        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
-        neg   %rcx
+        movb  $0, STACK_CPUINFO_FIELD(use_pv_cr3)(%rbx)
         mov   %rcx, %cr3
         /* %r12 is still zero at this point. */
         mov   %r12, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
@@ -561,25 +554,25 @@ ENTRY(common_interrupt)
 
 .Lintr_cr3_start:
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rcx
+        mov   STACK_CPUINFO_FIELD(use_pv_cr3)(%r14), %bl
         mov   %rcx, %r15
-        neg   %rcx
+        test  %rcx, %rcx
         jz    .Lintr_cr3_okay
-        jns   .Lintr_cr3_load
-        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
-        neg   %rcx
-.Lintr_cr3_load:
+        movb  $0, STACK_CPUINFO_FIELD(use_pv_cr3)(%r14)
         mov   %rcx, %cr3
         /* %r12 is still zero at this point. */
         mov   %r12, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
         testb $3, UREGS_cs(%rsp)
         cmovnz %r12, %r15
+        cmovnz %r12, %rbx
 .Lintr_cr3_okay:
         ALTERNATIVE_NOP .Lintr_cr3_start, .Lintr_cr3_okay, X86_FEATURE_NO_XPTI
 
         CR4_PV32_RESTORE
         movq %rsp,%rdi
         callq do_IRQ
-        ALTERNATIVE __stringify(mov %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14)), \
+        ALTERNATIVE __stringify(mov %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14);    \
+                                mov %bl, STACK_CPUINFO_FIELD(use_pv_cr3)(%r14)), \
                     "", X86_FEATURE_NO_XPTI
         jmp ret_from_intr
 
@@ -596,18 +589,17 @@ GLOBAL(handle_exception)
 
 .Lxcpt_cr3_start:
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rcx
+        mov   STACK_CPUINFO_FIELD(use_pv_cr3)(%r14), %r13b
         mov   %rcx, %r15
-        neg   %rcx
+        test  %rcx, %rcx
         jz    .Lxcpt_cr3_okay
-        jns   .Lxcpt_cr3_load
-        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
-        neg   %rcx
-.Lxcpt_cr3_load:
+        movb  $0, STACK_CPUINFO_FIELD(use_pv_cr3)(%r14)
         mov   %rcx, %cr3
         /* %r12 is still zero at this point. */
         mov   %r12, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
         testb $3, UREGS_cs(%rsp)
         cmovnz %r12, %r15
+        cmovnz %r12, %r13
 .Lxcpt_cr3_okay:
         ALTERNATIVE_NOP .Lxcpt_cr3_start, .Lxcpt_cr3_okay, X86_FEATURE_NO_XPTI
 
@@ -662,7 +654,8 @@ handle_exception_saved:
         PERFC_INCR(exceptions, %rax, %rbx)
         mov   (%rdx, %rax, 8), %rdx
         INDIRECT_CALL %rdx
-        ALTERNATIVE __stringify(mov %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14)), \
+        ALTERNATIVE __stringify(mov %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14);      \
+                                mov %r13b, STACK_CPUINFO_FIELD(use_pv_cr3)(%r14)), \
                     "", X86_FEATURE_NO_XPTI
         testb $3,UREGS_cs(%rsp)
         jz    restore_all_xen
@@ -696,7 +689,8 @@ exception_with_ints_disabled:
         rep;  movsq                     # make room for ec/ev
 1:      movq  UREGS_error_code(%rsp),%rax # ec/ev
         movq  %rax,UREGS_kernel_sizeof(%rsp)
-        ALTERNATIVE __stringify(mov %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14)), \
+        ALTERNATIVE __stringify(mov %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14);      \
+                                mov %r13b, STACK_CPUINFO_FIELD(use_pv_cr3)(%r14)), \
                     "", X86_FEATURE_NO_XPTI
         jmp   restore_all_xen           # return to fixup code
 
@@ -785,9 +779,6 @@ ENTRY(double_fault)
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rbx
         test  %rbx, %rbx
         jz    .Ldblf_cr3_okay
-        jns   .Ldblf_cr3_load
-        neg   %rbx
-.Ldblf_cr3_load:
         mov   %rbx, %cr3
 .Ldblf_cr3_okay:
 
@@ -817,13 +808,11 @@ handle_ist_exception:
 
 .List_cr3_start:
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rcx
+        mov   STACK_CPUINFO_FIELD(use_pv_cr3)(%r14), %bl
         mov   %rcx, %r15
-        neg   %rcx
+        test  %rcx, %rcx
         jz    .List_cr3_okay
-        jns   .List_cr3_load
-        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
-        neg   %rcx
-.List_cr3_load:
+        movb  $0, STACK_CPUINFO_FIELD(use_pv_cr3)(%r14)
         mov   %rcx, %cr3
         /* %r12 is still zero at this point. */
         mov   %r12, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
@@ -838,6 +827,7 @@ handle_ist_exception:
          * and copy the context to stack bottom.
          */
         xor   %r15, %r15
+        xor   %ebx, %ebx
         GET_CPUINFO_FIELD(guest_cpu_user_regs,di)
         movq  %rsp,%rsi
         movl  $UREGS_kernel_sizeof/8,%ecx
@@ -848,7 +838,8 @@ handle_ist_exception:
         leaq  exception_table(%rip),%rdx
         mov   (%rdx, %rax, 8), %rdx
         INDIRECT_CALL %rdx
-        ALTERNATIVE __stringify(mov %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14)), \
+        ALTERNATIVE __stringify(mov %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14);    \
+                                mov %bl, STACK_CPUINFO_FIELD(use_pv_cr3)(%r14)), \
                     "", X86_FEATURE_NO_XPTI
         cmpb  $TRAP_nmi,UREGS_entry_vector(%rsp)
         jne   ret_from_intr
diff --git a/xen/include/asm-x86/current.h b/xen/include/asm-x86/current.h
index b2475783f8..43bdec1f49 100644
--- a/xen/include/asm-x86/current.h
+++ b/xen/include/asm-x86/current.h
@@ -47,10 +47,7 @@ struct cpu_info {
      * context is being entered. A value of zero indicates no setting of CR3
      * is to be performed.
      * The former is the value to restore when re-entering Xen, if any. IOW
-     * its value being zero means there's nothing to restore. However, its
-     * value can also be negative, indicating to the exit-to-Xen code that
-     * restoring is not necessary, but allowing any nested entry code paths
-     * to still know the value to put back into CR3.
+     * its value being zero means there's nothing to restore.
      */
     unsigned long xen_cr3;
     unsigned long pv_cr3;
@@ -68,6 +65,13 @@ struct cpu_info {
      */
     bool         root_pgt_changed;
 
+    /*
+     * use_pv_cr3 is set in case the value of pv_cr3 is to be written into
+     * CR3 when returning from an interrupt. The main use is when returning
+     * from a NMI or MCE to hypervisor code where pv_cr3 was active.
+     */
+    bool         use_pv_cr3;
+
     unsigned long __pad;
     /* get_stack_bottom() must be 16-byte aligned */
 };
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v3 7/7] xen/x86: use PCID feature
  2018-03-21 12:51 [PATCH v3 0/7] xen/x86: various XPTI speedups Juergen Gross
                   ` (5 preceding siblings ...)
  2018-03-21 12:51 ` [PATCH v3 6/7] xen/x86: use flag byte for decision whether xen_cr3 is valid Juergen Gross
@ 2018-03-21 12:51 ` Juergen Gross
  2018-03-23 10:51   ` Jan Beulich
       [not found]   ` <5AB4EA4902000078001B577B@suse.com>
  2018-03-23 14:10 ` [PATCH v3 0/7] xen/x86: various XPTI speedups Dario Faggioli
  7 siblings, 2 replies; 49+ messages in thread
From: Juergen Gross @ 2018-03-21 12:51 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, andrew.cooper3, jbeulich, dfaggioli

Avoid flushing the complete TLB when switching %cr3 for mitigation of
Meltdown by using the PCID feature if available.

We are using 4 PCID values for a 64 bit pv domain subject to XPTI and
2 values for the non-XPTI case:

- guest active and in kernel mode
- guest active and in user mode
- hypervisor active and guest in user mode (XPTI only)
- hypervisor active and guest in kernel mode (XPTI only)

We use PCID only if PCID _and_ INVPCID are supported. With PCID in use
we disable global pages in cr4. A command line parameter controls in
which cases PCID is being used.

As the non-XPTI case has shown not to perform better with PCID at least
on some machines the default is to use PCID only for domains subject to
XPTI.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
V3:
- support PCID for non-XPTI case, too
- add command line parameter for controlling usage of PCID
- check PCID active by using cr4.pcide (Jan Beulich)
---
 docs/misc/xen-command-line.markdown | 12 +++++++++
 xen/arch/x86/debug.c                |  3 ++-
 xen/arch/x86/domain_page.c          |  2 +-
 xen/arch/x86/domctl.c               |  4 +++
 xen/arch/x86/flushtlb.c             | 49 ++++++++++++++++++++++++++++------
 xen/arch/x86/mm.c                   | 34 +++++++++++++++++++++---
 xen/arch/x86/pv/dom0_build.c        |  1 +
 xen/arch/x86/pv/domain.c            | 52 +++++++++++++++++++++++++++++++++++++
 xen/include/asm-x86/domain.h        | 14 +++++++---
 xen/include/asm-x86/pv/domain.h     |  2 ++
 xen/include/asm-x86/x86-defns.h     |  1 +
 11 files changed, 158 insertions(+), 16 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index 8fc7b2ff3b..4ecf471ea9 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -1450,6 +1450,18 @@ All numbers specified must be hexadecimal ones.
 
 This option can be specified more than once (up to 8 times at present).
 
+### pcid (x86)
+> `= off | all | xpti | noxpti`
+
+> Default: `xpti`
+
+> Can be modified at runtime
+
+If available, control usage of the PCID feature of the processor for
+64-bit pv-domains. PCID can be used either for no domain at all, for
+all of them, only for those subject to XPTI or for those not subject
+to XPTI.
+
 ### ple\_gap
 > `= <integer>`
 
diff --git a/xen/arch/x86/debug.c b/xen/arch/x86/debug.c
index 9159f32db4..c8079569c4 100644
--- a/xen/arch/x86/debug.c
+++ b/xen/arch/x86/debug.c
@@ -97,7 +97,8 @@ dbg_pv_va2mfn(dbgva_t vaddr, struct domain *dp, uint64_t pgd3val)
     l3_pgentry_t l3e, *l3t;
     l2_pgentry_t l2e, *l2t;
     l1_pgentry_t l1e, *l1t;
-    unsigned long cr3 = (pgd3val ? pgd3val : dp->vcpu[0]->arch.cr3);
+    unsigned long cr3 = (pgd3val ? pgd3val
+                                 : (dp->vcpu[0]->arch.cr3 & ~X86_CR3_NOFLUSH));
     mfn_t mfn = maddr_to_mfn(cr3);
 
     DBGP2("vaddr:%lx domid:%d cr3:%lx pgd3:%lx\n", vaddr, dp->domain_id, 
diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c
index b5780f201f..8073ae5282 100644
--- a/xen/arch/x86/domain_page.c
+++ b/xen/arch/x86/domain_page.c
@@ -51,7 +51,7 @@ static inline struct vcpu *mapcache_current_vcpu(void)
         if ( (v = idle_vcpu[smp_processor_id()]) == current )
             sync_local_execstate();
         /* We must now be running on the idle page table. */
-        ASSERT(read_cr3() == __pa(idle_pg_table));
+        ASSERT((read_cr3() & ~X86_CR3_PCIDMASK) == __pa(idle_pg_table));
     }
 
     return v;
diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
index 0704f398c7..a7c8772fa6 100644
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -613,7 +613,11 @@ long arch_do_domctl(
             ret = -EINVAL;
 
         if ( ret == 0 )
+        {
             xpti_domain_init(d);
+            pcid_domain_init(d);
+        }
+
         break;
 
     case XEN_DOMCTL_get_address_size:
diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
index d4b8acc837..092ef86314 100644
--- a/xen/arch/x86/flushtlb.c
+++ b/xen/arch/x86/flushtlb.c
@@ -102,7 +102,19 @@ void write_cr3_cr4(unsigned long cr3, unsigned long cr4)
     t = pre_flush();
 
     if ( read_cr4() & X86_CR4_PGE )
+        /*
+         * X86_CR4_PGE set means PCID being inactive.
+         * We have to purge the TLB via flipping cr4.pge.
+         */
         write_cr4(cr4 & ~X86_CR4_PGE);
+    else if ( cpu_has_invpcid )
+        /*
+         * If we are using PCID purge the TLB via INVPCID as loading cr3
+         * will affect the current PCID only.
+         * If INVPCID is not supported we don't use PCIDs so loading cr3
+         * will purge the TLB (we are in the "global pages off" branch).
+         */
+        invpcid_flush_all();
 
     asm volatile ( "mov %0, %%cr3" : : "r" (cr3) : "memory" );
 
@@ -131,14 +143,35 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
     {
         if ( order == 0 )
         {
-            /*
-             * We don't INVLPG multi-page regions because the 2M/4M/1G
-             * region may not have been mapped with a superpage. Also there
-             * are various errata surrounding INVLPG usage on superpages, and
-             * a full flush is in any case not *that* expensive.
-             */
-            asm volatile ( "invlpg %0"
-                           : : "m" (*(const char *)(va)) : "memory" );
+            if ( read_cr4() & X86_CR4_PCIDE )
+            {
+                unsigned long addr = (unsigned long)va;
+
+                /*
+                 * Flush the addresses for all potential address spaces.
+                 * We can't check the current domain for being subject to
+                 * XPTI as current might be the idle vcpu while we still have
+                 * some XPTI domain TLB entries.
+                 * Using invpcid is okay here, as with PCID enabled we always
+                 * have global pages disabled.
+                 */
+                invpcid_flush_one(PCID_PV_PRIV, addr);
+                invpcid_flush_one(PCID_PV_USER, addr);
+                if ( !cpu_has_no_xpti )
+                {
+                    invpcid_flush_one(PCID_PV_PRIV | PCID_PV_XEN, addr);
+                    invpcid_flush_one(PCID_PV_USER | PCID_PV_XEN, addr);
+                }
+            }
+            else
+                /*
+                 * We don't INVLPG multi-page regions because the 2M/4M/1G
+                 * region may not have been mapped with a superpage. Also there
+                 * are various errata surrounding INVLPG usage on superpages,
+                 * and a full flush is in any case not *that* expensive.
+                 */
+                asm volatile ( "invlpg %0"
+                               : : "m" (*(const char *)(va)) : "memory" );
         }
         else
             do_tlb_flush();
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 29071bb257..242425c075 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -497,12 +497,38 @@ void free_shared_domheap_page(struct page_info *page)
     free_domheap_page(page);
 }
 
+/*
+ * Return additional PCID specific cr3 bits.
+ *
+ * Note that X86_CR3_NOFLUSH will not be readable in cr3. Anyone consuming
+ * v->arch.cr3 should mask away X86_CR3_NOFLUSH and X86_CR3_PCIDMASK in case
+ * the value is used to address the root page table.
+ */
+static unsigned long get_pcid_bits(struct vcpu *v, bool is_xen)
+{
+    return X86_CR3_NOFLUSH | (is_xen ? PCID_PV_XEN : 0) |
+           ((v->arch.flags & TF_kernel_mode) ? PCID_PV_PRIV : PCID_PV_USER);
+}
+
 void make_cr3(struct vcpu *v, mfn_t mfn)
 {
+    struct domain *d = v->domain;
+
     v->arch.cr3 = mfn_x(mfn) << PAGE_SHIFT;
-    if ( v == current && this_cpu(root_pgt) && is_pv_vcpu(v) &&
-         !is_pv_32bit_vcpu(v) )
-        get_cpu_info()->root_pgt_changed = true;
+    if ( is_pv_domain(d) )
+    {
+        if ( d->arch.pv_domain.xpti && v == current )
+        {
+            struct cpu_info *cpu_info = get_cpu_info();
+
+            cpu_info->root_pgt_changed = true;
+            if ( d->arch.pv_domain.pcid )
+                cpu_info->pv_cr3 = __pa(this_cpu(root_pgt)) |
+                                   get_pcid_bits(v, false);
+        }
+        if ( d->arch.pv_domain.pcid )
+            v->arch.cr3 |= get_pcid_bits(v, d->arch.pv_domain.xpti);
+    }
 }
 
 void write_ptbase(struct vcpu *v)
@@ -517,6 +543,8 @@ void write_ptbase(struct vcpu *v)
     {
         cpu_info->root_pgt_changed = true;
         cpu_info->pv_cr3 = __pa(this_cpu(root_pgt));
+        if ( new_cr4 & X86_CR4_PCIDE )
+            cpu_info->pv_cr3 |= get_pcid_bits(v, false);
         write_cr3_cr4(v->arch.cr3, new_cr4);
     }
     else
diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
index 77186c19bd..2af0094e95 100644
--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -709,6 +709,7 @@ int __init dom0_construct_pv(struct domain *d,
     }
 
     xpti_domain_init(d);
+    pcid_domain_init(d);
 
     d->arch.paging.mode = 0;
 
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 266117e804..46c050aeeb 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -95,6 +95,58 @@ void xpti_domain_init(struct domain *d)
     }
 }
 
+static __read_mostly enum {
+    PCID_OFF,
+    PCID_ALL,
+    PCID_XPTI,
+    PCID_NOXPTI
+} opt_pcid = PCID_XPTI;
+
+static __init int parse_pcid(const char *s)
+{
+    int rc = 0;
+
+    if ( !strcmp(s, "off") )
+        opt_pcid = PCID_OFF;
+    else if ( !strcmp(s, "all") )
+        opt_pcid = PCID_ALL;
+    else if ( !strcmp(s, "xpti") )
+        opt_pcid = PCID_XPTI;
+    else if ( !strcmp(s, "noxpti") )
+        opt_pcid = PCID_NOXPTI;
+    else
+        rc = -EINVAL;
+
+    return rc;
+}
+custom_runtime_param("pcid", parse_pcid);
+
+void pcid_domain_init(struct domain *d)
+{
+    if ( !is_pv_domain(d) || is_pv_32bit_domain(d) ||
+         !cpu_has_invpcid || !cpu_has_pcid )
+        return;
+
+    switch ( opt_pcid )
+    {
+    case PCID_OFF:
+        d->arch.pv_domain.pcid = false;
+        break;
+    case PCID_ALL:
+        d->arch.pv_domain.pcid = true;
+        break;
+    case PCID_XPTI:
+        d->arch.pv_domain.pcid = d->arch.pv_domain.xpti;
+        break;
+    case PCID_NOXPTI:
+        d->arch.pv_domain.pcid = !d->arch.pv_domain.xpti;
+        break;
+    default:
+        ASSERT_UNREACHABLE();
+        break;
+    }
+}
+
 static void noreturn continue_nonidle_domain(struct vcpu *v)
 {
     check_wakeup_from_wait();
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index ef9e9639fd..7751f9225b 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -255,6 +255,8 @@ struct pv_domain
 
     /* XPTI active? */
     bool xpti;
+    /* Use PCID feature? */
+    bool pcid;
 
     /* map_domain_page() mapping cache. */
     struct mapcache_domain mapcache;
@@ -262,6 +264,11 @@ struct pv_domain
     struct cpuidmasks *cpuidmasks;
 };
 
+/* PCID values for the address spaces of 64-bit pv domains: */
+#define PCID_PV_PRIV      0x0000    /* Used for other domains, too. */
+#define PCID_PV_USER      0x0001
+#define PCID_PV_XEN       0x0002    /* To be ORed to above values. */
+
 struct monitor_write_data {
     struct {
         unsigned int msr : 1;
@@ -619,14 +626,15 @@ unsigned long pv_guest_cr4_fixup(const struct vcpu *, unsigned long guest_cr4);
       | (mmu_cr4_features                                   \
          & (X86_CR4_PGE | X86_CR4_PSE | X86_CR4_SMEP |      \
             X86_CR4_SMAP | X86_CR4_OSXSAVE |                \
-            X86_CR4_FSGSBASE))                              \
-      | ((v)->domain->arch.vtsc ? X86_CR4_TSD : 0))         \
+            X86_CR4_FSGSBASE | X86_CR4_PCIDE))              \
+      | ((v)->domain->arch.vtsc ? X86_CR4_TSD : 0)          \
+      | ((v)->domain->arch.pv_domain.pcid ? X86_CR4_PCIDE : 0)) \
      & ~(X86_CR4_DE |                                       \
          ((v)->domain->arch.pv_domain.xpti ? X86_CR4_PGE : 0)))
 #define real_cr4_to_pv_guest_cr4(c)                         \
     ((c) & ~(X86_CR4_PGE | X86_CR4_PSE | X86_CR4_TSD |      \
              X86_CR4_OSXSAVE | X86_CR4_SMEP |               \
-             X86_CR4_FSGSBASE | X86_CR4_SMAP))
+             X86_CR4_FSGSBASE | X86_CR4_SMAP | X86_CR4_PCIDE))
 
 #define domain_max_vcpus(d) (is_hvm_domain(d) ? HVM_MAX_VCPUS : MAX_VIRT_CPUS)
 
diff --git a/xen/include/asm-x86/pv/domain.h b/xen/include/asm-x86/pv/domain.h
index 2213a8fb3d..8fd822ca83 100644
--- a/xen/include/asm-x86/pv/domain.h
+++ b/xen/include/asm-x86/pv/domain.h
@@ -29,6 +29,7 @@ void pv_domain_destroy(struct domain *d);
 int pv_domain_initialise(struct domain *d);
 void xpti_init(void);
 void xpti_domain_init(struct domain *d);
+void pcid_domain_init(struct domain *d);
 
 #else  /* !CONFIG_PV */
 
@@ -40,6 +41,7 @@ static inline void pv_domain_destroy(struct domain *d) {}
 static inline int pv_domain_initialise(struct domain *d) { return -EOPNOTSUPP; }
 static inline void xpti_init(void) {}
 static inline void xpti_domain_init(struct domain *d) {}
+static inline void pcid_domain_init(struct domain *d) {}
 #endif	/* CONFIG_PV */
 
 void paravirt_ctxt_switch_from(struct vcpu *v);
diff --git a/xen/include/asm-x86/x86-defns.h b/xen/include/asm-x86/x86-defns.h
index ff8d66be3c..e323d3c01f 100644
--- a/xen/include/asm-x86/x86-defns.h
+++ b/xen/include/asm-x86/x86-defns.h
@@ -46,6 +46,7 @@
  * Intel CPU flags in CR3
  */
 #define X86_CR3_NOFLUSH (_AC(1, ULL) << 63)
+#define X86_CR3_PCIDMASK _AC(0x0fff, ULL) /* Mask for PCID */
 
 /*
  * Intel CPU features in CR4
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 1/7] x86/xpti: avoid copying L4 page table contents when possible
  2018-03-21 12:51 ` [PATCH v3 1/7] x86/xpti: avoid copying L4 page table contents when possible Juergen Gross
@ 2018-03-22 14:31   ` Jan Beulich
       [not found]   ` <5AB3CC5502000078001B5254@suse.com>
  1 sibling, 0 replies; 49+ messages in thread
From: Jan Beulich @ 2018-03-22 14:31 UTC (permalink / raw)
  To: Juergen Gross; +Cc: andrew.cooper3, Dario Faggioli, xen-devel

>>> On 21.03.18 at 13:51, <jgross@suse.com> wrote:
> --- a/xen/arch/x86/flushtlb.c
> +++ b/xen/arch/x86/flushtlb.c
> @@ -158,6 +158,9 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
>          }
>      }
>  
> +    if ( flags & FLUSH_ROOT_PGTBL )
> +        get_cpu_info()->root_pgt_changed = true;
> +
>      local_irq_restore(irqfl);
>  
>      return flags;

Does this really need to sit inside the interrupts disabled section?

Thinking about it I even wonder whether the cache flush part needs
to be. Even for the INVLPG portion of the TLB flush part I can't
seem to see a need for IRQs to be off. I think it's really just the
pre_flush() / post_flush() pair which needs to be inside such a
section. I'll prepare a patch (for after 4.11). I think some of the
changes later in your series will actually further ease this.

> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -499,10 +499,15 @@ void free_shared_domheap_page(struct page_info *page)
>  void make_cr3(struct vcpu *v, mfn_t mfn)
>  {
>      v->arch.cr3 = mfn_x(mfn) << PAGE_SHIFT;
> +    if ( v == current && this_cpu(root_pgt) && is_pv_vcpu(v) &&
> +         !is_pv_32bit_vcpu(v) )
> +        get_cpu_info()->root_pgt_changed = true;
>  }

As this doesn't actually update CR3, setting the flag shouldn't
generally be necessary if the caller then invokes write_ptbase().
Isn't setting the flag here needed solely in the case of
_toggle_guest_pt() being up the call tree? In which case it would
perhaps better be set there (and in turn some or even all of the
conditional around it could be dropped)?

>  void write_ptbase(struct vcpu *v)
>  {
> +    if ( this_cpu(root_pgt) && is_pv_vcpu(v) && !is_pv_32bit_vcpu(v) )
> +        get_cpu_info()->root_pgt_changed = true;
>      write_cr3(v->arch.cr3);

When you come here from e.g. __sync_local_execstate(), you
don't really need to set the flag. Of course you'll come here again
before the next 64-bit PV vCPU will make it to restore_all_guest,
so by the time we make it there the flag will be set anyway.
However, if you already use such a subtlety, then there's also
no point excluding 32-bit vCPU-s here (nor in make_cr3()), as
those will never make it to restore_all_guest. Same then for
excluding HVM vCPU-s. And I then wonder whether (here or
more likely in a later patch) the root_pgt check couldn't go away
as well.

> @@ -3698,18 +3703,29 @@ long do_mmu_update(
>                          break;
>                      rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
>                                        cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
> -                    /*
> -                     * No need to sync if all uses of the page can be accounted
> -                     * to the page lock we hold, its pinned status, and uses on
> -                     * this (v)CPU.
> -                     */
> -                    if ( !rc && !cpu_has_no_xpti &&
> -                         ((page->u.inuse.type_info & PGT_count_mask) >
> -                          (1 + !!(page->u.inuse.type_info & PGT_pinned) +
> -                           (pagetable_get_pfn(curr->arch.guest_table) == mfn) 
> +
> -                           (pagetable_get_pfn(curr->arch.guest_table_user) ==
> -                            mfn))) )
> -                        sync_guest = true;
> +                    if ( !rc && !cpu_has_no_xpti )
> +                    {
> +                        bool local_in_use = false;
> +
> +                        if ( (pagetable_get_pfn(curr->arch.guest_table) ==
> +                              mfn) ||
> +                             (pagetable_get_pfn(curr->arch.guest_table_user) ==
> +                              mfn) )
> +                        {
> +                            local_in_use = true;
> +                            get_cpu_info()->root_pgt_changed = true;
> +                        }

The conditional causes root_pgt_changed to get set even in cases
where what CR3 points to doesn't actually change (if it's the user
page tables that get modified). I think you want to check
curr->arch.cr3 here, or only curr->arch.guest_table (as user mode
can't invoke hypercalls).

> +                        /*
> +                         * No need to sync if all uses of the page can be
> +                         * accounted to the page lock we hold, its pinned
> +                         * status, and uses on this (v)CPU.
> +                         */
> +                        if ( (page->u.inuse.type_info & PGT_count_mask) >
> +                             (1 + !!(page->u.inuse.type_info & PGT_pinned) +
> +                              local_in_use) )

The boolean local_in_use evaluates to 1 here, when previously the
value could have been 1 or 2 (I agree that's highly theoretical, but
anyway). Of course this will be addressed implicitly if you check
(only) curr->arch.guest_table above and move the
curr->arch.guest_table_user check here.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 2/7] x86/xpti: don't flush TLB twice when switching to 64-bit pv context
  2018-03-21 12:51 ` [PATCH v3 2/7] x86/xpti: don't flush TLB twice when switching to 64-bit pv context Juergen Gross
@ 2018-03-22 14:50   ` Jan Beulich
  2018-03-23 12:35     ` Juergen Gross
  0 siblings, 1 reply; 49+ messages in thread
From: Jan Beulich @ 2018-03-22 14:50 UTC (permalink / raw)
  To: Juergen Gross; +Cc: andrew.cooper3, Dario Faggioli, xen-devel

>>> On 21.03.18 at 13:51, <jgross@suse.com> wrote:
> When switching to a 64-bit pv context the TLB is flushed twice today:
> the first time when switching to the new address space in
> write_ptbase(), the second time when switching to guest mode in
> restore_to_guest.
> 
> Avoid the first TLB flush in that case.
> 
> Signed-off-by: Juergen Gross <jgross@suse.com>
> ---
> V3:
> - omit setting root_pgt_changed to false (Jan Beulich)
> ---
>  xen/arch/x86/mm.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
> index 352600ad73..8c944b33c9 100644
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -123,6 +123,7 @@
>  #include <asm/io_apic.h>
>  #include <asm/pci.h>
>  #include <asm/guest.h>
> +#include <asm/processor.h>
>  
>  #include <asm/hvm/grant_table.h>
>  #include <asm/pv/grant_table.h>
> @@ -507,8 +508,14 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
>  void write_ptbase(struct vcpu *v)
>  {
>      if ( this_cpu(root_pgt) && is_pv_vcpu(v) && !is_pv_32bit_vcpu(v) )
> +    {
>          get_cpu_info()->root_pgt_changed = true;
> -    write_cr3(v->arch.cr3);
> +        asm volatile ( "mov %0, %%cr3" : : "r" (v->arch.cr3) : "memory" );
> +    }
> +    else
> +    {
> +        write_cr3(v->arch.cr3);
> +    }

Unnecessary braces. with that
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(This could be taken care of while committing, but the patch
depends on patch 1 anyway, which may see further
transformation.)

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 3/7] xen/x86: support per-domain flag for xpti
  2018-03-21 12:51 ` [PATCH v3 3/7] xen/x86: support per-domain flag for xpti Juergen Gross
@ 2018-03-22 15:26   ` Jan Beulich
       [not found]   ` <5AB3D92D02000078001B5326@suse.com>
  1 sibling, 0 replies; 49+ messages in thread
From: Jan Beulich @ 2018-03-22 15:26 UTC (permalink / raw)
  To: Juergen Gross; +Cc: andrew.cooper3, Dario Faggioli, xen-devel

>>> On 21.03.18 at 13:51, <jgross@suse.com> wrote:
> +void xpti_domain_init(struct domain *d)
> +{
> +    if ( !is_pv_domain(d) || is_pv_32bit_domain(d) )
> +        return;

As you rely on the zero-initialization of the field here, ...

> +    switch ( opt_xpti )
> +    {
> +    case XPTI_OFF:
> +        d->arch.pv_domain.xpti = false;

... this could go away as well.

> @@ -1050,8 +1050,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
>          panic("Error %d setting up PV root page table\n", rc);
>      if ( per_cpu(root_pgt, 0) )
>      {
> -        get_cpu_info()->pv_cr3 = __pa(per_cpu(root_pgt, 0));
> -
> +        get_cpu_info()->pv_cr3 = 0;
>          /*
>           * All entry points which may need to switch page tables have to start
>           * with interrupts off. Re-write what pv_trap_init() has put there.

Please don't drop the blank line.

> @@ -36,7 +38,8 @@ static inline void pv_vcpu_destroy(struct vcpu *v) {}
>  static inline int pv_vcpu_initialise(struct vcpu *v) { return -EOPNOTSUPP; }
>  static inline void pv_domain_destroy(struct domain *d) {}
>  static inline int pv_domain_initialise(struct domain *d) { return -EOPNOTSUPP; }
> -
> +static inline void xpti_init(void) {}
> +static inline void xpti_domain_init(struct domain *d) {}
>  #endif	/* CONFIG_PV */

Same here. With that
Reviewed-by: Jan Beulich <jbeulich@suse.com>

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 1/7] x86/xpti: avoid copying L4 page table contents when possible
       [not found]   ` <5AB3CC5502000078001B5254@suse.com>
@ 2018-03-22 15:26     ` Juergen Gross
  2018-03-22 15:42       ` Jan Beulich
  0 siblings, 1 reply; 49+ messages in thread
From: Juergen Gross @ 2018-03-22 15:26 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, xen-devel, Dario Faggioli

On 22/03/18 15:31, Jan Beulich wrote:
>>>> On 21.03.18 at 13:51, <jgross@suse.com> wrote:
>> --- a/xen/arch/x86/flushtlb.c
>> +++ b/xen/arch/x86/flushtlb.c
>> @@ -158,6 +158,9 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
>>          }
>>      }
>>  
>> +    if ( flags & FLUSH_ROOT_PGTBL )
>> +        get_cpu_info()->root_pgt_changed = true;
>> +
>>      local_irq_restore(irqfl);
>>  
>>      return flags;
> 
> Does this really need to sit inside the interrupts disabled section?

Hmm, no, I don't think so. I'll move it below local_irq_restore().

> Thinking about it I even wonder whether the cache flush part needs
> to be. Even for the INVLPG portion of the TLB flush part I can't
> seem to see a need for IRQs to be off. I think it's really just the
> pre_flush() / post_flush() pair which needs to be inside such a
> section. I'll prepare a patch (for after 4.11). I think some of the
> changes later in your series will actually further ease this.
> 
>> --- a/xen/arch/x86/mm.c
>> +++ b/xen/arch/x86/mm.c
>> @@ -499,10 +499,15 @@ void free_shared_domheap_page(struct page_info *page)
>>  void make_cr3(struct vcpu *v, mfn_t mfn)
>>  {
>>      v->arch.cr3 = mfn_x(mfn) << PAGE_SHIFT;
>> +    if ( v == current && this_cpu(root_pgt) && is_pv_vcpu(v) &&
>> +         !is_pv_32bit_vcpu(v) )
>> +        get_cpu_info()->root_pgt_changed = true;
>>  }
> 
> As this doesn't actually update CR3, setting the flag shouldn't
> generally be necessary if the caller then invokes write_ptbase().
> Isn't setting the flag here needed solely in the case of
> _toggle_guest_pt() being up the call tree? In which case it would
> perhaps better be set there (and in turn some or even all of the
> conditional around it could be dropped)?

Yes, you are right.

> 
>>  void write_ptbase(struct vcpu *v)
>>  {
>> +    if ( this_cpu(root_pgt) && is_pv_vcpu(v) && !is_pv_32bit_vcpu(v) )
>> +        get_cpu_info()->root_pgt_changed = true;
>>      write_cr3(v->arch.cr3);
> 
> When you come here from e.g. __sync_local_execstate(), you
> don't really need to set the flag. Of course you'll come here again
> before the next 64-bit PV vCPU will make it to restore_all_guest,
> so by the time we make it there the flag will be set anyway.
> However, if you already use such a subtlety, then there's also
> no point excluding 32-bit vCPU-s here (nor in make_cr3()), as
> those will never make it to restore_all_guest. Same then for
> excluding HVM vCPU-s. And I then wonder whether (here or
> more likely in a later patch) the root_pgt check couldn't go away
> as well.

I'm not sure this is worth it. Patch 3 will re-introduce a conditional
here and it will look rather different (e.g. without the root_pgt
check). So micro-optimizing this patch barely makes any sense.

> 
>> @@ -3698,18 +3703,29 @@ long do_mmu_update(
>>                          break;
>>                      rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
>>                                        cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
>> -                    /*
>> -                     * No need to sync if all uses of the page can be accounted
>> -                     * to the page lock we hold, its pinned status, and uses on
>> -                     * this (v)CPU.
>> -                     */
>> -                    if ( !rc && !cpu_has_no_xpti &&
>> -                         ((page->u.inuse.type_info & PGT_count_mask) >
>> -                          (1 + !!(page->u.inuse.type_info & PGT_pinned) +
>> -                           (pagetable_get_pfn(curr->arch.guest_table) == mfn) 
>> +
>> -                           (pagetable_get_pfn(curr->arch.guest_table_user) ==
>> -                            mfn))) )
>> -                        sync_guest = true;
>> +                    if ( !rc && !cpu_has_no_xpti )
>> +                    {
>> +                        bool local_in_use = false;
>> +
>> +                        if ( (pagetable_get_pfn(curr->arch.guest_table) ==
>> +                              mfn) ||
>> +                             (pagetable_get_pfn(curr->arch.guest_table_user) ==
>> +                              mfn) )
>> +                        {
>> +                            local_in_use = true;
>> +                            get_cpu_info()->root_pgt_changed = true;
>> +                        }
> 
> The conditional causes root_pgt_changed to get set even in cases
> where what CR3 points to doesn't actually change (if it's the user
> page tables that get modified). I think you want to check
> curr->arch.cr3 here, or only curr->arch.guest_table (as user mode
> can't invoke hypercalls).

I'll go with curr->arch.guest_table.

> 
>> +                        /*
>> +                         * No need to sync if all uses of the page can be
>> +                         * accounted to the page lock we hold, its pinned
>> +                         * status, and uses on this (v)CPU.
>> +                         */
>> +                        if ( (page->u.inuse.type_info & PGT_count_mask) >
>> +                             (1 + !!(page->u.inuse.type_info & PGT_pinned) +
>> +                              local_in_use) )
> 
> The boolean local_in_use evaluates to 1 here, when previously the
> value could have been 1 or 2 (I agree that's highly theoretical, but
> anyway). Of course this will be addressed implicitly if you check
> (only) curr->arch.guest_table above and move the
> curr->arch.guest_table_user check here.

Yes.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 3/7] xen/x86: support per-domain flag for xpti
       [not found]   ` <5AB3D92D02000078001B5326@suse.com>
@ 2018-03-22 15:29     ` Juergen Gross
  2018-03-22 15:44       ` Jan Beulich
       [not found]       ` <5AB3DD6A02000078001B538B@suse.com>
  0 siblings, 2 replies; 49+ messages in thread
From: Juergen Gross @ 2018-03-22 15:29 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, xen-devel, Dario Faggioli

On 22/03/18 16:26, Jan Beulich wrote:
>>>> On 21.03.18 at 13:51, <jgross@suse.com> wrote:
>> +void xpti_domain_init(struct domain *d)
>> +{
>> +    if ( !is_pv_domain(d) || is_pv_32bit_domain(d) )
>> +        return;
> 
> As you rely on the zero-initialization of the field here, ...
> 
>> +    switch ( opt_xpti )
>> +    {
>> +    case XPTI_OFF:
>> +        d->arch.pv_domain.xpti = false;
> 
> ... this could go away as well.

I wanted to make the switch statement complete. No problem to drop
setting of xpti here of you like that better.

> 
>> @@ -1050,8 +1050,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
>>          panic("Error %d setting up PV root page table\n", rc);
>>      if ( per_cpu(root_pgt, 0) )
>>      {
>> -        get_cpu_info()->pv_cr3 = __pa(per_cpu(root_pgt, 0));
>> -
>> +        get_cpu_info()->pv_cr3 = 0;
>>          /*
>>           * All entry points which may need to switch page tables have to start
>>           * with interrupts off. Re-write what pv_trap_init() has put there.
> 
> Please don't drop the blank line.

Okay.

> 
>> @@ -36,7 +38,8 @@ static inline void pv_vcpu_destroy(struct vcpu *v) {}
>>  static inline int pv_vcpu_initialise(struct vcpu *v) { return -EOPNOTSUPP; }
>>  static inline void pv_domain_destroy(struct domain *d) {}
>>  static inline int pv_domain_initialise(struct domain *d) { return -EOPNOTSUPP; }
>> -
>> +static inline void xpti_init(void) {}
>> +static inline void xpti_domain_init(struct domain *d) {}
>>  #endif	/* CONFIG_PV */
> 
> Same here. With that
> Reviewed-by: Jan Beulich <jbeulich@suse.com>

Thanks,


Juergen


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 4/7] xen/x86: use invpcid for flushing the TLB
  2018-03-21 12:51 ` [PATCH v3 4/7] xen/x86: use invpcid for flushing the TLB Juergen Gross
@ 2018-03-22 15:35   ` Jan Beulich
       [not found]   ` <5AB3DB5102000078001B5358@suse.com>
  1 sibling, 0 replies; 49+ messages in thread
From: Jan Beulich @ 2018-03-22 15:35 UTC (permalink / raw)
  To: Andrew Cooper, Juergen Gross; +Cc: xen-devel, Dario Faggioli

>>> On 21.03.18 at 13:51, <jgross@suse.com> wrote:
> --- a/docs/misc/xen-command-line.markdown
> +++ b/docs/misc/xen-command-line.markdown
> @@ -1380,6 +1380,14 @@ Because responsibility for APIC setup is shared between Xen and the
>  domain 0 kernel this option is automatically propagated to the domain
>  0 command line.
>  
> +### noinvpcid (x86)
> +> `= <boolean>`
> +
> +Disable using the INVPCID instruction for flushing TLB entries.
> +This should only be used in case of known issues on the current platform
> +with that instruction. Disabling INVPCID will normally result in a slightly
> +degraded performance.

At the first glance this looks as if it wants to be a cpuid=
sub-option. However, that would disable use by both Xen and
(HVM) guests. Andrew, what are your plans here as to
distinguishing the "Xen uses a feature" from the "disable use of
a feature altogether"?

If we stay with a separate option, then please make this a
normal boolean one (i.e. drop the "no" prefix), as "no-noinvpcid"
is rather ugly.

> @@ -457,7 +472,6 @@ static void generic_set_all(void)
>  			set_bit(count, &smp_changes_mask);
>  		mask >>= 1;
>  	}
> -	
>  }
>  
>  static void generic_set_mtrr(unsigned int reg, unsigned long base,

I don't mind this line being dropped, but in general please avoid
stray changes which aren't assimilated into changes you do anyway.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 1/7] x86/xpti: avoid copying L4 page table contents when possible
  2018-03-22 15:26     ` Juergen Gross
@ 2018-03-22 15:42       ` Jan Beulich
  0 siblings, 0 replies; 49+ messages in thread
From: Jan Beulich @ 2018-03-22 15:42 UTC (permalink / raw)
  To: Juergen Gross; +Cc: andrew.cooper3, Dario Faggioli, xen-devel

>>> On 22.03.18 at 16:26, <jgross@suse.com> wrote:
> On 22/03/18 15:31, Jan Beulich wrote:
>>>>> On 21.03.18 at 13:51, <jgross@suse.com> wrote:
>>>  void write_ptbase(struct vcpu *v)
>>>  {
>>> +    if ( this_cpu(root_pgt) && is_pv_vcpu(v) && !is_pv_32bit_vcpu(v) )
>>> +        get_cpu_info()->root_pgt_changed = true;
>>>      write_cr3(v->arch.cr3);
>> 
>> When you come here from e.g. __sync_local_execstate(), you
>> don't really need to set the flag. Of course you'll come here again
>> before the next 64-bit PV vCPU will make it to restore_all_guest,
>> so by the time we make it there the flag will be set anyway.
>> However, if you already use such a subtlety, then there's also
>> no point excluding 32-bit vCPU-s here (nor in make_cr3()), as
>> those will never make it to restore_all_guest. Same then for
>> excluding HVM vCPU-s. And I then wonder whether (here or
>> more likely in a later patch) the root_pgt check couldn't go away
>> as well.
> 
> I'm not sure this is worth it. Patch 3 will re-introduce a conditional
> here and it will look rather different (e.g. without the root_pgt
> check). So micro-optimizing this patch barely makes any sense.

Yes, I've seen that once I made it there. Perhaps worth dropping
the two is_*() checks here, but not worry about the root_pgt one.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 3/7] xen/x86: support per-domain flag for xpti
  2018-03-22 15:29     ` Juergen Gross
@ 2018-03-22 15:44       ` Jan Beulich
       [not found]       ` <5AB3DD6A02000078001B538B@suse.com>
  1 sibling, 0 replies; 49+ messages in thread
From: Jan Beulich @ 2018-03-22 15:44 UTC (permalink / raw)
  To: Juergen Gross; +Cc: andrew.cooper3, Dario Faggioli, xen-devel

>>> On 22.03.18 at 16:29, <jgross@suse.com> wrote:
> On 22/03/18 16:26, Jan Beulich wrote:
>>>>> On 21.03.18 at 13:51, <jgross@suse.com> wrote:
>>> +void xpti_domain_init(struct domain *d)
>>> +{
>>> +    if ( !is_pv_domain(d) || is_pv_32bit_domain(d) )
>>> +        return;
>> 
>> As you rely on the zero-initialization of the field here, ...
>> 
>>> +    switch ( opt_xpti )
>>> +    {
>>> +    case XPTI_OFF:
>>> +        d->arch.pv_domain.xpti = false;
>> 
>> ... this could go away as well.
> 
> I wanted to make the switch statement complete. No problem to drop
> setting of xpti here of you like that better.

FAOD I didn't mean dropping the entire case block.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 5/7] xen/x86: disable global pages for domains with XPTI active
  2018-03-21 12:51 ` [PATCH v3 5/7] xen/x86: disable global pages for domains with XPTI active Juergen Gross
@ 2018-03-22 16:30   ` Jan Beulich
       [not found]   ` <5AB3E82402000078001B5408@suse.com>
  1 sibling, 0 replies; 49+ messages in thread
From: Jan Beulich @ 2018-03-22 16:30 UTC (permalink / raw)
  To: Juergen Gross; +Cc: andrew.cooper3, Dario Faggioli, xen-devel

>>> On 21.03.18 at 13:51, <jgross@suse.com> wrote:
> Instead of flushing the TLB from global pages when switching address
> spaces with XPTI being active just disable global pages via %cr4
> completely when a domain subject to XPTI is active. This avoids the
> need for extra TLB flushes as loading %cr3 will remove all TLB
> entries.

I continue to be not entirely convinced of this move. I had an
alternative in mind: Since retaining global pages is particularly
relevant for switches between guest user and guest kernel
modes, what if we made a shortcut from e.g. lstar_enter through
switch_to_kernel to restore_all_guest without ever switching to
the full page Xen tables?

> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -508,18 +508,23 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
>  void write_ptbase(struct vcpu *v)
>  {
>      struct cpu_info *cpu_info = get_cpu_info();
> +    unsigned long new_cr4;
> +
> +    new_cr4 = (is_pv_vcpu(v) && !is_idle_vcpu(v))
> +              ? pv_guest_cr4_to_real_cr4(v) : mmu_cr4_features;

I'm not overly happy to see any new uses of mmu_cr4_features.
This should really only be used for priming certain values imo,
which isn't the case here (otoh pv_guest_cr4_to_real_cr4() does
so too, and perhaps better wouldn't). Hence I wonder whether
this shouldn't be read_cr4() | X86_CR4_PGE, not the least
because we've just got rid of the blanket reversion to
mmu_cr4_features in VMX code.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 4/7] xen/x86: use invpcid for flushing the TLB
       [not found]   ` <5AB3DB5102000078001B5358@suse.com>
@ 2018-03-22 18:03     ` Juergen Gross
  0 siblings, 0 replies; 49+ messages in thread
From: Juergen Gross @ 2018-03-22 18:03 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper; +Cc: xen-devel, Dario Faggioli

On 22/03/18 16:35, Jan Beulich wrote:
>>>> On 21.03.18 at 13:51, <jgross@suse.com> wrote:
>> --- a/docs/misc/xen-command-line.markdown
>> +++ b/docs/misc/xen-command-line.markdown
>> @@ -1380,6 +1380,14 @@ Because responsibility for APIC setup is shared between Xen and the
>>  domain 0 kernel this option is automatically propagated to the domain
>>  0 command line.
>>  
>> +### noinvpcid (x86)
>> +> `= <boolean>`
>> +
>> +Disable using the INVPCID instruction for flushing TLB entries.
>> +This should only be used in case of known issues on the current platform
>> +with that instruction. Disabling INVPCID will normally result in a slightly
>> +degraded performance.
> 
> At the first glance this looks as if it wants to be a cpuid=
> sub-option. However, that would disable use by both Xen and
> (HVM) guests. Andrew, what are your plans here as to
> distinguishing the "Xen uses a feature" from the "disable use of
> a feature altogether"?
> 
> If we stay with a separate option, then please make this a
> normal boolean one (i.e. drop the "no" prefix), as "no-noinvpcid"
> is rather ugly.

Okay.

> 
>> @@ -457,7 +472,6 @@ static void generic_set_all(void)
>>  			set_bit(count, &smp_changes_mask);
>>  		mask >>= 1;
>>  	}
>> -	
>>  }
>>  
>>  static void generic_set_mtrr(unsigned int reg, unsigned long base,
> 
> I don't mind this line being dropped, but in general please avoid
> stray changes which aren't assimilated into changes you do anyway.

The main reason I did drop this line was the trailing tab. I just took
the risk of someone complaining.


Juergen


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 3/7] xen/x86: support per-domain flag for xpti
       [not found]       ` <5AB3DD6A02000078001B538B@suse.com>
@ 2018-03-22 18:05         ` Juergen Gross
  0 siblings, 0 replies; 49+ messages in thread
From: Juergen Gross @ 2018-03-22 18:05 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, xen-devel, Dario Faggioli

On 22/03/18 16:44, Jan Beulich wrote:
>>>> On 22.03.18 at 16:29, <jgross@suse.com> wrote:
>> On 22/03/18 16:26, Jan Beulich wrote:
>>>>>> On 21.03.18 at 13:51, <jgross@suse.com> wrote:
>>>> +void xpti_domain_init(struct domain *d)
>>>> +{
>>>> +    if ( !is_pv_domain(d) || is_pv_32bit_domain(d) )
>>>> +        return;
>>>
>>> As you rely on the zero-initialization of the field here, ...
>>>
>>>> +    switch ( opt_xpti )
>>>> +    {
>>>> +    case XPTI_OFF:
>>>> +        d->arch.pv_domain.xpti = false;
>>>
>>> ... this could go away as well.
>>
>> I wanted to make the switch statement complete. No problem to drop
>> setting of xpti here of you like that better.
> 
> FAOD I didn't mean dropping the entire case block.

Of course not. This would just be wrong with the current default block.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 5/7] xen/x86: disable global pages for domains with XPTI active
       [not found]   ` <5AB3E82402000078001B5408@suse.com>
@ 2018-03-22 18:18     ` Juergen Gross
  2018-03-23  7:46       ` Jan Beulich
       [not found]       ` <5AB4BEED02000078001B563B@suse.com>
  2018-03-27  7:14     ` Juergen Gross
  1 sibling, 2 replies; 49+ messages in thread
From: Juergen Gross @ 2018-03-22 18:18 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, xen-devel, Dario Faggioli

On 22/03/18 17:30, Jan Beulich wrote:
>>>> On 21.03.18 at 13:51, <jgross@suse.com> wrote:
>> Instead of flushing the TLB from global pages when switching address
>> spaces with XPTI being active just disable global pages via %cr4
>> completely when a domain subject to XPTI is active. This avoids the
>> need for extra TLB flushes as loading %cr3 will remove all TLB
>> entries.
> 
> I continue to be not entirely convinced of this move. I had an
> alternative in mind: Since retaining global pages is particularly
> relevant for switches between guest user and guest kernel
> modes, what if we made a shortcut from e.g. lstar_enter through
> switch_to_kernel to restore_all_guest without ever switching to
> the full page Xen tables?

With patch 7 of this series in mind I'm not convinced the extra effort
is really making sense. Today most processors do have PCID support so
for that old hardware I don't think we need to make the handling even
more complex.

> 
>> --- a/xen/arch/x86/mm.c
>> +++ b/xen/arch/x86/mm.c
>> @@ -508,18 +508,23 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
>>  void write_ptbase(struct vcpu *v)
>>  {
>>      struct cpu_info *cpu_info = get_cpu_info();
>> +    unsigned long new_cr4;
>> +
>> +    new_cr4 = (is_pv_vcpu(v) && !is_idle_vcpu(v))
>> +              ? pv_guest_cr4_to_real_cr4(v) : mmu_cr4_features;
> 
> I'm not overly happy to see any new uses of mmu_cr4_features.
> This should really only be used for priming certain values imo,
> which isn't the case here (otoh pv_guest_cr4_to_real_cr4() does
> so too, and perhaps better wouldn't). Hence I wonder whether
> this shouldn't be read_cr4() | X86_CR4_PGE, not the least
> because we've just got rid of the blanket reversion to
> mmu_cr4_features in VMX code.

I do understand that using mmu_cr4_features isn't the best way to set
cr4. But I think it is a good idea to have a default value which should
normally be used instead of only switching various bits on and off.

In case cr4 is loaded with a strange value in some corner case that
value might be used from then on instead of being repaired by loading a
dedicated value at certain points in time, e.g. when doing a context
switch.

So maybe we should introduce cr4_default which is derived from
mmu_cr4_features? mmu_cr4_features would contain all bits which are
allowed on the current processor with the current command line options,
while cr4_default would be a subset of mmu_cr4_features.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 5/7] xen/x86: disable global pages for domains with XPTI active
  2018-03-22 18:18     ` Juergen Gross
@ 2018-03-23  7:46       ` Jan Beulich
       [not found]       ` <5AB4BEED02000078001B563B@suse.com>
  1 sibling, 0 replies; 49+ messages in thread
From: Jan Beulich @ 2018-03-23  7:46 UTC (permalink / raw)
  To: Juergen Gross; +Cc: andrew.cooper3, Dario Faggioli, xen-devel

>>> On 22.03.18 at 19:18, <jgross@suse.com> wrote:
> On 22/03/18 17:30, Jan Beulich wrote:
>>>>> On 21.03.18 at 13:51, <jgross@suse.com> wrote:
>>> Instead of flushing the TLB from global pages when switching address
>>> spaces with XPTI being active just disable global pages via %cr4
>>> completely when a domain subject to XPTI is active. This avoids the
>>> need for extra TLB flushes as loading %cr3 will remove all TLB
>>> entries.
>> 
>> I continue to be not entirely convinced of this move. I had an
>> alternative in mind: Since retaining global pages is particularly
>> relevant for switches between guest user and guest kernel
>> modes, what if we made a shortcut from e.g. lstar_enter through
>> switch_to_kernel to restore_all_guest without ever switching to
>> the full page Xen tables?
> 
> With patch 7 of this series in mind I'm not convinced the extra effort
> is really making sense. Today most processors do have PCID support so
> for that old hardware I don't think we need to make the handling even
> more complex.

PCID yes, but INVPCID (and we're talking about Intel hardware
here only anyway)? But yes, the extra complexity is what has kept
me so far from investing time here.

>>> --- a/xen/arch/x86/mm.c
>>> +++ b/xen/arch/x86/mm.c
>>> @@ -508,18 +508,23 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
>>>  void write_ptbase(struct vcpu *v)
>>>  {
>>>      struct cpu_info *cpu_info = get_cpu_info();
>>> +    unsigned long new_cr4;
>>> +
>>> +    new_cr4 = (is_pv_vcpu(v) && !is_idle_vcpu(v))
>>> +              ? pv_guest_cr4_to_real_cr4(v) : mmu_cr4_features;
>> 
>> I'm not overly happy to see any new uses of mmu_cr4_features.
>> This should really only be used for priming certain values imo,
>> which isn't the case here (otoh pv_guest_cr4_to_real_cr4() does
>> so too, and perhaps better wouldn't). Hence I wonder whether
>> this shouldn't be read_cr4() | X86_CR4_PGE, not the least
>> because we've just got rid of the blanket reversion to
>> mmu_cr4_features in VMX code.
> 
> I do understand that using mmu_cr4_features isn't the best way to set
> cr4. But I think it is a good idea to have a default value which should
> normally be used instead of only switching various bits on and off.
> 
> In case cr4 is loaded with a strange value in some corner case that
> value might be used from then on instead of being repaired by loading a
> dedicated value at certain points in time, e.g. when doing a context
> switch.

But that would make it even more difficult to notice and debug a
possible problem. The more we play with CR4 bits, the more
important it is that we keep an accurate record of what is currently
loaded into it, and that we have a clear understanding of what we
mean to be loaded into the register at any given point in time.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 5/7] xen/x86: disable global pages for domains with XPTI active
       [not found]       ` <5AB4BEED02000078001B563B@suse.com>
@ 2018-03-23  7:58         ` Juergen Gross
  2018-03-23  8:14           ` Jan Beulich
       [not found]           ` <5AB4C56602000078001B567F@suse.com>
  0 siblings, 2 replies; 49+ messages in thread
From: Juergen Gross @ 2018-03-23  7:58 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, xen-devel, Dario Faggioli

On 23/03/18 08:46, Jan Beulich wrote:
>>>> On 22.03.18 at 19:18, <jgross@suse.com> wrote:
>> On 22/03/18 17:30, Jan Beulich wrote:
>>>>>> On 21.03.18 at 13:51, <jgross@suse.com> wrote:
>>>> Instead of flushing the TLB from global pages when switching address
>>>> spaces with XPTI being active just disable global pages via %cr4
>>>> completely when a domain subject to XPTI is active. This avoids the
>>>> need for extra TLB flushes as loading %cr3 will remove all TLB
>>>> entries.
>>>
>>> I continue to be not entirely convinced of this move. I had an
>>> alternative in mind: Since retaining global pages is particularly
>>> relevant for switches between guest user and guest kernel
>>> modes, what if we made a shortcut from e.g. lstar_enter through
>>> switch_to_kernel to restore_all_guest without ever switching to
>>> the full page Xen tables?
>>
>> With patch 7 of this series in mind I'm not convinced the extra effort
>> is really making sense. Today most processors do have PCID support so
>> for that old hardware I don't think we need to make the handling even
>> more complex.
> 
> PCID yes, but INVPCID (and we're talking about Intel hardware
> here only anyway)? But yes, the extra complexity is what has kept
> me so far from investing time here.

As PCID seems to speed up XPTI only and XPTI is necessary for INTEL cpus
only I don't see a problem here. :-)

> 
>>>> --- a/xen/arch/x86/mm.c
>>>> +++ b/xen/arch/x86/mm.c
>>>> @@ -508,18 +508,23 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
>>>>  void write_ptbase(struct vcpu *v)
>>>>  {
>>>>      struct cpu_info *cpu_info = get_cpu_info();
>>>> +    unsigned long new_cr4;
>>>> +
>>>> +    new_cr4 = (is_pv_vcpu(v) && !is_idle_vcpu(v))
>>>> +              ? pv_guest_cr4_to_real_cr4(v) : mmu_cr4_features;
>>>
>>> I'm not overly happy to see any new uses of mmu_cr4_features.
>>> This should really only be used for priming certain values imo,
>>> which isn't the case here (otoh pv_guest_cr4_to_real_cr4() does
>>> so too, and perhaps better wouldn't). Hence I wonder whether
>>> this shouldn't be read_cr4() | X86_CR4_PGE, not the least
>>> because we've just got rid of the blanket reversion to
>>> mmu_cr4_features in VMX code.
>>
>> I do understand that using mmu_cr4_features isn't the best way to set
>> cr4. But I think it is a good idea to have a default value which should
>> normally be used instead of only switching various bits on and off.
>>
>> In case cr4 is loaded with a strange value in some corner case that
>> value might be used from then on instead of being repaired by loading a
>> dedicated value at certain points in time, e.g. when doing a context
>> switch.
> 
> But that would make it even more difficult to notice and debug a
> possible problem. The more we play with CR4 bits, the more
> important it is that we keep an accurate record of what is currently
> loaded into it, and that we have a clear understanding of what we
> mean to be loaded into the register at any given point in time.

What about adding an appropriate ASSERT() for that case?

So I would use read_cr4() | X86_CR4_PGE and ASSERT() that cr4 matches
the default value.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 5/7] xen/x86: disable global pages for domains with XPTI active
  2018-03-23  7:58         ` Juergen Gross
@ 2018-03-23  8:14           ` Jan Beulich
       [not found]           ` <5AB4C56602000078001B567F@suse.com>
  1 sibling, 0 replies; 49+ messages in thread
From: Jan Beulich @ 2018-03-23  8:14 UTC (permalink / raw)
  To: Juergen Gross; +Cc: andrew.cooper3, Dario Faggioli, xen-devel

>>> On 23.03.18 at 08:58, <jgross@suse.com> wrote:
> On 23/03/18 08:46, Jan Beulich wrote:
>>>>> On 22.03.18 at 19:18, <jgross@suse.com> wrote:
>>> On 22/03/18 17:30, Jan Beulich wrote:
>>>>>>> On 21.03.18 at 13:51, <jgross@suse.com> wrote:
>>>>> Instead of flushing the TLB from global pages when switching address
>>>>> spaces with XPTI being active just disable global pages via %cr4
>>>>> completely when a domain subject to XPTI is active. This avoids the
>>>>> need for extra TLB flushes as loading %cr3 will remove all TLB
>>>>> entries.
>>>>
>>>> I continue to be not entirely convinced of this move. I had an
>>>> alternative in mind: Since retaining global pages is particularly
>>>> relevant for switches between guest user and guest kernel
>>>> modes, what if we made a shortcut from e.g. lstar_enter through
>>>> switch_to_kernel to restore_all_guest without ever switching to
>>>> the full page Xen tables?
>>>
>>> With patch 7 of this series in mind I'm not convinced the extra effort
>>> is really making sense. Today most processors do have PCID support so
>>> for that old hardware I don't think we need to make the handling even
>>> more complex.
>> 
>> PCID yes, but INVPCID (and we're talking about Intel hardware
>> here only anyway)? But yes, the extra complexity is what has kept
>> me so far from investing time here.
> 
> As PCID seems to speed up XPTI only and XPTI is necessary for INTEL cpus
> only I don't see a problem here. :-)

Well, yes as far as AMD is unaffected, but not entirely yes to the
rest, as I intentionally pointed out the difference of availability of
PCID (which even Westmere's have) and INVPCID (which only my
Haswell has).

>>>>> --- a/xen/arch/x86/mm.c
>>>>> +++ b/xen/arch/x86/mm.c
>>>>> @@ -508,18 +508,23 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
>>>>>  void write_ptbase(struct vcpu *v)
>>>>>  {
>>>>>      struct cpu_info *cpu_info = get_cpu_info();
>>>>> +    unsigned long new_cr4;
>>>>> +
>>>>> +    new_cr4 = (is_pv_vcpu(v) && !is_idle_vcpu(v))
>>>>> +              ? pv_guest_cr4_to_real_cr4(v) : mmu_cr4_features;
>>>>
>>>> I'm not overly happy to see any new uses of mmu_cr4_features.
>>>> This should really only be used for priming certain values imo,
>>>> which isn't the case here (otoh pv_guest_cr4_to_real_cr4() does
>>>> so too, and perhaps better wouldn't). Hence I wonder whether
>>>> this shouldn't be read_cr4() | X86_CR4_PGE, not the least
>>>> because we've just got rid of the blanket reversion to
>>>> mmu_cr4_features in VMX code.
>>>
>>> I do understand that using mmu_cr4_features isn't the best way to set
>>> cr4. But I think it is a good idea to have a default value which should
>>> normally be used instead of only switching various bits on and off.
>>>
>>> In case cr4 is loaded with a strange value in some corner case that
>>> value might be used from then on instead of being repaired by loading a
>>> dedicated value at certain points in time, e.g. when doing a context
>>> switch.
>> 
>> But that would make it even more difficult to notice and debug a
>> possible problem. The more we play with CR4 bits, the more
>> important it is that we keep an accurate record of what is currently
>> loaded into it, and that we have a clear understanding of what we
>> mean to be loaded into the register at any given point in time.
> 
> What about adding an appropriate ASSERT() for that case?
> 
> So I would use read_cr4() | X86_CR4_PGE and ASSERT() that cr4 matches
> the default value.

That's an option; later we may want to sprinkle around a few more
such assertions.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 5/7] xen/x86: disable global pages for domains with XPTI active
       [not found]           ` <5AB4C56602000078001B567F@suse.com>
@ 2018-03-23  8:29             ` Juergen Gross
  2018-03-23  8:51               ` Jan Beulich
  0 siblings, 1 reply; 49+ messages in thread
From: Juergen Gross @ 2018-03-23  8:29 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, xen-devel, Dario Faggioli

On 23/03/18 09:14, Jan Beulich wrote:
>>>> On 23.03.18 at 08:58, <jgross@suse.com> wrote:
>> On 23/03/18 08:46, Jan Beulich wrote:
>>>>>> On 22.03.18 at 19:18, <jgross@suse.com> wrote:
>>>> On 22/03/18 17:30, Jan Beulich wrote:
>>>>>>>> On 21.03.18 at 13:51, <jgross@suse.com> wrote:
>>>>>> Instead of flushing the TLB from global pages when switching address
>>>>>> spaces with XPTI being active just disable global pages via %cr4
>>>>>> completely when a domain subject to XPTI is active. This avoids the
>>>>>> need for extra TLB flushes as loading %cr3 will remove all TLB
>>>>>> entries.
>>>>>
>>>>> I continue to be not entirely convinced of this move. I had an
>>>>> alternative in mind: Since retaining global pages is particularly
>>>>> relevant for switches between guest user and guest kernel
>>>>> modes, what if we made a shortcut from e.g. lstar_enter through
>>>>> switch_to_kernel to restore_all_guest without ever switching to
>>>>> the full page Xen tables?
>>>>
>>>> With patch 7 of this series in mind I'm not convinced the extra effort
>>>> is really making sense. Today most processors do have PCID support so
>>>> for that old hardware I don't think we need to make the handling even
>>>> more complex.
>>>
>>> PCID yes, but INVPCID (and we're talking about Intel hardware
>>> here only anyway)? But yes, the extra complexity is what has kept
>>> me so far from investing time here.
>>
>> As PCID seems to speed up XPTI only and XPTI is necessary for INTEL cpus
>> only I don't see a problem here. :-)
> 
> Well, yes as far as AMD is unaffected, but not entirely yes to the
> rest, as I intentionally pointed out the difference of availability of
> PCID (which even Westmere's have) and INVPCID (which only my
> Haswell has).

Haswells are out for nearly 5 years now.

I think in case we want to do something else here we should delay that
to 4.12. Even without PCID this patch is speeding up XPTI handling
significantly (parallel hypervisor compilation: elapsed time -6%,
system time -12%), so I'm not making anything worse.

BTW: while Westmere is supporting PCID I remember having done some PCID
testing with Westmere not showing any performance gains at all.

> 
>>>>>> --- a/xen/arch/x86/mm.c
>>>>>> +++ b/xen/arch/x86/mm.c
>>>>>> @@ -508,18 +508,23 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
>>>>>>  void write_ptbase(struct vcpu *v)
>>>>>>  {
>>>>>>      struct cpu_info *cpu_info = get_cpu_info();
>>>>>> +    unsigned long new_cr4;
>>>>>> +
>>>>>> +    new_cr4 = (is_pv_vcpu(v) && !is_idle_vcpu(v))
>>>>>> +              ? pv_guest_cr4_to_real_cr4(v) : mmu_cr4_features;
>>>>>
>>>>> I'm not overly happy to see any new uses of mmu_cr4_features.
>>>>> This should really only be used for priming certain values imo,
>>>>> which isn't the case here (otoh pv_guest_cr4_to_real_cr4() does
>>>>> so too, and perhaps better wouldn't). Hence I wonder whether
>>>>> this shouldn't be read_cr4() | X86_CR4_PGE, not the least
>>>>> because we've just got rid of the blanket reversion to
>>>>> mmu_cr4_features in VMX code.
>>>>
>>>> I do understand that using mmu_cr4_features isn't the best way to set
>>>> cr4. But I think it is a good idea to have a default value which should
>>>> normally be used instead of only switching various bits on and off.
>>>>
>>>> In case cr4 is loaded with a strange value in some corner case that
>>>> value might be used from then on instead of being repaired by loading a
>>>> dedicated value at certain points in time, e.g. when doing a context
>>>> switch.
>>>
>>> But that would make it even more difficult to notice and debug a
>>> possible problem. The more we play with CR4 bits, the more
>>> important it is that we keep an accurate record of what is currently
>>> loaded into it, and that we have a clear understanding of what we
>>> mean to be loaded into the register at any given point in time.
>>
>> What about adding an appropriate ASSERT() for that case?
>>
>> So I would use read_cr4() | X86_CR4_PGE and ASSERT() that cr4 matches
>> the default value.
> 
> That's an option; later we may want to sprinkle around a few more
> such assertions.

Okay, I'll go that route then. Do you want me to use a new default_cr4
variable for doing the assertion (and as initial cr4 value for secondary
cpus), or are you fine with using mmu_cr4_features for now?


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 5/7] xen/x86: disable global pages for domains with XPTI active
  2018-03-23  8:29             ` Juergen Gross
@ 2018-03-23  8:51               ` Jan Beulich
  0 siblings, 0 replies; 49+ messages in thread
From: Jan Beulich @ 2018-03-23  8:51 UTC (permalink / raw)
  To: Juergen Gross; +Cc: andrew.cooper3, Dario Faggioli, xen-devel

>>> On 23.03.18 at 09:29, <jgross@suse.com> wrote:
> On 23/03/18 09:14, Jan Beulich wrote:
>>>>> On 23.03.18 at 08:58, <jgross@suse.com> wrote:
>>> On 23/03/18 08:46, Jan Beulich wrote:
>>>>>>> On 22.03.18 at 19:18, <jgross@suse.com> wrote:
>>>>> On 22/03/18 17:30, Jan Beulich wrote:
>>>>>>>>> On 21.03.18 at 13:51, <jgross@suse.com> wrote:
>>>>>>> --- a/xen/arch/x86/mm.c
>>>>>>> +++ b/xen/arch/x86/mm.c
>>>>>>> @@ -508,18 +508,23 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
>>>>>>>  void write_ptbase(struct vcpu *v)
>>>>>>>  {
>>>>>>>      struct cpu_info *cpu_info = get_cpu_info();
>>>>>>> +    unsigned long new_cr4;
>>>>>>> +
>>>>>>> +    new_cr4 = (is_pv_vcpu(v) && !is_idle_vcpu(v))
>>>>>>> +              ? pv_guest_cr4_to_real_cr4(v) : mmu_cr4_features;
>>>>>>
>>>>>> I'm not overly happy to see any new uses of mmu_cr4_features.
>>>>>> This should really only be used for priming certain values imo,
>>>>>> which isn't the case here (otoh pv_guest_cr4_to_real_cr4() does
>>>>>> so too, and perhaps better wouldn't). Hence I wonder whether
>>>>>> this shouldn't be read_cr4() | X86_CR4_PGE, not the least
>>>>>> because we've just got rid of the blanket reversion to
>>>>>> mmu_cr4_features in VMX code.
>>>>>
>>>>> I do understand that using mmu_cr4_features isn't the best way to set
>>>>> cr4. But I think it is a good idea to have a default value which should
>>>>> normally be used instead of only switching various bits on and off.
>>>>>
>>>>> In case cr4 is loaded with a strange value in some corner case that
>>>>> value might be used from then on instead of being repaired by loading a
>>>>> dedicated value at certain points in time, e.g. when doing a context
>>>>> switch.
>>>>
>>>> But that would make it even more difficult to notice and debug a
>>>> possible problem. The more we play with CR4 bits, the more
>>>> important it is that we keep an accurate record of what is currently
>>>> loaded into it, and that we have a clear understanding of what we
>>>> mean to be loaded into the register at any given point in time.
>>>
>>> What about adding an appropriate ASSERT() for that case?
>>>
>>> So I would use read_cr4() | X86_CR4_PGE and ASSERT() that cr4 matches
>>> the default value.
>> 
>> That's an option; later we may want to sprinkle around a few more
>> such assertions.
> 
> Okay, I'll go that route then. Do you want me to use a new default_cr4
> variable for doing the assertion (and as initial cr4 value for secondary
> cpus), or are you fine with using mmu_cr4_features for now?

I'd prefer if we could get away without yet another variable holding
some variant of possible CR4 values. As a follow-up we'll then want
to switch pv_guest_cr4_to_real_cr4() away from using
mmu_cr4_features as well (except for a possible assertion to put
there).

And btw, please consider re-organizing your change to
pv_guest_cr4_to_real_cr4() to match what we do for X86_CR4_TSD,
rather than first ORing in X86_CR4_PGE in order to then
(conditionally) mask it back out.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 6/7] xen/x86: use flag byte for decision whether xen_cr3 is valid
  2018-03-21 12:51 ` [PATCH v3 6/7] xen/x86: use flag byte for decision whether xen_cr3 is valid Juergen Gross
@ 2018-03-23  9:47   ` Jan Beulich
  0 siblings, 0 replies; 49+ messages in thread
From: Jan Beulich @ 2018-03-23  9:47 UTC (permalink / raw)
  To: Juergen Gross; +Cc: andrew.cooper3, Dario Faggioli, xen-devel

>>> On 21.03.18 at 13:51, <jgross@suse.com> wrote:
> Today cpu_info->xen_cr3 is either 0 to indicate %cr3 doesn't need to
> be switched on entry to Xen, or negative for keeping the value while
> indicating not to restore %cr3, or positive in case %cr3 is to be
> restored.
> 
> Switch to use a flag byte instead of a negative xen_cr3 value in order
> to allow %cr3 values with the high bit set in case we want to keep TLB
> entries when using the PCID feature.
> 
> This reduces the number of branches in interrupt handling and results
> in better performance (e.g. parallel make of the Xen hypervisor on my
> system was using about 3% less system time).
> 
> Signed-off-by: Juergen Gross <jgross@suse.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 7/7] xen/x86: use PCID feature
  2018-03-21 12:51 ` [PATCH v3 7/7] xen/x86: use PCID feature Juergen Gross
@ 2018-03-23 10:51   ` Jan Beulich
       [not found]   ` <5AB4EA4902000078001B577B@suse.com>
  1 sibling, 0 replies; 49+ messages in thread
From: Jan Beulich @ 2018-03-23 10:51 UTC (permalink / raw)
  To: Jun Nakajima, Kevin Tian, Juergen Gross
  Cc: andrew.cooper3, Dario Faggioli, xen-devel

>>> On 21.03.18 at 13:51, <jgross@suse.com> wrote:
> Avoid flushing the complete TLB when switching %cr3 for mitigation of
> Meltdown by using the PCID feature if available.
> 
> We are using 4 PCID values for a 64 bit pv domain subject to XPTI and
> 2 values for the non-XPTI case:
> 
> - guest active and in kernel mode
> - guest active and in user mode
> - hypervisor active and guest in user mode (XPTI only)
> - hypervisor active and guest in kernel mode (XPTI only)

Before committing to this route, Jun, Kevin, can we please get
confirmation that PCID isn't (and isn't going to be) subject to the
same speculation issues in the pipeline that the U/S bit is (having
caused Meltdown in the first place)? To me it seems a pretty
likely thing to play all the same games.

> ---
>  docs/misc/xen-command-line.markdown | 12 +++++++++
>  xen/arch/x86/debug.c                |  3 ++-
>  xen/arch/x86/domain_page.c          |  2 +-
>  xen/arch/x86/domctl.c               |  4 +++
>  xen/arch/x86/flushtlb.c             | 49 ++++++++++++++++++++++++++++------
>  xen/arch/x86/mm.c                   | 34 +++++++++++++++++++++---
>  xen/arch/x86/pv/dom0_build.c        |  1 +
>  xen/arch/x86/pv/domain.c            | 52 
> +++++++++++++++++++++++++++++++++++++
>  xen/include/asm-x86/domain.h        | 14 +++++++---
>  xen/include/asm-x86/pv/domain.h     |  2 ++
>  xen/include/asm-x86/x86-defns.h     |  1 +
>  11 files changed, 158 insertions(+), 16 deletions(-)

Having had the discussion previously, I'm missing a change to
smp.c:new_tlbflush_clock_period() here.

> --- a/xen/arch/x86/debug.c
> +++ b/xen/arch/x86/debug.c
> @@ -97,7 +97,8 @@ dbg_pv_va2mfn(dbgva_t vaddr, struct domain *dp, uint64_t pgd3val)
>      l3_pgentry_t l3e, *l3t;
>      l2_pgentry_t l2e, *l2t;
>      l1_pgentry_t l1e, *l1t;
> -    unsigned long cr3 = (pgd3val ? pgd3val : dp->vcpu[0]->arch.cr3);
> +    unsigned long cr3 = (pgd3val ? pgd3val
> +                                 : (dp->vcpu[0]->arch.cr3 & ~X86_CR3_NOFLUSH));

What about the PCID portion? You want the address of the page
here, so I think you should use a "white-listing" masking operation
instead of a blacklisting one.

Also, as you touch this anyway, it would have been nice to drop
the unnecessary middle argument at the same time.

> --- a/xen/arch/x86/flushtlb.c
> +++ b/xen/arch/x86/flushtlb.c
> @@ -102,7 +102,19 @@ void write_cr3_cr4(unsigned long cr3, unsigned long cr4)
>      t = pre_flush();
>  
>      if ( read_cr4() & X86_CR4_PGE )
> +        /*
> +         * X86_CR4_PGE set means PCID being inactive.
> +         * We have to purge the TLB via flipping cr4.pge.
> +         */
>          write_cr4(cr4 & ~X86_CR4_PGE);
> +    else if ( cpu_has_invpcid )
> +        /*
> +         * If we are using PCID purge the TLB via INVPCID as loading cr3
> +         * will affect the current PCID only.
> +         * If INVPCID is not supported we don't use PCIDs so loading cr3
> +         * will purge the TLB (we are in the "global pages off" branch).
> +         */
> +        invpcid_flush_all();

Since with CR4.PGE correctness-wise clear it doesn't matter whether
you use invpcid_flush_all() or invpcid_flush_all_nonglobals() here,
does it also not matter performance-wise?

> @@ -131,14 +143,35 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
>      {
>          if ( order == 0 )
>          {
> -            /*
> -             * We don't INVLPG multi-page regions because the 2M/4M/1G
> -             * region may not have been mapped with a superpage. Also there
> -             * are various errata surrounding INVLPG usage on superpages, and
> -             * a full flush is in any case not *that* expensive.
> -             */

This comment really explains the order == 0 check above, so I
don't think it should be moved.

> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -497,12 +497,38 @@ void free_shared_domheap_page(struct page_info *page)
>      free_domheap_page(page);
>  }
>  
> +/*
> + * Return additional PCID specific cr3 bits.
> + *
> + * Note that X86_CR3_NOFLUSH will not be readable in cr3. Anyone consuming
> + * v->arch.cr3 should mask away X86_CR3_NOFLUSH and X86_CR3_PCIDMASK in case

I stand to my previous comment, which was left unanswered afaics:
"Is it a good idea to suppress the flush this way for every reader
 of v->arch.cr3? For example, what about the use in
 dbg_pv_va2mfn()? I think it should be the consumers of the field
 to decide whether to OR in that flag (which isn't part of the
 register value anyway)."
To be more precise, I can see the pcid to be put here (which will
require consumers to be aware anyway), but I don't think the non-
register-value no-flush indicator belongs here. IOW I think after
writing the value into %cr3, the value read back should match the
stored value.

> + * the value is used to address the root page table.
> + */
> +static unsigned long get_pcid_bits(struct vcpu *v, bool is_xen)
> +{
> +    return X86_CR3_NOFLUSH | (is_xen ? PCID_PV_XEN : 0) |
> +           ((v->arch.flags & TF_kernel_mode) ? PCID_PV_PRIV : PCID_PV_USER);
> +}
> +
>  void make_cr3(struct vcpu *v, mfn_t mfn)
>  {
> +    struct domain *d = v->domain;
> +
>      v->arch.cr3 = mfn_x(mfn) << PAGE_SHIFT;
> -    if ( v == current && this_cpu(root_pgt) && is_pv_vcpu(v) &&
> -         !is_pv_32bit_vcpu(v) )
> -        get_cpu_info()->root_pgt_changed = true;
> +    if ( is_pv_domain(d) )
> +    {
> +        if ( d->arch.pv_domain.xpti && v == current )
> +        {
> +            struct cpu_info *cpu_info = get_cpu_info();
> +
> +            cpu_info->root_pgt_changed = true;
> +            if ( d->arch.pv_domain.pcid )
> +                cpu_info->pv_cr3 = __pa(this_cpu(root_pgt)) |
> +                                   get_pcid_bits(v, false);

Just like we seem to have agreed (on an earlier patch) that
setting root_pgt_changed elsewhere may be better, perhaps
that extends to pv_cr3 as well? It would certainly be nice if
this was written with the final intended value right away, yet
the lack of an "else" here suggests there is at least one place
where this doesn't happen, but pv_cr3 is also written with a
non-zero value.

> +        }
> +        if ( d->arch.pv_domain.pcid )
> +            v->arch.cr3 |= get_pcid_bits(v, d->arch.pv_domain.xpti);

It is certainly at least confusing that you pass "xpti" as argument
for a parameter named "is_xen". The question is what mode you
want us to be in when running with PCID but no XPTI: Should Xen
use its own PCID then? That would seem reasonable to me, but
would seem to require passing true here. Yet then this would
require switching CR3 on the way out to guests and back in from
guests even in that case (just without copying the root page table).

If in that mode Xen isn't meant to use its own PCID, the command
line option "pcid=noxpti" would seem at least misleading to me then,
as you wouldn't really use different PCIDs in that mode, but only
disable global pages (which probably hurts performance) and use
INVPCID for flushing (which probably helps performance).

> --- a/xen/arch/x86/pv/domain.c
> +++ b/xen/arch/x86/pv/domain.c
> @@ -95,6 +95,58 @@ void xpti_domain_init(struct domain *d)
>      }
>  }
>  
> +static __read_mostly enum {
> +    PCID_OFF,
> +    PCID_ALL,
> +    PCID_XPTI,
> +    PCID_NOXPTI
> +} opt_pcid = PCID_XPTI;
> +
> +static __init int parse_pcid(const char *s)
> +{
> +    int rc = 0;
> +
> +    if ( !strcmp(s, "off") )
> +        opt_pcid = PCID_OFF;
> +    else if ( !strcmp(s, "all") )
> +        opt_pcid = PCID_ALL;
> +    else if ( !strcmp(s, "xpti") )
> +        opt_pcid = PCID_XPTI;
> +    else if ( !strcmp(s, "noxpti") )
> +        opt_pcid = PCID_NOXPTI;

I'd prefer if you used parse_bool() and parse_boolean() here, so
that the syntax of this new option fits with all other boolean ones
we have.

> +void pcid_domain_init(struct domain *d)
> +{
> +    if ( !is_pv_domain(d) || is_pv_32bit_domain(d) ||
> +         !cpu_has_invpcid || !cpu_has_pcid )
> +        return;
> +
> +    switch ( opt_pcid )
> +    {
> +    case PCID_OFF:
> +        d->arch.pv_domain.pcid = false;

As for the earlier patch, with the return above implying the value
to be zero this store ought to be unnecessary.

> @@ -619,14 +626,15 @@ unsigned long pv_guest_cr4_fixup(const struct vcpu *, unsigned long guest_cr4);
>        | (mmu_cr4_features                                   \
>           & (X86_CR4_PGE | X86_CR4_PSE | X86_CR4_SMEP |      \
>              X86_CR4_SMAP | X86_CR4_OSXSAVE |                \
> -            X86_CR4_FSGSBASE))                              \
> -      | ((v)->domain->arch.vtsc ? X86_CR4_TSD : 0))         \
> +            X86_CR4_FSGSBASE | X86_CR4_PCIDE))              \

Why? Afaics you never set the bit in mmu_cr4_features.

> --- a/xen/include/asm-x86/x86-defns.h
> +++ b/xen/include/asm-x86/x86-defns.h
> @@ -46,6 +46,7 @@
>   * Intel CPU flags in CR3
>   */
>  #define X86_CR3_NOFLUSH (_AC(1, ULL) << 63)
> +#define X86_CR3_PCIDMASK _AC(0x0fff, ULL) /* Mask for PCID */

As per an earlier comment you also want X86_CR3_ADDR_MASK here
(and as an implication from the suggested name I think it would be
better if you also added another underscore in the one you already
add).

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 7/7] xen/x86: use PCID feature
       [not found]   ` <5AB4EA4902000078001B577B@suse.com>
@ 2018-03-23 11:29     ` Juergen Gross
  2018-03-23 13:46       ` Jan Beulich
       [not found]       ` <5AB5136302000078001B58D9@suse.com>
  0 siblings, 2 replies; 49+ messages in thread
From: Juergen Gross @ 2018-03-23 11:29 UTC (permalink / raw)
  To: Jan Beulich, Jun Nakajima, Kevin Tian
  Cc: andrew.cooper3, xen-devel, Dario Faggioli

On 23/03/18 11:51, Jan Beulich wrote:
>>>> On 21.03.18 at 13:51, <jgross@suse.com> wrote:
>> Avoid flushing the complete TLB when switching %cr3 for mitigation of
>> Meltdown by using the PCID feature if available.
>>
>> We are using 4 PCID values for a 64 bit pv domain subject to XPTI and
>> 2 values for the non-XPTI case:
>>
>> - guest active and in kernel mode
>> - guest active and in user mode
>> - hypervisor active and guest in user mode (XPTI only)
>> - hypervisor active and guest in kernel mode (XPTI only)
> 
> Before committing to this route, Jun, Kevin, can we please get
> confirmation that PCID isn't (and isn't going to be) subject to the
> same speculation issues in the pipeline that the U/S bit is (having
> caused Meltdown in the first place)? To me it seems a pretty
> likely thing to play all the same games.

Really? This would assume either the processor is capable to deal with
multiple matching TLB entries when speculating or that there can be
only one entry per virtual address present in the TLB at the same time
in spite of different PCIDs.

And why aren't you asking for the same problem with VPIDs? This should
be comparable to the PCID problem you are suspecting.

> 
>> ---
>>  docs/misc/xen-command-line.markdown | 12 +++++++++
>>  xen/arch/x86/debug.c                |  3 ++-
>>  xen/arch/x86/domain_page.c          |  2 +-
>>  xen/arch/x86/domctl.c               |  4 +++
>>  xen/arch/x86/flushtlb.c             | 49 ++++++++++++++++++++++++++++------
>>  xen/arch/x86/mm.c                   | 34 +++++++++++++++++++++---
>>  xen/arch/x86/pv/dom0_build.c        |  1 +
>>  xen/arch/x86/pv/domain.c            | 52 
>> +++++++++++++++++++++++++++++++++++++
>>  xen/include/asm-x86/domain.h        | 14 +++++++---
>>  xen/include/asm-x86/pv/domain.h     |  2 ++
>>  xen/include/asm-x86/x86-defns.h     |  1 +
>>  11 files changed, 158 insertions(+), 16 deletions(-)
> 
> Having had the discussion previously, I'm missing a change to
> smp.c:new_tlbflush_clock_period() here.

I can add that. I did not do this as I haven't treated FLUSH_TLB
differently to FLUSH_TLB_GLOBAL (trying this even without any other
change to staging led to degfaults in dom0). So such a change to
new_tlbflush_clock_period() should be a separate patch I believe.

> 
>> --- a/xen/arch/x86/debug.c
>> +++ b/xen/arch/x86/debug.c
>> @@ -97,7 +97,8 @@ dbg_pv_va2mfn(dbgva_t vaddr, struct domain *dp, uint64_t pgd3val)
>>      l3_pgentry_t l3e, *l3t;
>>      l2_pgentry_t l2e, *l2t;
>>      l1_pgentry_t l1e, *l1t;
>> -    unsigned long cr3 = (pgd3val ? pgd3val : dp->vcpu[0]->arch.cr3);
>> +    unsigned long cr3 = (pgd3val ? pgd3val
>> +                                 : (dp->vcpu[0]->arch.cr3 & ~X86_CR3_NOFLUSH));
> 
> What about the PCID portion? You want the address of the page
> here, so I think you should use a "white-listing" masking operation
> instead of a blacklisting one.

The PCID portion is no problem here as the value is converted into a
mfn.

I can do the modification you are asking for, of course.

> 
> Also, as you touch this anyway, it would have been nice to drop
> the unnecessary middle argument at the same time.

Okay.

> 
>> --- a/xen/arch/x86/flushtlb.c
>> +++ b/xen/arch/x86/flushtlb.c
>> @@ -102,7 +102,19 @@ void write_cr3_cr4(unsigned long cr3, unsigned long cr4)
>>      t = pre_flush();
>>  
>>      if ( read_cr4() & X86_CR4_PGE )
>> +        /*
>> +         * X86_CR4_PGE set means PCID being inactive.
>> +         * We have to purge the TLB via flipping cr4.pge.
>> +         */
>>          write_cr4(cr4 & ~X86_CR4_PGE);
>> +    else if ( cpu_has_invpcid )
>> +        /*
>> +         * If we are using PCID purge the TLB via INVPCID as loading cr3
>> +         * will affect the current PCID only.
>> +         * If INVPCID is not supported we don't use PCIDs so loading cr3
>> +         * will purge the TLB (we are in the "global pages off" branch).
>> +         */
>> +        invpcid_flush_all();
> 
> Since with CR4.PGE correctness-wise clear it doesn't matter whether
> you use invpcid_flush_all() or invpcid_flush_all_nonglobals() here,
> does it also not matter performance-wise?

I didn't check that. I'll have a try.

> 
>> @@ -131,14 +143,35 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
>>      {
>>          if ( order == 0 )
>>          {
>> -            /*
>> -             * We don't INVLPG multi-page regions because the 2M/4M/1G
>> -             * region may not have been mapped with a superpage. Also there
>> -             * are various errata surrounding INVLPG usage on superpages, and
>> -             * a full flush is in any case not *that* expensive.
>> -             */
> 
> This comment really explains the order == 0 check above, so I
> don't think it should be moved.

Okay.

> 
>> --- a/xen/arch/x86/mm.c
>> +++ b/xen/arch/x86/mm.c
>> @@ -497,12 +497,38 @@ void free_shared_domheap_page(struct page_info *page)
>>      free_domheap_page(page);
>>  }
>>  
>> +/*
>> + * Return additional PCID specific cr3 bits.
>> + *
>> + * Note that X86_CR3_NOFLUSH will not be readable in cr3. Anyone consuming
>> + * v->arch.cr3 should mask away X86_CR3_NOFLUSH and X86_CR3_PCIDMASK in case
> 
> I stand to my previous comment, which was left unanswered afaics:

Uuh, sorry for that.

> "Is it a good idea to suppress the flush this way for every reader
>  of v->arch.cr3? For example, what about the use in
>  dbg_pv_va2mfn()? I think it should be the consumers of the field
>  to decide whether to OR in that flag (which isn't part of the
>  register value anyway)."
> To be more precise, I can see the pcid to be put here (which will
> require consumers to be aware anyway), but I don't think the non-
> register-value no-flush indicator belongs here. IOW I think after
> writing the value into %cr3, the value read back should match the
> stored value.

This will make restore_all_guest more complicated. v->arch.cr3 is copied
to cpu_info->xen_cr3 there and this value is then used for %cr3. I
really don't want to add complex logic there to add the no-flush
indicator in case PCIDs are active.

>> + * the value is used to address the root page table.
>> + */
>> +static unsigned long get_pcid_bits(struct vcpu *v, bool is_xen)
>> +{
>> +    return X86_CR3_NOFLUSH | (is_xen ? PCID_PV_XEN : 0) |
>> +           ((v->arch.flags & TF_kernel_mode) ? PCID_PV_PRIV : PCID_PV_USER);
>> +}
>> +
>>  void make_cr3(struct vcpu *v, mfn_t mfn)
>>  {
>> +    struct domain *d = v->domain;
>> +
>>      v->arch.cr3 = mfn_x(mfn) << PAGE_SHIFT;
>> -    if ( v == current && this_cpu(root_pgt) && is_pv_vcpu(v) &&
>> -         !is_pv_32bit_vcpu(v) )
>> -        get_cpu_info()->root_pgt_changed = true;
>> +    if ( is_pv_domain(d) )
>> +    {
>> +        if ( d->arch.pv_domain.xpti && v == current )
>> +        {
>> +            struct cpu_info *cpu_info = get_cpu_info();
>> +
>> +            cpu_info->root_pgt_changed = true;
>> +            if ( d->arch.pv_domain.pcid )
>> +                cpu_info->pv_cr3 = __pa(this_cpu(root_pgt)) |
>> +                                   get_pcid_bits(v, false);
> 
> Just like we seem to have agreed (on an earlier patch) that
> setting root_pgt_changed elsewhere may be better, perhaps
> that extends to pv_cr3 as well? It would certainly be nice if
> this was written with the final intended value right away, yet
> the lack of an "else" here suggests there is at least one place
> where this doesn't happen, but pv_cr3 is also written with a
> non-zero value.

Yes. I'll look into moving it to _toggle_guest_pt().

> 
>> +        }
>> +        if ( d->arch.pv_domain.pcid )
>> +            v->arch.cr3 |= get_pcid_bits(v, d->arch.pv_domain.xpti);
> 
> It is certainly at least confusing that you pass "xpti" as argument
> for a parameter named "is_xen". The question is what mode you
> want us to be in when running with PCID but no XPTI: Should Xen
> use its own PCID then? That would seem reasonable to me, but
> would seem to require passing true here. Yet then this would
> require switching CR3 on the way out to guests and back in from
> guests even in that case (just without copying the root page table).
> 
> If in that mode Xen isn't meant to use its own PCID, the command
> line option "pcid=noxpti" would seem at least misleading to me then,
> as you wouldn't really use different PCIDs in that mode, but only
> disable global pages (which probably hurts performance) and use
> INVPCID for flushing (which probably helps performance).

The idea is to use different PCID values for guest kernel and user
mode. This removes the need for global guest user pages. I don't
want to use global guest user pages together with PCID as flushing
global pages from the TLB with PCID enabled requires flushing either
the complete TLB or you'd have to use INVLPG in all possible address
spaces (so you'd need to have multiple %cr3 switches).

I'll add some comments in this regard.

> 
>> --- a/xen/arch/x86/pv/domain.c
>> +++ b/xen/arch/x86/pv/domain.c
>> @@ -95,6 +95,58 @@ void xpti_domain_init(struct domain *d)
>>      }
>>  }
>>  
>> +static __read_mostly enum {
>> +    PCID_OFF,
>> +    PCID_ALL,
>> +    PCID_XPTI,
>> +    PCID_NOXPTI
>> +} opt_pcid = PCID_XPTI;
>> +
>> +static __init int parse_pcid(const char *s)
>> +{
>> +    int rc = 0;
>> +
>> +    if ( !strcmp(s, "off") )
>> +        opt_pcid = PCID_OFF;
>> +    else if ( !strcmp(s, "all") )
>> +        opt_pcid = PCID_ALL;
>> +    else if ( !strcmp(s, "xpti") )
>> +        opt_pcid = PCID_XPTI;
>> +    else if ( !strcmp(s, "noxpti") )
>> +        opt_pcid = PCID_NOXPTI;
> 
> I'd prefer if you used parse_bool() and parse_boolean() here, so
> that the syntax of this new option fits with all other boolean ones
> we have.

Okay.

> 
>> +void pcid_domain_init(struct domain *d)
>> +{
>> +    if ( !is_pv_domain(d) || is_pv_32bit_domain(d) ||
>> +         !cpu_has_invpcid || !cpu_has_pcid )
>> +        return;
>> +
>> +    switch ( opt_pcid )
>> +    {
>> +    case PCID_OFF:
>> +        d->arch.pv_domain.pcid = false;
> 
> As for the earlier patch, with the return above implying the value
> to be zero this store ought to be unnecessary.

Okay.

> 
>> @@ -619,14 +626,15 @@ unsigned long pv_guest_cr4_fixup(const struct vcpu *, unsigned long guest_cr4);
>>        | (mmu_cr4_features                                   \
>>           & (X86_CR4_PGE | X86_CR4_PSE | X86_CR4_SMEP |      \
>>              X86_CR4_SMAP | X86_CR4_OSXSAVE |                \
>> -            X86_CR4_FSGSBASE))                              \
>> -      | ((v)->domain->arch.vtsc ? X86_CR4_TSD : 0))         \
>> +            X86_CR4_FSGSBASE | X86_CR4_PCIDE))              \
> 
> Why? Afaics you never set the bit in mmu_cr4_features.

Oops, this is a leftover from v2.

> 
>> --- a/xen/include/asm-x86/x86-defns.h
>> +++ b/xen/include/asm-x86/x86-defns.h
>> @@ -46,6 +46,7 @@
>>   * Intel CPU flags in CR3
>>   */
>>  #define X86_CR3_NOFLUSH (_AC(1, ULL) << 63)
>> +#define X86_CR3_PCIDMASK _AC(0x0fff, ULL) /* Mask for PCID */
> 
> As per an earlier comment you also want X86_CR3_ADDR_MASK here
> (and as an implication from the suggested name I think it would be
> better if you also added another underscore in the one you already
> add).

Yes.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 2/7] x86/xpti: don't flush TLB twice when switching to 64-bit pv context
  2018-03-22 14:50   ` Jan Beulich
@ 2018-03-23 12:35     ` Juergen Gross
  0 siblings, 0 replies; 49+ messages in thread
From: Juergen Gross @ 2018-03-23 12:35 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, xen-devel, Dario Faggioli

On 22/03/18 15:50, Jan Beulich wrote:
>>>> On 21.03.18 at 13:51, <jgross@suse.com> wrote:
>> When switching to a 64-bit pv context the TLB is flushed twice today:
>> the first time when switching to the new address space in
>> write_ptbase(), the second time when switching to guest mode in
>> restore_to_guest.
>>
>> Avoid the first TLB flush in that case.
>>
>> Signed-off-by: Juergen Gross <jgross@suse.com>
>> ---
>> V3:
>> - omit setting root_pgt_changed to false (Jan Beulich)
>> ---
>>  xen/arch/x86/mm.c | 9 ++++++++-
>>  1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
>> index 352600ad73..8c944b33c9 100644
>> --- a/xen/arch/x86/mm.c
>> +++ b/xen/arch/x86/mm.c
>> @@ -123,6 +123,7 @@
>>  #include <asm/io_apic.h>
>>  #include <asm/pci.h>
>>  #include <asm/guest.h>
>> +#include <asm/processor.h>
>>  
>>  #include <asm/hvm/grant_table.h>
>>  #include <asm/pv/grant_table.h>
>> @@ -507,8 +508,14 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
>>  void write_ptbase(struct vcpu *v)
>>  {
>>      if ( this_cpu(root_pgt) && is_pv_vcpu(v) && !is_pv_32bit_vcpu(v) )
>> +    {
>>          get_cpu_info()->root_pgt_changed = true;
>> -    write_cr3(v->arch.cr3);
>> +        asm volatile ( "mov %0, %%cr3" : : "r" (v->arch.cr3) : "memory" );
>> +    }
>> +    else
>> +    {
>> +        write_cr3(v->arch.cr3);
>> +    }
> 
> Unnecessary braces. with that
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
> (This could be taken care of while committing, but the patch
> depends on patch 1 anyway, which may see further
> transformation.)

Just realized it now: I have to re-introduce the conditionals I removed
on your behalf from patch 1, as otherwise I'd omit TLB flushing via
write_cr3() for 32-bit pv-domains and HVM-domains.

Therefor I'm removing your R-b in case you don't like that (patch 3
changes the conditional again, so I don't think that is really bad).


Juergen


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 7/7] xen/x86: use PCID feature
  2018-03-23 11:29     ` Juergen Gross
@ 2018-03-23 13:46       ` Jan Beulich
       [not found]       ` <5AB5136302000078001B58D9@suse.com>
  1 sibling, 0 replies; 49+ messages in thread
From: Jan Beulich @ 2018-03-23 13:46 UTC (permalink / raw)
  To: Juergen Gross
  Cc: andrew.cooper3, Kevin Tian, Dario Faggioli, Jun Nakajima, xen-devel

>>> On 23.03.18 at 12:29, <jgross@suse.com> wrote:
> On 23/03/18 11:51, Jan Beulich wrote:
>>>>> On 21.03.18 at 13:51, <jgross@suse.com> wrote:
>>> Avoid flushing the complete TLB when switching %cr3 for mitigation of
>>> Meltdown by using the PCID feature if available.
>>>
>>> We are using 4 PCID values for a 64 bit pv domain subject to XPTI and
>>> 2 values for the non-XPTI case:
>>>
>>> - guest active and in kernel mode
>>> - guest active and in user mode
>>> - hypervisor active and guest in user mode (XPTI only)
>>> - hypervisor active and guest in kernel mode (XPTI only)
>> 
>> Before committing to this route, Jun, Kevin, can we please get
>> confirmation that PCID isn't (and isn't going to be) subject to the
>> same speculation issues in the pipeline that the U/S bit is (having
>> caused Meltdown in the first place)? To me it seems a pretty
>> likely thing to play all the same games.
> 
> Really? This would assume either the processor is capable to deal with
> multiple matching TLB entries when speculating or that there can be
> only one entry per virtual address present in the TLB at the same time
> in spite of different PCIDs.

Hmm, yes, good point.

> And why aren't you asking for the same problem with VPIDs? This should
> be comparable to the PCID problem you are suspecting.

Since Linux is using the approach, I'm not really suspecting a
problem. I'd just like to be sure there is none / not going to be
one. None of this is spelled out in the doc after all.

>>> ---
>>>  docs/misc/xen-command-line.markdown | 12 +++++++++
>>>  xen/arch/x86/debug.c                |  3 ++-
>>>  xen/arch/x86/domain_page.c          |  2 +-
>>>  xen/arch/x86/domctl.c               |  4 +++
>>>  xen/arch/x86/flushtlb.c             | 49 ++++++++++++++++++++++++++++------
>>>  xen/arch/x86/mm.c                   | 34 +++++++++++++++++++++---
>>>  xen/arch/x86/pv/dom0_build.c        |  1 +
>>>  xen/arch/x86/pv/domain.c            | 52 
>>> +++++++++++++++++++++++++++++++++++++
>>>  xen/include/asm-x86/domain.h        | 14 +++++++---
>>>  xen/include/asm-x86/pv/domain.h     |  2 ++
>>>  xen/include/asm-x86/x86-defns.h     |  1 +
>>>  11 files changed, 158 insertions(+), 16 deletions(-)
>> 
>> Having had the discussion previously, I'm missing a change to
>> smp.c:new_tlbflush_clock_period() here.
> 
> I can add that. I did not do this as I haven't treated FLUSH_TLB
> differently to FLUSH_TLB_GLOBAL (trying this even without any other
> change to staging led to degfaults in dom0). So such a change to
> new_tlbflush_clock_period() should be a separate patch I believe.

Ah yes, you don't have the "flushing too much" issue anymore in
this version, or at least not in as obvious a way. Or wait - it's in
patch 4. You still do an including-global flush there regardless of
whether FLUSH_TLB_GLOBAL was actually requested. Anyway,
we'll save this aspect for later.

>>> --- a/xen/arch/x86/debug.c
>>> +++ b/xen/arch/x86/debug.c
>>> @@ -97,7 +97,8 @@ dbg_pv_va2mfn(dbgva_t vaddr, struct domain *dp, uint64_t pgd3val)
>>>      l3_pgentry_t l3e, *l3t;
>>>      l2_pgentry_t l2e, *l2t;
>>>      l1_pgentry_t l1e, *l1t;
>>> -    unsigned long cr3 = (pgd3val ? pgd3val : dp->vcpu[0]->arch.cr3);
>>> +    unsigned long cr3 = (pgd3val ? pgd3val
>>> +                                 : (dp->vcpu[0]->arch.cr3 & ~X86_CR3_NOFLUSH));
>> 
>> What about the PCID portion? You want the address of the page
>> here, so I think you should use a "white-listing" masking operation
>> instead of a blacklisting one.
> 
> The PCID portion is no problem here as the value is converted into a
> mfn.
> 
> I can do the modification you are asking for, of course.

Well, even if the bits are shifter out in the end, the code could
look more correct. Plus by masking to just the address, future
new meaning assigned to currently unused bits would not be
a problem for this piece of code anymore.

>>> +/*
>>> + * Return additional PCID specific cr3 bits.
>>> + *
>>> + * Note that X86_CR3_NOFLUSH will not be readable in cr3. Anyone consuming
>>> + * v->arch.cr3 should mask away X86_CR3_NOFLUSH and X86_CR3_PCIDMASK in case
>> 
>> I stand to my previous comment, which was left unanswered afaics:
> 
> Uuh, sorry for that.
> 
>> "Is it a good idea to suppress the flush this way for every reader
>>  of v->arch.cr3? For example, what about the use in
>>  dbg_pv_va2mfn()? I think it should be the consumers of the field
>>  to decide whether to OR in that flag (which isn't part of the
>>  register value anyway)."
>> To be more precise, I can see the pcid to be put here (which will
>> require consumers to be aware anyway), but I don't think the non-
>> register-value no-flush indicator belongs here. IOW I think after
>> writing the value into %cr3, the value read back should match the
>> stored value.
> 
> This will make restore_all_guest more complicated. v->arch.cr3 is copied
> to cpu_info->xen_cr3 there and this value is then used for %cr3. I
> really don't want to add complex logic there to add the no-flush
> indicator in case PCIDs are active.

Valid point. Looking at all present uses of ->arch.cr3, it's probably
indeed better the way you have it. However, I'm now wondering
about something else: make_cr3() leaves PCID as zero for HVM
and idle domains, but runs Xen with PCIDs 2 and 3 for (some) PV
domains. That looks like an undesirable setup though - it would
seem better to run Xen (with full page tables) with PCID 0 at all
times.

Then we'd have e.g.
PCID 0	Xen (full page tables)
PCID x	PV guest priv
PCID y	PV guest user

Global pages in PCID 0 could then still be permitted, and wouldn't
ever need flushing except when FLUSH_TLB_GLOBAL is requested.

As to the use of two separate PCIDs for PV kernel and user modes
- while this helps isolation, it prevents recovering the non-XPTI
property of user mode TLB entries surviving in-guest mode switches.
I wonder whether this is part of the reason you see PCID have a
negative effect in the non-XPTI case.

So in the end the question is: Why not use just two PCIDs, and
allow global pages just like we do now, with the added benefit
that we no longer need to flush Xen's global TLB entries just
because we want to get rid of PV guest user ones.

>>> +        }
>>> +        if ( d->arch.pv_domain.pcid )
>>> +            v->arch.cr3 |= get_pcid_bits(v, d->arch.pv_domain.xpti);
>> 
>> It is certainly at least confusing that you pass "xpti" as argument
>> for a parameter named "is_xen". The question is what mode you
>> want us to be in when running with PCID but no XPTI: Should Xen
>> use its own PCID then? That would seem reasonable to me, but
>> would seem to require passing true here. Yet then this would
>> require switching CR3 on the way out to guests and back in from
>> guests even in that case (just without copying the root page table).
>> 
>> If in that mode Xen isn't meant to use its own PCID, the command
>> line option "pcid=noxpti" would seem at least misleading to me then,
>> as you wouldn't really use different PCIDs in that mode, but only
>> disable global pages (which probably hurts performance) and use
>> INVPCID for flushing (which probably helps performance).
> 
> The idea is to use different PCID values for guest kernel and user
> mode. This removes the need for global guest user pages.

Partly only. Global guest user pages serve two purposes: They
survive a round trip user -> kernel -> user (as long as no Xen
context switch occurs in the middle), and they also allow the
kernel to utilize TLB entries user mode has just created e.g. for
system call input. Your current use of PCIDs retains only the
first property.

> I don't
> want to use global guest user pages together with PCID as flushing
> global pages from the TLB with PCID enabled requires flushing either
> the complete TLB or you'd have to use INVLPG in all possible address
> spaces (so you'd need to have multiple %cr3 switches).

Well, yes, flushing _individual_ pages is a problem in that case.
As to multiple CR3 switches - are these all that bad really with
the no-flush bit set? With the reduced number of PCIDs in actual
use (as discussed above) "all possible address spaces" would
mean just two. And I could imagine that in a number of cases
just one INVLPG (with the right PCID active) might suffice.

One complicating factor is that we don't want to introduce
Xen TLB entries for other than what we map in the minimal page
tables into PV guest PCID space, which would happen if we
simply switched PCID around an INVLPG.

What I don't understand in any event is why you need separate
PCIDs for Xen depending on whether the active PV guest is in
kernel or user mode.

As a side note, I've noticed only now that get_pcid_bits()'s
first parameter wants to be constified.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 0/7] xen/x86: various XPTI speedups
  2018-03-21 12:51 [PATCH v3 0/7] xen/x86: various XPTI speedups Juergen Gross
                   ` (6 preceding siblings ...)
  2018-03-21 12:51 ` [PATCH v3 7/7] xen/x86: use PCID feature Juergen Gross
@ 2018-03-23 14:10 ` Dario Faggioli
  7 siblings, 0 replies; 49+ messages in thread
From: Dario Faggioli @ 2018-03-23 14:10 UTC (permalink / raw)
  To: Juergen Gross, xen-devel; +Cc: andrew.cooper3, jbeulich


[-- Attachment #1.1: Type: text/plain, Size: 34722 bytes --]

On Wed, 2018-03-21 at 13:51 +0100, Juergen Gross wrote:
> On my machine (Intel i7-4600M) using the PCID feature in the non-XPTI
> case showed a slightly worse performance than using global pages
> instead (using PCID and global pages is a bad idea as invalidating
> global pages in this case would need a complete TLB flush). For this
> reason I've decided to use PCID for XPTI only as the default. That
> can easily be changed by using the command line parameter "pcid=all".
> 
>        XPTI off     Jan, XPTI on        +this series, XPTI on
> real   1m21.169s    1m52.149s (+38%)    1m25.692s (+6%)
> user   2m47.652s    2m50.054s (+1%)     2m46.428s (-1%)
> sys    1m11.949s    2m21.767s (+97%)    1m23.053s (+15%)
> 
> A git branch of that series (+ Jan's patches) is available:
> 
> https://github.com/jgross1/xen.git xpti
> 
As usual, I've run a few benchmarks. Long story short, this series
mitigates the performance hit of XPTI, in pretty much all the cases. In
a couple of cases, even significantly so (~10%).

The improvement is not as good as in the above numbers, even for
similar workloads (compile ones), but I guess that depends on the
hardware. But anyway, things do improve, and I think we should have the
series in 4.11.

I can also confirm that using PCID with XPTI=no, does not bring much of
an improvement, if any at all.

There's the weird case of STREAM, where using PCID makes things go
significantly worse, not only than no-PCID and xpti=no, but also than 
PCID and xpri=true.
Basically, when using PCID, xpti boosts STREAM performance, which is a
little weird. I'll try to manually re-run the benchmark in a couple of
configurations, and let's see what I will find.

Here's the results. Basically:
- if you want to compare xpti=on, with and without this series, look at
   'Jan, XPTI on' (without) and '+this series, XPTI on' (with);
- if you want to compare, with this series, xpti on and off, look at
  'XPTI off' (off) and '+this series, XPTI on' (on);
- if you want to compare xpti=off, with and without PCID, look at
  'XPTI off' (without) and '+this series, XPTI off, PCID all' (with).

https://openbenchmarking.org/result/1803232-DARI-180323589

Also available here:

AIO-Stress 0.21
Test: Random Write
    MB/s > Higher Is Better
    XPTI off .......................... 1783.11 |===================================================================================================================================================
    +this series, XPTI on ............. 1967.75 |==================================================================================================================================================================
    +this series, XPTI off, PCID all .. 1827.43 |======================================================================================================================================================
    Jan, XPTI on ...................... 1866.80 |==========================================================================================================================================================

Stream 2013-01-17
Type: Copy
    MB/s > Higher Is Better
    XPTI off .......................... 19442.54 |=================================================================================================================================================================
    +this series, XPTI on ............. 18939.40 |=============================================================================================================================================================
    +this series, XPTI off, PCID all .. 15478.48 |================================================================================================================================
    Jan, XPTI on ...................... 15638.96 |==================================================================================================================================

Stream 2013-01-17
Type: Scale
    MB/s > Higher Is Better
    XPTI off .......................... 12978.74 |=================================================================================================================================================================
    +this series, XPTI on ............. 12851.56 |===============================================================================================================================================================
    +this series, XPTI off, PCID all .. 10699.02 |=====================================================================================================================================
    Jan, XPTI on ...................... 10778.84 |======================================================================================================================================

Stream 2013-01-17
Type: Triad
    MB/s > Higher Is Better
    XPTI off .......................... 14280.68 |=================================================================================================================================================================
    +this series, XPTI on ............. 14022.30 |==============================================================================================================================================================
    +this series, XPTI off, PCID all .. 12149.18 |=========================================================================================================================================
    Jan, XPTI on ...................... 12218.22 |==========================================================================================================================================

Stream 2013-01-17
Type: Add
    MB/s > Higher Is Better
    XPTI off .......................... 16093.28 |=================================================================================================================================================================
    +this series, XPTI on ............. 15815.04 |==============================================================================================================================================================
    +this series, XPTI off, PCID all .. 12077.40 |=========================================================================================================================
    Jan, XPTI on ...................... 12269.30 |===========================================================================================================================

t-test1 2017-01-13
Threads: 1
    Seconds < Lower Is Better
    XPTI off .......................... 76.16  |===============================================================================================
    +this series, XPTI on ............. 113.93 |=============================================================================================================================================
    +this series, XPTI off, PCID all .. 74.68  |=============================================================================================
    Jan, XPTI on ...................... 131.26 |===================================================================================================================================================================

t-test1 2017-01-13
Threads: 2
    Seconds < Lower Is Better
    XPTI off .......................... 24.28 |====================================================================================================
    +this series, XPTI on ............. 34.91 |================================================================================================================================================
    +this series, XPTI off, PCID all .. 23.92 |===================================================================================================
    Jan, XPTI on ...................... 39.73 |====================================================================================================================================================================

Timed Linux Kernel Compilation 4.13
Time To Compile
    Seconds < Lower Is Better
    XPTI off .......................... 155.35 |==============================================================================================================================================
    +this series, XPTI on ............. 170.36 |===========================================================================================================================================================
    +this series, XPTI off, PCID all .. 155.81 |==============================================================================================================================================
    Jan, XPTI on ...................... 178.89 |===================================================================================================================================================================

Timed MPlayer Compilation 1.0-rc3
Time To Compile
    Seconds < Lower Is Better
    XPTI off .......................... 45.94 |=======================================================================================================================================================
    +this series, XPTI on ............. 48.85 |================================================================================================================================================================
    +this series, XPTI off, PCID all .. 45.95 |=======================================================================================================================================================
    Jan, XPTI on ...................... 49.97 |====================================================================================================================================================================

Timed PHP Compilation 7.1.9
Time To Compile
    Seconds < Lower Is Better
    XPTI off .......................... 119.18 |================================================================================================================================================
    +this series, XPTI on ............. 129.24 |============================================================================================================================================================
    +this series, XPTI off, PCID all .. 120.29 |=================================================================================================================================================
    Jan, XPTI on ...................... 134.76 |===================================================================================================================================================================

Parallel BZIP2 Compression 1.1.12
256MB File Compression
    Seconds < Lower Is Better
    XPTI off .......................... 8.97 |===========================================================================================================================================================
    +this series, XPTI on ............. 8.43 |==================================================================================================================================================
    +this series, XPTI off, PCID all .. 9.00 |===========================================================================================================================================================
    Jan, XPTI on ...................... 9.55 |=====================================================================================================================================================================

Hackbench 
Count: 1 - Type: Process
    Seconds < Lower Is Better
    XPTI off .......................... 11.11 |=====================================================================================================
    +this series, XPTI on ............. 16.62 |=======================================================================================================================================================
    +this series, XPTI off, PCID all .. 11.05 |====================================================================================================
    Jan, XPTI on ...................... 18.11 |====================================================================================================================================================================

Hackbench 
Count: 4 - Type: Process
    Seconds < Lower Is Better
    XPTI off .......................... 37.82 |===============================================================================================
    +this series, XPTI on ............. 59.66 |=====================================================================================================================================================
    +this series, XPTI off, PCID all .. 37.58 |==============================================================================================
    Jan, XPTI on ...................... 65.47 |====================================================================================================================================================================

Hackbench 
Count: 16 - Type: Process
    Seconds < Lower Is Better
    XPTI off .......................... 142.24 |==========================================================================================
    +this series, XPTI on ............. 232.55 |===================================================================================================================================================
    +this series, XPTI off, PCID all .. 141.58 |==========================================================================================
    Jan, XPTI on ...................... 257.20 |===================================================================================================================================================================

Redis 4.0.8
Test: LPOP
    Requests Per Second > Higher Is Better
    XPTI off .......................... 402887.81 |================================================================================================================================================================
    +this series, XPTI on ............. 146349.66 |==========================================================
    +this series, XPTI off, PCID all .. 341182.95 |=======================================================================================================================================
    Jan, XPTI on ...................... 128094.76 |===================================================

Redis 4.0.8
Test: SADD
    Requests Per Second > Higher Is Better
    XPTI off .......................... 378952.25 |================================================================================================================================================================
    +this series, XPTI on ............. 165097.91 |======================================================================
    +this series, XPTI off, PCID all .. 371460.27 |=============================================================================================================================================================
    Jan, XPTI on ...................... 141056.13 |============================================================

Redis 4.0.8
Test: LPUSH
    Requests Per Second > Higher Is Better
    XPTI off .......................... 327837.31 |=============================================================================================================================================================
    +this series, XPTI on ............. 146894.11 |=======================================================================
    +this series, XPTI off, PCID all .. 333126.30 |================================================================================================================================================================
    Jan, XPTI on ...................... 126966.19 |=============================================================

Redis 4.0.8
Test: GET
    Requests Per Second > Higher Is Better
    XPTI off .......................... 397632.29 |================================================================================================================================================================
    +this series, XPTI on ............. 169390.22 |====================================================================
    +this series, XPTI off, PCID all .. 393664.83 |==============================================================================================================================================================
    Jan, XPTI on ...................... 143045.55 |==========================================================

Redis 4.0.8
Test: SET
    Requests Per Second > Higher Is Better
    XPTI off .......................... 352699.10 |================================================================================================================================================================
    +this series, XPTI on ............. 153986.24 |======================================================================
    +this series, XPTI off, PCID all .. 350564.06 |===============================================================================================================================================================
    Jan, XPTI on ...................... 133726.65 |=============================================================

Stress-NG 0.07.26
Test: Crypto
    Bogo Ops/s > Higher Is Better
    XPTI off .......................... 689.86 |===================================================================================================================================================================
    +this series, XPTI on ............. 688.27 |===================================================================================================================================================================
    +this series, XPTI off, PCID all .. 689.49 |===================================================================================================================================================================
    Jan, XPTI on ...................... 688.75 |===================================================================================================================================================================

Stress-NG 0.07.26
Test: Bsearch
    Bogo Ops/s > Higher Is Better
    XPTI off .......................... 3939.99 |==================================================================================================================================================================
    +this series, XPTI on ............. 3936.73 |==================================================================================================================================================================
    +this series, XPTI off, PCID all .. 3936.85 |==================================================================================================================================================================
    Jan, XPTI on ...................... 3930.97 |==================================================================================================================================================================

Stress-NG 0.07.26
Test: Forking
    Bogo Ops/s > Higher Is Better
    XPTI off .......................... 6056.42 |=================================================================================================================================================================
    +this series, XPTI on ............. 5222.55 |===========================================================================================================================================
    +this series, XPTI off, PCID all .. 6107.00 |==================================================================================================================================================================
    Jan, XPTI on ...................... 4515.24 |========================================================================================================================

Stress-NG 0.07.26
Test: Hsearch
    Bogo Ops/s > Higher Is Better
    XPTI off .......................... 26376.03 |=================================================================================================================================================================
    +this series, XPTI on ............. 26278.84 |================================================================================================================================================================
    +this series, XPTI off, PCID all .. 26391.88 |=================================================================================================================================================================
    Jan, XPTI on ...................... 26258.08 |================================================================================================================================================================

Stress-NG 0.07.26
Test: Lsearch
    Bogo Ops/s > Higher Is Better
    XPTI off .......................... 60.27 |====================================================================================================================================================================
    +this series, XPTI on ............. 60.26 |====================================================================================================================================================================
    +this series, XPTI off, PCID all .. 60.27 |====================================================================================================================================================================
    Jan, XPTI on ...................... 60.27 |====================================================================================================================================================================

Stress-NG 0.07.26
Test: Tsearch
    Bogo Ops/s > Higher Is Better
    XPTI off .......................... 124.15 |==================================================================================================================================================================
    +this series, XPTI on ............. 124.12 |==================================================================================================================================================================
    +this series, XPTI off, PCID all .. 125.04 |===================================================================================================================================================================
    Jan, XPTI on ...................... 124.75 |===================================================================================================================================================================

Stress-NG 0.07.26
Test: CPU Stress
    Bogo Ops/s > Higher Is Better
    XPTI off .......................... 1709.34 |============================================================================================================================================================
    +this series, XPTI on ............. 1751.09 |================================================================================================================================================================
    +this series, XPTI off, PCID all .. 1746.89 |================================================================================================================================================================
    Jan, XPTI on ...................... 1772.44 |==================================================================================================================================================================

Stress-NG 0.07.26
Test: Semaphores
    Bogo Ops/s > Higher Is Better
    XPTI off .......................... 2208236.65 |============================================================================================================================================================
    +this series, XPTI on ............. 2044183.40 |================================================================================================================================================
    +this series, XPTI off, PCID all .. 2252052.71 |===============================================================================================================================================================
    Jan, XPTI on ...................... 2092475.64 |====================================================================================================================================================

Stress-NG 0.07.26
Test: Matrix Math
    Bogo Ops/s > Higher Is Better
    XPTI off .......................... 20279.06 |=================================================================================================================================================================
    +this series, XPTI on ............. 20261.14 |=================================================================================================================================================================
    +this series, XPTI off, PCID all .. 20287.40 |=================================================================================================================================================================
    Jan, XPTI on ...................... 20264.35 |=================================================================================================================================================================

Stress-NG 0.07.26
Test: Vector Math
    Bogo Ops/s > Higher Is Better
    XPTI off .......................... 10506.60 |=================================================================================================================================================================
    +this series, XPTI on ............. 10500.10 |=================================================================================================================================================================
    +this series, XPTI off, PCID all .. 10509.78 |=================================================================================================================================================================
    Jan, XPTI on ...................... 10493.20 |=================================================================================================================================================================

Stress-NG 0.07.26
Test: Memory Copying
    Bogo Ops/s > Higher Is Better
    XPTI off .......................... 1051.84 |==================================================================================================================================================================
    +this series, XPTI on ............. 1012.27 |============================================================================================================================================================
    +this series, XPTI off, PCID all .. 983.86  |========================================================================================================================================================
    Jan, XPTI on ...................... 992.60  |=========================================================================================================================================================

Stress-NG 0.07.26
Test: Socket Activity
    Bogo Ops/s > Higher Is Better
    XPTI off .......................... 1587.44 |=================================================================================================================================================================
    +this series, XPTI on ............. 1087.62 |==============================================================================================================
    +this series, XPTI off, PCID all .. 1594.80 |==================================================================================================================================================================
    Jan, XPTI on ...................... 922.89  |==============================================================================================

Stress-NG 0.07.26
Test: Context Switching
    Bogo Ops/s > Higher Is Better
    XPTI off .......................... 696817.28 |================================================================================================================================================================
    +this series, XPTI on ............. 343513.18 |===============================================================================
    +this series, XPTI off, PCID all .. 657917.78 |=======================================================================================================================================================
    Jan, XPTI on ...................... 280122.99 |================================================================

Stress-NG 0.07.26
Test: Glibc C String Functions
    Bogo Ops/s > Higher Is Better
    XPTI off .......................... 268516.17 |=============================================================================================================================================================
    +this series, XPTI on ............. 269261.27 |=============================================================================================================================================================
    +this series, XPTI off, PCID all .. 273643.72 |================================================================================================================================================================
    Jan, XPTI on ...................... 266052.71 |============================================================================================================================================================

Stress-NG 0.07.26
Test: Glibc Qsort Data Sorting
    Bogo Ops/s > Higher Is Better
    XPTI off .......................... 57.80 |====================================================================================================================================================================
    +this series, XPTI on ............. 56.66 |=================================================================================================================================================================
    +this series, XPTI off, PCID all .. 57.77 |====================================================================================================================================================================
    Jan, XPTI on ...................... 56.88 |=================================================================================================================================================================

Stress-NG 0.07.26
Test: System V Message Passing
    Bogo Ops/s > Higher Is Better
    XPTI off .......................... 3645928.49 |===============================================================================================================================================================
    +this series, XPTI on ............. 1743511.46 |============================================================================
    +this series, XPTI off, PCID all .. 3573098.79 |============================================================================================================================================================
    Jan, XPTI on ...................... 1596148.22 |======================================================================

PyBench 2018-02-16
Total For Average Test Times
    Milliseconds < Lower Is Better
    XPTI off .......................... 3007 |=====================================================================================================================================================================
    +this series, XPTI on ............. 3000 |=====================================================================================================================================================================
    +this series, XPTI off, PCID all .. 2995 |====================================================================================================================================================================
    Jan, XPTI on ...................... 3000 |=====================================================================================================================================================================

NGINX Benchmark 1.9.9
Static Web Page Serving
    Requests Per Second > Higher Is Better
    XPTI off .......................... 15930.51 |=================================================================================================================================================================
    +this series, XPTI on ............. 11537.88 |=====================================================================================================================
    +this series, XPTI off, PCID all .. 15915.73 |=================================================================================================================================================================
    Jan, XPTI on ...................... 10454.68 |==========================================================================================================

PHPBench 0.8.1
PHP Benchmark Suite
    Score > Higher Is Better
    XPTI off .......................... 279331 |==================================================================================================================================================================
    +this series, XPTI on ............. 253206 |===================================================================================================================================================
    +this series, XPTI off, PCID all .. 280877 |===================================================================================================================================================================
    Jan, XPTI on ...................... 247288 |================================================================================================================================================

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Software Engineer @ SUSE https://www.suse.com/

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 7/7] xen/x86: use PCID feature
       [not found]       ` <5AB5136302000078001B58D9@suse.com>
@ 2018-03-23 14:11         ` Juergen Gross
  2018-03-23 15:58           ` Jan Beulich
       [not found]           ` <5AB5324C02000078001B59E7@suse.com>
  0 siblings, 2 replies; 49+ messages in thread
From: Juergen Gross @ 2018-03-23 14:11 UTC (permalink / raw)
  To: Jan Beulich
  Cc: andrew.cooper3, Kevin Tian, xen-devel, Jun Nakajima, Dario Faggioli

On 23/03/18 14:46, Jan Beulich wrote:
>>>> On 23.03.18 at 12:29, <jgross@suse.com> wrote:
>> On 23/03/18 11:51, Jan Beulich wrote:
>>>>>> On 21.03.18 at 13:51, <jgross@suse.com> wrote:
>>>> Avoid flushing the complete TLB when switching %cr3 for mitigation of
>>>> Meltdown by using the PCID feature if available.
>>>>
>>>> We are using 4 PCID values for a 64 bit pv domain subject to XPTI and
>>>> 2 values for the non-XPTI case:
>>>>
>>>> - guest active and in kernel mode
>>>> - guest active and in user mode
>>>> - hypervisor active and guest in user mode (XPTI only)
>>>> - hypervisor active and guest in kernel mode (XPTI only)
>>>
>>> Before committing to this route, Jun, Kevin, can we please get
>>> confirmation that PCID isn't (and isn't going to be) subject to the
>>> same speculation issues in the pipeline that the U/S bit is (having
>>> caused Meltdown in the first place)? To me it seems a pretty
>>> likely thing to play all the same games.
>>
>> Really? This would assume either the processor is capable to deal with
>> multiple matching TLB entries when speculating or that there can be
>> only one entry per virtual address present in the TLB at the same time
>> in spite of different PCIDs.
> 
> Hmm, yes, good point.
> 
>> And why aren't you asking for the same problem with VPIDs? This should
>> be comparable to the PCID problem you are suspecting.
> 
> Since Linux is using the approach, I'm not really suspecting a
> problem. I'd just like to be sure there is none / not going to be
> one. None of this is spelled out in the doc after all.
> 
>>>> ---
>>>>  docs/misc/xen-command-line.markdown | 12 +++++++++
>>>>  xen/arch/x86/debug.c                |  3 ++-
>>>>  xen/arch/x86/domain_page.c          |  2 +-
>>>>  xen/arch/x86/domctl.c               |  4 +++
>>>>  xen/arch/x86/flushtlb.c             | 49 ++++++++++++++++++++++++++++------
>>>>  xen/arch/x86/mm.c                   | 34 +++++++++++++++++++++---
>>>>  xen/arch/x86/pv/dom0_build.c        |  1 +
>>>>  xen/arch/x86/pv/domain.c            | 52 
>>>> +++++++++++++++++++++++++++++++++++++
>>>>  xen/include/asm-x86/domain.h        | 14 +++++++---
>>>>  xen/include/asm-x86/pv/domain.h     |  2 ++
>>>>  xen/include/asm-x86/x86-defns.h     |  1 +
>>>>  11 files changed, 158 insertions(+), 16 deletions(-)
>>>
>>> Having had the discussion previously, I'm missing a change to
>>> smp.c:new_tlbflush_clock_period() here.
>>
>> I can add that. I did not do this as I haven't treated FLUSH_TLB
>> differently to FLUSH_TLB_GLOBAL (trying this even without any other
>> change to staging led to degfaults in dom0). So such a change to
>> new_tlbflush_clock_period() should be a separate patch I believe.
> 
> Ah yes, you don't have the "flushing too much" issue anymore in
> this version, or at least not in as obvious a way. Or wait - it's in
> patch 4. You still do an including-global flush there regardless of
> whether FLUSH_TLB_GLOBAL was actually requested. Anyway,
> we'll save this aspect for later.
> 
>>>> --- a/xen/arch/x86/debug.c
>>>> +++ b/xen/arch/x86/debug.c
>>>> @@ -97,7 +97,8 @@ dbg_pv_va2mfn(dbgva_t vaddr, struct domain *dp, uint64_t pgd3val)
>>>>      l3_pgentry_t l3e, *l3t;
>>>>      l2_pgentry_t l2e, *l2t;
>>>>      l1_pgentry_t l1e, *l1t;
>>>> -    unsigned long cr3 = (pgd3val ? pgd3val : dp->vcpu[0]->arch.cr3);
>>>> +    unsigned long cr3 = (pgd3val ? pgd3val
>>>> +                                 : (dp->vcpu[0]->arch.cr3 & ~X86_CR3_NOFLUSH));
>>>
>>> What about the PCID portion? You want the address of the page
>>> here, so I think you should use a "white-listing" masking operation
>>> instead of a blacklisting one.
>>
>> The PCID portion is no problem here as the value is converted into a
>> mfn.
>>
>> I can do the modification you are asking for, of course.
> 
> Well, even if the bits are shifter out in the end, the code could
> look more correct. Plus by masking to just the address, future
> new meaning assigned to currently unused bits would not be
> a problem for this piece of code anymore.
> 
>>>> +/*
>>>> + * Return additional PCID specific cr3 bits.
>>>> + *
>>>> + * Note that X86_CR3_NOFLUSH will not be readable in cr3. Anyone consuming
>>>> + * v->arch.cr3 should mask away X86_CR3_NOFLUSH and X86_CR3_PCIDMASK in case
>>>
>>> I stand to my previous comment, which was left unanswered afaics:
>>
>> Uuh, sorry for that.
>>
>>> "Is it a good idea to suppress the flush this way for every reader
>>>  of v->arch.cr3? For example, what about the use in
>>>  dbg_pv_va2mfn()? I think it should be the consumers of the field
>>>  to decide whether to OR in that flag (which isn't part of the
>>>  register value anyway)."
>>> To be more precise, I can see the pcid to be put here (which will
>>> require consumers to be aware anyway), but I don't think the non-
>>> register-value no-flush indicator belongs here. IOW I think after
>>> writing the value into %cr3, the value read back should match the
>>> stored value.
>>
>> This will make restore_all_guest more complicated. v->arch.cr3 is copied
>> to cpu_info->xen_cr3 there and this value is then used for %cr3. I
>> really don't want to add complex logic there to add the no-flush
>> indicator in case PCIDs are active.
> 
> Valid point. Looking at all present uses of ->arch.cr3, it's probably
> indeed better the way you have it. However, I'm now wondering
> about something else: make_cr3() leaves PCID as zero for HVM
> and idle domains, but runs Xen with PCIDs 2 and 3 for (some) PV
> domains. That looks like an undesirable setup though - it would
> seem better to run Xen (with full page tables) with PCID 0 at all
> times.
> 
> Then we'd have e.g.
> PCID 0	Xen (full page tables)
> PCID x	PV guest priv
> PCID y	PV guest user

So this would need another way to switch between guest and xen %cr3.
Or would you want to use different %cr3 values with the same PCID
without flushing the TLB in between? This seems to be a way to ask for
problems...

In case you'd use the same %cr3 (guest kernel one, I guess) for both
cases: are you really sure there is no problem in any hypervisor path
accessing guest data which would result in using guest kernel access
rights when coming from user mode (BTW: that was the security note I
had in v2 of my series).

> Global pages in PCID 0 could then still be permitted, and wouldn't
> ever need flushing except when FLUSH_TLB_GLOBAL is requested.
> 
> As to the use of two separate PCIDs for PV kernel and user modes
> - while this helps isolation, it prevents recovering the non-XPTI
> property of user mode TLB entries surviving in-guest mode switches.

I don't get that. With PCID the guest's kernel _and_ user entries
will survive in-guest mode switches as there is no TLB flushing
involved (the no-flush bit is set in v->arch.cr3 for both modes).

The only downside are guest kernel accesses to user pages: they will
need additional TLB entries as the PCID is different.

> I wonder whether this is part of the reason you see PCID have a
> negative effect in the non-XPTI case.
> 
> So in the end the question is: Why not use just two PCIDs, and
> allow global pages just like we do now, with the added benefit
> that we no longer need to flush Xen's global TLB entries just
> because we want to get rid of PV guest user ones.

I can't see how that would work without either needing some more TLB
flushes in order to prevent stale TLB entries or loosing the Meltdown
mitigation.

Which %cr3/PCID combination should be used in hypervisor, guest kernel
and guest user mode? Which pages would be global?

> 
>>>> +        }
>>>> +        if ( d->arch.pv_domain.pcid )
>>>> +            v->arch.cr3 |= get_pcid_bits(v, d->arch.pv_domain.xpti);
>>>
>>> It is certainly at least confusing that you pass "xpti" as argument
>>> for a parameter named "is_xen". The question is what mode you
>>> want us to be in when running with PCID but no XPTI: Should Xen
>>> use its own PCID then? That would seem reasonable to me, but
>>> would seem to require passing true here. Yet then this would
>>> require switching CR3 on the way out to guests and back in from
>>> guests even in that case (just without copying the root page table).
>>>
>>> If in that mode Xen isn't meant to use its own PCID, the command
>>> line option "pcid=noxpti" would seem at least misleading to me then,
>>> as you wouldn't really use different PCIDs in that mode, but only
>>> disable global pages (which probably hurts performance) and use
>>> INVPCID for flushing (which probably helps performance).
>>
>> The idea is to use different PCID values for guest kernel and user
>> mode. This removes the need for global guest user pages.
> 
> Partly only. Global guest user pages serve two purposes: They
> survive a round trip user -> kernel -> user (as long as no Xen
> context switch occurs in the middle), and they also allow the
> kernel to utilize TLB entries user mode has just created e.g. for
> system call input. Your current use of PCIDs retains only the
> first property.

Right.

> 
>> I don't
>> want to use global guest user pages together with PCID as flushing
>> global pages from the TLB with PCID enabled requires flushing either
>> the complete TLB or you'd have to use INVLPG in all possible address
>> spaces (so you'd need to have multiple %cr3 switches).
> 
> Well, yes, flushing _individual_ pages is a problem in that case.
> As to multiple CR3 switches - are these all that bad really with
> the no-flush bit set? With the reduced number of PCIDs in actual
> use (as discussed above) "all possible address spaces" would
> mean just two. And I could imagine that in a number of cases
> just one INVLPG (with the right PCID active) might suffice.
> 
> One complicating factor is that we don't want to introduce
> Xen TLB entries for other than what we map in the minimal page
> tables into PV guest PCID space, which would happen if we
> simply switched PCID around an INVLPG.
> 
> What I don't understand in any event is why you need separate
> PCIDs for Xen depending on whether the active PV guest is in
> kernel or user mode.

Main reason are the different page tables anchored in %cr3.

> As a side note, I've noticed only now that get_pcid_bits()'s
> first parameter wants to be constified.

Will change that.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 7/7] xen/x86: use PCID feature
  2018-03-23 14:11         ` Juergen Gross
@ 2018-03-23 15:58           ` Jan Beulich
       [not found]           ` <5AB5324C02000078001B59E7@suse.com>
  1 sibling, 0 replies; 49+ messages in thread
From: Jan Beulich @ 2018-03-23 15:58 UTC (permalink / raw)
  To: Juergen Gross
  Cc: andrew.cooper3, Kevin Tian, Dario Faggioli, Jun Nakajima, xen-devel

>>> On 23.03.18 at 15:11, <jgross@suse.com> wrote:
> On 23/03/18 14:46, Jan Beulich wrote:
>> Valid point. Looking at all present uses of ->arch.cr3, it's probably
>> indeed better the way you have it. However, I'm now wondering
>> about something else: make_cr3() leaves PCID as zero for HVM
>> and idle domains, but runs Xen with PCIDs 2 and 3 for (some) PV
>> domains. That looks like an undesirable setup though - it would
>> seem better to run Xen (with full page tables) with PCID 0 at all
>> times.
>> 
>> Then we'd have e.g.
>> PCID 0	Xen (full page tables)
>> PCID x	PV guest priv
>> PCID y	PV guest user
> 
> So this would need another way to switch between guest and xen %cr3.
> Or would you want to use different %cr3 values with the same PCID
> without flushing the TLB in between? This seems to be a way to ask for
> problems...

Well, a TLB flush is clearly needed in such a setup when going
from kernel to user mode.

> In case you'd use the same %cr3 (guest kernel one, I guess) for both
> cases: are you really sure there is no problem in any hypervisor path
> accessing guest data which would result in using guest kernel access
> rights when coming from user mode (BTW: that was the security note I
> had in v2 of my series).

I'm afraid I don't understand: Same %cr3? There are separate
kernel and user page tables, requiring different values anyway.
I also don't understand what problems in hypervisor code paths
you suspect, when everything looks to work fine right now
without PCID.

>> Global pages in PCID 0 could then still be permitted, and wouldn't
>> ever need flushing except when FLUSH_TLB_GLOBAL is requested.
>> 
>> As to the use of two separate PCIDs for PV kernel and user modes
>> - while this helps isolation, it prevents recovering the non-XPTI
>> property of user mode TLB entries surviving in-guest mode switches.
> 
> I don't get that. With PCID the guest's kernel _and_ user entries
> will survive in-guest mode switches as there is no TLB flushing
> involved (the no-flush bit is set in v->arch.cr3 for both modes).
> 
> The only downside are guest kernel accesses to user pages: they will
> need additional TLB entries as the PCID is different.

That's the point I was trying to make. This was further explained
in my previous reply a little further down.

>> I wonder whether this is part of the reason you see PCID have a
>> negative effect in the non-XPTI case.
>> 
>> So in the end the question is: Why not use just two PCIDs, and
>> allow global pages just like we do now, with the added benefit
>> that we no longer need to flush Xen's global TLB entries just
>> because we want to get rid of PV guest user ones.
> 
> I can't see how that would work without either needing some more TLB
> flushes in order to prevent stale TLB entries or loosing the Meltdown
> mitigation.
> 
> Which %cr3/PCID combination should be used in hypervisor, guest kernel
> and guest user mode?

Xen would run with PCID 0 (and full Xen mappings) at all times
(except early entry and late exit code of course). The guest would
run with PCID 1 (and minimal Xen mappings) at all times. The switch
of PCID eliminates the need for flushes on the way out and back in.

> Which pages would be global?

Use of global pages would continue to be as today: Xen has some,
and guest user mode has some. Of course it is quite possible that
the use of global pages with a single guest PCID is still worse than
no global pages with two guest PCIDs, but that's a separate step
to take (and measure) imo.

>>> I don't
>>> want to use global guest user pages together with PCID as flushing
>>> global pages from the TLB with PCID enabled requires flushing either
>>> the complete TLB or you'd have to use INVLPG in all possible address
>>> spaces (so you'd need to have multiple %cr3 switches).
>> 
>> Well, yes, flushing _individual_ pages is a problem in that case.
>> As to multiple CR3 switches - are these all that bad really with
>> the no-flush bit set? With the reduced number of PCIDs in actual
>> use (as discussed above) "all possible address spaces" would
>> mean just two. And I could imagine that in a number of cases
>> just one INVLPG (with the right PCID active) might suffice.
>> 
>> One complicating factor is that we don't want to introduce
>> Xen TLB entries for other than what we map in the minimal page
>> tables into PV guest PCID space, which would happen if we
>> simply switched PCID around an INVLPG.
>> 
>> What I don't understand in any event is why you need separate
>> PCIDs for Xen depending on whether the active PV guest is in
>> kernel or user mode.
> 
> Main reason are the different page tables anchored in %cr3.

Hmm, right, looks like we can't have the best of both worlds: We'd
like the Xen part of the address space to be shared, but the guest
part of it to be separate. Question then still is whether the reduced
flushing outweighs the reduced sharing. IOW between what we
have today (a single PCID and a lot of flushing) and what you
introduce (four PCIDs and very little flushing) is a middle approach -
two PCIDs plus some flushing.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 7/7] xen/x86: use PCID feature
       [not found]           ` <5AB5324C02000078001B59E7@suse.com>
@ 2018-03-26  6:49             ` Juergen Gross
  2018-03-26  8:28               ` Jan Beulich
       [not found]               ` <5AB8CB3902000078001B5F3A@suse.com>
  0 siblings, 2 replies; 49+ messages in thread
From: Juergen Gross @ 2018-03-26  6:49 UTC (permalink / raw)
  To: Jan Beulich
  Cc: andrew.cooper3, Kevin Tian, xen-devel, Jun Nakajima, Dario Faggioli

On 23/03/18 16:58, Jan Beulich wrote:
>>>> On 23.03.18 at 15:11, <jgross@suse.com> wrote:
>> On 23/03/18 14:46, Jan Beulich wrote:
>>> Valid point. Looking at all present uses of ->arch.cr3, it's probably
>>> indeed better the way you have it. However, I'm now wondering
>>> about something else: make_cr3() leaves PCID as zero for HVM
>>> and idle domains, but runs Xen with PCIDs 2 and 3 for (some) PV
>>> domains. That looks like an undesirable setup though - it would
>>> seem better to run Xen (with full page tables) with PCID 0 at all
>>> times.
>>>
>>> Then we'd have e.g.
>>> PCID 0	Xen (full page tables)
>>> PCID x	PV guest priv
>>> PCID y	PV guest user
>>
>> So this would need another way to switch between guest and xen %cr3.
>> Or would you want to use different %cr3 values with the same PCID
>> without flushing the TLB in between? This seems to be a way to ask for
>> problems...
> 
> Well, a TLB flush is clearly needed in such a setup when going
> from kernel to user mode.
> 
>> In case you'd use the same %cr3 (guest kernel one, I guess) for both
>> cases: are you really sure there is no problem in any hypervisor path
>> accessing guest data which would result in using guest kernel access
>> rights when coming from user mode (BTW: that was the security note I
>> had in v2 of my series).
> 
> I'm afraid I don't understand: Same %cr3? There are separate
> kernel and user page tables, requiring different values anyway.
> I also don't understand what problems in hypervisor code paths
> you suspect, when everything looks to work fine right now
> without PCID.

With switching between different page tables you need to flush the TLB.
That was my point.

> 
>>> Global pages in PCID 0 could then still be permitted, and wouldn't
>>> ever need flushing except when FLUSH_TLB_GLOBAL is requested.
>>>
>>> As to the use of two separate PCIDs for PV kernel and user modes
>>> - while this helps isolation, it prevents recovering the non-XPTI
>>> property of user mode TLB entries surviving in-guest mode switches.
>>
>> I don't get that. With PCID the guest's kernel _and_ user entries
>> will survive in-guest mode switches as there is no TLB flushing
>> involved (the no-flush bit is set in v->arch.cr3 for both modes).
>>
>> The only downside are guest kernel accesses to user pages: they will
>> need additional TLB entries as the PCID is different.
> 
> That's the point I was trying to make. This was further explained
> in my previous reply a little further down.
> 
>>> I wonder whether this is part of the reason you see PCID have a
>>> negative effect in the non-XPTI case.
>>>
>>> So in the end the question is: Why not use just two PCIDs, and
>>> allow global pages just like we do now, with the added benefit
>>> that we no longer need to flush Xen's global TLB entries just
>>> because we want to get rid of PV guest user ones.
>>
>> I can't see how that would work without either needing some more TLB
>> flushes in order to prevent stale TLB entries or loosing the Meltdown
>> mitigation.
>>
>> Which %cr3/PCID combination should be used in hypervisor, guest kernel
>> and guest user mode?
> 
> Xen would run with PCID 0 (and full Xen mappings) at all times
> (except early entry and late exit code of course). The guest would
> run with PCID 1 (and minimal Xen mappings) at all times. The switch
> of PCID eliminates the need for flushes on the way out and back in.

You still need the kernel page tables flushed when switching to user
mode, right?

> 
>> Which pages would be global?
> 
> Use of global pages would continue to be as today: Xen has some,
> and guest user mode has some. Of course it is quite possible that
> the use of global pages with a single guest PCID is still worse than
> no global pages with two guest PCIDs, but that's a separate step
> to take (and measure) imo.

But global pages of Xen would either make it vulnerable with regard to
Meltdown or you need a TLB flush again when switching between Xen and
guest making all the PCID stuff moot.

> 
>>>> I don't
>>>> want to use global guest user pages together with PCID as flushing
>>>> global pages from the TLB with PCID enabled requires flushing either
>>>> the complete TLB or you'd have to use INVLPG in all possible address
>>>> spaces (so you'd need to have multiple %cr3 switches).
>>>
>>> Well, yes, flushing _individual_ pages is a problem in that case.
>>> As to multiple CR3 switches - are these all that bad really with
>>> the no-flush bit set? With the reduced number of PCIDs in actual
>>> use (as discussed above) "all possible address spaces" would
>>> mean just two. And I could imagine that in a number of cases
>>> just one INVLPG (with the right PCID active) might suffice.
>>>
>>> One complicating factor is that we don't want to introduce
>>> Xen TLB entries for other than what we map in the minimal page
>>> tables into PV guest PCID space, which would happen if we
>>> simply switched PCID around an INVLPG.
>>>
>>> What I don't understand in any event is why you need separate
>>> PCIDs for Xen depending on whether the active PV guest is in
>>> kernel or user mode.
>>
>> Main reason are the different page tables anchored in %cr3.
> 
> Hmm, right, looks like we can't have the best of both worlds: We'd
> like the Xen part of the address space to be shared, but the guest
> part of it to be separate. Question then still is whether the reduced
> flushing outweighs the reduced sharing. IOW between what we
> have today (a single PCID and a lot of flushing) and what you
> introduce (four PCIDs and very little flushing) is a middle approach -
> two PCIDs plus some flushing.

As I believe kernel is touching much less user pages than kernel pages
I'm pretty sure my approach is better in most cases.

And TBH I'm still not sure the "some flushing" wouldn't be too
expensive.

So lets compare the possibilities:

My approach:
- no global pages
- 4 different PCIDs
- no TLB flushes needed when switching between Xen and guest
- no TLB flushes needed when switching between guest user and kernel
- flushing of single pages (guest or Xen) rather simple (4 INVPCIDs)
- flushing of complete TLB via 1 INVPCID

2 PCIDs (Xen and guest), keeping guest user pages as global pages
- Xen can't use global pages - global bit must be handled dynamically
  for Xen pages (or do we want to drop global pages e.g. for AMD, too?
- 2 PCIDs
- no TLB flushes needed when switching between Xen and guest
- when switching from guest kernel to guest user the kernel pages must
  be flushed from TLB
- flushing of single guest user pages needs 2 changes of %cr3 and 2
  INVLPGs, switch code must be mapped to guest page tables
- flushing of complete TLB via 1 INVPCID

So the advantage of the 2 PCID solution are the single TLB entries for
guest user pages compared to 2 entries for guest user pages accessed by
the guest kernel or Xen.

The disadvantage are the flushed guest kernel pages when executing user
code, the more complicated single user page flushing and the dynamical
Xen global bit handling.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 7/7] xen/x86: use PCID feature
  2018-03-26  6:49             ` Juergen Gross
@ 2018-03-26  8:28               ` Jan Beulich
       [not found]               ` <5AB8CB3902000078001B5F3A@suse.com>
  1 sibling, 0 replies; 49+ messages in thread
From: Jan Beulich @ 2018-03-26  8:28 UTC (permalink / raw)
  To: Juergen Gross
  Cc: andrew.cooper3, Kevin Tian, Dario Faggioli, Jun Nakajima, xen-devel

>>> On 26.03.18 at 08:49, <jgross@suse.com> wrote:
> On 23/03/18 16:58, Jan Beulich wrote:
>>>>> On 23.03.18 at 15:11, <jgross@suse.com> wrote:
>>> On 23/03/18 14:46, Jan Beulich wrote:
>>>> So in the end the question is: Why not use just two PCIDs, and
>>>> allow global pages just like we do now, with the added benefit
>>>> that we no longer need to flush Xen's global TLB entries just
>>>> because we want to get rid of PV guest user ones.
>>>
>>> I can't see how that would work without either needing some more TLB
>>> flushes in order to prevent stale TLB entries or loosing the Meltdown
>>> mitigation.
>>>
>>> Which %cr3/PCID combination should be used in hypervisor, guest kernel
>>> and guest user mode?
>> 
>> Xen would run with PCID 0 (and full Xen mappings) at all times
>> (except early entry and late exit code of course). The guest would
>> run with PCID 1 (and minimal Xen mappings) at all times. The switch
>> of PCID eliminates the need for flushes on the way out and back in.
> 
> You still need the kernel page tables flushed when switching to user
> mode, right?

Of course.

>>> Which pages would be global?
>> 
>> Use of global pages would continue to be as today: Xen has some,
>> and guest user mode has some. Of course it is quite possible that
>> the use of global pages with a single guest PCID is still worse than
>> no global pages with two guest PCIDs, but that's a separate step
>> to take (and measure) imo.
> 
> But global pages of Xen would either make it vulnerable with regard to
> Meltdown or you need a TLB flush again when switching between Xen and
> guest making all the PCID stuff moot.

No - the guest would run with PCID 1 active, and global Xen TLB
entries would exist for PCID 0 only.

> So lets compare the possibilities:
> 
> My approach:
> - no global pages
> - 4 different PCIDs
> - no TLB flushes needed when switching between Xen and guest
> - no TLB flushes needed when switching between guest user and kernel
> - flushing of single pages (guest or Xen) rather simple (4 INVPCIDs)
> - flushing of complete TLB via 1 INVPCID
> 
> 2 PCIDs (Xen and guest), keeping guest user pages as global pages
> - Xen can't use global pages - global bit must be handled dynamically
>   for Xen pages (or do we want to drop global pages e.g. for AMD, too?

As per above - I don't see why Xen couldn't use global pages.
The option of using them is part of why I'm wondering whether
this might be worth looking into.

> - 2 PCIDs
> - no TLB flushes needed when switching between Xen and guest
> - when switching from guest kernel to guest user the kernel pages must
>   be flushed from TLB
> - flushing of single guest user pages needs 2 changes of %cr3 and 2
>   INVLPGs, switch code must be mapped to guest page tables
> - flushing of complete TLB via 1 INVPCID
> 
> So the advantage of the 2 PCID solution are the single TLB entries for
> guest user pages compared to 2 entries for guest user pages accessed by
> the guest kernel or Xen.
> 
> The disadvantage are the flushed guest kernel pages when executing user
> code, the more complicated single user page flushing and the dynamical
> Xen global bit handling.

Right. In order to make forward progress here I think we should
shelve the discussion on the 2-PCID alternative for now. What I'd
like to ask for as a change to your current approach is to use
PCID 0 for Xen rather than running Xen with PCIDs 2 or 3 when
PCIDs are enabled, and (implicitly) with PCID 0 when they're
disabled. Or alternatively don't use PCID 0 at all when PCIDs are
enabled. I'm simply worried of us overlooking a case where PCID
0 TLB entries may be left in place (when switching between PCIDs
enabled and PCIDs disabled) when they should have been flushed,
opening back up a Meltdown-like attack window.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 7/7] xen/x86: use PCID feature
       [not found]               ` <5AB8CB3902000078001B5F3A@suse.com>
@ 2018-03-26  8:55                 ` Juergen Gross
  2018-03-26 10:13                   ` Jan Beulich
       [not found]                   ` <5AB8E3DC02000078001B5FEB@suse.com>
  0 siblings, 2 replies; 49+ messages in thread
From: Juergen Gross @ 2018-03-26  8:55 UTC (permalink / raw)
  To: Jan Beulich
  Cc: andrew.cooper3, Kevin Tian, xen-devel, Jun Nakajima, Dario Faggioli

On 26/03/18 10:28, Jan Beulich wrote:
>>>> On 26.03.18 at 08:49, <jgross@suse.com> wrote:
>> On 23/03/18 16:58, Jan Beulich wrote:
>>>>>> On 23.03.18 at 15:11, <jgross@suse.com> wrote:
>>>> On 23/03/18 14:46, Jan Beulich wrote:
>>>>> So in the end the question is: Why not use just two PCIDs, and
>>>>> allow global pages just like we do now, with the added benefit
>>>>> that we no longer need to flush Xen's global TLB entries just
>>>>> because we want to get rid of PV guest user ones.
>>>>
>>>> I can't see how that would work without either needing some more TLB
>>>> flushes in order to prevent stale TLB entries or loosing the Meltdown
>>>> mitigation.
>>>>
>>>> Which %cr3/PCID combination should be used in hypervisor, guest kernel
>>>> and guest user mode?
>>>
>>> Xen would run with PCID 0 (and full Xen mappings) at all times
>>> (except early entry and late exit code of course). The guest would
>>> run with PCID 1 (and minimal Xen mappings) at all times. The switch
>>> of PCID eliminates the need for flushes on the way out and back in.
>>
>> You still need the kernel page tables flushed when switching to user
>> mode, right?
> 
> Of course.
> 
>>>> Which pages would be global?
>>>
>>> Use of global pages would continue to be as today: Xen has some,
>>> and guest user mode has some. Of course it is quite possible that
>>> the use of global pages with a single guest PCID is still worse than
>>> no global pages with two guest PCIDs, but that's a separate step
>>> to take (and measure) imo.
>>
>> But global pages of Xen would either make it vulnerable with regard to
>> Meltdown or you need a TLB flush again when switching between Xen and
>> guest making all the PCID stuff moot.
> 
> No - the guest would run with PCID 1 active, and global Xen TLB
> entries would exist for PCID 0 only.

Uuh, global pages are accessible via all PCIDs. That's why they are
called global...

> 
>> So lets compare the possibilities:
>>
>> My approach:
>> - no global pages
>> - 4 different PCIDs
>> - no TLB flushes needed when switching between Xen and guest
>> - no TLB flushes needed when switching between guest user and kernel
>> - flushing of single pages (guest or Xen) rather simple (4 INVPCIDs)
>> - flushing of complete TLB via 1 INVPCID
>>
>> 2 PCIDs (Xen and guest), keeping guest user pages as global pages
>> - Xen can't use global pages - global bit must be handled dynamically
>>   for Xen pages (or do we want to drop global pages e.g. for AMD, too?
> 
> As per above - I don't see why Xen couldn't use global pages.
> The option of using them is part of why I'm wondering whether
> this might be worth looking into.

See chapter 4.10.2.4 SDM Vol. 3

> 
>> - 2 PCIDs
>> - no TLB flushes needed when switching between Xen and guest
>> - when switching from guest kernel to guest user the kernel pages must
>>   be flushed from TLB
>> - flushing of single guest user pages needs 2 changes of %cr3 and 2
>>   INVLPGs, switch code must be mapped to guest page tables
>> - flushing of complete TLB via 1 INVPCID
>>
>> So the advantage of the 2 PCID solution are the single TLB entries for
>> guest user pages compared to 2 entries for guest user pages accessed by
>> the guest kernel or Xen.
>>
>> The disadvantage are the flushed guest kernel pages when executing user
>> code, the more complicated single user page flushing and the dynamical
>> Xen global bit handling.
> 
> Right. In order to make forward progress here I think we should
> shelve the discussion on the 2-PCID alternative for now. What I'd
> like to ask for as a change to your current approach is to use
> PCID 0 for Xen rather than running Xen with PCIDs 2 or 3 when
> PCIDs are enabled, and (implicitly) with PCID 0 when they're
> disabled. Or alternatively don't use PCID 0 at all when PCIDs are
> enabled. I'm simply worried of us overlooking a case where PCID
> 0 TLB entries may be left in place (when switching between PCIDs
> enabled and PCIDs disabled) when they should have been flushed,
> opening back up a Meltdown-like attack window.
The reason I didn't use PCID 0 for running Xen was to use a few
INVPCID calls as possible for single page invalidation and still
covering the cases for PCID on while XPTI off and including PCID 0.

I can change the scheme to use different values for guest PCIDs
with XPTI on, of course. Are you fine with:

- XPTI off: PCID 0 = kernel, PCID 1 = user
- XPTI on:  PCID 0 = kernel/Xen, PCID 1 = user/Xen,
            PCID 2 = kernel/guest, PCID 3 = user/guest

Juergen


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 7/7] xen/x86: use PCID feature
  2018-03-26  8:55                 ` Juergen Gross
@ 2018-03-26 10:13                   ` Jan Beulich
       [not found]                   ` <5AB8E3DC02000078001B5FEB@suse.com>
  1 sibling, 0 replies; 49+ messages in thread
From: Jan Beulich @ 2018-03-26 10:13 UTC (permalink / raw)
  To: Juergen Gross
  Cc: andrew.cooper3, Kevin Tian, Dario Faggioli, Jun Nakajima, xen-devel

>>> On 26.03.18 at 10:55, <jgross@suse.com> wrote:
> On 26/03/18 10:28, Jan Beulich wrote:
>>>>> On 26.03.18 at 08:49, <jgross@suse.com> wrote:
>>> On 23/03/18 16:58, Jan Beulich wrote:
>>>>>>> On 23.03.18 at 15:11, <jgross@suse.com> wrote:
>>>>> On 23/03/18 14:46, Jan Beulich wrote:
>>>>>> So in the end the question is: Why not use just two PCIDs, and
>>>>>> allow global pages just like we do now, with the added benefit
>>>>>> that we no longer need to flush Xen's global TLB entries just
>>>>>> because we want to get rid of PV guest user ones.
>>>>>
>>>>> I can't see how that would work without either needing some more TLB
>>>>> flushes in order to prevent stale TLB entries or loosing the Meltdown
>>>>> mitigation.
>>>>>
>>>>> Which %cr3/PCID combination should be used in hypervisor, guest kernel
>>>>> and guest user mode?
>>>>
>>>> Xen would run with PCID 0 (and full Xen mappings) at all times
>>>> (except early entry and late exit code of course). The guest would
>>>> run with PCID 1 (and minimal Xen mappings) at all times. The switch
>>>> of PCID eliminates the need for flushes on the way out and back in.
>>>
>>> You still need the kernel page tables flushed when switching to user
>>> mode, right?
>> 
>> Of course.
>> 
>>>>> Which pages would be global?
>>>>
>>>> Use of global pages would continue to be as today: Xen has some,
>>>> and guest user mode has some. Of course it is quite possible that
>>>> the use of global pages with a single guest PCID is still worse than
>>>> no global pages with two guest PCIDs, but that's a separate step
>>>> to take (and measure) imo.
>>>
>>> But global pages of Xen would either make it vulnerable with regard to
>>> Meltdown or you need a TLB flush again when switching between Xen and
>>> guest making all the PCID stuff moot.
>> 
>> No - the guest would run with PCID 1 active, and global Xen TLB
>> entries would exist for PCID 0 only.
> 
> Uuh, global pages are accessible via all PCIDs. That's why they are
> called global...

Okay, in that case all of what I've said in this regard was rubbish.
(I don't, btw, think that this is the only sensible interpretation of
"global" - it could as well mean protected from ordinary flushes
within the given PCID realm.)

>>> - 2 PCIDs
>>> - no TLB flushes needed when switching between Xen and guest
>>> - when switching from guest kernel to guest user the kernel pages must
>>>   be flushed from TLB
>>> - flushing of single guest user pages needs 2 changes of %cr3 and 2
>>>   INVLPGs, switch code must be mapped to guest page tables
>>> - flushing of complete TLB via 1 INVPCID
>>>
>>> So the advantage of the 2 PCID solution are the single TLB entries for
>>> guest user pages compared to 2 entries for guest user pages accessed by
>>> the guest kernel or Xen.
>>>
>>> The disadvantage are the flushed guest kernel pages when executing user
>>> code, the more complicated single user page flushing and the dynamical
>>> Xen global bit handling.
>> 
>> Right. In order to make forward progress here I think we should
>> shelve the discussion on the 2-PCID alternative for now. What I'd
>> like to ask for as a change to your current approach is to use
>> PCID 0 for Xen rather than running Xen with PCIDs 2 or 3 when
>> PCIDs are enabled, and (implicitly) with PCID 0 when they're
>> disabled. Or alternatively don't use PCID 0 at all when PCIDs are
>> enabled. I'm simply worried of us overlooking a case where PCID
>> 0 TLB entries may be left in place (when switching between PCIDs
>> enabled and PCIDs disabled) when they should have been flushed,
>> opening back up a Meltdown-like attack window.
> The reason I didn't use PCID 0 for running Xen was to use a few
> INVPCID calls as possible for single page invalidation and still
> covering the cases for PCID on while XPTI off and including PCID 0.

How would the number of INVPCIDs needed differ depending on
the actual PCID values used?

> I can change the scheme to use different values for guest PCIDs
> with XPTI on, of course. Are you fine with:
> 
> - XPTI off: PCID 0 = kernel, PCID 1 = user
> - XPTI on:  PCID 0 = kernel/Xen, PCID 1 = user/Xen,
>             PCID 2 = kernel/guest, PCID 3 = user/guest

Yes, that would fit the first variant I've described. I take it you
prefer not to avoid PCID 0 altogether when PCIDs are enabled -
is there a particular reason?

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 7/7] xen/x86: use PCID feature
       [not found]                   ` <5AB8E3DC02000078001B5FEB@suse.com>
@ 2018-03-26 10:29                     ` Juergen Gross
  2018-03-26 10:43                       ` Jan Beulich
       [not found]                       ` <5AB8EAE902000078001B6034@suse.com>
  0 siblings, 2 replies; 49+ messages in thread
From: Juergen Gross @ 2018-03-26 10:29 UTC (permalink / raw)
  To: Jan Beulich
  Cc: andrew.cooper3, Kevin Tian, xen-devel, Jun Nakajima, Dario Faggioli

On 26/03/18 12:13, Jan Beulich wrote:
>>>> On 26.03.18 at 10:55, <jgross@suse.com> wrote:
>> On 26/03/18 10:28, Jan Beulich wrote:
>>>>>> On 26.03.18 at 08:49, <jgross@suse.com> wrote:
>>>> On 23/03/18 16:58, Jan Beulich wrote:
>>>>>>>> On 23.03.18 at 15:11, <jgross@suse.com> wrote:
>>>>>> On 23/03/18 14:46, Jan Beulich wrote:
>>>>>>> So in the end the question is: Why not use just two PCIDs, and
>>>>>>> allow global pages just like we do now, with the added benefit
>>>>>>> that we no longer need to flush Xen's global TLB entries just
>>>>>>> because we want to get rid of PV guest user ones.
>>>>>>
>>>>>> I can't see how that would work without either needing some more TLB
>>>>>> flushes in order to prevent stale TLB entries or loosing the Meltdown
>>>>>> mitigation.
>>>>>>
>>>>>> Which %cr3/PCID combination should be used in hypervisor, guest kernel
>>>>>> and guest user mode?
>>>>>
>>>>> Xen would run with PCID 0 (and full Xen mappings) at all times
>>>>> (except early entry and late exit code of course). The guest would
>>>>> run with PCID 1 (and minimal Xen mappings) at all times. The switch
>>>>> of PCID eliminates the need for flushes on the way out and back in.
>>>>
>>>> You still need the kernel page tables flushed when switching to user
>>>> mode, right?
>>>
>>> Of course.
>>>
>>>>>> Which pages would be global?
>>>>>
>>>>> Use of global pages would continue to be as today: Xen has some,
>>>>> and guest user mode has some. Of course it is quite possible that
>>>>> the use of global pages with a single guest PCID is still worse than
>>>>> no global pages with two guest PCIDs, but that's a separate step
>>>>> to take (and measure) imo.
>>>>
>>>> But global pages of Xen would either make it vulnerable with regard to
>>>> Meltdown or you need a TLB flush again when switching between Xen and
>>>> guest making all the PCID stuff moot.
>>>
>>> No - the guest would run with PCID 1 active, and global Xen TLB
>>> entries would exist for PCID 0 only.
>>
>> Uuh, global pages are accessible via all PCIDs. That's why they are
>> called global...
> 
> Okay, in that case all of what I've said in this regard was rubbish.
> (I don't, btw, think that this is the only sensible interpretation of
> "global" - it could as well mean protected from ordinary flushes
> within the given PCID realm.)

That's the reason I gave the reference to the SDM. It clearly states
that TLB entries with the global bit set don't have to match the current
PCID for being regarded to match.

> 
>>>> - 2 PCIDs
>>>> - no TLB flushes needed when switching between Xen and guest
>>>> - when switching from guest kernel to guest user the kernel pages must
>>>>   be flushed from TLB
>>>> - flushing of single guest user pages needs 2 changes of %cr3 and 2
>>>>   INVLPGs, switch code must be mapped to guest page tables
>>>> - flushing of complete TLB via 1 INVPCID
>>>>
>>>> So the advantage of the 2 PCID solution are the single TLB entries for
>>>> guest user pages compared to 2 entries for guest user pages accessed by
>>>> the guest kernel or Xen.
>>>>
>>>> The disadvantage are the flushed guest kernel pages when executing user
>>>> code, the more complicated single user page flushing and the dynamical
>>>> Xen global bit handling.
>>>
>>> Right. In order to make forward progress here I think we should
>>> shelve the discussion on the 2-PCID alternative for now. What I'd
>>> like to ask for as a change to your current approach is to use
>>> PCID 0 for Xen rather than running Xen with PCIDs 2 or 3 when
>>> PCIDs are enabled, and (implicitly) with PCID 0 when they're
>>> disabled. Or alternatively don't use PCID 0 at all when PCIDs are
>>> enabled. I'm simply worried of us overlooking a case where PCID
>>> 0 TLB entries may be left in place (when switching between PCIDs
>>> enabled and PCIDs disabled) when they should have been flushed,
>>> opening back up a Meltdown-like attack window.
>> The reason I didn't use PCID 0 for running Xen was to use a few
>> INVPCID calls as possible for single page invalidation and still
>> covering the cases for PCID on while XPTI off and including PCID 0.
> 
> How would the number of INVPCIDs needed differ depending on
> the actual PCID values used?

See answer below.

>> I can change the scheme to use different values for guest PCIDs
>> with XPTI on, of course. Are you fine with:
>>
>> - XPTI off: PCID 0 = kernel, PCID 1 = user
>> - XPTI on:  PCID 0 = kernel/Xen, PCID 1 = user/Xen,
>>             PCID 2 = kernel/guest, PCID 3 = user/guest
> 
> Yes, that would fit the first variant I've described. I take it you
> prefer not to avoid PCID 0 altogether when PCIDs are enabled -
> is there a particular reason?

Yes. As written in the comment in flush_area_local() I can't be sure
whether the current address space is that of a domain with XPTI
enabled (the idle domain could be "current"). So I need to always
flush with PCID 0 and with the possible PCID values for a XPTI domain.
When using PCID 0 for XPTI as well I'll need 4 INVPCIDs, while when
avoiding it I'd need 5 (at least when current == idle).


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 7/7] xen/x86: use PCID feature
  2018-03-26 10:29                     ` Juergen Gross
@ 2018-03-26 10:43                       ` Jan Beulich
       [not found]                       ` <5AB8EAE902000078001B6034@suse.com>
  1 sibling, 0 replies; 49+ messages in thread
From: Jan Beulich @ 2018-03-26 10:43 UTC (permalink / raw)
  To: Juergen Gross
  Cc: andrew.cooper3, Kevin Tian, Dario Faggioli, Jun Nakajima, xen-devel

>>> On 26.03.18 at 12:29, <jgross@suse.com> wrote:
> On 26/03/18 12:13, Jan Beulich wrote:
>>>>> On 26.03.18 at 10:55, <jgross@suse.com> wrote:
>>> I can change the scheme to use different values for guest PCIDs
>>> with XPTI on, of course. Are you fine with:
>>>
>>> - XPTI off: PCID 0 = kernel, PCID 1 = user
>>> - XPTI on:  PCID 0 = kernel/Xen, PCID 1 = user/Xen,
>>>             PCID 2 = kernel/guest, PCID 3 = user/guest
>> 
>> Yes, that would fit the first variant I've described. I take it you
>> prefer not to avoid PCID 0 altogether when PCIDs are enabled -
>> is there a particular reason?
> 
> Yes. As written in the comment in flush_area_local() I can't be sure
> whether the current address space is that of a domain with XPTI
> enabled (the idle domain could be "current"). So I need to always
> flush with PCID 0 and with the possible PCID values for a XPTI domain.
> When using PCID 0 for XPTI as well I'll need 4 INVPCIDs, while when
> avoiding it I'd need 5 (at least when current == idle).

I see. Which makes me wonder whether a suitable combination
of INVLPG (to get rid of global entries) and INVPCID couldn't be
used instead. For example, you may be able to replace the
INVPCID for the active PCID by INVLPG (without needing to
know who "current" is).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 7/7] xen/x86: use PCID feature
       [not found]                       ` <5AB8EAE902000078001B6034@suse.com>
@ 2018-03-26 12:04                         ` Juergen Gross
  2018-03-26 12:19                           ` Jan Beulich
       [not found]                           ` <5AB9016002000078001B611E@suse.com>
  0 siblings, 2 replies; 49+ messages in thread
From: Juergen Gross @ 2018-03-26 12:04 UTC (permalink / raw)
  To: Jan Beulich
  Cc: andrew.cooper3, Kevin Tian, xen-devel, Jun Nakajima, Dario Faggioli

On 26/03/18 12:43, Jan Beulich wrote:
>>>> On 26.03.18 at 12:29, <jgross@suse.com> wrote:
>> On 26/03/18 12:13, Jan Beulich wrote:
>>>>>> On 26.03.18 at 10:55, <jgross@suse.com> wrote:
>>>> I can change the scheme to use different values for guest PCIDs
>>>> with XPTI on, of course. Are you fine with:
>>>>
>>>> - XPTI off: PCID 0 = kernel, PCID 1 = user
>>>> - XPTI on:  PCID 0 = kernel/Xen, PCID 1 = user/Xen,
>>>>             PCID 2 = kernel/guest, PCID 3 = user/guest
>>>
>>> Yes, that would fit the first variant I've described. I take it you
>>> prefer not to avoid PCID 0 altogether when PCIDs are enabled -
>>> is there a particular reason?
>>
>> Yes. As written in the comment in flush_area_local() I can't be sure
>> whether the current address space is that of a domain with XPTI
>> enabled (the idle domain could be "current"). So I need to always
>> flush with PCID 0 and with the possible PCID values for a XPTI domain.
>> When using PCID 0 for XPTI as well I'll need 4 INVPCIDs, while when
>> avoiding it I'd need 5 (at least when current == idle).
> 
> I see. Which makes me wonder whether a suitable combination
> of INVLPG (to get rid of global entries) and INVPCID couldn't be
> used instead. For example, you may be able to replace the
> INVPCID for the active PCID by INVLPG (without needing to
> know who "current" is).

INVLPG has the disadvantage to clear all paging-structure cache
entries associated with the current PCID.

And I thought we were finally on the same page not to use global
pages with PCID enabled?


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 7/7] xen/x86: use PCID feature
  2018-03-26 12:04                         ` Juergen Gross
@ 2018-03-26 12:19                           ` Jan Beulich
       [not found]                           ` <5AB9016002000078001B611E@suse.com>
  1 sibling, 0 replies; 49+ messages in thread
From: Jan Beulich @ 2018-03-26 12:19 UTC (permalink / raw)
  To: Juergen Gross
  Cc: andrew.cooper3, Kevin Tian, Dario Faggioli, Jun Nakajima, xen-devel

>>> On 26.03.18 at 14:04, <jgross@suse.com> wrote:
> On 26/03/18 12:43, Jan Beulich wrote:
>>>>> On 26.03.18 at 12:29, <jgross@suse.com> wrote:
>>> On 26/03/18 12:13, Jan Beulich wrote:
>>>>>>> On 26.03.18 at 10:55, <jgross@suse.com> wrote:
>>>>> I can change the scheme to use different values for guest PCIDs
>>>>> with XPTI on, of course. Are you fine with:
>>>>>
>>>>> - XPTI off: PCID 0 = kernel, PCID 1 = user
>>>>> - XPTI on:  PCID 0 = kernel/Xen, PCID 1 = user/Xen,
>>>>>             PCID 2 = kernel/guest, PCID 3 = user/guest
>>>>
>>>> Yes, that would fit the first variant I've described. I take it you
>>>> prefer not to avoid PCID 0 altogether when PCIDs are enabled -
>>>> is there a particular reason?
>>>
>>> Yes. As written in the comment in flush_area_local() I can't be sure
>>> whether the current address space is that of a domain with XPTI
>>> enabled (the idle domain could be "current"). So I need to always
>>> flush with PCID 0 and with the possible PCID values for a XPTI domain.
>>> When using PCID 0 for XPTI as well I'll need 4 INVPCIDs, while when
>>> avoiding it I'd need 5 (at least when current == idle).
>> 
>> I see. Which makes me wonder whether a suitable combination
>> of INVLPG (to get rid of global entries) and INVPCID couldn't be
>> used instead. For example, you may be able to replace the
>> INVPCID for the active PCID by INVLPG (without needing to
>> know who "current" is).
> 
> INVLPG has the disadvantage to clear all paging-structure cache
> entries associated with the current PCID.
> 
> And I thought we were finally on the same page not to use global
> pages with PCID enabled?

Yes, we are. If no global TLB entries can ever survive a CR4.PCIDE
0 -> 1 transition, all would be fine without INVLPG of course.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 7/7] xen/x86: use PCID feature
       [not found]                           ` <5AB9016002000078001B611E@suse.com>
@ 2018-03-26 12:22                             ` Juergen Gross
  0 siblings, 0 replies; 49+ messages in thread
From: Juergen Gross @ 2018-03-26 12:22 UTC (permalink / raw)
  To: Jan Beulich
  Cc: andrew.cooper3, Kevin Tian, xen-devel, Jun Nakajima, Dario Faggioli

On 26/03/18 14:19, Jan Beulich wrote:
>>>> On 26.03.18 at 14:04, <jgross@suse.com> wrote:
>> On 26/03/18 12:43, Jan Beulich wrote:
>>>>>> On 26.03.18 at 12:29, <jgross@suse.com> wrote:
>>>> On 26/03/18 12:13, Jan Beulich wrote:
>>>>>>>> On 26.03.18 at 10:55, <jgross@suse.com> wrote:
>>>>>> I can change the scheme to use different values for guest PCIDs
>>>>>> with XPTI on, of course. Are you fine with:
>>>>>>
>>>>>> - XPTI off: PCID 0 = kernel, PCID 1 = user
>>>>>> - XPTI on:  PCID 0 = kernel/Xen, PCID 1 = user/Xen,
>>>>>>             PCID 2 = kernel/guest, PCID 3 = user/guest
>>>>>
>>>>> Yes, that would fit the first variant I've described. I take it you
>>>>> prefer not to avoid PCID 0 altogether when PCIDs are enabled -
>>>>> is there a particular reason?
>>>>
>>>> Yes. As written in the comment in flush_area_local() I can't be sure
>>>> whether the current address space is that of a domain with XPTI
>>>> enabled (the idle domain could be "current"). So I need to always
>>>> flush with PCID 0 and with the possible PCID values for a XPTI domain.
>>>> When using PCID 0 for XPTI as well I'll need 4 INVPCIDs, while when
>>>> avoiding it I'd need 5 (at least when current == idle).
>>>
>>> I see. Which makes me wonder whether a suitable combination
>>> of INVLPG (to get rid of global entries) and INVPCID couldn't be
>>> used instead. For example, you may be able to replace the
>>> INVPCID for the active PCID by INVLPG (without needing to
>>> know who "current" is).
>>
>> INVLPG has the disadvantage to clear all paging-structure cache
>> entries associated with the current PCID.
>>
>> And I thought we were finally on the same page not to use global
>> pages with PCID enabled?
> 
> Yes, we are. If no global TLB entries can ever survive a CR4.PCIDE
> 0 -> 1 transition, all would be fine without INVLPG of course.

Okay, I'll add an ASSERT() to make sure this is true:

ASSERT(!cr4.pge || !cr4.pcide)


Juergen


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 5/7] xen/x86: disable global pages for domains with XPTI active
       [not found]   ` <5AB3E82402000078001B5408@suse.com>
  2018-03-22 18:18     ` Juergen Gross
@ 2018-03-27  7:14     ` Juergen Gross
  2018-03-27  7:23       ` Jan Beulich
       [not found]       ` <5ABA0DAF02000078001B6588@suse.com>
  1 sibling, 2 replies; 49+ messages in thread
From: Juergen Gross @ 2018-03-27  7:14 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, xen-devel, Dario Faggioli

On 22/03/18 17:30, Jan Beulich wrote:
>>>> On 21.03.18 at 13:51, <jgross@suse.com> wrote:
>> Instead of flushing the TLB from global pages when switching address
>> spaces with XPTI being active just disable global pages via %cr4
>> completely when a domain subject to XPTI is active. This avoids the
>> need for extra TLB flushes as loading %cr3 will remove all TLB
>> entries.
> 
> I continue to be not entirely convinced of this move. I had an
> alternative in mind: Since retaining global pages is particularly
> relevant for switches between guest user and guest kernel
> modes, what if we made a shortcut from e.g. lstar_enter through
> switch_to_kernel to restore_all_guest without ever switching to
> the full page Xen tables?
> 
>> --- a/xen/arch/x86/mm.c
>> +++ b/xen/arch/x86/mm.c
>> @@ -508,18 +508,23 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
>>  void write_ptbase(struct vcpu *v)
>>  {
>>      struct cpu_info *cpu_info = get_cpu_info();
>> +    unsigned long new_cr4;
>> +
>> +    new_cr4 = (is_pv_vcpu(v) && !is_idle_vcpu(v))
>> +              ? pv_guest_cr4_to_real_cr4(v) : mmu_cr4_features;
> 
> I'm not overly happy to see any new uses of mmu_cr4_features.
> This should really only be used for priming certain values imo,
> which isn't the case here (otoh pv_guest_cr4_to_real_cr4() does
> so too, and perhaps better wouldn't). Hence I wonder whether
> this shouldn't be read_cr4() | X86_CR4_PGE, not the least
> because we've just got rid of the blanket reversion to
> mmu_cr4_features in VMX code.

I just realized that using read_cr4() | X86_CR4_PGE is wrong. We would
keep bits switched on which a pv domain is allowed to modify (plus
CR4_TSD eventually).

Do we really want that?

We could mask away certain bits, of course, but in the end we'd just
have a default calculated cr4 value instead of having it just set
somewhere initially.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 5/7] xen/x86: disable global pages for domains with XPTI active
  2018-03-27  7:14     ` Juergen Gross
@ 2018-03-27  7:23       ` Jan Beulich
       [not found]       ` <5ABA0DAF02000078001B6588@suse.com>
  1 sibling, 0 replies; 49+ messages in thread
From: Jan Beulich @ 2018-03-27  7:23 UTC (permalink / raw)
  To: Juergen Gross; +Cc: andrew.cooper3, Dario Faggioli, xen-devel

>>> On 27.03.18 at 09:14, <jgross@suse.com> wrote:
> On 22/03/18 17:30, Jan Beulich wrote:
>>>>> On 21.03.18 at 13:51, <jgross@suse.com> wrote:
>>> Instead of flushing the TLB from global pages when switching address
>>> spaces with XPTI being active just disable global pages via %cr4
>>> completely when a domain subject to XPTI is active. This avoids the
>>> need for extra TLB flushes as loading %cr3 will remove all TLB
>>> entries.
>> 
>> I continue to be not entirely convinced of this move. I had an
>> alternative in mind: Since retaining global pages is particularly
>> relevant for switches between guest user and guest kernel
>> modes, what if we made a shortcut from e.g. lstar_enter through
>> switch_to_kernel to restore_all_guest without ever switching to
>> the full page Xen tables?
>> 
>>> --- a/xen/arch/x86/mm.c
>>> +++ b/xen/arch/x86/mm.c
>>> @@ -508,18 +508,23 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
>>>  void write_ptbase(struct vcpu *v)
>>>  {
>>>      struct cpu_info *cpu_info = get_cpu_info();
>>> +    unsigned long new_cr4;
>>> +
>>> +    new_cr4 = (is_pv_vcpu(v) && !is_idle_vcpu(v))
>>> +              ? pv_guest_cr4_to_real_cr4(v) : mmu_cr4_features;
>> 
>> I'm not overly happy to see any new uses of mmu_cr4_features.
>> This should really only be used for priming certain values imo,
>> which isn't the case here (otoh pv_guest_cr4_to_real_cr4() does
>> so too, and perhaps better wouldn't). Hence I wonder whether
>> this shouldn't be read_cr4() | X86_CR4_PGE, not the least
>> because we've just got rid of the blanket reversion to
>> mmu_cr4_features in VMX code.
> 
> I just realized that using read_cr4() | X86_CR4_PGE is wrong. We would
> keep bits switched on which a pv domain is allowed to modify (plus
> CR4_TSD eventually).
> 
> Do we really want that?

Does it matter what exact CR4 settings we run with when it's not
a PV guest that's in context, and when we don't depend on the
settings ourselves? I don't think it does, and HVM guests run with
their own CR4 anyway. In fact there may end up being cases
where we won't need to switch CR4 another time when we come
here the next time with v being a PV vCPU.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 5/7] xen/x86: disable global pages for domains with XPTI active
       [not found]       ` <5ABA0DAF02000078001B6588@suse.com>
@ 2018-03-27  7:37         ` Juergen Gross
  2018-03-27  8:29           ` Juergen Gross
                             ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Juergen Gross @ 2018-03-27  7:37 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, xen-devel, Dario Faggioli

On 27/03/18 09:23, Jan Beulich wrote:
>>>> On 27.03.18 at 09:14, <jgross@suse.com> wrote:
>> On 22/03/18 17:30, Jan Beulich wrote:
>>>>>> On 21.03.18 at 13:51, <jgross@suse.com> wrote:
>>>> Instead of flushing the TLB from global pages when switching address
>>>> spaces with XPTI being active just disable global pages via %cr4
>>>> completely when a domain subject to XPTI is active. This avoids the
>>>> need for extra TLB flushes as loading %cr3 will remove all TLB
>>>> entries.
>>>
>>> I continue to be not entirely convinced of this move. I had an
>>> alternative in mind: Since retaining global pages is particularly
>>> relevant for switches between guest user and guest kernel
>>> modes, what if we made a shortcut from e.g. lstar_enter through
>>> switch_to_kernel to restore_all_guest without ever switching to
>>> the full page Xen tables?
>>>
>>>> --- a/xen/arch/x86/mm.c
>>>> +++ b/xen/arch/x86/mm.c
>>>> @@ -508,18 +508,23 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
>>>>  void write_ptbase(struct vcpu *v)
>>>>  {
>>>>      struct cpu_info *cpu_info = get_cpu_info();
>>>> +    unsigned long new_cr4;
>>>> +
>>>> +    new_cr4 = (is_pv_vcpu(v) && !is_idle_vcpu(v))
>>>> +              ? pv_guest_cr4_to_real_cr4(v) : mmu_cr4_features;
>>>
>>> I'm not overly happy to see any new uses of mmu_cr4_features.
>>> This should really only be used for priming certain values imo,
>>> which isn't the case here (otoh pv_guest_cr4_to_real_cr4() does
>>> so too, and perhaps better wouldn't). Hence I wonder whether
>>> this shouldn't be read_cr4() | X86_CR4_PGE, not the least
>>> because we've just got rid of the blanket reversion to
>>> mmu_cr4_features in VMX code.
>>
>> I just realized that using read_cr4() | X86_CR4_PGE is wrong. We would
>> keep bits switched on which a pv domain is allowed to modify (plus
>> CR4_TSD eventually).
>>
>> Do we really want that?
> 
> Does it matter what exact CR4 settings we run with when it's not
> a PV guest that's in context, and when we don't depend on the
> settings ourselves? I don't think it does, and HVM guests run with
> their own CR4 anyway. In fact there may end up being cases
> where we won't need to switch CR4 another time when we come
> here the next time with v being a PV vCPU.

I could imagine that there is some performance impact. cr4.tsd set
might make rdtsc a little bit slower as an additional privilege level
check is needed.

Suspending requires cr4.fsgsbase to be set, which might have been
reset by a pv guest.


Juergen


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 5/7] xen/x86: disable global pages for domains with XPTI active
  2018-03-27  7:37         ` Juergen Gross
@ 2018-03-27  8:29           ` Juergen Gross
  2018-03-27  8:33           ` Jan Beulich
       [not found]           ` <5ABA1DE202000078001B65FD@suse.com>
  2 siblings, 0 replies; 49+ messages in thread
From: Juergen Gross @ 2018-03-27  8:29 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, Dario Faggioli, xen-devel

On 27/03/18 09:37, Juergen Gross wrote:
> On 27/03/18 09:23, Jan Beulich wrote:
>>>>> On 27.03.18 at 09:14, <jgross@suse.com> wrote:
>>> On 22/03/18 17:30, Jan Beulich wrote:
>>>>>>> On 21.03.18 at 13:51, <jgross@suse.com> wrote:
>>>>> Instead of flushing the TLB from global pages when switching address
>>>>> spaces with XPTI being active just disable global pages via %cr4
>>>>> completely when a domain subject to XPTI is active. This avoids the
>>>>> need for extra TLB flushes as loading %cr3 will remove all TLB
>>>>> entries.
>>>>
>>>> I continue to be not entirely convinced of this move. I had an
>>>> alternative in mind: Since retaining global pages is particularly
>>>> relevant for switches between guest user and guest kernel
>>>> modes, what if we made a shortcut from e.g. lstar_enter through
>>>> switch_to_kernel to restore_all_guest without ever switching to
>>>> the full page Xen tables?
>>>>
>>>>> --- a/xen/arch/x86/mm.c
>>>>> +++ b/xen/arch/x86/mm.c
>>>>> @@ -508,18 +508,23 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
>>>>>  void write_ptbase(struct vcpu *v)
>>>>>  {
>>>>>      struct cpu_info *cpu_info = get_cpu_info();
>>>>> +    unsigned long new_cr4;
>>>>> +
>>>>> +    new_cr4 = (is_pv_vcpu(v) && !is_idle_vcpu(v))
>>>>> +              ? pv_guest_cr4_to_real_cr4(v) : mmu_cr4_features;
>>>>
>>>> I'm not overly happy to see any new uses of mmu_cr4_features.
>>>> This should really only be used for priming certain values imo,
>>>> which isn't the case here (otoh pv_guest_cr4_to_real_cr4() does
>>>> so too, and perhaps better wouldn't). Hence I wonder whether
>>>> this shouldn't be read_cr4() | X86_CR4_PGE, not the least
>>>> because we've just got rid of the blanket reversion to
>>>> mmu_cr4_features in VMX code.
>>>
>>> I just realized that using read_cr4() | X86_CR4_PGE is wrong. We would
>>> keep bits switched on which a pv domain is allowed to modify (plus
>>> CR4_TSD eventually).
>>>
>>> Do we really want that?
>>
>> Does it matter what exact CR4 settings we run with when it's not
>> a PV guest that's in context, and when we don't depend on the
>> settings ourselves? I don't think it does, and HVM guests run with
>> their own CR4 anyway. In fact there may end up being cases
>> where we won't need to switch CR4 another time when we come
>> here the next time with v being a PV vCPU.
> 
> I could imagine that there is some performance impact. cr4.tsd set
> might make rdtsc a little bit slower as an additional privilege level
> check is needed.
> 
> Suspending requires cr4.fsgsbase to be set, which might have been
> reset by a pv guest.

Which is wrong, of course, as pv_guest_cr4_to_real_cr4() will always
return a 1 for cr4.fsgsbase if the processor supports that bit.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 5/7] xen/x86: disable global pages for domains with XPTI active
  2018-03-27  7:37         ` Juergen Gross
  2018-03-27  8:29           ` Juergen Gross
@ 2018-03-27  8:33           ` Jan Beulich
       [not found]           ` <5ABA1DE202000078001B65FD@suse.com>
  2 siblings, 0 replies; 49+ messages in thread
From: Jan Beulich @ 2018-03-27  8:33 UTC (permalink / raw)
  To: Juergen Gross; +Cc: andrew.cooper3, Dario Faggioli, xen-devel

>>> On 27.03.18 at 09:37, <jgross@suse.com> wrote:
> On 27/03/18 09:23, Jan Beulich wrote:
>>>>> On 27.03.18 at 09:14, <jgross@suse.com> wrote:
>>> On 22/03/18 17:30, Jan Beulich wrote:
>>>>>>> On 21.03.18 at 13:51, <jgross@suse.com> wrote:
>>>>> Instead of flushing the TLB from global pages when switching address
>>>>> spaces with XPTI being active just disable global pages via %cr4
>>>>> completely when a domain subject to XPTI is active. This avoids the
>>>>> need for extra TLB flushes as loading %cr3 will remove all TLB
>>>>> entries.
>>>>
>>>> I continue to be not entirely convinced of this move. I had an
>>>> alternative in mind: Since retaining global pages is particularly
>>>> relevant for switches between guest user and guest kernel
>>>> modes, what if we made a shortcut from e.g. lstar_enter through
>>>> switch_to_kernel to restore_all_guest without ever switching to
>>>> the full page Xen tables?
>>>>
>>>>> --- a/xen/arch/x86/mm.c
>>>>> +++ b/xen/arch/x86/mm.c
>>>>> @@ -508,18 +508,23 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
>>>>>  void write_ptbase(struct vcpu *v)
>>>>>  {
>>>>>      struct cpu_info *cpu_info = get_cpu_info();
>>>>> +    unsigned long new_cr4;
>>>>> +
>>>>> +    new_cr4 = (is_pv_vcpu(v) && !is_idle_vcpu(v))
>>>>> +              ? pv_guest_cr4_to_real_cr4(v) : mmu_cr4_features;
>>>>
>>>> I'm not overly happy to see any new uses of mmu_cr4_features.
>>>> This should really only be used for priming certain values imo,
>>>> which isn't the case here (otoh pv_guest_cr4_to_real_cr4() does
>>>> so too, and perhaps better wouldn't). Hence I wonder whether
>>>> this shouldn't be read_cr4() | X86_CR4_PGE, not the least
>>>> because we've just got rid of the blanket reversion to
>>>> mmu_cr4_features in VMX code.
>>>
>>> I just realized that using read_cr4() | X86_CR4_PGE is wrong. We would
>>> keep bits switched on which a pv domain is allowed to modify (plus
>>> CR4_TSD eventually).
>>>
>>> Do we really want that?
>> 
>> Does it matter what exact CR4 settings we run with when it's not
>> a PV guest that's in context, and when we don't depend on the
>> settings ourselves? I don't think it does, and HVM guests run with
>> their own CR4 anyway. In fact there may end up being cases
>> where we won't need to switch CR4 another time when we come
>> here the next time with v being a PV vCPU.
> 
> I could imagine that there is some performance impact. cr4.tsd set
> might make rdtsc a little bit slower as an additional privilege level
> check is needed.

Quite possible, indeed. Another opinion on the route to take
would be helpful here. Andrew?

> Suspending requires cr4.fsgsbase to be set, which might have been
> reset by a pv guest.

pv_guest_cr4_to_real_cr4() consistently forces this bit to 1 (on
capable hardware).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 5/7] xen/x86: disable global pages for domains with XPTI active
       [not found]           ` <5ABA1DE202000078001B65FD@suse.com>
@ 2018-03-27  8:44             ` Juergen Gross
  2018-03-27  8:54               ` Jan Beulich
  0 siblings, 1 reply; 49+ messages in thread
From: Juergen Gross @ 2018-03-27  8:44 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, xen-devel, Dario Faggioli

On 27/03/18 10:33, Jan Beulich wrote:
>>>> On 27.03.18 at 09:37, <jgross@suse.com> wrote:
>> On 27/03/18 09:23, Jan Beulich wrote:
>>>>>> On 27.03.18 at 09:14, <jgross@suse.com> wrote:
>>>> On 22/03/18 17:30, Jan Beulich wrote:
>>>>>>>> On 21.03.18 at 13:51, <jgross@suse.com> wrote:
>>>>>> Instead of flushing the TLB from global pages when switching address
>>>>>> spaces with XPTI being active just disable global pages via %cr4
>>>>>> completely when a domain subject to XPTI is active. This avoids the
>>>>>> need for extra TLB flushes as loading %cr3 will remove all TLB
>>>>>> entries.
>>>>>
>>>>> I continue to be not entirely convinced of this move. I had an
>>>>> alternative in mind: Since retaining global pages is particularly
>>>>> relevant for switches between guest user and guest kernel
>>>>> modes, what if we made a shortcut from e.g. lstar_enter through
>>>>> switch_to_kernel to restore_all_guest without ever switching to
>>>>> the full page Xen tables?
>>>>>
>>>>>> --- a/xen/arch/x86/mm.c
>>>>>> +++ b/xen/arch/x86/mm.c
>>>>>> @@ -508,18 +508,23 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
>>>>>>  void write_ptbase(struct vcpu *v)
>>>>>>  {
>>>>>>      struct cpu_info *cpu_info = get_cpu_info();
>>>>>> +    unsigned long new_cr4;
>>>>>> +
>>>>>> +    new_cr4 = (is_pv_vcpu(v) && !is_idle_vcpu(v))
>>>>>> +              ? pv_guest_cr4_to_real_cr4(v) : mmu_cr4_features;
>>>>>
>>>>> I'm not overly happy to see any new uses of mmu_cr4_features.
>>>>> This should really only be used for priming certain values imo,
>>>>> which isn't the case here (otoh pv_guest_cr4_to_real_cr4() does
>>>>> so too, and perhaps better wouldn't). Hence I wonder whether
>>>>> this shouldn't be read_cr4() | X86_CR4_PGE, not the least
>>>>> because we've just got rid of the blanket reversion to
>>>>> mmu_cr4_features in VMX code.
>>>>
>>>> I just realized that using read_cr4() | X86_CR4_PGE is wrong. We would
>>>> keep bits switched on which a pv domain is allowed to modify (plus
>>>> CR4_TSD eventually).
>>>>
>>>> Do we really want that?
>>>
>>> Does it matter what exact CR4 settings we run with when it's not
>>> a PV guest that's in context, and when we don't depend on the
>>> settings ourselves? I don't think it does, and HVM guests run with
>>> their own CR4 anyway. In fact there may end up being cases
>>> where we won't need to switch CR4 another time when we come
>>> here the next time with v being a PV vCPU.
>>
>> I could imagine that there is some performance impact. cr4.tsd set
>> might make rdtsc a little bit slower as an additional privilege level
>> check is needed.
> 
> Quite possible, indeed. Another opinion on the route to take
> would be helpful here. Andrew?

I could mask away tsd, of course. I need to do so for pcide already,
so that would be just another bit reset in the mask.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 5/7] xen/x86: disable global pages for domains with XPTI active
  2018-03-27  8:44             ` Juergen Gross
@ 2018-03-27  8:54               ` Jan Beulich
  0 siblings, 0 replies; 49+ messages in thread
From: Jan Beulich @ 2018-03-27  8:54 UTC (permalink / raw)
  To: Juergen Gross; +Cc: andrew.cooper3, Dario Faggioli, xen-devel

>>> On 27.03.18 at 10:44, <jgross@suse.com> wrote:
> On 27/03/18 10:33, Jan Beulich wrote:
>>>>> On 27.03.18 at 09:37, <jgross@suse.com> wrote:
>>> On 27/03/18 09:23, Jan Beulich wrote:
>>>>>>> On 27.03.18 at 09:14, <jgross@suse.com> wrote:
>>>>> I just realized that using read_cr4() | X86_CR4_PGE is wrong. We would
>>>>> keep bits switched on which a pv domain is allowed to modify (plus
>>>>> CR4_TSD eventually).
>>>>>
>>>>> Do we really want that?
>>>>
>>>> Does it matter what exact CR4 settings we run with when it's not
>>>> a PV guest that's in context, and when we don't depend on the
>>>> settings ourselves? I don't think it does, and HVM guests run with
>>>> their own CR4 anyway. In fact there may end up being cases
>>>> where we won't need to switch CR4 another time when we come
>>>> here the next time with v being a PV vCPU.
>>>
>>> I could imagine that there is some performance impact. cr4.tsd set
>>> might make rdtsc a little bit slower as an additional privilege level
>>> check is needed.
>> 
>> Quite possible, indeed. Another opinion on the route to take
>> would be helpful here. Andrew?
> 
> I could mask away tsd, of course. I need to do so for pcide already,
> so that would be just another bit reset in the mask.

Fine with me.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2018-03-27  8:55 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-21 12:51 [PATCH v3 0/7] xen/x86: various XPTI speedups Juergen Gross
2018-03-21 12:51 ` [PATCH v3 1/7] x86/xpti: avoid copying L4 page table contents when possible Juergen Gross
2018-03-22 14:31   ` Jan Beulich
     [not found]   ` <5AB3CC5502000078001B5254@suse.com>
2018-03-22 15:26     ` Juergen Gross
2018-03-22 15:42       ` Jan Beulich
2018-03-21 12:51 ` [PATCH v3 2/7] x86/xpti: don't flush TLB twice when switching to 64-bit pv context Juergen Gross
2018-03-22 14:50   ` Jan Beulich
2018-03-23 12:35     ` Juergen Gross
2018-03-21 12:51 ` [PATCH v3 3/7] xen/x86: support per-domain flag for xpti Juergen Gross
2018-03-22 15:26   ` Jan Beulich
     [not found]   ` <5AB3D92D02000078001B5326@suse.com>
2018-03-22 15:29     ` Juergen Gross
2018-03-22 15:44       ` Jan Beulich
     [not found]       ` <5AB3DD6A02000078001B538B@suse.com>
2018-03-22 18:05         ` Juergen Gross
2018-03-21 12:51 ` [PATCH v3 4/7] xen/x86: use invpcid for flushing the TLB Juergen Gross
2018-03-22 15:35   ` Jan Beulich
     [not found]   ` <5AB3DB5102000078001B5358@suse.com>
2018-03-22 18:03     ` Juergen Gross
2018-03-21 12:51 ` [PATCH v3 5/7] xen/x86: disable global pages for domains with XPTI active Juergen Gross
2018-03-22 16:30   ` Jan Beulich
     [not found]   ` <5AB3E82402000078001B5408@suse.com>
2018-03-22 18:18     ` Juergen Gross
2018-03-23  7:46       ` Jan Beulich
     [not found]       ` <5AB4BEED02000078001B563B@suse.com>
2018-03-23  7:58         ` Juergen Gross
2018-03-23  8:14           ` Jan Beulich
     [not found]           ` <5AB4C56602000078001B567F@suse.com>
2018-03-23  8:29             ` Juergen Gross
2018-03-23  8:51               ` Jan Beulich
2018-03-27  7:14     ` Juergen Gross
2018-03-27  7:23       ` Jan Beulich
     [not found]       ` <5ABA0DAF02000078001B6588@suse.com>
2018-03-27  7:37         ` Juergen Gross
2018-03-27  8:29           ` Juergen Gross
2018-03-27  8:33           ` Jan Beulich
     [not found]           ` <5ABA1DE202000078001B65FD@suse.com>
2018-03-27  8:44             ` Juergen Gross
2018-03-27  8:54               ` Jan Beulich
2018-03-21 12:51 ` [PATCH v3 6/7] xen/x86: use flag byte for decision whether xen_cr3 is valid Juergen Gross
2018-03-23  9:47   ` Jan Beulich
2018-03-21 12:51 ` [PATCH v3 7/7] xen/x86: use PCID feature Juergen Gross
2018-03-23 10:51   ` Jan Beulich
     [not found]   ` <5AB4EA4902000078001B577B@suse.com>
2018-03-23 11:29     ` Juergen Gross
2018-03-23 13:46       ` Jan Beulich
     [not found]       ` <5AB5136302000078001B58D9@suse.com>
2018-03-23 14:11         ` Juergen Gross
2018-03-23 15:58           ` Jan Beulich
     [not found]           ` <5AB5324C02000078001B59E7@suse.com>
2018-03-26  6:49             ` Juergen Gross
2018-03-26  8:28               ` Jan Beulich
     [not found]               ` <5AB8CB3902000078001B5F3A@suse.com>
2018-03-26  8:55                 ` Juergen Gross
2018-03-26 10:13                   ` Jan Beulich
     [not found]                   ` <5AB8E3DC02000078001B5FEB@suse.com>
2018-03-26 10:29                     ` Juergen Gross
2018-03-26 10:43                       ` Jan Beulich
     [not found]                       ` <5AB8EAE902000078001B6034@suse.com>
2018-03-26 12:04                         ` Juergen Gross
2018-03-26 12:19                           ` Jan Beulich
     [not found]                           ` <5AB9016002000078001B611E@suse.com>
2018-03-26 12:22                             ` Juergen Gross
2018-03-23 14:10 ` [PATCH v3 0/7] xen/x86: various XPTI speedups Dario Faggioli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.