All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains
@ 2018-01-22 12:32 Juergen Gross
  2018-01-22 12:32 ` [PATCH RFC v2 01/12] x86: cleanup processor.h Juergen Gross
                   ` (14 more replies)
  0 siblings, 15 replies; 74+ messages in thread
From: Juergen Gross @ 2018-01-22 12:32 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, dfaggioli, jbeulich

As a preparation for doing page table isolation in the Xen hypervisor
in order to mitigate "Meltdown" use dedicated stacks, GDT and TSS for
64 bit PV domains mapped to the per-domain virtual area.

The per-vcpu stacks are used for early interrupt handling only. After
saving the domain's registers stacks are switched back to the normal
per physical cpu ones in order to be able to address on-stack data
from other cpus e.g. while handling IPIs.

Adding %cr3 switching between saving of the registers and switching
the stacks will enable the possibility to run guest code without any
per physical cpu mapping, i.e. avoiding the threat of a guest being
able to access other domains data.

Without any further measures it will still be possible for e.g. a
guest's user program to read stack data of another vcpu of the same
domain, but this can be easily avoided by a little PV-ABI modification
introducing per-cpu user address spaces.

This series is meant as a replacement for Andrew's patch series:
"x86: Prerequisite work for a Xen KAISER solution".

What needs to be done:
- verify livepatching is still working
- performance evaluation (Dario is working on it)
- the real page table switching


Changes since RFC V1:
- switch back to per physical cpu stacks in interrupt handling
- complete rework of series
- rebase to current staging
- adding reverts of Jan's band-aid patches
- adding two minor cleanups at the begin of the series
- done much more testing, including NMIs

Juergen Gross (12):
  x86: cleanup processor.h
  x86: don't use hypervisor stack size for dumping guest stacks
  x86: do a revert of e871e80c38547d9faefc6604532ba3e985e65873
  x86: revert 5784de3e2067ed73efc2fe42e62831e8ae7f46c4
  x86: don't access saved user regs via rsp in trap handlers
  x86: add a xpti command line parameter
  x86: allow per-domain mappings without NX bit or with specific mfn
  xen/x86: use dedicated function for tss initialization
  x86: enhance syscall stub to work in per-domain mapping
  x86: allocate per-vcpu stacks for interrupt entries
  x86: modify interrupt handlers to support stack switching
  x86: activate per-vcpu stacks in case of xpti

 docs/misc/xen-command-line.markdown |  16 +-
 xen/arch/x86/cpu/common.c           |  56 ++++---
 xen/arch/x86/domain.c               |  84 ++++++++--
 xen/arch/x86/mm.c                   | 102 ++++++++++---
 xen/arch/x86/pv/domain.c            | 161 +++++++++++++++++++-
 xen/arch/x86/smpboot.c              | 211 --------------------------
 xen/arch/x86/traps.c                |  26 ++--
 xen/arch/x86/x86_64/asm-offsets.c   |   6 +-
 xen/arch/x86/x86_64/compat/entry.S  |  98 ++++++------
 xen/arch/x86/x86_64/entry.S         | 295 ++++++++++++------------------------
 xen/arch/x86/x86_64/traps.c         |  47 +++---
 xen/common/wait.c                   |   8 +-
 xen/include/asm-x86/asm_defns.h     |  49 +++---
 xen/include/asm-x86/config.h        |  13 +-
 xen/include/asm-x86/current.h       |  71 ++++++---
 xen/include/asm-x86/desc.h          |   5 +
 xen/include/asm-x86/domain.h        |   5 +
 xen/include/asm-x86/mm.h            |   3 +
 xen/include/asm-x86/processor.h     |  42 -----
 xen/include/asm-x86/regs.h          |   2 +
 xen/include/asm-x86/system.h        |   8 +
 xen/include/asm-x86/x86_64/page.h   |   5 +-
 22 files changed, 647 insertions(+), 666 deletions(-)

-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH RFC v2 01/12] x86: cleanup processor.h
  2018-01-22 12:32 [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains Juergen Gross
@ 2018-01-22 12:32 ` Juergen Gross
  2018-01-22 12:52   ` Jan Beulich
       [not found]   ` <5A65ECA502000078001A111C@suse.com>
  2018-01-22 12:32 ` [PATCH RFC v2 02/12] x86: don't use hypervisor stack size for dumping guest stacks Juergen Gross
                   ` (13 subsequent siblings)
  14 siblings, 2 replies; 74+ messages in thread
From: Juergen Gross @ 2018-01-22 12:32 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, dfaggioli, jbeulich

Remove NSC/Cyrix CPU macros and current_text_addr() which are used
nowhere.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 xen/include/asm-x86/processor.h | 41 -----------------------------------------
 1 file changed, 41 deletions(-)

diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
index 9dd29bb04c..e8c2f02e99 100644
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -102,16 +102,6 @@
 struct domain;
 struct vcpu;
 
-/*
- * Default implementation of macro that returns current
- * instruction pointer ("program counter").
- */
-#define current_text_addr() ({                      \
-    void *pc;                                       \
-    asm ( "leaq 1f(%%rip),%0\n1:" : "=r" (pc) );    \
-    pc;                                             \
-})
-
 struct x86_cpu_id {
     uint16_t vendor;
     uint16_t family;
@@ -375,37 +365,6 @@ static inline bool_t read_pkru_wd(uint32_t pkru, unsigned int pkey)
     return (pkru >> (pkey * PKRU_ATTRS + PKRU_WRITE)) & 1;
 }
 
-/*
- *      NSC/Cyrix CPU configuration register indexes
- */
-
-#define CX86_PCR0 0x20
-#define CX86_GCR  0xb8
-#define CX86_CCR0 0xc0
-#define CX86_CCR1 0xc1
-#define CX86_CCR2 0xc2
-#define CX86_CCR3 0xc3
-#define CX86_CCR4 0xe8
-#define CX86_CCR5 0xe9
-#define CX86_CCR6 0xea
-#define CX86_CCR7 0xeb
-#define CX86_PCR1 0xf0
-#define CX86_DIR0 0xfe
-#define CX86_DIR1 0xff
-#define CX86_ARR_BASE 0xc4
-#define CX86_RCR_BASE 0xdc
-
-/*
- *      NSC/Cyrix CPU indexed register access macros
- */
-
-#define getCx86(reg) ({ outb((reg), 0x22); inb(0x23); })
-
-#define setCx86(reg, data) do { \
-    outb((reg), 0x22); \
-    outb((data), 0x23); \
-} while (0)
-
 static always_inline void __monitor(const void *eax, unsigned long ecx,
                                     unsigned long edx)
 {
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH RFC v2 02/12] x86: don't use hypervisor stack size for dumping guest stacks
  2018-01-22 12:32 [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains Juergen Gross
  2018-01-22 12:32 ` [PATCH RFC v2 01/12] x86: cleanup processor.h Juergen Gross
@ 2018-01-22 12:32 ` Juergen Gross
  2018-01-23  9:26   ` Jan Beulich
       [not found]   ` <5A670DEF02000078001A16AF@suse.com>
  2018-01-22 12:32 ` [PATCH RFC v2 03/12] x86: do a revert of e871e80c38547d9faefc6604532ba3e985e65873 Juergen Gross
                   ` (12 subsequent siblings)
  14 siblings, 2 replies; 74+ messages in thread
From: Juergen Gross @ 2018-01-22 12:32 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, dfaggioli, jbeulich

show_guest_stack() and compat_show_guest_stack() stop dumping the
stack of the guest whenever its virtual address reaches the same
alignment which is used for the hypervisor stacks.

Remove this arbitrary limit and try to dump a fixed number of lines
instead.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 xen/arch/x86/traps.c | 26 +++++++++++---------------
 1 file changed, 11 insertions(+), 15 deletions(-)

diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index a3e8f0c9b9..1115b69050 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -191,7 +191,8 @@ static void compat_show_guest_stack(struct vcpu *v,
                                     const struct cpu_user_regs *regs,
                                     int debug_stack_lines)
 {
-    unsigned int i, *stack, addr, mask = STACK_SIZE;
+    unsigned int i, *stack, addr;
+    unsigned long last_addr = -1L;
 
     stack = (unsigned int *)(unsigned long)regs->esp;
     printk("Guest stack trace from esp=%08lx:\n ", (unsigned long)stack);
@@ -220,13 +221,13 @@ static void compat_show_guest_stack(struct vcpu *v,
                 printk("Inaccessible guest memory.\n");
                 return;
             }
-            mask = PAGE_SIZE;
+            last_addr = round_pgup((unsigned long)stack);
         }
     }
 
     for ( i = 0; i < debug_stack_lines * 8; i++ )
     {
-        if ( (((long)stack - 1) ^ ((long)(stack + 1) - 1)) & mask )
+        if ( (unsigned long)stack >= last_addr )
             break;
         if ( __get_user(addr, stack) )
         {
@@ -241,11 +242,9 @@ static void compat_show_guest_stack(struct vcpu *v,
         printk(" %08x", addr);
         stack++;
     }
-    if ( mask == PAGE_SIZE )
-    {
-        BUILD_BUG_ON(PAGE_SIZE == STACK_SIZE);
+    if ( last_addr != -1L )
         unmap_domain_page(stack);
-    }
+
     if ( i == 0 )
         printk("Stack empty.");
     printk("\n");
@@ -254,8 +253,7 @@ static void compat_show_guest_stack(struct vcpu *v,
 static void show_guest_stack(struct vcpu *v, const struct cpu_user_regs *regs)
 {
     int i;
-    unsigned long *stack, addr;
-    unsigned long mask = STACK_SIZE;
+    unsigned long *stack, addr, last_addr = -1L;
 
     /* Avoid HVM as we don't know what the stack looks like. */
     if ( is_hvm_vcpu(v) )
@@ -290,13 +288,13 @@ static void show_guest_stack(struct vcpu *v, const struct cpu_user_regs *regs)
                 printk("Inaccessible guest memory.\n");
                 return;
             }
-            mask = PAGE_SIZE;
+            last_addr = round_pgup((unsigned long)stack);
         }
     }
 
     for ( i = 0; i < (debug_stack_lines*stack_words_per_line); i++ )
     {
-        if ( (((long)stack - 1) ^ ((long)(stack + 1) - 1)) & mask )
+        if ( (unsigned long)stack >= last_addr )
             break;
         if ( __get_user(addr, stack) )
         {
@@ -311,11 +309,9 @@ static void show_guest_stack(struct vcpu *v, const struct cpu_user_regs *regs)
         printk(" %p", _p(addr));
         stack++;
     }
-    if ( mask == PAGE_SIZE )
-    {
-        BUILD_BUG_ON(PAGE_SIZE == STACK_SIZE);
+    if ( last_addr != -1L )
         unmap_domain_page(stack);
-    }
+
     if ( i == 0 )
         printk("Stack empty.");
     printk("\n");
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH RFC v2 03/12] x86: do a revert of e871e80c38547d9faefc6604532ba3e985e65873
  2018-01-22 12:32 [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains Juergen Gross
  2018-01-22 12:32 ` [PATCH RFC v2 01/12] x86: cleanup processor.h Juergen Gross
  2018-01-22 12:32 ` [PATCH RFC v2 02/12] x86: don't use hypervisor stack size for dumping guest stacks Juergen Gross
@ 2018-01-22 12:32 ` Juergen Gross
  2018-01-22 12:32 ` [PATCH RFC v2 04/12] x86: revert 5784de3e2067ed73efc2fe42e62831e8ae7f46c4 Juergen Gross
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 74+ messages in thread
From: Juergen Gross @ 2018-01-22 12:32 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, dfaggioli, jbeulich

Revert "x86: allow Meltdown band-aid to be disabled" in order to
prepare for a final Meltdown mitigation.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 docs/misc/xen-command-line.markdown | 12 ------------
 xen/arch/x86/domain.c               |  7 ++-----
 xen/arch/x86/mm.c                   |  2 +-
 xen/arch/x86/smpboot.c              | 17 +++--------------
 xen/arch/x86/x86_64/entry.S         |  2 --
 5 files changed, 6 insertions(+), 34 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index f73990f7cd..f5214defbb 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -1911,18 +1911,6 @@ In the case that x2apic is in use, this option switches between physical and
 clustered mode.  The default, given no hint from the **FADT**, is cluster
 mode.
 
-### xpti
-> `= <boolean>`
-
-> Default: `false` on AMD hardware
-> Default: `true` everywhere else
-
-Override default selection of whether to isolate 64-bit PV guest page
-tables.
-
-** WARNING: Not yet a complete isolation implementation, but better than
-nothing. **
-
 ### xsave
 > `= <boolean>`
 
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index dbf4522e69..8589d856be 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1509,13 +1509,10 @@ void paravirt_ctxt_switch_from(struct vcpu *v)
 
 void paravirt_ctxt_switch_to(struct vcpu *v)
 {
-    root_pgentry_t *root_pgt = this_cpu(root_pgt);
     unsigned long cr4;
 
-    if ( root_pgt )
-        root_pgt[root_table_offset(PERDOMAIN_VIRT_START)] =
-            l4e_from_page(v->domain->arch.perdomain_l3_pg,
-                          __PAGE_HYPERVISOR_RW);
+    this_cpu(root_pgt)[root_table_offset(PERDOMAIN_VIRT_START)] =
+        l4e_from_page(v->domain->arch.perdomain_l3_pg, __PAGE_HYPERVISOR_RW);
 
     cr4 = pv_guest_cr4_to_real_cr4(v);
     if ( unlikely(cr4 != read_cr4()) )
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 5a1b472432..c83f5224c1 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -3654,7 +3654,7 @@ long do_mmu_update(
                     rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
                     if ( !rc )
-                        sync_guest = this_cpu(root_pgt);
+                        sync_guest = true;
                     break;
 
                 case PGT_writable_page:
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index fe637dae40..37a7e59760 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -329,7 +329,7 @@ void start_secondary(void *unused)
     spin_debug_disable();
 
     get_cpu_info()->xen_cr3 = 0;
-    get_cpu_info()->pv_cr3 = this_cpu(root_pgt) ? __pa(this_cpu(root_pgt)) : 0;
+    get_cpu_info()->pv_cr3 = __pa(this_cpu(root_pgt));
 
     load_system_tables();
 
@@ -738,20 +738,14 @@ static int clone_mapping(const void *ptr, root_pgentry_t *rpt)
     return 0;
 }
 
-static __read_mostly int8_t opt_xpti = -1;
-boolean_param("xpti", opt_xpti);
 DEFINE_PER_CPU(root_pgentry_t *, root_pgt);
 
 static int setup_cpu_root_pgt(unsigned int cpu)
 {
-    root_pgentry_t *rpt;
+    root_pgentry_t *rpt = alloc_xen_pagetable();
     unsigned int off;
     int rc;
 
-    if ( !opt_xpti )
-        return 0;
-
-    rpt = alloc_xen_pagetable();
     if ( !rpt )
         return -ENOMEM;
 
@@ -1000,14 +994,10 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
 
     stack_base[0] = stack_start;
 
-    if ( opt_xpti < 0 )
-        opt_xpti = boot_cpu_data.x86_vendor != X86_VENDOR_AMD;
-
     rc = setup_cpu_root_pgt(0);
     if ( rc )
         panic("Error %d setting up PV root page table\n", rc);
-    if ( per_cpu(root_pgt, 0) )
-        get_cpu_info()->pv_cr3 = __pa(per_cpu(root_pgt, 0));
+    get_cpu_info()->pv_cr3 = __pa(per_cpu(root_pgt, 0));
 
     set_nr_sockets();
 
@@ -1079,7 +1069,6 @@ void __init smp_prepare_boot_cpu(void)
 #endif
 
     get_cpu_info()->xen_cr3 = 0;
-    get_cpu_info()->pv_cr3 = 0;
 }
 
 static void
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 710c0616ba..f753eb4c02 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -46,7 +46,6 @@ restore_all_guest:
         movabs $DIRECTMAP_VIRT_START, %rcx
         mov   %rdi, %rax
         and   %rsi, %rdi
-        jz    .Lrag_keep_cr3
         and   %r9, %rsi
         add   %rcx, %rdi
         add   %rcx, %rsi
@@ -63,7 +62,6 @@ restore_all_guest:
         rep movsq
         mov   %r9, STACK_CPUINFO_FIELD(xen_cr3)(%rdx)
         write_cr3 rax, rdi, rsi
-.Lrag_keep_cr3:
 
         RESTORE_ALL
         testw $TRAP_syscall,4(%rsp)
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH RFC v2 04/12] x86: revert 5784de3e2067ed73efc2fe42e62831e8ae7f46c4
  2018-01-22 12:32 [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains Juergen Gross
                   ` (2 preceding siblings ...)
  2018-01-22 12:32 ` [PATCH RFC v2 03/12] x86: do a revert of e871e80c38547d9faefc6604532ba3e985e65873 Juergen Gross
@ 2018-01-22 12:32 ` Juergen Gross
  2018-01-22 12:32 ` [PATCH RFC v2 05/12] x86: don't access saved user regs via rsp in trap handlers Juergen Gross
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 74+ messages in thread
From: Juergen Gross @ 2018-01-22 12:32 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, dfaggioli, jbeulich

Revert patch "x86: Meltdown band-aid against malicious 64-bit PV
guests" in order to prepare for a final Meltdown mitigation.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 xen/arch/x86/domain.c              |   5 -
 xen/arch/x86/mm.c                  |  21 ----
 xen/arch/x86/smpboot.c             | 200 -------------------------------------
 xen/arch/x86/x86_64/asm-offsets.c  |   2 -
 xen/arch/x86/x86_64/compat/entry.S |  11 --
 xen/arch/x86/x86_64/entry.S        | 149 +--------------------------
 xen/include/asm-x86/asm_defns.h    |  30 ------
 xen/include/asm-x86/current.h      |  12 ---
 xen/include/asm-x86/processor.h    |   1 -
 xen/include/asm-x86/x86_64/page.h  |   5 +-
 10 files changed, 6 insertions(+), 430 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 8589d856be..da1bf1a97b 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1511,9 +1511,6 @@ void paravirt_ctxt_switch_to(struct vcpu *v)
 {
     unsigned long cr4;
 
-    this_cpu(root_pgt)[root_table_offset(PERDOMAIN_VIRT_START)] =
-        l4e_from_page(v->domain->arch.perdomain_l3_pg, __PAGE_HYPERVISOR_RW);
-
     cr4 = pv_guest_cr4_to_real_cr4(v);
     if ( unlikely(cr4 != read_cr4()) )
         write_cr4(cr4);
@@ -1685,8 +1682,6 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
 
     ASSERT(local_irq_is_enabled());
 
-    get_cpu_info()->xen_cr3 = 0;
-
     cpumask_copy(&dirty_mask, next->vcpu_dirty_cpumask);
     /* Allow at most one CPU at a time to be dirty. */
     ASSERT(cpumask_weight(&dirty_mask) <= 1);
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index c83f5224c1..74cdb6e14d 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -3489,7 +3489,6 @@ long do_mmu_update(
     struct vcpu *curr = current, *v = curr;
     struct domain *d = v->domain, *pt_owner = d, *pg_owner;
     mfn_t map_mfn = INVALID_MFN;
-    bool sync_guest = false;
     uint32_t xsm_needed = 0;
     uint32_t xsm_checked = 0;
     int rc = put_old_guest_table(curr);
@@ -3653,8 +3652,6 @@ long do_mmu_update(
                         break;
                     rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
-                    if ( !rc )
-                        sync_guest = true;
                     break;
 
                 case PGT_writable_page:
@@ -3759,24 +3756,6 @@ long do_mmu_update(
     if ( va )
         unmap_domain_page(va);
 
-    if ( sync_guest )
-    {
-        /*
-         * Force other vCPU-s of the affected guest to pick up L4 entry
-         * changes (if any). Issue a flush IPI with empty operation mask to
-         * facilitate this (including ourselves waiting for the IPI to
-         * actually have arrived). Utilize the fact that FLUSH_VA_VALID is
-         * meaningless without FLUSH_CACHE, but will allow to pass the no-op
-         * check in flush_area_mask().
-         */
-        unsigned int cpu = smp_processor_id();
-        cpumask_t *mask = per_cpu(scratch_cpumask, cpu);
-
-        cpumask_andnot(mask, pt_owner->domain_dirty_cpumask, cpumask_of(cpu));
-        if ( !cpumask_empty(mask) )
-            flush_area_mask(mask, ZERO_BLOCK_PTR, FLUSH_VA_VALID);
-    }
-
     perfc_add(num_page_updates, i);
 
  out:
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 37a7e59760..eebc4e8528 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -328,9 +328,6 @@ void start_secondary(void *unused)
      */
     spin_debug_disable();
 
-    get_cpu_info()->xen_cr3 = 0;
-    get_cpu_info()->pv_cr3 = __pa(this_cpu(root_pgt));
-
     load_system_tables();
 
     /* Full exception support from here on in. */
@@ -640,187 +637,6 @@ void cpu_exit_clear(unsigned int cpu)
     set_cpu_state(CPU_STATE_DEAD);
 }
 
-static int clone_mapping(const void *ptr, root_pgentry_t *rpt)
-{
-    unsigned long linear = (unsigned long)ptr, pfn;
-    unsigned int flags;
-    l3_pgentry_t *pl3e = l4e_to_l3e(idle_pg_table[root_table_offset(linear)]) +
-                         l3_table_offset(linear);
-    l2_pgentry_t *pl2e;
-    l1_pgentry_t *pl1e;
-
-    if ( linear < DIRECTMAP_VIRT_START )
-        return 0;
-
-    flags = l3e_get_flags(*pl3e);
-    ASSERT(flags & _PAGE_PRESENT);
-    if ( flags & _PAGE_PSE )
-    {
-        pfn = (l3e_get_pfn(*pl3e) & ~((1UL << (2 * PAGETABLE_ORDER)) - 1)) |
-              (PFN_DOWN(linear) & ((1UL << (2 * PAGETABLE_ORDER)) - 1));
-        flags &= ~_PAGE_PSE;
-    }
-    else
-    {
-        pl2e = l3e_to_l2e(*pl3e) + l2_table_offset(linear);
-        flags = l2e_get_flags(*pl2e);
-        ASSERT(flags & _PAGE_PRESENT);
-        if ( flags & _PAGE_PSE )
-        {
-            pfn = (l2e_get_pfn(*pl2e) & ~((1UL << PAGETABLE_ORDER) - 1)) |
-                  (PFN_DOWN(linear) & ((1UL << PAGETABLE_ORDER) - 1));
-            flags &= ~_PAGE_PSE;
-        }
-        else
-        {
-            pl1e = l2e_to_l1e(*pl2e) + l1_table_offset(linear);
-            flags = l1e_get_flags(*pl1e);
-            if ( !(flags & _PAGE_PRESENT) )
-                return 0;
-            pfn = l1e_get_pfn(*pl1e);
-        }
-    }
-
-    if ( !(root_get_flags(rpt[root_table_offset(linear)]) & _PAGE_PRESENT) )
-    {
-        pl3e = alloc_xen_pagetable();
-        if ( !pl3e )
-            return -ENOMEM;
-        clear_page(pl3e);
-        l4e_write(&rpt[root_table_offset(linear)],
-                  l4e_from_paddr(__pa(pl3e), __PAGE_HYPERVISOR));
-    }
-    else
-        pl3e = l4e_to_l3e(rpt[root_table_offset(linear)]);
-
-    pl3e += l3_table_offset(linear);
-
-    if ( !(l3e_get_flags(*pl3e) & _PAGE_PRESENT) )
-    {
-        pl2e = alloc_xen_pagetable();
-        if ( !pl2e )
-            return -ENOMEM;
-        clear_page(pl2e);
-        l3e_write(pl3e, l3e_from_paddr(__pa(pl2e), __PAGE_HYPERVISOR));
-    }
-    else
-    {
-        ASSERT(!(l3e_get_flags(*pl3e) & _PAGE_PSE));
-        pl2e = l3e_to_l2e(*pl3e);
-    }
-
-    pl2e += l2_table_offset(linear);
-
-    if ( !(l2e_get_flags(*pl2e) & _PAGE_PRESENT) )
-    {
-        pl1e = alloc_xen_pagetable();
-        if ( !pl1e )
-            return -ENOMEM;
-        clear_page(pl1e);
-        l2e_write(pl2e, l2e_from_paddr(__pa(pl1e), __PAGE_HYPERVISOR));
-    }
-    else
-    {
-        ASSERT(!(l2e_get_flags(*pl2e) & _PAGE_PSE));
-        pl1e = l2e_to_l1e(*pl2e);
-    }
-
-    pl1e += l1_table_offset(linear);
-
-    if ( l1e_get_flags(*pl1e) & _PAGE_PRESENT )
-    {
-        ASSERT(l1e_get_pfn(*pl1e) == pfn);
-        ASSERT(l1e_get_flags(*pl1e) == flags);
-    }
-    else
-        l1e_write(pl1e, l1e_from_pfn(pfn, flags));
-
-    return 0;
-}
-
-DEFINE_PER_CPU(root_pgentry_t *, root_pgt);
-
-static int setup_cpu_root_pgt(unsigned int cpu)
-{
-    root_pgentry_t *rpt = alloc_xen_pagetable();
-    unsigned int off;
-    int rc;
-
-    if ( !rpt )
-        return -ENOMEM;
-
-    clear_page(rpt);
-    per_cpu(root_pgt, cpu) = rpt;
-
-    rpt[root_table_offset(RO_MPT_VIRT_START)] =
-        idle_pg_table[root_table_offset(RO_MPT_VIRT_START)];
-    /* SH_LINEAR_PT inserted together with guest mappings. */
-    /* PERDOMAIN inserted during context switch. */
-    rpt[root_table_offset(XEN_VIRT_START)] =
-        idle_pg_table[root_table_offset(XEN_VIRT_START)];
-
-    /* Install direct map page table entries for stack, IDT, and TSS. */
-    for ( off = rc = 0; !rc && off < STACK_SIZE; off += PAGE_SIZE )
-        rc = clone_mapping(__va(__pa(stack_base[cpu])) + off, rpt);
-
-    if ( !rc )
-        rc = clone_mapping(idt_tables[cpu], rpt);
-    if ( !rc )
-        rc = clone_mapping(&per_cpu(init_tss, cpu), rpt);
-
-    return rc;
-}
-
-static void cleanup_cpu_root_pgt(unsigned int cpu)
-{
-    root_pgentry_t *rpt = per_cpu(root_pgt, cpu);
-    unsigned int r;
-
-    if ( !rpt )
-        return;
-
-    per_cpu(root_pgt, cpu) = NULL;
-
-    for ( r = root_table_offset(DIRECTMAP_VIRT_START);
-          r < root_table_offset(HYPERVISOR_VIRT_END); ++r )
-    {
-        l3_pgentry_t *l3t;
-        unsigned int i3;
-
-        if ( !(root_get_flags(rpt[r]) & _PAGE_PRESENT) )
-            continue;
-
-        l3t = l4e_to_l3e(rpt[r]);
-
-        for ( i3 = 0; i3 < L3_PAGETABLE_ENTRIES; ++i3 )
-        {
-            l2_pgentry_t *l2t;
-            unsigned int i2;
-
-            if ( !(l3e_get_flags(l3t[i3]) & _PAGE_PRESENT) )
-                continue;
-
-            ASSERT(!(l3e_get_flags(l3t[i3]) & _PAGE_PSE));
-            l2t = l3e_to_l2e(l3t[i3]);
-
-            for ( i2 = 0; i2 < L2_PAGETABLE_ENTRIES; ++i2 )
-            {
-                if ( !(l2e_get_flags(l2t[i2]) & _PAGE_PRESENT) )
-                    continue;
-
-                ASSERT(!(l2e_get_flags(l2t[i2]) & _PAGE_PSE));
-                free_xen_pagetable(l2e_to_l1e(l2t[i2]));
-            }
-
-            free_xen_pagetable(l2t);
-        }
-
-        free_xen_pagetable(l3t);
-    }
-
-    free_xen_pagetable(rpt);
-}
-
 static void cpu_smpboot_free(unsigned int cpu)
 {
     unsigned int order, socket = cpu_to_socket(cpu);
@@ -859,8 +675,6 @@ static void cpu_smpboot_free(unsigned int cpu)
             free_domheap_page(mfn_to_page(mfn));
     }
 
-    cleanup_cpu_root_pgt(cpu);
-
     order = get_order_from_pages(NR_RESERVED_GDT_PAGES);
     free_xenheap_pages(per_cpu(gdt_table, cpu), order);
 
@@ -915,11 +729,6 @@ static int cpu_smpboot_alloc(unsigned int cpu)
     memcpy(idt_tables[cpu], idt_table, IDT_ENTRIES * sizeof(idt_entry_t));
     disable_each_ist(idt_tables[cpu]);
 
-    rc = setup_cpu_root_pgt(cpu);
-    if ( rc )
-        goto out;
-    rc = -ENOMEM;
-
     for ( stub_page = 0, i = cpu & ~(STUBS_PER_PAGE - 1);
           i < nr_cpu_ids && i <= (cpu | (STUBS_PER_PAGE - 1)); ++i )
         if ( cpu_online(i) && cpu_to_node(i) == node )
@@ -979,8 +788,6 @@ static struct notifier_block cpu_smpboot_nfb = {
 
 void __init smp_prepare_cpus(unsigned int max_cpus)
 {
-    int rc;
-
     register_cpu_notifier(&cpu_smpboot_nfb);
 
     mtrr_aps_sync_begin();
@@ -994,11 +801,6 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
 
     stack_base[0] = stack_start;
 
-    rc = setup_cpu_root_pgt(0);
-    if ( rc )
-        panic("Error %d setting up PV root page table\n", rc);
-    get_cpu_info()->pv_cr3 = __pa(per_cpu(root_pgt, 0));
-
     set_nr_sockets();
 
     socket_cpumask = xzalloc_array(cpumask_t *, nr_sockets);
@@ -1067,8 +869,6 @@ void __init smp_prepare_boot_cpu(void)
 #if NR_CPUS > 2 * BITS_PER_LONG
     per_cpu(scratch_cpumask, cpu) = &scratch_cpu0mask;
 #endif
-
-    get_cpu_info()->xen_cr3 = 0;
 }
 
 static void
diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
index b1a4310974..e136af6b99 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -137,8 +137,6 @@ void __dummy__(void)
     OFFSET(CPUINFO_processor_id, struct cpu_info, processor_id);
     OFFSET(CPUINFO_current_vcpu, struct cpu_info, current_vcpu);
     OFFSET(CPUINFO_cr4, struct cpu_info, cr4);
-    OFFSET(CPUINFO_xen_cr3, struct cpu_info, xen_cr3);
-    OFFSET(CPUINFO_pv_cr3, struct cpu_info, pv_cr3);
     DEFINE(CPUINFO_sizeof, sizeof(struct cpu_info));
     BLANK();
 
diff --git a/xen/arch/x86/x86_64/compat/entry.S b/xen/arch/x86/x86_64/compat/entry.S
index e668f00c36..3fea54ee9d 100644
--- a/xen/arch/x86/x86_64/compat/entry.S
+++ b/xen/arch/x86/x86_64/compat/entry.S
@@ -199,17 +199,6 @@ ENTRY(cstar_enter)
         pushq $0
         movl  $TRAP_syscall, 4(%rsp)
         SAVE_ALL
-
-        GET_STACK_END(bx)
-        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
-        neg   %rcx
-        jz    .Lcstar_cr3_okay
-        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
-        neg   %rcx
-        write_cr3 rcx, rdi, rsi
-        movq  $0, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
-.Lcstar_cr3_okay:
-
         GET_CURRENT(bx)
         movq  VCPU_domain(%rbx),%rcx
         cmpb  $0,DOMAIN_is_32bit_pv(%rcx)
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index f753eb4c02..cbd73f6c22 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -37,32 +37,6 @@ ENTRY(switch_to_kernel)
 /* %rbx: struct vcpu, interrupts disabled */
 restore_all_guest:
         ASSERT_INTERRUPTS_DISABLED
-
-        /* Copy guest mappings and switch to per-CPU root page table. */
-        mov   %cr3, %r9
-        GET_STACK_END(dx)
-        mov   STACK_CPUINFO_FIELD(pv_cr3)(%rdx), %rdi
-        movabs $PADDR_MASK & PAGE_MASK, %rsi
-        movabs $DIRECTMAP_VIRT_START, %rcx
-        mov   %rdi, %rax
-        and   %rsi, %rdi
-        and   %r9, %rsi
-        add   %rcx, %rdi
-        add   %rcx, %rsi
-        mov   $ROOT_PAGETABLE_FIRST_XEN_SLOT, %ecx
-        mov   root_table_offset(SH_LINEAR_PT_VIRT_START)*8(%rsi), %r8
-        mov   %r8, root_table_offset(SH_LINEAR_PT_VIRT_START)*8(%rdi)
-        rep movsq
-        mov   $ROOT_PAGETABLE_ENTRIES - \
-               ROOT_PAGETABLE_LAST_XEN_SLOT - 1, %ecx
-        sub   $(ROOT_PAGETABLE_FIRST_XEN_SLOT - \
-                ROOT_PAGETABLE_LAST_XEN_SLOT - 1) * 8, %rsi
-        sub   $(ROOT_PAGETABLE_FIRST_XEN_SLOT - \
-                ROOT_PAGETABLE_LAST_XEN_SLOT - 1) * 8, %rdi
-        rep movsq
-        mov   %r9, STACK_CPUINFO_FIELD(xen_cr3)(%rdx)
-        write_cr3 rax, rdi, rsi
-
         RESTORE_ALL
         testw $TRAP_syscall,4(%rsp)
         jz    iret_exit_to_guest
@@ -97,22 +71,6 @@ iret_exit_to_guest:
         ALIGN
 /* No special register assumptions. */
 restore_all_xen:
-        /*
-         * Check whether we need to switch to the per-CPU page tables, in
-         * case we return to late PV exit code (from an NMI or #MC).
-         */
-        GET_STACK_END(ax)
-        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rax), %rdx
-        mov   STACK_CPUINFO_FIELD(pv_cr3)(%rax), %rax
-        test  %rdx, %rdx
-        /*
-         * Ideally the condition would be "nsz", but such doesn't exist,
-         * so "g" will have to do.
-         */
-UNLIKELY_START(g, exit_cr3)
-        write_cr3 rax, rdi, rsi
-UNLIKELY_END(exit_cr3)
-
         RESTORE_ALL adj=8
         iretq
 
@@ -142,18 +100,7 @@ ENTRY(lstar_enter)
         pushq $0
         movl  $TRAP_syscall, 4(%rsp)
         SAVE_ALL
-
-        GET_STACK_END(bx)
-        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
-        neg   %rcx
-        jz    .Llstar_cr3_okay
-        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
-        neg   %rcx
-        write_cr3 rcx, rdi, rsi
-        movq  $0, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
-.Llstar_cr3_okay:
-
-        __GET_CURRENT(bx)
+        GET_CURRENT(bx)
         testb $TF_kernel_mode,VCPU_thread_flags(%rbx)
         jz    switch_to_kernel
 
@@ -245,18 +192,7 @@ GLOBAL(sysenter_eflags_saved)
         pushq $0
         movl  $TRAP_syscall, 4(%rsp)
         SAVE_ALL
-
-        GET_STACK_END(bx)
-        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
-        neg   %rcx
-        jz    .Lsyse_cr3_okay
-        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
-        neg   %rcx
-        write_cr3 rcx, rdi, rsi
-        movq  $0, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
-.Lsyse_cr3_okay:
-
-        __GET_CURRENT(bx)
+        GET_CURRENT(bx)
         cmpb  $0,VCPU_sysenter_disables_events(%rbx)
         movq  VCPU_sysenter_addr(%rbx),%rax
         setne %cl
@@ -292,23 +228,13 @@ ENTRY(int80_direct_trap)
         movl  $0x80, 4(%rsp)
         SAVE_ALL
 
-        GET_STACK_END(bx)
-        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
-        neg   %rcx
-        jz    .Lint80_cr3_okay
-        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
-        neg   %rcx
-        write_cr3 rcx, rdi, rsi
-        movq  $0, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
-.Lint80_cr3_okay:
-
         cmpb  $0,untrusted_msi(%rip)
 UNLIKELY_START(ne, msi_check)
         movl  $0x80,%edi
         call  check_for_unexpected_msi
 UNLIKELY_END(msi_check)
 
-        __GET_CURRENT(bx)
+        GET_CURRENT(bx)
 
         /* Check that the callback is non-null. */
         leaq  VCPU_int80_bounce(%rbx),%rdx
@@ -465,27 +391,9 @@ ENTRY(dom_crash_sync_extable)
 
 ENTRY(common_interrupt)
         SAVE_ALL CLAC
-
-        GET_STACK_END(14)
-        mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rcx
-        mov   %rcx, %r15
-        neg   %rcx
-        jz    .Lintr_cr3_okay
-        jns   .Lintr_cr3_load
-        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
-        neg   %rcx
-.Lintr_cr3_load:
-        write_cr3 rcx, rdi, rsi
-        xor   %ecx, %ecx
-        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
-        testb $3, UREGS_cs(%rsp)
-        cmovnz %rcx, %r15
-.Lintr_cr3_okay:
-
         CR4_PV32_RESTORE
         movq %rsp,%rdi
         callq do_IRQ
-        mov   %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
         jmp ret_from_intr
 
 /* No special register assumptions. */
@@ -503,23 +411,6 @@ ENTRY(page_fault)
 /* No special register assumptions. */
 GLOBAL(handle_exception)
         SAVE_ALL CLAC
-
-        GET_STACK_END(14)
-        mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rcx
-        mov   %rcx, %r15
-        neg   %rcx
-        jz    .Lxcpt_cr3_okay
-        jns   .Lxcpt_cr3_load
-        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
-        neg   %rcx
-.Lxcpt_cr3_load:
-        write_cr3 rcx, rdi, rsi
-        xor   %ecx, %ecx
-        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
-        testb $3, UREGS_cs(%rsp)
-        cmovnz %rcx, %r15
-.Lxcpt_cr3_okay:
-
 handle_exception_saved:
         GET_CURRENT(bx)
         testb $X86_EFLAGS_IF>>8,UREGS_eflags+1(%rsp)
@@ -585,7 +476,6 @@ handle_exception_saved:
         PERFC_INCR(exceptions, %rax, %rbx)
         mov   (%rdx, %rax, 8), %rdx
         INDIRECT_CALL %rdx
-        mov   %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
         testb $3,UREGS_cs(%rsp)
         jz    restore_all_xen
         leaq  VCPU_trap_bounce(%rbx),%rdx
@@ -618,7 +508,6 @@ exception_with_ints_disabled:
         rep;  movsq                     # make room for ec/ev
 1:      movq  UREGS_error_code(%rsp),%rax # ec/ev
         movq  %rax,UREGS_kernel_sizeof(%rsp)
-        mov   %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
         jmp   restore_all_xen           # return to fixup code
 
 /* No special register assumptions. */
@@ -697,17 +586,6 @@ ENTRY(double_fault)
         movl  $TRAP_double_fault,4(%rsp)
         /* Set AC to reduce chance of further SMAP faults */
         SAVE_ALL STAC
-
-        GET_STACK_END(bx)
-        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rbx
-        test  %rbx, %rbx
-        jz    .Ldblf_cr3_okay
-        jns   .Ldblf_cr3_load
-        neg   %rbx
-.Ldblf_cr3_load:
-        write_cr3 rbx, rdi, rsi
-.Ldblf_cr3_okay:
-
         movq  %rsp,%rdi
         call  do_double_fault
         BUG   /* do_double_fault() shouldn't return. */
@@ -726,28 +604,10 @@ ENTRY(nmi)
         movl  $TRAP_nmi,4(%rsp)
 handle_ist_exception:
         SAVE_ALL CLAC
-
-        GET_STACK_END(14)
-        mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rcx
-        mov   %rcx, %r15
-        neg   %rcx
-        jz    .List_cr3_okay
-        jns   .List_cr3_load
-        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
-        neg   %rcx
-.List_cr3_load:
-        write_cr3 rcx, rdi, rsi
-        movq  $0, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
-.List_cr3_okay:
-
         CR4_PV32_RESTORE
         testb $3,UREGS_cs(%rsp)
         jz    1f
-        /*
-         * Interrupted guest context. Clear the restore value for xen_cr3
-         * and copy the context to stack bottom.
-         */
-        xor   %r15, %r15
+        /* Interrupted guest context. Copy the context to stack bottom. */
         GET_CPUINFO_FIELD(guest_cpu_user_regs,di)
         movq  %rsp,%rsi
         movl  $UREGS_kernel_sizeof/8,%ecx
@@ -758,7 +618,6 @@ handle_ist_exception:
         leaq  exception_table(%rip),%rdx
         mov   (%rdx, %rax, 8), %rdx
         INDIRECT_CALL %rdx
-        mov   %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
         cmpb  $TRAP_nmi,UREGS_entry_vector(%rsp)
         jne   ret_from_intr
 
diff --git a/xen/include/asm-x86/asm_defns.h b/xen/include/asm-x86/asm_defns.h
index d2d91ca1fa..ae9fef7450 100644
--- a/xen/include/asm-x86/asm_defns.h
+++ b/xen/include/asm-x86/asm_defns.h
@@ -101,30 +101,9 @@ void ret_from_intr(void);
         UNLIKELY_DONE(mp, tag);   \
         __UNLIKELY_END(tag)
 
-        .equ .Lrax, 0
-        .equ .Lrcx, 1
-        .equ .Lrdx, 2
-        .equ .Lrbx, 3
-        .equ .Lrsp, 4
-        .equ .Lrbp, 5
-        .equ .Lrsi, 6
-        .equ .Lrdi, 7
-        .equ .Lr8,  8
-        .equ .Lr9,  9
-        .equ .Lr10, 10
-        .equ .Lr11, 11
-        .equ .Lr12, 12
-        .equ .Lr13, 13
-        .equ .Lr14, 14
-        .equ .Lr15, 15
-
 #define STACK_CPUINFO_FIELD(field) (1 - CPUINFO_sizeof + CPUINFO_##field)
 #define GET_STACK_END(reg)                        \
-        .if .Lr##reg > 8;                         \
-        movq $STACK_SIZE-1, %r##reg;              \
-        .else;                                    \
         movl $STACK_SIZE-1, %e##reg;              \
-        .endif;                                   \
         orq  %rsp, %r##reg
 
 #define GET_CPUINFO_FIELD(field, reg)             \
@@ -206,15 +185,6 @@ void ret_from_intr(void);
 #define ASM_STAC ASM_AC(STAC)
 #define ASM_CLAC ASM_AC(CLAC)
 
-.macro write_cr3 val:req, tmp1:req, tmp2:req
-        mov   %cr4, %\tmp1
-        mov   %\tmp1, %\tmp2
-        and   $~X86_CR4_PGE, %\tmp1
-        mov   %\tmp1, %cr4
-        mov   %\val, %cr3
-        mov   %\tmp2, %cr4
-.endm
-
 #define CR4_PV32_RESTORE                                           \
         667: ASM_NOP5;                                             \
         .pushsection .altinstr_replacement, "ax";                  \
diff --git a/xen/include/asm-x86/current.h b/xen/include/asm-x86/current.h
index b929c48c85..89849929eb 100644
--- a/xen/include/asm-x86/current.h
+++ b/xen/include/asm-x86/current.h
@@ -41,18 +41,6 @@ struct cpu_info {
     struct vcpu *current_vcpu;
     unsigned long per_cpu_offset;
     unsigned long cr4;
-    /*
-     * Of the two following fields the latter is being set to the CR3 value
-     * to be used on the given pCPU for loading whenever 64-bit PV guest
-     * context is being entered. The value never changes once set.
-     * The former is the value to restore when re-entering Xen, if any. IOW
-     * its value being zero means there's nothing to restore. However, its
-     * value can also be negative, indicating to the exit-to-Xen code that
-     * restoring is not necessary, but allowing any nested entry code paths
-     * to still know the value to put back into CR3.
-     */
-    unsigned long xen_cr3;
-    unsigned long pv_cr3;
     /* get_stack_bottom() must be 16-byte aligned */
 };
 
diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
index e8c2f02e99..78e17a46fa 100644
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -437,7 +437,6 @@ extern idt_entry_t idt_table[];
 extern idt_entry_t *idt_tables[];
 
 DECLARE_PER_CPU(struct tss_struct, init_tss);
-DECLARE_PER_CPU(root_pgentry_t *, root_pgt);
 
 extern void init_int80_direct_trap(struct vcpu *v);
 
diff --git a/xen/include/asm-x86/x86_64/page.h b/xen/include/asm-x86/x86_64/page.h
index 05a0334893..6fb7cd5553 100644
--- a/xen/include/asm-x86/x86_64/page.h
+++ b/xen/include/asm-x86/x86_64/page.h
@@ -24,8 +24,8 @@
 /* These are architectural limits. Current CPUs support only 40-bit phys. */
 #define PADDR_BITS              52
 #define VADDR_BITS              48
-#define PADDR_MASK              ((_AC(1,UL) << PADDR_BITS) - 1)
-#define VADDR_MASK              ((_AC(1,UL) << VADDR_BITS) - 1)
+#define PADDR_MASK              ((1UL << PADDR_BITS)-1)
+#define VADDR_MASK              ((1UL << VADDR_BITS)-1)
 
 #define VADDR_TOP_BIT           (1UL << (VADDR_BITS - 1))
 #define CANONICAL_MASK          (~0UL & ~VADDR_MASK)
@@ -107,7 +107,6 @@ typedef l4_pgentry_t root_pgentry_t;
       : (((_s) < ROOT_PAGETABLE_FIRST_XEN_SLOT) ||  \
          ((_s) > ROOT_PAGETABLE_LAST_XEN_SLOT)))
 
-#define root_table_offset         l4_table_offset
 #define root_get_pfn              l4e_get_pfn
 #define root_get_flags            l4e_get_flags
 #define root_get_intpte           l4e_get_intpte
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH RFC v2 05/12] x86: don't access saved user regs via rsp in trap handlers
  2018-01-22 12:32 [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains Juergen Gross
                   ` (3 preceding siblings ...)
  2018-01-22 12:32 ` [PATCH RFC v2 04/12] x86: revert 5784de3e2067ed73efc2fe42e62831e8ae7f46c4 Juergen Gross
@ 2018-01-22 12:32 ` Juergen Gross
  2018-01-30 14:49   ` Jan Beulich
       [not found]   ` <5A70941B02000078001A3BF0@suse.com>
  2018-01-22 12:32 ` [PATCH RFC v2 06/12] x86: add a xpti command line parameter Juergen Gross
                   ` (9 subsequent siblings)
  14 siblings, 2 replies; 74+ messages in thread
From: Juergen Gross @ 2018-01-22 12:32 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, dfaggioli, jbeulich

In order to support switching stacks when entering the hypervisor for
support of page table isolation, don't use %rsp for accessing the
saved user registers, but do that via %rdi.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 xen/arch/x86/x86_64/compat/entry.S |  82 +++++++++++++----------
 xen/arch/x86/x86_64/entry.S        | 129 +++++++++++++++++++++++--------------
 xen/include/asm-x86/current.h      |  10 ++-
 3 files changed, 134 insertions(+), 87 deletions(-)

diff --git a/xen/arch/x86/x86_64/compat/entry.S b/xen/arch/x86/x86_64/compat/entry.S
index 3fea54ee9d..abf3fcae48 100644
--- a/xen/arch/x86/x86_64/compat/entry.S
+++ b/xen/arch/x86/x86_64/compat/entry.S
@@ -18,14 +18,14 @@ ENTRY(entry_int82)
         pushq $0
         movl  $HYPERCALL_VECTOR, 4(%rsp)
         SAVE_ALL compat=1 /* DPL1 gate, restricted to 32bit PV guests only. */
+        mov   %rsp, %rdi
         CR4_PV32_RESTORE
 
         GET_CURRENT(bx)
 
-        mov   %rsp, %rdi
         call  do_entry_int82
 
-/* %rbx: struct vcpu */
+/* %rbx: struct vcpu, %rdi: user_regs */
 ENTRY(compat_test_all_events)
         ASSERT_NOT_IN_ATOMIC
         cli                             # tests must not race interrupts
@@ -58,20 +58,24 @@ compat_test_guest_events:
         jmp   compat_test_all_events
 
         ALIGN
-/* %rbx: struct vcpu */
+/* %rbx: struct vcpu, %rdi: user_regs */
 compat_process_softirqs:
         sti
+        pushq %rdi
         call  do_softirq
+        popq  %rdi
         jmp   compat_test_all_events
 
 	ALIGN
-/* %rbx: struct vcpu */
+/* %rbx: struct vcpu, %rdi: user_regs */
 compat_process_mce:
         testb $1 << VCPU_TRAP_MCE,VCPU_async_exception_mask(%rbx)
         jnz   .Lcompat_test_guest_nmi
         sti
         movb $0,VCPU_mce_pending(%rbx)
+        pushq %rdi
         call set_guest_machinecheck_trapbounce
+        popq  %rdi
         testl %eax,%eax
         jz    compat_test_all_events
         movzbl VCPU_async_exception_mask(%rbx),%edx # save mask for the
@@ -81,13 +85,15 @@ compat_process_mce:
         jmp   compat_process_trap
 
 	ALIGN
-/* %rbx: struct vcpu */
+/* %rbx: struct vcpu, %rdi: user_regs */
 compat_process_nmi:
         testb $1 << VCPU_TRAP_NMI,VCPU_async_exception_mask(%rbx)
         jnz  compat_test_guest_events
         sti
         movb  $0,VCPU_nmi_pending(%rbx)
+        pushq %rdi
         call  set_guest_nmi_trapbounce
+        popq  %rdi
         testl %eax,%eax
         jz    compat_test_all_events
         movzbl VCPU_async_exception_mask(%rbx),%edx # save mask for the
@@ -178,7 +184,7 @@ ENTRY(cr4_pv32_restore)
         xor   %eax, %eax
         ret
 
-/* %rdx: trap_bounce, %rbx: struct vcpu */
+/* %rdx: trap_bounce, %rbx: struct vcpu, %rdi: user_regs */
 ENTRY(compat_post_handle_exception)
         testb $TBF_EXCEPTION,TRAPBOUNCE_flags(%rdx)
         jz    compat_test_all_events
@@ -199,6 +205,7 @@ ENTRY(cstar_enter)
         pushq $0
         movl  $TRAP_syscall, 4(%rsp)
         SAVE_ALL
+        movq  %rsp, %rdi
         GET_CURRENT(bx)
         movq  VCPU_domain(%rbx),%rcx
         cmpb  $0,DOMAIN_is_32bit_pv(%rcx)
@@ -211,13 +218,15 @@ ENTRY(cstar_enter)
         testl $~3,%esi
         leal  (,%rcx,TBF_INTERRUPT),%ecx
 UNLIKELY_START(z, compat_syscall_gpf)
-        movq  VCPU_trap_ctxt(%rbx),%rdi
-        movl  $TRAP_gp_fault,UREGS_entry_vector(%rsp)
-        subl  $2,UREGS_rip(%rsp)
+        pushq %rcx
+        movq  VCPU_trap_ctxt(%rbx),%rcx
+        movl  $TRAP_gp_fault,UREGS_entry_vector(%rdi)
+        subl  $2,UREGS_rip(%rdi)
         movl  $0,TRAPBOUNCE_error_code(%rdx)
-        movl  TRAP_gp_fault * TRAPINFO_sizeof + TRAPINFO_eip(%rdi),%eax
-        movzwl TRAP_gp_fault * TRAPINFO_sizeof + TRAPINFO_cs(%rdi),%esi
-        testb $4,TRAP_gp_fault * TRAPINFO_sizeof + TRAPINFO_flags(%rdi)
+        movl  TRAP_gp_fault * TRAPINFO_sizeof + TRAPINFO_eip(%rcx),%eax
+        movzwl TRAP_gp_fault * TRAPINFO_sizeof + TRAPINFO_cs(%rcx),%esi
+        testb $4,TRAP_gp_fault * TRAPINFO_sizeof + TRAPINFO_flags(%rcx)
+        popq  %rcx
         setnz %cl
         leal  TBF_EXCEPTION|TBF_EXCEPTION_ERRCODE(,%rcx,TBF_INTERRUPT),%ecx
 UNLIKELY_END(compat_syscall_gpf)
@@ -229,12 +238,12 @@ UNLIKELY_END(compat_syscall_gpf)
 ENTRY(compat_sysenter)
         CR4_PV32_RESTORE
         movq  VCPU_trap_ctxt(%rbx),%rcx
-        cmpb  $TRAP_gp_fault,UREGS_entry_vector(%rsp)
+        cmpb  $TRAP_gp_fault,UREGS_entry_vector(%rdi)
         movzwl VCPU_sysenter_sel(%rbx),%eax
         movzwl TRAP_gp_fault * TRAPINFO_sizeof + TRAPINFO_cs(%rcx),%ecx
         cmovel %ecx,%eax
         testl $~3,%eax
-        movl  $FLAT_COMPAT_USER_SS,UREGS_ss(%rsp)
+        movl  $FLAT_COMPAT_USER_SS,UREGS_ss(%rdi)
         cmovzl %ecx,%eax
         movw  %ax,TRAPBOUNCE_cs(%rdx)
         call  compat_create_bounce_frame
@@ -247,26 +256,27 @@ ENTRY(compat_int80_direct_trap)
 
 /* CREATE A BASIC EXCEPTION FRAME ON GUEST OS (RING-1) STACK:            */
 /*   {[ERRCODE,] EIP, CS, EFLAGS, [ESP, SS]}                             */
-/* %rdx: trap_bounce, %rbx: struct vcpu                                  */
-/* On return only %rbx and %rdx are guaranteed non-clobbered.            */
+/* %rdx: trap_bounce, %rbx: struct vcpu, %rdi: user_regs                 */
+/* On return only %rbx, %rdi and %rdx are guaranteed non-clobbered.      */
 compat_create_bounce_frame:
         ASSERT_INTERRUPTS_ENABLED
-        mov   %fs,%edi
+        mov   %fs,%ecx
+        pushq %rcx
         ASM_STAC
-        testb $2,UREGS_cs+8(%rsp)
+        testb $2,UREGS_cs(%rdi)
         jz    1f
         /* Push new frame at registered guest-OS stack base. */
         movl  VCPU_kernel_sp(%rbx),%esi
 .Lft1:  mov   VCPU_kernel_ss(%rbx),%fs
         subl  $2*4,%esi
-        movl  UREGS_rsp+8(%rsp),%eax
+        movl  UREGS_rsp(%rdi),%eax
 .Lft2:  movl  %eax,%fs:(%rsi)
-        movl  UREGS_ss+8(%rsp),%eax
+        movl  UREGS_ss(%rdi),%eax
 .Lft3:  movl  %eax,%fs:4(%rsi)
         jmp   2f
 1:      /* In kernel context already: push new frame at existing %rsp. */
-        movl  UREGS_rsp+8(%rsp),%esi
-.Lft4:  mov   UREGS_ss+8(%rsp),%fs
+        movl  UREGS_rsp(%rdi),%esi
+.Lft4:  mov   UREGS_ss(%rdi),%fs
 2:
         movq  VCPU_domain(%rbx),%r8
         subl  $3*4,%esi
@@ -277,12 +287,12 @@ compat_create_bounce_frame:
         orb   %ch,COMPAT_VCPUINFO_upcall_mask(%rax)
         popq  %rax
         shll  $16,%eax                  # Bits 16-23: saved_upcall_mask
-        movw  UREGS_cs+8(%rsp),%ax      # Bits  0-15: CS
+        movw  UREGS_cs(%rdi),%ax        # Bits  0-15: CS
 .Lft5:  movl  %eax,%fs:4(%rsi)          # CS / saved_upcall_mask
         shrl  $16,%eax
         testb %al,%al                   # Bits 0-7: saved_upcall_mask
         setz  %ch                       # %ch == !saved_upcall_mask
-        movl  UREGS_eflags+8(%rsp),%eax
+        movl  UREGS_eflags(%rdi),%eax
         andl  $~(X86_EFLAGS_IF|X86_EFLAGS_IOPL),%eax
         addb  %ch,%ch                   # Bit 9 (EFLAGS.IF)
         orb   %ch,%ah                   # Fold EFLAGS.IF into %eax
@@ -291,7 +301,7 @@ compat_create_bounce_frame:
         cmovnzl VCPU_iopl(%rbx),%ecx    # Bits 13:12 (EFLAGS.IOPL)
         orl   %ecx,%eax                 # Fold EFLAGS.IOPL into %eax
 .Lft6:  movl  %eax,%fs:2*4(%rsi)        # EFLAGS
-        movl  UREGS_rip+8(%rsp),%eax
+        movl  UREGS_rip(%rdi),%eax
 .Lft7:  movl  %eax,%fs:(%rsi)           # EIP
         testb $TBF_EXCEPTION_ERRCODE,TRAPBOUNCE_flags(%rdx)
         jz    1f
@@ -303,10 +313,11 @@ compat_create_bounce_frame:
         /* Rewrite our stack frame and return to guest-OS mode. */
         /* IA32 Ref. Vol. 3: TF, VM, RF and NT flags are cleared on trap. */
         andl  $~(X86_EFLAGS_VM|X86_EFLAGS_RF|\
-                 X86_EFLAGS_NT|X86_EFLAGS_TF),UREGS_eflags+8(%rsp)
-        mov   %fs,UREGS_ss+8(%rsp)
-        movl  %esi,UREGS_rsp+8(%rsp)
-.Lft13: mov   %edi,%fs
+                 X86_EFLAGS_NT|X86_EFLAGS_TF),UREGS_eflags(%rdi)
+        mov   %fs,UREGS_ss(%rdi)
+        movl  %esi,UREGS_rsp(%rdi)
+.Lft13: popq  %rax
+        mov   %eax,%fs
         movzwl TRAPBOUNCE_cs(%rdx),%eax
         /* Null selectors (0-3) are not allowed. */
         testl $~3,%eax
@@ -314,13 +325,14 @@ UNLIKELY_START(z, compat_bounce_null_selector)
         lea   UNLIKELY_DISPATCH_LABEL(compat_bounce_null_selector)(%rip), %rdi
         jmp   asm_domain_crash_synchronous  /* Does not return */
 __UNLIKELY_END(compat_bounce_null_selector)
-        movl  %eax,UREGS_cs+8(%rsp)
+        movl  %eax,UREGS_cs(%rdi)
         movl  TRAPBOUNCE_eip(%rdx),%eax
-        movl  %eax,UREGS_rip+8(%rsp)
+        movl  %eax,UREGS_rip(%rdi)
         ret
 .section .fixup,"ax"
 .Lfx13:
-        xorl  %edi,%edi
+        popq  %rax
+        pushq $0
         jmp   .Lft13
 .previous
         _ASM_EXTABLE(.Lft1,  dom_crash_sync_extable)
@@ -338,14 +350,16 @@ compat_crash_page_fault_8:
 compat_crash_page_fault_4:
         addl  $4,%esi
 compat_crash_page_fault:
-.Lft14: mov   %edi,%fs
+.Lft14: popq  %rax
+        mov   %eax,%fs
         ASM_CLAC
         movl  %esi,%edi
         call  show_page_walk
         jmp   dom_crash_sync_extable
 .section .fixup,"ax"
 .Lfx14:
-        xorl  %edi,%edi
+        popq  %rax
+        pushq $0
         jmp   .Lft14
 .previous
         _ASM_EXTABLE(.Lft14, .Lfx14)
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index cbd73f6c22..f7412b87c2 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -14,13 +14,13 @@
 #include <public/xen.h>
 #include <irq_vectors.h>
 
-/* %rbx: struct vcpu */
+/* %rbx: struct vcpu, %rdi: user_regs */
 ENTRY(switch_to_kernel)
         leaq  VCPU_trap_bounce(%rbx),%rdx
         /* TB_eip = (32-bit syscall && syscall32_addr) ?
          *          syscall32_addr : syscall_addr */
         xor   %eax,%eax
-        cmpw  $FLAT_USER_CS32,UREGS_cs(%rsp)
+        cmpw  $FLAT_USER_CS32,UREGS_cs(%rdi)
         cmoveq VCPU_syscall32_addr(%rbx),%rax
         testq %rax,%rax
         cmovzq VCPU_syscall_addr(%rbx),%rax
@@ -31,7 +31,7 @@ ENTRY(switch_to_kernel)
         leal  (,%rcx,TBF_INTERRUPT),%ecx
         movb  %cl,TRAPBOUNCE_flags(%rdx)
         call  create_bounce_frame
-        andl  $~X86_EFLAGS_DF,UREGS_eflags(%rsp)
+        andl  $~X86_EFLAGS_DF,UREGS_eflags(%rdi)
         jmp   test_all_events
 
 /* %rbx: struct vcpu, interrupts disabled */
@@ -100,14 +100,16 @@ ENTRY(lstar_enter)
         pushq $0
         movl  $TRAP_syscall, 4(%rsp)
         SAVE_ALL
+        mov   %rsp, %rdi
         GET_CURRENT(bx)
         testb $TF_kernel_mode,VCPU_thread_flags(%rbx)
         jz    switch_to_kernel
 
-        mov   %rsp, %rdi
+        push  %rdi
         call  pv_hypercall
+        pop   %rdi
 
-/* %rbx: struct vcpu */
+/* %rbx: struct vcpu, %rdi: user_regs */
 test_all_events:
         ASSERT_NOT_IN_ATOMIC
         cli                             # tests must not race interrupts
@@ -138,20 +140,24 @@ test_guest_events:
         jmp   test_all_events
 
         ALIGN
-/* %rbx: struct vcpu */
+/* %rbx: struct vcpu, %rdi: user_regs */
 process_softirqs:
         sti       
+        pushq %rdi
         call do_softirq
+        popq  %rdi
         jmp  test_all_events
 
         ALIGN
-/* %rbx: struct vcpu */
+/* %rbx: struct vcpu, %rdi: user_regs */
 process_mce:
         testb $1 << VCPU_TRAP_MCE,VCPU_async_exception_mask(%rbx)
         jnz  .Ltest_guest_nmi
         sti
         movb $0,VCPU_mce_pending(%rbx)
+        push %rdi
         call set_guest_machinecheck_trapbounce
+        pop  %rdi
         test %eax,%eax
         jz   test_all_events
         movzbl VCPU_async_exception_mask(%rbx),%edx # save mask for the
@@ -167,7 +173,9 @@ process_nmi:
         jnz  test_guest_events
         sti
         movb $0,VCPU_nmi_pending(%rbx)
+        push %rdi
         call set_guest_nmi_trapbounce
+        pop  %rdi
         test %eax,%eax
         jz   test_all_events
         movzbl VCPU_async_exception_mask(%rbx),%edx # save mask for the
@@ -192,11 +200,12 @@ GLOBAL(sysenter_eflags_saved)
         pushq $0
         movl  $TRAP_syscall, 4(%rsp)
         SAVE_ALL
+        movq  %rsp, %rdi
         GET_CURRENT(bx)
         cmpb  $0,VCPU_sysenter_disables_events(%rbx)
         movq  VCPU_sysenter_addr(%rbx),%rax
         setne %cl
-        testl $X86_EFLAGS_NT,UREGS_eflags(%rsp)
+        testl $X86_EFLAGS_NT,UREGS_eflags(%rdi)
         leaq  VCPU_trap_bounce(%rbx),%rdx
 UNLIKELY_START(nz, sysenter_nt_set)
         pushfq
@@ -208,17 +217,17 @@ UNLIKELY_END(sysenter_nt_set)
         leal  (,%rcx,TBF_INTERRUPT),%ecx
 UNLIKELY_START(z, sysenter_gpf)
         movq  VCPU_trap_ctxt(%rbx),%rsi
-        movl  $TRAP_gp_fault,UREGS_entry_vector(%rsp)
+        movl  $TRAP_gp_fault,UREGS_entry_vector(%rdi)
         movl  %eax,TRAPBOUNCE_error_code(%rdx)
         movq  TRAP_gp_fault * TRAPINFO_sizeof + TRAPINFO_eip(%rsi),%rax
         testb $4,TRAP_gp_fault * TRAPINFO_sizeof + TRAPINFO_flags(%rsi)
         setnz %cl
         leal  TBF_EXCEPTION|TBF_EXCEPTION_ERRCODE(,%rcx,TBF_INTERRUPT),%ecx
 UNLIKELY_END(sysenter_gpf)
-        movq  VCPU_domain(%rbx),%rdi
+        movq  VCPU_domain(%rbx),%rsi
         movq  %rax,TRAPBOUNCE_eip(%rdx)
         movb  %cl,TRAPBOUNCE_flags(%rdx)
-        testb $1,DOMAIN_is_32bit_pv(%rdi)
+        testb $1,DOMAIN_is_32bit_pv(%rsi)
         jnz   compat_sysenter
         jmp   .Lbounce_exception
 
@@ -227,11 +236,14 @@ ENTRY(int80_direct_trap)
         pushq $0
         movl  $0x80, 4(%rsp)
         SAVE_ALL
+        mov   %rsp, %rdi
 
         cmpb  $0,untrusted_msi(%rip)
 UNLIKELY_START(ne, msi_check)
+        pushq %rdi
         movl  $0x80,%edi
         call  check_for_unexpected_msi
+        popq  %rdi
 UNLIKELY_END(msi_check)
 
         GET_CURRENT(bx)
@@ -253,30 +265,32 @@ int80_slow_path:
          * Setup entry vector and error code as if this was a GPF caused by an
          * IDT entry with DPL==0.
          */
-        movl  $((0x80 << 3) | X86_XEC_IDT),UREGS_error_code(%rsp)
-        movl  $TRAP_gp_fault,UREGS_entry_vector(%rsp)
+        movl  $((0x80 << 3) | X86_XEC_IDT),UREGS_error_code(%rdi)
+        movl  $TRAP_gp_fault,UREGS_entry_vector(%rdi)
         /* A GPF wouldn't have incremented the instruction pointer. */
-        subq  $2,UREGS_rip(%rsp)
+        subq  $2,UREGS_rip(%rdi)
         jmp   handle_exception_saved
 
 /* CREATE A BASIC EXCEPTION FRAME ON GUEST OS STACK:                     */
 /*   { RCX, R11, [ERRCODE,] RIP, CS, RFLAGS, RSP, SS }                   */
-/* %rdx: trap_bounce, %rbx: struct vcpu                                  */
-/* On return only %rbx and %rdx are guaranteed non-clobbered.            */
+/* %rdx: trap_bounce, %rbx: struct vcpu, %rdi: user_regs                 */
+/* On return only %rdi, %rbx and %rdx are guaranteed non-clobbered.      */
 create_bounce_frame:
         ASSERT_INTERRUPTS_ENABLED
         testb $TF_kernel_mode,VCPU_thread_flags(%rbx)
         jnz   1f
         /* Push new frame at registered guest-OS stack base. */
         pushq %rdx
+        pushq %rdi
         movq  %rbx,%rdi
         call  toggle_guest_mode
+        popq  %rdi
         popq  %rdx
         movq  VCPU_kernel_sp(%rbx),%rsi
         jmp   2f
 1:      /* In kernel context already: push new frame at existing %rsp. */
-        movq  UREGS_rsp+8(%rsp),%rsi
-        andb  $0xfc,UREGS_cs+8(%rsp)    # Indicate kernel context to guest.
+        movq  UREGS_rsp(%rdi),%rsi
+        andb  $0xfc,UREGS_cs(%rdi)      # Indicate kernel context to guest.
 2:      andq  $~0xf,%rsi                # Stack frames are 16-byte aligned.
         movq  $HYPERVISOR_VIRT_START+1,%rax
         cmpq  %rax,%rsi
@@ -294,11 +308,10 @@ __UNLIKELY_END(create_bounce_frame_bad_sp)
         _ASM_EXTABLE(0b, domain_crash_page_fault_ ## n ## x8)
 
         subq  $7*8,%rsi
-        movq  UREGS_ss+8(%rsp),%rax
+        movq  UREGS_ss(%rdi),%rax
         ASM_STAC
-        movq  VCPU_domain(%rbx),%rdi
         STORE_GUEST_STACK(rax,6)        # SS
-        movq  UREGS_rsp+8(%rsp),%rax
+        movq  UREGS_rsp(%rdi),%rax
         STORE_GUEST_STACK(rax,5)        # RSP
         movq  VCPU_vcpu_info(%rbx),%rax
         pushq VCPUINFO_upcall_mask(%rax)
@@ -307,21 +320,24 @@ __UNLIKELY_END(create_bounce_frame_bad_sp)
         orb   %ch,VCPUINFO_upcall_mask(%rax)
         popq  %rax
         shlq  $32,%rax                  # Bits 32-39: saved_upcall_mask
-        movw  UREGS_cs+8(%rsp),%ax      # Bits  0-15: CS
+        movw  UREGS_cs(%rdi),%ax        # Bits  0-15: CS
         STORE_GUEST_STACK(rax,3)        # CS / saved_upcall_mask
         shrq  $32,%rax
         testb $0xFF,%al                 # Bits 0-7: saved_upcall_mask
         setz  %ch                       # %ch == !saved_upcall_mask
-        movl  UREGS_eflags+8(%rsp),%eax
+        movl  UREGS_eflags(%rdi),%eax
+        pushq %rdi
+        movq  VCPU_domain(%rbx),%rdi
         andl  $~(X86_EFLAGS_IF|X86_EFLAGS_IOPL),%eax
         addb  %ch,%ch                   # Bit 9 (EFLAGS.IF)
         orb   %ch,%ah                   # Fold EFLAGS.IF into %eax
         xorl  %ecx,%ecx                 # if ( VM_ASSIST(v->domain, architectural_iopl) )
         testb $1 << VMASST_TYPE_architectural_iopl,DOMAIN_vm_assist(%rdi)
+        popq  %rdi
         cmovnzl VCPU_iopl(%rbx),%ecx    # Bits 13:12 (EFLAGS.IOPL)
         orl   %ecx,%eax                 # Fold EFLAGS.IOPL into %eax
         STORE_GUEST_STACK(rax,4)        # RFLAGS
-        movq  UREGS_rip+8(%rsp),%rax
+        movq  UREGS_rip(%rdi),%rax
         STORE_GUEST_STACK(rax,2)        # RIP
         testb $TBF_EXCEPTION_ERRCODE,TRAPBOUNCE_flags(%rdx)
         jz    1f
@@ -329,9 +345,9 @@ __UNLIKELY_END(create_bounce_frame_bad_sp)
         movl  TRAPBOUNCE_error_code(%rdx),%eax
         STORE_GUEST_STACK(rax,2)        # ERROR CODE
 1:
-        movq  UREGS_r11+8(%rsp),%rax
+        movq  UREGS_r11(%rdi),%rax
         STORE_GUEST_STACK(rax,1)        # R11
-        movq  UREGS_rcx+8(%rsp),%rax
+        movq  UREGS_rcx(%rdi),%rax
         STORE_GUEST_STACK(rax,0)        # RCX
         ASM_CLAC
 
@@ -340,19 +356,19 @@ __UNLIKELY_END(create_bounce_frame_bad_sp)
         /* Rewrite our stack frame and return to guest-OS mode. */
         /* IA32 Ref. Vol. 3: TF, VM, RF and NT flags are cleared on trap. */
         /* Also clear AC: alignment checks shouldn't trigger in kernel mode. */
-        orl   $TRAP_syscall,UREGS_entry_vector+8(%rsp)
+        orl   $TRAP_syscall,UREGS_entry_vector(%rdi)
         andl  $~(X86_EFLAGS_AC|X86_EFLAGS_VM|X86_EFLAGS_RF|\
-                 X86_EFLAGS_NT|X86_EFLAGS_TF),UREGS_eflags+8(%rsp)
-        movq  $FLAT_KERNEL_SS,UREGS_ss+8(%rsp)
-        movq  %rsi,UREGS_rsp+8(%rsp)
-        movq  $FLAT_KERNEL_CS,UREGS_cs+8(%rsp)
+                 X86_EFLAGS_NT|X86_EFLAGS_TF),UREGS_eflags(%rdi)
+        movq  $FLAT_KERNEL_SS,UREGS_ss(%rdi)
+        movq  %rsi,UREGS_rsp(%rdi)
+        movq  $FLAT_KERNEL_CS,UREGS_cs(%rdi)
         movq  TRAPBOUNCE_eip(%rdx),%rax
         testq %rax,%rax
 UNLIKELY_START(z, create_bounce_frame_bad_bounce_ip)
         lea   UNLIKELY_DISPATCH_LABEL(create_bounce_frame_bad_bounce_ip)(%rip), %rdi
         jmp   asm_domain_crash_synchronous  /* Does not return */
 __UNLIKELY_END(create_bounce_frame_bad_bounce_ip)
-        movq  %rax,UREGS_rip+8(%rsp)
+        movq  %rax,UREGS_rip(%rdi)
         ret
 
         .pushsection .fixup, "ax", @progbits
@@ -391,15 +407,17 @@ ENTRY(dom_crash_sync_extable)
 
 ENTRY(common_interrupt)
         SAVE_ALL CLAC
-        CR4_PV32_RESTORE
         movq %rsp,%rdi
+        CR4_PV32_RESTORE
+        pushq %rdi
         callq do_IRQ
+        popq  %rdi
         jmp ret_from_intr
 
 /* No special register assumptions. */
 ENTRY(ret_from_intr)
         GET_CURRENT(bx)
-        testb $3,UREGS_cs(%rsp)
+        testb $3,UREGS_cs(%rdi)
         jz    restore_all_xen
         movq  VCPU_domain(%rbx),%rax
         testb $1,DOMAIN_is_32bit_pv(%rax)
@@ -411,9 +429,10 @@ ENTRY(page_fault)
 /* No special register assumptions. */
 GLOBAL(handle_exception)
         SAVE_ALL CLAC
+        movq  %rsp, %rdi
 handle_exception_saved:
         GET_CURRENT(bx)
-        testb $X86_EFLAGS_IF>>8,UREGS_eflags+1(%rsp)
+        testb $X86_EFLAGS_IF>>8,UREGS_eflags+1(%rdi)
         jz    exception_with_ints_disabled
 
 .Lcr4_pv32_orig:
@@ -434,7 +453,7 @@ handle_exception_saved:
                              (.Lcr4_pv32_alt_end - .Lcr4_pv32_alt)
         .popsection
 
-        testb $3,UREGS_cs(%rsp)
+        testb $3,UREGS_cs(%rdi)
         jz    .Lcr4_pv32_done
         cmpb  $0,DOMAIN_is_32bit_pv(%rax)
         je    .Lcr4_pv32_done
@@ -463,20 +482,21 @@ handle_exception_saved:
          *     goto compat_test_all_events;
          */
         mov   $PFEC_page_present,%al
-        cmpb  $TRAP_page_fault,UREGS_entry_vector(%rsp)
+        cmpb  $TRAP_page_fault,UREGS_entry_vector(%rdi)
         jne   .Lcr4_pv32_done
-        xor   UREGS_error_code(%rsp),%eax
+        xor   UREGS_error_code(%rdi),%eax
         test  $~(PFEC_write_access|PFEC_insn_fetch),%eax
         jz    compat_test_all_events
 .Lcr4_pv32_done:
         sti
-1:      movq  %rsp,%rdi
-        movzbl UREGS_entry_vector(%rsp),%eax
+1:      movzbl UREGS_entry_vector(%rdi),%eax
         leaq  exception_table(%rip),%rdx
         PERFC_INCR(exceptions, %rax, %rbx)
+        pushq %rdi
         mov   (%rdx, %rax, 8), %rdx
         INDIRECT_CALL %rdx
-        testb $3,UREGS_cs(%rsp)
+        popq  %rdi
+        testb $3,UREGS_cs(%rdi)
         jz    restore_all_xen
         leaq  VCPU_trap_bounce(%rbx),%rdx
         movq  VCPU_domain(%rbx),%rax
@@ -491,10 +511,11 @@ handle_exception_saved:
 
 /* No special register assumptions. */
 exception_with_ints_disabled:
-        testb $3,UREGS_cs(%rsp)         # interrupts disabled outside Xen?
+        testb $3,UREGS_cs(%rdi)         # interrupts disabled outside Xen?
         jnz   FATAL_exception_with_ints_disabled
-        movq  %rsp,%rdi
+        /* %rsp == %rdi here! */
         call  search_pre_exception_table
+        movq  %rsp,%rdi
         testq %rax,%rax                 # no fixup code for faulting EIP?
         jz    1b
         movq  %rax,UREGS_rip(%rsp)
@@ -513,7 +534,6 @@ exception_with_ints_disabled:
 /* No special register assumptions. */
 FATAL_exception_with_ints_disabled:
         xorl  %esi,%esi
-        movq  %rsp,%rdi
         call  fatal_trap
         BUG   /* fatal_trap() shouldn't return. */
 
@@ -604,25 +624,32 @@ ENTRY(nmi)
         movl  $TRAP_nmi,4(%rsp)
 handle_ist_exception:
         SAVE_ALL CLAC
+        movq  %rsp, %rdi
         CR4_PV32_RESTORE
-        testb $3,UREGS_cs(%rsp)
+        movq  %rdi,%rdx
+        movq  %rdi,%rbx
+        subq  %rsp,%rbx
+        testb $3,UREGS_cs(%rdi)
         jz    1f
         /* Interrupted guest context. Copy the context to stack bottom. */
         GET_CPUINFO_FIELD(guest_cpu_user_regs,di)
-        movq  %rsp,%rsi
+        addq  %rbx,%rdi
+        movq  %rdx,%rsi
         movl  $UREGS_kernel_sizeof/8,%ecx
         movq  %rdi,%rsp
         rep   movsq
-1:      movq  %rsp,%rdi
-        movzbl UREGS_entry_vector(%rsp),%eax
+        movq  %rdx,%rdi
+1:      movzbl UREGS_entry_vector(%rdi),%eax
         leaq  exception_table(%rip),%rdx
+        pushq %rdi
         mov   (%rdx, %rax, 8), %rdx
         INDIRECT_CALL %rdx
-        cmpb  $TRAP_nmi,UREGS_entry_vector(%rsp)
+        popq  %rdi
+        cmpb  $TRAP_nmi,UREGS_entry_vector(%rdi)
         jne   ret_from_intr
 
         /* We want to get straight to the IRET on the NMI exit path. */
-        testb $3,UREGS_cs(%rsp)
+        testb $3,UREGS_cs(%rdi)
         jz    restore_all_xen
         GET_CURRENT(bx)
         /* Send an IPI to ourselves to cover for the lack of event checking. */
@@ -631,8 +658,10 @@ handle_ist_exception:
         leaq  irq_stat+IRQSTAT_softirq_pending(%rip),%rcx
         cmpl  $0,(%rcx,%rax,1)
         je    1f
+        pushq %rdi
         movl  $EVENT_CHECK_VECTOR,%edi
         call  send_IPI_self
+        popq  %rdi
 1:      movq  VCPU_domain(%rbx),%rax
         cmpb  $0,DOMAIN_is_32bit_pv(%rax)
         je    restore_all_guest
diff --git a/xen/include/asm-x86/current.h b/xen/include/asm-x86/current.h
index 89849929eb..c7acbb97da 100644
--- a/xen/include/asm-x86/current.h
+++ b/xen/include/asm-x86/current.h
@@ -95,9 +95,13 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
     ({                                                                  \
         __asm__ __volatile__ (                                          \
             "mov %0,%%"__OP"sp;"                                        \
-            CHECK_FOR_LIVEPATCH_WORK                                      \
-             "jmp %c1"                                                  \
-            : : "r" (guest_cpu_user_regs()), "i" (__fn) : "memory" );   \
+            "mov %1,%%"__OP"di;"                                        \
+            "pushq %%"__OP"di;"                                         \
+            CHECK_FOR_LIVEPATCH_WORK                                    \
+            "popq %%"__OP"di;"                                          \
+            "jmp %c2"                                                   \
+            : : "r" (get_cpu_info()), "r" (guest_cpu_user_regs()),      \
+                "i" (__fn) : "memory" );                                \
         unreachable();                                                  \
     })
 
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH RFC v2 06/12] x86: add a xpti command line parameter
  2018-01-22 12:32 [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains Juergen Gross
                   ` (4 preceding siblings ...)
  2018-01-22 12:32 ` [PATCH RFC v2 05/12] x86: don't access saved user regs via rsp in trap handlers Juergen Gross
@ 2018-01-22 12:32 ` Juergen Gross
  2018-01-30 15:39   ` Jan Beulich
       [not found]   ` <5A709FDF02000078001A3C2C@suse.com>
  2018-01-22 12:32 ` [PATCH RFC v2 07/12] x86: allow per-domain mappings without NX bit or with specific mfn Juergen Gross
                   ` (8 subsequent siblings)
  14 siblings, 2 replies; 74+ messages in thread
From: Juergen Gross @ 2018-01-22 12:32 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, dfaggioli, jbeulich

Add a command line parameter for controlling Xen page table isolation
(XPTI): per default it is on for non-AMD systems in 64 bit pv domains.

Possible settings are:
- true: switched on even on AMD systems
- false: switched off for all
- nodom0: switched off for dom0

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 docs/misc/xen-command-line.markdown | 18 ++++++++++++
 xen/arch/x86/pv/domain.c            | 55 +++++++++++++++++++++++++++++++++++++
 xen/include/asm-x86/domain.h        |  2 ++
 3 files changed, 75 insertions(+)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index f5214defbb..90202a5cc9 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -1911,6 +1911,24 @@ In the case that x2apic is in use, this option switches between physical and
 clustered mode.  The default, given no hint from the **FADT**, is cluster
 mode.
 
+### xpti
+> `= nodom0 | default | <boolean>`
+
+> Default: `false` on AMD hardware, `true` everywhere else.
+
+> Can be modified at runtime
+
+Override default selection of whether to isolate 64-bit PV guest page
+tables.
+
+`true` activates page table isolation even on AMD hardware.
+
+`false` deactivates page table isolation on all systems.
+
+`nodom0` deactivates page table isolation for dom0.
+
+`default` switch to default settings.
+
 ### xsave
 > `= <boolean>`
 
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 74e9e667d2..7d50f9bc19 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -6,6 +6,7 @@
 
 #include <xen/domain_page.h>
 #include <xen/errno.h>
+#include <xen/init.h>
 #include <xen/lib.h>
 #include <xen/sched.h>
 
@@ -17,6 +18,40 @@
 #undef page_to_mfn
 #define page_to_mfn(pg) _mfn(__page_to_mfn(pg))
 
+static __read_mostly enum {
+    XPTI_DEFAULT,
+    XPTI_ON,
+    XPTI_OFF,
+    XPTI_NODOM0
+} opt_xpti = XPTI_DEFAULT;
+
+static int parse_xpti(const char *s)
+{
+    int rc = 0;
+
+    switch ( parse_bool(s, NULL) )
+    {
+    case 0:
+        opt_xpti = XPTI_OFF;
+        break;
+    case 1:
+        opt_xpti = XPTI_ON;
+        break;
+    default:
+        if ( !strcmp(s, "default") )
+            opt_xpti = XPTI_DEFAULT;
+        else if ( !strcmp(s, "nodom0") )
+            opt_xpti = XPTI_NODOM0;
+        else
+            rc = -EINVAL;
+        break;
+    }
+
+    return rc;
+}
+
+custom_runtime_param("xpti", parse_xpti);
+
 static void noreturn continue_nonidle_domain(struct vcpu *v)
 {
     check_wakeup_from_wait();
@@ -76,6 +111,8 @@ int switch_compat(struct domain *d)
             goto undo_and_fail;
     }
 
+    d->arch.pv_domain.xpti = false;
+
     domain_set_alloc_bitsize(d);
     recalculate_cpuid_policy(d);
 
@@ -212,6 +249,24 @@ int pv_domain_initialise(struct domain *d, unsigned int domcr_flags,
     /* 64-bit PV guest by default. */
     d->arch.is_32bit_pv = d->arch.has_32bit_shinfo = 0;
 
+    switch (opt_xpti)
+    {
+    case XPTI_OFF:
+        d->arch.pv_domain.xpti = false;
+        break;
+    case XPTI_ON:
+        d->arch.pv_domain.xpti = true;
+        break;
+    case XPTI_NODOM0:
+        d->arch.pv_domain.xpti = boot_cpu_data.x86_vendor != X86_VENDOR_AMD &&
+                                 d->domain_id != 0 &&
+                                 d->domain_id != hardware_domid;
+        break;
+    case XPTI_DEFAULT:
+        d->arch.pv_domain.xpti = boot_cpu_data.x86_vendor != X86_VENDOR_AMD;
+        break;
+    }
+
     return 0;
 
   fail:
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index 4679d5477d..f1230ac621 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -257,6 +257,8 @@ struct pv_domain
     struct mapcache_domain mapcache;
 
     struct cpuidmasks *cpuidmasks;
+
+    bool xpti;
 };
 
 struct monitor_write_data {
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH RFC v2 07/12] x86: allow per-domain mappings without NX bit or with specific mfn
  2018-01-22 12:32 [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains Juergen Gross
                   ` (5 preceding siblings ...)
  2018-01-22 12:32 ` [PATCH RFC v2 06/12] x86: add a xpti command line parameter Juergen Gross
@ 2018-01-22 12:32 ` Juergen Gross
  2018-01-29 17:06   ` Jan Beulich
                     ` (2 more replies)
  2018-01-22 12:32 ` [PATCH RFC v2 08/12] xen/x86: use dedicated function for tss initialization Juergen Gross
                   ` (7 subsequent siblings)
  14 siblings, 3 replies; 74+ messages in thread
From: Juergen Gross @ 2018-01-22 12:32 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, dfaggioli, jbeulich

For support of per-vcpu stacks we need per-vcpu trampolines. To be
able to put those into the per-domain mappings the upper levels
page tables must not have NX set for per-domain mappings.

In order to be able to reset the NX bit for a per-domain mapping add
a helper flipflags_perdomain_mapping() for flipping page table flags
of a specific mapped page.

To be able to use a page from xen heap for the last per-vcpu stack
page add a helper to map an arbitrary mfn in the perdomain area.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 xen/arch/x86/mm.c        | 81 ++++++++++++++++++++++++++++++++++++++++++++++--
 xen/include/asm-x86/mm.h |  3 ++
 2 files changed, 81 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 74cdb6e14d..ab990cc667 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1568,7 +1568,7 @@ void init_xen_l4_slots(l4_pgentry_t *l4t, mfn_t l4mfn,
 
     /* Slot 260: Per-domain mappings (if applicable). */
     l4t[l4_table_offset(PERDOMAIN_VIRT_START)] =
-        d ? l4e_from_page(d->arch.perdomain_l3_pg, __PAGE_HYPERVISOR_RW)
+        d ? l4e_from_page(d->arch.perdomain_l3_pg, __PAGE_HYPERVISOR)
           : l4e_empty();
 
     /* Slot 261-: text/data/bss, RW M2P, vmap, frametable, directmap. */
@@ -5269,7 +5269,7 @@ int create_perdomain_mapping(struct domain *d, unsigned long va,
         }
         l2tab = __map_domain_page(pg);
         clear_page(l2tab);
-        l3tab[l3_table_offset(va)] = l3e_from_page(pg, __PAGE_HYPERVISOR_RW);
+        l3tab[l3_table_offset(va)] = l3e_from_page(pg, __PAGE_HYPERVISOR);
     }
     else
         l2tab = map_l2t_from_l3e(l3tab[l3_table_offset(va)]);
@@ -5311,7 +5311,7 @@ int create_perdomain_mapping(struct domain *d, unsigned long va,
                 l1tab = __map_domain_page(pg);
             }
             clear_page(l1tab);
-            *pl2e = l2e_from_page(pg, __PAGE_HYPERVISOR_RW);
+            *pl2e = l2e_from_page(pg, __PAGE_HYPERVISOR);
         }
         else if ( !l1tab )
             l1tab = map_l1t_from_l2e(*pl2e);
@@ -5401,6 +5401,81 @@ void destroy_perdomain_mapping(struct domain *d, unsigned long va,
     unmap_domain_page(l3tab);
 }
 
+void flipflags_perdomain_mapping(struct domain *d, unsigned long va,
+                                 unsigned int flags)
+{
+    const l3_pgentry_t *l3tab, *pl3e;
+
+    ASSERT(va >= PERDOMAIN_VIRT_START &&
+           va < PERDOMAIN_VIRT_SLOT(PERDOMAIN_SLOTS));
+
+    if ( !d->arch.perdomain_l3_pg )
+        return;
+
+    l3tab = __map_domain_page(d->arch.perdomain_l3_pg);
+    pl3e = l3tab + l3_table_offset(va);
+
+    if ( l3e_get_flags(*pl3e) & _PAGE_PRESENT )
+    {
+        const l2_pgentry_t *l2tab = map_l2t_from_l3e(*pl3e);
+        const l2_pgentry_t *pl2e = l2tab + l2_table_offset(va);
+
+        if ( l2e_get_flags(*pl2e) & _PAGE_PRESENT )
+        {
+            l1_pgentry_t *l1tab = map_l1t_from_l2e(*pl2e);
+            unsigned int off = l1_table_offset(va);
+
+            if ( (l1e_get_flags(l1tab[off]) & (_PAGE_PRESENT | _PAGE_AVAIL0)) ==
+                 (_PAGE_PRESENT | _PAGE_AVAIL0) )
+                l1e_flip_flags(l1tab[off], flags);
+
+            unmap_domain_page(l1tab);
+        }
+
+        unmap_domain_page(l2tab);
+    }
+
+    unmap_domain_page(l3tab);
+}
+
+void addmfn_to_perdomain_mapping(struct domain *d, unsigned long va, mfn_t mfn)
+{
+    const l3_pgentry_t *l3tab, *pl3e;
+
+    ASSERT(va >= PERDOMAIN_VIRT_START &&
+           va < PERDOMAIN_VIRT_SLOT(PERDOMAIN_SLOTS));
+
+    if ( !d->arch.perdomain_l3_pg )
+        return;
+
+    l3tab = __map_domain_page(d->arch.perdomain_l3_pg);
+    pl3e = l3tab + l3_table_offset(va);
+
+    if ( l3e_get_flags(*pl3e) & _PAGE_PRESENT )
+    {
+        const l2_pgentry_t *l2tab = map_l2t_from_l3e(*pl3e);
+        const l2_pgentry_t *pl2e = l2tab + l2_table_offset(va);
+
+        if ( l2e_get_flags(*pl2e) & _PAGE_PRESENT )
+        {
+            l1_pgentry_t *l1tab = map_l1t_from_l2e(*pl2e);
+            unsigned int off = l1_table_offset(va);
+
+            if ( (l1e_get_flags(l1tab[off]) & (_PAGE_PRESENT | _PAGE_AVAIL0)) ==
+                 (_PAGE_PRESENT | _PAGE_AVAIL0) )
+                free_domheap_page(l1e_get_page(l1tab[off]));
+
+            l1tab[off] = l1e_from_mfn(mfn, __PAGE_HYPERVISOR_RW);
+
+            unmap_domain_page(l1tab);
+        }
+
+        unmap_domain_page(l2tab);
+    }
+
+    unmap_domain_page(l3tab);
+}
+
 void free_perdomain_mappings(struct domain *d)
 {
     l3_pgentry_t *l3tab;
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index 3013c266fe..fa158bd96a 100644
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -582,6 +582,9 @@ int create_perdomain_mapping(struct domain *, unsigned long va,
                              struct page_info **);
 void destroy_perdomain_mapping(struct domain *, unsigned long va,
                                unsigned int nr);
+void flipflags_perdomain_mapping(struct domain *d, unsigned long va,
+                                 unsigned int flags);
+void addmfn_to_perdomain_mapping(struct domain *d, unsigned long va, mfn_t mfn);
 void free_perdomain_mappings(struct domain *);
 
 extern int memory_add(unsigned long spfn, unsigned long epfn, unsigned int pxm);
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH RFC v2 08/12] xen/x86: use dedicated function for tss initialization
  2018-01-22 12:32 [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains Juergen Gross
                   ` (6 preceding siblings ...)
  2018-01-22 12:32 ` [PATCH RFC v2 07/12] x86: allow per-domain mappings without NX bit or with specific mfn Juergen Gross
@ 2018-01-22 12:32 ` Juergen Gross
  2018-01-22 12:32 ` [PATCH RFC v2 09/12] x86: enhance syscall stub to work in per-domain mapping Juergen Gross
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 74+ messages in thread
From: Juergen Gross @ 2018-01-22 12:32 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, dfaggioli, jbeulich

Carve out the TSS initialization from load_system_tables().

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 xen/arch/x86/cpu/common.c    | 56 ++++++++++++++++++++++++--------------------
 xen/include/asm-x86/system.h |  1 +
 2 files changed, 32 insertions(+), 25 deletions(-)

diff --git a/xen/arch/x86/cpu/common.c b/xen/arch/x86/cpu/common.c
index 4306e59650..f9ec05c3ee 100644
--- a/xen/arch/x86/cpu/common.c
+++ b/xen/arch/x86/cpu/common.c
@@ -702,6 +702,35 @@ void __init early_cpu_init(void)
 	early_cpu_detect();
 }
 
+void tss_init(struct tss_struct *tss, unsigned long stack_bottom)
+{
+	unsigned long stack_top = stack_bottom & ~(STACK_SIZE - 1);
+
+	*tss = (struct tss_struct){
+		/* Main stack for interrupts/exceptions. */
+		.rsp0 = stack_bottom,
+
+		/* Ring 1 and 2 stacks poisoned. */
+		.rsp1 = 0x8600111111111111ul,
+		.rsp2 = 0x8600111111111111ul,
+
+		/*
+		 * MCE, NMI and Double Fault handlers get their own stacks.
+		 * All others poisoned.
+		 */
+		.ist = {
+			[IST_MCE - 1] = stack_top + IST_MCE * PAGE_SIZE,
+			[IST_DF  - 1] = stack_top + IST_DF  * PAGE_SIZE,
+			[IST_NMI - 1] = stack_top + IST_NMI * PAGE_SIZE,
+
+			[IST_MAX ... ARRAY_SIZE(tss->ist) - 1] =
+				0x8600111111111111ul,
+		},
+
+		.bitmap = IOBMP_INVALID_OFFSET,
+	};
+}
+
 /*
  * Sets up system tables and descriptors.
  *
@@ -713,8 +742,7 @@ void __init early_cpu_init(void)
 void load_system_tables(void)
 {
 	unsigned int cpu = smp_processor_id();
-	unsigned long stack_bottom = get_stack_bottom(),
-		stack_top = stack_bottom & ~(STACK_SIZE - 1);
+	unsigned long stack_bottom = get_stack_bottom();
 
 	struct tss_struct *tss = &this_cpu(init_tss);
 	struct desc_struct *gdt =
@@ -731,29 +759,7 @@ void load_system_tables(void)
 		.limit = (IDT_ENTRIES * sizeof(idt_entry_t)) - 1,
 	};
 
-	*tss = (struct tss_struct){
-		/* Main stack for interrupts/exceptions. */
-		.rsp0 = stack_bottom,
-
-		/* Ring 1 and 2 stacks poisoned. */
-		.rsp1 = 0x8600111111111111ul,
-		.rsp2 = 0x8600111111111111ul,
-
-		/*
-		 * MCE, NMI and Double Fault handlers get their own stacks.
-		 * All others poisoned.
-		 */
-		.ist = {
-			[IST_MCE - 1] = stack_top + IST_MCE * PAGE_SIZE,
-			[IST_DF  - 1] = stack_top + IST_DF  * PAGE_SIZE,
-			[IST_NMI - 1] = stack_top + IST_NMI * PAGE_SIZE,
-
-			[IST_MAX ... ARRAY_SIZE(tss->ist) - 1] =
-				0x8600111111111111ul,
-		},
-
-		.bitmap = IOBMP_INVALID_OFFSET,
-	};
+	tss_init(tss, stack_bottom);
 
 	_set_tssldt_desc(
 		gdt + TSS_ENTRY,
diff --git a/xen/include/asm-x86/system.h b/xen/include/asm-x86/system.h
index 8ac170371b..2cf50d1d49 100644
--- a/xen/include/asm-x86/system.h
+++ b/xen/include/asm-x86/system.h
@@ -230,6 +230,7 @@ static inline int local_irq_is_enabled(void)
 
 void trap_init(void);
 void init_idt_traps(void);
+void tss_init(struct tss_struct *tss, unsigned long stack_bottom);
 void load_system_tables(void);
 void percpu_traps_init(void);
 void subarch_percpu_traps_init(void);
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH RFC v2 09/12] x86: enhance syscall stub to work in per-domain mapping
  2018-01-22 12:32 [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains Juergen Gross
                   ` (7 preceding siblings ...)
  2018-01-22 12:32 ` [PATCH RFC v2 08/12] xen/x86: use dedicated function for tss initialization Juergen Gross
@ 2018-01-22 12:32 ` Juergen Gross
  2018-01-30 15:11   ` Jan Beulich
       [not found]   ` <5A70991902000078001A3C16@suse.com>
  2018-01-22 12:32 ` [PATCH RFC v2 10/12] x86: allocate per-vcpu stacks for interrupt entries Juergen Gross
                   ` (5 subsequent siblings)
  14 siblings, 2 replies; 74+ messages in thread
From: Juergen Gross @ 2018-01-22 12:32 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, dfaggioli, jbeulich

Use indirect jump via register in case the target address isn't
reachable via a 32 bit relative jump.

Add macros for stub size and use those instead of returning the size
when writing the stub trampoline in order to support easy switching
between different sized stubs.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 xen/arch/x86/x86_64/traps.c  | 47 +++++++++++++++++++++++++-------------------
 xen/include/asm-x86/system.h |  7 +++++++
 2 files changed, 34 insertions(+), 20 deletions(-)

diff --git a/xen/arch/x86/x86_64/traps.c b/xen/arch/x86/x86_64/traps.c
index 3652f5ff21..b4836f623c 100644
--- a/xen/arch/x86/x86_64/traps.c
+++ b/xen/arch/x86/x86_64/traps.c
@@ -260,10 +260,11 @@ void do_double_fault(struct cpu_user_regs *regs)
     panic("DOUBLE FAULT -- system shutdown");
 }
 
-static unsigned int write_stub_trampoline(
-    unsigned char *stub, unsigned long stub_va,
-    unsigned long stack_bottom, unsigned long target_va)
+void write_stub_trampoline(unsigned char *stub, unsigned long stub_va,
+                           unsigned long stack_bottom, unsigned long target_va)
 {
+    long target_diff;
+
     /* movabsq %rax, stack_bottom - 8 */
     stub[0] = 0x48;
     stub[1] = 0xa3;
@@ -282,24 +283,32 @@ static unsigned int write_stub_trampoline(
     /* pushq %rax */
     stub[23] = 0x50;
 
-    /* jmp target_va */
-    stub[24] = 0xe9;
-    *(int32_t *)&stub[25] = target_va - (stub_va + 29);
-
-    /* Round up to a multiple of 16 bytes. */
-    return 32;
+    target_diff = target_va - (stub_va + 29);
+    if ( target_diff >> 31 == target_diff >> 63 )
+    {
+        /* jmp target_va */
+        stub[24] = 0xe9;
+        *(int32_t *)&stub[25] = target_diff;
+    }
+    else
+    {
+        /* movabs target_va, %rax */
+        stub[24] = 0x48;
+        stub[25] = 0xb8;
+        *(uint64_t *)&stub[26] = target_va;
+        /* jmpq *%rax */
+        stub[34] = 0xff;
+        stub[35] = 0xe0;
+    }
 }
 
 DEFINE_PER_CPU(struct stubs, stubs);
-void lstar_enter(void);
-void cstar_enter(void);
 
 void subarch_percpu_traps_init(void)
 {
     unsigned long stack_bottom = get_stack_bottom();
     unsigned long stub_va = this_cpu(stubs.addr);
     unsigned char *stub_page;
-    unsigned int offset;
 
     /* IST_MAX IST pages + 1 syscall page + 1 guard page + primary stack. */
     BUILD_BUG_ON((IST_MAX + 2) * PAGE_SIZE + PRIMARY_STACK_SIZE > STACK_SIZE);
@@ -312,10 +321,9 @@ void subarch_percpu_traps_init(void)
      * start of the stubs.
      */
     wrmsrl(MSR_LSTAR, stub_va);
-    offset = write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
-                                   stub_va, stack_bottom,
-                                   (unsigned long)lstar_enter);
-    stub_va += offset;
+    write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK), stub_va,
+                          stack_bottom, (unsigned long)lstar_enter);
+    stub_va += STUB_TRAMPOLINE_SIZE_PERCPU;
 
     if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL ||
          boot_cpu_data.x86_vendor == X86_VENDOR_CENTAUR )
@@ -328,12 +336,11 @@ void subarch_percpu_traps_init(void)
 
     /* Trampoline for SYSCALL entry from compatibility mode. */
     wrmsrl(MSR_CSTAR, stub_va);
-    offset += write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
-                                    stub_va, stack_bottom,
-                                    (unsigned long)cstar_enter);
+    write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK), stub_va,
+                          stack_bottom, (unsigned long)cstar_enter);
 
     /* Don't consume more than half of the stub space here. */
-    ASSERT(offset <= STUB_BUF_SIZE / 2);
+    ASSERT(2 * STUB_TRAMPOLINE_SIZE_PERCPU <= STUB_BUF_SIZE / 2);
 
     unmap_domain_page(stub_page);
 
diff --git a/xen/include/asm-x86/system.h b/xen/include/asm-x86/system.h
index 2cf50d1d49..c5baf7c991 100644
--- a/xen/include/asm-x86/system.h
+++ b/xen/include/asm-x86/system.h
@@ -231,6 +231,13 @@ static inline int local_irq_is_enabled(void)
 void trap_init(void);
 void init_idt_traps(void);
 void tss_init(struct tss_struct *tss, unsigned long stack_bottom);
+void write_stub_trampoline(unsigned char *stub, unsigned long stub_va,
+                           unsigned long stack_bottom,
+                           unsigned long target_va);
+#define STUB_TRAMPOLINE_SIZE_PERCPU   32
+#define STUB_TRAMPOLINE_SIZE_PERVCPU  64
+void lstar_enter(void);
+void cstar_enter(void);
 void load_system_tables(void);
 void percpu_traps_init(void);
 void subarch_percpu_traps_init(void);
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH RFC v2 10/12] x86: allocate per-vcpu stacks for interrupt entries
  2018-01-22 12:32 [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains Juergen Gross
                   ` (8 preceding siblings ...)
  2018-01-22 12:32 ` [PATCH RFC v2 09/12] x86: enhance syscall stub to work in per-domain mapping Juergen Gross
@ 2018-01-22 12:32 ` Juergen Gross
  2018-01-30 15:40   ` Jan Beulich
       [not found]   ` <5A70A01402000078001A3C30@suse.com>
  2018-01-22 12:32 ` [PATCH RFC v2 11/12] x86: modify interrupt handlers to support stack switching Juergen Gross
                   ` (4 subsequent siblings)
  14 siblings, 2 replies; 74+ messages in thread
From: Juergen Gross @ 2018-01-22 12:32 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, dfaggioli, jbeulich

In case of XPTI being active for a pv-domain allocate and initialize
per-vcpu stacks. The stacks are added to the per-domain mappings of
the pv-domain.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 xen/arch/x86/pv/domain.c      | 72 +++++++++++++++++++++++++++++++++++++++++++
 xen/include/asm-x86/config.h  | 13 +++++++-
 xen/include/asm-x86/current.h | 39 ++++++++++++++++++++---
 xen/include/asm-x86/domain.h  |  3 ++
 4 files changed, 121 insertions(+), 6 deletions(-)

diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 7d50f9bc19..834be96ed8 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -156,6 +156,75 @@ void pv_vcpu_destroy(struct vcpu *v)
     pv_destroy_gdt_ldt_l1tab(v);
     xfree(v->arch.pv_vcpu.trap_ctxt);
     v->arch.pv_vcpu.trap_ctxt = NULL;
+
+    if ( v->domain->arch.pv_domain.xpti )
+    {
+        free_xenheap_page(v->arch.pv_vcpu.stack_regs);
+        v->arch.pv_vcpu.stack_regs = NULL;
+        destroy_perdomain_mapping(v->domain, XPTI_START(v), STACK_PAGES);
+    }
+}
+
+static int pv_vcpu_init_xpti(struct vcpu *v)
+{
+    struct domain *d = v->domain;
+    struct page_info *pg;
+    void *ptr;
+    struct cpu_info *info;
+    unsigned long stack_bottom;
+    int rc;
+
+    /* Populate page tables. */
+    rc = create_perdomain_mapping(d, XPTI_START(v), STACK_PAGES,
+                                  NIL(l1_pgentry_t *), NULL);
+    if ( rc )
+        goto done;
+
+    /* Map stacks. */
+    rc = create_perdomain_mapping(d, XPTI_START(v), IST_MAX,
+                                  NULL, NIL(struct page_info *));
+    if ( rc )
+        goto done;
+
+    ptr = alloc_xenheap_page();
+    if ( !ptr )
+    {
+        rc = -ENOMEM;
+        goto done;
+    }
+    clear_page(ptr);
+    addmfn_to_perdomain_mapping(d, XPTI_START(v) + STACK_SIZE - PAGE_SIZE,
+                                _mfn(virt_to_mfn(ptr)));
+    info = (struct cpu_info *)((unsigned long)ptr + PAGE_SIZE) - 1;
+    info->flags = ON_VCPUSTACK;
+    v->arch.pv_vcpu.stack_regs = &info->guest_cpu_user_regs;
+
+    /* Map TSS. */
+    rc = create_perdomain_mapping(d, XPTI_TSS(v), 1, NULL, &pg);
+    if ( rc )
+        goto done;
+    info = (struct cpu_info *)(XPTI_START(v) + STACK_SIZE) - 1;
+    stack_bottom = (unsigned long)&info->guest_cpu_user_regs.es;
+    ptr = __map_domain_page(pg);
+    tss_init(ptr, stack_bottom);
+    unmap_domain_page(ptr);
+
+    /* Map stub trampolines. */
+    rc = create_perdomain_mapping(d, XPTI_TRAMPOLINE(v), 1, NULL, &pg);
+    if ( rc )
+        goto done;
+    ptr = __map_domain_page(pg);
+    write_stub_trampoline((unsigned char *)ptr, XPTI_TRAMPOLINE(v),
+                          stack_bottom, (unsigned long)lstar_enter);
+    write_stub_trampoline((unsigned char *)ptr + STUB_TRAMPOLINE_SIZE_PERVCPU,
+                          XPTI_TRAMPOLINE(v) + STUB_TRAMPOLINE_SIZE_PERVCPU,
+                          stack_bottom, (unsigned long)cstar_enter);
+    unmap_domain_page(ptr);
+    flipflags_perdomain_mapping(d, XPTI_TRAMPOLINE(v),
+                                _PAGE_NX | _PAGE_RW | _PAGE_DIRTY);
+
+ done:
+    return rc;
 }
 
 int pv_vcpu_initialise(struct vcpu *v)
@@ -195,6 +264,9 @@ int pv_vcpu_initialise(struct vcpu *v)
             goto done;
     }
 
+    if ( d->arch.pv_domain.xpti )
+        rc = pv_vcpu_init_xpti(v);
+
  done:
     if ( rc )
         pv_vcpu_destroy(v);
diff --git a/xen/include/asm-x86/config.h b/xen/include/asm-x86/config.h
index 9ef9d03ca7..cb107255af 100644
--- a/xen/include/asm-x86/config.h
+++ b/xen/include/asm-x86/config.h
@@ -66,6 +66,7 @@
 #endif
 
 #define STACK_ORDER 3
+#define STACK_PAGES (1 << STACK_ORDER)
 #define STACK_SIZE  (PAGE_SIZE << STACK_ORDER)
 
 #define TRAMPOLINE_STACK_SPACE  PAGE_SIZE
@@ -202,7 +203,7 @@ extern unsigned char boot_edid_info[128];
 /* Slot 260: per-domain mappings (including map cache). */
 #define PERDOMAIN_VIRT_START    (PML4_ADDR(260))
 #define PERDOMAIN_SLOT_MBYTES   (PML4_ENTRY_BYTES >> (20 + PAGETABLE_ORDER))
-#define PERDOMAIN_SLOTS         3
+#define PERDOMAIN_SLOTS         4
 #define PERDOMAIN_VIRT_SLOT(s)  (PERDOMAIN_VIRT_START + (s) * \
                                  (PERDOMAIN_SLOT_MBYTES << 20))
 /* Slot 261: machine-to-phys conversion table (256GB). */
@@ -310,6 +311,16 @@ extern unsigned long xen_phys_start;
 #define ARG_XLAT_START(v)        \
     (ARG_XLAT_VIRT_START + ((v)->vcpu_id << ARG_XLAT_VA_SHIFT))
 
+/* Per-vcpu XPTI pages. The fourth per-domain-mapping sub-area. */
+#define XPTI_VIRT_START          PERDOMAIN_VIRT_SLOT(3)
+#define XPTI_VA_SHIFT            (PAGE_SHIFT + STACK_ORDER)
+#define XPTI_TRAMPOLINE_OFF      (IST_MAX << PAGE_SHIFT)
+#define XPTI_TSS_OFF             ((IST_MAX + 2) << PAGE_SHIFT)
+#define XPTI_START(v)            (XPTI_VIRT_START + \
+                                  ((v)->vcpu_id << XPTI_VA_SHIFT))
+#define XPTI_TRAMPOLINE(v)       (XPTI_START(v) + XPTI_TRAMPOLINE_OFF)
+#define XPTI_TSS(v)              (XPTI_START(v) + XPTI_TSS_OFF)
+
 #define NATIVE_VM_ASSIST_VALID   ((1UL << VMASST_TYPE_4gb_segments)        | \
                                   (1UL << VMASST_TYPE_4gb_segments_notify) | \
                                   (1UL << VMASST_TYPE_writable_pagetables) | \
diff --git a/xen/include/asm-x86/current.h b/xen/include/asm-x86/current.h
index c7acbb97da..6ae0931a59 100644
--- a/xen/include/asm-x86/current.h
+++ b/xen/include/asm-x86/current.h
@@ -12,7 +12,7 @@
 #include <asm/page.h>
 
 /*
- * Xen's cpu stacks are 8 pages (8-page aligned), arranged as:
+ * Xen's physical cpu stacks are 8 pages (8-page aligned), arranged as:
  *
  * 7 - Primary stack (with a struct cpu_info at the top)
  * 6 - Primary stack
@@ -25,6 +25,21 @@
  */
 
 /*
+ * The vcpu stacks used for XPTI are arranged similar to the physical cpu
+ * stacks with some modifications. The main difference are the primary stack
+ * size (only 1 page) and usage of the unused mappings for TSS and IDT.
+ *
+ * 7 - Primary stack (with a struct cpu_info at the top)
+ * 6 - unused
+ * 5 - TSS
+ * 4 - unused
+ * 3 - Syscall trampolines
+ * 2 - MCE IST stack
+ * 1 - NMI IST stack
+ * 0 - Double Fault IST stack
+ */
+
+/*
  * Identify which stack page the stack pointer is on.  Returns an index
  * as per the comment above.
  */
@@ -37,10 +52,24 @@ struct vcpu;
 
 struct cpu_info {
     struct cpu_user_regs guest_cpu_user_regs;
-    unsigned int processor_id;
-    struct vcpu *current_vcpu;
-    unsigned long per_cpu_offset;
-    unsigned long cr4;
+    union {
+        /* per physical cpu mapping */
+        struct {
+            struct vcpu *current_vcpu;
+            unsigned long per_cpu_offset;
+            unsigned long cr4;
+        };
+        /* per vcpu mapping (xpti) */
+        struct {
+            unsigned long pad1;
+            unsigned long pad2;
+            unsigned long stack_bottom_cpu;
+        };
+    };
+    unsigned int processor_id;  /* per physical cpu mapping only */
+    unsigned int flags;
+#define ON_VCPUSTACK      0x00000001
+#define VCPUSTACK_ACTIVE  0x00000002
     /* get_stack_bottom() must be 16-byte aligned */
 };
 
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index f1230ac621..5eb67f4f4c 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -503,6 +503,9 @@ struct pv_vcpu
     /* Deferred VA-based update state. */
     bool_t need_update_runstate_area;
     struct vcpu_time_info pending_system_time;
+
+    /* If XPTI is active: pointer to user regs on stack. */
+    struct cpu_user_regs *stack_regs;
 };
 
 typedef enum __packed {
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH RFC v2 11/12] x86: modify interrupt handlers to support stack switching
  2018-01-22 12:32 [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains Juergen Gross
                   ` (9 preceding siblings ...)
  2018-01-22 12:32 ` [PATCH RFC v2 10/12] x86: allocate per-vcpu stacks for interrupt entries Juergen Gross
@ 2018-01-22 12:32 ` Juergen Gross
  2018-01-30 16:07   ` Jan Beulich
       [not found]   ` <5A70A63D02000078001A3C7C@suse.com>
  2018-01-22 12:32 ` [PATCH RFC v2 12/12] x86: activate per-vcpu stacks in case of xpti Juergen Gross
                   ` (3 subsequent siblings)
  14 siblings, 2 replies; 74+ messages in thread
From: Juergen Gross @ 2018-01-22 12:32 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, dfaggioli, jbeulich

Modify the interrupt handlers to switch stacks on interrupt entry in
case they are running on a per-vcpu stack. Same applies to returning
to the guest: in case the to be loaded context is located on a
per-vcpu stack switch to this one before returning to the guest.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 xen/arch/x86/x86_64/asm-offsets.c  |  4 ++++
 xen/arch/x86/x86_64/compat/entry.S |  5 ++++-
 xen/arch/x86/x86_64/entry.S        | 15 +++++++++++++--
 xen/common/wait.c                  |  8 ++++----
 xen/include/asm-x86/asm_defns.h    | 19 +++++++++++++++++++
 xen/include/asm-x86/current.h      | 10 +++++++++-
 6 files changed, 53 insertions(+), 8 deletions(-)

diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
index e136af6b99..0da756e7af 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -137,6 +137,10 @@ void __dummy__(void)
     OFFSET(CPUINFO_processor_id, struct cpu_info, processor_id);
     OFFSET(CPUINFO_current_vcpu, struct cpu_info, current_vcpu);
     OFFSET(CPUINFO_cr4, struct cpu_info, cr4);
+    OFFSET(CPUINFO_stack_bottom_cpu, struct cpu_info, stack_bottom_cpu);
+    OFFSET(CPUINFO_flags, struct cpu_info, flags);
+    DEFINE(ASM_ON_VCPUSTACK, ON_VCPUSTACK);
+    DEFINE(ASM_VCPUSTACK_ACTIVE, VCPUSTACK_ACTIVE);
     DEFINE(CPUINFO_sizeof, sizeof(struct cpu_info));
     BLANK();
 
diff --git a/xen/arch/x86/x86_64/compat/entry.S b/xen/arch/x86/x86_64/compat/entry.S
index abf3fcae48..b8d74e83db 100644
--- a/xen/arch/x86/x86_64/compat/entry.S
+++ b/xen/arch/x86/x86_64/compat/entry.S
@@ -19,6 +19,7 @@ ENTRY(entry_int82)
         movl  $HYPERCALL_VECTOR, 4(%rsp)
         SAVE_ALL compat=1 /* DPL1 gate, restricted to 32bit PV guests only. */
         mov   %rsp, %rdi
+        SWITCH_FROM_VCPU_STACK
         CR4_PV32_RESTORE
 
         GET_CURRENT(bx)
@@ -109,6 +110,7 @@ compat_process_trap:
 /* %rbx: struct vcpu, interrupts disabled */
 ENTRY(compat_restore_all_guest)
         ASSERT_INTERRUPTS_DISABLED
+        SWITCH_TO_VCPU_STACK
         mov   $~(X86_EFLAGS_IOPL|X86_EFLAGS_NT|X86_EFLAGS_VM),%r11d
         and   UREGS_eflags(%rsp),%r11d
 .Lcr4_orig:
@@ -195,7 +197,6 @@ ENTRY(compat_post_handle_exception)
 
 /* See lstar_enter for entry register state. */
 ENTRY(cstar_enter)
-        sti
         CR4_PV32_RESTORE
         movq  8(%rsp),%rax /* Restore %rax. */
         movq  $FLAT_KERNEL_SS,8(%rsp)
@@ -206,6 +207,8 @@ ENTRY(cstar_enter)
         movl  $TRAP_syscall, 4(%rsp)
         SAVE_ALL
         movq  %rsp, %rdi
+        SWITCH_FROM_VCPU_STACK
+        sti
         GET_CURRENT(bx)
         movq  VCPU_domain(%rbx),%rcx
         cmpb  $0,DOMAIN_is_32bit_pv(%rcx)
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index f7412b87c2..991a8799a9 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -37,6 +37,7 @@ ENTRY(switch_to_kernel)
 /* %rbx: struct vcpu, interrupts disabled */
 restore_all_guest:
         ASSERT_INTERRUPTS_DISABLED
+        SWITCH_TO_VCPU_STACK
         RESTORE_ALL
         testw $TRAP_syscall,4(%rsp)
         jz    iret_exit_to_guest
@@ -71,6 +72,7 @@ iret_exit_to_guest:
         ALIGN
 /* No special register assumptions. */
 restore_all_xen:
+        SWITCH_TO_VCPU_STACK
         RESTORE_ALL adj=8
         iretq
 
@@ -91,7 +93,6 @@ restore_all_xen:
  * %ss must be saved into the space left by the trampoline.
  */
 ENTRY(lstar_enter)
-        sti
         movq  8(%rsp),%rax /* Restore %rax. */
         movq  $FLAT_KERNEL_SS,8(%rsp)
         pushq %r11
@@ -101,6 +102,8 @@ ENTRY(lstar_enter)
         movl  $TRAP_syscall, 4(%rsp)
         SAVE_ALL
         mov   %rsp, %rdi
+        SWITCH_FROM_VCPU_STACK
+        sti
         GET_CURRENT(bx)
         testb $TF_kernel_mode,VCPU_thread_flags(%rbx)
         jz    switch_to_kernel
@@ -189,7 +192,6 @@ process_trap:
         jmp  test_all_events
 
 ENTRY(sysenter_entry)
-        sti
         pushq $FLAT_USER_SS
         pushq $0
         pushfq
@@ -201,6 +203,8 @@ GLOBAL(sysenter_eflags_saved)
         movl  $TRAP_syscall, 4(%rsp)
         SAVE_ALL
         movq  %rsp, %rdi
+        SWITCH_FROM_VCPU_STACK
+        sti
         GET_CURRENT(bx)
         cmpb  $0,VCPU_sysenter_disables_events(%rbx)
         movq  VCPU_sysenter_addr(%rbx),%rax
@@ -237,6 +241,7 @@ ENTRY(int80_direct_trap)
         movl  $0x80, 4(%rsp)
         SAVE_ALL
         mov   %rsp, %rdi
+        SWITCH_FROM_VCPU_STACK
 
         cmpb  $0,untrusted_msi(%rip)
 UNLIKELY_START(ne, msi_check)
@@ -408,6 +413,7 @@ ENTRY(dom_crash_sync_extable)
 ENTRY(common_interrupt)
         SAVE_ALL CLAC
         movq %rsp,%rdi
+        SWITCH_FROM_VCPU_STACK
         CR4_PV32_RESTORE
         pushq %rdi
         callq do_IRQ
@@ -430,6 +436,7 @@ ENTRY(page_fault)
 GLOBAL(handle_exception)
         SAVE_ALL CLAC
         movq  %rsp, %rdi
+        SWITCH_FROM_VCPU_STACK
 handle_exception_saved:
         GET_CURRENT(bx)
         testb $X86_EFLAGS_IF>>8,UREGS_eflags+1(%rdi)
@@ -607,6 +614,7 @@ ENTRY(double_fault)
         /* Set AC to reduce chance of further SMAP faults */
         SAVE_ALL STAC
         movq  %rsp,%rdi
+        SWITCH_FROM_VCPU_STACK_IST
         call  do_double_fault
         BUG   /* do_double_fault() shouldn't return. */
 
@@ -615,7 +623,9 @@ ENTRY(early_page_fault)
         movl  $TRAP_page_fault,4(%rsp)
         SAVE_ALL
         movq  %rsp,%rdi
+        SWITCH_FROM_VCPU_STACK
         call  do_early_page_fault
+        movq  %rsp, %rdi
         jmp   restore_all_xen
         .popsection
 
@@ -625,6 +635,7 @@ ENTRY(nmi)
 handle_ist_exception:
         SAVE_ALL CLAC
         movq  %rsp, %rdi
+        SWITCH_FROM_VCPU_STACK_IST
         CR4_PV32_RESTORE
         movq  %rdi,%rdx
         movq  %rdi,%rbx
diff --git a/xen/common/wait.c b/xen/common/wait.c
index a57bc10d61..fbb5d996e5 100644
--- a/xen/common/wait.c
+++ b/xen/common/wait.c
@@ -122,10 +122,10 @@ void wake_up_all(struct waitqueue_head *wq)
 
 static void __prepare_to_wait(struct waitqueue_vcpu *wqv)
 {
-    struct cpu_info *cpu_info = get_cpu_info();
+    struct cpu_user_regs *user_regs = guest_cpu_user_regs();
     struct vcpu *curr = current;
     unsigned long dummy;
-    u32 entry_vector = cpu_info->guest_cpu_user_regs.entry_vector;
+    u32 entry_vector = user_regs->entry_vector;
 
     ASSERT(wqv->esp == 0);
 
@@ -160,7 +160,7 @@ static void __prepare_to_wait(struct waitqueue_vcpu *wqv)
         "pop %%r11; pop %%r10; pop %%r9;  pop %%r8;"
         "pop %%rbp; pop %%rdx; pop %%rbx; pop %%rax"
         : "=&S" (wqv->esp), "=&c" (dummy), "=&D" (dummy)
-        : "i" (PAGE_SIZE), "0" (0), "1" (cpu_info), "2" (wqv->stack)
+        : "i" (PAGE_SIZE), "0" (0), "1" (user_regs), "2" (wqv->stack)
         : "memory" );
 
     if ( unlikely(wqv->esp == 0) )
@@ -169,7 +169,7 @@ static void __prepare_to_wait(struct waitqueue_vcpu *wqv)
         domain_crash_synchronous();
     }
 
-    cpu_info->guest_cpu_user_regs.entry_vector = entry_vector;
+    user_regs->entry_vector = entry_vector;
 }
 
 static void __finish_wait(struct waitqueue_vcpu *wqv)
diff --git a/xen/include/asm-x86/asm_defns.h b/xen/include/asm-x86/asm_defns.h
index ae9fef7450..e759726a4b 100644
--- a/xen/include/asm-x86/asm_defns.h
+++ b/xen/include/asm-x86/asm_defns.h
@@ -116,6 +116,25 @@ void ret_from_intr(void);
         GET_STACK_END(reg);                       \
         __GET_CURRENT(reg)
 
+#define SWITCH_FROM_VCPU_STACK                                           \
+        GET_STACK_END(ax);                                               \
+        testb $ASM_ON_VCPUSTACK, STACK_CPUINFO_FIELD(flags)(%rax);       \
+        jz    1f;                                                        \
+        movq  STACK_CPUINFO_FIELD(stack_bottom_cpu)(%rax), %rsp;         \
+1:
+
+#define SWITCH_FROM_VCPU_STACK_IST                                       \
+        GET_STACK_END(ax);                                               \
+        testb $ASM_ON_VCPUSTACK, STACK_CPUINFO_FIELD(flags)(%rax);       \
+        jz    1f;                                                        \
+        subq  $(CPUINFO_sizeof - 1), %rax;                               \
+        addq  CPUINFO_stack_bottom_cpu(%rax), %rsp;                      \
+        subq  %rax, %rsp;                                                \
+1:
+
+#define SWITCH_TO_VCPU_STACK                                             \
+        movq  %rdi, %rsp
+
 #ifndef NDEBUG
 #define ASSERT_NOT_IN_ATOMIC                                             \
     sti; /* sometimes called with interrupts disabled: safe to enable */ \
diff --git a/xen/include/asm-x86/current.h b/xen/include/asm-x86/current.h
index 6ae0931a59..4b7e9104be 100644
--- a/xen/include/asm-x86/current.h
+++ b/xen/include/asm-x86/current.h
@@ -9,6 +9,7 @@
 
 #include <xen/percpu.h>
 #include <public/xen.h>
+#include <asm/config.h>
 #include <asm/page.h>
 
 /*
@@ -94,9 +95,16 @@ static inline struct cpu_info *get_cpu_info(void)
 #define set_processor_id(id)  do {                                      \
     struct cpu_info *ci__ = get_cpu_info();                             \
     ci__->per_cpu_offset = __per_cpu_offset[ci__->processor_id = (id)]; \
+    ci__->flags = 0;                                                    \
 } while (0)
 
-#define guest_cpu_user_regs() (&get_cpu_info()->guest_cpu_user_regs)
+#define guest_cpu_user_regs() ({                                        \
+    struct cpu_info *info = get_cpu_info();                             \
+    if ( info->flags & VCPUSTACK_ACTIVE )                               \
+        info = (struct cpu_info *)(XPTI_START(info->current_vcpu) +     \
+                                   STACK_SIZE) - 1;                     \
+    &info->guest_cpu_user_regs;                                         \
+})
 
 /*
  * Get the bottom-of-stack, as stored in the per-CPU TSS. This actually points
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH RFC v2 12/12] x86: activate per-vcpu stacks in case of xpti
  2018-01-22 12:32 [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains Juergen Gross
                   ` (10 preceding siblings ...)
  2018-01-22 12:32 ` [PATCH RFC v2 11/12] x86: modify interrupt handlers to support stack switching Juergen Gross
@ 2018-01-22 12:32 ` Juergen Gross
  2018-01-30 16:33   ` Jan Beulich
       [not found]   ` <5A70AC7F02000078001A3CA6@suse.com>
  2018-01-22 12:50 ` [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains Jan Beulich
                   ` (2 subsequent siblings)
  14 siblings, 2 replies; 74+ messages in thread
From: Juergen Gross @ 2018-01-22 12:32 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, dfaggioli, jbeulich

When scheduling a vcpu subject to xpti activate the per-vcpu stacks
by loading the vcpu specific gdt and tss. When de-scheduling such a
vcpu switch back to the per physical cpu gdt and tss.

Accessing the user registers on the stack is done via helpers as
depending on XPTI active or not the registers are located either on
the per-vcpu stack or on the default stack.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 xen/arch/x86/domain.c              | 76 +++++++++++++++++++++++++++++++++++---
 xen/arch/x86/pv/domain.c           | 34 +++++++++++++++--
 xen/include/asm-x86/desc.h         |  5 +++
 xen/include/asm-x86/regs.h         |  2 +
 4 files changed, 107 insertions(+), 10 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index da1bf1a97b..d75234ca35 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1585,9 +1585,28 @@ static inline bool need_full_gdt(const struct domain *d)
     return is_pv_domain(d) && !is_idle_domain(d);
 }
 
+static void copy_user_regs_from_stack(struct vcpu *v)
+{
+    struct cpu_user_regs *stack_regs;
+
+    stack_regs = (is_pv_vcpu(v) && v->domain->arch.pv_domain.xpti)
+                 ? v->arch.pv_vcpu.stack_regs
+                 : &get_cpu_info()->guest_cpu_user_regs;
+    memcpy(&v->arch.user_regs, stack_regs, CTXT_SWITCH_STACK_BYTES);
+}
+
+static void copy_user_regs_to_stack(struct vcpu *v)
+{
+    struct cpu_user_regs *stack_regs;
+
+    stack_regs = (is_pv_vcpu(v) && v->domain->arch.pv_domain.xpti)
+                 ? v->arch.pv_vcpu.stack_regs
+                 : &get_cpu_info()->guest_cpu_user_regs;
+    memcpy(stack_regs, &v->arch.user_regs, CTXT_SWITCH_STACK_BYTES);
+}
+
 static void __context_switch(void)
 {
-    struct cpu_user_regs *stack_regs = guest_cpu_user_regs();
     unsigned int          cpu = smp_processor_id();
     struct vcpu          *p = per_cpu(curr_vcpu, cpu);
     struct vcpu          *n = current;
@@ -1600,7 +1619,7 @@ static void __context_switch(void)
 
     if ( !is_idle_domain(pd) )
     {
-        memcpy(&p->arch.user_regs, stack_regs, CTXT_SWITCH_STACK_BYTES);
+        copy_user_regs_from_stack(p);
         vcpu_save_fpu(p);
         pd->arch.ctxt_switch->from(p);
     }
@@ -1616,7 +1635,7 @@ static void __context_switch(void)
 
     if ( !is_idle_domain(nd) )
     {
-        memcpy(stack_regs, &n->arch.user_regs, CTXT_SWITCH_STACK_BYTES);
+        copy_user_regs_to_stack(n);
         if ( cpu_has_xsave )
         {
             u64 xcr0 = n->arch.xcr0 ?: XSTATE_FP_SSE;
@@ -1635,7 +1654,7 @@ static void __context_switch(void)
 
     gdt = !is_pv_32bit_domain(nd) ? per_cpu(gdt_table, cpu) :
                                     per_cpu(compat_gdt_table, cpu);
-    if ( need_full_gdt(nd) )
+    if ( need_full_gdt(nd) && !nd->arch.pv_domain.xpti )
     {
         unsigned long mfn = virt_to_mfn(gdt);
         l1_pgentry_t *pl1e = pv_gdt_ptes(n);
@@ -1647,23 +1666,68 @@ static void __context_switch(void)
     }
 
     if ( need_full_gdt(pd) &&
-         ((p->vcpu_id != n->vcpu_id) || !need_full_gdt(nd)) )
+         ((p->vcpu_id != n->vcpu_id) || !need_full_gdt(nd) ||
+          pd->arch.pv_domain.xpti) )
     {
         gdt_desc.limit = LAST_RESERVED_GDT_BYTE;
         gdt_desc.base  = (unsigned long)(gdt - FIRST_RESERVED_GDT_ENTRY);
 
+        if ( pd->arch.pv_domain.xpti )
+            _set_tssldt_type(gdt + TSS_ENTRY - FIRST_RESERVED_GDT_ENTRY,
+                             SYS_DESC_tss_avail);
+
         lgdt(&gdt_desc);
+
+        if ( pd->arch.pv_domain.xpti )
+        {
+            unsigned long stub_va = this_cpu(stubs.addr);
+
+            ltr(TSS_ENTRY << 3);
+            get_cpu_info()->flags &= ~VCPUSTACK_ACTIVE;
+            wrmsrl(MSR_LSTAR, stub_va);
+            wrmsrl(MSR_CSTAR, stub_va + STUB_TRAMPOLINE_SIZE_PERCPU);
+            if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL ||
+                 boot_cpu_data.x86_vendor == X86_VENDOR_CENTAUR )
+                wrmsrl(MSR_IA32_SYSENTER_ESP,
+                       (unsigned long)&get_cpu_info()->guest_cpu_user_regs.es);
+        }
     }
 
     write_ptbase(n);
 
     if ( need_full_gdt(nd) &&
-         ((p->vcpu_id != n->vcpu_id) || !need_full_gdt(pd)) )
+         ((p->vcpu_id != n->vcpu_id) || !need_full_gdt(pd) ||
+          nd->arch.pv_domain.xpti) )
     {
         gdt_desc.limit = LAST_RESERVED_GDT_BYTE;
         gdt_desc.base = GDT_VIRT_START(n);
 
+        if ( nd->arch.pv_domain.xpti )
+        {
+            struct cpu_info *info;
+
+            gdt = (struct desc_struct *)GDT_VIRT_START(n);
+            gdt[PER_CPU_GDT_ENTRY].a = cpu;
+            _set_tssldt_type(gdt + TSS_ENTRY, SYS_DESC_tss_avail);
+            info = (struct cpu_info *)(XPTI_START(n) + STACK_SIZE) - 1;
+            info->stack_bottom_cpu = (unsigned long)guest_cpu_user_regs();
+        }
+
         lgdt(&gdt_desc);
+
+        if ( nd->arch.pv_domain.xpti )
+        {
+            unsigned long stub_va = XPTI_TRAMPOLINE(n);
+
+            ltr(TSS_ENTRY << 3);
+            get_cpu_info()->flags |= VCPUSTACK_ACTIVE;
+            wrmsrl(MSR_LSTAR, stub_va);
+            wrmsrl(MSR_CSTAR, stub_va + STUB_TRAMPOLINE_SIZE_PERVCPU);
+            if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL ||
+                 boot_cpu_data.x86_vendor == X86_VENDOR_CENTAUR )
+                wrmsrl(MSR_IA32_SYSENTER_ESP,
+                       (unsigned long)&guest_cpu_user_regs()->es);
+        }
     }
 
     if ( pd != nd )
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 834be96ed8..6158086087 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -133,10 +133,36 @@ int switch_compat(struct domain *d)
 
 static int pv_create_gdt_ldt_l1tab(struct vcpu *v)
 {
-    return create_perdomain_mapping(v->domain, GDT_VIRT_START(v),
-                                    1U << GDT_LDT_VCPU_SHIFT,
-                                    v->domain->arch.pv_domain.gdt_ldt_l1tab,
-                                    NULL);
+    int rc;
+
+    rc = create_perdomain_mapping(v->domain, GDT_VIRT_START(v),
+                                  1U << GDT_LDT_VCPU_SHIFT,
+                                  v->domain->arch.pv_domain.gdt_ldt_l1tab,
+                                  NULL);
+    if ( !rc && v->domain->arch.pv_domain.xpti )
+    {
+        struct desc_struct *gdt;
+        struct page_info *gdt_pg;
+
+        BUILD_BUG_ON(NR_RESERVED_GDT_PAGES > 1);
+        gdt = (struct desc_struct *)GDT_VIRT_START(v) +
+              FIRST_RESERVED_GDT_ENTRY;
+        rc = create_perdomain_mapping(v->domain, (unsigned long)gdt,
+                                      NR_RESERVED_GDT_PAGES,
+                                      NULL, &gdt_pg);
+        if ( !rc )
+        {
+            gdt = __map_domain_page(gdt_pg);
+            memcpy(gdt, boot_cpu_gdt_table, NR_RESERVED_GDT_BYTES);
+            _set_tssldt_desc(gdt + TSS_ENTRY - FIRST_RESERVED_GDT_ENTRY,
+                         XPTI_TSS(v),
+                         offsetof(struct tss_struct, __cacheline_filler) - 1,
+                         SYS_DESC_tss_avail);
+            unmap_domain_page(gdt);
+        }
+    }
+
+    return rc;
 }
 
 static void pv_destroy_gdt_ldt_l1tab(struct vcpu *v)
diff --git a/xen/include/asm-x86/desc.h b/xen/include/asm-x86/desc.h
index 4093c65faa..d5fff4cce5 100644
--- a/xen/include/asm-x86/desc.h
+++ b/xen/include/asm-x86/desc.h
@@ -185,6 +185,11 @@ do {                                                     \
         (((u32)(addr) & 0x00FF0000U) >> 16);             \
 } while (0)
 
+#define _set_tssldt_type(desc,type)                      \
+do {                                                     \
+    ((u8 *)&(desc)[0].b)[1] = (type) | 0x80;             \
+} while (0)
+
 struct __packed desc_ptr {
 	unsigned short limit;
 	unsigned long base;
diff --git a/xen/include/asm-x86/regs.h b/xen/include/asm-x86/regs.h
index 725a664e0a..361de4c54e 100644
--- a/xen/include/asm-x86/regs.h
+++ b/xen/include/asm-x86/regs.h
@@ -7,6 +7,8 @@
 #define guest_mode(r)                                                         \
 ({                                                                            \
     unsigned long diff = (char *)guest_cpu_user_regs() - (char *)(r);         \
+    if ( diff >= STACK_SIZE )                                                 \
+        diff = (char *)&get_cpu_info()->guest_cpu_user_regs - (char *)(r);    \
     /* Frame pointer must point into current CPU stack. */                    \
     ASSERT(diff < STACK_SIZE);                                                \
     /* If not a guest frame, it must be a hypervisor frame. */                \
-- 
2.13.6


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains
  2018-01-22 12:32 [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains Juergen Gross
                   ` (11 preceding siblings ...)
  2018-01-22 12:32 ` [PATCH RFC v2 12/12] x86: activate per-vcpu stacks in case of xpti Juergen Gross
@ 2018-01-22 12:50 ` Jan Beulich
       [not found] ` <5A65EC0A02000078001A1118@suse.com>
  2018-01-22 21:45 ` Konrad Rzeszutek Wilk
  14 siblings, 0 replies; 74+ messages in thread
From: Jan Beulich @ 2018-01-22 12:50 UTC (permalink / raw)
  To: Juergen Gross
  Cc: wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson,
	Dario Faggioli, xen-devel

>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
> As a preparation for doing page table isolation in the Xen hypervisor
> in order to mitigate "Meltdown" use dedicated stacks, GDT and TSS for
> 64 bit PV domains mapped to the per-domain virtual area.
> 
> The per-vcpu stacks are used for early interrupt handling only. After
> saving the domain's registers stacks are switched back to the normal
> per physical cpu ones in order to be able to address on-stack data
> from other cpus e.g. while handling IPIs.
> 
> Adding %cr3 switching between saving of the registers and switching
> the stacks will enable the possibility to run guest code without any
> per physical cpu mapping, i.e. avoiding the threat of a guest being
> able to access other domains data.
> 
> Without any further measures it will still be possible for e.g. a
> guest's user program to read stack data of another vcpu of the same
> domain, but this can be easily avoided by a little PV-ABI modification
> introducing per-cpu user address spaces.
> 
> This series is meant as a replacement for Andrew's patch series:
> "x86: Prerequisite work for a Xen KAISER solution".

Considering in particular the two reverts, what I'm missing here
is a clear description of the meaningful additional protection this
approach provides over the band-aid. For context see also
https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg01735.html

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 01/12] x86: cleanup processor.h
  2018-01-22 12:32 ` [PATCH RFC v2 01/12] x86: cleanup processor.h Juergen Gross
@ 2018-01-22 12:52   ` Jan Beulich
       [not found]   ` <5A65ECA502000078001A111C@suse.com>
  1 sibling, 0 replies; 74+ messages in thread
From: Jan Beulich @ 2018-01-22 12:52 UTC (permalink / raw)
  To: Juergen Gross
  Cc: wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson,
	Dario Faggioli, xen-devel

>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
> Remove NSC/Cyrix CPU macros and current_text_addr() which are used
> nowhere.

I agree doing the former, but I have a vague recollection that we've
left the latter in place despite there not being any callers at present.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 01/12] x86: cleanup processor.h
       [not found]   ` <5A65ECA502000078001A111C@suse.com>
@ 2018-01-22 14:10     ` Juergen Gross
  2018-01-22 14:25       ` Andrew Cooper
  0 siblings, 1 reply; 74+ messages in thread
From: Juergen Gross @ 2018-01-22 14:10 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson,
	Dario Faggioli, xen-devel

On 22/01/18 13:52, Jan Beulich wrote:
>>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
>> Remove NSC/Cyrix CPU macros and current_text_addr() which are used
>> nowhere.
> 
> I agree doing the former, but I have a vague recollection that we've
> left the latter in place despite there not being any callers at present.

It isn't as if current_text_addr() would be rocket science. I'm quite
sure in case it is needed there will be enough brain power available to
build it either from scratch again or to find it in git.

In case you really like it to stay I won't object, of course.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains
       [not found] ` <5A65EC0A02000078001A1118@suse.com>
@ 2018-01-22 14:18   ` Juergen Gross
  2018-01-22 14:22     ` Jan Beulich
       [not found]     ` <5A6601D302000078001A1230@suse.com>
  0 siblings, 2 replies; 74+ messages in thread
From: Juergen Gross @ 2018-01-22 14:18 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson,
	Dario Faggioli, xen-devel

On 22/01/18 13:50, Jan Beulich wrote:
>>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
>> As a preparation for doing page table isolation in the Xen hypervisor
>> in order to mitigate "Meltdown" use dedicated stacks, GDT and TSS for
>> 64 bit PV domains mapped to the per-domain virtual area.
>>
>> The per-vcpu stacks are used for early interrupt handling only. After
>> saving the domain's registers stacks are switched back to the normal
>> per physical cpu ones in order to be able to address on-stack data
>> from other cpus e.g. while handling IPIs.
>>
>> Adding %cr3 switching between saving of the registers and switching
>> the stacks will enable the possibility to run guest code without any
>> per physical cpu mapping, i.e. avoiding the threat of a guest being
>> able to access other domains data.
>>
>> Without any further measures it will still be possible for e.g. a
>> guest's user program to read stack data of another vcpu of the same
>> domain, but this can be easily avoided by a little PV-ABI modification
>> introducing per-cpu user address spaces.
>>
>> This series is meant as a replacement for Andrew's patch series:
>> "x86: Prerequisite work for a Xen KAISER solution".
> 
> Considering in particular the two reverts, what I'm missing here
> is a clear description of the meaningful additional protection this
> approach provides over the band-aid. For context see also
> https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg01735.html

My approach supports mapping only the following data while the guest is
running (apart form the guest's own data, of course):

- the per-vcpu entry stacks of the domain which will contain only the
  guest's registers saved when an interrupt occurs
- the per-vcpu GDTs and TSSs of the domain
- the IDT
- the interrupt handler code (arch/x86/x86_64/[compat/]entry.S

All other hypervisor data and code can be completely hidden from the
guests.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains
  2018-01-22 14:18   ` Juergen Gross
@ 2018-01-22 14:22     ` Jan Beulich
       [not found]     ` <5A6601D302000078001A1230@suse.com>
  1 sibling, 0 replies; 74+ messages in thread
From: Jan Beulich @ 2018-01-22 14:22 UTC (permalink / raw)
  To: Juergen Gross
  Cc: wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson,
	Dario Faggioli, xen-devel

>>> On 22.01.18 at 15:18, <jgross@suse.com> wrote:
> On 22/01/18 13:50, Jan Beulich wrote:
>>>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
>>> As a preparation for doing page table isolation in the Xen hypervisor
>>> in order to mitigate "Meltdown" use dedicated stacks, GDT and TSS for
>>> 64 bit PV domains mapped to the per-domain virtual area.
>>>
>>> The per-vcpu stacks are used for early interrupt handling only. After
>>> saving the domain's registers stacks are switched back to the normal
>>> per physical cpu ones in order to be able to address on-stack data
>>> from other cpus e.g. while handling IPIs.
>>>
>>> Adding %cr3 switching between saving of the registers and switching
>>> the stacks will enable the possibility to run guest code without any
>>> per physical cpu mapping, i.e. avoiding the threat of a guest being
>>> able to access other domains data.
>>>
>>> Without any further measures it will still be possible for e.g. a
>>> guest's user program to read stack data of another vcpu of the same
>>> domain, but this can be easily avoided by a little PV-ABI modification
>>> introducing per-cpu user address spaces.
>>>
>>> This series is meant as a replacement for Andrew's patch series:
>>> "x86: Prerequisite work for a Xen KAISER solution".
>> 
>> Considering in particular the two reverts, what I'm missing here
>> is a clear description of the meaningful additional protection this
>> approach provides over the band-aid. For context see also
>> https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg01735.html 
> 
> My approach supports mapping only the following data while the guest is
> running (apart form the guest's own data, of course):
> 
> - the per-vcpu entry stacks of the domain which will contain only the
>   guest's registers saved when an interrupt occurs
> - the per-vcpu GDTs and TSSs of the domain
> - the IDT
> - the interrupt handler code (arch/x86/x86_64/[compat/]entry.S
> 
> All other hypervisor data and code can be completely hidden from the
> guests.

I understand that. What I'm not clear about is: Which parts of
the additionally hidden data are actually necessary (or at least
very desirable) to hide?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 01/12] x86: cleanup processor.h
  2018-01-22 14:10     ` Juergen Gross
@ 2018-01-22 14:25       ` Andrew Cooper
  2018-01-22 14:32         ` Jan Beulich
  0 siblings, 1 reply; 74+ messages in thread
From: Andrew Cooper @ 2018-01-22 14:25 UTC (permalink / raw)
  To: Juergen Gross, Jan Beulich
  Cc: wei.liu2, George.Dunlap, ian.jackson, Dario Faggioli, xen-devel

On 22/01/18 14:10, Juergen Gross wrote:
> On 22/01/18 13:52, Jan Beulich wrote:
>>>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
>>> Remove NSC/Cyrix CPU macros and current_text_addr() which are used
>>> nowhere.
>> I agree doing the former, but I have a vague recollection that we've
>> left the latter in place despite there not being any callers at present.
> It isn't as if current_text_addr() would be rocket science. I'm quite
> sure in case it is needed there will be enough brain power available to
> build it either from scratch again or to find it in git.
>
> In case you really like it to stay I won't object, of course.

FWIW, I've disliked all the recent patches which have tried to use
current_text_addr(), and I don't see it as a useful debugging utility
either.

I would prefer to see it gone than to stay.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 01/12] x86: cleanup processor.h
  2018-01-22 14:25       ` Andrew Cooper
@ 2018-01-22 14:32         ` Jan Beulich
  0 siblings, 0 replies; 74+ messages in thread
From: Jan Beulich @ 2018-01-22 14:32 UTC (permalink / raw)
  To: Andrew Cooper, Juergen Gross
  Cc: wei.liu2, George.Dunlap, ian.jackson, Dario Faggioli, xen-devel

>>> On 22.01.18 at 15:25, <andrew.cooper3@citrix.com> wrote:
> On 22/01/18 14:10, Juergen Gross wrote:
>> On 22/01/18 13:52, Jan Beulich wrote:
>>>>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
>>>> Remove NSC/Cyrix CPU macros and current_text_addr() which are used
>>>> nowhere.
>>> I agree doing the former, but I have a vague recollection that we've
>>> left the latter in place despite there not being any callers at present.
>> It isn't as if current_text_addr() would be rocket science. I'm quite
>> sure in case it is needed there will be enough brain power available to
>> build it either from scratch again or to find it in git.
>>
>> In case you really like it to stay I won't object, of course.
> 
> FWIW, I've disliked all the recent patches which have tried to use
> current_text_addr(), and I don't see it as a useful debugging utility
> either.
> 
> I would prefer to see it gone than to stay.

Well, okay then. The patch is independent of the other, actual
RFC stuff, so could go in right away.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains
       [not found]     ` <5A6601D302000078001A1230@suse.com>
@ 2018-01-22 14:38       ` Juergen Gross
  2018-01-22 14:48         ` Jan Beulich
       [not found]         ` <5A6607DB02000078001A127B@suse.com>
  0 siblings, 2 replies; 74+ messages in thread
From: Juergen Gross @ 2018-01-22 14:38 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson,
	Dario Faggioli, xen-devel

On 22/01/18 15:22, Jan Beulich wrote:
>>>> On 22.01.18 at 15:18, <jgross@suse.com> wrote:
>> On 22/01/18 13:50, Jan Beulich wrote:
>>>>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
>>>> As a preparation for doing page table isolation in the Xen hypervisor
>>>> in order to mitigate "Meltdown" use dedicated stacks, GDT and TSS for
>>>> 64 bit PV domains mapped to the per-domain virtual area.
>>>>
>>>> The per-vcpu stacks are used for early interrupt handling only. After
>>>> saving the domain's registers stacks are switched back to the normal
>>>> per physical cpu ones in order to be able to address on-stack data
>>>> from other cpus e.g. while handling IPIs.
>>>>
>>>> Adding %cr3 switching between saving of the registers and switching
>>>> the stacks will enable the possibility to run guest code without any
>>>> per physical cpu mapping, i.e. avoiding the threat of a guest being
>>>> able to access other domains data.
>>>>
>>>> Without any further measures it will still be possible for e.g. a
>>>> guest's user program to read stack data of another vcpu of the same
>>>> domain, but this can be easily avoided by a little PV-ABI modification
>>>> introducing per-cpu user address spaces.
>>>>
>>>> This series is meant as a replacement for Andrew's patch series:
>>>> "x86: Prerequisite work for a Xen KAISER solution".
>>>
>>> Considering in particular the two reverts, what I'm missing here
>>> is a clear description of the meaningful additional protection this
>>> approach provides over the band-aid. For context see also
>>> https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg01735.html 
>>
>> My approach supports mapping only the following data while the guest is
>> running (apart form the guest's own data, of course):
>>
>> - the per-vcpu entry stacks of the domain which will contain only the
>>   guest's registers saved when an interrupt occurs
>> - the per-vcpu GDTs and TSSs of the domain
>> - the IDT
>> - the interrupt handler code (arch/x86/x86_64/[compat/]entry.S
>>
>> All other hypervisor data and code can be completely hidden from the
>> guests.
> 
> I understand that. What I'm not clear about is: Which parts of
> the additionally hidden data are actually necessary (or at least
> very desirable) to hide?

Necessary:
- other guests' memory (e.g. physical memory 1:1 mapping)
- data from other guests e.g.in stack pages, debug buffers, I/O buffers,
  code emulator buffers
- other guests' register values e.g. in vcpu structure

Desirable: as much as possible. For instance I don't buy your reasoning
regarding the Xen binary: how would you do this e.g. in a public cloud?
How do you know which Xen binary (possibly with livepatches) is being
used there? And today we don't have something like KASLR in Xen, but
not hiding the text and RO data will make the introduction of that quite
useless.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains
  2018-01-22 14:38       ` Juergen Gross
@ 2018-01-22 14:48         ` Jan Beulich
       [not found]         ` <5A6607DB02000078001A127B@suse.com>
  1 sibling, 0 replies; 74+ messages in thread
From: Jan Beulich @ 2018-01-22 14:48 UTC (permalink / raw)
  To: Juergen Gross
  Cc: wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson,
	Dario Faggioli, xen-devel

>>> On 22.01.18 at 15:38, <jgross@suse.com> wrote:
> On 22/01/18 15:22, Jan Beulich wrote:
>>>>> On 22.01.18 at 15:18, <jgross@suse.com> wrote:
>>> On 22/01/18 13:50, Jan Beulich wrote:
>>>>>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
>>>>> As a preparation for doing page table isolation in the Xen hypervisor
>>>>> in order to mitigate "Meltdown" use dedicated stacks, GDT and TSS for
>>>>> 64 bit PV domains mapped to the per-domain virtual area.
>>>>>
>>>>> The per-vcpu stacks are used for early interrupt handling only. After
>>>>> saving the domain's registers stacks are switched back to the normal
>>>>> per physical cpu ones in order to be able to address on-stack data
>>>>> from other cpus e.g. while handling IPIs.
>>>>>
>>>>> Adding %cr3 switching between saving of the registers and switching
>>>>> the stacks will enable the possibility to run guest code without any
>>>>> per physical cpu mapping, i.e. avoiding the threat of a guest being
>>>>> able to access other domains data.
>>>>>
>>>>> Without any further measures it will still be possible for e.g. a
>>>>> guest's user program to read stack data of another vcpu of the same
>>>>> domain, but this can be easily avoided by a little PV-ABI modification
>>>>> introducing per-cpu user address spaces.
>>>>>
>>>>> This series is meant as a replacement for Andrew's patch series:
>>>>> "x86: Prerequisite work for a Xen KAISER solution".
>>>>
>>>> Considering in particular the two reverts, what I'm missing here
>>>> is a clear description of the meaningful additional protection this
>>>> approach provides over the band-aid. For context see also
>>>> https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg01735.html 
>>>
>>> My approach supports mapping only the following data while the guest is
>>> running (apart form the guest's own data, of course):
>>>
>>> - the per-vcpu entry stacks of the domain which will contain only the
>>>   guest's registers saved when an interrupt occurs
>>> - the per-vcpu GDTs and TSSs of the domain
>>> - the IDT
>>> - the interrupt handler code (arch/x86/x86_64/[compat/]entry.S
>>>
>>> All other hypervisor data and code can be completely hidden from the
>>> guests.
>> 
>> I understand that. What I'm not clear about is: Which parts of
>> the additionally hidden data are actually necessary (or at least
>> very desirable) to hide?
> 
> Necessary:
> - other guests' memory (e.g. physical memory 1:1 mapping)
> - data from other guests e.g.in stack pages, debug buffers, I/O buffers,
>   code emulator buffers
> - other guests' register values e.g. in vcpu structure

All of this is already being made invisible by the band-aid (with the
exception of leftovers on the hypervisor stacks across context
switches, which we've already said could be taken care of by
memset()ing that area). I'm asking about the _additional_ benefits
of your approach.

> Desirable: as much as possible. For instance I don't buy your reasoning
> regarding the Xen binary: how would you do this e.g. in a public cloud?
> How do you know which Xen binary (possibly with livepatches) is being
> used there? And today we don't have something like KASLR in Xen, but
> not hiding the text and RO data will make the introduction of that quite
> useless.

I'm aware that there are people thinking that .text and .rodata
should be hidden; what I'm not really aware of is the reasoning
behind that.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains
       [not found]         ` <5A6607DB02000078001A127B@suse.com>
@ 2018-01-22 15:00           ` Juergen Gross
  2018-01-22 16:51             ` Jan Beulich
       [not found]             ` <5A6624A602000078001A1375@suse.com>
  0 siblings, 2 replies; 74+ messages in thread
From: Juergen Gross @ 2018-01-22 15:00 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson,
	Dario Faggioli, xen-devel

On 22/01/18 15:48, Jan Beulich wrote:
>>>> On 22.01.18 at 15:38, <jgross@suse.com> wrote:
>> On 22/01/18 15:22, Jan Beulich wrote:
>>>>>> On 22.01.18 at 15:18, <jgross@suse.com> wrote:
>>>> On 22/01/18 13:50, Jan Beulich wrote:
>>>>>>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
>>>>>> As a preparation for doing page table isolation in the Xen hypervisor
>>>>>> in order to mitigate "Meltdown" use dedicated stacks, GDT and TSS for
>>>>>> 64 bit PV domains mapped to the per-domain virtual area.
>>>>>>
>>>>>> The per-vcpu stacks are used for early interrupt handling only. After
>>>>>> saving the domain's registers stacks are switched back to the normal
>>>>>> per physical cpu ones in order to be able to address on-stack data
>>>>>> from other cpus e.g. while handling IPIs.
>>>>>>
>>>>>> Adding %cr3 switching between saving of the registers and switching
>>>>>> the stacks will enable the possibility to run guest code without any
>>>>>> per physical cpu mapping, i.e. avoiding the threat of a guest being
>>>>>> able to access other domains data.
>>>>>>
>>>>>> Without any further measures it will still be possible for e.g. a
>>>>>> guest's user program to read stack data of another vcpu of the same
>>>>>> domain, but this can be easily avoided by a little PV-ABI modification
>>>>>> introducing per-cpu user address spaces.
>>>>>>
>>>>>> This series is meant as a replacement for Andrew's patch series:
>>>>>> "x86: Prerequisite work for a Xen KAISER solution".
>>>>>
>>>>> Considering in particular the two reverts, what I'm missing here
>>>>> is a clear description of the meaningful additional protection this
>>>>> approach provides over the band-aid. For context see also
>>>>> https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg01735.html 
>>>>
>>>> My approach supports mapping only the following data while the guest is
>>>> running (apart form the guest's own data, of course):
>>>>
>>>> - the per-vcpu entry stacks of the domain which will contain only the
>>>>   guest's registers saved when an interrupt occurs
>>>> - the per-vcpu GDTs and TSSs of the domain
>>>> - the IDT
>>>> - the interrupt handler code (arch/x86/x86_64/[compat/]entry.S
>>>>
>>>> All other hypervisor data and code can be completely hidden from the
>>>> guests.
>>>
>>> I understand that. What I'm not clear about is: Which parts of
>>> the additionally hidden data are actually necessary (or at least
>>> very desirable) to hide?
>>
>> Necessary:
>> - other guests' memory (e.g. physical memory 1:1 mapping)
>> - data from other guests e.g.in stack pages, debug buffers, I/O buffers,
>>   code emulator buffers
>> - other guests' register values e.g. in vcpu structure
> 
> All of this is already being made invisible by the band-aid (with the
> exception of leftovers on the hypervisor stacks across context
> switches, which we've already said could be taken care of by
> memset()ing that area). I'm asking about the _additional_ benefits
> of your approach.

I'm quite sure the performance will be much better as it doesn't require
per physical cpu L4 page tables, but just a shadow L4 table for each
guest L4 table, similar to the Linux kernel KPTI approach.

> 
>> Desirable: as much as possible. For instance I don't buy your reasoning
>> regarding the Xen binary: how would you do this e.g. in a public cloud?
>> How do you know which Xen binary (possibly with livepatches) is being
>> used there? And today we don't have something like KASLR in Xen, but
>> not hiding the text and RO data will make the introduction of that quite
>> useless.
> 
> I'm aware that there are people thinking that .text and .rodata
> should be hidden; what I'm not really aware of is the reasoning
> behind that.

In case an attacker knows of some vulnerability it is just harder to use
that knowledge without knowing where specific data structures or coding
is living. Its like switching the lights off when you know somebody is
aiming with a gun at you. The odds are much better if the killer can't
see you.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains
  2018-01-22 15:00           ` Juergen Gross
@ 2018-01-22 16:51             ` Jan Beulich
  2018-01-22 18:39               ` Andrew Cooper
       [not found]             ` <5A6624A602000078001A1375@suse.com>
  1 sibling, 1 reply; 74+ messages in thread
From: Jan Beulich @ 2018-01-22 16:51 UTC (permalink / raw)
  To: Juergen Gross
  Cc: wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson,
	Dario Faggioli, xen-devel

>>> On 22.01.18 at 16:00, <jgross@suse.com> wrote:
> On 22/01/18 15:48, Jan Beulich wrote:
>>>>> On 22.01.18 at 15:38, <jgross@suse.com> wrote:
>>> On 22/01/18 15:22, Jan Beulich wrote:
>>>>>>> On 22.01.18 at 15:18, <jgross@suse.com> wrote:
>>>>> On 22/01/18 13:50, Jan Beulich wrote:
>>>>>>>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
>>>>>>> As a preparation for doing page table isolation in the Xen hypervisor
>>>>>>> in order to mitigate "Meltdown" use dedicated stacks, GDT and TSS for
>>>>>>> 64 bit PV domains mapped to the per-domain virtual area.
>>>>>>>
>>>>>>> The per-vcpu stacks are used for early interrupt handling only. After
>>>>>>> saving the domain's registers stacks are switched back to the normal
>>>>>>> per physical cpu ones in order to be able to address on-stack data
>>>>>>> from other cpus e.g. while handling IPIs.
>>>>>>>
>>>>>>> Adding %cr3 switching between saving of the registers and switching
>>>>>>> the stacks will enable the possibility to run guest code without any
>>>>>>> per physical cpu mapping, i.e. avoiding the threat of a guest being
>>>>>>> able to access other domains data.
>>>>>>>
>>>>>>> Without any further measures it will still be possible for e.g. a
>>>>>>> guest's user program to read stack data of another vcpu of the same
>>>>>>> domain, but this can be easily avoided by a little PV-ABI modification
>>>>>>> introducing per-cpu user address spaces.
>>>>>>>
>>>>>>> This series is meant as a replacement for Andrew's patch series:
>>>>>>> "x86: Prerequisite work for a Xen KAISER solution".
>>>>>>
>>>>>> Considering in particular the two reverts, what I'm missing here
>>>>>> is a clear description of the meaningful additional protection this
>>>>>> approach provides over the band-aid. For context see also
>>>>>> https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg01735.html 
>>>>>
>>>>> My approach supports mapping only the following data while the guest is
>>>>> running (apart form the guest's own data, of course):
>>>>>
>>>>> - the per-vcpu entry stacks of the domain which will contain only the
>>>>>   guest's registers saved when an interrupt occurs
>>>>> - the per-vcpu GDTs and TSSs of the domain
>>>>> - the IDT
>>>>> - the interrupt handler code (arch/x86/x86_64/[compat/]entry.S
>>>>>
>>>>> All other hypervisor data and code can be completely hidden from the
>>>>> guests.
>>>>
>>>> I understand that. What I'm not clear about is: Which parts of
>>>> the additionally hidden data are actually necessary (or at least
>>>> very desirable) to hide?
>>>
>>> Necessary:
>>> - other guests' memory (e.g. physical memory 1:1 mapping)
>>> - data from other guests e.g.in stack pages, debug buffers, I/O buffers,
>>>   code emulator buffers
>>> - other guests' register values e.g. in vcpu structure
>> 
>> All of this is already being made invisible by the band-aid (with the
>> exception of leftovers on the hypervisor stacks across context
>> switches, which we've already said could be taken care of by
>> memset()ing that area). I'm asking about the _additional_ benefits
>> of your approach.
> 
> I'm quite sure the performance will be much better as it doesn't require
> per physical cpu L4 page tables, but just a shadow L4 table for each
> guest L4 table, similar to the Linux kernel KPTI approach.

But isn't that model having the same synchronization issues upon
guest L4 updates which Andrew was fighting with?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains
  2018-01-22 16:51             ` Jan Beulich
@ 2018-01-22 18:39               ` Andrew Cooper
  2018-01-22 18:48                 ` George Dunlap
                                   ` (3 more replies)
  0 siblings, 4 replies; 74+ messages in thread
From: Andrew Cooper @ 2018-01-22 18:39 UTC (permalink / raw)
  To: Jan Beulich, Juergen Gross
  Cc: wei.liu2, George.Dunlap, ian.jackson, Dario Faggioli, xen-devel

On 22/01/18 16:51, Jan Beulich wrote:
>>>> On 22.01.18 at 16:00, <jgross@suse.com> wrote:
>> On 22/01/18 15:48, Jan Beulich wrote:
>>>>>> On 22.01.18 at 15:38, <jgross@suse.com> wrote:
>>>> On 22/01/18 15:22, Jan Beulich wrote:
>>>>>>>> On 22.01.18 at 15:18, <jgross@suse.com> wrote:
>>>>>> On 22/01/18 13:50, Jan Beulich wrote:
>>>>>>>>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
>>>>>>>> As a preparation for doing page table isolation in the Xen hypervisor
>>>>>>>> in order to mitigate "Meltdown" use dedicated stacks, GDT and TSS for
>>>>>>>> 64 bit PV domains mapped to the per-domain virtual area.
>>>>>>>>
>>>>>>>> The per-vcpu stacks are used for early interrupt handling only. After
>>>>>>>> saving the domain's registers stacks are switched back to the normal
>>>>>>>> per physical cpu ones in order to be able to address on-stack data
>>>>>>>> from other cpus e.g. while handling IPIs.
>>>>>>>>
>>>>>>>> Adding %cr3 switching between saving of the registers and switching
>>>>>>>> the stacks will enable the possibility to run guest code without any
>>>>>>>> per physical cpu mapping, i.e. avoiding the threat of a guest being
>>>>>>>> able to access other domains data.
>>>>>>>>
>>>>>>>> Without any further measures it will still be possible for e.g. a
>>>>>>>> guest's user program to read stack data of another vcpu of the same
>>>>>>>> domain, but this can be easily avoided by a little PV-ABI modification
>>>>>>>> introducing per-cpu user address spaces.
>>>>>>>>
>>>>>>>> This series is meant as a replacement for Andrew's patch series:
>>>>>>>> "x86: Prerequisite work for a Xen KAISER solution".
>>>>>>> Considering in particular the two reverts, what I'm missing here
>>>>>>> is a clear description of the meaningful additional protection this
>>>>>>> approach provides over the band-aid. For context see also
>>>>>>> https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg01735.html 
>>>>>> My approach supports mapping only the following data while the guest is
>>>>>> running (apart form the guest's own data, of course):
>>>>>>
>>>>>> - the per-vcpu entry stacks of the domain which will contain only the
>>>>>>   guest's registers saved when an interrupt occurs
>>>>>> - the per-vcpu GDTs and TSSs of the domain
>>>>>> - the IDT
>>>>>> - the interrupt handler code (arch/x86/x86_64/[compat/]entry.S
>>>>>>
>>>>>> All other hypervisor data and code can be completely hidden from the
>>>>>> guests.
>>>>> I understand that. What I'm not clear about is: Which parts of
>>>>> the additionally hidden data are actually necessary (or at least
>>>>> very desirable) to hide?
>>>> Necessary:
>>>> - other guests' memory (e.g. physical memory 1:1 mapping)
>>>> - data from other guests e.g.in stack pages, debug buffers, I/O buffers,
>>>>   code emulator buffers
>>>> - other guests' register values e.g. in vcpu structure
>>> All of this is already being made invisible by the band-aid (with the
>>> exception of leftovers on the hypervisor stacks across context
>>> switches, which we've already said could be taken care of by
>>> memset()ing that area). I'm asking about the _additional_ benefits
>>> of your approach.
>> I'm quite sure the performance will be much better as it doesn't require
>> per physical cpu L4 page tables, but just a shadow L4 table for each
>> guest L4 table, similar to the Linux kernel KPTI approach.
> But isn't that model having the same synchronization issues upon
> guest L4 updates which Andrew was fighting with?

(Condensing a lot of threads down into one)

All the methods have L4 synchronisation update issues, until we have a
PV ABI which guarantees that L4's don't get reused.  Any improvements to
the shadowing/synchronisation algorithm will benefit all approaches.

Juergen: you're now adding a LTR into the context switch path which
tends to be very slow.  I.e. As currently presented, this series
necessarily has a higher runtime overhead than Jan's XPTI.

One of my concerns is that this patch series moves further away from the
secondary goal of my KAISER series, which was to have the IDT and GDT
mapped at the same linear addresses on every CPU so a) SIDT/SGDT don't
leak which CPU you're currently scheduled on into PV guests and b) the
context switch code can drop a load of its slow instructions like LGDT
and the VMWRITEs to update the VMCS.

Jan: As to the things not covered by the current XPTI, hiding most of
the .text section is important to prevent fingerprinting or ROP
scanning.  This is a defence-in-depth argument, but a guest being easily
able to identify whether certain XSAs are fixed or not is quite bad. 
Also, a load of CPU 0's data data-structures, including the stack is
visible in .data.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains
  2018-01-22 18:39               ` Andrew Cooper
@ 2018-01-22 18:48                 ` George Dunlap
  2018-01-22 19:02                   ` Andrew Cooper
  2018-01-23  6:34                 ` Juergen Gross
                                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 74+ messages in thread
From: George Dunlap @ 2018-01-22 18:48 UTC (permalink / raw)
  To: Andrew Cooper, Jan Beulich, Juergen Gross
  Cc: wei.liu2, George.Dunlap, ian.jackson, Dario Faggioli, xen-devel

On 01/22/2018 06:39 PM, Andrew Cooper wrote:
> On 22/01/18 16:51, Jan Beulich wrote:
>>>>> On 22.01.18 at 16:00, <jgross@suse.com> wrote:
>>> On 22/01/18 15:48, Jan Beulich wrote:
>>>>>>> On 22.01.18 at 15:38, <jgross@suse.com> wrote:
>>>>> On 22/01/18 15:22, Jan Beulich wrote:
>>>>>>>>> On 22.01.18 at 15:18, <jgross@suse.com> wrote:
>>>>>>> On 22/01/18 13:50, Jan Beulich wrote:
>>>>>>>>>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
>>>>>>>>> As a preparation for doing page table isolation in the Xen hypervisor
>>>>>>>>> in order to mitigate "Meltdown" use dedicated stacks, GDT and TSS for
>>>>>>>>> 64 bit PV domains mapped to the per-domain virtual area.
>>>>>>>>>
>>>>>>>>> The per-vcpu stacks are used for early interrupt handling only. After
>>>>>>>>> saving the domain's registers stacks are switched back to the normal
>>>>>>>>> per physical cpu ones in order to be able to address on-stack data
>>>>>>>>> from other cpus e.g. while handling IPIs.
>>>>>>>>>
>>>>>>>>> Adding %cr3 switching between saving of the registers and switching
>>>>>>>>> the stacks will enable the possibility to run guest code without any
>>>>>>>>> per physical cpu mapping, i.e. avoiding the threat of a guest being
>>>>>>>>> able to access other domains data.
>>>>>>>>>
>>>>>>>>> Without any further measures it will still be possible for e.g. a
>>>>>>>>> guest's user program to read stack data of another vcpu of the same
>>>>>>>>> domain, but this can be easily avoided by a little PV-ABI modification
>>>>>>>>> introducing per-cpu user address spaces.
>>>>>>>>>
>>>>>>>>> This series is meant as a replacement for Andrew's patch series:
>>>>>>>>> "x86: Prerequisite work for a Xen KAISER solution".
>>>>>>>> Considering in particular the two reverts, what I'm missing here
>>>>>>>> is a clear description of the meaningful additional protection this
>>>>>>>> approach provides over the band-aid. For context see also
>>>>>>>> https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg01735.html 
>>>>>>> My approach supports mapping only the following data while the guest is
>>>>>>> running (apart form the guest's own data, of course):
>>>>>>>
>>>>>>> - the per-vcpu entry stacks of the domain which will contain only the
>>>>>>>   guest's registers saved when an interrupt occurs
>>>>>>> - the per-vcpu GDTs and TSSs of the domain
>>>>>>> - the IDT
>>>>>>> - the interrupt handler code (arch/x86/x86_64/[compat/]entry.S
>>>>>>>
>>>>>>> All other hypervisor data and code can be completely hidden from the
>>>>>>> guests.
>>>>>> I understand that. What I'm not clear about is: Which parts of
>>>>>> the additionally hidden data are actually necessary (or at least
>>>>>> very desirable) to hide?
>>>>> Necessary:
>>>>> - other guests' memory (e.g. physical memory 1:1 mapping)
>>>>> - data from other guests e.g.in stack pages, debug buffers, I/O buffers,
>>>>>   code emulator buffers
>>>>> - other guests' register values e.g. in vcpu structure
>>>> All of this is already being made invisible by the band-aid (with the
>>>> exception of leftovers on the hypervisor stacks across context
>>>> switches, which we've already said could be taken care of by
>>>> memset()ing that area). I'm asking about the _additional_ benefits
>>>> of your approach.
>>> I'm quite sure the performance will be much better as it doesn't require
>>> per physical cpu L4 page tables, but just a shadow L4 table for each
>>> guest L4 table, similar to the Linux kernel KPTI approach.
>> But isn't that model having the same synchronization issues upon
>> guest L4 updates which Andrew was fighting with?
> 
> (Condensing a lot of threads down into one)
> 
> All the methods have L4 synchronisation update issues, until we have a
> PV ABI which guarantees that L4's don't get reused.  Any improvements to
> the shadowing/synchronisation algorithm will benefit all approaches.
> 
> Juergen: you're now adding a LTR into the context switch path which
> tends to be very slow.  I.e. As currently presented, this series
> necessarily has a higher runtime overhead than Jan's XPTI.
> 
> One of my concerns is that this patch series moves further away from the
> secondary goal of my KAISER series, which was to have the IDT and GDT
> mapped at the same linear addresses on every CPU so a) SIDT/SGDT don't
> leak which CPU you're currently scheduled on into PV guests and b) the
> context switch code can drop a load of its slow instructions like LGDT
> and the VMWRITEs to update the VMCS.
> 
> Jan: As to the things not covered by the current XPTI, hiding most of
> the .text section is important to prevent fingerprinting or ROP
> scanning.  This is a defence-in-depth argument, but a guest being easily
> able to identify whether certain XSAs are fixed or not is quite bad. 

I'm afraid we have a fairly different opinion of what is "quite bad".
Suppose we handed users a knob and said, "If you flip this switch,
attackers won't be able to tell if you've fixed XSAs or not without
trying them; but it will slow down your guests 20%."  How many do you
think would flip it, and how many would reckon that an attacker could
probably find out that information anyway?

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains
  2018-01-22 18:48                 ` George Dunlap
@ 2018-01-22 19:02                   ` Andrew Cooper
  2018-01-23  8:36                     ` Jan Beulich
  2018-01-23 11:06                     ` George Dunlap
  0 siblings, 2 replies; 74+ messages in thread
From: Andrew Cooper @ 2018-01-22 19:02 UTC (permalink / raw)
  To: George Dunlap, Jan Beulich, Juergen Gross
  Cc: wei.liu2, George.Dunlap, ian.jackson, Dario Faggioli, xen-devel

On 22/01/18 18:48, George Dunlap wrote:
> On 01/22/2018 06:39 PM, Andrew Cooper wrote:
>> On 22/01/18 16:51, Jan Beulich wrote:
>>>>>> On 22.01.18 at 16:00, <jgross@suse.com> wrote:
>>>> On 22/01/18 15:48, Jan Beulich wrote:
>>>>>>>> On 22.01.18 at 15:38, <jgross@suse.com> wrote:
>>>>>> On 22/01/18 15:22, Jan Beulich wrote:
>>>>>>>>>> On 22.01.18 at 15:18, <jgross@suse.com> wrote:
>>>>>>>> On 22/01/18 13:50, Jan Beulich wrote:
>>>>>>>>>>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
>>>>>>>>>> As a preparation for doing page table isolation in the Xen hypervisor
>>>>>>>>>> in order to mitigate "Meltdown" use dedicated stacks, GDT and TSS for
>>>>>>>>>> 64 bit PV domains mapped to the per-domain virtual area.
>>>>>>>>>>
>>>>>>>>>> The per-vcpu stacks are used for early interrupt handling only. After
>>>>>>>>>> saving the domain's registers stacks are switched back to the normal
>>>>>>>>>> per physical cpu ones in order to be able to address on-stack data
>>>>>>>>>> from other cpus e.g. while handling IPIs.
>>>>>>>>>>
>>>>>>>>>> Adding %cr3 switching between saving of the registers and switching
>>>>>>>>>> the stacks will enable the possibility to run guest code without any
>>>>>>>>>> per physical cpu mapping, i.e. avoiding the threat of a guest being
>>>>>>>>>> able to access other domains data.
>>>>>>>>>>
>>>>>>>>>> Without any further measures it will still be possible for e.g. a
>>>>>>>>>> guest's user program to read stack data of another vcpu of the same
>>>>>>>>>> domain, but this can be easily avoided by a little PV-ABI modification
>>>>>>>>>> introducing per-cpu user address spaces.
>>>>>>>>>>
>>>>>>>>>> This series is meant as a replacement for Andrew's patch series:
>>>>>>>>>> "x86: Prerequisite work for a Xen KAISER solution".
>>>>>>>>> Considering in particular the two reverts, what I'm missing here
>>>>>>>>> is a clear description of the meaningful additional protection this
>>>>>>>>> approach provides over the band-aid. For context see also
>>>>>>>>> https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg01735.html 
>>>>>>>> My approach supports mapping only the following data while the guest is
>>>>>>>> running (apart form the guest's own data, of course):
>>>>>>>>
>>>>>>>> - the per-vcpu entry stacks of the domain which will contain only the
>>>>>>>>   guest's registers saved when an interrupt occurs
>>>>>>>> - the per-vcpu GDTs and TSSs of the domain
>>>>>>>> - the IDT
>>>>>>>> - the interrupt handler code (arch/x86/x86_64/[compat/]entry.S
>>>>>>>>
>>>>>>>> All other hypervisor data and code can be completely hidden from the
>>>>>>>> guests.
>>>>>>> I understand that. What I'm not clear about is: Which parts of
>>>>>>> the additionally hidden data are actually necessary (or at least
>>>>>>> very desirable) to hide?
>>>>>> Necessary:
>>>>>> - other guests' memory (e.g. physical memory 1:1 mapping)
>>>>>> - data from other guests e.g.in stack pages, debug buffers, I/O buffers,
>>>>>>   code emulator buffers
>>>>>> - other guests' register values e.g. in vcpu structure
>>>>> All of this is already being made invisible by the band-aid (with the
>>>>> exception of leftovers on the hypervisor stacks across context
>>>>> switches, which we've already said could be taken care of by
>>>>> memset()ing that area). I'm asking about the _additional_ benefits
>>>>> of your approach.
>>>> I'm quite sure the performance will be much better as it doesn't require
>>>> per physical cpu L4 page tables, but just a shadow L4 table for each
>>>> guest L4 table, similar to the Linux kernel KPTI approach.
>>> But isn't that model having the same synchronization issues upon
>>> guest L4 updates which Andrew was fighting with?
>> (Condensing a lot of threads down into one)
>>
>> All the methods have L4 synchronisation update issues, until we have a
>> PV ABI which guarantees that L4's don't get reused.  Any improvements to
>> the shadowing/synchronisation algorithm will benefit all approaches.
>>
>> Juergen: you're now adding a LTR into the context switch path which
>> tends to be very slow.  I.e. As currently presented, this series
>> necessarily has a higher runtime overhead than Jan's XPTI.
>>
>> One of my concerns is that this patch series moves further away from the
>> secondary goal of my KAISER series, which was to have the IDT and GDT
>> mapped at the same linear addresses on every CPU so a) SIDT/SGDT don't
>> leak which CPU you're currently scheduled on into PV guests and b) the
>> context switch code can drop a load of its slow instructions like LGDT
>> and the VMWRITEs to update the VMCS.
>>
>> Jan: As to the things not covered by the current XPTI, hiding most of
>> the .text section is important to prevent fingerprinting or ROP
>> scanning.  This is a defence-in-depth argument, but a guest being easily
>> able to identify whether certain XSAs are fixed or not is quite bad. 
> I'm afraid we have a fairly different opinion of what is "quite bad".

I suggest you try talking to some real users then.

> Suppose we handed users a knob and said, "If you flip this switch,
> attackers won't be able to tell if you've fixed XSAs or not without
> trying them; but it will slow down your guests 20%."  How many do you
> think would flip it, and how many would reckon that an attacker could
> probably find out that information anyway?

Nonsense.  The performance hit is already taken.  The argument is "do
you want an attacker able to trivially evaluate security weaknesses in
your hypervisor", a process which usually has to be done by guesswork
and knowing the exact binary under attack.  Having .text fully readable
lowers the barrier to entry substantially.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains
  2018-01-22 12:32 [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains Juergen Gross
                   ` (13 preceding siblings ...)
       [not found] ` <5A65EC0A02000078001A1118@suse.com>
@ 2018-01-22 21:45 ` Konrad Rzeszutek Wilk
  2018-01-23  6:38   ` Juergen Gross
  14 siblings, 1 reply; 74+ messages in thread
From: Konrad Rzeszutek Wilk @ 2018-01-22 21:45 UTC (permalink / raw)
  To: Juergen Gross
  Cc: wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson, dfaggioli,
	jbeulich, xen-devel

On Mon, Jan 22, 2018 at 01:32:44PM +0100, Juergen Gross wrote:
> As a preparation for doing page table isolation in the Xen hypervisor
> in order to mitigate "Meltdown" use dedicated stacks, GDT and TSS for
> 64 bit PV domains mapped to the per-domain virtual area.
> 
> The per-vcpu stacks are used for early interrupt handling only. After
> saving the domain's registers stacks are switched back to the normal
> per physical cpu ones in order to be able to address on-stack data
> from other cpus e.g. while handling IPIs.
> 
> Adding %cr3 switching between saving of the registers and switching
> the stacks will enable the possibility to run guest code without any
> per physical cpu mapping, i.e. avoiding the threat of a guest being
> able to access other domains data.
> 
> Without any further measures it will still be possible for e.g. a
> guest's user program to read stack data of another vcpu of the same
> domain, but this can be easily avoided by a little PV-ABI modification
> introducing per-cpu user address spaces.
> 
> This series is meant as a replacement for Andrew's patch series:
> "x86: Prerequisite work for a Xen KAISER solution".
> 
> What needs to be done:
> - verify livepatching is still working

Is there an git repo for this?

> - performance evaluation (Dario is working on it)
> - the real page table switching
> 
> 
> Changes since RFC V1:
> - switch back to per physical cpu stacks in interrupt handling
> - complete rework of series
> - rebase to current staging
> - adding reverts of Jan's band-aid patches
> - adding two minor cleanups at the begin of the series
> - done much more testing, including NMIs
> 
> Juergen Gross (12):
>   x86: cleanup processor.h
>   x86: don't use hypervisor stack size for dumping guest stacks
>   x86: do a revert of e871e80c38547d9faefc6604532ba3e985e65873
>   x86: revert 5784de3e2067ed73efc2fe42e62831e8ae7f46c4
>   x86: don't access saved user regs via rsp in trap handlers
>   x86: add a xpti command line parameter
>   x86: allow per-domain mappings without NX bit or with specific mfn
>   xen/x86: use dedicated function for tss initialization
>   x86: enhance syscall stub to work in per-domain mapping
>   x86: allocate per-vcpu stacks for interrupt entries
>   x86: modify interrupt handlers to support stack switching
>   x86: activate per-vcpu stacks in case of xpti
> 
>  docs/misc/xen-command-line.markdown |  16 +-
>  xen/arch/x86/cpu/common.c           |  56 ++++---
>  xen/arch/x86/domain.c               |  84 ++++++++--
>  xen/arch/x86/mm.c                   | 102 ++++++++++---
>  xen/arch/x86/pv/domain.c            | 161 +++++++++++++++++++-
>  xen/arch/x86/smpboot.c              | 211 --------------------------
>  xen/arch/x86/traps.c                |  26 ++--
>  xen/arch/x86/x86_64/asm-offsets.c   |   6 +-
>  xen/arch/x86/x86_64/compat/entry.S  |  98 ++++++------
>  xen/arch/x86/x86_64/entry.S         | 295 ++++++++++++------------------------
>  xen/arch/x86/x86_64/traps.c         |  47 +++---
>  xen/common/wait.c                   |   8 +-
>  xen/include/asm-x86/asm_defns.h     |  49 +++---
>  xen/include/asm-x86/config.h        |  13 +-
>  xen/include/asm-x86/current.h       |  71 ++++++---
>  xen/include/asm-x86/desc.h          |   5 +
>  xen/include/asm-x86/domain.h        |   5 +
>  xen/include/asm-x86/mm.h            |   3 +
>  xen/include/asm-x86/processor.h     |  42 -----
>  xen/include/asm-x86/regs.h          |   2 +
>  xen/include/asm-x86/system.h        |   8 +
>  xen/include/asm-x86/x86_64/page.h   |   5 +-
>  22 files changed, 647 insertions(+), 666 deletions(-)
> 
> -- 
> 2.13.6
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains
       [not found]             ` <5A6624A602000078001A1375@suse.com>
@ 2018-01-23  5:50               ` Juergen Gross
  2018-01-23  8:40                 ` Jan Beulich
       [not found]                 ` <5A67030F02000078001A164B@suse.com>
  0 siblings, 2 replies; 74+ messages in thread
From: Juergen Gross @ 2018-01-23  5:50 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson,
	Dario Faggioli, xen-devel

On 22/01/18 17:51, Jan Beulich wrote:
>>>> On 22.01.18 at 16:00, <jgross@suse.com> wrote:
>> On 22/01/18 15:48, Jan Beulich wrote:
>>>>>> On 22.01.18 at 15:38, <jgross@suse.com> wrote:
>>>> On 22/01/18 15:22, Jan Beulich wrote:
>>>>>>>> On 22.01.18 at 15:18, <jgross@suse.com> wrote:
>>>>>> On 22/01/18 13:50, Jan Beulich wrote:
>>>>>>>>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
>>>>>>>> As a preparation for doing page table isolation in the Xen hypervisor
>>>>>>>> in order to mitigate "Meltdown" use dedicated stacks, GDT and TSS for
>>>>>>>> 64 bit PV domains mapped to the per-domain virtual area.
>>>>>>>>
>>>>>>>> The per-vcpu stacks are used for early interrupt handling only. After
>>>>>>>> saving the domain's registers stacks are switched back to the normal
>>>>>>>> per physical cpu ones in order to be able to address on-stack data
>>>>>>>> from other cpus e.g. while handling IPIs.
>>>>>>>>
>>>>>>>> Adding %cr3 switching between saving of the registers and switching
>>>>>>>> the stacks will enable the possibility to run guest code without any
>>>>>>>> per physical cpu mapping, i.e. avoiding the threat of a guest being
>>>>>>>> able to access other domains data.
>>>>>>>>
>>>>>>>> Without any further measures it will still be possible for e.g. a
>>>>>>>> guest's user program to read stack data of another vcpu of the same
>>>>>>>> domain, but this can be easily avoided by a little PV-ABI modification
>>>>>>>> introducing per-cpu user address spaces.
>>>>>>>>
>>>>>>>> This series is meant as a replacement for Andrew's patch series:
>>>>>>>> "x86: Prerequisite work for a Xen KAISER solution".
>>>>>>>
>>>>>>> Considering in particular the two reverts, what I'm missing here
>>>>>>> is a clear description of the meaningful additional protection this
>>>>>>> approach provides over the band-aid. For context see also
>>>>>>> https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg01735.html 
>>>>>>
>>>>>> My approach supports mapping only the following data while the guest is
>>>>>> running (apart form the guest's own data, of course):
>>>>>>
>>>>>> - the per-vcpu entry stacks of the domain which will contain only the
>>>>>>   guest's registers saved when an interrupt occurs
>>>>>> - the per-vcpu GDTs and TSSs of the domain
>>>>>> - the IDT
>>>>>> - the interrupt handler code (arch/x86/x86_64/[compat/]entry.S
>>>>>>
>>>>>> All other hypervisor data and code can be completely hidden from the
>>>>>> guests.
>>>>>
>>>>> I understand that. What I'm not clear about is: Which parts of
>>>>> the additionally hidden data are actually necessary (or at least
>>>>> very desirable) to hide?
>>>>
>>>> Necessary:
>>>> - other guests' memory (e.g. physical memory 1:1 mapping)
>>>> - data from other guests e.g.in stack pages, debug buffers, I/O buffers,
>>>>   code emulator buffers
>>>> - other guests' register values e.g. in vcpu structure
>>>
>>> All of this is already being made invisible by the band-aid (with the
>>> exception of leftovers on the hypervisor stacks across context
>>> switches, which we've already said could be taken care of by
>>> memset()ing that area). I'm asking about the _additional_ benefits
>>> of your approach.
>>
>> I'm quite sure the performance will be much better as it doesn't require
>> per physical cpu L4 page tables, but just a shadow L4 table for each
>> guest L4 table, similar to the Linux kernel KPTI approach.
> 
> But isn't that model having the same synchronization issues upon
> guest L4 updates which Andrew was fighting with?

I don't think so, as the number of shadows will always only be max. 1
with my approach.

Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains
  2018-01-22 18:39               ` Andrew Cooper
  2018-01-22 18:48                 ` George Dunlap
@ 2018-01-23  6:34                 ` Juergen Gross
  2018-01-23  7:21                   ` Juergen Gross
                                     ` (2 more replies)
  2018-01-23 13:24                 ` Dario Faggioli
  2018-01-23 16:45                 ` George Dunlap
  3 siblings, 3 replies; 74+ messages in thread
From: Juergen Gross @ 2018-01-23  6:34 UTC (permalink / raw)
  To: Andrew Cooper, Jan Beulich
  Cc: wei.liu2, George.Dunlap, ian.jackson, Dario Faggioli, xen-devel

On 22/01/18 19:39, Andrew Cooper wrote:
> On 22/01/18 16:51, Jan Beulich wrote:
>>>>> On 22.01.18 at 16:00, <jgross@suse.com> wrote:
>>> On 22/01/18 15:48, Jan Beulich wrote:
>>>>>>> On 22.01.18 at 15:38, <jgross@suse.com> wrote:
>>>>> On 22/01/18 15:22, Jan Beulich wrote:
>>>>>>>>> On 22.01.18 at 15:18, <jgross@suse.com> wrote:
>>>>>>> On 22/01/18 13:50, Jan Beulich wrote:
>>>>>>>>>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
>>>>>>>>> As a preparation for doing page table isolation in the Xen hypervisor
>>>>>>>>> in order to mitigate "Meltdown" use dedicated stacks, GDT and TSS for
>>>>>>>>> 64 bit PV domains mapped to the per-domain virtual area.
>>>>>>>>>
>>>>>>>>> The per-vcpu stacks are used for early interrupt handling only. After
>>>>>>>>> saving the domain's registers stacks are switched back to the normal
>>>>>>>>> per physical cpu ones in order to be able to address on-stack data
>>>>>>>>> from other cpus e.g. while handling IPIs.
>>>>>>>>>
>>>>>>>>> Adding %cr3 switching between saving of the registers and switching
>>>>>>>>> the stacks will enable the possibility to run guest code without any
>>>>>>>>> per physical cpu mapping, i.e. avoiding the threat of a guest being
>>>>>>>>> able to access other domains data.
>>>>>>>>>
>>>>>>>>> Without any further measures it will still be possible for e.g. a
>>>>>>>>> guest's user program to read stack data of another vcpu of the same
>>>>>>>>> domain, but this can be easily avoided by a little PV-ABI modification
>>>>>>>>> introducing per-cpu user address spaces.
>>>>>>>>>
>>>>>>>>> This series is meant as a replacement for Andrew's patch series:
>>>>>>>>> "x86: Prerequisite work for a Xen KAISER solution".
>>>>>>>> Considering in particular the two reverts, what I'm missing here
>>>>>>>> is a clear description of the meaningful additional protection this
>>>>>>>> approach provides over the band-aid. For context see also
>>>>>>>> https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg01735.html 
>>>>>>> My approach supports mapping only the following data while the guest is
>>>>>>> running (apart form the guest's own data, of course):
>>>>>>>
>>>>>>> - the per-vcpu entry stacks of the domain which will contain only the
>>>>>>>   guest's registers saved when an interrupt occurs
>>>>>>> - the per-vcpu GDTs and TSSs of the domain
>>>>>>> - the IDT
>>>>>>> - the interrupt handler code (arch/x86/x86_64/[compat/]entry.S
>>>>>>>
>>>>>>> All other hypervisor data and code can be completely hidden from the
>>>>>>> guests.
>>>>>> I understand that. What I'm not clear about is: Which parts of
>>>>>> the additionally hidden data are actually necessary (or at least
>>>>>> very desirable) to hide?
>>>>> Necessary:
>>>>> - other guests' memory (e.g. physical memory 1:1 mapping)
>>>>> - data from other guests e.g.in stack pages, debug buffers, I/O buffers,
>>>>>   code emulator buffers
>>>>> - other guests' register values e.g. in vcpu structure
>>>> All of this is already being made invisible by the band-aid (with the
>>>> exception of leftovers on the hypervisor stacks across context
>>>> switches, which we've already said could be taken care of by
>>>> memset()ing that area). I'm asking about the _additional_ benefits
>>>> of your approach.
>>> I'm quite sure the performance will be much better as it doesn't require
>>> per physical cpu L4 page tables, but just a shadow L4 table for each
>>> guest L4 table, similar to the Linux kernel KPTI approach.
>> But isn't that model having the same synchronization issues upon
>> guest L4 updates which Andrew was fighting with?
> 
> (Condensing a lot of threads down into one)
> 
> All the methods have L4 synchronisation update issues, until we have a
> PV ABI which guarantees that L4's don't get reused.  Any improvements to
> the shadowing/synchronisation algorithm will benefit all approaches.
> 
> Juergen: you're now adding a LTR into the context switch path which
> tends to be very slow.  I.e. As currently presented, this series
> necessarily has a higher runtime overhead than Jan's XPTI.

Sure? How slow is LTR compared to a copy of nearly 4kB of data?

> One of my concerns is that this patch series moves further away from the
> secondary goal of my KAISER series, which was to have the IDT and GDT
> mapped at the same linear addresses on every CPU so a) SIDT/SGDT don't
> leak which CPU you're currently scheduled on into PV guests and b) the
> context switch code can drop a load of its slow instructions like LGDT
> and the VMWRITEs to update the VMCS.

The GDT address of a PV vcpu is depending on vcpu_id only. I don't
see why the IDT can't be mapped to the same address on each cpu with
my approach.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains
  2018-01-22 21:45 ` Konrad Rzeszutek Wilk
@ 2018-01-23  6:38   ` Juergen Gross
  0 siblings, 0 replies; 74+ messages in thread
From: Juergen Gross @ 2018-01-23  6:38 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson, dfaggioli,
	jbeulich, xen-devel

On 22/01/18 22:45, Konrad Rzeszutek Wilk wrote:
> On Mon, Jan 22, 2018 at 01:32:44PM +0100, Juergen Gross wrote:
>> As a preparation for doing page table isolation in the Xen hypervisor
>> in order to mitigate "Meltdown" use dedicated stacks, GDT and TSS for
>> 64 bit PV domains mapped to the per-domain virtual area.
>>
>> The per-vcpu stacks are used for early interrupt handling only. After
>> saving the domain's registers stacks are switched back to the normal
>> per physical cpu ones in order to be able to address on-stack data
>> from other cpus e.g. while handling IPIs.
>>
>> Adding %cr3 switching between saving of the registers and switching
>> the stacks will enable the possibility to run guest code without any
>> per physical cpu mapping, i.e. avoiding the threat of a guest being
>> able to access other domains data.
>>
>> Without any further measures it will still be possible for e.g. a
>> guest's user program to read stack data of another vcpu of the same
>> domain, but this can be easily avoided by a little PV-ABI modification
>> introducing per-cpu user address spaces.
>>
>> This series is meant as a replacement for Andrew's patch series:
>> "x86: Prerequisite work for a Xen KAISER solution".
>>
>> What needs to be done:
>> - verify livepatching is still working
> 
> Is there an git repo for this?

https://github.com/jgross1/xen.git xpti


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains
  2018-01-23  6:34                 ` Juergen Gross
@ 2018-01-23  7:21                   ` Juergen Gross
  2018-01-23  8:53                   ` Jan Beulich
       [not found]                   ` <5A67061F02000078001A1669@suse.com>
  2 siblings, 0 replies; 74+ messages in thread
From: Juergen Gross @ 2018-01-23  7:21 UTC (permalink / raw)
  To: Andrew Cooper, Jan Beulich
  Cc: George.Dunlap, ian.jackson, wei.liu2, xen-devel, Dario Faggioli

On 23/01/18 07:34, Juergen Gross wrote:
> On 22/01/18 19:39, Andrew Cooper wrote:
>> On 22/01/18 16:51, Jan Beulich wrote:
>>>>>> On 22.01.18 at 16:00, <jgross@suse.com> wrote:
>>>> On 22/01/18 15:48, Jan Beulich wrote:
>>>>>>>> On 22.01.18 at 15:38, <jgross@suse.com> wrote:
>>>>>> On 22/01/18 15:22, Jan Beulich wrote:
>>>>>>>>>> On 22.01.18 at 15:18, <jgross@suse.com> wrote:
>>>>>>>> On 22/01/18 13:50, Jan Beulich wrote:
>>>>>>>>>>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
>>>>>>>>>> As a preparation for doing page table isolation in the Xen hypervisor
>>>>>>>>>> in order to mitigate "Meltdown" use dedicated stacks, GDT and TSS for
>>>>>>>>>> 64 bit PV domains mapped to the per-domain virtual area.
>>>>>>>>>>
>>>>>>>>>> The per-vcpu stacks are used for early interrupt handling only. After
>>>>>>>>>> saving the domain's registers stacks are switched back to the normal
>>>>>>>>>> per physical cpu ones in order to be able to address on-stack data
>>>>>>>>>> from other cpus e.g. while handling IPIs.
>>>>>>>>>>
>>>>>>>>>> Adding %cr3 switching between saving of the registers and switching
>>>>>>>>>> the stacks will enable the possibility to run guest code without any
>>>>>>>>>> per physical cpu mapping, i.e. avoiding the threat of a guest being
>>>>>>>>>> able to access other domains data.
>>>>>>>>>>
>>>>>>>>>> Without any further measures it will still be possible for e.g. a
>>>>>>>>>> guest's user program to read stack data of another vcpu of the same
>>>>>>>>>> domain, but this can be easily avoided by a little PV-ABI modification
>>>>>>>>>> introducing per-cpu user address spaces.
>>>>>>>>>>
>>>>>>>>>> This series is meant as a replacement for Andrew's patch series:
>>>>>>>>>> "x86: Prerequisite work for a Xen KAISER solution".
>>>>>>>>> Considering in particular the two reverts, what I'm missing here
>>>>>>>>> is a clear description of the meaningful additional protection this
>>>>>>>>> approach provides over the band-aid. For context see also
>>>>>>>>> https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg01735.html 
>>>>>>>> My approach supports mapping only the following data while the guest is
>>>>>>>> running (apart form the guest's own data, of course):
>>>>>>>>
>>>>>>>> - the per-vcpu entry stacks of the domain which will contain only the
>>>>>>>>   guest's registers saved when an interrupt occurs
>>>>>>>> - the per-vcpu GDTs and TSSs of the domain
>>>>>>>> - the IDT
>>>>>>>> - the interrupt handler code (arch/x86/x86_64/[compat/]entry.S
>>>>>>>>
>>>>>>>> All other hypervisor data and code can be completely hidden from the
>>>>>>>> guests.
>>>>>>> I understand that. What I'm not clear about is: Which parts of
>>>>>>> the additionally hidden data are actually necessary (or at least
>>>>>>> very desirable) to hide?
>>>>>> Necessary:
>>>>>> - other guests' memory (e.g. physical memory 1:1 mapping)
>>>>>> - data from other guests e.g.in stack pages, debug buffers, I/O buffers,
>>>>>>   code emulator buffers
>>>>>> - other guests' register values e.g. in vcpu structure
>>>>> All of this is already being made invisible by the band-aid (with the
>>>>> exception of leftovers on the hypervisor stacks across context
>>>>> switches, which we've already said could be taken care of by
>>>>> memset()ing that area). I'm asking about the _additional_ benefits
>>>>> of your approach.
>>>> I'm quite sure the performance will be much better as it doesn't require
>>>> per physical cpu L4 page tables, but just a shadow L4 table for each
>>>> guest L4 table, similar to the Linux kernel KPTI approach.
>>> But isn't that model having the same synchronization issues upon
>>> guest L4 updates which Andrew was fighting with?
>>
>> (Condensing a lot of threads down into one)
>>
>> All the methods have L4 synchronisation update issues, until we have a
>> PV ABI which guarantees that L4's don't get reused.  Any improvements to
>> the shadowing/synchronisation algorithm will benefit all approaches.
>>
>> Juergen: you're now adding a LTR into the context switch path which
>> tends to be very slow.  I.e. As currently presented, this series
>> necessarily has a higher runtime overhead than Jan's XPTI.
> 
> Sure? How slow is LTR compared to a copy of nearly 4kB of data?

I just added some measurement code to ltr(). On my system ltr takes
about 320 cycles, so a little bit more than 100ns (2.9 GHz).

With 10.000 context switches per second and 2 ltr instructions per
context switch this would add up to about 0.2% performance loss.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains
  2018-01-22 19:02                   ` Andrew Cooper
@ 2018-01-23  8:36                     ` Jan Beulich
  2018-01-23 11:23                       ` Andrew Cooper
  2018-01-23 11:06                     ` George Dunlap
  1 sibling, 1 reply; 74+ messages in thread
From: Jan Beulich @ 2018-01-23  8:36 UTC (permalink / raw)
  To: Andrew Cooper, George Dunlap
  Cc: Juergen Gross, wei.liu2, George.Dunlap, ian.jackson,
	Dario Faggioli, xen-devel

>>> On 22.01.18 at 20:02, <andrew.cooper3@citrix.com> wrote:
> On 22/01/18 18:48, George Dunlap wrote:
>> On 01/22/2018 06:39 PM, Andrew Cooper wrote:
>>> Jan: As to the things not covered by the current XPTI, hiding most of
>>> the .text section is important to prevent fingerprinting or ROP
>>> scanning.  This is a defence-in-depth argument, but a guest being easily
>>> able to identify whether certain XSAs are fixed or not is quite bad. 
>> I'm afraid we have a fairly different opinion of what is "quite bad".
> 
> I suggest you try talking to some real users then.
> 
>> Suppose we handed users a knob and said, "If you flip this switch,
>> attackers won't be able to tell if you've fixed XSAs or not without
>> trying them; but it will slow down your guests 20%."  How many do you
>> think would flip it, and how many would reckon that an attacker could
>> probably find out that information anyway?
> 
> Nonsense.  The performance hit is already taken.  The argument is "do
> you want an attacker able to trivially evaluate security weaknesses in
> your hypervisor", a process which usually has to be done by guesswork
> and knowing the exact binary under attack.  Having .text fully readable
> lowers the barrier to entry substantially.

I neither agree with George's reply being nonsense, nor do I think
this is an appropriate tone. _Some_ performance hit is already
taken. Further hiding of information my incur further loss of
performance, or are you telling me you can guarantee this never
ever to happen? Additionally, the amount of "guesswork" may
heavily depend on the nature of a specific issue. I can imagine
cases where such guesswork may even turn out easier than using
some side channel approach like those recent ones.

As indicated earlier, I'm not fundamentally opposed to hiding
more things, but I'm also not convinced we should hide more stuff
regardless of the price to pay.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains
  2018-01-23  5:50               ` Juergen Gross
@ 2018-01-23  8:40                 ` Jan Beulich
       [not found]                 ` <5A67030F02000078001A164B@suse.com>
  1 sibling, 0 replies; 74+ messages in thread
From: Jan Beulich @ 2018-01-23  8:40 UTC (permalink / raw)
  To: Juergen Gross
  Cc: wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson,
	Dario Faggioli, xen-devel

>>> On 23.01.18 at 06:50, <jgross@suse.com> wrote:
> On 22/01/18 17:51, Jan Beulich wrote:
>> But isn't that model having the same synchronization issues upon
>> guest L4 updates which Andrew was fighting with?
> 
> I don't think so, as the number of shadows will always only be max. 1
> with my approach.

How can I know that? The overview mail doesn't talk about the
intended shadowing algorithm afaics, and none of the patches
(judging by their titles) implements any part thereof. In
particular I'd be curious to know whether what you say will
hold also for guests not making use of the intended PV ABI
extension.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains
  2018-01-23  6:34                 ` Juergen Gross
  2018-01-23  7:21                   ` Juergen Gross
@ 2018-01-23  8:53                   ` Jan Beulich
       [not found]                   ` <5A67061F02000078001A1669@suse.com>
  2 siblings, 0 replies; 74+ messages in thread
From: Jan Beulich @ 2018-01-23  8:53 UTC (permalink / raw)
  To: Juergen Gross
  Cc: wei.liu2, George.Dunlap, Andrew Cooper, ian.jackson,
	Dario Faggioli, xen-devel

>>> On 23.01.18 at 07:34, <jgross@suse.com> wrote:
> On 22/01/18 19:39, Andrew Cooper wrote:
>> One of my concerns is that this patch series moves further away from the
>> secondary goal of my KAISER series, which was to have the IDT and GDT
>> mapped at the same linear addresses on every CPU so a) SIDT/SGDT don't
>> leak which CPU you're currently scheduled on into PV guests and b) the
>> context switch code can drop a load of its slow instructions like LGDT
>> and the VMWRITEs to update the VMCS.
> 
> The GDT address of a PV vcpu is depending on vcpu_id only. I don't
> see why the IDT can't be mapped to the same address on each cpu with
> my approach.

You're not introducing a per-CPU range in the page tables afaics
(again from overview and titles only), yet with the IDT needing
to be per-CPU you'd also need a per-CPU range to map it to if
you want to avoid the LIDT as well as exposing what CPU you're
on (same goes for the GDT and the respective avoidance of LGDT
afaict).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains
       [not found]                   ` <5A67061F02000078001A1669@suse.com>
@ 2018-01-23  9:24                     ` Juergen Gross
  2018-01-23  9:31                       ` Jan Beulich
       [not found]                       ` <5A670F0E02000078001A16C9@suse.com>
  0 siblings, 2 replies; 74+ messages in thread
From: Juergen Gross @ 2018-01-23  9:24 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, George.Dunlap, Andrew Cooper, ian.jackson,
	Dario Faggioli, xen-devel

On 23/01/18 09:53, Jan Beulich wrote:
>>>> On 23.01.18 at 07:34, <jgross@suse.com> wrote:
>> On 22/01/18 19:39, Andrew Cooper wrote:
>>> One of my concerns is that this patch series moves further away from the
>>> secondary goal of my KAISER series, which was to have the IDT and GDT
>>> mapped at the same linear addresses on every CPU so a) SIDT/SGDT don't
>>> leak which CPU you're currently scheduled on into PV guests and b) the
>>> context switch code can drop a load of its slow instructions like LGDT
>>> and the VMWRITEs to update the VMCS.
>>
>> The GDT address of a PV vcpu is depending on vcpu_id only. I don't
>> see why the IDT can't be mapped to the same address on each cpu with
>> my approach.
> 
> You're not introducing a per-CPU range in the page tables afaics
> (again from overview and titles only), yet with the IDT needing
> to be per-CPU you'd also need a per-CPU range to map it to if
> you want to avoid the LIDT as well as exposing what CPU you're
> on (same goes for the GDT and the respective avoidance of LGDT
> afaict).

After a quick look I don't see why a Meltdown mitigation can't use
the same IDT for all cpus: the only reason I could find for having
per-cpu IDTs seems to be in SVM code, so it seems to be AMD specific.
And AMD won't need XPTI at all.

The GDT of pv domains is already in the per-domain region even without
my patches, so I don't have to change anything regarding usage of LGDT.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 02/12] x86: don't use hypervisor stack size for dumping guest stacks
  2018-01-22 12:32 ` [PATCH RFC v2 02/12] x86: don't use hypervisor stack size for dumping guest stacks Juergen Gross
@ 2018-01-23  9:26   ` Jan Beulich
       [not found]   ` <5A670DEF02000078001A16AF@suse.com>
  1 sibling, 0 replies; 74+ messages in thread
From: Jan Beulich @ 2018-01-23  9:26 UTC (permalink / raw)
  To: Andrew Cooper, Juergen Gross
  Cc: wei.liu2, George.Dunlap, ian.jackson, Dario Faggioli, xen-devel

>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
> show_guest_stack() and compat_show_guest_stack() stop dumping the
> stack of the guest whenever its virtual address reaches the same
> alignment which is used for the hypervisor stacks.
> 
> Remove this arbitrary limit and try to dump a fixed number of lines
> instead.

Hmm, I can see your point, but before looking at the change in detail
I think we need to agree on what behavior we want. Dumping
arbitrary data as if it was a part of the stack isn't very helpful, limiting
the risk of which is, I think, the reason for the way things currently
work (assuming that guest kernels won't have stacks larger than Xen
itself, and that they too would align them). What would perhaps be
better is for the guest to supply information about the restrictions it
enforces on its stacks, which Xen could then use here. In the
absence of such hints using the values currently being used would
possibly make sense.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains
  2018-01-23  9:24                     ` Juergen Gross
@ 2018-01-23  9:31                       ` Jan Beulich
       [not found]                       ` <5A670F0E02000078001A16C9@suse.com>
  1 sibling, 0 replies; 74+ messages in thread
From: Jan Beulich @ 2018-01-23  9:31 UTC (permalink / raw)
  To: Juergen Gross
  Cc: wei.liu2, George.Dunlap, Andrew Cooper, ian.jackson,
	Dario Faggioli, xen-devel

>>> On 23.01.18 at 10:24, <jgross@suse.com> wrote:
> On 23/01/18 09:53, Jan Beulich wrote:
>>>>> On 23.01.18 at 07:34, <jgross@suse.com> wrote:
>>> On 22/01/18 19:39, Andrew Cooper wrote:
>>>> One of my concerns is that this patch series moves further away from the
>>>> secondary goal of my KAISER series, which was to have the IDT and GDT
>>>> mapped at the same linear addresses on every CPU so a) SIDT/SGDT don't
>>>> leak which CPU you're currently scheduled on into PV guests and b) the
>>>> context switch code can drop a load of its slow instructions like LGDT
>>>> and the VMWRITEs to update the VMCS.
>>>
>>> The GDT address of a PV vcpu is depending on vcpu_id only. I don't
>>> see why the IDT can't be mapped to the same address on each cpu with
>>> my approach.
>> 
>> You're not introducing a per-CPU range in the page tables afaics
>> (again from overview and titles only), yet with the IDT needing
>> to be per-CPU you'd also need a per-CPU range to map it to if
>> you want to avoid the LIDT as well as exposing what CPU you're
>> on (same goes for the GDT and the respective avoidance of LGDT
>> afaict).
> 
> After a quick look I don't see why a Meltdown mitigation can't use
> the same IDT for all cpus: the only reason I could find for having
> per-cpu IDTs seems to be in SVM code, so it seems to be AMD specific.
> And AMD won't need XPTI at all.

Isn't your RFC series allowing XPTI to be enabled even on AMD?

> The GDT of pv domains is already in the per-domain region even without
> my patches, so I don't have to change anything regarding usage of LGDT.

Andrew's point was that eliminating the LGDT is a secondary goal.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains
       [not found]                 ` <5A67030F02000078001A164B@suse.com>
@ 2018-01-23  9:45                   ` Juergen Gross
  0 siblings, 0 replies; 74+ messages in thread
From: Juergen Gross @ 2018-01-23  9:45 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson,
	Dario Faggioli, xen-devel

On 23/01/18 09:40, Jan Beulich wrote:
>>>> On 23.01.18 at 06:50, <jgross@suse.com> wrote:
>> On 22/01/18 17:51, Jan Beulich wrote:
>>> But isn't that model having the same synchronization issues upon
>>> guest L4 updates which Andrew was fighting with?
>>
>> I don't think so, as the number of shadows will always only be max. 1
>> with my approach.
> 
> How can I know that? The overview mail doesn't talk about the
> intended shadowing algorithm afaics, and none of the patches
> (judging by their titles) implements any part thereof. In

Right. That's the reason I'm telling you about it.

> particular I'd be curious to know whether what you say will
> hold also for guests not making use of the intended PV ABI
> extension.

Those guests will still be vulnerable to cross-vcpu accesses to Xen
stacks regarding Meltdown. Linux kernel is vulnerable the same way
regarding its own stacks, so there is no new vulnerability added
for Linux running as pv guests (I have to admit I don't know whether
the same applies to BSD).


Juergen


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 02/12] x86: don't use hypervisor stack size for dumping guest stacks
       [not found]   ` <5A670DEF02000078001A16AF@suse.com>
@ 2018-01-23  9:58     ` Juergen Gross
  2018-01-23 10:11       ` Jan Beulich
       [not found]       ` <5A67187C02000078001A1742@suse.com>
  0 siblings, 2 replies; 74+ messages in thread
From: Juergen Gross @ 2018-01-23  9:58 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper
  Cc: wei.liu2, George.Dunlap, ian.jackson, Dario Faggioli, xen-devel

On 23/01/18 10:26, Jan Beulich wrote:
>>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
>> show_guest_stack() and compat_show_guest_stack() stop dumping the
>> stack of the guest whenever its virtual address reaches the same
>> alignment which is used for the hypervisor stacks.
>>
>> Remove this arbitrary limit and try to dump a fixed number of lines
>> instead.
> 
> Hmm, I can see your point, but before looking at the change in detail
> I think we need to agree on what behavior we want. Dumping
> arbitrary data as if it was a part of the stack isn't very helpful, limiting
> the risk of which is, I think, the reason for the way things currently
> work (assuming that guest kernels won't have stacks larger than Xen
> itself, and that they too would align them). What would perhaps be
> better is for the guest to supply information about the restrictions it
> enforces on its stacks, which Xen could then use here. In the
> absence of such hints using the values currently being used would
> possibly make sense.

Currently the stack dump will have the same fixed number of lines as
with my patch. I'm only removing the premature end of dumping whenever
the stack address crosses a 32kB boundary. Linux 64 bit pv guests are
using 16kB stack size. So using this boundary would be more natural.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains
       [not found]                       ` <5A670F0E02000078001A16C9@suse.com>
@ 2018-01-23 10:10                         ` Juergen Gross
  2018-01-23 11:45                           ` Andrew Cooper
  0 siblings, 1 reply; 74+ messages in thread
From: Juergen Gross @ 2018-01-23 10:10 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, George.Dunlap, Andrew Cooper, ian.jackson,
	Dario Faggioli, xen-devel

On 23/01/18 10:31, Jan Beulich wrote:
>>>> On 23.01.18 at 10:24, <jgross@suse.com> wrote:
>> On 23/01/18 09:53, Jan Beulich wrote:
>>>>>> On 23.01.18 at 07:34, <jgross@suse.com> wrote:
>>>> On 22/01/18 19:39, Andrew Cooper wrote:
>>>>> One of my concerns is that this patch series moves further away from the
>>>>> secondary goal of my KAISER series, which was to have the IDT and GDT
>>>>> mapped at the same linear addresses on every CPU so a) SIDT/SGDT don't
>>>>> leak which CPU you're currently scheduled on into PV guests and b) the
>>>>> context switch code can drop a load of its slow instructions like LGDT
>>>>> and the VMWRITEs to update the VMCS.
>>>>
>>>> The GDT address of a PV vcpu is depending on vcpu_id only. I don't
>>>> see why the IDT can't be mapped to the same address on each cpu with
>>>> my approach.
>>>
>>> You're not introducing a per-CPU range in the page tables afaics
>>> (again from overview and titles only), yet with the IDT needing
>>> to be per-CPU you'd also need a per-CPU range to map it to if
>>> you want to avoid the LIDT as well as exposing what CPU you're
>>> on (same goes for the GDT and the respective avoidance of LGDT
>>> afaict).
>>
>> After a quick look I don't see why a Meltdown mitigation can't use
>> the same IDT for all cpus: the only reason I could find for having
>> per-cpu IDTs seems to be in SVM code, so it seems to be AMD specific.
>> And AMD won't need XPTI at all.
> 
> Isn't your RFC series allowing XPTI to be enabled even on AMD?

Yes, you are right. This might either want to be revisited or the
address space to be activated for SVM domains could map an IDT with
IST related traps removed.

>> The GDT of pv domains is already in the per-domain region even without
>> my patches, so I don't have to change anything regarding usage of LGDT.
> 
> Andrew's point was that eliminating the LGDT is a secondary goal.

With per-cpu mappings this is surely an obvious optimization. In the
end the overall performance should be taken as base for a decision.
His main point was avoiding exposing data like the physical cpu number
and this doesn't apply here, as the GDT is per vcpu in my case.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 02/12] x86: don't use hypervisor stack size for dumping guest stacks
  2018-01-23  9:58     ` Juergen Gross
@ 2018-01-23 10:11       ` Jan Beulich
       [not found]       ` <5A67187C02000078001A1742@suse.com>
  1 sibling, 0 replies; 74+ messages in thread
From: Jan Beulich @ 2018-01-23 10:11 UTC (permalink / raw)
  To: Juergen Gross
  Cc: wei.liu2, George.Dunlap, Andrew Cooper, ian.jackson,
	Dario Faggioli, xen-devel

>>> On 23.01.18 at 10:58, <jgross@suse.com> wrote:
> On 23/01/18 10:26, Jan Beulich wrote:
>>>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
>>> show_guest_stack() and compat_show_guest_stack() stop dumping the
>>> stack of the guest whenever its virtual address reaches the same
>>> alignment which is used for the hypervisor stacks.
>>>
>>> Remove this arbitrary limit and try to dump a fixed number of lines
>>> instead.
>> 
>> Hmm, I can see your point, but before looking at the change in detail
>> I think we need to agree on what behavior we want. Dumping
>> arbitrary data as if it was a part of the stack isn't very helpful, limiting
>> the risk of which is, I think, the reason for the way things currently
>> work (assuming that guest kernels won't have stacks larger than Xen
>> itself, and that they too would align them). What would perhaps be
>> better is for the guest to supply information about the restrictions it
>> enforces on its stacks, which Xen could then use here. In the
>> absence of such hints using the values currently being used would
>> possibly make sense.
> 
> Currently the stack dump will have the same fixed number of lines as
> with my patch. I'm only removing the premature end of dumping whenever
> the stack address crosses a 32kB boundary. Linux 64 bit pv guests are
> using 16kB stack size. So using this boundary would be more natural.

IOW your change converts a 50:50 chance of dumping non-stack
data to 100% (all in case the stack pointer isn't far away from the
stack start).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 02/12] x86: don't use hypervisor stack size for dumping guest stacks
       [not found]       ` <5A67187C02000078001A1742@suse.com>
@ 2018-01-23 10:19         ` Juergen Gross
  0 siblings, 0 replies; 74+ messages in thread
From: Juergen Gross @ 2018-01-23 10:19 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, George.Dunlap, Andrew Cooper, ian.jackson,
	Dario Faggioli, xen-devel

On 23/01/18 11:11, Jan Beulich wrote:
>>>> On 23.01.18 at 10:58, <jgross@suse.com> wrote:
>> On 23/01/18 10:26, Jan Beulich wrote:
>>>>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
>>>> show_guest_stack() and compat_show_guest_stack() stop dumping the
>>>> stack of the guest whenever its virtual address reaches the same
>>>> alignment which is used for the hypervisor stacks.
>>>>
>>>> Remove this arbitrary limit and try to dump a fixed number of lines
>>>> instead.
>>>
>>> Hmm, I can see your point, but before looking at the change in detail
>>> I think we need to agree on what behavior we want. Dumping
>>> arbitrary data as if it was a part of the stack isn't very helpful, limiting
>>> the risk of which is, I think, the reason for the way things currently
>>> work (assuming that guest kernels won't have stacks larger than Xen
>>> itself, and that they too would align them). What would perhaps be
>>> better is for the guest to supply information about the restrictions it
>>> enforces on its stacks, which Xen could then use here. In the
>>> absence of such hints using the values currently being used would
>>> possibly make sense.
>>
>> Currently the stack dump will have the same fixed number of lines as
>> with my patch. I'm only removing the premature end of dumping whenever
>> the stack address crosses a 32kB boundary. Linux 64 bit pv guests are
>> using 16kB stack size. So using this boundary would be more natural.
> 
> IOW your change converts a 50:50 chance of dumping non-stack
> data to 100% (all in case the stack pointer isn't far away from the
> stack start).

I'd rather dump some non-stack data than omitting some stack data.

I can't see show_guest_stack() is limited to guest kernel mode. User
stacks can be much larger than 32kB.


Juergen


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains
  2018-01-22 19:02                   ` Andrew Cooper
  2018-01-23  8:36                     ` Jan Beulich
@ 2018-01-23 11:06                     ` George Dunlap
  1 sibling, 0 replies; 74+ messages in thread
From: George Dunlap @ 2018-01-23 11:06 UTC (permalink / raw)
  To: Andrew Cooper, Jan Beulich, Juergen Gross
  Cc: wei.liu2, George.Dunlap, ian.jackson, Dario Faggioli, xen-devel

On 01/22/2018 07:02 PM, Andrew Cooper wrote:
> On 22/01/18 18:48, George Dunlap wrote:
>> On 01/22/2018 06:39 PM, Andrew Cooper wrote:
>>> On 22/01/18 16:51, Jan Beulich wrote:
>>>>>>> On 22.01.18 at 16:00, <jgross@suse.com> wrote:
>>>>> On 22/01/18 15:48, Jan Beulich wrote:
>>>>>>>>> On 22.01.18 at 15:38, <jgross@suse.com> wrote:
>>>>>>> On 22/01/18 15:22, Jan Beulich wrote:
>>>>>>>>>>> On 22.01.18 at 15:18, <jgross@suse.com> wrote:
>>>>>>>>> On 22/01/18 13:50, Jan Beulich wrote:
>>>>>>>>>>>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
>>>>>>>>>>> As a preparation for doing page table isolation in the Xen hypervisor
>>>>>>>>>>> in order to mitigate "Meltdown" use dedicated stacks, GDT and TSS for
>>>>>>>>>>> 64 bit PV domains mapped to the per-domain virtual area.
>>>>>>>>>>>
>>>>>>>>>>> The per-vcpu stacks are used for early interrupt handling only. After
>>>>>>>>>>> saving the domain's registers stacks are switched back to the normal
>>>>>>>>>>> per physical cpu ones in order to be able to address on-stack data
>>>>>>>>>>> from other cpus e.g. while handling IPIs.
>>>>>>>>>>>
>>>>>>>>>>> Adding %cr3 switching between saving of the registers and switching
>>>>>>>>>>> the stacks will enable the possibility to run guest code without any
>>>>>>>>>>> per physical cpu mapping, i.e. avoiding the threat of a guest being
>>>>>>>>>>> able to access other domains data.
>>>>>>>>>>>
>>>>>>>>>>> Without any further measures it will still be possible for e.g. a
>>>>>>>>>>> guest's user program to read stack data of another vcpu of the same
>>>>>>>>>>> domain, but this can be easily avoided by a little PV-ABI modification
>>>>>>>>>>> introducing per-cpu user address spaces.
>>>>>>>>>>>
>>>>>>>>>>> This series is meant as a replacement for Andrew's patch series:
>>>>>>>>>>> "x86: Prerequisite work for a Xen KAISER solution".
>>>>>>>>>> Considering in particular the two reverts, what I'm missing here
>>>>>>>>>> is a clear description of the meaningful additional protection this
>>>>>>>>>> approach provides over the band-aid. For context see also
>>>>>>>>>> https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg01735.html 
>>>>>>>>> My approach supports mapping only the following data while the guest is
>>>>>>>>> running (apart form the guest's own data, of course):
>>>>>>>>>
>>>>>>>>> - the per-vcpu entry stacks of the domain which will contain only the
>>>>>>>>>   guest's registers saved when an interrupt occurs
>>>>>>>>> - the per-vcpu GDTs and TSSs of the domain
>>>>>>>>> - the IDT
>>>>>>>>> - the interrupt handler code (arch/x86/x86_64/[compat/]entry.S
>>>>>>>>>
>>>>>>>>> All other hypervisor data and code can be completely hidden from the
>>>>>>>>> guests.
>>>>>>>> I understand that. What I'm not clear about is: Which parts of
>>>>>>>> the additionally hidden data are actually necessary (or at least
>>>>>>>> very desirable) to hide?
>>>>>>> Necessary:
>>>>>>> - other guests' memory (e.g. physical memory 1:1 mapping)
>>>>>>> - data from other guests e.g.in stack pages, debug buffers, I/O buffers,
>>>>>>>   code emulator buffers
>>>>>>> - other guests' register values e.g. in vcpu structure
>>>>>> All of this is already being made invisible by the band-aid (with the
>>>>>> exception of leftovers on the hypervisor stacks across context
>>>>>> switches, which we've already said could be taken care of by
>>>>>> memset()ing that area). I'm asking about the _additional_ benefits
>>>>>> of your approach.
>>>>> I'm quite sure the performance will be much better as it doesn't require
>>>>> per physical cpu L4 page tables, but just a shadow L4 table for each
>>>>> guest L4 table, similar to the Linux kernel KPTI approach.
>>>> But isn't that model having the same synchronization issues upon
>>>> guest L4 updates which Andrew was fighting with?
>>> (Condensing a lot of threads down into one)
>>>
>>> All the methods have L4 synchronisation update issues, until we have a
>>> PV ABI which guarantees that L4's don't get reused.  Any improvements to
>>> the shadowing/synchronisation algorithm will benefit all approaches.
>>>
>>> Juergen: you're now adding a LTR into the context switch path which
>>> tends to be very slow.  I.e. As currently presented, this series
>>> necessarily has a higher runtime overhead than Jan's XPTI.
>>>
>>> One of my concerns is that this patch series moves further away from the
>>> secondary goal of my KAISER series, which was to have the IDT and GDT
>>> mapped at the same linear addresses on every CPU so a) SIDT/SGDT don't
>>> leak which CPU you're currently scheduled on into PV guests and b) the
>>> context switch code can drop a load of its slow instructions like LGDT
>>> and the VMWRITEs to update the VMCS.
>>>
>>> Jan: As to the things not covered by the current XPTI, hiding most of
>>> the .text section is important to prevent fingerprinting or ROP
>>> scanning.  This is a defence-in-depth argument, but a guest being easily
>>> able to identify whether certain XSAs are fixed or not is quite bad. 
>> I'm afraid we have a fairly different opinion of what is "quite bad".
> 
> I suggest you try talking to some real users then.
> 
>> Suppose we handed users a knob and said, "If you flip this switch,
>> attackers won't be able to tell if you've fixed XSAs or not without
>> trying them; but it will slow down your guests 20%."  How many do you
>> think would flip it, and how many would reckon that an attacker could
>> probably find out that information anyway?
> 
> Nonsense.  The performance hit is already taken. 

You just said:

"Juergen: you're now adding a LTR into the context switch path which
tends to be very slow.  I.e. As currently presented, this series
necessarily has a higher runtime overhead than Jan's XPTI."

And:

"As to the things not covered by the current XPTI, hiding most of
the .text section is important..."

You've previously said that the overhead for your KAISER series was much
higher than Jan's "bandaid" XPTI series, and implied that Juergen's
approach would suffer the same fate.

This led me to infer:

1. The .text segment is not hidden in XPTI, but would be under your and
Juergen's approaches

2. The cost of hiding the .text segment, over and above XPTI stage 1,
according to our current best efforts, is significant (making up 20% as
a reasonable strawman).

In which case performance hit is most certainly *not* already taken.

> The argument is "do
> you want an attacker able to trivially evaluate security weaknesses in
> your hypervisor", a process which usually has to be done by guesswork
> and knowing the exact binary under attack.  Having .text fully readable
> lowers the barrier to entry substantially.

And I can certainly see that some users would want to protect against
that.  But faced with an even higher performance hit, a significant
number of users would probably pass.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains
  2018-01-23  8:36                     ` Jan Beulich
@ 2018-01-23 11:23                       ` Andrew Cooper
  0 siblings, 0 replies; 74+ messages in thread
From: Andrew Cooper @ 2018-01-23 11:23 UTC (permalink / raw)
  To: Jan Beulich, George Dunlap
  Cc: Juergen Gross, wei.liu2, George.Dunlap, ian.jackson,
	Dario Faggioli, xen-devel

On 23/01/18 08:36, Jan Beulich wrote:
>>>> On 22.01.18 at 20:02, <andrew.cooper3@citrix.com> wrote:
>> On 22/01/18 18:48, George Dunlap wrote:
>>> On 01/22/2018 06:39 PM, Andrew Cooper wrote:
>>>> Jan: As to the things not covered by the current XPTI, hiding most of
>>>> the .text section is important to prevent fingerprinting or ROP
>>>> scanning.  This is a defence-in-depth argument, but a guest being easily
>>>> able to identify whether certain XSAs are fixed or not is quite bad. 
>>> I'm afraid we have a fairly different opinion of what is "quite bad".
>> I suggest you try talking to some real users then.
>>
>>> Suppose we handed users a knob and said, "If you flip this switch,
>>> attackers won't be able to tell if you've fixed XSAs or not without
>>> trying them; but it will slow down your guests 20%."  How many do you
>>> think would flip it, and how many would reckon that an attacker could
>>> probably find out that information anyway?
>> Nonsense.  The performance hit is already taken.  The argument is "do
>> you want an attacker able to trivially evaluate security weaknesses in
>> your hypervisor", a process which usually has to be done by guesswork
>> and knowing the exact binary under attack.  Having .text fully readable
>> lowers the barrier to entry substantially.
> I neither agree with George's reply being nonsense, nor do I think
> this is an appropriate tone. _Some_ performance hit is already
> taken. Further hiding of information my incur further loss of
> performance, or are you telling me you can guarantee this never
> ever to happen? Additionally, the amount of "guesswork" may
> heavily depend on the nature of a specific issue. I can imagine
> cases where such guesswork may even turn out easier than using
> some side channel approach like those recent ones.
>
> As indicated earlier, I'm not fundamentally opposed to hiding
> more things, but I'm also not convinced we should hide more stuff
> regardless of the price to pay.

Here is an example which comes with zero extra overhead.

Shuffle the virtual layout to put .text adjacent to MMCFG, and steal
some space (1G?) from the top of MMCFG for .entry.text and the per-cpu
stubs.  With some linker adjustments, relative jumps/references will
even work properly.

Anyone serious about security is not going to be happy with XPTI in its
current form, because being able to arbitrarily read .text is far too
valuable for an attacker.  Anyone serious about performance will turn
the whole lot off.

In some theoretical world with three options, only a fool would choose
the middle option, because a 10% hit is not going to be chosen lightly
in the first place, but there is no point taking the hit with the
reduced security.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains
  2018-01-23 10:10                         ` Juergen Gross
@ 2018-01-23 11:45                           ` Andrew Cooper
  2018-01-23 13:31                             ` Juergen Gross
  0 siblings, 1 reply; 74+ messages in thread
From: Andrew Cooper @ 2018-01-23 11:45 UTC (permalink / raw)
  To: Juergen Gross, Jan Beulich
  Cc: wei.liu2, George.Dunlap, ian.jackson, Dario Faggioli, xen-devel

On 23/01/18 10:10, Juergen Gross wrote:
> On 23/01/18 10:31, Jan Beulich wrote:
>>>>> On 23.01.18 at 10:24, <jgross@suse.com> wrote:
>>> On 23/01/18 09:53, Jan Beulich wrote:
>>>>>>> On 23.01.18 at 07:34, <jgross@suse.com> wrote:
>>>>> On 22/01/18 19:39, Andrew Cooper wrote:
>>>>>> One of my concerns is that this patch series moves further away from the
>>>>>> secondary goal of my KAISER series, which was to have the IDT and GDT
>>>>>> mapped at the same linear addresses on every CPU so a) SIDT/SGDT don't
>>>>>> leak which CPU you're currently scheduled on into PV guests and b) the
>>>>>> context switch code can drop a load of its slow instructions like LGDT
>>>>>> and the VMWRITEs to update the VMCS.
>>>>> The GDT address of a PV vcpu is depending on vcpu_id only. I don't
>>>>> see why the IDT can't be mapped to the same address on each cpu with
>>>>> my approach.
>>>> You're not introducing a per-CPU range in the page tables afaics
>>>> (again from overview and titles only), yet with the IDT needing
>>>> to be per-CPU you'd also need a per-CPU range to map it to if
>>>> you want to avoid the LIDT as well as exposing what CPU you're
>>>> on (same goes for the GDT and the respective avoidance of LGDT
>>>> afaict).
>>> After a quick look I don't see why a Meltdown mitigation can't use
>>> the same IDT for all cpus: the only reason I could find for having
>>> per-cpu IDTs seems to be in SVM code, so it seems to be AMD specific.
>>> And AMD won't need XPTI at all.
>> Isn't your RFC series allowing XPTI to be enabled even on AMD?
> Yes, you are right. This might either want to be revisited or the
> address space to be activated for SVM domains could map an IDT with
> IST related traps removed.

I've experimented quite a lot in this area.  Ideally, we'd vmload/save
in the SVM critical region (like all other hypervisors) at which point
we don't need any adjustments to the IDT (as IST references are safe to
use), and we'd catch stack overflows in the #DF handler rather than
immediately triple faulting.

Using LIDT to switch between alternative IDTs, or INVLPG to swap the
mapping under a fixed linear address are both much slower than the
current implementation.

>
>>> The GDT of pv domains is already in the per-domain region even without
>>> my patches, so I don't have to change anything regarding usage of LGDT.
>> Andrew's point was that eliminating the LGDT is a secondary goal.
> With per-cpu mappings this is surely an obvious optimization. In the
> end the overall performance should be taken as base for a decision.
> His main point was avoiding exposing data like the physical cpu number
> and this doesn't apply here, as the GDT is per vcpu in my case.

The GDT leaks vcpu_id into guest userspace, which is similarly problematic.

The secondary goals of my KAISER series stand irrespective of the
Meltdown issues:
* The stack and mutable critical structures really should be numa-local
to the CPU using it.
* The GDT should sit fully fat over zeros.  At the moment in HVM
context, there are 14 frames of arbitrary directmap living within the
GDT limit.
* The IDT/GDT should exist at the same linear address on every pcpu to
avoid leaking information  (This property is what allows the removal of
the lgdt from the context switch path).
* The critical datastructures should be mapped read only to make
exploitation hardware for an attacker with a write-primative.
* With the stack at the same linear address on each CPU, we don't need
the syscall stubs, and the TSS is identical on all cpus.

In some copious free time, it would be nice to fix these issues.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains
  2018-01-22 18:39               ` Andrew Cooper
  2018-01-22 18:48                 ` George Dunlap
  2018-01-23  6:34                 ` Juergen Gross
@ 2018-01-23 13:24                 ` Dario Faggioli
  2018-01-23 16:45                 ` George Dunlap
  3 siblings, 0 replies; 74+ messages in thread
From: Dario Faggioli @ 2018-01-23 13:24 UTC (permalink / raw)
  To: Andrew Cooper, Jan Beulich, Juergen Gross
  Cc: George.Dunlap, ian.jackson, wei.liu2, xen-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Hey, Hi!

On Mon, 2018-01-22 at 18:39 +0000, Andrew Cooper wrote:
> > > > On 22.01.18 at 15:38, <jgross@suse.com> wrote:
> > > I'm quite sure the performance will be much better as it doesn't
> > > require
> > > per physical cpu L4 page tables, but just a shadow L4 table for
> > > each
> > > guest L4 table, similar to the Linux kernel KPTI approach.
> > 
> Juergen: you're now adding a LTR into the context switch path which
> tends to be very slow.  I.e. As currently presented, this series
> necessarily has a higher runtime overhead than Jan's XPTI.
> 
So, as Juergen mentioned, I'm trying to do some performance evaluation
of these solutions.

This is just the first set of numbers, so consider it preliminary. In
particular, I'm sure there is a better set of benchmarks than the ones
I've used for now (in order to have something quickly)... I am looking
more into this.

Anyway, what I'm seeing for now is that Juergen's branch performs
pretty much as current staging, if booted with xpti=false (i.e., with
Jan's band-aid compiled but disabled).

OTOH, staging with xpti=true does show some performance impact. I
appreciate that this is still unfair a comparison (as Juergen's series
lacks the "real XPTI" bits), but the goal here was to figure out
whether the current status of the series is already introducing
regressions or not (and, as far as this first set of benches says, it's
not).

Anyway, here's the numbers. The benchmarks are run in a 16 vCPUs Debian
PV guest, on a 16 pCPUs (Intel Xeon-s) Debian host.

Raw numbers:
https://openbenchmarking.org/result/1801238-AL-1801232AL05

Normalized against "Staging xpti=false"
https://openbenchmarking.org/result/1801238-AL-1801232AL05&obr_nor=y&obr_hgv=4.11+Staging+xpti%3Dfalse

You'll have to forgive me about the labels (I'll pick better titles
next time). They're meaning is as follows:
- - "4.11 Staging xpti=false" this is current staging, booted with
  xpti=false (so, with Jan's band-aid applied, but disabled);
- - "staging-xpti-on" this is current staging, booted with xpti=true
  (so, with Jan's band-aid applied, but enabled);
- - "4.11 Juergen xpti" this is Juergen's GitHub branch, booted with
xpti=true.

I'll post more as soon as I will have it.
Dario
- -- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
- -----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Software Engineer @ SUSE https://www.suse.com/
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEES5ssOj3Vhr0WPnOLFkJ4iaW4c+4FAlpnN5sACgkQFkJ4iaW4
c+7p3RAAskP/vGfZMermn3iXh5tULiumjWb8olYN/X5bXEcR5HQCuWiK3p647LkA
p9Am1vgTxGSz7GpQJt45f1No0zS/oMcfEDvnV7aJ76sjSQqK/fIwVGec9qIJbgFB
TzvF4rGqiPnbXykq79ps6RFK3bN6PasV/4Yr1cqr0EJRtiYVe/F12UMER32AyOnd
XSCnHMI5yu4Zy6te0CsfxH96TlDoIGsvSKv86xMpc1m87l19yBFfRLUrVZL1Dmfc
zZwaR91IZdlR7N2xKCBgGbbnqRx4HmfNTN49Ih2ND/YISEyQdgNxZdOSxfRKpD2m
yMOzf7huJmUBiwQ+M/tJmC/bn8hVG1ZCwPbuMF5OItXWPnfA/SHBG++NM17LVGTa
tdQC12Gl2DNvQEOns6z9tfnRF/FqnQAnK7KJ1LACAWGmQSIBGCrO+sQlmy/uwRdX
wWpuH4qE7WDBhXtMbN/4b31ab7US4N/ZJcgz/uKgMr8/OUhjYUFSjUwR39gGeruN
b78s019rtEVOUEKNYngzb8FPJP89qnfcDj7sivMmgzq0FIB5VTXXhA4Idvgk6WPb
/RCfyO1SYgWOY0zKAV85lGCcuU7X8SVdTmAjlM7yNVh/WtfHCcah5GZQYBnse3tw
i61CmZpl2eVzzbfeSnA6lU7g9y8jOKkhmPaQ+8dxgV+mMv0Z/Lo=
=/v47
-----END PGP SIGNATURE-----


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains
  2018-01-23 11:45                           ` Andrew Cooper
@ 2018-01-23 13:31                             ` Juergen Gross
  0 siblings, 0 replies; 74+ messages in thread
From: Juergen Gross @ 2018-01-23 13:31 UTC (permalink / raw)
  To: Andrew Cooper, Jan Beulich
  Cc: wei.liu2, George.Dunlap, ian.jackson, Dario Faggioli, xen-devel

On 23/01/18 12:45, Andrew Cooper wrote:
> On 23/01/18 10:10, Juergen Gross wrote:
>> On 23/01/18 10:31, Jan Beulich wrote:
>>>>>> On 23.01.18 at 10:24, <jgross@suse.com> wrote:
>>>> On 23/01/18 09:53, Jan Beulich wrote:
>>>>>>>> On 23.01.18 at 07:34, <jgross@suse.com> wrote:
>>>>>> On 22/01/18 19:39, Andrew Cooper wrote:
>>>>>>> One of my concerns is that this patch series moves further away from the
>>>>>>> secondary goal of my KAISER series, which was to have the IDT and GDT
>>>>>>> mapped at the same linear addresses on every CPU so a) SIDT/SGDT don't
>>>>>>> leak which CPU you're currently scheduled on into PV guests and b) the
>>>>>>> context switch code can drop a load of its slow instructions like LGDT
>>>>>>> and the VMWRITEs to update the VMCS.
>>>>>> The GDT address of a PV vcpu is depending on vcpu_id only. I don't
>>>>>> see why the IDT can't be mapped to the same address on each cpu with
>>>>>> my approach.
>>>>> You're not introducing a per-CPU range in the page tables afaics
>>>>> (again from overview and titles only), yet with the IDT needing
>>>>> to be per-CPU you'd also need a per-CPU range to map it to if
>>>>> you want to avoid the LIDT as well as exposing what CPU you're
>>>>> on (same goes for the GDT and the respective avoidance of LGDT
>>>>> afaict).
>>>> After a quick look I don't see why a Meltdown mitigation can't use
>>>> the same IDT for all cpus: the only reason I could find for having
>>>> per-cpu IDTs seems to be in SVM code, so it seems to be AMD specific.
>>>> And AMD won't need XPTI at all.
>>> Isn't your RFC series allowing XPTI to be enabled even on AMD?
>> Yes, you are right. This might either want to be revisited or the
>> address space to be activated for SVM domains could map an IDT with
>> IST related traps removed.
> 
> I've experimented quite a lot in this area.  Ideally, we'd vmload/save
> in the SVM critical region (like all other hypervisors) at which point
> we don't need any adjustments to the IDT (as IST references are safe to
> use), and we'd catch stack overflows in the #DF handler rather than
> immediately triple faulting.
> 
> Using LIDT to switch between alternative IDTs, or INVLPG to swap the
> mapping under a fixed linear address are both much slower than the
> current implementation.
> 
>>
>>>> The GDT of pv domains is already in the per-domain region even without
>>>> my patches, so I don't have to change anything regarding usage of LGDT.
>>> Andrew's point was that eliminating the LGDT is a secondary goal.
>> With per-cpu mappings this is surely an obvious optimization. In the
>> end the overall performance should be taken as base for a decision.
>> His main point was avoiding exposing data like the physical cpu number
>> and this doesn't apply here, as the GDT is per vcpu in my case.
> 
> The GDT leaks vcpu_id into guest userspace, which is similarly problematic.

Mind explaining this? Why is leaking the vcpu_id problematic?

> The secondary goals of my KAISER series stand irrespective of the
> Meltdown issues:
> * The stack and mutable critical structures really should be numa-local
> to the CPU using it.
> * The GDT should sit fully fat over zeros.  At the moment in HVM
> context, there are 14 frames of arbitrary directmap living within the
> GDT limit.
> * The IDT/GDT should exist at the same linear address on every pcpu to
> avoid leaking information  (This property is what allows the removal of
> the lgdt from the context switch path).
> * The critical datastructures should be mapped read only to make
> exploitation hardware for an attacker with a write-primative.
> * With the stack at the same linear address on each CPU, we don't need
> the syscall stubs, and the TSS is identical on all cpus.
> 
> In some copious free time, it would be nice to fix these issues.

As long as you can't solve the primary performance problem of your
approach for existing pv guests I don't see why above tuning attempts
would make any sense.

I know for sure there are users out there not capable to switch to HVM
or PVH guests because they need more than 64 vcpus per guest. So before
tackling above problems you really have to solve the large HVM guest
problem. And making it impossible for those users to continue using
PV guests by hitting performance so bad won't be an accepted "solution".


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains
  2018-01-22 18:39               ` Andrew Cooper
                                   ` (2 preceding siblings ...)
  2018-01-23 13:24                 ` Dario Faggioli
@ 2018-01-23 16:45                 ` George Dunlap
  2018-01-23 16:56                   ` Juergen Gross
  3 siblings, 1 reply; 74+ messages in thread
From: George Dunlap @ 2018-01-23 16:45 UTC (permalink / raw)
  To: Andrew Cooper, Jan Beulich, Juergen Gross
  Cc: wei.liu2, George.Dunlap, ian.jackson, Dario Faggioli, xen-devel

On 01/22/2018 06:39 PM, Andrew Cooper wrote:
> On 22/01/18 16:51, Jan Beulich wrote:
>>>>> On 22.01.18 at 16:00, <jgross@suse.com> wrote:
>>> On 22/01/18 15:48, Jan Beulich wrote:
>>>>>>> On 22.01.18 at 15:38, <jgross@suse.com> wrote:
>>>>> On 22/01/18 15:22, Jan Beulich wrote:
>>>>>>>>> On 22.01.18 at 15:18, <jgross@suse.com> wrote:
>>>>>>> On 22/01/18 13:50, Jan Beulich wrote:
>>>>>>>>>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
>>>>>>>>> As a preparation for doing page table isolation in the Xen hypervisor
>>>>>>>>> in order to mitigate "Meltdown" use dedicated stacks, GDT and TSS for
>>>>>>>>> 64 bit PV domains mapped to the per-domain virtual area.
>>>>>>>>>
>>>>>>>>> The per-vcpu stacks are used for early interrupt handling only. After
>>>>>>>>> saving the domain's registers stacks are switched back to the normal
>>>>>>>>> per physical cpu ones in order to be able to address on-stack data
>>>>>>>>> from other cpus e.g. while handling IPIs.
>>>>>>>>>
>>>>>>>>> Adding %cr3 switching between saving of the registers and switching
>>>>>>>>> the stacks will enable the possibility to run guest code without any
>>>>>>>>> per physical cpu mapping, i.e. avoiding the threat of a guest being
>>>>>>>>> able to access other domains data.
>>>>>>>>>
>>>>>>>>> Without any further measures it will still be possible for e.g. a
>>>>>>>>> guest's user program to read stack data of another vcpu of the same
>>>>>>>>> domain, but this can be easily avoided by a little PV-ABI modification
>>>>>>>>> introducing per-cpu user address spaces.
>>>>>>>>>
>>>>>>>>> This series is meant as a replacement for Andrew's patch series:
>>>>>>>>> "x86: Prerequisite work for a Xen KAISER solution".
>>>>>>>> Considering in particular the two reverts, what I'm missing here
>>>>>>>> is a clear description of the meaningful additional protection this
>>>>>>>> approach provides over the band-aid. For context see also
>>>>>>>> https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg01735.html 
>>>>>>> My approach supports mapping only the following data while the guest is
>>>>>>> running (apart form the guest's own data, of course):
>>>>>>>
>>>>>>> - the per-vcpu entry stacks of the domain which will contain only the
>>>>>>>   guest's registers saved when an interrupt occurs
>>>>>>> - the per-vcpu GDTs and TSSs of the domain
>>>>>>> - the IDT
>>>>>>> - the interrupt handler code (arch/x86/x86_64/[compat/]entry.S
>>>>>>>
>>>>>>> All other hypervisor data and code can be completely hidden from the
>>>>>>> guests.
>>>>>> I understand that. What I'm not clear about is: Which parts of
>>>>>> the additionally hidden data are actually necessary (or at least
>>>>>> very desirable) to hide?
>>>>> Necessary:
>>>>> - other guests' memory (e.g. physical memory 1:1 mapping)
>>>>> - data from other guests e.g.in stack pages, debug buffers, I/O buffers,
>>>>>   code emulator buffers
>>>>> - other guests' register values e.g. in vcpu structure
>>>> All of this is already being made invisible by the band-aid (with the
>>>> exception of leftovers on the hypervisor stacks across context
>>>> switches, which we've already said could be taken care of by
>>>> memset()ing that area). I'm asking about the _additional_ benefits
>>>> of your approach.
>>> I'm quite sure the performance will be much better as it doesn't require
>>> per physical cpu L4 page tables, but just a shadow L4 table for each
>>> guest L4 table, similar to the Linux kernel KPTI approach.
>> But isn't that model having the same synchronization issues upon
>> guest L4 updates which Andrew was fighting with?
> 
> (Condensing a lot of threads down into one)
> 
> All the methods have L4 synchronisation update issues, until we have a
> PV ABI which guarantees that L4's don't get reused.  Any improvements to
> the shadowing/synchronisation algorithm will benefit all approaches.
> 
> Juergen: you're now adding a LTR into the context switch path which
> tends to be very slow.  I.e. As currently presented, this series
> necessarily has a higher runtime overhead than Jan's XPTI.

So here are a repeat of the "hypervisor compile" tests I did, comparing
the different XPTI-like series so far.

# Experimental setup:
Host:
 - Intel(R) Xeon(R) CPU E5630  @ 2.53GHz
 - 4 pcpus
 - Memory: 4GiB
Guest:
 - 4vcpus, 512MiB, blkback to raw file
 - CentOS 6 userspace
 - Linux 4.14 kernel with PV / PVH / PVHVM / KVM guest support (along
with expected drivers) built-in
Test:
 - cd xen-4.10.0
 - make -C xen clean
 - time make -j 4 xen

# Results
- In all cases, running a "default" build with CONFIG_DEBUG=n

* Staging, xpti=off
real    1m2.995s
user    2m52.527s
sys     0m40.276s

Result: 63s

* Staging [xpti default]
real    1m27.190s
user    3m3.900s
sys     1m42.686s

Result: 87s (38% overhead)

Note also that the "system time" here is about 2.5x of "xpti=off"; so
total wasted cpu time is significantly higher.

* Staging + "x86: slightly reduce Meltdown band-aid overhead"
real    1m21.661s
user    3m3.809s
sys     1m25.344s

Result: 81s (28% overhead)

NB that the "system time" here is significantly reduced from above, but
still nearly double of the "system time" for plain PV

* Above + "x86: reduce Meltdown band-aid overhead a little further"
real    1m21.357s
user    3m3.284s
sys     1m25.379s

Result: 81s (28% overhead)

No real change

* Staging + Juergen's v2 series
real    1m3.018s
user    2m52.217s
sys     0m40.357s

Result: 63s (0% overhead)

Unfortunately, I can't really verify that Juergen's patches are having
any effect; there's no printk indicating whether it's enabling the
mitigation or not.  I have verified that the changeset reported in `xl
dmesg` corresponds to the branch I have with the patches applied.

So it's *possible* something has gotten mixed up, and the mitigation
isn't being applied; but if it *is* applied, the performance is
significantly better than the "band-aid" XPTI.

 -George


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains
  2018-01-23 16:45                 ` George Dunlap
@ 2018-01-23 16:56                   ` Juergen Gross
  2018-01-23 17:33                     ` George Dunlap
  0 siblings, 1 reply; 74+ messages in thread
From: Juergen Gross @ 2018-01-23 16:56 UTC (permalink / raw)
  To: George Dunlap, Andrew Cooper, Jan Beulich
  Cc: wei.liu2, George.Dunlap, ian.jackson, Dario Faggioli, xen-devel

On 23/01/18 17:45, George Dunlap wrote:
> On 01/22/2018 06:39 PM, Andrew Cooper wrote:
>> Juergen: you're now adding a LTR into the context switch path which
>> tends to be very slow.  I.e. As currently presented, this series
>> necessarily has a higher runtime overhead than Jan's XPTI.
> 
> So here are a repeat of the "hypervisor compile" tests I did, comparing
> the different XPTI-like series so far.
> 
> # Experimental setup:
> Host:
>  - Intel(R) Xeon(R) CPU E5630  @ 2.53GHz
>  - 4 pcpus
>  - Memory: 4GiB
> Guest:
>  - 4vcpus, 512MiB, blkback to raw file
>  - CentOS 6 userspace
>  - Linux 4.14 kernel with PV / PVH / PVHVM / KVM guest support (along
> with expected drivers) built-in
> Test:
>  - cd xen-4.10.0
>  - make -C xen clean
>  - time make -j 4 xen
> 

...

> * Staging + Juergen's v2 series
> real    1m3.018s
> user    2m52.217s
> sys     0m40.357s
> 
> Result: 63s (0% overhead)
> 
> Unfortunately, I can't really verify that Juergen's patches are having
> any effect; there's no printk indicating whether it's enabling the
> mitigation or not.  I have verified that the changeset reported in `xl
> dmesg` corresponds to the branch I have with the patches applied.
> 
> So it's *possible* something has gotten mixed up, and the mitigation
> isn't being applied; but if it *is* applied, the performance is
> significantly better than the "band-aid" XPTI.

As there is no real mitigation in place, but only the needed rework of
the interrupt handling and context switching, anything not next to
xpti=off would have been disappointing for me. :-)

I'll add some statistics in the next patches so it can be verified the
patches are really doing something.

Thanks for doing the tests,


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains
  2018-01-23 16:56                   ` Juergen Gross
@ 2018-01-23 17:33                     ` George Dunlap
  2018-01-24  7:37                       ` Jan Beulich
  0 siblings, 1 reply; 74+ messages in thread
From: George Dunlap @ 2018-01-23 17:33 UTC (permalink / raw)
  To: Juergen Gross, Andrew Cooper, Jan Beulich
  Cc: wei.liu2, George.Dunlap, ian.jackson, Dario Faggioli, xen-devel

On 01/23/2018 04:56 PM, Juergen Gross wrote:
> On 23/01/18 17:45, George Dunlap wrote:
>> On 01/22/2018 06:39 PM, Andrew Cooper wrote:
>>> Juergen: you're now adding a LTR into the context switch path which
>>> tends to be very slow.  I.e. As currently presented, this series
>>> necessarily has a higher runtime overhead than Jan's XPTI.
>>
>> So here are a repeat of the "hypervisor compile" tests I did, comparing
>> the different XPTI-like series so far.
>>
>> # Experimental setup:
>> Host:
>>  - Intel(R) Xeon(R) CPU E5630  @ 2.53GHz
>>  - 4 pcpus
>>  - Memory: 4GiB
>> Guest:
>>  - 4vcpus, 512MiB, blkback to raw file
>>  - CentOS 6 userspace
>>  - Linux 4.14 kernel with PV / PVH / PVHVM / KVM guest support (along
>> with expected drivers) built-in
>> Test:
>>  - cd xen-4.10.0
>>  - make -C xen clean
>>  - time make -j 4 xen
>>
> 
> ...
> 
>> * Staging + Juergen's v2 series
>> real    1m3.018s
>> user    2m52.217s
>> sys     0m40.357s
>>
>> Result: 63s (0% overhead)
>>
>> Unfortunately, I can't really verify that Juergen's patches are having
>> any effect; there's no printk indicating whether it's enabling the
>> mitigation or not.  I have verified that the changeset reported in `xl
>> dmesg` corresponds to the branch I have with the patches applied.
>>
>> So it's *possible* something has gotten mixed up, and the mitigation
>> isn't being applied; but if it *is* applied, the performance is
>> significantly better than the "band-aid" XPTI.
> 
> As there is no real mitigation in place, but only the needed rework of
> the interrupt handling and context switching, anything not next to
> xpti=off would have been disappointing for me. :-)
> 
> I'll add some statistics in the next patches so it can be verified the
> patches are really doing something.

Well at very least there should be something in the boot scrool that
says, "Enabling Xen Pagetable protection (XPTI) for PV guests" or
something.  (That goes for the current round of XPTI as well really.)

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains
  2018-01-23 17:33                     ` George Dunlap
@ 2018-01-24  7:37                       ` Jan Beulich
  0 siblings, 0 replies; 74+ messages in thread
From: Jan Beulich @ 2018-01-24  7:37 UTC (permalink / raw)
  To: George Dunlap
  Cc: Juergen Gross, wei.liu2, George.Dunlap, Andrew Cooper,
	ian.jackson, Dario Faggioli, xen-devel

>>> On 23.01.18 at 18:33, <george.dunlap@citrix.com> wrote:
> Well at very least there should be something in the boot scrool that
> says, "Enabling Xen Pagetable protection (XPTI) for PV guests" or
> something.  (That goes for the current round of XPTI as well really.)

And indeed I have this on my list of follow-up things, but didn't get
to it yet.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 07/12] x86: allow per-domain mappings without NX bit or with specific mfn
  2018-01-22 12:32 ` [PATCH RFC v2 07/12] x86: allow per-domain mappings without NX bit or with specific mfn Juergen Gross
@ 2018-01-29 17:06   ` Jan Beulich
       [not found]   ` <5A6F62B602000078001A3810@suse.com>
  2018-01-31 10:30   ` Jan Beulich
  2 siblings, 0 replies; 74+ messages in thread
From: Jan Beulich @ 2018-01-29 17:06 UTC (permalink / raw)
  To: xen-devel, Juergen Gross
  Cc: wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson, Dario Faggioli

>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -1568,7 +1568,7 @@ void init_xen_l4_slots(l4_pgentry_t *l4t, mfn_t l4mfn,
>  
>      /* Slot 260: Per-domain mappings (if applicable). */
>      l4t[l4_table_offset(PERDOMAIN_VIRT_START)] =
> -        d ? l4e_from_page(d->arch.perdomain_l3_pg, __PAGE_HYPERVISOR_RW)
> +        d ? l4e_from_page(d->arch.perdomain_l3_pg, __PAGE_HYPERVISOR)
>            : l4e_empty();
>  
>      /* Slot 261-: text/data/bss, RW M2P, vmap, frametable, directmap. */
> @@ -5269,7 +5269,7 @@ int create_perdomain_mapping(struct domain *d, unsigned long va,
>          }
>          l2tab = __map_domain_page(pg);
>          clear_page(l2tab);
> -        l3tab[l3_table_offset(va)] = l3e_from_page(pg, __PAGE_HYPERVISOR_RW);
> +        l3tab[l3_table_offset(va)] = l3e_from_page(pg, __PAGE_HYPERVISOR);
>      }
>      else
>          l2tab = map_l2t_from_l3e(l3tab[l3_table_offset(va)]);
> @@ -5311,7 +5311,7 @@ int create_perdomain_mapping(struct domain *d, unsigned long va,
>                  l1tab = __map_domain_page(pg);
>              }
>              clear_page(l1tab);
> -            *pl2e = l2e_from_page(pg, __PAGE_HYPERVISOR_RW);
> +            *pl2e = l2e_from_page(pg, __PAGE_HYPERVISOR);

These changes (in the absence of the description saying otherwise)
leave open whether any of the per-domain mappings now suddenly
become executable.

> @@ -5401,6 +5401,81 @@ void destroy_perdomain_mapping(struct domain *d, unsigned long va,
>      unmap_domain_page(l3tab);
>  }
>  
> +void flipflags_perdomain_mapping(struct domain *d, unsigned long va,
> +                                 unsigned int flags)

Flipping flags means the caller has to know (perhaps track) what state
the flags are in at present. I think it would be better to pass in two
masks - one for flags to be set, and the other for flags to be cleared.

> +void addmfn_to_perdomain_mapping(struct domain *d, unsigned long va, mfn_t mfn)
> +{
> +    const l3_pgentry_t *l3tab, *pl3e;
> +
> +    ASSERT(va >= PERDOMAIN_VIRT_START &&
> +           va < PERDOMAIN_VIRT_SLOT(PERDOMAIN_SLOTS));
> +
> +    if ( !d->arch.perdomain_l3_pg )
> +        return;
> +
> +    l3tab = __map_domain_page(d->arch.perdomain_l3_pg);
> +    pl3e = l3tab + l3_table_offset(va);
> +
> +    if ( l3e_get_flags(*pl3e) & _PAGE_PRESENT )
> +    {
> +        const l2_pgentry_t *l2tab = map_l2t_from_l3e(*pl3e);
> +        const l2_pgentry_t *pl2e = l2tab + l2_table_offset(va);
> +
> +        if ( l2e_get_flags(*pl2e) & _PAGE_PRESENT )
> +        {
> +            l1_pgentry_t *l1tab = map_l1t_from_l2e(*pl2e);
> +            unsigned int off = l1_table_offset(va);
> +
> +            if ( (l1e_get_flags(l1tab[off]) & (_PAGE_PRESENT | _PAGE_AVAIL0)) ==
> +                 (_PAGE_PRESENT | _PAGE_AVAIL0) )
> +                free_domheap_page(l1e_get_page(l1tab[off]));
> +
> +            l1tab[off] = l1e_from_mfn(mfn, __PAGE_HYPERVISOR_RW);
> +
> +            unmap_domain_page(l1tab);
> +        }
> +
> +        unmap_domain_page(l2tab);
> +    }
> +
> +    unmap_domain_page(l3tab);
> +}

Here even more than in the flipflags function - what if an
intermediate page table entry was not present? The caller will
have no ideal that what was requested wasn't carried out.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 07/12] x86: allow per-domain mappings without NX bit or with specific mfn
       [not found]   ` <5A6F62B602000078001A3810@suse.com>
@ 2018-01-30  8:02     ` Juergen Gross
  2018-01-30  8:41       ` Jan Beulich
  0 siblings, 1 reply; 74+ messages in thread
From: Juergen Gross @ 2018-01-30  8:02 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson, Dario Faggioli

On 29/01/18 18:06, Jan Beulich wrote:
>>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
>> --- a/xen/arch/x86/mm.c
>> +++ b/xen/arch/x86/mm.c
>> @@ -1568,7 +1568,7 @@ void init_xen_l4_slots(l4_pgentry_t *l4t, mfn_t l4mfn,
>>  
>>      /* Slot 260: Per-domain mappings (if applicable). */
>>      l4t[l4_table_offset(PERDOMAIN_VIRT_START)] =
>> -        d ? l4e_from_page(d->arch.perdomain_l3_pg, __PAGE_HYPERVISOR_RW)
>> +        d ? l4e_from_page(d->arch.perdomain_l3_pg, __PAGE_HYPERVISOR)
>>            : l4e_empty();
>>  
>>      /* Slot 261-: text/data/bss, RW M2P, vmap, frametable, directmap. */
>> @@ -5269,7 +5269,7 @@ int create_perdomain_mapping(struct domain *d, unsigned long va,
>>          }
>>          l2tab = __map_domain_page(pg);
>>          clear_page(l2tab);
>> -        l3tab[l3_table_offset(va)] = l3e_from_page(pg, __PAGE_HYPERVISOR_RW);
>> +        l3tab[l3_table_offset(va)] = l3e_from_page(pg, __PAGE_HYPERVISOR);
>>      }
>>      else
>>          l2tab = map_l2t_from_l3e(l3tab[l3_table_offset(va)]);
>> @@ -5311,7 +5311,7 @@ int create_perdomain_mapping(struct domain *d, unsigned long va,
>>                  l1tab = __map_domain_page(pg);
>>              }
>>              clear_page(l1tab);
>> -            *pl2e = l2e_from_page(pg, __PAGE_HYPERVISOR_RW);
>> +            *pl2e = l2e_from_page(pg, __PAGE_HYPERVISOR);
> 
> These changes (in the absence of the description saying otherwise)
> leave open whether any of the per-domain mappings now suddenly
> become executable.

Are you fine with me adding something like the following to the commit
message:

As create_perdomain_mapping() creates L1 mappings with flags being
__PAGE_HYPERVISOR_RW this won't change any of the current per domain
mappings to become executable.

> 
>> @@ -5401,6 +5401,81 @@ void destroy_perdomain_mapping(struct domain *d, unsigned long va,
>>      unmap_domain_page(l3tab);
>>  }
>>  
>> +void flipflags_perdomain_mapping(struct domain *d, unsigned long va,
>> +                                 unsigned int flags)
> 
> Flipping flags means the caller has to know (perhaps track) what state
> the flags are in at present. I think it would be better to pass in two
> masks - one for flags to be set, and the other for flags to be cleared.

Okay.

> 
>> +void addmfn_to_perdomain_mapping(struct domain *d, unsigned long va, mfn_t mfn)
>> +{
>> +    const l3_pgentry_t *l3tab, *pl3e;
>> +
>> +    ASSERT(va >= PERDOMAIN_VIRT_START &&
>> +           va < PERDOMAIN_VIRT_SLOT(PERDOMAIN_SLOTS));
>> +
>> +    if ( !d->arch.perdomain_l3_pg )
>> +        return;
>> +
>> +    l3tab = __map_domain_page(d->arch.perdomain_l3_pg);
>> +    pl3e = l3tab + l3_table_offset(va);
>> +
>> +    if ( l3e_get_flags(*pl3e) & _PAGE_PRESENT )
>> +    {
>> +        const l2_pgentry_t *l2tab = map_l2t_from_l3e(*pl3e);
>> +        const l2_pgentry_t *pl2e = l2tab + l2_table_offset(va);
>> +
>> +        if ( l2e_get_flags(*pl2e) & _PAGE_PRESENT )
>> +        {
>> +            l1_pgentry_t *l1tab = map_l1t_from_l2e(*pl2e);
>> +            unsigned int off = l1_table_offset(va);
>> +
>> +            if ( (l1e_get_flags(l1tab[off]) & (_PAGE_PRESENT | _PAGE_AVAIL0)) ==
>> +                 (_PAGE_PRESENT | _PAGE_AVAIL0) )
>> +                free_domheap_page(l1e_get_page(l1tab[off]));
>> +
>> +            l1tab[off] = l1e_from_mfn(mfn, __PAGE_HYPERVISOR_RW);
>> +
>> +            unmap_domain_page(l1tab);
>> +        }
>> +
>> +        unmap_domain_page(l2tab);
>> +    }
>> +
>> +    unmap_domain_page(l3tab);
>> +}
> 
> Here even more than in the flipflags function - what if an
> intermediate page table entry was not present? The caller will
> have no ideal that what was requested wasn't carried out.

I'll add returning -ENOENT for both functions in that case.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 07/12] x86: allow per-domain mappings without NX bit or with specific mfn
  2018-01-30  8:02     ` Juergen Gross
@ 2018-01-30  8:41       ` Jan Beulich
  0 siblings, 0 replies; 74+ messages in thread
From: Jan Beulich @ 2018-01-30  8:41 UTC (permalink / raw)
  To: Juergen Gross
  Cc: wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson,
	Dario Faggioli, xen-devel

>>> On 30.01.18 at 09:02, <jgross@suse.com> wrote:
> On 29/01/18 18:06, Jan Beulich wrote:
>>>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
>>> --- a/xen/arch/x86/mm.c
>>> +++ b/xen/arch/x86/mm.c
>>> @@ -1568,7 +1568,7 @@ void init_xen_l4_slots(l4_pgentry_t *l4t, mfn_t l4mfn,
>>>  
>>>      /* Slot 260: Per-domain mappings (if applicable). */
>>>      l4t[l4_table_offset(PERDOMAIN_VIRT_START)] =
>>> -        d ? l4e_from_page(d->arch.perdomain_l3_pg, __PAGE_HYPERVISOR_RW)
>>> +        d ? l4e_from_page(d->arch.perdomain_l3_pg, __PAGE_HYPERVISOR)
>>>            : l4e_empty();
>>>  
>>>      /* Slot 261-: text/data/bss, RW M2P, vmap, frametable, directmap. */
>>> @@ -5269,7 +5269,7 @@ int create_perdomain_mapping(struct domain *d, unsigned long va,
>>>          }
>>>          l2tab = __map_domain_page(pg);
>>>          clear_page(l2tab);
>>> -        l3tab[l3_table_offset(va)] = l3e_from_page(pg, __PAGE_HYPERVISOR_RW);
>>> +        l3tab[l3_table_offset(va)] = l3e_from_page(pg, __PAGE_HYPERVISOR);
>>>      }
>>>      else
>>>          l2tab = map_l2t_from_l3e(l3tab[l3_table_offset(va)]);
>>> @@ -5311,7 +5311,7 @@ int create_perdomain_mapping(struct domain *d, unsigned long va,
>>>                  l1tab = __map_domain_page(pg);
>>>              }
>>>              clear_page(l1tab);
>>> -            *pl2e = l2e_from_page(pg, __PAGE_HYPERVISOR_RW);
>>> +            *pl2e = l2e_from_page(pg, __PAGE_HYPERVISOR);
>> 
>> These changes (in the absence of the description saying otherwise)
>> leave open whether any of the per-domain mappings now suddenly
>> become executable.
> 
> Are you fine with me adding something like the following to the commit
> message:
> 
> As create_perdomain_mapping() creates L1 mappings with flags being
> __PAGE_HYPERVISOR_RW this won't change any of the current per domain
> mappings to become executable.

That would seem to be sufficient, yes.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 05/12] x86: don't access saved user regs via rsp in trap handlers
  2018-01-22 12:32 ` [PATCH RFC v2 05/12] x86: don't access saved user regs via rsp in trap handlers Juergen Gross
@ 2018-01-30 14:49   ` Jan Beulich
       [not found]   ` <5A70941B02000078001A3BF0@suse.com>
  1 sibling, 0 replies; 74+ messages in thread
From: Jan Beulich @ 2018-01-30 14:49 UTC (permalink / raw)
  To: Juergen Gross
  Cc: wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson,
	Dario Faggioli, xen-devel

>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
> In order to support switching stacks when entering the hypervisor for
> support of page table isolation, don't use %rsp for accessing the
> saved user registers, but do that via %rdi.

If this really turns out to be necessary ...

> @@ -58,20 +58,24 @@ compat_test_guest_events:
>          jmp   compat_test_all_events
>  
>          ALIGN
> -/* %rbx: struct vcpu */
> +/* %rbx: struct vcpu, %rdi: user_regs */
>  compat_process_softirqs:
>          sti
> +        pushq %rdi
>          call  do_softirq
> +        popq  %rdi
>          jmp   compat_test_all_events

... to avoid changes like this one (which unduly affect stack
alignment) you will want to consider using e.g. %r12 instead.

But concerning specifically the compat entry code, it's unclear to
me why you'd need to switch stacks there too.

> @@ -211,13 +218,15 @@ ENTRY(cstar_enter)
>          testl $~3,%esi
>          leal  (,%rcx,TBF_INTERRUPT),%ecx
>  UNLIKELY_START(z, compat_syscall_gpf)
> -        movq  VCPU_trap_ctxt(%rbx),%rdi
> -        movl  $TRAP_gp_fault,UREGS_entry_vector(%rsp)
> -        subl  $2,UREGS_rip(%rsp)
> +        pushq %rcx
> +        movq  VCPU_trap_ctxt(%rbx),%rcx
> +        movl  $TRAP_gp_fault,UREGS_entry_vector(%rdi)
> +        subl  $2,UREGS_rip(%rdi)
>          movl  $0,TRAPBOUNCE_error_code(%rdx)
> -        movl  TRAP_gp_fault * TRAPINFO_sizeof + TRAPINFO_eip(%rdi),%eax
> -        movzwl TRAP_gp_fault * TRAPINFO_sizeof + TRAPINFO_cs(%rdi),%esi
> -        testb $4,TRAP_gp_fault * TRAPINFO_sizeof + TRAPINFO_flags(%rdi)
> +        movl  TRAP_gp_fault * TRAPINFO_sizeof + TRAPINFO_eip(%rcx),%eax
> +        movzwl TRAP_gp_fault * TRAPINFO_sizeof + TRAPINFO_cs(%rcx),%esi
> +        testb $4,TRAP_gp_fault * TRAPINFO_sizeof + TRAPINFO_flags(%rcx)
> +        popq  %rcx

Is there really no register available, requiring you to push/pop
%rcx here?

> --- a/xen/include/asm-x86/current.h
> +++ b/xen/include/asm-x86/current.h
> @@ -95,9 +95,13 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
>      ({                                                                  \
>          __asm__ __volatile__ (                                          \
>              "mov %0,%%"__OP"sp;"                                        \
> -            CHECK_FOR_LIVEPATCH_WORK                                      \
> -             "jmp %c1"                                                  \
> -            : : "r" (guest_cpu_user_regs()), "i" (__fn) : "memory" );   \
> +            "mov %1,%%"__OP"di;"                                        \
> +            "pushq %%"__OP"di;"                                         \
> +            CHECK_FOR_LIVEPATCH_WORK                                    \
> +            "popq %%"__OP"di;"                                          \
> +            "jmp %c2"                                                   \
> +            : : "r" (get_cpu_info()), "r" (guest_cpu_user_regs()),      \
> +                "i" (__fn) : "memory" );                                \
>          unreachable();                                                  \
>      })

If you want guest_cpu_user_regs() in %rdi, why don't you use
"D" as constraint? Why do you need to restore %rdi prior to the
final JMP? And why do you need the value in %rdi before calling
check_for_livepatch_work(), when the function takes no arguments?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 09/12] x86: enhance syscall stub to work in per-domain mapping
  2018-01-22 12:32 ` [PATCH RFC v2 09/12] x86: enhance syscall stub to work in per-domain mapping Juergen Gross
@ 2018-01-30 15:11   ` Jan Beulich
       [not found]   ` <5A70991902000078001A3C16@suse.com>
  1 sibling, 0 replies; 74+ messages in thread
From: Jan Beulich @ 2018-01-30 15:11 UTC (permalink / raw)
  To: xen-devel, Juergen Gross
  Cc: wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson, Dario Faggioli

>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
> --- a/xen/arch/x86/x86_64/traps.c
> +++ b/xen/arch/x86/x86_64/traps.c
> @@ -260,10 +260,11 @@ void do_double_fault(struct cpu_user_regs *regs)
>      panic("DOUBLE FAULT -- system shutdown");
>  }
>  
> -static unsigned int write_stub_trampoline(
> -    unsigned char *stub, unsigned long stub_va,
> -    unsigned long stack_bottom, unsigned long target_va)
> +void write_stub_trampoline(unsigned char *stub, unsigned long stub_va,
> +                           unsigned long stack_bottom, unsigned long target_va)

Why does the static go away?

> @@ -282,24 +283,32 @@ static unsigned int write_stub_trampoline(
>      /* pushq %rax */
>      stub[23] = 0x50;
>  
> -    /* jmp target_va */
> -    stub[24] = 0xe9;
> -    *(int32_t *)&stub[25] = target_va - (stub_va + 29);
> -
> -    /* Round up to a multiple of 16 bytes. */
> -    return 32;
> +    target_diff = target_va - (stub_va + 29);
> +    if ( target_diff >> 31 == target_diff >> 63 )
> +    {
> +        /* jmp target_va */
> +        stub[24] = 0xe9;
> +        *(int32_t *)&stub[25] = target_diff;
> +    }
> +    else
> +    {
> +        /* movabs target_va, %rax */
> +        stub[24] = 0x48;
> +        stub[25] = 0xb8;
> +        *(uint64_t *)&stub[26] = target_va;
> +        /* jmpq *%rax */
> +        stub[34] = 0xff;
> +        stub[35] = 0xe0;
> +    }

This clearly needs another solution, as you'd have to go through a
thunk now, and the thunk would be unreachable too.

>  }
>  
>  DEFINE_PER_CPU(struct stubs, stubs);
> -void lstar_enter(void);
> -void cstar_enter(void);

Why do these move into a header?

> @@ -312,10 +321,9 @@ void subarch_percpu_traps_init(void)
>       * start of the stubs.
>       */
>      wrmsrl(MSR_LSTAR, stub_va);
> -    offset = write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
> -                                   stub_va, stack_bottom,
> -                                   (unsigned long)lstar_enter);
> -    stub_va += offset;
> +    write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK), stub_va,
> +                          stack_bottom, (unsigned long)lstar_enter);
> +    stub_va += STUB_TRAMPOLINE_SIZE_PERCPU;

The function may have written more than 32 bytes now; you'd
notice the breakage if you put a suitable BUILD_BUG_ON() into
the function. Otherwise I recommend you stick to the current
"return number of bytes written" model.

> @@ -328,12 +336,11 @@ void subarch_percpu_traps_init(void)
>  
>      /* Trampoline for SYSCALL entry from compatibility mode. */
>      wrmsrl(MSR_CSTAR, stub_va);
> -    offset += write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
> -                                    stub_va, stack_bottom,
> -                                    (unsigned long)cstar_enter);
> +    write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK), stub_va,
> +                          stack_bottom, (unsigned long)cstar_enter);
>  
>      /* Don't consume more than half of the stub space here. */
> -    ASSERT(offset <= STUB_BUF_SIZE / 2);
> +    ASSERT(2 * STUB_TRAMPOLINE_SIZE_PERCPU <= STUB_BUF_SIZE / 2);

BUILD_BUG_ON() for compile time constants.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 06/12] x86: add a xpti command line parameter
  2018-01-22 12:32 ` [PATCH RFC v2 06/12] x86: add a xpti command line parameter Juergen Gross
@ 2018-01-30 15:39   ` Jan Beulich
       [not found]   ` <5A709FDF02000078001A3C2C@suse.com>
  1 sibling, 0 replies; 74+ messages in thread
From: Jan Beulich @ 2018-01-30 15:39 UTC (permalink / raw)
  To: Juergen Gross
  Cc: wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson,
	Dario Faggioli, xen-devel

>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
> @@ -212,6 +249,24 @@ int pv_domain_initialise(struct domain *d, unsigned int domcr_flags,
>      /* 64-bit PV guest by default. */
>      d->arch.is_32bit_pv = d->arch.has_32bit_shinfo = 0;
>  
> +    switch (opt_xpti)

Style.

> +    {
> +    case XPTI_OFF:
> +        d->arch.pv_domain.xpti = false;
> +        break;
> +    case XPTI_ON:
> +        d->arch.pv_domain.xpti = true;
> +        break;
> +    case XPTI_NODOM0:
> +        d->arch.pv_domain.xpti = boot_cpu_data.x86_vendor != X86_VENDOR_AMD &&
> +                                 d->domain_id != 0 &&
> +                                 d->domain_id != hardware_domid;
> +        break;
> +    case XPTI_DEFAULT:
> +        d->arch.pv_domain.xpti = boot_cpu_data.x86_vendor != X86_VENDOR_AMD;
> +        break;
> +    }

Why does a 32-bit domain need this?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 10/12] x86: allocate per-vcpu stacks for interrupt entries
  2018-01-22 12:32 ` [PATCH RFC v2 10/12] x86: allocate per-vcpu stacks for interrupt entries Juergen Gross
@ 2018-01-30 15:40   ` Jan Beulich
  2018-02-09 12:35     ` Juergen Gross
       [not found]   ` <5A70A01402000078001A3C30@suse.com>
  1 sibling, 1 reply; 74+ messages in thread
From: Jan Beulich @ 2018-01-30 15:40 UTC (permalink / raw)
  To: Juergen Gross
  Cc: wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson,
	Dario Faggioli, xen-devel

>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
> In case of XPTI being active for a pv-domain allocate and initialize
> per-vcpu stacks. The stacks are added to the per-domain mappings of
> the pv-domain.

Considering the intended use of these stacks (as per the overview
mail) I consider 32k per vCPU a non-negligible amount of extra memory
use.

> +static int pv_vcpu_init_xpti(struct vcpu *v)
> +{
> +    struct domain *d = v->domain;
> +    struct page_info *pg;
> +    void *ptr;
> +    struct cpu_info *info;
> +    unsigned long stack_bottom;
> +    int rc;
> +
> +    /* Populate page tables. */
> +    rc = create_perdomain_mapping(d, XPTI_START(v), STACK_PAGES,
> +                                  NIL(l1_pgentry_t *), NULL);
> +    if ( rc )
> +        goto done;
> +
> +    /* Map stacks. */
> +    rc = create_perdomain_mapping(d, XPTI_START(v), IST_MAX,
> +                                  NULL, NIL(struct page_info *));
> +    if ( rc )
> +        goto done;
> +
> +    ptr = alloc_xenheap_page();
> +    if ( !ptr )
> +    {
> +        rc = -ENOMEM;
> +        goto done;
> +    }
> +    clear_page(ptr);
> +    addmfn_to_perdomain_mapping(d, XPTI_START(v) + STACK_SIZE - PAGE_SIZE,
> +                                _mfn(virt_to_mfn(ptr)));

This can't be create_perdomain_mapping() because of ...? If it's
the Xen heap page you use here - that would be the next question:
Does it need to be such, rather than a domheap one? I do see ...

> +    info = (struct cpu_info *)((unsigned long)ptr + PAGE_SIZE) - 1;
> +    info->flags = ON_VCPUSTACK;
> +    v->arch.pv_vcpu.stack_regs = &info->guest_cpu_user_regs;

... this pointer, but without a clear picture on intended use it's
hard to judge.

> +    /* Map TSS. */
> +    rc = create_perdomain_mapping(d, XPTI_TSS(v), 1, NULL, &pg);
> +    if ( rc )
> +        goto done;
> +    info = (struct cpu_info *)(XPTI_START(v) + STACK_SIZE) - 1;

Iiuc this is a pointer one absolutely must not de-reference. A bit
dangerous, I would say, the more that further up the same
variable is being de-referenced.

Also I would assume the TSS can be mapped r/o.

> +    stack_bottom = (unsigned long)&info->guest_cpu_user_regs.es;
> +    ptr = __map_domain_page(pg);
> +    tss_init(ptr, stack_bottom);
> +    unmap_domain_page(ptr);
> +
> +    /* Map stub trampolines. */
> +    rc = create_perdomain_mapping(d, XPTI_TRAMPOLINE(v), 1, NULL, &pg);
> +    if ( rc )
> +        goto done;
> +    ptr = __map_domain_page(pg);
> +    write_stub_trampoline((unsigned char *)ptr, XPTI_TRAMPOLINE(v),

I would be very surprised if you really needed the cast here.

> @@ -25,6 +25,21 @@
>   */
>  
>  /*
> + * The vcpu stacks used for XPTI are arranged similar to the physical cpu
> + * stacks with some modifications. The main difference are the primary stack
> + * size (only 1 page) and usage of the unused mappings for TSS and IDT.
> + *
> + * 7 - Primary stack (with a struct cpu_info at the top)
> + * 6 - unused
> + * 5 - TSS

Judging by the comment this might mean "TSS / IDT", or slots 4 or 6
might be used for the IDT. Otoh I don't see any IDT related logic in
pv_vcpu_init_xpti(). Please clarify this.

> @@ -37,10 +52,24 @@ struct vcpu;
>  
>  struct cpu_info {
>      struct cpu_user_regs guest_cpu_user_regs;
> -    unsigned int processor_id;
> -    struct vcpu *current_vcpu;
> -    unsigned long per_cpu_offset;
> -    unsigned long cr4;
> +    union {
> +        /* per physical cpu mapping */
> +        struct {
> +            struct vcpu *current_vcpu;
> +            unsigned long per_cpu_offset;
> +            unsigned long cr4;
> +        };
> +        /* per vcpu mapping (xpti) */
> +        struct {
> +            unsigned long pad1;
> +            unsigned long pad2;
> +            unsigned long stack_bottom_cpu;
> +        };

In order to avoid accidental use in the wrong context as much as
possible, I think you want to name both structures.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 11/12] x86: modify interrupt handlers to support stack switching
  2018-01-22 12:32 ` [PATCH RFC v2 11/12] x86: modify interrupt handlers to support stack switching Juergen Gross
@ 2018-01-30 16:07   ` Jan Beulich
       [not found]   ` <5A70A63D02000078001A3C7C@suse.com>
  1 sibling, 0 replies; 74+ messages in thread
From: Jan Beulich @ 2018-01-30 16:07 UTC (permalink / raw)
  To: Juergen Gross
  Cc: wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson,
	Dario Faggioli, xen-devel

>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
> --- a/xen/arch/x86/x86_64/asm-offsets.c
> +++ b/xen/arch/x86/x86_64/asm-offsets.c
> @@ -137,6 +137,10 @@ void __dummy__(void)
>      OFFSET(CPUINFO_processor_id, struct cpu_info, processor_id);
>      OFFSET(CPUINFO_current_vcpu, struct cpu_info, current_vcpu);
>      OFFSET(CPUINFO_cr4, struct cpu_info, cr4);
> +    OFFSET(CPUINFO_stack_bottom_cpu, struct cpu_info, stack_bottom_cpu);
> +    OFFSET(CPUINFO_flags, struct cpu_info, flags);
> +    DEFINE(ASM_ON_VCPUSTACK, ON_VCPUSTACK);
> +    DEFINE(ASM_VCPUSTACK_ACTIVE, VCPUSTACK_ACTIVE);

Seeing their uses in asm_defns.h it's not really clear to me why
you can't use the C constants there, the more that those uses
are inside C macros (which perhaps would better be assembler
ones). The latter doesn't even appear to be used in assembly
code.

> --- a/xen/arch/x86/x86_64/compat/entry.S
> +++ b/xen/arch/x86/x86_64/compat/entry.S
> @@ -19,6 +19,7 @@ ENTRY(entry_int82)
>          movl  $HYPERCALL_VECTOR, 4(%rsp)
>          SAVE_ALL compat=1 /* DPL1 gate, restricted to 32bit PV guests only. */
>          mov   %rsp, %rdi
> +        SWITCH_FROM_VCPU_STACK
>          CR4_PV32_RESTORE

Once again - why for compat mode guests?

> @@ -615,7 +623,9 @@ ENTRY(early_page_fault)
>          movl  $TRAP_page_fault,4(%rsp)
>          SAVE_ALL
>          movq  %rsp,%rdi
> +        SWITCH_FROM_VCPU_STACK

Why, in this context?

>          call  do_early_page_fault
> +        movq  %rsp, %rdi
>          jmp   restore_all_xen

Doesn't this belong in an earlier patch?

> --- a/xen/common/wait.c
> +++ b/xen/common/wait.c
> @@ -122,10 +122,10 @@ void wake_up_all(struct waitqueue_head *wq)
>  
>  static void __prepare_to_wait(struct waitqueue_vcpu *wqv)
>  {
> -    struct cpu_info *cpu_info = get_cpu_info();
> +    struct cpu_user_regs *user_regs = guest_cpu_user_regs();
>      struct vcpu *curr = current;
>      unsigned long dummy;
> -    u32 entry_vector = cpu_info->guest_cpu_user_regs.entry_vector;
> +    u32 entry_vector = user_regs->entry_vector;
>  
>      ASSERT(wqv->esp == 0);
>  
> @@ -160,7 +160,7 @@ static void __prepare_to_wait(struct waitqueue_vcpu *wqv)
>          "pop %%r11; pop %%r10; pop %%r9;  pop %%r8;"
>          "pop %%rbp; pop %%rdx; pop %%rbx; pop %%rax"
>          : "=&S" (wqv->esp), "=&c" (dummy), "=&D" (dummy)
> -        : "i" (PAGE_SIZE), "0" (0), "1" (cpu_info), "2" (wqv->stack)
> +        : "i" (PAGE_SIZE), "0" (0), "1" (user_regs), "2" (wqv->stack)
>          : "memory" );
>  
>      if ( unlikely(wqv->esp == 0) )
> @@ -169,7 +169,7 @@ static void __prepare_to_wait(struct waitqueue_vcpu *wqv)
>          domain_crash_synchronous();
>      }
>  
> -    cpu_info->guest_cpu_user_regs.entry_vector = entry_vector;
> +    user_regs->entry_vector = entry_vector;
>  }

I don't see how this change is related to the purpose of this patch,
or why the change is needed. All you do is utilize that
guest_cpu_user_regs is the first field of struct cpu_info afaics.

> --- a/xen/include/asm-x86/asm_defns.h
> +++ b/xen/include/asm-x86/asm_defns.h
> @@ -116,6 +116,25 @@ void ret_from_intr(void);
>          GET_STACK_END(reg);                       \
>          __GET_CURRENT(reg)
>  
> +#define SWITCH_FROM_VCPU_STACK                                           \
> +        GET_STACK_END(ax);                                               \
> +        testb $ASM_ON_VCPUSTACK, STACK_CPUINFO_FIELD(flags)(%rax);       \
> +        jz    1f;                                                        \
> +        movq  STACK_CPUINFO_FIELD(stack_bottom_cpu)(%rax), %rsp;         \
> +1:
> +
> +#define SWITCH_FROM_VCPU_STACK_IST                                       \
> +        GET_STACK_END(ax);                                               \
> +        testb $ASM_ON_VCPUSTACK, STACK_CPUINFO_FIELD(flags)(%rax);       \
> +        jz    1f;                                                        \
> +        subq  $(CPUINFO_sizeof - 1), %rax;                               \
> +        addq  CPUINFO_stack_bottom_cpu(%rax), %rsp;                      \
> +        subq  %rax, %rsp;                                                \

If I'm not mistaken, %rsp is complete rubbish for on instruction
here. While quite likely not a problem in practice, it would still
feel better if you went through an intermediate register. I also
think the calculation might then end up easier to follow. It'll also
make analysis of a crash easier if an NMI or #MC hits exactly at
this boundary.

> +1:
> +
> +#define SWITCH_TO_VCPU_STACK                                             \
> +        movq  %rdi, %rsp

For these additions as a whole: At least in new pieces of code
please avoid insn suffixes when they're redundant with registers
used.

> @@ -94,9 +95,16 @@ static inline struct cpu_info *get_cpu_info(void)
>  #define set_processor_id(id)  do {                                      \
>      struct cpu_info *ci__ = get_cpu_info();                             \
>      ci__->per_cpu_offset = __per_cpu_offset[ci__->processor_id = (id)]; \
> +    ci__->flags = 0;                                                    \
>  } while (0)

Not here, no. Considering other similar changes by recent patches
I can see the need for a helper doing that, but this shouldn't be
hidden in a completely unrelated macro.

> -#define guest_cpu_user_regs() (&get_cpu_info()->guest_cpu_user_regs)
> +#define guest_cpu_user_regs() ({                                        \
> +    struct cpu_info *info = get_cpu_info();                             \

Please use a more macro-suitable name, e.g. ci__ as above.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 12/12] x86: activate per-vcpu stacks in case of xpti
  2018-01-22 12:32 ` [PATCH RFC v2 12/12] x86: activate per-vcpu stacks in case of xpti Juergen Gross
@ 2018-01-30 16:33   ` Jan Beulich
       [not found]   ` <5A70AC7F02000078001A3CA6@suse.com>
  1 sibling, 0 replies; 74+ messages in thread
From: Jan Beulich @ 2018-01-30 16:33 UTC (permalink / raw)
  To: Juergen Gross
  Cc: wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson,
	Dario Faggioli, xen-devel

>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
> When scheduling a vcpu subject to xpti activate the per-vcpu stacks
> by loading the vcpu specific gdt and tss. When de-scheduling such a
> vcpu switch back to the per physical cpu gdt and tss.
> 
> Accessing the user registers on the stack is done via helpers as
> depending on XPTI active or not the registers are located either on
> the per-vcpu stack or on the default stack.
> 
> Signed-off-by: Juergen Gross <jgross@suse.com>
> ---
>  xen/arch/x86/domain.c              | 76 +++++++++++++++++++++++++++++++++++---
>  xen/arch/x86/pv/domain.c           | 34 +++++++++++++++--
>  xen/include/asm-x86/desc.h         |  5 +++
>  xen/include/asm-x86/regs.h         |  2 +
>  4 files changed, 107 insertions(+), 10 deletions(-)
> 
> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
> index da1bf1a97b..d75234ca35 100644
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -1585,9 +1585,28 @@ static inline bool need_full_gdt(const struct domain 
> *d)
>      return is_pv_domain(d) && !is_idle_domain(d);
>  }
>  
> +static void copy_user_regs_from_stack(struct vcpu *v)
> +{
> +    struct cpu_user_regs *stack_regs;

const

> +    stack_regs = (is_pv_vcpu(v) && v->domain->arch.pv_domain.xpti)
> +                 ? v->arch.pv_vcpu.stack_regs
> +                 : &get_cpu_info()->guest_cpu_user_regs;

Ugly open coding of what previously was guest_cpu_user_regs().

> +    memcpy(&v->arch.user_regs, stack_regs, CTXT_SWITCH_STACK_BYTES);
> +}
> +
> +static void copy_user_regs_to_stack(struct vcpu *v)

const

> @@ -1635,7 +1654,7 @@ static void __context_switch(void)
>  
>      gdt = !is_pv_32bit_domain(nd) ? per_cpu(gdt_table, cpu) :
>                                      per_cpu(compat_gdt_table, cpu);
> -    if ( need_full_gdt(nd) )
> +    if ( need_full_gdt(nd) && !nd->arch.pv_domain.xpti )
>      {
>          unsigned long mfn = virt_to_mfn(gdt);
>          l1_pgentry_t *pl1e = pv_gdt_ptes(n);
> @@ -1647,23 +1666,68 @@ static void __context_switch(void)
>      }
>  
>      if ( need_full_gdt(pd) &&
> -         ((p->vcpu_id != n->vcpu_id) || !need_full_gdt(nd)) )
> +         ((p->vcpu_id != n->vcpu_id) || !need_full_gdt(nd) ||
> +          pd->arch.pv_domain.xpti) )
>      {
>          gdt_desc.limit = LAST_RESERVED_GDT_BYTE;
>          gdt_desc.base  = (unsigned long)(gdt - FIRST_RESERVED_GDT_ENTRY);
>  
> +        if ( pd->arch.pv_domain.xpti )
> +            _set_tssldt_type(gdt + TSS_ENTRY - FIRST_RESERVED_GDT_ENTRY,
> +                             SYS_DESC_tss_avail);

Why is this not done in the if() after lgdt()?

>          lgdt(&gdt_desc);
> +
> +        if ( pd->arch.pv_domain.xpti )
> +        {
> +            unsigned long stub_va = this_cpu(stubs.addr);
> +
> +            ltr(TSS_ENTRY << 3);
> +            get_cpu_info()->flags &= ~VCPUSTACK_ACTIVE;
> +            wrmsrl(MSR_LSTAR, stub_va);
> +            wrmsrl(MSR_CSTAR, stub_va + STUB_TRAMPOLINE_SIZE_PERCPU);
> +            if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL ||
> +                 boot_cpu_data.x86_vendor == X86_VENDOR_CENTAUR )
> +                wrmsrl(MSR_IA32_SYSENTER_ESP,
> +                       (unsigned long)&get_cpu_info()->guest_cpu_user_regs.es);

Why is this not - like below - &guest_cpu_user_regs()->es?

> +        }
>      }
>  
>      write_ptbase(n);
>  
>      if ( need_full_gdt(nd) &&
> -         ((p->vcpu_id != n->vcpu_id) || !need_full_gdt(pd)) )
> +         ((p->vcpu_id != n->vcpu_id) || !need_full_gdt(pd) ||
> +          nd->arch.pv_domain.xpti) )
>      {
>          gdt_desc.limit = LAST_RESERVED_GDT_BYTE;
>          gdt_desc.base = GDT_VIRT_START(n);
>  
> +        if ( nd->arch.pv_domain.xpti )
> +        {
> +            struct cpu_info *info;
> +
> +            gdt = (struct desc_struct *)GDT_VIRT_START(n);
> +            gdt[PER_CPU_GDT_ENTRY].a = cpu;
> +            _set_tssldt_type(gdt + TSS_ENTRY, SYS_DESC_tss_avail);
> +            info = (struct cpu_info *)(XPTI_START(n) + STACK_SIZE) - 1;
> +            info->stack_bottom_cpu = (unsigned long)guest_cpu_user_regs();
> +        }
> +
>          lgdt(&gdt_desc);
> +
> +        if ( nd->arch.pv_domain.xpti )
> +        {
> +            unsigned long stub_va = XPTI_TRAMPOLINE(n);
> +
> +            ltr(TSS_ENTRY << 3);
> +            get_cpu_info()->flags |= VCPUSTACK_ACTIVE;
> +            wrmsrl(MSR_LSTAR, stub_va);
> +            wrmsrl(MSR_CSTAR, stub_va + STUB_TRAMPOLINE_SIZE_PERVCPU);
> +            if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL ||
> +                 boot_cpu_data.x86_vendor == X86_VENDOR_CENTAUR )
> +                wrmsrl(MSR_IA32_SYSENTER_ESP,
> +                       (unsigned long)&guest_cpu_user_regs()->es);
> +        }

So on a switch from PV to PV you add two LTR and 6 WRMSR. Quite
a lot, and I'm not at all convinced that this double writing is all really
needed in such a case.

> --- a/xen/arch/x86/pv/domain.c
> +++ b/xen/arch/x86/pv/domain.c
> @@ -133,10 +133,36 @@ int switch_compat(struct domain *d)
>  
>  static int pv_create_gdt_ldt_l1tab(struct vcpu *v)
>  {
> -    return create_perdomain_mapping(v->domain, GDT_VIRT_START(v),
> -                                    1U << GDT_LDT_VCPU_SHIFT,
> -                                    v->domain->arch.pv_domain.gdt_ldt_l1tab,
> -                                    NULL);
> +    int rc;
> +
> +    rc = create_perdomain_mapping(v->domain, GDT_VIRT_START(v),
> +                                  1U << GDT_LDT_VCPU_SHIFT,
> +                                  v->domain->arch.pv_domain.gdt_ldt_l1tab,
> +                                  NULL);
> +    if ( !rc && v->domain->arch.pv_domain.xpti )
> +    {
> +        struct desc_struct *gdt;
> +        struct page_info *gdt_pg;
> +
> +        BUILD_BUG_ON(NR_RESERVED_GDT_PAGES > 1);
> +        gdt = (struct desc_struct *)GDT_VIRT_START(v) +
> +              FIRST_RESERVED_GDT_ENTRY;
> +        rc = create_perdomain_mapping(v->domain, (unsigned long)gdt,
> +                                      NR_RESERVED_GDT_PAGES,
> +                                      NULL, &gdt_pg);
> +        if ( !rc )
> +        {
> +            gdt = __map_domain_page(gdt_pg);
> +            memcpy(gdt, boot_cpu_gdt_table, NR_RESERVED_GDT_BYTES);
> +            _set_tssldt_desc(gdt + TSS_ENTRY - FIRST_RESERVED_GDT_ENTRY,
> +                         XPTI_TSS(v),
> +                         offsetof(struct tss_struct, __cacheline_filler) - 1,
> +                         SYS_DESC_tss_avail);
> +            unmap_domain_page(gdt);
> +        }
> +    }
> +
> +    return rc;
>  }

Since you fiddle with the GDT anyway during context switch - do
you really need to allocate another page here, rather than simply
mapping the pCPU's GDT page into the vCPU's per-domain area?
That would also eliminate a concern regarding changes being made
to the GDT after a domain was created.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 05/12] x86: don't access saved user regs via rsp in trap handlers
       [not found]   ` <5A70941B02000078001A3BF0@suse.com>
@ 2018-01-30 16:33     ` Juergen Gross
  0 siblings, 0 replies; 74+ messages in thread
From: Juergen Gross @ 2018-01-30 16:33 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson,
	Dario Faggioli, xen-devel

On 30/01/18 15:49, Jan Beulich wrote:
>>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
>> In order to support switching stacks when entering the hypervisor for
>> support of page table isolation, don't use %rsp for accessing the
>> saved user registers, but do that via %rdi.
> 
> If this really turns out to be necessary ...
> 
>> @@ -58,20 +58,24 @@ compat_test_guest_events:
>>          jmp   compat_test_all_events
>>  
>>          ALIGN
>> -/* %rbx: struct vcpu */
>> +/* %rbx: struct vcpu, %rdi: user_regs */
>>  compat_process_softirqs:
>>          sti
>> +        pushq %rdi
>>          call  do_softirq
>> +        popq  %rdi
>>          jmp   compat_test_all_events
> 
> ... to avoid changes like this one (which unduly affect stack
> alignment) you will want to consider using e.g. %r12 instead.

Right. I have this already on my agenda for the next version of the
patches.

> But concerning specifically the compat entry code, it's unclear to
> me why you'd need to switch stacks there too.

That was just for consistency. I can drop that if you prefer.

> 
>> @@ -211,13 +218,15 @@ ENTRY(cstar_enter)
>>          testl $~3,%esi
>>          leal  (,%rcx,TBF_INTERRUPT),%ecx
>>  UNLIKELY_START(z, compat_syscall_gpf)
>> -        movq  VCPU_trap_ctxt(%rbx),%rdi
>> -        movl  $TRAP_gp_fault,UREGS_entry_vector(%rsp)
>> -        subl  $2,UREGS_rip(%rsp)
>> +        pushq %rcx
>> +        movq  VCPU_trap_ctxt(%rbx),%rcx
>> +        movl  $TRAP_gp_fault,UREGS_entry_vector(%rdi)
>> +        subl  $2,UREGS_rip(%rdi)
>>          movl  $0,TRAPBOUNCE_error_code(%rdx)
>> -        movl  TRAP_gp_fault * TRAPINFO_sizeof + TRAPINFO_eip(%rdi),%eax
>> -        movzwl TRAP_gp_fault * TRAPINFO_sizeof + TRAPINFO_cs(%rdi),%esi
>> -        testb $4,TRAP_gp_fault * TRAPINFO_sizeof + TRAPINFO_flags(%rdi)
>> +        movl  TRAP_gp_fault * TRAPINFO_sizeof + TRAPINFO_eip(%rcx),%eax
>> +        movzwl TRAP_gp_fault * TRAPINFO_sizeof + TRAPINFO_cs(%rcx),%esi
>> +        testb $4,TRAP_gp_fault * TRAPINFO_sizeof + TRAPINFO_flags(%rcx)
>> +        popq  %rcx
> 
> Is there really no register available, requiring you to push/pop
> %rcx here?

With switching from %rdi to e.g. %r12 this is no longer an issue.

> 
>> --- a/xen/include/asm-x86/current.h
>> +++ b/xen/include/asm-x86/current.h
>> @@ -95,9 +95,13 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
>>      ({                                                                  \
>>          __asm__ __volatile__ (                                          \
>>              "mov %0,%%"__OP"sp;"                                        \
>> -            CHECK_FOR_LIVEPATCH_WORK                                      \
>> -             "jmp %c1"                                                  \
>> -            : : "r" (guest_cpu_user_regs()), "i" (__fn) : "memory" );   \
>> +            "mov %1,%%"__OP"di;"                                        \
>> +            "pushq %%"__OP"di;"                                         \
>> +            CHECK_FOR_LIVEPATCH_WORK                                    \
>> +            "popq %%"__OP"di;"                                          \
>> +            "jmp %c2"                                                   \
>> +            : : "r" (get_cpu_info()), "r" (guest_cpu_user_regs()),      \
>> +                "i" (__fn) : "memory" );                                \
>>          unreachable();                                                  \
>>      })
> 
> If you want guest_cpu_user_regs() in %rdi, why don't you use
> "D" as constraint? Why do you need to restore %rdi prior to the
> final JMP? And why do you need the value in %rdi before calling
> check_for_livepatch_work(), when the function takes no arguments?

Will change.


Juergen


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 09/12] x86: enhance syscall stub to work in per-domain mapping
       [not found]   ` <5A70991902000078001A3C16@suse.com>
@ 2018-01-30 16:50     ` Juergen Gross
  0 siblings, 0 replies; 74+ messages in thread
From: Juergen Gross @ 2018-01-30 16:50 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson, Dario Faggioli

On 30/01/18 16:11, Jan Beulich wrote:
>>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
>> --- a/xen/arch/x86/x86_64/traps.c
>> +++ b/xen/arch/x86/x86_64/traps.c
>> @@ -260,10 +260,11 @@ void do_double_fault(struct cpu_user_regs *regs)
>>      panic("DOUBLE FAULT -- system shutdown");
>>  }
>>  
>> -static unsigned int write_stub_trampoline(
>> -    unsigned char *stub, unsigned long stub_va,
>> -    unsigned long stack_bottom, unsigned long target_va)
>> +void write_stub_trampoline(unsigned char *stub, unsigned long stub_va,
>> +                           unsigned long stack_bottom, unsigned long target_va)
> 
> Why does the static go away?

I'll need it in patch 10.

> 
>> @@ -282,24 +283,32 @@ static unsigned int write_stub_trampoline(
>>      /* pushq %rax */
>>      stub[23] = 0x50;
>>  
>> -    /* jmp target_va */
>> -    stub[24] = 0xe9;
>> -    *(int32_t *)&stub[25] = target_va - (stub_va + 29);
>> -
>> -    /* Round up to a multiple of 16 bytes. */
>> -    return 32;
>> +    target_diff = target_va - (stub_va + 29);
>> +    if ( target_diff >> 31 == target_diff >> 63 )
>> +    {
>> +        /* jmp target_va */
>> +        stub[24] = 0xe9;
>> +        *(int32_t *)&stub[25] = target_diff;
>> +    }
>> +    else
>> +    {
>> +        /* movabs target_va, %rax */
>> +        stub[24] = 0x48;
>> +        stub[25] = 0xb8;
>> +        *(uint64_t *)&stub[26] = target_va;
>> +        /* jmpq *%rax */
>> +        stub[34] = 0xff;
>> +        stub[35] = 0xe0;
>> +    }
> 
> This clearly needs another solution, as you'd have to go through a
> thunk now, and the thunk would be unreachable too.

Aah, right. So maybe it would be better not to share the code for
writing the stub page with XPTI.

I'll replace this patch with one adding a new function for XPTI.


Juergen

> 
>>  }
>>  
>>  DEFINE_PER_CPU(struct stubs, stubs);
>> -void lstar_enter(void);
>> -void cstar_enter(void);
> 
> Why do these move into a header?
> 
>> @@ -312,10 +321,9 @@ void subarch_percpu_traps_init(void)
>>       * start of the stubs.
>>       */
>>      wrmsrl(MSR_LSTAR, stub_va);
>> -    offset = write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
>> -                                   stub_va, stack_bottom,
>> -                                   (unsigned long)lstar_enter);
>> -    stub_va += offset;
>> +    write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK), stub_va,
>> +                          stack_bottom, (unsigned long)lstar_enter);
>> +    stub_va += STUB_TRAMPOLINE_SIZE_PERCPU;
> 
> The function may have written more than 32 bytes now; you'd
> notice the breakage if you put a suitable BUILD_BUG_ON() into
> the function. Otherwise I recommend you stick to the current
> "return number of bytes written" model.
> 
>> @@ -328,12 +336,11 @@ void subarch_percpu_traps_init(void)
>>  
>>      /* Trampoline for SYSCALL entry from compatibility mode. */
>>      wrmsrl(MSR_CSTAR, stub_va);
>> -    offset += write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
>> -                                    stub_va, stack_bottom,
>> -                                    (unsigned long)cstar_enter);
>> +    write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK), stub_va,
>> +                          stack_bottom, (unsigned long)cstar_enter);
>>  
>>      /* Don't consume more than half of the stub space here. */
>> -    ASSERT(offset <= STUB_BUF_SIZE / 2);
>> +    ASSERT(2 * STUB_TRAMPOLINE_SIZE_PERCPU <= STUB_BUF_SIZE / 2);
> 
> BUILD_BUG_ON() for compile time constants.
> 
> Jan
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 06/12] x86: add a xpti command line parameter
       [not found]   ` <5A709FDF02000078001A3C2C@suse.com>
@ 2018-01-30 16:51     ` Juergen Gross
  0 siblings, 0 replies; 74+ messages in thread
From: Juergen Gross @ 2018-01-30 16:51 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson,
	Dario Faggioli, xen-devel

On 30/01/18 16:39, Jan Beulich wrote:
>>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
>> @@ -212,6 +249,24 @@ int pv_domain_initialise(struct domain *d, unsigned int domcr_flags,
>>      /* 64-bit PV guest by default. */
>>      d->arch.is_32bit_pv = d->arch.has_32bit_shinfo = 0;
>>  
>> +    switch (opt_xpti)
> 
> Style.
> 
>> +    {
>> +    case XPTI_OFF:
>> +        d->arch.pv_domain.xpti = false;
>> +        break;
>> +    case XPTI_ON:
>> +        d->arch.pv_domain.xpti = true;
>> +        break;
>> +    case XPTI_NODOM0:
>> +        d->arch.pv_domain.xpti = boot_cpu_data.x86_vendor != X86_VENDOR_AMD &&
>> +                                 d->domain_id != 0 &&
>> +                                 d->domain_id != hardware_domid;
>> +        break;
>> +    case XPTI_DEFAULT:
>> +        d->arch.pv_domain.xpti = boot_cpu_data.x86_vendor != X86_VENDOR_AMD;
>> +        break;
>> +    }
> 
> Why does a 32-bit domain need this?

It doesn't. In my current version I have moved this initialization and
it will never run for 32 bit domains.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 10/12] x86: allocate per-vcpu stacks for interrupt entries
       [not found]   ` <5A70A01402000078001A3C30@suse.com>
@ 2018-01-30 17:12     ` Juergen Gross
  2018-01-31 10:18       ` Jan Beulich
  0 siblings, 1 reply; 74+ messages in thread
From: Juergen Gross @ 2018-01-30 17:12 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson,
	Dario Faggioli, xen-devel

On 30/01/18 16:40, Jan Beulich wrote:
>>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
>> In case of XPTI being active for a pv-domain allocate and initialize
>> per-vcpu stacks. The stacks are added to the per-domain mappings of
>> the pv-domain.
> 
> Considering the intended use of these stacks (as per the overview
> mail) I consider 32k per vCPU a non-negligible amount of extra memory
> use.

Maybe I can shrink this by putting multiple entry stacks into one page.
In the end I only need struct cpu_info and maybe some spare for each
stack.

> 
>> +static int pv_vcpu_init_xpti(struct vcpu *v)
>> +{
>> +    struct domain *d = v->domain;
>> +    struct page_info *pg;
>> +    void *ptr;
>> +    struct cpu_info *info;
>> +    unsigned long stack_bottom;
>> +    int rc;
>> +
>> +    /* Populate page tables. */
>> +    rc = create_perdomain_mapping(d, XPTI_START(v), STACK_PAGES,
>> +                                  NIL(l1_pgentry_t *), NULL);
>> +    if ( rc )
>> +        goto done;
>> +
>> +    /* Map stacks. */
>> +    rc = create_perdomain_mapping(d, XPTI_START(v), IST_MAX,
>> +                                  NULL, NIL(struct page_info *));
>> +    if ( rc )
>> +        goto done;
>> +
>> +    ptr = alloc_xenheap_page();
>> +    if ( !ptr )
>> +    {
>> +        rc = -ENOMEM;
>> +        goto done;
>> +    }
>> +    clear_page(ptr);
>> +    addmfn_to_perdomain_mapping(d, XPTI_START(v) + STACK_SIZE - PAGE_SIZE,
>> +                                _mfn(virt_to_mfn(ptr)));
> 
> This can't be create_perdomain_mapping() because of ...? If it's
> the Xen heap page you use here - that would be the next question:
> Does it need to be such, rather than a domheap one? I do see ...

I need to reference the user regs in __context_switch() before
switching to the new address space (otherwise I'd had to rework
__context_switch() which I wanted to avoid).

> 
>> +    info = (struct cpu_info *)((unsigned long)ptr + PAGE_SIZE) - 1;
>> +    info->flags = ON_VCPUSTACK;
>> +    v->arch.pv_vcpu.stack_regs = &info->guest_cpu_user_regs;
> 
> ... this pointer, but without a clear picture on intended use it's
> hard to judge.

See patch 12.

> 
>> +    /* Map TSS. */
>> +    rc = create_perdomain_mapping(d, XPTI_TSS(v), 1, NULL, &pg);
>> +    if ( rc )
>> +        goto done;
>> +    info = (struct cpu_info *)(XPTI_START(v) + STACK_SIZE) - 1;
> 
> Iiuc this is a pointer one absolutely must not de-reference. A bit
> dangerous, I would say, the more that further up the same
> variable is being de-referenced.

Okay, I'll add another variable for this purpose.

> 
> Also I would assume the TSS can be mapped r/o.

Right.

> 
>> +    stack_bottom = (unsigned long)&info->guest_cpu_user_regs.es;
>> +    ptr = __map_domain_page(pg);
>> +    tss_init(ptr, stack_bottom);
>> +    unmap_domain_page(ptr);
>> +
>> +    /* Map stub trampolines. */
>> +    rc = create_perdomain_mapping(d, XPTI_TRAMPOLINE(v), 1, NULL, &pg);
>> +    if ( rc )
>> +        goto done;
>> +    ptr = __map_domain_page(pg);
>> +    write_stub_trampoline((unsigned char *)ptr, XPTI_TRAMPOLINE(v),
> 
> I would be very surprised if you really needed the cast here.

Oh, this is a leftover from a previous version where ptr was char *.

> 
>> @@ -25,6 +25,21 @@
>>   */
>>  
>>  /*
>> + * The vcpu stacks used for XPTI are arranged similar to the physical cpu
>> + * stacks with some modifications. The main difference are the primary stack
>> + * size (only 1 page) and usage of the unused mappings for TSS and IDT.
>> + *
>> + * 7 - Primary stack (with a struct cpu_info at the top)
>> + * 6 - unused
>> + * 5 - TSS
> 
> Judging by the comment this might mean "TSS / IDT", or slots 4 or 6
> might be used for the IDT. Otoh I don't see any IDT related logic in
> pv_vcpu_init_xpti(). Please clarify this.

Oh yes. I'll remove the IDT related comments, as I think I can just map
the original IDT.

> 
>> @@ -37,10 +52,24 @@ struct vcpu;
>>  
>>  struct cpu_info {
>>      struct cpu_user_regs guest_cpu_user_regs;
>> -    unsigned int processor_id;
>> -    struct vcpu *current_vcpu;
>> -    unsigned long per_cpu_offset;
>> -    unsigned long cr4;
>> +    union {
>> +        /* per physical cpu mapping */
>> +        struct {
>> +            struct vcpu *current_vcpu;
>> +            unsigned long per_cpu_offset;
>> +            unsigned long cr4;
>> +        };
>> +        /* per vcpu mapping (xpti) */
>> +        struct {
>> +            unsigned long pad1;
>> +            unsigned long pad2;
>> +            unsigned long stack_bottom_cpu;
>> +        };
> 
> In order to avoid accidental use in the wrong context as much as
> possible, I think you want to name both structures.

Okay.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 11/12] x86: modify interrupt handlers to support stack switching
       [not found]   ` <5A70A63D02000078001A3C7C@suse.com>
@ 2018-01-30 17:19     ` Juergen Gross
  2018-01-31 10:36       ` Jan Beulich
       [not found]       ` <5A71AA4202000078001A3F56@suse.com>
  0 siblings, 2 replies; 74+ messages in thread
From: Juergen Gross @ 2018-01-30 17:19 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson,
	Dario Faggioli, xen-devel

On 30/01/18 17:07, Jan Beulich wrote:
>>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
>> --- a/xen/arch/x86/x86_64/asm-offsets.c
>> +++ b/xen/arch/x86/x86_64/asm-offsets.c
>> @@ -137,6 +137,10 @@ void __dummy__(void)
>>      OFFSET(CPUINFO_processor_id, struct cpu_info, processor_id);
>>      OFFSET(CPUINFO_current_vcpu, struct cpu_info, current_vcpu);
>>      OFFSET(CPUINFO_cr4, struct cpu_info, cr4);
>> +    OFFSET(CPUINFO_stack_bottom_cpu, struct cpu_info, stack_bottom_cpu);
>> +    OFFSET(CPUINFO_flags, struct cpu_info, flags);
>> +    DEFINE(ASM_ON_VCPUSTACK, ON_VCPUSTACK);
>> +    DEFINE(ASM_VCPUSTACK_ACTIVE, VCPUSTACK_ACTIVE);
> 
> Seeing their uses in asm_defns.h it's not really clear to me why
> you can't use the C constants there, the more that those uses
> are inside C macros (which perhaps would better be assembler
> ones). The latter doesn't even appear to be used in assembly
> code.

I tried using the C constants but this led to rather nasty include
dependencies.

ASM_VCPUSTACK_ACTIVE will be used when %cr3 switching is being added.

> 
>> --- a/xen/arch/x86/x86_64/compat/entry.S
>> +++ b/xen/arch/x86/x86_64/compat/entry.S
>> @@ -19,6 +19,7 @@ ENTRY(entry_int82)
>>          movl  $HYPERCALL_VECTOR, 4(%rsp)
>>          SAVE_ALL compat=1 /* DPL1 gate, restricted to 32bit PV guests only. */
>>          mov   %rsp, %rdi
>> +        SWITCH_FROM_VCPU_STACK
>>          CR4_PV32_RESTORE
> 
> Once again - why for compat mode guests?
> 
>> @@ -615,7 +623,9 @@ ENTRY(early_page_fault)
>>          movl  $TRAP_page_fault,4(%rsp)
>>          SAVE_ALL
>>          movq  %rsp,%rdi
>> +        SWITCH_FROM_VCPU_STACK
> 
> Why, in this context?

Same as before: consistency. I can remove this.

> 
>>          call  do_early_page_fault
>> +        movq  %rsp, %rdi
>>          jmp   restore_all_xen
> 
> Doesn't this belong in an earlier patch?

I have cleaned this up already.

> 
>> --- a/xen/common/wait.c
>> +++ b/xen/common/wait.c
>> @@ -122,10 +122,10 @@ void wake_up_all(struct waitqueue_head *wq)
>>  
>>  static void __prepare_to_wait(struct waitqueue_vcpu *wqv)
>>  {
>> -    struct cpu_info *cpu_info = get_cpu_info();
>> +    struct cpu_user_regs *user_regs = guest_cpu_user_regs();
>>      struct vcpu *curr = current;
>>      unsigned long dummy;
>> -    u32 entry_vector = cpu_info->guest_cpu_user_regs.entry_vector;
>> +    u32 entry_vector = user_regs->entry_vector;
>>  
>>      ASSERT(wqv->esp == 0);
>>  
>> @@ -160,7 +160,7 @@ static void __prepare_to_wait(struct waitqueue_vcpu *wqv)
>>          "pop %%r11; pop %%r10; pop %%r9;  pop %%r8;"
>>          "pop %%rbp; pop %%rdx; pop %%rbx; pop %%rax"
>>          : "=&S" (wqv->esp), "=&c" (dummy), "=&D" (dummy)
>> -        : "i" (PAGE_SIZE), "0" (0), "1" (cpu_info), "2" (wqv->stack)
>> +        : "i" (PAGE_SIZE), "0" (0), "1" (user_regs), "2" (wqv->stack)
>>          : "memory" );
>>  
>>      if ( unlikely(wqv->esp == 0) )
>> @@ -169,7 +169,7 @@ static void __prepare_to_wait(struct waitqueue_vcpu *wqv)
>>          domain_crash_synchronous();
>>      }
>>  
>> -    cpu_info->guest_cpu_user_regs.entry_vector = entry_vector;
>> +    user_regs->entry_vector = entry_vector;
>>  }
> 
> I don't see how this change is related to the purpose of this patch,
> or why the change is needed. All you do is utilize that
> guest_cpu_user_regs is the first field of struct cpu_info afaics.

guest_cpu_user_regs() might point to either stack, while get_cpu_info()
will always reference the Xen stack and never the per-vcpu one.

> 
>> --- a/xen/include/asm-x86/asm_defns.h
>> +++ b/xen/include/asm-x86/asm_defns.h
>> @@ -116,6 +116,25 @@ void ret_from_intr(void);
>>          GET_STACK_END(reg);                       \
>>          __GET_CURRENT(reg)
>>  
>> +#define SWITCH_FROM_VCPU_STACK                                           \
>> +        GET_STACK_END(ax);                                               \
>> +        testb $ASM_ON_VCPUSTACK, STACK_CPUINFO_FIELD(flags)(%rax);       \
>> +        jz    1f;                                                        \
>> +        movq  STACK_CPUINFO_FIELD(stack_bottom_cpu)(%rax), %rsp;         \
>> +1:
>> +
>> +#define SWITCH_FROM_VCPU_STACK_IST                                       \
>> +        GET_STACK_END(ax);                                               \
>> +        testb $ASM_ON_VCPUSTACK, STACK_CPUINFO_FIELD(flags)(%rax);       \
>> +        jz    1f;                                                        \
>> +        subq  $(CPUINFO_sizeof - 1), %rax;                               \
>> +        addq  CPUINFO_stack_bottom_cpu(%rax), %rsp;                      \
>> +        subq  %rax, %rsp;                                                \
> 
> If I'm not mistaken, %rsp is complete rubbish for on instruction
> here. While quite likely not a problem in practice, it would still
> feel better if you went through an intermediate register. I also
> think the calculation might then end up easier to follow. It'll also
> make analysis of a crash easier if an NMI or #MC hits exactly at
> this boundary.

Okay. Will change.

> 
>> +1:
>> +
>> +#define SWITCH_TO_VCPU_STACK                                             \
>> +        movq  %rdi, %rsp
> 
> For these additions as a whole: At least in new pieces of code
> please avoid insn suffixes when they're redundant with registers
> used.

Okay.

> 
>> @@ -94,9 +95,16 @@ static inline struct cpu_info *get_cpu_info(void)
>>  #define set_processor_id(id)  do {                                      \
>>      struct cpu_info *ci__ = get_cpu_info();                             \
>>      ci__->per_cpu_offset = __per_cpu_offset[ci__->processor_id = (id)]; \
>> +    ci__->flags = 0;                                                    \
>>  } while (0)
> 
> Not here, no. Considering other similar changes by recent patches
> I can see the need for a helper doing that, but this shouldn't be
> hidden in a completely unrelated macro.

Okay.

> 
>> -#define guest_cpu_user_regs() (&get_cpu_info()->guest_cpu_user_regs)
>> +#define guest_cpu_user_regs() ({                                        \
>> +    struct cpu_info *info = get_cpu_info();                             \
> 
> Please use a more macro-suitable name, e.g. ci__ as above.

Okay.


Juergen


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 12/12] x86: activate per-vcpu stacks in case of xpti
       [not found]   ` <5A70AC7F02000078001A3CA6@suse.com>
@ 2018-01-30 17:33     ` Juergen Gross
  2018-01-31 10:40       ` Jan Beulich
  0 siblings, 1 reply; 74+ messages in thread
From: Juergen Gross @ 2018-01-30 17:33 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson,
	Dario Faggioli, xen-devel

On 30/01/18 17:33, Jan Beulich wrote:
>>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
>> When scheduling a vcpu subject to xpti activate the per-vcpu stacks
>> by loading the vcpu specific gdt and tss. When de-scheduling such a
>> vcpu switch back to the per physical cpu gdt and tss.
>>
>> Accessing the user registers on the stack is done via helpers as
>> depending on XPTI active or not the registers are located either on
>> the per-vcpu stack or on the default stack.
>>
>> Signed-off-by: Juergen Gross <jgross@suse.com>
>> ---
>>  xen/arch/x86/domain.c              | 76 +++++++++++++++++++++++++++++++++++---
>>  xen/arch/x86/pv/domain.c           | 34 +++++++++++++++--
>>  xen/include/asm-x86/desc.h         |  5 +++
>>  xen/include/asm-x86/regs.h         |  2 +
>>  4 files changed, 107 insertions(+), 10 deletions(-)
>>
>> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
>> index da1bf1a97b..d75234ca35 100644
>> --- a/xen/arch/x86/domain.c
>> +++ b/xen/arch/x86/domain.c
>> @@ -1585,9 +1585,28 @@ static inline bool need_full_gdt(const struct domain 
>> *d)
>>      return is_pv_domain(d) && !is_idle_domain(d);
>>  }
>>  
>> +static void copy_user_regs_from_stack(struct vcpu *v)
>> +{
>> +    struct cpu_user_regs *stack_regs;
> 
> const

Okay.

> 
>> +    stack_regs = (is_pv_vcpu(v) && v->domain->arch.pv_domain.xpti)
>> +                 ? v->arch.pv_vcpu.stack_regs
>> +                 : &get_cpu_info()->guest_cpu_user_regs;
> 
> Ugly open coding of what previously was guest_cpu_user_regs().

I have to make sure to address the per physical cpu stack.

> 
>> +    memcpy(&v->arch.user_regs, stack_regs, CTXT_SWITCH_STACK_BYTES);
>> +}
>> +
>> +static void copy_user_regs_to_stack(struct vcpu *v)
> 
> const

Okay.

> 
>> @@ -1635,7 +1654,7 @@ static void __context_switch(void)
>>  
>>      gdt = !is_pv_32bit_domain(nd) ? per_cpu(gdt_table, cpu) :
>>                                      per_cpu(compat_gdt_table, cpu);
>> -    if ( need_full_gdt(nd) )
>> +    if ( need_full_gdt(nd) && !nd->arch.pv_domain.xpti )
>>      {
>>          unsigned long mfn = virt_to_mfn(gdt);
>>          l1_pgentry_t *pl1e = pv_gdt_ptes(n);
>> @@ -1647,23 +1666,68 @@ static void __context_switch(void)
>>      }
>>  
>>      if ( need_full_gdt(pd) &&
>> -         ((p->vcpu_id != n->vcpu_id) || !need_full_gdt(nd)) )
>> +         ((p->vcpu_id != n->vcpu_id) || !need_full_gdt(nd) ||
>> +          pd->arch.pv_domain.xpti) )
>>      {
>>          gdt_desc.limit = LAST_RESERVED_GDT_BYTE;
>>          gdt_desc.base  = (unsigned long)(gdt - FIRST_RESERVED_GDT_ENTRY);
>>  
>> +        if ( pd->arch.pv_domain.xpti )
>> +            _set_tssldt_type(gdt + TSS_ENTRY - FIRST_RESERVED_GDT_ENTRY,
>> +                             SYS_DESC_tss_avail);
> 
> Why is this not done in the if() after lgdt()?

I had some problems here when developing the patches. Just wanted to
make sure all changes of the GDT are in place before activating it.
I can move it.

> 
>>          lgdt(&gdt_desc);
>> +
>> +        if ( pd->arch.pv_domain.xpti )
>> +        {
>> +            unsigned long stub_va = this_cpu(stubs.addr);
>> +
>> +            ltr(TSS_ENTRY << 3);
>> +            get_cpu_info()->flags &= ~VCPUSTACK_ACTIVE;
>> +            wrmsrl(MSR_LSTAR, stub_va);
>> +            wrmsrl(MSR_CSTAR, stub_va + STUB_TRAMPOLINE_SIZE_PERCPU);
>> +            if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL ||
>> +                 boot_cpu_data.x86_vendor == X86_VENDOR_CENTAUR )
>> +                wrmsrl(MSR_IA32_SYSENTER_ESP,
>> +                       (unsigned long)&get_cpu_info()->guest_cpu_user_regs.es);
> 
> Why is this not - like below - &guest_cpu_user_regs()->es?

Right, this would have been possible, but I needed to move restoring the
MSRs to another place where VCPUSTACK_ACTIVE was still set, so using
guest_cpu_user_regs() would be wrong.

I'll add a comment.

> 
>> +        }
>>      }
>>  
>>      write_ptbase(n);
>>  
>>      if ( need_full_gdt(nd) &&
>> -         ((p->vcpu_id != n->vcpu_id) || !need_full_gdt(pd)) )
>> +         ((p->vcpu_id != n->vcpu_id) || !need_full_gdt(pd) ||
>> +          nd->arch.pv_domain.xpti) )
>>      {
>>          gdt_desc.limit = LAST_RESERVED_GDT_BYTE;
>>          gdt_desc.base = GDT_VIRT_START(n);
>>  
>> +        if ( nd->arch.pv_domain.xpti )
>> +        {
>> +            struct cpu_info *info;
>> +
>> +            gdt = (struct desc_struct *)GDT_VIRT_START(n);
>> +            gdt[PER_CPU_GDT_ENTRY].a = cpu;
>> +            _set_tssldt_type(gdt + TSS_ENTRY, SYS_DESC_tss_avail);
>> +            info = (struct cpu_info *)(XPTI_START(n) + STACK_SIZE) - 1;
>> +            info->stack_bottom_cpu = (unsigned long)guest_cpu_user_regs();
>> +        }
>> +
>>          lgdt(&gdt_desc);
>> +
>> +        if ( nd->arch.pv_domain.xpti )
>> +        {
>> +            unsigned long stub_va = XPTI_TRAMPOLINE(n);
>> +
>> +            ltr(TSS_ENTRY << 3);
>> +            get_cpu_info()->flags |= VCPUSTACK_ACTIVE;
>> +            wrmsrl(MSR_LSTAR, stub_va);
>> +            wrmsrl(MSR_CSTAR, stub_va + STUB_TRAMPOLINE_SIZE_PERVCPU);
>> +            if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL ||
>> +                 boot_cpu_data.x86_vendor == X86_VENDOR_CENTAUR )
>> +                wrmsrl(MSR_IA32_SYSENTER_ESP,
>> +                       (unsigned long)&guest_cpu_user_regs()->es);
>> +        }
> 
> So on a switch from PV to PV you add two LTR and 6 WRMSR. Quite
> a lot, and I'm not at all convinced that this double writing is all really
> needed in such a case.

I'll test if I can omit some of those. Maybe not at once, but when I
have a working XPTI version showing my approach is really worth being
considered.

> 
>> --- a/xen/arch/x86/pv/domain.c
>> +++ b/xen/arch/x86/pv/domain.c
>> @@ -133,10 +133,36 @@ int switch_compat(struct domain *d)
>>  
>>  static int pv_create_gdt_ldt_l1tab(struct vcpu *v)
>>  {
>> -    return create_perdomain_mapping(v->domain, GDT_VIRT_START(v),
>> -                                    1U << GDT_LDT_VCPU_SHIFT,
>> -                                    v->domain->arch.pv_domain.gdt_ldt_l1tab,
>> -                                    NULL);
>> +    int rc;
>> +
>> +    rc = create_perdomain_mapping(v->domain, GDT_VIRT_START(v),
>> +                                  1U << GDT_LDT_VCPU_SHIFT,
>> +                                  v->domain->arch.pv_domain.gdt_ldt_l1tab,
>> +                                  NULL);
>> +    if ( !rc && v->domain->arch.pv_domain.xpti )
>> +    {
>> +        struct desc_struct *gdt;
>> +        struct page_info *gdt_pg;
>> +
>> +        BUILD_BUG_ON(NR_RESERVED_GDT_PAGES > 1);
>> +        gdt = (struct desc_struct *)GDT_VIRT_START(v) +
>> +              FIRST_RESERVED_GDT_ENTRY;
>> +        rc = create_perdomain_mapping(v->domain, (unsigned long)gdt,
>> +                                      NR_RESERVED_GDT_PAGES,
>> +                                      NULL, &gdt_pg);
>> +        if ( !rc )
>> +        {
>> +            gdt = __map_domain_page(gdt_pg);
>> +            memcpy(gdt, boot_cpu_gdt_table, NR_RESERVED_GDT_BYTES);
>> +            _set_tssldt_desc(gdt + TSS_ENTRY - FIRST_RESERVED_GDT_ENTRY,
>> +                         XPTI_TSS(v),
>> +                         offsetof(struct tss_struct, __cacheline_filler) - 1,
>> +                         SYS_DESC_tss_avail);
>> +            unmap_domain_page(gdt);
>> +        }
>> +    }
>> +
>> +    return rc;
>>  }
> 
> Since you fiddle with the GDT anyway during context switch - do
> you really need to allocate another page here, rather than simply
> mapping the pCPU's GDT page into the vCPU's per-domain area?
> That would also eliminate a concern regarding changes being made
> to the GDT after a domain was created.

Hmm, let me think about that. I'll postpone it to later.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 10/12] x86: allocate per-vcpu stacks for interrupt entries
  2018-01-30 17:12     ` Juergen Gross
@ 2018-01-31 10:18       ` Jan Beulich
  0 siblings, 0 replies; 74+ messages in thread
From: Jan Beulich @ 2018-01-31 10:18 UTC (permalink / raw)
  To: Juergen Gross
  Cc: wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson,
	Dario Faggioli, xen-devel

>>> On 30.01.18 at 18:12, <jgross@suse.com> wrote:
> On 30/01/18 16:40, Jan Beulich wrote:
>>>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
>>> +static int pv_vcpu_init_xpti(struct vcpu *v)
>>> +{
>>> +    struct domain *d = v->domain;
>>> +    struct page_info *pg;
>>> +    void *ptr;
>>> +    struct cpu_info *info;
>>> +    unsigned long stack_bottom;
>>> +    int rc;
>>> +
>>> +    /* Populate page tables. */
>>> +    rc = create_perdomain_mapping(d, XPTI_START(v), STACK_PAGES,
>>> +                                  NIL(l1_pgentry_t *), NULL);
>>> +    if ( rc )
>>> +        goto done;
>>> +
>>> +    /* Map stacks. */
>>> +    rc = create_perdomain_mapping(d, XPTI_START(v), IST_MAX,
>>> +                                  NULL, NIL(struct page_info *));
>>> +    if ( rc )
>>> +        goto done;
>>> +
>>> +    ptr = alloc_xenheap_page();
>>> +    if ( !ptr )
>>> +    {
>>> +        rc = -ENOMEM;
>>> +        goto done;
>>> +    }
>>> +    clear_page(ptr);
>>> +    addmfn_to_perdomain_mapping(d, XPTI_START(v) + STACK_SIZE - PAGE_SIZE,
>>> +                                _mfn(virt_to_mfn(ptr)));
>> 
>> This can't be create_perdomain_mapping() because of ...? If it's
>> the Xen heap page you use here - that would be the next question:
>> Does it need to be such, rather than a domheap one? I do see ...
> 
> I need to reference the user regs in __context_switch() before
> switching to the new address space (otherwise I'd had to rework
> __context_switch() which I wanted to avoid).

And a suitably mapped domain-heap page won't do?

>>> +    info = (struct cpu_info *)((unsigned long)ptr + PAGE_SIZE) - 1;
>>> +    info->flags = ON_VCPUSTACK;
>>> +    v->arch.pv_vcpu.stack_regs = &info->guest_cpu_user_regs;
>> 
>> ... this pointer, but without a clear picture on intended use it's
>> hard to judge.
> 
> See patch 12.

Well, that's one of the big problems with this RFC: The overview
mail doesn't give a clear picture of the intended overall changes
(including ones yet to be submitted), and individual patches rely
on the reader to pull out information from later patches to
understand what the current patch one is looking at does.

>>> +    /* Map TSS. */
>>> +    rc = create_perdomain_mapping(d, XPTI_TSS(v), 1, NULL, &pg);
>>> +    if ( rc )
>>> +        goto done;
>>> +    info = (struct cpu_info *)(XPTI_START(v) + STACK_SIZE) - 1;
>> 
>> Iiuc this is a pointer one absolutely must not de-reference. A bit
>> dangerous, I would say, the more that further up the same
>> variable is being de-referenced.
> 
> Okay, I'll add another variable for this purpose.

Or at least add a comment clearly stating the restriction.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 07/12] x86: allow per-domain mappings without NX bit or with specific mfn
  2018-01-22 12:32 ` [PATCH RFC v2 07/12] x86: allow per-domain mappings without NX bit or with specific mfn Juergen Gross
  2018-01-29 17:06   ` Jan Beulich
       [not found]   ` <5A6F62B602000078001A3810@suse.com>
@ 2018-01-31 10:30   ` Jan Beulich
  2 siblings, 0 replies; 74+ messages in thread
From: Jan Beulich @ 2018-01-31 10:30 UTC (permalink / raw)
  To: Juergen Gross
  Cc: wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson,
	Dario Faggioli, xen-devel

>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
> For support of per-vcpu stacks we need per-vcpu trampolines. To be
> able to put those into the per-domain mappings the upper levels
> page tables must not have NX set for per-domain mappings.
> 
> In order to be able to reset the NX bit for a per-domain mapping add
> a helper flipflags_perdomain_mapping() for flipping page table flags
> of a specific mapped page.
> 
> To be able to use a page from xen heap for the last per-vcpu stack
> page add a helper to map an arbitrary mfn in the perdomain area.

One further remark on this patch as a whole:
create_perdomain_mapping() allows the L1 tables to be returned,
and I think making this fit your needs (if it doesn't in its current
shape) might be better than introducing new functions which in
the end only want to fiddle with the L1 entries of previously
established mappings. This might also help mapping the pCPU's
GDT into the vCPU's per-domain mappings during context switch,
as suggested elsewhere.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 11/12] x86: modify interrupt handlers to support stack switching
  2018-01-30 17:19     ` Juergen Gross
@ 2018-01-31 10:36       ` Jan Beulich
       [not found]       ` <5A71AA4202000078001A3F56@suse.com>
  1 sibling, 0 replies; 74+ messages in thread
From: Jan Beulich @ 2018-01-31 10:36 UTC (permalink / raw)
  To: Juergen Gross
  Cc: wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson,
	Dario Faggioli, xen-devel

>>> On 30.01.18 at 18:19, <jgross@suse.com> wrote:
> On 30/01/18 17:07, Jan Beulich wrote:
>>>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
>>> --- a/xen/arch/x86/x86_64/asm-offsets.c
>>> +++ b/xen/arch/x86/x86_64/asm-offsets.c
>>> @@ -137,6 +137,10 @@ void __dummy__(void)
>>>      OFFSET(CPUINFO_processor_id, struct cpu_info, processor_id);
>>>      OFFSET(CPUINFO_current_vcpu, struct cpu_info, current_vcpu);
>>>      OFFSET(CPUINFO_cr4, struct cpu_info, cr4);
>>> +    OFFSET(CPUINFO_stack_bottom_cpu, struct cpu_info, stack_bottom_cpu);
>>> +    OFFSET(CPUINFO_flags, struct cpu_info, flags);
>>> +    DEFINE(ASM_ON_VCPUSTACK, ON_VCPUSTACK);
>>> +    DEFINE(ASM_VCPUSTACK_ACTIVE, VCPUSTACK_ACTIVE);
>> 
>> Seeing their uses in asm_defns.h it's not really clear to me why
>> you can't use the C constants there, the more that those uses
>> are inside C macros (which perhaps would better be assembler
>> ones). The latter doesn't even appear to be used in assembly
>> code.
> 
> I tried using the C constants but this led to rather nasty include
> dependencies.

Hmm, I can imagine this to be the case, but I'd like to have more
detail for justification. current.h itself doesn't have that many
dependencies, and if half-way reasonable disentangling our
headers may be the better choice.

> ASM_VCPUSTACK_ACTIVE will be used when %cr3 switching is being added.

Please introduce it when needed.

>>> --- a/xen/common/wait.c
>>> +++ b/xen/common/wait.c
>>> @@ -122,10 +122,10 @@ void wake_up_all(struct waitqueue_head *wq)
>>>  
>>>  static void __prepare_to_wait(struct waitqueue_vcpu *wqv)
>>>  {
>>> -    struct cpu_info *cpu_info = get_cpu_info();
>>> +    struct cpu_user_regs *user_regs = guest_cpu_user_regs();
>>>      struct vcpu *curr = current;
>>>      unsigned long dummy;
>>> -    u32 entry_vector = cpu_info->guest_cpu_user_regs.entry_vector;
>>> +    u32 entry_vector = user_regs->entry_vector;
>>>  
>>>      ASSERT(wqv->esp == 0);
>>>  
>>> @@ -160,7 +160,7 @@ static void __prepare_to_wait(struct waitqueue_vcpu *wqv)
>>>          "pop %%r11; pop %%r10; pop %%r9;  pop %%r8;"
>>>          "pop %%rbp; pop %%rdx; pop %%rbx; pop %%rax"
>>>          : "=&S" (wqv->esp), "=&c" (dummy), "=&D" (dummy)
>>> -        : "i" (PAGE_SIZE), "0" (0), "1" (cpu_info), "2" (wqv->stack)
>>> +        : "i" (PAGE_SIZE), "0" (0), "1" (user_regs), "2" (wqv->stack)
>>>          : "memory" );
>>>  
>>>      if ( unlikely(wqv->esp == 0) )
>>> @@ -169,7 +169,7 @@ static void __prepare_to_wait(struct waitqueue_vcpu *wqv)
>>>          domain_crash_synchronous();
>>>      }
>>>  
>>> -    cpu_info->guest_cpu_user_regs.entry_vector = entry_vector;
>>> +    user_regs->entry_vector = entry_vector;
>>>  }
>> 
>> I don't see how this change is related to the purpose of this patch,
>> or why the change is needed. All you do is utilize that
>> guest_cpu_user_regs is the first field of struct cpu_info afaics.
> 
> guest_cpu_user_regs() might point to either stack, while get_cpu_info()
> will always reference the Xen stack and never the per-vcpu one.

Then the description should say so for justification.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 12/12] x86: activate per-vcpu stacks in case of xpti
  2018-01-30 17:33     ` Juergen Gross
@ 2018-01-31 10:40       ` Jan Beulich
  0 siblings, 0 replies; 74+ messages in thread
From: Jan Beulich @ 2018-01-31 10:40 UTC (permalink / raw)
  To: Juergen Gross
  Cc: wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson,
	Dario Faggioli, xen-devel

>>> On 30.01.18 at 18:33, <jgross@suse.com> wrote:
> On 30/01/18 17:33, Jan Beulich wrote:
>>>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
>>> --- a/xen/arch/x86/domain.c
>>> +++ b/xen/arch/x86/domain.c
>>> @@ -1585,9 +1585,28 @@ static inline bool need_full_gdt(const struct domain 
>>> *d)
>>>      return is_pv_domain(d) && !is_idle_domain(d);
>>>  }
>>>  
>>> +static void copy_user_regs_from_stack(struct vcpu *v)
>>> +{
>>> +    struct cpu_user_regs *stack_regs;
>> 
>> const
> 
> Okay.
> 
>> 
>>> +    stack_regs = (is_pv_vcpu(v) && v->domain->arch.pv_domain.xpti)
>>> +                 ? v->arch.pv_vcpu.stack_regs
>>> +                 : &get_cpu_info()->guest_cpu_user_regs;
>> 
>> Ugly open coding of what previously was guest_cpu_user_regs().
> 
> I have to make sure to address the per physical cpu stack.

I would have guessed that's the reason, but especially when
uses are inconsistent (see e.g. the two MSR_IA32_SYSENTER_ESP
writes) a brief comment should be attached to clarify why the
other variant is unsuitable in the specific case.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 11/12] x86: modify interrupt handlers to support stack switching
       [not found]       ` <5A71AA4202000078001A3F56@suse.com>
@ 2018-02-02 15:42         ` Juergen Gross
  0 siblings, 0 replies; 74+ messages in thread
From: Juergen Gross @ 2018-02-02 15:42 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson,
	Dario Faggioli, xen-devel

On 31/01/18 11:36, Jan Beulich wrote:
>>>> On 30.01.18 at 18:19, <jgross@suse.com> wrote:
>> On 30/01/18 17:07, Jan Beulich wrote:
>>>>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
>>>> --- a/xen/arch/x86/x86_64/asm-offsets.c
>>>> +++ b/xen/arch/x86/x86_64/asm-offsets.c
>>>> @@ -137,6 +137,10 @@ void __dummy__(void)
>>>>      OFFSET(CPUINFO_processor_id, struct cpu_info, processor_id);
>>>>      OFFSET(CPUINFO_current_vcpu, struct cpu_info, current_vcpu);
>>>>      OFFSET(CPUINFO_cr4, struct cpu_info, cr4);
>>>> +    OFFSET(CPUINFO_stack_bottom_cpu, struct cpu_info, stack_bottom_cpu);
>>>> +    OFFSET(CPUINFO_flags, struct cpu_info, flags);
>>>> +    DEFINE(ASM_ON_VCPUSTACK, ON_VCPUSTACK);
>>>> +    DEFINE(ASM_VCPUSTACK_ACTIVE, VCPUSTACK_ACTIVE);
>>>
>>> Seeing their uses in asm_defns.h it's not really clear to me why
>>> you can't use the C constants there, the more that those uses
>>> are inside C macros (which perhaps would better be assembler
>>> ones). The latter doesn't even appear to be used in assembly
>>> code.
>>
>> I tried using the C constants but this led to rather nasty include
>> dependencies.
> 
> Hmm, I can imagine this to be the case, but I'd like to have more
> detail for justification. current.h itself doesn't have that many
> dependencies, and if half-way reasonable disentangling our
> headers may be the better choice.

Some #ifndef __ASSEMBLY__ made it work.

I think I had the defines in another header in the beginning and just
didn't switch back after moving them to current.h.

> 
>> ASM_VCPUSTACK_ACTIVE will be used when %cr3 switching is being added.
> 
> Please introduce it when needed.
> 
>>>> --- a/xen/common/wait.c
>>>> +++ b/xen/common/wait.c
>>>> @@ -122,10 +122,10 @@ void wake_up_all(struct waitqueue_head *wq)
>>>>  
>>>>  static void __prepare_to_wait(struct waitqueue_vcpu *wqv)
>>>>  {
>>>> -    struct cpu_info *cpu_info = get_cpu_info();
>>>> +    struct cpu_user_regs *user_regs = guest_cpu_user_regs();
>>>>      struct vcpu *curr = current;
>>>>      unsigned long dummy;
>>>> -    u32 entry_vector = cpu_info->guest_cpu_user_regs.entry_vector;
>>>> +    u32 entry_vector = user_regs->entry_vector;
>>>>  
>>>>      ASSERT(wqv->esp == 0);
>>>>  
>>>> @@ -160,7 +160,7 @@ static void __prepare_to_wait(struct waitqueue_vcpu *wqv)
>>>>          "pop %%r11; pop %%r10; pop %%r9;  pop %%r8;"
>>>>          "pop %%rbp; pop %%rdx; pop %%rbx; pop %%rax"
>>>>          : "=&S" (wqv->esp), "=&c" (dummy), "=&D" (dummy)
>>>> -        : "i" (PAGE_SIZE), "0" (0), "1" (cpu_info), "2" (wqv->stack)
>>>> +        : "i" (PAGE_SIZE), "0" (0), "1" (user_regs), "2" (wqv->stack)
>>>>          : "memory" );
>>>>  
>>>>      if ( unlikely(wqv->esp == 0) )
>>>> @@ -169,7 +169,7 @@ static void __prepare_to_wait(struct waitqueue_vcpu *wqv)
>>>>          domain_crash_synchronous();
>>>>      }
>>>>  
>>>> -    cpu_info->guest_cpu_user_regs.entry_vector = entry_vector;
>>>> +    user_regs->entry_vector = entry_vector;
>>>>  }
>>>
>>> I don't see how this change is related to the purpose of this patch,
>>> or why the change is needed. All you do is utilize that
>>> guest_cpu_user_regs is the first field of struct cpu_info afaics.
>>
>> guest_cpu_user_regs() might point to either stack, while get_cpu_info()
>> will always reference the Xen stack and never the per-vcpu one.
> 
> Then the description should say so for justification.

Okay, added.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 10/12] x86: allocate per-vcpu stacks for interrupt entries
  2018-01-30 15:40   ` Jan Beulich
@ 2018-02-09 12:35     ` Juergen Gross
  2018-02-13  9:10       ` Jan Beulich
  0 siblings, 1 reply; 74+ messages in thread
From: Juergen Gross @ 2018-02-09 12:35 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson,
	Dario Faggioli, xen-devel

On 30/01/18 16:40, Jan Beulich wrote:
>>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
>> @@ -37,10 +52,24 @@ struct vcpu;
>>  
>>  struct cpu_info {
>>      struct cpu_user_regs guest_cpu_user_regs;
>> -    unsigned int processor_id;
>> -    struct vcpu *current_vcpu;
>> -    unsigned long per_cpu_offset;
>> -    unsigned long cr4;
>> +    union {
>> +        /* per physical cpu mapping */
>> +        struct {
>> +            struct vcpu *current_vcpu;
>> +            unsigned long per_cpu_offset;
>> +            unsigned long cr4;
>> +        };
>> +        /* per vcpu mapping (xpti) */
>> +        struct {
>> +            unsigned long pad1;
>> +            unsigned long pad2;
>> +            unsigned long stack_bottom_cpu;
>> +        };
> 
> In order to avoid accidental use in the wrong context as much as
> possible, I think you want to name both structures.

I'd like to leave it as is in order to make a possible backport much
more easier.


Juergen


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH RFC v2 10/12] x86: allocate per-vcpu stacks for interrupt entries
  2018-02-09 12:35     ` Juergen Gross
@ 2018-02-13  9:10       ` Jan Beulich
  0 siblings, 0 replies; 74+ messages in thread
From: Jan Beulich @ 2018-02-13  9:10 UTC (permalink / raw)
  To: Juergen Gross
  Cc: wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson,
	Dario Faggioli, xen-devel

>>> On 09.02.18 at 13:35, <jgross@suse.com> wrote:
> On 30/01/18 16:40, Jan Beulich wrote:
>>>>> On 22.01.18 at 13:32, <jgross@suse.com> wrote:
>>> @@ -37,10 +52,24 @@ struct vcpu;
>>>  
>>>  struct cpu_info {
>>>      struct cpu_user_regs guest_cpu_user_regs;
>>> -    unsigned int processor_id;
>>> -    struct vcpu *current_vcpu;
>>> -    unsigned long per_cpu_offset;
>>> -    unsigned long cr4;
>>> +    union {
>>> +        /* per physical cpu mapping */
>>> +        struct {
>>> +            struct vcpu *current_vcpu;
>>> +            unsigned long per_cpu_offset;
>>> +            unsigned long cr4;
>>> +        };
>>> +        /* per vcpu mapping (xpti) */
>>> +        struct {
>>> +            unsigned long pad1;
>>> +            unsigned long pad2;
>>> +            unsigned long stack_bottom_cpu;
>>> +        };
>> 
>> In order to avoid accidental use in the wrong context as much as
>> possible, I think you want to name both structures.
> 
> I'd like to leave it as is in order to make a possible backport much
> more easier.

Well, I can see why you would want the pre-existing fields left
without structure field name, but the new (vcpu) ones? And
even the pre-existing (pcpu) ones should gain a name, just
perhaps in a patch late in the series, which then wouldn't be
backported.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

end of thread, other threads:[~2018-02-13  9:10 UTC | newest]

Thread overview: 74+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-22 12:32 [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains Juergen Gross
2018-01-22 12:32 ` [PATCH RFC v2 01/12] x86: cleanup processor.h Juergen Gross
2018-01-22 12:52   ` Jan Beulich
     [not found]   ` <5A65ECA502000078001A111C@suse.com>
2018-01-22 14:10     ` Juergen Gross
2018-01-22 14:25       ` Andrew Cooper
2018-01-22 14:32         ` Jan Beulich
2018-01-22 12:32 ` [PATCH RFC v2 02/12] x86: don't use hypervisor stack size for dumping guest stacks Juergen Gross
2018-01-23  9:26   ` Jan Beulich
     [not found]   ` <5A670DEF02000078001A16AF@suse.com>
2018-01-23  9:58     ` Juergen Gross
2018-01-23 10:11       ` Jan Beulich
     [not found]       ` <5A67187C02000078001A1742@suse.com>
2018-01-23 10:19         ` Juergen Gross
2018-01-22 12:32 ` [PATCH RFC v2 03/12] x86: do a revert of e871e80c38547d9faefc6604532ba3e985e65873 Juergen Gross
2018-01-22 12:32 ` [PATCH RFC v2 04/12] x86: revert 5784de3e2067ed73efc2fe42e62831e8ae7f46c4 Juergen Gross
2018-01-22 12:32 ` [PATCH RFC v2 05/12] x86: don't access saved user regs via rsp in trap handlers Juergen Gross
2018-01-30 14:49   ` Jan Beulich
     [not found]   ` <5A70941B02000078001A3BF0@suse.com>
2018-01-30 16:33     ` Juergen Gross
2018-01-22 12:32 ` [PATCH RFC v2 06/12] x86: add a xpti command line parameter Juergen Gross
2018-01-30 15:39   ` Jan Beulich
     [not found]   ` <5A709FDF02000078001A3C2C@suse.com>
2018-01-30 16:51     ` Juergen Gross
2018-01-22 12:32 ` [PATCH RFC v2 07/12] x86: allow per-domain mappings without NX bit or with specific mfn Juergen Gross
2018-01-29 17:06   ` Jan Beulich
     [not found]   ` <5A6F62B602000078001A3810@suse.com>
2018-01-30  8:02     ` Juergen Gross
2018-01-30  8:41       ` Jan Beulich
2018-01-31 10:30   ` Jan Beulich
2018-01-22 12:32 ` [PATCH RFC v2 08/12] xen/x86: use dedicated function for tss initialization Juergen Gross
2018-01-22 12:32 ` [PATCH RFC v2 09/12] x86: enhance syscall stub to work in per-domain mapping Juergen Gross
2018-01-30 15:11   ` Jan Beulich
     [not found]   ` <5A70991902000078001A3C16@suse.com>
2018-01-30 16:50     ` Juergen Gross
2018-01-22 12:32 ` [PATCH RFC v2 10/12] x86: allocate per-vcpu stacks for interrupt entries Juergen Gross
2018-01-30 15:40   ` Jan Beulich
2018-02-09 12:35     ` Juergen Gross
2018-02-13  9:10       ` Jan Beulich
     [not found]   ` <5A70A01402000078001A3C30@suse.com>
2018-01-30 17:12     ` Juergen Gross
2018-01-31 10:18       ` Jan Beulich
2018-01-22 12:32 ` [PATCH RFC v2 11/12] x86: modify interrupt handlers to support stack switching Juergen Gross
2018-01-30 16:07   ` Jan Beulich
     [not found]   ` <5A70A63D02000078001A3C7C@suse.com>
2018-01-30 17:19     ` Juergen Gross
2018-01-31 10:36       ` Jan Beulich
     [not found]       ` <5A71AA4202000078001A3F56@suse.com>
2018-02-02 15:42         ` Juergen Gross
2018-01-22 12:32 ` [PATCH RFC v2 12/12] x86: activate per-vcpu stacks in case of xpti Juergen Gross
2018-01-30 16:33   ` Jan Beulich
     [not found]   ` <5A70AC7F02000078001A3CA6@suse.com>
2018-01-30 17:33     ` Juergen Gross
2018-01-31 10:40       ` Jan Beulich
2018-01-22 12:50 ` [PATCH RFC v2 00/12] xen/x86: use per-vcpu stacks for 64 bit pv domains Jan Beulich
     [not found] ` <5A65EC0A02000078001A1118@suse.com>
2018-01-22 14:18   ` Juergen Gross
2018-01-22 14:22     ` Jan Beulich
     [not found]     ` <5A6601D302000078001A1230@suse.com>
2018-01-22 14:38       ` Juergen Gross
2018-01-22 14:48         ` Jan Beulich
     [not found]         ` <5A6607DB02000078001A127B@suse.com>
2018-01-22 15:00           ` Juergen Gross
2018-01-22 16:51             ` Jan Beulich
2018-01-22 18:39               ` Andrew Cooper
2018-01-22 18:48                 ` George Dunlap
2018-01-22 19:02                   ` Andrew Cooper
2018-01-23  8:36                     ` Jan Beulich
2018-01-23 11:23                       ` Andrew Cooper
2018-01-23 11:06                     ` George Dunlap
2018-01-23  6:34                 ` Juergen Gross
2018-01-23  7:21                   ` Juergen Gross
2018-01-23  8:53                   ` Jan Beulich
     [not found]                   ` <5A67061F02000078001A1669@suse.com>
2018-01-23  9:24                     ` Juergen Gross
2018-01-23  9:31                       ` Jan Beulich
     [not found]                       ` <5A670F0E02000078001A16C9@suse.com>
2018-01-23 10:10                         ` Juergen Gross
2018-01-23 11:45                           ` Andrew Cooper
2018-01-23 13:31                             ` Juergen Gross
2018-01-23 13:24                 ` Dario Faggioli
2018-01-23 16:45                 ` George Dunlap
2018-01-23 16:56                   ` Juergen Gross
2018-01-23 17:33                     ` George Dunlap
2018-01-24  7:37                       ` Jan Beulich
     [not found]             ` <5A6624A602000078001A1375@suse.com>
2018-01-23  5:50               ` Juergen Gross
2018-01-23  8:40                 ` Jan Beulich
     [not found]                 ` <5A67030F02000078001A164B@suse.com>
2018-01-23  9:45                   ` Juergen Gross
2018-01-22 21:45 ` Konrad Rzeszutek Wilk
2018-01-23  6:38   ` Juergen Gross

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.