All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/16] x86: Support for CET Supervisor Shadow Stacks
@ 2020-05-01 22:58 Andrew Cooper
  2020-05-01 22:58 ` [PATCH 01/16] x86/traps: Drop last_extable_addr Andrew Cooper
                   ` (15 more replies)
  0 siblings, 16 replies; 66+ messages in thread
From: Andrew Cooper @ 2020-05-01 22:58 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Wei Liu, Jan Beulich, Roger Pau Monné

This series implements Shadow Stack support for Xen to use.

You'll need a CET-capable toolchain (Binutils 2.32 and later), but no specific
compiler support required.

CET-SS makes PV32 unusable, so using shadow stacks prevents the use of 32bit
PV guests.  Compatibilty can be obtained using PV Shim

Andrew Cooper (16):
  x86/traps: Drop last_extable_addr
  x86/traps: Clean up printing in do_reserved_trap()/fatal_trap()
  x86/traps: Factor out exception_fixup() and make printing consistent
  x86/smpboot: Write the top-of-stack block in cpu_smpboot_alloc()
  x86/shstk: Introduce Supervisor Shadow Stack support
  x86/traps: Implement #CP handler and extend #PF for shadow stacks
  x86/shstk: Re-layout the stack block for shadow stacks
  x86/shstk: Create shadow stacks
  x86/cpu: Adjust enable_nmis() to be shadow stack compatible
  x86/cpu: Adjust reset_stack_and_jump() to be shadow stack compatible
  x86/spec-ctrl: Adjust DO_OVERWRITE_RSB to be shadow stack compatible
  x86/extable: Adjust extable handling to be shadow stack compatible
  x86/ioemul: Rewrite stub generation to be shadow stack compatible
  x86/alt: Adjust _alternative_instructions() to not create shadow stacks
  x86/entry: Adjust guest paths to be shadow stack compatible
  x86/shstk: Activate Supervisor Shadow Stacks

 xen/arch/x86/Kconfig                |  17 +++
 xen/arch/x86/acpi/wakeup_prot.S     |  56 ++++++++++
 xen/arch/x86/alternative.c          |  14 +++
 xen/arch/x86/boot/x86_64.S          |  30 +++++-
 xen/arch/x86/cpu/common.c           |  34 +++++-
 xen/arch/x86/crash.c                |   7 ++
 xen/arch/x86/ioport_emulate.c       |  11 +-
 xen/arch/x86/mm.c                   |  41 ++++---
 xen/arch/x86/pv/emul-priv-op.c      |  91 ++++++++++++----
 xen/arch/x86/pv/gpr_switch.S        |  37 ++-----
 xen/arch/x86/setup.c                |  56 ++++++++++
 xen/arch/x86/smpboot.c              |  10 +-
 xen/arch/x86/spec_ctrl.c            |   8 ++
 xen/arch/x86/traps.c                | 206 ++++++++++++++++++++++--------------
 xen/arch/x86/x86_64/compat/entry.S  |   2 +-
 xen/arch/x86/x86_64/entry.S         |  39 ++++++-
 xen/include/asm-x86/cpufeature.h    |   1 +
 xen/include/asm-x86/cpufeatures.h   |   1 +
 xen/include/asm-x86/current.h       |  59 ++++++++---
 xen/include/asm-x86/io.h            |   3 +-
 xen/include/asm-x86/mm.h            |   1 -
 xen/include/asm-x86/msr-index.h     |   3 +
 xen/include/asm-x86/page.h          |   1 +
 xen/include/asm-x86/processor.h     |  60 +++++++----
 xen/include/asm-x86/spec_ctrl_asm.h |  16 ++-
 xen/include/asm-x86/x86-defns.h     |  36 +++++++
 xen/include/asm-x86/x86_64/page.h   |   1 +
 xen/scripts/Kconfig.include         |   4 +
 28 files changed, 640 insertions(+), 205 deletions(-)

-- 
2.11.0



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 01/16] x86/traps: Drop last_extable_addr
  2020-05-01 22:58 [PATCH 00/16] x86: Support for CET Supervisor Shadow Stacks Andrew Cooper
@ 2020-05-01 22:58 ` Andrew Cooper
  2020-05-04 12:44   ` Jan Beulich
  2020-05-01 22:58 ` [PATCH 02/16] x86/traps: Clean up printing in do_reserved_trap()/fatal_trap() Andrew Cooper
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 66+ messages in thread
From: Andrew Cooper @ 2020-05-01 22:58 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Wei Liu, Jan Beulich, Roger Pau Monné

The only user of this facility is dom_crash_sync_extable() by passing 0 into
asm_domain_crash_synchronous().  The common error cases are already covered
with show_page_walk(), leaving only %ss/%fs selector/segment errors in the
compat case.

Point at dom_crash_sync_extable in the error message, which is about as good
as the error hints from other users of asm_domain_crash_synchronous(), and
drop last_extable_addr.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Wei Liu <wl@xen.org>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/traps.c        | 11 +----------
 xen/arch/x86/x86_64/entry.S |  2 +-
 2 files changed, 2 insertions(+), 11 deletions(-)

diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 33e5d21ece..fe9457cdb6 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -96,7 +96,6 @@ static char __read_mostly opt_nmi[10] = "fatal";
 string_param("nmi", opt_nmi);
 
 DEFINE_PER_CPU(uint64_t, efer);
-static DEFINE_PER_CPU(unsigned long, last_extable_addr);
 
 DEFINE_PER_CPU_READ_MOSTLY(seg_desc_t *, gdt);
 DEFINE_PER_CPU_READ_MOSTLY(l1_pgentry_t, gdt_l1e);
@@ -786,7 +785,6 @@ static void do_trap(struct cpu_user_regs *regs)
     {
         dprintk(XENLOG_ERR, "Trap %u: %p [%ps] -> %p\n",
                 trapnr, _p(regs->rip), _p(regs->rip), _p(fixup));
-        this_cpu(last_extable_addr) = regs->rip;
         regs->rip = fixup;
         return;
     }
@@ -1099,7 +1097,6 @@ void do_invalid_op(struct cpu_user_regs *regs)
  die:
     if ( (fixup = search_exception_table(regs)) != 0 )
     {
-        this_cpu(last_extable_addr) = regs->rip;
         regs->rip = fixup;
         return;
     }
@@ -1122,7 +1119,6 @@ void do_int3(struct cpu_user_regs *regs)
 
         if ( (fixup = search_exception_table(regs)) != 0 )
         {
-            this_cpu(last_extable_addr) = regs->rip;
             dprintk(XENLOG_DEBUG, "Trap %u: %p [%ps] -> %p\n",
                     TRAP_int3, _p(regs->rip), _p(regs->rip), _p(fixup));
             regs->rip = fixup;
@@ -1461,7 +1457,6 @@ void do_page_fault(struct cpu_user_regs *regs)
             perfc_incr(copy_user_faults);
             if ( unlikely(regs->error_code & PFEC_reserved_bit) )
                 reserved_bit_page_fault(addr, regs);
-            this_cpu(last_extable_addr) = regs->rip;
             regs->rip = fixup;
             return;
         }
@@ -1591,7 +1586,6 @@ void do_general_protection(struct cpu_user_regs *regs)
     {
         dprintk(XENLOG_INFO, "GPF (%04x): %p [%ps] -> %p\n",
                 regs->error_code, _p(regs->rip), _p(regs->rip), _p(fixup));
-        this_cpu(last_extable_addr) = regs->rip;
         regs->rip = fixup;
         return;
     }
@@ -2085,10 +2079,7 @@ void asm_domain_crash_synchronous(unsigned long addr)
      */
     clac();
 
-    if ( addr == 0 )
-        addr = this_cpu(last_extable_addr);
-
-    printk("domain_crash_sync called from entry.S: fault at %p %pS\n",
+    printk("domain_crash_sync called from entry.S: issue around %p %pS\n",
            _p(addr), _p(addr));
 
     __domain_crash(current->domain);
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index d55453f3f3..a3ce298529 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -527,7 +527,7 @@ ENTRY(dom_crash_sync_extable)
         sete  %al
         leal  (%rax,%rax,2),%eax
         orb   %al,UREGS_cs(%rsp)
-        xorl  %edi,%edi
+        lea   dom_crash_sync_extable(%rip), %rdi
         jmp   asm_domain_crash_synchronous /* Does not return */
         .popsection
 #endif /* CONFIG_PV */
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 02/16] x86/traps: Clean up printing in do_reserved_trap()/fatal_trap()
  2020-05-01 22:58 [PATCH 00/16] x86: Support for CET Supervisor Shadow Stacks Andrew Cooper
  2020-05-01 22:58 ` [PATCH 01/16] x86/traps: Drop last_extable_addr Andrew Cooper
@ 2020-05-01 22:58 ` Andrew Cooper
  2020-05-04 13:08   ` Jan Beulich
  2020-05-01 22:58 ` [PATCH 03/16] x86/traps: Factor out exception_fixup() and make printing consistent Andrew Cooper
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 66+ messages in thread
From: Andrew Cooper @ 2020-05-01 22:58 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Wei Liu, Jan Beulich, Roger Pau Monné

For one, they render the vector in a different base.

Introduce X86_EXC_* constants and vec_name() to refer to exceptions by their
mnemonic, which starts bringing the code/diagnostics in line with the Intel
and AMD manuals.

Provide constants for every archtiecturally defined exception, even those not
implemented by Xen yet, as do_reserved_trap() is a catch-all handler.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Wei Liu <wl@xen.org>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/traps.c            | 24 +++++++++++++++++++-----
 xen/include/asm-x86/processor.h |  6 +-----
 xen/include/asm-x86/x86-defns.h | 35 +++++++++++++++++++++++++++++++++++
 3 files changed, 55 insertions(+), 10 deletions(-)

diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index fe9457cdb6..e73f07f28a 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -686,6 +686,20 @@ const char *trapstr(unsigned int trapnr)
     return trapnr < ARRAY_SIZE(strings) ? strings[trapnr] : "???";
 }
 
+static const char *vec_name(unsigned int vec)
+{
+    static const char names[][4] = {
+#define N(x) [X86_EXC_ ## x] = #x
+        N(DE),  N(DB),  N(NMI), N(BP),  N(OF),  N(BR),  N(UD),  N(NM),
+        N(DF),  N(CSO), N(TS),  N(NP),  N(SS),  N(GP),  N(PF),  N(SPV),
+        N(MF),  N(AC),  N(MC),  N(XM),  N(VE),  N(CP),
+                                        N(HV),  N(VC),  N(SX),
+#undef N
+    };
+
+    return (vec < ARRAY_SIZE(names) && names[vec][0]) ? names[vec] : "??";
+}
+
 /*
  * This is called for faults at very unexpected times (e.g., when interrupts
  * are disabled). In such situations we can't do much that is safe. We try to
@@ -743,10 +757,9 @@ void fatal_trap(const struct cpu_user_regs *regs, bool show_remote)
         }
     }
 
-    panic("FATAL TRAP: vector = %d (%s)\n"
-          "[error_code=%04x] %s\n",
-          trapnr, trapstr(trapnr), regs->error_code,
-          (regs->eflags & X86_EFLAGS_IF) ? "" : ", IN INTERRUPT CONTEXT");
+    panic("FATAL TRAP: vec %u, #%s[%04x]%s\n",
+          trapnr, vec_name(trapnr), regs->error_code,
+          (regs->eflags & X86_EFLAGS_IF) ? "" : " IN INTERRUPT CONTEXT");
 }
 
 static void do_reserved_trap(struct cpu_user_regs *regs)
@@ -757,7 +770,8 @@ static void do_reserved_trap(struct cpu_user_regs *regs)
         return;
 
     show_execution_state(regs);
-    panic("FATAL RESERVED TRAP %#x: %s\n", trapnr, trapstr(trapnr));
+    panic("FATAL RESERVED TRAP: vec %u, #%s[%04x]\n",
+          trapnr, vec_name(trapnr), regs->error_code);
 }
 
 static void do_trap(struct cpu_user_regs *regs)
diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
index 8f6f5a97dd..12b55e1022 100644
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -43,11 +43,7 @@
 #define TRAP_virtualisation   20
 #define TRAP_nr               32
 
-#define TRAP_HAVE_EC                                                    \
-    ((1u << TRAP_double_fault) | (1u << TRAP_invalid_tss) |             \
-     (1u << TRAP_no_segment) | (1u << TRAP_stack_error) |               \
-     (1u << TRAP_gp_fault) | (1u << TRAP_page_fault) |                  \
-     (1u << TRAP_alignment_check))
+#define TRAP_HAVE_EC X86_EXC_HAVE_EC
 
 /* Set for entry via SYSCALL. Informs return code to use SYSRETQ not IRETQ. */
 /* NB. Same as VGCF_in_syscall. No bits in common with any other TRAP_ defn. */
diff --git a/xen/include/asm-x86/x86-defns.h b/xen/include/asm-x86/x86-defns.h
index 8bf503220a..84e15b15be 100644
--- a/xen/include/asm-x86/x86-defns.h
+++ b/xen/include/asm-x86/x86-defns.h
@@ -118,4 +118,39 @@
 
 #define X86_NR_VECTORS 256
 
+/* Exception Vectors */
+#define X86_EXC_DE             0 /* Divide Error. */
+#define X86_EXC_DB             1 /* Debug Exception. */
+#define X86_EXC_NMI            2 /* NMI. */
+#define X86_EXC_BP             3 /* Breakpoint. */
+#define X86_EXC_OF             4 /* Overflow. */
+#define X86_EXC_BR             5 /* BOUND Range. */
+#define X86_EXC_UD             6 /* Invalid Opcode. */
+#define X86_EXC_NM             7 /* Device Not Available. */
+#define X86_EXC_DF             8 /* Double Fault. */
+#define X86_EXC_CSO            9 /* Coprocessor Segment Overrun. */
+#define X86_EXC_TS            10 /* Invalid TSS. */
+#define X86_EXC_NP            11 /* Segment Not Present. */
+#define X86_EXC_SS            12 /* Stack-Segment Fault. */
+#define X86_EXC_GP            13 /* General Porection Fault. */
+#define X86_EXC_PF            14 /* Page Fault. */
+#define X86_EXC_SPV           15 /* PIC Spurious Interrupt Vector. */
+#define X86_EXC_MF            16 /* Maths fault (x87 FPU). */
+#define X86_EXC_AC            17 /* Alignment Check. */
+#define X86_EXC_MC            18 /* Machine Check. */
+#define X86_EXC_XM            19 /* SIMD Exception. */
+#define X86_EXC_VE            20 /* Virtualisation Exception. */
+#define X86_EXC_CP            21 /* Control-flow Protection. */
+#define X86_EXC_HV            28 /* Hypervisor Injection. */
+#define X86_EXC_VC            29 /* VMM Communication. */
+#define X86_EXC_SX            30 /* Security Exception. */
+
+/* Bitmap of exceptions which have error codes. */
+#define X86_EXC_HAVE_EC                                             \
+    ((1u << X86_EXC_DF) | (1u << X86_EXC_TS) | (1u << X86_EXC_NP) | \
+     (1u << X86_EXC_SS) | (1u << X86_EXC_GP) | (1u << X86_EXC_PF) | \
+     (1u << X86_EXC_AC) | (1u << X86_EXC_CP) |                      \
+     (1u << X86_EXC_VC) | (1u << X86_EXC_SX))
+
+
 #endif	/* __XEN_X86_DEFNS_H__ */
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 03/16] x86/traps: Factor out exception_fixup() and make printing consistent
  2020-05-01 22:58 [PATCH 00/16] x86: Support for CET Supervisor Shadow Stacks Andrew Cooper
  2020-05-01 22:58 ` [PATCH 01/16] x86/traps: Drop last_extable_addr Andrew Cooper
  2020-05-01 22:58 ` [PATCH 02/16] x86/traps: Clean up printing in do_reserved_trap()/fatal_trap() Andrew Cooper
@ 2020-05-01 22:58 ` Andrew Cooper
  2020-05-04 13:20   ` Jan Beulich
  2020-05-01 22:58 ` [PATCH 04/16] x86/smpboot: Write the top-of-stack block in cpu_smpboot_alloc() Andrew Cooper
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 66+ messages in thread
From: Andrew Cooper @ 2020-05-01 22:58 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Wei Liu, Jan Beulich, Roger Pau Monné

UD faults never had any diagnostics printed, and the others were inconsistent.

Don't use dprintk() because identifying traps.c is actively unhelpful in the
message, as it is the location of the fixup, not the fault.  Use the new
vec_name() infrastructure, rather than leaving raw numbers for the log.

  (XEN) Running stub recovery selftests...
  (XEN) Fixup #UD[0000]: ffff82d07fffd040 [ffff82d07fffd040] -> ffff82d0403ac9d6
  (XEN) Fixup #GP[0000]: ffff82d07fffd041 [ffff82d07fffd041] -> ffff82d0403ac9d6
  (XEN) Fixup #SS[0000]: ffff82d07fffd040 [ffff82d07fffd040] -> ffff82d0403ac9d6
  (XEN) Fixup #BP[0000]: ffff82d07fffd041 [ffff82d07fffd041] -> ffff82d0403ac9d6

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Wei Liu <wl@xen.org>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/traps.c | 68 ++++++++++++++++++++++++----------------------------
 1 file changed, 31 insertions(+), 37 deletions(-)

diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index e73f07f28a..737ab036d2 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -774,10 +774,27 @@ static void do_reserved_trap(struct cpu_user_regs *regs)
           trapnr, vec_name(trapnr), regs->error_code);
 }
 
+static bool exception_fixup(struct cpu_user_regs *regs, bool print)
+{
+    unsigned long fixup = search_exception_table(regs);
+
+    if ( unlikely(fixup == 0) )
+        return false;
+
+    /* Can currently be triggered by guests.  Make sure we ratelimit. */
+    if ( IS_ENABLED(CONFIG_DEBUG) && print )
+        printk(XENLOG_GUEST XENLOG_WARNING "Fixup #%s[%04x]: %p [%ps] -> %p\n",
+               vec_name(regs->entry_vector), regs->error_code,
+               _p(regs->rip), _p(regs->rip), _p(fixup));
+
+    regs->rip = fixup;
+
+    return true;
+}
+
 static void do_trap(struct cpu_user_regs *regs)
 {
     unsigned int trapnr = regs->entry_vector;
-    unsigned long fixup;
 
     if ( regs->error_code & X86_XEC_EXT )
         goto hardware_trap;
@@ -795,13 +812,8 @@ static void do_trap(struct cpu_user_regs *regs)
         return;
     }
 
-    if ( likely((fixup = search_exception_table(regs)) != 0) )
-    {
-        dprintk(XENLOG_ERR, "Trap %u: %p [%ps] -> %p\n",
-                trapnr, _p(regs->rip), _p(regs->rip), _p(fixup));
-        regs->rip = fixup;
+    if ( likely(exception_fixup(regs, true)) )
         return;
-    }
 
  hardware_trap:
     if ( debugger_trap_fatal(trapnr, regs) )
@@ -1109,11 +1121,8 @@ void do_invalid_op(struct cpu_user_regs *regs)
     }
 
  die:
-    if ( (fixup = search_exception_table(regs)) != 0 )
-    {
-        regs->rip = fixup;
+    if ( likely(exception_fixup(regs, true)) )
         return;
-    }
 
     if ( debugger_trap_fatal(TRAP_invalid_op, regs) )
         return;
@@ -1129,15 +1138,8 @@ void do_int3(struct cpu_user_regs *regs)
 
     if ( !guest_mode(regs) )
     {
-        unsigned long fixup;
-
-        if ( (fixup = search_exception_table(regs)) != 0 )
-        {
-            dprintk(XENLOG_DEBUG, "Trap %u: %p [%ps] -> %p\n",
-                    TRAP_int3, _p(regs->rip), _p(regs->rip), _p(fixup));
-            regs->rip = fixup;
+        if ( likely(exception_fixup(regs, true)) )
             return;
-        }
 
         if ( !debugger_trap_fatal(TRAP_int3, regs) )
             printk(XENLOG_DEBUG "Hit embedded breakpoint at %p [%ps]\n",
@@ -1435,7 +1437,7 @@ static int fixup_page_fault(unsigned long addr, struct cpu_user_regs *regs)
  */
 void do_page_fault(struct cpu_user_regs *regs)
 {
-    unsigned long addr, fixup;
+    unsigned long addr;
     unsigned int error_code;
 
     addr = read_cr2();
@@ -1466,12 +1468,11 @@ void do_page_fault(struct cpu_user_regs *regs)
         if ( pf_type != real_fault )
             return;
 
-        if ( likely((fixup = search_exception_table(regs)) != 0) )
+        if ( likely(exception_fixup(regs, false)) )
         {
             perfc_incr(copy_user_faults);
             if ( unlikely(regs->error_code & PFEC_reserved_bit) )
                 reserved_bit_page_fault(addr, regs);
-            regs->rip = fixup;
             return;
         }
 
@@ -1529,7 +1530,6 @@ void do_general_protection(struct cpu_user_regs *regs)
 #ifdef CONFIG_PV
     struct vcpu *v = current;
 #endif
-    unsigned long fixup;
 
     if ( debugger_trap_entry(TRAP_gp_fault, regs) )
         return;
@@ -1596,13 +1596,8 @@ void do_general_protection(struct cpu_user_regs *regs)
 
  gp_in_kernel:
 
-    if ( likely((fixup = search_exception_table(regs)) != 0) )
-    {
-        dprintk(XENLOG_INFO, "GPF (%04x): %p [%ps] -> %p\n",
-                regs->error_code, _p(regs->rip), _p(regs->rip), _p(fixup));
-        regs->rip = fixup;
+    if ( likely(exception_fixup(regs, true)) )
         return;
-    }
 
  hardware_gp:
     if ( debugger_trap_fatal(TRAP_gp_fault, regs) )
@@ -1761,18 +1756,17 @@ void do_device_not_available(struct cpu_user_regs *regs)
 
     if ( !guest_mode(regs) )
     {
-        unsigned long fixup = search_exception_table(regs);
-
-        gprintk(XENLOG_ERR, "#NM: %p [%ps] -> %p\n",
-                _p(regs->rip), _p(regs->rip), _p(fixup));
         /*
          * We shouldn't be able to reach here, but for release builds have
          * the recovery logic in place nevertheless.
          */
-        ASSERT_UNREACHABLE();
-        BUG_ON(!fixup);
-        regs->rip = fixup;
-        return;
+        if ( exception_fixup(regs, true) )
+        {
+            ASSERT_UNREACHABLE();
+            return;
+        }
+
+        fatal_trap(regs, false);
     }
 
 #ifdef CONFIG_PV
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 04/16] x86/smpboot: Write the top-of-stack block in cpu_smpboot_alloc()
  2020-05-01 22:58 [PATCH 00/16] x86: Support for CET Supervisor Shadow Stacks Andrew Cooper
                   ` (2 preceding siblings ...)
  2020-05-01 22:58 ` [PATCH 03/16] x86/traps: Factor out exception_fixup() and make printing consistent Andrew Cooper
@ 2020-05-01 22:58 ` Andrew Cooper
  2020-05-04 13:22   ` Jan Beulich
  2020-05-01 22:58 ` [PATCH 05/16] x86/shstk: Introduce Supervisor Shadow Stack support Andrew Cooper
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 66+ messages in thread
From: Andrew Cooper @ 2020-05-01 22:58 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Wei Liu, Jan Beulich, Roger Pau Monné

This allows the AP boot assembly use per-cpu variables, and brings the
semantics closer to that of the BSP, which can use per-cpu variables from the
start of day.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Wei Liu <wl@xen.org>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/smpboot.c        | 7 ++++++-
 xen/include/asm-x86/current.h | 5 -----
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 5a3786d399..f999323bc4 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -329,7 +329,6 @@ void start_secondary(void *unused)
 
     /* Critical region without IDT or TSS.  Any fault is deadly! */
 
-    set_processor_id(cpu);
     set_current(idle_vcpu[cpu]);
     this_cpu(curr_vcpu) = idle_vcpu[cpu];
     rdmsrl(MSR_EFER, this_cpu(efer));
@@ -986,6 +985,7 @@ static void cpu_smpboot_free(unsigned int cpu, bool remove)
 
 static int cpu_smpboot_alloc(unsigned int cpu)
 {
+    struct cpu_info *info;
     unsigned int i, memflags = 0;
     nodeid_t node = cpu_to_node(cpu);
     seg_desc_t *gdt;
@@ -999,6 +999,11 @@ static int cpu_smpboot_alloc(unsigned int cpu)
         stack_base[cpu] = alloc_xenheap_pages(STACK_ORDER, memflags);
     if ( stack_base[cpu] == NULL )
         goto out;
+
+    info = get_cpu_info_from_stack((unsigned long)stack_base[cpu]);
+    info->processor_id = cpu;
+    info->per_cpu_offset = __per_cpu_offset[cpu];
+
     memguard_guard_stack(stack_base[cpu]);
 
     gdt = per_cpu(gdt, cpu) ?: alloc_xenheap_pages(0, memflags);
diff --git a/xen/include/asm-x86/current.h b/xen/include/asm-x86/current.h
index 0b47485337..5b8f4dbc79 100644
--- a/xen/include/asm-x86/current.h
+++ b/xen/include/asm-x86/current.h
@@ -100,11 +100,6 @@ static inline struct cpu_info *get_cpu_info(void)
 #define current               (get_current())
 
 #define get_processor_id()    (get_cpu_info()->processor_id)
-#define set_processor_id(id)  do {                                      \
-    struct cpu_info *ci__ = get_cpu_info();                             \
-    ci__->per_cpu_offset = __per_cpu_offset[ci__->processor_id = (id)]; \
-} while (0)
-
 #define guest_cpu_user_regs() (&get_cpu_info()->guest_cpu_user_regs)
 
 /*
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 05/16] x86/shstk: Introduce Supervisor Shadow Stack support
  2020-05-01 22:58 [PATCH 00/16] x86: Support for CET Supervisor Shadow Stacks Andrew Cooper
                   ` (3 preceding siblings ...)
  2020-05-01 22:58 ` [PATCH 04/16] x86/smpboot: Write the top-of-stack block in cpu_smpboot_alloc() Andrew Cooper
@ 2020-05-01 22:58 ` Andrew Cooper
  2020-05-04 13:52   ` Jan Beulich
  2020-05-01 22:58 ` [PATCH 06/16] x86/traps: Implement #CP handler and extend #PF for shadow stacks Andrew Cooper
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 66+ messages in thread
From: Andrew Cooper @ 2020-05-01 22:58 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Wei Liu, Jan Beulich, Roger Pau Monné

Introduce CONFIG_HAS_AS_CET to determine whether CET instructions are
supported in the assembler, and CONFIG_XEN_SHSTK as the main build option.

Introduce xen={no-,}shstk to for a user to select whether or not to use shadow
stacks at runtime, and X86_FEATURE_XEN_SHSTK to determine Xen's overall
enablement of shadow stacks.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Wei Liu <wl@xen.org>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/Kconfig              | 17 +++++++++++++++++
 xen/arch/x86/setup.c              | 30 ++++++++++++++++++++++++++++++
 xen/include/asm-x86/cpufeature.h  |  1 +
 xen/include/asm-x86/cpufeatures.h |  1 +
 xen/scripts/Kconfig.include       |  4 ++++
 5 files changed, 53 insertions(+)

diff --git a/xen/arch/x86/Kconfig b/xen/arch/x86/Kconfig
index 96432f1f69..ebd01e6893 100644
--- a/xen/arch/x86/Kconfig
+++ b/xen/arch/x86/Kconfig
@@ -34,6 +34,9 @@ config ARCH_DEFCONFIG
 config INDIRECT_THUNK
 	def_bool $(cc-option,-mindirect-branch-register)
 
+config HAS_AS_CET
+	def_bool $(as-instr,wrssq %rax$(comma)0;setssbsy;endbr64)
+
 menu "Architecture Features"
 
 source "arch/Kconfig"
@@ -97,6 +100,20 @@ config HVM
 
 	  If unsure, say Y.
 
+config XEN_SHSTK
+	bool "Supervisor Shadow Stacks"
+	depends on HAS_AS_CET && EXPERT = "y"
+	default y
+        ---help---
+          Control-flow Enforcement Technology (CET) is a set of features in
+          hardware designed to combat Return-oriented Programming (ROP, also
+          call/jump COP/JOP) attacks.  Shadow Stacks are one CET feature
+          designed to provide return address protection.
+
+          This option arranges for Xen to use CET-SS for its own protection.
+          When CET-SS is active, 32bit PV guests cannot be used.  Backwards
+          compatiblity can be provided vai the PV Shim mechanism.
+
 config SHADOW_PAGING
         bool "Shadow Paging"
         default y
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 9e9576344c..aa21201507 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -95,6 +95,36 @@ unsigned long __initdata highmem_start;
 size_param("highmem-start", highmem_start);
 #endif
 
+static bool __initdata opt_xen_shstk = true;
+
+static int parse_xen(const char *s)
+{
+    const char *ss;
+    int val, rc = 0;
+
+    do {
+        ss = strchr(s, ',');
+        if ( !ss )
+            ss = strchr(s, '\0');
+
+        if ( (val = parse_boolean("shstk", s, ss)) >= 0 )
+        {
+#ifdef CONFIG_XEN_SHSTK
+            opt_xen_shstk = val;
+#else
+            no_config_param("XEN_SHSTK", "xen", s, ss);
+#endif
+        }
+        else
+            rc = -EINVAL;
+
+        s = ss + 1;
+    } while ( *ss );
+
+    return rc;
+}
+custom_param("xen", parse_xen);
+
 cpumask_t __read_mostly cpu_present_map;
 
 unsigned long __read_mostly xen_phys_start;
diff --git a/xen/include/asm-x86/cpufeature.h b/xen/include/asm-x86/cpufeature.h
index 859970570b..6b25f61832 100644
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -136,6 +136,7 @@
 #define cpu_has_aperfmperf      boot_cpu_has(X86_FEATURE_APERFMPERF)
 #define cpu_has_lfence_dispatch boot_cpu_has(X86_FEATURE_LFENCE_DISPATCH)
 #define cpu_has_xen_lbr         boot_cpu_has(X86_FEATURE_XEN_LBR)
+#define cpu_has_xen_shstk       boot_cpu_has(X86_FEATURE_XEN_SHSTK)
 
 #define cpu_has_msr_tsc_aux     (cpu_has_rdtscp || cpu_has_rdpid)
 
diff --git a/xen/include/asm-x86/cpufeatures.h b/xen/include/asm-x86/cpufeatures.h
index b9d3cac975..d7e42d9bb6 100644
--- a/xen/include/asm-x86/cpufeatures.h
+++ b/xen/include/asm-x86/cpufeatures.h
@@ -38,6 +38,7 @@ XEN_CPUFEATURE(XEN_LBR,           X86_SYNTH(22)) /* Xen uses MSR_DEBUGCTL.LBR */
 XEN_CPUFEATURE(SC_VERW_PV,        X86_SYNTH(23)) /* VERW used by Xen for PV */
 XEN_CPUFEATURE(SC_VERW_HVM,       X86_SYNTH(24)) /* VERW used by Xen for HVM */
 XEN_CPUFEATURE(SC_VERW_IDLE,      X86_SYNTH(25)) /* VERW used by Xen for idle */
+XEN_CPUFEATURE(XEN_SHSTK,         X86_SYNTH(26)) /* Xen uses CET Shadow Stacks */
 
 /* Bug words follow the synthetic words. */
 #define X86_NR_BUG 1
diff --git a/xen/scripts/Kconfig.include b/xen/scripts/Kconfig.include
index 8221095ca3..e1f13e1720 100644
--- a/xen/scripts/Kconfig.include
+++ b/xen/scripts/Kconfig.include
@@ -31,6 +31,10 @@ cc-option = $(success,$(CC) -Werror $(CLANG_FLAGS) $(1) -E -x c /dev/null -o /de
 # Return y if the linker supports <flag>, n otherwise
 ld-option = $(success,$(LD) -v $(1))
 
+# $(as-instr,<instr>)
+# Return y if the assembler supports <instr>, n otherwise
+as-instr = $(success,printf "%b\n" "$(1)" | $(CC) $(CLANG_FLAGS) -c -x assembler -o /dev/null -)
+
 # check if $(CC) and $(LD) exist
 $(error-if,$(failure,command -v $(CC)),compiler '$(CC)' not found)
 $(error-if,$(failure,command -v $(LD)),linker '$(LD)' not found)
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 06/16] x86/traps: Implement #CP handler and extend #PF for shadow stacks
  2020-05-01 22:58 [PATCH 00/16] x86: Support for CET Supervisor Shadow Stacks Andrew Cooper
                   ` (4 preceding siblings ...)
  2020-05-01 22:58 ` [PATCH 05/16] x86/shstk: Introduce Supervisor Shadow Stack support Andrew Cooper
@ 2020-05-01 22:58 ` Andrew Cooper
  2020-05-04 14:10   ` Jan Beulich
  2020-05-01 22:58 ` [PATCH 07/16] x86/shstk: Re-layout the stack block " Andrew Cooper
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 66+ messages in thread
From: Andrew Cooper @ 2020-05-01 22:58 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Wei Liu, Jan Beulich, Roger Pau Monné

For now, any #CP exception or shadow stack #PF indicate a bug in Xen, but
attempt to recover if taken in guest context.

Drop the comment beside do_page_fault().  It's stale (missing PFEC_prot_key),
and inaccurate (PFEC_present being set means just that, not necesserily a
protection violation).

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Wei Liu <wl@xen.org>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/traps.c            | 55 ++++++++++++++++++++++++++++++++++-------
 xen/arch/x86/x86_64/entry.S     |  7 +++++-
 xen/include/asm-x86/processor.h |  2 ++
 3 files changed, 54 insertions(+), 10 deletions(-)

diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 737ab036d2..ddbe312f89 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -158,7 +158,9 @@ void (* const exception_table[TRAP_nr])(struct cpu_user_regs *regs) = {
     [TRAP_alignment_check]              = do_trap,
     [TRAP_machine_check]                = (void *)do_machine_check,
     [TRAP_simd_error]                   = do_trap,
-    [TRAP_virtualisation ...
+    [TRAP_virtualisation]               = do_reserved_trap,
+    [X86_EXC_CP]                        = do_entry_CP,
+    [X86_EXC_CP + 1 ...
      (ARRAY_SIZE(exception_table) - 1)] = do_reserved_trap,
 };
 
@@ -1427,14 +1429,6 @@ static int fixup_page_fault(unsigned long addr, struct cpu_user_regs *regs)
     return 0;
 }
 
-/*
- * #PF error code:
- *  Bit 0: Protection violation (=1) ; Page not present (=0)
- *  Bit 1: Write access
- *  Bit 2: User mode (=1) ; Supervisor mode (=0)
- *  Bit 3: Reserved bit violation
- *  Bit 4: Instruction fetch
- */
 void do_page_fault(struct cpu_user_regs *regs)
 {
     unsigned long addr;
@@ -1457,6 +1451,10 @@ void do_page_fault(struct cpu_user_regs *regs)
     {
         enum pf_type pf_type = spurious_page_fault(addr, regs);
 
+        /* Any fault on a shadow stack access is a bug in Xen. */
+        if ( error_code & PFEC_shstk )
+            goto fatal;
+
         if ( (pf_type == smep_fault) || (pf_type == smap_fault) )
         {
             console_start_sync();
@@ -1476,6 +1474,7 @@ void do_page_fault(struct cpu_user_regs *regs)
             return;
         }
 
+    fatal:
         if ( debugger_trap_fatal(TRAP_page_fault, regs) )
             return;
 
@@ -1906,6 +1905,43 @@ void do_debug(struct cpu_user_regs *regs)
     pv_inject_hw_exception(TRAP_debug, X86_EVENT_NO_EC);
 }
 
+void do_entry_CP(struct cpu_user_regs *regs)
+{
+    static const char errors[][10] = {
+        [1] = "near ret",
+        [2] = "far/iret",
+        [3] = "endbranch",
+        [4] = "rstorssp",
+        [5] = "setssbsy",
+    };
+    const char *err = "??";
+    unsigned int ec = regs->error_code;
+
+    if ( debugger_trap_entry(TRAP_debug, regs) )
+        return;
+
+    /* Decode ec if possible */
+    if ( ec < ARRAY_SIZE(errors) && errors[ec][0] )
+        err = errors[ec];
+
+    /*
+     * For now, only supervisors shadow stacks should be active.  A #CP from
+     * guest context is probably a Xen bug, but kill the guest in an attempt
+     * to recover.
+     */
+    if ( guest_mode(regs) )
+    {
+        gprintk(XENLOG_ERR, "Hit #CP[%04x] in guest context %04x:%p\n",
+                ec, regs->cs, _p(regs->rip));
+        ASSERT_UNREACHABLE();
+        domain_crash(current->domain);
+        return;
+    }
+
+    show_execution_state(regs);
+    panic("CONTROL-FLOW PROTECTION FAULT: #CP[%04x] %s\n", ec, err);
+}
+
 static void __init noinline __set_intr_gate(unsigned int n,
                                             uint32_t dpl, void *addr)
 {
@@ -1995,6 +2031,7 @@ void __init init_idt_traps(void)
     set_intr_gate(TRAP_alignment_check,&alignment_check);
     set_intr_gate(TRAP_machine_check,&machine_check);
     set_intr_gate(TRAP_simd_error,&simd_coprocessor_error);
+    set_intr_gate(X86_EXC_CP, entry_CP);
 
     /* Specify dedicated interrupt stacks for NMI, #DF, and #MC. */
     enable_each_ist(idt_table);
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index a3ce298529..6403c0ab92 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -795,6 +795,10 @@ ENTRY(alignment_check)
         movl  $TRAP_alignment_check,4(%rsp)
         jmp   handle_exception
 
+ENTRY(entry_CP)
+        movl  $X86_EXC_CP, 4(%rsp)
+        jmp   handle_exception
+
 ENTRY(double_fault)
         movl  $TRAP_double_fault,4(%rsp)
         /* Set AC to reduce chance of further SMAP faults */
@@ -940,7 +944,8 @@ autogen_stubs: /* Automatically generated stubs. */
         entrypoint 1b
 
         /* Reserved exceptions, heading towards do_reserved_trap(). */
-        .elseif vec == TRAP_copro_seg || vec == TRAP_spurious_int || (vec > TRAP_simd_error && vec < TRAP_nr)
+        .elseif vec == TRAP_copro_seg || vec == TRAP_spurious_int || \
+                vec == TRAP_virtualisation || (vec > X86_EXC_CP && vec < TRAP_nr)
 
 1:      test  $8,%spl        /* 64bit exception frames are 16 byte aligned, but the word */
         jz    2f             /* size is 8 bytes.  Check whether the processor gave us an */
diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
index 12b55e1022..5e8a0fb649 100644
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -68,6 +68,7 @@
 #define PFEC_reserved_bit   (_AC(1,U) << 3)
 #define PFEC_insn_fetch     (_AC(1,U) << 4)
 #define PFEC_prot_key       (_AC(1,U) << 5)
+#define PFEC_shstk         (_AC(1,U) << 6)
 #define PFEC_arch_mask      (_AC(0xffff,U)) /* Architectural PFEC values. */
 /* Internally used only flags. */
 #define PFEC_page_paged     (1U<<16)
@@ -529,6 +530,7 @@ DECLARE_TRAP_HANDLER(coprocessor_error);
 DECLARE_TRAP_HANDLER(simd_coprocessor_error);
 DECLARE_TRAP_HANDLER_CONST(machine_check);
 DECLARE_TRAP_HANDLER(alignment_check);
+DECLARE_TRAP_HANDLER(entry_CP);
 
 DECLARE_TRAP_HANDLER(entry_int82);
 
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 07/16] x86/shstk: Re-layout the stack block for shadow stacks
  2020-05-01 22:58 [PATCH 00/16] x86: Support for CET Supervisor Shadow Stacks Andrew Cooper
                   ` (5 preceding siblings ...)
  2020-05-01 22:58 ` [PATCH 06/16] x86/traps: Implement #CP handler and extend #PF for shadow stacks Andrew Cooper
@ 2020-05-01 22:58 ` Andrew Cooper
  2020-05-04 14:24   ` Jan Beulich
  2020-05-01 22:58 ` [PATCH 08/16] x86/shstk: Create " Andrew Cooper
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 66+ messages in thread
From: Andrew Cooper @ 2020-05-01 22:58 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Wei Liu, Jan Beulich, Roger Pau Monné

We have two free pages in the current stack.  A useful property of shadow
stacks and regular stacks is that they act as each others guard pages as far
as OoB writes go.

Move the regular IST stacks up by one page, to allow their shadow stack page
to be in slot 0.  The primary shadow stack uses slot 5.

As the shadow IST stacks are only 1k large, shuffle the order of IST vectors
to have #DF numerically highest (so there is no chance of a shadow stack
overflow clobbering the supervisor token).

The XPTI code already breaks the MEMORY_GUARD abstraction for stacks by
forcing it to be present.  To avoid having too many configurations, do away
with the concept entirely, and unconditionally unmap the pages in all cases.

A later change will turn these properly into shadow stacks.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Wei Liu <wl@xen.org>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/cpu/common.c       | 10 +++++-----
 xen/arch/x86/mm.c               | 19 ++++++-------------
 xen/arch/x86/smpboot.c          |  3 +--
 xen/arch/x86/traps.c            | 23 ++++++-----------------
 xen/include/asm-x86/current.h   | 12 ++++++------
 xen/include/asm-x86/mm.h        |  1 -
 xen/include/asm-x86/processor.h |  6 +++---
 7 files changed, 27 insertions(+), 47 deletions(-)

diff --git a/xen/arch/x86/cpu/common.c b/xen/arch/x86/cpu/common.c
index 131ff03fcf..290f9f1c30 100644
--- a/xen/arch/x86/cpu/common.c
+++ b/xen/arch/x86/cpu/common.c
@@ -732,14 +732,14 @@ void load_system_tables(void)
 		.rsp2 = 0x8600111111111111ul,
 
 		/*
-		 * MCE, NMI and Double Fault handlers get their own stacks.
+		 * #DB, NMI, DF and #MCE handlers get their own stacks.
 		 * All others poisoned.
 		 */
 		.ist = {
-			[IST_MCE - 1] = stack_top + IST_MCE * PAGE_SIZE,
-			[IST_DF  - 1] = stack_top + IST_DF  * PAGE_SIZE,
-			[IST_NMI - 1] = stack_top + IST_NMI * PAGE_SIZE,
-			[IST_DB  - 1] = stack_top + IST_DB  * PAGE_SIZE,
+			[IST_MCE - 1] = stack_top + (1 + IST_MCE) * PAGE_SIZE,
+			[IST_NMI - 1] = stack_top + (1 + IST_NMI) * PAGE_SIZE,
+			[IST_DB  - 1] = stack_top + (1 + IST_DB)  * PAGE_SIZE,
+			[IST_DF  - 1] = stack_top + (1 + IST_DF)  * PAGE_SIZE,
 
 			[IST_MAX ... ARRAY_SIZE(tss->ist) - 1] =
 				0x8600111111111111ul,
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 355c50ff91..bc44d865ef 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -6002,25 +6002,18 @@ void memguard_unguard_range(void *p, unsigned long l)
 
 void memguard_guard_stack(void *p)
 {
-    /* IST_MAX IST pages + at least 1 guard page + primary stack. */
-    BUILD_BUG_ON((IST_MAX + 1) * PAGE_SIZE + PRIMARY_STACK_SIZE > STACK_SIZE);
+    map_pages_to_xen((unsigned long)p, virt_to_mfn(p), 1, _PAGE_NONE);
 
-    memguard_guard_range(p + IST_MAX * PAGE_SIZE,
-                         STACK_SIZE - PRIMARY_STACK_SIZE - IST_MAX * PAGE_SIZE);
+    p += 5 * PAGE_SIZE;
+    map_pages_to_xen((unsigned long)p, virt_to_mfn(p), 1, _PAGE_NONE);
 }
 
 void memguard_unguard_stack(void *p)
 {
-    memguard_unguard_range(p + IST_MAX * PAGE_SIZE,
-                           STACK_SIZE - PRIMARY_STACK_SIZE - IST_MAX * PAGE_SIZE);
-}
-
-bool memguard_is_stack_guard_page(unsigned long addr)
-{
-    addr &= STACK_SIZE - 1;
+    map_pages_to_xen((unsigned long)p, virt_to_mfn(p), 1, PAGE_HYPERVISOR_RW);
 
-    return addr >= IST_MAX * PAGE_SIZE &&
-           addr < STACK_SIZE - PRIMARY_STACK_SIZE;
+    p += 5 * PAGE_SIZE;
+    map_pages_to_xen((unsigned long)p, virt_to_mfn(p), 1, PAGE_HYPERVISOR_RW);
 }
 
 void arch_dump_shared_mem_info(void)
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index f999323bc4..e0f421ca3d 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -823,8 +823,7 @@ static int setup_cpu_root_pgt(unsigned int cpu)
 
     /* Install direct map page table entries for stack, IDT, and TSS. */
     for ( off = rc = 0; !rc && off < STACK_SIZE; off += PAGE_SIZE )
-        if ( !memguard_is_stack_guard_page(off) )
-            rc = clone_mapping(__va(__pa(stack_base[cpu])) + off, rpt);
+        rc = clone_mapping(__va(__pa(stack_base[cpu])) + off, rpt);
 
     if ( !rc )
         rc = clone_mapping(idt_tables[cpu], rpt);
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index ddbe312f89..1cf00c1f4a 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -369,20 +369,15 @@ static void show_guest_stack(struct vcpu *v, const struct cpu_user_regs *regs)
 /*
  * Notes for get_stack_trace_bottom() and get_stack_dump_bottom()
  *
- * Stack pages 0 - 3:
+ * Stack pages 1 - 4:
  *   These are all 1-page IST stacks.  Each of these stacks have an exception
  *   frame and saved register state at the top.  The interesting bound for a
  *   trace is the word adjacent to this, while the bound for a dump is the
  *   very top, including the exception frame.
  *
- * Stack pages 4 and 5:
- *   None of these are particularly interesting.  With MEMORY_GUARD, page 5 is
- *   explicitly not present, so attempting to dump or trace it is
- *   counterproductive.  Without MEMORY_GUARD, it is possible for a call chain
- *   to use the entire primary stack and wander into page 5.  In this case,
- *   consider these pages an extension of the primary stack to aid debugging
- *   hopefully rare situations where the primary stack has effective been
- *   overflown.
+ * Stack pages 0 and 5:
+ *   Shadow stacks.  These are mapped read-only, and used by CET-SS capable
+ *   processors.  They will never contain regular stack data.
  *
  * Stack pages 6 and 7:
  *   These form the primary stack, and have a cpu_info at the top.  For a
@@ -396,13 +391,10 @@ unsigned long get_stack_trace_bottom(unsigned long sp)
 {
     switch ( get_stack_page(sp) )
     {
-    case 0 ... 3:
+    case 1 ... 4:
         return ROUNDUP(sp, PAGE_SIZE) -
             offsetof(struct cpu_user_regs, es) - sizeof(unsigned long);
 
-#ifndef MEMORY_GUARD
-    case 4 ... 5:
-#endif
     case 6 ... 7:
         return ROUNDUP(sp, STACK_SIZE) -
             sizeof(struct cpu_info) - sizeof(unsigned long);
@@ -416,12 +408,9 @@ unsigned long get_stack_dump_bottom(unsigned long sp)
 {
     switch ( get_stack_page(sp) )
     {
-    case 0 ... 3:
+    case 1 ... 4:
         return ROUNDUP(sp, PAGE_SIZE) - sizeof(unsigned long);
 
-#ifndef MEMORY_GUARD
-    case 4 ... 5:
-#endif
     case 6 ... 7:
         return ROUNDUP(sp, STACK_SIZE) - sizeof(unsigned long);
 
diff --git a/xen/include/asm-x86/current.h b/xen/include/asm-x86/current.h
index 5b8f4dbc79..99b66a0087 100644
--- a/xen/include/asm-x86/current.h
+++ b/xen/include/asm-x86/current.h
@@ -16,12 +16,12 @@
  *
  * 7 - Primary stack (with a struct cpu_info at the top)
  * 6 - Primary stack
- * 5 - Optionally not present (MEMORY_GUARD)
- * 4 - Unused; optionally not present (MEMORY_GUARD)
- * 3 - Unused; optionally not present (MEMORY_GUARD)
- * 2 - MCE IST stack
- * 1 - NMI IST stack
- * 0 - Double Fault IST stack
+ * 5 - Primay Shadow Stack (read-only)
+ * 4 - #DF IST stack
+ * 3 - #DB IST stack
+ * 2 - NMI IST stack
+ * 1 - #MC IST stack
+ * 0 - IST Shadow Stacks (4x 1k, read-only)
  */
 
 /*
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index 3d3f9d49ac..7e74996053 100644
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -536,7 +536,6 @@ void memguard_unguard_range(void *p, unsigned long l);
 
 void memguard_guard_stack(void *p);
 void memguard_unguard_stack(void *p);
-bool __attribute_const__ memguard_is_stack_guard_page(unsigned long addr);
 
 struct mmio_ro_emulate_ctxt {
         unsigned long cr2;
diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
index 5e8a0fb649..f7e80d12e4 100644
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -439,10 +439,10 @@ struct tss_page {
 DECLARE_PER_CPU(struct tss_page, tss_page);
 
 #define IST_NONE 0UL
-#define IST_DF   1UL
+#define IST_MCE  1UL
 #define IST_NMI  2UL
-#define IST_MCE  3UL
-#define IST_DB   4UL
+#define IST_DB   3UL
+#define IST_DF   4UL
 #define IST_MAX  4UL
 
 /* Set the Interrupt Stack Table used by a particular IDT entry. */
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 08/16] x86/shstk: Create shadow stacks
  2020-05-01 22:58 [PATCH 00/16] x86: Support for CET Supervisor Shadow Stacks Andrew Cooper
                   ` (6 preceding siblings ...)
  2020-05-01 22:58 ` [PATCH 07/16] x86/shstk: Re-layout the stack block " Andrew Cooper
@ 2020-05-01 22:58 ` Andrew Cooper
  2020-05-04 14:55   ` Jan Beulich
  2020-05-01 22:58 ` [PATCH 09/16] x86/cpu: Adjust enable_nmis() to be shadow stack compatible Andrew Cooper
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 66+ messages in thread
From: Andrew Cooper @ 2020-05-01 22:58 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Wei Liu, Jan Beulich, Roger Pau Monné

Introduce HYPERVISOR_SHSTK pagetable constants, which are Read-Only + Dirty.
Use these in place of _PAGE_NONE for memguard_guard_stack().

Supervisor shadow stacks need a token written at the top, which is most easily
done before making the frame read only.

Allocate the shadow IST stack block in struct tss_page.  It doesn't strictly
need to live here, but it is a convenient location (and XPTI-safe, for testing
purposes).

Have load_system_tables() set up the shadow IST stack table when setting up
the regular IST in the TSS.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Wei Liu <wl@xen.org>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/cpu/common.c         | 19 +++++++++++++++++++
 xen/arch/x86/mm.c                 | 22 +++++++++++++++++++---
 xen/include/asm-x86/page.h        |  1 +
 xen/include/asm-x86/processor.h   |  3 ++-
 xen/include/asm-x86/x86_64/page.h |  1 +
 5 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/cpu/common.c b/xen/arch/x86/cpu/common.c
index 290f9f1c30..3962717aa5 100644
--- a/xen/arch/x86/cpu/common.c
+++ b/xen/arch/x86/cpu/common.c
@@ -748,6 +748,25 @@ void load_system_tables(void)
 		.bitmap = IOBMP_INVALID_OFFSET,
 	};
 
+	/* Set up the shadow stack IST. */
+	if ( cpu_has_xen_shstk ) {
+		unsigned int i;
+		uint64_t *ist_ssp = this_cpu(tss_page).ist_ssp;
+
+		/* Must point at the supervisor stack token. */
+		ist_ssp[IST_MCE] = stack_top + (IST_MCE * 0x400) - 8;
+		ist_ssp[IST_NMI] = stack_top + (IST_NMI * 0x400) - 8;
+		ist_ssp[IST_DB]  = stack_top + (IST_DB  * 0x400) - 8;
+		ist_ssp[IST_DF]  = stack_top + (IST_DF  * 0x400) - 8;
+
+		/* Poision unused entries. */
+		for ( i = IST_MAX;
+		      i < ARRAY_SIZE(this_cpu(tss_page).ist_ssp); ++i )
+			ist_ssp[i] = 0x8600111111111111ul;
+
+		wrmsrl(MSR_INTERRUPT_SSP_TABLE, (unsigned long)ist_ssp);
+	}
+
 	BUILD_BUG_ON(sizeof(*tss) <= 0x67); /* Mandated by the architecture. */
 
 	_set_tssldt_desc(gdt + TSS_ENTRY, (unsigned long)tss,
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index bc44d865ef..4e2c3c9735 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -6000,12 +6000,28 @@ void memguard_unguard_range(void *p, unsigned long l)
 
 #endif
 
-void memguard_guard_stack(void *p)
+static void write_sss_token(unsigned long *ptr)
 {
-    map_pages_to_xen((unsigned long)p, virt_to_mfn(p), 1, _PAGE_NONE);
+    /*
+     * A supervisor shadow stack token is its own linear address, with the
+     * busy bit (0) clear.
+     */
+    *ptr = (unsigned long)ptr;
+}
 
+void memguard_guard_stack(void *p)
+{
+    /* IST Shadow stacks.  4x 1k in stack page 0. */
+    write_sss_token(p + 0x3f8);
+    write_sss_token(p + 0x7f8);
+    write_sss_token(p + 0xbf8);
+    write_sss_token(p + 0xff8);
+    map_pages_to_xen((unsigned long)p, virt_to_mfn(p), 1, PAGE_HYPERVISOR_SHSTK);
+
+    /* Primary Shadow Stack.  1x 4k in stack page 5. */
     p += 5 * PAGE_SIZE;
-    map_pages_to_xen((unsigned long)p, virt_to_mfn(p), 1, _PAGE_NONE);
+    write_sss_token(p + 0xff8);
+    map_pages_to_xen((unsigned long)p, virt_to_mfn(p), 1, PAGE_HYPERVISOR_SHSTK);
 }
 
 void memguard_unguard_stack(void *p)
diff --git a/xen/include/asm-x86/page.h b/xen/include/asm-x86/page.h
index 5acf3d3d5a..f632affaef 100644
--- a/xen/include/asm-x86/page.h
+++ b/xen/include/asm-x86/page.h
@@ -364,6 +364,7 @@ void efi_update_l4_pgtable(unsigned int l4idx, l4_pgentry_t);
                                    _PAGE_DIRTY | _PAGE_RW)
 #define __PAGE_HYPERVISOR_UCMINUS (__PAGE_HYPERVISOR | _PAGE_PCD)
 #define __PAGE_HYPERVISOR_UC      (__PAGE_HYPERVISOR | _PAGE_PCD | _PAGE_PWT)
+#define __PAGE_HYPERVISOR_SHSTK   (__PAGE_HYPERVISOR_RO | _PAGE_DIRTY)
 
 #define MAP_SMALL_PAGES _PAGE_AVAIL0 /* don't use superpages mappings */
 
diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
index f7e80d12e4..54e1a8b605 100644
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -434,7 +434,8 @@ struct __packed tss64 {
     uint16_t :16, bitmap;
 };
 struct tss_page {
-    struct tss64 __aligned(PAGE_SIZE) tss;
+    uint64_t __aligned(PAGE_SIZE) ist_ssp[8];
+    struct tss64 tss;
 };
 DECLARE_PER_CPU(struct tss_page, tss_page);
 
diff --git a/xen/include/asm-x86/x86_64/page.h b/xen/include/asm-x86/x86_64/page.h
index 9876634881..26621f9519 100644
--- a/xen/include/asm-x86/x86_64/page.h
+++ b/xen/include/asm-x86/x86_64/page.h
@@ -171,6 +171,7 @@ static inline intpte_t put_pte_flags(unsigned int x)
 #define PAGE_HYPERVISOR_RW      (__PAGE_HYPERVISOR_RW      | _PAGE_GLOBAL)
 #define PAGE_HYPERVISOR_RX      (__PAGE_HYPERVISOR_RX      | _PAGE_GLOBAL)
 #define PAGE_HYPERVISOR_RWX     (__PAGE_HYPERVISOR         | _PAGE_GLOBAL)
+#define PAGE_HYPERVISOR_SHSTK   (__PAGE_HYPERVISOR_SHSTK   | _PAGE_GLOBAL)
 
 #define PAGE_HYPERVISOR         PAGE_HYPERVISOR_RW
 #define PAGE_HYPERVISOR_UCMINUS (__PAGE_HYPERVISOR_UCMINUS | \
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 09/16] x86/cpu: Adjust enable_nmis() to be shadow stack compatible
  2020-05-01 22:58 [PATCH 00/16] x86: Support for CET Supervisor Shadow Stacks Andrew Cooper
                   ` (7 preceding siblings ...)
  2020-05-01 22:58 ` [PATCH 08/16] x86/shstk: Create " Andrew Cooper
@ 2020-05-01 22:58 ` Andrew Cooper
  2020-05-05 14:48   ` Jan Beulich
  2020-05-01 22:58 ` [PATCH 10/16] x86/cpu: Adjust reset_stack_and_jump() " Andrew Cooper
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 66+ messages in thread
From: Andrew Cooper @ 2020-05-01 22:58 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Wei Liu, Jan Beulich, Roger Pau Monné

When executing an IRET-to-self, the shadow stack must agree with the regular
stack.  We can't manipulate SSP directly, so have to fake a shadow IRET frame
by executing 3 CALLs, then editing the result to look correct.

This is not a fastpath, is called on the BSP long before CET can be set up,
and may be called on the crash path after CET is disabled.  Use the fact that
INCSSP is allocated from the hint nop space to construct a test for CET being
active which is safe on all processors.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Wei Liu <wl@xen.org>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/include/asm-x86/processor.h | 43 +++++++++++++++++++++++++++++++----------
 1 file changed, 33 insertions(+), 10 deletions(-)

diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
index 54e1a8b605..654d46a6f4 100644
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -544,17 +544,40 @@ static inline void enable_nmis(void)
 {
     unsigned long tmp;
 
-    asm volatile ( "mov %%rsp, %[tmp]     \n\t"
-                   "push %[ss]            \n\t"
-                   "push %[tmp]           \n\t"
-                   "pushf                 \n\t"
-                   "push %[cs]            \n\t"
-                   "lea 1f(%%rip), %[tmp] \n\t"
-                   "push %[tmp]           \n\t"
-                   "iretq; 1:             \n\t"
-                   : [tmp] "=&r" (tmp)
+    asm volatile ( "mov     %%rsp, %[rsp]        \n\t"
+                   "lea    .Ldone(%%rip), %[rip] \n\t"
+#ifdef CONFIG_XEN_SHSTK
+                   /* Check for CET-SS being active. */
+                   "mov    $1, %k[ssp]           \n\t"
+                   "rdsspq %[ssp]                \n\t"
+                   "cmp    $1, %k[ssp]           \n\t"
+                   "je     .Lshstk_done          \n\t"
+
+                   /* Push 3 words on the shadow stack */
+                   ".rept 3                      \n\t"
+                   "call 1f; nop; 1:             \n\t"
+                   ".endr                        \n\t"
+
+                   /* Fixup to be an IRET shadow stack frame */
+                   "wrssq  %q[cs], -1*8(%[ssp])  \n\t"
+                   "wrssq  %[rip], -2*8(%[ssp])  \n\t"
+                   "wrssq  %[ssp], -3*8(%[ssp])  \n\t"
+
+                   ".Lshstk_done:"
+#endif
+                   /* Write an IRET regular frame */
+                   "push   %[ss]                 \n\t"
+                   "push   %[rsp]                \n\t"
+                   "pushf                        \n\t"
+                   "push   %q[cs]                \n\t"
+                   "push   %[rip]                \n\t"
+                   "iretq                        \n\t"
+                   ".Ldone:                      \n\t"
+                   : [rip] "=&r" (tmp),
+                     [rsp] "=&r" (tmp),
+                     [ssp] "=&r" (tmp)
                    : [ss] "i" (__HYPERVISOR_DS),
-                     [cs] "i" (__HYPERVISOR_CS) );
+                     [cs] "r" (__HYPERVISOR_CS) );
 }
 
 void sysenter_entry(void);
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 10/16] x86/cpu: Adjust reset_stack_and_jump() to be shadow stack compatible
  2020-05-01 22:58 [PATCH 00/16] x86: Support for CET Supervisor Shadow Stacks Andrew Cooper
                   ` (8 preceding siblings ...)
  2020-05-01 22:58 ` [PATCH 09/16] x86/cpu: Adjust enable_nmis() to be shadow stack compatible Andrew Cooper
@ 2020-05-01 22:58 ` Andrew Cooper
  2020-05-07 13:17   ` Jan Beulich
  2020-05-01 22:58 ` [PATCH 11/16] x86/spec-ctrl: Adjust DO_OVERWRITE_RSB " Andrew Cooper
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 66+ messages in thread
From: Andrew Cooper @ 2020-05-01 22:58 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Wei Liu, Jan Beulich, Roger Pau Monné

We need to unwind up to the supervisor token.  See the comment for details.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Wei Liu <wl@xen.org>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/include/asm-x86/current.h | 42 +++++++++++++++++++++++++++++++++++++++---
 1 file changed, 39 insertions(+), 3 deletions(-)

diff --git a/xen/include/asm-x86/current.h b/xen/include/asm-x86/current.h
index 99b66a0087..2a7b728b1e 100644
--- a/xen/include/asm-x86/current.h
+++ b/xen/include/asm-x86/current.h
@@ -124,13 +124,49 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
 # define CHECK_FOR_LIVEPATCH_WORK ""
 #endif
 
+#ifdef CONFIG_XEN_SHSTK
+/*
+ * We need to unwind the primary shadow stack to its supervisor token, located
+ * at 0x5ff8 from the base of the stack blocks.
+ *
+ * Read the shadow stack pointer, subtract it from 0x5ff8, divide by 8 to get
+ * the number of slots needing popping.
+ *
+ * INCSSPQ can't pop more than 255 entries.  We shouldn't ever need to pop
+ * that many entries, and getting this wrong will cause us to #DF later.
+ */
+# define SHADOW_STACK_WORK                      \
+    "mov $1, %[ssp];"                           \
+    "rdsspd %[ssp];"                            \
+    "cmp $1, %[ssp];"                           \
+    "je 1f;" /* CET not active?  Skip. */       \
+    "mov $"STR(0x5ff8)", %[val];"               \
+    "and $"STR(STACK_SIZE - 1)", %[ssp];"       \
+    "sub %[ssp], %[val];"                       \
+    "shr $3, %[val];"                           \
+    "cmp $255, %[val];"                         \
+    "jle 2f;"                                   \
+    "ud2a;"                                     \
+    "2: incsspq %q[val];"                       \
+    "1:"
+#else
+# define SHADOW_STACK_WORK ""
+#endif
+
 #define switch_stack_and_jump(fn, instr)                                \
     ({                                                                  \
+        unsigned int tmp;                                               \
         __asm__ __volatile__ (                                          \
-            "mov %0,%%"__OP"sp;"                                        \
+            "cmc;"                                                      \
+            SHADOW_STACK_WORK                                           \
+            "mov %[stk], %%rsp;"                                        \
             instr                                                       \
-             "jmp %c1"                                                  \
-            : : "r" (guest_cpu_user_regs()), "i" (fn) : "memory" );     \
+            "jmp %c[fun];"                                              \
+            : [val] "=&r" (tmp),                                        \
+              [ssp] "=&r" (tmp)                                         \
+            : [stk] "r" (guest_cpu_user_regs()),                        \
+              [fun] "i" (fn)                                            \
+            : "memory" );                                               \
         unreachable();                                                  \
     })
 
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 11/16] x86/spec-ctrl: Adjust DO_OVERWRITE_RSB to be shadow stack compatible
  2020-05-01 22:58 [PATCH 00/16] x86: Support for CET Supervisor Shadow Stacks Andrew Cooper
                   ` (9 preceding siblings ...)
  2020-05-01 22:58 ` [PATCH 10/16] x86/cpu: Adjust reset_stack_and_jump() " Andrew Cooper
@ 2020-05-01 22:58 ` Andrew Cooper
  2020-05-07 13:22   ` Jan Beulich
  2020-05-01 22:58 ` [PATCH 12/16] x86/extable: Adjust extable handling " Andrew Cooper
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 66+ messages in thread
From: Andrew Cooper @ 2020-05-01 22:58 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Wei Liu, Jan Beulich, Roger Pau Monné

The 32 calls need dropping from the shadow stack as well as the regular stack.
To shorten the code, we can use the 32bit forms of RDSSP/INCSSP, but need to
double up the input to INCSSP to counter the operand size based multiplier.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Wei Liu <wl@xen.org>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/include/asm-x86/spec_ctrl_asm.h | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/xen/include/asm-x86/spec_ctrl_asm.h b/xen/include/asm-x86/spec_ctrl_asm.h
index c60093b090..cb34299a86 100644
--- a/xen/include/asm-x86/spec_ctrl_asm.h
+++ b/xen/include/asm-x86/spec_ctrl_asm.h
@@ -83,9 +83,9 @@
  * Requires nothing
  * Clobbers \tmp (%rax by default), %rcx
  *
- * Requires 256 bytes of stack space, but %rsp has no net change. Based on
- * Google's performance numbers, the loop is unrolled to 16 iterations and two
- * calls per iteration.
+ * Requires 256 bytes of {,shadow}stack space, but %rsp/SSP has no net
+ * change. Based on Google's performance numbers, the loop is unrolled to 16
+ * iterations and two calls per iteration.
  *
  * The call filling the RSB needs a nonzero displacement.  A nop would do, but
  * we use "1: pause; lfence; jmp 1b" to safely contains any ret-based
@@ -114,6 +114,16 @@
     sub $1, %ecx
     jnz .L\@_fill_rsb_loop
     mov %\tmp, %rsp                 /* Restore old %rsp */
+
+#ifdef CONFIG_XEN_SHSTK
+    mov $1, %ecx
+    rdsspd %ecx
+    cmp $1, %ecx
+    je .L\@_shstk_done
+    mov $64, %ecx                   /* 64 * 4 bytes, given incsspd */
+    incsspd %ecx                    /* Restore old SSP */
+.L\@_shstk_done:
+#endif
 .endm
 
 .macro DO_SPEC_CTRL_ENTRY_FROM_HVM
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 12/16] x86/extable: Adjust extable handling to be shadow stack compatible
  2020-05-01 22:58 [PATCH 00/16] x86: Support for CET Supervisor Shadow Stacks Andrew Cooper
                   ` (10 preceding siblings ...)
  2020-05-01 22:58 ` [PATCH 11/16] x86/spec-ctrl: Adjust DO_OVERWRITE_RSB " Andrew Cooper
@ 2020-05-01 22:58 ` Andrew Cooper
  2020-05-07 13:35   ` Jan Beulich
  2020-05-01 22:58 ` [PATCH 13/16] x86/ioemul: Rewrite stub generation " Andrew Cooper
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 66+ messages in thread
From: Andrew Cooper @ 2020-05-01 22:58 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Wei Liu, Jan Beulich, Roger Pau Monné

When adjusting an IRET frame to recover from a fault, and equivalent
adjustment needs making in the shadow IRET frame.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Wei Liu <wl@xen.org>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/traps.c        | 22 ++++++++++++++++++++++
 xen/arch/x86/x86_64/entry.S | 11 ++++++++++-
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 1cf00c1f4a..2354357cc1 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -778,6 +778,28 @@ static bool exception_fixup(struct cpu_user_regs *regs, bool print)
                vec_name(regs->entry_vector), regs->error_code,
                _p(regs->rip), _p(regs->rip), _p(fixup));
 
+    if ( IS_ENABLED(CONFIG_XEN_SHSTK) )
+    {
+        unsigned long ssp;
+
+        asm ("rdsspq %0" : "=r" (ssp) : "0" (1) );
+        if ( ssp != 1 )
+        {
+            unsigned long *ptr = _p(ssp);
+
+            /* Search for %rip in the shadow stack, ... */
+            while ( *ptr != regs->rip )
+                ptr++;
+
+            ASSERT(ptr[1] == __HYPERVISOR_CS);
+
+            /* ... and adjust to the fixup location. */
+            asm ("wrssq %[fix], %[stk]"
+                 : [stk] "=m" (*ptr)
+                 : [fix] "r" (fixup));
+        }
+    }
+
     regs->rip = fixup;
 
     return true;
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 6403c0ab92..06da350ba0 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -708,7 +708,16 @@ exception_with_ints_disabled:
         call  search_pre_exception_table
         testq %rax,%rax                 # no fixup code for faulting EIP?
         jz    1b
-        movq  %rax,UREGS_rip(%rsp)
+        movq  %rax,UREGS_rip(%rsp)      # fixup regular stack
+
+#ifdef CONFIG_XEN_SHSTK
+        mov    $1, %edi
+        rdsspq %rdi
+        cmp    $1, %edi
+        je     .L_exn_shstk_done
+        wrssq  %rax, (%rdi)             # fixup shadow stack
+.L_exn_shstk_done:
+#endif
         subq  $8,UREGS_rsp(%rsp)        # add ec/ev to previous stack frame
         testb $15,UREGS_rsp(%rsp)       # return %rsp is now aligned?
         jz    1f                        # then there is a pad quadword already
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 13/16] x86/ioemul: Rewrite stub generation to be shadow stack compatible
  2020-05-01 22:58 [PATCH 00/16] x86: Support for CET Supervisor Shadow Stacks Andrew Cooper
                   ` (11 preceding siblings ...)
  2020-05-01 22:58 ` [PATCH 12/16] x86/extable: Adjust extable handling " Andrew Cooper
@ 2020-05-01 22:58 ` Andrew Cooper
  2020-05-07 13:46   ` Jan Beulich
  2020-05-01 22:58 ` [PATCH 14/16] x86/alt: Adjust _alternative_instructions() to not create shadow stacks Andrew Cooper
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 66+ messages in thread
From: Andrew Cooper @ 2020-05-01 22:58 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Wei Liu, Jan Beulich, Roger Pau Monné

The logic is completely undocumented and almost impossible to follow.  It
actually uses return oriented programming.  Rewrite it to conform to more
normal call mechanics, and leave a big comment explaining thing.  As well as
the code being easier to follow, it will execute faster as it isn't fighting
the branch predictor.

Move the ioemul_handle_quirk() function pointer from traps.c to
ioport_emulate.c.  There is no reason for it to be in neither of the two
translation units which use it.  Alter the behaviour to return the number of
bytes written into the stub.

Access the addresses of the host/guest helpers with extern const char arrays.
Nothing good will come of C thinking they are regular functions.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
--
CC: Jan Beulich <JBeulich@suse.com>
CC: Wei Liu <wl@xen.org>
CC: Roger Pau Monné <roger.pau@citrix.com>

Posted previously on its perf benefits alone, but here is the real reason
behind the change.
---
 xen/arch/x86/ioport_emulate.c  | 11 ++---
 xen/arch/x86/pv/emul-priv-op.c | 91 +++++++++++++++++++++++++++++++-----------
 xen/arch/x86/pv/gpr_switch.S   | 37 +++++------------
 xen/arch/x86/traps.c           |  3 --
 xen/include/asm-x86/io.h       |  3 +-
 5 files changed, 85 insertions(+), 60 deletions(-)

diff --git a/xen/arch/x86/ioport_emulate.c b/xen/arch/x86/ioport_emulate.c
index 499c1f6056..f7511a9c49 100644
--- a/xen/arch/x86/ioport_emulate.c
+++ b/xen/arch/x86/ioport_emulate.c
@@ -8,7 +8,10 @@
 #include <xen/sched.h>
 #include <xen/dmi.h>
 
-static bool ioemul_handle_proliant_quirk(
+unsigned int (*ioemul_handle_quirk)(
+    u8 opcode, char *io_emul_stub, struct cpu_user_regs *regs);
+
+static unsigned int ioemul_handle_proliant_quirk(
     u8 opcode, char *io_emul_stub, struct cpu_user_regs *regs)
 {
     static const char stub[] = {
@@ -19,18 +22,16 @@ static bool ioemul_handle_proliant_quirk(
         0xa8, 0x80, /*    test $0x80, %al */
         0x75, 0xfb, /*    jnz 1b          */
         0x9d,       /*    popf            */
-        0xc3,       /*    ret             */
     };
     uint16_t port = regs->dx;
     uint8_t value = regs->al;
 
     if ( (opcode != 0xee) || (port != 0xcd4) || !(value & 0x80) )
-        return false;
+        return 0;
 
     memcpy(io_emul_stub, stub, sizeof(stub));
-    BUILD_BUG_ON(IOEMUL_QUIRK_STUB_BYTES < sizeof(stub));
 
-    return true;
+    return sizeof(stub);
 }
 
 /* This table is the set of system-specific I/O emulation hooks. */
diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
index e24b84f46a..f150886711 100644
--- a/xen/arch/x86/pv/emul-priv-op.c
+++ b/xen/arch/x86/pv/emul-priv-op.c
@@ -54,51 +54,96 @@ struct priv_op_ctxt {
     unsigned int bpmatch;
 };
 
-/* I/O emulation support. Helper routines for, and type of, the stack stub. */
-void host_to_guest_gpr_switch(struct cpu_user_regs *);
-unsigned long guest_to_host_gpr_switch(unsigned long);
+/* I/O emulation helpers.  Use non-standard calling conventions. */
+extern const char load_guest_gprs[], save_guest_gprs[];
 
 typedef void io_emul_stub_t(struct cpu_user_regs *);
 
 static io_emul_stub_t *io_emul_stub_setup(struct priv_op_ctxt *ctxt, u8 opcode,
                                           unsigned int port, unsigned int bytes)
 {
+    /*
+     * Construct a stub for IN/OUT emulation.
+     *
+     * Some platform drivers communicate with the SMM handler using GPRs as a
+     * mailbox.  Therefore, we must perform the emulation with the hardware
+     * domain's registers in view.
+     *
+     * We write a stub of the following form, using the guest load/save
+     * helpers (abnormal calling conventions), and one of several possible
+     * stubs performing the real I/O.
+     */
+    static const char prologue[] = {
+        0x53,       /* push %rbx */
+        0x55,       /* push %rbp */
+        0x41, 0x54, /* push %r12 */
+        0x41, 0x55, /* push %r13 */
+        0x41, 0x56, /* push %r14 */
+        0x41, 0x57, /* push %r15 */
+        0x57,       /* push %rdi (param for save_guest_gprs) */
+    };              /* call load_guest_gprs */
+                    /* <I/O stub> */
+                    /* call save_guest_gprs */
+    static const char epilogue[] = {
+        0x5f,       /* pop %rdi  */
+        0x41, 0x5f, /* pop %r15  */
+        0x41, 0x5e, /* pop %r14  */
+        0x41, 0x5d, /* pop %r13  */
+        0x41, 0x5c, /* pop %r12  */
+        0x5d,       /* pop %rbp  */
+        0x5b,       /* pop %rbx  */
+        0xc3,       /* ret       */
+    };
+
     struct stubs *this_stubs = &this_cpu(stubs);
     unsigned long stub_va = this_stubs->addr + STUB_BUF_SIZE / 2;
-    long disp;
-    bool use_quirk_stub = false;
+    unsigned int quirk_bytes = 0;
+    char *p;
+
+    /* Helpers - Read outer scope but only modify p. */
+#define APPEND_BUFF(b) ({ memcpy(p, b, sizeof(b)); p += sizeof(b); })
+#define APPEND_CALL(f)                                                  \
+    ({                                                                  \
+        long disp = (long)(f) - (stub_va + p - ctxt->io_emul_stub + 5); \
+        BUG_ON((int32_t)disp != disp);                                  \
+        *p++ = 0xe8;                                                    \
+        *(int32_t *)p = disp; p += 4;                                   \
+    })
 
     if ( !ctxt->io_emul_stub )
         ctxt->io_emul_stub =
             map_domain_page(_mfn(this_stubs->mfn)) + (stub_va & ~PAGE_MASK);
 
-    /* call host_to_guest_gpr_switch */
-    ctxt->io_emul_stub[0] = 0xe8;
-    disp = (long)host_to_guest_gpr_switch - (stub_va + 5);
-    BUG_ON((int32_t)disp != disp);
-    *(int32_t *)&ctxt->io_emul_stub[1] = disp;
+    p = ctxt->io_emul_stub;
+
+    APPEND_BUFF(prologue);
+    APPEND_CALL(load_guest_gprs);
 
+    /* Some platforms might need to quirk the stub for specific inputs. */
     if ( unlikely(ioemul_handle_quirk) )
-        use_quirk_stub = ioemul_handle_quirk(opcode, &ctxt->io_emul_stub[5],
-                                             ctxt->ctxt.regs);
+    {
+        quirk_bytes = ioemul_handle_quirk(opcode, p, ctxt->ctxt.regs);
+        p += quirk_bytes;
+    }
 
-    if ( !use_quirk_stub )
+    /* Default I/O stub. */
+    if ( likely(!quirk_bytes) )
     {
-        /* data16 or nop */
-        ctxt->io_emul_stub[5] = (bytes != 2) ? 0x90 : 0x66;
-        /* <io-access opcode> */
-        ctxt->io_emul_stub[6] = opcode;
-        /* imm8 or nop */
-        ctxt->io_emul_stub[7] = !(opcode & 8) ? port : 0x90;
-        /* ret (jumps to guest_to_host_gpr_switch) */
-        ctxt->io_emul_stub[8] = 0xc3;
+        *p++ = (bytes != 2) ? 0x90 : 0x66;  /* data16 or nop */
+        *p++ = opcode;                      /* <opcode>      */
+        *p++ = !(opcode & 8) ? port : 0x90; /* imm8 or nop   */
     }
 
-    BUILD_BUG_ON(STUB_BUF_SIZE / 2 < MAX(9, /* Default emul stub */
-                                         5 + IOEMUL_QUIRK_STUB_BYTES));
+    APPEND_CALL(save_guest_gprs);
+    APPEND_BUFF(epilogue);
+
+    BUG_ON(STUB_BUF_SIZE / 2 < (p - ctxt->io_emul_stub));
 
     /* Handy function-typed pointer to the stub. */
     return (void *)stub_va;
+
+#undef APPEND_CALL
+#undef APPEND_BUFF
 }
 
 
diff --git a/xen/arch/x86/pv/gpr_switch.S b/xen/arch/x86/pv/gpr_switch.S
index 6d26192c2c..e3f8037b69 100644
--- a/xen/arch/x86/pv/gpr_switch.S
+++ b/xen/arch/x86/pv/gpr_switch.S
@@ -9,59 +9,42 @@
 
 #include <asm/asm_defns.h>
 
-ENTRY(host_to_guest_gpr_switch)
-        movq  (%rsp), %rcx
-        movq  %rdi, (%rsp)
+/* Load guest GPRs.  Parameter in %rdi, clobbers all registers. */
+ENTRY(load_guest_gprs)
         movq  UREGS_rdx(%rdi), %rdx
-        pushq %rbx
         movq  UREGS_rax(%rdi), %rax
         movq  UREGS_rbx(%rdi), %rbx
-        pushq %rbp
         movq  UREGS_rsi(%rdi), %rsi
         movq  UREGS_rbp(%rdi), %rbp
-        pushq %r12
-        movq  UREGS_r8(%rdi), %r8
+        movq  UREGS_r8 (%rdi), %r8
         movq  UREGS_r12(%rdi), %r12
-        pushq %r13
-        movq  UREGS_r9(%rdi), %r9
+        movq  UREGS_r9 (%rdi), %r9
         movq  UREGS_r13(%rdi), %r13
-        pushq %r14
         movq  UREGS_r10(%rdi), %r10
         movq  UREGS_r14(%rdi), %r14
-        pushq %r15
         movq  UREGS_r11(%rdi), %r11
         movq  UREGS_r15(%rdi), %r15
-        pushq %rcx /* dummy push, filled by guest_to_host_gpr_switch pointer */
-        pushq %rcx
-        leaq  guest_to_host_gpr_switch(%rip),%rcx
-        movq  %rcx,8(%rsp)
         movq  UREGS_rcx(%rdi), %rcx
         movq  UREGS_rdi(%rdi), %rdi
         ret
 
-ENTRY(guest_to_host_gpr_switch)
+/* Save guest GPRs.  Parameter on the stack above the return address. */
+ENTRY(save_guest_gprs)
         pushq %rdi
-        movq  7*8(%rsp), %rdi
+        movq  2*8(%rsp), %rdi
         movq  %rax, UREGS_rax(%rdi)
-        popq  UREGS_rdi(%rdi)
+        popq        UREGS_rdi(%rdi)
         movq  %r15, UREGS_r15(%rdi)
         movq  %r11, UREGS_r11(%rdi)
-        popq  %r15
         movq  %r14, UREGS_r14(%rdi)
         movq  %r10, UREGS_r10(%rdi)
-        popq  %r14
         movq  %r13, UREGS_r13(%rdi)
-        movq  %r9, UREGS_r9(%rdi)
-        popq  %r13
+        movq  %r9,  UREGS_r9 (%rdi)
         movq  %r12, UREGS_r12(%rdi)
-        movq  %r8, UREGS_r8(%rdi)
-        popq  %r12
+        movq  %r8,  UREGS_r8 (%rdi)
         movq  %rbp, UREGS_rbp(%rdi)
         movq  %rsi, UREGS_rsi(%rdi)
-        popq  %rbp
         movq  %rbx, UREGS_rbx(%rdi)
         movq  %rdx, UREGS_rdx(%rdi)
-        popq  %rbx
         movq  %rcx, UREGS_rcx(%rdi)
-        popq  %rcx
         ret
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 2354357cc1..3923950df7 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -117,9 +117,6 @@ idt_entry_t *idt_tables[NR_CPUS] __read_mostly;
  */
 DEFINE_PER_CPU_PAGE_ALIGNED(struct tss_page, tss_page);
 
-bool (*ioemul_handle_quirk)(
-    u8 opcode, char *io_emul_stub, struct cpu_user_regs *regs);
-
 static int debug_stack_lines = 20;
 integer_param("debug_stack_lines", debug_stack_lines);
 
diff --git a/xen/include/asm-x86/io.h b/xen/include/asm-x86/io.h
index 8708b79b99..c4ec52cba7 100644
--- a/xen/include/asm-x86/io.h
+++ b/xen/include/asm-x86/io.h
@@ -49,8 +49,7 @@ __OUT(w,"w",short)
 __OUT(l,,int)
 
 /* Function pointer used to handle platform specific I/O port emulation. */
-#define IOEMUL_QUIRK_STUB_BYTES 10
-extern bool (*ioemul_handle_quirk)(
+extern unsigned int (*ioemul_handle_quirk)(
     u8 opcode, char *io_emul_stub, struct cpu_user_regs *regs);
 
 #endif
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 14/16] x86/alt: Adjust _alternative_instructions() to not create shadow stacks
  2020-05-01 22:58 [PATCH 00/16] x86: Support for CET Supervisor Shadow Stacks Andrew Cooper
                   ` (12 preceding siblings ...)
  2020-05-01 22:58 ` [PATCH 13/16] x86/ioemul: Rewrite stub generation " Andrew Cooper
@ 2020-05-01 22:58 ` Andrew Cooper
  2020-05-07 13:49   ` Jan Beulich
  2020-05-01 22:58 ` [PATCH 15/16] x86/entry: Adjust guest paths to be shadow stack compatible Andrew Cooper
  2020-05-01 22:58 ` [PATCH 16/16] x86/shstk: Activate Supervisor Shadow Stacks Andrew Cooper
  15 siblings, 1 reply; 66+ messages in thread
From: Andrew Cooper @ 2020-05-01 22:58 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Wei Liu, Jan Beulich, Roger Pau Monné

The current alternatives algorithm clears CR0.WP and writes into .text.  This
has a side effect of the mappings becoming shadow stacks once CET is active.

Adjust _alternative_instructions() to clean up after itself.  This involves
extending the set of bits modify_xen_mappings() to include Dirty (and Accessed
for good measure).

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Wei Liu <wl@xen.org>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/alternative.c | 14 ++++++++++++++
 xen/arch/x86/mm.c          |  6 +++---
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/alternative.c b/xen/arch/x86/alternative.c
index ce2b4302e6..004e9ede25 100644
--- a/xen/arch/x86/alternative.c
+++ b/xen/arch/x86/alternative.c
@@ -21,6 +21,7 @@
 #include <asm/processor.h>
 #include <asm/alternative.h>
 #include <xen/init.h>
+#include <asm/setup.h>
 #include <asm/system.h>
 #include <asm/traps.h>
 #include <asm/nmi.h>
@@ -398,6 +399,19 @@ static void __init _alternative_instructions(bool force)
         panic("Timed out waiting for alternatives self-NMI to hit\n");
 
     set_nmi_callback(saved_nmi_callback);
+
+    /*
+     * When Xen is using shadow stacks, the alternatives clearing CR0.WP and
+     * writing into the mappings set dirty bits, turning the mappings into
+     * shadow stack mappings.
+     *
+     * While we can execute from them, this would also permit them to be the
+     * target of WRSS instructions, so reset the dirty after patching.
+     */
+    if ( cpu_has_xen_shstk )
+        modify_xen_mappings(XEN_VIRT_START + MB(2),
+                            (unsigned long)&__2M_text_end,
+                            PAGE_HYPERVISOR_RX);
 }
 
 void __init alternative_instructions(void)
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 4e2c3c9735..26b01cb917 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -5448,8 +5448,8 @@ int populate_pt_range(unsigned long virt, unsigned long nr_mfns)
  * mappings, but will shatter superpages if necessary, and will destroy
  * mappings if not passed _PAGE_PRESENT.
  *
- * The only flags considered are NX, RW and PRESENT.  All other input flags
- * are ignored.
+ * The only flags considered are NX, D, A, RW and PRESENT.  All other input
+ * flags are ignored.
  *
  * It is an error to call with present flags over an unpopulated range.
  */
@@ -5462,7 +5462,7 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
     unsigned long v = s;
 
     /* Set of valid PTE bits which may be altered. */
-#define FLAGS_MASK (_PAGE_NX|_PAGE_RW|_PAGE_PRESENT)
+#define FLAGS_MASK (_PAGE_NX|_PAGE_DIRTY|_PAGE_ACCESSED|_PAGE_RW|_PAGE_PRESENT)
     nf &= FLAGS_MASK;
 
     ASSERT(IS_ALIGNED(s, PAGE_SIZE));
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 15/16] x86/entry: Adjust guest paths to be shadow stack compatible
  2020-05-01 22:58 [PATCH 00/16] x86: Support for CET Supervisor Shadow Stacks Andrew Cooper
                   ` (13 preceding siblings ...)
  2020-05-01 22:58 ` [PATCH 14/16] x86/alt: Adjust _alternative_instructions() to not create shadow stacks Andrew Cooper
@ 2020-05-01 22:58 ` Andrew Cooper
  2020-05-07 14:12   ` Jan Beulich
  2020-05-01 22:58 ` [PATCH 16/16] x86/shstk: Activate Supervisor Shadow Stacks Andrew Cooper
  15 siblings, 1 reply; 66+ messages in thread
From: Andrew Cooper @ 2020-05-01 22:58 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Wei Liu, Jan Beulich, Roger Pau Monné

The SYSCALL/SYSEXIT paths need to use {SET,CLR}SSBSY.  The IRET to guest paths
must not, which forces us to spill a register to the stack.

The IST switch onto the primary stack is not great as we have an instruction
boundary with no shadow stack.  This is the least bad option available.

These paths are not used before shadow stacks are properly established, so can
use alternatives to avoid extra runtime CET detection logic.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Wei Liu <wl@xen.org>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/x86_64/compat/entry.S |  2 +-
 xen/arch/x86/x86_64/entry.S        | 19 ++++++++++++++++++-
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/x86_64/compat/entry.S b/xen/arch/x86/x86_64/compat/entry.S
index 3cd375bd48..7816d0d4ac 100644
--- a/xen/arch/x86/x86_64/compat/entry.S
+++ b/xen/arch/x86/x86_64/compat/entry.S
@@ -198,7 +198,7 @@ ENTRY(cr4_pv32_restore)
 
 /* See lstar_enter for entry register state. */
 ENTRY(cstar_enter)
-        /* sti could live here when we don't switch page tables below. */
+        ALTERNATIVE "", "setssbsy", X86_FEATURE_XEN_SHSTK
         CR4_PV32_RESTORE
         movq  8(%rsp),%rax /* Restore %rax. */
         movq  $FLAT_USER_SS32, 8(%rsp) /* Assume a 64bit domain.  Compat handled lower. */
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 06da350ba0..91cd8f94fd 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -194,6 +194,15 @@ restore_all_guest:
         movq  8(%rsp),%rcx            # RIP
         ja    iret_exit_to_guest
 
+        /* Clear the supervisor shadow stack token busy bit. */
+.macro rag_clrssbsy
+        push %rax
+        rdsspq %rax
+        clrssbsy (%rax)
+        pop %rax
+.endm
+        ALTERNATIVE "", rag_clrssbsy, X86_FEATURE_XEN_SHSTK
+
         cmpw  $FLAT_USER_CS32,16(%rsp)# CS
         movq  32(%rsp),%rsp           # RSP
         je    1f
@@ -226,7 +235,7 @@ iret_exit_to_guest:
  * %ss must be saved into the space left by the trampoline.
  */
 ENTRY(lstar_enter)
-        /* sti could live here when we don't switch page tables below. */
+        ALTERNATIVE "", "setssbsy", X86_FEATURE_XEN_SHSTK
         movq  8(%rsp),%rax /* Restore %rax. */
         movq  $FLAT_KERNEL_SS,8(%rsp)
         pushq %r11
@@ -877,6 +886,14 @@ handle_ist_exception:
         movl  $UREGS_kernel_sizeof/8,%ecx
         movq  %rdi,%rsp
         rep   movsq
+
+        /* Switch Shadow Stacks */
+.macro ist_switch_shstk
+        rdsspq %rdi
+        clrssbsy (%rdi)
+        setssbsy
+.endm
+        ALTERNATIVE "", ist_switch_shstk, X86_FEATURE_XEN_SHSTK
 1:
 #else
         ASSERT_CONTEXT_IS_XEN
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 16/16] x86/shstk: Activate Supervisor Shadow Stacks
  2020-05-01 22:58 [PATCH 00/16] x86: Support for CET Supervisor Shadow Stacks Andrew Cooper
                   ` (14 preceding siblings ...)
  2020-05-01 22:58 ` [PATCH 15/16] x86/entry: Adjust guest paths to be shadow stack compatible Andrew Cooper
@ 2020-05-01 22:58 ` Andrew Cooper
  2020-05-07 14:54   ` Jan Beulich
  15 siblings, 1 reply; 66+ messages in thread
From: Andrew Cooper @ 2020-05-01 22:58 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Wei Liu, Jan Beulich, Roger Pau Monné

With all other plumbing in place, activate shadow stacks when possible.

The BSP needs to wait until alternatives have run (to avoid interaction with
CR0.WP), and after the first reset_stack_and_jump() to avoid having a pristine
shadow stack interact in problematic ways with an in-use regular stack.
Activate shadow stack in reinit_bsp_stack().

APs have all infrastructure set up by the booting CPU, so enable shadow stacks
before entering C.  The S3 path needs save and restore SSP along side RSP.

The crash path needs to turn CET off to avoid interfereing with the kexec
kernel's environment.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Wei Liu <wl@xen.org>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/acpi/wakeup_prot.S | 56 +++++++++++++++++++++++++++++++++++++++++
 xen/arch/x86/boot/x86_64.S      | 30 +++++++++++++++++++++-
 xen/arch/x86/cpu/common.c       |  5 ++++
 xen/arch/x86/crash.c            |  7 ++++++
 xen/arch/x86/setup.c            | 26 +++++++++++++++++++
 xen/arch/x86/spec_ctrl.c        |  8 ++++++
 xen/include/asm-x86/msr-index.h |  3 +++
 xen/include/asm-x86/x86-defns.h |  1 +
 8 files changed, 135 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/acpi/wakeup_prot.S b/xen/arch/x86/acpi/wakeup_prot.S
index 4dba6020a7..22c0f8cc79 100644
--- a/xen/arch/x86/acpi/wakeup_prot.S
+++ b/xen/arch/x86/acpi/wakeup_prot.S
@@ -1,3 +1,8 @@
+#include <asm/config.h>
+#include <asm/msr-index.h>
+#include <asm/page.h>
+#include <asm/processor.h>
+
         .file __FILE__
         .text
         .code64
@@ -15,6 +20,12 @@ ENTRY(do_suspend_lowlevel)
         mov     %cr0, %rax
         mov     %rax, saved_cr0(%rip)
 
+#ifdef CONFIG_XEN_SHSTK
+        mov     $1, %eax
+        rdsspq  %rax
+        mov     %rax, saved_ssp(%rip)
+#endif
+
         /* enter sleep state physically */
         mov     $3, %edi
         call    acpi_enter_sleep_state
@@ -48,6 +59,48 @@ ENTRY(s3_resume)
         pushq   %rax
         lretq
 1:
+#ifdef CONFIG_XEN_SHSTK
+	/*
+         * Restoring SSP is a little convoluted, because we are intercepting
+         * the middle of an in-use shadow stack.  Write a temporary supervisor
+         * token under the stack, so SETSSBSY takes us where we want, then
+         * reset MSR_PL0_SSP to its usual value and pop the temporary token.
+         */
+        mov     saved_rsp(%rip), %rdi
+        cmpq    $1, %rdi
+        je      .L_shstk_done
+
+        /* Write a supervisor token under SSP. */
+        sub     $8, %rdi
+        mov     %rdi, (%rdi)
+
+        /* Load it into MSR_PL0_SSP. */
+        mov     $MSR_PL0_SSP, %ecx
+        mov     %rdi, %rdx
+        shr     $32, %rdx
+        mov     %edi, %eax
+
+        /* Enable CET. */
+        mov     $MSR_S_CET, %ecx
+        xor     %edx, %edx
+        mov     $CET_SHSTK_EN | CET_WRSS_EN, %eax
+        wrmsr
+
+        /* Activate our temporary token. */
+        mov     $XEN_MINIMAL_CR4 | X86_CR4_CET, %ebx
+        mov     %rbx, %cr4
+        setssbsy
+
+        /* Reset MSR_PL0_SSP back to its expected value. */
+        and     $~(STACK_SIZE - 1), %eax
+        or      $0x5ff8, %eax
+        wrmsr
+
+        /* Pop the temporary token off the stack. */
+        mov     $2, %eax
+        incsspd %eax
+.L_shstk_done:
+#endif
 
         call    load_system_tables
 
@@ -65,6 +118,9 @@ ENTRY(s3_resume)
 
 saved_rsp:      .quad   0
 saved_cr0:      .quad   0
+#ifdef CONFIG_XEN_SHSTK
+saved_ssp:      .quad   0
+#endif
 
 GLOBAL(saved_magic)
         .long   0x9abcdef0
diff --git a/xen/arch/x86/boot/x86_64.S b/xen/arch/x86/boot/x86_64.S
index 314a32a19f..59b770f955 100644
--- a/xen/arch/x86/boot/x86_64.S
+++ b/xen/arch/x86/boot/x86_64.S
@@ -28,8 +28,36 @@ ENTRY(__high_start)
         lretq
 1:
         test    %ebx,%ebx
-        jnz     start_secondary
+        jz      .L_bsp
 
+        /* APs.  Set up shadow stacks before entering C. */
+
+        testl   $cpufeat_mask(X86_FEATURE_XEN_SHSTK), \
+                CPUINFO_FEATURE_OFFSET(X86_FEATURE_XEN_SHSTK) + boot_cpu_data(%rip)
+        je      .L_ap_shstk_done
+
+        mov     $MSR_S_CET, %ecx
+        xor     %edx, %edx
+        mov     $CET_SHSTK_EN | CET_WRSS_EN, %eax
+        wrmsr
+
+        mov     $MSR_PL0_SSP, %ecx
+        mov     %rsp, %rdx
+        shr     $32, %rdx
+        mov     %esp, %eax
+        and     $~(STACK_SIZE - 1), %eax
+        or      $0x5ff8, %eax
+        wrmsr
+
+        mov     $XEN_MINIMAL_CR4 | X86_CR4_CET, %ecx
+        mov     %rcx, %cr4
+        setssbsy
+
+.L_ap_shstk_done:
+        call    start_secondary
+        BUG     /* start_secondary() shouldn't return. */
+
+.L_bsp:
         /* Pass off the Multiboot info structure to C land (if applicable). */
         mov     multiboot_ptr(%rip),%edi
         call    __start_xen
diff --git a/xen/arch/x86/cpu/common.c b/xen/arch/x86/cpu/common.c
index 3962717aa5..a77be36349 100644
--- a/xen/arch/x86/cpu/common.c
+++ b/xen/arch/x86/cpu/common.c
@@ -323,6 +323,11 @@ void __init early_cpu_init(void)
 	       x86_cpuid_vendor_to_str(c->x86_vendor), c->x86, c->x86,
 	       c->x86_model, c->x86_model, c->x86_mask, eax);
 
+	if (c->cpuid_level >= 7) {
+		cpuid_count(7, 0, &eax, &ebx, &ecx, &edx);
+		c->x86_capability[cpufeat_word(X86_FEATURE_CET_SS)] = ecx;
+	}
+
 	eax = cpuid_eax(0x80000000);
 	if ((eax >> 16) == 0x8000 && eax >= 0x80000008) {
 		eax = cpuid_eax(0x80000008);
diff --git a/xen/arch/x86/crash.c b/xen/arch/x86/crash.c
index 450eecd46b..0611b4fb9b 100644
--- a/xen/arch/x86/crash.c
+++ b/xen/arch/x86/crash.c
@@ -200,6 +200,13 @@ void machine_crash_shutdown(void)
     /* Reset CPUID masking and faulting to the host's default. */
     ctxt_switch_levelling(NULL);
 
+    /* Disable shadow stacks. */
+    if ( cpu_has_xen_shstk )
+    {
+        wrmsrl(MSR_S_CET, 0);
+        write_cr4(read_cr4() & ~X86_CR4_CET);
+    }
+
     info = kexec_crash_save_info();
     info->xen_phys_start = xen_phys_start;
     info->dom0_pfn_to_mfn_frame_list_list =
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index aa21201507..5c574b2035 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -664,6 +664,13 @@ static void __init noreturn reinit_bsp_stack(void)
     stack_base[0] = stack;
     memguard_guard_stack(stack);
 
+    if ( cpu_has_xen_shstk )
+    {
+        wrmsrl(MSR_PL0_SSP, (unsigned long)stack + 0x5ff8);
+        wrmsrl(MSR_S_CET, CET_SHSTK_EN | CET_WRSS_EN);
+        asm volatile ("setssbsy" ::: "memory");
+    }
+
     reset_stack_and_jump_nolp(init_done);
 }
 
@@ -985,6 +992,21 @@ void __init noreturn __start_xen(unsigned long mbi_p)
     /* This must come before e820 code because it sets paddr_bits. */
     early_cpu_init();
 
+    /* Choose shadow stack early, to set infrastructure up appropriately. */
+    if ( opt_xen_shstk && boot_cpu_has(X86_FEATURE_CET_SS) )
+    {
+        printk("Enabling Supervisor Shadow Stacks\n");
+
+        setup_force_cpu_cap(X86_FEATURE_XEN_SHSTK);
+#ifdef CONFIG_PV32
+        if ( opt_pv32 )
+        {
+            opt_pv32 = 0;
+            printk("  - Disabling PV32 due to Shadow Stacks\n");
+        }
+#endif
+    }
+
     /* Sanitise the raw E820 map to produce a final clean version. */
     max_page = raw_max_page = init_e820(memmap_type, &e820_raw);
 
@@ -1721,6 +1743,10 @@ void __init noreturn __start_xen(unsigned long mbi_p)
 
     alternative_branches();
 
+    /* Defer CR4.CET until alternatives have finished playing with CR4.WP */
+    if ( cpu_has_xen_shstk )
+        set_in_cr4(X86_CR4_CET);
+
     /*
      * NB: when running as a PV shim VCPUOP_up/down is wired to the shim
      * physical cpu_add/remove functions, so launch the guest with only
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index c5d8e587a8..a94be2d594 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -882,6 +882,14 @@ void __init init_speculation_mitigations(void)
     hw_smt_enabled = check_smt_enabled();
 
     /*
+     * First, disable the use of retpolines if Xen is using shadow stacks, as
+     * they are incompatible.
+     */
+    if ( cpu_has_xen_shstk &&
+         (opt_thunk == THUNK_DEFAULT || opt_thunk == THUNK_RETPOLINE) )
+        thunk = THUNK_JMP;
+
+    /*
      * Has the user specified any custom BTI mitigations?  If so, follow their
      * instructions exactly and disable all heuristics.
      */
diff --git a/xen/include/asm-x86/msr-index.h b/xen/include/asm-x86/msr-index.h
index 85c5f20b76..cdfb7b047b 100644
--- a/xen/include/asm-x86/msr-index.h
+++ b/xen/include/asm-x86/msr-index.h
@@ -68,6 +68,9 @@
 
 #define MSR_U_CET                           0x000006a0
 #define MSR_S_CET                           0x000006a2
+#define  CET_SHSTK_EN                       (_AC(1, ULL) <<  0)
+#define  CET_WRSS_EN                        (_AC(1, ULL) <<  1)
+
 #define MSR_PL0_SSP                         0x000006a4
 #define MSR_PL1_SSP                         0x000006a5
 #define MSR_PL2_SSP                         0x000006a6
diff --git a/xen/include/asm-x86/x86-defns.h b/xen/include/asm-x86/x86-defns.h
index 84e15b15be..4051a80485 100644
--- a/xen/include/asm-x86/x86-defns.h
+++ b/xen/include/asm-x86/x86-defns.h
@@ -73,6 +73,7 @@
 #define X86_CR4_SMEP       0x00100000 /* enable SMEP */
 #define X86_CR4_SMAP       0x00200000 /* enable SMAP */
 #define X86_CR4_PKE        0x00400000 /* enable PKE */
+#define X86_CR4_CET        0x00800000 /* Control-flow Enforcement Technology */
 
 /*
  * XSTATE component flags in XCR0
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH 01/16] x86/traps: Drop last_extable_addr
  2020-05-01 22:58 ` [PATCH 01/16] x86/traps: Drop last_extable_addr Andrew Cooper
@ 2020-05-04 12:44   ` Jan Beulich
  2020-05-11 14:53     ` Andrew Cooper
  0 siblings, 1 reply; 66+ messages in thread
From: Jan Beulich @ 2020-05-04 12:44 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 02.05.2020 00:58, Andrew Cooper wrote:
> The only user of this facility is dom_crash_sync_extable() by passing 0 into
> asm_domain_crash_synchronous().  The common error cases are already covered
> with show_page_walk(), leaving only %ss/%fs selector/segment errors in the
> compat case.
> 
> Point at dom_crash_sync_extable in the error message, which is about as good
> as the error hints from other users of asm_domain_crash_synchronous(), and
> drop last_extable_addr.

While I'm not entirely opposed, I'd like you to clarify that you indeed
mean to (mostly) revert your own improvement from 6 or 7 years back
(commit 8e0da8c07f4f). I'm also surprised to find this as part of the
series it's in - in what way does this logic get in the way of CET-SS?

Jan


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 02/16] x86/traps: Clean up printing in do_reserved_trap()/fatal_trap()
  2020-05-01 22:58 ` [PATCH 02/16] x86/traps: Clean up printing in do_reserved_trap()/fatal_trap() Andrew Cooper
@ 2020-05-04 13:08   ` Jan Beulich
  2020-05-11 15:01     ` Andrew Cooper
  0 siblings, 1 reply; 66+ messages in thread
From: Jan Beulich @ 2020-05-04 13:08 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 02.05.2020 00:58, Andrew Cooper wrote:
> For one, they render the vector in a different base.
> 
> Introduce X86_EXC_* constants and vec_name() to refer to exceptions by their
> mnemonic, which starts bringing the code/diagnostics in line with the Intel
> and AMD manuals.

For this "bringing in line" purpose I'd like to see whether you could
live with some adjustments to how you're currently doing things:
- NMI is nowhere prefixed by #, hence I think we'd better not do so
  either; may require embedding the #-es in the names[] table, or not
  using N() for NMI
- neither Coprocessor Segment Overrun nor vector 0x0f have a mnemonic
  and hence I think we shouldn't invent one; just treat them like
  other reserved vectors (of which at least vector 0x09 indeed is one
  on x86-64)?

Jan


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 03/16] x86/traps: Factor out exception_fixup() and make printing consistent
  2020-05-01 22:58 ` [PATCH 03/16] x86/traps: Factor out exception_fixup() and make printing consistent Andrew Cooper
@ 2020-05-04 13:20   ` Jan Beulich
  2020-05-11 15:14     ` Andrew Cooper
  0 siblings, 1 reply; 66+ messages in thread
From: Jan Beulich @ 2020-05-04 13:20 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 02.05.2020 00:58, Andrew Cooper wrote:
> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -774,10 +774,27 @@ static void do_reserved_trap(struct cpu_user_regs *regs)
>            trapnr, vec_name(trapnr), regs->error_code);
>  }
>  
> +static bool exception_fixup(struct cpu_user_regs *regs, bool print)
> +{
> +    unsigned long fixup = search_exception_table(regs);
> +
> +    if ( unlikely(fixup == 0) )
> +        return false;
> +
> +    /* Can currently be triggered by guests.  Make sure we ratelimit. */
> +    if ( IS_ENABLED(CONFIG_DEBUG) && print )

I didn't think we consider dprintk()-s a possible security issue.
Why would we consider so a printk() hidden behind
IS_ENABLED(CONFIG_DEBUG)? IOW I think one of XENLOG_GUEST and
IS_ENABLED(CONFIG_DEBUG) wants dropping.

> @@ -1466,12 +1468,11 @@ void do_page_fault(struct cpu_user_regs *regs)
>          if ( pf_type != real_fault )
>              return;
>  
> -        if ( likely((fixup = search_exception_table(regs)) != 0) )
> +        if ( likely(exception_fixup(regs, false)) )
>          {
>              perfc_incr(copy_user_faults);
>              if ( unlikely(regs->error_code & PFEC_reserved_bit) )
>                  reserved_bit_page_fault(addr, regs);
> -            regs->rip = fixup;

I'm afraid this modification can't validly be pulled ahead -
show_execution_state(), as called from reserved_bit_page_fault(),
wants to / should print the original RIP, not the fixed up one.

Jan


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 04/16] x86/smpboot: Write the top-of-stack block in cpu_smpboot_alloc()
  2020-05-01 22:58 ` [PATCH 04/16] x86/smpboot: Write the top-of-stack block in cpu_smpboot_alloc() Andrew Cooper
@ 2020-05-04 13:22   ` Jan Beulich
  0 siblings, 0 replies; 66+ messages in thread
From: Jan Beulich @ 2020-05-04 13:22 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 02.05.2020 00:58, Andrew Cooper wrote:
> This allows the AP boot assembly use per-cpu variables, and brings the
> semantics closer to that of the BSP, which can use per-cpu variables from the
> start of day.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 05/16] x86/shstk: Introduce Supervisor Shadow Stack support
  2020-05-01 22:58 ` [PATCH 05/16] x86/shstk: Introduce Supervisor Shadow Stack support Andrew Cooper
@ 2020-05-04 13:52   ` Jan Beulich
  2020-05-11 15:46     ` Andrew Cooper
  2020-05-15 16:21     ` Anthony PERARD
  0 siblings, 2 replies; 66+ messages in thread
From: Jan Beulich @ 2020-05-04 13:52 UTC (permalink / raw)
  To: Andrew Cooper, Anthony Perard; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 02.05.2020 00:58, Andrew Cooper wrote:
> --- a/xen/arch/x86/Kconfig
> +++ b/xen/arch/x86/Kconfig
> @@ -34,6 +34,9 @@ config ARCH_DEFCONFIG
>  config INDIRECT_THUNK
>  	def_bool $(cc-option,-mindirect-branch-register)
>  
> +config HAS_AS_CET
> +	def_bool $(as-instr,wrssq %rax$(comma)0;setssbsy;endbr64)

I see you add as-instr here as a side effect. Until the other
similar checks get converted, I think for the time being we should
use the old model, to have all such checks in one place. I realize
this means you can't have a Kconfig dependency then.

Also why do you check multiple insns, when just one (like we do
elsewhere) would suffice?

The crucial insns to check are those which got changed pretty
close before the release of 2.29 (in the cover letter you
mention 2.32): One of incssp{d,q}, setssbsy, or saveprevssp.
There weren't official binutils releases with the original
insns, but distros may still have picked up intermediate
snapshots.

> @@ -97,6 +100,20 @@ config HVM
>  
>  	  If unsure, say Y.
>  
> +config XEN_SHSTK
> +	bool "Supervisor Shadow Stacks"
> +	depends on HAS_AS_CET && EXPERT = "y"
> +	default y
> +        ---help---
> +          Control-flow Enforcement Technology (CET) is a set of features in
> +          hardware designed to combat Return-oriented Programming (ROP, also
> +          call/jump COP/JOP) attacks.  Shadow Stacks are one CET feature
> +          designed to provide return address protection.
> +
> +          This option arranges for Xen to use CET-SS for its own protection.
> +          When CET-SS is active, 32bit PV guests cannot be used.  Backwards
> +          compatiblity can be provided vai the PV Shim mechanism.

Indentation looks odd here - the whole help section should
start with hard tabs, I think.

> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -95,6 +95,36 @@ unsigned long __initdata highmem_start;
>  size_param("highmem-start", highmem_start);
>  #endif
>  
> +static bool __initdata opt_xen_shstk = true;
> +
> +static int parse_xen(const char *s)
> +{
> +    const char *ss;
> +    int val, rc = 0;
> +
> +    do {
> +        ss = strchr(s, ',');
> +        if ( !ss )
> +            ss = strchr(s, '\0');
> +
> +        if ( (val = parse_boolean("shstk", s, ss)) >= 0 )
> +        {
> +#ifdef CONFIG_XEN_SHSTK
> +            opt_xen_shstk = val;
> +#else
> +            no_config_param("XEN_SHSTK", "xen", s, ss);
> +#endif
> +        }
> +        else
> +            rc = -EINVAL;
> +
> +        s = ss + 1;
> +    } while ( *ss );
> +
> +    return rc;
> +}
> +custom_param("xen", parse_xen);

What's the idea here going forward, i.e. why the new top level
"xen" option? Almost all options are for Xen itself, after all.
Did you perhaps mean this to be "cet"?

Also you surely meant to document this new option?

> --- a/xen/scripts/Kconfig.include
> +++ b/xen/scripts/Kconfig.include
> @@ -31,6 +31,10 @@ cc-option = $(success,$(CC) -Werror $(CLANG_FLAGS) $(1) -E -x c /dev/null -o /de
>  # Return y if the linker supports <flag>, n otherwise
>  ld-option = $(success,$(LD) -v $(1))
>  
> +# $(as-instr,<instr>)
> +# Return y if the assembler supports <instr>, n otherwise
> +as-instr = $(success,printf "%b\n" "$(1)" | $(CC) $(CLANG_FLAGS) -c -x assembler -o /dev/null -)

CLANG_FLAGS caught my eye here, then noticing that cc-option
also uses it. Anthony - what's the deal with this? It doesn't
look to get defined anywhere, and I also don't see what clang-
specific about these constructs.

Jan


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 06/16] x86/traps: Implement #CP handler and extend #PF for shadow stacks
  2020-05-01 22:58 ` [PATCH 06/16] x86/traps: Implement #CP handler and extend #PF for shadow stacks Andrew Cooper
@ 2020-05-04 14:10   ` Jan Beulich
  2020-05-11 17:20     ` Andrew Cooper
  0 siblings, 1 reply; 66+ messages in thread
From: Jan Beulich @ 2020-05-04 14:10 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 02.05.2020 00:58, Andrew Cooper wrote:
> @@ -1457,6 +1451,10 @@ void do_page_fault(struct cpu_user_regs *regs)
>      {
>          enum pf_type pf_type = spurious_page_fault(addr, regs);
>  
> +        /* Any fault on a shadow stack access is a bug in Xen. */
> +        if ( error_code & PFEC_shstk )
> +            goto fatal;

Not going through the full spurious_page_fault() in this case
would seem desirable, as would be at least a respective
adjustment to __page_fault_type(). Perhaps such an adjustment
could then avoid the change (and the need for goto) here?

> @@ -1906,6 +1905,43 @@ void do_debug(struct cpu_user_regs *regs)
>      pv_inject_hw_exception(TRAP_debug, X86_EVENT_NO_EC);
>  }
>  
> +void do_entry_CP(struct cpu_user_regs *regs)

If there's a plan to change other exception handlers to this naming
scheme too, I can live with the intermediate inconsistency.
Otherwise, though, I'd prefer a name closer to what other handlers
use, e.g. do_control_prot(). Same for the asm entry point then.

> @@ -940,7 +944,8 @@ autogen_stubs: /* Automatically generated stubs. */
>          entrypoint 1b
>  
>          /* Reserved exceptions, heading towards do_reserved_trap(). */
> -        .elseif vec == TRAP_copro_seg || vec == TRAP_spurious_int || (vec > TRAP_simd_error && vec < TRAP_nr)
> +        .elseif vec == TRAP_copro_seg || vec == TRAP_spurious_int || \
> +                vec == TRAP_virtualisation || (vec > X86_EXC_CP && vec < TRAP_nr)

Use the shorter X86_EXC_VE here, since you don't want/need to
retain what was there before? (I'd be fine with you changing
the two other TRAP_* too at this occasion.)

> --- a/xen/include/asm-x86/processor.h
> +++ b/xen/include/asm-x86/processor.h
> @@ -68,6 +68,7 @@
>  #define PFEC_reserved_bit   (_AC(1,U) << 3)
>  #define PFEC_insn_fetch     (_AC(1,U) << 4)
>  #define PFEC_prot_key       (_AC(1,U) << 5)
> +#define PFEC_shstk         (_AC(1,U) << 6)

One too few padding blanks?

Jan


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 07/16] x86/shstk: Re-layout the stack block for shadow stacks
  2020-05-01 22:58 ` [PATCH 07/16] x86/shstk: Re-layout the stack block " Andrew Cooper
@ 2020-05-04 14:24   ` Jan Beulich
  2020-05-11 17:48     ` Andrew Cooper
  0 siblings, 1 reply; 66+ messages in thread
From: Jan Beulich @ 2020-05-04 14:24 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 02.05.2020 00:58, Andrew Cooper wrote:
> --- a/xen/arch/x86/cpu/common.c
> +++ b/xen/arch/x86/cpu/common.c
> @@ -732,14 +732,14 @@ void load_system_tables(void)
>  		.rsp2 = 0x8600111111111111ul,
>  
>  		/*
> -		 * MCE, NMI and Double Fault handlers get their own stacks.
> +		 * #DB, NMI, DF and #MCE handlers get their own stacks.

Then also #DF and #MC?

> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -6002,25 +6002,18 @@ void memguard_unguard_range(void *p, unsigned long l)
>  
>  void memguard_guard_stack(void *p)
>  {
> -    /* IST_MAX IST pages + at least 1 guard page + primary stack. */
> -    BUILD_BUG_ON((IST_MAX + 1) * PAGE_SIZE + PRIMARY_STACK_SIZE > STACK_SIZE);
> +    map_pages_to_xen((unsigned long)p, virt_to_mfn(p), 1, _PAGE_NONE);
>  
> -    memguard_guard_range(p + IST_MAX * PAGE_SIZE,
> -                         STACK_SIZE - PRIMARY_STACK_SIZE - IST_MAX * PAGE_SIZE);
> +    p += 5 * PAGE_SIZE;

The literal 5 here and ...

> +    map_pages_to_xen((unsigned long)p, virt_to_mfn(p), 1, _PAGE_NONE);
>  }
>  
>  void memguard_unguard_stack(void *p)
>  {
> -    memguard_unguard_range(p + IST_MAX * PAGE_SIZE,
> -                           STACK_SIZE - PRIMARY_STACK_SIZE - IST_MAX * PAGE_SIZE);
> -}
> -
> -bool memguard_is_stack_guard_page(unsigned long addr)
> -{
> -    addr &= STACK_SIZE - 1;
> +    map_pages_to_xen((unsigned long)p, virt_to_mfn(p), 1, PAGE_HYPERVISOR_RW);
>  
> -    return addr >= IST_MAX * PAGE_SIZE &&
> -           addr < STACK_SIZE - PRIMARY_STACK_SIZE;
> +    p += 5 * PAGE_SIZE;

... here could do with macro-izing: IST_MAX + 1 would already be
a little better, I guess.

Preferably with adjustments along these lines
Reviewed-by: Jan Beulich <jbeulich@suse.com>

Jan


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 08/16] x86/shstk: Create shadow stacks
  2020-05-01 22:58 ` [PATCH 08/16] x86/shstk: Create " Andrew Cooper
@ 2020-05-04 14:55   ` Jan Beulich
  2020-05-04 15:08     ` Andrew Cooper
  0 siblings, 1 reply; 66+ messages in thread
From: Jan Beulich @ 2020-05-04 14:55 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 02.05.2020 00:58, Andrew Cooper wrote:
> --- a/xen/arch/x86/cpu/common.c
> +++ b/xen/arch/x86/cpu/common.c
> @@ -748,6 +748,25 @@ void load_system_tables(void)
>  		.bitmap = IOBMP_INVALID_OFFSET,
>  	};
>  
> +	/* Set up the shadow stack IST. */
> +	if ( cpu_has_xen_shstk ) {

This being a Linux style function, you want to omit the blanks
immediately inside the parentheses bother here and in the for()
below.

> +		unsigned int i;
> +		uint64_t *ist_ssp = this_cpu(tss_page).ist_ssp;
> +
> +		/* Must point at the supervisor stack token. */
> +		ist_ssp[IST_MCE] = stack_top + (IST_MCE * 0x400) - 8;
> +		ist_ssp[IST_NMI] = stack_top + (IST_NMI * 0x400) - 8;
> +		ist_ssp[IST_DB]  = stack_top + (IST_DB  * 0x400) - 8;
> +		ist_ssp[IST_DF]  = stack_top + (IST_DF  * 0x400) - 8;

Introduce a constant for 0x400, to then also be used in the
invocations of write_sss_token()?

> +		/* Poision unused entries. */
> +		for ( i = IST_MAX;
> +		      i < ARRAY_SIZE(this_cpu(tss_page).ist_ssp); ++i )
> +			ist_ssp[i] = 0x8600111111111111ul;

IST_MAX == IST_DF, so you're overwriting one token here.

> --- a/xen/include/asm-x86/processor.h
> +++ b/xen/include/asm-x86/processor.h
> @@ -434,7 +434,8 @@ struct __packed tss64 {
>      uint16_t :16, bitmap;
>  };
>  struct tss_page {
> -    struct tss64 __aligned(PAGE_SIZE) tss;
> +    uint64_t __aligned(PAGE_SIZE) ist_ssp[8];
> +    struct tss64 tss;
>  };

Just curious - any particular reason you put this ahead of the TSS?

Jan


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 08/16] x86/shstk: Create shadow stacks
  2020-05-04 14:55   ` Jan Beulich
@ 2020-05-04 15:08     ` Andrew Cooper
  0 siblings, 0 replies; 66+ messages in thread
From: Andrew Cooper @ 2020-05-04 15:08 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 04/05/2020 15:55, Jan Beulich wrote:
>> +		/* Poision unused entries. */
>> +		for ( i = IST_MAX;
>> +		      i < ARRAY_SIZE(this_cpu(tss_page).ist_ssp); ++i )
>> +			ist_ssp[i] = 0x8600111111111111ul;
> IST_MAX == IST_DF, so you're overwriting one token here.

And failing to poison entry 0.  This was a bad rearrangement when
tidying the series up.

Unfortunately, testing the #DF path isn't terribly easy.

>> --- a/xen/include/asm-x86/processor.h
>> +++ b/xen/include/asm-x86/processor.h
>> @@ -434,7 +434,8 @@ struct __packed tss64 {
>>      uint16_t :16, bitmap;
>>  };
>>  struct tss_page {
>> -    struct tss64 __aligned(PAGE_SIZE) tss;
>> +    uint64_t __aligned(PAGE_SIZE) ist_ssp[8];
>> +    struct tss64 tss;
>>  };
> Just curious - any particular reason you put this ahead of the TSS?

Yes.  Reduced chance of interacting with a buggy IO bitmap offset.

Furthermore, we could do away most of the IO emulation quirking, and the
#GP path overhead, if we actually constructed a real IO bitmap for
dom0.  That would require using the 8k following the TSS.

~Andrew


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 09/16] x86/cpu: Adjust enable_nmis() to be shadow stack compatible
  2020-05-01 22:58 ` [PATCH 09/16] x86/cpu: Adjust enable_nmis() to be shadow stack compatible Andrew Cooper
@ 2020-05-05 14:48   ` Jan Beulich
  2020-05-11 18:48     ` Andrew Cooper
  0 siblings, 1 reply; 66+ messages in thread
From: Jan Beulich @ 2020-05-05 14:48 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 02.05.2020 00:58, Andrew Cooper wrote:
> When executing an IRET-to-self, the shadow stack must agree with the regular
> stack.  We can't manipulate SSP directly, so have to fake a shadow IRET frame
> by executing 3 CALLs, then editing the result to look correct.
> 
> This is not a fastpath, is called on the BSP long before CET can be set up,
> and may be called on the crash path after CET is disabled.  Use the fact that
> INCSSP is allocated from the hint nop space to construct a test for CET being
> active which is safe on all processors.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>
albeit with a question which may make a further change necessary:

> --- a/xen/include/asm-x86/processor.h
> +++ b/xen/include/asm-x86/processor.h
> @@ -544,17 +544,40 @@ static inline void enable_nmis(void)
>  {
>      unsigned long tmp;
>  
> -    asm volatile ( "mov %%rsp, %[tmp]     \n\t"
> -                   "push %[ss]            \n\t"
> -                   "push %[tmp]           \n\t"
> -                   "pushf                 \n\t"
> -                   "push %[cs]            \n\t"
> -                   "lea 1f(%%rip), %[tmp] \n\t"
> -                   "push %[tmp]           \n\t"
> -                   "iretq; 1:             \n\t"
> -                   : [tmp] "=&r" (tmp)
> +    asm volatile ( "mov     %%rsp, %[rsp]        \n\t"
> +                   "lea    .Ldone(%%rip), %[rip] \n\t"
> +#ifdef CONFIG_XEN_SHSTK
> +                   /* Check for CET-SS being active. */
> +                   "mov    $1, %k[ssp]           \n\t"
> +                   "rdsspq %[ssp]                \n\t"
> +                   "cmp    $1, %k[ssp]           \n\t"
> +                   "je     .Lshstk_done          \n\t"
> +
> +                   /* Push 3 words on the shadow stack */
> +                   ".rept 3                      \n\t"
> +                   "call 1f; nop; 1:             \n\t"
> +                   ".endr                        \n\t"
> +
> +                   /* Fixup to be an IRET shadow stack frame */
> +                   "wrssq  %q[cs], -1*8(%[ssp])  \n\t"
> +                   "wrssq  %[rip], -2*8(%[ssp])  \n\t"
> +                   "wrssq  %[ssp], -3*8(%[ssp])  \n\t"
> +
> +                   ".Lshstk_done:"
> +#endif
> +                   /* Write an IRET regular frame */
> +                   "push   %[ss]                 \n\t"
> +                   "push   %[rsp]                \n\t"
> +                   "pushf                        \n\t"
> +                   "push   %q[cs]                \n\t"
> +                   "push   %[rip]                \n\t"
> +                   "iretq                        \n\t"
> +                   ".Ldone:                      \n\t"
> +                   : [rip] "=&r" (tmp),
> +                     [rsp] "=&r" (tmp),
> +                     [ssp] "=&r" (tmp)

Even after an hour of reading and searching through the gcc docs
I can't convince myself that this utilizes defined behavior. We
do tie multiple outputs to the same C variable elsewhere, yes,
but only in cases where we don't really care about the register,
or where the register is a fixed one anyway. What I can't find
is a clear statement that gcc wouldn't ever, now or in the
future, use the same register for all three outputs. I think one
can deduce this in certain ways, and experimentation also seems
to confirm it, but still.

Jan


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 10/16] x86/cpu: Adjust reset_stack_and_jump() to be shadow stack compatible
  2020-05-01 22:58 ` [PATCH 10/16] x86/cpu: Adjust reset_stack_and_jump() " Andrew Cooper
@ 2020-05-07 13:17   ` Jan Beulich
  2020-05-11 20:07     ` Andrew Cooper
  0 siblings, 1 reply; 66+ messages in thread
From: Jan Beulich @ 2020-05-07 13:17 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 02.05.2020 00:58, Andrew Cooper wrote:
> We need to unwind up to the supervisor token.  See the comment for details.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Wei Liu <wl@xen.org>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> ---
>  xen/include/asm-x86/current.h | 42 +++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 39 insertions(+), 3 deletions(-)
> 
> diff --git a/xen/include/asm-x86/current.h b/xen/include/asm-x86/current.h
> index 99b66a0087..2a7b728b1e 100644
> --- a/xen/include/asm-x86/current.h
> +++ b/xen/include/asm-x86/current.h
> @@ -124,13 +124,49 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
>  # define CHECK_FOR_LIVEPATCH_WORK ""
>  #endif
>  
> +#ifdef CONFIG_XEN_SHSTK
> +/*
> + * We need to unwind the primary shadow stack to its supervisor token, located
> + * at 0x5ff8 from the base of the stack blocks.
> + *
> + * Read the shadow stack pointer, subtract it from 0x5ff8, divide by 8 to get
> + * the number of slots needing popping.
> + *
> + * INCSSPQ can't pop more than 255 entries.  We shouldn't ever need to pop
> + * that many entries, and getting this wrong will cause us to #DF later.
> + */
> +# define SHADOW_STACK_WORK                      \
> +    "mov $1, %[ssp];"                           \
> +    "rdsspd %[ssp];"                            \
> +    "cmp $1, %[ssp];"                           \
> +    "je 1f;" /* CET not active?  Skip. */       \
> +    "mov $"STR(0x5ff8)", %[val];"               \

As per comments on earlier patches, I think it would be nice if
this wasn't a literal number here, but tied to actual stack
layout via some suitable expression. An option might be to use
0xff8 (or the constant to be introduced for it in the earlier
patch) here and ...

> +    "and $"STR(STACK_SIZE - 1)", %[ssp];"       \

... PAGE_SIZE here.

> +    "sub %[ssp], %[val];"                       \
> +    "shr $3, %[val];"                           \
> +    "cmp $255, %[val];"                         \
> +    "jle 2f;"                                   \

Perhaps better "jbe", treating the unsigned values as such?

> +    "ud2a;"                                     \
> +    "2: incsspq %q[val];"                       \
> +    "1:"
> +#else
> +# define SHADOW_STACK_WORK ""
> +#endif
> +
>  #define switch_stack_and_jump(fn, instr)                                \
>      ({                                                                  \
> +        unsigned int tmp;                                               \
>          __asm__ __volatile__ (                                          \
> -            "mov %0,%%"__OP"sp;"                                        \
> +            "cmc;"                                                      \
> +            SHADOW_STACK_WORK                                           \
> +            "mov %[stk], %%rsp;"                                        \
>              instr                                                       \
> -             "jmp %c1"                                                  \
> -            : : "r" (guest_cpu_user_regs()), "i" (fn) : "memory" );     \
> +            "jmp %c[fun];"                                              \
> +            : [val] "=&r" (tmp),                                        \
> +              [ssp] "=&r" (tmp)                                         \

See my concern on the earlier similar construct.

Jan


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 11/16] x86/spec-ctrl: Adjust DO_OVERWRITE_RSB to be shadow stack compatible
  2020-05-01 22:58 ` [PATCH 11/16] x86/spec-ctrl: Adjust DO_OVERWRITE_RSB " Andrew Cooper
@ 2020-05-07 13:22   ` Jan Beulich
  2020-05-07 13:25     ` Andrew Cooper
  0 siblings, 1 reply; 66+ messages in thread
From: Jan Beulich @ 2020-05-07 13:22 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 02.05.2020 00:58, Andrew Cooper wrote:
> @@ -114,6 +114,16 @@
>      sub $1, %ecx
>      jnz .L\@_fill_rsb_loop
>      mov %\tmp, %rsp                 /* Restore old %rsp */
> +
> +#ifdef CONFIG_XEN_SHSTK
> +    mov $1, %ecx
> +    rdsspd %ecx
> +    cmp $1, %ecx
> +    je .L\@_shstk_done
> +    mov $64, %ecx                   /* 64 * 4 bytes, given incsspd */
> +    incsspd %ecx                    /* Restore old SSP */
> +.L\@_shstk_done:
> +#endif

The latest here I wonder why you don't use alternatives patching.
I thought that's what you've introduced the synthetic feature
flag for.

Jan


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 11/16] x86/spec-ctrl: Adjust DO_OVERWRITE_RSB to be shadow stack compatible
  2020-05-07 13:22   ` Jan Beulich
@ 2020-05-07 13:25     ` Andrew Cooper
  2020-05-07 13:38       ` Jan Beulich
  0 siblings, 1 reply; 66+ messages in thread
From: Andrew Cooper @ 2020-05-07 13:25 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 07/05/2020 14:22, Jan Beulich wrote:
> On 02.05.2020 00:58, Andrew Cooper wrote:
>> @@ -114,6 +114,16 @@
>>      sub $1, %ecx
>>      jnz .L\@_fill_rsb_loop
>>      mov %\tmp, %rsp                 /* Restore old %rsp */
>> +
>> +#ifdef CONFIG_XEN_SHSTK
>> +    mov $1, %ecx
>> +    rdsspd %ecx
>> +    cmp $1, %ecx
>> +    je .L\@_shstk_done
>> +    mov $64, %ecx                   /* 64 * 4 bytes, given incsspd */
>> +    incsspd %ecx                    /* Restore old SSP */
>> +.L\@_shstk_done:
>> +#endif
> The latest here I wonder why you don't use alternatives patching.
> I thought that's what you've introduced the synthetic feature
> flag for.

We're already in the middle of an alternative and they don't nest.  More
importantly, this path gets used on the BSP, after patching and before
CET gets enabled.

~Andrew


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 12/16] x86/extable: Adjust extable handling to be shadow stack compatible
  2020-05-01 22:58 ` [PATCH 12/16] x86/extable: Adjust extable handling " Andrew Cooper
@ 2020-05-07 13:35   ` Jan Beulich
  2020-05-11 21:09     ` Andrew Cooper
  0 siblings, 1 reply; 66+ messages in thread
From: Jan Beulich @ 2020-05-07 13:35 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 02.05.2020 00:58, Andrew Cooper wrote:
> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -778,6 +778,28 @@ static bool exception_fixup(struct cpu_user_regs *regs, bool print)
>                 vec_name(regs->entry_vector), regs->error_code,
>                 _p(regs->rip), _p(regs->rip), _p(fixup));
>  
> +    if ( IS_ENABLED(CONFIG_XEN_SHSTK) )
> +    {
> +        unsigned long ssp;
> +
> +        asm ("rdsspq %0" : "=r" (ssp) : "0" (1) );
> +        if ( ssp != 1 )
> +        {
> +            unsigned long *ptr = _p(ssp);
> +
> +            /* Search for %rip in the shadow stack, ... */
> +            while ( *ptr != regs->rip )
> +                ptr++;

Wouldn't it be better to bound the loop, as it shouldn't search past
(strictly speaking not even to) the next page boundary? Also you
don't care about the top of the stack (being the to be restored SSP),
do you? I.e. maybe

            while ( *++ptr != regs->rip )
                ;

?

And then - isn't searching for a specific RIP value alone prone to
error, in case a it matches an ordinary return address? I.e.
wouldn't you better search for a matching RIP accompanied by a
suitable pointer into the shadow stack and a matching CS value?
Otherwise, ...

> +            ASSERT(ptr[1] == __HYPERVISOR_CS);

... also assert that ptr[-1] points into the shadow stack?

> --- a/xen/arch/x86/x86_64/entry.S
> +++ b/xen/arch/x86/x86_64/entry.S
> @@ -708,7 +708,16 @@ exception_with_ints_disabled:
>          call  search_pre_exception_table
>          testq %rax,%rax                 # no fixup code for faulting EIP?
>          jz    1b
> -        movq  %rax,UREGS_rip(%rsp)
> +        movq  %rax,UREGS_rip(%rsp)      # fixup regular stack
> +
> +#ifdef CONFIG_XEN_SHSTK
> +        mov    $1, %edi
> +        rdsspq %rdi
> +        cmp    $1, %edi
> +        je     .L_exn_shstk_done
> +        wrssq  %rax, (%rdi)             # fixup shadow stack
> +.L_exn_shstk_done:
> +#endif

Again avoid the conditional jump by using alternatives patching?

Jan


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 11/16] x86/spec-ctrl: Adjust DO_OVERWRITE_RSB to be shadow stack compatible
  2020-05-07 13:25     ` Andrew Cooper
@ 2020-05-07 13:38       ` Jan Beulich
  0 siblings, 0 replies; 66+ messages in thread
From: Jan Beulich @ 2020-05-07 13:38 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 07.05.2020 15:25, Andrew Cooper wrote:
> On 07/05/2020 14:22, Jan Beulich wrote:
>> On 02.05.2020 00:58, Andrew Cooper wrote:
>>> @@ -114,6 +114,16 @@
>>>      sub $1, %ecx
>>>      jnz .L\@_fill_rsb_loop
>>>      mov %\tmp, %rsp                 /* Restore old %rsp */
>>> +
>>> +#ifdef CONFIG_XEN_SHSTK
>>> +    mov $1, %ecx
>>> +    rdsspd %ecx
>>> +    cmp $1, %ecx
>>> +    je .L\@_shstk_done
>>> +    mov $64, %ecx                   /* 64 * 4 bytes, given incsspd */
>>> +    incsspd %ecx                    /* Restore old SSP */
>>> +.L\@_shstk_done:
>>> +#endif
>> The latest here I wonder why you don't use alternatives patching.
>> I thought that's what you've introduced the synthetic feature
>> flag for.
> 
> We're already in the middle of an alternative and they don't nest.  More
> importantly, this path gets used on the BSP, after patching and before
> CET gets enabled.

Oh, I should have noticed this. The first point could be dealt with,
but I agree the second pretty much rules out patching.

Reviewed-by: Jan Beulich <jbeulich@suse.com>

Jan


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 13/16] x86/ioemul: Rewrite stub generation to be shadow stack compatible
  2020-05-01 22:58 ` [PATCH 13/16] x86/ioemul: Rewrite stub generation " Andrew Cooper
@ 2020-05-07 13:46   ` Jan Beulich
  0 siblings, 0 replies; 66+ messages in thread
From: Jan Beulich @ 2020-05-07 13:46 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 02.05.2020 00:58, Andrew Cooper wrote:
> The logic is completely undocumented and almost impossible to follow.  It
> actually uses return oriented programming.  Rewrite it to conform to more
> normal call mechanics, and leave a big comment explaining thing.  As well as
> the code being easier to follow, it will execute faster as it isn't fighting
> the branch predictor.
> 
> Move the ioemul_handle_quirk() function pointer from traps.c to
> ioport_emulate.c.  There is no reason for it to be in neither of the two
> translation units which use it.  Alter the behaviour to return the number of
> bytes written into the stub.
> 
> Access the addresses of the host/guest helpers with extern const char arrays.
> Nothing good will come of C thinking they are regular functions.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> --
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Wei Liu <wl@xen.org>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> 
> Posted previously on its perf benefits alone, but here is the real reason
> behind the change.
> ---
>  xen/arch/x86/ioport_emulate.c  | 11 ++---
>  xen/arch/x86/pv/emul-priv-op.c | 91 +++++++++++++++++++++++++++++++-----------
>  xen/arch/x86/pv/gpr_switch.S   | 37 +++++------------
>  xen/arch/x86/traps.c           |  3 --
>  xen/include/asm-x86/io.h       |  3 +-
>  5 files changed, 85 insertions(+), 60 deletions(-)
> 
> diff --git a/xen/arch/x86/ioport_emulate.c b/xen/arch/x86/ioport_emulate.c
> index 499c1f6056..f7511a9c49 100644
> --- a/xen/arch/x86/ioport_emulate.c
> +++ b/xen/arch/x86/ioport_emulate.c
> @@ -8,7 +8,10 @@
>  #include <xen/sched.h>
>  #include <xen/dmi.h>
>  
> -static bool ioemul_handle_proliant_quirk(
> +unsigned int (*ioemul_handle_quirk)(
> +    u8 opcode, char *io_emul_stub, struct cpu_user_regs *regs);

As requested for the standalone patch - would you mind adding
__read_mostly on this occasion? And I notice now that use of
uint8_t would also be more appropriate.

> --- a/xen/arch/x86/pv/gpr_switch.S
> +++ b/xen/arch/x86/pv/gpr_switch.S
> @@ -9,59 +9,42 @@
>  
>  #include <asm/asm_defns.h>
>  
> -ENTRY(host_to_guest_gpr_switch)
> -        movq  (%rsp), %rcx
> -        movq  %rdi, (%rsp)
> +/* Load guest GPRs.  Parameter in %rdi, clobbers all registers. */
> +ENTRY(load_guest_gprs)
>          movq  UREGS_rdx(%rdi), %rdx
> -        pushq %rbx
>          movq  UREGS_rax(%rdi), %rax
>          movq  UREGS_rbx(%rdi), %rbx
> -        pushq %rbp
>          movq  UREGS_rsi(%rdi), %rsi
>          movq  UREGS_rbp(%rdi), %rbp
> -        pushq %r12
> -        movq  UREGS_r8(%rdi), %r8
> +        movq  UREGS_r8 (%rdi), %r8
>          movq  UREGS_r12(%rdi), %r12
> -        pushq %r13
> -        movq  UREGS_r9(%rdi), %r9
> +        movq  UREGS_r9 (%rdi), %r9
>          movq  UREGS_r13(%rdi), %r13
> -        pushq %r14
>          movq  UREGS_r10(%rdi), %r10
>          movq  UREGS_r14(%rdi), %r14
> -        pushq %r15
>          movq  UREGS_r11(%rdi), %r11
>          movq  UREGS_r15(%rdi), %r15
> -        pushq %rcx /* dummy push, filled by guest_to_host_gpr_switch pointer */
> -        pushq %rcx
> -        leaq  guest_to_host_gpr_switch(%rip),%rcx
> -        movq  %rcx,8(%rsp)
>          movq  UREGS_rcx(%rdi), %rcx
>          movq  UREGS_rdi(%rdi), %rdi
>          ret
>  
> -ENTRY(guest_to_host_gpr_switch)
> +/* Save guest GPRs.  Parameter on the stack above the return address. */
> +ENTRY(save_guest_gprs)
>          pushq %rdi
> -        movq  7*8(%rsp), %rdi
> +        movq  2*8(%rsp), %rdi
>          movq  %rax, UREGS_rax(%rdi)
> -        popq  UREGS_rdi(%rdi)
> +        popq        UREGS_rdi(%rdi)
>          movq  %r15, UREGS_r15(%rdi)
>          movq  %r11, UREGS_r11(%rdi)
> -        popq  %r15
>          movq  %r14, UREGS_r14(%rdi)
>          movq  %r10, UREGS_r10(%rdi)
> -        popq  %r14
>          movq  %r13, UREGS_r13(%rdi)
> -        movq  %r9, UREGS_r9(%rdi)
> -        popq  %r13
> +        movq  %r9,  UREGS_r9 (%rdi)
>          movq  %r12, UREGS_r12(%rdi)
> -        movq  %r8, UREGS_r8(%rdi)
> -        popq  %r12
> +        movq  %r8,  UREGS_r8 (%rdi)
>          movq  %rbp, UREGS_rbp(%rdi)
>          movq  %rsi, UREGS_rsi(%rdi)
> -        popq  %rbp
>          movq  %rbx, UREGS_rbx(%rdi)
>          movq  %rdx, UREGS_rdx(%rdi)
> -        popq  %rbx
>          movq  %rcx, UREGS_rcx(%rdi)
> -        popq  %rcx
>          ret

Now that these are oridinary functions (with just a non-standard
call convention), I think - as said before - they should be marked
STT_FUNC in the object. With that, as also said before, I then
don't think using char[] on the C side declarations (leading to
STT_OBJECT) is appropriate to use - linker are imo fine to warn
about such a mismatch. To prevent the declaration to be used for
an actual call, use e.g. __attribute__((__warning__())) as already
suggested.

Jan


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 14/16] x86/alt: Adjust _alternative_instructions() to not create shadow stacks
  2020-05-01 22:58 ` [PATCH 14/16] x86/alt: Adjust _alternative_instructions() to not create shadow stacks Andrew Cooper
@ 2020-05-07 13:49   ` Jan Beulich
  0 siblings, 0 replies; 66+ messages in thread
From: Jan Beulich @ 2020-05-07 13:49 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 02.05.2020 00:58, Andrew Cooper wrote:
> The current alternatives algorithm clears CR0.WP and writes into .text.  This
> has a side effect of the mappings becoming shadow stacks once CET is active.
> 
> Adjust _alternative_instructions() to clean up after itself.  This involves
> extending the set of bits modify_xen_mappings() to include Dirty (and Accessed
> for good measure).
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 15/16] x86/entry: Adjust guest paths to be shadow stack compatible
  2020-05-01 22:58 ` [PATCH 15/16] x86/entry: Adjust guest paths to be shadow stack compatible Andrew Cooper
@ 2020-05-07 14:12   ` Jan Beulich
  2020-05-07 15:50     ` Andrew Cooper
  0 siblings, 1 reply; 66+ messages in thread
From: Jan Beulich @ 2020-05-07 14:12 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel; +Cc: Wei Liu, Roger Pau Monné

On 02.05.2020 00:58, Andrew Cooper wrote:
> The SYSCALL/SYSEXIT paths need to use {SET,CLR}SSBSY.

I take it you mean SYSRET, not SYSEXIT. I do think though that you
also need to deal with the SYSENTER entry point we have.

> --- a/xen/arch/x86/x86_64/compat/entry.S
> +++ b/xen/arch/x86/x86_64/compat/entry.S
> @@ -198,7 +198,7 @@ ENTRY(cr4_pv32_restore)
>  
>  /* See lstar_enter for entry register state. */
>  ENTRY(cstar_enter)
> -        /* sti could live here when we don't switch page tables below. */
> +        ALTERNATIVE "", "setssbsy", X86_FEATURE_XEN_SHSTK

I don't see why you delete the comment here (or elsewhere). While
I recall you not really wanting them there, I still think they're
useful to have, and they shouldn't be deleted as a side effect of
an entirely unrelated change. Of course they need to live after
your insertions then.

> --- a/xen/arch/x86/x86_64/entry.S
> +++ b/xen/arch/x86/x86_64/entry.S
> @@ -194,6 +194,15 @@ restore_all_guest:
>          movq  8(%rsp),%rcx            # RIP
>          ja    iret_exit_to_guest
>  
> +        /* Clear the supervisor shadow stack token busy bit. */
> +.macro rag_clrssbsy
> +        push %rax
> +        rdsspq %rax
> +        clrssbsy (%rax)
> +        pop %rax
> +.endm
> +        ALTERNATIVE "", rag_clrssbsy, X86_FEATURE_XEN_SHSTK

In principle you could get away without spilling %rax:

        cmpl  $1,%ecx
        ja    iret_exit_to_guest

        /* Clear the supervisor shadow stack token busy bit. */
.macro rag_clrssbsy
        rdsspq %rcx
        clrssbsy (%rcx)
.endm
        ALTERNATIVE "", rag_clrssbsy, X86_FEATURE_XEN_SHSTK
        movq  8(%rsp),%rcx            # RIP
        cmpw  $FLAT_USER_CS32,16(%rsp)# CS
        movq  32(%rsp),%rsp           # RSP
        je    1f
        sysretq
1:      sysretl

        ALIGN
/* No special register assumptions. */
iret_exit_to_guest:
        movq  8(%rsp),%rcx            # RIP
        andl  $~(X86_EFLAGS_IOPL|X86_EFLAGS_NT|X86_EFLAGS_VM),24(%rsp)
        ...

Also - what about CLRSSBSY failing? It would seem easier to diagnose
this right here than when getting presumably #DF upon next entry into
Xen. At the very least I think it deserves a comment if an error case
does not get handled.

Somewhat similar for SETSSBSY, except there things get complicated by
it raising #CP instead of setting EFLAGS.CF: Aiui it would require us
to handle #CP on an IST stack in order to avoid #DF there.

> @@ -877,6 +886,14 @@ handle_ist_exception:
>          movl  $UREGS_kernel_sizeof/8,%ecx
>          movq  %rdi,%rsp
>          rep   movsq
> +
> +        /* Switch Shadow Stacks */
> +.macro ist_switch_shstk
> +        rdsspq %rdi
> +        clrssbsy (%rdi)
> +        setssbsy
> +.endm

Could you extend the comment to mention the caveat that you point
out in the description?

Jan


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 16/16] x86/shstk: Activate Supervisor Shadow Stacks
  2020-05-01 22:58 ` [PATCH 16/16] x86/shstk: Activate Supervisor Shadow Stacks Andrew Cooper
@ 2020-05-07 14:54   ` Jan Beulich
  2020-05-11 23:46     ` Andrew Cooper
  0 siblings, 1 reply; 66+ messages in thread
From: Jan Beulich @ 2020-05-07 14:54 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 02.05.2020 00:58, Andrew Cooper wrote:
> --- a/xen/arch/x86/acpi/wakeup_prot.S
> +++ b/xen/arch/x86/acpi/wakeup_prot.S
> @@ -1,3 +1,8 @@
> +#include <asm/config.h>

Why is this needed? Afaics assembly files, just like C ones, get
xen/config.h included from the compiler command line.

> @@ -48,6 +59,48 @@ ENTRY(s3_resume)
>          pushq   %rax
>          lretq
>  1:
> +#ifdef CONFIG_XEN_SHSTK
> +	/*
> +         * Restoring SSP is a little convoluted, because we are intercepting
> +         * the middle of an in-use shadow stack.  Write a temporary supervisor
> +         * token under the stack, so SETSSBSY takes us where we want, then
> +         * reset MSR_PL0_SSP to its usual value and pop the temporary token.

What do you mean by "takes us where we want"? I take it "us" is really
SSP here?

> +         */
> +        mov     saved_rsp(%rip), %rdi
> +        cmpq    $1, %rdi
> +        je      .L_shstk_done
> +
> +        /* Write a supervisor token under SSP. */
> +        sub     $8, %rdi
> +        mov     %rdi, (%rdi)
> +
> +        /* Load it into MSR_PL0_SSP. */
> +        mov     $MSR_PL0_SSP, %ecx
> +        mov     %rdi, %rdx
> +        shr     $32, %rdx
> +        mov     %edi, %eax
> +
> +        /* Enable CET. */
> +        mov     $MSR_S_CET, %ecx
> +        xor     %edx, %edx
> +        mov     $CET_SHSTK_EN | CET_WRSS_EN, %eax
> +        wrmsr
> +
> +        /* Activate our temporary token. */
> +        mov     $XEN_MINIMAL_CR4 | X86_CR4_CET, %ebx
> +        mov     %rbx, %cr4
> +        setssbsy
> +
> +        /* Reset MSR_PL0_SSP back to its expected value. */
> +        and     $~(STACK_SIZE - 1), %eax
> +        or      $0x5ff8, %eax
> +        wrmsr

Ahead of this WRMSR neither %ecx nor %edx look to have their intended
values anymore. Also there is a again a magic 0x5ff8 here (and at
least one more further down).

> --- a/xen/arch/x86/boot/x86_64.S
> +++ b/xen/arch/x86/boot/x86_64.S
> @@ -28,8 +28,36 @@ ENTRY(__high_start)
>          lretq
>  1:
>          test    %ebx,%ebx
> -        jnz     start_secondary
> +        jz      .L_bsp
>  
> +        /* APs.  Set up shadow stacks before entering C. */
> +
> +        testl   $cpufeat_mask(X86_FEATURE_XEN_SHSTK), \
> +                CPUINFO_FEATURE_OFFSET(X86_FEATURE_XEN_SHSTK) + boot_cpu_data(%rip)
> +        je      .L_ap_shstk_done
> +
> +        mov     $MSR_S_CET, %ecx
> +        xor     %edx, %edx
> +        mov     $CET_SHSTK_EN | CET_WRSS_EN, %eax
> +        wrmsr
> +
> +        mov     $MSR_PL0_SSP, %ecx
> +        mov     %rsp, %rdx
> +        shr     $32, %rdx
> +        mov     %esp, %eax
> +        and     $~(STACK_SIZE - 1), %eax
> +        or      $0x5ff8, %eax
> +        wrmsr
> +
> +        mov     $XEN_MINIMAL_CR4 | X86_CR4_CET, %ecx
> +        mov     %rcx, %cr4
> +        setssbsy

Since the token doesn't get written here, could you make the comment
say where this happens? I have to admit that I had to go through
earlier patches to find it again.

> +.L_ap_shstk_done:
> +        call    start_secondary
> +        BUG     /* start_secondary() shouldn't return. */

This conversion from a jump to CALL is unrelated and hence would
better be mentioned in the description imo.

> --- a/xen/arch/x86/cpu/common.c
> +++ b/xen/arch/x86/cpu/common.c
> @@ -323,6 +323,11 @@ void __init early_cpu_init(void)
>  	       x86_cpuid_vendor_to_str(c->x86_vendor), c->x86, c->x86,
>  	       c->x86_model, c->x86_model, c->x86_mask, eax);
>  
> +	if (c->cpuid_level >= 7) {
> +		cpuid_count(7, 0, &eax, &ebx, &ecx, &edx);
> +		c->x86_capability[cpufeat_word(X86_FEATURE_CET_SS)] = ecx;
> +	}

How about moving the leaf 7 code from generic_identify() here as
a whole?

> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -664,6 +664,13 @@ static void __init noreturn reinit_bsp_stack(void)
>      stack_base[0] = stack;
>      memguard_guard_stack(stack);
>  
> +    if ( cpu_has_xen_shstk )
> +    {
> +        wrmsrl(MSR_PL0_SSP, (unsigned long)stack + 0x5ff8);
> +        wrmsrl(MSR_S_CET, CET_SHSTK_EN | CET_WRSS_EN);
> +        asm volatile ("setssbsy" ::: "memory");
> +    }

Same as for APs - a brief comment pointing at where the token was
written would seem helpful.

Could you also have the patch description say a word on the choice
of enabling CET_WRSS_EN uniformly and globally?

> @@ -985,6 +992,21 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>      /* This must come before e820 code because it sets paddr_bits. */
>      early_cpu_init();
>  
> +    /* Choose shadow stack early, to set infrastructure up appropriately. */
> +    if ( opt_xen_shstk && boot_cpu_has(X86_FEATURE_CET_SS) )
> +    {
> +        printk("Enabling Supervisor Shadow Stacks\n");
> +
> +        setup_force_cpu_cap(X86_FEATURE_XEN_SHSTK);
> +#ifdef CONFIG_PV32
> +        if ( opt_pv32 )
> +        {
> +            opt_pv32 = 0;
> +            printk("  - Disabling PV32 due to Shadow Stacks\n");
> +        }
> +#endif

I think this deserves an explanation, either in a comment or in
the patch description.

> @@ -1721,6 +1743,10 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>  
>      alternative_branches();
>  
> +    /* Defer CR4.CET until alternatives have finished playing with CR4.WP */
> +    if ( cpu_has_xen_shstk )
> +        set_in_cr4(X86_CR4_CET);

Nit: CR0.WP (in the comment)

Jan


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 15/16] x86/entry: Adjust guest paths to be shadow stack compatible
  2020-05-07 14:12   ` Jan Beulich
@ 2020-05-07 15:50     ` Andrew Cooper
  2020-05-07 16:15       ` Jan Beulich
  0 siblings, 1 reply; 66+ messages in thread
From: Andrew Cooper @ 2020-05-07 15:50 UTC (permalink / raw)
  To: Jan Beulich, Xen-devel; +Cc: Wei Liu, Roger Pau Monné

On 07/05/2020 15:12, Jan Beulich wrote:
> On 02.05.2020 00:58, Andrew Cooper wrote:
>> The SYSCALL/SYSEXIT paths need to use {SET,CLR}SSBSY.
> I take it you mean SYSRET, not SYSEXIT.

I do, sorry.

> I do think though that you
> also need to deal with the SYSENTER entry point we have.

Oh - so we do.

>> --- a/xen/arch/x86/x86_64/compat/entry.S
>> +++ b/xen/arch/x86/x86_64/compat/entry.S
>> @@ -198,7 +198,7 @@ ENTRY(cr4_pv32_restore)
>>  
>>  /* See lstar_enter for entry register state. */
>>  ENTRY(cstar_enter)
>> -        /* sti could live here when we don't switch page tables below. */
>> +        ALTERNATIVE "", "setssbsy", X86_FEATURE_XEN_SHSTK
> I don't see why you delete the comment here (or elsewhere). While
> I recall you not really wanting them there, I still think they're
> useful to have, and they shouldn't be deleted as a side effect of
> an entirely unrelated change. Of course they need to live after
> your insertions then.

Do you not remember Juergen performance testing results concerning this
comment?  The results were provably worse.

It is a useless comment.  Sure, its technically accurate, but so are an
arbitrarily large number of other comments about how we could permute
the code.

It has already been concluded that we won't be making the suggested
change.  Having a /* TODO - doing X will make the system slower */ isn't
something we should have adding to the complexity of the code, and
tricking people into thinking that something should be done.

>> --- a/xen/arch/x86/x86_64/entry.S
>> +++ b/xen/arch/x86/x86_64/entry.S
>> @@ -194,6 +194,15 @@ restore_all_guest:
>>          movq  8(%rsp),%rcx            # RIP
>>          ja    iret_exit_to_guest
>>  
>> +        /* Clear the supervisor shadow stack token busy bit. */
>> +.macro rag_clrssbsy
>> +        push %rax
>> +        rdsspq %rax
>> +        clrssbsy (%rax)
>> +        pop %rax
>> +.endm
>> +        ALTERNATIVE "", rag_clrssbsy, X86_FEATURE_XEN_SHSTK
> In principle you could get away without spilling %rax:
>
>         cmpl  $1,%ecx
>         ja    iret_exit_to_guest
>
>         /* Clear the supervisor shadow stack token busy bit. */
> .macro rag_clrssbsy
>         rdsspq %rcx
>         clrssbsy (%rcx)
> .endm
>         ALTERNATIVE "", rag_clrssbsy, X86_FEATURE_XEN_SHSTK
>         movq  8(%rsp),%rcx            # RIP
>         cmpw  $FLAT_USER_CS32,16(%rsp)# CS
>         movq  32(%rsp),%rsp           # RSP
>         je    1f
>         sysretq
> 1:      sysretl
>
>         ALIGN
> /* No special register assumptions. */
> iret_exit_to_guest:
>         movq  8(%rsp),%rcx            # RIP
>         andl  $~(X86_EFLAGS_IOPL|X86_EFLAGS_NT|X86_EFLAGS_VM),24(%rsp)
>         ...
>
> Also - what about CLRSSBSY failing? It would seem easier to diagnose
> this right here than when getting presumably #DF upon next entry into
> Xen. At the very least I think it deserves a comment if an error case
> does not get handled.

I did consider this, but ultimately decided against it.

You can't have an unlikely block inside a alternative block because the
jmp's displacement doesn't get fixed up.  Keeping everything inline puts
an incorrect statically-predicted branch in program flow.

Most importantly however, is that the SYSRET path is vastly less common
than the IRET path.  There is no easy way to proactively spot problems
in the IRET path, which means that conditions leading to a problem are
already far more likely to manifest as #DF, so there is very little
value in adding complexity to the SYSRET path in the first place.

> Somewhat similar for SETSSBSY, except there things get complicated by
> it raising #CP instead of setting EFLAGS.CF: Aiui it would require us
> to handle #CP on an IST stack in order to avoid #DF there.

Right, but having #CP as IST gives us far worse problems.

Being able to spot #CP vs #DF doesn't help usefully.  Its still some
arbitrary period of time after the damage was done.

Any nesting of #CP (including fault on IRET out) results in losing
program state and entering an infinite loop.

The cases which end up as #DF are properly fatal to the system, and we
at least get a clean crash out it.

>> @@ -877,6 +886,14 @@ handle_ist_exception:
>>          movl  $UREGS_kernel_sizeof/8,%ecx
>>          movq  %rdi,%rsp
>>          rep   movsq
>> +
>> +        /* Switch Shadow Stacks */
>> +.macro ist_switch_shstk
>> +        rdsspq %rdi
>> +        clrssbsy (%rdi)
>> +        setssbsy
>> +.endm
> Could you extend the comment to mention the caveat that you point
> out in the description?

Ok.

~Andrew


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 15/16] x86/entry: Adjust guest paths to be shadow stack compatible
  2020-05-07 15:50     ` Andrew Cooper
@ 2020-05-07 16:15       ` Jan Beulich
  2020-05-11 21:45         ` Andrew Cooper
  0 siblings, 1 reply; 66+ messages in thread
From: Jan Beulich @ 2020-05-07 16:15 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 07.05.2020 17:50, Andrew Cooper wrote:
> On 07/05/2020 15:12, Jan Beulich wrote:
>> On 02.05.2020 00:58, Andrew Cooper wrote:
>>> --- a/xen/arch/x86/x86_64/compat/entry.S
>>> +++ b/xen/arch/x86/x86_64/compat/entry.S
>>> @@ -198,7 +198,7 @@ ENTRY(cr4_pv32_restore)
>>>  
>>>  /* See lstar_enter for entry register state. */
>>>  ENTRY(cstar_enter)
>>> -        /* sti could live here when we don't switch page tables below. */
>>> +        ALTERNATIVE "", "setssbsy", X86_FEATURE_XEN_SHSTK
>> I don't see why you delete the comment here (or elsewhere). While
>> I recall you not really wanting them there, I still think they're
>> useful to have, and they shouldn't be deleted as a side effect of
>> an entirely unrelated change. Of course they need to live after
>> your insertions then.
> 
> Do you not remember Juergen performance testing results concerning this
> comment?  The results were provably worse.
> 
> It is a useless comment.  Sure, its technically accurate, but so are an
> arbitrarily large number of other comments about how we could permute
> the code.
> 
> It has already been concluded that we won't be making the suggested
> change.  Having a /* TODO - doing X will make the system slower */ isn't
> something we should have adding to the complexity of the code, and
> tricking people into thinking that something should be done.

A separate patch is still the way to go then, with reference to
the claimed performance testing results.

>>> --- a/xen/arch/x86/x86_64/entry.S
>>> +++ b/xen/arch/x86/x86_64/entry.S
>>> @@ -194,6 +194,15 @@ restore_all_guest:
>>>          movq  8(%rsp),%rcx            # RIP
>>>          ja    iret_exit_to_guest
>>>  
>>> +        /* Clear the supervisor shadow stack token busy bit. */
>>> +.macro rag_clrssbsy
>>> +        push %rax
>>> +        rdsspq %rax
>>> +        clrssbsy (%rax)
>>> +        pop %rax
>>> +.endm
>>> +        ALTERNATIVE "", rag_clrssbsy, X86_FEATURE_XEN_SHSTK
>> In principle you could get away without spilling %rax:
>>
>>         cmpl  $1,%ecx
>>         ja    iret_exit_to_guest
>>
>>         /* Clear the supervisor shadow stack token busy bit. */
>> .macro rag_clrssbsy
>>         rdsspq %rcx
>>         clrssbsy (%rcx)
>> .endm
>>         ALTERNATIVE "", rag_clrssbsy, X86_FEATURE_XEN_SHSTK
>>         movq  8(%rsp),%rcx            # RIP
>>         cmpw  $FLAT_USER_CS32,16(%rsp)# CS
>>         movq  32(%rsp),%rsp           # RSP
>>         je    1f
>>         sysretq
>> 1:      sysretl
>>
>>         ALIGN
>> /* No special register assumptions. */
>> iret_exit_to_guest:
>>         movq  8(%rsp),%rcx            # RIP
>>         andl  $~(X86_EFLAGS_IOPL|X86_EFLAGS_NT|X86_EFLAGS_VM),24(%rsp)
>>         ...
>>
>> Also - what about CLRSSBSY failing? It would seem easier to diagnose
>> this right here than when getting presumably #DF upon next entry into
>> Xen. At the very least I think it deserves a comment if an error case
>> does not get handled.
> 
> I did consider this, but ultimately decided against it.
> 
> You can't have an unlikely block inside a alternative block because the
> jmp's displacement doesn't get fixed up.

We do fix up unconditional JMP/CALL displacements; I don't
see why we couldn't also do so for conditional ones.

>  Keeping everything inline puts
> an incorrect statically-predicted branch in program flow.
> 
> Most importantly however, is that the SYSRET path is vastly less common
> than the IRET path.  There is no easy way to proactively spot problems
> in the IRET path, which means that conditions leading to a problem are
> already far more likely to manifest as #DF, so there is very little
> value in adding complexity to the SYSRET path in the first place.

The SYSRET path being uncommon is a problem by itself imo, if
that's indeed the case. I'm sure I've suggested before that
we convert frames to TRAP_syscall ones whenever possible,
such that we wouldn't go the slower IRET path.

>> Somewhat similar for SETSSBSY, except there things get complicated by
>> it raising #CP instead of setting EFLAGS.CF: Aiui it would require us
>> to handle #CP on an IST stack in order to avoid #DF there.
> 
> Right, but having #CP as IST gives us far worse problems.
> 
> Being able to spot #CP vs #DF doesn't help usefully.  Its still some
> arbitrary period of time after the damage was done.
> 
> Any nesting of #CP (including fault on IRET out) results in losing
> program state and entering an infinite loop.
> 
> The cases which end up as #DF are properly fatal to the system, and we
> at least get a clean crash out it.

May I suggest that all of this gets spelled out in at least
the description of the patch, so that it can be properly
understood (and, if need be, revisited) later on?

Jan


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 01/16] x86/traps: Drop last_extable_addr
  2020-05-04 12:44   ` Jan Beulich
@ 2020-05-11 14:53     ` Andrew Cooper
  2020-05-11 15:00       ` Jan Beulich
  0 siblings, 1 reply; 66+ messages in thread
From: Andrew Cooper @ 2020-05-11 14:53 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 04/05/2020 13:44, Jan Beulich wrote:
> On 02.05.2020 00:58, Andrew Cooper wrote:
>> The only user of this facility is dom_crash_sync_extable() by passing 0 into
>> asm_domain_crash_synchronous().  The common error cases are already covered
>> with show_page_walk(), leaving only %ss/%fs selector/segment errors in the
>> compat case.
>>
>> Point at dom_crash_sync_extable in the error message, which is about as good
>> as the error hints from other users of asm_domain_crash_synchronous(), and
>> drop last_extable_addr.
> While I'm not entirely opposed, I'd like you to clarify that you indeed
> mean to (mostly) revert your own improvement from 6 or 7 years back
> (commit 8e0da8c07f4f). I'm also surprised to find this as part of the
> series it's in - in what way does this logic get in the way of CET-SS?

It was part of the exception_fixup() cleanup.  The first 4 patches not
specifically related to CET-SS.

~Andrew


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 01/16] x86/traps: Drop last_extable_addr
  2020-05-11 14:53     ` Andrew Cooper
@ 2020-05-11 15:00       ` Jan Beulich
  0 siblings, 0 replies; 66+ messages in thread
From: Jan Beulich @ 2020-05-11 15:00 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 11.05.2020 16:53, Andrew Cooper wrote:
> On 04/05/2020 13:44, Jan Beulich wrote:
>> On 02.05.2020 00:58, Andrew Cooper wrote:
>>> The only user of this facility is dom_crash_sync_extable() by passing 0 into
>>> asm_domain_crash_synchronous().  The common error cases are already covered
>>> with show_page_walk(), leaving only %ss/%fs selector/segment errors in the
>>> compat case.
>>>
>>> Point at dom_crash_sync_extable in the error message, which is about as good
>>> as the error hints from other users of asm_domain_crash_synchronous(), and
>>> drop last_extable_addr.
>> While I'm not entirely opposed, I'd like you to clarify that you indeed
>> mean to (mostly) revert your own improvement from 6 or 7 years back
>> (commit 8e0da8c07f4f). I'm also surprised to find this as part of the
>> series it's in - in what way does this logic get in the way of CET-SS?
> 
> It was part of the exception_fixup() cleanup.  The first 4 patches not
> specifically related to CET-SS.

"Cleanup" doesn't mean it's needed as a prereq for the CET-SS, so I'm
afraid it's still not really clear to me whether we indeed want to
effectively revert the earlier change, which had a reason to be done
the way it was done.

Jan


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 02/16] x86/traps: Clean up printing in do_reserved_trap()/fatal_trap()
  2020-05-04 13:08   ` Jan Beulich
@ 2020-05-11 15:01     ` Andrew Cooper
  2020-05-11 15:09       ` Jan Beulich
  0 siblings, 1 reply; 66+ messages in thread
From: Andrew Cooper @ 2020-05-11 15:01 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 04/05/2020 14:08, Jan Beulich wrote:
> On 02.05.2020 00:58, Andrew Cooper wrote:
>> For one, they render the vector in a different base.
>>
>> Introduce X86_EXC_* constants and vec_name() to refer to exceptions by their
>> mnemonic, which starts bringing the code/diagnostics in line with the Intel
>> and AMD manuals.
> For this "bringing in line" purpose I'd like to see whether you could
> live with some adjustments to how you're currently doing things:
> - NMI is nowhere prefixed by #, hence I think we'd better not do so
>   either; may require embedding the #-es in the names[] table, or not
>   using N() for NMI

No-one is going to get confused at seeing #NMI in an error message.  I
don't mind jugging the existing names table, but anything more
complicated is overkill.

> - neither Coprocessor Segment Overrun nor vector 0x0f have a mnemonic
>   and hence I think we shouldn't invent one; just treat them like
>   other reserved vectors (of which at least vector 0x09 indeed is one
>   on x86-64)?

This I disagree with.  Coprocessor Segment Overrun *is* its name in both
manuals, and the avoidance of vector 0xf is clearly documented as well,
due to it being the default PIC Spurious Interrupt Vector.

Neither CSO or SPV are expected to be encountered in practice, but if
they are, highlighting them is a damn-sight more helpful than pretending
they don't exist.

~Andrew


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 02/16] x86/traps: Clean up printing in do_reserved_trap()/fatal_trap()
  2020-05-11 15:01     ` Andrew Cooper
@ 2020-05-11 15:09       ` Jan Beulich
  2020-05-18 16:54         ` Andrew Cooper
  0 siblings, 1 reply; 66+ messages in thread
From: Jan Beulich @ 2020-05-11 15:09 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 11.05.2020 17:01, Andrew Cooper wrote:
> On 04/05/2020 14:08, Jan Beulich wrote:
>> On 02.05.2020 00:58, Andrew Cooper wrote:
>>> For one, they render the vector in a different base.
>>>
>>> Introduce X86_EXC_* constants and vec_name() to refer to exceptions by their
>>> mnemonic, which starts bringing the code/diagnostics in line with the Intel
>>> and AMD manuals.
>> For this "bringing in line" purpose I'd like to see whether you could
>> live with some adjustments to how you're currently doing things:
>> - NMI is nowhere prefixed by #, hence I think we'd better not do so
>>   either; may require embedding the #-es in the names[] table, or not
>>   using N() for NMI
> 
> No-one is going to get confused at seeing #NMI in an error message.  I
> don't mind jugging the existing names table, but anything more
> complicated is overkill.
> 
>> - neither Coprocessor Segment Overrun nor vector 0x0f have a mnemonic
>>   and hence I think we shouldn't invent one; just treat them like
>>   other reserved vectors (of which at least vector 0x09 indeed is one
>>   on x86-64)?
> 
> This I disagree with.  Coprocessor Segment Overrun *is* its name in both
> manuals, and the avoidance of vector 0xf is clearly documented as well,
> due to it being the default PIC Spurious Interrupt Vector.
> 
> Neither CSO or SPV are expected to be encountered in practice, but if
> they are, highlighting them is a damn-sight more helpful than pretending
> they don't exist.

How is them occurring (and getting logged with their vector numbers)
any different from other reserved, acronym-less vectors? I particularly
didn't suggest to pretend they don't exist; instead I did suggest that
they are as reserved as, say, vector 0x18. By inventing an acronym and
logging this instead of the vector number you'll make people other than
you have to look up what the odd acronym means iff such an exception
ever got raised.

Jan


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 03/16] x86/traps: Factor out exception_fixup() and make printing consistent
  2020-05-04 13:20   ` Jan Beulich
@ 2020-05-11 15:14     ` Andrew Cooper
  2020-05-12 13:05       ` Jan Beulich
  0 siblings, 1 reply; 66+ messages in thread
From: Andrew Cooper @ 2020-05-11 15:14 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 04/05/2020 14:20, Jan Beulich wrote:
> On 02.05.2020 00:58, Andrew Cooper wrote:
>> --- a/xen/arch/x86/traps.c
>> +++ b/xen/arch/x86/traps.c
>> @@ -774,10 +774,27 @@ static void do_reserved_trap(struct cpu_user_regs *regs)
>>            trapnr, vec_name(trapnr), regs->error_code);
>>  }
>>  
>> +static bool exception_fixup(struct cpu_user_regs *regs, bool print)
>> +{
>> +    unsigned long fixup = search_exception_table(regs);
>> +
>> +    if ( unlikely(fixup == 0) )
>> +        return false;
>> +
>> +    /* Can currently be triggered by guests.  Make sure we ratelimit. */
>> +    if ( IS_ENABLED(CONFIG_DEBUG) && print )
> I didn't think we consider dprintk()-s a possible security issue.
> Why would we consider so a printk() hidden behind
> IS_ENABLED(CONFIG_DEBUG)? IOW I think one of XENLOG_GUEST and
> IS_ENABLED(CONFIG_DEBUG) wants dropping.

Who said anything about a security issue?

I'm deliberately not using dprintk() for the reasons explained in the
commit message, so IS_ENABLED(CONFIG_DEBUG) is to cover that.

XENLOG_GUEST is because everything (other than the boot path) hitting
this caused directly by a guest action, and needs rate-limiting.  This
was not consistent before.

>
>> @@ -1466,12 +1468,11 @@ void do_page_fault(struct cpu_user_regs *regs)
>>          if ( pf_type != real_fault )
>>              return;
>>  
>> -        if ( likely((fixup = search_exception_table(regs)) != 0) )
>> +        if ( likely(exception_fixup(regs, false)) )
>>          {
>>              perfc_incr(copy_user_faults);
>>              if ( unlikely(regs->error_code & PFEC_reserved_bit) )
>>                  reserved_bit_page_fault(addr, regs);
>> -            regs->rip = fixup;
> I'm afraid this modification can't validly be pulled ahead -
> show_execution_state(), as called from reserved_bit_page_fault(),
> wants to / should print the original RIP, not the fixed up one.

This path is bogus to begin with.  Any RSVD pagefault (other than the
Shadow MMIO fastpath, handled earlier) is a bug in Xen and should be
fatal rather than just a warning on extable-tagged instructions.

Amongst other things, it would consistent an L1TF-vulnerable gadget. 
The MMIO fastpath is only safe-ish because it also inverts the upper
address bits.

~Andrew


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 05/16] x86/shstk: Introduce Supervisor Shadow Stack support
  2020-05-04 13:52   ` Jan Beulich
@ 2020-05-11 15:46     ` Andrew Cooper
  2020-05-12 13:54       ` Jan Beulich
  2020-05-15 16:21     ` Anthony PERARD
  1 sibling, 1 reply; 66+ messages in thread
From: Andrew Cooper @ 2020-05-11 15:46 UTC (permalink / raw)
  To: Jan Beulich, Anthony Perard; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 04/05/2020 14:52, Jan Beulich wrote:
> [CAUTION - EXTERNAL EMAIL] DO NOT reply, click links, or open attachments unless you have verified the sender and know the content is safe.
>
> On 02.05.2020 00:58, Andrew Cooper wrote:
>> --- a/xen/arch/x86/Kconfig
>> +++ b/xen/arch/x86/Kconfig
>> @@ -34,6 +34,9 @@ config ARCH_DEFCONFIG
>>  config INDIRECT_THUNK
>>  	def_bool $(cc-option,-mindirect-branch-register)
>>  
>> +config HAS_AS_CET
>> +	def_bool $(as-instr,wrssq %rax$(comma)0;setssbsy;endbr64)
> I see you add as-instr here as a side effect. Until the other
> similar checks get converted, I think for the time being we should
> use the old model, to have all such checks in one place. I realize
> this means you can't have a Kconfig dependency then.

No.  That's like asking me to keep on using bool_t, and you are the
first person to point out and object to that in newly submitted patches.

> Also why do you check multiple insns, when just one (like we do
> elsewhere) would suffice?

Because CET-SS and CET-IBT are orthogonal ABIs, but you wanted a single
CET symbol, rather than a CET_SS symbol.

I picked a sample of various instructions to get broad coverage of CET
without including every instruction.

> The crucial insns to check are those which got changed pretty
> close before the release of 2.29 (in the cover letter you
> mention 2.32): One of incssp{d,q}, setssbsy, or saveprevssp.
> There weren't official binutils releases with the original
> insns, but distros may still have picked up intermediate
> snapshots.

I've got zero interest in catering to distros which are still using
obsolete pre-release toolchains.  Bleeding edge toolchains are one
thing, but this is like asking us to support the middle changeset of the
series introducing CET, which is absolutely a non-starter.

If the instructions were missing from real binutils releases, then
obviously we should exclude those releases, but that doesn't appear to
be the case.

>> --- a/xen/arch/x86/setup.c
>> +++ b/xen/arch/x86/setup.c
>> @@ -95,6 +95,36 @@ unsigned long __initdata highmem_start;
>>  size_param("highmem-start", highmem_start);
>>  #endif
>>  
>> +static bool __initdata opt_xen_shstk = true;
>> +
>> +static int parse_xen(const char *s)
>> +{
>> +    const char *ss;
>> +    int val, rc = 0;
>> +
>> +    do {
>> +        ss = strchr(s, ',');
>> +        if ( !ss )
>> +            ss = strchr(s, '\0');
>> +
>> +        if ( (val = parse_boolean("shstk", s, ss)) >= 0 )
>> +        {
>> +#ifdef CONFIG_XEN_SHSTK
>> +            opt_xen_shstk = val;
>> +#else
>> +            no_config_param("XEN_SHSTK", "xen", s, ss);
>> +#endif
>> +        }
>> +        else
>> +            rc = -EINVAL;
>> +
>> +        s = ss + 1;
>> +    } while ( *ss );
>> +
>> +    return rc;
>> +}
>> +custom_param("xen", parse_xen);
> What's the idea here going forward, i.e. why the new top level
> "xen" option? Almost all options are for Xen itself, after all.
> Did you perhaps mean this to be "cet"?

I forgot an RFC for this, as I couldn't think of anything better.  "cet"
as a top level option isn't going to gain more than {no-}shstk and
{no-}ibt as suboptions.

>> --- a/xen/scripts/Kconfig.include
>> +++ b/xen/scripts/Kconfig.include
>> @@ -31,6 +31,10 @@ cc-option = $(success,$(CC) -Werror $(CLANG_FLAGS) $(1) -E -x c /dev/null -o /de
>>  # Return y if the linker supports <flag>, n otherwise
>>  ld-option = $(success,$(LD) -v $(1))
>>  
>> +# $(as-instr,<instr>)
>> +# Return y if the assembler supports <instr>, n otherwise
>> +as-instr = $(success,printf "%b\n" "$(1)" | $(CC) $(CLANG_FLAGS) -c -x assembler -o /dev/null -)
> CLANG_FLAGS caught my eye here, then noticing that cc-option
> also uses it. Anthony - what's the deal with this? It doesn't
> look to get defined anywhere, and I also don't see what clang-
> specific about these constructs.

This is as it inherits from Linux.  There is obviously a reason.

However, I'm not interested in diving into that rabbit hole in an
unrelated series.

~Andrew


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 06/16] x86/traps: Implement #CP handler and extend #PF for shadow stacks
  2020-05-04 14:10   ` Jan Beulich
@ 2020-05-11 17:20     ` Andrew Cooper
  2020-05-12 13:58       ` Jan Beulich
  0 siblings, 1 reply; 66+ messages in thread
From: Andrew Cooper @ 2020-05-11 17:20 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 04/05/2020 15:10, Jan Beulich wrote:
> On 02.05.2020 00:58, Andrew Cooper wrote:
>> @@ -1457,6 +1451,10 @@ void do_page_fault(struct cpu_user_regs *regs)
>>      {
>>          enum pf_type pf_type = spurious_page_fault(addr, regs);
>>  
>> +        /* Any fault on a shadow stack access is a bug in Xen. */
>> +        if ( error_code & PFEC_shstk )
>> +            goto fatal;
> Not going through the full spurious_page_fault() in this case
> would seem desirable, as would be at least a respective
> adjustment to __page_fault_type(). Perhaps such an adjustment
> could then avoid the change (and the need for goto) here?

This seems to do a lot of things which have little/nothing to do with
spurious faults.

In particular, we don't need to disable interrupts to look at
PFEC_shstk, or RSVD for that matter.

>> @@ -1906,6 +1905,43 @@ void do_debug(struct cpu_user_regs *regs)
>>      pv_inject_hw_exception(TRAP_debug, X86_EVENT_NO_EC);
>>  }
>>  
>> +void do_entry_CP(struct cpu_user_regs *regs)
> If there's a plan to change other exception handlers to this naming
> scheme too, I can live with the intermediate inconsistency.

Yes.  This isn't the first time I've introduced this naming scheme
either, but the emulator patch got stuck in trivialities.

> Otherwise, though, I'd prefer a name closer to what other handlers
> use, e.g. do_control_prot(). Same for the asm entry point then.

These names are impossible to search for, because there is no
consistency even amongst the similarly named ones.

>
>> @@ -940,7 +944,8 @@ autogen_stubs: /* Automatically generated stubs. */
>>          entrypoint 1b
>>  
>>          /* Reserved exceptions, heading towards do_reserved_trap(). */
>> -        .elseif vec == TRAP_copro_seg || vec == TRAP_spurious_int || (vec > TRAP_simd_error && vec < TRAP_nr)
>> +        .elseif vec == TRAP_copro_seg || vec == TRAP_spurious_int || \
>> +                vec == TRAP_virtualisation || (vec > X86_EXC_CP && vec < TRAP_nr)
> Use the shorter X86_EXC_VE here, since you don't want/need to
> retain what was there before? (I'd be fine with you changing
> the two other TRAP_* too at this occasion.)

Ok.

Having played this game several times now, I'm considering reworking how
do_reserved_trap() gets set up, because we've now got two places which
can go wrong and result in an unhelpful assert early on boot, rather
than a build-time failure.  (The one thing I'm not sure on how to do is
usefully turn this into a build time failure.)

>
>> --- a/xen/include/asm-x86/processor.h
>> +++ b/xen/include/asm-x86/processor.h
>> @@ -68,6 +68,7 @@
>>  #define PFEC_reserved_bit   (_AC(1,U) << 3)
>>  #define PFEC_insn_fetch     (_AC(1,U) << 4)
>>  #define PFEC_prot_key       (_AC(1,U) << 5)
>> +#define PFEC_shstk         (_AC(1,U) << 6)
> One too few padding blanks?

Oops.

~Andrew


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 07/16] x86/shstk: Re-layout the stack block for shadow stacks
  2020-05-04 14:24   ` Jan Beulich
@ 2020-05-11 17:48     ` Andrew Cooper
  2020-05-12 14:07       ` Jan Beulich
  0 siblings, 1 reply; 66+ messages in thread
From: Andrew Cooper @ 2020-05-11 17:48 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 04/05/2020 15:24, Jan Beulich wrote:
> On 02.05.2020 00:58, Andrew Cooper wrote:
>> --- a/xen/arch/x86/cpu/common.c
>> +++ b/xen/arch/x86/cpu/common.c
>> @@ -732,14 +732,14 @@ void load_system_tables(void)
>>  		.rsp2 = 0x8600111111111111ul,
>>  
>>  		/*
>> -		 * MCE, NMI and Double Fault handlers get their own stacks.
>> +		 * #DB, NMI, DF and #MCE handlers get their own stacks.
> Then also #DF and #MC?

Ok.

>
>> --- a/xen/arch/x86/mm.c
>> +++ b/xen/arch/x86/mm.c
>> @@ -6002,25 +6002,18 @@ void memguard_unguard_range(void *p, unsigned long l)
>>  
>>  void memguard_guard_stack(void *p)
>>  {
>> -    /* IST_MAX IST pages + at least 1 guard page + primary stack. */
>> -    BUILD_BUG_ON((IST_MAX + 1) * PAGE_SIZE + PRIMARY_STACK_SIZE > STACK_SIZE);
>> +    map_pages_to_xen((unsigned long)p, virt_to_mfn(p), 1, _PAGE_NONE);
>>  
>> -    memguard_guard_range(p + IST_MAX * PAGE_SIZE,
>> -                         STACK_SIZE - PRIMARY_STACK_SIZE - IST_MAX * PAGE_SIZE);
>> +    p += 5 * PAGE_SIZE;
> The literal 5 here and ...
>
>> +    map_pages_to_xen((unsigned long)p, virt_to_mfn(p), 1, _PAGE_NONE);
>>  }
>>  
>>  void memguard_unguard_stack(void *p)
>>  {
>> -    memguard_unguard_range(p + IST_MAX * PAGE_SIZE,
>> -                           STACK_SIZE - PRIMARY_STACK_SIZE - IST_MAX * PAGE_SIZE);
>> -}
>> -
>> -bool memguard_is_stack_guard_page(unsigned long addr)
>> -{
>> -    addr &= STACK_SIZE - 1;
>> +    map_pages_to_xen((unsigned long)p, virt_to_mfn(p), 1, PAGE_HYPERVISOR_RW);
>>  
>> -    return addr >= IST_MAX * PAGE_SIZE &&
>> -           addr < STACK_SIZE - PRIMARY_STACK_SIZE;
>> +    p += 5 * PAGE_SIZE;
> ... here could do with macro-izing: IST_MAX + 1 would already be
> a little better, I guess.

The problem is that "IST_MAX + 1" is now less meaningful than a literal
5, because at least 5 obviously matches up with the comment describing
which page does what.

~Andrew

>
> Preferably with adjustments along these lines
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
>
> Jan



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 09/16] x86/cpu: Adjust enable_nmis() to be shadow stack compatible
  2020-05-05 14:48   ` Jan Beulich
@ 2020-05-11 18:48     ` Andrew Cooper
  0 siblings, 0 replies; 66+ messages in thread
From: Andrew Cooper @ 2020-05-11 18:48 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 05/05/2020 15:48, Jan Beulich wrote:
> [CAUTION - EXTERNAL EMAIL] DO NOT reply, click links, or open attachments unless you have verified the sender and know the content is safe.
>
> On 02.05.2020 00:58, Andrew Cooper wrote:
>> When executing an IRET-to-self, the shadow stack must agree with the regular
>> stack.  We can't manipulate SSP directly, so have to fake a shadow IRET frame
>> by executing 3 CALLs, then editing the result to look correct.
>>
>> This is not a fastpath, is called on the BSP long before CET can be set up,
>> and may be called on the crash path after CET is disabled.  Use the fact that
>> INCSSP is allocated from the hint nop space to construct a test for CET being
>> active which is safe on all processors.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
> albeit with a question which may make a further change necessary:
>
>> --- a/xen/include/asm-x86/processor.h
>> +++ b/xen/include/asm-x86/processor.h
>> @@ -544,17 +544,40 @@ static inline void enable_nmis(void)
>>  {
>>      unsigned long tmp;
>>  
>> -    asm volatile ( "mov %%rsp, %[tmp]     \n\t"
>> -                   "push %[ss]            \n\t"
>> -                   "push %[tmp]           \n\t"
>> -                   "pushf                 \n\t"
>> -                   "push %[cs]            \n\t"
>> -                   "lea 1f(%%rip), %[tmp] \n\t"
>> -                   "push %[tmp]           \n\t"
>> -                   "iretq; 1:             \n\t"
>> -                   : [tmp] "=&r" (tmp)
>> +    asm volatile ( "mov     %%rsp, %[rsp]        \n\t"
>> +                   "lea    .Ldone(%%rip), %[rip] \n\t"
>> +#ifdef CONFIG_XEN_SHSTK
>> +                   /* Check for CET-SS being active. */
>> +                   "mov    $1, %k[ssp]           \n\t"
>> +                   "rdsspq %[ssp]                \n\t"
>> +                   "cmp    $1, %k[ssp]           \n\t"
>> +                   "je     .Lshstk_done          \n\t"
>> +
>> +                   /* Push 3 words on the shadow stack */
>> +                   ".rept 3                      \n\t"
>> +                   "call 1f; nop; 1:             \n\t"
>> +                   ".endr                        \n\t"
>> +
>> +                   /* Fixup to be an IRET shadow stack frame */
>> +                   "wrssq  %q[cs], -1*8(%[ssp])  \n\t"
>> +                   "wrssq  %[rip], -2*8(%[ssp])  \n\t"
>> +                   "wrssq  %[ssp], -3*8(%[ssp])  \n\t"
>> +
>> +                   ".Lshstk_done:"
>> +#endif
>> +                   /* Write an IRET regular frame */
>> +                   "push   %[ss]                 \n\t"
>> +                   "push   %[rsp]                \n\t"
>> +                   "pushf                        \n\t"
>> +                   "push   %q[cs]                \n\t"
>> +                   "push   %[rip]                \n\t"
>> +                   "iretq                        \n\t"
>> +                   ".Ldone:                      \n\t"
>> +                   : [rip] "=&r" (tmp),
>> +                     [rsp] "=&r" (tmp),
>> +                     [ssp] "=&r" (tmp)
> Even after an hour of reading and searching through the gcc docs
> I can't convince myself that this utilizes defined behavior. We
> do tie multiple outputs to the same C variable elsewhere, yes,
> but only in cases where we don't really care about the register,
> or where the register is a fixed one anyway. What I can't find
> is a clear statement that gcc wouldn't ever, now or in the
> future, use the same register for all three outputs. I think one
> can deduce this in certain ways, and experimentation also seems
> to confirm it, but still.

I don't see how different local variable will make any difference.  The
variables themselves will be dropped for being write-only before
register scheduling happens.

GCC and Clang reliably use the callee preserved registers first, then
start spilling caller preserved registers, and finally refuse to
assemble as soon as we hit 16 registers of this form, citing impossible
constraints (GCC) or too many registers (Clang).

The important point is that they are listed as separate outputs, so
cannot share register allocations.  If this were to be violated, loads
of code would malfunction.

~Andrew


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 10/16] x86/cpu: Adjust reset_stack_and_jump() to be shadow stack compatible
  2020-05-07 13:17   ` Jan Beulich
@ 2020-05-11 20:07     ` Andrew Cooper
  0 siblings, 0 replies; 66+ messages in thread
From: Andrew Cooper @ 2020-05-11 20:07 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 07/05/2020 14:17, Jan Beulich wrote:
> [CAUTION - EXTERNAL EMAIL] DO NOT reply, click links, or open attachments unless you have verified the sender and know the content is safe.
>
> On 02.05.2020 00:58, Andrew Cooper wrote:
>> We need to unwind up to the supervisor token.  See the comment for details.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> ---
>> CC: Jan Beulich <JBeulich@suse.com>
>> CC: Wei Liu <wl@xen.org>
>> CC: Roger Pau Monné <roger.pau@citrix.com>
>> ---
>>  xen/include/asm-x86/current.h | 42 +++++++++++++++++++++++++++++++++++++++---
>>  1 file changed, 39 insertions(+), 3 deletions(-)
>>
>> diff --git a/xen/include/asm-x86/current.h b/xen/include/asm-x86/current.h
>> index 99b66a0087..2a7b728b1e 100644
>> --- a/xen/include/asm-x86/current.h
>> +++ b/xen/include/asm-x86/current.h
>> @@ -124,13 +124,49 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
>>  # define CHECK_FOR_LIVEPATCH_WORK ""
>>  #endif
>>  
>> +#ifdef CONFIG_XEN_SHSTK
>> +/*
>> + * We need to unwind the primary shadow stack to its supervisor token, located
>> + * at 0x5ff8 from the base of the stack blocks.
>> + *
>> + * Read the shadow stack pointer, subtract it from 0x5ff8, divide by 8 to get
>> + * the number of slots needing popping.
>> + *
>> + * INCSSPQ can't pop more than 255 entries.  We shouldn't ever need to pop
>> + * that many entries, and getting this wrong will cause us to #DF later.
>> + */
>> +# define SHADOW_STACK_WORK                      \
>> +    "mov $1, %[ssp];"                           \
>> +    "rdsspd %[ssp];"                            \
>> +    "cmp $1, %[ssp];"                           \
>> +    "je 1f;" /* CET not active?  Skip. */       \
>> +    "mov $"STR(0x5ff8)", %[val];"               \
> As per comments on earlier patches, I think it would be nice if
> this wasn't a literal number here, but tied to actual stack
> layout via some suitable expression. An option might be to use
> 0xff8 (or the constant to be introduced for it in the earlier
> patch) here and ...
>
>> +    "and $"STR(STACK_SIZE - 1)", %[ssp];"       \
> ... PAGE_SIZE here.

It is important to use STACK_SIZE here and not PAGE_SIZE to trigger...

>> +    "sub %[ssp], %[val];"                       \
>> +    "shr $3, %[val];"                           \
>> +    "cmp $255, %[val];"                         \
>> +    "jle 2f;"                                   \

... this condition if we try to reset&jump from more than 4k away from
0x5ff8, e.g. from a IST stack.

Whatever happens we're going to crash, but given that we're talking
about imm32's here,

> Perhaps better "jbe", treating the unsigned values as such?

What I really want is actually to opencode an UNLIKLEY() region seeing
none of our infrastructure works inside inline asm.  Same for...

>
>> +    "ud2a;"                                     \

... this to turn into a real BUG.

~Andrew


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 12/16] x86/extable: Adjust extable handling to be shadow stack compatible
  2020-05-07 13:35   ` Jan Beulich
@ 2020-05-11 21:09     ` Andrew Cooper
  2020-05-12 14:31       ` Jan Beulich
  0 siblings, 1 reply; 66+ messages in thread
From: Andrew Cooper @ 2020-05-11 21:09 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 07/05/2020 14:35, Jan Beulich wrote:
> On 02.05.2020 00:58, Andrew Cooper wrote:
>> --- a/xen/arch/x86/traps.c
>> +++ b/xen/arch/x86/traps.c
>> @@ -778,6 +778,28 @@ static bool exception_fixup(struct cpu_user_regs *regs, bool print)
>>                 vec_name(regs->entry_vector), regs->error_code,
>>                 _p(regs->rip), _p(regs->rip), _p(fixup));
>>  
>> +    if ( IS_ENABLED(CONFIG_XEN_SHSTK) )
>> +    {
>> +        unsigned long ssp;
>> +
>> +        asm ("rdsspq %0" : "=r" (ssp) : "0" (1) );
>> +        if ( ssp != 1 )
>> +        {
>> +            unsigned long *ptr = _p(ssp);
>> +
>> +            /* Search for %rip in the shadow stack, ... */
>> +            while ( *ptr != regs->rip )
>> +                ptr++;
> Wouldn't it be better to bound the loop, as it shouldn't search past
> (strictly speaking not even to) the next page boundary? Also you
> don't care about the top of the stack (being the to be restored SSP),
> do you? I.e. maybe
>
>             while ( *++ptr != regs->rip )
>                 ;
>
> ?
>
> And then - isn't searching for a specific RIP value alone prone to
> error, in case a it matches an ordinary return address? I.e.
> wouldn't you better search for a matching RIP accompanied by a
> suitable pointer into the shadow stack and a matching CS value?
> Otherwise, ...
>
>> +            ASSERT(ptr[1] == __HYPERVISOR_CS);
> ... also assert that ptr[-1] points into the shadow stack?

So this is the problem I was talking about that the previous contexts
SSP isn't stored anywhere helpful.

What we are in practice doing is looking 2 or 3 words up the shadow
stack (depending on exactly how deep our call graph is), to the shadow
IRET frame matching the real IRET frame which regs is pointing to.

Both IRET frames were pushed in the process of generating the exception,
and we've already matched regs->rip to the exception table record.  We
need to fix up regs->rip and the shadow lip to the fixup address.

As we are always fixing up an exception generated from Xen context, we
know that ptr[1] == __HYPERVISOR_CS, and *ptr[-1] = &ptr[2], as we
haven't switched shadow stack as part of taking this exception. 
However, this second point is fragile to exception handlers moving onto IST.

We can't encounter regs->rip in the shadow stack between the current SSP
and the IRET frame we're looking to fix up, or we would have taken a
recursive fault and not reached exception_fixup() to begin with.

Therefore, the loop is reasonably bounded in all cases.

Sadly, there is no RDSS instruction, so we can't actually use shadow
stack reads to spot if we underflowed the shadow stack, and there is no
useful alternative to panic() if we fail to find the shadow IRET frame.

>> --- a/xen/arch/x86/x86_64/entry.S
>> +++ b/xen/arch/x86/x86_64/entry.S
>> @@ -708,7 +708,16 @@ exception_with_ints_disabled:
>>          call  search_pre_exception_table
>>          testq %rax,%rax                 # no fixup code for faulting EIP?
>>          jz    1b
>> -        movq  %rax,UREGS_rip(%rsp)
>> +        movq  %rax,UREGS_rip(%rsp)      # fixup regular stack
>> +
>> +#ifdef CONFIG_XEN_SHSTK
>> +        mov    $1, %edi
>> +        rdsspq %rdi
>> +        cmp    $1, %edi
>> +        je     .L_exn_shstk_done
>> +        wrssq  %rax, (%rdi)             # fixup shadow stack
>> +.L_exn_shstk_done:
>> +#endif
> Again avoid the conditional jump by using alternatives patching?

Well - that depends on whether we're likely to gain any new content in
the pre exception table.

As it stands, it is only the IRET(s) to userspace so would be safe to
turn this into an unconditional alternative.  Even in the crash case, we
won't be returning to guest context after having started the crash
teardown path.

~Andrew


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 15/16] x86/entry: Adjust guest paths to be shadow stack compatible
  2020-05-07 16:15       ` Jan Beulich
@ 2020-05-11 21:45         ` Andrew Cooper
  2020-05-12 14:56           ` Jan Beulich
  0 siblings, 1 reply; 66+ messages in thread
From: Andrew Cooper @ 2020-05-11 21:45 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 07/05/2020 17:15, Jan Beulich wrote:
>>>> --- a/xen/arch/x86/x86_64/entry.S
>>>> +++ b/xen/arch/x86/x86_64/entry.S
>>>> @@ -194,6 +194,15 @@ restore_all_guest:
>>>>          movq  8(%rsp),%rcx            # RIP
>>>>          ja    iret_exit_to_guest
>>>>  
>>>> +        /* Clear the supervisor shadow stack token busy bit. */
>>>> +.macro rag_clrssbsy
>>>> +        push %rax
>>>> +        rdsspq %rax
>>>> +        clrssbsy (%rax)
>>>> +        pop %rax
>>>> +.endm
>>>> +        ALTERNATIVE "", rag_clrssbsy, X86_FEATURE_XEN_SHSTK
>>> In principle you could get away without spilling %rax:
>>>
>>>         cmpl  $1,%ecx
>>>         ja    iret_exit_to_guest
>>>
>>>         /* Clear the supervisor shadow stack token busy bit. */
>>> .macro rag_clrssbsy
>>>         rdsspq %rcx
>>>         clrssbsy (%rcx)
>>> .endm
>>>         ALTERNATIVE "", rag_clrssbsy, X86_FEATURE_XEN_SHSTK
>>>         movq  8(%rsp),%rcx            # RIP
>>>         cmpw  $FLAT_USER_CS32,16(%rsp)# CS
>>>         movq  32(%rsp),%rsp           # RSP
>>>         je    1f
>>>         sysretq
>>> 1:      sysretl
>>>
>>>         ALIGN
>>> /* No special register assumptions. */
>>> iret_exit_to_guest:
>>>         movq  8(%rsp),%rcx            # RIP
>>>         andl  $~(X86_EFLAGS_IOPL|X86_EFLAGS_NT|X86_EFLAGS_VM),24(%rsp)
>>>         ...
>>>
>>> Also - what about CLRSSBSY failing? It would seem easier to diagnose
>>> this right here than when getting presumably #DF upon next entry into
>>> Xen. At the very least I think it deserves a comment if an error case
>>> does not get handled.
>> I did consider this, but ultimately decided against it.
>>
>> You can't have an unlikely block inside a alternative block because the
>> jmp's displacement doesn't get fixed up.
> We do fix up unconditional JMP/CALL displacements; I don't
> see why we couldn't also do so for conditional ones.

Only for the first instruction in the block.

We do not decode the entire block of instructions and fix up each
displacement.

>
>>   Keeping everything inline puts
>> an incorrect statically-predicted branch in program flow.
>>
>> Most importantly however, is that the SYSRET path is vastly less common
>> than the IRET path.  There is no easy way to proactively spot problems
>> in the IRET path, which means that conditions leading to a problem are
>> already far more likely to manifest as #DF, so there is very little
>> value in adding complexity to the SYSRET path in the first place.
> The SYSRET path being uncommon is a problem by itself imo, if
> that's indeed the case. I'm sure I've suggested before that
> we convert frames to TRAP_syscall ones whenever possible,
> such that we wouldn't go the slower IRET path.

It is not possible to convert any.

The opportunistic SYSRET logic in Linux loses you performance in
reality.  Its just that the extra conditionals are very highly predicted
and totally dominated by the ring transition cost.

You can create a synthetic test case where the opportunistic logic makes
a performance win, but the chances of encountering real world code where
TRAP_syscall is clear and %r11 and %rcx match flags/rip is 2 ^ 128.

It is very much not worth the extra code and cycles taken to implement.

>>> Somewhat similar for SETSSBSY, except there things get complicated by
>>> it raising #CP instead of setting EFLAGS.CF: Aiui it would require us
>>> to handle #CP on an IST stack in order to avoid #DF there.
>> Right, but having #CP as IST gives us far worse problems.
>>
>> Being able to spot #CP vs #DF doesn't help usefully.  Its still some
>> arbitrary period of time after the damage was done.
>>
>> Any nesting of #CP (including fault on IRET out) results in losing
>> program state and entering an infinite loop.
>>
>> The cases which end up as #DF are properly fatal to the system, and we
>> at least get a clean crash out it.
> May I suggest that all of this gets spelled out in at least
> the description of the patch, so that it can be properly
> understood (and, if need be, revisited) later on?

Is this really the right patch to do that?

I do eventually plan to put a whole load of this kinds of details into
the hypervisor guide.

~Andrew


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 16/16] x86/shstk: Activate Supervisor Shadow Stacks
  2020-05-07 14:54   ` Jan Beulich
@ 2020-05-11 23:46     ` Andrew Cooper
  0 siblings, 0 replies; 66+ messages in thread
From: Andrew Cooper @ 2020-05-11 23:46 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 07/05/2020 15:54, Jan Beulich wrote:
> On 02.05.2020 00:58, Andrew Cooper wrote:
>> --- a/xen/arch/x86/acpi/wakeup_prot.S
>> +++ b/xen/arch/x86/acpi/wakeup_prot.S
>> @@ -1,3 +1,8 @@
>> +#include <asm/config.h>
> Why is this needed? Afaics assembly files, just like C ones, get
> xen/config.h included from the compiler command line.

I'll double check, but I do recall it being necessary.

>> @@ -48,6 +59,48 @@ ENTRY(s3_resume)
>>          pushq   %rax
>>          lretq
>>  1:
>> +#ifdef CONFIG_XEN_SHSTK
>> +	/*
>> +         * Restoring SSP is a little convoluted, because we are intercepting
>> +         * the middle of an in-use shadow stack.  Write a temporary supervisor
>> +         * token under the stack, so SETSSBSY takes us where we want, then
>> +         * reset MSR_PL0_SSP to its usual value and pop the temporary token.
> What do you mean by "takes us where we want"? I take it "us" is really
> SSP here?

Load the SSP that we want.  SETSSBSY is the only instruction which can
do a fairly arbitrary load of SSP, but it still has to complete the
check and activation of the supervisor token.

>> +         */
>> +        mov     saved_rsp(%rip), %rdi
>> +        cmpq    $1, %rdi
>> +        je      .L_shstk_done
>> +
>> +        /* Write a supervisor token under SSP. */
>> +        sub     $8, %rdi
>> +        mov     %rdi, (%rdi)
>> +
>> +        /* Load it into MSR_PL0_SSP. */
>> +        mov     $MSR_PL0_SSP, %ecx
>> +        mov     %rdi, %rdx
>> +        shr     $32, %rdx
>> +        mov     %edi, %eax
>> +
>> +        /* Enable CET. */
>> +        mov     $MSR_S_CET, %ecx
>> +        xor     %edx, %edx
>> +        mov     $CET_SHSTK_EN | CET_WRSS_EN, %eax
>> +        wrmsr
>> +
>> +        /* Activate our temporary token. */
>> +        mov     $XEN_MINIMAL_CR4 | X86_CR4_CET, %ebx
>> +        mov     %rbx, %cr4
>> +        setssbsy
>> +
>> +        /* Reset MSR_PL0_SSP back to its expected value. */
>> +        and     $~(STACK_SIZE - 1), %eax
>> +        or      $0x5ff8, %eax
>> +        wrmsr
> Ahead of this WRMSR neither %ecx nor %edx look to have their intended
> values anymore. Also there is a again a magic 0x5ff8 here (and at
> least one more further down).

There is another bug in this version which I spotted and fixed.  The
write of the supervisor shadow stack token has to be done after CET is
enabled and with WRSSQ because the mapping is already read-only.

>> --- a/xen/arch/x86/boot/x86_64.S
>> +++ b/xen/arch/x86/boot/x86_64.S
>> @@ -28,8 +28,36 @@ ENTRY(__high_start)
>>          lretq
>>  1:
>>          test    %ebx,%ebx
>> -        jnz     start_secondary
>> +        jz      .L_bsp
>>  
>> +        /* APs.  Set up shadow stacks before entering C. */
>> +
>> +        testl   $cpufeat_mask(X86_FEATURE_XEN_SHSTK), \
>> +                CPUINFO_FEATURE_OFFSET(X86_FEATURE_XEN_SHSTK) + boot_cpu_data(%rip)
>> +        je      .L_ap_shstk_done
>> +
>> +        mov     $MSR_S_CET, %ecx
>> +        xor     %edx, %edx
>> +        mov     $CET_SHSTK_EN | CET_WRSS_EN, %eax
>> +        wrmsr
>> +
>> +        mov     $MSR_PL0_SSP, %ecx
>> +        mov     %rsp, %rdx
>> +        shr     $32, %rdx
>> +        mov     %esp, %eax
>> +        and     $~(STACK_SIZE - 1), %eax
>> +        or      $0x5ff8, %eax
>> +        wrmsr
>> +
>> +        mov     $XEN_MINIMAL_CR4 | X86_CR4_CET, %ecx
>> +        mov     %rcx, %cr4
>> +        setssbsy
> Since the token doesn't get written here, could you make the comment
> say where this happens? I have to admit that I had to go through
> earlier patches to find it again.

Ok.

>
>> +.L_ap_shstk_done:
>> +        call    start_secondary
>> +        BUG     /* start_secondary() shouldn't return. */
> This conversion from a jump to CALL is unrelated and hence would
> better be mentioned in the description imo.
>
>> --- a/xen/arch/x86/cpu/common.c
>> +++ b/xen/arch/x86/cpu/common.c
>> @@ -323,6 +323,11 @@ void __init early_cpu_init(void)
>>  	       x86_cpuid_vendor_to_str(c->x86_vendor), c->x86, c->x86,
>>  	       c->x86_model, c->x86_model, c->x86_mask, eax);
>>  
>> +	if (c->cpuid_level >= 7) {
>> +		cpuid_count(7, 0, &eax, &ebx, &ecx, &edx);
>> +		c->x86_capability[cpufeat_word(X86_FEATURE_CET_SS)] = ecx;
>> +	}
> How about moving the leaf 7 code from generic_identify() here as
> a whole?

In the past, we've deliberately not done that to avoid code gaining a
reliance on the pre-cached values.

I have a plan to rework this substantially when I move microcode loading
to the start of __start_xen(), at which point early_cpu_init() will
disappear and become the BSP's regular cpu_init().

Until then, we shouldn't cache unnecessary leaves this early.

>> --- a/xen/arch/x86/setup.c
>> +++ b/xen/arch/x86/setup.c
>> @@ -664,6 +664,13 @@ static void __init noreturn reinit_bsp_stack(void)
>>      stack_base[0] = stack;
>>      memguard_guard_stack(stack);
>>  
>> +    if ( cpu_has_xen_shstk )
>> +    {
>> +        wrmsrl(MSR_PL0_SSP, (unsigned long)stack + 0x5ff8);
>> +        wrmsrl(MSR_S_CET, CET_SHSTK_EN | CET_WRSS_EN);
>> +        asm volatile ("setssbsy" ::: "memory");
>> +    }
> Same as for APs - a brief comment pointing at where the token was
> written would seem helpful.
>
> Could you also have the patch description say a word on the choice
> of enabling CET_WRSS_EN uniformly and globally?

That is an area for possible improvement.  For now, it is unilaterally
enabled for simplicity.

None of the places we need to use WRSSQ are fastpaths.  It is in the
extable recovery, S3 path and enable_nmi()'s.

We can get away with a RDMSR/WRMSR/WRMSR sequence, which keeps us safe
to ROP gadgets and problems from poisoning a read-mostly default.

>> @@ -985,6 +992,21 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>>      /* This must come before e820 code because it sets paddr_bits. */
>>      early_cpu_init();
>>  
>> +    /* Choose shadow stack early, to set infrastructure up appropriately. */
>> +    if ( opt_xen_shstk && boot_cpu_has(X86_FEATURE_CET_SS) )
>> +    {
>> +        printk("Enabling Supervisor Shadow Stacks\n");
>> +
>> +        setup_force_cpu_cap(X86_FEATURE_XEN_SHSTK);
>> +#ifdef CONFIG_PV32
>> +        if ( opt_pv32 )
>> +        {
>> +            opt_pv32 = 0;
>> +            printk("  - Disabling PV32 due to Shadow Stacks\n");
>> +        }
>> +#endif
> I think this deserves an explanation, either in a comment or in
> the patch description.

Probably both.

>
>> @@ -1721,6 +1743,10 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>>  
>>      alternative_branches();
>>  
>> +    /* Defer CR4.CET until alternatives have finished playing with CR4.WP */
>> +    if ( cpu_has_xen_shstk )
>> +        set_in_cr4(X86_CR4_CET);
> Nit: CR0.WP (in the comment)

Oops.

~Andrew


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 03/16] x86/traps: Factor out exception_fixup() and make printing consistent
  2020-05-11 15:14     ` Andrew Cooper
@ 2020-05-12 13:05       ` Jan Beulich
  2020-05-26 18:06         ` Andrew Cooper
  0 siblings, 1 reply; 66+ messages in thread
From: Jan Beulich @ 2020-05-12 13:05 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 11.05.2020 17:14, Andrew Cooper wrote:
> On 04/05/2020 14:20, Jan Beulich wrote:
>> On 02.05.2020 00:58, Andrew Cooper wrote:
>>> --- a/xen/arch/x86/traps.c
>>> +++ b/xen/arch/x86/traps.c
>>> @@ -774,10 +774,27 @@ static void do_reserved_trap(struct cpu_user_regs *regs)
>>>            trapnr, vec_name(trapnr), regs->error_code);
>>>  }
>>>  
>>> +static bool exception_fixup(struct cpu_user_regs *regs, bool print)
>>> +{
>>> +    unsigned long fixup = search_exception_table(regs);
>>> +
>>> +    if ( unlikely(fixup == 0) )
>>> +        return false;
>>> +
>>> +    /* Can currently be triggered by guests.  Make sure we ratelimit. */
>>> +    if ( IS_ENABLED(CONFIG_DEBUG) && print )
>> I didn't think we consider dprintk()-s a possible security issue.
>> Why would we consider so a printk() hidden behind
>> IS_ENABLED(CONFIG_DEBUG)? IOW I think one of XENLOG_GUEST and
>> IS_ENABLED(CONFIG_DEBUG) wants dropping.
> 
> Who said anything about a security issue?

The need to rate limit is (among other aspects) to prevent a
(logspam) security issue, isn't it?

> I'm deliberately not using dprintk() for the reasons explained in the
> commit message, so IS_ENABLED(CONFIG_DEBUG) is to cover that.

Which is fine, and which I understood.

> XENLOG_GUEST is because everything (other than the boot path) hitting
> this caused directly by a guest action, and needs rate-limiting.  This
> was not consistent before.

But why do we need both CONFIG_DEBUG and XENLOG_GUEST? In a debug
build I think we'd better know of such events under the control
of the general log level setting, not under that of guest specific
messages.

>>> @@ -1466,12 +1468,11 @@ void do_page_fault(struct cpu_user_regs *regs)
>>>          if ( pf_type != real_fault )
>>>              return;
>>>  
>>> -        if ( likely((fixup = search_exception_table(regs)) != 0) )
>>> +        if ( likely(exception_fixup(regs, false)) )
>>>          {
>>>              perfc_incr(copy_user_faults);
>>>              if ( unlikely(regs->error_code & PFEC_reserved_bit) )
>>>                  reserved_bit_page_fault(addr, regs);
>>> -            regs->rip = fixup;
>> I'm afraid this modification can't validly be pulled ahead -
>> show_execution_state(), as called from reserved_bit_page_fault(),
>> wants to / should print the original RIP, not the fixed up one.
> 
> This path is bogus to begin with.  Any RSVD pagefault (other than the
> Shadow MMIO fastpath, handled earlier) is a bug in Xen and should be
> fatal rather than just a warning on extable-tagged instructions.

I could see reasons to convert to a fatal exception (without looking
up fixups), but then in a separate change with suitable justification.
I'd like to point out though that this would convert a logspam kind
of DoS to a Xen crash one, in case such a PTE would indeed be created
anywhere by mistake.

However, it is not helpful to the diagnosis of such a problem if the
logged address points at the fixup code rather than the faulting one.

Jan


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 05/16] x86/shstk: Introduce Supervisor Shadow Stack support
  2020-05-11 15:46     ` Andrew Cooper
@ 2020-05-12 13:54       ` Jan Beulich
  0 siblings, 0 replies; 66+ messages in thread
From: Jan Beulich @ 2020-05-12 13:54 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Anthony Perard, Xen-devel, Wei Liu, Roger Pau Monné

On 11.05.2020 17:46, Andrew Cooper wrote:
> On 04/05/2020 14:52, Jan Beulich wrote:
>> [CAUTION - EXTERNAL EMAIL] DO NOT reply, click links, or open attachments unless you have verified the sender and know the content is safe.
>>
>> On 02.05.2020 00:58, Andrew Cooper wrote:
>>> --- a/xen/arch/x86/Kconfig
>>> +++ b/xen/arch/x86/Kconfig
>>> @@ -34,6 +34,9 @@ config ARCH_DEFCONFIG
>>>  config INDIRECT_THUNK
>>>  	def_bool $(cc-option,-mindirect-branch-register)
>>>  
>>> +config HAS_AS_CET
>>> +	def_bool $(as-instr,wrssq %rax$(comma)0;setssbsy;endbr64)
>> I see you add as-instr here as a side effect. Until the other
>> similar checks get converted, I think for the time being we should
>> use the old model, to have all such checks in one place. I realize
>> this means you can't have a Kconfig dependency then.
> 
> No.  That's like asking me to keep on using bool_t, and you are the
> first person to point out and object to that in newly submitted patches.

These are entirely different things. The bool_t => bool
conversion is agreed upon. The conversion to record tool chain
capabilities in xen/.config isn't. I've raised my reservations
against this elsewhere. I can be convinced, but not by trying to
introduce such functionality as a side effect of an unrelated
change.

>> Also why do you check multiple insns, when just one (like we do
>> elsewhere) would suffice?
> 
> Because CET-SS and CET-IBT are orthogonal ABIs, but you wanted a single
> CET symbol, rather than a CET_SS symbol.
> 
> I picked a sample of various instructions to get broad coverage of CET
> without including every instruction.

I wanted HAS_AS_CET rather than HAS_AS_CET_SS and HAS_AS_CET_IBT
because both got added to gas at the same time, and hence there's
little point in having separate symbols. If there was a reason to
assume assemblers might be out there which support one but not
the other, then we indeed ought to have two symbols.

>> The crucial insns to check are those which got changed pretty
>> close before the release of 2.29 (in the cover letter you
>> mention 2.32): One of incssp{d,q}, setssbsy, or saveprevssp.
>> There weren't official binutils releases with the original
>> insns, but distros may still have picked up intermediate
>> snapshots.
> 
> I've got zero interest in catering to distros which are still using
> obsolete pre-release toolchains.  Bleeding edge toolchains are one
> thing, but this is like asking us to support the middle changeset of the
> series introducing CET, which is absolutely a non-starter.
> 
> If the instructions were missing from real binutils releases, then
> obviously we should exclude those releases, but that doesn't appear to
> be the case.

But you realize that there's no special effort needed? We merely
need to check for the right insns. Their operands not matching
for binutils from the intermediate time window is enough for our
purposes. With my remark I merely meant to guide which of the
three insns you've picked needs to remain, and which would imo
better be dropped.

>>> --- a/xen/arch/x86/setup.c
>>> +++ b/xen/arch/x86/setup.c
>>> @@ -95,6 +95,36 @@ unsigned long __initdata highmem_start;
>>>  size_param("highmem-start", highmem_start);
>>>  #endif
>>>  
>>> +static bool __initdata opt_xen_shstk = true;
>>> +
>>> +static int parse_xen(const char *s)
>>> +{
>>> +    const char *ss;
>>> +    int val, rc = 0;
>>> +
>>> +    do {
>>> +        ss = strchr(s, ',');
>>> +        if ( !ss )
>>> +            ss = strchr(s, '\0');
>>> +
>>> +        if ( (val = parse_boolean("shstk", s, ss)) >= 0 )
>>> +        {
>>> +#ifdef CONFIG_XEN_SHSTK
>>> +            opt_xen_shstk = val;
>>> +#else
>>> +            no_config_param("XEN_SHSTK", "xen", s, ss);
>>> +#endif
>>> +        }
>>> +        else
>>> +            rc = -EINVAL;
>>> +
>>> +        s = ss + 1;
>>> +    } while ( *ss );
>>> +
>>> +    return rc;
>>> +}
>>> +custom_param("xen", parse_xen);
>> What's the idea here going forward, i.e. why the new top level
>> "xen" option? Almost all options are for Xen itself, after all.
>> Did you perhaps mean this to be "cet"?
> 
> I forgot an RFC for this, as I couldn't think of anything better.  "cet"
> as a top level option isn't going to gain more than {no-}shstk and
> {no-}ibt as suboptions.

Imo that's still better than "xen=".

>>> --- a/xen/scripts/Kconfig.include
>>> +++ b/xen/scripts/Kconfig.include
>>> @@ -31,6 +31,10 @@ cc-option = $(success,$(CC) -Werror $(CLANG_FLAGS) $(1) -E -x c /dev/null -o /de
>>>  # Return y if the linker supports <flag>, n otherwise
>>>  ld-option = $(success,$(LD) -v $(1))
>>>  
>>> +# $(as-instr,<instr>)
>>> +# Return y if the assembler supports <instr>, n otherwise
>>> +as-instr = $(success,printf "%b\n" "$(1)" | $(CC) $(CLANG_FLAGS) -c -x assembler -o /dev/null -)
>> CLANG_FLAGS caught my eye here, then noticing that cc-option
>> also uses it. Anthony - what's the deal with this? It doesn't
>> look to get defined anywhere, and I also don't see what clang-
>> specific about these constructs.
> 
> This is as it inherits from Linux.  There is obviously a reason.
> 
> However, I'm not interested in diving into that rabbit hole in an
> unrelated series.

That's fine - my question was directed at Anthony after all.

Jan


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 06/16] x86/traps: Implement #CP handler and extend #PF for shadow stacks
  2020-05-11 17:20     ` Andrew Cooper
@ 2020-05-12 13:58       ` Jan Beulich
  0 siblings, 0 replies; 66+ messages in thread
From: Jan Beulich @ 2020-05-12 13:58 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 11.05.2020 19:20, Andrew Cooper wrote:
> On 04/05/2020 15:10, Jan Beulich wrote:
>> On 02.05.2020 00:58, Andrew Cooper wrote:
>>> @@ -1457,6 +1451,10 @@ void do_page_fault(struct cpu_user_regs *regs)
>>>      {
>>>          enum pf_type pf_type = spurious_page_fault(addr, regs);
>>>  
>>> +        /* Any fault on a shadow stack access is a bug in Xen. */
>>> +        if ( error_code & PFEC_shstk )
>>> +            goto fatal;
>> Not going through the full spurious_page_fault() in this case
>> would seem desirable, as would be at least a respective
>> adjustment to __page_fault_type(). Perhaps such an adjustment
>> could then avoid the change (and the need for goto) here?
> 
> This seems to do a lot of things which have little/nothing to do with
> spurious faults.
> 
> In particular, we don't need to disable interrupts to look at
> PFEC_shstk, or RSVD for that matter.

Perhaps even more so a reason to make spurious_page_fault()
return a new enum pf_type enumerator? In any event your reply
looks more like a "yes" to my suggestion than an objection,
but I may be getting it entirely wrong ...

Jan


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 07/16] x86/shstk: Re-layout the stack block for shadow stacks
  2020-05-11 17:48     ` Andrew Cooper
@ 2020-05-12 14:07       ` Jan Beulich
  0 siblings, 0 replies; 66+ messages in thread
From: Jan Beulich @ 2020-05-12 14:07 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 11.05.2020 19:48, Andrew Cooper wrote:
> On 04/05/2020 15:24, Jan Beulich wrote:
>> On 02.05.2020 00:58, Andrew Cooper wrote:
>>> --- a/xen/arch/x86/cpu/common.c
>>> +++ b/xen/arch/x86/cpu/common.c
>>> @@ -732,14 +732,14 @@ void load_system_tables(void)
>>>  		.rsp2 = 0x8600111111111111ul,
>>>  
>>>  		/*
>>> -		 * MCE, NMI and Double Fault handlers get their own stacks.
>>> +		 * #DB, NMI, DF and #MCE handlers get their own stacks.
>> Then also #DF and #MC?
> 
> Ok.
> 
>>
>>> --- a/xen/arch/x86/mm.c
>>> +++ b/xen/arch/x86/mm.c
>>> @@ -6002,25 +6002,18 @@ void memguard_unguard_range(void *p, unsigned long l)
>>>  
>>>  void memguard_guard_stack(void *p)
>>>  {
>>> -    /* IST_MAX IST pages + at least 1 guard page + primary stack. */
>>> -    BUILD_BUG_ON((IST_MAX + 1) * PAGE_SIZE + PRIMARY_STACK_SIZE > STACK_SIZE);
>>> +    map_pages_to_xen((unsigned long)p, virt_to_mfn(p), 1, _PAGE_NONE);
>>>  
>>> -    memguard_guard_range(p + IST_MAX * PAGE_SIZE,
>>> -                         STACK_SIZE - PRIMARY_STACK_SIZE - IST_MAX * PAGE_SIZE);
>>> +    p += 5 * PAGE_SIZE;
>> The literal 5 here and ...
>>
>>> +    map_pages_to_xen((unsigned long)p, virt_to_mfn(p), 1, _PAGE_NONE);
>>>  }
>>>  
>>>  void memguard_unguard_stack(void *p)
>>>  {
>>> -    memguard_unguard_range(p + IST_MAX * PAGE_SIZE,
>>> -                           STACK_SIZE - PRIMARY_STACK_SIZE - IST_MAX * PAGE_SIZE);
>>> -}
>>> -
>>> -bool memguard_is_stack_guard_page(unsigned long addr)
>>> -{
>>> -    addr &= STACK_SIZE - 1;
>>> +    map_pages_to_xen((unsigned long)p, virt_to_mfn(p), 1, PAGE_HYPERVISOR_RW);
>>>  
>>> -    return addr >= IST_MAX * PAGE_SIZE &&
>>> -           addr < STACK_SIZE - PRIMARY_STACK_SIZE;
>>> +    p += 5 * PAGE_SIZE;
>> ... here could do with macro-izing: IST_MAX + 1 would already be
>> a little better, I guess.
> 
> The problem is that "IST_MAX + 1" is now less meaningful than a literal
> 5, because at least 5 obviously matches up with the comment describing
> which page does what.

If you don't like IST_MAX+1, can I at least talk you into introducing
a separate constant? This is the only way to connect together the
various places where it is used. We can't be sure that we're not going
to touch this code anymore, ever. And if we do, it'll help if related
places are easy to spot.

Jan


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 12/16] x86/extable: Adjust extable handling to be shadow stack compatible
  2020-05-11 21:09     ` Andrew Cooper
@ 2020-05-12 14:31       ` Jan Beulich
  2020-05-12 16:14         ` Andrew Cooper
  0 siblings, 1 reply; 66+ messages in thread
From: Jan Beulich @ 2020-05-12 14:31 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 11.05.2020 23:09, Andrew Cooper wrote:
> On 07/05/2020 14:35, Jan Beulich wrote:
>> On 02.05.2020 00:58, Andrew Cooper wrote:
>>> --- a/xen/arch/x86/traps.c
>>> +++ b/xen/arch/x86/traps.c
>>> @@ -778,6 +778,28 @@ static bool exception_fixup(struct cpu_user_regs *regs, bool print)
>>>                 vec_name(regs->entry_vector), regs->error_code,
>>>                 _p(regs->rip), _p(regs->rip), _p(fixup));
>>>  
>>> +    if ( IS_ENABLED(CONFIG_XEN_SHSTK) )
>>> +    {
>>> +        unsigned long ssp;
>>> +
>>> +        asm ("rdsspq %0" : "=r" (ssp) : "0" (1) );
>>> +        if ( ssp != 1 )
>>> +        {
>>> +            unsigned long *ptr = _p(ssp);
>>> +
>>> +            /* Search for %rip in the shadow stack, ... */
>>> +            while ( *ptr != regs->rip )
>>> +                ptr++;
>> Wouldn't it be better to bound the loop, as it shouldn't search past
>> (strictly speaking not even to) the next page boundary? Also you
>> don't care about the top of the stack (being the to be restored SSP),
>> do you? I.e. maybe
>>
>>             while ( *++ptr != regs->rip )
>>                 ;
>>
>> ?
>>
>> And then - isn't searching for a specific RIP value alone prone to
>> error, in case a it matches an ordinary return address? I.e.
>> wouldn't you better search for a matching RIP accompanied by a
>> suitable pointer into the shadow stack and a matching CS value?
>> Otherwise, ...
>>
>>> +            ASSERT(ptr[1] == __HYPERVISOR_CS);
>> ... also assert that ptr[-1] points into the shadow stack?
> 
> So this is the problem I was talking about that the previous contexts
> SSP isn't stored anywhere helpful.
> 
> What we are in practice doing is looking 2 or 3 words up the shadow
> stack (depending on exactly how deep our call graph is), to the shadow
> IRET frame matching the real IRET frame which regs is pointing to.
> 
> Both IRET frames were pushed in the process of generating the exception,
> and we've already matched regs->rip to the exception table record.  We
> need to fix up regs->rip and the shadow lip to the fixup address.
> 
> As we are always fixing up an exception generated from Xen context, we
> know that ptr[1] == __HYPERVISOR_CS, and *ptr[-1] = &ptr[2], as we
> haven't switched shadow stack as part of taking this exception. 
> However, this second point is fragile to exception handlers moving onto IST.
> 
> We can't encounter regs->rip in the shadow stack between the current SSP
> and the IRET frame we're looking to fix up, or we would have taken a
> recursive fault and not reached exception_fixup() to begin with.

I'm afraid I don't follow here. Consider a function (also)
involved in exception handling having this code sequence:

    call    func
    mov     (%rax), %eax

If the fault we're handling occured on the MOV and
exception_fixup() is a descendant of func(), then the first
instance of an address on the shadow stack pointing at this
MOV is going to be the one which did not fault.

> Therefore, the loop is reasonably bounded in all cases.
> 
> Sadly, there is no RDSS instruction, so we can't actually use shadow
> stack reads to spot if we underflowed the shadow stack, and there is no
> useful alternative to panic() if we fail to find the shadow IRET frame.

But afaics you don't panic() in this case. Instead you continue
looping until (presumably) you hit some form of fault.

>>> --- a/xen/arch/x86/x86_64/entry.S
>>> +++ b/xen/arch/x86/x86_64/entry.S
>>> @@ -708,7 +708,16 @@ exception_with_ints_disabled:
>>>          call  search_pre_exception_table
>>>          testq %rax,%rax                 # no fixup code for faulting EIP?
>>>          jz    1b
>>> -        movq  %rax,UREGS_rip(%rsp)
>>> +        movq  %rax,UREGS_rip(%rsp)      # fixup regular stack
>>> +
>>> +#ifdef CONFIG_XEN_SHSTK
>>> +        mov    $1, %edi
>>> +        rdsspq %rdi
>>> +        cmp    $1, %edi
>>> +        je     .L_exn_shstk_done
>>> +        wrssq  %rax, (%rdi)             # fixup shadow stack
>>> +.L_exn_shstk_done:
>>> +#endif
>> Again avoid the conditional jump by using alternatives patching?
> 
> Well - that depends on whether we're likely to gain any new content in
> the pre exception table.
> 
> As it stands, it is only the IRET(s) to userspace so would be safe to
> turn this into an unconditional alternative.  Even in the crash case, we
> won't be returning to guest context after having started the crash
> teardown path.

Ah, right - perhaps indeed better keep it as is then.

Jan


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 15/16] x86/entry: Adjust guest paths to be shadow stack compatible
  2020-05-11 21:45         ` Andrew Cooper
@ 2020-05-12 14:56           ` Jan Beulich
  0 siblings, 0 replies; 66+ messages in thread
From: Jan Beulich @ 2020-05-12 14:56 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 11.05.2020 23:45, Andrew Cooper wrote:
> On 07/05/2020 17:15, Jan Beulich wrote:
>>>>> --- a/xen/arch/x86/x86_64/entry.S
>>>>> +++ b/xen/arch/x86/x86_64/entry.S
>>>>> @@ -194,6 +194,15 @@ restore_all_guest:
>>>>>          movq  8(%rsp),%rcx            # RIP
>>>>>          ja    iret_exit_to_guest
>>>>>  
>>>>> +        /* Clear the supervisor shadow stack token busy bit. */
>>>>> +.macro rag_clrssbsy
>>>>> +        push %rax
>>>>> +        rdsspq %rax
>>>>> +        clrssbsy (%rax)
>>>>> +        pop %rax
>>>>> +.endm
>>>>> +        ALTERNATIVE "", rag_clrssbsy, X86_FEATURE_XEN_SHSTK
>>>> In principle you could get away without spilling %rax:
>>>>
>>>>         cmpl  $1,%ecx
>>>>         ja    iret_exit_to_guest
>>>>
>>>>         /* Clear the supervisor shadow stack token busy bit. */
>>>> .macro rag_clrssbsy
>>>>         rdsspq %rcx
>>>>         clrssbsy (%rcx)
>>>> .endm
>>>>         ALTERNATIVE "", rag_clrssbsy, X86_FEATURE_XEN_SHSTK
>>>>         movq  8(%rsp),%rcx            # RIP
>>>>         cmpw  $FLAT_USER_CS32,16(%rsp)# CS
>>>>         movq  32(%rsp),%rsp           # RSP
>>>>         je    1f
>>>>         sysretq
>>>> 1:      sysretl
>>>>
>>>>         ALIGN
>>>> /* No special register assumptions. */
>>>> iret_exit_to_guest:
>>>>         movq  8(%rsp),%rcx            # RIP
>>>>         andl  $~(X86_EFLAGS_IOPL|X86_EFLAGS_NT|X86_EFLAGS_VM),24(%rsp)
>>>>         ...
>>>>
>>>> Also - what about CLRSSBSY failing? It would seem easier to diagnose
>>>> this right here than when getting presumably #DF upon next entry into
>>>> Xen. At the very least I think it deserves a comment if an error case
>>>> does not get handled.
>>> I did consider this, but ultimately decided against it.
>>>
>>> You can't have an unlikely block inside a alternative block because the
>>> jmp's displacement doesn't get fixed up.
>> We do fix up unconditional JMP/CALL displacements; I don't
>> see why we couldn't also do so for conditional ones.
> 
> Only for the first instruction in the block.
> 
> We do not decode the entire block of instructions and fix up each
> displacement.

Right, but that's not overly difficult to overcome - simply split
the ALTERNATIVE in two.

>>>   Keeping everything inline puts
>>> an incorrect statically-predicted branch in program flow.
>>>
>>> Most importantly however, is that the SYSRET path is vastly less common
>>> than the IRET path.  There is no easy way to proactively spot problems
>>> in the IRET path, which means that conditions leading to a problem are
>>> already far more likely to manifest as #DF, so there is very little
>>> value in adding complexity to the SYSRET path in the first place.
>> The SYSRET path being uncommon is a problem by itself imo, if
>> that's indeed the case. I'm sure I've suggested before that
>> we convert frames to TRAP_syscall ones whenever possible,
>> such that we wouldn't go the slower IRET path.
> 
> It is not possible to convert any.
> 
> The opportunistic SYSRET logic in Linux loses you performance in
> reality.  Its just that the extra conditionals are very highly predicted
> and totally dominated by the ring transition cost.
> 
> You can create a synthetic test case where the opportunistic logic makes
> a performance win, but the chances of encountering real world code where
> TRAP_syscall is clear and %r11 and %rcx match flags/rip is 2 ^ 128.
> 
> It is very much not worth the extra code and cycles taken to implement.

Oops, yes, for a moment I forgot this minor detail of %rcx/%r11.

>>>> Somewhat similar for SETSSBSY, except there things get complicated by
>>>> it raising #CP instead of setting EFLAGS.CF: Aiui it would require us
>>>> to handle #CP on an IST stack in order to avoid #DF there.
>>> Right, but having #CP as IST gives us far worse problems.
>>>
>>> Being able to spot #CP vs #DF doesn't help usefully.  Its still some
>>> arbitrary period of time after the damage was done.
>>>
>>> Any nesting of #CP (including fault on IRET out) results in losing
>>> program state and entering an infinite loop.
>>>
>>> The cases which end up as #DF are properly fatal to the system, and we
>>> at least get a clean crash out it.
>> May I suggest that all of this gets spelled out in at least
>> the description of the patch, so that it can be properly
>> understood (and, if need be, revisited) later on?
> 
> Is this really the right patch to do that?
> 
> I do eventually plan to put a whole load of this kinds of details into
> the hypervisor guide.

Well, as you can see having some of these considerations and
decisions spelled out would already have helped review here.
Whether this is exactly the right patch I'm not sure, but I'd
find it quite helpful if such was available at least for
cross referencing.

Jan


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 12/16] x86/extable: Adjust extable handling to be shadow stack compatible
  2020-05-12 14:31       ` Jan Beulich
@ 2020-05-12 16:14         ` Andrew Cooper
  2020-05-13  9:22           ` Jan Beulich
  0 siblings, 1 reply; 66+ messages in thread
From: Andrew Cooper @ 2020-05-12 16:14 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 12/05/2020 15:31, Jan Beulich wrote:
> On 11.05.2020 23:09, Andrew Cooper wrote:
>> On 07/05/2020 14:35, Jan Beulich wrote:
>>> On 02.05.2020 00:58, Andrew Cooper wrote:
>>>> --- a/xen/arch/x86/traps.c
>>>> +++ b/xen/arch/x86/traps.c
>>>> @@ -778,6 +778,28 @@ static bool exception_fixup(struct cpu_user_regs *regs, bool print)
>>>>                 vec_name(regs->entry_vector), regs->error_code,
>>>>                 _p(regs->rip), _p(regs->rip), _p(fixup));
>>>>  
>>>> +    if ( IS_ENABLED(CONFIG_XEN_SHSTK) )
>>>> +    {
>>>> +        unsigned long ssp;
>>>> +
>>>> +        asm ("rdsspq %0" : "=r" (ssp) : "0" (1) );
>>>> +        if ( ssp != 1 )
>>>> +        {
>>>> +            unsigned long *ptr = _p(ssp);
>>>> +
>>>> +            /* Search for %rip in the shadow stack, ... */
>>>> +            while ( *ptr != regs->rip )
>>>> +                ptr++;
>>> Wouldn't it be better to bound the loop, as it shouldn't search past
>>> (strictly speaking not even to) the next page boundary? Also you
>>> don't care about the top of the stack (being the to be restored SSP),
>>> do you? I.e. maybe
>>>
>>>             while ( *++ptr != regs->rip )
>>>                 ;
>>>
>>> ?
>>>
>>> And then - isn't searching for a specific RIP value alone prone to
>>> error, in case a it matches an ordinary return address? I.e.
>>> wouldn't you better search for a matching RIP accompanied by a
>>> suitable pointer into the shadow stack and a matching CS value?
>>> Otherwise, ...
>>>
>>>> +            ASSERT(ptr[1] == __HYPERVISOR_CS);
>>> ... also assert that ptr[-1] points into the shadow stack?
>> So this is the problem I was talking about that the previous contexts
>> SSP isn't stored anywhere helpful.
>>
>> What we are in practice doing is looking 2 or 3 words up the shadow
>> stack (depending on exactly how deep our call graph is), to the shadow
>> IRET frame matching the real IRET frame which regs is pointing to.
>>
>> Both IRET frames were pushed in the process of generating the exception,
>> and we've already matched regs->rip to the exception table record.  We
>> need to fix up regs->rip and the shadow lip to the fixup address.
>>
>> As we are always fixing up an exception generated from Xen context, we
>> know that ptr[1] == __HYPERVISOR_CS, and *ptr[-1] = &ptr[2], as we
>> haven't switched shadow stack as part of taking this exception. 
>> However, this second point is fragile to exception handlers moving onto IST.
>>
>> We can't encounter regs->rip in the shadow stack between the current SSP
>> and the IRET frame we're looking to fix up, or we would have taken a
>> recursive fault and not reached exception_fixup() to begin with.
> I'm afraid I don't follow here. Consider a function (also)
> involved in exception handling having this code sequence:
>
>     call    func
>     mov     (%rax), %eax
>
> If the fault we're handling occured on the MOV and
> exception_fixup() is a descendant of func(), then the first
> instance of an address on the shadow stack pointing at this
> MOV is going to be the one which did not fault.

No.  The moment `call func` returns, the address you're looking to match
is rubble no longer on the stack.  Numerically, it will be located at
SSP-8 when the fault for MOV is generated.

In this exact case, it would be clobbered by the shadow IRET frame, but
even if it was deeper in the call tree, we would still never encounter
it from waking up the shadow stack from SSP.

The only things you will find on the shadow stack is the shadow IRET
frame, handle_exception_saved(), do_*(), fixup_exception(), except that
I'd not like to fix the behaviour to require exactly two function calls
of depth for fixup_exception().

>> Therefore, the loop is reasonably bounded in all cases.
>>
>> Sadly, there is no RDSS instruction, so we can't actually use shadow
>> stack reads to spot if we underflowed the shadow stack, and there is no
>> useful alternative to panic() if we fail to find the shadow IRET frame.
> But afaics you don't panic() in this case. Instead you continue
> looping until (presumably) you hit some form of fault.
>
>>>> --- a/xen/arch/x86/x86_64/entry.S
>>>> +++ b/xen/arch/x86/x86_64/entry.S
>>>> @@ -708,7 +708,16 @@ exception_with_ints_disabled:
>>>>          call  search_pre_exception_table
>>>>          testq %rax,%rax                 # no fixup code for faulting EIP?
>>>>          jz    1b
>>>> -        movq  %rax,UREGS_rip(%rsp)
>>>> +        movq  %rax,UREGS_rip(%rsp)      # fixup regular stack
>>>> +
>>>> +#ifdef CONFIG_XEN_SHSTK
>>>> +        mov    $1, %edi
>>>> +        rdsspq %rdi
>>>> +        cmp    $1, %edi
>>>> +        je     .L_exn_shstk_done
>>>> +        wrssq  %rax, (%rdi)             # fixup shadow stack
>>>> +.L_exn_shstk_done:
>>>> +#endif
>>> Again avoid the conditional jump by using alternatives patching?
>> Well - that depends on whether we're likely to gain any new content in
>> the pre exception table.
>>
>> As it stands, it is only the IRET(s) to userspace so would be safe to
>> turn this into an unconditional alternative.  Even in the crash case, we
>> won't be returning to guest context after having started the crash
>> teardown path.
> Ah, right - perhaps indeed better keep it as is then.

That was my reasoning.  It is a path we expect to execute never with
well behaved guests, so I erred on the safe side.

~Andrew


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 12/16] x86/extable: Adjust extable handling to be shadow stack compatible
  2020-05-12 16:14         ` Andrew Cooper
@ 2020-05-13  9:22           ` Jan Beulich
  0 siblings, 0 replies; 66+ messages in thread
From: Jan Beulich @ 2020-05-13  9:22 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 12.05.2020 18:14, Andrew Cooper wrote:
> On 12/05/2020 15:31, Jan Beulich wrote:
>> On 11.05.2020 23:09, Andrew Cooper wrote:
>>> On 07/05/2020 14:35, Jan Beulich wrote:
>>>> On 02.05.2020 00:58, Andrew Cooper wrote:
>>>>> --- a/xen/arch/x86/traps.c
>>>>> +++ b/xen/arch/x86/traps.c
>>>>> @@ -778,6 +778,28 @@ static bool exception_fixup(struct cpu_user_regs *regs, bool print)
>>>>>                 vec_name(regs->entry_vector), regs->error_code,
>>>>>                 _p(regs->rip), _p(regs->rip), _p(fixup));
>>>>>  
>>>>> +    if ( IS_ENABLED(CONFIG_XEN_SHSTK) )
>>>>> +    {
>>>>> +        unsigned long ssp;
>>>>> +
>>>>> +        asm ("rdsspq %0" : "=r" (ssp) : "0" (1) );
>>>>> +        if ( ssp != 1 )
>>>>> +        {
>>>>> +            unsigned long *ptr = _p(ssp);
>>>>> +
>>>>> +            /* Search for %rip in the shadow stack, ... */
>>>>> +            while ( *ptr != regs->rip )
>>>>> +                ptr++;
>>>> Wouldn't it be better to bound the loop, as it shouldn't search past
>>>> (strictly speaking not even to) the next page boundary? Also you
>>>> don't care about the top of the stack (being the to be restored SSP),
>>>> do you? I.e. maybe
>>>>
>>>>             while ( *++ptr != regs->rip )
>>>>                 ;
>>>>
>>>> ?
>>>>
>>>> And then - isn't searching for a specific RIP value alone prone to
>>>> error, in case a it matches an ordinary return address? I.e.
>>>> wouldn't you better search for a matching RIP accompanied by a
>>>> suitable pointer into the shadow stack and a matching CS value?
>>>> Otherwise, ...
>>>>
>>>>> +            ASSERT(ptr[1] == __HYPERVISOR_CS);
>>>> ... also assert that ptr[-1] points into the shadow stack?
>>> So this is the problem I was talking about that the previous contexts
>>> SSP isn't stored anywhere helpful.
>>>
>>> What we are in practice doing is looking 2 or 3 words up the shadow
>>> stack (depending on exactly how deep our call graph is), to the shadow
>>> IRET frame matching the real IRET frame which regs is pointing to.
>>>
>>> Both IRET frames were pushed in the process of generating the exception,
>>> and we've already matched regs->rip to the exception table record.  We
>>> need to fix up regs->rip and the shadow lip to the fixup address.
>>>
>>> As we are always fixing up an exception generated from Xen context, we
>>> know that ptr[1] == __HYPERVISOR_CS, and *ptr[-1] = &ptr[2], as we
>>> haven't switched shadow stack as part of taking this exception. 
>>> However, this second point is fragile to exception handlers moving onto IST.
>>>
>>> We can't encounter regs->rip in the shadow stack between the current SSP
>>> and the IRET frame we're looking to fix up, or we would have taken a
>>> recursive fault and not reached exception_fixup() to begin with.
>> I'm afraid I don't follow here. Consider a function (also)
>> involved in exception handling having this code sequence:
>>
>>     call    func
>>     mov     (%rax), %eax
>>
>> If the fault we're handling occured on the MOV and
>> exception_fixup() is a descendant of func(), then the first
>> instance of an address on the shadow stack pointing at this
>> MOV is going to be the one which did not fault.
> 
> No.  The moment `call func` returns, the address you're looking to match
> is rubble no longer on the stack.  Numerically, it will be located at
> SSP-8 when the fault for MOV is generated.

I think I still didn't explain the scenario sufficiently well:

In some function, say test(), we have above code sequence and
the MOV faults. The exception handling then goes
- assembly entry
- C entry
- test()
- exception_fixup()
in order of (nesting) call sequence, i.e. with _all_ respective
return addresses still on the stack. Since your lookup starts
from SSP, there'll be two active frames on the stack which both
have the same return address.

We may not actively have such a code structure right now, but
it would seem shortsighted to me to not account for the
possibility.

Jan


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 05/16] x86/shstk: Introduce Supervisor Shadow Stack support
  2020-05-04 13:52   ` Jan Beulich
  2020-05-11 15:46     ` Andrew Cooper
@ 2020-05-15 16:21     ` Anthony PERARD
  1 sibling, 0 replies; 66+ messages in thread
From: Anthony PERARD @ 2020-05-15 16:21 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Roger Pau Monné, Wei Liu, Xen-devel

On Mon, May 04, 2020 at 03:52:58PM +0200, Jan Beulich wrote:
> On 02.05.2020 00:58, Andrew Cooper wrote:
> > --- a/xen/scripts/Kconfig.include
> > +++ b/xen/scripts/Kconfig.include
> > @@ -31,6 +31,10 @@ cc-option = $(success,$(CC) -Werror $(CLANG_FLAGS) $(1) -E -x c /dev/null -o /de
> >  # Return y if the linker supports <flag>, n otherwise
> >  ld-option = $(success,$(LD) -v $(1))
> >  
> > +# $(as-instr,<instr>)
> > +# Return y if the assembler supports <instr>, n otherwise
> > +as-instr = $(success,printf "%b\n" "$(1)" | $(CC) $(CLANG_FLAGS) -c -x assembler -o /dev/null -)
> 
> CLANG_FLAGS caught my eye here, then noticing that cc-option
> also uses it. Anthony - what's the deal with this? It doesn't
> look to get defined anywhere, and I also don't see what clang-
> specific about these constructs.

It's because these constructs are gcc-specific :-). Indeed CLANG_FLAGS
probably needs to be defined as I don't think providing the full
AFLAGS/CFLAGS is going to be a good idea and may change the result of
the commands.

Linux has a few clang specific flags in CLANG_FLAGS, I have found those:
    The ones for cross compilation: --prefix, --target, --gcc-toolchain
    -no-integrated-as
    -Werror=unknown-warning-option
And that's it.

So, I think we could keep using CLANG_FLAGS in Kconfig.include and
define it in Makefile with a comment saying that it's only used by
Kconfig. It would always have -Werror=unknown-warning-option and have
-no-integrated-as when needed, the -Wunknown-warning-option is present
in clang 3.0.0, according Linux's commit 589834b3a009 ("kbuild: Add
-Werror=unknown-warning-option to CLANG_FLAGS").

The options -Werror=unknown-warning-option is to make sure that the
warning is enabled, even though it is by default but could be disabled
in a particular build of clang, see e8de12fb7cde ("kbuild: Check for
unknown options with cc-option usage in Kconfig and clang")

I'll write a patch with this new CLANG_FLAGS.

-- 
Anthony PERARD


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 02/16] x86/traps: Clean up printing in do_reserved_trap()/fatal_trap()
  2020-05-11 15:09       ` Jan Beulich
@ 2020-05-18 16:54         ` Andrew Cooper
  2020-05-19  8:50           ` Jan Beulich
  0 siblings, 1 reply; 66+ messages in thread
From: Andrew Cooper @ 2020-05-18 16:54 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 11/05/2020 16:09, Jan Beulich wrote:
> [CAUTION - EXTERNAL EMAIL] DO NOT reply, click links, or open attachments unless you have verified the sender and know the content is safe.
>
> On 11.05.2020 17:01, Andrew Cooper wrote:
>> On 04/05/2020 14:08, Jan Beulich wrote:
>>> On 02.05.2020 00:58, Andrew Cooper wrote:
>>>> For one, they render the vector in a different base.
>>>>
>>>> Introduce X86_EXC_* constants and vec_name() to refer to exceptions by their
>>>> mnemonic, which starts bringing the code/diagnostics in line with the Intel
>>>> and AMD manuals.
>>> For this "bringing in line" purpose I'd like to see whether you could
>>> live with some adjustments to how you're currently doing things:
>>> - NMI is nowhere prefixed by #, hence I think we'd better not do so
>>>   either; may require embedding the #-es in the names[] table, or not
>>>   using N() for NMI
>> No-one is going to get confused at seeing #NMI in an error message.  I
>> don't mind jugging the existing names table, but anything more
>> complicated is overkill.
>>
>>> - neither Coprocessor Segment Overrun nor vector 0x0f have a mnemonic
>>>   and hence I think we shouldn't invent one; just treat them like
>>>   other reserved vectors (of which at least vector 0x09 indeed is one
>>>   on x86-64)?
>> This I disagree with.  Coprocessor Segment Overrun *is* its name in both
>> manuals, and the avoidance of vector 0xf is clearly documented as well,
>> due to it being the default PIC Spurious Interrupt Vector.
>>
>> Neither CSO or SPV are expected to be encountered in practice, but if
>> they are, highlighting them is a damn-sight more helpful than pretending
>> they don't exist.
> How is them occurring (and getting logged with their vector numbers)
> any different from other reserved, acronym-less vectors? I particularly
> didn't suggest to pretend they don't exist; instead I did suggest that
> they are as reserved as, say, vector 0x18. By inventing an acronym and
> logging this instead of the vector number you'll make people other than
> you have to look up what the odd acronym means iff such an exception
> ever got raised.

You snipped the bits in the patch where both the vector number and
acronym are printed together.

Anyone who doesn't know the vector has to look it up anyway, at which
point they'll find that what Xen prints out matches what both manuals
say.  OTOH, people who know what a coprocessor segment overrun or PIC
spurious vector is won't need to look it up.

~Andrew


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 02/16] x86/traps: Clean up printing in do_reserved_trap()/fatal_trap()
  2020-05-18 16:54         ` Andrew Cooper
@ 2020-05-19  8:50           ` Jan Beulich
  2020-05-26 15:38             ` Andrew Cooper
  0 siblings, 1 reply; 66+ messages in thread
From: Jan Beulich @ 2020-05-19  8:50 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 18.05.2020 18:54, Andrew Cooper wrote:
> On 11/05/2020 16:09, Jan Beulich wrote:
>> [CAUTION - EXTERNAL EMAIL] DO NOT reply, click links, or open attachments unless you have verified the sender and know the content is safe.
>>
>> On 11.05.2020 17:01, Andrew Cooper wrote:
>>> On 04/05/2020 14:08, Jan Beulich wrote:
>>>> On 02.05.2020 00:58, Andrew Cooper wrote:
>>>>> For one, they render the vector in a different base.
>>>>>
>>>>> Introduce X86_EXC_* constants and vec_name() to refer to exceptions by their
>>>>> mnemonic, which starts bringing the code/diagnostics in line with the Intel
>>>>> and AMD manuals.
>>>> For this "bringing in line" purpose I'd like to see whether you could
>>>> live with some adjustments to how you're currently doing things:
>>>> - NMI is nowhere prefixed by #, hence I think we'd better not do so
>>>>   either; may require embedding the #-es in the names[] table, or not
>>>>   using N() for NMI
>>> No-one is going to get confused at seeing #NMI in an error message.  I
>>> don't mind jugging the existing names table, but anything more
>>> complicated is overkill.
>>>
>>>> - neither Coprocessor Segment Overrun nor vector 0x0f have a mnemonic
>>>>   and hence I think we shouldn't invent one; just treat them like
>>>>   other reserved vectors (of which at least vector 0x09 indeed is one
>>>>   on x86-64)?
>>> This I disagree with.  Coprocessor Segment Overrun *is* its name in both
>>> manuals, and the avoidance of vector 0xf is clearly documented as well,
>>> due to it being the default PIC Spurious Interrupt Vector.
>>>
>>> Neither CSO or SPV are expected to be encountered in practice, but if
>>> they are, highlighting them is a damn-sight more helpful than pretending
>>> they don't exist.
>> How is them occurring (and getting logged with their vector numbers)
>> any different from other reserved, acronym-less vectors? I particularly
>> didn't suggest to pretend they don't exist; instead I did suggest that
>> they are as reserved as, say, vector 0x18. By inventing an acronym and
>> logging this instead of the vector number you'll make people other than
>> you have to look up what the odd acronym means iff such an exception
>> ever got raised.
> 
> You snipped the bits in the patch where both the vector number and
> acronym are printed together.
> 
> Anyone who doesn't know the vector has to look it up anyway, at which
> point they'll find that what Xen prints out matches what both manuals
> say.  OTOH, people who know what a coprocessor segment overrun or PIC
> spurious vector is won't need to look it up.

And who know to decipher the non-standard CPO and SPV (which are what
triggered my comments in the first place). What I continue to fail to
see is why these reserved vectors need treatment different from all
others. In addition I'm having trouble seeing how the default spurious
PIC vector matters for us - we program the PIC to vectors 0x20-0x2f,
i.e. a spurious PIC0 IRQ would show up at vector 0x27. (I notice we
still blindly assume there's a pair of PICs in the first place.)

Jan


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 02/16] x86/traps: Clean up printing in do_reserved_trap()/fatal_trap()
  2020-05-19  8:50           ` Jan Beulich
@ 2020-05-26 15:38             ` Andrew Cooper
  2020-05-27  6:54               ` Jan Beulich
  0 siblings, 1 reply; 66+ messages in thread
From: Andrew Cooper @ 2020-05-26 15:38 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 19/05/2020 09:50, Jan Beulich wrote:
> On 18.05.2020 18:54, Andrew Cooper wrote:
>> On 11/05/2020 16:09, Jan Beulich wrote:
>>> On 11.05.2020 17:01, Andrew Cooper wrote:
>>>> On 04/05/2020 14:08, Jan Beulich wrote:
>>>>> On 02.05.2020 00:58, Andrew Cooper wrote:
>>>>>> For one, they render the vector in a different base.
>>>>>>
>>>>>> Introduce X86_EXC_* constants and vec_name() to refer to exceptions by their
>>>>>> mnemonic, which starts bringing the code/diagnostics in line with the Intel
>>>>>> and AMD manuals.
>>>>> For this "bringing in line" purpose I'd like to see whether you could
>>>>> live with some adjustments to how you're currently doing things:
>>>>> - NMI is nowhere prefixed by #, hence I think we'd better not do so
>>>>>   either; may require embedding the #-es in the names[] table, or not
>>>>>   using N() for NMI
>>>> No-one is going to get confused at seeing #NMI in an error message.  I
>>>> don't mind jugging the existing names table, but anything more
>>>> complicated is overkill.
>>>>
>>>>> - neither Coprocessor Segment Overrun nor vector 0x0f have a mnemonic
>>>>>   and hence I think we shouldn't invent one; just treat them like
>>>>>   other reserved vectors (of which at least vector 0x09 indeed is one
>>>>>   on x86-64)?
>>>> This I disagree with.  Coprocessor Segment Overrun *is* its name in both
>>>> manuals, and the avoidance of vector 0xf is clearly documented as well,
>>>> due to it being the default PIC Spurious Interrupt Vector.
>>>>
>>>> Neither CSO or SPV are expected to be encountered in practice, but if
>>>> they are, highlighting them is a damn-sight more helpful than pretending
>>>> they don't exist.
>>> How is them occurring (and getting logged with their vector numbers)
>>> any different from other reserved, acronym-less vectors? I particularly
>>> didn't suggest to pretend they don't exist; instead I did suggest that
>>> they are as reserved as, say, vector 0x18. By inventing an acronym and
>>> logging this instead of the vector number you'll make people other than
>>> you have to look up what the odd acronym means iff such an exception
>>> ever got raised.
>> You snipped the bits in the patch where both the vector number and
>> acronym are printed together.
>>
>> Anyone who doesn't know the vector has to look it up anyway, at which
>> point they'll find that what Xen prints out matches what both manuals
>> say.  OTOH, people who know what a coprocessor segment overrun or PIC
>> spurious vector is won't need to look it up.
> And who know to decipher the non-standard CPO and SPV (which are what
> triggered my comments in the first place).

CSO, and no.

Anyone who doesn't know the text still has the vector number to work
with, and still needs to look it up.

At which point they will observe that the text is appropriate in context.

> What I continue to fail to
> see is why these reserved vectors need treatment different from all
> others.

Because it has nothing to do with reserved-ness.

It is about providing clarifying information (for all vectors which
currently have, or have ever had, meaning) for mere mortals who can't
(or rather, don't want to) debug crashes based on raw numbers alone.

> In addition I'm having trouble seeing how the default spurious
> PIC vector matters for us - we program the PIC to vectors 0x20-0x2f,
> i.e. a spurious PIC0 IRQ would show up at vector 0x27. (I notice we
> still blindly assume there's a pair of PICs in the first place.)

That's not relevant.  What is relevant is the actions taken when we see
vector 15 being raised.

Hitting CSO means that legacy #FERR_FREEZE external signal has been
wired up (and it is very SMP-unsafe, hence why it was phased out with
the introductions integrated x87's).

Hitting SPV means that the PIC wasn't reprogrammed and something wonky
is going on with one of the input pins.

Both of these are strictly more helpful in a log than "something went
wrong - figure it out yourself", and both indicate that something is
very wrong with the system.

~Andrew


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 03/16] x86/traps: Factor out exception_fixup() and make printing consistent
  2020-05-12 13:05       ` Jan Beulich
@ 2020-05-26 18:06         ` Andrew Cooper
  2020-05-27  7:01           ` Jan Beulich
  0 siblings, 1 reply; 66+ messages in thread
From: Andrew Cooper @ 2020-05-26 18:06 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 12/05/2020 14:05, Jan Beulich wrote:
> [CAUTION - EXTERNAL EMAIL] DO NOT reply, click links, or open attachments unless you have verified the sender and know the content is safe.
>
> On 11.05.2020 17:14, Andrew Cooper wrote:
>> On 04/05/2020 14:20, Jan Beulich wrote:
>>> On 02.05.2020 00:58, Andrew Cooper wrote:
>>>> --- a/xen/arch/x86/traps.c
>>>> +++ b/xen/arch/x86/traps.c
>>>> @@ -774,10 +774,27 @@ static void do_reserved_trap(struct cpu_user_regs *regs)
>>>>            trapnr, vec_name(trapnr), regs->error_code);
>>>>  }
>>>>  
>>>> +static bool exception_fixup(struct cpu_user_regs *regs, bool print)
>>>> +{
>>>> +    unsigned long fixup = search_exception_table(regs);
>>>> +
>>>> +    if ( unlikely(fixup == 0) )
>>>> +        return false;
>>>> +
>>>> +    /* Can currently be triggered by guests.  Make sure we ratelimit. */
>>>> +    if ( IS_ENABLED(CONFIG_DEBUG) && print )
>>> I didn't think we consider dprintk()-s a possible security issue.
>>> Why would we consider so a printk() hidden behind
>>> IS_ENABLED(CONFIG_DEBUG)? IOW I think one of XENLOG_GUEST and
>>> IS_ENABLED(CONFIG_DEBUG) wants dropping.
>> Who said anything about a security issue?
> The need to rate limit is (among other aspects) to prevent a
> (logspam) security issue, isn't it?

Rate limiting (from a security aspect) is a stopgap solution to relieve
incidental pressure on the various global spinlocks involved.

It specifically does not prevent a guest from trivially filling the
console ring with junk, or for that junk to be written to
/var/log/xen/hypervisor.log at an alarming rate, both of which are
issues in production setups, but not security issues.

Technical solutions to these problems do exist, such as deleting the
offending printk(), or maintaining per-guest console rings, but both
come with downsides in terms of usability, which similarly impacts
production setups.


What ratelimiting even in debug builds gets you is a quick spate of
printks() (e.g. any new sshd connection on an AMD system where the
MSR_VIRT_SPEC_CTRL patch is still uncommitted, and the default WRMSR
behaviour breaking wrmsr_safe() logic in Linux) not wasting an
unreasonable quantity of space in the console ring.

~Andrew


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 02/16] x86/traps: Clean up printing in do_reserved_trap()/fatal_trap()
  2020-05-26 15:38             ` Andrew Cooper
@ 2020-05-27  6:54               ` Jan Beulich
  0 siblings, 0 replies; 66+ messages in thread
From: Jan Beulich @ 2020-05-27  6:54 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 26.05.2020 17:38, Andrew Cooper wrote:
> On 19/05/2020 09:50, Jan Beulich wrote:
>> On 18.05.2020 18:54, Andrew Cooper wrote:
>>> On 11/05/2020 16:09, Jan Beulich wrote:
>>>> On 11.05.2020 17:01, Andrew Cooper wrote:
>>>>> On 04/05/2020 14:08, Jan Beulich wrote:
>>>>>> On 02.05.2020 00:58, Andrew Cooper wrote:
>>>>>>> For one, they render the vector in a different base.
>>>>>>>
>>>>>>> Introduce X86_EXC_* constants and vec_name() to refer to exceptions by their
>>>>>>> mnemonic, which starts bringing the code/diagnostics in line with the Intel
>>>>>>> and AMD manuals.
>>>>>> For this "bringing in line" purpose I'd like to see whether you could
>>>>>> live with some adjustments to how you're currently doing things:
>>>>>> - NMI is nowhere prefixed by #, hence I think we'd better not do so
>>>>>>   either; may require embedding the #-es in the names[] table, or not
>>>>>>   using N() for NMI
>>>>> No-one is going to get confused at seeing #NMI in an error message.  I
>>>>> don't mind jugging the existing names table, but anything more
>>>>> complicated is overkill.
>>>>>
>>>>>> - neither Coprocessor Segment Overrun nor vector 0x0f have a mnemonic
>>>>>>   and hence I think we shouldn't invent one; just treat them like
>>>>>>   other reserved vectors (of which at least vector 0x09 indeed is one
>>>>>>   on x86-64)?
>>>>> This I disagree with.  Coprocessor Segment Overrun *is* its name in both
>>>>> manuals, and the avoidance of vector 0xf is clearly documented as well,
>>>>> due to it being the default PIC Spurious Interrupt Vector.
>>>>>
>>>>> Neither CSO or SPV are expected to be encountered in practice, but if
>>>>> they are, highlighting them is a damn-sight more helpful than pretending
>>>>> they don't exist.
>>>> How is them occurring (and getting logged with their vector numbers)
>>>> any different from other reserved, acronym-less vectors? I particularly
>>>> didn't suggest to pretend they don't exist; instead I did suggest that
>>>> they are as reserved as, say, vector 0x18. By inventing an acronym and
>>>> logging this instead of the vector number you'll make people other than
>>>> you have to look up what the odd acronym means iff such an exception
>>>> ever got raised.
>>> You snipped the bits in the patch where both the vector number and
>>> acronym are printed together.
>>>
>>> Anyone who doesn't know the vector has to look it up anyway, at which
>>> point they'll find that what Xen prints out matches what both manuals
>>> say.  OTOH, people who know what a coprocessor segment overrun or PIC
>>> spurious vector is won't need to look it up.
>> And who know to decipher the non-standard CPO and SPV (which are what
>> triggered my comments in the first place).
> 
> CSO, and no.
> 
> Anyone who doesn't know the text still has the vector number to work
> with, and still needs to look it up.
> 
> At which point they will observe that the text is appropriate in context.
> 
>> What I continue to fail to
>> see is why these reserved vectors need treatment different from all
>> others.
> 
> Because it has nothing to do with reserved-ness.

How does it not? The SDM page, among historic information, specifically
says "Intel reserved". Seeing more exception vectors getting used after
many years of "silence" in this area, I'm pretty sure if they ran out
of vectors they'd re-use this one. Vector 15 doesn't even have a page,
which puts it even more in the same group as other reserved ones.

> It is about providing clarifying information (for all vectors which
> currently have, or have ever had, meaning) for mere mortals who can't
> (or rather, don't want to) debug crashes based on raw numbers alone.
> 
>> In addition I'm having trouble seeing how the default spurious
>> PIC vector matters for us - we program the PIC to vectors 0x20-0x2f,
>> i.e. a spurious PIC0 IRQ would show up at vector 0x27. (I notice we
>> still blindly assume there's a pair of PICs in the first place.)
> 
> That's not relevant.  What is relevant is the actions taken when we see
> vector 15 being raised.
> 
> Hitting CSO means that legacy #FERR_FREEZE external signal has been
> wired up (and it is very SMP-unsafe, hence why it was phased out with
> the introductions integrated x87's).

What does FERR have to do with this vector? This exception is a stand-
in for #GP (and maybe #PF) on the 386/387 pair.

> Hitting SPV means that the PIC wasn't reprogrammed and something wonky
> is going on with one of the input pins.

If the PIC was neither re-programmed nor properly masked, we're in
bigger trouble, I'm afraid.

> Both of these are strictly more helpful in a log than "something went
> wrong - figure it out yourself", and both indicate that something is
> very wrong with the system.

So what do we do? We can't seem to be able to reach agreement here,
because our views are different and neither of us can convince the
other. Looking back at my initial reply, hesitantly
Acked-by: Jan Beulich <jbeulich@suse.com>
then.

Jan


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 03/16] x86/traps: Factor out exception_fixup() and make printing consistent
  2020-05-26 18:06         ` Andrew Cooper
@ 2020-05-27  7:01           ` Jan Beulich
  0 siblings, 0 replies; 66+ messages in thread
From: Jan Beulich @ 2020-05-27  7:01 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel, Wei Liu, Roger Pau Monné

On 26.05.2020 20:06, Andrew Cooper wrote:
> On 12/05/2020 14:05, Jan Beulich wrote:
>> [CAUTION - EXTERNAL EMAIL] DO NOT reply, click links, or open attachments unless you have verified the sender and know the content is safe.
>>
>> On 11.05.2020 17:14, Andrew Cooper wrote:
>>> On 04/05/2020 14:20, Jan Beulich wrote:
>>>> On 02.05.2020 00:58, Andrew Cooper wrote:
>>>>> --- a/xen/arch/x86/traps.c
>>>>> +++ b/xen/arch/x86/traps.c
>>>>> @@ -774,10 +774,27 @@ static void do_reserved_trap(struct cpu_user_regs *regs)
>>>>>            trapnr, vec_name(trapnr), regs->error_code);
>>>>>  }
>>>>>  
>>>>> +static bool exception_fixup(struct cpu_user_regs *regs, bool print)
>>>>> +{
>>>>> +    unsigned long fixup = search_exception_table(regs);
>>>>> +
>>>>> +    if ( unlikely(fixup == 0) )
>>>>> +        return false;
>>>>> +
>>>>> +    /* Can currently be triggered by guests.  Make sure we ratelimit. */
>>>>> +    if ( IS_ENABLED(CONFIG_DEBUG) && print )
>>>> I didn't think we consider dprintk()-s a possible security issue.
>>>> Why would we consider so a printk() hidden behind
>>>> IS_ENABLED(CONFIG_DEBUG)? IOW I think one of XENLOG_GUEST and
>>>> IS_ENABLED(CONFIG_DEBUG) wants dropping.
>>> Who said anything about a security issue?
>> The need to rate limit is (among other aspects) to prevent a
>> (logspam) security issue, isn't it?
> 
> Rate limiting (from a security aspect) is a stopgap solution to relieve
> incidental pressure on the various global spinlocks involved.
> 
> It specifically does not prevent a guest from trivially filling the
> console ring with junk, or for that junk to be written to
> /var/log/xen/hypervisor.log at an alarming rate, both of which are
> issues in production setups, but not security issues.

IOW you assert that e.g. XSA-141 should not have been issued?

> Technical solutions to these problems do exist, such as deleting the
> offending printk(), or maintaining per-guest console rings, but both
> come with downsides in terms of usability, which similarly impacts
> production setups.
> 
> 
> What ratelimiting even in debug builds gets you is a quick spate of
> printks() (e.g. any new sshd connection on an AMD system where the
> MSR_VIRT_SPEC_CTRL patch is still uncommitted, and the default WRMSR
> behaviour breaking wrmsr_safe() logic in Linux) not wasting an
> unreasonable quantity of space in the console ring.

Hmm, okay, I can accept this perspective. Since the other comment I
gave has been taken care of by re-arrangements in a separate patch:
Reviewed-by: Jan Beulich <jbeulich@suse.com>

Jan


^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2020-05-27  7:02 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-01 22:58 [PATCH 00/16] x86: Support for CET Supervisor Shadow Stacks Andrew Cooper
2020-05-01 22:58 ` [PATCH 01/16] x86/traps: Drop last_extable_addr Andrew Cooper
2020-05-04 12:44   ` Jan Beulich
2020-05-11 14:53     ` Andrew Cooper
2020-05-11 15:00       ` Jan Beulich
2020-05-01 22:58 ` [PATCH 02/16] x86/traps: Clean up printing in do_reserved_trap()/fatal_trap() Andrew Cooper
2020-05-04 13:08   ` Jan Beulich
2020-05-11 15:01     ` Andrew Cooper
2020-05-11 15:09       ` Jan Beulich
2020-05-18 16:54         ` Andrew Cooper
2020-05-19  8:50           ` Jan Beulich
2020-05-26 15:38             ` Andrew Cooper
2020-05-27  6:54               ` Jan Beulich
2020-05-01 22:58 ` [PATCH 03/16] x86/traps: Factor out exception_fixup() and make printing consistent Andrew Cooper
2020-05-04 13:20   ` Jan Beulich
2020-05-11 15:14     ` Andrew Cooper
2020-05-12 13:05       ` Jan Beulich
2020-05-26 18:06         ` Andrew Cooper
2020-05-27  7:01           ` Jan Beulich
2020-05-01 22:58 ` [PATCH 04/16] x86/smpboot: Write the top-of-stack block in cpu_smpboot_alloc() Andrew Cooper
2020-05-04 13:22   ` Jan Beulich
2020-05-01 22:58 ` [PATCH 05/16] x86/shstk: Introduce Supervisor Shadow Stack support Andrew Cooper
2020-05-04 13:52   ` Jan Beulich
2020-05-11 15:46     ` Andrew Cooper
2020-05-12 13:54       ` Jan Beulich
2020-05-15 16:21     ` Anthony PERARD
2020-05-01 22:58 ` [PATCH 06/16] x86/traps: Implement #CP handler and extend #PF for shadow stacks Andrew Cooper
2020-05-04 14:10   ` Jan Beulich
2020-05-11 17:20     ` Andrew Cooper
2020-05-12 13:58       ` Jan Beulich
2020-05-01 22:58 ` [PATCH 07/16] x86/shstk: Re-layout the stack block " Andrew Cooper
2020-05-04 14:24   ` Jan Beulich
2020-05-11 17:48     ` Andrew Cooper
2020-05-12 14:07       ` Jan Beulich
2020-05-01 22:58 ` [PATCH 08/16] x86/shstk: Create " Andrew Cooper
2020-05-04 14:55   ` Jan Beulich
2020-05-04 15:08     ` Andrew Cooper
2020-05-01 22:58 ` [PATCH 09/16] x86/cpu: Adjust enable_nmis() to be shadow stack compatible Andrew Cooper
2020-05-05 14:48   ` Jan Beulich
2020-05-11 18:48     ` Andrew Cooper
2020-05-01 22:58 ` [PATCH 10/16] x86/cpu: Adjust reset_stack_and_jump() " Andrew Cooper
2020-05-07 13:17   ` Jan Beulich
2020-05-11 20:07     ` Andrew Cooper
2020-05-01 22:58 ` [PATCH 11/16] x86/spec-ctrl: Adjust DO_OVERWRITE_RSB " Andrew Cooper
2020-05-07 13:22   ` Jan Beulich
2020-05-07 13:25     ` Andrew Cooper
2020-05-07 13:38       ` Jan Beulich
2020-05-01 22:58 ` [PATCH 12/16] x86/extable: Adjust extable handling " Andrew Cooper
2020-05-07 13:35   ` Jan Beulich
2020-05-11 21:09     ` Andrew Cooper
2020-05-12 14:31       ` Jan Beulich
2020-05-12 16:14         ` Andrew Cooper
2020-05-13  9:22           ` Jan Beulich
2020-05-01 22:58 ` [PATCH 13/16] x86/ioemul: Rewrite stub generation " Andrew Cooper
2020-05-07 13:46   ` Jan Beulich
2020-05-01 22:58 ` [PATCH 14/16] x86/alt: Adjust _alternative_instructions() to not create shadow stacks Andrew Cooper
2020-05-07 13:49   ` Jan Beulich
2020-05-01 22:58 ` [PATCH 15/16] x86/entry: Adjust guest paths to be shadow stack compatible Andrew Cooper
2020-05-07 14:12   ` Jan Beulich
2020-05-07 15:50     ` Andrew Cooper
2020-05-07 16:15       ` Jan Beulich
2020-05-11 21:45         ` Andrew Cooper
2020-05-12 14:56           ` Jan Beulich
2020-05-01 22:58 ` [PATCH 16/16] x86/shstk: Activate Supervisor Shadow Stacks Andrew Cooper
2020-05-07 14:54   ` Jan Beulich
2020-05-11 23:46     ` Andrew Cooper

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.