All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0 of 4 RFC] Kexec path alteration
@ 2012-12-04 18:16 Andrew Cooper
  2012-12-04 18:16 ` [PATCH 1 of 4 RFC] x86/kexec: Change NMI and MCE handling on kexec path Andrew Cooper
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Andrew Cooper @ 2012-12-04 18:16 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper

I discovered a triple fault bug in the crash kernel while working on the
wider reentrant NMI/MCE issues discussed on the past few weeks on the
list.

As this is quite a substantial change, I present it here as an RFC ahead
of the longer series for reentrant behaviour (which I am still working
on).

I present patches 2 thru 4 simply as debugging aids for the bug fixed in
patch 1.  I am primarily rally concerned about RFC feedback from patch
1, although comments on patches 2 and 3 will not go amis, as they are
expected to be in the longer reentrant series.

Patch 4 is a debugkey '1' which deliberately creates reentrant NMIs and
stack corruption.  It is not intended for actually committing into the
codebase.

I am at a loss to explain why the triple fault is actually occurring; I
failed to get anything from early printk from the crash kernel, but did
not attempt to debug the extra gubbins which kexec puts in the blob.
The fault however is completely deterministic and can be verified by
commenting out the call to enable_nmis() in machine_kexec()

My setup is xen-unstable, 2.6.32-classic based dom0 and kdump kernels
with kexec-tools 2.0.2

~Andrew

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1 of 4 RFC] x86/kexec: Change NMI and MCE handling on kexec path
  2012-12-04 18:16 [PATCH 0 of 4 RFC] Kexec path alteration Andrew Cooper
@ 2012-12-04 18:16 ` Andrew Cooper
  2012-12-04 18:25   ` Andrew Cooper
                     ` (2 more replies)
  2012-12-04 18:16 ` [PATCH 2 of 4 RFC] xen/panic: Introduce panic_in_progress Andrew Cooper
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 13+ messages in thread
From: Andrew Cooper @ 2012-12-04 18:16 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper

Experimentally, certain crash kernels will triple fault very early after
starting if started with NMIs disabled.  This was discovered when
experimenting with a debug keyhandler which deliberately created a
reentrant NMI, causing stack corruption.

Because of this discovered bug, and that the future changes to the NMI
handling will make the kexec path more fragile, take the time now to
bullet-proof the kexec behaviour to be safe in all circumstances.

This patch adds three new low level routines:
 * trap_nop
     This is a no op handler which irets immediately
 * nmi_crash
     This is a special NMI handler for using during a kexec crash
 * enable_nmis
     This function enables NMIs by executing an iret-to-self

As a result, the new behaviour of the kexec_crash path is:

nmi_shootdown_cpus() will:

 * Disable interrupt stack tables.
    Disabling ISTs will prevent stack corruption from future NMIs and
    MCEs. As Xen is going to crash, we are not concerned with the
    security race condition with 'sysret'; any guest which managed to
    benefit from the race condition will only be able execute a handful
    of instructions before it will be killed anyway, and a guest is
    still unable to trigger the race condition in the first place.

 * Swap the NMI trap handlers.
    The crashing pcpu gets the nop handler, to prevent it getting stuck in
    an NMI context, causing a hang instead of crash.  The non-crashing
    pcpus all get the crash_nmi handler which is designed never to
    return.

do_nmi_crash() will:

 * Save the crash notes and shut the pcpu down.
    There is now an extra per-cpu variable to prevent us from executing
    this multiple times.  In the case where we reenter midway through,
    attempt the whole operation again in preference to not completing
    it in the first place.

 * Set up another NMI at the LAPIC.
    Even when the LAPIC has been disabled, the ID and command registers
    are still usable.  As a result, we can deliberately queue up a new
    NMI to re-interrupt us later if NMIs get unlatched.  Because of the
    call to __stop_this_cpu(), we have to hand craft self_nmi() to be
    safe from General Protection Faults.

 * Fall into infinite loop.

machine_kexec() will:

  * Swap the MCE handlers to be a nop.
     We cannot prevent MCEs from being delivered when we pass off to the
     crash kernel, and the less Xen context is being touched the better.

  * Explicitly enable NMIs.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

diff -r 1c69c938f641 -r b15d3ae525af xen/arch/x86/crash.c
--- a/xen/arch/x86/crash.c
+++ b/xen/arch/x86/crash.c
@@ -32,41 +32,97 @@
 
 static atomic_t waiting_for_crash_ipi;
 static unsigned int crashing_cpu;
+static DEFINE_PER_CPU_READ_MOSTLY(bool_t, crash_save_done);
 
-static int crash_nmi_callback(struct cpu_user_regs *regs, int cpu)
+/* This becomes the NMI handler for non-crashing CPUs. */
+void __attribute__((noreturn)) do_nmi_crash(struct cpu_user_regs *regs)
 {
-    /* Don't do anything if this handler is invoked on crashing cpu.
-     * Otherwise, system will completely hang. Crashing cpu can get
-     * an NMI if system was initially booted with nmi_watchdog parameter.
+    /* nmi_shootdown_cpus() should ensure that this assertion is correct. */
+    ASSERT( crashing_cpu != smp_processor_id() );
+
+    /* Save crash information and shut down CPU.  Attempt only once. */
+    if ( ! this_cpu(crash_save_done) )
+    {
+        kexec_crash_save_cpu();
+        __stop_this_cpu();
+
+        this_cpu(crash_save_done) = 1;
+    }
+
+    /* Poor mans self_nmi().  __stop_this_cpu() has reverted the LAPIC
+     * back to its boot state, so we are unable to rely on the regular
+     * apic_* functions, due to 'x2apic_enabled' being possibly wrong.
+     * (The likely senario is that we have reverted from x2apic mode to
+     * xapic, at which point #GPFs will occur if we use the apic_*
+     * functions)
+     *
+     * The ICR and APIC ID of the LAPIC are still valid even during
+     * software disable (Intel SDM Vol 3, 10.4.7.2), which allows us to
+     * queue up another NMI, to force us back into this loop if we exit.
      */
-    if ( cpu == crashing_cpu )
-        return 1;
-    local_irq_disable();
+    switch ( current_local_apic_mode() )
+    {
+        u32 apic_id;
 
-    kexec_crash_save_cpu();
+    case APIC_MODE_X2APIC:
+        apic_id = apic_rdmsr(APIC_ID);
 
-    __stop_this_cpu();
+        apic_wrmsr(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL | ((u64)apic_id << 32));
+        break;
 
-    atomic_dec(&waiting_for_crash_ipi);
+    case APIC_MODE_XAPIC:
+        apic_id = GET_xAPIC_ID(apic_mem_read(APIC_ID));
+
+        while ( apic_mem_read(APIC_ICR) & APIC_ICR_BUSY )
+            cpu_relax();
+
+        apic_mem_write(APIC_ICR2, apic_id << 24);
+        apic_mem_write(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL);
+        break;
+
+    default:
+        break;
+    }
 
     for ( ; ; )
         halt();
-
-    return 1;
 }
 
 static void nmi_shootdown_cpus(void)
 {
     unsigned long msecs;
+    int i;
+    int cpu = smp_processor_id();
 
     local_irq_disable();
 
-    crashing_cpu = smp_processor_id();
+    crashing_cpu = cpu;
     local_irq_count(crashing_cpu) = 0;
 
     atomic_set(&waiting_for_crash_ipi, num_online_cpus() - 1);
-    /* Would it be better to replace the trap vector here? */
-    set_nmi_callback(crash_nmi_callback);
+
+    /* Change NMI trap handlers.  Non-crashing pcpus get crash_nmi which
+     * invokes do_nmi_crash (above), which cause them to write state and
+     * fall into a loop.  The crashing pcpu gets the nop handler to
+     * cause it to return to this function ASAP.
+     *
+     * Futhermore, disable stack tables for NMIs and MCEs to prevent
+     * race conditions resulting in corrupt stack frames.  As Xen is
+     * about to die, we no longer care about the security-related race
+     * condition with 'sysret' which ISTs are designed to solve.
+     */
+    for ( i = 0; i < nr_cpu_ids; ++i )
+        if ( idt_tables[i] )
+        {
+            idt_tables[i][TRAP_nmi].a           &= ~(7UL << 32);
+            idt_tables[i][TRAP_machine_check].a &= ~(7UL << 32);
+
+            if ( i == cpu )
+                _set_gate(&idt_tables[i][TRAP_nmi], 14, 0, &trap_nop);
+            else
+                _set_gate(&idt_tables[i][TRAP_nmi], 14, 0, &nmi_crash);
+        }
+
     /* Ensure the new callback function is set before sending out the NMI. */
     wmb();
 
diff -r 1c69c938f641 -r b15d3ae525af xen/arch/x86/machine_kexec.c
--- a/xen/arch/x86/machine_kexec.c
+++ b/xen/arch/x86/machine_kexec.c
@@ -87,6 +87,17 @@ void machine_kexec(xen_kexec_image_t *im
      */
     local_irq_disable();
 
+    /* Now regular interrupts are disabled, we need to reduce the impact
+     * of interrupts not disabled by 'cli'.  NMIs have already been
+     * dealt with by machine_crash_shutdown().
+     *
+     * Set all pcpu MCE handler to be a noop. */
+    set_intr_gate(TRAP_machine_check, &trap_nop);
+
+    /* Explicitly enable NMIs on this CPU.  Some crashdump kernels do
+     * not like running with NMIs disabled. */
+    enable_nmis();
+
     /*
      * compat_machine_kexec() returns to idle pagetables, which requires us
      * to be running on a static GDT mapping (idle pagetables have no GDT
diff -r 1c69c938f641 -r b15d3ae525af xen/arch/x86/x86_64/entry.S
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -635,11 +635,42 @@ ENTRY(nmi)
         movl  $TRAP_nmi,4(%rsp)
         jmp   handle_ist_exception
 
+ENTRY(nmi_crash)
+        cli
+        pushq $0
+        movl $TRAP_nmi,4(%rsp)
+        SAVE_ALL
+        movq %rsp,%rdi
+        callq do_nmi_crash /* Does not return */
+        ud2
+
 ENTRY(machine_check)
         pushq $0
         movl  $TRAP_machine_check,4(%rsp)
         jmp   handle_ist_exception
 
+/* No op trap handler.  Required for kexec path. */
+ENTRY(trap_nop)
+        iretq
+
+/* Enable NMIs.  No special register assumptions, and all preserved. */
+ENTRY(enable_nmis)
+        push %rax
+        movq %rsp, %rax /* Grab RSP before pushing */
+
+        /* Set up stack frame */
+        pushq $0               /* SS */
+        pushq %rax             /* RSP */
+        pushfq                 /* RFLAGS */
+        pushq $__HYPERVISOR_CS /* CS */
+        leaq  1f(%rip),%rax
+        pushq %rax             /* RIP */
+
+        iretq /* Disable the hardware NMI latch */
+1:
+        popq %rax
+        retq
+
 .section .rodata, "a", @progbits
 
 ENTRY(exception_table)
diff -r 1c69c938f641 -r b15d3ae525af xen/include/asm-x86/processor.h
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -517,6 +517,7 @@ void do_ ## _name(struct cpu_user_regs *
 DECLARE_TRAP_HANDLER(divide_error);
 DECLARE_TRAP_HANDLER(debug);
 DECLARE_TRAP_HANDLER(nmi);
+DECLARE_TRAP_HANDLER(nmi_crash);
 DECLARE_TRAP_HANDLER(int3);
 DECLARE_TRAP_HANDLER(overflow);
 DECLARE_TRAP_HANDLER(bounds);
@@ -535,6 +536,9 @@ DECLARE_TRAP_HANDLER(alignment_check);
 DECLARE_TRAP_HANDLER(spurious_interrupt_bug);
 #undef DECLARE_TRAP_HANDLER
 
+void trap_nop(void);
+void enable_nmis(void);
+
 void syscall_enter(void);
 void sysenter_entry(void);
 void sysenter_eflags_saved(void);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 2 of 4 RFC] xen/panic: Introduce panic_in_progress
  2012-12-04 18:16 [PATCH 0 of 4 RFC] Kexec path alteration Andrew Cooper
  2012-12-04 18:16 ` [PATCH 1 of 4 RFC] x86/kexec: Change NMI and MCE handling on kexec path Andrew Cooper
@ 2012-12-04 18:16 ` Andrew Cooper
  2012-12-04 18:16 ` [PATCH 3 of 4 RFC] x86/nmi: Prevent reentrant execution of the C nmi handler Andrew Cooper
  2012-12-04 18:16 ` [PATCH 4 of 4 RFC] xen/nmi: DO NOT APPLY: debugkey to deliberatly invoke a reentrant NMI Andrew Cooper
  3 siblings, 0 replies; 13+ messages in thread
From: Andrew Cooper @ 2012-12-04 18:16 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper

The panic() function will re-enable NMIs.  In addition, because of the
private spinlock, panic() is not safe to reentry on the same pcpu, due
to an NMI or MCE.

We introduce a panic_in_progress flag and is_panic_in_progress() helper,
to be used in subsequent patches.  (And also fix a whitespace error.)

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

diff -r b15d3ae525af -r 48a60a407e15 xen/drivers/char/console.c
--- a/xen/drivers/char/console.c
+++ b/xen/drivers/char/console.c
@@ -65,6 +65,8 @@ static int __read_mostly sercon_handle =
 
 static DEFINE_SPINLOCK(console_lock);
 
+static atomic_t panic_in_progress = ATOMIC_INIT(0);
+
 /*
  * To control the amount of printing, thresholds are added.
  * These thresholds correspond to the XENLOG logging levels.
@@ -980,13 +982,22 @@ static int __init debugtrace_init(void)
  * **************************************************************
  */
 
+/* Is a panic() in progress? Used to prevent reentrancy of panic() from
+ * NMIs/MCEs, as there is potential to deadlock from those contexts. */
+long is_panic_in_progress(void)
+{
+    return atomic_read(&panic_in_progress);
+}
+
 void panic(const char *fmt, ...)
 {
     va_list args;
     unsigned long flags;
     static DEFINE_SPINLOCK(lock);
     static char buf[128];
-    
+
+    atomic_set(&panic_in_progress, 1);
+
     debugtrace_dump();
 
     /* Protects buf[] and ensure multi-line message prints atomically. */
diff -r b15d3ae525af -r 48a60a407e15 xen/include/xen/lib.h
--- a/xen/include/xen/lib.h
+++ b/xen/include/xen/lib.h
@@ -78,6 +78,7 @@ extern void debugtrace_printk(const char
 #define _p(_x) ((void *)(unsigned long)(_x))
 extern void printk(const char *format, ...)
     __attribute__ ((format (printf, 1, 2)));
+extern long is_panic_in_progress(void);
 extern void panic(const char *format, ...)
     __attribute__ ((format (printf, 1, 2)));
 extern long vm_assist(struct domain *, unsigned int, unsigned int);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 3 of 4 RFC] x86/nmi: Prevent reentrant execution of the C nmi handler
  2012-12-04 18:16 [PATCH 0 of 4 RFC] Kexec path alteration Andrew Cooper
  2012-12-04 18:16 ` [PATCH 1 of 4 RFC] x86/kexec: Change NMI and MCE handling on kexec path Andrew Cooper
  2012-12-04 18:16 ` [PATCH 2 of 4 RFC] xen/panic: Introduce panic_in_progress Andrew Cooper
@ 2012-12-04 18:16 ` Andrew Cooper
  2012-12-05  9:21   ` Jan Beulich
  2012-12-06 11:39   ` Tim Deegan
  2012-12-04 18:16 ` [PATCH 4 of 4 RFC] xen/nmi: DO NOT APPLY: debugkey to deliberatly invoke a reentrant NMI Andrew Cooper
  3 siblings, 2 replies; 13+ messages in thread
From: Andrew Cooper @ 2012-12-04 18:16 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper

The (old) function do_nmi() is not reentrantly-safe.  Rename it to
_do_nmi() and present a new do_nmi() which reentrancy guards.

If a reentrant NMI has been detected, then it is highly likely that the
outer NMI exception frame has been corrupted, meaning we cannot return
to the original context.  In this case, we panic() obviously rather than
falling into an infinite loop.

panic() however is not safe to reenter from an NMI context, as an NMI
(or MCE) can interrupt it inside its critical section, at which point a
new call to panic() will deadlock.  As a result, we bail early if a
panic() is already in progress, as Xen is about to die anyway.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

--
I am fairly sure this is safe with the current kexec_crash functionality
which involves holding all non-crashing pcpus in an NMI loop.  In the
case of reentrant NMIs and panic_in_progress, we will repeatedly bail
early in an infinite loop of NMIs, which has the same intended effect of
simply causing all non-crashing CPUs to stay out of the way while the
main crash occurs.

diff -r 48a60a407e15 -r f6ad86b61d5a xen/arch/x86/traps.c
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -88,6 +88,7 @@ static char __read_mostly opt_nmi[10] = 
 string_param("nmi", opt_nmi);
 
 DEFINE_PER_CPU(u64, efer);
+static DEFINE_PER_CPU(bool_t, nmi_in_progress) = 0;
 
 DEFINE_PER_CPU_READ_MOSTLY(u32, ler_msr);
 
@@ -3182,7 +3183,8 @@ static int dummy_nmi_callback(struct cpu
  
 static nmi_callback_t nmi_callback = dummy_nmi_callback;
 
-void do_nmi(struct cpu_user_regs *regs)
+/* This function should never be called directly.  Use do_nmi() instead. */
+static void _do_nmi(struct cpu_user_regs *regs)
 {
     unsigned int cpu = smp_processor_id();
     unsigned char reason;
@@ -3208,6 +3210,44 @@ void do_nmi(struct cpu_user_regs *regs)
     }
 }
 
+/* This function is NOT SAFE to call from C code in general.
+ * Use with extreme care! */
+void do_nmi(struct cpu_user_regs *regs)
+{
+    bool_t * in_progress = &this_cpu(nmi_in_progress);
+
+    if ( is_panic_in_progress() )
+    {
+        /* A panic is already in progress.  It may have reenabled NMIs,
+         * or we are simply unluckly to receive one right now.  Either
+         * way, bail early, as Xen is about to die.
+         *
+         * TODO: Ideally we should exit without executing an iret, to
+         * leave NMIs disabled, but that option is not currently
+         * available to us.
+         */
+        return;
+    }
+
+    if ( test_and_set_bool(*in_progress) )
+    {
+        /* Crash in an obvious mannor, as opposed to falling into
+         * infinite loop because our exception frame corrupted the
+         * exception frame of the previous NMI.
+         *
+         * TODO: This check does not cover all possible cases of corrupt
+         * exception frames, but it is substantially better than
+         * nothing.
+         */
+        console_force_unlock();
+        show_execution_state(regs);
+        panic("Reentrant NMI detected\n");
+    }
+
+    _do_nmi(regs);
+    *in_progress = 0;
+}
+
 void set_nmi_callback(nmi_callback_t callback)
 {
     nmi_callback = callback;

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 4 of 4 RFC] xen/nmi: DO NOT APPLY: debugkey to deliberatly invoke a reentrant NMI
  2012-12-04 18:16 [PATCH 0 of 4 RFC] Kexec path alteration Andrew Cooper
                   ` (2 preceding siblings ...)
  2012-12-04 18:16 ` [PATCH 3 of 4 RFC] x86/nmi: Prevent reentrant execution of the C nmi handler Andrew Cooper
@ 2012-12-04 18:16 ` Andrew Cooper
  3 siblings, 0 replies; 13+ messages in thread
From: Andrew Cooper @ 2012-12-04 18:16 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper

FOR TESTING PURPOSES ONLY.  Use debug key '1'

diff -r f6ad86b61d5a -r 303d94fa720c xen/arch/x86/nmi.c
--- a/xen/arch/x86/nmi.c
+++ b/xen/arch/x86/nmi.c
@@ -39,6 +39,7 @@ static unsigned int nmi_perfctr_msr;	/* 
 static unsigned int nmi_p4_cccr_val;
 static DEFINE_PER_CPU(struct timer, nmi_timer);
 static DEFINE_PER_CPU(unsigned int, nmi_timer_ticks);
+static DEFINE_PER_CPU(unsigned int, reent_state);
 
 /* opt_watchdog: If true, run a watchdog NMI on each processor. */
 bool_t __initdata opt_watchdog = 0;
@@ -532,10 +533,106 @@ static struct keyhandler nmi_stats_keyha
     .desc = "NMI statistics"
 };
 
+void do_nmi_reent_fault(struct cpu_user_regs * regs)
+{
+    printk("In fault handler from NMI - iret from this is bad\n");
+    mdelay(10);
+}
+
+void do_nmi_reentrant_handler(void)
+{
+    volatile unsigned int * state_ptr = &(this_cpu(reent_state));
+    unsigned int state;
+
+    state = *state_ptr;
+
+    switch ( state )
+    {
+    case 0: // Not a reentrant NMI request
+        return;
+
+    case 1: // Outer NMI call - lets try to set up a reentrant case
+        printk("In outer reentrant case\n");
+
+        // Signal inner state
+        *state_ptr = state = 2;
+        // Queue up another NMI which cant currently be delivered
+        self_nmi();
+        /* printk("Queued up suplimentary NMI\n"); */
+        /* mdelay(10); */
+        /* // iret from exception handler should re-enable NMIs */
+        run_in_exception_handler(do_nmi_reent_fault);
+        /* mdelay(10); */
+
+        return;
+
+    case 2:
+        printk("In inner reentrant case - stuff will probably break quickly\n");
+        *state_ptr = state = 0;
+    }
+}
+
+static void nmi_reentrant(unsigned char key)
+{
+    volatile unsigned int * state_ptr = &(this_cpu(reent_state));
+    unsigned int state;
+    s_time_t deadline;
+
+    printk("*** Trying to force reentrant NMI ***\n");
+
+    if ( *state_ptr != 0 )
+    {
+        printk("State not ready - waiting for up to 2 seconds\n");
+        deadline = (NOW() + SECONDS(2));
+        do
+        {
+            cpu_relax();
+            state = *state_ptr;
+        } while ( NOW() < deadline && state != 0 );
+
+        if ( state != 0 )
+        {
+            printk("Forcibly resetting state\n");
+            *state_ptr = 0;
+        }
+        else
+            printk("Waiting seemed to work\n");
+    }
+
+    // Set up trap for NMI handler
+    *state_ptr = 1;
+    deadline = (NOW() + SECONDS(1));
+    printk("Set up trigger for NMI handler and waiting for magic\n");
+    self_nmi();
+
+    do
+    {
+        cpu_relax();
+        state = *state_ptr;
+    } while ( NOW() < deadline && state != 0 );
+
+    if ( state != 0 )
+    {
+        printk("Forcibly resetting state\n");
+        *state_ptr = 0;
+    }
+    else
+        printk("Apparently AOK\n");
+
+    printk("*** Done reentrant test ***\n");
+}
+
+static struct keyhandler nmi_reentrant_keyhandler = {
+    .diagnostic = 0,
+    .u.fn = nmi_reentrant,
+    .desc = "Reentrant NMI test"
+};
+
 static __init int register_nmi_trigger(void)
 {
     register_keyhandler('N', &nmi_trigger_keyhandler);
     register_keyhandler('n', &nmi_stats_keyhandler);
+    register_keyhandler('1', &nmi_reentrant_keyhandler);
     return 0;
 }
 __initcall(register_nmi_trigger);
diff -r f6ad86b61d5a -r 303d94fa720c xen/arch/x86/traps.c
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -3183,6 +3183,7 @@ static int dummy_nmi_callback(struct cpu
  
 static nmi_callback_t nmi_callback = dummy_nmi_callback;
 
+extern void do_nmi_reentrant_handler(void);
 /* This function should never be called directly.  Use do_nmi() instead. */
 static void _do_nmi(struct cpu_user_regs *regs)
 {
@@ -3191,6 +3192,8 @@ static void _do_nmi(struct cpu_user_regs
 
     ++nmi_count(cpu);
 
+    do_nmi_reentrant_handler();
+
     if ( nmi_callback(regs, cpu) )
         return;

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1 of 4 RFC] x86/kexec: Change NMI and MCE handling on kexec path
  2012-12-04 18:16 ` [PATCH 1 of 4 RFC] x86/kexec: Change NMI and MCE handling on kexec path Andrew Cooper
@ 2012-12-04 18:25   ` Andrew Cooper
  2012-12-05  9:17   ` Jan Beulich
  2012-12-06 11:36   ` Tim Deegan
  2 siblings, 0 replies; 13+ messages in thread
From: Andrew Cooper @ 2012-12-04 18:25 UTC (permalink / raw)
  To: xen-devel


On 04/12/12 18:16, Andrew Cooper wrote:
> Experimentally, certain crash kernels will triple fault very early after
> starting if started with NMIs disabled.  This was discovered when
> experimenting with a debug keyhandler which deliberately created a
> reentrant NMI, causing stack corruption.
>
> Because of this discovered bug, and that the future changes to the NMI
> handling will make the kexec path more fragile, take the time now to
> bullet-proof the kexec behaviour to be safe in all circumstances.
>
> This patch adds three new low level routines:
>  * trap_nop
>      This is a no op handler which irets immediately
>  * nmi_crash
>      This is a special NMI handler for using during a kexec crash
>  * enable_nmis
>      This function enables NMIs by executing an iret-to-self
>
> As a result, the new behaviour of the kexec_crash path is:
>
> nmi_shootdown_cpus() will:
>
>  * Disable interrupt stack tables.
>     Disabling ISTs will prevent stack corruption from future NMIs and
>     MCEs. As Xen is going to crash, we are not concerned with the
>     security race condition with 'sysret'; any guest which managed to
>     benefit from the race condition will only be able execute a handful
>     of instructions before it will be killed anyway, and a guest is
>     still unable to trigger the race condition in the first place.
>
>  * Swap the NMI trap handlers.
>     The crashing pcpu gets the nop handler, to prevent it getting stuck in
>     an NMI context, causing a hang instead of crash.  The non-crashing
>     pcpus all get the crash_nmi handler which is designed never to
>     return.
>
> do_nmi_crash() will:
>
>  * Save the crash notes and shut the pcpu down.
>     There is now an extra per-cpu variable to prevent us from executing
>     this multiple times.  In the case where we reenter midway through,
>     attempt the whole operation again in preference to not completing
>     it in the first place.
>
>  * Set up another NMI at the LAPIC.
>     Even when the LAPIC has been disabled, the ID and command registers
>     are still usable.  As a result, we can deliberately queue up a new
>     NMI to re-interrupt us later if NMIs get unlatched.  Because of the
>     call to __stop_this_cpu(), we have to hand craft self_nmi() to be
>     safe from General Protection Faults.
>
>  * Fall into infinite loop.
>
> machine_kexec() will:
>
>   * Swap the MCE handlers to be a nop.
>      We cannot prevent MCEs from being delivered when we pass off to the
>      crash kernel, and the less Xen context is being touched the better.
>
>   * Explicitly enable NMIs.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>
> diff -r 1c69c938f641 -r b15d3ae525af xen/arch/x86/crash.c
> --- a/xen/arch/x86/crash.c
> +++ b/xen/arch/x86/crash.c
> @@ -32,41 +32,97 @@
>  
>  static atomic_t waiting_for_crash_ipi;
>  static unsigned int crashing_cpu;
> +static DEFINE_PER_CPU_READ_MOSTLY(bool_t, crash_save_done);
>  
> -static int crash_nmi_callback(struct cpu_user_regs *regs, int cpu)
> +/* This becomes the NMI handler for non-crashing CPUs. */
> +void __attribute__((noreturn)) do_nmi_crash(struct cpu_user_regs *regs)
>  {
> -    /* Don't do anything if this handler is invoked on crashing cpu.
> -     * Otherwise, system will completely hang. Crashing cpu can get
> -     * an NMI if system was initially booted with nmi_watchdog parameter.
> +    /* nmi_shootdown_cpus() should ensure that this assertion is correct. */
> +    ASSERT( crashing_cpu != smp_processor_id() );
> +
> +    /* Save crash information and shut down CPU.  Attempt only once. */
> +    if ( ! this_cpu(crash_save_done) )
> +    {
> +        kexec_crash_save_cpu();
> +        __stop_this_cpu();
> +
> +        this_cpu(crash_save_done) = 1;

And I have already found a bug.  The
"atomic_dec(&waiting_for_crash_ipi);" from below should appear here.

Without it, we just wait the full second in nmi_shootdown_cpus() and
continue anyway.

~Andrew

> +    }
> +
> +    /* Poor mans self_nmi().  __stop_this_cpu() has reverted the LAPIC
> +     * back to its boot state, so we are unable to rely on the regular
> +     * apic_* functions, due to 'x2apic_enabled' being possibly wrong.
> +     * (The likely senario is that we have reverted from x2apic mode to
> +     * xapic, at which point #GPFs will occur if we use the apic_*
> +     * functions)
> +     *
> +     * The ICR and APIC ID of the LAPIC are still valid even during
> +     * software disable (Intel SDM Vol 3, 10.4.7.2), which allows us to
> +     * queue up another NMI, to force us back into this loop if we exit.
>       */
> -    if ( cpu == crashing_cpu )
> -        return 1;
> -    local_irq_disable();
> +    switch ( current_local_apic_mode() )
> +    {
> +        u32 apic_id;
>  
> -    kexec_crash_save_cpu();
> +    case APIC_MODE_X2APIC:
> +        apic_id = apic_rdmsr(APIC_ID);
>  
> -    __stop_this_cpu();
> +        apic_wrmsr(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL | ((u64)apic_id << 32));
> +        break;
>  
> -    atomic_dec(&waiting_for_crash_ipi);
> +    case APIC_MODE_XAPIC:
> +        apic_id = GET_xAPIC_ID(apic_mem_read(APIC_ID));
> +
> +        while ( apic_mem_read(APIC_ICR) & APIC_ICR_BUSY )
> +            cpu_relax();
> +
> +        apic_mem_write(APIC_ICR2, apic_id << 24);
> +        apic_mem_write(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL);
> +        break;
> +
> +    default:
> +        break;
> +    }
>  
>      for ( ; ; )
>          halt();
> -
> -    return 1;
>  }
>  
>  static void nmi_shootdown_cpus(void)
>  {
>      unsigned long msecs;
> +    int i;
> +    int cpu = smp_processor_id();
>  
>      local_irq_disable();
>  
> -    crashing_cpu = smp_processor_id();
> +    crashing_cpu = cpu;
>      local_irq_count(crashing_cpu) = 0;
>  
>      atomic_set(&waiting_for_crash_ipi, num_online_cpus() - 1);
> -    /* Would it be better to replace the trap vector here? */
> -    set_nmi_callback(crash_nmi_callback);
> +
> +    /* Change NMI trap handlers.  Non-crashing pcpus get crash_nmi which
> +     * invokes do_nmi_crash (above), which cause them to write state and
> +     * fall into a loop.  The crashing pcpu gets the nop handler to
> +     * cause it to return to this function ASAP.
> +     *
> +     * Futhermore, disable stack tables for NMIs and MCEs to prevent
> +     * race conditions resulting in corrupt stack frames.  As Xen is
> +     * about to die, we no longer care about the security-related race
> +     * condition with 'sysret' which ISTs are designed to solve.
> +     */
> +    for ( i = 0; i < nr_cpu_ids; ++i )
> +        if ( idt_tables[i] )
> +        {
> +            idt_tables[i][TRAP_nmi].a           &= ~(7UL << 32);
> +            idt_tables[i][TRAP_machine_check].a &= ~(7UL << 32);
> +
> +            if ( i == cpu )
> +                _set_gate(&idt_tables[i][TRAP_nmi], 14, 0, &trap_nop);
> +            else
> +                _set_gate(&idt_tables[i][TRAP_nmi], 14, 0, &nmi_crash);
> +        }
> +
>      /* Ensure the new callback function is set before sending out the NMI. */
>      wmb();
>  
> diff -r 1c69c938f641 -r b15d3ae525af xen/arch/x86/machine_kexec.c
> --- a/xen/arch/x86/machine_kexec.c
> +++ b/xen/arch/x86/machine_kexec.c
> @@ -87,6 +87,17 @@ void machine_kexec(xen_kexec_image_t *im
>       */
>      local_irq_disable();
>  
> +    /* Now regular interrupts are disabled, we need to reduce the impact
> +     * of interrupts not disabled by 'cli'.  NMIs have already been
> +     * dealt with by machine_crash_shutdown().
> +     *
> +     * Set all pcpu MCE handler to be a noop. */
> +    set_intr_gate(TRAP_machine_check, &trap_nop);
> +
> +    /* Explicitly enable NMIs on this CPU.  Some crashdump kernels do
> +     * not like running with NMIs disabled. */
> +    enable_nmis();
> +
>      /*
>       * compat_machine_kexec() returns to idle pagetables, which requires us
>       * to be running on a static GDT mapping (idle pagetables have no GDT
> diff -r 1c69c938f641 -r b15d3ae525af xen/arch/x86/x86_64/entry.S
> --- a/xen/arch/x86/x86_64/entry.S
> +++ b/xen/arch/x86/x86_64/entry.S
> @@ -635,11 +635,42 @@ ENTRY(nmi)
>          movl  $TRAP_nmi,4(%rsp)
>          jmp   handle_ist_exception
>  
> +ENTRY(nmi_crash)
> +        cli
> +        pushq $0
> +        movl $TRAP_nmi,4(%rsp)
> +        SAVE_ALL
> +        movq %rsp,%rdi
> +        callq do_nmi_crash /* Does not return */
> +        ud2
> +
>  ENTRY(machine_check)
>          pushq $0
>          movl  $TRAP_machine_check,4(%rsp)
>          jmp   handle_ist_exception
>  
> +/* No op trap handler.  Required for kexec path. */
> +ENTRY(trap_nop)
> +        iretq
> +
> +/* Enable NMIs.  No special register assumptions, and all preserved. */
> +ENTRY(enable_nmis)
> +        push %rax
> +        movq %rsp, %rax /* Grab RSP before pushing */
> +
> +        /* Set up stack frame */
> +        pushq $0               /* SS */
> +        pushq %rax             /* RSP */
> +        pushfq                 /* RFLAGS */
> +        pushq $__HYPERVISOR_CS /* CS */
> +        leaq  1f(%rip),%rax
> +        pushq %rax             /* RIP */
> +
> +        iretq /* Disable the hardware NMI latch */
> +1:
> +        popq %rax
> +        retq
> +
>  .section .rodata, "a", @progbits
>  
>  ENTRY(exception_table)
> diff -r 1c69c938f641 -r b15d3ae525af xen/include/asm-x86/processor.h
> --- a/xen/include/asm-x86/processor.h
> +++ b/xen/include/asm-x86/processor.h
> @@ -517,6 +517,7 @@ void do_ ## _name(struct cpu_user_regs *
>  DECLARE_TRAP_HANDLER(divide_error);
>  DECLARE_TRAP_HANDLER(debug);
>  DECLARE_TRAP_HANDLER(nmi);
> +DECLARE_TRAP_HANDLER(nmi_crash);
>  DECLARE_TRAP_HANDLER(int3);
>  DECLARE_TRAP_HANDLER(overflow);
>  DECLARE_TRAP_HANDLER(bounds);
> @@ -535,6 +536,9 @@ DECLARE_TRAP_HANDLER(alignment_check);
>  DECLARE_TRAP_HANDLER(spurious_interrupt_bug);
>  #undef DECLARE_TRAP_HANDLER
>  
> +void trap_nop(void);
> +void enable_nmis(void);
> +
>  void syscall_enter(void);
>  void sysenter_entry(void);
>  void sysenter_eflags_saved(void);

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1 of 4 RFC] x86/kexec: Change NMI and MCE handling on kexec path
  2012-12-04 18:16 ` [PATCH 1 of 4 RFC] x86/kexec: Change NMI and MCE handling on kexec path Andrew Cooper
  2012-12-04 18:25   ` Andrew Cooper
@ 2012-12-05  9:17   ` Jan Beulich
  2012-12-05 10:05     ` Andrew Cooper
  2012-12-06 11:36   ` Tim Deegan
  2 siblings, 1 reply; 13+ messages in thread
From: Jan Beulich @ 2012-12-05  9:17 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel

>>> On 04.12.12 at 19:16, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> --- a/xen/arch/x86/crash.c
> +++ b/xen/arch/x86/crash.c
> @@ -32,41 +32,97 @@
>  
>  static atomic_t waiting_for_crash_ipi;
>  static unsigned int crashing_cpu;
> +static DEFINE_PER_CPU_READ_MOSTLY(bool_t, crash_save_done);
>  
> -static int crash_nmi_callback(struct cpu_user_regs *regs, int cpu)
> +/* This becomes the NMI handler for non-crashing CPUs. */
> +void __attribute__((noreturn)) do_nmi_crash(struct cpu_user_regs *regs)
>  {
> -    /* Don't do anything if this handler is invoked on crashing cpu.
> -     * Otherwise, system will completely hang. Crashing cpu can get
> -     * an NMI if system was initially booted with nmi_watchdog parameter.
> +    /* nmi_shootdown_cpus() should ensure that this assertion is correct. */
> +    ASSERT( crashing_cpu != smp_processor_id() );
> +
> +    /* Save crash information and shut down CPU.  Attempt only once. */
> +    if ( ! this_cpu(crash_save_done) )
> +    {
> +        kexec_crash_save_cpu();
> +        __stop_this_cpu();
> +
> +        this_cpu(crash_save_done) = 1;
> +    }
> +
> +    /* Poor mans self_nmi().  __stop_this_cpu() has reverted the LAPIC
> +     * back to its boot state, so we are unable to rely on the regular
> +     * apic_* functions, due to 'x2apic_enabled' being possibly wrong.
> +     * (The likely senario is that we have reverted from x2apic mode to
> +     * xapic, at which point #GPFs will occur if we use the apic_*
> +     * functions)
> +     *
> +     * The ICR and APIC ID of the LAPIC are still valid even during
> +     * software disable (Intel SDM Vol 3, 10.4.7.2), which allows us to
> +     * queue up another NMI, to force us back into this loop if we exit.
>       */
> -    if ( cpu == crashing_cpu )
> -        return 1;
> -    local_irq_disable();
> +    switch ( current_local_apic_mode() )
> +    {
> +        u32 apic_id;
>  
> -    kexec_crash_save_cpu();
> +    case APIC_MODE_X2APIC:
> +        apic_id = apic_rdmsr(APIC_ID);
>  
> -    __stop_this_cpu();
> +        apic_wrmsr(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL | ((u64)apic_id << 32));

As in the description you talk about the LAPIC being (possibly)
disabled here - did you check that this would not #GP in that
case?

> +        break;
>  
> -    atomic_dec(&waiting_for_crash_ipi);
> +    case APIC_MODE_XAPIC:
> +        apic_id = GET_xAPIC_ID(apic_mem_read(APIC_ID));
> +
> +        while ( apic_mem_read(APIC_ICR) & APIC_ICR_BUSY )
> +            cpu_relax();
> +
> +        apic_mem_write(APIC_ICR2, apic_id << 24);
> +        apic_mem_write(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL);
> +        break;
> +
> +    default:
> +        break;
> +    }
>  
>      for ( ; ; )
>          halt();
> -
> -    return 1;
>  }

>...

>  ENTRY(machine_check)
>          pushq $0
>          movl  $TRAP_machine_check,4(%rsp)
>          jmp   handle_ist_exception
>  
> +/* No op trap handler.  Required for kexec path. */
> +ENTRY(trap_nop)

I'd prefer you to re-use an existing IRETQ (e.g. the one you add
below) here - this is not performance critical, and hence there's
no need for it to be aligned.

Jan

> +        iretq
> +
> +/* Enable NMIs.  No special register assumptions, and all preserved. */
> +ENTRY(enable_nmis)
> +        push %rax
> +        movq %rsp, %rax /* Grab RSP before pushing */
> +
> +        /* Set up stack frame */
> +        pushq $0               /* SS */
> +        pushq %rax             /* RSP */
> +        pushfq                 /* RFLAGS */
> +        pushq $__HYPERVISOR_CS /* CS */
> +        leaq  1f(%rip),%rax
> +        pushq %rax             /* RIP */
> +
> +        iretq /* Disable the hardware NMI latch */
> +1:
> +        popq %rax
> +        retq
> +
>  .section .rodata, "a", @progbits
>  
>  ENTRY(exception_table)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3 of 4 RFC] x86/nmi: Prevent reentrant execution of the C nmi handler
  2012-12-04 18:16 ` [PATCH 3 of 4 RFC] x86/nmi: Prevent reentrant execution of the C nmi handler Andrew Cooper
@ 2012-12-05  9:21   ` Jan Beulich
  2012-12-05 10:47     ` Andrew Cooper
  2012-12-06 11:39   ` Tim Deegan
  1 sibling, 1 reply; 13+ messages in thread
From: Jan Beulich @ 2012-12-05  9:21 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel

>>> On 04.12.12 at 19:16, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> The (old) function do_nmi() is not reentrantly-safe.  Rename it to
> _do_nmi() and present a new do_nmi() which reentrancy guards.
> 
> If a reentrant NMI has been detected, then it is highly likely that the
> outer NMI exception frame has been corrupted, meaning we cannot return
> to the original context.  In this case, we panic() obviously rather than
> falling into an infinite loop.
> 
> panic() however is not safe to reenter from an NMI context, as an NMI
> (or MCE) can interrupt it inside its critical section, at which point a
> new call to panic() will deadlock.  As a result, we bail early if a
> panic() is already in progress, as Xen is about to die anyway.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> 
> --
> I am fairly sure this is safe with the current kexec_crash functionality
> which involves holding all non-crashing pcpus in an NMI loop.  In the
> case of reentrant NMIs and panic_in_progress, we will repeatedly bail
> early in an infinite loop of NMIs, which has the same intended effect of
> simply causing all non-crashing CPUs to stay out of the way while the
> main crash occurs.
> 
> diff -r 48a60a407e15 -r f6ad86b61d5a xen/arch/x86/traps.c
> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -88,6 +88,7 @@ static char __read_mostly opt_nmi[10] = 
>  string_param("nmi", opt_nmi);
>  
>  DEFINE_PER_CPU(u64, efer);
> +static DEFINE_PER_CPU(bool_t, nmi_in_progress) = 0;
>  
>  DEFINE_PER_CPU_READ_MOSTLY(u32, ler_msr);
>  
> @@ -3182,7 +3183,8 @@ static int dummy_nmi_callback(struct cpu
>   
>  static nmi_callback_t nmi_callback = dummy_nmi_callback;
>  
> -void do_nmi(struct cpu_user_regs *regs)
> +/* This function should never be called directly.  Use do_nmi() instead. */
> +static void _do_nmi(struct cpu_user_regs *regs)
>  {
>      unsigned int cpu = smp_processor_id();
>      unsigned char reason;
> @@ -3208,6 +3210,44 @@ void do_nmi(struct cpu_user_regs *regs)
>      }
>  }
>  
> +/* This function is NOT SAFE to call from C code in general.
> + * Use with extreme care! */
> +void do_nmi(struct cpu_user_regs *regs)
> +{
> +    bool_t * in_progress = &this_cpu(nmi_in_progress);
> +
> +    if ( is_panic_in_progress() )
> +    {
> +        /* A panic is already in progress.  It may have reenabled NMIs,
> +         * or we are simply unluckly to receive one right now.  Either
> +         * way, bail early, as Xen is about to die.
> +         *
> +         * TODO: Ideally we should exit without executing an iret, to
> +         * leave NMIs disabled, but that option is not currently
> +         * available to us.

You could easily provide the ground work for this here by having
the function return a bool_t (even if not immediately consumed by
the caller in this same patch).

Jan

> +         */
> +        return;
> +    }
> +
> +    if ( test_and_set_bool(*in_progress) )
> +    {
> +        /* Crash in an obvious mannor, as opposed to falling into
> +         * infinite loop because our exception frame corrupted the
> +         * exception frame of the previous NMI.
> +         *
> +         * TODO: This check does not cover all possible cases of corrupt
> +         * exception frames, but it is substantially better than
> +         * nothing.
> +         */
> +        console_force_unlock();
> +        show_execution_state(regs);
> +        panic("Reentrant NMI detected\n");
> +    }
> +
> +    _do_nmi(regs);
> +    *in_progress = 0;
> +}
> +
>  void set_nmi_callback(nmi_callback_t callback)
>  {
>      nmi_callback = callback;
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org 
> http://lists.xen.org/xen-devel 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1 of 4 RFC] x86/kexec: Change NMI and MCE handling on kexec path
  2012-12-05  9:17   ` Jan Beulich
@ 2012-12-05 10:05     ` Andrew Cooper
  0 siblings, 0 replies; 13+ messages in thread
From: Andrew Cooper @ 2012-12-05 10:05 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On 05/12/12 09:17, Jan Beulich wrote:
>>>> On 04.12.12 at 19:16, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>> --- a/xen/arch/x86/crash.c
>> +++ b/xen/arch/x86/crash.c
>> @@ -32,41 +32,97 @@
>>  
>>  static atomic_t waiting_for_crash_ipi;
>>  static unsigned int crashing_cpu;
>> +static DEFINE_PER_CPU_READ_MOSTLY(bool_t, crash_save_done);
>>  
>> -static int crash_nmi_callback(struct cpu_user_regs *regs, int cpu)
>> +/* This becomes the NMI handler for non-crashing CPUs. */
>> +void __attribute__((noreturn)) do_nmi_crash(struct cpu_user_regs *regs)
>>  {
>> -    /* Don't do anything if this handler is invoked on crashing cpu.
>> -     * Otherwise, system will completely hang. Crashing cpu can get
>> -     * an NMI if system was initially booted with nmi_watchdog parameter.
>> +    /* nmi_shootdown_cpus() should ensure that this assertion is correct. */
>> +    ASSERT( crashing_cpu != smp_processor_id() );
>> +
>> +    /* Save crash information and shut down CPU.  Attempt only once. */
>> +    if ( ! this_cpu(crash_save_done) )
>> +    {
>> +        kexec_crash_save_cpu();
>> +        __stop_this_cpu();
>> +
>> +        this_cpu(crash_save_done) = 1;
>> +    }
>> +
>> +    /* Poor mans self_nmi().  __stop_this_cpu() has reverted the LAPIC
>> +     * back to its boot state, so we are unable to rely on the regular
>> +     * apic_* functions, due to 'x2apic_enabled' being possibly wrong.
>> +     * (The likely senario is that we have reverted from x2apic mode to
>> +     * xapic, at which point #GPFs will occur if we use the apic_*
>> +     * functions)
>> +     *
>> +     * The ICR and APIC ID of the LAPIC are still valid even during
>> +     * software disable (Intel SDM Vol 3, 10.4.7.2), which allows us to
>> +     * queue up another NMI, to force us back into this loop if we exit.
>>       */
>> -    if ( cpu == crashing_cpu )
>> -        return 1;
>> -    local_irq_disable();
>> +    switch ( current_local_apic_mode() )
>> +    {
>> +        u32 apic_id;
>>  
>> -    kexec_crash_save_cpu();
>> +    case APIC_MODE_X2APIC:
>> +        apic_id = apic_rdmsr(APIC_ID);
>>  
>> -    __stop_this_cpu();
>> +        apic_wrmsr(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL | ((u64)apic_id << 32));
> As in the description you talk about the LAPIC being (possibly)
> disabled here - did you check that this would not #GP in that
> case?

Yes - we are switching on current_local_apic_mode() which reads the
APICBASE MSR to work out which mode we are in, and returns an enum
apic_mode.

With this information, all the apic_**msr accesses are in case x2apic,
and all the apic_mem_** accesses are in case xapic.

If the apic_mode is disabled, then we skip trying to set up the next NMI.

The patch is sadly rather less legible than I would have hoped, but the
code post patch is far more logical to read.

>> +        break;
>>  
>> -    atomic_dec(&waiting_for_crash_ipi);
>> +    case APIC_MODE_XAPIC:
>> +        apic_id = GET_xAPIC_ID(apic_mem_read(APIC_ID));
>> +
>> +        while ( apic_mem_read(APIC_ICR) & APIC_ICR_BUSY )
>> +            cpu_relax();
>> +
>> +        apic_mem_write(APIC_ICR2, apic_id << 24);
>> +        apic_mem_write(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL);
>> +        break;
>> +
>> +    default:
>> +        break;
>> +    }
>>  
>>      for ( ; ; )
>>          halt();
>> -
>> -    return 1;
>>  }
>> ...
>>  ENTRY(machine_check)
>>          pushq $0
>>          movl  $TRAP_machine_check,4(%rsp)
>>          jmp   handle_ist_exception
>>  
>> +/* No op trap handler.  Required for kexec path. */
>> +ENTRY(trap_nop)
> I'd prefer you to re-use an existing IRETQ (e.g. the one you add
> below) here - this is not performance critical, and hence there's
> no need for it to be aligned.
>
> Jan

Certainly.

~Andrew

>
>> +        iretq
>> +
>> +/* Enable NMIs.  No special register assumptions, and all preserved. */
>> +ENTRY(enable_nmis)
>> +        push %rax
>> +        movq %rsp, %rax /* Grab RSP before pushing */
>> +
>> +        /* Set up stack frame */
>> +        pushq $0               /* SS */
>> +        pushq %rax             /* RSP */
>> +        pushfq                 /* RFLAGS */
>> +        pushq $__HYPERVISOR_CS /* CS */
>> +        leaq  1f(%rip),%rax
>> +        pushq %rax             /* RIP */
>> +
>> +        iretq /* Disable the hardware NMI latch */
>> +1:
>> +        popq %rax
>> +        retq
>> +
>>  .section .rodata, "a", @progbits
>>  
>>  ENTRY(exception_table)

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3 of 4 RFC] x86/nmi: Prevent reentrant execution of the C nmi handler
  2012-12-05  9:21   ` Jan Beulich
@ 2012-12-05 10:47     ` Andrew Cooper
  0 siblings, 0 replies; 13+ messages in thread
From: Andrew Cooper @ 2012-12-05 10:47 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On 05/12/12 09:21, Jan Beulich wrote:
>>>> On 04.12.12 at 19:16, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>> The (old) function do_nmi() is not reentrantly-safe.  Rename it to
>> _do_nmi() and present a new do_nmi() which reentrancy guards.
>>
>> If a reentrant NMI has been detected, then it is highly likely that the
>> outer NMI exception frame has been corrupted, meaning we cannot return
>> to the original context.  In this case, we panic() obviously rather than
>> falling into an infinite loop.
>>
>> panic() however is not safe to reenter from an NMI context, as an NMI
>> (or MCE) can interrupt it inside its critical section, at which point a
>> new call to panic() will deadlock.  As a result, we bail early if a
>> panic() is already in progress, as Xen is about to die anyway.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>
>> --
>> I am fairly sure this is safe with the current kexec_crash functionality
>> which involves holding all non-crashing pcpus in an NMI loop.  In the
>> case of reentrant NMIs and panic_in_progress, we will repeatedly bail
>> early in an infinite loop of NMIs, which has the same intended effect of
>> simply causing all non-crashing CPUs to stay out of the way while the
>> main crash occurs.
>>
>> diff -r 48a60a407e15 -r f6ad86b61d5a xen/arch/x86/traps.c
>> --- a/xen/arch/x86/traps.c
>> +++ b/xen/arch/x86/traps.c
>> @@ -88,6 +88,7 @@ static char __read_mostly opt_nmi[10] = 
>>  string_param("nmi", opt_nmi);
>>  
>>  DEFINE_PER_CPU(u64, efer);
>> +static DEFINE_PER_CPU(bool_t, nmi_in_progress) = 0;
>>  
>>  DEFINE_PER_CPU_READ_MOSTLY(u32, ler_msr);
>>  
>> @@ -3182,7 +3183,8 @@ static int dummy_nmi_callback(struct cpu
>>   
>>  static nmi_callback_t nmi_callback = dummy_nmi_callback;
>>  
>> -void do_nmi(struct cpu_user_regs *regs)
>> +/* This function should never be called directly.  Use do_nmi() instead. */
>> +static void _do_nmi(struct cpu_user_regs *regs)
>>  {
>>      unsigned int cpu = smp_processor_id();
>>      unsigned char reason;
>> @@ -3208,6 +3210,44 @@ void do_nmi(struct cpu_user_regs *regs)
>>      }
>>  }
>>  
>> +/* This function is NOT SAFE to call from C code in general.
>> + * Use with extreme care! */
>> +void do_nmi(struct cpu_user_regs *regs)
>> +{
>> +    bool_t * in_progress = &this_cpu(nmi_in_progress);
>> +
>> +    if ( is_panic_in_progress() )
>> +    {
>> +        /* A panic is already in progress.  It may have reenabled NMIs,
>> +         * or we are simply unluckly to receive one right now.  Either
>> +         * way, bail early, as Xen is about to die.
>> +         *
>> +         * TODO: Ideally we should exit without executing an iret, to
>> +         * leave NMIs disabled, but that option is not currently
>> +         * available to us.
> You could easily provide the ground work for this here by having
> the function return a bool_t (even if not immediately consumed by
> the caller in this same patch).
>
> Jan

Will do.  I had considered a bool_t and was planning to integrate it
later in development.

~Andrew

>
>> +         */
>> +        return;
>> +    }
>> +
>> +    if ( test_and_set_bool(*in_progress) )
>> +    {
>> +        /* Crash in an obvious mannor, as opposed to falling into
>> +         * infinite loop because our exception frame corrupted the
>> +         * exception frame of the previous NMI.
>> +         *
>> +         * TODO: This check does not cover all possible cases of corrupt
>> +         * exception frames, but it is substantially better than
>> +         * nothing.
>> +         */
>> +        console_force_unlock();
>> +        show_execution_state(regs);
>> +        panic("Reentrant NMI detected\n");
>> +    }
>> +
>> +    _do_nmi(regs);
>> +    *in_progress = 0;
>> +}
>> +
>>  void set_nmi_callback(nmi_callback_t callback)
>>  {
>>      nmi_callback = callback;
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org 
>> http://lists.xen.org/xen-devel 
>
>

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1 of 4 RFC] x86/kexec: Change NMI and MCE handling on kexec path
  2012-12-04 18:16 ` [PATCH 1 of 4 RFC] x86/kexec: Change NMI and MCE handling on kexec path Andrew Cooper
  2012-12-04 18:25   ` Andrew Cooper
  2012-12-05  9:17   ` Jan Beulich
@ 2012-12-06 11:36   ` Tim Deegan
  2012-12-06 11:58     ` Andrew Cooper
  2 siblings, 1 reply; 13+ messages in thread
From: Tim Deegan @ 2012-12-06 11:36 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel

At 18:16 +0000 on 04 Dec (1354644968), Andrew Cooper wrote:
> do_nmi_crash() will:
> 
>  * Save the crash notes and shut the pcpu down.
>     There is now an extra per-cpu variable to prevent us from executing
>     this multiple times.  In the case where we reenter midway through,
>     attempt the whole operation again in preference to not completing
>     it in the first place.
> 
>  * Set up another NMI at the LAPIC.
>     Even when the LAPIC has been disabled, the ID and command registers
>     are still usable.  As a result, we can deliberately queue up a new
>     NMI to re-interrupt us later if NMIs get unlatched.

Can you please include this reasoning in the code itself, as well as in
the commit message?

> machine_kexec() will:
> 
>   * Swap the MCE handlers to be a nop.
>      We cannot prevent MCEs from being delivered when we pass off to the
>      crash kernel, and the less Xen context is being touched the better.
> 
>   * Explicitly enable NMIs.

Would it be cleaner to have this path explicitly set the IDT entry to
invoke the noop handler?  Or do we know this is alwys the case when we
reach this point?

I'm just wondering about the case where we get here with an NMI pending.
Presumably if we have some other source of NMIs active while we kexec,
the post-exec kernel will crash if it overwrites our handlers &c before
setting up its own IDT.  But if kexec()ing with NMIs _disabled_ also
fails than we haven't much choice. :|

Otherwise, this looks good to me.

Tim.

> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> 
> diff -r 1c69c938f641 -r b15d3ae525af xen/arch/x86/crash.c
> --- a/xen/arch/x86/crash.c
> +++ b/xen/arch/x86/crash.c
> @@ -32,41 +32,97 @@
>  
>  static atomic_t waiting_for_crash_ipi;
>  static unsigned int crashing_cpu;
> +static DEFINE_PER_CPU_READ_MOSTLY(bool_t, crash_save_done);
>  
> -static int crash_nmi_callback(struct cpu_user_regs *regs, int cpu)
> +/* This becomes the NMI handler for non-crashing CPUs. */
> +void __attribute__((noreturn)) do_nmi_crash(struct cpu_user_regs *regs)
>  {
> -    /* Don't do anything if this handler is invoked on crashing cpu.
> -     * Otherwise, system will completely hang. Crashing cpu can get
> -     * an NMI if system was initially booted with nmi_watchdog parameter.
> +    /* nmi_shootdown_cpus() should ensure that this assertion is correct. */
> +    ASSERT( crashing_cpu != smp_processor_id() );
> +
> +    /* Save crash information and shut down CPU.  Attempt only once. */
> +    if ( ! this_cpu(crash_save_done) )
> +    {
> +        kexec_crash_save_cpu();
> +        __stop_this_cpu();
> +
> +        this_cpu(crash_save_done) = 1;
> +    }
> +
> +    /* Poor mans self_nmi().  __stop_this_cpu() has reverted the LAPIC
> +     * back to its boot state, so we are unable to rely on the regular
> +     * apic_* functions, due to 'x2apic_enabled' being possibly wrong.
> +     * (The likely senario is that we have reverted from x2apic mode to
> +     * xapic, at which point #GPFs will occur if we use the apic_*
> +     * functions)
> +     *
> +     * The ICR and APIC ID of the LAPIC are still valid even during
> +     * software disable (Intel SDM Vol 3, 10.4.7.2), which allows us to
> +     * queue up another NMI, to force us back into this loop if we exit.
>       */
> -    if ( cpu == crashing_cpu )
> -        return 1;
> -    local_irq_disable();
> +    switch ( current_local_apic_mode() )
> +    {
> +        u32 apic_id;
>  
> -    kexec_crash_save_cpu();
> +    case APIC_MODE_X2APIC:
> +        apic_id = apic_rdmsr(APIC_ID);
>  
> -    __stop_this_cpu();
> +        apic_wrmsr(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL | ((u64)apic_id << 32));
> +        break;
>  
> -    atomic_dec(&waiting_for_crash_ipi);
> +    case APIC_MODE_XAPIC:
> +        apic_id = GET_xAPIC_ID(apic_mem_read(APIC_ID));
> +
> +        while ( apic_mem_read(APIC_ICR) & APIC_ICR_BUSY )
> +            cpu_relax();
> +
> +        apic_mem_write(APIC_ICR2, apic_id << 24);
> +        apic_mem_write(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL);
> +        break;
> +
> +    default:
> +        break;
> +    }
>  
>      for ( ; ; )
>          halt();
> -
> -    return 1;
>  }
>  
>  static void nmi_shootdown_cpus(void)
>  {
>      unsigned long msecs;
> +    int i;
> +    int cpu = smp_processor_id();
>  
>      local_irq_disable();
>  
> -    crashing_cpu = smp_processor_id();
> +    crashing_cpu = cpu;
>      local_irq_count(crashing_cpu) = 0;
>  
>      atomic_set(&waiting_for_crash_ipi, num_online_cpus() - 1);
> -    /* Would it be better to replace the trap vector here? */
> -    set_nmi_callback(crash_nmi_callback);
> +
> +    /* Change NMI trap handlers.  Non-crashing pcpus get crash_nmi which
> +     * invokes do_nmi_crash (above), which cause them to write state and
> +     * fall into a loop.  The crashing pcpu gets the nop handler to
> +     * cause it to return to this function ASAP.
> +     *
> +     * Futhermore, disable stack tables for NMIs and MCEs to prevent
> +     * race conditions resulting in corrupt stack frames.  As Xen is
> +     * about to die, we no longer care about the security-related race
> +     * condition with 'sysret' which ISTs are designed to solve.
> +     */
> +    for ( i = 0; i < nr_cpu_ids; ++i )
> +        if ( idt_tables[i] )
> +        {
> +            idt_tables[i][TRAP_nmi].a           &= ~(7UL << 32);
> +            idt_tables[i][TRAP_machine_check].a &= ~(7UL << 32);
> +
> +            if ( i == cpu )
> +                _set_gate(&idt_tables[i][TRAP_nmi], 14, 0, &trap_nop);
> +            else
> +                _set_gate(&idt_tables[i][TRAP_nmi], 14, 0, &nmi_crash);
> +        }
> +
>      /* Ensure the new callback function is set before sending out the NMI. */
>      wmb();
>  
> diff -r 1c69c938f641 -r b15d3ae525af xen/arch/x86/machine_kexec.c
> --- a/xen/arch/x86/machine_kexec.c
> +++ b/xen/arch/x86/machine_kexec.c
> @@ -87,6 +87,17 @@ void machine_kexec(xen_kexec_image_t *im
>       */
>      local_irq_disable();
>  
> +    /* Now regular interrupts are disabled, we need to reduce the impact
> +     * of interrupts not disabled by 'cli'.  NMIs have already been
> +     * dealt with by machine_crash_shutdown().
> +     *
> +     * Set all pcpu MCE handler to be a noop. */
> +    set_intr_gate(TRAP_machine_check, &trap_nop);
> +
> +    /* Explicitly enable NMIs on this CPU.  Some crashdump kernels do
> +     * not like running with NMIs disabled. */
> +    enable_nmis();
> +
>      /*
>       * compat_machine_kexec() returns to idle pagetables, which requires us
>       * to be running on a static GDT mapping (idle pagetables have no GDT
> diff -r 1c69c938f641 -r b15d3ae525af xen/arch/x86/x86_64/entry.S
> --- a/xen/arch/x86/x86_64/entry.S
> +++ b/xen/arch/x86/x86_64/entry.S
> @@ -635,11 +635,42 @@ ENTRY(nmi)
>          movl  $TRAP_nmi,4(%rsp)
>          jmp   handle_ist_exception
>  
> +ENTRY(nmi_crash)
> +        cli
> +        pushq $0
> +        movl $TRAP_nmi,4(%rsp)
> +        SAVE_ALL
> +        movq %rsp,%rdi
> +        callq do_nmi_crash /* Does not return */
> +        ud2
> +
>  ENTRY(machine_check)
>          pushq $0
>          movl  $TRAP_machine_check,4(%rsp)
>          jmp   handle_ist_exception
>  
> +/* No op trap handler.  Required for kexec path. */
> +ENTRY(trap_nop)
> +        iretq
> +
> +/* Enable NMIs.  No special register assumptions, and all preserved. */
> +ENTRY(enable_nmis)
> +        push %rax
> +        movq %rsp, %rax /* Grab RSP before pushing */
> +
> +        /* Set up stack frame */
> +        pushq $0               /* SS */
> +        pushq %rax             /* RSP */
> +        pushfq                 /* RFLAGS */
> +        pushq $__HYPERVISOR_CS /* CS */
> +        leaq  1f(%rip),%rax
> +        pushq %rax             /* RIP */
> +
> +        iretq /* Disable the hardware NMI latch */
> +1:
> +        popq %rax
> +        retq
> +
>  .section .rodata, "a", @progbits
>  
>  ENTRY(exception_table)
> diff -r 1c69c938f641 -r b15d3ae525af xen/include/asm-x86/processor.h
> --- a/xen/include/asm-x86/processor.h
> +++ b/xen/include/asm-x86/processor.h
> @@ -517,6 +517,7 @@ void do_ ## _name(struct cpu_user_regs *
>  DECLARE_TRAP_HANDLER(divide_error);
>  DECLARE_TRAP_HANDLER(debug);
>  DECLARE_TRAP_HANDLER(nmi);
> +DECLARE_TRAP_HANDLER(nmi_crash);
>  DECLARE_TRAP_HANDLER(int3);
>  DECLARE_TRAP_HANDLER(overflow);
>  DECLARE_TRAP_HANDLER(bounds);
> @@ -535,6 +536,9 @@ DECLARE_TRAP_HANDLER(alignment_check);
>  DECLARE_TRAP_HANDLER(spurious_interrupt_bug);
>  #undef DECLARE_TRAP_HANDLER
>  
> +void trap_nop(void);
> +void enable_nmis(void);
> +
>  void syscall_enter(void);
>  void sysenter_entry(void);
>  void sysenter_eflags_saved(void);
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3 of 4 RFC] x86/nmi: Prevent reentrant execution of the C nmi handler
  2012-12-04 18:16 ` [PATCH 3 of 4 RFC] x86/nmi: Prevent reentrant execution of the C nmi handler Andrew Cooper
  2012-12-05  9:21   ` Jan Beulich
@ 2012-12-06 11:39   ` Tim Deegan
  1 sibling, 0 replies; 13+ messages in thread
From: Tim Deegan @ 2012-12-06 11:39 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel

At 18:16 +0000 on 04 Dec (1354644970), Andrew Cooper wrote:
> +/* This function is NOT SAFE to call from C code in general.
> + * Use with extreme care! */
> +void do_nmi(struct cpu_user_regs *regs)
> +{
> +    bool_t * in_progress = &this_cpu(nmi_in_progress);
> +
> +    if ( is_panic_in_progress() )
> +    {
> +        /* A panic is already in progress.  It may have reenabled NMIs,
> +         * or we are simply unluckly to receive one right now.  Either
> +         * way, bail early, as Xen is about to die.
> +         *
> +         * TODO: Ideally we should exit without executing an iret, to
> +         * leave NMIs disabled, but that option is not currently
> +         * available to us.
> +         */
> +        return;
> +    }
> +
> +    if ( test_and_set_bool(*in_progress) )
> +    {
> +        /* Crash in an obvious mannor, as opposed to falling into
> +         * infinite loop because our exception frame corrupted the
> +         * exception frame of the previous NMI.
> +         *
> +         * TODO: This check does not cover all possible cases of corrupt
> +         * exception frames, but it is substantially better than
> +         * nothing.
> +         */
> +        console_force_unlock();
> +        show_execution_state(regs);
> +        panic("Reentrant NMI detected\n");
> +    }
> +
> +    _do_nmi(regs);

I think you need a barrier() above and below _do_nmi() in case the
compiler inlines _do_nmi() and then decides it can shuffle the writes to
in_progress around.

Otherwise, this looks fine, and you can add my ack if you like.

> +    *in_progress = 0;
> +}
> +
>  void set_nmi_callback(nmi_callback_t callback)
>  {
>      nmi_callback = callback;
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1 of 4 RFC] x86/kexec: Change NMI and MCE handling on kexec path
  2012-12-06 11:36   ` Tim Deegan
@ 2012-12-06 11:58     ` Andrew Cooper
  0 siblings, 0 replies; 13+ messages in thread
From: Andrew Cooper @ 2012-12-06 11:58 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel

On 06/12/12 11:36, Tim Deegan wrote:
> At 18:16 +0000 on 04 Dec (1354644968), Andrew Cooper wrote:
>> do_nmi_crash() will:
>>
>>  * Save the crash notes and shut the pcpu down.
>>     There is now an extra per-cpu variable to prevent us from executing
>>     this multiple times.  In the case where we reenter midway through,
>>     attempt the whole operation again in preference to not completing
>>     it in the first place.
>>
>>  * Set up another NMI at the LAPIC.
>>     Even when the LAPIC has been disabled, the ID and command registers
>>     are still usable.  As a result, we can deliberately queue up a new
>>     NMI to re-interrupt us later if NMIs get unlatched.
> Can you please include this reasoning in the code itself, as well as in
> the commit message?

There is a comment alluding to this; the final sentence in the block
beginning "Poor mans self_nmi()", but I will reword it to make it clearer.

>
>> machine_kexec() will:
>>
>>   * Swap the MCE handlers to be a nop.
>>      We cannot prevent MCEs from being delivered when we pass off to the
>>      crash kernel, and the less Xen context is being touched the better.
>>
>>   * Explicitly enable NMIs.
> Would it be cleaner to have this path explicitly set the IDT entry to
> invoke the noop handler?  Or do we know this is alwys the case when we
> reach this point?

Early versions of this patch did change the NMI entry here, but that was
before I decided that nmi_shootdown_cpus() needed redoing.

As it currently stands, all pcpus other than us have the nmi_crash
handler, and we have the nop handler because the call graph looks like:

kexec_crash()
  ...
  machine_crash_shutdown()
    nmi_shootdown_cpus()
  machine_kexec()

I will however clarify this NMI setup with a comment at this location.

>
> I'm just wondering about the case where we get here with an NMI pending.
> Presumably if we have some other source of NMIs active while we kexec,
> the post-exec kernel will crash if it overwrites our handlers &c before
> setting up its own IDT.  But if kexec()ing with NMIs _disabled_ also
> fails than we haven't much choice. :|
>
> Otherwise, this looks good to me.
>
> Tim.

We do have another source of NMIs active; The watchdog disable routine
only tells the NMI handler to ignore watchdog NMIs, rather than disable
the source of NMIs themselves.  It is another item on my hitlist.

The other problem we have have with the kexec mechanism is to do with
taking NMIs and MCEs between purgatory changing addressing schemes and
setting up its own IDT.  I believe that compat_machine_kexec() does this
in an unsafe way even before jumping to purgatory.  I was wondering
whether it was possible to monopolise some very low pages which could be
identity mapped in any paging scheme, but I cant think of a nice way to
fix a move into real mode at which point the loaded IDT is completely
invalid.  Another item for the hitlist.

~Andrew

>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>
>> diff -r 1c69c938f641 -r b15d3ae525af xen/arch/x86/crash.c
>> --- a/xen/arch/x86/crash.c
>> +++ b/xen/arch/x86/crash.c
>> @@ -32,41 +32,97 @@
>>  
>>  static atomic_t waiting_for_crash_ipi;
>>  static unsigned int crashing_cpu;
>> +static DEFINE_PER_CPU_READ_MOSTLY(bool_t, crash_save_done);
>>  
>> -static int crash_nmi_callback(struct cpu_user_regs *regs, int cpu)
>> +/* This becomes the NMI handler for non-crashing CPUs. */
>> +void __attribute__((noreturn)) do_nmi_crash(struct cpu_user_regs *regs)
>>  {
>> -    /* Don't do anything if this handler is invoked on crashing cpu.
>> -     * Otherwise, system will completely hang. Crashing cpu can get
>> -     * an NMI if system was initially booted with nmi_watchdog parameter.
>> +    /* nmi_shootdown_cpus() should ensure that this assertion is correct. */
>> +    ASSERT( crashing_cpu != smp_processor_id() );
>> +
>> +    /* Save crash information and shut down CPU.  Attempt only once. */
>> +    if ( ! this_cpu(crash_save_done) )
>> +    {
>> +        kexec_crash_save_cpu();
>> +        __stop_this_cpu();
>> +
>> +        this_cpu(crash_save_done) = 1;
>> +    }
>> +
>> +    /* Poor mans self_nmi().  __stop_this_cpu() has reverted the LAPIC
>> +     * back to its boot state, so we are unable to rely on the regular
>> +     * apic_* functions, due to 'x2apic_enabled' being possibly wrong.
>> +     * (The likely senario is that we have reverted from x2apic mode to
>> +     * xapic, at which point #GPFs will occur if we use the apic_*
>> +     * functions)
>> +     *
>> +     * The ICR and APIC ID of the LAPIC are still valid even during
>> +     * software disable (Intel SDM Vol 3, 10.4.7.2), which allows us to
>> +     * queue up another NMI, to force us back into this loop if we exit.
>>       */
>> -    if ( cpu == crashing_cpu )
>> -        return 1;
>> -    local_irq_disable();
>> +    switch ( current_local_apic_mode() )
>> +    {
>> +        u32 apic_id;
>>  
>> -    kexec_crash_save_cpu();
>> +    case APIC_MODE_X2APIC:
>> +        apic_id = apic_rdmsr(APIC_ID);
>>  
>> -    __stop_this_cpu();
>> +        apic_wrmsr(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL | ((u64)apic_id << 32));
>> +        break;
>>  
>> -    atomic_dec(&waiting_for_crash_ipi);
>> +    case APIC_MODE_XAPIC:
>> +        apic_id = GET_xAPIC_ID(apic_mem_read(APIC_ID));
>> +
>> +        while ( apic_mem_read(APIC_ICR) & APIC_ICR_BUSY )
>> +            cpu_relax();
>> +
>> +        apic_mem_write(APIC_ICR2, apic_id << 24);
>> +        apic_mem_write(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL);
>> +        break;
>> +
>> +    default:
>> +        break;
>> +    }
>>  
>>      for ( ; ; )
>>          halt();
>> -
>> -    return 1;
>>  }
>>  
>>  static void nmi_shootdown_cpus(void)
>>  {
>>      unsigned long msecs;
>> +    int i;
>> +    int cpu = smp_processor_id();
>>  
>>      local_irq_disable();
>>  
>> -    crashing_cpu = smp_processor_id();
>> +    crashing_cpu = cpu;
>>      local_irq_count(crashing_cpu) = 0;
>>  
>>      atomic_set(&waiting_for_crash_ipi, num_online_cpus() - 1);
>> -    /* Would it be better to replace the trap vector here? */
>> -    set_nmi_callback(crash_nmi_callback);
>> +
>> +    /* Change NMI trap handlers.  Non-crashing pcpus get crash_nmi which
>> +     * invokes do_nmi_crash (above), which cause them to write state and
>> +     * fall into a loop.  The crashing pcpu gets the nop handler to
>> +     * cause it to return to this function ASAP.
>> +     *
>> +     * Futhermore, disable stack tables for NMIs and MCEs to prevent
>> +     * race conditions resulting in corrupt stack frames.  As Xen is
>> +     * about to die, we no longer care about the security-related race
>> +     * condition with 'sysret' which ISTs are designed to solve.
>> +     */
>> +    for ( i = 0; i < nr_cpu_ids; ++i )
>> +        if ( idt_tables[i] )
>> +        {
>> +            idt_tables[i][TRAP_nmi].a           &= ~(7UL << 32);
>> +            idt_tables[i][TRAP_machine_check].a &= ~(7UL << 32);
>> +
>> +            if ( i == cpu )
>> +                _set_gate(&idt_tables[i][TRAP_nmi], 14, 0, &trap_nop);
>> +            else
>> +                _set_gate(&idt_tables[i][TRAP_nmi], 14, 0, &nmi_crash);
>> +        }
>> +
>>      /* Ensure the new callback function is set before sending out the NMI. */
>>      wmb();
>>  
>> diff -r 1c69c938f641 -r b15d3ae525af xen/arch/x86/machine_kexec.c
>> --- a/xen/arch/x86/machine_kexec.c
>> +++ b/xen/arch/x86/machine_kexec.c
>> @@ -87,6 +87,17 @@ void machine_kexec(xen_kexec_image_t *im
>>       */
>>      local_irq_disable();
>>  
>> +    /* Now regular interrupts are disabled, we need to reduce the impact
>> +     * of interrupts not disabled by 'cli'.  NMIs have already been
>> +     * dealt with by machine_crash_shutdown().
>> +     *
>> +     * Set all pcpu MCE handler to be a noop. */
>> +    set_intr_gate(TRAP_machine_check, &trap_nop);
>> +
>> +    /* Explicitly enable NMIs on this CPU.  Some crashdump kernels do
>> +     * not like running with NMIs disabled. */
>> +    enable_nmis();
>> +
>>      /*
>>       * compat_machine_kexec() returns to idle pagetables, which requires us
>>       * to be running on a static GDT mapping (idle pagetables have no GDT
>> diff -r 1c69c938f641 -r b15d3ae525af xen/arch/x86/x86_64/entry.S
>> --- a/xen/arch/x86/x86_64/entry.S
>> +++ b/xen/arch/x86/x86_64/entry.S
>> @@ -635,11 +635,42 @@ ENTRY(nmi)
>>          movl  $TRAP_nmi,4(%rsp)
>>          jmp   handle_ist_exception
>>  
>> +ENTRY(nmi_crash)
>> +        cli
>> +        pushq $0
>> +        movl $TRAP_nmi,4(%rsp)
>> +        SAVE_ALL
>> +        movq %rsp,%rdi
>> +        callq do_nmi_crash /* Does not return */
>> +        ud2
>> +
>>  ENTRY(machine_check)
>>          pushq $0
>>          movl  $TRAP_machine_check,4(%rsp)
>>          jmp   handle_ist_exception
>>  
>> +/* No op trap handler.  Required for kexec path. */
>> +ENTRY(trap_nop)
>> +        iretq
>> +
>> +/* Enable NMIs.  No special register assumptions, and all preserved. */
>> +ENTRY(enable_nmis)
>> +        push %rax
>> +        movq %rsp, %rax /* Grab RSP before pushing */
>> +
>> +        /* Set up stack frame */
>> +        pushq $0               /* SS */
>> +        pushq %rax             /* RSP */
>> +        pushfq                 /* RFLAGS */
>> +        pushq $__HYPERVISOR_CS /* CS */
>> +        leaq  1f(%rip),%rax
>> +        pushq %rax             /* RIP */
>> +
>> +        iretq /* Disable the hardware NMI latch */
>> +1:
>> +        popq %rax
>> +        retq
>> +
>>  .section .rodata, "a", @progbits
>>  
>>  ENTRY(exception_table)
>> diff -r 1c69c938f641 -r b15d3ae525af xen/include/asm-x86/processor.h
>> --- a/xen/include/asm-x86/processor.h
>> +++ b/xen/include/asm-x86/processor.h
>> @@ -517,6 +517,7 @@ void do_ ## _name(struct cpu_user_regs *
>>  DECLARE_TRAP_HANDLER(divide_error);
>>  DECLARE_TRAP_HANDLER(debug);
>>  DECLARE_TRAP_HANDLER(nmi);
>> +DECLARE_TRAP_HANDLER(nmi_crash);
>>  DECLARE_TRAP_HANDLER(int3);
>>  DECLARE_TRAP_HANDLER(overflow);
>>  DECLARE_TRAP_HANDLER(bounds);
>> @@ -535,6 +536,9 @@ DECLARE_TRAP_HANDLER(alignment_check);
>>  DECLARE_TRAP_HANDLER(spurious_interrupt_bug);
>>  #undef DECLARE_TRAP_HANDLER
>>  
>> +void trap_nop(void);
>> +void enable_nmis(void);
>> +
>>  void syscall_enter(void);
>>  void sysenter_entry(void);
>>  void sysenter_eflags_saved(void);
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2012-12-06 11:58 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-12-04 18:16 [PATCH 0 of 4 RFC] Kexec path alteration Andrew Cooper
2012-12-04 18:16 ` [PATCH 1 of 4 RFC] x86/kexec: Change NMI and MCE handling on kexec path Andrew Cooper
2012-12-04 18:25   ` Andrew Cooper
2012-12-05  9:17   ` Jan Beulich
2012-12-05 10:05     ` Andrew Cooper
2012-12-06 11:36   ` Tim Deegan
2012-12-06 11:58     ` Andrew Cooper
2012-12-04 18:16 ` [PATCH 2 of 4 RFC] xen/panic: Introduce panic_in_progress Andrew Cooper
2012-12-04 18:16 ` [PATCH 3 of 4 RFC] x86/nmi: Prevent reentrant execution of the C nmi handler Andrew Cooper
2012-12-05  9:21   ` Jan Beulich
2012-12-05 10:47     ` Andrew Cooper
2012-12-06 11:39   ` Tim Deegan
2012-12-04 18:16 ` [PATCH 4 of 4 RFC] xen/nmi: DO NOT APPLY: debugkey to deliberatly invoke a reentrant NMI Andrew Cooper

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.