linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] x86/entry/64: De-Xen-ify our NMI code further
@ 2021-01-25  2:14 Lai Jiangshan
  2021-01-25  3:00 ` Andy Lutomirski
  0 siblings, 1 reply; 8+ messages in thread
From: Lai Jiangshan @ 2021-01-25  2:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Steven Rostedt, Peter Zijlstra, Lai Jiangshan, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

The commit 929bacec21478("x86/entry/64: De-Xen-ify our NMI code") simplified
the NMI code by changing paravirt code into native code and left a comment
about "inspecting RIP instead".  But until now, "inspecting RIP instead"
has not been made happened and this patch tries to complete it.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/entry/entry_64.S | 46 +++++++++++----------------------------
 1 file changed, 13 insertions(+), 33 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index cad08703c4ad..cb6b8a6c6652 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1268,32 +1268,12 @@ SYM_CODE_START(asm_exc_nmi)
 	je	nested_nmi
 
 	/*
-	 * Now test if the previous stack was an NMI stack.  This covers
-	 * the case where we interrupt an outer NMI after it clears
-	 * "NMI executing" but before IRET.  We need to be careful, though:
-	 * there is one case in which RSP could point to the NMI stack
-	 * despite there being no NMI active: naughty userspace controls
-	 * RSP at the very beginning of the SYSCALL targets.  We can
-	 * pull a fast one on naughty userspace, though: we program
-	 * SYSCALL to mask DF, so userspace cannot cause DF to be set
-	 * if it controls the kernel's RSP.  We set DF before we clear
-	 * "NMI executing".
+	 * Now test if we interrupt an outer NMI after it clears
+	 * "NMI executing" but before iret.
 	 */
-	lea	6*8(%rsp), %rdx
-	/* Compare the NMI stack (rdx) with the stack we came from (4*8(%rsp)) */
-	cmpq	%rdx, 4*8(%rsp)
-	/* If the stack pointer is above the NMI stack, this is a normal NMI */
-	ja	first_nmi
-
-	subq	$EXCEPTION_STKSZ, %rdx
-	cmpq	%rdx, 4*8(%rsp)
-	/* If it is below the NMI stack, it is a normal NMI */
-	jb	first_nmi
-
-	/* Ah, it is within the NMI stack. */
-
-	testb	$(X86_EFLAGS_DF >> 8), (3*8 + 1)(%rsp)
-	jz	first_nmi	/* RSP was user controlled. */
+	movq	$nmi_executing_cleared, %rdx
+	cmpq	8(%rsp), %rdx
+	jne	first_nmi
 
 	/* This is a nested NMI. */
 
@@ -1438,16 +1418,16 @@ nmi_restore:
 	addq	$6*8, %rsp
 
 	/*
-	 * Clear "NMI executing".  Set DF first so that we can easily
-	 * distinguish the remaining code between here and IRET from
-	 * the SYSCALL entry and exit paths.
-	 *
-	 * We arguably should just inspect RIP instead, but I (Andy) wrote
-	 * this code when I had the misapprehension that Xen PV supported
-	 * NMIs, and Xen PV would break that approach.
+	 * Clear "NMI executing".  It also leaves a window after it before
+	 * iret which should be also considered to be "NMI executing" albeit
+	 * with "NMI executing" variable being zero.  So we should also check
+	 * the RIP after it when checking "NMI executing".  See the code
+	 * before nested_nmi.  No code is allowed to be added to between
+	 * clearing "NMI executing" and iret unless we check a larger window
+	 * with a range of RIPs instead of currently a single-RIP window.
 	 */
-	std
 	movq	$0, 5*8(%rsp)		/* clear "NMI executing" */
+nmi_executing_cleared:
 
 	/*
 	 * iretq reads the "iret" frame and exits the NMI stack in a
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86/entry/64: De-Xen-ify our NMI code further
  2021-01-25  2:14 [PATCH] x86/entry/64: De-Xen-ify our NMI code further Lai Jiangshan
@ 2021-01-25  3:00 ` Andy Lutomirski
  2021-01-25  7:45   ` [PATCH V2] " Lai Jiangshan
  0 siblings, 1 reply; 8+ messages in thread
From: Andy Lutomirski @ 2021-01-25  3:00 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: LKML, Steven Rostedt, Peter Zijlstra, Lai Jiangshan,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	X86 ML, H. Peter Anvin

On Sun, Jan 24, 2021 at 5:13 PM Lai Jiangshan <jiangshanlai@gmail.com> wrote:
>
> From: Lai Jiangshan <laijs@linux.alibaba.com>
>
> The commit 929bacec21478("x86/entry/64: De-Xen-ify our NMI code") simplified
> the NMI code by changing paravirt code into native code and left a comment
> about "inspecting RIP instead".  But until now, "inspecting RIP instead"
> has not been made happened and this patch tries to complete it.
>
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>  arch/x86/entry/entry_64.S | 46 +++++++++++----------------------------
>  1 file changed, 13 insertions(+), 33 deletions(-)
>
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index cad08703c4ad..cb6b8a6c6652 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -1268,32 +1268,12 @@ SYM_CODE_START(asm_exc_nmi)
>         je      nested_nmi
>
>         /*
> -        * Now test if the previous stack was an NMI stack.  This covers
> -        * the case where we interrupt an outer NMI after it clears
> -        * "NMI executing" but before IRET.  We need to be careful, though:
> -        * there is one case in which RSP could point to the NMI stack
> -        * despite there being no NMI active: naughty userspace controls
> -        * RSP at the very beginning of the SYSCALL targets.  We can
> -        * pull a fast one on naughty userspace, though: we program
> -        * SYSCALL to mask DF, so userspace cannot cause DF to be set
> -        * if it controls the kernel's RSP.  We set DF before we clear
> -        * "NMI executing".
> +        * Now test if we interrupt an outer NMI after it clears
> +        * "NMI executing" but before iret.

s/interrupt/interrupted

But let's make it a lot more clear:


Now test if we interrupted an outer NMI that just cleared "NMI
executing" and is about to IRET.  This is a single-instruction window.
This check does not handle the case in which we get a nested interrupt
(#MC, #VE, #VC, etc.) after clearing "NMI executing" but before the
outer NMI executes IRET.

> +       movq    $nmi_executing_cleared, %rdx
> +       cmpq    8(%rsp), %rdx
> +       jne     first_nmi

If we're okay with non-PIC code, then this is suboptimal -- you can
just compare directly.  But using PIC is polite, so that movq should
be a RIP-relative leaq.

>
>         /* This is a nested NMI. */
>
> @@ -1438,16 +1418,16 @@ nmi_restore:
>         addq    $6*8, %rsp
>
>         /*
> -        * Clear "NMI executing".  Set DF first so that we can easily
> -        * distinguish the remaining code between here and IRET from
> -        * the SYSCALL entry and exit paths.
> -        *
> -        * We arguably should just inspect RIP instead, but I (Andy) wrote
> -        * this code when I had the misapprehension that Xen PV supported
> -        * NMIs, and Xen PV would break that approach.
> +        * Clear "NMI executing".  It also leaves a window after it before
> +        * iret which should be also considered to be "NMI executing" albeit
> +        * with "NMI executing" variable being zero.  So we should also check
> +        * the RIP after it when checking "NMI executing".  See the code
> +        * before nested_nmi.  No code is allowed to be added to between
> +        * clearing "NMI executing" and iret unless we check a larger window
> +        * with a range of RIPs instead of currently a single-RIP window.

Let's simplify this comment:

Clear "NMI executing".  This leaves a window in which a nested NMI
could observe "NMI executing" cleared, and a nested NMI will detect
this by inspecting RIP.

>          */
> -       std
>         movq    $0, 5*8(%rsp)           /* clear "NMI executing" */
> +nmi_executing_cleared:
>

This should be local.  Let's call it .Lnmi_iret.  And add a comment:

.Lnmi_iret: /* must be immediately after clearing "NMI executing" */

>         /*
>          * iretq reads the "iret" frame and exits the NMI stack in a

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH V2] x86/entry/64: De-Xen-ify our NMI code further
  2021-01-25  3:00 ` Andy Lutomirski
@ 2021-01-25  7:45   ` Lai Jiangshan
  2021-01-25 17:38     ` Steven Rostedt
  0 siblings, 1 reply; 8+ messages in thread
From: Lai Jiangshan @ 2021-01-25  7:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Steven Rostedt, Peter Zijlstra, Lai Jiangshan, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

The commit 929bacec21478("x86/entry/64: De-Xen-ify our NMI code") simplified
the NMI code by changing paravirt code into native code and left a comment
about "inspecting RIP instead".  But until now, "inspecting RIP instead"
has not been made happened and this patch tries to complete it.

Comments in the code was from Andy Lutomirski.  Thanks!

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/entry/entry_64.S | 44 ++++++++++-----------------------------
 1 file changed, 11 insertions(+), 33 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index cad08703c4ad..21f67ea62341 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1268,32 +1268,14 @@ SYM_CODE_START(asm_exc_nmi)
 	je	nested_nmi
 
 	/*
-	 * Now test if the previous stack was an NMI stack.  This covers
-	 * the case where we interrupt an outer NMI after it clears
-	 * "NMI executing" but before IRET.  We need to be careful, though:
-	 * there is one case in which RSP could point to the NMI stack
-	 * despite there being no NMI active: naughty userspace controls
-	 * RSP at the very beginning of the SYSCALL targets.  We can
-	 * pull a fast one on naughty userspace, though: we program
-	 * SYSCALL to mask DF, so userspace cannot cause DF to be set
-	 * if it controls the kernel's RSP.  We set DF before we clear
-	 * "NMI executing".
+	 * Now test if we interrupted an outer NMI that just cleared "NMI
+	 * executing" and is about to IRET.  This is a single-instruction
+	 * window.  This check does not handle the case in which we get a
+	 * nested interrupt (#MC, #VE, #VC, etc.) after clearing
+	 * "NMI executing" but before the outer NMI executes IRET.
 	 */
-	lea	6*8(%rsp), %rdx
-	/* Compare the NMI stack (rdx) with the stack we came from (4*8(%rsp)) */
-	cmpq	%rdx, 4*8(%rsp)
-	/* If the stack pointer is above the NMI stack, this is a normal NMI */
-	ja	first_nmi
-
-	subq	$EXCEPTION_STKSZ, %rdx
-	cmpq	%rdx, 4*8(%rsp)
-	/* If it is below the NMI stack, it is a normal NMI */
-	jb	first_nmi
-
-	/* Ah, it is within the NMI stack. */
-
-	testb	$(X86_EFLAGS_DF >> 8), (3*8 + 1)(%rsp)
-	jz	first_nmi	/* RSP was user controlled. */
+	cmpq	$.Lnmi_iret, 8(%rsp)
+	jne	first_nmi
 
 	/* This is a nested NMI. */
 
@@ -1438,17 +1420,13 @@ nmi_restore:
 	addq	$6*8, %rsp
 
 	/*
-	 * Clear "NMI executing".  Set DF first so that we can easily
-	 * distinguish the remaining code between here and IRET from
-	 * the SYSCALL entry and exit paths.
-	 *
-	 * We arguably should just inspect RIP instead, but I (Andy) wrote
-	 * this code when I had the misapprehension that Xen PV supported
-	 * NMIs, and Xen PV would break that approach.
+	 * Clear "NMI executing".  This leaves a window in which a nested NMI
+	 * could observe "NMI executing" cleared, and a nested NMI will detect
+	 * this by inspecting RIP.
 	 */
-	std
 	movq	$0, 5*8(%rsp)		/* clear "NMI executing" */
 
+.Lnmi_iret: /* must be immediately after clearing "NMI executing" */
 	/*
 	 * iretq reads the "iret" frame and exits the NMI stack in a
 	 * single instruction.  We are returning to kernel mode, so this
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH V2] x86/entry/64: De-Xen-ify our NMI code further
  2021-01-25  7:45   ` [PATCH V2] " Lai Jiangshan
@ 2021-01-25 17:38     ` Steven Rostedt
  2021-01-25 17:51       ` Andy Lutomirski
  2021-01-25 17:52       ` [PATCH V2] x86/entry/64: De-Xen-ify our NMI code further Steven Rostedt
  0 siblings, 2 replies; 8+ messages in thread
From: Steven Rostedt @ 2021-01-25 17:38 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Peter Zijlstra, Lai Jiangshan, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin

On Mon, 25 Jan 2021 15:45:06 +0800
Lai Jiangshan <jiangshanlai@gmail.com> wrote:

> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> The commit 929bacec21478("x86/entry/64: De-Xen-ify our NMI code") simplified
> the NMI code by changing paravirt code into native code and left a comment
> about "inspecting RIP instead".  But until now, "inspecting RIP instead"
> has not been made happened and this patch tries to complete it.
> 
> Comments in the code was from Andy Lutomirski.  Thanks!
> 
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>  arch/x86/entry/entry_64.S | 44 ++++++++++-----------------------------
>  1 file changed, 11 insertions(+), 33 deletions(-)
> 
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index cad08703c4ad..21f67ea62341 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -1268,32 +1268,14 @@ SYM_CODE_START(asm_exc_nmi)
>  	je	nested_nmi
>  
>  	/*
> -	 * Now test if the previous stack was an NMI stack.  This covers
> -	 * the case where we interrupt an outer NMI after it clears
> -	 * "NMI executing" but before IRET.  We need to be careful, though:
> -	 * there is one case in which RSP could point to the NMI stack
> -	 * despite there being no NMI active: naughty userspace controls
> -	 * RSP at the very beginning of the SYSCALL targets.  We can
> -	 * pull a fast one on naughty userspace, though: we program
> -	 * SYSCALL to mask DF, so userspace cannot cause DF to be set
> -	 * if it controls the kernel's RSP.  We set DF before we clear
> -	 * "NMI executing".
> +	 * Now test if we interrupted an outer NMI that just cleared "NMI
> +	 * executing" and is about to IRET.  This is a single-instruction
> +	 * window.  This check does not handle the case in which we get a
> +	 * nested interrupt (#MC, #VE, #VC, etc.) after clearing
> +	 * "NMI executing" but before the outer NMI executes IRET.
>  	 */
> -	lea	6*8(%rsp), %rdx
> -	/* Compare the NMI stack (rdx) with the stack we came from (4*8(%rsp)) */
> -	cmpq	%rdx, 4*8(%rsp)
> -	/* If the stack pointer is above the NMI stack, this is a normal NMI */
> -	ja	first_nmi
> -
> -	subq	$EXCEPTION_STKSZ, %rdx
> -	cmpq	%rdx, 4*8(%rsp)
> -	/* If it is below the NMI stack, it is a normal NMI */
> -	jb	first_nmi
> -
> -	/* Ah, it is within the NMI stack. */
> -
> -	testb	$(X86_EFLAGS_DF >> 8), (3*8 + 1)(%rsp)
> -	jz	first_nmi	/* RSP was user controlled. */

So we no longer check to see if the current stack is on the NMI stack.
Makes sense, since this beginning of the NMI code can not be interrupted,
as there's no breakpoints or faults that can occur when that happens. The
$nmi_executing is set in all other locations except for:

  repeat_nmi - end_repeat_nmi
  and for the iretq itself (.Lnmi_iret).

Thus, by just checking the nmi_executing variable on the stack along with
the RIP compared to repeat_nim-end_repeat_nmi + .Lnmi_iret, we can safely
tell if the NMI is nested or not.

I was working on rewriting the beginning comments to reflect these updates,
and discovered a possible bug that exists (unrelated to this patch).

> +	cmpq	$.Lnmi_iret, 8(%rsp)
> +	jne	first_nmi
>  

On triggering an NMI from user space, I see the switch to the thread stack
is done, and "exc_nmi" is called.

The problem I see with this is that exc_nmi is called with the thread
stack, if it were to take an exception, NMIs would be enabled allowing for
a nested NMI to run. From what I can tell, I don't see anything stopping
that NMI from executing over the currently running NMI. That is, this means
that NMI handlers are now re-entrant.

Yes, the stack issue is not a problem here, but NMI handlers are not
allowed to be re-entrant. For example, we have spin locks in NMI handlers
that are considered fine if they are only used in NMI handlers. But because
there's a possible way to make NMI handlers re-entrant then these spin
locks can deadlock.

I'm guessing that we need to add some tricks to the user space path to
set and clear the "NMI executing" variable, but the return may become a bit
complex in clearing that without races.

-- Steve

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V2] x86/entry/64: De-Xen-ify our NMI code further
  2021-01-25 17:38     ` Steven Rostedt
@ 2021-01-25 17:51       ` Andy Lutomirski
  2021-01-25 18:16         ` Steven Rostedt
  2021-01-25 18:36         ` [PATCH] x86_64: Update the NMI handler nesting logic comment Steven Rostedt
  2021-01-25 17:52       ` [PATCH V2] x86/entry/64: De-Xen-ify our NMI code further Steven Rostedt
  1 sibling, 2 replies; 8+ messages in thread
From: Andy Lutomirski @ 2021-01-25 17:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Lai Jiangshan, linux-kernel, Peter Zijlstra, Lai Jiangshan,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	x86, H. Peter Anvin


> On Jan 25, 2021, at 9:39 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> On Mon, 25 Jan 2021 15:45:06 +0800
> Lai Jiangshan <jiangshanlai@gmail.com> wrote:
> 
>> From: Lai Jiangshan <laijs@linux.alibaba.com>
>> 
>> The commit 929bacec21478("x86/entry/64: De-Xen-ify our NMI code") simplified
>> the NMI code by changing paravirt code into native code and left a comment
>> about "inspecting RIP instead".  But until now, "inspecting RIP instead"
>> has not been made happened and this patch tries to complete it.
>> 
>> Comments in the code was from Andy Lutomirski.  Thanks!
>> 
>> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
>> ---
>> arch/x86/entry/entry_64.S | 44 ++++++++++-----------------------------
>> 1 file changed, 11 insertions(+), 33 deletions(-)
>> 
>> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>> index cad08703c4ad..21f67ea62341 100644
>> --- a/arch/x86/entry/entry_64.S
>> +++ b/arch/x86/entry/entry_64.S
>> @@ -1268,32 +1268,14 @@ SYM_CODE_START(asm_exc_nmi)
>>    je    nested_nmi
>> 
>>    /*
>> -     * Now test if the previous stack was an NMI stack.  This covers
>> -     * the case where we interrupt an outer NMI after it clears
>> -     * "NMI executing" but before IRET.  We need to be careful, though:
>> -     * there is one case in which RSP could point to the NMI stack
>> -     * despite there being no NMI active: naughty userspace controls
>> -     * RSP at the very beginning of the SYSCALL targets.  We can
>> -     * pull a fast one on naughty userspace, though: we program
>> -     * SYSCALL to mask DF, so userspace cannot cause DF to be set
>> -     * if it controls the kernel's RSP.  We set DF before we clear
>> -     * "NMI executing".
>> +     * Now test if we interrupted an outer NMI that just cleared "NMI
>> +     * executing" and is about to IRET.  This is a single-instruction
>> +     * window.  This check does not handle the case in which we get a
>> +     * nested interrupt (#MC, #VE, #VC, etc.) after clearing
>> +     * "NMI executing" but before the outer NMI executes IRET.
>>     */
>> -    lea    6*8(%rsp), %rdx
>> -    /* Compare the NMI stack (rdx) with the stack we came from (4*8(%rsp)) */
>> -    cmpq    %rdx, 4*8(%rsp)
>> -    /* If the stack pointer is above the NMI stack, this is a normal NMI */
>> -    ja    first_nmi
>> -
>> -    subq    $EXCEPTION_STKSZ, %rdx
>> -    cmpq    %rdx, 4*8(%rsp)
>> -    /* If it is below the NMI stack, it is a normal NMI */
>> -    jb    first_nmi
>> -
>> -    /* Ah, it is within the NMI stack. */
>> -
>> -    testb    $(X86_EFLAGS_DF >> 8), (3*8 + 1)(%rsp)
>> -    jz    first_nmi    /* RSP was user controlled. */
> 
> So we no longer check to see if the current stack is on the NMI stack.
> Makes sense, since this beginning of the NMI code can not be interrupted,
> as there's no breakpoints or faults that can occur when that happens. The
> $nmi_executing is set in all other locations except for:
> 
>  repeat_nmi - end_repeat_nmi
>  and for the iretq itself (.Lnmi_iret).
> 
> Thus, by just checking the nmi_executing variable on the stack along with
> the RIP compared to repeat_nim-end_repeat_nmi + .Lnmi_iret, we can safely
> tell if the NMI is nested or not.
> 
> I was working on rewriting the beginning comments to reflect these updates,
> and discovered a possible bug that exists (unrelated to this patch).
> 
>> +    cmpq    $.Lnmi_iret, 8(%rsp)
>> +    jne    first_nmi
>> 
> 
> On triggering an NMI from user space, I see the switch to the thread stack
> is done, and "exc_nmi" is called.
> 
> The problem I see with this is that exc_nmi is called with the thread
> stack, if it were to take an exception, NMIs would be enabled allowing for
> a nested NMI to run. From what I can tell, I don't see anything stopping
> that NMI from executing over the currently running NMI. That is, this means
> that NMI handlers are now re-entrant.

That was intentional in my part. The C code checks for this condition and handles it, just like it does on 32-bit kernels.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V2] x86/entry/64: De-Xen-ify our NMI code further
  2021-01-25 17:38     ` Steven Rostedt
  2021-01-25 17:51       ` Andy Lutomirski
@ 2021-01-25 17:52       ` Steven Rostedt
  1 sibling, 0 replies; 8+ messages in thread
From: Steven Rostedt @ 2021-01-25 17:52 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Peter Zijlstra, Lai Jiangshan, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin

On Mon, 25 Jan 2021 12:38:59 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On triggering an NMI from user space, I see the switch to the thread stack
> is done, and "exc_nmi" is called.
> 
> The problem I see with this is that exc_nmi is called with the thread
> stack, if it were to take an exception, NMIs would be enabled allowing for
> a nested NMI to run. From what I can tell, I don't see anything stopping
> that NMI from executing over the currently running NMI. That is, this means
> that NMI handlers are now re-entrant.
> 
> Yes, the stack issue is not a problem here, but NMI handlers are not
> allowed to be re-entrant. For example, we have spin locks in NMI handlers
> that are considered fine if they are only used in NMI handlers. But because
> there's a possible way to make NMI handlers re-entrant then these spin
> locks can deadlock.
> 
> I'm guessing that we need to add some tricks to the user space path to
> set and clear the "NMI executing" variable, but the return may become a bit
> complex in clearing that without races.

I think this may work if we wrap the exc_nmi call with the following:

Overwrite the NMI HW stack frame on the NMI stack as if an NMI came in at
the return back to the user space path of the NMI handler. Set the stack
pointer to the NMI stack just after the first frame that was updated. Then
jump to asm_exc_nmi.

Then the code would act like it came in from kernel mode, and execute the
NMI nesting code normally. When it finishes, and does the iretq, it will
return to the NMI handler for the user space return with the kernel thread
stack, and then the special code for returning to user space can be called.

The exc_nmi C code will need to handle this case to update pt_regs to make
sure the registered NMI handlers still see the pt_regs from user space. But
I think something like this may be the easiest way to handle this without
dealing with more NMI stack nesting races.

I could try to write something up to implemented this.

-- Steve

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V2] x86/entry/64: De-Xen-ify our NMI code further
  2021-01-25 17:51       ` Andy Lutomirski
@ 2021-01-25 18:16         ` Steven Rostedt
  2021-01-25 18:36         ` [PATCH] x86_64: Update the NMI handler nesting logic comment Steven Rostedt
  1 sibling, 0 replies; 8+ messages in thread
From: Steven Rostedt @ 2021-01-25 18:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Lai Jiangshan, linux-kernel, Peter Zijlstra, Lai Jiangshan,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	x86, H. Peter Anvin

On Mon, 25 Jan 2021 09:51:45 -0800
Andy Lutomirski <luto@amacapital.net> wrote:

> > The problem I see with this is that exc_nmi is called with the thread
> > stack, if it were to take an exception, NMIs would be enabled allowing for
> > a nested NMI to run. From what I can tell, I don't see anything stopping
> > that NMI from executing over the currently running NMI. That is, this means
> > that NMI handlers are now re-entrant.  
> 
> That was intentional in my part. The C code checks for this condition and handles it, just like it does on 32-bit kernels.

I vaguely remember you implementing this, and me reviewing it. I'm getting
to that age that there's less and less I remember doing :-/

I'll include a comment about that in my rewrite. But first, I'll re-review
your changes to make sure there's no one offs that happen with the mixture
of nmi stack handling and the x86_32 version doing the same thing.

Thanks for the reminder.

-- Steve

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH] x86_64: Update the NMI handler nesting logic comment
  2021-01-25 17:51       ` Andy Lutomirski
  2021-01-25 18:16         ` Steven Rostedt
@ 2021-01-25 18:36         ` Steven Rostedt
  1 sibling, 0 replies; 8+ messages in thread
From: Steven Rostedt @ 2021-01-25 18:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Lai Jiangshan, linux-kernel, Peter Zijlstra, Lai Jiangshan,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	x86, H. Peter Anvin

From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>

The NMI handler on x86 needs to deal with nested NMIs becaues if an NMI
takes an exception (page fault or breakpoint), the exception handle triggers
a iretq which unlatches NMIs in the hardware allowing for another NMI to
take place.

The current code does a bit of work arounds to handle this case, and over
time it has changed to become safer and more efficient. But the comment has
not kept up as much to the current logic. That needs to be fixed.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---

This is added on top of Lai's patch.

 arch/x86/entry/entry_64.S | 68 +++++++++++++++++++++++++--------------
 1 file changed, 43 insertions(+), 25 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 21f67ea62341..962e0bfb69dd 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1117,33 +1117,46 @@ SYM_CODE_START(asm_exc_nmi)
 	 * stack of the previous NMI. NMI handlers are not re-entrant
 	 * anyway.
 	 *
-	 * To handle this case we do the following:
-	 *  Check the a special location on the stack that contains
-	 *  a variable that is set when NMIs are executing.
-	 *  The interrupted task's stack is also checked to see if it
-	 *  is an NMI stack.
-	 *  If the variable is not set and the stack is not the NMI
-	 *  stack then:
-	 *    o Set the special variable on the stack
-	 *    o Copy the interrupt frame into an "outermost" location on the
-	 *      stack
-	 *    o Copy the interrupt frame into an "iret" location on the stack
-	 *    o Continue processing the NMI
-	 *  If the variable is set or the previous stack is the NMI stack:
-	 *    o Modify the "iret" location to jump to the repeat_nmi
-	 *    o return back to the first NMI
+	 * To handle this case, the following is performed:
 	 *
-	 * Now on exit of the first NMI, we first clear the stack variable
-	 * The NMI stack will tell any nested NMIs at that point that it is
-	 * nested. Then we pop the stack normally with iret, and if there was
-	 * a nested NMI that updated the copy interrupt stack frame, a
-	 * jump will be made to the repeat_nmi code that will handle the second
-	 * NMI.
+	 *  If the NMI came in from user space, then the stack used is the
+	 *   kernel thread stack, and the C code of the nmi handler handles
+	 *   NMI nesting just like it is done for x86_32. This also allows
+	 *   NMIs from user space to handle the issues that ISTs have
+	 *   going back to user space.
 	 *
-	 * However, espfix prevents us from directly returning to userspace
-	 * with a single IRET instruction.  Similarly, IRET to user mode
-	 * can fault.  We therefore handle NMIs from user space like
-	 * other IST entries.
+	 *  If the NMI came in from kernel space, then perform the following:
+	 *   Add a variable at a specific location on the stack which is
+	 *    called "NMI executing". This variable will be set as early
+	 *    in the process as possible, and cleared just before the iretq.
+	 *   Copy the hardware stack frame twice onto the stack.
+	 *    The original hardware stack will be replaced by the hardware if
+	 *	any nested NMIs occur, so it must not be used.
+	 *    The first copy will be used by the iretq of the NMI handler.
+	 *      If a nested NMI arrives, it will update this copy to have
+	 *      the interrupted NMI return to the "repeat_nmi" location and
+	 *      that will execute the next NMI at the finish of the first NMI.
+	 *    The second copy is used to reset the first copy of the NMI stack
+	 *	frame to return to the original location of the first NMI.
+	 *
+	 *  On entering the NMI handler, if the "NMI executing" bit is set
+	 *   or the RIP of the interrupted location is located between
+	 *   the labels of "repeat_nmi" and "end_repeat_nmi" or is located
+	 *   at the iretq (.Lnmi_iret) then the current NMI had interrupted
+	 *   another NMI handler. It will update the first copy of the
+	 *   stack frame to return to "repeat_nmi" and exit.
+	 *
+	 *  When the first NMI exits, it clears the "NMI executing" variable
+	 *  and immediately calls iretq. If this NMI had triggered an exception
+	 *  that executed an iretq, allowing a nested NMI to come in, then
+	 *  that nested NMI would have updated the stack frame that the iretq
+	 *  is executing to jump to the "repeat_nmi" label. The repeat_nmi
+	 *  code is reponsible for setting the "NMI executing" variable and
+	 *  copying the second copy of the stack frame to the first copy.
+	 *
+	 *  NB the repeated NMI still has NMIs enabled from the start, and
+	 *   an NMI can come in even after the first iretq jumps to the
+	 *   repeat_nmi code.
 	 */
 
 	ASM_CLAC
@@ -1193,6 +1206,11 @@ SYM_CODE_START(asm_exc_nmi)
 	call	exc_nmi
 
 	/*
+	 * However, espfix prevents us from directly returning to userspace
+	 * with a single IRET instruction.  Similarly, IRET to user mode
+	 * can fault.  We therefore handle NMIs from user space like
+	 * other IST entries.
+	 *
 	 * Return back to user mode.  We must *not* do the normal exit
 	 * work, because we don't want to enable interrupts.
 	 */
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-01-25 18:38 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-25  2:14 [PATCH] x86/entry/64: De-Xen-ify our NMI code further Lai Jiangshan
2021-01-25  3:00 ` Andy Lutomirski
2021-01-25  7:45   ` [PATCH V2] " Lai Jiangshan
2021-01-25 17:38     ` Steven Rostedt
2021-01-25 17:51       ` Andy Lutomirski
2021-01-25 18:16         ` Steven Rostedt
2021-01-25 18:36         ` [PATCH] x86_64: Update the NMI handler nesting logic comment Steven Rostedt
2021-01-25 17:52       ` [PATCH V2] x86/entry/64: De-Xen-ify our NMI code further Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).