Re: [PATCH] x86/entry/64: De-Xen-ify our NMI code further

From: Andy Lutomirski <luto@kernel.org>
To: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: LKML <linux-kernel@vger.kernel.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Lai Jiangshan <laijs@linux.alibaba.com>,
	Andy Lutomirski <luto@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	X86 ML <x86@kernel.org>, "H. Peter Anvin" <hpa@zytor.com>
Subject: Re: [PATCH] x86/entry/64: De-Xen-ify our NMI code further
Date: Sun, 24 Jan 2021 19:00:46 -0800	[thread overview]
Message-ID: <CALCETrW1qP=vbHCSdgOLjjP+-i=io3o1w5bMdtH_UHSV3gvBXg@mail.gmail.com> (raw)
In-Reply-To: <20210125021435.16646-1-jiangshanlai@gmail.com>

On Sun, Jan 24, 2021 at 5:13 PM Lai Jiangshan <jiangshanlai@gmail.com> wrote:
>
> From: Lai Jiangshan <laijs@linux.alibaba.com>
>
> The commit 929bacec21478("x86/entry/64: De-Xen-ify our NMI code") simplified
> the NMI code by changing paravirt code into native code and left a comment
> about "inspecting RIP instead".  But until now, "inspecting RIP instead"
> has not been made happened and this patch tries to complete it.
>
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>  arch/x86/entry/entry_64.S | 46 +++++++++++----------------------------
>  1 file changed, 13 insertions(+), 33 deletions(-)
>
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index cad08703c4ad..cb6b8a6c6652 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -1268,32 +1268,12 @@ SYM_CODE_START(asm_exc_nmi)
>         je      nested_nmi
>
>         /*
> -        * Now test if the previous stack was an NMI stack.  This covers
> -        * the case where we interrupt an outer NMI after it clears
> -        * "NMI executing" but before IRET.  We need to be careful, though:
> -        * there is one case in which RSP could point to the NMI stack
> -        * despite there being no NMI active: naughty userspace controls
> -        * RSP at the very beginning of the SYSCALL targets.  We can
> -        * pull a fast one on naughty userspace, though: we program
> -        * SYSCALL to mask DF, so userspace cannot cause DF to be set
> -        * if it controls the kernel's RSP.  We set DF before we clear
> -        * "NMI executing".
> +        * Now test if we interrupt an outer NMI after it clears
> +        * "NMI executing" but before iret.

s/interrupt/interrupted

But let's make it a lot more clear:

Now test if we interrupted an outer NMI that just cleared "NMI
executing" and is about to IRET.  This is a single-instruction window.
This check does not handle the case in which we get a nested interrupt
(#MC, #VE, #VC, etc.) after clearing "NMI executing" but before the
outer NMI executes IRET.

> +       movq    $nmi_executing_cleared, %rdx
> +       cmpq    8(%rsp), %rdx
> +       jne     first_nmi

If we're okay with non-PIC code, then this is suboptimal -- you can
just compare directly.  But using PIC is polite, so that movq should
be a RIP-relative leaq.

>
>         /* This is a nested NMI. */
>
> @@ -1438,16 +1418,16 @@ nmi_restore:
>         addq    $6*8, %rsp
>
>         /*
> -        * Clear "NMI executing".  Set DF first so that we can easily
> -        * distinguish the remaining code between here and IRET from
> -        * the SYSCALL entry and exit paths.
> -        *
> -        * We arguably should just inspect RIP instead, but I (Andy) wrote
> -        * this code when I had the misapprehension that Xen PV supported
> -        * NMIs, and Xen PV would break that approach.
> +        * Clear "NMI executing".  It also leaves a window after it before
> +        * iret which should be also considered to be "NMI executing" albeit
> +        * with "NMI executing" variable being zero.  So we should also check
> +        * the RIP after it when checking "NMI executing".  See the code
> +        * before nested_nmi.  No code is allowed to be added to between
> +        * clearing "NMI executing" and iret unless we check a larger window
> +        * with a range of RIPs instead of currently a single-RIP window.

Let's simplify this comment:

Clear "NMI executing".  This leaves a window in which a nested NMI
could observe "NMI executing" cleared, and a nested NMI will detect
this by inspecting RIP.

>          */
> -       std
>         movq    $0, 5*8(%rsp)           /* clear "NMI executing" */
> +nmi_executing_cleared:
>

This should be local.  Let's call it .Lnmi_iret.  And add a comment:

.Lnmi_iret: /* must be immediately after clearing "NMI executing" */

>         /*
>          * iretq reads the "iret" frame and exits the NMI stack in a