[PATCH 0/3] x86/sev-es: Check for trusted regs->sp in __sev_es_ist

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] x86/sev-es: Check for trusted regs->sp in __sev_es_ist_enter()
@ 2021-02-17 12:01 ` Joerg Roedel
  0 siblings, 0 replies; 26+ messages in thread
From: Joerg Roedel @ 2021-02-17 12:01 UTC (permalink / raw)
  To: x86
  Cc: Joerg Roedel, Joerg Roedel, hpa, Andy Lutomirski, Dave Hansen,
	Peter Zijlstra, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, David Rientjes, Cfir Cohen,
	Erdem Aktas, Masami Hiramatsu, Mike Stunes, Sean Christopherson,
	Martin Radev, Arvind Sankar, linux-kernel, kvm, virtualization

From: Joerg Roedel <jroedel@suse.de>

Hi,

here are some changes to the Linux SEV-ES code to check whether the
value in regs->sp can be trusted, before checking whether it points to
the #VC IST stack.

Andy Lutomirski reported that it is entirely possible to reach this
function with a regs->sp value which was set by user-space. So check
for this condition and don't use regs->sp if it can't be trusted.

Also improve the comments around __sev_es_ist_enter/exit() to better
explain what these function do and why they are there.

Please review.

Thanks,

	Joerg

Joerg Roedel (3):
  x86/sev-es: Introduce from_syscall_gap() helper
  x86/sev-es: Check if regs->sp is trusted before adjusting #VC IST
    stack
  x86/sev-es: Improve comments in and around __sev_es_ist_enter/exit()

 arch/x86/include/asm/ptrace.h |  8 ++++++++
 arch/x86/kernel/sev-es.c      | 27 +++++++++++++++++++--------
 arch/x86/kernel/traps.c       |  3 +--
 3 files changed, 28 insertions(+), 10 deletions(-)

-- 
2.30.0

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 0/3] x86/sev-es: Check for trusted regs->sp in __sev_es_ist_enter()
@ 2021-02-17 12:01 ` Joerg Roedel
  0 siblings, 0 replies; 26+ messages in thread
From: Joerg Roedel @ 2021-02-17 12:01 UTC (permalink / raw)
  To: x86
  Cc: kvm, Peter Zijlstra, Dave Hansen, virtualization, Arvind Sankar,
	hpa, Jiri Slaby, Joerg Roedel, David Rientjes, Martin Radev,
	Tom Lendacky, Joerg Roedel, Kees Cook, Cfir Cohen,
	Andy Lutomirski, Dan Williams, Juergen Gross, Mike Stunes,
	Sean Christopherson, linux-kernel, Masami Hiramatsu, Erdem Aktas

From: Joerg Roedel <jroedel@suse.de>

Hi,

here are some changes to the Linux SEV-ES code to check whether the
value in regs->sp can be trusted, before checking whether it points to
the #VC IST stack.

Andy Lutomirski reported that it is entirely possible to reach this
function with a regs->sp value which was set by user-space. So check
for this condition and don't use regs->sp if it can't be trusted.

Also improve the comments around __sev_es_ist_enter/exit() to better
explain what these function do and why they are there.

Please review.

Thanks,

	Joerg

Joerg Roedel (3):
  x86/sev-es: Introduce from_syscall_gap() helper
  x86/sev-es: Check if regs->sp is trusted before adjusting #VC IST
    stack
  x86/sev-es: Improve comments in and around __sev_es_ist_enter/exit()

 arch/x86/include/asm/ptrace.h |  8 ++++++++
 arch/x86/kernel/sev-es.c      | 27 +++++++++++++++++++--------
 arch/x86/kernel/traps.c       |  3 +--
 3 files changed, 28 insertions(+), 10 deletions(-)

-- 
2.30.0

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/3] x86/sev-es: Introduce from_syscall_gap() helper
  2021-02-17 12:01 ` Joerg Roedel
@ 2021-02-17 12:01   ` Joerg Roedel
  -1 siblings, 0 replies; 26+ messages in thread
From: Joerg Roedel @ 2021-02-17 12:01 UTC (permalink / raw)
  To: x86
  Cc: Joerg Roedel, Joerg Roedel, stable, hpa, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Jiri Slaby, Dan Williams,
	Tom Lendacky, Juergen Gross, Kees Cook, David Rientjes,
	Cfir Cohen, Erdem Aktas, Masami Hiramatsu, Mike Stunes,
	Sean Christopherson, Martin Radev, Arvind Sankar, linux-kernel,
	kvm, virtualization

From: Joerg Roedel <jroedel@suse.de>

Introduce a helper to check whether an exception came from the syscall
gap and use it in the SEV-ES code

Fixes: 315562c9af3d5 ("x86/sev-es: Adjust #VC IST Stack on entering NMI handler")
Cc: stable@vger.kernel.org # 5.10+
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/ptrace.h | 8 ++++++++
 arch/x86/kernel/traps.c       | 3 +--
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index d8324a236696..14854b2c4944 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -94,6 +94,8 @@ struct pt_regs {
 #include <asm/paravirt_types.h>
 #endif
 
+#include <asm/proto.h>
+
 struct cpuinfo_x86;
 struct task_struct;
 
@@ -175,6 +177,12 @@ static inline bool any_64bit_mode(struct pt_regs *regs)
 #ifdef CONFIG_X86_64
 #define current_user_stack_pointer()	current_pt_regs()->sp
 #define compat_user_stack_pointer()	current_pt_regs()->sp
+
+static inline bool from_syscall_gap(struct pt_regs *regs)
+{
+	return (regs->ip >= (unsigned long)entry_SYSCALL_64 &&
+		regs->ip <  (unsigned long)entry_SYSCALL_64_safe_stack);
+}
 #endif
 
 static inline unsigned long kernel_stack_pointer(struct pt_regs *regs)
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 7f5aec758f0e..b4f2b4e9066d 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -694,8 +694,7 @@ asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *r
 	 * In the SYSCALL entry path the RSP value comes from user-space - don't
 	 * trust it and switch to the current kernel stack
 	 */
-	if (regs->ip >= (unsigned long)entry_SYSCALL_64 &&
-	    regs->ip <  (unsigned long)entry_SYSCALL_64_safe_stack) {
+	if (from_syscall_gap(regs)) {
 		sp = this_cpu_read(cpu_current_top_of_stack);
 		goto sync;
 	}
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 1/3] x86/sev-es: Introduce from_syscall_gap() helper
@ 2021-02-17 12:01   ` Joerg Roedel
  0 siblings, 0 replies; 26+ messages in thread
From: Joerg Roedel @ 2021-02-17 12:01 UTC (permalink / raw)
  To: x86
  Cc: kvm, Peter Zijlstra, Dave Hansen, virtualization, Arvind Sankar,
	hpa, Jiri Slaby, Joerg Roedel, David Rientjes, Martin Radev,
	Tom Lendacky, Joerg Roedel, Kees Cook, Cfir Cohen,
	Andy Lutomirski, Dan Williams, Juergen Gross, Mike Stunes,
	Sean Christopherson, linux-kernel, stable, Masami Hiramatsu,
	Erdem Aktas

From: Joerg Roedel <jroedel@suse.de>

Introduce a helper to check whether an exception came from the syscall
gap and use it in the SEV-ES code

Fixes: 315562c9af3d5 ("x86/sev-es: Adjust #VC IST Stack on entering NMI handler")
Cc: stable@vger.kernel.org # 5.10+
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/ptrace.h | 8 ++++++++
 arch/x86/kernel/traps.c       | 3 +--
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index d8324a236696..14854b2c4944 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -94,6 +94,8 @@ struct pt_regs {
 #include <asm/paravirt_types.h>
 #endif
 
+#include <asm/proto.h>
+
 struct cpuinfo_x86;
 struct task_struct;
 
@@ -175,6 +177,12 @@ static inline bool any_64bit_mode(struct pt_regs *regs)
 #ifdef CONFIG_X86_64
 #define current_user_stack_pointer()	current_pt_regs()->sp
 #define compat_user_stack_pointer()	current_pt_regs()->sp
+
+static inline bool from_syscall_gap(struct pt_regs *regs)
+{
+	return (regs->ip >= (unsigned long)entry_SYSCALL_64 &&
+		regs->ip <  (unsigned long)entry_SYSCALL_64_safe_stack);
+}
 #endif
 
 static inline unsigned long kernel_stack_pointer(struct pt_regs *regs)
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 7f5aec758f0e..b4f2b4e9066d 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -694,8 +694,7 @@ asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *r
 	 * In the SYSCALL entry path the RSP value comes from user-space - don't
 	 * trust it and switch to the current kernel stack
 	 */
-	if (regs->ip >= (unsigned long)entry_SYSCALL_64 &&
-	    regs->ip <  (unsigned long)entry_SYSCALL_64_safe_stack) {
+	if (from_syscall_gap(regs)) {
 		sp = this_cpu_read(cpu_current_top_of_stack);
 		goto sync;
 	}
-- 
2.30.0

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/3] x86/sev-es: Check if regs->sp is trusted before adjusting #VC IST stack
  2021-02-17 12:01 ` Joerg Roedel
@ 2021-02-17 12:01   ` Joerg Roedel
  -1 siblings, 0 replies; 26+ messages in thread
From: Joerg Roedel @ 2021-02-17 12:01 UTC (permalink / raw)
  To: x86
  Cc: Joerg Roedel, Joerg Roedel, Andy Lutomirski, stable, hpa,
	Dave Hansen, Peter Zijlstra, Jiri Slaby, Dan Williams,
	Tom Lendacky, Juergen Gross, Kees Cook, David Rientjes,
	Cfir Cohen, Erdem Aktas, Masami Hiramatsu, Mike Stunes,
	Sean Christopherson, Martin Radev, Arvind Sankar, linux-kernel,
	kvm, virtualization

From: Joerg Roedel <jroedel@suse.de>

The code in the NMI handler to adjust the #VC handler IST stack is
needed in case an NMI hits when the #VC handler is still using its IST
stack.
But the check for this condition also needs to look if the regs->sp
value is trusted, meaning it was not set by user-space. Extend the
check to not use regs->sp when the NMI interrupted user-space code or
the SYSCALL gap.

Reported-by: Andy Lutomirski <luto@kernel.org>
Fixes: 315562c9af3d5 ("x86/sev-es: Adjust #VC IST Stack on entering NMI handler")
Cc: stable@vger.kernel.org # 5.10+
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/sev-es.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 84c1821819af..0df38b185d53 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -144,7 +144,9 @@ void noinstr __sev_es_ist_enter(struct pt_regs *regs)
 	old_ist = __this_cpu_read(cpu_tss_rw.x86_tss.ist[IST_INDEX_VC]);
 
 	/* Make room on the IST stack */
-	if (on_vc_stack(regs->sp))
+	if (on_vc_stack(regs->sp) &&
+	    !user_mode(regs) &&
+	    !from_syscall_gap(regs))
 		new_ist = ALIGN_DOWN(regs->sp, 8) - sizeof(old_ist);
 	else
 		new_ist = old_ist - sizeof(old_ist);
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/3] x86/sev-es: Check if regs->sp is trusted before adjusting #VC IST stack
@ 2021-02-17 12:01   ` Joerg Roedel
  0 siblings, 0 replies; 26+ messages in thread
From: Joerg Roedel @ 2021-02-17 12:01 UTC (permalink / raw)
  To: x86
  Cc: kvm, Peter Zijlstra, Dave Hansen, virtualization, Arvind Sankar,
	hpa, Jiri Slaby, Joerg Roedel, David Rientjes, Martin Radev,
	Tom Lendacky, Joerg Roedel, Kees Cook, Cfir Cohen,
	Andy Lutomirski, Dan Williams, Juergen Gross, Mike Stunes,
	Sean Christopherson, linux-kernel, stable, Masami Hiramatsu,
	Erdem Aktas

From: Joerg Roedel <jroedel@suse.de>

The code in the NMI handler to adjust the #VC handler IST stack is
needed in case an NMI hits when the #VC handler is still using its IST
stack.
But the check for this condition also needs to look if the regs->sp
value is trusted, meaning it was not set by user-space. Extend the
check to not use regs->sp when the NMI interrupted user-space code or
the SYSCALL gap.

Reported-by: Andy Lutomirski <luto@kernel.org>
Fixes: 315562c9af3d5 ("x86/sev-es: Adjust #VC IST Stack on entering NMI handler")
Cc: stable@vger.kernel.org # 5.10+
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/sev-es.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 84c1821819af..0df38b185d53 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -144,7 +144,9 @@ void noinstr __sev_es_ist_enter(struct pt_regs *regs)
 	old_ist = __this_cpu_read(cpu_tss_rw.x86_tss.ist[IST_INDEX_VC]);
 
 	/* Make room on the IST stack */
-	if (on_vc_stack(regs->sp))
+	if (on_vc_stack(regs->sp) &&
+	    !user_mode(regs) &&
+	    !from_syscall_gap(regs))
 		new_ist = ALIGN_DOWN(regs->sp, 8) - sizeof(old_ist);
 	else
 		new_ist = old_ist - sizeof(old_ist);
-- 
2.30.0

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/3] x86/sev-es: Improve comments in and around __sev_es_ist_enter/exit()
  2021-02-17 12:01 ` Joerg Roedel
@ 2021-02-17 12:01   ` Joerg Roedel
  -1 siblings, 0 replies; 26+ messages in thread
From: Joerg Roedel @ 2021-02-17 12:01 UTC (permalink / raw)
  To: x86
  Cc: Joerg Roedel, Joerg Roedel, hpa, Andy Lutomirski, Dave Hansen,
	Peter Zijlstra, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, David Rientjes, Cfir Cohen,
	Erdem Aktas, Masami Hiramatsu, Mike Stunes, Sean Christopherson,
	Martin Radev, Arvind Sankar, linux-kernel, kvm, virtualization

From: Joerg Roedel <jroedel@suse.de>

Better explain why this code is necessary and what it is doing.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/sev-es.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 0df38b185d53..79241bc45f25 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -127,14 +127,20 @@ static __always_inline bool on_vc_stack(unsigned long sp)
 }
 
 /*
- * This function handles the case when an NMI is raised in the #VC exception
- * handler entry code. In this case, the IST entry for #VC must be adjusted, so
- * that any subsequent #VC exception will not overwrite the stack contents of the
- * interrupted #VC handler.
+ * This function handles the case when an NMI is raised in the #VC
+ * exception handler entry code, before the #VC handler has switched off
+ * its IST stack. In this case, the IST entry for #VC must be adjusted,
+ * so that any nested #VC exception will not overwrite the stack
+ * contents of the interrupted #VC handler.
  *
  * The IST entry is adjusted unconditionally so that it can be also be
- * unconditionally adjusted back in sev_es_ist_exit(). Otherwise a nested
- * sev_es_ist_exit() call may adjust back the IST entry too early.
+ * unconditionally adjusted back in __sev_es_ist_exit(). Otherwise a
+ * nested sev_es_ist_exit() call may adjust back the IST entry too
+ * early.
+ *
+ * The __sev_es_ist_enter() and __sev_es_ist_exit() functions always run
+ * on the NMI IST stack, as they are only called from NMI handling code
+ * right now.
  */
 void noinstr __sev_es_ist_enter(struct pt_regs *regs)
 {
@@ -143,7 +149,10 @@ void noinstr __sev_es_ist_enter(struct pt_regs *regs)
 	/* Read old IST entry */
 	old_ist = __this_cpu_read(cpu_tss_rw.x86_tss.ist[IST_INDEX_VC]);
 
-	/* Make room on the IST stack */
+	/*
+	 * Make room on the IST stack - Reserve 8 bytes to store the old
+	 * IST entry.
+	 */
 	if (on_vc_stack(regs->sp) &&
 	    !user_mode(regs) &&
 	    !from_syscall_gap(regs))
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/3] x86/sev-es: Improve comments in and around __sev_es_ist_enter/exit()
@ 2021-02-17 12:01   ` Joerg Roedel
  0 siblings, 0 replies; 26+ messages in thread
From: Joerg Roedel @ 2021-02-17 12:01 UTC (permalink / raw)
  To: x86
  Cc: kvm, Peter Zijlstra, Dave Hansen, virtualization, Arvind Sankar,
	hpa, Jiri Slaby, Joerg Roedel, David Rientjes, Martin Radev,
	Tom Lendacky, Joerg Roedel, Kees Cook, Cfir Cohen,
	Andy Lutomirski, Dan Williams, Juergen Gross, Mike Stunes,
	Sean Christopherson, linux-kernel, Masami Hiramatsu, Erdem Aktas

From: Joerg Roedel <jroedel@suse.de>

Better explain why this code is necessary and what it is doing.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/sev-es.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 0df38b185d53..79241bc45f25 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -127,14 +127,20 @@ static __always_inline bool on_vc_stack(unsigned long sp)
 }
 
 /*
- * This function handles the case when an NMI is raised in the #VC exception
- * handler entry code. In this case, the IST entry for #VC must be adjusted, so
- * that any subsequent #VC exception will not overwrite the stack contents of the
- * interrupted #VC handler.
+ * This function handles the case when an NMI is raised in the #VC
+ * exception handler entry code, before the #VC handler has switched off
+ * its IST stack. In this case, the IST entry for #VC must be adjusted,
+ * so that any nested #VC exception will not overwrite the stack
+ * contents of the interrupted #VC handler.
  *
  * The IST entry is adjusted unconditionally so that it can be also be
- * unconditionally adjusted back in sev_es_ist_exit(). Otherwise a nested
- * sev_es_ist_exit() call may adjust back the IST entry too early.
+ * unconditionally adjusted back in __sev_es_ist_exit(). Otherwise a
+ * nested sev_es_ist_exit() call may adjust back the IST entry too
+ * early.
+ *
+ * The __sev_es_ist_enter() and __sev_es_ist_exit() functions always run
+ * on the NMI IST stack, as they are only called from NMI handling code
+ * right now.
  */
 void noinstr __sev_es_ist_enter(struct pt_regs *regs)
 {
@@ -143,7 +149,10 @@ void noinstr __sev_es_ist_enter(struct pt_regs *regs)
 	/* Read old IST entry */
 	old_ist = __this_cpu_read(cpu_tss_rw.x86_tss.ist[IST_INDEX_VC]);
 
-	/* Make room on the IST stack */
+	/*
+	 * Make room on the IST stack - Reserve 8 bytes to store the old
+	 * IST entry.
+	 */
 	if (on_vc_stack(regs->sp) &&
 	    !user_mode(regs) &&
 	    !from_syscall_gap(regs))
-- 
2.30.0

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] x86/sev-es: Introduce from_syscall_gap() helper
  2021-02-17 12:01   ` Joerg Roedel
@ 2021-02-17 17:59     ` Borislav Petkov
  -1 siblings, 0 replies; 26+ messages in thread
From: Borislav Petkov @ 2021-02-17 17:59 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: x86, Joerg Roedel, stable, hpa, Andy Lutomirski, Dave Hansen,
	Peter Zijlstra, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, David Rientjes, Cfir Cohen,
	Erdem Aktas, Masami Hiramatsu, Mike Stunes, Sean Christopherson,
	Martin Radev, Arvind Sankar, linux-kernel, kvm, virtualization

I guess subject prefix should be "x86/traps:" but I'll fix that up while
applying eventually.

On Wed, Feb 17, 2021 at 01:01:41PM +0100, Joerg Roedel wrote:
> From: Joerg Roedel <jroedel@suse.de>
> 
> Introduce a helper to check whether an exception came from the syscall
> gap and use it in the SEV-ES code
> 
> Fixes: 315562c9af3d5 ("x86/sev-es: Adjust #VC IST Stack on entering NMI handler")
> Cc: stable@vger.kernel.org # 5.10+
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  arch/x86/include/asm/ptrace.h | 8 ++++++++
>  arch/x86/kernel/traps.c       | 3 +--
>  2 files changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
> index d8324a236696..14854b2c4944 100644
> --- a/arch/x86/include/asm/ptrace.h
> +++ b/arch/x86/include/asm/ptrace.h
> @@ -94,6 +94,8 @@ struct pt_regs {
>  #include <asm/paravirt_types.h>
>  #endif
>  
> +#include <asm/proto.h>
> +
>  struct cpuinfo_x86;
>  struct task_struct;
>  
> @@ -175,6 +177,12 @@ static inline bool any_64bit_mode(struct pt_regs *regs)
>  #ifdef CONFIG_X86_64
>  #define current_user_stack_pointer()	current_pt_regs()->sp
>  #define compat_user_stack_pointer()	current_pt_regs()->sp
> +
> +static inline bool from_syscall_gap(struct pt_regs *regs)

rip_within_syscall_gap() sounds kinda better to me and it is more
readable when you look at it at the usage site:

	if (rip_within_syscall_gap(regs))
		...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] x86/sev-es: Introduce from_syscall_gap() helper
@ 2021-02-17 17:59     ` Borislav Petkov
  0 siblings, 0 replies; 26+ messages in thread
From: Borislav Petkov @ 2021-02-17 17:59 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: kvm, Peter Zijlstra, Dave Hansen, virtualization, Arvind Sankar,
	hpa, Jiri Slaby, x86, David Rientjes, Martin Radev, Tom Lendacky,
	Joerg Roedel, Kees Cook, Cfir Cohen, Andy Lutomirski,
	Dan Williams, Juergen Gross, Mike Stunes, Sean Christopherson,
	linux-kernel, stable, Masami Hiramatsu, Erdem Aktas

I guess subject prefix should be "x86/traps:" but I'll fix that up while
applying eventually.

On Wed, Feb 17, 2021 at 01:01:41PM +0100, Joerg Roedel wrote:
> From: Joerg Roedel <jroedel@suse.de>
> 
> Introduce a helper to check whether an exception came from the syscall
> gap and use it in the SEV-ES code
> 
> Fixes: 315562c9af3d5 ("x86/sev-es: Adjust #VC IST Stack on entering NMI handler")
> Cc: stable@vger.kernel.org # 5.10+
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  arch/x86/include/asm/ptrace.h | 8 ++++++++
>  arch/x86/kernel/traps.c       | 3 +--
>  2 files changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
> index d8324a236696..14854b2c4944 100644
> --- a/arch/x86/include/asm/ptrace.h
> +++ b/arch/x86/include/asm/ptrace.h
> @@ -94,6 +94,8 @@ struct pt_regs {
>  #include <asm/paravirt_types.h>
>  #endif
>  
> +#include <asm/proto.h>
> +
>  struct cpuinfo_x86;
>  struct task_struct;
>  
> @@ -175,6 +177,12 @@ static inline bool any_64bit_mode(struct pt_regs *regs)
>  #ifdef CONFIG_X86_64
>  #define current_user_stack_pointer()	current_pt_regs()->sp
>  #define compat_user_stack_pointer()	current_pt_regs()->sp
> +
> +static inline bool from_syscall_gap(struct pt_regs *regs)

rip_within_syscall_gap() sounds kinda better to me and it is more
readable when you look at it at the usage site:

	if (rip_within_syscall_gap(regs))
		...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] x86/sev-es: Check if regs->sp is trusted before adjusting #VC IST stack
  2021-02-17 12:01   ` Joerg Roedel
@ 2021-02-17 18:00     ` Borislav Petkov
  -1 siblings, 0 replies; 26+ messages in thread
From: Borislav Petkov @ 2021-02-17 18:00 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: x86, Joerg Roedel, Andy Lutomirski, stable, hpa, Dave Hansen,
	Peter Zijlstra, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, David Rientjes, Cfir Cohen,
	Erdem Aktas, Masami Hiramatsu, Mike Stunes, Sean Christopherson,
	Martin Radev, Arvind Sankar, linux-kernel, kvm, virtualization

On Wed, Feb 17, 2021 at 01:01:42PM +0100, Joerg Roedel wrote:
> From: Joerg Roedel <jroedel@suse.de>
> 
> The code in the NMI handler to adjust the #VC handler IST stack is
> needed in case an NMI hits when the #VC handler is still using its IST
> stack.
> But the check for this condition also needs to look if the regs->sp
> value is trusted, meaning it was not set by user-space. Extend the
> check to not use regs->sp when the NMI interrupted user-space code or
> the SYSCALL gap.
> 
> Reported-by: Andy Lutomirski <luto@kernel.org>
> Fixes: 315562c9af3d5 ("x86/sev-es: Adjust #VC IST Stack on entering NMI handler")
> Cc: stable@vger.kernel.org # 5.10+
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  arch/x86/kernel/sev-es.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
> index 84c1821819af..0df38b185d53 100644
> --- a/arch/x86/kernel/sev-es.c
> +++ b/arch/x86/kernel/sev-es.c
> @@ -144,7 +144,9 @@ void noinstr __sev_es_ist_enter(struct pt_regs *regs)
>  	old_ist = __this_cpu_read(cpu_tss_rw.x86_tss.ist[IST_INDEX_VC]);
>  
>  	/* Make room on the IST stack */
> -	if (on_vc_stack(regs->sp))
> +	if (on_vc_stack(regs->sp) &&
> +	    !user_mode(regs) &&
> +	    !from_syscall_gap(regs))

Why not add those checks to on_vc_stack() directly? Because in it, you
can say:

on_vc_stack():

	/* user mode rSP is not trusted */
	if (user_mode())
		return false;

	/* ditto */
	if (ip_within_syscall_gap())
		return false;

	...

?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] x86/sev-es: Check if regs->sp is trusted before adjusting #VC IST stack
@ 2021-02-17 18:00     ` Borislav Petkov
  0 siblings, 0 replies; 26+ messages in thread
From: Borislav Petkov @ 2021-02-17 18:00 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: kvm, Peter Zijlstra, Dave Hansen, virtualization, Arvind Sankar,
	hpa, Jiri Slaby, x86, David Rientjes, Martin Radev, Tom Lendacky,
	Joerg Roedel, Kees Cook, Cfir Cohen, Andy Lutomirski,
	Dan Williams, Juergen Gross, Mike Stunes, Sean Christopherson,
	linux-kernel, stable, Masami Hiramatsu, Erdem Aktas

On Wed, Feb 17, 2021 at 01:01:42PM +0100, Joerg Roedel wrote:
> From: Joerg Roedel <jroedel@suse.de>
> 
> The code in the NMI handler to adjust the #VC handler IST stack is
> needed in case an NMI hits when the #VC handler is still using its IST
> stack.
> But the check for this condition also needs to look if the regs->sp
> value is trusted, meaning it was not set by user-space. Extend the
> check to not use regs->sp when the NMI interrupted user-space code or
> the SYSCALL gap.
> 
> Reported-by: Andy Lutomirski <luto@kernel.org>
> Fixes: 315562c9af3d5 ("x86/sev-es: Adjust #VC IST Stack on entering NMI handler")
> Cc: stable@vger.kernel.org # 5.10+
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  arch/x86/kernel/sev-es.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
> index 84c1821819af..0df38b185d53 100644
> --- a/arch/x86/kernel/sev-es.c
> +++ b/arch/x86/kernel/sev-es.c
> @@ -144,7 +144,9 @@ void noinstr __sev_es_ist_enter(struct pt_regs *regs)
>  	old_ist = __this_cpu_read(cpu_tss_rw.x86_tss.ist[IST_INDEX_VC]);
>  
>  	/* Make room on the IST stack */
> -	if (on_vc_stack(regs->sp))
> +	if (on_vc_stack(regs->sp) &&
> +	    !user_mode(regs) &&
> +	    !from_syscall_gap(regs))

Why not add those checks to on_vc_stack() directly? Because in it, you
can say:

on_vc_stack():

	/* user mode rSP is not trusted */
	if (user_mode())
		return false;

	/* ditto */
	if (ip_within_syscall_gap())
		return false;

	...

?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] x86/sev-es: Improve comments in and around __sev_es_ist_enter/exit()
  2021-02-17 12:01   ` Joerg Roedel
@ 2021-02-17 18:00     ` Borislav Petkov
  -1 siblings, 0 replies; 26+ messages in thread
From: Borislav Petkov @ 2021-02-17 18:00 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: x86, Joerg Roedel, hpa, Andy Lutomirski, Dave Hansen,
	Peter Zijlstra, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, David Rientjes, Cfir Cohen,
	Erdem Aktas, Masami Hiramatsu, Mike Stunes, Sean Christopherson,
	Martin Radev, Arvind Sankar, linux-kernel, kvm, virtualization

On Wed, Feb 17, 2021 at 01:01:43PM +0100, Joerg Roedel wrote:
> From: Joerg Roedel <jroedel@suse.de>
> 
> Better explain why this code is necessary and what it is doing.
> 
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  arch/x86/kernel/sev-es.c | 23 ++++++++++++++++-------
>  1 file changed, 16 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
> index 0df38b185d53..79241bc45f25 100644
> --- a/arch/x86/kernel/sev-es.c
> +++ b/arch/x86/kernel/sev-es.c
> @@ -127,14 +127,20 @@ static __always_inline bool on_vc_stack(unsigned long sp)
>  }
>  
>  /*
> - * This function handles the case when an NMI is raised in the #VC exception
> - * handler entry code. In this case, the IST entry for #VC must be adjusted, so
> - * that any subsequent #VC exception will not overwrite the stack contents of the
> - * interrupted #VC handler.
> + * This function handles the case when an NMI is raised in the #VC
> + * exception handler entry code, before the #VC handler has switched off
> + * its IST stack. In this case, the IST entry for #VC must be adjusted,
> + * so that any nested #VC exception will not overwrite the stack
> + * contents of the interrupted #VC handler.
>   *
>   * The IST entry is adjusted unconditionally so that it can be also be
> - * unconditionally adjusted back in sev_es_ist_exit(). Otherwise a nested
> - * sev_es_ist_exit() call may adjust back the IST entry too early.
> + * unconditionally adjusted back in __sev_es_ist_exit(). Otherwise a
> + * nested sev_es_ist_exit() call may adjust back the IST entry too
> + * early.
> + *
> + * The __sev_es_ist_enter() and __sev_es_ist_exit() functions always run
> + * on the NMI IST stack, as they are only called from NMI handling code
> + * right now.
>   */
>  void noinstr __sev_es_ist_enter(struct pt_regs *regs)
>  {
> @@ -143,7 +149,10 @@ void noinstr __sev_es_ist_enter(struct pt_regs *regs)
>  	/* Read old IST entry */
>  	old_ist = __this_cpu_read(cpu_tss_rw.x86_tss.ist[IST_INDEX_VC]);
>  
> -	/* Make room on the IST stack */
> +	/*
> +	 * Make room on the IST stack - Reserve 8 bytes to store the old
> +	 * IST entry.
> +	 */
>  	if (on_vc_stack(regs->sp) &&
>  	    !user_mode(regs) &&
>  	    !from_syscall_gap(regs))
> -- 

Yah, and then we probably should simplify this __sev_es_ist_enter()
function even more as it is not easy to grok.

For example, the ALIGN_DOWN(regs->sp, 8) is not really needed, right?

Also, both branches do "- sizeof(old_ist);" so you can just as well do
it unconditionally.

And the sizeof(old_ist) is just a confusing way to write 8, right? We're
64-bit only so there's no need for that, I'd say.

And then you probably should change the comments from

	/* Store old IST entry */

and

	/* Set new IST entry */

to something like:

 /*
  * If on the #VC IST stack, new_ist gets set to point one stack slot
  * further down from the #VC interrupt frame which has been pushed on
  * it during the first #VC exception entry.
  *
  * If not, simply the next slot on the #VC IST stack is set to point...

and here I'm not even sure why we're doing it?

The else branch, when we're not on the #VC stack, why are we doing

	new_ist = old_ist - sizeof(old_ist);

?

I mean, if the NMI handler causes a #VC exception, it will simply run on
the #VC IST stack so why do we have to do that - 8 thing at all?

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] x86/sev-es: Improve comments in and around __sev_es_ist_enter/exit()
@ 2021-02-17 18:00     ` Borislav Petkov
  0 siblings, 0 replies; 26+ messages in thread
From: Borislav Petkov @ 2021-02-17 18:00 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: kvm, Peter Zijlstra, Dave Hansen, virtualization, Arvind Sankar,
	hpa, Jiri Slaby, x86, David Rientjes, Martin Radev, Tom Lendacky,
	Joerg Roedel, Kees Cook, Cfir Cohen, Andy Lutomirski,
	Dan Williams, Juergen Gross, Mike Stunes, Sean Christopherson,
	linux-kernel, Masami Hiramatsu, Erdem Aktas

On Wed, Feb 17, 2021 at 01:01:43PM +0100, Joerg Roedel wrote:
> From: Joerg Roedel <jroedel@suse.de>
> 
> Better explain why this code is necessary and what it is doing.
> 
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  arch/x86/kernel/sev-es.c | 23 ++++++++++++++++-------
>  1 file changed, 16 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
> index 0df38b185d53..79241bc45f25 100644
> --- a/arch/x86/kernel/sev-es.c
> +++ b/arch/x86/kernel/sev-es.c
> @@ -127,14 +127,20 @@ static __always_inline bool on_vc_stack(unsigned long sp)
>  }
>  
>  /*
> - * This function handles the case when an NMI is raised in the #VC exception
> - * handler entry code. In this case, the IST entry for #VC must be adjusted, so
> - * that any subsequent #VC exception will not overwrite the stack contents of the
> - * interrupted #VC handler.
> + * This function handles the case when an NMI is raised in the #VC
> + * exception handler entry code, before the #VC handler has switched off
> + * its IST stack. In this case, the IST entry for #VC must be adjusted,
> + * so that any nested #VC exception will not overwrite the stack
> + * contents of the interrupted #VC handler.
>   *
>   * The IST entry is adjusted unconditionally so that it can be also be
> - * unconditionally adjusted back in sev_es_ist_exit(). Otherwise a nested
> - * sev_es_ist_exit() call may adjust back the IST entry too early.
> + * unconditionally adjusted back in __sev_es_ist_exit(). Otherwise a
> + * nested sev_es_ist_exit() call may adjust back the IST entry too
> + * early.
> + *
> + * The __sev_es_ist_enter() and __sev_es_ist_exit() functions always run
> + * on the NMI IST stack, as they are only called from NMI handling code
> + * right now.
>   */
>  void noinstr __sev_es_ist_enter(struct pt_regs *regs)
>  {
> @@ -143,7 +149,10 @@ void noinstr __sev_es_ist_enter(struct pt_regs *regs)
>  	/* Read old IST entry */
>  	old_ist = __this_cpu_read(cpu_tss_rw.x86_tss.ist[IST_INDEX_VC]);
>  
> -	/* Make room on the IST stack */
> +	/*
> +	 * Make room on the IST stack - Reserve 8 bytes to store the old
> +	 * IST entry.
> +	 */
>  	if (on_vc_stack(regs->sp) &&
>  	    !user_mode(regs) &&
>  	    !from_syscall_gap(regs))
> -- 

Yah, and then we probably should simplify this __sev_es_ist_enter()
function even more as it is not easy to grok.

For example, the ALIGN_DOWN(regs->sp, 8) is not really needed, right?

Also, both branches do "- sizeof(old_ist);" so you can just as well do
it unconditionally.

And the sizeof(old_ist) is just a confusing way to write 8, right? We're
64-bit only so there's no need for that, I'd say.

And then you probably should change the comments from

	/* Store old IST entry */

and

	/* Set new IST entry */

to something like:

 /*
  * If on the #VC IST stack, new_ist gets set to point one stack slot
  * further down from the #VC interrupt frame which has been pushed on
  * it during the first #VC exception entry.
  *
  * If not, simply the next slot on the #VC IST stack is set to point...

and here I'm not even sure why we're doing it?

The else branch, when we're not on the #VC stack, why are we doing

	new_ist = old_ist - sizeof(old_ist);

?

I mean, if the NMI handler causes a #VC exception, it will simply run on
the #VC IST stack so why do we have to do that - 8 thing at all?

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] x86/sev-es: Check if regs->sp is trusted before adjusting #VC IST stack
  2021-02-17 12:01   ` Joerg Roedel
@ 2021-02-17 18:09     ` Andy Lutomirski
  -1 siblings, 0 replies; 26+ messages in thread
From: Andy Lutomirski @ 2021-02-17 18:09 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: X86 ML, Joerg Roedel, Andy Lutomirski, stable, H. Peter Anvin,
	Dave Hansen, Peter Zijlstra, Jiri Slaby, Dan Williams,
	Tom Lendacky, Juergen Gross, Kees Cook, David Rientjes,
	Cfir Cohen, Erdem Aktas, Masami Hiramatsu, Mike Stunes,
	Sean Christopherson, Martin Radev, Arvind Sankar, LKML, kvm list,
	Linux Virtualization

On Wed, Feb 17, 2021 at 4:02 AM Joerg Roedel <joro@8bytes.org> wrote:
>
> From: Joerg Roedel <jroedel@suse.de>
>
> The code in the NMI handler to adjust the #VC handler IST stack is
> needed in case an NMI hits when the #VC handler is still using its IST
> stack.
> But the check for this condition also needs to look if the regs->sp
> value is trusted, meaning it was not set by user-space. Extend the
> check to not use regs->sp when the NMI interrupted user-space code or
> the SYSCALL gap.
>
> Reported-by: Andy Lutomirski <luto@kernel.org>
> Fixes: 315562c9af3d5 ("x86/sev-es: Adjust #VC IST Stack on entering NMI handler")
> Cc: stable@vger.kernel.org # 5.10+
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  arch/x86/kernel/sev-es.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
> index 84c1821819af..0df38b185d53 100644
> --- a/arch/x86/kernel/sev-es.c
> +++ b/arch/x86/kernel/sev-es.c
> @@ -144,7 +144,9 @@ void noinstr __sev_es_ist_enter(struct pt_regs *regs)
>         old_ist = __this_cpu_read(cpu_tss_rw.x86_tss.ist[IST_INDEX_VC]);
>
>         /* Make room on the IST stack */
> -       if (on_vc_stack(regs->sp))
> +       if (on_vc_stack(regs->sp) &&
> +           !user_mode(regs) &&
> +           !from_syscall_gap(regs))
>                 new_ist = ALIGN_DOWN(regs->sp, 8) - sizeof(old_ist);
>         else
>

Can you get rid of the linked list hack while you're at it?  This code
is unnecessarily convoluted right now, and it seems to be just asking
for weird bugs.  Just stash the old value in a local variable, please.

Meanwhile, I'm pretty sure I can break this whole scheme if the
hypervisor is messing with us.  As a trivial example, the sequence
SYSCALL gap -> #VC -> NMI -> #VC will go quite poorly.  Is this really
better than just turning IST off for #VC and documenting that we are
not secure against a malicious hypervisor yet?

--Andy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] x86/sev-es: Check if regs->sp is trusted before adjusting #VC IST stack
@ 2021-02-17 18:09     ` Andy Lutomirski
  0 siblings, 0 replies; 26+ messages in thread
From: Andy Lutomirski @ 2021-02-17 18:09 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: kvm list, Peter Zijlstra, Dave Hansen, Linux Virtualization,
	Arvind Sankar, H. Peter Anvin, Jiri Slaby, X86 ML,
	David Rientjes, Martin Radev, Tom Lendacky, Joerg Roedel,
	Kees Cook, Cfir Cohen, Andy Lutomirski, Dan Williams,
	Juergen Gross, Mike Stunes, Sean Christopherson, LKML, stable,
	Masami Hiramatsu, Erdem Aktas

On Wed, Feb 17, 2021 at 4:02 AM Joerg Roedel <joro@8bytes.org> wrote:
>
> From: Joerg Roedel <jroedel@suse.de>
>
> The code in the NMI handler to adjust the #VC handler IST stack is
> needed in case an NMI hits when the #VC handler is still using its IST
> stack.
> But the check for this condition also needs to look if the regs->sp
> value is trusted, meaning it was not set by user-space. Extend the
> check to not use regs->sp when the NMI interrupted user-space code or
> the SYSCALL gap.
>
> Reported-by: Andy Lutomirski <luto@kernel.org>
> Fixes: 315562c9af3d5 ("x86/sev-es: Adjust #VC IST Stack on entering NMI handler")
> Cc: stable@vger.kernel.org # 5.10+
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  arch/x86/kernel/sev-es.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
> index 84c1821819af..0df38b185d53 100644
> --- a/arch/x86/kernel/sev-es.c
> +++ b/arch/x86/kernel/sev-es.c
> @@ -144,7 +144,9 @@ void noinstr __sev_es_ist_enter(struct pt_regs *regs)
>         old_ist = __this_cpu_read(cpu_tss_rw.x86_tss.ist[IST_INDEX_VC]);
>
>         /* Make room on the IST stack */
> -       if (on_vc_stack(regs->sp))
> +       if (on_vc_stack(regs->sp) &&
> +           !user_mode(regs) &&
> +           !from_syscall_gap(regs))
>                 new_ist = ALIGN_DOWN(regs->sp, 8) - sizeof(old_ist);
>         else
>

Can you get rid of the linked list hack while you're at it?  This code
is unnecessarily convoluted right now, and it seems to be just asking
for weird bugs.  Just stash the old value in a local variable, please.

Meanwhile, I'm pretty sure I can break this whole scheme if the
hypervisor is messing with us.  As a trivial example, the sequence
SYSCALL gap -> #VC -> NMI -> #VC will go quite poorly.  Is this really
better than just turning IST off for #VC and documenting that we are
not secure against a malicious hypervisor yet?

--Andy
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] x86/sev-es: Check if regs->sp is trusted before adjusting #VC IST stack
  2021-02-17 18:09     ` Andy Lutomirski
@ 2021-02-18 11:25       ` Joerg Roedel
  -1 siblings, 0 replies; 26+ messages in thread
From: Joerg Roedel @ 2021-02-18 11:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, Joerg Roedel, stable, H. Peter Anvin, Dave Hansen,
	Peter Zijlstra, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, David Rientjes, Cfir Cohen,
	Erdem Aktas, Masami Hiramatsu, Mike Stunes, Sean Christopherson,
	Martin Radev, Arvind Sankar, LKML, kvm list,
	Linux Virtualization

Hi Andy,

On Wed, Feb 17, 2021 at 10:09:46AM -0800, Andy Lutomirski wrote:
> Can you get rid of the linked list hack while you're at it?  This code
> is unnecessarily convoluted right now, and it seems to be just asking
> for weird bugs.  Just stash the old value in a local variable, please.

Yeah, the linked list is not really necessary right now, because of the
way nested NMI handling works and given that these functions are only
used in the NMI handler right now.
The whole #VC handling code was written with future requirements in
mind, like what is needed when debugging registers get virtualized and
#HV gets enabled.
Until its clear whether __sev_es_ist_enter/exit() is needed in any of
these paths, I'd like to keep the linked list for now. It is more
complicated but allows nesting.

> Meanwhile, I'm pretty sure I can break this whole scheme if the
> hypervisor is messing with us.  As a trivial example, the sequence
> SYSCALL gap -> #VC -> NMI -> #VC will go quite poorly.

I don't see how this would break, can you elaborate?

What I think happens is:

SYSCALL gap (RSP is from userspace and untrusted)

	-> #VC - Handler on #VC IST stack detects that it interrupted
	   the SYSCALL gap and switches to the task stack.

	-> NMI - Now running on NMI IST stack. Depending on whether the
	   stack switch in the #VC handler already happened, the #VC IST
	   entry is adjusted so that a subsequent #VC will not overwrite
	   the interrupted handlers stack frame.

	-> #VC - Handler runs on the adjusted #VC IST stack and switches
	   itself back to the NMI IST stack. This is safe wrt. nested
	   NMIs as long as nested NMIs itself are safe.

As a rule of thumb, think of the #VC handler as trying to be a non-IST
handler by switching itself to the interrupted stack or the task stack.
If it detects that this is not possible (which can't happen right now,
but with SNP), it will kill the guest.

Also #VC is currently not safe against #MC, but this is the same as with
NMI and #MC. And more care is needed when SNP gets enabled and #VCs can
happen in the stack switching/stack adjustment code itself. I will
probably just add a check then to kill the guest if an SNP related #VC
comes from noinstr code.

> Is this really better than just turning IST off for #VC and
> documenting that we are not secure against a malicious hypervisor yet?

It needs to be IST, even without SNP, because #DB is IST too. When the
hypervisor intercepts #DB then any #DB exception will be turned into
#VC, so #VC needs to be handled anywhere a #DB can happen.

And with SNP we need to be able to at least detect a malicious HV so we
can reliably kill the guest. Otherwise the HV could potentially take
control over the guest's execution flow and make it reveal its secrets.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] x86/sev-es: Check if regs->sp is trusted before adjusting #VC IST stack
@ 2021-02-18 11:25       ` Joerg Roedel
  0 siblings, 0 replies; 26+ messages in thread
From: Joerg Roedel @ 2021-02-18 11:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: kvm list, Peter Zijlstra, Dave Hansen, Linux Virtualization,
	Arvind Sankar, H. Peter Anvin, Jiri Slaby, X86 ML,
	David Rientjes, Martin Radev, Tom Lendacky, Joerg Roedel,
	Kees Cook, Cfir Cohen, Dan Williams, Juergen Gross, Mike Stunes,
	Sean Christopherson, LKML, stable, Masami Hiramatsu, Erdem Aktas

Hi Andy,

On Wed, Feb 17, 2021 at 10:09:46AM -0800, Andy Lutomirski wrote:
> Can you get rid of the linked list hack while you're at it?  This code
> is unnecessarily convoluted right now, and it seems to be just asking
> for weird bugs.  Just stash the old value in a local variable, please.

Yeah, the linked list is not really necessary right now, because of the
way nested NMI handling works and given that these functions are only
used in the NMI handler right now.
The whole #VC handling code was written with future requirements in
mind, like what is needed when debugging registers get virtualized and
#HV gets enabled.
Until its clear whether __sev_es_ist_enter/exit() is needed in any of
these paths, I'd like to keep the linked list for now. It is more
complicated but allows nesting.

> Meanwhile, I'm pretty sure I can break this whole scheme if the
> hypervisor is messing with us.  As a trivial example, the sequence
> SYSCALL gap -> #VC -> NMI -> #VC will go quite poorly.

I don't see how this would break, can you elaborate?

What I think happens is:

SYSCALL gap (RSP is from userspace and untrusted)

	-> #VC - Handler on #VC IST stack detects that it interrupted
	   the SYSCALL gap and switches to the task stack.

	-> NMI - Now running on NMI IST stack. Depending on whether the
	   stack switch in the #VC handler already happened, the #VC IST
	   entry is adjusted so that a subsequent #VC will not overwrite
	   the interrupted handlers stack frame.

	-> #VC - Handler runs on the adjusted #VC IST stack and switches
	   itself back to the NMI IST stack. This is safe wrt. nested
	   NMIs as long as nested NMIs itself are safe.

As a rule of thumb, think of the #VC handler as trying to be a non-IST
handler by switching itself to the interrupted stack or the task stack.
If it detects that this is not possible (which can't happen right now,
but with SNP), it will kill the guest.

Also #VC is currently not safe against #MC, but this is the same as with
NMI and #MC. And more care is needed when SNP gets enabled and #VCs can
happen in the stack switching/stack adjustment code itself. I will
probably just add a check then to kill the guest if an SNP related #VC
comes from noinstr code.

> Is this really better than just turning IST off for #VC and
> documenting that we are not secure against a malicious hypervisor yet?

It needs to be IST, even without SNP, because #DB is IST too. When the
hypervisor intercepts #DB then any #DB exception will be turned into
#VC, so #VC needs to be handled anywhere a #DB can happen.

And with SNP we need to be able to at least detect a malicious HV so we
can reliably kill the guest. Otherwise the HV could potentially take
control over the guest's execution flow and make it reveal its secrets.

Regards,

	Joerg
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] x86/sev-es: Check if regs->sp is trusted before adjusting #VC IST stack
  2021-02-18 11:25       ` Joerg Roedel
@ 2021-02-18 17:49         ` Andy Lutomirski
  -1 siblings, 0 replies; 26+ messages in thread
From: Andy Lutomirski @ 2021-02-18 17:49 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Andy Lutomirski, X86 ML, Joerg Roedel, stable, H. Peter Anvin,
	Dave Hansen, Peter Zijlstra, Jiri Slaby, Dan Williams,
	Tom Lendacky, Juergen Gross, Kees Cook, David Rientjes,
	Cfir Cohen, Erdem Aktas, Masami Hiramatsu, Mike Stunes,
	Sean Christopherson, Martin Radev, Arvind Sankar, LKML, kvm list,
	Linux Virtualization

On Thu, Feb 18, 2021 at 3:25 AM Joerg Roedel <joro@8bytes.org> wrote:
>
> Hi Andy,
>
> On Wed, Feb 17, 2021 at 10:09:46AM -0800, Andy Lutomirski wrote:
> > Can you get rid of the linked list hack while you're at it?  This code
> > is unnecessarily convoluted right now, and it seems to be just asking
> > for weird bugs.  Just stash the old value in a local variable, please.
>
> Yeah, the linked list is not really necessary right now, because of the
> way nested NMI handling works and given that these functions are only
> used in the NMI handler right now.
> The whole #VC handling code was written with future requirements in
> mind, like what is needed when debugging registers get virtualized and
> #HV gets enabled.
> Until its clear whether __sev_es_ist_enter/exit() is needed in any of
> these paths, I'd like to keep the linked list for now. It is more
> complicated but allows nesting.

I don't understand what this means.  The whole entry mechanism on x86
is structured so that we call a C function *and return from that C
function without longjmp-like magic* with the sole exception of
unwind_stack_do_exit().  This means that you can match up enters and
exits, and that unwind_stack_do_exit() needs to unwind correctly.  In
the former case, it's normal C and we can use normal local variables.
In the latter case, we know exactly what state we're trying to restore
and we can restore it directly without any linked lists or similar.

What do you have in mind that requires a linked list?

>
> > Meanwhile, I'm pretty sure I can break this whole scheme if the
> > hypervisor is messing with us.  As a trivial example, the sequence
> > SYSCALL gap -> #VC -> NMI -> #VC will go quite poorly.
>
> I don't see how this would break, can you elaborate?
>
> What I think happens is:
>
> SYSCALL gap (RSP is from userspace and untrusted)
>
>         -> #VC - Handler on #VC IST stack detects that it interrupted
>            the SYSCALL gap and switches to the task stack.
>

Can you point me to exactly what code you're referring to?  I spent a
while digging through the code and macro tangle and I can't find this.

>
>         -> NMI - Now running on NMI IST stack. Depending on whether the
>            stack switch in the #VC handler already happened, the #VC IST
>            entry is adjusted so that a subsequent #VC will not overwrite
>            the interrupted handlers stack frame.
>
>         -> #VC - Handler runs on the adjusted #VC IST stack and switches
>            itself back to the NMI IST stack. This is safe wrt. nested
>            NMIs as long as nested NMIs itself are safe.
>
> As a rule of thumb, think of the #VC handler as trying to be a non-IST
> handler by switching itself to the interrupted stack or the task stack.
> If it detects that this is not possible (which can't happen right now,
> but with SNP), it will kill the guest.

I will try to think of this, but it's hard, since I can't find the code :)

I found this comment:

 * With the current implementation it is always possible to switch to a safe
 * stack because #VC exceptions only happen at known places, like intercepted
 * instructions or accesses to MMIO areas/IO ports. They can also happen with
 * code instrumentation when the hypervisor intercepts #DB, but the critical
 * paths are forbidden to be instrumented, so #DB exceptions currently also
 * only happen in safe places.

Unless AMD is more magic than I realize, the MOV SS bug^Wfeature means
that #DB is *not* always called in safe places.

But I *thnk* the code you're talking about is this:

    /*
     * If the entry is from userspace, switch stacks and treat it as
     * a normal entry.
     */
    testb    $3, CS-ORIG_RAX(%rsp)
    jnz    .Lfrom_usermode_switch_stack_\@

which does not run on #VC from kernel code.

> It needs to be IST, even without SNP, because #DB is IST too. When the
> hypervisor intercepts #DB then any #DB exception will be turned into
> #VC, so #VC needs to be handled anywhere a #DB can happen.

Eww.

>
> And with SNP we need to be able to at least detect a malicious HV so we
> can reliably kill the guest. Otherwise the HV could potentially take
> control over the guest's execution flow and make it reveal its secrets.

True.  But is the rest of the machinery to be secure against EFLAGS.IF
violations and such in place yet?

>
> Regards,
>
>         Joerg

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] x86/sev-es: Check if regs->sp is trusted before adjusting #VC IST stack
@ 2021-02-18 17:49         ` Andy Lutomirski
  0 siblings, 0 replies; 26+ messages in thread
From: Andy Lutomirski @ 2021-02-18 17:49 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: kvm list, Peter Zijlstra, Dave Hansen, Linux Virtualization,
	Arvind Sankar, H. Peter Anvin, Jiri Slaby, X86 ML,
	David Rientjes, Martin Radev, Tom Lendacky, Joerg Roedel,
	Kees Cook, Cfir Cohen, Andy Lutomirski, Dan Williams,
	Juergen Gross, Mike Stunes, Sean Christopherson, LKML, stable,
	Masami Hiramatsu, Erdem Aktas

On Thu, Feb 18, 2021 at 3:25 AM Joerg Roedel <joro@8bytes.org> wrote:
>
> Hi Andy,
>
> On Wed, Feb 17, 2021 at 10:09:46AM -0800, Andy Lutomirski wrote:
> > Can you get rid of the linked list hack while you're at it?  This code
> > is unnecessarily convoluted right now, and it seems to be just asking
> > for weird bugs.  Just stash the old value in a local variable, please.
>
> Yeah, the linked list is not really necessary right now, because of the
> way nested NMI handling works and given that these functions are only
> used in the NMI handler right now.
> The whole #VC handling code was written with future requirements in
> mind, like what is needed when debugging registers get virtualized and
> #HV gets enabled.
> Until its clear whether __sev_es_ist_enter/exit() is needed in any of
> these paths, I'd like to keep the linked list for now. It is more
> complicated but allows nesting.

I don't understand what this means.  The whole entry mechanism on x86
is structured so that we call a C function *and return from that C
function without longjmp-like magic* with the sole exception of
unwind_stack_do_exit().  This means that you can match up enters and
exits, and that unwind_stack_do_exit() needs to unwind correctly.  In
the former case, it's normal C and we can use normal local variables.
In the latter case, we know exactly what state we're trying to restore
and we can restore it directly without any linked lists or similar.

What do you have in mind that requires a linked list?

>
> > Meanwhile, I'm pretty sure I can break this whole scheme if the
> > hypervisor is messing with us.  As a trivial example, the sequence
> > SYSCALL gap -> #VC -> NMI -> #VC will go quite poorly.
>
> I don't see how this would break, can you elaborate?
>
> What I think happens is:
>
> SYSCALL gap (RSP is from userspace and untrusted)
>
>         -> #VC - Handler on #VC IST stack detects that it interrupted
>            the SYSCALL gap and switches to the task stack.
>

Can you point me to exactly what code you're referring to?  I spent a
while digging through the code and macro tangle and I can't find this.

>
>         -> NMI - Now running on NMI IST stack. Depending on whether the
>            stack switch in the #VC handler already happened, the #VC IST
>            entry is adjusted so that a subsequent #VC will not overwrite
>            the interrupted handlers stack frame.
>
>         -> #VC - Handler runs on the adjusted #VC IST stack and switches
>            itself back to the NMI IST stack. This is safe wrt. nested
>            NMIs as long as nested NMIs itself are safe.
>
> As a rule of thumb, think of the #VC handler as trying to be a non-IST
> handler by switching itself to the interrupted stack or the task stack.
> If it detects that this is not possible (which can't happen right now,
> but with SNP), it will kill the guest.

I will try to think of this, but it's hard, since I can't find the code :)

I found this comment:

 * With the current implementation it is always possible to switch to a safe
 * stack because #VC exceptions only happen at known places, like intercepted
 * instructions or accesses to MMIO areas/IO ports. They can also happen with
 * code instrumentation when the hypervisor intercepts #DB, but the critical
 * paths are forbidden to be instrumented, so #DB exceptions currently also
 * only happen in safe places.

Unless AMD is more magic than I realize, the MOV SS bug^Wfeature means
that #DB is *not* always called in safe places.

But I *thnk* the code you're talking about is this:

    /*
     * If the entry is from userspace, switch stacks and treat it as
     * a normal entry.
     */
    testb    $3, CS-ORIG_RAX(%rsp)
    jnz    .Lfrom_usermode_switch_stack_\@

which does not run on #VC from kernel code.

> It needs to be IST, even without SNP, because #DB is IST too. When the
> hypervisor intercepts #DB then any #DB exception will be turned into
> #VC, so #VC needs to be handled anywhere a #DB can happen.

Eww.

>
> And with SNP we need to be able to at least detect a malicious HV so we
> can reliably kill the guest. Otherwise the HV could potentially take
> control over the guest's execution flow and make it reveal its secrets.

True.  But is the rest of the machinery to be secure against EFLAGS.IF
violations and such in place yet?

>
> Regards,
>
>         Joerg
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] x86/sev-es: Check if regs->sp is trusted before adjusting #VC IST stack
  2021-02-18 17:49         ` Andy Lutomirski
@ 2021-02-18 19:21           ` Joerg Roedel
  -1 siblings, 0 replies; 26+ messages in thread
From: Joerg Roedel @ 2021-02-18 19:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Joerg Roedel, X86 ML, stable, H. Peter Anvin, Dave Hansen,
	Peter Zijlstra, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, David Rientjes, Cfir Cohen,
	Erdem Aktas, Masami Hiramatsu, Mike Stunes, Sean Christopherson,
	Martin Radev, Arvind Sankar, LKML, kvm list,
	Linux Virtualization

On Thu, Feb 18, 2021 at 09:49:06AM -0800, Andy Lutomirski wrote:
> I don't understand what this means.  The whole entry mechanism on x86
> is structured so that we call a C function *and return from that C
> function without longjmp-like magic* with the sole exception of
> unwind_stack_do_exit().  This means that you can match up enters and
> exits, and that unwind_stack_do_exit() needs to unwind correctly.  In
> the former case, it's normal C and we can use normal local variables.
> In the latter case, we know exactly what state we're trying to restore
> and we can restore it directly without any linked lists or similar.

Okay, the unwinder will likely get confused by this logic.

> What do you have in mind that requires a linked list?

Cases when there are multiple IST vectors besides NMI that can hit while
the #VC handler is still on its own IST stack. #MCE comes to mind, but
that is broken anyway. At some point #VC itself will be one of them, but
when that happens the code will kill the machine.
This leaves #HV in the list, and I am not sure how that is going to be
handled yet. I think the goal is that the #HV handler is not allowed to
cause any #VC exception. In that case the linked-list logic will not be
needed.

> > I don't see how this would break, can you elaborate?
> >
> > What I think happens is:
> >
> > SYSCALL gap (RSP is from userspace and untrusted)
> >
> >         -> #VC - Handler on #VC IST stack detects that it interrupted
> >            the SYSCALL gap and switches to the task stack.
> >
> 
> Can you point me to exactly what code you're referring to?  I spent a
> while digging through the code and macro tangle and I can't find this.

See the entry code in arch/x86/entry/entry_64.S, macro idtentry_vc. It
creates the assembly code for the handler. At some point it calls
vc_switch_off_ist(), which is a C function in arch/x86/kernel/traps.c.
This function tries to find a new stack for the #VC handler.

The first thing it does is checking whether the exception came from the
SYSCALL gap and just uses the task stack in that case.

Then it will check for other kernel stacks which are safe to switch
to. If that fails it uses the fall-back stack (VC2), which will direct
the handler to a separate function which, for now, just calls panic().
Not safe are the entry or unknown stacks.

The function then copies pt_regs and returns the new stack pointer to
assembly code, which then writes it to %RSP.

> Unless AMD is more magic than I realize, the MOV SS bug^Wfeature means
> that #DB is *not* always called in safe places.

You are right, forgot about this. The MOV SS bug can very well
trigger a #VC(#DB) exception from the syscall gap.

> > And with SNP we need to be able to at least detect a malicious HV so we
> > can reliably kill the guest. Otherwise the HV could potentially take
> > control over the guest's execution flow and make it reveal its secrets.
> 
> True.  But is the rest of the machinery to be secure against EFLAGS.IF
> violations and such in place yet?

Not sure what you mean by EFLAGS.IF violations, probably enabling IRQs
while in the #VC handler? The #VC handler _must_ _not_ enable IRQs
anywhere in its call-path. If that ever happens it is a bug.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] x86/sev-es: Check if regs->sp is trusted before adjusting #VC IST stack
@ 2021-02-18 19:21           ` Joerg Roedel
  0 siblings, 0 replies; 26+ messages in thread
From: Joerg Roedel @ 2021-02-18 19:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: kvm list, Peter Zijlstra, Dave Hansen, Linux Virtualization,
	Arvind Sankar, H. Peter Anvin, Jiri Slaby, Joerg Roedel, X86 ML,
	David Rientjes, Martin Radev, Tom Lendacky, Kees Cook,
	Cfir Cohen, Dan Williams, Juergen Gross, Mike Stunes,
	Sean Christopherson, LKML, stable, Masami Hiramatsu, Erdem Aktas

On Thu, Feb 18, 2021 at 09:49:06AM -0800, Andy Lutomirski wrote:
> I don't understand what this means.  The whole entry mechanism on x86
> is structured so that we call a C function *and return from that C
> function without longjmp-like magic* with the sole exception of
> unwind_stack_do_exit().  This means that you can match up enters and
> exits, and that unwind_stack_do_exit() needs to unwind correctly.  In
> the former case, it's normal C and we can use normal local variables.
> In the latter case, we know exactly what state we're trying to restore
> and we can restore it directly without any linked lists or similar.

Okay, the unwinder will likely get confused by this logic.

> What do you have in mind that requires a linked list?

Cases when there are multiple IST vectors besides NMI that can hit while
the #VC handler is still on its own IST stack. #MCE comes to mind, but
that is broken anyway. At some point #VC itself will be one of them, but
when that happens the code will kill the machine.
This leaves #HV in the list, and I am not sure how that is going to be
handled yet. I think the goal is that the #HV handler is not allowed to
cause any #VC exception. In that case the linked-list logic will not be
needed.

> > I don't see how this would break, can you elaborate?
> >
> > What I think happens is:
> >
> > SYSCALL gap (RSP is from userspace and untrusted)
> >
> >         -> #VC - Handler on #VC IST stack detects that it interrupted
> >            the SYSCALL gap and switches to the task stack.
> >
> 
> Can you point me to exactly what code you're referring to?  I spent a
> while digging through the code and macro tangle and I can't find this.

See the entry code in arch/x86/entry/entry_64.S, macro idtentry_vc. It
creates the assembly code for the handler. At some point it calls
vc_switch_off_ist(), which is a C function in arch/x86/kernel/traps.c.
This function tries to find a new stack for the #VC handler.

The first thing it does is checking whether the exception came from the
SYSCALL gap and just uses the task stack in that case.

Then it will check for other kernel stacks which are safe to switch
to. If that fails it uses the fall-back stack (VC2), which will direct
the handler to a separate function which, for now, just calls panic().
Not safe are the entry or unknown stacks.

The function then copies pt_regs and returns the new stack pointer to
assembly code, which then writes it to %RSP.

> Unless AMD is more magic than I realize, the MOV SS bug^Wfeature means
> that #DB is *not* always called in safe places.

You are right, forgot about this. The MOV SS bug can very well
trigger a #VC(#DB) exception from the syscall gap.

> > And with SNP we need to be able to at least detect a malicious HV so we
> > can reliably kill the guest. Otherwise the HV could potentially take
> > control over the guest's execution flow and make it reveal its secrets.
> 
> True.  But is the rest of the machinery to be secure against EFLAGS.IF
> violations and such in place yet?

Not sure what you mean by EFLAGS.IF violations, probably enabling IRQs
while in the #VC handler? The #VC handler _must_ _not_ enable IRQs
anywhere in its call-path. If that ever happens it is a bug.

Regards,

	Joerg
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] x86/sev-es: Check if regs->sp is trusted before adjusting #VC IST stack
  2021-02-18 19:21           ` Joerg Roedel
@ 2021-02-19  0:28             ` Andy Lutomirski
  -1 siblings, 0 replies; 26+ messages in thread
From: Andy Lutomirski @ 2021-02-19  0:28 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Andy Lutomirski, Joerg Roedel, X86 ML, stable, H. Peter Anvin,
	Dave Hansen, Peter Zijlstra, Jiri Slaby, Dan Williams,
	Tom Lendacky, Juergen Gross, Kees Cook, David Rientjes,
	Cfir Cohen, Erdem Aktas, Masami Hiramatsu, Mike Stunes,
	Sean Christopherson, Martin Radev, Arvind Sankar, LKML, kvm list,
	Linux Virtualization

On Thu, Feb 18, 2021 at 11:21 AM Joerg Roedel <jroedel@suse.de> wrote:
>
> On Thu, Feb 18, 2021 at 09:49:06AM -0800, Andy Lutomirski wrote:
> > I don't understand what this means.  The whole entry mechanism on x86
> > is structured so that we call a C function *and return from that C
> > function without longjmp-like magic* with the sole exception of
> > unwind_stack_do_exit().  This means that you can match up enters and
> > exits, and that unwind_stack_do_exit() needs to unwind correctly.  In
> > the former case, it's normal C and we can use normal local variables.
> > In the latter case, we know exactly what state we're trying to restore
> > and we can restore it directly without any linked lists or similar.
>
> Okay, the unwinder will likely get confused by this logic.
>
> > What do you have in mind that requires a linked list?
>
> Cases when there are multiple IST vectors besides NMI that can hit while
> the #VC handler is still on its own IST stack. #MCE comes to mind, but
> that is broken anyway. At some point #VC itself will be one of them, but
> when that happens the code will kill the machine.
> This leaves #HV in the list, and I am not sure how that is going to be
> handled yet. I think the goal is that the #HV handler is not allowed to
> cause any #VC exception. In that case the linked-list logic will not be
> needed.

Can you give me an example, even artificial, in which the linked-list
logic is useful?

>
> > > I don't see how this would break, can you elaborate?
> > >
> > > What I think happens is:
> > >
> > > SYSCALL gap (RSP is from userspace and untrusted)
> > >
> > >         -> #VC - Handler on #VC IST stack detects that it interrupted
> > >            the SYSCALL gap and switches to the task stack.
> > >
> >
> > Can you point me to exactly what code you're referring to?  I spent a
> > while digging through the code and macro tangle and I can't find this.
>
> See the entry code in arch/x86/entry/entry_64.S, macro idtentry_vc. It
> creates the assembly code for the handler. At some point it calls
> vc_switch_off_ist(), which is a C function in arch/x86/kernel/traps.c.
> This function tries to find a new stack for the #VC handler.
>
> The first thing it does is checking whether the exception came from the
> SYSCALL gap and just uses the task stack in that case.
>
> Then it will check for other kernel stacks which are safe to switch
> to. If that fails it uses the fall-back stack (VC2), which will direct
> the handler to a separate function which, for now, just calls panic().
> Not safe are the entry or unknown stacks.

Can you explain your reasoning in considering the entry stack unsafe?
It's 4k bytes these days.

You forgot about entry_SYSCALL_compat.

Your 8-byte alignment is confusing to me.  In valid kernel code, SP
should be 8-byte-aligned already, and, if you're trying to match
architectural behavior, the CPU aligns to 16 bytes.

We're not robust against #VC, NMI in the #VC prologue before the magic
stack switch, and a new #VC in the NMI prologue.  Nor do we appear to
have any detection of the case where #VC nests directly inside its own
prologue.  Or did I miss something else here?

If we get NMI and get #VC in the NMI *asm*, the #VC magic stack switch
looks like it will merrily run itself in the NMI special-stack-layout
section, and that sounds really quite bad.

>
> The function then copies pt_regs and returns the new stack pointer to
> assembly code, which then writes it to %RSP.
>
> > Unless AMD is more magic than I realize, the MOV SS bug^Wfeature means
> > that #DB is *not* always called in safe places.
>
> You are right, forgot about this. The MOV SS bug can very well
> trigger a #VC(#DB) exception from the syscall gap.
>
> > > And with SNP we need to be able to at least detect a malicious HV so we
> > > can reliably kill the guest. Otherwise the HV could potentially take
> > > control over the guest's execution flow and make it reveal its secrets.
> >
> > True.  But is the rest of the machinery to be secure against EFLAGS.IF
> > violations and such in place yet?
>
> Not sure what you mean by EFLAGS.IF violations, probably enabling IRQs
> while in the #VC handler? The #VC handler _must_ _not_ enable IRQs
> anywhere in its call-path. If that ever happens it is a bug.
>

I mean that, IIRC, a malicious hypervisor can inject inappropriate
vectors at inappropriate times if the #HV mechanism isn't enabled.
For example, it could inject a page fault or an interrupt in a context
in which we have the wrong GSBASE loaded.

But the #DB issue makes this moot.  We have to use IST unless we turn
off SCE.  But I admit I'm leaning toward turning off SCE until we have
a solution that seems convincingly robust.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] x86/sev-es: Check if regs->sp is trusted before adjusting #VC IST stack
@ 2021-02-19  0:28             ` Andy Lutomirski
  0 siblings, 0 replies; 26+ messages in thread
From: Andy Lutomirski @ 2021-02-19  0:28 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: kvm list, Peter Zijlstra, Dave Hansen, Linux Virtualization,
	Arvind Sankar, H. Peter Anvin, Jiri Slaby, Joerg Roedel, X86 ML,
	David Rientjes, Martin Radev, Tom Lendacky, Kees Cook,
	Cfir Cohen, Andy Lutomirski, Dan Williams, Juergen Gross,
	Mike Stunes, Sean Christopherson, LKML, stable, Masami Hiramatsu,
	Erdem Aktas

On Thu, Feb 18, 2021 at 11:21 AM Joerg Roedel <jroedel@suse.de> wrote:
>
> On Thu, Feb 18, 2021 at 09:49:06AM -0800, Andy Lutomirski wrote:
> > I don't understand what this means.  The whole entry mechanism on x86
> > is structured so that we call a C function *and return from that C
> > function without longjmp-like magic* with the sole exception of
> > unwind_stack_do_exit().  This means that you can match up enters and
> > exits, and that unwind_stack_do_exit() needs to unwind correctly.  In
> > the former case, it's normal C and we can use normal local variables.
> > In the latter case, we know exactly what state we're trying to restore
> > and we can restore it directly without any linked lists or similar.
>
> Okay, the unwinder will likely get confused by this logic.
>
> > What do you have in mind that requires a linked list?
>
> Cases when there are multiple IST vectors besides NMI that can hit while
> the #VC handler is still on its own IST stack. #MCE comes to mind, but
> that is broken anyway. At some point #VC itself will be one of them, but
> when that happens the code will kill the machine.
> This leaves #HV in the list, and I am not sure how that is going to be
> handled yet. I think the goal is that the #HV handler is not allowed to
> cause any #VC exception. In that case the linked-list logic will not be
> needed.

Can you give me an example, even artificial, in which the linked-list
logic is useful?

>
> > > I don't see how this would break, can you elaborate?
> > >
> > > What I think happens is:
> > >
> > > SYSCALL gap (RSP is from userspace and untrusted)
> > >
> > >         -> #VC - Handler on #VC IST stack detects that it interrupted
> > >            the SYSCALL gap and switches to the task stack.
> > >
> >
> > Can you point me to exactly what code you're referring to?  I spent a
> > while digging through the code and macro tangle and I can't find this.
>
> See the entry code in arch/x86/entry/entry_64.S, macro idtentry_vc. It
> creates the assembly code for the handler. At some point it calls
> vc_switch_off_ist(), which is a C function in arch/x86/kernel/traps.c.
> This function tries to find a new stack for the #VC handler.
>
> The first thing it does is checking whether the exception came from the
> SYSCALL gap and just uses the task stack in that case.
>
> Then it will check for other kernel stacks which are safe to switch
> to. If that fails it uses the fall-back stack (VC2), which will direct
> the handler to a separate function which, for now, just calls panic().
> Not safe are the entry or unknown stacks.

Can you explain your reasoning in considering the entry stack unsafe?
It's 4k bytes these days.

You forgot about entry_SYSCALL_compat.

Your 8-byte alignment is confusing to me.  In valid kernel code, SP
should be 8-byte-aligned already, and, if you're trying to match
architectural behavior, the CPU aligns to 16 bytes.

We're not robust against #VC, NMI in the #VC prologue before the magic
stack switch, and a new #VC in the NMI prologue.  Nor do we appear to
have any detection of the case where #VC nests directly inside its own
prologue.  Or did I miss something else here?

If we get NMI and get #VC in the NMI *asm*, the #VC magic stack switch
looks like it will merrily run itself in the NMI special-stack-layout
section, and that sounds really quite bad.

>
> The function then copies pt_regs and returns the new stack pointer to
> assembly code, which then writes it to %RSP.
>
> > Unless AMD is more magic than I realize, the MOV SS bug^Wfeature means
> > that #DB is *not* always called in safe places.
>
> You are right, forgot about this. The MOV SS bug can very well
> trigger a #VC(#DB) exception from the syscall gap.
>
> > > And with SNP we need to be able to at least detect a malicious HV so we
> > > can reliably kill the guest. Otherwise the HV could potentially take
> > > control over the guest's execution flow and make it reveal its secrets.
> >
> > True.  But is the rest of the machinery to be secure against EFLAGS.IF
> > violations and such in place yet?
>
> Not sure what you mean by EFLAGS.IF violations, probably enabling IRQs
> while in the #VC handler? The #VC handler _must_ _not_ enable IRQs
> anywhere in its call-path. If that ever happens it is a bug.
>

I mean that, IIRC, a malicious hypervisor can inject inappropriate
vectors at inappropriate times if the #HV mechanism isn't enabled.
For example, it could inject a page fault or an interrupt in a context
in which we have the wrong GSBASE loaded.

But the #DB issue makes this moot.  We have to use IST unless we turn
off SCE.  But I admit I'm leaning toward turning off SCE until we have
a solution that seems convincingly robust.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] x86/sev-es: Check if regs->sp is trusted before adjusting #VC IST stack
  2021-02-19  0:28             ` Andy Lutomirski
@ 2021-02-19 11:05               ` Joerg Roedel
  -1 siblings, 0 replies; 26+ messages in thread
From: Joerg Roedel @ 2021-02-19 11:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Joerg Roedel, X86 ML, stable, H. Peter Anvin, Dave Hansen,
	Peter Zijlstra, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, David Rientjes, Cfir Cohen,
	Erdem Aktas, Masami Hiramatsu, Mike Stunes, Sean Christopherson,
	Martin Radev, Arvind Sankar, LKML, kvm list,
	Linux Virtualization

On Thu, Feb 18, 2021 at 04:28:36PM -0800, Andy Lutomirski wrote:
> On Thu, Feb 18, 2021 at 11:21 AM Joerg Roedel <jroedel@suse.de> wrote:
> Can you give me an example, even artificial, in which the linked-list
> logic is useful?

So here we go, its of course artificial, but still:

	1. #VC happens, not important where
	2. NMI in the #VC prologue before it moved off its IST stack
	   - first VC IST adjustment happening here
	3. #VC in the NMI handler
	4. #HV in the #VC prologue again
	   - second VC IST adjustment happening here, so the #HV handler
	     can cause its own #VC exceptions.

Can only happen if the #HV handler is allowed to cause #VC exceptions.
But even if its not allowed, it can happen with SNP and a malicious
Hypervisor. But in this case the only option is to reliably panic.

> Can you explain your reasoning in considering the entry stack unsafe?
> It's 4k bytes these days.

I wasn't aware that it is 4k in size now. I still thought it was just
these 64 words large and one can not simply execute C code on it.

> You forgot about entry_SYSCALL_compat.

Right, thanks for pointing this out.

> Your 8-byte alignment is confusing to me.  In valid kernel code, SP
> should be 8-byte-aligned already, and, if you're trying to match
> architectural behavior, the CPU aligns to 16 bytes.

Yeah, I was just being cautious. The explicit alignment can be removed,
Boris also pointed this out.

> We're not robust against #VC, NMI in the #VC prologue before the magic
> stack switch, and a new #VC in the NMI prologue.  Nor do we appear to
> have any detection of the case where #VC nests directly inside its own
> prologue.  Or did I miss something else here?

No, you don't miss anything here. At the moment #VC can't happen at
those places, so this is not handled yet. With SNP it can happen and
needs to be handled in a way to at least allow a reliable panic (because
if it really happens the Hypervisor is messing with us).

> If we get NMI and get #VC in the NMI *asm*, the #VC magic stack switch
> looks like it will merrily run itself in the NMI special-stack-layout
> section, and that sounds really quite bad.

Yes, I havn't looked at the details yet, but if a #VC happens there it
probably better not returns.

> I mean that, IIRC, a malicious hypervisor can inject inappropriate
> vectors at inappropriate times if the #HV mechanism isn't enabled.
> For example, it could inject a page fault or an interrupt in a context
> in which we have the wrong GSBASE loaded.

Yes, a malicious Hypervisor can do that, and without #HV there is no
real protection against this besides turning all vectors (even IRQs)
into paranoid entries. Maybe even more care is needed, but I think its
not worth to care about this. 

> But the #DB issue makes this moot.  We have to use IST unless we turn
> off SCE.  But I admit I'm leaning toward turning off SCE until we have
> a solution that seems convincingly robust.

Turning off SCE might be tempting, but I guess doing so would break a
quite some user-space code, no?

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] x86/sev-es: Check if regs->sp is trusted before adjusting #VC IST stack
@ 2021-02-19 11:05               ` Joerg Roedel
  0 siblings, 0 replies; 26+ messages in thread
From: Joerg Roedel @ 2021-02-19 11:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: kvm list, Peter Zijlstra, Dave Hansen, Linux Virtualization,
	Arvind Sankar, H. Peter Anvin, Jiri Slaby, X86 ML,
	David Rientjes, Martin Radev, Tom Lendacky, Joerg Roedel,
	Kees Cook, Cfir Cohen, Dan Williams, Juergen Gross, Mike Stunes,
	Sean Christopherson, LKML, stable, Masami Hiramatsu, Erdem Aktas

On Thu, Feb 18, 2021 at 04:28:36PM -0800, Andy Lutomirski wrote:
> On Thu, Feb 18, 2021 at 11:21 AM Joerg Roedel <jroedel@suse.de> wrote:
> Can you give me an example, even artificial, in which the linked-list
> logic is useful?

So here we go, its of course artificial, but still:

	1. #VC happens, not important where
	2. NMI in the #VC prologue before it moved off its IST stack
	   - first VC IST adjustment happening here
	3. #VC in the NMI handler
	4. #HV in the #VC prologue again
	   - second VC IST adjustment happening here, so the #HV handler
	     can cause its own #VC exceptions.

Can only happen if the #HV handler is allowed to cause #VC exceptions.
But even if its not allowed, it can happen with SNP and a malicious
Hypervisor. But in this case the only option is to reliably panic.

> Can you explain your reasoning in considering the entry stack unsafe?
> It's 4k bytes these days.

I wasn't aware that it is 4k in size now. I still thought it was just
these 64 words large and one can not simply execute C code on it.

> You forgot about entry_SYSCALL_compat.

Right, thanks for pointing this out.

> Your 8-byte alignment is confusing to me.  In valid kernel code, SP
> should be 8-byte-aligned already, and, if you're trying to match
> architectural behavior, the CPU aligns to 16 bytes.

Yeah, I was just being cautious. The explicit alignment can be removed,
Boris also pointed this out.

> We're not robust against #VC, NMI in the #VC prologue before the magic
> stack switch, and a new #VC in the NMI prologue.  Nor do we appear to
> have any detection of the case where #VC nests directly inside its own
> prologue.  Or did I miss something else here?

No, you don't miss anything here. At the moment #VC can't happen at
those places, so this is not handled yet. With SNP it can happen and
needs to be handled in a way to at least allow a reliable panic (because
if it really happens the Hypervisor is messing with us).

> If we get NMI and get #VC in the NMI *asm*, the #VC magic stack switch
> looks like it will merrily run itself in the NMI special-stack-layout
> section, and that sounds really quite bad.

Yes, I havn't looked at the details yet, but if a #VC happens there it
probably better not returns.

> I mean that, IIRC, a malicious hypervisor can inject inappropriate
> vectors at inappropriate times if the #HV mechanism isn't enabled.
> For example, it could inject a page fault or an interrupt in a context
> in which we have the wrong GSBASE loaded.

Yes, a malicious Hypervisor can do that, and without #HV there is no
real protection against this besides turning all vectors (even IRQs)
into paranoid entries. Maybe even more care is needed, but I think its
not worth to care about this. 

> But the #DB issue makes this moot.  We have to use IST unless we turn
> off SCE.  But I admit I'm leaning toward turning off SCE until we have
> a solution that seems convincingly robust.

Turning off SCE might be tempting, but I guess doing so would break a
quite some user-space code, no?

Regards,

	Joerg
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2021-02-19 11:06 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-17 12:01 [PATCH 0/3] x86/sev-es: Check for trusted regs->sp in __sev_es_ist_enter() Joerg Roedel
2021-02-17 12:01 ` Joerg Roedel
2021-02-17 12:01 ` [PATCH 1/3] x86/sev-es: Introduce from_syscall_gap() helper Joerg Roedel
2021-02-17 12:01   ` Joerg Roedel
2021-02-17 17:59   ` Borislav Petkov
2021-02-17 17:59     ` Borislav Petkov
2021-02-17 12:01 ` [PATCH 2/3] x86/sev-es: Check if regs->sp is trusted before adjusting #VC IST stack Joerg Roedel
2021-02-17 12:01   ` Joerg Roedel
2021-02-17 18:00   ` Borislav Petkov
2021-02-17 18:00     ` Borislav Petkov
2021-02-17 18:09   ` Andy Lutomirski
2021-02-17 18:09     ` Andy Lutomirski
2021-02-18 11:25     ` Joerg Roedel
2021-02-18 11:25       ` Joerg Roedel
2021-02-18 17:49       ` Andy Lutomirski
2021-02-18 17:49         ` Andy Lutomirski
2021-02-18 19:21         ` Joerg Roedel
2021-02-18 19:21           ` Joerg Roedel
2021-02-19  0:28           ` Andy Lutomirski
2021-02-19  0:28             ` Andy Lutomirski
2021-02-19 11:05             ` Joerg Roedel
2021-02-19 11:05               ` Joerg Roedel
2021-02-17 12:01 ` [PATCH 3/3] x86/sev-es: Improve comments in and around __sev_es_ist_enter/exit() Joerg Roedel
2021-02-17 12:01   ` Joerg Roedel
2021-02-17 18:00   ` Borislav Petkov
2021-02-17 18:00     ` Borislav Petkov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.