All of lore.kernel.org
 help / color / mirror / Atom feed
* [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?)
@ 2023-01-27  3:56 Alexey Kardashevskiy
  2023-01-27  9:08 ` Peter Zijlstra
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Alexey Kardashevskiy @ 2023-01-27  3:56 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, x86, linux-kernel, Thomas Gleixner, Sean Christopherson,
	Jiri Kosina, Ingo Molnar, Dave Hansen, Borislav Petkov,
	Peter Zijlstra (Intel),
	H. Peter Anvin, Tom Lendacky

AMD SEV-ES guest kernels compiled without CONFIG_PARAVIRT crash when
"perf" executes. "./perf record sleep 20" is an example.

Some debugging revealed this happens when CONFIG_PARAVIRT_XXL is not
defined. The problem seems to be that between DEFINE_IDTENTRY_RAW(exc_nmi)
and actual reading of DB7 (which in turn causes #VC) every function is
inlined and no stack frame is created (?). Replacing __always_inline with
noinline in  local_db_save() or native_get_debugreg() fixes the problem.

The crash does not happen with CONFIG_PARAVIRT_XXL as in this case
paravirt_get_debugreg() is used and there is an indirect call via
PVOP_CALL1(). It has not been noticed as the most configs have CONFIG_XEN
enabled which enables CONFIG_PARAVIRT_XXL.

This happens with the recent tip/master, here is my test kernel
and the config:
https://github.com/aik/linux/commits/debug_dr7

Found this while testing DebugSwap (which also fixes the crash as
it eliminates DB7 interception == #VC):
https://lore.kernel.org/all/20230120031047.628097-1-aik@amd.com

Define local_db_save_exc_nmi() to demostrate that the problem better.

Why is this crash happening and how to fix that? I am still reading
the assembly but was hoping for a shortcut here :) Thanks,

aik-Standard-PC-i440FX-PIIX-1996 login: ^[[A[   15.775303] BUG: NMI stack guard page was hit at 0000000003983d50 (stack is 000000007feb1fa4..00000000574369c2)
[   15.775314] stack guard page: 0000 [#1] PREEMPT SMP NOPTI
[   15.775316] CPU: 0 PID: 790 Comm: sleep Not tainted 6.2.0-rc4_aik-debugswap_ruby-954bhost #73
[   15.775322] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown unknown
[   15.775323] RIP: 0010:error_entry+0x17/0x140
[   15.775326] Code: f8 e9 98 fd ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 56 48 8b 74 24 08 48 89 7c 24 08 52 51 50 41 50 41 51 41 52 41 53 53 <55> 41 54 41 55 41 56 41 57 56 31 f6 31 d2 31 c9 45 31 c0 45 31 c9
[   15.775328] RSP: 0000:fffffe2446b2b000 EFLAGS: 00010097
[   15.775332] RAX: fffffe2446b2b0a8 RBX: 0000000000000000 RCX: ffffffffb3a00fed
[   15.775333] RDX: 0000000000000000 RSI: ffffffffb3a00b69 RDI: fffffe2446b2b0a8
[   15.775336] RBP: fffffe2446b2b0a8 R08: 0000000000000000 R09: 0000000000000000
[   15.775337] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[   15.775338] R13: 0000000000000000 R14: 000000000002dd80 R15: 0000000000000000
[   15.775339] FS:  0000000000000000(0000) GS:ffff94b17dc00000(0000) knlGS:ffff94b17dc00000
[   15.775340] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   15.775341] CR2: fffffe2446b2aff8 CR3: 00080000167b8000 CR4: 00000000003506f0
[   15.775342] Call Trace:
[   15.775352]  <NMI>
[   15.775355]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775357]  ? exc_page_fault+0x11/0x120
[   15.775360]  ? asm_exc_page_fault+0x22/0x30
[   15.775364]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775365]  ? exc_page_fault+0x11/0x120
[   15.775367]  ? asm_exc_page_fault+0x22/0x30
[   15.775368]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775369]  ? exc_page_fault+0x11/0x120
[   15.775371]  ? asm_exc_page_fault+0x22/0x30
[   15.775372]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775373]  ? exc_page_fault+0x11/0x120
[   15.775374]  ? asm_exc_page_fault+0x22/0x30
[   15.775375]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775376]  ? exc_page_fault+0x11/0x120
[   15.775378]  ? asm_exc_page_fault+0x22/0x30
[   15.775379]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775380]  ? exc_page_fault+0x11/0x120
[   15.775381]  ? asm_exc_page_fault+0x22/0x30
[   15.775382]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775383]  ? exc_page_fault+0x11/0x120
[   15.775384]  ? asm_exc_page_fault+0x22/0x30
[   15.775385]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775386]  ? exc_page_fault+0x11/0x120
[   15.775388]  ? asm_exc_page_fault+0x22/0x30
[   15.775389]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775390]  ? exc_page_fault+0x11/0x120
[   15.775391]  ? asm_exc_page_fault+0x22/0x30
[   15.775392]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775393]  ? exc_page_fault+0x11/0x120
[   15.775395]  ? asm_exc_page_fault+0x22/0x30
[   15.775396]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775397]  ? exc_page_fault+0x11/0x120
[   15.775398]  ? asm_exc_page_fault+0x22/0x30
[   15.775399]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775400]  ? exc_page_fault+0x11/0x120
[   15.775401]  ? asm_exc_page_fault+0x22/0x30
[   15.775403]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775404]  ? exc_page_fault+0x11/0x120
[   15.775405]  ? asm_exc_page_fault+0x22/0x30
[   15.775406]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775407]  ? exc_page_fault+0x11/0x120
[   15.775408]  ? asm_exc_page_fault+0x22/0x30
[   15.775409]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775410]  ? exc_page_fault+0x11/0x120
[   15.775412]  ? asm_exc_page_fault+0x22/0x30
[   15.775413]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775414]  ? exc_page_fault+0x11/0x120
[   15.775415]  ? asm_exc_page_fault+0x22/0x30
[   15.775416]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775420]  ? exc_page_fault+0x11/0x120
[   15.775421]  ? asm_exc_page_fault+0x22/0x30
[   15.775422]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775423]  ? exc_page_fault+0x11/0x120
[   15.775425]  ? asm_exc_page_fault+0x22/0x30
[   15.775426]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775427]  ? exc_page_fault+0x11/0x120
[   15.775431]  ? asm_exc_page_fault+0x22/0x30
[   15.775432]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775433]  ? exc_page_fault+0x11/0x120
[   15.775435]  ? asm_exc_page_fault+0x22/0x30
[   15.775436]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775437]  ? exc_page_fault+0x11/0x120
[   15.775438]  ? asm_exc_page_fault+0x22/0x30
[   15.775439]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775440]  ? exc_page_fault+0x11/0x120
[   15.775441]  ? asm_exc_page_fault+0x22/0x30
[   15.775442]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775443]  ? exc_page_fault+0x11/0x120
[   15.775445]  ? asm_exc_page_fault+0x22/0x30
[   15.775446]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775447]  ? exc_page_fault+0x11/0x120
[   15.775448]  ? asm_exc_page_fault+0x22/0x30
[   15.775449]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775450]  ? exc_page_fault+0x11/0x120
[   15.775454]  ? asm_exc_page_fault+0x22/0x30
[   15.775455]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775456]  ? exc_page_fault+0x11/0x120
[   15.775458]  ? asm_exc_page_fault+0x22/0x30
[   15.775459]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775460]  ? exc_page_fault+0x11/0x120
[   15.775461]  ? asm_exc_page_fault+0x22/0x30
[   15.775462]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775463]  ? exc_page_fault+0x11/0x120
[   15.775465]  ? asm_exc_page_fault+0x22/0x30
[   15.775466]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775467]  ? exc_page_fault+0x11/0x120
[   15.775468]  ? asm_exc_page_fault+0x22/0x30
[   15.775469]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775470]  ? exc_page_fault+0x11/0x120
[   15.775471]  ? asm_exc_page_fault+0x22/0x30
[   15.775472]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775473]  ? exc_page_fault+0x11/0x120
[   15.775475]  ? asm_exc_page_fault+0x22/0x30
[   15.775476]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775477]  ? exc_page_fault+0x11/0x120
[   15.775478]  ? asm_exc_page_fault+0x22/0x30
[   15.775482]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775483]  ? exc_page_fault+0x11/0x120
[   15.775485]  ? asm_exc_page_fault+0x22/0x30
[   15.775486]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775487]  ? exc_page_fault+0x11/0x120
[   15.775488]  ? asm_exc_page_fault+0x22/0x30
[   15.775490]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775491]  ? exc_page_fault+0x11/0x120
[   15.775492]  ? asm_exc_page_fault+0x22/0x30
[   15.775493]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775494]  ? exc_page_fault+0x11/0x120
[   15.775495]  ? asm_exc_page_fault+0x22/0x30
[   15.775496]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775497]  ? exc_page_fault+0x11/0x120
[   15.775499]  ? asm_exc_page_fault+0x22/0x30
[   15.775500]  ? check_preemption_disabled+0x8/0xe0
[   15.775502]  ? __sev_es_ist_enter+0x13/0x100
[   15.775503]  ? exc_nmi+0x10e/0x150
[   15.775505]  ? end_repeat_nmi+0x16/0x67
[   15.775506]  ? asm_exc_double_fault+0x30/0x30
[   15.775507]  ? asm_exc_double_fault+0x30/0x30
[   15.775508]  ? asm_exc_double_fault+0x30/0x30
[   15.775509]  </NMI>
[   15.775509]  <#VC>
[   15.775510]  ? __show_regs.cold+0x18e/0x23d
[   15.775511]  </#VC>
[   15.775511]  <#DF>
[   15.775512]  ? __die_body.cold+0x1a/0x1f
[   15.775513]  ? die+0x26/0x40
[   15.775517]  ? handle_stack_overflow+0x44/0x60
[   15.775518]  ? exc_double_fault+0x14b/0x180
[   15.775519]  ? asm_exc_double_fault+0x1f/0x30
[   15.775520]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775521]  ? asm_exc_page_fault+0x9/0x30
[   15.775522]  ? error_entry+0x17/0x140
[   15.775523]  </#DF>
[   15.775523] WARNING: stack recursion on stack type 6
[   15.775524] Modules linked in: msr efivarfs
[   15.837935] ---[ end trace 0000000000000000 ]---

Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
---
 arch/x86/include/asm/debugreg.h | 29 ++++++++++++++++++++
 arch/x86/kernel/nmi.c           |  2 +-
 2 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/debugreg.h b/arch/x86/include/asm/debugreg.h
index b049d950612f..396e2ea114dc 100644
--- a/arch/x86/include/asm/debugreg.h
+++ b/arch/x86/include/asm/debugreg.h
@@ -125,6 +125,35 @@ static __always_inline void local_db_restore(unsigned long dr7)
 		set_debugreg(dr7, 7);
 }
 
+/* noinline here inline __always_inline'd native_get_debugreg(int regno) */
+static noinline unsigned long native_get_debugreg7(void)
+{
+	unsigned long dr7;
+	asm("mov %%db7, %0" :"=r" (dr7));
+	return dr7;
+}
+
+static __always_inline unsigned long local_db_save_exc_nmi(void)
+{
+	unsigned long dr7;
+
+	if (static_cpu_has(X86_FEATURE_HYPERVISOR) && !hw_breakpoint_active())
+		return 0;
+
+	dr7 = native_get_debugreg7();
+	dr7 &= ~0x400; /* architecturally set bit */
+	if (dr7)
+		set_debugreg(0, 7);
+	/*
+	 * Ensure the compiler doesn't lower the above statements into
+	 * the critical section; disabling breakpoints late would not
+	 * be good.
+	 */
+	barrier();
+
+	return dr7;
+}
+
 #ifdef CONFIG_CPU_SUP_AMD
 extern void set_dr_addr_mask(unsigned long mask, int dr);
 #else
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index cec0bfa3bc04..400b5b6b74f6 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -503,7 +503,7 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
 	 */
 	sev_es_ist_enter(regs);
 
-	this_cpu_write(nmi_dr7, local_db_save());
+	this_cpu_write(nmi_dr7, local_db_save_exc_nmi());
 
 	irq_state = irqentry_nmi_enter(regs);
 
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?)
  2023-01-27  3:56 [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?) Alexey Kardashevskiy
@ 2023-01-27  9:08 ` Peter Zijlstra
  2023-01-27 10:37   ` Joerg Roedel
  2023-01-27 12:13   ` Alexey Kardashevskiy
  2023-01-31 10:37 ` [tip: x86/urgent] x86/debug: Fix stack recursion caused by wrongly ordered DR7 accesses tip-bot2 for Joerg Roedel
  2023-01-31 11:57 ` tip-bot2 for Joerg Roedel
  2 siblings, 2 replies; 19+ messages in thread
From: Peter Zijlstra @ 2023-01-27  9:08 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, x86, linux-kernel, Thomas Gleixner, Sean Christopherson,
	Jiri Kosina, Ingo Molnar, Dave Hansen, Borislav Petkov,
	H. Peter Anvin, Tom Lendacky, joro

On Fri, Jan 27, 2023 at 02:56:16PM +1100, Alexey Kardashevskiy wrote:
> AMD SEV-ES guest kernels compiled without CONFIG_PARAVIRT crash when
> "perf" executes. "./perf record sleep 20" is an example.
>
> Some debugging revealed this happens when CONFIG_PARAVIRT_XXL is not
> defined. The problem seems to be that between DEFINE_IDTENTRY_RAW(exc_nmi)
> and actual reading of DB7 (which in turn causes #VC) every function is
> inlined

Very much on purpose.

> and no stack frame is created (?).

Silly compilers ;-) I think you can force it to by using inline asm with
a rsp dependency or somesuch.

> Replacing __always_inline with
> noinline in  local_db_save() or native_get_debugreg() fixes the problem.

It will create others.

> Why is this crash happening and how to fix that? I am still reading
> the assembly but was hoping for a shortcut here :) Thanks,

Welcome to the wonderful shit show that is x86 exceptions :/

I thought sev_es_*() is supposed to fix this. Joerg, any clue?

> aik-Standard-PC-i440FX-PIIX-1996 login: ^[[A[   15.775303] BUG: NMI stack guard page was hit at 0000000003983d50 (stack is 000000007feb1fa4..00000000574369c2)
> [   15.775314] stack guard page: 0000 [#1] PREEMPT SMP NOPTI
> [   15.775316] CPU: 0 PID: 790 Comm: sleep Not tainted 6.2.0-rc4_aik-debugswap_ruby-954bhost #73
> [   15.775322] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown unknown
> [   15.775323] RIP: 0010:error_entry+0x17/0x140
> [   15.775326] Code: f8 e9 98 fd ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 56 48 8b 74 24 08 48 89 7c 24 08 52 51 50 41 50 41 51 41 52 41 53 53 <55> 41 54 41 55 41 56 41 57 56 31 f6 31 d2 31 c9 45 31 c0 45 31 c9
> [   15.775328] RSP: 0000:fffffe2446b2b000 EFLAGS: 00010097
> [   15.775332] RAX: fffffe2446b2b0a8 RBX: 0000000000000000 RCX: ffffffffb3a00fed
> [   15.775333] RDX: 0000000000000000 RSI: ffffffffb3a00b69 RDI: fffffe2446b2b0a8
> [   15.775336] RBP: fffffe2446b2b0a8 R08: 0000000000000000 R09: 0000000000000000
> [   15.775337] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [   15.775338] R13: 0000000000000000 R14: 000000000002dd80 R15: 0000000000000000
> [   15.775339] FS:  0000000000000000(0000) GS:ffff94b17dc00000(0000) knlGS:ffff94b17dc00000
> [   15.775340] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   15.775341] CR2: fffffe2446b2aff8 CR3: 00080000167b8000 CR4: 00000000003506f0
> [   15.775342] Call Trace:
> [   15.775352]  <NMI>

<snip>

> [   15.775495]  ? asm_exc_page_fault+0x22/0x30
> [   15.775496]  ? restore_regs_and_return_to_kernel+0x22/0x22
> [   15.775497]  ? exc_page_fault+0x11/0x120
> [   15.775499]  ? asm_exc_page_fault+0x22/0x30
> [   15.775500]  ? check_preemption_disabled+0x8/0xe0
> [   15.775502]  ? __sev_es_ist_enter+0x13/0x100
> [   15.775503]  ? exc_nmi+0x10e/0x150
> [   15.775505]  ? end_repeat_nmi+0x16/0x67
> [   15.775506]  ? asm_exc_double_fault+0x30/0x30
> [   15.775507]  ? asm_exc_double_fault+0x30/0x30
> [   15.775508]  ? asm_exc_double_fault+0x30/0x30
> [   15.775509]  </NMI>
> [   15.775509]  <#VC>
> [   15.775510]  ? __show_regs.cold+0x18e/0x23d
> [   15.775511]  </#VC>
> [   15.775511]  <#DF>
> [   15.775512]  ? __die_body.cold+0x1a/0x1f
> [   15.775513]  ? die+0x26/0x40
> [   15.775517]  ? handle_stack_overflow+0x44/0x60
> [   15.775518]  ? exc_double_fault+0x14b/0x180
> [   15.775519]  ? asm_exc_double_fault+0x1f/0x30
> [   15.775520]  ? restore_regs_and_return_to_kernel+0x22/0x22
> [   15.775521]  ? asm_exc_page_fault+0x9/0x30
> [   15.775522]  ? error_entry+0x17/0x140
> [   15.775523]  </#DF>
> [   15.775523] WARNING: stack recursion on stack type 6
> [   15.775524] Modules linked in: msr efivarfs
> [   15.837935] ---[ end trace 0000000000000000 ]---
> 
> Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
> ---
>  arch/x86/include/asm/debugreg.h | 29 ++++++++++++++++++++
>  arch/x86/kernel/nmi.c           |  2 +-
>  2 files changed, 30 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/debugreg.h b/arch/x86/include/asm/debugreg.h
> index b049d950612f..396e2ea114dc 100644
> --- a/arch/x86/include/asm/debugreg.h
> +++ b/arch/x86/include/asm/debugreg.h
> @@ -125,6 +125,35 @@ static __always_inline void local_db_restore(unsigned long dr7)
>  		set_debugreg(dr7, 7);
>  }
>  
> +/* noinline here inline __always_inline'd native_get_debugreg(int regno) */
> +static noinline unsigned long native_get_debugreg7(void)
> +{
> +	unsigned long dr7;
> +	asm("mov %%db7, %0" :"=r" (dr7));
> +	return dr7;
> +}
> +
> +static __always_inline unsigned long local_db_save_exc_nmi(void)
> +{
> +	unsigned long dr7;
> +
> +	if (static_cpu_has(X86_FEATURE_HYPERVISOR) && !hw_breakpoint_active())
> +		return 0;
> +
> +	dr7 = native_get_debugreg7();
> +	dr7 &= ~0x400; /* architecturally set bit */
> +	if (dr7)
> +		set_debugreg(0, 7);
> +	/*
> +	 * Ensure the compiler doesn't lower the above statements into
> +	 * the critical section; disabling breakpoints late would not
> +	 * be good.
> +	 */
> +	barrier();
> +
> +	return dr7;
> +}

This is broken, and building with DEBUG_ENTRY=y would've told you.

> +
>  #ifdef CONFIG_CPU_SUP_AMD
>  extern void set_dr_addr_mask(unsigned long mask, int dr);
>  #else
> diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
> index cec0bfa3bc04..400b5b6b74f6 100644
> --- a/arch/x86/kernel/nmi.c
> +++ b/arch/x86/kernel/nmi.c
> @@ -503,7 +503,7 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
>  	 */
>  	sev_es_ist_enter(regs);
>  
> -	this_cpu_write(nmi_dr7, local_db_save());
> +	this_cpu_write(nmi_dr7, local_db_save_exc_nmi());
>  
>  	irq_state = irqentry_nmi_enter(regs);

So what I don't get is why sev_es_ist_enter() doesn't cause us to make a
stack frame, that has an actual call in it (although admittedly
conditional).

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?)
  2023-01-27  9:08 ` Peter Zijlstra
@ 2023-01-27 10:37   ` Joerg Roedel
  2023-01-27 11:56     ` Alexey Kardashevskiy
  2023-01-27 12:13   ` Alexey Kardashevskiy
  1 sibling, 1 reply; 19+ messages in thread
From: Joerg Roedel @ 2023-01-27 10:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexey Kardashevskiy, kvm, x86, linux-kernel, Thomas Gleixner,
	Sean Christopherson, Jiri Kosina, Ingo Molnar, Dave Hansen,
	Borislav Petkov, H. Peter Anvin, Tom Lendacky

On Fri, Jan 27, 2023 at 10:08:14AM +0100, Peter Zijlstra wrote:
> Welcome to the wonderful shit show that is x86 exceptions :/
> 
> I thought sev_es_*() is supposed to fix this. Joerg, any clue?

Hmm, no, not yet, the stack-trace doesn't make much sense to me. The
sev_es_* function calls in the NMI path are for re-enabling NMI and
adjusting the #VC IST stack to allow nested VCs.

Alexey, can you try to get a more stable backtrace? For example by
building the kernel with frame pointers?

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?)
  2023-01-27 10:37   ` Joerg Roedel
@ 2023-01-27 11:56     ` Alexey Kardashevskiy
  2023-01-27 12:59       ` Joerg Roedel
  2023-01-27 17:25       ` Joerg Roedel
  0 siblings, 2 replies; 19+ messages in thread
From: Alexey Kardashevskiy @ 2023-01-27 11:56 UTC (permalink / raw)
  To: Joerg Roedel, Peter Zijlstra
  Cc: kvm, x86, linux-kernel, Thomas Gleixner, Sean Christopherson,
	Jiri Kosina, Ingo Molnar, Dave Hansen, Borislav Petkov,
	H. Peter Anvin, Tom Lendacky



On 27/1/23 21:37, Joerg Roedel wrote:
> On Fri, Jan 27, 2023 at 10:08:14AM +0100, Peter Zijlstra wrote:
>> Welcome to the wonderful shit show that is x86 exceptions :/
>>
>> I thought sev_es_*() is supposed to fix this. Joerg, any clue?
> 
> Hmm, no, not yet, the stack-trace doesn't make much sense to me. The
> sev_es_* function calls in the NMI path are for re-enabling NMI and
> adjusting the #VC IST stack to allow nested VCs.
> 
> Alexey, can you try to get a more stable backtrace? For example by
> building the kernel with frame pointers?

Do you mean these guys (below)?

aik@aik-Standard-PC-i440FX-PIIX-1996:~$ grep FRAME_POINTER 
/boot/config-$(uname -r)
CONFIG_SCHED_OMIT_FRAME_POINTER=y 

CONFIG_FRAME_POINTER=y 

CONFIG_UNWINDER_FRAME_POINTER=y


Here is the complete output of that VM (200k so not in the email):

https://github.com/aik/linux/commit/d0d6bbb58fcd927ddd1f8e9d42ab121920c7eafc

Note that the original long backtrace appears more than once, I just did 
not copy all of that in the first email.

-- 
Alexey

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?)
  2023-01-27  9:08 ` Peter Zijlstra
  2023-01-27 10:37   ` Joerg Roedel
@ 2023-01-27 12:13   ` Alexey Kardashevskiy
  2023-01-27 12:41     ` Peter Zijlstra
  1 sibling, 1 reply; 19+ messages in thread
From: Alexey Kardashevskiy @ 2023-01-27 12:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kvm, x86, linux-kernel, Thomas Gleixner, Sean Christopherson,
	Jiri Kosina, Ingo Molnar, Dave Hansen, Borislav Petkov,
	H. Peter Anvin, Tom Lendacky, joro



On 27/1/23 20:08, Peter Zijlstra wrote:
> On Fri, Jan 27, 2023 at 02:56:16PM +1100, Alexey Kardashevskiy wrote:
>> AMD SEV-ES guest kernels compiled without CONFIG_PARAVIRT crash when
>> "perf" executes. "./perf record sleep 20" is an example.
>>
>> Some debugging revealed this happens when CONFIG_PARAVIRT_XXL is not
>> defined. The problem seems to be that between DEFINE_IDTENTRY_RAW(exc_nmi)
>> and actual reading of DB7 (which in turn causes #VC) every function is
>> inlined
> 
> Very much on purpose.
> 
>> and no stack frame is created (?).
> 
> Silly compilers ;-) I think you can force it to by using inline asm with
> a rsp dependency or somesuch.
> 
>> Replacing __always_inline with
>> noinline in  local_db_save() or native_get_debugreg() fixes the problem.
> 
> It will create others.
> 
>> Why is this crash happening and how to fix that? I am still reading
>> the assembly but was hoping for a shortcut here :) Thanks,
> 
> Welcome to the wonderful shit show that is x86 exceptions :/
> 
> I thought sev_es_*() is supposed to fix this. Joerg, any clue?
> 
>> aik-Standard-PC-i440FX-PIIX-1996 login: ^[[A[   15.775303] BUG: NMI stack guard page was hit at 0000000003983d50 (stack is 000000007feb1fa4..00000000574369c2)
>> [   15.775314] stack guard page: 0000 [#1] PREEMPT SMP NOPTI
>> [   15.775316] CPU: 0 PID: 790 Comm: sleep Not tainted 6.2.0-rc4_aik-debugswap_ruby-954bhost #73
>> [   15.775322] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown unknown
>> [   15.775323] RIP: 0010:error_entry+0x17/0x140
>> [   15.775326] Code: f8 e9 98 fd ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 56 48 8b 74 24 08 48 89 7c 24 08 52 51 50 41 50 41 51 41 52 41 53 53 <55> 41 54 41 55 41 56 41 57 56 31 f6 31 d2 31 c9 45 31 c0 45 31 c9
>> [   15.775328] RSP: 0000:fffffe2446b2b000 EFLAGS: 00010097
>> [   15.775332] RAX: fffffe2446b2b0a8 RBX: 0000000000000000 RCX: ffffffffb3a00fed
>> [   15.775333] RDX: 0000000000000000 RSI: ffffffffb3a00b69 RDI: fffffe2446b2b0a8
>> [   15.775336] RBP: fffffe2446b2b0a8 R08: 0000000000000000 R09: 0000000000000000
>> [   15.775337] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
>> [   15.775338] R13: 0000000000000000 R14: 000000000002dd80 R15: 0000000000000000
>> [   15.775339] FS:  0000000000000000(0000) GS:ffff94b17dc00000(0000) knlGS:ffff94b17dc00000
>> [   15.775340] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   15.775341] CR2: fffffe2446b2aff8 CR3: 00080000167b8000 CR4: 00000000003506f0
>> [   15.775342] Call Trace:
>> [   15.775352]  <NMI>
> 
> <snip>
> 
>> [   15.775495]  ? asm_exc_page_fault+0x22/0x30
>> [   15.775496]  ? restore_regs_and_return_to_kernel+0x22/0x22
>> [   15.775497]  ? exc_page_fault+0x11/0x120
>> [   15.775499]  ? asm_exc_page_fault+0x22/0x30
>> [   15.775500]  ? check_preemption_disabled+0x8/0xe0
>> [   15.775502]  ? __sev_es_ist_enter+0x13/0x100
>> [   15.775503]  ? exc_nmi+0x10e/0x150
>> [   15.775505]  ? end_repeat_nmi+0x16/0x67
>> [   15.775506]  ? asm_exc_double_fault+0x30/0x30
>> [   15.775507]  ? asm_exc_double_fault+0x30/0x30
>> [   15.775508]  ? asm_exc_double_fault+0x30/0x30
>> [   15.775509]  </NMI>
>> [   15.775509]  <#VC>
>> [   15.775510]  ? __show_regs.cold+0x18e/0x23d
>> [   15.775511]  </#VC>
>> [   15.775511]  <#DF>
>> [   15.775512]  ? __die_body.cold+0x1a/0x1f
>> [   15.775513]  ? die+0x26/0x40
>> [   15.775517]  ? handle_stack_overflow+0x44/0x60
>> [   15.775518]  ? exc_double_fault+0x14b/0x180
>> [   15.775519]  ? asm_exc_double_fault+0x1f/0x30
>> [   15.775520]  ? restore_regs_and_return_to_kernel+0x22/0x22
>> [   15.775521]  ? asm_exc_page_fault+0x9/0x30
>> [   15.775522]  ? error_entry+0x17/0x140
>> [   15.775523]  </#DF>
>> [   15.775523] WARNING: stack recursion on stack type 6
>> [   15.775524] Modules linked in: msr efivarfs
>> [   15.837935] ---[ end trace 0000000000000000 ]---
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
>> ---
>>   arch/x86/include/asm/debugreg.h | 29 ++++++++++++++++++++
>>   arch/x86/kernel/nmi.c           |  2 +-
>>   2 files changed, 30 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/include/asm/debugreg.h b/arch/x86/include/asm/debugreg.h
>> index b049d950612f..396e2ea114dc 100644
>> --- a/arch/x86/include/asm/debugreg.h
>> +++ b/arch/x86/include/asm/debugreg.h
>> @@ -125,6 +125,35 @@ static __always_inline void local_db_restore(unsigned long dr7)
>>   		set_debugreg(dr7, 7);
>>   }
>>   
>> +/* noinline here inline __always_inline'd native_get_debugreg(int regno) */
>> +static noinline unsigned long native_get_debugreg7(void)
>> +{
>> +	unsigned long dr7;
>> +	asm("mov %%db7, %0" :"=r" (dr7));
>> +	return dr7;
>> +}
>> +
>> +static __always_inline unsigned long local_db_save_exc_nmi(void)
>> +{
>> +	unsigned long dr7;
>> +
>> +	if (static_cpu_has(X86_FEATURE_HYPERVISOR) && !hw_breakpoint_active())
>> +		return 0;
>> +
>> +	dr7 = native_get_debugreg7();
>> +	dr7 &= ~0x400; /* architecturally set bit */
>> +	if (dr7)
>> +		set_debugreg(0, 7);
>> +	/*
>> +	 * Ensure the compiler doesn't lower the above statements into
>> +	 * the critical section; disabling breakpoints late would not
>> +	 * be good.
>> +	 */
>> +	barrier();
>> +
>> +	return dr7;
>> +}
> 
> This is broken, and building with DEBUG_ENTRY=y would've told you.


Huh, good to know. Is this it telling me so?

vmlinux.o: warning: objtool: exc_nmi+0x73: call to 
native_get_debugreg7() leaves .noinstr.text section



>> +
>>   #ifdef CONFIG_CPU_SUP_AMD
>>   extern void set_dr_addr_mask(unsigned long mask, int dr);
>>   #else
>> diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
>> index cec0bfa3bc04..400b5b6b74f6 100644
>> --- a/arch/x86/kernel/nmi.c
>> +++ b/arch/x86/kernel/nmi.c
>> @@ -503,7 +503,7 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
>>   	 */
>>   	sev_es_ist_enter(regs);
>>   
>> -	this_cpu_write(nmi_dr7, local_db_save());
>> +	this_cpu_write(nmi_dr7, local_db_save_exc_nmi());
>>   
>>   	irq_state = irqentry_nmi_enter(regs);
> 
> So what I don't get is why sev_es_ist_enter() doesn't cause us to make a
> stack frame, that has an actual call in it (although admittedly
> conditional).

Is not the frame gone when sev_es_ist_enter() (which does not get 
inlined as per objdump's "ffffffff81bd4550 <__sev_es_ist_enter>:
") returned?



-- 
Alexey

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?)
  2023-01-27 12:13   ` Alexey Kardashevskiy
@ 2023-01-27 12:41     ` Peter Zijlstra
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2023-01-27 12:41 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, x86, linux-kernel, Thomas Gleixner, Sean Christopherson,
	Jiri Kosina, Ingo Molnar, Dave Hansen, Borislav Petkov,
	H. Peter Anvin, Tom Lendacky, joro

On Fri, Jan 27, 2023 at 11:13:38PM +1100, Alexey Kardashevskiy wrote:

> > This is broken, and building with DEBUG_ENTRY=y would've told you.
> 
> 
> Huh, good to know. Is this it telling me so?
> 
> vmlinux.o: warning: objtool: exc_nmi+0x73: call to native_get_debugreg7()
> leaves .noinstr.text section
> 

Yep. The ramification of all that is that by calling non-noinstr code
(double negative, iow, regular instrumented code) is that you can end up
in the tracers/*SAN/breakpoints etc.. code -- something we're very much
not ready for at this point.

> > > +
> > >   #ifdef CONFIG_CPU_SUP_AMD
> > >   extern void set_dr_addr_mask(unsigned long mask, int dr);
> > >   #else
> > > diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
> > > index cec0bfa3bc04..400b5b6b74f6 100644
> > > --- a/arch/x86/kernel/nmi.c
> > > +++ b/arch/x86/kernel/nmi.c
> > > @@ -503,7 +503,7 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
> > >   	 */
> > >   	sev_es_ist_enter(regs);
> > > -	this_cpu_write(nmi_dr7, local_db_save());
> > > +	this_cpu_write(nmi_dr7, local_db_save_exc_nmi());
> > >   	irq_state = irqentry_nmi_enter(regs);
> > 
> > So what I don't get is why sev_es_ist_enter() doesn't cause us to make a
> > stack frame, that has an actual call in it (although admittedly
> > conditional).
> 
> Is not the frame gone when sev_es_ist_enter() (which does not get inlined as
> per objdump's "ffffffff81bd4550 <__sev_es_ist_enter>:
> ") returned?

Well, returning would consume the callframe, but the stack setup of the
caller should remain. Let me go stare at some asm.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?)
  2023-01-27 11:56     ` Alexey Kardashevskiy
@ 2023-01-27 12:59       ` Joerg Roedel
  2023-01-27 17:25       ` Joerg Roedel
  1 sibling, 0 replies; 19+ messages in thread
From: Joerg Roedel @ 2023-01-27 12:59 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Peter Zijlstra, kvm, x86, linux-kernel, Thomas Gleixner,
	Sean Christopherson, Jiri Kosina, Ingo Molnar, Dave Hansen,
	Borislav Petkov, H. Peter Anvin, Tom Lendacky

On Fri, Jan 27, 2023 at 10:56:26PM +1100, Alexey Kardashevskiy wrote:
> Here is the complete output of that VM (200k so not in the email):
> 
> https://github.com/aik/linux/commit/d0d6bbb58fcd927ddd1f8e9d42ab121920c7eafc

Thanks. So looking at the code in the traces:

Code starting with the faulting instruction
===========================================
   0:	65 48 8b 04 25 c0 db 	mov    %gs:0x2dbc0,%rax
   7:	02 00
   9:	48 8b 80 a8 08 00 00 	mov    0x8a8(%rax),%rax
  10:	0f 0d 48 70          	prefetchw 0x70(%rax)
  14:	e8                   	.byte 0xe8
  15:	82                   	.byte 0x82

I think the fault in the page-fault handler happens here:

DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
{
        unsigned long address = read_cr2();
        irqentry_state_t state;

        prefetchw(&current->mm->mmap_lock); <--- Here

To be precise, it faults while dereferencing current. That means that
GS_BASE is likely broken, need to find out why...

This at least explains why it page-faults in a loop until the stack
overflows and the guard page is hit.

Regards,

	Joerg


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?)
  2023-01-27 11:56     ` Alexey Kardashevskiy
  2023-01-27 12:59       ` Joerg Roedel
@ 2023-01-27 17:25       ` Joerg Roedel
  2023-01-28 11:24         ` Alexey Kardashevskiy
  1 sibling, 1 reply; 19+ messages in thread
From: Joerg Roedel @ 2023-01-27 17:25 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Peter Zijlstra, kvm, x86, linux-kernel, Thomas Gleixner,
	Sean Christopherson, Jiri Kosina, Ingo Molnar, Dave Hansen,
	Borislav Petkov, H. Peter Anvin, Tom Lendacky

On Fri, Jan 27, 2023 at 10:56:26PM +1100, Alexey Kardashevskiy wrote:
> https://github.com/aik/linux/commit/d0d6bbb58fcd927ddd1f8e9d42ab121920c7eafc

Okay, I reproduced the problem here and the root cause turned out to be
that the compiler moved the DR7 read instruction before the 5-byte NOP
which becomes the call to sev_es_ist_enter() in SEV-ES guests. This is
guaranteed to cause #VC exception stack recursion if the NMI was
triggered on the #VC stack, and that leads to all kinds of undefined
behavior.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?)
  2023-01-27 17:25       ` Joerg Roedel
@ 2023-01-28 11:24         ` Alexey Kardashevskiy
  2023-01-28 13:52           ` Joerg Roedel
  2023-01-30 17:30           ` H. Peter Anvin
  0 siblings, 2 replies; 19+ messages in thread
From: Alexey Kardashevskiy @ 2023-01-28 11:24 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Peter Zijlstra, kvm, x86, linux-kernel, Thomas Gleixner,
	Sean Christopherson, Jiri Kosina, Ingo Molnar, Dave Hansen,
	Borislav Petkov, H. Peter Anvin, Tom Lendacky



On 28/1/23 04:25, Joerg Roedel wrote:
> On Fri, Jan 27, 2023 at 10:56:26PM +1100, Alexey Kardashevskiy wrote:
>> https://github.com/aik/linux/commit/d0d6bbb58fcd927ddd1f8e9d42ab121920c7eafc
> 
> Okay, I reproduced the problem here and the root cause turned out to be
> that the compiler moved the DR7 read instruction before the 5-byte NOP
> which becomes the call to sev_es_ist_enter() in SEV-ES guests. This is
> guaranteed to cause #VC exception stack recursion if the NMI was
> triggered on the #VC stack, and that leads to all kinds of undefined
> behavior.

Cool!

(out of curiosity) where do you see these NOPs? "objdump -D vmlinux" 
does not show any, is this after lifepatching?

Meanwhile, this seems to be doing the right thing:


diff --git a/arch/x86/include/asm/debugreg.h 
b/arch/x86/include/asm/debugreg.h
index b049d950612f..687b15297057 100644
--- a/arch/x86/include/asm/debugreg.h
+++ b/arch/x86/include/asm/debugreg.h
@@ -39,7 +39,7 @@ static __always_inline unsigned long 
native_get_debugreg(int regno)
                 asm("mov %%db6, %0" :"=r" (val));
                 break;
         case 7:
-               asm("mov %%db7, %0" :"=r" (val));
+               asm volatile ("mov %%db7, %0" :"=r" (val));



-- 
Alexey

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?)
  2023-01-28 11:24         ` Alexey Kardashevskiy
@ 2023-01-28 13:52           ` Joerg Roedel
  2023-01-30  9:17             ` Joerg Roedel
  2023-01-30 17:30           ` H. Peter Anvin
  1 sibling, 1 reply; 19+ messages in thread
From: Joerg Roedel @ 2023-01-28 13:52 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Peter Zijlstra, kvm, x86, linux-kernel, Thomas Gleixner,
	Sean Christopherson, Jiri Kosina, Ingo Molnar, Dave Hansen,
	Borislav Petkov, H. Peter Anvin, Tom Lendacky

On Sat, Jan 28, 2023 at 10:24:56PM +1100, Alexey Kardashevskiy wrote:
> (out of curiosity) where do you see these NOPs? "objdump -D vmlinux" does
> not show any, is this after lifepatching?

Here is the disassembly of exc_nmi of a kernel built from tip/master
with CONFIG_PARAVIRT=n:

<exc_nmi>:
       41 54                   push   %r12
       55                      push   %rbp
       48 89 fd                mov    %rdi,%rbp
       53                      push   %rbx
       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
       65 8b 05 69 66 41 7e    mov    %gs:0x7e416669(%rip),%eax        # 3254c <pcpu_hot+0xc>
       48 98                   cltq
       48 0f a3 05 33 00 2b    bt     %rax,0x12b0033(%rip)        # ffffffff82ecbf20 <__cpu_online_mask>
       01 
       0f 83 c9 00 00 00       jae    ffffffff81c1bfbc <exc_nmi+0xec>
       65 8b 05 f6 41 40 7e    mov    %gs:0x7e4041f6(%rip),%eax        # 200f0 <nmi_state>
       85 c0                   test   %eax,%eax
       0f 85 f8 00 00 00       jne    ffffffff81c1bffa <exc_nmi+0x12a>
       65 c7 05 e3 41 40 7e    movl   $0x1,%gs:0x7e4041e3(%rip)        # 200f0 <nmi_state>
       01 00 00 00 
       0f 20 d0                mov    %cr2,%rax
       65 48 89 05 d0 41 40    mov    %rax,%gs:0x7e4041d0(%rip)        # 200e8 <nmi_cr2>
       7e 
       41 0f 21 fc             mov    %db7,%r12			<-- here is the DR7 read
       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)		<-- here are the NOPS that become a
       								    call to sev_es_ist_enter() in
								    SEV-ES guests

The DR7 read will cause a #VC exception, switching to the #VC IST stack.
If the NMI was raised while already on the #VC IST stack, this DR7 read
will overwrite the previous stack frame and cause stack recursion, with
all funny side effects.


> diff --git a/arch/x86/include/asm/debugreg.h
> b/arch/x86/include/asm/debugreg.h
> index b049d950612f..687b15297057 100644
> --- a/arch/x86/include/asm/debugreg.h
> +++ b/arch/x86/include/asm/debugreg.h
> @@ -39,7 +39,7 @@ static __always_inline unsigned long
> native_get_debugreg(int regno)
>                 asm("mov %%db6, %0" :"=r" (val));
>                 break;
>         case 7:
> -               asm("mov %%db7, %0" :"=r" (val));
> +               asm volatile ("mov %%db7, %0" :"=r" (val));

Yeah, something like this will be the fix. I am still thinking about
the right place to put the volatile to make it explicit to the situation
we are encountering here (which is SEV-ES specific).

Best would be an explicit barrier in C code between sev_es_ist_enter()
and the DR7 read, but all barriers I tried to far only seem to affect
memory instructions and had no influence on the DR7 read (which is
obviously not considered as a memory read by the compiler).

The best place to put the barrier is in the sev_es_ist_enter() inline
function, right after the static_call to __sev_es_ist_enter().

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?)
  2023-01-28 13:52           ` Joerg Roedel
@ 2023-01-30  9:17             ` Joerg Roedel
  0 siblings, 0 replies; 19+ messages in thread
From: Joerg Roedel @ 2023-01-30  9:17 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Peter Zijlstra, kvm, x86, linux-kernel, Thomas Gleixner,
	Sean Christopherson, Jiri Kosina, Ingo Molnar, Dave Hansen,
	Borislav Petkov, H. Peter Anvin, Tom Lendacky

On Sat, Jan 28, 2023 at 02:52:23PM +0100, Joerg Roedel wrote:
> Yeah, something like this will be the fix. I am still thinking about
> the right place to put the volatile to make it explicit to the situation
> we are encountering here (which is SEV-ES specific).
> 
> Best would be an explicit barrier in C code between sev_es_ist_enter()
> and the DR7 read, but all barriers I tried to far only seem to affect
> memory instructions and had no influence on the DR7 read (which is
> obviously not considered as a memory read by the compiler).
> 
> The best place to put the barrier is in the sev_es_ist_enter() inline
> function, right after the static_call to __sev_es_ist_enter().

Okay, after some investigation I was not able to find a compiler barrier
which affects DR7 read ordering. This leaves us with the only solution
of directly forbidding DR7 register access re-ordering by adding a
volatile to the asm, like you did before.

I will send a fix later today.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?)
  2023-01-28 11:24         ` Alexey Kardashevskiy
  2023-01-28 13:52           ` Joerg Roedel
@ 2023-01-30 17:30           ` H. Peter Anvin
  2023-01-30 18:04             ` Borislav Petkov
  2023-01-31  8:57             ` Joerg Roedel
  1 sibling, 2 replies; 19+ messages in thread
From: H. Peter Anvin @ 2023-01-30 17:30 UTC (permalink / raw)
  To: Alexey Kardashevskiy, Joerg Roedel
  Cc: Peter Zijlstra, kvm, x86, linux-kernel, Thomas Gleixner,
	Sean Christopherson, Jiri Kosina, Ingo Molnar, Dave Hansen,
	Borislav Petkov, Tom Lendacky

On January 28, 2023 3:24:56 AM PST, Alexey Kardashevskiy <aik@amd.com> wrote:
>
>
>On 28/1/23 04:25, Joerg Roedel wrote:
>> On Fri, Jan 27, 2023 at 10:56:26PM +1100, Alexey Kardashevskiy wrote:
>>> https://github.com/aik/linux/commit/d0d6bbb58fcd927ddd1f8e9d42ab121920c7eafc
>> 
>> Okay, I reproduced the problem here and the root cause turned out to be
>> that the compiler moved the DR7 read instruction before the 5-byte NOP
>> which becomes the call to sev_es_ist_enter() in SEV-ES guests. This is
>> guaranteed to cause #VC exception stack recursion if the NMI was
>> triggered on the #VC stack, and that leads to all kinds of undefined
>> behavior.
>
>Cool!
>
>(out of curiosity) where do you see these NOPs? "objdump -D vmlinux" does not show any, is this after lifepatching?
>
>Meanwhile, this seems to be doing the right thing:
>
>
>diff --git a/arch/x86/include/asm/debugreg.h b/arch/x86/include/asm/debugreg.h
>index b049d950612f..687b15297057 100644
>--- a/arch/x86/include/asm/debugreg.h
>+++ b/arch/x86/include/asm/debugreg.h
>@@ -39,7 +39,7 @@ static __always_inline unsigned long native_get_debugreg(int regno)
>                asm("mov %%db6, %0" :"=r" (val));
>                break;
>        case 7:
>-               asm("mov %%db7, %0" :"=r" (val));
>+               asm volatile ("mov %%db7, %0" :"=r" (val));
>
>
>

It's somewhat odd to me that reading %dr7 is volatile, but %dr6 is not... %dr6 is the status register!

I believe they should all be volatile (the compiler semantics is that volatile operations are always executed exactly once, in strict program order with respect to any other volatile operations); the real question is if there should also be memory clobbers on %dr6 reads and any %dr write.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?)
  2023-01-30 17:30           ` H. Peter Anvin
@ 2023-01-30 18:04             ` Borislav Petkov
  2023-01-31  8:57             ` Joerg Roedel
  1 sibling, 0 replies; 19+ messages in thread
From: Borislav Petkov @ 2023-01-30 18:04 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Alexey Kardashevskiy, Joerg Roedel, Peter Zijlstra, kvm, x86,
	linux-kernel, Thomas Gleixner, Sean Christopherson, Jiri Kosina,
	Ingo Molnar, Dave Hansen, Tom Lendacky

On Mon, Jan 30, 2023 at 09:30:38AM -0800, H. Peter Anvin wrote:
> It's somewhat odd to me that reading %dr7 is volatile, but %dr6 is
> not... %dr6 is the status register!

Yeah, as a precaution I think we should make all those volatile. Just in
case.
 
> I believe they should all be volatile (the compiler semantics is that
> volatile operations are always executed exactly once, in strict
> program order with respect to any other volatile operations); the real
> question is if there should also be memory clobbers on %dr6 reads and
> any %dr write.

Yes, I think so too. From gcc docs:


"6.47.2.1 Volatile
.................

...

Note that the compiler can move even 'volatile asm' instructions
relative to other code, including across jump instructions."

We already have __FORCE_ORDER for exactly things like that.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?)
  2023-01-30 17:30           ` H. Peter Anvin
  2023-01-30 18:04             ` Borislav Petkov
@ 2023-01-31  8:57             ` Joerg Roedel
  2023-01-31 15:53               ` Sean Christopherson
  1 sibling, 1 reply; 19+ messages in thread
From: Joerg Roedel @ 2023-01-31  8:57 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Alexey Kardashevskiy, Peter Zijlstra, kvm, x86, linux-kernel,
	Thomas Gleixner, Sean Christopherson, Jiri Kosina, Ingo Molnar,
	Dave Hansen, Borislav Petkov, Tom Lendacky

On Mon, Jan 30, 2023 at 09:30:38AM -0800, H. Peter Anvin wrote:
> It's somewhat odd to me that reading %dr7 is volatile, but %dr6 is
> not... %dr6 is the status register!

The reason is that on SEV-ES only accesses to DR7 will cause #VC
exceptions, DR0-DR6 are not intercepted.

Regards,

	Joerg


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [tip: x86/urgent] x86/debug: Fix stack recursion caused by wrongly ordered DR7 accesses
  2023-01-27  3:56 [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?) Alexey Kardashevskiy
  2023-01-27  9:08 ` Peter Zijlstra
@ 2023-01-31 10:37 ` tip-bot2 for Joerg Roedel
  2023-01-31 11:57 ` tip-bot2 for Joerg Roedel
  2 siblings, 0 replies; 19+ messages in thread
From: tip-bot2 for Joerg Roedel @ 2023-01-31 10:37 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Alexey Kardashevskiy, Joerg Roedel, Borislav Petkov (AMD),
	stable, x86, linux-kernel

The following commit has been merged into the x86/urgent branch of tip:

Commit-ID:     31859551393bc00f705cae2e1f9d31b80c62f365
Gitweb:        https://git.kernel.org/tip/31859551393bc00f705cae2e1f9d31b80c62f365
Author:        Joerg Roedel <jroedel@suse.de>
AuthorDate:    Tue, 31 Jan 2023 09:57:18 +01:00
Committer:     Borislav Petkov (AMD) <bp@alien8.de>
CommitterDate: Tue, 31 Jan 2023 11:26:15 +01:00

x86/debug: Fix stack recursion caused by wrongly ordered DR7 accesses

In kernels compiled with CONFIG_PARAVIRT=n, the compiler re-orders the
DR7 read in exc_nmi() to happen before the call to sev_es_ist_enter().

This is problematic when running as an SEV-ES guest because in this
environment the DR7 read might cause a #VC exception, and taking #VC
exceptions is not safe in exc_nmi() before sev_es_ist_enter() has run.

The result is stack recursion if the NMI was caused on the #VC IST
stack, because a subsequent #VC exception in the NMI handler will
overwrite the stack frame of the interrupted #VC handler.

As there are no compiler barriers affecting the ordering of DR7
reads/writes, make the accesses to this register volatile, forbidding
the compiler to re-order them.

  [ bp: Massage text, make them volatile too, to make sure some
  aggressive compiler optimization pass doesn't discard them. ]

Fixes: 315562c9af3d ("x86/sev-es: Adjust #VC IST Stack on entering NMI handler")
Reported-by: Alexey Kardashevskiy <aik@amd.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230127035616.508966-1-aik@amd.com
---
 arch/x86/include/asm/debugreg.h | 27 ++++++++++++++++++++++++---
 1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/debugreg.h b/arch/x86/include/asm/debugreg.h
index b049d95..ff1a924 100644
--- a/arch/x86/include/asm/debugreg.h
+++ b/arch/x86/include/asm/debugreg.h
@@ -39,7 +39,20 @@ static __always_inline unsigned long native_get_debugreg(int regno)
 		asm("mov %%db6, %0" :"=r" (val));
 		break;
 	case 7:
-		asm("mov %%db7, %0" :"=r" (val));
+		/*
+		 * Apply __FORCE_ORDER to DR7 reads to forbid re-ordering them
+		 * with other code.
+		 *
+		 * This is needed because a DR7 access can cause a #VC exception
+		 * when running under SEV-ES. Taking a #VC exception is not a
+		 * safe thing to do just anywhere in the entry code and
+		 * re-ordering might place the access into an unsafe location.
+		 *
+		 * This happened in the NMI handler, where the DR7 read was
+		 * re-ordered to happen before the call to sev_es_ist_enter(),
+		 * causing stack recursion.
+		 */
+		asm volatile("mov %%db7, %0" : "=r" (val) : __FORCE_ORDER);
 		break;
 	default:
 		BUG();
@@ -66,8 +79,16 @@ static __always_inline void native_set_debugreg(int regno, unsigned long value)
 		asm("mov %0, %%db6"	::"r" (value));
 		break;
 	case 7:
-		asm("mov %0, %%db7"	::"r" (value));
-		break;
+		/*
+		 * Apply __FORCE_ORDER to DR7 writes to forbid re-ordering them
+		 * with other code.
+		 *
+		 * While is didn't happen with a DR7 write (see the DR7 read
+		 * comment above which explains where it happened), add the
+		 * __FORCE_ORDER here too to avoid similar problems in the
+		 * future.
+		 */
+		asm volatile("mov %0, %%db7"	::"r" (value), __FORCE_ORDER); break;
 	default:
 		BUG();
 	}

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [tip: x86/urgent] x86/debug: Fix stack recursion caused by wrongly ordered DR7 accesses
  2023-01-27  3:56 [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?) Alexey Kardashevskiy
  2023-01-27  9:08 ` Peter Zijlstra
  2023-01-31 10:37 ` [tip: x86/urgent] x86/debug: Fix stack recursion caused by wrongly ordered DR7 accesses tip-bot2 for Joerg Roedel
@ 2023-01-31 11:57 ` tip-bot2 for Joerg Roedel
  2 siblings, 0 replies; 19+ messages in thread
From: tip-bot2 for Joerg Roedel @ 2023-01-31 11:57 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Alexey Kardashevskiy, Joerg Roedel, Borislav Petkov (AMD),
	stable, x86, linux-kernel

The following commit has been merged into the x86/urgent branch of tip:

Commit-ID:     9d2c7203ffdb846399b82b0660563c89e918c751
Gitweb:        https://git.kernel.org/tip/9d2c7203ffdb846399b82b0660563c89e918c751
Author:        Joerg Roedel <jroedel@suse.de>
AuthorDate:    Tue, 31 Jan 2023 09:57:18 +01:00
Committer:     Borislav Petkov (AMD) <bp@alien8.de>
CommitterDate: Tue, 31 Jan 2023 12:51:19 +01:00

x86/debug: Fix stack recursion caused by wrongly ordered DR7 accesses

In kernels compiled with CONFIG_PARAVIRT=n, the compiler re-orders the
DR7 read in exc_nmi() to happen before the call to sev_es_ist_enter().

This is problematic when running as an SEV-ES guest because in this
environment the DR7 read might cause a #VC exception, and taking #VC
exceptions is not safe in exc_nmi() before sev_es_ist_enter() has run.

The result is stack recursion if the NMI was caused on the #VC IST
stack, because a subsequent #VC exception in the NMI handler will
overwrite the stack frame of the interrupted #VC handler.

As there are no compiler barriers affecting the ordering of DR7
reads/writes, make the accesses to this register volatile, forbidding
the compiler to re-order them.

  [ bp: Massage text, make them volatile too, to make sure some
  aggressive compiler optimization pass doesn't discard them. ]

Fixes: 315562c9af3d ("x86/sev-es: Adjust #VC IST Stack on entering NMI handler")
Reported-by: Alexey Kardashevskiy <aik@amd.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230127035616.508966-1-aik@amd.com
---
 arch/x86/include/asm/debugreg.h | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/debugreg.h b/arch/x86/include/asm/debugreg.h
index b049d95..ca97442 100644
--- a/arch/x86/include/asm/debugreg.h
+++ b/arch/x86/include/asm/debugreg.h
@@ -39,7 +39,20 @@ static __always_inline unsigned long native_get_debugreg(int regno)
 		asm("mov %%db6, %0" :"=r" (val));
 		break;
 	case 7:
-		asm("mov %%db7, %0" :"=r" (val));
+		/*
+		 * Apply __FORCE_ORDER to DR7 reads to forbid re-ordering them
+		 * with other code.
+		 *
+		 * This is needed because a DR7 access can cause a #VC exception
+		 * when running under SEV-ES. Taking a #VC exception is not a
+		 * safe thing to do just anywhere in the entry code and
+		 * re-ordering might place the access into an unsafe location.
+		 *
+		 * This happened in the NMI handler, where the DR7 read was
+		 * re-ordered to happen before the call to sev_es_ist_enter(),
+		 * causing stack recursion.
+		 */
+		asm volatile("mov %%db7, %0" : "=r" (val) : __FORCE_ORDER);
 		break;
 	default:
 		BUG();
@@ -66,7 +79,16 @@ static __always_inline void native_set_debugreg(int regno, unsigned long value)
 		asm("mov %0, %%db6"	::"r" (value));
 		break;
 	case 7:
-		asm("mov %0, %%db7"	::"r" (value));
+		/*
+		 * Apply __FORCE_ORDER to DR7 writes to forbid re-ordering them
+		 * with other code.
+		 *
+		 * While is didn't happen with a DR7 write (see the DR7 read
+		 * comment above which explains where it happened), add the
+		 * __FORCE_ORDER here too to avoid similar problems in the
+		 * future.
+		 */
+		asm volatile("mov %0, %%db7"	::"r" (value), __FORCE_ORDER);
 		break;
 	default:
 		BUG();

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?)
  2023-01-31  8:57             ` Joerg Roedel
@ 2023-01-31 15:53               ` Sean Christopherson
  2023-01-31 16:00                 ` Joerg Roedel
  0 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2023-01-31 15:53 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: H. Peter Anvin, Alexey Kardashevskiy, Peter Zijlstra, kvm, x86,
	linux-kernel, Thomas Gleixner, Jiri Kosina, Ingo Molnar,
	Dave Hansen, Borislav Petkov, Tom Lendacky

On Tue, Jan 31, 2023, Joerg Roedel wrote:
> On Mon, Jan 30, 2023 at 09:30:38AM -0800, H. Peter Anvin wrote:
> > It's somewhat odd to me that reading %dr7 is volatile, but %dr6 is
> > not... %dr6 is the status register!
> 
> The reason is that on SEV-ES only accesses to DR7 will cause #VC
> exceptions, DR0-DR6 are not intercepted.

I don't think that is technically true.  A _well-behaved_ hypervisor will not
intercept DR0-DR6 accesses for SEV-ES guests, but AFAICT nothing in the SEV-ES
architecture enforces that behavior.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?)
  2023-01-31 15:53               ` Sean Christopherson
@ 2023-01-31 16:00                 ` Joerg Roedel
  2023-01-31 16:47                   ` Sean Christopherson
  0 siblings, 1 reply; 19+ messages in thread
From: Joerg Roedel @ 2023-01-31 16:00 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: H. Peter Anvin, Alexey Kardashevskiy, Peter Zijlstra, kvm, x86,
	linux-kernel, Thomas Gleixner, Jiri Kosina, Ingo Molnar,
	Dave Hansen, Borislav Petkov, Tom Lendacky

On Tue, Jan 31, 2023 at 03:53:39PM +0000, Sean Christopherson wrote:
> I don't think that is technically true.  A _well-behaved_ hypervisor will not
> intercept DR0-DR6 accesses for SEV-ES guests, but AFAICT nothing in the SEV-ES
> architecture enforces that behavior.

Not from the hardware architecture side, but the GHCB spec does not
list NAE events for DR0-DR6 accesses, so a guest is not required to
handle them in the VC handler.

Linux under SEV-ES will crash if the HV intercepts debug registers,
except DR7.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?)
  2023-01-31 16:00                 ` Joerg Roedel
@ 2023-01-31 16:47                   ` Sean Christopherson
  0 siblings, 0 replies; 19+ messages in thread
From: Sean Christopherson @ 2023-01-31 16:47 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: H. Peter Anvin, Alexey Kardashevskiy, Peter Zijlstra, kvm, x86,
	linux-kernel, Thomas Gleixner, Jiri Kosina, Ingo Molnar,
	Dave Hansen, Borislav Petkov, Tom Lendacky

On Tue, Jan 31, 2023, Joerg Roedel wrote:
> On Tue, Jan 31, 2023 at 03:53:39PM +0000, Sean Christopherson wrote:
> > I don't think that is technically true.  A _well-behaved_ hypervisor will not
> > intercept DR0-DR6 accesses for SEV-ES guests, but AFAICT nothing in the SEV-ES
> > architecture enforces that behavior.
> 
> Not from the hardware architecture side, but the GHCB spec does not
> list NAE events for DR0-DR6 accesses, so a guest is not required to
> handle them in the VC handler.
> 
> Linux under SEV-ES will crash if the HV intercepts debug registers,
> except DR7.

Right, I'm just objecting to the wording of "DR0-DR6 are not intercepted".  E.g.
from a security perspective, the kernel shouldn't rely on DR0-DR6 to execute
cleanly.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2023-01-31 16:47 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-27  3:56 [Question PATCH kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?) Alexey Kardashevskiy
2023-01-27  9:08 ` Peter Zijlstra
2023-01-27 10:37   ` Joerg Roedel
2023-01-27 11:56     ` Alexey Kardashevskiy
2023-01-27 12:59       ` Joerg Roedel
2023-01-27 17:25       ` Joerg Roedel
2023-01-28 11:24         ` Alexey Kardashevskiy
2023-01-28 13:52           ` Joerg Roedel
2023-01-30  9:17             ` Joerg Roedel
2023-01-30 17:30           ` H. Peter Anvin
2023-01-30 18:04             ` Borislav Petkov
2023-01-31  8:57             ` Joerg Roedel
2023-01-31 15:53               ` Sean Christopherson
2023-01-31 16:00                 ` Joerg Roedel
2023-01-31 16:47                   ` Sean Christopherson
2023-01-27 12:13   ` Alexey Kardashevskiy
2023-01-27 12:41     ` Peter Zijlstra
2023-01-31 10:37 ` [tip: x86/urgent] x86/debug: Fix stack recursion caused by wrongly ordered DR7 accesses tip-bot2 for Joerg Roedel
2023-01-31 11:57 ` tip-bot2 for Joerg Roedel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.