linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/20] efi/x86: Avoid bare metal decompressor during EFI boot
@ 2023-05-08  7:03 Ard Biesheuvel
  2023-05-08  7:03 ` [PATCH v2 01/20] x86: decompressor: Use proper sequence to take the address of the GOT Ard Biesheuvel
                   ` (19 more replies)
  0 siblings, 20 replies; 42+ messages in thread
From: Ard Biesheuvel @ 2023-05-08  7:03 UTC (permalink / raw)
  To: linux-efi
  Cc: linux-kernel, Ard Biesheuvel, Evgeniy Baskov, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexey Khoroshilov, Peter Jones, Gerd Hoffmann,
	Dave Young, Mario Limonciello, Kees Cook, Tom Lendacky,
	Kirill A . Shutemov, Linus Torvalds

This is v2 of [2], which updates the x86 boot path to avoid the legacy
decompressor when booting via the EFI stub.

TL;DR the bare metal decompressor inherits the loader's 1:1 mapping of
DRAM when entering in 64-bit mode, and assumes that all of it is mapped
read/write/execute, which will no longer be the case on hardware built
to comply with recently tightened logo requirements (*).

Changes since v1:
- streamline existing 4/5 level switching code and call it directly from
  the EFI stub - this is covered by the first 9 patches, which can be
  applied in isolation, if desired;
- deal with SEV/SNP init explicitly;
- clear BSS when booting via the 'handover protocol'
- switch to kernel CS before calling SEV init code in kernel proper.

Conceptually, this should not conflict with the memory acceptance work
which is being done by Kirill [3], but some lexical conlicts are not
unlikely. After applying this series, the allocations done by the EFI
stub for the trampoline and the decompressed kernel will use the EFI
page allocation APIs, and will therefore not need explicit acceptance. 

---- v1 cover letter follows ----

This series is conceptually a combination of Evgeny's series [0] and
mine [1], both of which attempt to make the early decompressor code more
amenable to executing in the EFI environment with stricter handling of
memory permissions.

My series [1] implemented zboot for x86, by getting rid of the entire
x86 decompressor, and replacing it with existing EFI code that does the
same but in a generic way. The downside of this is that only EFI boot is
supported, making it unviable for distros, which need to support BIOS
boot and hybrid EFI boot modes that omit the EFI stub.

Evgeny's series [0] adapted the entire decompressor code flow to allow
it to execute in the EFI context as well as the bare metal context, and
this involves changes to the 1:1 mapping code and the page fault
handlers etc, none of which are really needed when doing EFI boot in the
first place.

So this series attempts to occupy the middle ground here: it makes
minimal changes to the existing decompressor so some of it can be called
from the EFI stub. Then, it reimplements the EFI boot flow to decompress
the kernel and boot it directly, without relying on the trampoline
allocation code, page table code or page fault handling code. This
allows us to get rid of quite a bit of unsavory EFI stub code, and
replace it with two clear invocations of the EFI firmware APIs to clear
NX restrictions from allocations that have been populated with
executable code.

The only code that is being reused is the decompression library itself,
along with the minimal ELF parsing that is required to copy the ELF
segments in place, and the relocation processing that fixes up absolute
symbol references to refer to the correct virtual addresses.

Note that some of Evgeny's changes to clean up the PE/COFF header
generation will still be needed, but I've omitted those here for
brevity.

(*) IMHO the following developments are likely to occur:
- the Windows boot chain (along with 3rd party drivers) is cleaned up so
  that it never relies on memory being writable and executable at the
  same time when running under the EFI boot services;
- the EFI reference implementation gets updated to map all memory NX by
  default, and to require read-only permissions for executable mappings;
- BIOS vendors incorporate these changes into their codebases, and
  deploy it more widely than just the 'secure' SKUs;
- OEMs only care about the Windows sticker, so they only boot test
  Windows, which works fine in this more restricted context;
- Linux boot no longer works reliably on new hardware built for Windows
  unless we clean up our boot chain as well.

Cc: Evgeniy Baskov <baskov@ispras.ru>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexey Khoroshilov <khoroshilov@ispras.ru>
Cc: Peter Jones <pjones@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Mario Limonciello <mario.limonciello@amd.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>

[0] https://lore.kernel.org/all/cover.1678785672.git.baskov@ispras.ru/
[1] https://lore.kernel.org/all/20230416120729.2470762-1-ardb@kernel.org/
[2] https://lore.kernel.org/all/20230424165726.2245548-1-ardb@kernel.org/
[3] https://lore.kernel.org/all/20230507234618.18067-1-kirill.shutemov@linux.intel.com/

Ard Biesheuvel (20):
  x86: decompressor: Use proper sequence to take the address of the GOT
  x86: decompressor: Store boot_params pointer in callee save register
  x86: decompressor: Call trampoline as a normal function
  x86: decompressor: Use standard calling convention for trampoline
  x86: decompressor: Avoid the need for a stack in the 32-bit trampoline
  x86: decompressor: Call trampoline directly from C code
  x86: decompressor: Only call the trampoline when changing paging
    levels
  x86: decompressor: Merge trampoline cleanup with switching code
  x86: efistub: Perform 4/5 level paging switch from the stub
  x86: efistub: Prefer EFI memory attributes protocol over DXE services
  decompress: Use 8 byte alignment
  x86: decompressor: Move global symbol references to C code
  x86: decompressor: Factor out kernel decompression and relocation
  x86: head_64: Store boot_params pointer in callee-preserved register
  x86: head_64: Switch to kernel CS before enabling memory encryption
  efi: libstub: Add limit argument to efi_random_alloc()
  x86: efistub: Check SEV/SNP support while running in the firmware
  x86: efistub: Avoid legacy decompressor when doing EFI boot
  x86: efistub: Clear BSS in EFI handover protocol entrypoint
  x86: decompressor: Avoid magic offsets for EFI handover entrypoint

 arch/x86/boot/compressed/Makefile              |   5 +
 arch/x86/boot/compressed/efi_mixed.S           |  58 +---
 arch/x86/boot/compressed/head_32.S             |  34 +-
 arch/x86/boot/compressed/head_64.S             | 196 +++--------
 arch/x86/boot/compressed/misc.c                |  44 ++-
 arch/x86/boot/compressed/pgtable.h             |   6 +-
 arch/x86/boot/compressed/pgtable_64.c          |  72 ++--
 arch/x86/boot/compressed/sev.c                 |  12 +-
 arch/x86/include/asm/boot.h                    |   8 +
 arch/x86/include/asm/efi.h                     |   7 +-
 arch/x86/include/asm/sev.h                     |   4 +
 arch/x86/kernel/head_64.S                      |  33 +-
 drivers/firmware/efi/libstub/arm64-stub.c      |   2 +-
 drivers/firmware/efi/libstub/efi-stub-helper.c |   4 +
 drivers/firmware/efi/libstub/efistub.h         |   2 +-
 drivers/firmware/efi/libstub/randomalloc.c     |  10 +-
 drivers/firmware/efi/libstub/x86-stub.c        | 353 +++++++++++++-------
 drivers/firmware/efi/libstub/zboot.c           |   2 +-
 include/linux/decompress/mm.h                  |   2 +-
 19 files changed, 400 insertions(+), 454 deletions(-)

-- 
2.39.2


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v2 01/20] x86: decompressor: Use proper sequence to take the address of the GOT
  2023-05-08  7:03 [PATCH v2 00/20] efi/x86: Avoid bare metal decompressor during EFI boot Ard Biesheuvel
@ 2023-05-08  7:03 ` Ard Biesheuvel
  2023-05-17 17:31   ` Borislav Petkov
  2023-05-08  7:03 ` [PATCH v2 02/20] x86: decompressor: Store boot_params pointer in callee save register Ard Biesheuvel
                   ` (18 subsequent siblings)
  19 siblings, 1 reply; 42+ messages in thread
From: Ard Biesheuvel @ 2023-05-08  7:03 UTC (permalink / raw)
  To: linux-efi
  Cc: linux-kernel, Ard Biesheuvel, Evgeniy Baskov, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexey Khoroshilov, Peter Jones, Gerd Hoffmann,
	Dave Young, Mario Limonciello, Kees Cook, Tom Lendacky,
	Kirill A . Shutemov, Linus Torvalds

We don't actually use a global offset table (GOT) in the 32-bit
decompressor, but as is common for 32-bit position independent code, we
use the magic symbol _GLOBAL_OFFSET_TABLE_ as an anchor from which to
derive the actual runtime addresses of other symbols, using special
@GOTOFF symbol references that are resolved at link time, and populated
with the distance between the address of the magic _GLOBAL_OFFSET_TABLE_
anchor and the address of the symbol in question.

This means _GLOBAL_OFFSET_TABLE_ is the only symbol whose actual runtime
address we have to determine explicitly, which is one of the first
things we do in startup_32. However, we do so by taking the absolute
address via the immediate field of an ADD instruction (plus a small
offset), and taking absolute addresses that need to be resolved at link
time is what we are trying to avoid.

Fortunately, the assembler knows that _GLOBAL_OFFSET_TABLE_ is magic,
and emits a special relative relocation instead, and so the resulting
code works as expected. However, this is not obvious for someone reading
the code, and the use of LEA with an explicit relative addend is more
idiomatic so use that instead.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/x86/boot/compressed/head_32.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/boot/compressed/head_32.S b/arch/x86/boot/compressed/head_32.S
index 987ae727cf9f0d04..53cbee1e2a93efce 100644
--- a/arch/x86/boot/compressed/head_32.S
+++ b/arch/x86/boot/compressed/head_32.S
@@ -58,7 +58,7 @@ SYM_FUNC_START(startup_32)
 	leal	(BP_scratch+4)(%esi), %esp
 	call	1f
 1:	popl	%edx
-	addl	$_GLOBAL_OFFSET_TABLE_+(.-1b), %edx
+	leal	(_GLOBAL_OFFSET_TABLE_ - 1b)(%edx), %edx
 
 	/* Load new GDT */
 	leal	gdt@GOTOFF(%edx), %eax
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 02/20] x86: decompressor: Store boot_params pointer in callee save register
  2023-05-08  7:03 [PATCH v2 00/20] efi/x86: Avoid bare metal decompressor during EFI boot Ard Biesheuvel
  2023-05-08  7:03 ` [PATCH v2 01/20] x86: decompressor: Use proper sequence to take the address of the GOT Ard Biesheuvel
@ 2023-05-08  7:03 ` Ard Biesheuvel
  2023-05-08  7:03 ` [PATCH v2 03/20] x86: decompressor: Call trampoline as a normal function Ard Biesheuvel
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 42+ messages in thread
From: Ard Biesheuvel @ 2023-05-08  7:03 UTC (permalink / raw)
  To: linux-efi
  Cc: linux-kernel, Ard Biesheuvel, Evgeniy Baskov, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexey Khoroshilov, Peter Jones, Gerd Hoffmann,
	Dave Young, Mario Limonciello, Kees Cook, Tom Lendacky,
	Kirill A . Shutemov, Linus Torvalds

Instead of pushing and popping %RSI umpteen times to preserve the struct
boot_params pointer across the execution of the startup code, just move
it into a callee save register before the first call into C, and copy it
back when needed.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/x86/boot/compressed/head_64.S | 34 +++++++-------------
 1 file changed, 11 insertions(+), 23 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index 03c4328a88cbd5d0..59340e533dff8369 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -416,10 +416,14 @@ SYM_CODE_START(startup_64)
 	lretq
 
 .Lon_kernel_cs:
+	/*
+	 * RSI holds a pointer to a boot_params structure provided by the
+	 * loader, and this needs to be preserved across C function calls. So
+	 * move it into a callee saved register.
+	 */
+	movq	%rsi, %r15
 
-	pushq	%rsi
 	call	load_stage1_idt
-	popq	%rsi
 
 #ifdef CONFIG_AMD_MEM_ENCRYPT
 	/*
@@ -432,10 +436,8 @@ SYM_CODE_START(startup_64)
 	 * detection/setup to ensure that has been done in advance of any dependent
 	 * code.
 	 */
-	pushq	%rsi
-	movq	%rsi, %rdi		/* real mode address */
+	movq	%r15, %rdi		/* pass struct boot_params pointer */
 	call	sev_enable
-	popq	%rsi
 #endif
 
 	/*
@@ -448,13 +450,9 @@ SYM_CODE_START(startup_64)
 	 *   - Non zero RDX means trampoline needs to enable 5-level
 	 *     paging.
 	 *
-	 * RSI holds real mode data and needs to be preserved across
-	 * this function call.
 	 */
-	pushq	%rsi
-	movq	%rsi, %rdi		/* real mode address */
+	movq	%r15, %rdi		/* pass struct boot_params pointer */
 	call	paging_prepare
-	popq	%rsi
 
 	/* Save the trampoline address in RCX */
 	movq	%rax, %rcx
@@ -479,14 +477,9 @@ trampoline_return:
 	 *
 	 * RDI is address of the page table to use instead of page table
 	 * in trampoline memory (if required).
-	 *
-	 * RSI holds real mode data and needs to be preserved across
-	 * this function call.
 	 */
-	pushq	%rsi
 	leaq	rva(top_pgtable)(%rbx), %rdi
 	call	cleanup_trampoline
-	popq	%rsi
 
 	/* Zero EFLAGS */
 	pushq	$0
@@ -496,7 +489,6 @@ trampoline_return:
  * Copy the compressed kernel to the end of our buffer
  * where decompression in place becomes safe.
  */
-	pushq	%rsi
 	leaq	(_bss-8)(%rip), %rsi
 	leaq	rva(_bss-8)(%rbx), %rdi
 	movl	$(_bss - startup_32), %ecx
@@ -504,7 +496,6 @@ trampoline_return:
 	std
 	rep	movsq
 	cld
-	popq	%rsi
 
 	/*
 	 * The GDT may get overwritten either during the copy we just did or
@@ -551,30 +542,27 @@ SYM_FUNC_START_LOCAL_NOALIGN(.Lrelocated)
 	shrq	$3, %rcx
 	rep	stosq
 
-	pushq	%rsi
 	call	load_stage2_idt
 
 	/* Pass boot_params to initialize_identity_maps() */
-	movq	(%rsp), %rdi
+	movq	%r15, %rdi		/* pass struct boot_params pointer */
 	call	initialize_identity_maps
-	popq	%rsi
 
 /*
  * Do the extraction, and jump to the new kernel..
  */
-	pushq	%rsi			/* Save the real mode argument */
-	movq	%rsi, %rdi		/* real mode address */
+	movq	%r15, %rdi		/* pass struct boot_params pointer */
 	leaq	boot_heap(%rip), %rsi	/* malloc area for uncompression */
 	leaq	input_data(%rip), %rdx  /* input_data */
 	movl	input_len(%rip), %ecx	/* input_len */
 	movq	%rbp, %r8		/* output target address */
 	movl	output_len(%rip), %r9d	/* decompressed length, end of relocs */
 	call	extract_kernel		/* returns kernel entry point in %rax */
-	popq	%rsi
 
 /*
  * Jump to the decompressed kernel.
  */
+	movq	%r15, %rsi
 	jmp	*%rax
 SYM_FUNC_END(.Lrelocated)
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 03/20] x86: decompressor: Call trampoline as a normal function
  2023-05-08  7:03 [PATCH v2 00/20] efi/x86: Avoid bare metal decompressor during EFI boot Ard Biesheuvel
  2023-05-08  7:03 ` [PATCH v2 01/20] x86: decompressor: Use proper sequence to take the address of the GOT Ard Biesheuvel
  2023-05-08  7:03 ` [PATCH v2 02/20] x86: decompressor: Store boot_params pointer in callee save register Ard Biesheuvel
@ 2023-05-08  7:03 ` Ard Biesheuvel
  2023-05-15 13:59   ` Kirill A. Shutemov
  2023-05-08  7:03 ` [PATCH v2 04/20] x86: decompressor: Use standard calling convention for trampoline Ard Biesheuvel
                   ` (16 subsequent siblings)
  19 siblings, 1 reply; 42+ messages in thread
From: Ard Biesheuvel @ 2023-05-08  7:03 UTC (permalink / raw)
  To: linux-efi
  Cc: linux-kernel, Ard Biesheuvel, Evgeniy Baskov, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexey Khoroshilov, Peter Jones, Gerd Hoffmann,
	Dave Young, Mario Limonciello, Kees Cook, Tom Lendacky,
	Kirill A . Shutemov, Linus Torvalds

Move the long return to switch to 32-bit mode into the trampoline code
so we can call it as an ordinary function. This will allow us to call it
directly from C code in a subsequent patch.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/x86/boot/compressed/head_64.S | 25 +++++++++-----------
 arch/x86/boot/compressed/pgtable.h |  2 +-
 2 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index 59340e533dff8369..81b53b576cdd05ae 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -457,18 +457,9 @@ SYM_CODE_START(startup_64)
 	/* Save the trampoline address in RCX */
 	movq	%rax, %rcx
 
-	/*
-	 * Load the address of trampoline_return() into RDI.
-	 * It will be used by the trampoline to return to the main code.
-	 */
-	leaq	trampoline_return(%rip), %rdi
-
-	/* Switch to compatibility mode (CS.L = 0 CS.D = 1) via far return */
-	pushq	$__KERNEL32_CS
 	leaq	TRAMPOLINE_32BIT_CODE_OFFSET(%rax), %rax
-	pushq	%rax
-	lretq
-trampoline_return:
+	call	*%rax
+
 	/* Restore the stack, the 32-bit trampoline uses its own stack */
 	leaq	rva(boot_stack_end)(%rbx), %rsp
 
@@ -566,16 +557,22 @@ SYM_FUNC_START_LOCAL_NOALIGN(.Lrelocated)
 	jmp	*%rax
 SYM_FUNC_END(.Lrelocated)
 
-	.code32
 /*
  * This is the 32-bit trampoline that will be copied over to low memory.
  *
- * RDI contains the return address (might be above 4G).
  * ECX contains the base address of the trampoline memory.
  * Non zero RDX means trampoline needs to enable 5-level paging.
  */
 SYM_CODE_START(trampoline_32bit_src)
-	/* Set up data and stack segments */
+	popq	%rdi
+	/* Switch to compatibility mode (CS.L = 0 CS.D = 1) via far return */
+	pushq	$__KERNEL32_CS
+	leaq	0f(%rip), %rax
+	pushq	%rax
+	lretq
+
+	.code32
+0:	/* Set up data and stack segments */
 	movl	$__KERNEL_DS, %eax
 	movl	%eax, %ds
 	movl	%eax, %ss
diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
index cc9b2529a08634b4..91dbb99203fbce2d 100644
--- a/arch/x86/boot/compressed/pgtable.h
+++ b/arch/x86/boot/compressed/pgtable.h
@@ -6,7 +6,7 @@
 #define TRAMPOLINE_32BIT_PGTABLE_OFFSET	0
 
 #define TRAMPOLINE_32BIT_CODE_OFFSET	PAGE_SIZE
-#define TRAMPOLINE_32BIT_CODE_SIZE	0x80
+#define TRAMPOLINE_32BIT_CODE_SIZE	0xA0
 
 #define TRAMPOLINE_32BIT_STACK_END	TRAMPOLINE_32BIT_SIZE
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 04/20] x86: decompressor: Use standard calling convention for trampoline
  2023-05-08  7:03 [PATCH v2 00/20] efi/x86: Avoid bare metal decompressor during EFI boot Ard Biesheuvel
                   ` (2 preceding siblings ...)
  2023-05-08  7:03 ` [PATCH v2 03/20] x86: decompressor: Call trampoline as a normal function Ard Biesheuvel
@ 2023-05-08  7:03 ` Ard Biesheuvel
  2023-05-15 14:00   ` Kirill A. Shutemov
  2023-05-08  7:03 ` [PATCH v2 05/20] x86: decompressor: Avoid the need for a stack in the 32-bit trampoline Ard Biesheuvel
                   ` (15 subsequent siblings)
  19 siblings, 1 reply; 42+ messages in thread
From: Ard Biesheuvel @ 2023-05-08  7:03 UTC (permalink / raw)
  To: linux-efi
  Cc: linux-kernel, Ard Biesheuvel, Evgeniy Baskov, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexey Khoroshilov, Peter Jones, Gerd Hoffmann,
	Dave Young, Mario Limonciello, Kees Cook, Tom Lendacky,
	Kirill A . Shutemov, Linus Torvalds

Update the trampoline code so its arguments are passed via RDI and RSI,
which matches the ordinary SysV calling convention for x86_64. This will
allow us to call this code directly from C.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/x86/boot/compressed/head_64.S | 30 +++++++++-----------
 arch/x86/boot/compressed/pgtable.h |  2 +-
 2 files changed, 14 insertions(+), 18 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index 81b53b576cdd05ae..b1f8a867777120bb 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -454,9 +454,9 @@ SYM_CODE_START(startup_64)
 	movq	%r15, %rdi		/* pass struct boot_params pointer */
 	call	paging_prepare
 
-	/* Save the trampoline address in RCX */
-	movq	%rax, %rcx
-
+	/* Pass the trampoline address and boolean flag as args #1 and #2 */
+	movq	%rax, %rdi
+	movq	%rdx, %rsi
 	leaq	TRAMPOLINE_32BIT_CODE_OFFSET(%rax), %rax
 	call	*%rax
 
@@ -560,11 +560,11 @@ SYM_FUNC_END(.Lrelocated)
 /*
  * This is the 32-bit trampoline that will be copied over to low memory.
  *
- * ECX contains the base address of the trampoline memory.
- * Non zero RDX means trampoline needs to enable 5-level paging.
+ * EDI contains the base address of the trampoline memory.
+ * Non-zero ESI means trampoline needs to enable 5-level paging.
  */
 SYM_CODE_START(trampoline_32bit_src)
-	popq	%rdi
+	popq	%r8
 	/* Switch to compatibility mode (CS.L = 0 CS.D = 1) via far return */
 	pushq	$__KERNEL32_CS
 	leaq	0f(%rip), %rax
@@ -578,7 +578,7 @@ SYM_CODE_START(trampoline_32bit_src)
 	movl	%eax, %ss
 
 	/* Set up new stack */
-	leal	TRAMPOLINE_32BIT_STACK_END(%ecx), %esp
+	leal	TRAMPOLINE_32BIT_STACK_END(%edi), %esp
 
 	/* Disable paging */
 	movl	%cr0, %eax
@@ -586,7 +586,7 @@ SYM_CODE_START(trampoline_32bit_src)
 	movl	%eax, %cr0
 
 	/* Check what paging mode we want to be in after the trampoline */
-	testl	%edx, %edx
+	testl	%esi, %esi
 	jz	1f
 
 	/* We want 5-level paging: don't touch CR3 if it already points to 5-level page tables */
@@ -601,21 +601,17 @@ SYM_CODE_START(trampoline_32bit_src)
 	jz	3f
 2:
 	/* Point CR3 to the trampoline's new top level page table */
-	leal	TRAMPOLINE_32BIT_PGTABLE_OFFSET(%ecx), %eax
+	leal	TRAMPOLINE_32BIT_PGTABLE_OFFSET(%edi), %eax
 	movl	%eax, %cr3
 3:
 	/* Set EFER.LME=1 as a precaution in case hypervsior pulls the rug */
-	pushl	%ecx
-	pushl	%edx
 	movl	$MSR_EFER, %ecx
 	rdmsr
 	btsl	$_EFER_LME, %eax
 	/* Avoid writing EFER if no change was made (for TDX guest) */
 	jc	1f
 	wrmsr
-1:	popl	%edx
-	popl	%ecx
-
+1:
 #ifdef CONFIG_X86_MCE
 	/*
 	 * Preserve CR4.MCE if the kernel will enable #MC support.
@@ -632,14 +628,14 @@ SYM_CODE_START(trampoline_32bit_src)
 
 	/* Enable PAE and LA57 (if required) paging modes */
 	orl	$X86_CR4_PAE, %eax
-	testl	%edx, %edx
+	testl	%esi, %esi
 	jz	1f
 	orl	$X86_CR4_LA57, %eax
 1:
 	movl	%eax, %cr4
 
 	/* Calculate address of paging_enabled() once we are executing in the trampoline */
-	leal	.Lpaging_enabled - trampoline_32bit_src + TRAMPOLINE_32BIT_CODE_OFFSET(%ecx), %eax
+	leal	.Lpaging_enabled - trampoline_32bit_src + TRAMPOLINE_32BIT_CODE_OFFSET(%edi), %eax
 
 	/* Prepare the stack for far return to Long Mode */
 	pushl	$__KERNEL_CS
@@ -656,7 +652,7 @@ SYM_CODE_END(trampoline_32bit_src)
 	.code64
 SYM_FUNC_START_LOCAL_NOALIGN(.Lpaging_enabled)
 	/* Return from the trampoline */
-	jmp	*%rdi
+	jmp	*%r8
 SYM_FUNC_END(.Lpaging_enabled)
 
 	/*
diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
index 91dbb99203fbce2d..4e8cef135226bcbb 100644
--- a/arch/x86/boot/compressed/pgtable.h
+++ b/arch/x86/boot/compressed/pgtable.h
@@ -14,7 +14,7 @@
 
 extern unsigned long *trampoline_32bit;
 
-extern void trampoline_32bit_src(void *return_ptr);
+extern void trampoline_32bit_src(void *trampoline, bool enable_5lvl);
 
 #endif /* __ASSEMBLER__ */
 #endif /* BOOT_COMPRESSED_PAGETABLE_H */
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 05/20] x86: decompressor: Avoid the need for a stack in the 32-bit trampoline
  2023-05-08  7:03 [PATCH v2 00/20] efi/x86: Avoid bare metal decompressor during EFI boot Ard Biesheuvel
                   ` (3 preceding siblings ...)
  2023-05-08  7:03 ` [PATCH v2 04/20] x86: decompressor: Use standard calling convention for trampoline Ard Biesheuvel
@ 2023-05-08  7:03 ` Ard Biesheuvel
  2023-05-15 14:03   ` Kirill A. Shutemov
  2023-05-17 22:40   ` Tom Lendacky
  2023-05-08  7:03 ` [PATCH v2 06/20] x86: decompressor: Call trampoline directly from C code Ard Biesheuvel
                   ` (14 subsequent siblings)
  19 siblings, 2 replies; 42+ messages in thread
From: Ard Biesheuvel @ 2023-05-08  7:03 UTC (permalink / raw)
  To: linux-efi
  Cc: linux-kernel, Ard Biesheuvel, Evgeniy Baskov, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexey Khoroshilov, Peter Jones, Gerd Hoffmann,
	Dave Young, Mario Limonciello, Kees Cook, Tom Lendacky,
	Kirill A . Shutemov, Linus Torvalds

The 32-bit trampoline no longer uses the stack for anything except
performing a long return back to long mode. Currently, this stack is
allocated in the same page that carries the trampoline code, which means
this page must be mapped writable and executable, and the stack is
therefore executable as well.

So let's do a long jump instead: that way, we can pre-calculate the
return address and poke it into the code before we call it. In a later
patch, we will take advantage of this by removing writable permissions
(and adding executable ones) explicitly when booting via the EFI stub.

Not playing with the stack pointer also makes it more straight-forward
to call the trampoline code as an ordinary 64-bit function from C code.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/x86/boot/compressed/head_64.S    | 34 ++++----------------
 arch/x86/boot/compressed/pgtable.h    |  6 ++--
 arch/x86/boot/compressed/pgtable_64.c | 12 ++++++-
 3 files changed, 21 insertions(+), 31 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index b1f8a867777120bb..3b5fc851737ffc39 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -460,9 +460,6 @@ SYM_CODE_START(startup_64)
 	leaq	TRAMPOLINE_32BIT_CODE_OFFSET(%rax), %rax
 	call	*%rax
 
-	/* Restore the stack, the 32-bit trampoline uses its own stack */
-	leaq	rva(boot_stack_end)(%rbx), %rsp
-
 	/*
 	 * cleanup_trampoline() would restore trampoline memory.
 	 *
@@ -563,24 +560,17 @@ SYM_FUNC_END(.Lrelocated)
  * EDI contains the base address of the trampoline memory.
  * Non-zero ESI means trampoline needs to enable 5-level paging.
  */
+	.section ".rodata", "a", @progbits
 SYM_CODE_START(trampoline_32bit_src)
-	popq	%r8
 	/* Switch to compatibility mode (CS.L = 0 CS.D = 1) via far return */
 	pushq	$__KERNEL32_CS
 	leaq	0f(%rip), %rax
 	pushq	%rax
 	lretq
+.Lret:	retq
 
 	.code32
-0:	/* Set up data and stack segments */
-	movl	$__KERNEL_DS, %eax
-	movl	%eax, %ds
-	movl	%eax, %ss
-
-	/* Set up new stack */
-	leal	TRAMPOLINE_32BIT_STACK_END(%edi), %esp
-
-	/* Disable paging */
+0:	/* Disable paging */
 	movl	%cr0, %eax
 	btrl	$X86_CR0_PG_BIT, %eax
 	movl	%eax, %cr0
@@ -634,26 +624,16 @@ SYM_CODE_START(trampoline_32bit_src)
 1:
 	movl	%eax, %cr4
 
-	/* Calculate address of paging_enabled() once we are executing in the trampoline */
-	leal	.Lpaging_enabled - trampoline_32bit_src + TRAMPOLINE_32BIT_CODE_OFFSET(%edi), %eax
-
-	/* Prepare the stack for far return to Long Mode */
-	pushl	$__KERNEL_CS
-	pushl	%eax
-
 	/* Enable paging again. */
 	movl	%cr0, %eax
 	btsl	$X86_CR0_PG_BIT, %eax
 	movl	%eax, %cr0
 
-	lret
+.Ljmp:	ljmpl	$__KERNEL_CS, $(.Lret - trampoline_32bit_src)
 SYM_CODE_END(trampoline_32bit_src)
 
-	.code64
-SYM_FUNC_START_LOCAL_NOALIGN(.Lpaging_enabled)
-	/* Return from the trampoline */
-	jmp	*%r8
-SYM_FUNC_END(.Lpaging_enabled)
+/* keep this right after trampoline_32bit_src() so we can infer its size */
+SYM_DATA(trampoline_ljmp_imm_offset, .word  .Ljmp + 1 - trampoline_32bit_src)
 
 	/*
          * The trampoline code has a size limit.
@@ -662,7 +642,7 @@ SYM_FUNC_END(.Lpaging_enabled)
 	 */
 	.org	trampoline_32bit_src + TRAMPOLINE_32BIT_CODE_SIZE
 
-	.code32
+	.text
 SYM_FUNC_START_LOCAL_NOALIGN(.Lno_longmode)
 	/* This isn't an x86-64 CPU, so hang intentionally, we cannot continue */
 1:
diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
index 4e8cef135226bcbb..131488f50af55d0a 100644
--- a/arch/x86/boot/compressed/pgtable.h
+++ b/arch/x86/boot/compressed/pgtable.h
@@ -6,9 +6,7 @@
 #define TRAMPOLINE_32BIT_PGTABLE_OFFSET	0
 
 #define TRAMPOLINE_32BIT_CODE_OFFSET	PAGE_SIZE
-#define TRAMPOLINE_32BIT_CODE_SIZE	0xA0
-
-#define TRAMPOLINE_32BIT_STACK_END	TRAMPOLINE_32BIT_SIZE
+#define TRAMPOLINE_32BIT_CODE_SIZE	0x80
 
 #ifndef __ASSEMBLER__
 
@@ -16,5 +14,7 @@ extern unsigned long *trampoline_32bit;
 
 extern void trampoline_32bit_src(void *trampoline, bool enable_5lvl);
 
+extern const u16 trampoline_ljmp_imm_offset;
+
 #endif /* __ASSEMBLER__ */
 #endif /* BOOT_COMPRESSED_PAGETABLE_H */
diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index 2ac12ff4111bf8c0..09fc18180929fab3 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -109,6 +109,7 @@ static unsigned long find_trampoline_placement(void)
 struct paging_config paging_prepare(void *rmode)
 {
 	struct paging_config paging_config = {};
+	void *tramp_code;
 
 	/* Initialize boot_params. Required for cmdline_find_option_bool(). */
 	boot_params = rmode;
@@ -143,9 +144,18 @@ struct paging_config paging_prepare(void *rmode)
 	memset(trampoline_32bit, 0, TRAMPOLINE_32BIT_SIZE);
 
 	/* Copy trampoline code in place */
-	memcpy(trampoline_32bit + TRAMPOLINE_32BIT_CODE_OFFSET / sizeof(unsigned long),
+	tramp_code = memcpy(trampoline_32bit +
+			TRAMPOLINE_32BIT_CODE_OFFSET / sizeof(unsigned long),
 			&trampoline_32bit_src, TRAMPOLINE_32BIT_CODE_SIZE);
 
+	/*
+	 * Avoid the need for a stack in the 32-bit trampoline code, by using
+	 * LJMP rather than LRET to return back to long mode. LJMP takes an
+	 * immediate absolute address, so we have to adjust that based on the
+	 * placement of the trampoline.
+	 */
+	*(u32 *)(tramp_code + trampoline_ljmp_imm_offset) += (unsigned long)tramp_code;
+
 	/*
 	 * The code below prepares page table in trampoline memory.
 	 *
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 06/20] x86: decompressor: Call trampoline directly from C code
  2023-05-08  7:03 [PATCH v2 00/20] efi/x86: Avoid bare metal decompressor during EFI boot Ard Biesheuvel
                   ` (4 preceding siblings ...)
  2023-05-08  7:03 ` [PATCH v2 05/20] x86: decompressor: Avoid the need for a stack in the 32-bit trampoline Ard Biesheuvel
@ 2023-05-08  7:03 ` Ard Biesheuvel
  2023-05-15 14:05   ` Kirill A. Shutemov
  2023-05-08  7:03 ` [PATCH v2 07/20] x86: decompressor: Only call the trampoline when changing paging levels Ard Biesheuvel
                   ` (13 subsequent siblings)
  19 siblings, 1 reply; 42+ messages in thread
From: Ard Biesheuvel @ 2023-05-08  7:03 UTC (permalink / raw)
  To: linux-efi
  Cc: linux-kernel, Ard Biesheuvel, Evgeniy Baskov, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexey Khoroshilov, Peter Jones, Gerd Hoffmann,
	Dave Young, Mario Limonciello, Kees Cook, Tom Lendacky,
	Kirill A . Shutemov, Linus Torvalds

Instead of returning to the asm calling code to invoke the trampoline,
call it straight from the C code that sets the scene. That way, we don't
need the struct return type for returning two values, and we can make
the call conditional more cleanly in a subsequent patch.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/x86/boot/compressed/head_64.S    | 20 +++-----------
 arch/x86/boot/compressed/pgtable_64.c | 28 ++++++++------------
 2 files changed, 15 insertions(+), 33 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index 3b5fc851737ffc39..94b614ecb7c2fd55 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -441,24 +441,12 @@ SYM_CODE_START(startup_64)
 #endif
 
 	/*
-	 * paging_prepare() sets up the trampoline and checks if we need to
-	 * enable 5-level paging.
-	 *
-	 * paging_prepare() returns a two-quadword structure which lands
-	 * into RDX:RAX:
-	 *   - Address of the trampoline is returned in RAX.
-	 *   - Non zero RDX means trampoline needs to enable 5-level
-	 *     paging.
-	 *
+	 * set_paging_levels() updates the number of paging levels using a
+	 * trampoline in 32-bit addressable memory if the current number does
+	 * not match the desired number.
 	 */
 	movq	%r15, %rdi		/* pass struct boot_params pointer */
-	call	paging_prepare
-
-	/* Pass the trampoline address and boolean flag as args #1 and #2 */
-	movq	%rax, %rdi
-	movq	%rdx, %rsi
-	leaq	TRAMPOLINE_32BIT_CODE_OFFSET(%rax), %rax
-	call	*%rax
+	call	set_paging_levels
 
 	/*
 	 * cleanup_trampoline() would restore trampoline memory.
diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index 09fc18180929fab3..b62b6819dcdd01be 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -16,11 +16,6 @@ unsigned int __section(".data") pgdir_shift = 39;
 unsigned int __section(".data") ptrs_per_p4d = 1;
 #endif
 
-struct paging_config {
-	unsigned long trampoline_start;
-	unsigned long l5_required;
-};
-
 /* Buffer to preserve trampoline memory */
 static char trampoline_save[TRAMPOLINE_32BIT_SIZE];
 
@@ -106,10 +101,10 @@ static unsigned long find_trampoline_placement(void)
 	return bios_start - TRAMPOLINE_32BIT_SIZE;
 }
 
-struct paging_config paging_prepare(void *rmode)
+asmlinkage void set_paging_levels(void *rmode)
 {
-	struct paging_config paging_config = {};
-	void *tramp_code;
+	void (*toggle_la57)(void *trampoline, bool enable_5lvl);
+	bool l5_required = false;
 
 	/* Initialize boot_params. Required for cmdline_find_option_bool(). */
 	boot_params = rmode;
@@ -130,12 +125,10 @@ struct paging_config paging_prepare(void *rmode)
 			!cmdline_find_option_bool("no5lvl") &&
 			native_cpuid_eax(0) >= 7 &&
 			(native_cpuid_ecx(7) & (1 << (X86_FEATURE_LA57 & 31)))) {
-		paging_config.l5_required = 1;
+		l5_required = true;
 	}
 
-	paging_config.trampoline_start = find_trampoline_placement();
-
-	trampoline_32bit = (unsigned long *)paging_config.trampoline_start;
+	trampoline_32bit = (unsigned long *)find_trampoline_placement();
 
 	/* Preserve trampoline memory */
 	memcpy(trampoline_save, trampoline_32bit, TRAMPOLINE_32BIT_SIZE);
@@ -144,7 +137,7 @@ struct paging_config paging_prepare(void *rmode)
 	memset(trampoline_32bit, 0, TRAMPOLINE_32BIT_SIZE);
 
 	/* Copy trampoline code in place */
-	tramp_code = memcpy(trampoline_32bit +
+	toggle_la57 = memcpy(trampoline_32bit +
 			TRAMPOLINE_32BIT_CODE_OFFSET / sizeof(unsigned long),
 			&trampoline_32bit_src, TRAMPOLINE_32BIT_CODE_SIZE);
 
@@ -154,7 +147,8 @@ struct paging_config paging_prepare(void *rmode)
 	 * immediate absolute address, so we have to adjust that based on the
 	 * placement of the trampoline.
 	 */
-	*(u32 *)(tramp_code + trampoline_ljmp_imm_offset) += (unsigned long)tramp_code;
+	*(u32 *)((u8 *)toggle_la57 + trampoline_ljmp_imm_offset) +=
+						(unsigned long)toggle_la57;
 
 	/*
 	 * The code below prepares page table in trampoline memory.
@@ -170,10 +164,10 @@ struct paging_config paging_prepare(void *rmode)
 	 * We are not going to use the page table in trampoline memory if we
 	 * are already in the desired paging mode.
 	 */
-	if (paging_config.l5_required == !!(native_read_cr4() & X86_CR4_LA57))
+	if (l5_required == !!(native_read_cr4() & X86_CR4_LA57))
 		goto out;
 
-	if (paging_config.l5_required) {
+	if (l5_required) {
 		/*
 		 * For 4- to 5-level paging transition, set up current CR3 as
 		 * the first and the only entry in a new top-level page table.
@@ -196,7 +190,7 @@ struct paging_config paging_prepare(void *rmode)
 	}
 
 out:
-	return paging_config;
+	toggle_la57(trampoline_32bit, l5_required);
 }
 
 void cleanup_trampoline(void *pgtable)
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 07/20] x86: decompressor: Only call the trampoline when changing paging levels
  2023-05-08  7:03 [PATCH v2 00/20] efi/x86: Avoid bare metal decompressor during EFI boot Ard Biesheuvel
                   ` (5 preceding siblings ...)
  2023-05-08  7:03 ` [PATCH v2 06/20] x86: decompressor: Call trampoline directly from C code Ard Biesheuvel
@ 2023-05-08  7:03 ` Ard Biesheuvel
  2023-05-15 14:07   ` Kirill A. Shutemov
  2023-05-08  7:03 ` [PATCH v2 08/20] x86: decompressor: Merge trampoline cleanup with switching code Ard Biesheuvel
                   ` (12 subsequent siblings)
  19 siblings, 1 reply; 42+ messages in thread
From: Ard Biesheuvel @ 2023-05-08  7:03 UTC (permalink / raw)
  To: linux-efi
  Cc: linux-kernel, Ard Biesheuvel, Evgeniy Baskov, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexey Khoroshilov, Peter Jones, Gerd Hoffmann,
	Dave Young, Mario Limonciello, Kees Cook, Tom Lendacky,
	Kirill A . Shutemov, Linus Torvalds

Since we know the current and desired number of paging levels when
preparing the trampoline, let's not call the trampoline at all when we
know that calling it is not going to result in a change to the number of
paging levels. Given that we are already running in long mode, the PAE
and LA57 settings are necessarily consistent with the currently active
page tables - the only difference is that we will always preserve
CR4.MCE in this case, but this will be cleared by the real kernel
startup code if CONFIG_X86_MCE is not enabled.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/x86/boot/compressed/head_64.S    | 21 +-------------------
 arch/x86/boot/compressed/pgtable_64.c | 18 +++++++----------
 2 files changed, 8 insertions(+), 31 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index 94b614ecb7c2fd55..ccdfe7e55c36a40f 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -398,10 +398,6 @@ SYM_CODE_START(startup_64)
 	 * For the trampoline, we need the top page table to reside in lower
 	 * memory as we don't have a way to load 64-bit values into CR3 in
 	 * 32-bit mode.
-	 *
-	 * We go though the trampoline even if we don't have to: if we're
-	 * already in a desired paging mode. This way the trampoline code gets
-	 * tested on every boot.
 	 */
 
 	/* Make sure we have GDT with 32-bit code segment */
@@ -563,25 +559,10 @@ SYM_CODE_START(trampoline_32bit_src)
 	btrl	$X86_CR0_PG_BIT, %eax
 	movl	%eax, %cr0
 
-	/* Check what paging mode we want to be in after the trampoline */
-	testl	%esi, %esi
-	jz	1f
-
-	/* We want 5-level paging: don't touch CR3 if it already points to 5-level page tables */
-	movl	%cr4, %eax
-	testl	$X86_CR4_LA57, %eax
-	jnz	3f
-	jmp	2f
-1:
-	/* We want 4-level paging: don't touch CR3 if it already points to 4-level page tables */
-	movl	%cr4, %eax
-	testl	$X86_CR4_LA57, %eax
-	jz	3f
-2:
 	/* Point CR3 to the trampoline's new top level page table */
 	leal	TRAMPOLINE_32BIT_PGTABLE_OFFSET(%edi), %eax
 	movl	%eax, %cr3
-3:
+
 	/* Set EFER.LME=1 as a precaution in case hypervsior pulls the rug */
 	movl	$MSR_EFER, %ecx
 	rdmsr
diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index b62b6819dcdd01be..b92cf1d6e156d5f6 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -128,6 +128,13 @@ asmlinkage void set_paging_levels(void *rmode)
 		l5_required = true;
 	}
 
+	/*
+	 * We are not going to use the trampoline if we
+	 * are already in the desired paging mode.
+	 */
+	if (l5_required == !!(native_read_cr4() & X86_CR4_LA57))
+		return;
+
 	trampoline_32bit = (unsigned long *)find_trampoline_placement();
 
 	/* Preserve trampoline memory */
@@ -155,18 +162,8 @@ asmlinkage void set_paging_levels(void *rmode)
 	 *
 	 * The new page table will be used by trampoline code for switching
 	 * from 4- to 5-level paging or vice versa.
-	 *
-	 * If switching is not required, the page table is unused: trampoline
-	 * code wouldn't touch CR3.
 	 */
 
-	/*
-	 * We are not going to use the page table in trampoline memory if we
-	 * are already in the desired paging mode.
-	 */
-	if (l5_required == !!(native_read_cr4() & X86_CR4_LA57))
-		goto out;
-
 	if (l5_required) {
 		/*
 		 * For 4- to 5-level paging transition, set up current CR3 as
@@ -189,7 +186,6 @@ asmlinkage void set_paging_levels(void *rmode)
 		       (void *)src, PAGE_SIZE);
 	}
 
-out:
 	toggle_la57(trampoline_32bit, l5_required);
 }
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 08/20] x86: decompressor: Merge trampoline cleanup with switching code
  2023-05-08  7:03 [PATCH v2 00/20] efi/x86: Avoid bare metal decompressor during EFI boot Ard Biesheuvel
                   ` (6 preceding siblings ...)
  2023-05-08  7:03 ` [PATCH v2 07/20] x86: decompressor: Only call the trampoline when changing paging levels Ard Biesheuvel
@ 2023-05-08  7:03 ` Ard Biesheuvel
       [not found]   ` <20230515140955.d4adbstv6gtnshp2@box.shutemov.name>
  2023-05-08  7:03 ` [PATCH v2 09/20] x86: efistub: Perform 4/5 level paging switch from the stub Ard Biesheuvel
                   ` (11 subsequent siblings)
  19 siblings, 1 reply; 42+ messages in thread
From: Ard Biesheuvel @ 2023-05-08  7:03 UTC (permalink / raw)
  To: linux-efi
  Cc: linux-kernel, Ard Biesheuvel, Evgeniy Baskov, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexey Khoroshilov, Peter Jones, Gerd Hoffmann,
	Dave Young, Mario Limonciello, Kees Cook, Tom Lendacky,
	Kirill A . Shutemov, Linus Torvalds

Now that the trampoline setup code and the actual invocation of it are
all done from the C routine, we can merge the trampoline cleanup into
that as well, instead of returning to asm just to call another C
function.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/x86/boot/compressed/head_64.S    | 13 +++------
 arch/x86/boot/compressed/pgtable_64.c | 28 ++++++++------------
 2 files changed, 15 insertions(+), 26 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index ccdfe7e55c36a40f..21af9cfd0dd0afb7 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -440,19 +440,14 @@ SYM_CODE_START(startup_64)
 	 * set_paging_levels() updates the number of paging levels using a
 	 * trampoline in 32-bit addressable memory if the current number does
 	 * not match the desired number.
+	 *
+	 * RSI is the relocated address of the page table to use instead of
+	 * page table in trampoline memory (if required).
 	 */
 	movq	%r15, %rdi		/* pass struct boot_params pointer */
+	leaq	rva(top_pgtable)(%rbx), %rsi
 	call	set_paging_levels
 
-	/*
-	 * cleanup_trampoline() would restore trampoline memory.
-	 *
-	 * RDI is address of the page table to use instead of page table
-	 * in trampoline memory (if required).
-	 */
-	leaq	rva(top_pgtable)(%rbx), %rdi
-	call	cleanup_trampoline
-
 	/* Zero EFLAGS */
 	pushq	$0
 	popfq
diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index b92cf1d6e156d5f6..eeddad8c8335655e 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -101,9 +101,10 @@ static unsigned long find_trampoline_placement(void)
 	return bios_start - TRAMPOLINE_32BIT_SIZE;
 }
 
-asmlinkage void set_paging_levels(void *rmode)
+asmlinkage void set_paging_levels(void *rmode, void *pgtable)
 {
 	void (*toggle_la57)(void *trampoline, bool enable_5lvl);
+	void *trampoline_pgtable;
 	bool l5_required = false;
 
 	/* Initialize boot_params. Required for cmdline_find_option_bool(). */
@@ -133,7 +134,7 @@ asmlinkage void set_paging_levels(void *rmode)
 	 * are already in the desired paging mode.
 	 */
 	if (l5_required == !!(native_read_cr4() & X86_CR4_LA57))
-		return;
+		goto out;
 
 	trampoline_32bit = (unsigned long *)find_trampoline_placement();
 
@@ -163,6 +164,8 @@ asmlinkage void set_paging_levels(void *rmode)
 	 * The new page table will be used by trampoline code for switching
 	 * from 4- to 5-level paging or vice versa.
 	 */
+	trampoline_pgtable = trampoline_32bit +
+			     TRAMPOLINE_32BIT_PGTABLE_OFFSET / sizeof(unsigned long);
 
 	if (l5_required) {
 		/*
@@ -182,31 +185,21 @@ asmlinkage void set_paging_levels(void *rmode)
 		 * may be above 4G.
 		 */
 		src = *(unsigned long *)__native_read_cr3() & PAGE_MASK;
-		memcpy(trampoline_32bit + TRAMPOLINE_32BIT_PGTABLE_OFFSET / sizeof(unsigned long),
-		       (void *)src, PAGE_SIZE);
+		memcpy(trampoline_pgtable, (void *)src, PAGE_SIZE);
 	}
 
 	toggle_la57(trampoline_32bit, l5_required);
-}
-
-void cleanup_trampoline(void *pgtable)
-{
-	void *trampoline_pgtable;
-
-	trampoline_pgtable = trampoline_32bit + TRAMPOLINE_32BIT_PGTABLE_OFFSET / sizeof(unsigned long);
 
 	/*
-	 * Move the top level page table out of trampoline memory,
-	 * if it's there.
+	 * Move the top level page table out of trampoline memory.
 	 */
-	if ((void *)__native_read_cr3() == trampoline_pgtable) {
-		memcpy(pgtable, trampoline_pgtable, PAGE_SIZE);
-		native_write_cr3((unsigned long)pgtable);
-	}
+	memcpy(pgtable, trampoline_pgtable, PAGE_SIZE);
+	native_write_cr3((unsigned long)pgtable);
 
 	/* Restore trampoline memory */
 	memcpy(trampoline_32bit, trampoline_save, TRAMPOLINE_32BIT_SIZE);
 
+out:
 	/* Initialize variables for 5-level paging */
 #ifdef CONFIG_X86_5LEVEL
 	if (__read_cr4() & X86_CR4_LA57) {
@@ -215,4 +208,5 @@ void cleanup_trampoline(void *pgtable)
 		ptrs_per_p4d = 512;
 	}
 #endif
+	return;
 }
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 09/20] x86: efistub: Perform 4/5 level paging switch from the stub
  2023-05-08  7:03 [PATCH v2 00/20] efi/x86: Avoid bare metal decompressor during EFI boot Ard Biesheuvel
                   ` (7 preceding siblings ...)
  2023-05-08  7:03 ` [PATCH v2 08/20] x86: decompressor: Merge trampoline cleanup with switching code Ard Biesheuvel
@ 2023-05-08  7:03 ` Ard Biesheuvel
  2023-05-15 14:14   ` Kirill A. Shutemov
  2023-05-08  7:03 ` [PATCH v2 10/20] x86: efistub: Prefer EFI memory attributes protocol over DXE services Ard Biesheuvel
                   ` (10 subsequent siblings)
  19 siblings, 1 reply; 42+ messages in thread
From: Ard Biesheuvel @ 2023-05-08  7:03 UTC (permalink / raw)
  To: linux-efi
  Cc: linux-kernel, Ard Biesheuvel, Evgeniy Baskov, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexey Khoroshilov, Peter Jones, Gerd Hoffmann,
	Dave Young, Mario Limonciello, Kees Cook, Tom Lendacky,
	Kirill A . Shutemov, Linus Torvalds

In preparation for updating the EFI stub boot flow to avoid the bare
metal decompressor code altogether, implement the support code for
switching between 4 and 5 levels of paging before jumping to the kernel
proper.

This reuses the newly refactored trampoline that the bare metal
decompressor uses, but relies on EFI APIs to allocate 32-bit addressable
memory and remap it with the appropriate permissions. Given that the
bare metal decompressor will no longer call into the trampoline if the
number of paging levels is already set correctly, we no longer need to
remove NX restrictions from the memory range where this trampoline may
end up.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 drivers/firmware/efi/libstub/efi-stub-helper.c |   4 +
 drivers/firmware/efi/libstub/x86-stub.c        | 119 ++++++++++++++++----
 2 files changed, 102 insertions(+), 21 deletions(-)

diff --git a/drivers/firmware/efi/libstub/efi-stub-helper.c b/drivers/firmware/efi/libstub/efi-stub-helper.c
index 1e0203d74691ffcc..fc5f3b4c45e91401 100644
--- a/drivers/firmware/efi/libstub/efi-stub-helper.c
+++ b/drivers/firmware/efi/libstub/efi-stub-helper.c
@@ -16,6 +16,8 @@
 
 #include "efistub.h"
 
+extern bool efi_no5lvl;
+
 bool efi_nochunk;
 bool efi_nokaslr = !IS_ENABLED(CONFIG_RANDOMIZE_BASE);
 bool efi_novamap;
@@ -73,6 +75,8 @@ efi_status_t efi_parse_options(char const *cmdline)
 			efi_loglevel = CONSOLE_LOGLEVEL_QUIET;
 		} else if (!strcmp(param, "noinitrd")) {
 			efi_noinitrd = true;
+		} else if (IS_ENABLED(CONFIG_X86_64) && !strcmp(param, "no5lvl")) {
+			efi_no5lvl = true;
 		} else if (!strcmp(param, "efi") && val) {
 			efi_nochunk = parse_option_str(val, "nochunk");
 			efi_novamap |= parse_option_str(val, "novamap");
diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index a0bfd31358ba97b1..fb83a72ad905ad6e 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -267,32 +267,11 @@ adjust_memory_range_protection(unsigned long start, unsigned long size)
 	}
 }
 
-/*
- * Trampoline takes 2 pages and can be loaded in first megabyte of memory
- * with its end placed between 128k and 640k where BIOS might start.
- * (see arch/x86/boot/compressed/pgtable_64.c)
- *
- * We cannot find exact trampoline placement since memory map
- * can be modified by UEFI, and it can alter the computed address.
- */
-
-#define TRAMPOLINE_PLACEMENT_BASE ((128 - 8)*1024)
-#define TRAMPOLINE_PLACEMENT_SIZE (640*1024 - (128 - 8)*1024)
-
 void startup_32(struct boot_params *boot_params);
 
 static void
 setup_memory_protection(unsigned long image_base, unsigned long image_size)
 {
-	/*
-	 * Allow execution of possible trampoline used
-	 * for switching between 4- and 5-level page tables
-	 * and relocated kernel image.
-	 */
-
-	adjust_memory_range_protection(TRAMPOLINE_PLACEMENT_BASE,
-				       TRAMPOLINE_PLACEMENT_SIZE);
-
 #ifdef CONFIG_64BIT
 	if (image_base != (unsigned long)startup_32)
 		adjust_memory_range_protection(image_base, image_size);
@@ -760,6 +739,96 @@ static efi_status_t exit_boot(struct boot_params *boot_params, void *handle)
 	return EFI_SUCCESS;
 }
 
+bool efi_no5lvl;
+
+static void (*la57_toggle)(void *trampoline, bool enable_5lvl);
+
+extern void trampoline_32bit_src(void *, bool);
+extern const u16 trampoline_ljmp_imm_offset;
+
+/*
+ * Enabling (or disabling) 5 level paging is tricky, because it can only be
+ * done from 32-bit mode with paging disabled. This means not only that the
+ * code itself must be running from 32-bit addressable physical memory, but
+ * also that the root page table must be 32-bit addressable, as we cannot
+ * program a 64-bit value into CR3 when running in 32-bit mode.
+ */
+static efi_status_t efi_setup_5level_paging(void)
+{
+	u8 tmpl_size = (u8 *)&trampoline_ljmp_imm_offset - (u8 *)&trampoline_32bit_src;
+	efi_status_t status;
+	u8 *la57_code;
+
+	if (!efi_is_64bit())
+		return EFI_SUCCESS;
+
+	/* check for 5 level paging support */
+	if (native_cpuid_eax(0) < 7 ||
+	    !(native_cpuid_ecx(7) & (1 << (X86_FEATURE_LA57 & 31))))
+		return EFI_SUCCESS;
+
+	/* allocate some 32-bit addressable memory for code and a page table */
+	status = efi_allocate_pages(2 * PAGE_SIZE, (unsigned long *)&la57_code,
+				    U32_MAX);
+	if (status != EFI_SUCCESS)
+		return status;
+
+	la57_toggle = memcpy(la57_code, trampoline_32bit_src, tmpl_size);
+	memset(la57_code + tmpl_size, 0x90, PAGE_SIZE - tmpl_size);
+
+	/*
+	 * To avoid having to allocate a 32-bit addressable stack, we use a
+	 * ljmp to switch back to long mode. However, this takes an absolute
+	 * address, so we have to poke it in at runtime.
+	 */
+	*(u32 *)&la57_code[trampoline_ljmp_imm_offset] += (unsigned long)la57_code;
+
+	adjust_memory_range_protection((unsigned long)la57_toggle, PAGE_SIZE);
+
+	return EFI_SUCCESS;
+}
+
+static void efi_5level_switch(void)
+{
+#ifdef CONFIG_X86_64
+	static const struct desc_struct gdt[] = {
+		[GDT_ENTRY_KERNEL32_CS] = GDT_ENTRY_INIT(0xc09b, 0, 0xfffff),
+		[GDT_ENTRY_KERNEL_CS]   = GDT_ENTRY_INIT(0xa09b, 0, 0xfffff),
+	};
+
+	bool want_la57 = IS_ENABLED(CONFIG_X86_5LEVEL) && !efi_no5lvl;
+	bool have_la57 = native_read_cr4() & X86_CR4_LA57;
+	bool need_toggle = want_la57 ^ have_la57;
+	u64 *pgt = (void *)la57_toggle + PAGE_SIZE;
+	u64 *cr3 = (u64 *)__native_read_cr3();
+	u64 *new_cr3;
+
+	if (!la57_toggle || !need_toggle)
+		return;
+
+	if (!have_la57) {
+		/*
+		 * We are going to enable 5 level paging, so we need to
+		 * allocate a root level page from the 32-bit addressable
+		 * physical region, and plug the existing hierarchy into it.
+		 */
+		new_cr3 = memset(pgt, 0, PAGE_SIZE);
+		new_cr3[0] = (u64)cr3 | _PAGE_TABLE_NOENC;
+	} else {
+		// take the new root table pointer from the current entry #0
+		new_cr3 = (u64 *)(cr3[0] & PAGE_MASK);
+
+		// copy the new root level table if it is not 32-bit addressable
+		if ((u64)new_cr3 > U32_MAX)
+			new_cr3 = memcpy(pgt, new_cr3, PAGE_SIZE);
+	}
+
+	native_load_gdt(&(struct desc_ptr){ sizeof(gdt) - 1, (u64)gdt });
+
+	la57_toggle(new_cr3, want_la57);
+#endif
+}
+
 /*
  * On success, we return the address of startup_32, which has potentially been
  * relocated by efi_relocate_kernel.
@@ -787,6 +856,12 @@ asmlinkage unsigned long efi_main(efi_handle_t handle,
 		efi_dxe_table = NULL;
 	}
 
+	status = efi_setup_5level_paging();
+	if (status != EFI_SUCCESS) {
+		efi_err("efi_setup_5level_paging() failed!\n");
+		goto fail;
+	}
+
 	/*
 	 * If the kernel isn't already loaded at a suitable address,
 	 * relocate it.
@@ -905,6 +980,8 @@ asmlinkage unsigned long efi_main(efi_handle_t handle,
 		goto fail;
 	}
 
+	efi_5level_switch();
+
 	return bzimage_addr;
 fail:
 	efi_err("efi_main() failed!\n");
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 10/20] x86: efistub: Prefer EFI memory attributes protocol over DXE services
  2023-05-08  7:03 [PATCH v2 00/20] efi/x86: Avoid bare metal decompressor during EFI boot Ard Biesheuvel
                   ` (8 preceding siblings ...)
  2023-05-08  7:03 ` [PATCH v2 09/20] x86: efistub: Perform 4/5 level paging switch from the stub Ard Biesheuvel
@ 2023-05-08  7:03 ` Ard Biesheuvel
  2023-05-08  7:03 ` [PATCH v2 11/20] decompress: Use 8 byte alignment Ard Biesheuvel
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 42+ messages in thread
From: Ard Biesheuvel @ 2023-05-08  7:03 UTC (permalink / raw)
  To: linux-efi
  Cc: linux-kernel, Ard Biesheuvel, Evgeniy Baskov, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexey Khoroshilov, Peter Jones, Gerd Hoffmann,
	Dave Young, Mario Limonciello, Kees Cook, Tom Lendacky,
	Kirill A . Shutemov, Linus Torvalds

Currently, we rely on DXE services in some cases to clear non-execute
restrictions from page allocations that need to be executable. This is
dodgy, because DXE services are not specified by UEFI but by PI, and
they are not intended for consumption by OS loaders. However, no
alternative existed at the time.

Now, there is a new UEFI protocol that should be used instead, so if it
exists, prefer it over the DXE services calls.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 drivers/firmware/efi/libstub/x86-stub.c | 29 ++++++++++++++------
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index fb83a72ad905ad6e..ce8434fce0c37982 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -25,6 +25,7 @@ const efi_system_table_t *efi_system_table;
 const efi_dxe_services_table_t *efi_dxe_table;
 u32 image_offset __section(".data");
 static efi_loaded_image_t *image = NULL;
+static efi_memory_attribute_protocol_t *memattr;
 
 static efi_status_t
 preserve_pci_rom_image(efi_pci_io_protocol_t *pci, struct pci_setup_rom **__rom)
@@ -221,12 +222,18 @@ adjust_memory_range_protection(unsigned long start, unsigned long size)
 	unsigned long rounded_start, rounded_end;
 	unsigned long unprotect_start, unprotect_size;
 
-	if (efi_dxe_table == NULL)
-		return;
-
 	rounded_start = rounddown(start, EFI_PAGE_SIZE);
 	rounded_end = roundup(start + size, EFI_PAGE_SIZE);
 
+	if (memattr != NULL) {
+		efi_call_proto(memattr, clear_memory_attributes, rounded_start,
+			       rounded_end - rounded_start, EFI_MEMORY_XP);
+		return;
+	}
+
+	if (efi_dxe_table == NULL)
+		return;
+
 	/*
 	 * Don't modify memory region attributes, they are
 	 * already suitable, to lower the possibility to
@@ -838,6 +845,7 @@ asmlinkage unsigned long efi_main(efi_handle_t handle,
 				  efi_system_table_t *sys_table_arg,
 				  struct boot_params *boot_params)
 {
+	efi_guid_t guid = EFI_MEMORY_ATTRIBUTE_PROTOCOL_GUID;
 	unsigned long bzimage_addr = (unsigned long)startup_32;
 	unsigned long buffer_start, buffer_end;
 	struct setup_header *hdr = &boot_params->hdr;
@@ -849,13 +857,18 @@ asmlinkage unsigned long efi_main(efi_handle_t handle,
 	if (efi_system_table->hdr.signature != EFI_SYSTEM_TABLE_SIGNATURE)
 		efi_exit(handle, EFI_INVALID_PARAMETER);
 
-	efi_dxe_table = get_efi_config_table(EFI_DXE_SERVICES_TABLE_GUID);
-	if (efi_dxe_table &&
-	    efi_dxe_table->hdr.signature != EFI_DXE_SERVICES_TABLE_SIGNATURE) {
-		efi_warn("Ignoring DXE services table: invalid signature\n");
-		efi_dxe_table = NULL;
+	if (IS_ENABLED(CONFIG_EFI_DXE_MEM_ATTRIBUTES)) {
+		efi_dxe_table = get_efi_config_table(EFI_DXE_SERVICES_TABLE_GUID);
+		if (efi_dxe_table &&
+		    efi_dxe_table->hdr.signature != EFI_DXE_SERVICES_TABLE_SIGNATURE) {
+			efi_warn("Ignoring DXE services table: invalid signature\n");
+			efi_dxe_table = NULL;
+		}
 	}
 
+	/* grab the memory attributes protocol if it exists */
+	efi_bs_call(locate_protocol, &guid, NULL, (void **)&memattr);
+
 	status = efi_setup_5level_paging();
 	if (status != EFI_SUCCESS) {
 		efi_err("efi_setup_5level_paging() failed!\n");
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 11/20] decompress: Use 8 byte alignment
  2023-05-08  7:03 [PATCH v2 00/20] efi/x86: Avoid bare metal decompressor during EFI boot Ard Biesheuvel
                   ` (9 preceding siblings ...)
  2023-05-08  7:03 ` [PATCH v2 10/20] x86: efistub: Prefer EFI memory attributes protocol over DXE services Ard Biesheuvel
@ 2023-05-08  7:03 ` Ard Biesheuvel
  2023-05-08  7:03 ` [PATCH v2 12/20] x86: decompressor: Move global symbol references to C code Ard Biesheuvel
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 42+ messages in thread
From: Ard Biesheuvel @ 2023-05-08  7:03 UTC (permalink / raw)
  To: linux-efi
  Cc: linux-kernel, Ard Biesheuvel, Evgeniy Baskov, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexey Khoroshilov, Peter Jones, Gerd Hoffmann,
	Dave Young, Mario Limonciello, Kees Cook, Tom Lendacky,
	Kirill A . Shutemov, Linus Torvalds

The ZSTD decompressor requires malloc() allocations to be 8 byte
aligned, so ensure that this the case.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 include/linux/decompress/mm.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/decompress/mm.h b/include/linux/decompress/mm.h
index 9192986b1a731323..ac862422df158bef 100644
--- a/include/linux/decompress/mm.h
+++ b/include/linux/decompress/mm.h
@@ -48,7 +48,7 @@ MALLOC_VISIBLE void *malloc(int size)
 	if (!malloc_ptr)
 		malloc_ptr = free_mem_ptr;
 
-	malloc_ptr = (malloc_ptr + 3) & ~3;     /* Align */
+	malloc_ptr = (malloc_ptr + 7) & ~7;     /* Align */
 
 	p = (void *)malloc_ptr;
 	malloc_ptr += size;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 12/20] x86: decompressor: Move global symbol references to C code
  2023-05-08  7:03 [PATCH v2 00/20] efi/x86: Avoid bare metal decompressor during EFI boot Ard Biesheuvel
                   ` (10 preceding siblings ...)
  2023-05-08  7:03 ` [PATCH v2 11/20] decompress: Use 8 byte alignment Ard Biesheuvel
@ 2023-05-08  7:03 ` Ard Biesheuvel
  2023-05-08  7:03 ` [PATCH v2 13/20] x86: decompressor: Factor out kernel decompression and relocation Ard Biesheuvel
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 42+ messages in thread
From: Ard Biesheuvel @ 2023-05-08  7:03 UTC (permalink / raw)
  To: linux-efi
  Cc: linux-kernel, Ard Biesheuvel, Evgeniy Baskov, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexey Khoroshilov, Peter Jones, Gerd Hoffmann,
	Dave Young, Mario Limonciello, Kees Cook, Tom Lendacky,
	Kirill A . Shutemov, Linus Torvalds

We no longer have to be cautious when referring to global variables in
the position independent decompressor code, now that we use PIE codegen
and assert in the linker script that no GOT entries exist (which would
require adjustment for the actual runtime load address of the
decompressor binary).

This means we can simply refer to global variables directly, instead of
passing them into C routines from asm code. Let's do so for the code
that will be called directly from the EFI stub after a subsequent patch,
and avoid the need to pass these quantities directly.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/x86/boot/compressed/head_32.S |  8 --------
 arch/x86/boot/compressed/head_64.S |  8 +-------
 arch/x86/boot/compressed/misc.c    | 16 +++++++++-------
 3 files changed, 10 insertions(+), 22 deletions(-)

diff --git a/arch/x86/boot/compressed/head_32.S b/arch/x86/boot/compressed/head_32.S
index 53cbee1e2a93efce..8ee8749007595fcc 100644
--- a/arch/x86/boot/compressed/head_32.S
+++ b/arch/x86/boot/compressed/head_32.S
@@ -179,13 +179,7 @@ SYM_FUNC_START_LOCAL_NOALIGN(.Lrelocated)
  */
 	/* push arguments for extract_kernel: */
 
-	pushl	output_len@GOTOFF(%ebx)	/* decompressed length, end of relocs */
 	pushl	%ebp			/* output address */
-	pushl	input_len@GOTOFF(%ebx)	/* input_len */
-	leal	input_data@GOTOFF(%ebx), %eax
-	pushl	%eax			/* input_data */
-	leal	boot_heap@GOTOFF(%ebx), %eax
-	pushl	%eax			/* heap area */
 	pushl	%esi			/* real mode pointer */
 	call	extract_kernel		/* returns kernel entry point in %eax */
 	addl	$24, %esp
@@ -213,8 +207,6 @@ SYM_DATA_END_LABEL(gdt, SYM_L_LOCAL, gdt_end)
  */
 	.bss
 	.balign 4
-boot_heap:
-	.fill BOOT_HEAP_SIZE, 1, 0
 boot_stack:
 	.fill BOOT_STACK_SIZE, 1, 0
 boot_stack_end:
diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index 21af9cfd0dd0afb7..bcf678aed81468e3 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -519,11 +519,7 @@ SYM_FUNC_START_LOCAL_NOALIGN(.Lrelocated)
  * Do the extraction, and jump to the new kernel..
  */
 	movq	%r15, %rdi		/* pass struct boot_params pointer */
-	leaq	boot_heap(%rip), %rsi	/* malloc area for uncompression */
-	leaq	input_data(%rip), %rdx  /* input_data */
-	movl	input_len(%rip), %ecx	/* input_len */
-	movq	%rbp, %r8		/* output target address */
-	movl	output_len(%rip), %r9d	/* decompressed length, end of relocs */
+	movq	%rbp, %rsi		/* output target address */
 	call	extract_kernel		/* returns kernel entry point in %rax */
 
 /*
@@ -651,8 +647,6 @@ SYM_DATA_END_LABEL(boot_idt, SYM_L_GLOBAL, boot_idt_end)
  */
 	.bss
 	.balign 4
-SYM_DATA_LOCAL(boot_heap,	.fill BOOT_HEAP_SIZE, 1, 0)
-
 SYM_DATA_START_LOCAL(boot_stack)
 	.fill BOOT_STACK_SIZE, 1, 0
 	.balign 16
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index 014ff222bf4b35c2..1cd40cb9fb4e5027 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -330,6 +330,11 @@ static size_t parse_elf(void *output)
 	return ehdr.e_entry - LOAD_PHYSICAL_ADDR;
 }
 
+static u8 boot_heap[BOOT_HEAP_SIZE] __aligned(4);
+
+extern unsigned char input_data[];
+extern unsigned int input_len, output_len;
+
 /*
  * The compressed kernel image (ZO), has been moved so that its position
  * is against the end of the buffer used to hold the uncompressed kernel
@@ -347,14 +352,11 @@ static size_t parse_elf(void *output)
  *             |-------uncompressed kernel image---------|
  *
  */
-asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
-				  unsigned char *input_data,
-				  unsigned long input_len,
-				  unsigned char *output,
-				  unsigned long output_len)
+asmlinkage __visible void *extract_kernel(void *rmode, unsigned char *output)
 {
 	const unsigned long kernel_total_size = VO__end - VO__text;
 	unsigned long virt_addr = LOAD_PHYSICAL_ADDR;
+	memptr heap = (memptr)boot_heap;
 	unsigned long needed_size;
 	size_t entry_offset;
 
@@ -412,7 +414,7 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
 	 * entries. This ensures the full mapped area is usable RAM
 	 * and doesn't include any reserved areas.
 	 */
-	needed_size = max(output_len, kernel_total_size);
+	needed_size = max((unsigned long)output_len, kernel_total_size);
 #ifdef CONFIG_X86_64
 	needed_size = ALIGN(needed_size, MIN_KERNEL_ALIGN);
 #endif
@@ -443,7 +445,7 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
 #ifdef CONFIG_X86_64
 	if (heap > 0x3fffffffffffUL)
 		error("Destination address too large");
-	if (virt_addr + max(output_len, kernel_total_size) > KERNEL_IMAGE_SIZE)
+	if (virt_addr + needed_size > KERNEL_IMAGE_SIZE)
 		error("Destination virtual address is beyond the kernel mapping area");
 #else
 	if (heap > ((-__PAGE_OFFSET-(128<<20)-1) & 0x7fffffff))
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 13/20] x86: decompressor: Factor out kernel decompression and relocation
  2023-05-08  7:03 [PATCH v2 00/20] efi/x86: Avoid bare metal decompressor during EFI boot Ard Biesheuvel
                   ` (11 preceding siblings ...)
  2023-05-08  7:03 ` [PATCH v2 12/20] x86: decompressor: Move global symbol references to C code Ard Biesheuvel
@ 2023-05-08  7:03 ` Ard Biesheuvel
  2023-05-08  7:03 ` [PATCH v2 14/20] x86: head_64: Store boot_params pointer in callee-preserved register Ard Biesheuvel
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 42+ messages in thread
From: Ard Biesheuvel @ 2023-05-08  7:03 UTC (permalink / raw)
  To: linux-efi
  Cc: linux-kernel, Ard Biesheuvel, Evgeniy Baskov, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexey Khoroshilov, Peter Jones, Gerd Hoffmann,
	Dave Young, Mario Limonciello, Kees Cook, Tom Lendacky,
	Kirill A . Shutemov, Linus Torvalds

Factor out the decompressor sequence that invokes the decompressor,
parses the ELF and applies the relocations so that we will be able to
call it directly from the EFI stub.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/x86/boot/compressed/misc.c | 28 ++++++++++++++++----
 arch/x86/include/asm/boot.h     |  8 ++++++
 2 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index 1cd40cb9fb4e5027..3635cbeaca1c03cf 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -330,11 +330,33 @@ static size_t parse_elf(void *output)
 	return ehdr.e_entry - LOAD_PHYSICAL_ADDR;
 }
 
+const unsigned long kernel_total_size = VO__end - VO__text;
+
 static u8 boot_heap[BOOT_HEAP_SIZE] __aligned(4);
 
 extern unsigned char input_data[];
 extern unsigned int input_len, output_len;
 
+unsigned long decompress_kernel(unsigned char *outbuf, unsigned long virt_addr,
+				void (*error)(char *x))
+{
+	unsigned long entry;
+
+	if (!free_mem_ptr) {
+		free_mem_ptr     = (unsigned long)boot_heap;
+		free_mem_end_ptr = (unsigned long)boot_heap + sizeof(boot_heap);
+	}
+
+	if (__decompress(input_data, input_len, NULL, NULL, outbuf, output_len,
+			 NULL, error) < 0)
+		return ULONG_MAX;
+
+	entry = parse_elf(outbuf);
+	handle_relocations(outbuf, output_len, virt_addr);
+
+	return entry;
+}
+
 /*
  * The compressed kernel image (ZO), has been moved so that its position
  * is against the end of the buffer used to hold the uncompressed kernel
@@ -354,7 +376,6 @@ extern unsigned int input_len, output_len;
  */
 asmlinkage __visible void *extract_kernel(void *rmode, unsigned char *output)
 {
-	const unsigned long kernel_total_size = VO__end - VO__text;
 	unsigned long virt_addr = LOAD_PHYSICAL_ADDR;
 	memptr heap = (memptr)boot_heap;
 	unsigned long needed_size;
@@ -457,10 +478,7 @@ asmlinkage __visible void *extract_kernel(void *rmode, unsigned char *output)
 #endif
 
 	debug_putstr("\nDecompressing Linux... ");
-	__decompress(input_data, input_len, NULL, NULL, output, output_len,
-			NULL, error);
-	entry_offset = parse_elf(output);
-	handle_relocations(output, output_len, virt_addr);
+	entry_offset = decompress_kernel(output, virt_addr, error);
 
 	debug_putstr("done.\nBooting the kernel (entry_offset: 0x");
 	debug_puthex(entry_offset);
diff --git a/arch/x86/include/asm/boot.h b/arch/x86/include/asm/boot.h
index 9191280d9ea3160d..4ae14339cb8cc72d 100644
--- a/arch/x86/include/asm/boot.h
+++ b/arch/x86/include/asm/boot.h
@@ -62,4 +62,12 @@
 # define BOOT_STACK_SIZE	0x1000
 #endif
 
+#ifndef __ASSEMBLY__
+extern unsigned int output_len;
+extern const unsigned long kernel_total_size;
+
+unsigned long decompress_kernel(unsigned char *outbuf, unsigned long virt_addr,
+				void (*error)(char *x));
+#endif
+
 #endif /* _ASM_X86_BOOT_H */
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 14/20] x86: head_64: Store boot_params pointer in callee-preserved register
  2023-05-08  7:03 [PATCH v2 00/20] efi/x86: Avoid bare metal decompressor during EFI boot Ard Biesheuvel
                   ` (12 preceding siblings ...)
  2023-05-08  7:03 ` [PATCH v2 13/20] x86: decompressor: Factor out kernel decompression and relocation Ard Biesheuvel
@ 2023-05-08  7:03 ` Ard Biesheuvel
  2023-05-08  7:03 ` [PATCH v2 15/20] x86: head_64: Switch to kernel CS before enabling memory encryption Ard Biesheuvel
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 42+ messages in thread
From: Ard Biesheuvel @ 2023-05-08  7:03 UTC (permalink / raw)
  To: linux-efi
  Cc: linux-kernel, Ard Biesheuvel, Evgeniy Baskov, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexey Khoroshilov, Peter Jones, Gerd Hoffmann,
	Dave Young, Mario Limonciello, Kees Cook, Tom Lendacky,
	Kirill A . Shutemov, Linus Torvalds

Instead of pushing/popping %RSI to/from the stack every time we call a
function from startup_64(), just store it in a callee preserved register
until we're ready to pass it on.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/x86/kernel/head_64.S | 15 +++------------
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index a5df3e994f04f10f..95b12fdae10e1dc9 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -60,6 +60,7 @@ SYM_CODE_START_NOALIGN(startup_64)
 	 * compiled to run at we first fixup the physical addresses in our page
 	 * tables and then reload them.
 	 */
+	mov	%rsi, %r15		/* Preserve boot_params pointer */
 
 	/* Set up the stack for verify_cpu() */
 	leaq	(__end_init_task - PTREGS_SIZE)(%rip), %rsp
@@ -73,9 +74,7 @@ SYM_CODE_START_NOALIGN(startup_64)
 	shrq	$32,  %rdx
 	wrmsr
 
-	pushq	%rsi
 	call	startup_64_setup_env
-	popq	%rsi
 
 #ifdef CONFIG_AMD_MEM_ENCRYPT
 	/*
@@ -84,10 +83,8 @@ SYM_CODE_START_NOALIGN(startup_64)
 	 * which needs to be done before any CPUID instructions are executed in
 	 * subsequent code.
 	 */
-	movq	%rsi, %rdi
-	pushq	%rsi
+	movq	%r15, %rdi
 	call	sme_enable
-	popq	%rsi
 #endif
 
 	/* Now switch to __KERNEL_CS so IRET works reliably */
@@ -109,9 +106,7 @@ SYM_CODE_START_NOALIGN(startup_64)
 	 * programmed into CR3.
 	 */
 	leaq	_text(%rip), %rdi
-	pushq	%rsi
 	call	__startup_64
-	popq	%rsi
 
 	/* Form the CR3 value being sure to include the CR3 modifier */
 	addq	$(early_top_pgt - __START_KERNEL_map), %rax
@@ -200,10 +195,8 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
 	 * %rsi carries pointer to realmode data and is callee-clobbered. Save
 	 * and restore it.
 	 */
-	pushq	%rsi
 	movq	%rax, %rdi
 	call	sev_verify_cbit
-	popq	%rsi
 
 	/*
 	 * Switch to new page-table
@@ -294,9 +287,7 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
 	wrmsr
 
 	/* Setup and Load IDT */
-	pushq	%rsi
 	call	early_setup_idt
-	popq	%rsi
 
 	/* Check if nx is implemented */
 	movl	$0x80000001, %eax
@@ -334,7 +325,7 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
 
 	/* rsi is pointer to real mode structure with interesting info.
 	   pass it to C */
-	movq	%rsi, %rdi
+	movq	%r15, %rdi
 
 .Ljump_to_C_code:
 	/*
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 15/20] x86: head_64: Switch to kernel CS before enabling memory encryption
  2023-05-08  7:03 [PATCH v2 00/20] efi/x86: Avoid bare metal decompressor during EFI boot Ard Biesheuvel
                   ` (13 preceding siblings ...)
  2023-05-08  7:03 ` [PATCH v2 14/20] x86: head_64: Store boot_params pointer in callee-preserved register Ard Biesheuvel
@ 2023-05-08  7:03 ` Ard Biesheuvel
  2023-05-17 18:54   ` Tom Lendacky
  2023-05-08  7:03 ` [PATCH v2 16/20] efi: libstub: Add limit argument to efi_random_alloc() Ard Biesheuvel
                   ` (4 subsequent siblings)
  19 siblings, 1 reply; 42+ messages in thread
From: Ard Biesheuvel @ 2023-05-08  7:03 UTC (permalink / raw)
  To: linux-efi
  Cc: linux-kernel, Ard Biesheuvel, Evgeniy Baskov, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexey Khoroshilov, Peter Jones, Gerd Hoffmann,
	Dave Young, Mario Limonciello, Kees Cook, Tom Lendacky,
	Kirill A . Shutemov, Linus Torvalds

The SME initialization triggers #VC exceptions due to the use of CPUID
instructions, and returning from an exception restores the code segment
that was active when the exception was taken.

This means we should ensure that we switch the code segment to one that
is described in the GDT we just loaded before running the SME init code.

Reported-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/x86/kernel/head_64.S | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 95b12fdae10e1dc9..a128ac62956ff7c4 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -76,6 +76,15 @@ SYM_CODE_START_NOALIGN(startup_64)
 
 	call	startup_64_setup_env
 
+	/* Now switch to __KERNEL_CS so IRET works reliably */
+	pushq	$__KERNEL_CS
+	leaq	.Lon_kernel_cs(%rip), %rax
+	pushq	%rax
+	lretq
+
+.Lon_kernel_cs:
+	UNWIND_HINT_END_OF_STACK
+
 #ifdef CONFIG_AMD_MEM_ENCRYPT
 	/*
 	 * Activate SEV/SME memory encryption if supported/enabled. This needs to
@@ -87,15 +96,6 @@ SYM_CODE_START_NOALIGN(startup_64)
 	call	sme_enable
 #endif
 
-	/* Now switch to __KERNEL_CS so IRET works reliably */
-	pushq	$__KERNEL_CS
-	leaq	.Lon_kernel_cs(%rip), %rax
-	pushq	%rax
-	lretq
-
-.Lon_kernel_cs:
-	UNWIND_HINT_END_OF_STACK
-
 	/* Sanitize CPU configuration */
 	call verify_cpu
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 16/20] efi: libstub: Add limit argument to efi_random_alloc()
  2023-05-08  7:03 [PATCH v2 00/20] efi/x86: Avoid bare metal decompressor during EFI boot Ard Biesheuvel
                   ` (14 preceding siblings ...)
  2023-05-08  7:03 ` [PATCH v2 15/20] x86: head_64: Switch to kernel CS before enabling memory encryption Ard Biesheuvel
@ 2023-05-08  7:03 ` Ard Biesheuvel
  2023-05-08  7:03 ` [PATCH v2 17/20] x86: efistub: Check SEV/SNP support while running in the firmware Ard Biesheuvel
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 42+ messages in thread
From: Ard Biesheuvel @ 2023-05-08  7:03 UTC (permalink / raw)
  To: linux-efi
  Cc: linux-kernel, Ard Biesheuvel, Evgeniy Baskov, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexey Khoroshilov, Peter Jones, Gerd Hoffmann,
	Dave Young, Mario Limonciello, Kees Cook, Tom Lendacky,
	Kirill A . Shutemov, Linus Torvalds

x86 will need to limit the kernel memory allocation to the lowest 512
MiB of memory, to match the behavior of the existing bare metal KASLR
physical randomization logic. So in preparation for that, add a limit
parameter to efi_random_alloc() and wire it up.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 drivers/firmware/efi/libstub/arm64-stub.c  |  2 +-
 drivers/firmware/efi/libstub/efistub.h     |  2 +-
 drivers/firmware/efi/libstub/randomalloc.c | 10 ++++++----
 drivers/firmware/efi/libstub/zboot.c       |  2 +-
 4 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/firmware/efi/libstub/arm64-stub.c b/drivers/firmware/efi/libstub/arm64-stub.c
index 770b8ecb73984c61..8c40fc89f5f99209 100644
--- a/drivers/firmware/efi/libstub/arm64-stub.c
+++ b/drivers/firmware/efi/libstub/arm64-stub.c
@@ -106,7 +106,7 @@ efi_status_t handle_kernel_image(unsigned long *image_addr,
 		 */
 		status = efi_random_alloc(*reserve_size, min_kimg_align,
 					  reserve_addr, phys_seed,
-					  EFI_LOADER_CODE);
+					  EFI_LOADER_CODE, EFI_ALLOC_LIMIT);
 		if (status != EFI_SUCCESS)
 			efi_warn("efi_random_alloc() failed: 0x%lx\n", status);
 	} else {
diff --git a/drivers/firmware/efi/libstub/efistub.h b/drivers/firmware/efi/libstub/efistub.h
index 67d5a20802e0b7c6..03e3cec87ffbe2d1 100644
--- a/drivers/firmware/efi/libstub/efistub.h
+++ b/drivers/firmware/efi/libstub/efistub.h
@@ -955,7 +955,7 @@ efi_status_t efi_get_random_bytes(unsigned long size, u8 *out);
 
 efi_status_t efi_random_alloc(unsigned long size, unsigned long align,
 			      unsigned long *addr, unsigned long random_seed,
-			      int memory_type);
+			      int memory_type, unsigned long alloc_limit);
 
 efi_status_t efi_random_get_seed(void);
 
diff --git a/drivers/firmware/efi/libstub/randomalloc.c b/drivers/firmware/efi/libstub/randomalloc.c
index 32c7a54923b4c127..674a064b8f7adc68 100644
--- a/drivers/firmware/efi/libstub/randomalloc.c
+++ b/drivers/firmware/efi/libstub/randomalloc.c
@@ -16,7 +16,8 @@
  */
 static unsigned long get_entry_num_slots(efi_memory_desc_t *md,
 					 unsigned long size,
-					 unsigned long align_shift)
+					 unsigned long align_shift,
+					 u64 alloc_limit)
 {
 	unsigned long align = 1UL << align_shift;
 	u64 first_slot, last_slot, region_end;
@@ -29,7 +30,7 @@ static unsigned long get_entry_num_slots(efi_memory_desc_t *md,
 		return 0;
 
 	region_end = min(md->phys_addr + md->num_pages * EFI_PAGE_SIZE - 1,
-			 (u64)EFI_ALLOC_LIMIT);
+			 alloc_limit);
 	if (region_end < size)
 		return 0;
 
@@ -54,7 +55,8 @@ efi_status_t efi_random_alloc(unsigned long size,
 			      unsigned long align,
 			      unsigned long *addr,
 			      unsigned long random_seed,
-			      int memory_type)
+			      int memory_type,
+			      unsigned long alloc_limit)
 {
 	unsigned long total_slots = 0, target_slot;
 	unsigned long total_mirrored_slots = 0;
@@ -76,7 +78,7 @@ efi_status_t efi_random_alloc(unsigned long size,
 		efi_memory_desc_t *md = (void *)map->map + map_offset;
 		unsigned long slots;
 
-		slots = get_entry_num_slots(md, size, ilog2(align));
+		slots = get_entry_num_slots(md, size, ilog2(align), alloc_limit);
 		MD_NUM_SLOTS(md) = slots;
 		total_slots += slots;
 		if (md->attribute & EFI_MEMORY_MORE_RELIABLE)
diff --git a/drivers/firmware/efi/libstub/zboot.c b/drivers/firmware/efi/libstub/zboot.c
index e5d7fa1f1d8fd160..bdb17eac0cb401be 100644
--- a/drivers/firmware/efi/libstub/zboot.c
+++ b/drivers/firmware/efi/libstub/zboot.c
@@ -119,7 +119,7 @@ efi_zboot_entry(efi_handle_t handle, efi_system_table_t *systab)
 		}
 
 		status = efi_random_alloc(alloc_size, min_kimg_align, &image_base,
-					  seed, EFI_LOADER_CODE);
+					  seed, EFI_LOADER_CODE, EFI_ALLOC_LIMIT);
 		if (status != EFI_SUCCESS) {
 			efi_err("Failed to allocate memory\n");
 			goto free_cmdline;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 17/20] x86: efistub: Check SEV/SNP support while running in the firmware
  2023-05-08  7:03 [PATCH v2 00/20] efi/x86: Avoid bare metal decompressor during EFI boot Ard Biesheuvel
                   ` (15 preceding siblings ...)
  2023-05-08  7:03 ` [PATCH v2 16/20] efi: libstub: Add limit argument to efi_random_alloc() Ard Biesheuvel
@ 2023-05-08  7:03 ` Ard Biesheuvel
  2023-05-18 20:16   ` Tom Lendacky
  2023-05-08  7:03 ` [PATCH v2 18/20] x86: efistub: Avoid legacy decompressor when doing EFI boot Ard Biesheuvel
                   ` (2 subsequent siblings)
  19 siblings, 1 reply; 42+ messages in thread
From: Ard Biesheuvel @ 2023-05-08  7:03 UTC (permalink / raw)
  To: linux-efi
  Cc: linux-kernel, Ard Biesheuvel, Evgeniy Baskov, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexey Khoroshilov, Peter Jones, Gerd Hoffmann,
	Dave Young, Mario Limonciello, Kees Cook, Tom Lendacky,
	Kirill A . Shutemov, Linus Torvalds

The decompressor executes in an environment with little or no access to
a console, and without any ability to return an error back to the caller
(the bootloader). So the only recourse we have when the SEV/SNP context
is not quite what the kernel expects is to terminate the guest entirely.

This is a bit harsh, and also unnecessary when booting via the EFI stub,
given that it provides all the support that SEV guests need to probe the
underlying platform.

So let's do the SEV initialization and SNP feature check before calling
ExitBootServices(), and simply return with an error if the SNP feature
mask is not as expected.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/x86/boot/compressed/sev.c          | 12 ++++++++----
 arch/x86/include/asm/sev.h              |  4 ++++
 drivers/firmware/efi/libstub/x86-stub.c | 17 +++++++++++++++++
 3 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index 014b89c890887b9a..19c40873fdd209b5 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -315,20 +315,24 @@ static void enforce_vmpl0(void)
  */
 #define SNP_FEATURES_PRESENT (0)
 
+u64 snp_get_unsupported_features(void)
+{
+	if (!(sev_status & MSR_AMD64_SEV_SNP_ENABLED))
+		return 0;
+	return sev_status & SNP_FEATURES_IMPL_REQ & ~SNP_FEATURES_PRESENT;
+}
+
 void snp_check_features(void)
 {
 	u64 unsupported;
 
-	if (!(sev_status & MSR_AMD64_SEV_SNP_ENABLED))
-		return;
-
 	/*
 	 * Terminate the boot if hypervisor has enabled any feature lacking
 	 * guest side implementation. Pass on the unsupported features mask through
 	 * EXIT_INFO_2 of the GHCB protocol so that those features can be reported
 	 * as part of the guest boot failure.
 	 */
-	unsupported = sev_status & SNP_FEATURES_IMPL_REQ & ~SNP_FEATURES_PRESENT;
+	unsupported = snp_get_unsupported_features();
 	if (unsupported) {
 		if (ghcb_version < 2 || (!boot_ghcb && !early_setup_ghcb()))
 			sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SNP_UNSUPPORTED);
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 13dc2a9d23c1eb25..bf27b91644d0da5a 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -157,6 +157,7 @@ static __always_inline void sev_es_nmi_complete(void)
 		__sev_es_nmi_complete();
 }
 extern int __init sev_es_efi_map_ghcbs(pgd_t *pgd);
+extern void sev_enable(struct boot_params *bp);
 
 static inline int rmpadjust(unsigned long vaddr, bool rmp_psize, unsigned long attrs)
 {
@@ -202,12 +203,14 @@ void snp_set_wakeup_secondary_cpu(void);
 bool snp_init(struct boot_params *bp);
 void __init __noreturn snp_abort(void);
 int snp_issue_guest_request(u64 exit_code, struct snp_req_data *input, struct snp_guest_request_ioctl *rio);
+u64 snp_get_unsupported_features(void);
 #else
 static inline void sev_es_ist_enter(struct pt_regs *regs) { }
 static inline void sev_es_ist_exit(void) { }
 static inline int sev_es_setup_ap_jump_table(struct real_mode_header *rmh) { return 0; }
 static inline void sev_es_nmi_complete(void) { }
 static inline int sev_es_efi_map_ghcbs(pgd_t *pgd) { return 0; }
+static inline void sev_enable(struct boot_params *bp) { }
 static inline int pvalidate(unsigned long vaddr, bool rmp_psize, bool validate) { return 0; }
 static inline int rmpadjust(unsigned long vaddr, bool rmp_psize, unsigned long attrs) { return 0; }
 static inline void setup_ghcb(void) { }
@@ -225,6 +228,7 @@ static inline int snp_issue_guest_request(u64 exit_code, struct snp_req_data *in
 {
 	return -ENOTTY;
 }
+static inline u64 snp_get_unsupported_features(void) { return 0; }
 #endif
 
 #endif
diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index ce8434fce0c37982..33d11ba78f1d8c4f 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -15,6 +15,7 @@
 #include <asm/setup.h>
 #include <asm/desc.h>
 #include <asm/boot.h>
+#include <asm/sev.h>
 
 #include "efistub.h"
 
@@ -714,6 +715,22 @@ static efi_status_t exit_boot_func(struct efi_boot_memmap *map,
 			  &p->efi->efi_memmap, &p->efi->efi_memmap_hi);
 	p->efi->efi_memmap_size		= map->map_size;
 
+	/*
+	 * Call the SEV init code while still running with the firmware's
+	 * GDT/IDT, so #VC exceptions will be handled by EFI.
+	 */
+	if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
+		u64 unsupported;
+
+		sev_enable(p->boot_params);
+		unsupported = snp_get_unsupported_features();
+		if (unsupported) {
+			efi_err("Unsupported SEV-SNP features detected: 0x%llx\n",
+				unsupported);
+			return EFI_UNSUPPORTED;
+		}
+	}
+
 	return EFI_SUCCESS;
 }
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 18/20] x86: efistub: Avoid legacy decompressor when doing EFI boot
  2023-05-08  7:03 [PATCH v2 00/20] efi/x86: Avoid bare metal decompressor during EFI boot Ard Biesheuvel
                   ` (16 preceding siblings ...)
  2023-05-08  7:03 ` [PATCH v2 17/20] x86: efistub: Check SEV/SNP support while running in the firmware Ard Biesheuvel
@ 2023-05-08  7:03 ` Ard Biesheuvel
  2023-05-18 20:48   ` Tom Lendacky
  2023-05-08  7:03 ` [PATCH v2 19/20] x86: efistub: Clear BSS in EFI handover protocol entrypoint Ard Biesheuvel
  2023-05-08  7:03 ` [PATCH v2 20/20] x86: decompressor: Avoid magic offsets for EFI handover entrypoint Ard Biesheuvel
  19 siblings, 1 reply; 42+ messages in thread
From: Ard Biesheuvel @ 2023-05-08  7:03 UTC (permalink / raw)
  To: linux-efi
  Cc: linux-kernel, Ard Biesheuvel, Evgeniy Baskov, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexey Khoroshilov, Peter Jones, Gerd Hoffmann,
	Dave Young, Mario Limonciello, Kees Cook, Tom Lendacky,
	Kirill A . Shutemov, Linus Torvalds

The bare metal decompressor code was never really intended to run in a
hosted environment such as the EFI boot services, and does a few things
that are problematic in the context of EFI boot now that the logo
requirements are getting tighter.

In particular, the decompressor moves its own executable image around in
memory, and relies on demand paging to populate the identity mappings,
and these things are difficult to support in a context where memory is
not permitted to be mapped writable and executable at the same time or,
at the very least, is mapped non-executable by default, and needs
special treatment for this restriction to be lifted.

Since EFI already maps all of memory 1:1, we don't need to create new
page tables or handle page faults when decompressing the kernel. That
means there is also no need to replace the special exception handlers
for SEV. Generally, there is little need to do anything that the
decompressor does beyond

- initialize SEV encryption, if needed,
- perform the 4/5 level paging switch, if needed,
- decompress the kernel
- relocate the kernel

So let's do all of this from the EFI stub code, and avoid the bare metal
decompressor altogether.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/x86/boot/compressed/Makefile       |   5 +
 arch/x86/boot/compressed/efi_mixed.S    |  58 +------
 arch/x86/boot/compressed/head_32.S      |  22 +--
 arch/x86/boot/compressed/head_64.S      |  34 ----
 arch/x86/include/asm/efi.h              |   7 +-
 drivers/firmware/efi/libstub/x86-stub.c | 176 +++++++++-----------
 6 files changed, 95 insertions(+), 207 deletions(-)

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 6b6cfe607bdbdfaa..f3289f17f820f982 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -74,6 +74,11 @@ LDFLAGS_vmlinux += -z noexecstack
 ifeq ($(CONFIG_LD_IS_BFD),y)
 LDFLAGS_vmlinux += $(call ld-option,--no-warn-rwx-segments)
 endif
+ifeq ($(CONFIG_EFI_STUB),y)
+# ensure that we'll pull in the static EFI stub library even if we never
+# reference it from the startup code
+LDFLAGS_vmlinux += -u efi_pe_entry
+endif
 LDFLAGS_vmlinux += -T
 
 hostprogs	:= mkpiggy
diff --git a/arch/x86/boot/compressed/efi_mixed.S b/arch/x86/boot/compressed/efi_mixed.S
index 4ca70bf93dc0bdcd..dec579eb1caa16d5 100644
--- a/arch/x86/boot/compressed/efi_mixed.S
+++ b/arch/x86/boot/compressed/efi_mixed.S
@@ -49,9 +49,12 @@ SYM_FUNC_START(startup_64_mixed_mode)
 	lea	efi32_boot_args(%rip), %rdx
 	mov	0(%rdx), %edi
 	mov	4(%rdx), %esi
+#ifdef CONFIG_EFI_HANDOVER_PROTOCOL
 	mov	8(%rdx), %edx		// saved bootparams pointer
 	test	%edx, %edx
 	jnz	efi64_stub_entry
+#endif
+
 	/*
 	 * efi_pe_entry uses MS calling convention, which requires 32 bytes of
 	 * shadow space on the stack even if all arguments are passed in
@@ -245,10 +248,6 @@ SYM_FUNC_START(efi32_entry)
 	jmp	startup_32
 SYM_FUNC_END(efi32_entry)
 
-#define ST32_boottime		60 // offsetof(efi_system_table_32_t, boottime)
-#define BS32_handle_protocol	88 // offsetof(efi_boot_services_32_t, handle_protocol)
-#define LI32_image_base		32 // offsetof(efi_loaded_image_32_t, image_base)
-
 /*
  * efi_status_t efi32_pe_entry(efi_handle_t image_handle,
  *			       efi_system_table_32_t *sys_table)
@@ -256,8 +255,6 @@ SYM_FUNC_END(efi32_entry)
 SYM_FUNC_START(efi32_pe_entry)
 	pushl	%ebp
 	movl	%esp, %ebp
-	pushl	%eax				// dummy push to allocate loaded_image
-
 	pushl	%ebx				// save callee-save registers
 	pushl	%edi
 
@@ -266,48 +263,8 @@ SYM_FUNC_START(efi32_pe_entry)
 	movl	$0x80000003, %eax		// EFI_UNSUPPORTED
 	jnz	2f
 
-	call	1f
-1:	pop	%ebx
-
-	/* Get the loaded image protocol pointer from the image handle */
-	leal	-4(%ebp), %eax
-	pushl	%eax				// &loaded_image
-	leal	(loaded_image_proto - 1b)(%ebx), %eax
-	pushl	%eax				// pass the GUID address
-	pushl	8(%ebp)				// pass the image handle
-
-	/*
-	 * Note the alignment of the stack frame.
-	 *   sys_table
-	 *   handle             <-- 16-byte aligned on entry by ABI
-	 *   return address
-	 *   frame pointer
-	 *   loaded_image       <-- local variable
-	 *   saved %ebx		<-- 16-byte aligned here
-	 *   saved %edi
-	 *   &loaded_image
-	 *   &loaded_image_proto
-	 *   handle             <-- 16-byte aligned for call to handle_protocol
-	 */
-
-	movl	12(%ebp), %eax			// sys_table
-	movl	ST32_boottime(%eax), %eax	// sys_table->boottime
-	call	*BS32_handle_protocol(%eax)	// sys_table->boottime->handle_protocol
-	addl	$12, %esp			// restore argument space
-	testl	%eax, %eax
-	jnz	2f
-
 	movl	8(%ebp), %ecx			// image_handle
 	movl	12(%ebp), %edx			// sys_table
-	movl	-4(%ebp), %esi			// loaded_image
-	movl	LI32_image_base(%esi), %esi	// loaded_image->image_base
-	leal	(startup_32 - 1b)(%ebx), %ebp	// runtime address of startup_32
-	/*
-	 * We need to set the image_offset variable here since startup_32() will
-	 * use it before we get to the 64-bit efi_pe_entry() in C code.
-	 */
-	subl	%esi, %ebp			// calculate image_offset
-	movl	%ebp, (image_offset - 1b)(%ebx)	// save image_offset
 	xorl	%esi, %esi
 	jmp	efi32_entry			// pass %ecx, %edx, %esi
 						// no other registers remain live
@@ -318,15 +275,6 @@ SYM_FUNC_START(efi32_pe_entry)
 	RET
 SYM_FUNC_END(efi32_pe_entry)
 
-	.section ".rodata"
-	/* EFI loaded image protocol GUID */
-	.balign 4
-SYM_DATA_START_LOCAL(loaded_image_proto)
-	.long	0x5b1b31a1
-	.word	0x9562, 0x11d2
-	.byte	0x8e, 0x3f, 0x00, 0xa0, 0xc9, 0x69, 0x72, 0x3b
-SYM_DATA_END(loaded_image_proto)
-
 	.data
 	.balign	8
 SYM_DATA_START_LOCAL(efi32_boot_gdt)
diff --git a/arch/x86/boot/compressed/head_32.S b/arch/x86/boot/compressed/head_32.S
index 8ee8749007595fcc..3f9b80726070a8e7 100644
--- a/arch/x86/boot/compressed/head_32.S
+++ b/arch/x86/boot/compressed/head_32.S
@@ -84,19 +84,6 @@ SYM_FUNC_START(startup_32)
 
 #ifdef CONFIG_RELOCATABLE
 	leal	startup_32@GOTOFF(%edx), %ebx
-
-#ifdef CONFIG_EFI_STUB
-/*
- * If we were loaded via the EFI LoadImage service, startup_32() will be at an
- * offset to the start of the space allocated for the image. efi_pe_entry() will
- * set up image_offset to tell us where the image actually starts, so that we
- * can use the full available buffer.
- *	image_offset = startup_32 - image_base
- * Otherwise image_offset will be zero and has no effect on the calculations.
- */
-	subl    image_offset@GOTOFF(%edx), %ebx
-#endif
-
 	movl	BP_kernel_alignment(%esi), %eax
 	decl	%eax
 	addl    %eax, %ebx
@@ -150,15 +137,10 @@ SYM_FUNC_START(startup_32)
 	jmp	*%eax
 SYM_FUNC_END(startup_32)
 
-#ifdef CONFIG_EFI_STUB
+#ifdef CONFIG_EFI_HANDOVER_PROTOCOL
 SYM_FUNC_START(efi32_stub_entry)
-	add	$0x4, %esp
-	movl	8(%esp), %esi	/* save boot_params pointer */
-	call	efi_main
-	/* efi_main returns the possibly relocated address of startup_32 */
-	jmp	*%eax
+	jmp	efi_main
 SYM_FUNC_END(efi32_stub_entry)
-SYM_FUNC_ALIAS(efi_stub_entry, efi32_stub_entry)
 #endif
 
 	.text
diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index bcf678aed81468e3..320e2825ff0b32da 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -146,19 +146,6 @@ SYM_FUNC_START(startup_32)
 
 #ifdef CONFIG_RELOCATABLE
 	movl	%ebp, %ebx
-
-#ifdef CONFIG_EFI_STUB
-/*
- * If we were loaded via the EFI LoadImage service, startup_32 will be at an
- * offset to the start of the space allocated for the image. efi_pe_entry will
- * set up image_offset to tell us where the image actually starts, so that we
- * can use the full available buffer.
- *	image_offset = startup_32 - image_base
- * Otherwise image_offset will be zero and has no effect on the calculations.
- */
-	subl    rva(image_offset)(%ebp), %ebx
-#endif
-
 	movl	BP_kernel_alignment(%esi), %eax
 	decl	%eax
 	addl	%eax, %ebx
@@ -346,20 +333,6 @@ SYM_CODE_START(startup_64)
 	/* Start with the delta to where the kernel will run at. */
 #ifdef CONFIG_RELOCATABLE
 	leaq	startup_32(%rip) /* - $startup_32 */, %rbp
-
-#ifdef CONFIG_EFI_STUB
-/*
- * If we were loaded via the EFI LoadImage service, startup_32 will be at an
- * offset to the start of the space allocated for the image. efi_pe_entry will
- * set up image_offset to tell us where the image actually starts, so that we
- * can use the full available buffer.
- *	image_offset = startup_32 - image_base
- * Otherwise image_offset will be zero and has no effect on the calculations.
- */
-	movl    image_offset(%rip), %eax
-	subq	%rax, %rbp
-#endif
-
 	movl	BP_kernel_alignment(%rsi), %eax
 	decl	%eax
 	addq	%rax, %rbp
@@ -481,19 +454,12 @@ SYM_CODE_START(startup_64)
 	jmp	*%rax
 SYM_CODE_END(startup_64)
 
-#ifdef CONFIG_EFI_STUB
 #ifdef CONFIG_EFI_HANDOVER_PROTOCOL
 	.org 0x390
-#endif
 SYM_FUNC_START(efi64_stub_entry)
 	and	$~0xf, %rsp			/* realign the stack */
-	movq	%rdx, %rbx			/* save boot_params pointer */
 	call	efi_main
-	movq	%rbx,%rsi
-	leaq	rva(startup_64)(%rax), %rax
-	jmp	*%rax
 SYM_FUNC_END(efi64_stub_entry)
-SYM_FUNC_ALIAS(efi_stub_entry, efi64_stub_entry)
 #endif
 
 	.text
diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
index 419280d263d2e3f2..24fa828eeef7d9dd 100644
--- a/arch/x86/include/asm/efi.h
+++ b/arch/x86/include/asm/efi.h
@@ -88,6 +88,8 @@ static inline void efi_fpu_end(void)
 }
 
 #ifdef CONFIG_X86_32
+#define EFI_X86_KERNEL_ALLOC_LIMIT		(SZ_512M - 1)
+
 #define arch_efi_call_virt_setup()					\
 ({									\
 	efi_fpu_begin();						\
@@ -101,8 +103,7 @@ static inline void efi_fpu_end(void)
 })
 
 #else /* !CONFIG_X86_32 */
-
-#define EFI_LOADER_SIGNATURE	"EL64"
+#define EFI_X86_KERNEL_ALLOC_LIMIT		EFI_ALLOC_LIMIT
 
 extern asmlinkage u64 __efi_call(void *fp, ...);
 
@@ -216,6 +217,8 @@ efi_status_t efi_set_virtual_address_map(unsigned long memory_map_size,
 
 #ifdef CONFIG_EFI_MIXED
 
+#define EFI_ALLOC_LIMIT		(efi_is_64bit() ? ULONG_MAX : U32_MAX)
+
 #define ARCH_HAS_EFISTUB_WRAPPERS
 
 static inline bool efi_is_64bit(void)
diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index 33d11ba78f1d8c4f..59076e16c1ac11ee 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -15,16 +15,13 @@
 #include <asm/setup.h>
 #include <asm/desc.h>
 #include <asm/boot.h>
+#include <asm/kaslr.h>
 #include <asm/sev.h>
 
 #include "efistub.h"
 
-/* Maximum physical address for 64-bit kernel with 4-level paging */
-#define MAXMEM_X86_64_4LEVEL (1ull << 46)
-
 const efi_system_table_t *efi_system_table;
 const efi_dxe_services_table_t *efi_dxe_table;
-u32 image_offset __section(".data");
 static efi_loaded_image_t *image = NULL;
 static efi_memory_attribute_protocol_t *memattr;
 
@@ -275,33 +272,9 @@ adjust_memory_range_protection(unsigned long start, unsigned long size)
 	}
 }
 
-void startup_32(struct boot_params *boot_params);
-
-static void
-setup_memory_protection(unsigned long image_base, unsigned long image_size)
-{
-#ifdef CONFIG_64BIT
-	if (image_base != (unsigned long)startup_32)
-		adjust_memory_range_protection(image_base, image_size);
-#else
-	/*
-	 * Clear protection flags on a whole range of possible
-	 * addresses used for KASLR. We don't need to do that
-	 * on x86_64, since KASLR/extraction is performed after
-	 * dedicated identity page tables are built and we only
-	 * need to remove possible protection on relocated image
-	 * itself disregarding further relocations.
-	 */
-	adjust_memory_range_protection(LOAD_PHYSICAL_ADDR,
-				       KERNEL_IMAGE_SIZE - LOAD_PHYSICAL_ADDR);
-#endif
-}
-
 static const efi_char16_t apple[] = L"Apple";
 
-static void setup_quirks(struct boot_params *boot_params,
-			 unsigned long image_base,
-			 unsigned long image_size)
+static void setup_quirks(struct boot_params *boot_params)
 {
 	efi_char16_t *fw_vendor = (efi_char16_t *)(unsigned long)
 		efi_table_attr(efi_system_table, fw_vendor);
@@ -310,9 +283,6 @@ static void setup_quirks(struct boot_params *boot_params,
 		if (IS_ENABLED(CONFIG_APPLE_PROPERTIES))
 			retrieve_apple_device_properties(boot_params);
 	}
-
-	if (IS_ENABLED(CONFIG_EFI_DXE_MEM_ATTRIBUTES))
-		setup_memory_protection(image_base, image_size);
 }
 
 /*
@@ -432,6 +402,7 @@ static void __noreturn efi_exit(efi_handle_t handle, efi_status_t status)
 		asm("hlt");
 }
 
+static __alias(efi_main)
 void __noreturn efi_stub_entry(efi_handle_t handle,
 			       efi_system_table_t *sys_table_arg,
 			       struct boot_params *boot_params);
@@ -465,7 +436,6 @@ efi_status_t __efiapi efi_pe_entry(efi_handle_t handle,
 	}
 
 	image_base = efi_table_attr(image, image_base);
-	image_offset = (void *)startup_32 - image_base;
 
 	status = efi_allocate_pages(sizeof(struct boot_params),
 				    (unsigned long *)&boot_params, ULONG_MAX);
@@ -853,20 +823,82 @@ static void efi_5level_switch(void)
 #endif
 }
 
+static void efi_get_seed(void *seed, int size)
+{
+	efi_get_random_bytes(size, seed);
+
+	/*
+	 * This only updates seed[0] when running on 32-bit, but in that case,
+	 * we don't use seed[1] anyway, as there is no virtual KASLR on 32-bit.
+	 */
+	*(unsigned long *)seed ^= kaslr_get_random_long("EFI");
+}
+
+static void error(char *str)
+{
+	efi_warn("Decompression failed: %s\n", str);
+}
+
+static efi_status_t efi_decompress_kernel(unsigned long *kernel_entry)
+{
+	unsigned long virt_addr = LOAD_PHYSICAL_ADDR;
+	unsigned long addr, alloc_size, entry;
+	efi_status_t status;
+	u32 seed[2] = {};
+
+	/* determine the required size of the allocation */
+	alloc_size = ALIGN(max((unsigned long)output_len, kernel_total_size),
+			   MIN_KERNEL_ALIGN);
+
+	if (IS_ENABLED(CONFIG_RANDOMIZE_BASE) && !efi_nokaslr) {
+		u64 range = KERNEL_IMAGE_SIZE - LOAD_PHYSICAL_ADDR - kernel_total_size;
+
+		efi_get_seed(seed, sizeof(seed));
+
+		virt_addr += (range * seed[1]) >> 32;
+		virt_addr &= ~(CONFIG_PHYSICAL_ALIGN - 1);
+	}
+
+	status = efi_random_alloc(alloc_size, CONFIG_PHYSICAL_ALIGN, &addr,
+				  seed[0], EFI_LOADER_CODE,
+				  EFI_X86_KERNEL_ALLOC_LIMIT);
+	if (status != EFI_SUCCESS)
+		return status;
+
+	entry = decompress_kernel((void *)addr, virt_addr, error);
+	if (entry == ULONG_MAX) {
+		efi_free(alloc_size, addr);
+		return EFI_LOAD_ERROR;
+	}
+
+	*kernel_entry = addr + entry;
+
+	adjust_memory_range_protection(addr, kernel_total_size);
+
+	return EFI_SUCCESS;
+}
+
+static void __noreturn enter_kernel(unsigned long kernel_addr,
+				    struct boot_params *boot_params)
+{
+	/* enter decompressed kernel with boot_params pointer in RSI/ESI */
+	asm("jmp *%0"::"r"(kernel_addr), "S"(boot_params));
+
+	unreachable();
+}
+
 /*
- * On success, we return the address of startup_32, which has potentially been
- * relocated by efi_relocate_kernel.
+ * On success, we jump to the relocated kernel directly and never return.
  * On failure, we exit to the firmware via efi_exit instead of returning.
  */
-asmlinkage unsigned long efi_main(efi_handle_t handle,
-				  efi_system_table_t *sys_table_arg,
-				  struct boot_params *boot_params)
+asmlinkage void __noreturn efi_main(efi_handle_t handle,
+				    efi_system_table_t *sys_table_arg,
+				    struct boot_params *boot_params)
 {
 	efi_guid_t guid = EFI_MEMORY_ATTRIBUTE_PROTOCOL_GUID;
-	unsigned long bzimage_addr = (unsigned long)startup_32;
-	unsigned long buffer_start, buffer_end;
 	struct setup_header *hdr = &boot_params->hdr;
 	const struct linux_efi_initrd *initrd = NULL;
+	unsigned long kernel_entry;
 	efi_status_t status;
 
 	efi_system_table = sys_table_arg;
@@ -892,60 +924,6 @@ asmlinkage unsigned long efi_main(efi_handle_t handle,
 		goto fail;
 	}
 
-	/*
-	 * If the kernel isn't already loaded at a suitable address,
-	 * relocate it.
-	 *
-	 * It must be loaded above LOAD_PHYSICAL_ADDR.
-	 *
-	 * The maximum address for 64-bit is 1 << 46 for 4-level paging. This
-	 * is defined as the macro MAXMEM, but unfortunately that is not a
-	 * compile-time constant if 5-level paging is configured, so we instead
-	 * define our own macro for use here.
-	 *
-	 * For 32-bit, the maximum address is complicated to figure out, for
-	 * now use KERNEL_IMAGE_SIZE, which will be 512MiB, the same as what
-	 * KASLR uses.
-	 *
-	 * Also relocate it if image_offset is zero, i.e. the kernel wasn't
-	 * loaded by LoadImage, but rather by a bootloader that called the
-	 * handover entry. The reason we must always relocate in this case is
-	 * to handle the case of systemd-boot booting a unified kernel image,
-	 * which is a PE executable that contains the bzImage and an initrd as
-	 * COFF sections. The initrd section is placed after the bzImage
-	 * without ensuring that there are at least init_size bytes available
-	 * for the bzImage, and thus the compressed kernel's startup code may
-	 * overwrite the initrd unless it is moved out of the way.
-	 */
-
-	buffer_start = ALIGN(bzimage_addr - image_offset,
-			     hdr->kernel_alignment);
-	buffer_end = buffer_start + hdr->init_size;
-
-	if ((buffer_start < LOAD_PHYSICAL_ADDR)				     ||
-	    (IS_ENABLED(CONFIG_X86_32) && buffer_end > KERNEL_IMAGE_SIZE)    ||
-	    (IS_ENABLED(CONFIG_X86_64) && buffer_end > MAXMEM_X86_64_4LEVEL) ||
-	    (image_offset == 0)) {
-		extern char _bss[];
-
-		status = efi_relocate_kernel(&bzimage_addr,
-					     (unsigned long)_bss - bzimage_addr,
-					     hdr->init_size,
-					     hdr->pref_address,
-					     hdr->kernel_alignment,
-					     LOAD_PHYSICAL_ADDR);
-		if (status != EFI_SUCCESS) {
-			efi_err("efi_relocate_kernel() failed!\n");
-			goto fail;
-		}
-		/*
-		 * Now that we've copied the kernel elsewhere, we no longer
-		 * have a set up block before startup_32(), so reset image_offset
-		 * to zero in case it was set earlier.
-		 */
-		image_offset = 0;
-	}
-
 #ifdef CONFIG_CMDLINE_BOOL
 	status = efi_parse_options(CONFIG_CMDLINE);
 	if (status != EFI_SUCCESS) {
@@ -963,6 +941,12 @@ asmlinkage unsigned long efi_main(efi_handle_t handle,
 		}
 	}
 
+	status = efi_decompress_kernel(&kernel_entry);
+	if (status != EFI_SUCCESS) {
+		efi_err("Failed to decompress kernel\n");
+		goto fail;
+	}
+
 	/*
 	 * At this point, an initrd may already have been loaded by the
 	 * bootloader and passed via bootparams. We permit an initrd loaded
@@ -1002,7 +986,7 @@ asmlinkage unsigned long efi_main(efi_handle_t handle,
 
 	setup_efi_pci(boot_params);
 
-	setup_quirks(boot_params, bzimage_addr, buffer_end - buffer_start);
+	setup_quirks(boot_params);
 
 	status = exit_boot(boot_params, handle);
 	if (status != EFI_SUCCESS) {
@@ -1012,7 +996,7 @@ asmlinkage unsigned long efi_main(efi_handle_t handle,
 
 	efi_5level_switch();
 
-	return bzimage_addr;
+	enter_kernel(kernel_entry, boot_params);
 fail:
 	efi_err("efi_main() failed!\n");
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 19/20] x86: efistub: Clear BSS in EFI handover protocol entrypoint
  2023-05-08  7:03 [PATCH v2 00/20] efi/x86: Avoid bare metal decompressor during EFI boot Ard Biesheuvel
                   ` (17 preceding siblings ...)
  2023-05-08  7:03 ` [PATCH v2 18/20] x86: efistub: Avoid legacy decompressor when doing EFI boot Ard Biesheuvel
@ 2023-05-08  7:03 ` Ard Biesheuvel
  2023-05-08  7:03 ` [PATCH v2 20/20] x86: decompressor: Avoid magic offsets for EFI handover entrypoint Ard Biesheuvel
  19 siblings, 0 replies; 42+ messages in thread
From: Ard Biesheuvel @ 2023-05-08  7:03 UTC (permalink / raw)
  To: linux-efi
  Cc: linux-kernel, Ard Biesheuvel, Evgeniy Baskov, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexey Khoroshilov, Peter Jones, Gerd Hoffmann,
	Dave Young, Mario Limonciello, Kees Cook, Tom Lendacky,
	Kirill A . Shutemov, Linus Torvalds

The so-called EFI handover protocol is value-add from the distros that
permits a loader to simply copy a PE kernel image into memory and call
an alternative entrypoint that is described by an embedded boot_params
structure.

Most implementations of this protocol do not bother to check the PE
header for minimum alignment, section placement, etc, and therefore also
don't clear the image's BSS, or even allocate enough memory for it.

Allocating more memory on the fly is rather difficult, but let's at
least clear the BSS explicitly when entering in this manner, so that the
decompressor's pseudo-malloc() does not get confused by global variables
that were not zero-initialized correctly.

Note that we shouldn't clear BSS before calling efi_main() if we are not
booting via the handover protocol, but this entrypoint is no longer
shared with the pure EFI entrypoint so we can ignore that case here.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/x86/boot/compressed/head_32.S      |  6 -----
 arch/x86/boot/compressed/head_64.S      |  2 +-
 drivers/firmware/efi/libstub/x86-stub.c | 24 +++++++++++++++++---
 3 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/arch/x86/boot/compressed/head_32.S b/arch/x86/boot/compressed/head_32.S
index 3f9b80726070a8e7..cd9587fcd5084f22 100644
--- a/arch/x86/boot/compressed/head_32.S
+++ b/arch/x86/boot/compressed/head_32.S
@@ -137,12 +137,6 @@ SYM_FUNC_START(startup_32)
 	jmp	*%eax
 SYM_FUNC_END(startup_32)
 
-#ifdef CONFIG_EFI_HANDOVER_PROTOCOL
-SYM_FUNC_START(efi32_stub_entry)
-	jmp	efi_main
-SYM_FUNC_END(efi32_stub_entry)
-#endif
-
 	.text
 SYM_FUNC_START_LOCAL_NOALIGN(.Lrelocated)
 
diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index 320e2825ff0b32da..b7599cbbd2ea1136 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -458,7 +458,7 @@ SYM_CODE_END(startup_64)
 	.org 0x390
 SYM_FUNC_START(efi64_stub_entry)
 	and	$~0xf, %rsp			/* realign the stack */
-	call	efi_main
+	call	efi_handover_entry
 SYM_FUNC_END(efi64_stub_entry)
 #endif
 
diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index 59076e16c1ac11ee..0528db3e36cf636b 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -891,9 +891,9 @@ static void __noreturn enter_kernel(unsigned long kernel_addr,
  * On success, we jump to the relocated kernel directly and never return.
  * On failure, we exit to the firmware via efi_exit instead of returning.
  */
-asmlinkage void __noreturn efi_main(efi_handle_t handle,
-				    efi_system_table_t *sys_table_arg,
-				    struct boot_params *boot_params)
+static void __noreturn efi_main(efi_handle_t handle,
+				efi_system_table_t *sys_table_arg,
+				struct boot_params *boot_params)
 {
 	efi_guid_t guid = EFI_MEMORY_ATTRIBUTE_PROTOCOL_GUID;
 	struct setup_header *hdr = &boot_params->hdr;
@@ -1002,3 +1002,21 @@ asmlinkage void __noreturn efi_main(efi_handle_t handle,
 
 	efi_exit(handle, status);
 }
+
+#ifdef CONFIG_EFI_HANDOVER_PROTOCOL
+void efi_handover_entry(efi_handle_t handle, efi_system_table_t *sys_table_arg,
+			struct boot_params *boot_params)
+{
+	extern char _bss[], _ebss[];
+
+	/* Ensure that BSS is zeroed when booting via the handover protocol */
+	memset(_bss, 0, _ebss - _bss);
+	efi_main(handle, sys_table_arg, boot_params);
+}
+
+#ifdef CONFIG_X86_32
+extern __alias(efi_handover_entry)
+void efi32_stub_entry(efi_handle_t handle, efi_system_table_t *sys_table_arg,
+		      struct boot_params *boot_params);
+#endif
+#endif
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 20/20] x86: decompressor: Avoid magic offsets for EFI handover entrypoint
  2023-05-08  7:03 [PATCH v2 00/20] efi/x86: Avoid bare metal decompressor during EFI boot Ard Biesheuvel
                   ` (18 preceding siblings ...)
  2023-05-08  7:03 ` [PATCH v2 19/20] x86: efistub: Clear BSS in EFI handover protocol entrypoint Ard Biesheuvel
@ 2023-05-08  7:03 ` Ard Biesheuvel
  19 siblings, 0 replies; 42+ messages in thread
From: Ard Biesheuvel @ 2023-05-08  7:03 UTC (permalink / raw)
  To: linux-efi
  Cc: linux-kernel, Ard Biesheuvel, Evgeniy Baskov, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexey Khoroshilov, Peter Jones, Gerd Hoffmann,
	Dave Young, Mario Limonciello, Kees Cook, Tom Lendacky,
	Kirill A . Shutemov, Linus Torvalds

The special EFI handover protocol entrypoint offset wrt to the
startup_XX address is described in struct boot_params as
handover_offset, so that the special Linux/x86 aware EFI loader can find
it there.

When mixed mode is enabled, this single field has to describe this
offset for both the 32-bit and 64-bit entrypoints, so their respective
relative offsets have to be identical.

Currently, we use hard-coded fixed offsets to ensure this, but the only
requirement is that the entrypoints are 0x200 bytes apart, and this only
matters when EFI mixed mode is configured to begin with.

So just set the required offset directly. This could potentially result
in a build error if the 32-bit startup code is much smaller than the
64-bit code but this is currently far from the case, and easily fixed
when that situation does arise.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/x86/boot/compressed/head_64.S | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index b7599cbbd2ea1136..72780644a2272af8 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -282,7 +282,6 @@ SYM_FUNC_START(startup_32)
 SYM_FUNC_END(startup_32)
 
 #if IS_ENABLED(CONFIG_EFI_MIXED) && IS_ENABLED(CONFIG_EFI_HANDOVER_PROTOCOL)
-	.org 0x190
 SYM_FUNC_START(efi32_stub_entry)
 	add	$0x4, %esp		/* Discard return address */
 	popl	%ecx
@@ -455,7 +454,9 @@ SYM_CODE_START(startup_64)
 SYM_CODE_END(startup_64)
 
 #ifdef CONFIG_EFI_HANDOVER_PROTOCOL
-	.org 0x390
+#ifdef CONFIG_EFI_MIXED
+	.org	efi32_stub_entry + 0x200
+#endif
 SYM_FUNC_START(efi64_stub_entry)
 	and	$~0xf, %rsp			/* realign the stack */
 	call	efi_handover_entry
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 03/20] x86: decompressor: Call trampoline as a normal function
  2023-05-08  7:03 ` [PATCH v2 03/20] x86: decompressor: Call trampoline as a normal function Ard Biesheuvel
@ 2023-05-15 13:59   ` Kirill A. Shutemov
  0 siblings, 0 replies; 42+ messages in thread
From: Kirill A. Shutemov @ 2023-05-15 13:59 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-efi, linux-kernel, Evgeniy Baskov, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexey Khoroshilov, Peter Jones, Gerd Hoffmann,
	Dave Young, Mario Limonciello, Kees Cook, Tom Lendacky,
	Kirill A . Shutemov, Linus Torvalds

On Mon, May 08, 2023 at 09:03:13AM +0200, Ard Biesheuvel wrote:
> Move the long return to switch to 32-bit mode into the trampoline code
> so we can call it as an ordinary function. This will allow us to call it
> directly from C code in a subsequent patch.
> 
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 04/20] x86: decompressor: Use standard calling convention for trampoline
  2023-05-08  7:03 ` [PATCH v2 04/20] x86: decompressor: Use standard calling convention for trampoline Ard Biesheuvel
@ 2023-05-15 14:00   ` Kirill A. Shutemov
  0 siblings, 0 replies; 42+ messages in thread
From: Kirill A. Shutemov @ 2023-05-15 14:00 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-efi, linux-kernel, Evgeniy Baskov, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexey Khoroshilov, Peter Jones, Gerd Hoffmann,
	Dave Young, Mario Limonciello, Kees Cook, Tom Lendacky,
	Kirill A . Shutemov, Linus Torvalds

On Mon, May 08, 2023 at 09:03:14AM +0200, Ard Biesheuvel wrote:
> Update the trampoline code so its arguments are passed via RDI and RSI,
> which matches the ordinary SysV calling convention for x86_64. This will
> allow us to call this code directly from C.
> 
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 05/20] x86: decompressor: Avoid the need for a stack in the 32-bit trampoline
  2023-05-08  7:03 ` [PATCH v2 05/20] x86: decompressor: Avoid the need for a stack in the 32-bit trampoline Ard Biesheuvel
@ 2023-05-15 14:03   ` Kirill A. Shutemov
  2023-05-17 22:40   ` Tom Lendacky
  1 sibling, 0 replies; 42+ messages in thread
From: Kirill A. Shutemov @ 2023-05-15 14:03 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-efi, linux-kernel, Evgeniy Baskov, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexey Khoroshilov, Peter Jones, Gerd Hoffmann,
	Dave Young, Mario Limonciello, Kees Cook, Tom Lendacky,
	Kirill A . Shutemov, Linus Torvalds

On Mon, May 08, 2023 at 09:03:15AM +0200, Ard Biesheuvel wrote:
> The 32-bit trampoline no longer uses the stack for anything except
> performing a long return back to long mode. Currently, this stack is
> allocated in the same page that carries the trampoline code, which means
> this page must be mapped writable and executable, and the stack is
> therefore executable as well.
> 
> So let's do a long jump instead: that way, we can pre-calculate the
> return address and poke it into the code before we call it. In a later
> patch, we will take advantage of this by removing writable permissions
> (and adding executable ones) explicitly when booting via the EFI stub.
> 
> Not playing with the stack pointer also makes it more straight-forward
> to call the trampoline code as an ordinary 64-bit function from C code.
> 
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 06/20] x86: decompressor: Call trampoline directly from C code
  2023-05-08  7:03 ` [PATCH v2 06/20] x86: decompressor: Call trampoline directly from C code Ard Biesheuvel
@ 2023-05-15 14:05   ` Kirill A. Shutemov
  0 siblings, 0 replies; 42+ messages in thread
From: Kirill A. Shutemov @ 2023-05-15 14:05 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-efi, linux-kernel, Evgeniy Baskov, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexey Khoroshilov, Peter Jones, Gerd Hoffmann,
	Dave Young, Mario Limonciello, Kees Cook, Tom Lendacky,
	Kirill A . Shutemov, Linus Torvalds

On Mon, May 08, 2023 at 09:03:16AM +0200, Ard Biesheuvel wrote:
> Instead of returning to the asm calling code to invoke the trampoline,
> call it straight from the C code that sets the scene. That way, we don't
> need the struct return type for returning two values, and we can make
> the call conditional more cleanly in a subsequent patch.
> 
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 07/20] x86: decompressor: Only call the trampoline when changing paging levels
  2023-05-08  7:03 ` [PATCH v2 07/20] x86: decompressor: Only call the trampoline when changing paging levels Ard Biesheuvel
@ 2023-05-15 14:07   ` Kirill A. Shutemov
  0 siblings, 0 replies; 42+ messages in thread
From: Kirill A. Shutemov @ 2023-05-15 14:07 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-efi, linux-kernel, Evgeniy Baskov, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexey Khoroshilov, Peter Jones, Gerd Hoffmann,
	Dave Young, Mario Limonciello, Kees Cook, Tom Lendacky,
	Kirill A . Shutemov, Linus Torvalds

On Mon, May 08, 2023 at 09:03:17AM +0200, Ard Biesheuvel wrote:
> Since we know the current and desired number of paging levels when
> preparing the trampoline, let's not call the trampoline at all when we
> know that calling it is not going to result in a change to the number of
> paging levels. Given that we are already running in long mode, the PAE
> and LA57 settings are necessarily consistent with the currently active
> page tables - the only difference is that we will always preserve
> CR4.MCE in this case, but this will be cleared by the real kernel
> startup code if CONFIG_X86_MCE is not enabled.
> 
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 09/20] x86: efistub: Perform 4/5 level paging switch from the stub
  2023-05-08  7:03 ` [PATCH v2 09/20] x86: efistub: Perform 4/5 level paging switch from the stub Ard Biesheuvel
@ 2023-05-15 14:14   ` Kirill A. Shutemov
  2023-05-16 17:53     ` Ard Biesheuvel
  0 siblings, 1 reply; 42+ messages in thread
From: Kirill A. Shutemov @ 2023-05-15 14:14 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-efi, linux-kernel, Evgeniy Baskov, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexey Khoroshilov, Peter Jones, Gerd Hoffmann,
	Dave Young, Mario Limonciello, Kees Cook, Tom Lendacky,
	Kirill A . Shutemov, Linus Torvalds

On Mon, May 08, 2023 at 09:03:19AM +0200, Ard Biesheuvel wrote:
> In preparation for updating the EFI stub boot flow to avoid the bare
> metal decompressor code altogether, implement the support code for
> switching between 4 and 5 levels of paging before jumping to the kernel
> proper.
> 
> This reuses the newly refactored trampoline that the bare metal
> decompressor uses, but relies on EFI APIs to allocate 32-bit addressable
> memory and remap it with the appropriate permissions. Given that the
> bare metal decompressor will no longer call into the trampoline if the
> number of paging levels is already set correctly, we no longer need to
> remove NX restrictions from the memory range where this trampoline may
> end up.
> 
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
>  drivers/firmware/efi/libstub/efi-stub-helper.c |   4 +
>  drivers/firmware/efi/libstub/x86-stub.c        | 119 ++++++++++++++++----
>  2 files changed, 102 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/firmware/efi/libstub/efi-stub-helper.c b/drivers/firmware/efi/libstub/efi-stub-helper.c
> index 1e0203d74691ffcc..fc5f3b4c45e91401 100644
> --- a/drivers/firmware/efi/libstub/efi-stub-helper.c
> +++ b/drivers/firmware/efi/libstub/efi-stub-helper.c
> @@ -16,6 +16,8 @@
>  
>  #include "efistub.h"
>  
> +extern bool efi_no5lvl;
> +
>  bool efi_nochunk;
>  bool efi_nokaslr = !IS_ENABLED(CONFIG_RANDOMIZE_BASE);
>  bool efi_novamap;
> @@ -73,6 +75,8 @@ efi_status_t efi_parse_options(char const *cmdline)
>  			efi_loglevel = CONSOLE_LOGLEVEL_QUIET;
>  		} else if (!strcmp(param, "noinitrd")) {
>  			efi_noinitrd = true;
> +		} else if (IS_ENABLED(CONFIG_X86_64) && !strcmp(param, "no5lvl")) {
> +			efi_no5lvl = true;
>  		} else if (!strcmp(param, "efi") && val) {
>  			efi_nochunk = parse_option_str(val, "nochunk");
>  			efi_novamap |= parse_option_str(val, "novamap");
> diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
> index a0bfd31358ba97b1..fb83a72ad905ad6e 100644
> --- a/drivers/firmware/efi/libstub/x86-stub.c
> +++ b/drivers/firmware/efi/libstub/x86-stub.c
> @@ -267,32 +267,11 @@ adjust_memory_range_protection(unsigned long start, unsigned long size)
>  	}
>  }
>  
> -/*
> - * Trampoline takes 2 pages and can be loaded in first megabyte of memory
> - * with its end placed between 128k and 640k where BIOS might start.
> - * (see arch/x86/boot/compressed/pgtable_64.c)
> - *
> - * We cannot find exact trampoline placement since memory map
> - * can be modified by UEFI, and it can alter the computed address.
> - */
> -
> -#define TRAMPOLINE_PLACEMENT_BASE ((128 - 8)*1024)
> -#define TRAMPOLINE_PLACEMENT_SIZE (640*1024 - (128 - 8)*1024)
> -
>  void startup_32(struct boot_params *boot_params);
>  
>  static void
>  setup_memory_protection(unsigned long image_base, unsigned long image_size)
>  {
> -	/*
> -	 * Allow execution of possible trampoline used
> -	 * for switching between 4- and 5-level page tables
> -	 * and relocated kernel image.
> -	 */
> -
> -	adjust_memory_range_protection(TRAMPOLINE_PLACEMENT_BASE,
> -				       TRAMPOLINE_PLACEMENT_SIZE);
> -
>  #ifdef CONFIG_64BIT
>  	if (image_base != (unsigned long)startup_32)
>  		adjust_memory_range_protection(image_base, image_size);
> @@ -760,6 +739,96 @@ static efi_status_t exit_boot(struct boot_params *boot_params, void *handle)
>  	return EFI_SUCCESS;
>  }
>  
> +bool efi_no5lvl;
> +
> +static void (*la57_toggle)(void *trampoline, bool enable_5lvl);
> +
> +extern void trampoline_32bit_src(void *, bool);
> +extern const u16 trampoline_ljmp_imm_offset;
> +
> +/*
> + * Enabling (or disabling) 5 level paging is tricky, because it can only be
> + * done from 32-bit mode with paging disabled. This means not only that the
> + * code itself must be running from 32-bit addressable physical memory, but
> + * also that the root page table must be 32-bit addressable, as we cannot
> + * program a 64-bit value into CR3 when running in 32-bit mode.
> + */
> +static efi_status_t efi_setup_5level_paging(void)
> +{
> +	u8 tmpl_size = (u8 *)&trampoline_ljmp_imm_offset - (u8 *)&trampoline_32bit_src;
> +	efi_status_t status;
> +	u8 *la57_code;
> +
> +	if (!efi_is_64bit())
> +		return EFI_SUCCESS;
> +
> +	/* check for 5 level paging support */
> +	if (native_cpuid_eax(0) < 7 ||
> +	    !(native_cpuid_ecx(7) & (1 << (X86_FEATURE_LA57 & 31))))
> +		return EFI_SUCCESS;
> +
> +	/* allocate some 32-bit addressable memory for code and a page table */
> +	status = efi_allocate_pages(2 * PAGE_SIZE, (unsigned long *)&la57_code,
> +				    U32_MAX);
> +	if (status != EFI_SUCCESS)
> +		return status;
> +
> +	la57_toggle = memcpy(la57_code, trampoline_32bit_src, tmpl_size);
> +	memset(la57_code + tmpl_size, 0x90, PAGE_SIZE - tmpl_size);
> +
> +	/*
> +	 * To avoid having to allocate a 32-bit addressable stack, we use a
> +	 * ljmp to switch back to long mode. However, this takes an absolute
> +	 * address, so we have to poke it in at runtime.
> +	 */
> +	*(u32 *)&la57_code[trampoline_ljmp_imm_offset] += (unsigned long)la57_code;
> +
> +	adjust_memory_range_protection((unsigned long)la57_toggle, PAGE_SIZE);
> +
> +	return EFI_SUCCESS;
> +}
> +
> +static void efi_5level_switch(void)
> +{
> +#ifdef CONFIG_X86_64
> +	static const struct desc_struct gdt[] = {
> +		[GDT_ENTRY_KERNEL32_CS] = GDT_ENTRY_INIT(0xc09b, 0, 0xfffff),
> +		[GDT_ENTRY_KERNEL_CS]   = GDT_ENTRY_INIT(0xa09b, 0, 0xfffff),
> +	};
> +
> +	bool want_la57 = IS_ENABLED(CONFIG_X86_5LEVEL) && !efi_no5lvl;
> +	bool have_la57 = native_read_cr4() & X86_CR4_LA57;
> +	bool need_toggle = want_la57 ^ have_la57;
> +	u64 *pgt = (void *)la57_toggle + PAGE_SIZE;
> +	u64 *cr3 = (u64 *)__native_read_cr3();
> +	u64 *new_cr3;
> +
> +	if (!la57_toggle || !need_toggle)
> +		return;
> +
> +	if (!have_la57) {
> +		/*
> +		 * We are going to enable 5 level paging, so we need to
> +		 * allocate a root level page from the 32-bit addressable
> +		 * physical region, and plug the existing hierarchy into it.
> +		 */
> +		new_cr3 = memset(pgt, 0, PAGE_SIZE);
> +		new_cr3[0] = (u64)cr3 | _PAGE_TABLE_NOENC;
> +	} else {
> +		// take the new root table pointer from the current entry #0
> +		new_cr3 = (u64 *)(cr3[0] & PAGE_MASK);
> +
> +		// copy the new root level table if it is not 32-bit addressable
> +		if ((u64)new_cr3 > U32_MAX)
> +			new_cr3 = memcpy(pgt, new_cr3, PAGE_SIZE);
> +	}
> +
> +	native_load_gdt(&(struct desc_ptr){ sizeof(gdt) - 1, (u64)gdt });
> +
> +	la57_toggle(new_cr3, want_la57);
> +#endif
> +}
> +

Nit: I would prefer to have the #ifdef around whole function with dummy
function for !CONFIG_X86_64 case, if IS_ENABLED(CONFIG_X86_64) is not an
option.

Otherwise:

	Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 08/20] x86: decompressor: Merge trampoline cleanup with switching code
       [not found]   ` <20230515140955.d4adbstv6gtnshp2@box.shutemov.name>
@ 2023-05-16 17:50     ` Ard Biesheuvel
  0 siblings, 0 replies; 42+ messages in thread
From: Ard Biesheuvel @ 2023-05-16 17:50 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-efi, linux-kernel, Evgeniy Baskov, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexey Khoroshilov, Peter Jones, Gerd Hoffmann,
	Dave Young, Mario Limonciello, Kees Cook, Tom Lendacky,
	Kirill A . Shutemov, Linus Torvalds

On Mon, 15 May 2023 at 16:10, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> On Mon, May 08, 2023 at 09:03:18AM +0200, Ard Biesheuvel wrote:
> > @@ -215,4 +208,5 @@ void cleanup_trampoline(void *pgtable)
> >               ptrs_per_p4d = 512;
> >       }
> >  #endif
> > +     return;
> >  }
>
> Return is redundant here.

The return (or at least just a semicolon) is needed here because of
the goto label right before the #ifdef block, as otherwise, the
function would end with a label and the compiler does not like that.

> Otherwise:
>
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>

Thanks.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 09/20] x86: efistub: Perform 4/5 level paging switch from the stub
  2023-05-15 14:14   ` Kirill A. Shutemov
@ 2023-05-16 17:53     ` Ard Biesheuvel
  0 siblings, 0 replies; 42+ messages in thread
From: Ard Biesheuvel @ 2023-05-16 17:53 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-efi, linux-kernel, Evgeniy Baskov, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexey Khoroshilov, Peter Jones, Gerd Hoffmann,
	Dave Young, Mario Limonciello, Kees Cook, Tom Lendacky,
	Kirill A . Shutemov, Linus Torvalds

On Mon, 15 May 2023 at 16:14, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> On Mon, May 08, 2023 at 09:03:19AM +0200, Ard Biesheuvel wrote:
> > In preparation for updating the EFI stub boot flow to avoid the bare
> > metal decompressor code altogether, implement the support code for
> > switching between 4 and 5 levels of paging before jumping to the kernel
> > proper.
> >
> > This reuses the newly refactored trampoline that the bare metal
> > decompressor uses, but relies on EFI APIs to allocate 32-bit addressable
> > memory and remap it with the appropriate permissions. Given that the
> > bare metal decompressor will no longer call into the trampoline if the
> > number of paging levels is already set correctly, we no longer need to
> > remove NX restrictions from the memory range where this trampoline may
> > end up.
> >
> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > ---
> >  drivers/firmware/efi/libstub/efi-stub-helper.c |   4 +
> >  drivers/firmware/efi/libstub/x86-stub.c        | 119 ++++++++++++++++----
> >  2 files changed, 102 insertions(+), 21 deletions(-)
> >
> > diff --git a/drivers/firmware/efi/libstub/efi-stub-helper.c b/drivers/firmware/efi/libstub/efi-stub-helper.c
> > index 1e0203d74691ffcc..fc5f3b4c45e91401 100644
> > --- a/drivers/firmware/efi/libstub/efi-stub-helper.c
> > +++ b/drivers/firmware/efi/libstub/efi-stub-helper.c
> > @@ -16,6 +16,8 @@
> >
> >  #include "efistub.h"
> >
> > +extern bool efi_no5lvl;
> > +
> >  bool efi_nochunk;
> >  bool efi_nokaslr = !IS_ENABLED(CONFIG_RANDOMIZE_BASE);
> >  bool efi_novamap;
> > @@ -73,6 +75,8 @@ efi_status_t efi_parse_options(char const *cmdline)
> >                       efi_loglevel = CONSOLE_LOGLEVEL_QUIET;
> >               } else if (!strcmp(param, "noinitrd")) {
> >                       efi_noinitrd = true;
> > +             } else if (IS_ENABLED(CONFIG_X86_64) && !strcmp(param, "no5lvl")) {
> > +                     efi_no5lvl = true;
> >               } else if (!strcmp(param, "efi") && val) {
> >                       efi_nochunk = parse_option_str(val, "nochunk");
> >                       efi_novamap |= parse_option_str(val, "novamap");
> > diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
> > index a0bfd31358ba97b1..fb83a72ad905ad6e 100644
> > --- a/drivers/firmware/efi/libstub/x86-stub.c
> > +++ b/drivers/firmware/efi/libstub/x86-stub.c
> > @@ -267,32 +267,11 @@ adjust_memory_range_protection(unsigned long start, unsigned long size)
> >       }
> >  }
> >
> > -/*
> > - * Trampoline takes 2 pages and can be loaded in first megabyte of memory
> > - * with its end placed between 128k and 640k where BIOS might start.
> > - * (see arch/x86/boot/compressed/pgtable_64.c)
> > - *
> > - * We cannot find exact trampoline placement since memory map
> > - * can be modified by UEFI, and it can alter the computed address.
> > - */
> > -
> > -#define TRAMPOLINE_PLACEMENT_BASE ((128 - 8)*1024)
> > -#define TRAMPOLINE_PLACEMENT_SIZE (640*1024 - (128 - 8)*1024)
> > -
> >  void startup_32(struct boot_params *boot_params);
> >
> >  static void
> >  setup_memory_protection(unsigned long image_base, unsigned long image_size)
> >  {
> > -     /*
> > -      * Allow execution of possible trampoline used
> > -      * for switching between 4- and 5-level page tables
> > -      * and relocated kernel image.
> > -      */
> > -
> > -     adjust_memory_range_protection(TRAMPOLINE_PLACEMENT_BASE,
> > -                                    TRAMPOLINE_PLACEMENT_SIZE);
> > -
> >  #ifdef CONFIG_64BIT
> >       if (image_base != (unsigned long)startup_32)
> >               adjust_memory_range_protection(image_base, image_size);
> > @@ -760,6 +739,96 @@ static efi_status_t exit_boot(struct boot_params *boot_params, void *handle)
> >       return EFI_SUCCESS;
> >  }
> >
> > +bool efi_no5lvl;
> > +
> > +static void (*la57_toggle)(void *trampoline, bool enable_5lvl);
> > +
> > +extern void trampoline_32bit_src(void *, bool);
> > +extern const u16 trampoline_ljmp_imm_offset;
> > +
> > +/*
> > + * Enabling (or disabling) 5 level paging is tricky, because it can only be
> > + * done from 32-bit mode with paging disabled. This means not only that the
> > + * code itself must be running from 32-bit addressable physical memory, but
> > + * also that the root page table must be 32-bit addressable, as we cannot
> > + * program a 64-bit value into CR3 when running in 32-bit mode.
> > + */
> > +static efi_status_t efi_setup_5level_paging(void)
> > +{
> > +     u8 tmpl_size = (u8 *)&trampoline_ljmp_imm_offset - (u8 *)&trampoline_32bit_src;
> > +     efi_status_t status;
> > +     u8 *la57_code;
> > +
> > +     if (!efi_is_64bit())
> > +             return EFI_SUCCESS;
> > +
> > +     /* check for 5 level paging support */
> > +     if (native_cpuid_eax(0) < 7 ||
> > +         !(native_cpuid_ecx(7) & (1 << (X86_FEATURE_LA57 & 31))))
> > +             return EFI_SUCCESS;
> > +
> > +     /* allocate some 32-bit addressable memory for code and a page table */
> > +     status = efi_allocate_pages(2 * PAGE_SIZE, (unsigned long *)&la57_code,
> > +                                 U32_MAX);
> > +     if (status != EFI_SUCCESS)
> > +             return status;
> > +
> > +     la57_toggle = memcpy(la57_code, trampoline_32bit_src, tmpl_size);
> > +     memset(la57_code + tmpl_size, 0x90, PAGE_SIZE - tmpl_size);
> > +
> > +     /*
> > +      * To avoid having to allocate a 32-bit addressable stack, we use a
> > +      * ljmp to switch back to long mode. However, this takes an absolute
> > +      * address, so we have to poke it in at runtime.
> > +      */
> > +     *(u32 *)&la57_code[trampoline_ljmp_imm_offset] += (unsigned long)la57_code;
> > +
> > +     adjust_memory_range_protection((unsigned long)la57_toggle, PAGE_SIZE);
> > +
> > +     return EFI_SUCCESS;
> > +}
> > +
> > +static void efi_5level_switch(void)
> > +{
> > +#ifdef CONFIG_X86_64
> > +     static const struct desc_struct gdt[] = {
> > +             [GDT_ENTRY_KERNEL32_CS] = GDT_ENTRY_INIT(0xc09b, 0, 0xfffff),
> > +             [GDT_ENTRY_KERNEL_CS]   = GDT_ENTRY_INIT(0xa09b, 0, 0xfffff),
> > +     };
> > +
> > +     bool want_la57 = IS_ENABLED(CONFIG_X86_5LEVEL) && !efi_no5lvl;
> > +     bool have_la57 = native_read_cr4() & X86_CR4_LA57;
> > +     bool need_toggle = want_la57 ^ have_la57;
> > +     u64 *pgt = (void *)la57_toggle + PAGE_SIZE;
> > +     u64 *cr3 = (u64 *)__native_read_cr3();
> > +     u64 *new_cr3;
> > +
> > +     if (!la57_toggle || !need_toggle)
> > +             return;
> > +
> > +     if (!have_la57) {
> > +             /*
> > +              * We are going to enable 5 level paging, so we need to
> > +              * allocate a root level page from the 32-bit addressable
> > +              * physical region, and plug the existing hierarchy into it.
> > +              */
> > +             new_cr3 = memset(pgt, 0, PAGE_SIZE);
> > +             new_cr3[0] = (u64)cr3 | _PAGE_TABLE_NOENC;
> > +     } else {
> > +             // take the new root table pointer from the current entry #0
> > +             new_cr3 = (u64 *)(cr3[0] & PAGE_MASK);
> > +
> > +             // copy the new root level table if it is not 32-bit addressable
> > +             if ((u64)new_cr3 > U32_MAX)
> > +                     new_cr3 = memcpy(pgt, new_cr3, PAGE_SIZE);
> > +     }
> > +
> > +     native_load_gdt(&(struct desc_ptr){ sizeof(gdt) - 1, (u64)gdt });
> > +
> > +     la57_toggle(new_cr3, want_la57);
> > +#endif
> > +}
> > +
>
> Nit: I would prefer to have the #ifdef around whole function with dummy
> function for !CONFIG_X86_64 case, if IS_ENABLED(CONFIG_X86_64) is not an
> option.
>

The latter is not an option because 32-bit has no definition for
GDT_ENTRY_KERNEL32_CS.

Personally, I prefer having only a single function declaration.
Alternatively, we could move this code to a separate file and only
include it in the build on 64-bit. That would allow the use of
IS_ENABLED() at the call site, as the calls would just disappear from
the object file on 32-bit builds.

> Otherwise:
>
>         Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Thanks.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 01/20] x86: decompressor: Use proper sequence to take the address of the GOT
  2023-05-08  7:03 ` [PATCH v2 01/20] x86: decompressor: Use proper sequence to take the address of the GOT Ard Biesheuvel
@ 2023-05-17 17:31   ` Borislav Petkov
  2023-05-17 17:39     ` Ard Biesheuvel
  0 siblings, 1 reply; 42+ messages in thread
From: Borislav Petkov @ 2023-05-17 17:31 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-efi, linux-kernel, Evgeniy Baskov, Andy Lutomirski,
	Dave Hansen, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Alexey Khoroshilov, Peter Jones, Gerd Hoffmann, Dave Young,
	Mario Limonciello, Kees Cook, Tom Lendacky, Kirill A . Shutemov,
	Linus Torvalds

Please fix all your subjects as explained here:

https://kernel.org/doc/html/latest/process/maintainer-tip.html#patch-subject

On Mon, May 08, 2023 at 09:03:11AM +0200, Ard Biesheuvel wrote:
> We don't actually use a global offset table (GOT) in the 32-bit

Please use passive voice in your commit message: no "we" or "I", etc,
and describe your changes in imperative mood.

Personal pronouns are ambiguous in text, especially with
so many parties/companies/etc developing the kernel so let's avoid them
please.

> diff --git a/arch/x86/boot/compressed/head_32.S b/arch/x86/boot/compressed/head_32.S
> index 987ae727cf9f0d04..53cbee1e2a93efce 100644
> --- a/arch/x86/boot/compressed/head_32.S
> +++ b/arch/x86/boot/compressed/head_32.S
> @@ -58,7 +58,7 @@ SYM_FUNC_START(startup_32)
>  	leal	(BP_scratch+4)(%esi), %esp
>  	call	1f
>  1:	popl	%edx
> -	addl	$_GLOBAL_OFFSET_TABLE_+(.-1b), %edx
> +	leal	(_GLOBAL_OFFSET_TABLE_ - 1b)(%edx), %edx

Yeah, that's a bit better.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 01/20] x86: decompressor: Use proper sequence to take the address of the GOT
  2023-05-17 17:31   ` Borislav Petkov
@ 2023-05-17 17:39     ` Ard Biesheuvel
  0 siblings, 0 replies; 42+ messages in thread
From: Ard Biesheuvel @ 2023-05-17 17:39 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-efi, linux-kernel, Evgeniy Baskov, Andy Lutomirski,
	Dave Hansen, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Alexey Khoroshilov, Peter Jones, Gerd Hoffmann, Dave Young,
	Mario Limonciello, Kees Cook, Tom Lendacky, Kirill A . Shutemov,
	Linus Torvalds

On Wed, 17 May 2023 at 19:31, Borislav Petkov <bp@alien8.de> wrote:
>
> Please fix all your subjects as explained here:
>
> https://kernel.org/doc/html/latest/process/maintainer-tip.html#patch-subject
>
> On Mon, May 08, 2023 at 09:03:11AM +0200, Ard Biesheuvel wrote:
> > We don't actually use a global offset table (GOT) in the 32-bit
>
> Please use passive voice in your commit message: no "we" or "I", etc,
> and describe your changes in imperative mood.
>
> Personal pronouns are ambiguous in text, especially with
> so many parties/companies/etc developing the kernel so let's avoid them
> please.
>

Ack.

> > diff --git a/arch/x86/boot/compressed/head_32.S b/arch/x86/boot/compressed/head_32.S
> > index 987ae727cf9f0d04..53cbee1e2a93efce 100644
> > --- a/arch/x86/boot/compressed/head_32.S
> > +++ b/arch/x86/boot/compressed/head_32.S
> > @@ -58,7 +58,7 @@ SYM_FUNC_START(startup_32)
> >       leal    (BP_scratch+4)(%esi), %esp
> >       call    1f
> >  1:   popl    %edx
> > -     addl    $_GLOBAL_OFFSET_TABLE_+(.-1b), %edx
> > +     leal    (_GLOBAL_OFFSET_TABLE_ - 1b)(%edx), %edx
>
> Yeah, that's a bit better.
>
> Thx.
>
> --
> Regards/Gruss,
>     Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 15/20] x86: head_64: Switch to kernel CS before enabling memory encryption
  2023-05-08  7:03 ` [PATCH v2 15/20] x86: head_64: Switch to kernel CS before enabling memory encryption Ard Biesheuvel
@ 2023-05-17 18:54   ` Tom Lendacky
  0 siblings, 0 replies; 42+ messages in thread
From: Tom Lendacky @ 2023-05-17 18:54 UTC (permalink / raw)
  To: Ard Biesheuvel, linux-efi
  Cc: linux-kernel, Evgeniy Baskov, Borislav Petkov, Andy Lutomirski,
	Dave Hansen, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Alexey Khoroshilov, Peter Jones, Gerd Hoffmann, Dave Young,
	Mario Limonciello, Kees Cook, Kirill A . Shutemov,
	Linus Torvalds

On 5/8/23 02:03, Ard Biesheuvel wrote:
> The SME initialization triggers #VC exceptions due to the use of CPUID
> instructions, and returning from an exception restores the code segment
> that was active when the exception was taken.
> 
> This means we should ensure that we switch the code segment to one that
> is described in the GDT we just loaded before running the SME init code.
> 
> Reported-by: Tom Lendacky <thomas.lendacky@amd.com>
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>

Ah, just saw this as I was going through my email backlog...  I submitted 
a separate patch just a little earlier today for this issue. I guess we'll 
let the maintainers decide how they want to handle it.

Thanks,
Tom

> ---
>   arch/x86/kernel/head_64.S | 18 +++++++++---------
>   1 file changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
> index 95b12fdae10e1dc9..a128ac62956ff7c4 100644
> --- a/arch/x86/kernel/head_64.S
> +++ b/arch/x86/kernel/head_64.S
> @@ -76,6 +76,15 @@ SYM_CODE_START_NOALIGN(startup_64)
>   
>   	call	startup_64_setup_env
>   
> +	/* Now switch to __KERNEL_CS so IRET works reliably */
> +	pushq	$__KERNEL_CS
> +	leaq	.Lon_kernel_cs(%rip), %rax
> +	pushq	%rax
> +	lretq
> +
> +.Lon_kernel_cs:
> +	UNWIND_HINT_END_OF_STACK
> +
>   #ifdef CONFIG_AMD_MEM_ENCRYPT
>   	/*
>   	 * Activate SEV/SME memory encryption if supported/enabled. This needs to
> @@ -87,15 +96,6 @@ SYM_CODE_START_NOALIGN(startup_64)
>   	call	sme_enable
>   #endif
>   
> -	/* Now switch to __KERNEL_CS so IRET works reliably */
> -	pushq	$__KERNEL_CS
> -	leaq	.Lon_kernel_cs(%rip), %rax
> -	pushq	%rax
> -	lretq
> -
> -.Lon_kernel_cs:
> -	UNWIND_HINT_END_OF_STACK
> -
>   	/* Sanitize CPU configuration */
>   	call verify_cpu
>   

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 05/20] x86: decompressor: Avoid the need for a stack in the 32-bit trampoline
  2023-05-08  7:03 ` [PATCH v2 05/20] x86: decompressor: Avoid the need for a stack in the 32-bit trampoline Ard Biesheuvel
  2023-05-15 14:03   ` Kirill A. Shutemov
@ 2023-05-17 22:40   ` Tom Lendacky
  2023-05-18 14:55     ` Ard Biesheuvel
  1 sibling, 1 reply; 42+ messages in thread
From: Tom Lendacky @ 2023-05-17 22:40 UTC (permalink / raw)
  To: Ard Biesheuvel, linux-efi
  Cc: linux-kernel, Evgeniy Baskov, Borislav Petkov, Andy Lutomirski,
	Dave Hansen, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Alexey Khoroshilov, Peter Jones, Gerd Hoffmann, Dave Young,
	Mario Limonciello, Kees Cook, Kirill A . Shutemov,
	Linus Torvalds

On 5/8/23 02:03, Ard Biesheuvel wrote:
> The 32-bit trampoline no longer uses the stack for anything except
> performing a long return back to long mode. Currently, this stack is
> allocated in the same page that carries the trampoline code, which means
> this page must be mapped writable and executable, and the stack is
> therefore executable as well.
> 
> So let's do a long jump instead: that way, we can pre-calculate the
> return address and poke it into the code before we call it. In a later
> patch, we will take advantage of this by removing writable permissions
> (and adding executable ones) explicitly when booting via the EFI stub.
> 
> Not playing with the stack pointer also makes it more straight-forward
> to call the trampoline code as an ordinary 64-bit function from C code.
> 
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
>   arch/x86/boot/compressed/head_64.S    | 34 ++++----------------
>   arch/x86/boot/compressed/pgtable.h    |  6 ++--
>   arch/x86/boot/compressed/pgtable_64.c | 12 ++++++-
>   3 files changed, 21 insertions(+), 31 deletions(-)
> 
> diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
> index b1f8a867777120bb..3b5fc851737ffc39 100644
> --- a/arch/x86/boot/compressed/head_64.S
> +++ b/arch/x86/boot/compressed/head_64.S
> @@ -460,9 +460,6 @@ SYM_CODE_START(startup_64)
>   	leaq	TRAMPOLINE_32BIT_CODE_OFFSET(%rax), %rax
>   	call	*%rax
>   
> -	/* Restore the stack, the 32-bit trampoline uses its own stack */
> -	leaq	rva(boot_stack_end)(%rbx), %rsp
> -
>   	/*
>   	 * cleanup_trampoline() would restore trampoline memory.
>   	 *
> @@ -563,24 +560,17 @@ SYM_FUNC_END(.Lrelocated)
>    * EDI contains the base address of the trampoline memory.
>    * Non-zero ESI means trampoline needs to enable 5-level paging.
>    */
> +	.section ".rodata", "a", @progbits
>   SYM_CODE_START(trampoline_32bit_src)
> -	popq	%r8
>   	/* Switch to compatibility mode (CS.L = 0 CS.D = 1) via far return */
>   	pushq	$__KERNEL32_CS
>   	leaq	0f(%rip), %rax
>   	pushq	%rax
>   	lretq
> +.Lret:	retq

Maybe just add a comment above this to explain that this is a target of 
the long jump below to get back into long mode and be able to return 
without setting up a new stack for the 32-bit code.

And then a corresponding comment on the long jump itself. I think it would 
make it easier to understand what is going on in this part of the code.

Thanks,
Tom

>   
>   	.code32
> -0:	/* Set up data and stack segments */
> -	movl	$__KERNEL_DS, %eax
> -	movl	%eax, %ds
> -	movl	%eax, %ss
> -
> -	/* Set up new stack */
> -	leal	TRAMPOLINE_32BIT_STACK_END(%edi), %esp
> -
> -	/* Disable paging */
> +0:	/* Disable paging */
>   	movl	%cr0, %eax
>   	btrl	$X86_CR0_PG_BIT, %eax
>   	movl	%eax, %cr0
> @@ -634,26 +624,16 @@ SYM_CODE_START(trampoline_32bit_src)
>   1:
>   	movl	%eax, %cr4
>   
> -	/* Calculate address of paging_enabled() once we are executing in the trampoline */
> -	leal	.Lpaging_enabled - trampoline_32bit_src + TRAMPOLINE_32BIT_CODE_OFFSET(%edi), %eax
> -
> -	/* Prepare the stack for far return to Long Mode */
> -	pushl	$__KERNEL_CS
> -	pushl	%eax
> -
>   	/* Enable paging again. */
>   	movl	%cr0, %eax
>   	btsl	$X86_CR0_PG_BIT, %eax
>   	movl	%eax, %cr0
>   
> -	lret
> +.Ljmp:	ljmpl	$__KERNEL_CS, $(.Lret - trampoline_32bit_src)
>   SYM_CODE_END(trampoline_32bit_src)
>   
> -	.code64
> -SYM_FUNC_START_LOCAL_NOALIGN(.Lpaging_enabled)
> -	/* Return from the trampoline */
> -	jmp	*%r8
> -SYM_FUNC_END(.Lpaging_enabled)
> +/* keep this right after trampoline_32bit_src() so we can infer its size */
> +SYM_DATA(trampoline_ljmp_imm_offset, .word  .Ljmp + 1 - trampoline_32bit_src)
>   
>   	/*
>            * The trampoline code has a size limit.
> @@ -662,7 +642,7 @@ SYM_FUNC_END(.Lpaging_enabled)
>   	 */
>   	.org	trampoline_32bit_src + TRAMPOLINE_32BIT_CODE_SIZE
>   
> -	.code32
> +	.text
>   SYM_FUNC_START_LOCAL_NOALIGN(.Lno_longmode)
>   	/* This isn't an x86-64 CPU, so hang intentionally, we cannot continue */
>   1:
> diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
> index 4e8cef135226bcbb..131488f50af55d0a 100644
> --- a/arch/x86/boot/compressed/pgtable.h
> +++ b/arch/x86/boot/compressed/pgtable.h
> @@ -6,9 +6,7 @@
>   #define TRAMPOLINE_32BIT_PGTABLE_OFFSET	0
>   
>   #define TRAMPOLINE_32BIT_CODE_OFFSET	PAGE_SIZE
> -#define TRAMPOLINE_32BIT_CODE_SIZE	0xA0
> -
> -#define TRAMPOLINE_32BIT_STACK_END	TRAMPOLINE_32BIT_SIZE
> +#define TRAMPOLINE_32BIT_CODE_SIZE	0x80
>   
>   #ifndef __ASSEMBLER__
>   
> @@ -16,5 +14,7 @@ extern unsigned long *trampoline_32bit;
>   
>   extern void trampoline_32bit_src(void *trampoline, bool enable_5lvl);
>   
> +extern const u16 trampoline_ljmp_imm_offset;
> +
>   #endif /* __ASSEMBLER__ */
>   #endif /* BOOT_COMPRESSED_PAGETABLE_H */
> diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
> index 2ac12ff4111bf8c0..09fc18180929fab3 100644
> --- a/arch/x86/boot/compressed/pgtable_64.c
> +++ b/arch/x86/boot/compressed/pgtable_64.c
> @@ -109,6 +109,7 @@ static unsigned long find_trampoline_placement(void)
>   struct paging_config paging_prepare(void *rmode)
>   {
>   	struct paging_config paging_config = {};
> +	void *tramp_code;
>   
>   	/* Initialize boot_params. Required for cmdline_find_option_bool(). */
>   	boot_params = rmode;
> @@ -143,9 +144,18 @@ struct paging_config paging_prepare(void *rmode)
>   	memset(trampoline_32bit, 0, TRAMPOLINE_32BIT_SIZE);
>   
>   	/* Copy trampoline code in place */
> -	memcpy(trampoline_32bit + TRAMPOLINE_32BIT_CODE_OFFSET / sizeof(unsigned long),
> +	tramp_code = memcpy(trampoline_32bit +
> +			TRAMPOLINE_32BIT_CODE_OFFSET / sizeof(unsigned long),
>   			&trampoline_32bit_src, TRAMPOLINE_32BIT_CODE_SIZE);
>   
> +	/*
> +	 * Avoid the need for a stack in the 32-bit trampoline code, by using
> +	 * LJMP rather than LRET to return back to long mode. LJMP takes an
> +	 * immediate absolute address, so we have to adjust that based on the
> +	 * placement of the trampoline.
> +	 */
> +	*(u32 *)(tramp_code + trampoline_ljmp_imm_offset) += (unsigned long)tramp_code;
> +
>   	/*
>   	 * The code below prepares page table in trampoline memory.
>   	 *

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 05/20] x86: decompressor: Avoid the need for a stack in the 32-bit trampoline
  2023-05-17 22:40   ` Tom Lendacky
@ 2023-05-18 14:55     ` Ard Biesheuvel
  0 siblings, 0 replies; 42+ messages in thread
From: Ard Biesheuvel @ 2023-05-18 14:55 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: linux-efi, linux-kernel, Evgeniy Baskov, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexey Khoroshilov, Peter Jones, Gerd Hoffmann,
	Dave Young, Mario Limonciello, Kees Cook, Kirill A . Shutemov,
	Linus Torvalds

On Thu, 18 May 2023 at 00:40, Tom Lendacky <thomas.lendacky@amd.com> wrote:
>
> On 5/8/23 02:03, Ard Biesheuvel wrote:
> > The 32-bit trampoline no longer uses the stack for anything except
> > performing a long return back to long mode. Currently, this stack is
> > allocated in the same page that carries the trampoline code, which means
> > this page must be mapped writable and executable, and the stack is
> > therefore executable as well.
> >
> > So let's do a long jump instead: that way, we can pre-calculate the
> > return address and poke it into the code before we call it. In a later
> > patch, we will take advantage of this by removing writable permissions
> > (and adding executable ones) explicitly when booting via the EFI stub.
> >
> > Not playing with the stack pointer also makes it more straight-forward
> > to call the trampoline code as an ordinary 64-bit function from C code.
> >
> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > ---
> >   arch/x86/boot/compressed/head_64.S    | 34 ++++----------------
> >   arch/x86/boot/compressed/pgtable.h    |  6 ++--
> >   arch/x86/boot/compressed/pgtable_64.c | 12 ++++++-
> >   3 files changed, 21 insertions(+), 31 deletions(-)
> >
> > diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
> > index b1f8a867777120bb..3b5fc851737ffc39 100644
> > --- a/arch/x86/boot/compressed/head_64.S
> > +++ b/arch/x86/boot/compressed/head_64.S
> > @@ -460,9 +460,6 @@ SYM_CODE_START(startup_64)
> >       leaq    TRAMPOLINE_32BIT_CODE_OFFSET(%rax), %rax
> >       call    *%rax
> >
> > -     /* Restore the stack, the 32-bit trampoline uses its own stack */
> > -     leaq    rva(boot_stack_end)(%rbx), %rsp
> > -
> >       /*
> >        * cleanup_trampoline() would restore trampoline memory.
> >        *
> > @@ -563,24 +560,17 @@ SYM_FUNC_END(.Lrelocated)
> >    * EDI contains the base address of the trampoline memory.
> >    * Non-zero ESI means trampoline needs to enable 5-level paging.
> >    */
> > +     .section ".rodata", "a", @progbits
> >   SYM_CODE_START(trampoline_32bit_src)
> > -     popq    %r8
> >       /* Switch to compatibility mode (CS.L = 0 CS.D = 1) via far return */
> >       pushq   $__KERNEL32_CS
> >       leaq    0f(%rip), %rax
> >       pushq   %rax
> >       lretq
> > +.Lret:       retq
>
> Maybe just add a comment above this to explain that this is a target of
> the long jump below to get back into long mode and be able to return
> without setting up a new stack for the 32-bit code.
>
> And then a corresponding comment on the long jump itself. I think it would
> make it easier to understand what is going on in this part of the code.
>

Fair point. I'll add that in the next version.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 17/20] x86: efistub: Check SEV/SNP support while running in the firmware
  2023-05-08  7:03 ` [PATCH v2 17/20] x86: efistub: Check SEV/SNP support while running in the firmware Ard Biesheuvel
@ 2023-05-18 20:16   ` Tom Lendacky
  2023-05-18 22:27     ` Ard Biesheuvel
  0 siblings, 1 reply; 42+ messages in thread
From: Tom Lendacky @ 2023-05-18 20:16 UTC (permalink / raw)
  To: Ard Biesheuvel, linux-efi
  Cc: linux-kernel, Evgeniy Baskov, Borislav Petkov, Andy Lutomirski,
	Dave Hansen, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Alexey Khoroshilov, Peter Jones, Gerd Hoffmann, Dave Young,
	Mario Limonciello, Kees Cook, Kirill A . Shutemov,
	Linus Torvalds

On 5/8/23 02:03, Ard Biesheuvel wrote:
> The decompressor executes in an environment with little or no access to
> a console, and without any ability to return an error back to the caller
> (the bootloader). So the only recourse we have when the SEV/SNP context
> is not quite what the kernel expects is to terminate the guest entirely.
> 
> This is a bit harsh, and also unnecessary when booting via the EFI stub,
> given that it provides all the support that SEV guests need to probe the
> underlying platform.
> 
> So let's do the SEV initialization and SNP feature check before calling
> ExitBootServices(), and simply return with an error if the SNP feature
> mask is not as expected.

My SEV-ES / SEV-SNP guests started crashing when I applied this patch.
Turns out that sev_es_negotiate_protocol() used to be called when no #VC
exceptions were being generated before a valid GHCB was setup. Because
of that the current GHCB MSR value was not saved/restored. But now,
sev_es_negotiate_protocol() is called earlier in the boot process and
there are still messages being issued by UEFI, e.g.:

SetUefiImageMemoryAttributes - 0x000000007F6D7000 - 0x0000000000006000 (0x0000000000000008)

Similarly, get_hv_features() didn't worry about saving the GHCB MSR value
and an SNP guest crashed after fixing sev_es_negotiate_protocol().

The following changes got me past everything:

diff --git a/arch/x86/kernel/sev-shared.c b/arch/x86/kernel/sev-shared.c
index 3a5b0c9c4fcc..23450628d41c 100644
--- a/arch/x86/kernel/sev-shared.c
+++ b/arch/x86/kernel/sev-shared.c
@@ -106,15 +106,19 @@ static void __noreturn sev_es_terminate(unsigned int set, unsigned int reason)
   */
  static u64 get_hv_features(void)
  {
-	u64 val;
+	u64 val, save;
  
  	if (ghcb_version < 2)
  		return 0;
  
+	save = sev_es_rd_ghcb_msr();
+
  	sev_es_wr_ghcb_msr(GHCB_MSR_HV_FT_REQ);
  	VMGEXIT();
-
  	val = sev_es_rd_ghcb_msr();
+
+	sev_es_wr_ghcb_msr(save);
+
  	if (GHCB_RESP_CODE(val) != GHCB_MSR_HV_FT_RESP)
  		return 0;
  
@@ -139,13 +143,17 @@ static void snp_register_ghcb_early(unsigned long paddr)
  
  static bool sev_es_negotiate_protocol(void)
  {
-	u64 val;
+	u64 val, save;
+
+	save = sev_es_rd_ghcb_msr();
  
  	/* Do the GHCB protocol version negotiation */
  	sev_es_wr_ghcb_msr(GHCB_MSR_SEV_INFO_REQ);
  	VMGEXIT();
  	val = sev_es_rd_ghcb_msr();
  
+	sev_es_wr_ghcb_msr(save);
+
  	if (GHCB_MSR_INFO(val) != GHCB_MSR_SEV_INFO_RESP)
  		return false;
  

Thanks,
Tom

> 
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
>   arch/x86/boot/compressed/sev.c          | 12 ++++++++----
>   arch/x86/include/asm/sev.h              |  4 ++++
>   drivers/firmware/efi/libstub/x86-stub.c | 17 +++++++++++++++++
>   3 files changed, 29 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
> index 014b89c890887b9a..19c40873fdd209b5 100644
> --- a/arch/x86/boot/compressed/sev.c
> +++ b/arch/x86/boot/compressed/sev.c
> @@ -315,20 +315,24 @@ static void enforce_vmpl0(void)
>    */
>   #define SNP_FEATURES_PRESENT (0)
>   
> +u64 snp_get_unsupported_features(void)
> +{
> +	if (!(sev_status & MSR_AMD64_SEV_SNP_ENABLED))
> +		return 0;
> +	return sev_status & SNP_FEATURES_IMPL_REQ & ~SNP_FEATURES_PRESENT;
> +}
> +
>   void snp_check_features(void)
>   {
>   	u64 unsupported;
>   
> -	if (!(sev_status & MSR_AMD64_SEV_SNP_ENABLED))
> -		return;
> -
>   	/*
>   	 * Terminate the boot if hypervisor has enabled any feature lacking
>   	 * guest side implementation. Pass on the unsupported features mask through
>   	 * EXIT_INFO_2 of the GHCB protocol so that those features can be reported
>   	 * as part of the guest boot failure.
>   	 */
> -	unsupported = sev_status & SNP_FEATURES_IMPL_REQ & ~SNP_FEATURES_PRESENT;
> +	unsupported = snp_get_unsupported_features();
>   	if (unsupported) {
>   		if (ghcb_version < 2 || (!boot_ghcb && !early_setup_ghcb()))
>   			sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SNP_UNSUPPORTED);
> diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
> index 13dc2a9d23c1eb25..bf27b91644d0da5a 100644
> --- a/arch/x86/include/asm/sev.h
> +++ b/arch/x86/include/asm/sev.h
> @@ -157,6 +157,7 @@ static __always_inline void sev_es_nmi_complete(void)
>   		__sev_es_nmi_complete();
>   }
>   extern int __init sev_es_efi_map_ghcbs(pgd_t *pgd);
> +extern void sev_enable(struct boot_params *bp);
>   
>   static inline int rmpadjust(unsigned long vaddr, bool rmp_psize, unsigned long attrs)
>   {
> @@ -202,12 +203,14 @@ void snp_set_wakeup_secondary_cpu(void);
>   bool snp_init(struct boot_params *bp);
>   void __init __noreturn snp_abort(void);
>   int snp_issue_guest_request(u64 exit_code, struct snp_req_data *input, struct snp_guest_request_ioctl *rio);
> +u64 snp_get_unsupported_features(void);
>   #else
>   static inline void sev_es_ist_enter(struct pt_regs *regs) { }
>   static inline void sev_es_ist_exit(void) { }
>   static inline int sev_es_setup_ap_jump_table(struct real_mode_header *rmh) { return 0; }
>   static inline void sev_es_nmi_complete(void) { }
>   static inline int sev_es_efi_map_ghcbs(pgd_t *pgd) { return 0; }
> +static inline void sev_enable(struct boot_params *bp) { }
>   static inline int pvalidate(unsigned long vaddr, bool rmp_psize, bool validate) { return 0; }
>   static inline int rmpadjust(unsigned long vaddr, bool rmp_psize, unsigned long attrs) { return 0; }
>   static inline void setup_ghcb(void) { }
> @@ -225,6 +228,7 @@ static inline int snp_issue_guest_request(u64 exit_code, struct snp_req_data *in
>   {
>   	return -ENOTTY;
>   }
> +static inline u64 snp_get_unsupported_features(void) { return 0; }
>   #endif
>   
>   #endif
> diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
> index ce8434fce0c37982..33d11ba78f1d8c4f 100644
> --- a/drivers/firmware/efi/libstub/x86-stub.c
> +++ b/drivers/firmware/efi/libstub/x86-stub.c
> @@ -15,6 +15,7 @@
>   #include <asm/setup.h>
>   #include <asm/desc.h>
>   #include <asm/boot.h>
> +#include <asm/sev.h>
>   
>   #include "efistub.h"
>   
> @@ -714,6 +715,22 @@ static efi_status_t exit_boot_func(struct efi_boot_memmap *map,
>   			  &p->efi->efi_memmap, &p->efi->efi_memmap_hi);
>   	p->efi->efi_memmap_size		= map->map_size;
>   
> +	/*
> +	 * Call the SEV init code while still running with the firmware's
> +	 * GDT/IDT, so #VC exceptions will be handled by EFI.
> +	 */
> +	if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
> +		u64 unsupported;
> +
> +		sev_enable(p->boot_params);
> +		unsupported = snp_get_unsupported_features();
> +		if (unsupported) {
> +			efi_err("Unsupported SEV-SNP features detected: 0x%llx\n",
> +				unsupported);
> +			return EFI_UNSUPPORTED;
> +		}
> +	}
> +
>   	return EFI_SUCCESS;
>   }
>   

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 18/20] x86: efistub: Avoid legacy decompressor when doing EFI boot
  2023-05-08  7:03 ` [PATCH v2 18/20] x86: efistub: Avoid legacy decompressor when doing EFI boot Ard Biesheuvel
@ 2023-05-18 20:48   ` Tom Lendacky
  2023-05-18 22:33     ` Ard Biesheuvel
  0 siblings, 1 reply; 42+ messages in thread
From: Tom Lendacky @ 2023-05-18 20:48 UTC (permalink / raw)
  To: Ard Biesheuvel, linux-efi
  Cc: linux-kernel, Evgeniy Baskov, Borislav Petkov, Andy Lutomirski,
	Dave Hansen, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Alexey Khoroshilov, Peter Jones, Gerd Hoffmann, Dave Young,
	Mario Limonciello, Kees Cook, Kirill A . Shutemov,
	Linus Torvalds

On 5/8/23 02:03, Ard Biesheuvel wrote:
> The bare metal decompressor code was never really intended to run in a
> hosted environment such as the EFI boot services, and does a few things
> that are problematic in the context of EFI boot now that the logo
> requirements are getting tighter.
> 
> In particular, the decompressor moves its own executable image around in
> memory, and relies on demand paging to populate the identity mappings,
> and these things are difficult to support in a context where memory is
> not permitted to be mapped writable and executable at the same time or,
> at the very least, is mapped non-executable by default, and needs
> special treatment for this restriction to be lifted.
> 
> Since EFI already maps all of memory 1:1, we don't need to create new
> page tables or handle page faults when decompressing the kernel. That
> means there is also no need to replace the special exception handlers
> for SEV. Generally, there is little need to do anything that the
> decompressor does beyond
> 
> - initialize SEV encryption, if needed,
> - perform the 4/5 level paging switch, if needed,
> - decompress the kernel
> - relocate the kernel
> 
> So let's do all of this from the EFI stub code, and avoid the bare metal
> decompressor altogether.

This patch crashes SEV guests, probably because of the BSS is accessed 
encrypted and results in ciphertext for what would be a zero for a non-SEV 
guest. After pushing patch #19 everything started working again. From a 
bisectability perspective, you probably want patch #19 before this one.

Thanks,
Tom

> 
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
>   arch/x86/boot/compressed/Makefile       |   5 +
>   arch/x86/boot/compressed/efi_mixed.S    |  58 +------
>   arch/x86/boot/compressed/head_32.S      |  22 +--
>   arch/x86/boot/compressed/head_64.S      |  34 ----
>   arch/x86/include/asm/efi.h              |   7 +-
>   drivers/firmware/efi/libstub/x86-stub.c | 176 +++++++++-----------
>   6 files changed, 95 insertions(+), 207 deletions(-)
> 
> diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
> index 6b6cfe607bdbdfaa..f3289f17f820f982 100644
> --- a/arch/x86/boot/compressed/Makefile
> +++ b/arch/x86/boot/compressed/Makefile
> @@ -74,6 +74,11 @@ LDFLAGS_vmlinux += -z noexecstack
>   ifeq ($(CONFIG_LD_IS_BFD),y)
>   LDFLAGS_vmlinux += $(call ld-option,--no-warn-rwx-segments)
>   endif
> +ifeq ($(CONFIG_EFI_STUB),y)
> +# ensure that we'll pull in the static EFI stub library even if we never
> +# reference it from the startup code
> +LDFLAGS_vmlinux += -u efi_pe_entry
> +endif
>   LDFLAGS_vmlinux += -T
>   
>   hostprogs	:= mkpiggy
> diff --git a/arch/x86/boot/compressed/efi_mixed.S b/arch/x86/boot/compressed/efi_mixed.S
> index 4ca70bf93dc0bdcd..dec579eb1caa16d5 100644
> --- a/arch/x86/boot/compressed/efi_mixed.S
> +++ b/arch/x86/boot/compressed/efi_mixed.S
> @@ -49,9 +49,12 @@ SYM_FUNC_START(startup_64_mixed_mode)
>   	lea	efi32_boot_args(%rip), %rdx
>   	mov	0(%rdx), %edi
>   	mov	4(%rdx), %esi
> +#ifdef CONFIG_EFI_HANDOVER_PROTOCOL
>   	mov	8(%rdx), %edx		// saved bootparams pointer
>   	test	%edx, %edx
>   	jnz	efi64_stub_entry
> +#endif
> +
>   	/*
>   	 * efi_pe_entry uses MS calling convention, which requires 32 bytes of
>   	 * shadow space on the stack even if all arguments are passed in
> @@ -245,10 +248,6 @@ SYM_FUNC_START(efi32_entry)
>   	jmp	startup_32
>   SYM_FUNC_END(efi32_entry)
>   
> -#define ST32_boottime		60 // offsetof(efi_system_table_32_t, boottime)
> -#define BS32_handle_protocol	88 // offsetof(efi_boot_services_32_t, handle_protocol)
> -#define LI32_image_base		32 // offsetof(efi_loaded_image_32_t, image_base)
> -
>   /*
>    * efi_status_t efi32_pe_entry(efi_handle_t image_handle,
>    *			       efi_system_table_32_t *sys_table)
> @@ -256,8 +255,6 @@ SYM_FUNC_END(efi32_entry)
>   SYM_FUNC_START(efi32_pe_entry)
>   	pushl	%ebp
>   	movl	%esp, %ebp
> -	pushl	%eax				// dummy push to allocate loaded_image
> -
>   	pushl	%ebx				// save callee-save registers
>   	pushl	%edi
>   
> @@ -266,48 +263,8 @@ SYM_FUNC_START(efi32_pe_entry)
>   	movl	$0x80000003, %eax		// EFI_UNSUPPORTED
>   	jnz	2f
>   
> -	call	1f
> -1:	pop	%ebx
> -
> -	/* Get the loaded image protocol pointer from the image handle */
> -	leal	-4(%ebp), %eax
> -	pushl	%eax				// &loaded_image
> -	leal	(loaded_image_proto - 1b)(%ebx), %eax
> -	pushl	%eax				// pass the GUID address
> -	pushl	8(%ebp)				// pass the image handle
> -
> -	/*
> -	 * Note the alignment of the stack frame.
> -	 *   sys_table
> -	 *   handle             <-- 16-byte aligned on entry by ABI
> -	 *   return address
> -	 *   frame pointer
> -	 *   loaded_image       <-- local variable
> -	 *   saved %ebx		<-- 16-byte aligned here
> -	 *   saved %edi
> -	 *   &loaded_image
> -	 *   &loaded_image_proto
> -	 *   handle             <-- 16-byte aligned for call to handle_protocol
> -	 */
> -
> -	movl	12(%ebp), %eax			// sys_table
> -	movl	ST32_boottime(%eax), %eax	// sys_table->boottime
> -	call	*BS32_handle_protocol(%eax)	// sys_table->boottime->handle_protocol
> -	addl	$12, %esp			// restore argument space
> -	testl	%eax, %eax
> -	jnz	2f
> -
>   	movl	8(%ebp), %ecx			// image_handle
>   	movl	12(%ebp), %edx			// sys_table
> -	movl	-4(%ebp), %esi			// loaded_image
> -	movl	LI32_image_base(%esi), %esi	// loaded_image->image_base
> -	leal	(startup_32 - 1b)(%ebx), %ebp	// runtime address of startup_32
> -	/*
> -	 * We need to set the image_offset variable here since startup_32() will
> -	 * use it before we get to the 64-bit efi_pe_entry() in C code.
> -	 */
> -	subl	%esi, %ebp			// calculate image_offset
> -	movl	%ebp, (image_offset - 1b)(%ebx)	// save image_offset
>   	xorl	%esi, %esi
>   	jmp	efi32_entry			// pass %ecx, %edx, %esi
>   						// no other registers remain live
> @@ -318,15 +275,6 @@ SYM_FUNC_START(efi32_pe_entry)
>   	RET
>   SYM_FUNC_END(efi32_pe_entry)
>   
> -	.section ".rodata"
> -	/* EFI loaded image protocol GUID */
> -	.balign 4
> -SYM_DATA_START_LOCAL(loaded_image_proto)
> -	.long	0x5b1b31a1
> -	.word	0x9562, 0x11d2
> -	.byte	0x8e, 0x3f, 0x00, 0xa0, 0xc9, 0x69, 0x72, 0x3b
> -SYM_DATA_END(loaded_image_proto)
> -
>   	.data
>   	.balign	8
>   SYM_DATA_START_LOCAL(efi32_boot_gdt)
> diff --git a/arch/x86/boot/compressed/head_32.S b/arch/x86/boot/compressed/head_32.S
> index 8ee8749007595fcc..3f9b80726070a8e7 100644
> --- a/arch/x86/boot/compressed/head_32.S
> +++ b/arch/x86/boot/compressed/head_32.S
> @@ -84,19 +84,6 @@ SYM_FUNC_START(startup_32)
>   
>   #ifdef CONFIG_RELOCATABLE
>   	leal	startup_32@GOTOFF(%edx), %ebx
> -
> -#ifdef CONFIG_EFI_STUB
> -/*
> - * If we were loaded via the EFI LoadImage service, startup_32() will be at an
> - * offset to the start of the space allocated for the image. efi_pe_entry() will
> - * set up image_offset to tell us where the image actually starts, so that we
> - * can use the full available buffer.
> - *	image_offset = startup_32 - image_base
> - * Otherwise image_offset will be zero and has no effect on the calculations.
> - */
> -	subl    image_offset@GOTOFF(%edx), %ebx
> -#endif
> -
>   	movl	BP_kernel_alignment(%esi), %eax
>   	decl	%eax
>   	addl    %eax, %ebx
> @@ -150,15 +137,10 @@ SYM_FUNC_START(startup_32)
>   	jmp	*%eax
>   SYM_FUNC_END(startup_32)
>   
> -#ifdef CONFIG_EFI_STUB
> +#ifdef CONFIG_EFI_HANDOVER_PROTOCOL
>   SYM_FUNC_START(efi32_stub_entry)
> -	add	$0x4, %esp
> -	movl	8(%esp), %esi	/* save boot_params pointer */
> -	call	efi_main
> -	/* efi_main returns the possibly relocated address of startup_32 */
> -	jmp	*%eax
> +	jmp	efi_main
>   SYM_FUNC_END(efi32_stub_entry)
> -SYM_FUNC_ALIAS(efi_stub_entry, efi32_stub_entry)
>   #endif
>   
>   	.text
> diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
> index bcf678aed81468e3..320e2825ff0b32da 100644
> --- a/arch/x86/boot/compressed/head_64.S
> +++ b/arch/x86/boot/compressed/head_64.S
> @@ -146,19 +146,6 @@ SYM_FUNC_START(startup_32)
>   
>   #ifdef CONFIG_RELOCATABLE
>   	movl	%ebp, %ebx
> -
> -#ifdef CONFIG_EFI_STUB
> -/*
> - * If we were loaded via the EFI LoadImage service, startup_32 will be at an
> - * offset to the start of the space allocated for the image. efi_pe_entry will
> - * set up image_offset to tell us where the image actually starts, so that we
> - * can use the full available buffer.
> - *	image_offset = startup_32 - image_base
> - * Otherwise image_offset will be zero and has no effect on the calculations.
> - */
> -	subl    rva(image_offset)(%ebp), %ebx
> -#endif
> -
>   	movl	BP_kernel_alignment(%esi), %eax
>   	decl	%eax
>   	addl	%eax, %ebx
> @@ -346,20 +333,6 @@ SYM_CODE_START(startup_64)
>   	/* Start with the delta to where the kernel will run at. */
>   #ifdef CONFIG_RELOCATABLE
>   	leaq	startup_32(%rip) /* - $startup_32 */, %rbp
> -
> -#ifdef CONFIG_EFI_STUB
> -/*
> - * If we were loaded via the EFI LoadImage service, startup_32 will be at an
> - * offset to the start of the space allocated for the image. efi_pe_entry will
> - * set up image_offset to tell us where the image actually starts, so that we
> - * can use the full available buffer.
> - *	image_offset = startup_32 - image_base
> - * Otherwise image_offset will be zero and has no effect on the calculations.
> - */
> -	movl    image_offset(%rip), %eax
> -	subq	%rax, %rbp
> -#endif
> -
>   	movl	BP_kernel_alignment(%rsi), %eax
>   	decl	%eax
>   	addq	%rax, %rbp
> @@ -481,19 +454,12 @@ SYM_CODE_START(startup_64)
>   	jmp	*%rax
>   SYM_CODE_END(startup_64)
>   
> -#ifdef CONFIG_EFI_STUB
>   #ifdef CONFIG_EFI_HANDOVER_PROTOCOL
>   	.org 0x390
> -#endif
>   SYM_FUNC_START(efi64_stub_entry)
>   	and	$~0xf, %rsp			/* realign the stack */
> -	movq	%rdx, %rbx			/* save boot_params pointer */
>   	call	efi_main
> -	movq	%rbx,%rsi
> -	leaq	rva(startup_64)(%rax), %rax
> -	jmp	*%rax
>   SYM_FUNC_END(efi64_stub_entry)
> -SYM_FUNC_ALIAS(efi_stub_entry, efi64_stub_entry)
>   #endif
>   
>   	.text
> diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
> index 419280d263d2e3f2..24fa828eeef7d9dd 100644
> --- a/arch/x86/include/asm/efi.h
> +++ b/arch/x86/include/asm/efi.h
> @@ -88,6 +88,8 @@ static inline void efi_fpu_end(void)
>   }
>   
>   #ifdef CONFIG_X86_32
> +#define EFI_X86_KERNEL_ALLOC_LIMIT		(SZ_512M - 1)
> +
>   #define arch_efi_call_virt_setup()					\
>   ({									\
>   	efi_fpu_begin();						\
> @@ -101,8 +103,7 @@ static inline void efi_fpu_end(void)
>   })
>   
>   #else /* !CONFIG_X86_32 */
> -
> -#define EFI_LOADER_SIGNATURE	"EL64"
> +#define EFI_X86_KERNEL_ALLOC_LIMIT		EFI_ALLOC_LIMIT
>   
>   extern asmlinkage u64 __efi_call(void *fp, ...);
>   
> @@ -216,6 +217,8 @@ efi_status_t efi_set_virtual_address_map(unsigned long memory_map_size,
>   
>   #ifdef CONFIG_EFI_MIXED
>   
> +#define EFI_ALLOC_LIMIT		(efi_is_64bit() ? ULONG_MAX : U32_MAX)
> +
>   #define ARCH_HAS_EFISTUB_WRAPPERS
>   
>   static inline bool efi_is_64bit(void)
> diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
> index 33d11ba78f1d8c4f..59076e16c1ac11ee 100644
> --- a/drivers/firmware/efi/libstub/x86-stub.c
> +++ b/drivers/firmware/efi/libstub/x86-stub.c
> @@ -15,16 +15,13 @@
>   #include <asm/setup.h>
>   #include <asm/desc.h>
>   #include <asm/boot.h>
> +#include <asm/kaslr.h>
>   #include <asm/sev.h>
>   
>   #include "efistub.h"
>   
> -/* Maximum physical address for 64-bit kernel with 4-level paging */
> -#define MAXMEM_X86_64_4LEVEL (1ull << 46)
> -
>   const efi_system_table_t *efi_system_table;
>   const efi_dxe_services_table_t *efi_dxe_table;
> -u32 image_offset __section(".data");
>   static efi_loaded_image_t *image = NULL;
>   static efi_memory_attribute_protocol_t *memattr;
>   
> @@ -275,33 +272,9 @@ adjust_memory_range_protection(unsigned long start, unsigned long size)
>   	}
>   }
>   
> -void startup_32(struct boot_params *boot_params);
> -
> -static void
> -setup_memory_protection(unsigned long image_base, unsigned long image_size)
> -{
> -#ifdef CONFIG_64BIT
> -	if (image_base != (unsigned long)startup_32)
> -		adjust_memory_range_protection(image_base, image_size);
> -#else
> -	/*
> -	 * Clear protection flags on a whole range of possible
> -	 * addresses used for KASLR. We don't need to do that
> -	 * on x86_64, since KASLR/extraction is performed after
> -	 * dedicated identity page tables are built and we only
> -	 * need to remove possible protection on relocated image
> -	 * itself disregarding further relocations.
> -	 */
> -	adjust_memory_range_protection(LOAD_PHYSICAL_ADDR,
> -				       KERNEL_IMAGE_SIZE - LOAD_PHYSICAL_ADDR);
> -#endif
> -}
> -
>   static const efi_char16_t apple[] = L"Apple";
>   
> -static void setup_quirks(struct boot_params *boot_params,
> -			 unsigned long image_base,
> -			 unsigned long image_size)
> +static void setup_quirks(struct boot_params *boot_params)
>   {
>   	efi_char16_t *fw_vendor = (efi_char16_t *)(unsigned long)
>   		efi_table_attr(efi_system_table, fw_vendor);
> @@ -310,9 +283,6 @@ static void setup_quirks(struct boot_params *boot_params,
>   		if (IS_ENABLED(CONFIG_APPLE_PROPERTIES))
>   			retrieve_apple_device_properties(boot_params);
>   	}
> -
> -	if (IS_ENABLED(CONFIG_EFI_DXE_MEM_ATTRIBUTES))
> -		setup_memory_protection(image_base, image_size);
>   }
>   
>   /*
> @@ -432,6 +402,7 @@ static void __noreturn efi_exit(efi_handle_t handle, efi_status_t status)
>   		asm("hlt");
>   }
>   
> +static __alias(efi_main)
>   void __noreturn efi_stub_entry(efi_handle_t handle,
>   			       efi_system_table_t *sys_table_arg,
>   			       struct boot_params *boot_params);
> @@ -465,7 +436,6 @@ efi_status_t __efiapi efi_pe_entry(efi_handle_t handle,
>   	}
>   
>   	image_base = efi_table_attr(image, image_base);
> -	image_offset = (void *)startup_32 - image_base;
>   
>   	status = efi_allocate_pages(sizeof(struct boot_params),
>   				    (unsigned long *)&boot_params, ULONG_MAX);
> @@ -853,20 +823,82 @@ static void efi_5level_switch(void)
>   #endif
>   }
>   
> +static void efi_get_seed(void *seed, int size)
> +{
> +	efi_get_random_bytes(size, seed);
> +
> +	/*
> +	 * This only updates seed[0] when running on 32-bit, but in that case,
> +	 * we don't use seed[1] anyway, as there is no virtual KASLR on 32-bit.
> +	 */
> +	*(unsigned long *)seed ^= kaslr_get_random_long("EFI");
> +}
> +
> +static void error(char *str)
> +{
> +	efi_warn("Decompression failed: %s\n", str);
> +}
> +
> +static efi_status_t efi_decompress_kernel(unsigned long *kernel_entry)
> +{
> +	unsigned long virt_addr = LOAD_PHYSICAL_ADDR;
> +	unsigned long addr, alloc_size, entry;
> +	efi_status_t status;
> +	u32 seed[2] = {};
> +
> +	/* determine the required size of the allocation */
> +	alloc_size = ALIGN(max((unsigned long)output_len, kernel_total_size),
> +			   MIN_KERNEL_ALIGN);
> +
> +	if (IS_ENABLED(CONFIG_RANDOMIZE_BASE) && !efi_nokaslr) {
> +		u64 range = KERNEL_IMAGE_SIZE - LOAD_PHYSICAL_ADDR - kernel_total_size;
> +
> +		efi_get_seed(seed, sizeof(seed));
> +
> +		virt_addr += (range * seed[1]) >> 32;
> +		virt_addr &= ~(CONFIG_PHYSICAL_ALIGN - 1);
> +	}
> +
> +	status = efi_random_alloc(alloc_size, CONFIG_PHYSICAL_ALIGN, &addr,
> +				  seed[0], EFI_LOADER_CODE,
> +				  EFI_X86_KERNEL_ALLOC_LIMIT);
> +	if (status != EFI_SUCCESS)
> +		return status;
> +
> +	entry = decompress_kernel((void *)addr, virt_addr, error);
> +	if (entry == ULONG_MAX) {
> +		efi_free(alloc_size, addr);
> +		return EFI_LOAD_ERROR;
> +	}
> +
> +	*kernel_entry = addr + entry;
> +
> +	adjust_memory_range_protection(addr, kernel_total_size);
> +
> +	return EFI_SUCCESS;
> +}
> +
> +static void __noreturn enter_kernel(unsigned long kernel_addr,
> +				    struct boot_params *boot_params)
> +{
> +	/* enter decompressed kernel with boot_params pointer in RSI/ESI */
> +	asm("jmp *%0"::"r"(kernel_addr), "S"(boot_params));
> +
> +	unreachable();
> +}
> +
>   /*
> - * On success, we return the address of startup_32, which has potentially been
> - * relocated by efi_relocate_kernel.
> + * On success, we jump to the relocated kernel directly and never return.
>    * On failure, we exit to the firmware via efi_exit instead of returning.
>    */
> -asmlinkage unsigned long efi_main(efi_handle_t handle,
> -				  efi_system_table_t *sys_table_arg,
> -				  struct boot_params *boot_params)
> +asmlinkage void __noreturn efi_main(efi_handle_t handle,
> +				    efi_system_table_t *sys_table_arg,
> +				    struct boot_params *boot_params)
>   {
>   	efi_guid_t guid = EFI_MEMORY_ATTRIBUTE_PROTOCOL_GUID;
> -	unsigned long bzimage_addr = (unsigned long)startup_32;
> -	unsigned long buffer_start, buffer_end;
>   	struct setup_header *hdr = &boot_params->hdr;
>   	const struct linux_efi_initrd *initrd = NULL;
> +	unsigned long kernel_entry;
>   	efi_status_t status;
>   
>   	efi_system_table = sys_table_arg;
> @@ -892,60 +924,6 @@ asmlinkage unsigned long efi_main(efi_handle_t handle,
>   		goto fail;
>   	}
>   
> -	/*
> -	 * If the kernel isn't already loaded at a suitable address,
> -	 * relocate it.
> -	 *
> -	 * It must be loaded above LOAD_PHYSICAL_ADDR.
> -	 *
> -	 * The maximum address for 64-bit is 1 << 46 for 4-level paging. This
> -	 * is defined as the macro MAXMEM, but unfortunately that is not a
> -	 * compile-time constant if 5-level paging is configured, so we instead
> -	 * define our own macro for use here.
> -	 *
> -	 * For 32-bit, the maximum address is complicated to figure out, for
> -	 * now use KERNEL_IMAGE_SIZE, which will be 512MiB, the same as what
> -	 * KASLR uses.
> -	 *
> -	 * Also relocate it if image_offset is zero, i.e. the kernel wasn't
> -	 * loaded by LoadImage, but rather by a bootloader that called the
> -	 * handover entry. The reason we must always relocate in this case is
> -	 * to handle the case of systemd-boot booting a unified kernel image,
> -	 * which is a PE executable that contains the bzImage and an initrd as
> -	 * COFF sections. The initrd section is placed after the bzImage
> -	 * without ensuring that there are at least init_size bytes available
> -	 * for the bzImage, and thus the compressed kernel's startup code may
> -	 * overwrite the initrd unless it is moved out of the way.
> -	 */
> -
> -	buffer_start = ALIGN(bzimage_addr - image_offset,
> -			     hdr->kernel_alignment);
> -	buffer_end = buffer_start + hdr->init_size;
> -
> -	if ((buffer_start < LOAD_PHYSICAL_ADDR)				     ||
> -	    (IS_ENABLED(CONFIG_X86_32) && buffer_end > KERNEL_IMAGE_SIZE)    ||
> -	    (IS_ENABLED(CONFIG_X86_64) && buffer_end > MAXMEM_X86_64_4LEVEL) ||
> -	    (image_offset == 0)) {
> -		extern char _bss[];
> -
> -		status = efi_relocate_kernel(&bzimage_addr,
> -					     (unsigned long)_bss - bzimage_addr,
> -					     hdr->init_size,
> -					     hdr->pref_address,
> -					     hdr->kernel_alignment,
> -					     LOAD_PHYSICAL_ADDR);
> -		if (status != EFI_SUCCESS) {
> -			efi_err("efi_relocate_kernel() failed!\n");
> -			goto fail;
> -		}
> -		/*
> -		 * Now that we've copied the kernel elsewhere, we no longer
> -		 * have a set up block before startup_32(), so reset image_offset
> -		 * to zero in case it was set earlier.
> -		 */
> -		image_offset = 0;
> -	}
> -
>   #ifdef CONFIG_CMDLINE_BOOL
>   	status = efi_parse_options(CONFIG_CMDLINE);
>   	if (status != EFI_SUCCESS) {
> @@ -963,6 +941,12 @@ asmlinkage unsigned long efi_main(efi_handle_t handle,
>   		}
>   	}
>   
> +	status = efi_decompress_kernel(&kernel_entry);
> +	if (status != EFI_SUCCESS) {
> +		efi_err("Failed to decompress kernel\n");
> +		goto fail;
> +	}
> +
>   	/*
>   	 * At this point, an initrd may already have been loaded by the
>   	 * bootloader and passed via bootparams. We permit an initrd loaded
> @@ -1002,7 +986,7 @@ asmlinkage unsigned long efi_main(efi_handle_t handle,
>   
>   	setup_efi_pci(boot_params);
>   
> -	setup_quirks(boot_params, bzimage_addr, buffer_end - buffer_start);
> +	setup_quirks(boot_params);
>   
>   	status = exit_boot(boot_params, handle);
>   	if (status != EFI_SUCCESS) {
> @@ -1012,7 +996,7 @@ asmlinkage unsigned long efi_main(efi_handle_t handle,
>   
>   	efi_5level_switch();
>   
> -	return bzimage_addr;
> +	enter_kernel(kernel_entry, boot_params);
>   fail:
>   	efi_err("efi_main() failed!\n");
>   

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 17/20] x86: efistub: Check SEV/SNP support while running in the firmware
  2023-05-18 20:16   ` Tom Lendacky
@ 2023-05-18 22:27     ` Ard Biesheuvel
  2023-05-19 14:04       ` Tom Lendacky
  0 siblings, 1 reply; 42+ messages in thread
From: Ard Biesheuvel @ 2023-05-18 22:27 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: linux-efi, linux-kernel, Evgeniy Baskov, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexey Khoroshilov, Peter Jones, Gerd Hoffmann,
	Dave Young, Mario Limonciello, Kees Cook, Kirill A . Shutemov,
	Linus Torvalds

On Thu, 18 May 2023 at 22:16, Tom Lendacky <thomas.lendacky@amd.com> wrote:
>
> On 5/8/23 02:03, Ard Biesheuvel wrote:
> > The decompressor executes in an environment with little or no access to
> > a console, and without any ability to return an error back to the caller
> > (the bootloader). So the only recourse we have when the SEV/SNP context
> > is not quite what the kernel expects is to terminate the guest entirely.
> >
> > This is a bit harsh, and also unnecessary when booting via the EFI stub,
> > given that it provides all the support that SEV guests need to probe the
> > underlying platform.
> >
> > So let's do the SEV initialization and SNP feature check before calling
> > ExitBootServices(), and simply return with an error if the SNP feature
> > mask is not as expected.
>
> My SEV-ES / SEV-SNP guests started crashing when I applied this patch.
> Turns out that sev_es_negotiate_protocol() used to be called when no #VC
> exceptions were being generated before a valid GHCB was setup. Because
> of that the current GHCB MSR value was not saved/restored. But now,
> sev_es_negotiate_protocol() is called earlier in the boot process and
> there are still messages being issued by UEFI, e.g.:
>
> SetUefiImageMemoryAttributes - 0x000000007F6D7000 - 0x0000000000006000 (0x0000000000000008)
>
> Similarly, get_hv_features() didn't worry about saving the GHCB MSR value
> and an SNP guest crashed after fixing sev_es_negotiate_protocol().
>

Thanks for the clarification

So the underlying assumption here is that performing these checks
before ExitBootServices() is better because we can still return to the
bootloader, which -like GRUB does- could simply attempt booting the
next kernel in the list.

I was obviously unaware of the complication you are hitting here. So I
wonder what your take is on this: should we defer this check until
after ExitBootServices(), and crash and burn like before if the test
fails? Or is the change below reasonable in your opinion, and we can
incorporate it? Or is there a third option, i.e., is there a SEV
specific EFI protocol that the stub should be using to establish
whether the underlying platform meets the kernel's expectations?


> The following changes got me past everything:
>
> diff --git a/arch/x86/kernel/sev-shared.c b/arch/x86/kernel/sev-shared.c
> index 3a5b0c9c4fcc..23450628d41c 100644
> --- a/arch/x86/kernel/sev-shared.c
> +++ b/arch/x86/kernel/sev-shared.c
> @@ -106,15 +106,19 @@ static void __noreturn sev_es_terminate(unsigned int set, unsigned int reason)
>    */
>   static u64 get_hv_features(void)
>   {
> -       u64 val;
> +       u64 val, save;
>
>         if (ghcb_version < 2)
>                 return 0;
>
> +       save = sev_es_rd_ghcb_msr();
> +
>         sev_es_wr_ghcb_msr(GHCB_MSR_HV_FT_REQ);
>         VMGEXIT();
> -
>         val = sev_es_rd_ghcb_msr();
> +
> +       sev_es_wr_ghcb_msr(save);
> +
>         if (GHCB_RESP_CODE(val) != GHCB_MSR_HV_FT_RESP)
>                 return 0;
>
> @@ -139,13 +143,17 @@ static void snp_register_ghcb_early(unsigned long paddr)
>
>   static bool sev_es_negotiate_protocol(void)
>   {
> -       u64 val;
> +       u64 val, save;
> +
> +       save = sev_es_rd_ghcb_msr();
>
>         /* Do the GHCB protocol version negotiation */
>         sev_es_wr_ghcb_msr(GHCB_MSR_SEV_INFO_REQ);
>         VMGEXIT();
>         val = sev_es_rd_ghcb_msr();
>
> +       sev_es_wr_ghcb_msr(save);
> +
>         if (GHCB_MSR_INFO(val) != GHCB_MSR_SEV_INFO_RESP)
>                 return false;
>
>
> Thanks,
> Tom
>
> >
> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > ---
> >   arch/x86/boot/compressed/sev.c          | 12 ++++++++----
> >   arch/x86/include/asm/sev.h              |  4 ++++
> >   drivers/firmware/efi/libstub/x86-stub.c | 17 +++++++++++++++++
> >   3 files changed, 29 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
> > index 014b89c890887b9a..19c40873fdd209b5 100644
> > --- a/arch/x86/boot/compressed/sev.c
> > +++ b/arch/x86/boot/compressed/sev.c
> > @@ -315,20 +315,24 @@ static void enforce_vmpl0(void)
> >    */
> >   #define SNP_FEATURES_PRESENT (0)
> >
> > +u64 snp_get_unsupported_features(void)
> > +{
> > +     if (!(sev_status & MSR_AMD64_SEV_SNP_ENABLED))
> > +             return 0;
> > +     return sev_status & SNP_FEATURES_IMPL_REQ & ~SNP_FEATURES_PRESENT;
> > +}
> > +
> >   void snp_check_features(void)
> >   {
> >       u64 unsupported;
> >
> > -     if (!(sev_status & MSR_AMD64_SEV_SNP_ENABLED))
> > -             return;
> > -
> >       /*
> >        * Terminate the boot if hypervisor has enabled any feature lacking
> >        * guest side implementation. Pass on the unsupported features mask through
> >        * EXIT_INFO_2 of the GHCB protocol so that those features can be reported
> >        * as part of the guest boot failure.
> >        */
> > -     unsupported = sev_status & SNP_FEATURES_IMPL_REQ & ~SNP_FEATURES_PRESENT;
> > +     unsupported = snp_get_unsupported_features();
> >       if (unsupported) {
> >               if (ghcb_version < 2 || (!boot_ghcb && !early_setup_ghcb()))
> >                       sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SNP_UNSUPPORTED);
> > diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
> > index 13dc2a9d23c1eb25..bf27b91644d0da5a 100644
> > --- a/arch/x86/include/asm/sev.h
> > +++ b/arch/x86/include/asm/sev.h
> > @@ -157,6 +157,7 @@ static __always_inline void sev_es_nmi_complete(void)
> >               __sev_es_nmi_complete();
> >   }
> >   extern int __init sev_es_efi_map_ghcbs(pgd_t *pgd);
> > +extern void sev_enable(struct boot_params *bp);
> >
> >   static inline int rmpadjust(unsigned long vaddr, bool rmp_psize, unsigned long attrs)
> >   {
> > @@ -202,12 +203,14 @@ void snp_set_wakeup_secondary_cpu(void);
> >   bool snp_init(struct boot_params *bp);
> >   void __init __noreturn snp_abort(void);
> >   int snp_issue_guest_request(u64 exit_code, struct snp_req_data *input, struct snp_guest_request_ioctl *rio);
> > +u64 snp_get_unsupported_features(void);
> >   #else
> >   static inline void sev_es_ist_enter(struct pt_regs *regs) { }
> >   static inline void sev_es_ist_exit(void) { }
> >   static inline int sev_es_setup_ap_jump_table(struct real_mode_header *rmh) { return 0; }
> >   static inline void sev_es_nmi_complete(void) { }
> >   static inline int sev_es_efi_map_ghcbs(pgd_t *pgd) { return 0; }
> > +static inline void sev_enable(struct boot_params *bp) { }
> >   static inline int pvalidate(unsigned long vaddr, bool rmp_psize, bool validate) { return 0; }
> >   static inline int rmpadjust(unsigned long vaddr, bool rmp_psize, unsigned long attrs) { return 0; }
> >   static inline void setup_ghcb(void) { }
> > @@ -225,6 +228,7 @@ static inline int snp_issue_guest_request(u64 exit_code, struct snp_req_data *in
> >   {
> >       return -ENOTTY;
> >   }
> > +static inline u64 snp_get_unsupported_features(void) { return 0; }
> >   #endif
> >
> >   #endif
> > diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
> > index ce8434fce0c37982..33d11ba78f1d8c4f 100644
> > --- a/drivers/firmware/efi/libstub/x86-stub.c
> > +++ b/drivers/firmware/efi/libstub/x86-stub.c
> > @@ -15,6 +15,7 @@
> >   #include <asm/setup.h>
> >   #include <asm/desc.h>
> >   #include <asm/boot.h>
> > +#include <asm/sev.h>
> >
> >   #include "efistub.h"
> >
> > @@ -714,6 +715,22 @@ static efi_status_t exit_boot_func(struct efi_boot_memmap *map,
> >                         &p->efi->efi_memmap, &p->efi->efi_memmap_hi);
> >       p->efi->efi_memmap_size         = map->map_size;
> >
> > +     /*
> > +      * Call the SEV init code while still running with the firmware's
> > +      * GDT/IDT, so #VC exceptions will be handled by EFI.
> > +      */
> > +     if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
> > +             u64 unsupported;
> > +
> > +             sev_enable(p->boot_params);
> > +             unsupported = snp_get_unsupported_features();
> > +             if (unsupported) {
> > +                     efi_err("Unsupported SEV-SNP features detected: 0x%llx\n",
> > +                             unsupported);
> > +                     return EFI_UNSUPPORTED;
> > +             }
> > +     }
> > +
> >       return EFI_SUCCESS;
> >   }
> >

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 18/20] x86: efistub: Avoid legacy decompressor when doing EFI boot
  2023-05-18 20:48   ` Tom Lendacky
@ 2023-05-18 22:33     ` Ard Biesheuvel
  0 siblings, 0 replies; 42+ messages in thread
From: Ard Biesheuvel @ 2023-05-18 22:33 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: linux-efi, linux-kernel, Evgeniy Baskov, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexey Khoroshilov, Peter Jones, Gerd Hoffmann,
	Dave Young, Mario Limonciello, Kees Cook, Kirill A . Shutemov,
	Linus Torvalds

On Thu, 18 May 2023 at 22:48, Tom Lendacky <thomas.lendacky@amd.com> wrote:
>
> On 5/8/23 02:03, Ard Biesheuvel wrote:
> > The bare metal decompressor code was never really intended to run in a
> > hosted environment such as the EFI boot services, and does a few things
> > that are problematic in the context of EFI boot now that the logo
> > requirements are getting tighter.
> >
> > In particular, the decompressor moves its own executable image around in
> > memory, and relies on demand paging to populate the identity mappings,
> > and these things are difficult to support in a context where memory is
> > not permitted to be mapped writable and executable at the same time or,
> > at the very least, is mapped non-executable by default, and needs
> > special treatment for this restriction to be lifted.
> >
> > Since EFI already maps all of memory 1:1, we don't need to create new
> > page tables or handle page faults when decompressing the kernel. That
> > means there is also no need to replace the special exception handlers
> > for SEV. Generally, there is little need to do anything that the
> > decompressor does beyond
> >
> > - initialize SEV encryption, if needed,
> > - perform the 4/5 level paging switch, if needed,
> > - decompress the kernel
> > - relocate the kernel
> >
> > So let's do all of this from the EFI stub code, and avoid the bare metal
> > decompressor altogether.
>
> This patch crashes SEV guests, probably because of the BSS is accessed
> encrypted and results in ciphertext for what would be a zero for a non-SEV
> guest. After pushing patch #19 everything started working again. From a
> bisectability perspective, you probably want patch #19 before this one.
>

Noted, thanks.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 17/20] x86: efistub: Check SEV/SNP support while running in the firmware
  2023-05-18 22:27     ` Ard Biesheuvel
@ 2023-05-19 14:04       ` Tom Lendacky
  2023-05-22 12:48         ` Joerg Roedel
  0 siblings, 1 reply; 42+ messages in thread
From: Tom Lendacky @ 2023-05-19 14:04 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-efi, linux-kernel, Evgeniy Baskov, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Alexey Khoroshilov, Peter Jones, Gerd Hoffmann,
	Dave Young, Mario Limonciello, Kees Cook, Kirill A . Shutemov,
	Linus Torvalds, Joerg Roedel

On 5/18/23 17:27, Ard Biesheuvel wrote:
> On Thu, 18 May 2023 at 22:16, Tom Lendacky <thomas.lendacky@amd.com> wrote:
>>
>> On 5/8/23 02:03, Ard Biesheuvel wrote:
>>> The decompressor executes in an environment with little or no access to
>>> a console, and without any ability to return an error back to the caller
>>> (the bootloader). So the only recourse we have when the SEV/SNP context
>>> is not quite what the kernel expects is to terminate the guest entirely.
>>>
>>> This is a bit harsh, and also unnecessary when booting via the EFI stub,
>>> given that it provides all the support that SEV guests need to probe the
>>> underlying platform.
>>>
>>> So let's do the SEV initialization and SNP feature check before calling
>>> ExitBootServices(), and simply return with an error if the SNP feature
>>> mask is not as expected.
>>
>> My SEV-ES / SEV-SNP guests started crashing when I applied this patch.
>> Turns out that sev_es_negotiate_protocol() used to be called when no #VC
>> exceptions were being generated before a valid GHCB was setup. Because
>> of that the current GHCB MSR value was not saved/restored. But now,
>> sev_es_negotiate_protocol() is called earlier in the boot process and
>> there are still messages being issued by UEFI, e.g.:
>>
>> SetUefiImageMemoryAttributes - 0x000000007F6D7000 - 0x0000000000006000 (0x0000000000000008)
>>
>> Similarly, get_hv_features() didn't worry about saving the GHCB MSR value
>> and an SNP guest crashed after fixing sev_es_negotiate_protocol().
>>
> 
> Thanks for the clarification
> 
> So the underlying assumption here is that performing these checks
> before ExitBootServices() is better because we can still return to the
> bootloader, which -like GRUB does- could simply attempt booting the
> next kernel in the list.
> 
> I was obviously unaware of the complication you are hitting here. So I
> wonder what your take is on this: should we defer this check until
> after ExitBootServices(), and crash and burn like before if the test
> fails? Or is the change below reasonable in your opinion, and we can

Deferring the checks is probably the safest thing to do, since that would 
match the way things are done today and known to work. I'm not sure what 
other things might pop up if we stay with this approach, for example, page 
state change calls using the GHCB MSR protocol that also don't 
save/restore the MSR value.

It is possible to audit these areas and stay with this approach, but I'm 
wondering if that wouldn't be better done as a separate patch series.

Adding @Joerg for any additional thoughts he might have around this area, too.

> incorporate it? Or is there a third option, i.e., is there a SEV
> specific EFI protocol that the stub should be using to establish
> whether the underlying platform meets the kernel's expectations?

There isn't currently an EFI protocol do that.

Thanks,
Tom

> 
> 
>> The following changes got me past everything:
>>
>> diff --git a/arch/x86/kernel/sev-shared.c b/arch/x86/kernel/sev-shared.c
>> index 3a5b0c9c4fcc..23450628d41c 100644
>> --- a/arch/x86/kernel/sev-shared.c
>> +++ b/arch/x86/kernel/sev-shared.c
>> @@ -106,15 +106,19 @@ static void __noreturn sev_es_terminate(unsigned int set, unsigned int reason)
>>     */
>>    static u64 get_hv_features(void)
>>    {
>> -       u64 val;
>> +       u64 val, save;
>>
>>          if (ghcb_version < 2)
>>                  return 0;
>>
>> +       save = sev_es_rd_ghcb_msr();
>> +
>>          sev_es_wr_ghcb_msr(GHCB_MSR_HV_FT_REQ);
>>          VMGEXIT();
>> -
>>          val = sev_es_rd_ghcb_msr();
>> +
>> +       sev_es_wr_ghcb_msr(save);
>> +
>>          if (GHCB_RESP_CODE(val) != GHCB_MSR_HV_FT_RESP)
>>                  return 0;
>>
>> @@ -139,13 +143,17 @@ static void snp_register_ghcb_early(unsigned long paddr)
>>
>>    static bool sev_es_negotiate_protocol(void)
>>    {
>> -       u64 val;
>> +       u64 val, save;
>> +
>> +       save = sev_es_rd_ghcb_msr();
>>
>>          /* Do the GHCB protocol version negotiation */
>>          sev_es_wr_ghcb_msr(GHCB_MSR_SEV_INFO_REQ);
>>          VMGEXIT();
>>          val = sev_es_rd_ghcb_msr();
>>
>> +       sev_es_wr_ghcb_msr(save);
>> +
>>          if (GHCB_MSR_INFO(val) != GHCB_MSR_SEV_INFO_RESP)
>>                  return false;
>>
>>
>> Thanks,
>> Tom
>>
>>>
>>> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
>>> ---
>>>    arch/x86/boot/compressed/sev.c          | 12 ++++++++----
>>>    arch/x86/include/asm/sev.h              |  4 ++++
>>>    drivers/firmware/efi/libstub/x86-stub.c | 17 +++++++++++++++++
>>>    3 files changed, 29 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
>>> index 014b89c890887b9a..19c40873fdd209b5 100644
>>> --- a/arch/x86/boot/compressed/sev.c
>>> +++ b/arch/x86/boot/compressed/sev.c
>>> @@ -315,20 +315,24 @@ static void enforce_vmpl0(void)
>>>     */
>>>    #define SNP_FEATURES_PRESENT (0)
>>>
>>> +u64 snp_get_unsupported_features(void)
>>> +{
>>> +     if (!(sev_status & MSR_AMD64_SEV_SNP_ENABLED))
>>> +             return 0;
>>> +     return sev_status & SNP_FEATURES_IMPL_REQ & ~SNP_FEATURES_PRESENT;
>>> +}
>>> +
>>>    void snp_check_features(void)
>>>    {
>>>        u64 unsupported;
>>>
>>> -     if (!(sev_status & MSR_AMD64_SEV_SNP_ENABLED))
>>> -             return;
>>> -
>>>        /*
>>>         * Terminate the boot if hypervisor has enabled any feature lacking
>>>         * guest side implementation. Pass on the unsupported features mask through
>>>         * EXIT_INFO_2 of the GHCB protocol so that those features can be reported
>>>         * as part of the guest boot failure.
>>>         */
>>> -     unsupported = sev_status & SNP_FEATURES_IMPL_REQ & ~SNP_FEATURES_PRESENT;
>>> +     unsupported = snp_get_unsupported_features();
>>>        if (unsupported) {
>>>                if (ghcb_version < 2 || (!boot_ghcb && !early_setup_ghcb()))
>>>                        sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SNP_UNSUPPORTED);
>>> diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
>>> index 13dc2a9d23c1eb25..bf27b91644d0da5a 100644
>>> --- a/arch/x86/include/asm/sev.h
>>> +++ b/arch/x86/include/asm/sev.h
>>> @@ -157,6 +157,7 @@ static __always_inline void sev_es_nmi_complete(void)
>>>                __sev_es_nmi_complete();
>>>    }
>>>    extern int __init sev_es_efi_map_ghcbs(pgd_t *pgd);
>>> +extern void sev_enable(struct boot_params *bp);
>>>
>>>    static inline int rmpadjust(unsigned long vaddr, bool rmp_psize, unsigned long attrs)
>>>    {
>>> @@ -202,12 +203,14 @@ void snp_set_wakeup_secondary_cpu(void);
>>>    bool snp_init(struct boot_params *bp);
>>>    void __init __noreturn snp_abort(void);
>>>    int snp_issue_guest_request(u64 exit_code, struct snp_req_data *input, struct snp_guest_request_ioctl *rio);
>>> +u64 snp_get_unsupported_features(void);
>>>    #else
>>>    static inline void sev_es_ist_enter(struct pt_regs *regs) { }
>>>    static inline void sev_es_ist_exit(void) { }
>>>    static inline int sev_es_setup_ap_jump_table(struct real_mode_header *rmh) { return 0; }
>>>    static inline void sev_es_nmi_complete(void) { }
>>>    static inline int sev_es_efi_map_ghcbs(pgd_t *pgd) { return 0; }
>>> +static inline void sev_enable(struct boot_params *bp) { }
>>>    static inline int pvalidate(unsigned long vaddr, bool rmp_psize, bool validate) { return 0; }
>>>    static inline int rmpadjust(unsigned long vaddr, bool rmp_psize, unsigned long attrs) { return 0; }
>>>    static inline void setup_ghcb(void) { }
>>> @@ -225,6 +228,7 @@ static inline int snp_issue_guest_request(u64 exit_code, struct snp_req_data *in
>>>    {
>>>        return -ENOTTY;
>>>    }
>>> +static inline u64 snp_get_unsupported_features(void) { return 0; }
>>>    #endif
>>>
>>>    #endif
>>> diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
>>> index ce8434fce0c37982..33d11ba78f1d8c4f 100644
>>> --- a/drivers/firmware/efi/libstub/x86-stub.c
>>> +++ b/drivers/firmware/efi/libstub/x86-stub.c
>>> @@ -15,6 +15,7 @@
>>>    #include <asm/setup.h>
>>>    #include <asm/desc.h>
>>>    #include <asm/boot.h>
>>> +#include <asm/sev.h>
>>>
>>>    #include "efistub.h"
>>>
>>> @@ -714,6 +715,22 @@ static efi_status_t exit_boot_func(struct efi_boot_memmap *map,
>>>                          &p->efi->efi_memmap, &p->efi->efi_memmap_hi);
>>>        p->efi->efi_memmap_size         = map->map_size;
>>>
>>> +     /*
>>> +      * Call the SEV init code while still running with the firmware's
>>> +      * GDT/IDT, so #VC exceptions will be handled by EFI.
>>> +      */
>>> +     if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
>>> +             u64 unsupported;
>>> +
>>> +             sev_enable(p->boot_params);
>>> +             unsupported = snp_get_unsupported_features();
>>> +             if (unsupported) {
>>> +                     efi_err("Unsupported SEV-SNP features detected: 0x%llx\n",
>>> +                             unsupported);
>>> +                     return EFI_UNSUPPORTED;
>>> +             }
>>> +     }
>>> +
>>>        return EFI_SUCCESS;
>>>    }
>>>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 17/20] x86: efistub: Check SEV/SNP support while running in the firmware
  2023-05-19 14:04       ` Tom Lendacky
@ 2023-05-22 12:48         ` Joerg Roedel
  2023-05-22 13:07           ` Ard Biesheuvel
  0 siblings, 1 reply; 42+ messages in thread
From: Joerg Roedel @ 2023-05-22 12:48 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Ard Biesheuvel, linux-efi, linux-kernel, Evgeniy Baskov,
	Borislav Petkov, Andy Lutomirski, Dave Hansen, Ingo Molnar,
	Peter Zijlstra, Thomas Gleixner, Alexey Khoroshilov, Peter Jones,
	Gerd Hoffmann, Dave Young, Mario Limonciello, Kees Cook,
	Kirill A . Shutemov, Linus Torvalds

On Fri, May 19, 2023 at 09:04:46AM -0500, Tom Lendacky wrote:
> Deferring the checks is probably the safest thing to do, since that would
> match the way things are done today and known to work. I'm not sure what
> other things might pop up if we stay with this approach, for example, page
> state change calls using the GHCB MSR protocol that also don't save/restore
> the MSR value.
> 
> It is possible to audit these areas and stay with this approach, but I'm
> wondering if that wouldn't be better done as a separate patch series.
> 
> Adding @Joerg for any additional thoughts he might have around this area, too.

If I got it correctly the patch actually moves two things before
ExitBootServices:

	1) SEV features check

	2) SEV initialization

I think it makes a lot of sense to have 1) before ExitBootServices. It
allows to soft-fail in case the kernel does not support all required
SEV-SNP features and move on to a kernel which does. This check also only
needs the SEV_STATUS MSR and not any GHCB calls.

The problem is the GHCB protocol negotiation with the HV, but the GHCB
protocol is downward-compatible, so an older kernel can work with a
newer HV.

But 2) needs to stay after ExitBootServices, as it needs resources owned
by UEFI, e.g. the GHCB MSR and potentially the configured GHCB itself.
Fiddling around with the GHCB MSR while it is still owned by UEFI will
bite us in one or the other way (e.g. UEFI, before ExitBootServices, is
free to take IRQs with handlers that rely on the GHCB MSR content).

Regards,

	Joerg


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 17/20] x86: efistub: Check SEV/SNP support while running in the firmware
  2023-05-22 12:48         ` Joerg Roedel
@ 2023-05-22 13:07           ` Ard Biesheuvel
  2023-05-22 13:35             ` Joerg Roedel
  0 siblings, 1 reply; 42+ messages in thread
From: Ard Biesheuvel @ 2023-05-22 13:07 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Tom Lendacky, linux-efi, linux-kernel, Evgeniy Baskov,
	Borislav Petkov, Andy Lutomirski, Dave Hansen, Ingo Molnar,
	Peter Zijlstra, Thomas Gleixner, Alexey Khoroshilov, Peter Jones,
	Gerd Hoffmann, Dave Young, Mario Limonciello, Kees Cook,
	Kirill A . Shutemov, Linus Torvalds

On Mon, 22 May 2023 at 14:48, Joerg Roedel <jroedel@suse.de> wrote:
>
> On Fri, May 19, 2023 at 09:04:46AM -0500, Tom Lendacky wrote:
> > Deferring the checks is probably the safest thing to do, since that would
> > match the way things are done today and known to work. I'm not sure what
> > other things might pop up if we stay with this approach, for example, page
> > state change calls using the GHCB MSR protocol that also don't save/restore
> > the MSR value.
> >
> > It is possible to audit these areas and stay with this approach, but I'm
> > wondering if that wouldn't be better done as a separate patch series.
> >
> > Adding @Joerg for any additional thoughts he might have around this area, too.
>
> If I got it correctly the patch actually moves two things before
> ExitBootServices:
>
>         1) SEV features check
>
>         2) SEV initialization
>
> I think it makes a lot of sense to have 1) before ExitBootServices. It
> allows to soft-fail in case the kernel does not support all required
> SEV-SNP features and move on to a kernel which does. This check also only
> needs the SEV_STATUS MSR and not any GHCB calls.
>
> The problem is the GHCB protocol negotiation with the HV, but the GHCB
> protocol is downward-compatible, so an older kernel can work with a
> newer HV.
>
> But 2) needs to stay after ExitBootServices, as it needs resources owned
> by UEFI, e.g. the GHCB MSR and potentially the configured GHCB itself.
> Fiddling around with the GHCB MSR while it is still owned by UEFI will
> bite us in one or the other way (e.g. UEFI, before ExitBootServices, is
> free to take IRQs with handlers that rely on the GHCB MSR content).
>

Thanks for the insight. Note that I have sent a v3 in the mean time
that moves all of this *after* ExitBootServices() [0], but I failed to
cc you - apologies.

So IIUC, we could just read sev_status much earlier just to perform
the SNP feature check, and fail the boot gracefully on a mismatch. And
the sev_enable() call needs to move after ExitBootServices(), right?

That would result in only very minor duplication, afaict. I'll have a
stab at implementing this for v4.

Thanks,




[0] https://lore.kernel.org/all/20230522071415.501717-21-ardb@kernel.org/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 17/20] x86: efistub: Check SEV/SNP support while running in the firmware
  2023-05-22 13:07           ` Ard Biesheuvel
@ 2023-05-22 13:35             ` Joerg Roedel
  0 siblings, 0 replies; 42+ messages in thread
From: Joerg Roedel @ 2023-05-22 13:35 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Tom Lendacky, linux-efi, linux-kernel, Evgeniy Baskov,
	Borislav Petkov, Andy Lutomirski, Dave Hansen, Ingo Molnar,
	Peter Zijlstra, Thomas Gleixner, Alexey Khoroshilov, Peter Jones,
	Gerd Hoffmann, Dave Young, Mario Limonciello, Kees Cook,
	Kirill A . Shutemov, Linus Torvalds

On Mon, May 22, 2023 at 03:07:12PM +0200, Ard Biesheuvel wrote:
> So IIUC, we could just read sev_status much earlier just to perform
> the SNP feature check, and fail the boot gracefully on a mismatch. And
> the sev_enable() call needs to move after ExitBootServices(), right?

Right, sev_enable() negotiates the GHCB protocol version, which needs
the GHCB MSR, so that has to stay after ExitBootServices(). The
SEV feature check on the other side only needs to read the sev-status
MSR, which is no problem before ExitBootServices() (as long as it is
only read on SEV platforms).

> That would result in only very minor duplication, afaict. I'll have a
> stab at implementing this for v4.

Thanks,

-- 
Jörg Rödel
jroedel@suse.de

SUSE Software Solutions Germany GmbH
Frankenstraße 146
90461 Nürnberg
Germany

(HRB 36809, AG Nürnberg)
Geschäftsführer: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman


^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2023-05-22 13:35 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-08  7:03 [PATCH v2 00/20] efi/x86: Avoid bare metal decompressor during EFI boot Ard Biesheuvel
2023-05-08  7:03 ` [PATCH v2 01/20] x86: decompressor: Use proper sequence to take the address of the GOT Ard Biesheuvel
2023-05-17 17:31   ` Borislav Petkov
2023-05-17 17:39     ` Ard Biesheuvel
2023-05-08  7:03 ` [PATCH v2 02/20] x86: decompressor: Store boot_params pointer in callee save register Ard Biesheuvel
2023-05-08  7:03 ` [PATCH v2 03/20] x86: decompressor: Call trampoline as a normal function Ard Biesheuvel
2023-05-15 13:59   ` Kirill A. Shutemov
2023-05-08  7:03 ` [PATCH v2 04/20] x86: decompressor: Use standard calling convention for trampoline Ard Biesheuvel
2023-05-15 14:00   ` Kirill A. Shutemov
2023-05-08  7:03 ` [PATCH v2 05/20] x86: decompressor: Avoid the need for a stack in the 32-bit trampoline Ard Biesheuvel
2023-05-15 14:03   ` Kirill A. Shutemov
2023-05-17 22:40   ` Tom Lendacky
2023-05-18 14:55     ` Ard Biesheuvel
2023-05-08  7:03 ` [PATCH v2 06/20] x86: decompressor: Call trampoline directly from C code Ard Biesheuvel
2023-05-15 14:05   ` Kirill A. Shutemov
2023-05-08  7:03 ` [PATCH v2 07/20] x86: decompressor: Only call the trampoline when changing paging levels Ard Biesheuvel
2023-05-15 14:07   ` Kirill A. Shutemov
2023-05-08  7:03 ` [PATCH v2 08/20] x86: decompressor: Merge trampoline cleanup with switching code Ard Biesheuvel
     [not found]   ` <20230515140955.d4adbstv6gtnshp2@box.shutemov.name>
2023-05-16 17:50     ` Ard Biesheuvel
2023-05-08  7:03 ` [PATCH v2 09/20] x86: efistub: Perform 4/5 level paging switch from the stub Ard Biesheuvel
2023-05-15 14:14   ` Kirill A. Shutemov
2023-05-16 17:53     ` Ard Biesheuvel
2023-05-08  7:03 ` [PATCH v2 10/20] x86: efistub: Prefer EFI memory attributes protocol over DXE services Ard Biesheuvel
2023-05-08  7:03 ` [PATCH v2 11/20] decompress: Use 8 byte alignment Ard Biesheuvel
2023-05-08  7:03 ` [PATCH v2 12/20] x86: decompressor: Move global symbol references to C code Ard Biesheuvel
2023-05-08  7:03 ` [PATCH v2 13/20] x86: decompressor: Factor out kernel decompression and relocation Ard Biesheuvel
2023-05-08  7:03 ` [PATCH v2 14/20] x86: head_64: Store boot_params pointer in callee-preserved register Ard Biesheuvel
2023-05-08  7:03 ` [PATCH v2 15/20] x86: head_64: Switch to kernel CS before enabling memory encryption Ard Biesheuvel
2023-05-17 18:54   ` Tom Lendacky
2023-05-08  7:03 ` [PATCH v2 16/20] efi: libstub: Add limit argument to efi_random_alloc() Ard Biesheuvel
2023-05-08  7:03 ` [PATCH v2 17/20] x86: efistub: Check SEV/SNP support while running in the firmware Ard Biesheuvel
2023-05-18 20:16   ` Tom Lendacky
2023-05-18 22:27     ` Ard Biesheuvel
2023-05-19 14:04       ` Tom Lendacky
2023-05-22 12:48         ` Joerg Roedel
2023-05-22 13:07           ` Ard Biesheuvel
2023-05-22 13:35             ` Joerg Roedel
2023-05-08  7:03 ` [PATCH v2 18/20] x86: efistub: Avoid legacy decompressor when doing EFI boot Ard Biesheuvel
2023-05-18 20:48   ` Tom Lendacky
2023-05-18 22:33     ` Ard Biesheuvel
2023-05-08  7:03 ` [PATCH v2 19/20] x86: efistub: Clear BSS in EFI handover protocol entrypoint Ard Biesheuvel
2023-05-08  7:03 ` [PATCH v2 20/20] x86: decompressor: Avoid magic offsets for EFI handover entrypoint Ard Biesheuvel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).